MongoDB compression options
Many of us have faced the classic tasks that force us to save a vast amount of monotonous data for subsequent extraction from the database and analysis. These can be colossal number series, string series, or texts of impressive size (For example, compilation logs or any other text logs, configurations, etc.).
Let's look at the database's options to optimize the storage of such data.
MongoDB compression algorithms
Data compression in the MongoDB is provided by WiredTiger Storage Engine and is not available if your MongoDB uses another storage engine (Now, in the 6.0 version, only two storage engines are available: wiredTiger and inMemory).
WiredTiger support compression option none and three compression algorithms: snappy, zlib and zstd. By default, a snappy algorithm will be used, providing a balance between efficient computation requirements and reasonable compression rates. We also have two options for compression indexes: true and false.
The compression option can only be chosen at the time of collection creation, and it only affects created collection so that you can use multiple collections with different compression algorithms in the same database. It can’t be changed after the collection is created, so let’s create a collection with a non-default compression option.
db.createCollection("logs", {storageEngine: {wiredTiger: {configString: "block_compressor=zlib"}}})
We can now check information about our collection options via db.getCollectionInfos() and see the next:
[{
"name" : "logs",
"type" : "collection",
"options" : {
"storageEngine" : {
"wiredTiger" : {
"configString" : "block_compressor=zlib"
}
}
},
"info" : {
...
}
}]
Comparison of compression algorithms in real case
For my experiment, I ran 4 identical docker containers with mongo which differed only by compression methods: none, snappy, zlib, and zstd.
Let's imagine a more or less real case when you need to store logs from every compilation/build/deployment/autotests in CI/CD. Every document will contain 2 dates (start and finish), a float for the duration (but you can compute it from previous values and not store it), an integer for response code, a string for status, some amount of ObjectID values as references at pipelines, job, scopes, runners, etc. The last fields of the document will present huge text data (environment variables, configs, versions of compilers and other tools, and the most important - stderr and stdout of the compilation process). It’s hard to get some compression on numeric or date values, but autotests logs compress very well because of a lot of repeating sentences.
The total count of documents in my experiment is 112233, where every document it’s a real log of autotests.
I have inserted the same documents in created MongoDB instances and got the following results:
| average size of document | total collection size | % of default compression method |
w/o compression | 9.79 KB | 1.13GB | 162% |
snappy | 9.69 KB | 711.54 MB | 100% |
zlib | 9.79 KB | 449.48 MB | 63% |
zstd | 9.79 KB | 380.43 MB | 53% |
As we can see, zstd shows amazing results. Considering caching in MongoDB, right indexes, and server-side micro-caching, we can win a lot of disk space with a performance slowdown that can be completely non-critical.
Additional experiment
When it comes to a test system, it is most likely that we are searching documents in a database by status, tags, pipeline, flag, etc, but we are hardly looking for a specific word in stdout. If the goal is to show a log to the interested developer only when an object was picked by other criteria, we can try to store logs already compressed.
I tried to compress stdout and stderr of documents by zlib before inserting them into the database. Only the large text field “stdout” has been compressed, and any other values are stored as is. I got the next results:
| average size of document | total collection size | % of default compression method |
w/o compression | 4.35 KB | 505.49 MB | 112% |
snappy | 4.35 KB | 449.93 MB | 100% |
zlib | 4.35 KB | 432.99 MB | 96% |
zstd | 4.35 KB | 432.61 MB | 96% |
The results are better for snappy and zlib, and worse results for zstd, which is logical, because we steal the free space for compression algorithm. Given that compression is applied not to one specific document, but to an entire collection (We can verify this by paying attention to the statistics from the database itself: it shows that when the collection size decreases, the average document size remains the same), the data that has already been compressed сannot be used to compose a compression dictionary.
Network compression
We can also improve the performance of interaction with the database by setting network compression options, which will affect only data transmission over the network but will not affect how the data is stored.
Network compression algorithm can only be set when a client connects and requires the client to support the picked algorithm. For example, for Python 3, you need to install and import packages python-snappy or zstandard, however, zlib is a standard package and needs to be just imported.
In Python we can set compression just in constructor of MongoClient class:
client = MongoClient('mongodb://localhost', compressors='zstd')
For zlib you can also set compression level:
client = MongoClient('mongodb://localhost', compressors=’zlib’, zlibCompressionLevel=9)
In Golang the compression can be set in the following way:
compressionLevel := 9
opts := &options.ClientOptions{
Compressors: []string{"zlib"},
ZlibLevel: &compressionLevel,
}
cli, err := mongo.NewClient(opts.ApplyURI(uri))
Let's try to understand how it will increase the speed of data exchange with the database with a simple python script, which will take documents from the cursor again and again and calculate the rate at which they are received. The results should be treated with a certain dose of skepticism since the results can be very different under different network conditions, computing power, specialties of the documents you store, and so on.
data amount | snappy (speed) | snappy (records rate) | zst (speed) | zstd (records rate) |
1MB | 36.50 KB/s | 140 records/second | 47.5 KB/s | 190 records/second |
3MB | 31.20 KB/s | 121 records/second | 39.8KB/s | 175 records/second |
5MB | 29.10 KB/s | 111 records/second | 35.8KB/s | 158 records/second |
7MB | 28.40 KB/s | 108 records/second | 35.0 KB/s | 154 records/second |
10MB | 27.0 KB/s | 102 records/second | 34.3KB/s | 151 records/second |
Change of encryption algorithm to zstd in my case led to the increased speed of data transmission by up to 27%, but it also increased CPU resource consumption both for server and database. This is probably a bad idea to use additional compression in clients who will have limited computing power, but we can get out of this situation by defining connection settings in client dynamically. If your server application is configured via flags of environment variables, don’t forget to add an additional option that allows redefining default compression algorithm.
Summary
We can see that additional compression options in MongoDB can easily decrease your database size, and you can get almost twice compressed data if you are storing a large text output from casual operations like automatic tests or compilations. If you set the zstd compression algorithm in your database, it may give you awesome results and allow you to get rid of data compression functions in your code. We can supplement this with a network compression factor if the system has enough resources.
Changing the standard algorithm to more efficient compression is very important in databases, which are used for storing files via GridFS. Data compression works just as well on large binary sequences (files) but is unlikely to have much effect on video files or audio files since such data already has a low number of repeated blocks thanks to constantly evolving encoders. Images like png, jpg, and webp are usually compressed well too via their encoders. pdf, svg, executables binaries, and containers like tar and zip, are still very popular formats that will compress very well, although it’s an unpopular decision to store them in a database.