I have recently been looking at compressing backups on the fly. That is I will generate a backup stream and pipe it through a compression utility and then send it to disk. Plain an simple, nothing great about it. However I also wanted to utilize all those cores and hyperthreads lying around to speed up the compression process. In other words I wanted parallel compression and parallel decompression ability.
After looking around I found a bunch of utilities that do this. Among the several I found, notable ones include pigz, pbzip2, pxz, p7zip, libbsc and Nanozip (not opensource). Most of these implement parallel implementations of specific algorithms while trying to be compatible with the existing file formats. For example pbzip2 retains the existing bzip2 file format by creating multiple compression streams within the file. Some of them like pxz would only scale to physical cores and ignore the Hyperthreads.
I however had other ideas. I wanted a technique that was algorithm agnostic and would allow me to leverage multiple algorithms based on the dataset I am trying to compress. The simple way to do this is to break up the data into chunks and compress the chunks in parallel. Decompression can also be done in parallel. This of course affects compression efficiency but provides great speedup on multicore and multisocket boxes. In addition if one has lots of RAM then large chunk sizes can be used to minimize the compression ratio hit as much as possible. It is also possible in this scenario to combine multiple compression algorithms and extract the best result which I call adaptive mode.
Digressing a bit, there is a lot of material around on the net regarding compression techniques, tools and algorithms. Some of the best lossless compression is achieved by the PAQ family of content mixing algorithms. They however can take inordinately long to work their magic and are not practical in many cases. You can visit Matt Mahoney’s Large Text Compression Benchmark page to see the top performers with detailed analysis in the text compression category. Another useful website is www.compressionratings.com.
Coming back to the topic, I looked around in Google for a while I could not find an existing opensource tool to do what I was thinking of. Maybe I did not search enough or maybe I just have an impulsive urge to play around with code. In any case I got around to developing a utility that has the following features/capabilities/characteristics:
- Can split a file in chunks and do parallel compression and decompression of the chunks.
- Can compress/decompress from stdin to stdout.
- It is capable of utilizing multiple algorithms, the user can specify the algorithm [s]he wants to use.
- Currently supports: Gzip, Bzip2, Lzma, PPMD (from the LZMA SDK).
- It has an adaptive mode where it uses multiple algorithms to compress the same chunk and choose the smallest result.
- Adaptive mode compression is of course the slowest but decompression speed is unaffected even in adaptive mode.
- It provides 14 compression levels automatically choosing various algorithm parameters.
- It uses a simple internal Slab Allocator to speed up memory allocation as the allocation patterns are 100% repeating.
- The allocator can display statistics and detect leaks at the end.
- Has very low metadata overhead.
- Computes and stores a CRC64 checksum for each chunk to do integrity verification.
- Has a pluggable source structure which makes it easy to add new compression algorithm support.
- Bundles a modified LZMA code from the LZMA SDK which adds performance optimizations including SSE intrinsics.
- Tries to overlap compression in threads and reading chunks from file as much as possible while retaining sequence.
- Can scale to all logical cores or a user-specified number of threads.
You can download the current tarball from here. I will register a project on Github soon. This is a work in progress and I have only tested it on Fedora as yet with Illumos/OpenIndiana in the TODO list. There are a bunch of things I’d like to do:
- Add support for filters like BCJ2, XWRT and Delta Encoding.
- Auto-Detect data properties to auto-apply some filters like use BCJ2 for 8-bit binary data.
- Track and print various compression statistics.
- Add libbsc integration with another adaptive mode that uses libbsc instead of bzip2.
- Possibilities of doing dedup:
- Use things like Lrzip or srep as a dedup preprocessor.
- Use chunk checksums to detect duplicates and then compare them. Will it be effective ?
- Can we use some Rabin Fingerprinting derivative to auto-determine chunk size which will help dedup ?
I did some simple measurements using the enwik9 dataset on my Core i5 laptop with M430 processor running at 2.27 GHz and 3GB RAM. It’s got 2 hyperthreaded cores giving 4 CPU threads in total. The result is presented in graphical and tabular way below. The next measurement will be using the Linux6 Git repo tarball.
|Enwik9 Compression Results|
|Cmd||Size in Bytes||Time in Seconds|
Naming Scheme is one the two formats:
<Name>_<Mode>_<Compression Level>_<Chunk Size>_t<Number of Threads>
If you notice some of the tests are run using only 1 thread or 2 threads. That is simply because my laptop does not have enough RAM to handle the large memory requirements of those higher compression levels. That also reflects on the longer times taken for the adaptive modes on the graph. My desktop is an older Athlon64 having only 2 cores but more RAM. I am yet to test on that.
There is something interesting to note here. The Xz/Pxz utilities are slowest among the LZMA compressors. One is better off using p7zip. The single threaded adaptive mode of pcompress with extreme compression takes longer than pxz but pxz is using 2 threads. So using 2 threads the extreme adaptive mode will take less time than pxz while still trying out multiple compression algorithms and giving a better result!