Parallel Compression Revisited

Parallel Compression RevisitedI have recently been looking at compressing backups on the fly. That is I will generate a backup stream and pipe it through a compression utility and then send it to disk. Plain an simple, nothing great about it. However I also wanted to utilize all those cores and hyperthreads lying around to speed up the compression process. In other words I wanted parallel compression and parallel decompression ability.

After looking around I found a bunch of utilities that do this. Among the several I found, notable ones include pigz, pbzip2, pxz, p7zip, libbsc and Nanozip (not opensource). Most of these implement parallel implementations of specific algorithms while trying to be compatible with the existing file formats. For example pbzip2 retains the existing bzip2 file format by creating multiple compression streams within the file. Some of them like pxz would only scale to physical cores and ignore the Hyperthreads.

I however had other ideas. I wanted a technique that was algorithm agnostic and would allow me to leverage multiple algorithms based on the dataset I am trying to compress. The simple way to do this is to break up the data into chunks and compress the chunks in parallel. Decompression can also be done in parallel. This of course affects compression efficiency but provides great speedup on multicore and multisocket boxes. In addition if one has lots of RAM then large chunk sizes can be used to minimize the compression ratio hit as much as possible. It is also possible in this scenario to combine multiple compression algorithms and extract the best result which I call adaptive mode.

Digressing a bit, there is a lot of material around on the net regarding compression techniques, tools and algorithms. Some of the best lossless compression is achieved by the PAQ family of content mixing algorithms. They however can take inordinately long to work their magic and are not practical in many cases. You can visit Matt Mahoney’s Large Text Compression Benchmark page to see the top performers with detailed analysis in the text compression category. Another useful website is

Coming back to the topic, I looked around in Google for a while I could not find an existing opensource tool to do what I was thinking of. Maybe I did not search enough or maybe I just have an impulsive urge to play around with code. In any case I got around to developing a utility that has the following features/capabilities/characteristics:

  • Can split a file in chunks and do parallel compression and decompression of the chunks.
  • Can compress/decompress from stdin to stdout.
  • It is capable of utilizing multiple algorithms, the user can specify the algorithm [s]he wants to use.
  • Currently supports: Gzip, Bzip2, Lzma, PPMD (from the LZMA SDK).
  • It has an adaptive mode where it uses multiple algorithms to compress the same chunk and choose the smallest result.
  • Adaptive mode compression is of course the slowest but decompression speed is unaffected even in adaptive mode.
  • It provides 14 compression levels automatically choosing various algorithm parameters.
  • It uses a simple internal Slab Allocator to speed up memory allocation as the allocation patterns are 100% repeating.
  • The allocator can display statistics and detect leaks at the end.
  • Has very low metadata overhead.
  • Computes and stores a CRC64 checksum for each chunk to do integrity verification.
  • Has a pluggable source structure which makes it easy to add new compression algorithm support.
  • Bundles a modified LZMA code from the LZMA SDK which adds performance optimizations including SSE intrinsics.
  • Tries to overlap compression in threads and reading chunks from file as much as possible while retaining sequence.
  • Can scale to all logical cores or a user-specified number of threads.

You can download the current tarball from here. I will register a project on Github soon. This is a work in progress and I have only tested it on Fedora as yet with Illumos/OpenIndiana in the TODO list. There are a bunch of things I’d like to do:

  • Add support for filters like BCJ2, XWRT and Delta Encoding.
  • Auto-Detect data properties to auto-apply some filters like use BCJ2 for 8-bit binary data.
  • Track and print various compression statistics.
  • Add libbsc integration with another adaptive mode that uses libbsc instead of bzip2.
  • Possibilities of doing dedup:
    • Use things like Lrzip or srep as a dedup preprocessor.
    • Use chunk checksums to detect duplicates and then compare them. Will it be effective ?
    • Can we use some Rabin Fingerprinting derivative to auto-determine chunk size which will help dedup ?

I did some simple measurements using the enwik9 dataset on my Core i5 laptop with M430 processor running at 2.27 GHz and 3GB RAM. It’s got 2 hyperthreaded cores giving 4 CPU threads in total. The result is presented in graphical and tabular way below. The next measurement will be using the Linux6 Git repo tarball.

Enwik9 Compression Results
Cmd Size in Bytes Time in Seconds
Uncompressed 1000000000 0
gzip_6 323742884 76.41
gzip_9 322591997 93.81
pigz_6 323973501 31.41
pigz_9 322819489 35.29
pcompress_gzip_6_16m 324016149 32.16
pcompress_gzip_6_32m 323974316 31.4
pcompress_gzip_9_16m 322856796 38.03
pcompress_gzip_9_64m 322801875 38.24
bzip2_9 253977891 199.87
pbzip2_9 254067824 80.08
pcompress_bzip2_9_16m 254282504 86.45
pcompress_bzip2_9_64m 254032393 82.08
p7zip_6 227905649 838.94
p7zip_9 213336010 1177.47
xz_6 230153052 1184.77
xz_9 213370900 1645.03
pxz_9 215675816 1287.17
pcompress_lzma_9_32m 193378238 132.06
pcompress_lzma_9_64m 221297989 527.04
pcompress_lzma_10_64m 187103881 154.49
pcompress_lzma_14_64m_t2 184876720 265.52
pcompress_adapt_6_32m 232115301 477.68
pcompress_adapt_9_16m 233419207 470
pcompress_adapt_9_64m 227271086 541.82
pcompress_adapt2_9_64m_t2 190955872 899.78
pcompress_adapt_10_64m 227271086 543.17
pcompress_adapt2_14_64m_t2 184876720 1284.3
pcompress_adapt2_14_128m_t1 184552964 2105.52
pcompress_adapt2_14_500m_t1 183656405 2228.429
pcompress_adapt2_14_900m_t1 183575811 2212.369

Naming Scheme is one the two formats:
<Name>_<Compression level>
<Name>_<Mode>_<Compression Level>_<Chunk Size>_t<Number of Threads>

If you notice some of the tests are run using only 1 thread or 2 threads. That is simply because my laptop does not have enough RAM to handle the large memory requirements of those higher compression levels. That also reflects on the longer times taken for the adaptive modes on the graph. My desktop is an older Athlon64 having only 2 cores but more RAM. I am yet to test on that.

There is something interesting to note here. The Xz/Pxz utilities are slowest among the LZMA compressors. One is better off using p7zip. The single threaded adaptive mode of pcompress with extreme compression takes longer than pxz but pxz is using 2 threads. So using 2 threads the extreme adaptive mode will take less time than pxz while still trying out multiple compression algorithms and giving a better result!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s