Tag Archives: Delta Compression

Making Delta Compression effective

I had blogged about Delta Compression earlier and compared various results with and without delta compression in an earlier post on Compression Benchmarks. While those benchmark results are now outdated you can notice that delta compression had little if any impact on the final compression ratio. It even made compression worse in some cases.

Recently I sat down to take a closer look at the results and found delta compression to have a negative impact for most compression algorithms especially in the high compression levels. After pondering on this for some time it occurred to me that delta compression of similar chunks via Bsdiff was actually eliminating patterns from the data. Compression algorithms look for patterns and collapse them, so eliminating some patterns could impact them negatively.

Many compression algorithms also typically use a dictionary (sliding window or otherwise) to store pattern codes so that newly found patterns can be looked up to see if they has occurred in the past. So if I thought of adding a constraint that delta compression between 2 similar chunks can only occur if they are further apart in the datastream than the dictionary size in use. This can be called as the TOO-NEAR constraint. It took a little experimentation with various datasets and compression algorithms to determine the optimum size of the TOO-NEAR constraint for different compression algorithms. I always used the maximum compression level with the maximum window sizes to ensure benefits are seen in almost all cases.

Eventually this paid off. Delta Compression started to have a slight benefit. Obviously the extent of delta compression is reduced but that made it faster and also improved final achievable compression. For details on how similar chunks are identified see below.

Pcompress 1.4 Released

A new release of Pcompress is now available from here: http://code.google.com/p/pcompress/downloads/list

I have added the following capabilities into this release:

  1. AES is better optimized using VPAES and AES-NI with CPU auto-detection. Since I had to support linking with older versions of OpenSSL I have copied the relevant implementation files into Pcompress repo.
  2. Added XSalsa20 from Dan Bernstein‘s NaCL library (SSE2 and reference code). It is extremely fast and has excellent security properties – as much as I could glean reading up on various articles and posts.
  3. Deduplication performance has been improved by 95% by optimizing the Rabin sliding window chunking algorithm and doubling the chunk hash-table size.
  4. From version 1.3 onwards Delta Compression now actually benefits overall compression ratio.
  5. All hashes have been parallelized better by using Merkle Tree style hashing. This happens only in solid mode when entire file is being compressed in a single chunk.
  6. Encryption key length is no longer a compile-time constant. It can be selected at runtime to be 128 or 256 bits.
  7. The Header HMAC now includes Nonce, Salt and Key length properties as well. This was an oversight in the earlier release.
  8. Better cleanup of temporary variables on the stack in various crypto functions.
  9. The Global Deduplication feature is still a work in progress. I have added some code that builds but is not functional as yet.

 

 

Adaptive Run-Length Delta Encoding

delta

I recently added a small new feature into Pcompress that I am calling as Adaptive Run-Length Delta Encoding. Delta Encoding can come in many variants. The simplest form is a type of data transformation that looks at byte values and detects arithmetic progressions in them. If a progression or monotonic series is found then the starting value and differences between subsequent values are output. Since the difference is constant we get a repeating sequence that can be easily compressed as compared to the original data.

A series may exist for individual byte values or multiple byte integers. I added a simple implementation in Pcompress that tries to identify multi-byte integer sequences in the data. The number of bytes constituting an integer is called the stride length and can range from 3 bytes to 8 bytes. The bytes are packed into a big-endian integer representation. The code initially does a dry run with all the stride lengths to find out the one that will give maximum benefit and then uses integers sequences of that stride size to encode the data.

This approach is similar to Bulat Ziganshin’s Delta compression preprocessor. However there are some differences. Firstly my implementation does not try to do an exhaustive search (using sliding windows etc.) for embedded sequences at odd offsets, so it is quite fast at the cost of lower effectiveness. Secondly I added a simple twist to to the basic delta encoding. Whenever a sequence is detected I apply Run Length Encoding to collapse the sequence into it’s parameters: Count, Starting Value and Increment Delta. This pre-compresses the data and benefits all the standard compression algorithms. I found this to be most effective when combined with LZP so using Delta automatically enables LZP pre-compression.

The code is at https://github.com/moinakg/pcompress/tree/master/delta2. This delta encoding is different from the delta compression and similarity matching that I had implemented earlier.

Compression Benchmarks

I spent the last few days running some benchmarks to see how Pcompress in its current incarnation fares in terms of compression performance and effectiveness. This was needed given the extent of changes that have happened. I have tested only the compression performance here. I still need to check out how decompression and crypto perform.

My test system is my laptop so the absolute numbers do not make much sense. It is only the relative numbers comparing different utilities and algorithms that are interesting. My laptop config: Intel Core i5 M430 @ 2.27 GHz (2011 model), 8GB RAM, 5400 RPM HDD (Hitachi HTS725050A9A364). It is a HP Pavilion Dv6 laptop dating back to Oct 2011. My desktop is a 2006 model Athlon 64 X2 with 4GB RAM so I decided to test on the more capable laptop. I used 4 different data sets: linux 3.6 RC2 tarball, the enwik9 corpus, glibc 12.1 tarball, a tarball containing CentOS 6.0 and 6.2 vmdk images. Except for the VMware images these are mostly different kinds of text since this is a traditional compression test.

Except where mentioned all the pcompress tests used a 256MB compression chunk size. Files bigger than 256MB were compressed using multiple threads. Only the Glibc tarball is 102MB in size and a single pcompress thread was used. Pcompress detects if the segment size is more than the file size and reduces it to fit the whole file in a single segment. The actual commands that were executed are listed in the leftmost graph column. The magenta bar shows original and compressed sizes in megabytes while the pink lines denote time taken in seconds.

Pcompress version used was 0.9.1 standard build. The standard utilities were used as distributed with Fedora 16. Some may have reservations that same compiler flags were not used to rebuild everything. In that case use some salt. I will rerun the standard utility tests with custom builds when I get time in the future. I have also missed out other parallel tools like Pigz and Pbzip2. I will post those comparisons shortly for the Linux tarball.

Finally except where mentioned Pcompress was always run with Chunk-level Deduplication (-D) and LZ-Prediction (-L) enabled. Identity Deduplication always adds value. The VMDK image dataset includes comparison for compression ratio with and without Deduplication.

Linux 3.6 Git clone Tarball

Of course the Lzma algorithm implementations give the best compression ratio but they also take more time. As usual the Xz utility performs quite badly and I keep wondering why would anyone use that in practice. I wonder if the XZ-Utils folks are considering performance aspects at all. The parallelized Pxz is a lot better, though I missed out running Pxz in extreme mode in this test. I will fill this gap shortly.

What is most interesting however is to note the performance and compression ratio achieved by Pcompress. Even with extreme settings parallel Pcompress using lzmaMt (lzma multi-thread) is faster than non-extreme parallel Pxz. Part of the speed comes from the low-level optimizations (like SSE) I have done to the lzma implementation. The original 7-Zip utility performs quite badly with extreme settings(-mx9 -m0fb=128 -m0d=256M) since it is not fully parallelized. The PPMd algorithm in Pcompress is by far the rock star in this test. It gives the best compression ratio while still having awesome speed. PPMd works excellent for text files. The LZFX algorithm is blindingly fast with decent compression ratio.

Pcompress is run with Deduplication (-D) and with and without Delta Compression (-E). Delta compression at this scale does not seem to add much value.

Glibc 2.12.1 Release Tarball

This test mostly shows single-threaded behavior. Since the glibc tarball is about 102MB in size and given the Pcompress compression chunk size of 256MB, it fits into a single chunk. So for the Pcompress tests only lzmaMt is multithreaded. That actually uses multiple threads for the match-finder. As you will notice, extreme level Pcompress with lzmaMt is significantly faster than both Pxz and 7-zip with extreme settings (-mx9 -m0fb=128 -m0d=256M).

Once again Delta compression at this scale did not add much value.

The Enwik9 Corpus

This test uses the enwik9 (English Wikipedia initial segment, 2009) from http://mattmahoney.net/dc/textdata.html. This is XML text data. Interestingly with a 256M chunk size Libbsc (BWT with Quantized Local Frequency Coding and dual LZ-Prediction) gives the best results on XML data even compared to Lzma and PPMd.

So I wanted to see how much better libbsc will work if I put the entire enwik9 file into a single chunk of 1GB size. So ran ‘pcompress -D -c libbsc -l 14 -L -s 1g’ to check. Indeed this produces a smaller 156.3MB(163870946 bytes) file taking 233.8 seconds. For reference the older Enwik8 corpus compresses to 20785844 bytes using the same command. This puts us at the 15th position on Matt Mahoney’s Large Text Compression Benchmark slightly above the original BSC. I have not made any changes to libbsc. Pcompress just links to it as an external dependency. However I have copied the LZP portion and tweaked it a bit. I believe that this result can be improved by tweaking the Rolling Hash function to get actual dedupe out of XML data. At the moment deduplication (-D) is having no impact on Enwik9 as it is not finding any duplicates. You can build Pcompress from source configured with the ‘–enable-debug-stats’ option to check.

Tarball of CentOS 6 and CentOS 6.2 VMDK Images

For this test I did not try all the variations and specifically left out Lzma. For such large multi-gigabyte there are very few practical use cases where one will spend hours using slow high-compression algorithms with lower ROI. The objective rather will be to balance compression effectiveness with practical speed.

The distinct benefits of parallelism are clearly visible from these results. With multi-gigabyte data using multiple threads far outstrips the traditional utilities. Deduplication also ensures that we get smaller compressed files. Look at the standalone bzip2 result and the pcompress bzip2 result with Deduplication. Since this is in-chunk dedupe and not global dedupe the effect of using larger chunks are also shown.

These results are quite interesting and puts Pcompress among the tools at the top of the heap of general data compression utilities with Pigz, Pbzip2 etc. However it also provides stronger data integrity checks using SKEIN 256/512 and SHA 256/512 checksums while still maintaining the performance.