I had blogged about Delta Compression earlier and compared various results with and without delta compression in an earlier post on Compression Benchmarks. While those benchmark results are now outdated you can notice that delta compression had little if any impact on the final compression ratio. It even made compression worse in some cases.
Recently I sat down to take a closer look at the results and found delta compression to have a negative impact for most compression algorithms especially in the high compression levels. After pondering on this for some time it occurred to me that delta compression of similar chunks via Bsdiff was actually eliminating patterns from the data. Compression algorithms look for patterns and collapse them, so eliminating some patterns could impact them negatively.
Many compression algorithms also typically use a dictionary (sliding window or otherwise) to store pattern codes so that newly found patterns can be looked up to see if they has occurred in the past. So if I thought of adding a constraint that delta compression between 2 similar chunks can only occur if they are further apart in the datastream than the dictionary size in use. This can be called as the TOO-NEAR constraint. It took a little experimentation with various datasets and compression algorithms to determine the optimum size of the TOO-NEAR constraint for different compression algorithms. I always used the maximum compression level with the maximum window sizes to ensure benefits are seen in almost all cases.
Eventually this paid off. Delta Compression started to have a slight benefit. Obviously the extent of delta compression is reduced but that made it faster and also improved final achievable compression. For details on how similar chunks are identified see below.
- Similarity Matching round #3 (moinakg.wordpress.com)