Making Delta Compression effective

I had blogged about Delta Compression earlier and compared various results with and without delta compression in an earlier post on Compression Benchmarks. While those benchmark results are now outdated you can notice that delta compression had little if any impact on the final compression ratio. It even made compression worse in some cases.

Recently I sat down to take a closer look at the results and found delta compression to have a negative impact for most compression algorithms especially in the high compression levels. After pondering on this for some time it occurred to me that delta compression of similar chunks via Bsdiff was actually eliminating patterns from the data. Compression algorithms look for patterns and collapse them, so eliminating some patterns could impact them negatively.

Many compression algorithms also typically use a dictionary (sliding window or otherwise) to store pattern codes so that newly found patterns can be looked up to see if they has occurred in the past. So if I thought of adding a constraint that delta compression between 2 similar chunks can only occur if they are further apart in the datastream than the dictionary size in use. This can be called as the TOO-NEAR constraint. It took a little experimentation with various datasets and compression algorithms to determine the optimum size of the TOO-NEAR constraint for different compression algorithms. I always used the maximum compression level with the maximum window sizes to ensure benefits are seen in almost all cases.

Eventually this paid off. Delta Compression started to have a slight benefit. Obviously the extent of delta compression is reduced but that made it faster and also improved final achievable compression. For details on how similar chunks are identified see below.


4 thoughts on “Making Delta Compression effective

  1. slm

    I am curious whether your adaptive delta encoding could help compress electrophysiological data, which consists of long sequences of 2 or 4 byte integers (in some implementations, floats) for some number of channels (depending on patient ECoG grid density or number of sites on depth-electrode.) At high sampling rates (20-40kHz) successive samples are similar. Some formats interleave channels, so subsequent samples are (nChannels-1)*nBytesPerSample bytes apart, however it is trivial for us to pre-process and serialize.

    Finally my question: can your implementation (-P) in pcompress work with deltas between multi-byte samples?

    1. moinakg Post author

      Adaptive Delta looks for arithmetic sequences in multi-byte numbers of the following lengths: 3, 5, 7 and 8 bytes. That is subsequent numbers are monotonically increasing or decreasing by a constant amount. If you are looking sampling data from physiological measurements, they may or may not contain sequences. So without testing I am not sure what results this approach will yield on your data.
      However, since it is possible for subsequent samples to be similar the numeric representations will also share bytes in common. If a necessarily large set of numbers are found which have columns of bytes in common, then it is possible to consider the sequence as a rectangular matrix with columnar similarity. Then transposing such a matrix will line up columnar data into rows, which will allow any compression algorithm (including simple RLE) to compress the data effectively:
      I have been thinking about auto-detecting such data and plan to work on adding this into the delta feature. However it is easy enough to add an explicit option to perform this transpose if data is known to contain similar samples of specific byte lengths.

      1. slm

        Matrix transpose is a nice idea and I suspect would give results for us if set to 2 bytes. Why are the lengths set to 3, 5, 7 and 8 for adaptive delta? Will this still give good results with data which are 2 or 4 byte numbers? Forgive my naivete. I am happy to test versions with 2 byte alignments to see how it performs. Straight (but terribly slow) xz on the serialized signals (chan1[1:n],chan2[1:n] instead of interleaved channels) gives about 40% compression (to 60% of original size) but I think we could do much better… without needing hours or many gigs ram.

      2. moinakg Post author

        These numbers are chosen based on multiples. If there are sequences of 2 byte numbers, then we will still find sequences if 8 bytes are taken at a time. So there is no need to check for every byte count. Just checking for the higher multiple is sufficient. This improves performance. Xz is dramatically slow. The original 7zip is much faster. I will add a transpose option and post here. In the meantime you can also check if zpaq is faster and/or provides better reduction.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s