I recently added a small new feature into Pcompress that I am calling as Adaptive Run-Length Delta Encoding. Delta Encoding can come in many variants. The simplest form is a type of data transformation that looks at byte values and detects arithmetic progressions in them. If a progression or monotonic series is found then the starting value and differences between subsequent values are output. Since the difference is constant we get a repeating sequence that can be easily compressed as compared to the original data.
A series may exist for individual byte values or multiple byte integers. I added a simple implementation in Pcompress that tries to identify multi-byte integer sequences in the data. The number of bytes constituting an integer is called the stride length and can range from 3 bytes to 8 bytes. The bytes are packed into a big-endian integer representation. The code initially does a dry run with all the stride lengths to find out the one that will give maximum benefit and then uses integers sequences of that stride size to encode the data.
This approach is similar to Bulat Ziganshin’s Delta compression preprocessor. However there are some differences. Firstly my implementation does not try to do an exhaustive search (using sliding windows etc.) for embedded sequences at odd offsets, so it is quite fast at the cost of lower effectiveness. Secondly I added a simple twist to to the basic delta encoding. Whenever a sequence is detected I apply Run Length Encoding to collapse the sequence into it’s parameters: Count, Starting Value and Increment Delta. This pre-compresses the data and benefits all the standard compression algorithms. I found this to be most effective when combined with LZP so using Delta automatically enables LZP pre-compression.
The code is at https://github.com/moinakg/pcompress/tree/master/delta2. This delta encoding is different from the delta compression and similarity matching that I had implemented earlier.
- Similarity Matching round #3 (moinakg.wordpress.com)