# Making Delta Compression effective

I had blogged about Delta Compression earlier and compared various results with and without delta compression in an earlier post on Compression Benchmarks. While those benchmark results are now outdated you can notice that delta compression had little if any impact on the final compression ratio. It even made compression worse in some cases.

Recently I sat down to take a closer look at the results and found delta compression to have a negative impact for most compression algorithms especially in the high compression levels. After pondering on this for some time it occurred to me that delta compression of similar chunks via Bsdiff was actually eliminating patterns from the data. Compression algorithms look for patterns and collapse them, so eliminating some patterns could impact them negatively.

Many compression algorithms also typically use a dictionary (sliding window or otherwise) to store pattern codes so that newly found patterns can be looked up to see if they has occurred in the past. So if I thought of adding a constraint that delta compression between 2 similar chunks can only occur if they are further apart in the datastream than the dictionary size in use. This can be called as the TOO-NEAR constraint. It took a little experimentation with various datasets and compression algorithms to determine the optimum size of the TOO-NEAR constraint for different compression algorithms. I always used the maximum compression level with the maximum window sizes to ensure benefits are seen in almost all cases.

Eventually this paid off. Delta Compression started to have a slight benefit. Obviously the extent of delta compression is reduced but that made it faster and also improved final achievable compression. For details on how similar chunks are identified see below.

# Pcompress 1.3 released

I have put up a new release of Pcompress on the Google Code download page. This release focusses primarily on performance enhancements across the board and a few bug fixes. The changes are summarized below:

1. Deduplication performance has improved by at least 2.5X as a result of a variety of tweaks to the core chunking code.
• One of the interesting changes is to use a 16-byte SSE register as the sliding window since I am using a window size of 16. This avoids a lot of memory accesses but requires SSE4.
• The perf utility allowed me to see that using the window position counter as a context variable causes a spurious memory store for every byte! Using a local variable allows optimization via a register. This optimization affects the situation where we do not have SSE4.
• Compute the full fingerprint only when at least minimum chunk length bytes have been consumed.
2. Delta Compression performance and effectiveness have both been improved. I have tweaked the minhash approach to avoid storing and using fingerprints. That approach was causing memory write amplification and significant slowdown. Rather I am just treating the raw data as a sequence of 64-bit integers and heapifying them. Bsdiff performance has been improved along with RLE encoding. I also tweaked the matching approach. It now checks for similar blocks that are some distance apart depending on the compression algorithm. This actually causes long range similar blocks to be delta-ed eventually helping the overall compression.
3. One of the big changes is the inclusion of the BLAKE2 checksum algorithm and making it the default. BLAKE2 is one of the highest-performing cryptographic checksums, exceeding even MD5 in performance on 64-bit platforms. It is derived from BLAKE, one of the NIST SHA3 runner ups with a large security margin.
4. I have tweaked Yann Collet’s xxHash algorithm (non-cryptographic hash) to vectorize it and make it work with 32-byte blocks. Performance is improved for both vectorized and non-vectorized versions. I have covered this in detail in my previous post: Vectorizing xxHash for fun and profit.
5. I have tweaked the AES CTR mode implementation to vectorize it. CTR mode encrypts a 16-byte block consisting of a 64-bit nonce or salt value and a 64-bit block counter value concatenated together. This is then XOR-ed with 16 bytes of plaintext to generate 16 bytes of ciphertext. The block counter is then incremented and the process repeated. This XOR handling with 16-bytes can be nicely done in an XMM register. The result is faster even when using unaligned SSE2 loads helped a little with data prefetch instructions.
6. Apart from BLAKE2 I also included Intel’s optimized SHA512 implementation for x86 processors and moved to using SHA512/256. This improves SHA2 performance significantly on x86 platforms.
7. BLAKE2 includes a parallel mode. I also included simple 2-way parallel modes for other hashes including KECCAK when compressing a single file in a single chunk. This is essentially a single-threaded operation so other forms of parallelism need to be employed.
8. With all the vectorization being thrown around with SSE2/3/4 and AVX1/2 versions of various stuff, I have also added runtime CPU feature detection to invoke the appropriate version for the CPU. At least SSE2 capability is assumed. At this point I really have no intention of supporting Pentium and Atom processors! This also requires one to use at least the Gcc 4.4 compiler so that things like SSE4.2 and AVX intrinsics can be compiled even if CPU support for them is not available.
9. In addition to all the above some bug fixes have also gone into this release.

However this is in no way the full gamut of optimizations possible. There are more changes to be done. For example I need to add support for optimized AES GCM mode. This is a block cipher mode of operation which combines encryption and authentication avoiding the need to for a separate HMAC. HMAC is still useful for situations where one may want to authenticate but not encrypt. Deduplication performance can be further improved by at least 2X. The current chunking code has a silly oversight.  HMAC needs to support parallel modes. I also need to enable parallel operation for LZP in single-chunk modes. In addition I want to explore use of GPGPUs and CUDA for hashing, chunking etc.

# Linear Algebra meets Data Compression

I recently introduced another kind of data encoding into Pcompress that leverages a common technique from linear algebra called Matrix Transpose. There are multiple ways of transposing a matrix. The simplest, conceptually is a transformation where columns of the source matrix become rows in the target and vice versa. In math terms if A is the source matrix and A” is the target then transposition boils down to $A[m X n] -> A"[n X m]$ where m and n are dimensions of the 2D matrices.

Now where does this fit into the seemingly far removed world of data compression? Let us consider a dataset which is a sequence of 32-bit numbers. In my previous post on the Adaptive Run-Length Delta Encoding feature I mentioned how it detects sequences of numbers in data which are in simple arithmetic progression. Once an arithmetic progression is found it can be very easily encoded. Now let us assume that we do have a series of numbers but they are not in a simple incrementing series. So Delta Encoding will not be useful there.

As an example let us consider the following series of numbers: 4567, 3479, 9345. Let us also assume that they are stored in the data in big-endian row-major representation. We choose big-endian since that is the common platform-independent way of storing numeric data values. So if we look at the byte stream it will be thus (in hex): 00  00  11  D7  00  00  0D  97  00  00  24  81. Notice that there are a bunch of zeroes but they are mixed up with non-zero bytes. We can look at this data as a 4×3 matrix. 4-byte integers totalling 3 in number: $A[4X3] = \left| \begin{array}{cccc} 00 & 00 & 11 & D7 \\ 00 & 00 & 0D & 97 \\ 00 & 00 & 24 & 81 \end{array} \right|$.

Now looking at the matrix do you see what I see? The zeroes are aligned up in columns. Since the storage in memory is row-major the zeroes are mixed up with the other bytes. Now let us check out the transposed matrix: $A"[3X4] = \left| \begin{array}{ccc} 00 & 00 & 00 \\ 00 & 00 & 00 \\ 11 & 0D & 24 \\ D7 & 97 & 81 \end{array} \right|$. If we store this in memory the byte sequence becomes: 00  00  00  00  00  00  11  0D  24  D7  97  81. Voila, all the zeroes have lined up. Any compression algorithm on the planet will now be able to collapse that sequence of zeroes. This example shows zeroes, but in fact, it can be any repeating byte value.

This technique have been used in other places. For example in HDF5 where they unassumingly call it Shuffle. In Pcompress it is used in 2 places. Once within the Delta2 encoding phase and secondly to encode the Deduplication Index table. So compression ratios are improved across the board. To recover the data it is a simple matter of effecting the inverse transpose.

# Adaptive Run-Length Delta Encoding

I recently added a small new feature into Pcompress that I am calling as Adaptive Run-Length Delta Encoding. Delta Encoding can come in many variants. The simplest form is a type of data transformation that looks at byte values and detects arithmetic progressions in them. If a progression or monotonic series is found then the starting value and differences between subsequent values are output. Since the difference is constant we get a repeating sequence that can be easily compressed as compared to the original data.

A series may exist for individual byte values or multiple byte integers. I added a simple implementation in Pcompress that tries to identify multi-byte integer sequences in the data. The number of bytes constituting an integer is called the stride length and can range from 3 bytes to 8 bytes. The bytes are packed into a big-endian integer representation. The code initially does a dry run with all the stride lengths to find out the one that will give maximum benefit and then uses integers sequences of that stride size to encode the data.

This approach is similar to Bulat Ziganshin’s Delta compression preprocessor. However there are some differences. Firstly my implementation does not try to do an exhaustive search (using sliding windows etc.) for embedded sequences at odd offsets, so it is quite fast at the cost of lower effectiveness. Secondly I added a simple twist to to the basic delta encoding. Whenever a sequence is detected I apply Run Length Encoding to collapse the sequence into it’s parameters: Count, Starting Value and Increment Delta. This pre-compresses the data and benefits all the standard compression algorithms. I found this to be most effective when combined with LZP so using Delta automatically enables LZP pre-compression.

The code is at https://github.com/moinakg/pcompress/tree/master/delta2. This delta encoding is different from the delta compression and similarity matching that I had implemented earlier.