Tag Archives: Linear algebra

Linear Algebra meets Data Compression

I recently introduced another kind of data encoding into Pcompress that leverages a common technique from linear algebra called Matrix Transpose. There are multiple ways of transposing a matrix. The simplest, conceptually is a transformation where columns of the source matrix become rows in the target and vice versa. In math terms if A is the source matrix and A” is the target then transposition boils down to A[m X n] -> A"[n X m] where m and n are dimensions of the 2D matrices.

Now where does this fit into the seemingly far removed world of data compression? Let us consider a dataset which is a sequence of 32-bit numbers. In my previous post on the Adaptive Run-Length Delta Encoding feature I mentioned how it detects sequences of numbers in data which are in simple arithmetic progression. Once an arithmetic progression is found it can be very easily encoded. Now let us assume that we do have a series of numbers but they are not in a simple incrementing series. So Delta Encoding will not be useful there.

As an example let us consider the following series of numbers: 4567, 3479, 9345. Let us also assume that they are stored in the data in big-endian row-major representation. We choose big-endian since that is the common platform-independent way of storing numeric data values. So if we look at the byte stream it will be thus (in hex): 00  00  11  D7  00  00  0D  97  00  00  24  81. Notice that there are a bunch of zeroes but they are mixed up with non-zero bytes. We can look at this data as a 4×3 matrix. 4-byte integers totalling 3 in number: A[4X3] = \left| \begin{array}{cccc} 00 & 00 & 11 & D7 \\ 00 & 00 & 0D & 97 \\ 00 & 00 & 24 & 81 \end{array} \right|.

Now looking at the matrix do you see what I see? The zeroes are aligned up in columns. Since the storage in memory is row-major the zeroes are mixed up with the other bytes. Now let us check out the transposed matrix: A"[3X4] = \left| \begin{array}{ccc} 00 & 00 & 00 \\ 00 & 00 & 00 \\ 11 & 0D & 24 \\ D7 & 97 & 81 \end{array} \right|. If we store this in memory the byte sequence becomes: 00  00  00  00  00  00  11  0D  24  D7  97  81. Voila, all the zeroes have lined up. Any compression algorithm on the planet will now be able to collapse that sequence of zeroes. This example shows zeroes, but in fact, it can be any repeating byte value.

This technique have been used in other places. For example in HDF5 where they unassumingly call it Shuffle. In Pcompress it is used in 2 places. Once within the Delta2 encoding phase and secondly to encode the Deduplication Index table. So compression ratios are improved across the board. To recover the data it is a simple matter of effecting the inverse transpose.