# Tag Archives: Pcompress

Posts related to the Pcompress utility.

# Galois Fields @ USENIX FAST 2013

USENIX sponsors an annual storage related conference called File and Storage Technologies or FAST for short. It has been happening for more than a decade and this year’s FAST seminar has lined up an impressive list of sessions. Of all those, there is one session that caught my attention: https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions

Galois Field or Finite Field is a construct out of Abstract Algebra which raises it’s spooky head whenever Cryptography, Error Correcting Codes, Rabin Deduplication, number theory and related stuff are concerned.  It is interesting to see EMC having done work on speeding up GF computations on modern x86 CPUs using the SSE SIMD instruction set. However it does not appear to be an entirely new work. There has already been work done on this aspect. For example look at the following links:

Optimizing Forward ECC Codes using SIMD instructions

Impact of Intel’s New Instruction Sets on Software Implementation of GF(2)[x] Multiplication

gf2x – A library for multiplying polynomials over the binary field

EMC is obviously interested in speeding up Data Deduplication in it’s Data Domain products. So we should see subsequent releases of Data Domain coming with this optimization. It will be interesting to see what other work the EMC paper cites when it is eventually made public.

Of the above three links the first one is of most interest to me. I am looking to implement error correcting codes in Pcompress and that article’s author provides a nice SIMD-optimized ECC library called FECpp.

# Pcompress 1.2 Released

Pcompress 1.2 is now available and can be grabbed from here: https://code.google.com/p/pcompress/downloads/list. This is a major release containing a large number of bug fixes and improvements. There are performance and stability improvements, code cleanup, resolution of corner cases etc. The biggest new additions to this release are the new Delta2 Encoding and support Keccak message digest. Keccak has been announced the NIST SHA3 standard secure hash. The SIMD (SSE) optimized variant of Keccak runs faster than SHA256 on x86 platforms. However it is still slower than SKEIN so SKEIN remains the default hash algorithm for data integrity verification. In addition Deduplication is now significantly faster.

Delta2 Encoding as I had mentioned in a previous post probes for embedded tables of numeric sequences in the data and encodes them by collapsing the arithmetic sequence into it’s parameters: starting value, increment/decrement, number of terms. This generally provides benefits across different kinds of data and can be combined with LZP preprocessing to enable the final compression algorithm to achieve the maximum compression ratio beyond what it can normally achieve. This encoding works very fast and still manages to detect a good amount of numeric sequences if they are present in the data.

I have extended the statistics mode to display additional data including throughput figures. Here is an output from a compression run of the silesia corpus test data set:

## CPU: Core i5 430M
## RAM: 8GB
## Compiler: Gcc 4.7.2
##
## Compression
##
./pcompress -D -c lz4 -l1 -L -P -s200m silesia.tar
Scaling to 1 thread
Checksum computed at 241.321 MB/s
Original size: 206612480, blknum: 46913
Number of maxlen blocks: 0
Total Hashtable bucket collisions: 17225
Merge count: 46750
Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
Chunking speed 112.189 MB/s, Overall Dedupe speed 100.880 MB/s
LZP: Insize: 206196371, Outsize: 192127556
LZP: Processed at 55.839 MB/s
DELTA2: srclen: 192127556, dstlen: 191899643
DELTA2: Processed at 382.530 MB/s
Chunk compression speed 207.908 MB/s
##
## Decompression
##
./pcompress -d silesia.tar.pz silesia.tar.1
Scaling to 4 threads
Chunk decompression speed 383.488 MB/s
DELTA2: Decoded at 3030.724 MB/s
Checksum computed at 244.235 MB/sls -l silesia.tar.pz
-rw-rw-r--. 1 testuser testuser 99115899 Jan  5 21:36 silesia.tar.pz

Note that these are single-threaded performance figures. The entire file is being compressed in a single chunk. The default checksum is SKEIN. Look at the decoding speed of the Delta2 implementation. It is close to 3GB/s rate. Next lets check the performance of SSE optimized Keccak:

./pcompress -D -c lz4 -l1 -P -S KECCAK256 -s200m silesia.tar
Scaling to 1 thread
Checksum computed at 143.904 MB/s
Original size: 206612480, blknum: 46913
Number of maxlen blocks: 0
Total Hashtable bucket collisions: 17225
Merge count: 46750
Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
Chunking speed 111.601 MB/s, Overall Dedupe speed 100.352 MB/s
DELTA2: srclen: 206196371, dstlen: 201217172
DELTA2: Processed at 360.383 MB/s
Chunk compression speed 213.226 MB/sls -l silesia.tar.pz
-rw-rw-r--. 1 testuser testuser 100204566 Jan  5 21:34 silesia.tar.pz

This time I left out LZP to show the reduction using Delta2 alone. As you can see combining LZP and Delta2 gives the greatest reduction. Also see how much slower Keccak is compared to SKEIN. Note that I am using optimized 64-bit assembly implementation of Skein but it does not use SSE whereas Keccak uses SSE.

Next lets have a look at a dataset that has lots of embedded numeric data. I used a Global Topographic Elevation map data from USGS:

#
# Compression
#
./pcompress -c lz4 -l1 -P -s100m e020n40.tar
Scaling to 1 thread
Checksum computed at 237.584 MB/s
DELTA2: srclen: 86599680, dstlen: 43707440
DELTA2: Processed at 279.484 MB/s
Chunk compression speed 211.112 MB/s

ls -l e020n40.tar.pz
-rw-rw-r--. 1 testuser testuser 35360062 Jan  5 21:46 e020n40.tar.pz
#
# Decompression
#
./pcompress -d e020n40.tar.pz e020n40.tar.1
Scaling to 4 threads
Chunk decompression speed 622.394 MB/s
DELTA2: Decoded at 1971.282 MB/s
Checksum computed at 246.015 MB/s

This time I left out dedupe and LZP. The Delta2 benefits are a lot more since there is a lot of numeric data. Also because there is a lot more Delta spans in the encoded dataset the decoding speed is also lesser. However it still decodes at 1.9 GB/s.

As can be seen Delta2 performance is on par with LZ4 can be used to improve LZ4 results with very little overhead.

# GCC Link time Optimizations need some Salt

I wanted to make use of the much talked about feature in Gcc 4.5+ called Link Time Optimization in Pcompress. Link time optimizations promise to bring in advanced optimizations across compilation contexts. For example if you declare a function as “inline” in a source file and call it from other places in the same source file then all function calls will be replaced with the function body itself. However if you call the same function from another source file which is compiled separately then inlining will not happen across the files. Earlier Gcc support for this was called whole-program optimization but it was cumbersome and in some cases impossible to use. LTO enables a repeat optimization and re-compilation pass during linking of the final executable providing the ability to do such cross-file optimizations elegantly.

I am using Fedora 16 which has got Gcc 4.6.3 with support for LTO so it was a simple matter of tweaking the makefile. However, after adding the LTO flag something felt not right. The program felt a tad slower. When doing performance tweaks “feel” is the last thing you want to depend upon, so I added simple timing measurements to various modules within Pcompress. They measure the starting and ending monotonic wall clock times in milliseconds for processing a chunk of data and compute the throughput in terms of MBs per Second. What I saw after that was quite strange.

LTO appeared to actually reduce throughput performance by as much as 60%! My laptop had the following specs: Core i5 430M 2.27GHz, 8GB RAM. I was using Gcc 4.6.3 native build on Fedora. Suspecting something with the distro or the specific Gcc build I looked at another laptop I had. That was an AMD Piledriver 1.9 GHz with 4GB RAM running Linux Mint 13. It also had Gcc 4.6.3 but of course differen Ubuntu derived build. This also produced the same results. A 50% performance drop with LTO enabled. The outputs are below.

Output of “Normal” build without LTO

time ./pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar
Scaling to 2 threads
Original size: 86528000, blknum: 12726
Number of maxlen blocks: 116
Total Hashtable bucket collisions: 4477
Merge count: 12320
Deduped size: 85195878, blknum: 562, delta_calls: 0, delta_fails: 0
Dedupe speed 107.758 MB/s
LZP: Insize: 85193594, Outsize: 35087799
LZP: Processed at 178.023 MB/s
DELTA2: srclen: 35087799, dstlen: 35086691
DELTA2: Processed at 247.827 MB/s
Chunk compression speed 1.513 MB/s

real    0m24.058s
user    0m40.485s
sys     0m0.448s

Output of build with LTO enabled

time ./pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar
Scaling to 2 threads
Original size: 86528000, blknum: 12726
Number of maxlen blocks: 116
Total Hashtable bucket collisions: 4477
Merge count: 12320
Deduped size: 85195878, blknum: 562, delta_calls: 0, delta_fails: 0
Dedupe speed 32.146 MB/s
LZP: Insize: 85193594, Outsize: 35087799
LZP: Processed at 67.585 MB/s
DELTA2: srclen: 35087799, dstlen: 35086691
DELTA2: Processed at 84.044 MB/s
Chunk compression speed 0.828 MB/s

real    0m45.138s
user    1m17.835s
sys     0m0.612s

The pcompress invocations above uses LZMA compression in extreme mode (-l14) and also enables Deduplication, LZP and Delta2 encoding. The tarfile “w020n40.tar” is a Global Topographic Data (Digital Elevation Model) from USGS. I used that dataset since it contains a lot of embedded tables of repeating numeric data that Delta Encoding and LZP (as you can see above in this case) can detect and collapse.

The performance drop with LTO is drastic to say the least. To rule out any mistake in my throughput computation I also included the “time” command and the performance difference is clear in the time output as well. Similar results were visible on the AMD laptop running Linux Mint 13. In addition to LTO I was passing the following flags to the compiler:

-m64 -msse3 -c  -O3  -ftree-vectorize -floop-interchange -floop-block

I then experimented with omitting those flags and reducing optimization level to -O2 but to no avail. I even tried “-fno-inline-functions” thinking excessive inlining might be causing cache overflows of loops. But that produced the same results. LTO kept on churning out lower performance numbers regardless of what I did. Subsequently I tried with Gcc 4.7.2. I built it from upstream sources and repeated my experiments with exactly the same results as before! Something was broken.

So my next step was to ask for help from the gem of an utility on Linux called simply as Perf. This can do a bunch of profiling on the system and apps and also provide CPU performance counter metrics. This tool is similar to some OpenSolaris tools like cpustat, intrstat etc. Perf has got a whole bunch of features but I looked at a couple of capabilities. I used “perf record” to collect profiling data from a pcompress run and used “perf stat -d” to dump detailed performance counter statistics. The outputs of perf stat were interesting but provided little clue as to the root cause:

Output of “perf stat -d” for normal non-LTO build

Performance counter stats for './pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar':

46724.903534 task-clock                #    1.682 CPUs utilized
21,763 context-switches          #    0.466 K/sec
1,800 CPU-migrations            #    0.039 K/sec
49,634 page-faults               #    0.001 M/sec
97,919,627,864 cycles                    #    2.096 GHz                     [40.01%]
60,976,690,063 stalled-cycles-frontend   #   62.27% frontend cycles idle    [39.99%]
42,558,445,705 stalled-cycles-backend    #   43.46% backend  cycles idle    [39.97%]
83,132,473,549 instructions              #    0.85  insns per cycle
#    0.73  stalled cycles per insn [49.93%]
10,757,966,536 branches                  #  230.241 M/sec                   [49.95%]
674,594,850 branch-misses             #    6.27% of all branches         [50.01%]
20,128,203,574 L1-dcache-loads           #  430.781 M/sec                   [50.05%]
819,620,127 L1-dcache-load-misses     #    4.07% of all L1-dcache hits   [50.02%]
470,081,090 LLC-loads                 #   10.061 M/sec                   [40.06%]
291,066,964 LLC-load-misses           #   61.92% of all LL-cache hits    [39.99%]

27.781254809 seconds time elapsed

Output of “perf stat -d” for LTO enabled build

Performance counter stats for './pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar':

80593.375461 task-clock                #    1.759 CPUs utilized
26,507 context-switches          #    0.329 K/sec
889 CPU-migrations            #    0.011 K/sec
2,696 page-faults               #    0.033 K/sec
169,713,512,965 cycles                    #    2.106 GHz                     [40.02%]
116,359,748,762 stalled-cycles-frontend   #   68.56% frontend cycles idle    [40.02%]
44,479,101,832 stalled-cycles-backend    #   26.21% backend  cycles idle    [40.04%]
147,677,082,105 instructions              #    0.87  insns per cycle
#    0.79  stalled cycles per insn [50.04%]
12,176,677,580 branches                  #  151.088 M/sec                   [50.05%]
710,765,008 branch-misses             #    5.84% of all branches         [50.01%]
73,456,165,490 L1-dcache-loads           #  911.442 M/sec                   [49.98%]
576,238,614 L1-dcache-load-misses     #    0.78% of all L1-dcache hits   [50.00%]
361,655,574 LLC-loads                 #    4.487 M/sec                   [39.94%]
236,828,595 LLC-load-misses           #   65.48% of all LL-cache hits    [39.97%]

45.811620815 seconds time elapsed

I have highlighted some of the differences of interest. As you can see LLC or Last Level Cache (L3 Cache) misses increased by 5% with LTO while L1 data cache hits actually increased. The frontend (the instruction fetch and decode primarily) had more stalled cycles with LTO. While a stall can be due to many reasons coupled with a higher LLC miss ratio it appears that more time was spent waiting to fetch data from memory.

Side Note: You may be wondering that the LLC cache miss ratio is very high even in the normal non-LTO case. Does that point to severe inefficiencies within Pcompress ? Not really since it is LZMA which is the culprit here. LZMA’s memory access pattern is cache-unfriendly. A cost one has to bear to get the almost unbeatable compression ratio. For comparison, I have included the perf stat output for LZ4 compression in the Appendix below.  See that.

Next item on the agenda was to capture profile data via “perf record” and view it via “perf report”. Perf report typically lists functions or modules within an executable with the percentage of time they hogged during the execution. The list is sorted in descending order. I am reproducing the top 6 items from each of the profile data below.

Perf report snippet for normal non-LTO case

Samples: 172K of event 'cycles', Event count (approx.): 88229375574
50.31%  pcompress  pcompress              [.] GetMatchesSpec1
32.22%  pcompress  pcompress              [.] LzmaEnc_CodeOneBlock
4.02%  pcompress  pcompress              [.] MixMatches3
2.61%  pcompress  pcompress              [.] GetHeads4b
1.63%  pcompress  pcompress              [.] dedupe_compress
1.45%  pcompress  pcompress              [.] MatchFinderMt_GetMatches

Perf report snippet for LTO build

Samples: 326K of event 'cycles', Event count (approx.): 169052159163
39.39%  pcompress  pcompress              [.] GetMatchesSpec1
29.70%  pcompress  pcompress              [.] GetOptimum.4390
4.91%  pcompress  pcompress              [.] LitEnc_GetPriceMatched.4145
3.22%  pcompress  pcompress              [.] MixMatches3
2.87%  pcompress  pcompress              [.] dedupe_compress
2.72%  pcompress  pcompress              [.] GetHeads4b.5262

We can immediately notice a few differences. The LTO enabled build lists a few new symbols that are not present in the non-LTO case. These are actually functions within the LZMA implementation. The numeric suffix to the function names is an LTO artifact. Why are these functions not showing up in the non-LTO build? Lets check the different binaries using objdump.

Objdump output for normal non-LTO binary

objdump -t pcompress | grep GetOptimum
[pcompress]\$

Objdump output for LTO enabled binary

objdump -t pcompress | grep GetOptimum
00000000004284c4 g     F .text	00000000000004d5        .hidden GetOptimumFast.4368
000000000042a351 g     F .text	0000000000002412        .hidden GetOptimum.4390

Objdump output for non-LTO DEBUG binary (No Optimizations)

objdump -t pcompress | grep GetOptimum
000000000040fd05 l     F .text	0000000000002570        GetOptimum
0000000000412275 l     F .text	0000000000000515        GetOptimumFast

Something jumps out as obvious now. The GetOptimum function is not visible in the optimized non-LTO build but is visible in the DEBUG build. This is a static function in LzmaEnc.c and is large but still gets inlined in the normal optimized build. LTO however is actually preventing some useful inlining from happening. I inspected the disassembly using “objdump -d” to verify that. In the LzmaEnc.c source file one of the call sequence is thus:

LzmaEnc_CodeOneMemBlock -> LzmaEnc_CodeOneBlock -> GetOptimum

In the normal non-LTO optimized build LzmaEnc_CodeOneMemBlock() is a single large function with the other functions above inlined and further optimized. In the LTO build the call sequence is thus:

LzmaEnc_CodeOneMemBlock ->LzmaEnc_CodeOneBlock.4411 ->
LzmaEnc_CodeOneBlock.part.8.4405 -> GetOptimum.4390

No wonder this causes a major performance drop since inlining in turn enables a series of other optimizations like Value Range Propagation, Code Motion etc which LTO is preventing by preventing inlining. So the lesson of the day is to use LTO with a spoonful of salt. Do not just use it and be happy. Actually benchmark and verify whether you are benefitting from it or not. We should see some improvement in this space with Gcc 4.8 (see below).

Appendix

Output of perf stat -d for LZ4 compression method

Performance counter stats for './pcompress -D -c lz4 -l14 -L -P -s110m w020n40.tar':

3943.008621 task-clock                #    0.996 CPUs utilized
440 context-switches          #    0.112 K/sec
9 CPU-migrations            #    0.002 K/sec
2,179 page-faults               #    0.553 K/sec
8,275,383,339 cycles                    #    2.099 GHz                     [39.97%]
3,510,592,770 stalled-cycles-frontend   #   42.42% frontend cycles idle    [39.91%]
2,314,266,052 stalled-cycles-backend    #   27.97% backend  cycles idle    [39.93%]
10,907,435,475 instructions              #    1.32  insns per cycle
#    0.32  stalled cycles per insn [49.95%]
1,754,196,387 branches                  #  444.888 M/sec                   [49.93%]
60,384,786 branch-misses             #    3.44% of all branches         [50.05%]
2,588,170,552 L1-dcache-loads           #  656.395 M/sec                   [50.10%]
196,213,428 L1-dcache-load-misses     #    7.58% of all L1-dcache hits   [50.17%]
57,790,609 LLC-loads                 #   14.656 M/sec                   [40.14%]
2,870,453 LLC-load-misses           #    4.97% of all LL-cache hits    [40.11%]

3.957535121 seconds time elapsed

The interesting thing to note here is the LLC cache miss ratio. Due to it’s near linear access pattern LZ4 is making good use of the cache. This is one of the reasons why it is so fast. Of course LZ4 cannot even think of matching LZMA in terms of compression ratio but then, that is not it’s intent either. In addition it also shows that LZP and my Dedupe and Delta2 Encoding implementations are cache-efficient as well.

# Linear Algebra meets Data Compression

I recently introduced another kind of data encoding into Pcompress that leverages a common technique from linear algebra called Matrix Transpose. There are multiple ways of transposing a matrix. The simplest, conceptually is a transformation where columns of the source matrix become rows in the target and vice versa. In math terms if A is the source matrix and A” is the target then transposition boils down to $A[m X n] -> A"[n X m]$ where m and n are dimensions of the 2D matrices.

Now where does this fit into the seemingly far removed world of data compression? Let us consider a dataset which is a sequence of 32-bit numbers. In my previous post on the Adaptive Run-Length Delta Encoding feature I mentioned how it detects sequences of numbers in data which are in simple arithmetic progression. Once an arithmetic progression is found it can be very easily encoded. Now let us assume that we do have a series of numbers but they are not in a simple incrementing series. So Delta Encoding will not be useful there.

As an example let us consider the following series of numbers: 4567, 3479, 9345. Let us also assume that they are stored in the data in big-endian row-major representation. We choose big-endian since that is the common platform-independent way of storing numeric data values. So if we look at the byte stream it will be thus (in hex): 00  00  11  D7  00  00  0D  97  00  00  24  81. Notice that there are a bunch of zeroes but they are mixed up with non-zero bytes. We can look at this data as a 4×3 matrix. 4-byte integers totalling 3 in number: $A[4X3] = \left| \begin{array}{cccc} 00 & 00 & 11 & D7 \\ 00 & 00 & 0D & 97 \\ 00 & 00 & 24 & 81 \end{array} \right|$.

Now looking at the matrix do you see what I see? The zeroes are aligned up in columns. Since the storage in memory is row-major the zeroes are mixed up with the other bytes. Now let us check out the transposed matrix: $A"[3X4] = \left| \begin{array}{ccc} 00 & 00 & 00 \\ 00 & 00 & 00 \\ 11 & 0D & 24 \\ D7 & 97 & 81 \end{array} \right|$. If we store this in memory the byte sequence becomes: 00  00  00  00  00  00  11  0D  24  D7  97  81. Voila, all the zeroes have lined up. Any compression algorithm on the planet will now be able to collapse that sequence of zeroes. This example shows zeroes, but in fact, it can be any repeating byte value.

This technique have been used in other places. For example in HDF5 where they unassumingly call it Shuffle. In Pcompress it is used in 2 places. Once within the Delta2 encoding phase and secondly to encode the Deduplication Index table. So compression ratios are improved across the board. To recover the data it is a simple matter of effecting the inverse transpose.

# Adaptive Run-Length Delta Encoding

I recently added a small new feature into Pcompress that I am calling as Adaptive Run-Length Delta Encoding. Delta Encoding can come in many variants. The simplest form is a type of data transformation that looks at byte values and detects arithmetic progressions in them. If a progression or monotonic series is found then the starting value and differences between subsequent values are output. Since the difference is constant we get a repeating sequence that can be easily compressed as compared to the original data.

A series may exist for individual byte values or multiple byte integers. I added a simple implementation in Pcompress that tries to identify multi-byte integer sequences in the data. The number of bytes constituting an integer is called the stride length and can range from 3 bytes to 8 bytes. The bytes are packed into a big-endian integer representation. The code initially does a dry run with all the stride lengths to find out the one that will give maximum benefit and then uses integers sequences of that stride size to encode the data.

This approach is similar to Bulat Ziganshin’s Delta compression preprocessor. However there are some differences. Firstly my implementation does not try to do an exhaustive search (using sliding windows etc.) for embedded sequences at odd offsets, so it is quite fast at the cost of lower effectiveness. Secondly I added a simple twist to to the basic delta encoding. Whenever a sequence is detected I apply Run Length Encoding to collapse the sequence into it’s parameters: Count, Starting Value and Increment Delta. This pre-compresses the data and benefits all the standard compression algorithms. I found this to be most effective when combined with LZP so using Delta automatically enables LZP pre-compression.

The code is at https://github.com/moinakg/pcompress/tree/master/delta2. This delta encoding is different from the delta compression and similarity matching that I had implemented earlier.

# Pcompress updated to 1.1

I just put out a new release that implements proper HMAC computation as described in my previous post. I added a bunch of new tests that test error conditions and error handling for out of range parameters and corrupted archives. These tests caused problems to show up that went unnoticed earlier. These were mostly improper error handling situations. As you see testing may be boring but is absolutely necessary.

I also optimized the use of Zlib by switching to raw Zlib streams. Normal Zlib puts a header and computes an Adler32 checksum among other things. However Pcompress adds its own headers and computes cryptographic checksums like SHA and SKEIN which are light years ahead of the simple Adler32 in terms of data integrity verification. So the Adler32 computation overhead was unnecessary.

I had also tweaked the rolling hash used for deduplication while doing the chunking analysis. This produces better results and improved identity dedupe reduction. Finally I also added header checksums in plain non-crypto mode.

# Authenticated Encryption and Pcompress

I had got HMAC into pcompress but was not too happy with the way I was using it to verify the chunk headers and the regular digest. The operation was thus: $HMAC(Digest(Plaintext), Header)||Encrypt(Compress(Plaintext))$. However this approach looked suspect to me. Eventually after reading up more stuff it turns out that Message Authentication and Encryption can be combined to make Authenticated Encryption. This article provides an excellent background to the entire thing: http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken.

In addition to that Wei Dai’s Cryptopp wiki also has good concise info: http://www.cryptopp.com/wiki/Authenticated_Encryption. Whoever thought that the ubiquitous SSH we take for granted is technically insecure! The most common recommended encryption mode for embedding message authentication with encryption is EAX mode. There is a more advanced and better performant OCB mode but it is patented. Now I had a choice of pulling out the EAX mode implementation from Cryptopp and using it with AES. However I also need to HMAC the headers without having to encrypt them and have an integrity checksum. Also importing and integrating the EAX code from Cryptopp is somewhat painstaking with lots of code changes. So I decided to follow IPSec.

IPSec encrypts and then computes a HMAC on the encrypted data. As We Dai points out in the wiki this approach is secure. I was already computing HMAC of the header so it was a simple matter to extend it to cover the entire encrypted data. In addition I had to avoid computing another digest of the plaintext as that is an unnecessary overhead. HMAC authenticates and also verifies data integrity.  So now the approach becomes: $HMAC(Header, Encrypt(Compress(Plaintext)))$. The HMAC is then inserted into the Header. The HMAC portion of the header is zeroed when actually computing the HMAC. Since the HMAC is computed over the compressed data, it needs to process a smaller dataset and benefits performance.

This change is already in the Pcompress git repo and will make it to the 1.1 release.

# J-Bit Encoding

I came across this very interesting research paper today on a new mechanism to pre-compress data in a way that helps other traditional compression algorithms achieve a better result: http://arxiv.org/abs/1209.1045. The author calls this J-Bit encoding. I have already have a new idea to enhance the technique further and can’t wait to try this out in Pcompress.

# Inside Content Defined Chunking in Pcompress – Part 2

After cooling off in the hills for a few days, curiosity got the better of me. I discussed in my previous post on chunk size distribution from the rolling hash algorithm that there is a potential trend that can be inferred. Regions having greater variability produce more smaller chunks while regions of uniformity produce fewer larger ones.

I wanted to do some more analysis to see whether there is indeed a trend, short of looking at actual data samples. A trend or correlation will indicate that the rolling hash fingerprint based chunking algorithm is working nicely with desirable properties. Now the question comes how do we define uniformity or conversely variability of data ? I considered two ways of looking at it:

1. Spans or run-lengths of identical bytes values. Fewer longer spans will indicate low variability and vice versa.
2. Zones where same pattern repeats. That is uniform repeating patterns.

Now point #2 is not suitable for our case since repeating patterns can generate repeating break-points in the rolling hash giving rise to many small chunks depending on the repeating pattern size. So from the point of view of our rolling hash this is a form of repeating variability rather than uniformity. This leaves point #1. Long spans of identical byte values will have less chance of triggering break-points since we check for a few least significant bits being zero as our break-point. Spans of zero will of course trigger many break-points but we can ignore that for this analysis as our dataset is quite varied.

I used an initial 24GB portion of the earlier 360GB dataset since I will generating a large number of data points. This initial portion also contains a wide variety of textual, binary (Linux and Windows) and media files. I decided to split the data into 1MB segments and chunk them with an average of 4KB chunk size. I also computed the following properties per segment (of course in a single thread):

• Variability Index
• Ratio of  Total size of small chunks : Total segment size (in bytes)
• Ratio of  Total size of large chunks : Total segment size (in bytes)

Where the following hold:

$Variability Index = Number Of RLE spans / Segment Size$

$Small chunk = Chunk Size < 16KB$

$Large chunk = Chunk Size \ge 16KB$

I somewhat arbitrarily chose the partition value of 16KB keeping in mind average chunk size of 4KB. Eventually all the three values above are ratios and can be directly compared. I also computed the average chunk size per segment and plotted it separately.

Chunk Size Distribution vs Data Variability

Splitting approx 24GB of data in to approx 1MB sized segments produced more than 23400 data items. The chunking algorithm in Pcompress splits segments at a rabin chunk boundary so that every segment gets a set of complete content-defined chunks. This causes a slightly variable segment size.

Since the number of points is very large the lines in the line graph merge together and look like an area graph. Looking at the graphs for a few moments yields the view that there is indeed a correlation between data variability and chunk size distribution. Regions of higher variability do indeed tend to produce smaller chunks and vice versa. Now lets zoom into the middle interesting portion of the Distribution graph to see some more detail (more than 11000 points):

The correlation between the three lines is more clearly visible here but it is not a 100% fit, but then, is anything perfect in life and this universe ? So our requirement of greater data variability producing smaller chunks and vice versa holds in general but does not hold all the time. We can do some quick statistical analysis to get a better idea. First thing to do is to add a couple of polynomial trend lines.

We can clearly see that total size of large chunks is inversely proportional to the data variability with a high degree of probability. This is exactly what we desire. So over a large amount of data localized variations even out and general trends are observed. Finally I also computed Pearson’s correlation coefficients:

$COEFF(RLE Ratio, Small Chunk Ratio) = 0.819695$

$COEFF(RLE Ratio, Large Chunk Ratio) = -0.81969$

So combined with these values and all the previous analysis we can say that clearly there exists a strong correlation of the form which is desirable. Pcompress appears to have a good chunking algorithm unless, of course, someone more experienced than me in this aspect points out shortcomings or mistakes.

# Inside Content-Defined chunking in Pcompress

After having played with the content-defined (or content-aware) chunking process in Pcompress and having tweaked it’s parameters quite a bit I decided to actually look at what kind of chunk (or block in Pcompress terms) sizes it is producing. I had never looked at this aspect in the past and was curious to see whether there exists any abnormal skewness.

I had to run the chunking process over large amounts of data and capture a full list of chunk (or block) lengths it is producing. A simple printf() to output the length is enough. Gathering a sufficiently large dataset that covered a wide range of file types was more challenging! I eventually managed to cobble together something from the gigabytes of stuff I already had in my ZFS pools and some I had gathered over a period of time for testing Pcompress.

Basic Dataset Properties

• Total size: 360 GB
• File Types:
• Program Text Linux and Unix, Linux git checkout tarball etc.
• Variuous kinds of general text including English Text
• A 2.5GB portion of the OANC dataset
• XML and HTML (The enwik9 and enwik8 datasets)
• Email backups
• Full OpenSolaris root filesystem image consisting of binaries, translations and other textual data
• Windows SysWOW64 and Program Files contents
• Gigabytes of Multimedia files of various formats: MP3, WAV, MKV, AVI, XVID, OGG etc.
• Various ISO images of movies, and LiveCD OS distributions
• Image files of various formats: BMP, GIF, PNG, JPEG etc.
• Linux binaries, /usr/share, /var, log files, backups and other misc contents.
• Already compressed files and tarballs using Gzip, Bzip2, Compress (.Z) etc.

As you can see the dataset contains a wide variety of files for exercising the chunking algorithm across a wide range of value types. I ran the deduplication code with two different average chunk sizes of 4KB and 8KB and produced a frequency distribution of various actual chunk sizes. The process went like this:  Capture chunk size list -> Sort the list -> Count repeated occurrences of same value -> Produce frequency distribution.

The plots below show the results. Since Content-Defined Chunking produces variable length chunks the actual average chunk size is typically close to the desired value but rarely ever exact. The graphs below show Frequency on the left and Chunk size on the right. Both are log scales. The point where the Frequency line and Chunk size line cross is the average value.

As you can see smaller chunk sizes have higher frequencies with the number dropping off quite rapidly and variability/jitter in frequencies increasing quite rapidly as we got to higher chunk sizes. I did a small tweak to the algorithm today to use a fixed mask value for the  rolling-hash break pattern which appears to produce more consistent average chunk sizes across a range of desired sizes. The minimum size check is there anyway to ensure that the desired average chunk size is reached.

The graph below shows frequency distribution of chunks grouped into 6 distinct power of two bins: 4KB, 8KB, 16KB, 32KB, 64KB, 128KB.

As you can see the peak is at around 4KB which is the desired average size in this case. Next comes the plots for 8KB desired average chunk size.

These plots look similar to the ones for 4KB chunks and produce similar patterns. The detailed frequency distribution plot for 8KB is strikingly similar to the same for 4KB chunks. Also look at how the frequency drops off rapidly with higher chunk sizes. I am tempted to theorize that the algorithm is producing more smaller chunks in regions of high change and fewer larger ones in uniform zones. The last few tweaks I have done brought avg chunk size closer to the given value and improved dedupe ratio – this would seem to support the hypothesis. However one needs to look at some of the data around small and large chunks to actually check. The final tweaks and changes will be rolled into a 1.0.1 release I will make in a while.

It is interesting to note that will this Pcompress now produces smaller compressed files than the graphs I put out in some of the previous posts. All this now appear to be in fair shape for the next item on my table, Global Deduplication.