Tag Archives: Hash function

High Performance Content Defined Chunking

In Pcompress, I have implemented a variant of the rolling hash based Content Defined Chunking that provides both deduplication accuracy and high performance. This post attempts to explain the chunking process, covers the chunking computations that are done in Pcompress and then talks about the new optimizations for very fast sliding window chunking (on the order of 500MB/s to 600MB/s throughput depending on processor).

Background

Data Deduplication requires splitting a data stream into chunks and then searching for duplicate chunks. Once duplicates are found only one copy of the duplicate is stored and the remaining chunks are references to that copy. The splitting of data into chunks appears to be an ordinary process but is crucial to finding duplicates effectively. The simplest is of course splitting data into fixed size blocks. It is screaming fast, requiring virtually no processing. It however comes with the limitation of the data shifting problem.

The diagram below illustrates the problem. The two 64-character patterns are mostly similar with only two characters differing. Initially fixed-block chunking provides good duplicate detection. However the insertion of a single character at the beginning shifts the entire data while chunk boundaries are fixed. So no duplicates are found even though the patterns are mostly similar.

static_chunkingThe diagram shows insertion, but the same thing can happen for deletion. In general with static chunking duplicate detection is lost after the point where insertion or deletion has taken place.

In order to deal with this, most dedupe solutions use content defined chunking that mark cut points based on patterns in the data. So if the data patterns shift the cut points also shift with them. The diagram below illustrates.

dynamic_chunkingThe chunks are split based on patterns in data so they are of variable length (but average size is close to the desired length). Since the chunk boundaries shift along with the data patterns, duplicates are still found. Only the modified chunks are unique.

The Rolling Hash computation

Now the question comes as to what data patterns to look out for when determining the chunk boundaries or cut points? The common technique is to compute a hash value of a few consecutive bytes at every byte position in the data stream. If the hash value matches a certain predefined pattern we can declare a chunk boundary at that position. To do this computation efficiently a technique called the rolling hash was devised. It uses a sliding window that scans over the data bytes and provides a hash value at each point. The hash value at position I can be cheaply computed from the hash at position I-1. In other words H(X_{(i,n)}) \Leftarrow (H(X_{(i-1,n)}) + X_i - X_{(i-n)}) \bmod M where ‘n’ is the window size and X_{(i,n)} represents the window bytes at byte position ‘i’. In mathematical terms this is a recurrence relation. Rolling hashes have been used in contexts like Rabin-Karp substring search and Rsync. Today they are used extensively in chunk splitting in the context of data deduplication.

One of the common rolling hashes used in Data Deduplication is Rabin Fingerprinting devised originally by Turing award winner Michael O. Rabin in his seminal paper titled “Fingerprinting By Random Polynomials“. The mathematically inclined will enjoy reading it. There are other rolling hash techniques such as the one used in Rsync, the TTTD algorithm devised by HP, the FBC algorithm etc.

rolling_hashWhile I am not so much of a mathematically inclined person I still needed a good rolling hash in order to do content defined chunking in Pcompress. After looking at various implementations like the one in LBFS and few other open-source software like n-gram hashing, I came up with an approach that worked well and produced average chunk sizes close to the desired value.

I used a small sliding window of 16 bytes that produces a 64-bit fingerprint at each byte position requiring an addition, subtraction, multiplication and conditionally an XOR for each byte position. It would declare a chunk boundary if the bottom Y bits of the fingerprint were zero. The value of Y would depend on the average chunk size desired. For example for 4KB average size one would look for bottom 12 bits to be zero. The core of the approach is derived from Rabin Fingerprinting. A good description is here: http://www.infoarena.ro/blog/rolling-hash. The hashing approach is a multiplicative scheme of the form:

rollhash = (rollhash * PRIME + inbyte - outbyte * POW) \% MODULUS

Where inbyte is Incoming byte into sliding window head, outbyte is outgoing byte from sliding window tail and POW = (PRIME ^ {windowsize}) \% MODULUS. The PRIME number I am using is the same value used by Bulat Ziganishin in his SREP tool. Experimentation showed it to produce good results. In addition to this I precompute a table using the irreducible polynomial (represented in GF(2)) from LBFS. The outbyte is used to index the table and the value is XOR-ed with the hash value to produce the final fingerprint. I did some analysis of the chunking approach which is documented in two earlier posts. The results were good.

A window size of only 16 bytes will raise some eyebrows as typically much larger windows are used. LBFS for example used a 48-byte window and others have used even larger windows. However in practice, as is evident from the above analysis, this implementation does produce good results and the window size of 16 bytes allows an optimization as we will see below.

Optimizations

While addition, multiplication are extremely fast on modern processors, performance overheads remained. Even though I was using a small window of 16 bytes it still required performing computations over the entire series of bytes in order to find cut points. It is very much computationally expensive compared to the simple splitting of data into fixed-size chunks. A couple of optimizations are immediately apparent from the above hash formula:

  • Since we are dealing with bytes it is possible to pre-compute a table for outbyte * POW
  • The MODULUS operation can be replaced with masking if it is a power of 2.

This gives some gains however the overhead of scanning the data and constantly updating a sliding window in memory remains. Eventually I implemented a couple of key new optimizations in Pcompress that made a significant difference:

  • Since the sliding window is just 16 bytes it is possible to keep it entirely in a 128-bit SSE register.
  • Since we have minimum and maximum limits for chunk sizes, it is possible to skip minlength – small constant bytes after a breakpoint is found and then start scanning. This provides for a significant improvement in performance by avoiding scanning majority of the data stream.

Experimentation with different types of data shows that the second optimization results in scanning only 28% to 40% of the data. The remaining data are just skipped. The minimum and maximum limits are used to retain a distribution of chunk sizes close to the average. Since rolling hash cut points below the minimum size are ignored it does not make sense to scan that data.

opt_dynamic_chunkingAll these optimizations combined provide an average chunking throughput of 530 MB/s per core on my 2nd generation Core i5 running at 2.2 GHz. Of course faster, more recent processors will produce better results. The throughput also depends on the nature of the data. If the data has a very specific pattern that causes more large chunks to be produced the performance degrades (Think why this should be the case). This brings us to the worst case behaviour.

Worst Case performance profile

The worst case performance profile of the optimized chunking approach happens when all chunks produced are of the maximum size. That is the data is such that no breakpoints are produced resulting in a degeneration to the fixed block chunking behaviour at max chunksize of 64KB and at the cost of rolling hash computation overhead. In this case the majority of the data is scanned and computed, but how much ?

If we assume minimum chunk size of 3KB, maximum 64KB and 100MB data we will have 100MB / 64KB = 1600 chunks (considering worst case all max-length chunks). For every chunk 3KB - small constant of data will be skipped. In my current implementation the value of small constant is 256, though it can be smaller. So the actual skipped size is 3072 - 256 = 2816 bytes. In total the number of skipped bytes will be 2816 * 1600 = 4505600 bytes out of 100MB data. In percentage terms it is 4505600 / 104857600 * 100 = 4.29\%. In other words 95% of the data will be scanned degrading the performance by more than half.

Now the question is what kind of data will produce this worst case behaviour? If you have seen the rolling hash computation details in Pcompress above, the eventual fingerprint is computed via an XOR with a polynomial computation result from a table. Those values are non-zero and we check for breakpoints based on bottom 12 bits of the fingerprint being zero. So if the computed hash is zero the XOR will set the bits and bottom 12 bits will become non-zero. The hash will be zero if the data is zero. That is if we have a file of only zero bytes we will hit the worst case.

I created a zero byte file and tested this and got a throughput of 200 MB/s and all chunks of the max 64KB length. In real datasets zero byte regions can happen, however very large entirely zero byte files are uncommon, at least to my knowledge. One place having zero byte regions is VMDK/VDI files. So I tested on a virtual harddisk file of a Fedora 18 installation in VirtualBox and still got a majority of 4KB chunks but with a small peak at 64KB. The throughput was 490 MB/s with approx 41% of the data being scanned. So even a virtual harddisk file will have non-zero bytes inside like formatting markers. It is rare to get 100s of megabytes of files with only zero bytes. Finally from an overall deduplication perspective such files will be deduplicated maximally with almost 98% data reduction and final compression stage will also be extremely fast (only zero bytes). So even though chunking suffers, overall deduplication will be fast.

Footnote

If you are interested to look at the implementation in Pcompress, it is here: https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c#L598

Vectorizing xxHash for Fun and Profit

Update: Thanks to Yann Collet’s comment I have fixed the mistaken reference to Hyperthreading while in fact I was referring to Instruction Level Parallelism via Superscalar processing.

I was doing various  performance optimizations in Pcompress for the last few days and the xxHash piece caught my eye. This non-cryptographic hash implementation is one of the fastest 32 bit hashes available today. I use it in Pcompress to hash data chunks for deduplication. The hash value is used to insert the chunk-id into a hashtable for lookups. xxHash is already extremely fast working at 5.4 GB/s on a fast processor however it’s inner loop looked very interesting.

 do
 {
     v1 += XXH_LE32(p) * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1; p+=4;
     v2 += XXH_LE32(p) * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1; p+=4;
     v3 += XXH_LE32(p) * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1; p+=4;
     v4 += XXH_LE32(p) * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1; p+=4;
 } while (p<=limit);

 h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

There are 4 independent compute streams consuming data into 4 separate accumulators which are then combined into a single hash value later. These 4 independent but identical compute streams look like a perfect fit for vectorizing using the SSE instructions on modern x86 processors. The SSE registers can hold multiple packed integers or floats and a single operation, for example addition, applied to two SSE registers affect all the package values in parallel. The image below shows this behavior diagrammatically when the SSE registers are populated with 4 32-bit quantities.

simd

An SSE register is 128-bit and can perfectly hold four 32-bit packed integers. I had to use at least SSE 4.1 to get the parallel multiply of packed 32-bit values. The rotate (XXH_rotl32) operation is easily doable with 2 bitwise shifts and a bitwise or (as is done in xxHash original code as well). Without having to deal with inline assembly language we can fairly easily  use the compiler SSE intrinsics. These appear to the programmer as functions but in fact they result in specific CPU instructions being emitted. SSE intrinsics also provide the benefit of letting the compiler handle optimizations like instruction scheduling. Using inline assembly leaves all that optimization burden to the programmer which is often tricky to get right except for short segments. In addition some of the intrinsic functions are composite intrinsics. They result in emitting a collection of CPU instructions encapsulating some commonly used operation (like _mm_set_epi32(…)).

You can get a good overview of SSE programming from the CodeProject article. So my initial effort led to the following code shown below.

#include <smmintrin.h>
static inline __m128i _x_mm_rotl_epi32(const __m128i a, int bits)
{
    __m128i tmp1 = _mm_slli_epi32(a, bits);
    __m128i tmp2 = _mm_srli_epi32(a, 32 - bits);
    return (_mm_or_si128(tmp1, tmp2));
}
...
...
 __m128i accum = _mm_set_epi32(v4, v3, v2, v1);
 __m128i prime1 = _mm_set1_epi32(PRIME32_1);
 __m128i prime2 = _mm_set1_epi32(PRIME32_2);
 do {
     __m128i mem = _mm_loadu_si128((__m128i *)p);
     mem = _mm_mullo_epi32(mem, prime2);
     accum = _mm_add_epi32(accum, mem);
     accum = _x_mm_rotl_epi32(accum, 13);
     accum = _mm_mullo_epi32(accum, prime1);
     p += 16;
 } while (p<=limit);
 _mm_storeu_si128((__m128i *)vx, accum);
 v1 = vx[0]; v2 = vx[1];
 v3 = vx[2]; v4 = vx[3];
 h32 = XXH_rotl32(v1,1) + XXH_rotl32(v2,7) + XXH_rotl32(v3,12) + XXH_rotl32(v4,18);
...

It worked perfectly. However there was a big problem. It was slower than before, and by a good margin! Look at the benchmarks below using the standard silesia test dataset.

Original
========
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm ,by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3571.1 MB/s   0x61B1160F

Modified
========
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm ,by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3013.3 MB/s   0x61B1160F

It is 15% slower after vectorizing. This seemed ridiculous to me. I was expecting at the minimum a 10% speedup and here we have parallelized code that is in fact slower! If you notice, my initial code does unaligned loads via “_mm_loadu_si128((__m128i *)p)”. Fearing that to be the cause of latency I did some tweaks to ensure aligned access and processing the initial unaligned bytes separately. The performance difference still remained!

I first thought that SSE instructions were having higher latency compared to the ordinary integer processing ones. So I did some measurements using the Perf tool. The results are shown below.

Original
========
1) Annotation from perf record:
 10.42 │ 30:   mov    (%rdi),%r10d                                                                                                                                     
 14.03 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.34 │       add    %r10d,%r8d                                                                                                                                       
  0.21 │       mov    0x4(%rdi),%r10d                                                                                                                                  
 10.53 │       rol    $0xd,%r8d                                                                                                                                        
  0.23 │       imul   $0x9e3779b1,%r8d,%r8d                                                                                                                            
  0.77 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.02 │       add    %r10d,%ecx                                                                                                                                       
 10.54 │       mov    0x8(%rdi),%r10d                                                                                                                                  
  0.06 │       rol    $0xd,%ecx                                                                                                                                        
  0.34 │       imul   $0x9e3779b1,%ecx,%ecx                                                                                                                            
 10.38 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.30 │       add    %r10d,%edx                                                                                                                                       
  0.02 │       mov    0xc(%rdi),%r10d                                                                                                                                  
  0.07 │       add    $0x10,%rdi                                                                                                                                       
 10.07 │       rol    $0xd,%edx                                                                                                                                        
  0.32 │       imul   $0x9e3779b1,%edx,%edx                                                                                                                            
 10.30 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.12 │       add    %r10d,%eax                                                                                                                                       
  0.22 │       rol    $0xd,%eax                                                                                                                                        
  0.15 │       imul   $0x9e3779b1,%eax,%eax                                                                                                                            
 20.52 │       cmp    %rdi,%r11                                                                                                                                        
  0.03 │     ↑ jae    30

2) Performance counters from perf stat:
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3541.4 MB/s   0x61B1160F

 Performance counter stats for './bench silesia.tar':

    6260.562213 task-clock                #    0.996 CPUs utilized
            544 context-switches          #    0.087 K/sec
              4 CPU-migrations            #    0.001 K/sec
         50,624 page-faults               #    0.008 M/sec
 13,278,187,173 cycles                    #    2.121 GHz                     [83.41%]
  3,657,672,543 stalled-cycles-frontend   #   27.55% frontend cycles idle    [83.30%]
  6,834,837,811 stalled-cycles-backend    #   51.47% backend  cycles idle    [66.70%]
 31,374,782,178 instructions              #    2.36  insns per cycle
                                          #    0.22  stalled cycles per insn [83.40%]
  1,384,012,974 branches                  #  221.068 M/sec                   [83.30%]
         95,141 branch-misses             #    0.01% of all branches         [83.30%]

    6.288590138 seconds time elapsed
Modified
========
1) Annotation from perf record:
  9.38 │ 70:┌─→movdqu (%rdi),%xmm0
  9.52 │    │  add    $0x10,%rdi
  0.00 │    │  cmp    %rdi,%r9
  0.00 │    │  pmulld %xmm3,%xmm0
  9.04 │    │  paddd  %xmm1,%xmm0
  0.15 │    │  movdqa %xmm0,%xmm1
  0.04 │    │  psrld  $0x13,%xmm0
  9.19 │    │  pslld  $0xd,%xmm1
  0.27 │    │  por    %xmm0,%xmm1
  8.95 │    │  pmulld %xmm2,%xmm1
 53.44 │    └──jae    70

2) Performance counters from perf stat:
 Performance counter stats for './bench silesia.tar':

    6265.021121 task-clock                #    0.996 CPUs utilized
            546 context-switches          #    0.087 K/sec
              2 CPU-migrations            #    0.000 K/sec
         50,625 page-faults               #    0.008 M/sec
 13,288,539,258 cycles                    #    2.121 GHz                     [83.36%]
  8,562,225,224 stalled-cycles-frontend   #   64.43% frontend cycles idle    [83.31%]
  3,763,596,002 stalled-cycles-backend    #   28.32% backend  cycles idle    [66.72%]
 12,813,511,969 instructions              #    0.96  insns per cycle
                                          #    0.67  stalled cycles per insn [83.41%]
  1,176,939,765 branches                  #  187.859 M/sec                   [83.31%]
        102,390 branch-misses             #    0.01% of all branches         [83.30%]

    6.292536751 seconds time elapsed

There are a few interesting observations here. Perf annotation tells us that there is not much difference in combined instruction latencies between the normal C and the SSE implementations. However performance counter statistics have a story to tell. I have highlighted the interesting entries in bold. Backend cycles decreased but Frontend cycles increased 2.3x. What is more interesting is that the normal C version issued 2.36 instructions per cycle while in the SSE version it reduced to 1/3rd of that. There is not much code in the small loop to cause stalls so the only explanation for increased Frontend stalls is memory read latency which also cause fewer instructions to get issued.

The big question was why was there so much of memory access latency in the SSE version compared to the normal C version? I pondered this for some time while trying out a variety of experiments to no avail.  I tried things like hand optimized assembly, trying to hide memory latency by loop hoisting etc. Then looking at the disassembly from the perf annotate of the original code again, something looked  interesting. The instructions from the 4 independent computation lines are all interleaved. More precisely instructions from independent computation sequences were interleaved in two pairs. Suddenly a bulb lit up somewhere. We have a gotcha named Superscalar processing and the key thing here is “independent computations”. Intel and AMD x86 CPUs can execute multiple instructions per clock cycle, not necessarily in chronological order, if they are independent of each other and utilize separate processing resources within the core. This is a form of instruction level parallelism that is different to vector processing. Within the core some resources are duplicated and some shared.

I was using a Core i5 from the Nehalem generation which typically has 4 decoders in the front-end pipeline and a superscalar out-of-order execution engine than can dispatch upto 6 micro-ops per cycle. The advantage of doing this is if some instruction from one stream stalls due to some resource being fetched, like memory, then independent instructions from the other streams can be processed. In addition if two compute operations require different resources they can be operated on simultaneously.

superscalarThe original xxHash code fully utilizes this feature helped by the Gcc compiler that recognizes this capability and interleaves independent instructions together to help the superscalar engine achieve maximum instruction level parallelism. In my SSE version I was utilizing a vector parallel unit that can handle four 32-bit integers in a single instruction. So there was only a single vectorized instruction stream to handle all 4 independent operations. The whole Frontend, or in other words fetch-decode unit had to stall during a memory fetch.

So next question was could we improve this situation? How about multiple independent SIMD SSE streams to better utilize the ILP capabillity? It turns out that this is doable but results in a slight change in the hashing approach. I implemented 2 independent SSE streams each processing 16 bytes in an accumulator. There are enough SSE registers available to do this without spurious memory accesses. The accumulators are then combined at the end to the final hash result. This changes the resulting hash value since there are now 2 independent streams processing data in independent strides without combining them. Even then xxHash properties are retained due to the way the final combination is done. My new code is shown below.

#include <smmintrin.h>
static inline __m128i _x_mm_rotl_epi32(const __m128i a, int bits)
{
     __m128i tmp1 = _mm_slli_epi32(a, bits);
     __m128i tmp2 = _mm_srli_epi32(a, 32 - bits);
     return (_mm_or_si128(tmp1, tmp2));
}
...
...
 unsigned int vx[4], vx1[4];
 __m128i accum = _mm_set_epi32(v4, v3, v2, v1);
 __m128i accum1 = _mm_set_epi32(v4, v3, v2, v1);
 __m128i prime1 = _mm_set1_epi32(PRIME32_1);
 __m128i prime2 = _mm_set1_epi32(PRIME32_2);

 /*
  * 4-way SIMD calculations with 4 ints in two blocks for two
  * accumulators will interleave to some extent on a superscalar
  * processor(instruction level parallelism) providing 10% - 14%
  * speedup over original xxhash depending on processor. We could
  * have used aligned loads but we actually want the unaligned
  * penalty. It helps to interleave better for a slight benefit
  * over aligned loads here!
  */
do {
     __m128i mem = _mm_loadu_si128((__m128i *)p);
     p += 16;
     mem = _mm_mullo_epi32(mem, prime2);
     accum = _mm_add_epi32(accum, mem);
     accum = _x_mm_rotl_epi32(accum, 13);
     accum = _mm_mullo_epi32(accum, prime1);

     mem = _mm_loadu_si128((__m128i *)p);
     p += 16;
     mem = _mm_mullo_epi32(mem, prime2);
     accum1 = _mm_add_epi32(accum1, mem);
     accum1 = _x_mm_rotl_epi32(accum1, 13);
     accum1 = _mm_mullo_epi32(accum1, prime1);
 } while (p<=limit);

 _mm_storeu_si128((__m128i *)vx, accum);
 _mm_storeu_si128((__m128i *)vx1, accum1);

 /*
  * Combine the two accumulators into a single hash value.
  */
 v1 = vx[0]; v2 = vx[1];
 v3 = vx[2]; v4 = vx[3];
 v1 += vx1[0] * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1;
 v2 += vx1[1] * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1;
 v3 += vx1[2] * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1;
 v4 += vx1[3] * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1;
 h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

Functionally this is processing data in interleaved blocks or 16 bytes per block. This gives us two independent processing streams which should leverage the superscalar out-of-order execution capabilities on x86 processors and give a speedup. Diagrammatically this processing can be depicted as below.

sse_hash_small

Now lets look at the performance metrics of this new approach.

1) xxHash benchmark tool:
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  4028.9 MB/s   0x140BD4B4

2) Perf annotate output:
  5.95 │ 70:   movdqu (%rdi),%xmm1
  0.16 │       movdqu 0x10(%rdi),%xmm0
 14.91 │       pmulld %xmm5,%xmm1
  5.97 │       paddd  %xmm3,%xmm1
  0.05 │       movdqa %xmm1,%xmm3
  0.14 │       psrld  $0x13,%xmm1
  5.51 │       add    $0x20,%rdi
  0.60 │       pmulld %xmm5,%xmm0
  0.86 │       paddd  %xmm2,%xmm0
  5.32 │       movdqa %xmm0,%xmm2
  0.80 │       pslld  $0xd,%xmm3
  2.57 │       psrld  $0x13,%xmm0
  9.24 │       por    %xmm1,%xmm3
  0.05 │       pslld  $0xd,%xmm2
  5.94 │       por    %xmm0,%xmm2
  5.92 │       cmp    %rdi,%rax
  0.08 │       pmulld %xmm4,%xmm3
 27.12 │       pmulld %xmm4,%xmm2
  8.82 │     ↑ jae    70

3) Perf stat output:
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3997.0 MB/s   0x140BD4B4

 Performance counter stats for './bench silesia.tar':

    6183.451857 task-clock                #    0.995 CPUs utilized
            531 context-switches          #    0.086 K/sec
              7 CPU-migrations            #    0.001 K/sec
         50,625 page-faults               #    0.008 M/sec
 13,114,838,000 cycles                    #    2.121 GHz                     [83.28%]
  7,687,625,582 stalled-cycles-frontend   #   58.62% frontend cycles idle    [83.32%]
  2,157,403,461 stalled-cycles-backend    #   16.45% backend  cycles idle    [66.68%]
 14,508,523,299 instructions              #    1.11  insns per cycle
                                          #    0.53  stalled cycles per insn [83.34%]
    783,565,815 branches                  #  126.720 M/sec                   [83.32%]
        109,017 branch-misses             #    0.01% of all branches         [83.40%]
    6.211583495 seconds time elapsed

As you can notice there is a reduction in the Frontend cycles idle parameter and a good increase in the instructions per cycle. What is more significant is that throughput has increased from 3571 MB/s in the original to 4028 MB/s in the new one. A good 12.8% improvement over the original. I also did another benchmark using SMhasher, a hash function test suite. The results are below.

Original
========
Bulk speed test - 262144-byte keys
Alignment  0 -  1.865 bytes/cycle - 5335.63 MiB/sec @ 3 ghz
Alignment  1 -  1.863 bytes/cycle - 5329.93 MiB/sec @ 3 ghz
Alignment  2 -  1.863 bytes/cycle - 5329.93 MiB/sec @ 3 ghz
Alignment  3 -  1.863 bytes/cycle - 5329.94 MiB/sec @ 3 ghz
Alignment  4 -  1.868 bytes/cycle - 5345.33 MiB/sec @ 3 ghz
Alignment  5 -  1.864 bytes/cycle - 5334.29 MiB/sec @ 3 ghz
Alignment  6 -  1.864 bytes/cycle - 5334.27 MiB/sec @ 3 ghz
Alignment  7 -  1.864 bytes/cycle - 5334.28 MiB/sec @ 3 ghzModified
========
Bulk speed test - 262144-byte keys
Alignment  0 -  2.144 bytes/cycle - 6132.86 MiB/sec @ 3 ghz
Alignment  1 -  2.123 bytes/cycle - 6074.16 MiB/sec @ 3 ghz
Alignment  2 -  2.123 bytes/cycle - 6074.21 MiB/sec @ 3 ghz
Alignment  3 -  2.133 bytes/cycle - 6102.57 MiB/sec @ 3 ghz
Alignment  4 -  2.122 bytes/cycle - 6072.41 MiB/sec @ 3 ghz
Alignment  5 -  2.122 bytes/cycle - 6072.35 MiB/sec @ 3 ghz
Alignment  6 -  2.123 bytes/cycle - 6073.10 MiB/sec @ 3 ghz
Alignment  7 -  2.123 bytes/cycle - 6073.19 MiB/sec @ 3 ghz

SMhasher reports almost a 14% improvement in the worst case. For aligned data blocks it is processing at almost 6 GB/s! The small key results are not shown here but there is however an increase of 3 cycles in the worst case. I feel that marginal reduction in performance for small buffers/keys is acceptable for this gain for large buffers.

All these numbers are from my laptop. I’d expect better gains on better boxes and processors. I will have to run the full SMHasher test suite to validate my approach. My profit from all this is the learning experience and the potential speedup of deduplication in Pcompress.