Tag Archives: linkedin

Pcompress 1.3 released

Parallel Compression RevisitedI have put up a new release of Pcompress on the Google Code download page. This release focusses primarily on performance enhancements across the board and a few bug fixes. The changes are summarized below:

  1. Deduplication performance has improved by at least 2.5X as a result of a variety of tweaks to the core chunking code.
    • One of the interesting changes is to use a 16-byte SSE register as the sliding window since I am using a window size of 16. This avoids a lot of memory accesses but requires SSE4.
    • The perf utility allowed me to see that using the window position counter as a context variable causes a spurious memory store for every byte! Using a local variable allows optimization via a register. This optimization affects the situation where we do not have SSE4.
    • Compute the full fingerprint only when at least minimum chunk length bytes have been consumed.
  2. Delta Compression performance and effectiveness have both been improved. I have tweaked the minhash approach to avoid storing and using fingerprints. That approach was causing memory write amplification and significant slowdown. Rather I am just treating the raw data as a sequence of 64-bit integers and heapifying them. Bsdiff performance has been improved along with RLE encoding. I also tweaked the matching approach. It now checks for similar blocks that are some distance apart depending on the compression algorithm. This actually causes long range similar blocks to be delta-ed eventually helping the overall compression.
  3. One of the big changes is the inclusion of the BLAKE2 checksum algorithm and making it the default. BLAKE2 is one of the highest-performing cryptographic checksums, exceeding even MD5 in performance on 64-bit platforms. It is derived from BLAKE, one of the NIST SHA3 runner ups with a large security margin.
  4. I have tweaked Yann Collet’s xxHash algorithm (non-cryptographic hash) to vectorize it and make it work with 32-byte blocks. Performance is improved for both vectorized and non-vectorized versions. I have covered this in detail in my previous post: Vectorizing xxHash for fun and profit.
  5. I have tweaked the AES CTR mode implementation to vectorize it. CTR mode encrypts a 16-byte block consisting of a 64-bit nonce or salt value and a 64-bit block counter value concatenated together. This is then XOR-ed with 16 bytes of plaintext to generate 16 bytes of ciphertext. The block counter is then incremented and the process repeated. This XOR handling with 16-bytes can be nicely done in an XMM register. The result is faster even when using unaligned SSE2 loads helped a little with data prefetch instructions.
  6. Apart from BLAKE2 I also included Intel’s optimized SHA512 implementation for x86 processors and moved to using SHA512/256. This improves SHA2 performance significantly on x86 platforms.
  7. BLAKE2 includes a parallel mode. I also included simple 2-way parallel modes for other hashes including KECCAK when compressing a single file in a single chunk. This is essentially a single-threaded operation so other forms of parallelism need to be employed.
  8. With all the vectorization being thrown around with SSE2/3/4 and AVX1/2 versions of various stuff, I have also added runtime CPU feature detection to invoke the appropriate version for the CPU. At least SSE2 capability is assumed. At this point I really have no intention of supporting Pentium and Atom processors! This also requires one to use at least the Gcc 4.4 compiler so that things like SSE4.2 and AVX intrinsics can be compiled even if CPU support for them is not available.
  9. In addition to all the above some bug fixes have also gone into this release.

However this is in no way the full gamut of optimizations possible. There are more changes to be done. For example I need to add support for optimized AES GCM mode. This is a block cipher mode of operation which combines encryption and authentication avoiding the need to for a separate HMAC. HMAC is still useful for situations where one may want to authenticate but not encrypt. Deduplication performance can be further improved by at least 2X. The current chunking code has a silly oversight.  HMAC needs to support parallel modes. I also need to enable parallel operation for LZP in single-chunk modes. In addition I want to explore use of GPGPUs and CUDA for hashing, chunking etc.

Vectorizing xxHash for Fun and Profit

Update: Thanks to Yann Collet’s comment I have fixed the mistaken reference to Hyperthreading while in fact I was referring to Instruction Level Parallelism via Superscalar processing.

I was doing various  performance optimizations in Pcompress for the last few days and the xxHash piece caught my eye. This non-cryptographic hash implementation is one of the fastest 32 bit hashes available today. I use it in Pcompress to hash data chunks for deduplication. The hash value is used to insert the chunk-id into a hashtable for lookups. xxHash is already extremely fast working at 5.4 GB/s on a fast processor however it’s inner loop looked very interesting.

 do
 {
     v1 += XXH_LE32(p) * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1; p+=4;
     v2 += XXH_LE32(p) * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1; p+=4;
     v3 += XXH_LE32(p) * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1; p+=4;
     v4 += XXH_LE32(p) * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1; p+=4;
 } while (p<=limit);

 h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

There are 4 independent compute streams consuming data into 4 separate accumulators which are then combined into a single hash value later. These 4 independent but identical compute streams look like a perfect fit for vectorizing using the SSE instructions on modern x86 processors. The SSE registers can hold multiple packed integers or floats and a single operation, for example addition, applied to two SSE registers affect all the package values in parallel. The image below shows this behavior diagrammatically when the SSE registers are populated with 4 32-bit quantities.

simd

An SSE register is 128-bit and can perfectly hold four 32-bit packed integers. I had to use at least SSE 4.1 to get the parallel multiply of packed 32-bit values. The rotate (XXH_rotl32) operation is easily doable with 2 bitwise shifts and a bitwise or (as is done in xxHash original code as well). Without having to deal with inline assembly language we can fairly easily  use the compiler SSE intrinsics. These appear to the programmer as functions but in fact they result in specific CPU instructions being emitted. SSE intrinsics also provide the benefit of letting the compiler handle optimizations like instruction scheduling. Using inline assembly leaves all that optimization burden to the programmer which is often tricky to get right except for short segments. In addition some of the intrinsic functions are composite intrinsics. They result in emitting a collection of CPU instructions encapsulating some commonly used operation (like _mm_set_epi32(…)).

You can get a good overview of SSE programming from the CodeProject article. So my initial effort led to the following code shown below.

#include <smmintrin.h>
static inline __m128i _x_mm_rotl_epi32(const __m128i a, int bits)
{
    __m128i tmp1 = _mm_slli_epi32(a, bits);
    __m128i tmp2 = _mm_srli_epi32(a, 32 - bits);
    return (_mm_or_si128(tmp1, tmp2));
}
...
...
 __m128i accum = _mm_set_epi32(v4, v3, v2, v1);
 __m128i prime1 = _mm_set1_epi32(PRIME32_1);
 __m128i prime2 = _mm_set1_epi32(PRIME32_2);
 do {
     __m128i mem = _mm_loadu_si128((__m128i *)p);
     mem = _mm_mullo_epi32(mem, prime2);
     accum = _mm_add_epi32(accum, mem);
     accum = _x_mm_rotl_epi32(accum, 13);
     accum = _mm_mullo_epi32(accum, prime1);
     p += 16;
 } while (p<=limit);
 _mm_storeu_si128((__m128i *)vx, accum);
 v1 = vx[0]; v2 = vx[1];
 v3 = vx[2]; v4 = vx[3];
 h32 = XXH_rotl32(v1,1) + XXH_rotl32(v2,7) + XXH_rotl32(v3,12) + XXH_rotl32(v4,18);
...

It worked perfectly. However there was a big problem. It was slower than before, and by a good margin! Look at the benchmarks below using the standard silesia test dataset.

Original
========
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm ,by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3571.1 MB/s   0x61B1160F

Modified
========
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm ,by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3013.3 MB/s   0x61B1160F

It is 15% slower after vectorizing. This seemed ridiculous to me. I was expecting at the minimum a 10% speedup and here we have parallelized code that is in fact slower! If you notice, my initial code does unaligned loads via “_mm_loadu_si128((__m128i *)p)”. Fearing that to be the cause of latency I did some tweaks to ensure aligned access and processing the initial unaligned bytes separately. The performance difference still remained!

I first thought that SSE instructions were having higher latency compared to the ordinary integer processing ones. So I did some measurements using the Perf tool. The results are shown below.

Original
========
1) Annotation from perf record:
 10.42 │ 30:   mov    (%rdi),%r10d                                                                                                                                     
 14.03 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.34 │       add    %r10d,%r8d                                                                                                                                       
  0.21 │       mov    0x4(%rdi),%r10d                                                                                                                                  
 10.53 │       rol    $0xd,%r8d                                                                                                                                        
  0.23 │       imul   $0x9e3779b1,%r8d,%r8d                                                                                                                            
  0.77 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.02 │       add    %r10d,%ecx                                                                                                                                       
 10.54 │       mov    0x8(%rdi),%r10d                                                                                                                                  
  0.06 │       rol    $0xd,%ecx                                                                                                                                        
  0.34 │       imul   $0x9e3779b1,%ecx,%ecx                                                                                                                            
 10.38 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.30 │       add    %r10d,%edx                                                                                                                                       
  0.02 │       mov    0xc(%rdi),%r10d                                                                                                                                  
  0.07 │       add    $0x10,%rdi                                                                                                                                       
 10.07 │       rol    $0xd,%edx                                                                                                                                        
  0.32 │       imul   $0x9e3779b1,%edx,%edx                                                                                                                            
 10.30 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.12 │       add    %r10d,%eax                                                                                                                                       
  0.22 │       rol    $0xd,%eax                                                                                                                                        
  0.15 │       imul   $0x9e3779b1,%eax,%eax                                                                                                                            
 20.52 │       cmp    %rdi,%r11                                                                                                                                        
  0.03 │     ↑ jae    30

2) Performance counters from perf stat:
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3541.4 MB/s   0x61B1160F

 Performance counter stats for './bench silesia.tar':

    6260.562213 task-clock                #    0.996 CPUs utilized
            544 context-switches          #    0.087 K/sec
              4 CPU-migrations            #    0.001 K/sec
         50,624 page-faults               #    0.008 M/sec
 13,278,187,173 cycles                    #    2.121 GHz                     [83.41%]
  3,657,672,543 stalled-cycles-frontend   #   27.55% frontend cycles idle    [83.30%]
  6,834,837,811 stalled-cycles-backend    #   51.47% backend  cycles idle    [66.70%]
 31,374,782,178 instructions              #    2.36  insns per cycle
                                          #    0.22  stalled cycles per insn [83.40%]
  1,384,012,974 branches                  #  221.068 M/sec                   [83.30%]
         95,141 branch-misses             #    0.01% of all branches         [83.30%]

    6.288590138 seconds time elapsed
Modified
========
1) Annotation from perf record:
  9.38 │ 70:┌─→movdqu (%rdi),%xmm0
  9.52 │    │  add    $0x10,%rdi
  0.00 │    │  cmp    %rdi,%r9
  0.00 │    │  pmulld %xmm3,%xmm0
  9.04 │    │  paddd  %xmm1,%xmm0
  0.15 │    │  movdqa %xmm0,%xmm1
  0.04 │    │  psrld  $0x13,%xmm0
  9.19 │    │  pslld  $0xd,%xmm1
  0.27 │    │  por    %xmm0,%xmm1
  8.95 │    │  pmulld %xmm2,%xmm1
 53.44 │    └──jae    70

2) Performance counters from perf stat:
 Performance counter stats for './bench silesia.tar':

    6265.021121 task-clock                #    0.996 CPUs utilized
            546 context-switches          #    0.087 K/sec
              2 CPU-migrations            #    0.000 K/sec
         50,625 page-faults               #    0.008 M/sec
 13,288,539,258 cycles                    #    2.121 GHz                     [83.36%]
  8,562,225,224 stalled-cycles-frontend   #   64.43% frontend cycles idle    [83.31%]
  3,763,596,002 stalled-cycles-backend    #   28.32% backend  cycles idle    [66.72%]
 12,813,511,969 instructions              #    0.96  insns per cycle
                                          #    0.67  stalled cycles per insn [83.41%]
  1,176,939,765 branches                  #  187.859 M/sec                   [83.31%]
        102,390 branch-misses             #    0.01% of all branches         [83.30%]

    6.292536751 seconds time elapsed

There are a few interesting observations here. Perf annotation tells us that there is not much difference in combined instruction latencies between the normal C and the SSE implementations. However performance counter statistics have a story to tell. I have highlighted the interesting entries in bold. Backend cycles decreased but Frontend cycles increased 2.3x. What is more interesting is that the normal C version issued 2.36 instructions per cycle while in the SSE version it reduced to 1/3rd of that. There is not much code in the small loop to cause stalls so the only explanation for increased Frontend stalls is memory read latency which also cause fewer instructions to get issued.

The big question was why was there so much of memory access latency in the SSE version compared to the normal C version? I pondered this for some time while trying out a variety of experiments to no avail.  I tried things like hand optimized assembly, trying to hide memory latency by loop hoisting etc. Then looking at the disassembly from the perf annotate of the original code again, something looked  interesting. The instructions from the 4 independent computation lines are all interleaved. More precisely instructions from independent computation sequences were interleaved in two pairs. Suddenly a bulb lit up somewhere. We have a gotcha named Superscalar processing and the key thing here is “independent computations”. Intel and AMD x86 CPUs can execute multiple instructions per clock cycle, not necessarily in chronological order, if they are independent of each other and utilize separate processing resources within the core. This is a form of instruction level parallelism that is different to vector processing. Within the core some resources are duplicated and some shared.

I was using a Core i5 from the Nehalem generation which typically has 4 decoders in the front-end pipeline and a superscalar out-of-order execution engine than can dispatch upto 6 micro-ops per cycle. The advantage of doing this is if some instruction from one stream stalls due to some resource being fetched, like memory, then independent instructions from the other streams can be processed. In addition if two compute operations require different resources they can be operated on simultaneously.

superscalarThe original xxHash code fully utilizes this feature helped by the Gcc compiler that recognizes this capability and interleaves independent instructions together to help the superscalar engine achieve maximum instruction level parallelism. In my SSE version I was utilizing a vector parallel unit that can handle four 32-bit integers in a single instruction. So there was only a single vectorized instruction stream to handle all 4 independent operations. The whole Frontend, or in other words fetch-decode unit had to stall during a memory fetch.

So next question was could we improve this situation? How about multiple independent SIMD SSE streams to better utilize the ILP capabillity? It turns out that this is doable but results in a slight change in the hashing approach. I implemented 2 independent SSE streams each processing 16 bytes in an accumulator. There are enough SSE registers available to do this without spurious memory accesses. The accumulators are then combined at the end to the final hash result. This changes the resulting hash value since there are now 2 independent streams processing data in independent strides without combining them. Even then xxHash properties are retained due to the way the final combination is done. My new code is shown below.

#include <smmintrin.h>
static inline __m128i _x_mm_rotl_epi32(const __m128i a, int bits)
{
     __m128i tmp1 = _mm_slli_epi32(a, bits);
     __m128i tmp2 = _mm_srli_epi32(a, 32 - bits);
     return (_mm_or_si128(tmp1, tmp2));
}
...
...
 unsigned int vx[4], vx1[4];
 __m128i accum = _mm_set_epi32(v4, v3, v2, v1);
 __m128i accum1 = _mm_set_epi32(v4, v3, v2, v1);
 __m128i prime1 = _mm_set1_epi32(PRIME32_1);
 __m128i prime2 = _mm_set1_epi32(PRIME32_2);

 /*
  * 4-way SIMD calculations with 4 ints in two blocks for two
  * accumulators will interleave to some extent on a superscalar
  * processor(instruction level parallelism) providing 10% - 14%
  * speedup over original xxhash depending on processor. We could
  * have used aligned loads but we actually want the unaligned
  * penalty. It helps to interleave better for a slight benefit
  * over aligned loads here!
  */
do {
     __m128i mem = _mm_loadu_si128((__m128i *)p);
     p += 16;
     mem = _mm_mullo_epi32(mem, prime2);
     accum = _mm_add_epi32(accum, mem);
     accum = _x_mm_rotl_epi32(accum, 13);
     accum = _mm_mullo_epi32(accum, prime1);

     mem = _mm_loadu_si128((__m128i *)p);
     p += 16;
     mem = _mm_mullo_epi32(mem, prime2);
     accum1 = _mm_add_epi32(accum1, mem);
     accum1 = _x_mm_rotl_epi32(accum1, 13);
     accum1 = _mm_mullo_epi32(accum1, prime1);
 } while (p<=limit);

 _mm_storeu_si128((__m128i *)vx, accum);
 _mm_storeu_si128((__m128i *)vx1, accum1);

 /*
  * Combine the two accumulators into a single hash value.
  */
 v1 = vx[0]; v2 = vx[1];
 v3 = vx[2]; v4 = vx[3];
 v1 += vx1[0] * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1;
 v2 += vx1[1] * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1;
 v3 += vx1[2] * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1;
 v4 += vx1[3] * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1;
 h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

Functionally this is processing data in interleaved blocks or 16 bytes per block. This gives us two independent processing streams which should leverage the superscalar out-of-order execution capabilities on x86 processors and give a speedup. Diagrammatically this processing can be depicted as below.

sse_hash_small

Now lets look at the performance metrics of this new approach.

1) xxHash benchmark tool:
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  4028.9 MB/s   0x140BD4B4

2) Perf annotate output:
  5.95 │ 70:   movdqu (%rdi),%xmm1
  0.16 │       movdqu 0x10(%rdi),%xmm0
 14.91 │       pmulld %xmm5,%xmm1
  5.97 │       paddd  %xmm3,%xmm1
  0.05 │       movdqa %xmm1,%xmm3
  0.14 │       psrld  $0x13,%xmm1
  5.51 │       add    $0x20,%rdi
  0.60 │       pmulld %xmm5,%xmm0
  0.86 │       paddd  %xmm2,%xmm0
  5.32 │       movdqa %xmm0,%xmm2
  0.80 │       pslld  $0xd,%xmm3
  2.57 │       psrld  $0x13,%xmm0
  9.24 │       por    %xmm1,%xmm3
  0.05 │       pslld  $0xd,%xmm2
  5.94 │       por    %xmm0,%xmm2
  5.92 │       cmp    %rdi,%rax
  0.08 │       pmulld %xmm4,%xmm3
 27.12 │       pmulld %xmm4,%xmm2
  8.82 │     ↑ jae    70

3) Perf stat output:
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3997.0 MB/s   0x140BD4B4

 Performance counter stats for './bench silesia.tar':

    6183.451857 task-clock                #    0.995 CPUs utilized
            531 context-switches          #    0.086 K/sec
              7 CPU-migrations            #    0.001 K/sec
         50,625 page-faults               #    0.008 M/sec
 13,114,838,000 cycles                    #    2.121 GHz                     [83.28%]
  7,687,625,582 stalled-cycles-frontend   #   58.62% frontend cycles idle    [83.32%]
  2,157,403,461 stalled-cycles-backend    #   16.45% backend  cycles idle    [66.68%]
 14,508,523,299 instructions              #    1.11  insns per cycle
                                          #    0.53  stalled cycles per insn [83.34%]
    783,565,815 branches                  #  126.720 M/sec                   [83.32%]
        109,017 branch-misses             #    0.01% of all branches         [83.40%]
    6.211583495 seconds time elapsed

As you can notice there is a reduction in the Frontend cycles idle parameter and a good increase in the instructions per cycle. What is more significant is that throughput has increased from 3571 MB/s in the original to 4028 MB/s in the new one. A good 12.8% improvement over the original. I also did another benchmark using SMhasher, a hash function test suite. The results are below.

Original
========
Bulk speed test - 262144-byte keys
Alignment  0 -  1.865 bytes/cycle - 5335.63 MiB/sec @ 3 ghz
Alignment  1 -  1.863 bytes/cycle - 5329.93 MiB/sec @ 3 ghz
Alignment  2 -  1.863 bytes/cycle - 5329.93 MiB/sec @ 3 ghz
Alignment  3 -  1.863 bytes/cycle - 5329.94 MiB/sec @ 3 ghz
Alignment  4 -  1.868 bytes/cycle - 5345.33 MiB/sec @ 3 ghz
Alignment  5 -  1.864 bytes/cycle - 5334.29 MiB/sec @ 3 ghz
Alignment  6 -  1.864 bytes/cycle - 5334.27 MiB/sec @ 3 ghz
Alignment  7 -  1.864 bytes/cycle - 5334.28 MiB/sec @ 3 ghzModified
========
Bulk speed test - 262144-byte keys
Alignment  0 -  2.144 bytes/cycle - 6132.86 MiB/sec @ 3 ghz
Alignment  1 -  2.123 bytes/cycle - 6074.16 MiB/sec @ 3 ghz
Alignment  2 -  2.123 bytes/cycle - 6074.21 MiB/sec @ 3 ghz
Alignment  3 -  2.133 bytes/cycle - 6102.57 MiB/sec @ 3 ghz
Alignment  4 -  2.122 bytes/cycle - 6072.41 MiB/sec @ 3 ghz
Alignment  5 -  2.122 bytes/cycle - 6072.35 MiB/sec @ 3 ghz
Alignment  6 -  2.123 bytes/cycle - 6073.10 MiB/sec @ 3 ghz
Alignment  7 -  2.123 bytes/cycle - 6073.19 MiB/sec @ 3 ghz

SMhasher reports almost a 14% improvement in the worst case. For aligned data blocks it is processing at almost 6 GB/s! The small key results are not shown here but there is however an increase of 3 cycles in the worst case. I feel that marginal reduction in performance for small buffers/keys is acceptable for this gain for large buffers.

All these numbers are from my laptop. I’d expect better gains on better boxes and processors. I will have to run the full SMHasher test suite to validate my approach. My profit from all this is the learning experience and the potential speedup of deduplication in Pcompress.

Galois Fields @ USENIX FAST 2013

USENIX sponsors an annual storage related conference called File and Storage Technologies or FAST for short. It has been happening for more than a decade and this year’s FAST seminar has lined up an impressive list of sessions. Of all those, there is one session that caught my attention: https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions

Galois Field or Finite Field is a construct out of Abstract Algebra which raises it’s spooky head whenever Cryptography, Error Correcting Codes, Rabin Deduplication, number theory and related stuff are concerned.  It is interesting to see EMC having done work on speeding up GF computations on modern x86 CPUs using the SSE SIMD instruction set. However it does not appear to be an entirely new work. There has already been work done on this aspect. For example look at the following links:

Optimizing Forward ECC Codes using SIMD instructions

Impact of Intel’s New Instruction Sets on Software Implementation of GF(2)[x] Multiplication

gf2x – A library for multiplying polynomials over the binary field

EMC is obviously interested in speeding up Data Deduplication in it’s Data Domain products. So we should see subsequent releases of Data Domain coming with this optimization. It will be interesting to see what other work the EMC paper cites when it is eventually made public.

Of the above three links the first one is of most interest to me. I am looking to implement error correcting codes in Pcompress and that article’s author provides a nice SIMD-optimized ECC library called FECpp.

Pcompress 1.2 Released

Pcompress 1.2 is now available and can be grabbed from here: https://code.google.com/p/pcompress/downloads/list. This is a major release containing a large number of bug fixes and improvements. There are performance and stability improvements, code cleanup, resolution of corner cases etc. The biggest new additions to this release are the new Delta2 Encoding and support Keccak message digest. Keccak has been announced the NIST SHA3 standard secure hash. The SIMD (SSE) optimized variant of Keccak runs faster than SHA256 on x86 platforms. However it is still slower than SKEIN so SKEIN remains the default hash algorithm for data integrity verification. In addition Deduplication is now significantly faster.

Delta2 Encoding as I had mentioned in a previous post probes for embedded tables of numeric sequences in the data and encodes them by collapsing the arithmetic sequence into it’s parameters: starting value, increment/decrement, number of terms. This generally provides benefits across different kinds of data and can be combined with LZP preprocessing to enable the final compression algorithm to achieve the maximum compression ratio beyond what it can normally achieve. This encoding works very fast and still manages to detect a good amount of numeric sequences if they are present in the data.

I have extended the statistics mode to display additional data including throughput figures. Here is an output from a compression run of the silesia corpus test data set:

## CPU: Core i5 430M
## RAM: 8GB
## Compiler: Gcc 4.7.2
##
## Compression
##
./pcompress -D -c lz4 -l1 -L -P -s200m silesia.tar 
Scaling to 1 thread
Checksum computed at 241.321 MB/s
Original size: 206612480, blknum: 46913
Number of maxlen blocks: 0
Total Hashtable bucket collisions: 17225
Merge count: 46750
Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
Chunking speed 112.189 MB/s, Overall Dedupe speed 100.880 MB/s
LZP: Insize: 206196371, Outsize: 192127556
LZP: Processed at 55.839 MB/s
DELTA2: srclen: 192127556, dstlen: 191899643
DELTA2: header overhead: 50800
DELTA2: Processed at 382.530 MB/s
Chunk compression speed 207.908 MB/s
##
## Decompression
##
./pcompress -d silesia.tar.pz silesia.tar.1 
Scaling to 4 threads
Chunk decompression speed 383.488 MB/s
DELTA2: Decoded at 3030.724 MB/s
Checksum computed at 244.235 MB/sls -l silesia.tar.pz 
-rw-rw-r--. 1 testuser testuser 99115899 Jan  5 21:36 silesia.tar.pz

Note that these are single-threaded performance figures. The entire file is being compressed in a single chunk. The default checksum is SKEIN. Look at the decoding speed of the Delta2 implementation. It is close to 3GB/s rate. Next lets check the performance of SSE optimized Keccak:

./pcompress -D -c lz4 -l1 -P -S KECCAK256 -s200m silesia.tar 
Scaling to 1 thread
Checksum computed at 143.904 MB/s
Original size: 206612480, blknum: 46913
Number of maxlen blocks: 0
Total Hashtable bucket collisions: 17225
Merge count: 46750
Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
Chunking speed 111.601 MB/s, Overall Dedupe speed 100.352 MB/s
DELTA2: srclen: 206196371, dstlen: 201217172
DELTA2: header overhead: 570448
DELTA2: Processed at 360.383 MB/s
Chunk compression speed 213.226 MB/sls -l silesia.tar.pz 
-rw-rw-r--. 1 testuser testuser 100204566 Jan  5 21:34 silesia.tar.pz

This time I left out LZP to show the reduction using Delta2 alone. As you can see combining LZP and Delta2 gives the greatest reduction. Also see how much slower Keccak is compared to SKEIN. Note that I am using optimized 64-bit assembly implementation of Skein but it does not use SSE whereas Keccak uses SSE.

Next lets have a look at a dataset that has lots of embedded numeric data. I used a Global Topographic Elevation map data from USGS:

#
# Compression
#
./pcompress -c lz4 -l1 -P -s100m e020n40.tar 
Scaling to 1 thread
Checksum computed at 237.584 MB/s
DELTA2: srclen: 86599680, dstlen: 43707440
DELTA2: header overhead: 2024320
DELTA2: Processed at 279.484 MB/s
Chunk compression speed 211.112 MB/s

ls -l e020n40.tar.pz 
-rw-rw-r--. 1 testuser testuser 35360062 Jan  5 21:46 e020n40.tar.pz
#
# Decompression
#
./pcompress -d e020n40.tar.pz e020n40.tar.1 
Scaling to 4 threads
Chunk decompression speed 622.394 MB/s
DELTA2: Decoded at 1971.282 MB/s
Checksum computed at 246.015 MB/s

This time I left out dedupe and LZP. The Delta2 benefits are a lot more since there is a lot of numeric data. Also because there is a lot more Delta spans in the encoded dataset the decoding speed is also lesser. However it still decodes at 1.9 GB/s.

As can be seen Delta2 performance is on par with LZ4 can be used to improve LZ4 results with very little overhead.

GCC Link time Optimizations need some Salt

I wanted to make use of the much talked about feature in Gcc 4.5+ called Link Time Optimization in Pcompress. Link time optimizations promise to bring in advanced optimizations across compilation contexts. For example if you declare a function as “inline” in a source file and call it from other places in the same source file then all function calls will be replaced with the function body itself. However if you call the same function from another source file which is compiled separately then inlining will not happen across the files. Earlier Gcc support for this was called whole-program optimization but it was cumbersome and in some cases impossible to use. LTO enables a repeat optimization and re-compilation pass during linking of the final executable providing the ability to do such cross-file optimizations elegantly.

I am using Fedora 16 which has got Gcc 4.6.3 with support for LTO so it was a simple matter of tweaking the makefile. However, after adding the LTO flag something felt not right. The program felt a tad slower. When doing performance tweaks “feel” is the last thing you want to depend upon, so I added simple timing measurements to various modules within Pcompress. They measure the starting and ending monotonic wall clock times in milliseconds for processing a chunk of data and compute the throughput in terms of MBs per Second. What I saw after that was quite strange.

LTO appeared to actually reduce throughput performance by as much as 60%! My laptop had the following specs: Core i5 430M 2.27GHz, 8GB RAM. I was using Gcc 4.6.3 native build on Fedora. Suspecting something with the distro or the specific Gcc build I looked at another laptop I had. That was an AMD Piledriver 1.9 GHz with 4GB RAM running Linux Mint 13. It also had Gcc 4.6.3 but of course differen Ubuntu derived build. This also produced the same results. A 50% performance drop with LTO enabled. The outputs are below.

Output of “Normal” build without LTO

time ./pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar
Scaling to 2 threads
Original size: 86528000, blknum: 12726
Number of maxlen blocks: 116
Total Hashtable bucket collisions: 4477
Merge count: 12320
Deduped size: 85195878, blknum: 562, delta_calls: 0, delta_fails: 0
Dedupe speed 107.758 MB/s
LZP: Insize: 85193594, Outsize: 35087799
LZP: Processed at 178.023 MB/s
DELTA2: srclen: 35087799, dstlen: 35086691
DELTA2: header overhead: 192
DELTA2: Processed at 247.827 MB/s
Chunk compression speed 1.513 MB/s

real    0m24.058s
user    0m40.485s
sys     0m0.448s

Output of build with LTO enabled

time ./pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar
Scaling to 2 threads
Original size: 86528000, blknum: 12726
Number of maxlen blocks: 116
Total Hashtable bucket collisions: 4477
Merge count: 12320
Deduped size: 85195878, blknum: 562, delta_calls: 0, delta_fails: 0
Dedupe speed 32.146 MB/s
LZP: Insize: 85193594, Outsize: 35087799
LZP: Processed at 67.585 MB/s
DELTA2: srclen: 35087799, dstlen: 35086691
DELTA2: header overhead: 192
DELTA2: Processed at 84.044 MB/s
Chunk compression speed 0.828 MB/s

real    0m45.138s
user    1m17.835s
sys     0m0.612s

The pcompress invocations above uses LZMA compression in extreme mode (-l14) and also enables Deduplication, LZP and Delta2 encoding. The tarfile “w020n40.tar” is a Global Topographic Data (Digital Elevation Model) from USGS. I used that dataset since it contains a lot of embedded tables of repeating numeric data that Delta Encoding and LZP (as you can see above in this case) can detect and collapse.

The performance drop with LTO is drastic to say the least. To rule out any mistake in my throughput computation I also included the “time” command and the performance difference is clear in the time output as well. Similar results were visible on the AMD laptop running Linux Mint 13. In addition to LTO I was passing the following flags to the compiler:

-m64 -msse3 -c  -O3  -ftree-vectorize -floop-interchange -floop-block

I then experimented with omitting those flags and reducing optimization level to -O2 but to no avail. I even tried “-fno-inline-functions” thinking excessive inlining might be causing cache overflows of loops. But that produced the same results. LTO kept on churning out lower performance numbers regardless of what I did. Subsequently I tried with Gcc 4.7.2. I built it from upstream sources and repeated my experiments with exactly the same results as before! Something was broken.

So my next step was to ask for help from the gem of an utility on Linux called simply as Perf. This can do a bunch of profiling on the system and apps and also provide CPU performance counter metrics. This tool is similar to some OpenSolaris tools like cpustat, intrstat etc. Perf has got a whole bunch of features but I looked at a couple of capabilities. I used “perf record” to collect profiling data from a pcompress run and used “perf stat -d” to dump detailed performance counter statistics. The outputs of perf stat were interesting but provided little clue as to the root cause:

Output of “perf stat -d” for normal non-LTO build

Performance counter stats for './pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar':

   46724.903534 task-clock                #    1.682 CPUs utilized
         21,763 context-switches          #    0.466 K/sec
          1,800 CPU-migrations            #    0.039 K/sec
         49,634 page-faults               #    0.001 M/sec
 97,919,627,864 cycles                    #    2.096 GHz                     [40.01%]
 60,976,690,063 stalled-cycles-frontend   #   62.27% frontend cycles idle    [39.99%]
 42,558,445,705 stalled-cycles-backend    #   43.46% backend  cycles idle    [39.97%]
 83,132,473,549 instructions              #    0.85  insns per cycle
                                          #    0.73  stalled cycles per insn [49.93%]
 10,757,966,536 branches                  #  230.241 M/sec                   [49.95%]
    674,594,850 branch-misses             #    6.27% of all branches         [50.01%]
 20,128,203,574 L1-dcache-loads           #  430.781 M/sec                   [50.05%]
    819,620,127 L1-dcache-load-misses     #    4.07% of all L1-dcache hits   [50.02%]
    470,081,090 LLC-loads                 #   10.061 M/sec                   [40.06%]
    291,066,964 LLC-load-misses           #   61.92% of all LL-cache hits    [39.99%]

    27.781254809 seconds time elapsed

Output of “perf stat -d” for LTO enabled build

Performance counter stats for './pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar':

   80593.375461 task-clock                #    1.759 CPUs utilized
         26,507 context-switches          #    0.329 K/sec
            889 CPU-migrations            #    0.011 K/sec
          2,696 page-faults               #    0.033 K/sec
169,713,512,965 cycles                    #    2.106 GHz                     [40.02%]
116,359,748,762 stalled-cycles-frontend   #   68.56% frontend cycles idle    [40.02%]
 44,479,101,832 stalled-cycles-backend    #   26.21% backend  cycles idle    [40.04%]
147,677,082,105 instructions              #    0.87  insns per cycle
                                          #    0.79  stalled cycles per insn [50.04%]
 12,176,677,580 branches                  #  151.088 M/sec                   [50.05%]
    710,765,008 branch-misses             #    5.84% of all branches         [50.01%]
 73,456,165,490 L1-dcache-loads           #  911.442 M/sec                   [49.98%]
    576,238,614 L1-dcache-load-misses     #    0.78% of all L1-dcache hits   [50.00%]
    361,655,574 LLC-loads                 #    4.487 M/sec                   [39.94%]
    236,828,595 LLC-load-misses           #   65.48% of all LL-cache hits    [39.97%]

    45.811620815 seconds time elapsed

I have highlighted some of the differences of interest. As you can see LLC or Last Level Cache (L3 Cache) misses increased by 5% with LTO while L1 data cache hits actually increased. The frontend (the instruction fetch and decode primarily) had more stalled cycles with LTO. While a stall can be due to many reasons coupled with a higher LLC miss ratio it appears that more time was spent waiting to fetch data from memory.

Side Note: You may be wondering that the LLC cache miss ratio is very high even in the normal non-LTO case. Does that point to severe inefficiencies within Pcompress ? Not really since it is LZMA which is the culprit here. LZMA’s memory access pattern is cache-unfriendly. A cost one has to bear to get the almost unbeatable compression ratio. For comparison, I have included the perf stat output for LZ4 compression in the Appendix below.  See that.

Next item on the agenda was to capture profile data via “perf record” and view it via “perf report”. Perf report typically lists functions or modules within an executable with the percentage of time they hogged during the execution. The list is sorted in descending order. I am reproducing the top 6 items from each of the profile data below.

Perf report snippet for normal non-LTO case

Samples: 172K of event 'cycles', Event count (approx.): 88229375574                                                                                 
 50.31%  pcompress  pcompress              [.] GetMatchesSpec1
 32.22%  pcompress  pcompress              [.] LzmaEnc_CodeOneBlock
  4.02%  pcompress  pcompress              [.] MixMatches3
  2.61%  pcompress  pcompress              [.] GetHeads4b
  1.63%  pcompress  pcompress              [.] dedupe_compress
  1.45%  pcompress  pcompress              [.] MatchFinderMt_GetMatches

Perf report snippet for LTO build

Samples: 326K of event 'cycles', Event count (approx.): 169052159163                                                                                
 39.39%  pcompress  pcompress              [.] GetMatchesSpec1
 29.70%  pcompress  pcompress              [.] GetOptimum.4390
  4.91%  pcompress  pcompress              [.] LitEnc_GetPriceMatched.4145
  3.22%  pcompress  pcompress              [.] MixMatches3
  2.87%  pcompress  pcompress              [.] dedupe_compress
  2.72%  pcompress  pcompress              [.] GetHeads4b.5262

We can immediately notice a few differences. The LTO enabled build lists a few new symbols that are not present in the non-LTO case. These are actually functions within the LZMA implementation. The numeric suffix to the function names is an LTO artifact. Why are these functions not showing up in the non-LTO build? Lets check the different binaries using objdump.

Objdump output for normal non-LTO binary

objdump -t pcompress | grep GetOptimum 
[pcompress]$

Objdump output for LTO enabled binary

objdump -t pcompress | grep GetOptimum 
00000000004284c4 g     F .text	00000000000004d5        .hidden GetOptimumFast.4368
000000000042a351 g     F .text	0000000000002412        .hidden GetOptimum.4390

Objdump output for non-LTO DEBUG binary (No Optimizations)

objdump -t pcompress | grep GetOptimum 
000000000040fd05 l     F .text	0000000000002570        GetOptimum
0000000000412275 l     F .text	0000000000000515        GetOptimumFast

Something jumps out as obvious now. The GetOptimum function is not visible in the optimized non-LTO build but is visible in the DEBUG build. This is a static function in LzmaEnc.c and is large but still gets inlined in the normal optimized build. LTO however is actually preventing some useful inlining from happening. I inspected the disassembly using “objdump -d” to verify that. In the LzmaEnc.c source file one of the call sequence is thus:

LzmaEnc_CodeOneMemBlock -> LzmaEnc_CodeOneBlock -> GetOptimum

In the normal non-LTO optimized build LzmaEnc_CodeOneMemBlock() is a single large function with the other functions above inlined and further optimized. In the LTO build the call sequence is thus:

LzmaEnc_CodeOneMemBlock ->LzmaEnc_CodeOneBlock.4411 ->
LzmaEnc_CodeOneBlock.part.8.4405 -> GetOptimum.4390

No wonder this causes a major performance drop since inlining in turn enables a series of other optimizations like Value Range Propagation, Code Motion etc which LTO is preventing by preventing inlining. So the lesson of the day is to use LTO with a spoonful of salt. Do not just use it and be happy. Actually benchmark and verify whether you are benefitting from it or not. We should see some improvement in this space with Gcc 4.8 (see below).

Appendix

Output of perf stat -d for LZ4 compression method

Performance counter stats for './pcompress -D -c lz4 -l14 -L -P -s110m w020n40.tar':

    3943.008621 task-clock                #    0.996 CPUs utilized
            440 context-switches          #    0.112 K/sec
              9 CPU-migrations            #    0.002 K/sec
          2,179 page-faults               #    0.553 K/sec
  8,275,383,339 cycles                    #    2.099 GHz                     [39.97%]
  3,510,592,770 stalled-cycles-frontend   #   42.42% frontend cycles idle    [39.91%]
  2,314,266,052 stalled-cycles-backend    #   27.97% backend  cycles idle    [39.93%]
 10,907,435,475 instructions              #    1.32  insns per cycle
                                          #    0.32  stalled cycles per insn [49.95%]
  1,754,196,387 branches                  #  444.888 M/sec                   [49.93%]
     60,384,786 branch-misses             #    3.44% of all branches         [50.05%]
  2,588,170,552 L1-dcache-loads           #  656.395 M/sec                   [50.10%]
    196,213,428 L1-dcache-load-misses     #    7.58% of all L1-dcache hits   [50.17%]
     57,790,609 LLC-loads                 #   14.656 M/sec                   [40.14%]
      2,870,453 LLC-load-misses           #    4.97% of all LL-cache hits    [40.11%]

      3.957535121 seconds time elapsed

The interesting thing to note here is the LLC cache miss ratio. Due to it’s near linear access pattern LZ4 is making good use of the cache. This is one of the reasons why it is so fast. Of course LZ4 cannot even think of matching LZMA in terms of compression ratio but then, that is not it’s intent either. In addition it also shows that LZP and my Dedupe and Delta2 Encoding implementations are cache-efficient as well.

NetApp advanced Data Compression (from BeleniX?)

Belenix_logo

I came across this article on a so called advanced compression feature in OnTap: https://communities.netapp.com/docs/DOC-14329. There is a bunch of marketing fluff in the beginning of the article till you come to the second section where the real technique is illustrated. This is a simple idea being drummed up a bit. There is nothing special compared to what I had implemented for BeleniX and the OpenSolaris livecd years back. It is the very same feature.

Don’t believe me? Head over to my livemedia slides: http://www.osdevcon.org/2007/slides/osol_livemedia.pdf (slide 13). This is the presentation I delivered at the First OpenSolaris Developer Conference in 2007 in Berlin. It was hosted by the good folks of the  German Unix User’s Group.

comp_index

At the high level there are but a couple of differences with the NetApp implementation. The OnTap version of course provides write capability which I did not implement since I was dealing with CDROMs. OnTap uses a compressibility threshold of 25% while I used 12% in the livecd case.

When I wanted to leave SUN and was looking for a change I had interviewed briefly with NetApp but decided not to pursue it for the sake of BeleniX at that time (OpenSolaris+ZFS was NetApp competition). However I had put all details of my CDROM Filesystem I/O scheduler and transparent compression implementation with URLs in my resume.Who knows if someone in NetApp decided to pull the good ideas from that into their product – ha.

Linear Algebra meets Data Compression

I recently introduced another kind of data encoding into Pcompress that leverages a common technique from linear algebra called Matrix Transpose. There are multiple ways of transposing a matrix. The simplest, conceptually is a transformation where columns of the source matrix become rows in the target and vice versa. In math terms if A is the source matrix and A” is the target then transposition boils down to A[m X n] -> A"[n X m] where m and n are dimensions of the 2D matrices.

Now where does this fit into the seemingly far removed world of data compression? Let us consider a dataset which is a sequence of 32-bit numbers. In my previous post on the Adaptive Run-Length Delta Encoding feature I mentioned how it detects sequences of numbers in data which are in simple arithmetic progression. Once an arithmetic progression is found it can be very easily encoded. Now let us assume that we do have a series of numbers but they are not in a simple incrementing series. So Delta Encoding will not be useful there.

As an example let us consider the following series of numbers: 4567, 3479, 9345. Let us also assume that they are stored in the data in big-endian row-major representation. We choose big-endian since that is the common platform-independent way of storing numeric data values. So if we look at the byte stream it will be thus (in hex): 00  00  11  D7  00  00  0D  97  00  00  24  81. Notice that there are a bunch of zeroes but they are mixed up with non-zero bytes. We can look at this data as a 4×3 matrix. 4-byte integers totalling 3 in number: A[4X3] = \left| \begin{array}{cccc} 00 & 00 & 11 & D7 \\ 00 & 00 & 0D & 97 \\ 00 & 00 & 24 & 81 \end{array} \right|.

Now looking at the matrix do you see what I see? The zeroes are aligned up in columns. Since the storage in memory is row-major the zeroes are mixed up with the other bytes. Now let us check out the transposed matrix: A"[3X4] = \left| \begin{array}{ccc} 00 & 00 & 00 \\ 00 & 00 & 00 \\ 11 & 0D & 24 \\ D7 & 97 & 81 \end{array} \right|. If we store this in memory the byte sequence becomes: 00  00  00  00  00  00  11  0D  24  D7  97  81. Voila, all the zeroes have lined up. Any compression algorithm on the planet will now be able to collapse that sequence of zeroes. This example shows zeroes, but in fact, it can be any repeating byte value.

This technique have been used in other places. For example in HDF5 where they unassumingly call it Shuffle. In Pcompress it is used in 2 places. Once within the Delta2 encoding phase and secondly to encode the Deduplication Index table. So compression ratios are improved across the board. To recover the data it is a simple matter of effecting the inverse transpose.