Tag Archives: Keccak

Pcompress 1.3 released

Parallel Compression RevisitedI have put up a new release of Pcompress on the Google Code download page. This release focusses primarily on performance enhancements across the board and a few bug fixes. The changes are summarized below:

  1. Deduplication performance has improved by at least 2.5X as a result of a variety of tweaks to the core chunking code.
    • One of the interesting changes is to use a 16-byte SSE register as the sliding window since I am using a window size of 16. This avoids a lot of memory accesses but requires SSE4.
    • The perf utility allowed me to see that using the window position counter as a context variable causes a spurious memory store for every byte! Using a local variable allows optimization via a register. This optimization affects the situation where we do not have SSE4.
    • Compute the full fingerprint only when at least minimum chunk length bytes have been consumed.
  2. Delta Compression performance and effectiveness have both been improved. I have tweaked the minhash approach to avoid storing and using fingerprints. That approach was causing memory write amplification and significant slowdown. Rather I am just treating the raw data as a sequence of 64-bit integers and heapifying them. Bsdiff performance has been improved along with RLE encoding. I also tweaked the matching approach. It now checks for similar blocks that are some distance apart depending on the compression algorithm. This actually causes long range similar blocks to be delta-ed eventually helping the overall compression.
  3. One of the big changes is the inclusion of the BLAKE2 checksum algorithm and making it the default. BLAKE2 is one of the highest-performing cryptographic checksums, exceeding even MD5 in performance on 64-bit platforms. It is derived from BLAKE, one of the NIST SHA3 runner ups with a large security margin.
  4. I have tweaked Yann Collet’s xxHash algorithm (non-cryptographic hash) to vectorize it and make it work with 32-byte blocks. Performance is improved for both vectorized and non-vectorized versions. I have covered this in detail in my previous post: Vectorizing xxHash for fun and profit.
  5. I have tweaked the AES CTR mode implementation to vectorize it. CTR mode encrypts a 16-byte block consisting of a 64-bit nonce or salt value and a 64-bit block counter value concatenated together. This is then XOR-ed with 16 bytes of plaintext to generate 16 bytes of ciphertext. The block counter is then incremented and the process repeated. This XOR handling with 16-bytes can be nicely done in an XMM register. The result is faster even when using unaligned SSE2 loads helped a little with data prefetch instructions.
  6. Apart from BLAKE2 I also included Intel’s optimized SHA512 implementation for x86 processors and moved to using SHA512/256. This improves SHA2 performance significantly on x86 platforms.
  7. BLAKE2 includes a parallel mode. I also included simple 2-way parallel modes for other hashes including KECCAK when compressing a single file in a single chunk. This is essentially a single-threaded operation so other forms of parallelism need to be employed.
  8. With all the vectorization being thrown around with SSE2/3/4 and AVX1/2 versions of various stuff, I have also added runtime CPU feature detection to invoke the appropriate version for the CPU. At least SSE2 capability is assumed. At this point I really have no intention of supporting Pentium and Atom processors! This also requires one to use at least the Gcc 4.4 compiler so that things like SSE4.2 and AVX intrinsics can be compiled even if CPU support for them is not available.
  9. In addition to all the above some bug fixes have also gone into this release.

However this is in no way the full gamut of optimizations possible. There are more changes to be done. For example I need to add support for optimized AES GCM mode. This is a block cipher mode of operation which combines encryption and authentication avoiding the need to for a separate HMAC. HMAC is still useful for situations where one may want to authenticate but not encrypt. Deduplication performance can be further improved by at least 2X. The current chunking code has a silly oversight.  HMAC needs to support parallel modes. I also need to enable parallel operation for LZP in single-chunk modes. In addition I want to explore use of GPGPUs and CUDA for hashing, chunking etc.


Pcompress 1.2 Released

Pcompress 1.2 is now available and can be grabbed from here: https://code.google.com/p/pcompress/downloads/list. This is a major release containing a large number of bug fixes and improvements. There are performance and stability improvements, code cleanup, resolution of corner cases etc. The biggest new additions to this release are the new Delta2 Encoding and support Keccak message digest. Keccak has been announced the NIST SHA3 standard secure hash. The SIMD (SSE) optimized variant of Keccak runs faster than SHA256 on x86 platforms. However it is still slower than SKEIN so SKEIN remains the default hash algorithm for data integrity verification. In addition Deduplication is now significantly faster.

Delta2 Encoding as I had mentioned in a previous post probes for embedded tables of numeric sequences in the data and encodes them by collapsing the arithmetic sequence into it’s parameters: starting value, increment/decrement, number of terms. This generally provides benefits across different kinds of data and can be combined with LZP preprocessing to enable the final compression algorithm to achieve the maximum compression ratio beyond what it can normally achieve. This encoding works very fast and still manages to detect a good amount of numeric sequences if they are present in the data.

I have extended the statistics mode to display additional data including throughput figures. Here is an output from a compression run of the silesia corpus test data set:

## CPU: Core i5 430M
## RAM: 8GB
## Compiler: Gcc 4.7.2
## Compression
./pcompress -D -c lz4 -l1 -L -P -s200m silesia.tar 
Scaling to 1 thread
Checksum computed at 241.321 MB/s
Original size: 206612480, blknum: 46913
Number of maxlen blocks: 0
Total Hashtable bucket collisions: 17225
Merge count: 46750
Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
Chunking speed 112.189 MB/s, Overall Dedupe speed 100.880 MB/s
LZP: Insize: 206196371, Outsize: 192127556
LZP: Processed at 55.839 MB/s
DELTA2: srclen: 192127556, dstlen: 191899643
DELTA2: header overhead: 50800
DELTA2: Processed at 382.530 MB/s
Chunk compression speed 207.908 MB/s
## Decompression
./pcompress -d silesia.tar.pz silesia.tar.1 
Scaling to 4 threads
Chunk decompression speed 383.488 MB/s
DELTA2: Decoded at 3030.724 MB/s
Checksum computed at 244.235 MB/sls -l silesia.tar.pz 
-rw-rw-r--. 1 testuser testuser 99115899 Jan  5 21:36 silesia.tar.pz

Note that these are single-threaded performance figures. The entire file is being compressed in a single chunk. The default checksum is SKEIN. Look at the decoding speed of the Delta2 implementation. It is close to 3GB/s rate. Next lets check the performance of SSE optimized Keccak:

./pcompress -D -c lz4 -l1 -P -S KECCAK256 -s200m silesia.tar 
Scaling to 1 thread
Checksum computed at 143.904 MB/s
Original size: 206612480, blknum: 46913
Number of maxlen blocks: 0
Total Hashtable bucket collisions: 17225
Merge count: 46750
Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
Chunking speed 111.601 MB/s, Overall Dedupe speed 100.352 MB/s
DELTA2: srclen: 206196371, dstlen: 201217172
DELTA2: header overhead: 570448
DELTA2: Processed at 360.383 MB/s
Chunk compression speed 213.226 MB/sls -l silesia.tar.pz 
-rw-rw-r--. 1 testuser testuser 100204566 Jan  5 21:34 silesia.tar.pz

This time I left out LZP to show the reduction using Delta2 alone. As you can see combining LZP and Delta2 gives the greatest reduction. Also see how much slower Keccak is compared to SKEIN. Note that I am using optimized 64-bit assembly implementation of Skein but it does not use SSE whereas Keccak uses SSE.

Next lets have a look at a dataset that has lots of embedded numeric data. I used a Global Topographic Elevation map data from USGS:

# Compression
./pcompress -c lz4 -l1 -P -s100m e020n40.tar 
Scaling to 1 thread
Checksum computed at 237.584 MB/s
DELTA2: srclen: 86599680, dstlen: 43707440
DELTA2: header overhead: 2024320
DELTA2: Processed at 279.484 MB/s
Chunk compression speed 211.112 MB/s

ls -l e020n40.tar.pz 
-rw-rw-r--. 1 testuser testuser 35360062 Jan  5 21:46 e020n40.tar.pz
# Decompression
./pcompress -d e020n40.tar.pz e020n40.tar.1 
Scaling to 4 threads
Chunk decompression speed 622.394 MB/s
DELTA2: Decoded at 1971.282 MB/s
Checksum computed at 246.015 MB/s

This time I left out dedupe and LZP. The Delta2 benefits are a lot more since there is a lot of numeric data. Also because there is a lot more Delta spans in the encoded dataset the decoding speed is also lesser. However it still decodes at 1.9 GB/s.

As can be seen Delta2 performance is on par with LZ4 can be used to improve LZ4 results with very little overhead.