Tag Archives: AVX

SHA512 Performance in Pcompress

Pcompress provides 2 cryptographic hashes from the SHA2 family, namely SHA512-256 and SHA512. The core SHA512 block function is the implementation from Intel: http://edc.intel.com/Download.aspx?id=6548

Intel’s implementation provides two heavily optimized versions. One for SSE4 and one for AVX. Intel only provided the core compression function. I had to add supporting code from other sources to get a complete hash implementation including padding and IVs for 512-bit and the 256-bit truncation. Since I am bothered only with 64-bit CPUs using SHA512-256 is the most optimal choice. SHA512 is much faster that native SHA256 on 64-bit CPUs. I did some benchmarks using the SMHasher suite to check how this implementation fares. SMHasher is primarily designed to test various qualities of non-cryptographic hash functions, however it’s benchmarking implementation is good and I used just that part. I had to modify SMhasher to add all the various SHA2 implementations and a tweak to support 512-bit hashes.

I only tested the SSE4 version since i do not currently have an AVX capable CPU. SMHasher shows bytes/cycle and I just did the reciprocal to get cycles/byte. All this was done on my laptop which has a Core i5 430M, 2.27 GHz CPU (non sandy bridge). The OpenSSL version used is 1.0.1c from Linux Mint 14. I also used GCC 4.7.2 with -O3 and -ftree-vectorize flags. The results are shown below.


Clearly Intel’s SSE4 optimized version is superior than the rest on x64 platforms with OpenSSL not too far behind. The AVX version should give even better results. The other result shows cycles/hash for tiny 31-byte buffers.

sha512_small.Fast hashing is important when using it for data integrity verification. If you are thinking of slow hashes for hashing passwords then please look at Scrypt or PBKDF2.


Pcompress 1.3 released

Parallel Compression RevisitedI have put up a new release of Pcompress on the Google Code download page. This release focusses primarily on performance enhancements across the board and a few bug fixes. The changes are summarized below:

  1. Deduplication performance has improved by at least 2.5X as a result of a variety of tweaks to the core chunking code.
    • One of the interesting changes is to use a 16-byte SSE register as the sliding window since I am using a window size of 16. This avoids a lot of memory accesses but requires SSE4.
    • The perf utility allowed me to see that using the window position counter as a context variable causes a spurious memory store for every byte! Using a local variable allows optimization via a register. This optimization affects the situation where we do not have SSE4.
    • Compute the full fingerprint only when at least minimum chunk length bytes have been consumed.
  2. Delta Compression performance and effectiveness have both been improved. I have tweaked the minhash approach to avoid storing and using fingerprints. That approach was causing memory write amplification and significant slowdown. Rather I am just treating the raw data as a sequence of 64-bit integers and heapifying them. Bsdiff performance has been improved along with RLE encoding. I also tweaked the matching approach. It now checks for similar blocks that are some distance apart depending on the compression algorithm. This actually causes long range similar blocks to be delta-ed eventually helping the overall compression.
  3. One of the big changes is the inclusion of the BLAKE2 checksum algorithm and making it the default. BLAKE2 is one of the highest-performing cryptographic checksums, exceeding even MD5 in performance on 64-bit platforms. It is derived from BLAKE, one of the NIST SHA3 runner ups with a large security margin.
  4. I have tweaked Yann Collet’s xxHash algorithm (non-cryptographic hash) to vectorize it and make it work with 32-byte blocks. Performance is improved for both vectorized and non-vectorized versions. I have covered this in detail in my previous post: Vectorizing xxHash for fun and profit.
  5. I have tweaked the AES CTR mode implementation to vectorize it. CTR mode encrypts a 16-byte block consisting of a 64-bit nonce or salt value and a 64-bit block counter value concatenated together. This is then XOR-ed with 16 bytes of plaintext to generate 16 bytes of ciphertext. The block counter is then incremented and the process repeated. This XOR handling with 16-bytes can be nicely done in an XMM register. The result is faster even when using unaligned SSE2 loads helped a little with data prefetch instructions.
  6. Apart from BLAKE2 I also included Intel’s optimized SHA512 implementation for x86 processors and moved to using SHA512/256. This improves SHA2 performance significantly on x86 platforms.
  7. BLAKE2 includes a parallel mode. I also included simple 2-way parallel modes for other hashes including KECCAK when compressing a single file in a single chunk. This is essentially a single-threaded operation so other forms of parallelism need to be employed.
  8. With all the vectorization being thrown around with SSE2/3/4 and AVX1/2 versions of various stuff, I have also added runtime CPU feature detection to invoke the appropriate version for the CPU. At least SSE2 capability is assumed. At this point I really have no intention of supporting Pentium and Atom processors! This also requires one to use at least the Gcc 4.4 compiler so that things like SSE4.2 and AVX intrinsics can be compiled even if CPU support for them is not available.
  9. In addition to all the above some bug fixes have also gone into this release.

However this is in no way the full gamut of optimizations possible. There are more changes to be done. For example I need to add support for optimized AES GCM mode. This is a block cipher mode of operation which combines encryption and authentication avoiding the need to for a separate HMAC. HMAC is still useful for situations where one may want to authenticate but not encrypt. Deduplication performance can be further improved by at least 2X. The current chunking code has a silly oversight.  HMAC needs to support parallel modes. I also need to enable parallel operation for LZP in single-chunk modes. In addition I want to explore use of GPGPUs and CUDA for hashing, chunking etc.

Pcompress goes to 0.9.1

Now that encryption and message verification/authentication is done pushed out a couple of releases. Here’s some of the salient features that have come in after the initial implementation of AES encryption:

  1. Hash-based Message Authentication (HMAC)
  2. Optimized SHA256 for AMD platforms
  3. Fix build on Debian platforms (Linux Mint)
  4. Portable printf format string for unsigned int64 value to avoid gcc 4.6 warnings.

The HMAC functionality is available for all chunk digest algorithms namely SKEIN 256/512, SHA 256/512. For CRC64, SHA256 HMAC is used. The HMAC uses the same encryption key. However question now arises which items to verify using the HMAC. Ideally it should be everything in the file to ensure tampering is prevented. Now a Pcompress data file has a file header and every chunk in the file have their chunk headers. Every chunk includes a normal cryptographic disgest which can be SKEIN or SHA-2 to verify data integrity.

At this point applying a HMAC which is the same digest strengthened using an encryption key to the entire chunk data is wasteful. So what I have done is to apply the HMAC to the file header and the chunk headers. So file header is verified and chunk header is verified. The chunk header includes the normal chunk data digest which also gets verified by the HMAC. It is then used to verify integrity of the chunk data. This is good for performance since computing a HMAC is much more expensive than computing a standard cryptographic digest. Apart from these I added a few data validation checks during decompression to ensure that things like chunk sizes mentioned are within bounds.

With respect to the SHA256 optimizations, I found that Intel’s optimized implementations based on SSE 4.2 and AVX1 work fine on AMD Bulldozer and later platforms (Piledriver). So I have enabled those to be used on AMD with the SSE 4.2 and AVX1 feature detection. Thankfully AMD uses the same bits in ECX to indicate SSE4.2 and AVX1 capabilities.