I now have added a fair bit of different hash functions into Pcompress, so thought of writing a bit about their purposes and some implementation details. Each kind of hash function is needed for a specific purpose. The description below is categorized by purpose:
- Chunking or Blocking hash:This is a rolling hash function used to break the data into individual blocks based on content defined boundaries. After some work in this area I am finally using a prime-number based multiplicative rolling hash that is coupled with an Irreducible Polynomial to provide the Rabin algorithm based fingerprinting. Shortcuts are employed to avoid doing divisions and modulus and keep things very fast and practical. Only one multiplication, one XOR and a couple of adds are performed per byte to compute the fingerprint.Multiplication is very fast on modern processors.The multiplicative hash has very good distribution properties. Block boundaries are selected when the bottom X bits of the fingerprint are zero. X is the number of bits in the average block size. The fingerprint is derived from the rolling hash by XOR-ing with values from a precomputed polynomial table. The byte sliding out of the rolling hash window is used to index into the table. In practice this gives a slightly skewed fingerprint distribution (skewed towards 0 and some neighbouring values). So given average block size of 4K, the sizes can vary from 3.8K to 10K in practice. Using the byte at an Nth bit position in the rolling hash to index into the table gives a far better distribution with 4.5K block sizes on average across a wide range of data. However I found experimentally that the former approach gives better dedupe ratio even with the more varying block sizes.
- Similarity hash or MinHash: This was described on my previous post titled “Similarity Matching round #3“. This is basically an xxHash of the K minimum fingerprints for a block. 40%, 50% and 60% similarity hashes are computed based on the extent of similarity desired for doing Delta Compression of blocks.
- Block Content Hash: This is a straightforward xxHash of the block content. This is used to find identical chunks and dedupe among them. Of course a byte for byte memory compare is also done to verify that blocks are indeed the same. xxHash fast version is a non-cryptographic hash and can have collisions but it is extremely fast to compute coming in at around 7.5 GB/s performance.
- Hash table hash: Sometime back I got rid of my earlier quicksort approach to find duplicate blocks and replaced it with an in-memory hash table for much better performance. The hash index computation requires one division and one modulus for lower collisions. Collisions are handled using bucket chaining. On average I found chain lengths to be around 4 – 7 buckets.
- Data Integrity Hash:Pcompress computes a cryptographic checksum for the entire chunk for data integrity verification. The checksum is computed over the original data and recomputeed and compared during decompression. There are options to use SKEIN 256, 512 or SHA 256, 512. In addition CRC64 is also provided if one wants a fast non-cryptographic alternative. SKEIN is one of the 5 NIST SHA3 finalists. Eventually NIST chose Keccak as the final SHA3 choice not SKEIN. I continue to retain use of SKEIN due to its awesome performance and robust security properties. As per what I read NIST’s choice of Keccak was driven more by the desire to move to an entirely new Mathematical model from the earlier one that derives its roots from MD5. So if the earlier model is somehow fundamentally broken using new Mathematical tools it will not take out all the SHA hashes.Providing SHA 256 was also an obvious choice. It is used widely and till date have held up well against Cryptanalysis. It has not been practically broken in so many years of its existence. However one issue with SHA 256 is its speed of computation on 64-bit platforms.Standard SHA 256 is much slower to compute than SKEIN on 64-bit. It has been an oft-investigated issue and I was looking around for optimized implementations to use. OpenSSL’s crypto library comes across as an obvious choice. It has got optimized (C and Assembly) implementations for various platforms. However I also came across a recent by Intel titled “Fast SHA-256 Implementations on Intel Architecture Processors“. They provide 4 highly optimized variants written in assembly language using SSE4, AVX1 and two AVX2 versions.The SHA 256 algorithm works on 64-byte blocks of data at a time with some padding mechanism specified in the standard when data length is not an exact multiple of 64. Intel only provides implementations of the core function that works on these 64-byte blocks. One still needs the higher-level interfaces that sets up the IVs, handles padding and finalization. Essentially in OpenSSL parlance we need SHA256_Init(), SHA256_Update() and SHA256_Final() functions.There are lots of SHA256 implementations around, however I came across the very interesting liberally licensed ones written by Allan Saddi. I stripped out the higher-level interface functions from his code, optimized them and modified to call the core Intel routines. In particular I removed the loop that called memcpy() per 64-byte block since the Intel routines take the number of blocks to process in the data. One still has to handle leftover bytes and data lengths less than 64.Now comes the question of which Intel function to use. One easy way is to detect the processor during compile time and build that appropriate function variant. A more elegant approach is to build all the functions and select the appropriate one at runtime based on cpuid flags. Glibc uses this technique along with some great runtime linker glue called IFUNC to select processor specific optimizations at runtime. IFUNC provides user-level access to this feature but I did not use this due to non-portability. In any case I had to use cpuid to discover whether the processor is Intel or AMD, whether it has SSE4 or AVX1 (I left out AVX2 functions for now) and call the appropriate function variant. I did not want to use a complete cpuid library or tool to achieve this small requirement. So eventually I ended up slicing out some code from libcpuid and doing my own thing with that. The result is that I have a very highly optimized SHA 256 implementation for Intel chips that are noticeably faster than OpenSSL. I still need to add checks for AMD processors like bulldozer/fusion that have SSE4 and AVX.