Tag Archives: Minhash

Scaling Deduplication in Pcompress – Petascale in RAM

My core objectives with Deduplication experiments in Pcompress has been to achieve large-scale data handling capability using tiny chunks (4KB and even 2KB). This has traditionally been extremely difficult to the point of being nearly impractical. The algorithms in Pcompress can now achieve this without loss of performance, at least for archival data streams. Read on to find out more.


I have in the past alluded to a an indexing algorithm in Pcompress that allows me to deduplicate large data using only a tiny memory-resident index. I finally found some time to write about it. The basic idea is quite simple and is not entirely new. What is new, is the way certain components of the approach are implemented. The steps are simple:

  1. Break dataset into large segments, such that each segment is a collection of adjacent variable-length chunks as derived using a chunking algorithm.
  2. For each segment create a chunk-list file on disk which contains the list of chunk hashes.
  3. For each segment compute one or more similarity identifiers and store in an index.
  4. If one or more similarity identifiers for two segments match then we know that the segments have at least a few chunks in common. In that case load the chunk lists of the segments from disk and deduplicate the chunks.

If the segments are large enough then a relatively tiny similarity index can address large amounts of data. This scheme requires one disk read per bunch of chunks reducing lookup and access times. Also each chunk list is one file and is read sequentially improving read times. The chunk list files can also be stored on SSDs to speed up even more. The similarity index can be a RAM-resident hashtable giving the fastest possible lookup times. Since each segment is a collection of chunks, I experimented using different chunk counts and found excellent results having 2048 chunks per segment. Thus the segment size is of variable length. For an average chunk size of 4KB we get 8MB segments on average, for 8KB chunks we get 16MB segments and so on.

Many Similarity based Deduplication approaches use files as the objects for checking similarity. However this suffers from a problem that very large files can give rise to loss of accuracy and poor deduplication ratio. Conversely too many tiny files can bloat the similarity index. The approach described above avoids these extremes. A somewhat similar approach has been described in the SiLo scheme.

The old guard – MinHash

The effectiveness of the approach described above hinges on the effectiveness of the similarity indexing scheme. Here I based my implementation on the good old MinHash technique. This technique has been used commonly in data mining and web search to determine similar documents, detect plagiarism, perform clustering and so on.

Essentially we break up a document into a set of semantic pieces, compute multiple hashes per piece and identify a subset of pieces that possess the numerically lowest hash values. If such k-min subsets of two documents match then we can say that the documents are somewhat similar to each other. In this deduplication use case a document is nothing but a segment. Since each segment is a set of chunks, each sematic piece is a chunk as determined by the chunking algorithm. The question is what hashes to compute for each chunk in order to apply the MinHash algorithm? (See MinHash for Dummies for a layman’s explanation of MinHash).

After lots of experimentation and head scratching it turns out that the cryptographic hashes computed per chunk can themselves be used via truncating. Truncated cryptographic hashes are themselves independent hashes. I took this idea to its logical extreme and split each cryptographic hash into a set of smaller hashes. These can be thought of as the permutations of the hash function. Then I can sort these hashes numerically and select the K lowest values to get my K-min-values sketch. If one or more elements of the KMV sketches of two segments match then the segments are deemed to have at least a few chunks in common. How small to truncate a hash? It turns out that 64-bit hashes provide a good approximation. So if we are using, say SHA-256, then we get 4 smaller hashes from each chunk hash. These are then numerically sorted and the 20 lowest unique values are chosen. Pictorially this approach can be depicted as below.


The number 20 and other thresholds were arrived at experimentally, to give a good balance of memory usage vs deduplication effectiveness. Increasing beyond 20 hashes, results in diminishing returns on dedupe effectiveness while linearly increasing memory consumption. The point of inflection actually comes at 16 hashes, below which, dedupe effectiveness falls rapidly. The chart below shows the results of testing on a 360GB dataset that I have used in previous occasions.

sidHere Duplicate Elimination Ratio =\frac{input\, stream\, length}{total\, size\, of\, chunks\, stored\,+\, metadata\, size}.

The interesting thing here is that this approach results in high deduplication effectiveness to the extent of 90% to 95% of the effectiveness of straight forward exact dedupe using a simple chunk index. The choice of the hash function (SHA2, SHA3, BLAKE2 etc) has little bearing on the outcome. Here is another chart showing a comparison of this similarity based deduplication (using various datasets and hash funtions) with the exact deduplication baseline.


Memory Usage

Lets consider the case of 4KB average chunk size. Each segment contains 2048 (approx) chunks, which gives us 8MB average segment size. For each chunk we derive a 256-bit crypto hash which results in 4 64-bit similarity hashes per chunk. We select the lowest valued 20 unique hashes per segment which form the similarity indicators. So we need 160 bytes of similarity indicators per segment. In addition to this we need to store 64-bit pointers to the actual segment data on disk. The segment data is basically the chunk hash and data pointer list stored on disk. So the storage requirement is doubled to 320 bytes. The similarity indicators for a segment are kept in a hash table. So we need to consider some data structure overheads as well. Assuming 4 bytes of overhead on average we have a final storage requirement of 400 bytes per segment.

Now assuming 8MB segment size and one Petabyte of 100% random data where no duplicates exist (worst case), we would need 134217728 segments. This translates to 50GB of worst case memory usage. If we use 8KB chunk size, then the calculations lead to 25GB RAM requirement for the worst case. These memory values are not too much by present day standards and typically data will have duplicates. So RAM requirement will come down by the extent of duplicates present, especially when using the deduplicating storage architecture I had described earlier. If we limit data handling to say 500TB, then even smaller 2KB chunks can be used. These are practical resource limitations. The algorithm does not have any inherent limitations. If we use a well-endowed server with say 256GB or more of RAM then petascale data can be handled using 2KB chunks as well. Performance will be lower of course.

To the best of my knowledge there is as yet no dedupe product that can handle a petabyte of data using small 4KB chunks and an in-memory index. I may be wrong, so please add comments if you know of another such system. If you are looking at less than a Petabyte then even 2KB chunks are possible – virtually no one does this today.

This indexing scheme is actually implemented in Pcompress today and works quite well. It can actually reach raw a dedupe processing throughput of upto 500MB/s (discounting I/O) on my average laptop. Of course there are many other optimizations both algorithmic and architectural some of which I have posted about earlier. I am presently working on the persistent similarity index part which would allow creating versioned incremental archives.

Related Posts


The hash functions in Pcompress

I now have added a fair bit of different hash functions into Pcompress, so thought of writing a bit about their purposes and some implementation details. Each kind of hash function is needed for a specific purpose. The description below is categorized by purpose:

  1. Chunking or Blocking hash:This is a rolling hash function used to break the data into individual blocks based on content defined boundaries. After some work in this area I am finally using a prime-number based multiplicative rolling hash that is coupled with an Irreducible Polynomial to provide the Rabin algorithm based fingerprinting. Shortcuts are employed to avoid doing divisions and modulus and keep things very fast and practical. Only one multiplication, one XOR and a couple of adds are performed per byte to compute the fingerprint.Multiplication is very fast on modern processors.The multiplicative hash has very good distribution properties. Block boundaries are selected when the bottom X bits of the fingerprint are zero. X is the number of bits in the average block size. The fingerprint is derived from the rolling hash by XOR-ing with values from a precomputed polynomial table. The byte sliding out of the rolling hash window is used to index into the table. In practice this gives a slightly skewed fingerprint distribution (skewed towards 0 and some neighbouring values). So given average block size of 4K, the sizes can vary from 3.8K to 10K in practice. Using the byte at an Nth bit position in the rolling hash to index into the table gives a far better distribution with 4.5K block sizes on average across a wide range of data. However I found experimentally that the former approach gives better dedupe ratio even with the more varying block sizes.
  2. Similarity hash or MinHash: This was described on my previous post titled “Similarity Matching round #3“. This is basically an xxHash of the K minimum fingerprints for a block. 40%, 50% and 60% similarity hashes are computed based on the extent of similarity desired for doing Delta Compression of blocks.
  3. Block Content Hash: This is a straightforward xxHash of the block content. This is used to find identical chunks and dedupe among them. Of course a byte for byte memory compare is also done to verify that blocks are indeed the same. xxHash fast version is a non-cryptographic hash and can have collisions but it is extremely fast to compute coming in at around 7.5 GB/s performance.
  4. Hash table hash: Sometime back I got rid of my earlier quicksort approach to find duplicate blocks and replaced it with an in-memory hash table for much better performance. The hash index computation requires one division and one modulus for lower collisions. Collisions are handled using bucket chaining. On average I found chain lengths to be around 4 – 7 buckets.
  5. Data Integrity Hash:Pcompress computes a cryptographic checksum for the entire chunk for data integrity verification. The checksum is computed over the original data and recomputeed and compared during decompression. There are options to use SKEIN 256, 512 or SHA 256, 512. In addition CRC64 is also provided if one wants a fast non-cryptographic alternative. SKEIN is one of the 5 NIST SHA3 finalists. Eventually NIST chose Keccak as the final SHA3 choice not SKEIN. I continue to retain use of SKEIN due to its awesome performance and robust security properties. As per what I read NIST’s choice of Keccak was driven more by the desire to move to an entirely new Mathematical model from the earlier one that derives its roots from MD5. So if the earlier model is somehow fundamentally broken using new Mathematical tools it will not take out all the SHA hashes.Providing SHA 256 was also an obvious choice. It is used widely and till date have held up well against Cryptanalysis. It has not been practically broken in so many years of its existence. However one issue with SHA 256 is its speed of computation on 64-bit platforms.Standard SHA 256 is much slower to compute than SKEIN on 64-bit. It has been an oft-investigated issue and I was looking around for optimized implementations to use. OpenSSL’s crypto library comes across as an obvious choice. It has got optimized (C and Assembly) implementations for various platforms. However I also came across a recent by Intel titled “Fast SHA-256 Implementations on Intel Architecture Processors“. They provide 4 highly optimized variants written in assembly language using SSE4, AVX1 and two AVX2 versions.The SHA 256 algorithm works on 64-byte blocks of data at a time with some padding mechanism specified in the standard when data length is not an exact multiple of 64. Intel only provides implementations of the core function that works on these 64-byte blocks. One still needs the higher-level interfaces that sets up the IVs, handles padding and finalization. Essentially in OpenSSL parlance we need SHA256_Init(), SHA256_Update() and SHA256_Final() functions.There are lots of SHA256 implementations around, however I came across the very interesting liberally licensed ones written by Allan Saddi. I stripped out the higher-level interface functions from his code, optimized them and modified to call the core Intel routines. In particular I removed the loop that called memcpy() per 64-byte block since the Intel routines take the number of blocks to process in the data. One still has to handle leftover bytes and data lengths less than 64.Now comes the question of which Intel function to use. One easy way is to detect the processor during compile time and build that appropriate function variant. A more elegant approach is to build all the functions and select the appropriate one at runtime based on cpuid flags. Glibc uses this technique along with some great runtime linker glue called IFUNC to select processor specific optimizations at runtime. IFUNC provides user-level access to this feature but I did not use this due to non-portability. In any case I had to use cpuid to discover whether the processor is Intel or AMD, whether it has SSE4 or AVX1 (I left out AVX2 functions for now) and call the appropriate function variant. I did not want to use a complete cpuid library or tool to achieve this small requirement. So eventually I ended up slicing out some code from libcpuid and doing my own thing with that. The result is that I have a very highly optimized SHA 256 implementation for Intel chips that are noticeably faster than OpenSSL. I still need to add checks for AMD processors like bulldozer/fusion that have SSE4 and AVX.

Similarity Matching round #3

As I had posted in Improving the Similarity Matching, I had a fair process of doing fuzzy matching. However it was an experimentally derived non-standard process which did come up with a bunch of false positives even though they were low in percentage terms. I kept reading up on various aspects of fuzzy matching and most of the articles were very mathematical or appeared fairly elaborate, for example Broder’s seminal paper from 1997. I could follow the core ideas of the paper but I am not a math geek and get put off by too many mathematical details. However the core idea of that paper is actually called MinHashing which is used quite widely starting with the Altavista search engine.

There are various articles, blog posts and code samples on that technique and it is popular because the core concept is so simple. There are many variants of the technique used in different places. For textual documents this is also known as a W-shingling approach and is enhanced using other techniques like punctuation elimination, capital reduction, stop word stemming etc. For generic blocks of data which can contain non-textual binary the approach reduces to using a rolling checksum like Rabin Fingerprints. For a given block one has to get all the rolling checksum values and select K minimum checksum values from the set. We can then match all the K-min-values Sketches of all the blocks to determine fuzzy similarity between blocks. The ratio K to block size determines the extent of similarity we are looking for. While all this is fine, it is not immediately apparent how to go about doing this practically in one’s application. Storing and dealing with K minimum checksum values is out of the question, and what is the efficient way to get the K min values list in the first place ? Do we have to sort or is there a better way ?

Since the approach is so popular there are also lots of example code available which shows how to do this in practice. However it also requires a bit of tweaking to suit one’s application needs. I am listing some of the resources I read through to understand and evaluate the approaches:

  1. Sketch of the Day: K-Minimum Values
  2. Finding similar items using minhashing
  3. Similarity and Dissimilarity Measures
  4. Set Similarity and Min Hash
  5. Calculating Similarity
  6. Finding the top K items in a list efficiently. While this talks about top-K the approach has to be inverted for bottom-K.

If you get overwhelmed after reading all those links and the Wikipedia articles above I can fully sympathize. However patience and perseverance is required especially if Maths is not your second language. If I take a stock here, things boil down to this:

  1. Compute all the rolling hashes for a block and store them.
  2. I do not need to store the entire 8-byte hash, the lower 4 bytes are mostly enough.
  3. I am using block sizes of 4K  to 128K, so I need 8KB to 500KB to store the list for a block.
  4. Since pcompress does memory buffer to memory buffer compression I can temporarily use space in the target buffer for the list avoiding an allocation.
  5. Chose K = 40% of block size to allow for 40% approximate similarity.
  6. Once the K minimum values are obtained compute a hash over that list to get a Similarity Index.
  7. I am using xxHash here as it is extremely fast and has very good properties.
  8. If the Similarity Index of two blocks are same they should be similar to the extent of 40% or greater.
  9. Okay so now how to efficiently find the K minimum fingerprint values ?

Sorting is an option but is sub-optimal. I just need K values and should not waste computation sorting all the values. I appears that a Heap based Priority Queue is the optimal choice in this situation. Thanks to Steve Hanov for the tip. While Steve talks about finding top-K items, finding the bottom-K items is  the same approach tweaked. I eventually landed up with an in-place Min Heap implementation from Python’s heapq. Also see this StackOverflow thread for more details. Python’s heapq is implemented in C and is nice and compact. It also has very high performance. I stripped out only portions from the heapqmodule that I needed and simplified and optimized them further as much as I could think of. I also decided not to sort the collection of K min values that this gives. The relative positions of values in the input block can affect where the values end up in the in-place heap and is an added similarity measure automatically.

So after all this circus I now have a much better Delta Compression for Pcompress where the potential for false similarity matches is about 0.001% on average. False similarity is one where the two blocks have less than 40% in common. The Delta Compression performance is much better than before. In addition because of better matching predictability it helps all compression algorithms to improve compression ratio. This change is currently in the Github repo and will make it into the next release in a while. See here and here if you are interested in the actual changes.