**Approach**

I have in the past alluded to a an indexing algorithm in Pcompress that allows me to deduplicate large data using only a tiny memory-resident index. I finally found some time to write about it. The basic idea is quite simple and is not entirely new. What is new, is the way certain components of the approach are implemented. The steps are simple:

- Break dataset into large segments, such that each segment is a collection of adjacent variable-length chunks as derived using a chunking algorithm.
- For each segment create a chunk-list file on disk which contains the list of chunk hashes.
- For each segment compute one or more similarity identifiers and store in an index.
- If one or more similarity identifiers for two segments match then we know that the segments have at least a few chunks in common. In that case load the chunk lists of the segments from disk and deduplicate the chunks.

If the segments are large enough then a relatively tiny similarity index can address large amounts of data. This scheme requires one disk read per bunch of chunks reducing lookup and access times. Also each chunk list is one file and is read sequentially improving read times. The chunk list files can also be stored on SSDs to speed up even more. The similarity index can be a RAM-resident hashtable giving the fastest possible lookup times. Since each segment is a collection of chunks, I experimented using different chunk counts and found excellent results having 2048 chunks per segment. Thus the segment size is of variable length. For an average chunk size of 4KB we get 8MB segments on average, for 8KB chunks we get 16MB segments and so on.

Many Similarity based Deduplication approaches use files as the objects for checking similarity. However this suffers from a problem that very large files can give rise to loss of accuracy and poor deduplication ratio. Conversely too many tiny files can bloat the similarity index. The approach described above avoids these extremes. A somewhat similar approach has been described in the SiLo scheme.

**The old guard – MinHash**

The effectiveness of the approach described above hinges on the effectiveness of the similarity indexing scheme. Here I based my implementation on the good old MinHash technique. This technique has been used commonly in data mining and web search to determine similar documents, detect plagiarism, perform clustering and so on.

Essentially we break up a document into a set of semantic pieces, compute multiple hashes per piece and identify a subset of pieces that possess the numerically lowest hash values. If such k-min subsets of two documents match then we can say that the documents are somewhat similar to each other. In this deduplication use case a document is nothing but a segment. Since each segment is a set of chunks, each sematic piece is a chunk as determined by the chunking algorithm. The question is what hashes to compute for each chunk in order to apply the MinHash algorithm? (See MinHash for Dummies for a layman’s explanation of MinHash).

After lots of experimentation and head scratching it turns out that the cryptographic hashes computed per chunk can themselves be used via truncating. Truncated cryptographic hashes are themselves independent hashes. I took this idea to its logical extreme and split each cryptographic hash into a set of smaller hashes. These can be thought of as the permutations of the hash function. Then I can sort these hashes numerically and select the K lowest values to get my K-min-values sketch. If one or more elements of the KMV sketches of two segments match then the segments are deemed to have at least a few chunks in common. How small to truncate a hash? It turns out that 64-bit hashes provide a good approximation. So if we are using, say SHA-256, then we get 4 smaller hashes from each chunk hash. These are then numerically sorted and the 20 lowest unique values are chosen. Pictorially this approach can be depicted as below.

The number 20 and other thresholds were arrived at experimentally, to give a good balance of memory usage vs deduplication effectiveness. Increasing beyond 20 hashes, results in diminishing returns on dedupe effectiveness while linearly increasing memory consumption. The point of inflection actually comes at 16 hashes, below which, dedupe effectiveness falls rapidly. The chart below shows the results of testing on a 360GB dataset that I have used in previous occasions.

The interesting thing here is that this approach results in high deduplication effectiveness to the extent of 90% to 95% of the effectiveness of straight forward exact dedupe using a simple chunk index. The choice of the hash function (SHA2, SHA3, BLAKE2 etc) has little bearing on the outcome. Here is another chart showing a comparison of this similarity based deduplication (using various datasets and hash funtions) with the exact deduplication baseline.

**Memory Usage
**

Lets consider the case of 4KB average chunk size. Each segment contains 2048 (approx) chunks, which gives us 8MB average segment size. For each chunk we derive a 256-bit crypto hash which results in 4 64-bit similarity hashes per chunk. We select the lowest valued 20 unique hashes per segment which form the similarity indicators. So we need 160 bytes of similarity indicators per segment. In addition to this we need to store 64-bit pointers to the actual segment data on disk. The segment data is basically the chunk hash and data pointer list stored on disk. So the storage requirement is doubled to 320 bytes. The similarity indicators for a segment are kept in a hash table. So we need to consider some data structure overheads as well. Assuming 4 bytes of overhead on average we have a final storage requirement of 400 bytes per segment.

Now assuming 8MB segment size and one Petabyte of 100% random data where no duplicates exist (worst case), we would need 134217728 segments. This translates to 50GB of worst case memory usage. If we use 8KB chunk size, then the calculations lead to 25GB RAM requirement for the worst case. These memory values are not too much by present day standards and typically data will have duplicates. So RAM requirement will come down by the extent of duplicates present, especially when using the deduplicating storage architecture I had described earlier. If we limit data handling to say 500TB, then even smaller 2KB chunks can be used. These are practical resource limitations. The algorithm does not have any inherent limitations. If we use a well-endowed server with say 256GB or more of RAM then petascale data can be handled using 2KB chunks as well. Performance will be lower of course.

To the best of my knowledge there is as yet no dedupe product that can handle a petabyte of data using small 4KB chunks and an in-memory index. I may be wrong, so please add comments if you know of another such system.

This indexing scheme is actually implemented in Pcompress today and works quite well. It can actually reach raw a dedupe processing throughput of upto 500MB/s (discounting I/O) on my average laptop. Of course there are many other optimizations both algorithmic and architectural some of which I have posted about earlier. I am presently working on the persistent similarity index part which would allow creating versioned incremental archives.