The last few weeks I have been heads down busy with a multitude of things at work and in personal life with hardly any time for anything else. One of the biggest items that kept me busy during my spare times has of course been the release of Pcompress 2.0.
This release brings to fruition some of the hobby research work I had been doing around scalable deduplication of very large datasets. Pcompress 2.0 includes support for Global Deduplication which eliminates duplicate chunks across the entire dataset or file. Pcompress already had support for Data Deduplication but it removed duplicates only within a segment of the data. The larger the segment size, the more effective is the deduplication. This mode is very fast since there is no central index and no serialization. However dedupe effectiveness gets limited.
Global Deduplication introduces a central in-memory index for looking up chunk hashes. Data is first split into fixed-size or variable-length Rabin chunks as usual. Each 4KB (or larger) chunk of data has an associated 256-bit or larger cryptographic checksum (SHA256, BLAKE2 etc.). These hashes are looked up and inserted into a central hashtable. If a chunk hash entry is already present in the hashtable then the chunk is considered a duplicate and a reference to the existing chunk is inserted into the datastream. This is a simple full chunk index based exact deduplication approach which is very effective using 4KB chunk sizes. However there is a problem.
The size of a full chunk index grows rapidly with the dataset. If we are looking at 4KB chunks then we get 268435456 chunks for 1TB of data. Each chunk entry in the hashtable needs to have the 256-bit checksum, a 64-bit file offset and a 32-bit length value. So total size of the index entries is approax 11GB for unique data not considering the additional overheads of the hashtable structure. So if we consider hundreds of terabytes then the index is too big to fit in memory. In fact the index becomes so big that it becomes very costly to lookup chunk hashes slowing the dedupe process to a crawl. Virtually all commercial dedupe products do not even use 4KB chunks. The minimum is 8KB used in Data Domain with most other products using chunk sizes much larger than that. Larger chunk sizes reduce the index size but also reduce dedupe effectiveness.
One of the ways of scaling Data Deduplication to petascale is to look at similarity matching techniques that can determine regions of data that are approximately similar to each other and then compare their cryptographic chunk hashes to actually eliminate exact matching chunks. A variant of this technique uses Delta Differencing instead of hash matching to eliminate redundancy at the byte level. However I was interested in the former.
Pcompress 2.0 includes two approaches to Global Deduplication. If a simple full chunk index can fit into 75% of available free RAM then it is used directly. This is fast and most effective at eliminating duplicates. By default 4KB chunks are used and it gives good performance even with chunks this small. This is lower than what most other commercial or open-source dedupe products recommend or offer. Once file sizes start becoming larger and the index size overflows the memory limit then Pcompress automatically switches to Segmented Similarity Based Deduplication.
In Segmented Similarity mode data is split into 4KB (or larger) chunks as usual (Variable-length Rabin or Fixed-block). Then groups of 2048 chunks are collected to form a segment. With 4KB chunks this results in an average segment size of 8MB. The list of cryptographic chunk hashes for the segment are stored in a temporary segment file. Then these cryptographic chunks hashes are analysed to produce 25 similarity hashes. Each similarity hash is essentially a 64-bit CRC of a min-value entry. These hashes are then inserted or looked up in a central index. If another segment is found that matches at least one of the 25 similarity hashes then that segment is considered approximately similar to the current segment. It’s chunk hash list is then memory mapped into the process address space and exact crypto hash based chunk matching is done to perform the actual deduplication.
This approach results is an index size that is approximately 0.0023% of the dataset size. So Pcompress will require upto a 25GB index to deduplicate 1PB of data. That is assuming 100% random 1PB data with no duplicates. In practice the index will be smaller. This approach provides >90% dedupe efficiency of using a full chunk index while providing high scalability. Even though disk I/O is not completely avoided, it requires one disk write and only a few disk reads for every 2048 chunks. To balance performance and predictable behaviour, the write is synced to disk after every few segments. Using mmap(), instead of a read, helps performance and the disk offsets to be mmap-ed are sorted in ascending order to reduce random access to the segment chunk list file. This file is always written to at the end and extended but existing data is never modified. So it is ideal to place it on a Solid State drive to get a very good performance boost. Finally, access to the central index is coordinated by the threads cooperating using a set of semaphores allowing for lock-free access to critical sections. See: https://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/
I had been working out the details of this approach for quite a while now and Pcompress 2.0 contains the practical implementation of it. In addition to this Pcompress now includes two additional streaming modes. When compressing a file the output file can be specified as ‘-‘ to stream the compressed data to stdout. Decompression can take the input file as ‘-‘ to read compressed data from stdin.
Global Deduplication in Pcompress together with streaming modes and with help from utilities like Netcat or Ncat can be used to optimize network transfer of large datasets. Eventually I intend to implement proper WAN Optimization capabilities in a later release.