I have recently added deduplication support in the pcompress utility. Deduplication happens at the chunk level and is therefore done in parallel. Compression is done after deduplication, so dedup removes duplicates based on relatively large blocks while compression works on byte-level similarities. There is some interaction between dedup block size, break pattern and the compression algorithm used. In particular dedup block size can affect compression ratio depending on the compression algorithm’s window size. Since for example LZMA works on large windows of up to 128MB using a dedup block size smaller than 4K appears to affect LZMA’s compression ratio. For other algorithms like Bzip2 or Zlib a block size of 2K actually helps.
So based on some experimentation I have selected different minimum blocks sizes and Rabin break patterns. The dedup metadata is kept as low as feasible and is typically 1% or less of the final compressed data size (dedup + compress). The dedup implementation adds very little computation time overhead to the overall process. The general logic is as follows:
- Use a 31-byte Rabin sliding window to create a list of content-aware blocks. Minimum length is 2K-4K while max is kept at 128K.
- Also compute a CRC64 checksum for each block.
- Sort the block pointers based on the checksum value. This brings all same/similar blocks together and hash collisions are of course possible.
- Scan the block pointers replacing runs of identical block entries to point back to the first one. Both length and exact byte level comparison is done using memcmp(). This rules out data corruption due to hash collisions.
- Do a second pass through the block entries this time building another block entry table. Each entry in the table is a 32-bit unsigned integer which is either the length of the following data data block at the corresponding offset in the byte stream or an index pointing to another block if this is a duplicate. The high-order bit is used as a flag to indicate between the two. So max index numer or max length of a single block can be 2^31. This max block length is important since block merging is done in the next step. A max index count of 2^31 means that using the smallest block size of 2K a single chunk can span upto 4TB. Of course we will be very far from that in practice.
- In the second pass also merge consecutive runs of non-duplicate block entries into a singe entry. Upto 2GB of consecutive blocks can be merged. This has the effect of drastically reducing dedup metadata. Since this also changes the block indexes, all index pointers from duplicate blocks must also be updated. A translation table is built in this step to simplify this process.
- A third and final pass is made over the reduced index table to copy all data blocks into the target buffer while updating the duplicate index entries at the same time from the translation table.
- Compress the final dedup index using LZMA and the remaining data blocks using whatever compression the user specified.
- Additional memory usage for doing Dedup is kept controlled by using space in the target buffer for some of the intermediate arrays.
This dedpulication scheme is simple and fast. Rabin computations are all done using shifts and additions avoiding costly multiplication and modulus operations. The byte-level data scanning provides an opportunity to discover characteristics of data to be leveraged for improving compression later. I will be looking at this going forward.
I am observing good results combining Dedup and Compression and the ability to parallelize speeds up both. Dedup metadata is kept at a minimum using a block-merging technique. Finally compression dictionaries for the data are not polluted by compressing the Dedup index separately. I will be posting updated measurements soon. As usual, source code is available here: https://github.com/moinakg/pcompress.