Delta Compression or in other words Delta Encoding is the process of finding differences between 2 documents which are similar to each other but not 100% identical. Delta Encoding can be of various types that include finding differences between individual byte sequences or differences between 2 documents/files. A complete description of delta encoding can be found in the Wikipedia article. Delta encoding is widely used in version control repositories for example where successive edits to a file are stored as a sequence of deltas to the original file (with some optimizations when the number of deltas is huge).
When optimizing compression with dedupe it also makes sense to include delta compression. Once dedupe has dealt with removing duplicate identical blocks we can look at blocks that are closely similar to each other and store the diff between them. For example if A is 80% similar to B then we can store a delta in place of B such that fetching A and applying the delta will produce B. A good delta tool will generate an optimum diff such that about 25% to 30% of the original space will be used, so we achieve a good reduction.
Now this is all good but the challenge here is to find a way to determine which blocks are similar. One way to find similarity is to use the Hamming Distance measure. However Hamming Distance is a pairwise operation requiring repeated byte-scans of blocks and easily becomes a O((n * (n-1)) / 2) operation where ‘n’ is the number of blocks. It is too expensive to use in practice. What I actually needed here is an indicator, preferably one or more numeric values like hashes, which indicates similarity between blocks and allows us to do fuzzy matching. So two blocks which are mostly similar should have close or same numeric values. Cryptographic hashes are not suitable since they are built to diverge pseudo-randomly even if 2 blocks of data have 1-bit difference among them. This is called the Avalanche Effect and is a desirable property of a cryptographic hash. The opposite behavior is needed for a similarity hash.
I was reading up on this topic and chanced across some papers and articles on this. The most interesting article I found is this one: http://www.armedia.com/wp/SimilarityIndex.pdf. It has an detailed treatment of the topic and evaluates various approaches to build a similarity key. The measures are interesting and we get onto the concept of a sketch which is something out of category theory in mathematics. So we essentially have to build a sketch of a document or data block and compare it to sketches for other documents to find similarity. There are many approaches to build sketches however the ones recommended in the paper are not entirely suitable in the Pcompress case. The paper focuses on english textual documents and uses things like soundex codes to measure closeness of the language and introduce fuzziness in the match. I needed to do a generic match for arbitrary binary or textual mixed data blocks. I tried using the Sum 0 mod 25 sketch that, combined with soundex and stop word stemming, works so well in the paper for textual documents. It did not work out well for me. In addition I cannot use soundex or stop word stemming. Those do not apply to arbitrary data blocks.
After some pondering and experimentation I decided on an approach that uses some concepts from the paper in a different way. What I have done is to identify some of the “maximal features” of a data block and generate a single key from them. One way to identify maximal features is to maintain occurrence counts of rabin fingerprint values and sum the top N occurring fingerprints where N is an arbitrary number determined empirically or 400 as mentioned in the paper. This requires maintaining a hashtable of fingerprints with occurrence counts and then sorting those when a block coundary is hit. To simplify things and also have fuzziness I took a somewhat different approach. For each KB of data processed I maintain a table indexed by the bottom X bits of the fingerprint. X is the number of bits in the average block size. So if average block size is 4KB then we get a 4096-entry table. The table is initialized to 0 and current fingerprint values are added to the table entries based on the bottom X-bit indexing. I also maintain a pointer to the current index of the highest valued entry. Whenever 1 KB data has been processed the highest value is added to the current “Super-Sketch” value and table re-initialized to 0. Once a block boundary is hit the Super-Sketch value is stored as it’s similarity index. To additionally determine identical matches I of course do a memcmp() – phew. My experimentation, albeit limited, till date has produced good results and this approach actually identifies close similarities. The use of “char” as opposed “unsigned char” when computing the semi-Rabin fingerprint also helps matters somewhat.
Now I had a way to identify similarities but what should I use to do the actual grind, generate a minimal diff that is. I needed a binary diff utility and some Googling led me to the inevitable BSDIFF utility. It is used widely for example in Mozilla and Chromium to deliver binary patches and produces some of the smallest diffs compared to other utilities like Zdelta, Xdelta etc. As an aside Google has a nifty technology called Courgette for Chromium to further minimize executable patches beyond what Bsdiff alone can achieve. The license is also quite friendly. It’s problem is with memory consumption and the time taken to do the prefix sort. In Pcompress I am only dealing with 4KB to 128KB blocks so the tradeoff is completely okay for me.
However BSDIFF uses Bzip2 to compress the delta which I could not have since in my case, the final chunk consisting of all the data blocks produced after dedupe, is compressed eventually by a user-specified algorithm. Having Bzip2 compressed hunks liberally scattered hurts overall compression negatively so much so that the result is larger than not doing Delta Compression in the first place! Of course Bzip2 over Bzip2 is a worst case. Modifying BSDIFF to leave out Bzip2 causes another problem. The output diff data is larger than the original! I printed the diff data byte for byte to find out what is going on. It transpires that BSDIFF produces large runs (thousands) of 0 bytes with actual non-zero diff data positioned at appropriate places as per its control index. That is why a simple bzip2 collapses it so well and produces a tiny diff output. Bzip2 was not an option for me but a simple Run Length Encoding is enough to collapse these neutral zones to only a few bytes. So I conjured up a Run Length Encoder that will encode all runs of zero bytes numbering at least 4 into 2 bytes. It also has the 2-byte overhead of maintaining count for runs of non-zero mixed bytes. Given the BSDIFF output, it actually works well in practice. Delta Compression now starts to have the desired positive impact.
There are things from above that can be improved for sure. The Sketch computation, the way the fingerprintes are sorted to arrange them per similarity and then scanned. Instead of calling the platform’s quicksort function we can maintain a tree structure and insert fingerprints into appropriate nodes. It is possible that the current code misses some Delta and/or Dedupe opportunity due to this. In addition to the similarity hash we can also compute a SHA-256 hash and use it to do an identity comparison instead of costly memcmp(). The probability of a SHA-256 collision is less than the probability of random hardware failures, disk failures or bit flipping in RAM and is used in many places like, for example, ZFS. Of course folks are welcome to play with the code and contribute.