As I had posted in Improving the Similarity Matching, I had a fair process of doing fuzzy matching. However it was an experimentally derived non-standard process which did come up with a bunch of false positives even though they were low in percentage terms. I kept reading up on various aspects of fuzzy matching and most of the articles were very mathematical or appeared fairly elaborate, for example Broder’s seminal paper from 1997. I could follow the core ideas of the paper but I am not a math geek and get put off by too many mathematical details. However the core idea of that paper is actually called MinHashing which is used quite widely starting with the Altavista search engine.
There are various articles, blog posts and code samples on that technique and it is popular because the core concept is so simple. There are many variants of the technique used in different places. For textual documents this is also known as a W-shingling approach and is enhanced using other techniques like punctuation elimination, capital reduction, stop word stemming etc. For generic blocks of data which can contain non-textual binary the approach reduces to using a rolling checksum like Rabin Fingerprints. For a given block one has to get all the rolling checksum values and select K minimum checksum values from the set. We can then match all the K-min-values Sketches of all the blocks to determine fuzzy similarity between blocks. The ratio K to block size determines the extent of similarity we are looking for. While all this is fine, it is not immediately apparent how to go about doing this practically in one’s application. Storing and dealing with K minimum checksum values is out of the question, and what is the efficient way to get the K min values list in the first place ? Do we have to sort or is there a better way ?
Since the approach is so popular there are also lots of example code available which shows how to do this in practice. However it also requires a bit of tweaking to suit one’s application needs. I am listing some of the resources I read through to understand and evaluate the approaches:
- Sketch of the Day: K-Minimum Values
- Finding similar items using minhashing
- Similarity and Dissimilarity Measures
- Set Similarity and Min Hash
- Calculating Similarity
- Finding the top K items in a list efficiently. While this talks about top-K the approach has to be inverted for bottom-K.
If you get overwhelmed after reading all those links and the Wikipedia articles above I can fully sympathize. However patience and perseverance is required especially if Maths is not your second language. If I take a stock here, things boil down to this:
- Compute all the rolling hashes for a block and store them.
- I do not need to store the entire 8-byte hash, the lower 4 bytes are mostly enough.
- I am using block sizes of 4K to 128K, so I need 8KB to 500KB to store the list for a block.
- Since pcompress does memory buffer to memory buffer compression I can temporarily use space in the target buffer for the list avoiding an allocation.
- Chose K = 40% of block size to allow for 40% approximate similarity.
- Once the K minimum values are obtained compute a hash over that list to get a Similarity Index.
- I am using xxHash here as it is extremely fast and has very good properties.
- If the Similarity Index of two blocks are same they should be similar to the extent of 40% or greater.
- Okay so now how to efficiently find the K minimum fingerprint values ?
Sorting is an option but is sub-optimal. I just need K values and should not waste computation sorting all the values. I appears that a Heap based Priority Queue is the optimal choice in this situation. Thanks to Steve Hanov for the tip. While Steve talks about finding top-K items, finding the bottom-K items is the same approach tweaked. I eventually landed up with an in-place Min Heap implementation from Python’s heapq. Also see this StackOverflow thread for more details. Python’s heapq is implemented in C and is nice and compact. It also has very high performance. I stripped out only portions from the heapqmodule that I needed and simplified and optimized them further as much as I could think of. I also decided not to sort the collection of K min values that this gives. The relative positions of values in the input block can affect where the values end up in the in-place heap and is an added similarity measure automatically.
So after all this circus I now have a much better Delta Compression for Pcompress where the potential for false similarity matches is about 0.001% on average. False similarity is one where the two blocks have less than 40% in common. The Delta Compression performance is much better than before. In addition because of better matching predictability it helps all compression algorithms to improve compression ratio. This change is currently in the Github repo and will make it into the next release in a while. See here and here if you are interested in the actual changes.