Improving the Similarity Matching

In the last update I had mentioned that I was seeing up-to 11% false positives in the similarity matching scheme within Pcompress. It turned out that it was happening only when the minimum Rabin block size is auto-selected to 2K. It happens when the lightweight fast algorithms are used. For a 4K minimum size false positives were limited to 5% of the total delta blocks.

In any case I started looking at using other statistical properties of the data to refine the similarity checksum. One approach was to use a standard deviation of the sketch values and match those within some tolerance. However it did not help much and in addition involved slower floating point computations, square roots etc. I like to remain in the integer domain and use basic arithmetic. I thought of capturing byte patterns but then the fingerprint was doing that. Eventually I took at approach of counting byte value frequencies and taking into account the top 2 occuring byte values. They are averaged of 1K blocks and added to the existing similarity checksum.

Experimentation with mixed textual, binary and vmdk data showed good improvement. Even though extent of delta compression drops somewhat but it leaves more redundancy on the table for compression algorithms to use and produce better final compression ratios. The biggest benefit is of course in reduction of false positives. Now the max false positives I am seeing varies from around 0.2% to 1.5% of the total delta blocks.

Advertisements

2 thoughts on “Improving the Similarity Matching

  1. Klaus W.

    Dear Moinakg, I am happy that there is finally someone who picks up the important task. Pcompress is really the tool needed – if not everywhere, so at least in many many places. Please let me comment on some detail before I have even started testing: CRC64 is not exactly a wise choice, because it’s an algorithm made for bit loss and bit disturbance errors, which nowadays nearly never ever happen to occur. One should consider using a modern message digest algorithm instead. Keep up the good work & 1000 thanks.

    Reply
    1. moinakg Post author

      Thanks a lot for your encouraging comments. Yes CRC was only a temporary choice. I plan to move to SHA 512-256 or Skein.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s