In the last update I had mentioned that I was seeing up-to 11% false positives in the similarity matching scheme within Pcompress. It turned out that it was happening only when the minimum Rabin block size is auto-selected to 2K. It happens when the lightweight fast algorithms are used. For a 4K minimum size false positives were limited to 5% of the total delta blocks.
In any case I started looking at using other statistical properties of the data to refine the similarity checksum. One approach was to use a standard deviation of the sketch values and match those within some tolerance. However it did not help much and in addition involved slower floating point computations, square roots etc. I like to remain in the integer domain and use basic arithmetic. I thought of capturing byte patterns but then the fingerprint was doing that. Eventually I took at approach of counting byte value frequencies and taking into account the top 2 occuring byte values. They are averaged of 1K blocks and added to the existing similarity checksum.
Experimentation with mixed textual, binary and vmdk data showed good improvement. Even though extent of delta compression drops somewhat but it leaves more redundancy on the table for compression algorithms to use and produce better final compression ratios. The biggest benefit is of course in reduction of false positives. Now the max false positives I am seeing varies from around 0.2% to 1.5% of the total delta blocks.