I decided to bump the version to 0.7 and move from alpha stage to a beta release. A large number of fixes and improvements have gone in since the last time I updated about Pcompress here. I did a bunch of testing and rolled in a whole set of functionality and stability fixes. This also pointed me to the need of a test suite which I will be working on soon. I have also added a bunch of performance optimizations and tweaked various parameters to improve compression ratios for different algorithms. For LZMA the levels 11 to 14 now indicate Ultra modes with large dictionaries and match lengths. There is now a “none” compression option to just do dedupe and delta compression so that other compression utilities can be subsequently used on the file. A point to note here is that using dedupe and LZMA with high compression levels together have a negative effect on the ratio unless the files are at least a GB or more.
One of the biggest changes is an improvement to the similarity detection algorithm in Delta compression. Purely tracking the maximal fingerprint values per 1K of data and then doing a sum, which is commutative, was not good several cases. For example if a block is mostly zeroes the fingerprint for those will be minimal with respect to the few non-zero items. So zero commonality is lost in the process. In general higher byte values were being preferred. This is also the reason why using signed byte values in the Rabin-style fingerprint calculation was helping. The signed-ness was helping to normalize higher byte values but could not help in the zero case.
After some experimentation I figured a process to bias the fingerprint value with the occurrence counts. Essentially take the low-order X bits of the rolling fingerprint where X is the bits in the average block size. This goes into the lower 4 bytes of the 64-bit sketch value. The occurrence count goes into the upper 4 bytes. Using unsigned byte values now is required. In addition I am also computing the mean of the sketches and comparing that as well. This scheme can be further improved since the super-sketch is just a sum. Experimentally (Text, Binary data like WAV files, VM Images) I am seeing 7% – 11% false positives when the algorithm indicates 2 blocks are similar but after doing bsdiff the non-similar extra data is too much and the delta encoding is aborted resulting in wasted computation time. I have tried to use a second level hash of the sketches where I tried another Rabin fingerprint implementation but it did not work out well. So this is something I am still experimenting on.