I have been fairly busy over the past few months with multiple things demanding my attention at the same time both at work and at home. In addition, I was doing a bunch of testing and analysis of the algorithms I am using in Pcompress, which resulted in some useful findings, improvements and fine tunings of the parameters. I also wanted to test with a few hundred GBs worth of data and had to procure additional disks to perform testing within realistic time scales. The combination of a Stardom iTANK external enclosure and a WD Caviar Black 2TB drive over eSATA worked nicely providing upto 137MB/s of sustained sequential read throughput. I have finally managed to pull together all the changes and put out a 2.3 release.
Among other things, I have made some changes to the KMV Sketch computation and updated it to select unique similarity indicators. Earlier it was just selecting the lowest K indicators, some of which were duplicates. This also allowed me to reduce the number of IDs used for the similarity match. This in turn results in a smaller index while improving the similarity match at the same time. Data splitting between threads is also improved now, resulting in fewer unbalanced chunks. These improvements and some code cleanups have resulted in performance improvements as well. As of this release, the similarity based near-exact deduplication is quite faster than exact deduplication using a simple chunk index with upto 98% effectiveness as compared to the latter.
One of the changes in this release is to move all the functionality into a shared library and make the main executable as a simple wrapper. This will allow me to add a programmatic API going forward. The release can be downloaded from here: http://code.google.com/p/pcompress/downloads/detail?name=pcompress-2.3.tar.bz2