Pcompress has gone through a sea of changes since the last time I ran benchmarks comparing performance and effectiveness with other utilities. So I spent several days running various benchmark scripts generating and collating a lot of results in the process.
Due to the sheer volume of the results and limited time, I took the easy way out of importing all the CSV data into Excel, formatting and charting them and exporting to HTML. The generated HTML code looks complex and messy but at least it shows up correctly in Firefox, Chrome and IE.
The first set of results can be seen here: http://moinakg.github.io/pcompress/results1.html. This is basically comparing pcompress with Segment-level and Global Deduplication to other standard utilities. It also contrasts effectiveness of Global Dedupe with Segment-level Dedupe.
The Datasets used
- A tar of the VMDK files of installed CentOS 6.2 x86-64 version.
- Linux 3.6 RC2 source tarball.
- Two copies of the Silesia corpus tar concatenated together. This results in a file that is double the size of the original Silesia corpus but has 100% duplicate data.
- A tarball of the “Program Files” directory on my 32-bit Windows 7 installation.
- As is quite clear, Pcompress is both faster and more effective compared to the standard utilities tested: Gzip, Bzip2, 7za, Xz and Pxz (Parallel Xz).
- As usual Xz performs the worst. The time graph shows a steep spike. Pxz is a lot better but is still half as slow as Pcompress. In addition remember that Pcompress is having a bunch of additional processing overheads that the other utilities do not have: SHA256, BLAKE2, LZP and Delta2 processing.
- Interestingly the LZ4 mode along with Dedupe and all the preprocessing produces results that are close to traditional Gzip while being more than twice as fast. In fact two datasets shows results smaller than Gzip. This result is notable when one wants good compression done extremely fast.
- Global Dedupe of course is more effective than Segment-level Dedupe but what is more surprising is that it is also faster overall, even though Global Dedupe requires serialized access to a central index and Segmented Dedupe is fully parallel. I can attribute three causes: my test system is low-end with constrained RAM bandwidth and conflicts arising from parallel access; Segment-level dedupe also uses memcmp() while Global Dedupe does not; Global Dedupe reduces data further resulting in lesser work for the final compression algorithm.
- The concatenated Silesia corpus with 100% duplicate data of course shows the maximum benefit from Global Dedupe that removes long-range redundancies in data.
- In some cases compression levels 9 and 14 show marginally lesser compression than level 6. This appears to be because of LZP side-effects. At higher levels, LZP parameters are tweaked to work more aggressively so it may be taking out a little too much redundancy that affects the compression algorithm’s effectiveness. This is something that I will have to tweak going forward.
I will be posting more results soon and will include a comparison with Lrzip that uses an improved Rzip implementation to take out long-range redundancies in data at a finer granularity compared to 4KB variable-block Deduplication.