I have added the 3rd and final set of benchmark results comparing Pcompress to two other data dedupe utilities, Lrzip and eXdupe here: http://moinakg.github.io/pcompress/results3.html. Lrzip does not do traditional dedupe of 4KB blocks or above. Rather it uses the Rzip algorithm which is derived from Rsync.
Rzip also does variable block dedupe but at much smaller sizes than 4KB. However I am not sure if Rzip can be adapted as a multi-file generalized deduplication store as the index blow-up is quite extravagant. Though it might be possible to do segmented matching and then apply Rzip across Segment data. It will require re-reading old segment data and the dedupe solution will necessarily be offline or post-process.
The observations from the results are summarized below:
- If we just do Dedupe and avoid compression of data (“Dedupe Only” result in the graphs) then Lrzip produces smaller archives. This is obvious since Pcompress does traditional Dedupe at average 4KB variable blocks while Lrzip finds matches are much smaller lengths. Exdupe cannot be compared here as it has no option to avoid compression. At high compression levels Pcompress consistently gives the fastest times. However except for LZ4 option Pcompress produces slightly larger archives for all other algorithms when compared with Lrzip. Lrzip uses Lzo not LZ4. I tried using Lrzip to just do rzip and then compress the result with LZ4 for the CentOS tarball. I got a size of 662751240 bytes with data split into 256MB chunks. So Lrzip would have produced a smaller archive if it had integrated LZ4.
- LZ4 is a fantastic algorithm. The combination of speed and compression ratio is unparalleled.
- At fast compression levels Pcompress matches or exceeds Exdupe in speed (depending on the dataset) while producing a better compression ratio. Once again LZ4 has a big contribution to the result. Lrzip loses out handily in terms of speed but compression ratio is good.
- In general Pcompress gives some of the best combinations of compression ratio and speed.
- One of the possible reasons for the larger Exdupe file sizes can be extra metadata. Exdupe allows differential backups to be taken against an initial full backup. In order to do block-level differential backup, in other words deduplicated backup, it needs to store additional metadata for existing blocks.
Remember this is just a small system with 2 cores and 2 hyperthreads, or 4 logical cores. On systems will more cores Pcompress performance will scale appropriately.