Tag Archives: Zlib

Pcompress updated to 1.1

I just put out a new release that implements proper HMAC computation as described in my previous post. I added a bunch of new tests that test error conditions and error handling for out of range parameters and corrupted archives. These tests caused problems to show up that went unnoticed earlier. These were mostly improper error handling situations. As you see testing may be boring but is absolutely necessary.

I also optimized the use of Zlib by switching to raw Zlib streams. Normal Zlib puts a header and computes an Adler32 checksum among other things. However Pcompress adds its own headers and computes cryptographic checksums like SHA and SKEIN which are light years ahead of the simple Adler32 in terms of data integrity verification. So the Adler32 computation overhead was unnecessary.

I had also tweaked the rolling hash used for deduplication while doing the chunking analysis. This produces better results and improved identity dedupe reduction. Finally I also added header checksums in plain non-crypto mode.



Compression Benchmarks #2

As a follow-up to my previous Compression Benchmarks here is second set of benchmarks showing comparison with existing parallel utilities and Deduplication impact.

Note: I am using very large chunks in these tests in order to let Deduplication find enough duplicate data with an average block size of 4KB. Pcompress does not yet have a Global Deduplication facility so I am using this approach. I will put a later post with some test results using smaller chunks of say 1MB – 8MB.

The datasets: This time I used the Linux 3.6 tarball and Centos VMDK tarball from the earlier tests. In addition created a tarball containing Windows 7 64-Bit system files. The contents of “Windows\SysWOW64” and “Program Files” directories. This time I did not include a graph bar for the original size in order to highlight the compressed size differences clearly.

Linux 3.6 Git clone Tarball

As you can see Pcompress with this dataset is slighty slower than Pigz and Pbzip2 but offer a little better compression ratio as well especially with LZ-Prediction and Deduplication. In addition one needs to keep in mind that Pcompress provides far stronger data integrity verification via the default SKEIN 256 message digest. This takes more cycles to compute compared to the simple 32-bit checksums in Pigz and Pbzip2.

The total dataset size is 466MB so using a 256MB chunk will cause 2 chunks, one 256MB and one 210MB. So total time taken will be time taken for the larger chunk. Deduplication has only marginal impact in this dataset as there will be very few duplicate blocks with average size of 4KB within the Linux kernel source tree. The main benefit comes with LZ-Prediction.

Tarball of CentOS 6 and CentOS 6.2 VMDK Images

Notice that the Libbsc algorithm provides best results with this dataset (aside from LZMA which is not included here, see my previous post). However Libbsc is not included in the 1GB chunk results as there is some memory inefficiency in the implementation that causes it to gobble too much RAM and exceed my system’s 8GB budget. I will fix this in an upcoming release.

There are two items of interest here that jump out to anyone looking closely at the graphs. Bzip2 performs better on this dataset in all respects than Zlib(gzip). This is the case whether we are using the parallel utilities (pigz, pbzip2) or using Pcompress. Secondly Pcompress produces much better compression compared to the parallel utilities. Both Deduplication and LZP preprocessing add real value. Of course it is a known fact that VMDK images dedupe well.

So the properties of VMDKs lend themselves well to Bzip2 both in terms of speed and compression ratio. Consider using bzip2 algorithm over zlib(gzip) when packing VMDKs. Another point is that PPMd spins really badly in terms of speed on non-textual data.

Windows 7 64-Bit system folders: Windows\SysWOW64, Program Files

Once again Pcompress produces smaller compressed files as compared to pigz and pbzip2 and performs almost the same in terms of speed. Deduplication produces benefits here as well with a very small overhead. As usual PPMd spins badly on this primarily binary dataset.

From the previous enwik9 result and these results there is one possible enhancement idea that comes out. In adaptive2 mode Pcompress checks if a chunk is primarily text and uses PPMd  for text. Otherwise for binary it uses LZMA. In addition to this I can add a detection for XML data and use Libbsc if the chunk is primarily XML text.