Tag Archives: Prediction by partial matching

Pcompress gets archiving features

Among a busy personal schedule for the last two months, I have managed to work quite a bit on adding archiving features to Pcompress. Thanks to the excellent LibArchive, Pcompress can now bundle up a bunch of files into a compressed archive. This is a desirable and useful capability that was missing till date.

With the addition of archiving capability Pcompress can now perform advanced detection of file data and tweak its compression behaviour to achieve the best results. Below is a short list of features and behaviour that the github code has as of this writing:

  1. Pcompress enumerates the file list to be archived and sorts the files by extension/name and size using an incremental merge sort to minimize memory use. This sorting, groups related files together and clusters small files to achieve the best compression and deduplication behaviour. For example see this paper where a similar technique has been discussed to improve deduplication: https://www.usenix.org/legacy/event/atc11/tech/final_files/Xia.pdf
  2. File types are detected via extension and/or file header parsing for magic numbers. Compression buffers are split at boundaries where files change from one type to another to avoid mixing unrelated files in a single compression buffer. It helps to improve compression ratio.
  3. More importantly, this file type detection is used to apply data-specific compression techniques more effectively, making the Adaptive modes in Pcompress extremely powerful. The following data specific algorithms are used:
    • LZMA – Most binary data.
    • PPMD – Most Textual data.
    • Libbsc – DNA Sequences, XML/HTML etc, BMP and TIFF images.
    • Dispack – Preprocess 32-bit x86 executable binaries.
    • PackJPG – Reduce JPEG size by upto 40%. This is new lossless JPEG compression technique by Matthias Stirner.
    • Wavpack – Compress WAV files better than any regular compression technique. This is still a work in progress.
    • Detect already compressed files and for some heavily compressed data just use LZ4 to suppress some internal headers and zero padding. This avoids wasting time trying to compress data that is already heavily compressed.
    • There are other data specific filters around like MAFISC which I am looking at.
    • For Dispack, 32-bit x86 executables are detected and the data buffer is then split into 32K blocks. Some approximate instruction statistics are checked to determine whether to Dispack that block.
  4. Compression buffers are split either at hash-based or data type change based boundaries improving both compression and deduplication.
  5. LibArchive is used as the backend archiving library whose output is passed to the buffering, deduplication and compression stages in a logical pipeline. Synchronization is kept simple by using semaphores. LibArchive runs in a single thread and the data fetch from archiver to compression is also done at a single point. Thus there is exactly one producer and one consumer. This simplifies synchronization.
  6. To the extent possible data copying is avoided. LibArchive’s callback routines are used to copy data directly into the compression buffers without resorting to pipes and such.

The filters like Wavpack and PackJPG need to work with LibArchive. However LibArchive does not support using external filter routines so it took a while to work out how to have external file filters pipelined before LibArchive. Note that since Pcompress uses a custom file format and consumes the output of LibArchive, there is no need for strict compatibility with standard archiver formats like Tar, Pax, Cpio etc. LibArchive for its own requirements obviously strives to attain strict conformance allowing no user-defined headers. So one of the big problems was to flag which files have been processed by a custom filter. One easy way was to add an extended attribute programmatically. However LibArchive does not provide a way to delete a single attribute during extraction. There is a call to clear all attributes! One does not want internal, programmatic use attributes to be extracted to disk. I was stuck. Eventually it turned out that I could use contextual inference. A file preprocessor like PackJPG will add its own magic header to the file. Thus during archiving I can look for a JPEG header and only then pass the file through PackJPG. During extraction I can look for the PackJPG header.

However the question comes, what if I have some PackJPG processed files and are archiving them using Pcompress? Won’t it revert to normal JPEG during extraction even though I do not want it to? Well the filename extension is also checked. During archiving, normal JPEGs are filtered but their extension remains as jpg or jpeg. So only files having a Jpeg extension but having a PackJPG header are unpacked during extraction. If you use the standalone PackJPG utility to pack your JPEGs, then will get a .pjg extension which will be untouched by Pcompress filters during extraction. However, truely speaking, LibArchive needs to add a simple xattr deletion function to avoid all this jugglery.

File types during archiving, are detected by a combination of filename extension and magic header inspection. To lookup filename extensions one obviously needs a hashtable. However there is a bit of detail here. I have predefined list of known filename extensions with their corresponding file types, so instead of using a general hash function I needed a perfect hash function. That is, the number of slots in the table is the number of keys and each known key maps to one slot. An unknown key can be easily found by comparing with key value at the slot, or if the slot number lies outside the table range. I used the old ‘Minimal Perfect Hashing’ technique courtesy of Bob Jenkins. It works nicely for fast hashing of filename extensions.

The next item to do is to support multi-volume archives. This is quite easy to do since Pcompress already splits data into independent buffers, each with its own header. So a volume needs to contain a set of compressed buffers with some sequence indicator so that they can be correctly concatenated together to restore the original archive.


Compression Benchmarks #2

As a follow-up to my previous Compression Benchmarks here is second set of benchmarks showing comparison with existing parallel utilities and Deduplication impact.

Note: I am using very large chunks in these tests in order to let Deduplication find enough duplicate data with an average block size of 4KB. Pcompress does not yet have a Global Deduplication facility so I am using this approach. I will put a later post with some test results using smaller chunks of say 1MB – 8MB.

The datasets: This time I used the Linux 3.6 tarball and Centos VMDK tarball from the earlier tests. In addition created a tarball containing Windows 7 64-Bit system files. The contents of “Windows\SysWOW64” and “Program Files” directories. This time I did not include a graph bar for the original size in order to highlight the compressed size differences clearly.

Linux 3.6 Git clone Tarball

As you can see Pcompress with this dataset is slighty slower than Pigz and Pbzip2 but offer a little better compression ratio as well especially with LZ-Prediction and Deduplication. In addition one needs to keep in mind that Pcompress provides far stronger data integrity verification via the default SKEIN 256 message digest. This takes more cycles to compute compared to the simple 32-bit checksums in Pigz and Pbzip2.

The total dataset size is 466MB so using a 256MB chunk will cause 2 chunks, one 256MB and one 210MB. So total time taken will be time taken for the larger chunk. Deduplication has only marginal impact in this dataset as there will be very few duplicate blocks with average size of 4KB within the Linux kernel source tree. The main benefit comes with LZ-Prediction.

Tarball of CentOS 6 and CentOS 6.2 VMDK Images

Notice that the Libbsc algorithm provides best results with this dataset (aside from LZMA which is not included here, see my previous post). However Libbsc is not included in the 1GB chunk results as there is some memory inefficiency in the implementation that causes it to gobble too much RAM and exceed my system’s 8GB budget. I will fix this in an upcoming release.

There are two items of interest here that jump out to anyone looking closely at the graphs. Bzip2 performs better on this dataset in all respects than Zlib(gzip). This is the case whether we are using the parallel utilities (pigz, pbzip2) or using Pcompress. Secondly Pcompress produces much better compression compared to the parallel utilities. Both Deduplication and LZP preprocessing add real value. Of course it is a known fact that VMDK images dedupe well.

So the properties of VMDKs lend themselves well to Bzip2 both in terms of speed and compression ratio. Consider using bzip2 algorithm over zlib(gzip) when packing VMDKs. Another point is that PPMd spins really badly in terms of speed on non-textual data.

Windows 7 64-Bit system folders: Windows\SysWOW64, Program Files

Once again Pcompress produces smaller compressed files as compared to pigz and pbzip2 and performs almost the same in terms of speed. Deduplication produces benefits here as well with a very small overhead. As usual PPMd spins badly on this primarily binary dataset.

From the previous enwik9 result and these results there is one possible enhancement idea that comes out. In adaptive2 mode Pcompress checks if a chunk is primarily text and uses PPMd  for text. Otherwise for binary it uses LZMA. In addition to this I can add a detection for XML data and use Libbsc if the chunk is primarily XML text.