Archiving | The Pseudo Random Bit Bucket

I use a Libarchive copy in Pcompress to do archiving. Pcompress prepares the list of pathnames to be archived and calls low-level libarchive apis to archive the pathnames for fine-granied control. For example, the pathnames are sorted via fairly involved heuristics. The output from Libarchive is passed to the compressor stage. In addition, a few file type specific filters are used prior to writing file data via Libarchive apis.

The libarchive output stream is split into buffers and compressed in parallel, the basic approach in Pcompress. This results in a solid compression mode that achieves a high compression ratio. However this also has a few problems:

Libarchive does streaming archives so archive metadata is inline within the data stream. This causes breaks within the actual data stream and pollutes the context eventually reducing compression ratio slightly.
Simply listing the archive members, without actually extracting them to disk requires decompressing all of the data.
Similarly extracting a single member means decompressing everything before it (which is the problem with things like tar.gz anyway).

So separating out data and metadata would be beneficial. Metadata can be kept in a separate compressed stream from the data which would allow fast access in the couple of use cases above. To achieve this, Libarchive would need to indicate to the client callbacks the type of request: data or metadata. This is tricky. The model of Libarchive is to just request some data as in:

When extracting it invokes the callback. The callback returns a blob of data and tells it’s length. When Libarchive has consumed all of the blob it requests for more.
When archiving it invokes the callback to with a block of data to write with a specified length.

There is no differentiation between data and metadata as the formats are typically streaming formats. Pcompress uses PAX. The immediate thought that arises is to have Libarchive indicate via a flag in the archive structure whether the request is for data or metadata. This works fine when archiving, since blocks of data and metadata are written separately. The client callback can use an api call using the archive structure to determine whether data or metadata is being written. The trouble arises during extraction

This simple technique of using a flag within Libarchive is not sufficient when extracting an archive. Looking into the archive_read code, one can notice that there is a filter structure, that keeps track of the current buffer passed from the client callback and the internal cache to implement a virtually zero-copy architecture. The filter structures can be cascaded if there are multiple filters in the chain. The root filter is obviously the NONE filter. See archive_read_open1() in archive_read.c. Once all data in the client buffer is consumed, the callback is invoked again to request for more. So, If data and metadata is stored separately, then the client callback will have to stitch them together and re-create the original archive stream which has to be passed back to Libarchive. This is complicated and very expensive in practice. In particular, the buffer copying required would defeat Libarchive’s zero-copy approach. Eventually, the solution I landed upon, is to introduce a secondary filter structure accompanying each normal filter struct. I call this the shadow filter. This is identical to the main filter struct and is only initialized in case metadata streaming is being done. Now, Libarchive can use this shadow filter structure to keep track of the separate metadata stream. Whenever a metadata request is made the client can return a metadata buffer which is tagged onto the shadow filter. Data requests are handled via the normal filter structure.

The changes needed in Libarchive to get this working were, in fact, smaller than I anticipated and are all inside the Libarchive copy in Pcompress trunk(changeset 1, changeset 2). A couple of obvious new api calls are needed: archive_set_metadata_streaming() and archive_request_is_metadata().

The current Pcompress trunk packs metadata into 3MB chunks and marks them as metadata chunks. These chunks are compressed using the Delta2 filter and Bzip2 compression. During extraction, it opens two handles to the archive file. One handle is used by the metadata thread to read metadata chunks and skip data chunks while the main thread does vice versa. This of course precludes pipe-mode streaming operation.

Currently, listing of archive contents is extremely fast as Pcompress has to just decompress the metadata chunks and return that to Libarchive. The data sections are simply skipped. I am still working on a mechanism to do optimized selective extraction. For this I will have to store additional metadata to indicate which data chunk holds the start of the file contents so that the data decompression thread can quickly skip forward to that one.

So what’s the difference in terms of compression ratio and timings, with and without metadata streaming? Using the 10GB compression benchmark from Matt Mahoney here’s the results (83435 archive members) on a MacBook Pro late 2013 model:

Metadata Streaming:
- pcompress -a -l14 -s60m -t5 10gb 10gb.pz —> archive size: 2936436327 bytes
Without metadata streaming (notice the -T flag):
- pcompress -a -l14 -s60m -t5 -T 10gb 10gb.pz —> archive size: 2939247636 bytes
Time to list archive contents with metadata stream:
- ./pcompress -i 10gb.pz 0.90s user 0.10s system 84% cpu 1.184 total
Time to list archive contents without metadata stream:
- ./pcompress -i 10gb.pz 1106.16s user 28.63s system 346% cpu 5:27.38 total

So compression done using inline metadata (without metadata streaming) is about 2.6MB or 0.1% larger. It is a small amount given the archive size, but a difference nevertheless. Larger relative benefits will be visible for archives having a large number of small files. However the real massive benefit is seen in the total time to list archive contents. 1.2 seconds vs 5.5 minutes: It really is a no-brainer.

Among a busy personal schedule for the last two months, I have managed to work quite a bit on adding archiving features to Pcompress. Thanks to the excellent LibArchive, Pcompress can now bundle up a bunch of files into a compressed archive. This is a desirable and useful capability that was missing till date.

With the addition of archiving capability Pcompress can now perform advanced detection of file data and tweak its compression behaviour to achieve the best results. Below is a short list of features and behaviour that the github code has as of this writing:

Pcompress enumerates the file list to be archived and sorts the files by extension/name and size using an incremental merge sort to minimize memory use. This sorting, groups related files together and clusters small files to achieve the best compression and deduplication behaviour. For example see this paper where a similar technique has been discussed to improve deduplication: https://www.usenix.org/legacy/event/atc11/tech/final_files/Xia.pdf
File types are detected via extension and/or file header parsing for magic numbers. Compression buffers are split at boundaries where files change from one type to another to avoid mixing unrelated files in a single compression buffer. It helps to improve compression ratio.
More importantly, this file type detection is used to apply data-specific compression techniques more effectively, making the Adaptive modes in Pcompress extremely powerful. The following data specific algorithms are used:
- LZMA – Most binary data.
- PPMD – Most Textual data.
- Libbsc – DNA Sequences, XML/HTML etc, BMP and TIFF images.
- Dispack – Preprocess 32-bit x86 executable binaries.
- PackJPG – Reduce JPEG size by upto 40%. This is new lossless JPEG compression technique by Matthias Stirner.
- Wavpack – Compress WAV files better than any regular compression technique. This is still a work in progress.
- Detect already compressed files and for some heavily compressed data just use LZ4 to suppress some internal headers and zero padding. This avoids wasting time trying to compress data that is already heavily compressed.
- There are other data specific filters around like MAFISC which I am looking at.
- For Dispack, 32-bit x86 executables are detected and the data buffer is then split into 32K blocks. Some approximate instruction statistics are checked to determine whether to Dispack that block.
Compression buffers are split either at hash-based or data type change based boundaries improving both compression and deduplication.
LibArchive is used as the backend archiving library whose output is passed to the buffering, deduplication and compression stages in a logical pipeline. Synchronization is kept simple by using semaphores. LibArchive runs in a single thread and the data fetch from archiver to compression is also done at a single point. Thus there is exactly one producer and one consumer. This simplifies synchronization.
To the extent possible data copying is avoided. LibArchive’s callback routines are used to copy data directly into the compression buffers without resorting to pipes and such.

The filters like Wavpack and PackJPG need to work with LibArchive. However LibArchive does not support using external filter routines so it took a while to work out how to have external file filters pipelined before LibArchive. Note that since Pcompress uses a custom file format and consumes the output of LibArchive, there is no need for strict compatibility with standard archiver formats like Tar, Pax, Cpio etc. LibArchive for its own requirements obviously strives to attain strict conformance allowing no user-defined headers. So one of the big problems was to flag which files have been processed by a custom filter. One easy way was to add an extended attribute programmatically. However LibArchive does not provide a way to delete a single attribute during extraction. There is a call to clear all attributes! One does not want internal, programmatic use attributes to be extracted to disk. I was stuck. Eventually it turned out that I could use contextual inference. A file preprocessor like PackJPG will add its own magic header to the file. Thus during archiving I can look for a JPEG header and only then pass the file through PackJPG. During extraction I can look for the PackJPG header.

However the question comes, what if I have some PackJPG processed files and are archiving them using Pcompress? Won’t it revert to normal JPEG during extraction even though I do not want it to? Well the filename extension is also checked. During archiving, normal JPEGs are filtered but their extension remains as jpg or jpeg. So only files having a Jpeg extension but having a PackJPG header are unpacked during extraction. If you use the standalone PackJPG utility to pack your JPEGs, then will get a .pjg extension which will be untouched by Pcompress filters during extraction. However, truely speaking, LibArchive needs to add a simple xattr deletion function to avoid all this jugglery.

File types during archiving, are detected by a combination of filename extension and magic header inspection. To lookup filename extensions one obviously needs a hashtable. However there is a bit of detail here. I have predefined list of known filename extensions with their corresponding file types, so instead of using a general hash function I needed a perfect hash function. That is, the number of slots in the table is the number of keys and each known key maps to one slot. An unknown key can be easily found by comparing with key value at the slot, or if the slot number lies outside the table range. I used the old ‘Minimal Perfect Hashing’ technique courtesy of Bob Jenkins. It works nicely for fast hashing of filename extensions.

The next item to do is to support multi-volume archives. This is quite easy to do since Pcompress already splits data into independent buffers, each with its own header. So a volume needs to contain a set of compressed buffers with some sequence indicator so that they can be correctly concatenated together to restore the original archive.

The Pseudo Random Bit Bucket

Moinakg's Ramblings

Tag Archives: Archiving

Pulling out Libarchive metadata

Pcompress gets archiving features