Inside Content-Defined chunking in Pcompress

After having played with the content-defined (or content-aware) chunking process in Pcompress and having tweaked it’s parameters quite a bit I decided to actually look at what kind of chunk (or block in Pcompress terms) sizes it is producing. I had never looked at this aspect in the past and was curious to see whether there exists any abnormal skewness.

I had to run the chunking process over large amounts of data and capture a full list of chunk (or block) lengths it is producing. A simple printf() to output the length is enough. Gathering a sufficiently large dataset that covered a wide range of file types was more challenging! I eventually managed to cobble together something from the gigabytes of stuff I already had in my ZFS pools and some I had gathered over a period of time for testing Pcompress.

Basic Dataset Properties

  • Total size: 360 GB
  • File Types:
    • Program Text Linux and Unix, Linux git checkout tarball etc.
    • Variuous kinds of general text including English Text
    • A 2.5GB portion of the OANC dataset
    • XML and HTML (The enwik9 and enwik8 datasets)
    • Email backups
    • Full OpenSolaris root filesystem image consisting of binaries, translations and other textual data
    • Windows SysWOW64 and Program Files contents
    • Gigabytes of Multimedia files of various formats: MP3, WAV, MKV, AVI, XVID, OGG etc.
    • Various ISO images of movies, and LiveCD OS distributions
    • Image files of various formats: BMP, GIF, PNG, JPEG etc.
    • Linux binaries, /usr/share, /var, log files, backups and other misc contents.
    • Already compressed files and tarballs using Gzip, Bzip2, Compress (.Z) etc.

As you can see the dataset contains a wide variety of files for exercising the chunking algorithm across a wide range of value types. I ran the deduplication code with two different average chunk sizes of 4KB and 8KB and produced a frequency distribution of various actual chunk sizes. The process went like this:  Capture chunk size list -> Sort the list -> Count repeated occurrences of same value -> Produce frequency distribution.

The plots below show the results. Since Content-Defined Chunking produces variable length chunks the actual average chunk size is typically close to the desired value but rarely ever exact. The graphs below show Frequency on the left and Chunk size on the right. Both are log scales. The point where the Frequency line and Chunk size line cross is the average value.

As you can see smaller chunk sizes have higher frequencies with the number dropping off quite rapidly and variability/jitter in frequencies increasing quite rapidly as we got to higher chunk sizes. I did a small tweak to the algorithm today to use a fixed mask value for the  rolling-hash break pattern which appears to produce more consistent average chunk sizes across a range of desired sizes. The minimum size check is there anyway to ensure that the desired average chunk size is reached.

The graph below shows frequency distribution of chunks grouped into 6 distinct power of two bins: 4KB, 8KB, 16KB, 32KB, 64KB, 128KB.

As you can see the peak is at around 4KB which is the desired average size in this case. Next comes the plots for 8KB desired average chunk size.

These plots look similar to the ones for 4KB chunks and produce similar patterns. The detailed frequency distribution plot for 8KB is strikingly similar to the same for 4KB chunks. Also look at how the frequency drops off rapidly with higher chunk sizes. I am tempted to theorize that the algorithm is producing more smaller chunks in regions of high change and fewer larger ones in uniform zones. The last few tweaks I have done brought avg chunk size closer to the given value and improved dedupe ratio – this would seem to support the hypothesis. However one needs to look at some of the data around small and large chunks to actually check. The final tweaks and changes will be rolled into a 1.0.1 release I will make in a while.

It is interesting to note that will this Pcompress now produces smaller compressed files than the graphs I put out in some of the previous posts. All this now appear to be in fair shape for the next item on my table, Global Deduplication.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s