# Architecture for a Deduplicated Archival Store: Part 2

In the previous post on this topic I had put down my thoughts around the requirements I am looking at. In this post I will jot down some detailed notes around the design of the on-disk data store format that I am thinking of.

The Archival Chunk Store

From the most basic viewpoint we have data streams which are split into variable length chunks. After deduplication these chunks can be references to other chunks in the same dataset or chunks in other datasets. So we need to have metadata that identifies the dataset (like name, timestamp, length etc.) and then have a list of pointers to data chunks. This is not much different to a traditional file system which has inodes storing metadata and then pointers to blocks/pages on disk. It is conceptually simple to consider a single data block to have multiple references. It is intuitive. However additional metadata is needed to maintain information like reference counts.

The key difference of a file system and a content-defined deduplication storage is that in the former all the blocks are of fixed length and potentially grouped into allocation units. In the latter chunks are of variable length. So we need additional metadata giving chunk lengths and on-disk storage requires a second layer of disk block allocation data. Software like OpenDedup have implemented FuSE based file systems however they only deal with the simpler fixed-length chunking approach and offer primary storage dedupe.

I do not need a full file system route since I am not dealing with primary storage in this case and it also avoids a lot of complexity. There are existing file systems like OpenDedup, LiveDFS, Lessfs and scale-out approaches like Ceph, Tahoe-LAFS etc. where the scalable, variable-chunked dedupe features will be useful, but that is something for later. So I am thinking of storing the data chunks in files that I will call extents, along with the minimum additional metadata in separate metadata extents. The following diagram is a schematic of my approach to storing the chunks on disk.

The following are the characteristics that imply from this schematic:

• A Dataset is identified by some metadata and a sequence of extents in a linked list.
• Each extent is a collection of segments. Extents are essentially numbered files.
• Each segment is a collection of variable-length data chunks.
• Each extent stores segment data and metadata in separate files. A naming convention is used to associate extent metadata and corresponding data files.
• Each extent can contain a fixed maximum number of segments. I am considering up to 2048 segments per extent. Incoming segments are appended to the last extent in the dataset till it fills up and a new extent is allocated.
• Notice that a separate extent metadata section is not required. A extent is just a file.
• The scalable Segmented Similarity based Deduplication is being used here. Each segment contains up to 2048 variable-length chunks. So with 4KB chunk size, each segment is 8MB in size.
• Segment metadata consists of a chunk count, chunk hashes and offsets. The chunk size is not stored. Instead it can be computed by subtracting current chunk’s offset from the next chunk’s offset. Since a 64-bit segment offset is stored the chunk offsets can be relative to it and only need to be 32-bit values.
• The Similarity Index contains similarity hashes that point to segments within the extents. So the pointer has to be the extent number followed by the segment offset within the extent metadata file. Incoming segments from a new datastream are chunked, their similarity hashes computed and then approximate-match segments are looked up in the index.
• Segment data is compressed before storing in the segment. So segment entries in the data extent are of variable length.
• Each segment entry in the metadata extent can also be of variable length since the number of chunks can be less than the maximum. However segment entries in the metadata extent are added when an entry is made in the index, so the exact offset can be recorded.
• Similary a segment entry in the metadata extent needs to point to the offset of the segment data in the data extent. However since segments are compressed later in parallel and stored into the extent, the metadata entries are updated later once the segment data is appended. Keeping segment data in a separate data extent allows this parallel processing while still allowing similarity matches to be processed from the metadata extent.
• Duplicate chunk references are maintained in the metadata extents. A duplicate reference consists of the extent number, segment offset in the compressed file and chunk number within the segment.
• The index is obviously persistent on disk but is loaded in memory in it’s entirety when doing lookups. Any insertion into the index is written immediately onto the disk. I’d obviously have to use a NoSQL key-value store for this. I am currently interested in Hamsterdb.
• Keeping a separate metadata extent allows staging metadata on a separate high-performance storage media like flash to reduce access latency.
• It is possible to store reference counts at the segment level within the index for the purpose of capping number of references to “popular” chunks. This can reduce dedupe ratio since not all chunks will have reached the max reference count. However the advantage of this is it avoids storing and updating reference counts in scattered records in extent files which in turn avoids some random I/O during data ingestion. Each segment has 25 similarity indicators representing different portions of the segment. So all 25 indicators should have reached the maximum reference count to completely remove the entire segment from consideration.
• The entire segment is compressed and stored instead of per-chunk compression. This provides better compression ratio but is also an overhead especially if we just have to retrieve one chunk from a referenced segment. However due to data locality in backups most similar segments will have several chunks in common. In addition the fast LZ4 compression algorithm and caching of uncompressed segments should provide for low overheads. This is something that I have to test in practice.

Supporting Deletion and Forward Referencing

Deleting datasets means deleting all the extents that belong to it. However this is easier said than done because the extent may have segments which contain chunks which are referred to by other extents. So we cannot simply delete. There are two ways to support effective deletion.

First approach is to load the segments one by one from the extents and conditionally store them into a new file. First the segment’s similarity indicators are re-computed and looked up in the index. This will give us the reference count associated with the similarity indicator along with the segment it points to. If the indicator points to another segment then it’s reference count is decremented. Otherwise if the associated reference count is zero, it is first removed from the index. If the reference count is zero for all similarity indicators of the segment or all it’s similarity indicators point to other segments then the segment is not stored into the new file. However a seek is performed on the target file to sparsely extend it. This preserves the relative offsets of the segments which need to be retained.

Second approach is dependent on a technique called Forward Referencing. In this incoming data is stored as-is. If new chunks are duplicate to older chunks then the older chunk entries are updated to point to the new chunks. This means that deletion can be simply performed on the oldest dataset without any further checks as all references will be to newer chunks. I will need to apply the constraint that intermediate datasets cannot be deleted. The big advantage of Forward Referencing is that it speeds up restore times a lot because the latest dataset is typically the one that you want to restore and it is stored as whole and read sequentially. However Forward Referencing requires post-process deduplication in order to be performant and avoid too much random I/O during backup for example. Also technically it precludes source side dedupe as the data has to appear wholly on the backup store.

The third approach combines the above two approaches. Inline dedupe is done and then a post-process optimization pass can be kicked off to re-organize the data to a forward referenced layout. This requires temporary extra metadata space to record a log of all references per referenced extent so that we can invert the references an extent at a time. This can somewhat tricky to get right.

At present I am looking at the first approach and intend to explore the third optimization technique at a later date.

# Requirements

Pcompress as it stands today is a powerful single-file lossless compression program that applies a variety of compression and data deduplication algorithms to effectively reduce the dataset size. However as far as data deduplication goes it can only apply the algorithms to a single dataset to remove internal duplicates. What is more useful is to be able to apply deduplication to remove common blocks across datasets to achieve even greater savings especially in backup scenarios. This is why we see a slew of products in this space boasting of upto 90% reduction in backup storage requirements.

In the open source space we have filesystems like OpenDedup, Lessfs, S3QL, ZFS etc that provide deduplication even for primary online storage. While that is a desirable feature in itself, these software lack many of the advanced features of commercial products like Sepaton, HP StoreOnce or EMC DataDomain. Pcompress implements a bunch of those advanced algorithms today (I am writing a couple of papers on this) so it makes sense to extend the software into a proper scalable archival store for backup requirements. In this topic it is worthwhile to take note of eXdupe which provides archival deduplicated backup capabilities but it is quite simplistic providing only differential storage against a single initial backup dataset. It is much like a full backup followed by incremental backups. Just that there is no real multi-file dedupe. One can only dedupe the latest backup data against the first non-differential backup data. It is not a scalable chunk store that can chunk any incoming dataset and store only the unique chunks.

If we look at open source backup software like Amanda or Bacula, none of them have block-level dedupe capability, leave alone sliding-window variable block chunking. So, in a nutshell, we can summarize the requirements as follows:

1. A Deduplicated, Scalable Chunk Store that stores unique chunks and provides fast read access.
2. The Chunk Store is meant for backups and archival storage and assumes immutable chunks. I am not looking at online primary storage in this case. However the system should support deletion of old datasets.
3. It should be able to do inline dedupe. With inline dedupe we can do source side dedupe reducing the amount of backup data transferred over the network.
4. Pcompress can potentially utilize all the cores on the system and this archival store should be no different.
5. Metadata overhead should be kept to a minimum and I will be using the Segmented similarity based indexing to use a global index that can fit in RAM.
6. Data and Metadata should be kept separate such that metadata can be located on high-speed storage like SSDs to speed up access. While this increases the number of multiple separate disk accesses during restore, the effect can be reduced by locality sensitive caching in addition to SSDs.
7. The system should of course be able to scale to petabytes.
8. It should be possible to integrate the system with existing backup software like Amanda, Bacula etc. This is needed if we want to do source-side dedupe.
9. There should be a chunk reference count with a max limit to avoid too many datasets referencing the same chunk. The loss of a multiple referenced chunk can corrupt multiple backups. Having an upper limit reduces the risk. In addition we need replication but that is not in my charter at this time. Filesystem replication/distribution can be used for the purpose. Software like DRBD can also be used.
10. Another feature is to limit deduplication to the last X backup sets much like a sliding window. This allows cleanly removing really old backups and avoid recent backups from referencing chunks in a those old data.
11. All this applies to archival storage on disk. Deduping backups onto tape is a different can of worms that I will probably look at later.

I plan to go at all these requirements in phases. For example I’d not initially look at source-side dedupe. Rather the initial focus will be to get a high-performance stable backend. If one is wondering about some of the terms used here, then look at the Wikipedia article for explanations.

# Updated Compression Benchmarks – part 3

I have added the 3rd and final set of benchmark results comparing Pcompress to two other data dedupe utilities, Lrzip and eXdupe here: http://moinakg.github.io/pcompress/results3.html. Lrzip does not do traditional dedupe of 4KB blocks or above. Rather it uses the Rzip algorithm which is derived from Rsync.

Rzip also does variable block dedupe but at much smaller sizes than 4KB. However I am not sure if Rzip can be adapted as a multi-file generalized deduplication store as the index blow-up is quite extravagant. Though it might be possible to do segmented matching and then apply Rzip across Segment data. It will require re-reading old segment data and the dedupe solution will necessarily be offline or post-process.

The observations from the results are summarized below:

• If we just do Dedupe and avoid compression of data (“Dedupe Only” result in the graphs) then Lrzip produces smaller archives. This is obvious since Pcompress does traditional Dedupe at average 4KB variable blocks while Lrzip finds matches are much smaller lengths. Exdupe cannot be compared here as it has no option to avoid compression. At high compression levels Pcompress consistently gives the fastest times. However except for LZ4 option Pcompress produces slightly larger archives for all other algorithms when compared with Lrzip. Lrzip uses Lzo not LZ4. I tried using Lrzip to just do rzip and then compress the result with LZ4 for the CentOS tarball. I got a size of 662751240 bytes with data split into 256MB chunks. So Lrzip would have produced a smaller archive if it had integrated LZ4.
• LZ4 is a fantastic algorithm. The combination of speed and compression ratio is unparalleled.
• At fast compression levels Pcompress matches or exceeds Exdupe in speed (depending on the dataset) while producing a better compression ratio. Once again LZ4 has a big contribution to the result. Lrzip loses out handily in terms of speed but compression ratio is good.
• In general Pcompress gives some of the best combinations of compression ratio and speed.
• One of the possible reasons for the larger Exdupe file sizes can be extra metadata. Exdupe allows differential backups to be taken against an initial full backup. In order to do block-level differential backup, in other words deduplicated backup, it needs to store additional metadata for existing blocks.

Remember this is just a small system with 2 cores and 2 hyperthreads, or 4 logical cores. On systems will more cores Pcompress performance will scale appropriately.

# Pcompress 2.2 released

I decided to pull another release of Pcompress primarily due to some bugfixes that went in. One of them is a build issue on Debian6 and non-SSE4 processor and the others are a couple of crashes with invalid input.

In addition to fixing stuff I have re-wrote the Min-Heap code and took out all the Python derived stuff. It is now much simpler and much faster than before. While doing this re-write I found and fixed a problem with the earlier Min-Heap approach. Thus Delta Differencing is now faster and more accurate than before.

I also improved the scalable Segmented Global Dedupe and it now works with greater than 95% efficiency in finding duplicate chunks. it appears that using larger segments for larger dedupe block sizes results in better accuracy. If you come to think of it this is also logical since one would want faster processing with smaller indexes when using larger and larger dedupe blocks. Corresponding larger segments enable just that.

# Updated Compression Benchmarks

Pcompress has gone through a sea of changes since the last time I ran benchmarks comparing performance and effectiveness with other utilities. So I spent several days running various benchmark scripts generating and collating a lot of results in the process.

Due to the sheer volume of the results and limited time, I took the easy way out of importing all the CSV data into Excel, formatting and charting them and exporting to HTML. The generated HTML code looks complex and messy but at least it shows up correctly in Firefox, Chrome and IE.

The first set of results can be seen here: http://moinakg.github.io/pcompress/results1.html. This is basically comparing pcompress with Segment-level and Global Deduplication to other standard utilities. It also contrasts effectiveness of Global Dedupe with Segment-level Dedupe.

The Datasets used

1. A tar of the VMDK files of installed CentOS 6.2 x86-64 version.
2. Linux 3.6 RC2 source tarball.
3. Two copies of the Silesia corpus tar concatenated together. This results in a file that is double the size of the original Silesia corpus but has 100% duplicate data.
4. A tarball of the “Program Files” directory on my 32-bit Windows 7 installation.

Some Observations

1. As is quite clear, Pcompress is both faster and more effective compared to the standard utilities tested: Gzip, Bzip2, 7za, Xz and Pxz (Parallel Xz).
2. As usual Xz performs the worst. The time graph shows a steep spike. Pxz is a lot better but is still half as slow as Pcompress. In addition remember that Pcompress is having a bunch of additional processing overheads that the other utilities do not have: SHA256, BLAKE2, LZP and Delta2 processing.
3. Interestingly the LZ4 mode along with Dedupe and all the preprocessing produces results that are close to traditional Gzip while being more than twice as fast. In fact two datasets shows results smaller than Gzip. This result is notable when one wants good compression done extremely fast.
4. Global Dedupe of course is more effective than Segment-level Dedupe but what is more surprising is that it is also faster overall, even though Global Dedupe requires serialized access to a central index and Segmented Dedupe is fully parallel. I can attribute three causes: my test system is low-end with constrained RAM bandwidth and conflicts arising from parallel access; Segment-level dedupe also uses memcmp() while Global Dedupe does not; Global Dedupe reduces data further resulting in lesser work for the final compression algorithm.
5. The concatenated Silesia corpus with 100% duplicate data of course shows the maximum benefit from Global Dedupe that removes long-range redundancies in data.
6. In some cases compression levels 9 and 14 show marginally lesser compression than level 6. This appears to be because of LZP side-effects. At higher levels, LZP parameters are tweaked to work more aggressively so it may be taking out a little too much redundancy that affects the compression algorithm’s effectiveness. This is something that I will have to tweak going forward.

I will be posting more results soon and will include a comparison with Lrzip that uses an improved Rzip implementation to take out long-range redundancies in data at a finer granularity compared to 4KB variable-block Deduplication.

# Pcompress 2.1 released with fixes and performance enhancements

A couple of the key things are improvement in Global Dedupe accuracy and ability to set the dedupe block hash independent of the data verification hash. From a conservative viewpoint the default block hash is set to the proven SHA256. This however can be changed via an environment variable called ‘PCOMPRESS_CHUNK_HASH_GLOBAL’. SKEIN is one of the alternatives supported for this. SKEIN is a solid NIST SHA3 finalist with good amount of cryptanalysis done and no practical weakness found. It is also faster than SHA256. These choices give a massive margin of safety against random hash collisions and unexpected data corruptions considering that other commercial and open-source dedupe offerings tend to use weaker options like SHA1(Collision attack found, see below), Tiger24 or even the non-cryptographic Murmur3-128! All this for the sake of performance. Albeit some of them did not have too many choices at the time development started on those products. In addition even with a collision attack it is still impractical to get a working exploit for a dedupe storage engine that uses SHA1 like say Data Domain, and corrupt stored data.

The Segmented Global Dedupe algorithm used for scalability now gives around 95% of the data reduction efficiency of simple full chunk index based dedupe.

# Pcompress 2.0 with Global Deduplication

The last few weeks I have been heads down busy with a multitude of things at work and in personal life with hardly any time for anything else. One of the biggest items that kept me busy during my spare times has of course been the release of Pcompress 2.0.

This release brings to fruition some of the hobby research work I had been doing around scalable deduplication of very large datasets. Pcompress 2.0 includes support for Global Deduplication which eliminates duplicate chunks across the entire dataset or file. Pcompress already had support for Data Deduplication but it removed duplicates only within a segment of the data. The larger the segment size, the more effective is the deduplication. This mode is very fast since there is no central index and no serialization. However dedupe effectiveness gets limited.

Global Deduplication introduces a central in-memory index for looking up chunk hashes. Data is first split into fixed-size or variable-length Rabin chunks as usual. Each 4KB (or larger) chunk of data has an associated 256-bit or larger cryptographic checksum (SHA256, BLAKE2 etc.). These hashes are looked up and inserted into a central hashtable. If a chunk hash entry is already present in the hashtable then the chunk is considered a duplicate and a reference to the existing chunk is inserted into the datastream. This is a simple full chunk index based exact deduplication approach which is very effective using 4KB chunk sizes. However there is a problem.

The size of a full chunk index grows rapidly with the dataset. If we are looking at 4KB chunks then we get 268435456 chunks for 1TB of data. Each chunk entry in the hashtable needs to have the 256-bit checksum, a 64-bit file offset and a 32-bit length value. So total size of the index entries is approax 11GB for unique data not considering the additional overheads of the hashtable structure. So if we consider hundreds of terabytes then the index is too big to fit in memory. In fact the index becomes so big that it becomes very costly to lookup chunk hashes slowing the dedupe process to a crawl. Virtually all commercial dedupe products do not even use 4KB chunks. The minimum is 8KB used in Data Domain with most other products using chunk sizes much larger than that. Larger chunk sizes reduce the index size but also reduce dedupe effectiveness.

One of the ways of scaling Data Deduplication to petascale is to look at similarity matching techniques that can determine regions of data that are approximately similar to each other and then compare their cryptographic chunk hashes to actually eliminate exact matching chunks. A variant of this technique uses Delta Differencing instead of hash matching to eliminate redundancy at the byte level. However I was interested in the former.

Pcompress 2.0 includes two approaches to Global Deduplication. If a simple full chunk index can fit into 75% of available free RAM then it is used directly. This is fast and most effective at eliminating duplicates. By default 4KB chunks are used and it gives good performance even with chunks this small. This is lower than what most other commercial or open-source dedupe products recommend or offer. Once file sizes start becoming larger and the index size overflows the memory limit then Pcompress automatically switches to Segmented Similarity Based Deduplication.

In Segmented Similarity mode data is split into 4KB (or larger) chunks as usual (Variable-length Rabin or Fixed-block). Then groups of 2048 chunks are collected to form a segment. With 4KB chunks this results in an average segment size of 8MB. The list of cryptographic chunk hashes for the segment are stored in a temporary segment file. Then these cryptographic chunks hashes are analysed to produce 25 similarity hashes. Each similarity hash is essentially a 64-bit CRC of a min-value entry. These hashes are then inserted or looked up in a central index. If another segment is found that matches at least one of the 25 similarity hashes then that segment is considered approximately similar to the current segment. It’s chunk hash list is then memory mapped into the process address space and exact crypto hash based chunk matching is done to perform the actual deduplication.

This approach results is an index size that is approximately 0.0023% of the dataset size. So Pcompress will require upto a 25GB index to deduplicate 1PB of data. That is assuming 100% random 1PB data with no duplicates. In practice the index will be smaller. This approach provides >90% dedupe efficiency of using a full chunk index while providing high scalability. Even though disk I/O is not completely avoided, it requires one disk write and only a few disk reads for every 2048 chunks. To balance performance and predictable behaviour, the write is synced to disk after every few segments. Using mmap(), instead of a read, helps performance and the disk offsets to be mmap-ed are sorted in ascending order to reduce random access to the segment chunk list file. This file is always written to at the end and extended but existing data is never modified. So it is ideal to place it on a Solid State drive to get a very good performance boost. Finally, access to the central index is coordinated by the threads cooperating using a set of semaphores allowing for lock-free access to critical sections. See: https://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/

I had been working out the details of this approach for quite a while now and Pcompress 2.0 contains the practical implementation of it. In addition to this Pcompress now includes two additional streaming modes. When compressing a file the output file can be specified as ‘-’ to stream the compressed data to stdout. Decompression can take the input file as ‘-’ to read compressed data from stdin.

Global Deduplication in Pcompress together with streaming modes and with help from utilities like Netcat or Ncat can be used to optimize network transfer of large datasets. Eventually I intend to implement proper WAN Optimization capabilities in a later release.

Related Research

# Adding GPU processing to Pcompress

English: CUDA processing flow (Photo credit: Wikipedia)

GPGPUs provide an intriguing opportunity to speed up some aspects of Pcompress. Typically GPUs represent a large cluster of ALUs with access to a few different types of high-speed memory on the board. GPUs are typically suited for highly-parallel workloads, especially the class of problems that can be termed embarrassingly parallel. An example is Monte-Carlo simulations. However many otherwise serial algorithms or logic can be converted into parallel forms with a little bit of effort.

There are a few places within Pcompress where GPUs can be of use:

1. Parallel hashing. I have already implemented a Merkle-style parallel hashing but the approach currently uses only 4 threads via OpenMP. This is only used when compressing an entire file in a single segment which is essentially a single-thread operation with some operations like hashing, HMAC (and multithread LZMA) parallelized via different approaches. With GPUs parallel hashing can be used in all cases, but there is a slight problem. Normally parallel hashing produces different hash values as compared to the serial version so I need to work out a way where the same underlying hashing approach is used in both serial and parallel cases so identical results are produced. If one uses GPUs to generate data checksums on one machine it cannot be assumed that every machine where the data is extracted back will have a GPU! Changes to the hashing approach will make current archives incompatible with future versions of Pcompress so current code paths will have to be retained for backward compatibility.
2. Using AES on GPU. It is possible to speed up AES on the GPU, especially with the CTR mode that I am using. There is a GPU Gems article on this.
3. Parallel data chunking for deduplication. This is possible but more complex to implement than the previous two items. There is a research paper on a system called Shredder that provides an approach to do data deduplication chunking on the GPU. My approach to chunking is quite novel and different than what is described in the Shredder paper. So I have to do some work here.

There are a few issues to deal with when programming GPGPUs other than the initial high learning curve:

1. GPUs are devices that sit on the PCI bus, so data needs to be transferred to and fro. This is the biggest stumbling block when dealing with GPUs. The computation to be performed must be large enough to offset the cost of data transfer. There are other ways to hide the latency like performing one compute while transferring the data for the next computation to be done. Using pinned memory on the host computer’s RAM to speed up data transfer. Transferring large blocks of data in one shot as opposed to many small transfers. The biggest gain comes from pipelining computation stages and overlapping compute and data transfer.
2. Code on the GPU runs in an execution context that has hundreds of hardware threads each of which runs the same code path but works on a different slice of data in memory. This is essentially Single Instruction Multiple Data Model (Nvidia calls it SIMT). The access to data by the different threads need to be ordered or, in other words, be adjacent in a range to get maximum throughput. This is the coalesced access requirement. This is becoming less of an issue as GPGPUs evolve and newer improved devices come to the market.
3. Need to use a form of explicit caching via shared memory. This is again improving by the introduction of L1/L2 caches in newer GPGPUs like the Nvidia Tesla C2XXX series.
4. Having to worry about Thread block and grid sizing. Some libraries like Thrust handle sizing internally and provide a high-level external API.

Pcompress has to remain modular. It needs to detect the presence of GPUs in a system and optionally allow using them. Since I will be using CUDA, it needs to depend on the presence of CUDA and the Nvidia accelerated drivers as well.

Finally the big questions will be how do all these scale? Will using GPUs allow faster processing in Pcompress as compared to the modern Sandy Bridge and Piledriver CPUs with vector units. Only experimentation will tell.

# Pcompress 1.3 released

I have put up a new release of Pcompress on the Google Code download page. This release focusses primarily on performance enhancements across the board and a few bug fixes. The changes are summarized below:

1. Deduplication performance has improved by at least 2.5X as a result of a variety of tweaks to the core chunking code.
• One of the interesting changes is to use a 16-byte SSE register as the sliding window since I am using a window size of 16. This avoids a lot of memory accesses but requires SSE4.
• The perf utility allowed me to see that using the window position counter as a context variable causes a spurious memory store for every byte! Using a local variable allows optimization via a register. This optimization affects the situation where we do not have SSE4.
• Compute the full fingerprint only when at least minimum chunk length bytes have been consumed.
2. Delta Compression performance and effectiveness have both been improved. I have tweaked the minhash approach to avoid storing and using fingerprints. That approach was causing memory write amplification and significant slowdown. Rather I am just treating the raw data as a sequence of 64-bit integers and heapifying them. Bsdiff performance has been improved along with RLE encoding. I also tweaked the matching approach. It now checks for similar blocks that are some distance apart depending on the compression algorithm. This actually causes long range similar blocks to be delta-ed eventually helping the overall compression.
3. One of the big changes is the inclusion of the BLAKE2 checksum algorithm and making it the default. BLAKE2 is one of the highest-performing cryptographic checksums, exceeding even MD5 in performance on 64-bit platforms. It is derived from BLAKE, one of the NIST SHA3 runner ups with a large security margin.
4. I have tweaked Yann Collet’s xxHash algorithm (non-cryptographic hash) to vectorize it and make it work with 32-byte blocks. Performance is improved for both vectorized and non-vectorized versions. I have covered this in detail in my previous post: Vectorizing xxHash for fun and profit.
5. I have tweaked the AES CTR mode implementation to vectorize it. CTR mode encrypts a 16-byte block consisting of a 64-bit nonce or salt value and a 64-bit block counter value concatenated together. This is then XOR-ed with 16 bytes of plaintext to generate 16 bytes of ciphertext. The block counter is then incremented and the process repeated. This XOR handling with 16-bytes can be nicely done in an XMM register. The result is faster even when using unaligned SSE2 loads helped a little with data prefetch instructions.
6. Apart from BLAKE2 I also included Intel’s optimized SHA512 implementation for x86 processors and moved to using SHA512/256. This improves SHA2 performance significantly on x86 platforms.
7. BLAKE2 includes a parallel mode. I also included simple 2-way parallel modes for other hashes including KECCAK when compressing a single file in a single chunk. This is essentially a single-threaded operation so other forms of parallelism need to be employed.
8. With all the vectorization being thrown around with SSE2/3/4 and AVX1/2 versions of various stuff, I have also added runtime CPU feature detection to invoke the appropriate version for the CPU. At least SSE2 capability is assumed. At this point I really have no intention of supporting Pentium and Atom processors! This also requires one to use at least the Gcc 4.4 compiler so that things like SSE4.2 and AVX intrinsics can be compiled even if CPU support for them is not available.
9. In addition to all the above some bug fixes have also gone into this release.

However this is in no way the full gamut of optimizations possible. There are more changes to be done. For example I need to add support for optimized AES GCM mode. This is a block cipher mode of operation which combines encryption and authentication avoiding the need to for a separate HMAC. HMAC is still useful for situations where one may want to authenticate but not encrypt. Deduplication performance can be further improved by at least 2X. The current chunking code has a silly oversight.  HMAC needs to support parallel modes. I also need to enable parallel operation for LZP in single-chunk modes. In addition I want to explore use of GPGPUs and CUDA for hashing, chunking etc.

# Inside Content Defined Chunking in Pcompress – Part 2

After cooling off in the hills for a few days, curiosity got the better of me. I discussed in my previous post on chunk size distribution from the rolling hash algorithm that there is a potential trend that can be inferred. Regions having greater variability produce more smaller chunks while regions of uniformity produce fewer larger ones.

I wanted to do some more analysis to see whether there is indeed a trend, short of looking at actual data samples. A trend or correlation will indicate that the rolling hash fingerprint based chunking algorithm is working nicely with desirable properties. Now the question comes how do we define uniformity or conversely variability of data ? I considered two ways of looking at it:

1. Spans or run-lengths of identical bytes values. Fewer longer spans will indicate low variability and vice versa.
2. Zones where same pattern repeats. That is uniform repeating patterns.

Now point #2 is not suitable for our case since repeating patterns can generate repeating break-points in the rolling hash giving rise to many small chunks depending on the repeating pattern size. So from the point of view of our rolling hash this is a form of repeating variability rather than uniformity. This leaves point #1. Long spans of identical byte values will have less chance of triggering break-points since we check for a few least significant bits being zero as our break-point. Spans of zero will of course trigger many break-points but we can ignore that for this analysis as our dataset is quite varied.

I used an initial 24GB portion of the earlier 360GB dataset since I will generating a large number of data points. This initial portion also contains a wide variety of textual, binary (Linux and Windows) and media files. I decided to split the data into 1MB segments and chunk them with an average of 4KB chunk size. I also computed the following properties per segment (of course in a single thread):

• Variability Index
• Ratio of  Total size of small chunks : Total segment size (in bytes)
• Ratio of  Total size of large chunks : Total segment size (in bytes)

Where the following hold:

$Variability Index = Number Of RLE spans / Segment Size$

$Small chunk = Chunk Size < 16KB$

$Large chunk = Chunk Size \ge 16KB$

I somewhat arbitrarily chose the partition value of 16KB keeping in mind average chunk size of 4KB. Eventually all the three values above are ratios and can be directly compared. I also computed the average chunk size per segment and plotted it separately.

Chunk Size Distribution vs Data Variability

Splitting approx 24GB of data in to approx 1MB sized segments produced more than 23400 data items. The chunking algorithm in Pcompress splits segments at a rabin chunk boundary so that every segment gets a set of complete content-defined chunks. This causes a slightly variable segment size.

Since the number of points is very large the lines in the line graph merge together and look like an area graph. Looking at the graphs for a few moments yields the view that there is indeed a correlation between data variability and chunk size distribution. Regions of higher variability do indeed tend to produce smaller chunks and vice versa. Now lets zoom into the middle interesting portion of the Distribution graph to see some more detail (more than 11000 points):

The correlation between the three lines is more clearly visible here but it is not a 100% fit, but then, is anything perfect in life and this universe ? So our requirement of greater data variability producing smaller chunks and vice versa holds in general but does not hold all the time. We can do some quick statistical analysis to get a better idea. First thing to do is to add a couple of polynomial trend lines.

We can clearly see that total size of large chunks is inversely proportional to the data variability with a high degree of probability. This is exactly what we desire. So over a large amount of data localized variations even out and general trends are observed. Finally I also computed Pearson’s correlation coefficients:

$COEFF(RLE Ratio, Small Chunk Ratio) = 0.819695$

$COEFF(RLE Ratio, Large Chunk Ratio) = -0.81969$

So combined with these values and all the previous analysis we can say that clearly there exists a strong correlation of the form which is desirable. Pcompress appears to have a good chunking algorithm unless, of course, someone more experienced than me in this aspect points out shortcomings or mistakes.