Tag Archives: Pcompress

Posts related to the Pcompress utility.

Architecture for a Deduplicated Archival Store: Part 1


Pcompress as it stands today is a powerful single-file lossless compression program that applies a variety of compression and data deduplication algorithms to effectively reduce the dataset size. However as far as data deduplication goes it can only apply the algorithms to a single dataset to remove internal duplicates. What is more useful is to be able to apply deduplication to remove common blocks across datasets to achieve even greater savings especially in backup scenarios. This is why we see a slew of products in this space boasting of upto 90% reduction in backup storage requirements.

In the open source space we have filesystems like OpenDedup, Lessfs, S3QL, ZFS etc that provide deduplication even for primary online storage. While that is a desirable feature in itself, these software lack many of the advanced features of commercial products like Sepaton, HP StoreOnce or EMC DataDomain. Pcompress implements a bunch of those advanced algorithms today (I am writing a couple of papers on this) so it makes sense to extend the software into a proper scalable archival store for backup requirements. In this topic it is worthwhile to take note of eXdupe which provides archival deduplicated backup capabilities but it is quite simplistic providing only differential storage against a single initial backup dataset. It is much like a full backup followed by incremental backups. Just that there is no real multi-file dedupe. One can only dedupe the latest backup data against the first non-differential backup data. It is not a scalable chunk store that can chunk any incoming dataset and store only the unique chunks.

If we look at open source backup software like Amanda or Bacula, none of them have block-level dedupe capability, leave alone sliding-window variable block chunking. So, in a nutshell, we can summarize the requirements as follows:

  1. A Deduplicated, Scalable Chunk Store that stores unique chunks and provides fast read access.
  2. The Chunk Store is meant for backups and archival storage and assumes immutable chunks. I am not looking at online primary storage in this case. However the system should support deletion of old datasets.
  3. It should be able to do inline dedupe. With inline dedupe we can do source side dedupe reducing the amount of backup data transferred over the network.
  4. Pcompress can potentially utilize all the cores on the system and this archival store should be no different.
  5. Metadata overhead should be kept to a minimum and I will be using the Segmented similarity based indexing to use a global index that can fit in RAM.
  6. Data and Metadata should be kept separate such that metadata can be located on high-speed storage like SSDs to speed up access. While this increases the number of multiple separate disk accesses during restore, the effect can be reduced by locality sensitive caching in addition to SSDs.
  7. The system should of course be able to scale to petabytes.
  8. It should be possible to integrate the system with existing backup software like Amanda, Bacula etc. This is needed if we want to do source-side dedupe.
  9. There should be a chunk reference count with a max limit to avoid too many datasets referencing the same chunk. The loss of a multiple referenced chunk can corrupt multiple backups. Having an upper limit reduces the risk. In addition we need replication but that is not in my charter at this time. Filesystem replication/distribution can be used for the purpose. Software like DRBD can also be used.
  10. Another feature is to limit deduplication to the last X backup sets much like a sliding window. This allows cleanly removing really old backups and avoid recent backups from referencing chunks in a those old data.
  11. All this applies to archival storage on disk. Deduping backups onto tape is a different can of worms that I will probably look at later.

I plan to go at all these requirements in phases. For example I’d not initially look at source-side dedupe. Rather the initial focus will be to get a high-performance stable backend. If one is wondering about some of the terms used here, then look at the Wikipedia article for explanations.


Updated Compression Benchmarks – part 3

I have added the 3rd and final set of benchmark results comparing Pcompress to two other data dedupe utilities, Lrzip and eXdupe here: http://moinakg.github.io/pcompress/results3.html. Lrzip does not do traditional dedupe of 4KB blocks or above. Rather it uses the Rzip algorithm which is derived from Rsync.

Rzip also does variable block dedupe but at much smaller sizes than 4KB. However I am not sure if Rzip can be adapted as a multi-file generalized deduplication store as the index blow-up is quite extravagant. Though it might be possible to do segmented matching and then apply Rzip across Segment data. It will require re-reading old segment data and the dedupe solution will necessarily be offline or post-process.

The observations from the results are summarized below:

      • If we just do Dedupe and avoid compression of data (“Dedupe Only” result in the graphs) then Lrzip produces smaller archives. This is obvious since Pcompress does traditional Dedupe at average 4KB variable blocks while Lrzip finds matches are much smaller lengths. Exdupe cannot be compared here as it has no option to avoid compression. At high compression levels Pcompress consistently gives the fastest times. However except for LZ4 option Pcompress produces slightly larger archives for all other algorithms when compared with Lrzip. Lrzip uses Lzo not LZ4. I tried using Lrzip to just do rzip and then compress the result with LZ4 for the CentOS tarball. I got a size of 662751240 bytes with data split into 256MB chunks. So Lrzip would have produced a smaller archive if it had integrated LZ4.
      • LZ4 is a fantastic algorithm. The combination of speed and compression ratio is unparalleled.
      • At fast compression levels Pcompress matches or exceeds Exdupe in speed (depending on the dataset) while producing a better compression ratio. Once again LZ4 has a big contribution to the result. Lrzip loses out handily in terms of speed but compression ratio is good.
      • In general Pcompress gives some of the best combinations of compression ratio and speed.
      • One of the possible reasons for the larger Exdupe file sizes can be extra metadata. Exdupe allows differential backups to be taken against an initial full backup. In order to do block-level differential backup, in other words deduplicated backup, it needs to store additional metadata for existing blocks.

Remember this is just a small system with 2 cores and 2 hyperthreads, or 4 logical cores. On systems will more cores Pcompress performance will scale appropriately.

Pcompress 2.2 released

I decided to pull another release of Pcompress primarily due to some bugfixes that went in. One of them is a build issue on Debian6 and non-SSE4 processor and the others are a couple of crashes with invalid input.

In addition to fixing stuff I have re-wrote the Min-Heap code and took out all the Python derived stuff. It is now much simpler and much faster than before. While doing this re-write I found and fixed a problem with the earlier Min-Heap approach. Thus Delta Differencing is now faster and more accurate than before.

I also improved the scalable Segmented Global Dedupe and it now works with greater than 95% efficiency in finding duplicate chunks. it appears that using larger segments for larger dedupe block sizes results in better accuracy. If you come to think of it this is also logical since one would want faster processing with smaller indexes when using larger and larger dedupe blocks. Corresponding larger segments enable just that.

Updated Compression Benchmarks – part 2

I have added the second set of benchmarks that demonstrate the effect of the different pre-processing options on compression ratio and speed. The results are available here: http://moinakg.github.io/pcompress/results2.html

All of these results have Global Dedupe enabled. These results also compare the effect of various compression algorithms on two completely different datasets. One is a set of VMDK files and another purely textual data. Some observations below:

  • In virtually all the cases using ‘-L’ and ‘-P’ switches results in the smallest file. Only in case of LZMA these options marginally deteriorate the compression ratio indicating that the reduction of redundancy is hurting LZMA. To identify which of the two hurts more I repeated the command (see the terminology in results page) with lzmaMt algo and only option ‘-L’ at compression level 6 on the CentOS vmdk tarball. The resultant size came to: 472314917. The size got from running with only option ‘-P’ is available in the results page: 469153825. Thus it is the LZP preprocessing that unsettles LZMA the most along with segment size of 64MB. Delta2 actually helps. Running the command with segment size of 256MB we see the following results – ‘-L’ and ‘-P’: 467946789, ‘-P’ only: 466076733, ‘-L’ only: . Once again Delta2 helps. At higher compression however, Delta2 is marginally worse as well.
  • There is some interesting behavior with respect to the PPMD algorithm. The time graph (red line) shows a relative spike for the CentOS graphs as compared to the Linux source tarball graphs. PPMD is an algorithm primarily suited for textual data so using it on non-textual data provides good compression but takes more time.
  • Both Libbsc and PPMD are especially good on the textual Linux source tar and are comparable to LZMA results while only taking a fraction of the time taken by LZMA. Especially Libbsc really rocks by producing better compression and being much faster as compared to LZMA. However i have seen decompression time with Libbsc to be quite high as compared to PPMD.

Updated Compression Benchmarks

Pcompress has gone through a sea of changes since the last time I ran benchmarks comparing performance and effectiveness with other utilities. So I spent several days running various benchmark scripts generating and collating a lot of results in the process.

Due to the sheer volume of the results and limited time, I took the easy way out of importing all the CSV data into Excel, formatting and charting them and exporting to HTML. The generated HTML code looks complex and messy but at least it shows up correctly in Firefox, Chrome and IE.

The first set of results can be seen here: http://moinakg.github.io/pcompress/results1.html. This is basically comparing pcompress with Segment-level and Global Deduplication to other standard utilities. It also contrasts effectiveness of Global Dedupe with Segment-level Dedupe.

The Datasets used

  1. A tar of the VMDK files of installed CentOS 6.2 x86-64 version.
  2. Linux 3.6 RC2 source tarball.
  3. Two copies of the Silesia corpus tar concatenated together. This results in a file that is double the size of the original Silesia corpus but has 100% duplicate data.
  4. A tarball of the “Program Files” directory on my 32-bit Windows 7 installation.

Some Observations

  1. As is quite clear, Pcompress is both faster and more effective compared to the standard utilities tested: Gzip, Bzip2, 7za, Xz and Pxz (Parallel Xz).
  2. As usual Xz performs the worst. The time graph shows a steep spike. Pxz is a lot better but is still half as slow as Pcompress. In addition remember that Pcompress is having a bunch of additional processing overheads that the other utilities do not have: SHA256, BLAKE2, LZP and Delta2 processing.
  3. Interestingly the LZ4 mode along with Dedupe and all the preprocessing produces results that are close to traditional Gzip while being more than twice as fast. In fact two datasets shows results smaller than Gzip. This result is notable when one wants good compression done extremely fast.
  4. Global Dedupe of course is more effective than Segment-level Dedupe but what is more surprising is that it is also faster overall, even though Global Dedupe requires serialized access to a central index and Segmented Dedupe is fully parallel. I can attribute three causes: my test system is low-end with constrained RAM bandwidth and conflicts arising from parallel access; Segment-level dedupe also uses memcmp() while Global Dedupe does not; Global Dedupe reduces data further resulting in lesser work for the final compression algorithm.
  5. The concatenated Silesia corpus with 100% duplicate data of course shows the maximum benefit from Global Dedupe that removes long-range redundancies in data.
  6. In some cases compression levels 9 and 14 show marginally lesser compression than level 6. This appears to be because of LZP side-effects. At higher levels, LZP parameters are tweaked to work more aggressively so it may be taking out a little too much redundancy that affects the compression algorithm’s effectiveness. This is something that I will have to tweak going forward.

I will be posting more results soon and will include a comparison with Lrzip that uses an improved Rzip implementation to take out long-range redundancies in data at a finer granularity compared to 4KB variable-block Deduplication.

Pcompress 2.1 released with fixes and performance enhancements

I just uploaded a new release of Pcompress with a load of fixes and performance tweaks. You can see the download and some details of the changes here: https://code.google.com/p/pcompress/downloads/detail?name=pcompress-2.1.tar.bz2&can=2&q=

A couple of the key things are improvement in Global Dedupe accuracy and ability to set the dedupe block hash independent of the data verification hash. From a conservative viewpoint the default block hash is set to the proven SHA256. This however can be changed via an environment variable called ‘PCOMPRESS_CHUNK_HASH_GLOBAL’. SKEIN is one of the alternatives supported for this. SKEIN is a solid NIST SHA3 finalist with good amount of cryptanalysis done and no practical weakness found. It is also faster than SHA256. These choices give a massive margin of safety against random hash collisions and unexpected data corruptions considering that other commercial and open-source dedupe offerings tend to use weaker options like SHA1(Collision attack found, see below), Tiger24 or even the non-cryptographic Murmur3-128! All this for the sake of performance. Albeit some of them did not have too many choices at the time development started on those products. In addition even with a collision attack it is still impractical to get a working exploit for a dedupe storage engine that uses SHA1 like say Data Domain, and corrupt stored data.

The Segmented Global Dedupe algorithm used for scalability now gives around 95% of the data reduction efficiency of simple full chunk index based dedupe.

Pcompress 2.0 with Global Deduplication

The last few weeks I have been heads down busy with a multitude of things at work and in personal life with hardly any time for anything else. One of the biggest items that kept me busy during my spare times has of course been the release of Pcompress 2.0.

This release brings to fruition some of the hobby research work I had been doing around scalable deduplication of very large datasets. Pcompress 2.0 includes support for Global Deduplication which eliminates duplicate chunks across the entire dataset or file. Pcompress already had support for Data Deduplication but it removed duplicates only within a segment of the data. The larger the segment size, the more effective is the deduplication. This mode is very fast since there is no central index and no serialization. However dedupe effectiveness gets limited.

Global Deduplication introduces a central in-memory index for looking up chunk hashes. Data is first split into fixed-size or variable-length Rabin chunks as usual. Each 4KB (or larger) chunk of data has an associated 256-bit or larger cryptographic checksum (SHA256, BLAKE2 etc.). These hashes are looked up and inserted into a central hashtable. If a chunk hash entry is already present in the hashtable then the chunk is considered a duplicate and a reference to the existing chunk is inserted into the datastream. This is a simple full chunk index based exact deduplication approach which is very effective using 4KB chunk sizes. However there is a problem.

The size of a full chunk index grows rapidly with the dataset. If we are looking at 4KB chunks then we get 268435456 chunks for 1TB of data. Each chunk entry in the hashtable needs to have the 256-bit checksum, a 64-bit file offset and a 32-bit length value. So total size of the index entries is approax 11GB for unique data not considering the additional overheads of the hashtable structure. So if we consider hundreds of terabytes then the index is too big to fit in memory. In fact the index becomes so big that it becomes very costly to lookup chunk hashes slowing the dedupe process to a crawl. Virtually all commercial dedupe products do not even use 4KB chunks. The minimum is 8KB used in Data Domain with most other products using chunk sizes much larger than that. Larger chunk sizes reduce the index size but also reduce dedupe effectiveness.

One of the ways of scaling Data Deduplication to petascale is to look at similarity matching techniques that can determine regions of data that are approximately similar to each other and then compare their cryptographic chunk hashes to actually eliminate exact matching chunks. A variant of this technique uses Delta Differencing instead of hash matching to eliminate redundancy at the byte level. However I was interested in the former.

Pcompress 2.0 includes two approaches to Global Deduplication. If a simple full chunk index can fit into 75% of available free RAM then it is used directly. This is fast and most effective at eliminating duplicates. By default 4KB chunks are used and it gives good performance even with chunks this small. This is lower than what most other commercial or open-source dedupe products recommend or offer. Once file sizes start becoming larger and the index size overflows the memory limit then Pcompress automatically switches to Segmented Similarity Based Deduplication.

In Segmented Similarity mode data is split into 4KB (or larger) chunks as usual (Variable-length Rabin or Fixed-block). Then groups of 2048 chunks are collected to form a segment. With 4KB chunks this results in an average segment size of 8MB. The list of cryptographic chunk hashes for the segment are stored in a temporary segment file. Then these cryptographic chunks hashes are analysed to produce 25 similarity hashes. Each similarity hash is essentially a 64-bit CRC of a min-value entry. These hashes are then inserted or looked up in a central index. If another segment is found that matches at least one of the 25 similarity hashes then that segment is considered approximately similar to the current segment. It’s chunk hash list is then memory mapped into the process address space and exact crypto hash based chunk matching is done to perform the actual deduplication.

This approach results is an index size that is approximately 0.0023% of the dataset size. So Pcompress will require upto a 25GB index to deduplicate 1PB of data. That is assuming 100% random 1PB data with no duplicates. In practice the index will be smaller. This approach provides >90% dedupe efficiency of using a full chunk index while providing high scalability. Even though disk I/O is not completely avoided, it requires one disk write and only a few disk reads for every 2048 chunks. To balance performance and predictable behaviour, the write is synced to disk after every few segments. Using mmap(), instead of a read, helps performance and the disk offsets to be mmap-ed are sorted in ascending order to reduce random access to the segment chunk list file. This file is always written to at the end and extended but existing data is never modified. So it is ideal to place it on a Solid State drive to get a very good performance boost. Finally, access to the central index is coordinated by the threads cooperating using a set of semaphores allowing for lock-free access to critical sections. See: https://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/

I had been working out the details of this approach for quite a while now and Pcompress 2.0 contains the practical implementation of it. In addition to this Pcompress now includes two additional streaming modes. When compressing a file the output file can be specified as ‘-‘ to stream the compressed data to stdout. Decompression can take the input file as ‘-‘ to read compressed data from stdin.

Global Deduplication in Pcompress together with streaming modes and with help from utilities like Netcat or Ncat can be used to optimize network transfer of large datasets. Eventually I intend to implement proper WAN Optimization capabilities in a later release.

Related Research

  1. SiLo: A Similarity-Locality based Near-Exact Deduplication
  2. The Design of a Similarity Based Deduplication System
  3. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
  4. Similarity Based Deduplication with Small Data Chunks