Tag Archives: Pcompress

Posts related to the Pcompress utility.

Updated Compression Benchmarks – part 3

I have added the 3rd and final set of benchmark results comparing Pcompress to two other data dedupe utilities, Lrzip and eXdupe here: http://moinakg.github.io/pcompress/results3.html. Lrzip does not do traditional dedupe of 4KB blocks or above. Rather it uses the Rzip algorithm which is derived from Rsync.

Rzip also does variable block dedupe but at much smaller sizes than 4KB. However I am not sure if Rzip can be adapted as a multi-file generalized deduplication store as the index blow-up is quite extravagant. Though it might be possible to do segmented matching and then apply Rzip across Segment data. It will require re-reading old segment data and the dedupe solution will necessarily be offline or post-process.

The observations from the results are summarized below:

      • If we just do Dedupe and avoid compression of data (“Dedupe Only” result in the graphs) then Lrzip produces smaller archives. This is obvious since Pcompress does traditional Dedupe at average 4KB variable blocks while Lrzip finds matches are much smaller lengths. Exdupe cannot be compared here as it has no option to avoid compression. At high compression levels Pcompress consistently gives the fastest times. However except for LZ4 option Pcompress produces slightly larger archives for all other algorithms when compared with Lrzip. Lrzip uses Lzo not LZ4. I tried using Lrzip to just do rzip and then compress the result with LZ4 for the CentOS tarball. I got a size of 662751240 bytes with data split into 256MB chunks. So Lrzip would have produced a smaller archive if it had integrated LZ4.
      • LZ4 is a fantastic algorithm. The combination of speed and compression ratio is unparalleled.
      • At fast compression levels Pcompress matches or exceeds Exdupe in speed (depending on the dataset) while producing a better compression ratio. Once again LZ4 has a big contribution to the result. Lrzip loses out handily in terms of speed but compression ratio is good.
      • In general Pcompress gives some of the best combinations of compression ratio and speed.
      • One of the possible reasons for the larger Exdupe file sizes can be extra metadata. Exdupe allows differential backups to be taken against an initial full backup. In order to do block-level differential backup, in other words deduplicated backup, it needs to store additional metadata for existing blocks.

Remember this is just a small system with 2 cores and 2 hyperthreads, or 4 logical cores. On systems will more cores Pcompress performance will scale appropriately.

Pcompress 2.2 released

I decided to pull another release of Pcompress primarily due to some bugfixes that went in. One of them is a build issue on Debian6 and non-SSE4 processor and the others are a couple of crashes with invalid input.

In addition to fixing stuff I have re-wrote the Min-Heap code and took out all the Python derived stuff. It is now much simpler and much faster than before. While doing this re-write I found and fixed a problem with the earlier Min-Heap approach. Thus Delta Differencing is now faster and more accurate than before.

I also improved the scalable Segmented Global Dedupe and it now works with greater than 95% efficiency in finding duplicate chunks. it appears that using larger segments for larger dedupe block sizes results in better accuracy. If you come to think of it this is also logical since one would want faster processing with smaller indexes when using larger and larger dedupe blocks. Corresponding larger segments enable just that.

Updated Compression Benchmarks – part 2

I have added the second set of benchmarks that demonstrate the effect of the different pre-processing options on compression ratio and speed. The results are available here: http://moinakg.github.io/pcompress/results2.html

All of these results have Global Dedupe enabled. These results also compare the effect of various compression algorithms on two completely different datasets. One is a set of VMDK files and another purely textual data. Some observations below:

  • In virtually all the cases using ‘-L’ and ‘-P’ switches results in the smallest file. Only in case of LZMA these options marginally deteriorate the compression ratio indicating that the reduction of redundancy is hurting LZMA. To identify which of the two hurts more I repeated the command (see the terminology in results page) with lzmaMt algo and only option ‘-L’ at compression level 6 on the CentOS vmdk tarball. The resultant size came to: 472314917. The size got from running with only option ‘-P’ is available in the results page: 469153825. Thus it is the LZP preprocessing that unsettles LZMA the most along with segment size of 64MB. Delta2 actually helps. Running the command with segment size of 256MB we see the following results – ‘-L’ and ‘-P': 467946789, ‘-P’ only: 466076733, ‘-L’ only: . Once again Delta2 helps. At higher compression however, Delta2 is marginally worse as well.
  • There is some interesting behavior with respect to the PPMD algorithm. The time graph (red line) shows a relative spike for the CentOS graphs as compared to the Linux source tarball graphs. PPMD is an algorithm primarily suited for textual data so using it on non-textual data provides good compression but takes more time.
  • Both Libbsc and PPMD are especially good on the textual Linux source tar and are comparable to LZMA results while only taking a fraction of the time taken by LZMA. Especially Libbsc really rocks by producing better compression and being much faster as compared to LZMA. However i have seen decompression time with Libbsc to be quite high as compared to PPMD.

Updated Compression Benchmarks

Pcompress has gone through a sea of changes since the last time I ran benchmarks comparing performance and effectiveness with other utilities. So I spent several days running various benchmark scripts generating and collating a lot of results in the process.

Due to the sheer volume of the results and limited time, I took the easy way out of importing all the CSV data into Excel, formatting and charting them and exporting to HTML. The generated HTML code looks complex and messy but at least it shows up correctly in Firefox, Chrome and IE.

The first set of results can be seen here: http://moinakg.github.io/pcompress/results1.html. This is basically comparing pcompress with Segment-level and Global Deduplication to other standard utilities. It also contrasts effectiveness of Global Dedupe with Segment-level Dedupe.

The Datasets used

  1. A tar of the VMDK files of installed CentOS 6.2 x86-64 version.
  2. Linux 3.6 RC2 source tarball.
  3. Two copies of the Silesia corpus tar concatenated together. This results in a file that is double the size of the original Silesia corpus but has 100% duplicate data.
  4. A tarball of the “Program Files” directory on my 32-bit Windows 7 installation.

Some Observations

  1. As is quite clear, Pcompress is both faster and more effective compared to the standard utilities tested: Gzip, Bzip2, 7za, Xz and Pxz (Parallel Xz).
  2. As usual Xz performs the worst. The time graph shows a steep spike. Pxz is a lot better but is still half as slow as Pcompress. In addition remember that Pcompress is having a bunch of additional processing overheads that the other utilities do not have: SHA256, BLAKE2, LZP and Delta2 processing.
  3. Interestingly the LZ4 mode along with Dedupe and all the preprocessing produces results that are close to traditional Gzip while being more than twice as fast. In fact two datasets shows results smaller than Gzip. This result is notable when one wants good compression done extremely fast.
  4. Global Dedupe of course is more effective than Segment-level Dedupe but what is more surprising is that it is also faster overall, even though Global Dedupe requires serialized access to a central index and Segmented Dedupe is fully parallel. I can attribute three causes: my test system is low-end with constrained RAM bandwidth and conflicts arising from parallel access; Segment-level dedupe also uses memcmp() while Global Dedupe does not; Global Dedupe reduces data further resulting in lesser work for the final compression algorithm.
  5. The concatenated Silesia corpus with 100% duplicate data of course shows the maximum benefit from Global Dedupe that removes long-range redundancies in data.
  6. In some cases compression levels 9 and 14 show marginally lesser compression than level 6. This appears to be because of LZP side-effects. At higher levels, LZP parameters are tweaked to work more aggressively so it may be taking out a little too much redundancy that affects the compression algorithm’s effectiveness. This is something that I will have to tweak going forward.

I will be posting more results soon and will include a comparison with Lrzip that uses an improved Rzip implementation to take out long-range redundancies in data at a finer granularity compared to 4KB variable-block Deduplication.

Pcompress 2.1 released with fixes and performance enhancements

I just uploaded a new release of Pcompress with a load of fixes and performance tweaks. You can see the download and some details of the changes here: https://code.google.com/p/pcompress/downloads/detail?name=pcompress-2.1.tar.bz2&can=2&q=

A couple of the key things are improvement in Global Dedupe accuracy and ability to set the dedupe block hash independent of the data verification hash. From a conservative viewpoint the default block hash is set to the proven SHA256. This however can be changed via an environment variable called ‘PCOMPRESS_CHUNK_HASH_GLOBAL’. SKEIN is one of the alternatives supported for this. SKEIN is a solid NIST SHA3 finalist with good amount of cryptanalysis done and no practical weakness found. It is also faster than SHA256. These choices give a massive margin of safety against random hash collisions and unexpected data corruptions considering that other commercial and open-source dedupe offerings tend to use weaker options like SHA1(Collision attack found, see below), Tiger24 or even the non-cryptographic Murmur3-128! All this for the sake of performance. Albeit some of them did not have too many choices at the time development started on those products. In addition even with a collision attack it is still impractical to get a working exploit for a dedupe storage engine that uses SHA1 like say Data Domain, and corrupt stored data.

The Segmented Global Dedupe algorithm used for scalability now gives around 95% of the data reduction efficiency of simple full chunk index based dedupe.

Pcompress 2.0 with Global Deduplication

The last few weeks I have been heads down busy with a multitude of things at work and in personal life with hardly any time for anything else. One of the biggest items that kept me busy during my spare times has of course been the release of Pcompress 2.0.

This release brings to fruition some of the hobby research work I had been doing around scalable deduplication of very large datasets. Pcompress 2.0 includes support for Global Deduplication which eliminates duplicate chunks across the entire dataset or file. Pcompress already had support for Data Deduplication but it removed duplicates only within a segment of the data. The larger the segment size, the more effective is the deduplication. This mode is very fast since there is no central index and no serialization. However dedupe effectiveness gets limited.

Global Deduplication introduces a central in-memory index for looking up chunk hashes. Data is first split into fixed-size or variable-length Rabin chunks as usual. Each 4KB (or larger) chunk of data has an associated 256-bit or larger cryptographic checksum (SHA256, BLAKE2 etc.). These hashes are looked up and inserted into a central hashtable. If a chunk hash entry is already present in the hashtable then the chunk is considered a duplicate and a reference to the existing chunk is inserted into the datastream. This is a simple full chunk index based exact deduplication approach which is very effective using 4KB chunk sizes. However there is a problem.

The size of a full chunk index grows rapidly with the dataset. If we are looking at 4KB chunks then we get 268435456 chunks for 1TB of data. Each chunk entry in the hashtable needs to have the 256-bit checksum, a 64-bit file offset and a 32-bit length value. So total size of the index entries is approax 11GB for unique data not considering the additional overheads of the hashtable structure. So if we consider hundreds of terabytes then the index is too big to fit in memory. In fact the index becomes so big that it becomes very costly to lookup chunk hashes slowing the dedupe process to a crawl. Virtually all commercial dedupe products do not even use 4KB chunks. The minimum is 8KB used in Data Domain with most other products using chunk sizes much larger than that. Larger chunk sizes reduce the index size but also reduce dedupe effectiveness.

One of the ways of scaling Data Deduplication to petascale is to look at similarity matching techniques that can determine regions of data that are approximately similar to each other and then compare their cryptographic chunk hashes to actually eliminate exact matching chunks. A variant of this technique uses Delta Differencing instead of hash matching to eliminate redundancy at the byte level. However I was interested in the former.

Pcompress 2.0 includes two approaches to Global Deduplication. If a simple full chunk index can fit into 75% of available free RAM then it is used directly. This is fast and most effective at eliminating duplicates. By default 4KB chunks are used and it gives good performance even with chunks this small. This is lower than what most other commercial or open-source dedupe products recommend or offer. Once file sizes start becoming larger and the index size overflows the memory limit then Pcompress automatically switches to Segmented Similarity Based Deduplication.

In Segmented Similarity mode data is split into 4KB (or larger) chunks as usual (Variable-length Rabin or Fixed-block). Then groups of 2048 chunks are collected to form a segment. With 4KB chunks this results in an average segment size of 8MB. The list of cryptographic chunk hashes for the segment are stored in a temporary segment file. Then these cryptographic chunks hashes are analysed to produce 25 similarity hashes. Each similarity hash is essentially a 64-bit CRC of a min-value entry. These hashes are then inserted or looked up in a central index. If another segment is found that matches at least one of the 25 similarity hashes then that segment is considered approximately similar to the current segment. It’s chunk hash list is then memory mapped into the process address space and exact crypto hash based chunk matching is done to perform the actual deduplication.

This approach results is an index size that is approximately 0.0023% of the dataset size. So Pcompress will require upto a 25GB index to deduplicate 1PB of data. That is assuming 100% random 1PB data with no duplicates. In practice the index will be smaller. This approach provides >90% dedupe efficiency of using a full chunk index while providing high scalability. Even though disk I/O is not completely avoided, it requires one disk write and only a few disk reads for every 2048 chunks. To balance performance and predictable behaviour, the write is synced to disk after every few segments. Using mmap(), instead of a read, helps performance and the disk offsets to be mmap-ed are sorted in ascending order to reduce random access to the segment chunk list file. This file is always written to at the end and extended but existing data is never modified. So it is ideal to place it on a Solid State drive to get a very good performance boost. Finally, access to the central index is coordinated by the threads cooperating using a set of semaphores allowing for lock-free access to critical sections. See: https://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/

I had been working out the details of this approach for quite a while now and Pcompress 2.0 contains the practical implementation of it. In addition to this Pcompress now includes two additional streaming modes. When compressing a file the output file can be specified as ‘-‘ to stream the compressed data to stdout. Decompression can take the input file as ‘-‘ to read compressed data from stdin.

Global Deduplication in Pcompress together with streaming modes and with help from utilities like Netcat or Ncat can be used to optimize network transfer of large datasets. Eventually I intend to implement proper WAN Optimization capabilities in a later release.

Related Research

  1. SiLo: A Similarity-Locality based Near-Exact Deduplication
  2. The Design of a Similarity Based Deduplication System
  3. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
  4. Similarity Based Deduplication with Small Data Chunks

Coordinated Parallelism using Semaphores

teamOne of the key features of my Pcompress utility is of course, parallelism. The ability to split and process data in parallel in multiple threads with all the threads doing virtually the same work. There is some limited variability depending on the nature of the work and the nature of the data. For example some data segments may not be compressible so they will have to be stored as-is. Whenever there are multiple threads there is typically some need for synchronization. There are of course scenarios where thread processing is completely independent and no synchronization is needed whatsoever. However in Pcompress that is not the case. There are a couple of cases where synchronization is needed:

  1. Ordering of input data segments. The data segments in the compressed file must be written in the same order as they were input otherwise data will be corrupt.
  2. With the Global Deduplication feature, access to a single chunk index must be serialized and ordered in the same sequence as they were input.

The second feature is a recent addition and also requires ordering since we want all duplicate chunk references to be backward references. That is in a data stream duplicate chunks point backwards to whole chunks at the head of the stream. So data segments containing the chunks must go through the index lookup in the same order as they were input. Rest of the processing like actual pre-compression stage, chunk splitting stage, compression stage, optional encryption stage and so on can work completely parallel without dependencies. The flow can be illustrated by the following diagram:

parallel_flow

As you can notice there are 3 points where some form of  synchronization is necessary. The input, Index Lookup for global dedupe and final writer stage. In addition data ordering as per input has to be maintained for index lookup and when writing the output data.

There are several ways of achieving this flow, the most common techniques are using a thread pool and some queues. Perhaps the simplest approach is to use barrier synchronization. We can put one barrier prior to the index lookup and another barrier prior to the writer. In each case a simple loop takes care of the serial processing maintaining the proper data ordering. However both the approaches have drawbacks. Using queues and thread pools have resource overheads for the data structures and locking.  Barriers are not strictly needed here and using barriers mean that some amount of potential concurrency is lost waiting at the barrier. The time spent waiting at the barrier is the time taken for the slowest or typically the last thread to complete processing. One of the intentions I had was to have as much overlapped processing as possible. if one thread is accessing the index and another thread does not need it, then, it should be allowed to proceed.

So I played around with POSIX semaphores. Using semaphores in a producer-consumer setup is a common approach. However Pcompress threads are a little more involved than simple producers and consumers. A bunch of semaphores are needed to signal and control the execution flow of the threads. After some experimentation I came up with the following approach.

A dispatcher thread reads data segments from the input file and schedules them to worker threads in a round robin fashion and the writer thread reads processed data segments from worker threads in a round robin loop as well. This preserves data ordering at input and output. The ordering of index lookup and dedupe processing is done by one thread signaling the other. The diagram below illustrates this showing an example with 2 threads.

parallel_flow

The green arrows in the diagram shows the direction of the semaphore signals. At each synchronization point a semaphore is signaled to indicate completion of some activity. The critical section of the index lookup and update operation is highlighted in blue. Each thread holds a reference to the index semaphore of the next thread in sequence. The last thread holds a reference of the index semaphore of the first thread. Each thread first waits for it’s own index semaphore to be signaled, then performs the index update and signals the next guy to proceed. The dispatcher thread signals the index semaphore of the first thread to start the ball rolling. Effectively this approach is equivalent to a round-robin token ring network. Whoever holds the token can access the common resource. Lock contention is completely avoided, so this can potentially scale to thousands of threads.

The key to data ordering are the two loops, one in the dispatcher and one in the writer thread. The dispatcher always assigns data segments to threads in a fixed order. In the above example Thread 1 gets all the odd segments and Thread 2 gets all the even ones. The writer thread also waits for threads in order eventually ensuring that data ordering is preserved.

Looking at all this in another way the synchronization approach can be viewed simplified as three concentric rings and the processing flows are a set of radii converging to the center of the circle and intersecting the rings. The processing flow direction is inwards towards the center and all the tokens flow along the rings in one direction, for example clockwise (black arrows). The green curved arrows show signaling of the synch points to forward tokens. That is when processing flow reaches the writer sink ring it forwards the token it received at the dedupe ring to the next flow. The final sync point at the centre completes the data write and forwards the token at the previous radius intersection point on the outermost ring. This approach ensures ordering and avoids races. To have maximum concurrency right from the beginning, all the synch points on the outermost ring get one-time-use tokens so all the initial processing can begin immediately. This is somewhat like priming a water pump.

parallel_flow_ring

This flow allows overlapped operations to happen concurrently. In addition the dispatcher does a simple double buffering by reading the next segment into a spare buffer after signaling the current thread to start processing. A bit of concurrency can be lost when the writer thread is waiting for thread 1 and thread 2 has already completed. That situation typically arises at the end of a file where the last segment can be a small one. It can also arise if one segment cannot be deduplicated and the rest of the dedupe processing is aborted. However the impact of these are relatively small compared to the overall processing being done, so a lot of multi-core parallelism is effectively utilized in practice. Finally a bunch of overheads in using specific data structures and/or parallel threading libraries are also avoided.