Tag Archives: deduplication

Scaling Deduplication in Pcompress – Petascale in RAM

My core objectives with Deduplication experiments in Pcompress has been to achieve large-scale data handling capability using tiny chunks (4KB and even 2KB). This has traditionally been extremely difficult to the point of being nearly impractical. The algorithms in Pcompress can now achieve this without loss of performance, at least for archival data streams. Read on to find out more.

Approach

I have in the past alluded to a an indexing algorithm in Pcompress that allows me to deduplicate large data using only a tiny memory-resident index. I finally found some time to write about it. The basic idea is quite simple and is not entirely new. What is new, is the way certain components of the approach are implemented. The steps are simple:

  1. Break dataset into large segments, such that each segment is a collection of adjacent variable-length chunks as derived using a chunking algorithm.
  2. For each segment create a chunk-list file on disk which contains the list of chunk hashes.
  3. For each segment compute one or more similarity identifiers and store in an index.
  4. If one or more similarity identifiers for two segments match then we know that the segments have at least a few chunks in common. In that case load the chunk lists of the segments from disk and deduplicate the chunks.

If the segments are large enough then a relatively tiny similarity index can address large amounts of data. This scheme requires one disk read per bunch of chunks reducing lookup and access times. Also each chunk list is one file and is read sequentially improving read times. The chunk list files can also be stored on SSDs to speed up even more. The similarity index can be a RAM-resident hashtable giving the fastest possible lookup times. Since each segment is a collection of chunks, I experimented using different chunk counts and found excellent results having 2048 chunks per segment. Thus the segment size is of variable length. For an average chunk size of 4KB we get 8MB segments on average, for 8KB chunks we get 16MB segments and so on.

Many Similarity based Deduplication approaches use files as the objects for checking similarity. However this suffers from a problem that very large files can give rise to loss of accuracy and poor deduplication ratio. Conversely too many tiny files can bloat the similarity index. The approach described above avoids these extremes. A somewhat similar approach has been described in the SiLo scheme.

The old guard – MinHash

The effectiveness of the approach described above hinges on the effectiveness of the similarity indexing scheme. Here I based my implementation on the good old MinHash technique. This technique has been used commonly in data mining and web search to determine similar documents, detect plagiarism, perform clustering and so on.

Essentially we break up a document into a set of semantic pieces, compute multiple hashes per piece and identify a subset of pieces that possess the numerically lowest hash values. If such k-min subsets of two documents match then we can say that the documents are somewhat similar to each other. In this deduplication use case a document is nothing but a segment. Since each segment is a set of chunks, each sematic piece is a chunk as determined by the chunking algorithm. The question is what hashes to compute for each chunk in order to apply the MinHash algorithm? (See MinHash for Dummies for a layman’s explanation of MinHash).

After lots of experimentation and head scratching it turns out that the cryptographic hashes computed per chunk can themselves be used via truncating. Truncated cryptographic hashes are themselves independent hashes. I took this idea to its logical extreme and split each cryptographic hash into a set of smaller hashes. These can be thought of as the permutations of the hash function. Then I can sort these hashes numerically and select the K lowest values to get my K-min-values sketch. If one or more elements of the KMV sketches of two segments match then the segments are deemed to have at least a few chunks in common. How small to truncate a hash? It turns out that 64-bit hashes provide a good approximation. So if we are using, say SHA-256, then we get 4 smaller hashes from each chunk hash. These are then numerically sorted and the 20 lowest unique values are chosen. Pictorially this approach can be depicted as below.

sid2

The number 20 and other thresholds were arrived at experimentally, to give a good balance of memory usage vs deduplication effectiveness. Increasing beyond 20 hashes, results in diminishing returns on dedupe effectiveness while linearly increasing memory consumption. The point of inflection actually comes at 16 hashes, below which, dedupe effectiveness falls rapidly. The chart below shows the results of testing on a 360GB dataset that I have used in previous occasions.

sidHere Duplicate Elimination Ratio =\frac{input\, stream\, length}{total\, size\, of\, chunks\, stored\,+\, metadata\, size}.

The interesting thing here is that this approach results in high deduplication effectiveness to the extent of 90% to 95% of the effectiveness of straight forward exact dedupe using a simple chunk index. The choice of the hash function (SHA2, SHA3, BLAKE2 etc) has little bearing on the outcome. Here is another chart showing a comparison of this similarity based deduplication (using various datasets and hash funtions) with the exact deduplication baseline.

der2

Memory Usage

Lets consider the case of 4KB average chunk size. Each segment contains 2048 (approx) chunks, which gives us 8MB average segment size. For each chunk we derive a 256-bit crypto hash which results in 4 64-bit similarity hashes per chunk. We select the lowest valued 20 unique hashes per segment which form the similarity indicators. So we need 160 bytes of similarity indicators per segment. In addition to this we need to store 64-bit pointers to the actual segment data on disk. The segment data is basically the chunk hash and data pointer list stored on disk. So the storage requirement is doubled to 320 bytes. The similarity indicators for a segment are kept in a hash table. So we need to consider some data structure overheads as well. Assuming 4 bytes of overhead on average we have a final storage requirement of 400 bytes per segment.

Now assuming 8MB segment size and one Petabyte of 100% random data where no duplicates exist (worst case), we would need 134217728 segments. This translates to 50GB of worst case memory usage. If we use 8KB chunk size, then the calculations lead to 25GB RAM requirement for the worst case. These memory values are not too much by present day standards and typically data will have duplicates. So RAM requirement will come down by the extent of duplicates present, especially when using the deduplicating storage architecture I had described earlier. If we limit data handling to say 500TB, then even smaller 2KB chunks can be used. These are practical resource limitations. The algorithm does not have any inherent limitations. If we use a well-endowed server with say 256GB or more of RAM then petascale data can be handled using 2KB chunks as well. Performance will be lower of course.

To the best of my knowledge there is as yet no dedupe product that can handle a petabyte of data using small 4KB chunks and an in-memory index. I may be wrong, so please add comments if you know of another such system. If you are looking at less than a Petabyte then even 2KB chunks are possible – virtually no one does this today.

This indexing scheme is actually implemented in Pcompress today and works quite well. It can actually reach raw a dedupe processing throughput of upto 500MB/s (discounting I/O) on my average laptop. Of course there are many other optimizations both algorithmic and architectural some of which I have posted about earlier. I am presently working on the persistent similarity index part which would allow creating versioned incremental archives.

Related Posts

Advertisements

Scaling Deduplication – Sepaton’s Big Data Backup appliance

Came across this news piece on Register dated back to October: http://www.theregister.co.uk/2013/10/16/sepatons_superduper_deduper/

Potentially handles upto 16PB of backup storage with Global Deduplication – Great! The performance and features on paper are really top of the line. Beats the competition on most aspects. If I look at the deduplication features a few things look interesting vis-a-vis those that I have put into Pcompress.

It mixes “Hash-based inline deduplication” and “Post-process, content-aware deduplication”. The article is not clear what exactly this means. There are two possibilities. Firstly it can detect duplicate files during ingestion, store only a single copy and then do block-level dedupe as post-process. Secondly it can deduplicate using large chunks during backup ingestion and then dedupe using small blocks as post-process. This is of course to not hurt backup performance and scale to large datasets. Post-process deduplication is a common technique to scale deduplication without affecting I/O throughput of in-flight data. It has also been used effectively in Windows Server 2012 to do primary data deduplication.

Sepaton can do analysis of data types and change rates in data to apply the most efficient dedupe mechanism or even skip dedupe for encrypted and compressed files that virtually do not deduplicate at all.

The other interesting features include byte-level deduplication for databases that store data in block sizes less than 8KB and using flash based SSDs to store the global index. I am not sure what this “byte-level deduplication” exactly means but it appears to be a delta-differencing mechanism. Now the question is how efficient restore can be when delta-differencing is used.

In some of the posts on Pcompress design I have already mentioned about using SSDs for storing all kinds of metadata. Fast metadata access is critical and this is the logical choice. However the other “new aspect in Pcompress is the ability to use small blocks for deduplication without losing performance and giving good scalability“. This is a key feature that most of the current solutions seem to be missing. Pcompress can use blocks (or chunks) as small as 2KB without losing too much performance. With 2KB chunks it can potentially scale to 500TB of 100% random data using a 40GB in-memory global index. If the data has duplicates then the index size becomes smaller. This deduplication occurs with 95% efficiency of a full chunk index based brute-force dedupe. This single capability solves a sticky problem that dedupe solutions has been dealing with for quite some time. The metadata structure that I have discussed in earlier posts also helps with overall performance. The approach is based on similarity detection of large regions in the data stream. The chunks lists of those regions are then sequentially loaded from SSD and compared to perform actual deduplication. The similarity detection technique is simple and novel. It avoids any kind of complicated math, fuzzy hashing etc.I will detail it later.

There are other unique techniques in Pcompress like a partially vectorized rolling hash, scanning less than 30% of the data to locate chunk boundaries, parallelized deduplication and others that contribute to the overall performance. I have posted about a few of these earlier.

In addition to the above, the recent zip-like archiver capabilities that I have added into Pcompress introduce data type detection and automatic selection of filters and compression techniques to suit the data type. However the big missing piece in all this is that Pcompress is still a stand-alone utility. It needs work to turn it into an archival store where data can be ingested incrementally and selective data streams extracted for restore. Also an efficient partitioned indexing is needed to be able to scale deduplication in a cluster without losing deduplication ratio.

On Nimble Storage Compression

I recently came across this old blog by Nimble storage co-founder Umesh Maheshwari: http://www.nimblestorage.com/blog/technology/better-than-dedupe-unduped/

The post has intriguing views on inline compression vs Dedupe and the approach of using snapshots on primary storage as a short-term backup mechanism is quite cool. COW snapshots by definition  avoid duplicates the instant they are created. In addition consecutive snapshots will share common blocks with no special effort. This aspect not much different from ZFS. It has the same real-time compression and snapshot features and the same benefits apply there as well.

However it is the conclusions and the graphs that I humbly find missing some points and even misleading to an extent. Deduplication does not only remove duplicates from successive backups but it can remove internal duplicates within the volume. It helps compression in turn. In all my tests with Pcompress I have found Deduplication providing additional gain when added to standard compression. See the entry for Pcompress here: http://www.mattmahoney.net/dc/10gb.html. ZFS for that matter provides deduplication as well, though it has scalability bottlenecks.

While inline compression is cool, compression works within a limited window. The window size varies with the algorithm. For example Zlib uses a 32KB window, Bzip2 uses 900KB max, LZMA can use a gigabyte window size. Repeating patterns or redundancy within the window are compressed. Deduplication finds duplicate blocks within the entire dataset of gigabytes to hundreds of terabytes. There is no theoretical window size limitation (Though practical scaling considerations can put a limit). So I really cannot digest that just snapshotting + inline compression will be superior. Deduplication + snapshotting + inline compression will provide greater capacity optimization. For example see the section on “Compression” here: http://www.ciosolutions.com/Nimble+Storage+vs+Netapp+-+CASL+WAFL

Now if Post-Process Variable-Block Similarity based deduplication (of the type I added into Pcompress) can be added to ZFS, things will be very interesting.

Compression Benchmarks

I spent the last few days running some benchmarks to see how Pcompress in its current incarnation fares in terms of compression performance and effectiveness. This was needed given the extent of changes that have happened. I have tested only the compression performance here. I still need to check out how decompression and crypto perform.

My test system is my laptop so the absolute numbers do not make much sense. It is only the relative numbers comparing different utilities and algorithms that are interesting. My laptop config: Intel Core i5 M430 @ 2.27 GHz (2011 model), 8GB RAM, 5400 RPM HDD (Hitachi HTS725050A9A364). It is a HP Pavilion Dv6 laptop dating back to Oct 2011. My desktop is a 2006 model Athlon 64 X2 with 4GB RAM so I decided to test on the more capable laptop. I used 4 different data sets: linux 3.6 RC2 tarball, the enwik9 corpus, glibc 12.1 tarball, a tarball containing CentOS 6.0 and 6.2 vmdk images. Except for the VMware images these are mostly different kinds of text since this is a traditional compression test.

Except where mentioned all the pcompress tests used a 256MB compression chunk size. Files bigger than 256MB were compressed using multiple threads. Only the Glibc tarball is 102MB in size and a single pcompress thread was used. Pcompress detects if the segment size is more than the file size and reduces it to fit the whole file in a single segment. The actual commands that were executed are listed in the leftmost graph column. The magenta bar shows original and compressed sizes in megabytes while the pink lines denote time taken in seconds.

Pcompress version used was 0.9.1 standard build. The standard utilities were used as distributed with Fedora 16. Some may have reservations that same compiler flags were not used to rebuild everything. In that case use some salt. I will rerun the standard utility tests with custom builds when I get time in the future. I have also missed out other parallel tools like Pigz and Pbzip2. I will post those comparisons shortly for the Linux tarball.

Finally except where mentioned Pcompress was always run with Chunk-level Deduplication (-D) and LZ-Prediction (-L) enabled. Identity Deduplication always adds value. The VMDK image dataset includes comparison for compression ratio with and without Deduplication.

Linux 3.6 Git clone Tarball

Of course the Lzma algorithm implementations give the best compression ratio but they also take more time. As usual the Xz utility performs quite badly and I keep wondering why would anyone use that in practice. I wonder if the XZ-Utils folks are considering performance aspects at all. The parallelized Pxz is a lot better, though I missed out running Pxz in extreme mode in this test. I will fill this gap shortly.

What is most interesting however is to note the performance and compression ratio achieved by Pcompress. Even with extreme settings parallel Pcompress using lzmaMt (lzma multi-thread) is faster than non-extreme parallel Pxz. Part of the speed comes from the low-level optimizations (like SSE) I have done to the lzma implementation. The original 7-Zip utility performs quite badly with extreme settings(-mx9 -m0fb=128 -m0d=256M) since it is not fully parallelized. The PPMd algorithm in Pcompress is by far the rock star in this test. It gives the best compression ratio while still having awesome speed. PPMd works excellent for text files. The LZFX algorithm is blindingly fast with decent compression ratio.

Pcompress is run with Deduplication (-D) and with and without Delta Compression (-E). Delta compression at this scale does not seem to add much value.

Glibc 2.12.1 Release Tarball

This test mostly shows single-threaded behavior. Since the glibc tarball is about 102MB in size and given the Pcompress compression chunk size of 256MB, it fits into a single chunk. So for the Pcompress tests only lzmaMt is multithreaded. That actually uses multiple threads for the match-finder. As you will notice, extreme level Pcompress with lzmaMt is significantly faster than both Pxz and 7-zip with extreme settings (-mx9 -m0fb=128 -m0d=256M).

Once again Delta compression at this scale did not add much value.

The Enwik9 Corpus

This test uses the enwik9 (English Wikipedia initial segment, 2009) from http://mattmahoney.net/dc/textdata.html. This is XML text data. Interestingly with a 256M chunk size Libbsc (BWT with Quantized Local Frequency Coding and dual LZ-Prediction) gives the best results on XML data even compared to Lzma and PPMd.

So I wanted to see how much better libbsc will work if I put the entire enwik9 file into a single chunk of 1GB size. So ran ‘pcompress -D -c libbsc -l 14 -L -s 1g’ to check. Indeed this produces a smaller 156.3MB(163870946 bytes) file taking 233.8 seconds. For reference the older Enwik8 corpus compresses to 20785844 bytes using the same command. This puts us at the 15th position on Matt Mahoney’s Large Text Compression Benchmark slightly above the original BSC. I have not made any changes to libbsc. Pcompress just links to it as an external dependency. However I have copied the LZP portion and tweaked it a bit. I believe that this result can be improved by tweaking the Rolling Hash function to get actual dedupe out of XML data. At the moment deduplication (-D) is having no impact on Enwik9 as it is not finding any duplicates. You can build Pcompress from source configured with the ‘–enable-debug-stats’ option to check.

Tarball of CentOS 6 and CentOS 6.2 VMDK Images

For this test I did not try all the variations and specifically left out Lzma. For such large multi-gigabyte there are very few practical use cases where one will spend hours using slow high-compression algorithms with lower ROI. The objective rather will be to balance compression effectiveness with practical speed.

The distinct benefits of parallelism are clearly visible from these results. With multi-gigabyte data using multiple threads far outstrips the traditional utilities. Deduplication also ensures that we get smaller compressed files. Look at the standalone bzip2 result and the pcompress bzip2 result with Deduplication. Since this is in-chunk dedupe and not global dedupe the effect of using larger chunks are also shown.

These results are quite interesting and puts Pcompress among the tools at the top of the heap of general data compression utilities with Pigz, Pbzip2 etc. However it also provides stronger data integrity checks using SKEIN 256/512 and SHA 256/512 checksums while still maintaining the performance.

 

What next for Pcompress

With the current release of 0.9.1 Pcompress now comes with a whole bunch of features and capabilities that are close to being a logically complete set for a stand-alone data compression utility.  It will be hard to find another open-source utility that has the same combination of features and capabilities in a single utility. There are a few other stuff that can be added from a single utility point of view the biggest being a GUI and possibly portability to Windows. However beyond all this I have been looking at extending Pcompress to being a more comprehensive archival mechanism. Not necessarily a full Backup solution though. For that we have excellent software like Bacula and Amanda. However it can be an add-on for those providing an archival data store with capabilities that they lack.

I am listing out some of the thoughts and new ideas that have been crawling inside my cranium for the past few weeks.

Global Deduplication

Pcompress does Deduplication but only at the level of individual chunks. It is a logical next step to evolve it into a complete deduplication archival store on the lines of commercial products like Data Domain, Sepaton, Permabit Albireo, HP StoreOnce etc. There are big challenges in this area. The biggest being the disk bottleneck arising out of index lookups for chunk hashes. This becomes increasingly a problem as we scale into hundreds of TB and to Petascale. Then comes efficient storage of chunks, efficient retrieval of archived data, encryption of deduplicated data, sources-side dedupe to reduce network traffic etc. The industry has come up with a variety of solutions and optimizations but nothing significantly comparable is available in the open-source domain.

A couple of the open-source solutions are SDFS and LessFS. I have tried both and did not have much success with LessFS as it would hang every now and then in my Fedora 16 laptop. However SDFS worked reliably and fast. I am sure I am doing something wrong in setting up LessFS. Both of these are in-line dedupe filesystems that work on fixed-block boundaries. In addition the backup solutions like Bacula and Amanda only do whole-file dedupe or, in other words, single-instance storage.

For Pcompress I am looking at providing more advanced capabilities. Pcompress already has sliding-window content-aware chunking capability. The challenge is to design a scalable deduplicated archival store that can handle Petabytes. To this effect I have been reading up on a bunch of research papers and capabilities touted by various products and have formed some detailed design ideas. Some of the features required for this capability include:

  1. Sparse Indexing: We need to use some approximate matching capability to probabilistically identify duplicate and unique chunks with low memory overhead. Data Domain for example uses a Bloom Filter that they call a Summary Vector to quickly identify unique chunks along with other capabilities like locality based caching. Other different approaches include Albireo from Permabit, and SILT. In that respect the following paper seems very interesting: https://code.google.com/p/binarywarriors/downloads/detail?name=INLINE_DEDUP.pdf&can=2&q= . What is more interesting is this HP paper that reflects a more polished form of that paper: http://static.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge.pdf . This is essentially a locality based grouping of chunks along with a stratified sampling to do group-level approximate matching on a smaller index size. Locality based grouping is very helpful and instead of doing stratified sampling I have a different idea of using similarity matching at various confidence levels to find related groups. My idea will enable deduplicating 1 PB of data requiring just 14GB of memory for the index.
  2. Source side dedupe: Doing chunking and hashing at the source side can reduce load on the backup server and not sending chunks already present on the server reduces network bandwidth drastically. In addition the archive data itself will have many duplicates inside it even before comparing with the previous backup archive. So that will help as well. One extra thing I am looking at here is for the server to have a configurable capability to re-compute the chunk hashes and verify. This is an extra security mechanism that prevents malicious uploading of garbage chunks with invalid hashes. This of course puts extra load on the server but current gen Sandy Bridge or Bulldozers are solidly up to the task.
  3. Forward Referencing: Typically as backup archives keep on entering the deduplication data store their chunks keep pointing to older chunks that are duplicates. The first archive will be stored in full (after in-archive dedupe) and chunks in subsequent archives keep pointing backwards to duplicate chunks in older archives. So restoring from the newest backup will have lower performance due to the numerous random reads that it will have to do read those referenced chunks from older archives. Older archives will restore faster! In some documents this is called re-hydration. Logically we want the reverse to happen. If we can make chunks in older archives point to duplicate chunks in newer archives then the newest archive can be stored sequentially in it’s logical chunk order. However doing this in-line is not practical because many references need to be updated. So this is typically done with post-process dedupe and Sepaton is the only vendor doing this. This however precludes the use of Source-side dedupe. I have given this a lot of thought and come up with an alternative that can give the benefits of normal in-line and source-side dedupe while also providing forward-referencing capability.
  4. Security Considerations: Suffice it to say that this paper has a comprehensive coverage on this topic: http://www.ssrc.ucsc.edu/Papers/storer-storagess08.pdf .
  5. On-disk Container-Based storage mechanism: Storing chunks on the disk is another can of worms. Typical dedupe products implement a mini-filesystem for this. Storing individual groups of chunks in individual files is too expensive as there will be too many of them and filesystem metadata-lookup overheads will have significant impact. Chunk groups or segments can be further grouped into larger containers to solve this. I do not intend to implement a low-level mini-filesystem. Just use existing scalable filesystems like XFS.
  6. Replication: This is obviously a very important need. Especially with deduplicated data the loss of a single chunk invalidates all references to it and corrupts a whole host of files and archives. So replication is mandatorily needed as a protection against this. Instead of trying to build something I am looking at leveraging existing software like the excellent GlusterFS.
  7. GPGPU Acceleration: Chunking and hashing can be speeded up by leveraging GPGPUs. This paper provides the background: http://static.usenix.org/events/fast/tech/full_papers/Bhatotia2-5-12.pdf . Very interestingly Keccack has been ported to work optimally on Nvidia CUDA: https://sites.google.com/site/keccaktreegpu/ .
  8. Clustering: Distributing deduplication storage tasks across a cluster of computers can help greatly with scalability as opposed to a single appliance approach. It is possible to distribute chunks or segments across a cluster of computers based on their hash values. SDFS uses this technique. This however affects dedupe ratio.
  9. Integration with backup software: Neither Bacula nor Amanda includes variable-block dedupe capability.

Miscellaneous Features

  1. GUI: Everybody likes a GUI … everybody likes gui …. well except for Unix CLI freaks (myself included).
  2. Windows Port: This will broaden the usage base but requires having a GUI first.
  3. Reorder Files: Sort files based on similarity, size and filename. This should give better dedupe and compression results.
  4. Backup to online storage: Amazon S3, Glacier, DropBox etc should be supported as back-end storage targets.

Parallel Deduplication in Pcompress

I have recently added deduplication support in the pcompress utility. Deduplication happens at the chunk level and is therefore done in parallel. Compression is done after deduplication, so dedup removes duplicates based on relatively large blocks while compression works on byte-level similarities. There is some interaction between dedup block size, break pattern and the compression algorithm used. In particular dedup block size can affect compression ratio depending on the compression algorithm’s window size. Since for example LZMA works on large windows of up to 128MB using a dedup block size smaller than 4K appears to affect LZMA’s compression ratio. For other algorithms like Bzip2 or Zlib a block size of 2K actually helps.

So based on some experimentation  I have selected different minimum blocks sizes and Rabin break patterns. The dedup metadata is kept as low as feasible and is typically 1% or less of the final compressed data size (dedup + compress). The dedup implementation adds very little computation time overhead to the overall process. The general logic is as follows:

  1. Use a 31-byte Rabin sliding window to create a list of content-aware blocks. Minimum length is 2K-4K while max is kept at 128K.
  2. Also compute a CRC64 checksum for each block.
  3. Sort the block pointers based on the checksum value. This brings all same/similar blocks together and hash collisions are of course possible.
  4. Scan the block pointers replacing runs of identical block entries to point back to the first one. Both length and exact byte level comparison is done using memcmp(). This rules out data corruption due to hash collisions.
  5. Do a second pass through the block entries this time building another block entry table. Each entry in the table is a 32-bit unsigned integer which is either the length of the following data data block at the corresponding offset in the byte stream or an index pointing to another block if this is a duplicate. The high-order bit is used as a flag to indicate between the two. So max index numer or max length of a single block can be 2^31. This max block length is important since block merging is done in the next step. A max index count of 2^31 means that using the smallest block size of 2K a single chunk can span upto 4TB. Of course we will be very far from that in practice.
  6. In the second pass also merge consecutive runs of non-duplicate block entries into a singe entry. Upto 2GB of consecutive blocks can be merged. This has the effect of drastically reducing dedup metadata. Since this also changes the block indexes, all index pointers from duplicate blocks must also be updated. A translation table is built in this step to simplify this process.
  7. A third and final pass is made over the reduced index table to copy all data blocks into the target buffer while updating the duplicate index entries at the same time from the translation table.
  8. Compress the final dedup index using LZMA and the remaining data blocks using whatever compression the user specified.
  9. Additional memory usage for doing Dedup is kept controlled by using space in the target buffer for some of the intermediate arrays.

This dedpulication scheme is simple and fast. Rabin computations are all done using shifts and additions avoiding costly multiplication and modulus operations. The byte-level data scanning provides an opportunity to discover characteristics of data to be leveraged for improving compression later. I will be looking at this going forward.

I am observing good results combining Dedup and Compression and the ability to parallelize speeds up both. Dedup metadata is kept at a minimum using a block-merging technique. Finally compression dictionaries for the data are not polluted by compressing the Dedup index separately. I will be posting updated measurements soon. As usual, source code is available here: https://github.com/moinakg/pcompress.