Tag Archives: Pcompress

Posts related to the Pcompress utility.

Scaling Deduplication in Pcompress – Petascale in RAM

My core objectives with Deduplication experiments in Pcompress has been to achieve large-scale data handling capability using tiny chunks (4KB and even 2KB). This has traditionally been extremely difficult to the point of being nearly impractical. The algorithms in Pcompress can now achieve this without loss of performance, at least for archival data streams. Read on to find out more.

Approach

I have in the past alluded to a an indexing algorithm in Pcompress that allows me to deduplicate large data using only a tiny memory-resident index. I finally found some time to write about it. The basic idea is quite simple and is not entirely new. What is new, is the way certain components of the approach are implemented. The steps are simple:

  1. Break dataset into large segments, such that each segment is a collection of adjacent variable-length chunks as derived using a chunking algorithm.
  2. For each segment create a chunk-list file on disk which contains the list of chunk hashes.
  3. For each segment compute one or more similarity identifiers and store in an index.
  4. If one or more similarity identifiers for two segments match then we know that the segments have at least a few chunks in common. In that case load the chunk lists of the segments from disk and deduplicate the chunks.

If the segments are large enough then a relatively tiny similarity index can address large amounts of data. This scheme requires one disk read per bunch of chunks reducing lookup and access times. Also each chunk list is one file and is read sequentially improving read times. The chunk list files can also be stored on SSDs to speed up even more. The similarity index can be a RAM-resident hashtable giving the fastest possible lookup times. Since each segment is a collection of chunks, I experimented using different chunk counts and found excellent results having 2048 chunks per segment. Thus the segment size is of variable length. For an average chunk size of 4KB we get 8MB segments on average, for 8KB chunks we get 16MB segments and so on.

Many Similarity based Deduplication approaches use files as the objects for checking similarity. However this suffers from a problem that very large files can give rise to loss of accuracy and poor deduplication ratio. Conversely too many tiny files can bloat the similarity index. The approach described above avoids these extremes. A somewhat similar approach has been described in the SiLo scheme.

The old guard – MinHash

The effectiveness of the approach described above hinges on the effectiveness of the similarity indexing scheme. Here I based my implementation on the good old MinHash technique. This technique has been used commonly in data mining and web search to determine similar documents, detect plagiarism, perform clustering and so on.

Essentially we break up a document into a set of semantic pieces, compute multiple hashes per piece and identify a subset of pieces that possess the numerically lowest hash values. If such k-min subsets of two documents match then we can say that the documents are somewhat similar to each other. In this deduplication use case a document is nothing but a segment. Since each segment is a set of chunks, each sematic piece is a chunk as determined by the chunking algorithm. The question is what hashes to compute for each chunk in order to apply the MinHash algorithm? (See MinHash for Dummies for a layman’s explanation of MinHash).

After lots of experimentation and head scratching it turns out that the cryptographic hashes computed per chunk can themselves be used via truncating. Truncated cryptographic hashes are themselves independent hashes. I took this idea to its logical extreme and split each cryptographic hash into a set of smaller hashes. These can be thought of as the permutations of the hash function. Then I can sort these hashes numerically and select the K lowest values to get my K-min-values sketch. If one or more elements of the KMV sketches of two segments match then the segments are deemed to have at least a few chunks in common. How small to truncate a hash? It turns out that 64-bit hashes provide a good approximation. So if we are using, say SHA-256, then we get 4 smaller hashes from each chunk hash. These are then numerically sorted and the 20 lowest unique values are chosen. Pictorially this approach can be depicted as below.

sid2

The number 20 and other thresholds were arrived at experimentally, to give a good balance of memory usage vs deduplication effectiveness. Increasing beyond 20 hashes, results in diminishing returns on dedupe effectiveness while linearly increasing memory consumption. The point of inflection actually comes at 16 hashes, below which, dedupe effectiveness falls rapidly. The chart below shows the results of testing on a 360GB dataset that I have used in previous occasions.

sidHere Duplicate Elimination Ratio =\frac{input\, stream\, length}{total\, size\, of\, chunks\, stored\,+\, metadata\, size}.

The interesting thing here is that this approach results in high deduplication effectiveness to the extent of 90% to 95% of the effectiveness of straight forward exact dedupe using a simple chunk index. The choice of the hash function (SHA2, SHA3, BLAKE2 etc) has little bearing on the outcome. Here is another chart showing a comparison of this similarity based deduplication (using various datasets and hash funtions) with the exact deduplication baseline.

der2

Memory Usage

Lets consider the case of 4KB average chunk size. Each segment contains 2048 (approx) chunks, which gives us 8MB average segment size. For each chunk we derive a 256-bit crypto hash which results in 4 64-bit similarity hashes per chunk. We select the lowest valued 20 unique hashes per segment which form the similarity indicators. So we need 160 bytes of similarity indicators per segment. In addition to this we need to store 64-bit pointers to the actual segment data on disk. The segment data is basically the chunk hash and data pointer list stored on disk. So the storage requirement is doubled to 320 bytes. The similarity indicators for a segment are kept in a hash table. So we need to consider some data structure overheads as well. Assuming 4 bytes of overhead on average we have a final storage requirement of 400 bytes per segment.

Now assuming 8MB segment size and one Petabyte of 100% random data where no duplicates exist (worst case), we would need 134217728 segments. This translates to 50GB of worst case memory usage. If we use 8KB chunk size, then the calculations lead to 25GB RAM requirement for the worst case. These memory values are not too much by present day standards and typically data will have duplicates. So RAM requirement will come down by the extent of duplicates present, especially when using the deduplicating storage architecture I had described earlier. If we limit data handling to say 500TB, then even smaller 2KB chunks can be used. These are practical resource limitations. The algorithm does not have any inherent limitations. If we use a well-endowed server with say 256GB or more of RAM then petascale data can be handled using 2KB chunks as well. Performance will be lower of course.

To the best of my knowledge there is as yet no dedupe product that can handle a petabyte of data using small 4KB chunks and an in-memory index. I may be wrong, so please add comments if you know of another such system. If you are looking at less than a Petabyte then even 2KB chunks are possible – virtually no one does this today.

This indexing scheme is actually implemented in Pcompress today and works quite well. It can actually reach raw a dedupe processing throughput of upto 500MB/s (discounting I/O) on my average laptop. Of course there are many other optimizations both algorithmic and architectural some of which I have posted about earlier. I am presently working on the persistent similarity index part which would allow creating versioned incremental archives.

Related Posts

Scaling Deduplication – Sepaton’s Big Data Backup appliance

Came across this news piece on Register dated back to October: http://www.theregister.co.uk/2013/10/16/sepatons_superduper_deduper/

Potentially handles upto 16PB of backup storage with Global Deduplication – Great! The performance and features on paper are really top of the line. Beats the competition on most aspects. If I look at the deduplication features a few things look interesting vis-a-vis those that I have put into Pcompress.

It mixes “Hash-based inline deduplication” and “Post-process, content-aware deduplication”. The article is not clear what exactly this means. There are two possibilities. Firstly it can detect duplicate files during ingestion, store only a single copy and then do block-level dedupe as post-process. Secondly it can deduplicate using large chunks during backup ingestion and then dedupe using small blocks as post-process. This is of course to not hurt backup performance and scale to large datasets. Post-process deduplication is a common technique to scale deduplication without affecting I/O throughput of in-flight data. It has also been used effectively in Windows Server 2012 to do primary data deduplication.

Sepaton can do analysis of data types and change rates in data to apply the most efficient dedupe mechanism or even skip dedupe for encrypted and compressed files that virtually do not deduplicate at all.

The other interesting features include byte-level deduplication for databases that store data in block sizes less than 8KB and using flash based SSDs to store the global index. I am not sure what this “byte-level deduplication” exactly means but it appears to be a delta-differencing mechanism. Now the question is how efficient restore can be when delta-differencing is used.

In some of the posts on Pcompress design I have already mentioned about using SSDs for storing all kinds of metadata. Fast metadata access is critical and this is the logical choice. However the other “new aspect in Pcompress is the ability to use small blocks for deduplication without losing performance and giving good scalability“. This is a key feature that most of the current solutions seem to be missing. Pcompress can use blocks (or chunks) as small as 2KB without losing too much performance. With 2KB chunks it can potentially scale to 500TB of 100% random data using a 40GB in-memory global index. If the data has duplicates then the index size becomes smaller. This deduplication occurs with 95% efficiency of a full chunk index based brute-force dedupe. This single capability solves a sticky problem that dedupe solutions has been dealing with for quite some time. The metadata structure that I have discussed in earlier posts also helps with overall performance. The approach is based on similarity detection of large regions in the data stream. The chunks lists of those regions are then sequentially loaded from SSD and compared to perform actual deduplication. The similarity detection technique is simple and novel. It avoids any kind of complicated math, fuzzy hashing etc.I will detail it later.

There are other unique techniques in Pcompress like a partially vectorized rolling hash, scanning less than 30% of the data to locate chunk boundaries, parallelized deduplication and others that contribute to the overall performance. I have posted about a few of these earlier.

In addition to the above, the recent zip-like archiver capabilities that I have added into Pcompress introduce data type detection and automatic selection of filters and compression techniques to suit the data type. However the big missing piece in all this is that Pcompress is still a stand-alone utility. It needs work to turn it into an archival store where data can be ingested incrementally and selective data streams extracted for restore. Also an efficient partitioned indexing is needed to be able to scale deduplication in a cluster without losing deduplication ratio.

Pcompress gets archiving features

Among a busy personal schedule for the last two months, I have managed to work quite a bit on adding archiving features to Pcompress. Thanks to the excellent LibArchive, Pcompress can now bundle up a bunch of files into a compressed archive. This is a desirable and useful capability that was missing till date.

With the addition of archiving capability Pcompress can now perform advanced detection of file data and tweak its compression behaviour to achieve the best results. Below is a short list of features and behaviour that the github code has as of this writing:

  1. Pcompress enumerates the file list to be archived and sorts the files by extension/name and size using an incremental merge sort to minimize memory use. This sorting, groups related files together and clusters small files to achieve the best compression and deduplication behaviour. For example see this paper where a similar technique has been discussed to improve deduplication: https://www.usenix.org/legacy/event/atc11/tech/final_files/Xia.pdf
  2. File types are detected via extension and/or file header parsing for magic numbers. Compression buffers are split at boundaries where files change from one type to another to avoid mixing unrelated files in a single compression buffer. It helps to improve compression ratio.
  3. More importantly, this file type detection is used to apply data-specific compression techniques more effectively, making the Adaptive modes in Pcompress extremely powerful. The following data specific algorithms are used:
    • LZMA – Most binary data.
    • PPMD – Most Textual data.
    • Libbsc – DNA Sequences, XML/HTML etc, BMP and TIFF images.
    • Dispack – Preprocess 32-bit x86 executable binaries.
    • PackJPG – Reduce JPEG size by upto 40%. This is new lossless JPEG compression technique by Matthias Stirner.
    • Wavpack – Compress WAV files better than any regular compression technique. This is still a work in progress.
    • Detect already compressed files and for some heavily compressed data just use LZ4 to suppress some internal headers and zero padding. This avoids wasting time trying to compress data that is already heavily compressed.
    • There are other data specific filters around like MAFISC which I am looking at.
    • For Dispack, 32-bit x86 executables are detected and the data buffer is then split into 32K blocks. Some approximate instruction statistics are checked to determine whether to Dispack that block.
  4. Compression buffers are split either at hash-based or data type change based boundaries improving both compression and deduplication.
  5. LibArchive is used as the backend archiving library whose output is passed to the buffering, deduplication and compression stages in a logical pipeline. Synchronization is kept simple by using semaphores. LibArchive runs in a single thread and the data fetch from archiver to compression is also done at a single point. Thus there is exactly one producer and one consumer. This simplifies synchronization.
  6. To the extent possible data copying is avoided. LibArchive’s callback routines are used to copy data directly into the compression buffers without resorting to pipes and such.

The filters like Wavpack and PackJPG need to work with LibArchive. However LibArchive does not support using external filter routines so it took a while to work out how to have external file filters pipelined before LibArchive. Note that since Pcompress uses a custom file format and consumes the output of LibArchive, there is no need for strict compatibility with standard archiver formats like Tar, Pax, Cpio etc. LibArchive for its own requirements obviously strives to attain strict conformance allowing no user-defined headers. So one of the big problems was to flag which files have been processed by a custom filter. One easy way was to add an extended attribute programmatically. However LibArchive does not provide a way to delete a single attribute during extraction. There is a call to clear all attributes! One does not want internal, programmatic use attributes to be extracted to disk. I was stuck. Eventually it turned out that I could use contextual inference. A file preprocessor like PackJPG will add its own magic header to the file. Thus during archiving I can look for a JPEG header and only then pass the file through PackJPG. During extraction I can look for the PackJPG header.

However the question comes, what if I have some PackJPG processed files and are archiving them using Pcompress? Won’t it revert to normal JPEG during extraction even though I do not want it to? Well the filename extension is also checked. During archiving, normal JPEGs are filtered but their extension remains as jpg or jpeg. So only files having a Jpeg extension but having a PackJPG header are unpacked during extraction. If you use the standalone PackJPG utility to pack your JPEGs, then will get a .pjg extension which will be untouched by Pcompress filters during extraction. However, truely speaking, LibArchive needs to add a simple xattr deletion function to avoid all this jugglery.

File types during archiving, are detected by a combination of filename extension and magic header inspection. To lookup filename extensions one obviously needs a hashtable. However there is a bit of detail here. I have predefined list of known filename extensions with their corresponding file types, so instead of using a general hash function I needed a perfect hash function. That is, the number of slots in the table is the number of keys and each known key maps to one slot. An unknown key can be easily found by comparing with key value at the slot, or if the slot number lies outside the table range. I used the old ‘Minimal Perfect Hashing’ technique courtesy of Bob Jenkins. It works nicely for fast hashing of filename extensions.

The next item to do is to support multi-volume archives. This is quite easy to do since Pcompress already splits data into independent buffers, each with its own header. So a volume needs to contain a set of compressed buffers with some sequence indicator so that they can be correctly concatenated together to restore the original archive.

Persisting the In-Memory Hash Index in Pcompress

Pcompress uses an in-memory hash table to find similar regions of data, loads the chunk hashes of those regions and matches them to find exact chunk matches for deduplication. Currently this works on a single archive producing a compressed file as the output. The index is not required during decompression and is discarded after use. An archival store on the other hand needs to deal with multiple archives and needs a persistent index to keep deduplicating across archives that get added to the store. So there is an archive server which receives data streams from clients. Clients are responsible for identifying the files to be backed up, rolling them into a archive format like tar and sending the data stream to the server. The archive server holds the aforementioned similarity index.

The similarity index in Pcompress is a hash table, so looking at persistence, one quickly thinks of the numerous NoSQL solutions. Evaluate and benchmark one and use it. Even SQlite can be looked at, as it is embeddable, fast and reliable. Good enough for the use case at hand. After pondering this somewhat, it occurred to me that my approach in Pcompress is a scalable in-memory index. The key thing here is the “in-memory” piece. The design is centered around that and does not use an index that can overflow to disk. This in turn means that I do not require an on-disk index format. I just need to stash the records in a file and load them into an in-memory hash when the archive server daemon is started. So all I need is a simple flat file with a sequence of fixed-length records. When a new KV pair is added to the hash it is first appended to the flat file. This is almost like journaling with the journal being the data store itself. Write I/O remains largely sequential. When keys are deleted it can marked invalid in the flat file. A separate cleanup process (like Postgres VACUUMDB) can be used to eliminate the deleted records. Marking deleted records requires in-place updates which is simple because records are of fixed-length.

Thus I can dispense with a NoSQL library and keep things very simple and fast. This approach is similar to the Sparkey KV store from Spotify. The primary differences being the hash not stored to disk and ability to do deletes. Of course unlike Sparkey I want robustness against data corruption. Firstly I will use sync writes to the flat file. Secondly the record will be inserted into the in-memory hash only after disk write is successful.

Pcompress 2.3

I have been fairly busy over the past few months with multiple things demanding my attention at the same time both at work and at home. In addition, I was doing a bunch of testing and analysis of the algorithms I am using in Pcompress, which resulted in some useful findings, improvements and fine tunings of the parameters. I also wanted to test with a few hundred GBs worth of data and had to procure additional disks to perform testing within realistic time scales. The combination of a Stardom iTANK external enclosure and a WD Caviar Black 2TB drive over eSATA worked nicely providing upto 137MB/s of sustained sequential read throughput. I have finally managed to pull together all the changes and put out a 2.3 release.

Among other things, I have made some changes to the KMV Sketch computation and updated it to select unique similarity indicators. Earlier it was just selecting the lowest K indicators, some of which were duplicates. This also allowed me to reduce the number of IDs used for the similarity match. This in turn results in a smaller index while improving the similarity match at the same time. Data splitting between threads is also improved now, resulting in fewer unbalanced chunks. These improvements and some code cleanups have resulted in performance improvements as well. As of this release, the similarity based near-exact deduplication is quite faster than exact deduplication using a simple chunk index with upto 98% effectiveness as compared to the latter.

One of the changes in this release is to move all the functionality into a shared library and make the main executable as a simple wrapper. This will allow me to add a programmatic API going forward. The release can be downloaded from here: http://code.google.com/p/pcompress/downloads/detail?name=pcompress-2.3.tar.bz2

High Performance Content Defined Chunking

In Pcompress, I have implemented a variant of the rolling hash based Content Defined Chunking that provides both deduplication accuracy and high performance. This post attempts to explain the chunking process, covers the chunking computations that are done in Pcompress and then talks about the new optimizations for very fast sliding window chunking (on the order of 500MB/s to 600MB/s throughput depending on processor).

Background

Data Deduplication requires splitting a data stream into chunks and then searching for duplicate chunks. Once duplicates are found only one copy of the duplicate is stored and the remaining chunks are references to that copy. The splitting of data into chunks appears to be an ordinary process but is crucial to finding duplicates effectively. The simplest is of course splitting data into fixed size blocks. It is screaming fast, requiring virtually no processing. It however comes with the limitation of the data shifting problem.

The diagram below illustrates the problem. The two 64-character patterns are mostly similar with only two characters differing. Initially fixed-block chunking provides good duplicate detection. However the insertion of a single character at the beginning shifts the entire data while chunk boundaries are fixed. So no duplicates are found even though the patterns are mostly similar.

static_chunkingThe diagram shows insertion, but the same thing can happen for deletion. In general with static chunking duplicate detection is lost after the point where insertion or deletion has taken place.

In order to deal with this, most dedupe solutions use content defined chunking that mark cut points based on patterns in the data. So if the data patterns shift the cut points also shift with them. The diagram below illustrates.

dynamic_chunkingThe chunks are split based on patterns in data so they are of variable length (but average size is close to the desired length). Since the chunk boundaries shift along with the data patterns, duplicates are still found. Only the modified chunks are unique.

The Rolling Hash computation

Now the question comes as to what data patterns to look out for when determining the chunk boundaries or cut points? The common technique is to compute a hash value of a few consecutive bytes at every byte position in the data stream. If the hash value matches a certain predefined pattern we can declare a chunk boundary at that position. To do this computation efficiently a technique called the rolling hash was devised. It uses a sliding window that scans over the data bytes and provides a hash value at each point. The hash value at position I can be cheaply computed from the hash at position I-1. In other words H(X_{(i,n)}) \Leftarrow (H(X_{(i-1,n)}) + X_i - X_{(i-n)}) \bmod M where ‘n’ is the window size and X_{(i,n)} represents the window bytes at byte position ‘i’. In mathematical terms this is a recurrence relation. Rolling hashes have been used in contexts like Rabin-Karp substring search and Rsync. Today they are used extensively in chunk splitting in the context of data deduplication.

One of the common rolling hashes used in Data Deduplication is Rabin Fingerprinting devised originally by Turing award winner Michael O. Rabin in his seminal paper titled “Fingerprinting By Random Polynomials“. The mathematically inclined will enjoy reading it. There are other rolling hash techniques such as the one used in Rsync, the TTTD algorithm devised by HP, the FBC algorithm etc.

rolling_hashWhile I am not so much of a mathematically inclined person I still needed a good rolling hash in order to do content defined chunking in Pcompress. After looking at various implementations like the one in LBFS and few other open-source software like n-gram hashing, I came up with an approach that worked well and produced average chunk sizes close to the desired value.

I used a small sliding window of 16 bytes that produces a 64-bit fingerprint at each byte position requiring an addition, subtraction, multiplication and conditionally an XOR for each byte position. It would declare a chunk boundary if the bottom Y bits of the fingerprint were zero. The value of Y would depend on the average chunk size desired. For example for 4KB average size one would look for bottom 12 bits to be zero. The core of the approach is derived from Rabin Fingerprinting. A good description is here: http://www.infoarena.ro/blog/rolling-hash. The hashing approach is a multiplicative scheme of the form:

rollhash = (rollhash * PRIME + inbyte - outbyte * POW) \% MODULUS

Where inbyte is Incoming byte into sliding window head, outbyte is outgoing byte from sliding window tail and POW = (PRIME ^ {windowsize}) \% MODULUS. The PRIME number I am using is the same value used by Bulat Ziganishin in his SREP tool. Experimentation showed it to produce good results. In addition to this I precompute a table using the irreducible polynomial (represented in GF(2)) from LBFS. The outbyte is used to index the table and the value is XOR-ed with the hash value to produce the final fingerprint. I did some analysis of the chunking approach which is documented in two earlier posts. The results were good.

A window size of only 16 bytes will raise some eyebrows as typically much larger windows are used. LBFS for example used a 48-byte window and others have used even larger windows. However in practice, as is evident from the above analysis, this implementation does produce good results and the window size of 16 bytes allows an optimization as we will see below.

Optimizations

While addition, multiplication are extremely fast on modern processors, performance overheads remained. Even though I was using a small window of 16 bytes it still required performing computations over the entire series of bytes in order to find cut points. It is very much computationally expensive compared to the simple splitting of data into fixed-size chunks. A couple of optimizations are immediately apparent from the above hash formula:

  • Since we are dealing with bytes it is possible to pre-compute a table for outbyte * POW
  • The MODULUS operation can be replaced with masking if it is a power of 2.

This gives some gains however the overhead of scanning the data and constantly updating a sliding window in memory remains. Eventually I implemented a couple of key new optimizations in Pcompress that made a significant difference:

  • Since the sliding window is just 16 bytes it is possible to keep it entirely in a 128-bit SSE register.
  • Since we have minimum and maximum limits for chunk sizes, it is possible to skip minlength – small constant bytes after a breakpoint is found and then start scanning. This provides for a significant improvement in performance by avoiding scanning majority of the data stream.

Experimentation with different types of data shows that the second optimization results in scanning only 28% to 40% of the data. The remaining data are just skipped. The minimum and maximum limits are used to retain a distribution of chunk sizes close to the average. Since rolling hash cut points below the minimum size are ignored it does not make sense to scan that data.

opt_dynamic_chunkingAll these optimizations combined provide an average chunking throughput of 530 MB/s per core on my 2nd generation Core i5 running at 2.2 GHz. Of course faster, more recent processors will produce better results. The throughput also depends on the nature of the data. If the data has a very specific pattern that causes more large chunks to be produced the performance degrades (Think why this should be the case). This brings us to the worst case behaviour.

Worst Case performance profile

The worst case performance profile of the optimized chunking approach happens when all chunks produced are of the maximum size. That is the data is such that no breakpoints are produced resulting in a degeneration to the fixed block chunking behaviour at max chunksize of 64KB and at the cost of rolling hash computation overhead. In this case the majority of the data is scanned and computed, but how much ?

If we assume minimum chunk size of 3KB, maximum 64KB and 100MB data we will have 100MB / 64KB = 1600 chunks (considering worst case all max-length chunks). For every chunk 3KB - small constant of data will be skipped. In my current implementation the value of small constant is 256, though it can be smaller. So the actual skipped size is 3072 - 256 = 2816 bytes. In total the number of skipped bytes will be 2816 * 1600 = 4505600 bytes out of 100MB data. In percentage terms it is 4505600 / 104857600 * 100 = 4.29\%. In other words 95% of the data will be scanned degrading the performance by more than half.

Now the question is what kind of data will produce this worst case behaviour? If you have seen the rolling hash computation details in Pcompress above, the eventual fingerprint is computed via an XOR with a polynomial computation result from a table. Those values are non-zero and we check for breakpoints based on bottom 12 bits of the fingerprint being zero. So if the computed hash is zero the XOR will set the bits and bottom 12 bits will become non-zero. The hash will be zero if the data is zero. That is if we have a file of only zero bytes we will hit the worst case.

I created a zero byte file and tested this and got a throughput of 200 MB/s and all chunks of the max 64KB length. In real datasets zero byte regions can happen, however very large entirely zero byte files are uncommon, at least to my knowledge. One place having zero byte regions is VMDK/VDI files. So I tested on a virtual harddisk file of a Fedora 18 installation in VirtualBox and still got a majority of 4KB chunks but with a small peak at 64KB. The throughput was 490 MB/s with approx 41% of the data being scanned. So even a virtual harddisk file will have non-zero bytes inside like formatting markers. It is rare to get 100s of megabytes of files with only zero bytes. Finally from an overall deduplication perspective such files will be deduplicated maximally with almost 98% data reduction and final compression stage will also be extremely fast (only zero bytes). So even though chunking suffers, overall deduplication will be fast.

Footnote

If you are interested to look at the implementation in Pcompress, it is here: https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c#L598

Architecture for a Deduplicated Archival Store: Part 1

Requirements

Pcompress as it stands today is a powerful single-file lossless compression program that applies a variety of compression and data deduplication algorithms to effectively reduce the dataset size. However as far as data deduplication goes it can only apply the algorithms to a single dataset to remove internal duplicates. What is more useful is to be able to apply deduplication to remove common blocks across datasets to achieve even greater savings especially in backup scenarios. This is why we see a slew of products in this space boasting of upto 90% reduction in backup storage requirements.

In the open source space we have filesystems like OpenDedup, Lessfs, S3QL, ZFS etc that provide deduplication even for primary online storage. While that is a desirable feature in itself, these software lack many of the advanced features of commercial products like Sepaton, HP StoreOnce or EMC DataDomain. Pcompress implements a bunch of those advanced algorithms today (I am writing a couple of papers on this) so it makes sense to extend the software into a proper scalable archival store for backup requirements. In this topic it is worthwhile to take note of eXdupe which provides archival deduplicated backup capabilities but it is quite simplistic providing only differential storage against a single initial backup dataset. It is much like a full backup followed by incremental backups. Just that there is no real multi-file dedupe. One can only dedupe the latest backup data against the first non-differential backup data. It is not a scalable chunk store that can chunk any incoming dataset and store only the unique chunks.

If we look at open source backup software like Amanda or Bacula, none of them have block-level dedupe capability, leave alone sliding-window variable block chunking. So, in a nutshell, we can summarize the requirements as follows:

  1. A Deduplicated, Scalable Chunk Store that stores unique chunks and provides fast read access.
  2. The Chunk Store is meant for backups and archival storage and assumes immutable chunks. I am not looking at online primary storage in this case. However the system should support deletion of old datasets.
  3. It should be able to do inline dedupe. With inline dedupe we can do source side dedupe reducing the amount of backup data transferred over the network.
  4. Pcompress can potentially utilize all the cores on the system and this archival store should be no different.
  5. Metadata overhead should be kept to a minimum and I will be using the Segmented similarity based indexing to use a global index that can fit in RAM.
  6. Data and Metadata should be kept separate such that metadata can be located on high-speed storage like SSDs to speed up access. While this increases the number of multiple separate disk accesses during restore, the effect can be reduced by locality sensitive caching in addition to SSDs.
  7. The system should of course be able to scale to petabytes.
  8. It should be possible to integrate the system with existing backup software like Amanda, Bacula etc. This is needed if we want to do source-side dedupe.
  9. There should be a chunk reference count with a max limit to avoid too many datasets referencing the same chunk. The loss of a multiple referenced chunk can corrupt multiple backups. Having an upper limit reduces the risk. In addition we need replication but that is not in my charter at this time. Filesystem replication/distribution can be used for the purpose. Software like DRBD can also be used.
  10. Another feature is to limit deduplication to the last X backup sets much like a sliding window. This allows cleanly removing really old backups and avoid recent backups from referencing chunks in a those old data.
  11. All this applies to archival storage on disk. Deduping backups onto tape is a different can of worms that I will probably look at later.

I plan to go at all these requirements in phases. For example I’d not initially look at source-side dedupe. Rather the initial focus will be to get a high-performance stable backend. If one is wondering about some of the terms used here, then look at the Wikipedia article for explanations.