I have been super busy for the last few months getting a new house completed and moving into it. It takes an incredible amount of effort. Me being the DIY type also invested some energy in being hands-on with Carpentry, Masonry, Plumbing, Electricals, Gardening and sundry other stuff, to the consternation of my family and my utter satisfaction and utter exhaustion! Anyway that being settled mostly, I am getting back to catching up on digital stuff.
One of the very interesting developments that caught me, was this post on the Register: http://www.theregister.co.uk/2014/01/31/bluedbm_fpga_storage/. A storage controller on a FPGA linked together with other such controllers on a low-latency bus. How cool that is in terms of distributed storage. From the outside, we get a high-performance distributed storage pool. Obviously flash is also there in the picture. Thinking further on this one can also consider this system acting as a front-end cache for a remote WAN-linked storage archive.
I am wondering, if they can implement some storage controller functionality using a FPGA, can it also be done using CUDA. In that case, we can use something like GPUDirect for low-latency Infiniband access from the GPU. This would avoid having to deal with custom silicon. From yet another perspective, it might be possible to do something similar with the Pluribus Server-Switch built from off-the-shelf components.
Came across this news piece on Register dated back to October: http://www.theregister.co.uk/2013/10/16/sepatons_superduper_deduper/
Potentially handles upto 16PB of backup storage with Global Deduplication – Great! The performance and features on paper are really top of the line. Beats the competition on most aspects. If I look at the deduplication features a few things look interesting vis-a-vis those that I have put into Pcompress.
It mixes “Hash-based inline deduplication” and “Post-process, content-aware deduplication”. The article is not clear what exactly this means. There are two possibilities. Firstly it can detect duplicate files during ingestion, store only a single copy and then do block-level dedupe as post-process. Secondly it can deduplicate using large chunks during backup ingestion and then dedupe using small blocks as post-process. This is of course to not hurt backup performance and scale to large datasets. Post-process deduplication is a common technique to scale deduplication without affecting I/O throughput of in-flight data. It has also been used effectively in Windows Server 2012 to do primary data deduplication.
Sepaton can do analysis of data types and change rates in data to apply the most efficient dedupe mechanism or even skip dedupe for encrypted and compressed files that virtually do not deduplicate at all.
The other interesting features include byte-level deduplication for databases that store data in block sizes less than 8KB and using flash based SSDs to store the global index. I am not sure what this “byte-level deduplication” exactly means but it appears to be a delta-differencing mechanism. Now the question is how efficient restore can be when delta-differencing is used.
In some of the posts on Pcompress design I have already mentioned about using SSDs for storing all kinds of metadata. Fast metadata access is critical and this is the logical choice. However the other “new aspect in Pcompress is the ability to use small blocks for deduplication without losing performance and giving good scalability“. This is a key feature that most of the current solutions seem to be missing. Pcompress can use blocks (or chunks) as small as 2KB without losing too much performance. With 2KB chunks it can potentially scale to 500TB of 100% random data using a 40GB in-memory global index. If the data has duplicates then the index size becomes smaller. This deduplication occurs with 95% efficiency of a full chunk index based brute-force dedupe. This single capability solves a sticky problem that dedupe solutions has been dealing with for quite some time. The metadata structure that I have discussed in earlier posts also helps with overall performance. The approach is based on similarity detection of large regions in the data stream. The chunks lists of those regions are then sequentially loaded from SSD and compared to perform actual deduplication. The similarity detection technique is simple and novel. It avoids any kind of complicated math, fuzzy hashing etc.I will detail it later.
There are other unique techniques in Pcompress like a partially vectorized rolling hash, scanning less than 30% of the data to locate chunk boundaries, parallelized deduplication and others that contribute to the overall performance. I have posted about a few of these earlier.
In addition to the above, the recent zip-like archiver capabilities that I have added into Pcompress introduce data type detection and automatic selection of filters and compression techniques to suit the data type. However the big missing piece in all this is that Pcompress is still a stand-alone utility. It needs work to turn it into an archival store where data can be ingested incrementally and selective data streams extracted for restore. Also an efficient partitioned indexing is needed to be able to scale deduplication in a cluster without losing deduplication ratio.
- Pcompress gets archiving features (moinakg.wordpress.com)
- Storage Myths: Dedupe for Databases (flashdba.com)
- Architecture for a Deduplicated Archival Store: Part 1 (moinakg.wordpress.com)
- Architecture for a Deduplicated Archival Store: Part 2 (moinakg.wordpress.com)
- High Performance Content Defined Chunking (moinakg.wordpress.com)
- Inside Content-Defined chunking in Pcomporess (moinakg.wordpress.com)
- Inside Content-Defined chunking in Pcomporess – Part 2 (moinakg.wordpress.com)
I recently came across this old blog by Nimble storage co-founder Umesh Maheshwari: http://www.nimblestorage.com/blog/technology/better-than-dedupe-unduped/
The post has intriguing views on inline compression vs Dedupe and the approach of using snapshots on primary storage as a short-term backup mechanism is quite cool. COW snapshots by definition avoid duplicates the instant they are created. In addition consecutive snapshots will share common blocks with no special effort. This aspect not much different from ZFS. It has the same real-time compression and snapshot features and the same benefits apply there as well.
However it is the conclusions and the graphs that I humbly find missing some points and even misleading to an extent. Deduplication does not only remove duplicates from successive backups but it can remove internal duplicates within the volume. It helps compression in turn. In all my tests with Pcompress I have found Deduplication providing additional gain when added to standard compression. See the entry for Pcompress here: http://www.mattmahoney.net/dc/10gb.html. ZFS for that matter provides deduplication as well, though it has scalability bottlenecks.
While inline compression is cool, compression works within a limited window. The window size varies with the algorithm. For example Zlib uses a 32KB window, Bzip2 uses 900KB max, LZMA can use a gigabyte window size. Repeating patterns or redundancy within the window are compressed. Deduplication finds duplicate blocks within the entire dataset of gigabytes to hundreds of terabytes. There is no theoretical window size limitation (Though practical scaling considerations can put a limit). So I really cannot digest that just snapshotting + inline compression will be superior. Deduplication + snapshotting + inline compression will provide greater capacity optimization. For example see the section on “Compression” here: http://www.ciosolutions.com/Nimble+Storage+vs+Netapp+-+CASL+WAFL
Now if Post-Process Variable-Block Similarity based deduplication (of the type I added into Pcompress) can be added to ZFS, things will be very interesting.
- Nimble Storage says “me too!” (neuralytix.com)
- Nimble Storage Named a 2013 Emerging Technology Vendor by CRN (virtual-strategy.com)
Pcompress uses an in-memory hash table to find similar regions of data, loads the chunk hashes of those regions and matches them to find exact chunk matches for deduplication. Currently this works on a single archive producing a compressed file as the output. The index is not required during decompression and is discarded after use. An archival store on the other hand needs to deal with multiple archives and needs a persistent index to keep deduplicating across archives that get added to the store. So there is an archive server which receives data streams from clients. Clients are responsible for identifying the files to be backed up, rolling them into a archive format like tar and sending the data stream to the server. The archive server holds the aforementioned similarity index.
The similarity index in Pcompress is a hash table, so looking at persistence, one quickly thinks of the numerous NoSQL solutions. Evaluate and benchmark one and use it. Even SQlite can be looked at, as it is embeddable, fast and reliable. Good enough for the use case at hand. After pondering this somewhat, it occurred to me that my approach in Pcompress is a scalable in-memory index. The key thing here is the “in-memory” piece. The design is centered around that and does not use an index that can overflow to disk. This in turn means that I do not require an on-disk index format. I just need to stash the records in a file and load them into an in-memory hash when the archive server daemon is started. So all I need is a simple flat file with a sequence of fixed-length records. When a new KV pair is added to the hash it is first appended to the flat file. This is almost like journaling with the journal being the data store itself. Write I/O remains largely sequential. When keys are deleted it can marked invalid in the flat file. A separate cleanup process (like Postgres VACUUMDB) can be used to eliminate the deleted records. Marking deleted records requires in-place updates which is simple because records are of fixed-length.
Thus I can dispense with a NoSQL library and keep things very simple and fast. This approach is similar to the Sparkey KV store from Spotify. The primary differences being the hash not stored to disk and ability to do deletes. Of course unlike Sparkey I want robustness against data corruption. Firstly I will use sync writes to the flat file. Secondly the record will be inserted into the in-memory hash only after disk write is successful.
In Pcompress, I have implemented a variant of the rolling hash based Content Defined Chunking that provides both deduplication accuracy and high performance. This post attempts to explain the chunking process, covers the chunking computations that are done in Pcompress and then talks about the new optimizations for very fast sliding window chunking (on the order of 500MB/s to 600MB/s throughput depending on processor).
Data Deduplication requires splitting a data stream into chunks and then searching for duplicate chunks. Once duplicates are found only one copy of the duplicate is stored and the remaining chunks are references to that copy. The splitting of data into chunks appears to be an ordinary process but is crucial to finding duplicates effectively. The simplest is of course splitting data into fixed size blocks. It is screaming fast, requiring virtually no processing. It however comes with the limitation of the data shifting problem.
The diagram below illustrates the problem. The two 64-character patterns are mostly similar with only two characters differing. Initially fixed-block chunking provides good duplicate detection. However the insertion of a single character at the beginning shifts the entire data while chunk boundaries are fixed. So no duplicates are found even though the patterns are mostly similar.
In order to deal with this, most dedupe solutions use content defined chunking that mark cut points based on patterns in the data. So if the data patterns shift the cut points also shift with them. The diagram below illustrates.
The chunks are split based on patterns in data so they are of variable length (but average size is close to the desired length). Since the chunk boundaries shift along with the data patterns, duplicates are still found. Only the modified chunks are unique.
The Rolling Hash computation
Now the question comes as to what data patterns to look out for when determining the chunk boundaries or cut points? The common technique is to compute a hash value of a few consecutive bytes at every byte position in the data stream. If the hash value matches a certain predefined pattern we can declare a chunk boundary at that position. To do this computation efficiently a technique called the rolling hash was devised. It uses a sliding window that scans over the data bytes and provides a hash value at each point. The hash value at position I can be cheaply computed from the hash at position I-1. In other words where ‘n’ is the window size and represents the window bytes at byte position ‘i’. In mathematical terms this is a recurrence relation. Rolling hashes have been used in contexts like Rabin-Karp substring search and Rsync. Today they are used extensively in chunk splitting in the context of data deduplication.
One of the common rolling hashes used in Data Deduplication is Rabin Fingerprinting devised originally by Turing award winner Michael O. Rabin in his seminal paper titled “Fingerprinting By Random Polynomials“. The mathematically inclined will enjoy reading it. There are other rolling hash techniques such as the one used in Rsync, the TTTD algorithm devised by HP, the FBC algorithm etc.
While I am not so much of a mathematically inclined person I still needed a good rolling hash in order to do content defined chunking in Pcompress. After looking at various implementations like the one in LBFS and few other open-source software like n-gram hashing, I came up with an approach that worked well and produced average chunk sizes close to the desired value.
I used a small sliding window of 16 bytes that produces a 64-bit fingerprint at each byte position requiring an addition, subtraction, multiplication and conditionally an XOR for each byte position. It would declare a chunk boundary if the bottom Y bits of the fingerprint were zero. The value of Y would depend on the average chunk size desired. For example for 4KB average size one would look for bottom 12 bits to be zero. The core of the approach is derived from Rabin Fingerprinting. A good description is here: http://www.infoarena.ro/blog/rolling-hash. The hashing approach is a multiplicative scheme of the form:
Where inbyte is Incoming byte into sliding window head, outbyte is outgoing byte from sliding window tail and . The PRIME number I am using is the same value used by Bulat Ziganishin in his SREP tool. Experimentation showed it to produce good results. In addition to this I precompute a table using the irreducible polynomial (represented in GF(2)) from LBFS. The outbyte is used to index the table and the value is XOR-ed with the hash value to produce the final fingerprint. I did some analysis of the chunking approach which is documented in two earlier posts. The results were good.
- Inside Content-Defined Chunking in Pcompress, part – 1
- Inside Content-Defined Chunking in Pcompress, part – 2
A window size of only 16 bytes will raise some eyebrows as typically much larger windows are used. LBFS for example used a 48-byte window and others have used even larger windows. However in practice, as is evident from the above analysis, this implementation does produce good results and the window size of 16 bytes allows an optimization as we will see below.
While addition, multiplication are extremely fast on modern processors, performance overheads remained. Even though I was using a small window of 16 bytes it still required performing computations over the entire series of bytes in order to find cut points. It is very much computationally expensive compared to the simple splitting of data into fixed-size chunks. A couple of optimizations are immediately apparent from the above hash formula:
- Since we are dealing with bytes it is possible to pre-compute a table for
- The MODULUS operation can be replaced with masking if it is a power of 2.
This gives some gains however the overhead of scanning the data and constantly updating a sliding window in memory remains. Eventually I implemented a couple of key new optimizations in Pcompress that made a significant difference:
- Since the sliding window is just 16 bytes it is possible to keep it entirely in a 128-bit SSE register.
- Since we have minimum and maximum limits for chunk sizes, it is possible to skip minlength – small constant bytes after a breakpoint is found and then start scanning. This provides for a significant improvement in performance by avoiding scanning majority of the data stream.
Experimentation with different types of data shows that the second optimization results in scanning only 28% to 40% of the data. The remaining data are just skipped. The minimum and maximum limits are used to retain a distribution of chunk sizes close to the average. Since rolling hash cut points below the minimum size are ignored it does not make sense to scan that data.
All these optimizations combined provide an average chunking throughput of 530 MB/s per core on my 2nd generation Core i5 running at 2.2 GHz. Of course faster, more recent processors will produce better results. The throughput also depends on the nature of the data. If the data has a very specific pattern that causes more large chunks to be produced the performance degrades (Think why this should be the case). This brings us to the worst case behaviour.
Worst Case performance profile
The worst case performance profile of the optimized chunking approach happens when all chunks produced are of the maximum size. That is the data is such that no breakpoints are produced resulting in a degeneration to the fixed block chunking behaviour at max chunksize of 64KB and at the cost of rolling hash computation overhead. In this case the majority of the data is scanned and computed, but how much ?
If we assume minimum chunk size of 3KB, maximum 64KB and 100MB data we will have chunks (considering worst case all max-length chunks). For every chunk of data will be skipped. In my current implementation the value of small constant is 256, though it can be smaller. So the actual skipped size is bytes. In total the number of skipped bytes will be bytes out of 100MB data. In percentage terms it is . In other words 95% of the data will be scanned degrading the performance by more than half.
Now the question is what kind of data will produce this worst case behaviour? If you have seen the rolling hash computation details in Pcompress above, the eventual fingerprint is computed via an XOR with a polynomial computation result from a table. Those values are non-zero and we check for breakpoints based on bottom 12 bits of the fingerprint being zero. So if the computed hash is zero the XOR will set the bits and bottom 12 bits will become non-zero. The hash will be zero if the data is zero. That is if we have a file of only zero bytes we will hit the worst case.
I created a zero byte file and tested this and got a throughput of 200 MB/s and all chunks of the max 64KB length. In real datasets zero byte regions can happen, however very large entirely zero byte files are uncommon, at least to my knowledge. One place having zero byte regions is VMDK/VDI files. So I tested on a virtual harddisk file of a Fedora 18 installation in VirtualBox and still got a majority of 4KB chunks but with a small peak at 64KB. The throughput was 490 MB/s with approx 41% of the data being scanned. So even a virtual harddisk file will have non-zero bytes inside like formatting markers. It is rare to get 100s of megabytes of files with only zero bytes. Finally from an overall deduplication perspective such files will be deduplicated maximally with almost 98% data reduction and final compression stage will also be extremely fast (only zero bytes). So even though chunking suffers, overall deduplication will be fast.
If you are interested to look at the implementation in Pcompress, it is here: https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c#L598
In the previous post on this topic I had put down my thoughts around the requirements I am looking at. In this post I will jot down some detailed notes around the design of the on-disk data store format that I am thinking of.
The Archival Chunk Store
From the most basic viewpoint we have data streams which are split into variable length chunks. After deduplication these chunks can be references to other chunks in the same dataset or chunks in other datasets. So we need to have metadata that identifies the dataset (like name, timestamp, length etc.) and then have a list of pointers to data chunks. This is not much different to a traditional file system which has inodes storing metadata and then pointers to blocks/pages on disk. It is conceptually simple to consider a single data block to have multiple references. It is intuitive. However additional metadata is needed to maintain information like reference counts.
The key difference of a file system and a content-defined deduplication storage is that in the former all the blocks are of fixed length and potentially grouped into allocation units. In the latter chunks are of variable length. So we need additional metadata giving chunk lengths and on-disk storage requires a second layer of disk block allocation data. Software like OpenDedup have implemented FuSE based file systems however they only deal with the simpler fixed-length chunking approach and offer primary storage dedupe.
I do not need a full file system route since I am not dealing with primary storage in this case and it also avoids a lot of complexity. There are existing file systems like OpenDedup, LiveDFS, Lessfs and scale-out approaches like Ceph, Tahoe-LAFS etc. where the scalable, variable-chunked dedupe features will be useful, but that is something for later. So I am thinking of storing the data chunks in files that I will call extents, along with the minimum additional metadata in separate metadata extents. The following diagram is a schematic of my approach to storing the chunks on disk.
- A Dataset is identified by some metadata and a sequence of extents in a linked list.
- Each extent is a collection of segments. Extents are essentially numbered files.
- Each segment is a collection of variable-length data chunks.
- Each extent stores segment data and metadata in separate files. A naming convention is used to associate extent metadata and corresponding data files.
- Each extent can contain a fixed maximum number of segments. I am considering up to 2048 segments per extent. Incoming segments are appended to the last extent in the dataset till it fills up and a new extent is allocated.
- Notice that a separate extent metadata section is not required. A extent is just a file.
- The scalable Segmented Similarity based Deduplication is being used here. Each segment contains up to 2048 variable-length chunks. So with 4KB chunk size, each segment is 8MB in size.
- Segment metadata consists of a chunk count, chunk hashes and offsets. The chunk size is not stored. Instead it can be computed by subtracting current chunk’s offset from the next chunk’s offset. Since a 64-bit segment offset is stored the chunk offsets can be relative to it and only need to be 32-bit values.
- The Similarity Index contains similarity hashes that point to segments within the extents. So the pointer has to be the extent number followed by the segment offset within the extent metadata file. Incoming segments from a new datastream are chunked, their similarity hashes computed and then approximate-match segments are looked up in the index.
- Segment data is compressed before storing in the segment. So segment entries in the data extent are of variable length.
- Each segment entry in the metadata extent can also be of variable length since the number of chunks can be less than the maximum. However segment entries in the metadata extent are added when an entry is made in the index, so the exact offset can be recorded.
- Similary a segment entry in the metadata extent needs to point to the offset of the segment data in the data extent. However since segments are compressed later in parallel and stored into the extent, the metadata entries are updated later once the segment data is appended. Keeping segment data in a separate data extent allows this parallel processing while still allowing similarity matches to be processed from the metadata extent.
- Duplicate chunk references are maintained in the metadata extents. A duplicate reference consists of the extent number, segment offset in the compressed file and chunk number within the segment.
- The index is obviously persistent on disk but is loaded in memory in it’s entirety when doing lookups. Any insertion into the index is written immediately onto the disk. I’d obviously have to use a NoSQL key-value store for this. I am currently interested in Hamsterdb.
- Keeping a separate metadata extent allows staging metadata on a separate high-performance storage media like flash to reduce access latency.
- It is possible to store reference counts at the segment level within the index for the purpose of capping number of references to “popular” chunks. This can reduce dedupe ratio since not all chunks will have reached the max reference count. However the advantage of this is it avoids storing and updating reference counts in scattered records in extent files which in turn avoids some random I/O during data ingestion. Each segment has 25 similarity indicators representing different portions of the segment. So all 25 indicators should have reached the maximum reference count to completely remove the entire segment from consideration.
- The entire segment is compressed and stored instead of per-chunk compression. This provides better compression ratio but is also an overhead especially if we just have to retrieve one chunk from a referenced segment. However due to data locality in backups most similar segments will have several chunks in common. In addition the fast LZ4 compression algorithm and caching of uncompressed segments should provide for low overheads. This is something that I have to test in practice.
Supporting Deletion and Forward Referencing
Deleting datasets means deleting all the extents that belong to it. However this is easier said than done because the extent may have segments which contain chunks which are referred to by other extents. So we cannot simply delete. There are two ways to support effective deletion.
First approach is to load the segments one by one from the extents and conditionally store them into a new file. First the segment’s similarity indicators are re-computed and looked up in the index. This will give us the reference count associated with the similarity indicator along with the segment it points to. If the indicator points to another segment then it’s reference count is decremented. Otherwise if the associated reference count is zero, it is first removed from the index. If the reference count is zero for all similarity indicators of the segment or all it’s similarity indicators point to other segments then the segment is not stored into the new file. However a seek is performed on the target file to sparsely extend it. This preserves the relative offsets of the segments which need to be retained.
Second approach is dependent on a technique called Forward Referencing. In this incoming data is stored as-is. If new chunks are duplicate to older chunks then the older chunk entries are updated to point to the new chunks. This means that deletion can be simply performed on the oldest dataset without any further checks as all references will be to newer chunks. I will need to apply the constraint that intermediate datasets cannot be deleted. The big advantage of Forward Referencing is that it speeds up restore times a lot because the latest dataset is typically the one that you want to restore and it is stored as whole and read sequentially. However Forward Referencing requires post-process deduplication in order to be performant and avoid too much random I/O during backup for example. Also technically it precludes source side dedupe as the data has to appear wholly on the backup store.
The third approach combines the above two approaches. Inline dedupe is done and then a post-process optimization pass can be kicked off to re-organize the data to a forward referenced layout. This requires temporary extra metadata space to record a log of all references per referenced extent so that we can invert the references an extent at a time. This can somewhat tricky to get right.
At present I am looking at the first approach and intend to explore the third optimization technique at a later date.
- Architecture for a Deduplicated Archival Store: Part 1 (moinakg.wordpress.com)
- A Tale about key value databases.