Tag Archives: backup

Scaling Deduplication – Sepaton’s Big Data Backup appliance

Came across this news piece on Register dated back to October: http://www.theregister.co.uk/2013/10/16/sepatons_superduper_deduper/

Potentially handles upto 16PB of backup storage with Global Deduplication – Great! The performance and features on paper are really top of the line. Beats the competition on most aspects. If I look at the deduplication features a few things look interesting vis-a-vis those that I have put into Pcompress.

It mixes “Hash-based inline deduplication” and “Post-process, content-aware deduplication”. The article is not clear what exactly this means. There are two possibilities. Firstly it can detect duplicate files during ingestion, store only a single copy and then do block-level dedupe as post-process. Secondly it can deduplicate using large chunks during backup ingestion and then dedupe using small blocks as post-process. This is of course to not hurt backup performance and scale to large datasets. Post-process deduplication is a common technique to scale deduplication without affecting I/O throughput of in-flight data. It has also been used effectively in Windows Server 2012 to do primary data deduplication.

Sepaton can do analysis of data types and change rates in data to apply the most efficient dedupe mechanism or even skip dedupe for encrypted and compressed files that virtually do not deduplicate at all.

The other interesting features include byte-level deduplication for databases that store data in block sizes less than 8KB and using flash based SSDs to store the global index. I am not sure what this “byte-level deduplication” exactly means but it appears to be a delta-differencing mechanism. Now the question is how efficient restore can be when delta-differencing is used.

In some of the posts on Pcompress design I have already mentioned about using SSDs for storing all kinds of metadata. Fast metadata access is critical and this is the logical choice. However the other “new aspect in Pcompress is the ability to use small blocks for deduplication without losing performance and giving good scalability“. This is a key feature that most of the current solutions seem to be missing. Pcompress can use blocks (or chunks) as small as 2KB without losing too much performance. With 2KB chunks it can potentially scale to 500TB of 100% random data using a 40GB in-memory global index. If the data has duplicates then the index size becomes smaller. This deduplication occurs with 95% efficiency of a full chunk index based brute-force dedupe. This single capability solves a sticky problem that dedupe solutions has been dealing with for quite some time. The metadata structure that I have discussed in earlier posts also helps with overall performance. The approach is based on similarity detection of large regions in the data stream. The chunks lists of those regions are then sequentially loaded from SSD and compared to perform actual deduplication. The similarity detection technique is simple and novel. It avoids any kind of complicated math, fuzzy hashing etc.I will detail it later.

There are other unique techniques in Pcompress like a partially vectorized rolling hash, scanning less than 30% of the data to locate chunk boundaries, parallelized deduplication and others that contribute to the overall performance. I have posted about a few of these earlier.

In addition to the above, the recent zip-like archiver capabilities that I have added into Pcompress introduce data type detection and automatic selection of filters and compression techniques to suit the data type. However the big missing piece in all this is that Pcompress is still a stand-alone utility. It needs work to turn it into an archival store where data can be ingested incrementally and selective data streams extracted for restore. Also an efficient partitioned indexing is needed to be able to scale deduplication in a cluster without losing deduplication ratio.

Architecture for a Deduplicated Archival Store: Part 2

Golf Disc StorageIn the previous post on this topic I had put down my thoughts around the requirements I am looking at. In this post I will jot down some detailed notes around the design of the on-disk data store format that I am thinking of.

The Archival Chunk Store

From the most basic viewpoint we have data streams which are split into variable length chunks. After deduplication these chunks can be references to other chunks in the same dataset or chunks in other datasets. So we need to have metadata that identifies the dataset (like name, timestamp, length etc.) and then have a list of pointers to data chunks. This is not much different to a traditional file system which has inodes storing metadata and then pointers to blocks/pages on disk. It is conceptually simple to consider a single data block to have multiple references. It is intuitive. However additional metadata is needed to maintain information like reference counts.

The key difference of a file system and a content-defined deduplication storage is that in the former all the blocks are of fixed length and potentially grouped into allocation units. In the latter chunks are of variable length. So we need additional metadata giving chunk lengths and on-disk storage requires a second layer of disk block allocation data. Software like OpenDedup have implemented FuSE based file systems however they only deal with the simpler fixed-length chunking approach and offer primary storage dedupe.

I do not need a full file system route since I am not dealing with primary storage in this case and it also avoids a lot of complexity. There are existing file systems like OpenDedup, LiveDFS, Lessfs and scale-out approaches like Ceph, Tahoe-LAFS etc. where the scalable, variable-chunked dedupe features will be useful, but that is something for later. So I am thinking of storing the data chunks in files that I will call extents, along with the minimum additional metadata in separate metadata extents. The following diagram is a schematic of my approach to storing the chunks on disk.

ChunkstoreThe following are the characteristics that imply from this schematic:

  • A Dataset is identified by some metadata and a sequence of extents in a linked list.
  • Each extent is a collection of segments. Extents are essentially numbered files.
  • Each segment is a collection of variable-length data chunks.
  • Each extent stores segment data and metadata in separate files. A naming convention is used to associate extent metadata and corresponding data files.
  • Each extent can contain a fixed maximum number of segments. I am considering up to 2048 segments per extent. Incoming segments are appended to the last extent in the dataset till it fills up and a new extent is allocated.
  • Notice that a separate extent metadata section is not required. A extent is just a file.
  • The scalable Segmented Similarity based Deduplication is being used here. Each segment contains up to 2048 variable-length chunks. So with 4KB chunk size, each segment is 8MB in size.
  • Segment metadata consists of a chunk count, chunk hashes and offsets. The chunk size is not stored. Instead it can be computed by subtracting current chunk’s offset from the next chunk’s offset. Since a 64-bit segment offset is stored the chunk offsets can be relative to it and only need to be 32-bit values.
  • The Similarity Index contains similarity hashes that point to segments within the extents. So the pointer has to be the extent number followed by the segment offset within the extent metadata file. Incoming segments from a new datastream are chunked, their similarity hashes computed and then approximate-match segments are looked up in the index.
  • Segment data is compressed before storing in the segment. So segment entries in the data extent are of variable length.
  • Each segment entry in the metadata extent can also be of variable length since the number of chunks can be less than the maximum. However segment entries in the metadata extent are added when an entry is made in the index, so the exact offset can be recorded.
  • Similary a segment entry in the metadata extent needs to point to the offset of the segment data in the data extent. However since segments are compressed later in parallel and stored into the extent, the metadata entries are updated later once the segment data is appended. Keeping segment data in a separate data extent allows this parallel processing while still allowing similarity matches to be processed from the metadata extent.
  • Duplicate chunk references are maintained in the metadata extents. A duplicate reference consists of the extent number, segment offset in the compressed file and chunk number within the segment.
  • The index is obviously persistent on disk but is loaded in memory in it’s entirety when doing lookups. Any insertion into the index is written immediately onto the disk. I’d obviously have to use a NoSQL key-value store for this. I am currently interested in Hamsterdb.
  • Keeping a separate metadata extent allows staging metadata on a separate high-performance storage media like flash to reduce access latency.
  • It is possible to store reference counts at the segment level within the index for the purpose of capping number of references to “popular” chunks. This can reduce dedupe ratio since not all chunks will have reached the max reference count. However the advantage of this is it avoids storing and updating reference counts in scattered records in extent files which in turn avoids some random I/O during data ingestion. Each segment has 25 similarity indicators representing different portions of the segment. So all 25 indicators should have reached the maximum reference count to completely remove the entire segment from consideration.
  • The entire segment is compressed and stored instead of per-chunk compression. This provides better compression ratio but is also an overhead especially if we just have to retrieve one chunk from a referenced segment. However due to data locality in backups most similar segments will have several chunks in common. In addition the fast LZ4 compression algorithm and caching of uncompressed segments should provide for low overheads. This is something that I have to test in practice.

Supporting Deletion and Forward Referencing

Deleting datasets means deleting all the extents that belong to it. However this is easier said than done because the extent may have segments which contain chunks which are referred to by other extents. So we cannot simply delete. There are two ways to support effective deletion.

First approach is to load the segments one by one from the extents and conditionally store them into a new file. First the segment’s similarity indicators are re-computed and looked up in the index. This will give us the reference count associated with the similarity indicator along with the segment it points to. If the indicator points to another segment then it’s reference count is decremented. Otherwise if the associated reference count is zero, it is first removed from the index. If the reference count is zero for all similarity indicators of the segment or all it’s similarity indicators point to other segments then the segment is not stored into the new file. However a seek is performed on the target file to sparsely extend it. This preserves the relative offsets of the segments which need to be retained.

Second approach is dependent on a technique called Forward Referencing. In this incoming data is stored as-is. If new chunks are duplicate to older chunks then the older chunk entries are updated to point to the new chunks. This means that deletion can be simply performed on the oldest dataset without any further checks as all references will be to newer chunks. I will need to apply the constraint that intermediate datasets cannot be deleted. The big advantage of Forward Referencing is that it speeds up restore times a lot because the latest dataset is typically the one that you want to restore and it is stored as whole and read sequentially. However Forward Referencing requires post-process deduplication in order to be performant and avoid too much random I/O during backup for example. Also technically it precludes source side dedupe as the data has to appear wholly on the backup store.

The third approach combines the above two approaches. Inline dedupe is done and then a post-process optimization pass can be kicked off to re-organize the data to a forward referenced layout. This requires temporary extra metadata space to record a log of all references per referenced extent so that we can invert the references an extent at a time. This can somewhat tricky to get right.

At present I am looking at the first approach and intend to explore the third optimization technique at a later date.

Architecture for a Deduplicated Archival Store: Part 1


Pcompress as it stands today is a powerful single-file lossless compression program that applies a variety of compression and data deduplication algorithms to effectively reduce the dataset size. However as far as data deduplication goes it can only apply the algorithms to a single dataset to remove internal duplicates. What is more useful is to be able to apply deduplication to remove common blocks across datasets to achieve even greater savings especially in backup scenarios. This is why we see a slew of products in this space boasting of upto 90% reduction in backup storage requirements.

In the open source space we have filesystems like OpenDedup, Lessfs, S3QL, ZFS etc that provide deduplication even for primary online storage. While that is a desirable feature in itself, these software lack many of the advanced features of commercial products like Sepaton, HP StoreOnce or EMC DataDomain. Pcompress implements a bunch of those advanced algorithms today (I am writing a couple of papers on this) so it makes sense to extend the software into a proper scalable archival store for backup requirements. In this topic it is worthwhile to take note of eXdupe which provides archival deduplicated backup capabilities but it is quite simplistic providing only differential storage against a single initial backup dataset. It is much like a full backup followed by incremental backups. Just that there is no real multi-file dedupe. One can only dedupe the latest backup data against the first non-differential backup data. It is not a scalable chunk store that can chunk any incoming dataset and store only the unique chunks.

If we look at open source backup software like Amanda or Bacula, none of them have block-level dedupe capability, leave alone sliding-window variable block chunking. So, in a nutshell, we can summarize the requirements as follows:

  1. A Deduplicated, Scalable Chunk Store that stores unique chunks and provides fast read access.
  2. The Chunk Store is meant for backups and archival storage and assumes immutable chunks. I am not looking at online primary storage in this case. However the system should support deletion of old datasets.
  3. It should be able to do inline dedupe. With inline dedupe we can do source side dedupe reducing the amount of backup data transferred over the network.
  4. Pcompress can potentially utilize all the cores on the system and this archival store should be no different.
  5. Metadata overhead should be kept to a minimum and I will be using the Segmented similarity based indexing to use a global index that can fit in RAM.
  6. Data and Metadata should be kept separate such that metadata can be located on high-speed storage like SSDs to speed up access. While this increases the number of multiple separate disk accesses during restore, the effect can be reduced by locality sensitive caching in addition to SSDs.
  7. The system should of course be able to scale to petabytes.
  8. It should be possible to integrate the system with existing backup software like Amanda, Bacula etc. This is needed if we want to do source-side dedupe.
  9. There should be a chunk reference count with a max limit to avoid too many datasets referencing the same chunk. The loss of a multiple referenced chunk can corrupt multiple backups. Having an upper limit reduces the risk. In addition we need replication but that is not in my charter at this time. Filesystem replication/distribution can be used for the purpose. Software like DRBD can also be used.
  10. Another feature is to limit deduplication to the last X backup sets much like a sliding window. This allows cleanly removing really old backups and avoid recent backups from referencing chunks in a those old data.
  11. All this applies to archival storage on disk. Deduping backups onto tape is a different can of worms that I will probably look at later.

I plan to go at all these requirements in phases. For example I’d not initially look at source-side dedupe. Rather the initial focus will be to get a high-performance stable backend. If one is wondering about some of the terms used here, then look at the Wikipedia article for explanations.

Deduped storage with SQL front-end

Came across this very interesting piece on Forbes: http://www.forbes.com/sites/tomgroenfeldt/2013/02/14/managing-financial-data-growth-with-30-40x-compression/

The unique thing about this product is the ability to do SQL queries requiring obviously an additional overhead but much less so than tapes. With very high data reduction ratios the product claims to be an cost-effective big-data storage container for medium to long term storage that can be queried much easily than retrieving data from tapes.

However tapes have certain economics and cater to specific operational models that are tricky to match with an appliance. So it will be interesting to watch how RainStor fares. Also whenever I hear about claims of extreme compression above 90% effectiveness I start to add salt. Compression can only remove as much data redundancies as they exist within the data. Of course some compression algorithms are better at finding the redundancies than others and compression combined with other things line rzip, deduplication and content-specific data transformation filters can take out global redundancies effectively from large datasets. Still all these techniques are not something magical. If the data does not contain redundancies then they will fail to reduce the data volume. What tends to happen though in the real world is that business data tend to be structured with repeating content and successive snapshots of data tend to contain a lot in common with previous ones. So we can potentially see a lot of reduction. One can only determine this by doing a thorough evaluation.


What next for Pcompress

With the current release of 0.9.1 Pcompress now comes with a whole bunch of features and capabilities that are close to being a logically complete set for a stand-alone data compression utility.  It will be hard to find another open-source utility that has the same combination of features and capabilities in a single utility. There are a few other stuff that can be added from a single utility point of view the biggest being a GUI and possibly portability to Windows. However beyond all this I have been looking at extending Pcompress to being a more comprehensive archival mechanism. Not necessarily a full Backup solution though. For that we have excellent software like Bacula and Amanda. However it can be an add-on for those providing an archival data store with capabilities that they lack.

I am listing out some of the thoughts and new ideas that have been crawling inside my cranium for the past few weeks.

Global Deduplication

Pcompress does Deduplication but only at the level of individual chunks. It is a logical next step to evolve it into a complete deduplication archival store on the lines of commercial products like Data Domain, Sepaton, Permabit Albireo, HP StoreOnce etc. There are big challenges in this area. The biggest being the disk bottleneck arising out of index lookups for chunk hashes. This becomes increasingly a problem as we scale into hundreds of TB and to Petascale. Then comes efficient storage of chunks, efficient retrieval of archived data, encryption of deduplicated data, sources-side dedupe to reduce network traffic etc. The industry has come up with a variety of solutions and optimizations but nothing significantly comparable is available in the open-source domain.

A couple of the open-source solutions are SDFS and LessFS. I have tried both and did not have much success with LessFS as it would hang every now and then in my Fedora 16 laptop. However SDFS worked reliably and fast. I am sure I am doing something wrong in setting up LessFS. Both of these are in-line dedupe filesystems that work on fixed-block boundaries. In addition the backup solutions like Bacula and Amanda only do whole-file dedupe or, in other words, single-instance storage.

For Pcompress I am looking at providing more advanced capabilities. Pcompress already has sliding-window content-aware chunking capability. The challenge is to design a scalable deduplicated archival store that can handle Petabytes. To this effect I have been reading up on a bunch of research papers and capabilities touted by various products and have formed some detailed design ideas. Some of the features required for this capability include:

  1. Sparse Indexing: We need to use some approximate matching capability to probabilistically identify duplicate and unique chunks with low memory overhead. Data Domain for example uses a Bloom Filter that they call a Summary Vector to quickly identify unique chunks along with other capabilities like locality based caching. Other different approaches include Albireo from Permabit, and SILT. In that respect the following paper seems very interesting: https://code.google.com/p/binarywarriors/downloads/detail?name=INLINE_DEDUP.pdf&can=2&q= . What is more interesting is this HP paper that reflects a more polished form of that paper: http://static.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge.pdf . This is essentially a locality based grouping of chunks along with a stratified sampling to do group-level approximate matching on a smaller index size. Locality based grouping is very helpful and instead of doing stratified sampling I have a different idea of using similarity matching at various confidence levels to find related groups. My idea will enable deduplicating 1 PB of data requiring just 14GB of memory for the index.
  2. Source side dedupe: Doing chunking and hashing at the source side can reduce load on the backup server and not sending chunks already present on the server reduces network bandwidth drastically. In addition the archive data itself will have many duplicates inside it even before comparing with the previous backup archive. So that will help as well. One extra thing I am looking at here is for the server to have a configurable capability to re-compute the chunk hashes and verify. This is an extra security mechanism that prevents malicious uploading of garbage chunks with invalid hashes. This of course puts extra load on the server but current gen Sandy Bridge or Bulldozers are solidly up to the task.
  3. Forward Referencing: Typically as backup archives keep on entering the deduplication data store their chunks keep pointing to older chunks that are duplicates. The first archive will be stored in full (after in-archive dedupe) and chunks in subsequent archives keep pointing backwards to duplicate chunks in older archives. So restoring from the newest backup will have lower performance due to the numerous random reads that it will have to do read those referenced chunks from older archives. Older archives will restore faster! In some documents this is called re-hydration. Logically we want the reverse to happen. If we can make chunks in older archives point to duplicate chunks in newer archives then the newest archive can be stored sequentially in it’s logical chunk order. However doing this in-line is not practical because many references need to be updated. So this is typically done with post-process dedupe and Sepaton is the only vendor doing this. This however precludes the use of Source-side dedupe. I have given this a lot of thought and come up with an alternative that can give the benefits of normal in-line and source-side dedupe while also providing forward-referencing capability.
  4. Security Considerations: Suffice it to say that this paper has a comprehensive coverage on this topic: http://www.ssrc.ucsc.edu/Papers/storer-storagess08.pdf .
  5. On-disk Container-Based storage mechanism: Storing chunks on the disk is another can of worms. Typical dedupe products implement a mini-filesystem for this. Storing individual groups of chunks in individual files is too expensive as there will be too many of them and filesystem metadata-lookup overheads will have significant impact. Chunk groups or segments can be further grouped into larger containers to solve this. I do not intend to implement a low-level mini-filesystem. Just use existing scalable filesystems like XFS.
  6. Replication: This is obviously a very important need. Especially with deduplicated data the loss of a single chunk invalidates all references to it and corrupts a whole host of files and archives. So replication is mandatorily needed as a protection against this. Instead of trying to build something I am looking at leveraging existing software like the excellent GlusterFS.
  7. GPGPU Acceleration: Chunking and hashing can be speeded up by leveraging GPGPUs. This paper provides the background: http://static.usenix.org/events/fast/tech/full_papers/Bhatotia2-5-12.pdf . Very interestingly Keccack has been ported to work optimally on Nvidia CUDA: https://sites.google.com/site/keccaktreegpu/ .
  8. Clustering: Distributing deduplication storage tasks across a cluster of computers can help greatly with scalability as opposed to a single appliance approach. It is possible to distribute chunks or segments across a cluster of computers based on their hash values. SDFS uses this technique. This however affects dedupe ratio.
  9. Integration with backup software: Neither Bacula nor Amanda includes variable-block dedupe capability.

Miscellaneous Features

  1. GUI: Everybody likes a GUI … everybody likes gui …. well except for Unix CLI freaks (myself included).
  2. Windows Port: This will broaden the usage base but requires having a GUI first.
  3. Reorder Files: Sort files based on similarity, size and filename. This should give better dedupe and compression results.
  4. Backup to online storage: Amazon S3, Glacier, DropBox etc should be supported as back-end storage targets.