Daily Archives: June 11, 2013

Architecture for a Deduplicated Archival Store: Part 1

Requirements

Pcompress as it stands today is a powerful single-file lossless compression program that applies a variety of compression and data deduplication algorithms to effectively reduce the dataset size. However as far as data deduplication goes it can only apply the algorithms to a single dataset to remove internal duplicates. What is more useful is to be able to apply deduplication to remove common blocks across datasets to achieve even greater savings especially in backup scenarios. This is why we see a slew of products in this space boasting of upto 90% reduction in backup storage requirements.

In the open source space we have filesystems like OpenDedup, Lessfs, S3QL, ZFS etc that provide deduplication even for primary online storage. While that is a desirable feature in itself, these software lack many of the advanced features of commercial products like Sepaton, HP StoreOnce or EMC DataDomain. Pcompress implements a bunch of those advanced algorithms today (I am writing a couple of papers on this) so it makes sense to extend the software into a proper scalable archival store for backup requirements. In this topic it is worthwhile to take note of eXdupe which provides archival deduplicated backup capabilities but it is quite simplistic providing only differential storage against a single initial backup dataset. It is much like a full backup followed by incremental backups. Just that there is no real multi-file dedupe. One can only dedupe the latest backup data against the first non-differential backup data. It is not a scalable chunk store that can chunk any incoming dataset and store only the unique chunks.

If we look at open source backup software like Amanda or Bacula, none of them have block-level dedupe capability, leave alone sliding-window variable block chunking. So, in a nutshell, we can summarize the requirements as follows:

  1. A Deduplicated, Scalable Chunk Store that stores unique chunks and provides fast read access.
  2. The Chunk Store is meant for backups and archival storage and assumes immutable chunks. I am not looking at online primary storage in this case. However the system should support deletion of old datasets.
  3. It should be able to do inline dedupe. With inline dedupe we can do source side dedupe reducing the amount of backup data transferred over the network.
  4. Pcompress can potentially utilize all the cores on the system and this archival store should be no different.
  5. Metadata overhead should be kept to a minimum and I will be using the Segmented similarity based indexing to use a global index that can fit in RAM.
  6. Data and Metadata should be kept separate such that metadata can be located on high-speed storage like SSDs to speed up access. While this increases the number of multiple separate disk accesses during restore, the effect can be reduced by locality sensitive caching in addition to SSDs.
  7. The system should of course be able to scale to petabytes.
  8. It should be possible to integrate the system with existing backup software like Amanda, Bacula etc. This is needed if we want to do source-side dedupe.
  9. There should be a chunk reference count with a max limit to avoid too many datasets referencing the same chunk. The loss of a multiple referenced chunk can corrupt multiple backups. Having an upper limit reduces the risk. In addition we need replication but that is not in my charter at this time. Filesystem replication/distribution can be used for the purpose. Software like DRBD can also be used.
  10. Another feature is to limit deduplication to the last X backup sets much like a sliding window. This allows cleanly removing really old backups and avoid recent backups from referencing chunks in a those old data.
  11. All this applies to archival storage on disk. Deduping backups onto tape is a different can of worms that I will probably look at later.

I plan to go at all these requirements in phases. For example I’d not initially look at source-side dedupe. Rather the initial focus will be to get a high-performance stable backend. If one is wondering about some of the terms used here, then look at the Wikipedia article for explanations.