Architecture for a Deduplicated Archival Store: Part 2

Golf Disc StorageIn the previous post on this topic I had put down my thoughts around the requirements I am looking at. In this post I will jot down some detailed notes around the design of the on-disk data store format that I am thinking of.

The Archival Chunk Store

From the most basic viewpoint we have data streams which are split into variable length chunks. After deduplication these chunks can be references to other chunks in the same dataset or chunks in other datasets. So we need to have metadata that identifies the dataset (like name, timestamp, length etc.) and then have a list of pointers to data chunks. This is not much different to a traditional file system which has inodes storing metadata and then pointers to blocks/pages on disk. It is conceptually simple to consider a single data block to have multiple references. It is intuitive. However additional metadata is needed to maintain information like reference counts.

The key difference of a file system and a content-defined deduplication storage is that in the former all the blocks are of fixed length and potentially grouped into allocation units. In the latter chunks are of variable length. So we need additional metadata giving chunk lengths and on-disk storage requires a second layer of disk block allocation data. Software like OpenDedup have implemented FuSE based file systems however they only deal with the simpler fixed-length chunking approach and offer primary storage dedupe.

I do not need a full file system route since I am not dealing with primary storage in this case and it also avoids a lot of complexity. There are existing file systems like OpenDedup, LiveDFS, Lessfs and scale-out approaches like Ceph, Tahoe-LAFS etc. where the scalable, variable-chunked dedupe features will be useful, but that is something for later. So I am thinking of storing the data chunks in files that I will call extents, along with the minimum additional metadata in separate metadata extents. The following diagram is a schematic of my approach to storing the chunks on disk.

ChunkstoreThe following are the characteristics that imply from this schematic:

  • A Dataset is identified by some metadata and a sequence of extents in a linked list.
  • Each extent is a collection of segments. Extents are essentially numbered files.
  • Each segment is a collection of variable-length data chunks.
  • Each extent stores segment data and metadata in separate files. A naming convention is used to associate extent metadata and corresponding data files.
  • Each extent can contain a fixed maximum number of segments. I am considering up to 2048 segments per extent. Incoming segments are appended to the last extent in the dataset till it fills up and a new extent is allocated.
  • Notice that a separate extent metadata section is not required. A extent is just a file.
  • The scalable Segmented Similarity based Deduplication is being used here. Each segment contains up to 2048 variable-length chunks. So with 4KB chunk size, each segment is 8MB in size.
  • Segment metadata consists of a chunk count, chunk hashes and offsets. The chunk size is not stored. Instead it can be computed by subtracting current chunk’s offset from the next chunk’s offset. Since a 64-bit segment offset is stored the chunk offsets can be relative to it and only need to be 32-bit values.
  • The Similarity Index contains similarity hashes that point to segments within the extents. So the pointer has to be the extent number followed by the segment offset within the extent metadata file. Incoming segments from a new datastream are chunked, their similarity hashes computed and then approximate-match segments are looked up in the index.
  • Segment data is compressed before storing in the segment. So segment entries in the data extent are of variable length.
  • Each segment entry in the metadata extent can also be of variable length since the number of chunks can be less than the maximum. However segment entries in the metadata extent are added when an entry is made in the index, so the exact offset can be recorded.
  • Similary a segment entry in the metadata extent needs to point to the offset of the segment data in the data extent. However since segments are compressed later in parallel and stored into the extent, the metadata entries are updated later once the segment data is appended. Keeping segment data in a separate data extent allows this parallel processing while still allowing similarity matches to be processed from the metadata extent.
  • Duplicate chunk references are maintained in the metadata extents. A duplicate reference consists of the extent number, segment offset in the compressed file and chunk number within the segment.
  • The index is obviously persistent on disk but is loaded in memory in it’s entirety when doing lookups. Any insertion into the index is written immediately onto the disk. I’d obviously have to use a NoSQL key-value store for this. I am currently interested in Hamsterdb.
  • Keeping a separate metadata extent allows staging metadata on a separate high-performance storage media like flash to reduce access latency.
  • It is possible to store reference counts at the segment level within the index for the purpose of capping number of references to “popular” chunks. This can reduce dedupe ratio since not all chunks will have reached the max reference count. However the advantage of this is it avoids storing and updating reference counts in scattered records in extent files which in turn avoids some random I/O during data ingestion. Each segment has 25 similarity indicators representing different portions of the segment. So all 25 indicators should have reached the maximum reference count to completely remove the entire segment from consideration.
  • The entire segment is compressed and stored instead of per-chunk compression. This provides better compression ratio but is also an overhead especially if we just have to retrieve one chunk from a referenced segment. However due to data locality in backups most similar segments will have several chunks in common. In addition the fast LZ4 compression algorithm and caching of uncompressed segments should provide for low overheads. This is something that I have to test in practice.

Supporting Deletion and Forward Referencing

Deleting datasets means deleting all the extents that belong to it. However this is easier said than done because the extent may have segments which contain chunks which are referred to by other extents. So we cannot simply delete. There are two ways to support effective deletion.

First approach is to load the segments one by one from the extents and conditionally store them into a new file. First the segment’s similarity indicators are re-computed and looked up in the index. This will give us the reference count associated with the similarity indicator along with the segment it points to. If the indicator points to another segment then it’s reference count is decremented. Otherwise if the associated reference count is zero, it is first removed from the index. If the reference count is zero for all similarity indicators of the segment or all it’s similarity indicators point to other segments then the segment is not stored into the new file. However a seek is performed on the target file to sparsely extend it. This preserves the relative offsets of the segments which need to be retained.

Second approach is dependent on a technique called Forward Referencing. In this incoming data is stored as-is. If new chunks are duplicate to older chunks then the older chunk entries are updated to point to the new chunks. This means that deletion can be simply performed on the oldest dataset without any further checks as all references will be to newer chunks. I will need to apply the constraint that intermediate datasets cannot be deleted. The big advantage of Forward Referencing is that it speeds up restore times a lot because the latest dataset is typically the one that you want to restore and it is stored as whole and read sequentially. However Forward Referencing requires post-process deduplication in order to be performant and avoid too much random I/O during backup for example. Also technically it precludes source side dedupe as the data has to appear wholly on the backup store.

The third approach combines the above two approaches. Inline dedupe is done and then a post-process optimization pass can be kicked off to re-organize the data to a forward referenced layout. This requires temporary extra metadata space to record a log of all references per referenced extent so that we can invert the references an extent at a time. This can somewhat tricky to get right.

At present I am looking at the first approach and intend to explore the third optimization technique at a later date.

Advertisements

6 thoughts on “Architecture for a Deduplicated Archival Store: Part 2

  1. Zooko Wilcox-O'Hearn

    Oh, I see that you mentioned Tahoe-LAFS in this post. Tahoe-LAFS already has deduplication, but only at the level of granularity of a whole file. Of course, you can always break your files up (say, with unix split) before passing them to Tahoe-LAFS, but nobody does this. What I’m interested in is adding variable-length deduplication such as that implemented by “bup” and “backshift”, and whatever ideas you have, provided that they aren’t patented or otherwise troublesome for us.

    Reply
    1. moinakg Post author

      Good to know Tahoe-LAFS is doing single instance storage which others like Ceph and GlusterFS are not attempting yet.
      I had a look at bup and backshift. Bup looks more mature and has great capabilities.

      I am using a rolling checksum in pcompress as well but have figured out an optimization that avoids scanning the entire data for breakpoints. Only 25% of the data on average need to be scanned. This brings performance closer to fixed length splitting. In addition to that I am not using a bloom filter. The Segment similarity based matching I have come up with results in a much smaller in-memory index of 0.0023% of the data size at the cost of 5% missed opportunities but no false positives.

      I have read a lot of literature on the topic, over the last year, including research papers and patents and I think I have skirted any issues. Of course, at the end of the day IANAL but as much as I have seen no one is using the crypto hash and minhash concept based similarity indicator derivation.

      Reply
      1. moinakg Post author

        BTW as another point majority of the solutions including Bup, Data Domain etc (except possibly for IBM Storwize) use at least 8kb chunks, and frequently larger, to be scalable. In Pcompress I have been able to get scalable indexing even with 4kb chunks. Data Domain mitigates the effect of larger chunk size by additionally using Delta Encoding for locality derived chunks.

      2. Zooko Wilcox-O'Hearn

        Is there a write-up of this technique to avoid scanning most of the data looking for breakpoints? I don’t immediately see how that is possible.

        You didn’t answer my first question: is this a hobby project, part of your day job, the beginnings of a new for-profit project for you, or what? Thanks!

      3. moinakg Post author

        This is an open source hobby project. I do not really have a profit motive with this.

        There is a write-up of this technique but it is easy enough to explain here. Consider a rolling hash that produces a good number of breakpoints and also consider a minimum chunk size restriction. So we can start scanning a little before the minimum chunk size point, rather than the beginning and then start testing for breakpoints once the minimum is reached. Once a breakpoint is found we can skip minimum_chunk_size – small_constant bytes and start scanning again.

        This is how Pcompress does the chunk splitting and it produces excellent results. I have tested with 360GB of mixed data. Results here: http://moinakg.github.io/pcompress/#deduplication-chunking-analysis

        This is one of the 5 ideas I alluded to in my Twitter post. In addition I have compared the dedupe ratio with Dirk Meister’s fs-c tool and found Pcompress dedupe ratio to be equal to or greater than what that tool estimates: https://code.google.com/p/fs-c/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s