With the current release of 0.9.1 Pcompress now comes with a whole bunch of features and capabilities that are close to being a logically complete set for a stand-alone data compression utility. It will be hard to find another open-source utility that has the same combination of features and capabilities in a single utility. There are a few other stuff that can be added from a single utility point of view the biggest being a GUI and possibly portability to Windows. However beyond all this I have been looking at extending Pcompress to being a more comprehensive archival mechanism. Not necessarily a full Backup solution though. For that we have excellent software like Bacula and Amanda. However it can be an add-on for those providing an archival data store with capabilities that they lack.
I am listing out some of the thoughts and new ideas that have been crawling inside my cranium for the past few weeks.
Pcompress does Deduplication but only at the level of individual chunks. It is a logical next step to evolve it into a complete deduplication archival store on the lines of commercial products like Data Domain, Sepaton, Permabit Albireo, HP StoreOnce etc. There are big challenges in this area. The biggest being the disk bottleneck arising out of index lookups for chunk hashes. This becomes increasingly a problem as we scale into hundreds of TB and to Petascale. Then comes efficient storage of chunks, efficient retrieval of archived data, encryption of deduplicated data, sources-side dedupe to reduce network traffic etc. The industry has come up with a variety of solutions and optimizations but nothing significantly comparable is available in the open-source domain.
A couple of the open-source solutions are SDFS and LessFS. I have tried both and did not have much success with LessFS as it would hang every now and then in my Fedora 16 laptop. However SDFS worked reliably and fast. I am sure I am doing something wrong in setting up LessFS. Both of these are in-line dedupe filesystems that work on fixed-block boundaries. In addition the backup solutions like Bacula and Amanda only do whole-file dedupe or, in other words, single-instance storage.
For Pcompress I am looking at providing more advanced capabilities. Pcompress already has sliding-window content-aware chunking capability. The challenge is to design a scalable deduplicated archival store that can handle Petabytes. To this effect I have been reading up on a bunch of research papers and capabilities touted by various products and have formed some detailed design ideas. Some of the features required for this capability include:
- Sparse Indexing: We need to use some approximate matching capability to probabilistically identify duplicate and unique chunks with low memory overhead. Data Domain for example uses a Bloom Filter that they call a Summary Vector to quickly identify unique chunks along with other capabilities like locality based caching. Other different approaches include Albireo from Permabit, and SILT. In that respect the following paper seems very interesting: https://code.google.com/p/binarywarriors/downloads/detail?name=INLINE_DEDUP.pdf&can=2&q= . What is more interesting is this HP paper that reflects a more polished form of that paper: http://static.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge.pdf . This is essentially a locality based grouping of chunks along with a stratified sampling to do group-level approximate matching on a smaller index size. Locality based grouping is very helpful and instead of doing stratified sampling I have a different idea of using similarity matching at various confidence levels to find related groups. My idea will enable deduplicating 1 PB of data requiring just 14GB of memory for the index.
- Source side dedupe: Doing chunking and hashing at the source side can reduce load on the backup server and not sending chunks already present on the server reduces network bandwidth drastically. In addition the archive data itself will have many duplicates inside it even before comparing with the previous backup archive. So that will help as well. One extra thing I am looking at here is for the server to have a configurable capability to re-compute the chunk hashes and verify. This is an extra security mechanism that prevents malicious uploading of garbage chunks with invalid hashes. This of course puts extra load on the server but current gen Sandy Bridge or Bulldozers are solidly up to the task.
- Forward Referencing: Typically as backup archives keep on entering the deduplication data store their chunks keep pointing to older chunks that are duplicates. The first archive will be stored in full (after in-archive dedupe) and chunks in subsequent archives keep pointing backwards to duplicate chunks in older archives. So restoring from the newest backup will have lower performance due to the numerous random reads that it will have to do read those referenced chunks from older archives. Older archives will restore faster! In some documents this is called re-hydration. Logically we want the reverse to happen. If we can make chunks in older archives point to duplicate chunks in newer archives then the newest archive can be stored sequentially in it’s logical chunk order. However doing this in-line is not practical because many references need to be updated. So this is typically done with post-process dedupe and Sepaton is the only vendor doing this. This however precludes the use of Source-side dedupe. I have given this a lot of thought and come up with an alternative that can give the benefits of normal in-line and source-side dedupe while also providing forward-referencing capability.
- Security Considerations: Suffice it to say that this paper has a comprehensive coverage on this topic: http://www.ssrc.ucsc.edu/Papers/storer-storagess08.pdf .
- On-disk Container-Based storage mechanism: Storing chunks on the disk is another can of worms. Typical dedupe products implement a mini-filesystem for this. Storing individual groups of chunks in individual files is too expensive as there will be too many of them and filesystem metadata-lookup overheads will have significant impact. Chunk groups or segments can be further grouped into larger containers to solve this. I do not intend to implement a low-level mini-filesystem. Just use existing scalable filesystems like XFS.
- Replication: This is obviously a very important need. Especially with deduplicated data the loss of a single chunk invalidates all references to it and corrupts a whole host of files and archives. So replication is mandatorily needed as a protection against this. Instead of trying to build something I am looking at leveraging existing software like the excellent GlusterFS.
- GPGPU Acceleration: Chunking and hashing can be speeded up by leveraging GPGPUs. This paper provides the background: http://static.usenix.org/events/fast/tech/full_papers/Bhatotia2-5-12.pdf . Very interestingly Keccack has been ported to work optimally on Nvidia CUDA: https://sites.google.com/site/keccaktreegpu/ .
- Clustering: Distributing deduplication storage tasks across a cluster of computers can help greatly with scalability as opposed to a single appliance approach. It is possible to distribute chunks or segments across a cluster of computers based on their hash values. SDFS uses this technique. This however affects dedupe ratio.
- Integration with backup software: Neither Bacula nor Amanda includes variable-block dedupe capability.
- GUI: Everybody likes a GUI … everybody likes gui …. well except for Unix CLI freaks (myself included).
- Windows Port: This will broaden the usage base but requires having a GUI first.
- Reorder Files: Sort files based on similarity, size and filename. This should give better dedupe and compression results.
- Backup to online storage: Amazon S3, Glacier, DropBox etc should be supported as back-end storage targets.