Came across this news piece on Register dated back to October: http://www.theregister.co.uk/2013/10/16/sepatons_superduper_deduper/
Potentially handles upto 16PB of backup storage with Global Deduplication – Great! The performance and features on paper are really top of the line. Beats the competition on most aspects. If I look at the deduplication features a few things look interesting vis-a-vis those that I have put into Pcompress.
It mixes “Hash-based inline deduplication” and “Post-process, content-aware deduplication”. The article is not clear what exactly this means. There are two possibilities. Firstly it can detect duplicate files during ingestion, store only a single copy and then do block-level dedupe as post-process. Secondly it can deduplicate using large chunks during backup ingestion and then dedupe using small blocks as post-process. This is of course to not hurt backup performance and scale to large datasets. Post-process deduplication is a common technique to scale deduplication without affecting I/O throughput of in-flight data. It has also been used effectively in Windows Server 2012 to do primary data deduplication.
Sepaton can do analysis of data types and change rates in data to apply the most efficient dedupe mechanism or even skip dedupe for encrypted and compressed files that virtually do not deduplicate at all.
The other interesting features include byte-level deduplication for databases that store data in block sizes less than 8KB and using flash based SSDs to store the global index. I am not sure what this “byte-level deduplication” exactly means but it appears to be a delta-differencing mechanism. Now the question is how efficient restore can be when delta-differencing is used.
In some of the posts on Pcompress design I have already mentioned about using SSDs for storing all kinds of metadata. Fast metadata access is critical and this is the logical choice. However the other “new aspect in Pcompress is the ability to use small blocks for deduplication without losing performance and giving good scalability“. This is a key feature that most of the current solutions seem to be missing. Pcompress can use blocks (or chunks) as small as 2KB without losing too much performance. With 2KB chunks it can potentially scale to 500TB of 100% random data using a 40GB in-memory global index. If the data has duplicates then the index size becomes smaller. This deduplication occurs with 95% efficiency of a full chunk index based brute-force dedupe. This single capability solves a sticky problem that dedupe solutions has been dealing with for quite some time. The metadata structure that I have discussed in earlier posts also helps with overall performance. The approach is based on similarity detection of large regions in the data stream. The chunks lists of those regions are then sequentially loaded from SSD and compared to perform actual deduplication. The similarity detection technique is simple and novel. It avoids any kind of complicated math, fuzzy hashing etc.I will detail it later.
There are other unique techniques in Pcompress like a partially vectorized rolling hash, scanning less than 30% of the data to locate chunk boundaries, parallelized deduplication and others that contribute to the overall performance. I have posted about a few of these earlier.
In addition to the above, the recent zip-like archiver capabilities that I have added into Pcompress introduce data type detection and automatic selection of filters and compression techniques to suit the data type. However the big missing piece in all this is that Pcompress is still a stand-alone utility. It needs work to turn it into an archival store where data can be ingested incrementally and selective data streams extracted for restore. Also an efficient partitioned indexing is needed to be able to scale deduplication in a cluster without losing deduplication ratio.
Among a busy personal schedule for the last two months, I have managed to work quite a bit on adding archiving features to Pcompress. Thanks to the excellent LibArchive, Pcompress can now bundle up a bunch of files into a compressed archive. This is a desirable and useful capability that was missing till date.
With the addition of archiving capability Pcompress can now perform advanced detection of file data and tweak its compression behaviour to achieve the best results. Below is a short list of features and behaviour that the github code has as of this writing:
- Pcompress enumerates the file list to be archived and sorts the files by extension/name and size using an incremental merge sort to minimize memory use. This sorting, groups related files together and clusters small files to achieve the best compression and deduplication behaviour. For example see this paper where a similar technique has been discussed to improve deduplication: https://www.usenix.org/legacy/event/atc11/tech/final_files/Xia.pdf
- File types are detected via extension and/or file header parsing for magic numbers. Compression buffers are split at boundaries where files change from one type to another to avoid mixing unrelated files in a single compression buffer. It helps to improve compression ratio.
- More importantly, this file type detection is used to apply data-specific compression techniques more effectively, making the Adaptive modes in Pcompress extremely powerful. The following data specific algorithms are used:
- LZMA – Most binary data.
- PPMD – Most Textual data.
- Libbsc – DNA Sequences, XML/HTML etc, BMP and TIFF images.
- Dispack – Preprocess 32-bit x86 executable binaries.
- PackJPG – Reduce JPEG size by upto 40%. This is new lossless JPEG compression technique by Matthias Stirner.
- Wavpack – Compress WAV files better than any regular compression technique. This is still a work in progress.
- Detect already compressed files and for some heavily compressed data just use LZ4 to suppress some internal headers and zero padding. This avoids wasting time trying to compress data that is already heavily compressed.
- There are other data specific filters around like MAFISC which I am looking at.
- For Dispack, 32-bit x86 executables are detected and the data buffer is then split into 32K blocks. Some approximate instruction statistics are checked to determine whether to Dispack that block.
- Compression buffers are split either at hash-based or data type change based boundaries improving both compression and deduplication.
- LibArchive is used as the backend archiving library whose output is passed to the buffering, deduplication and compression stages in a logical pipeline. Synchronization is kept simple by using semaphores. LibArchive runs in a single thread and the data fetch from archiver to compression is also done at a single point. Thus there is exactly one producer and one consumer. This simplifies synchronization.
- To the extent possible data copying is avoided. LibArchive’s callback routines are used to copy data directly into the compression buffers without resorting to pipes and such.
The filters like Wavpack and PackJPG need to work with LibArchive. However LibArchive does not support using external filter routines so it took a while to work out how to have external file filters pipelined before LibArchive. Note that since Pcompress uses a custom file format and consumes the output of LibArchive, there is no need for strict compatibility with standard archiver formats like Tar, Pax, Cpio etc. LibArchive for its own requirements obviously strives to attain strict conformance allowing no user-defined headers. So one of the big problems was to flag which files have been processed by a custom filter. One easy way was to add an extended attribute programmatically. However LibArchive does not provide a way to delete a single attribute during extraction. There is a call to clear all attributes! One does not want internal, programmatic use attributes to be extracted to disk. I was stuck. Eventually it turned out that I could use contextual inference. A file preprocessor like PackJPG will add its own magic header to the file. Thus during archiving I can look for a JPEG header and only then pass the file through PackJPG. During extraction I can look for the PackJPG header.
However the question comes, what if I have some PackJPG processed files and are archiving them using Pcompress? Won’t it revert to normal JPEG during extraction even though I do not want it to? Well the filename extension is also checked. During archiving, normal JPEGs are filtered but their extension remains as jpg or jpeg. So only files having a Jpeg extension but having a PackJPG header are unpacked during extraction. If you use the standalone PackJPG utility to pack your JPEGs, then will get a .pjg extension which will be untouched by Pcompress filters during extraction. However, truely speaking, LibArchive needs to add a simple xattr deletion function to avoid all this jugglery.
File types during archiving, are detected by a combination of filename extension and magic header inspection. To lookup filename extensions one obviously needs a hashtable. However there is a bit of detail here. I have predefined list of known filename extensions with their corresponding file types, so instead of using a general hash function I needed a perfect hash function. That is, the number of slots in the table is the number of keys and each known key maps to one slot. An unknown key can be easily found by comparing with key value at the slot, or if the slot number lies outside the table range. I used the old ‘Minimal Perfect Hashing’ technique courtesy of Bob Jenkins. It works nicely for fast hashing of filename extensions.
The next item to do is to support multi-volume archives. This is quite easy to do since Pcompress already splits data into independent buffers, each with its own header. So a volume needs to contain a set of compressed buffers with some sequence indicator so that they can be correctly concatenated together to restore the original archive.
I recently came across this old blog by Nimble storage co-founder Umesh Maheshwari: http://www.nimblestorage.com/blog/technology/better-than-dedupe-unduped/
The post has intriguing views on inline compression vs Dedupe and the approach of using snapshots on primary storage as a short-term backup mechanism is quite cool. COW snapshots by definition avoid duplicates the instant they are created. In addition consecutive snapshots will share common blocks with no special effort. This aspect not much different from ZFS. It has the same real-time compression and snapshot features and the same benefits apply there as well.
However it is the conclusions and the graphs that I humbly find missing some points and even misleading to an extent. Deduplication does not only remove duplicates from successive backups but it can remove internal duplicates within the volume. It helps compression in turn. In all my tests with Pcompress I have found Deduplication providing additional gain when added to standard compression. See the entry for Pcompress here: http://www.mattmahoney.net/dc/10gb.html. ZFS for that matter provides deduplication as well, though it has scalability bottlenecks.
While inline compression is cool, compression works within a limited window. The window size varies with the algorithm. For example Zlib uses a 32KB window, Bzip2 uses 900KB max, LZMA can use a gigabyte window size. Repeating patterns or redundancy within the window are compressed. Deduplication finds duplicate blocks within the entire dataset of gigabytes to hundreds of terabytes. There is no theoretical window size limitation (Though practical scaling considerations can put a limit). So I really cannot digest that just snapshotting + inline compression will be superior. Deduplication + snapshotting + inline compression will provide greater capacity optimization. For example see the section on “Compression” here: http://www.ciosolutions.com/Nimble+Storage+vs+Netapp+-+CASL+WAFL
Now if Post-Process Variable-Block Similarity based deduplication (of the type I added into Pcompress) can be added to ZFS, things will be very interesting.
Pcompress uses an in-memory hash table to find similar regions of data, loads the chunk hashes of those regions and matches them to find exact chunk matches for deduplication. Currently this works on a single archive producing a compressed file as the output. The index is not required during decompression and is discarded after use. An archival store on the other hand needs to deal with multiple archives and needs a persistent index to keep deduplicating across archives that get added to the store. So there is an archive server which receives data streams from clients. Clients are responsible for identifying the files to be backed up, rolling them into a archive format like tar and sending the data stream to the server. The archive server holds the aforementioned similarity index.
The similarity index in Pcompress is a hash table, so looking at persistence, one quickly thinks of the numerous NoSQL solutions. Evaluate and benchmark one and use it. Even SQlite can be looked at, as it is embeddable, fast and reliable. Good enough for the use case at hand. After pondering this somewhat, it occurred to me that my approach in Pcompress is a scalable in-memory index. The key thing here is the “in-memory” piece. The design is centered around that and does not use an index that can overflow to disk. This in turn means that I do not require an on-disk index format. I just need to stash the records in a file and load them into an in-memory hash when the archive server daemon is started. So all I need is a simple flat file with a sequence of fixed-length records. When a new KV pair is added to the hash it is first appended to the flat file. This is almost like journaling with the journal being the data store itself. Write I/O remains largely sequential. When keys are deleted it can marked invalid in the flat file. A separate cleanup process (like Postgres VACUUMDB) can be used to eliminate the deleted records. Marking deleted records requires in-place updates which is simple because records are of fixed-length.
Thus I can dispense with a NoSQL library and keep things very simple and fast. This approach is similar to the Sparkey KV store from Spotify. The primary differences being the hash not stored to disk and ability to do deletes. Of course unlike Sparkey I want robustness against data corruption. Firstly I will use sync writes to the flat file. Secondly the record will be inserted into the in-memory hash only after disk write is successful.
The car parking is open . . . . and closed!
Came across this interesting paper from the recently concluded 22nd USENIX Security Symposium: https://www.usenix.org/conference/usenixsecurity13/encryption-deduplicated-storage-dupless. The System is called DupLESS.
Typically, mixing encryption and deduplication does not yield good results. As each user uses their own key to encrypt, the resulting ciphertexts are different even if two users are encrypting the same files. So deduplication becomes impossible. Message Locked Encryption was mooted to work around this. To put it simply this encrypts each chunk of data using it’s cryptographic hash as the key. So two identical chunks will produce identical ciphertexts and and can still be deduplicated. However this leaks information and it is possible to brute force plaintexts, if data is not thoroughly unpredictable. Also, as another example, it is possible for privileged users having access to the storage server to store files and check their deduplication with other user’s data, thereby getting an idea of other user’s contents even if they are encrypted.
The DupLESS system above introduces a Key Server into the picture to perform authentication and then serve chunk encryption keys in a secure manner. From my understanding this means that all users authenticating to the same key server will be able to deduplicate data amongst themselves. An organization, using a cloud storage service, which does deduplication at the back-end will be able to deduplicate data among it’s users by using a local secured key server. This will prevent the storage provider or any external privileged user from gleaning any information about the data. A trusted third party can also provide a key service that can be shared among groups or categories of users, while not allowing the storage provider access to the key service. Neither the key server nor the storage service can glean any information about the plaintext data.
Very interesting stuff. The source code and a bunch of background is available at this link: http://cseweb.ucsd.edu/users/skeelvee/dupless/
While searching and reading up on various conference proceedings, I came across this awesome presentation by Dawn Foster (Puppet Labs) at the 2013 USENIX Women in Advanced Computing Summit. Even though the conference deals with technology participation of the fairer gender, this is one video that I think, applies to every person irrespective of gender, on a technology career path. You can watch the video or download it here: https://www.usenix.org/conference/wiac13/building-successful-technology-career
The video is 45 mins long but it is 45 mins worth spent. There are many points from the presentation which I can harmonize with, having been through an interesting 15 year tech career till date.