Category Archives: Technical

Persisting the In-Memory Hash Index in Pcompress

Pcompress uses an in-memory hash table to find similar regions of data, loads the chunk hashes of those regions and matches them to find exact chunk matches for deduplication. Currently this works on a single archive producing a compressed file as the output. The index is not required during decompression and is discarded after use. An archival store on the other hand needs to deal with multiple archives and needs a persistent index to keep deduplicating across archives that get added to the store. So there is an archive server which receives data streams from clients. Clients are responsible for identifying the files to be backed up, rolling them into a archive format like tar and sending the data stream to the server. The archive server holds the aforementioned similarity index.

The similarity index in Pcompress is a hash table, so looking at persistence, one quickly thinks of the numerous NoSQL solutions. Evaluate and benchmark one and use it. Even SQlite can be looked at, as it is embeddable, fast and reliable. Good enough for the use case at hand. After pondering this somewhat, it occurred to me that my approach in Pcompress is a scalable in-memory index. The key thing here is the “in-memory” piece. The design is centered around that and does not use an index that can overflow to disk. This in turn means that I do not require an on-disk index format. I just need to stash the records in a file and load them into an in-memory hash when the archive server daemon is started. So all I need is a simple flat file with a sequence of fixed-length records. When a new KV pair is added to the hash it is first appended to the flat file. This is almost like journaling with the journal being the data store itself. Write I/O remains largely sequential. When keys are deleted it can marked invalid in the flat file. A separate cleanup process (like Postgres VACUUMDB) can be used to eliminate the deleted records. Marking deleted records requires in-place updates which is simple because records are of fixed-length.

Thus I can dispense with a NoSQL library and keep things very simple and fast. This approach is similar to the Sparkey KV store from Spotify. The primary differences being the hash not stored to disk and ability to do deletes. Of course unlike Sparkey I want robustness against data corruption. Firstly I will use sync writes to the flat file. Secondly the record will be inserted into the in-memory hash only after disk write is successful.

New approach to secure Deduplication: DupLESS

Came across this interesting paper from the recently concluded 22ndĀ  USENIX Security Symposium: https://www.usenix.org/conference/usenixsecurity13/encryption-deduplicated-storage-dupless. The System is called DupLESS.

Typically, mixing encryption and deduplication does not yield good results. As each user uses their own key to encrypt, the resulting ciphertexts are different even if two users are encrypting the same files. So deduplication becomes impossible. Message Locked Encryption was mooted to work around this. To put it simply this encrypts each chunk of data using it’s cryptographic hash as the key. So two identical chunks will produce identical ciphertexts and and can still be deduplicated. However this leaks information and it is possible to brute force plaintexts, if data is not thoroughly unpredictable. Also, as another example, it is possible for privileged users having access to the storage server to store files and check their deduplication with other user’s data, thereby getting an idea of other user’s contents even if they are encrypted.

The DupLESS system above introduces a Key Server into the picture to perform authentication and then serve chunk encryption keys in a secure manner. From my understanding this means that all users authenticating to the same key server will be able to deduplicate data amongst themselves. An organization, using a cloud storage service, which does deduplication at the back-end will be able to deduplicate data among it’s users by using a local secured key server. This will prevent the storage provider or any external privileged user from gleaning any information about the data. A trusted third party can also provide a key service that can be shared among groups or categories of users, while not allowing the storage provider access to the key service. Neither the key server nor the storage service can glean any information about the plaintext data.

Very interesting stuff. The source code and a bunch of background is available at this link: http://cseweb.ucsd.edu/users/skeelvee/dupless/

 

Pcompress 2.3

I have been fairly busy over the past few months with multiple things demanding my attention at the same time both at work and at home. In addition, I was doing a bunch of testing and analysis of the algorithms I am using in Pcompress, which resulted in some useful findings, improvements and fine tunings of the parameters. I also wanted to test with a few hundred GBs worth of data and had to procure additional disks to perform testing within realistic time scales. The combination of a Stardom iTANK external enclosure and a WD Caviar Black 2TB drive over eSATA worked nicely providing upto 137MB/s of sustained sequential read throughput. I have finally managed to pull together all the changes and put out a 2.3 release.

Among other things, I have made some changes to the KMV Sketch computation and updated it to select unique similarity indicators. Earlier it was just selecting the lowest K indicators, some of which were duplicates. This also allowed me to reduce the number of IDs used for the similarity match. This in turn results in a smaller index while improving the similarity match at the same time. Data splitting between threads is also improved now, resulting in fewer unbalanced chunks. These improvements and some code cleanups have resulted in performance improvements as well. As of this release, the similarity based near-exact deduplication is quite faster than exact deduplication using a simple chunk index with upto 98% effectiveness as compared to the latter.

One of the changes in this release is to move all the functionality into a shared library and make the main executable as a simple wrapper. This will allow me to add a programmatic API going forward. The release can be downloaded from here: http://code.google.com/p/pcompress/downloads/detail?name=pcompress-2.3.tar.bz2

Nonblocking Algorithms for Multicore processors

Came across this awesome article on ACM Queue: http://queue.acm.org/detail.cfm?id=2492433. This is a very nice exposition of the topic with example code and detailed walk through of the effects of multiple access on shared state. The other interesting piece in this same context is of course transactional memory in Intel’s Haswell processor described here: http://www.realworldtech.com/haswell-tm/

 

Parallel Directory Tree Compare in Python

A friend asked me if there is any way to quickly use multiple threads to compare two directory trees. He was dealing with data migration scenarios with directory trees containing potentially millions of files and he would like to compare the copy with the original. A normal recursive diff would take days.

After trying out a few experiments with shell scripts and some such it finally boiled down to Python and it’s great multiprocessing module to get the work done. The full code is pasted below and is mostly self-explanatory. On my dual-core hyperthreaded laptop it takes half the time as compared to a recursive diff (with 26000 files). However this thing is primarily I/O bound and the tinny laptop’s drive cannot keep up with the I/O demands from multiple threads. A good server will give decent speedup compared to normal diff.

import os, sys
import hashlib
import stat
import multiprocessing
from multiprocessing import Process, JoinableQueue

num_procs = multiprocessing.cpu_count()
num_procs *= 4

def get_hash(fh, sz):
        h = hashlib.sha256()
        while True:
                buf = fh.read(65536)
                if len(buf) == 0: break
                h.update(buf)
        return h.digest()

def chk(path, lf, dir):
        path1 = os.path.join(dir, path)
        st = None
        st1 = None
        try:
                st = os.lstat(path)
                st1 = os.lstat(path1)
        except:
                lf.write("Missing: " + path1 + "\n")
                lf.flush()
                return

        if not stat.S_ISREG(st.st_mode):
                return
        if st.st_size != st1.st_size:
                lf.write("Size differ: " + path1 + "\n")
                lf.flush()
                return
        if st.st_size == 0: return

        hv = None
        hv1 = None
        try:
                fh = open(path, "r")
                hv = get_hash(fh, st.st_size)
                fh.close()
                fh = open(path1, "r")
                hv1 = get_hash(fh, st.st_size)
                fh.close()
        except:
                lf.write("Open error: " + path1 + "\n")
                lf.flush()
                return

        if hv != hv1:
                lf.write("Digests differ: " + path1 + "\n")
                lf.flush()

def proc_chk(q, lf, dir):
        while True:
                path1 = q.get()
                if path1 == "done":
                        break
                chk(path1, lf, dir)
                q.task_done()
        q.task_done()

q = JoinableQueue()
lf = open("/var/tmp/differ_files", "w+")
dir1 = os.path.realpath(sys.argv[1])
dir2 = os.path.realpath(sys.argv[2])

o_cwd = os.getcwd()
os.chdir(dir1)
cwd = os.getcwd()

for i in range(0, num_procs):
        p = Process(target=proc_chk, args=(q, lf, dir2))
        p.start()

for dirpath, dirnames, filenames in os.walk(".", followlinks=False):
        for f in filenames:
                q.put(os.path.join(dirpath, f))

for i in range(0, num_procs):
        q.put("done")q.join()
lf.close()
os.chdir(o_cwd)

It should be called like this: python pdiff.py <dir1> <dir2> . It will generate a list of any differing files in /var/tmp/differ_files.

High Performance Content Defined Chunking

In Pcompress, I have implemented a variant of the rolling hash based Content Defined Chunking that provides both deduplication accuracy and high performance. This post attempts to explain the chunking process, covers the chunking computations that are done in Pcompress and then talks about the new optimizations for very fast sliding window chunking (on the order of 500MB/s to 600MB/s throughput depending on processor).

Background

Data Deduplication requires splitting a data stream into chunks and then searching for duplicate chunks. Once duplicates are found only one copy of the duplicate is stored and the remaining chunks are references to that copy. The splitting of data into chunks appears to be an ordinary process but is crucial to finding duplicates effectively. The simplest is of course splitting data into fixed size blocks. It is screaming fast, requiring virtually no processing. It however comes with the limitation of the data shifting problem.

The diagram below illustrates the problem. The two 64-character patterns are mostly similar with only two characters differing. Initially fixed-block chunking provides good duplicate detection. However the insertion of a single character at the beginning shifts the entire data while chunk boundaries are fixed. So no duplicates are found even though the patterns are mostly similar.

static_chunkingThe diagram shows insertion, but the same thing can happen for deletion. In general with static chunking duplicate detection is lost after the point where insertion or deletion has taken place.

In order to deal with this, most dedupe solutions use content defined chunking that mark cut points based on patterns in the data. So if the data patterns shift the cut points also shift with them. The diagram below illustrates.

dynamic_chunkingThe chunks are split based on patterns in data so they are of variable length (but average size is close to the desired length). Since the chunk boundaries shift along with the data patterns, duplicates are still found. Only the modified chunks are unique.

The Rolling Hash computation

Now the question comes as to what data patterns to look out for when determining the chunk boundaries or cut points? The common technique is to compute a hash value of a few consecutive bytes at every byte position in the data stream. If the hash value matches a certain predefined pattern we can declare a chunk boundary at that position. To do this computation efficiently a technique called the rolling hash was devised. It uses a sliding window that scans over the data bytes and provides a hash value at each point. The hash value at position I can be cheaply computed from the hash at position I-1. In other words H(X_{(i,n)}) \Leftarrow (H(X_{(i-1,n)}) + X_i - X_{(i-n)}) \bmod M where ‘n’ is the window size and X_{(i,n)} represents the window bytes at byte position ‘i’. In mathematical terms this is a recurrence relation. Rolling hashes have been used in contexts like Rabin-Karp substring search and Rsync. Today they are used extensively in chunk splitting in the context of data deduplication.

One of the common rolling hashes used in Data Deduplication is Rabin Fingerprinting devised originally by Turing award winner Michael O. Rabin in his seminal paper titled “Fingerprinting By Random Polynomials“. The mathematically inclined will enjoy reading it. There are other rolling hash techniques such as the one used in Rsync, the TTTD algorithm devised by HP, the FBC algorithm etc.

rolling_hashWhile I am not so much of a mathematically inclined person I still needed a good rolling hash in order to do content defined chunking in Pcompress. After looking at various implementations like the one in LBFS and few otherĀ open-source software like n-gram hashing, I came up with an approach that worked well and produced average chunk sizes close to the desired value.

I used a small sliding window of 16 bytes that produces a 64-bit fingerprint at each byte position requiring an addition, subtraction, multiplication and conditionally an XOR for each byte position. It would declare a chunk boundary if the bottom Y bits of the fingerprint were zero. The value of Y would depend on the average chunk size desired. For example for 4KB average size one would look for bottom 12 bits to be zero. The core of the approach is derived from Rabin Fingerprinting. A good description is here: http://www.infoarena.ro/blog/rolling-hash. The hashing approach is a multiplicative scheme of the form:

rollhash = (rollhash * PRIME + inbyte - outbyte * POW) \% MODULUS

Where inbyte is Incoming byte into sliding window head, outbyte is outgoing byte from sliding window tail and POW = (PRIME ^ {windowsize}) \% MODULUS. The PRIME number I am using is the same value used by Bulat Ziganishin in his SREP tool. Experimentation showed it to produce good results. In addition to this I precompute a table using the irreducible polynomial (represented in GF(2)) from LBFS. The outbyte is used to index the table and the value is XOR-ed with the hash value to produce the final fingerprint. I did some analysis of the chunking approach which is documented in two earlier posts. The results were good.

A window size of only 16 bytes will raise some eyebrows as typically much larger windows are used. LBFS for example used a 48-byte window and others have used even larger windows. However in practice, as is evident from the above analysis, this implementation does produce good results and the window size of 16 bytes allows an optimization as we will see below.

Optimizations

While addition, multiplication are extremely fast on modern processors, performance overheads remained. Even though I was using a small window of 16 bytes it still required performing computations over the entire series of bytes in order to find cut points. It is very much computationally expensive compared to the simple splitting of data into fixed-size chunks. A couple of optimizations are immediately apparent from the above hash formula:

  • Since we are dealing with bytes it is possible to pre-compute a table for outbyte * POW
  • The MODULUS operation can be replaced with masking if it is a power of 2.

This gives some gains however the overhead of scanning the data and constantly updating a sliding window in memory remains. Eventually I implemented a couple of key new optimizations in Pcompress that made a significant difference:

  • Since the sliding window is just 16 bytes it is possible to keep it entirely in a 128-bit SSE register.
  • Since we have minimum and maximum limits for chunk sizes, it is possible to skip minlength – small constant bytes after a breakpoint is found and then start scanning. This provides for a significant improvement in performance by avoiding scanning majority of the data stream.

Experimentation with different types of data shows that the second optimization results in scanning only 28% to 40% of the data. The remaining data are just skipped. The minimum and maximum limits are used to retain a distribution of chunk sizes close to the average. Since rolling hash cut points below the minimum size are ignored it does not make sense to scan that data.

opt_dynamic_chunkingAll these optimizations combined provide an average chunking throughput of 530 MB/s per core on my 2nd generation Core i5 running at 2.2 GHz. Of course faster, more recent processors will produce better results. The throughput also depends on the nature of the data. If the data has a very specific pattern that causes more large chunks to be produced the performance degrades (Think why this should be the case). This brings us to the worst case behaviour.

Worst Case performance profile

The worst case performance profile of the optimized chunking approach happens when all chunks produced are of the maximum size. That is the data is such that no breakpoints are produced resulting in a degeneration to the fixed block chunking behaviour at max chunksize of 64KB and at the cost of rolling hash computation overhead. In this case the majority of the data is scanned and computed, but how much ?

If we assume minimum chunk size of 3KB, maximum 64KB and 100MB data we will have 100MB / 64KB = 1600 chunks (considering worst case all max-length chunks). For every chunk 3KB - small constant of data will be skipped. In my current implementation the value of small constant is 256, though it can be smaller. So the actual skipped size is 3072 - 256 = 2816 bytes. In total the number of skipped bytes will be 2816 * 1600 = 4505600 bytes out of 100MB data. In percentage terms it is 4505600 / 104857600 * 100 = 4.29\%. In other words 95% of the data will be scanned degrading the performance by more than half.

Now the question is what kind of data will produce this worst case behaviour? If you have seen the rolling hash computation details in Pcompress above, the eventual fingerprint is computed via an XOR with a polynomial computation result from a table. Those values are non-zero and we check for breakpoints based on bottom 12 bits of the fingerprint being zero. So if the computed hash is zero the XOR will set the bits and bottom 12 bits will become non-zero. The hash will be zero if the data is zero. That is if we have a file of only zero bytes we will hit the worst case.

I created a zero byte file and tested this and got a throughput of 200 MB/s and all chunks of the max 64KB length. In real datasets zero byte regions can happen, however very large entirely zero byte files are uncommon, at least to my knowledge. One place having zero byte regions is VMDK/VDI files. So I tested on a virtual harddisk file of a Fedora 18 installation in VirtualBox and still got a majority of 4KB chunks but with a small peak at 64KB. The throughput was 490 MB/s with approx 41% of the data being scanned. So even a virtual harddisk file will have non-zero bytes inside like formatting markers. It is rare to get 100s of megabytes of files with only zero bytes. Finally from an overall deduplication perspective such files will be deduplicated maximally with almost 98% data reduction and final compression stage will also be extremely fast (only zero bytes). So even though chunking suffers, overall deduplication will be fast.

Footnote

If you are interested to look at the implementation in Pcompress, it is here: https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c#L598

The Funny KVM benchmarks

RedHat Summit 2013 concluded recently and while browsing some of the presentation PDFs I came across something funny. In general the content is good and there is a bunch of interesting stuff available. However this particular PDF ruffled me up: http://rhsummit.files.wordpress.com/2013/06/sarathy_t_1040_kvm_hypervisor_roadmap_and_overview.pdf

This presentation talks about KVM technology in general with a bunch of marketing content thrown in which is all fine. However fast forward to slide 12 and something looks odd. The slide seems to scream KVM’s outstanding performance on SPECvirt_sc2010 as compared to ESXi5/4. Great isn’t it ? The “Eureka” feeling lasts till you look at the bottom of the graphs. Every comparison is done on dissimilar hardware! Suddenly Archimedes comes crashing to the floor.

Take for example the 2-socket 16-core benchmarks. The HP DL385 G7 box is a Generation 7 AMD bulldozer piece while DL380p Gen8 is a Generation 8 Sandy Bridge piece. RedHat is putting ESXi5 on an older generation hardware and KVM on the latest, greatest. If we consider the highest bin processors then the DL385 will get AMD Opteron 6220, 3.0 GHz processors with 16MB cache while DL380p will get Xeon E5-2690, 2.9 GHz processors with 20MB cache. Even if the Opteron’s clock is marginally higher a Bulldozer is simply no match for a big juicy Sandy Bridge beast. Second the Bulldozers get HT links with 6.4 GT/s throughput while the Xeons get QPI with 8 GT/s throughput. The Gen7 box gets PCIe Gen 2.0 while Gen 8 boxes get PCIe Gen 3.0. Similarly the story goes on and on. So we have a no-contest here. The Gen8 box wins hands down even if one puts fewer VMs on the Gen7 box.

Let’s look at the 4-socket 40 cores comparo. First the two boxes are from two different vendors. Second they are comparing ESXi4.1 with latest KVM. Whatever happened to ESXi5 here ? Does it not support that hardware ? At least the processors on the two boxes IBM x3850 x5 and DL580 G7 are comparable 10-core Xeon E7-4870 ones (considering the highest bin 10-core processors). However older ESX version skews the game.

Similarity the processors on the other comparisons are similar but the ESX version is older one that everyone is migrating off. If I am going to do a comparison, I will install latest ESX on a hardware, measure, reinstall latest KVM on the same hardware and measure not play games.

RedHat is nonchalantly tying one hand behind ESX’s back. Helpfully for the marketing fuzz types we have this fine print at the bottom: “Comparison based on best performing Red Hat and VMware solutions by cpu core count published at http://www.spec.org”. That is we are going by earlier measurements that our competitors published, so everyone chant after us: KVM is faster than ESX, KVM is faster than ESX, KVM is faster than ESX … ah well, let me grab that can of Diet Coke sitting nearby (or should it be salt rather?).

Disclaimer

I am NOT a Linux or KVM hater. On the other hand I use Linux Mint day in and day out and work with open-source in general. However above all I am a technologist and I like to take things as they really are, free of all the fuzz. Fuzz dilutes the values that various technologies bring to the table.