Tag Archives: Pcompress

Posts related to the Pcompress utility.

Updated Compression Benchmarks

Pcompress has gone through a sea of changes since the last time I ran benchmarks comparing performance and effectiveness with other utilities. So I spent several days running various benchmark scripts generating and collating a lot of results in the process.

Due to the sheer volume of the results and limited time, I took the easy way out of importing all the CSV data into Excel, formatting and charting them and exporting to HTML. The generated HTML code looks complex and messy but at least it shows up correctly in Firefox, Chrome and IE.

The first set of results can be seen here: http://moinakg.github.io/pcompress/results1.html. This is basically comparing pcompress with Segment-level and Global Deduplication to other standard utilities. It also contrasts effectiveness of Global Dedupe with Segment-level Dedupe.

The Datasets used

  1. A tar of the VMDK files of installed CentOS 6.2 x86-64 version.
  2. Linux 3.6 RC2 source tarball.
  3. Two copies of the Silesia corpus tar concatenated together. This results in a file that is double the size of the original Silesia corpus but has 100% duplicate data.
  4. A tarball of the “Program Files” directory on my 32-bit Windows 7 installation.

Some Observations

  1. As is quite clear, Pcompress is both faster and more effective compared to the standard utilities tested: Gzip, Bzip2, 7za, Xz and Pxz (Parallel Xz).
  2. As usual Xz performs the worst. The time graph shows a steep spike. Pxz is a lot better but is still half as slow as Pcompress. In addition remember that Pcompress is having a bunch of additional processing overheads that the other utilities do not have: SHA256, BLAKE2, LZP and Delta2 processing.
  3. Interestingly the LZ4 mode along with Dedupe and all the preprocessing produces results that are close to traditional Gzip while being more than twice as fast. In fact two datasets shows results smaller than Gzip. This result is notable when one wants good compression done extremely fast.
  4. Global Dedupe of course is more effective than Segment-level Dedupe but what is more surprising is that it is also faster overall, even though Global Dedupe requires serialized access to a central index and Segmented Dedupe is fully parallel. I can attribute three causes: my test system is low-end with constrained RAM bandwidth and conflicts arising from parallel access; Segment-level dedupe also uses memcmp() while Global Dedupe does not; Global Dedupe reduces data further resulting in lesser work for the final compression algorithm.
  5. The concatenated Silesia corpus with 100% duplicate data of course shows the maximum benefit from Global Dedupe that removes long-range redundancies in data.
  6. In some cases compression levels 9 and 14 show marginally lesser compression than level 6. This appears to be because of LZP side-effects. At higher levels, LZP parameters are tweaked to work more aggressively so it may be taking out a little too much redundancy that affects the compression algorithm’s effectiveness. This is something that I will have to tweak going forward.

I will be posting more results soon and will include a comparison with Lrzip that uses an improved Rzip implementation to take out long-range redundancies in data at a finer granularity compared to 4KB variable-block Deduplication.

Pcompress 2.1 released with fixes and performance enhancements

I just uploaded a new release of Pcompress with a load of fixes and performance tweaks. You can see the download and some details of the changes here: https://code.google.com/p/pcompress/downloads/detail?name=pcompress-2.1.tar.bz2&can=2&q=

A couple of the key things are improvement in Global Dedupe accuracy and ability to set the dedupe block hash independent of the data verification hash. From a conservative viewpoint the default block hash is set to the proven SHA256. This however can be changed via an environment variable called ‘PCOMPRESS_CHUNK_HASH_GLOBAL’. SKEIN is one of the alternatives supported for this. SKEIN is a solid NIST SHA3 finalist with good amount of cryptanalysis done and no practical weakness found. It is also faster than SHA256. These choices give a massive margin of safety against random hash collisions and unexpected data corruptions considering that other commercial and open-source dedupe offerings tend to use weaker options like SHA1(Collision attack found, see below), Tiger24 or even the non-cryptographic Murmur3-128! All this for the sake of performance. Albeit some of them did not have too many choices at the time development started on those products. In addition even with a collision attack it is still impractical to get a working exploit for a dedupe storage engine that uses SHA1 like say Data Domain, and corrupt stored data.

The Segmented Global Dedupe algorithm used for scalability now gives around 95% of the data reduction efficiency of simple full chunk index based dedupe.

Pcompress 2.0 with Global Deduplication

The last few weeks I have been heads down busy with a multitude of things at work and in personal life with hardly any time for anything else. One of the biggest items that kept me busy during my spare times has of course been the release of Pcompress 2.0.

This release brings to fruition some of the hobby research work I had been doing around scalable deduplication of very large datasets. Pcompress 2.0 includes support for Global Deduplication which eliminates duplicate chunks across the entire dataset or file. Pcompress already had support for Data Deduplication but it removed duplicates only within a segment of the data. The larger the segment size, the more effective is the deduplication. This mode is very fast since there is no central index and no serialization. However dedupe effectiveness gets limited.

Global Deduplication introduces a central in-memory index for looking up chunk hashes. Data is first split into fixed-size or variable-length Rabin chunks as usual. Each 4KB (or larger) chunk of data has an associated 256-bit or larger cryptographic checksum (SHA256, BLAKE2 etc.). These hashes are looked up and inserted into a central hashtable. If a chunk hash entry is already present in the hashtable then the chunk is considered a duplicate and a reference to the existing chunk is inserted into the datastream. This is a simple full chunk index based exact deduplication approach which is very effective using 4KB chunk sizes. However there is a problem.

The size of a full chunk index grows rapidly with the dataset. If we are looking at 4KB chunks then we get 268435456 chunks for 1TB of data. Each chunk entry in the hashtable needs to have the 256-bit checksum, a 64-bit file offset and a 32-bit length value. So total size of the index entries is approax 11GB for unique data not considering the additional overheads of the hashtable structure. So if we consider hundreds of terabytes then the index is too big to fit in memory. In fact the index becomes so big that it becomes very costly to lookup chunk hashes slowing the dedupe process to a crawl. Virtually all commercial dedupe products do not even use 4KB chunks. The minimum is 8KB used in Data Domain with most other products using chunk sizes much larger than that. Larger chunk sizes reduce the index size but also reduce dedupe effectiveness.

One of the ways of scaling Data Deduplication to petascale is to look at similarity matching techniques that can determine regions of data that are approximately similar to each other and then compare their cryptographic chunk hashes to actually eliminate exact matching chunks. A variant of this technique uses Delta Differencing instead of hash matching to eliminate redundancy at the byte level. However I was interested in the former.

Pcompress 2.0 includes two approaches to Global Deduplication. If a simple full chunk index can fit into 75% of available free RAM then it is used directly. This is fast and most effective at eliminating duplicates. By default 4KB chunks are used and it gives good performance even with chunks this small. This is lower than what most other commercial or open-source dedupe products recommend or offer. Once file sizes start becoming larger and the index size overflows the memory limit then Pcompress automatically switches to Segmented Similarity Based Deduplication.

In Segmented Similarity mode data is split into 4KB (or larger) chunks as usual (Variable-length Rabin or Fixed-block). Then groups of 2048 chunks are collected to form a segment. With 4KB chunks this results in an average segment size of 8MB. The list of cryptographic chunk hashes for the segment are stored in a temporary segment file. Then these cryptographic chunks hashes are analysed to produce 25 similarity hashes. Each similarity hash is essentially a 64-bit CRC of a min-value entry. These hashes are then inserted or looked up in a central index. If another segment is found that matches at least one of the 25 similarity hashes then that segment is considered approximately similar to the current segment. It’s chunk hash list is then memory mapped into the process address space and exact crypto hash based chunk matching is done to perform the actual deduplication.

This approach results is an index size that is approximately 0.0023% of the dataset size. So Pcompress will require upto a 25GB index to deduplicate 1PB of data. That is assuming 100% random 1PB data with no duplicates. In practice the index will be smaller. This approach provides >90% dedupe efficiency of using a full chunk index while providing high scalability. Even though disk I/O is not completely avoided, it requires one disk write and only a few disk reads for every 2048 chunks. To balance performance and predictable behaviour, the write is synced to disk after every few segments. Using mmap(), instead of a read, helps performance and the disk offsets to be mmap-ed are sorted in ascending order to reduce random access to the segment chunk list file. This file is always written to at the end and extended but existing data is never modified. So it is ideal to place it on a Solid State drive to get a very good performance boost. Finally, access to the central index is coordinated by the threads cooperating using a set of semaphores allowing for lock-free access to critical sections. See: https://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/

I had been working out the details of this approach for quite a while now and Pcompress 2.0 contains the practical implementation of it. In addition to this Pcompress now includes two additional streaming modes. When compressing a file the output file can be specified as ‘-’ to stream the compressed data to stdout. Decompression can take the input file as ‘-’ to read compressed data from stdin.

Global Deduplication in Pcompress together with streaming modes and with help from utilities like Netcat or Ncat can be used to optimize network transfer of large datasets. Eventually I intend to implement proper WAN Optimization capabilities in a later release.

Related Research

  1. SiLo: A Similarity-Locality based Near-Exact Deduplication
  2. The Design of a Similarity Based Deduplication System
  3. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
  4. Similarity Based Deduplication with Small Data Chunks

Coordinated Parallelism using Semaphores

teamOne of the key features of my Pcompress utility is of course, parallelism. The ability to split and process data in parallel in multiple threads with all the threads doing virtually the same work. There is some limited variability depending on the nature of the work and the nature of the data. For example some data segments may not be compressible so they will have to be stored as-is. Whenever there are multiple threads there is typically some need for synchronization. There are of course scenarios where thread processing is completely independent and no synchronization is needed whatsoever. However in Pcompress that is not the case. There are a couple of cases where synchronization is needed:

  1. Ordering of input data segments. The data segments in the compressed file must be written in the same order as they were input otherwise data will be corrupt.
  2. With the Global Deduplication feature, access to a single chunk index must be serialized and ordered in the same sequence as they were input.

The second feature is a recent addition and also requires ordering since we want all duplicate chunk references to be backward references. That is in a data stream duplicate chunks point backwards to whole chunks at the head of the stream. So data segments containing the chunks must go through the index lookup in the same order as they were input. Rest of the processing like actual pre-compression stage, chunk splitting stage, compression stage, optional encryption stage and so on can work completely parallel without dependencies. The flow can be illustrated by the following diagram:

parallel_flow

As you can notice there are 3 points where some form of  synchronization is necessary. The input, Index Lookup for global dedupe and final writer stage. In addition data ordering as per input has to be maintained for index lookup and when writing the output data.

There are several ways of achieving this flow, the most common techniques are using a thread pool and some queues. Perhaps the simplest approach is to use barrier synchronization. We can put one barrier prior to the index lookup and another barrier prior to the writer. In each case a simple loop takes care of the serial processing maintaining the proper data ordering. However both the approaches have drawbacks. Using queues and thread pools have resource overheads for the data structures and locking.  Barriers are not strictly needed here and using barriers mean that some amount of potential concurrency is lost waiting at the barrier. The time spent waiting at the barrier is the time taken for the slowest or typically the last thread to complete processing. One of the intentions I had was to have as much overlapped processing as possible. if one thread is accessing the index and another thread does not need it, then, it should be allowed to proceed.

So I played around with POSIX semaphores. Using semaphores in a producer-consumer setup is a common approach. However Pcompress threads are a little more involved than simple producers and consumers. A bunch of semaphores are needed to signal and control the execution flow of the threads. After some experimentation I came up with the following approach.

A dispatcher thread reads data segments from the input file and schedules them to worker threads in a round robin fashion and the writer thread reads processed data segments from worker threads in a round robin loop as well. This preserves data ordering at input and output. The ordering of index lookup and dedupe processing is done by one thread signaling the other. The diagram below illustrates this showing an example with 2 threads.

parallel_flow

The green arrows in the diagram shows the direction of the semaphore signals. At each synchronization point a semaphore is signaled to indicate completion of some activity. The critical section of the index lookup and update operation is highlighted in blue. Each thread holds a reference to the index semaphore of the next thread in sequence. The last thread holds a reference of the index semaphore of the first thread. Each thread first waits for it’s own index semaphore to be signaled, then performs the index update and signals the next guy to proceed. The dispatcher thread signals the index semaphore of the first thread to start the ball rolling. Effectively this approach is equivalent to a round-robin token ring network. Whoever holds the token can access the common resource. Lock contention is completely avoided, so this can potentially scale to thousands of threads.

The key to data ordering are the two loops, one in the dispatcher and one in the writer thread. The dispatcher always assigns data segments to threads in a fixed order. In the above example Thread 1 gets all the odd segments and Thread 2 gets all the even ones. The writer thread also waits for threads in order eventually ensuring that data ordering is preserved.

Looking at all this in another way the synchronization approach can be viewed simplified as three concentric rings and the processing flows are a set of radii converging to the center of the circle and intersecting the rings. The processing flow direction is inwards towards the center and all the tokens flow along the rings in one direction, for example clockwise (black arrows). The green curved arrows show signaling of the synch points to forward tokens. That is when processing flow reaches the writer sink ring it forwards the token it received at the dedupe ring to the next flow. The final sync point at the centre completes the data write and forwards the token at the previous radius intersection point on the outermost ring. This approach ensures ordering and avoids races. To have maximum concurrency right from the beginning, all the synch points on the outermost ring get one-time-use tokens so all the initial processing can begin immediately. This is somewhat like priming a water pump.

parallel_flow_ring

This flow allows overlapped operations to happen concurrently. In addition the dispatcher does a simple double buffering by reading the next segment into a spare buffer after signaling the current thread to start processing. A bit of concurrency can be lost when the writer thread is waiting for thread 1 and thread 2 has already completed. That situation typically arises at the end of a file where the last segment can be a small one. It can also arise if one segment cannot be deduplicated and the rest of the dedupe processing is aborted. However the impact of these are relatively small compared to the overall processing being done, so a lot of multi-core parallelism is effectively utilized in practice. Finally a bunch of overheads in using specific data structures and/or parallel threading libraries are also avoided.

Adding GPU processing to Pcompress

 

English: CUDA processing flow

English: CUDA processing flow (Photo credit: Wikipedia)

GPGPUs provide an intriguing opportunity to speed up some aspects of Pcompress. Typically GPUs represent a large cluster of ALUs with access to a few different types of high-speed memory on the board. GPUs are typically suited for highly-parallel workloads, especially the class of problems that can be termed embarrassingly parallel. An example is Monte-Carlo simulations. However many otherwise serial algorithms or logic can be converted into parallel forms with a little bit of effort.

There are a few places within Pcompress where GPUs can be of use:

  1. Parallel hashing. I have already implemented a Merkle-style parallel hashing but the approach currently uses only 4 threads via OpenMP. This is only used when compressing an entire file in a single segment which is essentially a single-thread operation with some operations like hashing, HMAC (and multithread LZMA) parallelized via different approaches. With GPUs parallel hashing can be used in all cases, but there is a slight problem. Normally parallel hashing produces different hash values as compared to the serial version so I need to work out a way where the same underlying hashing approach is used in both serial and parallel cases so identical results are produced. If one uses GPUs to generate data checksums on one machine it cannot be assumed that every machine where the data is extracted back will have a GPU! Changes to the hashing approach will make current archives incompatible with future versions of Pcompress so current code paths will have to be retained for backward compatibility.
  2. Using AES on GPU. It is possible to speed up AES on the GPU, especially with the CTR mode that I am using. There is a GPU Gems article on this.
  3. Parallel data chunking for deduplication. This is possible but more complex to implement than the previous two items. There is a research paper on a system called Shredder that provides an approach to do data deduplication chunking on the GPU. My approach to chunking is quite novel and different than what is described in the Shredder paper. So I have to do some work here.

There are a few issues to deal with when programming GPGPUs other than the initial high learning curve:

  1. GPUs are devices that sit on the PCI bus, so data needs to be transferred to and fro. This is the biggest stumbling block when dealing with GPUs. The computation to be performed must be large enough to offset the cost of data transfer. There are other ways to hide the latency like performing one compute while transferring the data for the next computation to be done. Using pinned memory on the host computer’s RAM to speed up data transfer. Transferring large blocks of data in one shot as opposed to many small transfers. The biggest gain comes from pipelining computation stages and overlapping compute and data transfer.
  2. Code on the GPU runs in an execution context that has hundreds of hardware threads each of which runs the same code path but works on a different slice of data in memory. This is essentially Single Instruction Multiple Data Model (Nvidia calls it SIMT). The access to data by the different threads need to be ordered or, in other words, be adjacent in a range to get maximum throughput. This is the coalesced access requirement. This is becoming less of an issue as GPGPUs evolve and newer improved devices come to the market.
  3. Need to use a form of explicit caching via shared memory. This is again improving by the introduction of L1/L2 caches in newer GPGPUs like the Nvidia Tesla C2XXX series.
  4. Having to worry about Thread block and grid sizing. Some libraries like Thrust handle sizing internally and provide a high-level external API.

Pcompress has to remain modular. It needs to detect the presence of GPUs in a system and optionally allow using them. Since I will be using CUDA, it needs to depend on the presence of CUDA and the Nvidia accelerated drivers as well.

Finally the big questions will be how do all these scale? Will using GPUs allow faster processing in Pcompress as compared to the modern Sandy Bridge and Piledriver CPUs with vector units. Only experimentation will tell.

 

Blake2 Users

Shows a typical cryptographic hash function (S...

(Photo credit: Wikipedia)

Just noticed that Pcompress is listed on the BLAKE2 homepage as one of the users of the hash. Great!

However the hash is good for many use cases as compared to the moth-eaten MD5 and even SHA1. The users section needs to have more entries. The Gluster project has been looking at bit rot detection which requires computing fast cryptographic hashes for files/fragments of files. BLAKE2 is ideal for the purpose and I submitted this bugzilla entry a while back. SHA256 is just too slow for that task. While Intel’s optimized SHA256 code is fast, BLAKE2-256 is a lot faster.

Making Delta Compression effective

I had blogged about Delta Compression earlier and compared various results with and without delta compression in an earlier post on Compression Benchmarks. While those benchmark results are now outdated you can notice that delta compression had little if any impact on the final compression ratio. It even made compression worse in some cases.

Recently I sat down to take a closer look at the results and found delta compression to have a negative impact for most compression algorithms especially in the high compression levels. After pondering on this for some time it occurred to me that delta compression of similar chunks via Bsdiff was actually eliminating patterns from the data. Compression algorithms look for patterns and collapse them, so eliminating some patterns could impact them negatively.

Many compression algorithms also typically use a dictionary (sliding window or otherwise) to store pattern codes so that newly found patterns can be looked up to see if they has occurred in the past. So if I thought of adding a constraint that delta compression between 2 similar chunks can only occur if they are further apart in the datastream than the dictionary size in use. This can be called as the TOO-NEAR constraint. It took a little experimentation with various datasets and compression algorithms to determine the optimum size of the TOO-NEAR constraint for different compression algorithms. I always used the maximum compression level with the maximum window sizes to ensure benefits are seen in almost all cases.

Eventually this paid off. Delta Compression started to have a slight benefit. Obviously the extent of delta compression is reduced but that made it faster and also improved final achievable compression. For details on how similar chunks are identified see below.

Pcompress 1.4 Released

A new release of Pcompress is now available from here: http://code.google.com/p/pcompress/downloads/list

I have added the following capabilities into this release:

  1. AES is better optimized using VPAES and AES-NI with CPU auto-detection. Since I had to support linking with older versions of OpenSSL I have copied the relevant implementation files into Pcompress repo.
  2. Added XSalsa20 from Dan Bernstein‘s NaCL library (SSE2 and reference code). It is extremely fast and has excellent security properties – as much as I could glean reading up on various articles and posts.
  3. Deduplication performance has been improved by 95% by optimizing the Rabin sliding window chunking algorithm and doubling the chunk hash-table size.
  4. From version 1.3 onwards Delta Compression now actually benefits overall compression ratio.
  5. All hashes have been parallelized better by using Merkle Tree style hashing. This happens only in solid mode when entire file is being compressed in a single chunk.
  6. Encryption key length is no longer a compile-time constant. It can be selected at runtime to be 128 or 256 bits.
  7. The Header HMAC now includes Nonce, Salt and Key length properties as well. This was an oversight in the earlier release.
  8. Better cleanup of temporary variables on the stack in various crypto functions.
  9. The Global Deduplication feature is still a work in progress. I have added some code that builds but is not functional as yet.

 

 

SHA512 Performance in Pcompress

Pcompress provides 2 cryptographic hashes from the SHA2 family, namely SHA512-256 and SHA512. The core SHA512 block function is the implementation from Intel: http://edc.intel.com/Download.aspx?id=6548

Intel’s implementation provides two heavily optimized versions. One for SSE4 and one for AVX. Intel only provided the core compression function. I had to add supporting code from other sources to get a complete hash implementation including padding and IVs for 512-bit and the 256-bit truncation. Since I am bothered only with 64-bit CPUs using SHA512-256 is the most optimal choice. SHA512 is much faster that native SHA256 on 64-bit CPUs. I did some benchmarks using the SMHasher suite to check how this implementation fares. SMHasher is primarily designed to test various qualities of non-cryptographic hash functions, however it’s benchmarking implementation is good and I used just that part. I had to modify SMhasher to add all the various SHA2 implementations and a tweak to support 512-bit hashes.

I only tested the SSE4 version since i do not currently have an AVX capable CPU. SMHasher shows bytes/cycle and I just did the reciprocal to get cycles/byte. All this was done on my laptop which has a Core i5 430M, 2.27 GHz CPU (non sandy bridge). The OpenSSL version used is 1.0.1c from Linux Mint 14. I also used GCC 4.7.2 with -O3 and -ftree-vectorize flags. The results are shown below.

sha512

Clearly Intel’s SSE4 optimized version is superior than the rest on x64 platforms with OpenSSL not too far behind. The AVX version should give even better results. The other result shows cycles/hash for tiny 31-byte buffers.

sha512_small.Fast hashing is important when using it for data integrity verification. If you are thinking of slow hashes for hashing passwords then please look at Scrypt or PBKDF2.

Pcompress 1.3 released

Parallel Compression RevisitedI have put up a new release of Pcompress on the Google Code download page. This release focusses primarily on performance enhancements across the board and a few bug fixes. The changes are summarized below:

  1. Deduplication performance has improved by at least 2.5X as a result of a variety of tweaks to the core chunking code.
    • One of the interesting changes is to use a 16-byte SSE register as the sliding window since I am using a window size of 16. This avoids a lot of memory accesses but requires SSE4.
    • The perf utility allowed me to see that using the window position counter as a context variable causes a spurious memory store for every byte! Using a local variable allows optimization via a register. This optimization affects the situation where we do not have SSE4.
    • Compute the full fingerprint only when at least minimum chunk length bytes have been consumed.
  2. Delta Compression performance and effectiveness have both been improved. I have tweaked the minhash approach to avoid storing and using fingerprints. That approach was causing memory write amplification and significant slowdown. Rather I am just treating the raw data as a sequence of 64-bit integers and heapifying them. Bsdiff performance has been improved along with RLE encoding. I also tweaked the matching approach. It now checks for similar blocks that are some distance apart depending on the compression algorithm. This actually causes long range similar blocks to be delta-ed eventually helping the overall compression.
  3. One of the big changes is the inclusion of the BLAKE2 checksum algorithm and making it the default. BLAKE2 is one of the highest-performing cryptographic checksums, exceeding even MD5 in performance on 64-bit platforms. It is derived from BLAKE, one of the NIST SHA3 runner ups with a large security margin.
  4. I have tweaked Yann Collet’s xxHash algorithm (non-cryptographic hash) to vectorize it and make it work with 32-byte blocks. Performance is improved for both vectorized and non-vectorized versions. I have covered this in detail in my previous post: Vectorizing xxHash for fun and profit.
  5. I have tweaked the AES CTR mode implementation to vectorize it. CTR mode encrypts a 16-byte block consisting of a 64-bit nonce or salt value and a 64-bit block counter value concatenated together. This is then XOR-ed with 16 bytes of plaintext to generate 16 bytes of ciphertext. The block counter is then incremented and the process repeated. This XOR handling with 16-bytes can be nicely done in an XMM register. The result is faster even when using unaligned SSE2 loads helped a little with data prefetch instructions.
  6. Apart from BLAKE2 I also included Intel’s optimized SHA512 implementation for x86 processors and moved to using SHA512/256. This improves SHA2 performance significantly on x86 platforms.
  7. BLAKE2 includes a parallel mode. I also included simple 2-way parallel modes for other hashes including KECCAK when compressing a single file in a single chunk. This is essentially a single-threaded operation so other forms of parallelism need to be employed.
  8. With all the vectorization being thrown around with SSE2/3/4 and AVX1/2 versions of various stuff, I have also added runtime CPU feature detection to invoke the appropriate version for the CPU. At least SSE2 capability is assumed. At this point I really have no intention of supporting Pentium and Atom processors! This also requires one to use at least the Gcc 4.4 compiler so that things like SSE4.2 and AVX intrinsics can be compiled even if CPU support for them is not available.
  9. In addition to all the above some bug fixes have also gone into this release.

However this is in no way the full gamut of optimizations possible. There are more changes to be done. For example I need to add support for optimized AES GCM mode. This is a block cipher mode of operation which combines encryption and authentication avoiding the need to for a separate HMAC. HMAC is still useful for situations where one may want to authenticate but not encrypt. Deduplication performance can be further improved by at least 2X. The current chunking code has a silly oversight.  HMAC needs to support parallel modes. I also need to enable parallel operation for LZP in single-chunk modes. In addition I want to explore use of GPGPUs and CUDA for hashing, chunking etc.