Tag Archives: software

SHA512 Performance in Pcompress

Pcompress provides 2 cryptographic hashes from the SHA2 family, namely SHA512-256 and SHA512. The core SHA512 block function is the implementation from Intel: http://edc.intel.com/Download.aspx?id=6548

Intel’s implementation provides two heavily optimized versions. One for SSE4 and one for AVX. Intel only provided the core compression function. I had to add supporting code from other sources to get a complete hash implementation including padding and IVs for 512-bit and the 256-bit truncation. Since I am bothered only with 64-bit CPUs using SHA512-256 is the most optimal choice. SHA512 is much faster that native SHA256 on 64-bit CPUs. I did some benchmarks using the SMHasher suite to check how this implementation fares. SMHasher is primarily designed to test various qualities of non-cryptographic hash functions, however it’s benchmarking implementation is good and I used just that part. I had to modify SMhasher to add all the various SHA2 implementations and a tweak to support 512-bit hashes.

I only tested the SSE4 version since i do not currently have an AVX capable CPU. SMHasher shows bytes/cycle and I just did the reciprocal to get cycles/byte. All this was done on my laptop which has a Core i5 430M, 2.27 GHz CPU (non sandy bridge). The OpenSSL version used is 1.0.1c from Linux Mint 14. I also used GCC 4.7.2 with -O3 and -ftree-vectorize flags. The results are shown below.


Clearly Intel’s SSE4 optimized version is superior than the rest on x64 platforms with OpenSSL not too far behind. The AVX version should give even better results. The other result shows cycles/hash for tiny 31-byte buffers.

sha512_small.Fast hashing is important when using it for data integrity verification. If you are thinking of slow hashes for hashing passwords then please look at Scrypt or PBKDF2.

Pcompress reaches 1.0 stable release

This is the first milestone for the Pcompress utility. After lots of development and testing I decided to push the first stable release. From a stand-alone data compression utility point of view it has reached a maturity and unique combination of powerful features that most other such utilities lack to the best of my knowledge. It does all this while retaining ease of use. Keep in mind that Pcompress is not a complete archiver. It compresses single files. So you will have to use tar for example, to create the archive which can then be compressed.

Here is the announcement that should appear shortly on Freecode.com:

“This is the first stable release of Pcompress. An extensive test suite was added and used to validate and fix bugs. The file format is versioned and backward compatiblity will be maintained. Deduplication parameters have been tweaked to get better distribution of content-aware blocks and improved dedupe ratio. Adaptive compression modes have been improved with more heuristics. This release also provides best in class SHA256 digest performance on x86 platforms. Pcompress today provides a combination of features and performance that makes it unique among data compression utilities.”

What next for Pcompress

With the current release of 0.9.1 Pcompress now comes with a whole bunch of features and capabilities that are close to being a logically complete set for a stand-alone data compression utility.  It will be hard to find another open-source utility that has the same combination of features and capabilities in a single utility. There are a few other stuff that can be added from a single utility point of view the biggest being a GUI and possibly portability to Windows. However beyond all this I have been looking at extending Pcompress to being a more comprehensive archival mechanism. Not necessarily a full Backup solution though. For that we have excellent software like Bacula and Amanda. However it can be an add-on for those providing an archival data store with capabilities that they lack.

I am listing out some of the thoughts and new ideas that have been crawling inside my cranium for the past few weeks.

Global Deduplication

Pcompress does Deduplication but only at the level of individual chunks. It is a logical next step to evolve it into a complete deduplication archival store on the lines of commercial products like Data Domain, Sepaton, Permabit Albireo, HP StoreOnce etc. There are big challenges in this area. The biggest being the disk bottleneck arising out of index lookups for chunk hashes. This becomes increasingly a problem as we scale into hundreds of TB and to Petascale. Then comes efficient storage of chunks, efficient retrieval of archived data, encryption of deduplicated data, sources-side dedupe to reduce network traffic etc. The industry has come up with a variety of solutions and optimizations but nothing significantly comparable is available in the open-source domain.

A couple of the open-source solutions are SDFS and LessFS. I have tried both and did not have much success with LessFS as it would hang every now and then in my Fedora 16 laptop. However SDFS worked reliably and fast. I am sure I am doing something wrong in setting up LessFS. Both of these are in-line dedupe filesystems that work on fixed-block boundaries. In addition the backup solutions like Bacula and Amanda only do whole-file dedupe or, in other words, single-instance storage.

For Pcompress I am looking at providing more advanced capabilities. Pcompress already has sliding-window content-aware chunking capability. The challenge is to design a scalable deduplicated archival store that can handle Petabytes. To this effect I have been reading up on a bunch of research papers and capabilities touted by various products and have formed some detailed design ideas. Some of the features required for this capability include:

  1. Sparse Indexing: We need to use some approximate matching capability to probabilistically identify duplicate and unique chunks with low memory overhead. Data Domain for example uses a Bloom Filter that they call a Summary Vector to quickly identify unique chunks along with other capabilities like locality based caching. Other different approaches include Albireo from Permabit, and SILT. In that respect the following paper seems very interesting: https://code.google.com/p/binarywarriors/downloads/detail?name=INLINE_DEDUP.pdf&can=2&q= . What is more interesting is this HP paper that reflects a more polished form of that paper: http://static.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge.pdf . This is essentially a locality based grouping of chunks along with a stratified sampling to do group-level approximate matching on a smaller index size. Locality based grouping is very helpful and instead of doing stratified sampling I have a different idea of using similarity matching at various confidence levels to find related groups. My idea will enable deduplicating 1 PB of data requiring just 14GB of memory for the index.
  2. Source side dedupe: Doing chunking and hashing at the source side can reduce load on the backup server and not sending chunks already present on the server reduces network bandwidth drastically. In addition the archive data itself will have many duplicates inside it even before comparing with the previous backup archive. So that will help as well. One extra thing I am looking at here is for the server to have a configurable capability to re-compute the chunk hashes and verify. This is an extra security mechanism that prevents malicious uploading of garbage chunks with invalid hashes. This of course puts extra load on the server but current gen Sandy Bridge or Bulldozers are solidly up to the task.
  3. Forward Referencing: Typically as backup archives keep on entering the deduplication data store their chunks keep pointing to older chunks that are duplicates. The first archive will be stored in full (after in-archive dedupe) and chunks in subsequent archives keep pointing backwards to duplicate chunks in older archives. So restoring from the newest backup will have lower performance due to the numerous random reads that it will have to do read those referenced chunks from older archives. Older archives will restore faster! In some documents this is called re-hydration. Logically we want the reverse to happen. If we can make chunks in older archives point to duplicate chunks in newer archives then the newest archive can be stored sequentially in it’s logical chunk order. However doing this in-line is not practical because many references need to be updated. So this is typically done with post-process dedupe and Sepaton is the only vendor doing this. This however precludes the use of Source-side dedupe. I have given this a lot of thought and come up with an alternative that can give the benefits of normal in-line and source-side dedupe while also providing forward-referencing capability.
  4. Security Considerations: Suffice it to say that this paper has a comprehensive coverage on this topic: http://www.ssrc.ucsc.edu/Papers/storer-storagess08.pdf .
  5. On-disk Container-Based storage mechanism: Storing chunks on the disk is another can of worms. Typical dedupe products implement a mini-filesystem for this. Storing individual groups of chunks in individual files is too expensive as there will be too many of them and filesystem metadata-lookup overheads will have significant impact. Chunk groups or segments can be further grouped into larger containers to solve this. I do not intend to implement a low-level mini-filesystem. Just use existing scalable filesystems like XFS.
  6. Replication: This is obviously a very important need. Especially with deduplicated data the loss of a single chunk invalidates all references to it and corrupts a whole host of files and archives. So replication is mandatorily needed as a protection against this. Instead of trying to build something I am looking at leveraging existing software like the excellent GlusterFS.
  7. GPGPU Acceleration: Chunking and hashing can be speeded up by leveraging GPGPUs. This paper provides the background: http://static.usenix.org/events/fast/tech/full_papers/Bhatotia2-5-12.pdf . Very interestingly Keccack has been ported to work optimally on Nvidia CUDA: https://sites.google.com/site/keccaktreegpu/ .
  8. Clustering: Distributing deduplication storage tasks across a cluster of computers can help greatly with scalability as opposed to a single appliance approach. It is possible to distribute chunks or segments across a cluster of computers based on their hash values. SDFS uses this technique. This however affects dedupe ratio.
  9. Integration with backup software: Neither Bacula nor Amanda includes variable-block dedupe capability.

Miscellaneous Features

  1. GUI: Everybody likes a GUI … everybody likes gui …. well except for Unix CLI freaks (myself included).
  2. Windows Port: This will broaden the usage base but requires having a GUI first.
  3. Reorder Files: Sort files based on similarity, size and filename. This should give better dedupe and compression results.
  4. Backup to online storage: Amazon S3, Glacier, DropBox etc should be supported as back-end storage targets.

Parallel Compression Revisited

Parallel Compression RevisitedI have recently been looking at compressing backups on the fly. That is I will generate a backup stream and pipe it through a compression utility and then send it to disk. Plain an simple, nothing great about it. However I also wanted to utilize all those cores and hyperthreads lying around to speed up the compression process. In other words I wanted parallel compression and parallel decompression ability.

After looking around I found a bunch of utilities that do this. Among the several I found, notable ones include pigz, pbzip2, pxz, p7zip, libbsc and Nanozip (not opensource). Most of these implement parallel implementations of specific algorithms while trying to be compatible with the existing file formats. For example pbzip2 retains the existing bzip2 file format by creating multiple compression streams within the file. Some of them like pxz would only scale to physical cores and ignore the Hyperthreads.

I however had other ideas. I wanted a technique that was algorithm agnostic and would allow me to leverage multiple algorithms based on the dataset I am trying to compress. The simple way to do this is to break up the data into chunks and compress the chunks in parallel. Decompression can also be done in parallel. This of course affects compression efficiency but provides great speedup on multicore and multisocket boxes. In addition if one has lots of RAM then large chunk sizes can be used to minimize the compression ratio hit as much as possible. It is also possible in this scenario to combine multiple compression algorithms and extract the best result which I call adaptive mode.

Digressing a bit, there is a lot of material around on the net regarding compression techniques, tools and algorithms. Some of the best lossless compression is achieved by the PAQ family of content mixing algorithms. They however can take inordinately long to work their magic and are not practical in many cases. You can visit Matt Mahoney’s Large Text Compression Benchmark page to see the top performers with detailed analysis in the text compression category. Another useful website is www.compressionratings.com.

Coming back to the topic, I looked around in Google for a while I could not find an existing opensource tool to do what I was thinking of. Maybe I did not search enough or maybe I just have an impulsive urge to play around with code. In any case I got around to developing a utility that has the following features/capabilities/characteristics:

  • Can split a file in chunks and do parallel compression and decompression of the chunks.
  • Can compress/decompress from stdin to stdout.
  • It is capable of utilizing multiple algorithms, the user can specify the algorithm [s]he wants to use.
  • Currently supports: Gzip, Bzip2, Lzma, PPMD (from the LZMA SDK).
  • It has an adaptive mode where it uses multiple algorithms to compress the same chunk and choose the smallest result.
  • Adaptive mode compression is of course the slowest but decompression speed is unaffected even in adaptive mode.
  • It provides 14 compression levels automatically choosing various algorithm parameters.
  • It uses a simple internal Slab Allocator to speed up memory allocation as the allocation patterns are 100% repeating.
  • The allocator can display statistics and detect leaks at the end.
  • Has very low metadata overhead.
  • Computes and stores a CRC64 checksum for each chunk to do integrity verification.
  • Has a pluggable source structure which makes it easy to add new compression algorithm support.
  • Bundles a modified LZMA code from the LZMA SDK which adds performance optimizations including SSE intrinsics.
  • Tries to overlap compression in threads and reading chunks from file as much as possible while retaining sequence.
  • Can scale to all logical cores or a user-specified number of threads.

You can download the current tarball from here. I will register a project on Github soon. This is a work in progress and I have only tested it on Fedora as yet with Illumos/OpenIndiana in the TODO list. There are a bunch of things I’d like to do:

  • Add support for filters like BCJ2, XWRT and Delta Encoding.
  • Auto-Detect data properties to auto-apply some filters like use BCJ2 for 8-bit binary data.
  • Track and print various compression statistics.
  • Add libbsc integration with another adaptive mode that uses libbsc instead of bzip2.
  • Possibilities of doing dedup:
    • Use things like Lrzip or srep as a dedup preprocessor.
    • Use chunk checksums to detect duplicates and then compare them. Will it be effective ?
    • Can we use some Rabin Fingerprinting derivative to auto-determine chunk size which will help dedup ?

I did some simple measurements using the enwik9 dataset on my Core i5 laptop with M430 processor running at 2.27 GHz and 3GB RAM. It’s got 2 hyperthreaded cores giving 4 CPU threads in total. The result is presented in graphical and tabular way below. The next measurement will be using the Linux6 Git repo tarball.

Enwik9 Compression Results
Cmd Size in Bytes Time in Seconds
Uncompressed 1000000000 0
gzip_6 323742884 76.41
gzip_9 322591997 93.81
pigz_6 323973501 31.41
pigz_9 322819489 35.29
pcompress_gzip_6_16m 324016149 32.16
pcompress_gzip_6_32m 323974316 31.4
pcompress_gzip_9_16m 322856796 38.03
pcompress_gzip_9_64m 322801875 38.24
bzip2_9 253977891 199.87
pbzip2_9 254067824 80.08
pcompress_bzip2_9_16m 254282504 86.45
pcompress_bzip2_9_64m 254032393 82.08
p7zip_6 227905649 838.94
p7zip_9 213336010 1177.47
xz_6 230153052 1184.77
xz_9 213370900 1645.03
pxz_9 215675816 1287.17
pcompress_lzma_9_32m 193378238 132.06
pcompress_lzma_9_64m 221297989 527.04
pcompress_lzma_10_64m 187103881 154.49
pcompress_lzma_14_64m_t2 184876720 265.52
pcompress_adapt_6_32m 232115301 477.68
pcompress_adapt_9_16m 233419207 470
pcompress_adapt_9_64m 227271086 541.82
pcompress_adapt2_9_64m_t2 190955872 899.78
pcompress_adapt_10_64m 227271086 543.17
pcompress_adapt2_14_64m_t2 184876720 1284.3
pcompress_adapt2_14_128m_t1 184552964 2105.52
pcompress_adapt2_14_500m_t1 183656405 2228.429
pcompress_adapt2_14_900m_t1 183575811 2212.369

Naming Scheme is one the two formats:
<Name>_<Compression level>
<Name>_<Mode>_<Compression Level>_<Chunk Size>_t<Number of Threads>

If you notice some of the tests are run using only 1 thread or 2 threads. That is simply because my laptop does not have enough RAM to handle the large memory requirements of those higher compression levels. That also reflects on the longer times taken for the adaptive modes on the graph. My desktop is an older Athlon64 having only 2 cores but more RAM. I am yet to test on that.

There is something interesting to note here. The Xz/Pxz utilities are slowest among the LZMA compressors. One is better off using p7zip. The single threaded adaptive mode of pcompress with extreme compression takes longer than pxz but pxz is using 2 threads. So using 2 threads the extreme adaptive mode will take less time than pxz while still trying out multiple compression algorithms and giving a better result!