Monthly Archives: May 2013

Pcompress 2.2 released

I decided to pull another release of Pcompress primarily due to some bugfixes that went in. One of them is a build issue on Debian6 and non-SSE4 processor and the others are a couple of crashes with invalid input.

In addition to fixing stuff I have re-wrote the Min-Heap code and took out all the Python derived stuff. It is now much simpler and much faster than before. While doing this re-write I found and fixed a problem with the earlier Min-Heap approach. Thus Delta Differencing is now faster and more accurate than before.

I also improved the scalable Segmented Global Dedupe and it now works with greater than 95% efficiency in finding duplicate chunks. it appears that using larger segments for larger dedupe block sizes results in better accuracy. If you come to think of it this is also logical since one would want faster processing with smaller indexes when using larger and larger dedupe blocks. Corresponding larger segments enable just that.


Updated Compression Benchmarks – part 2

I have added the second set of benchmarks that demonstrate the effect of the different pre-processing options on compression ratio and speed. The results are available here:

All of these results have Global Dedupe enabled. These results also compare the effect of various compression algorithms on two completely different datasets. One is a set of VMDK files and another purely textual data. Some observations below:

  • In virtually all the cases using ‘-L’ and ‘-P’ switches results in the smallest file. Only in case of LZMA these options marginally deteriorate the compression ratio indicating that the reduction of redundancy is hurting LZMA. To identify which of the two hurts more I repeated the command (see the terminology in results page) with lzmaMt algo and only option ‘-L’ at compression level 6 on the CentOS vmdk tarball. The resultant size came to: 472314917. The size got from running with only option ‘-P’ is available in the results page: 469153825. Thus it is the LZP preprocessing that unsettles LZMA the most along with segment size of 64MB. Delta2 actually helps. Running the command with segment size of 256MB we see the following results – ‘-L’ and ‘-P’: 467946789, ‘-P’ only: 466076733, ‘-L’ only: . Once again Delta2 helps. At higher compression however, Delta2 is marginally worse as well.
  • There is some interesting behavior with respect to the PPMD algorithm. The time graph (red line) shows a relative spike for the CentOS graphs as compared to the Linux source tarball graphs. PPMD is an algorithm primarily suited for textual data so using it on non-textual data provides good compression but takes more time.
  • Both Libbsc and PPMD are especially good on the textual Linux source tar and are comparable to LZMA results while only taking a fraction of the time taken by LZMA. Especially Libbsc really rocks by producing better compression and being much faster as compared to LZMA. However i have seen decompression time with Libbsc to be quite high as compared to PPMD.

Stress-testing Systems

Many a times we need some software that will enable us to load a system and check its stability under bad conditions. This can be a burn-in test or it can be generation of load to cause borderline faulty hardware to start acting up. This allows one to isolate system crashes to either hardware or Software issues for example. In my experience that latter has been a common scenario where system crash logs appear to look like hardware issues but diagnostics tools including the vendor provided ones come out clean and then the system crashes again soon after being put back into production use. Eventually some back and forth investigation and trial-and-error things are done to replace potentially faulty components till things are stable again or the box itself is replaced.

One of the choices here is to run something that loads the system hard causing hidden faults to surface faster than otherwise. When it comes to stress testing tools there are a whole bunch of choices but most of them focus on one piece at a time. Most commonly it starts with the CPU, then RAM and Disk I/O of course. However I am yet to come across something that comprehensively loads the entire system. By entire system I mean CPU, RAM, Disk and Network together. In addition by loading CPU, I mean loading virtually every component inside the CPU: FPU, SSE, AVX, Fetch, Decode and so on. Just running a single computation like for example Prime95 may heat up the CPU and/or RAM modules but it exercises only a few components within. The key here is to stress test everything in parallel.

Eventually all this should result in the system’s ambient temperature to be raised by a few degrees even when located inside a chilled datacenter and even when the server’s fans are spinning at a higher RPM. Once we have stressed the box we can then look at diagnostic logs like the IML (HP Integrated Management Log) and run diagnostic tools that will hopefully have a better chance of picking up something odd.

I have worked on something like this at work where we have successfully used it on several occasions for troubleshooting faults, evaluating new server models and when commissioning new datacenter field layouts. I have now started an open-source project on the same lines but being more comprehensive:

At them moment this is a work in progress and one will only find a few items in that github repo mostly dealing with creating a mini Fedora live image which is a core part of the system. The objectives for this system are listed below.

  • Parallel stress testing of CPU, RAM, Disk, and Network together or a chosen subset on Linux. Of course the core test framework should lend itself to be ported to other platforms like BSD or Illumos.
  • Attempt to load virtually every sub-component.
  • Non-destructive disk tests.
  • Network interface Card testing that will not flood the network with packets or frames.
  • Post-test verification and diagnostics scan.
  • Self-contained live-bootable environment to allow scheduling tests via PXE boot for example.
  • Ability to pass parameters via PXE/DHCP options.
  • Live environment should allow restricted root access that primarily does not provide the filesystem utilities like mount but allows reading from the block device. In addition the restricted shell should provide only a small subset of Linux utilities to prevent backdoors. This will allow systems engineers to to diagnostics etc while providing no ability to access production data on the disk filesystems.
  • A http based graphical console to remotely access the live environment and look at logs, run tests, do diagnostics etc.
  • The live bootable image should be as small as feasible and should be able to load itself entirely in RAM and boot and run off a ramdisk.

The Github project repo currently provides a Fedora kickstart file that goes into a great effort to minimize the live bootable ISO image (139MB approx including EFI boot capability). The live environment boots and auto-logins into a restricted root environment. One will require Fedora 18 and the Fedora livecd-creator to build it (see the README).



Updated Compression Benchmarks

Pcompress has gone through a sea of changes since the last time I ran benchmarks comparing performance and effectiveness with other utilities. So I spent several days running various benchmark scripts generating and collating a lot of results in the process.

Due to the sheer volume of the results and limited time, I took the easy way out of importing all the CSV data into Excel, formatting and charting them and exporting to HTML. The generated HTML code looks complex and messy but at least it shows up correctly in Firefox, Chrome and IE.

The first set of results can be seen here: This is basically comparing pcompress with Segment-level and Global Deduplication to other standard utilities. It also contrasts effectiveness of Global Dedupe with Segment-level Dedupe.

The Datasets used

  1. A tar of the VMDK files of installed CentOS 6.2 x86-64 version.
  2. Linux 3.6 RC2 source tarball.
  3. Two copies of the Silesia corpus tar concatenated together. This results in a file that is double the size of the original Silesia corpus but has 100% duplicate data.
  4. A tarball of the “Program Files” directory on my 32-bit Windows 7 installation.

Some Observations

  1. As is quite clear, Pcompress is both faster and more effective compared to the standard utilities tested: Gzip, Bzip2, 7za, Xz and Pxz (Parallel Xz).
  2. As usual Xz performs the worst. The time graph shows a steep spike. Pxz is a lot better but is still half as slow as Pcompress. In addition remember that Pcompress is having a bunch of additional processing overheads that the other utilities do not have: SHA256, BLAKE2, LZP and Delta2 processing.
  3. Interestingly the LZ4 mode along with Dedupe and all the preprocessing produces results that are close to traditional Gzip while being more than twice as fast. In fact two datasets shows results smaller than Gzip. This result is notable when one wants good compression done extremely fast.
  4. Global Dedupe of course is more effective than Segment-level Dedupe but what is more surprising is that it is also faster overall, even though Global Dedupe requires serialized access to a central index and Segmented Dedupe is fully parallel. I can attribute three causes: my test system is low-end with constrained RAM bandwidth and conflicts arising from parallel access; Segment-level dedupe also uses memcmp() while Global Dedupe does not; Global Dedupe reduces data further resulting in lesser work for the final compression algorithm.
  5. The concatenated Silesia corpus with 100% duplicate data of course shows the maximum benefit from Global Dedupe that removes long-range redundancies in data.
  6. In some cases compression levels 9 and 14 show marginally lesser compression than level 6. This appears to be because of LZP side-effects. At higher levels, LZP parameters are tweaked to work more aggressively so it may be taking out a little too much redundancy that affects the compression algorithm’s effectiveness. This is something that I will have to tweak going forward.

I will be posting more results soon and will include a comparison with Lrzip that uses an improved Rzip implementation to take out long-range redundancies in data at a finer granularity compared to 4KB variable-block Deduplication.

Deduped storage with SQL front-end

Came across this very interesting piece on Forbes:

The unique thing about this product is the ability to do SQL queries requiring obviously an additional overhead but much less so than tapes. With very high data reduction ratios the product claims to be an cost-effective big-data storage container for medium to long term storage that can be queried much easily than retrieving data from tapes.

However tapes have certain economics and cater to specific operational models that are tricky to match with an appliance. So it will be interesting to watch how RainStor fares. Also whenever I hear about claims of extreme compression above 90% effectiveness I start to add salt. Compression can only remove as much data redundancies as they exist within the data. Of course some compression algorithms are better at finding the redundancies than others and compression combined with other things line rzip, deduplication and content-specific data transformation filters can take out global redundancies effectively from large datasets. Still all these techniques are not something magical. If the data does not contain redundancies then they will fail to reduce the data volume. What tends to happen though in the real world is that business data tend to be structured with repeating content and successive snapshots of data tend to contain a lot in common with previous ones. So we can potentially see a lot of reduction. One can only determine this by doing a thorough evaluation.


Innovation does not emerge out of nothing

Very insightful post on HBR blogs:

To add to the topic I have a few observations of my own on this oft-beaten drum:

More often than not we see the terms “Innovation” and “Center of Excellence” being thrown around, mostly within corporate environments. People are expected to produce innovation outside of context having no clue of the domain or the market. In my experience, albeit limited, I see this more pronounced in Bangalore. it is not difficult for example, to walk around within IT parks in Bangalore and find product managers in product companies who have no idea of the competitive landscape. They are then asked to innovate. In other cases good people are not provided the proper exposure or opportunity and when something becomes a hype they are then asked to jump into the bandwagon and deliver innovation within strict deadlines based solely on what they have learned within the 4 walls of their office building.

This article is a distinct reminder that innovation cannot come out of thin air. It requires study and understanding of the domain, technology, products, existing work/research and insights from adjacent domains. One has to build on the shoulders of others. That also requires time and patience. It is not a factory where you turn a big gear called innovation and ready-made units drop off a conveyor belt at some point.

Pcompress 2.1 released with fixes and performance enhancements

I just uploaded a new release of Pcompress with a load of fixes and performance tweaks. You can see the download and some details of the changes here:

A couple of the key things are improvement in Global Dedupe accuracy and ability to set the dedupe block hash independent of the data verification hash. From a conservative viewpoint the default block hash is set to the proven SHA256. This however can be changed via an environment variable called ‘PCOMPRESS_CHUNK_HASH_GLOBAL’. SKEIN is one of the alternatives supported for this. SKEIN is a solid NIST SHA3 finalist with good amount of cryptanalysis done and no practical weakness found. It is also faster than SHA256. These choices give a massive margin of safety against random hash collisions and unexpected data corruptions considering that other commercial and open-source dedupe offerings tend to use weaker options like SHA1(Collision attack found, see below), Tiger24 or even the non-cryptographic Murmur3-128! All this for the sake of performance. Albeit some of them did not have too many choices at the time development started on those products. In addition even with a collision attack it is still impractical to get a working exploit for a dedupe storage engine that uses SHA1 like say Data Domain, and corrupt stored data.

The Segmented Global Dedupe algorithm used for scalability now gives around 95% of the data reduction efficiency of simple full chunk index based dedupe.