On Nimble Storage Compression

I recently came across this old blog by Nimble storage co-founder Umesh Maheshwari: http://www.nimblestorage.com/blog/technology/better-than-dedupe-unduped/

The post has intriguing views on inline compression vs Dedupe and the approach of using snapshots on primary storage as a short-term backup mechanism is quite cool. COW snapshots by definition  avoid duplicates the instant they are created. In addition consecutive snapshots will share common blocks with no special effort. This aspect not much different from ZFS. It has the same real-time compression and snapshot features and the same benefits apply there as well.

However it is the conclusions and the graphs that I humbly find missing some points and even misleading to an extent. Deduplication does not only remove duplicates from successive backups but it can remove internal duplicates within the volume. It helps compression in turn. In all my tests with Pcompress I have found Deduplication providing additional gain when added to standard compression. See the entry for Pcompress here: http://www.mattmahoney.net/dc/10gb.html. ZFS for that matter provides deduplication as well, though it has scalability bottlenecks.

While inline compression is cool, compression works within a limited window. The window size varies with the algorithm. For example Zlib uses a 32KB window, Bzip2 uses 900KB max, LZMA can use a gigabyte window size. Repeating patterns or redundancy within the window are compressed. Deduplication finds duplicate blocks within the entire dataset of gigabytes to hundreds of terabytes. There is no theoretical window size limitation (Though practical scaling considerations can put a limit). So I really cannot digest that just snapshotting + inline compression will be superior. Deduplication + snapshotting + inline compression will provide greater capacity optimization. For example see the section on “Compression” here: http://www.ciosolutions.com/Nimble+Storage+vs+Netapp+-+CASL+WAFL

Now if Post-Process Variable-Block Similarity based deduplication (of the type I added into Pcompress) can be added to ZFS, things will be very interesting.

Advertisements

5 thoughts on “On Nimble Storage Compression

    1. moinakg Post author

      Bup or Backshift both have simple chunk splitting using rolling checksums. Bup uses a formula derived from librsync while backshift uses something else based on a linear congruential pseudorandom number generator. The rolling checksum in Pcompress is somewhat similar to Backshift but has important differences. I found this polynomial rolling checksum to have very ideal properties and vectorized it along with other optimizations. See: https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/ and https://moinakg.wordpress.com/2012/11/15/inside-content-defined-chunking-in-pcompress-part-2/. The performance and effectiveness of this is currently better than any other technique I have looked at.

      I have implemented lock-free parallel chunking and dedup for multi-core systems. See: https://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/. I also use a new similarity based algorithm to match chunks that can potentially scale to petabytes of data using a small in-memory index and yet have >90% duplicate elimination effectiveness with high performance. This is a completely new algorithm that I have not seen anywhere else. I cannot give much more details on this at the moment, though the code is out there to read.

      Reply
    2. moinakg Post author

      I am interested in looking at Tahoe-LAFS but have a time crunch for the next two months due to personal reasons and a list of things to complete for Pcompress (archiving using libarchive, deduplicated store for backups, splittable compression for Hadoop etc.)

      Reply
  1. Umesh Maheshwari

    Hello, I am the author of the Nimble Storage blog.

    Moinakg is right in that dedupe+compression is necessarily more space efficient than compression alone.

    What my blog is pointing out, however, is that dedupe+compression on traditional backups (where the primary copy and the backups are stored on separate storage systems and therefore do not share blocks) is not as space efficient as doing compression alone with snapshot-based backups (where the primary copy shares blocks with the backups). In other words, even the most space-efficient systems for storing traditional backups are not as efficient as systems that store snapshots-based backups and compress.

    The reason for the above is this, to reiterate from my blog: “unduped converged storage keeps only one baseline copy of the volume, while separate deduped storage keeps two—one on primary storage and one on backup storage”. Here, by “converged storage” I meant a system that stores both primary and backup copies.

    Hope that helps. Thanks for pointing out the potential confusion.

    Reply
    1. moinakg Post author

      Thanks for clarifying. Yes, storing medium term backups in the same media as primary storage, by way of snapshots is definitely more space efficient than a separate backup. In addition recovery is a breeze.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s