<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>The Pseudo Random Bit Bucket</title>
	<atom:link href="http://moinakg.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://moinakg.wordpress.com</link>
	<description>Moinakg&#039;s Ramblings</description>
	<lastBuildDate>Tue, 18 Jun 2013 15:27:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='moinakg.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>The Pseudo Random Bit Bucket</title>
		<link>http://moinakg.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://moinakg.wordpress.com/osd.xml" title="The Pseudo Random Bit Bucket" />
	<atom:link rel='hub' href='http://moinakg.wordpress.com/?pushpress=hub'/>
		<item>
		<title>The Funny KVM benchmarks</title>
		<link>http://moinakg.wordpress.com/2013/06/18/the-funny-kvm-benchmarks/</link>
		<comments>http://moinakg.wordpress.com/2013/06/18/the-funny-kvm-benchmarks/#comments</comments>
		<pubDate>Tue, 18 Jun 2013 15:27:05 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[Advanced Micro Devices]]></category>
		<category><![CDATA[ESX]]></category>
		<category><![CDATA[ESX performance]]></category>
		<category><![CDATA[KVM]]></category>
		<category><![CDATA[KVM performance]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Red Hat]]></category>
		<category><![CDATA[Sandy Bridge]]></category>
		<category><![CDATA[VMware]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1445</guid>
		<description><![CDATA[RedHat Summit 2013 concluded recently and while browsing some of the presentation PDFs I came across something funny. In general the content is good and there is a bunch of interesting stuff available. However this particular PDF ruffled me up: http://rhsummit.files.wordpress.com/2013/06/sarathy_t_1040_kvm_hypervisor_roadmap_and_overview.pdf This presentation talks about KVM technology in general with a bunch of marketing content [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1445&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.redhat.com/summit/">RedHat Summit 2013</a> concluded recently and while browsing some of the <a href="http://www.redhat.com/summit/2013/presentations/">presentation PDFs</a> I came across something funny. In general the content is good and there is a bunch of interesting stuff available. However this particular PDF ruffled me up: <a href="http://rhsummit.files.wordpress.com/2013/06/sarathy_t_1040_kvm_hypervisor_roadmap_and_overview.pdf">http://rhsummit.files.wordpress.com/2013/06/sarathy_t_1040_kvm_hypervisor_roadmap_and_overview.pdf</a></p>
<p>This presentation talks about <a href="http://www.linux-kvm.org/page/Main_Page">KVM</a> technology in general with a bunch of marketing content thrown in which is all fine. However fast forward to slide 12 and something looks odd. The slide seems to scream KVM&#8217;s outstanding performance on <a href="http://www.spec.org/virt_sc2010/">SPECvirt_sc2010</a> as compared to ESXi5/4. Great isn&#8217;t it ? The &#8220;Eureka&#8221; feeling lasts till you look at the bottom of the graphs. Every comparison is done on dissimilar hardware! Suddenly Archimedes comes crashing to the floor.</p>
<p>Take for example the 2-socket 16-core benchmarks. The HP DL385 G7 box is a Generation 7 AMD bulldozer piece while DL380p Gen8 is a Generation 8 <a class="zem_slink" title="Sandy Bridge" href="http://en.wikipedia.org/wiki/Sandy_Bridge" target="_blank" rel="wikipedia">Sandy Bridge</a> piece. RedHat is putting ESXi5 on an older generation hardware and KVM on the latest, greatest. If we consider the highest bin processors then the DL385 will get AMD <a class="zem_slink" title="Opteron" href="http://en.wikipedia.org/wiki/Opteron" target="_blank" rel="wikipedia">Opteron</a> 6220, 3.0 GHz processors with 16MB cache while DL380p will get <a class="zem_slink" title="Xeon" href="http://en.wikipedia.org/wiki/Xeon" target="_blank" rel="wikipedia">Xeon</a> E5-2690, 2.9 GHz processors with 20MB cache. Even if the Opteron&#8217;s clock is marginally higher a <a class="zem_slink" title="Bulldozer (microarchitecture)" href="http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29" target="_blank" rel="wikipedia">Bulldozer</a> is simply no match for a big juicy Sandy Bridge beast. Second the Bulldozers get HT links with 6.4 GT/s throughput while the Xeons get <a class="zem_slink" title="Intel QuickPath Interconnect" href="http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect" target="_blank" rel="wikipedia">QPI</a> with 8 GT/s throughput. The Gen7 box gets <a class="zem_slink" title="PCI Express" href="http://en.wikipedia.org/wiki/PCI_Express" target="_blank" rel="wikipedia">PCIe</a> Gen 2.0 while Gen 8 boxes get PCIe Gen 3.0. Similarly the story goes on and on. So we have a no-contest here. The Gen8 box wins hands down even if one puts fewer VMs on the Gen7 box.</p>
<p>Let&#8217;s look at the 4-socket 40 cores comparo. First the two boxes are from two different vendors. Second they are comparing ESXi4.1 with latest KVM. Whatever happened to ESXi5 here ? Does it not support that hardware ? At least the processors on the two boxes IBM x3850 x5 and DL580 G7 are comparable 10-core Xeon E7-4870 ones (considering the highest bin 10-core processors). However older <a class="zem_slink" title="VMware ESX" href="http://en.wikipedia.org/wiki/VMware_ESX" target="_blank" rel="wikipedia">ESX</a> version skews the game.</p>
<p>Similarity the processors on the other comparisons are similar but the ESX version is older one that everyone is migrating off. If I am going to do a comparison, I will install latest ESX on a hardware, measure, reinstall latest KVM on the same hardware and measure not play games.</p>
<p>RedHat is nonchalantly tying one hand behind ESX&#8217;s back. Helpfully for the marketing fuzz types we have this fine print at the bottom: &#8220;Comparison based on best performing Red Hat and VMware solutions by cpu core count published at <a href="http://www.spec.org&#8221;" rel="nofollow">http://www.spec.org&#8221;</a>. That is we are going by earlier measurements that our competitors published, so everyone chant after us: KVM is faster than ESX, KVM is faster than ESX, KVM is faster than ESX &#8230; ah well, let me grab that can of Diet Coke sitting nearby (or should it be salt rather?).</p>
<h4>Disclaimer</h4>
<p>I am NOT a Linux or KVM hater. On the other hand I use Linux Mint day in and day out and work with open-source in general. However above all I am a technologist and I like to take things as they really are, free of all the fuzz. Fuzz dilutes the values that various technologies bring to the table.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1445/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1445/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1445&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/06/18/the-funny-kvm-benchmarks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
		<item>
		<title>Architecture for a Deduplicated Archival Store: Part 2</title>
		<link>http://moinakg.wordpress.com/2013/06/15/architecture-for-a-deduplicated-archival-store-part-2/</link>
		<comments>http://moinakg.wordpress.com/2013/06/15/architecture-for-a-deduplicated-archival-store-part-2/#comments</comments>
		<pubDate>Sat, 15 Jun 2013 18:31:23 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Storage]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[Computer data storage]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Data deduplication]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1407</guid>
		<description><![CDATA[In the previous post on this topic I had put down my thoughts around the requirements I am looking at. In this post I will jot down some detailed notes around the design of the on-disk data store format that I am thinking of. The Archival Chunk Store From the most basic viewpoint we have [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1407&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/88719648@N00/5772755813" target="_blank"><img class="zemanta-img-inserted zemanta-img-configured  aligncenter" title="Golf Disc Storage" alt="Golf Disc Storage" src="http://farm3.static.flickr.com/2360/5772755813_00783896eb_m.jpg" width="240" height="135" /></a>In the <a href="https://moinakg.wordpress.com/2013/06/11/architecture-for-a-deduplicated-archival-store-part-1/">previous post</a> on this topic I had put down my thoughts around the requirements I am looking at. In this post I will jot down some detailed notes around the design of the on-disk data store format that I am thinking of.</p>
<p><strong>The Archival Chunk Store</strong></p>
<p>From the most basic viewpoint we have data streams which are split into variable length chunks. After deduplication these chunks can be references to other chunks in the same dataset or chunks in other datasets. So we need to have <a class="zem_slink" title="Metadata" href="http://en.wikipedia.org/wiki/Metadata" target="_blank" rel="wikipedia">metadata</a> that identifies the dataset (like name, timestamp, length etc.) and then have a list of pointers to data chunks. This is not much different to a traditional file system which has inodes storing metadata and then pointers to blocks/pages on disk. It is conceptually simple to consider a single data block to have multiple references. It is intuitive. However additional metadata is needed to maintain information like reference counts.</p>
<p>The key difference of a file system and a content-defined deduplication storage is that in the former all the blocks are of fixed length and potentially grouped into allocation units. In the latter chunks are of variable length. So we need additional metadata giving chunk lengths and on-disk storage requires a second layer of disk block allocation data. Software like OpenDedup have implemented <a class="zem_slink" title="Filesystem in Userspace" href="http://fuse.sourceforge.net" target="_blank" rel="homepage">FuSE</a> based file systems however they only deal with the simpler fixed-length chunking approach and offer <a class="zem_slink" title="Computer data storage" href="http://en.wikipedia.org/wiki/Computer_data_storage" target="_blank" rel="wikipedia">primary storage</a> dedupe.</p>
<p>I do not need a full file system route since I am not dealing with primary storage in this case and it also avoids a lot of complexity. There are existing file systems like <a href="http://opendedup.org/">OpenDedup</a>, <a href="http://ansrlab.cse.cuhk.edu.hk/software/livedfs/">LiveDFS</a>, <a href="http://www.lessfs.com/wordpress/">Lessfs</a> and scale-out approaches like <a class="zem_slink" title="Ceph" href="http://ceph.com" target="_blank" rel="homepage">Ceph</a>, <a class="zem_slink" title="Tahoe-LAFS" href="http://https://tahoe-lafs.org/" target="_blank" rel="homepage">Tahoe-LAFS</a> etc. where the scalable, variable-chunked dedupe features will be useful, but that is something for later. So I am thinking of storing the data chunks in files that I will call extents, along with the minimum additional metadata in separate metadata extents. The following diagram is a schematic of my approach to storing the chunks on disk.</p>
<p><a href="http://moinakg.files.wordpress.com/2013/06/chunkstore2.png"><img class="aligncenter size-full wp-image-1438" alt="Chunkstore" src="http://moinakg.files.wordpress.com/2013/06/chunkstore2.png?w=625&#038;h=615" width="625" height="615" /></a>The following are the characteristics that imply from this schematic:</p>
<ul>
<li>A Dataset is identified by some metadata and a sequence of extents in a linked list.</li>
<li>Each extent is a collection of segments. Extents are essentially numbered files.</li>
<li>Each segment is a collection of variable-length data chunks.</li>
<li>Each extent stores segment data and metadata in separate files. A naming convention is used to associate extent metadata and corresponding data files.</li>
<li>Each extent can contain a fixed maximum number of segments. I am considering up to 2048 segments per extent. Incoming segments are appended to the last extent in the dataset till it fills up and a new extent is allocated.</li>
<li>Notice that a separate extent metadata section is not required. A extent is just a file.</li>
<li>The scalable Segmented Similarity based Deduplication is being used here. Each segment contains up to 2048 variable-length chunks. So with 4KB chunk size, each segment is 8MB in size.</li>
<li>Segment metadata consists of a chunk count, chunk hashes and offsets. The chunk size is not stored. Instead it can be computed by subtracting current chunk&#8217;s offset from the next chunk&#8217;s offset. Since a 64-bit segment offset is stored the chunk offsets can be relative to it and only need to be 32-bit values.</li>
<li>The Similarity Index contains similarity hashes that point to segments within the extents. So the pointer has to be the extent number followed by the segment offset within the extent metadata file. Incoming segments from a new datastream are chunked, their similarity hashes computed and then approximate-match segments are looked up in the index.</li>
<li>Segment data is compressed before storing in the segment. So segment entries in the data extent are of variable length.</li>
<li>Each segment entry in the metadata extent can also be of variable length since the number of chunks can be less than the maximum. However segment entries in the metadata extent are added when an entry is made in the index, so the exact offset can be recorded.</li>
<li>Similary a segment entry in the metadata extent needs to point to the offset of the segment data in the data extent. However since segments are compressed later in parallel and stored into the extent, the metadata entries are updated later once the segment data is appended. Keeping segment data in a separate data extent allows this parallel processing while still allowing similarity matches to be processed from the metadata extent.</li>
<li>Duplicate chunk references are maintained in the metadata extents. A duplicate reference consists of the extent number, segment offset in the compressed file and chunk number within the segment.</li>
<li>The index is obviously persistent on disk but is loaded in memory in it&#8217;s entirety when doing lookups. Any insertion into the index is written immediately onto the disk. I&#8217;d obviously have to use a <a href="https://en.wikipedia.org/wiki/NoSQL">NoSQL</a> key-value store for this. I am currently interested in <a href="http://hamsterdb.com/">Hamsterdb</a>.</li>
<li>Keeping a separate metadata extent allows staging metadata on a separate high-performance storage media like flash to reduce access latency.</li>
<li>It is possible to store reference counts at the segment level within the index for the purpose of capping number of references to &#8220;popular&#8221; chunks. This can reduce dedupe ratio since not all chunks will have reached the max reference count. However the advantage of this is it avoids storing and updating reference counts in scattered records in extent files which in turn avoids some random I/O during data ingestion. Each segment has 25 similarity indicators representing different portions of the segment. So all 25 indicators should have reached the maximum reference count to completely remove the entire segment from consideration.</li>
<li>The entire segment is compressed and stored instead of per-chunk compression. This provides better compression ratio but is also an overhead especially if we just have to retrieve one chunk from a referenced segment. However due to data locality in backups most similar segments will have several chunks in common. In addition the fast LZ4 compression algorithm and caching of uncompressed segments should provide for low overheads. This is something that I have to test in practice.</li>
</ul>
<p><strong>Supporting Deletion and Forward Referencing</strong></p>
<p>Deleting datasets means deleting all the extents that belong to it. However this is easier said than done because the extent may have segments which contain chunks which are referred to by other extents. So we cannot simply delete. There are two ways to support effective deletion.</p>
<p>First approach is to load the segments one by one from the extents and conditionally store them into a new file. First the segment&#8217;s similarity indicators are re-computed and looked up in the index. This will give us the reference count associated with the similarity indicator along with the segment it points to. If the indicator points to another segment then it&#8217;s reference count is decremented. Otherwise if the associated reference count is zero, it is first removed from the index. If the reference count is zero for all similarity indicators of the segment or all it&#8217;s similarity indicators point to other segments then the segment is not stored into the new file. However a seek is performed on the target file to sparsely extend it. This preserves the relative offsets of the segments which need to be retained.</p>
<p>Second approach is dependent on a technique called Forward Referencing. In this incoming data is stored as-is. If new chunks are duplicate to older chunks then the older chunk entries are updated to point to the new chunks. This means that deletion can be simply performed on the oldest dataset without any further checks as all references will be to newer chunks. I will need to apply the constraint that intermediate datasets cannot be deleted. The big advantage of Forward Referencing is that it speeds up restore times a lot because the latest dataset is typically the one that you want to restore and it is stored as whole and read sequentially. However Forward Referencing requires post-process deduplication in order to be performant and avoid too much random I/O during backup for example. Also technically it precludes source side dedupe as the data has to appear wholly on the backup store.</p>
<p>The third approach combines the above two approaches. Inline dedupe is done and then a post-process optimization pass can be kicked off to re-organize the data to a forward referenced layout. This requires temporary extra metadata space to record a log of all references per referenced extent so that we can invert the references an extent at a time. This can somewhat tricky to get right.</p>
<p>At present I am looking at the first approach and intend to explore the third optimization technique at a later date.</p>
<h6 class="zemanta-related-title" style="font-size:1em;">Related articles</h6>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://moinakg.wordpress.com/2013/06/11/architecture-for-a-deduplicated-archival-store-part-1/" target="_blank">Architecture for a Deduplicated Archival Store: Part 1</a> (moinakg.wordpress.com)</li>
<li class="zemanta-article-ul-li"><a href="http://www.lessfs.com/wordpress/?p=378">A Tale about key value databases</a>.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1407/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1407/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1407&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/06/15/architecture-for-a-deduplicated-archival-store-part-2/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2360/5772755813_00783896eb_m.jpg" medium="image">
			<media:title type="html">Golf Disc Storage</media:title>
		</media:content>

		<media:content url="http://moinakg.files.wordpress.com/2013/06/chunkstore2.png" medium="image">
			<media:title type="html">Chunkstore</media:title>
		</media:content>
	</item>
		<item>
		<title>Tumblr Architecture and one oddity</title>
		<link>http://moinakg.wordpress.com/2013/06/12/tumblr-architecture-and-one-oddity/</link>
		<comments>http://moinakg.wordpress.com/2013/06/12/tumblr-architecture-and-one-oddity/#comments</comments>
		<pubDate>Wed, 12 Jun 2013 14:12:33 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[Tumblr]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1403</guid>
		<description><![CDATA[Going through the StorageMojo website I came across a tweet that pointed to this High Scalability article: http://highscalability.com/blog/2013/5/20/the-tumblr-architecture-yahoo-bought-for-a-cool-billion-doll.html It is fascinating to learn about the technologies that Tumblr uses to operate at a mind boggling scale. It is not a joke that Yahoo! paid $1.1 billion for it. With all due respect to the amazing [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1403&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Going through the <a href="http://storagemojo.com/">StorageMojo</a> website I came across a tweet that pointed to this High Scalability article: <a href="http://highscalability.com/blog/2013/5/20/the-tumblr-architecture-yahoo-bought-for-a-cool-billion-doll.html">http://highscalability.com/blog/2013/5/20/the-tumblr-architecture-yahoo-bought-for-a-cool-billion-doll.html</a></p>
<p>It is fascinating to learn about the technologies that <a class="zem_slink" title="Tumblr" href="http://tumblr.com" target="_blank" rel="homepage">Tumblr</a> uses to operate at a mind boggling scale. It is not a joke that <a class="zem_slink" title="Yahoo!" href="http://www.yahoo.com" target="_blank" rel="homepage">Yahoo!</a> <a href="http://money.cnn.com/2013/05/20/technology/yahoo-buys-tumblr/index.html">paid $1.1 billion</a> for it. With all due respect to the amazing technologies that Tumblr has accomplished there is but one piece that strikes me as odd:</p>
<p>&#8220;&#8230;Example, for a new ID generator they needed A JVM process to generate service responses in less the 1ms at a rate at 10K requests per second with a 500 MB RAM limit with High Availability. They found the serial collector gave the lowest latency for this particular work load. Spent a lot of time on JVM tuning&#8230;&#8221;</p>
<p>Especially the part &#8220;&#8230;Spent a lot of time on JVM tuning&#8230;&#8221;. This is clearly a niche low-latency use case. For such things why not just drop to native code and maybe a <a class="zem_slink" title="Slab allocation" href="http://en.wikipedia.org/wiki/Slab_allocation" target="_blank" rel="wikipedia">slab allocator</a> and be done with it? Why spend &#8220;lots of time&#8221; fighting with Garbage Collector and related effects? What about using the right tool for the job?</p>
<p>Maybe there is something else that I am missing completely.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1403/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1403/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1403&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/06/12/tumblr-architecture-and-one-oddity/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
		<item>
		<title>Architecture for a Deduplicated Archival Store: Part 1</title>
		<link>http://moinakg.wordpress.com/2013/06/11/architecture-for-a-deduplicated-archival-store-part-1/</link>
		<comments>http://moinakg.wordpress.com/2013/06/11/architecture-for-a-deduplicated-archival-store-part-1/#comments</comments>
		<pubDate>Tue, 11 Jun 2013 17:58:45 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Storage]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[Bacula]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Data deduplication]]></category>
		<category><![CDATA[EMC DataDomain]]></category>
		<category><![CDATA[Exdupe]]></category>
		<category><![CDATA[Pcompress]]></category>
		<category><![CDATA[Sepaton]]></category>
		<category><![CDATA[ZFS]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1400</guid>
		<description><![CDATA[Requirements Pcompress as it stands today is a powerful single-file lossless compression program that applies a variety of compression and data deduplication algorithms to effectively reduce the dataset size. However as far as data deduplication goes it can only apply the algorithms to a single dataset to remove internal duplicates. What is more useful is [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1400&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<h1>Requirements</h1>
<p><a href="http://moinakg.github.io/pcompress/">Pcompress</a> as it stands today is a powerful single-file lossless compression program that applies a variety of compression and <a class="zem_slink" title="Data deduplication" href="http://en.wikipedia.org/wiki/Data_deduplication" target="_blank" rel="wikipedia">data deduplication</a> algorithms to effectively reduce the dataset size. However as far as data deduplication goes it can only apply the algorithms to a single dataset to remove internal duplicates. What is more useful is to be able to apply deduplication to remove common blocks across datasets to achieve even greater savings especially in backup scenarios. This is why we see a slew of products in this space boasting of upto 90% reduction in backup storage requirements.</p>
<p>In the open source space we have filesystems like <a href="http://opendedup.org/">OpenDedup</a>, <a href="http://www.lessfs.com/">Lessfs</a>, <a href="http://code.google.com/p/s3ql/">S3QL</a>, <a class="zem_slink" title="ZFS" href="http://en.wikipedia.org/wiki/ZFS" target="_blank" rel="wikipedia">ZFS</a> etc that provide deduplication even for primary online storage. While that is a desirable feature in itself, these software lack many of the advanced features of commercial products like <a class="zem_slink" title="Sepaton" href="http://www.sepaton.com/" target="_blank" rel="homepage">Sepaton</a>, <a href="http://h18006.www1.hp.com/storage/pdfs/hpstoreonce.pdf">HP StoreOnce</a> or <a href="http://www.emc.com/domains/datadomain/index.htm">EMC DataDomain</a>. Pcompress implements a bunch of those advanced algorithms today (I am writing a couple of papers on this) so it makes sense to extend the software into a proper scalable archival store for backup requirements. In this topic it is worthwhile to take note of <a href="http://www.exdupe.com/">eXdupe </a>which provides archival deduplicated backup capabilities but it is quite simplistic providing only differential storage against a single initial backup dataset. It is much like a full backup followed by incremental backups. Just that there is no real multi-file dedupe. One can only dedupe the latest backup data against the first non-differential backup data. It is not a scalable chunk store that can chunk any incoming dataset and store only the unique chunks.</p>
<p>If we look at open source backup software like <a href="http://www.amanda.org/">Amanda</a> or <a class="zem_slink" title="Bacula" href="http://www.bacula.org/" target="_blank" rel="homepage">Bacula</a>, none of them have block-level dedupe capability, leave alone sliding-window variable block chunking. So, in a nutshell, we can summarize the requirements as follows:</p>
<ol>
<li>A Deduplicated, Scalable Chunk Store that stores unique chunks and provides fast read access.</li>
<li>The Chunk Store is meant for backups and archival storage and assumes immutable chunks. I am not looking at online primary storage in this case. However the system should support deletion of old datasets.</li>
<li>It should be able to do inline dedupe. With inline dedupe we can do source side dedupe reducing the amount of backup data transferred over the network.</li>
<li>Pcompress can potentially utilize all the cores on the system and this archival store should be no different.</li>
<li>Metadata overhead should be kept to a minimum and I will be using the Segmented similarity based indexing to use a global index that can fit in RAM.</li>
<li>Data and Metadata should be kept separate such that metadata can be located on high-speed storage like SSDs to speed up access. While this increases the number of multiple separate disk accesses during restore, the effect can be reduced by locality sensitive caching in addition to SSDs.</li>
<li>The system should of course be able to scale to petabytes.</li>
<li>It should be possible to integrate the system with existing backup software like Amanda, Bacula etc. This is needed if we want to do source-side dedupe.</li>
<li>There should be a chunk reference count with a max limit to avoid too many datasets referencing the same chunk. The loss of a multiple referenced chunk can corrupt multiple backups. Having an upper limit reduces the risk. In addition we need replication but that is not in my charter at this time. Filesystem replication/distribution can be used for the purpose. Software like <a href="http://www.drbd.org/">DRBD</a> can also be used.</li>
<li>Another feature is to limit deduplication to the last X backup sets much like a sliding window. This allows cleanly removing really old backups and avoid recent backups from referencing chunks in a those old data.</li>
<li>All this applies to archival storage on disk. Deduping backups onto tape is a different can of worms that I will probably look at later.</li>
</ol>
<p>I plan to go at all these requirements in phases. For example I&#8217;d not initially look at source-side dedupe. Rather the initial focus will be to get a high-performance stable backend. If one is wondering about some of the terms used here, then look at the <a href="http://en.wikipedia.org/wiki/Data_deduplication">Wikipedia article</a> for explanations.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1400/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1400/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1400&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/06/11/architecture-for-a-deduplicated-archival-store-part-1/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
		<item>
		<title>Findings by Google on NUMA Performance</title>
		<link>http://moinakg.wordpress.com/2013/06/05/findings-by-google-on-numa-performance/</link>
		<comments>http://moinakg.wordpress.com/2013/06/05/findings-by-google-on-numa-performance/#comments</comments>
		<pubDate>Wed, 05 Jun 2013 17:36:09 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Non-Uniform Memory Access]]></category>
		<category><![CDATA[NUMA]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1397</guid>
		<description><![CDATA[Very interesting and surprising findings by Google with respect to NUMA: http://highscalability.com/blog/2013/5/30/google-finds-numa-up-to-20-slower-for-gmail-and-websearch.html It is curious that cache contention and NUMA have such an interplay depending on the workload being presented. The most interesting learning is from this paragraph: &#8220;In conclusion, surprisingly, some running scenarios with more remote memory accesses may outperform scenarios with more local accesses [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1397&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Very interesting and surprising findings by Google with respect to <a class="zem_slink" title="Non-Uniform Memory Access" href="http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access" target="_blank" rel="wikipedia">NUMA</a>: <a href="http://highscalability.com/blog/2013/5/30/google-finds-numa-up-to-20-slower-for-gmail-and-websearch.html">http://highscalability.com/blog/2013/5/30/google-finds-numa-up-to-20-slower-for-gmail-and-websearch.html</a></p>
<p>It is curious that <a class="zem_slink" title="CPU cache" href="http://en.wikipedia.org/wiki/CPU_cache" target="_blank" rel="wikipedia">cache</a> contention and NUMA have such an interplay depending on the workload being presented. The most interesting learning is from this paragraph:</p>
<p>&#8220;<em>In conclusion, surprisingly, some running scenarios with more remote memory accesses may outperform scenarios with more local accesses due to an increased amount of cache contention for the latter, especially when 100% local accesses cannot be guaranteed. This tradeoff between NUMA and cache sharing/contention varies for different applications and when the application’s corunner changes. The tradeoff also depends on the remote access penalty and the impact of cache contention on a given machine platform. On our <a class="zem_slink" title="Nehalem (microarchitecture)" href="http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29" target="_blank" rel="wikipedia">Intel Westmere</a>, more often, NUMA has a more signiﬁcant impact than cache contention. This may be due to the fact that this platform has a fairly large shared cache while the remote access latency is as large as 1.73x of local latency.</em> &#8220;</p>
<p>The extremely interesting findings have implications for NUMA-aware thread schedulers in the OS. They would need to compute NUMA policy parameters based on the platform and load characteristics (from CPU performance counters). It might even be pondered whether it makes sense to optionally provide threads the ability to  programmatically give NUMA policy hints to the scheduler. That is the thread may declare whether cache sharing or cache contention is more important for it.</p>
<p>Apart from NUMA other system components are also becoming socket-local in order to scale better. Network Interfaces and I/O connections are two recent examples. These considerations from the NUMA study calls for similar studies being done for these other components as well.</p>
<p><a href="http://moinakg.files.wordpress.com/2013/06/numa.png"><img class="aligncenter size-full wp-image-1398" alt="NUMA vs UMA" src="http://moinakg.files.wordpress.com/2013/06/numa.png?w=625&#038;h=271" width="625" height="271" /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1397/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1397/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1397&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/06/05/findings-by-google-on-numa-performance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>

		<media:content url="http://moinakg.files.wordpress.com/2013/06/numa.png" medium="image">
			<media:title type="html">NUMA vs UMA</media:title>
		</media:content>
	</item>
		<item>
		<title>R.I.P. Atul Chitnis &#8211; End of a Chapter</title>
		<link>http://moinakg.wordpress.com/2013/06/03/r-i-p-atul-chitnis-end-of-a-chapter/</link>
		<comments>http://moinakg.wordpress.com/2013/06/03/r-i-p-atul-chitnis-end-of-a-chapter/#comments</comments>
		<pubDate>Mon, 03 Jun 2013 14:20:18 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[Atul Chitnis]]></category>
		<category><![CDATA[FOSS.IN]]></category>
		<category><![CDATA[Free Open Source Software]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[PCQuest (magazine)]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1393</guid>
		<description><![CDATA[Very saddened today morning upon hearing the news of Atul Chitnis passing away. He was battling cancer for a while and he finally lost it. I am sure he will be peaceful in the Happy Computing Community. I have known him from his early PCQuest days and my awareness of Linux was primarily due to [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1393&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Very saddened today morning upon hearing the news of <a class="zem_slink" title="Atul Chitnis" href="http://AtulChitnis.net" target="_blank" rel="homepage">Atul Chitnis</a> passing away. He was battling cancer for a while and he finally lost it. I am sure he will be peaceful in the <a href="http://en.wikipedia.org/wiki/Happy_hunting_ground">Happy Computing Community</a>.</p>
<p>I have known him from his early <a class="zem_slink" title="PCQuest (magazine)" href="http://en.wikipedia.org/wiki/PCQuest_%28magazine%29" target="_blank" rel="homepage">PCQuest</a> days and my awareness of Linux was primarily due to his <a href="https://www.adityanag.com/articles/pcqlinux-a-fedora-based-distro-for-india/" target="_blank">PCQlinux</a> distribution initiative. However he will be remembered the most for the <a href="http://foss.in/">FOSS.IN</a> conference. Without him <a href="http://foss.in/">FOSS.IN</a> has lost a father figure. I have been visiting the conference from the time it was originally called Linux Bangalore and his influence over the flocks gathering there was unmistakable. He did have his quirks and share of disagreements with others in the Indian FOSS community but his far-reaching contributions in the Indian FOSS scene overshadow everything else.</p>
<p><a href="http://www.firstpost.com/tech/indian-tech-world-mourns-death-of-open-source-guru-atul-chitnis-837705.html">http://www.firstpost.com/tech/indian-tech-world-mourns-death-of-open-source-guru-atul-chitnis-837705.html</a></p>
<p><a href="http://en.wikipedia.org/wiki/Atul_Chitnis">http://en.wikipedia.org/wiki/Atul_Chitnis</a></p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1393/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1393/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1393&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/06/03/r-i-p-atul-chitnis-end-of-a-chapter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
		<item>
		<title>Updated Compression Benchmarks &#8211; part 3</title>
		<link>http://moinakg.wordpress.com/2013/06/01/updated-compression-benchmarks-3/</link>
		<comments>http://moinakg.wordpress.com/2013/06/01/updated-compression-benchmarks-3/#comments</comments>
		<pubDate>Sat, 01 Jun 2013 18:15:28 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Data deduplication]]></category>
		<category><![CDATA[Exdupe]]></category>
		<category><![CDATA[Lrzip]]></category>
		<category><![CDATA[Pcompress]]></category>
		<category><![CDATA[Rzip]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1390</guid>
		<description><![CDATA[I have added the 3rd and final set of benchmark results comparing Pcompress to two other data dedupe utilities, Lrzip and eXdupe here: http://moinakg.github.io/pcompress/results3.html. Lrzip does not do traditional dedupe of 4KB blocks or above. Rather it uses the Rzip algorithm which is derived from Rsync. Rzip also does variable block dedupe but at much [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1390&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I have added the 3rd and final set of benchmark results comparing <a href="http://moinakg.github.io/pcompress/">Pcompress</a> to two other data dedupe utilities, <a class="zem_slink" title="Rzip" href="http://en.wikipedia.org/wiki/Rzip" target="_blank" rel="wikipedia">Lrzip</a> and <a href="http://www.exdupe.com/">eXdupe</a> here: <a href="http://moinakg.github.io/pcompress/results3.html">http://moinakg.github.io/pcompress/results3.html</a>. Lrzip does not do traditional dedupe of 4KB blocks or above. Rather it uses the <a href="http://rzip.samba.org/">Rzip</a> algorithm which is derived from <a class="zem_slink" title="Rsync" href="http://en.wikipedia.org/wiki/Rsync" target="_blank" rel="wikipedia">Rsync</a>.</p>
<p>Rzip also does variable block dedupe but at much smaller sizes than 4KB. However I am not sure if Rzip can be adapted as a multi-file generalized deduplication store as the index blow-up is quite extravagant. Though it might be possible to do segmented matching and then apply Rzip across Segment data. It will require re-reading old segment data and the dedupe solution will necessarily be offline or post-process.</p>
<p>The observations from the results are summarized below:</p>
<ul>
<ul>
<ul>
<li>If we just do Dedupe and avoid compression of data (&#8220;Dedupe Only&#8221; result in the graphs) then Lrzip produces smaller archives. This is obvious since Pcompress does traditional Dedupe at average 4KB variable blocks while Lrzip finds matches are much smaller lengths. Exdupe cannot be compared here as it has no option to avoid compression. At high compression levels Pcompress consistently gives the fastest times. However except for LZ4 option Pcompress produces slightly larger archives for all other algorithms when compared with Lrzip. Lrzip uses <a class="zem_slink" title="Lempel–Ziv–Oberhumer" href="http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Oberhumer" target="_blank" rel="wikipedia">Lzo </a>not LZ4. I tried using Lrzip to just do rzip and then compress the result with LZ4 for the CentOS tarball. I got a size of 662751240 bytes with data split into 256MB chunks. So Lrzip would have produced a smaller archive if it had integrated LZ4.</li>
<li><a href="http://code.google.com/p/lz4/" target="_blank">LZ4</a> is a fantastic algorithm. The combination of speed and compression ratio is unparalleled.</li>
<li>At fast compression levels Pcompress matches or exceeds Exdupe in speed (depending on the dataset) while producing a better compression ratio. Once again LZ4 has a big contribution to the result. Lrzip loses out handily in terms of speed but compression ratio is good.</li>
<li>In general Pcompress gives some of the best combinations of compression ratio and speed.</li>
<li>One of the possible reasons for the larger Exdupe file sizes can be extra metadata. Exdupe allows differential backups to be taken against an initial full backup. In order to do block-level differential backup, in other words deduplicated backup, it needs to store additional metadata for existing blocks.</li>
</ul>
</ul>
</ul>
<p>Remember this is just a small system with 2 cores and 2 hyperthreads, or 4 logical cores. On systems will more cores Pcompress performance will scale appropriately.</p>
<h6 class="zemanta-related-title" style="font-size:1em;">Related articles</h6>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://moinakg.wordpress.com/2013/05/26/updated-compression-benchmarks/" target="_blank">Updated Compression Benchmarks</a> (moinakg.wordpress.com)</li>
<li class="zemanta-article-ul-li"><a href="http://moinakg.wordpress.com/2013/05/27/updated-compression-benchmarks-part-2/" target="_blank">Updated Compression Benchmarks &#8211; part 2</a> (moinakg.wordpress.com)</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1390/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1390/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1390&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/06/01/updated-compression-benchmarks-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
		<item>
		<title>Pcompress 2.2 released</title>
		<link>http://moinakg.wordpress.com/2013/05/28/pcompress-2-2-released/</link>
		<comments>http://moinakg.wordpress.com/2013/05/28/pcompress-2-2-released/#comments</comments>
		<pubDate>Tue, 28 May 2013 16:44:11 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Data deduplication]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[Delta Differencing]]></category>
		<category><![CDATA[Pcompress]]></category>
		<category><![CDATA[Segmented Global Dedupe]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1388</guid>
		<description><![CDATA[I decided to pull another release of Pcompress primarily due to some bugfixes that went in. One of them is a build issue on Debian6 and non-SSE4 processor and the others are a couple of crashes with invalid input. In addition to fixing stuff I have re-wrote the Min-Heap code and took out all the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1388&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I decided to pull <a href="code.google.com/p/pcompress/downloads/detail?name=pcompress-2.2.tar.bz2">another release of Pcompress</a> primarily due to some bugfixes that went in. One of them is a build issue on Debian6 and non-SSE4 processor and the others are a couple of crashes with invalid input.</p>
<p>In addition to fixing stuff I have re-wrote the Min-Heap code and took out all the <a class="zem_slink" title="Python (programming language)" href="http://www.python.org/" target="_blank" rel="homepage">Python</a> derived stuff. It is now much simpler and much faster than before. While doing this re-write I found and fixed a problem with the earlier Min-Heap approach. Thus <a href="http://searchstorage.techtarget.com/definition/delta-differencing">Delta Differencing</a> is now faster and more accurate than before.</p>
<p>I also improved the scalable Segmented Global Dedupe and it now works with greater than 95% efficiency in finding duplicate chunks. it appears that using larger segments for larger dedupe block sizes results in better accuracy. If you come to think of it this is also logical since one would want faster processing with smaller indexes when using larger and larger dedupe blocks. Corresponding larger segments enable just that.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1388/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1388/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1388&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/05/28/pcompress-2-2-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
		<item>
		<title>Updated Compression Benchmarks &#8211; part 2</title>
		<link>http://moinakg.wordpress.com/2013/05/27/updated-compression-benchmarks-part-2/</link>
		<comments>http://moinakg.wordpress.com/2013/05/27/updated-compression-benchmarks-part-2/#comments</comments>
		<pubDate>Mon, 27 May 2013 18:27:05 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[compression algorithms]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[LZMA]]></category>
		<category><![CDATA[LZP]]></category>
		<category><![CDATA[Pcompress]]></category>
		<category><![CDATA[PPMD]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1384</guid>
		<description><![CDATA[I have added the second set of benchmarks that demonstrate the effect of the different pre-processing options on compression ratio and speed. The results are available here: http://moinakg.github.io/pcompress/results2.html All of these results have Global Dedupe enabled. These results also compare the effect of various compression algorithms on two completely different datasets. One is a set [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1384&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I have added the second set of benchmarks that demonstrate the effect of the different pre-processing options on compression ratio and speed. The results are available here: <a href="http://moinakg.github.io/pcompress/results2.html">http://moinakg.github.io/pcompress/results2.html</a></p>
<p>All of these results have Global Dedupe enabled. These results also compare the effect of various compression algorithms on two completely different datasets. One is a set of VMDK files and another purely textual data. Some observations below:</p>
<ul>
<li>In virtually all the cases using &#8216;-L&#8217; and &#8216;-P&#8217; switches results in the smallest file. Only in case of <a class="zem_slink" title="Lempel–Ziv–Markov chain algorithm" href="http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm" target="_blank" rel="wikipedia">LZMA</a> these options marginally deteriorate the compression ratio indicating that the reduction of redundancy is hurting LZMA. To identify which of the two hurts more I repeated the command (see the terminology in results page) with lzmaMt algo and only option &#8216;-L&#8217; at compression level 6 on the CentOS vmdk tarball. The resultant size came to: 472314917. The size got from running with only option &#8216;-P&#8217; is available in the results page: 469153825. Thus it is the LZP preprocessing that unsettles LZMA the most along with segment size of 64MB. Delta2 actually helps. Running the command with segment size of 256MB we see the following results &#8211; &#8216;-L&#8217; and &#8216;-P&#8217;: 467946789, &#8216;-P&#8217; only: 466076733, &#8216;-L&#8217; only: . Once again Delta2 helps. At higher compression however, Delta2 is marginally worse as well.</li>
<li>There is some interesting behavior with respect to the <a class="zem_slink" title="Prediction by partial matching" href="http://en.wikipedia.org/wiki/Prediction_by_partial_matching" target="_blank" rel="wikipedia">PPMD</a> algorithm. The time graph (red line) shows a relative spike for the CentOS graphs as compared to the Linux source tarball graphs. PPMD is an algorithm primarily suited for textual data so using it on non-textual data provides good compression but takes more time.</li>
<li>Both Libbsc and PPMD are especially good on the textual Linux source tar and are comparable to LZMA results while only taking a fraction of the time taken by LZMA. Especially Libbsc really rocks by producing better compression and being much faster as compared to LZMA. However i have seen decompression time with Libbsc to be quite high as compared to PPMD.</li>
</ul>
<h6 class="zemanta-related-title" style="font-size:1em;">Related articles</h6>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://moinakg.wordpress.com/2013/05/26/updated-compression-benchmarks/" target="_blank">Updated Compression Benchmarks</a> (moinakg.wordpress.com)</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1384/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1384/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1384&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/05/27/updated-compression-benchmarks-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
		<item>
		<title>Stress-testing Systems</title>
		<link>http://moinakg.wordpress.com/2013/05/26/stress-testing-systems/</link>
		<comments>http://moinakg.wordpress.com/2013/05/26/stress-testing-systems/#comments</comments>
		<pubDate>Sun, 26 May 2013 10:02:08 +0000</pubDate>
		<dc:creator>moinakg</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[Diagnostics]]></category>
		<category><![CDATA[diagnostics tools]]></category>
		<category><![CDATA[Fedora]]></category>
		<category><![CDATA[Stress testing]]></category>
		<category><![CDATA[system crashes]]></category>

		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1379</guid>
		<description><![CDATA[Many a times we need some software that will enable us to load a system and check its stability under bad conditions. This can be a burn-in test or it can be generation of load to cause borderline faulty hardware to start acting up. This allows one to isolate system crashes to either hardware or [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1379&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Many a times we need some software that will enable us to load a system and check its stability under bad conditions. This can be a burn-in test or it can be generation of load to cause borderline faulty hardware to start acting up. This allows one to isolate system crashes to either hardware or Software issues for example. In my experience that latter has been a common scenario where system crash logs appear to look like hardware issues but diagnostics tools including the vendor provided ones come out clean and then the system crashes again soon after being put back into production use. Eventually some back and forth investigation and trial-and-error things are done to replace potentially faulty components till things are stable again or the box itself is replaced.</p>
<p>One of the choices here is to run something that loads the system hard causing hidden faults to surface faster than otherwise. When it comes to stress testing tools there are a whole bunch of choices but most of them focus on one piece at a time. Most commonly it starts with the CPU, then RAM and Disk I/O of course. However I am yet to come across something that comprehensively loads the entire system. By entire system I mean CPU, RAM, Disk and Network together. In addition by loading CPU, I mean loading virtually every component inside the CPU: FPU, SSE, AVX, Fetch, Decode and so on. Just running a single computation like for example <a href="http://files.extremeoverclocking.com/file.php?f=103">Prime95</a> may heat up the CPU and/or RAM modules but it exercises only a few components within. The key here is to stress test everything in parallel.</p>
<p>Eventually all this should result in the system&#8217;s ambient temperature to be raised by a few degrees even when located inside a chilled datacenter and even when the server&#8217;s fans are spinning at a higher RPM. Once we have stressed the box we can then look at diagnostic logs like the IML (HP Integrated Management Log) and run diagnostic tools that will hopefully have a better chance of picking up something odd.</p>
<p>I have worked on something like this at work where we have successfully used it on several occasions for troubleshooting faults, evaluating new server models and when commissioning new datacenter field layouts. I have now started an open-source project on the same lines but being more comprehensive: <a href="https://github.com/moinakg/systemroller">https://github.com/moinakg/systemroller</a></p>
<p>At them moment this is a work in progress and one will only find a few items in that github repo mostly dealing with creating a mini <a href="http://fedoraproject.org/">Fedora</a> live image which is a core part of the system. The objectives for this system are listed below.</p>
<ul>
<li>Parallel stress testing of CPU, RAM, Disk, and Network together or a chosen subset on Linux. Of course the core test framework should lend itself to be ported to other platforms like BSD or <a href="http://wiki.illumos.org/display/illumos/illumos+Home">Illumos</a>.</li>
<li>Attempt to load virtually every sub-component.</li>
<li>Non-destructive disk tests.</li>
<li>Network interface Card testing that will not flood the network with packets or frames.</li>
<li>Post-test verification and diagnostics scan.</li>
<li>Self-contained live-bootable environment to allow scheduling tests via <a href="http://en.wikipedia.org/wiki/Preboot_Execution_Environment">PXE</a> boot for example.</li>
<li>Ability to pass parameters via PXE/DHCP options.</li>
<li>Live environment should allow restricted root access that primarily does not provide the filesystem utilities like mount but allows reading from the block device. In addition the restricted shell should provide only a small subset of Linux utilities to prevent backdoors. This will allow systems engineers to to diagnostics etc while providing no ability to access production data on the disk filesystems.</li>
<li>A http based graphical console to remotely access the live environment and look at logs, run tests, do diagnostics etc.</li>
<li>The live bootable image should be as small as feasible and should be able to load itself entirely in RAM and boot and run off a ramdisk.</li>
</ul>
<p>The Github project repo currently provides a Fedora kickstart file that goes into a great effort to minimize the live bootable <a class="zem_slink" title="ISO image" href="http://en.wikipedia.org/wiki/ISO_image" target="_blank" rel="wikipedia">ISO image</a> (139MB approx including EFI boot capability). The live environment boots and auto-logins into a restricted root environment. One will require Fedora 18 and the Fedora livecd-creator to build it (see the README).</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/moinakg.wordpress.com/1379/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/moinakg.wordpress.com/1379/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=moinakg.wordpress.com&#038;blog=3783735&#038;post=1379&#038;subd=moinakg&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://moinakg.wordpress.com/2013/05/26/stress-testing-systems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e2674e69c4ce84db533d3a25ca6ae46?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">moinakg</media:title>
		</media:content>
	</item>
	</channel>
</rss>
