<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments for The Pseudo Random Bit Bucket</title>
	<atom:link href="http://moinakg.wordpress.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://moinakg.wordpress.com</link>
	<description>Moinakg&#039;s Ramblings</description>
	<lastBuildDate>Sat, 18 May 2013 12:32:25 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>Comment on Reducing OpenSolaris ramdisk greed by moinakg</title>
		<link>http://moinakg.wordpress.com/2009/01/01/reducing-opensolaris-ramdisk-greed/#comment-2557</link>
		<dc:creator><![CDATA[moinakg]]></dc:creator>
		<pubDate>Sat, 18 May 2013 12:32:25 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=207#comment-2557</guid>
		<description><![CDATA[Thanks. I am using one of the predefined Wordpress themes. Will check it out.]]></description>
		<content:encoded><![CDATA[<p>Thanks. I am using one of the predefined WordPress themes. Will check it out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Reducing OpenSolaris ramdisk greed by 83597</title>
		<link>http://moinakg.wordpress.com/2009/01/01/reducing-opensolaris-ramdisk-greed/#comment-2556</link>
		<dc:creator><![CDATA[83597]]></dc:creator>
		<pubDate>Sat, 18 May 2013 12:23:07 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=207#comment-2556</guid>
		<description><![CDATA[Simply wanted to let you know that I found your blog on and while 
I appreciated checking out your post, it appears your blog 
acts up in a couple browsers. If I use Firefox, it comes up okay, but 
if I use Chrome, it comes up appearing overlapped and off-kilter.

Just so you know.]]></description>
		<content:encoded><![CDATA[<p>Simply wanted to let you know that I found your blog on and while<br />
I appreciated checking out your post, it appears your blog<br />
acts up in a couple browsers. If I use Firefox, it comes up okay, but<br />
if I use Chrome, it comes up appearing overlapped and off-kilter.</p>
<p>Just so you know.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Hadoop &#8211; too much hype by moinakg</title>
		<link>http://moinakg.wordpress.com/2013/02/18/hadoop-too-much-hype/#comment-2543</link>
		<dc:creator><![CDATA[moinakg]]></dc:creator>
		<pubDate>Sun, 12 May 2013 17:50:46 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1203#comment-2543</guid>
		<description><![CDATA[Yes I agree that the HotSpot JIT does a darn well job, though it can require tweaking via the code cache size parameter. However my big beef is with the GC piece. I see a significant effort being directed to handle GC related issues in most Java software R&amp;D leading eventually to fancy off-heap storage for big datasets which looks like new/delete in disguise. I have the odd wish of seeing a Java variant with explicit memory management, no GC at all. The VM could be made a lot more lightweight in that case.]]></description>
		<content:encoded><![CDATA[<p>Yes I agree that the HotSpot JIT does a darn well job, though it can require tweaking via the code cache size parameter. However my big beef is with the GC piece. I see a significant effort being directed to handle GC related issues in most Java software R&amp;D leading eventually to fancy off-heap storage for big datasets which looks like new/delete in disguise. I have the odd wish of seeing a Java variant with explicit memory management, no GC at all. The VM could be made a lot more lightweight in that case.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Hadoop &#8211; too much hype by Jean-Francois Im (@jeanfrancoisim)</title>
		<link>http://moinakg.wordpress.com/2013/02/18/hadoop-too-much-hype/#comment-2542</link>
		<dc:creator><![CDATA[Jean-Francois Im (@jeanfrancoisim)]]></dc:creator>
		<pubDate>Sun, 12 May 2013 16:44:00 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1203#comment-2542</guid>
		<description><![CDATA[It&#039;s certainly possible to write relatively high performance on the JVM; just because there is a crapton of poor Java developers doesn&#039;t mean that the technology itself sucks. It&#039;s not perfect (I don&#039;t think it does SIMD ops yet on x86/x64) but it certainly does allow one to write reasonably fast code in a fraction of the time it would take to do so in C/C++ or assembly (for example, I added clustering and multi-node distributed processing to an app in a short afternoon on the JVM, which would&#039;ve been a major pain in C++ --- even with 0MQ).

As we both agree, though, people who don&#039;t understand how the computer works underneath and all the layers in between will write crap code that&#039;s suboptimal.]]></description>
		<content:encoded><![CDATA[<p>It&#8217;s certainly possible to write relatively high performance on the JVM; just because there is a crapton of poor Java developers doesn&#8217;t mean that the technology itself sucks. It&#8217;s not perfect (I don&#8217;t think it does SIMD ops yet on x86/x64) but it certainly does allow one to write reasonably fast code in a fraction of the time it would take to do so in C/C++ or assembly (for example, I added clustering and multi-node distributed processing to an app in a short afternoon on the JVM, which would&#8217;ve been a major pain in C++ &#8212; even with 0MQ).</p>
<p>As we both agree, though, people who don&#8217;t understand how the computer works underneath and all the layers in between will write crap code that&#8217;s suboptimal.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Pcompress 2.1 released with fixes and performance enhancements by moinakg</title>
		<link>http://moinakg.wordpress.com/2013/05/09/pcompress-2-1-released-with-fixes-and-performance-enhancements/#comment-2541</link>
		<dc:creator><![CDATA[moinakg]]></dc:creator>
		<pubDate>Sun, 12 May 2013 06:09:43 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1361#comment-2541</guid>
		<description><![CDATA[Thanks for letting me know. This looks like a bug in an invalid input handling routine! &#039;Adapt2&#039; mode uses libbsc which can take a max segment size of 1024m. Obviously &#039;-s 1280m&#039; exceeds that. But then the &#039;Floating point exception&#039; looks like a simple division by zero bug while printing the statistics.]]></description>
		<content:encoded><![CDATA[<p>Thanks for letting me know. This looks like a bug in an invalid input handling routine! &#8216;Adapt2&#8242; mode uses libbsc which can take a max segment size of 1024m. Obviously &#8216;-s 1280m&#8217; exceeds that. But then the &#8216;Floating point exception&#8217; looks like a simple division by zero bug while printing the statistics.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Pcompress 2.1 released with fixes and performance enhancements by Slowpoke</title>
		<link>http://moinakg.wordpress.com/2013/05/09/pcompress-2-1-released-with-fixes-and-performance-enhancements/#comment-2540</link>
		<dc:creator><![CDATA[Slowpoke]]></dc:creator>
		<pubDate>Sun, 12 May 2013 01:11:36 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=1361#comment-2540</guid>
		<description><![CDATA[Thanks for this nice and usefull program!
Did a quick test on Ubuntu 12.10, pcompress crashed with these switches :( If need be you can get the archive here: http://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win64/Automated%20Builds/

% pcompress -E -D -L -P -B 1 -M -C -c adapt2 -l 14 -s 1280m mingw-w64-bin_x86_64-linux_20130505.tar 
Scaling to 1 thread
Max allowed chunk size for LIBBSC is: 1073741824 
Error compressing file: mingw-w64-bin_x86_64-linux_20130505.tar

Compression Statistics
======================
Total chunks           : 0
Best compressed chunk  : 0 B(0.00%)
Worst compressed chunk : 0 B(0.00%)
Floating point exception (core dumped)
 

(Linux 3.5.0-28-generic #48-Ubuntu SMP Tue Apr 23 23:03:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux, gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, Core-i5/16GB RAM)]]></description>
		<content:encoded><![CDATA[<p>Thanks for this nice and usefull program!<br />
Did a quick test on Ubuntu 12.10, pcompress crashed with these switches <img src='http://s0.wp.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' />  If need be you can get the archive here: <a href="http://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win64/Automated%20Builds/" rel="nofollow">http://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win64/Automated%20Builds/</a></p>
<p>% pcompress -E -D -L -P -B 1 -M -C -c adapt2 -l 14 -s 1280m mingw-w64-bin_x86_64-linux_20130505.tar<br />
Scaling to 1 thread<br />
Max allowed chunk size for LIBBSC is: 1073741824<br />
Error compressing file: mingw-w64-bin_x86_64-linux_20130505.tar</p>
<p>Compression Statistics<br />
======================<br />
Total chunks           : 0<br />
Best compressed chunk  : 0 B(0.00%)<br />
Worst compressed chunk : 0 B(0.00%)<br />
Floating point exception (core dumped)</p>
<p>(Linux 3.5.0-28-generic #48-Ubuntu SMP Tue Apr 23 23:03:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux, gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, Core-i5/16GB RAM)</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on About by moinakg</title>
		<link>http://moinakg.wordpress.com/about/#comment-2537</link>
		<dc:creator><![CDATA[moinakg]]></dc:creator>
		<pubDate>Tue, 07 May 2013 17:49:37 +0000</pubDate>
		<guid isPermaLink="false">#comment-2537</guid>
		<description><![CDATA[Thanks. Though Pcompress is far from being an FS actually a few of the core ideas can be leveraged in existing filesystems like Btrfs. For example the scalable segmented deduplication idea can be leveraged in the volume layer for offline dedupe. The variable-block sliding-window chunking I am using has performance close to fixed-block chunking so it can be used as well among other things.
So one is welcome to leverage stuff for Btrfs if it helps. I could help anyone willing to do that. I have intentions to evolve Pcompress into a deduplicated archival/object storage appliance with the features sitting within the lower level block I/O layer with some existing filesystem on the top.]]></description>
		<content:encoded><![CDATA[<p>Thanks. Though Pcompress is far from being an FS actually a few of the core ideas can be leveraged in existing filesystems like Btrfs. For example the scalable segmented deduplication idea can be leveraged in the volume layer for offline dedupe. The variable-block sliding-window chunking I am using has performance close to fixed-block chunking so it can be used as well among other things.<br />
So one is welcome to leverage stuff for Btrfs if it helps. I could help anyone willing to do that. I have intentions to evolve Pcompress into a deduplicated archival/object storage appliance with the features sitting within the lower level block I/O layer with some existing filesystem on the top.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on About by Bgm0</title>
		<link>http://moinakg.wordpress.com/about/#comment-2536</link>
		<dc:creator><![CDATA[Bgm0]]></dc:creator>
		<pubDate>Tue, 07 May 2013 16:52:53 +0000</pubDate>
		<guid isPermaLink="false">#comment-2536</guid>
		<description><![CDATA[You know Pcompress is soo awesome that it could become a FS .  Btrfs is trying to include features present in Pcompress. How about a merge of the two ?]]></description>
		<content:encoded><![CDATA[<p>You know Pcompress is soo awesome that it could become a FS .  Btrfs is trying to include features present in Pcompress. How about a merge of the two ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Vectorizing xxHash for Fun and Profit by sanmayce</title>
		<link>http://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/#comment-2512</link>
		<dc:creator><![CDATA[sanmayce]]></dc:creator>
		<pubDate>Tue, 23 Apr 2013 18:10:05 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=990#comment-2512</guid>
		<description><![CDATA[Hi again moinakg,
I am happy to share my first XMM attempt: FNV1A_YoshimitsuTRIADiiXMM:

Wanted to see how SIMDed main loop would look like:
One of the main goals: to stress 128bit registers only and nothing else, for now 6 in total, in fact Intel uses the all 8.
Current approach: instead of rotating the 5 bits within the DWORD quadruplets I chose to do it within the entire DQWORD i.e. XMMWORD.

// #define xmmloadu(p) _mm_loadu_si128((__m128i const*)(p))
// #define _rotl_KAZE128(x, n) _mm_or_si128(_mm_slli_si128(x, n) , _mm_srli_si128(x, 128-n))
// uint32_t FNV1A_Hash_YoshimitsuTRIADiiXMM(const char *str, uint32_t wrdlen)
// {
//	const char *p = str;
// ...
// if (wrdlen &gt;= 4*24) {  // Actually 4*24 is the minimum and not useful, 200++ makes more sense.
// 	Loop_Counter = (wrdlen/(4*24));
//	Loop_Counter++;
//	Second_Line_Offset = wrdlen-(Loop_Counter)*(4*3*4);
//	for(; Loop_Counter; Loop_Counter--, p += 4*3*sizeof(uint32_t)) {
// 		xmm0 = xmmloadu(p+0*16);
// 		xmm1 = xmmloadu(p+0*16+Second_Line_Offset);
// 		xmm2 = xmmloadu(p+1*16);
// 		xmm3 = xmmloadu(p+1*16+Second_Line_Offset);
// 		xmm4 = xmmloadu(p+2*16);
// 		xmm5 = xmmloadu(p+2*16+Second_Line_Offset);
// 		hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm);       
// 		hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm);        
// 		hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm);      
//	}
// 	// The simplest mumbo-jumbo mix:
// 	hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _rotl_KAZE128(hash32Bxmm,5)) , PRIMExmm);       
// 	hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _rotl_KAZE128(hash32Cxmm,5)) , PRIMExmm);       
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[0],5) ) * PRIME;
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[1],5) ) * PRIME;
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[2],5) ) * PRIME;
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[3],5) ) * PRIME;
//	return hash32 ^ (hash32 &gt;&gt; 16);
// } else if (wrdlen &gt;= 24) {
// ...
// }

/*
; mark_description &quot;Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011&quot;;
; mark_description &quot;-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FA&quot;;

.B4.4:
        lea       edi, DWORD PTR [esi+esi*2]                    
        inc       esi                                           
        shl       edi, 4                                        
        cmp       esi, edx                                      
        movdqu    xmm7, XMMWORD PTR [ecx+edi]                   
        movdqu    xmm6, XMMWORD PTR [16+ebx+edi]                
        movdqu    xmm5, XMMWORD PTR [32+ecx+edi]                
        movdqa    xmm1, xmm7                                    
        pslldq    xmm1, 5                                       
        psrldq    xmm7, 123                                     
        por       xmm1, xmm7                                    
        movdqu    xmm7, XMMWORD PTR [ebx+edi]                   
        pxor      xmm1, xmm7                                    
        pxor      xmm2, xmm1                                    
        movdqa    xmm1, xmm6                                    
        pslldq    xmm1, 5                                       
        psrldq    xmm6, 123                                     
        por       xmm1, xmm6                                    
        movdqu    xmm6, XMMWORD PTR [16+ecx+edi]                
        pxor      xmm1, xmm6                                    
        movdqa    xmm6, xmm5                                    
        pslldq    xmm6, 5                                       
        pxor      xmm3, xmm1                                    
        psrldq    xmm5, 123                                     
        por       xmm6, xmm5                                    
        movdqu    xmm5, XMMWORD PTR [32+ebx+edi]                
        pxor      xmm6, xmm5                                    
        pxor      xmm4, xmm6                                    
        pmulld    xmm2, xmm0                                    
        pmulld    xmm3, xmm0                                    
        pmulld    xmm4, xmm0                                    
        jb        .B4.4


I wrote YoYo r.1+, my [CR]LF lines hasher in order to see what dispersion it has, in the next dump it hashed 1,048,576 Knight Tours using 20bit hash table:

YoYo - [CR]LF lines hasher, r.1+ copyleft Kaze.
Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long.
Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash.
Polynomial(s) used:
CRC32C2_8slice: 0x8F6E37A0 
HashSizeInBits = 20
Allocating KEY memory 1024KB ... OK
Allocating HASH memory 4MB ... OK
Allocating HASH memory 4MB ... OK
Allocating HASH memory 4MB ... OK
Hashing all the LF ending lines encountered in 136,314,880 bytes long file ...
Keys vs Slots ratio: 1:1 or 1,048,576:1,048,576
FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,001,048,576; 000,000,004 x MAXcollisionsAtSomeSlots = 0,000,000,010; HASHfreeSLOTS = 0,000,413,289; HashUtilization = 060%; Collisions = 0,000,413,289
FNV1A_YoshimitsuTRIADii    : Keys = 00,000,000,000,001,048,576; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,367; HashUtilization = 063%; Collisions = 0,000,385,367
CRC32C2_8slice             : Keys = 00,000,000,000,001,048,576; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,451; HashUtilization = 063%; Collisions = 0,000,385,451
Physical Lines: 1,048,576
Shortest Line : 128
Longest Line  : 128

Obviously the function needs some tuning, sadly my CPU T7500 supports up to SSSE3 and I cannot play with it (Q9550S was used for above dump).
In my view FNV1A_YoshimitsuTRIADiiXMM is the fastest (regarding bandwidth) hasher, I see it dethroned only by incoming FNV1A_JAMIROQUAI (aka FNV1A_YoshimitsuTRIADiiYMM).
http://en.wikipedia.org/wiki/File:Jamiroquai_2.jpg

I won&#039;t disturb you until I buy AVX machine and write &#039;JAMIROQUAI&#039;.
Regards]]></description>
		<content:encoded><![CDATA[<p>Hi again moinakg,<br />
I am happy to share my first XMM attempt: FNV1A_YoshimitsuTRIADiiXMM:</p>
<p>Wanted to see how SIMDed main loop would look like:<br />
One of the main goals: to stress 128bit registers only and nothing else, for now 6 in total, in fact Intel uses the all 8.<br />
Current approach: instead of rotating the 5 bits within the DWORD quadruplets I chose to do it within the entire DQWORD i.e. XMMWORD.</p>
<p>// #define xmmloadu(p) _mm_loadu_si128((__m128i const*)(p))<br />
// #define _rotl_KAZE128(x, n) _mm_or_si128(_mm_slli_si128(x, n) , _mm_srli_si128(x, 128-n))<br />
// uint32_t FNV1A_Hash_YoshimitsuTRIADiiXMM(const char *str, uint32_t wrdlen)<br />
// {<br />
//	const char *p = str;<br />
// &#8230;<br />
// if (wrdlen &gt;= 4*24) {  // Actually 4*24 is the minimum and not useful, 200++ makes more sense.<br />
// 	Loop_Counter = (wrdlen/(4*24));<br />
//	Loop_Counter++;<br />
//	Second_Line_Offset = wrdlen-(Loop_Counter)*(4*3*4);<br />
//	for(; Loop_Counter; Loop_Counter&#8211;, p += 4*3*sizeof(uint32_t)) {<br />
// 		xmm0 = xmmloadu(p+0*16);<br />
// 		xmm1 = xmmloadu(p+0*16+Second_Line_Offset);<br />
// 		xmm2 = xmmloadu(p+1*16);<br />
// 		xmm3 = xmmloadu(p+1*16+Second_Line_Offset);<br />
// 		xmm4 = xmmloadu(p+2*16);<br />
// 		xmm5 = xmmloadu(p+2*16+Second_Line_Offset);<br />
// 		hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm);<br />
// 		hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm);<br />
// 		hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm);<br />
//	}<br />
// 	// The simplest mumbo-jumbo mix:<br />
// 	hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _rotl_KAZE128(hash32Bxmm,5)) , PRIMExmm);<br />
// 	hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _rotl_KAZE128(hash32Cxmm,5)) , PRIMExmm);<br />
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[0],5) ) * PRIME;<br />
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[1],5) ) * PRIME;<br />
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[2],5) ) * PRIME;<br />
//	hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[3],5) ) * PRIME;<br />
//	return hash32 ^ (hash32 &gt;&gt; 16);<br />
// } else if (wrdlen &gt;= 24) {<br />
// &#8230;<br />
// }</p>
<p>/*<br />
; mark_description &#8220;Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011&#8243;;<br />
; mark_description &#8220;-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FA&#8221;;</p>
<p>.B4.4:<br />
        lea       edi, DWORD PTR [esi+esi*2]<br />
        inc       esi<br />
        shl       edi, 4<br />
        cmp       esi, edx<br />
        movdqu    xmm7, XMMWORD PTR [ecx+edi]<br />
        movdqu    xmm6, XMMWORD PTR [16+ebx+edi]<br />
        movdqu    xmm5, XMMWORD PTR [32+ecx+edi]<br />
        movdqa    xmm1, xmm7<br />
        pslldq    xmm1, 5<br />
        psrldq    xmm7, 123<br />
        por       xmm1, xmm7<br />
        movdqu    xmm7, XMMWORD PTR [ebx+edi]<br />
        pxor      xmm1, xmm7<br />
        pxor      xmm2, xmm1<br />
        movdqa    xmm1, xmm6<br />
        pslldq    xmm1, 5<br />
        psrldq    xmm6, 123<br />
        por       xmm1, xmm6<br />
        movdqu    xmm6, XMMWORD PTR [16+ecx+edi]<br />
        pxor      xmm1, xmm6<br />
        movdqa    xmm6, xmm5<br />
        pslldq    xmm6, 5<br />
        pxor      xmm3, xmm1<br />
        psrldq    xmm5, 123<br />
        por       xmm6, xmm5<br />
        movdqu    xmm5, XMMWORD PTR [32+ebx+edi]<br />
        pxor      xmm6, xmm5<br />
        pxor      xmm4, xmm6<br />
        pmulld    xmm2, xmm0<br />
        pmulld    xmm3, xmm0<br />
        pmulld    xmm4, xmm0<br />
        jb        .B4.4</p>
<p>I wrote YoYo r.1+, my [CR]LF lines hasher in order to see what dispersion it has, in the next dump it hashed 1,048,576 Knight Tours using 20bit hash table:</p>
<p>YoYo &#8211; [CR]LF lines hasher, r.1+ copyleft Kaze.<br />
Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long.<br />
Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash.<br />
Polynomial(s) used:<br />
CRC32C2_8slice: 0x8F6E37A0<br />
HashSizeInBits = 20<br />
Allocating KEY memory 1024KB &#8230; OK<br />
Allocating HASH memory 4MB &#8230; OK<br />
Allocating HASH memory 4MB &#8230; OK<br />
Allocating HASH memory 4MB &#8230; OK<br />
Hashing all the LF ending lines encountered in 136,314,880 bytes long file &#8230;<br />
Keys vs Slots ratio: 1:1 or 1,048,576:1,048,576<br />
FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,001,048,576; 000,000,004 x MAXcollisionsAtSomeSlots = 0,000,000,010; HASHfreeSLOTS = 0,000,413,289; HashUtilization = 060%; Collisions = 0,000,413,289<br />
FNV1A_YoshimitsuTRIADii    : Keys = 00,000,000,000,001,048,576; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,367; HashUtilization = 063%; Collisions = 0,000,385,367<br />
CRC32C2_8slice             : Keys = 00,000,000,000,001,048,576; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,451; HashUtilization = 063%; Collisions = 0,000,385,451<br />
Physical Lines: 1,048,576<br />
Shortest Line : 128<br />
Longest Line  : 128</p>
<p>Obviously the function needs some tuning, sadly my CPU T7500 supports up to SSSE3 and I cannot play with it (Q9550S was used for above dump).<br />
In my view FNV1A_YoshimitsuTRIADiiXMM is the fastest (regarding bandwidth) hasher, I see it dethroned only by incoming FNV1A_JAMIROQUAI (aka FNV1A_YoshimitsuTRIADiiYMM).<br />
<a href="http://en.wikipedia.org/wiki/File:Jamiroquai_2.jpg" rel="nofollow">http://en.wikipedia.org/wiki/File:Jamiroquai_2.jpg</a></p>
<p>I won&#8217;t disturb you until I buy AVX machine and write &#8216;JAMIROQUAI&#8217;.<br />
Regards</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Vectorizing xxHash for Fun and Profit by moinakg</title>
		<link>http://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/#comment-2502</link>
		<dc:creator><![CDATA[moinakg]]></dc:creator>
		<pubDate>Thu, 18 Apr 2013 14:27:36 +0000</pubDate>
		<guid isPermaLink="false">http://moinakg.wordpress.com/?p=990#comment-2502</guid>
		<description><![CDATA[That is an awesome resul t. In the current xxHash code there are only two levels of interleaving leaving enough XMM registers to do two more levels of interleaving which should improve speed much more. However there was also the question of retaining hash value compatibility with non-XMM code. Having more levels of interleaving will actually slow down the non-XMM loop on CPUs that do not have SSE 4.1 as the number of intermediate variables will overflow the available registers.]]></description>
		<content:encoded><![CDATA[<p>That is an awesome resul t. In the current xxHash code there are only two levels of interleaving leaving enough XMM registers to do two more levels of interleaving which should improve speed much more. However there was also the question of retaining hash value compatibility with non-XMM code. Having more levels of interleaving will actually slow down the non-XMM loop on CPUs that do not have SSE 4.1 as the number of intermediate variables will overflow the available registers.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
