Vectorizing xxHash for Fun and Profit

Update: Thanks to Yann Collet’s comment I have fixed the mistaken reference to Hyperthreading while in fact I was referring to Instruction Level Parallelism via Superscalar processing.

I was doing various  performance optimizations in Pcompress for the last few days and the xxHash piece caught my eye. This non-cryptographic hash implementation is one of the fastest 32 bit hashes available today. I use it in Pcompress to hash data chunks for deduplication. The hash value is used to insert the chunk-id into a hashtable for lookups. xxHash is already extremely fast working at 5.4 GB/s on a fast processor however it’s inner loop looked very interesting.

 do
 {
     v1 += XXH_LE32(p) * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1; p+=4;
     v2 += XXH_LE32(p) * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1; p+=4;
     v3 += XXH_LE32(p) * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1; p+=4;
     v4 += XXH_LE32(p) * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1; p+=4;
 } while (p<=limit);

 h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

There are 4 independent compute streams consuming data into 4 separate accumulators which are then combined into a single hash value later. These 4 independent but identical compute streams look like a perfect fit for vectorizing using the SSE instructions on modern x86 processors. The SSE registers can hold multiple packed integers or floats and a single operation, for example addition, applied to two SSE registers affect all the package values in parallel. The image below shows this behavior diagrammatically when the SSE registers are populated with 4 32-bit quantities.

simd

An SSE register is 128-bit and can perfectly hold four 32-bit packed integers. I had to use at least SSE 4.1 to get the parallel multiply of packed 32-bit values. The rotate (XXH_rotl32) operation is easily doable with 2 bitwise shifts and a bitwise or (as is done in xxHash original code as well). Without having to deal with inline assembly language we can fairly easily  use the compiler SSE intrinsics. These appear to the programmer as functions but in fact they result in specific CPU instructions being emitted. SSE intrinsics also provide the benefit of letting the compiler handle optimizations like instruction scheduling. Using inline assembly leaves all that optimization burden to the programmer which is often tricky to get right except for short segments. In addition some of the intrinsic functions are composite intrinsics. They result in emitting a collection of CPU instructions encapsulating some commonly used operation (like _mm_set_epi32(…)).

You can get a good overview of SSE programming from the CodeProject article. So my initial effort led to the following code shown below.

#include <smmintrin.h>
static inline __m128i _x_mm_rotl_epi32(const __m128i a, int bits)
{
    __m128i tmp1 = _mm_slli_epi32(a, bits);
    __m128i tmp2 = _mm_srli_epi32(a, 32 - bits);
    return (_mm_or_si128(tmp1, tmp2));
}
...
...
 __m128i accum = _mm_set_epi32(v4, v3, v2, v1);
 __m128i prime1 = _mm_set1_epi32(PRIME32_1);
 __m128i prime2 = _mm_set1_epi32(PRIME32_2);
 do {
     __m128i mem = _mm_loadu_si128((__m128i *)p);
     mem = _mm_mullo_epi32(mem, prime2);
     accum = _mm_add_epi32(accum, mem);
     accum = _x_mm_rotl_epi32(accum, 13);
     accum = _mm_mullo_epi32(accum, prime1);
     p += 16;
 } while (p<=limit);
 _mm_storeu_si128((__m128i *)vx, accum);
 v1 = vx[0]; v2 = vx[1];
 v3 = vx[2]; v4 = vx[3];
 h32 = XXH_rotl32(v1,1) + XXH_rotl32(v2,7) + XXH_rotl32(v3,12) + XXH_rotl32(v4,18);
...

It worked perfectly. However there was a big problem. It was slower than before, and by a good margin! Look at the benchmarks below using the standard silesia test dataset.

Original
========
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm ,by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3571.1 MB/s   0x61B1160F

Modified
========
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm ,by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3013.3 MB/s   0x61B1160F

It is 15% slower after vectorizing. This seemed ridiculous to me. I was expecting at the minimum a 10% speedup and here we have parallelized code that is in fact slower! If you notice, my initial code does unaligned loads via “_mm_loadu_si128((__m128i *)p)”. Fearing that to be the cause of latency I did some tweaks to ensure aligned access and processing the initial unaligned bytes separately. The performance difference still remained!

I first thought that SSE instructions were having higher latency compared to the ordinary integer processing ones. So I did some measurements using the Perf tool. The results are shown below.

Original
========
1) Annotation from perf record:
 10.42 │ 30:   mov    (%rdi),%r10d                                                                                                                                     
 14.03 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.34 │       add    %r10d,%r8d                                                                                                                                       
  0.21 │       mov    0x4(%rdi),%r10d                                                                                                                                  
 10.53 │       rol    $0xd,%r8d                                                                                                                                        
  0.23 │       imul   $0x9e3779b1,%r8d,%r8d                                                                                                                            
  0.77 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.02 │       add    %r10d,%ecx                                                                                                                                       
 10.54 │       mov    0x8(%rdi),%r10d                                                                                                                                  
  0.06 │       rol    $0xd,%ecx                                                                                                                                        
  0.34 │       imul   $0x9e3779b1,%ecx,%ecx                                                                                                                            
 10.38 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.30 │       add    %r10d,%edx                                                                                                                                       
  0.02 │       mov    0xc(%rdi),%r10d                                                                                                                                  
  0.07 │       add    $0x10,%rdi                                                                                                                                       
 10.07 │       rol    $0xd,%edx                                                                                                                                        
  0.32 │       imul   $0x9e3779b1,%edx,%edx                                                                                                                            
 10.30 │       imul   $0x85ebca77,%r10d,%r10d                                                                                                                          
  0.12 │       add    %r10d,%eax                                                                                                                                       
  0.22 │       rol    $0xd,%eax                                                                                                                                        
  0.15 │       imul   $0x9e3779b1,%eax,%eax                                                                                                                            
 20.52 │       cmp    %rdi,%r11                                                                                                                                        
  0.03 │     ↑ jae    30

2) Performance counters from perf stat:
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3541.4 MB/s   0x61B1160F

 Performance counter stats for './bench silesia.tar':

    6260.562213 task-clock                #    0.996 CPUs utilized
            544 context-switches          #    0.087 K/sec
              4 CPU-migrations            #    0.001 K/sec
         50,624 page-faults               #    0.008 M/sec
 13,278,187,173 cycles                    #    2.121 GHz                     [83.41%]
  3,657,672,543 stalled-cycles-frontend   #   27.55% frontend cycles idle    [83.30%]
  6,834,837,811 stalled-cycles-backend    #   51.47% backend  cycles idle    [66.70%]
 31,374,782,178 instructions              #    2.36  insns per cycle
                                          #    0.22  stalled cycles per insn [83.40%]
  1,384,012,974 branches                  #  221.068 M/sec                   [83.30%]
         95,141 branch-misses             #    0.01% of all branches         [83.30%]

    6.288590138 seconds time elapsed
Modified
========
1) Annotation from perf record:
  9.38 │ 70:┌─→movdqu (%rdi),%xmm0
  9.52 │    │  add    $0x10,%rdi
  0.00 │    │  cmp    %rdi,%r9
  0.00 │    │  pmulld %xmm3,%xmm0
  9.04 │    │  paddd  %xmm1,%xmm0
  0.15 │    │  movdqa %xmm0,%xmm1
  0.04 │    │  psrld  $0x13,%xmm0
  9.19 │    │  pslld  $0xd,%xmm1
  0.27 │    │  por    %xmm0,%xmm1
  8.95 │    │  pmulld %xmm2,%xmm1
 53.44 │    └──jae    70

2) Performance counters from perf stat:
 Performance counter stats for './bench silesia.tar':

    6265.021121 task-clock                #    0.996 CPUs utilized
            546 context-switches          #    0.087 K/sec
              2 CPU-migrations            #    0.000 K/sec
         50,625 page-faults               #    0.008 M/sec
 13,288,539,258 cycles                    #    2.121 GHz                     [83.36%]
  8,562,225,224 stalled-cycles-frontend   #   64.43% frontend cycles idle    [83.31%]
  3,763,596,002 stalled-cycles-backend    #   28.32% backend  cycles idle    [66.72%]
 12,813,511,969 instructions              #    0.96  insns per cycle
                                          #    0.67  stalled cycles per insn [83.41%]
  1,176,939,765 branches                  #  187.859 M/sec                   [83.31%]
        102,390 branch-misses             #    0.01% of all branches         [83.30%]

    6.292536751 seconds time elapsed

There are a few interesting observations here. Perf annotation tells us that there is not much difference in combined instruction latencies between the normal C and the SSE implementations. However performance counter statistics have a story to tell. I have highlighted the interesting entries in bold. Backend cycles decreased but Frontend cycles increased 2.3x. What is more interesting is that the normal C version issued 2.36 instructions per cycle while in the SSE version it reduced to 1/3rd of that. There is not much code in the small loop to cause stalls so the only explanation for increased Frontend stalls is memory read latency which also cause fewer instructions to get issued.

The big question was why was there so much of memory access latency in the SSE version compared to the normal C version? I pondered this for some time while trying out a variety of experiments to no avail.  I tried things like hand optimized assembly, trying to hide memory latency by loop hoisting etc. Then looking at the disassembly from the perf annotate of the original code again, something looked  interesting. The instructions from the 4 independent computation lines are all interleaved. More precisely instructions from independent computation sequences were interleaved in two pairs. Suddenly a bulb lit up somewhere. We have a gotcha named Superscalar processing and the key thing here is “independent computations”. Intel and AMD x86 CPUs can execute multiple instructions per clock cycle, not necessarily in chronological order, if they are independent of each other and utilize separate processing resources within the core. This is a form of instruction level parallelism that is different to vector processing. Within the core some resources are duplicated and some shared.

I was using a Core i5 from the Nehalem generation which typically has 4 decoders in the front-end pipeline and a superscalar out-of-order execution engine than can dispatch upto 6 micro-ops per cycle. The advantage of doing this is if some instruction from one stream stalls due to some resource being fetched, like memory, then independent instructions from the other streams can be processed. In addition if two compute operations require different resources they can be operated on simultaneously.

superscalarThe original xxHash code fully utilizes this feature helped by the Gcc compiler that recognizes this capability and interleaves independent instructions together to help the superscalar engine achieve maximum instruction level parallelism. In my SSE version I was utilizing a vector parallel unit that can handle four 32-bit integers in a single instruction. So there was only a single vectorized instruction stream to handle all 4 independent operations. The whole Frontend, or in other words fetch-decode unit had to stall during a memory fetch.

So next question was could we improve this situation? How about multiple independent SIMD SSE streams to better utilize the ILP capabillity? It turns out that this is doable but results in a slight change in the hashing approach. I implemented 2 independent SSE streams each processing 16 bytes in an accumulator. There are enough SSE registers available to do this without spurious memory accesses. The accumulators are then combined at the end to the final hash result. This changes the resulting hash value since there are now 2 independent streams processing data in independent strides without combining them. Even then xxHash properties are retained due to the way the final combination is done. My new code is shown below.

#include <smmintrin.h>
static inline __m128i _x_mm_rotl_epi32(const __m128i a, int bits)
{
     __m128i tmp1 = _mm_slli_epi32(a, bits);
     __m128i tmp2 = _mm_srli_epi32(a, 32 - bits);
     return (_mm_or_si128(tmp1, tmp2));
}
...
...
 unsigned int vx[4], vx1[4];
 __m128i accum = _mm_set_epi32(v4, v3, v2, v1);
 __m128i accum1 = _mm_set_epi32(v4, v3, v2, v1);
 __m128i prime1 = _mm_set1_epi32(PRIME32_1);
 __m128i prime2 = _mm_set1_epi32(PRIME32_2);

 /*
  * 4-way SIMD calculations with 4 ints in two blocks for two
  * accumulators will interleave to some extent on a superscalar
  * processor(instruction level parallelism) providing 10% - 14%
  * speedup over original xxhash depending on processor. We could
  * have used aligned loads but we actually want the unaligned
  * penalty. It helps to interleave better for a slight benefit
  * over aligned loads here!
  */
do {
     __m128i mem = _mm_loadu_si128((__m128i *)p);
     p += 16;
     mem = _mm_mullo_epi32(mem, prime2);
     accum = _mm_add_epi32(accum, mem);
     accum = _x_mm_rotl_epi32(accum, 13);
     accum = _mm_mullo_epi32(accum, prime1);

     mem = _mm_loadu_si128((__m128i *)p);
     p += 16;
     mem = _mm_mullo_epi32(mem, prime2);
     accum1 = _mm_add_epi32(accum1, mem);
     accum1 = _x_mm_rotl_epi32(accum1, 13);
     accum1 = _mm_mullo_epi32(accum1, prime1);
 } while (p<=limit);

 _mm_storeu_si128((__m128i *)vx, accum);
 _mm_storeu_si128((__m128i *)vx1, accum1);

 /*
  * Combine the two accumulators into a single hash value.
  */
 v1 = vx[0]; v2 = vx[1];
 v3 = vx[2]; v4 = vx[3];
 v1 += vx1[0] * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1;
 v2 += vx1[1] * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1;
 v3 += vx1[2] * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1;
 v4 += vx1[3] * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1;
 h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

Functionally this is processing data in interleaved blocks or 16 bytes per block. This gives us two independent processing streams which should leverage the superscalar out-of-order execution capabilities on x86 processors and give a speedup. Diagrammatically this processing can be depicted as below.

sse_hash_small

Now lets look at the performance metrics of this new approach.

1) xxHash benchmark tool:
~/xxhash-read-only $ ./bench silesia.tar
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  4028.9 MB/s   0x140BD4B4

2) Perf annotate output:
  5.95 │ 70:   movdqu (%rdi),%xmm1
  0.16 │       movdqu 0x10(%rdi),%xmm0
 14.91 │       pmulld %xmm5,%xmm1
  5.97 │       paddd  %xmm3,%xmm1
  0.05 │       movdqa %xmm1,%xmm3
  0.14 │       psrld  $0x13,%xmm1
  5.51 │       add    $0x20,%rdi
  0.60 │       pmulld %xmm5,%xmm0
  0.86 │       paddd  %xmm2,%xmm0
  5.32 │       movdqa %xmm0,%xmm2
  0.80 │       pslld  $0xd,%xmm3
  2.57 │       psrld  $0x13,%xmm0
  9.24 │       por    %xmm1,%xmm3
  0.05 │       pslld  $0xd,%xmm2
  5.94 │       por    %xmm0,%xmm2
  5.92 │       cmp    %rdi,%rax
  0.08 │       pmulld %xmm4,%xmm3
 27.12 │       pmulld %xmm4,%xmm2
  8.82 │     ↑ jae    70

3) Perf stat output:
*** Benchmark utility using xxHash algorithm , by Yann Collet (Jan 19 2013) ***
silesia.tar      :  206612480 ->  3997.0 MB/s   0x140BD4B4

 Performance counter stats for './bench silesia.tar':

    6183.451857 task-clock                #    0.995 CPUs utilized
            531 context-switches          #    0.086 K/sec
              7 CPU-migrations            #    0.001 K/sec
         50,625 page-faults               #    0.008 M/sec
 13,114,838,000 cycles                    #    2.121 GHz                     [83.28%]
  7,687,625,582 stalled-cycles-frontend   #   58.62% frontend cycles idle    [83.32%]
  2,157,403,461 stalled-cycles-backend    #   16.45% backend  cycles idle    [66.68%]
 14,508,523,299 instructions              #    1.11  insns per cycle
                                          #    0.53  stalled cycles per insn [83.34%]
    783,565,815 branches                  #  126.720 M/sec                   [83.32%]
        109,017 branch-misses             #    0.01% of all branches         [83.40%]
    6.211583495 seconds time elapsed

As you can notice there is a reduction in the Frontend cycles idle parameter and a good increase in the instructions per cycle. What is more significant is that throughput has increased from 3571 MB/s in the original to 4028 MB/s in the new one. A good 12.8% improvement over the original. I also did another benchmark using SMhasher, a hash function test suite. The results are below.

Original
========
Bulk speed test - 262144-byte keys
Alignment  0 -  1.865 bytes/cycle - 5335.63 MiB/sec @ 3 ghz
Alignment  1 -  1.863 bytes/cycle - 5329.93 MiB/sec @ 3 ghz
Alignment  2 -  1.863 bytes/cycle - 5329.93 MiB/sec @ 3 ghz
Alignment  3 -  1.863 bytes/cycle - 5329.94 MiB/sec @ 3 ghz
Alignment  4 -  1.868 bytes/cycle - 5345.33 MiB/sec @ 3 ghz
Alignment  5 -  1.864 bytes/cycle - 5334.29 MiB/sec @ 3 ghz
Alignment  6 -  1.864 bytes/cycle - 5334.27 MiB/sec @ 3 ghz
Alignment  7 -  1.864 bytes/cycle - 5334.28 MiB/sec @ 3 ghzModified
========
Bulk speed test - 262144-byte keys
Alignment  0 -  2.144 bytes/cycle - 6132.86 MiB/sec @ 3 ghz
Alignment  1 -  2.123 bytes/cycle - 6074.16 MiB/sec @ 3 ghz
Alignment  2 -  2.123 bytes/cycle - 6074.21 MiB/sec @ 3 ghz
Alignment  3 -  2.133 bytes/cycle - 6102.57 MiB/sec @ 3 ghz
Alignment  4 -  2.122 bytes/cycle - 6072.41 MiB/sec @ 3 ghz
Alignment  5 -  2.122 bytes/cycle - 6072.35 MiB/sec @ 3 ghz
Alignment  6 -  2.123 bytes/cycle - 6073.10 MiB/sec @ 3 ghz
Alignment  7 -  2.123 bytes/cycle - 6073.19 MiB/sec @ 3 ghz

SMhasher reports almost a 14% improvement in the worst case. For aligned data blocks it is processing at almost 6 GB/s! The small key results are not shown here but there is however an increase of 3 cycles in the worst case. I feel that marginal reduction in performance for small buffers/keys is acceptable for this gain for large buffers.

All these numbers are from my laptop. I’d expect better gains on better boxes and processors. I will have to run the full SMHasher test suite to validate my approach. My profit from all this is the learning experience and the potential speedup of deduplication in Pcompress.

About these ads

10 thoughts on “Vectorizing xxHash for Fun and Profit

  1. Yann Collet (@Cyan4973)

    Great reading and great analysis Monakg.

    Just a quick comment : I guess the word “hyperthreading” is not the one you want in this context.
    “Hyperthreading” is an intel technology, which makes a single physical core look like 2 ones.
    In order to use this capability, you need a multi-threaded application, so that the workload is distributed among the 2 logical cores.
    However, here, the algorithm is run within a single thread, so hyperthreading is not going to help.

    What you may be looking for is “ILP”, for Instruction Level Parallelism.
    ILP is the natural consequence of having several ALU packed into a single core, which can be made to good use thanks to OoOEx (Out of Order Execution).
    The number of ALU is architecture dependant, so the amount of parallelism that can be extracted from a core vary.
    But a basic rule always stand : for 2 operations to be executed in parallel, it is necessary for them to have no direct relation, since otherwise the second operation needs to wait for the first one.
    And that’s exactly what the inner loop of xxhash has been designed for….

    Reply
    1. moinakg Post author

      Thanks for pointing out the mistake. I was thinking of Superscalar processing and ILP all the time and did mention them but eventually wrote about hyperthreading. I will update the posting.

      Reply
  2. Frank Hileman

    Its nice to increase speed but the biggest question is the quality of the hash function. Does your function produce a good distribution, compared to xxhash, say via SMHasher?

    Reply
    1. moinakg Post author

      Yup, I have tested with SMHasher with the same results as the original xxHash. If you look at the approach closely it is in fact a merkle-style hashing approach which preserves the qualities of the original linear hash.

      Reply
  3. sanmayce

    Hi moinakg,
    I enjoyed your post.

    You say 6- GB/s on i5, but it doesn’t impress me much.
    In order to make it even more interesting (speaking of speed comparisons) my suggestion is to check the FASTEST hasher known to me ‘YoshimitsuTRIADii’ at ‘Creating a Fast Hash Function’ page: http://locklessinc.com/articles/fast_hash/

    Please keep the attempts, I believe you can do it better.
    Regards

    Reply
  4. sanmayce

    Just wanted to see how speedy the 64bit compile of FNV1A_YoshimitsuTRIADii64 is gonna be (named fnv1a-teslatriadii in the test, FNV1A-Tesla has 2 hash lines with 8+8 bytes stride), the main loops below are for 3 hash lines with 8+8 bytes interleaved stride i.e. 48 bytes per loop:

    ; mark_description “Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111″;

    ;;; hash64 = (hash64 ^ (ROL64(*(uint64_t *)(p+0),5) ^ *(uint64_t *)(p+0+Second_Line_Offset))) * PRIME;

    0004d 48 89 5c 24 20 mov QWORD PTR [32+rsp], rbx
    00052 48 ba b9 7b 4a
    7f b9 79 37 9e mov rdx, 09e3779b97f4a7bb9H
    0005c 48 8d 76 00 ALIGN 16
    ; LOE rdx rcx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 eax xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
    .B2.4:: ; Preds .B2.4 .B2.3
    00060 48 8b 19 mov rbx, QWORD PTR [rcx]
    00063 48 c1 c3 05 rol rbx, 5
    00067 4a 33 1c 01 xor rbx, QWORD PTR [rcx+r8]
    0006b 4c 33 cb xor r9, rbx

    ;;; hash64B = (hash64B ^ (ROL64(*(uint64_t *)(p+2*4+Second_Line_Offset),5) ^ *(uint64_t *)(p+2*4))) * PRIME;

    0006e 4a 8b 5c 01 08 mov rbx, QWORD PTR [8+rcx+r8]
    00073 48 c1 c3 05 rol rbx, 5
    00077 48 33 59 08 xor rbx, QWORD PTR [8+rcx]
    0007b 4c 33 d3 xor r10, rbx

    ;;; hash64C = (hash64C ^ (ROL64(*(uint64_t *)(p+2*8),5) ^ *(uint64_t *)(p+2*8+Second_Line_Offset))) * PRIME;

    0007e 48 8b 59 10 mov rbx, QWORD PTR [16+rcx]
    00082 48 c1 c3 05 rol rbx, 5
    00086 4a 33 5c 01 10 xor rbx, QWORD PTR [16+rcx+r8]
    0008b 48 83 c1 18 add rcx, 24
    0008f 4c 33 db xor r11, rbx
    00092 4c 0f af ca imul r9, rdx
    00096 4c 0f af d2 imul r10, rdx
    0009a 4c 0f af da imul r11, rdx
    0009e ff c8 dec eax
    000a0 75 be jne .B2.4 ; Prob 82%

    E:\benchmark-0.12_NOT-ORIGINAL\src>benchmark_Intel_12.1_O2.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-tesla fnv1a-teslatriadii xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
    memcpy: 109 ms, 209715202 bytes = 1834 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 3389 MB/s 3278 MB/s 277e15 268e15
    CityHash64 1.0.3
    209715210 (x 1.000) 3389 MB/s 3448 MB/s 277e15 282e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    fnv1a-jesteress v2
    209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    fnv1a-tesla v2
    209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15
    fnv1a-teslatriadii v2
    209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15
    xxhash-fast r3
    209715206 (x 1.000) 4000 MB/s 4000 MB/s 327e15 327e15
    xxhash-strong r3
    209715206 (x 1.000) 2816 MB/s 2816 MB/s 230e15 230e15
    xxhash256 r3
    209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.30319.01

    ; Function compile flags: /Ogtpy
    ; File e:\benchmark-0.12_not-original\src\codecs\sanmayce.c

    00175 66 66 66 0f 1f
    84 00 00 00 00
    00 npad 11
    $LL10@FNV1A_Hash@2:

    ; 114 : hash64 = (hash64 ^ (ROL64(*(uint64_t *)(p+0),5) ^ *(uint64_t *)(p+0+Second_Line_Offset))) * PRIME;

    00180 4a 8b 04 16 mov rax, QWORD PTR [rsi+r10]
    00184 49 83 c2 18 add r10, 24
    00188 48 c1 c0 05 rol rax, 5
    0018c 49 33 42 e0 xor rax, QWORD PTR [r10-32]
    00190 4c 33 c0 xor r8, rax

    ; 115 : hash64B = (hash64B ^ (ROL64(*(uint64_t *)(p+2*4+Second_Line_Offset),5) ^ *(uint64_t *)(p+2*4))) * PRIME;

    00193 49 8b 42 e8 mov rax, QWORD PTR [r10-24]
    00197 4d 0f af c3 imul r8, r11
    0019b 48 c1 c0 05 rol rax, 5
    0019f 4a 33 44 13 e8 xor rax, QWORD PTR [rbx+r10-24]
    001a4 4c 33 c8 xor r9, rax

    ; 116 : hash64C = (hash64C ^ (ROL64(*(uint64_t *)(p+2*8),5) ^ *(uint64_t *)(p+2*8+Second_Line_Offset))) * PRIME;

    001a7 4a 8b 44 17 e8 mov rax, QWORD PTR [rdi+r10-24]
    001ac 4d 0f af cb imul r9, r11
    001b0 48 c1 c0 05 rol rax, 5
    001b4 49 33 42 f0 xor rax, QWORD PTR [r10-16]
    001b8 48 33 c8 xor rcx, rax
    001bb 49 0f af cb imul rcx, r11
    001bf ff ca dec edx
    001c1 75 bd jne SHORT $LL10@FNV1A_Hash@2

    E:\benchmark-0.12_NOT-ORIGINAL\src>benchmark_Microsoft_VS2010_Ox.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-tesla fnv1a-teslatriadii xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
    memcpy: 109 ms, 209715202 bytes = 1834 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15
    CityHash64 1.0.3
    209715210 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 4000 MB/s 4000 MB/s 327e15 327e15
    fnv1a-jesteress v2
    209715206 (x 1.000) 3278 MB/s 3278 MB/s 268e15 268e15
    fnv1a-tesla v2
    209715210 (x 1.000) 4545 MB/s 4545 MB/s 372e15 372e15
    fnv1a-teslatriadii v2
    209715210 (x 1.000) 4545 MB/s 4545 MB/s 372e15 372e15
    xxhash-fast r3
    209715206 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    xxhash-strong r3
    209715206 (x 1.000) 2857 MB/s 2857 MB/s 234e15 234e15
    xxhash256 r3
    209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).
    E:\benchmark-0.12_NOT-ORIGINAL\src>

    E:\Benchmark_LuckyLight_r1>benchmark_Intel_12.1_O2.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-YoshimitsuTRIADii fnv1a-tesla fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT

    fnv1a-YoshimitsuTRIADii v2
    209715206 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15 !!! 32bit function compiled as 64bit code !!!
    fnv1a-tesla v2
    209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15

    When running my benchmark ‘Lucky Light’:

    BURST_Read_8XMM128bit: (64MB block); 524288MB fetched in 113085 clocks or 4.636MB per clock
    FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14539 clocks or 4.508MB per clock !!! 32bit function compiled as 32bit code !!!

    BURST_Read_8XMM128bit: (16KB block); 524288MB fetched in 17456 clocks or 30.035MB per clock
    FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8923 clocks or 7.345MB per clock !!! 32bit function compiled as 32bit code !!!

    When in L1: 29+ GB/s, YIPPEE!

    All the results are obtained on Core 2 T7500 2.2GHz laptop.

    FNV1A_YoshimitsuTriadii (3×8=24 stride):
    uint32_t HashLineA_1x2x32bit 4+4
    uint32_t HashLineB_1x2x32bit 4+4
    uint32_t HashLineC_1x2x32bit 4+4

    !!! 24+24+24+24=4x3x8 !!! 2x3x16 or 6 XMM registers, 2 more left as 32bit code !!!

    FNV1A_YoshimitsuTriadii24x4 (4x3x8=96 stride):
    uint32_t HashLineA_4x2x32bit 4+4 4+4 4+4 4+4 = 16+16 interleaved
    uint32_t HashLineB_4x2x32bit 4+4 4+4 4+4 4+4 = 16+16 interleaved
    uint32_t HashLineC_4x2x32bit 4+4 4+4 4+4 4+4 = 16+16 interleaved

    When data is cached XMM approach delivers thunderous boost, while in MAIN RAM according to my limited experience XMM registers are slightly faster than general ones, no?
    It is good ALWAYS to have and know the ROOF of reading/hashing the MAIN RAM with 32/64/128 and why not with 256 bit registers.
    Personally I don’t need a long runner but a sprinter (my interest is in 200- bytes strings), anyway it is a good exercise/exploration to utilize SIMD in order to utilize fully RAM bandwidth … for similar but heavy tasks.
    Somebody, versed in XMM, should write someday/somenight such a benchmarker, it will answer many basic but important questions on how RAM should be accessed.

    Sky is the limit, nah-nah, sky’s way is limitless:

    Delerium – Sky (Tears From Heaven) (feat. Kristy Thirsk)
    Why is the world so in pain?
    Isn’t there enough love to go around

    I can hear when Heaven cries so dim the lights
    And close my eyes…

    Reply
    1. moinakg Post author

      That is an awesome resul t. In the current xxHash code there are only two levels of interleaving leaving enough XMM registers to do two more levels of interleaving which should improve speed much more. However there was also the question of retaining hash value compatibility with non-XMM code. Having more levels of interleaving will actually slow down the non-XMM loop on CPUs that do not have SSE 4.1 as the number of intermediate variables will overflow the available registers.

      Reply
  5. sanmayce

    Hi again moinakg,
    I am happy to share my first XMM attempt: FNV1A_YoshimitsuTRIADiiXMM:

    Wanted to see how SIMDed main loop would look like:
    One of the main goals: to stress 128bit registers only and nothing else, for now 6 in total, in fact Intel uses the all 8.
    Current approach: instead of rotating the 5 bits within the DWORD quadruplets I chose to do it within the entire DQWORD i.e. XMMWORD.

    // #define xmmloadu(p) _mm_loadu_si128((__m128i const*)(p))
    // #define _rotl_KAZE128(x, n) _mm_or_si128(_mm_slli_si128(x, n) , _mm_srli_si128(x, 128-n))
    // uint32_t FNV1A_Hash_YoshimitsuTRIADiiXMM(const char *str, uint32_t wrdlen)
    // {
    // const char *p = str;
    // …
    // if (wrdlen >= 4*24) { // Actually 4*24 is the minimum and not useful, 200++ makes more sense.
    // Loop_Counter = (wrdlen/(4*24));
    // Loop_Counter++;
    // Second_Line_Offset = wrdlen-(Loop_Counter)*(4*3*4);
    // for(; Loop_Counter; Loop_Counter–, p += 4*3*sizeof(uint32_t)) {
    // xmm0 = xmmloadu(p+0*16);
    // xmm1 = xmmloadu(p+0*16+Second_Line_Offset);
    // xmm2 = xmmloadu(p+1*16);
    // xmm3 = xmmloadu(p+1*16+Second_Line_Offset);
    // xmm4 = xmmloadu(p+2*16);
    // xmm5 = xmmloadu(p+2*16+Second_Line_Offset);
    // hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _mm_xor_si128(_rotl_KAZE128(xmm0,5) , xmm1)) , PRIMExmm);
    // hash32Bxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Bxmm , _mm_xor_si128(_rotl_KAZE128(xmm3,5) , xmm2)) , PRIMExmm);
    // hash32Cxmm = _mm_mullo_epi32(_mm_xor_si128(hash32Cxmm , _mm_xor_si128(_rotl_KAZE128(xmm4,5) , xmm5)) , PRIMExmm);
    // }
    // // The simplest mumbo-jumbo mix:
    // hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _rotl_KAZE128(hash32Bxmm,5)) , PRIMExmm);
    // hash32xmm = _mm_mullo_epi32(_mm_xor_si128(hash32xmm , _rotl_KAZE128(hash32Cxmm,5)) , PRIMExmm);
    // hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[0],5) ) * PRIME;
    // hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[1],5) ) * PRIME;
    // hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[2],5) ) * PRIME;
    // hash32 = (hash32 ^ _rotl_KAZE(hash32xmm.m128i_u32[3],5) ) * PRIME;
    // return hash32 ^ (hash32 >> 16);
    // } else if (wrdlen >= 24) {
    // …
    // }

    /*
    ; mark_description “Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011″;
    ; mark_description “-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FA”;

    .B4.4:
    lea edi, DWORD PTR [esi+esi*2]
    inc esi
    shl edi, 4
    cmp esi, edx
    movdqu xmm7, XMMWORD PTR [ecx+edi]
    movdqu xmm6, XMMWORD PTR [16+ebx+edi]
    movdqu xmm5, XMMWORD PTR [32+ecx+edi]
    movdqa xmm1, xmm7
    pslldq xmm1, 5
    psrldq xmm7, 123
    por xmm1, xmm7
    movdqu xmm7, XMMWORD PTR [ebx+edi]
    pxor xmm1, xmm7
    pxor xmm2, xmm1
    movdqa xmm1, xmm6
    pslldq xmm1, 5
    psrldq xmm6, 123
    por xmm1, xmm6
    movdqu xmm6, XMMWORD PTR [16+ecx+edi]
    pxor xmm1, xmm6
    movdqa xmm6, xmm5
    pslldq xmm6, 5
    pxor xmm3, xmm1
    psrldq xmm5, 123
    por xmm6, xmm5
    movdqu xmm5, XMMWORD PTR [32+ebx+edi]
    pxor xmm6, xmm5
    pxor xmm4, xmm6
    pmulld xmm2, xmm0
    pmulld xmm3, xmm0
    pmulld xmm4, xmm0
    jb .B4.4

    I wrote YoYo r.1+, my [CR]LF lines hasher in order to see what dispersion it has, in the next dump it hashed 1,048,576 Knight Tours using 20bit hash table:

    YoYo – [CR]LF lines hasher, r.1+ copyleft Kaze.
    Note1: Incoming textual file can exceed 4GB and lines can be up to 1048576 chars long.
    Note2: FNV1A_YoshimitsuTRIADiiXMM needs SSE4.1, so if not present YoYo will crash.
    Polynomial(s) used:
    CRC32C2_8slice: 0x8F6E37A0
    HashSizeInBits = 20
    Allocating KEY memory 1024KB … OK
    Allocating HASH memory 4MB … OK
    Allocating HASH memory 4MB … OK
    Allocating HASH memory 4MB … OK
    Hashing all the LF ending lines encountered in 136,314,880 bytes long file …
    Keys vs Slots ratio: 1:1 or 1,048,576:1,048,576
    FNV1A_YoshimitsuTRIADiiXMM : Keys = 00,000,000,000,001,048,576; 000,000,004 x MAXcollisionsAtSomeSlots = 0,000,000,010; HASHfreeSLOTS = 0,000,413,289; HashUtilization = 060%; Collisions = 0,000,413,289
    FNV1A_YoshimitsuTRIADii : Keys = 00,000,000,000,001,048,576; 000,000,002 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,367; HashUtilization = 063%; Collisions = 0,000,385,367
    CRC32C2_8slice : Keys = 00,000,000,000,001,048,576; 000,000,001 x MAXcollisionsAtSomeSlots = 0,000,000,009; HASHfreeSLOTS = 0,000,385,451; HashUtilization = 063%; Collisions = 0,000,385,451
    Physical Lines: 1,048,576
    Shortest Line : 128
    Longest Line : 128

    Obviously the function needs some tuning, sadly my CPU T7500 supports up to SSSE3 and I cannot play with it (Q9550S was used for above dump).
    In my view FNV1A_YoshimitsuTRIADiiXMM is the fastest (regarding bandwidth) hasher, I see it dethroned only by incoming FNV1A_JAMIROQUAI (aka FNV1A_YoshimitsuTRIADiiYMM).
    http://en.wikipedia.org/wiki/File:Jamiroquai_2.jpg

    I won’t disturb you until I buy AVX machine and write ‘JAMIROQUAI’.
    Regards

    Reply
  6. Bulat Ziganshin (@justbulat)

    it should be much faster with reordered code and unrolled loop. for sandy bridge optimum is 3 SSE registers (since PMUL has latency of 3) although even more may improve speed if code isn’t optimal. with optimal sheduling, it should process 16 bytes in 3 cycles on SB

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s