Fast Salsa20 crypto using CUDA

English: NVIDIA GeForce 6600GT AGP with GPU De...

English: NVIDIA GeForce 6600GT AGP with GPU Deutsch: NVIDIA GeForce 6600GT AGP ohne Kühlkörper mit GPU und AGP-Wandlerchip (Photo credit: Wikipedia)

I was playing with some GPU stuff for the past week and decided to try out some crypto on a GPU. There is existing code that ports various hash functions and AES encryption to GPU and there is one port of eSTREAM Ciphers including Salsa20 to GPU (See references) that I found later. However I wanted to experiment myself in any case. By GPU here I mean specifically Nvidia CUDA. AMD supports OpenCL, has a Stream SDK and also has APUs. It will be interesting to try those out but that is for another day.

Salsa20 is designed to provide very fast software encryption performance without compromising on security aspects even when compared to AES. Optimized implementations exist for different platforms like x86 and ARM. The x86 code uses SSSE3 vector instructions.

I have low-end GPU at my disposal on my laptop. It is a Geforce GT 230M with 1GB onboard RAM and 48 CUDA cores (6 Multiprocessors)  clocked at 1.1 GHz. It only supports CUDA capability 1.2.

For the GPU port I started with the reference C code and focused on the CTR mode Salsa20 since it is embarrassingly parallel. Each block can be processed independently of the other. Very soon I had a working kernel function that mapped one Salsa20 64-byte block to one CUDA thread. It worked but was only 32% faster than the reference C implementation. This is hardly anything. The SSSE3 optimized version is about 245% faster than the reference code. Clearly I was not using the GPU properly as even this low-end GPU will not be so bad. The initial metrics looked like the following:

===== 1st Try =====
Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 268.483002 msec
GPU computation time  : 1093.133057 msec
CPU computation time  : 1628.641448 msec
[vectorAdd] test results...
PASSED

I was using a 244MB buffer with 4000000 Salsa20 blocks of 64-byte each. I am only considering whole blocks for the GPU since it is easier to process the last leftover bytes on the CPU.

I started reading the CUDA C best practices guide, and various other related presentations. Then I spent some time with the Nsight and nvprof utilities and a few things became quite apparent:

  1. Having local arrays in the function stack meant that they will be in local memory which is relatively slow to access. One needs to use shared memory for this purpose.
  2. I was accessing global memory byte by byte in a loop when doing the XORs. This is very bad for memory bandwidth.
  3. Processing only one block per CUDA thread wastes compute capability since Salsa20 compute overhead is extremely low compared to the memory access. the whole process is more memory access bound rather than compute bound.

So as a next attempt I declared the keystream and input counter blocks as shared memory contexts. I also changed all keystream and data byte accesses to 32-bit integer accesses. In addition I got rid of most of the explicit little-endian load-store functions from the reference code. Nvidia GPUs are little-endian since they need to be able to directly process x86 and ARM host data which are also little-endian platforms. After all this I got some improvement but it was still not great:

===== 2nd Try =====
Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 219.335999 msec
GPU computation time  : 797.606018 msec
CPU computation time  : 1570.171277 msec
[vectorAdd] test results...
PASSED

As you can see things improved but it appeared that I was still wasting GPU bandwidth and compute capabilities. I started to dig deeper using the nvprof utility to query the various performance counters. My low-end GPU does not have too many counters but fortunately it does provide some of the most critical ones. One counter called ‘warp_serialize’ looked quite bad:

======== NVPROF is profiling vecCrypt...
======== Command: vecCrypt
[vecCrypt] starting...

Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 219.335999 msec
GPU computation time  : 797.606018 msec
CPU computation time  : 1570.171277 msec
[vecCrypt] test results...
PASSED

> exiting in 3 seconds: 3...2...1...done!======== Profiling result:
                  Invocations       Avg       Min       Max  Event Name
Device 0
        Kernel: VecCrypt(unsigned char*, unsigned int, unsigned long)
                            1 182462803 182462803 182462803  warp_serialize

The counter gives the number of times a thread-warp was serialized. Each multiprocessor schedules threads in terms of small groups called warps. Thread serialization within warps means that the threads were executed serially rather than in parallel. Such serialization can occur if kernel functions have divergent branches in the code due to “if ()” statements. It can also happen due to shared memory bank conflicts. Since I did not have any divergent “if ()” conditions in my code, bank conflicts was the culprit.

In fact I was storing and accessing the per-thread keystream and input blocks in shared memory serially in traditional C row-major order as one would normally do on the CPU. This is bad on the GPU. Adjacent CUDA threads need to access adjacent shared memory locations to avoid bank conflicts. That is one has to access the memory in a strided fashion, or, in other words, column-major order. The following StackOverflow thread gives details on this particular issue: http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming

This required a bunch of changes throughout, including a few changes in the core encryption function. However the results were worth it. I was also passing the key and the sigma string via global memory. I changed that to use constant memory since it is cached and is faster to access from multiple threads. In addition I made each thread process more than one Salsa20 block. With experimentation I found that 4 Salsa20 blocks per CUDA thread gave best results. There were a bunch of other trial and error experiments that I did but leaving that out, the final results with nvprof showed warp_serialize to have come down to zero. In addition there were no incoherent accesses to global memory.

I also made a final change to use pinned memory via “cudaMallocHost()” which reduced the data transfer overhead significantly. I added more timings and throughput calculations to the code and also included the optimized x86_64 assembly implementation for comparison. The final results look like the following on my laptop:

[vecCrypt] starting...Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time (pinned mem)         : 173.084999 msec
GPU computation time                    : 214.147995 msec
GPU throughput                          : 1140.055619 MB/s
GPU throughput including naive transfer : 630.474750 MB/s
CPU computation time (reference code)   : 1635.980728 msec
CPU throughput (reference code)         : 149.231969 MB/s
CPU computation time (optimized code)   : 474.828611 msec
CPU throughput (optimized code)         : 514.165784 MB/s
[vecCrypt] test results...
PASSED> exiting in 3 seconds: 3...2...1...done!

As you can see the GPU version gives approximately 1.1 GB/s native throughput while including the PCI data transfer overhead it drops to 630 MB/s. The single-thread CPU performance is 514 MB/s. So a multi-thread CPU code using OpenMP for example will beat this low-end GPU. However the story will be a lot different if we use a Tesla M-series card with hundreds of CUDA cores. One thing to note here is that processing one or more Salsa20 blocks per CUDA thread means that we are accessing global memory in a strided fashion. Adjacent CUDA threads access memory location separated by at least 64-bytes. So fully-coalesced memory access does not occur and we are not leveraging the full memory bandwidth on the GPU. Once again the impact of this is lesser on the newer Tesla GPUs. Is it at all possible to do fully coalesced access ? I can’t think of a way. Maybe some CUDA experts can comment on this.

There are a couple of other things to do for a complete implementation:

  1. Use CUDA streams to overlap data transfer and compute and hide the PCI latency.
  2. Does it make sense to only compute the keystream blocks on the GPU and perform the XOR with the plaintext on the CPU side ? it requires only one PCI data transfer, however it may not be beneficial considering the high onboard memory bandwidth on the Tesla cards and the high parallelism. Some experimentation needs to be done

The source code for all this is available at https://github.com/moinakg/salsa20_core_cuda. Note that it has Nvidia’s SDK sample code license since I started by using the vectorAdd sample code and modifying it. Also note that this is not a complete Salsa20 encryption implementation, it just benchmarks the core Salsa20 GPU implementation. Among other things it does not do the key generation. Just uses a pre-set key value and a zero nonce.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s