Tag Archives: Nvidia

Fast Salsa20 crypto using CUDA

English: NVIDIA GeForce 6600GT AGP with GPU De...

English: NVIDIA GeForce 6600GT AGP with GPU Deutsch: NVIDIA GeForce 6600GT AGP ohne Kühlkörper mit GPU und AGP-Wandlerchip (Photo credit: Wikipedia)

I was playing with some GPU stuff for the past week and decided to try out some crypto on a GPU. There is existing code that ports various hash functions and AES encryption to GPU and there is one port of eSTREAM Ciphers including Salsa20 to GPU (See references) that I found later. However I wanted to experiment myself in any case. By GPU here I mean specifically Nvidia CUDA. AMD supports OpenCL, has a Stream SDK and also has APUs. It will be interesting to try those out but that is for another day.

Salsa20 is designed to provide very fast software encryption performance without compromising on security aspects even when compared to AES. Optimized implementations exist for different platforms like x86 and ARM. The x86 code uses SSSE3 vector instructions.

I have low-end GPU at my disposal on my laptop. It is a Geforce GT 230M with 1GB onboard RAM and 48 CUDA cores (6 Multiprocessors)  clocked at 1.1 GHz. It only supports CUDA capability 1.2.

For the GPU port I started with the reference C code and focused on the CTR mode Salsa20 since it is embarrassingly parallel. Each block can be processed independently of the other. Very soon I had a working kernel function that mapped one Salsa20 64-byte block to one CUDA thread. It worked but was only 32% faster than the reference C implementation. This is hardly anything. The SSSE3 optimized version is about 245% faster than the reference code. Clearly I was not using the GPU properly as even this low-end GPU will not be so bad. The initial metrics looked like the following:

===== 1st Try =====
Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 268.483002 msec
GPU computation time  : 1093.133057 msec
CPU computation time  : 1628.641448 msec
[vectorAdd] test results...
PASSED

I was using a 244MB buffer with 4000000 Salsa20 blocks of 64-byte each. I am only considering whole blocks for the GPU since it is easier to process the last leftover bytes on the CPU.

I started reading the CUDA C best practices guide, and various other related presentations. Then I spent some time with the Nsight and nvprof utilities and a few things became quite apparent:

  1. Having local arrays in the function stack meant that they will be in local memory which is relatively slow to access. One needs to use shared memory for this purpose.
  2. I was accessing global memory byte by byte in a loop when doing the XORs. This is very bad for memory bandwidth.
  3. Processing only one block per CUDA thread wastes compute capability since Salsa20 compute overhead is extremely low compared to the memory access. the whole process is more memory access bound rather than compute bound.

So as a next attempt I declared the keystream and input counter blocks as shared memory contexts. I also changed all keystream and data byte accesses to 32-bit integer accesses. In addition I got rid of most of the explicit little-endian load-store functions from the reference code. Nvidia GPUs are little-endian since they need to be able to directly process x86 and ARM host data which are also little-endian platforms. After all this I got some improvement but it was still not great:

===== 2nd Try =====
Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 219.335999 msec
GPU computation time  : 797.606018 msec
CPU computation time  : 1570.171277 msec
[vectorAdd] test results...
PASSED

As you can see things improved but it appeared that I was still wasting GPU bandwidth and compute capabilities. I started to dig deeper using the nvprof utility to query the various performance counters. My low-end GPU does not have too many counters but fortunately it does provide some of the most critical ones. One counter called ‘warp_serialize’ looked quite bad:

======== NVPROF is profiling vecCrypt...
======== Command: vecCrypt
[vecCrypt] starting...

Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 219.335999 msec
GPU computation time  : 797.606018 msec
CPU computation time  : 1570.171277 msec
[vecCrypt] test results...
PASSED

> exiting in 3 seconds: 3...2...1...done!======== Profiling result:
                  Invocations       Avg       Min       Max  Event Name
Device 0
        Kernel: VecCrypt(unsigned char*, unsigned int, unsigned long)
                            1 182462803 182462803 182462803  warp_serialize

The counter gives the number of times a thread-warp was serialized. Each multiprocessor schedules threads in terms of small groups called warps. Thread serialization within warps means that the threads were executed serially rather than in parallel. Such serialization can occur if kernel functions have divergent branches in the code due to “if ()” statements. It can also happen due to shared memory bank conflicts. Since I did not have any divergent “if ()” conditions in my code, bank conflicts was the culprit.

In fact I was storing and accessing the per-thread keystream and input blocks in shared memory serially in traditional C row-major order as one would normally do on the CPU. This is bad on the GPU. Adjacent CUDA threads need to access adjacent shared memory locations to avoid bank conflicts. That is one has to access the memory in a strided fashion, or, in other words, column-major order. The following StackOverflow thread gives details on this particular issue: http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming

This required a bunch of changes throughout, including a few changes in the core encryption function. However the results were worth it. I was also passing the key and the sigma string via global memory. I changed that to use constant memory since it is cached and is faster to access from multiple threads. In addition I made each thread process more than one Salsa20 block. With experimentation I found that 4 Salsa20 blocks per CUDA thread gave best results. There were a bunch of other trial and error experiments that I did but leaving that out, the final results with nvprof showed warp_serialize to have come down to zero. In addition there were no incoherent accesses to global memory.

I also made a final change to use pinned memory via “cudaMallocHost()” which reduced the data transfer overhead significantly. I added more timings and throughput calculations to the code and also included the optimized x86_64 assembly implementation for comparison. The final results look like the following on my laptop:

[vecCrypt] starting...Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time (pinned mem)         : 173.084999 msec
GPU computation time                    : 214.147995 msec
GPU throughput                          : 1140.055619 MB/s
GPU throughput including naive transfer : 630.474750 MB/s
CPU computation time (reference code)   : 1635.980728 msec
CPU throughput (reference code)         : 149.231969 MB/s
CPU computation time (optimized code)   : 474.828611 msec
CPU throughput (optimized code)         : 514.165784 MB/s
[vecCrypt] test results...
PASSED> exiting in 3 seconds: 3...2...1...done!

As you can see the GPU version gives approximately 1.1 GB/s native throughput while including the PCI data transfer overhead it drops to 630 MB/s. The single-thread CPU performance is 514 MB/s. So a multi-thread CPU code using OpenMP for example will beat this low-end GPU. However the story will be a lot different if we use a Tesla M-series card with hundreds of CUDA cores. One thing to note here is that processing one or more Salsa20 blocks per CUDA thread means that we are accessing global memory in a strided fashion. Adjacent CUDA threads access memory location separated by at least 64-bytes. So fully-coalesced memory access does not occur and we are not leveraging the full memory bandwidth on the GPU. Once again the impact of this is lesser on the newer Tesla GPUs. Is it at all possible to do fully coalesced access ? I can’t think of a way. Maybe some CUDA experts can comment on this.

There are a couple of other things to do for a complete implementation:

  1. Use CUDA streams to overlap data transfer and compute and hide the PCI latency.
  2. Does it make sense to only compute the keystream blocks on the GPU and perform the XOR with the plaintext on the CPU side ? it requires only one PCI data transfer, however it may not be beneficial considering the high onboard memory bandwidth on the Tesla cards and the high parallelism. Some experimentation needs to be done

The source code for all this is available at https://github.com/moinakg/salsa20_core_cuda. Note that it has Nvidia’s SDK sample code license since I started by using the vectorAdd sample code and modifying it. Also note that this is not a complete Salsa20 encryption implementation, it just benchmarks the core Salsa20 GPU implementation. Among other things it does not do the key generation. Just uses a pre-set key value and a zero nonce.

Adding GPU processing to Pcompress

 

English: CUDA processing flow

English: CUDA processing flow (Photo credit: Wikipedia)

GPGPUs provide an intriguing opportunity to speed up some aspects of Pcompress. Typically GPUs represent a large cluster of ALUs with access to a few different types of high-speed memory on the board. GPUs are typically suited for highly-parallel workloads, especially the class of problems that can be termed embarrassingly parallel. An example is Monte-Carlo simulations. However many otherwise serial algorithms or logic can be converted into parallel forms with a little bit of effort.

There are a few places within Pcompress where GPUs can be of use:

  1. Parallel hashing. I have already implemented a Merkle-style parallel hashing but the approach currently uses only 4 threads via OpenMP. This is only used when compressing an entire file in a single segment which is essentially a single-thread operation with some operations like hashing, HMAC (and multithread LZMA) parallelized via different approaches. With GPUs parallel hashing can be used in all cases, but there is a slight problem. Normally parallel hashing produces different hash values as compared to the serial version so I need to work out a way where the same underlying hashing approach is used in both serial and parallel cases so identical results are produced. If one uses GPUs to generate data checksums on one machine it cannot be assumed that every machine where the data is extracted back will have a GPU! Changes to the hashing approach will make current archives incompatible with future versions of Pcompress so current code paths will have to be retained for backward compatibility.
  2. Using AES on GPU. It is possible to speed up AES on the GPU, especially with the CTR mode that I am using. There is a GPU Gems article on this.
  3. Parallel data chunking for deduplication. This is possible but more complex to implement than the previous two items. There is a research paper on a system called Shredder that provides an approach to do data deduplication chunking on the GPU. My approach to chunking is quite novel and different than what is described in the Shredder paper. So I have to do some work here.

There are a few issues to deal with when programming GPGPUs other than the initial high learning curve:

  1. GPUs are devices that sit on the PCI bus, so data needs to be transferred to and fro. This is the biggest stumbling block when dealing with GPUs. The computation to be performed must be large enough to offset the cost of data transfer. There are other ways to hide the latency like performing one compute while transferring the data for the next computation to be done. Using pinned memory on the host computer’s RAM to speed up data transfer. Transferring large blocks of data in one shot as opposed to many small transfers. The biggest gain comes from pipelining computation stages and overlapping compute and data transfer.
  2. Code on the GPU runs in an execution context that has hundreds of hardware threads each of which runs the same code path but works on a different slice of data in memory. This is essentially Single Instruction Multiple Data Model (Nvidia calls it SIMT). The access to data by the different threads need to be ordered or, in other words, be adjacent in a range to get maximum throughput. This is the coalesced access requirement. This is becoming less of an issue as GPGPUs evolve and newer improved devices come to the market.
  3. Need to use a form of explicit caching via shared memory. This is again improving by the introduction of L1/L2 caches in newer GPGPUs like the Nvidia Tesla C2XXX series.
  4. Having to worry about Thread block and grid sizing. Some libraries like Thrust handle sizing internally and provide a high-level external API.

Pcompress has to remain modular. It needs to detect the presence of GPUs in a system and optionally allow using them. Since I will be using CUDA, it needs to depend on the presence of CUDA and the Nvidia accelerated drivers as well.

Finally the big questions will be how do all these scale? Will using GPUs allow faster processing in Pcompress as compared to the modern Sandy Bridge and Piledriver CPUs with vector units. Only experimentation will tell.