Fast Salsa20 crypto using CUDA – Part #2

CUDA PTX Rip

CUDA PTX Rip (Photo credit: Travis Goodspeed)

Update: Added results of testing using a Tesla device. See the end of this article.

I had identified a few follow-up items from my initial experiments using CUDA to run Salsa20 on the GPU. I have since done a bunch of work on the topic trying out various combinations of optimizations and have been able to get increased performance.

There are a bunch of techniques that one can employ to improve GPU throughput. However as will be evident shortly the challenge with Salsa20 is that it is device/memory throughput bound rather than compute.

  1. The first thing I looked is to streamline the code better. Since I had lifted and tweaked the reference C code for the GPU there were a few inefficiencies. There was no need to use local buffers to hold CTR keystream data for the block. Everything is computed in 16 32-bit variables which are in turn held entirely in registers as a GPU has tens of thousands of registers per block. So It made sense to directly XOR the variables with plaintext in global memory. This along with a couple of other minor tweaks got rid of shared memory altogether.
  2. Use a pinned buffer to hold data transferred to a from the device. This causes a significant reduction in the data transfer time.
  3. I used CUDA Streams to overlap data transfer and compute in a loop with the ability to change the value and experiment. I found that 16 streams for approximately 256MB of pinned buffer was giving good performance. This translates to overlapped PCI data transfer and kernel compute sizes of approx 16MB each.
  4. The last optimization was to avoid transferring plaintext to the GPU. That is to generate the CTR mode keystream blocks on the GPU transfer them back to the host and perform the XOR using optimized multithreaded host code. What I did was to create one host thread per CUDA stream which would wait for the async CUDA operations to complete and then perform the XOR. This approach provided the maximum speedup by an order of magnitude since it cuts the PCI transfer requirement in half which is then overlapped with compute to hide the latency. In addition the per-stream host thread does host side compute in parallel with GPU-side compute in other streams.

All of this resulted in a good speedup compared to the initial version and started giving good results as compared to the optimized CPU code. However if one actually measures with a Tesla device the actual throughput is still a fraction of the native GPU bandwidth on the higher end cards. I created 3 implementations one is a simple GPU based one simple version without streams which was my initial experiment but slightly improved. One is a streams based overlapped version where plaintext is transferred to the GPU and optimizations from #1 to #3 are used. The final one is the version that only transfers keystream from GPU to host and does XOR on the CPU. The source code for all the tree variants is available at https://github.com/moinakg/salsa20_core_cuda.

As is clear by running the code, low compute bandwidth limited code requires quite a bit of tuning to show good results with GPGPUs and is not an entirely suitable workload for these devices even after all the tuning and optimizations. The GPU really shines where heavy compute can be heavily parallelized

So the next step in my experimentation is include the Message Authentication computation on the GPU along with encryption. In practice simple encryption without a MAC is not suitable and computing a MAC introduces additional compute overheads so the GPU should begin to show it’s full capabilities in that case. Consider what would happen if, In this example, we also added a comparison with multithreaded code that run the optimized CPU version with the buffer split into multiple threads ?

Shown below are the results from the GT 230M on my laptop for all the three variants:

Version 1
==============================================================
./vecCrypt
Salsa20 Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer time (pinned mem)         : 174.018162 msec
GPU computation time                    : 196.317783 msec
GPU throughput                          : 1243.599134 MB/s
GPU throughput including naive transfer : 659.240963 MB/s
CPU computation time (reference code)   : 1538.825479 msec
CPU throughput (reference code)         : 158.653875 MB/s
CPU computation time (optimized code)   : 469.963965 msec
CPU throughput (optimized code)         : 519.487968 MB/s
PASSED

Version 2
==============================================================
./vecCrypt_strm
Salsa20 Vector Encryption using CUDA streams
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU computation time (with transfer)    : 261.696066 msec
GPU throughput (with transfer)          : 932.916680 MB/s
CPU computation time (reference code)   : 1538.681007 msec
CPU throughput (reference code)         : 158.668771 MB/s
CPU computation time (optimized code)   : 469.699561 msec
CPU throughput (optimized code)         : 519.780399 MB/s
PASSED

Version 3
==============================================================
./vecCrypt_strm_cpuxor
Salsa20 Vector Encryption using CUDA streams and multi-threaded XOR on CPU
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU+CPU computation time (with transfer): 227.668396 msec
GPU+CPU throughput (with transfer)      : 1072.351847 MB/s
CPU computation time (reference code)   : 1540.072748 msec
CPU throughput (reference code)         : 158.525385 MB/s
CPU computation time (optimized code)   : 470.163246 msec
CPU throughput (optimized code)         : 519.267780 MB/s
PASSED

Results on Tesla
Wanting to check all this on a GPGPU that matters, I decided to give Amazon Web Services a try. One can get GPGPU Cluster instances on AWS in the US East (Virginia) and EU (Ireland) regions and they cost just $2.10 per hour to run. I only needed it for 30 mins to setup and check the performance. That is less than what I’d pay for a snack at Cafe Coffee Day. In addition I actually wanted to try out AWS as I have never used it before. As a first time user it was fairly painless, however it took me a while to figure how I can get a GPGPU box. It is a Xen VM instance with dual-GPU pass-through.

Configuration

Processor: Xeon X5570 @ 2.93 GHz, 8 Cores
GPU: Tesla M2050 with 2GB global memory and 448 CUDA cores.

Results

Version 1
==============================================================
./vecCrypt
Salsa20 Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer time (pinned mem)         : 82.574656 msec
GPU computation time                    : 15.681968 msec
GPU throughput                          : 15568.238948 MB/s
GPU throughput including naive transfer : 2484.724338 MB/s
CPU computation time (reference code)   : 1265.675568 msec
CPU throughput (reference code)         : 192.893528 MB/s
CPU computation time (optimized code)   : 339.388304 msec
CPU throughput (optimized code)         : 719.354857 MB/s
PASSED

Version 2
==============================================================
./vecCrypt_strm
Salsa20 Vector Encryption using CUDA streams
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU computation time (with transfer)    : 52.542304 msec
GPU throughput (with transfer)          : 4646.553471 MB/s
CPU computation time (reference code)   : 1263.651728 msec
CPU throughput (reference code)         : 193.202462 MB/s
CPU computation time (optimized code)   : 342.387216 msec
CPU throughput (optimized code)         : 713.054149 MB/s
PASSED

Version 3
==============================================================
./vecCrypt_strm_cpuxor                                      
Salsa20 Vector Encryption using CUDA streams and multi-threaded XOR on CPU
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU+CPU computation time (with transfer): 48.651936 msec
GPU+CPU throughput (with transfer)      : 5018.107090 MB/s
CPU computation time (reference code)   : 1256.252896 msec
CPU throughput (reference code)         : 194.340348 MB/s
CPU computation time (optimized code)   : 335.179280 msec
CPU throughput (optimized code)         : 728.388178 MB/s
PASSED

The Tesla results are fairly interesting as compared to the tinny GPU on my laptop. The raw GPU throughput of the kernel from “Version 1” is an astounding 15 GB/s. However once PCI transfer requirements come into the picture we come rapidly down to ground from the clouds. The overlapped transfer and compute in “Version 2” shows good results contributed also by the 2 copy engines than can handle 2 PCI transfers at a time. My code is written to detect CUDA Capability 2.0 or greater and issue device to host copies in a different way. Once again “Version 3” is the fastest option since PCI transfer requirement is cut in half.

Now encryption at 5GB/s is not exactly bad but given the amount of hardware here one can surely want a better differentiation with respect to the CPU especially when a single CPU thread is delivering 728 MB/s. It will be interesting to look at the latest Kepler K20 GPGPUs on a sandy bridge box with PCIe Gen3. However I feel the real differentiation will start to materialize when we throw in extra compute requirements in terms of a MAC/HMAC.

Finally AES is there of course and several ports of AES to CUDA are already around. I want to look at that piece as well, later.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s