# New approach to secure Deduplication: DupLESS

Came across this interesting paper from the recently concluded 22nd  USENIX Security Symposium: https://www.usenix.org/conference/usenixsecurity13/encryption-deduplicated-storage-dupless. The System is called DupLESS.

Typically, mixing encryption and deduplication does not yield good results. As each user uses their own key to encrypt, the resulting ciphertexts are different even if two users are encrypting the same files. So deduplication becomes impossible. Message Locked Encryption was mooted to work around this. To put it simply this encrypts each chunk of data using it’s cryptographic hash as the key. So two identical chunks will produce identical ciphertexts and and can still be deduplicated. However this leaks information and it is possible to brute force plaintexts, if data is not thoroughly unpredictable. Also, as another example, it is possible for privileged users having access to the storage server to store files and check their deduplication with other user’s data, thereby getting an idea of other user’s contents even if they are encrypted.

The DupLESS system above introduces a Key Server into the picture to perform authentication and then serve chunk encryption keys in a secure manner. From my understanding this means that all users authenticating to the same key server will be able to deduplicate data amongst themselves. An organization, using a cloud storage service, which does deduplication at the back-end will be able to deduplicate data among it’s users by using a local secured key server. This will prevent the storage provider or any external privileged user from gleaning any information about the data. A trusted third party can also provide a key service that can be shared among groups or categories of users, while not allowing the storage provider access to the key service. Neither the key server nor the storage service can glean any information about the plaintext data.

Very interesting stuff. The source code and a bunch of background is available at this link: http://cseweb.ucsd.edu/users/skeelvee/dupless/

# Fast Salsa20 crypto using CUDA – Part #2

CUDA PTX Rip (Photo credit: Travis Goodspeed)

I had identified a few follow-up items from my initial experiments using CUDA to run Salsa20 on the GPU. I have since done a bunch of work on the topic trying out various combinations of optimizations and have been able to get increased performance.

There are a bunch of techniques that one can employ to improve GPU throughput. However as will be evident shortly the challenge with Salsa20 is that it is device/memory throughput bound rather than compute.

1. The first thing I looked is to streamline the code better. Since I had lifted and tweaked the reference C code for the GPU there were a few inefficiencies. There was no need to use local buffers to hold CTR keystream data for the block. Everything is computed in 16 32-bit variables which are in turn held entirely in registers as a GPU has tens of thousands of registers per block. So It made sense to directly XOR the variables with plaintext in global memory. This along with a couple of other minor tweaks got rid of shared memory altogether.
2. Use a pinned buffer to hold data transferred to a from the device. This causes a significant reduction in the data transfer time.
3. I used CUDA Streams to overlap data transfer and compute in a loop with the ability to change the value and experiment. I found that 16 streams for approximately 256MB of pinned buffer was giving good performance. This translates to overlapped PCI data transfer and kernel compute sizes of approx 16MB each.
4. The last optimization was to avoid transferring plaintext to the GPU. That is to generate the CTR mode keystream blocks on the GPU transfer them back to the host and perform the XOR using optimized multithreaded host code. What I did was to create one host thread per CUDA stream which would wait for the async CUDA operations to complete and then perform the XOR. This approach provided the maximum speedup by an order of magnitude since it cuts the PCI transfer requirement in half which is then overlapped with compute to hide the latency. In addition the per-stream host thread does host side compute in parallel with GPU-side compute in other streams.

All of this resulted in a good speedup compared to the initial version and started giving good results as compared to the optimized CPU code. However if one actually measures with a Tesla device the actual throughput is still a fraction of the native GPU bandwidth on the higher end cards. I created 3 implementations one is a simple GPU based one simple version without streams which was my initial experiment but slightly improved. One is a streams based overlapped version where plaintext is transferred to the GPU and optimizations from #1 to #3 are used. The final one is the version that only transfers keystream from GPU to host and does XOR on the CPU. The source code for all the tree variants is available at https://github.com/moinakg/salsa20_core_cuda.

As is clear by running the code, low compute bandwidth limited code requires quite a bit of tuning to show good results with GPGPUs and is not an entirely suitable workload for these devices even after all the tuning and optimizations. The GPU really shines where heavy compute can be heavily parallelized

So the next step in my experimentation is include the Message Authentication computation on the GPU along with encryption. In practice simple encryption without a MAC is not suitable and computing a MAC introduces additional compute overheads so the GPU should begin to show it’s full capabilities in that case. Consider what would happen if, In this example, we also added a comparison with multithreaded code that run the optimized CPU version with the buffer split into multiple threads ?

Shown below are the results from the GT 230M on my laptop for all the three variants:

``````Version 1
==============================================================
./vecCrypt
Salsa20 Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer time (pinned mem)         : 174.018162 msec
GPU computation time                    : 196.317783 msec
GPU throughput                          : 1243.599134 MB/s
GPU throughput including naive transfer : 659.240963 MB/s
CPU computation time (reference code)   : 1538.825479 msec
CPU throughput (reference code)         : 158.653875 MB/s
CPU computation time (optimized code)   : 469.963965 msec
CPU throughput (optimized code)         : 519.487968 MB/s
PASSED

Version 2
==============================================================
./vecCrypt_strm
Salsa20 Vector Encryption using CUDA streams
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU computation time (with transfer)    : 261.696066 msec
GPU throughput (with transfer)          : 932.916680 MB/s
CPU computation time (reference code)   : 1538.681007 msec
CPU throughput (reference code)         : 158.668771 MB/s
CPU computation time (optimized code)   : 469.699561 msec
CPU throughput (optimized code)         : 519.780399 MB/s
PASSED

Version 3
==============================================================
./vecCrypt_strm_cpuxor
Salsa20 Vector Encryption using CUDA streams and multi-threaded XOR on CPU
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU+CPU computation time (with transfer): 227.668396 msec
GPU+CPU throughput (with transfer)      : 1072.351847 MB/s
CPU computation time (reference code)   : 1540.072748 msec
CPU throughput (reference code)         : 158.525385 MB/s
CPU computation time (optimized code)   : 470.163246 msec
CPU throughput (optimized code)         : 519.267780 MB/s
PASSED
``````

Results on Tesla
Wanting to check all this on a GPGPU that matters, I decided to give Amazon Web Services a try. One can get GPGPU Cluster instances on AWS in the US East (Virginia) and EU (Ireland) regions and they cost just \$2.10 per hour to run. I only needed it for 30 mins to setup and check the performance. That is less than what I’d pay for a snack at Cafe Coffee Day. In addition I actually wanted to try out AWS as I have never used it before. As a first time user it was fairly painless, however it took me a while to figure how I can get a GPGPU box. It is a Xen VM instance with dual-GPU pass-through.

Configuration

Processor: Xeon X5570 @ 2.93 GHz, 8 Cores
GPU: Tesla M2050 with 2GB global memory and 448 CUDA cores.

Results

``````Version 1
==============================================================
./vecCrypt
Salsa20 Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer time (pinned mem)         : 82.574656 msec
GPU computation time                    : 15.681968 msec
GPU throughput                          : 15568.238948 MB/s
GPU throughput including naive transfer : 2484.724338 MB/s
CPU computation time (reference code)   : 1265.675568 msec
CPU throughput (reference code)         : 192.893528 MB/s
CPU computation time (optimized code)   : 339.388304 msec
CPU throughput (optimized code)         : 719.354857 MB/s
PASSED

Version 2
==============================================================
./vecCrypt_strm
Salsa20 Vector Encryption using CUDA streams
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU computation time (with transfer)    : 52.542304 msec
GPU throughput (with transfer)          : 4646.553471 MB/s
CPU computation time (reference code)   : 1263.651728 msec
CPU throughput (reference code)         : 193.202462 MB/s
CPU computation time (optimized code)   : 342.387216 msec
CPU throughput (optimized code)         : 713.054149 MB/s
PASSED

Version 3
==============================================================
./vecCrypt_strm_cpuxor
Salsa20 Vector Encryption using CUDA streams and multi-threaded XOR on CPU
Initializing input data
Allocating device buffer
Starting GPU Calls
Computing reference code on CPU
Verifying result
Computing optimized code on CPU
Data transfer was pinned
GPU+CPU computation time (with transfer): 48.651936 msec
GPU+CPU throughput (with transfer)      : 5018.107090 MB/s
CPU computation time (reference code)   : 1256.252896 msec
CPU throughput (reference code)         : 194.340348 MB/s
CPU computation time (optimized code)   : 335.179280 msec
CPU throughput (optimized code)         : 728.388178 MB/s
PASSED
``````

The Tesla results are fairly interesting as compared to the tinny GPU on my laptop. The raw GPU throughput of the kernel from “Version 1” is an astounding 15 GB/s. However once PCI transfer requirements come into the picture we come rapidly down to ground from the clouds. The overlapped transfer and compute in “Version 2” shows good results contributed also by the 2 copy engines than can handle 2 PCI transfers at a time. My code is written to detect CUDA Capability 2.0 or greater and issue device to host copies in a different way. Once again “Version 3” is the fastest option since PCI transfer requirement is cut in half.

Now encryption at 5GB/s is not exactly bad but given the amount of hardware here one can surely want a better differentiation with respect to the CPU especially when a single CPU thread is delivering 728 MB/s. It will be interesting to look at the latest Kepler K20 GPGPUs on a sandy bridge box with PCIe Gen3. However I feel the real differentiation will start to materialize when we throw in extra compute requirements in terms of a MAC/HMAC.

Finally AES is there of course and several ports of AES to CUDA are already around. I want to look at that piece as well, later.

# Fast Salsa20 crypto using CUDA

English: NVIDIA GeForce 6600GT AGP with GPU Deutsch: NVIDIA GeForce 6600GT AGP ohne Kühlkörper mit GPU und AGP-Wandlerchip (Photo credit: Wikipedia)

I was playing with some GPU stuff for the past week and decided to try out some crypto on a GPU. There is existing code that ports various hash functions and AES encryption to GPU and there is one port of eSTREAM Ciphers including Salsa20 to GPU (See references) that I found later. However I wanted to experiment myself in any case. By GPU here I mean specifically Nvidia CUDA. AMD supports OpenCL, has a Stream SDK and also has APUs. It will be interesting to try those out but that is for another day.

Salsa20 is designed to provide very fast software encryption performance without compromising on security aspects even when compared to AES. Optimized implementations exist for different platforms like x86 and ARM. The x86 code uses SSSE3 vector instructions.

I have low-end GPU at my disposal on my laptop. It is a Geforce GT 230M with 1GB onboard RAM and 48 CUDA cores (6 Multiprocessors)  clocked at 1.1 GHz. It only supports CUDA capability 1.2.

For the GPU port I started with the reference C code and focused on the CTR mode Salsa20 since it is embarrassingly parallel. Each block can be processed independently of the other. Very soon I had a working kernel function that mapped one Salsa20 64-byte block to one CUDA thread. It worked but was only 32% faster than the reference C implementation. This is hardly anything. The SSSE3 optimized version is about 245% faster than the reference code. Clearly I was not using the GPU properly as even this low-end GPU will not be so bad. The initial metrics looked like the following:

``````===== 1st Try =====
Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 268.483002 msec
GPU computation time  : 1093.133057 msec
CPU computation time  : 1628.641448 msec
PASSED
``````

I was using a 244MB buffer with 4000000 Salsa20 blocks of 64-byte each. I am only considering whole blocks for the GPU since it is easier to process the last leftover bytes on the CPU.

I started reading the CUDA C best practices guide, and various other related presentations. Then I spent some time with the Nsight and nvprof utilities and a few things became quite apparent:

1. Having local arrays in the function stack meant that they will be in local memory which is relatively slow to access. One needs to use shared memory for this purpose.
2. I was accessing global memory byte by byte in a loop when doing the XORs. This is very bad for memory bandwidth.
3. Processing only one block per CUDA thread wastes compute capability since Salsa20 compute overhead is extremely low compared to the memory access. the whole process is more memory access bound rather than compute bound.

So as a next attempt I declared the keystream and input counter blocks as shared memory contexts. I also changed all keystream and data byte accesses to 32-bit integer accesses. In addition I got rid of most of the explicit little-endian load-store functions from the reference code. Nvidia GPUs are little-endian since they need to be able to directly process x86 and ARM host data which are also little-endian platforms. After all this I got some improvement but it was still not great:

``````===== 2nd Try =====
Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 219.335999 msec
GPU computation time  : 797.606018 msec
CPU computation time  : 1570.171277 msec
PASSED
``````

As you can see things improved but it appeared that I was still wasting GPU bandwidth and compute capabilities. I started to dig deeper using the nvprof utility to query the various performance counters. My low-end GPU does not have too many counters but fortunately it does provide some of the most critical ones. One counter called ‘warp_serialize’ looked quite bad:

``````======== NVPROF is profiling vecCrypt...
======== Command: vecCrypt
[vecCrypt] starting...

Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time    : 219.335999 msec
GPU computation time  : 797.606018 msec
CPU computation time  : 1570.171277 msec
[vecCrypt] test results...
PASSED

> exiting in 3 seconds: 3...2...1...done!======== Profiling result:
Invocations       Avg       Min       Max  Event Name
Device 0
Kernel: VecCrypt(unsigned char*, unsigned int, unsigned long)
1 182462803 182462803 182462803  warp_serialize
``````

The counter gives the number of times a thread-warp was serialized. Each multiprocessor schedules threads in terms of small groups called warps. Thread serialization within warps means that the threads were executed serially rather than in parallel. Such serialization can occur if kernel functions have divergent branches in the code due to “if ()” statements. It can also happen due to shared memory bank conflicts. Since I did not have any divergent “if ()” conditions in my code, bank conflicts was the culprit.

In fact I was storing and accessing the per-thread keystream and input blocks in shared memory serially in traditional C row-major order as one would normally do on the CPU. This is bad on the GPU. Adjacent CUDA threads need to access adjacent shared memory locations to avoid bank conflicts. That is one has to access the memory in a strided fashion, or, in other words, column-major order. The following StackOverflow thread gives details on this particular issue: http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming

This required a bunch of changes throughout, including a few changes in the core encryption function. However the results were worth it. I was also passing the key and the sigma string via global memory. I changed that to use constant memory since it is cached and is faster to access from multiple threads. In addition I made each thread process more than one Salsa20 block. With experimentation I found that 4 Salsa20 blocks per CUDA thread gave best results. There were a bunch of other trial and error experiments that I did but leaving that out, the final results with nvprof showed warp_serialize to have come down to zero. In addition there were no incoherent accesses to global memory.

I also made a final change to use pinned memory via “cudaMallocHost()” which reduced the data transfer overhead significantly. I added more timings and throughput calculations to the code and also included the optimized x86_64 assembly implementation for comparison. The final results look like the following on my laptop:

``````[vecCrypt] starting...Vector Encryption
Initializing input data
Allocating device buffer
Copying buffer to device
Invoking kernel
Copying buffer back to host memory
Verifying result
Data transfer time (pinned mem)         : 173.084999 msec
GPU computation time                    : 214.147995 msec
GPU throughput                          : 1140.055619 MB/s
GPU throughput including naive transfer : 630.474750 MB/s
CPU computation time (reference code)   : 1635.980728 msec
CPU throughput (reference code)         : 149.231969 MB/s
CPU computation time (optimized code)   : 474.828611 msec
CPU throughput (optimized code)         : 514.165784 MB/s
[vecCrypt] test results...
PASSED> exiting in 3 seconds: 3...2...1...done!
``````

As you can see the GPU version gives approximately 1.1 GB/s native throughput while including the PCI data transfer overhead it drops to 630 MB/s. The single-thread CPU performance is 514 MB/s. So a multi-thread CPU code using OpenMP for example will beat this low-end GPU. However the story will be a lot different if we use a Tesla M-series card with hundreds of CUDA cores. One thing to note here is that processing one or more Salsa20 blocks per CUDA thread means that we are accessing global memory in a strided fashion. Adjacent CUDA threads access memory location separated by at least 64-bytes. So fully-coalesced memory access does not occur and we are not leveraging the full memory bandwidth on the GPU. Once again the impact of this is lesser on the newer Tesla GPUs. Is it at all possible to do fully coalesced access ? I can’t think of a way. Maybe some CUDA experts can comment on this.

There are a couple of other things to do for a complete implementation:

1. Use CUDA streams to overlap data transfer and compute and hide the PCI latency.
2. Does it make sense to only compute the keystream blocks on the GPU and perform the XOR with the plaintext on the CPU side ? it requires only one PCI data transfer, however it may not be beneficial considering the high onboard memory bandwidth on the Tesla cards and the high parallelism. Some experimentation needs to be done

The source code for all this is available at https://github.com/moinakg/salsa20_core_cuda. Note that it has Nvidia’s SDK sample code license since I started by using the vectorAdd sample code and modifying it. Also note that this is not a complete Salsa20 encryption implementation, it just benchmarks the core Salsa20 GPU implementation. Among other things it does not do the key generation. Just uses a pre-set key value and a zero nonce.

# Authenticated Encryption and Pcompress

I had got HMAC into pcompress but was not too happy with the way I was using it to verify the chunk headers and the regular digest. The operation was thus: $HMAC(Digest(Plaintext), Header)||Encrypt(Compress(Plaintext))$. However this approach looked suspect to me. Eventually after reading up more stuff it turns out that Message Authentication and Encryption can be combined to make Authenticated Encryption. This article provides an excellent background to the entire thing: http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken.

In addition to that Wei Dai’s Cryptopp wiki also has good concise info: http://www.cryptopp.com/wiki/Authenticated_Encryption. Whoever thought that the ubiquitous SSH we take for granted is technically insecure! The most common recommended encryption mode for embedding message authentication with encryption is EAX mode. There is a more advanced and better performant OCB mode but it is patented. Now I had a choice of pulling out the EAX mode implementation from Cryptopp and using it with AES. However I also need to HMAC the headers without having to encrypt them and have an integrity checksum. Also importing and integrating the EAX code from Cryptopp is somewhat painstaking with lots of code changes. So I decided to follow IPSec.

IPSec encrypts and then computes a HMAC on the encrypted data. As We Dai points out in the wiki this approach is secure. I was already computing HMAC of the header so it was a simple matter to extend it to cover the entire encrypted data. In addition I had to avoid computing another digest of the plaintext as that is an unnecessary overhead. HMAC authenticates and also verifies data integrity.  So now the approach becomes: $HMAC(Header, Encrypt(Compress(Plaintext)))$. The HMAC is then inserted into the Header. The HMAC portion of the header is zeroed when actually computing the HMAC. Since the HMAC is computed over the compressed data, it needs to process a smaller dataset and benefits performance.

This change is already in the Pcompress git repo and will make it to the 1.1 release.

# Encryption in Pcompress

I just completed adding support for AES encryption in Pcompress – whew! On the surface it is simple, just encrypt and decrypt using a password provided by the user. However there are a myriad of security pieces around this that make actual implementations lengthy and involved. First and foremost password based encryption requires a symmetric encryption key to be generated from the user password. This step is fraught with problems. We need to make dictionary attacks reasonably hard even if the user provides a weak password. There is a NIST standard for this called PBKDF2 – Password Based Key Derivation Function 2. However given modern distributed computing techniques, botnets, GPGPUs etc it is still possible to do brute force dictionary attacks practically. The online cloud back service Tarsnap provides a unique algorithm called Scrypt that attempts to make this hard enough to be impractical due to high resource requirements for the key derivation process.

Using Scrypt is just one step in the process. One also needs to generate a random salt value as input to the key derivation function. One ideally needs to get high-quality random bytes from the operating system’s entropy pool. This can be done using the RAND_bytes() function in OpenSSL. However entropy may not always be available immediately. So if OpenSSL returns a not ready status then we need to use other alternatives. So second good quality option is to use “/dev/urandom“. This will be available on Linux and other Unixes. If for some reason this fails as well then we need another lower quality but reasonable fallback than the simple pseudorandom rand() function in the standard library. I looked around and picked up ideas from the approach used by PHP. The PHP idea also uses a Mersenne Twister which I will add in the future. All this results in a unique key being generated every time Pcompress is invoked even when using the same password. Decryption of course recovers the key used to encrypt the file.

Inputting passwords is another piece. OpenSSL has some old UI compat functions to do this but they seem to be artifacts retained only for backward compatibility. So I decided to roll my own based on what is being done by Python’s getpass module. Passwords can also be input via a file. This needs to be a writable temp file since Pcompress zeroes out the file after reading the password from it. The next thing is to generate an nonce value. Tarsnap uses sequential nonces that appear to start at 0. I wanted to use sequential nonces but starting at something other than 0. The nonce is 64-bit of which the top 32 bits can be selected at runtime and bottom 32 bits can be zero, incrementing by chunk number. I am using the salt value, monotonic clock value passed through PBKDF2 to get a 256-bit quantity. This I then pass through CRC64 to get a 64-bit quantity from which the top 32 bits are extracted.

Finally the salt and starting nonce are stored in the compressed file in clear. This means that the file format has changed a bit and is now version 4. Encryption is performed on the compressed data as compression takes out redundancies and makes for stronger encryption. It is also faster. In addition temporary values on the stack and in memory are cleared out with zeroes at every stage.

However after all these I do not yet have a per-chunk HMAC to verify the encrypted data has not been tampered with (In addition to the plaintext message digest for integrity). I plan to add it in the next few days.