# Pcompress updated to 1.1

I just put out a new release that implements proper HMAC computation as described in my previous post. I added a bunch of new tests that test error conditions and error handling for out of range parameters and corrupted archives. These tests caused problems to show up that went unnoticed earlier. These were mostly improper error handling situations. As you see testing may be boring but is absolutely necessary.

I also optimized the use of Zlib by switching to raw Zlib streams. Normal Zlib puts a header and computes an Adler32 checksum among other things. However Pcompress adds its own headers and computes cryptographic checksums like SHA and SKEIN which are light years ahead of the simple Adler32 in terms of data integrity verification. So the Adler32 computation overhead was unnecessary.

I had also tweaked the rolling hash used for deduplication while doing the chunking analysis. This produces better results and improved identity dedupe reduction. Finally I also added header checksums in plain non-crypto mode.

# Authenticated Encryption and Pcompress

I had got HMAC into pcompress but was not too happy with the way I was using it to verify the chunk headers and the regular digest. The operation was thus: $HMAC(Digest(Plaintext), Header)||Encrypt(Compress(Plaintext))$. However this approach looked suspect to me. Eventually after reading up more stuff it turns out that Message Authentication and Encryption can be combined to make Authenticated Encryption. This article provides an excellent background to the entire thing: http://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-probably-broken.

In addition to that Wei Dai’s Cryptopp wiki also has good concise info: http://www.cryptopp.com/wiki/Authenticated_Encryption. Whoever thought that the ubiquitous SSH we take for granted is technically insecure! The most common recommended encryption mode for embedding message authentication with encryption is EAX mode. There is a more advanced and better performant OCB mode but it is patented. Now I had a choice of pulling out the EAX mode implementation from Cryptopp and using it with AES. However I also need to HMAC the headers without having to encrypt them and have an integrity checksum. Also importing and integrating the EAX code from Cryptopp is somewhat painstaking with lots of code changes. So I decided to follow IPSec.

IPSec encrypts and then computes a HMAC on the encrypted data. As We Dai points out in the wiki this approach is secure. I was already computing HMAC of the header so it was a simple matter to extend it to cover the entire encrypted data. In addition I had to avoid computing another digest of the plaintext as that is an unnecessary overhead. HMAC authenticates and also verifies data integrity.  So now the approach becomes: $HMAC(Header, Encrypt(Compress(Plaintext)))$. The HMAC is then inserted into the Header. The HMAC portion of the header is zeroed when actually computing the HMAC. Since the HMAC is computed over the compressed data, it needs to process a smaller dataset and benefits performance.

This change is already in the Pcompress git repo and will make it to the 1.1 release.

# Encryption in Pcompress

I just completed adding support for AES encryption in Pcompress – whew! On the surface it is simple, just encrypt and decrypt using a password provided by the user. However there are a myriad of security pieces around this that make actual implementations lengthy and involved. First and foremost password based encryption requires a symmetric encryption key to be generated from the user password. This step is fraught with problems. We need to make dictionary attacks reasonably hard even if the user provides a weak password. There is a NIST standard for this called PBKDF2 – Password Based Key Derivation Function 2. However given modern distributed computing techniques, botnets, GPGPUs etc it is still possible to do brute force dictionary attacks practically. The online cloud back service Tarsnap provides a unique algorithm called Scrypt that attempts to make this hard enough to be impractical due to high resource requirements for the key derivation process.

Using Scrypt is just one step in the process. One also needs to generate a random salt value as input to the key derivation function. One ideally needs to get high-quality random bytes from the operating system’s entropy pool. This can be done using the RAND_bytes() function in OpenSSL. However entropy may not always be available immediately. So if OpenSSL returns a not ready status then we need to use other alternatives. So second good quality option is to use “/dev/urandom“. This will be available on Linux and other Unixes. If for some reason this fails as well then we need another lower quality but reasonable fallback than the simple pseudorandom rand() function in the standard library. I looked around and picked up ideas from the approach used by PHP. The PHP idea also uses a Mersenne Twister which I will add in the future. All this results in a unique key being generated every time Pcompress is invoked even when using the same password. Decryption of course recovers the key used to encrypt the file.

Inputting passwords is another piece. OpenSSL has some old UI compat functions to do this but they seem to be artifacts retained only for backward compatibility. So I decided to roll my own based on what is being done by Python’s getpass module. Passwords can also be input via a file. This needs to be a writable temp file since Pcompress zeroes out the file after reading the password from it. The next thing is to generate an nonce value. Tarsnap uses sequential nonces that appear to start at 0. I wanted to use sequential nonces but starting at something other than 0. The nonce is 64-bit of which the top 32 bits can be selected at runtime and bottom 32 bits can be zero, incrementing by chunk number. I am using the salt value, monotonic clock value passed through PBKDF2 to get a 256-bit quantity. This I then pass through CRC64 to get a 64-bit quantity from which the top 32 bits are extracted.

Finally the salt and starting nonce are stored in the compressed file in clear. This means that the file format has changed a bit and is now version 4. Encryption is performed on the compressed data as compression takes out redundancies and makes for stronger encryption. It is also faster. In addition temporary values on the stack and in memory are cleared out with zeroes at every stage.

However after all these I do not yet have a per-chunk HMAC to verify the encrypted data has not been tampered with (In addition to the plaintext message digest for integrity). I plan to add it in the next few days.