Adding GPU processing to Pcompress


English: CUDA processing flow

English: CUDA processing flow (Photo credit: Wikipedia)

GPGPUs provide an intriguing opportunity to speed up some aspects of Pcompress. Typically GPUs represent a large cluster of ALUs with access to a few different types of high-speed memory on the board. GPUs are typically suited for highly-parallel workloads, especially the class of problems that can be termed embarrassingly parallel. An example is Monte-Carlo simulations. However many otherwise serial algorithms or logic can be converted into parallel forms with a little bit of effort.

There are a few places within Pcompress where GPUs can be of use:

  1. Parallel hashing. I have already implemented a Merkle-style parallel hashing but the approach currently uses only 4 threads via OpenMP. This is only used when compressing an entire file in a single segment which is essentially a single-thread operation with some operations like hashing, HMAC (and multithread LZMA) parallelized via different approaches. With GPUs parallel hashing can be used in all cases, but there is a slight problem. Normally parallel hashing produces different hash values as compared to the serial version so I need to work out a way where the same underlying hashing approach is used in both serial and parallel cases so identical results are produced. If one uses GPUs to generate data checksums on one machine it cannot be assumed that every machine where the data is extracted back will have a GPU! Changes to the hashing approach will make current archives incompatible with future versions of Pcompress so current code paths will have to be retained for backward compatibility.
  2. Using AES on GPU. It is possible to speed up AES on the GPU, especially with the CTR mode that I am using. There is a GPU Gems article on this.
  3. Parallel data chunking for deduplication. This is possible but more complex to implement than the previous two items. There is a research paper on a system called Shredder that provides an approach to do data deduplication chunking on the GPU. My approach to chunking is quite novel and different than what is described in the Shredder paper. So I have to do some work here.

There are a few issues to deal with when programming GPGPUs other than the initial high learning curve:

  1. GPUs are devices that sit on the PCI bus, so data needs to be transferred to and fro. This is the biggest stumbling block when dealing with GPUs. The computation to be performed must be large enough to offset the cost of data transfer. There are other ways to hide the latency like performing one compute while transferring the data for the next computation to be done. Using pinned memory on the host computer’s RAM to speed up data transfer. Transferring large blocks of data in one shot as opposed to many small transfers. The biggest gain comes from pipelining computation stages and overlapping compute and data transfer.
  2. Code on the GPU runs in an execution context that has hundreds of hardware threads each of which runs the same code path but works on a different slice of data in memory. This is essentially Single Instruction Multiple Data Model (Nvidia calls it SIMT). The access to data by the different threads need to be ordered or, in other words, be adjacent in a range to get maximum throughput. This is the coalesced access requirement. This is becoming less of an issue as GPGPUs evolve and newer improved devices come to the market.
  3. Need to use a form of explicit caching via shared memory. This is again improving by the introduction of L1/L2 caches in newer GPGPUs like the Nvidia Tesla C2XXX series.
  4. Having to worry about Thread block and grid sizing. Some libraries like Thrust handle sizing internally and provide a high-level external API.

Pcompress has to remain modular. It needs to detect the presence of GPUs in a system and optionally allow using them. Since I will be using CUDA, it needs to depend on the presence of CUDA and the Nvidia accelerated drivers as well.

Finally the big questions will be how do all these scale? Will using GPUs allow faster processing in Pcompress as compared to the modern Sandy Bridge and Piledriver CPUs with vector units. Only experimentation will tell.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s