Tag Archives: parallel programming

Nonblocking Algorithms for Multicore processors

Came across this awesome article on ACM Queue: http://queue.acm.org/detail.cfm?id=2492433. This is a very nice exposition of the topic with example code and detailed walk through of the effects of multiple access on shared state. The other interesting piece in this same context is of course transactional memory in Intel’s Haswell processor described here: http://www.realworldtech.com/haswell-tm/



Coordinated Parallelism using Semaphores

teamOne of the key features of my Pcompress utility is of course, parallelism. The ability to split and process data in parallel in multiple threads with all the threads doing virtually the same work. There is some limited variability depending on the nature of the work and the nature of the data. For example some data segments may not be compressible so they will have to be stored as-is. Whenever there are multiple threads there is typically some need for synchronization. There are of course scenarios where thread processing is completely independent and no synchronization is needed whatsoever. However in Pcompress that is not the case. There are a couple of cases where synchronization is needed:

  1. Ordering of input data segments. The data segments in the compressed file must be written in the same order as they were input otherwise data will be corrupt.
  2. With the Global Deduplication feature, access to a single chunk index must be serialized and ordered in the same sequence as they were input.

The second feature is a recent addition and also requires ordering since we want all duplicate chunk references to be backward references. That is in a data stream duplicate chunks point backwards to whole chunks at the head of the stream. So data segments containing the chunks must go through the index lookup in the same order as they were input. Rest of the processing like actual pre-compression stage, chunk splitting stage, compression stage, optional encryption stage and so on can work completely parallel without dependencies. The flow can be illustrated by the following diagram:


As you can notice there are 3 points where some form of  synchronization is necessary. The input, Index Lookup for global dedupe and final writer stage. In addition data ordering as per input has to be maintained for index lookup and when writing the output data.

There are several ways of achieving this flow, the most common techniques are using a thread pool and some queues. Perhaps the simplest approach is to use barrier synchronization. We can put one barrier prior to the index lookup and another barrier prior to the writer. In each case a simple loop takes care of the serial processing maintaining the proper data ordering. However both the approaches have drawbacks. Using queues and thread pools have resource overheads for the data structures and locking.  Barriers are not strictly needed here and using barriers mean that some amount of potential concurrency is lost waiting at the barrier. The time spent waiting at the barrier is the time taken for the slowest or typically the last thread to complete processing. One of the intentions I had was to have as much overlapped processing as possible. if one thread is accessing the index and another thread does not need it, then, it should be allowed to proceed.

So I played around with POSIX semaphores. Using semaphores in a producer-consumer setup is a common approach. However Pcompress threads are a little more involved than simple producers and consumers. A bunch of semaphores are needed to signal and control the execution flow of the threads. After some experimentation I came up with the following approach.

A dispatcher thread reads data segments from the input file and schedules them to worker threads in a round robin fashion and the writer thread reads processed data segments from worker threads in a round robin loop as well. This preserves data ordering at input and output. The ordering of index lookup and dedupe processing is done by one thread signaling the other. The diagram below illustrates this showing an example with 2 threads.


The green arrows in the diagram shows the direction of the semaphore signals. At each synchronization point a semaphore is signaled to indicate completion of some activity. The critical section of the index lookup and update operation is highlighted in blue. Each thread holds a reference to the index semaphore of the next thread in sequence. The last thread holds a reference of the index semaphore of the first thread. Each thread first waits for it’s own index semaphore to be signaled, then performs the index update and signals the next guy to proceed. The dispatcher thread signals the index semaphore of the first thread to start the ball rolling. Effectively this approach is equivalent to a round-robin token ring network. Whoever holds the token can access the common resource. Lock contention is completely avoided, so this can potentially scale to thousands of threads.

The key to data ordering are the two loops, one in the dispatcher and one in the writer thread. The dispatcher always assigns data segments to threads in a fixed order. In the above example Thread 1 gets all the odd segments and Thread 2 gets all the even ones. The writer thread also waits for threads in order eventually ensuring that data ordering is preserved.

Looking at all this in another way the synchronization approach can be viewed simplified as three concentric rings and the processing flows are a set of radii converging to the center of the circle and intersecting the rings. The processing flow direction is inwards towards the center and all the tokens flow along the rings in one direction, for example clockwise (black arrows). The green curved arrows show signaling of the synch points to forward tokens. That is when processing flow reaches the writer sink ring it forwards the token it received at the dedupe ring to the next flow. The final sync point at the centre completes the data write and forwards the token at the previous radius intersection point on the outermost ring. This approach ensures ordering and avoids races. To have maximum concurrency right from the beginning, all the synch points on the outermost ring get one-time-use tokens so all the initial processing can begin immediately. This is somewhat like priming a water pump.


This flow allows overlapped operations to happen concurrently. In addition the dispatcher does a simple double buffering by reading the next segment into a spare buffer after signaling the current thread to start processing. A bit of concurrency can be lost when the writer thread is waiting for thread 1 and thread 2 has already completed. That situation typically arises at the end of a file where the last segment can be a small one. It can also arise if one segment cannot be deduplicated and the rest of the dedupe processing is aborted. However the impact of these are relatively small compared to the overall processing being done, so a lot of multi-core parallelism is effectively utilized in practice. Finally a bunch of overheads in using specific data structures and/or parallel threading libraries are also avoided.