Software! Math! Data! The blog of R. Sean Bowman
The blog of R. Sean Bowman
June 04 2015

When I was a kid, if you were lucky enough to have a computer, it had one core and not much memory and it ran really slowly. It’s a wonder anything got done. We couldn’t even order toilet paper from Amazon because Amazon didn’t exist yet. But now, lots of cores are the norm, and things are only going to get better (worse?). How can we deal with the additional complexity brought on by the need to use these cores effectively?

Clearly we have to program with concurrency in mind, but there are lots of strategies for doing so. When I was in school, the focus was on shared memory architectures that use some sort of locking for coordinating between different threads of execution. That’s still common, but other ideas have gained some traction. Let’s look at another strategy that seems promising: software transactional memory, or STM.

STM for atomic operations

STM is a strategy for dealing with memory contention that has its roots in relational database systems. These systems provide atomic and isolated operations. The first means that we may specify that when some state changes, it either all changes, or none of it does. The second means that an outside observer sees the state before the operation or the state after, but never some in-between state. STM provides atomic, isolated transactions for shared memory processors like the ones in most computers today.

I think that STM might be a better introduction to parallel programming than what some people encounter in school: mutexes, semaphores, condition variables, endless complexity and the potential for headaches around every corner. (Of course, those ideas are crucially important for modern high performance code. My point is just that they are not simple to think about.)

Let’s make the idea of STM concrete with an example. This example is an interpretation of a question posed by a colleague of mine months ago. I still don’t have a great answer to his question, but here’s an idea.

The problem is this: we have a counter that keeps track of of several events that occur (lets say two for this example). Several threads are performing these events we want to keep track of, and so they’re writing to our shared counter. We’d like to periodically see how many events have occurred. The first thing we can do to make our lives easier is keep a “per thread” counter and add them up when we need to. Our counter class looks like this:

constexpr int n_threads = 2;

class Counter {
    int event1_count[n_threads] = {0};
    int event2_count[n_threads] = {0};
public:
    void events_occurred(int thread_id) {
        event1_count[thread_id]++;
        event2_count[thread_id]++;
    }

    void print_totals(void) {
        int total1_count = 0, total2_count = 0;
        for (auto& e : event1_count)
            total1_count += e;
        for (auto& e : event2_count)
            total2_count += e;
        cout << total1_count << " event 1 and ";
        cout << total2_count << " event 2." << endl;
    }
};

Now the idea is that we keep track of how many events each thread performs by using the arrays event1_count and event2_count. Since the event_occurred method increments both types of events for a given thread id, we should see that the total number of events printed is the same. But when we run with several threads, it’s not! For example, when I run this code on my dual core laptop, I got the output

1696385 event 1 and 1696431 event 2.

Close, but different! Now, this is no surprise to most of you who have seen this stuff before. What’s happening is that as we call print_totals from one thread, the other is calling events_occurred. If the instructions interleave just so, the counts will differ.

Making the counts right

What we need is some way of incrementing both counts in an atomic and isolated way. This will ensure that the counts don’t get out of sync, and that they don’t change while we’re printing them out. We can do that with STM, and in particular I’m going to use the STM available in gcc since 4.7. (There are many other implementations, but the ones I’ve checked out are more difficult to use.) All we’re going to do is add some annotations to tell the compiler that we want certain blocks of code to happen atomically, using the __transaction_atomic statement. Here’s what it looks like:

class Counter {
    int event1_count[n_threads] = {0};
    int event2_count[n_threads] = {0};
public:
    void events_occurred(int thread_id) {
        __transaction_atomic {
            event1_count[thread_id]++;
            event2_count[thread_id]++;
        }
    }

    void print_totals(void) {
        int total1_count = 0, total2_count = 0;
        __transaction_atomic {
            for (auto& e : event1_count)
                total1_count += e;
            for (auto& e : event2_count)
                total2_count += e;
        }
        cout << total1_count << " event 1 and ";
        cout << total2_count << " event 2." << endl;
    }
};

Almost the same code, right? We’ve merely told the compiler that we expect the counts to be incremented atomically, and that the totals should be summed up without interference from other threads modifying the counts.

The really nice this about this approach, in theory, is that calls to the events_occurred method can indeed happen concurrently. This is because each thread has a unique id, and no thread modifies memory belonging to another. The STM system should detect that and allow concurrent calls to this method. This is in contrast to course grained locking, in which only one thread at a time would be allowed to increment the counts.

At the same time, once we’re in the atomic section of the print_totals method, transactions in the events_occurred method will be rolled back since we read from the memory accessed there. The upshot is that we see consistent numbers: something like

1696385 event 1 and 1696385 event 2.

Cool!

Some details about compilation and performance

You can check out the code on my github page. There’s a makefile you can use or crib from; the crucial bit is the -fgnu-tm flag ttelling gcc to use the transactional memory extensions.

I’ve included a comparison with the standard library locks as well as a mutex from Intel’s thread building blocks which performs much, much better. On my machine, STM falls somewhere in between. I should add that when I originally did these experiments in 2014 using gcc 4.7, the STM code was much slower than any alternative. Today, it’s more than twice as fast as std::mutex! That’s impressive.

In my limited experience, the performance of STM varies. There is almost certainly some operator error here – my knowledge of STM is rudimentary at best. But in any case, STM implementations are getting better, and they’re getting hardware support (some high end Intel chips already have the so called TSX extensions for this purpose). I believe that we should expect the performance to improve greatly in the next few years.

Approx. 1125 words, plus code