Monday, January 7, 2008

volatile - a few notes on its usefulness

There have been a few threads on the MSDN Forums lately, asking various questions more or less related to the volatile keyword, and its usefulness in threads. I've only briefly commented on it earlier in this blog, so I suppose today is as good a time as any to clarify a bit.

Historically speaking, the volatile keyword was meant to assure that memory possibly updated by other hardware would be re-read by the compiler on all accesses. In its search for ways to speed up (or even shrink) the code, the compiler could apply e.g. register assignment or dead assignment optimizations. The former would cause variable values to be cached in registers, and thus not re-read from memory at each use. The dead assignment optimization, on the other hand, would cause code such as

x = *ptr; // get value at [ptr]
y = *ptr; // also get value at [ptr] => may not be the same as the value we read for x

to be optimized to

x = *ptr;
y = x;

This may be applied when the compiler can prove that the value at [ptr] isn't being updated by the application. Of course that doesn't mean that other hardware (or other applications) can't update the memory on their own, and thus break that proof.

To deal with this, the volatile keyword was born. It would, in all its simplicity, disallow the compiler from applying any optimizations to the volatile value.

Moving slightly forward in time, multi-threading became an every-day feature of applications. C++, of course, remained oblivious to threads, and so programmers had to rely on other ways of making sure applications would run as intended under these new premises. Once again the volatile keyword came to use, as it would also reassure that values written in one thread, could be read in another. Neither one of the threads would cache the variable, or optimize it away, and so the application could live on to see another day.

That's about where the victory march ends for the volatile keyword. While the last paragraph is true, it's not the full truth, nor is it the full problem. First of all, compilers are cunning little things. And the C++ standard (as noted on in earlier posts), is not one to prohibit them from being so. Unless the volatile keyword is used notoriously (and even then), it's astonishingly difficult to fully battle the effects of optimizations, and that's especially the case when you bring classes into the puzzle. Initialization of members may be delayed, constructors may be inlined and writes may be re-ordered. From a single threaded point of view, it's all good; and from the historical perspective, the volatile keyword still lives up to its promise and assures that hardware-updated values are fresh. From a multi threaded point of view, however, we're just about whacked.

And as you'd come stumbling out from a match you were unlikely to win in the first place, all bloodied and bruised; SMP is there to finish you off. Shared Memory Multiprocessing, as the name suggests, involves multiple CPUs (or cores), and a common memory store. Multiple threads, on multiple CPUs, may be accessing the same data. On a single processor system, the multi-threading issue is simply whether or not the threads refrain from optimizations, and write to memory. In that case, volatile is your friend. On multi-processor systems, however, there's also the case of processor cache. Reading from and writing to main memory is not a cheap operation, and doing so usually hogs the memory pipeline. If the CPUs didn't utilize separate caches, you'd quickly get a stand-still similar to the Friday afternoon rush out of town. With the cache in place, the memory pipeline will be used less frequently, and system speed will be greatly improved. For multi threaded applications, on the other hand, it brings trouble. You now have to worry about whether or not a different thread, on a different CPU, has actually seen your updated variable. Although you've got your variable volatile, and expect it to be written to memory, there's no guarantee that the value has made it past the cache. There's similarly no guarantee that once the value *is* written to main memory, it'll actually be in the same order as in your code. There's not a thing volatile, in all its historical hardware glory, can do about this.

What could save you, at least on some platforms, is the memory model and cache coherency protocols. The x86 and x64 both have strong memory models, accompanied by what's known as "snooping" cache coherency protocols. The memory model dictates that reads may be re-ordered, but no writes may pass another write. The cache protocol, on its side, will assure that changes made to one CPU's cached view of a memory block, is visible in the cache of another CPU's view of the same block. On other platforms, such as Itanium, you're not so lucky.

So, when the memory model is weak, and the cache is working against you, what do you do? Volatile is of no help, and was never intended to be. What you should do at this point, is what you really should have been doing all along: use synchronization primitives. All good threading libraries (e.g. those shipped with your platform, or provided by Boost) have primitives with the necessary means in them to battle both compiler optimizations, weak memory models and cache. For the most, their weapons are compiler and architecture specific assembly instructions, such as memory fences and pipeline flushes. Some cost more than others, but they all save you from the havoc that would otherwise be multi threading.

The morale here: volatile is a good thing, as long as its used for what it's designed to do. There are numerous articles online about so-called anti-patterns which misuse volatile (such as the Double Checked Locking Pattern), so it should be clear what you *shouldn't* do with it. And in multi threading there's really only one rule to follow: synchronize, synchronize, synchronize. Save the optimizations and clever tricks for other code, and don't be afraid of letting your application spend a little extra time processing the cross-thread updates.