Sunday, November 16, 2014

WriterReaderPhaser: A story about a new (?) synchronization primitive

I recently added a synchronization primitive mechanism in my HdrHistogram and LatencyUtils code, which I think has generic use for some very common operations. Specifically, when wait-free writers are updating stuff that background analyzers or loggers needs to look at. I've isolated it in what I now call a WriterReaderPhaser. The name is very intentional, and we'll get to that in a moment. And to the code (all 66 actual lines of it, 200 with elaborate comments). But first, I'll stray into some "how did this come about" storytelling.

WriterReaderPhaser is a new (I think) synchronization primitive: It provides a straightforward interface and API to coordinate wait-free writing to a shared data structure with blocking reading operations of the same data. Readers view a stable (i.e. non changing, coherent) data set while writers continue to modify data without waiting. And readers are guaranteed forward progress, and will only block for other readers and for writers that may have been "in flight" at the time the reader establishes a stable view of the data.

How did this come about?

This sometimes happens when I build stuff: I find myself in need of some behavior that I thought would be common, but for which I can't find an existing implementation, or a name, or a description. This can obviously be ascribed to my weak Google-fu skills, but after a while I give up and just build the thing, because "it's not that complicated". So I build a one-off implementation into whatever I am doing at the time, and move on with life. At some later point, I find myself needing the same thing again. And since I had already solved that problem once, I go back to my old code and (let's be honest) copy-and-paste my first implementation into whatever new thing I'm working on. Sometimes the little guy on my right shoulder wins over the other guy, and I come back and refactor the behavior into a separate class and build an API for more generic use, at which point the "does this deserve it's own library? It's own repo?" thinking starts, coupled with much Yak Shaving [1]. Sometimes the guy on the left shoulder wins, and I actually get on with the real work I was supposed to be doing. I'll leave it to you to decide which little guy is red and which is white.

Sometimes (usually much later) I realize that what I built was actually new. That even though I thought it was a common use case, and built my version simply out of impatience or frustration at not finding something I could use as-is, I may actually be the first person to solve it. Most of those times, this realization is quickly followed by someone showing me a paper or a piece of code that is 30 years old that makes me go "oh... right.". But sometimes that doesn't happen. Sometimes it really is new.

HdrHistogram itself started this way. It was nothing more than about 100 lines of code in a one-off "JitterMeter" tool I was playing with, which needed to record latencies very quickly and report accurate percentiles with many nines in them. Then I found myself building all sorts of variations on jitter meters and sharing them (jHiccup is an evolved version with a better name). And then I found that people (myself included) were taking the code and ripping out just the histogram trick inside, because they needed a histogram that was actually useful for talking about latencies. Recognizing that a fast histogram with good precision and accurate and fine grained quantile reporting capability is actually a very common use case, I decided to build a Yak shaving co-op on github and called it HdrHistogram. The first Yak hair I produced was Java-colored but others have recently added other colors and breeds.

HdrHistogram is a [presumably] successful example of this process going the distance. More often than not, it doesn't. That's probably what my stale repos on github with 2 stars and no forks represent.

WriterReaderPhaser is currently about halfway through this precarious process, but at this point I'm pretty sure it's not going to die. It's a class on it's own, but not yet it's own library. Certainly not it's own repo yet. It will need to find a home, but org.giltene.stuff is probably not where it needs to end up. Since it's so short, this blog entry is as good a home as any for now.

Most importantly, it looks like it may actually be a new and generically useful synchronization primitive. More accurately: nobody has shown me that "oh... right." link or paper yet, and I'm done holding my breath for now.

So what is WriterReaderPhaser about? 

Have you ever had a need for logging or analyzing data that is actively being updated? Have you ever wanted to do that without stalling the writers (recorders) in any way? If so, then WriterReaderPhaser is for you.

 I'm not talking about logging messages or text lines here. I'm talking about data. Data larger than one word of memory. Data that holds actual interesting state. Data that keeps being updated, but needs to be viewed in a stable and coherent way for analysis or logging. Data like frame buffers. Data like histograms. Data like usage counts. Data that changes.

Existing solutions

Sure, you can use channels, queues or magic rings to move data updates and safely process them in background copies of the data. You can use persistent data structures and all sorts of immutable trickery. But those are expensive. As in orders of magnitude more expensive than updating in-cache state in place. When this data thing you want to look at could be updated millions of times per second, you invariably end up with some sort of double-buffered (or multi buffered) scheme: Updates are done to an active copy, and analysis is done "in the background" on stable, inactive copies.

Double buffered schemes usually involve some sort of "phase flipping". At some point the notion of which copy is active changes. Writers update the "new" active copy, and readers access a stable and coherent copy that used to be active, but now isn't. It's this phase flipping that usually comes in the way of keeping writers from blocking.

There are all sorts of variations on how to do this flipping. We can obviously use some form of mutual exclusion lock to protect the writes and the flip. But then writers will block each other, and be blocked by the flipping operation. We can use ReaderWriter locks backwards: where the state being protected by the ReaderWriter lock would be the notion of which data set is the "active" one (the one writers write to). In this scheme writers take the read lock for the duration of their active state modification operations, while readers take the write lock to flip the roles of active and inactive data sets. This can be [much] better than complete mutual exclusion when multiple writers are involved, since writers no longer block other writers, but readers still block writers during a flip. Also, when you start asking yourself "what does 'read' mean again in this context?" that is a good sign you have a problem. Most people write buggier code when standing on their head and juggling. I'm sure there are a whole bunch of other schemes people use, but in my looking around thus far, I didn't find any examples that were non-blocking for the writers.

Why did I care?

The thing I actually wanted to double-buffer was a histogram. And not just any histogram. A fixed-footprint histogram that supports lossless recording of experienced latencies, such that later computation of precise percentiles will be possible, all the way to the as-many-9s-as-there-are-in-the-data level. The very purpose of such a histogram is often to capture and analyze latency outlier behavior. The recording operation cannot be allowed to be a cause of the very outliers it is trying to measure. For the latency recording mechanism to have any susceptibility to blocking or locking would be unacceptable.

These latency histograms are basically non-blocking data structures with tens (or hundreds) of kilobytes of state that is rapidly being mutated by critical path "writer" code. But I wanted to log their contents over intervals that are short enough to be interesting for monitoring purposes, and for later time based analysis. In order to log the latency information being captured, I needed a logging "reader" to somehow gain access to a stable, coherent "snapshot" of the latency data that was recorded during some prior interval. To do this, I needed a way for the reader to flip the roles of the active and inactive histograms, but I needed to do that without ever blocking the writers. This is a classic case of an asymmetric synchronization need. I'm fine blocking, delaying and pausing the reader. I just can't afford for the writers to ever block or otherwise delay the execution of the thread they are recording in.

In comes WriterReaderPhaser. And the best starting point for understanding what it does is to dissect the name:

The Phaser part is there because it's main function is to coordinate phase shifts between the writers and the readers. Besides, I couldn't bring myself to call this thing a lock. It's not a lock. Not in it's most important function, which is phase shift coordination. Writers remain lock-free in all cases (they actually remain wait free on architectures that support atomic increment operations). They never block or lock. Calling WriterReaderPhaser a lock would be like calling an AtomicLong an "add lock" because someone could also construct a spin-lock around it....

The WriterReader part is a reversal of the commonly used ReaderWriter (or ReadWrite) term. ReaderWriter locks are asymmetric, but in the reverse direction of what I needed: they enable [relatively] smooth reader operation while causing the writers to block. The really cool wait-free Left-Right which Martin Thompson had pointed me to achieves perfectly smooth reader operation, but that's still not what I needed. WriterReaderPhaser works for the exactly reversed need: Writers remain non-blocking and perfectly smooth, while only readers suffer.

The desired behaviors I was looking for in a WriterReaderPhaser were:

1. Writers remaining lock-free at all times. Ideally they will remain wait-free at all times.

2. A Reader can coordinate a phase flip and access to the inactive data such that:

2.1 Other readers will not flip a phase while this reader is still interested in the inactive data.

2.2 No writer modification will be made to the inactive data after the phase flip operation is complete, and for as long as the reader is interested in the inactive data.

2.3 Readers are guaranteed forward progress (even in the presence of heavy and continuous writer activity, and even when there is no writer activity at all).

Defining WriterReaderPhaser:

With these high level desired behaviors stated, lets clearly define the qualities and guarantees that a well implemented WriterReaderPhaser primitive would provide to users, and the relevant rules that users must adhere to in order to maintain those qualities and guarantees:

A WriterReaderPhaser instance provides the following 5 operations:
  • writerCriticalSectionEnter
  • writerCriticalSectionExit
  • readerLock
  • readerUnlock
  • flipPhase
When a WriterReaderPhaser instance is used to protect an actively updated data structure [or set of data structures] involving [potentially multiple] writers and [potentially multiple] readers , the assumptions on how readers and writers act are:
  • There are two sets of data structures (an "active" set and an "inactive" set)
  • Writing is done to the perceived active version (as perceived by the writer), and only within critical sections delineated by writerCriticalSectionEnter and writerCriticalSectionExit operations.
  • Only readers switch the perceived roles of the active and inactive data structures. They do so only while holding the readerLock, and the switch is only done before execution a flipPhase.
  • Readers do not hold onto readerLock indefinitely. 
  • Only readers perform readerLock and readerUnlock.
  • Writers do not remain in their critical sections indefinitely. 
  • Only writers perform writerCriticalSectionEnter and writerCriticalSectionExit.
  • Only readers perform flipPhase operations, and only while holding the readerLock.

When the above assumptions are met, WriterReaderPhaser guarantees that the inactive data structures are not being modified by any writers while being read while under readerLock protection after a flipPhase operation.

The following progress guarantees are provided to writers and readers that adhere to the above stated assumptions:
  • Writers operations (writerCriticalSectionEnter and writerCriticalSectionExit) are wait free (on architectures that support wait-free atomic increment operations).
  • flipPhase operations are guaranteed to make forward progress, and will only be blocked by writers whose critical sections were entered prior to the start of the reader's flipPhase operation, and have not yet exited their critical sections.
  • readerLock only blocks for other readers that are holding the readerLock.

Example use

Imagine a simple use case where a large set of rapidly updated counters is being modified by writers, and a reader needs to gain access to stable interval samples of those counters for reporting and other analysis purposes. 

The counters are represented in a volatile array of values (it is the array reference that is volatile, not the value cells within it):

volatile long counts[];

A writer updates a specific count (n) in the set of counters:

   counts[n]++; // should use atomic increment if multi-writer

A reader gains access to a stable set of counts collected during an interval, reports on it, and accumulates it:

long intervalCounts[];
long accumulated_counts[];

   long tmp[] = counts;
   counts = interval_counts;
   interval_counts = tmp;
   // At this point, interval_counts content is stable  

A working implementation

Under the hood, my WriterReaderPhaser implementation achieves these qualities in a fairly straightforward way, by using a dual set of epoch counters (and "odd" set and "even" set) to coordinate the phase flip operations, coupled with a read lock that is used purely to protect readers from each other in multi-reader situations: i.e. to prevent one reader from flipping a phase or changing the notion of active o inactive data while another reader is still operating on it. Many other implementation mechanisms are possible, but this one is certainly sufficient for the job at hand.

Rather than describe the logic in text, it is easiest to list it as code at this point. Below is the entire WriterReaderPhaser class as implemented in my current HdrHistogram repository, spelled out in Java code (most of which is detailed comments). The mechanism can obviously be ported to any language and envrionment that can provide support to atomic increment and atomic swap operations. It's the API and documentation (in the case the details in the JavaDoc comments) that is more important. A simple example of how this is used in practice can be found in HdrHistogram's various interval histogram recorders, like the original (and probably simplest example) in, or its more recent replacements in and which add some unrelated and more complicated logic that deals with safely avoiding some copy costs on getIntervalHistogram() variants.

And yes, it is now all in the public domain.


[1] For an apparent etymology of the term "Yak Shaving", read the example story attributed here.