Strong eventual consistency

Most people will have seen the “Call Me Maybe” series (so named for the song by Jepsen) of blog posts about data loss in the face of network partition. Midway through the last post in the series is what is almost an off-the-cuff comment, but I think it’s everything:

“Consistency is a property of your data, not of your nodes.”

We tend to get overwhelmed with replication configurations, high-availability solutions, sharding strategies, and worrying about how a given database will react under various failure modes.

And yet, the essential truth that we’re so busy worrying about what’s stored on disk that we forget that we don’t care about consistency of what’s on disk. We need to care about the consistency of our data. It’s easy for a misbehaving program to write garbage, but not to worry! we’re absolutely certain that garbage is consistently replicated across the cluster. Yeah, well done there.

So the much bigger challenge in high-availability distributed systems, is making sure we have sane rules for propagating changes so that we can have a safe view of our data.

About 10 years ago I was working with a Java based object-oriented database (which is a grandiose name for what was as much a disk-backed datastore as anything else, but if you’re morbidly curious about what sort of API such a beast would have, you can read about db4o in a series of posts I wrote about it). It was surprisingly easy to use, and came along at a time when I was prepared to do just about anything to escape the object-relational mapping hell.

They got significant adoption in embedded devices where zero-administration is a necessity and where developers don’t want to deal with the machinery of a full scale RDMBS just to store e.g. configuration parameters. But surprise, it wasn’t long before users started asking for replication features. Now, usually when you hear that term you think of master/slave replication being done at database engine level in a high-availability setup. In this case, however, they had disconnected devices re-establishing connectivity to enterprise datastores, and because of that you had to cope with significant conflicts when it came time to synchronize.

Because the data model was articulated in terms of Java code (to a naive first approximation, you were just storing Java objects), you had the data model living in the same place as the application code, domain layer, and validation logic. This meant that when it came time to cope with those conflicts, the natural place to put do that was in the same Java code. This was interesting, because for just about every other database engine out there data is opaque. Oh, sure, RDBMS have types (though that there are people who think VARCHAR(256) actually tells you anything useful remains a source of wonder; alas, I digress), but if you have a high availability configuration and you’ve allowed concurrent activity during a network partition, then you have to deal with diverged replicas and thus have to merge them. Database doesn’t know what to do; how could it? No: consistency is a property of your data, not the datastore; the rules to decide how to synchronize are a business decision, so where better to put it than in the business logic?

Peter Miller suggests the example of booking flights: multiple passengers can end up allocated the same seat on an oversold flight, but the decision about who gets which seat happens at check-in and conflict resolution is a business one made by the airline staff, not the database.

Throughout the Jepsen posts, you’ll see occasional mention of “CRDTs” as an alternative to the problems of attempting to achieve simultaneous write safety in a distributed system. Finding out just what a CRDT is took a bit more doing that I would have expected; hence wanting to write this post.

Convergent and Commutative Replicated Data Types

It’s easy to have Consistency when you impose synchronous access to your data. But the locks needed to give that property don’t scale to distributed systems; you need to have data that can cope with delay. The idea of self-healing systems have been around for a while, but there hasn’t been much formal study of what data types meet these requirements. If you’re at all interested, I’d encourage you to have a read of “A comprehensive study of Convergent and Commutative Replicated Data Types” by Shapiro, Preguiça, Baquero, and Zawirski.

They use set notation and a form of psuedocode to describe the different data types which all makes the read a bit more serious than it needs to be, but having had my head buried in this paper for a few days I can say the effort has paid off. They articulate a set of conditions that would make either a state based system able to handle merges — which basically works out because the requirement is for the datatype to be a join semilatice; if it is, then they show the replicas will converge — or an operation based one (aka command pattern to us programmer types) — where the requirement is for manipulations of the datatype to be commutative, and if so, ditto [They also show these are equivalent, which is handy].

Here’s an schematic illustration of a state-based convergent replicated data type, from the paper:

state-based CRDT

The idea being that if you have a merge function, then it doesn’t matter where a state change is made; it will eventually make its way to all replicas.

Which raises the topic of eventual consistency. Anyone who has worked with Amazon S3 has discovered (the hard way, inevitably) that mutating an existing value has wildly undefined behaviour as to when other readers will see that change. CRDTs, on the other hand, exhibit “strong eventual consistency” (or perhaps better “strong eventual convergence”, as Murat Demirbas put his analysis of the topic), whereby the propagation behaviour is well defined.

The surface area you can use one of these data types on is limited. Because the data type is neither synchronous nor is a consensus protocol used to maintain the appearance of a single entity you cannot by definition have a global invariant. So you can track all the additions and subtractions to an integer (summing the like and dislike clicks on a page, for example); addition commutes and eventually all the operations will end up being applied to all the replicas. What you can’t do is something like enforce that the variable never goes below zero (an account balance, say) because two machines with the value at 1 could simultaneously apply a -1 operation, breaking the invariant once that operation propagates. If this seems a bit hypothetical, consider the well documented shopping cart problem encountered by a certain major global online bookseller: delete a book from your cart, and sure enough, five minutes later it’s back again. Classic case of the failure mode encountered by distributed key-value stores.

At first you’d think that this limitation would seriously cramp your style or that there wouldn’t be any real world data types that meet these requirements, but it turns out there are. The significant contribution of the paper is they come up with a formal definition of what a CRDT would need to look like, then explore around a bit and show a number of different datatypes that do meet the requirements.

The paper also includes an impressive reference list & discussion of prior art in the space, so it’s worth a read. There’s also “Conflict-free Replicated Data Types” by the same authors which formalizes SEC.

Back to the effect of network partitions on data safety:

What about Ceph?

Good question.

What I would be interested in now is how Ceph‘s various inter-related pieces hold up in the face of the sort of aggressive network partition testing conducted in the Jepsen survey. Reading a recent blog article about how the Ceph monitor services have re-implemented their use of Paxos struck me as being extraordinarily complicated. “One Paxos to rule them all”? Oh dear.

I’m doing a back-of-the-envelope examination but I think I already know the answer: you’re not going to get a write acknowledged until it is durably stored — which is Consistency. Ceph is a complex system, and parts of it can be offline when others are continuing to provide service. So you’d have to break it down to the provision of a single piece of mutable data before you could study the Availability of the system properly. I’d love to find someone who would like us do a real analysis using the Jepsen techniques; be interesting to see.

But this all reminds us why we’re interested in CRDTs in the first place: systems where you can build synchronous communication (or an external appearance thereof care of the use of consensus protocols internally) to achieve Consistency are in essence limited to highly controlled clusters in an individual data center. Most real world systems involve components distributed across geographic, temporal, and logical distances, and that means you must take into account the limitations of the speed of information propagation. While most people immediately think about the light-speed problem, it applies just as much to any distributed environment; and in any real world information system we need to serve clients concurrently, and that means the technique of using a CRDT where possible might very well be worth the effort.