Extending Eventually Consistent Cloud Databases for Enforcing Numeric Invariants
Valter Balegas, Diogo Serra, Sérgio Duarte, Carla Ferreira, Rodrigo Rodrigues, Nuno Preguiça, Marc Shapiro, Mahsa Najafzadeh
EExtending Eventually Consistent Cloud Databasesfor Enforcing Numeric Invariants
Valter Balegas, Diogo Serra, Sérgio DuarteCarla Ferreira, Rodrigo Rodrigues, Nuno Preguiça
NOVA LINCS/FCT/Universidade Nova de Lisboa
Marc Shapiro, Mahsa Najafzadeh
INRIA / LIP6
Abstract —Geo-replicated databases often operate under theprinciple of eventual consistency to offer high-availability withlow latency on a simple key/value store abstraction. Recently,some have adopted commutative data types to provide seamlessreconciliation for special purpose data types, such as counters.Despite this, the inability to enforce numeric invariants across allreplicas still remains a key shortcoming of relying on the limitedguarantees of eventual consistency storage.We present a new replicated data type, called boundedcounter, which adds support for numeric invariants to eventuallyconsistent geo-replicated databases. We describe how this can beimplemented on top of existing cloud stores without modifyingthem, using Riak as an example. Our approach adapts ideasfrom escrow transactions to devise a solution that is decentralized,fault-tolerant and fast. Our evaluation shows much lower latencyand better scalability than the traditional approach of usingstrong consistency to enforce numeric invariants, thus alleviatingthe tension between consistency and availability.
I. I
NTRODUCTION
Scalable cloud databases with a key/value store interfacehave emerged as the platform of choice for providing onlineservices that operate on a global scale, such as Facebook [15],Amazon [11], or Yahoo! [9]. In this context, a common tech-nique for improving the user experience is geo-replication [11],[9], [27], i.e., maintaining copies of application data andlogic in multiple data centers scattered across the globe. Thisdecreases the latency for handling user requests by routingthem to nearby data centers, but at the expense of resorting toweaker data consistency guarantees, in order to avoid a costlycoordination across replicas for executing operations.When executing under such weaker consistency models,applications have to deal with concurrent operations executingwithout being aware of each other, which implies that a mergestrategy is required for reconciling concurrent updates. Acommon approach is to rely on a last writer wins strategy[19], [20], [15], but this is not appropriate in all situations. Aprominent example is the proper handling of counters, whichare a useful abstraction for implementing features such as like buttons, votes and ad and page views, and all sorts ofresource counting. For counters, using last writer wins leads tolost updates, breaking the intended semantics. To address thislimitation, cloud databases, such as Cassandra [1], DynamoDBand Riak[6], have extended their interfaces with support forcorrect counters, implemented using specific merge algorithms.Even though these approaches provide a principled han-dling of concurrent updates to counter objects, they fall shorton supporting the enforcement of crucial invariants or databaseintegrity constraints, which are often required for maintaining correct operation [17]. Real world examples where enforcinginvariants is essential are advertisement services, virtual wal-lets or to maintain stocks. However, enforcing this conditionusing counters implemented on eventually consistent clouddatabase is impossible. This is because counter updates canoccur concurrently, making it impossible to detect if the limitis exceeded before the operation concludes.Maintaining this type of invariants would be trivial insystems that offer strong consistency guarantees, namely thosethat serialize all updates, and therefore preclude that twooperations execute without seeing the effects of one another[10], [27], [17]. The problem with these systems is that theyrequire coordination among replicas, leading to an increasedlatency. In particular, in a geo-replicated scenario, this latencymay amount to hundreds of milliseconds, which suffices toimpact application usability [23].In this paper we show that it is possible to achieve thebest of both worlds, i.e., that fast geo-replicated operations oncounters can coexist with strong invariants. To this end, wepropose a novel abstract data type called a
Bounded Counter .This replicated object, like conventional CRDTs [24], allowsfor operations to execute locally, automatically merges concur-rent updates, and, in contrast to previous CRDTs, also enforcesnumeric invariants while avoiding coordination in most cases.Implementing
Bounded Counter in a fast and portable wayrequired overcoming a series of challenges, which form themain technical contributions of this work.First, we propose an extension to the main idea behindescrow transactions [21], which is to partition the differencebetween the current value of a counter and the limit to beenforced among existing replicas. These parts are distributedamong replicas, who can locally execute operations that donot exceed their allocated part. Unlike previous solutionsthat include some central authority and are often based onsynchronous interactions between nodes [21], [5], [22], [25],our approach is completely decentralized and asynchronous,with each replica relying only on a local and possibly staleview of the information and on peer-to-peer asynchronousinteractions. This design makes it easy to deploy our system,since we do not need to add a new master server (or replicagroup) that controls the allocation of operations on the counter.Furthermore, this avoids situations where the temporary un-reachability of the data center where the master server islocated can prevent operations from making progress.Second, and building on the fact that we did not have toadd any new master servers to enforce invariants, we showhow it is possible to layer our design on top of existing a r X i v : . [ c s . D C ] M a r ventually consistent storage systems, while making very fewassumptions about the underlying system. In particular, weonly assume that the underlying storage system executesoperations in a serializable way in each replica (not necessarilyby the same order across replicas) and that it provides areconciliation mechanism for merging concurrent updates. Thismakes our solution generic and portable, but raises the barfor achieving a performance that is comparable to directlyaccessing the underlying storage. Furthermore, we propose twoalternative designs, where the first one is implemented usingonly a client-side library, whereas the second one includes aserver side component deployed in a distributed hash table,which provides better scalability by minimizing the number ofoperations executed in the underlying storage system.The evaluation of our prototypes running on top of Riakshows that: 1) when compared to using weak consistency, ourapproach with the cache and a write batching mechanism hashigher throughput with a very small increase in latency, whileguaranteeing that invariants are not broken; 2) when comparedto using strong consistency, our approach can enforce invari-ants without paying the latency price for replica coordination,which is considerable for all but the local clients; 3) the clientbased design performs well under low contention, but doesnot scale when contention on the same counter is large; 4) theserver based design scales well with the number of clientsexecuting operations, providing even higher throughput thanweak consistency.The remainder of the paper is organized as follows. Sec-tion II overviews our solution and its requirements; Section IIIintroduces the Bounded Counter
CRDT; Section IV presentsour two designs that extend Riak with numeric invariantpreservation; Section V evaluates our prototypes; Section VIdiscusses extensions to the proposed design; Section VIIdiscusses related work; and Section VIII concludes the paper.II. S
YSTEM O VERVIEW
A. Assumptions
We target a typical geo-replicated scenario, with copies ofapplication data and logic maintained in multiple data centers(DC) scattered across the globe. End clients contact the closestDC for executing application operations in the applicationserver running in that DC. The execution of this applicationlogic leads to issuing a sequence of operations on the databasesystem where application data resides.We consider that system processes (or nodes) are con-nected by an asynchronous network (i.e., subject to arbitrarydelays, including partitions). We assume a finite set Π = p , p , . . . , p n − of processes who may fail by crashing. Acrashed process may remain crashed forever, or may recoverwith its persistent memory intact. A non-crashed process issaid to be correct .For simplicity, our presentation considers a single dataobject replicated in all processes of Π , with r i representingthe replica of the object at process p i . The model triviallygeneralizes to the case where multiple data objects exist – insuch a case, for each object o , we need to consider only theset Π o of the processes that replicate o . % Regular data operationsget(key): object | failput(key, object): ok | fail% Bounded Counters operationscreate(key, type, bound): ok | errorread(key): integer | errorinc(key, delta, flag): ok | fail | retrydec(key, delta, flag): ok | fail | retry Fig. 1. System API.
B. System API
Our middleware system is built on top of a weakly-consistent key-value database. Figure 1 summarizes the pro-gramming interface of our system, with the usual get and put operations for accessing regular data, and additional operationsfor creating a new
Bounded Counter , reading its current state,and incrementing or decrementing its value. As any other data,InvCounters are identified in all operations by an opaque key.The create operation creates a new bounded counter. The type argument specifies if it is an upper- or a lower- BoundedCounter , and the bound argument provides the global invariantlimit to be maintained – e.g., create(“X”, upper, 1000) createsa
Bounded Counter that maintains the invariant that the valuemust be smaller or equal to 1000 . The counter is initializedto the value of the bound.The read operation returns the current value of the givencounter. The returned value is computed based on local in-formation and it may not be globally accurate. To update acounter, the application submits inc or dec operations. Theseoperations include a flag to decide on whether the execution isstrictly local or whether global execution is attempted. In bothcases, the operation attempts to run locally first. When the localinformation cannot guarantee that the value remains withinbounds, in the case of a strictly local operation, the API returnsan error and a hint regarding whether global execution is likelyto succeed; otherwise, in the case of a global operation, thesystem tries to contact remote replicas to safely execute tooperation and only returns an error if this coordination withremote replicas cannot ensure the preservation of the invariant(namely when the counter has reached its limit). C. Consistency Guarantees
We build our middleware on top of an eventually con-sistent database, extending the underlying guarantees withinvariant preservation for counters. In particular, the eventualconsistency model means that the outcome of each operationreflects the effects of only a subset of the operations thatall clients have previously invoked – these are the operationsthat have already been executed by the replica that the clienthas contacted. However, for each operation that successfullyreturns at a client, there is a point in time after which its effectbecomes visible to every operation that is invoked after thattime, i.e., operations are eventually executed by all replicas.In terms of the invariant preservation guarantee, this meansthat the bounds on the counter value are never violated, neither locally nor globally . By locally, this means that the boundsmust be obeyed in each replica at all times, i.e., the subset ofoperations seen by the replica must obey:lower bound ≤ initial value + ∑ inc − ∑ dec ≤ upper bound.By globally, this means that, at any instant in the execution ofhe system, when considering the union of all the operationsexecuted by each replica, the same bounds must hold.Note that the notion of causality is orthogonal to our designand guarantees, in the sense that if the underlying storagesystem offers causal consistency, then we also provide numericinvariant-preserving causal consistency. D. Enforcing Numeric Invariants
To enforce numeric invariants, our design borrows ideasfrom the escrow transactional model [21]. The key idea ofthis model is to consider that the difference between the valueof a counter and its bound can be seen as a set of rights toexecute operations. For example, in a counter, n , with initialvalue n =
40 and invariant n ≥
10, there are 30 (40 −
10) rightsto execute decrement operations. Executing dec(5) consumes5 of these rights. Executing inc(5) creates 5 rights. In thismodel, these rights are split among the replicas of the counter– e.g. if there are 3 replicas, each replica can be assigned 10rights. If the rights needed to execute some operation existin the local replica, the operation can safely execute locally,knowing that the global invariant will not be broken – in theprevious example, if the decrements of each replica are lessor equal to 10, it follows that the total number of decrementsdoes not exceed 30, and therefore the invariant is preserved.If not enough rights exist, then either the operation fails oradditional rights must be obtained from other replicas.Our approach encompasses two components that worktogether to achieve the goal of our system: a novel data struc-ture, the
Bounded Counter
CRDT, to maintain the necessaryinformation for locally verifying whether it is safe to executean operation or not; and a middleware layer to manipulateinstances of this data structure stored in the underlying clouddatabase. The first component is detailed in Section III, whilealternative designs to the second part are detailed in Section IV.III. D
ESIGN OF B OUNDED C OUNTER
CRDTThis section presents the design of
Bounded Counter , aCRDT that can be used to enforce numeric invariants withoutrequiring coordination for most operation executions. Instead,coordination is normally executed outside of the normal ex-ecution flow of an operation and amortized over multipleoperations.
A. CRDT Basics
Conflict-free replicated data types (CRDTs) [24] are a classof distributed data types that allow replicas to be modifiedwithout coordination, while guaranteeing that replicas con-verge to the same correct value after all updates are propagatedand executed in all replicas.Two types of CRDTs have been defined: operation-basedCRDTs , where modifications are propagated as operations(or patches) and executed on every replica; and state-basedCRDTs , where modifications are propagated as states, andmerged with the state of every replica.In this work, we have adopted the state-based model, as wehave built our work on top of a key-value store that synchro-nizes replicas by propagating the state of the database objects.In this model, one operation submitted in one site executesin the local replica. Updates are propagated among replicas payload integer[ n ][ n ] R , integer[ n ] U , integer min initial [[0,0,...,0], ..., [0,0,...,0]], [0,0,...,0], K query value () : integer vv = min + ∑ i ∈ Ids R [ i ][ i ] − ∑ i ∈ Ids U [ i ] query localRights () : integer vid = repId () %Id of the local replica v = R [ id ][ id ] + ∑ i (cid:54) = id R [ i ][ id ] − ∑ i (cid:54) = id R [ id ][ i ] − U [ id ] update increment (integer n ) id = repId () R [ id ][ id ] = R [ id ][ id ] + n update decrement (integer n ) pre-condition localRights () ≥ nid = repId () U [ id ] = U [ id ] + n update transfer (integer n , replicaId to ): boolean b pre-condition b = ( localRights () ≥ n ) from = repId () R [ from ][ to ] : = R [ from ][ to ] + n update merge (S) R [ i ][ j ] = max ( R [ i ][ j ] , S . R [ i ][ j ]) , ∀ i , j ∈ IdsU [ i ] = max ( U [ i ] , S . U [ i ]) , ∀ i ∈ Ids
Fig. 2.
Bounded Counter for maintaining the invariant larger or equal to K . in peer-to-peer interactions, where a replica r propagates itsstate to another replica r , which merges its local and receivedstate, by executing the merge function.State-based CRDTs build on the definition of a join semi-lattice (or just semi-lattice), which is a partial order ≤ equippedwith a least upper bound (LUB) (cid:116) for all pairs: m = x (cid:116) y is a Least Upper Bound of { x , y } under ≤ iff x ≤ m ∧ y ≤ m ∧ ∀ m (cid:48) , x ≤ m (cid:48) ∧ y ≤ m (cid:48) ⇒ m ≤ m (cid:48) .It has been proven that a sufficient condition for guaran-teeing the convergence of the replicas of state-based CRDTsis that the object conforms the properties of a monotonicsemi-lattice object [24], in which: (i) The set S of possiblestates forms a semi-lattice ordered by ≤ ; (ii) The result ofmerging state s with remote state s (cid:48) is the result of computingthe LUB of the two states in the semi-lattice of states, i.e., merge ( s , s (cid:48) ) = s (cid:116) s (cid:48) ; (iii) The state is monotonically non-decreasing across updates, i.e., for any update u , s ≤ u ( s ) . B. Bounded Counter CRDT
We now detail the
Bounded Counter , a CRDT for main-taining the invariant larger or equal to K . The pseudocode for
Bounded Counter is presented in Figure 2.
Bounded Counter state:
The
Bounded Counter must main-tain the necessary information to verify whether it is safe tolocally execute operations or not. This information consists inthe rights each replica holds (as in the escrow transactionalmodel [21]).To maintain this information, for a system with n replicas,we use two data structures. The first, R is a matrix of n linesby n columns with: R [ i ][ i ] recording the increments executedat r i , which define an equal number of rights initially assignedto r i ; R [ i ][ j ] recording the rights transferred from r i to r j .The second, U is a vector of n lines with U [ i ] recording thesuccessful decrements executed at r i , which consume an equalnumber of rights. " " % $ % & % " $ & !" ' Fig. 3. Example of the state of
Bounded Counter for maintaining the invariant larger or equal to 10 . For simplicity, our specification assumes every replicamaintains a complete copy of these data structures, but welater discuss how to avoid this in practice.
Operations:
When a counter is created, we assume thatthe initial value of the counter is equal to the minimum valueallowed by the invariant, K . Thus, no rights are assigned toany replica and both R and U are initialized with all entriesbeing equal to 0. To overcome the limiting assumption of theinitial value being K , we can immediately execute an increment operation in the freshly created Bounded Counter . Figure 3shows an example of the state of a
Bounded Counter formaintaining the invariant larger or equal to 10 , with initialvalue 40. This initial value led to the creation of 30 rightsassigned to r – this value is recorded in R [ ][ ] .An increment executed at r i updates the number of incre-ments for r i by updating the value of R [ i ][ i ] . In the exampleof Figure 3, the value of R [ ][ ] is 1, which is the result ofincrementing the counter by 1 in r .A decrement executed at r i updates the number of decre-ments for r i by updating the value of U [ i ] . This operationcan only execute if r i holds enough rights before executingthe operation. The decrement operation fails if not enoughlocal rights exist. In the example of Figure 3, the values of U reflect the execution of 5, 4 and 2 decrements in r , r and r , respectively.The rights the local replica r i holds, returned by function lo-calRights , are computed by: (a) adding the increments executedin the local replica, R [ i ][ i ] ; (b) adding the rights transferredfrom other replicas to r i , R [ j ][ i ] , ∀ j (cid:54) = i ; (c) subtracting therights transferred from r i to other replicas, R [ i ][ j ] , ∀ j (cid:54) = i ; and(d) subtracting the decrements executed in r i , U [ i ] . In theexample of Figure 3, replica r holds 5 rights (obtained from30 + ( + ) − ( + ) − value of the counterconsists of: (a) adding the minimum value, K ; (b) adding thesum of increment operations executed at any replica, R [ i ][ i ] , ∀ i ;and (c) subtracting the sum of the decrement operationsexecuted at any replica, U [ i ] , ∀ i . In the example of Figure 3, thecurrent value is 30 (obtained from 10 + ( + ) − ( + + ) ).The operation transfer executed at replica r i transfers rightsfrom r i to some other replica r j , by updating the value recordedin R [ i ][ j ] . This operation can only execute if enough local rightexist. In the example of Figure 3, transfers of 10 rights from r to each of r and r are recorded in the values of R [ ][ ] and R [ ][ ] The merge operation is executed during peer-to-peer syn-chronization, when a replica receives the state of a remotereplica. The local state is updated by just taking, for eachentry of both data structures, the maximum of the local andthe received value.
Correctness:
For showing the correctness of
BoundedCounter , it is necessary to show that all replicas of
BoundedCounter eventually converge to the same state, i.e., that
Bounded Counter is a correct CRDT, and that the executionof concurrent operations will not break the invariant. We nowsketch an argument for why these properties are satisfied.For showing that replicas eventually converge to the samestate, it is necessary to prove that the specification is amonotonic semi-lattice object. As the elements of R and U aremonotonically increasing (since operations never decrementthe value of these variables), the semi-lattice properties areimmediately satisfied – two states, s , s , are related by a partialorder relation, s ≤ s , whenever all values of R and U in s are greater or equal to the corresponding values in s (i.e., ∀ i , j , s . R [ i ][ j ] ≤ s . R [ i ][ j ] ∧ s . U [ i ] ≤ s . U [ i ] ). Furthermore, themerge of two state is the LUB, as the function just takes themaximumTo guarantee that the invariant is not broken, it is necessaryto guarantee that a replica does not execute an operation( decrement or transfer ) without holding enough rights to doit. As operations execute sequentially and verify if the localreplica holds enough rights before execution, it is necessary toprove that if a replica believe it has N rights, it owns at least N rights. The construction of the algorithms guarantees that line i of R and U is only updated by operations executed at replica r i . Thus, replica r i necessarily has the most recent value forline i of both R and U . As rights of replica r i are consumed by decrement operations, recorded in U [ i ] , and transfer operations,recorded in R [ i ][ j ] , it follows immediately that replica r i knowsof all rights it has consumed. Thus, when computing the localrights, the value computed locally is always conservative (asreplica r i may not know yet of some transfer to r i executed bysome other replica). This guarantees that the invariant is notbroken when operations execute locally in a single replica.We wrote the specification of Bounded Counter inTLA [16] and successfully verified that the invariant holdsfor all the cases that the tool generated.
Extensions:
It is possible to define a
Bounded Counter that enforces an invariant of the form smaller or equal to K byusing a similar approach, where rights represent the possibilityof executing increment operations instead of decrement opera-tions. The specification would be similar to the one presentedin Figure 2, with the necessary adaptations to the differentmeaning of the rights.A
Bounded Counter that can maintain an invariant of theform larger or equal to K and smaller or equal to K canbe created by combining the information of two BoundedCounter s, one for each invariant, and updating both on eachoperation.
Optimizations:
The state of
Bounded Counter , as pre-sented, has complexity O ( n ) . In practice, the impact of thisis expected to be small as the number of data centers incommon deployments is typically small and each data centerwill typically hold a single logical replica.In the cases when this is not true, we can leverage the fol-lowing observations to lower the space complexity of BoundedCounters up to O ( n ) . For computing the local rights, replica r i only uses the line i and column i of R and line i of u . Forcomputing the local value of the counter, replica i additionallyses entries R [ i ][ i ] , ∀ i and the remaining entries of U . This leadsto a space complexity of 4 . n for storage, which compares with2 . n as the minimal complexity of a state-based counter [8].In this case, when synchronizing, a replica only needs tosend the information both replicas store. Thus, a replica r i would send to r j only R [ i ][ i ] , ∀ i , R [ i ][ j ] and U , lowering thespace complexity for messages to 2 . n .When this optimization is not in place, and every replicamaintains the complete data structure, we can still lower thecommunication costs by propagating the information epidem-ically. This means that it is not necessary for every replica tocommunicate directly with every other replica. In particular,we can allow for the communication to be reactive insteadof proactive: a replica r i only needs to communicate directlywith r j when it transfers rights to r j (e.g., upon request inorder to execute an operation) so that r j knows about the newrights. Note that the lack of communication does not affect thecorrectness regarding the invariant violation, as each replicaalways has a conservative view on its available rights.IV. M IDDLEWARE FOR E NFOCING N UMERIC I NVARIANTS
We now present two middleware designs for extendingRiak database with numeric invariants, using the
BoundedCounter . The proposed designs can be applied to any databasethat provides the following two properties, essential for
Bounded Counter to work properly. First, each replica needsto execute operations referring to each counter in a serializableway, i.e., as if they had been executed in sequence. This doesnot, however, preclude concurrency: operations for differentcounters are not constrained by this requirement, and evenwithin the same counter there are protocols that allow for someconcurrency while maintaining the illusion of a serializableexecution. This serialization is necessary to guarantee that twoconcurrent operations do not use the same rights. Second, thereplication model must ensure no lost updates, i.e., updatesexecuted concurrently in different replicas must be mergedusing the CRDT merge function. This is necessary for theCRDT to work properly.Before presenting the middleware designs, we present anoverview of the functionalities of Riak that are relevant for thedeployment of
Bounded Counters . A. Overview of Riak 2.0
Riak 2.0 is a key/value database inspired in Dynamo [11].It support geo-replication in its Enterprise Edition, where eachDC maintains a full replica of the database. Riak provides anAPI supporting a read ( get ) and write ( put ) interface, wherea write associates a new value with a key, and a read returnsthe value(s) associated with the key.By default, writes on a key can proceed concurrently,with the system maintaining the multiple concurrent versionsand exposing them to clients in subsequent read operations.Additionally, Riak includes native support for storing CRDTs,dubbed Riak data types, where concurrent writes are automat-ically merged.Riak keys can be marked as strongly consistent. For thesekeys, Riak uses a conditional writing mode where a write fails Replicas r i and r j also share R [ j ][ i ] , but as this value is only updated at r j , it is not necessary to send it. !" % &'%( % !" % &'%) % !" % &'%* % + ,, % ' -" . / % -" % !" Fig. 4. Client-based middleware for deploying
Bounded Counters . !" % &'%( % ) ** % ' +" , - . % +" / % !" % '( &'%4 % &'%5 % )*+,&-.* Fig. 5. Server-based middleware for deploying
Bounded Counters . if a concurrent write has been executed. These key are notgeo-replicated (each DC has its local view of the data) andthey cannot store a Riak data type object. B. Alternative 1: Client-based Middleware
Our first design, depicted in Figure 4, is based on a client-side middleware. Supporting operations on
Bounded Counters is fairly simple, given the functionality provided by Riak.The state of a
Bounded Counter is stored as an opaqueobject in the Riak database, which is marked as strongly con-sistent. Rights for executing operations in a
Bounded Counter are associated with each DC, i.e., each DC is considered asa single replica for a
Bounded Counter . An increment (resp.decrement) executes in the client library by first reading thecurrent value of the counter (executing a get operation in Riak),then executing the increment (resp. decrement) operation inthe
Bounded Counter and writing the new value of the counterback into the database. If the operation in the
Bounded Counter fails, the client can try to obtain additional rights by requestingthe execution of a transfer operation from another DC. If theoperation in the CRDT succeeds but the conditional write fails,the operation must be re-executed until it succeeds.Given that
Bounded Counters are marked as stronglyconsistent, updates are serialized in each DC through theconditional writing mechanism. Concurrent updates to thesame
Bounded Counter can only appear due to geo-replication.If this is the case, then concurrent versions can be merged bythe client library when reading the counter.For propagating the updated values across DC, we were notable to reuse the geo-replication mechanism from Riak, sinceit does not support multi-data center replication for objectsthat use strong consistency. As such, we had to implementa custom synchronization mechanism for
Bounded Counters .This custom synchronization forwards modified counters toother DCs periodically. A DC receiving a remote version of acounter, merges the received version with the local version.
C. Alternative 2: Server-based Middleware
The client-based middleware has an important limitation,as pointed out by the evaluation in Section V: the conditionalriting mechanism for serializing operation execution workswell under low load, but leads to an increased number offailed writes when the load increases. To address this issue,we propose a server-based middleware design that serializesall operations executed in each DC for each counter.The server-based middleware is built using a DHT com-munication substrate (riak_core [13] in our prototype) runningside by side with each node of the Riak database. The keyfeature that is employed is the ability to lookup the DHT nodethat is responsible for a given key. This primitive is used toroute all requests for a given key to the same node, whichserializes their execution. For operations on regular objects,the client library calls Riak directly (without contacting DHTnodes).When an application wants to execute an operation in acounter, the operation is sent to the DHT node responsible forthat counter. The DHT node executes the operation by readingthe counter from Riak, executing the operation and writingback the new value.
Bounded Counters are marked as stronglyconsistent, with writes being executed using conditional write.In the normal case, when there are no reconfigurations, theconditional write will succeed, since a single DHT node isresponsible for any given key and executes all operations foreach counter in sequence.In contrast, when a new nodes enters the DHT or somenode fails, the DHT is automatically reconfigured and itbecomes possible that two nodes concurrently process twooperations for the same key. In this case, only the first writewill succeed, since the following concurrent writes will faildue to the conditional write mechanism. This guarantees thecorrectness of the system, by serializing all updates.Since Riak does not geo-replicate keys marked as stronglyconsistent, our middleware had to include a mechanism forpropagating updates to
Bounded Counters to other DCs. Tothis end, each DHT node periodically propagates its updatedentries to the corresponding DHT nodes in other DCs. Withthis approach, each value that is sent can include the effectsof a sequence of operations, thus reducing the communicationoverhead. As in the previous version, when a
Bounded Counter is received in a DC from another DC, it is merged withthe local replica using the CRDT merge function. For otherobjects, we rely on normal built-in Riak multi-data centerreplication.
Optimizations:
Our prototype includes a number of opti-mization to improve its efficiency. The first optimization is tocache
Bounded Counters at the middleware layer. This allowsus to reduce the number of Riak operations necessary forprocessing each update on a
Bounded Counter from two toone – only the write is necessary.Under high contention in a
Bounded Counter , the designdescribed so far is not very efficient, since an operation mustcomplete before the next operation starts being processed.In particular, since processing an update requires writing themodified
Bounded Counter back to the Riak database, whichinvolves contacting remote nodes, each operation can take afew milliseconds to complete. To improve throughput, whilea remote write to Riak is taking place, the operations thatare received are executed in the local copy of the
BoundedCounter . If the counter cannot be incremented or decremented,the result is immediately returned to the client. Otherwise, no result is immediately returned and the operation becomes pend-ing. When the previous write to the Riak database completes,the local version of the
Bounded Counter , which absorbedthe modifications of all pending operations, is written inthe Riak database. If this second conditional write succeeds,all pending operations complete by returning success to theclients. Otherwise, clients are notified of the failure.
D. Transferring Rights
For executing an operation that may violate an invariant, areplica needs to own enough rights. Given that it is impossibleto anticipate the rights needed at each replica, it is necessaryto redistribute rights among replicas.In our middleware designs, replicas proactively exchangerights in the background. A replica that has fewer rightsthan a given threshold periodically asks additional rights fromreplicas that have more rights (as reflected in the local replicaof the
Bounded Counter ). The number of rights requested ishalf of the difference between the rights of the remote andthe local replicas. A replica receiving an asynchronous transferrequest never accepts to transfer more than half of the availablerights. This strategy provisions replicas with rights withoutimpairing the latency during operation execution.Nonetheless, it may happen that an operation does not suc-ceed because it has insufficient local rights during execution. Inthis situation, the programmer can choose to get the rights froma remote replica or abort the operation. Programmatically thedecision is made through the flag parameter in the decrement and increment operations, as presented in Figure 1.To execute a transfer, replica r i checks the local stateto choose the best candidate replica to request rights from(e.g., the remote replica holding more rights), r j , and sends atransfer request and a flag saying whether it is a synchronousor an asynchronous request. Upon receiving the request, theremote replica r j checks if it can satisfy the request and ifso it executes a local transfer operation to move the rightsfrom r j to r i . If the request was asynchronous the replicationmechanism will asynchronously propagate the update to therequester, otherwise r j stores the transfer locally and repliesto r i immediately with the the new state of the counter.Replying to every transfer request may lead to a requestbeing satisfied more than once, either because a requestmessage was lost and replayed or because the requester sentthe request more than once (possibly to multiple replicas). Toavoid this situation, r i sends in the request to r j the number ofrights transferred from r j to r i ( R [ j ][ i ] ). The receiver ignoresa request if it has already transferred more rights.A property of the way transfer is implemented is thatit does not require any strong synchronization between thereplica asking for rights and the one providing the rights.Thus, the request for a transfer and synchronization of theinformation about transferred values can be done completelyasynchronously, which simplifies the system design. E. Fault-tolerance
We now analyze how our middleware designs provide fault-tolerance building on the fault-tolerance of the underlyingcloud database. We start by noting that for the
BoundedCounters , each DC acts as a
Bounded Counter replica. DC is assumed to have sufficient internal redundancy tonever lose its state. In Riak, the level of fault-tolerance in eachDC can be controlled by changing the size of the quorums usedto store data. Thus, an update to an
Bounded Counter executedin a DC is never lost unless the DC fails forever.A failure in a node in the DC may cause the DHTused in our server-based middleware to reconfigure. As weexplained before, this does not affect correctness as we rely onconditional writes to guarantee that operations of each counterare serialized in each DC.During a network partition, rights can be used in both sidesof the partition – the only restriction is that it is impossible totransfer rights between any two partitioned DCs. If an entireDC becomes unavailable, the rights owned by the unreachableDC become temporarily unavailable. If a DC fails permanently,as the
Bounded Counter records the rights owned by everyreplica, it is possible to recover the rights that were owned bythe failed DC. V. E
VALUATION
We implemented both middleware designs for extendingRiak with numeric invariants and evaluated experimentallythe prototypes. This evaluation tries to address the followingmain questions. (i) How much overhead is introduced by ourdesigns? (ii) What is the performance penalty when the boundsare close to being exceeded? (iii) How does the performancevary with the level of contention for the same counter?In our designs, operations on
Bounded Counters are han-dled by our middleware. All other operations are directlyexecuted in the Riak database. For this reason, our evaluationfocus on the performance of
Bounded Counters , using micro-benchmarks to test different properties of the system.
A. Configurations
In the experiments, we compare the client-based middle-ware,
BCclt , and sever-based middleware,
BCsrv , with thefollowing configurations.
Weakly Consistent Counters (Weak).
This configurationuses Riak’s native counters operating under weak consistency.Before issuing a decrement, a client reads the current countervalue and issue a decrement only if the value is positive.
Strongly Consistent Counters (Strong).
This configurationuses Riak’s native strong consistency, with the Riak databaserunning in a single DC, which receives requests from clients inthe local and remote DCs. As Riak strong consistency cannotbe used with Riak data types, the value of the counter is storedas an opaque object for Riak. A counter is updated by readingits value, updating its state if the value is positive, and writingback the new state (using a conditional write).
B. Experimental Setup
Our experiments comprised 3 Amazon EC2 DCs dis-tributed across the globe. The latency between each DC isshown in Table I. In each DC, we use three m1.large machineswith 7.5GB of memory for running the database servers andserver-based middleware and three m1.large machines forrunning the clients.For
Weak , we used Riak 2.0 Enterprise Edition (EE), withsupport for geo-replication. For other configurations we used
RTT (ms) US-E US-W EUUS-East - 80 96US-West 83 - 163EU 93 161 -
TABLE I. RTT L
ATENCY BETWEEN D ATA C ENTERS IN A MAZON
EC2. La t en cy [ m s ] Throughput [ decrements/s ]BCcltBCsrvBCsrv-nobatchStrongWeak
Fig. 6. Throughput vs. latency with a single counter.
Riak 2.0 Community Edition (CE), with support for strongconsistency. Both version share the same code, except for thesupport for strong consistency and geo-replication, which isonly available in the enterprise edition.In
Strong , data is stored in the US-East DC, which is thelocation that minimizes the latency for remote clients. In theremaining configurations, data is fully geo-replicated in allDCs, with clients accessing the replicas in the local DC. Riakoperations use a quorum of 3 replicas for writes and 1 replicafor reads.
C. Single Counter
The objective of this experiment is to evaluate the perfor-mance of the middleware designs in contention scenarios. Inthis case, we use a single counter initialized to a value that islarge enough to never break the invariant (10 ). Clients execute20% of increments and 80% of decrements in a closed loopwith a think time of 100 ms. Each experiment runs for twominutes after the initialization of the database. The load iscontrolled by tuning the number of clients running in eachexperiment – clients are always evenly distributed among thethe client machines. a) Throughput vs. latency: Figure 6 presents the varia-tion of the throughput vs. latency values as more operations areinjected in the system. For the throughput values we consideronly the operations that have succeeded, while for the latencywe consider the average of all (succeeded or failed) operations.(This only affects the results for
Strong .)The results of
BCclt and
Strong present a similar trend,which is that the throughput quickly starts degrading withthe increase in the load. By analyzing the results of theoperations, we found out that this is explained by the fact thatthe percentage of operations that fail increase very quickly withthe number of clients. This is because concurrent updates faildue to the conditional write mechanism – e.g., for
Strong , 50%of operations fail with 100 clients and 90% with 200 clients.The 3 × higher throughput in BCclt is explained by the fact thatclients execute operations in their local DC, while in
Strong alloperations are sent to a single DC. The higher average latency
Weak Strong BCsrv BCclt La t en cy [ m s ] EU-WESTUS-WESTUS-EAST
Fig. 7. Median latency with a single counter,per region of clients (the line is the value forthe 99 th percentile). La t en cy [ m s ] Time [s]
Fig. 8. Latency of each operation over timefor BCsrv. N u m . I n v a r i an t V i o l a t i on s Num. Clients WeakBCcltBCsrvStrong
Fig. 9. Decrements executed in excess, vio-lating invariant. in Strong is explained by the latency of operations from remoteclients. This explains why we chose to report the latency ofall operations, including failed ones: since most of remoteoperations fail, considering only operations that succeed wouldlead to latency values close to those of
BCclt .The throughput of
Weak is much larger and it does notdegrade with the increase of the load – when it reachesits maximum throughput, increasing the load just leads toan increase in latency. Our server-based middleware,
BCsrv ,has an even higher throughput with slightly higher latency.The higher latency is expected, as the middleware introducescommunication overhead. The higher throughput is due to thebatching mechanism introduces in
BCsrv , which batches asequence of updates into a single Riak write, thus leading toa constant rate of Riak operations. To prove this hypothesis,we have run the same experiment, turning off the batching andwriting every update in Riak - results are presented as
BCsrv-nobatch . In this case, we can observe that the throughput ismuch lower than
Weak , but unlike
BCclt , the throughput doesnot degrade with the load - the reason for this is that themiddleware serializes updates and Riak still sees a constantrate of writes. The same approach for batching multipleoperations into a single Riak write could be used with otherconfigurations, such as
Weak , to improve their scalability. b) Latency under low load:
Figure 7 presents themedian latency experienced by clients in different regionswhen load is low (with 15 threads in each client machine).As expected, the results show that for
Strong , remote clientsexperience high latency for operation execution, while localclients are fast. The latency for all the other configurations isvery low, with
BCsrv introducing a slight overhead (of about 2ms), due to additional communication steps for processing therequest. If
Bounded Counters were added to the Riak database,this overhead could be eliminated. c) Effects of exhausting rights:
In this experiment weevaluate the behavior of our middleware when the value of thecounter approaches the limit. To this end, we run the exper-iment with
BCsrv and 5 clients executing 100% decrements,initializing the counter with the value 6000 and running anexperiment until the rights are all consumed.Figure 8 shows that most operations take low latency, witha few peaks of high latency whenever a replica needs to obtainadditional rights. The number of peaks is small because mostof the time the proactive mechanism for exchanging rights isable to provision a replica with enough rights before all rightsare used. La t en cy [ m s ] Throughput [ decrements/s ]BCcltBCsrvBCsrv-nobatchStrongWeak
Fig. 10. Throughput vs. latency with multiple counters.
These peaks can occur at any time during the experiment,but are more frequent when the resources are low and replicasexchange rights more often – close to the end of the experi-ment. After all rights are consumed, the latency remains lowbecause a replica does not ask for rights from replicas thatare expected to have no rights (according to the local copy ofthe
Bounded Counter ). Thus, when all rights are consumed,operations fail locally. d) Invariant Preservation:
To evaluate the severity ofthe risk of invariant violation, we computed how may decre-ments in excess were executed with success in the differentsolutions. The counter is initialized with the value of 6 , Weak , as expected. The figure shows that anincrease in the number of clients directly impacts the severityof the invariant violation. This is because in
Weak the clientreads a counter, checks if its value is greater than the limitand decrements it. Since this is not an atomic operation, thevalue of the counter can be different between the read and theupdate, and that difference is directly affected by the number ofconcurrent updates, which leads to more invariant violations.
D. Multiple Counters
To evaluate how the system behaves in the common casewhere clients access to multiple counters, we ran the same ex-periment of Section V-C with 100 counters. For each operation,a client selects the counter to update randomly with uniformdistribution. Increasing the number of counters reduces theontention in each key and contributes to balance the loadamong nodes.The results presented in Figure 10 show that both
BCclt and
Strong now scale to a larger throughput (when comparedwith the results with a single key). The reason for this isthat by increasing the number of counters, the number ofconcurrent writes to the same key is lower, leading to asmaller number of failed operations. This in turn increaseswith the load, as expected. Additionally, when the maximumthroughput is reached, the latency degrades but the throughputremains almost constant. The higher average latency in
Strong is explained by the fact that remote operations have highlatency, as shown before.The
Weak configuration scales up to a much larger value(9K decrements/s compared with 3K decrements/s for a singlecounter). As each Riak node includes multiple virtual nodes,when using multiple counters the load is balanced among them- enabling multi-core capabilities to process multiple requestsin parallel (whereas with a single node, a single virtual nodeis used, resulting in requests being processed sequentially).The results show that
BCsrv has a low latency close to
Weak ’s as long the number of writes can be handled by Riak’sstrong consistency mode in a timely manner. In contrast withthe experiment with a single counter, Riak’s capacity is sharedamong all the keys, each contributing with writes to Riak.Therefore, as the load increases, writing batches to Riak willtake longer to complete and contribute to accumulate latencysooner than in the single key case. Nevertheless, batchingstill allows multiple client requests to be processed per eachRiak operation, leading to a better throughput. The maximumthroughput even surpasses the results for the
Weak configu-ration. The results for
BCsrv-nobatch , where each individualupdate is written using one Riak operation, can be seen as theworst case of our middleware, in which the batching had noeffect. Still, since all
BCsrv operations are local to a given DCand access only a quorum of Riak nodes, one can expect thatincreasing the local cluster’s capacity should have a positiveeffect both on latency and throughput.VI. D
ISCUSSION
In this section we discuss how to extend our approach, tosupport other could databases and additional invariants.
A. Supporting Other Cloud Databases
Although our middleware designs run on top of the Riakdatabase, it would be immediate to implement a similarprototype running on top of any database that provides condi-tional writes, such as DynamoDB [11]. Given that we had toimplement the geo-replication in the middleware, we do noteven require native support for geo-replication.Alternatively, if the database provides a way to serializeall operations to a given key, it would be easy to adapt thecurrent design. We note that this could be done in two differentways: either the cloud database already supports these strongsemantics, in which case there is no need to add any furtherlogic, or the DHT has a way to ensure that messages routed toa given key are delivered in sequence, in which case the DHTcan keep track of the latest operation issued to the database.
B. Supporting Other Invariants
Some applications might require that a a counter is involvedin more than one numeric invariant, and also that some invari-ants refer to multiple counters – e.g., we may want to have x ≥ ∧ y ≥ ∧ x + y ≥ K . To address this, the invariant x + y ≥ K can be maintained by a Bounded Counter that represents thevalue of x + y . In this case, when updating the value of x (resp. y ), it is necessary to update both the Bounded Counter for x (resp. y ) and for x + y , with the operation succeedingif both execute with success. For maintaining such invariants,this needs to be done atomically but not in isolation. In otherwords, either both Bounded Counters are updated or none,however, it is safe for an application to observe a state whereonly one of the
Bounded Counters has been updated.Without considering failures, this allows for a simpleimplementation where, if one
Bounded Counter operation fails,the operation in the other
Bounded Counter is compensated[12] by executing the inverse operation. When consideringfailures, it is necessary to include some transactional mech-anism for guaranteeing that either both updates execute ornone – recently, eventually consistent cloud databases startedto support such features [19], [20].A number of other invariants, such as referential integrityand key constraints, can be encoded as numeric invariants,as discussed by Barbará-Milla and Garcia-Molina [5]. Thoseapproaches could be adapted for using
Bounded Counters .VII. R
ELATED WORK
Many cloud databases supporting geo-replication have beendeveloped in recent years. Several of them [11], [19], [20],[2], [15], [7], [26] offer variants of eventual/weak consistencywhere operations return immediately once executed in a singledata center. Such approach is favored for the low latencyit can achieve when it selects a data center close the end-user (among several scattered across the world). Each variantaddresses particular requirements, such as: reading a causallyconsistent view of the database [19], [2]; writing a set ofupdates atomically[20]; or, supporting application-specific ortype-specific reconciliation with no lost updates[11], [19], [27],[26], [7]. Our work focuses on the complementary requirementof having counters that enforce a global numeric invariant.For some applications eventual consistency needs to becomplemented or replaced with strong consistency to ensurecorrectness. Spanner [10] provides strong consistency for thewhole database, at the cost of high coordination overheadfor all updates. Transaction chains [29] is an alternative thatoffers transaction serializability with latency proportional tothe latency to the first replica accessed.Often, only specific operations require strong consistency.Walter [27] and RedBlue consistency in Gemini [17] canmix eventual and strong consistency (snapshot isolation inWalter) to allow eventually consistent operations to be fast.PNUTS [9], DynamoDB [26] and Riak [7] also combine weakconsistency with per-object strong consistency, by relying onconditional writes that fail if concurrent ones existed. Megas-tore [4] offers strong consistency inside a partition and weakconsistency across partitions. In contrast, our work extendseventual consistency with numeric invariants. This allows, forthe specific case of applications that require numeric invariantso be preserved, their correctness to be met while still allowingmost operations to execute in a single replica.Bailis et al. [3] examine which operations in database sys-tems require coordination for meeting invariants. We providea low cost solution for operations that may break numericinvariants, which require coordination under their analysis.This is possible because we secure the necessary rights priorto executing the operations, and this way move coordinationoutside the critical path of operation execution.Escrow transactions [21], initially proposed for increasingconcurrency of transactions in single databases, have alsobeen used for supporting disconnected operation in mobilecomputing environments either relying on centralized [22],[28] or peer-to-peer [25] protocols for escrow distribution. Thedemarcation protocol [5] enforces invariants across multipleobjects, located in different nodes. The underlying protocolsare similar to escrow-based ones, with peer-to-peer interaction.MDCC [14] has recently proposed a variant of this protocolfor enforcing data invariants in quorum systems.Our work combines convergent data-types [24] with ideasfrom these systems to provide a decentralized approach withreplicated data that offers both automatic convergence andinvariant-preservation with no central authority. Additionally,we describe, implement and evaluate how such solution can beintegrated into existing eventually consistent cloud databases.Warranties [18] provide time-limited assertions over thestate of the database and have been used for improving latencyof read operations in cloud databases. The goal of warranties isto support linearizability efficiently, whereas ours is to permitconcurrent updates while enforcing invariants.VIII. C
ONCLUSION
This paper proposed two middleware designs for extend-ing eventually consistent cloud databases with the ability toenforce numeric invariants. Our designs allow most operationsto complete within a single DC by moving the necessarycoordination outside of the critical path of operation execution,thus combining the benefits of eventual consistency – lowlatency, high availability – with those of strong consistency –enforcing global invariants. The evaluation of our prototypesshows that our client-based middleware does not scale whencontention is high, but our server-based middleware, featuringa cache and a write batching mechanism, scales even betterthan the Riak’s native weak consistency mechanism whereinvariants can be compromised.A
CKNOWLEDGMENTS
This research is supported in part by EU FP7 SyncFreeproject (609551), FCT/MCT SFRH/BD/87540/2012, PTDC/EEI-SCR/ 1837/ 2012 and PEst-OE/ EEI/ UI0527/ 2014. Theresearch of Rodrigo Rodrigues is supported by the EuropeanResearch Council under an ERC Starting Grant.R
EFERENCES[1] Cassandra counters. http://wiki.apache.org/cassandra/Counters.[2] A
LMEIDA , S., L
EITÃO , J. A ., AND R ODRIGUES , L. Chainreaction:A causal+ consistent datastore based on chain replication. In
Proc.EuroSys ’13 (2013).[3] B
AILIS , P., F
EKETE , A., F
RANKLIN , M. J., G
HODSI , A., H
ELLER - STEIN , J. M.,
AND S TOICA , I. Coordination-avoiding database systems.
CoRR abs/1402.2237 (2014). [4] B
AKER , J., B
OND , C., C
ORBETT , J. C., F
URMAN , J. J., K
HORLIN , A.,L
ARSON , J., L
EON , J.-M., L I , Y., L LOYD , A.,
AND Y USHPRAKH , V.Megastore: Providing scalable, highly available storage for interactiveservices. In
Proc. CIDR 2011 (2011).[5] B
ARBARÁ -M ILLÁ , D.,
AND G ARCIA -M OLINA , H. The demarcationprotocol: A technique for maintaining constraints in distributed databasesystems.
The VLDB Journal 3 (July 1994).[6] B
ASHO
Using Data Types. http://docs.basho.com/riak/2.0.0/dev/using/data-types/ . Accessed Dec/2014.[7] B
ASHO . Riak. http://basho.com/riak/ . Accessed Dec/2014.[8] B
URCKHARDT , S., G
OTSMAN , A., Y
ANG , H.,
AND Z AWIRSKI , M.Replicated data types: Specification, verification, optimality. In
Proc.POPL ’14 (2014).[9] C
OOPER , B. F., R
AMAKRISHNAN , R., S
RIVASTAVA , U., S
ILBER - STEIN , A., B
OHANNON , P., J
ACOBSEN , H.-A., P UZ , N., W EAVER ,D.,
AND Y ERNENI , R. Pnuts: Yahoo!’s hosted data serving platform.
Proc. VLDB Endow. 1 (Aug. 2008).[10] C
ORBETT , J. C., D
EAN , J., E
PSTEIN , M., F
IKES , A., F
ROST ,C., F
URMAN , J. J., G
HEMAWAT , S., G
UBAREV , A., H
EISER , C.,H
OCHSCHILD , P., H
SIEH , W., K
ANTHAK , S., K
OGAN , E., L I , H.,L LOYD , A., M
ELNIK , S., M
WAURA , D., N
AGLE , D., Q
UINLAN ,S., R AO , R., R OLIG , L., S
AITO , Y., S
ZYMANIAK , M., T
AYLOR ,C., W
ANG , R.,
AND W OODFORD , D. Spanner: Google’s globally-distributed database. In
Proc. OSDI’12 (2012).[11] D E C ANDIA , G., H
ASTORUN , D., J
AMPANI , M., K
AKULAPATI , G.,L
AKSHMAN , A., P
ILCHIN , A., S
IVASUBRAMANIAN , S., V
OSSHALL ,P.,
AND V OGELS , W. Dynamo: Amazon’s highly available key-valuestore. In
Proc. SOSP ’07 (2007).[12] G
ARCIA -M OLINA , H.,
AND S ALEM , K. Sagas. In
Proc. SIGMOD’87 (1987).[13] K
LOPHAUS , R. Riak core: Building distributed applications withoutshared state. In
CUFP ’10 (2010).[14] K
RASKA , T., P
ANG , G., F
RANKLIN , M. J., M
ADDEN , S.,
AND F EKETE , A. Mdcc: Multi-data center consistency. In
Proc. EuroSys’13 (2013).[15] L
AKSHMAN , A.,
AND M ALIK , P. Cassandra: A decentralized structuredstorage system.
SIGOPS Oper. Syst. Rev. 44 , 2 (Apr. 2010), 35–40.[16] L
AMPORT , L. The temporal logic of actions.
ACM Trans. Program.Lang. Syst. 16 , 3 (May 1994), 872–923.[17] L I , C., P ORTO , D., C
LEMENT , A., G
EHRKE , J., P
REGUIÇA , N.,
AND R ODRIGUES , R. Making geo-replicated systems fast as possible,consistent when necessary. In
Proc. OSDI’12 (2012).[18] L IU , J., M AGRINO , T., A
RDEN , O., G
EORGE , M. D.,
AND M YERS ,A. C. Warranties for faster strong consistency. In
Proc. nsdi’14 (2014).[19] L
LOYD , W., F
REEDMAN , M. J., K
AMINSKY , M.,
AND A NDERSEN ,D. G. Don’t settle for eventual: Scalable causal consistency for wide-area storage with cops. In
Proc. SOSP ’11 (2011).[20] L
LOYD , W., F
REEDMAN , M. J., K
AMINSKY , M.,
AND A NDERSEN ,D. G. Stronger semantics for low-latency geo-replicated storage. In
Proc. nsdi’13 (2013).[21] O’N
EIL , P. E. The escrow transactional method.
ACM Trans. DatabaseSyst. 11 , 4 (Dec. 1986), 405–430.[22] P
REGUIÇA , N., M
ARTINS , J. L., C
UNHA , M.,
AND D OMINGOS , H.Reservations for conflict avoidance in a mobile database system. In
Proc. MobiSys ’03 (2003).[23] S
CHURMAN , E.,
AND B RUTLAG , J. Performance related changes andtheir user impact. Presented at velocity web performance and operationsconference, 2009.[24] S
HAPIRO , M., P
REGUIÇA , N., B
AQUERO , C.,
AND Z AWIRSKI , M.Conflict-free replicated data types. In
Proc. SSS ’11 (2011).[25] S
HRIRA , L., T
IAN , H.,
AND T ERRY , D. Exo-leasing: Escrow synchro-nization for mobile clients of commodity storage servers. In
Proc.Middleware ’08 (2008).[26] S
IVASUBRAMANIAN , S. Amazon dynamodb: A seamlessly scalablenon-relational database service. In
Proc. SIGMOD ’12 (2012).[27] S
OVRAN , Y., P
OWER , R., A
GUILERA , M. K.,
AND L I , J. Transactionalstorage for geo-replicated systems. In Proc. SOSP ’11 (2011).28] W
ALBORN , G. D.,
AND C HRYSANTHIS , P. K. Supporting semantics-based transaction processing in mobile database applications. In
Proc.SRDS ’95 (1995).[29] Z
HANG , Y., P
OWER , R., Z
HOU , S., S
OVRAN , Y., A
GUILERA , M. K.,