DKVF: A Framework for Rapid Prototyping and Evaluating Distributed Key-value Stores
DDKVF: A Framework for Rapid Prototyping andEvaluating Distributed Key-value Stores
Mohammad Roohitavaf and Sandeep Kulkarni
Computer Science and Engineering DepartmentMichigan State University, East Lansing, MI, USAEmail: { roohitav, sandeep } @cse.msu.edu Abstract —We present our framework DKVF that enablesone to quickly prototype and evaluate new protocols for key-value stores and compare them with existing protocols based onselected benchmarks. Due to limitations of CAP theorem, newprotocols must be developed that achieve the desired trade-offbetween consistency and availability for the given application athand. Hence, both academic and industrial communities focus ondeveloping new protocols that identify a different (and hopefullybetter in one or more aspect) point on this trade-off curve.While these protocols are often based on a simple intuition,evaluating them to ensure that they indeed provide increasedavailability, consistency, or performance is a tedious task. Ourframework, DKVF, enables one to quickly prototype a newprotocol as well as identify how it performs compared to existingprotocols for pre-specified benchmarks. Our framework relies onYCSB (Yahoo! Cloud Servicing Benchmark) for benchmarking.We demonstrate DKVF by implementing four existing protocols–eventual consistency, COPS, GentleRain and CausalSpartan–with it. We compare the performance of these protocols againstdifferent loading conditions. We find that the performance issimilar to our implementation of these protocols from scratch.And, the comparison of these protocols is consistent with whathas been reported in the literature. Moreover, implementationof these protocols was much more natural as we only needed totranslate the pseudocode into Java (and add the necessary errorhandling). Hence, it was possible to achieve this in just 1-2 daysper protocol. Finally, our framework is extensible. It is possible toreplace individual components in the framework (e.g., the storagecomponent).
Keywords — Distributed Data Stores, Key-value Stores, Frame-work, Prototyping, YCSB, Geo-replication
I. I
NTRODUCTION
Key-values stores, together with other forms of NoSQLstorage systems, have gained popularity in recent years dueto their advantages over relational databases for modern work-loads. The schemaless approach of NoSQL databases has madethem a good choice for today’s web applications with changingrequirements. The need for greater scalability for very largedatasets and very high write throughput has led to the ever-increasing use of NoSQL databases for big data and real-timeweb applications [15].With the huge amount of data and very high query through-put produced by a large number of users across the world,storing data in a single machine does not work for any majorbusiness. Thus, we have to distribute the data across severalmachines. When we distribute our data, an important challengeis the consistency between different copies (i.e., replicas) ofthe data. There is an inherent trade-off between consistency and availability/performance [14]. Different levels of consis-tency come with different levels of availability/performanceoverhead. Even to achieve a certain level of consistency, twodifferent protocols may have different levels of overhead.In general, this suggests that developers need to developnew protocols to improve performance, provide higher levelsof consistency, reduce communication requirements, reducestorage requirements, and so on. When the developers intu-itively identify a new approach to design such a protocol,the natural question that arises is how to evaluate the newprotocol by comparing it with different existing protocols.Generally, the concept of the new protocol is often a simpleinnovation/intuition (e.g., explicitly keep track of dependenciesas in [19], using time as a way to decide when keys should bemade visible as in [13]) but its evaluation is more complicated.Distributed data stores are complex systems which makesan accurate analytical performance evaluation infeasible forthem. A more practical option is experimental performanceevaluation via benchmarking a prototype running the protocol.One way to have a prototype is building it from scratch.This approach has the advantage of maximum flexibility.However, building our prototype from scratch may take along time that can slow down the research or development.Furthermore, if the protocol suffers from some undesirableproperties (e.g., low performance), a substantial amount ofdevelopment time is wasted. Another important obstacle isthat this approach especially makes the comparison to otherprotocols hard. Imagine that we want to compare our newprotocol to several other existing systems, developed by othergroups. Since each of them are implemented in a different codebase, we need to implement other protocols with the same codebase as ours to have a fair comparison which requires moretime.Another approach is to create our prototype by mod-ifying an already existing system. There are many open-source NoSQL databases that can be modified for prototypingpurposes. An important advantage of this approach is thatby building a system on top of a tested system, we canbenefit from all of its good features, and save time. However,modifying an existing system has its own disadvantages. Themost important problem is the lack of flexibility. Although wecan always change the code of an open-source data store, tochange a system correctly, we need to understand a possiblymassive implementation thoroughly, that may take even moretime than creating a prototype from scratch. In addition, bychanging an existing product, we may lose the advantage ofreusing some of its components which was the whole purpose a r X i v : . [ c s . D C ] J a n f using an existing product. For instance, suppose an existingsystem uses a certain replication policy. If the replicationpolicy of our protocol is different, we have to change the wholereplication mechanism of the underlying system.Another problem of changing an existing product is theproblem of being locked by that product. For instance, supposethat we have implemented a prototype to evaluate an algorithmfor causal consistency by forking from a current system likeCassandra [17]. If in future we are interested to see howour algorithm would perform if we used another system,say Voldemort [8], we have no choice but building anothersystem based on Voldemort as well. That would be especiallynecessary if we want to compare our algorithm with anothersystem based on Voldemort.In addition to the implementation of a protocol, runningexperiments is also another burden. Different research groupsmay evaluate their systems in different ways making compar-isons unfair. Yahoo! Cloud Serving Benchmark (YCSB) [11]is a good candidate for a unified way of comparing differentstorage systems. The YCSB drivers required for benchmarkingwith YCSB are already available for many systems. AlthoughYCSB helps us to benchmark our system, writing the driver,running clusters and clients on several machines, obtaining,and aggregating the results is a task that we have to doeverything we want to evaluate a new protocol.In this paper, we introduce Distributed Key-Value Frame-work (DKVF) that allows protocol designers to quickly createprototypes running their protocols to see how they work inpractice. We want to help researchers to only focus on theirhigh-level protocol and let the DKVF do all the lower-leveltasks. For instance, consider the GentleRain protocol proposedin [13]. The server side of this protocol is only 31 lines ofpseudocode provided in Algorithm 2 of [13]. However, to havea prototype running this protocol, we need to write hundreds oflines of code to handle lower-level tasks that are independentof the protocol. Our goal is to provide a framework that helpsresearchers to create their prototypes by writing codes thatare very close to the pseudocodes that they publish in theirresearch papers. We believe this framework together with atoolset that helps us to run experiments can significantly savetime in implementing and benchmarking new protocols. Wehope our framework expedites the research on the field.Followings are the advantages of our framework: • The framework allows us to easily define our proto-col in a high-level abstraction with an event-drivenapproach. Specifically, we can define our protocol asa set of event handlers which is the same way asresearchers typically present their protocols in theirpapers. It makes the code much more clear, andreduces the number of lines of code that protocoldesigners need to write. • The clear separation of concerns that the frameworkprovides expedites debugging the system, and im-proves maintainability of our code. • We can easily compare any two protocols that areimplemented on top of the framework, as both of themare implemented with the same code base. • We provide the implementation of four protocols withthis paper. Adding other protocols to the repositoryis part of the future work. Also, other groups canadd their protocols to the repository making thempublicly available. Having a library of protocol imple-mentations allows researchers to easily compare theirprotocols to previous ones. • We can easily change the storage engine of our key-value store without changing the logic of our higherconsistency protocol. This makes comparison easy, aswe can use the storage engine that another system isbuilt on for comparison purposes. • The framework and its toolset streamline the useof YCSB for benchmarking protocols. It encour-ages researchers to use a standardized framework forbenchmarking instead of performing experiments inindividual ways. • The framework comes with a command line applica-tion called Cluster Manager that lets us convenientlyrun a cluster over the network. Using Cluster Manager,we can easily run a cluster on cloud systems such asAmazon Web Services (AWSs) on Windows or Linuxinstances. It allows us to monitor connections, networklatencies, current load on nodes, and so on. • Cluster Manager also lets us specify a set of ex-periments to benchmark the system. it takes care ofrunning YCSB clients, collecting, and aggregating theresults. • The framework is also accompanied by a graphicaltool called Cluster Designer that lets us easily defineour cluster and experiments. We can visually createa graph of servers and clients, and define differentworkloads to run on the cluster.The rest of the paper is organized as follows: In Section II,we provide a quick background on distributed key-value storesand the problem of consistency. In Section III, we reviewthe overall structure of DKVF and components that protocoldesigners need to write. Next, in Section IV, we explain howto implement a prototype with DKVF in details. In Section V,we focus on using YCSB to benchmark prototypes created byDKVF. We introduce DKVF tools in Section VI. We provideour experimental results and analysis in Section VII. Finally,we provide our conclusion and future work in Section VIII.II. B
ACKGROUND
In this section, we briefly provide an overview of dis-tributed key-value stores. We also very briefly explain con-sistency protocols that we have implemented with DKVF inthis paper.
A. Distributed Key-value Stores
Key-value stores provide a simple abstraction to store andretrieve our data. Each key-value store is a set of (cid:104) key, value (cid:105) pairs. A key-value store has two basic operations:
P U T ( k, v ) and GET ( k ) . P U T ( k, v ) writes a new version with value v for data item with key k , and GET ( k ) reads the value of thedata item with key k . A key-value store can store multipleersions for each key, or store only one version for eachkey. In the single-version type, any time that we update thevalue, we overwrite the previous version, while in the multi-version type we keep a version chain for each key. Key-valuestores are the basis of document-oriented databases. Thesedatabases can handle one-to-many relations more efficientlythan relational databases. Specifically, we can encode all piecesof data related to a key as a JSON (or XML) document, andstore it as the value for that key in our key-value store. Thisapproach provides better locality than multi-table schemas thatare normally used in relational databases [15].The schemaless data model of document-oriented databasesgives us the freedom from strict schema at the write time, andlets the application interpret the structure of the encoded valueat the read time. This also eliminates the need for awkwardobject-relational mapping layers that we usually need forrelational databases. Specifically, we can easily encode/decodefields of an object to/from documents that we store in ourdatabase. Many new storage systems such as MongoDB [4],RethinkDB [7], CouchDB [1], and Espresso [20] supportdocument-oriented data model.Typically, for practical systems, to increase the perfor-mance, we must distribute our key-value store. Two main tech-niques to distribute the data are partitioning and replication.Partitioning (also known as sharding) allows us to store ourdata in more than one machine. Specifically, we can divide thekey space into several parts, and store each part on a differentmachine. Usually, we partition the data in a way that each keyis assigned to exactly one partition. Partitioning increases thescalability of our system, as we do not need to fit the wholekey space on a single machine. It also enables us to scale ourquery throughput by adding more nodes [15].Replication, on the other hand, improves durability, avail-ability, and performance. Specifically, by keeping multiplecopies of data in several replicas, we can increase the durabilityof our data. It also improves the availability of the system, asif a replica fails, other replicas can serve the clients. Moreover,using replication, we can keep data geographically close to theclients, thereby reducing the network latency. Because of thesebenefits, geo-replicated data stores have become an importantbuilding block of today’s Internet services [19]. B. Consistency
When we copy the same data across several replicas, weneed to make sure that all clients see a consistent view of thedata. Different levels of consistency are defined for distributeddata stores. Through the light of the CAP theorem [14], weknow there is an inherent trade-off between availability andconsistency. The trade-off also exists for performance andconsistency [9]. Specifically, the stronger is a consistencymodel, the higher is its performance overhead. Even for acertain consistency model, it matters how we achieve it, i.e.,two different protocols that achieve equal levels of consistencydo not necessarily have the same performance. Thus, anytimethat we come up with a new consistency protocol, we need toevaluate its overhead. The goal of DKVF is to help protocoldesigners to create a prototype key-value store running theirprotocols for evaluation purposes. To show the effectivenessof DKVF, we have implemented four protocols using DKVFthat we briefly review here. The first protocol is the eventual consistency protocol.Eventual consistency requires that two connected replicas mustfinally converge to the same state in the absence of new writes[22]. We can achieve eventual consistency as follows: Eachreplica sends new updates that occur in it to other replicas.The receiving replicas simply apply the new updates as theyreceive them. To converge to the same values, however, allreplicas must follow the same rule in applying the updates.Specifically, they must order the versions of a key in the sameway. Dynamo [12] and Cassandra [17] are two examples ofdata stores that provide eventual consistency.In addition to the eventual consistency, we also considerthree causal consistency protocols. These protocols guaranteethat when a version is visible to a client, all of its causaldependencies [21] are also visible. COPS [19] is one of thefirst protocols for causally consist distributed key-value stores.It guarantees causal consistency by explicitly tracking causaldependencies of a version. Specifically, we keep track ofversions that a client reads. Then, once the client writes avalue for some key, we consider all the versions that the clienthas read as the causal dependencies of the new version beingwritten by the client. Each replica sends any new update doneby itself to other replicas. When a replica receives a replicatemessage, it does not make it visible to its clients until it madesure that all of the dependencies of the version are visible in thereplica. Since, inside each replica we have multiple partitions,we have to send dependency check messages to other partitionsto check the dependency. This explicit dependency trackingcan affect the performance of the system.GentleRain [13] is another protocol that we implement us-ing DKVF. To avoid explicit dependency checking mechanismof COPS, GentleRain uses an implicit dependency trackingvia synchronized physical clocks. It assigns each version atimestamp which is the value of the physical clock at the timeof the write. GentleRain assigns timestamp in such a way thatif version v depends on version v , the timestamp assigned to v is greater than v . To satisfy this requirement, GentleRainmay need to delay some PUT operations, if the physical clocksof the partitions are not perfectly synchronized. Next, eachreplica calculates a Global Stable Time (GST) which is avalue such that any version with timestamp smaller than itis visible inside the replica. Partitions inside a replica need tocommunicate with each other to calculate GST. Now, whena client asks for a key, we give the client the most recentversion of the key that has a timestamp smaller than the GST.This guarantees that versions are visible only after their causaldependencies.The delay in PUT operations that GentleRain requires canaffect the write throughput of the key-value store. This is espe-cially important for modern workloads that require very highwrite throughput [15]. This issue is made worse in presence ofquery amplification where a single end user request translatesto many internal operations. In this situation, any delay in anyof internal queries increases the final response time perceivedby the client, and affects the end user experience [10] [21].CausalSpartan solves this issue by replacing physical clockwith Hybrid Logical Clock (HLC) [16]. HLC as the namesuggests, has a hybrid nature that includes the benefits of bothlogical clock [18] and physical clock. At one hand, it provideslogical clock property (i.e., if f depends on e , then HLCimestamp of f is larger than e ). On the other, unlike logicalclock that has no relation to the physical clock, HLC valuesare very close to the physical clock. Also, like physical clock,HLC advances spontaneously. Using HLCs, we do not needto force any delay for PUT operations, thereby decreasing theresponse time and improving the throughput. CausalSpartanalso improves update visibility latency that allows us to makeremote update visible sooner than GentleRain in case ofhaving a slow replica in the system. To lower update visibilitylatency, however, CausalSpartan increases the size of metadata(proportional to the number of replicas).III. O VERVIEW OF
DKVFDKVF is written in Java. Each key-value store createdbased on DKVF has two sides: 1) a server side, and 2) a clientside. The server side (respectively, client side) extends serverside (respectively, client side) of DKVF by implementing therespective abstract methods and adding new methods requiredfor the protocol at hand.When we create a new protocol, in addition to actual dataconsisting of the key-value pairs, we will likely need to storesome metadata with each record. For example, we may need tostore a timestamp with each version, or we may need to storethe ID of the replica where the version has been written. Eachprotocol requires its own metadata. DKVF relies on GoogleProtocol Buffers [3] (referred to as protobuf from now on) formarshalling/unmarshalling data for storage and transmission.An important advantage of protobuf is its convenience for theprotocol designer to describe the metadata, as the protocoldesigner only needs to write a simple text file, and protobuftakes cares of creating the necessary code. Another importantadvantage of protobuf is its effective way of compressing thedata using bit variant techniques that saves storage space andnetwork bandwidth. The protobuf description together withthe server and client sides of the protocol are componentsthat the protocol designer needs to provide for any key-valuestore based on DKVF. These components are shown by darkrectangles in Figure 1. We will focus on these components inSection IV.Once a key-value store is ready, we can use it for ourapplications. An application can be any program (website,mobile application, etc.) that needs to access storage resourcesthrough the network. We refer to the entity that provides thestorage resources as storage provider . The storage providerruns the server side of the key-value store. To do that, thestorage provides needs to write a configuration file. Thisconfiguration file is an XML file according to
Config.xsd [2], and describes the cluster and server side parameters. Oncethe server side is running, different applications can connect toit through the client side of the key-value and use the storageresources. The application developers also need to write aconfiguration file that specifies servers to connect and clientside parameters. The server side and client side configurationsare shown by white rectangles in Figure 1. These configurationfiles together with three components that protocol designerneeds to write are five components that we need to provideto have a running key-value store based on DKVF.The
Application rectangle in Figure 1 captures anyclient program that uses a key-value store server. While the
Protocol Buffer
DKVF ServerProtocol
ServerStorage
EngineStores/Retrives
MetadataClass
Metadata
Description
DKVF ClientProtocol
ClientImplements Implements
Writes UsesUses
Uses
Uses CallsExtendsExtends Reads
CallsCreates
Client s Cash Main Data
Storage
Driver
CallsStorage
Engine
Stores/Retrives
Calls
Storage
Driver
Calls
Server
ConfigurationWrites
RunsReadsClient
Configuration
Reads
Storage ProviderProtocol DesignerApplication
Calls
End UserCallsImplements
Writes
Application Developer DKVF
Config.xsdAccording to DKVFConfig.xsd
According to
Fig. 1. Typical usage of DKVF exact application is orthogonal to DKVF, DKVF can be usedto provide suitable benchmarks so that application designercan choose the suitable protocol based on these benchmarks.DKVF relies on YCSB [11] for benchmarking. When webenchmark our key-value store using YCSB, the YCSB clientbecomes our application in Figure 1.DKVF can be configured to use any storage engine pro-vided the storage developer implements the necessary drivers.DKVF comes with a driver for Berkeley-BD. We can configurethe default storage to be multi-version or single version. In caseof multi-version, we have to provide a comparator function toorder versions with the same key. DKVF also provides a simpleapproach for the addition of new storage engine. An importantquestion regarding storage is how we want to replicate data.Data replication can be done either by storage engine itself orby the protocol. The default storage delegates data replicationto the protocol. This gives the full control of data replicationto the protocol designer. However, we can configure DKVFfor the case where the storage engine handles data replication.IV. C
REATING A P ROTOTYPE USING
DKVFThe overall usage of DKVF is as shown in Figure 1. Whena designer intends to develop a new protocol, he/she needs tospecify three components (shown by dark rectangles in Figure1). These components are 1) metadata description, 2) the serverside of the protocol, and 3) the client side of the protocol. Weexplain each of these components in this section. . Metadata Description
Describing metadata is done by writing a text file, .proto , that contains a set of message blocks. You canthink of a message as a class or struct in a programminglanguage. Each message has a set of fields. Each field hasa type that is either a primitive type like integer, or anothermessage. Any metadata description written for DKVF mustinclude four messages: 1)
Record , 2)
ClientMessage ,3)
ServerMessage , and 4)
ClientReply . Record de-scribes records that will be stored in the key-value store. Forinstance, if we want to store a timestamp with each record, weneed to add an int64 field to
Record message to store a 64-bit Java long variable with each record.
ClientMessage and
ServerMessage describe client and server messages,respectively.
ClientReply describes a response to a clientmessage.As an illustration, consider the case where we implementGentleRain protocol [13] using DKVF. In GentleRain, eachrecord (i.e., data item) is a tuple as (cid:104) k, v, ut, sr (cid:105) where k is thekey, v is the value, ut is update time (timestamp of the currentversion), and sr is the ID of the replica where the currentversion has been written. Listing 1 shows the correspondingprotobuf description for GentleRain records. Numbers in frontof fields are tag numbers that protobuf uses for optimization.Tag numbers must be unique positive integers, and we shouldassign smaller values to the fields that are used more frequently[3]. Listing 1. Protobuf description for GentleRain record message Record { s t r i n g k = 1; b y t e s v = 2; i n t 6 4 ut = 3; i n t 3 2 s r = 4; } In addition to the metadata for the records, we also useprotobuf to describe messages that servers/clients send in ourprotocol. For instance, consider G ET R EQ message in Gen-tleRain [13]. Each G ET R EQ message has a string to specifythe key that we want to read, and an integer for GST valueused by the protocol to find a consistent version (see SectionII-B). Listing 2 shows the necessary protobuf description forG ET R EQ message. Similarly, we can write description forother messages of the protocol. Listing 2. Protobuf description for GentleRain G ET R EQ message message GetMessage { s t r i n g k = 1; i n t 6 4 g s t = 2; } After writing the metadata description, we need to compileit using protobuf. The protobuf will create a Java class thatcontains all necessary data structures to marshalling/unmar-shalling our data. This class is shown by
Metadata class rectangle in Figure 1.
B. Server Side Implementation
To implement the server side of a protocol, we need towrite a class that extends the abstract class
DKVFServer . The full metadata description needed for GentleRain protocol is providedin the Appendix.
DKVF follows an event-driven approach to define a pro-tocol. Specifically, we can define a protocol as a set ofevent handlers. The two main event handlers that will becalled by the framework are handleServerMessage and handleClientMessage of DKVFServer class. Insidethese two main event handlers, the protocol designer can calldetailed event handlers for different events. A protocol can alsohave other event handlers that do not call by the framework.For instance, GentelRain [13] and CausalSparatan [21] haveevent handlers that are constantly called at a certain rate.Listing 3 shows the overall structure of GentleRainServer classthat implements GentleRain on top of DKVF. The body ofevent handlers is left blank for sake of presentation.
Listing 3. Overall structure of GetnelRainServer
12 public class
Ge ntleR ainS erve r extends
DKVFServer { @Override handleClientMessage ( ClientMessageAgent cma ) { ( cma . getC lientM essag e ( ) . hasGetMessage ( ) ) { handleGetMessage ( cma ) ; } else if ( cma . getCl ientM essag e ( ) . hasPutMessage ( ) ) { handlePutMessage ( cma ) ; } } @Override
13 public void handleServerMessage ( ServerMessage sm ) {
14 if ( sm . hasReplicateMessage ( ) ) { handleReplicateMessage ( sm ) ; } else if ( sm . hasHeartbeatMessage ( ) ) { handleHearbeatMessage ( sm ) ; } else if ( sm . hasVvMessage ( ) ) { handleVvMessage ( sm ) ; } else if ( sm . hasGstMessage ( ) ) { handleGstMessage ( sm ) ; } } handleGetMessage ( ClientMessageAgent cma ) {
26 //TODO Handle GET messages here27 }
28 private void handlePutMessage ( ClientMessageAgent cma ) {
29 //TODO Handle PUT messages here30 }
31 private void handleReplicateMessage ( ServerMessage sm ) {
32 //TODO Handle Replicate messages here33 }
34 void handleHearbeatMessage ( ServerMessage sm ) {
35 //TODO Handle Heartbeat messages here36 }
37 void handleVvMessage ( ServerMessage sm ) {
38 //TODO Handle VV messages here39 }
40 void handleGstMessage ( ServerMessage sm ) {
41 //TODO Handle GST messages here42 } } handleServerMessage receives an object of class ServerMessage which is created by protobuf fromour metadata description explained in Section IV-A. handleClientMessage receives an object from class
ClientMessageAgent that includes an object of class
ClientMessage created by protobuf.While we are processing server or client messages in handleServerMessage and handleClientMessage ,we may need to send messages to other servers, or send clientresponses. To send a message to another server, the frameworkprovides the convenient sendToServer method that receivesthe ID of the destination, and an object of
ServerMessage class. The mapping between server IDs and their actual ad-dresses must be defined in the configuration file. DKVF takes isting 4. DKVF implementation of GentleRain GET request handler handleGetMessage ( ClientMessageAgent cma ) { GetMessage gm = cma . ge tClie ntMes sage ( ) . getGetMessage ( ) ; updateGst (gm . getGst ( ) ) ; //Thread-safely update GST4 L i s t Record r e s u l t = new
ArrayList <> () ; S t o r a g e S t a t u s ss = read (gm . getKey ( ) , ( Record r ) − > { (m == r . g e t S r ( ) | | r . getUt ( ) < = g s t . g e t ( ) ) ; ; } , r e s u l t ) ; Record r e c = r e s u l t . g e t ( 0 ) ; C l i e n t R e p l y cr = C l i e n t R e p l y . newBuilder ( ) . s e t S t a t u s ( true ) . setGetReply ( GetReply . newBuilder ( ) . s e t V a l u e (r e c . getValue ( ) ) . s e t U t ( r e c . getUt ( ) ) . s e t G s t ( g s t . g e t( ) ) ) . b u i l d ( ) ; cma . sendReply ( cr ) ; } care of asynchronous reliable FIFO delivery of the message tothe destination. Specifically, if the receiver cannot receive themessage (e.g., it has crashed, or there is a network partition),DKVF stores the message and will try to send it later. Inthe configuration file we can specify the amount of timeto wait before resending the message. Also, we can set thecapacity of the queue of undelivered messages. If the limit ofwaiting messages reaches, DKVF throws an exception. Calling sendToServer is thread-safe. Thus, protocol designer doesnot need to worry about concurrency or failure issues. Tosend the response to the client, the ClientMessageAgent class provides sendReply method that allows us to send theresponse to the client message.While we are processing client/server messages, wealso need to store or retrieve data from the stor-age engine. The
DKVFServer class provides methodsthat can be used for this purpose. Two main meth-ods are read (String k, Predicate
ClientMessageAgent object. Note that in Listing 4, we ignore exception and errorhandling for sake of presentation. The full DKVF implementation of GentleRain protocol is provided in theAppendix.
Algorithm 1
Pseudocode of the GET request handler ofGentleRain protocol (copied from [13]) Upon receive (cid:104) G ET R EQ k, gst (cid:105) GST mn ← max ( GST mn , gst ) obtain latest version d from version chain of key k s.t. d.sr = m , or d.ut < GST mn send (cid:104) G ET R EPLY d.v, d.ut, GT S mn (cid:105) to client C. Client Side Implementation
To implement the client side of a protocol, we need toextend the client part of the framework. Specifically, we needto write a class that extends class
DKVFClient . When weextend
DKVFClient , we have to implement two abstractmethods put and get that are the basic PUT and GEToperations of a key-value store. These methods are operationsthat the protocol designer needs to provide for the applicationdeveloper. The application developer later can use these meth-ods to use the data store (see Figure 1). The protocol designercan also add more complex operations for its implementation,but these two methods are required for any implementation.To process application requests, the client part needs tosend client messages and receive responses from the servers.Finding the correct node to send the request is the problemof service discovery [15]. DKVF does not force any servicediscovery policy, and lets protocol define it. DKVF, on theother hand, provides convenient ways to send/receive messagesto/from servers via their IDs specified in the client con-figuration file. Specifically, sendToServer(String id,ClientMessage cm) sends a client message to the serverwith ID id , and readFromServer (String id) readsthe response from server with ID id .Now, let us consider client side of PUT operation ofGentleRain. Algorithm 2 shows the PUT operation at clientside in the GentleRain. Listing 5 shows the correspondingDKVF code. To find the correct server to send the PUT request,we call findPartition function. DKVF Utils libraryprovides utilities to distribute the keys according to their hashvalues. The rest of the handler is clear and identical to thepseudocode.
Listing 5. DKVF implementation of GentleRain client-side PUT handler put ( S t r i n g k , byte [ ] v ) { ClientMessage cm = ClientMessage . newBuilder ( ) .setPutMessage ( PutMessage . newBuilder ( ) . s e t D t ( dt ) .setKey ( key ) . s e t V a l u e ( B y t e S t r i n g . copyFrom ( value ) ) ) .b u i l d ( ) ; S t r i n g s e r v e r I d = f i n d P a r t i t i o n ( key ) //finds server ID4 sendToServer ( s e r v e r I d , cm) ; C l i e n t R e p l y cr = readFromServer ( s e r v e r I d ) ; dt = Math . max ( dt , cr . getPutReply ( ) . getUt ( ) ) ; ; } Algorithm 2
Pseudocode of the PUT handler of the client-sideof GentleRain protocol (copied from [13]) PUT (key k , value v ) send (cid:104) P UT R EQ k, v, DT c (cid:105) to server receive (cid:104) P UT R EPLY ut (cid:105) DT c = max ( DT c , ut ) . B ENCHMARKING WITH
YCSBYCSB, originally developed by Yahoo!, is a tool forevaluating the performance of key-value or cloud serving stores[11]. To use YCSB, we need to write a YCSB driver that letsYCSB client class use our key-value store. YCSB has a coreworkload generator. We can specify different parameters forthe core workload generator such as read proportion, insertproportion, value size, number of client threads, number ofoperations, and so on. Once we specified the workload anddriver for YCSB, we can run it to benchmark our system.YCSB gives us different measurements such as throughput andlatencies.DKVF comes with a driver for YCSB. Thus, any key-value store written based on DKVF has its YCSB driver ready.DKVF also includes a workload generator. The DKVF YCSBworkload generator extends YCSB core workload generator byadding new operations such as amplified insert to benchmarkthe system against query amplification (see Section II-B).This feature allows us to evaluate the performance of macrooperations that reveal bottlenecks when a query results inmultiple operations on the key-value store.Figure 2 shows the components involving in benchmarkinga key-value store created by DKVF. The person who wants tobenchmark the system, referred to as benchmark generator,needs to provide two components shown by dark rectangles inFigure 2. The first component is the workload properties. Thebenchmark generator can specify any YCSB core properties forthe workload. For benchmarking query amplification, we canspecify the amplification factor. The benchmark generator alsoneeds to provide a client configuration file that specifies serversto connect, and other client side parameters (see Section III).The workload generator is also extensible. Specifically, ifwe want to benchmark an operation that is not included inDKVF, we need to implement a customized YCSB driver andworkload generator. We refer the reader to [11] for details.
CallsCalls
ServerServerServer Protocol Client Side Client ConfigurationWorkload WritesCallsReadsReads
Performance Results
Benchmark GeneratorDKVF
Dirver
DKVF WorkloadYCSB
Client
Calls
Fig. 2. Using of YCSB for evaluating a prototype created by DKVF
VI. T
OOLS
In this section, we introduce two tools that help protocoldesigners to run and benchmark their distributed key-valuestores created with DKVF. These tools can save us a greatdeal of time and headache in running and benchmarking oursystems.
A. Cluster Manager
Cluster Manager is a command line application to facili-tate managing clusters running key-value stores created withDKVF. It also helps us to run distributed YCSB experiments.Using this tool, we can benchmark our key-value store withoutdirectly setting up YCSB; we only need to define our desiredworkload, and Cluster Manager takes cares of the rest.To run a cluster, we need to write a cluster descrip-tor file. This descriptor file is an XML file according to
ClusterDiscriptor.xsd [2], and specifies various as-pects such as the IP address of the servers, port numbers tolisten for incoming client/server messages, the topology of theservers, and so on. After loading a cluster descriptor file, wecan us Cluster Manager to start all servers. Cluster Manageralso enables us to monitor the servers. For instance, we can seeif servers have properly started and connected to each other,how much are the network latencies, or how many clients areconnected to each server.Cluster Manager also helps us to test and debug our key-value store. Specifically, after running a cluster we can connectto any server in the cluster and run commands on the servers.For instance, suppose we want to test the convergence of ourprotocol. We can connect to a replica, and write a value forsome key. Next, we can connect to another replica to see if ourwrite has been replicated to the second replica properly. Thiskind of debugging is very convenient with Cluster Manager.Cluster Manager uses an instance of the client side of our key-value store to interact with the server. Thus, we need to specifyour client class for Cluster Manager in the cluster descriptorfile.After running a cluster, and testing it with Cluster Manager,we can conduct an experiment to see how well our protocolperforms. We need to write an experiment descriptor file foreach experiment. The experiment descriptor file is an XMLfile according to
ExperimentDescriptor.xsd [2], andspecifies experiment related parameters such as how manyclients we want to run, what are the addresses of the clientmachines, each client is connected to which servers, what arethe workloads, and so on. We can define a set of experimentsfor Cluster Manager in our descriptor file. After loading anexperiment descriptor file, we can use Cluster Manager to runthe experiment. The Cluster Manager conducts the experimentsone by one by running YCSB clients, and gather the resultsfrom clients. To aggregate the results, Cluster Manager pro-vides us with a minimal query language that lets us selectmeasurements we want to aggregate and specify how we wantto aggregate them (e.g., taking the average).
B. Cluster Designer
Although Cluster Manager is a convenient tool that cansignificantly reduce time and headache of debugging andbenchmarking our protocol, writing cluster and experimentdescriptor files can be a tedious and of course error-pronetask for larger clusters. To solve this issue, we provide ClusterDesigner tool. Cluster Designer is a graphical tool that allowsus to define our cluster and experiments visually. The toolprovides an area where we can add servers and clients. Wecan connect servers and clients by lines to specify networkconnections. When we have several components that need to ig. 3. The graphical interface of Cluster DesignerTABLE I. T
HE NUMBER OF LINES OF CODE THAT WE WROTE TOIMPLEMENT DIFFERENT PROTOCOLS WITH
DKVF.
Protocol Server Side Client Side Metadata
Eventual 95 58 32COPS 269 84 45GentleRain 226 61 50CausalSpartan 292 118 53 be all connected to each other, we can use hubs to avoidconnecting them one-by-one. Figure 3 shows the interface ofCluster Designer. In this network, we have 6 servers and 6clients. We will talk about this network in more details in VII.We can define default configurations for servers/clients.We later can tailor default configurations for an individualserver/client. After designing our cluster and experiments, wecan use Cluster Designer to export descriptor files. We canlater use Cluster Manager to run our cluster and experimentsas explained in Section VI-A.VII. E
XPERIMENTAL R ESULTS
In this section, we present some of the results that weobtained from implementing and evaluating three causal con-sistency protocols namely COPS [19] , GentleRain [13],and CausalSpartan [21] using DKVF. We also implementedeventual consistency for comparison. Table I shows the numberof lines of code that we wrote to implement each of theseprotocols. For each protocol, we have reported the number oflines that we wrote for server side, client side, and describingour metadata in .proto file for protobuf. Of course, numbersof lines of code is not an accurate indicator, as different peoplemay write the same program in different ways, but we reportthem here just to give you an estimate of the coding effort thatwe needed to put to implement these protocols using DKVF.You can access our implementations in [2].Without DKVF we needed 843 lines to implementCausalSpartan and 769 lines to implement GentleRain. Inimplementing CausalSpartan and GentleRain without DKVF,we used Netty [5] for network communications. The numberof lines of code that we needed to implement CausalSpartanand GentleRain with DKVF is around 40 percent of what weneeded to implement them without DKVF. Note that reducingthe number of lines of code is not the only goal of DKVF. We have implemented a simplified version of COPS without garbagecollection.
Instead, using DKVF has all the benefits that we mentionedin the introduction.We have not implemented COPS and eventual consistencywithout DKVF. With DKVF, it took only 2 days to imple-ment COPS based on the description of the protocol in [19].Furthermore, all the code developed with DKVF essentiallyrequired us to convert the pseudocode in the respective papersinto Java and add error handling that is generally omitted inthe pseudocode. In this sense, writing the code required withDKVF was straightforward.
A. Experimental Setup
We consider a replicated and partitioned data store shownin Figure 3. The data store consists of two replicas. Eachreplica consists of three partitions. Replica includes partitions , , and . Replica , on the other hand, consists ofpartitions , , and . We assume full replication, i.e.,each replica has a copy of the entire key space. The key spaceinside each replica is partitioned among servers. In Figure 3,we have connected servers inside each replica together with ahub. Partitions are also connected to their peers in the otherreplica. For servers, we use AWS m3.medium instances withthe following specification: 1 vCPUs, 2.5 GHz, Intel Xeon E5-2670v2, 3.75 GiB memory, 1 x 4 (GB) SSD Storage Capacity.Connected to each replica, we have a set of clients. Weallocate three client machines to run clients. We run 30 threadsof YCSB clients on each client machine. All causal consistencyprotocols that we study here assume locality of traffic, i.e.,clients always access one replica. Thus, clients are connectedto only one replica as shown Figure 3. We run clients on c3.large machines with the following specification: 2 vC-PUs, 2.8 GHz, Intel Xeon E5-2680v2, 3.75 GiB memory, 2 x16 (GB) SSD Storage Capacity. We have used more powerfulmachines for clients to better utilize our servers. B. The Effect of Workload on Performance
The workload of different applications has different charac-teristics. Some workloads are write-heavy, others like those indata analytics are read-heavy. In this section, we want to studyhow the characteristics of our workload affect the performanceof different consistency protocols. In all experiment, we set thesize of the values written by clients to 64 bytes.Figure 4 shows how GET:PUT proportion affects thethroughput. As we move from the left side of the plot to itsright side, the workload nature changes from write-heavy toread-heavy. The throughputs of all protocol increase as theproportion of GET operations increases. This results confirmprevious studies [13], [19], and are expected, as GET opera-tions are lighter than PUT. As expected, eventual consistencyhas the highest throughput. COPS, on the other hand, hasthe lowest throughput. This results confirm results publishedin [13], and is due to the overhead of dependency checkmessages that partitions send to each other to make sure causaldependencies of an update in other partitions are visible (seeSection II-B).Figure 5 shows how GET:PUT proportion affects theresponse time of PUT operations. In all protocols, the responsetime of PUT operations decreases as we move to read-heavier .05:0.95 0.25:0.75 0.5:0.5 0.75:0.25 0.95:0.05
GET:PUT proportion T h r oughpu t ( op / s ) Eventual ConsistencyCOPSGentleRainCausalSpartan
Fig. 4. Throughput vs. GET:PUT Proportion
GET:PUT proportion P U T R e s pon s e T i m e ( s ) Eventual ConsistencyCOPSGentleRainCausalSpartan
Fig. 5. Average PUT Response Time vs. GET:PUT Proportion workloads. This is due to the less load on servers for read-heavier workloads. The eventual consistency has the shortestresponse time thanks to its minimal metadata. CausalSpartanhas more metadata than GentleRain resulting in higher PUTresponse time. COPS has the highest response time becauseof its dependency check messages and its explicit dependencytracking approach. Like other protocols, the trend of PUTresponse time for COPS is decreasing as we move towardread-heavier workloads that can be explained by less load onthe machines. However, for 0.05:0.95, the PUT response timeincreases. This increase can be understood by considering thedependency tracking mechanism of COPS. At point 0.05:0.95,clients read many keys before writing a key. That results inlonger dependency lists which make PUT messages heavierto transmit and process. Note that we have implementeda basic version of COPS protocol without client metadatagarbage collection. COPS authors suggest a garbage collectionmechanism to cope with this problem [19].Figure 6 shows how GET:PUT proportion affects theresponse time of GET operations. Like the case of PUT op-erations, the response time of GET operations also decreases,as we move towards read-heavier workloads. It is interestingthat GentleRain and CausalSpartan have a lower response timefor GET operations comparing to the eventual consistencyfor write-heavy workloads. This can be explained by thesynchronization that occurs between threads in GentleRain
GET:PUT proportion G E T r e s pon s e t i m e ( s ) Eventual ConsistencyCOPSGentleRainCausalSpartan
Fig. 6. Average GET Response Time vs. GET:PUT Proportion and CausalSpartan. Specifically, there is a contention betweenthreads while performing PUT operations in GentleRain/-CausalSpartan. This contention occurs for obtaining a lockthat we used to guarantee updates with smaller timestampsare replicated to other nodes before updates with highertimestamps. This increases the PUT response time that resultsin lower overall throughput of GentleRain/CausalSpartan forwrite-heavy workloads. While threads serving PUT operationsare waiting for synchronization, the server can handle GEToperations. On the other hand, in the eventual consistency,there is no competition between PUT operations. Thus, thereare more active threads serving PUT operations leading tohigher competition over CPU that finally results in higher GETresponse time comparing to GentleRain/CausalSpartan. Notethat this happens for write-heavy workloads with low GETproportion. Therefore, the eventual consistency still has thehighest overall throughout in all cases (See Figure 4).
C. The Effect of Query Amplification
In this section, we study the effect of query amplificationon the performance of the system. In this section, we onlyconsider one replica consisting of three partitions. We considera workload that purely consists of amplified insert operations.Each amplified insert consists of several internal PUT oper-ations. The number of internal PUT operations is defined bythe amplification factor.Figure 7 shows the effect of amplification factor on theclient request throughput. Note that this throughput representsthe number of client macro operations (not individual PUToperations) that are served in one second. As the amplificationfactor increases, the throughput of all protocol decreases whichis expected, as requests with higher amplification factor includemore internal operations which mean more job to do for eachrequest. The eventual consistency has the highest throughput.The pure-write workload is an ideal write scenario for COPS,as dependency lists have at most one entry. Thus, the through-put of COPS is the highest after eventual consistency for thisscenario. GentleRain has the lowest throughput. That is due tothe delay that GentleRain imposes on PUT operations in caseof clock skew between servers. Note that we synchronized thephysical clocks of the system with NTP [6], but the effect ofclock skew still shows up in the results. These results confirmprevious results presented in [21]. CausalSpartan has higher
Amplification Factor T h r oughpu t ( R eque s t/ s ) Eventual ConsistencyCOPSGentleRainCausalSpartan
Fig. 7. Throughput vs. Amplification Factor
20 40 60 80 100
Amplification Factor R eque s t R e s pon s e T i m e ( s ) Eventual ConsistencyCOPSGentleRainCausalSpartan
Fig. 8. Request Response Time vs. Amplification Factor throughput than GentleRain, as CausalSpartan eliminates theneed for the delay before PUT operations by utilizing HLCsinstead of physical clocks [21]. Figure 8 shows the requestresponse time for different protocols. Again, because of delaysthat GentelRain forces on PUT operations, request responsetime has the highest value for GentleRain.VIII. C
ONCLUSION AND F UTURE W ORK
In this paper, we introduced DKVF which is a frameworkfor rapid prototyping and benchmarking distributed key-valuestores. It streamlines the evaluation of the performance ofconsistency protocols for distributed key-value store. To showthe effectiveness of our framework, we implemented fourconsistency protocols using DKVF. Thanks to the convenienceof DKVF, we were able to implement each of these protocolsin less than 2 days. We were able to implement CausalSpartanand GentleRain with significantly less effort than our previousimplementations without DKVF. Note that in implementingCausalSpartan and GentleRain without DKVF, we used Netty[5] that helped us for network communications. Althoughframeworks like Netty streamline network programming, toimplement a distributed key-value store still we have to writecode for many parts that are independent of the logic of theprotocol. DKVF and its toolset provide a more straightforwardframework that is specialized to develop distributed key-value stores. We believe other groups can also benefit from DKVFby reducing the necessary implementation efforts.DKVF relies on YCSB for benchmarking. The toolsetthat comes with the framework helps protocol designers toeasily evaluate their prototype. We evaluated the prototypesthat we developed by DKVF using the tools provided by theframework. Our results are consistent with what has beenpreviously reported in the literature, and also with our previousresults from prototypes that we developed without DKVF.We can use any storage systems as the storage engine forthe key-value stores that we develop with DKVF. This enablesprotocol designers to flexibility change their storage enginewithout touching the implementation of their consistency pro-tocol. To use a given storage system with DKVF, we needto write a driver for it that enables DKVF to interact with it.DKVF comes with a driver for Berkeley-DB. Writing driversfor other storage systems is part of the future work. Also,DKVF is designed to be extensible so that these drivers canbe easily added by others.R
Computer , 45(2):37–42,2012.[10] Phillipe Ajoux, Nathan Bronson, Sanjeev Kumar, Wyatt Lloyd, andKaushik Veeraraghavan. Challenges to adopting stronger consistencyat scale. In
HotOS , 2015.[11] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,and Russell Sears. Benchmarking cloud serving systems with ycsb. In
Proceedings of the 1st ACM symposium on Cloud computing , pages143–154. ACM, 2010.[12] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, GunavardhanKakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubra-manian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highlyavailable key-value store.
SIGOPS Oper. Syst. Rev. , 41(6):205–220,October 2007.[13] Jiaqing Du, C˘alin Iorgulescu, Amitabha Roy, and Willy Zwaenepoel.Gentlerain: Cheap and scalable causal consistency with physical clocks.In
Proceedings of the ACM Symposium on Cloud Computing , SOCC’14, pages 4:1–4:13, New York, NY, USA, 2014.[14] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibilityof consistent, available, partition-tolerant web services.
SIGACT News ,33(2):51–59, June 2002.[15] Martin Kleppmann.
Designing Data-Intensive Applications: The BigIdeas Behind Reliable, Scalable, and Maintainable Systems . ” O’ReillyMedia, Inc.”, 2017.[16] Sandeep S Kulkarni, Murat Demirbas, Deepak Madappa, BharadwajAvva, and Marcelo Leone. Logical physical clocks. In
InternationalConference on Principles of Distributed Systems , pages 17–32. Springer,2014.[17] Avinash Lakshman and Prashant Malik. Cassandra: a decentralizedstructured storage system.
ACM SIGOPS Operating Systems Review ,44(2):35–40, 2010.[18] Leslie Lamport. Time, clocks, and the ordering of events in a distributedsystem.
Commun. ACM , 21(7):558–565, July 1978.19] Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G.Andersen. Don’t settle for eventual: Scalable causal consistency forwide-area storage with cops. In
Proceedings of the Twenty-Third ACMSymposium on Operating Systems Principles , SOSP ’11, pages 401–416, New York, NY, USA, 2011.[20] Lin Qiao, Kapil Surlaker, Shirshanka Das, Tom Quiggle, Bob Schulman,Bhaskar Ghosh, Antony Curtis, Oliver Seeliger, Zhen Zhang, AdityaAuradar, et al. On brewing fresh espresso: Linkedin’s distributeddata serving platform. In
Proceedings of the 2013 ACM SIGMODInternational Conference on Management of Data , pages 1135–1146.ACM, 2013.[21] Mohammad Roohitavaf, Murat Demirbas, and Sandeep Kulkarni.Causalspartan: Causal consistency for distributed data stores usinghybrid logical clocks. In
Reliable Distributed Systems (SRDS), 2017IEEE 36th Symposium on , pages 184–193. IEEE, 2017.[22] Werner Vogels. Eventually consistent.
Commun. ACM , 52(1):40–44,January 2009. A PPENDIXListing 6. DKVF protobuf description for GentleRain protocol isting 7. Server side for GentleRain protocolisting 7. Server side for GentleRain protocol