Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation
CCornus: One-Phase Commit for Cloud Databases withStorage Disaggregation
Zhihan Guo * University of Wisconsin-MadisonMadison, [email protected]
Xinyu Zeng * University of Wisconsin-MadisonMadison, [email protected]
Ziwei Ren
University of Wisconsin-MadisonMadison, [email protected]
Xiangyao Yu
University of Wisconsin-MadisonMadison, [email protected]
ABSTRACT
Two-phase commit (2PC) has been widely used in distributed databasesto ensure atomicity for distributed transactions. However, 2PC suf-fers from two limitations. First, 2PC incurs long latency as it requirestwo logging operations on the critical path. Second, when a coordina-tor fails, a participant may be blocked waiting for the coordinator’sdecision, leading to indefinitely long latency and low throughput.We make a key observation that modern cloud databases featurea storage disaggregation architecture, which allows a transaction’sfinal decision to not rely on the central coordinator. We proposeCornus, a one-phase commit (1PC) protocol specifically designedfor this architecture. Cornus can solve the two problems mentionedabove by leveraging the fact that all compute nodes are able toaccess and modify the log data on any storage node. We presentCornus in detail, formally prove its correctness, develop certainoptimization techniques, and evaluate against 2PC on YCSB andTPC-C workloads. The results show that Cornus can achieve 1.5 × speedup in latency. Modern database management systems (DBMS) are increasinglydistributed due to the growing data volume and the diverse demandsof modern Internet services. To ensure the atomicity of distributedtransactions, an atomic commitment protocol (ACP) is required fortransactions that access data across distributed machines.
Two-phasecommit (2PC) is so far the most widely used ACP.Albeit widely implemented, 2PC has two major problems thatlimit its performance.
First , 2PC requires two round-trip networkmessages and the associated logging operations. Previous works [8,9, 16, 20, 23, 24, 33] have demonstrated that 2PC can be attributedto the majority of a transaction’s execution time due to the incurrednetwork messages and disk logging, which directly affects the queryresponse time that a user experiences.
Second , 2PC has a well-known blocking problem [11, 12, 25]. If a coordinator crashes beforenotifying participants of the decision, the participants may not knowthe decision and will be blocked until the coordinator recovers.Meanwhile, uncertain transactions cannot release their locks, whichblocks other transactions from making forward progress.The two problems above have inspired two separate lines ofresearch seeking solutions. To mitigate the long latency problem, * Both authors contributed equally previous works have proposed one-phase commit (1PC) protocols [8,13, 22, 26] removing one phase from the commit procedure. Existing1PC protocols, however, make extra assumptions to 2PC [8]. Mostof these assumptions are impractical in a production environment,which stymies the wide adoption of existing 1PC protocols. To solvethe blocking problem, previous works have proposed three-phasecommit (3PC) protocols [25] such that an uncertain transaction canlearn the decision even if the coordinator crashes. However, as thename suggests, these protocols must pay the overhead of one extraphase to the commit procedure, further exacerbating the latencyproblem. No protocol can achieve nonblocking without introducingextra assumptions or communications according to the fundamentalnonblocking theorem [25].In this project, we made a key insight that the architectural par-adigm shift happening in cloud databases (i.e., storage disaggrega-tion [2, 3, 5, 10, 15, 30, 32]) is fundamentally changing the designspace of atomic commitment protocols. Specifically, disaggregatingthe storage from computation allows a database server to directlyaccess all the storage nodes rather than its own storage as in aconventional shared-nothing architecture. This insight allows us todesign a new 1PC protocol that can achieve low latency and non-blocking altogether, without making further assumptions besides thedisaggregation of storageTo this end, we propose Cornus, a new non-blocking 1PC protocoldesigned for the storage-disaggregation architecture. Cornus solvesthe latency problem by eliminating the decision logging at eachcoordinator. Instead, a transaction relies on the collective logs of allthe participating nodes for its final decision. This change is madepossible because all the logs are accessible to all transactions ina disaggregation architecture. If any failure occurs, an uncertaintransaction can rebuild the decision by accessing all the logs. Cornusis also non-blocking — If any participant fails to flush to its log,other uncertain transactions can insert an abort record on behalf ofthe non-responding node. We introduce the
LogIfNotExist() functionto avoid race conditions in corner cases. In summary, this papermakes the following key contributions: • We develop Cornus, a one-phase commit (1PC) protocol de-signed for a storage-disaggregation architecture to reduce thelatency overhead in 2PC and alleviate the blocking problem atthe same time. a r X i v : . [ c s . D B ] F e b uo, Zeng, Ren and Yu, et al. Coordinator p r e p a r e r e qu e s t v o t e y e s [log] COMMIT c o mm i t ackback to user [log] START-2PC [log] VOTE-YES[log] COMMIT Prepare Phase BeginCommit Phase Begin [log] COMMIT[log] VOTE-YES
Participant 1 Participant 2 compute nodestorage node (a) 2PC with no failure.
Coordinator p r e p a r e r e qu e s t v o t e y e s [log] START-2PC [log] VOTE-YES Prepare Phase BeginCommit Phase Begin
Block until Coordinator recovers![log] VOTE-YES
Participant 1 Participant 2 compute nodestorage node timeout c o n t a c t c oo r d i n a t o r timeouttimeout c o n t a c t c oo r d i n a t o r timeoutfail (b) 2PC with coordinator failure. Figure 1: Illustration of Two-Phase Commit (2PC) —
The lifecycle of a committing transaction (a) and a scenario of coordinator failure (b).The compute node and the corresponding storage node are drawn close to each other. • We prove the correctness of Cornus by showing that it can satisfyall the five properties of an atomic commitment protocol that2PC can satisfy [11, 12]. • We evaluate Cornus in a great variety of settings on both YCSBand TPC-C workloads. Cornus shows an improvement in latencyof 50% compared to 2PC.
Section 2.1 describes two-phase commit (2PC). We discuss twomajor concerns of the protocol, long latency delay and blocking , andbriefly describe existing works addressing each problem. Section 2.2discusses storage disaggregation as a trending architecture in clouddatabases and how it motivates the design of Cornus.
In a distributed database management system (DDBMS), dataare partitioned across multiple sites which can be accessed by adistributed transaction. After the execution, all the sites involvedmust reach a consensus on committing or aborting the transaction toensure the transaction’s atomicity. An atomic commitment protocol (ACP) is required to achieve this goal.
Two-phase commit (2PC) [23] is a widely used ACP in currentDDBMSs. It contains a prepare phase and a commit phase . A demon-stration of the protocol is shown in Figure 1. When no failure hap-pens and all participating nodes agree to commit the transaction,the protocol behaves as Figure 1a. For each transaction, one node ispre-designated to be the coordinator and the other nodes involved inthe transaction become participants.During the prepare phase, the coordinator of the transaction startswith sending prepare requests (also called vote requests) to all theparticipants and in parallel writing a
Start-2PC log record to itsown log file. Upon receiving the prepare request, each participantlogs a
VOTE-YES record (assuming a committing transaction) tothe corresponding log file and then responds to the coordinator.After receiving all the votes, the coordinator enters the commitphase by logging the final decision (i.e., commit/abort) and notifiesthe caller. The coordinator then forwards the decision to each partic-ipant, which logs the decision accordingly and releases all the locksheld by the concurrency control protocol.If the coordinator fails before sending the decision to participants,as shown in Figure 1b, participants may not know the decision of thetransaction. Upon a timeout, a participant will initiate a termination protocol to contact other nodes to learn the decision; it will repeatthis process until at least one node replies with the decision. Incase the coordinator takes a long time to recover, the nodes underuncertainty cannot learn the decision, and the associated transactionswill block . In 2PC, the coordinator’s decision log record serves asthe ground truth of the commit/abort decision — the final outcomeof the transaction relies on the success of logging this record.
Limitation 1: The Latency of Two Phases
In the standard 2PC protocol, the transaction caller experiencesan average latency of one network round-trip and two logging opera-tions , as shown in Figure 1a. Such a delay directly affects the queryresponse time that an end-user will experience.Previous works have proposed various one-phase commit (1PC)protocols to reduce this latency. Some works combine the votingphase with the execution of the transaction [7, 9, 22, 26, 27] toreduce one phase, yet they make assumptions that are too strongto be practical [8]. These protocols assume that serialization andconsistency are ensured before an acknowledgment of each oper-ation is sent from the participant to the coordinator, and no abortdue to consistency or serialization is allowed after the successfulexecution of all operations. Thus, they do not support common con-currency protocols such as Optimistic Concurrency Control [21] andTimestamp Ordering [12] according to Abdallah et al. [8]. More-over, most protocols either pay extra overhead such as blocking I/Oduring execution [26] or violating site autonomy [9, 22, 26, 27] —a property that allows each node to manage its data and recoveryindependently [8].
Limitation 2: The Blocking Problem
In 2PC, a participant learns the decision of a transaction eitherdirectly from the coordinator or indirectly from other participants. Inan unfortunate corner case shown in Figure 1b where the coordinatorfails before sending any notifications, no participant can make orlearn the decision since it is unclear whether the decision has beenmade. Meanwhile, the participants must hold the locks on tuplesuntil the coordinator is recovered. It is well known as the blockingproblem which causes certain data to be inaccessible because someother node not holding the data is down, limiting the performanceand data availability of 2PC. Existing works [11, 19, 25] resolve theproblem of blocking by introducing extra inter-node communicationand imposing more assumptions on the failure model. ornus: One-Phase Commit for Cloud Databases with Storage Disaggregation
For instance, the three-phase commit (3PC) protocol eliminatesblocking by introducing an extra prepared to commit phase [25]. Thecoordinator will not send out a commit message until all participantshave acknowledged that they are prepared to commit. Althoughthe approach can eliminate blocking, it introduces another networkround-trip in each transaction and exacerbates the long latency in2PC. Moreover, 3PC assumes a synchronous system where the net-work delays among nodes are bounded. In practical systems withunbounded delay, 3PC cannot guarantee atomicity [18].
Modern cloud-native databases are shifting to a storage disag-gregation architecture where the storage and computation are sep-arately managed as different layers of services (Figure 2b). Thisbrings significant benefits including lower cost, simpler fault tol-erance, and higher hardware utilization, compared to the commonshared-nothing architecture (Figure 2a). A number of cloud-nativedatabases are adopting such an architecture, with examples includingAurora [32], Redshift Spectrum [3], Athena [2], Presto [5], Hive [30],SparkSQL [10], and Snowflake [15]. A storage disaggregation archi-tecture has the following two key properties:
First , all the data in the storage layer are accessible to all thecomputation nodes . In terms of logging, this means a computationnode can access not only its own log file but the log files corre-sponding to other computation nodes as well. This contrasts to a shared-nothing (Figure 2a) architecture, where a computation nodecan only directly access its own storage — accessing remote stor-age requires sending an explicit request to a remote computationnode. Moreover, modern disaggregated storage services like AmazonS3 [4] are typically highly available and fault-tolerant. Second , the storage layer can perform certain computation tasks .The storage layer is typically implemented as a cluster of serversthat can perform general-purpose computation. Although the storageservers are not as powerful as computation servers and cannot com-municate with each other arbitrarily, such computation capabilitycan substantially improve the performance of a database, for bothOLTP [32] and OLAP [34] workloads. This contrasts to a conven-tional shared disk architecture, where the disks are passive devicesand perform no computation.Through scrutinizing the 2PC protocol, we learned that it is de-signed with a basic assumption of a shared-nothing architecture;namely, a computation node can learn the content of a remote logonly through explicitly contacting the remote node. If the remotenode is down, the log on its local storage is no longer accessible.With the disaggregation architecture, however, a computationnode can learn the content of a remote log by directly accessingthe log itself. Furthermore, it may even directly manipulate the logcontent if necessary. With this subtle difference, we can optimize2PC to solve both limitations discussed in Section 2.1 — First, wecan eliminate one phase from the protocol to reduce latency. Second,we are able to solve the blocking problem. The details of our solutionwill be discussed in the following section. Network (a) Shared-nothing
Network (b) Storage-disaggregation
Figure 2: Shared-Nothing vs. Storage-Disaggregation Architec-tures.
We first present the highlevel ideas of Cornus in Section 3.1. Wethen describe the APIs of the protocol in detail in Sections 3.2 and3.3. Section 3.4 describes how the protocol handles failures andrecovery. Section 3.5 proves the correctness of Cornus. Section 3.6discusses some optimization techniques in Cornus.
Coordinator p r e p a r e r e qu e s t v o t e y e s [log] COMMITback to user [log] VOTE-YES [log] VOTE-YES Prepare Phase BeginCommit Phase Begin [log] VOTE-YES
Participant 1 Participant 2 compute nodestorage node [log] COMMIT c o mm i t ack [log] COMMIT Figure 3: Illustration of Cornus—
The lifecycle of a committingtransaction.
This section describes the high-level intuition behind Cornus. Inparticular, we explain why the two properties of a disaggregation ar-chitecture (explained in Section 2.2) can reduce the protocol latencyand eliminate blocking at the same time.
Latency reduction . Storage disaggregation enables a design of 1PCto reduce the latency. In the conventional 2PC protocol, the groundtruth of a transaction’s outcome (i.e., commit or abort) is the co-ordinator’s decision log; in Cornus, the ground truth is instead thecollective votes in all participants’ logs . For example, a transac-tion reaches the commit decision once each participant’s local logcontains
VOTE-YES . Such a decision cannot rollback once reached.If a particular node is uncertain about the outcome (e.g., the nodetimes out while waiting for the coordinator’s decision), the node candirectly check all participating nodes’ logs to learn the final decisionand thus does not have to rely on the coordinator’s decision log. Thismeans the coordinator’s decision logging no longer has to be onthe critical path of the protocol. In other words, the coordinator canrespond to the caller of the transaction immediately after receivingvotes from participants without logging first. Given that a loggingoperation is quite expensive in a highly available distributed DBMS,this can substantially reduce a transaction’s latency. Figure 3 showsthe procedure of a committing transaction using Cornus which saves uo, Zeng, Ren and Yu, et al.
Algorithm 1: Cornus API on the Storage Nodes — Theimplementation of
Log() and
LogIfNotExist() functions oneach storage node. Function
Storage::Log(txn, content) append content to the local log Function
Storage::LogIfNotExists(txn, content) if content == VOTE-YES then return ABORT if an
ABORT record exists for txn in the log otherwise log and return VOTE-YES else return VOTE-YES or COMMIT if such a record exists fortxn in the log otherwise log ABORT if not exists and return
ABORT the latency of one logging operation. This figure can be comparedwith Figure 1a to see the difference between 2PC and Cornus.Note that the optimization cannot be applied to 2PC in a conven-tional shared-nothing architecture, because a node cannot directlyaccess the log of another node if the remote node has failed. Non-blocking . Cornus addresses the blocking problem in 2PC with-out introducing significant complexity. When a participant expe-riences a timeout while waiting for the coordinator’s decision, itexecutes the termination protocol . In 2PC, the participants contactthe coordinator for the decision and must block if the coordinatorcannot be reached. (e.g., the coordinator has failed).In Cornus, in contrast, the termination protocol means to checkall the votes of other participants to learn the final decision . Thisis doable in a disaggregation architecture because all the logs canbe accessed from any computation node, and that the storage layeritself is highly available.In case that a particular vote is missing in the storage (e.g., thecorresponding participant failed before logging), the current noderunning the termination protocol will write an
ABORT into the log ofthe failed node. To guarantee atomicity, we implement the
LogIfNo-tExist() function in the storage layer, which guarantees that onlyone of the two votes (i.e.
𝑉𝑂𝑇 𝐸 − 𝑌 𝐸𝑆 or 𝐴𝐵𝑂𝑅𝑇 ) can exist for anytransaction on any node. The protocol guarantees that neither thecoordinator nor the participant will block due to unknown decisions.Note that Cornus does not introduce extra assumptions like pre-vious 1PC protocols mentioned in Section 2.1 besides the storagedisaggregation architecture, making it applicable to more generalsettings. In the next two sections, we describe the APIs of Cornuson the storage and compute nodes respectively.
We first describe our RPC notation followed by the functions thatare supported on the storage nodes in Cornus.
Remote Procedure Calls (RPC)
In this paper, we model the communication across nodes throughRPCs. An RPC can be either synchronous or asynchronous . A syn-chronous call means the program logic will block until the RPC returns; an asynchronous call means the program can continue exe-cuting until it is made to wait explicitly for the response.We represent RPC using the following notation: RPC 𝑛 sync/async :: FuncName() , where the subscript can be sync or async for synchronous andasynchronous RPCs, respectively. The superscript 𝑛 denotes thedestination node of the RPC. Finally, FuncName() is the functionthat will be called through this RPC on the remote node; the functioncan take arbitrary arguments if needed.
Log(txn, type)
The
Log(txn, type) function simply appends a log record of acertain type to the end of transaction txn ’s log. It is the log functionthat is used in conventional 2PC protocols.
LogIfNotExist(txn, type)
In Cornus, we introduce a new log function
LogIfNotExist(txn,type) to guarantee that different nodes do not write log records thatconflict. Cornus uses both types of log functions.
LogIfNotExist() is called only when a node logs
VOTE-YES (lines 2 & 17 of Algo-rithm 2) or when a node logs
ABORT on behalf of a remote nodewhen calling the termination protocol (line 30 of Algorithm 2).Algorithm 1 shows the pseudocode for the
LogIfNotExist() func-tion. When a storage node receives an RPC call on this function, itfirst checks if a conflicting decision has already been logged for thetransaction. If so, the transaction’s most recent status in the log (i.e.
ABORT , COMMIT , or
VOTE-YES ) is returned. Otherwise, it willappend the requested log and return the content.Specifically, when a compute node tries to log
VOTE-YES on itsown storage node (line 4), it checks whether other nodes have alreadylogged
ABORT on behalf of it. If so, the function returns
ABORT (line 6); otherwise, the function logs and returns
VOTE-YES (line7). Note that the checking and appending to the log must be doneatomically, such that a conflicting log record cannot be appendedafter the check is performed.The
LogIfNotExist() is also called to log
ABORT on behalf ofanother compute node during the termination protocol. In this case,if a
VOTE-YES or COMMIT log record already exists,
ABORT willnot be logged and the existing log record is returned (line 11);otherwise the
ABORT decision is logged and returned (line 12). Laterin Section 3.6, we will describe more details on how to implementthe
LogIfNotExist() function effectively on storage nodes.
This section explains in detail how Cornus works on the computenodes. The pseudocode is shown in Algorithm 2. We highlight thekey changes in Cornus in contrast to standard 2PC with a gray back-ground color. In the following, we will go through the pseudocodefor the coordinator’s procedure, the participant’s procedure, and thetermination protocol.
Coordinator::Start1PC(txn)
After a transaction txn finishes the execution phase, it starts theatomic committment protocol by calling
Start1PC(txn) at the coordi-nator. The coordinator logs
VOTE-YES asynchronously (line 2) andsends out vote requests along with a list of all nodes involved in thetransaction to all participants simultaneously (lines 3–4). There aretwo major differences compared to a standard 2PC protocol: First, ornus: One-Phase Commit for Cloud Databases with Storage Disaggregation the coordinator logs through
LogIfNotExist() instead of append-only logging. Second, the coordinator logs
VOTE-YES instead of
START-2PC — this second change is because all nodes can recoverindependently without the involvement of the coordinator, as we willexplain in Section 3.4.Then the coordinator waits for responses from all the participants(line 5). If an
ABORT is received, the transaction reaches an abortdecision (line 6); if all responses are received and none of them is an
ABORT (i.e., all responses are
VOTE-YES ), the transaction reachesa commit decision (line 7); if there is a timeout, the terminationprotocol is executed to finalize a decision (line 8). Note that the lastcondition is different from 2PC, which will unilaterally abort thetransaction without running the termination protocol.Once the decision is reached, it can be replied to the transactioncaller immediately before the decision is logged durably (line 9). Itis a key difference between Cornus and 2PC; the latter would replyto the caller only after the decision log is flushed. This optimizationreduces the caller-observed latency by one logging time.Finally, the coordinator asynchronously writes the decision to itslocal storage node (line 10) and also broadcasts the decision to allthe participants (lines 11–12).
Participant::Start1PC(txn)
The logic executed by a participant is very similar between Cor-nus and 2PC. A participant waits for a
VOTE-REQ message fromthe coordinator (line 14). If a timeout occurs, the participant canunilaterally abort the transaction (line 15), which can involve rollingback the database states, releasing locks, and logging
ABORT .Once
VOTE-REQ is received, the participant votes
VOTE-YES or VOTE-NO based on its local states of the transaction. For a
VOTE-NO , an
ABORT record is asynchronously written to the stor-age and replied back to the coordinator (lines 26–27). This log canalso be asynchronous following the presumed abort optimization inconventional 2PC [23].For a
VOTE-YES , the record is logged to the corresponding stor-age node through
LogIfNotExist() (line 17). There are two possibleoutcomes. If the function returns
ABORT , this means another nodehas already aborted the transaction on behalf of the current nodethrough the termination protocol. In this case, the current node abortsthe transaction and returns
ABORT to the coordinator (lines 18–19).Otherwise the storage node returns
VOTE-YES , in which case theparticipant also returns
VOTE-YES back to the coordinator (lines21) and starts to wait for the decision message from the coordina-tor (line 22). Upon a timeout, the termination protocol is executed(lines 22–23). Upon receiving the decision, it is logged to the storagenode (line 24); here we mark the log as asynchronous because othertransactional logic like releasing locks does not need to wait for thisdecision log to complete.
TerminationProtocol(txn)
In both 2PC and Cornus, the termination protocol is executedwhen a compute node has a timeout while waiting for a message andthe node cannot unilaterally abort the transaction. In 2PC, the noderunning the termination protocol will contact all the other nodes forthe outcome of the transaction. If any node returns the final outcome,the uncertainty is resolved. However, in certain corner cases, noactive node has the outcome — for example, the coordinator hasfailed right before sending out the final outcome to any participant.
Algorithm 2: API of Compute Nodes in Cornus — As-sumign a committing transaction. Differences between 1PCand 2PC are highlighted in gray. Function
Coordinator::Start1PC(txn) RPC local SNasync ::LogIfNotExist(
VOTE-YES ) for p in txn.participants do send VOTE-REQ to 𝑝 asynchronously wait for all responses from participants and storage node on receiving ABORT decision ← ABORT on receiving all responses decision ← COMMIT on timeout decision ← TerminationProtocol(txn) reply decision to the txn caller RPC local SNasync ::Log(decision) for p in txn.participants do send decision to p asynchronously Function
Participant::Start1PC(txn) wait for VOTE-REQ from coordinator on timeout RPC local SNsync ::Log(
ABORT ) return if participant votes yes for txn then resp ← RPC local SNsync ::LogIfNotExist(
VOTE-YES ) if resp is ABORT then
ABORT for it reply ABORT to coordinator else reply VOTE-YES to coordinator wait for decision from coordinator on timeout decision ← TerminationProtocol(txn) RPC local SNasync ::Log(decision) else RPC local SNsync ::Log(
ABORT ) reply ABORT to coordinator Function
TerminationProtocol(txn) for every node p participating txn other than self do RPC p.SNasync ::LogIfNotExist(
ABORT ) wait for responses on receiving ABORT decision ← ABORT on receiving COMMIT decision ← COMMIT on receiving all responses decision ← COMMIT on timeout retry from the beginning return decision In this case, the transaction can neither commit nor abort, and must block with all the locks held, until the failed node has been recovered.Cornus avoids the problem described above. Specifically, the noderunning the termination protocol would contact all the participating storage nodes rather than peer compute nodes, by trying to log an
ABORT record to each storage node through
LogIfNotExist() (lines29–30). If the remote storage node has already received a decisionlog record (i.e.,
COMMIT or ABORT ) for this transaction, such adecision will be returned and followed by the current node (line32–33). If the remote storage node has not received any log recordyet, the
ABORT record will be logged and returned (line 32). Thelast case is that a
VOTE-YES record is logged at the remote node uo, Zeng, Ren and Yu, et al. fail
Coordinator p r e p a r e r e qu e s t v o t e y e s [log] VOTE-YES [log] VOTE-YES Prepare Phase BeginCommit Phase Begin [log] VOTE-YES
Participant 1 Participant 2 compute nodestorage node v o t e y e s vote yestimeouttimeout [logIfNotExists] ABORT [log] COMMIT [log] COMMIT (a) Coordinator fails before sending decision Coordinator p r e p a r e r e qu e s t v o t e y e s [log] VOTE-YES Prepare Phase BeginCommit Phase Begin [log] VOTE-YES
Participant 1 Participant 2 compute nodestorage node abortvote yestimeout [logIfNotExists] ABORT [log] ABORT[log] ABORTfail[log] ABORTback to user a b o r t ack (b) Participant fails before logging vote Figure 4: Cornus under Failures —
The behavior of Cornus under two failures scenarios.and is returned; if the current node receives such responses from allthe remote storage nodes, the transaction will also reach a commitdecision (line 34). Finally, if the current node experience a timeoutagain, it will retry the termination protocol (line 35).Note that as long as the storage nodes are accessible, the protocolabove is non-blocking. A compute node can always reach a decisionwith a small number of messages. The only case that Cornus willrun into blocking (line 35) is when the storage service cannot bereached. However, as discussed in Section 2.2, we assume the caseis rare with a highly available storage service maintained on its own.
This section discusses the behavior of Cornus when failures occur.For simplicity, we discuss cases where only a single node fails at atime. Specifically, Table 1 and Table 2 discuss the system behaviorwhen the coordinator and a participant fails, respectively. The tablesalso describe the behavior of the failed node after it is recovered.
Coordinator Failure
Table 1 lists the system behaviors if the coordinator fails at differ-ent point of the Cornus protocol.
Case 1:
The coordinator fails before the protocol starts. In thiscase, a participant will experience a timeout waiting for
VOTE-REQ (line 15 in Algorithm 2). Therefore, all the participants will uni-laterally abort the transaction locally. After the coordinator is laterrecovered, it can run the termination protocol to learn this outcome.
Case 2:
The coordinator fails after sending some but not allvote requests. For participants that did not receive the request, thebehavior is the same as Case 1, namely, it will unilaterally abort thetransaction. For participants received the request, they will log thevotes to the storage nodes, send responses back to the coordinator,and experience a timeout while waiting for the final decision becausethe coordinator has failed (line 23 in Algorithm 2). Then they willrun the termination protocol. The protocol checks the votes of theparticipants that have unilaterally aborted the transaction. It eitherlearns the abort decision, or appends an
ABORT to their logs, therebyaborting the transaction. Once the coordinator is recovered from thefailure, it learns the outcome through the termination protocol.
Case 3:
The coordinator fails after sending all the vote requestsbut before sending out any decision. In this case, all participantshave logged their votes to the storage nodes. They will all timeoutwhile waiting for the decision from the coordinator. They will all runthe termination protocol to learn the final outcome of the transaction and act accordingly. After the coordinator is recovered, it will alsorun the termination protocol to learn the outcome. Note that if thisscenario occurs in 2PC, the participants will block instead.Figure 4a illlustrates an example of such a case. The figure canbe compared with Figure 1b showing how Cornus avoids blocking.After a participant’s timeout, instead of contacting the coordinatorwhich has failed, it contacts all the storage nodes using the
LogIfNo-tExist() function. Since all nodes have
VOTE-YES in their logs, eachparticipant learns the decision of
COMMIT and avoids blocking.
Case 4:
The coordinator fails after sending out the decision tosome but not all participants. For participants that have alreadyreceived the decision, their local 1PC protocol will terminate. Forthe other participants, they will timeout waiting for the decision, runthe termination protocol to learn the final decision.
Case 5:
The coordinator fails after sending out the decision toall participants. In this case, all participants have completed thelocal 1PC protocol and thus have no effects. After the coordinator isrecovered, if the final decision has been logged in its storage node,that decision is used. Otherwise, the coordinator runs the terminationprotocol to learn the final decision.
Participant Failure
Table 2 lists the effects of failure of a participant at differentpoints of the Cornus protocol.
Case 1:
The participant fails before receiving the vote requestfrom the coordinator. In this case, the coordinator will experience atimeout waiting for all responses from participants (line 8 in Algo-rithm 2) and then run the termination protocol.The coordinator will log an
ABORT record for the failed partic-ipant, thereby aborting the transaction. The coordinator will thenbroadcast the decision to the remaining participants. It is also pos-sible that another participant also has a timeout and initiates thetermination protocol; the end effect would be the same. After thefailed participant is recovered, it runs the termination protocol tolearn the final decision.
Case 2:
The participant fails after it receives the vote request butbefore its vote is logged. In this case, the behavior of the systemis the same as in Case 1. This is because, from the other nodes’perspectives, the behavior of the failed node is identical.Figure 4b shows an example of such case. The coordinator re-ceives a
VOTE-YES from participant 2 and has a timeout whilewaiting for the response from participant 1. At this point, the coordi-nator runs the termination protocol by issuing
LogIfNotExist() to the ornus: One-Phase Commit for Cloud Databases with Storage Disaggregation
Time of Coordinator Failure Effect of Failure After Node is Recovered
Before 1PC starts Participants (if any) will timeout and unilaterally abortthe transaction. Abort the transaction through the terminationprotocolAfter sending some vote re-quests Participants that did not receive the request will timeoutand abort unilaterally. Participants that receive the requestwill timeout waiting for the decision and execute thetermination protocol, which aborts the transaction. Run the termination protocol to learn the deci-sion, which is abort.After sending all vote requestsbut before sending any deci-sion All participants will timeout while waiting for the deci-sion and execute the termination protocol to learn theoutcome. Run the termination protocol to learn the deci-sion.After sending some decisions Participants that did not receive the decision will runtermination protocol to learn the decision. Same as above.After sending all decisions No Effect If the decision is logged, follow that decision;otherwise run termination protocol to learn it.
Table 1: Effects of Coordinator Failures.Time of Participant Failure Effect of Failure After Node is Recovered
Before receiving the vote request The coordinator will timeout, running the termina-tion protocol to abort the transaction Abort the transactionAfter receiving the vote request,before logging vote Same as above Same as aboveAfter logging vote, before reply-ing to coordinator The coordinator will run the termination protocolto see the vote and learn the final outcome. Abort the transaction if local vote is abort; other-wise run termination protocol to learn the outcome.After replying vote to the coordi-nator No effect If decision log exists, follow the decision. Other-wise, same as above.
Table 2: Effect of a Participant Failures. storage node of every participant and trying to log
ABORT on behalfof the participant. As the coordinator logs an
ABORT for participant1 and learns that participant 2 already logs
VOTE-YES , it reachesa decision of abort and sends it to participant 2. For simplicity, theexample assumes only the coordinator has a timeout. It is possiblethat participant 2 also experiences a timeout, which will lead to thesame outcome.
Case 3:
The participant fails after it logs the vote, but beforereplying to the coordinator. In this case, the coordinator will experi-ence a timeout waiting for votes and run the termination protocol.Then it can see all the participants’ votes from their storage nodesand learn the outcome. The remaining participants will learn thedecision either from the coordinator or from running the terminationprotocol by themselves. After the failed participant is recovered, itwill abort the transaction if the local vote is an abort; otherwise, itwill run the termination protocol to learn the outcome.
Case 4:
The participant fails after sending out the vote. Thisfailure does not affect the rest of the nodes — the coordinator andremaining participants will execute the rest of the protocol normally.After the failed participant is recovered, it will follow its own deci-sion log if exists; otherwise, it will run the termination protocol tolearn the outcome.
This section formally proves the correctness of Cornus. Our prooffollows the structure used in [12] where an atomic commit protocolis proven as five separate properties (AC1–5, see below). We startwith introducing the following definition of a global decision of adistributed transaction.
Definition 1 [Global Decision]: A transaction reaches a
COMMIT global decision if all nodes have logged
VOTE-YES ; it reaches an
ABORT global decision if any node has logged
ABORT . Otherwisethe decision is undetermined.We first introduce and prove the following lemma.
Lemma 1 [Irreversible Global Decision]: Once a global decision isreached for a transaction, the decision will not change.
Proof:
There are two cases to consider. In case 1, an abort global de-cision has been reached meaning one log contains an
ABORT record.According to the semantics of
LogIfNotExist() , no
VOTE-YES canbe appended to that log anymore, meaning that the global decisioncannot switch to commit. In case 2, a commit global decision isreached and all nodes have
VOTE-YES in their logs. The only wayto append an
ABORT record is through the termination protocol. Butaccording to the logics of
LogIfNotExist() (i.e., Algorithm 1),
ABORT will not be appended since
VOTE-YES has already existed. □ We now prove the five properties in order.
Theorem 1 [AC1]: The decision of each participant is identical tothe global decision.
Proof:
A participant can learn the global decision in two ways: (1) byreceiving the decision from the coordinator (line 22 in Algorithm 2)or (2) by running the termination protocol (line 23 in Algorithm 2).Following the protocol, the decision at the coordinator is identical tothe global decision and thus in the first case, the participant’s deci-sion is also identical to the global decision. In the second case, thetermination protocol collects the votes on each individual participant,and reaches a local decision that is identical to the global one. □ Theorem 2 [AC2]: A participant cannot reverse its decision after ithas reached one.
Proof:
Due to Lemma 1, once a global decision is reached, it cannot uo, Zeng, Ren and Yu, et al. reverse. Due to Theorem 1, each participant will reach the samedecision as the global decision, finishing the proof. □ The correctness of 2PC requires the following two propertiesaccording to [12].
AC3:
The commit decision can only be reached if all participantsvoted Yes.
AC4:
If there are no failures and all participants voted Yes, thenthe decision will be to commit .For the proof of Cornus, we combine the two properties into thefollowing theorem.
Theorem 3 [AC3&4]: The decision of a transaction is a commit ifand only if all participants vote Yes and write
VOTE-YES to theircorresponding logs.
Proof:
According to Definition 1, A transaction’s global decision is a commit if and only if all participants write
VOTE-YES to theirlogs. According to Theorem 1, the decision of each participant isidentical to the global decision, finishing the proof. □ Finally, 2PC requires the following property.
AC5:
Consider any execution containing only failures that thealgorithm is designed to tolerate. At any point in this execution,if all existing failures are repaired and no new failures occur forsufficiently long, then all processes will eventually reach a decision.For Cornus, we prove the following theorem which achieves astronger property.
Theorem 4 [AC5]: Assuming the storage layer is fault tolerant,with any failures that occur to the compute nodes, the remainingparticipants will always reach a decision without requiring the failednodes to be recovered.
Proof:
Since the storage layer is fault tolerant, once a global decisionis reached, an active participant can always learn the global decisionthrough the termination protocol. The only case where a decisioncannot be reached is when one participant fails to log its vote. Inthis case, a coordinator or participant that experiences a timeout willrun the termination protocol, and directly writes an
ABORT into thepending participant’s log, which enforces a global decision. □ This section discusses some optimization techniques for Cornus,including optimizing the
LogIfNotExist() function and optimizingfor readonly transactions.
Optimizing
LogIfNotExist()
The behavior of
LogIfNotExist() depends on the current contentin the log (e.g., whether a particular log record exists). With a naiveimplementation, the system can scan the entire log in the storagenode to decide whether a vote has already been logged for a particulartransaction. Since the log can be large in size, the naive solution canlead to very long processing time. Below we describe two techniquesto reduce such overhead.We observe that
LogIfNotExist() is called only in two cases(according to Algorithm 2): (1) a coordinator or participant logs
VOTE-YES or (2) the termination protocol logs
ABORT . The termi-nation protocol is called only during a timeout, hence the first case ismuch more common than the second case. We propose the followingtwo techniques to optimize for both cases.
Optimize for logging
VOTE-YES : We propose the following de-sign to optimize for the common case. Each storage node main-tains a hash table for
ABORT records from remote nodes (i.e., initi-ated through the termination protocol). During normal processing,a
VOTE-YES looks up the hash table (which should be empty inthe common case). If no
ABORT exists in the hash table for thetransaction, then
VOTE-YES can be written without scanning theentire log. Otherwise if an
ABORT exists in the hash table for thetransaction, the storage node immediately replies it to the computenode. With this optimization, the overhead of logging
VOTE-YES through
LogIfNotExist() becomes minimal.
Optimize for logging
ABORT in the termination protocol:
Thenaive implementation scans the entire log searching for
VOTE-YES upon receiving
LogIfNotExist() for an
ABORT record. Althoughthe termination protocol is rarely called, we still want to reducethis overhead. Our idea is to identify a watermark position in thelog such that the
VOTE-YES cannot locate before the watermark.Therefore, only the tail of the log after the watermark needs to bescanned. There are many ways to identify such a safe watermark.For example, for every network message, we can associate it withthe current log sequence number (LSN) of the corresponding storagenode. The coordinator of a transaction can collect these LSNs andsend them to participants during the commit procedure. For everynode, we know for sure that no log record can possibly exist for thetransaction before the collected LSN, because the transaction doesnot touch the corresponding node before that LSN. Such LSN canserve as our watermarks.In summary, with the two techniques described above,
LogIfNo-tExist() requests sent during normal execution (line 2 and line 17 inAlgorithm 2) only require in-memory hash table lookup;
LogIfNo-tExist() requests sent during the termination protocol (line 31 inAlgorithm 2) requires scanning of only the tail of the log after thewatermark, which is much smaller than the entire log.
Optimizing for Readonly Transactions
Conventional 2PC can be optimized for readonly transactions [23].Specifically, if a node is readonly, it does not need to log anythingduring the prepare phase and can simply release locks and end thelocal transaction. In 1PC, this optimization has a small subtlety. Ifthe coordinator does not know that a participant is actually readonly,the participant cannot avoid logging
VOTE-YES , because otherwiseit might be aborted through a remote node executing the terminationprotocol. This may cause a performance degradation. However, webelieve that in many practical scenarios, the coordinator does knowthat a participant is readonly during the execution phase. Then, when1PC starts, the coordinator can send the list of non-readonly nodes(rather than the list of all nodes) together with the prepare requeststo participants. By this way, 1PC no longer needs to log
VOTE-YES for readonly nodes. Even if some nodes run the termination protocol,they will not check the log of readonly nodes.
We now evaluate the performance of Cornus with respect toconventional 2PC. We first introduce the experimental setup in Sec-tion 4.1. In the following sections, we evaluate Cornus under varioussettings to illustrate the efficacy of Cornus in reducing latency. Ac-cordingly, we use the transaction latency as the major metric and ornus: One-Phase Commit for Cloud Databases with Storage Disaggregation present the latency breakdown by phases (i.e., execution, prepare,and commit) to verify that Cornus reduces latency by shrinking thecommit phase. We also provide the results in throughput.
We implement the protocols on Sundial [33],an open-source distributed DBMS testbed. We will open-sourcedour implementation. The system has a storage-disaggregation archi-tecture and contains two types of nodes: compute nodes and storagenodes . In our setup, each compute node is paired with one storagenode which stores the log for transactions executed in the corre-sponding compute node. A compute node can also read and writelogs stored in other storage nodes, but this happens only when atimeout occurs (e.g., when a node fails). Data is partitioned acrosscompute nodes. One compute node may send remote requests toother compute nodes for data and send remote requests to all storagenodes for logging requests.The communication across nodes is implemented through gRPC [1]and the communication can be either synchronous or asynchronous.Each node in the system has a gRPC client for issuing remote re-quests and a gRPC server for receiving remote requests. Each nodemanages a pool of server threads to handle remote requests.
Most of the experiments are performed on acluster with up to eight servers running Ubuntu 18.04 on Cloud-Lab [17]. Half of the servers serve as compute nodes and the othersserve as storage nodes. Each server contains two Intel Xeon Silver4114 CPUs (10 cores × We use two different OLTP workloads for per-formance evaluation. All transactions are executed as stored proce-dures that contain program logic intermixed with queries.
YCSB:
The Yahoo! Cloud Serving Benchmark [14] is a syntheticbenchmark modeled after cloud services. It contains a single tablethat is partitioned across servers in a round-robin fashion. Eachpartition contains 10 GB data with 1 KB tuples. Each transactionaccesses 16 tuples as a mixture of reads (50%) and writes (50%)with on average 5% of the accesses being remote (selected uniformlyat random). The queries access tuples following a power law distri-bution controlled by a parameter 𝜃 . By default, we use 𝜃 = , whichmeans data access is uniformally distributed. TPC-C:
This is a standard benchmark for evaluating the perfor-mance of OLTP DBMSs [29]. TPC-C models a warehouse-centricorder processing application that contains five transaction types. Allthe tables except ITEM are partitioned based on the warehouse ID.By default, the ITEM table is replicated at each server. We use a sin-gle warehouse per server to model high contention. Each warehousecontains around 100 MB of data. For all the five transactions, 10%of NEW-ORDER and 15% of PAYMENT transactions access thedata across multiple servers; other transactions access the data on asingle server.
Unlessotherwise specified, we will use the following default parametersettings: we will evaluate the system on four compute nodes andfour storage nodes. The number of worker threads executing the transaction logic is 20 per node and the number of server threadswhich handle the remote requests is also 20. The default concurrencycontrol algorithm is NO-WAIT. For each data point, we run fivetimes with 30 seconds per trial and then took the result from a trialwith the maximum throughput. Running the experiments for longertime does not change the conclusions.For read-only transactions, we assume the coordinator of a trans-action can learn that it is read-only at the end of the execution phasesuch that both Cornus and 2PC can skip both the prepare and commitphases for read-only transactions [23].
In this experiment, we vary the percentage of distributed transac-tions in the YCSB workload to compare Cornus with 2PC. Note thatwe set the number of worker threads to a low number (4 threads) toreduce the interference of high variance in network latency.Figure 5a shows the latency comparison. In the x-axis we changethe amount of distributed transactions from 0% to near 100%. Wereport the average and the 99th tail latency of both Cornus and2PC in lines and the latency speedup of Cornus over 2PC in bars.As the figure shows, Cornus’s latency speedup increases as thenumber of distributed transactions increases. The maximum speedupis about 1.4 × , which is achieved when nearly all the transactionsare distributed. The speedup of tail latency is similar to that of theaverage latency.Figure 5b shows the latency breakdown for both local and dis-tributed transactions for both Cornus and 2PC, when majority (i.e.,97%) of transactions are distributed. For local transactions, there isno need to run the commit protocol so there is no prepare phase butonly the commit phase. We notice that for local transactions, Cornusis slight slower than 2PC. This is because Cornus allows more trans-actions to run concurrently, incurring more resource contention. Fordistributed transactions, Cornus can almost completely eliminate thelatency of the commit phase that a user experiences. This is mainlydue to two reasons: (1) As discussed in Section 3.1, the decisionlogging in the commit phase can be done asynchronously in both thecoordinator and the participants, thereby eliminating the extra delay.(2) The logging overhead in cloud storage is significantly higherthan a simple network round-trip.Figure 5c shows the throughput of 2PC and Cornus as the per-centage of distributed transactions increases. We see that Cornusconstantly outperforms 2PC in throughput. High throughput im-provement is achieved when more transactions are distributed. Thegain in throughput speedup with repect to increased distributedtransactions is relatively small compared to latency speedup. Notethat throughput is not the primary metric that Cornus is trying toimprove; in later experiments, we will focus more on the latencyspeedup instead of throughput speedup. We now evaluate the performance of Cornus under YCSB withdifferent percentage of read-only transactions. In our system, readratio is a per request setting and we control the percentage of read-only transactions indirectly by controlling the read ratio of eachdata access request. In the experiment, we expect Cornus to obtainlatency speedup only for read-write transactions since both Cornus uo, Zeng, Ren and Yu, et al. T x n L a t e n c y ( m s ) Sp ee d u p o f P C w . r . t . P C Speedup in avg latency (a) Latency local txns distributed txnsTxn Type0246 L a t e n c y B r e a k d o w n ( m s ) (b) Latency breakdown (97% distributed txns) T h r o u g h p u t ( k t x n s / s ) Sp ee d u p o f P C w . r . t . P C Speedup in throughput (c) Throughput
Figure 5: Percentage of Distributed Transactions —
YCSB with varying percentage of distributed transactions. D i s t r i b u t e d T x n L a t e n c y ( m s ) Sp ee d u p o f P C w . r . t . P C Speedup in avg latency (a) Latency all read-write txns all read-only txnsTxn Type0.02.55.07.5 L a t e n c y B r e a k d o w n ( m s ) (b) Latency breakdown T h r o u g h p u t ( k t x n s / s ) Sp ee d u p o f P C w . r . t . P C Speedup in throughput (c) Throughput
Figure 6: YCSB varying percentage of read-only transactions D i s t r i b u t e d T x n L a t e n c y ( m s ) Sp ee d u p o f P C w . r . t . P C Speedup in avg latency (a) Latency L a t e n c y B r e a k d o w n ( m s ) (b) Latency breakdown T h r o u g h p u t ( k t x n s / s ) Sp ee d u p o f P C w . r . t . P C Speedup in throughput (c) Throughput
Figure 7: YCSB varying logging delay and 2PC omit prepare and commit phases for read-only transactionsas described in Section 4.1.4.The results shown in Figure 6a match the expectation that theimprovement of Cornus increases as the percentage of read-onlytransactions decreases. When all transactions are read-only, Cornusand 2PC have the same performance. When there are more than 80%read-write transactions, the average latency speedup of Cornus over2PC is nearly 1.5 × .The results in Figure 6b show that when all transactions are read-only the result meets our expectation that Cornus and 2PC are thesame. The commit phases in both protocols are short due to theoptimizations for read-only transactions described in Section 4.1.4.The prepare phase are the same because in our implementation wedo not omit the prepare phase and there is no logging in preparephase but only vote requests. When all transactions are read-writewe can clearly see that Cornus eliminates commit phase in thelatency breakdown. We still see Cornus is slightly slower in preparephase and we ascribe this to increased network traffic and resourcecontention.The result in Figure 6c shows that there is still no significantbenefit for Cornus on throughput for this workload. The numbers are close for both protocols. For the speedup in throughput, althoughthere are some variations in the middle due to the close data, wecan still see that when all the transactions are read-write, Cornusachieves the best speedup in throughput. In real world applications, the time spent on logging varies dueto factors like variances in network latency and different choices ofunderlying storage services. For example, a geo-replicated highly-available storage service can be orders of magnitude slower thana local disk. In this experiment, we evaluate the performance ofCornus as the latency of logging increases. We simulate the effectby introducing artificial delays in logging.Figure 7a shows that the speedup of Cornus increases as thelatency of logging increases. With 32 ms extra logging delay in-troduced, Cornus can achieve 2 × speedup with respect to 2PC inaverage latency. Figure 7b further demonstrates the improvementsin Cornus are due to the savings from logging in the commit phase.Figure 7c again shows constant but slight benefit of Cornus over2PC on throughput. ornus: One-Phase Commit for Cloud Databases with Storage Disaggregation D i s t r i b u t e d T x n L a t e n c y ( s ) Sp ee d u p o f P C w . r . t . P C Speedup in avg latency (a) Latency (log scale in y-axis) L a t e n c y B r e a k d o w n ( m s ) executionpreparecommitabort (b) Latency breakdown T h r o u g h p u t ( k t x n s / s ) Sp ee d u p o f P C w . r . t . P C Speedup in throughput (c) Throughput
Figure 8: YCSB varying data distribution 𝜃 . D i s t r i b u t e d T x n L a t e n c y ( m s ) Sp ee d u p o f P C w . r . t . P C Speedup in avg latency (a) Latency L a t e n c y B r e a k d o w n ( m s ) (b) Latency breakdown T h r o u g h p u t ( k t x n s / s ) Sp ee d u p o f P C w . r . t . P C Speedup in throughput (c) Throughput
Figure 9: TPCC varying the number of warehouses D i s t r i b u t e d T x n L a t e n c y ( m s ) Sp ee d u p o f P C w . r . t . P C Speedup in avg latency (a) YCSB - Latency T h r o u g h p u t ( k t x n s / s ) Sp ee d u p o f P C w . r . t . P C Speedup in throughput (b) YCSB - Throughput D i s t r i b u t e d T x n L a t e n c y ( m s ) Sp ee d u p o f P C w . r . t . P C Speedup in avg latency (c) TPC-C - Latency T h r o u g h p u t ( k t x n s / s ) Sp ee d u p o f P C w . r . t . P C Speedup in throughput (d) TPC-C - Throughput
Figure 10: Scalability
These results indicate that Cornus is particularly beneficial whenthe storage layer incurs longer delay. This is the case when computenodes and storage nodes are geo-distributed and the storage needs tomaintain consistency among multiple replicas, as done in moderncloud storage services.
In this section we evaluate the performance of Cornus with differ-ent levels of contention under YCSB workloads and TPC-C work-loads respectively.We vary the level of contention in YCSB by adjusting the zipfiandistribution of data accesses through 𝜃 as mentioned in Section 4.1.3.The larger the 𝜃 , the higher is the level of contention. Figure 8ashows the results. Overall, the improvements of Cornus can be closeto 2 × at high contention. Interestingly, the speedup of Cornus shows a v-curve as 𝜃 increases — it firstly decreases until 𝜃 reaches 0.9 andthen increases as the 𝜃 goes beyond 0.9 and further increases.Figure 8b demonstrates the possible interpretations of the v-curve.When contention is low ( 𝜃 < . ), aborts are rare and the mainbenefit of Cornus comes from the time reduced in the commit phase.However, when contention is high ( 𝜃 ≥ . ), the time spent onaborts increases significantly and becomes a dominant factor inthe overall latency. In this case, transactions in Cornus has shorterlock holding time due to smaller average latency. Cornus startsto gain improvements with fewer aborts. Figure 8c shows a similarpattern regarding to throughput that Cornus has benefit on throughputbecause of shorter lock holding time.We also evaluate the effect of contention on TPC-C by controllingthe number of warehouses. In Figure 9a, the speedup of Cornus in-creases as the level of contention decreases. The benefit from savingone logging diminishes as the level of contention increases and thetime spent on aborts dominates the runtime as shown in Figure 9b.Figure 9c shows that Cornus and 2PC has similar throughput due tothe low percentage of distributed transactions in TPC-C. Finally, we evaluate the scalability of Cornus on both YCSB andTPCC as the number of compute nodes varies from 2 to 8. We setthe parameter to default values described in Section 4.1.3. We runeach setup for five times with 40s run time each time. The resultsin Figure 10 demonstrate that both Cornus and 2PC can scale well inYCSB and TPC-C. The latency of both 2PC and Cornus remains thesame as number of nodes increases. Also, the throughput linearlyincreases for both 2PC and Cornus as number of nodes increases.The speedup on latency and throughput also remains constant. uo, Zeng, Ren and Yu, et al.
This section describes extra related work on reducing latency of2PC and solving the blocking problem of 2PC.
Many previous works propose to remove one phase from 2PC bycombining the voting phase with the execution of the transaction. Itmeans the log must be forced before the commit procedure starts.There are two common approaches. The first approach was originallyintroduced as early prepare (EP) [26]. Participants are required toforce logging before acknowledging every remote operation. Thisintroduces blocking I/O for every remote operation. To address suchoverhead, other works proposed to have the participants send its logalong with the acknowledgement to the coordinator and the coordina-tors force the log before the commit phase. This approach is appliedin coordinator log (CL) [26, 27], implicit yes vote (IVY) [9], and Leeand Yeom’s protocol [22] for in-memory databases. Although someprotocols tried to reduce the amount of log data at the coordinator,these approaches all suffer from increased size of acknowledgementmessages. Furthermore, these previous protocols lose site auton-omy — a property that requires the recovery of participants to largelyrely on its own logs; these protocols, instead, rely on the coordina-tor’s log to recovery a participant [7, 8]. To preserve site autonomyand avoid piggybacking redo logs increasing communication costin normal processing, Adballah et al. [7, 8] proposed to use logicallog and to log operations instead of values in the coordinator beforeissuing remote operations.However, all these protocols that try to embed the voting into theexecution phase are subject to strict restrictions on the choices ofconcurrency control protocols. Such protocols assume a transactioncan only commit after acknowledgments for all operations are ex-ecuted successfully and no aborts are allowed due to serializationor consistency afterwards [8]. This assumption is incompatible withconcurrency control protocols in which aborts due to serializabilityvalidation may occur after execution of an operation such as opti-mistic concurrency control protocol. These strong assumptions makesuch 1PC protocols impractical for real-world systems.Other works have been proposed to save one phase given a specificuse case or system. For example, Congiu, et al. [13] designed a1PC protocol tailored for metadata services. In their design, thevoting phase is cut off since only two nodes (metadata servers)are involved. A recent work of parallel commit protocol [6, 31] inCockroachDB [28] was proposed to remove one network roundtripin 2PC. It bears similarity to Cornus regarding that it also considersa transaction as committed once all the participants’ writes succeed.However, the proposal is designed for a specific system with aspecific architecture and concurrency control protocol. A formalspecification of the protocol and proofs are not published yet and thus it is unclear whether the protocol can be generalized as 2PC andbe integrated into any cloud database. The comparison with Cornuscan be done in the future if details of the protocol are later published.
Skeen [25] showed the necessary and sufficient conditions fora correct non-blocking commitment protocol known as the funda-mental nonblocking theorem . Specifically, nodes in 2PC have fourpossible states — initial, wait, abort, and commit. The paper pointedout that with the same assumptions made in 2PC, a non-blockingprotocol can be achieved by introducing another state, buffer state ,between the transition from wait state to commit state. Adding sucha state also requires one more network roundtrip and the protocolbecomes three-phase commit (3PC). Although it solves the blockingissue, 3PC exaggerates the problem of latency delay in 2PC.O. Babaoglu and S. Toueg [11] proposed a non-blocking atomiccommitment protocol based on 2PC. The proposed protocol ap-plied three strategies: (1) synchronizing clocks on different nodesso that out-of-time messages can be ignored; (2) having participantsforwarding the decision to other participants upon receiving the mes-sage from the coordinator (called Uniform Timed Reliable Broadcastalgorithm); (3) presuming abort instead of running termination proto-col upon timeout. The first two strategies enables the last strategy toeliminate the chance of blocking. However, the protocol introducesmore communication across nodes when no failure happens and thecorrectness of the algorithm relies on synchronized clocks, which isa non-trivial requirement for real-world applications.Similar to O. Babaoglu and S. Toueg’s, EasyCommit [19] solvesthe problem by requiring participants to forward the decisions toeach other on receiving but before logging the decision. When aparticipant timeouts, a leader will be elected. The leader then consultall the active nodes for the decision and can decide to abort if noactive nodes has a decision. However, the protocol satisfies theatomic commitment properties only under the assumption that theforwarded message will be delivered to at least one node withoutdelay or loss before the log of the decision is flushed. Moreover,it still introduces extra communication overhead when no failureoccurs and the complexity of leader selection when failure happens.
We proposed Cornus, a one-phase commit protocol designed forthe storage disaggregation architecture that is widely used in clouddatabases. Cornus solves both the long latency and the blockingproblem in 2PC at the same time, while not introducing extra imprac-tical assumptions. We formally proved the correctness of Cornus andproposed some optimizations. Our evaluations on two benchmarksshow a speedup of 1.5 × in latency. ornus: One-Phase Commit for Cloud Databases with Storage Disaggregation REFERENCES [1] 2015. gRPC: A high performance, open-source universal RPC framework. https://grpc.io/.[2] 2018. Amazon Athena — Serverless Interactive Query Service. https://aws.amazon.com/athena.[3] 2018. Amazon Redshift. https://aws.amazon.com/redshift.[4] 2018. Amazon S3. https://aws.amazon.com/s3/.[5] 2018. Presto. https://prestodb.io.[6] 2020.
Parallel Commits
In Proceedings of the National Conference Bases de DonnesAvances . Citeseer.[8] Maha Abdallah, Rachid Guerraoui, and Philippe Pucheral. 1998. One-phasecommit: does it make sense?. In
Proceedings 1998 International Conference onParallel and Distributed Systems (Cat. No. 98TB100250) . IEEE, 182–192.[9] Y Al-Houmaily and P Chrysanthis. 1995. Two-phase commit in gigabit-networkeddistributed databases. In
Int. Conf. on Parallel and Distributed Computing Systems(PDCS) .[10] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph KBradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al.2015. Spark SQL: Relational Data Processing in Spark. In
SIGMOD .[11] Ozalp Babaoglu and Sam Toueg. 1993. Understanding non-blocking atomiccommitment.
Distributed systems (1993).[12] Philip A Bernstein. 1987.
Concurrency control and recovery in database systems .Vol. 370. Addison-wesley New York.[13] Giuseppe Congiu, Matthias Grawinkel, Sai Narasimhamurthy, and AndréBrinkmann. 2012. One phase commit: A low overhead atomic commitmentprotocol for scalable metadata services. In . IEEE, 16–24.[14] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and RussellSears. 2010. Benchmarking cloud serving systems with YCSB. In
Proceedings ofthe 1st ACM symposium on Cloud computing . 143–154.[15] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, ArtinAvanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel,Jiansheng Huang, et al. 2016. The Snowflake Elastic Data Warehouse. In
SIG-MOD .[16] Aleksandar Dragojevi´c, Dushyanth Narayanan, Edmund B. Nightingale, MatthewRenzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Com-promises: Distributed Transactions with Consistency, Availability, and Perfor-mance. In
SOSP . 54–70.[17] Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, JonathonDuerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, AdityaAkella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, MichaelZink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. 2019. The De-sign and Operation of CloudLab. In
Proceedings of the USENIX Annual TechnicalConference (ATC)
Proceedings. 14th Symposium on Reliable Distributed Systems . IEEE, 41–50.[19] Suyash Gupta and Mohammad Sadoghi. 2018. EasyCommit: A Non-blockingTwo-phase Commit Protocol.. In
EDBT . 157–168.[20] Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017.An Evaluation of Distributed Concurrency Control.
VLDB (2017), 553–564.[21] Hsiang-Tsung Kung and John T Robinson. 1981. On optimistic methods forconcurrency control.
ACM Transactions on Database Systems (TODS)
6, 2 (1981),213–226.[22] Inseon Lee and Heon Young Yeom. 2002. A single phase distributed commitprotocol for main memory database systems. In
Proceedings 16th InternationalParallel and Distributed Processing Symposium . IEEE, 8–pp.[23] C Mohan, Bruce Lindsay, and Ron Obermarck. 1986. Transaction management inthe R* distributed database management system.
ACM Transactions on DatabaseSystems (TODS)
11, 4 (1986), 378–396.[24] George Samaras, Kathryn Britton, Andrew Citron, and C Mohan. 1993. Two-phase commit optimizations and tradeoffs in the commercial environment. In
Proceedings of IEEE 9th International Conference on Data Engineering . IEEE,520–529.[25] Dale Skeen. 1981. Nonblocking commit protocols. In
Proceedings of the 1981ACM SIGMOD international conference on Management of data . 133–142.[26] James W Stamos and Flaviu Cristian. 1990. A low-cost atomic commit protocol.In
Proceedings Ninth Symposium on Reliable Distributed Systems . IEEE, 66–75.[27] James W Stamos and Flaviu Cristian. 1993. Coordinator log transaction executionprotocol.
Distributed and Parallel Databases
1, 4 (1993), 383–408.[28] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis,Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, et al. 2020.Cockroachdb: The resilient geo-distributed SQL database. In
Proceedings of the2020 ACM SIGMOD International Conference on Management of Data . 1493–1509.[29] The Transaction Processing Council. 2007. TPC-C Benchmark (Revision 5.9.0).[30] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive — APetabyte Scale Data Warehouse Using Hadoop. In
ICDE .[31] Nathan VanBenschoten. 2019.
Parallel Commits: An Atomic Commit Protocol ForGlobally Distributed Transactions
Proceedings of the 2017ACM International Conference on Management of Data . 1041–1052.[33] Xiangyao Yu, Yu Xia, Andrew Pavlo, Daniel Sanchez, Larry Rudolph, and Srini-vas Devadas. 2018. Sundial: harmonizing concurrency control and caching ina distributed OLTP database management system.
Proceedings of the VLDBEndowment
11, 10 (2018), 1289–1302.[34] Xiangyao Yu, Matt Youill, Matthew Woicik, Abdurrahman Ghanem, Marco Ser-afini, Ashraf Aboulnaga, and Michael Stonebraker. 2020. PushdownDB: Acceler-ating a DBMS using S3 Computation. In2020 IEEE 36th International Conferenceon Data Engineering (ICDE)