[PDF] Design and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication

Abstract

We present a novel dynamic reconfiguration protocol for the MongoDB replication system that extends and generalizes the single server reconfiguration protocol of the Raft consensus algorithm. Our protocol decouples the processing of configuration changes from the main database operation log, which allows reconfigurations to proceed in cases when the main log is prevented from processing new operations. Additionally, this decoupling allows for configuration state to be managed by a logless replicated state machine, by optimizing away the explicit log and storing only the latest version of the configuration, avoiding the complexities of a log-based protocol. We provide a formal specification of the protocol along with results from automated verification of its safety properties. We also provide an experimental evaluation of the protocol benefits, showing how reconfigurations are able to quickly restore a system to healthy operation in scenarios where node failures have stalled the main operation log.

Full PDF

DDesign and Verification of a Logless Dynamic Reconfiguration Protocol inMongoDB Replication

WILLIAM SCHULTZ,

Northeastern University, United States

SIYUAN ZHOU,

MongoDB, Inc., United States

STAVROS TRIPAKIS,

Northeastern University, United States

We present a novel dynamic reconfiguration protocol for the MongoDB replication system that extends and generalizes the singleserver reconfiguration protocol of the Raft consensus algorithm. Our protocol decouples the processing of configuration changes fromthe main database operation log, which allows reconfigurations to proceed in cases when the main log is prevented from processingnew operations. Additionally, this decoupling allows for configuration state to be managed by a logless replicated state machine, byoptimizing away the explicit log and storing only the latest version of the configuration, avoiding the complexities of a log-basedprotocol. We provide a formal specification of the protocol along with results from automated verification of its safety properties. Wealso provide an experimental evaluation of the protocol benefits, showing how reconfigurations are able to quickly restore a system tohealthy operation in scenarios where node failures have stalled the main operation log.

Distributed replication systems based on the replicated state machine model [35] have become ubiquitous as thefoundation of modern, fault-tolerant data storage systems. In order for these systems to ensure availability in thepresence of faults, they must be able to dynamically replace failed nodes with healthy ones, a process known as dynamicreconfiguration. The protocols for building distributed replication systems have been well studied and implementedin a variety of systems [10, 12, 14, 43, 47]. Paxos [16] and, more recently, the Raft algorithm [31], have served as thelogical basis for building provably correct distributed replication systems. Dynamic reconfiguration, however, is anadditionally challenging and subtle problem [5] that has not been explored as extensively as the foundational consensusprotocols underlying these systems. Variants of Paxos have examined the problem of dynamic reconfiguration butthese reconfiguration techniques may require changes to a running system that impact availability [25] or require theuse of an external configuration master [20]. The Raft consensus protocol, originally published in 2014, provided adynamic reconfiguration algorithm in its initial publication, but did not include a precise discussion of its correctnessor include a formal specification or proof. A critical safety bug [29] in one of its reconfiguration protocols was foundafter initial publication, which has since been fixed. The discovery of bugs like these demonstrate that the design andverification of a safe dynamic reconfiguration protocol is a non-trivial task, and the correctness of published protocolsmay not be particularly well understood by system designers and engineers, which is often important if optimizationsor modifications need to be made to an underlying protocol. As a rule of thumb, we believe that if distributed systemsresearchers can make errors designing these protocols, it is even harder for system engineers to understand the subtletiesof these protocols when implementing them. Thus, it is important for their behaviors and characteristics to be wellstudied, formalized, and understood in different contexts and systems.MongoDB [3] is a general purpose, document-oriented database which implements a distributed replication system[41] for providing high availability and fault tolerance. Since its inception, MongoDB’s distributed replication systemprovided a mechanism for clients to dynamically reconfigure replica membership, but this legacy protocol was unsafein certain cases. In recent versions of MongoDB, reconfiguration has become a more common operation whichnecessitated the need for a redesigned, safe reconfiguration protocol. The new reconfiguration protocol was designed a r X i v : . [ c s . D C ] F e b illiam Schultz, Siyuan Zhou, and Stavros Tripakis with a goal of provable correctness guarantees and minimizing changes to the existing protocol where possible. Inthis paper we propose a new protocol that satisfies these goals, and that improves upon and generalizes the singleserver reconfiguration protocol of standard Raft. This protocol, which we refer to as MongoRaftReconfig , provides logless dynamic reconfiguration by decoupling the processing of configuration changes from the main database operation log.This allows for design modularity in addition to improving reconfiguration performance by letting reconfigurations runin parallel to the main operation log. We present this protocol along with a formal specification and verification of itssafety properties and an experimental evaluation of its performance benefits. We additionally provide a discussion ofhow our protocol relates to and generalizes Raft’s original reconfiguration protocol, which helps to establish a deeperunderstanding of Raft’s behaviors and properties.To summarize, we make the following contributions: • We present

MongoRaftReconfig , a novel extension of the static MongoDB replication protocol that allows forlogless dynamic reconfiguration, and which generalizes and optimizes the single server reconfiguration protocolof standard Raft. • We provide a formal specification of

MongoRaftReconfig in TLA+ [26] along with results from automatedverification of its key safety properties using the TLC model checker [46]. • We discuss how the concepts and behaviors of our protocol can be mapped to reconfiguration in standard Raft.Specifically, we show our how protocol optimizes Raft reconfiguration by avoiding unnecessary commitment ofwrites during reconfigurations, and how it simplifies the log structure for managing configuration state. • We provide an experimental evaluation of

MongoRaftReconfig ’s benefits demonstrating how it improves uponreconfiguration in standard Raft.

Throughout this paper we consider a set of server processes

Server = { s , s , ..., s n } that communicate by sendingmessages. We assume an asynchronous network model in which messages can be arbitrarily dropped or delayed. Weassume servers can fail by stopping but do not act maliciously i.e. we assume a “fail-stop" model with no Byzantinefailures. We define a member set as an element m ∈ P ( Server ) , where P is the powerset operator. We define a quorum similarly, as an element q ∈ P ( Server ) . Member sets and quorums have the same type but refer to different conceptualentities. For any member set m , there is an associated set of quorums, denoted Quorums ( m ) , which contains allquorums in P ( m ) with at least a majority of elements in m . Quorums ( m ) = { s ∈ P ( m ) : | s | ∗ > | m |} (1)where | S | denotes the cardinality of a set S . For two member sets m i , m j , we say that they satisfy the quorum overlap condition if any two quorums of either set have at least one common member i.e. QuorumOverlap ( m i , m j ) = ∀ q i ∈ Quorums ( m i ) , q j ∈ Quorums ( m j ) : q i ∩ q j ≠ ∅ (2)For any two member sets m i , m j that differ by at most one element, the quorum overlap condition is satisfied i.e. ∀ m i , m j ∈ P ( Server ) : | m i Δ m j | ≤ ⇒ QuorumOverlap ( m i , m j ) (3)where Δ represents the symmetric difference between two sets. This fact is demonstrated in Section 4.1 of [28]. esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication Raft [28] is a consensus protocol for implementing a replicated log in a system of distributed servers. It has beenimplemented in a variety of systems across the industry [30]. The core Raft protocol implements a replicated statemachine using a static set of servers. In the protocol, time is divided into terms of arbitrary length, where terms arenumbered with consecutive integers. Each term has at most one leader, which is selected via an election that occurs atthe beginning of a term. To dynamically change the set of servers operating the protocol, Raft includes two different,alternate algorithms: single server membership change and joint consensus . Joint consensus adopts a two phase approach,where the system must move through an intermediate configuration before reaching a specified target configuration.The single server change approach avoids this complexity by restricting reconfigurations to only add or remove a singlenode. In both algorithms, reconfiguration is accomplished by writing a special reconfiguration entry into the mainRaft operation log that alters the local configuration of a node. Throughout, we refer to the original Raft protocol asdescribed and specified in [28] as standard Raft . In this paper we are primarily concerned with Raft’s single serverchange reconfiguration protocol, so when referring to reconfiguration in standard Raft, we assume it to mean the singleserver change protocol.

MongoDB is a document oriented database that stores data in JSON-like objects. A MongoDB database consists of a setof collections, where a collection is a set of unique documents. To provide high availability, MongoDB provides theability to run a database as a replica set , which is a set of MongoDB servers that act as a consensus group, where eachnode maintains a logical copy of the database state.

MongoDB replica sets utilize a replication protocol that is similar to Raft, with someextensions. We refer to MongoDB’s abstract replication protocol, without dynamic reconfiguration, as

MongoStaticRaft .This protocol can be viewed as a modified version of standard Raft that satisfies the same underlying correctnessproperties. A more in depth description of

MongoStaticRaft is given in [41, 48], but we provide a high level overviewhere, since our reconfiguration protocol,

MongoRaftReconfig , is built on top of

MongoStaticRaft . In a replica set running

MongoStaticRaft there exists a single primary server and a set of secondary servers. As in standard Raft, there is a singleprimary elected per term. The primary server accepts client writes and inserts them into an ordered operation logknown as the oplog . The oplog is a logical log where each entry contains information about how to apply a singledatabase operation. Each entry is assigned a monotonically increasing timestamp, and these timestamps are unique andtotally ordered within a node’s log. These log entries are then replicated to secondaries which apply them in orderleading to a consistent database state on all servers. When the primary learns that enough servers have replicateda log entry in its term, the primary will mark it as committed , meaning that the entry is permanently durable in thereplica set. Clients of the replica set can issue writes with a specified write concern level, which indicates the durabilityguarantee that must be satisfied before the write can be acknowledged to the client. Providing a write concern level of majority ensures that a write will not be acknowledged until it has been marked as committed in the replica set. MongoRaftReconfig : A LOGLESS DYNAMIC RECONFIGURATION PROTOCOL

Dynamic reconfiguration allows the set of servers operating as part of a replica set to be modified while maintainingthe core safety guarantees of the replication protocol. Many consensus based replication protocols [25, 31, 42] utilizethe main operation log (the oplog , in MongoDB) to manage configuration changes by writing special reconfiguration illiam Schultz, Siyuan Zhou, and Stavros Tripakis log entries. The MongoRaftReconfig protocol instead decouples configuration updates from the main operation log bymanaging the configuration state of a replica set in a separate replicated state machine, which we refer to as the configlog . The config log is maintained alongside the oplog, and manages the configuration state used by the overall protocol.Decoupling these two conceptually distinct logs, the oplog and the config log, enables certain optimizations andsimplifications in

MongoRaftReconfig which would not be possible in a protocol where both logs are interleaved witheach other. First, it allows for a simplification of the config log structure by observing that configuration changes arean “update only" operation. This obviates the need to store the entire log history, allowing the config log to operateas a logless replicated state machine, storing only the latest version of the configuration state. Second, it prevents thedynamics of either log negatively impacting the other unnecessarily. For example, it is possible to commit writes ineither log independently, without requiring previous writes in the other log to also become committed. This can allowthe config log to bypass the oplog, allowing for reconfigurations in cases where a slow or stalled oplog replicationchannel would otherwise prevent reconfigurations from proceeding. We examine these benefits experimentally inSection 6.2. In the remainder of this section we give a high level overview of the behaviors of

MongoRaftReconfig andhow it operates safely. In Section 4 we present our formal specification of the protocol in TLA+, which allows for amore precise description and enables automated verification of the protocol’s safety properties, which we present inSection 5.

Dynamic reconfiguration in

MongoRaftReconfig consists of two main aspects: (1) updating the current configuration and(2) propagating new configurations between servers. Configurations also have an impact on election behavior whichwe discuss below, in Section 3.1.2. Formally, a configuration is defined as a tuple ( m , v , t ) , where m is a non-emptymember set, v ∈ N is a numeric configuration version , and t ∈ N is the numeric term of the configuration . Each serverof a replica set maintains a single, durable configuration, and it is assumed that, initially, all nodes begin with a commonconfiguration.To update the current configuration of a replica set, a client issues a reconfiguration command to a primary serverwith a new, desired configuration, C new . Reconfigurations can only be executed on primary servers, and they update theprimary’s current local configuration C old to the specified configuration C new . The version of the new configuration, C new . v , must be greater than the version of the primary’s current configuration, C old . v , and the term of C new is setequal to the current term of the primary processing the reconfiguration. After a reconfiguration has occurred on aprimary, the updated configuration needs to be communicated to other servers in the replica set. This is achieved ina simple, gossip like manner. Secondaries receive information about the configurations of other servers via periodicheartbeats. They need to have some mechanism, however, for determining whether another configuration is newer thantheir own. This is achieved by totally ordering configurations by their ( version , term ) pair, where term is comparedfirst, followed by version. If configuration C j compares as greater than configuration C i based on this ordering, we saythat C j is newer than C i . A secondary can install any configuration that is newer than its own. If it learns that someother server has a newer configuration, it will fetch that server’s configuration, verify that it is still newer than its ownupon receipt, and install it locally. For sake of convenience, we refer to the elements of a configuration tuple C = ( m , v , t ) as, respectively, C . m , C . v and C . t .4 esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication The above provides a basic outline of how reconfigurations occur and how configurations are propagated betweenservers in

MongoRaftReconfig . In order for the protocol to operate safely, however, there are several additional restrictionsthat are imposed on both reconfigurations and elections, which we discuss in more detail below. In MongoStaticRaft , which does not allow reconfiguration, the safety ofthe protocol depends on the fact that quorum overlap is satisfied for the member sets of any two configurations, sincethere is a single, uniform configuration that is never modified. For any pair of arbitrary configurations, however, theirmember sets may not satisfy this property. So, in order for

MongoRaftReconfig to operate safely, extra restrictions areneeded on how nodes are allowed to move between configurations. First, any reconfiguration that moves from C old to C new is required to satisfy the quorum overlap condition i.e. QuorumOverlap ( C old . m , C new . m ) . To satisfy this,it is sufficient to enforce a single node change condition, which requires that no more than a single member is addedor removed in a single reconfiguration. The sufficiency of this condition to enforce quorum overlap is illustrated inFormula 3. Although the single node change condition ensures quorum overlap between two adjacent configurations, itmay not be ensured between all configurations that the system passes through over time. So, there are two additionalpreconditions that must be satisfied before a primary node can execute a reconfiguration out of its current configuration C .P1. Config Commitment : The primary’s current configuration, C , must be replicated to and installed on some quorumof servers in its current member set, C . m , that are in the primary’s current term.P2. Oplog Commitment : Any oplog entries that were committed by the current primary in its previous configurationmust be committed on some quorum of servers in its current member set, C . m .At a high level, these preconditions enforce, respectively, two fundamental requirements needed for safe reconfiguration: deactivation of old configurations and state transfer from old configurations to new configurations. P1 ensures thatconfigurations earlier than C can no longer independently form a quorum for electing a node or committing a logentry. P2 ensures that previously committed oplog entries are properly transferred to the current configuration, whichensures that any primary elected in a subsequent configuration will contain these entries. When a node runs for election in

MongoStaticRaft , it must ensure its log isappropriately up to date and that it can garner a quorum of voters in its term. In

MongoRaftReconfig , there is anadditional restriction on voting behavior that depends on configuration ordering. If a replica set server is a candidatefor election in configuration C i , then a prospective voter in configuration C j may only cast a vote for the candidate if C i is newer than or equal to C j . Furthermore, when a node wins an election, it must commit a configuration in its ownterm before it is allowed to execute subsequent reconfigurations. This is achieved by requiring nodes to atomicallyre-write their existing configuration with their new term upon winning election in term t . That is, if a node withcurrent configuration ( m , v , t ) wins election in term t ′ , it will update its configuration to ( m , v , t ′ ) before allowingany reconfigurations to be processed. This behavior is necessary to appropriately disable concurrent reconfigurationsthat may occur on primaries in a different term. This configuration re-writing behavior is analogous to the write inRaft’s corrected membership change protocol proposed in [29]. The above provides a high level description of the behaviors of

MongoRaftReconfig and how it operates safely. In thefollowing section we present our formal specification of the protocol in TLA+, which allows us to define the protocol illiam Schultz, Siyuan Zhou, and Stavros Tripakis and its safety properties precisely. Additionally, it allows for automated verification of the protocol’s correctness, whichwe discuss in Section 5. MongoRaftReconfig behaves as an extension of

MongoStaticRaft that allows for dynamic reconfiguration. Thus, it can beformally viewed as a composition of two distinct subprotocols: one for managing the oplog, and one for managing theconfig log. The oplog is maintained by

MongoStaticRaft , and the config log is maintained by a protocol we refer to as

MongoLoglessDynamicRaft , which implements a logless replicated state machine that stores the configuration state ofthe replica set. The complete, formal description of

MongoRaftReconfig is given in the accompanying TLA+ specification[39], which is summarized in Figure 1. Since

MongoRaftReconfig is an extension of the existing

MongoStaticRaft protocol,we first present a specification of

MongoStaticRaft , followed by the formal specification of

MongoRaftReconfig .Note that TLA+ does not impose an underlying system or communication model (e.g. message passing, sharedmemory), which allows one to write specifications at a wide range of abstraction levels. Our specifications are written ata deliberately high level of abstraction, ignoring some lower level details of the protocol and system model. In practice,we have found the abstraction level of our specifications most useful for understanding and communicating the essentialbehaviors and safety characteristics of the protocol, while also serving to make automated verification feasible, whichwe examine further in section 5.2.

We use the TLA+ language [18] to formally describe our reconfiguration protocol. TLA+ is a formal specification languagefor describing distributed and concurrent systems that is based on first order and temporal logic [33]. Specifying asystem in TLA+ consists of defining a set of state variables, vars , along with a temporal logic formula which describesthe set of permitted system behaviors over these variables. The canonical way of defining a specification is as theconjunction of an initial state predicate,

Init , and a next state relation,

Next , which determine, respectively, the setof allowed initial states and how the protocol may transition between states. The overall system is then defined bythe temporal formula

Init ∧ □ [ Next ] vars , where □ denotes the “always" operator of temporal logic, meaning thata formula holds true at every step of a behavior. [ Next ] vars is equivalent to the expression Next ∨ ( vars ′ = vars ) ,which means that specifications of this form allow for stuttering steps i.e. transitions that do not change the state. Aprimed variable, expressed by attaching a ’ symbol, denotes the value of a variable in the next state of a system behavior.The next state relation is typically written as a disjunction A ∨ A ∨ ... ∨ A n of actions A i , where an action is alogical predicate that depends on both the current and next state of a behavior. For example, for variables x and y ,the following specification, Spec , describes a system whose initial state is ⟨ x = , y = ⟩ , and which at each step cannon-deterministically increment x or y by 1, or leave both variables unchanged. Init ≜ x = ∧ y = Next ≜ ( x ′ = x + ) ∨ ( y ′ = y + ) Spec ≜ Init ∧ □ [ Next ] ⟨ x , y ⟩ Correctness properties and system specifications in TLA+ are both written as temporal logic formulas. This allows oneto express notions of property satisfaction and refinement in a concise and similar manner. We say that a specification esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication module MongoStaticRaft variable log variable committed variable currentTerm variable state variable configNext Δ = ∃ s , t ∈ Server : ∃ Q ∈ Quorums ( config [ s ]) : ∨ ClientRequest ( s )∨ GetEntries ( s , t )∨ RollbackEntries ( s , t )∨ BecomeLeader ( s , Q )∨ CommitEntry ( s , Q )∨ UpdateTerms ( s , t ) (a) module MongoLoglessDynamicRaft variable currentTerm variable state variable config variable configVersion variable configTermNext Δ = ∃ s , t ∈ Server : ∃ Q ∈ Quorums ( config [ s ]) : ∨ SendConfig ( s , t )∨ Reconfig ( s )∨ BecomeLeaderDynamic ( s , Q )∨ UpdateTermsDynamic ( s , t ) (b) module MongoRaftReconfig variable log , committed , currentTerm , state , config , configVersion , configTerm configVars Δ = ⟨ configVersion , configTerm ⟩ oplogVars Δ = ⟨ log , committed ⟩ Next Δ = ∃ s , t ∈ Server : ∃ Q ∈ Quorums ( config [ s ]) : ∨ ClientRequest ( s ) ∧ unchanged configVars ∨ GetEntries ( s , t ) ∧ unchanged configVars ∨ RollbackEntries ( s , t ) ∧ unchanged configVars ∨ CommitEntry ( s , Q ) ∧ unchanged configVars ∨ SendConfig ( s , t ) ∧ unchanged oplogVars ∨ OplogCommitment ( s ) ∧ Reconfig ( s ) ∧ unchanged oplogVars ∨ BecomeLeader ( s , Q ) ∧ BecomeLeaderDynamic ( s , Q ) ∨ UpdateTerms ( s , t ) ∧ UpdateTermsDynamic ( s , t ) (c)Fig. 1. The MongoRaftReconfig transition relation defined as a composition of its subprotocols,

MongoStaticRaft and

MongoLoglessDy-namicRaft . Actions in blue represent those of

MongoStaticRaft , and actions in orange represent those of

MongoLoglessDynamicRaft .The UNCHANGED construct indicates that a set of variables do not change on a transition. S satisfies a property P iff the formula S ⇒ P is valid (i.e. true under all assignments). We say that a specification S refines (or is a refinement of ) S iff S ⇒ S is valid i.e. every behavior of S is a valid behavior of S [4]. Notation.

TLA+ includes sets, functions, sequences, and records as primitive data types. The expression

Seq ( S ) refersto the set of all sequences with elements in the set S . For a sequence s , the expression Len ( s ) gives its length, and s [ n ] refers to the n -th element of s , 1-indexed. Additionally, for a function f : S → T , we denote f [ s ] as the valueof f on input s ∈ S . For sets S , T , the set of all functions with domain S and range T is denoted by the expression [ S → T ] . The construct UNCHANGED x is defined as x ′ = x . illiam Schultz, Siyuan Zhou, and Stavros Tripakis MongoStaticRaft

Formal Specification

The high level behaviors of

MongoStaticRaft were described informally in Section 2.3.1. Here we give a summary of theprotocol’s formal specification in TLA+, which is provided in [40]. We note that, although the

MongoStaticRaft protocolexisted prior to the work presented in this paper, it did not have a published formal specification. So, we developed onein order to describe how

MongoRaftReconfig extends the protocol.The state variables of the

MongoStaticRaft specification are summarized in Figure 2, and its core actions are shownin Figure 1a. The specification represents the local state of each server in a set of global variables that are functionswith domain

Server . The log variable represents the oplog stored on each server, which is a sequence of term ∈ N values. The expression log [ s ] [ i ] refers to the term of the log entry of server s at index i . We alternately refer to suchan entry as the ( index , term ) pair ( i , log [ s ] [ i ]) . The variable state represents whether a server is currently actingin Primary or Secondary role, and currentTerm is the local term of each server. The committed variable is a set of ( index , term ) pairs that represents the set of log entries that have been marked committed. Our specification alsoincludes a config variable, which is the member set that each server considers to be part of the replica set. The config variable is not strictly required for specifying the behavior of MongoStaticRaft , since it is given an initial, uniform valueon all servers and never changes. We include it, though, to make it clearer how the protocol is extended to includedynamic reconfiguration, which is described below, in Section 4.3.In the initial state of the protocol, state [ s ] = Secondary , currentTerm [ s ] = log [ s ] = ⟨⟩ , and config [ s ] = m forall servers s ∈ Server , where m ∈ P ( Server ) is an arbitrary, non-empty member set. The high level descriptions ofthe core protocol actions are as follows: • ClientRequest(s) : a log entry is written on a primary server s . • GetEntries(s,t) : a log entry is replicated from server s to server t • RollbackEntries(s, t) : server s deletes its last diverged log entry with respect to server t . • BecomeLeader(s, Q) : primary server s is elected with set of voters Q . • CommitEntry(s, Q) : a log entry is marked committed with a set of servers Q by primary server s . • UpdateTerms(s, t) : server t updates its term to the newer term of server s , and reverts to Secondary state.

MongoRaftReconfig

Formal Specification

We now give an overview of the formal specification of

MongoRaftReconfig , which extends

MongoStaticRaft with dynamicreconfiguration. Reconfiguration in

MongoRaftReconfig is managed by the

MongoLoglessDynamicRaft subprotocol, sowe focus on the behaviors of this subprotocol and how it interacts with

MongoStaticRaft to form the overall protocol,

MongoRaftReconfig . The state variables of

MongoRaftReconfig along with their corresponding types andthe subprotocol in which they are used are summarized in Figure 2. As in

MongoStaticRaft , the local state of each serveris represented in a set of global variables that are functions with domain

Server . The specification represents theconfiguration C of a server with three separate state variables, config , configVersion , and configTerm , which represent,respectively, the member set , version , and term of a server’s current configuration. That is, the current configuration of aserver s is C = ( config [ s ] , configVersion [ s ] , configTerm [ s ]) . The initial states of the shared protocol variables arethe same as in MongoStaticRaft , and initially configVersion [ s ] = , configTerm [ s ] = s ∈ Server . esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication Protocol State Variable Type

MongoRaftReconfig 

MongoStaticRaft log [ Server → Seq ( N ) ] committed P ( N × N ) currentTerm [ Server → N ]Shared state [ Server → {

Primary , Secondary } ] config [ Server → P (

Server ) ] MongoLoglessDynamicRaft configVersion [ Server → N ] configTerm [ Server → N ] Fig. 2. State variables of the

MongoRaftReconfig protocol and their corresponding types.

The core actions of

MongoRaftReconfig are shown in Figure 1c. Reconfigurations are modeled by the

Reconfig(s) action, shown in Figure 1b of

MongoLoglessDynamicRaft , which represents a reconfiguration that occurs onserver s , and enforces the config commitment precondition, P1. The complete reconfiguration behavior is depicted online 14 of Figure 1c, which includes the precondition OplogCommitment ( s ) to enforce condition P2, which depends onthe set of committed oplog entries. Configuration propagation is modeled by the SendConfig(s, t) action, which representsthe propagation of a configuration from server s to server t , and is shown in Figure 1b. The election behavior of MongoRaftReconfig is defined as a conjunction of the

BecomeLeader(s, Q) and

BecomeLeaderDynamic(s, Q) actions, whichrepresents the election of node s with voter quorum Q and is shown on line 15 of Figure 1c. This conjunction simplymeans that both actions must be executed jointly i.e. in the same transition. In our specification of MongoRaftReconfig we allow term information to propagate between servers at any time. The action

UpdateTerms(s, t) propagates the termof a server s to server t , where currentTerm [ s ] > currentTerm [ t ] . Server t updates its term to currentTerm [ s ] and reverts to Secondary state if necessary. The definition of

UpdateTerms(s, t) and

UpdateTermsDynamic(s,t) is the samein both

MongoLoglessDynamicRaft and

MongoStaticRaft , so the behavior of their composition on line 16 of 1c has thesame effect as an

UpdateTerms(s, t) action.

MongoRaftReconfig is specified as a composition of the subprotocols

MongoLoglessDynam-icRaft and

MongoStaticRaft . The formal definition of this composition is shown in Figure 1c, where actions are colorizedaccording to which subprotocol they belong to. The full definitions of the actions referenced there are given in thecomplete specification [39].Specifying

MongoRaftReconfig as a composition of these two subprotocols helps make the conceptual subcomponentsof the protocol clearer, in addition to facilitating the scalability of automated safety verification, which we discuss in moredetail in Section 5.2. Importantly, the composition of these protocols does not impact the underlying safety propertiesof

MongoLoglessDynamicRaft . That is,

MongoRaftReconfig only restricts the behaviors of

MongoLoglessDynamicRaft , butdoes not add any new behaviors. So, any properties that

MongoLoglessDynamicRaft satisfies in isolation are satisfiedwhen operating as a subprotocol of

MongoRaftReconfig . We examine this property more formally in Section 5.3.

The fundamental safety property of MongoDB’s core replication protocol,

MongoStaticRaft , is the

StateMachineSafety property, which states that if an oplog entry has been marked committed at a particular log index, no conflicting logentry will ever be marked committed at the same index. We can formally state this property as a predicate on the illiam Schultz, Siyuan Zhou, and Stavros Tripakis committed variable, which stores the set of committed log entries as ( index , term ) pairs: StateMachineSafety ≜ ∀ c i , c j ∈ committed : ( c i [ ] = c j [ ]) ⇒ ( c i = c j ) We want to verify that

MongoRaftReconfig satisfies the same property. This property is an invariant, meaning that allreachable states of the protocol must satisfy it. Thus, our goal can be stated formally as:

MongoRaftReconfig ⇒ □ StateMachineSafety (4)Note that the

MongoLoglessDynamicRaft protocol also operates as a Raft based state machine, and it is responsible forsafely managing the configuration state. So, an auxiliary correctness goal which we also verify is that

MongoLoglessDy-namicRaft satisfies the

StateMachineSafety property, which we examine in more detail below, in Section 5.2.3.

We undertook an automated approach to verifying safety using TLC [46], an explicit state model checker for TLA+specifications. We verified finite instances of the protocol to provide a sound guarantee of protocol correctness up to acertain size. It has been observed elsewhere [24] that relatively small, finite instances of distributed protocols are oftensufficient to exhibit behaviors that are generalizable to larger (potentially infinite) instances, which provides confidencein our approach.We automatically verified the

StateMachineSafety invariant by model checking a finite instance of the

MongoRaftRe-config specification. Verifying this specification, however, encountered scalability issues even for very small models. Toalleviate this, we additionally verified the

MongoLoglessDynamicRaft protocol in isolation, which allowed us to checkfinite models with significantly larger parameters. In Section 5.3 we show, by a refinement based argument, that anyproperties of

MongoLoglessDynamicRaft hold in

MongoRaftReconfig . This allows us to assume our verification efforts for

MongoLoglessDynamicRaft hold in

MongoRaftReconfig , providing stronger confidence in the correctness of the overallprotocol.

TLC is an explicit state model checker that can check temporal properties of a given TLA+specification. It is provided as a Java program that takes as input a TLA+ module file, a model checker configuration file,and a set of command line parameters. For checking safety properties, TLC assumes a TLA+ specification of the form

Init ∧ □ [ Next ] vars . The configuration file tells TLC the name of the specification to check and of the properties to bechecked. In addition, the configuration file defines a model of the specification, which is an assignment of values to anyconstant parameters of the specification. It is also possible to provide a state constraint , which is a state predicate thatcan be used to constrain the set of reachable states. If TLC discovers a reachable state that violates the state constraintpredicate, it will not add the state to its current graph of reachable states. TLC also allows definition of a symmetryset , which causes the model checker to consider states that have the same constant value under some permutation asequivalent, which can significantly reduce the set of reachable states for certain models [11]. A more complete andin-depth explanation of TLC behavior and parameters can be found in [18]. For all model checking runs discussedbelow we used TLC version 2.15 (adc67eb) running with a single worker thread on CentOS Linux 7, with a 2.30GHzIntel Xeon Gold 5118 CPU. For checking safety of

MongoRaftReconfig we used a model we refer toas

MCMongoRaftReconfig , which imposes finite bounds on the

MongoRaftReconfig

TLA+ specification. The complete,runnable TLC configuration for this model can be found in [37]. The model sets

Server = { n , n , n , n } , and imposes esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication MCMongoRaftReconfigServer { n , n , n , n } MaxLogLen MaxTerm MaxConfigVersion

StateConstraint

Symmetry

Permutation(Server)

Invariant

StateMachineSafety

States 18,955,578Depth 29Duration 6h 51min (a)

MCMongoLoglessDynamicRaftAuxServer { n , n , n , n } MaxTerm MaxConfigVersion

StateConstraint

Symmetry

Permutation(Server)

Invariant

StateMachineSafety

States 124,438,466Depth 30Duration 11h 35min (b)Fig. 3. Summary of TLC Model Checking Results.

States is the number of reachable, distinct states discovered by TLC.

Depth is thelength of the longest behavior. the following state constraint:

StateConstraint Δ = ∀ s ∈ Server : ∧ currentTerm [ s ] ≤ MaxTerm ∧ Len ( log [ s ]) ≤ MaxLogLen ∧ configVersion [ s ] ≤ MaxConfigVersion

This constraint, along with a finite

Server set, is sufficient to make the reachable state space of this model finite, since itlimits the size of the three unbounded variables of the specification: terms , logs , and configuration versions . It restrictslogs to be of a maximum finite length, and imposes a finite upper bound on terms and configuration versions. Figure 3ashows the parameters and results for this model. Permutation is an operator in the

TLC.tla standard module [2] definedas the set of all permutations of elements in a given set. Under our symmetry declaration, any two states that are equalup to permutation of server identifiers are treated as equivalent by the model checker.

Even with finite constraints and the use of symmetry optimiza-tions, the complexity of the complete

MongoRaftReconfig protocol limited the scalability of our verification efforts. Asseen in Figure 3a, a model with 4 servers and

MaxLogLen = , MaxTerm = , MaxConfigVersion = MongoLogless-DynamicRaft subprotocol independently.

MongoLoglessDynamicRaft operates its own replicated state machine andso must uphold the necessary safety properties in order for

MongoRaftReconfig to operate safely. The compositionalstructure of

MongoRaftReconfig makes it possible to verify

MongoLoglessDynamicRaft in isolation and assume that itssafety properties hold in

MongoRaftReconfig . This technique allowed us to verify

MongoLoglessDynamicRaft on modelswith significantly larger finite parameters and ensure that our results hold for

MongoRaftReconfig . It provides strongerconfidence that this subprotocol, which handles the main behaviors related to dynamic reconfiguration, is correct.For model checking, we used an augmented version of the

MongoLoglessDynamicRaft specification that we refer to as

MongoLoglessDynamicRaftAux , whose complete definition is given in [38].

MongoLoglessDynamicRaftAux is a simple ex-tension of

MongoLoglessDynamicRaft that adds a committed variable, which is a history variable [21] that records the setof committed configurations as ( v , t ) pairs. Since the core MongoLoglessDynamicRaft protocol doesn’t explicitly record illiam Schultz, Siyuan Zhou, and Stavros Tripakis committed configurations, this variable is necessary in order to state the StateMachineSafety property. Note that a historyvariable is a “passive" state variable that does not change the semantics of a specification i.e. it does not change a specifica-tion’s set of behaviors. The full model checking results for our model

MCMongoLoglessDynamicRaftAux , whose definitionis provided in [36], are presented in Figure 3b. In the following section we show formally that it is sound to assumethese results hold for

MongoRaftReconfig , by showing that

MongoRaftReconfig ⇒ MongoLoglessDynamicRaft . In order to ensure that the safety properties of

MongoLoglessDynamicRaft hold for

MongoRaftReconfig , we mustdemonstrate that the behaviors of

MongoLoglessDynamicRaft are not augmented when operating as a subprotocolof

MongoRaftReconfig . Formally, we want to show that

MongoRaftReconfig ⇒ MongoLoglessDynamicRaft . Thisrequires showing that, for any behavior 𝜎 of MongoRaftReconfig , the initial state of 𝜎 is a valid initial state of MongoLog-lessDynamicRaft and every transition in 𝜎 is a valid transition of MongoLoglessDynamicRaft . For sake of brevity below,we use

MRR and

MLDR , respectively, as abbreviations for

MongoRaftReconfig and

MongoLoglessDynamicRaft . Formally,we must show

MRR ! Init ⇒ MLDR ! Init (5) [ MRR ! Next ] vars MRR ⇒ [

MLDR ! Next ] vars MLDR (6)where vars

MRR and vars

MLDR are, respectively, the variables of

MongoRaftReconfig and

MongoLoglessDynamicRaft ,as summarized in Figure 2. For a specification S , the expressions S ! Init and S ! Next refer, respectively, to the initialstate predicate and next state relation of S . Recall that [ N ] vars = N ∨ UNCHANGED vars i.e. it is an action that allowsfor stuttering steps. We define vars

MSR = ⟨ log , committed ⟩ as the set of variables that are private to MongoStaticRaft ,meaning they are not contained in vars

MLDR .In order to prove Formula 5, it is sufficient to show that all initial states allowed by

MRR ! Init satisfy

MLDR ! Init .As discussed in Section 4.3.1, the valid initial states of both

MRR and

MLDR are the same i.e. currentTerm [ s ] = state [ s ] = Secondary , configVersion [ s ] = , configTerm [ s ] =

0, and config [ s ] = m for all servers s and somemember set m . In order to prove Formula 6, we must show that each transition allowed by MRR ! Next is a valid

MLDR ! Next transition. For a TLA+ expression N = A ∨ ... ∨ A n , for actions A i , we call each A i a subaction of N .Note that for any subaction A i of N , it holds that A i ⇒ N . So, it is sufficient to show that, for each subaction A of MRR ! Next , the following holds A ⇒ MLDR ! Next ∨ UNCHANGED vars

MLDR (7)We first consider actions which modify only private variables vars

MSR . Each such action, A , is shown below as A vars ,where vars is the list of variables that are modified by that action: • ClientRequest ( s ) ⟨ log ⟩ • GetEntries ( s , t ) ⟨ log ⟩ • RollbackEntries ( s , t ) ⟨ log ⟩ • CommitEntry ( s , Q ) ⟨ committed ⟩ The actions above modify only the log or committed variable, so they trivially satisfy UNCHANGED vars MLDR i.e.they are stuttering steps of

MLDR , which is sufficient to satisfy the implication of Formula 7. Next we examine theremaining actions: esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication • SendConfig ( s , t )• OplogCommitment ( s ) ∧ Reconfig ( s )• BecomeLeader ( s , Q ) ∧ BecomeLeaderDynamic ( s , Q )• UpdateTerms ( s , t ) ∧ UpdateTermsDynamic ( s , t ) Each of these actions is of the form P ∧ A MLDR , where A MLDR is a subaction of

MLDR ! Next . We know that for anysuch subaction, A MLDR ⇒ MLDR ! Next . Since P ∧ A MLDR ⇒ A MLDR , it holds that P ∧ A MLDR ⇒ MLDR ! Next which implies that Formula 7 holds for each action. Thus, we have proved Formula 6, which was our goal.Establishing that

MRR ⇒ MLDR allows us to soundly assume that the safety properties of

MongoLoglessDy-namicRaft hold when operating as a subprotocol of

MongoRaftReconfig . That is, for any safety property P such that MLDR ⇒ P , it holds, by transitivity of implication, that MRR ⇒ P . This allows us to assume that the safety propertiesof MongoLoglessDynamicRaft verified in Section 5.2.3 hold in

MongoRaftReconfig . We do not formalize the interfacebetween

MongoRaftReconfig and

MongoLoglessDynamicRaft here, but our verification efforts provide confidence in thesafety of

MongoLoglessDynamicRaft , the most important and novel subcomponent of

MongoRaftReconfig . MongoRaftReconfig can be viewed as a generalization and optimization of the standard Raft dynamic reconfigurationprotocol. To show how our protocol relates to and extends standard Raft, we discuss the two primary aspects of theprotocol which set it apart from Raft: (1) decoupling of the oplog and config log and (2) logless optimization of theconfig log.

In standard Raft, the main operation log is used for both normal operations andreconfiguration operations. This coupling between logs has the benefit of providing a single, unified data structure tomanage system state, but it also imposes fundamental restrictions on the operation of the two logs. Most importantly, inorder for a write to commit in one log, it must commit all previous writes in the other. For example, if a reconfigurationlog entry C j has been written at log index j on primary s , and there is a sequence of uncommitted log entries U = ⟨ i , i + , ..., j − ⟩ in the log of s , in order for a reconfiguration from C j to C k to occur, all entries of U must becomecommitted. This behavior, however, is stronger than necessary for safety i.e. it is not strictly necessary to commit theselog entries before executing a reconfiguration. The only fundamental requirements are that previously committedlog entries are committed by the rules of the current configuration, and that the current configuration has becomecommitted i.e. it has propagated to a quorum of servers. Raft achieves this goal implicitly, but more conservativelythan necessary, by committing the entry C j and all entries behind it. This ensures that all previously committed logentries, in addition to the uncommitted operations U , are now committed in C j , but it is not strictly necessary topipeline a reconfiguration behind commitment of U . MongoRaftReconfig avoids this by separating the two logs and theircommitment rules, allowing reconfigurations to bypass the oplog if necessary. We examine this benefit experimentallyin Section 6.2.

Decoupling the config log from the main operation log allows for an optimization that isenabled by the fact that reconfigurations are “update-only" operations on the replicated state machine. This means thatit is sufficient to store only the latest version of the replicated state, since the latest version can be viewed as a “rolled-up"version of the entire (infinite) log. This logless optimization, which is implemented in

MongoLoglessDynamicRaft , allows illiam Schultz, Siyuan Zhou, and Stavros Tripakis the configuration state machine to avoid complexities related to garbage collection of old log entries and it simplifiesthe mechanism for state propagation between servers. Normally, log entries are replicated incrementally, either oneat a time, or in batches from one server to another. Additionally, servers may need to have an explicit procedure fordeleting (i.e. rolling back) log entries that will never become committed. In the logless replicated state machine, allof these mechanisms can be combined into a single conceptual action, which we refer to as MergeEntries . This actionconceptually subsumes the

GetEntries and

RollbackEntries actions of the

MongoStaticRaft specification described inSection 4.2, which is a log-based protocol. We do not formally define

MergeEntries here, but it can be viewed as an actionwhere one server s atomically transfers its entire log to another server t , if the log of s is newer, based on the indexand term of its last entry. In MongoLoglessDynamicRaft , the

SendConfig action implements this behavior to transferconfiguration state between servers.

In a healthy replica set, it is possible that a failure event causes some subset of replica set servers to degrade inperformance, causing the main oplog replication channel to become lagged or stall entirely. If this occurs on a majorityof nodes, then the replica set will be prevented from committing new writes until the performance degradation isresolved. For example, consider a 3 node replica set consisting of nodes { n , n , n } , where nodes n n n n

4, so that the system can return to a healthy operational state. Thisrequires a series of two reconfigurations, one to add n n

4. In standard Raft, this would require theability to commit at least one reconfiguration oplog entry with one of the degraded nodes ( n n MongoRaftReconfig , however, reconfigurations bypass theoplog replication channel, committing without the need to commit writes in the oplog. This allows the protocol tosuccessfully reconfigure the system in such a degraded state, restoring oplog write availability by removing the failednodes and adding in new, healthy nodes.

To demonstrate the benefits of

MongoRaftReconfig in this type of scenario, wedesigned an experiment to measure how quickly a replica set can reconfigure in new nodes to restore majority writeavailability when it faces periodic phases of degradation. For comparison, we implemented a simulated version of theRaft reconfiguration algorithm in MongoDB by having reconfigurations write a no-op oplog entry and requiring itto become committed before the reconfiguration can complete. Our experiment initiates a 5 node replica set withservers we refer to as { n , n , n , n , n } . We run the server processes co-located on a single Amazon EC2 t2.xlargeinstance with 4 vCPU cores, 16GB memory, and a 100GB EBS disk volume, running Ubuntu 20.04. Co-location of theserver processes is acceptable since the workload of the experiment does not saturate any resource (e.g. CPU, disk) ofthe machine. The servers run MongoDB version v4.4-39f10d with a patch to fix a minor bug [1] that prevents optimalconfiguration propagation speed in some cases.Initially, { n , n , n } are voting servers and { n , n } are non voting. In a MongoDB replica set, a server can beassigned either 0 or 1 votes. A non-voting server has zero votes and it does not contribute to a commit majority i.e. itis not considered as a member of the consensus group. Our experiment has a single writer thread that continuouslyinserts small documents into a collection with write concern majority , with a write concern timeout of 100 milliseconds.There is a concurrent fault injector thread that periodically simulates a degradation of performance on two secondary The source code used for our experiments is available upon request. 14 esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication -20 0 20 40 60 80 100 120 0 5 10 15 20 25 30 35 40 45 50 55 60 L a t e n c y ( m s ) Time (s)Majority Write Latencies with Raft Recon ﬁ guration-20 0 20 40 60 80 100 120 0 5 10 15 20 25 30 35 40 45 50 55 60 L a t e n c y ( m s ) Time (s)Majority Write Latencies with Logless Recon ﬁ guration Fig. 4. Latency of majority writes in the face of node degradation and reconfiguration to recover. Red points indicate writes thattimed out i.e. failed to commit. Orange horizontal bars indicate intervals of time where system entered a degraded mode. Thin, verticalblue bars indicate successful completion of reconfiguration events. nodes by temporarily pausing oplog replication on those nodes. This thread alternates between steady periods and degraded periods of time, starting out in steady mode, where all nodes are operating normally. It runs for 5 seconds in steady mode, then transitions to degraded mode for 2.5 seconds, before transitioning back to steady mode and repeatingthis cycle. When the fault injector enters degraded mode, the main test thread simulates a “fault detection" scenario(assuming some external module detected the performance degradation) by sleeping for 500 milliseconds, and thenstarting a series of reconfigurations to add two new, healthy secondaries and remove the two degraded secondaries.Over the course of the experiment, which has a 1 minute duration, we measure the latency of each operation executedby the writer thread. These latencies are depicted in the graphs of Figure 4. Red points indicate writes that failed tocommit i.e. that timed out at 100 milliseconds. The successful completion of reconfigurations are depicted with verticalblue bars. It can be seen how, when a period of degradation begins, the logless reconfiguration protocol is able tocomplete a series of reconfigurations quickly to get the system back to a healthy state, where writes are able to commitagain and latencies drop back to their normal levels. In the case of Raft reconfiguration, writes continue failing until theperiod of degradation ends, since the reconfigurations to add in new healthy nodes cannot complete.

Dynamic reconfiguration in consensus based systems has been explored from a variety of perspectives for Paxos basedsystems. In Lamport’s presentation of Paxos [17], he suggests using a fixed parameter 𝛼 such that the configurationfor a consensus instance i is governed by the configuration at instance i − 𝛼 . This restricts the number of commandsthat can be executed until the new configuration becomes committed, since the system cannot execute instance i until it knows what configuration to use, potentially causing availability issues if reconfigurations are slow to commit.Stoppable Paxos [25] was an alternative method later proposed where a Paxos system can be reconfigured by stopping illiam Schultz, Siyuan Zhou, and Stavros Tripakis the current state machine and starting up a new instance of the state machine with a potentially different configuration.This “stop-the-world" approach can hurt availability of the system while a reconfiguration is being processed. VerticalPaxos allows a Paxos state machine to be reconfigured in the middle of reaching agreement, but it assumes the existenceof an external configuration master [20]. In [10], the authors describe the Paxos implementation underlying Google’sChubby lock service, but do not include details of their approach to dynamic reconfiguration, stating that “While groupmembership with the core Paxos algorithm is straightforward, the exact details are non-trivial when we introduceMulti-Paxos...". They remark that the details, though minor, are “...subtle and beyond the scope of this paper".The Raft consensus protocol, published in 2014 by Ongaro and Ousterhout [31], presented two methods for dynamicmembership changes: single server membership change and joint consensus. A correctness proof of the core Raftprotocol, excluding dynamic reconfiguration, was included in Ongaro’s PhD dissertation [28]. Formal verification ofRaft’s linearizability guarantees was later completed in Verdi [45], a framework for verifying distributed systems inthe Coq proof assistant [7], but formalization of dynamic reconfiguration was not included. In 2015, after Raft’s initialpublication, a safety bug in the single server reconfiguration approach was found by Amos and Zhang [6], at the timePhD students working on a project to formalize parts of Raft’s original reconfiguration algorithm. A fix was proposedshortly after by Ongaro [29], but the project was never extended to include the fixed version of the protocol. The Zabreplication protocol, implemented in Apache Zookeeper [42], also includes a dynamic reconfiguration approach forprimary-backup clusters that is similar in nature to Raft’s joint consensus approach.The concept of decoupling reconfiguration from the main data replication channel has previously appeared in otherreplication systems not based on Raft. RAMBO [13], an algorithm for implementing a distributed shared memoryservice, implements a dynamic reconfiguration module that is loosely coupled with the main read-write functionality.Additionally, Matchmaker Paxos [44] is a more recent approach for reconfiguration in Paxos based protocols thatadds dedicated nodes for managing reconfigurations, which decouples reconfiguration from the main processing path,preventing performance degradation during configuration changes. There has also been prior work on reconfigurationusing weaker models than consensus [15], and approaches to logless implementations of Paxos based replicated statemachine protocols [34], which bear conceptual similarities to our logless protocol for managing configuration state.Our formal specification and verification efforts follow prior lines of work on formally verifying distributed protocolse.g. Paxos and its variants [9, 19], the Chord [23] protocol, the Pastry distributed hash table [22], and others [8, 27].Distributed protocols are subtle and challenging to design correctly, so they benefit greatly from precise, machinecheckable descriptions. More recent progress has also been made on tools to help automate the verification and proofof protocols like these even further e.g. Ivy [32] and I4 [24]. REFERENCES [1] 2020. MongoDB JIRA SERVER-46907. https://jira.mongodb.org/browse/SERVER-46907[2] 2020.

TLC.tla Module . https://github.com/tlaplus/tlaplus/blob/master/tlatools/org.lamport.tlatools/src/tla2sany/StandardModules/TLC.tla[3] 2021. MongoDB Github Project. https://github.com/mongodb/mongo[4] Martín Abadi and Leslie Lamport. 1991. The existence of refinement mappings.

Theoretical Computer Science (1991). https://doi.org/10.1016/0304-3975(91)90224-P[5] Marcos Aguilera, Idit Keidar, Dahlia Malkhi, Jean-Philippe Martin, and Alexander Shraer. 2010. Reconfiguring Replicated Atomic Storage: A Tutorial.

Bulletin of the European Association for Theoretical Computer Science EATCS (2010).[6] Brandon Amos and Huanchen Zhang. 2015.

Specifying and proving cluster membership for the Raft distributed consensus algorithm

Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions . SpringerScience & Business Media. 16 esign and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication [8] Sean Braithwaite, Ethan Buchman, Igor Konnov, Zarko Milosevic, Ilina Stoilkovska, Josef Widder, and Anca Zamfir. 2020. Formal Specification andModel Checking of the Tendermint Blockchain Synchronization Protocol (Short Paper). In . Schloss Dagstuhl-Leibniz-Zentrum für Informatik.[9] Saksham Chand, Yanhong A Liu, and Scott D Stoller. 2016. Formal verification of multi-Paxos for distributed consensus. In

International Symposiumon Formal Methods . Springer, 119–136.[10] Tushar D Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos Made Live: An Engineering Perspective. In

Proceedings of the Twenty-SixthAnnual ACM Symposium on Principles of Distributed Computing (PODC ’07) . Association for Computing Machinery, New York, NY, USA, 398–407.https://doi.org/10.1145/1281100.1281103[11] E. M. Clarke, E. A. Emerson, S. Jha, and A. P. Sistla. 1998. Symmetry reductions in model checking. In

Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) . https://doi.org/10.1007/bfb0028741[12] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, ChristopherHeiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle,Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner:Google’s globally-distributed database. In

Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012 .https://doi.org/10.1145/2518037.2491245[13] Seth Gilbert, Nancy A. Lynch, and Alexander A. Shvartsman. 2010. Rambo: A robust, reconfigurable atomic memory service for dynamic networks.

Distributed Computing (2010). https://doi.org/10.1007/s00446-010-0117-1[14] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, JianZhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. 2020. TiDB: aRaft-based HTAP database.

Proceedings of the VLDB Endowment (2020). https://doi.org/10.14778/3415478.3415535[15] Leander Jehl and Hein Meling. 2014. Asynchronous reconfiguration for Paxos state machines. In

Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) . https://doi.org/10.1007/978-3-642-45249-9_8[16] Leslie Lamport. 1998. The Part-Time Parliament.

ACM Transactions on Computer Systems (1998). https://doi.org/10.1145/279227.279229[17] Leslie Lamport. 2001. Paxos Made Simple.

ACM SIGACT News (2001). https://doi.org/10.1145/568425.568433[18] Leslie Lamport. 2002.

Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers

International Symposium on Distributed Computing . Springer, 211–224.[20] Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. 2009. Vertical Paxos and Primary-Backup Replication. In

Proceedings of the 28th ACM Symposiumon Principles of Distributed Computing (PODC ’09) . Association for Computing Machinery, New York, NY, USA, 312–313. https://doi.org/10.1145/1582716.1582783[21] Leslie Lamport and Stephan Merz. 2017. Auxiliary variables in TLA+. arXiv:1703.05121[22] Tianxiang Lu, Stephan Merz, and Christoph Weidenbach. 2011. Towards Verification of the Pastry Protocol Using TLA + . In

Formal Techniques forDistributed Systems , Roberto Bruni and Juergen Dingel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 244–258.[23] Jørgen Aarmo Lund. 2019.

Verification of the Chord protocol in TLA+ . Master’s thesis. UiT Norges arktiske universitet.[24] Haojun Ma, Aman Goel, Jean Baptiste Jeannin, Manos Kapritsos, Baris Kasikci, and Karem A. Sakallah. 2019. I4: Incremental inference ofinductive invariants for verification of distributed protocols. In

SOSP 2019 - Proceedings of the 27th ACM Symposium on Operating Systems Principles .https://doi.org/10.1145/3341301.3359651[25] Dahlia Malkhi, Leslie Lamport, and Lidong Zhou. 2008.

Stoppable Paxos

The Specification Language TLA+ . Springer Berlin Heidelberg, Berlin, Heidelberg, 401–451. https://doi.org/10.1007/978-3-540-74107-7_8[27] Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. 2014. Use of formal methods at Amazon WebServices.

See http://research. microsoft. com/en-us/um/people/lamport/tla/formal-methods-amazon. pdf (2014).[28] Diego Ongaro. 2014. Consensus: Bridging Theory and Practice.

Doctoral thesis (2014).[29] Diego Ongaro. 2015. Bug in single-server membership changes. https://groups.google.com/g/raft-dev/c/t4xj6dJTP6E/m/d2D9LrWRza8J.[30] Diego Ongaro. 2021. The Raft Consensus Algorithm. https://raft.github.io/[31] Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In

Proceedings of the 2014 USENIX Conference onUSENIX Annual Technical Conference (USENIX ATC’14) . USENIX Association, USA, 305–320.[32] Oded Padon, Kenneth L McMillan, Aurojit Panda, Mooly Sagiv, and Sharon Shoham. 2016. Ivy: safety verification by interactive generalization. In

Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation . 614–630.[33] Amir Pnueli. 1977. The Temporal Logic of Programs. (1977).[34] Denis Rystsov. 2018. CASPaxos: Replicated State Machines without logs. arXiv:1802.07000[35] Fred B. Schneider. 1990. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial.

ACM Computing Surveys (CSUR) (1990). https://doi.org/10.1145/98163.98167[36] William Schultz. 2021.

MCMongoLoglessDynamicRaftAux TLC Model . https://github.com/will62794/logless-reconfig/blob/63edb2573cb9fd38f681890283235aaf6cd320e7/specs/models/MCMongoLoglessDynamicRaftAux-4Servers-T4-CV4.cfg17 illiam Schultz, Siyuan Zhou, and Stavros Tripakis [37] William Schultz. 2021.

MCMongoRaftReconfig TLC Model . https://github.com/will62794/logless-reconfig/blob/63edb2573cb9fd38f681890283235aaf6cd320e7/specs/models/MCMongoRaftReconfig-4Servers-L2-T2-CV3.cfg[38] William Schultz. 2021.

MongoLoglessDynamicRaftAux TLA+ Specification . https://github.com/will62794/logless-reconfig/blob/63edb2573cb9fd38f681890283235aaf6cd320e7/specs/MongoLoglessDynamicRaftAux.tla[39] William Schultz. 2021.

MongoRaftReconfig TLA+ Specification . https://github.com/will62794/logless-reconfig/blob/63edb2573cb9fd38f681890283235aaf6cd320e7/specs/MongoRaftReconfig.tla[40] William Schultz. 2021.

MongoStaticRaft TLA+ Specification . https://github.com/will62794/logless-reconfig/blob/63edb2573cb9fd38f681890283235aaf6cd320e7/specs/MongoStaticRaft.tla[41] William Schultz, Tess Avitabile, and Alyson Cabral. 2018. Tunable consistency in MongoDB. In

Proceedings of the VLDB Endowment . https://doi.org/10.14778/3352063.3352125[42] Alexander Shraer, Benjamin Reed, Dahlia Malkhi, and Flavio Junqueira. 2019. Dynamic reconfiguration of primary/backup clusters. In

Proceedingsof the 2012 USENIX Annual Technical Conference, USENIX ATC 2012 .[43] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, RaphaelPoss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. 2020. CockroachDB: The ResilientGeo-Distributed SQL Database. In

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20) . Associationfor Computing Machinery, New York, NY, USA, 1493–1509. https://doi.org/10.1145/3318464.3386134[44] Michael Whittaker, Neil Giridharan, Adriana Szekeres, Joseph M Hellerstein, Heidi Howard, Faisal Nawab, and Ion Stoica. 2020. Matchmaker Paxos:A Reconfigurable Consensus Protocol [Technical Report]. arXiv:2007.09468 [cs.DC][45] Doug Woos, James R. Wilcox, Steve Anton, Zachary Tatlock, Michael D. Ernst, and Thomas Anderson. 2016. Planning for change in a formalverification of the raft consensus protocol. In

CPP 2016 - Proceedings of the 5th ACM SIGPLAN Conference on Certified Programs and Proofs, co-locatedwith POPL 2016 . https://doi.org/10.1145/2854065.2854081[46] Yuan Yu, Panagiotis Manolios, and Leslie Lamport. 1999. Model checking TLA+ specifications. In

Advanced Research Working Conference on CorrectHardware Design and Verification Methods . Springer, 54–66.[47] Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. 2017. PaxosStore: High-availability storage madepractical in WeChat. In

Proceedings of the VLDB Endowment . https://doi.org/10.14778/3137765.3137778[48] Siyuan Zhou and Shuai Mu. 2021. Fault-Tolerant Replication with Pull-Based Consensus in MongoDB. In18th {USENIX} Symposium on NetworkedSystems Design and Implementation ({NSDI} 21)