On the Practicality of `Practical' Byzantine Fault Tolerance
aa r X i v : . [ c s . D C ] O c t On the Practicality of ‘Practical’ Byzantine Fault Tolerance
Nikos Chondros
University of Athens [email protected]
Konstantinos Kokordelis
University of Athens [email protected]
Mema Roussopoulos
University of Athens [email protected]
Abstract
Byzantine Fault Tolerant (BFT) systems are considered bythe systems research community to be state of the art withregards to providing reliability in distributed systems. BFTsystems provide safety and liveness guarantees with reason-able assumptions, amongst a set of nodes where at mostf nodes display arbitrarily incorrect behaviors, known asByzantine faults. Despite this, BFT systems are still rarelyused in practice. In this paper we describe our experience,from an application developer’s perspective, trying to lever-age the publicly available and highly-tuned “PBFT” middle-ware (by Castro and Liskov), to provide provable reliabil-ity guarantees for an electronic voting application with highsecurity and robustness needs. The PBFT middleware hasbeen the focus of most BFT research efforts over the pasttwelve years; all direct descendent systems depend on its ini-tial code base.We describe several obstacles we encountered and draw-backs we identified in the PBFT approach. These includesome that we tackled, such as lack of support for dynamicclient management and leaving state management com-pletely up to the application. Others still remaining includethe lack of robust handling of non-determinism, lack of sup-port for web-based applications, lack of support for strongercryptographic primitives, and others. We find that, whilemany of the obstacles could be overcome with a revisedBFT middleware implementation that is tuned specificallyfor the needs of the particular application, they require sig-nificant engineering effort and time and their performanceimplications for the end-application are unclear. An appli-cation developer is thus unlikely to be willing to invest thetime and effort to do so to leverage the BFT approach. Weconclude that the research community needs to focus on theusability of BFT algorithms for real world applications, fromthe end-developer perspective, in addition to continuing to [Copyright notice will appear here once ’preprint’ option is removed.] improve the BFT middleware performance, robustness anddeployment layouts.
1. Introduction
Byzantine Fault Tolerant (BFT) systems are considered bythe systems research community to be state of the art with re-gards to providing reliability in distributed systems. A BFTsystem implements a replicated state machine [28] typicallyconsisting of n = 3 f + 1 replica servers that each providea finite state machine and execute operations from clientsin the same order. BFT systems assume a pessimistic fail-ure model, based on the classic Byzantine generals’ prob-lem [23] which provides agreement amongst a set of nodeswhere at most f nodes display arbitrarily incorrect behav-iors, known as Byzantine faults.BFT systems are attractive because they provide guaran-teed safety and liveness properties when the assumption ofup to f faulty nodes hold. Early work on BFT systems waswidely considered to be impractical for use by real systemsbecause they were either too slow to be used in practice orassumed synchronous environments that rely on known mes-sage delay bounds. However, the seminal work of Castro andLiskov [8], published in 1999, changed this view. This workproposed and implemented Practical Byzantine Fault Toler-ance achieving impressive peak throughput of several tens ofthousands (null) operations per second, previously thoughtunattainable. As has been noted by others [11], over the lasttwelve years, the research community has seen a flurry ofexcitement with several efforts to improve the performanceand/or cost of BFT replication systems. These efforts includestudies aimed at increasing throughput or reducing latencyof client requests [4, 11, 12, 14, 16, 20, 21, 31–33], efforts toreduce the number of replica servers needed to withstand f faults to achieve lower replication cost [15, 32, 33], and ef-forts to boost the robustness of the protocol under both faultyservers and faulty clients [6, 11]. A majority of these sys-tems [6, 11, 16, 20, 21, 32, 33] are direct dependents of theCastro and Liskov system, hereonin referred to as the PBFTapproach (for Practical Byzantine Fault Tolerance). Both theimplementations and evaluations of these systems depend onthe initial PBFT code base.Despite PBFT’s attractive correctness guarantees, BFTsare still rarely used in practice. This is unfortunate, given the ver-increasing need for reliability in real-world distributedsystems. More and more applications require utmost secu-rity and reliability to be both trustworthy to users and suc-cessful in use (e.g, electronic voting and digital preserva-tion). The lack of wide deployment of state-of-the-art BFTtechnologies is also puzzling. The open-source PBFT codeinitially provided by Castro and later modified by others hasbeen publicly available, improved, and fine-tuned for severalyears, and while readily sized up by the academic commu-nity for research purposes, it has not been used in practice inreal-world systems.In this paper, we examine, from the perspective of an ap-plication developer, the practicality , i.e., feasibility, of us-ing the PBFT protocol and accompanying implementationto provide provable reliability guarantees for a real-worldapplication. Our motivating application is a state-of-the artelectronic voting system, offered as a public Internet service.The current version is centralized [19]. Given the critical na-ture of the application, our aim is to build a system that has no centralized component . Every aspect of the system’s de-sign should be distributed to avoid single points of attackand failure. Our aim is to leverage the correctness guaranteesprovided by PBFT systems to improve the security and reli-ability properties of the system. In such a system, clients (onbehalf of users/voters) connect to the voting service, viewthe election procedures to which they have a right to par-ticipate, send the user’s vote, and potentially reconnect at alater point to view the progress and/or results of the elec-tion. Our aim has been to gauge, from the perspective ofa developer in need of providing reliability beyond simplecrash-fault recovery, how easily the PBFT approach and ac-companying system could be molded to fit the applicationdeveloper’s needs.We have focused on the original PBFT implementationfor several reasons. First, over the past twelve years, themajority of research efforts on improving BFT systems haverelied on the PBFT approach and implementation. This is themost stable code base that is publicly available and has beenfine-tuned and improved over several years and by severaldevelopers. Second, even as the debate over improving BFTsystems continues, the interface to application developersprovided by the PBFT middleware remains the same. Thismeans that any later developments in the PBFT system suitecan be easily leveraged by applications. Since later systemsare not as fine-tuned as the original PBFT code base, we havechosen the original for more stability. Third, our particularelectronic voting application is written in C; the PBFT codebase is written in C++. A recent effort, called UpRight [10]aimed at easing the application developer’s effort to makeuse of BFT technology is written in Java, still has several keyfeatures missing (e.g., view changes are unimplemented),and seems to be a work-in-progress that has not seen muchdevelopment in the last year and a half. Thus, for a developer wanting to leverage the attractive reliability guarantees ofBFT now , the original PBFT system offers the most promise.We describe our experience trying to leverage the PBFTapproach and code base to enhance the reliability of ourevoting application. We describe several obstacles we en-countered and drawbacks we identified in the PBFT ap-proach. One key drawback we identified is that PBFT-basedsystems assume static membership – ie., clients and replicaservers know each other apriori before system initialization.Most Internet services require support for dynamic clientmanagement, particularly when the number of envisionedclients is large. The PBFT literature (original as well as allsubsequent descendants of PBFT) does not address this is-sue. Another key drawback is that PBFT leaves state man-agement completely to the application developer, who is re-quired to manually manage a raw memory region, while alsoissuing notifications to the library before changing memorycontents. This may be fine when developing system services,but is not a very convenient base for an application. Ad-ditionally, PBFT treats a replica server’s memory as stablestorage, by assuming the use of uninterruptable power sup-plies [8]. Many Internet application services, particularly anelectronic voting system, cannot afford to rely on this as-sumption and instead require traditional ACID semantics toensure data stored is consistent and persists despite crashesand faults. The PBFT system suite leaves state managementto the application developer. This means that an applica-tion developer wishing to make use of an available legacydatabase to provide the required ACID semantics is facedwith the decision of implementing from scratch these seman-tics into the application or retrofitting the BFT middlewareto interface with and support the legacy database.In addition to the above, we describe a number of otherdrawbacks including: the mechanism used by PBFT to han-dle nondeterminism in applications, the lack of support forstronger cryptography, the lack of support for web-based ap-plications, and others. The description of our experience mayseem pedantic, with many minute low-level details, but weprovide these here to give the reader a clear understanding,from a holistic systems perspective, of the obstacles facedby a developer trying to put the PBFT system to real, practi-cal use. These are details that are often considered “not im-portant enough” to warrant attention and space in many re-search papers (and prototype implementations, for that mat-ter), usually due to time and space constraints. Nonethelessthey can trip up a third-party developer hoping to make useof the novel research prototype. In practice, it is the detailsthat make or break the widespread deployment and use of asystem.We find that while many of the obstacles we describecould be overcome with a “better” or “revised” BFT middle-ware implementation that is tuned specifically for the needsof the particular application, they require significant engi-neering effort and time. Even less encouraging is the fact hat the performance implications of the changes required tomeet the application’s needs are unclear. For example, wedescribe how we overcome the first two drawbacks above.While adding support for dynamic client management doesnot significantly affect system performance, measured in null operations per second, retrofitting the PBFT middle-ware to support a legacy database reveals a throughput per-formance of real operations that is two orders of magnitudesmaller than the null ones, advertised by prior BFT studies.To date, only two publications on BFT that we are awareof have noted that reporting null operations per second asthroughput is not representative of real applications and thusnot helpful to the end-developer [29, 32]. This is understand-able, as the focus of most BFT research efforts has not beenon end-application use but on improving the BFT middle-ware itself. Nonetheless, a developer faced with having tomake a slew of modifications to the BFT middleware to getan end-system that has unknown performance properties ishesitant to invest the effort to do so.This paper makes the following contributions: • We identify a number of drawbacks in the PBFT protocolsuite, from the perspective of an end-application devel-oper trying to leverage PBFT reliability guarantees andwe describe a number of potential solutions to addressthese. The sheer number of drawbacks severely affectsthe ease with which a developer can leverage the PBFTapproach. • We present changes we made to the PBFT protocol andimplementation to enable dynamic client management,a must for many Internet service applications in use to-day. We show that these changes can be made with mini-mal additions to the PBFT protocol, thus not affecting itsprovable reliability guarantees. We demonstrate, via em-pirical experiments, that support for dynamic client man-agement can be achieved with minimal performance im-pact. • We evaluate the performance impact of retrofiting thePBFT middleware to support ACID semantics via awidely-used legacy database to ease the state manage-ment burden of many applications requiring these se-mantics. We evaluate the impact on performance ofthis change, and show that for non-null operations, thethroughput can be many times smaller than the tens ofthousands of null operations per second presented in priorPBFT-based studies.
2. Background
The Castro-Liskov algorithm for Practical Byzantine FaultTolerance [8] (abbreviated as
PBFT ) is a replication algo-rithm that can tolerate arbitrary faults. It is based on StateMachine Replication [22, 28] where transitions are appliedto an instance of the application’s state and result in a new, deterministic instance of the state. The general idea is that agroup of replicas form a static group that provides a service.At each instance in time, one of them is the primary and isresponsible for sequencing the requests, providing total or-der. This in turn guarantees linearizability [18], which is acorrectness condition for concurrent objects where a concur-rent computation is equivalent to a legal sequential compu-tation. A view is the epoch where the primary is stable. Theremaining replicas monitor client requests and the primary’sbehavior and, if the latter is found misbehaving, begin a viewchange procedure and elect the new primary.The algorithm is asynchronous and provides livenessand safety guarantees when less than a third of the replicasare faulty. More specifically, to tolerate f Byzantine faults,the group needs at least f + 1 members. Safety, formallyproved by using the I/O Automaton model [25], guaranteesthat replies will be correct according to linearizability. Live-ness assures that clients will eventually receive replies totheir requests. The algorithm does not rely on synchrony toprovide safety but does rely on a weak synchrony assump-tion to provide liveness: that delay(t) does not grow fasterthan t indefinitely. Here, delay(t) represents the time inter-val between initial message transmission (t) and messagedelivery to the replica process. For the protocol to be live,the client is expected to keep retransmitting its request untilit finally obtains the reply. Further assumptions include in-dependence of node failures and inability of an attacker tosubvert cryptographic protocols.In normal operation, the client sends a request to the pri-mary. The primary assigns a monotonically increasing se-quence number to the request and begins a 3-phase agree-ment protocol with the other replicas, at the end of whicheach node executes the request and directly transmits the re-ply to the client. The latter will accept the reply as correctonly when f + 1 replies much. The 3-phase protocol con-sists of the exchange of the following messages, where thetarget of a multicast is the set of replicas:1. Pre-prepare , multicast from the primary, which assigns asequence number to a request and forwards its contents2.
Prepare , multicast by each replica, agreeing to the se-quence number assignment3.
Commit , multicast by each replica, which helps guaranteetotal ordering across viewsAfter the commit, each replica will execute the requestand transmit the reply directly to the client. In all above mes-sage exchanges, the sender is expected to sign the contentswith his private key.This operation is depicted in Figure 1.Certain optimizations were applied by Castro and Liskovto this basic mode of operation in order to improve the la-tency and throughput of the system. First of all, the use ofasymmetric cryptography was reduced, by introducing Mes-sage Authentication Codes. The client assigns a different igure 1. Normal PBFT operationkey to each replica and sends the key to it, signed with thenode’s public key. From then on, all requests are accompa-nied by an ‘authenticator’, which is a structure that containsone MAC for each replica. This considerably boosted per-formance, as we be confirm in Section 4. Another optimiza-tion is the tentative execution of requests before the com-mit phase. The client cooperates in this mode of operationas it expects 2f+1 tentative replies (marked as such by eachreplica) instead of the normal f+1. If such a quorum is not as-sembled, the client simply retransmits the request message.As the replicas will in turn retransmit the last reply for thisclient (which by now should be marked as stable, since theCommit phase should be over), a smaller quorum of f+1 sta-ble (non-tentative) replies may be enough.Yet another optimization is the special treatment of read-only and big requests. A request is considered big if its sizeexceeds a configurable threshold, while the read-only statusis explicitly set by the client. These differentiated requestsare multicast from the client to all replicas, to relieve thisburden from the primary. This mechanism is utilized by de-fault to the maximum extent, by defining the threshold to0, resulting in all requests treated as big . The read-only re-quests are specially treated and are executed as soon as theyare received, sequencing permitting, of course. Finally, re-quest batching is employed to minimize network usage andagreement latency. A congestion window is defined as thenumber of requests that have been received but not yet exe-cuted by the primary; its size is an adjustable parameter ofthe system. When the primary receives a request message,it calculates the difference between the last locally executedsequence number and the sequence number assigned to thenew request. If this difference exceeds the defined conges-tion window , it postpones issuing the pre-prepare message,giving itself time to catch up on request execution. Once itdoes, it includes in a single pre-prepare message, as manyoutstanding request messages as possible, thus minimizinglatency due to individual agreement. Note that batched re-quests capture parallelism from different clients, as eachclient is allowed a single outstanding request only. An implementation of the protocol was developed by theauthor, Miguel Castro, and published as open-source alongwith his dissertation. The environment chosen was: • Linux as the platform (but mostly POSIX compliant) • C++ as the base language • UDP as the network protocol • An implementation of the Rabin cryptosystem for asym-metric cryptography • An implementation of UMAC32 for MAC operations • An implementation of MD5 for digestsThis implementation defines application “state” as a sin-gle continuous virtual memory region. In fact, it splits thisregion in two, the first part for the internal library needs andthe remaining for the application. The library has a subsys-tem that manages the synchronization and checkpointing ofthis state using copy-on-write techniques and Merkle (hash)trees [3]. The general idea is that the state is divided in pagesof equal length. A hash tree is formed where the leaves arethe actual data pages while the inner nodes are the hashes oftheir children (either of the data pages at level height-1 , or ofthe hash text at smaller depths). At the root, a single digestuniquely identifies the complete memory region. A check-point message communicates this root hash to the rest of thereplicas, to agree that the state is properly synchronized. Ifa peer finds itself out of sync, an efficient tree walking algo-rithm is started from the root, to identify the (hopefully few)data pages that are different and have them retransmitted bythe rest of the group.The server part of an application wishing to use PBFTservices, is expected to initialize the library and then waitfor up-calls from it, to service requests and produce replies.While executing, it has free read access to arbitrary memoryregions inside the “state” managed by PBFT, but is expectedto notify the library before making any changes.
It is very hard to reason about the behavior of a distributedsystem when it is run on multiple hosts, without a commonclock. Although solutions such as vector clocks exist, itwould be too intrusive to retrofit them in the existing library,just for the sake of monitoring its operation. To this end, wemodified the library to be able to run multiple times on thesame host, using different port numbers. We also created alog of all messages exchanged between replicas that, giventhe common clock, allowed us to reason about the behaviorof the system. All further observations are based on thisgroundwork.
In an attempt to closely monitor and better understand the re-covery process, we stopped and restarted a replica, using thedefault optimal configuration. We immediately witnessed er- atic behavior in the recovery process, which started and re-synchronized the state to the latest checkpoint, but was un-able to execute the few requests remaining in the log afterthat point, because they failed the authentication test. Whatwe found after further investigation was that the use of au-thenticators, introduced for efficiency, impeded the recoveryprocess, because the transient state of the restarted replicahad no recording of the authenticators to use for validatingclient requests. The solution the existing system implements,is the blind retransmission of the authenticators from eachnode to all replicas, based on a timer. This way, once the re-covering replica receives the authenticators of the clients, itwill be able to resume the recovery process from the nextcheckpoint. The only way to lower the time frame for thisservice interruption, is to reduce the authenticator retrans-mission timeout, which results in increased load for the net-work. We investigated other solutions including on-demandretransmission of the authenticators; we did not pursue thishowever, because retransmissions could introduce denial-of-service vulnerabilities, as a faulty replica could simply bom-bard the clients with authenticator retransmission requests. The definition of a Byzantine fault, for the PBFT library isany possible fault, including an error as trivial as a UDPpacket loss. This creates interesting behaviors. We observedthat UDP packets were indeed lost in our experiments, evenin the loop-back interface, due to congestion caused bystress-testing the system. The impact of this is profound,as such an error will leave a replica lagging behind in trans-action execution and will cause the recovery process to com-mence on the next checkpoint. Unfortunately, although thisapporach is theoritically very elegant, it is unacceptable fora production environment to lose nodes from such trivialerrors.One of the optimizations described above, regarding thespecial handling of big requests, combined with a trivialUDP packet loss, can greatly affect the robustness of thesystem. In this case, big requests are multicast to all replicasonly once, from the client. The primary will then use onlythe digest of the request body for further communicationwith the rest of the replicas. Consider what happens if oneof the packets traveling from the client to one of the replicasis dropped on the way. All replicas will begin the three-phase protocol to commit and execute the request, but whenexecution time comes, the replica that missed the requestbody will be unable to execute, and will be stuck at this pointuntil the next checkpoint arrives and the recovery processkicks in. For a request not marked as big though, the processis different and more stable. Here, if the request from theclient to the primary is dropped, the client will timeoutand retransmit the request, resulting in a request executionworkflow where either all or no replica at all participates.Even in this case, a replica-to-replica packet loss wouldagain result in interruption of service for one of the replicas, but perhaps in some environments, one can assume this to beless frequent than client-to-replica packet loss.
In the original PBFT implementation, a feature was in-troduced to resolve the non-deterministic characteristicsof most applications. The primary makes an application-specific up-call, which returns a set of values that are at-tached by the primary to the Pre-Prepare message. Thisdata becomes common to all replicas executing the request,thus providing deterministic behavior on request execu-tion. Subsequent work on the PBFT protocol [9] addedan extra mechanism to validate this data on each replica. Anew application-specific up-call was established that, whenpassed the non-deterministic data, is expected to validate itand return success or failure. The idea is, for example, thatthe primary attaches the system clock to the Pre-Preparemessage, and each replica validates the passed value againstits own clock to make sure it is appropriate.However, the handling of non-determinism describedabove introduces a subtle issue. It is not always clear how theapplication can validate the non-deterministic data passed toit via the new upcall. The hurdle for such a validation isthe instance in time it is supposed to happen. In the nor-mal, fault-free lifetime of a request, the validation happensas soon as the Pre-Prepare message is received, which isalmost immediately after it is transmitted. Thus validatingagainst a time delta is viable. However, when a request isreplayed from the log during recovery, the time drift can bequite large and validating using a time delta will fail and im-pede the recovery process. A solution to this issue would beto differentiate message processing for the recovery processand completely skip non-deterministic data validation dur-ing recovery. This however is again a non-trivial exercise, asmessage execution in PBFT is completely orthogonal to itsorigin.
3. PBFT Deployment Drawbacks andObstacles
In this section, we present in detail a number of the obstacleswe encountered in trying to leverage the PBFT approach andimplementation for our Internet evoting service. While someof the details may seem pedantic and low-level, we includethem here to give the reader a clear idea of the kinds of issuesan application developer must face in porting his applicationto a BFT version.
The existing PBFT protocol and implementation assumescompletely static membership where each node in the sys-tem, client or replica, needs a priori knowledge of the ad-dress, port, and public key for every other node. For manyapplications, particularly Internet service applications witha large number of clients, such a closed system does not uffice. Our goal is to remedy this to enable clients to joinand leave the replicated service dynamically, while lettingthe replicas remain statically bound to one another. The endresult is that clients only need information regarding repli-cas, but no information regarding other clients, allowing fora more scalable deployment.To achieve support for dynamic client membership, thereplicas need to identify each client in an identical (deter-ministic) manner. This leads us to store the client identifiersin the shared state of the service (i.e., in the continuous mem-ory region). When a client requests to join or leave the group,each replica needs to process the request using the same ver-sion of the shared state. Thus, all such client requests needto be totally ordered, at least with respect to one another.We define two special system requests, namely a Join and a
Leave , which follow the same life-cycle as all otherapplication-level (client) requests. This results in a single to-tal order across all requests, application or system, fulfill-ing our requirement. The
Join and
Leave system requestsare processed by the middleware library and are invisible tothe application.We introduce a level of indirection between what thePBFT library already uses as a node identifier and what theclient reception module assigns to new clients, for efficiencyof message evaluation. Instead of using a single addressrange of [ .. max clients ], an arbitrary identifier is assignedto each new client and a table maps this number to the indexin the array of client and server node entries. This way,when a client request arrives, the system first checks to seeif the identifier exists in the redirection table before goinginto the more lengthy process of verifying its signature orauthenticator.Originally, our idea was for the client to multicast a sim-ple Join system request to all replicas, carrying its address,its public key and a random nonce, signed with its privatekey. Each replica would assign the same new identifier andtransmit it back in the reply. However, nothing stops a ma-licious client from initiating an infinite number of connec-tions, using phony addresses, thus exhausting the boundedmaximum number of node entries in each replica. To ad-dress this vulnerability, we improve the connection processby splitting the Join operation into two phases. In the firstphase, the client submits its data as previously described andawaits a challenge. Upon receiving the challenge, the clientcalculates a response and transmits it back to the replicatedservice in the second phase of the Join. Only then will thereplica add the client to the system as a full member. This ap-proach ensures the client indeed owns the address he claims,as receiving the challenge is imperative to compute the re-sponse.We also add an application-level identification buffer tothe Join message. This buffer is passed to the application forauthorization. It might include, for example, an encrypteduser id and password. The application then returns an iden- tifier to be associated with this client (such as the user id).The middleware library will then guarantee that only a sin-gle session can be active at a time for this specific identifier,by terminating all previous sessions when a new one is es-tablished. This way, even in a distributed denial of serviceattack, the attacker can only establish as many sessions asthe number of credentials he has managed to obtain.The Leave system request is much simpler as it simplyinstructs each replica to remove the client from its internaltables. All further communication with the service is prohib-ited for this client.We need timeouts to enforce cleanup of stale sessionsonce the node structures are full. To achieve some commonground regarding time across all replicas, all requests aretimestamped with the time of the primary; when each requestis executed, its timestamp is recorded for each client. Whena join request arrives that cannot be serviced because theclient/server node table is full, a cleanup process is startedthat will locate all clients with a last executed request olderthan the current join request minus a configurable threshold.All such sessions are cleared to make room for the newconnection. If no such stale sessions are found, the new Joinrequest is denied.The Join process is depicted in the UML sequence dia-gram shown in Figure 2. Figure 2.
PBFT dynamic client join sequence diagramNote we have enhanced the PBFT protocol with supportfor dynamic client membership without changing the inher-ent properties and message exchanges of the protocol. Thus,our changes do not affect the safety and liveness guaranteesoffered by PBFT. .2 A higher level state abstraction In a replicated state machine, the term ‘state’ is an abstractdefinition of the persistent workspace of the application.PBFT defines state to be a continuous virtual memory regionwhere both the application and the middleware library storetheir non-transient state, in contiguous non-overlapping par-titions. The middleware library has full access to this mem-ory region while the application code is not executing, sinceit is responsible for managing replication and synchroniza-tion of this state across replicas. The application, on the otherhand, has free read access to it, but is required to notify thelibrary before making changes to any region, thus permittingcopy-on-write optimizations of state synchronization.While the above approach relieves the application con-siderably from having to deal with state synchronization, itcreates a number of questions which the application devel-oper must face: What can a modern application do with justa pointer to a memory region? How is this state persistentlystored on disk when the service stops? And how does the de-veloper avoid the havoc caused by a misbehaving applicationwhich fails to notify the library before modifying memory?To answer these questions in a satisfactory manner, wedecided to adapt an embedded relational database engine,to intervene between the PBFT middleware library and theapplication. This way, the application will have SQL-levelaccess to its state and the embedded engine will take care ofinterfacing with the PBFT library to satisfy its requirements.In our search for an embedded relational database engine,the major feature we were after was storage of data in a sin-gle file, which we could map to virtual memory. We selectedSQLite [1] because it exhibits this feature and because it ismature and widely deployed. SQLite is an embedded, in-process library that implements a self-contained relationaldatabase engine using SQL as its command language and aC call level interface for the application. It stores all data ob-jects in a single database file that is binary compatible acrossmachine architectures (endianness) and word sizes.In SQLite’s quest to be a multi-platform product, theauthors have defined an abstraction layer called VFS (Vir-tual File System), that sits between the relational engineand the operating system. By hooking into this subsystem,we not only can manage memory mapping and performPBFT-required memory modification notifications, but alsore-implement non-deterministic functions, such as systemtime and random values, by using the upcalls described inSection 2. Interaction with VFS is illustrated in Figure 3.SQLite uses two disk files to manage the database, for re-liability reasons. The first file is the actual database, whichwe map to virtual memory. The second file is the rollbackjournal (or write-ahead-log, in a different mode of opera-tion), which is used to rollback failed transactions. We leftthis second file to be stored on disk, since it allows the engineto recover in the case of system failure and it is not actually
Figure 3.
SQLite with its VFS inside a PBFT applicationpart of the application state. In any case, the database file issynchronized with its disk image on transaction commit.We gain many advantages with this approach. First, acommitted transaction will be durable, even in the case ofa system crash. That is, when the replica node restarts op-eration, its state will include the last committed transaction,and PBFT recovery will commence from this point. Second,even if the node is to be removed from the replicated service,its data will be usable on its own, being just another databasefile. Moreover, an uncommitted transaction will be rolledback on the next attempt to access the database file, fromthe replicated service or on its own. These advantages aresimply the by-product of the ACID semantics that SQLiteprovides and excellent reasons why developers will likelywant to take advantage of it.One obstacle we faced was that, while SQLite can freelymanage the growth and shrinkage of its database file, PBFTis not so permitting, because it requires knowledge of thesize of the memory region that represents the state, duringits initialization. To alleviate this, we use a sparse file that isdefined to be a large enough size on initialization, withoutactually occupying that space on disk, a solution that isreasonable in modern 64-bit operating systems with largevirtual memory address ranges.The application code now simply passes the name ofthe database file to the PBFT initialization function re-sponsible for starting up the replica server and setting upany data structures needed by the middleware. The func-tion returns to the application code a standard SQLitedatabase handle. Using this handle, the application cancall standard SQLite library functions (e.g. sqlite3 exec,sqlite3 prepare v2, sqlite3 step) to access the database whileexecuting during the appropriate PBFT upcall. This way, anapplication already using SQLite is immediately portableto the PBFT middleware with only minor changes to theinitialization code.
We now describe a number of remaining issues we encoun-tered in the process of applying the PBFT approach to ourelectronic voting application service. .3.1 Cryptography Applications requiring strong cryptography, such as privatekey generation and storage on the server side of the appli-cation, are not well supported by the current PBFT imple-mentation. For key generation, strong random values arerequired. Unfortunately, even if the primary obtains suchstrong randomness from its local OS services, for examplevia /dev/random, there is no way such values can be veri-fied from the remaining replicas, by their very definition ofbeing random. Because of this, an adversary can obtain ac-cess to one of the execution replicas, wait until it becomesthe primary and use predetermined values instead of randomvalues. In this manner, the adversary can trigger the genera-tion of well-known private and public keys and thus violateconfidentiality. To alleviate such attacks, one solution wouldbe to enforce a threshold signature scheme [13] for such au-thentication requirements, provided for by the middlewarelibrary. In such a scheme, private key information for eachreplica would never be transmitted over the network, as itwould not be stored in shared state. In a ( f + 1 , n ) (where n = 3 f +1 ) threshold signature scheme, the set of n replicaswould collectively generate a digital signature despite up to f byzantine faults. Of course, the PBFT protocol would haveto be modified to provide for such cryptographic operations.Another confidentiality issue is the matter of protectingstorage of sensitive information. This has been studied byYin et al [33], who propose separating the agreement partof the PBFT protocol from the execution part, while alsoadding an intermediate cluster of ‘privacy firewall’ nodes.In this layout, f + 1 agreement nodes receive the clientrequests and forward them to f + 1 execution nodes forexecution. To ensure that a faulty execution node cannotdisclose sensitive information, an h + 1 rows by h + 1 columns privacy firewall set of nodes is positioned betweenthe agreement and execution cluster, which allows toleratingup to h faulty firewall nodes. This obviously increases bothdeployment complexity and request execution latency. The current implementation of the PBFT protocol purposelyignores the notion of client-specific state. This, however,severely limits the target applications to those that are eitherstateless by nature, or manage session state on their own us-ing their global state abstraction; the latter will need to passsession identifiers inside the request and reply bodies, with-out any assistance from the middleware library. This is notan inherent limitation of the State Machine Replication ap-proach. It is simply a consequence of the lack of appropriatemechanisms in the PBFT library. With our addition of ap-plication level sign-on messages to the protocol, resulting inidentification of specific sessions, a library-level subsystemcan be developed that will map parts of the state to a specificsession. This would enable easier porting of stateful applica-tions to the BFT world.
Our end goal is to provide a web application to end users,which provides them hassle-free access to the server coun-terpart of the evoting service. We aim to achieve this withoutsacrificing BFT semantics. To this end, the browser-hostedpart of the application, typically written in JavaScript, willhave to directly access each and every replica. This commu-nication however cannot be carried over UDP because thisprotocol is not allowed in the JavaScript runtime environ-ment. Moreover, binary messages are highly inconvenientin this context. Higher level protocols, such as WebSocket,and structures like JSON or XML need to be used. Supportfor these technologies needs to be incorporated in the mid-dleware library, a task not so trivial because of the need toswitch from a point-to-point message-based communicationto a connected channel-oriented communication. Addition-ally, cryptographic functions will need to be available in thebrowser-hosted client part, which requires transitioning fromRabin to more widely available cryptosystems, such as RSA.Additionally, we aim to have the replicas located in differ-ent physical locations, to obtain real independance of faultscaused by network partitions. This requirement dictates op-eration in a Wide Area Network environment, where thequadratic message complexity of PBFT will most probablyprove costly regarding request latency. Although we tried tosimulate a WAN deployment scenario using BFTsim [30],the simulator could not scale to a large enough number ofnodes ( > ) to obtain meaningful results. This issue is al-ready studied in [5], though no open source implementationis readily available. Summary:
The above issues can be overcome, but re-quire a significant amount of engineering effort. An applica-tion developer wanting to leverage and deploy PBFT now islikely to be unwilling to invest the time and effort requiredto retrofit the PBFT approach to match the needs of his/herapplication.
4. Evaluation
In this section we present empirical measurements of thePBFT library, both with and without our modifications sup-porting dynamic client and seamless state management forapplications requiring ACID semantics provided by a legacydatabase.We test the PBFT library and our modifications to it on acluster of 8 machines connected with a 1GB Ethernet switch.The first four machines are Intel Xeon E5620 at 2.40 Ghzunder CentOS 5.5 with Linux kernel 2.6.18-194. The re-maining four are Intel Core 2 Duo E6600 at 2.40 GHz un-der Debian 5.0 with Linux kernel 2.6.26. All eight machinesrun 64 bit versions of their corresponding operating systems.Ping roundtrip time is measured at 134-183 nanoseconds be-tween all hosts. Bandwidth is measured, using iperf, at 938Mbits/sec. For all tests, we generate a server and client ex-ecutable using a particular library configuration set so as to easure the effect of turning on or off a particular optimiza-tion and/or modification. We designed the client to connectto the library and wait for a signal. On signal reception, itrecords the current time, starts its operation and then mea-sures and reports elapsed time. To coordinate all processesrunning on different hosts while at the same time collect-ing and aggregating measurements, we implemented a testframework using Python and netcat, where the latter runson each host and allows a single controller to submit scripts(i.e., experiments) and collect the results. We first conduct an experiment without the SQL state ab-straction modifications we made in order to benchmark theplain PBFT implementation. Our goal is to measure the im-pact on system throughput of turning on/off the optimiza-tions described in Section 2. Recall that the use of certainoptimizations (such as the use of MACs and special han-dling of big requests) increases performance at the cost ofdecreased robustness (e.g., slow recovery) of the system.We generate and test a series of PBFT library configu-rations, shown in Table 1. The first configuration is the de-fault configuration preferred and recommended by Castro,with all optimizations enabled, including the use of MACs,special treatment of all requests as big requests, and requestbatching. Since batching is the only optimization for whichwe did not observe faulty behavior, we isolate it and test allother combinations of configurations with batching enabledand disabled, to show its impact. The last four rows of Ta-ble 1 depict the most robust configurations (use of MACsand big request handling turned off). Since our particular ap-plication has stringent security and reliability requirements,we choose to measure the impact of adding support for dy-namic client management using these configurations. We be-lieve other Internet service applications with similar high se-curity and robustness needs would need to run the PBFT li-brary using these configurations. The client and server pro-grams built to measure throughput transmit null requests andresponses of varying sizes, of 256, 1024, 2048 and 4096bytes. We test the system using 12 clients spread evenlyacross 4 machines while being serviced by 4 replicas, eachrunning alone on a single host. In all cases, IP-level mul-ticasting was turned off, as the networks we are targeting(WANs) do not support it.The results for varying request and response sizes aresimilar, so for brevity we show a representative plot, for sizeof 1024 bytes in Figure 4.From Table 1 and Figure 4, it is clear that the first configu-ration, which is the default configuration of the PBFT librarywith all optimizations turned on achieves the best throughputperformance. In our experiments, this configuration achievesapproximately 17000 null operations per second, while for
Figure 4.
PBFT teststhe most robust configurations the throughput drops to about1000 null operations per second.We observe that disabling the batching optimization seri-ously affects performance when using MACs. When switch-ing to signing with private keys, the delay introduced is solarge that batching can no longer assist in any way. More-over, when disabling big request handling, performancedrops to 18% of the optimal, while disabling the use ofMACs causes performance to drop to 7.5% of the optimalrespectively. Disabling both big request handling and MACuse causes performance to drop to 6% of the optimal. Whilewe observe a difference in performance amongst these con-figurations where some subset of optimizations is turned off,the bottom line is that performance takes a big hit when turn-ing off any of the optimization. However, for an applicationwith high security requirements, we conjecture robustness isfavored over performance.We evaluate the impact on performance of adding sup-port for dynamic client management using the most robustconfigurations. The performance decrease is 0,5% (988 vs992), which is negligible. This negligible decrease in perfor-mance is attributable to the cost of accessing the redirectiontable that converts assigned customer ids to indexes in thetables tracking participating nodes (clients and servers). Weemphasize that the above tests are artificial because they aretesting “null” operations. The software on the replica spendsno time executing application code; it simply manages thenetwork protocol. The large majority of prior BFT studiespresent throughput in terms of null operations per second.This is understandable as the focus is on providing a base-line benchmark against which varying BFT protocols canbe compared, but is not helpful to the application developerwho needs to understand how the system would behave us-ing real application requests.
In this subsection, we evaluate the performance of addingseamless state management for applications requiring ACIDsemantics provided by a legacy database. Null operations arethus not realistic to use in this setting. For our client appli-cation request we choose the insertion of a single row intoa database table. This is the operation our evoting servicemust perform to record a user’s vote in an ongoing election. ame Static client mgmt Using MACs All requests treated as big Batching TPS StDev sta mac allbig batch Yes Yes Yes Yes 17.014 66sta mac allbig nobatch Yes Yes Yes No 1.051 56sta mac noallbig batch Yes Yes No Yes 3.030 57sta mac noallbig nobatch Yes Yes No No 1.109 103sta nomac allbig batch Yes No Yes Yes 1.291 4sta nomac allbig nobatch Yes No Yes No 1.199 12sta nomac noallbig batch Yes No No Yes 992 2sta nomac noallbig nobatch Yes No No No 1.186 7nosta nomac noallbig batch No No No Yes 988 1nosta nomac noallbig nobatch No No No No 1.205 1 Table 1.
PBFT library configurations we test. TPS is transactions per second, where a transaction is simply a null request. Nullrequest and null response sizes are 1024 bytes.The tuple inserted into the database includes a simple keyand value text (representing voter identity and accompany-ing vote), in addition to a timestamp and a random value. Wepurposefully added the timestamp and random value to testthat replies are indeed identical across all replicas. For thisexperiment, we enabled request batching and varied turn-ing on and off the remaining options (use of MACs, bigrequest handling, and support for dynamic clients). ACIDsemantics are provided using the rollback journal mode ofSQLite. Throughput performance, measured as database in-sertion transactions per second, is illustrated in Figure 5.
Figure 5.
PBFT + SQL benchmarkIn this experiment, the big request handling optimiza-tion pays no dividends because the system now spends timeexecuting a real, non-null request which requires access-ing the hard disk. This dominates the overall request exe-cution lifetime. At any rate, the most robust configurationwith dynamic clients enabled is now at 43% of the best(sta mac noallbig). Since disk access is a big factor in thisexperiment, we perform two more experiments to isolate itsimpact. In these experiments, we measure the most robustconfiguration (where the use of MACs and big request han-dling are disabled) with dynamic clients and ACID seman-tics (as above) and we measure another configuration with-out ACID semantics (no rollback journal and no flushingto disk on each operation). The ACID version achieves 534 TPS while the No-ACID one scores 1155, an approximately2x performance boost.
Summary:
The optimizations turned on by default in thePBFT library, lead to the high throughput numbers reportedin prior studies, but as we have shown in Section 2, usingsome simple fault scenarios (such as UDP packet loss), thehigh performance numbers come at the cost of decreased ro-bustness of the system. Moreover, the performance numbersreported by a large majority of prior BFT studies are basedon a metric of null operations per second. This is not a help-ful metric for the end-application developer, particular for adeveloper whose application makes use of a legacy databasefor ACID semantics.
5. Related Work
As cited in Section 1, since the seminal 1999 publication onPBFT by Castro and Liskov [8], there has been a flurry ofresearch activity focused on improving the BFT middlewareperformance [4, 11, 12, 14, 16, 20, 21, 31–33], replicationcost [15, 32, 33], and robustness under both faulty serversand faulty clients [6, 11]. A majority of these systems [6,11, 16, 20, 21, 32, 33] are direct dependents of the Castroand Liskov PBFT system. Of all of these systems, the onlycodebase that has been made widely available and refinedfor several years is the PBFT system. For this reason, wehave focused on this system. Since all PBFT descendantsuse the same codebase, the obstacles we encountered asapplication developers in using the PBFT system apply toits descendents as well.We highlight, below, some related works that either di-rectly focus on bringing BFT systems closer to widespreaddeployment in real applications, or raise issues that affectthe practical deployment of our (and other) security-criticalapplications.Wood et al. [32] write “no commercial data center usesBFT techniques despite the wealth of research in this area”and posit that this is due to the high cost of replication re-quired by BFT protocols. They aptly point out that, for ap-plications such as web servers and database servers, it is theexecution of client requests and not the agreement of request rdering that dominates the performance of a BFT protocol.They propose lowering the number of active execution repli-cas to f + 1 by using virtual machines as execution nodesand ZFS snapshots for quick state checkpointing. When the f + 1 replicas produce inconsistent replies, a paused exe-cution node is revived and starts executing requests imme-diately. The middleware library fetches the state needed bythese requests on demand, to amortize the cost of state trans-fer. The paper claims that for applications running over aWAN environment, the time to perform state transfer is min-imal compared to WAN latencies. The focus of the paperis on reducing replication cost while maintaining good per-formance. While this is welcome for an application to bedeployed in a data center, the paper does not address howthe application developer can easily make use of the system,stating simply that applications must be rewritten to take ad-vantage of the system.Clement et al. [10] introduce UpRight , with the goal ofmaking it easy for application developers to convert a crash-fault tolerant application into a BFT application. It includesa number of state-of-the-art BFT techniques, including sepa-ration of agreement from execution, insights from the Aard-vark protocol [11] on dealing with faulty clients and allevi-ating denial-of-service attacks, as well as more flexible statemanagement (but not at such a high level as a relational en-gine). It also allows individual tailoring of crash-fault (Up)and arbitrary-fault (Right) tolerance. Unfortunately, it is stilla work in progress. with several key features missing (e.g.,view changes are unimplemented) and does not seem to haveseen much development since March 2010 [2], so it is nothelpful to a developer wishing to make use of BFT tech-niques now.Several attempts have been made to address the inabilityof replicated BFT services to mesh with the rest of the infras-tructure in today’s multi-tier world. Merideth et al. [26] in-troduced
Thema , which aims to mask BFT complexity fromthe application developer of web services based applications.An agent, visible to the unaffected outside world, plays therole of the client of a BFT system. Additionally, a proxycollects the multiple out-call requests from the replicas of aBFT system, and issues the actual out-call on behalf of them,returning the reply when available. Unfortunately, both theagent and the proxy are centralized components which areinappropriate for applications such as ours which requirecompletely distributed design.Pallemulle et al. [27] focuses on interoperability betweenBFT systems, while enforcing fault isolation and introduce awhole new protocol, named
Perpetual to achieve this. Senet al. [29] in a system called Prophecy, designed to in-crease BFT performance, introduce a
Sketcher component,that tries to trade space for performance, by storing a histor-ical log of request/reply pairs and allowing the applicationto differentiate its requests, asking for possible log-basedreplies. In its distributed incarnation,
D-Prophecy is simply an attempt to avoid re-execution of repetitive requests. In thecentralized one,
Prophecy , the
Sketcher completely avoidsBFT access but now becomes a centralized component.Amir et al. [5] introduce
Steward , a hierarchical BFT ar-chitecture, that tries to scale BFT to a wide-area network, byintroducing an abstraction layer above PBFT using a Paxos-based protocol. It uses a threshold signature scheme to en-sure the recipient of a cross-domain message that enoughreplicas at the originating site agreed with the request. Bothof these features are welcome to security-conscious Internetapplication services. Unfortunately, no source code is read-ily available.Vandiver et al. [31] and Garcia et al. [16] introduce mid-dleware for BFT database replication. Incoroporating legacydatabases into a BFT system is important for a wide rangeof Internet applications. Unfortunately, both systems assumeclosed systems with a finite number of clients. The devel-oper of an Internet-facing application service still must dealwith the issue of having end-user clients issue requests tothe replicated database system. Either these systems needto provide support for dynamic client management or theymust offload the Internet-facing application component ac-cepting customer/user requests to a centralized component,something not appropriate for our particular application.Finally, Guerraoui et al. [17] introduce a new abstractionallowing for the construction of new BFT protocols with afraction of the code currently necessary, thus vastly simpli-fying the BFT researcher’s task. Having waded through the20,000 lines of PBFT code, we applaud this effort and em-phasize here the need to simplify the end application devel-oper’s task as well.
6. Conclusion
This paper is a call to the systems community to look moreclosely at BFT from the perspective of a real-world applica-tion developer. Our experience in trying to apply the PBFTapproach to a real-world application with stringent securityand reliability needs reveals a slew of difficulties that theapplication developer must face if he wants to use even themature, stable and well-tuned PBFT protocol and codebaseupon which a large majority of subsequent BFT systems isbased. While the difficulties encountered by the developercan be overcome, they require significant engineering ef-fort and have unclear performance ramifications. These twocharacteristics are likely to make the developer hesitant toinvest the effort to leverage BFT techniques.The systems community prides itself on building andmeasuring real systems. If BFT systems are to see widepreaddeployment in real-world systems, then the research com-munity needs to focus on the usability of BFT algorithmsfor real world applications, from the end-developer perspec-tive, in addition to continuing to improve BFT middlewareperformance, robustness, and deployment layouts. nterestingly, we may find that the current BFT debatemay evolve to resemble the microkernel debate [24], withone camp advocating that the BFT concept is ultimately im-practical for real-world applications [7] and the other campadvocating that it is not the concept that is impractical/faulty,but it is the implementation that is impractical/faulty. Build-ing a complete implementation that supports a real applica-tion for a long duration rather than for the length of time ittakes to build and test a prototype implementation, that doesnot cut corners, that is not missing features, that does notmake optimizations that break down in corner cases, that canbe applied to more than one application, and that has goodperformance will go a long way to settling the debate. A tallorder, for sure. References [1] Sqlite embedded database engine. .[2] upright: Making distributed systemsup (available) and right (correct). http://code.google.com/p/upright/w/list .[3] A digital signature based on a conventional encryption func-tion. In
CRYPTO , 1987.[4] M. Abd-El-Malek, G. Ganger, G. Goodson, M. Reiter, andJ. Wylie. Fault-scalable byzantine fault-tolerant services. In
SOSP , October 2005.[5] Y. Amir, C. Danilov, D. Dolev, J. Kirsch, J. Lane, C. Nita-rotaru, J. Olsen, and D. Zage. Steward: Scaling byzantinefault-tolerant systems to wide area networks. In
DSN , 2006.[6] Y. Amir, B. Coan, J. Kirsch, and J. Lane. Byzantine replicationunder attack. In
DSN , Jun 2008.[7] K. P. Birman.
Reliable Distributed Systems . Springer, firstedition, 2005.[8] M. Castro and B. Liskov. Practical byzantine fault tolerance.In
OSDI , February 1999.[9] M. Castro, R. Rodrigues, and B. Liskov. BASE: Using ab-straction to improve fault tolerance.
ACM TOCS , 21(3), Aug.2003.[10] A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi,M. Dahlin, and T. Riche. Upright cluster services. In
SOSP ,Oct 2009.[11] A. Clement, E. Wong, L. Alvisi, and M. Dahlin. Makingbyzantine fault tolerant systems tolerate byzantine faults. In
NSDI , April 2009.[12] J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira.Hq relication: A hybrid quorum protocol for byzantine faulttolerance. In
OSDI , Nov 2006.[13] Desmedt and Frankel. Threshold cryptosystems. In
CRYPTO:Proceedings of Crypto , 1989.[14] T. Distler and R. Kapitza. Increasing performance in byzan-tine fault-tolerant systems with on-demand replica consis-tency. In
EuroSys , Apr 2011.[15] T. Distler, R. Kapitza, I. Popov, H. Reiser, and W. Schroder-Preikschat. Spare: Replicas on hold. In
NDSS , Feb 2011. [16] R. Garcia, R. Rodrigues, and N. Preguica. Efficient middle-ware for byzantine fault tolerant database replication. In
Eu-roSys , Apr 2011.[17] R. Guerraoui, N. Knezevic, V. Quema, and M. Vukolic. Thenext 700 bft protocols. In
EuroSys , Apr 2010.[18] M. Herlihy and J. M. Wing. Linearizability: A correctnesscondition for concurrent objects.
ACM TPLS , 12(3):463–492,July 1990.[19] A. Kiayias, M. Korman, and D. Walluck. An internet votingsystem supporting user privacy. In
ACSAC , Dec 2006.[20] R. Kotla and M. Dahlin. High throughput byzantine faulttolerance. In
DSN , Jun 2004.[21] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong.Zyzzyva: Speculative byzantine fault tolerance. In
SOSP , Oct2007.[22] L. Lamport. The implementation of reliable distributed mul-tiprocess systems.
Computer Networks , 2, 1978.[23] L. Lamport, R. Shostak, and M. Pease. The byzantine generalsproblem.
ACM TPLS , 4(3):382–401, July 1982.[24] J. Liedtke. On micro-kernel construction.
ACM SIGOPSOperating Systems Review , 29(5), Dec 1995.[25] N. Lynch.
Distributed Algorithms . Morgan Kaufmann, 1996.[26] M. Merideth, A. Iyengar, T. Mikalsen, S. Tai, I. Rouvellou,and P. Narasimhan. Thema: Byzantine-fault-tolerant middle-ware for web-service applications. In
SRDS , Oct. 2005.[27] S. L. Pallemulle, H. D. Thorvaldsson, and K. J. Goldman.Byzantine fault-tolerant web services for n-tier and serviceoriented architectures. In
ICDCS , June 2008.[28] F. Schneider. Implementing fault-tolerant services using thestate machine approach: a tutorial.
ACM Computing Surveys ,22(4):299–319, Dec 1990.[29] S. Sen, W. Lloyed, and M. Freedman. Prophecy: Using historyfor high-throughput fault tolerance. In
NSDI , April 2010.[30] A. Singh, T. Das, P. Maniatis, P. Druschel, and T. Roscoe. BFTprotocols under fire. In
NSDI , 2008.[31] B. Vandiver, H. Balakrishnan, B. Liskov, and S. Madden.Tolerating byzantine faults in transaction processing systemsusing commit barrier scheduling. In
SOSP , Oct 2007.[32] T. Wood, R. Singh, A. Venkataramani, P. Shenoy, and E. Cec-chet. Zz and the art of practical bft. In
EuroSys , April 2011.[33] J. Yin, J.-P. Martin, A. Venkataramani, and L. A. adnM. Dahlin. Separating agreement from execution for byzan-tine fault tolerant services. In
SOSP , Oct 2003. r X i v : . [ c s . D C ] O c t banner above paper title Title Text
Subtitle Text, if any
Name1
Affiliation1
Email1
Name2 Name3
Affiliation2/3
Email2/3
Abstract
This is the text of the abstract.
Categories and Subject Descriptors
CR-number [ subcategory ]:third-level
General Terms term1, term2
Keywords keyword1, keyword2
1. Introduction
The text of the paper begins here.
A. Appendix Title
This is the text of the appendix, if you need one.
Acknowledgments
Acknowledgments, if needed.
References [1] P. Q. Smith, and X. Y. Jones. ...reference text... [Copyright notice will appear here once ’preprint’ option is removed.] short description of paper1