[PDF] Byzantine-Fault-Tolerant Consensus via Reinforcement Learning for Permissioned Blockchain Implemented in a V2X Network

Abstract

Blockchain has been forming the central piece of various types of vehicle-to-everything (V2X) network for trusted data exchange. Recently, permissioned blockchains garner particular attention thanks to their improved scalability and diverse needs from different organizations. One representative example of permissioned blockchain is Hyperledger Fabric ("Fabric"). Due to its unique execute-order procedure, there is a critical need for a client to select an optimal number of peers. The interesting problem that this paper targets to address is the tradeoff in the number of peers: a too large number will degrade scalability while a too small number will make the network vulnerable to faulty nodes. This optimization issue gets especially challenging in V2X networks due to mobility of nodes: a transaction must be executed and the associated block must be committed before the vehicle leaves a network. To this end, this paper proposes an optimal peers selection mechanism based on reinforcement learning (RL) to keep a Fabric-empowered V2X network impervious to dynamicity due to mobility. We model the RL as a contextual multi-armed bandit (MAB) problem. The results demonstrate the outperformance of the proposed scheme.

Full PDF

11 Byzantine-Fault-Tolerant Consensus via Reinforcement Learning forPermissioned Blockchain Implemented in a V2X Network

Seungmo Kim,

Member , IEEE , and Ahmed S. Ibrahim,

Member , IEEE

Abstract —Blockchain has been at the center of various trust-promoting applications for vehicle-to-everything (V2X) networks.Recently, permissioned blockchains gain practical popularitythanks to their improved scalability and diverse needs for differ-ent organizations. One representative example of permissionedblockchain is Hyperledger Fabric. Due to its unique execute-order procedure, there is a critical need for a client to selectan optimal number of peers. The interesting problem that thispaper targets to address is the tradeoff in the number of peers:a too large number will lead to a lower scalability and a toosmall number will leave a narrow margin in the number ofpeers sufﬁcing the Byzantine fault tolerance (BFT). This channelselection issue gets especially challenging to deal with in V2Xnetworks due to the mobility: a transaction must be executedand the associated block must be committed before the vehicleleaves a network. To this end, this paper proposes an optimalchannel selection mechanism based on reinforcement learning(RL) to keep a Hyperledger Fabric-empowered V2X networkimpervious to dynamicity due to mobility. We model the RL asa contextual multi-armed bandit (MAB) problem. The resultsprove the outperformance of the proposed scheme.

Index Terms —V2X; Reinforcement learning; Contextual multi-armed bandit; Permissioned Blockchain; Hyperledger Fabric;Byzantine fault tolerant consensus

I. I

NTRODUCTION

A. Background

Vehicle-to-Everything (V2X) communications are acknow-eldged to have a massive potential to signiﬁcantly decreasethe number of vehicle crashes, thereby reducing the num-ber of associated fatalities [1]. The capability gave V2Xcommunications the central role in constitution of intelligenttransportation system (ITS) for connected and autonomousvehicles (CAVs).Meanwhile, the blockchain technology has been gainingwidespread interest based on its capability of providing secure,access-regulated interactions and transactions. However, to beapplied in V2X networks, the key challenge lies in keeping theperformance of a consensus due to the networks’ dynamicityattributed to mobility [2][3]. In general, a consensus algorithmis deﬁned as a process to achieve agreement on a singledata value among distributed processes or systems. Consensusalgorithms are designed to achieve a sufﬁcient degree ofreliability even in a network involving multiple unreliablenodes.Permissioned blockchains are getting popular as a meansto address this issue. In many distributed blockchains, suchas Ethereum and Bitcoin, which are not permissioned (alsoknown as “public”), any node can participate in the consensus

S. Kim is with the Department of Electrical and Computer Engineering,Georgia Southern University in Statesboro, GA, USA. Ahmed S. Ibrahimis with the Department of Electrical and Computer Engineering, FloridaInternational University in Miami, FL, USA. process, wherein transactions are ordered and bundled intoblocks. Because of this characteristic, these systems rely on probabilistic consensus algorithms, which eventually guaran-tee ledger consistency to a high degree of probability, butwhich are still vulnerable to divergent ledgers (also known asa ledger “fork”), where different participants in the networkhave a different view of the accepted order of transactions.Permissioned blockchains work differently. They aim at a deterministic consensus among all the nodes participating ina validation process.

B. Hyperledger Fabric

The Hyperledger Fabric (“Fabric” from now) has the widestpopularity these days owing to its design as modular consensusprotocols, which allows the system to be tailored to particularuse cases and trust models. It features a node called an orderer(also known as an “ordering node”) that does this transactionordering, which, along with other orderer nodes, forms anordering service. Because Fabric’s design relies on algorithms,any block validated by the peer is guaranteed to be ﬁnal andcorrect. Ledgers cannot fork the way they do in many otherdistributed and permissionless blockchain networks.Also, the Fabric employs an execute-order architecture,which requires all peers to execute every transaction andall transactions to be deterministic. Conversely, existingblockchain systems employ the opposite “order-execute”architecture: examples range from public ones such asEthereum—with a consensus mechanism based on proof-of-work (PoW)—to permissioned ones—with a consensus basedon crash fault tolerant (CFT) or Byzantine fault tolerant (BFT)[5]. Although this existing architecture is not immediatelyapparent in all systems, because the additional transactionvalidation step may blur it, its limitations are inherent in all:every peer executes every transaction and transactions must bedeterministic.In a Fabric network, the scalability is dominated by theendorsement policy complexity [6] and an ordering servicewhere a consensus has to be reached [7]. Speciﬁcally, thevalidation of a transaction’s endorsements requires evaluationof endorsement policy expression against the collected en-dorsements and checking for satisﬁability [8], which is usuallyachieved via a gossip protocol in a BFT-based consensusmechanism. This is the key bottleneck in the scalability:a larger number of peers participating in validation usuallycauses a longer latency and hence a lower throughput.Moreover, there is a major pitfall in the Fabric’s execute-order structure [4]. Since an application is executed beforevalidation of the associated transaction, the key drawback ofthis system occurs when the transaction turns out invalid at theend. It incurs a security problem and also waste of resources a r X i v : . [ c s . N I] S e p executing the application not complying the endorsementpolicy. C. Reinforcement Learning for Performance Optimization

In this paper, we propose to apply reinforcement learning(RL) to optimize the selection of a channel in a Fabric networkimplemented in a V2X environment. However, there stillare challenges to address. Speciﬁcally, the learning itself isextremely complicated due to the dynamicity as well, whichnecessitates that the learning framework itself must be resilientand ﬂexible according to the environment.This paper proposes the learning mechanism formulatedas a multi-armed bandit (MAB) problem, which enables avehicle, without any assistance from external infrastructure,to autonomously learn the environment and adapt its channelaccess behavior according to the learning.The MAB simpliﬁes RL by removing the learning depen-dency on state and thus providing evaluative feedback thatdepends entirely on the action taken (1-step RL problems).The actions usually are decided upon in a greedy mannerby updating the beneﬁt estimates of performing each actionindependently from other actions. To consider the state ina bandit solution, contextual bandits may be used [10]. Inmany cases, a bandit solution may perform as well as amore complicated RL solution or even better. Many banditalgorithms enjoy stronger theoretical guarantees on their per-formance even under adversarial settings [11]. These boundswould likely be of great value to the systems world as theysuggest in the limit that the proposed algorithm would beno worse than using the best ﬁxed system conﬁguration inhindsight.

Thompson sampling (TS) (also known as posterior sam-pling ) provides an elegant approach that tackles theexploration-exploitation dilemma by maintaining a posteriorover models and choosing actions in proportion to the proba-bility that they are optimal [12].We will show in this paper that the endorsing peer selectionproblem can be solved via Thompson sampling.

D. Contribution of This Paper

Motivated from the limitations of the state-of-the-art, thispaper proposes an endorser selection mechanism based on RLthat is performed autonomously by a client. Speciﬁcally, itfeatures the following contributions: ● It is the ﬁrst work proposing integration of the Hyper-ledger Fabric for V2X networking. ● It (i) adopts the RL for accomplishing the aforementionednovel consensus protocol and (ii) models the optimizationas a contextual MAB problem as means to achieve RL.The prior work such as [13] provided only little technicaldetail on the RL scheme itself, while focusing on the BFTprotocol design. This paper takes a more balanced viewon both of the BFT and RL. ● It provides a spatiotemporal analysis framework for per-formance evaluation of a blockchain system applied toa V2X network. The framework has contributions onthe following three fronts: (i) the dynamics of vehicles are modeled by using stochastic processes; (ii) the timeeffects on the blockchain performance are evaluated; and(iii) the performance of RL is evaluated as Bayesianstatistics. II. S

YSTEM M ODEL

1) Blockchain:

This paper assumes a V2X network onwhich a permissioned blockchain is formed based on the

Hyperledger Fabric v2.0 . Speciﬁcally, the RSUs act as peers that participate in endorsement and consensus (i.e., validationand commit) in the permissioned blockchain, while the OBUs(i.e., vehicles) are served as clients . Applying the Fabric tothis architecture, the RSUs have the authority to validate andorder a block, which means that all the endorsing peers andorderers are selected from the RSUs. Meanwhile, OBUs aretreated as clients of the execution and ordering services. Bythis architecture, we mean to make the blockchain systemoperate stably despite the vehicles’ frequent entry into anddeparture from the blockchain network.We notice that this architecture makes practical sense be-cause a blockchain system will likely be managed by a certainparty such as a state or federal organization or a privateenterprise, through which vehicles pass and some of them maygenerate blocks that should be processed in the blockchainmanaged by such an organization.Also, as a signiﬁcant remark, we remind that from v2.0, viaan ordering service named “Raft,” the Fabric started to providea

BFT-based consensus for validation and commit of a block.We emphasize that this also suits to address the dynamicityof a V2X network, which is highly dynamic in the networktopology and the member composition at a certain time, whichimplies a far higher possibility of any malice or fault. Assuch, the employment of Fabric is justiﬁed in both aspectsof efﬁciency and security.As shall be detailed in Section III, the key problem state-ment of this paper is based on the tradeoff of channel selection.By deﬁnition, channels partition a Fabric network in such asway that only the stakeholders can view the transactions. Inthis way, organizations are able to utilize the same networkwhile maintaining separation between multiple blockchains.The mechanism works by delegating transactions to differentledgers. Members of the particular channel can communicateand transact privately, while other members of the networkcannot see the transactions on that channel. The Raft consen-sus service allows an orderer to select a channel through whichit will serve the ordering service. As such, this paper focuseson ﬁnding an optimal channel that minimizes the latency andmaximizes the throughput.Next, we assume that not all RSUs are connected each other.A RSU usually has no wired connection, which causes thatit only has a ﬁnite coverage [21]. Reﬂecting this practicalconsideration, we assume that only a certain number of RSUsfalling into each other’s communications range are connected.The Fabric does consider this type of situation, which leads toemployment of a

Gossip protocol in disseminating informationto reach a consensus during a block validation procedure.Lastly, we consider a discrete time setting. Speciﬁcally, ineach period t = , ⋯ , T , where T ∈ N is a ﬁnite time horizon. Proposal

Channel Order ②③④ ⑤ ⑥ ⑦ ? Client

EndorsementTransaction OrderDissemination (Gossip) Validate(Consensus) Commit

ContextRLSelect channel ⑧① RLUpdate RL result ⑨ … fig_architecture (a) Proposed algorithm fig_timing Client (OBU)Orderer (RSU) Peer (RSU)Entry to network 1st transaction T train … Departure from network N th transaction (b) Spatiotemporal view Fig. 1: Proposed endorser selection mechanismIt is an asynchronous network, wherein each of the clients(i.e., vehicles) and peers (i.e., RSUs) has its own clock tomeasure the discrete time t . As such, in the evaluation ofthis network’s performance, we measure at each node (i.e.,a vehicle or a RSU) the number of slots that are consumedto process a transaction to append a block to the chain. Weremind, however, to assume the same length of t for all nodes.

2) Geometry:

Albeit not directly connected as mentionedin the previous subsection, every node is equipped withcommunications functionality and hence is able to exchangea transaction or a block.This paper adopts the stochastic geometry for characteriza-tion of a V2X network on a space [22]-[31]. They commonlyrely on the fact that uniform distributions of nodes on X and Y axes of a Cartesian-coordinate two-dimensional space yielda Poisson point process (PPP) on the number of nodes in thespace. The distributions of RSUs and OBUs are modeled asan independent homogeneous PPP Φ r and Φ o with the vehicledensity λ r and λ o .A two-dimensional space R is deﬁned with the lengthand width of l and w meters (m), respectively. To capture amore dynamic and realistic movement of nodes in a vehicularnetwork, this system model considers no separation of lanes .Notice that such a generalized model enables the subsequentanalyses more widely applicable [3]. Furthermore, to considerthe most generic vehicle movement characteristic, this modelassumes that every vehicle can move in any direction, whichenables the system to capture every possible movement sce-nario including ﬂight of unmanned aerial vehicles (UAVs),lane changing, intersection, and pedestrian walking.III. P ROPOSED M ECHANISM

As was introduced in Section I-D, this paper proposes toenable a vehicle to (i) autonomously learn about a channel thatprovides an optimal number of peers and (ii) hence minimizethe latency and maximize the throughput.

A. Improvement to the Hyperledger Fabric Architecture

In Fabric, a client selects only a certain subset of the peersdepending on the endorsement policy in which it operates [5]. Unless certain peers are designated in the policy, the client randomly selects the peers that will endorse its transaction.This is the part that this paper targets to improve: we proposea mechanism in which a client learns to improve its selectionof a channel.Fig. 1a demonstrates the proposed RL-based execute-orderprocedure in a Fabric network. Details of the entire proposedmechanism is as follows: 1 Each vehicle has a training doneat the beginning of joining a network; 2 When an applicationinvokes, the vehicle sends a proposal to the selected numberof endorsers; 3 The endorsers send the result of executionback to the client; 4 The client send the endorsed transactionto an order; 5 the orderer puts the transaction into a blockalong with other transactions and multicasts the block to a setof endorsers that are directly connected to it; 6 The endorsersuse a gossip protocol to disseminate the block to all amongthemselves; 7 The endorsers compare their ledgers if they areall ﬁnal and validate the new block; 8 Once the endorsersreach a consensus , they append the block to the chain.

B. RL for Channel Selection

Now, let us take a closer look at the RL components in theproposed framework.

1) Spatiotemporal View:

It is critical for a vehicle to collectthe prior distribution for each channel Fig. 1b illustrates aspatiotemporal view on a vehicle from its ﬁrst entry into aHyperledger Fabric network to departure. Before entry, thevehicle sends a Join Request (REQ) to the closest RSU, fromwhich it receives a Join Conﬁrm (CFM) upon entry to thenetwork. The Join CFM message contains the information thatis necessary to train itself: i.e., the minimum required numberof endorsers and the latest number of clients queued in eachendorser.An arbitrary vehicle i is designed to spend a certain lengthof time, T train , observe the context c i and update the reward r i . After T train elapses, the vehicle exploits the learned rewardsamong the arms.

2) Problem Formulation:

We model the RL of the proposedframework as a MAB problem. Therein, in time slot t , vehicle i (i.e., a client for a blockchain) becomes an agent thatobserves the context and chooses an action based on the rewardachieved from the action. The MAB problem is formulated asfollows.The proposed RL problem can be characterized as a banditproblem for design of a reinforcement learning to operatethe proposed channel selection mechanism, which is formallywritten as c i,t = max c ( N peer )∣ T dwell (1)s.t. N peer, min ≤ N peer ≤ N peer,max Associated with (1), we further characterize the proposedframework as a contextual MAB problem . Since a primaryRSU i (the bandit) does not know the optimal action to takefor a context a priori, the primary learns which action to selectaccording to the context and hence becomes able to optimizethe throughput. In order to learn the policy, the agent has totry out different arms (i.e., (i) the number and (ii) the IDs ofvoting peers) for on different contexts over time, which formsa contextual MAB problem.A bandit problem is deﬁned as a single state version of aMarkor decision process (MDP). However, in our proposedsystem, the state of the agent (i.e., a vehicle) does not changeafter taking a certain action. For instance, suppose that avehicle enters a network governed by the Fabric blockchain.Although we the mobility is a key factor distinguishing thiswork from other Hyperledger Fabric-based framework, asdescribed in Fig. 1b, the mobility can be translated down toa single variable, T dwell . By this, the proposed framework canbe modeled with the state not changing for a vehicle in takingan action (i.e., arm). It means that the only factor affecting theagent’s action is the context, T dwell , since it represents otherinﬂuencing factors such as the vehicle’s position in relation tothe network’s coverage, the vehicle’s speed, etc.Therefore, we model our problem as a contextual MABproblem since, in this way, the vehicle does not simply learnwhich channel is the optimal on average, but instead it exploitsadditional information about other channels under a giventrafﬁc situation.As such, in the proposed MAB problem, a newly enteringvehicle (i.e., a client from the blockchain’s point of vies) isregarded a bandit , and each channel is modeled as an arm ofthe bandit.The key challenge in a MAB problem lies in solving the exploration vs. exploitation dilemma , since all actions shouldbe explored sufﬁciently often to learn their rewards, but alsothose actions which have already yielded high rewards shouldbe exploited [32]. Since we model our problem as a MAB,each vehicle needs to identify the best channel by carefullyselecting one minimizing the latency and maximizing thethroughput. The additional challenge in a contextual

MABproblem is how to exploit historical reward observations undersimilar contexts. More technically, the problem of selectingan optimal channel comes from a tradeoff described in thefollowing lemma:

Lemma 1

Regarding the constraint in (1), for a client, atradeoff is formed in selecting a channel through which a transaction is executed and committed. In particular, thelatency and throughput depends on “the number of peers.” Ifthere are too many peers, a higher latency will be caused forendorsement and consensus; on the other hand, too few peerswill more easily cause a consensus failure when Byzantinefaults occur.3) Context:

Minimizing the need for modiﬁcation to thecurrent version of Fabric, we propose to design the context asthose can be deﬁned within an endorsement policy.

Deﬁnition 1 (Context: Client’s dwelling time in a Fabricnetwork).

A client (i.e., a bandit in the MAB) makes an actionbased on its dwelling time in the Fabric blockchain network,which is denoted by T dwell . The geographic information (e.g.,the estimated radius of the network’s boundary) is providedfrom the network via an endorsement policy in Join CFMupon joining of the network. Based on this information, eachvehicle estimates its T dwell and uses as the context to make theselection on a channel. It is formally written as T dwell = r / v where r gives the radius of a Fabric network and v denotesthe speed of the tagged vehicle.4) Reward: We characterize this MAB as a “Beta-Bernoullibandit” where the reward measured by vehicle i in time t , r i,t ,is modeled to be either 1 (i.e., a success) or 0 (i.e., a failure). Deﬁnition 2 (Reward: Beta-Bernoulli bandit).

The reward foran action by client i in time t is deﬁned as r i,t = r ec i,t ∩ r ld i,t (2) where r ec i,t = ec and r ld i,t = ld . Notice that these indicatorfunctions are associated with the following sets: S ec containstransactions making through both execution and commit; and S ld indicates transactions with a latency shorter than theclient’s dwelling time within the Fabric network. Also, it is a

Bayesian bandit problem. It is required thatsome information on the prior distribution is known to eachbandit for most of the learning strategies such as (cid:15) -greedyand TS [33]. However, a newly entering vehicle has no priorinformation about the channels available in the Fabric network.This emphasizes the signiﬁcance of a training period for thevehicle in order to make a decision that is close to an optimal.The regret of learning is deﬁned as the difference betweenthe reward achieved by vehicle i in time slot t and the optimal,which is formulated as ρ i,t = ∣ r i,t − r ∗ i,t ∣ (3)where r ∗ i,t denotes the reward that can be achieved by anoptimal channel selection.

5) Algorithm:

Now, we propose an online learning algo-rithm implementing the proposed contextual MAB problem.As described in Algorithm 1, the proposed framework inte-grates a RL mechanism to decide a channel to which it sendsa transaction proposal.As shown in Line 4, a new vehicle can enter a Fabricnetwork after receiving a Join CFM message from the networkadmin server.

Algorithm 1:

Proposed RL-based Hyperledger Fabricexecution-order algorithm Input: T train Intialize: c i , r i , a i for t = 1, ⋯ , T dwell do Send Join REQ and receive Join CFM; if t ≤ T train then %— Training —% c i ←— c i,t ; r i ←— r i,t,k ; r i,k ∼ Beta ( α k , β k ) ←— ( α k , β k ) t for k ∈ { , ⋯ , N arm } ; else %— Step 1 : Channel selection —% if (cid:15) -greedy then if rand ≤ (cid:15) then ˆ k i,t = U( N peer, min , N peer, max ); % Explore else ˆ k i,t = argmax k r i,k ∣ , , ⋯ ,t − ; % Exploit end else %— Thompson sampling —% ˆ θ i,t ∼ beta ( α k , β k ) for k = , ⋯ , N arm ; k i,t ←— max k ˆ θ i,t ; end %— Steps 2 and 3 —% Send a transaction to the peers in channel k i,t ; Receive endorsement result from the peers; if Endorsement successful then Request order; % Step 4 if Validation successful then r ec i,t ←— ; else r ec i,t ←— ; end else r ec i,t ←— ; end %— Latency examination —% if Latency ≤ T dwell then r ld i,t ←— ; else r ld i,t ←— ; end r i,t ←— r ec i,t ∩ r ld i,t ; % Reward ( α k , β k ) ←— ( α k , β k ) t + r i,t ; % Step 9 end end As described in Line 6, vehicle i should start a trainingperiod for T train epochs upon new entry to a network. Itobserves context c i,t As a consequence, the vehicle i collectsthe history of reward r i to update the prior distribution forthe reward from channel k . Speciﬁcally, after T train elapsed,for each arm k , the vehicle piled successes and failures to the Probability of failure at each peer N u m be r o f pee r s Fig. 2: Average latency versus { number of peers, probabilityof failure } prior, which is characterized as r i,k ∼ Beta ( α k , β k ) .From Line 11, now the vehicle starts to utilize the learnedprior distribution to select a channel when it needs to executea transaction. As shown in Lines 13-23, a vehicle is able tochoose between two representative strategies. In (cid:15) -greedy, thevehicle still explores at the rate of (cid:15) and select channel ˆ k i,t atrandom. At the other rate of − (cid:15) , it selects the k having thegreatest mean reward so far. TS, on the other hand, performsa sampling for each of the N arm arms and selects channel ˆ k i,t as k showing the largest sample.Once the vehicle has selected a channel, it can send aproposal to peers belonging to the channel, k i,t , whenever atransaction is generated by an application. Based on the Fabric,the peers execute the application and simulates the transactionif it is valid as per the endorsement policy. If valid, each peersends the vehicle an endorsement.Upon collection of a sufﬁcient number of endorsements, theclient now requests validation of the transaction to the orderer.It examines r ec i,t of (2). If the vehicle has been able to receivea reply from the order conﬁrming commit of the transactionwhile it still dwells in the network, the vehicle sets r ld i,t = and 0 otherwise.Finally, the reward r i,t is computed and updated to the priorfor each channel k .IV. N UMERICAL R ESULTS

This section presents a detailed numerical evaluation of theproposed framework in terms of its (i) learning convergenceand (ii) throughput. We established simulations for a Hyper-ledger Fabric network on MATLAB. Our test Fabric networkconsists of three organizations, each with 5-10 endorsing peersfor a total of 100 peer nodes. There are {

10, 20, 30 } channelsestablished on a subset of peers from each of the organizations.There is one orderer node operating in Raft. A. Scenario

Fig. 2 shows the average latency versus (i) the number ofpeers in a channel and (ii) the probability of failure at eachpeer in a consensus procedure. It is obvious from the ﬁgure (a) (cid:15) -greedy(b) TS

Fig. 3: Convergence of the proposed RL algorithm (With (cid:15) = . ; For each subﬁgure: Upper: Selected channel at each t ,Lower: Probability of each channel selection over t )that a higher probability of failure at each peer incurs a higherlatency. However, the pattern is less clear versus the number ofpeers. The reason is the tradeoff that was described in Lemma1. As such, we evaluate the performance in the followingmanner. A vehicle passes through the network for T dwell seconds. There are N ch = channels in the Fabric network,each of which is assumed to have (i) a number of peers and (ii)a p f among those presented in Fig. 2. For instance, Channel1 is expected to incur 2.308 seconds of latency with having50 peers with p f = . . Channel 2, despite higher p f = . , itwill give only 1.733 seconds of latency because of having 5peers.In the following results via Figs. 3 through 5b, the proposedmechanism will be shown to ﬁnd the channel giving theminimum latency (and thus, the maximum throughput).Exploiting the fact that we model the MAB problem as aBernoulli-bandit, we evaluated two representative algorithmsﬁnding an optimal arm in a MAB problem. Fig. 3 showsthe convergence performance of the two techniques. While (cid:15) -greedy can focus on a proved arm at the rate of 90% (since (cid:15) = . ), it showed inefﬁciency by wasting time by still

10 20 30

Number of channels A v e r age r eg r e t T train /T = 0.1T train /T = 0.2T train /T = 0.3T train /T = 0.4T train /T = 0.5 (a) (cid:15) -greedy

10 20 30

Number of channels A v e r age r eg r e t T train /T = 0.1T train /T = 0.2T train /T = 0.3T train /T = 0.4T train /T = 0.5 (b) TS Fig. 4: Average regret vs. train time lengthselecting irrelevant arms. On the other hand, TS is shownto better focus on the three successful arms as the learningprogresses. In fact, TS has been evidenced to outperformother alternatives such as (cid:15) -greedy and upper conﬁdence bound(UCB) [34].

B. Convergence and Time Complexity

Via Figs. 3 and 4, we demonstrate convergence and timecomplexity evaluation of each component of the proposedframework.To evaluate the time complexity, we computed the averageregret versus various lengths of T train . Each of TS and (cid:15) -greedy were run for iterations to demonstrate an averageconvergence performance. The results reveal the followingfour observations about the time complexity. First, TS showsa better concentration on the eligible channels, while (cid:15) -greedyconverges to a suboptimal, which is, however, still a success.Second, within each of TS and (cid:15) -greedy, a larger number of N arm was found to increase the regret, due to a larger load ofsearching, which can be translated as a higher time complexity.Third, commonly in the two techniques, a longer T train yieldeda lower regret. Lastly, within (cid:15) -greedy, a higher value for (cid:15) yields a lower regret since it has given a chance to learn frommore explorations. Number of clients A v e r age l a t en cy ( s e c ) Block size=40, Nesr(min,max)=(50,5), =0.05

Optimal, pFail = 0.1Optimal, pFail = 0.4Optimal, pFail = 0.8 A/B, pFail = 0.1A/B, pFail = 0.4A/B, pFail = 0.8 RL, pFail = 0.1RL, pFail = 0.4RL, pFail = 0.8 (a) Average latency vs. number of clients

Number of clients A v e r age t h r oughpu t ( T PS ) Block size=40, Nesr(min,max)=(50,5), =0.05

Optimal, pFail = 0.1Optimal, pFail = 0.4Optimal, pFail = 0.8 A/B, pFail = 0.1A/B, pFail = 0.4A/B, pFail = 0.8 RL, pFail = 0.1RL, pFail = 0.4RL, pFail = 0.8 (b) Average throughput vs. number of clients

Fig. 5: Scalability

C. Scalability

Figs. 5a and 5b show the scalability via the latency andthroughput versus the number of clients, as a result of theproposed RL mechanism applied in the proposed Fabric sys-tem framework for V2X. Notice the deﬁnitions: throughput is the rate at which transactions are committed to ledger,and latency is the time taken from application sending thetransaction proposal to the transaction commit.The key observation is that the proposed mechanism (dottedlines) achieves a performance that is far closer to the optimalthan the current Fabric’s channel selection mechanism. Therationale is the proposed RL scheme enables a vehicle to selecta channel that provides a close-to-optimal number of peers,addressing the tradeoff that was described in Section III-B2.V. R

ELATED W ORK

1) Scalability Trilemma:

Permissioned blockchains such asHyperledger Fabric [5] and Zyzzyva [18] employ speculativeconsensus methods, which increase scalability under the as-sumption that the security can be kept as long as f + nodes participate in a consensus (where f is the number offaults). Limitation → No consideration on further optimizationof selection of voting peers.

2) Blockchain-Empowered Networks:

RL based IoT [13];A deep reinforcement learning (DRL) based performanceoptimization framework for blockchain-enabled IoV, wheretransactional throughput is maximized while guaranteeingthe decentralization, latency and security of the underlyingblockchain system [14]. However, the proposed frameworkmakes no practical sense: regardless of being a Nakamoto-or voting-based consensus method, there is no single partythat is able to select a certain consensus method. A consensusmethod requires all of the peers since it is about a quorum;it is impossible to switch how they reach a quorum in themiddle.Another proposal focused on the endorsement procedure ofHyperledger Fabric [16]. Revealing the identity of an endorserto the peer nodes may not be suitable for transactions in whichthe endorsing peers have different preferences. However, itshows a limitation: in V2X, not every transaction will requireanonymous endorsement. Thus, this proposal lacks the gener-ality.In the current version of Fabric, a client could only guess in a selection of endorsing peers for a transaction [5]. Imple-mentation was not dynamically reactive to network changes(such as the addition of peers who have installed the relevantchaincode, or peers that are temporarily ofﬂine). Static con-ﬁgurations also do not allow applications to react to changesof the endorsement policy itself (as might happen when anew organization joins a channel). Furthermore, the clientapplication had no way of knowing which peers have updatedledgers and which do not, so it might submit proposals to peerswhose ledger data is not in sync with the rest of the network,resulting in transaction being invalidated upon commit. Thatwas a waste of both time and resources. V2X is dynamic torely on this probabilistic method. Fast processing is neededwhile keeping “liveness.”As a remedy, the Fabric recently added the service discovery [15]. But it comes at the cost of higher complexity due to theneed for additional information to each client. A scalabilityissue is anticipated with a very large number of clients. Highersecurity threat to malicious clients masquerading legitimateones. Also, another proposal suggested an anonymity ofendorsing peers in order to prevent a bias [16]. However, notevery application is biased; thus, it may incur unnecessaryinefﬁciency if an application does not need anonymity.Overall, since blockchain methods and applications are toodiverse these days, no one-size-ﬁts-all type of solution mayexist. VI. C

ONCLUSIONS

This paper proposed a RL-based channel selection frame-work for the Hyperledger Fabric applied to V2X networks. Weformulated the machine learning as a contextual MAB problemwith the length of a vehicle’s dwelling time in a Fabric networkas the context. Speciﬁcally, we found that a tradeoff exists onthe number of peers in a channel: a procedure of endorsementand consensus becomes (i) less scalable with too many peersand (ii) susceptible to faults with too few peers. Also, since thevehicle has no prior information of the peers’ probability of fault upon joining a network, there is no way to anticipate theperformance of each channel until it has learned about it. As anactual means to perform the learning, the proposed frameworkenabled a vehicle to adopt (cid:15) -greedy and or TS. The results ofour experiments showed that the proposed RL mechanism ledto stable selection of channels fulﬁlling the success condition.More precisely, the proposed algorithm showed the latencyand throughput close to the optimal.This work is expected to have signiﬁcant impact on futureapplications across the technologies gaining high researchinterest, namely Hyperledger Fabric and V2X. Despite itsunique modularized, execute-order structure, the Hyperledgersystem still has many aspects unproven when applied toV2X. One possible extension of this work is to incorporatethe proposed RL mechanism to incorporate other dynamicfactors such as network condition and evaluate the resultingperformance impacts.A

CKNOWLEDGEMENT

The work of Ahmed S. Ibrahim is supported in part by theNational Science Foundation under Award No. CNS-1816112.R

EFERENCES[1] USDOT, “Vehicle-to-vehicle communication technology,”

V2V FactSheet

IEEE Commun. Lett. , vol. 23, no. 2,Feb. 2019.[3] S. Kim, “Impacts of mobility on performance of blockchain in VANET,”

IEEE Access , vol. 7, May 2019.[4] M. Brandenburger, C. Cachin, R. Kapitza, and A. Sorniotti, “Blockchainand trusted computing: problems, pitfalls, and a solution for HyperledgerFabric,” arXiv:1805.08541 , May 2018.[5] E. Androulaki et al., “Hyperledger Fabric: a distributed operating systemfor permissioned blockchains,” in

Proc. ACM EuroSys 2018 , Apr. 2018.[6] P. Thakkar, S. Nathan, and B. Viswanathan, “Performance benchmarkingand optimizing Hyperledger Fabric blockchain platform,” in

Proc. IEEEMASCOTS 2018 .[7] H. Sukhwani, N. Wang, K. S. Trivedi, and A. Rindos, “Performancemodeling of hyperledger fabric (permissioned blockchain network),” in

Proc. IEEE NCA 2018 .[8] J. Gottlieb, E. Marchiori, and C. Rossi, “Evolutionary algorithms for thesatisﬁability problem,”

Evol. Comput. , vol. 10, no. 1, Mar. 2002.[9] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” Oct.2008. [Online]. Available: https://bitcoin.org/bitcoin.pdf[10] W. Chu, L. Li, L. Reyzin, and R. Schapire, “Contextual bandits withlinear payoff functions,” in

Proc. Int. Conf. Artiﬁcial Intell. Statist. 2011 .[11] A. Haj-Ali, N. K. Ahmed, T. Willke, J. E. Gonzalez, K. Asanovic, and I.Stoica, “A view on deep reinforcement learning in system optimization,” arxiv:1908.01275v3 , Sep. 2019.[12] C. Riquelme, G. Tucker, and J. Snoek, “Deep bayesian bandits show-down,” arXiv:1802.09127v1 , Feb. 2018.[13] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Performanceoptimization for blockchain-enabled industrial internet of things (IIoT)systems: a deep learning reinforcement learning approach,” in

IEEETrans. Ind. Informat. , vol. 15, no. 6, Jun. 2019.[14] C. Qiu, F. R. Yu, F. Xu, H. Yao, and C. Zhao, “Blockchain-baseddistributed software-deﬁned vehicular networks via deep Q-learning,”in

Proc. ACM DIVANet 2018 .[15] Y. Manevich, A. Barger, and Y. Tock, “Service discovery for Hyper-ledger Fabric,”

IBM J. Res. Dev , vol. 63, Feb. 2019.[16] S. Mazumdar and S. Ruj, “Design of anonymous endorsement system inHyperledger Fabric,” in

IEEE Trans. Emerg. Topics Comput. , Jun. 2019.[17] M. Castro and B. Liskov, “Practical Byzantine fault tolerant,” in

Proc.ACM OSDI 1999 .[18] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong, “Zyzzyva:speculative Byzantine fault tolerance,”

ACM Trans. Comput. Syst. , vol.27, no. 4, Jan. 2010. [19] M. Abd-El-Malek, G, R. Ganger, G, R. Goodsony, M, K. Reiter, J, J.Wylie, “Fault scalable byzantine fault tolerant services,” in

Proc. ACMSOSP 2005 .[20] A. Singh, T. Das, P. Maniatis, P. Druschel, and T. Roscoe, “BFTprotocols under ﬁre,” in

Proc. ACM NSDI 2008

PhDDissertation , Virginia Tech, Jul. 2017.[23] T. Dessalgn and S. Kim, “Danger aware vehicular networking,” in

Proc.IEEE SoutheastCon 2019 .[24] S. Kim and B. J. Kim, “Reinforcement learning for accident risk-adaptive V2X networking,” arXiv:2004.02379 , Apr. 2020.[25] S. Kim and C. Dietrich, “A novel method for evaluation of coexistencebetween DSRC and Wi-Fi at 5.9 GHz,” in

Proc. IEEE Globecom 2018 .[26] S. Kim and T. Dessalgn, “Mitigation of civilian-to-military interferencein DSRC for urban operations,” in

Proc. IEEE MILCOM 2019 .[27] S. Kim and B. J. Kim, “Prioritization of basic safety message in DSRCbased on distance to danger,” arXiv:2003.09724 , Mar. 2020.[28] J. W. Tantra, C. H. Foh, and A. Ben Mnaouer, “Throughput and delayanalysis of the IEEE 802.11e EDCA saturation,” in

Proc. IEEE ICC2005 .[29] S. Kim and M. Bennis, “Spatiotemporal analysis on broadcast per-formance of DSRC with external interference in 5.9 GHz band,” arXiv:1912.02537 , Dec. 2019. [Online]. Avilable: https://arxiv.org/pdf/1912.02537.pdf[30] S. Kim and M. J. Suh, “Adequacy of 5.9 GHz band for safety-criticaloperations with DSRC,” arXiv:2005.13528 , May 2020.[31] S. Kim and B. J. Kim, “Novel backoff mechanism for mitigation ofcongestion in DSRC broadcast,” arXiv:2005.08921 , May 2020.[32] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z, Wen, “Atutorial on Thompson sampling,”

Foundations and Trends in MachineLearning , vol. 11, no. 1, Jul. 2018.[33] A. Slivkins, “Introduction to multi-armed bandits,” arXiv:1904.07272v5 ,Sep. 2019.[34] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,”