Renaissance: Self-Stabilizing Distributed SDN Control Plane
Marco Canini, Iosif Salem, Liron Schiff, Elad Michael Schiller, Stefan Schmid
RRenaissance : A Self-Stabilizing Distributed SDN Control Plane (Technical Report)
Marco Canini Iosif Salem Liron Schiff Elad M. Schiller Stefan Schmid Universit´e catholique de Louvain Chalmers University of Technology GuardiCore Labs University of Vienna & Aalborg University
Abstract
By introducing programmability, automated verification, and innovative debugging tools,Software-Defined Networks (SDNs) are poised to meet the increasingly stringent dependabilityrequirements of today’s communication networks. However, the design of fault-tolerant SDNsremains an open challenge.This paper considers the design of dependable SDNs through the lenses of self-stabilization —a very strong notion of fault-tolerance. In particular, we develop algorithms for an in-band anddistributed control plane for SDNs, called
Renaissance , which tolerate a wide range of (concur-rent) controller, link, and communication failures. Our self-stabilizing algorithms ensure thatafter the occurrence of an arbitrary combination of failures, (i) every non-faulty SDN controllercan reach any switch (or another controller) in the network within a bounded communicationdelay (in the presence of a bounded number of concurrent failures) and (ii) every switch ismanaged by at least one controller (as long as at least one controller is not faulty).We evaluate
Renaissance through a rigorous worst-case analysis as well as a prototype im-plementation (based on OVS and Floodlight), and we report on our experiments using Mininet.
Context and Motivation.
Software-Defined Network (SDN) technologies have emerged as apromising alternative to the vendor-specific, complex, and hence error-prone, operation of tradi-tional communication networks. In particular, by outsourcing and consolidating the control overthe data plane elements to a logically centralized software, SDNs support a programmatic veri-fication and enable new debugging tools. Furthermore, the decoupling of the control plane fromthe data plane, allows the former to evolve independently of the constraints of the latter, enablingfaster innovations.However, while the literature articulates well the benefits of the separation between control anddata plane and the need for distributing the control plane (e.g., for performance and fault-tolerance),the question of how connectivity between these two planes is maintained (i.e., the communicationchannels from controllers to switches and between controllers) has not received much attention.Providing such connectivity is critical for ensuring the availability and robustness of SDNs.Guaranteeing that each switch is managed, at any time, by at least one controller is challengingespecially if control is in-band , i.e., if control and data traffic is forwarded along the same links anddevices and hence arrives at the same ports. In-band control is desirable as it avoids the need to1 a r X i v : . [ c s . N I] F e b uild, operate, and ensure the reliability of a separate out-of-band management network. Moreover,in-band management can in principle improve the resiliency of a network, by leveraging a higherpath diversity (beyond connectivity to the management port).The goal of this paper is the design of a highly fault-tolerant distributed and in-band con-trol plane for SDNs. In particular, we aim to develop a self-stabilizing software-defined network:An SDN that recovers from controller, switch, and link failures, as well as a wide range of com-munication failures (such as packet omissions, duplications, or reorderings). As such, our workis inspired by Radia Perlman’s pioneering work [38]: Perlman’s work envisioned a self-stabilizingInternet and enabled today’s link state routing protocols to be robust, scalable, and easy to man-age. Perlman also showed how to modify the ARPANET routing broadcast scheme, so that itbecomes self-stabilizing [39], and provided a self-stabilizing spanning tree algorithm for intercon-necting bridges [40]. Yet, while the Internet core is “conceptually self-stabilizing”, Perlman’s visionremains an open challenge, especially when it comes to recent developments in computer networks,such as SDNs, for which we propose self-stabilizing algorithms. Fault Model.
We consider (i) fail-stop failures of controllers, which failure detectors can observe,(ii) link failures, and (iii) communication failures, such as packet omission, duplication, and re-ordering. In particular, our fault model includes up to κ link failures, for some parameter κ ∈ Z + .In addition, to the failures captured in our model, we also aim to recover from transient faults , i.e.,any temporary violation of assumptions according to which the system and network were designedto behave, e.g., the corruption of the packet forwarding rules changes to the availability of links,switches, and controllers. We assume that (an arbitrary combination of) these transient faults cancorrupt the system state in unpredictable manners. In particular, when modeling the system, weassume that these violations bring the system to an arbitrary state (while keeping the program codeintact). Starting from an arbitrary state, the correctness proof of self-stabilizing systems [20, 18]has to demonstrate the return to correct behavior within a bounded period, which brings the systemto a legitimate state . The Problem.
This paper answers the following question: How can all non-faulty controllersmaintain bounded (in-band) communication delays to any switch as well as to any other controller?We interpret the requirements for provable (in-band) bounded communication delays to imply(i) the absence of out-of-band communications or any kind of external support, and yet (ii) thepossibility of fail-stop failures of controllers and link failures, as well as (iii) the need for guaranteedbounded recovery time after the occurrence of arbitrary transient faults. These faults are transientviolations of the assumptions according to which the system was designed to behave.
Our Contributions.
We present an important module for dependable networked systems: aself-stabilizing software-defined network. In particular, we provide a (distributed) self-stabilizingalgorithm for distributed SDN control planes that, relying solely on in-band communications, re-cover (from a wide spectrum of controller, link, and communication failures as well as transientfaults) by re-establishing connectivity in a robust manner. Concretely, we present a system, hence-forth called
Renaissance , which, to the best of our knowledge, is the first to provide:1. A robust efficient and distributed control plane:
We maintain short, O ( D )-length controlplane paths in the presence of controller and link (at most κ many) failures, as well as, The word renaissance means ‘rebirth’ (French) and it symbolizes the ability of the proposed system to recoverafter the occurrence of transient faults that corrupt the system state. D ≤ N is the (largest) network diameter (when consideringany possible network topology changes over time) and N is the number of nodes in thenetwork. More specifically, suppose that throughout the recovery period the network topologywas ( κ + 1)-edge-connected and included at least one (non-failed) controller. We prove thatstarting from a legitimate state, i.e., after recovery, our self-stabilizing solution can: • Deal with fail-stop failures of controllers:
These failures require the removal of staleinformation (that is related to unreachable controllers) from the switch configurations.Cleaning up stale information avoids inconsistencies and having to store large amountsof history data. • Deal with link failures:
Starting from a legitimate system state, the controllers maintainan O ( D )-length path to all nodes (including switches and other controllers), as long asat most κ links fail. That is, after the recovery period the communication delays arebounded.2. Recovery from transient faults:
We show that our control plane can even recover after theoccurrence of transient faults. That is, starting from an arbitrary state, the system recoverswithin time O ( D N ) to a legitimate state. In a legitimate state, the number of packetforwarding rules per switch is at most | P C | times the optimal, where | P C | is the numberof controllers. The proposed algorithm is memory adaptive [4], i.e., after the recovery fromtransient faults, each node’s use of local memory depends on the actual number, n C , ofcontrollers in the system, rather than the upper bound, N C , on the number of controllers inthe system.3. The proposed algorithm is memory adaptive. That is, after its recovery from transient faults,each node’s use of local memory depends on the actual number of controllers in the system, n C , rather than the upper bound on the number of controllers in the system, N C . We presenta non-memory adaptive variation on the proposed algorithm that recovers within a period ofΘ( D ) after the occurrence of transient faults. This is indeed faster than the O ( D N ) recoverytime of the proposed algorithm. However, the cost of memory use after stabilization can be N C /n C times higher than the proposed algorithm. Moreover, the fact that the recovery timeof the proposed memory adaptive solution is longer is relevant only in the presence of rarefaults that can corrupt the system state arbitrarily, because for the case of benign failures,we demonstrate recovery within Θ( D ).While we are not the first to consider the design of self-stabilizing systems which maintainredundant paths also beyond transient faults, the challenge and novelty of our approach comesfrom the specific restrictions imposed by SDNs (and in particular the switches). In this settingnot all nodes can compute and communicate, and in particular, SDN switches can merely forwardpackets according to the rules that are decided by other nodes, the controllers. This not onlychanges the model, but also requires different proof techniques, e.g., regarding the number of resetsand illegitimate rule deletions.In order to validate and evaluate our model and algorithms, we implemented a prototype of Renaissance in Floodlight using Open vSwitch (OVS), complementing our worst-case analysis. Our3 elf-stabilizing Abstract Switch
Switch/RouterControllerSelf-stabilizing transport layer
Rule Gener.
Self-stabilizing SDN controler
Self-stabilizing data link layerSelf-stabilizing physical layerSelf-stabilizing network layer
Local toplogy discov.
Self-stabilizing transport layerSelf-stabilizing data link layerSelf-stabilizing physical layerSelf-stabilizing network layer
Local toplogy discov.
ControllerSelf-stabilizing transport layer
Rule Gener.
Self-stabilizing SDN controler
Self-stabilizing data link layerSelf-stabilizing physical layerSelf-stabilizing network layer
Local toplogy discov.
Figure 1:
The system architecture, which is based on self-stabilizing versions of existing network layers. Theexternal building blocks for rule generation and local topology discovery appear in the dotted boxes. The proposedcontribution of self-stabilizing SDN controller and self-stabilizing abstract switch appear in bold. experiments in Mininet demonstrate the feasibility of our approach, indicating that in-band controlcan be bootstrapped and maintained efficiently and automatically, also in the presence of failures.To ensure reproducibility and to facilitate research on improved and alternative algorithms, wehave released the source code and evaluation data to the community at [52].We also discuss relevant extensions to the proposed solution (Section 8.2), such as a combingboth in-band and out-of-band communications, as well as coordinating the actions of the differentcontrollers using a reconfigurable replicated state machine.
Organization.
We give an overview of our system and the components it interfaces in Section 2and introduce our formal model in Section 3. Our algorithm is presented in Section 4, analyzed inSection 5, and validated in Section 6. We then discuss related work (Section 7) before drawing theconclusions from our study (Section 8).
Our self-stabilizing SDN control plane can be seen as one critical piece of a larger architecture forproviding fault-tolerant communications. Indeed, a self-stabilizing SDN control plane can be usedtogether with existing self-stabilizing protocols on other layers of the OSI stack, e.g., self-stabilizinglink layer and self-stabilizing transmission control protocols [25, 21], which provide logical FIFOcommunication channels. To put things into perspective, we provide a short overview of the overallnetwork architecture we envision. Our proposal includes new self-stabilizing components thatleverage existing self-stabilizing protocols towards an overall network architecture that is morerobust than existing SDNs. We consider an architecture (Figure 1) that comprises mechanisms forlocal topology discovery and a logic for packet forwarding rule generation. We contribute to thisarchitecture a self-stabilizing abstract switch as well as a self-stabilizing SDN control platform.The network includes a set P C = { p , . . . , p n C } of n C (remote) controllers , and a set P S = { p n C +1 , . . . , p n C + n S } of the n S (packet forwarding) switches , where i is the unique identifier of node p i ∈ P = P C ∪ P S . We denote by N c ( i ) ⊆ P (communication neighborhood) the set of nodeswhich are directly connecting node p i ∈ P and node p j , i.e., p j ∈ N c ( i ). At any given time,4nd for any given node p i ∈ P , the set N o ( i ) (operational neighborhood) refers to p i ’s directlyconnected nodes for which ports are currently available for packet forwarding. The local topologyinformation in N o ( i ) is liable to change rapidly and without notice. We denote the operationaland connected communication topology as G o = ( P, E o ), and respectively, as G c = ( P, E c ), where E x = { ( p i , p j ) ∈ P × P : p j ∈ N x ( i ) } for x ∈ { o, c } .Each switch p i ∈ P S stores a set of rules that the controllers install in order to define whichpackets have to be forwarded to which ports. In the out-of-band control scenario, a controllercommunicates the forwarding rules via a dedicated management port to the control module of theswitch. In contrast, in an in-band setting, the control traffic is interleaved with the data planetraffic, which is the traffic between hosts (as opposed to controller-to-controller and controller-to-switch traffic): switches can be connected to hosts through data ports and may have additionalrules installed in order to correctly forward their traffic. We do not assume anything about thehosts’ network service, except for that their traffic may traverse any network link.In an in-band setting, control and data plane traffic arrive through the same ports at the switch,which implies a need for being able to demultiplex control and data plane traffic: switches needto know whether to forward (data) traffic out of another port or (control) traffic to the controlmodule. In other words, control plane packets need to be logically distinguished from data planetraffic by some tag (or another deterministic discriminator).Figure 2 illustrates the switch model considered in this paper. Our self-stabilizing controlplane considers a proposal for abstract switches that do not require the extensive functionality thatexisting SDN switches provide. An abstract switch can be managed either via the managementport or in-band. It stores forwarding (match-action) rules. These rules are used to forward dataplane packets to ports leading to neighboring switches, or to forward control packets to the localcontrol module (e.g., instructing the control module to change existing rules). Rules can also dropall the matched packets. The match part of a rule can either be an exact match or optionallyinclude wildcards.Maintaining the forwarding rules with in-band control is the key challenge addressed in thispaper: for example, these rules must ensure (in a self-stabilizing manner) that control and datapackets are demultiplexed correctly (e.g., using tagging). Moreover, it must be ensured that wedo not end up with a set of misconfigured forwarding rules that drop all arriving (data plane andcontrol plane) packets: in this case, a controller will never be able to manage the switch anymorein the future.In the following, we will assume a local topology discovery mechanism that each node usesto report to the controllers the availability of their direct neighbors. Also, we assume access toself-stabilizing protocols for the link layer (and the transport layer) [25, 21] that provide reliable,bidirectional FIFO-communication channels over unreliable media that is prone to packet omission,reordering, and duplication. Each switch p i ∈ P S stores a set of forwarding rules which are installed by the controllers (servers)and define which packets have to be forwarded to which ports. In an out-of-band network, acontroller communicates the forwarding rules via a dedicated management port to the controlmodule of the switch. In contrast, in an in-band setting, the control traffic is interleaved with the5 ontrol module Switch fabricAbstract SDN switch
Internal link for in-band control
Forwarding rules local
Mng. portController A Updatesand stats
Abstract
SDN Switch
ControllerBData (plane) links to neighbors
Abstract SDN
Switch
Abstract SDN Switch … HostHostHost … Data (plane) links to hosts
Figure 2:
Abstract SDN switch illustration. dataplane traffic, and is communicated (possibly along multiple hops, in case of a remote controller)to a regular switch port. This implies that in-band control requires the switch to demultiplex controland data plane traffic. In other words, the dataplane of a switch cannot only be used to connectthe switch ports internally, but also to connect to the control module.In this paper, we make the natural assumption that switches have a bounded amount of mem-ory. Moreover, we assume that rules come in the form of match-action pairs, where the matchcan optionally include wildcards and the action part mainly defines a forwarding operation (cf.Figure 2).More formally, suppose that p i ∈ P S is a switch that receives a packet with p src ∈ P C and p dest ∈ P , as the packet source and destination, respectively. We refer to a rule (for packetforwarding at the switch) by a tuple (cid:104) k, i, src, dest, prt, j, metadata (cid:105) . The fields of a rule refer to p k as the controller that created this rule, prt ∈ { , . . . , n prt } : n prt ≥ κ + 1 is a priority that p k assigns to this rule, p j ∈ N c ( i ) is a port on which the packet can be sent whenever p j ∈ N o ( i ),and metadata is an (optional) opaque data value. Our self-stabilizing abstract switch considersonly rules that are installed on the switches indefinitely, i.e., until a controller explicitly requeststo delete them, rather than setting up rules with expiration timeouts .We say that the rule r = (cid:104) k , i , src , dest , prt , j , metadata (cid:105) is applicable for a packet thatreaches switch p i and has source p src and destination p dest , when r is the rule with the highest prt (priority) that matches the packet’s source and destination fields, and p j ∈ N o ( i ), i.e., the link( p i , p j ) is operational. We say that the set of rules of switch p i , rules ( i ), is unambiguous , if forevery received packet there is at most one applicable rule. Thus, a packet can be forwarded if thereexists only one applicable rule in the switch’s memory. We assume an interface function myRules ()which outputs the unambiguous rules that a controller p k ∈ P C needs to install to a switch p j ∈ P S ,based on p k ’s knowledge of the network’s topology. We require rules to be unambiguous and offer6esilience against at most κ link failures (details appear in Section 2.2.2). The main task of switches is to forward traffic according to the rules installed by the controllers.In addition, switches provide basic functionalities for interacting with the controllers.While OpenFlow, the de facto standard specification for the switch interface, as well as othersuggestions (Forwarding Metamorphosis [12], P4 [11], and SNAP [5]) provide innovative abstrac-tions with respect to data plane functionality and means to implement efficient network services,there is less work regarding the control plane abstraction, especially with respect to fault tolerance.We consider a slightly simpler switch model that does not include all the functionality one mayfind in an existing SDN switch. In particular, the proposed abstract SDN switch only supports the equal roles approach (where multiple “equal” controllers manage the switch); the master-slave setup usually used by switches [34] is not relevant toward the design of our self-stabilizing distributedSDN control plane. We elaborate more on the interface in the following.
Configuration queries (via a direct neighbor)
As long as the system rules and operational links support (bidirectional) packet forwarding betweencontroller p i and switch p j , the abstract switch allows p i to access p j ’s configuration remotely, i.e.,via the interface functions manager ( j ) (query and update), rules ( j ) (query and update) as well as N c ( j ) (query-only), where manager ( j ) ⊆ P C is p j ’s set of assigned managers and rules ( j ) is p j ’srule set. Also, a switch p j , upon arrival of a query of a controller p i , responds to p i with the tuple (cid:104) j, N c ( j ) , manager ( j ) , rules ( j ) (cid:105) .The abstract switch also allows controller p i to query node p j via p j ’s direct neighbor, p k as longas p i knows p k ’s local topology. In case p j is a switch, p i can also modify p j ’s configuration (via p j ’s abstract switch) to include a flow to p i (via p k ) and then to add itself as a manager of p j . (Theterm flow refer here to rules installs on a path in the network in a way that allows packet exchangebetween the path ends.) We refer to this as the query (and modify)-by-neighbor functionality. The switch memory management
The number of rules and controllers (that manage switches) that each switch can store is boundedby maxRules and maxM anagers , respectively. The abstract switch has a way to deal with cloggedmemory by storing the rules and managers in a FIFO manner (say, using local counters that serve astimestamps in the meta-information ( metadata ) part of each rule). Whenever a controller accessesa switch, that switch refreshes these timestamps, i.e., all switch configuration items related to thiscontroller. When the switch memory has more than maxRules rules, the switch removes the rulethat has the earliest timestamp so that a new rule can be added. This mechanism prioritizesnewer rules (and manager information) that controllers install. Note that, as long as a switch hassufficient memory to store the rules of all controllers in P C , the above mechanism does not needto remove any rule of controller p i ∈ P C after the first time that p i has refreshed its rules on thatswitch. Similarly, we assume that whenever the number of managers that a switch stores exceeds maxM anagers , the last to be stored (or access) manager is removed so that a new manager canbe added. 7 .2 Building blocks Our architecture relies on a fault-tolerant mechanism for topology discovery. We use such a mech-anism as an external building block. Moreover, we require a notion of resilient flows. We nextdiscuss both these aspects.
We assume a mechanism for local neighborhood discovery. We consider a system that uses an (everrunning) failure detection mechanism, such as the self-stabilizing Θ failure detector [8, Section6]: it discovers the switch neighborhood by identifying the failed/non-failed status of its attachedlinks and neighbors. We assume that this mechanism reports the set of nodes which are directlyconnecting node p i ∈ P and node p j , i.e., p j ∈ N c ( i ). We consider fault-resilient flows that are reminiscent of the flows in [33]. The definition of κ -fault-resilient flows considers the network topology G c and assumes that G c is not subject to changes.The idea is that the network can forward the data packets along the shortest routes, and usealternative routes in the presence of link failures, based on conditional forwarding rules [9]; thesefailover rules provide a backup for every edge and an enhancement of this redundancy for the casein which at most κ links fail, as we describe next.Let ( p r , . . . , p r n ) ∈ P n be a directed path in the communication network G c , where n ∈{ , . . . , | P |} . Given an operational network G o , we say that ( p r , . . . , p r n ) is a flow (over a simplepath) in G o , when the rules stored in p r , . . . , p r n relay packets from source p r to destination p r n using the switches in the sequence p r , . . . , p r n − for packet forwarding (relay nodes). Let G o ( k ) bean operational network that is obtained from G c by an arbitrary removal of k links. We say thereis a κ -fault-resilient flow from p i to p j in G c when for any k ≤ κ there is a flow (over a simple path)from p i to p j in any G o ( k ). We note that when considering a communication graph, G c , with ageneral topology, the construction of κ -fault-resilient flows is possible when κ < λ ( G c ), where λ ( G c )is the edge-connectivity of G c (i.e., the minimum number of edges whose removal can disconnect G c ). This section presents a formal model of the studied system (Figure 1), which serves as the frameworkfor our correctness analysis of the proposed self-stabilizing algorithms (Section 5).We model the control plane as a message passing system that has no notion of clocks (norexplicit timeout mechanisms), however, it has access to link failure detectors (in a way that issimilar to the Paxos model [8, 32]). We borrow from [8, Section 6] a technique for local linkmonitoring (Section 2.2.1), which assumes that every abstract switch can complete at least oneround-trip communication with any of its direct neighbors while it completes at most
Θ round-trips with any other directly connected neighbor. In other words, in our analytical model, butnot in our emulation-base evaluation, we assume that nodes have a mechanism to locally detect8emporary link failures (e.g., a link may also be unavailable due to congestion); a link which isunavailable for a longer time period will be flagged as permanent failure by a failure detector,which we borrow from [8, Section 6]. Apart from this monitoring of link status, we consider thecontrol plane as an asynchronous system. Note that once the system installs a κ -fault-resilient flowbetween controller p i ∈ P c and node p j ∈ P \ { p i } , the network provides a communication channelbetween p i and p j that has a bounded delay (because we assume that there are never more than κ link failures). Moreover, these bounded delays are offered by the data plane while the control planeis still asynchronous as described above (since, for example, we assume no bound on the time ittakes a controller to perform a local computation).Self-stabilizing algorithms usually consist of a do forever loop that contains communicationoperations and validations that the system is in a consistent state as part of the transition decision.An iteration (of the do forever loop) is said to be complete if it starts in the loop’s first lineand ends at the last (regardless of whether it enters branches). As long as every non-failed nodeeventually completes its do forever loop, the proposed algorithm is oblivious to the rate in whichthis completion occurs. Moreover, the exact time considerations can be added later for the sake offine-tuning performances. We are given reliable end-to-end FIFO channels over capacitated links, as implemented, e.g.,by [25, 21], which guarantee reliable message transfer regardless of packet omission, duplication,and reordering. After the recovery period of the channel algorithm [25, 21], it holds that, at anytime, there is exactly one token pkt ∈ { act, ack } in the channel that is either in transit from thesender p i ∈ P to the receiver p j ∈ P , i.e., channel i,j = { act } ∧ channel j,i = ∅ , or the token pkt isin transit from p j to p i , i.e., channel i,j = ∅ ∧ channel j,i = { ack } . During the recovery period (afterthe last occurrence of a transient fault), it can be the case that the sender sends a message m forwhich it receives a (false) acknowledgment ack without having m go through a complete round-trip. However, that can occur at most ∆ comm times, where ∆ comm ≤ m and receives its acknowledgment ack , the channelalgorithm [25, 21] guarantees that m has completed a round-trip.When node p i sends a packet, pkt ∈ { act, ack } , to node p j , the operation send inserts a copyof pkt to the FIFO queue that represents the above communication channel from p i to p j , whilerespecting the above token circulation constraint. When p j receives pkt from p i , node p j delivers pkt from the channel’s queue and transfers pkt ’s acknowledgment to the channel from p j to p i immediately after. For our analysis, we consider the standard interleaving model [20], in which there is a single (atomic)step at any given time. An input event can be either a packet reception or a periodic timer triggering p i to resend while executing the do forever loop. In our settings, the timer rate is completelyunknown and the only assumption that we make is that every non-failing node executes its doforever loop infinitely often.We model a node (switch or controller) using a state machine that executes its program by9aking a sequence of (atomic) steps , where a step of a controller starts with local computationsand ends with a single communication operation: either send or receive of a packet. A step ofthe (control module of an) abstract switch starts with a single message reception, continues withinternal processing and ends with a single message send.The state of node p i , denoted by s i , consists of the values of all the variables of the nodeincluding its communication channels. The term (system) state is used for a tuple of the form( s , s , · · · , s n , G o ), where each s i is the state of node p i (including messages in transit to p i ) and G o is the operational network that is determined by the environment. We define an execution(or run) R = c , a , c , a , . . . as an alternating sequence of system states c x and steps a x , suchthat each state c x +1 , except the initial system state c , is obtained from the preceding state c x byapplying step a x .For the sake of simple presentation of the correctness proof, we assume that the abstract switchdeals with one controller at a time, e.g., when requesting a configuration update or a query. More-over, we assume that within a single atomic step, the abstract switch can receive the controllerrequest, perform the update, and send a reply to the controller. We consider a system in which maxRules is large enough to store all the rules that all controllersneed to install to any given switch, and that maxM anagers ≥ N C . We assume that | P C | = n C and | P S | = n S are known only by their upper bounds, i.e., N C ≥ | P C | , and respectively, N S ≥ | P S | . Weuse these bounds only for estimating the memory requirements per node, in terms of maxRules and maxM anagers , i.e., the maximum number of rules, and respectively, managers at any switch.Suppose that a κ -fault-resilient flow from p i to p j is installed in the network. The term primarypath refers to the path along which the network forwards packets from p i to p j in the absence offailures . We assume that myRules () returns rules that encode κ -fault-resilient flows for a givennetwork topology. The primary paths encoded by myRules () are also the shortest paths in G c (withthe highest rule priority). A rule in myRules () corresponding to k link failures ( k -fault-resilientflow) has the ( k + 1)-highest rule priority. Due to the presence of faults in the system, we do not consider any bound on the communicationdelay, which could be, for example, the result of the absence of properly installed flows betweenthe sender and the receiver. Nevertheless, when a flow is properly installed, the channel is notdisconnected and thus we assume that sending a packet infinitely often implies its reception infinitelyoften. We refer to the latter assumption as the communication fairness property. We make thesame assumptions both for the link and transport layers.
This work proposes a solution for bootstrapping in-band communication in SDNs. The correctnessproof depends on the nodes’ ability to exchange messages during this bootstrapping. The proofuses the notion of a message round-trip, which includes sending a message to a node and receivinga reply from that node. Note that this process spans over many system states.10e give a detailed definition of round-trips as follows. Let p i ∈ P C be a controller and p j ∈ P \ { p i } be a network node. Suppose that immediately after state c node p i sends a message m to p j , for which p i awaits a response. At state c (cid:48) , that follows state c , node p j receives message m and sends a response message r m to p i . Then, at state c (cid:48)(cid:48) , that follows state c (cid:48) , node p i receives p j ’s response, r m . In this case, we say that p i has completed with p j a round-trip of message m .We define an iteration of a self-stabilizing algorithm in our model. Let P i be the set of nodes withwhom p i completes a message round trip infinitely often in execution R . Suppose that immediatelyafter the state c begin , controller p i takes a step that includes the execution of the first line of thedo forever loop, and immediately after system state c end , it holds that: (i) p i has completed theiteration it has started immediately after c begin (regardless of whether it enters branches) and (ii)every message m that p i has sent to any node p j ∈ P i during the iteration (that has startedimmediately after c begin ) has completed its round trip. In this case, we say that p i ’s iteration (withround-trips) starts at c begin and ends at c end . We characterize faults by their duration, that is, they are either transient or permanent. Weconsider the occurrence frequency of transient faults to be either rare or not rare. We illustrate ourfault model in Figure 3.
Transient packet failures, such as omissions, duplications, and reordering, may occur often. Recallthat we assume communication fairness and the use of a self-stabilizing link layer (and transportlayer) [25, 21]. This protocol assures that the system’s unreliable media, which are prone topacket omission, reordering, and duplication, can be used for providing reliable, bidirectional FIFO-communication channels without omissions, duplications or reordering. Note that the assumptionthat the communication is fair may still imply that there are periods in which a link is temporarilyunavailable. We assume that at any time there are no more than such κ link failures. We model rare faults to occur only before the system starts running. That is, during the systemrun, G c does not change and it is ( κ + 1)-edge connected.A permanent link failure or addition results in the removal, and respectively, the inclusion ofthat link from the network. The fail-stop failure of node p j is a transient fault that results inthe removal of ( p i , p j ) from the network and p j from N c ( i ), for every p i ∈ N c ( j ). Naturally, nodeaddition is combined with a number of new link additions that include the new node.Other than the above faults, we also consider any violation of the assumptions according towhich the system is assumed to operate (as long as the code stays intact). We refer to them as (rare) transient faults . They can model, for example, the event in which more than κ links failconcurrently. A transient fault can also corrupt the state of the nodes or the messages in thecommunication channels. 11 requencyDuration Rare Not rare
Any violation of the assumptions according Packet failures: omissions,to which the system is assumed to duplications, reorderingoperate (as long as the code stays intact). (assuming communication
Transient
This can result in any state corruption. fairness holds).Link failures (assumingat most κ links failures). Permanent
Node and link failures.
Legal execution (LE)Recovery periodPrior to the system start, consider all faults All states are legitimate Consider only non-transient faults Consider only benign faults Execution’s starting state
Figure 3:
The table above details our fault model and the chart illustrates when each fault set is relevant. Thechart’s gray boxes represent the system execution, and the white boxes specify the failures considered to be possibleat different execution parts and recovery guarantees of the proposed self-stabilizing algorithm. The set of benignfaults includes both transient link failures as well as permanent link and node failures.
We define the set of benign faults to include any fault that is not both rare and transient. Thecorrectness proof of the proposed algorithm demonstrates the system’s ability to recover after theoccurrence of either benign or transient faults, which are not necessarily rare. Our experiments,however, consider all benign faults and no rare transient faults due to the computation limitationsthat exist when considering all possible ways to corrupt the system state (Section 6.1).
We define the system’s task by a set of executions called legal executions ( LE ) in which the task’srequirements hold. That is, each controller p i constructs a κ -fault-resilient flow to every node p j ∈ P (either a switch or a controller). We say that a system state c is legitimate , when everyexecution R that starts from c is in LE . A system is self-stabilizing [20] with relation to task LE , when every (unbounded) system execution reaches a legitimate state with relation to LE (cf.Figure 3). The criteria of self-stabilization in the presence of faults [20, Section 6.4] requires thesystem to recover within a bounded period after the occurrence of a single benign failure duringlegal executions (in addition to the design criteria of self-stabilization that requires recovery withina bounded time after the occurrence of the last transient fault). We demonstrate self-stabilizationin Section 5.4 and self-stabilization in the presence of faults in Section 5.5.Self-stabilizing systems require the use of bounded memory, because real-world systems haveonly access to bounded memory. Moreover, the number of messages sent during an executiondoes not have an immediate relevance in the context of self-stabilization. The reason is that self-stabilizing algorithms can never terminate and stop sending messages, because if they did it would12ot be possible for the system to recover from transient faults (cf. [20, Chapter 2.3]). That is,suppose that the algorithm includes a predicate, such that when the predicate is true the algorithmforever stops sending messages. Then, a single transient fault can cause this predicate to be truein the starting state of an execution, from which the system can never recover. The latter holds,because the algorithm will never send any message and yet in the starting system state any variablethat is not considered by the predicate can be corrupted. We say that a system execution is fair when every step that is applicable infinitely often is executedinfinitely often and fair communication is kept (both at the link and the transport layer). Notethat only failing nodes ever stop taking steps and thus a violation of the fairness (communicationor execution) assumptions implies the presence of transient faults, which we assume to happen onlybefore the starting system state of any execution.
The first (asynchronous) frame in a fair execution R is the shortest prefix R (cid:48) of R = R (cid:48) ◦ R (cid:48)(cid:48) , suchthat each controller starts and ends at least one complete iteration (with round-trips) during R (cid:48) (see Section 3.3.2), where ◦ denotes an operation that concatenates two executions. The secondframe in execution R is the first frame in execution R (cid:48)(cid:48) , and so on. The stabilization time (or recovery period from transient faults) of a self-stabilizing system is thenumber of asynchronous frames it takes a fair execution to reach a legitimate system state whenstarting from an arbitrary one. The recovery period from benign faults is also measured by thenumber of asynchronous frames it takes the system return to a legal execution after the occurrenceof a single benign failure.We also consider the design criterion of memory adaptiveness by Anagnostou et al. [4]. Thiscriterion requires that, after the recovery period, the use of memory by each node is a function of theactual network dimensions. In our system, a memory adaptive algorithm has space requirementsthat depend on n C , which is the actual number controllers rather than their upper bound, N C .Moreover, when considering a non-adaptive solutions, one can achieve a shorter recovery periodfrom transient faults (Section 8).For the sake a simple presentation, our theoretical analysis assumes that all local computationsare done within a negligible time that is independent of, for example, the number of messages sentand received during each frame. We do however consider all network dimensions that are related tothe recovery costs (including the number of messages sent and received during each frame) duringthe evaluation of the proposed prototype (Section 6).13 lgorithm 1: Self-stabilizing SDN, high-level code description for controller p i . Algorithm 2 isa detailed version of this algorithm. Local state: replyDB ⊆ { m ( j ) : p j ∈ P } has the most recently received query replies; currT ag and prevT ag are p i ’s current and previous synchronization round, respectively; Interface: myRules ( G, j, tag ): returns the rules of p i on switch p j given a topology G on round tag ; do forever begin Remove from replyDB any reply from unreachable (in terms of graph connectivity) senders or not fromround prevT ag or currT ag . Also, remove from replyDB any response from p i and then add a recordthat includes the directly connected neighbors, N c ( i ); if replyDB includes a reply (with tag currT ag ) from every node that is reachable (it terms of graphconnectivity) according to the accumulated local topology, G , in replyDB then Store currT ag ’s value in prevT ag and get a new and unique tag for currT ag . By that, p i starts anew synchronization round; foreach switch p j ∈ P S and p j ’s most recently received reply do if this is the start of a new synchronization round then Remove from p j ’s configuration any manager p k or rule of p k that was not discovered to bereachable during round prevT ag ; Add p i in p j ’s managers (if it is not already included) and replace p i ’s rules in p j with myRules ( G, j, currT ag ); foreach p j ∈ P that is reachable from p i according to the most recently received replies in replyDB dosend to p j (with tag currT ag ) an update message (if p j ∈ P S is a switch) and query p j ’s configuration; upon query reply m from p j begin if there is no space in replyDB for storing m then perform a C-reset by including in replyDB only thedirect neighborhood, N c ( i ); if m ’s tag equals to currT ag then include m in replyDB after removing the previous response from p j ; upon arrival of a query (with a syncT ag ) from p j begin send to p j a response that includes the local topology, N c ( i ), and syncT ag We present a self-stabilizing SDN control plane, called
Renaissance , that enables each controller todiscover the network, remove any stale information in the configuration of the discovered unmanagedswitches (e.g., rules of failed controllers), and construct a κ -fault-resilient flow to any other node(switch or controller) that it discovers in the network. For the sake of presentation clarity, we startwith a high-level description of the proposed solution in Algorithm 1 before we present the solutiondetails in Algorithm 2. Algorithm 1 creates an iterative process of topology discovery that, first, lets each controller identifythe set of nodes that it is directly connected to; from there, it finds the nodes that are directlyconnected to them; and so on. This network discovery process is combined with another process forbootstrapping communication between any controller and any node in the network, i.e., connectingeach controller to its direct neighbors, and then to their direct neighbors, and so on, until it isconnected to the entire reachable network. 14ach controller associates independently each iteration with a unique tag [3, 43, 44] that syn-chronizes a round in which the controller performs configuration updates and queries. Controller p i also maintains the variables currT ag and prevT ag (line 2) of the round synchronization procedure,which starts when p i queries all reachable nodes and ends when it receives replies from all of thesenodes (cf. lines 6–7, as well as, Section 3). Upon receiving a query response, p i runs lines 13–15and replies to other controllers’ queries in lines 16–17.A controller p i ∈ P C keeps a local state of query replies (cf. Section 2.1) from other nodes(line 1). These replies allow p i to accumulate information about the network topology according towhich the switch configurations are updated in each round. The following three basic functionalitiesof Algorithm 1 are provided by the do-forever loop in lines 4–12, which we detail below. A controller p i ∈ P C can communicate and manage a switch p j ∈ P S only after p i has installedrules at all the switches on a path between p i and p j . This, of course, depends on whether thereare no permanent link failures on the path. In order to discover these link failures, we use localmechanisms for failure detection at each node for querying about the status of every link (cf.Section 2.2.1). These mechanisms consider any permanent link failure as a transient fault and weassume that Algorithm 1 starts running only after the last occurrence of any transient fault (cf.Figure 3). Thus, as soon as there is a flow installed between p i and p j and there are no permanentfailures on the primary path (Section 3), p i and p j can exchange messages that arrive eventually since it only depends on the temporary availability of the link which supports the communicationfairness assumption (Section 3.3.1).The above iterative process of network topology discovery and the process of rule installationconsider κ -fault-resilient flows (cf. Section 2.2.2 and myRules () function in Section 3). These flowsare computed through the interface myRules ( G, j, tag ) (line 3), where G is the input topology, p j is the switch to store these rules, and tag is the tag of the synchronization round. Once the entirenetwork topology is discovered, Algorithm 1 guarantees the installation of a κ -fault-resilient flowbetween p i and p j . Thus, once the system is in a legitimate state, the availability of κ -fault-resilientflows implies that the system is resilient to the occurrence of at most κ temporary link failures (andrecoveries) and p i can communicate with any node in the network within a bounded time. Algorithm 1 lets the controllers connect to each other via κ -fault-resilient flows. Moreover, Al-gorithm 1 can detect situations in which controller p k / ∈ P C is not reachable from controller p i (line 5). The reason is that p i is guaranteed to (i) discover the entire network eventually, and (ii)communicate with any node in the network. This means that p i eventually gets a response fromevery node in the network. Once that happens, the set of nodes that respond to p i equals to theset of nodes that were discovered by p i (line 6) and thus p i can restart the process of discoveringthe network (line 7).The start of a new round (in which p i rediscovers the network) allows p i to also remove in-formation at the switches that is related to any unreachable controller p k ∈ P C , only when it hassucceeded in discovering the network and bootstrapped communication. We note that, during new15 ommand type Command Switch p j ’s control module action new round (cid:104) ‘ newRound ’ , t metaRule (cid:105) updates current synchronization tag of the switchupdate command (cid:104) ‘ delMngr ’ , k (cid:105) deletes p k from manager ( j ) (cid:104) ‘ addMngr ’ , k (cid:105) adds p k in manager ( j ) (cid:104) ‘ delAllRules ’ , k (cid:105) deletes all rules of p k (cid:104) ‘ updateRules ’ , newRules (cid:105) replaces all rules of p i with newRules query command (cid:104) ‘ query ’ , t query (cid:105) sends query response m ( j ) to p i Figure 4: Abstract switch p j ’s control module interface, for each controller p i ∈ P C .rounds (line 9), p i removes information related to p k from any switch p j (line 10); whether thisinformation is a rule or p k ’s membership in p j ’s management set. This stale information clean-upeventually brings the system to a legitimate state, as we will prove in Section 5.Recall that we regard the long-term failure of links (or of more than κ links) as transient faults.After the occurrence of the last transient fault, the network returns to fulfill our assumptions aboutthe topology G c , i.e., G c is ( κ + 1)-edge connected. Then, Algorithm 1 brings the system back toa legitimate state (Section 5). The do-forever loop of Algorithm 1 completes by sending rule andmanager updates to every switch that has a reply in replyDB , as well as querying every reachablenode, with the current synchronization round’s tag (lines 12–12). After the provision of a high-level description of the proposed solution in Algorithm 1, we providethe solution details in Algorithm 2, which requires more notation, interfaces, and building blocks.
Local Variables
Each controller’s state includes replyDB (line 3), which is the set of the mostrecent query replies, and the tags currT ag and prevT ag , which are p i ’s current, and respectively,previous synchronization round tags. Each response m ( j ) ∈ replyDB can arrive from either aswitch or another controller and it has the form (cid:104) j, N c ( j ) , manager ( j ) , rules ( j ) (cid:105) , for p j ∈ P . Thecode denotes by N c ( j ) the neighborhood of p j , by manager ( j ) ⊆ P C the controllers of p j , and by rules ( j ) ⊆ {(cid:104) k, j , src , dest , prt , z, tag (cid:105) : ( p k , p j , p z , p dest ∈ P ) ∧ ( p src ∈ P C ) ∧ prt ∈ { , . . . , n prt } ∧ tag ∈ tagDomain } the rule set of p j . Throughout Algorithm 2 and for ease of presentation werefer to the elements of responses and rules using the struct notation, which is used by the Cprogramming language. We refer to the fields of m = (cid:104) ID, N c , M ng, rules (cid:105) stated above, by m.ID = j , m.N c = N c ( j ), m.M ng = manager ( j ), and m.rules = rules ( j ). We assume that thesize of replyDB is bounded by maxReplies ≥ N C + N S ), hence the local state has bounded size(the factor of 2 is due to responses from the rounds prevT ag and currT ag ). An internal building block: round synchronization
An SDN controller accesses the abstractswitch in synchronized rounds. Each round has a unique tag that distinguishes the given roundfrom its predecessors. We assume access to a self-stabilizing algorithm that generates unique tags of bounded size from a finite domain of tags, tagDomain . The algorithm provides a function called nextT ag () that, during a legal execution, returns a unique tag. That is, immediately before calling nextT ag () there is no tag anywhere in the system that has the returned value from that call. Giventwo tags, t and t , we require that t = t holds if, and only if, they have identical values. We16 lgorithm 2: Self-stabilizing SDN, code for controller p i : new notation draft. Symbols and operators: ‘ • ’ stands for ‘any sequence of values’, () is the empty sequence, ◦ (binary) is thesequence concatenation operator and (cid:13) (unary) concatenates a set’s items in an arbitrary order. Constants: N c ( i ) ⊆ P , p i ’s directly connected nodes; maxRules and maxManagers , maximum number ofrules and managers, respectively; maxReplies : maximum size of the set replyDB ; Local state:
A controller’s local state is the set replyDB which stores the most recently received queryreplies. A query reply m = (cid:104) ID, N c , Mng, rules (cid:105) includes the respondent’s ID, m.ID ∈ P , itscommunication neighborhood, m.N c ⊆ P , its set of managers, m.Mng ⊆ P C , and its set of installed rules, m.rules . A rule r = (cid:104) cID, sID, src, dest, prt, fwd, tag (cid:105) ∈ m.rules includes the switch’s ID, r.sID , the ID ofthe controller which installed the rule, r.cID , the source and destination fields, r.src , and respectively, r.dest , the rule’s priority, r.prt , the ID of the neighbor to which the packet should be forwarded, r.fwd , andthe rule’s tag, r.tag , where r.sID, r.fwd, r.dest ∈ P , r.cID, r.src ∈ P C , r.prt ∈ { , . . . , n prt } , and r.tag ∈ tagDomain . A command record x includes the switch’s ID, x.sID , and the command, x.cmd ; currT ag and prevT ag are p i ’s current, and respectively, previous synchronization round tags; Interfaces: Section 2.2.2, Figure 4, as well as the following: myRules ( G, j, tag ): creates p i ’s rules at switch p j according to G with tag tag (cf. Section 2.2.2); Macros: res ( x ) = { m ∈ replyDB : ∀ r ∈ m.rules r.tag = x } ∪ {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} ; G ( S ) := ( { p k : ∃ m ∈ S : ( m.ID = k ∨ p k ∈ m.N c ) } , { ( j, k ) : ∃ m ∈ S : ( m.ID = j ∧ p k ∈ m.N c } ); fusion := res ( currT ag ) ∪ { m ∈ res ( prevT ag ) : (cid:64) m (cid:48) ∈ res ( currTag ) m (cid:48) .ID = m.ID } ; p j → G p k := true if there is a path from p j to p k in G ; do forever begin /* Remove replies from unreachable senders or not from round prevT ag or currT ag . */ replyDB ← { m ∈ replyDB : m.ID = k (cid:54) = i ∧ ( ∃ x ∈{ currTag,prevTag } m ∈ res ( x ) ∧ p i → G ( res ( x )) p k } ∪ {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} ; let ( newRound, msg ) := ( false, ∅ ); /* newRound and msg get their default values *//* a new round with a new tag; remove replies with tag currT ag */ if ∀ p (cid:96) ∈ G ( res ( currTag )) ( p i → G ( res ( currTag )) p (cid:96) = ⇒ ∃ m ∈ res ( currTag ) m.ID = (cid:96) ) then ( newRound, prevT ag ) ← ( true, currT ag ); currT ag ← nextT ag (); replyDB ← replyDB \ res ( currT ag ); /* The reference tag, referTag , is currT ag when a topology change is discovered */ if G ( fusion ) = G ( res ( prevT ag )) then let referTag := prevT ag else let referTag := currT ag ; foreach p j ∈ P S : ∃ m ∈ res ( referTag ) m.ID = j do /* manage switch p j ’s managers and rules *//* p i is switch p j ’s manager; remove unreachable managers on new rounds and nodes withno rules */ let M := { p k ∈ m.Mng : ( ∃ r ∈ m.rules r.cID = k ) ∧ ( ¬ newRound ∨ p i → G ( res ( prevTag )) p k ) } ∪ { p i } ; msg ← msg ∪ { ( p j , (cid:104) ‘ delMngr ’ , k (cid:105) ) : p k ∈ ( m.Mng \ M ) } ∪ { ( p j , (cid:104) ‘ addMngr ’ , i (cid:105) ) } ; /* Remove any p j ’s rule that is associated with an unreachable node, p k */ msg ← msg ∪ { ( p j , (cid:104) ‘ delAllRules ’ , k (cid:105) ) : ( ∃ r ∈ m.rules r.cID = k ) ∧ p k / ∈ M } ; /* p i refreshes all of its rules at switch p j according to referT ag */ msg ← msg ∪ { ( p j , (cid:104) ‘ updateRules ’ , myRules ( G ( res ( referTag )) , j, currT ag ) (cid:105) ) } ; /* Send the prepared messages to all reachable nodes in an aggregated form */ foreach p j : p i → G ( fusion ) p j do send ( (cid:104) ‘ newRound ’ , currT ag (cid:105) ) ◦ (cid:13){ x.cmd : x ∈ msg ∧ x.sID = j } ◦ ( (cid:104) ‘ query ’ , currT ag (cid:105) ) to p j ; upon query reply m from p j begin /* p i tests that there is room to store m and m ’s tag matches currT ag */ if | replyDB ∪ { m }| > maxReplies then replyDB ← {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} ; /* C-reset */ if ( ∃ r ∈ m.rules r.tag = currT ag ) then replyDB ← ( replyDB \ { m (cid:48) ∈ replyDB : m (cid:48) .ID = m.ID } ) ∪ { m } ; upon arrival of ( • ◦ ( (cid:104) ‘ query ’ , tag (cid:105) )) from p j do send (cid:104) i, N c ( i ) , ⊥ , {(cid:104) j, i, ⊥ , ⊥ , ⊥ , ⊥ , tag (cid:105)}(cid:105) to p j ; p i ∈ P C generates a new tag andstores that tag in the variable currT ag ← nextT ag (). Controller p i then attempts to install atevery reachable switch p j ∈ P S a special meta-rule (cid:104) i, j, ⊥ , ⊥ , n prt , ⊥ , t metaRule (cid:105) , which includes, inaddition to p i ’s identity, the tag t metaRule = currT ag and has the lowest priority (before makingany configuration update on that switch). It then sends a query to all (possibly) reachable nodes inthe network and combines that query with the tag t query = currT ag . The response to that queryfrom other controllers p j ∈ P C includes the query tag, t query . The response to the query from theswitch p k ∈ P S includes the tag t metaRule of the most recently installed meta-rule that p k has in itsconfiguration. The controller p i ends its current round once it has received a response from every(possibly) reachable node in the network and that response has the tag of currT ag .We note the existence of self-stabilizing algorithms, such as the one by Alon et al. [3], that in fairexecutions (that are legal with respect to the self-stabilizing end-to-end communication protocol)provide unique tags within a number of synchronization rounds that is bounded (by a constantwhenever the execution is legal with respect to the self-stabilizing end-to-end communication pro-tocol). We refer to that known bound by ∆ synch and note that during a legal execution of theround synchronization algorithm, it holds that controller p i receives only a response message m that matches currT ag , i.e., it discards any message with a different tag. Moreover, since duringlegal executions nextT ag () returns only unique tags, m and its acknowledgment are guaranteed toform a complete round-trip. Note that we do not require nextT ag () to support concurrent callssince every controller manages its own synchronization rounds; one round at a time. We note theexistence of other relevant synchronizers, such as the α -synchronizer by Awerbuch et el. [6, 20],which have simpler tags than [3]. However, we prefer the elegant interface defined in [3]. Interfaces
Controller p i can send requests or queries to any other node p j (which could be eitheranother controller or a switch). We detail the switch interface below and illustrate it in Figure 4.The controllers send command batches, which are sequences of commands. The special meta-data command (cid:104) ‘ newRound ’ , t metaRule (cid:105) is always the first command and updates the special meta-rule to store t metaRule . We use it for starting a new round (where t metaRule = t is the round’s tag).This starting command could be followed by a number of commands, such as (cid:104) ‘ delM ngr ’ , k (cid:105) forthe removal of controller p k from the management of switch p j , (cid:104) ‘ addM ngr ’ , k (cid:105) for the additionof controller p k from the management of switch p j , and (cid:104) ‘ delAllRules ’ , k (cid:105) for the deletion of all of p k ’s rules from the configuration of switch p j , where p k ∈ P C \ { p i } . The rules’ update is done via (cid:104) ‘ updateRule ’ , newRules (cid:105) and it is the second last command. This update replaces all of p i ’s rulesat switch p j (except for the special meta-rule) with the rules in newRules . These commands are tobe followed by the round’s query (cid:104) ‘ query ’ , t query (cid:105) , where t query = t is the query’s tag. The switch p j replies to a query by sending m = (cid:104) j, N c ( j ), manager ( j ), rules ( j ) (cid:105) to p i , such that the ruleset includes also the special meta-rule (cid:104) i, • , t (cid:105) ∈ rules ( j ). Whenever p j ∈ P C is another controller,response to a query is simply (cid:104) i, N c ( i ) , ⊥ , {(cid:104) j, i, ⊥ , ⊥ , ⊥ , ⊥ , t query (cid:105)}(cid:105) (line 28). Note that controller p j simply ignores all other types of commands. We use the interface function myRules ( G, j, tag )(Section 2.2.2) for creating the packet forwarding rules that controller p i installs at switch p j when p i ’s current view on the network topology is G in round tag (line 6).18 .3 Algorithm details Algorithm 2 presents the proposed solution with a greater degree of details than Algorithm 1.Algorithm 2 is centered around a do forever loop, which starts by removing stale information from replyDB (line 12). This removal action includes refreshing information related to controller p i ,which deletes information about any node that is not reachable from p i . The reachability test usesthe currently known information about the network topology, G and the relation → G (line 10) thattells whether node p j is reachable from controller p i in G , given the information in replyDB .Algorithm 2 accesses the switch configurations in synchronization rounds. Lines 13–16 managethe start (and end) of synchronization rounds. When a new round starts, i.e., the condition of theif-statement of line 14 holds, controller p i marks the start of a new round ( newRound i = true ),updates the values of the tags prevT ag i and currT ag i and clears any record with tag currT ag ofthe replies stored in replyDB i (line 15 and 16).Algorithm 2 refreshes (and reconstructs) the information about remote nodes (controllers andswitches including the ones that are directly attached to it) by sending queries (line 24) and updatingthe set of stored replies (line 27). Notice that controller p i also responds to query requests comingfrom other controllers (line 28). Algorithm 2 uses these replies for completing the informationabout the switches that are directly connected to a remote controller (and thus the other fields inthe response messages are the empty sets).The heart of Algorithm 2 includes the updates of every switch p j ∈ P S (line 18 to 21). Forevery switch p j (line 18), controller p i considers p j ’s stored response (cid:104) j, N gb i , M ng i , Rul i (cid:105) for whichit prepares a set of commands to be stored in the set msg i (lines 13, 20, 21, 22 and 24). To thatend, p i first calculates the set of managers that p j should have in the following manner. If thisiteration of the do forever loop (lines 11 to 24) is the first one for the round currT ag i , the value of newRound i is true (line 15); this leads p i to remove any controller p k that is not reachable accordingto G ( res ( prevT ag )) (lines 19 to 21). Whenever the iteration is not the first one, p i merely assertsthat it is a manager of p j .Controller p i removes any rules of an unreachable controller p k (line 21) and updates all of itsrules at switch p j (line 22) using the interface function myRules () (line 22) and the reference tag, referTag (line 8 and line 17). The proposed algorithm selects referTag ’s value to be prevT ag duringlegal executions. During recovery periods, the discovered topology can differ from that one thatis stored with the tag prevT ag . In that case, the algorithm selects currT ag as the reference tag.After preparing these commands to all the switches, controller p i prepares query commands to allreachable nodes (including both controllers and switches) and then sends all prepared commands totheir designated destinations. Note that each of these configuration updates are done via a singlemessage that aggregates all commands for a given destination (line 24).We note that when a query response arrives at p i , before the update of the response set (line 27), p i checks that there is sufficient storage space for the arriving response (line 26). If space is lacking, p i performs what we call a ‘C-reset’. Note that p i stores replies only for the current synchronizationround, currT ag . 19 Correctness Proof
We prove the correctness of Algorithm 2 by showing that when the system starts in an arbitrarystate, it reaches a legitimate state (Definition 1) within a bounded period of ((∆ comm + ∆ synch ) +2) D +1)[((∆ comm +∆ synch ) D +1) · N S + N C +1] frames (Theorem 2). Moreover, we show that whenstarting from a legitimate state, the system satisfies the task requirements and it is also resilientto a bounded number of failures (lemmas 7 and 8).We refer to the values of variable X at node p i (controller or switch) as X i , i.e., the variablename with a subscript that indicates the node index. Similarly, we refer to the return values offunction f at controller p k as f k . Definition 1 (Legitimate System State) . State c ∈ R is legitimate with respect to Algorithm 2when, for every controller p i ∈ P C and node p k ∈ P \ { p i } , the following conditions hold.1. (cid:104) k , N c ( k ) , manager ( k ) , rules ( k ) (cid:105) ∈ replyDB i if, and only if, N c ( k ) , manager ( k ) , and rules ( k ) are p k ’s neighborhood, managers, and respectively, set of packet forwarding rules(line 3) as well as p i → G p k (line 10). Moreover, for the case of controller p k ∈ P C , the taskdoes not require p k to have any managers or rules, i.e., manager ( k ) = ∅ and rules ( k ) = ∅ .2. Any controller is the manager of every switch and only these controllers can be the mangersof any switch, i.e., p i ∈ P C ∧ p k ∈ P S ⇐⇒ p i ∈ manager ( k ) .3. The rules installed in the switches encode κ -fault-resilient flows between controller p i and node p k in the network G c (Section 2.2.2).4. The end-to-end protocol (Section 3.1) as well as the round synchronization protocol (Sec-tion 2.2.1) between p i and p k are in a legitimate state. The proof of Theorem 2 starts by establishing bounds on the number of rules that each switch needsto store (Lemma 1). The proof arguments are based on the bounded network size and the memorymanagement scheme of the abstract switch (Section 2.1.1), which guarantees that, during a legalexecution, all non-failing controllers are able to store their rules (Lemma 1). The bounded networksize also helps to bound, during a legal execution, the amount of memory that each controller needsto have (Lemma 2). This proof also bounds the number of C-resets that a controller might take(line 26) during the period in which the system recovers from transient faults. This is line 14 inAlgorithm 1. Note that this bound on the number of C-resets is important because C-resets deleteall the information that a controller has about the network state.C-resets are not the only disturbing actions that might occur during the recovery period. Thesystem cannot reach a legitimate state before it removes stale information from the configurationof every switch. Note that failing controllers cannot remove stale information that is associatedwith them and therefore non-failing controllers have to remove this information for them. Due totransient faults, it could be the case that one controller can remove information that is associatedwith another non-failing controller. We refer to these ‘mistakes’ as illegitimate deletion of rules ormanagers (Section 5.3). Note that illegitimate deletions occur when the (stale) information that a20ontroller has about the network topology differ from the actual network topology, G c . Moreover,due to stale information in the communication channels, any given controller might aggregate(possibly stale) information about the network more than once and thus instruct more than oncethe switch to delete illegitimately the rules of other controllers.Theorem 1 bounds the number of these illegitimate deletions. It does so by counting the numberof possible steps in which a controller might have stale information about the network and thatstale information leads the controller to perform an illegal deletion. The proof arguments start byconsidering a starting state in which controller p i ∈ P c is just about to take a step that instructs theswitches to perform illegitimate deletions. The proof then argues that between any two such steps,controller p i has to aggregate information about the network in such a way that p i preserves itsinformation about the network topology to be complete. But, this can only happen after receivinga reply from every node in the preserved topology (Claim 5.1). By induction on the distance k between controller p i ∈ P c and node p j ∈ P \ { p i } , the proof shows that the information that p i has about p j is correct within k · (∆ comm + ∆ synch + 1) + 1 times in which p i instruct the switchesto perform an illegitimate deletion, because there is a bounded number of stale information inthe communication channel between p i and p j (Lemma 4). Thus, the total number of illegitimatedeletions is at most D · (∆ comm + ∆ synch + 1) + 1.The proof demonstrates recovery from transient faults by considering a period in which thereare no C-resets and no illegitimate deletions (Section 5.4). In such a period, all the controllersconstruct κ -fault-resilient flows to any other node in the network (Lemma 5). This part of theproof is again by induction on the distance k between controller p i ∈ P c and node p j ∈ P \ { p i } .The induction shows that, within ((∆ comm + ∆ synch ) + 2) k frames, p i discovers correctly its k -distance neighborhood and establishes a communication channel between p i and p j . This meansthat within ((∆ comm + ∆ synch ) + 2) D frames in which there are no C-resets and no illegitimatedeletions, the system reaches a legitimate state (Lemma 6).The above allows Theorem 2 to show that within ((∆ comm + ∆ synch ) + 2) D + 1)[((∆ comm +∆ synch ) D + 1) · N S + N C + 1] frames in R , there is a period of ((∆ comm + ∆ synch ) + 2) D + 1)frames in which there are no C-resets and no illegitimate deletions and thus the system reachesa legitimate state. Lemma 7 shows that, when starting from a legitimate state an then letting asingle link in the network to be added or remove from G c , the system recovers within O ( D ) frames.The arguments here consider that number of frames it takes for each controller to notice the changeand to update all the switches. By similar arguments, Lemma 8 shows that after the addition orremoval of at most N C − O ( D )frames. Lemmas 1 and 2 bound the needed memory at every node during a legal execution. Recall that weassume that the switches implement a mechanism for dealing with clogged memory (Section 2.1.1),such that once controller p i ∈ P C refreshes its rules on a given switch, that switch never removes p i ’s rules.Lemma 1 considers an event that can delay recovery, i.e., the removal of a rule at a switch dueto lack of space. Lemma 1 bounds the needed memory for every switch, and thus relates to eventsthat can delay recovery, i.e., the removal of a rule at a switch due to lack of space.21 emma 1 (Bounded Switch Memory) . (i) Suppose that R is a legal execution of Algorithm 2. Aswitch needs to let no more than maxM anagers ≥ N C controllers to manage it and (2) no morethan maxRules ≥ N C · ( N C + N S − · n prt packet forwarding rules.Proof. Let p j ∈ P S be a switch. Number of managers.
Recall that we assume that maxM anagers ≥ N C ≥ | P C | , i.e., the boundis large enough to store all managers (once all stale information is removed in a FIFO manner thatis explained in Section 2.1.1). During a legal execution R of Algorithm 2, every controller accessesevery switch repeatedly (line 24). This way, every p i ∈ P C , is always among the N C most recentlyinstalled controllers at p j ∈ P S . Number of rules.
Recall that a rule is a tuple of the form (cid:104) k , i , src , dest , prt , j, tag (cid:105) , where p k ∈ P C is the controller that created this rule, p i ∈ P S is the switch that stores this rule, p src ∈ P C and p dest ∈ P are the source, and respectively, the destination of the packet, prt is the packet’spriority, p j ∈ P is the relay node (i.e., the rule’s action field) and tag is the synchronization roundtag.To show that there are no more than N C · ( N C + N S − · n prt rules that a switch needs to store,recall that each of the N C controllers p src ∈ P C constructs κ -fault-resilient flows to every node p dest ∈ P \ { p src } in the network. Thus, switch p i ∈ P S might be a hop on the κ -fault-resilient flowbetween p src and p dest . That is, there are at most N C · ( N C + N S −
1) such flows that pass via p i ,because for each of the N C possible flow sources p src , there are exactly ( N C + N S −
1) destinations p dest . Each such flow stores at most n prt ≥ κ + 1 rules at p i , i.e., one for each priority. Note that,during a legal execution, each switch p i ∈ P S stores at most one tag per p src ∈ P C (line 24).Lemma 2 considers an event C-reset, which can delay recovery. Lemma 2 (Bounded Controller Memory) . (1) Let a x ∈ R be the first step in which controller p i runs lines 25–27 (upon query reply). For every state in R that follows step a x , node p i storesno more than maxReplies replies in the set replyDB i . (2) Suppose that R is a legal execution.Controller p i ∈ P C needs to store, in the set replyDB i , no more than maxReplies ≥ · ( N C + N S ) items. (3) Suppose that R is any execution, which may start in an arbitrary state. Controller p i performs a C-reset at most once in R , i.e., takes a step a x (cid:48) ∈ R that includes the execution ofline 26, in which the if-statement condition is true.Proof. Part (1).
We note that p i modifies replyDB i only in line 12 and line 16 in the do-foreverloop (lines 11–24), and in lines 26 and 27 in the query reply procedure (lines 25–27). In line 12and line 16, the size of replyDB i either decreases (possible only at the first step that p i executesline 12 or line 16) or stays the same. Thus, the rest of this proof focuses only at lines 26 and 27,where the set replyDB i increases due to the addition of an incoming reply (line 27).Let a x (cid:48) be the first step in R , in which controller p i executes lines 25–27 due to a message m j that p i receives from node p j . By line 26, if | replyDB i ∪ { m j }| > maxReplies holds, then p i performsa C-reset, i.e., sets replyDB i ← {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} , which implies that | replyDB i | = 1 after theexecution of line 26. Hence, after the execution of line 27 in step a x (cid:48) , | replyDB i | < maxReplies holds for the state c x (cid:48) +1 , which follows a x (cid:48) immediately. Similarly, since the size of replyDB i increases only when p i executes line 27, for every step a x (cid:48)(cid:48) and the system state c x (cid:48)(cid:48) +1 that appearsin R after c x (cid:48) +1 , it is true that | replyDB i | ≤ maxReplies holds in c x (cid:48)(cid:48) +1 , due to line 26. Thus, forevery system state that follows the first step a x (cid:48) ∈ R , it holds that | replyDB i | ≤ maxReplies .22 art (2). Line 12 removes from replyDB i any response that its synchronization round tagis not in the set { prevT ag i , currT ag i } and line 27 does not add to replyDB i a response that itssynchronization round tag is not currT ag i . Moreover, line 16 makes sure that when finishing onesynchronization round and then transitioning to the next one, replyDB i includes replies only withsynchronization round tags that are prevT ag i . Therefore, there are no more than two synchroniza-tion round tags that could be simultaneously present in replyDB i . Moreover, line 12 also removesany response from an unreachable node, because item 1 of Definition 1 holds in any system state ofa legal execution. This further limits the set replyDB i to includes response from at most N C + N S nodes. Therefore, | replyDB i | ≤ · ( N C + N S ). Part (3).
Suppose that p i does perform a C-reset during R . Once that happens, parts (i) and(ii) of this proof imply that this can never happen again.Lemma 3 demonstrates that the proposed algorithm requires bounded message size. Lemma 3.
The message size before and after the recovery period is in O ( maxRules log N ) , andrespectively, O (∆ N log N ) bits, where N = N C + N S and ∆ is the maximum node degree.Proof. The size of the messages sent differs during and after the recovery period. Algorithm 2involves messages sent from a controller to any other node and their subsequent replies to thecontroller. A message from a controller to a switch is a set of commands msg initialized to theempty set in line 13. Commands are appended in msg in lines 20, 21, and 22, before a controllerappends two more commands to msg (line 24) and sends it to a switch. We denote with msg , msg , msg the sets of commands appended to msg in the respective lines. Thus, | msg | = | msg | + | msg | + | msg | + O (log c tag ) bits, where | msg x | refers to the message size due to linex and c tag , is the maximum size of a tag. Note that when using tags based on the ones in [3], O (log( N )) bits are needed, whereas using the ones by Awerbuch et el. [6, 20] requires O (1) bits.We now calculate the size of each msg x , for each line x mentioned above, following theanalysis of the current section. Recall from Section 2.1 that the size of a single rule is in O (log N C + log N S + log n prt + log c tag ) bits, where n prt ≥ ∆ + 1 suffices for expressing all rules.A command in msg , msg , and msg has size in O (log N C + log N S ), O (log N C + log N S ),and respectively, in O (( N C + N S − n prt (log N C + log N S + log n prt + log c tag )) bits. Dur-ing recovery the following hold for the product of cardinality with command size for each set: | msg | ∈ O ( maxM anagers · (log N C + log N S )), | msg | ∈ O ( maxRules · (log N C + log N S )), | msg | ∈ O (( N C + N S − n prt (log N C + log N S + log n prt + log c tag ))). Similarly, during alegal execution the following hold: | msg | ∈ O (log N C + log N S )), | msg | = 0 | msg | ∈ O (( N C + N S − n prt (log N C + log N S + log n prt + log c tag ))). Summing up, during recovery | msg | ∈ O (( maxRules + maxM anagers )(log N C + log N S ) + ( N C + N S − n prt (log N C + log N S +log n prt + log c tag ))) and during a legal execution | msg | ∈ O ((log N C + log N S ) + ( N C + N S − n prt (log N C + log N S + log n prt + log c tag ))).We now turn to calculate the message size for a query response. Since the query response ofa switch has a larger size than the one of a controller (by definition), we present only the case ofswitches. During recovery, a switch query response has size in O (log N S + ∆(log N S + log N C ) + maxM anagers log N C + maxRules (log N C +log N S +log n prt +log c tag )) bits, while a legal executionthe response size is in O (log N S + ∆(log N S + log N C ) + N C log N C + ( N C + N S − n prt (log N C +log N S + log n prt + log c tag )) bits, where ∆ is the maximum degree.23he proof of Lemma 3 reveals that the proposed solution is communication adaptive [26],because after stabilization the messages size is reduced. We consider another kind of event that might delay recovery (Definition 2) and prove that itcan occur a bounded number of times. Recall that ∆ comm is the number of frames in which theend-to-end protocol stabilizes (Section 3.1) and ∆ synch the number of frames in which the roundsynchronization mechanism stabilizes (Section 4.2).
Definition 2 (Illegitimate deletions) . A switch p j performs an illegitimate deletion when it removesa non-failing controller p (cid:96) ∈ P C from its manager set (or its rules), due to a command that itreceived from another controller p k ∈ P C . Theorem 1 (Bounded number of illegitimate deletions) . Let a x k ∈ R be the k -th step in whichcontroller p i ∈ P C executes lines 15–16 during execution R . Suppose that R includes at least ((∆ comm + ∆ synch ) D + 1) such a x k steps, where D is the network diameter. Let R (cid:48) be a prefixof R = R (cid:48) ◦ R (cid:48)(cid:48) that includes the steps a , . . . , a x (∆ comm +∆ synch ) D +1 ∈ R (cid:48) and R (cid:48)(cid:48) be the matchingsuffix. Controller p i does not take steps a s (cid:48) k ∈ R (cid:48)(cid:48) that send a message m k to p j ∈ P S , such that p j performs an illegitimate deletion (Definition 2) upon receiving m k .Proof. This proof uses Claim 5.1 and Lemma 4. Theorem 1 follows by the case of k ≥ D forLemma 4 and then applying Part (ii) of Claim 5.1. Claim 5.1. (i) The condition in the if-statement of line 14 holds if, and only if, V reported = V reporting , where V reported = { p k : ∃ (cid:104) j,N c ( j ) , • ,rls (cid:105)∈ replyDB i (( k = j ∨ p k ∈ N c ( j )) ∧∃(cid:104) i, j k , • , currT ag i (cid:105) ∈ rls ) } ∪ {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} and V reporting = { p j : (cid:104) j, • , rls (cid:105) ∈ replyDB i ∧ ( ∃(cid:104) i, j k , • , currT ag i (cid:105) ∈ rls ) } .(ii) Suppose that every node p j in G c has sent a response (cid:104) j, •(cid:105) to p i . Suppose that p i stores thesereplies in replyDB i together with p i ’s report about its directly connected neighborhood, (cid:104) i, N c ( i ) , ∅ , ∅(cid:105) ,cf. lines 7 and 12. In this case, the condition in the if-statement of line 14 holds.Proof of Claim 5.1. The proof of Part (i).
The condition in the if-statement of line 14 is( ∀ p (cid:96) : p i → G ( res i ( currT ag i )) p (cid:96) = ⇒ (cid:104) (cid:96), •(cid:105) ∈ res i ( currT ag i ). When V reported = V reporting holds, thefollowing two claims also hold by the definition of these sets (and vice versa): (a) p i ’s response isin replyDB i , and (b) for every node p j that was queried with tag currT ag i , such that before thequery either p j had a response in replyDB i or a direct neighbor of p j had a response in replyDB i ,there exists a response from p j in replyDB i with rules that have the tag currT ag i . Hence, thecondition in the if-statement of line 14 is true. The proof of Part (ii).
This is just a particular case in which P = V reported = V reporting . Lemma 4.
Let p j k ∈ P be a node that is at distance k from p i in G c , such that p j , p j , . . . , p j k isany shortest path from p i to p j k and p j = p i . Let c x y ∈ R be the system state that immediatelyfollows step a x y ∈ { a x , . . . , a x k · (∆ comm +∆ synch )+1 } ⊂ R (cid:48) .1. Let (cid:96) > k · ∆ comm + 1 . The system state c x (cid:96) is legal with respect to the end-to-end protocol ofthe channel between p i and p j k , and it holds that m = (cid:104) j k , •(cid:105) is a message arriving from p j k through the channel to p i , which is an acknowledgment for p i ’s message to p j k . . Let (cid:96) > k · (∆ comm + ∆ synch ) + 1 . The system state c x (cid:96) is legal with respect to the round syn-chronization protocol between p i and p j k . That is, for any message m = (cid:104) j k , • , rls (cid:105) that arrivesfrom the channel from p j k to p i , it holds that m ∈ replyDB i ∧ ∃ r ∈ rls r = (cid:104) i, j k , • , currT ag i (cid:105) .Moreover, message m is an acknowledgement of a message m (cid:48) that p i has sent to p j k andtogether m (cid:48) and m form a completed round-trip.Proof of Lemma 4. We note that the first step, a x could occur due to the fact that the systemstarts in an arbitrary state in which the condition of the if-statement of line 14 holds, hence theaddition of 1 in k · (∆ comm + ∆ synch ). The proof is by induction on k >
0. That is, we consider thesteps in a x y ∈ { a x , . . . , a x k · (∆ comm +∆ synch )+1 } . The base case of k = 1 . Claim 5.1 says that the condition in the if-statement of line 14 holdsif, and only if, V reported = V reporting , where {(cid:104) i, N c ( i ), ∅ , ∅(cid:105)} ⊆ V reported (line 7). Therefore, for any (cid:96) >
1, we have that a x (cid:96) ∈ { a x , . . . , a x k · (∆ comm +∆ synch +1)+1 } implies that {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} ⊆ V reporting holds immediately before a x (cid:96) . Claim 5.2.
Between a x k − and a x k , a message (cid:104) j k , • , rls (cid:105) : ∃ r ∈ rls r = (cid:104) i, j k , • , currT ag i (cid:105) arrivesfrom the channel from p j k ∈ N c ( i ) to p i , which p i stores in replyDB i , where k ≥ .Proof of Claim 5.2. During the step a x k − , controller p i removes any response (cid:104) j k , • , rls (cid:105) : ∃ r ∈ rls r = (cid:104) i, j k , • , currT ag i (cid:105) (line 16) and the only way in which (cid:104) j k , • , rls (cid:105) : ∃ r ∈ rls r = (cid:104) i, j k , • , currT ag i (cid:105) holds immediately before a x k is the following. Between a x k − and a x k , a message arrives throughthe channel from p j k ∈ N c ( j k − ) : j = i to p i , which p i stores in replyDB i (line 27). This istrue because no other line in the code that accesses replyDB i adds that message to replyDB i (cf.lines 12, 16, and 27). The proof of Part (1).
It can be the case the p i sends a message for which it receives a (false)acknowledgement from p j , i.e., without having that message go through a complete round-trip.However, by ∆ comm ’s definition (Section 3.1), that can occur at most ∆ comm times. The proof of Part (2).
It can be the case that p i receives message m from p j for which thefollowing condition does not hold in c j : m = (cid:104)• , rls (cid:105) ∈ replyDB i ∧ ∃ r ∈ rls r = (cid:104) i, j k , • , currT ag i (cid:105) .However, by ∆ synch ’s definition (Section 2.2.2), that can occur at most ∆ synch times. The rest ofthe proof is implied by the properties of the round synchronization algorithm (Section 2.2.2). The induction step.
Suppose that, within more than (∆ comm k +1) and ((∆ comm +∆ synch ) k +1)synchronization rounds from R ’s starting state, the system reaches a state in which conditions(1), and respectively, (2) hold with respect to some k ≥
1. We show that in c x ∆ comm ( k +1)+1 and c x (∆ comm +∆ synch )( k +1)+1 , conditions (1), and respectively, (2) hold with respect to k + 1. The proof of Part (1).
Claim 5.1 says that the condition in the if-statement of line 14holds if, and only if, V reported = V reporting . By the induction hypothesis, condition (2) holdswith respect to k in c x (∆ comm +∆ synch ) k +1 and therefore A ( k + 1) ∪ {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} ⊆ V reported ,where A ( k ) = {(cid:104) j k (cid:48) , N c ( j k (cid:48) ) , • , rls (cid:105) : 1 < k (cid:48) ≤ k ∧ ∃ r ∈ rls r = (cid:104) i, j k (cid:48) , • , currT ag i (cid:105)} . There-fore, that fact that the step a x (∆ comm +∆ synch )( k +1)+2 ∈ a x . . . a x k · (∆ comm +∆ synch +1)+1 implies that A ( k + 1) ∪ {(cid:104) i, N c ( i ) , ∅ , ∅(cid:105)} ⊆ V reporting holds in the system state that appears in R immediatelybefore the step a x (∆ comm +∆ synch )( k +1)+2 . Claim 5.2 implies the rest of the proof. The proof of Part (2).
The proof here follows by similar arguments to the ones that appear inthe proof of item (2) of the base case. 25art (iii) of Lemma 2 and Theorem 1 imply Corollary 1.
Corollary 1. [[Any execution R of Algorithm 2 includes no more than N C C-resets (Lemma 2)and ((∆ comm + ∆ synch ) D + 1) · N S illegitimate deletions (Theorem 1).]] In this section we prove that Algorithm 2 is self-stabilizing. Lemma 5 shows that (under someconditions, such as reset freedom) controller p i eventually discovers the local topology of a switch p j k that is at distance k from p i in the graph G c . This means that p i has all the informationthat its needs for constructing (at least) a 0-fault-resilient flow to p j k and discover any switch p j k +1 ∈ N c ( p j k ) that is at distance k + 1 from p i . Then, Lemma 6 shows that, within a boundednumber of frames, no stale information exists in the system. Theorem 2 combines Corollary 1 andLemma 6 to show that, within a bounded number of frames, the system reaches a legitimate statefrom which only a legal execution may continue.We start by giving some necessary definitions. Let G i be the value of G ( ref erT ag i ) (line 17)that controller p i ∈ P C computes in a step a x ∈ R . We say that there is a path between p i ∈ P and p j ∈ P , when there exist p j , p j , . . . , p j k ∈ P , such that (1) p j = p i , (2) p j k = p j , (3) p j , . . . , p j k − ∈ P S , and (4) the rules installed by a controller p (cid:96) ∈ P C at the switches in p j , . . . , p j k − (and also p i or p j if they are also switches) forward packets from p i to p j as well as from p j to p i (when the respective links are operational). We say that two nodes p i ∈ P and p j ∈ P can exchangepackets , when there is a path between p i and p j . Moreover, we say that the rules installed in theswitches p s ∈ P S facilitate κ -fault-resilient flows between p i and p j , if at the event of at most κ linkfailures there exists a path between p i and p j . Let p x and p y be two nodes in P and recall that weassume that every node p z ∈ P has a fixed ordering of its neighbors, i.e., N c ( z ) = { p i , . . . , p i | Nc ( z ) | } .We define the first shortest path between p x and p y to be the shortest path between p x and p y thatincludes the nodes with minimum indices according to the neighborhood orderings (among all theshortest paths between these two nodes). Lemma 5.
Let p i ∈ P C be a controller and p j k ∈ P be a node in P that is at distance k from p i in G c , such that p j , p j , . . . , p j k is the first shortest path from p i to p j k and p j = p i in G c .Suppose that C-resets (Lemma 2) and illegitimate deletions (Theorem 1) do not occur in R . Forevery k ≥ , and any system state that follows the first ((∆ comm + ∆ synch ) + 2) k frames from thebeginning of R , the following hold.1. (cid:104) j k , N c ( j k ) , manager i ( j k ) , rules i ( j k ) (cid:105) ∈ res i ( prevT ag i ) , where N c ( j k ) , manager i ( j k ) , and rules i ( j k ) are p j k ’s neighborhood, managers, and respectively, rules that p i has received from p j k . Moreover, for the case of controller p j k ∈ P C , it holds that manager ( j k ) = ∅ and rules ( j k ) = ∅ .2. p i ∈ manager j k ( j k ) .3. the rules in rules j ( j ) , rules j ( j ) , . . . , rules j k ( j k ) facilitate packet exchange between p i and p j k along p j , p j , . . . , p j k (when the respective links are operational). . The end-to-end protocol as well as the round synchronization protocol between p i and p j k arein a legitimate state.Proof. The proof is by induction on k . The base case.
Claims 5.3, 5.4, and 5.5 imply that the lemma statement holds for k = 1. Claim 5.3.
Within one frame from R ’s beginning, the system reaches a state in which condition(1) is fulfilled with respect to p i and any node that is in p i ’s distance- neighbors in G c .Proof of Claim 5.3. During the first frame (with round-trips) of R , controller p i starts and com-pletes at least one iteration in which it sends a query (line 24) to every node p j ∈ P that is in p i ’sdistance-1 neighborhood in G c (this includes both switches, as we explain in Section 2.1.1, as wellas other controllers, which respond according to line 28). Moreover, during that first frame, p j receives that query and replies to p i (lines 25-27) within one step (Section 3.2). Thus, the first partof condition (1) is fulfilled, because controller p i then adds (or updates) the latest (query) repliesthat it received from these neighbors to replyDB i . The second part of condition (1) is implied bythe first part of condition (1) and by line 28. Claim 5.4.
Within two frames from the beginning of R , the system reaches a state in whichconditions (2) and (3) are fulfilled with respect to p i and any node that is in p i ’s distance- neighborsin G c .Proof of Claim 5.4. This proof uses Claim 5.3 to prove this claim by first showing that within oneframe from the beginning an execution in which condition (1) holds, the system reaches a state inwhich conditions (2) and (3) are fulfilled with respect to p i and any node p j ∈ N c ( i ). This indeedimplies that conditions (2) and (3) are fulfilled within two frames of R for p i ’s direct neighbors.Let R ∗ be a suffix of R such that in R ∗ ’s stating system state, it holds that condition (1) isfulfilled with respect to p i and any node that is in p i ’s distance-1 neighbors in G c . During thefirst frame (with round-trips) of R ∗ , controller p i starts and completes at least one iteration (withround-trips) in which it is able to include p i in p j ’s manager set, manager j ( j ) (line 19 to 21) andto install rules at p j ∈ N c ( i ) (line 22). We know that this installation is possible, because p i is adirect neighbor of p j ∈ N c ( i ) (Section 2.1.1). Once these rules are installed, the packet exchangebetween p i and p j ∈ N c ( i ) is feasible. This implies that conditions (2) and (3) are fulfilled withinone frame of R ∗ (and two frames of R ) for p i ’s direct neighbors. Claim 5.5.
Within ((∆ comm + ∆ synch ) + 2) frames from the beginning of R , the system reachesa state in which condition (4) is fulfilled with respect to p i and any node that is in p i ’s distance- neighbors in G c .Proof of Claim 5.5. Since conditions (2) and (3) hold within two frames with respect to k = 1,controller p i and p j can maintain an end-to-end communication channel between them because thenetwork part between p i and p j includes all the needed flows. By ∆ comm ’s definition (Section 3.1),within ∆ comm frames, the system reaches a legitimate state with respect to the end-to-end protocolbetween p i and p j . Similarly, by ∆ synch ’s definition (Section 2.2.2), within ∆ synch frames, thesystem reaches a legitimate state with respect to the round synchronization protocol between p i and p j . Thus, condition (4) holds within ((∆ comm + ∆ synch ) + 2) frames from R ’s beginning.27 he induction step. Suppose that, within ((∆ comm + ∆ synch ) + 2) k frames from R ’s startingstate, the system reaches a state c x ∈ R in which conditions (1), (2), (3) and (4) hold with respectto k . We show that within (∆ comm + ∆ synch ) + 2 frames from c x , the system reaches a state inwhich the lemma’s statements hold with respect to k + 1 as well. Showing that, within one frame from c x , processor p i knows all of its distance- ( k + 1) neighbors. This part of the proof starts by showing that within one frame from c x , execution R reaches a state, such that p i → G i p j holds for every distance-( k +1) neighbor of p i in G c . The systemstate c x encodes (packet forwarding) rules that allow p i to exchange packets with its distance- k neighbors in G c (since by the induction hypothesis, conditions (3) and (4) hold with respect to k in c x ). Moreover, p i stores in res ( prevT ag i ) replies from p i ’s distance- k neighbors in G c (since bythe induction hypothesis, condition (1) holds for k in c x ). The latter implies that p i knows, as partof G i in c x , all of its distance-( k + 1) neighbors, { p k : ∃(cid:104) j, N c ( j ) , •(cid:105) ∈ res i ( prevT ag i ) ∧ ( k = j ∨ k ∈ N c ( j, prevT ag i )) } , since every reply of a distance- k neighbor, p j ∗ , in G c (which res i ( prevT ag i )stores in c x ) includes p j ∗ ’s neighborhood. Condition (1) holds with respect to k +1 within ((∆ comm +∆ synch )+2) k +1 frames. Usingthe above we show that, within one frame from c x , controller p i ∈ P C queries all of its distance-( k +1)neighbors (line 24), receives their replies, and stores them in replyDB i (lines 25–27), i.e., (cid:104) j k +1 , N c ( j k +1 ) , manager i ( j k +1 ), rules i ( j k +1 ) (cid:105) ∈ res i ( currT ag i ) for every distance-( k + 1) neighbor p j k +1 of p i in G i . Recall that c x encodes rules that let p i to forward packets with its distance- k neighborsin G c (condition (3) holds for k in c x ). By the query-by-neighbor functionality (Section 2.1.1),every such distance- k neighbor reports on its direct neighbors (that include p i ’s distance-( k + 1)neighbors), which implies that it forwards the query message to p i ’s distance-( k + 1) neighbor aswell as the reply back to p i . Therefore, within ((∆ comm + ∆ synch ) + 2) k + 1 frames, the systemreaches a state, c x (cid:48) , in which condition (1) holds with respect to k + 1. Conditions (2) to (3) hold with respect to k + 1 within ((∆ comm + ∆ synch ) + 2) k + 2 frames. The next step of the proof is to show that within one frame from c x (cid:48) , the system reachesthe state c x (cid:48)(cid:48) in which conditions (2) and (3) hold with respect to k + 1 (in addition to the fact thatcondition (1) holds). By the functionality for querying (and modifying)-by-neighbor (Section 2.1.1)and for every switch p j that is a distance-( k + 1) neighbor of p i in G c , it holds that between c x (cid:48) and c x (cid:48)(cid:48) : (a) p i adds itself to the manager set manager ( j ) of p j (line 19 to 21), and (b) p i installsits rules in p j ’s configuration (line 22). (We note that for the case p j is another controller, there isno need to show that conditions (2) and (3) hold.) Condition (4) holds for k + 1 within ((∆ comm + ∆ synch ) + 2)( k + 1) frames. The proof is bysimilar arguments to the ones that appear in the proof of Claim 5.5.Thus, conditions (1), (2), (3), and (4) hold for k + 1 within ((∆ comm + ∆ synch ) + 2)( k + 1) framesin R and the proof is complete.Lemma 6 bounds the number of frames before the system reaches a legitimate system state. Lemma 6.
Let R = R (cid:48) ◦ R (cid:48)(cid:48) be an execution of Algorithm 2 that includes a prefix, R (cid:48) , of (∆ comm +∆ synch )+2) D +1 frames that has no occurrence of C-resets or illegitimate deletions. (1) Any systemstate in R (cid:48)(cid:48) is legitimate (Definition 1). (2) Let a x ∈ R (cid:48)(cid:48) be a step that includes the execution ofthe do-forever loop that starts in line 12 and ends in line 24. During that step a x , the value of msg i , which p i sends to p j ∈ P in line 24, does not include the record (cid:104) ‘ delM ngr ’ , •(cid:105) nor the record ‘ delAllRules ’ , •(cid:105) , i.e., no deletions, whether they are illegitimate or not, of managers or rules. (3)No controller p i takes a step in R (cid:48)(cid:48) during which the condition of line 26 holds, which implies that p i performs no C-reset during R (cid:48)(cid:48) .Proof. When comparing the conditions of Definition 1 and the conditions of Lemma 5, we seethat Lemma 5 guarantees that within (∆ comm + ∆ synch ) + 2) D frames the system reaches a state c almostSafe ∈ R (cid:48) in which all the conditions of Definition 1 hold except condition 2 with respectto controllers p j / ∈ P C that do not exist in the system (and their rules that are stored by theswitches). From condition 1 of Definition 1, we have that at each controller p i ∈ P C , it holdsthat G ( res ( currT ag i )) = G ( f usion i ) = G c . This implies that p i can identify correctly any staleinformation related to p j and remove it from configuration of every switch (see line 18 to 22) that isin the system during the round that follows c almostSafe , which takes one frame because condition 1of Definition 1 holds. This means that within (∆ comm +∆ synch )+2) D +1 frames the system reachesa legitimate state in which all the conditions of Definition 1 hold and thus R (cid:48)(cid:48) is a legal execution,i.e., the first part of the lemma holds. Part (2) of this lemma is implied by the fact that there isno controller p j / ∈ P C that the controller p i ∈ P C needs to remove from the configuration of anyswitch during the legal execution R (cid:48)(cid:48) . Part (3) is implied by Part (3) of Lemma 2 and the fact that R (cid:48)(cid:48) is a legal execution. Theorem 2 (Self-Stabilization) . Within ((∆ comm + ∆ synch ) + 2) D + 1)[((∆ comm +∆ synch ) D + 1) · N S + N C + 1] frames in R , the system reaches a state c safe ∈ R that is legitimate (Definition 1).Moreover, no execution that starts from c safe ∈ R includes a C-reset nor illegitimate deletion ofmanagers or rules.Proof. In this proof, we say that an execution R adm is admissible when it includes at least ((∆ comm +∆ synch ) + 2) D + 1 frames and no C-reset nor an illegitimate deletion. Let R be an execution ofAlgorithm 2. Let us consider R ’s longest possible prefix R (cid:48) , such that R (cid:48) does not include anysub-execution that is admissible, i.e., R = R (cid:48) ◦ R (cid:48)(cid:48) . Recall that by Corollary 1 the prefix R (cid:48) has nomore than ((∆ comm + ∆ synch ) D + 1) · N S + N C C-resets or illegitimate deletions. By the pigeonholeprinciple, the prefix R (cid:48) has no more than ((∆ comm + ∆ synch ) + 2) D + 1)[((∆ comm + ∆ synch ) D +1) · N S + N C + 1] frames. By Lemma 6, R (cid:48)(cid:48) does not include C-resets nor deletions of managers orrules, and the system has reached a safe state, which is c safe . This part of the proof considers executions in which the system starts in a state c (cid:48) , that is obtainedby taking a system state c safe that satisfies the requirements for a legitimate system state (Defini-tion 1), and then applying a bounded number of failures and recoveries. We discuss the conditionsunder which no packet loss occurs when starting from c (cid:48) , which is obtained from c safe and (i) theevents of up to r link failures and up to (cid:96) link additions (Lemma 7), as well as, (ii) the events ofup to r controller failures and up to (cid:96) controller additions (Lemma 8). Lemma 7.
Suppose that c (cid:48) is obtained from a legitimate system state c safe by the removal of atmost r links and the addition of at most (cid:96) links (and no further failures), and R is an execution ofAlgorithm 2 that starts in c (cid:48) . It holds that no packet loss occurs in R as long as r ≤ κ and (cid:96) ≥ . or the case of r ≤ κ ∧ (cid:96) ≥ recovery occurs within O ( D ) frames, while for the case of r > κ bounded communication delays can no longer be guaranteed.Proof. We consider the following cases.
The case of r ≤ κ and (cid:96) = 0 . Suppose that a single link e has failed, i.e., it has beenpermanently removed from G c , in a state c (cid:48) that follows a legitimate system state c safe . Say that e is included either in a primary path Π in G o (0) or in one of the alternative paths of Π , Π k in G o ( k ),where k >
0, for a controller p i (cf. definitions of the function myRules () and the graphs G o ( k ) inSection 2.2.2). For every such case, since e ’s failure occurs after a legitimate state, communicationis maintained when at most κ − e ) are non-operational. Let s be the index in { , , . . . , κ } for which e ∈ Π s . Due to the construction of the paths Π k , k ∈ { , , . . . , κ } , in thecomputation of the function myRules () in p i , if s = 0, then each alternative path Π k before e ’sfailure is now considered as path Π k − , for k ∈ { , . . . , κ } . Otherwise, if s (cid:54) = 0, the paths Π k remainthe same for k ∈ { , . . . , s − } and each path Π k is now considered as the alternative path Π k − for k ∈ { s + 1 , . . . , κ } . In both cases, a new path Π κ is computed and installed in the switches if thatis possible due to the edge-connectivity of G c , and if that is not the case, the rules installed in thenetwork’s switches facilitate ( κ − e belongs to some path Π k ), sincethe removal of link e occurs after a legitimate state and all nodes in the network can be reached byevery controller p i ∈ P C .Note that if e is not part of any flow, then its failure has no effect in maintaining boundedcommunication delays. By extension of the argument above, bounded communication delays canbe maintained when at most κ link failures occur. That is, in the worst case when exactly κ link failures occur, bounded communication delays are maintained due to the existence of the κ th alternative paths and the assumption that no further failures occur in the network. The case of r = 0 ∧ (cid:96) > . A link addition can violate the first shortest path optimality, thusin this case all paths should be constructed from scratch. Since, the link addition occurs after alegitimate state, no stale information exist in the system, and no resets or illegitimate deletionsoccur. Hence, by Lemma 5 (for k = D ) within 2 D frames it is possible to (re-)build the κ -faultcontaining flows throughout all nodes in the network and reach a legitimate system state (since theedge-connectivity cannot decrease with link additions). The case of r ≤ κ and (cid:96) > . Note that by the first case, bounded communication delaysare maintained, since r ≤ κ . Since (cid:96) links are added in G c , the controllers require O ( D ) frames toinstall new paths (by Lemma 5), even though the connectivity of G c might be less than κ + 1 (butfor sure at least 1). Hence, bounded communication delays are guaranteed in this case, given thatno more failures occur. The case of r > κ . In this case, we do not guarantee bounded communication delays. Thisholds, due to the fact that the removal of more than κ edges might break connectivity in G c , whichmakes the existence of alternative paths for r > κ link failures impossible. Lemma 8.
Suppose that c (cid:48) is obtained from a legitimate system state c safe by the removal of atmost r nodes and the addition of at most (cid:96) nodes (and no further failures), and R is an executionof Algorithm 2 that starts in c (cid:48) . It holds that no packet loss occurs in R if, and only if, G c remainsconnected (and N C ≥ ∧ N S ≥ ), and in this case the network recovers within O ( D ) frames. Forthe case of r > ∧ (cid:96) = 0 bounded communication delays can no longer be guaranteed. roof. We study the following cases.
The case of r > and (cid:96) = 0 . The removal of a switch p j is equivalent to the removal of all thelinks that are adjacent to p j . Since the edge-connectivity is at least κ + 1, the minimum degree ofevery node in G c is at least κ + 1. Thus, a switch removal (equiv. removal of at least κ + 1 links)would violate the assumption of at most κ link failures, possibly violating connectivity or affectingall the alternative paths between two endpoints in the network. In this case, Algorithm 2 can onlyguarantee that the controllers will install ˜ κ -fault-resilient flows, where 0 ≤ ˜ κ ≤ κ .The case of removing a controller p i can be handled by Algorithm 2 if we assume that thecommunication graph G c stays (at least) ( κ + 1)-edge-connected after removing p i . In that case,each controller p i (cid:48) can discover the removal of p i and delete it from replyDB i (cid:48) in 1 frame, and then,in the subsequent frame, p i (cid:48) can delete p i ’s rules from rules j ( j ) and p i from manager j ( j ), for everyswitch p j . Hence, within 2 frames the system recovers to a legitimate state, since the existing rulesof the other controllers stay intact. The case of r = 0 and (cid:96) > . We assume that if controller or switch additions occur (includingtheir adjacent links) after a legitimate state, the new node is initialized with empty memory. Thatis, replyDB i is empty if a new controller p i is added, and manager j ( j ) = rules j ( j ) = ∅ if anew switch p j is added. Note that the new node should not violate the assumption of G c ’s edge-connectivity being at least κ + 1. In both cases, and similarly to link additions, the first shortestpath optimality might be violated and hence (as in the case of link additions) a period of 2 D framesis needed (Lemma 5) to (re-)build the κ -fault-resilient flows (since no stale information exist, andno resets or illegal deletions occur). The case of r > and (cid:96) > . Let G (cid:48) c be G c after the removal of at most r nodes and the additionof at most (cid:96) nodes. If G (cid:48) c is ˜ κ -edge-connected, where 1 < ˜ κ ≤ κ , then bounded communicationdelays in the occurrence of at most ˜ κ link failures can be guaranteed by following the argumentsof Section 5.4 for κ = ˜ κ . In order to evaluate our approach, and in particular, to complement our theoretical worst-case anal-ysis as well as study the performance in different settings, we implemented a prototype using OpenvSwitch (OVS) and Floodlight. To ensure reproducibility and to facilitate research on improvedand alternative algorithms, the source code and evaluation data are accessible via [52]. In the fol-lowing, we first explain our expectations with respect to the performance (Section 6.1) and discussdetails related to the implementation of the proposed solution (Section 6.2) before presenting thesetup of our experiments (Section 6.3). In particular, we empirically evaluate the time to bootstrapan SDN (after the occurrence of different kinds of transient failures), the recovery time (after theoccurrence of different kinds of benign failures), as well as the throughput during a recovery periodthat follows a single link failure (Section 6.4). For the reproducibility sake, the source code andevaluation data can be access via [52].
We study
Renaissance ’s ability to recover from failures in a wide range of topologies and settings.We note that the scope of our work does not include an empirical demonstration of recovery after31 i m e ( s e c ond s ) Network
B4 Clos Telstra AT&T Ebone51525354555
Figure 5: Bootstrap time for the networks using 3 controllers. The network diameters are 4, 5, 8,10 and 11 (left to right order).the occurrence of arbitrary transient faults, because such a result would need to consider all possiblestarting system states. Nevertheless, we do consider recovery after changes in the topology, whichSection 3.4 models as transient faults. However, in these cases, we mostly consider a single changeto the topology, i.e., node or link failure (after the recovery from any other transient fault).The basis for our performance expectation is the analysis presented in Section 5. Specifically, weuse lemmas 5, 7 and 8 to anticipate an O ( D ) bootstrap time and recovery period after the occurrenceof benign failures. Recall that, for the sake of simple presentation, our theoretical analysis does notconsider the number of messages sent and received (Section 3.5.3), which depends on the numberof nodes in the case of Renaissance . Thus, we do not expect the asymptotic bounds of lemmas 5, 7and 8 to offer an exact prediction of the system performance since our aim in Section 5 is merelyto demonstrate bounded recovery time. The measurements presented in this section show that
Renaissance ’s performance is in the ballpark of the estimation presented in Section 5.32 etwork Name Number of Nodes Network Diameter
B4 12 5Clos 20 4Telstra 57 8AT&T 172 10EBONE 208 11
Table 1: The number of nodes and diameter of the studied networks
In this evaluation section, we demonstrate
Renaissance ’s ability to recover from failures withoutdistinguishing between transient and permanent faults, as our model does (Figure 3), because thereis no definitive distinction between transient and permanent faults in real-world systems. Moreover,our implementation uses a variation on Algorithm 2. The reason that we need this variation is thatthis evaluation section considers changes to the network topology during legal executions, whereasour model considers such changes as transient faults that can occur before the system starts running.In detail, Algorithm 2 installs rules on the switches using two tags, which are currT ag and prevT ag (line 4). That is, as the new rules for currT ag are being installed, the ones for prevT ag are being removed. Our variation uses a third tag, beforePrevTag , which tags the rules in thesynchronization round that preceded the one that prevT ag refers to. When
Renaissance installsnew rules that are tagged with currT ag , it does not remove the rules tagged with prevT ag butinstead, it removes the rules that are tagged with beforePrevTag . This one extra round in whichthe switches hold on to the rules installed for prevT ag ’s synchronization round allows
Renaissance to use the κ -fault-resilient flows that are associated with prevT ag for dealing with link failures(without having them removed, as Algorithm 2 does). The above variation allows us to observe thebeneficial and complementary existence of the mechanisms for tolerating transient and permanentlink failures, i.e., Renaissance ’s construction of κ -fault-resilient flows, and respectively, update ofsuch flows according to changes reported by Renaissance ’s topology discovery.
We consider a spectrum of different topologies (varying in size and diameter), including B4 (Google’sinter-datacenter WAN based on SDN), Clos datacenter networks and Rocketfuel networks (namelyTelstra, AT&T and EBONE). The relevant statistics of these networks can be found in Table 1. Thehosts for traffic and round-trip time (RTT) evaluation are placed such that the distance betweenthem is as large as the network diameter. The evaluation was conducted on a PC running Ubuntu16.04 LTS OS, with the Intel(R) Core(TM) i5-457OS CPU @ 2.9 GHz (4x CPU) processor and32 GB RAM. During the experiments in large networks, the rule sets need to accommodate manyrules that controllers and switches have to exchange. Therefore, the maximum transfer unit (MTU)of each link is set to the value of 65536 bytes in all experiments.Paths are computed according to Breadth First Search (BFS) and we use OpenFlow fast-failovergroups for backup paths. We introduce a delay before every repetition of the algorithm’s do foreverloop as well as between each interval in which the abstract switch discovers its neighborhood. In33 etwork (Controllers) T i m e ( s e c ond s ) T1 T3 T5 T7 A2 A4 A6 E1 E3 E5 E7101520253035404550556065707580859095100105110115120125130
Figure 6: Bootstrap time for Telstra (T), AT&T (A) and EBONE (E) for 1 to 7 controllers.our experiments, the default delay value was 500 ms. However, in an experiment related to thebootstrap time (Figure 7), we have varied the delay values.The link status detector (for switches and controllers) has a parameter called Θ, similar to theone used in [8, Section 6]. This threshold parameter refers to scenarios in which the abstract switchqueries a non-failing neighboring node without receiving a query reply while receiving Θ repliesfrom all other neighbors. The parameter Θ can balance a trade-off between the certainty that nodeis indeed failing and the time it takes to detect a failure, which affects the recovery time. We haveselected Θ to be 10 for B4 and Clos, and 30 for Telstra, AT&T and Ebone. We observed that whenusing these settings the discovery of the entire network topology always occurred and yet had theability to provide a rapid fault detection.
We structure our evaluation of
Renaissance around the main questions related to the SDN boot-strap, recovery times, and overhead, as well as regarding the throughput during failures.For illustrating our data in figures 5–6 and 8–13, we use violin plots [28]. In these plots, we34 ask delay (seconds) T i m e ( s e c ond s ) llllllllllllllllllll ll Network (Diameter) EBONE (10)AT&T (9)Telstra (8)Clos (4)B4 (8)
Figure 7: Bootstrap time for B4, Clos, Telstra, AT&T and EBONE using seven controllers, as afunction of query intervals. Recall that the task delay in the added time between any repetition ofthe algorithm’s do forever loop as well as each interval in which the abstract switch discovers itsneighborhood.indicate the median with a white dot. The first and third quartiles are the endpoints of a thick blackline (hence the white dot representing the median is a point on the black line). The thick black lineis extended with thin black lines to denote the two extrema of all the data (as the whiskers of boxplots). Finally, the vertical boundary of each surface denotes the kernel density estimation (sameon both sides) and the horizontal boundary only closes the surface. We ran each experiment 20times. For the case of violin plots, we used all measurements except the two extrema. For the caseof the other plots, we dismissed from the 20 measurements the two extrema. Then, we calculatedaverage values and used them in the plots.
Renaissance bootstraps an SDN?
We first study how fast we can establish a stable network starting from empty switch configurations.Towards this end, we measure how long it takes until all controllers in the network reach a legitimate35 u m be r o f m e ss age s Network
B4 Clos Telstra AT&T EBONE0510152025
Figure 8: Communication cost per node needed from a maximum loaded global controller to reacha stable network. Note that we divide the number of messages by the number of iterations it takesto converge.state in which each controller can communicate with any other node in the network (by installingpacket-forwarding rules on the switches). For the smaller networks (B4 [29] and Clos [2]), we usethree controllers, and for the Rocketfuel networks [51, 50] (Telstra, AT&T and EBONE), we useup to seven controllers.
Bootstrapping time.
We are indeed able to bootstrap in any of the configurations studiedin our experiments. Lemma 5 predicts an O ( D ) bootstrap time when starting from an all emptyswitch configuration; that prediction does not consider the number of nodes, as explained above.Note that in such executions, no controller sends commands that perform (illegitimate) deletionsbefore it discovers the entire network topology and thus no illegitimate deletion is ever performedby any controller. In terms of performance, we observe that the recovery time grows (Figure 5) asthe network dimensions increase (diameter and number of nodes). It also somewhat depends onthe number of controllers when experimented with the larger networks (Figure 6): more controllersresult in slightly longer bootstrap times. We note that the recovery process over a growing numberof controllers follows trades that appear when considering the maximum value over a growing36 etwork T i m e ( s e c ond s ) B4 Clos Telstra AT&T EBONE024681012
Figure 9: Recovery time after fail-stop failure for a controller.number of random variables. Specifically, when an abstract switch updates its rules, the time ittakes to update all of the rules that were sent by many controllers can appear as a brief bottleneck.Note that the shown bootstrap times only provide qualitative insights: they are, up to a certainpoint, proportional to the frequency at which controllers request configurations and install flows(Figure 7). Specifically, the rightmost peaks in the charts are due to the congestion caused byhaving task delays that overwhelm the networks. These peaks rise earlier for networks with anincreasing number of switches. This is not a surprise because the proposed algorithm establishesmore and longer flows in larger networks and thus use higher values of network traffic as the numberof nodes becomes larger.
Communication overhead.
The study of bootstrap time thus raises interesting questionsregarding the communication overhead during the network bootstrap period. Concretely, we mea-sure the maximum number of controller messages, taking three controllers for the smaller networksB4 and Clos, and seven controllers for the Rocketfuel networks Telstra, AT&T and EBONE inthese experiments. While the communication overhead naturally depends on the network size, Fig-ure 8 suggests that when normalized, i.e., dividing by the number of iterations it takes to recover,the overhead is similar for different networks (and slightly higher for the case of the two largest37 etwork (Controllers) T i m e ( s e c ond s ) T1 T2 T3 T4 T5 T6 A1 A2 A3 A4 A5 A6 E1 E2 E3 E4 E5 E602468101214
Figure 10: Recovery time after fail-stop failure of 1-6 controllers in Telstra (T), AT&T (A) andEBONE (E).networks).
Renaissance recovers from link and node failures?
In order to study the recovery from benign failures, we distinguish between their different types:(i) fail-stop failures of controllers, (ii) permanent switch-failures, and (iii) permanent link-failures.The experiments start from a legitimate system state, to which we inject the above failures. (i) Recovery after the occurrence of controller’s fail-stop failure.
We injected a fail-stop failure by disconnecting a single controller chosen uniformly at random(Figure 9). We have also conducted an experiment in which we have disconnected many-but-not-all controllers (Figure 10). That is, we disconnected a single controller that is initially chosenat random and measured the recovery time. The procedure was repeated for the same controllerwhile recording the measurements until only one controller was left. Lemma 8, which does not takeinto consideration the time it takes to send or receive messages, suggests that after the removalof at most N C − O ( D ). We38 etwork T i m e ( s e c ond s ) B4 Clos Telstra AT&T EBONE0123456789101112131415
Figure 11: Recovery time after permanent switch-failure.observe in Figure 9 results that are in the ballpark of that prediction. Moreover, we also measuredisconnecting one to six random controllers simultaneously for the Rocketfuel networks (Telstra,AT&T, and EBONE), while running controller number 7. Note that we could not observe a relationbetween the number of failing controllers and the recovery time, see Figure 10. (ii) Recovery after the occurrence of switch’s fail-stop failure.
We have experimentedwith recovery after permanent switch-failures . These experiments started by allowing the networkto reach a legitimate (stale) state. Once in a legitimate (stale) state, a switch (selected uniformly atrandom) was disconnected from the network. We have then measured the time it takes the systemto regain legitimacy (stability). We know that by Lemma 8, the recovery time here should be atmost in the order of the network diameter. Figure 11 presents the measurements that are in theballpark of that prediction. That is, the longest recovery time for each of the studied networksgrows as the network diameter does. We also observe a rather large variance in the recovery time,especially for the larger networks. This is not a surprise since the selection of the disconnectedswitch is random. (iii) Recovery after the occurrence of permanent link-failures.
During the experiments,we waited until the system reached a legitimate state, and then disconnected a link and waited for39 etwork T i m e ( s e c ond s ) B4 Clos Telstra AT&T EBONE024681012141618
Figure 12: Recovery time after permanent link-failure.the system to recover. Lemma 8 predicts recovery within O ( D ). Figure 12 presents results thatare in the ballpark of that prediction. We also investigated the case of multiple and simultane-ous permanent link failures that were selected randomly. Figure 13 suggests that the number ofsimultaneous failures does not play a significant role with respect to the recovery time. Besides connectivity, we are also interested in performance metrics such as throughput and messageloss during recovery period that occur after a single link failure . Recall that we model such failures astransient faults and therefore there is a need to investigate empirically the system’s behavior duringsuch recovery periods since the mechanism for fault-resilient flows (Section 2.2.2) is always active.Our experiments show that the combination between the proposed algorithm and the mechanismfor fault-resilient flows performs rather well. That is, the recovery period from a single permanentlink failure is brief and it has a limited impact on the throughput.In the following, we measure the TCP throughput between two hosts (placed at a maximaldistance from each other), in the presence of a link-failure located as close to the middle of the40 etwork (Number of link failures) T i m e ( s e c ond s ) B2 B4 B6 C2 C4 C6 T2 T4 T6 A2 A4 A6 E2 E4 E602468101214161820
Figure 13: Recovery time after multiple (2,4 or 6) permanent link-failures at random for B4 (B),Clos (C), Telstra (T), AT&T (A) and EBONE (E).primary path as possible. To generate traffic, we use Iperf. A specific link to fail is chosen, suchthat it enables a backup path between the hosts.The maximum link bandwidth is set to 1000 Mbits/s. During the experiments, we conductthroughput measurements during a period of 30 seconds. The link-failure occurs after 10 seconds,and we expect a throughput drop due to the traffic being rerouted to a backup path. We note thatour prototype utilizes packet tagging for consistent updates [42] using the variation of Algorithm 2(presented in Section 6.2), which allows the controllers to repair the κ -fault-resilient flows withoutthe removal of the ones tagged with the previous tag.We can see in Figure 14 that one throughput valley occurs indeed (to around 480 - 510 Mbits/s).For comparison, Figure 15 shows the throughput over time without recovery that includes consistentupdates [42]: only the backup paths are used in these experiments, and no new primary paths arecalculated or used after the link-failure at the 10th second. The results in figures 14 and 15 arevery similar: there is a strong correlation between these two methods in terms of performance, seeTable 2.In order to gain more insights, we used Wireshark [35] for investigating the number of re-41 ime (seconds) T h r oughpu t ( M b i t s / s ) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Network (Diameter)EBONE (10)AT&T (9)Telstra (8)Clos (5)B4 (4)
Figure 14: Throughput for the different networks using network updates with tags. Here, a singlelink failure causes the drop after the 10 th second. Network CorrelationClos 0.94B4 0.95Telstra 0.92EBONE 0.96Exodus 0.94
Table 2: Correlation coefficient of the average throughput for the experiments in Figure 14 andFigure 15.transmissions (after the link-failure) for Telstra, AT&T and EBONE network topologies. We ob-served an increase in the packets sent at the 11th second (after the link-failure) are re-transmissions(Figure 16) and “BAD TCP” flags (Figure 17). This increase was from levels of below 1% to levelsof between 10% and 15% and it quickly deescalated. We have also observed a much smaller presenceof out-of-order packets (Figure 18). We observe that these phenomena (and the slight irregularity inthe throughput) are related to TCP congestion control protocol, which is TCP Reno [36]. Indeed,42 ime (seconds) T h r oughpu t ( M b i t s / s ) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Network (Diameter)EBONE (10)AT&T (9)Telstra (8)Clos (5)B4 (4)
Figure 15: Throughput for the different networks using no recovery after link-failure. This experi-ment considers a single link failure causes the drop after the 10 th second.whenever congestion is suspected, Reno’s fast recovery mechanism divides the current congestionwindow by half (when skipping the slow start mechanism). The design of distributed SDN control planes has been studied intensively in the last few years [7,31, 19, 53, 27, 13, 48]; both for performance and robustness reasons. While we are not aware of anyexisting solution for our problem (supporting an in-band and distributed network control), thereexists interesting work on bootstrapping connectivity in an OpenFlow network [49, 30] that doesnot consider self-stabilization. In contrast to our paper, Sharma et al. [49] do not consider how tosupport multiple controllers nor how to establish the control network. Moreover, their approachrelies on switch support for traditional STP and requires modifying DHCP on the switches. We doconsider multiple controllers and establish an in-band control network in a self-stabilizing manner.Katiyar et al. [30] suggest bootstrapping a control plane of SDN networks, supporting multiplecontroller associations and also non-SDN switches. However, the authors do not consider fault-43 ime (seconds) R e t r an s m i ss i on pa ck e t s ( % ) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Network (Diameter)EBONE (10)AT&T (9)Telstra (8)B4 (5)Clos (4)
Figure 16: Retransmission percentage rate for packets sent at each second.tolerance. We provide a very strong notion of fault-tolerance, which is self-stabilization.To the best of our knowledge, our paper is the first to present a comprehensive model andrigorous approach to the design of in-band distributed control planes providing self-stabilizingproperties. As such, our approach complements much ongoing, often more applied, related research.In particular, our control plane can be used together with and support distributed systems suchas ONOS [7], ONIX [31], ElastiCon [19], Beehive [53], Kandoo [27], STN [13] to name a few.Our paper also provides missing links for the interesting work by Akella and Krishnamurthy [1],whose switch-to-controller and controller-to-controller communication mechanisms rely on strongprimitives, such as consensus protocols, consistent snapshot and reliable flooding, which are notcurrently available in OpenFlow switches.We also note that our approach is not limited to a specific technology, but offers flexibilitiesand can be configured with additional robustness mechanisms, such as warm backups, local fastfailover [41], or alternatives spanning trees [10, 37].Our paper also contributes to the active discussion of which functionality can and should beimplemented in OpenFlow. DevoFlow [17] was one of the first works proposing a modificationof the OpenFlow model, namely to push responsibility for most flows to switches and adding44 ime (seconds) BA D T C P f l ag s ( % ) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Network (Diameter)EBONE (10)AT&T (9)Telstra (8)B4 (5)Clos (4)
Figure 17: Percentage of “BAD TCP” flags during the recovery periodthat follows a single link failureefficient statistics collection mechanisms. SmartSouth [45] shows that in recent OpenFlow versions,interesting network functions (such as anycast or network traversals) can readily be implementedin-band. More closely related to our paper, [46] shows that it is possible to implement atomic read-modify-write operations on an OpenFlow switch, which can serve as a powerful synchronizationand coordination primitive also for distributed control planes; however, such an atomic operationis not required in our system: a controller can claim a switch with a simple write operation. Inthis paper, we presented a first discussion of how to implement a strong notion of fault-tolerance,namely a self-stabilizing SDN [20, 18].We are not the first to consider self-stabilization in the presence of faults that are not justtransient faults (see [20], Chapter 6 and references therein). Thus far, these self-stabilizing algo-rithms consider networks in which all nodes can compute and communicate. In the context ofthe studied problem, some nodes, i.e., the switches, can merely forward packets according to rulesthat are decided by other nodes, i.e., the controllers. To the best of our knowledge, we are thefirst to demonstrate a rigorous proof for the existence of self-stabilizing algorithms for an SDNcontrol plane. This proof uses a number of techniques, such as the one for assuring a bounded45 ime (seconds) O u t − o f − o r de r pa ck e t s ( % ) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Network (Diameter)EBONE (10)AT&T (9)Telstra (8)B4 (5)Clos (4)
Figure 18: Percentage of out-of-order packets during the recovery period that follows a single linkfailurenumber of resets and illegitimate rule deletions, that were not used in the context of self-stabilizingbootstrapping of communication (to the best of our knowledge).
Bibliographic note.
We reported on preliminary insights on the design of in-band control planesin two short papers on
Medieval [46, 47]. However, Medieval is not self-stabilizing, because itsdesign depends on the presence of non-corrupted configuration data, e.g., related to the controllers’IP addresses, which goes against the idea self-stabilization. A self-organizing version of Medievalappeared in [15].
Renaissance provides a rigorous algorithm and proof of self-stabilization; itappeared as an extended abstract [16] and as a technical report [14].
While the benefits of the separation between control and data planes have been studied intensivelyin the SDN literature, the important question of how to connect these planes has received lessattention. This paper presented a first model and an algorithm, as well as a detailed analysis and46roof-of-concept implementation of a self-stabilizing SDN control plane called
Renaissance . Θ( D ) stabilization time variation (without memory adaptiveness) Before concluding the paper, we would like to point out the existence of a straightforward Ω( D )lower bound to the studied task to which we match an O ( D ) upper bound. Indeed, consider thecase of a single controller that needs to construct at least one flow to every switch in the network.Starting from a system state in which no switch encodes any rule and the controller is unawareof the network topology, an in-band bootstrapping of this network cannot be achieved within lessthan O ( D ) frames, where D is the network diameter (even in the absence of any kind of failure).We also present a variation of the proposed algorithm that provides no memory adaptiveness. Inthis variation, no controller ever removes rules installed by another controller (line 21). This varia-tion of the algorithm simply relies on the memory management mechanism of the abstract switches(Section 2.1.1) to eventually remove stale rules (that were either installed by failing controllers orappeared in the starting system state). Recall that, since the switches have sufficient memory tostore the rules of all controllers in P C , this mechanism never removes any rule of controller p i ∈ P C after the first time that p i has refreshed its rules on that switch. Similarly, this variation of thealgorithm does not remove managers (line 19) nor performs C-resets (line 26). Instead, these setsare implemented as constant size queues and similar memory management mechanisms eventuallyremove stale set items. We note the existence of bounds for these queues that make sure that theyhave sufficient memory to store the needed non-failing managers and replies, i.e., maxM anagers ,and respectively, 3 · maxRules .Recall the conditions of Lemma 5 that assume no C-resets and illegitimate deletions to occurduring the system execution. It implies that the system reaches a legitimate state within ((∆ comm +∆ synch ) + 2) D + 1 frames from the beginning of the system execution. However, the cost of memoryuse after stabilization can be N C /n C times higher than the proposed algorithm. We note thatLemma 5’s bound is asymptotically the same as the recovery time from benign faults (lemmas 7and 8). Theorem 2 brings an upper-bound for the proposed algorithm that is (((∆ comm +∆ synch ) D +1) · N S + N C + 1) times larger than the one of the above variance with respect to the period thatit takes the system to reach a legitimate state. However, Theorem 2 considers arbitrary transientfaults, which are rare. Thus, the fact that the recovery time of the proposed memory adaptivesolution is longer is relevant only in the presence of these rare faults. We note that the proposed algorithm can serve as the basis for more even advanced solutions. Inparticular, while we have deliberately focused on the more challenging in-band control scenarioonly, we anticipate that our approach can also be used in networks which combine both in-bandand out-of-band control, e.g., depending on the network sub-regions. Another possible extensioncan consider the use of a self-stabilizing reconfigurable replicated state machine [23, 24, 22] forcoordinating the actions of the different controllers, similar to ONOS [7].
Acknowledgments.
Part of this research was supported by the Danish Villum Fonden blok-stipendier project
Reliable Computer Networks (ReNet) . This research is (in part) supported by47uropean Union’s Horizon 2020 research and innovation programme under the ENDEAVOURproject (grant agreement 644960). We are grateful to Michael Tran, Ivan Tannerud and AntonLundgren for developing the prototype and for the many discussions. We are also thankful toEmelie Ekenstedt for helping to improve the presentation.
References [1] Aditya Akella and Arvind Krishnamurthy. A Highly Available Software Defined Fabric, 2014.[2] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity datacenter network architecture.
ACM SIGCOMM Computer Communication Review , 38(4):63–74,2008.[3] Noga Alon, Hagit Attiya, Shlomi Dolev, Swan Dubois, Maria Potop-Butucaru, and S´ebastienTixeuil. Practically stabilizing SWMR atomic memory in message-passing systems.
J. Comput.Syst. Sci. , 81(4):692–701, 2015.[4] Efthymios Anagnostou, Ran El-Yaniv, and Vassos Hadzilacos. Memory adaptive self-stabilizingprotocols (extended abstract). In
WDAG , volume 647 of
Lecture Notes in Computer Science ,pages 203–220. Springer, 1992.[5] Mina Tahmasbi Arashloo, Yaron Koral, Michael Greenberg, Jennifer Rexford, and DavidWalker. SNAP: stateful network-wide abstractions for packet processing. In
Proceedings of theACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 22-26, 2016 , pages 29–43.ACM, 2016.[6] Baruch Awerbuch, Shay Kutten, Yishay Mansour, Boaz Patt-Shamir, and George Varghese. Atime-optimal self-stabilizing synchronizer using A phase clock.
IEEE Trans. Dependable Sec.Comput. , 4(3):180–190, 2007.[7] Pankaj Berde, Matteo Gerola, Jonathan Hart, Yuta Higuchi, Masayoshi Kobayashi, ToshioKoide, Bob Lantz, Brian O’Connor, Pavlin Radoslavov, William Snow, and Guru M. Parulkar.ONOS: towards an open, distributed SDN OS. In
Proceedings of the third workshop on Hottopics in software defined networking, HotSDN ’14, Chicago, Illinois, USA, August 22, 2014 ,pages 1–6, 2014.[8] Peva Blanchard, Shlomi Dolev, Joffroy Beauquier, and Sylvie Dela¨et. Practically self-stabilizingpaxos replicated state-machine. In
NETYS , volume 8593 of
Lecture Notes in Computer Science ,pages 99–121. Springer, 2014.[9] Michael Borokhovich, Liron Schiff, and Stefan Schmid. Provable data plane connectivity withlocal fast failover: introducing openflow graph algorithms. In
Proc. 3rd Workshop on HotTopics in Software Defined Networking (HotSDN) , pages 121–126, 2014.[10] Michael Borokhovich, Liron Schiff, and Stefan Schmid. Provable Data Plane Connectivity withLocal Fast Failover: Introducing OpenFlow Graph Algorithms, 2014.4811] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, ColeSchlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. P4: program-ming protocol-independent packet processors.
Computer Communication Review , 44(3):87–95,2014.[12] Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard,Fernando A. Mujica, and Mark Horowitz. Forwarding metamorphosis: fast programmablematch-action processing in hardware for SDN. In
ACM SIGCOMM 2013 Conference, SIG-COMM’13, Hong Kong, China, August 12-16, 2013 , pages 99–110. ACM, 2013.[13] Marco Canini, Petr Kuznetsov, Dan Levin, and Stefan Schmid. A Distributed and RobustSDN Control Plane for Transactional Network Updates, 2015.[14] Marco Canini, Iosif Salem, Liron Schiff, Elad Michael Schiller, and Stefan Schmid. Renaissance:Self-stabilizing distributed SDN control plane.
CoRR , abs/1712.07697, 2017.[15] Marco Canini, Iosif Salem, Liron Schiff, Elad Michael Schiller, and Stefan Schmid. A self-organizing distributed and in-band SDN control plane. In
ICDCS , pages 2656–2657. IEEEComputer Society, 2017.[16] Marco Canini, Iosif Salem, Liron Schiff, Elad Michael Schiller, and Stefan Schmid. Renaissance:A self-stabilizing distributed SDN control plane. In , pages 233–243.IEEE Computer Society, 2018.[17] Andrew R. Curtis, Jeffrey C. Mogul, Jean Tourrilhes, Praveen Yalagandula, Puneet Sharma,and Sujata Banerjee. DevoFlow: Scaling Flow Management for High-performance Networks,2011.[18] Edsger W. Dijkstra. Self-stabilizing systems in spite of distributed control.
Commun. ACM ,17(11):643–644, 1974.[19] Advait Dixit, Fang Hao, Sarit Mukherjee, T.V. Lakshman, and Ramana Kompella. Towardsan Elastic Distributed SDN Controller, 2013.[20] Shlomi Dolev.
Self-Stabilization . MIT Press, 2000.[21] Shlomi Dolev, Swan Dubois, Maria Potop-Butucaru, and S´ebastien Tixeuil. Stabilizing data-link over non-FIFO channels with optimal fault-resilience.
Inf. Process. Lett. , 111(18):912–920,2011.[22] Shlomi Dolev, Chryssis Georgiou, Ioannis Marcoullis, and Elad Michael Schiller. Self-stabilizingreconfiguration. In
NETYS , volume 10299 of
Lecture Notes in Computer Science , pages 51–68,2017.[23] Shlomi Dolev, Chryssis Georgiou, Ioannis Marcoullis, and Elad Michael Schiller. Practically-self-stabilizing virtual synchrony.
J. Comput. Syst. Sci. , 96:50–73, 2018.4924] Shlomi Dolev, Chryssis Georgiou, Ioannis Marcoullis, and Elad Michael Schiller. Self-stabilizingByzantine tolerant replicated state machine based on failure detectors. In
CSCML , volume10879 of
Lecture Notes in Computer Science , pages 84–100. Springer, 2018.[25] Shlomi Dolev, Ariel Hanemann, Elad Michael Schiller, and Shantanu Sharma. Self-stabilizingend-to-end communication in (bounded capacity, omitting, duplicating and non-FIFO) dy-namic networks. In
Proc. International Symposium on Stabilization, Safety, and Security ofDistributed Systems (SSS) , pages 133–147, 2012.[26] Shlomi Dolev and Elad Schiller. Communication adaptive self-stabilizing group membershipservice.
IEEE Trans. Parallel Distrib. Syst. , 14(7):709–720, 2003.[27] Soheil Hassas Yeganeh and Yashar Ganjali. Kandoo: A Framework for Efficient and ScalableOffloading of Control Applications, 2012.[28] Jerry L Hintze and Ray D Nelson. Violin plots: a box plot-density trace synergism.
TheAmerican Statistician , 52(2):181–184, 1998.[29] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh,Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs H¨olzle, StephenStuart, and Amin Vahdat. B4: Experience with a Globally-Deployed Software Defined WAN,2013.[30] Rohit Katiyar, Prakash Pawar, Abhay Gupta, and Kotaro Kataoka. Auto-configuration of SDNswitches in SDN/non-SDN hybrid network. In
Proceedings of the Asian Internet EngineeringConference, AINTEC 2015, Bangkok, Thailand, November 18-20, 2015 , pages 48–53. ACM,2015.[31] Teemu Koponen, Martin Casado, Natasha Gude, Jeremy Stribling, Leon Poutievski, Min Zhu,Rajiv Ramanathan, Yuichiro Iwata, Hiroaki Inoue, Takayuki Hama, and Scott Shenker. Onix:A Distributed Control Platform for Large-scale Production Networks, 2010.[32] Leslie Lamport. The part-time parliament.
ACM Trans. Comput. Syst. , 16(2):133–169, 1998.[33] Junda Liu, Baohua Yang, Scott Shenker, and Michael Schapira. Data-driven network connec-tivity. In
Proc. ACM HotNets , page 8, 2011.[34] Open Networking Foundation. OpenFlow Switch Specification Version 1.3.4.[35] Angela Orebaugh, Gilbert Ramirez, Jay Beale, and Joshua Wright.
Wireshark & EtherealNetwork Protocol Analyzer Toolkit . Syngress Publishing, 2007.[36] Jitendra Padhye, Victor Firoiu, Donald F. Towsley, and James F. Kurose. Modeling TCPreno performance: a simple model and its empirical validation.
IEEE/ACM Trans. Netw. ,8(2):133–145, 2000.[37] Merav Parter. Dual failure resilient BFS structure. In
Proceedings of the 2015 ACM Symposiumon Principles of Distributed Computing, PODC 2015, Donostia-San Sebasti´an, Spain, July 21- 23, 2015 , pages 481–490. ACM, 2015. 5038] Radia Perlman.
Interconnections (2Nd Ed.): Bridges, Routers, Switches, and InternetworkingProtocols . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.[39] Radia J. Perlman. Fault-tolerant broadcast of routing information.
Computer Networks ,7:395–405, 1983.[40] Radia J. Perlman. An algorithm for distributed computation of a spanningtree in an extendedLAN. In
SIGCOMM ’85, Proceedings of the Ninth Symposium on Data Communications,British Columbia, Canada, September 10-12, 1985 , pages 44–53. ACM, 1985.[41] Mark Reitblatt, Marco Canini, Arjun Guha, and Nate Foster. FatTire: Declarative FaultTolerance for Software-defined Networks, 2013.[42] Mark Reitblatt, Nate Foster, Jennifer Rexford, and David Walker. Consistent updates forsoftware-defined networks: change you can believe in! In
Tenth ACM Workshop on HotTopics in Networks (HotNets-X), HOTNETS ’11, Cambridge, MA, USA - November 14 - 15,2011 , page 7, 2011.[43] Iosif Salem and Elad Michael Schiller. Practically-self-stabilizing vector clocks in the absenceof execution fairness.
CoRR , abs/1712.08205, 2017.[44] Iosif Salem and Elad Michael Schiller. Practically-self-stabilizing vector clocks in the absence ofexecution fairness (best paper award). In , Lecture Notes in Computer Science, 2018.[45] Liron Schiff, Michael Borokhovich, and Stefan Schmid. Reclaiming the Brain: Useful OpenFlowFunctions in the Data Plane, 2014.[46] Liron Schiff, Petr Kuznetsov, and Stefan Schmid. In-Band Synchronization for DistributedSDN Control Planes.
SIGCOMM Comput. Commun. Rev. , 46(1), January 2016.[47] Liron Schiff, Stefan Schmid, and Marco Canini. Ground control to major faults: Towards afault tolerant and adaptive sdn control network, 2016.[48] Stefan Schmid and Jukka Suomela. Exploiting locality in distributed sdn control, 2013.[49] Sachin Sharma, Dimitri Staessens, Didier Colle, Mario Pickavet, and Piet Demeester. In-BandControl, Queuing, and Failure Recovery Functionalities for OpenFlow.
IEEE Network , 30(1),January 2016.[50] Neil Spring, Ratul Mahajan, and Thomas Anderson. Quantifying the causes of path inflation,2003.[51] Neil Spring, Ratul Mahajan, David Wetherall, and Thomas Anderson. Measuring ISP topolo-gies with rocketfuel.
IEEE/ACM Trans. Netw. , 12(1):2–16, 2004.[52] Ivan Tannerud, Anton Lundgren, and Michael Tran. Renaissance: a self-stabilizing distributedSDN control plane. Available via