007: Democratically Finding The Cause of Packet Drops
Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hingqiang Liu, Jitu Padhye, Boon Thau Loo, Geoff Outhred
0007: Democratically Finding The Cause of Packet Drops(Extended Version)
Behnaz Arzani , Selim Ciraci , Luiz Chamon , Yibo Zhu , Hongqiang Liu , Jitu Padhye , Boon Thau Loo ,Geoff Outhred Microsoft Research, Microsoft, University of Pennsylvania
Abstract –
Network failures continue to plague dat-acenter operators as their symptoms may not havedirect correlation with where or why they occur. Weintroduce 007, a lightweight, always-on diagnosis ap-plication that can find problematic links and alsopinpoint problems for each TCP connection . 007 iscompletely contained within the end host. Duringits two month deployment in a tier-1 datacenter, itdetected every problem found by previously deployedmonitoring tools while also finding the sources ofother problems previously undetected.
007 has an ambitious goal: for every packet drop on aTCP flow in a datacenter, find the link that droppedthe packet and do so with negligible overhead andno changes to the network infrastructure.This goal may sound like an overkill—after all,TCP is supposed to be able to deal with a few packetlosses. Moreover, packet losses might occur due tocongestion instead of network equipment failures.Even network failures might be transient. Above all,there is a danger of drowning in a sea of data withoutgenerating any actionable intelligence.These objections are valid, but so is the need todiagnose “failures” that can result in severe problemsfor applications. For example, in our datacenters,VM images are stored in a storage service. When aVM boots, the image is mounted over the network.Even a small network outage or a few lossy links cancause the VM to “panic” and reboot. In fact, 17%of our VM reboots are due to network issues and inover 70% of these none of our monitoring tools wereable to find the links that caused the problem.VM reboots affect customers and we need to un-derstand their root cause. Any persistent pattern insuch transient failures is a cause for concern and ispotentially actionable. One example is silent packetdrops [1]. These types of problems are nearly impos-sible to detect with traditional monitoring tools (e.g.,SNMP). If a switch is experiencing these problems,we may want to reboot or replace it. These interven-tions are “costly” as they affect a large number offlows/VMs. Therefore, careful blame assignment isnecessary. Naturally, this is only one example thatwould benefit from such a detection system.There is a lot of prior work on network failure di-agnosis, though one of the existing systems meet our ambitious goal. Pingmesh [1] sends periodic probesto detect failures and can leave “gaps” in coverage, asit must manage the overhead of probing. Also, sinceit uses out-of-band probes, it cannot detect failuresthat affect only in-band data. Roy et al. [2] monitorall paths to detect failures but require modificationsto routers and special features in the switch (§10).Everflow [3] can be used to find the location of packetdrops but it would require capturing all traffic andis not scalable. We asked our operators what wouldbe the most useful solution for them. Responses in-cluded: “In a network of ≥ links its a reasonableassumption that there is a non-zero chance that anumber ( >
10) of these links are bad (due to device,port, or cable, etc.) and we cannot fix them simul-taneously. Therefore, fixes need to be prioritizedbased on customer impact. However, currently we donot have a direct way to correlate customer impactwith bad links". This shows that current systemsdo not satisfy operator needs as they do not provideapplication and connection level context.To address these limitations, we propose 007, asimple, lightweight, always-on monitoring tool. 007records the path of TCP connections (flows) suffer-ing from one or more retransmissions and assignsproportional “blame” to each link on the path. Itthen provides a ranking of links that represents theirrelative drop rates. Using this ranking, it can findthe most likely cause of drops in each TCP flow.007 has several noteworthy properties. First, itdoes not require any changes to the existing net-working infrastructure. Second, it does not requirechanges to the client software—the monitoring agentis an independent entity that sits on the side. Third,it detects in-band failures. Fourth, it continues toperform well in the presence of noise (e.g. lone packetdrops). Finally, it’s overhead is negligible.While the high-level design of 007 appear sim-ple, the practical challenges of making 007 workand the theoretical challenge of proving it worksare non-trivial. For example, its path discovery isbased on a traceroute-like approach. Due to the useof ECMP, traceroute packets have to be carefullycrafted to ensure that they follow the same path asthe TCP flow. Also, we must ensure that we do notoverwhelm routers by sending too many traceroutes(traceroute responses are handled by control-planeCPUs of routers, which are quite puny). Thus, we1 a r X i v : . [ c s . N I] F e b eed to ensure that our sampling strikes the rightbalance between accuracy and the overhead on theswitches. On the theoretical side, we are able to showthat 007’s simple blame assignment scheme is highlyaccurate even in the presence of noise.We make the following contributions: (i) we design007, a simple, lightweight, and yet accurate faultlocalization system for datacenter networks; (ii) weprove that 007 is accurate without imposing excessiveburden on the switches; (iii) we prove that its blameassignment scheme correctly finds the failed linkswith high probability; and (iv) we show how to tacklenumerous practical challenges involved in deploying007 in a real datacenter.Our results from a two month deployment of 007 ina datacenter show that it finds all problems found byother previously deployed monitoring tools while alsofinding the sources of problems for which informationis not provided by these monitoring tools.
007 aims to identify the cause of retransmissionswith high probability. It is is driven by two practi-cal requirements: (i) it should scale to datacentersize networks and (ii) it should be deployable in arunning datacenter with as little change to the in-frastructure as possible. Our current focus is mainlyon analyzing infrastructure traffic, especially connec-tions to services such as storage as these can havesevere consequences (see §1, [4]). Nevertheless, thesame mechanisms can be used in other contexts aswell (see §9). We deliberately include congestion-induced retransmissions. If episodes of congestion,however short-lived, are common on a link, we wantto be able to flag them. Of course, in practice, anysuch system needs to deal with a certain amount ofnoise, a concept we formalize later.There are a number of ways to find the causeof packet drops. One can monitor switch counters.These are inherently unreliable [5] and monitoringthousands of switches at a fine time granularity isnot scalable. One can use new hardware capabilitiesto gather more useful information [6]. Correlatingthis data with each retransmission reliably is difficult.Furthermore, time is needed until such hardware isproduction-ready and switches are upgraded. Com-plicating matters, operators may be unwilling toincur the expense and overhead of such changes [4].One can use PingMesh [1] to send probe packets andmonitor link status. Such systems suffer from a rateof probing trade-off: sending too many probes createsunacceptable overhead whereas reducing the probingrate leaves temporal and spatial gaps in coverage.More importantly, the probe traffic does not capture what the end user and TCP flows see. Instead, wechoose to use data traffic itself as probe traffic. Us-ing data traffic has the advantage that the systemintroduces little to no monitoring overhead.As one might expect, almost all traffic in our dat-acenters is TCP traffic. One way to monitor TCPtraffic is to use a system like Everflow. Everflowinserts a special tag in every packet and has theswitches mirror tagged packets to special collectionservers. Thus, if a tagged packet is dropped, we candetermine the link on which it happened. Unfortu-nately, there is no way to know in advance whichpacket is going to be dropped, so we would have totag and mirror every TCP packet. This is clearlyinfeasible. We could tag only a fraction of packets,but doing so would result in another sampling ratetrade-off. Hence, we choose to rely on some form ofnetwork tomography [7, 8, 9]. We can take advantageof the fact that TCP is a connection-oriented, reliabledelivery protocol so that any packet loss results inretransmissions that are easy to detect.If we knew the path of all flows, we could set up anoptimization to find which link dropped the packet.Such an optimization would minimize the number of“blamed” links while simultaneously explaining thecause of all drops. Indeed past approaches such asMAX COVERAGE and Tomo [10, 11] aim to approx-imate the solution of such an optimization (see §12for an example). There are problems with this ap-proach: (i) the optimization is NP-hard [12]. Solvingit on a datacenter scale is infeasible. (ii) tracking thepath of every flow in the datacenter is not scalable inour setting. We can use alternative solutions such asEverflow or the approach of [2] to track the path ofSYN packets. However, both rely on making changesto the switches. The only way to find the path of aflow without any special infrastructure support is toemploy something like a traceroute. Traceroute relieson getting ICMP TTL exceeded messages back fromthe switches. These messages are generated by thecontrol-plane, i.e., the switch CPU. To avoid over-loading the CPU, our administrators have cappedthe rate of ICMP responses to 100 per second. Thisseverely limits the number of flows we can track.Given these limitations, what can we do? Weanalyzed the drop patterns in two of our datacentersand found: typically when there are packet drops,multiple flows experience drops. We show this inFigure 1a for TCP flows in production datacenters.The figure shows the number of flows experiencingdrops in the datacenter conditioned on the totalnumber of packets dropped in that datacenter in 30second intervals. The data spans one day. We seethat the more packets are dropped in the datacenter,2 E m p i r i c a l C D F (a)0 50 100 150 Number of connectionswith at least one packet drop E m p i r i c a l C D F (b)0 0.5 1 Fraction of packet dropsattributed to a connection > 0 drops> 1 drop> 10 drops> 30 drops> 50 drops
Figure 1: Observations from a production network:(a) CDF of the number of flows with at least oneretransmission; (b) CDF of the fraction of dropsbelonging to each flow in each 30 second interval. D I P d i s c o v e r y SoftwareLoadBalancerAnalysisagent ...
Top of the rack switchTier-1switches
Host
TCP retransmissionPath
TCP MonitoringPath DiscoveryOther appsPre-processor Host007Otherapps
Figure 2: Overview of 007 architecturethe more flows experience drops and 95% of the time,at least 3 flows see drops when we condition on ≥ ≥
10 case because lowervalues mostly capture noisy drops due to one-offpacket drops by healthy links. In most cases dropsare distributed across flows and no single flow seesmore than 40% of the total packet drops. This isshown in Figure 1b (we have discarded all flows with0 drops and cases where the total number of dropswas less than 10). We see that in ≥
80% of cases, nosingle flow captures more than 34% of all drops.Based on these observations and the high pathdiversity in datacenter networks [13], we show that if:(a) we only track the path of those flows that haveretransmissions, (b) assign each link on the path ofsuch a flow a vote of 1 /h , where h is the path length,and (c) sum up the votes during a given period,then the top-voted links are almost always the onesdropping packets (see §5)! Unlike the optimization,our scheme is able to provide a ranking of the links interms of their drop rates, i.e. if link A has a highervote than B , it is also dropping more packets (withhigh probability). This gives us a heat-map of ournetwork which highlights the links with the mostimpact to a given application/customer (because weknow which links impact a particular flows). Figure 2 shows the overall architecture of 007. It isdeployed alongside other applications on each end-host as a user-level process running in the host OS.007 consists of three agents responsible for TCPmonitoring, path discovery, and analysis.The
TCP monitoring agent detects retransmissionsat each end-host. The Event Tracing For Windows(ETW) [14] framework notifies the agent as soon asan active flow suffers a retransmission.Upon a retransmission, the monitoring agent trig-gers the path discovery agent (§4) which identifiesthe flow’s path to the destination IP (DIP).At the end-hosts, a voting scheme (§5) is usedbased on the paths of flows that had retransmissions.At regular intervals of 30s the votes are tallied bya centralized analysis agent to find the top-votedlinks. Although we use an aggregation interval of30s, failures do not have to last for 30s.007’s implementation consists of 6000 lines of C++code. Its memory usage never goes beyond 600 KBon any of our production hosts, its CPU utilizationis minimal (1-3%), and its bandwidth utilization dueto traceroute is minimal (maximum of 200 KBpsper host). 007 is proven to be accurate (§5) in typ-ical datacenter conditions (a full description of theassumed conditions can be found in §9). The path discovery agent uses traceroute packetsto find the path of flows that suffer retransmissions.These packets are used solely to identify the pathof a flow. They do not need to be dropped for007 to operate. We first ensure that the numberof traceroutes sent by the agent does not overloadour switches (§4.1). Then, we briefly describe thekey engineering issues and how we solve them (§4.2).
Generating ICMP packets in response to tracerouteconsumes switch CPU, which is a valuable resource.In our network, there is a cap of T max = 100 on thenumber of ICMP messages a switch can send persecond. To ensure that the traceroute load does notexceed T max , we start by noticing that a small frac-tion of flows go through tier-3 switches ( T ). Indeed,after monitoring all TCP flows in our network forone hour, only 2 .
1% went through a T switch. Thuswe can ignore T switches in our analysis. Given thatour network is a Clos topology and assuming thathosts under a top of the rack switch (ToR) commu-nicate with hosts under a different ToR uniformly atrandom (see §6 for when this is not the case): Similar functionality exists in Linux. heorem 1. The rate of ICMP packets sent by anyswitch due to a traceroute is below T max if the rate C t at which hosts send traceroutes is upper bounded as C t ≤ T max n H min (cid:20) n , n ( n n pod − n ( n pod − (cid:21) , (1) where n , n , and n , are the numbers of ToR, T ,and T switches respectively, n pod is the number ofpods, and H is the number of hosts under each ToR. See §12 for proof. The upper bound of C t in ourdatacenters is 10. As long as hosts do not have morethan 10 flows with retransmissions per second, wecan guarantee that the number of traceroutes sent by007 will not go above T max . We use C t as a thresholdto limit the traceroute rate of each host. Note thatthere are two independent rate limits, one set atthe host by 007 and the other set by the networkoperators on the switch ( T max ). Additionally, theagent triggers path discovery for a given connection no more than once every epoch to further limit thenumber of traceroutes. We will show in §5 that thisnumber is sufficient to ensure high accuracy. As in most datacen-ters, our network also uses ECMP. All packets of agiven flow, defined by the five-tuple, follow the samepath [15]. Thus, traceroute packets must have thesame five-tuple as the flow we want to trace. Toensure this, we must account for load balancers.TCP connections are initiated in our datacenter ina way similar to that described in [16]. The connec-tion is first established to a virtual IP (VIP) and theSYN packet (containing the VIP as destination) goesto a software load balancer (SLB) which assigns thatflow to a physical destination IP (DIP) and a serviceport associated with that VIP. The SLB then sends aconfiguration message to the virtual switch (vSwitch)in the hypervisor of the source machine that regis-ters that DIP with that vSwitch. The destinationof all subsequent packets in that flow have the DIPas their destination and do not go through the SLB.For the path of the traceroute packets to match thatof the data packets, its header should contain theDIP and not the VIP. Thus, before tracing the pathof a flow, the path discovery agent first queries theSLB for the VIP-to-DIP mapping for that flow. Analternative is to query the vSwitch. In the instanceswhere the failure also results in connection termina-tion the mapping may be removed from the vSwitchtable. It is therefore more reliable to query the SLB.Note that there are cases where the TCP connectionestablishment itself may fail due to packet loss. Pathdiscovery is not triggered for such connections. It is also not triggered when the query to the SLB failsto avoid tracerouting the internet.
Re-routing and packet drops.
Traceroute itself mayfail. This may happen if the link drop rate is highor due to a blackhole. This actually helps us, as itdirectly pinpoints the faulty link and our analysisengine (§5) is able to use such partial traceroutes.A more insidious possibility is that routing maychange by the time traceroute starts. We use BGPin our datacenter and a lossy link may cause oneor more BGP sessions to fail, triggering rerouting.Then, the traceroute packets may take a differentpath than the original connection. However, RTTs ina datacenter are typically less than 1 or 2 ms, so TCPretransmits a dropped packet quickly. The ETWframework notifies the monitoring agent immediately,which invokes the path discovery agent. The onlyadditional delay is the time required to query theSLB to obtain the VIP-to-DIP mapping, which istypically less than a millisecond. Thus, as long aspaths are stable for a few milliseconds after a packetdrop, the traceroute packets will follow the samepath as the flow and the probability of error is low.Past work has shown this to be usually the case [17].Our network also makes use of link aggrega-tion (LAG) [18]. However, unless all the links in theaggregation group fail, the L3 path is not affected.
Router aliasing [19].
This problem is easily solved ina datacenter, as we know the topology, names, andIPs of all routers and interfaces. We can simply mapthe IPs from the traceroutes to the switch names.To summarize, 007’s path discovery implementa-tion is as follows: Once the TCP monitoring agentnotifies the path discovery agent that a flow hassuffered a retransmission, the path discovery agentchecks its cache of discovered path for that epochand if need be, queries the SLB for the DIP. It thensends 15 appropriately crafted TCP packets withTTL values ranging from 0–15. In order to disam-biguate the responses, the TTL value is also encodedin the IP ID field [20]. This allows for concurrenttraceroutes to multiple destinations. The TCP pack-ets deliberately carry a bad checksum so that theydo not interfere with the ongoing connection.
Here, we describe 007’s analysis agent focusing onits voting-based scheme. We also present alternativeNP-hard optimization solutions for comparison. bad .Each vote has a value that is tallied at the end ofevery epoch, providing a natural ranking of the links.4e set the value of good votes to 0 (if a flow has noretransmission, no traceroute is needed). Bad votesare assigned a value of h , where h is the numberof hops on the path, since each link on the path isequally likely to be responsible for the drop.The ranking obtained after compiling the votesallows us to identify the most likely cause of dropson each flow: links ranked higher have higher droprates (Theorem 2). To further guard against highlevels of noise, we can use our knowledge of thetopology to adjust the links votes. Namely, we itera-tively pick the most voted link l max and estimate theportion of votes obtained by all other links due tofailures on l max . This estimate is obtained for eachlink k by (i) assuming all flows having retransmissionsand going through l max had drops due to l max and(ii) finding what fraction of these flows go through k by assuming ECMP distributes flows uniformly atrandom. Our evaluations showed that this results ina 5% reduction in false positives. Algorithm 1
Finding the most problematic links inthe network. L ←
Set of all links P ←
Set of all possible paths v ( l i ) ← Number of votes for l i ∈ L B ←
Set of most problematic links l max ← Link with maximum votes in ∀ l i ∈ L ∩ B c while v ( l max ) ≥ . P li ∈L v ( l i )) do l max ← argmax li ∈L∩B c v ( l i ) B ← B ∪ { l max } for l i ∈ L ∩ B c do if ∃ p i ∈ P s.t. l i ∈ p i & l max ∈ p i then Adjust the score of l i end if end for end while return B
007 can also be used to detect failed links usingAlgorithm 1. The algorithm sorts the links based ontheir votes and uses a threshold to determine if thereare problematic links. If so, it adjusts the votes of allother links and repeats until no link has votes abovethe threshold. In Algorithm 1, we use a thresholdof 1% of the total votes cast based on a parametersweep where we found that it provides a reasonabletrade-off between precision and recall. Higher valuesreduce false positives but increase false negatives.Here we have focused on detecting link failures.007 can also be used to detect switch failures in asimilar fashion by applying votes to switches insteadof links. This is beyond the scope of this work.
Can 007 deliver on its promise of finding the mostprobable cause of packet drops on each flow? This isnot trivial. In its voting scheme, failed connectionscontribute to increase the tally of both good and bad links. Moreover, in a large datacenter such asours, occasional, lone, and sporadic drops can and will happen due to good links. These failures areakin to noise and can cause severe inaccuracies in anydetection system [21], 007 included. We show thatthe likelihood of 007 making these errors is small.Given our topology (Clos):
Theorem 2.
For n pod ≥ n n + 1 , 007 will find withprobability − e −O ( N ) the k < n ( n n pod − n ( n pod − badlinks that drop packets with probability p b amonggood links that drop packets with probability p g if p g ≤ ( n u α ) − [1 − (1 − p b ) n l ] , where N is the total number of flows between hosts, n l and n u are lower and upper bounds, respectively,on the number of packets per connection, and α = n (4 n − k )( n pod − n ( n n pod − − n ( n pod − k . (2)The proof is deferred to the appendices due tospace constraints. Theorem 2 states that under mildconditions, links with higher drop rates are rankedhigher by 007. Since a single flow is unlikely to gothrough more than one failed link in a network withthousands of links, it allows 007 to find the mostlikely cause of packet drops on each flow.A corollary of Theorem 2 is that in the absenceof noise ( p g = 0), 007 can find all bad links withhigh probability. In the presence of noise, 007 canstill identify the bad links as long as the probabil-ity of dropping packets on non-failed links is lowenough (the signal-to-noise ratio is large enough).This number is compatible with typical values foundin practice. As an example, let n l and n u be the 10 th and 90 th percentiles respectively of the number ofpackets sent by TCP flows across all hosts in a 3 hourperiod. If p b ≥ . . × − . Drop rates in a productiondatacenter are typically below 10 − [22].Another important consequence of Theorem 2 isthat it establishes that the probability of errors in007’s results diminishes exponentially with N , so thateven with the limits imposed by Theorem 1 we canaccurately identify the failed links. The conditionsin Theorem 2 are sufficient but not necessary. Infact, §6 shows how well 007 performs even when theconditions in Theorem 2 do not hold. One of the advantages of 007’s voting scheme is itssimplicity. Given additional time and resources wemay consider searching for the optimal sets of failedlinks by finding the most likely cause of drops given5he available evidence. For instance, we can find the least number of links that explain all failures as weknow the flows that had packet drops and their path.This can be written as an optimization problem wecall the binary program . Explicitly,minimize k p k subject to Ap ≥ sp ∈ { , } L (3)where A is a C × L routing matrix; s is a C × s is 1 if the connection experiencedat least one retransmission and 0 otherwise); L is thenumber of links; C is the number of connections inan epoch; and k p k denotes the number of nonzeroentries of the vector p . Indeed, if the solution of (3)is p ? , then the i -th element of p ? indicates whetherthe binary program estimates that link i failed.Problem (3) is the NP-hard minimum set coveringproblem [23] and is intractable. Its solutions can beapproximated greedily as in MAX COVERAGE orTomo [10, 11] (see appendix). For benchmarking, wecompare 007 to the true solution of (3) obtained bya mixed-integer linear program (MILP) solver [24].Our evaluations showed that 007 (Algorithm 1) sig-nificantly outperforms this binary optimization (bymore than 50% in the presence of noise). We illus-trate this point in Figures 4 and 10, but otherwiseomit results for this optimization in §6 for clarity.The binary program (3) does not provide a rankingof links. We also consider a solution in which wedetermine the number of packets dropped by eachlink, thus creating a natural ranking. The integerprogram can be written asminimize k p k subject to Ap ≥ c k p k = k c k p i ∈ N ∪ { } (4)where N is the set of natural numbers and c is a C × p ? of (4) represents the number of packets droppedby each link, which provides a ranking. The con-straint k p k = k c k ensures each failure is explainedonly once. As with (3), this problem is NP-hard [12]and is only used as a benchmark. As it uses moreinformation than the binary program (the numberof failures), (4) performs better (see §6).In the next three sections, we present our evalua-tion of 007 in simulations (§6), in a test cluster (§7),and in one of our production datacenters (§8). We start by evaluating in simulations where we knowthe ground truth. 007 first finds flows whose dropswere due to noise and marks them as “noise drops”.It then finds the link most likely responsible for dropson the remaining set of flows (“failure drops”). Anoisy drop is defined as one where the correspondinglink only dropped a single packet. 007 never markeda connection into the noisy category incorrectly. Wetherefore focus on the accuracy for connections that007 puts into the failure drop class.
Performance metrics.
Our measure for the perfor-mance of 007 is accuracy , which is the proportionof correctly identified drop causes. For evaluatingAlgorithm 1, we use recall and precision . Recall isa measure of reliability and shows how many of thefailures 007 can detect (false negatives). For exam-ple, if there are 100 failed links and 007 detects 90of them, its recall is 90%. Precision is a measure ofaccuracy and shows to what extent 007’s results canbe trusted (false positives). For example, if 007 flags100 links as bad, but only 90 of those links actuallyfailed, its precision is 90%.
Simulation setup.
We use a flow level simulator [25]implemented in MATLAB. Our topology consists of4160 links, 2 pods, and 20 ToRs per pod. Each hostestablishes 2 connections per second to a randomToR outside of its rack. The simulator has two typesof links. For good links , packets are dropped at a verylow rate chosen uniformly from (0 , − ) to simulatenoise. On the other hand, failed links have a higherdrop rate to simulate failures. By default, drop rateson failed links are set to vary uniformly from 0 . as they are approximations ofthe binary program (see [27]). The bounds of Theorem 2 are sufficient (not neces-sary) conditions for accuracy. We first validate that007 can achieve high levels of accuracy as expectedwhen these bounds hold. We set the drop rates on thefailed links to be between (0 . , A cc u r a c y Number of failed links
Figure 3: When Theorem 2 holds.
Number of failed links P r ec i s i o n ( % ) R ec a ll ( % ) Number of failed links (b)2575 007Integer optimizationBinary optimization
Figure 4: Algorithm 1 when Theorem 2 holds.reader to [2] for why these drop rates are reasonable.
Accuracy.
Figure 3 shows that 007 has an averageaccuracy that is higher than 96% in almost all cases.Due to its robustness to noise, it also outperformsthe optimization algorithm (§ 5.3) in most cases.
Recall & precision.
Figure 4 shows that even whenfailed links have low packet drop rates, 007 detectsthem with high recall/precision.We proceed to evaluate 007’s accuracy when thebounds in Theorem 2 do not hold. This shows theseconditions are not necessary for good performance.
Our next experiment aims to push the boundaries ofTheorem 2 by varying the “failed” links drop ratesbelow the conservative bounds of Theorem 2.
Single Failure.
Figure 5a shows results for differentdrop rates on a single failed link. It shows that 007can find the cause of drops on each flow with highaccuracy. Even as the drop rate decreases below thebounds of Theorem 2, we see that 007 can maintainaccuracy on par with the optimization.
Multiple Failures.
Figure 5b shows that 007 is suc-cessful at finding the link responsible for a drop evenwhen links have very different drop rates. Prior workhave reported the difficulty of detecting such cases [2].However, 007’s accuracy remains high.
We vary noise levels by changing thedrop rate of good links. We see that higher noiselevels have little impact on 007’s ability to find thecause of drops on individual flows (Figure 6a).
Multiple Failures.
We repeat this experiment for thecase of 5 failed links. Figure 6b shows the results. A cc u r a c y (a) Single failure0 0.2 0.4 0.6 0.8 1 Packet drop rate, (%) A cc u r a c y (b) Multiple failures6 10 14 Number of failed links
Figure 5: 007’s accuracy for varying drop rates. A cc u r a c y (a) Single failure Packet drop rate, A cc u r a c y (b) Multiple failures007Integer optimization Packet drop rate,
Figure 6: 007’s accuracy for varying noise levels.007 shows little sensitivity to the increase in noisewhen finding the cause of per-flow drops. Note thatthe large confidence intervals of the optimization isa result of its high sensitivity to noise.
In previous experiments, hosts opened 60 connectionsper epoch. Here, we allow hosts to choose the numberof connections they create per epoch uniformly atrandom between (10 , Single Failure.
Figure 7a shows the results. 007accurately finds the cause of packet drops on eachconnection. It also outperforms the optimizationwhen the failed link has a low drop rate. This isbecause the optimization has multiple optimal pointsand is not sufficiently constrained.
Multiple Failures.
Figure 7b shows the results formultiple failures. The optimization suffers from thelack of information to constrain the set of results. Ittherefore has a large variance (confidence intervals).007 on the other hand maintains high probability ofdetection no matter the number of failures.
We next demonstrate 007’s ability todetect the cause of drops even under heavily skewedtraffic. We pick 10 ToRs at random (25% of theToRs). To skew the traffic, 80% of the flows havedestinations set to hosts under these 10 ToRs. Theremaining flows are routed to randomly chosen hosts.Figure 8a shows that the optimization is much moreheavily impacted by the skew than 007. 007 continuesto detect the cause of drops with high probability7 A cc u r a c y (b) Multiple failures6 10 14 Number of failed links A cc u r a c y (a) Single failure0.2 0.4 10.6 0.8 Packet drop rate, (%)
Figure 7: Varying the number of connections. A cc u r a c y (a) Single failure0 0.2 0.4 0.6 0.8 1 Packet drop rate, (%) A cc u r a c y (b) Multiple failures6 10 141 Number of failed links
Figure 8: 007’s accuracy under skewed traffic.( ≥ . Multiple Failures.
We repeated the above for multi-ple failures. Figure 8b shows that the optimization’saccuracy suffers. It consistently shows a low detec-tion rate as its constraints are not sufficient in guidingthe optimizer to the right solution. 007 maintains adetection rate of ≥
98% at all times.
Hot ToR.
A special instance of traffic skew occursin the presence of a single hot ToR which acts asa sink for a large number of flows. Figure 9 showshow 007 performs in these situations. 007 can tol-erate up to 50% skew, i.e., 50% of all flows go tothe hot ToR, with negligible accuracy degradation.However, skews above 50% negatively impact its ac-curacy in the presence of a large number of failures( ≥ In our previous experiments, we focused on 007’saccuracy on a per connection basis. In our nextexperiment, we evaluate its ability to detect badlinks.
Single Failure.
Figure 10 shows the results. 007 A cc u r a c y Number of failures
30% skew50% skew70% skew10% skew
Figure 9: Impact of a hot ToR on 007’s accuracy. P r ec i s i o n ( % ) R ec a ll ( % ) (b)2575 007Integer optim.Binary optim.0 0.2 0.4 0.6 0.8 1 Figure 10: Algorithm 1 with single failure. P r ec i s i o n ( % ) (a)0.4 0.8 Packet drop rate, (%) R ec a ll ( % ) (b)0 0.2 0.6 10.4 0.8 Packet drop rate, (%)
ToR-T1 failureT1-T2 failureT2-T1 failureT1-ToR failure
Figure 11: Impact of link location on Algorithm 1.outperforms the optimization as it does not requirea fully specified set of equations to provide a bestguess as to which links failed. We also evaluate theimpact of failure location on our results (Figure 11).
Multiple Failures.
We heavily skew the drop rateson the failed links. Specifically, at least one failedlink has a drop rate between 10 and 100%, whileall others have a drop rate in (0 . , . k links had beenselected 007’s recall would have been close to 100%. Finally, we evaluate 007 in larger networks. Its accu-racy when finding a single failure was 98%, 92%, 91%,and 90% on average in a network with 1 , , , and4 pods respectively. In contrast, the optimizationhad an average accuracy of 94%, 72%, 79%, and 77% P r ec i s i o n ( % ) Number of failed links R ec a ll ( % ) (b)25752 6 10 14 Number of failed links
Figure 12: Algorithm 1 with multiple failures. Thedrop rates on the links are heavily skewed.8espectively. Algorithm 1 continues to have Recall ≥
98% for up to 6 pods (it drops to 85% for 7 pods).Precision remains 100% for all pod sizes.We also evaluate both 007 and the optimization’sability to find the cause of per flow drops when thenumber of failed links is ≥
30. We observe that bothapproach’s performance remained unchanged for themost part, e.g., the accuracy of 007 in an examplewith 30 failed links is 98 . We next evaluate 007 on the more realistic environ-ment of a test cluster with 10 ToRs and a total of 80links. We control 50 hosts in the cluster, while othersare production machines. Therefore, the T switchessee real production traffic. We recorded 6 hoursof traffic from a host in production and replayed itfrom our hosts in the cluster (with different startingtimes). Using Everflow-like functionality [3] on theToR switches, we induced different rates of dropson T to ToR links. Our goal is to find the causeof packet drops on each flow §7.2 and to validatewhether Algorithm 1 works in practice §7.3. We first validate a clean testbed environment. Werepave the cluster by setting all devices to a cleanstate. We then run 007 without injecting any fail-ures. We see that in the newly-repaved cluster, linksarriving at a particular ToR switch had abnormallyhigh votes, namely 22 . ± .
65 in average. We thussuspected that this ToR is experiencing problems.After rebooting it, the total votes of the links wentdown to 0, validating our suspicions. This exercisealso provides one example of when 007 is extremelyeffective at identifying links with low drop rates.
Can 007 identify the cause of drops when links havevery different drop rates? To find out, we induce adrop rate of 0 .
2% and 0 .
05% on two different linksfor an hour. We only know the ground truth whenthe flow goes through at least one of the two failedlinks. Thus, we only consider such flows. For 90 . E m p i r i c a l C D F Drop rate = 0.5%Drop rate = 0.05%Drop rate = 1% [Bad link votes] − [Maximum good link votes] Figure 13: Distribution of the difference betweenvotes on bad links and the maximum vote on goodlinks for different bad link drop rates.
We next validate Algorithm 1 and its ability to detectfailed links. We inject different drop rates on a chosenlink and determine whether there is a correlationbetween total votes and drop rates. Specifically, welook at the difference between the vote tally on thebad link and that of the most voted good link. Weinduced a packet drop rate of 1%, 0 . . T to ToR link in the test cluster.Figure 13 shows the distribution for the variousdrop rates. The failed link has the highest vote out ofall links when the drop rate is 1% and 0 . . .
89% of the instances (mostly dueto occasional lone drops on healthy links). However,it is always one of the 2 links with the highest votes.Figure 13 also shows the high correlation betweenthe probability of packet drop on a links and its votetally. This trivially shows that 007 is 100% accuratein finding the cause of packet drops on each flowgiven a single link failure: the failed link has thehighest votes among all links. We compare 007 withthe optimization problem in (4). We find that thelatter also returns the correct result every time, albeitat the cost of a large number of false positives. Toillustrate this point: the number of links marked asbad by (4) on average is 1 .
5, 1 .
18, and 1 .
47 timeshigher than the number given by 007 for the droprates of 1%, 0 . .
05% respectively.What about multiple failures? This is a harderexperiment to configure due to the smaller number oflinks in this test cluster and its lower path diversity.We induce different drop rates ( p = 0 .
2% and p =0 . We have deployed 007 in one of our datacenters . No-table examples of problems 007 found include: power The monitoring agent has been deployed across all ourdata centers for over 2 years. = 0 T > & T ≤ T > T )69% 30 .
98% 0 .
02% 11Table 1: Number of ICMPs per second per switch ( T ).We see max( T ) ≤ T max .supply undervoltages [28], FCS errors [29], switchreconfigurations, continuous BGP state changes, linkflaps, and software bugs [30]. Also, 007 found everyproblem that was caught by our previously deployeddiagnosis tools. Table 1 shows the distribution of the number of ICMPmessages sent by each switch in each epoch over aweek. The number of ICMP messages generated by007 never exceed T max (Theorem 1). In addition to finding problematic links, 007 identifiesthe most likely cause of drops on each flow. Knowingwhen each individual packet is dropped in productionis hard. We perform a semi-controlled experiment totest the accuracy of 007. Our environment consists ofthousands of hosts/links. To find the “ground truth”,we compare its results to that obtained by EverFlow.EverFlow captures all packets going through eachswitch on which it was enabled. It is expensive torun for extended periods of time. We thus only runEverFlow for 5 hours and configure it to captureall outgoing IP traffic from 9 random hosts. Thecaptures for each host were conducted on differentdays. We filter all flows that were detected to have atleast one retransmission during this time and usingEverFlow find where their packets were dropped. Wethen check whether the detected link matches thatfound by 007. We found that
007 was accurate inevery single case . In this test we also verified thateach path recorded by 007 matches exactly the pathtaken by that flow’s packets as captured by EverFlow.This confirms that it is unlikely for paths to changefast enough to cause errors in 007’s path discovery.
During our deployment, there were 281 VM rebootsin the datacenter for which there was no explanation.007 found a link as the cause of problems in each case.Upon further investigation on the SNMP system logs,we observe that in 262 cases, there were transientdrops on the host to ToR link a number of whichwere correlated with high CPU usage on the host.Two were due to high drop rates on the ToR. Inanother 15, the endpoints of the links found wereundergoing configuration updates. In the remaining 2instances, the link was flapping.Finally, we looked at our data for one cluster forone day. 007 identifies an average of 0 . ± .
12 links as dropping packets per epoch. The average acrossall epochs of the maximum vote tally was 2 . ± . T -ToRlinks and 6% were due to T - T link failures.
007 is highly effective in finding the cause of packetdrops on individual flows. By doing so, it providesflow-level context which is useful in finding the causeof problems for specific applications. In this sectionwe discuss a number of other factors we consideredin its design.
The proofs of Theorems 1 and 2 and the design ofthe path discovery agent (§4) are based on a numberof assumptions:
ACK loss on reverse path.
It is possible that packetloss on the reverse path is so severe that loss ofACK packets triggers timeout at the sender. If thishappens, the traceroute would not be going over anylink that triggered the packet drop. Since TCP ACKsare cumulative, this is typically not a problem and007 assumes retransmissions in such cases are unlikely.This is true unless loss rates are very high, in whichcase the severity of the problem is such that the causeis apparent. Spurious retransmissions triggered bytimeouts may also occur if there is sudden increaseddelay on forward or reverse paths. This can happendue to rerouting, or large queue buildups. 007 treatsthese retransmissions like any other.
Source NATs.
Source network address translators(SNATs) change the source IP of a packet before it issent out to a VIP. Our current implementation of 007assumes connections are SNAT bypassed. However, ifflows are SNATed, the ICMP messages will not havethe right source address for 007 to get the responseto its traceroutes. This can be fixed by a query tothe SLB. Details are omitted.
L2 networks.
Traceroute is not a viable option tofind paths when datacenters operate using L2 routing.In such cases we recommend one of the following:(a) If access to the destination is not a problemand switches can be upgraded one can use the pathdiscovery methods of [2, 31]. 007 is still useful as itallows for finding the cause of failures when multiplefailures are present and for individual flows. (b)Alternatively, EverFlow can be used to find path.007’s sampling is necessary here as EverFlow doesn’tscale to capture the path of all flows.
Network topology.
The calculations in §5 assume aknown topology (Clos). The same calculations canbe carried out for any known topology by updating10he values used for ECMP. The accuracy of 007 istied to the degree of path diversity and that multiplepaths are available at each hop: the higher the degreeof path diversity, the better 007 performs. This isalso a desired property in any datacenter topology,most of which follow the Clos topology [13, 32, 33].
ICMP rate limit.
In rare instances, the severity ofa failure or the number of flows impacted by it maybe such that it triggers 007’s ICMP rate limit whichstops sending more traceroute messages in that epoch.This does not impact the accuracy of Algorithm 1.By the time 007 reaches its rate limit, it has enoughdata to localize the problematic links. However, thislimits 007’s ability to find the cause of drops on flowsfor which it did not identify the path. We acceptthis trade-off in accuracy for the simplicity and loweroverhead of 007.
Unpredictability of ECMP.
If the topology and theECMP functions on all the routers are known, thepath of a packet can be found by inspecting its header.However, ECMP functions are typically proprietaryand have initialization “seeds” that change with everyreboot of the switch. Furthermore, ECMP functionschange after link failures and recoveries. Tracking alllink failures/recoveries in real time is not feasible ata datacenter scale.
007 has been designed for a specific use case, namelyfinding the cause of packet drops on individual con-nections in order to provide application context. Thisresulted in a number of design choices:
Detecting congestion. should not avoid detectingmajor congestion events as they signal severe trafficimbalance and/or incast and are actionable. However,the more prevalent ( ≥ − –10 − [29]. 007 treats these asnoise and does not detect them. Standard congestioncontrol protocols can effectively react to them. Finding the cause of other problems.
VM traffic problems. explain the cause of drops when they occur.Many of these are not actionable and do not requireoperator intervention. The tally of votes on a givenlink provide a starting point for deciding when suchintervention is needed.
10 Related Work
Finding the source of failures in distributed systems,specifically networks, is a mature topic. We outlinesome of the key differences of 007 with these works.The most closely related work to ours is perhaps [2],which requires modifications to routers and both end-points a limitation that 007 does not have. Oftenservices (e.g. storage) are unwilling to incur theadditional overhead of new monitoring software ontheir machines and in many instances the two end-points are in seperate organizations [4]. Moreover,in order to apply their approach to our datacenter,a number of engineering problems need to be over-come, including finding a substitute for their use ofthe DSCP bit, which is used for other purposes inour datacenter. Lastly, while the statistical testingmethod used in [2] (as well as others) are useful whenpaths of both failed and non-failed flows are availablethey cannot be used in our setting as the limitednumber of traceroutes 007 can send prevent it fromtracking the path of all flows. In addition 007 allowsfor diagnosis of individual connections and it workswell in the presence of multiple simultaneous failures,features that [2] does not provide. Indeed, findingpaths only when they are needed is one of the mostattractive features of 007 as it minimizes its overheadon the system. Maximum cover algorithms [31, 35]suffer from many of the same limitations describedearlier for the binary optimization, since MAX COV-ERAGE and Tomo are approximations of (3). Otherrelated work can be loosely categorized as follows:
Inference and Trace-Based Algorithms [1, 2, 3, 36,37, 38, 39, 40, 41, 42] use anomaly detection andtrace-based algorithms to find sources of failures.They require knowledge/inference of the location oflogical devices, e.g. load balancers in the connection11ath. While this information is available to thenetwork operators, it is not clear which instance ofthese entities a flow will go over. This reduces theaccuracy of the results.Everflow [3] aims to accurately identify the path ofpackets of interest. However, it does not scale to beused as an always on diagnosis system. Furthermore,it requires additional features to be enabled in theswitch. Similarly, [26, 31] provides another meansof path discovery, however, such approaches requiredeploying new applications to the remote end pointswhich we want to avoid (due to reasons describedin [4]). Also, they depend on SDN enabled networksand are not applicable to our setting where routingis based on BGP enabled switches.Some inference approaches aim at covering the fulltopology, e.g. [1]. While this is useful, they typicallyonly provides a sampled view of connection livelihoodand do not achieve the type of always on monitoringthat 007 provides. The time between probes for [1]for example is currently 5 minutes. It is likely thatfailures that happen at finer time scales slip throughthe cracks of its monitoring probes.Other such work, e.g. [2, 39, 40, 41] require accessto both endpoints and/or switches. Such access maynot always be possible. Finally, NetPoirot [4] canonly identify the general type of a problem (network,client, server) rather than the responsible device.
Network tomography [7, 8, 9, 11, 21, 43, 44, 45, 46,47, 48, 49, 50] typically consist of two aspects: (i)the gathering and filtering of network traffic data tobe used for identifying the points of failure [7, 45]and (ii) using the information found in the previousstep to identify where/why failures occurred [8, 9, 10,21, 43, 49, 51]. 007 utilizes ongoing traffic to detectproblems, unlike these approaches which require amuch heavier-weight operation of gathering largevolumes of data. Tomography-based approaches arealso better suited for non-transient failures, while 007can handle both transient and persistent errors. 007also has coverage that extends to the entire networkinfrastructure, and does not limit coverage to onlypaths between designated monitors as some suchapproaches do. Work on analyzing failures [7, 21, 43,45] are complementary and can be applied to 007 toimprove our accuracy.
Anomaly detection [52, 53, 54, 55, 56, 57, 58] findwhen a failure has occurred using machine learn-ing [52, 54] and Fourier transforms [56]. 007 goes astep further by finding the device responsible.
Fault Localization by Consensus [59] assumes thata failure on a node common to the path used by asubset of clients will result in failures on a significantnumber of them. NetPoirot [4] illustrates why this approach fails in the face of a subset of problems thatare common to datacenters. While our work buildson this idea, it provides a confidence measure thatidentifies how reliable a diagnosis report is.
Fault Localization using TCP statistics [2, 60, 61,62, 63] use TCP metrics for diagnosis. [60] requiresheavyweight active probing. [61] uses learning tech-niques. Both [61], and T-Rat [62] rely on continuouspacket captures which doesn’t scale. SNAP [63] iden-tifies performance problems/causes for connectionsby acquiring TCP information which are gatheredby querying socket options. It also gathers routingdata combined with topology data to compare theTCP statistics for flows that share the same host,link, ToR, or aggregator switch. Given their lack ofcontinuous monitoring, all of these approaches fail indetecting the type of problems 007 is designed to de-tect. Furthermore, the goal of 007 is more ambitious,namely to find the link that causes packet drops foreach TCP connection.
Learning Based Approaches [4, 64, 65, 66] do fail-ure detection in home and mobile networks. Ourapplication domain is different.
Application diagnosis [67, 68] aim at identifying thecause of problems in a distributed application’s exe-cution path. The limitations of diagnosing networklevel paths and the complexities associated with thistask are different. Obtaining all execution paths seen by an application, is plausible in such systems but isnot an option in ours.
Failure resilience in datacenters [13, 69, 70, 71, 72,73, 74, 75, 76, 77] target resilience to failures in dat-acenters. 007 can be helpful to a number of thesealgorithms as it can find problematic areas whichthese tools can then help avoid.
Understanding datacenter failures [22, 78] aims tofind the various types of failures in datacenters. Theyare useful in understanding the types of problemsthat arise in practice and to ensure that our diagnosisengines are well equipped to find them. 007’s analysisagent uses the findings of [22].12
We introduced 007, an always on and scalable mon-itoring/diagnosis system for datacenters. 007 canaccurately identify drop rates as low as 0 .
05% in dat-acenters with thousands of links through monitoringthe status of ongoing TCP flows.
12 Acknowledgements
This work was was supported by grants NSF CNS-1513679, DARPA/I2O HR0011-15-C-0098. The au-thors would like to thank T. Adams, D. Dhariwal, A.Aditya, M. Ghobadi, O. Alipourfard, A. Haeberlen,J. Cao, I. Menache, S. Saroiu, and our shepherd H.Madhyastha for their help.
References [1]
Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R.,Maltz, D., Liu, Z., Wang, V., Pang, B., Chen, H.,et al.
Pingmesh: A large-scale system for data centernetwork latency measurement and analysis. In
ACMSIGCOMM (2015), pp. 139–152.[2]
Roy, A., Bagga, J., Zeng, H., and Sneoren, A.
Pas-sive realtime datacenter fault detection. In
ACM NSDI (2017).[3]
Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G.,Mahajan, R., Maltz, D., Yuan, L., Zhang, M., Zhao,B. Y., et al.
Packet-level telemetry in large datacenternetworks. In
ACM SIGCOMM (2015), pp. 479–491.[4]
Arzani, B., Ciraci, S., Loo, B. T., Schuster, A.,Outhred, G., et al.
Taking the blame game out of datacenters operations with NetPoirot. In
ACM SIGCOMM (2016), pp. 440–453.[5]
Wu, X., Turner, D., Chen, C.-C., Maltz, D. A., Yang,X., Yuan, L., and Zhang, M.
NetPilot: Automatingdatacenter network failure mitigation.
ACM SIGCOMMComputer Communication Review 42 , 4 (2012), 419–430.[6]
Narayana, S., Sivaraman, A., Nathan, V., Goyal, P.,Arun, V., Alizadeh, M., Jeyakumar, V., and Kim, C.
Language-directed hardware design for network perfor-mance monitoring. In
Proceedings of the Conference ofthe ACM Special Interest Group on Data Communication (2017), ACM, pp. 85–98.[7]
Zhang, Y., Roughan, M., Willinger, W., and Qiu, L.
Spatio-temporal compressive sensing and internet trafficmatrices.
ACM SIGCOMM Computer CommunicationReview 39 , 4 (2009), 267–278.[8]
Ma, L., He, T., Swami, A., Towsley, D., Leung, K. K.,and Lowe, J.
Node failure localization via networktomography. In
ACM SIGCOMM IMC (2014), pp. 195–208.[9]
Liu, C., He, T., Swami, A., Towsley, D., Salonidis,T., and Leung, K. K.
Measurement design frameworkfor network tomography using fisher information.
ITAAFM (2013).[10]
Dhamdhere, A., Teixeira, R., Dovrolis, C., and Diot,C.
NetDiagnoser: Troubleshooting network unreachabili-ties using end-to-end probes and routing data. In
ACMCoNEXT (2007). [11]
Kompella, R. R., Yates, J., Greenberg, A., andSnoeren, A. C.
IP fault localization via risk modeling.In
USENIX NSDI (2005), pp. 57–70.[12]
Bertsimas, D., and Tsitsiklis, J. N.
Introduction tolinear optimization . Athena Scientific, 1997.[13]
Ports, D. R. K., Li, J., Liu, V., Sharma, N. K., andKrishnamurthy, A.
Designing distributed systems us-ing approximate synchrony in data center networks. In
USENIX NSDI (2015), pp. 43–57.[14]
Microsoft . Windows ETW, 2000. https://msdn.microsoft.com/en-us/library/windows/desktop/bb968803(v=vs.85).aspx .[15]
Hopps, C. E.
RFC 2992: Analysis of an Equal-CostMulti-Path algorithm, 2000.[16]
Patel, P., Bansal, D., Yuan, L., Murthy, A., Green-berg, A., Maltz, D. A., Kern, R., Kumar, H., Zikos,M., Wu, H., et al.
Ananta: Cloud scale load balancing.
ACM SIGCOMM Computer Communication Review 43 ,4 (2013), 207–218.[17]
Liu, H. H., Kandula, S., Mahajan, R., Zhang, M.,and Gelernter, D.
Traffic engineering with forwardfault correction. In
ACM SIGCOMM Computer Com-munication Review (2014), vol. 44, ACM, pp. 527–538.[18]
Johnson, B. W., Kim, S. H., Leo Jr, E. J., and Lee,D.
Link aggregation path selection method, 2003. USPatent 6,535,504.[19]
Gunes, M. H., and Sarac, K.
Resolving IP aliases inbuilding traceroute-based internet maps.
IEEE/ACMTransactions on Networking 17 , 6 (2009), 1738–1751.[20]
Institute, I. S.
RFC 791: Internet Protocol, 1981.DARPA.[21]
Mysore, R. N., Mahajan, R., Vahdat, A., and Vargh-ese, G.
Gestalt: Fast, unified fault localization for net-worked systems. In
USENIX ATC (2014), pp. 255–267.[22]
Zhuo, D., Ghobadi, M., Mahajan, R., Phanishayee,A., Zou, X. K., Guan, H., Krishnamurthy, A., andAnderson, T.
RAIL: A case for Redundant Arrays ofInexpensive Links in data center networks. In
USENIXNSDI (2017).[23]
Bernhard, K., and Vygen, J.
Combinatorial optimiza-tion: Theory and algorithms.
Springer, Third Edition,2005. (2008).[24]
Mosek, A.
The mosek optimization software. (2010), 2–1.[25]
Arzani, B.
Simulation source codes. Tech. rep., Mi-crosoft Research, 2018. https://github.com/behnazak/Vigil-007SourceCode.git .[26]
Tammana, P., Agarwal, R., and Lee, M.
Cherrypick:Tracing packet trajectory in software-defined datacenternetworks. In
Proceedings of the 1st ACM SIGCOMMSymposium on Software Defined Networking Research (2015), ACM, p. 23.[27]
Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H.,Padhye, J., Thau Loo, B., and Outhred, G. arXivpreprint (2018). Arista . Arista eos system message guide. Tech. rep.,Arista Networks, 2015. http://simatinc.com/simatftp/4.14/EOS-4.14.6M/EOS-4.14.6M-SysMsgGuide.pdf .[29]
Zhuo, D., Ghobadi, M., Mahajan, R., Förster, K.-T.,Krishnamurthy, A., and Anderson, T.
Understandingand mitigating packet corruption in data center networks.In
Proceedings of the Conference of the ACM SpecialInterest Group on Data Communication (2017), ACM,pp. 362–375.[30]
Cisco . Cisco bug: Cscut86141 - sfp-h10gb-cu2.255m,hardware type changed to no-transceiver on n3k. Tech.rep., Cisco, 2017. https://quickview.cloudapps.cisco.com/quickview/bug/CSCut86141 .[31]
Tammana, P., Agarwal, R., and Lee, M.
Simplifyingdatacenter network debugging with pathdump. In
OSDI (2016), pp. 233–248.[32]
Al-Fares, M., Loukissas, A., and Vahdat, A.
Ascalable, commodity data center network architecture.In
ACM SIGCOMM Computer Communication Review (2008), vol. 38, ACM, pp. 63–74.[33]
Greenberg, A., Hamilton, J. R., Jain, N., Kandula,S., Kim, C., Lahiri, P., Maltz, D. A., Patel, P., andSengupta, S.
Vl2: a scalable and flexible data centernetwork. In
ACM SIGCOMM computer communicationreview (2009), vol. 39, ACM, pp. 51–62.[34]
Firestone, D.
Vfp: A virtual switch platform for hostsdn in the public cloud. In
NSDI (2017), pp. 315–328.[35]
Kompella, R. R., Yates, J., Greenberg, A., andSnoeren, A. C.
Detection and localization of networkblack holes. In
INFOCOM 2007. 26th IEEE InternationalConference on Computer Communications. IEEE (2007),IEEE, pp. 2180–2188.[36]
Bahl, P., Chandra, R., Greenberg, A., Kandula, S.,Maltz, D. A., and Zhang, M.
Towards highly reliableenterprise network services via inference of multi-level de-pendencies.
ACM SIGCOMM Computer CommunicationReview 37 , 4 (2007), 13–24.[37]
Adair, K. L., Levis, A. P., and Hruska, S. I.
Expertnetwork development environment for automating ma-chine fault diagnosis. In
SPIE Applications and Scienceof Artificial Neural Networks (1996), pp. 506–515.[38]
Ghasemi, M., Benson, T., and Rexford, J.
RINC:Real-time Inference-based Network diagnosis in the Cloud.Tech. rep., Princeton University, 2015. .[39]
Mahajan, R., Spring, N., Wetherall, D., and Ander-son, T.
User-level internet path diagnosis.
ACM SIGOPSOperating Systems Review 37 , 5 (2003), 106–119.[40]
Liú, Y., Miao, R., Kim, C., and Yuú, M.
LossRadar:Fast detection of lost packets in data center networks. In
ACM CoNEXT (2016), pp. 481–495.[41]
Li, Y., Miao, R., Kim, C., and Yu, M.
FlowRadar:A better NetFlow for data centers. In
USENIX NSDI (2016), pp. 311–324. [42]
Heller, B., Scott, C., McKeown, N., Shenker, S.,Wundsam, A., Zeng, H., Whitlock, S., Jeyakumar,V., Handigol, N., McCauley, J., et al.
LeveragingSDN layering to systematically troubleshoot networks.In
ACM SIGCOMM HotSDN (2013), pp. 37–42.[43]
Duffield, N.
Network tomography of binary networkperformance characteristics.
IEEE Transactions on In-formation Theory 52 , 12 (2006), 5373–5388.[44]
Kandula, S., Katabi, D., and Vasseur, J.-P.
Shrink:A tool for failure diagnosis in IP networks. In
ACMSIGCOMM MineNet (2005), pp. 173–178.[45]
Ogino, N., Kitahara, T., Arakawa, S., Hasegawa,G., and Murata, M.
Decentralized boolean network to-mography based on network partitioning. In
IEEE/IFIPNOMS (2016), pp. 162–170.[46]
Chen, Y., Bindel, D., Song, H., and Katz, R. H.
An algebraic approach to practical and scalable over-lay network monitoring.
ACM SIGCOMM ComputerCommunication Review 34 , 4 (2004), 55–66.[47]
Zhao, Y., Chen, Y., and Bindel, D.
Towards unbi-ased end-to-end network diagnosis.
ACM SIGCOMMComputer Communication Review 36 , 4 (2006), 219–230.[48]
Huang, Y., Feamster, N., and Teixeira, R.
Practicalissues with using network tomography for fault diagnosis.
ACM SIGCOMM Computer Communication Review 38 ,5 (2008), 53–58.[49]
Duffield, N. G., Arya, V., Bellino, R., Friedman, T.,Horowitz, J., Towsley, D., and Turletti, T.
Networktomography from aggregate loss reports.
PerformanceEvaluation 62 , 1 (2005), 147–163.[50]
Herodotou, H., Ding, B., Balakrishnan, S., Outhred,G., and Fitter, P.
Scalable near real-time failure local-ization of data center networks. In
ACM KDD (2014),pp. 1689–1698.[51]
Banerjee, D., Madduri, V., and Srivatsa, M.
Aframework for distributed monitoring and root causeanalysis for large IP networks. In
IEEE SRDS (2009),pp. 246–255.[52]
Fu, Q., Lou, J.-G., Wang, Y., and Li, J.
Executionanomaly detection in distributed systems through un-structured log analysis. In
IEEE ICDM (2009), pp. 149–158.[53]
Huang, L., Nguyen, X., Garofalakis, M., Jordan,M. I., Joseph, A., and Taft, N.
In-network PCA andanomaly detection. In
NIPS (2006), pp. 617–624.[54]
Gabel, M., Sato, K., Keren, D., Matsuoka, S., andSchuster, A.
Latent fault detection with unbalancedworkloads. In
EPForDM (2015).[55]
Ibidunmoye, O., Hernández-Rodriguez, F., and Elm-roth, E.
Performance anomaly detection and bottleneckidentification.
ACM Computing Surveys 48 , 1 (2015).[56]
Zhang, Y., Ge, Z., Greenberg, A., and Roughan, M.
Network anomography. In
ACM SIGCOMM IMC (2005).[57]
Crovella, M., and Lakhina, A.
Method and apparatusfor whole-network anomaly diagnosis and method to de-tect and classify network anomalies using traffic featuredistributions, 2014. US Patent 8,869,276. Kind, A., Stoecklin, M. P., and Dimitropoulos, X.
Histogram-based traffic anomaly detection.
IEEE Trans-actions on Network and Service Management 6 , 2 (2009).[59]
Padmanabhan, V. N., Ramabhadran, S., and Padhye,J.
Netprofiler: Profiling wide-area networks using peercooperation. In
IPTPS . 2005, pp. 80–92.[60]
Mathis, M., Heffner, J., O’Neil, P., and Siemsen, P.
Pathdiag: Automated TCP diagnosis. In
PAM (2008),pp. 152–161.[61]
Widanapathirana, C., Li, J., Sekercioglu, Y. A.,Ivanovich, M., and Fitzpatrick, P.
Intelligent auto-mated diagnosis of client device bottlenecks in privateclouds. In
IEEE UCC (2011), pp. 261–266.[62]
Zhang, Y., Breslau, L., Paxson, V., and Shenker, S.
On the characteristics and origins of internet flow rates.
ACM SIGCOMM Computer Communication Review 32 ,4 (2002), 309–322.[63]
Yu, M., Greenberg, A. G., Maltz, D. A., Rexford, J.,Yuan, L., Kandula, S., and Kim, C.
Profiling networkperformance for multi-tier data center applications. In
USENIX NSDI (2011).[64]
Chen, M., Zheng, A. X., Lloyd, J., Jordan, M.,Brewer, E., et al.
Failure diagnosis using decisiontrees. In
IEEE ICAC (2004), pp. 36–43.[65]
Dimopoulos, G., Leontiadis, I., Barlet-Ros, P., Pa-pagiannaki, K., and Steenkiste, P.
Identifying theroot cause of video streaming issues on mobile devices.[66]
Agarwal, B., Bhagwan, R., Das, T., Eswaran, S.,Padmanabhan, V. N., and Voelker, G. M.
NetPrints:Diagnosing home network misconfigurations using sharedknowledge. In
USENIX NSDI (2009), vol. 9, pp. 349–364.[67]
Chen, Y.-Y. M., Accardi, A., Kiciman, E., Patterson,D. A., Fox, A., and Brewer, E. A.
Path-based failureand evolution management. In
USENIX NSDI (2004).[68]
Aguilera, M. K., Mogul, J. C., Wiener, J. L.,Reynolds, P., and Muthitacharoen, A.
Performancedebugging for distributed systems of black boxes.
ACMSIGOPS Operating Systems Review 37 , 5 (2003), 74–89.[69]
Liu, J., Panda, A., Singla, A., Godfrey, B., Schapira,M., and Shenker, S.
Ensuring connectivity via dataplane mechanisms. In
USENIX NSDI (2013), pp. 113–126.[70]
Alizadeh, M., Edsall, T., Dharmapurikar, S.,Vaidyanathan, R., Chu, K., Fingerhut, A., Matus, F.,Pan, R., Yadav, N., Varghese, G., et al.
CONGA: Dis-tributed congestion-aware load balancing for datacenters.
ACM SIGCOMM Computer Communication Review 44 ,4 (2014), 503–514.[71]
Paasch, C., and Bonaventure, O.
Multipath TCP.
Communications of the ACM 57 , 4 (2014), 51–57.[72]
Chen, G., Lu, Y., Meng, Y., Li, B., Tan, K., Pei, D.,Cheng, P., Luo, L. L., Xiong, Y., Wang, X., et al.
Fast and cautious: Leveraging multi-path diversity fortransport loss recovery in data centers. In
USENIX ATC (2016). [73]
Schiff, L., Schmid, S., and Canini, M.
Ground controlto major faults: Towards a fault tolerant and adaptiveSDN control network. In
IEEE/IFIP DSN (2016), pp. 90–96.[74]
Reitblatt, M., Canini, M., Guha, A., and Foster, N.
Fattire: Declarative fault tolerance for software-definednetworks. In
ACM SIGCOMM HotSDN (2013), pp. 109–114.[75]
Kuźniar, M., Perešíni, P., Vasić, N., Canini, M., andKostić, D.
Automatic failure recovery for software-defined networks. In
ACM SIGCOMM HotSDN (2013),pp. 159–160.[76]
Bodík, P., Menache, I., Chowdhury, M., Mani, P.,Maltz, D. A., and Stoica, I.
Surviving failures inbandwidth-constrained datacenters. In
ACM SIGCOMM (2012), pp. 431–442.[77]
Wundsam, A., Mehmood, A., Feldmann, A., andMaennel, O.
Network troubleshooting with mirrorVNets. In
IEEE GLOBECOM (2010), pp. 283–287.[78]
Gill, P., Jain, N., and Nagappan, N.
Understandingnetwork failures in data centers: Measurement, analysis,and implications.
ACM SIGCOMM Computer Commu-nication Review 41 , 4 (2011), 350–361.[79]
Arratia, R., and Gordon, L.
Tutorial on large devi-ations for the binomial distribution.
Bulletin of Mathe-matical Biology 51 , 1 (1989), 125–131.[80]
Feller, W.
An Introduction to Probability Theory andIts Applications , vol. 1. Wiley, 1968.[81]
Cover, T., and Thomas, J.
Elements of informationtheory . Wiley-Interscience, 2006. Application example: VM reboots
In the introduction (§ 1), we describe an instance inwhich the failure detection capacities of 007 can beuseful: pinpointing the cause of VM reboots. Indeed,in our datacenters, VM images are stored in a storageservice. When a customer boots a VM, the imageis mounted over the network. Thus, even a smallnetwork outage can cause the host kernel to “panic”and reboot the guest VM. We mentioned that over70% of VM reboots caused by network issues inour datacenters cannot be explained using currentlydeployed monitoring systems. To further illustratehow important this issue can be, Figure 14 shows thenumber of unexplained VM reboots due to networkproblems in one day of operations: there were onaverage 10 VM reboots per hour due to unexplainednetwork problems.
Hour of the day N u m b e r o f n e t w o r k - r e l a t e d r e b oo t s Figure 14: Number of network related reboots in aday.
B Network tomography example
Knowing the path of all flows, it is possible to findwith confidence which link dropped a packet. Todo so, consider the example network in Figure 15.Suppose that the link between nodes 2 and 4 dropspackets. Flows 1–2 and 3–2 suffer from drops, but1–3 does not. A set cover optimization, such as theone used by MAX COVERAGE and Tomo [10, 11],that minimizes the number of “blamed” links willcorrectly find the cause of drops. This problem ishowever equivalent to a set covering optimizationproblem that is known to be NP-complete [23].
C Proofs
Definition 1 (Clos topology) . A Clos topology has n pod pods each with n top of the rack (ToR)switches under which lie H hosts. The ToR switchesare connected to n tier-1 switches by a completenetwork ( n n links). Links between tier-0 and tier-1switches are referred to as level 1 links . The tier-1switches within each pod are connected to n tier-2switches by another complete network ( n n links).Links between these switches are called level 2 links .This notation is illustrated in Figure 16. Figure 15: Simple tomography example.
Remark 1 (Communication and failure model) . As-sume that connection occur uniformly at randombetween hosts under different ToR switches. Sincethe number of hosts under each ToR switch is thesame, this is equivalent to saying that connectionsoccur uniformly at random directly between ToRswitches. Also, assume that link failure and con-nection routing are independent and that links droppackets independently across links and across pack-ets.
Remark 2 (Notation) . We use calligraphic letter ( A )to denote sets and boldface font ( A ) to denote ran-dom variables. Also, we write [ M ] to mean the setof integers between 1 and M , i.e., [ M ] = 1 , . . . , M . C.1 Proof of Theorem 1
Proof.
Start by noticing that the number of hostsbelow each ToR switch is the same, so that we canconsider that traceroute are sent on flows uniformlyat random between ToR switches at a rate C t H .Moreover, note that routing probabilities are thesame for links on the same level, so that the tracer-oute rate depends only on whether the link is onlevel 1 or level 2.Since the probability of a switch routing a connec-tion through any link is uniform, the traceroute rateof a level 1 link is given by R = 1 n C t H , (5)Similarly for a level 2 link: R = n n n n ( n pod − n n pod − C t H , (6)where the second fraction represents the probabilityof a host connecting to another host outside its ownpod, i.e., of going through a level 2 link. Since n links are connected to a tier-1 switch and n linksare connected to a tier-2, the rate of ICMP packetsat any links is bounded by T ≤ max [ n R , n R ].Taking max [ n R , n R ] ≤ T max yields (1). C.2 Proof of Theorem 2
We prove the following more precise statement ofTheorem 2.16able 2: Notation and nomenclature n pod Number of pods n Number of top of the rack (ToR) switches per pod n Number of tier-1 switches per pod n Number of tier-2 switchesLevel 1 link Link between ToR and tier-1 switchLevel 2 link Link between tier-1 and tier-2 switch T s Set of all ToR switches in pod s T s Set of all tier-1 switches in pod s T Set of all ToR switches ( T = T ∪ · · · ∪ T n pod ) T Set of all tier-1 switches ( T = T ∪ · · · ∪ T n pod ) T Set of all tier-2 switches k Number of failed links in the network c u Upper bound on the number of packets per connection c l Lower bound on the number of packets per connection p g Probability that a good link drops a packet p b Probability that a failed link drops a packet v g Probability that a good link receives a vote v b Probability that a bad link receives a vote r g Probability that a good link causes a retransmission (drops at least one packet) r b Probability that a bad link causes a retransmission (drops at least one packet)
Theorem 3.
In a Clos topology with n ≥ n and n pod ≥ h n n , n ( n − n ( n − n ) , i , 007 will rank withprobability (1 − (cid:15) ) the k < n ( n n pod − n ( n pod − bad links thatdrop packets with probability p b above all good linksthat drop packets with probability p g as long as p g ≤ − (1 − p b ) c l αc u , (7) where c l and c u are lower and upper bounds, respec-tively, on the number of packets per connection, α = n (4 n − k )( n pod − n ( n n pod − − n ( n pod − k , (8) and (cid:15) ≤ e − N D KL ((1+ δ ) v g k v g ) + e − N D KL ((1 − δ ) v b k v b ) = 2 e −O ( N ) , (9) with v g and v b being the probabilities of a good and badlink receiving a vote, respectively, N being the totalnumber of connections between hosts, and D KL ( q k r ) denoting the Kullback-Leibler divergence between twoBernoulli distributions with probabilities of success q and r . Before proceeding, note that the typical scenarioin which n ≥ n and n ( n − n ( n − n ) ≤
1, as in our datacenter, the condition on the number of pods fromTheorem 3 reduces to n pod ≥ n n . Proof.
The proof proceeds as follows. First, we showthat if a link has higher probability of receiving avote, then it receives more votes if a large enoughnumber of connections ( N ) are established. We doso using large deviation theory [79], so that we canshow that this does not happen actually decreasesexponentially in N . Lemma 1. If v b ≥ v g , 007 will rank bad links abovegood links with probability (1 − (cid:15) ) for (cid:15) as in (9) . With Lemma 1 in hands, we then need to relatethe probabilities of a link receiving a vote ( v b , v g )to the link drop rates ( p b , p g ). This will allow us toderive the signal-to-noise ratio condition in (7). Notethat the probability of a link receiving a vote is theprobability of a flow going through the link and thata retransmission occurs (i.e., some link in the flow’spath drops at least one packet). Hence, we relatethese probabilities by exploiting the combinatorialstructure of ECMP in the Clos topology. Lemma 2.
In a Clos topology with n ≥ n and n pod ≥ h n n , n ( n − n ( n − n ) , i , it holds thatfor k ≤ n bad links v b ≥ r b n n n pod (10a) v g ≤ n n n pod n ( n pod − n n pod − (cid:20) (4 − kn ) r g + kn r b (cid:21) (10b)17 here r b and r g are the probabilities of a retransmis-sion occurring due to a bad and a good link, respec-tively. Before proving these lemmata, let us see how theyimply Theorem 3. From the (10) in Lemma 2, itholds that r b ≥ n (4 n − k )( n pod − n ( n n pod − − n ( n pod − k | {z } α r g ⇒ v b ≥ v g ,(11)for k < n ( n n pod − n ( n pod − < n . Thus, in a Clos topology,if the probability of retransmission due to a bad linkis large enough compare to a good link, i.e., r b ≥ αr g for α as in (8), then we have that the probability ofa bad link receiving a vote is larger than that of agood link ( v b ≥ v g ).Still, (11) gives a relation in terms of the prob-abilities of retransmission ( r g , r b ) instead of thepacket drop rates ( p g , p b ) as in (7). To obtain (7),note that the probability r of retransmission dur-ing a connection with c packets due to a link thatdrops packets with probability p is r = 1 − (1 − p ) c .Since r is monotonically increasing in c , we havethat r b ≥ − (1 − p b ) c l . Similarly, r g ≤ − (1 − p g ) c u .Using the fact (1 − x ) n ≥ − nx yields (7).We now proceed with the proofs of Lemmata 1and 2. Proof of Lemma 1.
We start by noting that in adatacenter-sized Clos network, almost every connec-tion has a hop count of 5. In our datacenter, thishappens to 97 .
5% of connections. Therefore, we canapproximate links votes by assuming all bad voteshave the same value. Thus, suffices to determine howmany votes each link has.Since links cause retransmissions independentlyacross connections (see Remark 1), the number ofvotes received by a bad link is a binomial randomvariable B with parameters N , the total number ofconnections, and v b , the probability of a bad linkreceiving a vote. Similarly, let G be the number ofvotes on a good link, a binomial random variablewith parameters N and v g . 007 will correctly rankthe bad links if B ≥ G , i.e., when bad links receivemore votes than good links. This event containsthe event D = { G ≤ (1 + δ ) N v g ∩ B ≥ (1 − δ ) N v b } for δ ≤ v b − v g v b + v g . Using the union bound P [ S i E i ] ≤ P i P [ E i ] [80], the probability of 007 identifying the Level 1 linksLevel 2 links
Figure 16: Illustration of notation for Clos topologyused in the proof of Lemma 2correct links is therefore bounded by P ( B ≥ G ) ≥ P [ G ≤ (1 + δ ) N v g ∩ B ≥ (1 − δ ) N v b ] ≥ − P [ G ≥ (1 + δ ) N v g ] − P [ B ≤ (1 − δ ) N v b ] (12)To proceed, note that the probabilities in (12) canbe bounded using the large deviation principle [79].Indeed, let S be a binomial random variable withparameters M and q . For δ > P [ S ≥ (1 + δ ) qM ] ≤ e − M D KL ((1+ δ ) q k q ) (13a) P [ S ≤ (1 − δ ) qM ] ≤ e − M D KL ((1 − δ ) q k q ) (13b)where D KL ( q k r ) is the Kullback-Leibler divergencebetween two Bernoulli distributions with probabili-ties of success q and r [81]. Explicitly,D KL ( q k r ) = q log (cid:16) qr (cid:17) + (1 − q ) log (cid:18) − q − r (cid:19) .Substituting the inequalities (13) into (12) yields (9). Proof of Lemma 2.
Before proceeding, let T , T ,and T denote the set of ToR, tier-1, and tier-2switches respectively (Figure 16). Also let T s and T s , s = [ n pod ], denote the tier-0 and tier-1 switches inpod s respectively. Note that T = T ∪ · · · ∪ T n pod and T = T ∪ · · · ∪ T n pod . Note that we use sub-scripts to denote the switch tier and superscripts todenote its pod. To clarify the derivations, we main-tain this notation for indices. For instance, i s is the i -th tier-0 switch from pod s , i.e., i s ∈ T s , and ‘ is the ‘ -th tier-2 switch. Note that tier-2 switchesdo not belong to specific pods. We write ( i s , j s ) todenote the level 1 link that connects i s to j s (as inFigure 16) and use r ( i s , j s ) = r ( j s , i s ) to refer to theprobability of link ( i s , j s ) causing a retransmission.Note that r is also a function of the number of pack-ets in a connection, but we omit this dependence forclarity.18he bounds in (10) are obtained by decomposingthe events that 007 votes for a level 1 or level 2 linkinto a union of simpler events. Before proceeding,note that each connection only goes through one linkin each level and in each direction, so that eventssuch as “going through a ToR to tier-1 link” aredisjoint.Starting with level 1, let A be the event that aconnection goes through link ( i s , j s ), i.e., a link thatconnects a ToR to a tier-1 switch in any pod. Thisevent happens with probability P [ A ] = 1 n n n pod , (14a)given that there are n n n pod level 1 links andthat connections occur uniformly at random. Thelink ( i s , j s ) will get a vote if one of five things oc-cur: (i) it causes a retransmission; (ii) the connectionstays within the pod and some other link causes aretransmission; (iii) the connection leaves the podand a link between a tier-1 and tier-2 switch causesthe retransmission; (iv) the connection leaves the podand a link between a tier-2 and tier-1 switch causesthe retransmission; or (v) the connection leaves thepod and a link between a tier-1 and ToR switch inthe other pod causes the retransmission. Formally,the link ( i s , j s ) receives a vote if a connection goesthrough it (event A ) and either of the followingoccurs:• event A : ( i s , j s ) causes a retransmission, i.e., P [ A ] = r ( i s , j s ) (14b)• event A : the connection also goes through some( j s , k s ), k s = i s , and ( j s , k s ) causes a retransmis-sion. Therefore, P [ A ] = 1 n n pod − | {z } connect to k s X k s ∈T s \{ i s } r ( j s , k s ) (14c)• event A : the connection also goes throughsome ( j s , ‘ ) and ( j s , ‘ ) causes a retransmis-sion, which occurs with probability P [ A ] = n ( n pod − n n pod − | {z } leave pod s n |{z} go through ‘ X ‘ ∈T r ( j s , ‘ )(14d)• event A : the connection also goes through some( ‘ , m t ), t = s , and ( ‘ , m t ) causes a retransmis- sion, so that P [ A ] = n n n pod − | {z } go to pod t n n | {z } go through( ‘ ,m t ) X ‘ ∈T ,m t ∈T t ,t ∈ [ n pod ] \ s r ( ‘ , m t )(14e)• event A : the connection also goes through some( m t , u t ), t = s , and ( m t , u t ) causes a retransmis-sion. Thus, P [ A ] = 1 n n pod − | {z } go to pod t n |{z} go through m t X m t ∈T t ,u t ∈T t ,t ∈ [ n pod ] \ s r ( m t , u t )(14f)Similarly for level 2, let B be the event thata connection goes through link ( j s , ‘ ), so that itsprobability is P [ B ] = 1 n pod | {z } start in pod s n ( n pod − n n pod − | {z } leave pod s n n | {z } go through( j s ,‘ ) (15a)For this link to receive a vote either (i) it causesa retransmission; (ii) a level 1 link from the originpod causes a retransmission; (iii) a link between atier-2 and tier-1 switch causes the retransmission;or (iv) a level 1 link in the destination pod causesthe retransmission. Thus, link ( j s , ‘ ) gets a vote ifa connection goes through ( j s , ‘ ) (event B ) and either of the following occurs:• event B : ( j s , ‘ ) causes a retransmission, i.e., P [ B ] = r ( j s , ‘ ) (15b)• event B : the connection also goes throughsome ( i s , j s ) and ( i s , j s ) causes a retransmis-sion. Then, P [ B ] = 1 n |{z} start in i s X i s ∈T s r ( i s , j s ) (15c)• event B : the connection also goes through some( ‘ , m t ), t = s , and ( ‘ , m t ) causes a retransmis-sion, which yields P [ B ] = 1 n ( n pod − | {z } go through m t X m t ∈T t ,t ∈ [ n pod ] \ s r ( ‘ , m t )(15d)19 event B : the connection also goes through some( m t , u t ), t = s , and ( m t , u t ) causes a retransmis-sion. Therefore, P [ B ] = 1 n n ( n pod − | {z } go through( m t ,n t ) , t = s X m t ∈T t ,u t ∈T t ,t ∈ [ n pod ] \ s r ( m t , u t )(15e)To obtain the lower bound in (10a), note that a badlink receives at least as many votes as retransmissionsit causes. Therefore, the probability of 007 voting fora bad link is larger than the probability of that linkcausing a retransmission. Explicitly, using the factthat failure and routing are independent and r = r b ,(14) and (15) give v b ≥ min [ P ( A ∩ A ) , P ( B ∩ B )]= min (cid:20) n n n pod , n n n pod n ( n pod − n n pod − (cid:21) r b .The assumption that n pod ≥ n ( n − n ( n − n ) makes thefirst term smaller than the second and yields (10a).In contrast, the upper bound in (10b) is obtainedby applying the union bound [80] to (14) and (15).Indeed, this leads to the following inequalities forthe probability of 007 voting for a good level 1 andlevel 2 link: v g, = P [ A ∩ ( A ∪ A ∪ A ∪ A ∪ A )] ≤ P [ A ] X i =1 P [ A i ] ! (16a) v g, = P [ B ∩ ( B ∪ B ∪ B ∪ B )] ≤ P [ B ] X i =1 P [ B i ] ! (16b)where v g, and v g, denote the probability of a goodlevel 1 and level 2 link being voted bad, respectively.Note that once again used the independence betweenfailures and routing. From (16), it is straightforwardto see that v g ≤ max [ v g, , v g, ].To obtain (10b), we first bound (16) by assumingthat all k bad links belong to the event A i and B i , i ≥
2, that maximize v g, and v g, . For a goodlevel 1 link, it is straightforward to see from (14)that since n ≥ n , event A has the largest coeffi-cient. Thus, taking all links to be good except for k bad links satisfying A one has v g, ≤ n n n pod n ( n pod − n n pod − × (cid:20)(cid:18) − kn + 2( n − n ( n pod − (cid:19) r g + kn r b (cid:21) , (17) Algorithm 2
Finding the most problematic links inthe network. F : set of failed links C : set of failed connections F ← ∅ while C 6 = ∅ do l ← link that explains the most number ofadditional failures L ← failures explained by l F ← F ∪ { l } C ← C − L end while return F which holds for k ≤ n . Similarly for a good level 2link, since n pod ≥ n n + 1 it holds from (15) thatevent B has the largest coefficient. Therefore, v g, ≤ n n n pod n ( n pod − n n pod − × (cid:20)(cid:18) − kn (cid:19) r g + kn r b (cid:21) , (18)which holds for k ≤ n . Straightforward algebrashows that for n pod ≥ v g, ≥ v g, , from which (10a)follows. D Greedy solution of the binary program
When discussing optimization-based alternativesto 007’s voting scheme, we presented the followingproblem which we dubbed the binary program minimize k p k subject to Ap ≥ sp ∈ { , } L (19)where A is a C × L routing matrix; s is a C × s is 1 if the connection experiencedat least one retransmission and 0 otherwise); L is thenumber of links; C is the number of connections inan epoch; and k p k denotes the number of nonzeroentries of the vector p .Problem (19) can be described as one of looking forthe smallest number of links that explains all failures .To see this is the case, start by noting that p is an L × i -th entry describe whether link i is believedto have failed or not. Thus, since A is the routingmatrix, the C × Ap describes whether p explains a possible failure in that connection or not:if [ Ap ] i = 0, then p does not explain a possible failurein the i -th connection; if [ Ap ] i >