[PDF] Localizing Faults in Cloud Systems

Abstract

By leveraging large clusters of commodity hardware, the Cloud offers great opportunities to optimize the operative costs of software systems, but impacts significantly on the reliability of software applications. The lack of control of applications over Cloud execution environments largely limits the applicability of state-of-the-art approaches that address reliability issues by relying on heavyweight training with injected faults. In this paper, we propose \emph(LOUD}, a lightweight fault localization approach that relies on positive training only, and can thus operate within the constraints of Cloud systems. \emph{LOUD} relies on machine learning and graph theory. It trains machine learning models with correct executions only, and compensates the inaccuracy that derives from training with positive samples, by elaborating the outcome of machine learning techniques with graph theory algorithms. The experimental results reported in this paper confirm that \emph{LOUD} can localize faults with high precision, by relying only on a lightweight positive training.

Full PDF

LLocalizing Faults in Cloud Systems

Leonardo Mariani ∗ , Cristina Monni † , Mauro Pezz´e ∗† , Oliviero Riganelli ∗ and Rui Xin †∗ Universit`a degli studi di Milano BicoccaViale Sarca 336, Milano, Italy 20126Email: { leonardo.mariani, mauro.pezze, oliviero.riganelli } @unimib.it † Universit`a della Svizzera Italiana (USI)Via Bufﬁ 13, Lugano, Switzerland 6900Email: { cristina.monni, mauro.pezze, rui.xin } @usi.ch Abstract —By leveraging large clusters of commodity hardware,the Cloud offers great opportunities to optimize the operativecosts of software systems, but impacts signiﬁcantly on the relia-bility of software applications. The lack of control of applicationsover Cloud execution environments largely limits the applicabilityof state-of-the-art approaches that address reliability issues byrelying on heavyweight training with injected faults.In this paper, we propose

LOUD , a lightweight fault local-ization approach that relies on positive training only, and canthus operate within the constraints of Cloud systems.

LOUD relies on machine learning and graph theory. It trains machinelearning models with correct executions only, and compensatesthe inaccuracy that derives from training with positive samples,by elaborating the outcome of machine learning techniques withgraph theory algorithms. The experimental results reported inthis paper conﬁrm that

LOUD can localize faults with highprecision, by relying only on a lightweight positive training.

I. I

NTRODUCTION

Runtime failures are unavoidable in complex systems, andchallenge the reliability of software applications [11]. Runtimefailures are particularly hard to prevent in the cloud [48], sincecloud systems rely on commodity hardware that reduces theoverall system reliability, change rapidly to match evolvingand sometime conﬂicting requirements, and suffer fromconﬁdentiality and visibility issues that may derive from thediffuse ownership of the system components [2]. The growingrole of cloud systems in business-critical applications, likevirtual retail stores [10], and the trend towards movingreliability-critical applications to the cloud, like the nextgeneration telecom infrastructures [18], introduce strictreliability requirements that further exacerbate the problem.The substantive characteristics of the cloud and the stringentreliability requirements of software applications running inthe cloud, hereafter cloud applications , can be addressed atruntime with approaches that predict and localize faults [15,16, 8, 9, 20, 19, 35, 29, 38, 41, 43, 44, 45, 48, 49] to triggereither automatic or manual recovery actions.State-of-the-art fault localization approaches rely on longtraining sessions based on injected faults, which is an expen-sive and seldom-applicable practice, since it is hard to injectmultiple classes of faults in several machines with differentownerships, and repeat this process while the system evolves.In this paper we propose

LOUD , an approach to accuratelylocalize faulty components in cloud virtualized environments.

LOUD locates faults as precisely as competing approaches [51,9, 38], and is much more applicable in practice than competingapproaches, since it exploits models trained under normalexecution conditions only.

LOUD combines machine learning with graph centrality algorithms: It relies on Key Performance Indicators (KPIs),that is, metrics commonly collected on the running systems;It exploits machine learning to both detect anomalies in KPIsand reveal causal relationships among them; It complementsmachine learning with algorithms based on centrality indices to localize the faulty resources responsible for generating andpropagating anomalies. LOUD originally exploits the causalrelationships among KPIs and centrality indices to identifythe causes of the failures, thus compensating the imprecisionof anomaly detection based on models trained with positivesamples only that are notoriously prone to false positives.

LOUD can be paired with any state-of-the-art failure pre-dictor to localize the likely faulty resources before observingthe failure, to enable automatic and manual healing.The results of the experiments reported in Section VI indi-cate that

LOUD can locate faults with a precision comparableto state-of-the-art approaches that rely on models trained withinjected faults, and with an accuracy that varies over time.The main contributions of this paper are: (i) an approachfor modelling temporal correlations among anomalous valuesof KPIs, (ii) a study of the relation among anomalous KPIsduring faulty executions, (iii) a technique to detect anomaliesthat are most likely related to the fault, based on a set ofgraph centrality indices, (iv) the experimental evaluation ofthe proposed approach for locating different types of faults inthe context of various kinds of faulty resources.The paper is organized as follows. Section II overviewsthe

LOUD approach. Sections III and IV present the coredata analytics and localizer components, respectively. Sec-tion V discusses the methodology we used to evaluate theeffectiveness of the proposed approach. Section VI illustratesthe experimental results about the effectiveness of

LOUD .Section VII discusses related work. Section VIII summarizesthe main contribution of the paper.II.

LOUDLOUD is an online metric-driven fault localization tech-nique, which analyses the dependencies among anomalous a r X i v : . [ c s . S E ] M a r nomalous KPIscausality graph faulty resource Localization propagation graph

Cloud System monitored KPIs

Data Analytics(online)

KPI Analysis failure prediction (from off-the-shelf failure predictor)

Data Analytics(ofﬂine)

KPI baseline modelsmonitored KPIs (training data) Legend ofﬂineonline

Fig. 1. The LOUD approach.

Key Performance Indicators (KPIs) commonly available insoftware systems at different abstraction levels to pinpoint thefaulty resources that are likely responsible of future failures.Figure 1 illustrates the ofﬂine and (subsequent) onlinephases that characterize the

LOUD logical architecture. In theofﬂine phase,

LOUD trains a model that captures the normalsystem behavior. In the online phase,

LOUD uses the model topinpoint the faulty resources that are likely to be responsibleof a possible failure.The

LOUD Data Analytics ofﬂine phase monitors the cloudsystem that operates in the ﬁeld to model the system executionunder normal conditions in the form of relations among

KeyPerformance Indicators (monitored KPIs) , which are metricscollected on a regular basis from target resources of the systemat different abstraction levels [12]. Examples of KPIs are theamount of occupied memory in a physical server, the CPUconsumption of a virtual machine and the number of connec-tions accepted by an application. These KPIs are sampled withprobes that are deployed in the monitored resource [17, 3].The

LOUD Data Analytics ofﬂine phase produces a setof baseline models of the KPIs and a causality graph. The

KPI baseline models represent the values of the KPIs thatcharacterize normal behaviors. The causality graph modelsthe causal dependencies among KPIs. Causal dependenciesindicate the strength of the correlation between pairs ofKPIs, that is, the extent to which changes in one KPI cancause changes in another KPI. The nodes in the causalitygraph correspond to KPIs, and the weighted edges indicatethe causal relationships among KPIs.

LOUD computes thebaseline models and the causality graph using data collectedwhen monitoring normal executions.

LOUD exploits both theKPI baseline models and the causality graph in the onlinephase to localize faulty resources.During the online operations,

LOUD regularly monitors andveriﬁes the KPIs against the baseline models (

Data Analytics–online ), creates the propagation graph, and localizes the faultyresource (

Localization ). LOUD executes the online as wellas ofﬂine activities on an independent machine to avoid sideeffects on the production system, which is monitored withlightweight monitoring probes.The

LOUD Data Analytics online phase relies on

IBMITOA-PI [17], to identify the anomalous KPIs, that is, KPIvalues that violate the KPI baseline models. The

LOUDLocalization phase combines the set of anomalous KPIs with the causality graph to derive the propagation graph , whichis the subgraph of the causality graph with anomalous KPIsonly. Intuitively, it models the mutual inﬂuence among theanomalous KPIs. As the set of anomalies changes over time,the propagation graph also changes at every timestamp.

LOUD is designed to operate in cascade to a failure predic-tor. By relying solely on the prediction of a possible failure,

LOUD can be integrated with any off-the-shelf predictor [32,13, 38, 34, 47, 50, 46]. When triggered by a failure prediction,the

LOUD Localization phase analyzes the propagation graphsproduced at runtime, and identiﬁes the faulty resource thatcorresponds to the likely root cause of the failure.

LOUD localizes faults under the hypothesis that the anoma-lous KPIs related to a fault are highly correlated and forma connected subgraph of the propagation graph. In particular,

LOUD assumes that the incorrect behavior of a faulty resourceis likely to result in incorrect behavior of neighbor resources,which are resources that interact with the faulty one eitherdirectly or indirectly. Intuitively, a misbehaving resource, suchas a client that generates an unusually high load of requests ora virtual machine that consumes an unusually high amount ofmemory, usually affects related resources, such as the serverapplication that processes the requests issued by the client orthe host machine that runs the VM. Based on this assumption,

LOUD exploits graph centrality indices [24, 27, 39] to identifythe anomalous KPIs that best characterize the root cause of apredicted failure.Since KPIs are metrics collected on speciﬁc resources, theKPIs with the highest centrality scores likely indicate thefaulty resources that might be responsible for the predictedfailure. The empirical results reported in Section VI indicatethat the

LOUD strategy based on centrality indices can effec-tively localize several classes of relevant faults in resources ofdifferent types.In the next sections we describe the Data Analytics, KPIAnalysis and Localization phases.III. D

ATA A NALYTICS AND

KPI A

NALYSIS

The

Data Analytics component is composed of an ofﬂineand an online engine. The ofﬂine engine builds both a baselinemodel for each KPI, which captures the legal behavior of theKPI, and a causality graph, which captures causal relationsbetween KPIs. The online engine identiﬁes anomalies, that is,KPIs that violate their baseline models. Both the ofﬂine andnline data analytics engines can be implemented with off-the-shelf IT operations analytics tool suites. We implement the

LOUD

Data Analytics engine with

IBM ITOA-PI , a reliableproduct that provides advanced analytics technologies. Tomake the paper self-contained, we provide essential infor-mation about

IBM ITOA-PI . The interested can refer to thedocumentation [17] for additional details.The Data Analytics component elaborates data about KPIsprovided in the form of time ordered sequences of data points.A KPI is a pair h M, R i , where M is a metric and R is themonitored resource. The metrics capture measurable aspectsof the behavior of the monitored system, for example memoryconsumption, number of requests served per minute, and soon. The resource can be any element of a system, for instancea host, a virtual machine, and a speciﬁc application.Table I shows the sample case of a same metric, Busy CPU ,which indicates the percentage of CPU usage, collected fromtwo sample resources, the virtual machines bono and ellis . TABLE IE

XAMPLE DATA COLLECTED FOR TWO

KPI S Timestamps h Busy CP U, Bono i h

Busy CP U, Ellis i IBM ITOA-PI produces a baseline model for each KPIand a causality graph for the whole system from the KPIdata collected during an initial training phase that spansacross multiple weeks of runtime and a constant retrainingphase which takes place when new KPI data are available.The baseline model represents information about the averagevalues and the acceptable variations over time of a KPI, byconsidering data seasonality. A KPI value monitored afterthe training phase is reported as anomalous if outside theacceptable variations coded in the baseline model or if thecausal relationships represented in the graph have broken.

IBM ITOA-PI updates the model periodically at runtime toprevent false negatives and reduce any bias that might derivefrom incorrect behaviour occurring during the learning phase.The causality graph is a directed weighted graph wherenodes correspond to KPIs and edges represent causal relationsamong KPIs. A directed edge with a weight w between nodes n a and n b that correspond to the KPIs a and b , respectively,indicates that a is correlated with b according to the Grangerstatistical test with a probability w . The Granger causalitytest is speciﬁcally designed to capture correlation among timeseries data [14, 1].The KPI Analysis phase exploits the causality graph andthe set of anomalous KPIs to produce the propagation graph ,which is a model that represents the anomalous KPIs andtheir interaction. Since the set of anomalous KPIs changesat each timestamp, the propagation graph is also different ateach timestamp. The propagation graph can be derived from the causalitygraph by preserving only the nodes corresponding to anoma-lous KPIs and their direct connections. More formally, givena set of anomalous KPI K A and a causality graph ( K, E ) ,where K is the set of KPIs and e = ( n , w, n ) ∈ E isthe set of weighted edges between KPIs, the propagationgraph is the maximal connected subgraph ( K A , E ) , with E = { ( n , w, n ) ∈ E | n , n ∈ K A } . Figure 2 shows asample propagation graph (on the right hand side) that hasbeen generated from the causality graph (on the left hand side)in our experiments. Propagation graph at time t Legend

Anomalous KPICausal correlation, thickness depends on the weightNon-anomalous KPI

Causality graph annotated with anomalous KPIs at time t Fig. 2. A sample causality graph and a corresponding propagation graph

IV. L

OCALIZATION

When triggered by a failure prediction, the

LOUD Lo-calization phase identiﬁes the likely faulty resource by ﬁrstranking the KPIs whose values are detected as anomalous,and then selecting the faulty resource based on the ranking.The

Localization phase is grounded on the observation that,once activated, faults cause increasingly many anomalies thatspread both within the faulty resource and to resources eitherdirectly or indirectly interconnected to the faulty one. Forinstance, when a memory leak in a process exhausts thememory of the server, it may prevent the process generatingresponses to incoming requests, thus propagating to processesthat share the memory with the faulty process, such as thevirtual and host machines that run the process, as well asto other client processes that do not receive the expectedresponses, thus spreading anomalies within the same and tocorrelated resources.The problem of identifying the fault location can be for-mulated as the problem of locating the nodes that originatedthe spreading of the anomalies represented in the series ofpropagation graphs built at runtime.The problem cannot be solved by simply selecting the nodethat becomes anomalous ﬁrst in the time series of propagationgraphs because (i) the data analytics component identiﬁesanomalies both in normal and in faulty executions, so it is hardto distinguish between a “root” anomaly and a false alarm, (ii)often the ﬁrst anomaly does not correspond to the actual fault,ut may correspond to an indirect effect not always easilytraceable to the root fault.

LOUD works based on the assumption that the faultyresource is likely to generate an increasing number of stronglycorrelated anomalies over time during a faulty execution.This assumption comes from experimental evaluation overall the performed experiments both on faulty and failure-freeexecutions.The propagation graph is a powerful model to study thespread and the impact of anomalies across resources sinceits edges capture dependencies among KPIs. For instance,the propagation graph may capture the indirect correlationbetween the KPI that corresponds to the number of packetssent by a resource and the KPI that corresponds to the memoryconsumption of a machine that does not even receive thesepackets, but is indirectly impacted by the generated trafﬁc.

LOUD identiﬁes the nodes in the propagation graph that cor-respond the faulty resource, assigning scores through centralityindices, which are commonly used to identify the relevantnodes in weighted graphs with respect to the connectivityamong nodes. The nodes in the propagation graph with thehighest centrality scores identify the anomalous KPIs mostlikely related to the faulty resource. Thus we have a series ofhighly scored KPIs, one for each timestamp.We considered different centrality indices, by focusing onthe ones that take into account the global inﬂuence of a singlenode on the whole graph. For this reason, we discarded:(i) the degree centrality and its generalizations because theydo not consider how anomalies can spread on a graph,(ii) the betweenness centrality and closeness centrality in-dices since they are based on shortest paths and thus do nottake into account the global structure of the graph, which isimportant for the diffusion of anomalies.We considered eigenvector centrality and its generalizations, non-backtracking centrality , HITS algorithm for hubs and au-thorities , and

PageRank . These indices meet our requirementsbecause they assign the highest score to the node with thehighest information ﬂow, and take into account the directionsand weights of the links, and the presence of noise in thespread of the information within the graph.Below, we introduce the centrality indices that we exploitedfor fault localization. We deﬁne all the indices referring to asame representation of the propagation graph based on theadjacency matrix W , where W ij is the value of the weight w ( i, j ) of the edge from node i to node j . The value is ifthere is no edge between node i and node j .The eigenvector centrality is a vector c whose i th compo-nent c i represents the score of the node i [39]. In a graph with n nodes, the score c i is c i = µ n X j =1 W ij c j (1)that is proportional to the sum of the weights of the edgesthat connect the node i to its neighbors, where µ is aproportionality constant that derives from the matrix of the eigenvalues. The eigenvector centrality score can be computediteratively from Equation 1, and the solution is unique for astrongly connected graph with non-negative weights, as thecase of propagation graphs. Fig. 3. Sample rankings for the nodes of the propagation graph of Figure 2

The eigenvector centrality may produce misleading resultswhen the graph contains very few nodes that act as stronghubs and create a “condensation” or “winner-takes-it-all”phenomenon [27]. This is because the eigenvector centralityassigns to such nodes a much higher score than to most ofthe nodes, thus hiding the relevance of the other nodes whichget a score close to zero, even if they might be related to thefaulty resource. In our case, this phenomenon may potentiallybias the localization and cause false positives.The non-backtracking centrality is a variation of the eigen-vector centrality that produces more accurate centrality scoresin the presence of a condensation effect. Intuitively, the non-backtracking centrality scores a node i as the sum of thescores of its neighbors computed in the absence of the node i itself [27].The non-backtracking centrality can be computed fromequation (1) by substituting the adjacency matrix with the non-backtracking matrix B , deﬁned as B ij = W ij iff ( i, j ) forms anon-backtracking path of length 2. A non-backtracking path isa directed path in which no edge is the inverse of the precedingedge [31].The assumption behind the eigenvector and non-backtracking centrality is that a node is important if itis attached to many edges coming from many other importantnodes. In the propagation graph, these nodes represent KPIswhich are more inﬂuenced by the spread of anomalies. The Hyperlink Induced Topic Search (HITS) algorithmassigns a high score to nodes that are linked with other highlyscored nodes [23]. This intuitively corresponds to identifyinghe KPIs that have a strong inﬂuence on the behavior of thesystem, and thus the corresponding resources. These relevantnodes might be of two kinds: authorities , which are the nodesthat are the destination of edges from highly ranked nodes, and hubs , which are the nodes with many outgoing edges leadingto highly ranked nodes.The equations to score nodes as authorities ( a i ) or hubs ( h i )are: n X j =1 W ij a j = δ h i , n X k =1 W ik h k = ν a j where δ and ν are proportionality constants related to eigen-values of W . The HITS algorithm combines the two equationsto compute iteratively a hub score and an authority score foreach node in the graph. PageRank [24] scores nodes according to both the numberof incoming edges and the probability of anomalies to ran-domly spread through the graph ( teleportation ). The PageRankscore r i of node i is computed iteratively from the score r j of the nodes j that are sources of edges directed to i : r i = X j ∈N in ( i ) W ij r j + 1 − αn (2)where N in ( i ) is the set of nodes j that are sources of edgesdirected to i , and α is the teleportation probability ( LOUD uses α = 0 . ). The interested reader can refer to the excellentdescription of Langville et al. [24] for a detailed survey on theHITS algorithm and PageRank.Figure 3 exempliﬁes the effects of the different centralityindices on the propagation graph of Figure 2. The Figureshows in different colors the nodes with the highest scorefor each algorithm. The red node is the top ranked node ofboth eigenvector and non-backtracking centrality, which meansthat the centrality index is evenly distributed through manyimportant nodes in the graph, and the variations of the non-backtracking centrality have a low impact on the scores. TheFigure illustrates the dual role of hubs and authorities: there isa directed edge from the top HITS Hub node to the top HITSAuthority node. The role of the purple highest PageRank nodeis similar to the role of the red highest centrality node, sinceboth are destination of highly weighted edges. The two nodesare different due to the impact of the PageRank teleportation. LOUD computes the four centrality indices with the poweriteration algorithm that computes the indices with an increas-ingly better approximation at each iteration, and can thusbe interrupted and later resumed at any time to improve theprecision of the localization.

LOUD identiﬁes the likely faulty resource as the mostfrequent resource in the top 20 KPIs according to the selectedcentrality index, and does not suggest a fault location in theabsence of a most frequent resource in the set. The intuition isthat a faulty resource is likely to show an anomalous behaviorfor multiple KPIs, which in turn are likely to affect severalother resources, and thus their KPIs, of the system. As a consequence a faulty resource is likely to be present withmultiple KPIs in the top part of the ranking.V. E

VALUATION

In this section, we describe the experimental setting forevaluating

LOUD : We introduce the research questions (Sec-tion V-A), the testbed that implements the cloud-based systemthat we use in the evaluation (Section V-B), the process forexecuting the experiments and collecting the data from thetestbed (Section V-C), and the set of faults that we seeded inthe system for evaluating

LOUD (Section V-D).

A. Research Questions

In the experiments, we addressed three research questionsthat qualify the effectiveness and overhead of

LOUD : RQ1:

Does the choice of centrality index impact on theprecision of

LOUD fault localization?

As discussed in Section IV,

LOUD implements ﬁvecentrality-based indices that are potentially suitable forfault localization. We experimentally compared the effec-tiveness of the different indices, to evaluate their impacton fault localization.

RQ2:

Do the type of the faulty resource, the type of the faultand the activation pattern impact on the effectiveness of

LOUD localization?

We evaluated the accuracy of

LOUD with different kindsof faults injected in various types of resources withdistinct activation patterns to evaluate the impact of thesefactors on

LOUD precision. We executed the experimentswith the PageRank centrality index, which the experi-ments for RQ1 indicate as the most effective index forthe

LOUD localization.

RQ3

What is the overhead of

LOUD on the cloud environ-ment?LOUD is a lightweight approach that isolates the resourceconsuming activities on a dedicated server, and deploysonly probes in the cloud environment. Thus, probes arethe only

LOUD elements that may impact on the cloudsystem. We evaluated the impact of the

LOUD probes bycomparing the execution time of the monitored resourceswith and without active probes.

B. Testbed

We executed the experiments on a cloud environmentrunning ClearWater, an open source implementation of anIP Multimedia Subsystem. The cluster running ClearWaterconsists of 8 server machines that share a local subnet. Eachmachine runs Ubuntu 14.04 LTS. We installed OpenStackIcehouse to manage the cluster. We conﬁgured the clusterwith six compute nodes that run the KVM hypervisor andhost the Virtual Machine (VM) instances that run ClearWater, controller node that offers VM managing services, fromthe basic ones, like creation, deletion and reboot, to moreadvanced ones, like identity services, image service and dash-board, and a network node that routes virtual networks throughNAT.Table II shows the hardware and software conﬁguration ofeach machine. TABLE III

NFRASTRUCTURE AND P LATFORM CONFIGURATION

Role

Controller Network Compute (x2) Compute (x4)

CPU

Intel(R) Core(TM)2 Quad CPU Q9650(12M Cache, 3.00 GHz, 1333 MHz FSB)

RAM

Disk

250 GB SATA hard disk

NIC

Intel(R) 82545EM Gigabit Ethernet Controller

Clearwater [6] exploits six virtual machines to providemessaging, voice and video communication using the SIPprotocol. In particular:

Bono serves as entry point for usersconnections,

Sprout routes user requests forwarded by Bonoand checks authentication,

Homestead provides the user au-thentication and proﬁle data, which are stored in a databasefor Sprout to inquire through a web service interface,

Homer stores MMTEL service settings of users in the form of standardXML,

Ralf provides ofﬂine billing features by interacting withBono and Sprout,

Ellis runs as a web sever for users to registerand manage personal proﬁle and settings.Figure 4 illustrates the logical architecture of the system,indicating dependencies across layers.

OpenStack ServersVirtual Machines

Control Network Compute1 Compute2 Compute3 Compute4 Compute5 Compute6

Ellis Bono Sprout Homer Homestead Ralf S I P H TT P H TT P X C AP HTTP H TT P ClearWater H TT P Fig. 4. Reference Logical Architecture

C. Data Generation and Collection

We executed our testbed with a workload implemented withthe SIPp open source SIP trafﬁc generator . We shaped thenumber of users and calls in the workload based on week and day patterns: within each week , the number of users in working Gayraud Richard and Jacques Olivier. SIPp. http://sipp.sourceforge.net.Last access: may 2015. days is higher than in weekends, with each a day , the numberof users grows at daytime and decreases at nighttime, withpeaks at 9am and 7pm.We based the

LOUD monitoring on KPIs collected from theboth the operating system and ClearWater. At the operatingsystem level,

LOUD uses a set of monitoring probes thatcollect 25 KPIs on the status of both the machines and thecommunication interfaces.

LOUD implements the monitoringprobes with common Linux monitoring tools such as theiostat, sar, vmstat, free, ps, ping, and the Psutil Python library.

LOUD monitors the operating systems running on both thehost and the virtual machines, and the machines that hostthe compute nodes.

LOUD collects a total of 300 KPIs atthe operating system level for the 6 compute nodes and6 virtual machines. At the

Application level,

LOUD usesthe SNMPv2c [5] monitoring service for ClearWater, whichprovides a total of 162 KPIs for the 6 ClearWater machines.The

LOUD monitor samples the 462 KPIs every minute,and sends the collected data to the analysis server that runsthe

LOUD approach. In our setup, the analysis server runs RedHat Enterprise Linux Server 6.3 with an Intel(R) Core (TM) 2Quad Q9650 processor at 3GHz frequency and 16GB RAM.We trained

LOUD ofﬂine with data collected by executingthe testbed for 2 weeks under normal conditions, as recom-mended by

IBM ITOA-PI . We used the default setup for

IBMITOA-PI , which aggregates and analyzes data in time intervalsof 5 minutes.

D. Investigated Faults

We simulated faulty scenarios using fault injection tech-niques. We injected fault types commonly used in the evalu-ation of state-of-the-art fault localization approaches [38, 36,16, 22]: packet loss, memory leak and CPU hog. For each faultwe considered different severity growth patterns: (i) linearpattern, the fault is triggered with a same frequency overtime, (ii) exponential pattern, the fault is activated with afrequency that increases exponentially, resulting in a shortertime to failure, (iii) random pattern, the fault is activatedrandomly over time. We injected the faults by adapting theChaosMonkey open-source failure generator to enable thevarious types of fault severity growths.We injected each type of fault in Bono, Sprout and Home-stead, which are the three core components of ClearWater. Toincrease the generality of the results, we repeated the injectionprocess four times for each type of fault. This produces atotal of 108 experimented cases: 3 types of faults (packetloss, memory leak, CPU hog) injected in 3 different resources(Bono, Sprout, Homestead) with 3 different activation patternsrepeated four times. E. Quality Metrics

We measure the effectiveness of

LOUD (RQ1 and RQ2) bycomputing precision, recall, and F1-score of the localization at Cory Bennett. Chaos Monkey. https://github.com/Netﬂix/SimianArmy/wiki/Chaos-Monkey. Last access: oct 2017.ig. 5. F1-score of each technique per fault type each timestamp after the activation of the fault. In particular,we deﬁne true/false positive/negatives as follows: • true positive (TP) at time t : the LOUD localizationproduced at time t is correct, • false positive (FP) at time t : the LOUD localizationproduced at time t is wrong, • false negative (FN) at time t : LOUD cannot identify amost frequent resource in the top 20 anomalies, and thusdoes not produce a localization at time t .We compute the precision as the rate of correct localizationsout of the total set of localizations produced at time t : T PT P + F P .We compute the recall as the rate of correct localizations outof the total set of localizations that should have been generatedat time t : T PT P + F N . We compute the F1-score as the harmonicaverage between precision and recall: precision ∗ recallprecision + recall .We measure the intrusiveness of the probes by computingthe overhead of the probes (RQ3).VI. E XPERIMENTAL R ESULTS

In this section, we discuss the results of our experimentsorganized according to the research questions presented above.

A. RQ1: Does the choice of centrality index impact on theprecision of

LOUD fault localization?

We evaluate the impact of the centrality index by mea-suring the effectiveness of

LOUD in localising faults, whenexecuting

LOUD with the different indices for various typesof injected faults. Figure 5 shows how the effectiveness of thelocalization changes over time for the ﬁve centrality indicesthat we identiﬁed in Section IV, eigenvector centrality, non-backtracking centrality, HITS algorithm for hubs and authori-ties, and PageRank, and for the three classes of faults that weconsidered, packet loss, memory leak, and CPU hog.The diagrams reported in this and all the ﬁgures of the papershow the F1-score computed for experiments starting from afault injected at time 0 and lasting 120 minutes. The null valueof F1-score for the ﬁrst 20 minutes of the experiments derivesfrom

IBM ITOA-PI , which does not compute anomalies forthe ﬁrst 20 minutes of execution. In all the experiments, thecloud system failed after the 120 minutes time interval shownin the plots. Figure 5 allows us to draw some considerations:LOUD performs better with PageRank than with any otherconsidered index: LOUD with PageRank dominates everyother algorithm for all fault types through most of the observedtime interval. This indicates that PageRank, and the conceptof teleportation that encodes the probability of anomalies tospread non-uniformly across the graph, as it may happen inreal scenarios, well captures how anomalies can spread incloud systems.

The effectiveness of

LOUD localization depends on thefault type:

The effectiveness of fault localization stronglydepends on the kind of injected fault. Indeed, anomaliesspread according to signiﬁcantly different patterns for differentfault types. Although the centrality indices present differenteffectiveness for different fault types, the relative difﬁcultyto localize a fault does not depend on the speciﬁc centralityindex. Packet loss faults are the hardest faults to localize forall the indices, likely because the effect of network problemseasily and quickly spreads through the system resources, in away that is difﬁcult to trace back to the source node. Memoryleaks are also challenging but deﬁnitely easier to localize thanpacket loss faults. CPU hogs spread with patterns that are theeasiest to trace back to the root cause among the consideredfaults for all algorithms.

The effectiveness of the

LOUD localization does not alwaysimprove with time:

The results do not conﬁrm the intuitionthat the quality of the localization improves over time, due tothe increased evidence of both the failures and their causesover time. CPU hogs are the only type of faults with anincreasing quality of the localization over time. This is likelydue to the increasingly stronger impact of the CPU utilizationon the affected resource compared to the other resources ofthe system. In the presence of both packet loss and memoryleaks, the quality of the localization ﬁrst increases and thendecreases, with some perturbation in the intermediate phase.This suggests that in an initial phase the anomalies incremen-tally produced by a fault accumulate in a way that facilitatesthe localization task, but at some point the spreading is soextensive that it gets confused within the noise of the system,that is, with the anomalies that are regularly produced by thesystem despite the presence of faults. ig. 6. F1-score, precision and recall of PageRank per fault typeFig. 7. F1-score, precision and recall of PageRank per activation pattern

Having identiﬁed PageRank as the index that leads to thebest performance compared to the other centrality indices, weconducted all the other experiments with PageRank.

B. RQ2: Do the type of the faulty resource, the type of thefault and the activation pattern impact on the effectiveness of

LOUD localization?

We investigate the effectiveness of the

LOUD localizationwith respect to three key dimensions: fault types, fault activa-tion patterns, and faulty resources.Figure 6 shows F1-score, precision, and recall of

LOUD with packet loss, memory leak and CPU hog injected faults.The precision and recall follow some interesting and comple-mentary trends for the different classes of faults.The localization of packet loss faults reaches a high F1-score only after a long time slack (about 65 minutes afterthe fault injection, and 40 minutes after the ﬁrst

IBM ITOA-PI anomaly), and does not stabilize, thus the localization oftenfails in identifying the faulty resource, as witnessed by a recallhigher than precision. The localization of packet loss faults isprecise only in a fairly short time interval (about 10 minutes)in terms of F1-score.

LOUD addresses well memory leak faults despite an un-stable F1-score, as witnessed by an always high precisionwith a drop of the recall when far from the fault activation.Thus,

LOUD may fail in identifying the fault location, but itidentiﬁes it precisely when it succeeds.

LOUD addresses well CPU hogs both in terms of precisionand recall.Figure 7 shows F1-score, precision and recall for the differ-ent activation patterns averaged over fault types.

LOUD showsa similar trend for all the activation patterns, thus suggesting a reasonable independence of the effectiveness of the techniquewith respect to the growth rate.The effectiveness of

LOUD remains stable over time forboth the linear and random activation patterns, and slightlydecreases in the long term for the exponential pattern. This islikely due to the rapid increase and spread of anomalies thatimpact on the effectiveness of the localization algorithm.Figure 8 shows the F1-score, precision and recall perresource, and indicates that

LOUD is precise for all theresources, slightly less stable for Homestead. A careful in-spection of the results indicates that the localization phasesometime erroneously identiﬁes Sprout as the faulty resourceinstead of Homestead. The architecture of the testbed dis-cussed in Section V-B shows that Homestead and Sprout arehighly interacting. Since

LOUD locates well faults for otherhighly interacting resources, for instance Bono and Sprout, therelatively lower effectiveness for Homestead may depend onthe speciﬁc characteristics of the interaction. In our prototypesetting, the KPIs are unevenly distributed among Sprout andHomestead: 92 KPIs for Sprout, 71 KPIs for Homestead. Sincea fault in Homestead is likely to impact on both resources, thesmaller number of KPIs for Homestead than Sprout may biasthe fault localization in favour of Sprout. The results might belikely improved by either collecting a set of evenly distributedKPIs or introducing a normalization strategy that takes intoaccount the number of KPIs extracted from each resource.Both studies are part of our current research plan.

C. RQ3: What is the overhead of

LOUD on the cloudenvironment?

The

LOUD overhead on the production system derives onlyfrom the monitoring probes, which collect and send data to ig. 8. F1-score, precision and recall of PageRank per resource the analysis node running

LOUD . LOUD can be executedeither on a separated node of the cloud system or on aseparate machine outside of the cloud environment. In theformer case, the cloud infrastructure has to allocate enoughresources to support the execution of the analysis node, inthe latter case,

LOUD does not require additional resourcesfrom the cloud infrastructure. In both cases,

LOUD does notintroduce signiﬁcant overhead on the monitored resources: Inour experiments, we measured an average increase of 2.63%on CPU usage and 1.91% on memory usage of the monitoredresources, which is a negligible overhead for cloud services.Overall,

LOUD demonstrates to be effective in localizingfaults in the cloud, with an effectiveness that depends onthe kind of fault to be localized, but is largely independentfrom the fault activation pattern. In particular,

LOUD is veryeffective with CPU hogs (the

LOUD localization is precise,quick and constantly available after the activation of thefault), memory leaks (the

LOUD localization is always precise,although sometime hindered by the noise in the anomalies),and useful with packet loss (the

LOUD localization is precise,at least in some time interval after the fault activation). Theseresults are extremely good, since the effectiveness of

LOUD is in line with the effectiveness of competing approaches,but

LOUD relies solely on training with normal executions,while competing approaches require long training sessionswith injected faults [38]. Sessions with injected faults stronglylimit the applicability of the approaches to large and evolvingcloud systems that badly tolerate if at all artiﬁcially faultysessions, and that cannot be replicated in the laboratory.

D. Threats to Validity

The main threats to the validity of our results derive fromthe fault injection strategy, the conﬁguration of the prototypeand the type of faults that we considered in the experiments.We mitigated the bias that may derive from the speciﬁc faultinjection strategy that we adopted by both using ChaosMon-key, a freely available tool for fault injection, and consideringmultiple activation patterns to increase the generality.The speciﬁc conﬁguration of our prototype and of the setof faults that we investigated may limit the generality of ourresults. Building complex infrastructures and repeating theexperiments with multiple testbeds is extremely expensive.We mitigate this bias by experimenting with ClearWater, an environment widely used in academia and industry since it isdesigned for scalable deployment in the cloud [38, 4].The scalability of

LOUD has not been tested, but it ispotentially allowed by the nature of the algorithm, which isbased on

IBM ITOA-PI , a tool for big data analytics, andPagerank which is by deﬁnition a scalable algorithm.We experimented with a relatively small set of types offaults. Although

LOUD does not depend on the speciﬁc classof faults, its effectiveness may depend on the faults as observedin the paper. For this reason, we report and discuss the resultsat the level of individual fault types.VII. R

ELATED W ORK

So far, cloud-based fault localization [42, 25] focused onfew fault types, the most common ones being performancefaults, resource faults, and operational faults.

Performancefaults are faults that cause the unexpected degradation of thesystem performance, such as packet loss faults. Performancefaults may be due to chronic problems [19, 21], task in-terference [30, 8], and latency increment [28, 15, 16, 37].

Resource faults [20, 7, 38, 33, 43] are faults that cause theincorrect allocation and use of system resources, such asmemory leakages and CPU hogs. Resource faults may alsoderive from the excessive use and exhaustion of resources.

Operational faults [9, 49, 26, 41, 29] are execution faultsthat impact on performance, and originate from deadlocks,inﬁnite loops, unhandled exceptions and emergent behaviors.Our experimental evaluation with

LOUD considered bothperformance and operational faults.Fault localization techniques implement different analysistechniques: latency analysis, data analytics, machine learning,graph-based approaches and their combination.

Latency Analysis consists of identifying operations withanomalous latency to diagnose the possible faulty elements re-sponsible of the anomalies. The localization can be at differentgranularity levels, from individual methods to network nodes.For example, CloudDiag [28] captures user requests, tracesthe execution time of the individual methods, and identiﬁesthe method calls that are likely responsible of an observedanomaly, based on the distribution of the latency time. Some-time the cause of a problem is not a single method but asequence of method calls. This is why DARC [45] identiﬁesthe root cause path , which is a call-path that starts from a givenunction and includes the largest latency contributors to a givenpeak time. call-graph. Khanna et al. at deﬁne an architecturefor the localization of faulty nodes and links by collectingnetwork measurements [22].

LOUD is not limited to latencyanalysis but collects and analyzes a range of data, includingthe data necessary for performance analysis, as reported in theempirical evaluation.

Data Analytics exploits various kinds of statistical tech-niques to localize faults. PeerWatch [20] uses canonical corre-lation analysis (CCA) to identify the correlations between thecharacteristics of multiple application instances and localizefaults. PerfCompass [8] collects data about the performanceof system calls while virtual machines behave correctly, andanalyzes the data collected in the presence of performanceproblems to classify the root cause of the fault as eitherexternal or internal. Johnsson et al. deﬁne an algorithm todetermine the root cause of a performance degradation prob-lem by using network level measurements and a graphicalmodel of the network [19]. Herodotou et al. focus on datacenter failures [16], using active monitoring techniques, suchas pings, to produce a ranked list of devices and links that arelikely related to the root cause. While statistical techniquesmay capture the resources with a behavior correlated well withthe observed failures, they cannot always trace a problem toits root cause, because they do not consider how misbehaviorspropagate across the resources of the system, as centralityindices do in

LOUD . Machine Learning is another widely adopted solution to lo-calize faults. UBL [7] leverages Self-Organizing Maps (SOM),a particular type of Artiﬁcial Neural Network, to captureemergent behaviors and unknown anomalies. Yuan et al.exploit machine learning to analyze system call informationon Windows XP, and classify the information according toknown faults [51]. POD-Diagnosis [49] deals with faults insporadic operations, for instance, upgrade, and replacement, byextracting events from logs, detecting errors from events, andusing a pre-constructed tree that encodes fault knowledge toidentify the root causes. BCT [26] exploits behavioral modelsinferred from tests execution to identify the causes of softwarefailures. Sauvanaud et al. [38] use Random Forest to detectService Level Agreements (SLAs) violations and infer theroot cause. While many techniques based on machine learningbuild their models involving prior characterization of the fault,through either fault injection or human knowledge,

LOUD localizes faults by exploiting centrality indices and studyingthe propagation of the anomalies in the cloud. In this way,

LOUD requires only data from normal executions and needsno predeﬁned fault proﬁle.

Graph-based approaches localize faults by using graphmodels derived from the network topology and dependenciesbetween services. Sharma et al. propose an algorithm tolocalize the problematic metrics by using a dependency graphbuilt with invariants among system metrics [40]. Gestalt [29]combines the features of existing algorithms to ﬁnd the cul-prit component in a transaction failure, with the help of adependency graph. Srikar et al. propose an adaptive algorithm to diagnose large-scale failures in computer networks [44],by analyzing a graph that represents the network topology.Graph-based approaches that refer to structural dependenciesmight miss the often relevant indirect impact that the behaviorof a resource may have on another resource.

LOUD analyzesa graph that represents causal relations between anomalousKPIs, thus referring to the monitored mutual impact amongthe resources in the system.Some approaches combine different techniques similarlyto

LOUD , but focus on a single fault type. PerfScope [9]combines hierarchical clustering, outlier detection and ﬁnitestate machine based matching, to address operational faults.Kahuna [43] combines clustering and latency analysis onHadoop network resource faults. CloudPD [41] combinesHidden Markov Models (HMM), k-nearest-neighbor (kNN),and k-means clustering to build the performance model,and uses statistical correlations to identify anomalies andsignature-based classiﬁcation to identify operational faults.Pingmesh [15] combines data analytics and latency analysisto detect network faults.VIII. C

ONCLUSIONS

In this paper we present

LOUD , an approach for localizingfaults in cloud systems, which is based on graph centralityalgorithms to efﬁciently locate faults with limited interferencewith the target system both during training and operationalconditions. The

LOUD approach improves over state-of-the-art techniques, by (i) relying on a training phase that requiresonly normal executions, thus supporting model training in theﬁeld without the relevant interferences that derive from thefault injection required by current techniques, (ii) running onan independent machine that works on metrics usually alreadycollected in the ﬁeld, and thus with negligible interference withthe system, and (iii) locating faults with an efﬁciency in linewith the best state-of-the-art approaches.The unique

LOUD feature of relying only on data fromnormal executions, jointly with the negligible execution over-head, and the good effectiveness make

LOUD well suited forthe many large over-running systems that cannot be suspendedand sand-boxed to perform in-lab experiments.A

CKNOWLEDGMENT

This work has been partially supported by the H2020 Learnproject, which has been funded under the ERC ConsolidatorGrant 2014 program (ERC Grant Agreement n. 646867),the Italian Ministry of Education, University, and Research(MIUR) with the PRIN projects IDEAS (grant n. PRIN-2012E47TM2 006) and GAUSS (grant n. 2015KWREMX),by the H2020 NGPaaS project (grant n. 761557), and by ICInformation Company AG.

EFERENCES [1] Andrew Arnold, Yan Liu, and Naoki Abe. “TemporalCausal Modeling with Graphical Granger Methods”.In:

Proceedings of the ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining .KDD ’07. ACM, 2007, pp. 66–75.[2] Eric Bauer and Randee Adams.

Reliability and Avail-ability of Cloud Computing . Wiley, 2012.[3]

CA Uniﬁed Infrastructure Management Probes . https://docops.ca.com/ca-uniﬁed-infrastructure-management-probes / ga / en / alphabetical - probe - articles / openstack -openstack-cloud-monitoring.[4] Lianjie Cao et al. “NFV-VITAL: A framework forcharacterizing the performance of virtual network func-tions”. In:

Proceedings of the IEEE Conference onNetwork Function Virtualization and Software DeﬁnedNetworks . NFV-SDN ’15. IEEE Computer Society,2015, pp. 93–99.[5] J. D. Case, K. McCloghrie, and S. Rose M.and Wald-busser.

Introduction to Community-based SNMPv2 .IETF RFC 1901. Jan. 1996.[6] Project Clearwater.

IMS in the Cloud

Proceedings of the International Conference on Au-tonomic Computing . ICAC ’12. ACM, 2012, pp. 191–200.[8] Daniel J. Dean et al. “PerfCompass: Online Perfor-mance Anomaly Fault Localization and Inference inInfrastructure- as-a-Service Clouds”. In:

IEEE Transac-tions on Parallel and Distributed Systems

PP.99 (2015),pp. 1–1.[9] Daniel J. Dean et al. “PerfScope: Practical OnlineServer Performance Bug Inference in Production CloudComputing Infrastructures”. In:

Proceedings of theAnnual Symposium on Cloud Computing . SoCC ’14.ACM, 2014, 8:1–8:13.[10] Giuseppe DeCandia et al. “Dynamo: amazon’s highlyavailable key-value store”. In:

ACM SIGOPS OperatingSystems Review

Software EngineeringInstitute

International Conference on Emerging Network In-telligence . EMERGING 2013. IARIA, 2013, pp. 60–64.[13] Errin W. Fulp, Glenn A Fink, and Jereme N Haack.“Predicting Computer System Failures Using SupportVector Machines.” In:

Proceedings of the USENIX con-ference on Analysis of system logs . WASL’08. USENIXAssociation, 2008, pp. 5–5. [14] Clive W. J. Granger. “Investigating Causal Relations byEconometric Models and Cross-spectral Methods”. In:

Econometrica

37 (1969), pp. 424–438.[15] Chuanxiong Guo et al. “Pingmesh: A Large-Scale Sys-tem for Data Center Network Latency Measurementand Analysis”. In:

Proceedings of the ACM SIGCOMMConference . SIGCOMM ’15. ACM, 2015, pp. 139–152.[16] Herodotos Herodotou et al. “Scalable Near Real-timeFailure Localization of Data Center Networks”. In:

Pro-ceedings of the ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining . SIGKDD’14. ACM, 2014, pp. 1689–1698.[17] IBM Corporation.

IBM SmartCloud Analytics - Predic-tive Insights . https://developer.ibm.com/itoa/docs/ibm-operations-analytics/. Last access: oct 2017.[18] European Telecommunications Standards Institute.

Net-work Functions Virtualisation

Proceedings of the Conference on NetworkOperations and Management Symposium . NOMS ’14.IEEE Computer Society, 2014, pp. 1–8.[20] Hui Kang, Haifeng Chen, and Guofei Jiang. “Peer-Watch: A Fault Detection and Diagnosis Tool for Vir-tualized Consolidation Systems”. In:

Proceedings ofthe International Conference on Autonomic Computing .ICAC ’10. ACM, 2010, pp. 119–128.[21] Soila Kavulya et al. “Draco: Statistical diagnosis ofchronic problems in large distributed systems”. In:

Pro-ceedings of the International Conference on Depend-able Systems and Networks . DSN ’12. IEEE ComputerSociety, 2012, pp. 1–12.[22] R. Khanna et al.

Monitoring and detecting causes offailures of network paths . US Patent 8,661,295. Feb.2014.

URL

J. ACM

SIAM Review

Science of Computer Programming

IEEE Transactions on Software Engineering (2011).[27] Travis Martin, Xiao Zhang, and M. E. J. Newman.“Localization and centrality in networks”. In:

Phys. Rev.E

90 (5 Nov. 2014), p. 052808.28] Haibo Mi et al. “Toward Fine-Grained, Unsupervised,Scalable Performance Diagnosis for Production CloudComputing Systems”. In:

IEEE Transactions on Paralleland Distributed Systems

Proceedings of the USENIX Conference on AnnualTechnical Conference . ATC ’14. USENIX Association,2014, pp. 255–267.[30] Dejan Novakovi´c et al. “DeepDive: Transparently Iden-tifying and Managing Performance Interference in Vir-tualized Environments”. In:

Proceedings of the USENIXConference on Annual Technical Conference . ATC ’13.USENIX Association, 2013, pp. 219–230.[31] Joel Friedman Omer Angel and Shlomo Hoory. “Thenon-backtracking spectrum of the universal cover of agraph”. In:

Transactions of the American MathematicalSociety

IEEE Transac-tions on Software Engineering

ACM SIGMETRICS Perfor-mance Evaluation Review

Proceedings of the In-ternational Conference on Computer Communicationsand Networks . ICCCN ’10. IEEE Computer Society,2010, pp. 1–6.[35] Oliviero Riganelli et al. “Power Optimization in Fault-Tolerant Mobile Ad Hoc Networks”. In:

Proceedings ofthe International Symposium on High Assurance Sys-tems Engineering . HASE ’08. IEEE Computer Society,2008, pp. 362–370.[36] Arjun Roy et al. “Passive Realtime Datacenter FaultDetection and Localization”. In:

Proceedings of theConference on Networked Systems Design & Implemen-tation . NSDI ’17. Boston, MA: USENIX Association,2017, pp. 595–612.[37] Swati Roy and Nick Feamster. “Characterizing corre-lated latency anomalies in broadband access networks”.In:

Proceedings of the ACM SIGCOMM ComputerCommunication Review . CCR ’13. IEEE Computer So-ciety, 2013, pp. 525–526.[38] C. Sauvanaud et al. “Anomaly Detection and RootCause Localization in Virtual Network Functions”. In:

Proceedings of the 27th International Symposium onSoftware Reliability Engineering . ISSRE ’16. IEEEComputer Society, 2016, pp. 196–206.[39] John P. Scott and Peter J. Carrington.

The SAGE Hand-book of Social Network Analysis . Sage PublicationsLtd., 2011.[40] Abhishek B. Sharma et al. “Fault detection and local-ization in distributed systems using invariant relation-ships”. In:

Proceedings of the International Conference on Dependable Systems and Networks . DSN ’13. IEEEComputer Society, 2013, pp. 1–8.[41] Bikash Sharma et al. “CloudPD: Problem determinationand diagnosis in shared dynamic clouds”. In:

Proceed-ings of the International Conference on DependableSystems and Networks . DSN ’13. IEEE Computer So-ciety, 2013, pp. 1–12.[42] James P. G. Sterbenz et al. “Resilience and Survivabilityin Communication Networks: Strategies, Principles, andSurvey of Disciplines”. In:

Computer Networks: TheInternational Journal of Computer and Telecommuni-cations Networking

Proceedings of the Conference on Network Operationsand Management Symposium . NOMS ’10. IEEE Com-puter Society, 2010, pp. 112–119.[44] Srikar Tati et al. “Adaptive Algorithms for DiagnosingLarge-Scale Failures in Computer Networks”. In:

IEEETransactions on Parallel and Distributed Systems

Proceedings of the International Conferenceon Measurement and Modeling of Computer Systems .SIGMETRICS ’08. ACM, 2008, pp. 277–288.[46] Ricardo Vilalta and Sheng Ma. “Predicting rare eventsin temporal domains”. In:

Proceedings of the Interna-tional Conference on Data Mining . ICDM ’02. IEEEComputer Society, 2002, pp. 474–481.[47] Andrew W. Williams, Soila M Pertet, and PriyaNarasimhan. “Tiresias: Black-box failure prediction indistributed systems”. In:

Proceedings of the Interna-tional Parallel and Distributed Processing Symposium .IPDPS ’07. IEEE Computer Society. 2007, pp. 1–8.[48] Chen Xin, Lu Charng-Da, and Pattabiraman Karthik.“Failure Analysis of Jobs in Compute Clouds: A GoogleCluster Case Study”. In:

Proceedings of the Interna-tional Symposium on Software Reliability Engineering .ISSRE ’14. IEEE Computer Society, 2014, pp. 167–177.[49] Xiwei Xu et al. “POD-Diagnosis: Error Diagnosis ofSporadic Operations on Cloud Applications”. In:

Pro-ceedings of the International Conference on Depend-able Systems and Networks . DSN ’14. IEEE ComputerSociety, 2014, pp. 252–263.[50] Tan Yongmin et al. “PREPARE: Predictive PerformanceAnomaly Prevention for Virtualized Cloud Systems”.In:

Proceedings of the International Conference on Dis-tributed Computing Systems . ICDCS ’12. IEEE Com-puter Society, 2012, pp. 285–294.[51] Chun Yuan et al. “Automated Known Problem Diag-nosis with Event Traces”. In: