[PDF] Causality-Guided Adaptive Interventional Debugging

Abstract

Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root cause of an application's intermittent failure and (2) generate an explanation of how the root cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses fault injection to execute a sequence of interventions on the predicates and discover their true causal relationships. This enables AID to identify the true root cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with six real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root cause and explain how the root cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits also hold for them.

Full PDF

CCausality-Guided Adaptive Interventional Debugging

Technical Report

Anna Fariha

University of MassachusettsAmherst, MA [email protected]

Suman Nath

Microsoft ResearchRedmond, WA

[email protected]

Alexandra Meliou

University of MassachusettsAmherst, MA [email protected]

ABSTRACT

Runtime nondeterminism is a fact of life in modern database ap-plications. Previous research has shown that nondeterminism cancause applications to intermittently crash, become unresponsive, orexperience data corruption. We propose Adaptive InterventionalDebugging (AID) for debugging such intermittent failures.AID combines existing statistical debugging, causal analysis,fault injection, and group testing techniques in a novel way to(1) pinpoint the root cause of an application’s intermittent failureand (2) generate an explanation of how the root cause triggers thefailure. AID works by ﬁrst identifying a set of runtime behaviors(called predicates) that are strongly correlated to the failure. It thenutilizes temporal properties of the predicates to (over)-approximatetheir causal relationships. Finally, it uses fault injection to execute asequence of interventions on the predicates and discover their truecausal relationships. This enables AID to identify the true rootcause and its causal relationship to the failure. We theoreticallyanalyze how fast AID can converge to the identiﬁcation.We evaluate AID with six real-world applications that intermit-tently fail under speciﬁc inputs. In each case, AID was able toidentify the root cause and explain how the root cause triggered thefailure, much faster than group testing and more precisely than sta-tistical debugging. We also evaluate AID with many syntheticallygenerated applications with known root causes and conﬁrm that thebeneﬁts also hold for them.

Keywords

Root cause analysis; trace analysis; causal intervention; group test-ing; concurrency bug

1. INTRODUCTION

Modern data management systems and database-backed applica-tions run on commodity hardware and heavily rely on asynchronousand concurrent processing [16, 24, 60, 51]. As a result, they com-monly experience runtime nondeterminism such as transient faultsand variability in timing and thread scheduling. Unfortunately,software bugs related to handling nondeterminism are also commonto these systems. Previous studies reported such bugs in MySQL [45,11], PostgreSQL [44], NoSQL systems [39, 70], and database-backed applications [8], and showed that the bugs can cause crashes,unresponsiveness, and data corruptions. It is, therefore, crucial toidentify and ﬁx these bugs as early as possible.Unfortunately, localizing root causes of intermittent failures isextremely challenging [46, 71, 42]. For example, concurrency bugssuch as deadlocks, order and atomicity violation, race conditions,etc. may appear only under very speciﬁc thread interleavings. Even when an application executes with the same input in the same en-vironment, these bugs may appear only rarely (e.g., in ﬂaky unittests [46]). When a. When a concurrency bug is conﬁrmed to ex-ist, the debugging process is further complicated by the fact thatthe bug cannot be consistently reproduced. Heavy-weight tech-niques based on record-replay [3] and ﬁne-grained tracing with lin-eage [2, 54] can provide insights on root causes after a bug mani-fests; but their runtime overheads often interfere with thread timingand scheduling, making it even harder for the intermittent bugs tomanifest in the ﬁrst place [37].

Statistical Debugging (SD) [31, 41, 43, 29] is a data-driven tech-nique that partly addresses the above challenge. SD uses lightweightlogging to capture an application’s runtime (mis)behaviors, called predicates . An example predicate indicates whether a method re-turns null in a particular execution or not. Given an applicationthat intermittently fails, SD logs predicates from many successfuland failed executions. SD then uses statistical analyses of the logsto identify discriminative predicates that are highly correlated withthe failure.SD has two key limitations. First, SD can produce many dis-criminative predicates that are correlated to, but not a true cause of,a failure. Second, SD does not provide enough insights that canexplain how a predicate may eventually lead to the failure. Lack ofsuch insights and the presence of many non-causal predicates makeit hard for a developer to identify the true root cause of a failure.SD expects that a developer has sufﬁcient domain knowledge aboutif/how a predicate can eventually cause a failure, even when thepredicate is examined in isolation without additional context. Thisis often hard in practice, as is reported by real-world surveys [55].

Example 1.

To motivate our work, we consider a recently reportedissue in Npgsql [52], an open-source ADO.NET data providerfor PostgreSQL. On its GitHub repository, a user reported that adatabase application intermittently crashes when it tries to createa new PostgreSQL connection (GitHub issue

In Section 7, we describe ﬁve other case studies that show thesame general problem: SD produces too many predicates, only asmall subset of which are causally related to the failure. Thus, SDis not speciﬁc enough, and it leaves the developer with the task ofidentifying the root causes from a large number of candidates. This1 a r X i v : . [ c s . D B ] A p r ask is particularly challenging, since SD does not provide explana-tions of how a potential predicate can eventually lead to the failure.In this paper, we address these limitations with a new data-driventechnique called Adaptive Interventional Debugging (AID). Givenpredicate logs from successful and failed executions of an applica-tion, AID can pinpoint why the application failed, by identifyingone (or a small number of) predicate that indicates the real rootcause (instead of producing a large number of potentially unrelatedpredicates). Moreover, AID can explain how the root cause leadsto the failure, by automatically generating a causal chain of predi-cates linking the root cause, subsequent effects, and the failure. Bydoing so, AID enables a developer to quickly localize (and ﬁx) thebug, even without deep knowledge about the application.AID achieves the above by combining SD with causal analy-sis [50, 49, 48], fault injection [2, 23, 34], and group testing [26]in a novel way. Like SD, it starts by identifying discriminativepredicates from successful and failed executions. In addition, AIDuses temporal properties of the predicates to build an approximatecausal DAG (Directed Acyclic Graph), which contains a supersetof all true causal relationships among predicates. AID then pro-gressively reﬁnes the DAG. In each round of reﬁnement, AID usesideas from adaptive group testing to carefully select a subset ofpredicates. Then, AID re-executes the application during which it intervenes (i.e., modiﬁes application’s behavior by e.g., injectingfaults) application to fail or not, AID conﬁrms or discards causalrelationships in the approximate causal DAG, assuming counter-factual causality ( C is a counterfactual cause of F iff F would notoccur unless C occurs) and a single root cause. A sequence of in-terventions enables AID to identify the root cause and generate a causal explanation path , a sequence of causally-related predicatesthat connect the root cause to the failure.A key beneﬁt of AID is its efﬁciency—it can identify root-causeand explanation predicates with signiﬁcantly fewer rounds of in-terventions than adaptive group testing. In group testing, pred-icates are considered independent and hence each round selects a random subset of predicates to intervene on and makes causal-ity decisions about only those intervened predicates. In contrast,AID uses potential causality among predicates (in the approximatecausal DAG). This enables AID to (1) make decisions not onlyabout the intervened predicates, but also about other predicates;and (2) carefully select predicates whose intervention would max-imize the effect of (1). Through theoretical and empirical analyseswe show that this can signiﬁcantly reduce the number of requiredinterventions. This is an important beneﬁt in practice since eachround of intervention involves executing the application with faultinjection and hence is time-consuming.We evaluated AID on 3 open-source applications: Npgsql,Apache Kafka, Microsoft Azure Cosmos DB, and on 3 proprietaryapplications in Microsoft. We used known issues that cause theseapplications to intermittently fail even for same inputs. In eachcase, AID was able to identify the root cause of failure and gener-ate an explanation that is consistent with the explanation providedby respective developers. Moreover, AID achieved this with sig-niﬁcantly fewer interventions than traditional adaptive group test-ing. We also performed sensitivity analysis of AID with a set ofsynthetic workloads. The results show that AID requires fewerinterventions than traditional adaptive group testing, and has sig-niﬁcantly better worst-case performance than other variants.In summary, we make the following contributions: • We propose Adaptive Interventional Debugging (AID), a data-driven technique that localizes the root cause of an intermit-tent failure through a novel combination of statistical debugging,causal analysis, fault injection, and group testing (Section 2). AID provides signiﬁcant beneﬁts over the state-of-the-art Sta-tistical Debugging (SD) techniques by (1) pinpointing the rootcause of an application’s failure and (2) generating an explana-tion of how the root cause triggers the failure (Sections 3–5).In contrast, SD techniques generate a large number of potentialcauses and without explaining how a potential cause may triggerthe failure. • We use information theoretic analysis to show that AID, by uti-lizing causal relationship among predicates, can converge to thetrue root cause and explanation signiﬁcantly faster than tradi-tional adaptive group testing (Section 6). • We evaluate AID with six real-world applications that intermit-tently fail under speciﬁc inputs (Section 7). AID was able toidentify the root causes and explain how the root causes triggeredthe failure, much faster than adaptive group testing and more pre-cisely than SD. We also evaluate AID with many syntheticallygenerated applications with known root causes and conﬁrm thatthe beneﬁts hold for them as well.

2. BACKGROUND AND PRELIMINARIES

AID combines several existing techniques in a novel way. Wenow brieﬂy review the techniques.

Statistical Debugging

Statistical debugging (SD) aims to automatically pinpoint likelycauses for an application’s failure by statistically analyzing its ex-ecution logs from many successful and failed executions. It worksby instrumenting an application to capture runtime predicates aboutthe application’s behavior. Examples of predicates include “theprogram takes the false branch at line 31”, “the method foo() returns null ”, etc. Executing the instrumented application gener-ates a sequence of predicate values, which we refer to as predicatelogs . Without loss of generality, we assume that all predicates areBoolean.Intuitively, the true root cause of the failure will cause certainpredicates to be true only in the failed logs (or, only in thesuccessful logs). Given logs from many successful executionsand many failed executions of an application, SD aims to identifythose discriminative predicates. Discriminative predicates encodeprogram behaviors of failed executions that deviate from the idealbehaviors of the successful executions. Without loss of generality,we assume that discriminative predicates are true during failedexecutions. The predicates can further be ranked based on their precision and recall , two well-known metrics that capture theirdiscriminatory power. precision ( P ) = P is true P is true recall ( P ) = P is true Causality

Informally, causality characterizes the relationship between anevent and an outcome: the event is a cause if the outcome is a con-sequence of the event. There are several deﬁnitions of causality [22,57]. In this work, we focus on counterfactual causes. According to counterfactual causality , C causes E iff E would not occur unlessC occurs. Reasoning about causality frequently relies on a mecha-nism for interventions [56, 25, 68, 61], where one or more variablesare forced to particular values, while the mechanisms controllingother variables remain unperturbed. Such interventions uncovercounterfactual dependencies between variables.2rivially, executing a program is a cause of its failure: if theprogram was not executed at the ﬁrst place, the failure would nothave occurred. However, our analysis targets fully -discriminativepredicates (with 100% precision and 100% recall), therebyeliminating such trivial predicates that are program invariants.

Fault Injection

In software testing, fault injection [2, 23, 34, 47] is a techniqueto force an application, by instrumenting it or by manipulatingthe runtime environment, to execute a different code path thanusual. We use the technique to intervene on (i.e., repair) discrim-inative predicates. Consider a method

ExecQuery() that re-turns a result object in all successful executions and null in allfailed executions. Then, the predicate “

ExecQuery() returns null ” is discriminative. The predicate can be intervened by forc-ing

ExecQuery() to return the correct result object. Similarly,the predicate “there is a data race on X” can be intervened by delay-ing one access to X or by putting a lock around the code segmentsthat access X to avoid simultaneous accesses to X.

Group Testing

Given a set of discriminative predicates, a na¨ıve approach to iden-tify which predicates cause the failure is to intervene on one pred-icate at a time and observe if the intervention causes an executionto succeed. However, the number of required interventions is lin-ear in number of predicates.

Group testing reduces the number ofinterventions.Group testing refers to the procedure that identiﬁes certainitems (e.g., defective) among a set of items while minimizing thenumber of group tests required. Formally, given a set P of N elements where D of them are defective, group testing performs k group tests, each on group P i ⊆ P . Result of test on group P i is positive if ∃ P ∈ P i s.t. P is defective, and negative otherwise.The objective is to minimize k , i.e., the number of group testsrequired. In our context, a group test is simultaneous interventionon a group of predicates, and the goal is to identify the predicatesthat cause the failure.Two variations of group testing are studied in the literature: adaptive and non-adaptive . Our approach is based on adaptivegroup testing where the i -th group to test is decided after we ob-serve the results of all ≤ j < i previous group tests. A trivial up-per bound for adaptive group testing [26] is O ( D log N ) . A simplebinary search algorithm can ﬁnd each of the D defective items in atmost log N group tests and hence a total of D log N group tests aresufﬁcient to identify all defective items. Note that if D ≥ N log N ,then a linear strategy is preferable over any group testing scheme.Hence, we assume that D < N log N .

3. AID OVERVIEW

Adaptive Interventional Debugging (AID) targets applications(e.g., ﬂaky tests [46]) that, even with the same inputs, intermittentlyfail due to various runtime nondeterminism such as thread schedul-ing and timing. Given predicate logs of successful and failed exe-cutions of an application, the goals of AID are to (1) identify what predicate actually causes the failure, and (2) generate an explana-tion of how the root cause leads to the failure (via a sequence ofintermediate predicates). This is in contrast with traditional statis-tical debugging, which generates a set of potential root-cause pred-icates (often a large number), without any explanation of how eachpotential root cause may lead to the failure.Figure 1 shows an overview of AID. First, the framework em-ploys standard SD techniques on predicate logs to identify a set of fully -discriminative predicates, i.e., predicates that always appear application predicate logs statistical debuggingfully-discriminativepredicates approximate causal DAGprune and select predicatesintervened applicationexecution temporal relationshipall causal predicates? yesno e x ec u t i o n root cause and causal pathpredicates to intervene Figure 1:

Adaptive Interventional Debugging workﬂow.in the failed executions and never appear in the successful execu-tions. Then, AID uses the temporal relationships of predicates toinfer approximate causality : if P temporally precedes P in all logs where they both appear, then P may cause P . AID rep-resents this approximate causality in a DAG called ApproximateCausal DAG (AC-DAG), where predicates are nodes and edges in-dicate these possible causal relationships. We describe the AC-DAG in Section 4.Based on its construction, the AC-DAG is guaranteed to containall the true root-cause predicates and causal relationships amongpredicates. However, it may also contain additional predicates andedges that are not truly causal. The key insight of AID is that wecan reﬁne the AC-DAG and prune the non-causal nodes and edgesthrough a sequence of interventions. To intervene on a predicate,AID changes the application’s behavior through fault injectionso that the predicate’s value matches its value in successfulexecutions. If the failure does not occur under the intervention,then, based on counterfactual causality, the predicate is guaranteedto be a root cause of the failure. Over several iterations, AIDintervenes on a set of carefully chosen predicates, reﬁnes the setof discriminative predicates, and prunes the AC-DAG, until itdiscovers the true root cause and the path that leads to the failure.We describe the intervention mechanism of AID in Section 5.We now describe how AID adapts existing approaches in SDand fault injection for two of its core ideas: predicates and inter-ventions. We refer to the Appendix for additional details and dis-cussion.

Predicate design:

Similar to traditional SD techniques, AID is ef-fective only if the initial set of predicates (in the predicate logs)contains a root-cause predicate that causes the failure. Predicatedesign is orthogonal to AID. We use predicates used by existing SDtechniques, especially the ones used for ﬁnding root causes of con-currency bugs [29], a key reason behind intermittent failures [46].Figure 2 shows examples of predicates in AID (column 1).

Predicate extraction:

AID automatically instruments a target ap-plication to generate its execution trace (see Appendix). The tracecontains each executed method’s start and end time, its thread id,ids of objects it accesses, return values, whether it throws exceptionor not, and so on. This trace is then analyzed ofﬂine to evaluate aset of predicates at each execution point. This results in a sequenceof predicates, called predicate log . The instrumented applicationis executed multiple times with the same input, to generate a setof predicate logs, each labeled as a successful or failed execution.Figure 2 shows the runtime conditions used to extract predicates(column 2).

Modeling nondeterminism:

In practice, some predicates maycause a failure nondeterministically: two predicates A and B in3 onjunction cause a failure. AID does not consider such predicatessince they are not fully discriminative (recall < compound predicates, adaptedfrom state-of-the-art SD techniques [29], which model conjunc-tions. These compound predicates (“A and B”) would deterministi-cally cause the failure and hence be fully discriminative. Note thatAID focuses on counterfactual causality and thus does not support disjunctive root causes (as they are not counterfactual). In Sec-tion 5, we discuss AID’s assumptions and their impact in practice. Intervention mechanism:

AID uses an existing fault injectiontool (similar to LFI [47]) to intervene on fully-discriminative pred-icates; interventions change a predicate to match its value in a suc-cessful execution. In a way, AID’s interventions try to locally “re-pair” a failed execution. Figure 2 shows examples of AID’s inter-ventions (column 3). Most of the interventions rely on changingtiming and thread scheduling that can occur naturally by the un-derlying execution environment and runtime. More speciﬁcally,AID can slow down the execution of a method (by injecting de-lays), force or prevent concurrent execution of methods in differentthreads (by using synchronization primitives such as locks), changethe execution order of concurrent threads (by injecting delays), etc.Such interventions can repair many concurrency bugs.

Validity of intervention:

AID supports two additional interven-tion types, return-value alteration and exception-handling, which,in theory, can have undesirable runtime side-effects. Consider twopredicates: (1) method

QueryAvgSalary fails returning null and (2) method

UpdateSalary fails returning error . AID canintervene to match their return values in successful executions, e.g., and OK , respectively. The intervention on the ﬁrst predicatedoes not modify any program state and, as the successful executionshows, the return value can be safely used by the application.However, altering the return value of UpdateSalary , but not up-dating the salary, may not be sufﬁcient intervention: other parts ofthe application that rely on the updated salary may fail. Inferringsuch side-effects is hard, if not impossible.AID is restricted to safe interventions. It relies on develop-ers to indicate which methods do not change (internal or ex-ternal) application states and limits return-value interventions toonly those methods (e.g., to

QueryAvgSalary , but not to

UpdateSalary ). The same holds for exception-handling inter-ventions. AID removes from predicate logs any predicates thatcannot be safely intervened without undesirable side-effects. Thisensures that the rest of the AID pipeline can safely intervene onany subset of predicates. Excluding some interventions may limitAID’s precision, as it may eliminate a root-cause predicate. In suchcases, AID may ﬁnd another intervenable predicate that is causallyrelated to the root cause, and is still useful for debugging. In ourexperiments (Section 7) we did not observe this issue, since theroot-cause predicates were safe to intervene.

4. APPROXIMATING CAUSALITY

AID relies on traditional SD to derive a set of fully-discrimi-native predicates. Using the logs of successful and failedexecutions, AID extracts temporal relationships among thesepredicates, and uses temporal precedence to approximate causality.It is clear that in the absence of feedback loops, a cause tempo-rally precedes an effect [58]. To handle loops, AID considersmultiple executions of the same program statement (e.g., withina loop, recursion, or multiple method calls) as separate instances,identiﬁed by their relative order of appearances during program execution, and maps them to separate predicates (see Appendix).This ensures that temporal precedence among predicates correctly over-approximates causality.

Approximate causal DAG.

AID represents the approximation ofcausality in a DAG: each node represents a predicate, and an edge P → P indicates that P temporally precedes P in all logswhere both predicates appear. Figure 4(a) shows an example ofthe approximate causal DAG (AC-DAG). We use circles to explic-itly depict junctions in the AC-DAG; junctions are not themselvespredicates, but denote splits or merges in the precedence orderingof predicates. Therefore, each predicate has in- and out-degrees ofat most 1, while junctions have in- or out-degrees greater than 1.Note that, for clarity of visuals, in our depictions of the AC-DAG,we omit edges implied by transitive closure. For example, thereexists an edge P → P , implied by P → P and P → P , butit is not depicted. AID enforces an assumption of counterfactualcausality by excluding from the AC-DAG any predicates that werenot observed in all failed executions: if some executions failedwithout manifesting P , then P cannot be a cause of the failure. Completeness of AC-DAG.

The AC-DAG is complete withrespect to the available, and safely-intervenable, predicates:it contains all fully-discriminative predicates that are safe tointervene, and if P causes P , it includes the edge P → P .However, it may not be complete with respect to all possible trueroot causes, as a root cause may not always be represented by theavailable predicates (e.g., if the true root cause is a data race andno predicate is used to capture it). In such cases, AID will identifythe (intervenable) predicate that is closest to the root cause and iscausally related to the failure.Since temporal precedence among predicates is a necessary con-dition for causality, the AC-DAG is guaranteed to contain the truecausal relationships. However, temporal precedence is not sufﬁ-cient for causality, and thus some edges in the AC-DAG may notbe truly causal. Temporal precedence.

Capturing temporal precedence is not al-ways straightforward. For simplicity of implementation, AID re-lies on computer clocks, which works reasonably well in prac-tice. Relying on computer clocks is not always precise as the timegap between two events may be too small for the granularity ofthe clock; moreover, events may occur on different cores or ma-chines whose clocks are not perfectly synchronized. These issuescan be addressed with the use of logical clocks such as Lamport’sClock [38].Another challenge is that some predicates are associated withtime windows, rather than time points. The correct policy to resolvetemporal precedence of two temporally overlapping predicates of-ten depends on their semantics. However, the predicate types giveimportant clues regarding the correct policy. In AID, predicate de-sign involves specifying a set of rules that dictates the temporalprecedence of two predicates. In constructing the AC-DAG, AIDuses those rules.For example, consider a scenario where foo() calls bar() andwaits for bar() to end—so, foo() starts before but ends after bar() . • (Case 1): Consider two predicates P : “ foo() is running slow”and P : “ bar() is running slow”. Here, P can cause P butnot the other way around. In this case, AID uses the policy that end-time implies temporal precedence . • (Case 2): Now consider P : “ foo() starts later than expected”and P : “ bar() starts later than expected”. Here, P can cause P but not the other way around. Therefore, in this case, start-time implies temporal precedence .4

1) Predicate (2) Extraction condition (3) Intervention mechanism

There is a data race involvingmethods M and M M and M temporally overlap accessing some object X while one of them is a write Put locks around the code segments within M and M that access X Method M fails M throws an exception Put M in a try-catch blockMethod M runs too fast M ’s duration is less than the minimum duration for M among all successful executions Insert delay before M ’s return statementMethod M runs too slow M ’s duration is greater than the maximum durationfor M among all successful executions Prematurely return from M the correct valuethat M returns in all successful executionsMethod M returns incorrectvalue M ’s return value (cid:54) = x , where x is the correct valuereturned by M in all successful executions Alter M ’s return statement to force it to re-turn the correct value x Figure 2:

Few example predicates, conditions used to extract them, and the corresponding interventions using fault injection.

Notation Description G Approximate causal DAG (AC-DAG) P Causal path F Failure indicating predicate P A predicate P Set of predicates P ( r ) Predicate P is observed in execution r ¬ P ( r ) Predicate P is not observed in execution rP (cid:59) P There is a path from P to P in G Figure 3:

Summary of notations used in Section 5.AID works with any policy of deciding precedence, as long as itdoes not create cycles in the AC-DAG. Since temporal precedenceis a necessary condition for causality, any conservative heuristic forderiving temporal precedence would work. A conservative heuris-tic may introduce more false positives (edges that are not trulycausal), but those will be pruned by interventions (Section 5).

5. CAUSAL INTERVENTION

In this section, we describe AID’s core component, whichreﬁnes the AC-DAG through a series of causal interventions . Anintervention on a predicate forces the predicate to a particular state;the execution of the application under the intervention asserts orcontradicts the causal connection of the predicate with the failure,and AID prunes the AC-DAG accordingly. Interventions can becostly, as they require the application to be re-executed. AID min-imizes this cost by (1) smartly selecting the proper predicates tointervene, (2) grouping interventions that can be applied in a singleapplication execution, and (3) aggressively pruning predicates evenwithout direct intervention, but based on outcomes of other inter-ventions. Figure 3 summarizes the notations used in this section.We start by formalizing the problem of causal path discovery and state our assumptions (Section 5.1). Then we provide anillustrative example to show how AID works (Section 5.2). Weproceed to describe interventional pruning that AID applies toaggressively prune predicates during group intervention rounds(Section 5.3). Then we present AID’s causality-guided groupintervention algorithm (Section 5.4) which administers groupinterventions to derive the causal path.

Given an application that intermittently fails, our goal is to pro-vide an informative explanation for the failure. To that end, givena set of fully-discriminative predicates P , we want to ﬁnd an or-dered subset of P that deﬁnes the causal path from the root-cause predicate to the predicate indicating the failure. Informally, AIDﬁnds a chain of predicates that starts from the root-cause predicate,ends at the failure predicate, and contains the maximal number ofexplanation predicates such that each is caused by the previous onein the chain. We address the problem in a similar setting as SD, andmake the following two assumptions: Assumption 1 (Single Root-cause Predicate).

The root cause of afailure is the predicate whose absence (i.e., a value of false ) cer-tainly avoids the failure, and there is no other predicate that causesthe root cause. We assume that in all the failed executions, there isexactly one root-cause predicate.This assumption is prevalent in the SD literature [41, 43, 29],and is supported by several studies on real-world concurrency bugcharacteristics [45, 67, 64], which show that a vast majority of rootcauses can be captured with reasonably simple single predicates(see Appendix). In practice, even with speciﬁc inputs, a programmay fail in multiple ways. However, failures by the same root causegenerate a unique failure signature and hence can be grouped to-gether using metadata (e.g., stack trace of the failure, location ofthe failure in the program binary, etc.) collected by failure track-ers [20]. AID can then treat each group separately, targeting asingle root cause for a speciﬁc failure. Moreover, the single-root-cause assumption is reasonable in many simpler settings such asunit tests that exercise small parts of an application.Note that this assumption does not imply that the root causeconsists of a single event; a predicate can be arbitrarily complexto capture multiple events. For example, the predicate “there is adata race on X” is true when two threads access the same sharedmemory X at the same time, the accesses are not lock-protected,and one of the accesses is a write operation. Whether a single pred-icate is sufﬁcient to capture the root cause depends on predicatedesign, which is orthogonal to AID. AID adapts the state-of-theart predicate design, tailored to capture root causes of concurrencybugs [29], which is sophisticated enough to capture all commonroot causes using single predicates. If no single predicate capturesthe true root cause, AID still ﬁnds the predicate closest to the trueroot cause in the true causal path.

Assumption 2 (Deterministic Effect).

A root-cause predicate, iftriggered, causes a ﬁxed sequence of intermediate predicates (i.e.,effects) before eventually causing the failure. We call this sequence causal path , and we assume that there is a unique one for each root-cause-failure pair.Prior work has considered, and shown evidence of, a uniquecausal path between a root cause and the failure in sequentialapplications [62, 30]. The unique causal path assumption islikely to hold in concurrent applications as well for two keyreasons. First, the predicates in AID’s causal path may remainunchanged, despite nondeterminism in the underlying instruction5equence. For example, the predicate “there is a data race betweenmethods X and Y” is not affected by which method starts ﬁrst,as long as they temporally overlap. Second, AID only considersfully-discriminative predicates. If such predicates exist to capturethe root cause and its effects, by the deﬁnition of being fullydiscriminative, there will be a unique causal path (of predicates)from the root cause to the failure. In all six of our real-world casestudies (Section 7), such predicates existed and there were uniquecausal paths from the root causes to the failures.Note that it is possible to observe some degree of disjointnesswithin the true causal paths. For example, consider a case wherethe root cause C triggers the failure F in two ways: in some failedexecutions, the causal path is C → A → B → F and, forothers, C → A → B → F . This implies that neither A nor A is fully discriminative. Since AID only considers fully-discriminative predicates, both of them are excluded from the AC-DAG. In this case, AID reports C → B → F as the causal path;this is the shared part of the two causal paths, which includes allcounterfactual predicates and omits any disjunctive predicates. Onecould potentially relax this assumption by encoding the interactionof such predicates through a fully-discriminative predicate (e.g., A = A ∨ A encodes disjunction and is fully discriminative).Based on these assumptions, we deﬁne the causal path discoveryproblem formally as follows. Deﬁnition 1 (Causal Path Discovery) . Given an approximatecausal DAG G = ( V , E ) and a predicate F ∈ V indicating aspeciﬁc failure, the causal path discovery problem seeks a path P = (cid:104) C , C , . . . , C n (cid:105) such that the following conditions hold: • C is the root cause of the failure and C n = F . • ∀ ≤ i ≤ n , C i ∈ V and ∀ ≤ i < n , ( C i , C i +1 ) ∈ E . • ∀ ≤ i < j ≤ n , C i is a counterfactual cause of C j . • | P | is maximized. AID performs causal path discovery through an intervention al-gorithm (Section 5.4). Here, we illustrate the main steps and intu-itions through an example.Figure 4(a) shows an AC-DAG derived by AID (Section 4). TheAC-DAG contains all edges implied by transitive closure, but wedo not depict them to have clearer visuals. The true causal path forthe failure F is P → P → P → F , depicted with dashedred outline. The AC-DAG is a superset of the actual causal graph,which is shown in Figure 4(b).AID follows an intervention-centric approach for discoveringthe causal path. Intervening on a predicate forces it to behave theway it does in the successful executions, which is by deﬁnition,the opposite of the failed executions. (Recall that, without loss ofgenerality, we assume that all predicates are boolean.) Followingthe adaptive group testing paradigm, AID performs group inter-vention, which is simultaneous intervention on a set of predicates,to reduce the total number of interventions. Figure 4(c) shows thesteps of the intervention algorithm, numbered 1 (cid:13) – 8 (cid:13) .AID ﬁrst aims to reduce the AC-DAG by pruning entire chainsthat are not associated with the failure, through a process called branch pruning (Section 5.4). Starting from the root of the AC-DAG, AID discovers the ﬁrst junction, after predicate P . Foreach child of a junction, AID creates a compound predicate, calledan independent branch , or simply branch , that is a disjunction overthe child and all its descendants that are not descendants of theother children. So, for the junction after P , we get branches B P ∨ P ∨ P and B P ∨ P ∨ P ∨ P . AID intervenes onone of the branches chosen at random—in this case B —at step 1 (cid:13) ; this requires an intervention on all of its disjunctive predicates ( P , P , and P ) in order to make the branch predicate False . Despitethe intervention, the program continues to fail, and AID prunes theentire branch of B , resolving the junction after P . For a junctionof B branches, AID would need log B interventions to resolve itusing a divide-and-conquer approach. At step 2 (cid:13) , AID similarlyprunes a branch at the junction after P . At this point, AID is donewith branch pruning since it is left with just a chain of predicates(step 3 (cid:13) ).What is left for AID is to prune any non-causal predicatefrom the remaining chain. AID achieves that through a divide-and-conquer strategy that intervenes on groups of predicates at atime (Algorithm 1). It intervenes on the top half of the chain— { P , P , P } —which stops the failure and conﬁrms that the rootcause is in this group (step 3 (cid:13) ). With two more steps that narrowdown the interventions (steps 4 (cid:13) and 5 (cid:13) ), AID discovers that P isthe root cause. Note that we cannot simply assume that the root ofthe AC-DAG is a cause, because the edges are not all necessarilycausal.After the discovery of the root cause, AID needs to derive thecausal path. Continuing the divide-and-conquer steps, it interveneson P (step 6 (cid:13) ). This stops the failure, conﬁrming that P is inthe causal path. In addition, since P is not causally dependenton P , the intervention on P does not stop P from occurring.This observation allows AID to prune P without intervening onit directly. At step 7 (cid:13) , AID intervenes on P . The effect of thisintervention is that the failure is still observed, but P no longeroccurs, indicating that P is causally connected to P , but not tothe failure; this allows AID to prune both P and P . Finally,at step 8 (cid:13) , AID intervenes on P and conﬁrms that it is causal,completing the causal path derivation. AID discovered the causalpath in 8 interventions, while na¨ıvely we would have needed 11—one for each predicate. In the initial construction of the AC-DAG, AID excludes pred-icates based on a simple rule: a predicate P is excluded if thereexists a program execution r , where P occurs and the failure doesnot ( P ( r ) ∧ ¬ F ( r ) ), or P does not occur and the failure does( ¬ P ( r ) ∧ F ( r ) ). Intervening executions follow the same basicintuition for pruning the intervened predicate C : By deﬁnition C does not occur in an execution r C that intervenes on predicate C ( ¬ C ( r C ) ); thus, if the failure still occurs on r C ( F ( r C ) ), then C ispruned from the AC-DAG.As we saw in the illustrative example, intervention on a predicate C may also lead to the pruning of additional predicates. However,the same basic pruning logic needs to be applied more carefully inthis case. In particular, we can never prune predicates that precede C in the AC-DAG, as their potential causal effect on the failure maybe muted by the intervention on C . Thus, we can only apply thepruning rule to any predicate X that is not an ancestor of C in theAC-DAG ( X (cid:54) (cid:59) C ). We formalize the predicate pruning strategyover G ( V , E ) in the following deﬁnition. Deﬁnition 2 (Interventional Pruning) . Let R C be a set of programexecutions intervening on a group of predicates C ⊆ V . Every C ∈ C is pruned from G iff ∃ r ∈ R C such that F ( r ) . Any otherpredicate P (cid:54)∈ C is pruned from G iff (cid:54) ∃ C ∈ C such that P (cid:59) C and ∃ r ∈ R C such that ( P ( r ) ∧ ¬ F ( r )) ∨ ( ¬ P ( r ) ∧ F ( r )) . Because of nondeterminism issues in concurrent applications, we execute a programmultiple times with the same intervention. However, it is sufﬁcient to identify a singlecounter-example execution to invoke the pruning rule. a) P1P2P3P4P5 P9P10FP6 P1P2P3 P7P8 P11P9P10F P1P2P3 P7 P11P10F P1P2P3 P7 P11P10FP1P2P3 P7 P11P10F P1P2P3 P7 P11P10F P1P2P3 P11 P1P2 P11F P1P2P11FP1P2P3P4P5 P7P8 P11P9P10F (b)

P6 1 2 3 48765 (c)

P10FB2B1P1P2P3P4P5 P9P10FP6 P7P8 P11 P7P8 P11

Figure 4: (a) AC-DAG as constructed by AID. The DAG includes all edges implied by transitive closure, but we omit them for clarityof the visuals. We indicate the predicates in the causal path with the dashed red outline. (b) The actual causal DAG is a subgraph of theAC-DAG. (c) Step by step illustration to discover the causal path (shown at bottom right). Steps 1 (cid:13) and 2 (cid:13) perform branch pruning, steps3 (cid:13) – 8 (cid:13) perform group intervention with pruning on the predicate chain, steps 6 (cid:13) and 7 (cid:13) apply interventional pruning.

AID’s core intervention method is described in Algorithm 1:Group Intervention With Pruning (GIWP). GIWP applies adap-tive group testing to derive causal and spurious (non-causal) nodesin the AC-DAG. The algorithm applies a divide-and-conquer ap-proach that groups predicates based on their topological order(a linear ordering of its nodes such that for every directed edge P → P , P comes before P in the ordering). In every iteration,GIWP selects the predicates in the lowest half of the topological or-der, resolving ties randomly, and intervenes by setting all of themto False (lines 4–5). The intervention returns a set of predicatelogs.If the failure is not observed in any of the intervening executions(line 6), based on counterfactual causality, GIWP concludes thatthe intervened group contains at least one predicate that causesthe failure. If the group contains a single predicate, it is markedas causal (line 8). Otherwise, GIWP recurses to trace the causalpredicates within the group (line 10).During each intervention round, GIWP applies Deﬁnition 2 toprune predicates that are determined to be non-causal (lines 13–17). First, if the algorithm discovers an intervening execution thatstill exhibits the failure, then it labels all intervened predicates asspurious and marks them for removal (line 14). Second, GIWP ex-amines each other predicate that does not precede any intervenedpredicate and observes if any of the intervened executions demon-strate a counterfactual violation between the predicate and the fail-ure. If a violation is found, that predicate is pruned (line 17).At completion of each intervention round, GIWP reﬁnes thepredicate pool by eliminating all conﬁrmed causes and spurious predicates (line 18) and enters the next intervention round . Itcontinues the interventions until all predicates are either markedas causal or spurious and the remaining predicate pool is empty.Finally, GIWP returns two disjoint predicate sets—the causal pred-icates and the spurious predicates (line 19).

Branch Pruning

GIWP is sufﬁcient for most practical applications and can workdirectly on the AC-DAG. However, when the AC-DAG satisﬁescertain conditions (analyzed in Section 6.3.1), we can reduce thenumber of required interventions through a process called branchpruning . The intuition is that since there is a single causal path thatexplains the failure, junctions (where multiple paths exist) can beused to quickly identify independent branches to be pruned or con-ﬁrmed as causal as a group. The branches can be used to moreeffectively identify groups for intervention, reducing the overallnumber of required interventions.Branch pruning iteratively prunes branches at junctions (steps 1 (cid:13) and 2 (cid:13) in the illustrative example) to reduce the AC-DAG to achain of predicates. The process is detailed in Algorithm 2. Thealgorithm traverses the DAG based on its topological order, anddoes not intervene while it encounters a single node at a time,which means it is still in a chain (line 5). When it encounters mul-tiple nodes at the same topological level, it means it encountereda junction (line 7). A junction means that the true causal pathcan only continue in one direction, and AID can perform groupintervention to discover it. The algorithm invokes GIWP to per-form this intervention over a set of special predicates constructedfrom the branches at the encountered junction (lines 10–12). A7 lgorithm 1:

GIWP ( P , G , F ) Input :

A set of candidate predicates, P ,AC-DAG, G Failure indicating predicate, F Output :

The set of counterfactual causes of F , C The set of spurious predicates, X C = ∅ /* causal predicate set */ X = ∅ /* spurious predicate set */ while P (cid:54) = ∅ do P = ﬁrst half of P in topological order R P = Intervene ( P ) if (cid:54) ∃ r ∈ R P s.t. F ( r ) then /* failure stopped */ if P contains a single predicate then C = C ∪ P /* a cause is confirmed */ else /* need to confirm causes */ C (cid:48) , X (cid:48) = GIWP ( P , G , F ) C = C ∪ C (cid:48) /* confirmed causes */ X = X ∪ X (cid:48) /* spurious predicates *//* interventional pruning */ if ∃ r ∈ R P s.t. F ( r ) then /* failure didn’t stop */ X = X ∪ P /* pruning */ foreach P ∈ P − P s.t. ∀ P (cid:48) ∈ P P (cid:54) (cid:59) P (cid:48) do if ∃ r ∈ R P s.t. ( P ( r ) ∧ ¬ F ( r )) ∨ ( ¬ P ( r ) ∧ F ( r )) then X = X ∪ { P } /* pruning */ P = P − ( C ∪ X ) /* remove confirmed and spuriouspredicates from candidate predicate pool */ return C , X branch at predicate P is deﬁned as a disjunctive predicate over P and all descendants of P that are not descendants of any otherpredicate at the same topological level as P . An example branchfrom our illustrative example is B P ∨ P ∨ P . Tointervene on a branch, one has to intervene on all of its disjunctivepredicates. The algorithm deﬁnes B as the union of all branches,which corresponds to a completely disconnected graph (no edgesbetween the nodes), thus all branch predicates are at the sametopological level. GIWP is then invoked (line 13) to identify thecausal branch. The algorithm removes any predicate that is notcausally connected to the failure (line 17) or is no longer reachablefrom the correct causal chain (line 18), and updates the AC-DAGaccordingly. At the completion of branch pruning, AID reducesthe AC-DAG to simple chain of predicates.Finally, Algorithm 3 presents the overall method that AID usesto perform causal path discovery, which optionally invokes branchpruning before the divide-and-conquer group intervention throughGIWP.

6. THEORETICAL ANALYSIS

In this section, we theoretically analyze the performance ofAID in terms of the number of interventions required to identifyall causal predicates , which are the predicates causally related tothe failure. Similar to the analysis of group testing algorithms, westudy the information-theoretic lower bound, which speciﬁes theminimum number of bits of information that an algorithm mustextract to identify all causal predicates for any instance of a prob-lem. We also study the lower and the upper bounds that quantifythe minimum and the maximum number of group interventionsrequired to identify all causal predicates, respectively, for AIDversus a Traditional Adaptive Group Testing (TAGT) algorithm.Any group testing algorithm takes N items (predicates), D ofwhich are faulty (causal), and aims to identify all faulty items Causal predicates correspond to faulty predicates in group testing. This distinction interminology is because group testing does not meaningfully reason about causality.

Algorithm 2:

Branch-Prune ( G , F ) Input :

AC-DAG, G = ( V , E ) Failure indicating predicate, F Output :

Reduces G to an approximate causal chain C = ∅ /* potential causal predicate set */ X = ∅ /* spurious predicate set */ while V − C (cid:54) = ∅ do P = predicates at the lowest topological level in V − C if P contains a single predicate then C = C ∪ P /* add to potential causal set */ else /* this is a junction */ B = ∅ foreach P ∈ P do B P = (cid:87) { Q : P (cid:59) Q ∧ ∀ P (cid:48) ∈ P−{ P } P (cid:48) (cid:54) (cid:59) Q } B P = P ∨ B P B = B ∪ {B P } /* set of branches */ C (cid:48) , X (cid:48) = GIWP ( B , G , F ) C = C ∪ C (cid:48) /* add to potential causal set */ X = X ∪ X (cid:48) /* add to spurious set *//* refining G */ U = { U : C (cid:54) = ∅ ∧ ∀ C ∈ C C (cid:54) (cid:59) U } /* unreachable */ V = V − X /* remove spurious predicates */ V = V − U /* remove unreachable predicates */

Algorithm 3:

Causal-Path-Discovery ( G , F, F lag B ) Input :

AC-DAG, G = ( V , E ) Failure indicating predicate,

FF lag B , whether to apply branch pruning or not Output :

A causal path if F lag B then Branch-Prune ( G , F ) C , X = GIWP ( V−{ F } , G , F ) return C using as few group interventions as possible. Since there are (cid:0) ND (cid:1) possible outcomes, the information-theoretic lower boundfor this problem is log (cid:0) ND (cid:1) . The upper bound on the numberof interventions using TAGT is O ( D log N ) , since log N groupinterventions are sufﬁcient to reveal each causal predicate. Here,we assume D < N log N ; otherwise, a linear approach that interveneson one predicate at a time is preferable.We now show that the Causal Path Discovery (CPD) problem(Deﬁnition 1) can reduce the lower bound on the number ofrequired interventions compared to Group Testing (GT). Wealso show that the upper bound on the number of interventionsis lower for AID than TAGT, because of the two assumptionsof CPD (Section 5.1). In TAGT, predicates are assumed to beindependent of each other, and hence, after each intervention,decisions (about whether predicates are causal) can be madeonly about the intervened predicates. In contrast, AID usesthe precedence relationships among predicates in the AC-DAGto (1) aggressively prune, by making decisions not only aboutthe intervened predicates but also about other predicates, and to(2) select predicates based on the topological order, which enableseffective pruning during each intervention. Example 2.

Consider the AC-DAG of Figure 5(a), consisting of N = 6 predicates and the failure predicate F . If AID interveneson all predicates in one branch (e.g., { A , B , C } ) and ﬁndscausal connection to the failure, it can avoid intervening onpredicates in the other branch according to the deterministic effectassumption. AID can also use the structure of the AC-DAG tointervene on A (or A ) before other predicates since the inter-vention can prune a large set of predicates. Since GT algorithms …… … p r e d i c a t e s = n … j un c t i o n s = J … … (c) A B A B C C F (a) F G G horizontal expansion vertical expansion G G (b) Figure 5: (a) An AC-DAG with failure predicate F . (b) Horizontaland vertical expansion. (c) A symmetric AC-DAG with J junc-tions where each junction has B branches and each branch has n predicates. do not assume relationships among predicates, they can onlyintervene on predicates in random order and can make decisionsabout only the intervened predicates. The temporal precedence and potential causality encoded in theAC-DAG restrict the possible causal paths and signiﬁcantly reducethe search space of CPD compared to GT.

Example 3.

In the example of Figure 5(a), GT considers all subsetsof the 6 predicates as possible solutions, and thus its search spaceincludes = 64 candidates. CPD leverages the AC-DAG andthe deterministic effect assumption (Section 5.1) to identify invalidcandidates and reduce the search space considerably. For example,the candidate solution { A , B , C } is not possible under CPD,because it involves predicates in separate paths on the AC-DAG.In fact, based on the AC-DAG, CPD does not need to explore anysolutions with more than 3 predicates. The complete search spaceof CPD includes all subsets of predicates along each branch oflength 3, thus a total of · (2 −

1) + 1 = 15 possible solutions.

We proceed to characterize the search space of CPD compared toGT more generally. We use | G | to denote the number of predicatesin an AC-DAG represented by G , and W GTG and W CPDG to denotethe size of the search space for GT and CPD, respectively. Westart from the simplest case of DAG, a chain, and then using thenotions of horizontal and vertical expansion, we can derive thesearch space for any DAG.If G is a simple chain of predicates, then GT and CPD have thesame search space: | G | . CPD reduces the search space drasticallywhen junctions split the predicates into separate branches, like inExample 3. We call this case a horizontal expansion : a DAG G H is a horizontal expansion of two subgraphs G and G if it con-nects them in parallel through two junctions, at the roots (lowesttopological level) and leaves (highest topological level). In con-trast, G V is a vertical expansion , if it connects them sequentiallyvia a junction. Horizontal and vertical expansion are depicted inFigure 5(b). In horizontal expansion, the search space of CPD isadditive over the combined DAGs, while in vertical expansion it ismultiplicative. Lemma 1 (DAG expansion) . Let W CPDG and W CPDG be the num-bers of valid solutions for CPD over DAGs G and G , respec-tively. Let G H and G V represent their horizontal and vertical ex-pansion, respectively. Then: W CPD G H = 1 + ( W CPDG −

1) + ( W CPDG − W CPD G V = W CPDG W CPDG In contrast, in both cases, the search space of GT is | G | + | G | . Intuitively, in horizontal expansion, the valid solutions for G H are those of G and those from G , but no combinations betweenthe two are possible. Note that both W CPDG and W CPDG have theempty set as a common solution, so in the computation of W CPD G H ,one solution is subtracted from each search space ( W CPDG i − ) andthen added to the overall result. Symmetric AC-DAG.

Lemma 1 allows us to derive the size of thesearch space for CPD over any AC-DAG. To further highlight thedifference between GT and CPD, we analyze their search spaceover a special type of AC-DAG, a symmetric AC-DAG, depictedin Figure 5(c). A symmetric AC-DAG has J junctions, and B branches at each junction, where each branch is a simple chainof n predicates. Therefore, the total number of predicates in thesymmetric AC-DAG is N = JBn , and the search space of GTis W GT = 2 JBn . For CPD, based on horizontal expansion,the subgraph in-between two subsequent junctions has a total of (cid:80) Bi (2 n −

1) = 1 + B (2 n − candidate solutions. Then,based on vertical expansion, the overall search space of CPD is: W CPD = ( B (2 n −

1) + 1) J We now show that, due to the predicate pruning mechanisms,and the strategy of picking predicates according to topologicalorder, the lower bound on the required number of interventionsin CPD is signiﬁcantly reduced. For the sake of simplicity, wedrop the deterministic effect assumption in this analysis. In GT,after each group test, we get at least bit of information. Sinceafter retrieving all information, the remaining information shouldbe ≤ , therefore, the number of required interventions in GT isbounded below by log (cid:0) ND (cid:1) . In contrast, for CPD, we have thefollowing theorem. (Proofs are in the Appendix.) Theorem 2.

The number of required group interventions in CPDis bounded below by NN + DS log (cid:0) ND (cid:1) , where at least S predicatesare discarded (either pruned using the pruning rule or marked ascausal) during each group intervention. Since DS N > , we obtain a reduced lower bound for the num-ber of required interventions in CPD than GT. In general, as S increases, the lower bound in CPD decreases. Note that we arenot claiming that AID achieves this lower bound for CPD; but thissets the possibility that improved algorithms can be designed in thesetting of CPD than GT. Symmetric AC-DAG.

Figure 6 shows the lower bound on the num-ber of required interventions in CPD and GT for the symmetricAC-DAG of Figure 5(c), assuming that each intervention discardsat least S predicates in CPD. Lower bound is a theoretical bound which states that, it might be possible to designan algorithm that can solve the problem which requires number of steps equal to thelower bound. Note that, this does not imply that there exists one such algorithm. earch ( B (2 n − J JBnJBn + DS log (cid:0) JBnD (cid:1) J log B + D log ( Jn ) − D ( D − S Jn GT JBn log (cid:0)

JBnD (cid:1) D log B + D log ( Jn ) − D ( D − JBn

Figure 6:

Theoretical comparison between CPD and GT for thesymmetric AC-DAG of Figure 5(c).

We now analyze the upper bound on the number of interventionsfor AID under (1) branch pruning, which exploits the deterministiceffect assumption, and (2) predicate pruning.

Whenever AID encounters a junction, it has the option to applybranch pruning. In CPD, at most one branch can be causal at eachjunction; hence, we can ﬁnd the causal branch using log B inter-ventions at each junction, where B is the number of branches atthat junction. Also, B is upper-bounded by the number of threads T in the program. This holds since we assume that the program in-puts are ﬁxed and there is no different conditional branching due toinput variation in different failed executions within the same thread.If there are J junctions and at most T branches at each junction, thenumber of interventions required to reduce the AC-DAG to a chainis at most J log T . Now let us assume that the maximum number ofpredicates in any path in the AC-DAG is N M . Therefore, the chainfound after branch pruning can contain at most N M predicates.If D of them are causal predicates, we need at most D log N M interventions to ﬁnd them. Therefore, the total number of requiredinterventions for AID is ≤ J log T + D log N M . In contrast, thenumber of required interventions for TAGT, which does not prunebranches, is ≤ D log( T N M ) = D log T + D log N M . Therefore,whenever J < D , the upper bound on the number of interventionsfor AID is smaller than the upper bound for TAGT.

For an AC-DAG with N predicates, D of which are causal, wenow focus on the upper bound on the number of interventions inAID using only predicate pruning. In the worst case, when nopruning is possible, the number of required interventions would bethe same as that of TAGT without pruning, i.e., O ( D log N ) . Theorem 3.

If at least S predicates are discarded (pruned ormarked as causal) from the candidate predicate pool during eachcausal predicate discovery, then the number of required interven-tions for AID is ≤ D log N − D ( D − S N . Hence, the reduction depends on S . When S = 1 , we are re-ferring to TAGT, in absence of pruning, because once TAGT ﬁnds acausal predicate, it removes that predicate from the candidate pred-icate pool. Symmetric AC-DAG.

Figure 6 shows the upper bound on the num-ber of required interventions using AID and TAGT for the symmet-ric AC-DAG of Figure 5(c), assuming that at least S predicates arediscarded during each causal predicate discovery by AID.

7. EXPERIMENTAL EVALUATION

We now empirically evaluate AID. We ﬁrst use AID on sixreal-world applications to demonstrate its effectiveness in identi-fying root cause and generating explanation on how the root causecauses the failure. Then we use a synthetic benchmark to compareAID and its variants against traditional adaptive group testing approach to do a sensitivity analysis of AID on various parametersof the benchmark.

We now use three real-world open-source applications and threeproprietary applications to demonstrate AID’s effectiveness inidentifying root causes of transient failures. Figure 7 summarizesthe results and highlights the key beneﬁts of AID: • AID is able to identify the true root cause and generate an ex-planation that is consistent with the explanation provided by thedevelopers in corresponding GitHub issues. • AID requires signiﬁcantly fewer interventions than traditionaladaptive group testing (TAGT), which does not utilize causalityamong predicates (columns 5 and 6). • In contrast, SD generates a large number of discriminative pred-icates (column 3), only a small number of which is actuallycausally related to the failures (column 4).

As a case study, we ﬁrst consider a recently discovered concur-rency bug in Npgsql [52], an open-source ADO.NET Data Providerfor PostgreSQL. The bug (GitHub issue

IndexOutOfRange exception(4) Application fails to handle the exception and crashes. Thisexplanation matches the root cause provided by the developer whoreported the bug to Npgsql GitHub repository.

Next, we use AID on an application built on Kafka [32], a dis-tributed message queue. On Kafka’s GitHub repository, a user re-ported an issue [33] that causes a Kafka application to intermit-tently crash or hang. The user also provided a sample code to re-produce the issue; we use a similar code for this case study.As before, we collected predicate logs from 50 successful and 50failed executions. Using SD, we identiﬁed 72 discriminative pred-icates. The AC-DAG identiﬁed 30 predicates with no causal pathto the failure indicating predicate, and hence were discarded. AIDthen used the intervention algorithm on the remaining 42 predi-cates. After a sequence of 7 interventions, AID could identify theroot-cause predicate. It took an additional 10 rounds (total 17) ofinterventions to discover a causal path of 5 predicates that connectsthe root cause and the failure. The causal path gives the followingexplanation: (1) The main thread that creates a Kafka consumer

1) (2) (3) (4)

Network

N/A 24 1 2 5

BuildAndTest

N/A 25 3 10 15

HealthTelemetry

N/A 93 10 40 70

Figure 7:

Results from case studies of real-world applications.SD produces way too many spurious predicates beyond the cor-rect causal predicates (columns 3 & 4). SD actually produceseven more predicates, but here we only report the number of fully-discriminative predicates. AID and traditional adaptive group test-ing (TAGT) both pin-point the correct causal predicates using in-terventions, but AID does so with signiﬁcantly fewer interventions(columns 5 & 6). C starts a child thread (2) the child thread runs too slow beforecalling a method on C (3) main thread disposes C (4) child threadcalls a commit method on C (5) since C has already been disposedby the main thread, the previous step causes an exception, caus-ing the failure. The explanation matches well with the descriptionprovided in GitHub.Overall, AID required 17 interventions to discover the root causeand explanation. In contrast, SD generates 72 predicates, withoutpinpointing the true root cause or explanation. TAGT could identifyall predicates in the explanation, but it takes 33 interventions in theworst case. Next, we use AID on an application built on Azure CosmosDB [14], Microsoft’s globally distributed database service for op-erational and analytics workloads. The application has an intermit-tent timing bug similar to the one mentioned in a Cosmos DB’spull request on GitHub [15]. In summary, the application popu-lates a cache with several entries that would expire after 1 second,performs a few tasks, and then accesses one of the cached entries.During successful executions, the tasks run fast and end before thecached entries expire. However, a transient fault triggers expensivefault handling code that makes a task run longer than the cache ex-piry time. This makes the application fail as it cannot ﬁnd the entryin the cache (i.e., it has already expired).Using SD, we identiﬁed 64 discriminative predicates from suc-cessful and failed executions of the application. Applying AID onthem required 15 interventions and it generated an explanation con-sisting of 7 predicates that are consistent with the aforementionedinformal explanation. In contrast, SD would generate 64 predicatesand TAGT would take 42 interventions in the worst case.

We applied AID for ﬁnding root causes of intermittent failuresof several proprietary applications inside Microsoft. We here re-port our experience with three of the applications that we nameas follows (Figure 7): (1)

Network : the control plane of a datacenter network, (2)

BuildAndTest : a large-scale software buildand test platform, and (3)

HealthTelemetry : a module usedby various services to report their runtime health. Parts of theseapplications (and associated tests) had been intermittently failingfor several months and their developers could not identify the ex-act root causes. This highlights that the root causes of these fail-ures were non-trivial. AID identiﬁed the root causes and gen-

MAX t I n t e r v e n t i o n s Average

MAX t Worst-case

TAGT AID-P-B AID-P AIDTAGT AID-P-B AID-P AID

Figure 8:

Number of interventions required in the average andworst case by traditional adaptive group testing (TAGT) and dif-ferent variations of AID with varying

MAX t . For average caseanalysis, total number of predicates is shown using a grey dottedline. Total number of predicates is not shown for the worst-caseanalysis, because the worst cases vary across approaches.erated explanations for how the root causes lead to failures: for Network , the root cause was a random number collision, for

BuildAndTest , it was an order violation of two events, and for

HealthTelemetry , it was a race condition. Developers of theapplications conﬁrmed that the root causes identiﬁed by AID areindeed the correct ones and that the explanations given by AID cor-rectly showed how the root causes lead to the (intermittent) failures.Figure 7 also shows the performance of AID with these applica-tions. As before, SD produces many discriminative predicates, onlya subset of which are causally related to the failures. Moreover,for all applications, AID requires signiﬁcantly fewer interventionsthan what TAGT would require in the worse case.

We further evaluate AID on a benchmark of synthetically-generated applications, designed to fail intermittently and withknown root causes. We generate multi-threaded applications rang-ing the maximum number of threads

MAX t from 2 to 40. For eachparameter setting, we generate applications. In these applica-tions, the total number of predicates N ranges from 4 to 284, andwe randomly choose the number of causal predicates in the range [1 , N log N ] .For this experiment, we compare four approaches: TAGT, AID,AID without predicate pruning (AID-P), and AID without predi-cate or branch pruning (AID-P-B). All four approaches derive thecorrect causal paths but differ in the number of required interven-tions. Figure 8 shows the average (left) and the maximum (right)number of interventions required by each approach. The grey dot-ted line in the average case shows the average number of predicatesover the 500 instances for that setting. This experiment providestwo key observations: Interventions in topological order converge faster.

Causally-related predicates are likely to be topologically close to each otherin the AC-DAG. AID discards all predicates in an intervened grouponly when none are causal. This is unlikely to occur when predi-cates are grouped randomly. For this reason, AID-P-B, which usestopological ordering, requires fewer interventions than TAGT.

Pruning reduces the required number of interventions.

Weobserve that both predicate and branch pruning reduce the numberof interventions. Pruning is a key differentiating factor of AIDfrom TAGT. In the worst-case setting in particular, the marginbetween AID and TAGT is signiﬁcant: TAGT requires up to 217interventions in one case, while the highest number of interventionsfor AID is 52.11 . RELATED WORK

Causal inference has been long applied for root-cause analysisof program failures. Attariyan et al. [3, 4] observe causality withinapplication components through runtime control and data ﬂow;but only report a list of root causes ordered by the likelihood ofbeing faulty, without providing further causal connection betweenroot causes and performance anomalies. Beyond statistical as-sociation (e.g., correlation) between root cause and failure, fewtechniques [5, 6, 59, 18] apply statistical causal inference onobservational data towards software fault localization. However,observational data collected from program execution logs is oftenlimited in capturing certain scenarios, and hence, observationalstudy is ill-equipped to identify the intermediate explanationpredicates. This is because observational data is not generatedby randomized controlled experiments, and therefore, may notsatisfy conditional exchangeability (data can be treated as if theycame from a randomized experiment [27]) and positivity (allpossible combinations of values for the variables are observed inthe data)—two key requirements for applying causal inference onobservational data [59]. While observational studies are extremelyuseful in many settings, AID’s problem setting permits interven-tional studies, which offer increased reliability and accuracy.

Explanation -centric approaches are relevant to AID as they alsoaim at generating informative, yet minimal, explanations of certainincidents, such as data errors [65] and binary outcomes [19], how-ever these do not focus on interventions. Viska [21] allows theusers to perform intervention on system parameters to understandthe underlying causes for performance differences across differentsystems. None of these systems are applicable for ﬁnding causallyconnected paths that explain intermittent failures due to concur-rency bugs.

Statistical debugging approaches [13, 29, 41, 43, 63, 71, 36,36] employ statistical diagnosis to rank program predicates basedon their likelihood of being the root causes of program failures.However, all statistical debugging approaches suffer from the issueof not separating correlated predicates from the causal ones, andfail to provide contextual information regarding how the root causeslead to program failures.

Predicates in AID are extracted from execution traces of the ap-plication. Ball et al. [10] provide algorithms for efﬁciently tracingexecution with minimal instrumentation. While the authors had adifferent goal (i.e., path proﬁling) than ours, the traces can be usedto extract AID predicates.

Fault injection techniques [2, 23, 34, 47] intervene applicationruntime behavior with the goal to test if an application can handlethe injected faults. In fault injection techniques, faults to beinjected are chosen based on whether they can occur in practice.In contrast, AID intervenes with the goal of verifying (presenceor absence of) causal relationship among runtime predicates, andfaults are chosen based on if they can alter selected predicates.

Group testing [1, 7, 26, 9, 40, 17, 35] has been applied for faultdiagnosis in prior literature [72]. Speciﬁcally, adaptive group test-ing is related to AID’s intervention algorithm. However, none ofthe existing works considers the scenario where a group test mightreveal additional information and thus offers an inefﬁcient solutionfor causal path discovery.

Control ﬂow graph -based techniques [12, 28] aim at identi-fying bug signature for sequential programs using discriminativesubgraphs within the program’s control ﬂow graph; or generatingfaulty control ﬂow paths that link many bug predictors. But theseapproaches do not consider causal connection among these bug pre-dictors and program failure.

Differential slicing [30] aims towards discovering causal pathof execution differences but requires complete program executiontrace generated by execution indexing [69]. Dual slicing [66] isanother program slicing-based technique to discover statementlevel causal paths for concurrent program failures. However, thisapproach does not consider compound predicates that capturecertain runtime conditions observed in concurrent programs.Moreover, program slicing-based approaches cannot deal with aset of executions, instead they only consider two executions—onesuccessful and one failed.

9. CONCLUSIONS

In this work, we deﬁned the problem of causal path discoveryfor explaining failure of concurrent programs. Our key contribu-tion is the novel Adaptive Interventional Debugging (AID) frame-work, which combines existing statistical debugging, causal anal-ysis, fault injection, and group testing techniques in a novel wayto discover root cause of program failure and generate the causalpath that explains how the root cause triggers the failure. Suchexplanation provides better interpretability for understanding andanalyzing the root causes of program failures. We showed both the-oretically and empirically that AID is both efﬁcient and effectiveto solve the causal path discovery problem. As a future direction,we plan to incorporate additional information regarding the pro-gram behavior to better approximate the causal relationship amongpredicates, and address the cases of multiple root causes and multi-ple causal paths. Furthermore, we plan to address the challenge ofexplaining multiple types of failures as well.

10. REFERENCES [1] A. Agarwal, S. Jaggi, and A. Mazumdar. Novel impossibility resultsfor group-testing. In ,pages 2579–2583, 2018.[2] P. Alvaro, J. Rosen, and J. M. Hellerstein. Lineage-driven faultinjection. In

Proceedings of the 2015 ACM SIGMOD InternationalConference on Management of Data, Melbourne, Victoria, Australia,May 31 - June 4, 2015 , pages 331–346, 2015.[3] M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating root-causediagnosis of performance anomalies in production software. In , pages 307–320, 2012.[4] M. Attariyan and J. Flinn. Automating conﬁguration troubleshootingwith dynamic information ﬂow analysis. In , pages237–250, 2010.[5] G. K. Baah, A. Podgurski, and M. J. Harrold. Causal inference forstatistical fault localization. In

Proceedings of the 19th InternationalSymposium on Software Testing and Analysis , ISSTA ’10, pages73–84, New York, NY, USA, 2010. ACM.[6] G. K. Baah, A. Podgurski, and M. J. Harrold. Mitigating theconfounding effects of program dependences for effective faultlocalization. In

SIGSOFT/FSE’11 19th ACM SIGSOFT Symposiumon the Foundations of Software Engineering (FSE-19) and ESEC’11:13th European Software Engineering Conference (ESEC-13), Szeged,Hungary, September 5-9, 2011 , pages 146–156, 2011.[7] Y. Bai, Q. Wang, C. Lo, M. Liu, J. P. Lynch, and X. Zhang. Adaptivebayesian group testing: Algorithms and performance.

SignalProcessing , 156:191–207, 2019.[8] P. Bailis, A. Fekete, M. J. Franklin, A. Ghodsi, J. M. Hellerstein, andI. Stoica. Feral concurrency control: An empirical investigation ofmodern application integrity. In

Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data , pages1327–1342. ACM, 2015.

9] L. Baldassini, O. Johnson, and M. Aldridge. The capacity of adaptivegroup testing. In

Proceedings of the 2013 IEEE InternationalSymposium on Information Theory, Istanbul, Turkey, July 7-12, 2013 ,pages 2676–2680, 2013.[10] T. Ball and J. R. Larus. Optimally proﬁling and tracing programs.

ACM Trans. Program. Lang. Syst. , 16(4):1319–1360, 1994.[11] A. Bovenzi, D. Cotroneo, R. Pietrantuono, and S. Russo. On theaging effects due to concurrency bugs: A case study on mysql. In , pages 211–220. IEEE, 2012.[12] H. Cheng, D. Lo, Y. Zhou, X. Wang, and X. Yan. Identifying bugsignatures using discriminative graph mining. In

Proceedings of theEighteenth International Symposium on Software Testing andAnalysis, ISSTA 2009, Chicago, IL, USA, July 19-23, 2009 , pages141–152, 2009.[13] T. M. Chilimbi, B. Liblit, K. K. Mehra, A. V. Nori, and K. Vaswani.HOLMES: effective statistical debugging via efﬁcient path proﬁling.In , pages34–44, 2009.[14] Azure cosmos db. https://docs.microsoft.com/en-us/azure/cosmos-db/ .[15] Cosmosdb bug. https://github.com/Azure/azure-cosmos-dotnet-v3/pull/713 .[16] J. Dean and S. Ghemawat. Mapreduce: simpliﬁed data processing onlarge clusters.

Commun. ACM , 51(1):107–113, 2008.[17] D.-Z. Du and F. K. Hwang.

Combinatorial group testing and itsapplications . World Scientiﬁc, Singapore River Edge, N.J, 1993.[18] F. Feyzi and S. Parsa. Inforence: Effective fault localization based oninformation-theoretic analysis and statistical causal inference.

CoRR ,abs/1712.03361, 2017.[19] K. E. Gebaly, P. Agrawal, L. Golab, F. Korn, and D. Srivastava.Interpretable and informative explanations of outcomes.

PVLDB ,8(1):61–72, 2014.[20] K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. R. Orgovan,G. Nichols, D. Grant, G. Loihle, and G. C. Hunt. Debugging in the(very) large: ten years of implementation and experience. In

SOSP ,pages 103–116, 2009.[21] H. Gudmundsdottir, B. Salimi, M. Balazinska, D. R. K. Ports, andD. Suciu. A demonstration of interactive analysis of performancemeasurements with viska. In

Proceedings of the 2017 ACMInternational Conference on Management of Data, SIGMODConference 2017, Chicago, IL, USA, May 14-19, 2017 , pages1707–1710, 2017.[22] J. Y. Halpern and J. Pearl. Causes and explanations: Astructural-model approach: Part 1: Causes. In

UAI ’01: Proceedingsof the 17th Conference in Uncertainty in Artiﬁcial Intelligence,University of Washington, Seattle, Washington, USA, August 2-5,2001 , pages 194–202, 2001.[23] S. Han, K. G. Shin, and H. A. Rosenberg. Doctor: An integratedsoftware fault injection environment for distributed real-timesystems. In

Proceedings of 1995 IEEE International ComputerPerformance and Dependability Symposium , pages 204–213. IEEE,1995.[24] M. Herlihy. A methodology for implementing highly concurrent dataobjects (abstract).

Operating Systems Review , 26(2):12, 1992.[25] C. Hitchcock. Conditioning, intervening, and decision.

Synthese ,193, 03 2015.[26] F. K. Hwang. A method for detecting all defective members in apopulation by group testing.

Journal of the American StatisticalAssociation , 67(339):605–608, 1972.[27] D. D. Jensen, J. Burroni, and M. J. Rattigan. Object conditioning forcausal inference. In

Proceedings of the Thirty-Fifth Conference onUncertainty in Artiﬁcial Intelligence, UAI 2019, Tel Aviv, Israel, July22-25, 2019 , page 393, 2019.[28] L. Jiang and Z. Su. Context-aware statistical debugging: from bugpredictors to faulty control ﬂow paths. In , pages 184–193, 2007.[29] G. Jin, A. V. Thakur, B. Liblit, and S. Lu. Instrumentation andsampling strategies for cooperative concurrency bug isolation. In

Proceedings of the 25th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, andApplications, OOPSLA 2010, October 17-21, 2010, Reno/Tahoe,Nevada, USA , pages 241–255, 2010.[30] N. M. Johnson, J. Caballero, K. Z. Chen, S. McCamant,P. Poosankam, D. Reynaud, and D. Song. Differential slicing:Identifying causal execution differences for security applications. In

IEEE Symposium on Security and Privacy , pages 347–362. IEEEComputer Society, 2011.[31] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantulaautomatic fault-localization technique. In , pages 273–282,2005.[32] Kafka - a distributed streaming platform. https://kafka.apache.org/ .[33] Kafka bug. https://github.com/confluentinc/confluent-kafka-dotnet/issues/279 .[34] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. FERRARI: Aﬂexible software-based fault and error injection system.

IEEE Trans.Computers , 44(2):248–260, 1995.[35] A. Karbasi and M. Zadimoghaddam. Sequential group testing withgraph constraints. In , pages 292–296, 2012.[36] B. Kasikci, W. Cui, X. Ge, and B. Niu. Lazy diagnosis ofin-production concurrency bugs. In

SOSP , pages 582–598. ACM,2017.[37] W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta.Root causing ﬂaky tests in a large-scale industrial setting. In

Proceedings of the 28th ACM SIGSOFT International Symposium onSoftware Testing and Analysis , ISSTA 2019, pages 101–111, NewYork, NY, USA, 2019. ACM.[38] L. Lamport. Time, clocks, and the ordering of events in a distributedsystem.

Commun. ACM , 21(7):558–565, 1978.[39] T. Leesatapornwongsa, J. F. Lukman, S. Lu, and H. S. Gunawi.Taxdc: A taxonomy of non-deterministic concurrency bugs indatacenter distributed systems.

ACM SIGPLAN Notices ,51(4):517–530, 2016.[40] T. Li, C. L. Chan, W. Huang, T. Kaced, and S. Jaggi. Group testingwith prior statistics. In ,pages 2346–2350, 2014.[41] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalablestatistical bug isolation. In

Proceedings of the ACM SIGPLAN 2005Conference on Programming Language Design and Implementation,Chicago, IL, USA, June 12-15, 2005 , pages 15–26, 2005.[42] B. Liu, Z. Qi, B. Wang, and R. Ma. Pinso: Precise isolation ofconcurrency bugs via delta triaging. In

ICSME , pages 201–210. IEEEComputer Society, 2014.[43] C. Liu, L. Fei, X. Yan, J. Han, and S. P. Midkiff. Statisticaldebugging: A hypothesis testing-based approach.

IEEE Trans.Software Eng. , 32(10):831–848, 2006.[44] S. Lu, S. Park, C. Hu, X. Ma, W. Jiang, Z. Li, R. A. Popa, andY. Zhou. MUVI: automatically inferring multi-variable accesscorrelations and detecting related semantic and concurrency bugs. In

Proceedings of the 21st ACM Symposium on Operating SystemsPrinciples 2007, SOSP 2007, Stevenson, Washington, USA, October14-17, 2007 , pages 103–116, 2007.[45] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: acomprehensive study on real world concurrency bug characteristics.In

Proceedings of the 13th International Conference on ArchitecturalSupport for Programming Languages and Operating Systems,ASPLOS 2008, Seattle, WA, USA, March 1-5, 2008 , pages 329–339,2008.[46] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. An empirical analysisof ﬂaky tests. In

Proceedings of the 22nd ACM SIGSOFTInternational Symposium on Foundations of Software Engineering ,pages 643–653. ACM, 2014.[47] P. D. Marinescu and G. Candea. Lﬁ: A practical and generallibrary-level fault injector. In , pages 379–388.IEEE, 2009.

48] A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. WHY so? orWHY no? functional causality for explaining query answers. In

Proceedings of the Fourth International VLDB workshop onManagement of Uncertain Data (MUD 2010) in conjunction withVLDB 2010, Singapore, September 13, 2010. , pages 3–17, 2010.[49] A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing dataerrors with view-conditioned causality. In

Proceedings of the ACMSIGMOD International Conference on Management of Data,SIGMOD 2011, Athens, Greece, June 12-16, 2011 , pages 505–516,2011.[50] A. Meliou, S. Roy, and D. Suciu. Causality and explanations indatabases.

PVLDB , 7(13):1715–1716, 2014.[51] Mysql. .[52] Npgsql - .net access to postgresql. .[53] Npgsql bug. https://github.com/npgsql/npgsql/issues/2485 .[54] L. Oldenburg, X. Zhu, K. Ramasubramanian, and P. Alvaro. Fixed itfor you: Protocol repair using lineage graphs. In

CIDR 2019, 9thBiennial Conference on Innovative Data Systems Research, Asilomar,CA, USA, January 13-16, 2019, Online Proceedings , 2019.[55] C. Parnin and A. Orso. Are automated debugging techniques actuallyhelping programmers? In

ISSTA , pages 199–209. ACM, 2011.[56] J. Pearl.

Causality: Models, Reasoning, and Inference . CambridgeUniversity Press, New York, NY, USA, 2000.[57] J. Pearl. The algorithmization of counterfactuals.

Ann. Math. Artif.Intell. , 61(1):29–39, 2011.[58] J. Pearl and T. Verma. A theory of inferred causation. In

Proceedingsof the 2nd International Conference on Principles of KnowledgeRepresentation and Reasoning (KR’91). Cambridge, MA, USA, April22-25, 1991 , pages 441–452, 1991.[59] G. Shu, B. Sun, A. Podgurski, and F. Cao. MFL: method-level faultlocalization with causal inference. In

ICST , pages 124–133. IEEEComputer Society, 2013.[60] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoopdistributed ﬁle system. In

IEEE 26th Symposium on Mass StorageSystems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA,May 3-7, 2010 , pages 1–10, 2010.[61] P. Spirtes, C. Glymour, and R. Scheines.

Causation, Prediction, andSearch . MIT press, 2nd edition, 2000.[62] W. N. Sumner and X. Zhang. Algorithms for automaticallycomputing the causal paths of failures. In

International Conferenceon Fundamental Approaches to Software Engineering , pages355–369. Springer, 2009.[63] A. V. Thakur, R. Sen, B. Liblit, and S. Lu. Cooperative crugisolation. In

Proceedings of the International Workshop on DynamicAnalysis: held in conjunction with the ACM SIGSOFT InternationalSymposium on Software Testing and Analysis (ISSTA 2009), WODA2009, Chicago, IL, USA, July, 2009. , pages 35–41, 2009.[64] A. Vahabzadeh, A. M. Fard, and A. Mesbah. An empirical study ofbugs in test code. In , pages 101–110, 2015.[65] X. Wang, X. L. Dong, and A. Meliou. Data x-ray: A diagnostic toolfor data errors. In

Proceedings of the 2015 ACM SIGMODInternational Conference on Management of Data, Melbourne,Victoria, Australia, May 31 - June 4, 2015 , pages 1231–1245, 2015.[66] D. Weeratunge, X. Zhang, W. N. Sumner, and S. Jagannathan.Analyzing concurrency bugs using dual slicing. In

ISSTA , pages253–264. ACM, 2010.[67] W. E. Wong and V. Debroy. A survey of software fault localization.

Department of Computer Science, University of Texas at Dallas,Tech. Rep. UTDCS-45 , 9, 2009.[68] J. Woodward.

Making Things Happen: A Theory of CausalExplanation . Oxford scholarship online. Oxford University Press,2003.[69] B. Xin, W. N. Sumner, and X. Zhang. Efﬁcient program executionindexing. In

Proceedings of the ACM SIGPLAN 2008 Conference onProgramming Language Design and Implementation, Tucson, AZ,USA, June 7-13, 2008 , pages 238–248, 2008.[70] D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. U. Jain, and M. Stumm. Simple testing can prevent most criticalfailures: An analysis of production failures in distributeddata-intensive systems. In , pages 249–265,2014.[71] A. X. Zheng, M. I. Jordan, B. Liblit, M. Naik, and A. Aiken.Statistical debugging: simultaneous identiﬁcation of multiple bugs.In

Machine Learning, Proceedings of the Twenty-Third InternationalConference (ICML 2006), Pittsburgh, Pennsylvania, USA, June25-29, 2006 , pages 1105–1112, 2006.[72] A. X. Zheng, I. Rish, and A. Beygelzimer. Efﬁcient test selection inactive diagnosis via entropy approximation. In

UAI ’05, Proceedingsof the 21st Conference in Uncertainty in Artiﬁcial Intelligence,Edinburgh, Scotland, July 26-29, 2005 , page 675, 2005.

APPENDIXA. PROGRAM INSTRUMENTATION

AID separates program instrumentation and predicate extractionunlike prior SD techniques [41, 29, 43]. One advantage of ourseparation of instrumentation and predicate extraction is that it en-ables us to design predicates after collection of the application’sexecution traces. In contrast, prior works in SD instrument appli-cations to directly extract the predicates. For example, to assess iftwo methods return the same value, prior work would instrumentthe program using a hard coded conditional statement “ pred =(foo() == bar()) ”. In contrast, our instrumentation simplycollects the return values of the two methods and stores them in theexecution trace. AID later evaluates the predicates based on theexecution traces. This gives us the ﬂexibility to design predicatespost-execution, often based on knowledge of some domain-expert.For example, in this case, we can design multiple predicates suchas whether two values are equal, unequal, or satisfy any customrelation.

Instrumentation granularity.

Instrumentation granularity is or-thogonal to AID. Like prior SD work, we could have instrumentedat a ﬁner granularity such as at each conditional branch; but in-strumenting method calls were sufﬁcient for our purpose. Sinceour instrumentation is of much sparser granularity than existing SDwork [41, 29, 43] that employ sampling based ﬁner granularity in-strumentation, we do not use any sampling.

B. PREDICATE EXTRACTION ANDFAULT INJECTION

Figure 9 shows the complete pipeline of predicate extraction andfault-injection for the Npgsql bug of Example 1, whose simpliﬁedsource code is shown in Figure 9(a). Executions of the instru-mented application generate a list of runtime method signatures perexecution, called execution traces . Two partial execution traces—one for a successful and the other for a failed execution—are shownin Figure 9(b). Then we extract predicates and compute their pre-cision and recall as shown in Figure 9(c).In AID, we use existing fault injection techniques—which isable to change a method’s input and return value, can cause amethod to throw exception, can cause a method to run slower or runbefore/after/concurrently with another method in another thread—to intervene on discriminative predicates. For example, to allow forreturn value alteration intervention, AID modiﬁes the entire appli-cation by adding (1) an optional parameter to each function, and(2) a conditional statement at the end of each function that speci-ﬁes that “if a value is passed to the optional parameter, the functionshould return that speciﬁc value, and the actually computed valueotherwise”. As another example, the predicate “there is a data raceon X” can be intervened by delaying one access to X or by putting a14 lobal variables: _pools : an array of connector pool, _nextSlot : next unused slot in _pools

ConnectorPool TryGetValue(key){ var localPools = _pools for (i = 0; i < _nextSlot ; i++)if (localPools[i].key == key)return localPools[i];return null;} ConnectorPool GetOrAdd(key, pool){ lock {var p = TryGetValue(key);if (p != null) return p;if ( _nextSlot == _pools.Length ) _pools = ResizeDouble( _pools ) _pools [ _nextSlot++ ] = pool;return pool;}} B1B2 B3

Thread1 Thread2

Initial values: _pools.Length = 10, _nextSlot = 10

Successful execution: B1 (Thread1) à B2 (Thread1) à B3 (Thread2)Failed execution: B1 (Thread1) à B3 (Thread2) à B2 (Thread1)

Event Accessed Object Access Type Thread ID Start Time End Time ...

Method call: TryGetValue _nextSlot Read 1 100 200Method call: GetOrAdd _nextSlot Write 2 230 250

Successful execution: method execution signature list (partial)Failed execution: method execution signature list (partial)

Event Accessed Object Access Type Thread ID Start Time EndTime ...

Method call: TryGetValue _nextSlot Read 1 100 200Method call: GetOrAdd _nextSlot Write 2 150 190

Predicate Precision Recall

TryGetValue and GetOrAdd are accessing the same object (_nextSlot) concurrently and one of them is a write OP 100% 100%... ... ...

Thread1

ConnectorPool TryGetValue(key){ lock {

B1B2 } } Fault Injection (a) (b) (c)(d)

Figure 9: (a) Simpliﬁed code for the Npgsql bug of Example 1.(b) Partial execution traces of one successful and one failed execu-tion. The start-time and end-time of the events reﬂect concurrentread/write access to the shared variable nextSlot . (c) The racepredicate is one of the discriminative predicates. (d) Fault injectionto intervene (disable) the race predicate: putting a lock around theinstructions within

TryGetValue() .lock around the code segments that access X to prevent simultane-ous access to X. Figure 9(d) shows how fault is injected by puttinga lock to intervene on the data race predicate of Figure 9(c).

C. REAL-WORLD CONCURRENCY BUGCHARACTERISTICS

Studies on real-world concurrency bug characteristics [45, 67,64] show that a vast majority of root-causes can be captured withreasonably simple single predicates and hence this assumption isvery common in the SD literature [41, 43, 29]. Some notable ﬁnd-ings include: (1) “97% of the non-deadlock concurrency bugs arecovered by two simple patterns: atomicity violation and order vi-olation ” [45], (2) “66% of the non-deadlock concurrency bugs in-volve only one variable” [45] (3) “The manifestation of 96% ofthe concurrency bugs involves no more than two threads.” [45],(4) “most fault localization approaches assume that each buggysource ﬁle has exactly one line of faulty code” [67], (5) “The major-ity of ﬂaky test bugs occur when the test does not wait properly forasynchronous calls during the exercise phase of testing.” [64], etc.

D. PROOF OF THEOREM 2

Proof.

After the ﬁrst intervention, we get at least (cid:16) log (cid:0) ND (cid:1) − log (cid:0) N − S D (cid:1) + 1 (cid:17) bits of information. Suppose that there are m interventions. Since after retrieving all information, the remaininginformation should be ≤ : log (cid:32) ND (cid:33) − m (cid:88) i =1 (cid:16) log (cid:32) N − ( i − S D (cid:33) − log (cid:32) N − iS D (cid:33) + 1 (cid:17) ≤ ⇒ log (cid:32) N − mS D (cid:33) − m ≤ ⇒ m ≥ log ( N − mS )! D !( N − mS − D )!= ⇒ m ≥ log ( N − mS ) D D ! [ ( N − mS )!( N − mS − D )! ≈ ( N − mS ) D ] = ⇒ m ≥ D log( N − mS ) − log( D !)= ⇒ m ≥ D log N (1 − mS N ) − log( D !)= ⇒ m ≥ D log N + D log(1 − mS N ) − log( D !) Since log(1 − x ) ≈ − x for small x ; we assume mS N to be small: = ⇒ m ≥ D log N − mDS N − log( D !)= ⇒ m (cid:16) DS N (cid:17) ≥ D log N − log( D !)= ⇒ m (cid:16) DS N (cid:17) ≥ log N D D != ⇒ m (cid:16) DS N (cid:17) ≥ log N ! D !( N − D )! [ N D ≈ N !( N − D )! ] = ⇒ m ≥ log (cid:0) ND (cid:1) DS N E. PROOF OF THEOREM 3

Proof.