[PDF] BugDoc: Algorithms to Debug Computational Processes

Abstract

Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our experimental data and processing software is available for use, reproducibility, and enhancement.

Full PDF

BBugDoc: Algorithms to Debug ComputationalProcesses

Raoni Lourenço

New York [email protected]

Juliana Freire

New York [email protected]

Dennis Shasha

New York [email protected]

Abstract

Data analysis for scientific experiments and enterprises,large-scale simulations, and machine learning tasks all entailthe use of complex computational pipelines to reach quanti-tative and qualitative conclusions. If some of the activities ina pipeline produce erroneous outputs, the pipeline may failto execute or produce incorrect results. Inferring the rootcause(s) of such failures is challenging, usually requiringtime and much human thought, while still being error-prone.We propose a new approach that makes use of iteration andprovenance to automatically infer the root causes and de-rive succinct explanations of failures. Through a detailedexperimental evaluation, we assess the cost, precision, andrecall of our approach compared to the state of the art. Ourexperimental data and processing software is available foruse, reproducibility, and enhancement.

CCS Concepts • Information systems → Data provenance . Computational pipelines are widely used in many do-mains, from astrophysics and biology to enterprise analytics.They are characterized by interdependent modules, associ-ated parameters, and data inputs. Results derived from thesepipelines lead to conclusions and, potentially, actions. If oneor more modules in a pipeline produce erroneous or unex-pected outputs, these conclusions may be incorrect. Thus, itis critical to identify the causes of such failures.Discovering the root cause of failures in a pipeline is chal-lenging because problems can come from many differentsources, including bugs in the code, input data, softwareupdates, and improper parameter settings. Connecting theerroneous result to its root cause is especially difficult forlong pipelines or when multiple pipelines are composed.Consider the following real but sanitized examples.

Example: Enterprise Analytics.

In an application deployed bya major software company, plots for sales forecasts showeda sharp decrease compared to historical values. After muchinvestigation, the problem was tracked down to a data feed(coming from an external data provider), whose temporalresolution had changed from monthly to weekly. The changein resolution affected the predictions of a machine learningpipeline, leading to incorrect forecasts.

Figure 1: Machine learning pipeline and its prove-nance. A data scientist can explore different inputdatasets and classifier estimators to identify a suitablesolution for a classification problem.

Example: Exploring Supernovas.

In an astronomy experiment,some visualizations of supernovas presented unusual arti-facts that could have indicated a discovery. The experimentalanalysis consisted of multiple pipelines run at different sites,including data collection at the telescope site, data processingat a high-performance computing facility, and data analysisrun on the physicist’s desktop. After spending substantialtime trying to verify the results, the physicists found thata bug introduced in the new version of the data processingsoftware had caused the artifacts.To debug such problems, users currently expend consider-able effort reasoning about the effects of the many possibledifferent settings. This requires them to tune and executenew pipeline instances to test hypotheses manually, whichis tedious, time-consuming, and error-prone.We propose new methods and a system that automaticallyand iteratively identifies one or more minimal causes offailures in general computational pipelines (or workflows).

The Need for Systematic Iteration.

Consider the examplein Figure 1, which shows a generic template for a machinelearning pipeline and a log of different instances that wererun with their associated results.The pipeline reads a dataset, splits it into training and testsubsets, creates and executes an estimator, and computes theF-measure score using 10-fold cross-validation. A data scien-tist uses this template to understand how different estimatorsperform for different types of input data, and ultimately, toderive a pipeline instance that leads to high scores.Analyzing the provenance of the runs, we can see that gradient boosting leads to low scores for two of the datasets a r X i v : . [ c s . D B ] A p r aoni Lourenço, Juliana Freire, and Dennis Shasha ( Iris and

Digits ), but it has a high score for

Images . By contrast, decision trees work well for both the

Iris and

Digits datasets,and logistic regression leads to a high score for

Iris .This may suggest that there is a problem with the gradientboosting module for some parameters, that decision trees provide a suitable compromise for different data, and that logistic regression is good for the

Iris data. Because each runused different parameters for each method depending onthe dataset, a definitive conclusion has to await additionaltesting of these hyperparameters. Doing so manually is time-consuming and error-prone, while

BugDoc automates thisprocess.

Identifying Root Causes of Failures: Challenges.

As theabove examples illustrate, there are many potential causesfor a given problem. Prior work used provenance to explainerrors in computational processes that derive data [18, 46].However, to test these hypotheses and obtain complete (andaccurate) explanations, new pipeline instances must be exe-cuted that vary the different components of the pipeline.Trying all possible combinations of parameter-values leadsto a combinatorial explosion of instances to execute, andtherefore can be prohibitively expensive. Thus, a criticalchallenge lies in the design of a strategy that is provablyefficient (often requiring only a linear number of pipelineexecutions in the number of parameters) for finding rootcauses. Causes of errors can include multiple parameters,each of which may have large domains. So, it is important tohave clear and concise explanations in terms of the parametervalues already tried.

Contributions.

In this paper, we introduce

BugDoc , a newapproach that makes use of iteration and provenance toinfer the root causes automatically and derive succinct ex-planations of failures in pipelines. Our contributions can besummarized as follows:(1)

BugDoc finds root causes autonomously and iteratively,intelligently selecting so-far untested combinations.(2) We propose debugging algorithms that find root causesusing fewer pipeline instances than state-of-the-artmethods, avoiding unnecessary costly computations.In fact,

BugDoc often finds root causes using only anumber of pipeline instances linear in the number ofparameters.(3) The

BugDoc system further reduces time by exploitingparallelism, and(4) Finally,

BugDoc derives concise explanations, to facili-tate the tasks of human debuggers.

Outline . The remainder of this paper is organized as follows.We review related work in Section 2. Section 3 introducesthe model we use for computational pipelines and formally defines the problem we address. In Section 4, we present al-gorithms to search for simple and complex causes of failures.We compare

BugDoc with the state of the art in Section 5and conclude in Section 6, where we outline directions forfuture work.

Debugging Data and Pipelines.

Recently, the problem ofexplaining query results and interesting features in data hasreceived substantial attention in the literature [4, 14, 18, 39,46]. Some have focused on explaining where and how er-rors occur in the data generation process [46] and whichdata items are most likely to be causes of relational queryoutputs [39, 47]. Others have attempted to use data to ex-plain salient features in data (e.g., outliers) by discoveringrelationships among attribute values [4, 14, 18]. In contrast,

BugDoc aims to diagnose abnormal behavior in computa-tional pipelines that may be due to errors in data, programs,or sequencing of operations.Previous work on pipeline debugging has focused on ana-lyzing execution histories to identify problematic parametersettings or inputs, but such work does not iteratively inferand test new workflow instances. Bala and Chana [5] appliedseveral machine learning algorithms to predict whether aparticular pipeline instance will fail to execute in a cloudenvironment. The goal is to reduce the consumption of ex-pensive resources by recommending against executing theinstance if it has a high probability of failure. The systemdoes not attempt to find the root causes of such failures.Chen et al. [12] developed a system that identifies problemsby finding the differences between provenance (encoded astrees) of good and bad runs. However, in general, these dif-ferences do not necessarily identify root causes, though theyoften contain them.Some systems have been developed to debug specific ap-plications. Viska [24] helps users identify the underlyingcauses for performance differences for a set of configura-tions. Users infer hypotheses by exploring performance dataand then test these hypotheses by asking questions aboutthe causal relationships between a set of selected featuresand the resulting performance. Thus, Viska can be used tovalidate hypotheses but not identify root causes. Molly [1]combines the analysis of lineage with SAT solvers to findbugs in fault-tolerance protocols for distributed systems. Itsimulates failures, such as permanent crash failures, mes-sage loss, and temporary network partitions, in order to testfault-tolerance protocols over a specified period.Although not designed for computational pipelines, DataX-Ray [46] provides a mechanism for explaining the sys-tematic causes of errors in the data generation process. Thesystem finds shared features among corrupt data elements ugDoc: Algorithms to Debug Computational Processes and produces a diagnosis of the problems. Given the prove-nance of pipeline instances together with error annotations,Data X-Ray derives explanations consisting of features thatdescribe the parameter-value pairs responsible for the errors.Explanation Tables [18] provides explanations for binaryoutcomes. Like Data X-Ray, it forms hypotheses based on alog of executions, but it does not propose new ones. Basedon a table with a set of categorical columns (attributes) andone binary column (outcome), the algorithm produces in-terpretable explanations of the causes for the outcome interms of the attribute-value pairs combinations. The expla-nations consist of a disjunction of patterns, and each patternis a conjunction of attribute-value pairs. As discussed inSection 5,

BugDoc produces explanations that are similar tothose of Data X-Ray and Explanation Tables, but they arealso minimal and able to express inequalities and negations.Furthermore,

BugDoc employs a systematic method to in-telligently generate new instances that enable it to deriveconcise explanations that are root causes for a problem.

Hyperparameter Tuning

Our work is related algorithmi-cally to approaches from hyperparameter tuning [6, 8, 17,42, 43], since we can view the generation of new pipelineinstances for debugging as an exploration of the space of itshyperparameters. Bayesian optimization methods are con-sidered state of the art for the hyperparameter optimizationproblem [7, 8, 17, 42, 43]. These methods approximate a prob-ability model of the performance outcome given a parameterconfiguration that is updated from a history of executions.Gaussian Processes and Tree-structured Parzen Estimatorare examples of probability models [6] used to optimize anunknown loss function using the expected improvement cri-terion as acquisition function. To do this, they assume thesearch space is smooth and differentiable. This assumption,however, does not hold in general for arbitrary computa-tional pipelines. Moreover, our goal is not to identify badconfigurations (we usually have those, to begin with), but toidentify the root cause(s), which are due to a subset of theparameters. Optimization, by contrast, seeks entire (in theircase, good) configurations.Examples of hyperparameter tuning techniques includeOtterTune and BOAT. OtterTune [45] is a system that uses su-pervised learning techniques to find optimal settings of data-base system administrator knobs given a database workloadand a set of metrics (optimization functions). BOAT [16] alsooptimizes database system configurations using BayesianOptimization. However, instead of starting the optimizationwith a standard Gaussian process, it allows a user to input aninitial probabilistic model that exploits previous knowledgeof the problem.

Software Testing.

State-of-the-art techniques for softwaretesting [21, 30], statistical debugging [35, 51], and bug local-ization [2, 3, 25] are often application-specific and/or requirea user-defined test suite. Some approaches require the instru-mentation of binaries or source code in the form of predicatesthat can be observed during computational runs [35, 51].Such information, if available, can be helpful to localize andexplain bugs.

BugDoc , however, does not assume any knowl-edge of the internal code of the computational processes: itwas designed to debug black-box pipelines where we canobserve only the inputs and outputs. Hence, our explana-tions are expressed in terms of input parameters. However,an interesting direction for future work would be to con-sider variables (or predicates) that can be observed but notmanipulated in our formalism to generate potentially richerexplanations. Approaches have also been proposed for buglocalization in a black-box scenario; however these weredesigned for specific applications and environments, e.g.,Pinpoint for J2EE [13]. By contrast,

BugDoc was designed tosupport language-independent workflows.Automated test generation techniques also derive newtests (or instances in our terminology). However, they donot aim to identify root causes (see, e.g., [19, 22, 27]). Oneexception is Causal Testing [30]. Similar to

BugDoc , CausalTesting aims to help users identify root causes for problems.However, it requires the user to specify a (single) suspectvariable to be investigated in a white-box scenario, while

BugDoc searches for potential causes for failures in a black-box scenario. Further these causes may include multiplevariables and value assignments.

BugDoc helps a user to trace back the potential cause of agiven behavior to a component of a pipeline. Nevertheless,since a pipeline can orchestrate a multitude of sophisticatedtools, to identify and correct the bug, it may be necessary todrill down into an individual component. If source code isavailable for that, traditional debugging techniques can beused.

Identifying Denial Constraints.

Our approach is also re-lated to the discovery of denial constraints in relational ta-bles [9, 15], particularly functional dependencies. The sim-ilarity can be illustrated as follows: imagine that there is acolumn indicating “successful instance" or “failed instance"for some set of parameter-values. Call it

Success Or Fail . Ifa failure occurs exactly when parameter

A = 5 and

B = 6 ,then that would manifest as a functional dependency AB −→ Success Or Fail , i.e., the result is a function of parameters A and B . However, if the failure happens when a disjunctionholds, e.g., A = 5 or B = 6 , the same functional dependencywould be inferred. No more minimal functional dependen-cies such as A −→ Success Or Fail would be inferred, because,for example, when

A = 4 , there can be success or failure aoni Lourenço, Juliana Freire, and Dennis Shasha depending on the B value. Thus, functional dependencies arenot expressive enough to characterize root causes. Intuitively, given a set of computational pipeline instances,some of which lead to bad or questionable results, our goalis to find the root causes of failures, possibly by creating andexecuting new pipeline instances.Definition 1. (Pipeline, instance, parameter-value pairs,value universe, results) A computational pipeline (orworkflow) CP is a collection of programs connected togetherthat contains a set of manipulable parameters P (i.e., includinghyperparameters, input data, versions of programs, computa-tional modules). We denote as CP i a pipeline instance of CP that defines values for the parameters for a particular run of CP .Thus, an instance CP i is associated with a list of parameter-value pairs Pv i containing an assignment ( p , v ) for each p ∈ P . We denote by CP i [ p ] = v the assignment of value v forparameter p in the instance CP i . For each parameter p ∈ P ,the parameter-value universe U p is the set of all property-values assigned to p by any pipeline instance thus far, i.e., U p = { v | ∃ i ( p , v ) ∈ CP i } . The Universe U = {( p , U p )| p ∈ P } . As we discuss in Section 4, the initial parameter-value uni-verse U can be expanded by explicitly defining the parameterdomains (e.g., parameter satisfaction can take integer valuesbetween 1 and 10).Definition 2. (Evaluation) Let E be a procedure that evaluates the result of an instance such that E ( CP i ) = succeed if the results are acceptable, and E ( CP i ) = fail otherwise. Nor-mally, the evaluation procedure will be code that looks at someproperty of the result of a given pipeline instance. Thus a bug, for the purposes of this paper, is a collectionof pipelines that, when executed, evaluate to fail . Notethat this is a deterministic definition that doesn’t captureintermittent failures, e.g., timing bugs or non-deterministicfailures. Even in such cases, however, if the bugs occur oftenenough, then

BugDoc may help, though without guarantee.Definition 3. (Hypothetical root cause of failure)Given a set of instances G = CP , ..., CP k and associated eval-uations E ( CP ) , ...., E ( CP k ) , a hypothetical root cause offailure is a set C f consisting of a Boolean conjunction ofparameter-comparator-value triples (e.g., a triple may be ofthe form A > ) which obey the following conditions amongthe instances G : (i) there is at least one CP i such that Pv i sat-isfies C f and E ( CP i ) = fail ; and (ii) if E ( CP i ) = succeed ,then the parameter-values pairs Pv i of CP i do not satisfy theconjunction C f .Example. To illustrate the converse of point (ii), if C f = A > B =

7, and CP i has the parameter values A =

15 and B = C f does not obey condition (ii) ofa hypothetical root cause of failure. C f is called hypothetical because, based on the evidenceso far, C f leads to fail , but further evidence may refute thathypothesis.We should note that the root causes defined here shouldnot be interpreted as the actual causes of pipeline problemsas characterized by causality theory [40]. The goal of BugDoc is to help the user identify sets of parameter-value pairs forwhich a black-box pipeline will always fail . However, theroot causes we output are not counterfactuals [34], i.e., thepipeline would not necessarily succeed had the root causenot been observed, because perhaps another root cause maycome into play. We simply want to determine the followingimplication definitively: root-cause = ⇒ fail for a singleroot cause. BugDoc can, however, also discover disjunctivecombinations of configurations that lead to failure.Definition 4. (Definitive root cause of failure) A hy-pothetical root cause of failure D is a definitive root causeof failure if there is no instance CP q from the universe of U with the property that E ( CP q ) = succeed and Pv q satisfies D .Informally, no pipeline instance that includes D as a subset ofits parameter-value settings leads to succeed . Definition 5. (Minimal Definitive Root Cause) A de-finitive root cause D is minimal if no proper subset of D is adefinitive root cause. The example in Figure 1 illustrates these concepts usingthe simple machine learning pipeline from the introduction.A possible evaluation procedure would test whether the re-sulting score is greater than 0.6. In this case,

Data being dif-ferent from

Images and

Estimator equal to gradient boosting is a hypothetical root cause of failure. Section 4 presents al-gorithms that determine whether this root cause is definitiveand minimal.

Problem Definition.

Given a computational pipeline CP (e.g., a query, script, simulation) and a set of parameter-value pairs associated with previously-run instances G = CP , ..., CP k , we consider two goals: (i) to find at least oneminimal definitive root cause or (ii) to find all minimal de-finitive root causes. Our cost measure for both goals is thenumber of executed pipeline instances beyond any given,previously run, instances. Given a set of pipeline instances,

BugDoc identifies mini-mal definitive root causes for failures. As noted above, a naivestrategy would be to try every possible parameter-value paircombination of the parameter-value universe, requiring thetesting of a number of pipeline instances that is exponential ugDoc: Algorithms to Debug Computational Processes in the number of parameters. Instead,

BugDoc uses heuris-tics that turn out to be quite effective at finding promisingconfigurations.

BugDoc uses two iterative debugging algorithms in turn.The first, called

Shortcut , discovers definitive root causes(which we sometimes abbreviate to, simply, bugs) consist-ing of a single conjunction of parameter-value (formally,parameter-equality-value) pairs. The second, called

Debug-ging Decision Trees and introduced in [36], discovers morecomplex definitive root causes involving inequalities (e.g., A takes a value between 5 and 13).Because the results of the Debugging Decision Trees al-gorithm consist of disjunctions of conjunctions, they maycontain redundancies, which we simplify using the Quine-McCluskey algorithm [28]. The goal is to create concise ex-planations, making it easy for users to understand and acton them.

The

Shortcut algorithm, shown in Algorithm 1, starts froma pipeline instance CP f that evaluates to fail . It then usespipeline instances that succeeded and are disjoint , i.e., theyshare no parameter-values, from CP f to construct new tests.Definition 6 (Disjoint Instances). Two pipeline instances CP x and CP y are disjoint if CP x [ p ] (cid:44) CP y [ p ] , ∀ p ∈ P associ-ated to CP . Intuitively, the

Shortcut algorithm starts with the failingpipeline instance CP f and a disjoint successful instance CP д .The existence of such a disjoint succeeding pipeline instanceis a requirement for the theoretical results that follow and iscalled the Disjointness Condition . If the Disjointness Condi-tion does not hold, then this method may still be useful as aheuristic.The current instance CP current is initialized to CP f . Then,using some order among parameters, for each parameter p ,an instance CP current ′ is executed that consists of a copy of CP current except that CP current ′ [ p ] = CP д [ p ] . If the instance CP current ′ fails then CP current is changed to CP current ′ and the next parameteris considered. The intution is that the value of p in CP f did not cause the failure. In the end, the definitive minimalroot cause asserted by the Shortcut will be a subset of thepipeline instance CP f that is still present in the final instanceof CP current . We denote that subset as D .The algorithm then performs a sanity check to see whetherany superset of the hypothetical minimal root cause D isin an already executed successful execution. If so, then the Shortcut algorithm has found a proper subset of the definitive minimal root cause, but not an actual definitive minimal rootcause.As noted above, if the Disjointness Condition does nothold, then the

Shortcut algorithm can still be used as a heuris-tic: take an instance that differs in as many parameter-valuesas possible. While the theoretical results that follow will nothold, this will often be good enough, as the experimentalresults show (Section 5).Here is an example that illustrates how the

Shortcut algo-rithm works.Example 1.

Consider the machine learning pipeline in Fig-ure 1 again. Here, the user is interested in investigating pipelinesthat lead to low F-measure scores and defines an evaluationfunction that returns succeed if score ≥ . and fail other-wise.For this pipeline, the user investigates three parameters: Dataset , the input data to be classified;

Estimator , the clas-sification algorithm to be executed; and

Library Version indicates the version of the machine learning library used.Table 1 shows examples of three executions of the pipeline.

Table 1: An initial (given) set of classification pipelinesinstances

Dataset Estimator Library Version Score Evaluation ( score ≥ . succeed Digits Decision Tree 1.0 0.8 succeed

Iris Gradient Boosting 2.0 0.2 fail

In the initial traces shown in Table 1, there are only twodisjoint instances with different evaluations: CP д = {(Dataset,Digits),(Estimator,Decision Tree),(LibraryVersion,1.0)} CP f ={(Dataset,Iris),(Estimator,Gradient Boosting),(LibraryVersion,2.0) }Examining parameter Dataset , we replace its correspond-ing value in the current instance to be executed from Iristo Digits. Because the execution evaluates to fail , we keepthis replacement in the current instance. Similarly, when

Table 2: Set of classification pipelines instances includ-ing the new instances (shown in blue) created by

Short-cut by substituting values of parameters in CP f by cor-responding values in CP д . Dataset Estimator Library Version Score Evaluation ( score ≥ . succeed Digits Decision Tree 1.0 0.8 succeed

Iris Gradient Boosting 2.0 0.2 fail

Digits Gradient Boosting 2.0 0.2 fail

Digits Decision Tree 2.0 0.3 fail

Digits Decision Tree 1.0 0.8 succeed aoni Lourenço, Juliana Freire, and Dennis Shasha we update the value of parameter

Estimator to DecisionTree, the instance evaluation is still fail , so we keep thatreplacement as well.However, when

Library Version is changed to 1 .

0, theresulting configuration evaluates to succeed . This suggeststhat

Library Version . Shortcut algorithm.For Pipelines with root causes similar to the ones in Ex-ample 1, the algorithm will find a minimal definitive rootcause.

Algorithm 1:

Shortcut

Algorithm

Input:

CPI , the set of pipeline instances in theexecution history characterized by theirparameter-values

Input: E , the evaluation function Input: P , list of parameters Input: CP f , pipeline instance evaluated as fail Input: CP д , pipeline instance evaluated as succeed disjoint to CP f Output: D , asserted minimal definitive root cause /* Initialization */ CP current ← CP f ; for p ∈ P do CP current ′ ← CP current ; CP current ′ [ p ] ← CP д [ p ] ; if E ( CP current ′ ) = fail then CP current ← CP current ′ ; endend D ← CP current ∩ CP f ; for CP i ∈ CPI doif D ⊆ CP i and E ( CP i ) = succeed thenreturn ∅ endendreturn D Theorem 1.

If all definitive root causes are singleton para-meter-values and the disjointness condition holds, then theshortcut algorithm will always assert exactly a minimal defin-itive root cause.

Proof. By construction. If all definitive root causes aresingletons, then CP д cannot contain any element of a rootcause, otherwise E ( CP д ) = fail . By contrast, CP f must con-tain at least one root cause. When iterating over parameter p , the Shortcut algorithm will replace CP f [ p ] by CP д [ p ] (be-cause the values must be different on all parameters p by theDisjointness Condition) while there is still one root cause in CP current . Therefore, by the end of the algorithm, only thethe root cause would remain. □ Guarantees of the Shortcut Algorithm.

The

Shortcut al-gorithm may be too aggressive in the sense that it can returna root cause D that is a proper subset of an actual minimaldefinitive root cause of failure.Example 2. Suppose that we have two minimal definitiveroot causes:(1) D = {( p , v ) , ( p , v )} (2) D = {( p , v ′ ) , ( p , v )} Consider also a computational pipeline consisting of three pa-rameters P = { p , p , p } , and CP f and CP д as follows: • CP f = {( p , v ) , ( p , v ) , ( p , v )}• CP д = {( p , v ′ ) , ( p , v ′ ) , ( p , v ′ )} Clearly D ⊆ CP f , therefore it is the root cause of the failure of CP f . However, when iterating over parameter p , the Shortcut algorithm updates CP current [ p ] = v ′ . But E ( CP current ′ ) = fail because D ⊆ CP current ′ . The same is observed when the algo-rithm iterates over parameter p . Consequently, the algorithmoutputs D = {( p , v )} as the root cause, but that is a propersubset of the minimal definitive root cause D . In this case, we say that D is a truncated assertion , i.e.,it is too short. Note, however, D will never be too long.Theorem 2. The

Shortcut algorithm never asserts a super-set of a minimal definitive root cause, provided the DisjointnessCondition holds.

Proof. By contradiction. We assume that ∃ ( p , v ) ∈ D ,such that ( p , v ) is not a necessary condition for an instanceto fail . By the construction at the beginning of the shortcutalgorithm, if ( p , v ) ∈ D , CP f [ p ] = v and CP д [ p ] (cid:44) v by theDisjointness Condition.When the Shortcut algorithm iterates over parameter p ,we observe CP current [ p ] = CP f [ p ] and CP current’ [ p ] = CP д [ p ] .Hence, since ( p , v ) is not needed for an instance to fail ,at this iteration, E ( CP current’ ) = fail , so ( p , v ) would beremoved from current and therefore would never be assertedto be part of the root cause. Contradiction. □ To address the problem of truncated assertions, let us firstobserve another case when they do not arise, beyond thesingleton case of Theorem 1.Example 3.

Consider a slight modification of Example 2,where we add another parameter-value pair to D , definingthe following scenario: • D = {( p , v ) , ( p , v )}• D = {( p , v ′ ) , ( p , v ′′ ) , ( p , v )}• CP f = {( p , v ) , ( p , v ) , ( p , v )}• CP д = {( p , v ′ ) , ( p , v ′ ) , ( p , v ′ )} ugDoc: Algorithms to Debug Computational Processes When iterating over parameter p , the Shortcut algorithm doesnot update CP current [ p ] = v , since E ( CP current ′ ) = succeed because D ⊈ CP current ′ and D ⊈ CP current ′ . Similarly, thevalue of CP current [ p ] is not changed. Only CP current [ p ] is up-dated to v ′ . Thereafter, the algorithm would assert D = {( p , v ) , ( p , v )} = D as minimal definitive root cause, which is cor-rect. In Example 3, both D and D contain values for p and p that are distinct from their counterpart in the other definitiveroot cause, i.e., D [ p ] (cid:44) D [ p ] and D [ p ] (cid:44) D [ p ] . We saythat D and D are sufficiently different . This characteristicdirectly influences when the Shortcut algorithm will yieldtruncated assertions and is formally defined as follows.Definition 7 (Sufficiently different instances).

Twodefinitive root causes D x and D y are sufficiently different if (i) they share at least two properties and (ii) for all propertiesthey have in common they differ in their values. Formally,(i) | P D x ∩ P D y | ≥ ;(ii) and D [ p ] (cid:44) D [ p ] , ∀ p ∈ P D x ∩ P D y . Theorem 3.

If the Disjointness Condition holds and all min-imal definitive root causes are pairwise sufficiently different,then the shortcut algorithm will never produce a truncatedassertion.

Proof. By contradiction. Suppose there are two sufficientlydifferent minimal definitive root causes D x and D y , such that D x ⊆ CP f , CP current is initialized to CP f , and at some pointthe Shortcut algorithm creates an instance CP current such that D y ⊆ CP current . We will show that this cannot happen.Consider the first parameter p ∈ P D x , such that CP current’ [ p ] = CP д [ p ] and E ( CP current’ ) = fail Now, D x ⊈ CP current’ because D x and D y differ on at leasttwo properties. In addition, D y ⊈ CP current’ , since CP current’ [ p ] is taken from P D x . Therefore, E ( CP current’ ) = succeed be-cause of the pairwise sufficient difference condition. There-fore, CP current [ p ] will not change its value. Thus, D y ⊆ CP current will never occur. □ Stacked Shortcut Algorithm.

Clearly, we cannot be sure a priori that all definitive root causes are single parameter-value pairs or that the minimal definitive root causes aresufficiently different, either of which would ensure that the

Shortcut makes no truncated assertions. However, even ifneither holds, we may be able to avoid truncated assertionsby a specific reapplication of

Shortcut .To see how, we first observe that

Shortcut makes truncatedassertions only if all elements of a minimal root cause arecontained in the union of CP f and CP д . This union property is formally described in Theorem 4. Theorem 4. The shortcut algorithm will yield a truncatedassertion for a given CP f and CP д only if there is a minimaldefinitive root cause D , such that D ⊆ CP f ∪ CP д and D (cid:49) CP f . Proof. In the course of the

Shortcut algorithm, all prop-erty values in CP current come from CP f or CP д . By construc-tion, the asserted root cause is the intersection of CP f and CP current . So if the asserted root cause is truncated, CP current must have elements from CP д that cause CP current to evaluateto fail . Therefore there is a minimal definitive root causein the union of CP f and CP д . □ Based on the previous theorems, we extended the shortcutalgorithm to the

Stacked Shortcut algorithm which basicallyruns a given failed configuration CP f individually againstmultiple disjoint good configurations and then takes theunion of the inferred root causes. Algorithm 2 shows thealgorithm’s pseudo-code. Stacked Shortcut is guaranteed toproduce a correct solution if

BugDoc can find k mutuallydisjoint successful instances, and there are at most k distinctminimal root causes.Recall that two instances CP and CP are disjoint if theyhave different values for all properties. That is, ∀ pCP [ p ] (cid:44) CP [ p ] . A set of instances is mutually disjoint if every pairof instances are disjoint. Algorithm 2:

Stacked Shortcut

Algorithm

Input:

CPI , the set of pipeline instances in theexecution history characterized by theirparameter-values

Input: E , the evaluation function Input: P , list of parameters Output: D , asserted minimal definitive root cause /* Initialization */ D ← ∅ ; /* Find an instance that evaluates to fail */ Let CP f be such that CP f ∈ CPI , and E ( CP f ) = fail ; /* Find k successful instances disjoint withrespect to CP f and mutually disjoint ifpossible */ CPG ← { CP , CP , ..., CP k } , such that CP i , for i ∈ { , , ... k } , are mutually disjoint and E ( CP i ) = succeed ; for CP д ∈ CPG do D ← D ∪ shortcut ( CPI , E , P , CP f , CP д ) ; endreturn D Theorem 5.

If all CP i , such that E ( CP i ) = succeed , for i ∈{ , , ..., k } , are mutually disjoint and disjoint from CP f , andthere are fewer than or equal to k distinct minimal definitive aoni Lourenço, Juliana Freire, and Dennis Shasha root causes, then the Stacked Shortcut

Algorithm will nevermake a truncated assertion.

Proof. By construction. For each other minimal definitiveroot cause D ⊈ CP f , there can be at most one CP i withthe property that D ⊆ CP i ∩ CP f , since all instances aredisjoint. Because there are fewer than k distinct minimaldefinitive root causes by assumption, there exists at least one CP i , which does not have the union property with respectto CP f . So, by the construction of D , the Stacked Shortcut algorithm will yield an assertion (candidate root cause) thatis not truncated. □ Note that even if all successful instances are not mutu-ally disjoint (perhaps because some parameters have veryfew values), each additional call to shortcut (i.e., each call to

Shortcut with a different disjoint good instance) reduces thelikelihood of yielding a truncated assertion. The reason isthat the second-to-last line of the

Stacked Shortcut algorithmcan only grow the hypothetical root causes.Finally, note that both

Shortcut and

Stacked Shortcut arelinear in the number of parameters, a very useful propertywhen there are hundreds of parameters having at least twovalues each.

While the

Shortcut and

Stacked Shortcut algorithms can finda single minimal definitive root cause very efficiently, usu-ally without truncation (as we will see in the experimentalsection), characterizing all minimal definitive root causes ischallenging. For this purpose, we use an algorithm that isexponential (in the number of parameters) in the worst case,but can characterize inequalities as well as equalities anddoes well heuristically even with a small budget [36].The algorithm constructs a debugging decision tree usingthe parameters of the pipeline as features and the evaluationof the instances as the target. Thus a leaf is either purely succeed , if all pipeline instances so far tested that lead to thatleaf evaluate to succeed ; or fail , if all pipeline instancesleading to that leaf evaluate to fail , or mixed . The algorithmworks as follows:(1) Given an initial set of instances

CPI , construct a de-cision tree based on the evaluation results for thoseinstances ( succeed or fail ). An inner node of the de-cision tree is a triple ( Parameter , Comparator , Value ),where the

Comparator indicates whether a given

Pa-rameter has a value equal to, greater than (or equal to),less than (or equal to), or unequal to

Value .(2) If a conjunction involving a set of parameters, say, P P , and P , leads to a consistently failing execution (a pure leaf in decision tree terms), then that combinationbecomes a suspect.(3) Each suspect is used as a filter in a Cartesian productof the parameter values from which new experimentswill be sampled.Step 3 requires some explanation. Consider an examplewhere all comparators denote equality. Suppose a path inthe decision tree consists of P = v , P = v , and P = v .To test that path, all other parameters will be varied. If everyinstance having the parameter-values P = v , P = v , and P = v leads to failure, then that conjunction constitutes a definitive root cause of failure .If the path consists of non-equality comparators (e.g., P = v , P = v , and P > P =

7) and choose pipeline instances having thosevalues (e.g., all pipelines P = v , P = v , and P = succeed ) pipeline instance, the decision tree is rebuilt, tak-ing into account the whole set of executed pipeline instances CPI , and a new suspect path is tried.Note that

BugDoc uses decision trees in an unusual way.We are not trying to predict whether an untested configura-tion will lead to succeed or fail , but simply use the tree todiscover short paths, possibly characterized by inequalities,that lead to fail . Those will be our suspects. For that reason,we build a complete decision tree, i.e., with no pruning. The most time-consuming aspect of debugging is the execu-tion of pipeline instances. Fortunately, each pipeline instanceis independent. Hence different instances can be run in paral-lel. However, such an approach may lead to the execution ofpipelines that are ultimately unnecessary (e.g., if one pipelineinstance shows that A . v is not a definitive root cause, thenfurther tests on A . v may not be useful). If the search space islarge, this extra overhead turns out to be small, as we showin Section 5.2. To evaluate the effectiveness of

BugDoc , we compare itagainst state-of-the-art methods for deriving explanations aswell as for hyperparameter optimization, using both real andsynthetic pipelines. We examine different scenarios, includ-ing when a single minimal definitive root cause is soughtand when a budget for the number of instances that can berun is set. We also evaluate the scalability of

BugDoc whenmultiple cores are available to execute pipeline instancesin parallel, and when the number of parameters and valuesincrease. ugDoc: Algorithms to Debug Computational Processes

Baselines.

Because no previous approach both creates newinstances and derives explanations, we compare our ap-proach against combinations of state-of-the-art methods.We use Data X-Ray [46] and Explanation Tables [18] to de-rive explanations. To generate instances for all explanationalgorithms, we use both the instances from

BugDoc and Se-quential Model-Based Algorithm Configuration (SMAC) [29].SMAC is a method for hyperparameter optimization thatis often more effective at searching configuration spacesthan grid search [7]. We also ran experiments using randomsearch as an alternative, i.e., randomly generating instancesand then analyzing them. However, the results were alwaysworse than those obtained using SMAC or

BugDoc . Therefore,for simplicity of presentation and to avoid cluttering theplots, we omit the random search results.The explanation approaches analyze the provenance of thepipelines, i.e., the instances previously run and their results,but do not suggest new ones. By contrast, SMAC iterativelyproposes new pipeline instances, but it always outputs acomplete pipeline instance: the best it can find given a budgetof instances to run and a criterion. This procedure makessense for SMAC’s primary use case, which is to find a setof parameter-values that performs well, but it is less helpfulfor debugging because it does not attempt to find a minimalroot cause. For example, if a minimal definitive root cause ofa pipeline is that parameter P i must have a value of 5, SMACwill return a pipeline that fails, which has P i set to 5. Butsince the pipeline may have many other parameter-values,the user has no way of knowing that P i = BugDoc . Since SMAC looks for goodinstances, mostly for machine learning pipelines, we changeits goal to look for bad pipeline instances.

Evaluation Criteria.

We consider two goals: (i)

FindOne – find at least one minimal definitive root cause in eachpipeline; (ii)

FindAll – find all minimal definitive root causes.The use case for

FindOne is a debugging setting where itmight be useful to work on one bug at a time, in the hopethat resolving one may resolve or at least mitigate others.The use case for

FindAll is when a team of debuggers canwork on many bugs in parallel.

FindAll may also be useful toprovide an overview of the set of issues encountered. We useprecision and recall to measure quality. These are defineddifferently for the

FindOne case than for the

FindAll case. Formally, let

UCP be a set of computational pipelines,where each pipeline CP ∈ UCP (e.g., the pipeline of Figure 1)is associated with a set of minimal definitive root causes R ( CP ) . Given a set of root causes A ( CP ) asserted by an al-gorithm A on pipeline CP for the FindOne case, we check if A ( CP ) has at least one actual root cause. Precision is thenthe number of computational pipelines for which at leastone minimal definitive root cause is found divided by thesum of the total number of pipelines where at least one min-imal definitive root cause is found and the number of falsepositives (predicted root causes that are not, in fact, minimaldefinitive root causes). Formally, the precision for FindOne is: (cid:205) CP ∈ U CP | A ( CP ) ∩ R ( CP ) (cid:44) ∅| (cid:205) CP ∈ U CP | A ( CP ) ∩ R ( CP ) (cid:44) ∅| + | A ( CP ) − R ( CP )| where A ( CP )∩ R ( CP ) (cid:44) ∅ evaluates to 1 if A ( CP ) correspondsto at least one of the conjuncts in R ( CP ) . Recall for FindOne is the fraction of the | UCP | pipelines for which a minimaldefinitive root cause is found by A . The recall for FindOne isthus: (cid:205) CP ∈ U CP | A ( CP ) ∩ R ( CP ) (cid:44) ∅|| UCP | For

FindAll , precision is the fraction of root causes that A identifies that are, in fact, minimal definitive root causes. The precision for FindAll is defined as: (cid:205) CP ∈ U CP | A ( CP ) ∩ R ( CP )| (cid:205) CP ∈ U CP | A ( CP )| Recall for FindAll is the fraction of all the R ( CP ) minimaldefinitive root causes, for all CP ∈ UCP , that are found bythe algorithms: (cid:205) CP ∈ U CP | A ( CP ) ∩ R ( CP )| (cid:205) CP ∈ U CP | R ( CP )| For both

FindOne and

FindAll , we also report the F-measure,i.e., the harmonic mean of their respective measures of pre-cision and recall.

F-measure = × Precision × RecallPrecision + Recall

Our first set of tests allows

BugDoc to find at least oneminimal definitive root cause using each of its algorithms(

Shortcut , Stacked Shortcut , and

Debugging Decision Trees ).The experiment then grants the same number of instancesto all other methods. Thus, the precision and recall for eachalgorithm is based on the same instance budget.In these tests, Data X-Ray and Explanation Tables aregiven (i) the instances generated by

BugDoc and, in a separatetest, (ii) the instances generated by SMAC.

Pipeline Benchmark.

We evaluate our approach usingboth synthetic and real pipelines. We have created syntheticdata that reflect typical pipelines in data science and compu-tational science, which often involve multiple components aoni Lourenço, Juliana Freire, and Dennis Shasha and associated parameters. The pipelines have between threeand fifteen parameters, and each parameter has between fiveand thirty values. The parameter values are either ordinal(e.g., temperature) or categorical (e.g., color), each with prob-ability 1/2. Each synthetic pipeline consists of a parameterspace and a definitive root cause of failure automaticallygenerated as follows:(1) We uniformly sample a non-empty subset of parame-ters to be part of a conjunction.(2) For each parameter in the subset, we uniformly samplefrom its values.(3) For each parameter-value pair, we uniformly samplefrom the set of comparators C = { = , ≤ , >, (cid:44) } .(4) After adding a conjunctive root cause, we add anotherconjunctive root cause with a certain probability.The example below illustrates the parameter space andthe definitive root cause for one of the synthetic pipelines.Example 4. A pipeline having three parameters with fourpossible values each could be characterized as follows: • Parameter Space: p ∈ [ . , . , . , . ] , p ∈ [ , , , ] ,and p ∈ [ “ p , “ p , “ p , “ p ] . • Minimal definitive Root Cause : ( p = ) or ( p < . and p (cid:44) “ p ) . We also evaluate the debugging strategies on real-worldcomputational pipelines (see Section 5.3).

Implementation and Experimental Setup.

The currentprototype of

BugDoc contains a dispatching component thatruns in a single thread and spawns multiple pipeline in-stances in parallel. In our experiments, we used five execu-tion engine workers to run the instances.We used the SMAC version for Python 3.6. We also usedthe code, implemented by the respective authors, for boththe Data X-Ray algorithm (implemented in Java 7) [46] andExplanation Tables [18] (written in python 2.7). Since DataX-Ray does not generate new tests, we use the pipeline in-stances created by

BugDoc as input to the feature modelinput of Data X-Ray. Separately, we converted the pipelineinstances created by SMAC as input to the feature model ofData X-Ray. Similarly, we used the pipeline instances gen-erated by both

BugDoc and SMAC to populate the databaseschema required by Explanation Tables.All experiments were run on a Linux Desktop (Ubuntu14.04, 32GB RAM, 3.5GHz × The results for the synthetically generated pipelines are re-ported according to the characteristics of their definitive root causes. The characteristics span three scenarios, con-sisting of multiple pipelines and covering different lengthsof definitive root causes:(1) a single parameter-comparator-value triple;(2) a single conjunction of triples containing parameter-comparator-value; and(3) a disjunction of conjunctions of parameter-comparator-value triples.These scenarios are useful to assess the generality and ex-pressiveness of the different approaches to explanation.

Precision, Recall, and F-measure.

Figure 2 shows the pre-cision, recall, and F-measure for the

FindOne problem for thethree types of definitive root causes. In the horizontal axis ofeach plot, we group all debugging methods by the maximumnumber of instances they were allowed to use to derive ex-planations, i.e., the number of new instances it took

Shortcut , Stacked Shortcut with four shortcuts, and

Debugging DecisionTrees to solve the problem.

BugDoc ’s algorithms outperform Data X-Ray and Expla-nation Tables in all three scenarios, both when the baselinesuse instances generated by

BugDoc and SMAC. If the defini-tive root cause is a single parameter-comparator-value (Fig-ures 2a, 2b, and 2c),

Shortcut and

Stacked Shortcut achieve sim-ilar precision and recall to

Debugging Decision Trees , whichdominates the other scenarios.Since we look for individual parameter-comparator-valuetriples with

Shortcut and disjoint patterns in the data withdecision trees, the likelihood that

Shortcut does not find adefinitive answer is higher in the scenario where a definitiveroot cause is a conjunction of factors, as can be seen in therelatively lower recall in Figure 2e. Conjunctions that arecomposed of equalities and inequalities have a high proba-bility of presenting configurations with the union property.Hence the

Shortcut and

Stacked Shortcut algorithms generatemore truncated assertions, and their precision score is lowerin Figure 2d as compared to Figures 2a and 2g. However, theshortcut algorithms still give better performance than thestate-of-the-art algorithms.Also note that in most cases, the state-of-the-art meth-ods using instances generated by

BugDoc outperform thosemethods using the SMAC instances. This suggests that ourapproach effectively proposes more useful test cases.Similar relative results hold for the

FindAll problem Fig-ure 3 shows, although we observe the expected decreasein recall in Figure 3b, as a single root cause is no longersufficient. The non-minimal approach of Data X-Ray paysoff in this scenario with multiple reasons for a pipeline to fail . However,

Debugging Decision Trees presents a bettertrade-off between precision and recall (Figure 3c).

Discussion.

The answers provided by Explanation Tablesrepresent a prediction of the pipeline instance evaluation ugDoc: Algorithms to Debug Computational Processes (a) Precision (b) Recall (c) F-measure(d) Precision (e) Recall (f) F-measure(g) Precision (h) Recall (i) F-measure

Figure 2: Synthetic Pipelines. Metrics for the

FindOne problem when the root cause is a single parameter-value-comparator (top row, Figures 2a, 2b, and 2c), a single conjunction (middle row, Figures 2d, 2e, and 2f), or a disjunc-tion of conjunctions (bottom row, Figures 2g, 2h, and 2i). In each figure, the leftmost group uses as many instancesas does

Shortcut , the middle uses as many as

Stacked Shortcut , the rightmost as many as

Debugging Decision Trees . result expressed as a real number, were 1 . BugDoc than when using the instances generated by SMAC,we omit the SMAC configurations from the case studies withreal-world pipelines presented later in this section. The takeaway message from the experiments is that

Bug-Doc dominates the other methods based on F-measure inevery case, with

Debugging Decision Trees dominating theshortcut methods unless the budget is small.

Conciseness of Explanation.

Figure 4 shows that

Bug-Doc ’s algorithms not only provide explanations that are moreconcise in the number of parameters than Data X-Ray andExplanation Tables (Figure 4a) but also that it does not assertmore root causes than there are (Figure 4b).

The primary computational cost for all algorithms we con-sider is the cost of running the pipeline instances. Figure 5 aoni Lourenço, Juliana Freire, and Dennis Shasha (a) Precision (b) Recall (c) F-measure

Figure 3: Synthetic Pipelines. Metrics for the

FindAll problem when the root cause is a disjunction of conjunctions(Figures 3a, 3b, and 3c). In each sub-figure, the leftmost group uses as many instances as does

Shortcut , the middlegroup as many

Stacked Shortcut , the rightmost as many as

Debugging Decision Trees . (a) Parameters per asserted root cause (b) Log of the number of asserted root causes peractual definitive root cause Figure 4: Synthetic Pipelines. (a) Average number of parameters per asserted root causes for each algorithm and(b) average logarithmic number of asserted root causes per actual definitive root cause for each method. shows the number of instances created by each of

BugDoc ’salgorithms as a function of the number of parameters of the

Figure 5: Instances required to execute each algorithmas a function of the number of parameters. pipeline.

Shortcut and

Stacked Shortcut increase linearly asexpected. Because the time performance of

Debugging Deci-sion Trees has no simple relationship with root causes andcould be exponential with the number of parameters, theuser should choose

Shortcut or Stacked Shortcut if there aremany parameters and instances are expensive to run.As noted above, the pipeline instances to test can be runin parallel, but at some risk to unnecessary computation.To evaluate scalability, we re-execute the experiment withsynthetic data, described in Section 5.1, with different num-bers of parallel computational cores and checked how manyinstances each core processed. As Figure 6 shows, the scale-up is essentially linear with the number of cores for the

Debugging Decision Trees algorithm solving the

FindAll prob-lem. Thus given sufficient computing power, even

DebuggingDecision Trees can explore relatively large parameter spaces. ugDoc: Algorithms to Debug Computational Processes

Figure 6: Scalability of

BugDoc when running the

De-bugging Decision Trees algorithm on multiple cores.

Data Polygamy Framework.

Data Polygamy aims to discover statistically significant re-lationships in a large number of spatio-temporal datasets [14].We created a VisTrails [20] pipeline that reproduces an ex-periment designed by the Data Polygamy authors to evaluatethe p-value and false discovery rate for their approach underdifferent scenarios. Specifically, the pipeline evaluates dif-ferent methods for determining statistical significance. Thedatasets used are synthetically generated, and their featuresare given as input parameters for the experiment. This pro-cess is a good use case for our approach because it has thefollowing properties: • The experiment requires a complex pipeline, including stepsfor data cleaning, data transformation, feature identification,multiple hypotheses testing, and other activities. • The input data is heterogeneous – over 300 datasets atdifferent spatio-temporal resolutions. • The parameter space is large, consisting of 2 boolean, 3categorical (3 to 10 possible values), and 7 numerical param-eters. Each instance takes 20 minutes to run, making manualdebugging impractical.For this experiment, we selected different data types andsteps of the computational pipeline. Each parameter canconceivably take on any value belonging to its type (e.g.,Integer or Boolean). Given a set of pipeline instances, someof which crash and some of which execute to completion, wewant to find at least one minimal set of parameter-values orcombinations of parameter-values among those in the givenpipeline instances, which cause the execution to crash.

GAN Training.

Generative adversarial networks (GAN) [23]are widely applied to image generation and semi-supervisedlearning [41, 50]. Training these generative models involves an expensive computational process with several configu-ration parameters, such as the architecture choices and ahigh-dimensional hyperparameter space to tune. Sequencemodel-based approaches like Bayesian Optimization are pro-hibitively expensive in practice, since a single configurationcould take more than a week to train. The most extensivestudy on the pathology of GAN training [10] entailed mod-ifying baseline architectures and setting hyperparametersmanually over three months, using hundreds of cores of aGoogle TPUv3 Pod [31]. Lucic et al. [37] evaluated sevendifferent GAN architectures and their hyperparameter con-figurations, performing a random search in an experimentalsetting that would take approximately 6.85 years using asingle NVIDIA P100.We created a computational pipeline that trains a modifiedSAGAN [49] on CIFAR-10 [32] and applied

BugDoc to findroot causes of one of the most common problems of GANtraining: mode collapse [11]. Our evaluation function sets athreshold on the Frechet Inception Distance (FID) [26] metric,which is a proxy for mode collapse. This pipeline specifiedonly 6 parameters limited to 5 possible values. The bottleneckwas the execution time because each configuration is trainedin approximately 10 hours, depending on the discriminatorand generator learning rates and the number of steps.

Transactional Database Performance.

DBSherlock [48]is a tool designed to help database administrators diagnoseonline transaction processing (OLTP) performance problems.DBSherlock analyzes hundreds of statistics and configura-tions from OLTP logs and tries to identify which subsetsof that data are potential root causes of the problems. Intheir experiments, the authors ran different settings of theTPC-C benchmark [44], introducing 10 distinct classes ofperformance anomalies varying the duration of the abnor-mal behavior. For each type of anomaly, they collected theworkload logs, creating a dataset of logs, each labeled asnormal or anomalous.This dataset was used by Bailis et al. [4] to demonstrateMacrobase’s ability to distinguish abnormal behavior in OLTPservers, where a classifier was trained to identify servers pre-senting degradation in performance.We ran

BugDoc on this data to identify the root causes ofeach class of performance anomaly. This experiment posestwo additional challenges. The first challenge comes fromthe fact that, for this example, it is not possible to derive andrun additional instances. We simulated the creation of newinstances by reading only part of provenance and testingthe algorithms on unread data, with an early stop when thepipeline instance to be tested was not present.The second challenge was the number of properties – atotal of 202 numerical statistics. We applied feature selectionand aggregated the values in buckets in order to increase aoni Lourenço, Juliana Freire, and Dennis Shasha the probability of configurations that share parameter-valuecombinations. This reduced the configuration space to 15parameters with 8 possible values (buckets) each. Since wewere dealing with historical data, the instance execution timehere is negligible.We split the dataset into three parts: 50% of the data wasused for training; 25% was the budget for pipeline instancesthat any sub-method of

BugDoc requested; and we createa 25% holdout to assess the accuracy of

BugDoc ’s minimalroot causes as a classifier to predict when a pipeline instancewill fail. Precisely, if the pipeline instance is a superset ofa minimal root cause, we predict failure. This method isaccurate 98% of the time, results that are comparable to thosereported in [4]. Thus,

BugDoc achieves concise explanationsof the bugs and high classification accuracy.

Quality Measures.

The root causes identified for all afore-mentioned pipelines were manually investigated to assesstheir soundness and to create ground truth for the real-worlddata. The ground truth allowed us to compute precision andrecall and to compare with Data X-Ray and ExplanationTables.

Figure 7: Real-World pipelines.

BugDoc (using

StackedShortcut and

Debugging Decision Trees combined), out-performs Data X-Ray and Explanation tables.Empirical Results.

The recall metric in Figure 7 showsthat

BugDoc methods found all the parameter-comparator-value triples that would cause the execution of the pipelinesto fail. As in Section 5.1, Data X-Ray sometimes producesspurious root causes, yielding lower precision. By contrast,Explanation Tables shows high precision, but low recall.

To the best of our knowledge,

BugDoc is the first methodthat autonomously finds minimal definitive root causes incomputational pipelines or workflows.

BugDoc achieves thisby analyzing previously executed computational pipelineinstances, selectively executing new pipeline instances, andfinding minimal explanations. When each root cause is due to a single parameter-valuesetting or a single conjunction of parameter-equality-values,the shortcut methods of

BugDoc can provably guarantee tofind at least one root cause in time proportional to the num-ber of parameters (rather than exponential in the number ofparameters as required by exhaustive search). Further, theshortcut approaches are guaranteed to find at least a subsetof the parameter-values constituting a root cause in timelinear in the number of parameters. When there are fewparameters or sufficient computation time, the

DebuggingDecision Trees method of

BugDoc performs best.Compared to the state of the art,

BugDoc makes no statis-tical assumptions (as do Bayesian optimization approacheslike SMAC), but generally achieves better precision and recallgiven the same number of pipeline instances. In all cases,

Bug-Doc dominates the other methods based on the F-measure,though it may sometimes lose based on precision or recallindividually.

BugDoc parallelizes well: pipeline instances canbe executed in parallel, thus opening up the possibility ofexploring large parameter spaces.There are two main avenues we plan to pursue in futurework. First, we would like to make

BugDoc available on awide variety of provenance systems that support pipelineexecution to broaden its applicability. Second, we would liketo explore group testing [33, 38] to identify problematic dataelements when a dataset has been identified as a root cause.Another potential direction is the inclusion of observed vari-ables (or predicates), properties that cannot be manipulated.While these cannot be used for deriving new instances, theycan help enrich the explanations.

Acknowledgments.

We thank Data X-Ray and ExplanationTables authors for sharing their code with us. We are alsograteful to Fernando Chirigati, Neel Dey, and Peter Bailisfor providing the real-world pipelines. This work has beensupported in part by NSF grants MCB-1158273, IOS-1339362,and MCB-1412232, CNPq (Brazil) grant 209623/2014-4, theDARPA D3M program, and NYU WIRELESS. Any opinions,findings, and conclusions or recommendations expressed inthis material are those of the authors and do not necessarilyreflect the views of funding agencies.

References [1] Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In

Proceedings of ACM SIGMOD . 331–346.[2] Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-Ray: Automat-ing Root-Cause Diagnosis of Performance Anomalies in ProductionSoftware. In

Proceedings of USENIX OSDI . 307–320.[3] Mona Attariyan and Jason Flinn. 2011. Automating ConfigurationTroubleshooting with ConfAid. ;login:

Proceedings of ACM SIGMOD . 541–556.[5] Anju Bala and Inderveer Chana. 2015. Intelligent Failure PredictionModels for Scientific Workflows.

Expert System Applications ugDoc: Algorithms to Debug Computational Processes

Proceedings of NIPS .2546–2554.[7] James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization.

JMLR (Feb. 2012), 281–305.[8] J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a Science of ModelSearch: Hyperparameter Optimization in Hundreds of Dimensions forVision Architectures. In

Proceedings of ICML . 115–123.[9] Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. EfficientDenial Constraint Discovery with Hydra.

Proceedings of VLDB Endow-ment

Proceedings of CIDR . 1–7.[13] Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and EricBrewer. 2002. Pinpoint: Problem Determination in Large, DynamicInternet Services. In

Proceedings of IEEE DSN . 595–604.[14] Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, andJuliana Freire. 2016. Data Polygamy: The Many-Many RelationshipsAmong Urban Spatio-Temporal Data Sets. In

Proceedings of ACM SIG-MOD . 1011–1025.[15] Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering DenialConstraints.

Proceeding of VLDB Endowment

13 (Aug. 2013), 1498–1509.[16] Valentin Dalibard, Michael Schaarschmidt, and Eiko Yoneki. 2017.BOAT: Building Auto-Tuners with Structured Bayesian Optimization.In

Proceedings of WWW . 479–488.[17] Nima Dolatnia, Alan Fern, and Xiaoli Fern. 2016. Bayesian Optimiza-tion with Resource Constraints and Production. In

Proceedings of ICAPS .115–123.[18] Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Di-vesh Srivastava. 2014. Interpretable and Informative Explanations ofOutcomes.

Proceedings of VLDB Endowment

IEEE Transactions on Software Engineering

Computer (2011), 367–386.[21] Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fair-ness Testing: Testing Software for Discrimination.

CoRR (2017), 1–13.arXiv:1709.03221[22] Patrice Godefroid, Michael Y. Levin, and David A. Molnar. 2008. Auto-mated whitebox fuzz testing. In

Proceedings of NDSS . 151âĂŞ166.[23] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014.Generative Adversarial Networks. , 9 pages. arXiv:1406.2661[24] Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R.K.Ports, and Dan Suciu. 2017. A Demonstration of Interactive Analysisof Performance Measurements with Viska. In

Proceedings of ACMSIGMOD . 1707–1710.[25] Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. BigSift:Automated Debugging of Big Data Analytics in Data-Intensive ScalableComputing. In

Proceedings of ESEC/FSE . 863–866. [26] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, BernhardNessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. , 38 pages.arXiv:1706.08500[27] Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing withcode fragments. In

Proceedings of USENIX Security Symposium . 445–458.[28] Jiangbo Huang. 2014. Programing implementation of the Quine-McCluskey method for minimization of Boolean expression.

CoRR (2014), 1–22. arXiv:1410.1059[29] F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. In

Proceed-ings of LION-5 . 507–523.[30] Brittany Johnson, Yuriy Brun, and Alexandra Meliou. 2018. Causal Test-ing: Finding Defects’ Root Causes.

CoRR (2018), 1–12. arXiv:1809.06991[31] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau-rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo-den, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, ChrisClark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland,Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, RobertHundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan-der Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, NaveenKumar, Steve Lacy, James Laudon, James Law, Diemthu Le, ChrisLeary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri-ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, RaviNarayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omer-nick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross,Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, MatthewSnelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gre-gory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan,Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017.In-Datacenter Performance Analysis of a Tensor Processing Unit.

SIGARCH Computer Architecture News

Learning multiple layers of features fromtiny images . Technical Report. Citeseer.[33] Kang Wook Lee, Ramtin Pedarsani, and Kannan Ramchandran. 2015.SAFFRON: A Fast, Efficient, and Robust Framework for Group Testingbased on Sparse-Graph Codes.

IEEE Transactions on Signal Processing (Aug. 2015), 1–10.[34] David Lewis. 2013.

Counterfactuals .[35] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I.Jordan. 2005. Scalable Statistical Bug Isolation. In

In Proceedings ofACM SIGPLAN . 15–26.[36] Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2019. Debug-ging Machine Learning Pipelines. In

Proceedings of DEEM . Article 3,10 pages.[37] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and OlivierBousquet. 2017. Are GANs Created Equal? A Large-Scale Study. ,21 pages. arXiv:1711.10337[38] Anthony J. Macula and Leonard J. Popyack. 2004. A Group TestingMethod for Finding Patterns in Data.

Discrete Applied Mathematics

PVLDB

13 (2014), 1715–1716.[40] Judea Pearl. 2009.

Causality: Models, Reasoning and Inference (2nd ed.).[41] Alec Radford, Luke Metz, and Soumith Chintala. 2015. UnsupervisedRepresentation Learning with Deep Convolutional Generative Adver-sarial Networks. , 16 pages. arXiv:1511.06434[42] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. PracticalBayesian Optimization of Machine Learning Algorithms. In

Proceedingsof NIPS . 2951–2959. aoni Lourenço, Juliana Freire, and Dennis Shasha [43] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat Prabhat, andRyan P. Adams. 2015. Scalable Bayesian Optimization Using DeepNeural Networks. In

Proceedings of the ICML

Proceedings of ACM SIGMOD . 1009–1024.[46] Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In

Proceedings of ACM SIGMOD .1231–1245.[47] Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. QFix: Di-agnosing Errors Through Query Histories. In

Proceedings of ACMSIGMOD . 1369–1384.[48] Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. DBSher-lock: A Performance Diagnostic Tool for Transactional Databases. In

Proceedings of ACM SIGMOD . 1599–1614.[49] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena.2018. Self-Attention Generative Adversarial Networks. , 10 pages.arXiv:1805.08318[50] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas.2017. StackGAN: Text to Photo-Realistic Image Synthesis with StackedGenerative Adversarial Networks. In

Proceeding of ICCV . 5908–5916.[51] Alice X. Zheng, Michael I. Jordan, Ben Liblit, Mayur Naik, and AlexAiken. 2006. Statistical Debugging: Simultaneous Identification ofMultiple Bugs. In