[PDF] Explaining Inference Queries with Bayesian Optimization

Abstract

Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO - a technique for finding the global optimum of a black-box function - is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at this https URL

Full PDF

EExplaining Inference Queries with Bayesian Optimization

Brandon Lockhart ♦ Jinglin Peng ♦ Weiyuan Wu ♦ Jiannan Wang ♦ Eugene Wu † Simon Fraser University ♦ Columbia University † {brandon_lockhart, jinglin_peng, youngw, jnwang}@sfu.ca [email protected] ABSTRACT

BOExplain , a novel framework for explaininginference queries using Bayesian optimization (BO). An explana-tion is a predicate defining the input tuples that should be removedso that the query result of interest is significantly affected. BO — atechnique for finding the global optimum of a black-box function— is used to find the best predicate. We develop two new tech-niques (individual contribution encoding and warm start) to handlecategorical variables. We perform experiments showing that thepredicates found by

BOExplain have a higher degree of explanationcompared to those found by the state-of-the-art query explanationengines. We also show that

BOExplain is effective at deriving ex-planations for inference queries from source and training data onthree real-world datasets.

Data scientists often need to execute aggregate SQL queries on inference data to inspect a machine learning (ML) model’s perfor-mance. We call such queries inference queries , which can be seenas an SQL query whose expressions may perform model inference.Consider an inference dataset with four variables (customer_id, age,sex, M.predict(I)) , where

M.predict(I) represents a variable whereeach value denotes whether the model 𝑀 predicts the customer willbe a repeat buyer or not. Running the following inference querywill return the number of female (predicted) repeat buyers: SELECT COUNT(*) FROM InferenceDataWHERE sex = 'female' and M.predict(I) = 'repeat buyer'

If the query result is surprising, e.g., the number of repeat buyersis higher than expected, the data scientist may seek an explana-tion for the unexpected result. One popular explanation methodis to find a subset of the input data such that when this subset isremoved, and the query is re-executed, the unexpected result nolonger manifests [34, 44]. This method is known as a provenance or intervention -based explanation [29].Specifically, there are two types of explanations in the intervention-based setting: fine-grained (a set of tuples) and coarse-grained (apredicate) [28]. In this paper, we focus on coarse-grained explana-tion. Predicates, unlike sets of tuples, provide a comprehensibleexplanation and identify common properties of the input tuplesthat cause the unexpected result. For the above example, it mayreturn a predicate like sex = ‘female’ AND ≤ age ≤

25 whichsuggests that if removing the young female customers from theinference data, the query result would look normal. Then, the datascientist can look into these customers more closely and conductfurther investigation.

SQL Explain Inference Query Explain[1, 33, 34, 44] Rain [45] BOExplainInference Data

Supported Supported Supported

Training Data

Not Supported Supported Supported

Source Data

Not Supported Not Supported Supported

Explanation Type

Coarse-grained Fine-grained Coarse-grained

Methodology

White-box White-box Black-box

Table 1: Comparison of

BOExplain and existing approaches.

Generating an explanation (i.e., a predicate) from inference datacan certainly help to understand the answer to an inference query.However, an ML pipeline does not only contain the inference databut also the training and source data. The following example illus-trates a scenario where an explanation should be generated fromthe source data.

Example 1.1. CompanyX creates an ML pipeline (Figure 1(a)) topredict repeat customers for a promotional event. CompanyX receivestransaction records from several websites that sell their products andaggregates them into a source data table 𝑆 . Next, the user definedfunction (UDF) make_training( · ) extracts and transforms features intothe training dataset 𝑇 . Finally, a random forest model is fit to thetraining data, and the model is applied to the inference dataset 𝐼 whichupdates it with a prediction variable, 𝑀.𝑝𝑟𝑒𝑑𝑖𝑐𝑡 ( 𝐼 ) .For validation purposes, the data scientist writes a query to computethe percentage of repeat buyers. The rate is higher than expected, butshe wants to double check that the result is not due to a data error.In fact, it turns out that the source data 𝑆 contains errors during Date ∈ [ 𝑡 , 𝑡 ] , when the website 𝑤 had network issues; customersconfirmed their transactions multiple times, which led to duplicaterecords in 𝑆 . The training data extraction UDF was coded to labelcustomers with multiple purchases as repeat buyers, and labelled allof the 𝑤 customers during the network issue as repeats. The modelerroneously predicts every website 𝑤 customer as a repeat buyer, andthus leads to the high query result. Ideally, the data scientist could askwhether the source data contains an error, and an explanation systemwould generate a predicate ( 𝑡 ≤ Date ≤ 𝑡 AND Website = 𝑤 ). Unfortunately, existing SQL explanation approaches [1, 33, 34,44] are ill-equipped to address this setting (Table 1) because theyare based on analysis of the query provenance. Although they cangenerate a predicate explanation over the inference data, the prove-nance analysis does not extend across model training nor UDFs,which are prevalent in data science workflows. The recent systemRain [45] generates fine-grained explanations for inference queries.It relaxes the inference query into a differentiable function overthe model’s prediction probabilities, and leverages influence analy-sis [22] to estimate the query result’s sensitivity to a training record.However, Rain returns training records rather than predicates, andestimating the model prediction sensitivity to group-wise changesto the training data remains an open problem. Further, Rain doesnot currently support UDFs and uses a white-box approach that isless amenable to data science programs (Figure 1(b)) that heavilyincorporate UDFs. a r X i v : . [ c s . D B ] F e b D Date Website … 𝑤 𝑥 𝑥 𝑥 𝑤 … … … Age Sex … Label

37 F one-time40 M repeat… … …Age Sex … 𝑀 .predict(I)48 F repeat45 F repeat46 M one-time… … … Train a modelSource Data, S Training Data, TInference Data, IFeature extraction and transformation Random forestPredictQueryThe repeat buyer rate is too high (a) Example ML Pipeline and Inference Query (b) Inference Query Explanation from Source Data Using BOExplain

Figure 1: An illustration of using BOExplain to generate an explanation from source data in an ML pipeline.

As a first approach towards addressing the above limitations,and to diverge from existing white-box explanation approaches [1,33, 34, 44, 45], this paper explores a black-box approach towardsinference query explanation.

BOExplain models inference query ex-planation as a hyperparameter tuning problem and adopts BayesianOptimization (BO) to solve it. In ML, hyperparameters (e.g., thenumber of trees, learning rate) control the training process and aretuned in an “outer-loop” that surrounds the model training process.Hyperparameter tuning seeks to find the best hyperparametersthat maximizes some model quality measure (e.g., validation score).

BOExplain treats predicate constraints (e.g., 𝑡 , 𝑡 , 𝑤 in Example 1.1)as hyperparameters, and the goal is to assign the optimal valueto each constraint. By defining a metric that evaluates a candi-date explanation’s quality, (e.g., the decrease of the repeat buyerrate), BOExplain finds the constraint values that correspond to thehighest quality predicate.A black-box approach offers a number of advantages for in-ference query explanation. In terms of usability , a data scientistcan derive a predicate from any data involved in an ML pipelinerather than inference data only. Furthermore, its concise API de-sign is similar to popular hyperparameter tuning libraries, such asscikit-optimize [19] and Hyperopt [8], that many data scientistsare already very familiar with. Figure 1(b) shows an example using

BOExplain ’s API to solve Example 1.1. The data scientist wraps theportion of the program in an objective function obj whose inputis the dataset to generate predicates for, and whose output is therepeat buyer rate that should be minimized. She also provides hintsto focus on the

Date and

Website variables. See Section 3.2 for moredetails.In terms of adaptability , a black-box approach can potentiallybe used to generate explanations for any data science workflow be-yond inference queries. The current machine learning and analyticsecosystem is rapidly evolving. In contrast to white-box approaches,which must be carefully designed for a specific class of programs,

BOExplain can more readily evolve with API, library, model, andecosystem changes.In terms of effectiveness , BOExplain builds on the considerableadvances in BO by the ML community [39], to quickly generate highquality explanations. A secondary benefit is that BO is a progressiveoptimization algorithm, which lets

BOExplain quickly propose aninitial explanation, and improve it over time.The key technical challenge is that existing BO approaches [9,20, 42] cannot be naively adapted to explanation generation. Inthe hyperparameter tuning setting, categorical variables typicallyhave very low cardinality (e.g., with 2-3 distinct values [30]). In thequery explanation setting, however, a categorical variable can havemany more distinct values. To address this, we propose a categoricalencoding method to map a categorical variable into a numerical variable. This lets

BOExplain estimate the quality of the categoricalvalues that have not been evaluated. We further propose a warmstart approach so that

BOExplain can prioritize predicates withmore promising categorical values.In summary, this paper makes the following contributions: • We are the first to generate coarse-grained explanations fromthe training and source data to an inference query. • We argue for a black-box approach to inference query explanationand discuss its advantages over a white-box approach. • We propose

BOExplain , a novel query explanation frameworkthat derives explanations for inference queries using BO. • We develop two techniques (categorical encoding and warm start)to improve

BOExplain ’s performance on categorical variables. • We show that

BOExplain can generate comparable or higher qual-ity explanations than state-of-the-art SQL explanation engines(Scorpion [44] and MacroBase [1]) on SQL-only queries. • We evaluate

BOExplain using inference queries on real-worlddatasets showing that

BOExplain can generate explanations fordifferent input datasets with a higher degree of explanation thanrandom search. • We implement

BOExplain as a Python package and open-sourceit at https://github.com/sfu-db/BOExplain.

In this section, we first define the SQL explanation problem, andsubsequently describe the extension to inference query explanation.

Query.

We first define the supported queries. In this work, wefocus on aggregation queries over a single table (the extension tomultiple tables has been formalized in [34]). An explainable query is an arithmetic expression over a collection of SQL query results,as formally defined in Definition 1.

Definition 1 (Supported Queries) . Given a relation 𝑅 , an explain-able query 𝑄 = 𝐸 ( 𝑞 , . . . , 𝑞 𝑘 ) is an arithmetic expression 𝐸 overqueries 𝑞 , . . . , 𝑞 𝑘 of the form 𝑞 𝑖 = SELECT agg ( . . . ) FROM 𝑅 WHERE 𝐶 AND/OR . . .

AND/OR 𝐶 𝑚 where agg is an aggregation operation and 𝐶 𝑗 is a filter condition.Example 2.1. Returning to the running example in Section 1, theuser queries the predicted repeat buyer rate. This can be expressed as 𝑄 = 𝑞 / 𝑞 , an arithmetic expression over 𝑞 and 𝑞 where 𝑞 = SELECT COUNT(*) FROM I WHERE M.predict(I)=‘repeat buyer’; 𝑞 = SELECT COUNT(*) FROM I; omplaint. After the user executes a query, she may find that theresult is unexpected and complain about its value. In this work, theuser can complain about the result being too high or too low, asdone in [34]. We use the notation 𝑑𝑖𝑟 = 𝑙𝑜𝑤 ( 𝑑𝑖𝑟 = ℎ𝑖𝑔ℎ ) to indicatethat 𝑄 is unexpectedly high (low). Example 2.2.

In our running example, the user found the repeatbuyer rate too high. Thus along with the query 𝑄 from Example 2.1,the user specifies 𝑑𝑖𝑟 = 𝑙𝑜𝑤 to indicate that 𝑄 should be lower. Explanation.

After the user complains about a query result,

BOEx-plain will return an explanation for the complaint. In this work, wedefine an explanation as a predicate over given variables.

Definition 2 (Explanation) . Given numerical variables 𝑁 , . . . , 𝑁 𝑛 and categorical variables 𝐶 , . . . , 𝐶 𝑚 , an explanation is a predicate 𝑝 of the form 𝑝 = 𝑙 ≤ 𝑁 ≤ 𝑢 ∧ · · · ∧ 𝑙 𝑛 ≤ 𝑁 𝑛 ≤ 𝑢 𝑛 ∧ 𝐶 = 𝑐 ∧ · · · ∧ 𝐶 𝑚 = 𝑐 𝑚 . The set of all such predicates forms the predicate space 𝑆 .Example 2.3. The source data in Figure 1 contains the variablesDate and Website. An example explanation over these variables is / / ≤ Date ≤ / / ∧ Website = 𝑤 . Objective Function.

Next we define our objective function. Thegoal of our system is to find the best explanation for the user’scomplaint. Hence, we need to measure the quality of an explanation.For a predicate 𝑝 , let 𝜎 ¬ 𝑝 ( 𝑅 ) represent 𝑅 filtered to contain all tuplesthat do not satisfy 𝑝 . We apply the query to 𝜎 ¬ 𝑝 ( 𝑅 ) and get thenew query result. If the user specifies 𝑑𝑖𝑟 = 𝑙𝑜𝑤 , then the smallerthe new query result is, the better the explanation is. Hence, weuse the new query result as a measure of explanation quality. Theobjective function is formally defined in Definition 3. Definition 3 (Objective Function) . Given a predicate 𝑝 , relation 𝑅 ,and query 𝑄 = 𝐸 ( 𝑞 , . . . , 𝑞 𝑘 ) , the objective function 𝑜𝑏 𝑗 ( 𝑝, 𝑅, 𝑄 ) → R applies 𝑄 on the relation 𝜎 ¬ 𝑝 ( 𝑅 ) . With the definition of objective function, the problem of search-ing for the best explanation is equivalent to finding a predicate thatminimizes or maximizes the objective function.

Definition 4 (SQL Explanation Problem) . Given a relation 𝑅 , query 𝑄 = 𝐸 ( 𝑞 , . . . , 𝑞 𝑘 ) , direction 𝑑𝑖𝑟 , and predicate space 𝑆 , find thepredicate 𝑝 ∗ = arg min 𝑝 ∈ 𝑆 𝑜𝑏 𝑗 ( 𝑝, 𝑅, 𝑄 ) if 𝑑𝑖𝑟 = 𝑙𝑜𝑤 (use arg max if 𝑑𝑖𝑟 = ℎ𝑖𝑔ℎ ). It may appear that minimizing the above objective function runsthe risk of overfitting to the user’s complaint (perhaps with anoverly complex predicate). However, a regularization term can beplaced within the objective function—for instance, SQL explanationtypically regularizes using the number of tuples that satisfy thepredicate [44]. Since 𝑄 is an arithmetic expression over multiplequeries, one of those queries may simply be the regularization term. For inference query explanation, we focus on three input datasetsthat the user can generate explanations from: source, training, andinference . The query processing pipeline is as follows (Figure 1(a)):(1) Transform and featurize the source data into the training data.(2) Train an ML model over the training data. In general, any intermediate dataset is acceptable, however we focus on these threedue to their prevalence and to simplify the paper. (3) Use the model to predict a variable from the inference dataset.(4) Issue a query over the inference dataset.From the above workflow, we can find that there are two differencesbetween SQL and inference query explanations: 1) the query forinference query explanation is evaluated from the model predictionsas well as the input data, and 2) in inference query explanation, theuser may want an explanation for the input dataset at any step ofthe workflow (e.g., the source, training, or inference dataset), whileSQL explanation only consider the query’s direct input.We next extend the objective function from SQL explanationto inference query explanation. Let 𝑄 be the query issued by theuser over the updated inference data, with the same form as inDefinition 1. Let 𝑅 be the data that we want to derive an explana-tion from (it can be source, training, or inference data) and 𝑝 bean explanation (i.e., predicate) over 𝑅 . We measure the quality of 𝑝 like in SQL explanation: filter the data by 𝑝 , then get the newquery result. Note that for inference query explanation, the queryis issued over the updated inference data. Hence, we define P as thesubset of the ML pipeline that takes as input the dataset 𝑅 that wewish to generate an explanation from, and that outputs the updatedinference data which is used as input to the SQL query. Note thatwhen 𝑅 is the updated inference data P is a noop , and the infer-ence query explanation problem degrades to the SQL explanationproblem. The extended objective function is defined in Definition 5. Definition 5 (Objective Function) . Given a subset of an ML pipeline P , a predicate 𝑝 , relation 𝑅 , and query 𝑄 , the objective function 𝑜𝑏 𝑗 ( 𝑝, 𝑅, P , 𝑄 ) → R feeds 𝜎 ¬ 𝑝 ( 𝑅 ) through P , and then applies 𝑄 onthe inference data. Finally, we define the inference query explanation problem.

Definition 6 (Inference Query Explanation Problem) . Given arelation 𝑅 , query 𝑄 , direction 𝑑𝑖𝑟 , pipeline P , and predicate space 𝑆 ,find the predicate 𝑝 ∗ = arg min 𝑝 ∈ 𝑆 𝑜𝑏 𝑗 ( 𝑝, 𝑅, 𝑄, P) if 𝑑𝑖𝑟 = 𝑙𝑜𝑤 (use arg max if 𝑑𝑖𝑟 = ℎ𝑖𝑔ℎ ). BOEXPLAIN

FRAMEWORK

This section introduces Bayesian optimization (BO) and presentsthe

BOExplain framework.

Black-box optimization aims to find the global minima (or maxima)of a black-box function 𝑓 over a search space X , 𝑥 ∗ = min 𝑥 ∈X 𝑓 ( 𝑥 ) . BO is a sequential model-based optimization strategy to solve theproblem, where sequential means that BO is an iterative algorithmand model-based means that BO builds surrogate models to estimatethe behavior of 𝑓 . Tree-structured Parzen Estimator (TPE) . TPE [7, 9] is a popularBO algorithm. It first initializes by evaluating 𝑓 on random samplesfrom the search space. Then, it iteratively selects 𝑥 from the searchspace using an acquisition function and evaluates 𝑓 ( 𝑥 ) . Let 𝐷 = {( 𝑥 , 𝑓 ( 𝑥 )) , ( 𝑥 , 𝑓 ( 𝑥 )) , · · · , ( 𝑥 𝑡 , 𝑓 ( 𝑥 𝑡 ))} denote the set of samplesevaluated in previous iterations. TPE chooses the next sample asfollows:(1) Partition 𝐷 into sets 𝐷 𝑔 and 𝐷 𝑏 , where 𝐷 𝑔 consists of the setof 𝛾 -percentile points with the lowest 𝑓 ( 𝑥 ) values in 𝐷 , and 𝐷 𝑏 consists of the remaining points ( 𝛾 is a user-definable parame-ter). Since the goal is minimize 𝑓 ( 𝑥 ) , 𝐷 𝑔 is called the good-point Good-point set: 𝐷 ! ={-1, 2}Bad-point set: 𝐷 " ={-5, -3, 3, 3.5} Step (1) xf(x) -6 -4 -2 0 2 4 60.10.2 x Density 𝑔(𝑥) : the density over 𝐷 ! 𝑏(𝑥) : the density over 𝐷 " Step (2) -6 -4 -2 0 2 4 613 x 𝒈(𝒙)/𝒃(𝒙) 𝑔(𝑥)/𝑏(𝑥) : acquisition functionThe next point Step (3) Figure 2: Suppose TPE has observed six points 𝐷 = {(− , ) , (− , ) , (− , ) , ( , ) , ( , ) , ( . , . )} . This figure il-lustrates how TPE finds the next point to evaluate ( 𝛾 = ). Age Sex City State Occupation M.predict( 𝐼 )

48 F Mesa AZ Athlete repeat45 F Miami FL Artist repeat46 M Mesa AZ Writer one-time40 M Miami FL Athlete repeat42 F Miami FL Athlete repeat

Table 2: An illustration of parameter creation. set and 𝐷 𝑏 is called the bad-point set . Intuitively, good pointslead to smaller objective values than bad points.(2) Use Parzen estimators (a.k.a kernel density estimators) to builda density model 𝑔 ( 𝑥 ) and 𝑏 ( 𝑥 ) over 𝐷 𝑔 and 𝐷 𝑏 , respectively.Intuitively, given an unseen 𝑥 ∗ in the search space, the densitymodels 𝑔 ( 𝑥 ∗ ) and 𝑏 ( 𝑥 ∗ ) can return the probability of 𝑥 ∗ beinga good and bad point, respectively. Note that separate densitymodels 𝑔 ( 𝑥 ) and 𝑏 ( 𝑥 ) are constructed for each dimension of X .(3) Construct an acquisition function 𝑔 ( 𝑥 )/ 𝑏 ( 𝑥 ) and select 𝑥 withthe maximum 𝑔 ( 𝑥 )/ 𝑏 ( 𝑥 ) to evaluate in the next iteration. Intu-itively, TPE selects a point that is more likely to appear in thegood-point set and less likely to appear in the bad-point set.Figure 2 illustrates the three steps. A complete introduction to TPEis given in Appendix A. Categorical Variables.

TPE models categorical variables by us-ing categorical distributions rather than kernel density estimation.Consider a categorical variable with four distinct values:

Website ∈ { 𝑤 , 𝑤 , 𝑤 , 𝑤 }. To build 𝑔 ( Website ) , TPE estimates the prob-ability of 𝑤 𝑖 based on the fraction of its occurrences in 𝐷 𝑔 ; thedistribution is smoothed by adding 1 to the count of occurrencesfor each value. For instance, if the occurrences are 2, 0, 1, 0, then thedistribution 𝑔 ( Website ) will be { 𝑃 ( 𝑤 ) , 𝑃 ( 𝑤 ) , 𝑃 ( 𝑤 ) , 𝑃 ( 𝑤 )} = { / , / , / , / } . In this section, we describe the

BOExplain framework.

Parameter Creation.

Given a predicate space, we need to map itto a parameter search space (the parameters and their domains).Suppose a predicate space is defined over variables 𝐴 , 𝐴 , · · · , 𝐴 𝑛 .If 𝐴 𝑖 is numerical (e.g., age, date), two parameters are created thatserve as bounds on the range constraint. Specifically, the parameters 𝐴 𝑖 min and 𝐴 𝑖 length define the lower bound and the length of the rangeconstraint, respectively. 𝐴 𝑖 min and 𝐴 𝑖 length have interval domains [ min ( 𝐴 𝑖 ) , max ( 𝐴 𝑖 )] and [ , max ( 𝐴 𝑖 ) − min ( 𝐴 𝑖 )] , respectively.If 𝐴 𝑖 is categorical (e.g., sex, website), one categorical parameteris created with a domain consisting of all unique values in 𝐴 𝑖 . Example 3.1. Suppose the user defines a predicate space over

State and

Age in Table 2.

BOExplain creates three parameters: one categor-ical parameter for

State with domain {AZ, FL}, and two numericalparameters for

Age with domains [ , ] and [ , ] , respectively. BOExplain

Framework.

Figure 3 walks through the

BOExplain framework. In step 0 , the user provides an objective function 𝑜𝑏 𝑗 ,a relation 𝑆 , and predicate variables 𝐴 , . . . , 𝐴 𝑛 (Figure 1(b), line10). Step 1 creates the parameters and their domains. Step 2 runs Parameter creation P r e d i c a t e Remove tuples satisfying predicateobj(S_filtered) R e s u l t Filtereddata

Parameters and domainsBest predicate

TPE2

Figure 3: The

BOExplain framework. one iteration of TPE, starting with the parameters from step 1 ,and outputs a predicate. Steps 3 and 4 evaluate the predicateby removing those tuples from the input dataset, and evaluating 𝑜𝑏 𝑗 on the filtered data. The result is passed to TPE for the nextiteration, and possibly yielded to the user as an intermediate orfinal predicate explanation.Consider the example code in Figure 1(b). Once it is executed,

BOExplain first creates three parameters:

Date min , Date length , and

Website along with the corresponding domains. Then, it itera-tively calls TPE to propose predicates (e.g., “12 / / ≤ Date ≤ / / AND Website = 𝑤 ”). BOExplain obtains S_filtered byremoving the tuples that satisfy this predicate from 𝑆 . Next, it ap-plies obj (·) to S_filtered which will rerun the pipeline (Figure 1(a))to compute the updated repeat buyer rate. The predicate and theupdated rate are passed to TPE to use when selecting the predicateon the next iteration. This iterative process will repeat until the timebudget is reached. When the user stops BOExplain , or when theoptimization has converged, the predicate with the lowest repeatbuyer rate is returned.

Why Is TPE Suitable For Query Explanation?

Recent work [6,24, 27] has suggested that random search is a competitive strategyfor hyperparameter tuning across a variety of challenging machinelearning tasks. However, we find that TPE is more effective forquery explanation because it is designed for problems where similarparameter values tend to have similar objective values (e.g., modelaccuracy). TPE can leverage this property to prune poor regions ofthe search space. As a trivial example, suppose a hyperparametercontrols the number of trees in a random forest. If values 10 , , , age ∈ [ , ] will exhibit a similarobjective to age ∈ [ , ] and age ∈ [ , ] . If the former has apoor objective value, the latter two may be pruned. In this section, we present our techniques to enable

BOExplain tosupport categorical variables more effectively.

Recall that in Section 3.1, TPE models numerical and categoricalvariables using kernel density estimation and categorical distri-bution, respectively. The advantage of kernel density estimationover a categorical distribution is that it can estimate the quality ofunseen points based on the points that are close to them. To benefitfrom this advantage, we map a categorical variable to a numericalvariable. We call this idea categorical encoding . In the following,we present our categorical encoding approach, called individualcontribution (IC) encoding.A good encoding method should put similar categorical valuesclose to each other. Intuitively, two categorical values are similarif they have a similar contribution to the objective function value.Based on this intuition, we rank the categorical values by theirindividual contribution to the objective function value. Specifically, onsider a categorical variable 𝐶 with domain ( 𝐶 ) = { 𝑐 , . . . , 𝑐 𝑛 } .For each value 𝑐 𝑖 , we obtain the filtered dataset 𝜎 𝐶 ≠ 𝑐 𝑖 ( 𝑆 ) w.r.t.the predicate 𝐶 = 𝑐 𝑖 . Next, the objective function is evaluatedon the relation 𝜎 𝐶 ≠ 𝑐 𝑖 ( 𝑆 ) which returns a number. This number canbe interpreted as the contribution of the categorical value on theobjective function. After repeating for all values 𝑐 𝑖 , the categoricalvalues are mapped to consecutive integers in order of their IC. BOExplain will then use a numerical rather than categorical variableto model 𝐶 . Example 4.1. Suppose we would like an explanation from the in-ference data in Table 2. Suppose the objective function value is therepeat buyer rate and the predicate space is defined over the Occu-pation variable. Note that the Occupation variable has the domain{Athlete, Artist, Writer}. The IC of Athlete is determined by removingthe tuples where Occupation=“Athlete” and computing the objectivefunction on the filtered dataset, which gives . (since only one of thetwo tuples in the filtered dataset is a repeat buyer). Similarly, the ICsof Artist and Writer are . and respectively. Finally, we sort thecategorical values by their objective function value and encode thevalues as integers: Athlete → , Artist → , Writer → . We next propose a warm-start approach to further enhance

BOEx-plain ’s performance for categorical variables. Since an IC scorehas been computed for each categorical value, we can prioritizepredicates that are composed of well performing individual cat-egorical values. Rather than selecting 𝑛 init points at random toinitialize the TPE algorithm, we select the 𝑛 init combinations ofcategorical values with the best combined score. More precisely,for a variable 𝐶 𝑖 , we consider the tuple pairs (variable value, IC)as computed in Section 4.1, 𝑆 𝑖𝐼𝐶 = {( 𝑐 𝑗 , 𝐼𝐶 ( 𝑐 𝑗 ))} 𝑛 𝑖 𝑗 = , where 𝑛 𝑖 isthe number of unique values in variable 𝐶 𝑖 . Next, we compute thed − ary Cartesian product and add the ICs for each combination 𝑆 𝐼𝐶 = 𝑆 𝐼𝐶 × · · · × 𝑆 𝑑𝐼𝐶 = {(( 𝑐 𝑖 , . . . , 𝑐 𝑖 𝑑 ) , 𝐼𝐶 ( 𝑐 𝑖 ) + · · · + 𝐼𝐶 ( 𝑐 𝑖 𝑑 )) | 𝑖 𝑗 ∈ { , . . . , 𝑛 𝑗 }} . To see why adding ICs can be useful for prioritiz-ing good predicates, suppose we want to minimize the objectivefunction, and that 𝐶 = 𝑐 and 𝐶 = 𝑐 have small ICs. Then it islikely that 𝐶 = 𝑐 ∧ 𝐶 = 𝑐 has a small value. So we choose tosum the IC values as it encodes this property. Finally, we select 𝑛 init valid predicates with the best combined IC score. Recall the user de-fines the direction that the objective function should be optimized.Therefore, we select the predicates with the smallest (largest) ICscore if the objective function should be minimized (maximized). Ifthe predicate also contains numerical variables, values are selectedat random to initialize the range constraint parameters. Example 4.2. The IC for values in the Occupation variable werecomputed in Example 4.1, 𝑆 Occupation 𝑖𝑐 = {( Athlete , . ) , ( Artist , . ) , ( Writer , )} , and for Sex we have 𝑆 Sex 𝑖𝑐 = {( F , . ) , ( M , )} . Next wecompute the combined IC score for each combination of predicates 𝑆 𝐼𝐶 = {(( Athlete, F ) , ) , . . . , (( Writer, M ) , )} . Recall, we want tominimize the objective function, so the smaller the combined IC scorethe better. Suppose 𝑛 init = , then on the first and second iteration ofBO we evaluate the predicates Occupation = “Athlete” ∧ Sex = “F”and Occupation = “Artist” ∧ Sex = “F” respectively. Note that Occupation = “Athlete” ∧ Sex = “F” is the best predicate, so addingIC scores can be useful at prioritizing good explanations. Algorithm 1:

BOExplain

Input:

Objective function 𝑜𝑏 𝑗 , data 𝑆 , variables 𝐴 , . . . , 𝐴 𝑛 Output:

A predicate and the corresponding objective value foreach categorical variable 𝐶 do Compute the IC of all unique values in 𝐶 end Create the parameters and domains Compute the predicted high quality combinations based onIC for the warm start Initialize TPE:

Perform 𝑛 init iterations using a warm startto create 𝐷 𝑛 init = {( x 𝑖 , 𝑜𝑏 𝑗 ( 𝜎 ¬ x 𝑖 ( 𝑆 ))} 𝑛 init 𝑖 = . for 𝑡 ← 𝑛 init to 𝑛 iter do Split 𝐷 𝑡 into 𝐷 𝑔𝑡 and 𝐷 𝑏𝑡 based on 𝛾 for 𝑖 ← to 𝑑 do Estimate 𝑔 ( 𝑥 ) on the i 𝑡ℎ dimension of 𝐷 𝑔𝑡 Estimate 𝑏 ( 𝑥 ) on the i 𝑡ℎ dimension of 𝐷 𝑏𝑡 Sample 𝑛 𝐸𝐼 points from 𝑔 ( 𝑥 ) Find the sample 𝑥 𝑡 + with the highest 𝑔 ( 𝑥 )/ 𝑏 ( 𝑥 ) end Update 𝐷 𝑡 + ← 𝐷 𝑡 ∪ {( x 𝑡 + , 𝑜𝑏 𝑗 ( 𝜎 ¬ x 𝑡 + ( 𝑆 )))} end return ( x , 𝑜𝑏 𝑗 ( 𝜎 ¬ x ( 𝑆 ))) ∈ 𝐷 𝑛 iter with the best objective value We lastly present the full

BOExplain algorithm in Algorithm 1. First,the ICs for the categorical variables are computed in lines 1-3. Next,the parameters and domains are created in line 4. In line 5, the ICvalues are used to prioritize predicted high quality predicates, andin line 6 TPE is initialized for 𝑛 init iterations with the predicted highquality predicates. Starting from line 7, we use a model to selectthe next points. In line 8, the previously evaluated points are splitinto good and bad groups based on 𝛾 . Next, from line 9, a value isselected for each parameter. In lines 10 and 11, distributions of thegood and bad groups are modelled, respectively. In line 12, pointsare sampled from the good distribution, and the sampled point withthe largest expected improvement is selected as the next parametervalue (line 13). In line 15, the objective function is evaluated basedon the parameter assignment, and the set of observation-value pairsis updated. Our experiments seek to answer the following questions. (1) Howdoes

BOExplain compare to current state-of-the-art query explana-tion engines for numerical variables? (2) Are the IC encoding andwarm start heuristics effective? (3) How effective is

BOExplain atderiving explanations from source and training data?

For SQL-only queries, we compare

BOExplain with the explanation engines Scorpion [44] and MacroBase [1, 4]which return predicates as explanations. For inference queries, nopredicate-based explanation engines exist, so we compare with arandom search baseline [6].

Scorpion [44] is a framework for explaining group-by aggregatequeries. The authors define a function to measure the quality ofa predicate, which can be implemented as

BOExplain ’s objective Time (seconds)

2D Easy

Time (seconds)

2D Hard

Time (seconds)

3D Easy

Time (seconds)

3D Hard

BOExplain MacroBase Scorpion

Figure 4: Performance comparison with Scorpion and MacroBase. The goal is to maximize the objective function. function. Each continuous variable’s domain is split into 15 equi-sized ranges as set in the original paper. We use the author’s open-source code to run the Scorpion experiments. MacroBase [4] (later, the DIFF SQL operator [1]) is an explana-tion engine that considers combinations of variable-values pairs,similar to a

CUBE query [18], as candidate explanations. The utilityof each explanation is determined by one or more difference metrics each having a utility threshold; explanations that satisfy the utilitythreshold are outputted. In Section 2.3 in [1], the authors describehow to use the DIFF operator for Scorpion’s influence metric. Weimplemented it in the author’s open-source code . The user needsto define how numerical variables are discretized; we tuned the binsize from 2 to 15 and report the best result.In [1], MacroBase was shown to outperform other explanationengines including Data X-ray [43] and Roy and Suciu [33], and sowe do not compare with these approaches. Random search is a competitive method for hyperparameter tun-ing [6]. The parameters are chosen independently and uniformlyat random for numerical (categorical) variables from the domainsdescribed in Section 3.2.

The following liststhe three datasets and ML pipelines used in our experiments. Thepipelines can be visualized in Figure 5, and we put a green boxaround the data where an explanation is derived in each pipeline.In Adult and House, an explanation is derived from training data,and in Credit an explanation is derived from source.

Adult income dataset [13]. This dataset contains 32,561 rows and 15human variables from a 1994 census. Figure 5(a) shows the pipelinewhere the data is prepared for modelling, and a random forestclassifier is then used to predict whether a person makes over $50Ka year. We split the data into 80% for training and 20% for inference.

House price prediction [11]. This dataset was published alreadysplit into training (1460 rows) and inference (1459 rows) tables. Itcontains 79 variables of a house which are used to train a supportvector regression model to predict the house price. The pipeline de-noting how to prepare the data for modelling is given in Figure 5(b).

Credit card approval prediction . The source data consists of two ta-bles: application_record (438,557 rows, 18 variables), which con-tains information about previous applicants, and credit_record (1,048,575 rows, 3 variables), which contains information about thecredit history of the applicants. The pipeline to prepared the datafor modelling is given in Figure 5(c), and a decision tree classifier istrained to predict whether a customer will default on their creditcard payment. We set aside 20% of the data to use for the inferencequery, and 80% for training. To measure the quality of an explanation, we plotthe best objective function value achieved by each time point 𝑡 .For Scorpion and MacroBase we plot, the objective function value https://github.com/sirrice/scorpion https://github.com/stanford-futuredata/macrobase ID Month Status credit_record

ID Sex … application_recordGet the worst status for each ID which serves as label

ID Label ID Feature Engineering • Select features • One-hot encode • Bin numerical variables using quantile and equi-range bins • Group categorical values

ID Sex … label

Female … label

DecisionTreeClassifier (c) Credit

HoueStyle … SalePrice

Feature Engineering • Impute missing values • Combine columns to form new features • One-hot encodesource table

HoueStyle_1story … SalePrice

Support Vector Regression (b) House

Age … Income

35 >50K… …

Feature Engineering • Impute missing values • Encode categorical variables and featuresource tableRandom Forest Classifier (a) Adult

Age … Income

35 1… …

Figure 5: ML Pipelines for Adult, House, and Credit. Thegreen box indicates where to generate an explanation. corresponding to their output predicate as a line that begins whenthe system finishes. To evaluate the effectiveness at identifyingdata errors, we measure the F-score, precision, and recall on thereal-world datasets. We synthetically corrupt data defined by apredicate, and use that data as ground truth. Precision is the numberof selected corrupted tuples divided by the total number of selectedtuples. Recall is the number of selected corrupted tuples divided bythe total number of corrupted tuples. F-score is the harmonic meanof precision and recall. For

BOExplain and Random, each result isaveraged over 10 runs.

BOExplain was implemented in Python 3.9.The code is open sourced at https://github.com/sfu-db/BOExplain.We modify the TPE algorithm in the Optuna library [2] with ouroptimization for categorical variables. The ML models in Section 5.3are created with sklearn. The experiments were run single-threadedon a MacBook Air (OS Big Sur, 8GB RAM). In the TPE algorithm,we set 𝑛 𝑖𝑛𝑖𝑡 = 𝑛 𝑒𝑖 =

24, and 𝛾 = . We compare

BOExplain with Scorpion and MacroBase using thesynthetic data and corresponding query from Scorpion’s paper [44].The dataset consists of a single group by variable 𝐴 𝑑 , an aggregatevariable 𝐴 𝑣 , and search variables 𝐴 , . . . , 𝐴 𝑛 with domain ( 𝐴 𝑖 ) = [ , ] ⊂ R , 𝑖 ∈ [ 𝑛 ] . 𝐴 𝑑 contains 10 unique values (or 10

50 100 150 200

Time (seconds)

F-score

Time (seconds)Precision

Time (seconds)Recall

BOExplain Random

Time (seconds) M ea n O b j ec t i ve Fun c t i on V a l u e BOExplain Random

Time (seconds)

F-score

Time (seconds)Precision

Time (seconds)Recall

BOExplain Random

Figure 6: Adult: best objective function value, F-score, precision, and recall found at each 5 second increment, averaged over10 runs. The goal is to minimize the objective function. groups) each corresponding to 2000 tuples randomly distributedin the 𝑛 dimensions. 5 groups are outlier groups and the other 5are holdout groups. Each 𝐴 𝑣 value in a holdout group is drawnfrom N ( , ) . Outlier groups are created with two 𝑛 dimen-sional hyper-cubes over the 𝑛 variables, where one is nested in-side the other. The inner cube contains 25% of the tuples and 𝐴 𝑣 ∼ N ( 𝜇, ) , and the outer cube contains 25% of the tuples inthe group and 𝐴 𝑣 ∼ N ( 𝜇 + , ) , else 𝐴 𝑣 ∼ N ( , ) . 𝜇 is set to80 for the “easy” setting (the outliers are more pronounced), and30 for the “hard” setting (the outliers are less pronounced). Thequery is SELECT SUM ( 𝐴 𝑣 ) FROM synthetic GROUP BY 𝐴 𝑑 . Thearithmetic expression over the SQL query is defined in Section3 of [44] that forms an objective function to be maximized. Thepenalty 𝑐 = . 𝑛 = 𝑛 = BOExplain outperforms Scor-pion and MacroBase in terms of optimizing the objective functionin each experiment. This is because

BOExplain can refine the con-straint values of the range predicate which enables it to outperformScorpion and MacroBase which discretize the range. The resultsare the same in the easy and hard settings. MacroBase performspoorly because the predicates formed by discretizing the variabledomains into equi-sized bins, and computing the cube, do not opti-mize this objective function. This exemplifies a known limitationof MacroBase that binning continuous variables is difficult [1].

BOExplain also outperforms Scorpion in terms of running time.

BOExplain achieves Scorpion’s objective function value in aroundhalf the time on each experiment.

Note.

The focus of this paper is not on SQL-only queries, thuswe did not conduct a comprehensive comparison with Scorpionand MacroBase. This experiment aims to show that a black-boxapproach (

BOExplain ) can even outperform white-box approaches(Scorpion and MacroBase) for SQL-only queries in some situations.

We evaluate

BOExplain ’s performance of explaining inference queriesin various settings. We start with an experiment on Adult wherean explanation is derived from training data (Section 5.3.1), thenwe investigate

BOExplain ’s approach for categorical variables onHouse (Section 5.3.2), and finally we evaluate

BOExplain in a com-plex ML pipeline on Credit, where an explanation is derived fromsource data (Section 5.3.3).

Explanation From Training Data.

In this experiment, weflip the training labels on Adult where 8 ≤ Education-Num ≤ ∧ ≤ Age ≤

40 which affects 16% of the training data. Onthe inference data, we query the average predicted value for thegroup Male. To asses whether

BOExplain can accurately removethe corrupted data, we define the objective function to minimizethe distance between the query result on the passed data and thequery result if executed on the data after filtering out the corruptedtuples. We use the two numerical search variables Education-Num

Jaccard Similarity

F-score

Time (seconds)

Precision

Time (seconds)

Recall

BOExplain BOExplain (w/o IC and WS)BOExplain (w/o IC) Random

Jaccard Similarity

F-score

Time (seconds)

Precision

Time (seconds)

Recall

BOExplain BOExplain (w/o IC and WS)BOExplain (w/o IC) Random M ea n O b j ec t i ve Fun c t i on V a l u e BOExplain BOExplain (w/o IC and WS)BOExplain (w/o IC) Random

Jaccard Similarity

F-score

Time (seconds)

Precision

Time (seconds)

Recall

BOExplain BOExplain (w/o IC and WS)BOExplain (w/o IC) Random

Figure 7: House: best objective function value, F-score, preci-sion, and, and recall, found at each 5 second increment aver-aged over 10 runs. The goal is to minimize the objective func-tion. (IC = Individual Contribution Encoding, WS = WarmStart) and Age which have domains [ , ] and [ , ] , respectively, andthe size of the search space is 1 . × .Each method is run for 200 seconds, and the results are shownin Figure 6. BOExplain on average achieves an objective functionresult lower than Random before 45 seconds compared to Random’sresult at 200 seconds. This shows that it is effective for

BOExplain toexploit promising regions, whereas random search that just exploresthe space cannot find a good predicate as quickly.

BOExplain alsooutperforms Random in terms of F-score and precision. The recallis high for both approaches since it is likely that a predicate isproduced with large ranges that cover all of the corrupted tuples.On average,

BOExplain (Random) completed 348.9 (295.6) iterations.

BOExplain performed more iterations because as it exploited thepromising region, it removed more training data than a randompredicate, and so the model took less time to retrain.

Supporting Categorical Variables.

In this experiment,we assess

BOExplain ’s method for handling categorical variableson House. The data is corrupted by setting the tuples satis-fying

Neighbourhood = ‘CollgCr’ ∧ Exterior1st = ‘VinylSd’ ∧ ≤ YearBuilt ≤ BOExplain ’s efficacy at removing the corrupted tuples, we definethe objective function to minimize the distance between the queriedresult on the passed data and the result of the query issued on thedata with the corrupted tuples removed. We use two categoricalsearch variables Neighbourhood and Exterior1st which have 25 and15 distinct values respectively, and one numerical search variableYearBuilt which has domain [1872, 2010]. The search space size is7 . × .In this experiment, we compare three strategies for dealing withcategorical variables. The first, BOExplain , is our algorithm withboth of the IC encoding and warm-start (WS) optimizations pro-posed in Section 4. To determine whether encoding categorical

50 100 150 200

Time (seconds)

F-score

Time (seconds)Precision

Time (seconds)Recall

BOExplain Random

Time (seconds) M ea n O b j ec t i ve Fun c t i on V a l u e BOExplain Random

Time (seconds)

F-score

Time (seconds)Precision

Time (seconds)Recall

BOExplain Random

Figure 8: Credit: best objective function value, F-score, precision, and recall found at each 5 second increment, averaged over10 runs. The goal is to maximize the objective function; larger values are better. values to integers based on IC and using a numerical distributionis effective, we consider a second approach, BOExplain (w/o IC),which uses the warm start optimization from Section 4.2, but usesthe TPE categorical distribution to model the variables rather thanencoding. The third, BOExplain (w/o IC and WS), is

BOExplain without any optimizations.Each method is run for 200 seconds, and the results are shownin Figure 7. The benefit of the warm start is apparent since

BOEx-plain and

BOExplain (w/o IC) outperform the other baselines muchsooner. Also,

BOExplain significantly outperforms

BOExplain (w/oIC) which shows that encoding the categorical values, and using anumerical distribution to model the parameter, leads to BO learningthe good region which can optimize the objective function whenexploited. The F-score, precision, and recall also demonstrate how

BOExplain can significantly outperform the baselines. In this exper-iment,

BOExplain completed on average 274.3 iterations, whereasrandom completed 1148.4 iterations.

Explanation From Source Data.

In the last experiment,we derive an explanation from source data on Credit. We corruptthe source data by setting all applicant records satisfying − ≤ DAYS_BIRTH ≤ − ∧ ≤ CNT_FAM_MEMBERS ≤ .

1% of the data. Corrupting the datadecreases the accuracy of the model, and we define the objectivefunction to increase the model accuracy. We derive an explanationfrom the source data table application_record with the variablesDAYS_BIRTH and CNT_FAM_MEMBERS which have domains [-25201, -7489] and [1, 15], respectively, and the size of the searchspace is 7 . × .The experiment is run for 200 seconds, and the results are shownin Figure 8. On average, BOExplain completes 246.8 iterations andrandom search completes 319.6 iterations during the 200 seconds.

BOExplain significantly outperforms Random at optimizing theobjective function, as

BOExplain on average attains an objectivefunction value at 51 seconds that is higher than the average valueRandom attains at 200 seconds. This shows that exploiting promis-ing regions can lead to better explanations, and that

BOExplain is effective at deriving explanations from source data that passesthrough an ML pipeline. Although random search can find an expla-nation with high precision,

BOExplain significantly outperformsRandom in terms of F-score.

Our work is mainly related to query explanation, ML pipeline de-bugging, and Bayesian optimization.

Query Explanation.

BOExplain is most closely related to Scor-pion [44] and the work of Roy and Suciu [34]. Both approachesdefine explanations as predicates. Scorpion uses a space partitioningand merging process to find the predicates, while Roy and Suciu [34]use a data cube approach. Both systems make assumptions aboutthe aggregation query’s structure in order to benefit from theirwhite-box optimizations. In contrast,

BOExplain supports com-plex queries, model training, and user defined functions. Further,

BOExplain is a progressive algorithm that improves the explanation over time. Variations of these ideas include the DIFF operator [1],explanation-ready databases [33], and counterbalances [29]. Finally,a number of specialized systems focus on explaining specific sce-narios, such as streaming data [4], map-reduce jobs [21], onlinetransaction processing workloads [46], cloud services [32], andrange-radius queries [36].

ML Pipeline Debugging.

Rain [45] is designed to resolve a user’scomplaint about the result of an inference query by removing aset of tuples that highly influence the query result. In contrast,

BOExplain removes sets of tuples satisfying a predicate, whichcan be easier for a user to understand. In addition,

BOExplain ismore expressive, and supports UDFs, data science workflows, andpre-processing functions. Data X-Ray [43] focuses on explainingsystematic errors in a data generative process. Other systems debugthe configuration of a computational pipeline [3, 23, 26, 47].

Bayesian Optimization.

Bayesian optimization (BO) is used tooptimize expensive black box functions (see [10, 15, 25, 39] foroverviews). BO consists of a surrogate model to estimate the expen-sive, derivative-free objective function, and an acquisition functionto determine the next best point. The most common surrogatemodel is a Gaussian process (GP), but other models have been used,including random forests [20], neural networks [41], Student-t pro-cesses [38], and tree-structured Parzen estimators [7, 9]. Expectedimprovement (EI) [37] is the most common acquisition function.

Categorical Bayesian Optimization.

A popular method for han-dling categorical variables in BO is to use one-hot encoding [14,16, 17]. However, it does not scale well to variables with manydistinct values [35]. BO may use tree-based surrogate models (e.g.,random forests [20], tree Parzen estimators [9]) to handle cate-gorical variables, however their predictive accuracy is empiricallypoor [16, 30]. Other work optimizes a combinatorial search space[5, 12, 31], and categorical/category-specific continuous variables[30]. These works consider only categorical variables or focus oncategorical variables with a small number of distinct values, whichis unsuitable for query explanation.

In this paper, we proposed

BOExplain , a novel framework for ex-plaining inference queries using BO. This framework treats theinference query along with an ML pipeline as a black-box whichenables explanations to be derived from complex pipelines withUDFs. We considered predicates as explanations, and treated thepredicate constraints as parameters to be tuned. TPE was used totune the parameters, and we proposed a novel individual contribu-tion encoding and warm start heuristic to improve the performanceof categorical variables. We performed experiments showing thati)

BOExplain can even outperform Scorpion and Macrobase forexplaining SQL-only queries in certain situations, ii) the proposedIC and warm start techniques were effective, iii)

BOExplain sig-nificantly outperformed random search for explaining inferencequeries. EFERENCES [1] Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy,Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, et al. 2020. DIFF: arelational interface for large-scale data explanation.

The VLDB Journal (2020),1–26.[2] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and MasanoriKoyama. 2019. Optuna: A next-generation hyperparameter optimization frame-work. In

Proceedings of the 25th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining . 2623–2631.[3] Cyrille Artho. 2011. Iterative delta debugging.

International Journal on SoftwareTools for Technology Transfer

13, 3 (2011), 223–246.[4] Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, andSahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In

Proceedingsof the 2017 ACM International Conference on Management of Data . 541–556.[5] Ricardo Baptista and Matthias Poloczek. 2018. Bayesian optimization of combi-natorial structures. arXiv preprint arXiv:1806.08838 (2018).[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameteroptimization.

The Journal of Machine Learning Research

13, 1 (2012), 281–305.[7] James Bergstra, Daniel Yamins, and David Cox. 2013. Making a science ofmodel search: Hyperparameter optimization in hundreds of dimensions for visionarchitectures. In

International conference on machine learning . 115–123.[8] James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A pythonlibrary for optimizing the hyperparameters of machine learning algorithms. In

Proceedings of the 12th Python in science conference , Vol. 13. Citeseer, 20.[9] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Al-gorithms for hyper-parameter optimization. In

Advances in neural informationprocessing systems . 2546–2554.[10] Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A tutorial on Bayesianoptimization of expensive cost functions, with application to active user modelingand hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010).[11] Dean De Cock. 2011. Ames, Iowa: Alternative to the Boston housing data as anend of semester regression project.

Journal of Statistics Education

19, 3 (2011).[12] Aryan Deshwal, Syrine Belakaria, and Janardhan Rao Doppa. 2020. ScalableCombinatorial Bayesian Optimization with Tractable Statistical models. arXivpreprint arXiv:2008.08177 (2020).[13] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml[14] Matthias Feurer and Frank Hutter. 2019. Hyperparameter optimization. In

Automated Machine Learning . Springer, Cham, 3–33.[15] Peter I Frazier. 2018. A tutorial on bayesian optimization. arXiv preprintarXiv:1807.02811 (2018).[16] Eduardo C Garrido-Merchán and Daniel Hernández-Lobato. 2020. Dealing withcategorical and integer-valued variables in bayesian optimization with gaussianprocesses.

Neurocomputing

380 (2020), 20–35.[17] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, JohnKarro, and D Sculley. 2017. Google vizier: A service for black-box optimization.In

Proceedings of the 23rd ACM SIGKDD international conference on knowledgediscovery and data mining . 1487–1495.[18] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart,Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A re-lational aggregation operator generalizing group-by, cross-tab, and sub-totals.

Data mining and knowledge discovery

1, 1 (1997), 29–53.[19] Tim Head, MechCoder, Gilles Louppe, Iaroslav Shcherbatyi, fcharras, Zé Vinícius,cmmalone, Christopher Schröder, nel215, Nuno Campos, Todd Young, StefanoCereda, Thomas Fan, rene rex, Kejia (KJ) Shi, Justus Schwabedal, carlosdanielc-santos, Hvass-Labs, Mikhail Pak, SoManyUsernamesTaken, Fred Callaway, LoïcEstève, Lilian Besson, Mehdi Cherti, Karlson Pfannschmidt, Fabian Linzberger,Christophe Cauet, Anna Gut, Andreas Mueller, and Alexander Fabisch. 2018. scikit-optimize/scikit-optimize: v0.5.2 . https://doi.org/10.5281/zenodo.1207017[20] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In

International confer-ence on learning and intelligent optimization . Springer, 507–523.[21] Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2012. Perfxplain:debugging mapreduce job performance. arXiv preprint arXiv:1203.6400 (2012).[22] Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictionsvia influence functions. In

International Conference on Machine Learning . PMLR,1885–1894.[23] Rahul Krishna, Md Shahriar Iqbal, Mohammad Ali Javidian, Baishakhi Ray, andPooyan Jamshidi. 2020. CADET: A Systematic Method For Debugging Miscon-figurations using Counterfactual Reasoning. arXiv preprint arXiv:2010.06061 (2020).[24] Liam Li and Ameet Talwalkar. 2020. Random search and reproducibility for neuralarchitecture search. In

Uncertainty in Artificial Intelligence . PMLR, 367–377.[25] Daniel James Lizotte. 2008.

Practical bayesian optimization . University of Alberta.[26] Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. Bugdoc: A system fordebugging computational pipelines. In

Proceedings of the 2020 ACM SIGMODInternational Conference on Management of Data . 2733–2736.[27] Horia Mania, Aurelia Guy, and Benjamin Recht. 2018. Simple random searchprovides a competitive approach to reinforcement learning. arXiv preprintarXiv:1803.07055 (2018).[28] Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and Explanationsin Databases.

Proc. VLDB Endow.

7, 13 (Aug. 2014), 1715–1716. https://doi.org/ 10.14778/2733004.2733070[29] Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyondprovenance: Explaining query answers with pattern-based counterbalances. In

Proceedings of the 2019 International Conference on Management of Data . 485–502.[30] Dang Nguyen, Sunil Gupta, Santu Rana, Alistair Shilton, and Svetha Venkatesh.2020. Bayesian Optimization for Categorical and Category-Specific ContinuousInputs.. In

AAAI . 5256–5263.[31] Changyong Oh, Jakub Tomczak, Efstratios Gavves, and Max Welling. 2019. Com-binatorial Bayesian Optimization using the Graph Cartesian Product. In

Advancesin Neural Information Processing Systems . 2914–2924.[32] Sudip Roy, Arnd Christian König, Igor Dvorkin, and Manish Kumar. 2015. Per-faugur: Robust diagnostics for performance anomalies in cloud services. In . IEEE, 1167–1178.[33] Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining Query Answers withExplanation-Ready Databases.

Proc. VLDB Endow.

9, 4 (Dec. 2015), 348–359.https://doi.org/10.14778/2856318.2856329[34] Sudeepa Roy and Dan Suciu. 2014. A Formal Approach to Finding Explana-tions for Database Queries. In

Proceedings of the 2014 ACM SIGMOD Interna-tional Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD’14) . Association for Computing Machinery, New York, NY, USA, 1579–1590.https://doi.org/10.1145/2588555.2588578[35] Binxin Ru, Ahsan Alvi, Vu Nguyen, Michael A Osborne, and Stephen Roberts.2020. Bayesian optimisation over multiple continuous and categorical inputs. In

International Conference on Machine Learning . PMLR, 8276–8285.[36] Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2018. Explainingaggregates for exploratory analytics. In . IEEE, 478–487.[37] Matthias Schonlau, William J Welch, and Donald R Jones. 1998. Global versus localsearch in constrained optimization of computer models.

Lecture Notes-MonographSeries (1998), 11–25.[38] Amar Shah, Andrew Wilson, and Zoubin Ghahramani. 2014. Student-t processesas alternatives to Gaussian processes. In

Artificial intelligence and statistics . 877–885.[39] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Fre-itas. 2015. Taking the human out of the loop: A review of Bayesian optimization.

Proc. IEEE

Density estimation for statistics and data analysis .Vol. 26. CRC press.[41] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015.Scalable bayesian optimization using deep neural networks. In

Internationalconference on machine learning . 2171–2180.[42] Jasper Roland Snoek. 2013.

Bayesian optimization and semiparametric modelswith applications to assistive technology . Ph.D. Dissertation. Citeseer.[43] Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data x-ray: A diag-nostic tool for data errors. In

Proceedings of the 2015 ACM SIGMOD InternationalConference on Management of Data . 1231–1245.[44] Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliersin Aggregate Queries.

Proc. VLDB Endow.

6, 8 (June 2013), 553–564. https://doi.org/10.14778/2536354.2536356[45] Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020. Complaint-driven Training Data Debugging for Query 2.0. In

Proceedings of the 2020 ACMSIGMOD International Conference on Management of Data . 1317–1334.[46] Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. Dbsherlock: A per-formance diagnostic tool for transactional databases. In

Proceedings of the 2016International Conference on Management of Data . 1599–1614.[47] Jiaqi Zhang, Lakshminarayanan Renganarayana, Xiaolan Zhang, Niyu Ge, Vas-anth Bala, Tianyin Xu, and Yuanyuan Zhou. 2014. Encore: Exploiting systemenvironment and correlation information for misconfiguration detection. In

Proceedings of the 19th international conference on Architectural support for pro-gramming languages and operating systems . 687–700. PPENDIX

A TREE-STRUCTURED PARZEN ESTIMATOR

The tree-structured Parzen estimator [7, 9] (TPE) is a sequentialmodel-based optimization algorithm that uses Gaussian mixturemodels to approximate a black-box function 𝑓 and the ExpectedImprovement [37] acquisition function to select the next sample.At a high level, TPE splits the evaluated points into two sets: goodpoints and bad points (as determined by the objective function). Itthen creates two distributions, one for each set, and finds the nextpoint to evaluate which has a high probability in the distributionover the good points and low probability in the distribution overthe bad points. We next formally define the algorithm.Initially, 𝑛 init samples are selected uniformally at random fromthe search space X , and subsequently a model is used to guide theselection to the optimal location. TPE models each dimension ofthe search space independently using univariate Parzen windowdensity estimation (or kernel density estimation) [40]. Assume fornow that the search space is one-dimensional, i.e., X = [ 𝑎, 𝑏 ] ⊂ R .Rather than model the posterior probability 𝑝 ( 𝑦 | 𝑥 ) directly, TPEexploits Bayes’ rule, 𝑝 ( 𝑦 | 𝑥 ) ∝ 𝑝 ( 𝑥 | 𝑦 ) 𝑝 ( 𝑦 ) , and models thelikelihood 𝑝 ( 𝑥 | 𝑦 ) and the prior 𝑝 ( 𝑦 ) . To model the likelihood 𝑝 ( 𝑥 | 𝑦 ) , the observations 𝐷 𝑡 = {( 𝑥 𝑖 , 𝑦 𝑖 = 𝑓 ( 𝑥 𝑖 ))} 𝑡𝑖 = are firstsplit into two sets, 𝐷 𝑔𝑡 and 𝐷 𝑏𝑡 , based on their quality under 𝑓 : 𝐷 𝑔𝑡 contains the 𝛾 − quantile highest quality points, and 𝐷 𝑏𝑡 contains theremaining points. Next, density functions 𝑔 ( 𝑥 ) and 𝑏 ( 𝑥 ) are createdfrom the samples in 𝐷 𝑔𝑡 and 𝐷 𝑏𝑡 respectively. For each point 𝑥 ∈ 𝐷 𝑔𝑡 ,a Gaussian distribution is fit with mean 𝑥 and standard deviationset to the greater of the distances to its left and right neighbor. 𝑔 ( 𝑥 ) is a uniform mixture of these distributions. The same process isperformed to create the distribution 𝑏 ( 𝑥 ) from the points in 𝐷 𝑏𝑡 .Formally, for a minimization problem, we have the likelihood 𝑝 ( 𝑥 | 𝑦 ) = (cid:40) 𝑙 ( 𝑥 ) if 𝑦 < 𝑦 ∗ 𝑔 ( 𝑥 ) if 𝑦 ≥ 𝑦 ∗ where 𝑦 ∗ is the 𝛾 − quantile of the observed values The prior proba-bility is 𝑝 ( 𝑦 < 𝑦 ∗ ) = 𝛾 .TPE uses the prior and likelihood models to derive the ExpectedImprovement [37] (EI) acquisition function. As the name suggests,EI involves computing how much improvement the objective func-tion is expected to achieve over some threshold 𝑦 ∗ by sampling agiven point. Formally, EI under some model 𝑀 of 𝑓 is defined as 𝐸𝐼 𝑦 ∗ ( 𝑥 ) = ∫ ∞−∞ max { 𝑦 ∗ − 𝑦, } 𝑝 𝑀 ( 𝑦 | 𝑥 ) 𝑑𝑦. (1)For TPE, it follows from Equation 1 that 𝐸𝐼 𝑦 ∗ ( 𝑥 ) ∝ (cid:18) 𝛾 + 𝑏 ( 𝑥 ) 𝑔 ( 𝑥 ) ( − 𝛾 ) (cid:19) − (2)the proof of which can be found in [9]. This means that a point withhigh probability in 𝑔 ( 𝑥 ) and low probability in 𝑏 ( 𝑥 ) will maximizethe EI. To find the next point to evaluate, TPE samples 𝑛 𝐸𝐼 candidatepoints from 𝑔 ( 𝑥 ) . Each of these points is evaluated by 𝑔 ( 𝑥 )/ 𝑏 ( 𝑥 ) ,and the point with the highest value is suggested as the next pointto be evaluated by 𝑓 .For a 𝑑 − dimensional search space, 𝑑 >

1, TPE is performedindependently for each dimension on each iteration. The full TPEalgorithm is given in Algorithm 2.

Algorithm 2:

Tree-structured Parzen Estimator

Input: 𝑓 , X , 𝑛 init , 𝑛 iter , 𝑛 𝐸𝐼 , 𝛾 Output:

The best performing point found by TPE Initialize:

Select 𝑛 init points uniformally at random from X ,and create 𝐷 𝑛 init = {( x 𝑖 , 𝑓 ( x 𝑖 ))} 𝑛 init 𝑖 = for 𝑡 ← 𝑛 init to 𝑛 iter do Determine the 𝛾 -quantile point, 𝑦 ∗ Split 𝐷 𝑡 into 𝐷 𝑔𝑡 and 𝐷 𝑏𝑡 based on 𝑦 ∗ for 𝑖 ← to 𝑑 do Estimate 𝑔 ( 𝑥 ) on the i 𝑡ℎ dimension of 𝐷 𝑔𝑡 Estimate 𝑏 ( 𝑥 ) on the i 𝑡ℎ dimension of 𝐷 𝑏𝑡 Sample 𝑛 𝐸𝐼 points from 𝑔 ( 𝑥 ) Find the sampled point 𝑥 𝑖𝑡 + with highest 𝑔 ( 𝑥 )/ 𝑏 ( 𝑥 ) end Update 𝐷 𝑡 + ← 𝐷 𝑡 ∪ {( x 𝑡 + , 𝑓 ( x 𝑡 + ))} end return ( x , 𝑦 ) ∈ 𝐷 𝑛 iter with the best objective function valuewith the best objective function value