[PDF] VINE: Visualizing Statistical Interactions in Black Box Models

Abstract

As machine learning becomes more pervasive, there is an urgent need for interpretable explanations of predictive models. Prior work has developed effective methods for visualizing global model behavior, as well as generating local (instance-specific) explanations. However, relatively little work has addressed regional explanations - how groups of similar instances behave in a complex model, and the related issue of visualizing statistical feature interactions. The lack of utilities available for these analytical needs hinders the development of models that are mission-critical, transparent, and align with social goals. We present VINE (Visual INteraction Effects), a novel algorithm to extract and visualize statistical interaction effects in black box models. We also present a novel evaluation metric for visualizations in the interpretable ML space.

Full PDF

VVINE: Visualizing Statistical Interactions in Black Box Models

Matthew Britton

Fig. 1. VINE overview screen showing feature space visualization. VINE plots are displayed as small multiples. Position on the X-axisdenotes strength of interaction effects, and position on the Y-axis denotes feature importance. Users can click on a chart to open adetailed view with regional model explanations.

Abstract —As machine learning becomes more pervasive, there is an urgent need for interpretable explanations of predictive models.Prior work has developed effective methods for visualizing global model behavior, as well as generating local (instance-speciﬁc)explanations. However, relatively little work has addressed regional explanations - how groups of similar instances behave in a complexmodel, and the related issue of visualizing statistical feature interactions. The lack of utilities available for these analytical needs hindersthe development of models that are mission-critical, transparent, and align with social goals. We present VINE ( V isual IN teraction E ffects), a novel algorithm to extract and visualize statistical interaction effects in black box models. We also present a novel evaluationmetric for visualizations in the interpretable ML space. Index Terms —Interpretable machine learning, data visualization, feature interactions, evaluation.

NTRODUCTION

State-of-the-art machine learning algorithms such as neural networksand support vector machines have shown enormous success in model-ing complex data. These models are widely regarded as “black box”algorithms, meaning that the reasons why they make a prediction arenot clear. There are serious downsides to employing predictive modelswhose behaviors are not fully understood. A well known example ofthis hazard was a study of how machine learning could be used to pre- • Matthew Britton is with the Georgia Institute of Technology. E-mail:[email protected]. dict pneumonia risk [7,10]. A rule-based, interpretable model extracteda counter-intuitive result from the dataset - having asthma was found tobe a protective factor (i.e. it lowered risk). However, asthma is actuallyan aggravating factor, a fact so well-known by doctors that patientswith the condition typically received aggressive treatment, improvingtheir outcomes. If the authors had instead utilized their black box neu-ral network trained on the same data, this behavior would have goneunnoticed and the model would have placed asthmatics lower in thetriage order, potentially costing lives.A case can be made for both social [46] and economic [47] gainsthat would be realized with a shift towards interpretable models. Themovement towards this lofty goal was accelerated recently with thepassage of the European Union’s General Data Protection Regulation a r X i v : . [ c s . L G ] A p r global ex-planations and local explanations . Global explanations aim to describehow a model works in broad terms, for example by listing the top k most important features. For example, take the task of predicting dailyridership for a bike-sharing program, based on features such as weather,day of the week, etc.. A global explanation might communicate to theuser that warm weather increases ridership. A Local explanation , onthe other hand, typically focuses on a single data point which has beenselected by a process external to the model. For example a user maywant an explanation for why they were denied a loan, or an engineermay want to investigate an unexpected prediction. Local explanationsoften take the form of a counterfactual or what-if statements, such as“ridership would have been 10% higher if it hadn’t rained.”Both global and local explanations can give insight into a model.However, they also have clear downsides. Global explanations are bynecessity simpliﬁcations of the real behaviors that the model performs.Higher temperatures are correlated with increased ridership, but perhapstemperature is irrelevant on the subset of days with pouring rain. Localexplanations typically have unclear generalizability - a user does notknow if the features that were most important in this prediction will beimpactful for other subsets of the data. For a local explanation such as“the temperature for this day (75 ◦ ) increased ridership by 60 over theaverage)” it is impossible to gain an understanding of how to interprettemperature’s role in the prediction for another case.There is still a wide gap between the information conveyed by state-of-the-art explanation methods, and the diverse array of behaviors thata complex predictive model can evince. To address these limitations,we explore a relatively neglected third approach, regional explana-tions . A regional explanation describes how an important subset ofinstances is treated by a model - this behavior is of particular inter-est when the regional explanation differs from the global explanation,so the regional explanation can be presented alongside a global one.For example, the global explanation “temperature increases ridership”could be complemented by the regional explanation “on weekends,temperature increases ridership, but on weekdays, temperature playsless of a role.” These subsets give insights into the distinctions that amodel ﬁnds salient, which can then be compared to human intuition.We present VINE (Visual INteraction Effects), a visualization whichhighlights unusual or noteworthy model behaviors which only apply toa subset of data cases. We base our approach on partial dependence [15],a common method for analyzing the relationship between a feature andmodel predictions. Relevant subsets are automatically extracted byour algorithm. In additional to a visual depiction of how the modelgenerates a prediction for these subsets, our tool generates a predicateto describe the points in this cluster. The result is a regional explanationthat adds nuance to the global explanation.Finally, how do we evaluate the effectiveness of a particular modelvisualization? Visualizations in this space are difﬁcult to evaluate,partly because they typically combine a novel algorithm for behaviorextraction with a novel visualization. Little work has been done todisentangle the effects of the two. As a result, most evaluations inthis space have either employed example use cases constructed by theauthors, or purely qualitative evaluations of human participants [13].While these clearly have their place, the ﬁeld suffers from a paucityof metrics that allow explanations to be compared quantitatively. To this end, we introduce the Information Ceiling framework, a simpleprocess that attempts to make new predictions based on the informationpresented in the visualization. By employing the algorithm to compareVINE with partial dependence plots, we ﬁnd that our visualization out-performs existing methods on multiple datasets. More importantly, weare able to compare visualizations using a simple metric that quantiﬁestheir ﬁdelity to the underlying model.Our contributions in this paper can be summarized as follows:• A novel algorithm for generating regional explanations of predic-tive models, based on clustered partial dependence curves;• An interactive visualization, VINE, which allows the user toexplore these explanations;• The Information Ceiling framework, a method for evaluating theﬁdelity of any visual model explanation to the underlying model. ACKGROUND

Our approach is heavily based on existing work [15, 17] on partial de-pendence plots and other visualizations that use the partial dependencecalculation to generate explanations. The implementation and usage ofthese plots is described below.

Partial Dependence Plots (PDPs) [15] calculate the average predictionacross all instances as the value of a single feature is changed, holdingall other values constant. PDPs are typically constructed for eachfeature in a dataset - a sample of PDPs for the bike dataset are presentedin Figure 3. Intuitively, the partial dependence curve shows the bestguess for a prediction if only one feature value is known. Figure 1is used to generate a partial dependence curve for a single predictorfeature: pd p f ( v ) = N N ∑ i pred ( x i ) with x i f = v (1) N is the number of items in the dataset, pred is the function deﬁnedby the predictive model, f is the predictor feature in question, and v is a value in the domain of f . The model is treated as an oracle andgenerates N curves constructed of M data points each, where M is ahyperparameter that determines the granularity of the explanation. v takes on the values of the M quantiles of f .Partial Dependence is one of the most common ways to communicatehow a prediction depends on a single feature. The plots are easyto calculate and interpret, and have become ﬁxtures in open-source[3, 4, 43] and proprietary [19] data-science toolkits. In addition to thetraditional line chart, they have also been presented as colored bars [31]and 2-feature heatmaps and contour plots (see Figure 2). Note thatPDPs (and other plots in this family) can be presented with the standardscale (in which the Y-axis is read as the predicted value) or as a centeredPDP (in which case the Y-axis is read as the change from the averageprediction). Figures 3 and 2 present the standard scale, however we usethe centered PDP in the rest of this work.However, PDPs have three main shortcomings:• Unreasonable assumption of feature independence.

The syn-thetic data generated by PDPs may be highly unlikely under thejoint distribution, if the input features are correlated. For exam-ple, in a dataset of personal health records, predictions would begenerated for children up to the maximum height in the dataset,perhaps 6’ tall. The predicted target might be outlandish, andskew the summary curve in regions with low probability mass.•

Heterogeneous effects are obscured by the summary curve.

The process of averaging the curves produced for each data pointnecessarily obscures varying shapes.•

Feature interactions are difﬁcult to separate from the mainvariable effect.

The PDP curve includes all feature interactions,making it difﬁcult to isolate the importance of the feature ofinterest itself. ig. 2. 2-D PDP plots show interactions and main effects for two features.This example shows the interaction between Temperature and Hour ofDay for the bike dataset.Fig. 3. Three PDP plots from the bike dataset, showing the main effectsfor salient features.

To address PDPs’ tendency to obscure heterogeneous effects, [17]presented Individual Conditional Expectation (ICE) plots, which dis-aggregate the PDP line into its constituent curves, one for each datapoint in the original dataset. The ICE plot consists of a line plot withone series of predictions for each instance. While ICE plots display thefull heterogeneity of effects, they inherit the other weaknesses of PDPs.Moreover, they scale poorly with the number of data cases, as they tendto overplot signiﬁcantly, obscuring potentially interesting curves.

ELATED W ORK

In this survey, we review only a relevant subset of the large literatureon interpretable machine learning. Speciﬁcally, we focus on methodsfor tabular datasets, and ignore the areas of interpretability for imageand text data, in which the notion of a feature is very different. We alsonarrow our focus to model explanations that are primarily visual.

As previously discussed, one of the major axes in this space dividestechniques into global explanations and local explanations . Globalexplanations present a summary of the model without regard to speciﬁcinstances or subsets of interest to the user. Inherently interpretablemodels are one type of global explanation. These models includelinear regressions, decision trees, and rule lists. Users can generallycomprehend these moodels by simply reviewing their internals (e.g.the coefﬁcients and bias for a linear regression). While these havebeen frequently lambasted for subpar performance, a new class ofinherently interpretable models [1, 9, 34, 50] have emerged in recentyears, promising predictive power comparable to black box modelsfor some problems. Feature importance scores [6, 8, 14] and featureinteraction scores [16,18,23] are global explanations as well. The latterare discussed in more detail below.A Partial Dependence Plot [15] is a common form of global explana-tion. The entire model can be summarized with a series of single linecharts (one for each feature in the model). The effect communicated isan aggregate behavior that may not represent the prediction process for any speciﬁc instance. ICE plots could perhaps be nominally categorizedas a local method, since they bind one encoding (a curve) per data point.However, in practice, overplotting obscures many of the points, and noprior work has provided utilities for a user to inspect a single point’sICE curve. Therefore, they are more accurately viewed as a globalexplanation that provides some additional information over PDPs.Many visual analytics systems for model analysis and debugging(see the excellent survey in [22]) employ model summaries as oneof the available views. While these systems tend to focus on theinternal elements of neural networks or other speciﬁc model types,these overviews are another type of global explanation.Local explanations focus on the prediction for a single data case.The major use cases for these approaches include consumer-orientedapplications (why was my loan application denied?) or model debug-ging. Furthermore, these explanations mirror the techniques humansuse to explain causality to each other [39].Local explanations are sometimes communicated using a counter-factual; for example, “the prediction for this instance would move fromnegative to positive if feature X changed by Y%.” One method fordeveloping these explanations is the Growing Spheres algorithm [32],which identiﬁes the nearest dissimilar prediction in the data spaceand generates an explanation from the differences in the two points.Prospector [31] uses partial dependence curves to allow users to inter-actively generate synthetic data points that serve as counterfactuals.Another class of local explanations uses prototypical examples ofcorrect and incorrect classiﬁcations to explain a model [27]. This andother exemplar-based approaches do not provide an explanation per-se,but rather operate on the premise that an explanation will be relativelyclear to a subject matter expert once the examples are surfaced (e.g.they will note that all of the incorrect classiﬁcations had a particularunusual value for a certain feature).The most well-known local approach is the LIME algorithm [44]which ﬁts a model on a set of points randomly drawn from a Gaussiandistribution centered on the dataset’s mean. Intuitively, this builds amodel that captures the impact of slight movements around the dataspace centered on an instance of interest. The resulting model (typicallylinear) is sparse and interpretable.

There are clear downsides to both global and local approaches. Globalapproaches by deﬁnition sacriﬁce complexity and ﬁdelity to the originalmodel for simplicity. At the same time, local models tend to only beappropriate for speciﬁc use cases - a data scientist could not realisti-cally debug a model by generating LIME explanations for 10 randominstances out of a dataset of 100,000 records. In other words, exist-ing local approaches provide no indication as to how they generalizebeyond the instance in question.VINE falls under the category of regional explanations , a novelcategory description under which we believe several pieces of priorwork can be fruitfully categorized. Regional explanations split thedifference between global and local approaches by describing behaviorsthat affect signiﬁcant regions of the data space. This affords moregenerality than local approaches, but yields more speciﬁcity than globalapproaches. Regional explanations can be thought of as exceptions toglobal behavior - the global behavior (e.g. a PDP line) applies unless adata case falls in a speciﬁc cluster. We deﬁne regional explanations asexplanations that meet at least one of the following criteria:• C1 An algorithm identiﬁes a region of the data space in whichmany of the points share a common behavior in the model. Asuccinct description of this cluster is provided.• C2 The common behavior for this data space is described.Below, we review related work that qualiﬁes under this deﬁnition.

Many visual analytics systems provide utilities for users to select anarbitrary subset(s) of interest either by predicate or direct manipula-tion. Users can then compare outcomes such as accuracy, or modelnternals such as nodes in a neural network. The GridViz applicationwas developed by Google to help them understand a model for adver-tising click predictions by visually comparing slices of the data [38].MLCubeExplorer displays a wide variety of distribution, prediction,and correlation data about subsets, with the intent of comparing therelative values of two models [26]. ActiVis [25] allows a user to selectinstances of interest from a visualization of model results, and comparethem in a “neuron activation matrix” view that can surface commonactivation channels.While these approaches meet criteria C2 above, they do not meet C1 , as the subsets are not algorithmically generated. While interactivecohort construction is undoubtedly a useful tool, we argue that these ap-proaches do not extract subsets which the model itself treats differently,and which may or may not correspond to human intuition . A wide variety of classiﬁers use a system of rules to make or explainpredictions. Often, these rules take the form of a predicate (if featureX < = a, then predict positive). In this section we focus on rule-basedmethods that are speciﬁcally engineered for providing explanations -we do not consider a 10-layer decision tree interpretable by the averagehuman.One class of rule-based approaches consists of inherently inter-pretable models that use a series of rules to make a prediction [1,16,33].RuleMatrix [40] uses a similar approach to generate rules that de-scribe an underlying model, then presents the rules in an interactivevisualization. Anchors [45] are model-agnostic explanations that usehigh-accuracy rules to deﬁne model behaviors. If a data case meets thecriteria deﬁned by a predicate, then it is highly likely that the anchor’spredicted value holds, regardless of the values of the other features ofthe data case. However the rules extracted by this algorithm do notcover much of the data space.While these rule-based approaches clearly deﬁne regions ( C1 ), thebehavior that they deﬁne for the region is very coarse, consisting of asingle value (the prediction). Explanation Explorer [28], Rivelo [49] and related tools [30] generatelocal models for each datapoint, consisting of a minimal list of featuresthat would strongly affect the prediction if changed. Datapoints areaggregated into clusters based on having identical or similar localmodels. Users can then view the details of instances in the cluster, aswell as their evaluation metrics (e.g. the number of points predictedfor each class, accuracy, etc). [29] presents Class Signatures, whichexpand on these methods by clustering instances by feature importancelists AND prediction, thus creating more nuanced groups.These tools deal exclusively with binary features and a binary target- the data type can be either tabular or text. This approach deﬁnesregions ( C1 ) of the dataset, but due to the nature of binary features,there is less need to describe behavior ( C2 ) for the cluster. The authorsnote that their approach is more ﬁne-grained than feature importancescores. While this is true, tabular datasets with numerical and ordinalfeatures require more complex expressions of behavior, for whichpartial dependence curves are well-suited.Shapley Additive Explanations (SHAPs) [35] leverage a well-established game theory method to generate feature importances [48],and extend this technique to include variables representing feature inter-actions. These variables are then combined into an additive explanationfor each point in the dataset. While this explanation is not sparse onits own, it allows instances to be clustered based on the ordering offeature importance values. The authors annotate their visualizationswith hand-curated labels for clusters that are found to correspond toshared real-world explanations (e.g. these data points were predictedto have low income because they are young and single). However, itshould be noted that while SHAPs automate the identiﬁcation of regions(clusters), they do not algorithmically generate sparse explanations forthese clusters. Moreover, VINE provides granular visualizations of theinteraction behavior due to our use of ICE curves, whereas SHAPs arenot primarily a visualization tool. Regional explanations can also be understood as a form of statisticalinteraction effect between two features, when the effects of two featuresupon a prediction are non-additive. Prior work has primarily focusedon quantifying the strength of these non-additive relationships viainteraction scores. While the literature uses a variety of terms forthese effects, such as statistical interactions, feature interaction effects,and non-additive interactions, we use the term feature interactions throughout.

Several types of GLMs (Generalized Linear Models) include featuresthat model interaction effects. RuleFit [16] is a modiﬁed linear regres-sion that includes interaction terms which are derived from the splitsgenerated by tree ensembles. This method for generating interactionterms, and subsequent pruning with a regularization algorithm, ensuresthat RuleFit models are relatively sparse.Another type of GLM is GAMs (Generalized Additive Models). AGAM is essentially a linear model in which each feature can be modiﬁedby a link function which enables the model to capture non-linear (say,logarithmic or quadratic) relationships between the feature and thetarget. A modiﬁed version, GA M’s, adds interaction terms consistingof two features, which are again modiﬁed by a link function [34]. TheGAMUT visual analytics system [21] uses GAM curves (as well asinstance-based explanations) as a model explanation tool, in much thesame way as partial dependence curves.Both GA Ms and RuleFit present clear explanations for individualfeatures, but suffer from the difﬁculty of interpreting interaction terms.GAMs are incapable of modeling feature interactions.

One method for measuring interaction strength is the H-statistic [16],which compares the 2-D partial dependence function for two featuresagainst the sum of the individual partial dependence functions for eachfeature. The loss is used to generate the interaction score, as it capturesthe degree to which additive explanations fail to recapture the target.Partial dependence functions have also been leveraged to calculatefeature interactions [18]. This method observes the partial dependencefunction for feature A at various intervals of feature B, and calculatesthe variance in the PD function across all points. Intuitively, this methodtreats features A and B as independent if feature A’s importance to themodel remains constant regardless of feature B’s value. While thesemethods generate numerical scores, the authors of their respectivepapers choose to communicate the scores with simple graphics, such asbar charts. Arguably, this is a natural mode of expression for this data.These methods only identify the presence and strength (in terms ofaverage impact on a prediction) of feature interactions. They do notindicate regions of a feature’s range in which interactions might beparticularly strong or weak or the shape of the function that expressesthe interaction.

In a Variable Interaction Network (VIN) [23], features are displayed in astylized network graph in which connections indicate the presence of aninteraction. This method is notable for its ability to efﬁciently identifyinteractions including 3 or more terms. The interactions are identiﬁedby an algorithm that uses a permutation method similar to featureimportance scores [6] to identify features whose effect changes in thepresence or absence of a potential interactor feature. The algorithm thencleverly prunes the search space by using the property that an interactioneffect can only exist if all the lower-order effects that involve its featurealso exist. Similar to the H-statistic, Variable Interaction Networks donot communicate granular detail about the nature of interactions, onlytheir presence.ICE and PDP plots can be extended to communicate feature interac-tions, in ways which leverage their visual properties but do not generateinteraction scores directly. [15] suggests a heatmap partial dependenceplot, in which color is encoded as the average predicted value for allpoints in the 2-D space deﬁned by two features. This method visualizes ig. 4. VINE cluster curves overlaid on a plot with PDP and ICE curvesto show how clusters capture regional behavior. ICE curves are coloredaccording to an interacting feature. feature interactions as color artifacts, such as sharp gradients or largeareas with no variation (see for example [41]). Similarly, ICE plots canencode a second variable as the color of a line [17]. The most simpleeffect would be a correlation between hue and Y-value which wouldindicate that two features have a positive super-additive interactioneffect.Partial Importance (PI) plots and Individual Conditional Importance(ICI) plots [8] operate much as PDP and ICE plots but visualize featureimportance instead of prediction value. This is a regional approach inthe sense that it visualizes the regions of a feature’s range in whichit impacts predictions. The authors note that high variance betweenindividual curves in an ICI plot suggests the presence of feature inter-actions.ALE plots [2] are a solution to the aforementioned tendency of PDPsto generate inaccurate curves where features are highly correlated.ALE plots instead calculate partial dependence from small piecewisesegments consisting of points with values in a narrow range, removingthe need for synthetic data. These plots address the issue of featureinteractions by allowing the user to view the feature’s main effect, andany interaction effects in separate plots.

The major downside to all of these approaches is that they requiresigniﬁcant user time and skill, and there is no predeﬁned thresholdfor a “signiﬁcant” feature interaction. A data scientist would likelyneed to generate a scatterplot matrix of all possible feature combina-tions, or try one by one, perhaps with interaction scores or a VIN as apruning mechanism. While these methods have enormous value in theprocess of exploratory data analysis, they are less suited for effectivelycommunicating model properties.Our approach to interaction effects is to present them when they serveas a relevant explanation for model behavior. We split the differencebetween relatively coarse interaction scores, and complex charts. VINEis therefore not a pure feature interaction score and does not directlycompete with measures such as the H-statistic. Rather VINE curatesa selection of feature interactions that aids the interpretation of modelbehavior by providing exceptions to global behavior. A user of VINEinterprets a model using the global explanation (a PDP curve) exceptwhere a data case meets certain criteria (a region in which featureinteraction effects occur). We argue that this is a parsimonious butpowerful explanatory technique capable of communicating both featureinteractions and non-linear relationships while not overwhelming theuser with many instance-level details.

PPROACH

Our approach is to create a visualization for model explanation thatleverages modiﬁed ICE plots and to present these plots in a visualanalytic tool called VINE. We generate VINE curves via the followingsteps: for feature F in Features do Cluster data using ICE curve slopes as a feature representationGenerate a predicate for each cluster using a 1-layer decision treeMerge clusters with similar explanations end for

An example of this algorithm is presented in Figure 4. We believethat this process produces accurate regional explanations for modelbehavior in the form of partial dependence curves which apply to asubset of the dataset.

To address the issue of overplotting on ICE curves, VINE tries to clustersimilar curves and visualize a centroid curve instead. Note that this is aform of unsupervised clustering on the dataset, but that instead of usingan instance’s feature vector as its representation, we instead use the X,Ytuples that constitute its ICE curve. We assessed a variety of clusteringalgorithms and distance metrics with the goal of generating accurateclusters quickly. Accuracy was initially assessed by visually comparingthe centroids against the constituent ICE curves to validate that clusterswere cleanly separated. In particular, we assessed the following cluster-ing metrics, using implementations from scikit-learn [43]: DBSCAN,K-Means, Afﬁnity Propagation, Agglomerative Clustering, and Birch.We found that Agglomerative Clustering [52] and Birch [54] both per-formed acceptably, with Birch running approximately 2.5x faster, butproducing less cleanly deﬁned clusters. Agglomerative Clustering isused for all examples in the paper, although Birch can be selected as anoption when running the script.A more difﬁcult question was the choice of distance metric for cal-culating the pairwise distance between ICE curves. Euclidean distanceproduced groups of curves which a human would clearly recognizeas inappropriate. While Dynamic Time Warp produced clusters thatappeared highly appropriate to the eye, we were unable to identify afast implementation. This was necessary because all pairwise distancesmust be calculated, meaning that our algorithm scales in O ( K ∗ N ) time, where K is the number of features and N is the number of itemsin the dataset. We also tried the Slope Similarity algorithm, whichcompares the Euclidean Distance between the slopes of ICE curvesinstead of their raw points. The Slope Similarity measure producedappealing results as well, and and ran in the same time as EuclideanDistance, making this an ideal choice for our purposes. After clustering the ICE curves we try to provide a human-interpretableexplanation for each cluster of curves– that is, what do these clusteredcurves have in common that differentiate them from the rest of the ICEcurves? To answer this question, we used a 1-deep decision tree topredict membership in that cluster against all other points (one-vs-all).This simple model identiﬁes the feature and split value that mostreduces the entropy between the curves in the cluster and those outsideof the cluster. Intuitively, this split represents a good explanation forwhat characteristics make the cluster unique.

One difﬁculty with our method was in choosing the appropriate numberof clusters for each dataset. We simulated exploratory data analysis(EDA) with early versions of the tool and found that some featureswould produce 5 or more distinct clusters of behavior (itself an interest-ing result), but that for other features, many of the cluster explanationswould be duplicative, or nearly so (e.g. two clusters with the explana-tion

Weight > for cluster c i in C to C N do : for cluster c dupe in C i + to C N do : f c i . f eature = c dupe . f eature ∧ c i . direction = c dupe . direction thenif ( c i . value − c dupe . value ) f m ax − f m in < = . then Merge Clusters end ifend ifend forend for

Here, each cluster’s explanation has a “feature” property (the featureused to deﬁne the split), a “direction” property ( < = or > ) and a valueproperty (e.g. 3) which together deﬁne a predicate ( Weight < = f isthe feature for which the plot is being generated. MPLEMENTATION

Our algorithm was built in Python 2.7, using standard machine learninglibraries, including Numpy, Pandas, Scipy, and Scikit-Learn [43]. In ad-dition, the original code for calculating PDP and ICE curves was forkedfrom the PyCEBox library [3], though it has been heavily modiﬁed inour implementation. We also employed the sklearn-gbmi package [20]to calculate H-statistics. The charts in the paper were generated withAltair [51] and Matplotlib [24]. The VINE visual analytics system wasbuilt in HTML using D3.js [5]. It consumes a JSON ﬁle that is outputby the Python script.VINE initially presents the user with a feature space visualizationdesigned to communicate the relevance of each feature to the model(see Figure 1). VINE charts as presented as small multiples, one perfeature. The X-axis indicates the strength of feature interactions. TheY-axis indicates the overall feature importance. This allows the user toquickly familiarize themselves with the dataset and its salient features.Both the charts themselves and their position in the feature space drawthe eye to interesting patterns. For example, in Figure 1,

Hour of Day is clearly the most important (topmost) feature, which can be veriﬁedby checking its Y-axis scale.

Work Day , on the bottom right, is not aparticularly important feature most of the time, but it does have oneinteraction (the red bar) which produces an outsize effect. Becausethis effect is so different than the PDP term,

Work Day has a strongfeature interaction score and occupies a sparse corner of the featurespace.

Wind Speed , on the other hand, has no VINE curves at all, andso appears on the left side of the graph.The feature interaction strength (X-axis) is calculated as the sum ofDynamic Time Warp distances between each VINE curve and the PDPcurve, normalized by the maximum value of the PDP curve. Featureimportance (the Y-axis), is determined by the standard deviation of thePDP curve. The position should be taken as a rough approximation, asa force layout is used to prevent overplotting of the small multiples.Users can select a feature to enlarge the chart, which makes theexplanations visible. VINE charts are displayed in the same manner asPDP and ICE plots. The VINE chart for feature A will have featureA’s range as the X-axis. The Y-axis depicts the change compared tothe mean prediction. We chose to mean-center each plot to enable anadditive interpretation, i.e. for a given data point, a user would sum thevalues from each plot to arrive at a prediction, rather than the traditionalPDP which requires the values to be averaged. The partial dependencecurve is presented as a black line. Each colored line represents a VINEcluster and is calculated as the centroid of all its constituent ICE curves(in other words, a partial dependence curve for the subset). The widthof each VINE curve encodes the size of its cluster, but is log-scaledfor readability purposes. Clicking on a VINE curve reveals all itsconstituent ICE curves. This allows the user to visually inspect thequality of each VINE curve.Binary features are presented as bar charts instead of lines, to aid intheir interpretation and visually distinguish them from numeric features.However, the underlying VINE algorithm is applied identically to eachfeature. The bar charts use the same color scheme as the line charts,with black corresponding to the PDP. A bar should be interpreted as thechange in prediction incurred by increasing a feature from 0 to 1.Lastly, the histograms on the right side provide a visual depiction ofthe explanation for each cluster. The histograms can be mapped to a

Table 1. Datasets used in our evaluation. Bike dataset is available at [12].Other datasets were loaded from scikit-learn [43]

Name Features Instances Model r TargetDiabetes 10 442 0.912 Disease severity.Boston Housing 13 506 0.987 Median house price.Bike Sharing 11 17,379 0.907 Hourly bike rentals.

VINE curve based on color. One or more columns will be displayeddepending on the number of features that appear in explanations. InFigure 9, the

Hour of Day feature serves as the best explanation fortwo of three VINE curves. The darker green region of the histogramconveys both the range deﬁned by the explanation and the density ofpoints in that region. The text of the deﬁnition, the size of the clusterand its accuracy are displayed in the top right-hand corner of the chart.

VALUATION

We evaluated the VINE algorithm on three benchmark tests, includingan application of the Information Ceiling framework. Each test wasperformed on three datasets using a regression target and a single modelﬁt for this task.

VINE was evaluated on three tabular datasets with numerical, ordinal,and categorical features. Pre-processing consisted of one-hot encodingany categorical features. Ordinal features, such as

Month for the Bikedataset, were left as is. These datasets did not have missing or erroneousvalues and so no imputation was performed. Due to the choice of a tree-based model, normalization/standardization was not necessary. Theversion of the Bike dataset stored in the UCI repository has severalstandardized features - these were transformed back to their originaldomain for readability purposes. Several features were removed fromthe Bike dataset [12] in order to produce a more intelligible model.Weekday and holiday were removed because they were raw versions ofthe engineered

Workingday feature.

Dteday and

Month were removedfor similar reasons, because they were better represented by the

Season feature. The

Casual and

Registered variables were removed becausethey are alternate regression targets, and highly correlated with the

Cnt target. Feature names for the Bike dataset have been changed to makethem more human-readable for ﬁgures and use cases in this paper.For all datasets, a Gradient Boosting Regressor was used. Eachregressor used 300 trees and a minimum leaf size of 100 to preventoverﬁtting. The accuracy of each classiﬁer is generally high and isreported in Table . Hyperparameters were manually selected to producedecently accurate classiﬁers that were faithful to the underlying dataset,but beyond these basic measures, no attempts were made to identify anoptimal model. For our purposes, the model is of more interest than thedataset or the relationship between the two. For this same reason, theentire dataset was used to ﬁt the model, as there is no use for a test setin our evaluation. That said, VINE should perform similarly on unseendata as long as it was drawn from the same distribution as the data usedto ﬁt the model.Note that for all three datasets, we chose a regression problem asour task. However, we believe that binary or multi-class classiﬁcationproblems can easily be tackled with VINE as well, as PDP and ICEcurves are also suitable for these tasks. The major difference for thesetasks is that the interpretation of the Y-axis alters from “change inprediction” to “change in probability of a given class”.

We ﬁrst attempted to evaluate the efﬁcacy of our algorithm for generat-ing clusters and their corresponding explanations. We sought to ensurethat our cluster explanations were accurate and that they outperformeda nominal baseline approach. The purpose of this check was to demon-strate that our decision tree would not simply overﬁt random clusters toa nonsensical explanation, and that a real signal must be present in thecluster in order for a high-accuracy explanation to be generated. ig. 5. The process for calculating the Information Ceiling metrics forPDP, VINE, and ICE plots. (A) Generate predictions for one point at atime. (B) Identify the relevant curve. For VINE, select the curve whosepredicate includes the point. For ICE, ﬁnd the ICE curve correspondingto that data point. (C) Get the ∆ Y value from the appropriate curve atthe point’s feature value. (D) Sum the results of steps A-C for eachfeature. This is added to the mean prediction and compared to the modelprediction to ﬁnd the loss. To evaluate the explanations, we compared the data points containedin each cluster (set A) with the data points returned by ﬁltering thedataset on the cluster explanation (set B). By treating set A as a trainingset and set B as the model output, we were able to apply traditionalaccuracy, precision, and recall metrics. For this evaluation, we set thehyperparameter for number of clusters to 5.To generate a baseline comparison, we used the following methodto generate random clusters: for feature f i in F to F N do :Partition the dataset into 5 clusters of random sizeFit a decision tree to each cluster, as in the VINE algorithm.Calculate the accuracy, precision, and recall as discussed above end for We believe that the explanations our method returns should be consis-tent with existing methods that quantify feature importance and featureinteractions. Assuming that feature A interacts strongly with featuresB,C,D according to a measure such as the H-Statistic [16] or Green-well’s partial dependence interaction [18], then we expect to see thatcluster explanations for feature A will include feature B, C, and/or D,allowing for the possibility that other features may be included as well,due to the fact that an interaction may only be strong in a narrow range.To this end, we evaluate our cluster explanations using Friedman’sH-statistic, which generates a score between 0 and 1 for each pair offeatures. 0 indicates no interaction between the two features, and 1indicates that the features have no main effects, but rather that theirentire impact on the prediction is generated from their interaction. Fora given Feature A, we generate a list of features that appear in its clusterexplanations (list A). We compare list A against the list of featureinteractions, ordered by the H-statistic (list B).Given that one of the issues with the H-statistic is the lack of a well-established threshold for determining signiﬁcance, we chose to ignorethe values themselves and instead calculate the number of elementsfrom list A that appear in the top 3 features of list B. We then sum thiscount across all features in the model, and normalize it by the totalnumber of clusters generated by VINE. The result can be interpreted asthe percentage of explanations that utilize a strongly interacting feature.We also present the baseline probability that features would have ap-peared among the top 3 interactors if they were chosen at random (thisprobability is constant for each dataset, equal to number o f f eatures ). We introduce a novel framework, the Information Ceiling, for evaluat-ing the ﬁdelity of any visual model explanation to its underlying model. For our tabular regression problems presented here, the metric simplyconsists of the r (often known as the Coefﬁcient of Determination)between the model’s predictions and our algorithm’s predictions asit tries to simulate the human sensemaking process afforded by themodel visualization in question . The tricky part here is to describe andsystematize a process by which a consumer of a visualization woulduse it to make a prediction. Nonetheless, as this is one of the most com-mon human tasks used to evaluate visualizations [13], we argue that itbehooves the designer of model visualizations to build them accordingto standard human-computer interaction principles, with speciﬁc tasksin mind.Luckily, for VINE curves and other plots in the PDP family, a fairlysimple method presents itself for making predictions based on theexplanation. For the PDP, the chart for Feature A allows the user toidentify the value contributed to the prediction at any point on theX-axis (e.g. the range of Feature A). To ﬁnd this component of aprediction for instance X, a user simply has to ﬁnd instance X’s valuefor Feature A, ﬁnd that point on the X-axis, and follow it up to the PDPline. This will yield Feature A’s contribution to the prediction. Theuser can simply sum the results of this process for each feature in thedataset, add the sum to the mean value of the target variable, and yielda prediction based on the PDP curve. This process is summarized inFigure 5.While it is unlikely that a user would perform this exact task inpractice, a heuristic version is more likely. A user would notice thatan instance of interest has high values for Features A,C, and D, andremember that the PDP curved sharply upwards for Features A andC. The user would add some estimated amount to an average valuefor the target, and produce a prediction in this manner. This methodis recommended in [23] as a workﬂow for data scientists when usingpartial dependence plots to analyze a model.This method can easily be extended to ICE and VINE plots, assummarized in Figure 5. For ICE curves, the user simply selects theparticular curve for the instance of interest instead of a PDP line. ForVINE, they select (much more easily) the VINE curve whose predicatematches their instance. For VINE, two edge cases must be considered:(1) when a point matches 2 or more predicates, we take the meanof each of their predictions, and (2) when a point doesn’t match anypredicate, we use the PDP line for prediction instead.This method is easy for an algorithm to simulate when presentedwith the data that underlies each of the curves. It should be notedthat we do not expect any user to derive predictions as accurately asour algorithm can. Instead, we treat our metric as the upper limit onprediction ﬁdelity (or a lower bound on error) that could possibly beachieved by interpreting the visualization in this way. For this reason,we refer to this evaluation framework as the Information Ceiling. ESULTS

We report performance on three algorithmic benchmark tests acrossthree datasets. Each of the benchmarks was devised speciﬁcally for thispaper. A Jupyter notebook with the full code required to reproduce allresults, charts, and tables in this publication is available at . Instructions, code, datasets,and other ﬁles necessary to run VINE as a standalone tool are alsoavailable at this URL.

Figure 6 indicates that VINE cluster explanations more accuratelydescribe real subsets than randomly chosen subsets. We take this asevidence that VINE explanations detect real descriptions of subsets,and do not simply ﬁt noise.

Table 2 presents the results of the H-statistic experiment. The featureused in VINE explanations occurs in the top 3 interactors (sorted by H-statistic) about twice as often as we would expect it to if features wereselected randomly. This suggests that the VINE algorithm successfullymeasures feature interactions. Note that the H-statistic calculation (and ig. 6. VINE cluster explanations are far more accurate when generatedon real as opposed to random clusters.Table 2. Explanations used in VINE tend to have high H-statistic values

Dataset % in Top 3 Baseline % with Random AssignmentDiabetes 60.5% 30%Boston Housing 64.2% 23.1%Bike Sharing 65.7% 27.3% the random baseline) are non-deterministic, so results will vary acrossiterations. Results for one pass are reported.

Our Information Ceiling method shows that VINE curves have higherﬁdelity to the model than PDPs (see Figure 7). In addition, our methodoutperformed Individual Conditional Expectation plots in two of thethree datasets. We conclude that our method can be considered a moreaccurate representation of a model’s behavior than PDPs. In addition,it appears that the method for calculating individual conditional ex-pectation has fundamental limitations, which may be caused by theaforementioned issues with extrapolation. Even when a prediction isgenerated for a data point based on its own ICE curve, the predictionis scarcely better than the PDP line (for two of the three datasets). Wehypothesize that when VINE aggregates ICE curves, it averages outinstabilities, which is ample tradeoff for the loss in speciﬁcity.

ISCUSSION

Our contribution consists of (1) an algorithm that clusters ICE curvesbased on shape similarity and generates a human readable label for thatsubset, (2) a visual analytics tool that facilitates model interpretationand sensemaking using VINE explanations, and (3) a framework forevaluating visual explanations of machine learning models based onthe loss that an automated method incurs when using them as a basisfor prediction.

Fig. 7. Information Ceiling of PDP, ICE, and VINE plots by dataset. VINEconsistently outperforms PDP. Fig. 8. VINE plot for the

Hour of Day feature in the bike dataset. (A) Themain plot shows the PDP as a black line, and VINE curves as variouscolors. (B) The sidebar uses matching-color histograms to visualize theexplanation for that subset. (C) The blue curve visualizes an importantinsight - the typical rush hour peak pattern does not exist on weekends. (1) Our algorithm is completely model-agnostic. The only require-ment is that the model’s prediction function be passed into the export method and that this prediction function uses the same API as scikit-learn [43].(2) VINE curves extract salient feature interactions and givedetailed information about how they affect predictions. Identifyingthese feature interactions is as simple as reading the chart, and doesnot require a detailed statistical analysis.(3) The Information Ceilingframework allows us to compare the validity of multiple visualizationsin the partial dependence family for the ﬁrst time. (1) Our approach is currently limited to tabular data and does not workfor text, image, or video data. (2) Our approach works best when mostfeatures in a dataset are numerical. Ordinal, categorical, and Booleanfeatures are supported, but existing methods [28–30, 49] are betteradapted to this task. In particular, one-hot encoding a categorical vari-able or creating a vectorized text representation can create a confusingarray of features. (3) Large datasets ( > Our tool extracts model behavior that differs signiﬁcantly from themean feature effect (the partial dependence curve). This has enormouspotential value for debugging both the model and the training set. Weused an early version of our tool to perform some model debugging onthe

Month attribute of the bike dataset. Data cases with a

Season ofSpring and a month from July-December had markedly lower predictedridership than the PDP average. However, this combination of features(Spring in December) is impossible. A data scientist could take thisinsight and either build a validation rule for data intake, or more likelydrop one of the two highly correlated features.Another use case is the extraction of insights. Our tool can partiallyautomate or supplement exploratory data analysis. In Figure 8, ananalyst viewing the

Hour feature in the bike dataset would note thatVINE has found two regions of interest. The blue VINE curve is forweekends/holidays (

Workingday =0). The PDP curve shows a largebump in ridership at the morning and evening rush hours. However, forweekends, this effect is far less pronounced, with ridership increasingsteadily but less sharply. The insight that weekday and weekend rider-ship patterns are fundamentally different is presumably valuable for abike-sharing company. These insights are extracted without the userbeing aware of the importance of the

Workingday feature or makingany intentional effort to analyze it.Figure 8 visualizes the effect of the

Feels Temperature (temperature+ wind chill) on ridership. The PDP curve indicates that the model pre- ig. 9. Insights into the effect of temperature on ridership for the bikedataset. (A) The model predicts a spike in ridership around 70 degrees.This behavior is particularly strong in the winter months. (B) However,this is a clear exception to this behavior. Ridership in the early afternoonand morning does not spike. dicts a large spike in ridership around 75 degrees. However, the VINEcurves reveal a more nuanced story. Later afternoon and evening rider-ship (the blue curve) spikes higher, while early afternoon and morningridership (the red curve) stays mostly ﬂat until the temperature becomesvery hot. Moreover, the green curve indicates that on warm winter days,ridership spikes particularly high and at a lower temperature. A modelexplanation communicated with only the PDP curve might convincethe bike-sharing company to reduce the size of the ﬂeet during thewinter. VINE ﬂags the potential for highly proﬁtable warm winter days.It is unlikely that this correlation would have been discovered unlesssomeone thought to check for it explicitly.

We believe that the Information Ceiling metric can be used to validatethe effectiveness of a wide array of visualizations in the interpretableML space. While we only consider PDP, VINE, and ICE plots here,it would be trivial to compare ALE plots too. Commentators [41]have noticed that whereas PDP plots suffer from extrapolation intosparse areas of the conditional distribution, ALE plots can suffer froma related tradeoff between accuracy and stability when setting thehyperparameter for number of intervals. An easy way to determine thesuperior method is to evaluate them using our method, which quantiﬁesﬁdelity to a model.Beyond visualizations in the partial dependence family, our Informa-tion Ceiling framework could also be used to evaluate explanations suchas RuleMatrix [40], in which the algorithm would simply scan throughrules in the order they are presented in the visualization until it found amatching predicate for a given instance. Similarly, Gamut [21] or otherGLMs can be evaluated in much the same way as PDPs, essentiallyusing a feature plot as a lookup table for each instance and then addingpredictions together.Pushing the envelope further, it should be possible to evaluateLIME [44], creating a direct comparison between global and localexplanations for the ﬁrst time. One approach, based on our personalmodel interpretation workﬂows, is as follows: (1) generate k clustercentroids using a method such as k-medoids, (2) build a LIME modelfor each centroid, and (3) make a prediction for an instance by ﬁndingthe nearest centroid and using its LIME model. Clearly, there are manyundeﬁned parameters here, such as the value k or the distance metricto use. It is likely that the human sensemaking process for this taskis difﬁcult to replicate as an algorithm. However, we argue that thereis value in investigating this process, under the assumption that it willnot be possible to design a good model visualization (or indeed, anyvisualization) without a sense of its intended use.It should be stressed that we do not recommend evaluating explana-tions solely by our method. Our method is not capable of measuring the aesthetic value or ease of interpretation of an explanation, only itsinformation content. We believe our Information Ceiling frameworkcan instead set a ceiling on the understanding that a human can gleanfrom an explanation. It should be noted that this is not a new con-cern in information visualization - prior research has investigated theﬁdelity of visualization-generating algorithms such as t-SNE [53] oreven histograms [36] to the underlying data.Other research [42] has investigated the design and perceptual fac-tors involved in model visualization. This research is a necessarycomplement to our work, addressing how to “make the most” of the In-formation Ceiling that a visual model explanation affords. Performanceon any prediction task performed by a human can be compared directlyto the Information Ceiling, and the loss can be explained by eitherdesign issues with the visualization, or human perceptual limitations(e.g. people have been shown to exaggerate certain features of linecharts and downplay or excise others [37]). UTURE W ORK • Investigate the effectiveness of partial dependence acrossdatasets.

While it cannot be proven from this limited study,the far higher performance of the ICE plot on the Diabetes datasetsuggests that the ﬁdelity of partial dependence curves may becontingent on some unknown property of a dataset, such as thepresence of multi-collinearity. The Information Ceiling providesan ideal tool to probe the limitations of partial dependence meth-ods and the impact of violating their assumptions. Given the wideuse of the technique, this meta-learning could be valuable.•

Evaluation.

We present a novel evaluation framework for modelvisualizations in which we seek to quantify their informationcontent. We hope that this method can be used both to evaluate thefundamental validity of other techniques in interpretable ML andto guide future studies into human sensemaking with predictivemodels. We also hope that VINE can be evaluated in situ todetermine its utility for data scientists.

10 C

ONCLUSION

We present VINE, an interactive visualization that communicates re-gional explanations for models built to make predictions on tabulardata. Our approach leverages existing work into partial dependenceplots to derive groups of instances whose behavior in a given modelsigniﬁcantly differs from the mean feature effect. These regional ex-planations also capture feature interaction effects in a novel manner.We argue that our approach provides a useful complement to globalexplanations by identifying caveats to the main behaviors, and alsocomplements local explanations by aggregating similar instances anddistilling common contributors to their prediction. Our approach hasapplicability to model interpretation and explanation, model debugging,algorithmic fairness and adversarial artiﬁcial intelligence.We demonstrate that our algorithm produces explanations that havehigh mean accuracy in describing relevant subsets. We also provideexample use cases that demonstrate how VINE can facilitate modelexploration on a real dataset. We evaluate VINE against PDPs us-ing a novel evaluation framework (Information Ceiling) and ﬁnd thatVINE more faithfully replicates the predictions made by the model.We conclude by discussing ways in which the Information Ceilingapproach can be used to quantify the effectiveness of visualizations inthe interpretable ML space. A CKNOWLEDGMENTS

The authors wish to thank Fred Hohman and Andrea Hu. R EFERENCES [1] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin. Learningcertiﬁably optimal rule lists for categorical data.

The Journal of MachineLearning Research , 18(1):8753–8830, 2017.[2] D. W. Apley. Visualizing the effects of predictor variables in black boxsupervised learning models. arXiv preprint arXiv:1612.08468 , 2016.[3] R. Austin. Pycebox, Jan 2018.4] P. Biecek. Dalex: Explainers for complex predictive models in r.

Journalof Machine Learning Research , 19(84):1–5, 2018.[5] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents.

IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) , 2011.[6] L. Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.[7] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelli-gible models for healthcare: Predicting pneumonia risk and hospital 30-dayreadmission. In

Proceedings of the 21th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining , pp. 1721–1730. ACM,2015.[8] G. Casalicchio, C. Molnar, and B. Bischl. Visualizing the feature impor-tance for black box models. In

Joint European Conference on MachineLearning and Knowledge Discovery in Databases , pp. 655–670. Springer,2018.[9] C. Chen, K. Lin, C. Rudin, Y. Shaposhnik, S. Wang, and T. Wang. Aninterpretable model with globally consistent explanations for credit risk. arXiv preprint arXiv:1811.12615 , 2018.[10] G. F. Cooper, V. Abraham, C. F. Aliferis, J. M. Aronis, B. G. Buchanan,R. Caruana, M. J. Fine, J. E. Janosky, G. Livingston, T. Mitchell, et al.Predicting dire outcomes of patients with community acquired pneumonia.

Journal of biomedical informatics , 38(5):347–366, 2005.[11] Council of European Union. Council regulation (EU) no 2016/679, 2016. https://eur-lex.europa.eu/legal-content/EN/TXT/?qid=1528874672298&uri=CELEX%3A32016R0679 .[12] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository,2017.[13] F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretablemachine learning. arXiv preprint arXiv:1702.08608 , 2017.[14] A. Fisher, C. Rudin, and F. Dominici. Model class reliance: Variableimportance measures for any machine learning model class, from therashomon perspective. arXiv preprint arXiv:1801.01489 , 2018.[15] J. H. Friedman. Greedy function approximation: a gradient boostingmachine.

Annals of statistics , pp. 1189–1232, 2001.[16] J. H. Friedman, B. E. Popescu, et al. Predictive learning via rule ensembles.

The Annals of Applied Statistics , 2(3):916–954, 2008.[17] A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin. Peeking inside theblack box: Visualizing statistical learning with plots of individual condi-tional expectation.

Journal of Computational and Graphical Statistics ,24(1):44–65, 2015.[18] B. M. Greenwell, B. C. Boehmke, and A. J. McCarthy. A simpleand effective model-based variable importance measure. arXiv preprintarXiv:1805.04755 , 2018.[19] H2O.ai.

Python interface for H2o3 , Mar 2019. 3.10.08.[20] R. Haygood. sklearn-gbmi, Jan 2017.[21] F. Hohman, A. Head, R. Caruana, R. DeLine, and S. M. Drucker. Gamut:A design probe to understand how data scientists understand machinelearning models.

CHI Conference on Human Factors in Computing Sys-tems Proceedings (CHI 2019), to appear , 2019.[22] F. M. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analyticsin deep learning: An interrogative survey for the next frontiers.

IEEETransactions on Visualization and Computer Graphics , 2018.[23] G. Hooker. Discovering additive structure in black box functions. In

Pro-ceedings of the tenth ACM SIGKDD international conference on Knowl-edge discovery and data mining , pp. 575–580. ACM, 2004.[24] J. D. Hunter. Matplotlib: A 2d graphics environment.

Computing InScience & Engineering , 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55[25] M. Kahng, P. Y. Andrews, A. Kalro, and D. H. P. Chau. Activis: Visual ex-ploration of industry-scale deep neural network models.

IEEE transactionson visualization and computer graphics , 24(1):88–97, 2018.[26] M. Kahng, D. Fang, and D. H. P. Chau. Visual exploration of machinelearning results using data cube analysis. In

Proceedings of the Workshopon Human-In-the-Loop Data Analytics , p. 1. ACM, 2016.[27] B. Kim, R. Khanna, and O. O. Koyejo. Examples are not enough, learn tocriticize! criticism for interpretability. In

Advances in Neural InformationProcessing Systems , pp. 2280–2288, 2016.[28] J. Krause, A. Dasgupta, J. Swartz, Y. Aphinyanaphongs, and E. Bertini. Aworkﬂow for visual diagnostics of binary classiﬁers using instance-levelexplanations. In , pp. 162–172. IEEE, 2017.[29] J. Krause, A. Perer, and E. Bertini. Using visual analytics to interpretpredictive machine learning models. arXiv preprint arXiv:1606.05685 ,2016.[30] J. Krause, A. Perer, and E. Bertini. A user study on the effect of aggregating explanations for interpreting machine learning models. 2018.[31] J. Krause, A. Perer, and K. Ng. Interacting with predictions: Visualinspection of black-box machine learning models. In

Proceedings ofthe 2016 CHI Conference on Human Factors in Computing Systems , pp.5686–5697. ACM, 2016.[32] T. Laugel, M.-J. Lesot, C. Marsala, X. Renard, and M. Detyniecki. Inverseclassiﬁcation for comparison-based interpretability in machine learning. arXiv preprint arXiv:1712.08443 , 2017.[33] B. Letham, C. Rudin, T. H. McCormick, D. Madigan, et al. Interpretableclassiﬁers using rules and bayesian analysis: Building a better strokeprediction model.

The Annals of Applied Statistics , 9(3):1350–1371, 2015.[34] Y. Lou, R. Caruana, J. Gehrke, and G. Hooker. Accurate intelligible modelswith pairwise interactions. In

Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pp.623–631. ACM, 2013.[35] S. M. Lundberg, G. G. Erion, and S.-I. Lee. Consistent individualizedfeature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 ,2018.[36] A. Lunzer and A. McNamara. Exploring histograms. http://tinlizzie.org/histograms/ , 2017. Accessed: 2019-03-29.[37] M. Mannino and A. Abouzied. Qetch: Time series querying with expres-sive sketches. In

Proceedings of the 2018 International Conference onManagement of Data , pp. 1741–1744. ACM, 2018.[38] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie,T. Phillips, E. Davydov, D. Golovin, et al. Ad click prediction: a viewfrom the trenches. In

Proceedings of the 19th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pp. 1222–1230.ACM, 2013.[39] T. Miller. Explanation in artiﬁcial intelligence: insights from the socialsciences. arXiv preprint arXiv:1706.07269 , 2017.[40] Y. Ming, H. Qu, and E. Bertini. Rulematrix: Visualizing and understandingclassiﬁers with rules.

IEEE transactions on visualization and computergraphics , 25(1):342–352, 2019.[41] C. Molnar.

Interpretable Machine Learning . 2019. https://christophm.github.io/interpretable-ml-book/ .[42] M. Narayanan, E. Chen, J. He, B. Kim, S. Gershman, and F. Doshi-Velez.How do humans understand explanations from machine learning systems?an evaluation of the human-interpretability of explanation. arXiv preprintarXiv:1802.00682 , 2018.[43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research ,12:2825–2830, 2011.[44] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?:Explaining the predictions of any classiﬁer. In

Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery and datamining , pp. 1135–1144. ACM, 2016.[45] M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic explanations. In

AAAI Conference on Artiﬁcial Intelligence ,2018.[46] C. Rudin. Please stop explaining black box models for high stakes deci-sions. arXiv preprint arXiv:1811.10154 , 2018.[47] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,V. Chaudhary, and M. Young. Machine learning: The high interest creditcard of technical debt. In

SE4ML: Software Engineering for MachineLearning (NIPS 2014 Workshop) , 2014.[48] E. ˇStrumbelj and I. Kononenko. Explaining prediction models and indi-vidual predictions with feature contributions.

Knowledge and informationsystems , 41(3):647–665, 2014.[49] P. Tamagnini, J. Krause, A. Dasgupta, and E. Bertini. Interpreting black-box classiﬁers using instance-level visual explanations. In

Proceedingsof the 2nd Workshop on Human-In-the-Loop Data Analytics , p. 6. ACM,2017.[50] B. Ustun and C. Rudin. Supersparse linear integer models for optimizedmedical scoring systems.

Machine Learning , 102(3):349–391, Mar 2016.doi: 10.1007/s10994-015-5528-6[51] J. VanderPlas, B. Granger, J. Heer, D. Moritz, K. Wongsuphasawat,A. Satyanarayan, E. Lees, I. Timofeev, B. Welsh, and S. Sievert. Al-tair: Interactive statistical visualizations for python.

Journal of OpenSource Software , dec 2018. doi: 10.21105/joss.01057[52] J. H. Ward Jr. Hierarchical grouping to optimize an objective function.

Journal of the American statistical association , 58(301):236–244, 1963.53] M. Wattenberg, F. Vigas, and I. Johnson. How to use t-sne effectively.

Distill , 2016. doi: 10.23915/distill.00002[54] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efﬁcient dataclustering method for very large databases. In