[PDF] Intuitively Assessing ML Model Reliability through Example-Based Explanations and Editing Model Inputs

Abstract

Interpretability methods aim to help users build trust in and understand the capabilities of machine learning models. However, existing approaches often rely on abstract, complex visualizations that poorly map to the task at hand or require non-trivial ML expertise to interpret. Here, we present two visual analytics modules that facilitate an intuitive assessment of model reliability. To help users better characterize and reason about a model's uncertainty, we visualize raw and aggregate information about a given input's nearest neighbors. Using an interactive editor, users can manipulate this input in semantically-meaningful ways, determine the effect on the output, and compare against their prior expectations. We evaluate our interface using an electrocardiogram beat classification case study. Compared to a baseline feature importance interface, we find that 14 physicians are better able to align the model's uncertainty with domain-relevant factors and build intuition about its capabilities and limitations.

Full PDF

IIntuitively Assessing ML Model Reliability throughExample-Based Explanations and Editing Model Inputs

Harini Suresh [email protected] CSAIL

Kathleen M. Lewis [email protected] CSAIL

John V. Guttag [email protected] CSAIL

Arvind Satyanarayan [email protected] CSAIL

Figure 1: An example of the proposed interface for an electrocardiogram (ECG) case study. The output of the machine learningmodel consists of raw and aggregate information about the input’s nearest neighbors. With the editor in the top right, the usercan apply meaningful manipulations to the input and see how the output changes.

ABSTRACT

Interpretability methods aim to help users build trust in and un-derstand the capabilities of machine learning models. However,existing approaches often rely on abstract, complex visualizationsthat poorly map to the task at hand or require non-trivial ML ex-pertise to interpret. Here, we present two interface modules tofacilitate a more intuitive assessment of model reliability. To helpusers better characterize and reason about a model’s uncertainty,we visualize raw and aggregate information about a given input’snearest neighbors in the training dataset. Using an interactive ed-itor, users can manipulate this input in semantically-meaningfulways, determine the effect on the output, and compare against theirprior expectations. We evaluate our interface using an electrocardio-gram beat classification case study. Compared to a baseline feature importance interface, we find that 9 physicians are better able toalign the model’s uncertainty with clinically relevant factors andbuild intuition about its capabilities and limitations.

CCS CONCEPTS • Applied computing → Health care information systems ; •

Human-centered computing → Visualization ; Empirical studies in inter-action design . KEYWORDS interpretability, machine learning, visualization, k-nearest neigh-bors, example-based explanations a r X i v : . [ c s . H C ] F e b INTRODUCTION

Machine learning (ML) systems are being developed and used fora broad range of tasks, from predicting medical diagnoses [24] toinforming hiring decisions [3]. While some of these systems havedemonstrated excellent performance on in-domain accuracy usingtest or benchmark datasets, many are intended to be part of a largersociotechnical process involving human decision-makers. In thesecases, in-domain accuracy is not enough to guarantee good out-comes – the people using a particular system must also understandits limitations and modulate their trust appropriately [49].

Modelinterpretability , which is intended to help people understand howa particular ML model is working in general or on a case-by-casebasis, can play an important role in this.Many of the existing approaches to model interpretability, how-ever, are intended for people with a non-trivial amount of MLexpertise, and are often only used in practice by ML developers [6].While tools for developers are certainly needed, the people whowill actually be interpreting and acting upon the model predictionsto guide decisions are often a distinctly different set of users. Evenmethods that are intended to be simpler and more understand-able to such users, such as reporting feature weights or displayingmore information about the model and dataset, have not improveddecision-making in experimental studies [8, 30, 43, 53].In response, we introduce two interface modules to facilitatemore intuitive assessment of model reliability. The first module,motivated by the effectiveness of examples [44] and case-basedreasoning [1] in human problem-solving, uses k-nearest neighbors(KNN) to visualize the model’s output grounded in examples famil-iar to the user. Our visualization comprises three components: aunit visualization of individual neighbors that encodes their classand distance from the input, an overlaid display of the raw inputof each neighbor, and a histogram of the neighbors’ class labels.These views give the user an insight into the model’s certaintywith respect to the given input. By interactively examining indi-vidual neighbors, the user can investigate questions like whetherthe variation in the neighbors is natural given the domain, or if itindicates that the prediction is unreliable; whether the common-alities amongst neighbors align with domain knowledge of whatshould or should not be similar; or whether seeing examples re-veals limitations or biases in the data of which they were previouslyunaware.The second module allows users to interactively modify the inputand see how the model’s output changes. The user can modify theinput by applying domain-specific transformations that correspondto meaningful manipulations of the data. Users can modify the inputto test hypotheses about the model’s reasoning, check that it alignswith domain expectations, and ensure that it is not overly sensitiveto small input modifications that should be class-preserving.Together, the KNN and editor modules provide a rich suite ofinformation that facilitates intuitive assessment of a model’s rea-soning and reliability. We evaluate the effectiveness of our pro-posed interface modules through a medical case study of classi-fying electrocardiogram (ECG) heartbeats with different types ofirregularities. This case study allows us to perform an application-grounded evaluation with users who closely resemble real-worlddecision-makers, and who have some degree of prior knowledge and investment in the domain. We conducted think-aloud studieswith 9 physicians/medical students, observing the way they inter-acted with our interface as well as a feature importance baseline.By visualizing the nearest neighbor waveforms overlaid on oneanother, participants were able to intuitively understand predictionreliability by assessing the variation amongst neighbors — for ex-ample, distinguishing between variation that was understandablebecause it reflected natural ambiguities in the task versus varia-tion that was more indicative of the model not learning the rightfeatures of the waveform. In contrast, when participants used thebaseline interface and did not see nearest neighbors, they often ra-tionalized incorrect predictions. Moreover, by exploring neighborsfrom different classes, participants were consistently able to relatethe model’s uncertainty to clinically-relevant concepts to guidedecision-making. Finally, participants used the editor module toform hypotheses about the model’s reasoning and test them, usingthe results to investigate how the model worked and whether itsreasoning was clinically sensible.To prevent ML systems from being misused or mistrusted, it iscrucial that they provide their users (who may not, themselves, befamiliar with ML) with information that is both relevant to the taskat hand and understandable to them. In this paper, we find that ourproposed interface is a promising direction for providing such usersthe ability to visualize and probe the model’s output, intuitivelyunderstand the reliability of its predictions, and build effective trustover time.

The growing field of ML interpretability aims to provide informa-tion that helps people understand how a particular model works,either on a global or case-by-case level [18]. Such efforts can serve anumber of different goals, such as aiding in decision-making, help-ing debug or improve a system, or building confidence in the model[21]. A major area of focus has been on developing methodologiesfor computing such explanations [14]. We note that these methodsare mostly intended for settings where the data and domain arecomplex enough to warrant models that are not inherently inter-pretable on their own, and these are the settings we focus on inthis paper.Some methods try to visualize the internals of a particular modelto reason about how it is operating [12, 34, 58]. This can be usefulfor theoretical ML understanding and model development, but amybe too abstract and complicated to help people without knowledgeof such models and how they work. Others try to produce explana-tions more grounded in the features of the data, such as a rankingof features important for the prediction or a decision-tree approxi-mating the model’s logic [16, 33, 46]. However, a growing body ofwork that has tried to empirically measure the efficacy of many ofthese methods has shown that they often do not actually affect orimprove human decision-making [2, 8, 30, 43], and in practice areprimarily used for internal model debugging [6].To understand the discord between proposed interpretabilitymethods and their suitability for real-world users, we can draw rom well-established theories in cognitive psychology that de-scribe how people think about problems and organize informationusing different “cognitive chunks” [35]. For example, a physicianmight think about decisions in terms of concepts that are higher-level than individual features, or relate features to each other inmore complex ways than linearly ranking them by importance. Thisidea manifests in theories of HCI stating that effective and engag-ing interfaces should allow users to view and interact with themin a way that feels direct — i.e., the visualizations and interactivemechanisms available to users should align with their cognitivechunks. Specifically, Hutchins et al. [23] describe “the gulf of exe-cution,” arising from a gap between the available mechanisms ofan interface and the user’s thoughts and goals, and “the gulf ofevaluation,” arising from a gap between the visual display of aninterface and the user’s conceptual model of the domain. In thispaper, our aim is to narrow both of these gaps.To this end, example-based (also referred to as instance-based)interpretability methods, which produce explanations in terms ofother input examples, are of particular interest. Research in cogni-tive psychology and education supports the idea that people oftenuse past cases to reason about new ones when solving problems [1]and that utilizing examples can help people understand complexconcepts, build intuition, and form better mental models [44, 45].Different types of example-based explanations for ML modelshave been proposed. Many of these are computed post-hoc, i.e.,they are generated after a prediction is made to try and explainthat prediction. For example, counterfactual examples [19, 56] usegradient-based methods to generate the closest example(s) to theinput that are predicted to be a different class (defining appropriatemeasures of “closeness” is an open question). Influence functions[28] try to trace a model’s predictions back to the data it was trainedon, identifying the examples that were most influential to the predic-tion. Normative explanations [9] present users with a set of trainingexamples from the predicted class. There are also limitations ofsome of these approaches; for example, technical constraints makequickly generating influential examples quite difficult in practice[5, 6], and hidden assumptions about actionability in counterfactualexplanations can be misleading [4].Others compute example-based explanations by modifying theinference process of a trained model to produce predictions baseddirectly on similar training examples. This type of method firstappeared in Caruana et al. [13] and Shin and Park [50]. They bothpropose utilizing a trained neural network model to improve a k-nearest neighbors classifier, either through weighting input featuresaccording to the neural network when computing example similar-ity [50], or through computing similarity in the embedding spaceof a neural network [13]. The class label making up the majority ofnearest neighbors is the prediction, and the nearest neighbor exam-ples can be used as an explanation. Of particular relevance to ourcase study, Caruana et al. [13] are motivated by the potential bene-fits of example-based explanations in clinical settings: “ [b]ecausemedical training and practice emphasizes case evaluation, most med-ical practitioners are adept at understanding explanations providedin the form of cases .” Recently, [41] extended this methodology tocompute neighbors using embeddings from multiple layers of aneural network, demonstrating additional benefits for improvingthe model’s robustness and confidence estimates. In our proposed interface, we compute neighbors using themethod reflected in [13], though this could be easily extended tocalculate neighbors in a weighted input space as in [50], or to useembeddings from multiple layers of the neural network as in [41].This prior work has focused on developing optimal ways to utilizethe trained neural network to inform the k-nearest neighbors al-gorithm, implying that the nearest neighbors would then serve asan explanation. Here, we focus on the relatively unexplored partof this claim, investigating how the resultant output should be pre-sented to the user in an interactive interface to narrow the gulfsof execution and evaluation. We explore a specific case study tomore clearly define the ways in which this type of example-basedexplanation can improve trust and understanding for real users. For explanations to be useful in practice, figuring out how to presentthe information to a user is a critical step. In a literature review ofinterpretability systems and techniques, Nunes and Jannach [40]found that the vast majority of papers presented explanations ina natural-language-based format (e.g., a list of feature weights).Other types of visualizations include simple charts (e.g., bar plotsindicating feature importances) [46] or highlighting/denoting sec-tions of the input (e.g., displaying important pixels of an image in adifferent color or opacity) [30, 52]. With respect to example-basedexplanations, the visualizations used in papers are often a table offeatures if the data is tabular [37, 56, 57] or a list of images if thedata is image-based [9, 27, 28]. Here, we explore visual encodingsthat convey more information and allow for more interaction thanlisting examples.Other work specifically focuses on visualizations of latent embed-dings within a neural network model. Many of these utilize 2 or 3Dplots to visualize distance between different examples in the embed-ding space [7, 20, 32]. Liu et al. [32] add an additional visualizationof 1D vectors of examples corresponding to user-defined concepts,and Boggust et al. [7] provide the ability to compare embeddingsof two different models by viewing and interacting with the twoplots side-by-side. Particularly relevant to our work, some of thevisualizations of text embeddings proposed in [20] aim to displaya given word’s nearest neighbors in the embedding space. Theyplot the nearest neighbors as points along a 1D axis that encodesdistance, and provide the ability to compare the nearest neighborsacross different embeddings.We also explore the benefits of interactive input modificationwithin our interface. Prior work in interpretability for ML systemshas studied interactivity primarily from the angle of using humanfeedback to modify or filter the information that is shown [10, 26,29, 51]. Here, our goal is instead to provide users with a way totest hypotheses about model behavior and explore limitations. Thetool described in [57] similarly includes a feature for modifying theinput to observe how a model’s output changes, though in theircase, it is intended primarily for users familiar with ML.Like these prior works, our interface aims to facilitate under-standing by allowing users to visualize and interact with examplesfrom the data. However, while these prior visualization tools areintended for general exploration of what a model has learnt, or for igure 2: To compute nearest neighbors, we extract an em-bedding model from the original classification model, wherethe output is a learned representation (i.e., the activation ofone of the model’s hidden layers). We can then use this toembed the training data examples and rank them by simi-larity to the input in this learned embedding space. In otherwords, we are replacing the final layers of the original clas-sification model with a nearest neighbors classifier. uncovering underlying structure in data, the goal of our interface isto help users assess the reliability of predictions on a case-by-casebasis. We propose two interface modules: a display of the model’s outputin terms of an aggregate and an individual-level view of nearestneighbors, and an editor with which users can interactively modifythe input and observe how the model’s output changes. These utilizegeneral ideas that can be customized to different domains. In Sec.3.3, we present a concrete instantiation for ECG beat classification.

In the k-nearest neighbors (KNN) module, we display the model’soutput for a particular example in terms of nearest neighbors fromthe training set.The nearest neighbors are computed using a similar method to[13, 41, 50]. Given a neural network model trained to perform theclassification task ( the classification model ), we first define an em-bedding model , whose output is the activations of one of the model’shidden layers (see Figure 2). We use this to embed all the trainingexamples. Then, for a given new input example, we first embed itand then use KNN to find the most similar training examples in thislearned representation space. These similar examples, along withtheir class labels, are visualized in the KNN module. The class labelassociated with the majority of nearest neighbors is the model’sprediction. Computing nearest neighbors in the learned embedding spaceof the classification model provides the advantage of harnessingthe classification model’s representational capacity. In other words,since the learned embedding space of the classification model en-codes higher level features relevant to the task, these are then takeninto account when calculating similar examples. Given our goal ofnarrowing the gulf of evaluation [23] for users — providing waysfor them to understand the model’s output in terms of higher-levelconcepts that align with how they think about the task — this stepis particularly important. Producing predictions based directly offnearest neighbors then facilitates example-based explanations byvisualizing the neighbors.The KNN module has three main components: an aggregateview of the neighbors’ class labels, a unit visualization of individualneighbors that encodes their class and distance from the input,and an overlaid display of the raw input examples associated witheach neighbor. These components support the following use cases,which we will describe in more detail in the ECG case study:(1) Visualizing the distribution of and variation amongst neigh-bors can help explain when the model is (or is not) reliableand, over time, help users build a broader sense of modellimitations.(2) Comparing neighbors from the other non-majority classescan help characterize why the model is uncertain, and howthat uncertainty should guide decision-making.(3) Seeing nearest neighbors that do not align with prior expec-tations of the data can help reveal limitations in the trainingdata and prompt further questioning.

A range of prior work interviewing interpretability stakeholdershas suggested that to build effective trust, users need the ability toconfirm that the model is using sensible logic and that its reasoningaligns with their expectations [6, 11, 21, 31, 55]. To this end, theeditor module allows users to apply transformations to the inputand re-run the modified input through the model to see how theoutput changes. Users can apply transformations that they expectto be class-preserving, for example, and ensure that the model’soutput does not drastically change.The available transformations should help narrow the gulf ofexecution [23] in the interface by providing transformations thatalign with users’ existing ways of thinking about the data and task.For example, in a dataset of natural images, it does not make senseto invert the colors because that is not something that would occurnaturally, and does not reflect thought processes of people analyz-ing images. We also would not want to provide transformationslike editing individual pixels, which operate at a much lower-levelthan a person looking at an image would consider. To come upwith transformations that are data-specific (meaning they reflecthow users think about a modifying a specific type of data, likeimages or ECG signals), relevant to the task (meaning they reflecthigher-level factors that users consider important to the task athand), and aligned with the target users’ level of understanding,we emphasize the importance with working with domain expertsand other intended end users to design them. lass Percentage of Examples Test Set AccuracyNormal 89.3% 99.6%Supraventricular Ectopic 2.7% 70.5%Ventricular Ectopic 7.1% 95.7%Fusion 0.8% 70.4%Overall – 98.3% Table 1: The classes used in the ECG beat classification task, along with their representation in the dataset and the model’stest set performance on that class.Figure 3: An example of the KNN module for a particular ECG beat. On the left is the input signal. On the right is a histogramof class labels for the 50 nearest neighbors from the training data (here, all 50 are in the class ventricular ectopic). The label andcount appear above the bar on hover. In the center, each dot represents an individual nearest neighbor, ordered by similarityto the input. The plot above overlays the signals in the selected region on each other. The brush selection can be moved orredrawn elsewhere.

We envision the following use cases for the editor module, whichwe expand upon in the following sections with examples from theECG case study:(1) Checking alignment with domain knowledge through testinghypotheses about how the model’s output should changeupon various transformations.(2) Assessing prediction reliability by seeing if the output isoverly sensitive to small transformations that should be class-preserving.(3) Understanding how the model works by observing whichtransformations lead to a significant shift in the class distri-bution.

Thus far, we have described the motivation for the interface mod-ules generally, but here we will instantiate and evaluate each witha specific clinical case study of classifying electrocardiogram (ECG)beats. We chose this application because we wanted to be able toperform an application-grounded evaluation of our system using arealistic task that people (i.e., physicians) were familiar with [15].Using simplified or proxy tasks (e.g., asking people to classify im-ages) can be more straightforward, but also can lead to less reliableresults since the task at hand is not something the participants arefamiliar with or have prior conceptions about doing. ECG beat clas-sification, in particular, is an area where machine learning has beenwidely applied and yielded good performance [25, 47, 59]. It is alsogenerally applicable in the medical domain since most physiciansare familiar with reading ECG beats.

The specific task we implementedwas classifying a single ECG heartbeat into one of four categories:normal, supraventricular ectopic, ventricular ectopic, or fusion.The latter three classes are different types of arrhythmias, or heartrhythm problems.We used a preprocessed version of the MIT-BIH ArrhythmiaDataset [36] available on Kaggle [17]. Each sample in the datasetis an individual heartbeat sampled at a frequency of 125 Hz, andpadded to a maximum length of 1.5 seconds. The available datasetcontains a fifth class, “unknown,” which we exclude here.We replicated the convolutional neural network (CNN) classi-fication model from [25], without data augmentation (we wereinterested to see whether our visualizations could elucidate, forexample, that certain classes were underrepresented). The modelwas trained for ten epochs to a final overall accuracy of 98.3%. Thebreakdown of classes and performances on each is in Table 1.

For the ECGbeat classification task, we use the same CNN classification modeldescribed above, and we define the embedding model to output theactivations from the final hidden layer (a 32-dimensional vector).As in prior work, we use Euclidean distance in this space to rankthe embeddings of the training examples by their similarity to aparticular input. We retrieve the 50 nearest neighbors for visualiza-tion.Figure 3 shows an example ECG beat in the interface. Throughoutthe interface, color encodes class labels (e.g., orange waveforms,dots, and bars correspond to ventricular ectopic examples). Theaggregate view is a histogram of class labels present in the nearest igure 4: The editing toolbar allows users to apply specifictransformations or combinations of transformations to theinput signal. The transformations can be applied to theentire signal, or to a specific user-selected region. This al-lows users to select and transform clinically-meaningfulsegments of the signal (e.g., “stretch the QRS complex”). neighbors, ordered by class frequency to identify the majority classand distribution of other classes. The exact count of each classappears on hover for each bar in the histogram.The unit visualization of individual neighbors is a series of dotsarrayed horizontally and ordered by similarity to the input. Userscan see, for example, within the nearest neighbors if certain classesare more similar to the input. When prototyping this component,we also considered designs that encoded the absolute similarity(e.g., placing two neighbors that were relatively similar nearer toeach other). However, we decided against this, since the absolutesimilarity (i.e., Euclidean distance in the learned embedding space)is not a value that is meaningful or familiar to the user. Additionally,the distribution of these values is more complicated to visualize,since the distances between neighbors are inconsistent. In our pro-totypes, for example, there were often clusters of points that denselyoverlapped and did not facilitate selecting and viewing individualexamples.To visualize the raw input examples, users can brush over specificsegments of the ordered dots. The brush is initialized to the firstfive neighbors since these represent the most similar examples.Because the ECG data is signal-based, we choose to visualize theneighbors by overlapping signals on a single plot that appears abovethe brush. This allows users to visually assess consistency amongstthe neighbors – for example, if the neighbors are very consistent,the overlaid plot will look very similar to a single signal, while ifthey are very varied, the overlaid plot will appear comparably noisy.Outliers are also visible, since they appear as a distinct waveformthat does not follow in the same pattern as the other signals. In theexample beat in Figure 3, for example, we can see that most of thewaveforms follow the same general pattern, with some variance intheir height, and that there is one that is significantly shorter thanthe others. By moving and adjusting the brush to cover specificsegments of the neighbors, users can home in on examples fromspecific classes or individual outliers. For the ECGbeat classification task, the editor consists of four transformations which we arrived at through discussion with a cardiologist: am-plify, dampen, stretch, and compress. These transformations canbe applied to the entire input signal, or to specific user-definedregions using the brushing functionality. Together, they allow fora large space of possible adjustments to the input signal. Thereare certainly other options that could be explored here, such asdetecting certain important sections of the signal (e.g., “P wave” or“QRS complex”) to transform instead of having users select themthemselves. Although some combinations of transformations mightlead to a signal that is not realistic, our tool is intended for userswho are familiar with what beats should look like, so we assumethat they will realize when this happens and undo or reset.Once the transformation has been applied, a new row appearsbelow the original output, displaying the new output. The colorencoding as well as highlighting on hover enables tracing how theclass distribution changes overall, while links between neighborsthat are shared across rows enables tracking how individual exam-ples shift in similarity. The editing toolbar is pictured in Figure 4,and an example of the output after several transformations is inFigure 5.

Throughout the interface, our design goals are to narrow the gulfof evaluation and the gulf of execution for users by providing, re-spectively, visualizations and input modifications that reflect mean-ingful, higher-level concepts that align with their existing ways ofthinking about the task. In doing so, we are able to provide more amore intuitive interactive interface with which to assess the model’sreliability, understand why it is uncertain, and check whether itsreasoning aligns with domain knowledge. Here, we expand uponseveral specific ways that a user can interact with both modules tothese ends:

Users can assess the re-liability of the prediction in multiple ways. First, the aggregatedistribution of class labels can convey the model’s uncertainty inthe prediction (i.e., the majority class label). For example, if 45 neigh-bors are normal, this conveys more certainty about the predictionthan if only 25 neighbors are normal, and the rest are spread outacross other classes.Second, by viewing the class labels of the unit visualizationrepresenting individual neighbors, users can see how similar theneighbors from non-majority classes are to neighbors from themajority class. For example, if there are 40 neighbors labeled normaland 10 neighbors labeled fusion, are those 10 the most similar tothe input? Or do they appear closer to the latter end of the nearestneighbors? If the neighbors from the non-majority class are the 10most similar, this might indicate further unreliability of the ‘normal’prediction.Third, visualizing the variance or consistency amongst the wave-forms themselves can give insight into whether the input exampleis well-represented in the training data and whether the model ispicking up on sensible high-level features common in the neigh-bors. For example, if the overlaid plot of nearest neighbors showsexamples that are very consistent and similar to the input in seman-tically meaningful ways (see Figure 6a for an example), it implies igure 5: As transformations are applied, new rows appear below the original output with the transformed input and themodel’s new output. Links between each row indicate neighbors that are shared. The links corresponding to neighbors withina row’s selection are more visible, while the rest are more transparent. Users can get a general sense of how much the nearestneighbors change (by assessing the overall density of links) as well as the specific movements of particular neighbors or setsof neighbors. that the input is well-represented in the training data and that themodel is picking up on the right concepts for this input. The usercan look at the commonalities amongst neighbors to further under-stand what the model is picking up on and whether it aligns withtheir expectations of the domain. On the other hand, if the plot ofnearest neighbor signals shows examples that are non-overlappingor not similar to the input (see Figure 6b for an example), it impliesthat either examples like the input are not well-represented in thetraining data, or that the model is not learning the right featuresand therefore not finding those similar examples. Typically, a classification model outputsa probability score indicating its certainty in its prediction. Proba-bility scores can alert the user to some uncertainty in the model, butthey don’t give the user any additional information to understandwhy the model is uncertain.In the KNN module, one way the model’s certainty is conveyedis through the aggregate distribution of class labels. Beyond this,though, the user can further investigate why the model is uncertainby viewing and comparing examples from non-majority classes.Brushing over specific selections of dots representing individualneighbors allows the user to compare neighbors from different classes on the same plot, or analyze them separately. Take the exam-ple in Figure 7. 30 of the neighbors have the class label supraventric-ular ectopic, and 20 have the label normal (the counts of each classare visible upon hovering over the bars in the aggregate histogram).Looking at the ordered dots representing individual neighbors, wecan see that most of the closer neighbors are supraventricular ec-topics, while the later neighbors are mostly normal. This mightgive us more confidence in the model’s prediction of supraven-tricular ectopic. Then, in Figure 7a, we can see that brushing overthe first 15 neighbors reveals that most of them follow the samegeneral pattern, and it looks very similar to the input. The 3 normalneighbors in this selection also seem to follow this pattern — sosome of the model’s uncertainty is arising from the fact that inthe training data, there are normal beats that can look very similarto supraventricular beats. In Figure 7b, we can see that brushingover the last 15 neighbors reveals that most of them follow thesame general pattern, but have a more elevated T-wave (the spikeat the beginning of the signal) than the supraventricular ectopicneighbors. We might reason, then, that the model is split betweensupraventricular and normal, and one of the factors driving theuncertainty is whether or not the input has a significant T-wave. a)(b) Figure 6: For the ECG data, we visualize individual nearest neighbors by overlaying those in the user’s selection on top ofone another. This allows for assessing the variance or consistency among the neighbors, comparing neighbors from differentclasses, getting a sense of the “average” waveform, and detecting if an output might be out-of-distribution with respect to thetraining data. (a) shows an example where the neighbors are very consistent, and (b) shows an example where they are muchnoisier.

The user can then use their domain knowledge to reason abouthow to proceed. In this example, for instance, the user might exam-ine the input and decide that the T-wave is significantly depressed,making the input more similar to the supraventricular ectopic exam-ples, and more confidently proceed with supraventricular ectopicas the correct class. Or, they might decide that the different classespresent in the neighbors reflect legitimate ambiguities about whatthe correct beat type is, and choose to deliberate further, consult asecond option, or run additional tests.

If,when looking at neighboring examples, the way the data looks or islabeled does not align with the user’s expectations, it might promptquestions from the user about the details of the data and how itwas collected or labeled, areas that are too often not engaged withafter a model’s deployment. Crucially, seeing the signals themselvesfacilitates this type of critical thinking for people who are likelymore familiar with the data and what it should look like than moreabstract representations like feature weights.In the ECG case study, for example, the data was annotated byphysicians who had access to additional information about thebeats preceding and following the input. As a result, there are someexamples in the dataset that look extremely similar but are labelleddifferently (presumably because the difference in their label wasdue to the information available during annotation that the modeldoes not see). In some cases, this leads to nearest neighbors that have different classes but look very similar (see Figure 8). Viewingthe neighbors for a particular example can prompt questions abouthow the data was annotated and the subsequent limitations of themodel, which would likely not arise if users were not able to viewand compare specific similar examples.

Checking if the model’s reasoningaligns with prior expectations of domain experts is crucial for build-ing trust, especially in the clinical domain [11, 55]. The editor mod-ule allows users to form hypotheses about how particular transfor-mations should change the model’s output, and build confidenceand intuition around the model’s reasoning by seeing if these hy-potheses hold. For example, the beat in Figure 9 is initially classifiedas supraventricular ectopic. The user might hypothesize that sinceone indicator of supraventricular ectopic beats is narrowness, andthis particular beat is narrow, that this is what the model is pickingup on. Therefore, stretching the beat should change the model’soutput, making it lean more towards normal. The user can then usethe editor to apply this transformation to test their hypothesis. Inthis case, the model’s output does change to reflect more normalneighbors, confirming both the original hypothesis and the factthat its reasoning in the case aligns with the user’s expectationsfrom a clinical perspective.

Even aside from specific hypotheses about howa particular series of transformations should change the output, a)(b) Figure 7: In this example, 30 of the nearest neighbors are supraventricular ectopic, and 20 are normal. Rather than just con-cluding that the prediction is “60% confident,” the user can home in on the non-majority examples to characterize where theuncertainty is coming from and how to take it into account. (a) shows brushing over and viewing the first 15 neighbors. Mostof them follow the same general pattern that looks very similar to the input. The 3 normal neighbors in this selection alsoseem to follow this pattern – meaning that some of the model’s uncertainty is arising from the fact that normal beats canlook very similar to supraventricular beats. (b) shows viewing the last 15 neighbors. Most of them are normal, following asimilar pattern, but with a more elevated T-wave (the spike at the beginning of the signal) than the supraventricular ectopicneighbors. This implies that one of the factors driving the uncertainty is whether or not the input has a significant T-wave.Based on this understanding of the uncertainty, the user can use their domain knowledge to decide how to proceed.Figure 8: An example of neighbors that look similar but have different labels, due to a discrepancy in the additional informa-tion available during annotation versus at test-time. Alerting users to such cases through viewing nearest neighbors can helpprompt questions about the data, the annotation process, and limitations of the model. a user can still gauge the reliability of a particular prediction byperforming ad-hoc sensitivity analyses. If the output changes dras-tically when the input is slightly tweaked, this can alert users tothe fact that the prediction is precarious and encourage them not tobe overly reliant on it. On the other hand, if the output is relativelystable, this can be an additional indicator of model reliability.

We ran a set of think-aloud studies to evaluate how well our designgoals came across using the ECG beat classification case study . Inparticular, we sought to study the two following questions: These studies were certified by our institution as exempt from IRB review underCategory 3 (benign behavioral intervention). igure 9: Here, the input beat is initially classified as supraventricular ectopic. One indicator of a supraventricular ectopic beatis narrowness, so we would expect it to become more normal upon stretching. A user could use the editor to stretch the beatout, then, to test out such a hypothesis as a way to check whether the model’s reasoning aligns with clinical expectations. (1) Can visualizing the nearest neighbors waveforms and theirclasses help physicians understand a prediction’s reliabil-ity in terms of clinically-meaningful concepts that they arefamiliar with?(2) Can editing inputs further help physicians test out hypothe-ses about the model’s reasoning and confirm if it aligns withdomain expectations? Our participants all had a medical background but varied in expe-rience, and were recruited through our personal networks. Therewere 9 total participants: 3 fourth year medical students (P1-P3), 5first year residents (P4-P8), and 1 internal medicine physician withfive years of experience (P9).

The baseline interface we used displayed outputfrom the machine learning model that included the predicted class,the predicted probability of that class, and a feature importance-based explanation. Feature importance-based explanations are widelyproposed and referenced [6, 16], and constitute a natural alternativeto example-based explanations.The particular feature-importance based explanation we imple-mented was LIME [46], which we chose because it is well-known,commonly used, and open source. To explain a specific input ex-ample, LIME finds a local neighborhood of points surrounding it,gets the predicted class for each, and computes a linear boundaryto classify that set of points. The parameters of that boundary canbe thought of as an approximation to the model’s behavior in that specific local neighborhood. Because the function is linear, the co-efficients associated with each input feature can then be extractedand ordered by importance.In our implementation (pictured in Figure 10), each input exam-ple consists of one ECG beat, so each coefficient corresponds to onemeasurement in that signal. To visualize these feature importances,we added a second plot of the input signal with a highlighting layercorresponding to important sections, which is in line with existingvisualizations of proposed feature importance-based explanationsfor ECGs [39, 54]. Specifically, we plot the feature importance val-ues that are both above the 80th percentile and part of a continuoussegment of neighboring important features, to align more withphysicians’ existing ways of thinking about distinct important sec-tions of an ECG signal.

In order to study the effect of our inter-face modules independently, each participant saw three interfaces.The first two interfaces were randomly ordered between the KNNmodule (without the editor) and the feature-importance baselineto allow us to study whether visualizing nearest neighbors helpedbuild participant intuition about the model. And then, to under-stand the impact of interactive editing of inputs, the third interfacefeatured the KNN module with the editor.Each interface was pre-populated with 12 input beats chosenfrom the test set and equally distributed among the four classes.We chose 30% of the beats such that they had incorrect predictions(for the baseline, the prediction is the class with highest probability;for the KNN Module, the prediction is the class that makes upthe majority of nearest neighbors). The incorrect predictions weredistributed to try and align with the model’s actual performance igure 10: The feature importance-based baseline interfaceconsists of the predicted beat class, the probability withwhich that class was predicted, and segments of the beat con-sidered most important to the decision highlighted. The im-portant segments were calculated via LIME, and we visualizecontinuous segments with importance values above the 80thpercentile in the plot. (e.g., we did not include incorrect predictions for normal beatssince there are very few of those in reality; we included moreincorrect predictions for supraventricular ectopic since the model’sperformance for that class is worse).All studies took place over Zoom. Participants were informedthat their participation was voluntary, that they could decline tocontinue at any point, and that their identities would remain anony-mous in any research output. The studies were recorded (video +audio) with participant consent. Each took an average of 50 minutes,and participants were compensated with a $30 gift card.At the start, participants were told which four categories ofbeats they would be working with including the more granularinformation about beat types included with the original dataset(e.g., there are multiple pathologies that fall under the umbrella of“ventricular ectopic”). We described that they would see ECG beatsone-by-one, along with output from a machine learning model thathad high overall performance. Participants were asked to imaginea scenario where their workplace had adopted such a tool for beatclassification, and they were both trying to consider the model’soutput to make the best decision about a particular beat, as well asget a general sense for how the model worked.Each interface condition was introduced as being backed by a dif-ferent model to mitigate participants carrying over preconceptionsfrom prior conditions. For each interface, participants were given abrief demo and were then sent a link to open the interface on theircomputer and asked to share their screen. They were prompted toclick through the beats, and for each one, think aloud about howthey were coming to a decision about the beat’s class, how theywere incorporating the model’s output, and whether their percep-tions about the model changed. At the end of each interface, weasked a few debriefing questions about their general thoughts onthe model, the interface, and its strengths and weaknesses. Our studies provide evidence that our interface modules success-fully met our design goals. Visualizations of the neighboring signalsallowed participants to reason about the model’s output in terms ofclinically-meaningful concepts. Moreover, participants were ableto intuitively assess prediction reliability by examining variance in the overlaid visualizations of nearest neighbor signals. Inspect-ing the aggregate chart, ordering of neighbors, and neighboringsignals all helped participants understand the model’s uncertaintyand relate it to relevant ambiguities in the task. When the nearestneighbors did not align with participants’ prior expectations of thedomain, they formulated questions about the data and the labelingprocess. And, finally, participants used the editor to confirm if themodel’s reasoning was sensible and guide decision-making. Here,we describe these observations in greater detail.

Visualizing nearest neighbors allowed participants to rea-son about the model in terms of clinically-relevant concepts. Forexample, participants would often notice a particular morphologypresent in the neighbors that helped them understand the model’sreasoning and whether it was clinically sensible. One participantsaid, pointing to a pattern present in all the neighboring signals, “Yeah, ventricular. It’s this elevation and this space that’s making itthink ventricular” [P4]. Another described, “The model is right —with ventricular ectopic, the QRS spike should be broad, which ispresent in all the similar examples” [P9]. Overall, six participantsreasoned about the model using high-level clinically-relevant con-cepts that they observed in the neighbors, like “depression in thesignal” [P9], “slope right after the P-wave” [P7], “presence of a T-wave” [P8], “P-R interval” [P5], or “abnormal morphology” [P3].However, in a few cases, participants were not able to reasonabout the neighbors: this occurred either when they were unsurewhy the nearest neighbors were similar, or disagreed with their classlabels. For example, one participant said, “these [nearest neighbors]are supposed to be are ventricular ectopic... I think they’re normal. Idon’t know what to make of this [output]” [P2]. In part, this is likelydue to the fact that when labeling a particular beat, the annotatorsof the data had additional information about surrounding beats thatis not available in the dataset or used by the model. Without thatadditional information, it is sometimes unclear why a particularbeat has the class label that it does. While the model’s output wasnot very helpful in these cases, it did sometimes prompt additionalquestions about the data and labeling process. For example, oneparticipant asked, “Some of these normal ones look like they could beabnormal, so I’d want to know why they were called normal and whatthat was based on” [P6]. Another participant asked a similar ques-tion and further hypothesized, “Most likely this data was correctlyannotated considering the multi-lead strip of ECG, but it’s not usingall that information here” [P2]. An additional factor in some of thesecases is that certain classes of beats are under-represented in thetraining data. Therefore, it is less likely that the neighboring beatsfor an input in this class will be very similar, since there are simplyless of them to choose from. We were curious whether participantswould notice this underrepresentation, but we found that it did notnaturally occur to them during the study, and occasionally theyexpressed confusion as to why the neighbors were not as similar inthese cases.For the feature importance-based baseline, participants often haddifficulty pulling out higher-level, clinically-relevant concepts fromthe explanations. For example, echoing a sentiment shared by many,one participant said, “I don’t see how these blue [highlighted] areasare super helpful here... what are they trying to get at?” [P7]. Another articipant, who struggled trying to connect the explanation tothe predicted class, said “I don’t understand how they go from this[pointing at highlighted areas] to saying that there’s some aspect of aventricular beat in there” [P2]. Some others had difficulty figuringout what about the highlighted signal was important — for example,one participant asked, “Why is it highlighted here, is it looking at theheight of this, is it looking at width? And why only this part?” [P1].In some cases, the highlighted areas did align with participants’expectations, but they had difficulty connecting these sections backto the prediction. One participant noted, for example, “Sometimesit was highlighting things I would also consider, but I still thoughtits prediction was wrong. I don’t have any intuition on that. I guessit’s finding some features. I would want to know what those featuresare, see whether they’re useful, if they have any intuitive correlation” [P2]. All participants said that they did not place as much weighton the model’s prediction when the overlaid signals appeared verynoisy and dissimilar from the input. Participants felt more confidentin their answers when the overlaid signals were very consistent andsimilar. They were also able to reason about what kind of variationwas acceptable given the task and domain ( “This input isn’t as pic-ture perfect so it makes sense that the model shows some variation inthe overlaid examples” [P4]) versus what was an indicator of unreli-ability ( “[The model’s output] isn’t giving me much information rightnow. If I was given this result I wouldn’t just listen to the machine, Iwould want additional information” [P4]).In the baseline interface, when the predicted probability was veryhigh, and the prediction aligned with their own, most participantsfelt reassured. When this was not the case, however, we foundthat it was difficult for participants to get a sense of how reliablethe prediction was. As a result, they often rationalized incorrectpredictions — even when it went against their initial instincts. Forexample, one participant saw an abnormal beat, started to say it wasabnormal, but then changed her mind after looking at the predictedclass, which (incorrectly) was normal: “I don’t think this is normal...well actually seeing that the machine thinks normal... I guess it has asmall QRS and the T-wave has a normal slope. Okay, I’ll put this inthe normal category” [P7]. Four participants [P2, P3, P4, P7] wentthrough similar processes of rationalizing an incorrect prediction(after having also expressed an inclination towards the correct class)presented in the baseline interface.Even when they did not rationalize an incorrect prediction, par-ticipants often struggled with relating a probability score, or theimportant sections of a signal, to an intuitive notion of reliabilitywhen using the baseline interface. For instance, one participantthought out loud, “I don’t know, it seems high probability for a weirdlooking one like this. And I don’t know if it makes sense what it’slooking at here and calling important. I’m not confident about this” [P1]. Six participants expressed similar difficulties in reasoningabout the reliability of the prediction.

When the class distribution of neigh-bors was split over multiple classes, participants were consistentlyable to home in on the differences using the overlaid plot of wave-forms and align them with clinical concepts. For example, one participant viewed a beat where the neighbors were split betweensupraventricular ectopic and normal, noting “For supraventricularectopic one thing you look for is whether or not it has a P-wave. It’sunclear in the input. These [brushing over the supraventricular ectopicexamples] are probably saying it isn’t a P-wave. And these [brushingover the normal examples] have the P-wave so they’re probably sayingthat the input does also and that’s why it should be normal” [P5].Participants were often able to relate the model’s uncertainty tonatural ambiguities in the task that humans also struggle with. Forexample, when one participant noticed some ventricular ectopicbeats present in the nearest neighbors of a fusion beat, “Given thatfusion is itself a combination of ventricular ectopic and normal, itmakes sense that there’s uncertainty here, and that there are someyellow [ventricular ectopic] ones that look similar” [P8]. Rather thanleading to distrust in the model, the ability to contextualize its un-certainty helped rationalize and move forward with the model’soutput. Another participant said, regarding neighbors split acrossclasses, “I would be exactly split like the model is between supraven-tricular and ventricular ectopic. The fact that the model is also splitbetween those two makes me feel better and I would do further testingto differentiate which one it is” [P4].Beyond reasoning about the presence of multiple classes in thenearest neighbors, participants were also able to harness that infor-mation along with their domain knowledge during decision-making.In many cases, upon viewing neighbors from the different classes,participants would realize that one of the classes was not actuallysimilar to the input, and as a result, feel more confident in disre-garding it. For example, one participant said of the beat in Figure11, “This is supraventricular ectopic. [The model] is calling it normal,but the normal ones don’t look so similar. The pink ones [supraven-tricular ectopic] look more like it because they also don’t contain aP-wave” [P9]. In other words, they were able to relate variation inthe neighbors to clinical concepts (normal neighbors with a P-wave,supraventricular ectopic neighbors without), hypothesize why themodel is uncertain (it isn’t sure whether the input example con-tains a P-wave), and use their own domain knowledge to determinehow to proceed (the input does not actually have a P-wave, so gowith supraventricular ectopic). Six participants also went throughsimilar thought processes to characterize the model’s uncertaintyand then use their domain expertise to more confidently arrive atthe correct answer.In the baseline interface, when the model appeared less certain(i.e., a lower probability score), participants had difficulty reasoningabout the uncertainty with the given explanation. Many said thatthey didn’t know why the probability was relatively low. Whenprompted to reason about it, all participants tried to guess usingtheir own knowledge as opposed to information from the featureimportance-based explanation.

Many partici-pants used the editor to form hypotheses about what would happento the output after applying certain transformations, and then testwhether it actually happened. They used this as a way to “sanitycheck” the model’s reasoning, and were more confident if it alignedwith their expectations (and vice versa). For example, one partici-pant described using the editor to feel more confident in the model’soutput for a particular beat (shown in Figure 12), which consisted a)(b) Figure 11: For this beat, one participant looked through some of the normal neighbors (a), comparing them to some of thesupraventricular ectopic neighbors (b). They reasoned that the normal examples, though they made up the majority of neigh-bors, were not more similar in clinically-meaningful ways to the input than the supraventricular ectopic examples. As a result,they were able to arrive at the correct classification (supraventricular ectopic). of mostly ventricular ectopic neighbors: “I’m not that confident withventricular ectopic, and this looks almost normal. It’s a little narrow,which is what ventricular means, so I think that’s why this is sayingventricular and if I were to stretch it it would be normal. [Stretchesthe signal] And that’s exactly what happened. That makes me moreconfident that this is more ventricular ectopic rather than normal.Just because that’s exactly what my thought was and that’s exactlywhat happened when I did it” [P4]. The same participant mentionedlater on, “This is how I think of things. If I can predict what’s goingto happen I’m more likely to be confident in the decision.”

Sometimes, however, participants applied a transformation, butwere not able to reason about why the nearest neighbors changedas they did, and subsequently were not sure how to incorporatethe observed change into their reasoning. This often happenedfor the under-represented beats, which, when transformed, wouldmore often than not result in nearest neighbors that were skewedtowards normal (i.e., the most highly-represented beat). In part,this is just because the model has learned about a much largerdiversity of different normal beats, as opposed to, e.g., differentsupraventricular ecotpic beats. On one hand, this unexpected be-havior prompted participants to rely less on the model’s outputin these cases — which, since the model is less accurate for theseclasses, is appropriate. At the same time, these instances were notable to offer them useful feedback on the model’s reasoning.Other participants did not form hypotheses about specific trans-formations, but applied several to try and gauge the sensitivityof the prediction to small changes as a way of assessing its relia-bility. Sometimes it served as positive reinforcement: “Okay, this makes me more confident. When it’s normal, and then you do all these[transformations], I think it should mostly stay normal, which it is.It’s consistent so this all makes sense and I feel good with the machine” [P1]. Other times it helped alert participants to the model’s unrelia-bility: “Seeing it switch so quickly from supraventricular ectopic tonormal does affect my perception of whether it [the model] is good attelling those apart” [P3].With respect to the model’s reasoning more generally, some par-ticipants expressed an increased understanding in how the modelworked after using the editor and observing what transformationstended to lead to a large change in the output. One participantnoted, “Doing these transformations is making me think about howthis program works. . . I can tell that the narrowness of a beat affectsthe decision a lot for example” [P8].Participants did not typically use the editor when the neighborswere very consistent (both in terms of the shape of the signal andtheir class labels), because they did not feel the need to check themodel’s reasoning. Other times, they chose not to use the editorbecause they could not think of a specific hypothesis they wanted totest — this was particularly true for the participants who were med-ical students, who often expressed that they “didn’t know enough”but that someone with more experience might know what to test.

Several participants noted that the way the ECG beats were vi-sualized was not realistic. For example, in practice, participantsdescribed that they would typically view a strip of beats from mul-tiple leads, rather than one beat in isolation, and often with a grid igure 12: One participant viewing this example hypothesized that this beat was ventricular ectopic because it was narrow,and therefore, that stretching it would cause it to be more normal. After applying the stretching transformation and seeingthat the nearest neighbors did change to be more normal, she felt more confident in the model’s reasoning for this beat andin making the classification of ventricular ectopic. overlaid to better measure distances. In some cases, this differencein display made participants more unsure about the class than theywould have been if they had had their more familiar overlays. Whilethe interface and task is simpler than it would be in a real clinicalenvironment, in the current work we are more concerned withdeveloping and evaluating the proposed interpretability and visual-ization techniques than developing a tool that could be deployed ina clinical setting (which would prompt an entirely different set ofconsiderations). Throughout the user studies, we found that that our interface mod-ules successfully achieved our design goals as participants’ abilityto reason about and interact with the model’s output aligned withtheir existing ways of thinking. In particular, viewing nearest neigh-bor examples elucidated higher-level morphologies that the modelwas learning, more so than seeing important segments highlightedon the input as in our baseline. Indeed, this result is in line withprior work in cognitive psychology suggesting that people solveproblems by utilizing knowledge about past cases [1, 45], and thatit is a particularly relevant form of reasoning in the medical do-main [48].We also found that participants generally approached the modelin a more critical way, investigating the nearest neighbor classdistribution and waveforms to see why the model was uncertain,and connecting that uncertainty to clinical concepts. Importantly,our results support the idea that understanding that the model isuncertain or unreliable for particular cases does not damage trust in the model, but is actually necessary for building it. As Cai et al.[11] find from interviews with physicians, “participants implicitlyand explicitly understood that no tool (or person) is perfect,” andTonekaboni et al. [55] further describe that “the acknowledgementof this challenge promotes trustworthiness.”

In other words, whenusers can explore and understand the model’s uncertainty, it fur-ther helps align the system with domain expectations that somecases are difficult and/or ambiguous, and that no system or toolis perfect. We found that the feature importance baseline did notfacilitate the same investigation — going from important sections ofthe prediction to why the predicted probability is high required toobig of a mental leap. This observation aligns with prior work study-ing the effect of feature-based explanations on doctors’ diagnosticperformance [8], in which participants expressed similar confusionabout the clinical meaning of a predicted probability, and how toincorporate the explanation into decision-making. The input editorfurther allowed participants to engage in a back-and-forth question-ing to confirm that their hypotheses about the model’s reasoningwere correct. Importantly, we found that participants describedthe transformations they were applying in terms of higher-levelfeatures corresponding to their domain knowledge. Several studiesor frameworks of interpretability needs have described the need forusers to “sanity check” a model’s decision as a way to build trust[6, 11, 22, 31, 55], and we found that the input editor is one way toaddress this need. In particular, we posit that the interactivity of theeditor — giving people the ability engage in back-and-forth probingof the model — is more effective to this end than, say, presenting ascore representing the model’s reliability. t the same time, we note some limitations of our current in-terface, and what they imply for future work. We found that whenthe nearest neighbor waveforms looked significantly different thanexpected, participants had difficulty reasoning about why the modelthought the neighbors were similar. We posit that part of partici-pants’ confusion is due to the fact that the nearest neighbors forsome examples are lower quality, as the training data unevenlyrepresented different classes (see Section 3.3.1). Supraventricularectopic beats, for instance, made up only 2.7% of the training exam-ples. As a result, the model was not able to distinguish this beat fromothers to the same level of detail or accuracy. Additionally, becausethere are simply fewer examples from this class, it is less likelythat an example will have nearest neighbors that look very similarto it. This consideration of under-representation in the trainingdata, however, did not naturally occur to participants upon seeingthe lower-quality neighbors. This observation has a few implica-tions for future work. First, it motivates additional data collectionto train models and compute nearest neighbors with data that ismore evenly distributed across classes and other important features.More immediately, it suggests the need to be transparent to usersabout under-representation in the data, and to explain its implica-tions on quality of nearest neighbors. If a user is then presentedwith an output where the neighbors do not appear to make sense,they are better equipped to understand why this might be the case,rather than be confused about it. Indeed, we found that when wedescribed this phenomena to participants after the conclusion ofthe study, they were able to understand why under-representationwould affect the nearest neighbors — it had just not been on theirradar previously. This direction is also supported by the resultsfrom Cai et al. [11], who found the need for an “AI Primer” for usersto explain, in part, “AI-specific behavior that may be surprising.” In other cases, participants found it difficult to form hypothesesand apply corresponding transformations given the open-endednature of the input editor. To imagine ways this problem could befurther explored and addressed, we find it useful to compare thecapabilities of the editor to interpretability methods that generatecounterfactual examples (i.e., similar example(s) that are classifieddifferently) [19, 38, 56]. In the latter, a modified version of the inputis generated by automatically finding small transformations thatlead to different predictions — while this doesn’t require the userto generate their own hypothesis, it can also return unrealisticexamples that do not permit further probing. With the editor, userswho are familiar with the data are instead able to modify the input inways they believe should be meaningful given their prior knowledgeof and familiarity with the domain. However, as we found, this canmake the interaction too open-ended in some cases. Here, we finda promising future direction in combining aspects of the two: forexample, automatically determining types of transformations thatlead to the most change in the output, and narrowing the user’sspace of hypotheses by communicating this on the interface akinto information scent [42].Finally, while we demonstrate our interface using an ECG casestudy to ground our examples and user study in a real-world task,there is a significant opportunity for future work to investigate howthese interface modules could be instantiated for other applications.

In this paper, we present two interface modules to facilitate intu-itive assessment of a machine learning model’s predictions. Ourdesigns are guided by the motivation to provide users the ability tovisualize and probe the model in ways that align with their exist-ing domain knowledge and ways of thinking. Using the interface,users can explore a given input’s nearest neighbors in the train-ing data to better understand if and why the model is uncertain,and what high-level features the model is learning. They can fur-ther manipulate the input using domain-specific transformationsto test hypotheses about the model’s reasoning or ensure that it isnot overly sensitive to small changes. Through think-aloud stud-ies with 9 physicians/medical students, we demonstrate that ourinterface helped participants understand the model’s reasoningand uncertainty in terms of domain-relevant concepts while build-ing intuition around its capabilities and limitations. The interfacemodules we present indicate a promising direction for future workaiming to build ML systems that are useful and trustworthy to theirreal-world users.

REFERENCES [1] Agnar Aamodt and Enric Plaza. 1994. Case-Based Reasoning: Foundational Issues,Methodological Variations, and System Approaches.

AI Communications

7, 1(1994), 39–59. https://doi.org/10.3233/AIC-1994-7104[2] Ajaya Adhikari, David M. J. Tax, Riccardo Satta, and Matthias Faeth. 2019.LEAFAGE: Example-based and Feature importance-based Explanations for Black-box ML models. In . IEEE, New Orleans, LA, USA, 1–7. https://doi.org/10.1109/FUZZ-IEEE.2019.8858846[3] Ifeoma Ajunwa. 2016. The Paradox of Automation as Anti-Bias Intervention.

Forthcoming in Cardozo Law Review (2016).[4] Solon Barocas, Andrew D. Selbst, and Manish Raghavan. 2020. The hiddenassumptions behind counterfactual explanations and principal reasons. In

Pro-ceedings of the 2020 Conference on Fairness, Accountability, and Transparency .ACM, Barcelona Spain, 80–89. https://doi.org/10.1145/3351095.3372830[5] Samyadeep Basu, Philip Pope, and Soheil Feizi. 2020. Influence Functions in DeepLearning Are Fragile. arXiv:2006.14651 [cs, stat] (June 2020). http://arxiv.org/abs/2006.14651 arXiv: 2006.14651.[6] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, YunhanJia, Joydeep Ghosh, Ruchir Puri, José M. F. Moura, and Peter Eckersley. 2020. Ex-plainable machine learning in deployment. In

Proceedings of the 2020 Conference onFairness, Accountability, and Transparency (FAT* ’20) . Association for ComputingMachinery, Barcelona, Spain, 648–657. https://doi.org/10.1145/3351095.3375624[7] Angie Boggust, Brandon Carter, and Arvind Satyanarayan. 2019. EmbeddingComparator: Visualizing Differences in Global Structure and Local Neighbor-hoods via Small Multiples. arXiv:1912.04853 [cs.HC][8] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The Role ofExplanations on Trust and Reliance in Clinical Decision Support Systems. In . IEEE, Dallas, TX, USA,160–169. https://doi.org/10.1109/ICHI.2015.26[9] Carrie J. Cai, Jonas Jongejan, and Jess Holbrook. 2019. The effects of example-based explanations in a machine learning interface. In

Proceedings of the 24thInternational Conference on Intelligent User Interfaces . ACM, Marina del RayCalifornia, 258–262. https://doi.org/10.1145/3301275.3302289[10] Carrie J Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov,Martin Wattenberg, Fernanda Viegas, Greg S Corrado, Martin C Stumpe, et al.2019. Human-centered tools for coping with imperfect algorithms during medicaldecision-making. In

Proceedings of the 2019 CHI Conference on Human Factors inComputing Systems . 1–14.[11] Carrie J Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry.2019. " Hello AI": Uncovering the Onboarding Needs of Medical Practitioners forHuman-AI Collaborative Decision-Making.

Proceedings of the ACM on Human-computer Interaction

3, CSCW (2019), 1–24.[12] Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah.2019. Exploring Neural Networks with Activation Atlases.

Distill

4, 3 (March2019), 10.23915/distill.00015. https://doi.org/10.23915/distill.00015[13] Rich Caruana, Hooshang Kangarloo, JD Dionisio, Usha Sinha, and David Johnson.1999. Case-based explanation of non-case-based learning methods.. In

Proceedingsof the AMIA Symposium . American Medical Informatics Association, 212.

14] Diogo V. Carvalho, Eduardo M. Pereira, and Jaime S. Cardoso. 2019. MachineLearning Interpretability: A Survey on Methods and Metrics.

Electronics

8, 8 (July2019), 832. https://doi.org/10.3390/electronics8080832[15] Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of In-terpretable Machine Learning. arXiv:1702.08608 [cs, stat] (March 2017). http://arxiv.org/abs/1702.08608 arXiv: 1702.08608.[16] Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretablemachine learning.

Commun. ACM

63, 1 (Dec. 2019), 68–77. https://doi.org/10.1145/3359786[17] Shayan Fazeli. [n.d.].

ECG Heartbeat Categorization Dataset . IEEE, Turin, Italy, 80–89. https://doi.org/10.1109/DSAA.2018.00018[19] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee.2019. Counterfactual Visual Explanations. In

Proceedings of the 36th InternationalConference on Machine Learning , Vol. 97. Long Beach, California, USA. http://proceedings.mlr.press/v97/goyal19a.html arXiv: 1904.07451.[20] Florian Heimerl and Michael Gleicher. 2018. Interactive analysis of word vectorembeddings. In

Computer Graphics Forum , Vol. 37. Wiley Online Library, 253–265.[21] Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. 2020. Human Factors inModel Interpretability: Industry Practices, Challenges, and Needs.

Proceedings ofthe ACM on Human-Computer Interaction

4, CSCW1 (May 2020), 1–26. https://doi.org/10.1145/3392878[22] Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. 2020. Human Factors inModel Interpretability: Industry Practices, Challenges, and Needs.

Proceedings ofthe ACM on Human-Computer Interaction

4, CSCW1 (2020), 1–26.[23] Edwin L Hutchins, James D Hollan, and Donald A Norman. 1985. Direct manipu-lation interfaces.

Human–computer interaction

1, 4 (1985), 311–338.[24] Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang, QiangDong, Haipeng Shen, and Yongjun Wang. 2017. Artificial intelligence in health-care: past, present and future.

Stroke and vascular neurology

2, 4 (2017), 230–243.[25] Mohammad Kachuee, Shayan Fazeli, and Majid Sarrafzadeh. 2018. ECG HeartbeatClassification: A Deep Transferable Representation. In . IEEE, New York, NY, 443–444. https://doi.org/10.1109/ICHI.2018.00092[26] Been Kim. 2015.

Interactive and Interpretable Machine Learning Models for HumanMachine Collaboration . Ph.D. Dissertation. Massachusetts Institute of Technology,Cambridge, MA.[27] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples arenot enough, learn to criticize! Criticism for Interpretability. In

Advancesin Neural Information Processing Systems 29 , D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 2280–2288. http://papers.nips.cc/paper/6300-examples-are-not-enough-learn-to-criticize-criticism-for-interpretability.pdf[28] Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictionsvia Influence Functions. In

Proceedings of the 34th International Conference onMachine Learning , Vol. 70. Sydney, Australia. http://arxiv.org/abs/1703.04730arXiv: 1703.04730.[29] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015.Principles of Explanatory Debugging to Personalize Interactive Machine Learning.In

Proceedings of the 20th International Conference on Intelligent User Interfaces- IUI ’15 . ACM Press, Atlanta, Georgia, USA, 126–137. https://doi.org/10.1145/2678025.2701399[30] Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations andPredictions of Machine Learning Models: A Case Study on Deception Detection.In

Proceedings of the Conference on Fairness, Accountability, and Transparency -FAT* ’19 . ACM Press, Atlanta, GA, USA, 29–38. https://doi.org/10.1145/3287560.3287590[31] Q Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: InformingDesign Practices for Explainable AI User Experiences. In

Proceedings of the 2020CHI Conference on Human Factors in Computing Systems . 1–15.[32] Yang Liu, Eunice Jun, Qisheng Li, and Jeffrey Heer. 2019. Latent space cartography:Visual analysis of vector space embeddings. In

Computer Graphics Forum , Vol. 38.Wiley Online Library, 67–78.[33] Scott Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting ModelPredictions. In

Advances in Neural Information Processing Systems 30 (NIPS2017) . https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions[34] Pablo Navarrete Michelini, Hanwen Liu, and Dan Zhu. 2019. Multigrid Back-projection Super–Resolution and Deep Filter Visualization.

Proceedings of theAAAI Conference on Artificial Intelligence

33 (July 2019), 4642–4650. https://doi.org/10.1609/aaai.v33i01.33014642[35] George A Miller. 1956. The magical number seven, plus or minus two: Somelimits on our capacity for processing information.

Psychological review

63, 2(1956), 81. [36] George B Moody and Roger G Mark. 2001. The impact of the MIT-BIH arrhythmiadatabase.

IEEE Engineering in Medicine and Biology Magazine

20, 3 (2001), 45–50.[37] Ramaravind K. Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explainingmachine learning classifiers through diverse counterfactual explanations. In

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency .ACM, Barcelona Spain, 607–617. https://doi.org/10.1145/3351095.3372850[38] Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explainingmachine learning classifiers through diverse counterfactual explanations. In

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency .607–617.[39] Sajad Mousavi, Fatemeh Afghah, and U Rajendra Acharya. 2020. HAN-ECG: AnInterpretable Atrial Fibrillation Detection Model Using Hierarchical AttentionNetworks. arXiv preprint arXiv:2002.05262 (2020).[40] Ingrid Nunes and Dietmar Jannach. 2017. A Systematic Review and Taxonomyof Explanations in Decision Support and Recommender Systems.

User Modelingand User-Adapted Interaction

27, 3–5 (Dec. 2017), 393–444. https://doi.org/10.1007/s11257-017-9195-0[41] Nicolas Papernot and Patrick McDaniel. 2018. Deep k-Nearest Neighbors: To-wards Confident, Interpretable and Robust Deep Learning. arXiv:1803.04765 [cs,stat] (March 2018). http://arxiv.org/abs/1803.04765 arXiv: 1803.04765.[42] Peter Pirolli and Stuart Card. 1999. Information foraging.

Psychological review arXiv:1802.07810 [cs] (Nov. 2019). http://arxiv.org/abs/1802.07810arXiv: 1802.07810.[44] Alexander Renkl. 2014. Toward an Instructionally Oriented Theory of Example-Based Learning.

Cognitive Science

38, 1 (Jan. 2014), 1–37. https://doi.org/10.1111/cogs.12086[45] Alexander Renkl, Tatjana Hilbert, and Silke Schworm. 2009. Example-BasedLearning in Heuristic Domains: A Cognitive Load Theory Account.

EducationalPsychology Review

21, 1 (March 2009), 67–78. https://doi.org/10.1007/s10648-008-9093-4[46] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should ITrust You?": Explaining the Predictions of Any Classifier. In

Proceedings of the22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining . ACM, San Francisco California USA, 1135–1144. https://doi.org/10.1145/2939672.2939778[47] Giovanna Sannino and Giuseppe De Pietro. 2018. A deep learning approach forECG-based heartbeat classification for arrhythmia detection.

Future GenerationComputer Systems

86 (2018), 446–455.[48] Jerry W Sayre, Hale Z Toklu, Fan Ye, Joseph Mazza, and Steven Yale. 2017. Casereports, case series–from clinical practice to evidence-based medicine in graduatemedical education.

Cureus

9, 8 (2017).[49] Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian,and Janet Vertesi. 2019. Fairness and Abstraction in Sociotechnical Systems.In

Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* ’19) . Association for Computing Machinery, New York,NY, USA, 59–68. https://doi.org/10.1145/3287560.3287598[50] Chung Kwan Shin and Sang Chan Park. 1999. Memory and neural network basedexpert system.

Expert Systems with Applications

16, 2 (1999), 145–155.[51] Kacper Sokol and Peter Flach. 2020. One explanation does not fit all.

KI-KünstlicheIntelligenz (2020), 1–16.[52] Pascal Sturmfels, Scott Lundberg, and Su-In Lee. 2020. Visualizing the Impactof Feature Attribution Baselines.

Distill

5, 1 (Jan. 2020), 10.23915/distill.00022.https://doi.org/10.23915/distill.00022[53] Harini Suresh, Natalie Lao, and Ilaria Liccardi. 2020. Misplaced Trust: Measuringthe Interference of Machine Learning in Human Decision-Making. In

WebSci’20: 12th ACM Conference on Web Science, Southampton, UK, July 6-10, 2020 ,Emilio Ferrara, Pauline Leonard, and Wendy Hall (Eds.). ACM, 315–324. https://doi.org/10.1145/3394231.3397922[54] Geoffrey H Tison, Jeffrey Zhang, Francesca N Delling, and Rahul C Deo. 2019.Automated and interpretable patient ECG profiles for disease detection, tracking,and discovery.

Circulation: Cardiovascular Quality and Outcomes

12, 9 (2019),e005289.[55] Sana Tonekaboni, Shalmali Joshi, Melissa D McCradden, and Anna Goldenberg.2019. What Clinicians Want: Contextualizing Explainable Machine Learning forClinical End Use. In

Machine Learning for Healthcare Conference . 359–380.[56] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2018. CounterfactualExplanations without Opening the Black Box: Automated Decisions and theGDPR.

Harvard Journal of Law & Technology

31, 2 (March 2018), 841–887. http://arxiv.org/abs/1711.00399 arXiv: 1711.00399.[57] James Wexler, Mahima Pushkarna, Tolga Bolukbasi, Martin Wattenberg, FernandaViégas, and Jimbo Wilson. 2019. The what-if tool: Interactive probing of machinelearning models.

IEEE transactions on visualization and computer graphics

26, 1(2019), 56–65.[58] Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolu-tional networks. In

European conference on computer vision . Springer, 818–833.

59] Muhammad Zubair, Jinsul Kim, and Changwoo Yoon. 2016. An automatedECG beat classification system using convolutional neural networks. In . IEEE, 1–5.. IEEE, 1–5.