[PDF] On Completeness-aware Concept-Based Explanations in Deep Neural Networks

Abstract

Human explanations of high-level decisions are often expressed in terms of key concepts the decisions are based on. In this paper, we study such concept-based explainability for Deep Neural Networks (DNNs). First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining a model's prediction behavior based on the assumption that complete concept scores are sufficient statistics of the model prediction. Next, we propose a concept discovery method that aims to infer a complete set of concepts that are additionally encouraged to be interpretable, which addresses the limitations of existing methods on concept explanations. To define an importance score for each discovered concept, we adapt game-theoretic notions to aggregate over sets and propose ConceptSHAP. Via proposed metrics and user studies, on a synthetic dataset with apriori-known concept explanations, as well as on real-world image and language datasets, we validate the effectiveness of our method in finding concepts that are both complete in explaining the decisions and interpretable. (The code is released at this https URL)

Full PDF

OO N C OMPLETENESS - AWARE C ONCEPT -B ASED E XPLANATIONS IN D EEP N EURAL N ETWORKS

A P

REPRINT

Chih-Kuan Yeh , Been Kim , Sercan Ö. Arık , Chun-Liang Li , Tomas Pﬁster , and Pradeep Ravikumar Machine Learning Department, Carnegie Mellon University Google Brain Google Cloud AI A BSTRACT

Human explanations of high-level decisions are often expressed in terms of key concepts the decisionsare based on. In this paper, we study such concept-based explainability for Deep Neural Networks(DNNs). First, we deﬁne the notion of completeness , which quantiﬁes how sufﬁcient a particular set ofconcepts is in explaining a model’s prediction behavior based on the assumption that complete conceptscores are sufﬁcient statistics of the model prediction. Next, we propose a concept discovery methodthat aims to infer a complete set of concepts that are additionally encouraged to be interpretable,which addresses the limitations of existing methods on concept explanations. To deﬁne an importancescore for each discovered concept, we adapt game-theoretic notions to aggregate over sets andpropose

ConceptSHAP . Via proposed metrics and user studies, on a synthetic dataset with apriori-known concept explanations, as well as on real-world image and language datasets, we validate theeffectiveness of our method in ﬁnding concepts that are both complete in explaining the decisionsand interpretable. The lack of explainability of deep neural networks (DNNs) arguably hampers their full potential for real-world impact.Explanations can help domain experts better understand rationales behind the model decisions, identify systematicfailure cases, and potentially provide feedback to model builders for improvements. Most commonly-used methodsfor DNNs explain each prediction by quantifying the importance of each input feature [1, 2]. One caveat with suchexplanations is that they typically focus on the local behavior for each data point, rather than globally explaining howthe model reasons. Besides, the weighted input features are not necessarily the most intuitive explanations for humanunderstanding, particularly when using low-level features such as raw pixel values. In contrast, human reasoning oftencomprise “concept-based thinking,” extracting similarities from numerous examples and grouping them systematicallybased on their resemblance [3, 4]. It is thus of interest to develop such “concept-based explanations” to characterizethe global behavior of a DNN in a way understandable to humans, explaining how DNNs use concepts in arriving atparticular decisions.A few recent studies have focused on bringing such concept-based explainability to DNNs, largely based on the commonimplicit assumption that the concepts lie in low-dimensional subspaces of some intermediate DNN activations. Viasupervised training based on labeled concepts, TCAV [5] trains linear concept classiﬁers to derive concept vectors,and uses how sensitive predictions are to these vectors (via directional derivatives) to measure the importance of aconcept with respect to a speciﬁc class. Zhou et al. [6] considers the decomposition of model predictions in termsof projections onto concept vectors. Instead of human-labeled concept data, Ghorbani et al. [7] employs k-meansclustering of super-pixel segmentations of images to discover concepts. Bouchacourt and Denoyer [8] proposes aBayesian generative model involving concept vectors. One drawback of these approaches is that they do not takeinto account how much each concept plays a role in the prediction. In particular, selecting a set of concepts salientto a particular class does not guarantee that these concepts are sufﬁcient in explaining the prediction. The notion The code is released at https://github.com/chihkuanyeh/concept_exp. a r X i v : . [ c s . L G ] J un PREPRINT - J

UNE

15, 2020of sufﬁciency is also referred to as “completeness” of explanations, as in [9, 10]. This motivates the following keyquestions: Is there an unsupervised approach to extract concepts that are sufﬁciently predictive of a DNN’s decisions?If so, how can we measure this sufﬁciency?In this paper, we propose such a completeness score for concept-based explanations. Our metric can be applied to aset of concept vectors that lie in a subspace of some intermediate DNN activations, which is a general assumption inprevious work in this context [5, 6]. Intuitively speaking, a set of “complete” concepts can fully explain the predictionof the underlying model. By further assuming that for a complete set of concepts, the projections of activations onto theconcepts are a sufﬁcient statistic for the prediction of the model, we may measure the “completion” of the conceptsby the accuracy of the model just given these concept based sufﬁcient statistics. For concept discovery, we propose anovel algorithm, which could also be viewed as optimizing a surrogate likelihood of the concept-based data generationprocess, motivated by topic modeling [11]. To ensure that the discovered complete concepts are also coherent (distinctfrom other concepts) and semantically meaningful, we further introduce an interpretability regularizer.Beyond concept discovery, we also propose a score,

ConceptSHAP , for quantiﬁcation of concept attributions ascontextualized importance. ConceptSHAP uniquely satisﬁes a key set of axioms involving the contribution of eachconcept to the completeness score [12, 2]. We also propose a class-speciﬁc version of ConceptSHAP that decomposesit with respect to each class in multi-class classiﬁcation. This can be used to ﬁnd class-speciﬁc concepts that contributethe most to a speciﬁc class. To verify the effectiveness of our automated completeness-aware concept discovery method,we create a synthetic dataset with apriori-known ground truth concepts. We show that our approach outperforms allcompared methods in correct retrieval of the concepts as well as in terms of its coherency via a user study. Lastly, wedemonstrate how our concept discovery algorithm provides additional insights into the behavior of DNN models onboth image and language real-world datasets.

Most post-hoc interpretability methods fall under two categories: (a) feature-based explanation methods, that attributethe decision to important input features [1, 2, 13, 14], and (b) sample-based explanation methods, that attribute thedecision to previously observed samples [15, 16, 17, 18]. Recent work has also focused on evaluations of explanations,ranging from human-centric evaluations [2, 5] to functionally-grounded evaluations [19, 20, 21, 22, 10, 23]. Our workprovides an evaluation of concept explanations based on the completeness criteria, which is related to the ‘ﬁdelity’ [23].Our work is related to methods that learn semantically-meaningful latent variables. Some use dimensionality reductionmethods [24, 25], while others uncover higher level human-relatable concepts by dimensionality reduction (e.g. forspeech [26] and language [27, 28]). More recently Locatello et al. [29] shows that meaningful latent dimensions cannotbe acquired in a completely unsupervised setting, implying the necessity of inductive biases for discovering meaningfullatent dimensions. Our work uses indirect supervision from the classiﬁer of interest to discover semantically meaningfullatent dimensions. Chen et al. [30] uses representative training patches to explain a prediction in a self-interpretableframework for image classiﬁcation, whereas our method provides top training patches for each concept and can beapplied on given models and different data types.

Problem setting:

Consider a set of n training examples x , x , ..., x n , corresponding labels y , y , ..., y n and a givenpre-trained DNN model that predicts the corresponding y from the input x . We assume that the pre-trained DNNmodel can be decomposed into two functions: the ﬁrst part Φ p¨q maps input x i into an intermediate layer Φ p x i q , andthe second part h p¨q maps the intermediate layer Φ p x i q to the output h p Φ p x i qq , which is a probability vector for eachclass, and h y p Φ p x qq is the probability of data x being predicted as label y by the model f . For DNNs that build up byprocessing parts of input at a time, such as those composed of convolutional layers, we can additionally assume that Φ p x i q is the concatenation of r φ p x i q , ..., φ p x iT qs , such that Φ p¨q P R p T ¨ d q , and φ p¨q P R d . Here, x i , x i , ..., x iT denotedifferent, potentially overlapping parts of the input for x i , such as a segment of an image or a sub-sentence of a text.These parts for example, can be chosen to correspond to the receptive ﬁeld of the neurons at the intermediate layer Φ p¨q . We will use these x it to relate discovered concepts. As an illustration of such parts, consider the ﬁfth convolutionlayer of a VGG-16 network with input shape ˆ have the size ˆ ˆ . If we treat this layer as Φ p x i q , φ p x i q corresponds to the ﬁrst 512 dimensions of the intermediate layer, and Φ p x i q “ r φ p x i q , ..., φ p x i qs . Here, each x ij corresponds to a ˆ square in the input image (with effective stride 16), which is the receptive ﬁeld ofconvolution layer 5 of VGG-16 [31]. We note that when the receptive ﬁeld of φ p¨q is equal to the entire input size,such as for multi-layer perceptrons, we may simply choose T “ so that x i T “ x i and Φ p x i q “ φ p x i q . Thus, ourmethod can also be generally applied to any DNN with an arbitrary structure besides convolutional layers. To choose2 PREPRINT - J

UNE

15, 2020the intermediate layer to apply concepts, we follow previous works on concept explanations [5, 7] by starting from thelayer closest to the prediction until we reached a layer that user is happy with, as higher layers encodes more abstractconcepts with larger receptive ﬁeld, and lower layers encodes more speciﬁc concepts with smaller receptive ﬁeld.Suppose that there is a set of m concepts denoted by unit vectors c , c , ..., c m that represent linear directions in theactivation space φ p¨q P R d , given by a concept discovery algorithm. For each part of data point x t (We omit i fornotational simplicity), The inner product between the data and concept vector is viewed as the closeness of the input x t and the concept c following [5, 7]. If x φ p x t q , c j y is large, then we know that x t is close to concept j . However,when x φ p x t q , c j y is less than some threshold, the dot product value is not semantically meaningful other than thethe input is not close to the concept. Based on this motivation, we deﬁne the concept product for part of data x t as v c p x t q : “ TH px φ p x t q , c j y , β q mj “ P R m , where TH is a threshold which trims value less than β to 0. We normalize theconcept product to unit norm for numerical stability, and aggregate upon all parts of data to obtain the concept score forinput x as v c p x q “ p v c p x t q} v c p x t q} q Tt “ P R T ¨ m .We assume that for “sufﬁcient” concepts, the concept scores should be sufﬁcient statistics for the model output, andthus we may evaluate the completeness of concepts by how well we can recover the prediction given the concept score.Let g : R T ¨ m Ñ R T ¨ d denote any mapping from the concept score to the activation space of Φ p¨q . If concept scores v c p¨q are sufﬁcient statistics for the model output, then there exists some mapping g f such that h p g f p v c p x t qqq « f p x t q .We can now formally deﬁne the completeness core for a set of concept vectors c , ..., c m : Deﬁnition 3.1. Completeness Score:

Given a prediction model f p x q “ h p φ p x qq , a set of concept vectors c , ..., c m ,we deﬁne the completeness score η f p c , ..., c m q as: η f p c , ..., c m q “ sup g P x ,y „ V r y “ arg max y h y p g p v c p x qqqs ´ a r P x ,y „ V r y “ arg max y f y p x qs ´ a r , (1)where sup g P x ,y „ V r y “ arg max y h y p g p v c p x qqqs is the best accuracy by predicting the label just given the conceptscores v c p x q , and a r is the accuracy of random prediction to equate the lower bound of completeness score to 0. Whenthe target y is multi-label, we may generalize the deﬁnition of completeness score by replacing the accuracy with thebinary accuracy, which is the accuracy where each label is treated as a binary classiﬁcation.To calculate the completeness score, we can set g to be a DNN or a simple linear projection, and optimize usingstochastic gradient descent. In our experiments, we simply set g to be a two-layer perceptron with 500 hidden units.We note that we approximate f p x t q by h p g f p v c p x t qqq , but not an arbitrary neural network h g p v c p x t qq for two beneﬁts:(a) the measure of completeness considers the architecture and parameter of the given model to be explained (b) thecomputation is much more efﬁcient since we only need to optimize the parameters of g , instead of the whole backbone h g . The completeness score measures how “sufﬁcient” are the concept scores as a sufﬁcient statistic of the model, basedon the assumption that the concept scores of “complete” concepts are sufﬁcient statistics of the model prediction f p¨q .By measuring the accuracy achieved by the concept score, we are effectively measuring how “complete” the conceptsare. We note that the completeness score can also be used to measure how sufﬁcient concepts can explain a datasetindependent of the model, by replacing φ p¨q , h p¨q with identical functions, and f p x q with y . Below is an illustrativeexample on why we need the completeness score: Example 3.1.

Consider a simpliﬁed scenario where we have the input x P R m , and the intermediate layer Φ is theidentity function. In this case, the m concepts c , c , ..., c m are the one-hot encoding of each feature in x . Assumethat the concepts c , c , ..., c m follow independent Bernoulli distribution with p “ . , and the model we attempt toexplain is f p x q “ c XOR c ... XOR c m . The ground truth concepts that are sufﬁcient to the model prediction shouldthen be c , c , ..., c m . However, if we have the information on c , c , ..., c m ´ but do not have information on c m , wemay have at most . probability to predict the output of the model, which is the same as the accuracy of random guess.In this case, η p c , c , ..., c m ´ q “ . On the other hand, given c , c , ..., c m , η p c , c , ..., c m q “ .The completeness score offers a way to assess the ‘sufﬁciency’ of the discovered concepts to “explain" reasoning behinda model’s decision. Not only the completeness score is useful in evaluating a proposed concept discovery method, butit can also shed light on how much of the learned information by DNN may not be ‘understandable’ to humans. Forexample, if the completeness score is very high, but discovered concepts aren’t making cohesive sense to humans, thismay mean that the DNN is basing its decisions on other concepts that are potentially hard to explain. We apply additional normalization to φ p¨q so it has unit norm and keep the notation for simplicity. PREPRINT - J

UNE

15, 2020

Our goal is to discover a set of maximally-complete concepts under the deﬁnition 4.2, where each concept is interpretableand semantically-meaningful to humans. We ﬁrst discuss the limitations of recent notable works related to conceptdiscovery and then explain how we address them. TCAV and ACE are concept discovery methods that use training datafor speciﬁc concepts and use trained linear concept classiﬁer to derive concept vectors. They quantify the saliency of aconcept to a class using ‘TCAV score’, based on the similarity of the loss gradients to the concept vectors. This scoreimplicitly assumes a ﬁrst-order relationship between the concepts and the model outputs. Regarding labeling of theconcepts, TCAV relies on human-deﬁned labels, while ACE uses automatically-derived image clusters by k-meansclustering of super-pixel segmentations. There are two main caveats to these approaches. The ﬁrst is that while theymay retrieve an important set of concepts, there is no guarantee on how ‘complete’ the concepts are in explain themodel – e.g., one may have 10 concepts with high TCAV scores, but they may still be very insufﬁcient in understandingthe predictions. Besides, human-suggested exogenous concept data might even encode conﬁrmation bias. The secondcaveat is that their saliency scores may fail to capture concepts that have non-linear relationships with the output due toﬁrst-order assumption. The concepts in Example 3.1 might not be retrieved by the TCAV score since XOR is not alinear relationship. Overall, our completeness score complements previous works in concept discovery by adding acriterion to determine whether a set of concepts are sufﬁcient to explain the model. The discussion of our relation toPCA is in the Appendix.

The goal of our method is to obtain concepts that are complete to the model. We consider the case where each datapoint x i has parts x i T , as described above. We assume that input data has spatial dependency, which can help learningcoherent concepts. Thus, we encourage proximity between each concept and its nearest neighbors patches. Note thatthe assumption works well with images and language, as we will demonstrate in the result section. We aim that theconcepts would obtain consistent nearest neighbors that only occur in parts of the input, e.g. head of animals or the grassin the background so that the concepts are pertained to certain spacial regions. By encouraging the closeness betweeneach concept and its nearest neighbors, we aim to obtain consistent nearest neighbors to enhance interpretability. Lastly,we optimize the completeness terms to encourage the completeness of the discovered concepts. Learning concepts:

To optimize the completeness of the discovered concepts, we optimize the surrogate loss for thecompleteness term for both concept vectors c m and the mapping function g : arg max c m ,g log P r h y p g p v c p x qqqs (2)An interpretation for ﬁnding the underlying concepts whose concept score maximizes the recovered prediction scoreis analogous to treating the prediction of DNNs as a topic model. By assuming the data generation process of p x , y q follows the probabilistic graphical model x t Ñ z t and z T Ñ y , such that the concept assignment z t is generated bythe data, and the overall concept assignment z T determines the label y . The log likelihood of the data log P r y | x s canbe estimated by log P r y | x s “ log ş z P r y | z s P r z | x s « log P r y | E r z | x ss , by replacing the sampling by a deterministicaverage. We note that v c p x T q resembles E r z | x s and P p y | h p g p v c p x qqq resembles P r y | E r z | x ss , and as in supervisedtopic modeling [32], we jointly optimize the latent “topic” and the prediction model, but in an end-to-end fashion tomaintain efﬁciency instead of EM update.To enhance the interpretability of our concepts beyond “topics”, we further design a regularizer to encourage the spacialdependency (and thus coherency) of concepts. Intuitively, we require that the top-K nearest neighbor training inputpatches of each concept to be sufﬁciently close to the concept, and different concepts are as different as possible.This for-mulation encourages the top-K nearest neighbors of the concepts would be coherent. K is a hyperparameter that is usuallychosen based on domain knowledge of the desired frequency of concepts. In our results, we ﬁx K to be half of the averageclass size in our experiments. When using batch update, we ﬁnd that picking K “ p batch size ¨ average class ratio q{ works well in our experiments, where average class ratio “ average instance of each class { total number of instances.That is, the regularizer term tries to maximize Φ p x it q ¨ c k while minimizing c j ¨ c k . Φ p x it q ¨ c k is the similarity betweenthe t th patch of the i th example and c j ¨ c k is the similarity between the j th concept vector and the k th concept vector.By averaging over all concepts, and deﬁning T c k as the set of top-K nearest neighbors of c k , the ﬁnal regularizationterm is R p c q “ λ ř mk “ ř x ba Ď T c k Φ p x ba q ¨ c k mK ´ λ ř j ‰ k c j ¨ c k m p m ´ q . PREPRINT - J

UNE

15, 2020By adding the regularization term to (2), the ﬁnal objective becomes arg max c m ,g log P p h y p g p v c p x T qqq ` R p c q , (3)for which we use stochastic gradient descent to optimize. Since only concept vectors c m , and the mapping function g (which we set as a two layer NN) is optimized in the process, the optimization process converges much faster comparedto training the model from scratch. Given a set of concept vectors C S “ t c , c , ... c m u with a high completeness score, we would like to evaluate theimportance of each individual concept by quantifying how much each individual concept contributes to the ﬁnalcompleteness score. Let s i denote the importance score for concept c i , such that s i quantiﬁes how much of thecompleteness score η p C S q is contributed by c i . Motivated by its successful applications in quantifying attributesfor complex systems, we adapt Shapley values [12] to fairly assign the importance of each concept (which we callConceptSHAP): Deﬁnition 4.1.

Given a set of concepts C S “ t c , c , ... c m u and some completeness score η , we deﬁne the Concept-SHAP s i for concept c i as s i p η q “ ÿ S Ď C s \ c i p m ´ | S | ´ q ! | S | ! m ! r η p S Y t c i uq ´ η p S qs , The main beneﬁt of Shapley for importance scoring is that it uniquely satisﬁes the set of desired axioms: efﬁciency,symmetry, dummy, and additivity [12], which are listed in the following proposition with modiﬁcation to our setting:

Proposition 4.1.

Given a set of concepts C S “ t c , c , ... c m u and a completeness score η , and some importance score s i for each concept c i that depends on the completeness score η . s i deﬁned by conceptSHAP is the unique importanceassignment that satisfy the following four axioms: • Efﬁciency: The sum of all importance value should sum up to the total completeness score, ř mi “ s i p η q “ η p C S q . • Symmetry: For two concept that are equivalent s.t. η p u Yt c i uq “ η p u Yt c j uq for every subset u Ď C S zt c i , c j u , s i p η q “ s j p η q . • Dummy: If η p u Y t c i uq “ η p u q for every subset u Ď C S zt c i u , then s i p η q “ . • Additivity: If η and η have importance value s p η q and s p η q respectively, then the importance value ofthe weighted sum of two completeness score should be equal to the sum of the two importance values, i.e, s i p a η ` a η q “ a s i p η q ` a s i p η q for all i and some scalar a and a . The proof and the interpretation for these concepts are well discussed in [12, 2, 33].

Per-class saliency of concepts:

Thus far, conceptSHAP measures the global attribution (i.e., contribution to com-pleteness when all classes are considered). However, per-class saliency, how much concepts contribute to prediction ofa particular class, might be informative in many cases. To obtain the concept importance score for each class, we deﬁnethe completeness score with respect to the class by considering data points that belong to it, which is formalized as:

Deﬁnition 4.2.

Given a prediction model f p x q “ h p φ p x qq , a set of concept vectors c , c , ..., c m that lie in the featuresubspace in φ p¨q , we deﬁne the completeness score η j p c , ..., c m q for class j as: P x ,y „ V j r y “ arg max y h y p ˆ g p v c p x qqqs ´ a r,j P x ,y „ V r y “ arg max y f y p x T qs ´ a r , (4)where V j is the set of validation data with ground truth label j , and a r,j is the accuracy of random predictions for datain class j , and ˆ g “ arg max g derived in the optimization of completeness. We then deﬁne the perclass ConceptSHAPfor concept i with respect to class j as: Deﬁnition 4.3.

Given a prediction model f p x q , a set of concept vectors in the feature subspace in φ p¨q . We can deﬁnethe perclass ConceptSHAP for concept i with respect to class j as: s i,j p η q “ s i p η j q . For each class j , we may select the concepts with the highest conceptSHAP score with respect to class j . We note that ř j | V j || V | η j “ η and thus with the additivity axiom, ř j | V j || V | s i,j p η j q “ s i p η q .5 PREPRINT - J

UNE

15, 2020 (a) Two random images and corresponding ground truth con-cepts (with their legend on the left) – each object correspondsto a ground truth concept solely via the shape information. (b) Top nearest neighbors (each neighbor corresponds to a partof the full image) of each discovered concepts. The groundtruth concepts, determined by their shape (with random colors),are on the left.

Figure 1: Examples (left) and nearest neighbors of our method (right) on Synthetic data.

In this section, we demonstrate our method both on a synthetic dataset, where we have ground truth concept importance,as well as on real-world image and language datasets.

We construct a synthetic image dataset with known and complete concepts, to evaluate how accurately theproposed concept discovery algorithm can extract them. In this dataset, each image contains at most 15 shapes (shownin Fig. 1a), and only 5 of them are relevant for the ground truth class, by construction. For each sample x i , z ij is abinary variable which represents whether x i contains shape j . z i is a 15-dimensional binary variable with elementsindependently sampled from Bernoulli distribution with p “ . . We construct a 15-dimensional multi-label target foreach sample, where the target of sample i , y i is a function that depends only on z i , which represents whether theﬁrst 5 shape exists in x i . For example, y “„ p z ¨ z q ` z , y “ z ` z ` z , y “ z ¨ z ` z ¨ z , where „ denotes logical Not (details are in Appendix). We construct 48k training samples and 12k evaluation samples and use aconvolutional neural network with 5 layers, obtaining . accuracy. We take the last convolution layer as the featurelayer φ p x q . Evaluations:

We conduct a user-study with 20 users to evaluate the nearest neighbor samples of a few conceptdiscovery methods. At each question, a user sees 10 nearest neighbor images of each discovered concept vector (6of them are shown on the right of Fig. 1b, and full version is in Appendix), and is asked to choose the most commonand coherent shape out of the 15 shapes based on the 10 nearest neighbors. We evaluate the results for our method,k-means clustering, PCA, ACE, and ACE-SP when m “ concepts are retrieved. Each user is tested on two randomlychosen methods in random order, and thus each method is tested on 8 users. We report the average number of correctconcepts and the number of agreed concepts (where the mode of each question is chosen as the correct answer) foreach method answered by users in Table 1. The average number of correct concepts measures how many of the correctFigure 2: Completeness scores on synthetic dataset (left) and completeness scores on AwA (right) versus differentnumber of discovered concepts m for all concept discovery methods in the synthetic dataset. Ours-noc refers to ourmethod without the completeness score objective as an ablation study.6 PREPRINT - J

UNE

15, 2020Table 1: The average number of correct and agreed concepts by users based on nearest neighbors.ACE ACE-SP PCA k-means

Ours correct concepts . ˘ . ˘ .

46 3 . ˘ .

35 3 . ˘ . . ˘ agreed concepts .

625 4 .

75 4 .

375 4 . . automated alignment . ´ .

876 0 . . concepts are retrieved by user via nearest neighbors. The average number of agreed concepts measures how consistentare the shapes retrieved by different users, which is related to the coherency and conciseness of the nearest neighbors foreach method. We also provide an automated alignment score based on how the discovered concept direction classiﬁesdifferent concepts – see Appendix for details. Results:

We compare our methods to ACE, k-means clustering, and PCA. For k-means and PCA, we take theembedding of the patch as input to be consistent to our method. For ACE, we implement a version which replacesthe superpixels with patches and another version that takes superpixels as input, which we refer as ACE and ACE-SPrespectively. We report the correct concepts and agreed concepts from the user study, and an automated alignment scorewhich does not require humans. We do not report the automated alignment score of ACE-SP since it does not operate onpatches and thus is unfair to compare with others (which would lead to much lower scores.) Our method outperformsothers on corrected concepts and alignment score, which shows that our method is superior in retrieving the accurateconcepts beyond the limitations of others (such as capturing non-linear relationships of salient concept as mentioned in4.1). The number of agreed concepts is also the highest for our method, showing how highly-interpretability it is tohumans such that the same concepts are consistently retrieved based on nearest neighbors. As qualitative results, Fig.1b shows the top-6 nearest neighbors for each concept c k of our concept discovery method based on the dot product x c k , Φ p x a q b y . All nearest neighbors contain a speciﬁc shape that corresponds to the ground-truth shapes 1 to 5. Forexample, all nearest neighbors of concept 1 contain the ground truth shape 1, which are cross as listed in Fig. 1a. Acomplete list of the top-10 nearest neighbors of all concept discovery methods is shown in Appendix. We perform experiments on Animals with Attribute (AwA) [34] that contains 50 animal classes.We use 26905 images for training and 2965 images for evaluation. We use the Inception-V3 model, pre-trained onImagenet [35], which yields . test accuracy. We apply our concept discovery algorithm to obtain m “ concepts.We conduct ad-hoc duplicate concept removal, by removing one concept vector if there are two vectors where the dotproduct is over 0.95. This gives us 53 concepts in total. We then calculate the ConceptSHAP and per class saliency scorefor each concept and each class. For each class, the top concepts based on the conceptSHAP are the most importantconcepts to classify this class, as shown in Fig.3. While ConceptSHAP is useful in capturing the sufﬁciency of conceptsfor prediction, sometimes we may want to show examples. We propose to measure the quality of the nearest neighborsexplanations by the average dot product between the nearest-neighbor patches that belongs to the class and the conceptvector. In other words, the quality of the nearest neighbors explanations is simply the ﬁrst term in R p c q , which wedenote as R p c q “ ř mk “ ř x ba Ď T c k x Φ p x ba q , c k y , where the top-K set is limited to image patches in the class of interest.When the nearest neighbor set contains patches of the same original image, we only show the patch with the highestsimilarity to the concept to increase the diversity.Figure 3: Concept examples with the samples that are the nearest to concept vectors in the activation space in AwA.The per-class ConceptSHAP score is listed above the images.7 PREPRINT - J

UNE

15, 2020Table 2: The 4 discovered concepts and some nearest neighbors along with the most frequent words that appear intop-500 nearest neighbors.

Concept Nearest Neighbors Frequent words ConceptSHAPpoorly constructed what comes across as interesting is the worst (168) ever (69) movie (61) seen (55)1 wasting my time with a comment but this movie ﬁlm (50) awful (42) time(40) waste (34) 0.280awful in my opinion there were and the poorly (26) movies (24) ﬁlms (18) long (17)normally it would earn at least 2 or 3 not (58) movie (39) make (25) too (23)2 is just too dumb to be called ﬁlm (22) even (19) like (18) 2 (16) 0.306i feel like i was ripped off and hollywood never (14) minutes (13) 1 (12) doesn’t (11)remember awaiting return of the jedi with almost movies (19) like (18) see (16) movie (15)3 better than most sequels for tv movies i hate love (15) good (12) character (11) life (11) 0.174male because marie has a crush on her attractive little (10) ever (9) watch (9) ﬁrst (9)new via with absolutely hilarious excellent (50) ﬁlm (25) perfectly (19) wonderful (19)4 homosexual and an italian clown is an entertaining perfect (16) hilarious (15) best (13) fun (12) 0.141stephen on the vampire as a masterpiece highly (11) movie (11) brilliant (9) old (9)

Results:

We show the top concepts (ranked by conceptSHAP) of 3 classes with R p c q >0.8 in Fig. 3 (full resultsare in Appendix). Note that since our method ﬁnds concepts for all classes as opposed to speciﬁc to one class (suchas [7, 30]), we discover common concepts across many classes. For example, concept 7, whose nearest neighborsshow grass texture, is important for the classes ‘Squirrel’, ‘Rabbit’, ‘Bob Cat’, since all these animals appear in prairie.Concept 8 shows a oval head and large black round eyes shared by the classes ‘Rabbit’, ‘Squirrel’, ‘Weasel’, whileconcept 46 shows head of the ‘Bob cat’, which is shared by the classes ‘Lion’, ‘Leopard’, and ‘Tiger’, ‘Antelope’,and ‘Gorilla’, which all show animal heads that are more rectangular and signiﬁcantly different to the animal heads ofconcept 8. We ﬁnd that having concepts shared between classes is useful to interpret the model. Fig. 2 shows that ourmethod achieves the highest completeness of all methods on both the synthetic dataset and AwA. As a sanity check, weinclude the baseline ‘ours-noc’, where the completeness objective is removed from (3). Our method has much highercompleteness than ‘ours-noc’, demonstrating the necessity of the completeness term. We apply our method on IMDB, a text dataset with movie reviews classiﬁed as either positive or negative.We use 37500 reviews for training and 12500 for testing. We employ a 4-layer CNN model with 0.9 test accuracy.We apply our concept discover method to obtain 4 concepts, where the part of data x ij consists of 10 consecutivewords of the sentence. The completeness of the 4 concepts is 0.97, thus the 4 concepts are highly representative of theclassiﬁcation model. Result:

For each concept, Table 2 shows (a) the top nearest neighbors based on the dot product of the concept and partof reviews (b) the most frequent words in the top-500 nearest neighbors (excluding stop words) (c) the conceptSHAPscore for each concept. We can see that concepts 1 and 2 contain mostly negative sentiments, evident from the nearestneighbors – concept 1 tends to criticize the movie/ﬁlm directly, while concept 2 contains negativity in commentsvia words such as “not”, “doesn’t”, “even”. We note that the ratings in concept 2 are also negative since the scores1 and 2 are considered to be very negative in movie review. On the other hand, concepts 3 and 4 contain mostlypositive sentiments, as evident from the nearest neighbors – concept 3 seems to discuss the plot of the movie withoutdirecting acclaiming or criticizing the movie, while concept 4 often contains very positive adjectives such as “excellent”,“wonderful” that are extremely positive. More nearest neighbors are provided in the Appendix.

Appending discovered concepts:

We perform an additional experiment where we randomly append 5 nearestneighbors (out of 500-nearest neighbors) of each concept to the end of all testing instances for further validation of theusefulness of the discovered concepts. For example, we may add “wasting my time with a comment but this movie”along with 4 other nearest neighbors of concept 1 to the end of a testing sentence. The original average predictionscore for the testing sentences is 0.516, and the average prediction score after randomly appending 5 nearest neighborsof each concept becomes 0.103, 0.364, 0.594, 0.678 for concept 1, 2, 3, 4, which suggests that the concept score ishighly related to the how the model makes prediction and may be used to manipulate the prediction. We note that whileconcept 1 contains stronger and more direct negative words than concept 2, concept 2 has a higher conceptSHAP valuethan concept 1. We hypothesize this is due to the fact that concept 2 may better detect weak negative sentences that maybe difﬁcult to be explained by concept 1, and thus may contribute more to the completeness score.8

PREPRINT - J

UNE

15, 2020

We propose to quantify the sufﬁciency of a particular set of concepts in explaining the model’s behavior by the complete-ness of concepts. By optimizing the completeness term coupled with additional constraints to ensure interpretability,we can discover concepts that are complete and interpretable. Through experiments on synthetic and real-world imageand language data, we demonstrate that our method can recover ground truth concepts correctly, and provide conceptualinsights of the model by examining the nearest neighbors. Although our work focuses on post-hoc explainability ofpre-trained DNNs, joint training with our proposed objective function is possible to train inherently-interpretable DNNs.An interesting future direction is exploring the beneﬁts of joint learning of the concepts along with the model, for betterinterpretability.

References [1] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions ofany classiﬁer. In

KDD . ACM, 2016.[2] Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In

NIPS , 2017.[3] Sharon Lee Armstrong, Lila R. Gleitman, and Henry Gleitman. What some concepts might not be.

Cognition , 13(3):263 – 308, 1983.[4] Joshua Brett Tenenbaum.

A Bayesian framework for concept learning . PhD thesis, Massachusetts Institute ofTechnology, 1999.[5] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretabilitybeyond feature attribution: Quantitative testing with concept activation vectors (tcav). In

ICML , 2018.[6] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Interpretable basis decomposition for visual explanation.In

ECCV , 2018.[7] Amirata Ghorbani, James Wexler, James Zou, and Been Kim. Towards automatic concept-based explanations.

NeurIPS , 2019.[8] Diane Bouchacourt and Ludovic Denoyer. Educe: Explaining model decisions through unsupervised conceptsextraction. arXiv preprint arXiv:1905.11852 , 2019.[9] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explainingexplanations: An overview of interpretability of machine learning. In . IEEE, 2018.[10] Fan Yang, Mengnan Du, and Xia Hu. Evaluating explanation without ground truth in interpretable machinelearning. arXiv preprint arXiv:1907.06831 , 2019.[11] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.

JMLR , 3(Jan):993–1022, 2003.[12] Lloyd S. Shapley.

A value for n-person games , page 31–40. 1988.[13] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removingnoise by adding noise. arXiv preprint arXiv:1706.03825 , 2017.[14] Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. L-shapley and c-shapley: Efﬁcient modelinterpretation for structured data. arXiv:1808.02610 , 2018.[15] Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions. In

ICML , 2017.[16] Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K Ravikumar. Representer point selection for explainingdeep neural networks. In

NIPS , 2018.[17] Rajiv Khanna, Been Kim, Joydeep Ghosh, and Sanmi Koyejo. Interpreting black box predictions using ﬁsherkernels. In

The 22nd International Conference on Artiﬁcial Intelligence and Statistics , pages 3382–3390, 2019.[18] Sercan Ö. Arik and Tomas Pﬁster. Attention-based prototypical learning towards interpretable, conﬁdent androbust deep neural networks. arXiv:1902.06292 , 2019.[19] Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller.Evaluating the visualization of what a deep neural network has learned.

IEEE transactions on neural networksand learning systems , 28(11):2660–2673, 2016.[20] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism forinterpretability. In

NIPS , 2016. 9

PREPRINT - J

UNE

15, 2020[21] Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better understanding of gradient-basedattribution methods for deep neural networks. arXiv preprint arXiv:1711.06104 , 2017.[22] Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Sai Suggala, David I. Inouye, and Pradeep Ravikumar. On the (in)ﬁdelityand sensitivity of explanations. In

NeurIPS , 2019.[23] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. Asurvey of methods for explaining black box models.

ACM computing surveys (CSUR) , 51(5):1–42, 2018.[24] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. Pcanet: A simple deep learningbaseline for image classiﬁcation?

IEEE transactions on image processing , 24(12):5017–5032, 2015.[25] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013.[26] Jan Chorowski, Ron J Weiss, Samy Bengio, and Aäron van den Oord. Unsupervised speech representation learningusing wavenet autoencoders. arXiv preprint arXiv:1901.08810 , 2019.[27] Alec Radford, Rafal Józefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv:1704.01444 , 2017.[28] S. Grover, C. Pulice, G. I. Simari, and V. S. Subrahmanian. Beef: Balanced english explanations of forecasts.

IEEE Transactions on Computational Social Systems , 6(2):350–364, 2019.[29] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and OlivierBachem. Challenging common assumptions in the unsupervised learning of disentangled representations. arXivpreprint arXiv:1811.12359 , 2018.[30] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This looks like that:deep learning for interpretable image recognition. In

Advances in Neural Information Processing Systems , pages8928–8939, 2019.[31] André Araujo, Wade Norris, and Jack Sim. Computing receptive ﬁelds of convolutional neural networks.

Distill ,4(11):e21, 2019.[32] Jon D Mcauliffe and David M Blei. Supervised topic models. In

NIPS , 2008.[33] Katsushige Fujimoto, Ivan Kojadinovic, and Jean-Luc Marichal. Axiomatic characterizations of probabilistic andcardinal-probabilistic interaction indices.

Games and Economic Behavior , 55(1):72–99, 2006.[34] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes bybetween-class attribute transfer. In

CVPR . IEEE, 2009.[35] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inceptionarchitecture for computer vision. In

CVPR , 2016.[36] N. Narodytska and S. Kasiviswanathan. Simple black-box adversarial attacks on deep neural networks. In

CVPRWorkshops , 2017. 10

PREPRINT - J

UNE

15, 2020

Appendix A Relation to PCA

PCA:

We show that under strict conditions, the PCA vectors applied on an intermediate layer where the principlecomponents are used as concept vectors, maximizes the completeness score.

Proposition A.1.

When h is an isometry function that maps from p Φ p¨q , } ¨ } F q Ñ p f p¨q , } ¨ } F q , and additionally f p x i q “ r y i s , @p x i , y i q P V (i.e. the loss is minimized, r y i s is the one hot vector of class y i ), and also assume T “ , E r z s “ x φ p x q , c y , and l is a linear function, the ﬁrst m PCA vectors maximizes the L2 surrogate of η . We note that the assumptions for this proposition are extremely stringent, and may not hold in general. Whenthe isometry and other assumptions do not hold, PCA no longer maximizes the completeness score as the lowestreconstruction in the intermediate layer do not imply the highest prediction accuracy in the output. In fact, DNNs areshown to be very sensitive to small perturbations in the input [36] – they can yield very different outputs althoughthe difference in the input is small (and often perceptually hard to recognize to humans). Thus, even though thereconstruction loss between two inputs are low at an intermediate layer, subsequent deep nonlinear processing maycause them to diverge signiﬁcantly. The principal components are also not trained to be semantically meaningful,but to greedily minimize the reconstruction error (or maximize the projected variance). Even though completenessscore and PCA share the idea of minimizing the reconstruction loss via dimensionality reduction, the lack of humaninterpretability of the principle components is a major bottleneck for PCA.We provide this proposition only because completeness and PCA share the idea of minimizing the reconstructionloss via dimensionality reduction. Another notable limitation of using PCA as concept vectors is the lack of humaninterpretability of the principle components. The PCA vectors are not trained to be semantically meaningful, but togreedily minimize the reconstruction error (or maximize the projected variance).

Proof of Proposition A.1

Proof.

By the basic properties of PCA, the ﬁrst m PCA vectors (principal components) minimize the reconstruction (cid:96) error. Deﬁne the concatenation of the m PCA vectors as a matrix p and } ¨ } as the (cid:96) norm, and deﬁne proj p φ p x , p qq asthe projection of x onto the span of p , the basic properties of PCA is equivalent to that for all c “ r c c . . . c m s , ÿ x Ď V X } proj p φ p x q , p q ´ φ p x q} F ď ÿ x Ď V X } proj p φ p x q , c q ´ φ p x q} F . By the isometry of h , we have ÿ x Ď V X } h p proj p φ p x q , p qq ´ h p φ p x qq} F ď ÿ x Ď V X } h p proj p φ p x q , c qq ´ h p φ p x qq} F , and since f p x q is equal to Y, we can rewrite to ÿ x ,y Ď V } h p proj p φ p x q , p qq ´ r y s} F ď ÿ x ,y Ď V } h p proj p φ p x q , c qq ´ r y s} F . (5)We note that under the assumptions, E r z | x s “ φ p x q c , and thus the reconstruction layer l can be written as l “ arg max l ÿ x ,y Ď V } r y s ´ h p l p E r z | x sqq} F “ arg max l ÿ x ,y Ď V } r y s ´ h p l p φ p x q c qq} F “ arg max l ÿ x Ď V x } φ p x q ´ l p φ p x q c q} F , (6)By deﬁnition, ř x Ď V x } φ p x q ´ l p φ p x q c q} F is minimized by the projection, and thus l p φ p x q c q “ proj p φ p x q , c q .And thus, (5) can be written as: ÿ x ,y Ď V } h p l p φ p x q p qq ´ r y s} F ď ÿ x ,y Ď V } h p l p φ p x q c qq ´ r y s} F . and subsequently get that for any c E x ,y „ V r} r y s ´ P p y | E r z T s , h, p q} F s ´ R E x ,y „ V r} r y s ´ P p x T , f q} F s ´ R ě E x ,y „ V r} r y s ´ P p y | E r z T s , h, c q} F s ´ R E x ,y „ V r} r y s ´ P p x T , f q} F s ´ R . PREPRINT - J

UNE

15, 2020Thus, PCA vectors maximize the L2 surrogate of the completeness score. We emphasize that Proposition A.1 hasseveral assumptions that may not be practical. However, the proposition is only meant to show that PCA optimizesour deﬁnition of completeness under a very stringent condition, as the key idea of completeness and PCA are both toprevent information loss through dimension reduction.

Appendix B Additional Experiments Results and Settings

Automated Alignment score on Synthetic Dataset

Given the existence of each ground truth shape z i in eachsample x i , we can evaluate how closely the discovered concept vectors c m align with the actual ground truth shapes 1to 5. Our evaluation assumes that if c k corresponds to some shape v , then the parts of input that contain the shape v andthe parts of input that does not contain ground truth shape v can be linearly separated by c k . That is, c k ¨ x a ą c k ¨ x b or c k ¨ x a ă c k ¨ x b for all x a that contains shape v and all x b that does not contain shape v . Without loss of generality,we assume c k ¨ x a ą c k ¨ x b if x a contains shape v and x dc does not contain shape v for notation simplicity, and check c k and ´ c k for each discovered concepts. Following this assumption, max Tt “ c k ¨ x it ą max Tt “ c k ¨ x jt for all i, j suchthat z iv “ and z jv “ , since at least one part of x bt should contain the ground truth shape v . Therefore, to evaluatehow well c k corresponds to shape v , we measure the accuracy of using r max Tt “ c k ¨ x it ą const s to classify z iv . Moreformally, we deﬁne the matching score between concept c k to the shape v as:Match p c k , z v q “ E x i „ V r r max t Ďr ,T s c k ¨ x it ą e s “ z iv s , where e is some constant. We then evaluate how well the set of discovered concepts c m aligns with shapes 1 to 5:Alignment p c m , z q “ max P Pr ,m s m ÿ j “ Match p c P r j s , z j q , which measures the best average matching accuracy by assigning the best concept vector to differentiate each shape.For each concept vector c j , we test c j and ´ c j and choose the direction that leads to the highest alignment score. Creation of the Toy Example

The complete list of the target y is y “„ p z ¨ z q ` z , y “ z ` z ` z , y “ z ¨ z ` z ¨ z , y “ z XOR z , y “ z ` z , y “„ p z ` z q` z , y “ p z ¨ z q XOR z , y “ z ¨ z ` z , y “ z , y “ p z ¨ z q XOR z , y “„ p z ` z q , y “ z ` z ` z , y “ z XOR z , y “„ p z ¨ z ` z q , y “ z XOR z . We create the dataset in matplotlib, where the color of each shape is sampled independently from green, red, blue, black,orange, purple, yellow, and the location is sampled randomly with the constraint that different shapes do not coincidewith each other.

Hyper-parameter Sensitivity

We set λ “ λ “ . , β “ . for the toy dataset. We show the completenessscore for varying λ , λ , β in Figure 4,5,6 (when varying λ , we ﬁx λ “ . , and β “ . .) We see that both thecompleteness and alignment score are above 0.93 when λ and λ are in the range of r . , . s , and β is in the range of r , . s , and thus our method outperforms all baselines with a wide range of hyper-parameters. Therefore, our methodis not sensitive to the hyper-parameter in the toy dataset. We set the λ “ λ “ . , β “ . for the NLP dataset, andwe set λ “ λ “ . , β “ for AwA dataset since the optimization becomes more difﬁcult with a deeper neuralnetwork, and thus we increase the regularizer strength to ensure interpretability. The completeness is above 0.9 when λ and λ are set in the range of r , s . Overall, our method is not too sensitive to the selection of hyper-parameter. Thegeneral principle for hyper-parameter tuning is to chose larger λ and λ that still gives a completeness value (usually >0.95). Additional Nearest Neighbors for toy example

We show 10 nearest neighbors for each concept obtained by ourmethods and baseline methods in the toy example in Figure 11. The 10 nearest neighbors for each concept obtained bydifferent methods is used to perform the user study, to test if the nearest neighbors allow human to retrieve the correctground truth concepts for each method.

User Study Setting and Discussion

For the user study, we set m “ (i.e. 5 discovered concepts) for all comparedmethods. The order of the 2 randomly chosen conditions (and which 2 conditions are paired), the order of the questions,the order of choices are all randomized to avoid biases and learning effects. All users are graduate students with someknowledge of machine learning. None of them have (self-reported) color-blindness. For each discovered concept, anuser is asked to ﬁnd the most common and coherent shape given the top 10 nearest neighbors. An example questionis shown in Figure 12. Each user is given 10 questions, which correspond to the nearest neighbors of the discovered12 PREPRINT - J

UNE

15, 2020Figure 4: Completeness score and Alignment score for different hyper-parameter λ .concepts for two random methods. (each method has 5 discovered concepts, and thus two methods have 10 discoveredconcepts in total). There are 20 users in total, and thus each method is tested on 8 users. For each method, we reportthe average number of correct answers chosen by the users. For example, if an user chooses shape 1,2,5,7,5, then thenumber of the correct answers chosen by the user will be 3 (since 1,2,5 are the ground truth shape obtained by the user).We average the correct answers chosen by 8 users for each method to obtain the “average number of correct answerschosen by users”. We also report the average number of agreed answers chosen by the users. For example, if mostusers choose 1,2,5,7,5 for ﬁve questions respectively, we set 1,2,5,7,5 as the ground truth for the ﬁve questions. If userA answered 1,2,5,7,10 for the ﬁve questions respectively, his number of agreed answers would be 4. We average theagreed answers chosen by 8 users for each method to obtain the “average number of agreed answers chosen by users”.We ﬁnd that other methods mainly fail due to (a) the same concept are chosen repeatedly (e.g. concept 2 and concept 4of ACE). (b) lack of disentanglement (coherency) of concepts (e.g. concept 5 of PCA shows two shape in all 10 nearestneighbors). (c) highlighted concepts are not related to the ground truth concept (e.g. concept 4 of Kmeans). (a), (c) arerelated to the lack of completeness of the method, and (b) is related to the lack of coherency of the method. Implementation Details

For calculating ConceptSHAP, we use the method in kernelSHAP [2] to calculate theShapley values efﬁciently by regression. For ACE in toy example, we set the number of cluster to be 15, and choose theconcepts based on TCAV score. For ACE in AwA, we set the number of clusters to be 150, and choose the conceptsbased on TCAV score. For PCA, we return the top m principle components when the number of discovered concepts is m . For k-means, we set the cluster size to be m when the number of discovered concepts is m , and return the clustermean as the discovered concepts. Additional Nearest Neighbors for AwA

We show additional nearest neighbors of the top concepts in AwA for all50 classes from Figure 13 to Figure 21. For each class, the 3 concepts with the highest ConceptSHAP respect to theclass with R p c q above 0.8 is shown, along with the ConceptSHAP score with respect to the class. We see that manyimportant concepts are shared between different classes, where most of them are semantically meaningful. To list someexamples, concept 7 corresponds to the concept of grass, concept 33 shows a speciﬁc kind of wolf-like face (which hastwo different colors on the face), concept 27 corresponds to the sky/ocean view, concept 25 shows a side face that isshared among many animals, concept 46 shows a front face of cat-like animals, concept 21 shows sandy/ wildernesstexture of the background, concept 38 shows gray back ground that looks like asphalt road, concept 43 shows similarears of several animals, concept 31 shows furry/ rough texture with a plain background. Additional Nearest Neighbors for NLP

We show additional nearest neighbors of the 4 concepts in NLP in Table3. The nearest neighbors of concept 1 and concept 2 are generally negative, and those of concept 3 and concept 4 aregenerally positive, and follows the discussion in section 5.3.13

PREPRINT - J

UNE

15, 2020Figure 5: Completeness score and Alignment score for different hyper-parameter λ .Figure 6: Completeness score and Alignment score for different hyper-parameter β .14 PREPRINT - J

UNE

15, 2020Figure 7: (Larger Version) Nearest Neighbors for each concept obtained in the toy example.Figure 8: (Larger Version) Nearest Neighbors for each concept for ACE obtained in the toy example.15

PREPRINT - J

UNE

15, 2020Figure 9: (Larger Version) Nearest Neighbors for each concept for PCA obtained in the toy example.Figure 10: (Larger Version) Nearest Neighbors for each concept for Kmeans obtained in the toy example.16

PREPRINT - J

UNE

15, 2020Figure 11: (Larger Version) Nearest Neighbors for each concept for ACE-SP obtained in the toy example.Table 3: The 4 discovered concepts with more nearest neighbors.

Concept Nearest Neighborspoorly constructed what comes across as interesting is thewasting my time with a comment but this movieawful in my opinion there were and the1 forgettable earn far more critical acclaim and winwasting my time with a comment but this movieworst 80’s slashers alongside with fear deadlyworst ever sound effects ever used in a movienormally it would earn at least 2 or 3 is just too dumb to be calledi feel like i was ripped off and hollywood2 johnson seems to be the only real actor herebut this thing is watchable if only for belaperformance but they’re all too unlikable to really careway the ﬁghts are awfully bad done while sometimesremember awaiting return of the jedi with almost better than most sequels for tv movies i hatemale because marie has a crush on her attractive3 think that about a lot of movies in thisi am beginning to see what she has beencinema of today these ﬁlms are the products oflong last think eastern promises there will be bloodnew via with absolutely hilarioushomosexual and an italian clown is an entertainingstephen on the vampire as a masterpiece4 and between the scenes the movie has one gemmake a ﬁlm in color light so perfectlyin it and the evil beast is an incrediblefather and my adult son peter falk is excellent PREPRINT - J

UNE

15, 2020Figure 12: An example question of a screenshot of the human study.18

PREPRINT - J

UNE

15, 2020Figure 13: (Larger Version) Nearest Neighbors for each concept obtained in AwA.19

PREPRINT - J

UNE

15, 2020Figure 14: (Larger Version) Nearest Neighbors for each concept obtained in AwA.20

PREPRINT - J

UNE

15, 2020Figure 15: (Larger Version) Nearest Neighbors for each concept obtained in AwA.21

PREPRINT - J

UNE

15, 2020Figure 16: (Larger Version) Nearest Neighbors for each concept obtained in AwA.22

PREPRINT - J

UNE

15, 2020Figure 17: (Larger Version) Nearest Neighbors for each concept obtained in AwA.23

PREPRINT - J

UNE

15, 2020

OtterPersian CatPolar Bear OxRabbitPig

Figure 18: (Larger Version) Nearest Neighbors for each concept obtained in AwA.24

PREPRINT - J

UNE

15, 2020Figure 19: (Larger Version) Nearest Neighbors for each concept obtained in AwA.25

PREPRINT - J

UNE

15, 2020Figure 20: (Larger Version) Nearest Neighbors for each concept obtained in AwA.26

PREPRINT - J