AA Model Explanation System: Latest Updates and Extensions
Ryan Turner
RYAN . TURNER @ NGC . COM
Northrop Grumman Corporation
Abstract
We propose a general model explanation sys-tem (MES) for “explaining” the output of blackbox classifiers. This paper describes extensionsto Turner (2015), which is referred to frequentlyin the text. We use the motivating example of aclassifier trained to detect fraud in a credit cardtransaction history. The key aspect is that weprovide explanations applicable to a single pre-diction , rather than provide an interpretable setof parameters. We focus on explaining posi-tive predictions (alerts). However, the presentedmethodology is symmetrically applicable to neg-ative predictions.In many classification applications, but especially in frauddetection, there is an expectation of false positives. Alertsare given to a human analyst before any further actionis taken. Such problems are sometimes referred to as“anomaly detection.” Analysts often insist on understand-ing “why” there was an alert, since an opaque alert makesit difficult for them to proceed. Analogous scenarios occurin computer vision, credit risk, spam detection, etc.Furthermore, the MES framework is useful for model criti-cism. In the world of generative models, practitioners oftengenerate synthetic data from a trained model to get an ideaof “what the model is doing” (Gelman et al., 1996). OurMES framework augments such tools. As an added bene-fit, MES is applicable to completely nonprobabilistic blackboxes that only provide hard labels.
Example
In the context of credit card fraud we may havefeature vectors x containing the number of online transac-tions, the geographic distance traveled for in-person trans-actions, the number of novel merchants, and so on. Asimple example explanation is: “Today, there were twoin-person transactions in the USA, followed by $1700 incountry X.” MES would output “ ( x i ≥ ∧ ( x j ≥ ” , New York, NY, USA. Copyright by the au-thor(s). Figure 1.
Illustration of MES on toy classifier with test inputs x , x , and x (blue dots). The classifier f outputs 1 in the hatchedregions and 0 elsewhere. The input density on the data is Gaussian(blue ellipse). The red boundaries are the respective explanations( E , E , and E ) for each of the test inputs x . The explanation E for x is: [ x ] ≤ . . Note the red arrows that depict the ≤ relation. We also have E : [ x ] ≤ . ; and E : [ x ] ≤ . .As most of the data comes from inside the blue ellipse, MES doesnot care that the explanations disagree with the classifier at theplot’s extremities. Although this example is 2D, MES is applica-ble in high dimensions. for the appropriate features i and j . We graphically depictMES on a separate illustrative example in Fig. 1. Explanation vs. interpretability
We adopt the paradigmwhere prediction accuracy is of paramount importance, butexplanation is also important. Therefore, we are not will-ing to give up any predictive accuracy for explanation. Bothmachine learning and statistics have a long history of build-ing models that are “interpretable”; such as, (small) deci-sion trees (Quinlan, 1986) and sparse linear models (Tib-shirani, 1996). MES augments black boxes with explana-tions, as the best predictor may not be “interpretable.”Historically, this dilemma has created two distinct ap-proaches: 1) the “interpretable” models approach, commonin scientific discovery/bioinformatics, and 2) the accuracy-focused approach, common in computer vision with meth-ods including deep learning, k -NNs, and support vectormachines (SVMs). The downside of the interpretable ap-proach is seen in machine learning competitions, where1 a r X i v : . [ s t a t . M L ] J un Model Explanation System: Latest Updates and Extensions the winning methods are typically nonparametric, or havea very large number of parameters (e.g., deep networks).MES has elements of both approaches. We do not aimto summarize how the model “works in general” Andrewset al. (1995), but only seek explanations of individual cases.Although the distinction is subtle, explanation is a mucheasier task than explaining an entire model. MES is thefirst method to utilize this weaker requirement to augmentblack boxes with explanations without affecting accuracy.
1. Formal setup
Consider a black box binary classifier f that takes a fea-ture vector x ∈ X = R D and provides a binary label: f ∈ X → { , } . In the introductory examples, expla-nations are Boolean statements about the feature vector. Ineffect, an explanation E is a function from X to { , } . Themapping E ∗ ∈ X → E finds the best explanation from theset of possible explanations E ⊂ X → { , } . We also de-fine that E contains a “null explanation” E ( x ) := 1 . Notethat an explanation is either sufficiently simple to be in E ornot. There is no other metric of “explanation simplicity.”Turner (2015) formalized axioms on what properties a sen-sible explanation system E ∗ should have. One possibility,that also has favorable computational properties, is an opti-mization over the following explanation quality score S : E ∗ ( x ) = argmax E ∈E S ( E ) s.t. E ( x ) = 1 , (1) S ( E ) = P ( E ( x (cid:48) ) | f ( x (cid:48) ) = 1) − P ( E ( x (cid:48) ) | f ( x (cid:48) ) = 0) , where f and E are deterministic functions; we are marginalizing over the input distribution p ( x (cid:48) ) . Notably, S is equivalent to the covariance: S ( E ) ∝ Cov [ E, f ] . Un-der this definition the null E has score S ( E ) = 0 , andthe true classifier f has score S ( f ) = 1 . Therefore, if thedecision rule f is in E , then it is preferable to any other ex-planation; and the selected explanation E = E ∗ ( x ) has anormalized quality score: S ( E ) ∈ [0 , . Also note that byconstruction, any explanation E (cid:54) = E selected for explain-ing f ( x ) = 1 would be not be selected for the converseproblem of trying to explain f ( x ) = 0 .
2. Score estimation with black box models
This section reviews using simple Monte Carlo to approx-imate the optimization in (1) with black box models. Wemerely require the classifier f be queryable at an arbitraryinput x and that we can obtain samples from the input den-sity p ( x ) . We allow for general explanation functions ofthe form g i ∈ X → R : E = (cid:83) Mi =1 { I { g i ( x ) ≤ a } , ∀ a ∈ R } . (2)Explanations of the form I { g i ( x ) ≥ a } are obtainable byincluding g (cid:48) i = − g i in E . The axis aligned explanations Algorithm 1
MES MC Precomputation input classifier f , input density p , g M , accuracy ( (cid:15) , δ )Find n from (cid:15) and δ Sample iid v n ∼ p ( x | f = 0) and v n ∼ p ( x | f = 1) for i = 1 to M do H n , F n ← ECDF ( g i ( v n )) , ECDF ( g i ( v n ))ˆ S i ← F n − H n A i ( z ) ← max argmax a ∈ [ z, ∞ ) ˆ S i ( a ) , ∀ z ∈ R end foroutput step-based functions ˆ S M and A M Algorithm 2
Run MES input test input x , ˆ S M , and A M for i = 1 to M do Saving threshold a with best score so far:Try a ← A i ( g i ( x )) and its score ˆ S i ( a ) end foroutput best threshold a , index i , and scorefrom Fig. 1 are recovered using g ( x ) = ± x i , yield-ing E = (cid:83) Di =1 { I { x i (cid:81) a } , ∀ a ∈ R } . Alternatively, we mayhave a predefined set of linear decision functions that arereasonable explanations: g i ( x ) = w (cid:62) i x + b i .The optimization to find the best explanation is done as fol-lows: For each explanation function g i , we utilize the out-put of a precomputation phase to efficiently find the optimalthreshold ˆ a and its corresponding score. We then comparethe optimized scores for each explanation function g i andreport the function g i (and corresponding threshold ˆ a ) withthe highest score. Turner (2015) showed that using Algo. 1for precomputation requires n = (cid:6) M/δ ) /(cid:15) (cid:7) MCsamples to obtain score suboptimality (cid:15) with confidence δ .The precomputation phase, Algo. 1, is based on finding the cumulative maximum w.r.t. a of the estimated score func-tion ˆ S . The max in Algo. 1 is a tiebreaker so that ˆ a equalsthe largest a of the set returned by the argmax . The com-putations to find A M are informally thought of as the bestoptimum so far scanning from + ∞ backwards. After pre-computation, we efficiently find the explanation for a testpoint x using Algo. 2.
3. Extending to larger explanation spaces
In Section 2 we reviewed the machinery for jointly choos-ing among M explanation functions g M and a scalarthreshold parameter a ∈ R . In this section we propose extended MES , which maximizes the score S with respectto some continuous free parameters θ of the explanation g . For instance, Section 2 mentions using linear decisionfunctions as explanations. In this section we assume expla-2 Model Explanation System: Latest Updates and Extensions nations of the general form: E = { I { g ( x ; θ ) ≤ a } , ∀ a ∈ R , ∀ θ } , (3)where g is now parameterized by θ rather than a discreteindex i . In the case of linear explanations θ = w ∈ R D .We now have to optimize the score (1) with respect to afree vector parameter θ . To do this efficiently we put theobjective in the form of an expected loss. This enables usto employ learning theoretic results that replace the opti-mization with a convex surrogate.First, we find it convenient to rewrite the explanations as: I { g ( x ; θ ) ≤ a } = u ( a − g ( x ; θ )) = u (˜ g ( x ; ˜ θ )) , (4) ˜ g ( x ; ˜ θ ) := a − g ( x ; θ ) , ˜ θ (cid:62) := (cid:2) θ (cid:62) a (cid:3) , where u ( · ) is the unit step function. Since the explanationspace E is now parameterized by ˜ θ , (1) is equivalent to: θ ∗ ( x ) = argmin ˜ θ E x (cid:48) [ u (˜ g ( x (cid:48) ; ˜ θ )) |¬ f ] − E x (cid:48) [ u (˜ g ( x (cid:48) ; ˜ θ )) | f ] s.t. ˜ g ( x ; ˜ θ ) ≥ , (5)where θ ∗ ( x ) are the best parameters ˜ θ for explaining x . Bydefining a “class rebalanced” version of p , we achieve theexpected loss formulation: θ ∗ ( x ) = argmin ˜ θ E p (cid:48) [ (cid:96) ( y ˜ g ( x (cid:48) ; ˜ θ ))] s.t. ˜ g ( x ; ˜ θ ) ≥ ,p (cid:48) ( x (cid:48) , y ) := p ( x (cid:48) | f ( x (cid:48) ) − y ) I { y ∈ {− , }} , where we have manipulated different forms of the zero-oneloss (cid:96) ( x ) := u ( − x ) : | u ( ˆ f ) − f | = (cid:96) ((2 f −
1) ˆ f ) = (cid:96) ( y ˆ f ) for some prediction ˆ f ∈ R . Although this objective canbe estimated with MC samples from p (cid:48) , the resulting func-tion is multivariate and discontinuous. This makes directoptimization problematic. However, Bartlett et al. (2006)showed zero-one loss objectives can be solved by replac-ing (cid:96) with a convex surrogate loss φ ∈ R → R + such asthe hinge loss or log-logistic: θ ∗ ( x ) = argmin ˜ θ E p (cid:48) [ φ ( y ˜ g ( x (cid:48) ; ˜ θ ))] s.t. ˜ g ( x ; ˜ θ ) ≥ . If we take a large number of MC samples, the resultingparameter estimates have asymptotically minimal risk.Although it is possible to solve for θ ∗ ( x ) directly by con-strained optimizing, we take the “poor man’s” approach ofputting the constraint ( ˜ g ( x ; ˜ θ ) ≥ ) in the objective. Thishas the practical advantage of allowing us to use existing(highly optimized) software modules. We modify our ob-jective as follows using γ ∈ (0 , . : θ ∗ ( x ) = argmin ˜ θ γ E p (cid:48) [ (cid:96) ( y ˜ g ( x (cid:48) ; ˜ θ ))] + (1 − γ ) (cid:96) (˜ g ( x ; ˜ θ ))= argmin ˜ θ E p (cid:48)(cid:48) [ (cid:96) ( y ˜ g ( x (cid:48) ; ˜ θ ))] , (6) p (cid:48)(cid:48) ( x (cid:48) , y ) := (1 − γ ) I { y = 1 } δ x ( x (cid:48) ) + γp (cid:48) ( x (cid:48) , y ) , (7) Algorithm 3
Extended MES input data subset X ∈ X N , n , classifier f , input density p repeat x ← random point from X D ← n samples from p (cid:48)(cid:48) (see (7)) using f , x , and p Set θ ∗ by fitting linear SVM (or logistic reg.) to D Delete from X points correctly classified by SVMAppend fitted parameters to list L until X empty output parameter list L (used for g M )where δ x ( · ) is a Dirac delta centered at x . In the case of lin-ear explanations we have ˜ g ( x ; ˜ θ ) = ˜ θ (cid:62) ˜ x , where we havedefined ˜ x (cid:62) := (cid:2) x (cid:62) (cid:3) . This gives us a final objective of: θ ∗ ( x ) = argmin ˜ θ (cid:80) ni =1 φ ( y i ˜ θ (cid:62) ˜ x i ) , ( x i , y i ) ∼ p (cid:48)(cid:48) . When φ is the log-logistic we find ˜ θ by applying logisticregression to MC samples D := ( x n , y n ) . Likewise,when φ is the hinge loss we use a linear SVM. Finally, wemap ˜ θ back to ( w , b ) for a linear explanation using (4).Extended MES is based on upon a two-phase approach. Wefirst find the parameters for our explanations g M usingAlgo. 3. Since the methods of Section 2 have finite sampleguarantees, the output of Algo. 3 is passed to Algos. 1 and 2to provide the final explanations.
4. Face recognition example
We now demonstrate MES on the scikit-learn demo “Facesrecognition example using eigenfaces and SVMs.” Thefaces are reduced to dimension D = 150 from ×
37 = k (e.g., Bush)we convert the SVM to a binary black box, informally as f ( x ) = I { SVM ( x ) = k } . Throughout this paper, we use (cid:15) = 0 . and δ = 0 . implying n = p ( x ) .Turner (2015) showed how to use standard MES to explainwhy the SVM classifies Hugo Chavez as George W Bush.Here, we are also able to find interesting explanations usingthe linear explanations from Section 3. In Fig. 2 we showa correct prediction of Colin Powell, and use MES to shedlight on the responsible elements of the images. ExtendedMES allows the explanation faces on the right in Fig. 2 tobe any image, not just an eigenface as was the case withstandard MES and axis aligned explanations.In Fig. 2, think of the white areas in the far right imageas being the parts of the image that contribute to the SVM3 Model Explanation System: Latest Updates and Extensions
Figure 2.
Example of MES explaining a correct prediction of Powell by the (nonlinear) SVM classifier. This example used extendedMES (Algo. 3 followed by Algos. 1 and 2) to learn the optimal linear explanation. We subtract out the explanation face ( right ) fromthe (mean removed) original ( left ) to make the image on the far left . In these images: gray = 0 , white > , and black < . Theproduct image ( far right ) is the Hadamard product of the original face and the explanation face. Here, the explanation is that the productimage has net white balance . > . , with a score of S = 0 . . We have added the red annotations as cues to the reader on theimportant areas. Technical details:
The above images are created as follows: Let x be the mean removed input face (left) reshaped as avector. This is transformed by PCA to get x PCA := Cx , where C is the principal component matrix. The explanation is: w (cid:62) x PCA > a .Thus we set the right image to be x E := C (cid:62) w . We then set the far right image to be x H := x E (cid:12) x . Then the explanation becomes: x E · x = (cid:80) x H > a . We set the corrected image to be x F := x − α x E / || x E || . When applying the explanation to the correctedimage we get: x E · x F = x E · x − α . Thus, by setting α > (cid:80) x H − a , the explanation is false: E ( x F ) = 0 . Here, α = 2 . predicting Powell, and the dark areas as though the Powellprediction is made in spite of them. Matches between theinput face and explanation face of black × black or white × white positively contribute to the prediction of the classifier f , and white × black negatively contributes to the classifi-cation. Patterns in the explanation face can be thought of asa sort of “linear template.” If the input face matches themexactly it leads to a large positive contribution.Interpreting Fig. 2, we see that the SVM is “picking up” onthe dark shading on the left side of Powell’s chin, shadingbelow his left eye, and a wide area for the dark pixels of hisnostrils and nasolabial folds (smile lines). Indeed, in manytraining images of Powell the lighting is to his right. MEShas uncovered the high relevance that the classifier placeson these non-obvious features.
5. Credit scoring example
To further show the generality of MES, we use it on theUCI German credit data set. After encoding the categoricaldata, there is a total of 48 possible features. We chose toapply MES to L logistic regression (LR) as it was the topperforming model after an extensive comparison includingSVMs and decision trees. For the input distribution, we usethe empirical distribution on the training data. For simplic-ity we use axis aligned explanations with Algos. 1 and 2.The explanations for of the test set data points use ei-ther the feature “credit history” or “status of existing check-ing account.” The remaining of explanations use the Figure 3.
MES applied to German credit data with LR classifier f .The shaded boxes represent the marginal distribution on the twovariables (past loans and checking balance). The area is propor-tional to the frequency in the training data. The percentages showhow often test points with those values result in a classification of1 by f . We show the most common explanation for data pointsin each box. The explanations within a box vary as there are an-other 18 features not plotted. The explanations are: E individualhas no checking account; E past payment delays or worse; E individual already has loans out; E loan duration less than 22months. It is unclear why shorter loans are more likely to be pre-dicted as risky by the model. However, E is only used of thetime and for individuals who are otherwise low risk. Model Explanation System: Latest Updates and Extensions loan duration feature. Hence, in Fig. 3 we demonstrate theoutput of MES on data points in the cross section of credithistory and checking account status. The four explanationsfound in the test set have scores: 0.491 ( E ), 0.275 ( E ),0.256 ( E ), and 0.244 ( E ).The L penalty also deems credit history and checking bal-ance to be the most important features; only these two re-main when the regularization penalty is increased. How-ever, constraining LR to only use these two features re-sults in a model that disagrees with the predictively optimalmodel on . of the test points.
6. Conclusions
We have presented a general framework for explainingblack box models. It alleviates the tension between per-formance and interpretability. We described a new MC al-gorithm that finds explanations with many free parameters.
References
Andrews, Robert, Diederich, Joachim, and Tickle, Alan B.Survey and critique of techniques for extracting rulesfrom trained artificial neural networks.
Knowledge-Based Systems , 8(6):373–389, 1995.Bartlett, Peter L, Jordan, Michael I, and McAuliffe, Jon D.Convexity, classification, and risk bounds.
Journal ofthe Americal Statistical Association , 101(473):138–156,2006.Gelman, Andrew, Meng, Xiao-Li, and Stern, Hal. Poste-rior predictive assessment of model fitness via realizeddiscrepancies.
Statistica Sinica , 6(4):733–760, 1996.Quinlan, J Ross. Induction of decision trees.
MachineLearning , 1(1):81–106, 1986.Tibshirani, Robert. Regression shrinkage and selection viathe lasso.
Journal of the Royal Statistical Society, SeriesB , 58(1):267–288, 1996.Turner, Ryan. A model explanation system. In