[PDF] "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations

Abstract

As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a human interpretable manner. It has recently become apparent that a high-fidelity explanation of a black box ML model may not accurately reflect the biases in the black box. As a consequence, explanations have the potential to mislead human users into trusting a problematic black box. In this work, we rigorously explore the notion of misleading explanations and how they influence user trust in black-box models. More specifically, we propose a novel theoretical framework for understanding and generating misleading explanations, and carry out a user study with domain experts to demonstrate how these explanations can be used to mislead users. Our work is the first to empirically establish how user trust in black box models can be manipulated via misleading explanations.

Full PDF

““How do I fool you?”: Manipulating User Trustvia Misleading Black Box Explanations

Himabindu Lakkaraju*, Osbert Bastani + *Harvard University, + University of Pennsylvania { *[email protected], + [email protected] } Abstract

As machine learning black boxes are increasingly being de-ployed in critical domains such as healthcare and criminaljustice, there has been a growing emphasis on developingtechniques for explaining these black boxes in a human inter-pretable manner. It has recently become apparent that a high-ﬁdelity explanation of a black box ML model may not accu-rately reﬂect the biases in the black box. As a consequence,explanations have the potential to mislead human users intotrusting a problematic black box. In this work, we rigorouslyexplore the notion of misleading explanations and how theyinﬂuence user trust in black box models. More speciﬁcally,we propose a novel theoretical framework for understandingand generating misleading explanations, and carry out a userstudy with domain experts to demonstrate how these expla-nations can be used to mislead users. Our work is the ﬁrst toempirically establish how user trust in black box models canbe manipulated via misleading explanations.

There has been an increasing interest in using ML mod-els to aid decision makers in domains such as healthcareand criminal justice. In these domains, it is critical thatdecision makers understand and trust ML models, to en-sure that they can diagnose errors and identify model bi-ases correctly. However, ML models that achieve state-of-the-art accuracy are typically complex black boxes that arehard to understand. As a consequence, there has been a re-cent surge in post hoc explanation techniques for explain-ing black box models (Ribeiro, Singh, and Guestrin 2018;2016; Lakkaraju et al. 2019; Lundberg and Lee 2017). Oneof the goals of such explanations is to help domain expertsdetect systematic errors and biases in black box model be-havior (Doshi-Velez and Kim 2017).Existing techniques for explaining black boxes typi-cally rely on optimizing ﬁdelity —i.e., ensuring that the ex-planations accurately mimic the predictions of black boxmodel (Ribeiro, Singh, and Guestrin 2016; 2018; Lakkarajuet al. 2019). The key assumption underlying these ap-proaches is that if an explanation has high ﬁdelity, then bi-

Copyright c (cid:13) ases of the black box model will be reﬂected in the expla-nation. However, it is questionable whether this assumptionactually holds in practice (Lipton 2016). The key issue is thathigh ﬁdelity only ensures high correlation between the pre-dictions of the explanation and the predictions of the blackbox. There are several other challenges associated with posthoc explanations which are not captured by the ﬁdelity met-ric: (i) they may fail to capture causal relationships betweeninput features and black box predictions (Lipton 2016;Rudin 2019), (ii) there could be multiple high-ﬁdelity expla-nations for the same black box that look qualitatively differ-ent (Lakkaraju et al. 2019), and (iii) they may not be robustand can vary signiﬁcantly even with small perturbations toinput data (Ghorbani, Abid, and Zou 2019).These challenges increase the possibility that explana-tions generated using existing techniques can actually mis-lead the decision maker into trusting a problematic blackbox. However, there has been little to no prior work empiri-cally studying if and how explanations can mislead users.

Contributions.

We propose the ﬁrst systematic study to ex-plore if and how explanations of black boxes can misleadusers. First, we propose a novel theoretical framework forunderstanding when misleading explanations can exist. Weshow that even if an explanation achieves perfect ﬁdelity,it may still not reﬂect issues in the black box model. Thekey issue is that due to correlations in the features, explana-tions can achieve high ﬁdelity even if they use entirely dif-ferent features compared to the black box. Second, we pro-pose a novel approach for generating potentially mislead-ing explanations. Our approach extends the MUSE frame-work (Lakkaraju et al. 2019) to favor explanations that con-tain features that users believe are relevant and omit featuresthat users believe are problematic. Third, we perform an ex-tensive user study with domain experts from law and crim-inal justice to understand how misleading explanations im-pact user trust. Our results demonstrate that the misleadingexplanations generated using our approach can in fact in-crease user trust of by 9.8 times (See Figure 1). Our ﬁndingshave far reaching implications both for research on ML in-terpretability and real-world applications of ML.

Related work.

Present work on interpretable ML largelyfalls into three categories. First, there are approaches fo- a r X i v : . [ c s . A I] N ov f Race ≠ African American:If Prior-Felony = Yes and Crime-Status = Active, then

Risky

If Prior-Convictions = 0, then

Not Risky

If Race = African American:If Pays-rent = No and Gender = Male, then

Risky

If Lives-with-Partner = No and College = No, then

Risky

If Age ≥

35 and Has-Kids = Yes, then

Not Risky

If Wages ≥ Not Risky

Default:

Not Risky

If Current-Offense = Felony:If Prior-FTA = Yes and Prior-Arrests ≥

1, then

Risky

If Crime-Status = Active and Owns-House = No and Has-Kids = No, then

Risky

If Prior-Convictions = 0 and College = Yes and Owns-House = Yes, then

Not Risky

If Current-Offense = Misdemeanor and Prior-Arrests > 1: If Prior-Jail-Incarcerations = Yes, then

Risky

If Has-Kids = Yes and Married = Yes and Owns-House = Yes, then

Not Risky

If Lives-with-Partner = Yes and College = Yes and Pays-Rent = Yes, then

Not Risky

If Current-Offense = Misdemeanor and Prior-Arrests ≤ Risky

If Age ≥

50 and Has-Kids = Yes and Prior-FTA = No, then

Not Risky

Default:

Not Risky

Figure 1: Classiﬁer which uses prohibited features (race and gender) when making predictions (left); and its misleading expla-nation (right) which excludes prohibited features (race, gender) and includes desired features (prior jail incarcerations, priorFTA ). Our user study shows that domain experts are 9.8 times more likely to trust the classiﬁer if they see the explanation onthe right (instead of the classiﬁer). Presence or absence of race and gender drives user trust (See Section 5.2)cused on learning predictive models that are human un-derstandable (Letham et al. 2015; Lakkaraju, Bach, andLeskovec 2016; Caruana et al. 2015). However, complexmodels such as deep neural networks and random foreststypically achieve higher performance compared to inter-pretable models (Ribeiro, Singh, and Guestrin 2016), so inmany situations it is more desirable to use these complexmodels. Thus, there has been work on explaining such com-plex black boxes. One approach is to provide local expla-nations for individual predictions of the black box (Ribeiro,Singh, and Guestrin 2018; 2016; Lundberg and Lee 2017),which is useful when a decision maker plans to review ev-ery decision made by the black box. An alternate approachis to provide a global explanation that describes the blackbox as a whole, typically summarizing it using an inter-pretable model (Lakkaraju et al. 2019; Bastani, Kim, andBastani 2017), which is useful in validating the black boxesbefore they are deployed to automatically make decisions(i.e., without human involvement).There has been some empirical work on studying how hu-mans understand and trust interpretable models and explana-tions. For instance, Poursabzi-Sangdeh et. al. (2018) showthat longer explanations are harder for humans to simulateaccurately. There has also been recent work on understand-ing what makes explanations useful in the context of threetasks they are likely to perform given an explanation of anML system: (i) predicting the system’s output, (ii) verifyingwhether the output is consistent with the explanation, and(iii) determining if and how the output would change if wechange the input (Lage et al. 2019).More closely related to our work, there has been recentwork on exploring the vulnerabilities of black box expla-nations. For instance, there has been work demonstratingthat explanations can be unstable, changing drastically evenwith small perturbations to inputs (Dombrowski et al. 2019;Ghorbani, Abid, and Zou 2019). Finally, recent work hasargued that black box explanations can often be mislead-ing and can potentially lead users to trust problematic blackboxes (Lipton 2016; Ghorbani, Abid, and Zou 2019).In contrast, we are the ﬁrst to study if and how adversarialentities could generate misleading explanations to manipu-late user trust. We are also the ﬁrst to explore the notion ofconﬁrmation bias in the context of black box explanations. In this section, we introduce some notation and formalizethe notions of (i) explanation of a black box model, and (ii)misleading explanation of a black box model.

Explanations.

Given input data X , a set of class labels Y = { , , · · · K } , and a black box B : X → Y , our goal is togenerate an explanation E that describes the behavior of B .Then, end users can use E to determine whether to trust B .We consider an approach to explaining B by approximat-ing it using an interpretable model E ∈ E . We measure thequality of this approximation using the relative error L ( E, B ) = E p ( x ) [ (cid:96) ( E ( x ) , B ( x ))] where p ( x ) is the data distribution and (cid:96) ( y, y (cid:48) ) is any lossfunction—e.g., the 0-1 loss (cid:96) ( y, y (cid:48) ) = I [ y (cid:54) = y (cid:48) ] . We wantto choose an explanation E ∈ E that minimizes the relativeerror. We also deﬁne the ﬁdelity of E to be − L ( E, B ) . Trustworthy black boxes & misleading explanations.

Weassume a workﬂow where the human user relies on ˆ E todecide whether to trust B . We model the human user as anoracle O : E → { , } such that O ( E ) = I [ user trusts black box B given explanation E ] . We can compute O via a user study that shows users ˆ E andasks if they trust B . We also assume there is a “correct”choice of whether B is trustworthy . We model this groundtruth as an oracle O ∗ : B → { , } , where B is the space ofall black boxes and O ∗ ( B ) = I [ B is trustworthy ] . An expla-nation E for B is misleading if O ( E ) (cid:54) = O ∗ ( B ) . Constructing misleading explanations.

Our goal is todemonstrate that misleading explanations exist. In our ap-proach, we ﬁrst devise a black box B that we expect to beuntrustworthy. This expectation is based on which featuresare used by the model (see Section 3). Then, we need tocheck if B is actually untrustworthy (i.e., O ∗ ( B ) = 0 ). Todo so, we choose B to itself be an interpretable model. Then,we perform a user study where we show B and ask if it istrustworthy, yielding O ∗ ( B ) . In this approach, B is still ablack box in the sense that (i) ˆ E is constructed without ex-amining the internals of B , and (ii) users are not aware ofthe internals of B when shown E to evaluate O ( E ) .Next, we construct an explanation E of B that we ex-pect to be misleading; again, this expectation is based onhich features are in the explanation (see Section 3). Then,we check if E is indeed misleading (i.e., evaluate O ( E ) ) viaa user study. Assuming we successfully constructed B sothat O ∗ ( B ) = 0 , then E is misleading if O ( E ) = 1 . Wediscuss how we construct E in Section 4 ( B is constructedsimilarly), and how we perform the user studies in Section 5. We deﬁne notions of a potentially untrustworthy black box B and a potentially misleading explanation E for B . Thesenotions are only used to guide our algorithms; once we haveconstructed B and E , we test whether B is actually untrust-worthy and E is actually misleading via user studies. Finally,we discuss when potentially misleading explanations exist. Quantifying user trust.

We consider a simple approach toestimating whether a user trusts B given E . We assume theirkey criterion is which features are included in E and whichones are omitted. More precisely, we assume the featurespace can be decomposed into X = X D × X A × X P , where X D corresponds to the desired features D that the user ex-pects to be included, X A corresponds to the ambivalent fea-tures A for which the user is indifferent about whether theyare included, and X P corresponds to the prohibited features P that the user expects to be omitted.Next, an acceptable explanation E ∈ E + ⊆ E is onewhere desired features appear in E and the prohibited fea-tures do not. Then, we estimate that user decisions O ( E ) are based on (i) whether E is acceptable, and (ii) whether E meets a minimum level (cid:15) + ∈ R ≥ of ﬁdelity—i.e., deﬁning ˆ O ( E ) = I [ E ∈ E + ∧ L ( E, B ) ≤ (cid:15) + ] , we have estimate ˆ O ( E ) ≈ O ( E ) . Similarly, for black boxesthat are interpretable, an acceptable blackbox B ∈ B + ⊆ B is one where the desired features appear in B and the prohib-ited features do not. Then, we estimate that user decisions O ∗ ( B ) are based on whether B is acceptable—i.e., letting ˆ O ∗ ( B ) = I [ B ∈ B + ] , we have ˆ O ∗ ( B ) ≈ O ∗ ( B ) . The userstudies we perform demonstrate that ˆ O and ˆ O ∗ are good es-timates of O and O ∗ , respectively; see Section 5.Now, we say B is potentially untrustworthy if ˆ O ∗ ( B ) =0 , and say E is potentially misleading if ˆ O ( E ) (cid:54) = ˆ O ∗ ( B ) .Figure 1 shows a potentially untrustworthy blackbox (left)and a potentially misleading explanation (right). Existence of potentially misleading explanations.

Westudy when potentially misleading explanations exist. First,even if an explanation has perfect ﬁdelity, it can still be po-tentially misleading:

Theorem 3.1.

There exists a black box B and an expla-nation E of B such that (i) E has perfect ﬁdelity (i.e., L ( E, B ) = 0 ), and (ii) E is potentially misleading. [SeeAppendix B for proof] This result is for a speciﬁc black box and a speciﬁc ex-planation of that black box. Next, we study more generalsettings where potentially misleading explanations exist. Let E ∈ E be the best explanation for black box B . We focus onthe case where ˆ O ∗ ( B ) = 0 (i.e., the black box is potentially untrustworthy), so E is potentially misleading if ˆ O ( E ) = 1 .Intuitively, potentially misleading explanations exist whenthe prohibited features P can be reconstructed from the re-maining ones D ∪ A . In this case, a misleading explanationcan internally reconstruct P using the D ∪ A . A potentialconcern is that even when P can be reconstructed, it maynot be possible to do so using an interpretable model. Weshow that an acceptable interpretable model can reconstruct P as long as (i) an acceptable black box B + can reconstruct P and achieve good accuracy, and (ii) we can explain B + using an acceptable interpretable model that achieves highﬁdelity. Intuitively, we expect (i) to hold when P can be re-constructed from D ∪ A , and we expect (ii) to hold since anexplanation of B + should not depend on features not in B + .We formalize (i) and (ii). For (i), let B + ∈ B + be thebest acceptable blackbox. The restriction error is (cid:15) R = L ( B + , B ) . Then, (i) corresponds to (cid:15) R ≈ —i.e., P canbe reconstructed from D ∪ A when B + can then achieveloss similar to B by internally reconstructing P . For (ii), let E (cid:48) ∈ E be the best explanation for B + , and let E + ∈ E + be the best acceptable explanation of B + . The acceptablerelative error is the gap in ﬁdelity between these two—i.e., (cid:15) A = L ( E (cid:48) , B + ) − L ( E + , B + ) ≥ . Then, (ii) corresponds to (cid:15) A ≈ —i.e., E + is almost asgood an explanation of B + as E (cid:48) . Intuitively, this assump-tion should hold since B + does not use P , so there shouldexist a high ﬁdelity explanation of B + that does not use P .Finally, suppose that (cid:15) R , (cid:15) A are small, and that there ex-ists a high ﬁdelity explanation E ∈ E (which may not beacceptable); then, E + is potentially misleading: Theorem 3.2.

Suppose O ∗ ( B ) = 0 ; if L ( E, B ) + 2 (cid:15) R + (cid:15) A ≤ (cid:15) + , then E + is potentially misleading. [See Ap-pendix B for proof] Our algorithm for constructing misleading explanations ofblack boxes builds on the Model Understanding throughSubspace Explanations (MUSE) framework (Lakkaraju etal. 2019) by incorporating additional constraints that enableus to output high ﬁdelity explanations that include desiredfeatures and omit prohibited features.

Given a black box, MUSE produces an explanation in theform of a two-level decision set , which intuitively is a modelconsisting of nested if-then statements where the nestingdepth is two. MUSE chooses an explanation that maximizestwo objectives: (i) interpretability: easier for humans to un-derstand, and (ii) ﬁdelity: the explanation should mimic thebehavior of the black box.

Two-level decision sets.

A two-level decision set R : X →Y is a hierarchical model consisting of a set of decision sets,each of which is embedded within an outer if-then struc-ture. Intuitively, the outer if-then rules can be thought of The clauses within each of the two levels are unordered, somultiple rules may apply to a given example x ∈ X . Ties between s neighborhood descriptors which correspond to differentparts of the feature space, and the inner if-then rules are pat-terns of model behaviors within the corresponding neighbor-hood. Formally, a two-level decision set has form R = { ( q , s , c ) , · · · , ( q M , s M , c M ) } , where c i ∈ Y is a label, and q i and s i are conjunctions ofpredicates of the form “feature ∼ value”, where ∼ ∈ { = , ≥ , ≤} is an operator; e.g., “age ≥ ” is a predicate. In par-ticular, q i corresponds to the neighborhood descriptor, and ( s i , c i ) together represent the inner if-then rules with s i de-noting the antecedent (i.e., the if condition) and c i denotingthe consequent (i.e., the corresponding label). Optimization problem.

Below, we give an overview of theobjective function of MUSE. The objective of MUSE is es-timated on a given training dataset D in the context of a two-level decision set R and a black box B .First, there are many measures of interpretability—e.g.,explanations with fewer rules are typically easier to under-stand. MUSE employs seven such measures. The ﬁrst fourmeasures are the number of predicates f ( R ) , the featureoverlap f ( R ) , the rule overlap f ( R ) , and the cover f ( R ) ;these four measures are part of the optimization objective.The next three measures are the size g ( R ) , the maximumwidth g ( R ) , and the number of unique neighborhood de-scriptors g ( R ) ; these three measures are included as con-straints in the optimization problem. For details on the deﬁ-nitions of these measures, see Appendix A.1.Second, ﬁdelity is measured as before—e.g., the accuracyrelative to B . We use f ( R ) to denote the ﬁdelity of R .Finally, to construct the search space, we use frequentitemset mining (e.g., apriori (Agrawal and Srikant 2004)) togenerate two sets of potential if conditions (i.e., sets of con-junctions of predicates): (i) N D from which we can choosethe neighborhood descriptors, and (ii) DL from which wecan choose the inner if-then rules. Then, the complete opti-mization problem is: arg max R ⊆N D×DL×C (cid:88) i =1 λ i f i ( R ) (1)subj. to g i ( R ) ≤ (cid:15) i ∀ i ∈ { , , } . The hyperparameters λ , ..., λ ∈ R ≥ can be chosen usingcross-validation; (cid:15) , (cid:15) , (cid:15) must be chosen by the user. Optimization procedure.

The optimization problem (1)is non-normal, non-negative, non-monotone, and submod-ular with matroid constraints (Lakkaraju et al. 2019). Ex-actly solving this problem is NP-Hard (Khuller, Moss, andNaor 1999). Approximate local search provides the bestknown theoretical guarantees for this class of problems—i.e., ( k +2+1 /k + δ ) − , where k is the number of constraintsand δ > (Lee et al. 2009). We extend MUSE to generate potentially misleading expla-nations by modifying the optimization problem (1). In par-ticular, we need to (i) ensure that none of the prohibited different if-then clauses are broken according to which rules aremost accurate; see (Lakkaraju et al. 2019) for details. features P (e.g., race) appear in the explanation (even ifthey are being used by the black box to make predictions),and (ii) ensure that all the desired features D appear (evenif they are not being used by the black box). Formally, let N D + ⊆ N D denote the set of candidate if conditions forouter if clauses that do not include any prohibited attributes,and let DL + ⊆ DL be the analog for inner if clauses. Fur-thermore, we also add a term to the objective that measuresthe number of features in X D that are part of some rule in R :coverdesired ( R ) = (cid:88) d ∈ D I [ ∃ ( q, s, c ) ∈ R s.t. d ∈ ( q ∪ s )] , where d ∈ D is a desired feature. Maximizing this valuewill in turn maximize the chance that every desired attributeappears somewhere in the explanation.Together, we use the following optimization problem toconstruct candidate misleading explanations: arg max R ⊆N D + ×DL + ×C (cid:88) i =1 λ i f i ( R ) + λ f ( R ) (2)subj. to g i ( R ) ≤ (cid:15) i ( ∀ i ∈ { , , } ) where f ( R ) = coverdesired ( R ) . The following theoremshows that as before, we can solve (2) with approximate lo-cal search: Theorem 4.1. (2) is non-normal, non-negative, non-monotone, and submodular, and has matroid constraints.[See Appendix B for proof]

Our goal is to evaluate how explanations can affect users’trust of a black box. To this end, we ﬁrst construct a blackbox and its explanations. Then, we perform a user study withdomain experts to understand how each explanation affectsuser trust of the black box. All of our experiments are per-formed in the context of a real world application - bail deci-sions.A key aspect of our approach is that the “black box” B that we construct is itself an interpretable model. This al-lows us to evaluate whether B is actually untrustworthy(i.e., O ∗ ( B ) = 0 ) via user studies. Also, for an expla-nation E of B , we can check if B is trusted given onlyon E (i.e., O ( E ) = 1 ). If both of these criteria hold i.e., O ∗ ( B ) (cid:54) = O ( E ) , then explanation E is misleading. Bail decisions.

Our experiments focus on bail decision mak-ing, a high-stakes task. Police arrest over 10 million peopleeach year in the U.S. (Kleinberg et al. 2017). Soon after ar-rest, judges decide whether defendants should be released onbail or must wait in jail until their trial. Since cases can takeseveral months to proceed to trial, bail decisions are con-sequential both for defendants as well as society. By law, adefendant should be released only if the judge believes thatthey will not ﬂee or commit another crime. This decision isnaturally modeled as a prediction problem. For the user study checking O ( E ) , we do not show users theinternals of B , so their decision of whether to trust B is not affectedby the fact that B happens to be interpretable. igure 2: Top 5 prohibited (left) and desired features (right),and number of participants who voted for each one.We use a dataset on bail outcomes collected from sev-eral state courts in the U.S. between 1990-2009 (Lakkarajuet al. 2019). This dataset contains 37 features, including de-mographic attributes (age, gender, race), personal (e.g., mar-ried) and socio-economic information (e.g., pays rent, liveswith children), current offense details (e.g., is felony), andpast criminal records of about 32K defendants who were re-leased on bail. Each defendant in the data is labeled either asrisky (if he/she either ﬂed and/or committed a new crime af-ter being released on bail) or non-risky. The goal is to train ablack box that predicts these outcomes to help judges makebail decisions. Explanations of this black box are needed tohelp domain experts determine whether to trust the blackbox. Domain experts in user study.

We carried out our studywith 47 subjects. Each participant is a student enrolled ina law school at the time of our study. Each participant ac-knowledged having in-depth knowledge (16 participants) orat least some familiarity (31 participants) with the bail de-cision making process. Of the subjects, 27 self-identiﬁed asmale and 20 as females; 25 are White, 15 Asian, 2 Hispanic,and 5 African American.We split our study into two phases: (i) First, we reachedout to each of the participants to determine which of thefeatures in the bail dataset are relevant (i.e., desired) andwhich ones should be omitted (i.e., prohibited). We usedthese insights to construct our classiﬁer and its explanations(see Section 5.1). (ii) Next, we performed the key part ofour study—we reached out to all the subjects to understandhow/why a particular explanation inﬂuences their trust of theblack box classiﬁer.

We discuss how we construct our black box (designed to beuntrustworthy) and its explanations (some of which are de-signed to be misleading). We surveyed the domain experts toidentify desired and prohibited features, and then used thisinformation to construct our classiﬁer and explanations. Wegenerate an untrustworthy black box B by explicitly includ-ing prohibited features and omitting desired features, andgenerate misleading explanations for B by explicitly includ-ing desired features and/or omitting prohibited features. Identifying prohibited and desired features.

We surveyedall our 47 subjects to identify prohibited and desired fea-tures. Each participant is shown all 37 features in the baildataset, and is asked to indicate which ones are relevant and which ones should be omitted when predicting if a defen-dant is risky and should not be released on bail. Figure 2shows the 5 features ( x -axis) ranked as the most prohibited(left) and the most desired (right) ones by the participants.It also shows how many participants voted for each feature( y -axis). Race and gender stand out unanimously as the topprohibited features; prior jail incarcerations (PJI) and priorfailure to appear (PFTA) are the top desired features. Inboth cases, the ﬁrst two features received signiﬁcantly morevotes compared to all the other features, so we use race andgender as prohibited features, and use PJI and PFTA as de-sired features in all subsequent experiments. Black box and explanations.

We use the identiﬁed prohib-ited and desired features to construct our black box and itsexplanations. At a high level, our approach is to constructa black box that is designed to be untrustworthy to the do-main experts should they be familiar with its inner workings,and construct high-ﬁdelity explanations of this black box de-signed to mislead them into trusting the black box.To this end, we randomly shufﬂe the bail dataset and splitit into train (70%), test (25%), and validation (5%) sets. Weemploy our framework with different parameter settings toconstruct both the black box and its explanations. We lever-age the validation set and a coordinate descent style tuningprocedure similar to that of MUSE to set the hyperparame-ters λ , λ , ..., λ (Lakkaraju et al. 2019).We ﬁrst construct a black box B that uses race and gender(prohibited) and does not use PJI and PFTA (desired); thus, B is most likely untrustworthy to the domain experts shouldthey examine its internal workings. We use our frameworkto build B ; while designed to construct explanations, it canbe applied to build an interpretable classiﬁer by replacingthe black box labels B ( x ) (for each x ∈ X ) with the corre-sponding ground truth label y . We use desired features D = { PJI , PFTA } and prohibited features P = { race , gender } .The resulting black box B , shown in Figure 1 (left), is an in-terpretable two-level decision set; its accuracy on the held-out test set is 83.28%.We then use our framework to construct three differenthigh-ﬁdelity explanations E , E , E of B , as follows: (i) E does not use either prohibited features or desired features(i.e., we use P = { race , gender , PJI , PFTA } and D = ∅ ),(ii) E uses both prohibited and desired features (i.e., weuse P = ∅ and D = { race , gender , PJI , PFTA } , and (iii) E uses desired features but not prohibited features (i.e., we use P = { race , gender } and D = { PJI , PFTA } . We show E inFigure 1 (right);A potential concern is that our goal is to study how quali-titative aspects of each explanation (e.g., which features ap-pear) affects whether a user trusts B ; however, the ﬁdelity ofan explanation can also affect user trust. Thus, it is impor-tant to control for ﬁdelity beforehand. To this end, we esti-mate the ﬁdelity of each explanation on the held-out test set;the ﬁdelities for E , E , E are 97.3%, 98.9%, and 98.2%respectively. These values are all very similar; thus, differ-ences in whether the user trusts or mistrusts B must be due If a defendant has failed to appear in the past, that means theyfailed to show up for court dates and is deemed a ﬂight risk. igure 3: Effect of various explanations on user trust ofblack box B .to the structure of the explanations rather than their ﬁdelities. Next, we performed a user study with the domain experts tounderstand how our different explanations E , E , E affectuser trust of the same black box model B . User study design.

We designed an online user study inwhich 41 of the 47 domain experts that we recruited partici-pated. Each participant was randomly chosen to be showneither the black box B (with ﬁdelity 100%) or one of the ex-planations E , E , E (with their corresponding ﬁdelities).Including the black box B is critical since it allows us toestimate the baseline trust O ∗ ( B ) —i.e., whether users trust B if they understand its internals. Each participant was in-structed beforehand that the explanations they see are onlycorrelational, not causal. Participants were allowed to takeas much time as they wanted to complete the study.Each participant was asked (i) to answer the followingyes/no question: “Below is an explanation generated bystate-of-the-art ML for a particular black box designed toassist judges in bail decisions. Based on this explanation,would you trust the underlying model enough to deploy it?” ,and (ii) a follow-up descriptive question to explain why theydecided to trust or mistrust the black box. Results and discussion.

Figure 3 shows the results of ouruser study. Each of the bars corresponds to either the blackbox or one of the explanations ( x -axis). We show the corre-sponding user trust, measured as the fraction of participantswho responded that they trust the underlying black box—i.e., answered yes to the question above ( y -axis).As can be seen, only 9.1% of the participants who saw theactual black box trusted it (blue), establishing our baselinethat the black box is not trustworthy. Next, we discuss userswho only saw one of the explanations of the black box. First,only 10% of the participants who saw E (brown), which in-cludes race and gender as well as PJI and PFTA, trusted theunderlying black box. On the other hand, 70% and 88% ofparticipants who saw E (yellow) and E (purple), respec-tively, trusted the underlying black box. The prohibited fea- Remaining 6 participants were used to explore how interactiveexplanations can affect user trust. tures race and gender do not appear in E or E ; in addition, E includes the desired features PJI and PFTA.These results show that E and E are misleading users—i.e., they lead the user to trust a black box, while users ﬁndthe actual black box untrustworthy. Since B and E both in-clude race and gender, participants are unwilling to trust theblack box in these two cases. On the other hand, race andgender do not appear in E and E , and in these cases usersare very likely to trust the underlying black box. These re-sults are in spite of the clear warning we show to participantssaying that the explanations shown are not causal. Further-more, participants who see E appear to trust the underly-ing black box more frequently than those who see E , mostlikely since the desired attributes PJI and PFTA are used by E .Finally, we analyzed the reasons participants gave fortheir responses. They are consistent with our ﬁndings—i.e.,user trust appears to primarily be driven by whether the raceand gender features appear in the explanation shown. We carried out the ﬁrst systematic study of if and how ex-planations of black boxes can mislead users and affect usertrust, including a novel theoretical framework for under-standing when misleading explanations can exist, a novelapproach for generating explanations that are likely to bemisleading, and an extensive user study with domain expertsfrom law and criminal justice to understand how mislead-ing explanations impact user trust. We ﬁnd that user trustcan be manipulated by high-ﬁdelity, misleading explana-tions. These misleading explanations exist since prohibitedfeatures (e.g., race or gender) can be reconstructed basedon correlated features (e.g., zip code). Thus, adversarial ac-tors can fool end users into trusting an untrustworthy blackbox—e.g., one that employs prohibited attributes to makedecisions. We consider two ways to address this challenge.First, recent research (Lakkaraju et al. 2019) has advo-cated for thinking about explanations as an interactive di-alogue where end users can query or explore different ex-planations (called perspectives ) of the black box. In fact,MUSE is designed for interactivity—e.g., a judge can askMUSE “How does the black box make predictions for de-fendants of different races and/or genders?”, and it wouldreturn an explanation that only uses race and/or gender onouter if-then clauses. We performed another user study with6 domain experts from our participant pool to study theirtrust in the underlying black box B when they could explorevarious explanations of B using MUSE, and found that only16.7% of the participants (1 out of 6) trusted B . This valueis much closer to the baseline trust (9.1%).Second, there has been recent work on capturing causalrelationships between input features and black box predic-tions (Zhao and Hastie 2019; Wachter, Mittelstadt, and Rus-sell 2017). Explanations relying on correlations not onlymay be misleading (Rudin 2019), but have also been shownto lack robustness (Ghorbani, Abid, and Zou 2019), andcausal explanations may address these issues. eferences Agrawal, R., and Srikant, R. 2004. Fast algorithms for min-ing association rules. In

Proceedings of the InternationalConference on Very Large Data Bases (VLDB) , 487–499.Bastani, O.; Kim, C.; and Bastani, H. 2017. Interpretabilityvia model extraction. arXiv preprint arXiv:1706.09773 .Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; andElhadad, N. 2015. Intelligible models for healthcare: Pre-dicting pneumonia risk and hospital 30-day readmission. In

Knowledge Discovery and Data Mining (KDD) .Dombrowski, A.-K.; Alber, M.; Anders, C. J.; Ackermann,M.; M¨uller, K.-R.; and Kessel, P. 2019. Explanations canbe manipulated and geometry is to blame. arXiv preprintarXiv:1906.07983 .Doshi-Velez, F., and Kim, B. 2017. Towards a rigorousscience of interpretable machine learning. arXiv preprintarXiv:1702.08608 .Ghorbani, A.; Abid, A.; and Zou, J. 2019. Interpretation ofneural networks is fragile. In

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence , volume 33, 3681–3688.Khuller, S.; Moss, A.; and Naor, J. S. 1999. The budgetedmaximum coverage problem.

Information Processing Let-ters

The quarterly journal of economics arXiv preprintarXiv:1902.00006 .Lakkaraju, H.; Bach, S. H.; and Leskovec, J. 2016. Inter-pretable decision sets: A joint framework for description andprediction. In

ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD) , 1675–1684.Lakkaraju, H.; Kamar, E.; Caruana, R.; and Leskovec, J.2019. Faithful and customizable explanations of black boxmodels. In

Proceedings of the 2019 AAAI/ACM Conferenceon AI, Ethics, and Society , 131–138. ACM.Lee, J.; Mirrokni, V. S.; Nagarajan, V.; and Sviridenko, M.2009. Non-monotone submodular maximization under ma-troid and knapsack constraints. In

Proceedings of the ACMSymposium on Theory of Computing (STOC) , 323–332.Letham, B.; Rudin, C.; McCormick, T. H.; and Madigan,D. 2015. Interpretable classiﬁers using rules and bayesiananalysis: Building a better stroke prediction model.

Annalsof Applied Statistics .Lipton, Z. C. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 .Lundberg, S. M., and Lee, S.-I. 2017. A uniﬁed approachto interpreting model predictions. In

Advances in NeuralInformation Processing Systems , 4765–4774.Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ”whyshould i trust you?”: Explaining the predictions of any clas-siﬁer. In

Knowledge Discovery and Data Mining (KDD) . Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors:High-precision model-agnostic explanations. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence .Rudin, C. 2019. Stop explaining black box machine learn-ing models for high stakes decisions and use interpretablemodels instead.

Nature Machine Intelligence

Harv. JL & Tech.

Journal of Business & Economic Statis-tics (just-accepted):1–19.

Algorithm

A.1 Interpretability Measures

We describe how each of our interpretability objectives aremeasured given a two-level decision set R (with M rules),a black box B , and a training set D = { x , ..., x N } ⊆ X .These measures are summarized in Table 2.The ﬁrst four measures are structural properties of R .First, we want to minimize the size of R , which is the num-ber of triples ( q, s, c ) in R . Second, we want to minimize the maximum width of R , which is the maximum over ( q, s, c ) in R of the quantities width ( s ) and width ( q ) , where the widthof an if-then rule is the number of predicates that occur inits condition. Third, we want to minimize the total numberof predicates in R , which is the sum over ( q, s, c ) in R ofwidth ( s ) + width ( q ) . Fourth, we want to minimize the num-ber of decision sets in R , which is the number of uniqueneighborhood descriptors q in R .The next measures are are semantic properties of R . In-tuitively, this objective captures the idea that outer if-thenclauses (i.e., neighborhood descriptors) and inner if-thenrules have different semantic meanings. To make the distinc-tion more clear, the overlap between the features that appearin outer and inner if-then rules should be minimized. In par-ticular, for each pair ( q, s ) of outer and inner if-then rules,we sum up the number of features that occur in both q and s ; we want to minimize this quantity.The ﬁnal measure captures a property speciﬁc to decisionsets. For decision sets, multiple rules may apply for a givenexample x ∈ X ; i.e., rules can be ambiguous . To maximizeinterpretability, for most examples x , the rules should beunambiguous—i.e., only one rule should apply for a given x . First, we want to minimize the rule overlap , which is thenumber of extra rules that apply. Second, we want to max-imize cover , which counts the number of instances in thedataset that satisfy some rule in R .Finally, we have f ( R ) = 2 W max · |N D| · |DL| − numpreds ( R ) f ( R ) = W max · |N D| · |DL| − featureoverlap ( R ) f ( R ) = N ( |N D| · |DL| ) − ruleoverlap ( R ) f ( R ) = cover ( R ) f ( R ) = N · |N D| · |DL| − disagreement ( R ) where W max is the maximum width of any rule in either can-didate sets. To ensure that the objective is non-negative, wehave subtracted each measure from its upper bound. B Proofs of Theorems

Proof of Theorem 3.1.

Consider input features X D = X P = R , and there are no ambivalent features, so X = X D ×X P = R , and binary labels Y = { , } . Furthermore,consider a distribution p (( x , x ) , y ) over X × Y deﬁned by p (( x , x ) , y ) = p ( x ) · δ ( x − x ) · δ ( y − I [ x ≥ , where p = N (0 , . In other words, x is a standard Gaus-sian random variable, x and x are perfectly correlated, and A rule applies to x if x satisﬁes its condition. Table 1: Summary of notation.

Symbol Description D Dataset D = { x , x , ..., x N }N D Candidate set of conjunctions of predictions for choosingouter if-then clauses of explanations DL Candidate set of conjunctions of predictions for choosinginner if-then rules of explanations

Table 2: Metrics used in the optimization problem. ﬁdelity disagreement ( R ) = M (cid:80) i =1 |{ x ∈ D | x satisﬁes q i ∧ s i , B ( x ) (cid:54) = c i }| interpretability size ( R ) = number of triples ( q, s, c ) in R maxwidth ( R ) = max e ∈ M (cid:83) i =1( qi ∪ si ) width ( e ) numpreds ( R ) = M (cid:80) i =1 width ( s i ) + width ( q i ) numdsets ( R ) = | dset ( R ) | where dset ( R ) = M (cid:83) i =1 q i featureoverlap ( R ) = (cid:80) q ∈ dset ( R ) M (cid:80) i =1 featureoverlap ( q, s i ) unambiguity ruleoverlap ( R ) = M (cid:80) i =1 M (cid:80) j =1 ,j (cid:54) = i overlap ( q i ∧ s i , q j ∧ s j ) cover ( R ) = |{ x ∈ D | x satisﬁes q i ∧ s i for some i ∈ { · · · M }}| the outcome is 1 if x ≥ and 0 otherwise. Next, considera black box B (( x , x )) = I [ x ≥ , i.e., B achieves zero loss. Since B uses the prohibited fea-ture x , it is probably untrustworthy—i.e., ˆ O ∗ ( B ) = 0 . Sim-ilarly, consider an explanation E (( x , x )) = I [ x ≥ . Since this explanation uses the desired feature and notthe prohibited feature, it is acceptable; thus, it is probablymisleading—i.e., ˆ O ( E ) (cid:54) = ˆ O ∗ ( B ) . Finally, note that L ( E, B )= E p (( x ,x )) [ (cid:96) ( E (( x , x )) , B (( x , x ))]= E p (( x ,x )) [ (cid:96) ( I [ x ≥ , I [ x ≥ E p ( x ) (cid:20)(cid:90) (cid:96) ( I [ x ≥ , I [ x ≥ · δ ( x − x ) dx (cid:21) = E p ( x ) [ (cid:90) (cid:96) ( I [ x ≥ , I [ x ≥ . Thus, E achieves perfect ﬁdelity, as claimed. Proof of Theorem 3.2.

First, we have the following decom-position of the relative error: for any

F, F (cid:48) , F (cid:48)(cid:48) : X → Y , L ( F, F (cid:48) ) ≤ L ( F, F (cid:48)(cid:48) ) + L ( F (cid:48)(cid:48) , F (cid:48) ) . This result follows since for any y, y (cid:48) , y (cid:48)(cid:48) ∈ Y , (cid:96) ( y, y (cid:48) ) = I [ y (cid:54) = y (cid:48) ]= I [ y (cid:54) = y (cid:48)(cid:48) ∧ y (cid:48)(cid:48) = y (cid:48) ] + I [ y = y (cid:48)(cid:48) ∧ y (cid:48)(cid:48) (cid:54) = y (cid:48) ] ≤ I [ y (cid:54) = y (cid:48)(cid:48) ] + I [ y (cid:48)(cid:48) (cid:54) = y (cid:48) ]= (cid:96) ( y, y (cid:48)(cid:48) ) + (cid:96) ( y (cid:48)(cid:48) , y (cid:48) ) , o we have L ( F, F (cid:48) ) = E p ( x ) [ (cid:96) ( F ( x ) , F (cid:48) ( x ))] ≤ E p ( x ) [ (cid:96) ( F ( x ) , F (cid:48)(cid:48) ( x )) + (cid:96) ( F (cid:48)(cid:48) ( x ) , F (cid:48) ( x ))]= L ( F, F (cid:48)(cid:48) ) + L ( F (cid:48)(cid:48) , F (cid:48) ) . As a consequence, we have L ( E, B ) ≤ L ( E, B + ) + L ( B + , B )= L ( E, B + ) + (cid:15) R . Next, note that L ( E, B + ) ≤ L ( E (cid:48) , B + ) + L ( B + , B )= L ( E + , B + ) + L ( B + , B ) + (cid:15) A , where the ﬁrst line follows since by deﬁnition, E (cid:48) maxi-mizes error relative to B + over E ∈ E , and the second linefollows by the deﬁnition of (cid:15) A . Now, again by our decom-position of relative error, we have L ( E + , B + ) ≤ L ( E + , B ) + L ( B, B + )= L ( E + , B ) + (cid:15) R , where the last line follows since relative error is symmetric.Putting these three inequalities together, we have L ( E + , B ) ≤ L ( E, B ) + 2 (cid:15) R + (cid:15) A ≤ (cid:15) + , where the second line follows by our assumption in the theo-rem statement. Since E + ∈ E + , by deﬁnition of ˆ O , we have ˆ O ( E + ) = 1 , as claimed. . Proof of Theorem 4.1.