[PDF] Do Explanations Reflect Decisions? A Machine-centric Strategy to Quantify the Performance of Explainability Algorithms

Abstract

There has been a significant surge of interest recently around the concept of explainable artificial intelligence (XAI), where the goal is to produce an interpretation for a decision made by a machine learning algorithm. Of particular interest is the interpretation of how deep neural networks make decisions, given the complexity and `black box' nature of such networks. Given the infancy of the field, there has been very limited exploration into the assessment of the performance of explainability methods, with most evaluations centered around subjective visual interpretation of the produced interpretations. In this study, we explore a more machine-centric strategy for quantifying the performance of explainability methods on deep neural networks via the notion of decision-making impact analysis. We introduce two quantitative performance metrics: i) Impact Score, which assesses the percentage of critical factors with either strong confidence reduction impact or decision changing impact, and ii) Impact Coverage, which assesses the percentage coverage of adversarially impacted factors in the input. A comprehensive analysis using this approach was conducted on several state-of-the-art explainability methods (LIME, SHAP, Expected Gradients, GSInquire) on a ResNet-50 deep convolutional neural network using a subset of ImageNet for the task of image classification. Experimental results show that the critical regions identified by LIME within the tested images had the lowest impact on the decision-making process of the network (~38%), with progressive increase in decision-making impact for SHAP (~44%), Expected Gradients (~51%), and GSInquire (~76%). While by no means perfect, the hope is that the proposed machine-centric strategy helps push the conversation forward towards better metrics for evaluating explainability methods and improve trust in deep neural networks.

Full PDF

DDo Explanations Reﬂect Decisions? AMachine-centric Strategy to Quantify thePerformance of Explainability Algorithms

Zhong Qiu Lin , , , ∗ , Mohammad Javad Shaﬁee , , , ∗ , Stanislav Bochkarev Michael St. Jules , Xiao Yu Wang , Alexander Wong , , Vision & Image Processing Group, Systems Design Engineering, University of Waterloo Waterloo Artiﬁcial Intelligence Institute, Waterloo, ON DarwinAI Corp., Waterloo, ON ∗ Equal Contribution

Abstract

There has been a signiﬁcant surge of interest recently in the research communityaround the concept of explainable artiﬁcial intelligence (XAI), where the goal is toproduce an interpretation for a decision made by a machine learning algorithm. Ofparticular interest is the interpretation of how deep neural networks make decisions,given the complexity and ‘black box’ nature of such networks. Given the infancy ofthe ﬁeld, there has been very limited exploration into the assessment of the perfor-mance of explainability methods, with most evaluations centered around subjectivevisual interpretation of the produced interpretations. In this study, we explore amore machine-centric strategy for quantifying the performance of explainabilitymethods on deep neural networks via the notion of decision-making impact analy-sis. More speciﬁcally, we quantify the importance of identiﬁed critical factors for agiven decision made by a network based on the impact over network decisions andconﬁdences in the absence of these critical factors. For scenarios where we wish tostudy impact in directed erroneous decisions (e.g., under adversarial distractions),we additionally quantify importance of identiﬁed critical factors based on coverageof the adversarially impacted factors. We introduce two quantitative performancemetrics: i) Impact Score, which assesses the percentage of critical factors witheither strong conﬁdence reduction impact or decision changing impact, and ii) Im-pact Coverage, which assesses the percentage coverage of adversarially impactedfactors in the input. A comprehensive analysis using this approach was conductedon several state-of-the-art explainability methods (LIME, SHAP, Expected Gradi-ents, GSInquire) on a ResNet-50 deep convolutional neural network using a subsetof ImageNet for the task of image classiﬁcation. Experimental results show that,for both general and adversarial distraction scenarios, the critical regions identiﬁedby LIME within the tested images had the lowest impact on the decision-makingprocess of the network ( ∼ ∼ ∼ ∼ The signiﬁcant advances in deep learning [8], in particular deep neural networks, has led to the risein adoption across industry. This has also led to a tremendous rise in research in the area of deeplearning and its application for a wide variety of tasks, leading to state-of-the-art performance across

Preprint. Under review. a r X i v : . [ c s . L G ] O c t arious tasks such as visual, perception [18, 6, 11], speech recognition [2], and natural languageprocessing [22, 5]. However, as the proliferation of deep learning continues, there is now a growinginterest as well as concern over how deep neural networks are making decisions, particularly forlife-critical applications such as autonomous driving and clinical decision support. Given the sheercomplexity of deep neural networks and how information propagates through such networks to forma decision, deep learning has been often viewed as a ‘black box’ machine learning method andvery difﬁcult to interpret and understand the decision-making process or the key factors involvedin the decision. This makes deep learning challenging to leverage particularly in regulated spaceswhere interpretability and transparency is a necessity (e.g., ﬁnance and healthcare). Furthermore,this challenge of interpretability also makes it very difﬁcult for machine learning engineers andscientists to understand biases and error scenarios of the trained network to improve upon, as well assituations where the network is deciding based on unintended patterns in the dataset [10]. This isparticular critical given the recent rise of adversarial examples [19, 1], which are designed speciﬁcallyto cause deep neural networks to make erroneous decisions and the understanding of how networksbehave is very important to better devise ways to defend against them. As such, the ability to explainthe decision-making process of deep neural networks can be critical for enabling the developmentof improved, more dependable deep learning as well as enable the use of deep learning in a moretrust-worthy manner in mission-critical scenarios.Due to this critical need for increased transparency and interpretability in deep learning, there hasbeen a considerable increase in research interest on explainability methods for interpreting thedecision-making process of a deep neural network. In the ﬁeld of computer vision, such explainabilitymethods typically manifest their interpretations of the decision-making process of visual perceptionneural networks in the form of visual saliency maps that highlight critical regions deemed by themethod as important in making the decision. While such visual interpretations aim to give newinsights into the way deep neural networks make decisions, much of the evaluation around the visualinterpretations produced by explainability methods have been largely subjective as such, ironically,is up to the interpretation of the human observer and thus difﬁcult to judge whether these identiﬁedcritical regions are in fact reﬂective of what the deep neural network is leveraging to make decisions.While this current gap in the exploration of quantitative performance assessment of explainabilitymethods in terms of their impact on decisions made by deep neural networks is understandable givenhow new this area of research is, this gap hinders the level of human trust in not just the deep neuralnetworks but also in the explainability methods themselves. In fact, quantitative methods to assess theperformance of explainability methods is critical to not only trust in decisions made but also in thechoice of method for deployment and research development, especially since different explainabilitymethods can produce drastically different explanations given the same input data and same modeland so it is difﬁcult to know if algorithmic extensions on such explainability approaches actuallyimproves interpretability.In this study, we explore a more machine-centric strategy for quantifying the performance of ex-plainability methods on deep neural networks via the notion of decision-making impact analysis.More speciﬁcally, we introduce a new performance metric (which we will refer to as the ImpactScore ) for quantifying how well the critical factors identiﬁed by an explainability method reﬂects agiven decision made by a network based on the impact over network decisions and conﬁdences in theabsence of these critical factors. For scenarios where we wish to assess impact in directed erroneousdecisions (e.g., under adversarial distractions), we introduce an additional performance metric (whichwe will refer to as

Impact Coverage ) for quantifying the coverage of the identiﬁed critical factorson the adversarial impacted factors. Based on these metrics, we conduct a comprehensive analysisof the performance of four different state-of-the-art methods from recent research literature on thetask of image classiﬁcation to study how such methods compare against each other in terms of howmuch impact the critical regions identiﬁed in the explanations produced by each method actuallyhas on the decision-making process under both general and adversarial scenarios. To the best ofthe authors’ knowledge, this is the ﬁrst systematic study to quantitatively assess the performanceof several state-of-the-art explainability methods based on how impactful their explanations are todecisions made by a network under both normal and adversarial scenarios.

The explainability methods in current research literature can generally divided into two main cat-egories [20]. In the ﬁrst category of explainability methods, which we will refer to as proxy direct strategies [17, 16, 14, 15, 4, 21] the decision-making process of a deepneural network is mainly interpreted by studying the internal behaviour within a deep neural networkdirectly and then surfacing that information as an explanation for the decision-making process ofthe network. The most well-known of proxy methods is LIME [12], which takes advantage of alinear proxy model to approximate the behavioural of the targeted machine learning model andthen interprets the original model based on the learnt proxy. Proxy approaches are considered as’black box’ approaches where the explainability method does not have direct access to the innerworkings of the network and the proxy model approximates it given the input and the output to thenetwork. On the other hand, direct explainability algorithms are usually considered as ‘white box’methods as they require access to the inner workings of a deep neural network such as gradientsand activations at different layers for a given input to identify the key factors within the input that iscritical to the decision-making process. For example, by leveraging information about gradients, itis possible to quantify how much change in the input data would turn the decision of the networkto another output and as such measure the importance of each input in the decision-making process.Notable gradient-based direct explainability approaches include Integrated Gradient [17], GuidedBackpropagation [16], Guided GradCAM [14], SmoothGrad [15] and Expected Gradients [4].

Much of research literature around explainability, particularly for visual perception tasks such asimage classiﬁcation, has revolved around subjective visual interpretation of the explanations producedby the explainability method. This usually takes on the form of visual saliency maps, where salientregions in the map produced using the explainability method of choice are considered as criticalregions inﬂuencing the decision made by a network. However, due to the purely qualitative nature ofsuch visual assessments, it is very challenging to get a good sense as to how well an explainabilitymethod is performing, how useful or meaningful the provide explanation is relative to its inﬂuenceover the network’s decision and its associated conﬁdence, and more importantly how well it performscompared to other explainability methods. As such, this can limit progress in the ﬁeld of explainableartiﬁcial intelligence since there is no method of benchmarking based on subjective visual assessment.More recently, there have been explorations into human-centric strategies for quantifying explain-ability performance in the case of visual perception, where the visual saliency map produced using agiven explainability method for a given image is compared with a visual attention maps created basedon gaze information collected from human subjects [7]. While such an approach is a step towardsquantiﬁcation of explanations produced by explainability methods, one of the biggest limitationsof such an approach is the underlying assumption that a deep neural network makes decisions in asimilar manner as human subjects, which is often not true. As such, this human-centric approach toquantifying explainability performance provides very little insight on the actual driving factors of thedecision-making process of deep neural networks. Furthermore, this approach requires considerablehuman gaze information to be collected, which is simply impractical for most real-world scenarios.To address the limitations of human-centric strategies for quantifying the performance of explainabilitymethods, we take a drastically different direction by instead exploring a more machine-centric strategywhere we quantify performance based on the decision-making behaviour of the network itself. Morespeciﬁcally, we aim to quantify the performance of explainability methods on deep neural networksvia the notion of decision-making impact analysis, where we instead study the quantitative impact ofcritical factors identiﬁed by an explainability method for a given decision made by a network basedon the changes in decisions and associated conﬁdences in the decisions of the network itself.In the below sections, we will ﬁrst deﬁne a performance metric for quantifying the impact of criticalfactors identiﬁed by an explainability method on decisions and the conﬁdence in those decisionsas made by a given deep neural network. Next, we introduce an additional performance metric fordirected erroneous decision scenarios based around the concept of impact coverage.

In order to be facilitate for the quantitative assessment of the performance of a given explainabilitymethod, the ﬁrst step is to ﬁrst deﬁne and formulate a performance metric for performing suchan assessment. Motivated towards taking a machine-centric strategy to quantitative performanceassessment of a given explainability method on a particular deep neural network, we aim to developmetrics that quantify the importance of critical factors identiﬁed by the explainability method for a3iven decision made by a network based on the impact these factors have over network decisions andthe associated conﬁdences. We consider the critical factors c identiﬁed by an explainability method M to be important to a decision y made by a deep neural network N for a given input x if either ofthe following conditions are met: • Decision-level impact : The decision made by the deep neural network changes in theabsence of the critical factors. • Conﬁdence-level impact : The conﬁdence of the deep neural network in its decision z changes by τ % in the absence of the critical factors.The motivation behind this deﬁnition of importance for critical factors as identiﬁed by a givenexplainability method is based on the idea that, if the critical factors are indeed crucial to the decision-making process of the deep neural network, then the absence of these critical factors in the giveninput will have such an impact that the network behaves in a way that it would either be signiﬁcantlyless conﬁdent in its current decision, or so unconﬁdent in its decision that its conﬁdence in anotherdecision is higher and thus leads the network to make a different decision all together.In this study, we formulate the performance metric I , which we will refer to as the Impact Score, asfollows. Let the relationships between the critical factors c , explainability method M , the input x , thedecision y , conﬁdence in the decision z , and the network N be expressed by the following equations: { y, z } = N ( x ) , (1) c = M ( x, N ) , (2)where c ∈ x . Based on this, we can deﬁne the input in absence of c as identiﬁed by M as, x (cid:48) = x − c, (3)and the decision given x (cid:48) as input into N as, { y (cid:48) , z (cid:48) } = N ( x (cid:48) ) . (4)Therefore, in the general scenario, based on the conditions deﬁned above that the critical factors c fora given input x as identiﬁed by M must meet to be deemed as important, we can deﬁne the ImpactScore I across a set of n inputs X = { x , x , . . . , x n } as: I = 1 n n (cid:88) i =1 (( y (cid:48) i (cid:54) = y i ) ∨ ( z (cid:48) i ≤ τ z i )) . (5)where i denotes the i th input. In this study, we set τ = 0 . to indicate that the network has lost halfof the conﬁdence it had on its original decision. Finally, we also introduce a stricter variant of theabove Impact Score, denoted by I strict where we only consider decision-level impact: I strict = 1 n n (cid:88) i =1 ( y (cid:48) i (cid:54) = y i ) . (6) Impact Coverage.

In the scenario where we wish to study impact in directed erroneous decisions(e.g., decisions made under the inﬂuence of adversarial examples), we introduce an additionalapproach to quantitatively assessing performance of the different explainability methods since thecritical factors that the network leverages to make a decision are largely known a priori to theevaluation (e.g., in the case of an adversarial patch, the critical region that is important to the decision-making process is the adversarial patch itself) More speciﬁcally, we can further quantify importanceof the identiﬁed critical factors c based the amount of coverage of the adversarially impacted factorsin x by the critical factors c .Let us deﬁne the Impact Coverage metric I coverage across a set of n inputs X = { x , x , . . . , x n } based on the intersection-over-union between the adversarially impacted factors and the criticalfactors across the given set of inputs: I coverage = 1 n n (cid:88) i =1 ( a i ∨ c i )( a i ∪ c i ) . (7)where a i is the adversarially impacted factors in input x i . As such, the Impact Coverage metricis designed to be high when heavy overlapping between the identiﬁed critical factors and theadversarially impacted factors to reward strong alignment between the explanation produced by theexplainability method and the actual factors impacting decision.4igure 1: Example of a decision change due to absence of critical regions in the decision-makingprocess. (left) original image; (center) identiﬁed critical region; (right) prediction conﬁdences fordecisions made with original image and with the absence of critical regions. The absence of criticalregions led to a change in decision, which means the explanation reﬂects impact on the decision. The conducted experiments and the explainability methods used in this study are described below.

For the ﬁrst experiment, we quantitatively evaluate the performance of several state-of-the-art ex-plainability methods using the two variants of Impact Score (i.e., I and I strict ) for each explainabilitymethod M using a ResNet-50 deep convolutional neural network designed for the task of imageclassiﬁcation as the reference network N . A subset of the ImageNet [13] dataset is leveraged asinput X . More speciﬁcally, we leveraged a subset of 410 different images from the ImageNet dataset,all of which had correct classiﬁcations for consistency purposes. As such, this experiment tasks thedifferent explainability methods to identify critical regions within a natural image that is importantto the class prediction made by the network, such that in the absence of such critical regions theconﬁdence by the network in the predicted class is either signiﬁcant reduced or results in an altogetherdifferent class prediction. An example of a decision change that resulted from the absence of criticalregions identiﬁed by an explainability method during the decision-making process is known in Fig. 1.The purpose of this ﬁrst experiment is the quantitatively evaluate explainability performance under amore general scenario where decisions are made on untampered data inputs and decisions are madeby the network on such data inputs, and is representative of the general use case. For the second experiment, we quantitatively evaluate the performance of several state-of-the-artexplainability methods using the two variants of Impact Score (i.e., I and I strict ), as well as I coverage for each explainability method M in the presence of visual ’distractions’ in the form of adversarialpatches to better study the impact in directed erroneous decisions. More speciﬁcally, we leverage theadversarial patches from the work of Brown et al. [3]. For generating the adversarial patch, we ﬁx thereference network N aforementioned in Experiment 1, and apply adversarial training for the samesubset of the ImageNet [13] dataset as Experiment 1. Later, we randomly (e.g. random translationand random rotation of the patch) overlay the resulting adversarial patches on the same subset ofimages with different patch scales ranging from . to . . An example of a directed erroneousdecision due to adversarially impacted area is known in Fig. 2. We compute I , I strict , and I coverage for each patch scale over the test images, of which the prediction classes change to the adversariallytargeted classes. With the adversarial patch being the control variable, the critical region that isimportant to the decision-making process is largely known a prior to be the adversarial patch itself,and as such I coverage provides an additional quantitative indicator for the ability of the explainabilitymethod to identify such adversarially impacted areas within the images that has a direct impact in thedecisions made by the deep neural network. In this study, the proposed Impact Score and Impact Coverage is leveraged to perform a comprehensiveanalysis of on several state-of-the-art explainability methods in research literature. More speciﬁcally,the methods under study are: i) LIME [12], ii) SHAP [9], iii) Expected Gradients [4], and iv)GSInquire [21]. These methods were selected as they represent a good coverage of both popular andstate-of-the-art methods from both the proxy and direct categories of explainability methods.5igure 2: Example of a directed erroneous decision due to adversarially impacted area. (left) originaluntampered image, (center) tampered image with an adversarial patch, (right) prediction conﬁdencesof decisions made with untampered image and adversarially tampered image. The adversarial patchled to a change in decision.

The experimental results for the two experiments conducted in this study is presented below.

Experiment 1:

The quantitative performance of the four tested explainability methods as determinedby the proposed Impact Scores in the ﬁrst experiment is shown in Table 1. A number of interestingobservations can be made. First, it can be observed that LIME achieved the lowest I and I strict scores, thus indicating that the critical regions identiﬁed by LIME had the lowest impact on the actualdecision-making process of the network in identifying the class for a given image when comparedto the other tested methods, with a difference in I and I strict between SHAP and LIME of over6% and over 5%, respectively. Second, it can be observed that there is a progressive increase indecision-making impact from SHAP to Expected Gradients, with a signiﬁcant absolute increase in I and I Strict by over 7% and over 7.5%, respectively. While both SHAP and Expected Gradientsapproximate Shapley values, this signiﬁcant improvement achieved by Expected Gradients overSHAP can be attributed to the incorporation of ideas behind three of most recent state-of-the-artconcepts in explainability (SHAP, Integrated Gradients [17], and SmoothGrad [15]) within a commonexpected value formulation, leading to the identiﬁcation of more impactful critical regions. Third, itcan be observed that GSInquire achieved the highest I and I strict scores amongst the tested methods,achieving a signiﬁcant absolute increase of close to 25% and close to 3% in I and I strict , respectively,when compared to Expected Gradients. What’s interesting about this observation is the fact that theimprovements of GSInquire for I is signiﬁcantly higher than the improvements for I strict , whichindicates that a much larger number of tested images experienced a signiﬁcant conﬁdence-level impactin the absence of the critical regions identiﬁed by GSInquire when compared to those identiﬁed bythe other methods, while the improvements in decision-level impact is signiﬁcant but less drastic. Anexample image, the critical regions identiﬁed by tested explainability methods, and the predictionconﬁdences with and in absence of the identiﬁed critical regions are shown in Fig. 3. It can beobserved that for one of the example images where both Expected Gradients and GSInquire identiﬁeddecision-impacting critical regions while SHAP and LIME did not (middle row), the absence of thecritical regions that SHAP identiﬁed not only did not lead to a decision change by the network, butinstead led to an increase in prediction conﬁdence for the original decision and as such illustrativeof an explanation that does the reﬂect the decision-making process of the network. Furthermore, asillustrated by the example image in the ﬁrst row, no explainability method is perfect and the criticalregions identiﬁed may not have decision-level impact (In this example, while decisions did not changein the absence of identiﬁed critical regions, the critical regions identiﬁed by GSInquire led to thehighest prediction conﬁdence change amongst the test methods). Experiment 2:

The quantitative performance of the four tested explainability methods as determinedby the proposed Impact Score and Impact Coverage in the second experiment is shown in Table 2.A number of interesting observations can be made. First, it can be observed that LIME achievedthe lowest I , I strict , and I coverage scores across all adversarial patch scales, thus indicating that thecritical regions identiﬁed by LIME have the lowest impact as well as coverage of the adversariallyimpacted areas in the test images amongst the tested methods. Second, it can be observed that bothSHAP and Expected Gradients had similar I , I strict , and I coverage scores, while GSInquire had6able 1: Performance of tested explainability methods based on impact on network decisions.Method I I

Strict

LIME [12] 38.05% 35.12%SHAP [9] 44.15% 40.24%Expected Gradients [4] 51.22% 47.80%GSInquire [21] 76.10% 50.73%Figure 3: Example images, the corresponding critical regions identiﬁed by tested explainabilitymethods, and prediction conﬁdences with and in absence of the identiﬁed critical regions.LIME [12] SHAP [9] ExpectedGradients [4] GSInquire [21]signiﬁcantly higher I , I strict , and I coverage scores than both SHAP and Expected Gradients acrossall adversarial patch scales. Example adversarially modiﬁed erroneous images via adversarial patchesand the corresponding critical regions identiﬁed by tested explainability methods as being importantto the decision made by the network are shown in Figure 4. It can be observed that both ExpectedGradients and GSInquire were both able to better identify more adversarially impacted areas, withGSInquire achieving the best identiﬁcation coverage for the adversarially impacted areas. In this study, we explored a more machine-centric strategy for quantifying the performance ofexplainability methods on deep convolutional neural networks by quantifying the importance ofcritical factors identiﬁed by an explainability method for a given decision made by a network. This isaccomplished by studying the impact of identiﬁed factors on the decision and the conﬁdence in the7able 2: Performance of tested explainability methods at different adversarial patch scales

Scale LIME [12] SHAP [9] Expected Gradient [4] GSInquire [21] I coverage I I strict I coverage I I strict I coverage I I strict I coverage I I strict

Figure 4: Example adversarially modiﬁed erroneous images via adversarial patches at different scales,and the corresponding critical regions identiﬁed by tested explainability methods as being importantto the decision made by the network.Patch Scale Ground Truth/ AdversarialLabel LIME [12] SHAP [9] ExpectedGradients [4] GSInquire [21]0.30 Television /Monitor0.40 Suit / Cup0.50 Necklace /Cup0.60 Sweatshirt /Monitor0.70 Cup /Necklacedecision, and additionally the coverage of adversarially impacted factors in the directed erroneousdecision scenario. A comprehensive analysis using this approach showed that, in the case of visualperception tasks such as image classiﬁcation, some of the most popular and widely-used methods suchas LIME and SHAP may produce explanations that may not be as reﬂective as expected of what thedeep neural network is leveraging to make decisions. Newer methods such as Expected Gradients andGSInquire performed signiﬁcantly better in general scenarios, and GSInquire performing signiﬁcantlybetter in adversarially distraction scenarios as well, though there is signiﬁcant room for improvementand thus illustrating the importance of such quantitative metrics for benchmarking methods to betterunderstand where our current approaches stand and where we can improve. While by no meansperfect, the hope is that the proposed machine-centric strategy helps push the conversation forwardtowards better metrics for evaluating explainability methods in a manner that gives insights to guidenetwork error mitigation as well as improve trust in deep neural networks. Future work involvesstudying the quantitative performance of explainability methods under different use cases such asspeech recognition and natural language processing tasks, as well as the extension of the proposedImpact Score to incorporate a wider range of factors for more thorough quantitative assessment.8 eferences [1] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: Asurvey.

CoRR , abs/1801.00553, 2018.[2] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case,Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speechrecognition in english and mandarin. In

International Conference on Machine Learning , pages 173–182,2016.[3] Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch. arXivpreprint arXiv:1712.09665 , 2017.[4] Gabriel Erion, Joseph D Janizek, Pascal Sturmfels, Scott Lundberg, and Su-In Lee. Learning explainablemodels using attribution priors. arXiv preprint arXiv:1906.10670 , 2019.[5] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters,Michael Schmitz, and Luke Zettlemoyer. Allennlp: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640 , 2018.[6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

CVPR , volume 1, page 3, 2017.[7] Qiuxia Lai, Wenguan Wang, Salman Khan, Jianbing Shen, Hanqiu Sun, and Ling Shao. Human vs.machine attention in neural networks: A comparative study. arXiv preprint arXiv:1906.08764 , 2019.[8] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436, 2015.[9] Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In

Advances inNeural Information Processing Systems , pages 4765–4774, 2017.[10] Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355 , 2019.[11] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. arXiv preprint , 2017.[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining thepredictions of any classiﬁer. In

Proceedings of the 22nd ACM SIGKDD international conference onknowledge discovery and data mining , pages 1135–1144. ACM, 2016.[13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.

International Journal of Computer Vision , 115(3):211–252, 2015.[14] Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, andDhruv Batra. Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450 , 2016.[15] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad:removing noise by adding noise. arXiv preprint arXiv:1706.03825 , 2017.[16] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity:The all convolutional net. arXiv preprint arXiv:1412.6806 , 2014.[17] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In

Proceedingsof the 34th International Conference on Machine Learning-Volume 70 , pages 3319–3328. JMLR. org,2017.[18] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking theinception architecture for computer vision. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2818–2826, 2016.[19] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, andRob Fergus. Intriguing properties of neural networks.

CoRR , 2013.[20] Erico Tjoa and Cuntai Guan. A survey on explainable artiﬁcial intelligence (xai): Towards medical xai. arXiv preprint arXiv:1907.07374 , 2019.[21] Alexander Wong, Mohammad Javad Shaﬁee, Brendan Chwyl, and Francis Li. Ferminets: Learninggenerative machines to generate efﬁcient neural networks via generative synthesis. arXiv preprintarXiv:1809.05989 , 2018.[22] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learningbased natural language processing. ieee Computational intelligenCe magazine , 13(3):55–75, 2018., 13(3):55–75, 2018.