[PDF] Explaining Natural Language Processing Classifiers with Occlusion and Language Modeling

Abstract

Deep neural networks are powerful statistical learners. However, their predictions do not come with an explanation of their process. To analyze these models, explanation methods are being developed. We present a novel explanation method, called OLM, for natural language processing classifiers. This method combines occlusion and language modeling, which are techniques central to explainability and NLP, respectively. OLM gives explanations that are theoretically sound and easy to understand. We make several contributions to the theory of explanation methods. Axioms for explanation methods are an interesting theoretical concept to explore their basics and deduce methods. We introduce a new axiom, give its intuition and show it contradicts another existing axiom. Additionally, we point out theoretical difficulties of existing gradient-based and some occlusion-based explanation methods in natural language processing. We provide an extensive argument why evaluation of explanation methods is difficult. We compare OLM to other explanation methods and underline its uniqueness experimentally. Finally, we investigate corner cases of OLM and discuss its validity and possible improvements.

Full PDF

UU NIVERSITY OF P OTSDAM M ASTER ’ S T HESIS

Explaining Natural Language

Processing Classiﬁers with Occlusionand Language Modeling

Author:

David H

ARBECKE

Supervisors:

Prof. Dr. David S

CHLANGEN

Prof. Dr. Manfred S

TEDE

A thesis submitted in fulﬁllment of the requirementsfor the degree of Master of Science in

Cognitive Systems: Language, Learning and Reasoning at the

Faculty of Human SciencesSeptember 27, 2020 a r X i v : . [ c s . C L ] J a n eclaration of Authorship I, David H

ARBECKE , declare that this thesis, titled “E

XPLAINING N ATURAL L ANGUAGE P ROCESSING C LASSIFIERS WITH O CCLUSION AND L ANGUAGE M ODELING ”, and the work presented in it are my own. I conﬁrm that Iworked independently and only with the sources and aids indicated. Allpassages of the work which I have taken from these sources and aids, eitherin wording or in meaning, are marked and listed in the bibliography. Partsof the presented research have been published with a co-author (Harbeckeand Alt, 2020). Content contributions of the co-author that are mentionedare marked as such. I am familiar with “Richtlinie zur Sicherung guterwissenschaftlicher Praxis für Studierende an der Universität Potsdam (Pla-giatsrichtlinie)” .Date:Signed: Nevertheless, this thesis is written in ﬁrst-person plural. I hereby acknowledge the Association for Computational Linguistics for letting medescribe ﬁndings of publications co-authored by me. ii English Abstract

David H

ARBECKE E XPLAINING N ATURAL L ANGUAGE P ROCESSING C LASSIFIERS WITH O CCLUSION AND L ANGUAGE M ODELING

OLM , for natural language processingclassiﬁers. This method combines occlusion and language modeling, whichare techniques central to explainability and NLP, respectively.

OLM givesexplanations that are theoretically sound and easy to understand.We make several contributions to the theory of explanation methods. Axiomsfor explanation methods are an interesting theoretical concept to explore theirbasics and deduce methods. We introduce a new axiom, give its intuitionand show it contradicts another existing axiom. Additionally, we pointout theoretical difﬁculties of existing gradient-based and some occlusion-based explanation methods in natural language processing. We provide anextensive argument why evaluation of explanation methods is difﬁcult. Wecompare

OLM to other explanation methods and underline its uniquenessexperimentally. Finally, we investigate corner cases of

OLM and discuss itsvalidity and possible improvements.v

Deutsche Zusammenfassung

David H

ARBECKE E XPLAINING N ATURAL L ANGUAGE P ROCESSING C LASSIFIERS WITH O CCLUSION AND L ANGUAGE M ODELING

Tiefe neuronale Modelle sind gut im statistischen Lernen. Jedoch liefernderen Vorhersagen keine Erklärungen des Vorgangs. Um diese Modelle zuanalysieren, hat man Erklärungstechniken entwickelt. Wir präsentieren eineneue Erklärungstechnik, gennant

OLM , für klassiﬁzerende Modelle linguis-tischer Datenverarbeitung. Diese Methode kombiniert das Maskieren vonMerkmalen mit Sprachmodellierung, was jeweils grundlegende Methodender Erklärungstechnik und linguistischer Datenverarbeitung sind.

OLM liefert Erklärungen die theoretisch fundiert und einfach zu verstehen sind.Wir machen mehrere theoretische Beiträge zu Erklärungstechniken. Axiomefür Erklärungstechniken sind ein interessantes theoretisches Konzept umderen Grundlagen auszuloten und Techniken abzuleiten. Wir führen einneues Axiom ein, legen die Intuition dar und zeigen dass es einem existieren-den Axiom widerspricht. Zusätzlich zeigen wir theoretische Problematikin der Anwendung von gradienten- und manchen maskierungsbasiertenErklärungstechniken bei linguistischer Datenverarbeitung auf. Wir argu-mentieren umfangreich, warum die Evaluation von Erklärungstechnikenschwierig ist. Wir vergleichen

OLM mit anderen Erklärungstechniken undheben dessen Alleinstellungsmerkmal hervor. Abschließend betrachten wirGrenzfälle der Anwendung von

OLM und diskutieren dessen Validität undmögliche Verbesserungen.

Contents

Declaration of Authorship . . . . . . . . . . . . . . . . . . . . . . . . iiEnglish Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiDeutsche Zusammenfassung . . . . . . . . . . . . . . . . . . . . . . ivContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Bibliography 46 ii List of Figures

List of Tables

List of Deﬁnitions p data probability distribution of data p LM probability distribution of a language model f θ neural network with parameters θ f neural network, or, in general, black-box function f c : = proj c ◦ f projection of function f to class c , i.e. output neuron c ( X , Y ) labeled dataset X a dataset or, more general, the whole input space Y label set x ∈ X an input element of the dataset; an input vector x i an indexed feature of the input xx \ i an input without the feature at i -th position ( x \ i , ˆ x i ) an input with a replacement feature r f , c ( x i ) relevance of an input feature x i regarding function f and class c The advent of deep learning has created an explanation gap as the modelsare considered hard to interpret (Guidotti et al., 2018; Adadi and Berrada,2018). Models without hand-engineered features are hard to understand intheir decision making. This makes explanation methods highly relevant. Innatural language processing most state-of-the-art architectures are neuralnetworks. To understand their decisions we point out gaps in existingexplanation methods and develop a novel method.

Deep learning describes the architectures of multi-layered neural networksand methods to train them. Deep neural networks (DNNs) learn featuresfrom data. The layers of a deep neural network learn increasingly higherlevel features (Deng and Yu, 2014).To introduce deep neural networks we ﬁrst motivate interest in them. Then,we introduce some of their theory by explaining a possible deep neuralnetwork architecture and how DNNs can be trained. Lastly, we discussspeciﬁc properties that show DNNs’ relevance to the presented work.

Deep neural networks have achieved state-of-the-art performance on a widevariety of tasks, such as• image recognition (Ciregan et al., 2012; Krizhevsky et al., 2012),• text classiﬁcation (Kim, 2014),• perfect information games (Silver et al., 2016, 2018),

Chapter 1. Introduction x x x Inputlayer h h h h Hiddenlayer 1 h h h h Hiddenlayer 2 ˆ y ˆ y Outputlayer F IGURE • or estimating the underlying probability distribution of a sample space(Bengio et al., 2003; Goodfellow et al., 2014).In addition to the wide variety of tasks they can perform well, the predictionprocess of a neural network can be seen as opaque. A neural network learnsits parameters from data and does not need to have feature extractionsengineered by humans. Thus, external techniques are required to explain thetraining and decisions of a neural network.

A neural network consists of neurons in layers. Figure 1.1 gives a schematicview of a neural network. The input layer displays an input vector x =( x , x , x ) T with three input dimensions. For the ﬁrst hidden layer this inputvector is multiplied by a weight matrix W with 3 × b = ( b , b , b , b ) T is added and a non-linear activation function σ isapplied component-wise. This is repeated for the next hidden layer and theoutput layer with different weights and biases. Therefore, a step from a layerto the next is an afﬁne transformation followed by an activation function.The activation function of the output layer is usually chosen such that theoutput neurons represent a probability distribution or individual probabilityfunctions, depending on the formulation of the problem and the data. Amathematical formulation of the network would beˆ y = σ ( W σ ( W σ ( W x + b ) + b ) + b ) . (1.1) .1. Deep Learning

3A neural network with several hidden layers is called deep neural network(DNN). The weights and biases are the trainable parameters θ of a neuralnetwork. The prediction ˆ y depends on these parameters, also referred to asweights. A simpler formulation if we are not interested in speciﬁc weights isˆ y = f θ ( x ) . (1.2) To train a DNN we need an objective and a loss function. The objective isusually a labeled dataset ( X , Y ) and tells us what the network should predictfor each input. An element of the dataset x ∈ X is usually a coordinate vectorover R . The loss function is a function L ( ˆ y , y ) of the prediction ˆ y and truelabel y ∈ Y . Sometimes a regularizer, which is a loss function on the networkparameters θ , also usually coordinate vectors over R , is added to this lossfunction.With all these ingredients, a neural network is usually trained with back-propagation (Linnainmaa, 1970; Rumelhart et al., 1986a,b) and a variationof gradient descent (Cauchy, 1847). Backpropagation is a method that prop-agates the error E calculated by the loss function to the parameters θ ofthe DNN via the chain rule of differentiation. The need for differentiabilityexplains why often both the inputs and parameters are coordinate vectorsover R . It ensures that the gradients are also real valued. This is importantfor section Language Representations for Deep Learning (1.2.1) .Differentiation can be done in parallel for all parameters of a layer. Gradientdescent is an optimization technique of these parameters. Let us view theneural network as a high-dimensional function over all parameters θ . Itchanges the parameter values by stepping proportionally to the size of thepartial derivative of the loss over the whole dataset regarding the parameterin each direction. Gradient descent can be seen an optimization alternativeto using Newton’s Method (Newton, 1736; Raphson, 1690; Simpson, 1740),which looks for zeros of a function, for the ﬁrst derivative of a function.Gradient descent only updates the parameters once per iteration over thedataset. This is inefﬁcient as subsets of the dataset (mini-batch) can providea good estimation of the gradient (Wilson and Martinez, 2003; Bottou andBousquet, 2008). Stochastic gradient descent (Robbins and Monro, 1951) Chapter 1. Introduction averages the gradients over a mini-batch and does one optimization step forthis mini-batch. This is a better trade-off between update time and updatequality. There are many popular and recent variants and alternatives tostochastic gradient descent such as Momentum (Qian, 1999) and ADAM(Kingma and Ba, 2014).

The universal approximation theorem states that with speciﬁc restrictions forthe width (Lu et al., 2017) or depth (Hanin, 2017) DNNs can approximate anycontinuous convex function. This gives an intuition on why they are state-of-the-art for many prediction tasks. Furthermore, Choromanska et al. (2015)show that deep networks have better loss surfaces than shallow networks fortraining. This means that the local optima that optimizers ﬁnd are closer tothe global optimum for deep networks. This is underlined by the informationbottleneck principle (Tishby and Zaslavsky, 2015; Shwartz-Ziv and Tishby,2017) which states that training of a DNN is faster than that of similarlycapable shallow networks. The success of deep learning is also partly due tohardware with parallel computing capabilities (Strigl et al., 2010).For this work the relevance of DNNs is two-fold. First, we try to explain theirbehaviour when performing state-of-the-art prediction on natural languageprocessing tasks. Second, we use neural language models to create theseexplanations.

Natural language processing (NLP) encompasses the intersection betweenhuman language and the processing of it by computing machinery. The ﬁeldof NLP is nowadays mostly concerned with a statistical and quantitativeprocessing and modeling of mass amounts of language data. Manning andSchütze (1999) state:“Increasingly, businesses, government agencies and individualsare confronted with large amounts of text that are critical forworking and living, but not well enough understood to get theenormous value out of them that they potentially hide.” .2. Natural Language Processing ET (Yang et al., 2019),• text classiﬁcation on the AG News corpus, the DBpedia onthology(Zhang et al., 2015) and the TREC dataset (Voorhees and Tice, 2000) byXLN ET (Yang et al., 2019).• There are also multi-task benchmarks, such as GLUE (Wang et al.,2017) and SuperGLUE (Wang et al., 2019a), that combine several NLPclassiﬁcation tasks. Leading both benchmarks is T5 (Raffel et al., 2019).At least three things are notable regarding the previous list. First, the varietyof tasks is wide, ranging from speech recognition on audio data and generat-ing text data in language modeling to classifying words, sentences or textsin part-of-speech tagging, relationship extraction, sentiment analysis andtext classiﬁcation. This makes the dominance of neural architectures in thesetasks even more impressive.Second, XLNet (Yang et al., 2019) and T5 (Raffel et al., 2019) appear frequentlyon top of the leaderboard on many of these tasks. Both use a pre-trainedlanguage model, i.e. they learn a representation of language by going overlarge text datasets with billions of words, such as the BooksCorpus (Zhuet al., 2015) or English Wikipedia . The training of language models will bedescribed in more detail in section Language Modeling (3.2) . Raffel et al.(2019) even encode the problem formulation into the representation.Third, most of the datasets are classiﬁcation tasks in some sense where theneural model has a preselected set of outputs. These are the problems and https://en.wikipedia.org Chapter 1. Introduction models we are interested in. The method presented in this thesis yieldsexplanations for NLP classiﬁcation tasks.

Time ﬂies like an arrow;fruit ﬂies like bananasAnthony Oettinger

An important component to utilize neural networks in NLP is the represen-tation of the input. This is nontrivial, as we saw in section

Training (1.1.3) .We have to assign real valued coordinate vectors to our input. Furthermore,the unit of language, called token or atomic parse element, from which tomap into our vector space is nonobvious. It is imperative to choose a tokenthat allows for an unambiguous mapping. For written language characters,words, sentences and documents are among the candidates for this atomicparse element.A simple realization of this mapping is to count the words of an input andcreate a vector representation with vector indices corresponding to words.This method is called bag-of-words (Harris, 1954). It can be used both on asentence and a document level. A practical improvement on this is whatis now called tf-idf (Salton et al., 1975). There, the word counts are scaledwith the logarithm of the inverse of the ratio of documents containing theword. Both methods do not preserve the order or contextual meaning ofwords. Locally, the order and contextuality can be incorporated by choosingto use n-grams instead of or in addition to single words. However, this doesnot retain information about longer contexts. It also yields exponentiallymore possible n-grams and fewer counts of a speciﬁc n-gram for increasingn, which makes the representations less precise and efﬁcient.A variation to bag-of-words is one-hot-encoding where the words are notcounted but words get assigned pairwise distinct standard basis vectors.If we ignore out-of-vocabulary words, this is an one-to-one function betweeninputs and the representation. This method can be used as underlying trans-formation for representing words with other methods. Compared to theprevious methods, it allows the order of words in the whole input to be kept. .2. Natural Language Processing word2vec , a learned vector space represen-tation of words. It can be called an embedding because the dimensionalityof this vector space is lower than the number of words that are represented.This means that the vectors corresponding to words are not pairwise orthog-onal, generally. Words with similar meaning are supposed to have a smallangle between their vector representations. Words are predicted from theircontext with the intuition that words that appear in similar contexts havesimilar meaning. At the time they achieved state of the art on semantic wordsimilarity. GloVe (Pennington et al., 2014) explicitly learns vector representa-tions that are based on co-occurrence. These representations allow the DNNsthat employ them to use the order of words if they desire. Thus, context ispreserved but does not inﬂuence representation itself.Many approaches use characters (Wieting et al., 2016; Bojanowski et al., 2017)or sub-words (Wu et al., 2016; Kudo and Richardson, 2018) as tokens toimprove on word representations. By choosing a smaller unit, similaritiesbetween words with similar spelling or the same root can be incorporatedinto the representation. Still, it does not enable different representations ofthe same word in different contexts.Howard and Ruder (2018) use the language modeling described in section

Language Modeling (3.2) to create a vector embedding where the infor-mation of context is fed through several layers. This enables a differentrepresentation for “ﬂies like” in the epigraph of this section, depending onthe context.

In section

Capabilities (1.1.4) we indicated the performance abilities ofDNNs. However, their state-of-the-art prediction performance is not the mea-sure of all things. Recital 71 of the European Union’s

General Data ProtectionRegulation states:“[decision-making based solely on automated processing] shouldbe subject to suitable safeguards, which should include speciﬁcinformation to the data subject and the right [. . . ] to obtain anexplanation of the decision reached after such assessment and tochallenge the decision.”Its potential legal ramiﬁcations are discussed in Goodman and Flaxman(2017). This regulation clearly states that for real-world applications ofDNNs explanations are necessary. Explanations as support for a decisionby a black box is detailed in Lombrozo (2006): “explanations are [. . . ] thecurrency in which we exchanged beliefs”. We will be using the term ex-plainability (of DNNs) to refer to this ﬁeld, not the more frequently used“Explainable Artiﬁcial Intelligence (XAI)” or “Interpretable Machine Learn-ing”. Both these terms indicate that there is something inherently explainableor interpretable about the models in deep learning. This may be the case inmachine learning but cannot be assumed in general. Furthermore, it is verycontroversial whether the “Artiﬁcial Intelligence” moniker is a precise orhelpful representation of the characteristics of deep learning (Jordan, 2019).Several popular surveys of explainability exist. We summarize some ofthem to pigeonhole our work more precisely. Doshi-Velez and Kim (2017)point out that explanations ﬁll an incompleteness in the problem that deeplearning models work on. E.g., the objective given to the model duringtraining may not have measured generalization performance adequately.This can be uncovered with explanations. Doshi-Velez and Kim (2017) divide https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679 hapter 2. Explainability of DNNs Motivation (3.3.1) .However, we disagree that many insights from psychology can be easilytransferred. Two main questions for explanations of black boxes are how togenerate explanations from them and how to evaluate the faithfulness of theirexplanations. This removes assumptions generally held in psychologicalcontexts.There exist a variety of NLP speciﬁc explanations. Li et al. (2016) presentmethods to view and test heatmaps for different recurrent networks. They in-vestigate sentiment analysis by putting forward different investigative input.Alvarez-Melis and Jaakkola (2017) create an explanation graph between struc-tured input and output. This is especially useful for sequence to sequencetasks like machine translation. Many state-of-the-art models use attention(Bahdanau et al., 2015; Vaswani et al., 2017). Some papers argue whetherattention weights as explanations are permissible. Jain and Wallace (2019)show that attention weights do not necessarily agree with other explanations0

Chapter 2. Explainability of DNNs and can also be distorted while not changing the prediction. Wiegreffe andPinter (2019) disagree and show that under some circumstances these distor-tion lead to a signiﬁcant decrease in performance. Synthesized, these papersargue that attention weights can be used as explanation if and only if theattention is vital for the model. There is also a variety of approaches thatdetermine the quality of models by linguistic analysis (Linzen et al., 2016;McCoy et al., 2019).We provide some examples of methods that that do not belong to our cat-egory of explanations and will not be discussed further. Kim et al. (2018)analyze neural models by determining which concepts were important to aclassiﬁcation decision. Intrinsic model-speciﬁc explanations can be foundin older machine learning approaches, such as decision trees, or in modelarchitectures that provide explanations and predictions in parallel (Zhanget al., 2018). Ribeiro et al. (2016) explain by learning a local explainable modelaround the prediction of a neural model.We will now focus on local post-hoc input space explanations. They describehow much a feature of one speciﬁc input contributed to a speciﬁc output(class). They give a real value called relevance for every input feature whichcan be used to create a saliency map (Simonyan et al., 2013). An example canbe found in Table 4.1. The output can be the true label neuron, the predictedneuron or another output of choice. A positive relevance value indicates thatthe feature contributed positively to the given output, whereas a negativerelevance value indicates that the feature distracted from the output. Inputfeatures can be the input values for the ﬁrst layer, or a cluster of these. ForNLP, e.g., all input values of a word (and punctuation mark) can be clustered.If the explanation method gives relevances for every input dimension therelevances are aggregated such that every word receives one relevance value.This is often achieved by summing the relevances (Arras et al., 2017).An important part of explainability is occlusion. It is either presented as anexplanation method (Robnik-Šikonja and Kononenko, 2008; Zintgraf et al.,2017) where the difference in prediction when removing an input feature isseen as indicator for the importance of this feature . Alternatively, it is usedas evaluation of explanation methods (Zeiler and Fergus, 2014; Montavonet al., 2018), where it is argued that (other) explanation methods should ﬁndfeatures whose occlusion change the prediction signiﬁcantly. .1. Axioms for Explanation Methods

On the Incompleteness of Evaluat-ing Explanations (2.3) . In general, most objective evaluation methods allowfor a derivation of an explanation method which satisﬁes this evaluationperfectly. This has already been done in Kindermans et al. (2018) and whendeveloping the theory for this thesis, it was originally intended as an evalu-ation method. However, we argue the standard for an evaluation methodshould be higher. It should not be considered as just an accessory to anexplanation method to justify the explanations.Explanations are a simpliﬁcation of the model’s decision process. They try topresent the process in a human-understandable way. For complex modelsthis is always an approximation. To strictly deﬁne some ground rules for thisapproximation, axioms for explanation methods are developed.

Axioms are a proposed method to develop and test explanation methodsby Sundararajan et al. (2017) and extensively discussed in Lipton (2018).Additional axioms are proposed in Bach et al. (2015), Ancona et al. (2018),Kindermans et al. (2019) and Srinivas and Fleuret (2019). The advantage ofaxiomatic analysis is that objectives for explanation methods can be statedand discussed and each explanation method can be evaluated against theseobjectives. This can also give context to tasks, models or settings wheresome explanation methods might be suitable or unsuitable. We will discusslater why axioms might be the most objective evaluation of explanationmethods. In the following, we brieﬂy discuss some important axioms andtheir intuition. Completeness

The sum of the relevances of features of an input is equal to theprediction (Bach et al., 2015).

This axiom stems from the intuition that aprediction of an input is a prediction of a composition of features.

Implementation Invariance

Two neural networks that are functionally equiv-alent, i.e. give the same output for all possible inputs, should receive the same An extensive discussion of axioms and explanation methods was done in the IndividualModule and will thus not be repeated. Chapter 2. Explainability of DNNs

Explanation 1: good ﬁlm , but very glum . (positive sentiment)good ﬁlm , but very glum . (negative sentiment)Explanation 2: good ﬁlm , but very glum . (positive sentiment)good ﬁlm , but very glum . (negative sentiment) F IGURE

Class Zero-Sum as the explanationsare equal for both classes. It is unclear which token actuallycontributed to which classes. The second explanations do sat-isfy

Class Zero-Sum . It is identiﬁable which token contributedto which class and by how much. relevances for every input (Sundararajan et al., 2017).

This seems trivial butis important to state as methods that work with internal weights of neuralmodels do not necessarily comply.

Linearity

A network, which is a linear combination of other networks, shouldhave explanations which are the same linear combination of the original networksexplanations (Sundararajan et al., 2017).

This axiom states that an explanationmethod should be a linear function of neural networks given the input.

Sensitivity

An input feature should receive a non-zero relevance, if and only if theprediction of the network depends on the feature (Sundararajan et al., 2017).

This isa simple double check that features that are ignored by the network shouldnot receive relevance and features that are important to the model shouldreceive relevance.

Sensitivity-1

The relevance of an input variable should be the difference of predic-tion when the input variable is removed (Ancona et al., 2018).

This could also benamed the occlusion axiom. It basically restates the prediction differenceformula (Robnik-Šikonja and Kononenko, 2008) which we will see later inEq. (3.1).We introduce a new axiom:

Class Zero-Sum

The sum of relevances of a feature over all classes is zero (Harbeckeand Alt, 2020).

The intuition behind this axiom is guided by the normalizationof most classiﬁers. If the sum of the predictions is ﬁxed, then every feature .2. Gradient-based Explanation Methods in NLP

Class Zero-Sum axiom is a contradiction to the

Completeness axiom if thenormalization of the classiﬁer is not set to have a sum of zero. We argue that

Completeness forces a method to assign undue relevance, e.g. in cases wherethere is no information detected in the input for any class, it does not makesense to distribute positive relevance over features.

Gradient-based explanation methods were introduced with Sensitivity Analy-sis (Baehrens et al., 2010; Simonyan et al., 2013). The intuition is that thegradient of the prediction at the input tells us which input dimensions canchange the prediction the most and are thus most important for the predic-tion. As Sundararajan et al. (2017) showed, this can be misleading, as thisonly gives information about the function in the local neighborhood of theinput.In the following we call the distribution of data p data , meaning p data ( x ) isthe probability of x appearing as a data point for a speciﬁc task. This is notconstrained to a dataset but the evasive more general distribution, i.e. “datathat users would expect the systems to work well on” (Gorman and Bedrick,2019). As discussed in section Training (1.1.3) , in general, an input x in thiscase can be regarded as coordinate vector over R . In practice this is almostalways the case in NLP. Let us assume we have a practical limitation on thelength of input text and, therefore, a maximum embedding dimension of n .We can make all other inputs have this dimensionality by padding them withzeros. Thus, we have a function f that maps text to a subset S of a coordinate4 Chapter 2. Explainability of DNNs space R n . We will now discuss which properties this subset S has and whythis is important. F IGURE S is a discrete set in R n .A discrete set is a set where every element s ∈ S has a neighbourhood thatdoes not contain any other point of S . There are a ﬁnite number of tokenswith pairwise distinct embeddings and a ﬁnite length. Thus, S is ﬁnite andthe global minimum of distances between two elements of S is positive.Therefore, S is discrete. p data ( x ) for x ∈ R n is a discrete probability distribution.For every point x ∈ R n / S we have p data ( x ) =

0. We just saw that S ⊂ R n isa discrete subset. Therefore, p data is a discrete probability distribution, as canbe seen schematically in Figure 2.2. Note that this is particularly different .3. On the Incompleteness of Evaluating Explanations x in a coordinate space R n entails a neighbourhood of small perturbations where the images still makesense. Consequently, the probability distribution of images embedded in R n is continuous. This property of the probability distribution is fundamental to gradient-based explanation methods.Gradient-based explanation methods analyze the change of prediction withrespect to the the input dimensions. Analyzing inﬁnitesimal change in afunction presumes that this change is meaningful. However, if the dataprobability is zero everywhere in a small neighbourhood of an input vector,these changes become meaningless, as the prediction function can never beconfronted with other vectors from this neighbourhood. These vectors areautomatically out of distribution, i.e. the prediction function is analyzed byevaluating a priori meaningless behaviour.We thus argue that gradient-based explanation methods are not theoreticallyjustiﬁed in NLP. Although, in some cases, especially if the function is well-regularized, local behaviour indicates global behaviour. This, however, cannot be assumed and needs to be investigated before using gradient methods.

The sole focus of an explanation method should be to relay informationabout a model to a user. Explanations exist to help understand the decisionprocess of the model (Doshi-Velez et al., 2017). Thus, the correct explanationcannot be independent of the model.Every experimental evaluation of explanation methods relies on a groundtruth (property) that the explanation should have. For example, let us take theevaluation method where the explanation of a model is compared with somefeatures that are identiﬁed with the help of experts (Mohseni and Ragan,2018). These ground truths are independent of the model. First, neuralmodels have achieved superhuman performance on several tasks, e.g. go(Silver et al., 2016) and chess (Silver et al., 2018), and grounding explanationsby humans on these tasks cannot be considered helpful in understandingsuperhuman performance.6

Chapter 2. Explainability of DNNs

Second, it does not take into consideration that a model can be wrong andthe explanations correct. The explanations of a model that made a falseclassiﬁcation can point to completely different features than an expert wouldselect.Third, a rigid scheme by humans does not take into account that predictivefeatures can be missed even by experts, as neural models are powerful statis-tical learners (Sarle, 1994; Geirhos et al., 2018). On the contrary, explanationmethods are especially useful in cases where the model uses artefacts, nothuman-intuitive features, for its decision.It is unclear whether it is possible to distinguish between the theoretical foun-dations of explanation methods and their evaluations. The most prominentexample of a method that is used for both is occlusion, which is discussedin the following section

Occlusion (3.1) . Since the model is the only groundtruth and explanations are a simpliﬁcation, it is highly probable that there ismore than one sensible explanation for an input classiﬁed by a model. Thus,for both explanations and evaluation, constraints are established that reducethe number of explanations. If these constrains are explicitly stated they canbe regarded as axioms. Note that axioms can be used both to develop and totest explanation methods. All in all, the need for somewhat subjective con-straints makes the existence of a general evaluation of explanation methodsunlikely.We do not argue that it is impossible to evaluate an explanation method.The performance in sensibly selected evaluations does probably correlate tothe quality of an explanation method. Measuring the correlation to sensibleexplanation methods can also be seen as quality assessment. We will useand discuss this approach in section

Tasks (4.1.2) . Asserting which axiomsan explanation method fulﬁlls is an important step towards evaluating itsvalidity. We argue that the explanation method for one’s use case shouldbe motivated by the paradigms that the method fulﬁlls. Furthermore, thereare sanity checks (Adebayo et al., 2018) that can determine whether anexplanation method has undesired properties. Experimental evaluation ofa method can provide guidance for selecting an explanation method. Theybecome more valuable the closer the setting of the evaluation is to the settingwhere the explanation is needed.7

To build a theoretically solid explanation method we combine a techniquerelated to the

Sensitivity-1 axiom with a method that considers likelihood inNLP. We introduce these techniques, occlusion and language modeling , beforesynthesizing them to a new method.

We lay out the theory of occlusion and discuss its advantages and disadvan-tages. Occlusion was introduced under the name

Occlusion Sensitivity (Zeilerand Fergus, 2014) as a method to detect whether an explanation methoddetects important features of an input for a neural network, by evaluatinginputs with occluded features detected by the explanation method.Conversely, Robnik-Šikonja and Kononenko (2008) introduce the same tech-nique as a local post-hoc input space explanation method. It measures theimportance of features of an input for a classiﬁer. This is done by comparingthe predictions of the classiﬁer with and without the feature. We refer to thismethod as occlusion, too. A feature can be a word or sentence in NLP classi-fying or a pixel or larger patch of an image in image classifying. Occlusion isa true black-box method. It only uses the predictions of a model, no internalrepresentations or even structural information about the model.For an input x we take a feature x i . To determine the relevance of x i weconsider the input x \ i without this feature. This is an incomplete input as wedo not know which values to set as replacement for x i . A simple approachis to set all the values of x i to zero. In NLP, it is possible to delete words orreplace them with the token. This will be done as baseline methodsin the experiments. Zintgraf et al. (2017) propose sampling the values fromthe overall distribution of the dataset. With this replacement we employ the8 Chapter 3. Methods difference of probabilities formula (Robnik-Šikonja and Kononenko, 2008;Zintgraf et al., 2017).The relevance r given the prediction function f and class c is r f , c ( x i ) = f c ( x ) − f c ( x \ i ) . (3.1)There are various practical difﬁculties with occlusion. It is unclear which fea-tures to select, especially if the human understanding of features differs fromthe feature space that was used to input the data. Furthermore, occlusiondoes not guarantee that the data still makes sense (to the model) after a partof the input is taken out. This is especially difﬁcult in NLP, where modelshave increasing syntactic, hierarchical and other linguistic understanding(Liu et al., 2019a; Hewitt and Manning, 2019). Thus, models could misinter-pret data with missing features in various ways, e.g. taking ungrammaticalityas an indicator for the prediction. We offer a short introduction to language modeling. This is both valuablefor our approach and for understanding the foundation of current state-of-the-art NLP classiﬁcation models. Language modeling can be seen as usinglarge corpora of unannotated language data and creating a supervised taskby reusing words as their own labels.A DNN takes a static word embedding like word2vec or GloVe , or a sub-wordembedding like

SentencePiece (Kudo and Richardson, 2018) as input andpredicts one or several tokens that are missing. These missing tokens canbe following the original input or masked among the input. The labels ofthese tokens are one-hot vectors. In a way, these models assign likelihoodto a given text (Brown et al., 1992). Let us have a text T that is split intotokens ( t , . . . , t n ) = T . If we have a language model p LM that is able tomake predictions of the form p LM ( t ) and p LM ( t i + | ( t , . . . , t i )) then we get .3. OLM

19a probability and score of the whole text. p LM ( T ) = p LM (( t , . . . , t n )) = p LM ( t ) n − ∏ i = p LM ( t i + | ( t , . . . , t i )) PP p LM ( T ) : = p LM ( T ) − n (3.2) PP p LM ( T ) is called perplexity of the corpus. The lower this score, the higherthe probability that the language model assigned to the corpus and thus,through a Bayesian argument, the better the language model.Not all language models are trained to predict text from scratch. E.g., Devlinet al. (2019) mask 15% of words in a sentence and predict those. This does notlead to a model that can be measured by giving the perplexity of a corpus.Nevertheless, all these models create an embedding of the input in everyhidden layer that can be used for other tasks. It is a priori unclear whetherthese representations are an improvement on the static embedding becausethe primary goal of the DNN is not to create a better representation of theinput for other classiﬁcation tasks. However, experimental results fromHoward and Ruder (2018), Peters et al. (2018) and Devlin et al. (2019) suggestthat using embeddings from a language model is an improvement on staticembeddings. The success of these models in various tasks mentioned in sec-tion Natural Language Processing (1.2) has been coined “NLP’s ImageNetmoment” (Ruder, 2018). Devlin et al. (2019) argue “Recent empirical improve-ments due to transfer learning with language models have demonstratedthat rich, unsupervised pre-training is an integral part of many languageunderstanding systems.”In our experiments, we only use language models for the simple task ofpredicting one missing word or punctuation mark. This is an area wheremasked language models should excel.

The main idea of this thesis is combining O cclusion with L anguage M odeling ( OLM ). Instead of leaving out tokens, we want to replace them with sensibleoptions that only remove information but not structure. This disallows theprediction network to consider a changed structure of the input. It is now0

Chapter 3. Methods forced to consider the change in information. To get a good overview of thischange we sample plenty of possible replacements and compute a weightedaverage of the prediction results. Due to the law of large numbers (Bernoulli,1713) we consider this average an accurate approximation of all possibilities.

Counterfactual thinking is a psychological concept which states that humansthink of past and future events by looking at alternatives “What if [. . . ] ?”(Kahneman and Tversky, 1981; Roese, 1997). This enables us to evaluatepast actions in another way than just looking at the outcome. For statisticalmodels we are able to actually evaluate counterfactuals without needing toguess about the outcome. Occlusion can be seen as an objective measure todo this by changing small parts of the input and considering “What if” thispart of the input is different.These approaches are closely linked to perturbation-based explanation meth-ods. In contrast, these methods do not presume a set of features of whichat least one is occluded, but ﬁgure out which features create the largestchange in prediction if they are missing or different (Fong and Vedaldi, 2017;Wachter et al., 2017). They highlight the synergy of local adversarials andexplanations. Ribeiro et al. (2016) propose a variant of this by using smallperturbations to learn a local linear model that resembles the original modeland is interpretable. We argue that all these methods fail to consider datalikelihood. Furthermore, Ilyas et al. (2019) show that uninterpretable adver-sarials are an inherent feature of almost all neural models, which questionstheir usefulness as explanations.Norm theory (Kahneman and Miller, 1986) states that we choose counter-factuals depending on how easy they are to imagine. This is at least relatedto how likely these alternatives are. Rational imagination theory (Byrne,2007) argues explicitly that we choose probable alternatives to reality whenevaluating outcomes. These alternatives can either lead to more positiveor negative results (Roese and Olson, 2014). We consider these psycholog-ical intuitions because they are important to our own understanding ofexplanations. Although, as we argued in section

On the Incompletenessof Evaluating Explanations (2.3) , explanations should not be evaluated byhuman intuition, it should be clear how the results of the methods are to be .3. OLM

We start from the difference of probabilities formula in Eq. (3.1). We reinter-pret x \ i in the following way. Instead of considering removing the feature x i we rather remove the information provided by this feature to the model.This gives a very intuitive information-theoretic question that our methodanswers: What additional information does this feature give to the model forclassiﬁcation? For this we have to consider what inputs are how likely giventhe rest of the input is preserved. In the abstract this gives f c ( x \ i ) = ∑ ˆ x i p data ( ˆ x i | x \ i ) f c ( x \ i , ˆ x i ) (3.3)with p data being the data probability over a deﬁned space containing theinputs, as deﬁned in section Gradient-based Explanation Methods in NLP(2.2) .To use this formula in NLP we approximate the data distribution ( p data ) witha language model p data ( ˆ x i | x \ i ) ≈ p LM ( ˆ x i | x \ i ) . (3.4)In practice this means we mask a word, which may consist of several tokens,or a punctuation mark and resample it with a given language model. Otherinputs and features are possible but could increase the approximation error,e.g. by accumulating it over several words. The language model does nothave the information of the original word but all information of the context.This can lead to cases where the original word is predicted with a very highprobability. We argue that in these cases the additional information providedby this word was negligent.The big advantage of this approach is that the structure of the original inputis preserved by only picking replacements that seem likely to a language2 Chapter 3. Methods model. It gives the formula f c ( x \ i ) : ≈ ∑ ˆ x i p LM ( ˆ x i | x \ i ) f c ( x \ i , ˆ x i ) (3.5)for x \ i . There are several practical difﬁculties that this approximation brings.First, it is only valid if all data is supposed to be grammatical. A languagemodel assigns higher probability to replacement tokens that produce a gram-matical sentence. We will investigate this effect with the Corpus of LinguisticAcceptability dataset, in section

CoLA Corner Case (4.2) .Second, replacements sometimes make much more sense if another partof the input is changed. For example, a prepositional verb determines thepreposition it appears alongside with and the preposition can be seen asinformation of the verb. This preposition, however, selects which verbs canbe sampled as replacement.Third, the length of the replacement is a limiting factor. There are cases whenthe replacement should be allowed to contain more or fewer tokens thanthe original. Additionally, some masked language models, which predictsub-word tokens frequently, may not sample a whole word. This distortsthe resampled input. The last two problems can be interpreted as not giv-ing the language model enough freedom by forcing it to make exactly onetoken replacement. We elaborate possible alleviations of these difﬁculties insection

Future Work (5.2) . In the following, we refer to the unit we want toresample as token because, disregarding practical concerns, it is arbitrary inour method.Combined with Eq. (3.1) we set r f , c ( x i ) : = f c ( x ) − ∑ ˆ x i p LM ( ˆ x i | x \ i ) f c ( x \ i , ˆ x i ) . (3.6)This establishes a new method that can determine what effect tokens haveon a prediction. They can either have a positive or negative relevance,depending on whether the original prediction is greater than the averagedprediction after resampling. .3. OLM We will show that

OLM satisﬁes

Class Zero-Sum , Implementation Invariance , Sensitivity-1 and

Linearity . Let f be a neural network that takes an element ofthe input space X and predicts a probability distribution over classes C , i.e, f : X → R | C | f c ( x ) ≥ ∀ c ∈ C , x ∈ X ∑ c ∈ C f c ( x ) = ∀ x ∈ X . (3.7)Let x i be an indexed feature of an input x . We denote the relevance given tothis feature regarding model f and class c by our method OLM with r f , c ( x i ) . OLM satisﬁes

Class Zero-Sum . Intuitively, if the input with the resampledtoken increases the prediction for one class, it has to decrease the predictionsof other classes, and vice-versa. We have ∑ c ∈ C r f , c ( x i ) ( ) = ∑ c ∈ C (cid:32) f c ( x ) − ∑ ˆ x i p LM ( ˆ x i | x \ i ) f c ( x \ i , ˆ x i ) (cid:33) = ∑ c ∈ C f c ( x ) − ∑ ˆ x i p LM ( ˆ x i | x \ i ) ∑ c ∈ C f c ( x \ i , ˆ x i ) ( ) = − ∑ ˆ x i p LM ( ˆ x i | x \ i ) =

0. (3.8)Thus,

OLM satisﬁes

Class Zero-Sum . From this follows that it does not satisfy

Completeness . OLM satisﬁes

Implementation Invariance . OLM is a black-box method andonly evaluates the function of the neural network. It does not regard theparameters θ . Assume we have θ (cid:54) = θ (cid:48) and f θ ( x ) = f θ (cid:48) ( x ) ∀ x ∈ X . (3.9)4 Chapter 3. Methods

Then we get r f θ , c ( x i ) = f θ c ( x ) − ∑ ˆ x i p LM ( ˆ x i | x \ i ) f θ c ( x \ i , ˆ x i )= f θ (cid:48) c ( x ) − ∑ ˆ x i p LM ( ˆ x i | x \ i ) f θ (cid:48) c ( x \ i , ˆ x i )= r f θ (cid:48) , c ( x i ) . (3.10)Thus, OLM satisﬁes

Implementation Invariance . OLM satisﬁes

Sensitivity-1 . OLM is deﬁned as an occlusion method, so itnecessarily provides the difference of prediction when an input variable isoccluded. Equation (3.6) is based on Eq. (3.1).

OLM satisﬁes

Linearity . Let f = ∑ nj = α j g j be a linear combination of models.Then we have r f , c ( x i ) = f c ( x ) − ∑ ˆ x i p LM ( ˆ x i | x \ i ) f c ( x \ i , ˆ x i )= n ∑ j = α j g jc ( x ) − ∑ ˆ x i p LM ( ˆ x i | x \ i ) n ∑ j = α j g jc ( x \ i , ˆ x i )= n ∑ j = α j r g j , c ( x i ) . (3.11) It can be of additional interest to determine to which input features the modelis most sensitive. Previously, we measured the mean difference between themodel prediction and the resampled predictions. As a measure for sensitivitywe suggest taking the standard deviation of the resampled predictions. Thismeasures how varied the predictions are for one token position, given therest of the input but regardless of the original token. With previous notation,we suggest for sensitivity s : s f , c ( x i ) : = (cid:115) ∑ ˆ x i p LM ( ˆ x i | x \ i ) (cid:16) f c ( x \ i , ˆ x i ) − µ (cid:17) . (3.12)We do not suggest this as a relevance measure because, as previously men-tioned, it is independent of the input feature x i . Rather, this measure suggests .3. OLM OLM and

OLM-S measure themean and standard deviation of predictions with resampled tokens.6

We perform several experiments to investigate the explanations generatedby

OLM . Example explanations can be found in Table 4.1. These experimentshighlight some practical peculiarities of our explanation method. For com-parison with other methods we also display the results of

OLM-S . Theseexperiments cannot be comprehensive (see section

On the Incompletenessof Evaluating Explanations (2.3) ) and there is no standard benchmark. Thus,we compare to other explanation methods and conduct experiments in areasthat may present corner cases to

OLM .Method Relevances Max. value

OLM good ﬁlm , but very glum . 0.57

OLM-S good ﬁlm , but very glum . 0.48Delete good ﬁlm , but very glum . 0.98UNK good ﬁlm , but very glum . 0.98Sensitivity Analysis good ﬁlm , but very glum . 35Gradient*Input good ﬁlm , but very glum . 0.041Integrated Gradients good ﬁlm , but very glum . 0.96 T ABLE

OLM and

OLM-S give relevance orsensitivity to both clauses. The other occlusion-based methodsgive almost all relevance to “good”. Gradient-based methodsgive most relevance to the second clause. Resamples that

OLM used multiple times can be found in Table 4.2.

OLM givespositive relevance to “glum” because some alternatives arepredicted with a much lower probability for positive sentiment. hapter 4. Experiments

OLM . For every token we resample k times. Let usassume the distribution of these tokens follows a variation of Zipf’s Law(Estoup, 1916; Zipf, 1949) with α > O ( α √ k ) different samples per token. Furthermore, we resample each of n tokens in an input. Thus, for a single input we have O ( n α √ k ) predictions. Aninvestigation of the effect of different language models on the explanationsof different classiﬁcation models should be done but requires vast resources.Additionally, we ﬁx the models to compare results across different tasksand datasets. To this end, we also only investigate explanations of the truelabel neuron. Some explanation methods do not necessarily treat differentclasses differently, as alluded to in Figure 2.1. We try to remove this effect byfocusing only on the most important class.For OLM and

OLM-S we use BERT

BASE (Devlin et al., 2019) as a languagemodel and choose words (and punctuation marks) as units for resampling.Resampling is computationally expensive but the quality of the samples isvery important. We also want a language model that does not frequentlyproduce sub-word tokens. BERT uses

WordPiece (Wu et al., 2016) whichdoes have sub-word tokens but mostly predicts whole words. This is viablefor single word resampling, especially compared to many other maskedlanguage models. Thus, we choose BERT

BASE as a low-resource compromiseof a well-ﬁtting state-of-the-art language model to analyze the method overdatasets. An example of the samples is shown in Table 4.2. We point out thatin general our approach is language model agnostic. For generating singleinput explanations, not analyzing a dataset, we suggest using a collectionof the best well-ﬁtting language models available. For classiﬁcation we usedifferent variations of R O BERT A (Liu et al., 2019b) that we describe insection Tasks (4.1.2) . All models were originally published at https://github.com/pytorch/fairseq/tree/master/examples/roberta . We use the implementation and pre-trained models from https://github.com/huggingface/transformers . Experiments are available at https://github.com/harbecke/xbert . Chapter 4. Experiments good ﬁlm , but very glum . good looking , but not bad .(34, 0.98) (11, 0.96) (84, 0.98) (87, 0.98) (22, 1) (26, 0.003) (100, 0.98)nice news art and still short(10, 0.46) (5, 0.018) (2, 0.79) (3, 1) (10, 0.87) (11, 1)great idea quality not very old(3, 0.41) (4, 0.0011) (2, 1) (2, 1) (10, 0.98) (5, 1)ﬁne taste though also thin(3, 0.27) (4, 0.94) (2, 1) (6, 1) (5, 1)classic morning too dull(3, 0.93) (3, 0.0064) (5, 1) (5, 0.058)interesting try always slow(3, 0.085) (3, 0.0026) (4, 0.89) (3, 1)lovely job never boring(3, 0.99) (3, 0.99) (3, 1) (3, 0.0059)strong work sometimes small(2, 1) (2, 0.99) (2, 1) (2, 1)bad thing quite dark(2, 8e-05) (2, 0.0041) (2, 0.98) (2, 1)fun plan slightly expensive(2, 0.87) (2, 0.0017) (2, 1) (2, 1)funny lord damn(2, 0.55) (2, 0.007) (2, 0.018)excellent question(2, 0.88) (2, 0.0012)wonderful walk(2, 0.93) (2, 0.22)decent answer(2, 0.48) (2, 0.011)scary thoughts(2, 0.001) (2, 0.07)advice(2, 0.0052)mood(2, 0.99)timing(2, 0.2) T ABLE .1. Correlation of Explanation Methods First, we compare the relevances produced by

OLM to those of other ex-planation methods. The main focus of this experiment is to evaluate howlarge the differences to other explanations are. It could be assumed that theexplanations of basic occlusion is very similar to

OLM explanations. If thiswere shown by experiments, it would make the theoretical beneﬁts of

OLM superﬂuous. E.g., in theory, the language model of a state-of-the-art classiﬁercould understand an obviously missing word. It could treat this similar toour method by internally representing it as missing and deriving a predic-tion from that. In the same vein, we investigate how much the explanationmethods that use gradients differ from occlusion-based methods to see iftheoretical difﬁculties manifest. We calculate the correlation of explanation methods on tasks in the followingway. Let r x and r x be the ordered relevances of an element x ∈ X of dataset X for methods 1 and 2. With corr being the Pearson correlation coefﬁcient forsamples (Pearson, 1895), we set the correlation of two methods over dataset X to ∑ ni = corr ( r i , r i ) n . (4.1)This means, two methods are perfectly positively correlated if and only ifthey produce scaled relevances with possibly different positive scaling foreach input. We compare our explanations with two baselines based on occlusion (seeEq. (3.1)). The simplest variation is removing the word of interest and notreplacing it. We call this method

Delete , it was ﬁrst used in NLP by Liet al. (2016). Similarly, we replace the word of interest with the unknowntoken <

UNK >. This approach is more tailored to state-of-the-art classiﬁerspre-trained with masked language modeling.Furthermore, we compare explanations to three gradient-based methods.All these methods provide relevances for every dimension of the input. To This experiment already appears in Harbecke and Alt (2020). The baseline methodswere selected by me. Christoph Alt selected and conducted the experiments. Phrasing andanalysis exceeding the publication is mine. Chapter 4. Experiments receive relevances on word level we sum over the dimensions for each word(Arras et al., 2017). The simplest one is the absolute value of the gradients andis called

Sensitivity Analysis (Simonyan et al., 2013). Note that this methodonly provides non-negative relevances. It is especially comparable to

OLM-S which also provides a non-negative sensitivity of the model.

Input*Gradient (Shrikumar et al., 2017) is self-explaining. Every input gets multiplied withits gradient. Finally, we compare our explanations to

Integrated Gradients (Sundararajan et al., 2017). This is the integration of the gradient of theprediction function along a straight path from a baseline, usually the zerovector, to the input vector multiplied by the path length.

We select three NLP classiﬁcation tasks. An input contains one or twosentences or phrases for all tasks. Each task focuses on one speciﬁc aspect oflanguage understanding. All tasks are part of the GLUE benchmark (Wanget al., 2017) which does not publish test sets. Therefore, we report results onthe development set which we do not use for model optimization.

Multi-Genre Natural Language Inference Corpus (MNLI) by Williams et al.(2018) is a natural language inference corpus. A data point consists of twosentences that may have a relation to each other.• If the second sentence is a sensible successor to the ﬁrst in content, thispair gets the entailment label.• If the content of the sentences does not relate to each other, the label is neutral .• If the sentences have are in disagreement they get a contradiction label.The corpus contains more than 400,000 samples.We use a R O BERT A LARGE model which is already ﬁne-tuned on MNLI. Itachieves an accuracy of 90.2% on the development set which is two per-centage points behind state-of-the-art T5 (Raffel et al., 2019). McCoy et al.(2019) show that even though these models perform around the human base-line for this task, they fail to generalize for a variety of rare constructions.Correlations of the explanation methods for MNLI can be found in Table 4.3. .1. Correlation of Explanation Methods

OLM OLM-S

Del UNK Sen G*I IG

OLM

OLM-S T ABLE

Stanford Sentiment Treebank (SST) by Socher et al. (2013) is a sentimentclassiﬁcation dataset. It contains 70,000 sentences from movies with either anegative or positive connotation. The SST-2 version only consists of binaryclassiﬁcation with positive and negative sentiment. Sentiment Analysis is aneasy task to interpret explanations on if the explanation method assumes thatfeatures cannot contribute to both classes (see Figure 2.1 and Eq. (3.8)). Aninput feature contributes as much to the positive sentiment as it detracts fromthe negative sentiment and vice versa. Therefore, the explanation methodassigns each feature positive or negative sentiment.We ﬁne-tune a pre-trained R O BERT A BASE . This model achieves an accuracyof 94.5% on the development set which is 3 percentage points lower than mul-tiple state-of-the-art models, including T5 (Raffel et al., 2019). Correlationsof the explanation methods for SST-2 can be found in Table 4.4.

Corpus of Linguistic Acceptability (CoLA) by Warstadt et al. (2019) is adataset with sentences labeled by their grammatical acceptability. It containsmore than 10,000 sentences which are annotated as either acceptable orunacceptable. We will elaborate on the speciﬁcs of this task for explanationmethods in section

CoLA Corner Case (4.2) .2 Chapter 4. Experiments

Occlusion GradientMethod

OLM OLM-S

Del UNK Sen G*I IG

OLM

OLM-S T ABLE

OLM and other occlusion methodsis a little lower. In contrast, the correlation between

OLM-S andother occlusion methods is a little higher.

Analogous to SST-2 we ﬁne-tune R O BERT A BASE and achieve a phi coefﬁ-cient (Yule, 1912) of 0.613 on the development set. S TRUCT

BERT (Wanget al., 2019b) achieves a phi coefﬁcient of 0.753. Correlations of the explana-tion methods for CoLA can be found in Table 4.5. also misnomered Matthews correlation coefﬁcient Occlusion GradientMethod

OLM OLM-S

Del UNK Sen G*I IG

OLM

OLM-S T ABLE

OLM and

OLM-S and other occlusion methods (in bold) is much lower thanon the other two tasks. This indicates that this dataset couldbe a corner case for our method. Correlation between othermethods is also lower but to a smaller extend. .1. Correlation of Explanation Methods Tables 4.3, 4.4 and 4.5 show correlation of all tested explanation methods.Overall, there is positive correlation between almost all methods. No meth-ods produce equivalent relevances, even the correlation between

OLM and

OLM-S is never close to 1. The three occlusion-based relevance methods

OLM , Delete and

UNK have consistently high correlation on MNLI and SST-2but much lower correlation on CoLA.We draw the following conclusions.

OLM produces signiﬁcantly differentexplanations than other occlusion methods.

Delete and

UNK have a highercorrelation with each other for all three tasks than with

OLM which is ev-idence that it stands out from other occlusion methods. The theoreticaldifferences between these methods seem to manifest experimentally.

OLM-S has the about as much correlation to other occlusion methods thanto

Sensitivity Analysis . This can be seen as experimental validation as asensitivity method with a somewhat different intent than a relevance method.The correlation between gradient-based methods and other methods is lowacross all tasks. This could indicate that gradient methods do not capture thediscrete nature of NLP (see section

Gradient-based Explanation Methods inNLP (2.2) ). Note that we use correlation in our argument in two seeminglyconﬂicting ways. If the correlation of explanations is close to 0, they areindependent of each other. It is unlikely that methods with independentresults can both give very good explanations. Furthermore, if the correlationof explanations is close to 1, the methods are redundant. They might havetheoretical differences but these at least did not manifest in the experiments.On the whole, we think different sensible explanation methods should havepositive correlation 0 (cid:28) c (cid:28)

1. This appears to be the case for

OLM andmost compared explanation methods.Table 4.1 shows example explanations for the input “good ﬁlm , but veryglum .” from the SST-2 dataset. Table 4.2 shows the sampling examples that

OLM used to give relevance to the words. The maximum value is easy tointerpret for occlusion-based methods, as it indicates the change in predictionif the feature with the maximum value is occluded.

OLM and

OLM-S givelower relevances to “good” because alternatives sometimes lead to a positiveclassiﬁcation as well. Interestingly, alternatives to “ﬁlm” more often lead to a4

Chapter 4. Experiments negative classiﬁcation. This indicates that the meaning or strength of “good”depends on the following noun. Surprisingly, “glum” also receives a positiverelevance from

OLM . There are alternatives, especially “bad”, that changethe prediction to negative. Furthermore, the classiﬁcation gives positivesentiment a probability of 98%, so no negative word had a big effect in theoriginal sentence.For two out of ﬁve words the original word was the most frequent samplefrom the language model. Additionally, the comma was resampled 84% ofthe time and the period every time. This indicates that the information ofthese was not completely lost as they are made very likely by the context.Many resampled words (prediction < OLM . The structure “good[...] but [...] glum” can not be adequately evaluated by resampling only oneword. It is possible to argue that two of these words already determine thesentiment of the third.

CoLA is an interesting edge case for our method. In the CoLA dataset thereare inputs that are grammatically acceptable and inputs that are unacceptable.The task of a model is to ﬁnd out to which category the input belongs. Thelanguage model in

OLM mostly saw acceptable sentences during training,thus it can be assumed that

OLM tries to resample such that grammaticallyacceptable input appears whenever possible. This can be seen as ﬂawed,because on average the resampling should lead to a similar class distributionas the original dataset.Therefore, in theory,

OLM is able to identify important words if replacementsmake the sentence acceptable, because it assigns relevance if the predictionof the model changes from ungrammatical to grammatical. However, it doesnot identify the opposite case, where an input is grammatical only because ofa speciﬁc construction.

OLM would not modify the input to be unacceptable. .2. CoLA Corner Case p data ≈ p LM inEq. (3.4) is not necessarily true, the resampling is not faithful to the data. To investigate this, we try to answer the following question. How much dothe explanations for grammatically acceptable and unacceptable sentencesdiffer? We hypothesize that the explanations for unacceptable inputs havehigher values on average.First, we have to set the classes on equal footing. We only use explanationsof inputs that were predicted correctly by the R O BERT A BASE model fromsection

Correlation of Explanation Methods (4.1) and with a probability p of at least 0.9. The relevances produced by OLM are always in the range [ p − p ] , as for all occlusion-based methods (see Eq. (3.1)). This ensures thatall explanations fall into the range [ − ] . We have 165 sentences labeledas unacceptable and 678 labeled as acceptable with probability p ≥ Welch’s t-test (Welch, 1947) which enables samples sizes and variancesto be different for both classes. We can ignore that the samples are not from anormal distribution because of the large sample size (Kendall, 1951). Table4.6 shows results of these tests. The null hypothesis of the averages beingequal can be rejected for all comparisons because all three tests are highlysigniﬁcant.We feel obliged to mention that this is not a conclusive test. Features ofinputs of different classes need not have equivalent properties. For accept-ability datasets, there are many more possible constructions for unacceptablesentences than acceptable sentences. Thus, we must not expect a symmetrybetween the explanations of different classes.The other two occlusion methods

Delete and

UNK ﬁnd more relevance in theacceptable sentences. This is likely due to them perturbing grammaticallyacceptable input such that it is not acceptable. The

Delete method removeswords, the

UNK method replaces them with the token, which maynot share all syntactic properties with the original word. Thus, they are both6

Chapter 4. Experiments relevance aggregationAvg. Sum Maxunacceptable sentence 0.275 1.89 0.893acceptable sentence 0.0384 0.304 0.172p-value <0.001 <0.001 <0.001 T ABLE p ≥ Welch’s t-test was performed to compare themeans and yielded a p -value < likely to break the syntax of a sentence. This conﬁrms that our method doesdiffer signiﬁcantly from other occlusion-based methods.We show some randomly selected examples of the explanations in Table4.7. The sentences correctly classiﬁed as unacceptable all have words thatcan be replaced to make the sentence acceptable. The sentences correctlyclassiﬁed as acceptable have few words with high relevance, so the languagemodel rarely created sentences that the classiﬁer considered unacceptable. id relevances max value1 John paid me against the book . 0.992 The person confessed responsible . 13 Medea tried the nurse to poison her children . 0.924 to die is no fun . 0.495 This teacher is a genius . 0.0566 Soaring temperatures are predicted for this weekend . 0.08 T ABLE .3. FAVA id orig. word new sentence prediction1 paid John pushed me against the book . 4e-41 paid John pressed me against the book . 3.6e-41 me John paid damages against the book . 4.3e-41 me John paid taxes against the book . 7.9e-41 against John paid me for the book . 3.7e-42 confessed The person was responsible . 4.4e-42 confessed The person is responsible . 4.4e-42 responsible The person confessed himself . 4e-43 tried Medea orders the nurse to poison her children . 3.5e-43 nurse Medea tried the same to poison her children . 1.1e-34 die to eat is no fun . 0.089 T ABLE

Examples of these resampled sentences can be found in Table 4.8.We take this as evidence that the language model is able to construct gram-matical sentences even under adverse circumstances. The inspection of thisdataset with

OLM can be a useful analysis. However, the approximation inEq. (3.4) of the language model distribution approximating the data distribu-tion does not hold up. The relevances of words in unacceptable sentences isampliﬁed because the language model tries to choose words that build anacceptable sentence. We consider the results of

OLM on this task meaningfulbut not unconditionally. The task shows investigating the approximation, oradapting the language model for the classiﬁcation task, can be pondered.

We conclude with an experiment that resembles CoLA but with a speciﬁclinguistic aspect and a possibility for more model introspection. Kann et al.(2019) introduce the Frames and Alternations of Verbs Acceptability (FAVA)8

Chapter 4. Experiments verb frame relevances max valuecausative- the chapter edited . 0.49inchovative david edited the chapter . 0.18spray- michael poured the bucket with the soup . 0.85load michael poured the soup into the bucket . 0.028there- there agreed with the politician a protester . 0.082load a protester agreed with the politician . 0.15understood- kelly joked david . 0.84object kelly and david joked . 0.42dative nicole proclaimed the greatest athlete to rebecca . 0.43nicole proclaimed rebecca the greatest athlete . 0.25 T ABLE p ≥ dataset. It contains constructed sentences around verb properties that yieldacceptable and unacceptable sentences. Most of these sentences come inpairs of acceptable and unacceptable sentence, which are variations of eachother and at least one uses the given verb frame. This allows a direct com-parison that exceeds the possibilities of CoLA. We only select these pairs forevaluation. Example sentences and frames can be found in Table 4.9.In this case we do not ﬁne-tune a model on the dataset but use theR O BERT A BASE model from

TextAttack ﬁne-tuned on CoLA. Since this modelwas trained on another dataset, this allows us to evaluate on the train, devel-opment and test set. We do this to retrieve enough data. This dataset versionencompasses 6466 single sentences or 3233 pairs. The accuracy for the modelon our class-balanced version of the dataset is 84.3%. The best model in theoriginal paper (Kann et al., 2019) achieved an accuracy of 85.5% on the testset of the whole dataset.This dataset is split into ﬁve types of syntactic verb frame alternations. Verbs The model was originally published at https://github.com/QData/TextAttack . Weused the version from the transformers package https://github.com/huggingface/transformers . .3. FAVA T ABLE p ≥ Welch’s t-test wasperformed to compare the means and yielded a p -value < can be seen as the central part-of-speech to this task, as constructions arebuilt around the fact that they work for some verbs but not for all. Thus, weare interested whether verbs have a higher relevance than other words. Weperform signiﬁcance tests as in section Statistical Analysis of Explanations(4.2.1) . Table 4.10 shows statistical signiﬁcance for both acceptable andunacceptable sentences. For unacceptable sentences the relevance of verbsis more than three times as large as for the average word. For acceptablesentences the relevance is more than two times as large. This shows that

OLM identiﬁes verbs as important words in this task which is centered onverbs.In Table 4.9 we show an example of the explanation of sentence pairs for eachof the syntactic verb frame alternations. For all verb frames the verb gets thehighest relevance in at least one of the two paired sentences. In only one ofthe verb frames the verb has the highest relevance in both paired sentences.As expected, the relevances of the unacceptable sentences are higher thanthe relevances of acceptable sentences.We also investigate how the relevances of verbs in sentence pairs are cor-related. We already saw that the relevances are higher for words of unac-ceptable sentences. Figure 4.1 provides a visualization of the relevances ofsentence pairs. The null hypothesis of the relevances not being correlatedis rejected with p -value < r = − Chapter 4. Experiments F IGURE p ≥ [ − ] . the case for an average of all words in the CoLA experiments. Few verbshave high relevance in both sentences of a sentence pair. This can have twopossible reasons. If the classiﬁcation model does not contextualize a verbcompletely, the verb as a feature itself inﬂuences the model prediction. Thiswould mean an imbalance for the resampling predictions. Also, the verbs areunlikely to be special cases of both the unacceptable and the acceptable sen-tence. Resampling is more likely to change the prediction if the constructionaround the verb only does or does not work for few samples.All in all, these experiments provide evidence that OLM is a useful tool foranalyzing black-box models. At the least, it provides easily interpretableexplanations. We identify language modeling as a standout feature of

OLM that makes its explanations vastly different from other methods. The intuitivetheory of

OLM allows for evaluation and introspection of relevances evenfor corner cases of the method.1

We summarize the topics and central arguments of this thesis. Furthermore,we point out weaknesses of our approach and how they could be alleviated.Last, we elaborate some possible future work.

In the ﬁrst chapter of this thesis we provide the necessary theoretical back-ground. We introduce Deep Learning by motivating interest in the topic,mainly through its achieved results. Additionally, we discuss practical as-pects, such as the architectures of deep neural networks and how to trainthem. We round out this section by pointing to capabilities of neural net-works and how they are important to this thesis.We introduce the topic of explainability of neural networks. We summarizecentral papers and surveys of this topics. Focus is laid on theoretical aspectsof explainability. This is done partly by showing work on axioms for expla-nation methods. In this context we provide our ﬁrst contribution, a novelaxiom. It is guided by the principle that a feature of an input to a normalizedprediction function contributes as much to a set of classes as it detracts fromthe complementary classes.The second contribution is a theoretical argument against gradient-basedexplanation methods in natural language processing. We show that inputin NLP is discrete and thus the data likelihood distribution is discrete. Thisinhibits the functionality of gradients when analyzing the prediction function.We contrast this with the likelihood function in vision.Furthermore, we contribute another theoretical argument to explainability.We discuss why a general evaluation of explanation methods is unlikely toexist. The main argument is that the ground truth for the explanation is onlyheld in the model, which is exactly what an explanation method is trying to2

Chapter 5. Conclusion extract. Since this can only be done by approximation, we point out that thereis a false dichotomy between the evaluative rules for this approximation andthe explanation method that fulﬁlls them.The central methods chapter introduces our main contribution, a novel ex-planation method. It consists of the combination of two existing methods, O cclusion and L anguage M odeling and is coined OLM . Occlusion is a tech-nique in explainability that is used either to explain black-box models orevaluate explanations. It can be seen as incomplete because it does not de-termine how to replace occluded features. For this replacement we proposeusing language modeling in NLP. We show that language modeling is espe-cially suited to sample replacement. They excel at this task because languageunderstanding by language models is the foundation of state-of-the-art mod-els in NLP.We motivate

OLM psychologically and through information theory. Thesearguments lead us to selecting and evaluating likely alternatives to explainfeatures of the original input. The formula for

OLM is achieved by insertingsampling with a language model into the difference of probabilities formula.We analyze our method by going over the axioms that we introduced in thepreceding chapter. We provide proofs of compliance. Since the formula for

OLM is a weighted mean of predictions, we add the standard deviation as anadditional measure

OLM-S . This measure is intended as a sensitivity method,not a relevance method.We provide experimental evaluation of our method. As alluded to in thesecond chapter, we do not consider a complete evaluation, like a singlebenchmark, possible. We try to compare our method to existing methodsby a correlation experiment. We work under the assumption that sensiblemethods correlate, but when introducing a novel method it is also importantto see that there is no perfect correlation. Both aspects are shown over threetasks. The results on one task, CoLA, point to a possible problem of

OLM .The second experiment is a deeper dive into the relevances of our method onCoLA. Linguistic acceptability is a context where it is wrong to assume thatthe distribution of a language model is similar to the distribution of the data.We show that this manifests in the explanations. We discuss how to interpretthese explanations and how they make sense once the context is understood.The introspection with

OLM is possibly deeper than with other relevance .2. Future Work

We describe some of the shortcomings, unexplored areas and possible ideas toextend this work. All identiﬁed weaknesses concern the practical applicationof our method.There are three main problems with our approach. The major problemwe identiﬁed during method development and experiments is the approx-imation of the data with a language model. While this is certainly validfrequently, we saw that there are exceptions. A language model can alwaysprovide syntactically correct data which does not make sense for the task. Ofall possible sentences, most sentences are not suitable for a dataset becausethey do not indicate a class. A language model alone has no mechanism todetect this.One possibility to improve selection of resampled tokens would be to train agenerative adversarial network (Goodfellow et al., 2014) to detect whetheran input is part of the given dataset. The discriminator of this model coulddetermine whether proposed resampled tokens create a likely input and alsoassign a probability. This probability could be used or combined with thelanguage model probability to select resampled tokens. This would ensurethat the resampled data ﬁts the task.A second problem of our method is that we resample exactly one token in atokens previous place. If we choose words as replacement units, this preventsthe use of language models that produce sub-word tokens frequently. Itwould be possible to use beam search to allow them to build a word from4

Chapter 5. Conclusion these sub-word tokens. Additionally, If we want to change the position, wewould not know where to position a replacement. This is problematic inseveral ways. In general, if we take out the information of one word thenthere are other sensible replacements that do not consist of just one word.Furthermore, if the replaced word forces a syntactic structure in the inputthat is not determined by any of the other inputs, it could be sensible to allowthis structure to change.In an information theory sense, the removal of the information of one tokenplaces the sentence into another set of possibilities that contain the informa-tion of all other tokens but not necessarily in the exact phrasing as before.This is the same phenomenon that humans have when thinking of formulat-ing a sentence and changing the structure at the last moment because the lastword ﬁts the new structure better. To allow this, we need an architecture thatis invariant to rephrasing. It would be possible to use an encoder-decoderarchitecture to encode the information of the sentence without the tokens ofinterest. Then we have a representation of the sentence that can be used togenerate the replacement sentence. However, it would be difﬁcult to detectwhether there is information missing that needs to be sampled in the encodedstate. E.g., if a strong adjective is removed in a sentiment classiﬁcation task,how do we produce a resampled sentence with sentiment?Another interesting aspect is choosing other features as replacement units.For tasks with many sentences it could be interesting to measure the effectof sentences as features. However, we consider it unlikely that resampledsentences produce vastly different results than removed sentences. Syntax isno longer an issue, only context for other sentences may be important.The third practical difﬁculty is the dependency of our approach on a languagemodel. Even state-of-the-art language models differ from the true likelihoodof language data. This problem is unlikely to be resolved practically, aslanguage models will presumably not become perfect in the future. Ourmethod depends on automatic generation of replacements. This complicatesresampling multiple features at the same time, as the approximation errorsare likely to accumulate.All mentioned practical points concern the possibilities and difﬁculties ofgenerating good comparison samples. This thesis is an attempt towards .2. Future Work

OLM to NLP is an adequate future research direction.6

Bibliography

Amina Adadi and Mohammed Berrada. Peeking inside the black-box: Asurvey on explainable artiﬁcial intelligence (xai).

IEEE Access , 6:52138–52160, 2018.Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, MoritzHardt, and Been Kim. Sanity checks for saliency maps. In

Advances inNeural Information Processing Systems , pages 9525–9536, 2018.David Alvarez-Melis and Tommi Jaakkola. A causal framework for ex-plaining the predictions of black-box sequence-to-sequence models. In

Proceedings of the 2017 Conference on Empirical Methods in Natural LanguageProcessing , pages 412–421, 2017.Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towardsbetter understanding of gradient-based attribution methods for deep neu-ral networks.

International Conference on Learning Representations , 2018.Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, andWojciech Samek. "What is relevant in a text document?": An interpretablemachine learning approach.

PloS one , 12(8):e0181142, 2017.Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen,Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations fornon-linear classiﬁer decisions by layer-wise relevance propagation.

PloSone , 10(7):e0130140, 2015.David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe,Katja Hansen, and Klaus-Robert Müller. How to explain individual classi-ﬁcation decisions.

Journal of Machine Learning Research , 11(Jun):1803–1831,2010.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machinetranslation by jointly learning to align and translate.

International Conferenceon Learning Representations , 2015.

IBLIOGRAPHY

Journal of machine learning research , 3:1137–1155, 2003.Jakob Bernoulli.

Ars conjectandi . Impensis Thurnisiorum, fratrum, 1713.Bernd Bohnet, Ryan McDonald, Gonçalo Simões, Daniel Andor, Emily Pitler,and Joshua Maynez. Morphosyntactic tagging with a meta-bilstm modelover context sensitive token encodings. In

Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (Volume 1: LongPapers) , pages 2642–2652, 2018.Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. En-riching word vectors with subword information.

Transactions of the Associa-tion for Computational Linguistics , 5:135–146, 2017.Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In

Advances in neural information processing systems , pages 161–168, 2008.Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Jennifer C.Lai, and Robert L. Mercer. An estimate of an upper bound for the entropyof english.

Computational Linguistics , 18(1):31–40, 1992.Ruth M. J. Byrne.

The rational imagination: How people create alternatives toreality . MIT press, 2007.Augustin Cauchy. Méthode générale pour la résolution des systemesd’équations simultanées.

Comptes rendus hebdomadaires des séances del’Académie des Sciences , 25:536–538, 1847.Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous,and Yann LeCun. The loss surfaces of multilayer networks. In

Artiﬁcialintelligence and statistics , pages 192–204, 2015.Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neu-ral networks for image classiﬁcation. In , pages 3642–3649. IEEE, 2012.Li Deng and Dong Yu. Deep learning: methods and applications.

Foundationsand Trends® in Signal Processing , 7(3–4):197–387, 2014.8

BIBLIOGRAPHY

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:Pre-training of deep bidirectional transformers for language understand-ing. In

Proceedings of the 2019 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pages 4171–4186. Association forComputational Linguistics, 2019.Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretablemachine learning.

CoRR , arXiv:1702.08608, 2017.Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gersh-man, David O’Brien, Stuart Schieber, James Waldo, David Weinberger,and Alexandra Wood. Accountability of ai under the law: The role ofexplanation.

CoRR , arXiv:1711.01134, 2017.Jean-Baptiste Estoup.

Gammes sténographiques: méthode et exercices pourl’acquisition de la vitesse . Institut sténographique, 1916.Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of blackboxes by meaningful perturbation. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3429–3437, 2017.Robert Geirhos, Carlos R. M. Temme, Jonas Rauber, Heiko H. Schütt, MatthiasBethge, and Felix A. Wichmann. Generalisation in humans and deepneural networks. In

Advances in neural information processing systems , pages7538–7550, 2018.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generativeadversarial nets. In

Advances in neural information processing systems , pages2672–2680, 2014.Bryce Goodman and Seth Flaxman. European union regulations on algo-rithmic decision-making and a “right to explanation”.

AI magazine , 38(3):50–57, 2017.Kyle Gorman and Steven Bedrick. We need to talk about standard splits.In

Proceedings of the 57th annual meeting of the association for computationallinguistics , pages 2786–2791, 2019.

IBLIOGRAPHY

ACM computing surveys (CSUR) , 51(5):1–42, 2018.Boris Hanin. Universal function approximation by deep neural nets withbounded width and relu activations.

CoRR , arXiv:1708.02691, 2017.David Harbecke and Christoph Alt. Considering likelihood in NLP classiﬁ-cation explanations with occlusion and language modeling. In

Proceedingsof the 58th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop , pages 111–117. Association for ComputationalLinguistics, 2020.Zellig S. Harris. Distributional structure.

Word , 10(2-3):146–162, 1954.John Hewitt and Christopher D. Manning. A structural probe for ﬁndingsyntax in word representations. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4129–4138, 2019.Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuningfor text classiﬁcation. In

Proceedings of the 56th Annual Meeting of the Associ-ation for Computational Linguistics (Volume 1: Long Papers) , pages 328–339,2018.Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, BrandonTran, and Aleksander Madry. Adversarial examples are not bugs, theyare features. In

Advances in Neural Information Processing Systems , pages125–136, 2019.Sarthak Jain and Byron C. Wallace. Attention is not Explanation. In

Proceed-ings of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) , pages 3543–3556, 2019.Michael I. Jordan. Artiﬁcial intelligence—the revolution hasn’t happenedyet.

Harvard Data Science Review , 2019.Daniel Kahneman and Dale T. Miller. Norm theory: Comparing reality to itsalternatives.

Psychological review , 93(2):136, 1986.0

BIBLIOGRAPHY

Daniel Kahneman and Amos Tversky. The simulation heuristic. Technicalreport, Stanford University, CA, Department of Psychology, 1981.Katharina Kann, Alex Warstadt, Adina Williams, and Samuel Bowman. Verbargument structure alternations in word and sentence embeddings. In

Proceedings of the Society for Computation in Linguistics (SCiL) 2019 , pages287–297, 2019.Maurice George Kendall.

The Advanced Theory of Statistics: Vol: II.

CharlesGrifﬁn and Co., Ltd., London, 1951.Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fer-nanda Viegas, and Rory Sayres. Interpretability beyond feature attribution:Quantitative testing with concept activation vectors (tcav). In

InternationalConference on Machine Learning , pages 2673–2682, 2018.Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In

Proceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1746–1751. Association for Computational Lin-guistics, 2014.Pieter-Jan Kindermans, Kristof T. Schütt, Maximilian Alber, Klaus-RobertMüller, Dumitru Erhan, Been Kim, and Sven Dähne. Learning how toexplain neural networks: Patternnet and patternattribution.

InternationalConference on Learning Representations , 2018.Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber,Kristof T. Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un)reliability of saliency methods. In

Explainable AI: Interpreting, Explainingand Visualizing Deep Learning , pages 267–280. Springer, 2019.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-tion.

International Conference on Learning Representations , 2014.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi-ﬁcation with deep convolutional neural networks. In

Advances in NeuralInformation Processing Systems , pages 1097–1105, 2012.Taku Kudo and John Richardson. Sentencepiece: A simple and languageindependent subword tokenizer and detokenizer for neural text processing.In

Proceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing: System Demonstrations , pages 66–71, 2018.

IBLIOGRAPHY

Proceedings of the 2016 Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies , pages 681–691, 2016.Seppo Linnainmaa. The representation of the cumulative rounding error ofan algorithm as a taylor expansion of the local rounding errors.

Master’sThesis (in Finnish), University of Helsinki , 1970.Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability oflstms to learn syntax-sensitive dependencies.

Transactions of the Associationfor Computational Linguistics , 4:521–535, 2016.Zachary C. Lipton. The mythos of model interpretability.

Queue , 16(3):31–57,2018.Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, andNoah A. Smith. Linguistic knowledge and transferability of contextualrepresentations. In

Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pages 1073–1094, 2019a.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta:A robustly optimized BERT pretraining approach.

CoRR , arXiv:1907.11692,2019b.Tania Lombrozo. The structure and function of explanations.

Trends incognitive sciences , 10(10):464–470, 2006.Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. Theexpressive power of neural networks: A view from the width. In

Advancesin Neural Information Processing Systems , pages 6231–6239, 2017.Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y.Ng, and Christopher Potts. Learning word vectors for sentiment analysis.In

Proceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies , pages 142–150, 2011.Christopher D. Manning and Hinrich Schütze.

Foundations of statistical naturallanguage processing . MIT press, 1999.2

BIBLIOGRAPHY

Mitchell Marcus et al. Treebank-3 LDC99T42, 1999. Philadelphia: LinguisticData Consortium.Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Di-agnosing syntactic heuristics in natural language inference. In

Proceedingsof the 57th Annual Meeting of the Association for Computational Linguistics ,pages 3428–3448, 2019.Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient esti-mation of word representations in vector space.

CoRR , arXiv:1301.3781,2013.Tim Miller. Explanation in artiﬁcial intelligence: Insights from the socialsciences.

Artiﬁcial Intelligence , 267:1–38, 2019.Sina Mohseni and Eric D. Ragan. A human-grounded evaluation benchmarkfor local explanations of machine learning.

CoRR , arXiv:1801.05075, 2018.Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methodsfor Interpreting and Understanding Deep Neural Networks.

Digital SignalProcessing , 73:1–15, 2018.Isaac Newton.

The method of ﬂuxions and inﬁnite series . Henry Woodfall, 1736.Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur.Librispeech: an asr corpus based on public domain audio books. In , pages 5206–5210. IEEE, 2015.Douglas B. Paul and Janet M. Baker. The design for the wall street journal-based csr corpus. In

Proceedings of the workshop on Speech and NaturalLanguage , pages 357–362. Association for Computational Linguistics, 1992.Karl Pearson. VII. Note on regression and inheritance in the case of twoparents.

Proceedings of the Royal Society of London , 58(347-352):240–242, 1895.Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove:Global vectors for word representation. In

Proceedings of the 2014 conferenceon empirical methods in natural language processing (EMNLP) , pages 1532–1543, 2014.Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, ChristopherClark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word

IBLIOGRAPHY

Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers) , pages 2227–2237, 2018.Steven T. Piantadosi. Zipf’s word frequency law in natural language: Acritical review and future directions.

Psychonomic bulletin & review , 21(5):1112–1130, 2014.Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vi-mal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. Purelysequence-trained neural networks for asr based on lattice-free mmi. In

Interspeech , pages 2751–2755, 2016.Ning Qian. On the momentum term in gradient descent learning algorithms.

Neural networks , 12(1):145–151, 1999.Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring thelimits of transfer learning with a uniﬁed text-to-text transformer.

CoRR ,arXiv:1910.10683, 2019.Joseph Raphson.

Analysis Aequationum Universalis . London, 1690.Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trustyou?: Explaining the predictions of any classiﬁer. In

Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery and data mining ,pages 1135–1144. ACM, 2016.Herbert Robbins and Sutton Monro. A stochastic approximation method.

The annals of mathematical statistics , pages 400–407, 1951.Marko Robnik-Šikonja and Igor Kononenko. Explaining classiﬁcations forindividual instances.

IEEE Transactions on Knowledge and Data Engineering ,20(5):589–600, 2008.Neal J. Roese. Counterfactual thinking.

Psychological bulletin , 121(1):133, 1997.Neal J. Roese and James M. Olson.

What might have been: The social psychologyof counterfactual thinking . Psychology Press, 2014.Sebastian Ruder. Nlp’s imagenet moment has arrived.

The Gradient , 2018.4

BIBLIOGRAPHY

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learningrepresentations by back-propagating errors. nature , 323(6088):533–536,1986a.David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learninginternal representations by error propagation. In D. E. Rumelhart andJ. L. McClelland, editors,

Parallel distributed processing: explorations in themicrostructure of cognition , volume 1, chapter 8, pages 318–362. MIT Press,1986b.Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model forautomatic indexing.

Communications of the ACM , 18(11):613–620, 1975.Warren S. Sarle. Neural networks and statistical models. In

Proceedings of theNineteenth Annual SAS Users Group International Conference , 1994.Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning impor-tant features through propagating activation differences. In

InternationalConference on Machine Learning , pages 3145–3153, 2017.Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neuralnetworks via information.

CoRR , arXiv:1703.00810, 2017.David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, VedaPanneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, JohnNham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, MadeleineLeach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Master-ing the game of go with deep neural networks and tree search. nature , 529(7587):484, 2016.David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou,Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Ku-maran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and DemisHassabis. A general reinforcement learning algorithm that masters chess,shogi, and go through self-play.

Science , 362(6419):1140–1144, 2018.Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside con-volutional networks: Visualising image classiﬁcation models and saliencymaps.

CoRR , arXiv:1312.6034, 2013.

IBLIOGRAPHY

Essays on Several Curious and Useful Subjects, in Speculativeand Mix’d Mathematicks . H. Woodfall, 1740.Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D.Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep modelsfor semantic compositionality over a sentiment treebank. In

Proceedings ofthe 2013 conference on empirical methods in natural language processing , pages1631–1642, 2013.Suraj Srinivas and François Fleuret. Full-gradient representation for neuralnetwork visualization. In

Advances in Neural Information Processing Systems ,pages 4126–4135, 2019.Daniel Strigl, Klaus Koﬂer, and Stefan Podlipnig. Performance and scalabilityof gpu-based convolutional neural networks. In , pages 317–324.IEEE, 2010.Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attributionfor Deep Networks. In

International Conference on Machine Learning , pages3319–3328, 2017.Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Edouard Grave, TatianaLikhomanenko, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, andRonan Collobert. End-to-end asr: from supervised to semi-supervisedlearning with modern architectures.

CoRR , arXiv:1911.08460, 2019.Naftali Tishby and Noga Zaslavsky. Deep learning and the informationbottleneck principle. In ,pages 1–5. IEEE, 2015.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all youneed. In

Advances in Neural Information Processing Systems , pages 5998–6008,2017.Ellen M. Voorhees and Dawn M. Tice. Building a question answering test col-lection. In

Proceedings of the 23rd annual international ACM SIGIR conferenceon Research and development in information retrieval , pages 200–207, 2000.6

BIBLIOGRAPHY

Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual expla-nations without opening the black box: Automated decisions and the gdpr.

Harv. JL & Tech. , 31:841, 2017.Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, andSamuel R. Bowman. GLUE: A multi-task benchmark and analysis platformfor natural language understanding.

International Conference on LearningRepresentations , 2017.Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, JulianMichael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: Astickier benchmark for general-purpose language understanding systems.In

Advances in Neural Information Processing Systems , pages 3266–3280,2019a.Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng,and Luo Si. Structbert: Incorporating language structures into pre-trainingfor deep language understanding. In

International Conference on LearningRepresentations , 2019b.Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural networkacceptability judgments.

Transactions of the Association for ComputationalLinguistics , 7:625–641, 2019.Bernard L. Welch. The generalization of student’s’ problem when severaldifferent population variances are involved.

Biometrika , 34(1/2):28–35,1947.Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. In

Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages 11–20, 2019.John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Charagram:Embedding words and sentences via character n-grams. In

Proceedingsof the 2016 Conference on Empirical Methods in Natural Language Processing ,pages 1504–1515, 2016.Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coveragechallenge corpus for sentence understanding through inference. In

Pro-ceedings of the 2018 Conference of the North American Chapter of the Association

IBLIOGRAPHY for Computational Linguistics: Human Language Technologies, Volume 1 (LongPapers) , pages 1112–1122, 2018.D. Randall Wilson and Tony R. Martinez. The general inefﬁciency of batchtraining for gradient descent learning. Neural networks , 16(10):1429–1451,2003.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser,Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith,Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes,and Jeffrey Dean. Google’s neural machine translation system: Bridgingthe gap between human and machine translation.

CoRR , arXiv:1609.08144,2016.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhut-dinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretrainingfor language understanding. In

Advances in neural information processingsystems , pages 5754–5764, 2019.G. Udny Yule. On the methods of measuring association between twoattributes.

Journal of the Royal Statistical Society , 75(6):579–652, 1912.Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolu-tional networks. In

European conference on computer vision , pages 818–833.Springer, 2014.Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolu-tional neural networks. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 8827–8836, 2018.Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutionalnetworks for text classiﬁcation. In

Advances in neural information processingsystems , pages 649–657, 2015.Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towardsstory-like visual explanations by watching movies and reading books. In8

BIBLIOGRAPHY

Proceedings of the IEEE international conference on computer vision , pages19–27, 2015.Luisa M. Zintgraf, Taco S. Cohen, Tameem Adel, and Max Welling. Vi-sualizing deep neural network decisions: Prediction difference analysis.

International Conference on Learning Representations , 2017.George Kingsley Zipf.