[PDF] Probing Classifiers: Promises, Shortcomings, and Advances

Abstract

Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple---a classifier is trained to predict some linguistic property from a model's representations---and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological weaknesses of this approach. This article critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.

Full PDF

aa r X i v : . [ c s . C L ] M a r Squib

Probing Classiﬁers: Promises, Shortcomings, andAlternatives

Yonatan Belinkov ∗ Technion – Israel Institute of Technology

Probing classiﬁers have emerged as one of the prominent methodologies for interpreting andanalyzing deep neural network models of natural language processing. The basic idea is simple—a classiﬁer is trained to predict some linguistic property from a model’s representations—andhas been used to examine a wide variety of models and properties. However, recent studies havedemonstrated various methodological weaknesses of this approach. This article critically reviewsthe probing classiﬁers framework, highlighting shortcomings, improvements, and alternativeapproaches.

1. Introduction

The opaqueness of deep neural network models of natural language processing (NLP)has spurred a line of research into interpreting and analyzing such models. Analysismethods may aim to answer questions about the structure of a model or its decisions.For instance, we might want to ask which parts of a neural neural model are responsiblefor certain linguistic properties, or which parts of the input led the model to make acertain decision. A common methodology to answer questions about the structure ofmodels is to associate internal representations with external properties, by training aclassiﬁer on said representations that will predict a given property. This framework,known as probing classiﬁers , has emerged as a prominent analysis strategy in manystudies on NLP models. Despite its apparent success, the probing classiﬁers paradigm is not without limi-tations. Critiques have been made about comparative baselines, metrics, the nature ofthe classiﬁer, and the correlational nature of the method. In this short article, we ﬁrstdeﬁne the probing classiﬁers framework, taking care to consider the various involvedcomponents, Then we summarize the framework’s shortcomings, as well as improve-ments and alternative approaches. This article provides a roadmap for NLP researcherswho wish to examine probing classiﬁers more critically and highlights areas in need ofadditional research. ∗ CS Taub Building 733, Technion, Haifa 3200003, Israel. E-mail: [email protected] For an overview of analysis methods in NLP, see the survey by Belinkov and Glass (2019) as well as thetutorials by Belinkov, Gehrmann, and Pavlick (2020) and Wallace, Gardner, and Singh (2020). For anoverview of explanation methods in particular, see the survey by Danilevsky et al. (2020).© 2016 Association for Computational Linguistics omputational Linguistics Volume 1, Number 1

2. The Probing Classiﬁers Framework

On the surface, the idea of probing classiﬁers seems straightforward. We take a modelthat was trained on some task, such as a language model. We generate representationsusing the model, and we train another classiﬁer that takes the representations andpredicts some property. If the classiﬁer performs well, we say that the model has learnedinformation relevant for the property. However, upon closer inspection, it will turn outthat there is much more involved here. To see this, we now deﬁne this framework a bitmore formally.Let us denote by f : x y a model that maps input x to output y . We call thismodel the original model. It is trained on some annotated dataset D O = { x ( i ) , y ( i ) } ,which we refer to as the original dataset. Its performance is evaluated by some measure,denoted P ERF ( f, D O ) . The function f is typically a deep neural network that generatesintermediate representations of x , for example f l ( x ) may denote the representation of x at layer l of f . A probing classiﬁer g : f l ( x ) z maps intermediate representationsto some property z , which is typically some linguistic feature of interest. As a concreteexample, f might be a sentiment analysis model, mapping a text x to a sentiment label y , while g might be a classiﬁer mapping intermediate representations f l ( x ) to part-of-speech tags z . The classiﬁer g is trained and evaluated on some annotated dataset D P = { x ( i ) , z ( i ) } , and some performance measure P ERF ( g, f, D O , D P ) (e.g., accuracy) isreported. Note that the performance measure depends on the probing classiﬁer g andthe probing dataset D P , as well as on the original model f and the original dataset D O .From an information theoretic perspective, training the probing classiﬁer g can beseen as maximizing the mutual information between the intermediate representations f l ( x ) and the property z (Belinkov 2018, p. 42; Pimentel et al. 2020b; Zhu and Rudzicz2020), which we write I( z ; h ) , where z is a random variable ranging over properties z and h is a random variable ranging over representations f l ( x ) .The above careful deﬁnition of the probing classiﬁers framework reveals that itis comprised of multiple concepts and components, depicted in Figure 1. The choiceof each such component, and the interactions between them, lead to non-trivial ques-tions regarding the design and implementation of any probing classiﬁer experiment.Before we turn to these considerations in Section 4, we brieﬂy review some history andpromises of probing classiﬁers in the next section. x y Original task D O = { x ( i ) , y ( i ) } Original dataset f : x y Original modelP

ERF ( f, D O ) Performance on the original task f l ( x ) Representations of x from ff l ( x ) z Probing task D P = { x ( i ) , z ( i ) } Probing dataset g : f l ( x ) z Probing classiﬁerP

ERF ( g, f, D O , D P ) Probing performance

Figure 1

Basic components comprising the probing classiﬁers framework. While most work analyzes representations of x , one could use the same framework to study other modelcomponents, such as attention weights (Clark et al. 2019). We use f l ( x ) to refer more generally to anyintermediate output of f when applied to x . onatan Belinkov Probing Classiﬁers Squib

3. Promises

Perhaps the ﬁrst studies that can be cast in the framework of probing classiﬁers are byKöhn (2015) and Gupta et al. (2015), who trained classiﬁers on static word embeddingsto predict various morphological, syntactic, and semantic properties. Other early workclassiﬁed hidden states of a recurrent neural network into morpho-syntactic proper-ties (Shi, Padhi, and Knight 2016). The framework has taken up a more stable formby several groups who studied sentence embeddings (Ettinger, Elgohary, and Resnik2016; Adi et al. 2017; Conneau et al. 2018) and recurrent/recursive neural networks(Belinkov et al. 2017a; Hupkes, Veldhoen, and Zuidema 2018). The same idea had beenconcurrently proposed for investigating computer vision models (Alain and Bengio2016). A plethora of work ensued, applying this framework to various models f andproperties z . See Belinkov and Glass (2019) for a comprehensive survey up to early 2019.Since then, the community has taken a more critical look at the methodology, which weturn to now.

4. Shortcomings and Alternatives

This section reviews several limitations of the probing classiﬁers framework, as well asexisting proposals for addressing them. We discuss comparisons and controls, how tochoose the probing classiﬁer, which causal claims can be made, the difference betweendatasets and tasks, and the need to deﬁne the probed properties.

Suppose we run a probing classiﬁer experiment and obtain performance ofP

ERF ( g, f, D O , D P ) = 87 . . Is that a high/low number? What should we compare it to?We will denote a baseline model with f and an upper bound or skyline model with ¯ f .Some studies compare with majority baselines (Belinkov et al. 2017a; Conneau et al.2018) or with classiﬁers trained on representations that are thought to be sim-pler than what the original model f produces, such as static word embeddings(Belinkov et al. 2017a). Others advocate for random baselines, training the classi-ﬁer g on a randomized version of f (Conneau et al. 2018; Zhang and Bowman 2018;Chrupała, Higy, and Alishahi 2020). These studies show that even random featurescapture signiﬁcant information that can be decoded by the probing classiﬁer, so per-formance on learned features should be viewed in such a perspective.On the other side, some studies compare the probing performanceP ERF ( g, f, D O , D P ) to skylines or upper bounds ¯ f , in an attempt to provide a pointof comparison for how far the probing performance is from a possible performanceon the task of mapping x z . Examples include estimating human performance(Conneau et al. 2018), reporting state-of-the-art performance from the literature(Liu et al. 2019), or training a dedicated model to predict property z from input x ,without restricting to (frozen) representations from f (Belinkov et al. 2017b).Others have proposed to design controls for possible confounders.Hewitt and Liang (2019) observe that the probing performance P ERF ( g, f, D O , D P ) There have also been numerous other studies using the probing classiﬁer framework as is, which will notbe discussed here. For a partial list, see https://github.com/boknilev/nlp-analysis-methods/issues/5 . omputational Linguistics Volume 1, Number 1 may tell us more about the probe g than about the model f . The probe g may memorizeinformation from D P , rather than evaluate information found in representations f ( x ) . They design control tasks, which a probe may only solve by memorizing.In particular, they randomize the labels in D P , to create a new dataset D P,Rand .Then, they deﬁne a selectivity measure, which is the difference between the probingperformance on the probing task and the control task. S EL ( g, f, D O , D P , D P,Rand ) =P ERF ( g, f, D O , D P ) − P ERF ( g, f, D O , D P,Rand ) . They show that probes may have highaccuracy, but low selectivity, and that linear probes tend to have high selectivity, whilenon-linear probes tend to have low selectivity. This indicates that high accuracy ofnon-linear probes comes from memorization of the control task, rather than frominformation captured in the representations f l ( x ) .Taking an information-theoretic perspective on probing, Pimentel et al. (2020b) pro-posed to use control functions instead of control tasks in order to compare probes.Their control function is any function applied to the representation, c : f l ( x ) c ( f l ( x )) ,and they compare the information gain, which is the difference in mutual informationbetween the property z and the representation before and after applying the controlfunction: G ( z , h , c ) = I( z ; h ) − I( z ; c ( h )) . While Pimentel et al. (2020b) posit that theircontrol function are a better criterion than the control tasks of Hewitt and Liang (2019),subsequent work showed that the two criteria are almost equivalent, both theoreticallyand empirically (Zhu and Rudzicz 2020).Another kind of control is proposed by Ravichander, Belinkov, and Hovy (2021),who design control datasets, where the linguistic property z is not discriminative w.r.tthe original task of mapping x to y . That is, they modify D O and create a new dataset, D O,z such that all examples in it have the same value for property z . Intuitively, a model f trained on D O,z should not pick up information about z , since it is not useful for thetask of f . They show that a probe g may learn to predict property z incidentally, evenwhen it is not discriminative w.r.t the original task of mapping x y , casting doubtson causal claims concerning the effect that property encoded in the representation mayhave on the original task. What should be the structure of the probing classiﬁer g ? What role does its expressivityplay in drawing conclusions about the original model f ?Some studies advocate for using simple probes, such as linear classiﬁers(Alain and Bengio 2016; Liu et al. 2019; Hewitt and Manning 2019; Hall Maudslay et al.2020). Somewhat anecdotally, a few studies observed better performance with morecomplex probes, but reported similar relative trends (Conneau et al. 2018; Belinkov2018). That is, if two representations from f are better under one probe, these studiesreport them to be better under other probes too. However, this pattern may be ﬂippedwhen considering alternative measures, such as selectivity (Hewitt and Liang 2019).Several studies considered the complexity of the probe g in more detail.Pimentel et al. (2020b) argue that, to give the best estimate about the information thatmodel f has about property z , the most complex probe should be used. Their argumentis based on a mild assumption about the uniqueness of representations f l ( x ) . In a morepractical view, Voita and Titov (2020) propose to measure both the performance of theprobe g and its complexity, by estimating the minimum description length of the coderequired to transmit property z knowing the representations f l ( x ) : MDL ( g, f, D O , D P ) .Note that this measure again depends on the probe g , the model f , and their respectivedatasets D O and D P . They found that the MDL measure provides more information4 onatan Belinkov Probing Classiﬁers Squib about how a probe g works, for instance by revealing differences in complexity of probeswhen performing control tasks from D P,Rand , as in Hewitt and Liang (2019). Finally,Pimentel et al. (2020a) argue that probing work should report the possible trade-offsbetween accuracy and complexity, along a range of probes g , and call for using probesthat are both simple and accurate. While they study a number of linear and non-linearmulti-layered perceptrons, one could extend this idea to other classes of probes.Another line of work proposes methods to extract linguistic information froma trained model without learning additional parameters. In particular, much workhas used some sort of pairwise importance score between words in a sentence asa signal for inferring linguistic properties, either full syntactic parsing or more ﬁne-grained properties such as coreference resolution. These scores may come from at-tention weights (Raganato and Tiedemann 2018; Clark et al. 2019; Mareˇcek and Rosa2019; Htut et al. 2019) or from distances between word representations, perhaps in-cluding perturbations of the input sentence (Wu et al. 2020). The pairwise scores canfeed into some general parsing algorithm, such as the Chu-Liu Edmonds algorithm(1965; 1967). Alternatively, some work has used representational similarity analysis(Kriegeskorte, Mur, and Bandettini 2008) to measure similarity between word or sen-tence representations and syntactic properties, both local properties like determining averb’s subject (Lepori and McCoy 2020) and more structured properties like inferringthe full syntactic tree (Chrupała and Alishahi 2019). This line of work can be seen asa parameter-less probing classiﬁer g : a linguistic property is inferred from internalmodel components (representations, attention weights), without needing to learn newparameters. Thus, such work avoids some of the issues about what the probe learns. Ad-ditionally, from the perspective of an accuracy–complexity trade-off, such work shouldperhaps be placed on the low end of the complexity axis, although the complexity of theparsing algorithm could also be taken into account. A main limitation of the probing classiﬁer paradigm is the disconnect between theprobing classiﬁer g and the original model f . They are trained in two different steps,where f is trained once and only used to generate feature representations f l ( x ) , whichare fed into g . Once we have f l ( x ) , we get a probing performance from g , which tells ussomething about the information in f l ( x ) . However, in the process, we have forgottenabout the original task assigned to f , which was to predict y . This raises an importantquestion: Does model f use the information discovered by probe g ? In other words,the probing framework may indicate correlations between intermediate representations f l ( x ) and linguistic property z , but it does not tell us whether this property is in-volved in predictions of the model f . Indeed, several studies pointed out this limitation(Belinkov and Glass 2019), including reports on a mismatch between performance ofthe probe, P ERF ( g, f, D O , D P ) , and performance of the original model, P ERF ( f, D O ) (Vanmassenhove, Du, and Way 2017). Relatedly, Tamkin et al. (2020) ﬁnd a discrepancybetween which features f l ( x ) obtain high probing performance, P ERF ( g, f, D O , D P ) ,and which features are identiﬁed as important when ﬁne-tuning f while performingthe probing task f l ( x ) z . They reveal this by randomizing speciﬁc layers when ﬁne-tuning f , which can be seen as a kind of intervention.Indeed, a number of studies have proposed alternatives and improvements to theprobing classiﬁer paradigm, which aim to discover causal effects by intervening in rep-resentations of the model f . Giulianelli et al. (2018) use gradients from g to modify therepresentations in f and evaluate how this change affects both the probing performance5 omputational Linguistics Volume 1, Number 1 and the original model performance. In their case, f is a language model and g predictssubject–verb number agreement. They ﬁnd that their intervention increases probingperformance, as may be expected. Interestingly, while in the general language modelingcase the intervention has a small effect on the original model performance, P ERF ( f, D O ) ,they ﬁnd an increase in this performance on examples designed to assess numberagreement. They conclude that probing classiﬁers can identify features that are actuallyused by the model. Similarly, Elazar et al. (2021) remove certain properties z (suchas parts of speech, syntactic dependencies) from representations in f by repeatedlytraining (linear) probing classiﬁers g and projecting them out of the representation.This results in a modiﬁed representation ˜ f l ( x ) , which has less information about z .They compare the probing performance to the performance on the original task (intheir case, language modeling) after the removal of said features. They ﬁnd that highprobing performance P ERF ( g, f, D O , D P ) does not necessarily entail a large drop inoriginal task performance after their removal, that is, P ERF ( ˜ f , D O ) . Thus, contrary toGiulianelli et al. (2018), they conclude that probing classiﬁers do not always identifyfeatures that are actually used by the model. In a similar vein, Feder et al. (2020) removeproperties z from representations in f by training g adversarially, while continuing totrain f . Concretely, g is trained to predict a property, but gradients are reversed whenback-propagated into f , which should result in removal of property z from the latentrepresentations in f . At the same time, they add another probing classiﬁer g C , trainedpositively, which aims to control for properties z C that should not be removed from f . They cast their approach in the framework of causal graphs and ﬁnd that they canaccurately estimate the effect of properties z on downstream tasks performed by f whenit is ﬁne-tuned. They also ﬁnd a large degradation in the ability to recover z from f l ( x ) ,but a small degradation in the ability to recover the controlled property z C , as desired.Other work performing interventions includes Bau et al. (2019), who identify im-portant individual neurons and change their activations by setting them to a certainexpected value. They manipulate in this way output translations along axes of tense,number, and gender, and quantify the effect of their interventions on the outputs y .Similarly, Lakretz et al. (2019) ablate neurons in language models by setting certaindimensions of f l ( x ) to zero. They ﬁnd a small number of neurons with large effectson the ability of language models to correctly model subject–verb number agreement.Vig et al. (2020) design interventions on inputs x and quantify the effect of intermediatevariables in language models (in particular, neurons and attention heads) on gender-biased predictions, using causal mediation analysis (Pearl 2001). They measure counter-factual outcomes as a function of outputs f ( x ) , when intermediate variables like f l ( x ) are set to different values. They ﬁnd that gender bias effects are sparsely located incertain parts of the analyzed models. The probing paradigm aims to study models performing some task ( f : x y ) via aclassiﬁer performing another task ( g : f l ( x ) z ). However, in practice these tasks areoperationalized via ﬁnite datsaets . Ravichander, Belinkov, and Hovy (2021) point outthat datasets are imperfect proxies for tasks. Indeed, the effect of the choice of datasets—both the original dataset D O and the probing dataset D P —has not been widely studied.Furthermore, we would ideally want to disentangle the role of each dataset from therole of the original model f and probing classiﬁer g . Unfortunately, common models f tend to be trained on different datasets D O , making any statements about models6 onatan Belinkov Probing Classiﬁers Squib confounded with issues of datasets. Some prior work acknowledged this limitation,explaining that conclusions can only be made about the existing trained models , notabout general architectures (Liu et al. 2019). However, in an ideal world, we would wantto compare models f trained on the same dataset D O . Such experiments are currentlylacking.The effect of the probing dataset D P —its size, composition, etc.—is similarly notwidely studied. Some prior work reported results on multiple datasets (when predictingthe same property z ) (e.g., Belinkov et al. 2017a). However, more careful investigationsare necessary. Inherent to the probing classiﬁer framework is a decision on a property z to probe for.This limits the investigation in multiple ways. First, it constrains the work to existingannotated datasets, which are often limited to English and to several kinds of properties.It also requires focusing on properties z that are thought to be relevant to the task ofmapping x y a-priori, potentially leading to biased conclusions. In an (at present,isolated) effort to alleviate this limitation, Michael, Botha, and Tenney (2020) proposeto learn latent clusters relevant for predicting a property z . They discover clusterscorresponding to known properties (such as personhood) as well as new categories,which are not usually annotated in common datasets.

5. Summary

Given the various limitations discussed in this article, one might ask: What are probingclassiﬁers good for? From an analysis point view, we have discussed several reserva-tions regarding which insights can be drawn from a probing classiﬁer experiment. Yetrecent work has also proposed improvements to the framework, such as better controlsand metrics. One direction that seems promising is to focus on how difﬁcult it is toextract a property from a representation, rather than making absolute statements aboutits presence (Pimentel et al. 2020b). Another compelling direction is to adopt causalapproaches, like those in Section 4.3, which are better equipped for drawing insightsabout the probed model.One might hope that probing classiﬁer experiments would suggest ways to improvethe quality of the probed model or to direct it to be better tuned to some use ortask. Presently, there are few such successful examples. For instance, results showingthat lower layers in language models focus on local phenomena while higher layersfocus on global ones (using probing classiﬁers and other methods) motivated Cao et al.(2020) to decouple question–passage processing in question-answering model, such thatlower layers process the question and the passage independently and higher layersprocess them jointly. An analysis of redundancy in language models (again using prob-ing classiﬁers and other methods) motivated an efﬁcient transfer-learning procedure(Dalvi et al. 2020). An analysis of phonetic information in layers of a speech recognitionsystems (Belinkov and Glass 2017) partly motivated Krishna, Toshniwal, and Livescu(2018) to propose multi-task learning with phonetic supervision on intermediate layers.Belinkov et al. (2020) discuss how their probing experiments can guide the selection ofwhich machine translation models to use when translating speciﬁc languages. Finally,when considering to use the representations for some downstream task, probing exper-iments can indicate what information is encoded, or can easily be extracted, from theserepresentations. 7 omputational Linguistics Volume 1, Number 1

To conclude, our critical review of the probing classiﬁers framework has revealedthat it is more complicated than may seem. Figure 2 reproduces the basic componentsand adds additional ones discussed in Section 4. We do not argue that any givenstudy should perform all the various controls and report all the alternative measuressummarized here. However, future work seeking to use probing classiﬁers would dowell to take into account the complexity of the framework and its apparent weaknesses.Basic Components x y Original task D O = { x ( i ) , y ( i ) } Original dataset f : x y Original modelP

ERF ( f, D O ) Performance on the original task f l ( x ) Representations of x from ff l ( x ) z Probing task D P = { x ( i ) , z ( i ) } Probing dataset g : f l ( x ) z Probing classiﬁerP

ERF ( g, f, D O , D P ) Probing performanceAdditional Components ¯ f : x y Skyline model or upper bound f : x y Baseline model x y Rand

Control task (Hewitt and Liang 2019) c : f l ( x ) c ( f l ( x )) Control function (Pimentel et al. 2020b) D P,Rand

Control task dataset (Hewitt and Liang 2019) D O,z

Control dataset (Ravichander, Belinkov, and Hovy 2021)S EL ( g, f, D O , D P , D P,Rand ) Probing selectivity (Hewitt and Liang 2019) G ( z , h , c ) Information gain w.r.t control function (Pimentel et al. 2020b)MDL ( g, f, D O , D P ) Probe minimum description length (Voita and Titov 2020) ˜ f l ( x ) Representations of x from f , after an intervention Figure 2

Components comprising the probing classiﬁers framework. Extended version of the basiccomponents in Figure 1. onatan Belinkov Probing Classiﬁers Squib References

Adi, Yossi, Einat Kermany, Yonatan Belinkov,Ofer Lavi, and Yoav Goldberg. 2017.Fine-grained analysis of sentenceembeddings using auxiliary predictiontasks. In

International Conference onLearning Representations (ICLR) .Alain, Guillaume and Yoshua Bengio. 2016.Understanding intermediate layers usinglinear classiﬁer probes. arXiv preprintarXiv:1610.01644v3 .Bau, Anthony, Yonatan Belinkov, HassanSajjad, Nadir Durrani, Fahim Dalvi, andJames Glass. 2019. Identifying andcontrolling important neurons in neuralmachine translation. In

InternationalConference on Learning Representations .Belinkov, Yonatan. 2018.

On InternalLanguage Representations in Deep Learning:An Analysis of Machine Translation andSpeech Recognition . Ph.D. thesis,Massachusetts Institute of Technology.Belinkov, Yonatan, Nadir Durrani, FahimDalvi, Hassan Sajjad, and James Glass.2017a. What do neural machinetranslation models learn aboutmorphology? In

Proceedings of the 55thAnnual Meeting of the Association forComputational Linguistics (Volume 1: LongPapers) , pages 861–872, Association forComputational Linguistics, Vancouver,Canada.Belinkov, Yonatan, Nadir Durrani, FahimDalvi, Hassan Sajjad, and James Glass.2020. On the linguistic representationalpower of neural machine translationmodels.

Computational Linguistics ,46(1):1–52.Belinkov, Yonatan, Sebastian Gehrmann, andEllie Pavlick. 2020. Interpretability andanalysis in neural NLP. In

Proceedings ofthe 58th Annual Meeting of the Associationfor Computational Linguistics: TutorialAbstracts , pages 1–5, Association forComputational Linguistics, Online.Belinkov, Yonatan and James Glass. 2017.Analyzing hidden representations inend-to-end automatic speech recognitionsystems. In

Advances in Neural InformationProcessing Systems , volume 30, CurranAssociates, Inc.Belinkov, Yonatan and James Glass. 2019.Analysis methods in neural languageprocessing: A survey.

Transactions of theAssociation for Computational Linguistics ,7:49–72.Belinkov, Yonatan, Lluís Màrquez, HassanSajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2017b. Evaluating layers ofrepresentation in neural machinetranslation on part-of-speech and semantictagging tasks. In

Proceedings of the EighthInternational Joint Conference on NaturalLanguage Processing (Volume 1: LongPapers) , pages 1–10, Asian Federation ofNatural Language Processing, Taipei,Taiwan.Cao, Qingqing, Harsh Trivedi, ArunaBalasubramanian, and NiranjanBalasubramanian. 2020. DeFormer:Decomposing pre-trained transformers forfaster question answering. In

Proceedingsof the 58th Annual Meeting of the Associationfor Computational Linguistics , pages4487–4497, Association for ComputationalLinguistics, Online.Chrupała, Grzegorz and Afra Alishahi. 2019.Correlating neural and symbolicrepresentations of language. In

Proceedingsof the 57th Annual Meeting of the Associationfor Computational Linguistics , pages2952–2962, Association for ComputationalLinguistics, Florence, Italy.Chrupała, Grzegorz, Bertrand Higy, and AfraAlishahi. 2020. Analyzing analyticalmethods: The case of phonology in neuralmodels of spoken language. In

Proceedingsof the 58th Annual Meeting of the Associationfor Computational Linguistics , pages4146–4156, Association for ComputationalLinguistics, Online.CHU, Y. 1965. On the shortest arborescenceof a directed graph.

Science Sinica ,14:1396–1400.Clark, Kevin, Urvashi Khandelwal, OmerLevy, and Christopher D. Manning. 2019.What does BERT look at? an analysis ofBERT’s attention. In

Proceedings of the 2019ACL Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP , pages276–286, Association for ComputationalLinguistics, Florence, Italy.Conneau, Alexis, German Kruszewski,Guillaume Lample, Loïc Barrault, andMarco Baroni. 2018. What you can craminto a single $&!

Proceedings of the 56th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages2126–2136, Association for ComputationalLinguistics, Melbourne, Australia.Dalvi, Fahim, Hassan Sajjad, Nadir Durrani,and Yonatan Belinkov. 2020. Analyzingredundancy in pretrained transformermodels. In

Proceedings of the 2020 omputational Linguistics Volume 1, Number 1 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages4908–4926, Association for ComputationalLinguistics, Online.Danilevsky, Marina, Kun Qian, RanitAharonov, Yannis Katsis, Ban Kawas, andPrithviraj Sen. 2020. A survey of the stateof explainable AI for natural languageprocessing. In

Proceedings of the 1stConference of the Asia-Paciﬁc Chapter of theAssociation for Computational Linguistics andthe 10th International Joint Conference onNatural Language Processing , pages447–459, Association for ComputationalLinguistics, Suzhou, China.Edmonds, Jack. 1967. Optimum branchings.

Journal of Research of the national Bureau ofStandards B , 71(4):233–240.Elazar, Yanai, Shauli Ravfogel, Alon Jacovi,and Yoav Goldberg. 2021. Amnesicprobing: Behavioral explanation withamnesic counterfactuals.

Transactions of theAssociation for Computational Linguistics .Ettinger, Allyson, Ahmed Elgohary, andPhilip Resnik. 2016. Probing for semanticevidence of composition by means ofsimple classiﬁcation tasks. In

Proceedings ofthe 1st Workshop on Evaluating Vector-SpaceRepresentations for NLP , pages 134–139,Association for ComputationalLinguistics, Berlin, Germany.Feder, Amir, Nadav Oved, Uri Shalit, andRoi Reichart. 2020. CausaLM: Causalmodel explanation through counterfactuallanguage models. arXiv preprintarXiv:2005.13407 .Giulianelli, Mario, Jack Harding, FlorianMohnert, Dieuwke Hupkes, and WillemZuidema. 2018. Under the hood: Usingdiagnostic classiﬁers to investigate andimprove how language models trackagreement information. In

Proceedings ofthe 2018 EMNLP Workshop BlackboxNLP:Analyzing and Interpreting Neural Networksfor NLP , pages 240–248, Association forComputational Linguistics, Brussels,Belgium.Gupta, Abhijeet, Gemma Boleda, MarcoBaroni, and Sebastian Padó. 2015.Distributional vectors encode referentialattributes. In

Proceedings of the 2015Conference on Empirical Methods in NaturalLanguage Processing , pages 12–21,Association for ComputationalLinguistics, Lisbon, Portugal.Hall Maudslay, Rowan, Josef Valvoda, TiagoPimentel, Adina Williams, and RyanCotterell. 2020. A tale of a probe and aparser. In

Proceedings of the 58th Annual Meeting of the Association for ComputationalLinguistics , pages 7389–7395, Associationfor Computational Linguistics, Online.Hewitt, John and Percy Liang. 2019.Designing and interpreting probes withcontrol tasks. In

Proceedings of the 2019Conference on Empirical Methods in NaturalLanguage Processing and the 9th InternationalJoint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages2733–2743, Association for ComputationalLinguistics, Hong Kong, China.Hewitt, John and Christopher D. Manning.2019. A structural probe for ﬁnding syntaxin word representations. In

Proceedings ofthe 2019 Conference of the North AmericanChapter of the Association for ComputationalLinguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages4129–4138, Association for ComputationalLinguistics, Minneapolis, Minnesota.Htut, Phu Mon, Jason Phang, Shikha Bordia,and Samuel R Bowman. 2019. Doattention heads in bert track syntacticdependencies? arXiv preprintarXiv:1911.12246 .Hupkes, Dieuwke, Sara Veldhoen, andWillem Zuidema. 2018. Visualisation and’diagnostic classiﬁers’ reveal howrecurrent and recursive neural networksprocess hierarchical structure.

Journal ofArtiﬁcial Intelligence Research , 61:907–926.Köhn, Arne. 2015. What’s in an embedding?analyzing word embeddings throughmultilingual evaluation. In

Proceedings ofthe 2015 Conference on Empirical Methods inNatural Language Processing , pages2067–2073, Association for ComputationalLinguistics, Lisbon, Portugal.Kriegeskorte, Nikolaus, Marieke Mur, andPeter Bandettini. 2008. Representationalsimilarity analysis - connecting thebranches of systems neuroscience.

Frontiers in Systems Neuroscience , 2:4.Krishna, Kalpesh, Shubham Toshniwal, andKaren Livescu. 2018. Hierarchicalmultitask learning for ctc-based speechrecognition. arXiv preprintarXiv:1807.06234 .Lakretz, Yair, German Kruszewski, TheoDesbordes, Dieuwke Hupkes, StanislasDehaene, and Marco Baroni. 2019. Theemergence of number and syntax units inLSTM language models. In

Proceedings ofthe 2019 Conference of the North AmericanChapter of the Association for ComputationalLinguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages11–20, Association for Computational onatan Belinkov Probing Classiﬁers SquibLinguistics, Minneapolis, Minnesota.Lepori, Michael and R. Thomas McCoy. 2020.Picking BERT’s brain: Probing forlinguistic dependencies in contextualizedembeddings using representationalsimilarity analysis. In Proceedings of the28th International Conference onComputational Linguistics , pages 3637–3651,International Committee onComputational Linguistics, Barcelona,Spain (Online).Liu, Nelson F., Matt Gardner, YonatanBelinkov, Matthew E. Peters, and Noah A.Smith. 2019. Linguistic knowledge andtransferability of contextualrepresentations. In

Proceedings of the 2019Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, Volume 1(Long and Short Papers) , pages 1073–1094,Association for ComputationalLinguistics, Minneapolis, Minnesota.Mareˇcek, David and Rudolf Rosa. 2019.From balustrades to pierre vinken:Looking for syntax in transformerself-attentions. In

Proceedings of the 2019ACL Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP , pages263–275, Association for ComputationalLinguistics, Florence, Italy.Michael, Julian, Jan A. Botha, and IanTenney. 2020. Asking without telling:Exploring latent ontologies in contextualrepresentations. In

Proceedings of the 2020Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages6792–6812, Association for ComputationalLinguistics, Online.Pearl, Judea. 2001. Direct and indirect effects.In

Proceedings of the Seventeenth Conferenceon Uncertainty in Artiﬁcial Intelligence ,UAI’01, page 411–420, Morgan KaufmannPublishers Inc., San Francisco, CA, USA.Pimentel, Tiago, Naomi Saphra, AdinaWilliams, and Ryan Cotterell. 2020a.Pareto probing: Trading off accuracy forcomplexity. In

Proceedings of the 2020Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages3138–3153, Association for ComputationalLinguistics, Online.Pimentel, Tiago, Josef Valvoda, RowanHall Maudslay, Ran Zmigrod, AdinaWilliams, and Ryan Cotterell. 2020b.Information-theoretic probing forlinguistic structure. In

Proceedings of the58th Annual Meeting of the Association forComputational Linguistics , pages 4609–4622,Association for Computational Linguistics, Online.Raganato, Alessandro and Jörg Tiedemann.2018. An analysis of encoderrepresentations in transformer-basedmachine translation. In

Proceedings of the2018 EMNLP Workshop BlackboxNLP:Analyzing and Interpreting Neural Networksfor NLP , pages 287–297, Association forComputational Linguistics, Brussels,Belgium.Ravichander, Abhilasha, Yonatan Belinkov,and Eduard Hovy. 2021. Probing theprobing paradigm: Does probing accuracyentail task relevance? In

Proceedings of the16th Conference of the European Chapter ofthe Association for Computational Linguistics(EACL) .Shi, Xing, Inkit Padhi, and Kevin Knight.2016. Does string-based neural MT learnsource syntax? In

Proceedings of the 2016Conference on Empirical Methods in NaturalLanguage Processing , pages 1526–1534,Association for ComputationalLinguistics, Austin, Texas.Tamkin, Alex, Trisha Singh, DavideGiovanardi, and Noah Goodman. 2020.Investigating transferability in pretrainedlanguage models. In

Findings of theAssociation for Computational Linguistics:EMNLP 2020 , pages 1393–1401,Association for ComputationalLinguistics, Online.Vanmassenhove, Eva, Jinhua Du, and AndyWay. 2017. Investigating ‘aspect’ in nmtand smt: Translating the english simplepast and present perfect.

ComputationalLinguistics in the Netherlands Journal ,7:109–128.Vig, Jesse, Sebastian Gehrmann, YonatanBelinkov, Sharon Qian, Daniel Nevo,Yaron Singer, and Stuart Shieber. 2020.Investigating gender bias in languagemodels using causal mediation analysis.In

Advances in Neural Information ProcessingSystems (NeurIPS, Spotlight presentation) .Voita, Elena and Ivan Titov. 2020.Information-theoretic probing withminimum description length. In

Proceedings of the 2020 Conference onEmpirical Methods in Natural LanguageProcessing (EMNLP) , pages 183–196,Association for ComputationalLinguistics, Online.Wallace, Eric, Matt Gardner, and SameerSingh. 2020. Interpreting predictions ofNLP models. In

Proceedings of the 2020Conference on Empirical Methods in NaturalLanguage Processing: Tutorial Abstracts ,pages 20–23, Association for omputational Linguistics Volume 1, Number 1Computational Linguistics, Online.Wu, Zhiyong, Yun Chen, Ben Kao, and QunLiu. 2020. Perturbed masking:Parameter-free probing for analyzing andinterpreting BERT. In Proceedings of the58th Annual Meeting of the Association forComputational Linguistics , pages 4166–4176,Association for ComputationalLinguistics, Online.Zhang, Kelly and Samuel Bowman. 2018.Language modeling teaches you morethan translation does: Lessons learnedthrough auxiliary syntactic task analysis.In

Proceedings of the 2018 EMNLP WorkshopBlackboxNLP: Analyzing and InterpretingNeural Networks for NLP , pages 359–361,Association for ComputationalLinguistics, Brussels, Belgium.Zhu, Zining and Frank Rudzicz. 2020. Aninformation theoretic view on selectinglinguistic probes. In

Proceedings of the 2020Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages9251–9262, Association for ComputationalLinguistics, Online., pages9251–9262, Association for ComputationalLinguistics, Online.