[PDF] Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

Abstract

Although neural models have achieved impressive results on several NLP benchmarks, little is understood about the mechanisms they use to perform language tasks. Thus, much recent attention has been devoted to analyzing the sentence representations learned by neural encoders, through the lens of `probing' tasks. However, to what extent was the information encoded in sentence representations, as discovered through a probe, actually used by the model to perform its task? In this work, we examine this probing paradigm through a case study in Natural Language Inference, showing that models can learn to encode linguistic properties even if they are not needed for the task on which the model was trained. We further identify that pretrained word embeddings play a considerable role in encoding these properties rather than the training task itself, highlighting the importance of careful controls when designing probing experiments. Finally, through a set of controlled synthetic tasks, we demonstrate models can encode these properties considerably above chance-level even when distributed in the data as random noise, calling into question the interpretation of absolute claims on probing tasks.

Full PDF

PProbing the Probing Paradigm: Does Probing Accuracy Entail TaskRelevance?

Abhilasha Ravichander Yonatan Belinkov Eduard Hovy Language Technologies Institute, Carnegie Mellon University John A. Paulson School of Engineering and Applied Sciences, Harvard University Computer Science and Artiﬁcial Intelligence Laboratory, Massachusetts Institute of Technology [email protected]@seas.harvard.edu , [email protected] Abstract

Much recent attention has been devoted toanalyzing sentence representations learned byneural encoders, through the paradigm of“probing” tasks. This is often motivated by aninterest to understand the information a modeluses to make its decision. However, to whatextent is the information encoded in a sentencerepresentation actually used for the task whichthe encoder is trained on? In this work, weexamine this probing paradigm through a case-study in Natural Language Inference, showingthat models learn to encode linguistic proper-ties even when not needed for a task. We iden-tify that pre-trained word embeddings play aconsiderable role in encoding these propertiesrather than the training task itself, highlight-ing the importance of careful controls whendesigning probing experiments. Through a setof controlled synthetic tasks, we demonstratemodels can encode these properties consider-ably above chance-level even when distributedas random noise, calling into question the inter-pretation of absolute claims on probing tasks. Neural models have achieved state-of-the-art per-formance in a variety of NLP benchmarks (Kim,2014; Seo et al., 2016; Parikh et al., 2016; Chenet al., 2017; Lan and Xu, 2018; Devlin et al., 2019),and recently there has been considerable commu-nity effort to develop methods to analyze them.This is motivated by an interest to not just havemodels perform a task well, but also understandthe information used by models to perform it (Con-neau et al., 2018). A popular approach is to as-sociate the representations learned by the neuralnetwork with linguistic properties of interest, andexamine the extent to which these properties can be Code and data available at https://github.com/AbhilashaRavichander/ProbingTaskRelevance . Main Task Sentence RepresentationRelevant Auxiliary Task1. Train 2. Freeze3. ProbeStandard Probing Methodology

Task Accuracy Probing Accuracy on relevant aux. task Probing Accuracy on irrelevant aux. task

Outcome : High accuracy on relevant auxiliary task

Conclusion:

Auxiliary information is linked to main task decision

Sentence RepresentationRelevant Auxiliary Task1. Train 2. Freeze3. ProbeProposed Test

Outcome : High accuracy on relevant and irrelevant auxiliary task

Conclusion:

Probing accuracy does not entail task relevance

Irrelevant Auxiliary TaskMain Task

Figure 1: Illustration of a typical application of probing,where representations from models trained on a task areprobed for relevant linguistic and semantic properties.Proposed test conclusions are discussed in Section 4. recovered from the representation (Adi et al., 2017).This paradigm has alternatively been called prob-ing (Conneau et al., 2018), auxilliary predictiontasks (Adi et al., 2017) and diagnostic classiﬁca-tion (Veldhoen et al., 2016; Hupkes et al., 2018).As described in (Conneau et al., 2018), one pri-mary goal of the probing paradigm is “to pinpointthe information a model is relying upon” to do atask. Let us examine a typical application as il-lustrated in Figure 1, through the case of NaturalLanguage Inference (NLI). In their formative work,Conneau et al. (2018) train three sentence-encodermodels on a NLI dataset (MultiNLI; Williams et al.(2017)). The weights for the encoders are frozen,and the encoders are then used to form sentencerepresentations for an auxiliary task such as pre-dicting the tense of the verb in the main clause ofthe sentence. A classiﬁer, which we refer to hence-forth as the probing classiﬁer, is trained to predictthis property based on the constructed representa-tion. If the probing classiﬁer demonstrates highaccuracy, the property is considered to be encoded a r X i v : . [ c s . C L ] M a y n the representation and assumed to play a rolein the task decision. Many insightful studies haveassumed this conventional wisdom, that if a learnedrepresentation encodes a particular relevant linguis-tic feature (demonstrated through a probing task),the model leverages this information to perform thetask (Shi et al., 2016; Belinkov et al., 2017a; Con-neau et al., 2018; Hupkes et al., 2018; Giulianelliet al., 2018; Kim et al., 2019; Alt et al., 2020).In this work, we re-examine this connection be-tween the linguistic information encoded in a rep-resentation, and the information a model requiresfor a task. We do this by establishing careful con-trol versions of the task which are invariant to thelinguistic property being probed. Broadly, our re-search ﬁndings can be summarized as follows: • We show that under the current framework ofprobing sentence representations to determinewhether particular linguistic knowledge is re-quired to perform a task, sentence representa-tions can exhibit similar probing accuracy forthe linguistic property whether it is actuallyneeded for the task or not ( § • Could pre-trained word embeddings be thereason for this phenomenon? We demonstratethat initializing models with pre-trained wordembeddings does play a considerable role inencoding some linguistic properties in sen-tence representations. We speculate that prob-ing experiments with pre-trained word em-beddings conﬂate two tasks – training wordembeddings and the task of interest ( § • However, when carefully controlled for taskinteraction, we demonstrate that models stillencode linguistic properties even when notactually required for a task. This poses a chal-lenge to how conclusions about the link be-tween linguistic properties and tasks shouldbe interpreted (Conneau et al., 2018) ( § • Through a set of control synthetic tasks, wehighlight issues with interpreting the resultsof probing in the context of task requirements.In this controlled setting, we explore whetheradversarial learning can determine if a linguis-tic property is needed for a task as a potentialalternative to the probing paradigm ( § • We discuss several considerations when inter-preting the results of probing experiments andhighlight avenues for future research needed in this important area of understanding mod-els, tasks and datasets ( § Progress in Natural Language Understanding(NLU) has been driven by a history of deﬁningtasks and corresponding benchmarks for the com-munity (Marcus et al., 1993; Dagan et al., 2006;Rajpurkar et al., 2016). These tasks are often tiedto speciﬁc practical applications, or to developmodels demonstrating competencies that transferacross applications. The corresponding benchmarkdatasets are utilized as proxies for the tasks them-selves. How can we estimate their quality as prox-ies? While annotation artifacts are one facet that af-fects proxy-quality (Gururangan et al., 2018; Poliaket al., 2018; Kaushik and Lipton, 2018; Naik et al.,2018; Glockner et al., 2018), a dataset might simplynot have coverage across competencies required fora task. Additionally, it might consist of alternate“explanations”, features correlated with the task la-bel in the dataset while not being task-relevant, thatmodels can exploit to give the impression of goodperformance at the task itself.Two analysis methods have emerged to addressthis limitation: 1)

Diagnostic examples , where asmall number of samples in a test set are annotatedwith linguistic phenomena of interest, and task ac-curacy is reported on these samples (Williams et al.,2017). However, it is difﬁcult to determine if mod-els perform well on diagnostic examples becausethey actually learn the linguistic competency re-quired, or if they exploit spurious correlations inthe data (McCoy et al., 2019; Gururangan et al.,2018; Poliak et al., 2018). 2)

External challengetests (Naik et al., 2018; Glockner et al., 2018; Is-abelle et al., 2017; McCoy et al., 2019; Ravichan-der et al., 2019), where examples are constructed,either through automatic methods or by experts,which demonstrate a speciﬁc phenomenon in isola-tion. However, it is challenging and expensive tobuild these evaluations, and non-trivial to isolatephenomena (Liu et al., 2019b).Thus, probing or diagnostic classiﬁcation presents an exciting alternative wherein the sen-tence representations can directly be probed for lin-guistic properties of interest (Ettinger et al., 2016;Adi et al., 2017; Tenney et al., 2019; Hewitt andManning, 2019; Warstadt et al., 2019; Zhang andBowman, 2018), which can give insight into thecompetencies a model uses to do a task. There haseen a variety of such work to test hypotheses aboutthe mechanisms models use to perform tasks. (Shiet al., 2016) examine whether the source side in aencoder-decoder model learns syntax when trainedfor machine translation. Conneau et al. (2018) useprobing to compare representations formed by avariety of training tasks including machine trans-lation and NLI, and examine the correlation be-tween linguistic properties and these downstreamtasks to identify competencies needed for each task.Hupkes et al. (2018) discuss ‘diagnostic classiﬁ-cation’, in which an additional classiﬁer is trainedto extract information from a sequence of hiddenrepresentations in a neural network. If the clas-siﬁer achieves high accuracy, it is concluded thatthe network is keeping track of the hypothesizedinformation. Giulianelli et al. (2018) use diagnos-tic classiﬁers to predict number from the internalstates of a language model. Kim et al. (2019) studywhat different NLP tasks teach neural models aboutfunction word comprehension. Alt et al. (2020) an-alyze learned representations for relation extraction(RE), through a set of 14 probing tasks for linguis-tic properties relevant to RE.Closest to our work is that of Zhang and Bow-man (2018) and Hewitt and Liang (2019), whichstudy the role of training data and lexical memoriza-tion in probing experiments. However, they bothexamine expressivity – of the neural model itself(Zhang and Bowman, 2018), and of the probingclassiﬁer (Hewitt and Liang, 2019). While therehas been much debate in the community on classi-ﬁer complexity and the settings that are appropriatefor probing (Alain and Bengio, 2016; Hewitt andLiang, 2019; Liu et al., 2019a; Conneau et al., 2018;Belinkov et al., 2017b; Qian et al., 2016; Voita andTitov, 2020), it is far from the only concern wheninterpreting the results of a probing experiment.Our work demonstrates that relying on diagnos-tic classiﬁers to interpret model reasoning for atask suffers from a fundamental limitation: prop-erties may be incidentally encoded even when notrequired for a task. In this section we describe how to construct controldatasets, such that a particular linguistic feature isnot required in making task judgements. While ourmotivating example of a task is natural languageinference, we expect that control datasets can beconstructed for most text classiﬁcation tasks, which

Linguistic Control Property

Table 1: Statistics of control datasets partitioned by lin-guistic property. usually have a small ﬁnite label space. Controldatasets are based on the intuition that a linguisticfeature is not informative for a model to discrim-inate between classes, if the linguistic feature isconstant across classes. Probabilistically, let usconsider the task label T and linguistic property L . When every example in the control dataset hasthe same value of the property, the task label andthe linguistic property are probabilistically inde-pendent i.e P ( T | L ) = P ( T ) .Thus, to construct control datasets, we pin downand hold constant the relevant property by ﬁx-ing its value across the whole dataset. Consid-ering datasets as proxies for tasks, in these controldatasets the task decision no longer depends on thevalue of the control property. In practice, controldatasets are constructed from existing large-scaledatasets for a task, by partitioning them on thevalue of a linguistic property. They are designedwith the following considerations:1. The linguistic property of interest is auxiliaryto the main task and a function of the input,but not of the task decision.2. Every sample in the training and test sets hasthe same ﬁxed value of the linguistic property.3. The training set is large in order to trainparameter-rich neural classiﬁers for the task.We next describe our main task, our three aux-iliary prediction tasks and the procedures to con-struct controlled datasets for each auxiliary predic-tion task.

Main Task : In this work, we study the Natu-ral Language Inference training task from Con-neau et al. (2018) as the main task for trainingsentence encoders. Natural Language Inference(NLI) is a benchmark task for research on natu-ral language understanding (Cooper et al., 1996;Fyodorov; Glickman et al., 2005; Haghighi et al., All the probing tasks considered in this work requiresingle sentence embeddings as input, and map them to binarylabels {

0, 1 } . ense SubjNum ObjNumDev-ST Probing Dev-SS Probing Dev-SO ProbingMajority 37.90 50.00 36.88 50.0 39.52 50.0CBOW-DS 57.57 82.36 58.4 76.55 55.85 75.49CBOW-PT 60.31 82.2 58.2 75.69 59.15 74.38BiLSTM-Av-DS 63.53 82.93 64.24 79.53 66.23 76.11BiLSTM-Av-PT 65.08 82.79 66.76 78.81 67.08 75.48BiLSTM-Max-DS 63.35 81.14 65.91 78.56 65.94 74.79BiLSTM-Max-PT 64.6 81.04 66.87 79.51 66.98 72.44BiLSTM-Last-DS 61.08 80.43 64.2 81.52 62.26 72.65BiLSTM-Last-PT 63.89 78.44 66.18 78.9 66.04 72.82 Table 2: Performance comparisons of task-speciﬁc and downsampled models. Dev-ST is MultiNLI developmentset controlled for tense, DEV-SS is MultiNLI development set controlled for subject number, Dev-SO is MultiNLIdevelopment set controlled for object number. PT is model trained on data partitioned by linguistic property. DSis models trained on downsampled data from MultiNLI to match the number of instances in PT.

Auxiliary Tasks : We consider three tasks thatprobe sentence representations for semantic infor-mation from Conneau et al. (2018), which “requiresome understanding of what the sentence denotes”.All three probing datasets do not have lexical itemsoccurring across the train/dev/test split for the tar-get, controlling for the effect of memorizing wordtypes associated with target categories (Hewitt andLiang, 2019). The tasks considered in this studyare:1. T

ENSE : Categorize sentences based on thetense of the main verb.2. S

UBJECT N UMBER : Categorize sentencesbased on the number of the subject of the mainclause.3. O

BJECT N UMBER : Categorize sentencesbased on number of the direct object of the main clause.

Control:

For each auxiliary task, we partitionMultiNLI such that premise and hypothesis agreeon a single value of the linguistic property. For ex-ample, for the auxiliary task T

ENSE , sentences withVBP/VBZ/VBG forms are labeled as present andVBD/VBN as past tense. Subsequently, premise-hypothesis pairs where the main verbs in both arein past tense are extracted from train/dev sets toform the control datasets for tense. This procedure results in three controldatasets/tasks- MultiNLI-PastTense, MultiNLI-SingularSubject and MultiNLI-SingularObject.For all three auxiliary tasks, we form controldatasets by setting the value of the linguisticproperty to the one that results in the maximumnumber of training instances on partitioning. Thisis obtained by ﬁxing past tense, singular subjectnumber and singular object number. Descriptivestatistics for each dataset can be found in Table 1.

Models:

A wide variety of sentence-encoderarchitectures exist for NLI. In this work weutilize CBOW and BiLSTM-based (Hochreiterand Schmidhuber, 1997) architectures as they areused for NLI (Williams et al., 2017), and have These heuristics are speciﬁc to English, as is MultiNLI.We use the Stanford Parser, for constituency, POS and depen-dency parsing (Manning et al., 2014). This procedure replicates the original SentEval probinglabels (Conneau et al., 2018) with 89.37% accuracy on tense,87.77% accuracy on subject number and 88.19% accuracy onobject number. een probed for encoded linguistic properties(Conneau et al., 2017). This allows for more directcomparisons. • Majority : A simple baseline that predicts themajority class for each dataset. • CBOW : A Continuous Bag-Of-Words Model(CBOW) where the sentence representation isthe sum of word embeddings of its constituentwords. • BiLSTM-Last/Avg/Max : For a sequence of N words in a sentence s = w ...w n , the BiL-STM computes N vectors extracted from itshidden states (cid:126)h , ..., (cid:126)h n . We produce ﬁxed-length vector representations in three ways:by selecting the last hidden state h n (BiLSTM-Last), by averaging the produced hidden states(BiLSTM-Avg) or by selecting the maximumvalue for each dimension in the hidden units(BiLSTM-Max).All models produce separate vector representa-tions for the premise and hypothesis. They areconcatenated with their element-wise product anddifference (Mou et al., 2016), passed to a tanh layerand then to a 3-way softmax classiﬁer. Models areinitialized with 300D gloVe embeddings (Penning-ton et al., 2014) unless speciﬁed otherwise, andimplemented in Dynet (Neubig et al., 2017). As a ﬁrst step, we ask the question: what does accu-racy of the probing classiﬁer actually tell us aboutthe training task? We construct multiple versions ofthe task (both training and development sets) wherethe entailment decision is independent of the givenlinguistic property , through careful partitioning asdescribed in §

3. To control for the effect of trainingdata size, we downsample MultiNLI training datato match the number of samples in each partitionedversion of the task. These results are in Table 2.Strikingly, we observe that even when modelsare trained on tasks which do not require the lin-guistic property at all for the main task, probingclassiﬁers still exhibit high accuracy (sometimesup to 85%). Probing data is split lexically by tar-get across partitions, and thus lexical memoriza-tion (Hewitt and Liang, 2019) cannot explain whythese properties are encoded in the sentence repre-sentations. Across models, on the version of the task where a particular linguistic property is notneeded, classiﬁers trained on data which does notrequire that property perform comparably to classi-ﬁers trained on MultiNLI training data (DS vs PTmodels, on Dev-ST, Dev-SS and Dev-SO).

One potential explanation can lie in our deﬁnitionof a “task”. Previous work directly probes modelstrained for a given task such as Machine Translationor NLI. However, when models are initialized withpre-trained word embeddings, the conﬂated resultsof two tasks are being probed, one being the maintask of interest, and the other being the task usedto train the word embeddings. To study this, we compare models initializedwith pre-trained word embeddings (Penningtonet al., 2014) and then trained for the main task,to models initialized with random word embed-dings but which are updated during the main task.These results are presented in Table. 3. We observethat probing accuracies drop across linguistic prop-erties in this setting, indicating that models withrandomly initialized embeddings generate represen-tations that contain less linguistic information thanthe models with pretrained embeddings. This resultcalls into question how to interpret the contributionof the main task to the encoding of a linguisticproperty, when the representation has already beeninitialized with pre-trained word embeddings. Theword embeddings could themselves encode a signif-icant amount of linguistic information, or the maintask might contribute to encoding information in away already largely captured by word embeddings.

When we isolate the effect of the main task withrandomly initialized word embeddings, are prop-erties not required for the main task still beingencoded? To study this, we revisit our linguisticcontrol tasks but train all models with randomlyinitialized word embeddings. We also train com-parable models on MultiNLI training data. Theseresults can be found in Table 4. We observe thateven in the setting with randomly initialized wordembeddings, these properties are still encoded to alarge extent in the control versions of their task. To some extent, this effect can be measured by usingrandom encoders (Wieting and Kiela, 2019). However, thismethod fails to isolate the main task. ense SubjNum ObjNumDev Probing Dev Probing Dev ProbingMajority 36.50 50.0 36.50 50.0 36.50 50.0CBOW-Word 62.21 83.74 62.1 76.91 61.93 75.4CBOW-Rand 56.98 60.14 56.27 67.01 56.82 64.71BiLSTM-Av-Word 70.05 82.48 70.67 76.53 69.82 72.29BiLSTM-Av-Rand 63.33 61.4 64.0 67.68 63.71 63.87BiLSTM-Max-Word 68.67 78.34 69.19 73.96 69.12 68.53BiLSTM-Max-Rand 62.78 62.89 63.29 69.51 63.28 62.84BiLSTM-Last-Word 68.32 74.61 69.04 71.82 68.82 69.27BiLSTM-Last-Rand 62.14 62.96 61.88 67.45 62.29 61.32

Table 3: Performance comparisons of models initialized with pretrained word embeddings (Word) and modelsrandomly initialized but updated during task-speciﬁc raining (Rand). Probing accuracies decrease sharply whenyou initialize with random word embeddings.

Tense SubjNum ObjNumDev-PT Probing Dev-SS Probing Dev-SO ProbingMajority 37.90 50.0 36.88 50.0 39.52 50.0CBOW-Rand-DS 49.88 61.33 51.04 67.32 49.25 63.63CBOW-Rand-PT 53.28 61.37 50.97 67.02 52.45 63.84BiLSTM-Av-Rand-DS 57.21 63.75 60.76 68.5 59.53 63.89BiLSTM-Av-Rand-PT 60.91 63.07 61.18 69.12 60.57 63.77BiLSTM-Max-Rand-DS 59.18 61.05 61.8 70.32 60.57 64.68BiLSTM-Max-Rand-PT 60.55 61.53 63.78 70.6 63.49 64.26BiLSTM-Last-Rand-DS 56.73 63.88 58.82 69.09 56.79 63.86BiLSTM-Last-Rand-PT 57.39 62.88 61.88 68.8 60.75 61.96

Table 4: Performance comparisons of task-speciﬁc and downsampled models initialized with pre-trained wordembeddings.

Thus far we have demonstrated that models en-code properties incidentally, even if they are notrequired for the main task. Thus, probing accuracycannot be considered indicative of competenciesany given model relies on. What circumstancescould lead to models encoding properties inciden-tally? Can we determine when a linguistic propertyis not needed by a model for a task? To shed lighton these questions, we build carefully controlledsynthetic tests, each capturing a kind of noise thatcould arise in datasets. We additionally presentan initial exploration of an adversarial frameworkto suppress this noise, as a potential approach toidentifying linguistic properties that are encodedincidentally.

We consider a task where the Premise P and Hy-pothesis H are strings from S = { ( a | b )( a | b | c ) ∗ } ofmaximum length 30, and the hypothesis H is saidto be entailed by the premise P if they begin withthe same letter a or b . Consider some examplestrings and entailment decisions in this task,(a, ab) → Entailed (a, ba) → Not Entailed(b, ba) → Entailed (b, ab) → Not Entailed(b, bc) → Entailed (b, acb) → Not EntailedNow, let us consider an auxiliary task ofpredicting whether a given sentence contains thecharacter c from a representation, analogous to A task with a similar objective was used by Belinkov et al.(2019) to demonstrate unlearning bias in datasets. The task isequivalent to XOR, which is learnable by an MLP.ataset

OISE

NCORRELATED

ARTIAL

ULL

TTACKER

Table 5: Number of train/dev/test examples in con-structed synthetic datasets. probing for a property not required for the maintask. To do so, we sample premise and hypothesisfrom a set of strings S (cid:48) = ( a | b ) ∗ of maximumlength 30, and simulate four kinds of correlationsthat could occur in the dataset, by inserting c at arandom position within the string which is not theﬁrst character:1. N OISE : The linguistic property could be dis-tributed as noise in the training data. To sim-ulate this, we insert c into 50% of randomlysampled premise and hypothesis strings.2. U NCORRELATED : The linguistic propertycould be unrelated to the main task decision,but correlated to some other property withinthe dataset. To simulate this, we insert c topremise strings that begin with a .3. P ARTIAL : The linguistic property could pro-vide a partial explanation for the main task de-cision. To simulate this, we insert c to premiseand hypothesis strings beginning with a .

4. F

ULL : The linguistic property provides a com-plete alternate explanation for the main taskdecision. We insert c to premise and hypothe-sis strings whenever the hypothesis is entailed.Descriptive statistics of all four constructions arepresented in Table 5. We follow the adversarial learning framework il-lustrated in Figure. 4. In this setup, we havepremise-hypothesis pairs (cid:104) p , h (cid:105) ... (cid:104) p n , h n (cid:105) andentailment labels y ...y n , as well as labels for lin-guistic properties in each premise–hypothesis pair (cid:104) z p, , z h, (cid:105) ... (cid:104) z p,n , z h,n (cid:105) . We would like to trainsentence encoders f( p i , θ ) and f( h i , θ ) and a classiﬁ-cation layer g θ such that y i = g θ (f( p i , θ ), f( h i , θ )), Models can use either the presence of c , or the ﬁrst char-acter of the strings being a to make their prediction, but theymust use whether the ﬁrst character of the strings is b . P H P H f p, 𝛩 f h, 𝛩 g 𝛩 g 𝛩 f p, 𝛩 f h, 𝛩 g p, φ g h, φ y yy ’p y ’h Figure 2: Baseline

P H P H f p, 𝛩 f h, 𝛩 g 𝛩 g 𝛩 f p, 𝛩 f h, 𝛩 g p, φ g h, φ y yy ’p y ’h Figure 3: Adversarial removal.Figure 4: Illustration of (1) The baseline NLI task ar-chitecture, and (2) Adversarial removal of linguisticproperties from the representations. Arrows representdirection of propagation of inputs in the forward passand gradients in backpropagation. Blue and orange ar-rows correspond to the gradient being preserved andreversed respectively. in a way that does not use (cid:104) z p,i , z h,i (cid:105) . We do thisby incorporating an adversarial classiﬁcation layer g φ such that (cid:104) z p,i , z h,i (cid:105) = (cid:104) g φ (f( p i , θ )), g φ (f( h i , θ ) (cid:105) (Goodfellow et al., 2014; Ganin and Lempitsky,2015). Following Elazar and Goldberg (2018), wealso have an external ‘attacker’ classiﬁer φ (cid:48) to pre-dict z p,i and z h,i from the learned sentence repre-sentation. Thus, during training the adversarial classiﬁeris trained to predict z from the sentence represen-tations f θ ( p i , h i ) , and the sentence encoder f istrained to make the adversarial classiﬁer unsuccess-ful at doing so. This is operationalized through thefollowing training objectives optimized jointly: arg min φ L ( g φ ( f ( p i , θ ) , z p,i ))+ L ( g φ ( f ( h i , θ ) , z h,i )) (1) We train the attacker on a held-out dataset with the lin-guistic property distributed as random noise (Table 5). Wealso ensure all examples in the attacker data are unseen in themain task, to prevent data leakage.oise Uncorrelated Partial FullDev Adv. Attack. Dev Adv. Attack. Dev Adv. Attack. Dev Adv. Attack.Majority 50.4 51.2 50.2 50.94 74.31 50.2 50.62 99.82 50.2 55.34 55.34 50.2 λ =0.0 100.0 - 90.3 100.0 - 93.6 100.0 - 91.08 100.0 - 100.0 λ =0.5 100.0 47.81 95.3 100.0 70.36 62.26 100.0 99.31 80.48 100.0 51.23 93.42 λ =1.0 100.0 49.43 94.5 100.0 71.28 74.1 100.0 99.79 68.8 100.0 52.37 92.58 λ =1.5 100.0 42.7 100.0 100.0 71.54 99.1 97.98 99.79 82.32 100.0 49.8 97.58 λ =2.0 100.0 46.19 99.36 100.0 70.62 99.98 100.0 94.83 91.12 100.0 40.94 94.64 λ =3.0 100.0 46.98 94.64 100.0 70.92 99.8 99.26 99.19 79.66 100.0 53.08 87.0 λ =5.0 99.98 38.87 96.92 99.94 71.0 86.6 100.0 98.73 100.0 100.0 51.32 98.74 Table 6: Adversarial performance on synthetic tasks: noise, uncorrelated, partial, full. Dev is accuracy of modelon task, Adv. is accuracy of the adversarial classiﬁer, Atttack. is accuracy of attacker classiﬁer on held-out data. arg min f,θ L ( g θ ( f θ ( p i , h i )) , y i ) − ( L ( g φ ( f ( p i , θ ) , z p,i ) + L ( g φ ( f ( h i , θ ) , z h,i ))) (2)where L is cross-entropy loss. The optimization isimplemented through a Gradient Reversal Layer(Ganin and Lempitsky, 2015) g λ which is placedbetween the sentence encoder and the adversarialclassiﬁer. It acts as an identity function in the for-ward pass, but during backpropogation scales thegradients by a factor − λ , resulting in the objec-tive: arg min f,θ L ( g θ ( f θ ( p i , h i )) , y i )+ L ( g φ ( g λ ( f ( p i , θ ))) , z p,i ) + L ( g φ ( g λ ( f ( h i , θ ))) , z h,i ) (3) Implementation details : We implemented theadversarial model using the Dynet framework (Neu-big et al., 2017), with a BiLSTM architecture of hid-den dimension 200 units. Fixed length vector rep-resentations are constructed using the last hiddenstate and the model is trained for upto 10 epochsusing early stopping. The attacker classiﬁer is a1-layer MLP with hidden dimension size of 200units.

Table 6 reports the performance of the adversarialand attacker classiﬁers on the four test sets. Tostart with, we observe that in the case when λ = 0 (no adversarial suppression), we are able to traina classiﬁer to predict the presence of c at a near-perfect level of accuracy in all four cases. This isnotable, considering that even when the property is λ controls the extent to which we try to suppress theproperty. distributed as random noise (N OISE ) uncorrelatedwith the actual task, the model encodes it. Thissimple synthetic task suggests that models learnto encode linguistic properties incidentally , callinginto question how we interpret absolute claims onprobing tasks.We next examine the results of the adversariallearning classiﬁer at suppressing the task-irrelevantlinguistic information. Our goal here is to exam-ine whether an adversarial learning framework canhelp a model learn to ignore this information whilestill maintaining task performance. If the modelsucceeds, it indicates that the model does not needthe particular linguistic property to perform thetask. We observe that even in the adversarial train-ing framework, a considerable amount of informa-tion about the property can be discovered by theattacker. In the case of random noise, we do notﬁnd any setting of adversary weight λ that man-ages to suppress the attribute. This is consistentwith the ﬁndings of (Elazar and Goldberg, 2018),wherein the attacker (which is the probing classi-ﬁer in our case) manages to extract the suppressedinformation from the representation.We would like to emphasize that the goal of thesynthetic tasks is to provide insight into sentenceencoding dynamics, and demonstrate that probingclassiﬁers are successful at extracting propertiesthat are incidental to the main task. It is problem-atic that probing classiﬁers exhibit high accuracyon task-irrelevant information, indicating that theaccuracy of probes cannot be relied upon as a mea-sure of what the model actually relied upon to solvea task. We explore further issues of representationcapacity, probing classiﬁer expressivity as well asstrategies of strengthening the adversarial classi-ﬁer: a) Main Task and Attacker Accuracy as a function of capacity of sentence representation.(b) Main Task and Attacker Accuracy as a function of capacity of adversarial classiﬁer for λ = 0 . and λ = 1 . .(c) Main Task and Attacker Accuracy as a function of capacity of probing classiﬁer for λ = 0 . and λ = 1 . . Figure 5: Task and probing performance of BiLSTM-Last on Noise, Uncorrelated, Partial and Full syntheticdatasets

Representation size : Does the dimensionalityof the sentence representation affect it’s propensityto encode task-speciﬁc linguistic information, with-out encoding task-irrelevant linguistic information?We hypothesize that models with lower capacitymight tend to encode task-speciﬁc information atthe expense of other linguistic properties. To ex-amine this, we train the BiLSTM architecture withhidden dimensions 10, 50, 100, 200, 300 and 600units, and train a attacker classiﬁer as shown inFigure 5a. We observe that while task accuracyremains consistent across choice of dimension, theattacker accuracy does decrease for models withlower capacity across categories. This suggests thatthe capacity of the representation may play a rolein which information it encodes.

Adversarial classiﬁer capacity : Does the ca-pacity of the adversarial classiﬁer inﬂuence themodel’s ability to suppress information about task?We hypothesize that a more powerful adversarialclassiﬁer might be more effective at suppressingtask-irrelevant information. To examine this, wehold the attacker classiﬁer constant and experimentwith an adversarial classiﬁer with 1-layer and 2-layer MLP probes and dimensions 100, 200, 1000, 5000 and 10000 units. These results are reported inFigure 5b. We observe that varying the capacity ofthe adversarial classiﬁer can decrease the attackeraccuracy, though the choice of capacity depends onthe setup used.

Probing classiﬁer capacity : Does adversarialsuppression depend on choice of the probing clas-siﬁer? We examine if adversarial suppression candecrease the ease with which task-irrelevant infor-mation can be extracted from the representation.To examine this, we experiment with probing clas-siﬁers utilizing 1-layer and 2-layer MLP’s of di-mensions {

10, 50, 100, 200, 1000 } . These resultsare shown in Figure 5c. We ﬁnd a nuanced picture:adversarial suppression does seem to reduce theease of extraction of information when the linguis-tic property is encoded as random noise, but not inany other distribution of the property. Considerations : 1) In the synthetic tests, themain task function is learnable by a neural net-work. However, in practice for most NLP datasetsthis might not be true, making it difﬁcult for mod-els to reach comparable task performance whilesuppressing correlated linguistic properties, 2) In-formation might be encoded, but not recoverabley the choice of probing classiﬁer. Additionally,a more expressive adversarial classiﬁer can ‘hide’information from the probing classiﬁer (Elazar andGoldberg, 2018) , 3) If comparable task accuracycan’t be reached, one cannot conclude a propertyisn’t relevant. We brieﬂy discuss our ﬁndings, with the goalof providing considerations for deciding whichinferences can be drawn from a probing study, andhighlighting avenues for future research.

Linguistic properties can be incidentally en-coded : Probing only indicates that some propertycorrelated with a linguistic property of interest isencoded in the sentence representation – but wespeculate that it cannot isolate what that propertymight be, whether the correlation is meaningful,or how many such properties exist. As we seethrough our controlled synthetic tests, even ifa particular property is not needed for a task,a probing classiﬁer can achieve high accuracy.Thus, probing cannot determine if the propertyis actually needed to do a task, and should notbe used to “pinpoint the information a model isrelying upon” (Conneau et al., 2018). A negativeresult here can be more meaningful than a positiveone. Adversarially suppressing the property mayhelp determine if an alternate explanation is readilyavailable to the model, with an appropriate choiceof probing classiﬁer. In this case, if the modelmaintains task accuracy while suppressing theinformation, we can decide the property is notneeded by the model for the task, but its failure todo so is not indicative of property importance.

Careful controls and baselines : We emphasizethe need for work on probing to establish carefulcontrols and baselines when reporting experimen-tal results. When probing accuracy for a linguisticcompetence is high, we speculate it may not bedirectly attributable to the training task. In thiswork, we identify two confounds: incidentalencoding and interaction between training tasks.We leave it to future work to determine causes of All claims related to probing task accuracy, as in mostprior work, are with respect to the probing classiﬁer used. This could be because the main task might be more com-plex to learn or unlearnable, or multiple alternate confoundscould be present in data which are not representative of thedecision-making needed for the main task, for example. incidental encoding, and identify further baselinesand controls that allow reliable conclusions to bedrawn from probing studies.

Lack of gold-standard data of task require-ments : While prior work has discussed thedifferent linguistic competencies that might beneeded for a task based on the results of probingstudies, these claims are inherently hard to reliablyquantify given that the exact linguistic compe-tencies, as well as the extent to which they arerequired, is difﬁcult to isolate for most real-worlddatasets. We advocate for the use of controlledtest cases (such as those in § Datasets are proxies for tasks, and proxies areimperfect reﬂections : Finally, we speculate thatwhile datasets are used as proxies for tasks, theymight not reﬂect the full complexity of the task.Aside from having dataset-speciﬁc idiosyncrasiesin the form of unwanted biases and correlations,they might also not require the full range of compe-tencies that we expect models to need to succeed onthe task. Future work would need to move beyondthe probing paradigm to carefully identify what thecompetencies reﬂected in any dataset are, and howrepresentative they are of overall task requirements.

What probes are good for : We would like to em-phasize that this work only reﬂects on the implica-tions of probing as a tool for gaining insight intowhat information models use to do a task. How-ever, when sentence representations are used sub-sequently downstream, probing can give insightinto what information is encoded in the model (irre-spective of how that encoding came to be). Futuredirections would include exploring the connectionbetween information encoded in the representationand whether models successfully learn to use themin downstream tasks.

The probing paradigm has evinced considerable in-terest as a useful tool for model interpretability, toprovide insights into what information models relyon to do tasks, and requirements for tasks them-selves. In this work we identify several considera-tions when probing sentence representations, mosttrikingly that linguistic properties can be inciden-tally encoded even when not needed for a main task.This line of questioning highlights several fruitfulareas for future research: how to successfully iden-tify the set of linguistic competencies necessary fora dataset, and consequently how well any datasetmeets task requirements, how to reliably identifythe exact information models rely upon to makepredictions, and how to draw connections betweeninformation encoded by a model and used by amodel downstream.

Acknowledgments

This research was supported in part by grants fromthe National Science Foundation Secure and Trust-worthy Computing program (CNS-1330596, CNS-15-13957, CNS-1801316, CNS-1914486) and aDARPA Brandeis grant (FA8750-15-2-0277). Theviews and conclusions contained herein are thoseof the authors and should not be interpreted asnecessarily representing the ofﬁcial policies or en-dorsements, either expressed or implied, of theNSF, DARPA, or the US Government. Y.B. wassupported by the Harvard Mind, Brain, and Behav-ior Initiative. The authors would like to extendspecial gratitude to Carolyn Rose and AakankshaNaik, for insightful discussions related to this work.The authors are also grateful to Yanai Elazar, PaulMichel, Shruti Rijhwani and Siddharth Dalmia forreviews while drafting this paper, and to MarcoBaroni for answering questions about the SentEvalprobing tasks.

References

Yossi Adi, Einat Kermany, Yonatan Belinkov, OferLavi, and Yoav Goldberg. 2017. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. In

International Conference on LearningRepresentations .Guillaume Alain and Yoshua Bengio. 2016. Under-standing intermediate layers using linear classiﬁerprobes. In

International Conference on LearningRepresentations .Christoph Alt, Aleksandra Gabryszak, and LeonhardHennig. 2020. Probing linguistic features ofsentence-level representations in neural relation ex-traction.Gabor Angeli and Christopher D Manning. 2014. Natu-ralli: Natural logic inference for common sense rea-soning. In

Proceedings of the 2014 conference onempirical methods in natural language processing(EMNLP) , pages 534–545. Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, HassanSajjad, and James Glass. 2017a. What do neural ma-chine translation models learn about morphology?In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) , pages 861–872.Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-san Sajjad, and James Glass. 2017b. What do neu-ral machine translation models learn about morphol-ogy? In

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 861–872. Associationfor Computational Linguistics.Yonatan Belinkov, Adam Poliak, Stuart M Shieber,Benjamin Van Durme, and Alexander M Rush. 2019.Don’t take the premise for granted: Mitigating arti-facts in natural language inference. arXiv preprintarXiv:1907.04380 .Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTMfor natural language inference. In

Proceedings ofthe 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 1657–1668, Vancouver, Canada. Associationfor Computational Linguistics.Alexis Conneau, Douwe Kiela, Holger Schwenk, LoicBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. arXiv preprintarXiv:1705.02364 .Alexis Conneau, German Kruszewski, Guillaume Lam-ple, Lo¨ıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single $&!

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 2126–2136, Melbourne, Aus-tralia. Association for Computational Linguistics.Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox,Johan Van Genabith, Jan Jaspars, Hans Kamp, DavidMilward, Manfred Pinkal, Massimo Poesio, et al.1996. Using the framework. Technical report.Ido Dagan, Bill Dolan, Bernardo Magnini, and DanRoth. 2010. The fourth pascal recognizing textualentailment challenge.

Journal of Natural LanguageEngineering .Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The pascal recognising textual entailmentchallenge. In

Machine learning challenges. evalu-ating predictive uncertainty, visual object classiﬁca-tion, and recognising tectual entailment , pages 177–190. Springer.Marie-Catherine DeMarneffe, Sebastian Pad´o, andChristopher D Manning. 2009. Multi-word expres-sions in textual inference: Much ado about nothing?In

Proceedings of the 2009 Workshop on Appliedextual Inference , pages 1–9. Association for Com-putational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Yanai Elazar and Yoav Goldberg. 2018. Adversarialremoval of demographic attributes from text data. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages 11–21.Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of compositionby means of simple classiﬁcation tasks. In

Proceed-ings of the 1st Workshop on Evaluating Vector-SpaceRepresentations for NLP , pages 134–139.Yaroslav Fyodorov. A natural logic inference system.Citeseer.Yaroslav Ganin and Victor Lempitsky. 2015. Unsu-pervised domain adaptation by backpropagation. In

Proceedings of the 32nd International Conferenceon International Conference on Machine Learning-Volume 37 , pages 1180–1189. JMLR. org.Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007. The third pascal recognizingtextual entailment challenge. In

Proceedings of theACL-PASCAL workshop on textual entailment andparaphrasing , pages 1–9. Association for Computa-tional Linguistics.Mario Giulianelli, Jack Harding, Florian Mohnert,Dieuwke Hupkes, and Willem Zuidema. 2018. Un-der the hood: Using diagnostic classiﬁers to in-vestigate and improve how language models trackagreement information. In

Proceedings of the 2018EMNLP Workshop BlackboxNLP: Analyzing and In-terpreting Neural Networks for NLP , pages 240–248,Brussels, Belgium. Association for ComputationalLinguistics.Oren Glickman, Ido Dagan, and Moshe Koppel. 2005.Web based probabilistic textual entailment.Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking nli systems with sentences that re-quire simple lexical inferences. In

Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers) ,pages 650–655, Melbourne, Australia. Associationfor Computational Linguistics.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In

Advances in neural informationprocessing systems , pages 2672–2680. Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R Bowman, andNoah A Smith. 2018. Annotation artifacts innatural language inference data. arXiv preprintarXiv:1803.02324 .Aria Haghighi, Andrew Ng, and Christopher Manning.2005. Robust textual inference via graph match-ing. In

Proceedings of Human Language Technol-ogy Conference and Conference on Empirical Meth-ods in Natural Language Processing .Sanda Harabagiu and Andrew Hickl. 2006. Methodsfor using textual entailment in open-domain ques-tion answering. In

Proceedings of the 21st Interna-tional Conference on Computational Linguistics andthe 44th annual meeting of the Association for Com-putational Linguistics , pages 905–912. Associationfor Computational Linguistics.John Hewitt and Percy Liang. 2019. Designing andinterpreting probes with control tasks. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 2733–2743, HongKong, China. Association for Computational Lin-guistics.John Hewitt and Christopher D. Manning. 2019. Astructural probe for ﬁnding syntax in word represen-tations. In

North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies (NAACL) . Association for Com-putational Linguistics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory.

Neural Comput. , 9(8):1735–1780.Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema.2018. Visualisation and ‘diagnostic classiﬁers’ re-veal how recurrent and recursive neural networksprocess hierarchical structure.

Journal of ArtiﬁcialIntelligence Research , 61:907–926.Pierre Isabelle, Colin Cherry, and George Foster. 2017.A challenge set approach to evaluating machinetranslation. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing , pages 2486–2496, Copenhagen, Denmark. As-sociation for Computational Linguistics.Divyansh Kaushik and Zachary C. Lipton. 2018. Howmuch reading does reading comprehension require?a critical investigation of popular benchmarks. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages5010–5015, Brussels, Belgium. Association forComputational Linguistics.Najoung Kim, Roma Patel, Adam Poliak, Patrick Xia,Alex Wang, Tom McCoy, Ian Tenney, Alexis Ross,Tal Linzen, Benjamin Van Durme, Samuel R. Bow-man, and Ellie Pavlick. 2019. Probing what dif-ferent NLP tasks teach machines about functionord comprehension. In

Proceedings of the EighthJoint Conference on Lexical and Computational Se-mantics (*SEM 2019) , pages 235–249, Minneapolis,Minnesota. Association for Computational Linguis-tics.Yoon Kim. 2014. Convolutional neural networksfor sentence classiﬁcation. In

Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 1746–1751,Doha, Qatar. Association for Computational Lin-guistics.Wuwei Lan and Wei Xu. 2018. Neural network modelsfor paraphrase identiﬁcation, semantic textual simi-larity, natural language inference, and question an-swering. In

Proceedings of the 27th InternationalConference on Computational Linguistics , pages3890–3902, Santa Fe, New Mexico, USA. Associ-ation for Computational Linguistics.Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019a. Lin-guistic knowledge and transferability of contextualrepresentations. In

Proceedings of the Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies .Nelson F Liu, Roy Schwartz, and Noah A Smith.2019b. Inoculation by ﬁne-tuning: A method foranalyzing challenge datasets. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 2171–2179.Bill MacCartney. 2009.

Natural language inference .Stanford University.Prodromos Malakasiotis and Ion Androutsopoulos.2007. Learning textual entailment using svms andstring similarity measures. In

Proceedings of theACL-PASCAL Workshop on Textual Entailment andParaphrasing , pages 42–47. Association for Compu-tational Linguistics.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In

Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank.

Computa-tional Linguistics , 19(2):313–330.Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella Bernardi, Roberto Zamparelli,et al. 2014. A sick cure for the evaluation of compo-sitional distributional semantic models. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 3428–3448,Florence, Italy. Association for Computational Lin-guistics.Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan,and Zhi Jin. 2016. Natural language inference bytree-based convolution and heuristic matching. In

Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers) , pages 130–136, Berlin, Germany. As-sociation for Computational Linguistics.Aakanksha Naik, Abhilasha Ravichander, NormanSadeh, Carolyn Rose, and Graham Neubig. 2018.Stress test evaluation for natural language inference.In

Proceedings of the 27th International Conferenceon Computational Linguistics , pages 2340–2353,Santa Fe, New Mexico, USA. Association for Com-putational Linguistics.Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, Daniel Cloth-iaux, Trevor Cohn, et al. 2017. Dynet: Thedynamic neural network toolkit. arXiv preprintarXiv:1701.03980 .Ankur P Parikh, Oscar T¨ackstr¨om, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. arXiv preprintarXiv:1606.01933 .Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language infer-ence. arXiv preprint arXiv:1805.01042 .Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016.Investigating language universal and speciﬁc prop-erties in word embeddings. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 1478–1488.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Abhilasha Ravichander, Aakanksha Naik, CarolynRose, and Eduard Hovy. 2019. EQUATE: A bench-mark evaluation framework for quantitative reason-ing in natural language inference. In

Proceedingsf the 23rd Conference on Computational NaturalLanguage Learning (CoNLL) , pages 349–361, HongKong, China. Association for Computational Lin-guistics.Lorenza Romano, Milen Kouylekov, Idan Szpektor,Ido Dagan, and Alberto Lavelli. 2006. Investigat-ing a generic paraphrase-based approach for relationextraction.Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional attentionﬂow for machine comprehension. arXiv preprintarXiv:1611.01603 .Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? In

Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing , pages 1526–1534.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019. What do you learn fromcontext? probing for sentence structure in contextu-alized word representations. In

International Con-ference on Learning Representations .Sara Veldhoen, Dieuwke Hupkes, Willem H Zuidema,et al. 2016. Diagnostic classiﬁers revealing how neu-ral networks process hierarchical structure.Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length.Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Ha-gen Blix, Yining Nie, Anna Alsop, Shikha Bordia,Haokun Liu, Alicia Parrish, Sheng-Fu Wang, JasonPhang, Anhad Mohananey, Phu Mon Htut, PalomaJeretic, and Samuel R. Bowman. 2019. Investi-gating BERT’s knowledge of language: Five anal-ysis methods with NPIs. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2870–2880, Hong Kong,China. Association for Computational Linguistics.John Wieting and Douwe Kiela. 2019. No trainingrequired: Exploring random encoders for sentenceclassiﬁcation. arXiv preprint arXiv:1901.10444 .Adina Williams, Nikita Nangia, and Samuel R Bow-man. 2017. A broad-coverage challenge corpus forsentence understanding through inference. arXivpreprint arXiv:1704.05426 .F Zanzotto, Alessandro Moschitti, Marco Pennac-chiotti, and M Pazienza. 2006. Learning textual en-tailment from examples. In

Second PASCAL recog-nizing textual entailment challenge , page 50. PAS-CAL. Kelly Zhang and Samuel Bowman. 2018. Languagemodeling teaches you more than translation does:Lessons learned through auxiliary syntactic taskanalysis. In