[PDF] When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

Abstract

Many methods now exist for conditioning model outputs on task instructions, retrieved documents, and user-provided explanations and feedback. Rather than relying solely on examples of task inputs and outputs, these approaches use valuable additional data for improving model correctness and aligning learned models with human priors. Meanwhile, a growing body of evidence suggests that some language models can (1) store a large amount of knowledge in their parameters, and (2) perform inference over tasks in textual inputs at test time. These results raise the possibility that, for some tasks, humans cannot explain to a model any more about the task than it already knows or could infer on its own. In this paper, we study the circumstances under which explanations of individual data points can (or cannot) improve modeling performance. In order to carefully control important properties of the data and explanations, we introduce a synthetic dataset for experiments, and we also make use of three existing datasets with explanations: e-SNLI, TACRED, and SemEval. We first give a formal framework for the available modeling approaches, in which explanation data can be used as model inputs, as targets, or as a prior. After arguing that the most promising role for explanation data is as model inputs, we propose to use a retrieval-based method and show that it solves our synthetic task with accuracies upwards of 95%, while baselines without explanation data achieve below 65% accuracy. We then identify properties of datasets for which retrieval-based modeling fails. With the three existing datasets, we find no improvements from explanation retrieval. Drawing on findings from our synthetic task, we suggest that at least one of six preconditions for successful modeling fails to hold with these datasets. Our code is publicly available at this https URL

Full PDF

WWhen Can Models Learn From Explanations?A Formal Framework for Understanding the Roles of Explanation Data

Peter Hase and

Mohit Bansal

University of North Carolina at Chapel Hill { peter,mbansal } @cs.unc.edu Abstract

Many methods now exist for conditioning modeloutputs on task instructions, retrieved documents,and user-provided explanations and feedback.Rather than relying solely on examples of taskinputs and outputs, these approaches use valuableadditional data for improving model correctnessand aligning learned models with human priors.Meanwhile, a growing body of evidence suggeststhat some language models can (1) store a largeamount of knowledge in their parameters, and(2) perform inference over tasks in textual inputsat test time. These results raise the possibilitythat, for some tasks, humans cannot explain toa model any more about the task than it alreadyknows or could infer on its own. In this paper,we study the circumstances under which expla-nations of individual data points can (or cannot)improve modeling performance. In order to care-fully control important properties of the data andexplanations, we introduce a synthetic dataset forexperiments, and we also make use of three exist-ing datasets with explanations: e-SNLI, TACRED,and SemEval. We ﬁrst give a formal frameworkfor the available modeling approaches, in whichexplanation data can be used as model inputs , as targets , or as a prior . After arguing that the mostpromising role for explanation data is as model in-puts, we propose to use a retrieval-based methodand show that it solves our synthetic task withaccuracies upwards of 95%, while baselines with-out explanation data achieve below 65% accuracy.We then identify properties of datasets for whichretrieval-based modeling fails. With the three ex-isting datasets, we ﬁnd no improvements fromexplanation retrieval. Drawing on ﬁndings fromour synthetic task, we suggest that at least one ofsix preconditions for successful modeling fails tohold with these datasets. Our code and data will be made publicly available at: https://github.com/peterbhase/ExplanationRoles

1. Introduction

To provide signal for learning, traditional supervised learn-ing algorithms use labels consisting of class IDs or a num-ber in regression settings. Yet training models with data inthis form provides the minimum possible supervision forlearning a task. Consider how deeply this style of learningcontrasts with the way a person can learn a task by gettingverbal explanations from someone helping them in addi-tion to just the error signal from their performance. Accessto such feedback can accelerate learning, resulting in lesserror-prone behavior, while also aligning the learned be-havior with the teacher’s prior on what behaviors are good.Since this sort of training should yield efﬁcient and safe out-comes, the contrast between machine and human learningpoints to natural question: How can we incorporate naturallanguage explanations into learning algorithms?A long line of past work has sought to use explanations,rationales, instructions, and other similar data to improvemodels. Proposed methods use explanations to constrainor regularize the learned model (Zaidan et al., 2007; Smallet al., 2011; Ba et al., 2015; Zhang et al., 2016; Srivastavaet al., 2017; Andreas et al., 2018; Liang et al., 2020), toautomatically label data for data augmentation (Hancocket al., 2018; Wang et al., 2019a; Awasthi et al., 2020), asadditional supervision (Narang et al., 2020; Hase et al.,2020; Pruthi et al., 2020) or intermediate structured variables(Camburu et al., 2018; Rajani et al., 2019; Wiegreffe et al.,2020), and simply as model inputs (Rupprecht et al., 2018;Co-Reyes et al., 2019; Zhou et al., 2020).What is surprising about the sheer breadth of approaches inthese works is that they all aim to incorporate essentially thesame kinds of information. We can describe each of theseapproaches as trying to augment models with (1) informa-tion not available through their inputs or in their parametricknowledge, or (2) a further speciﬁcation of the task thatis informative about which models are good. Improvingmodels in this manner is a natural goal of approaches usingexplanations, since one purpose of an explanation is to com-municate a mental model (Doshi-Velez & Kim, 2017; Miller,2019). But how do explanations get used as additional tar- a r X i v : . [ c s . C L ] F e b hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data gets, as inputs, as regularizers, as structured variables, andas rules for automatic data labeling? Even under a gen-eral notion of what an “explanation” is, e.g. the answer tosome why-question (Miller, 2019), this kind of data playsan impressive number of roles.Yet there are tasks where explanations do not fulﬁll theseroles effectively, as improvements in performance proveelusive even when thousands of explanations are gathered(Narang et al., 2020; Hase et al., 2020). In fact, there isreason to think that for some tasks models will not needadditional information or further task speciﬁcation of thekind explanations provide. This is because large languagemodels now (1) store a great amount of knowledge in theirparameters (Roberts et al., 2020; Lewis et al., 2020), and(2) infer tasks at test time from the input itself (Radfordet al., 2019; Brown et al., 2020; Weller et al., 2020). So insome situations we may not be able to explain to a modelmore about a task or a data point than it already knows orcould infer on its own. What remains unclear, however,is the set of conditions which distinguish situations whereexplanations will be helpful from those where they will notbe helpful in practice or cannot be in principle.In this paper, we (1) give an argument for the role of explana-tions in modeling that helps us understand how explanationshave been used in such distinct ways and points us towardsuitable methods, and (2) we experimentally study the con-ditions under which explanations are or are not helpful tomodels, using a specially designed synthetic task and threeexisting datasets with explanations given for individual datapoints. The modeling approach we ultimately propose is toperform retrieval over past explanations and provide themas inputs to a model at prediction time (see Sec. 2.6), whichis the approach we reach following our broader argument inSec. 2. Our synthetic task (described in Sec. 3) is designedto have analogous properties to existing real (i.e., human-curated) data, and it is especially useful here as it enables usto test a number of hypotheses that we could not test withexisting datasets.Using RoBERTa as a representative large language model(Liu et al., 2019) and Sentence-BERT as a retrieval model(Reimers & Gurevych, 2019), we investigate a number of primary research questions , each given with brief context:1. RQ1.

Since some models can infer tasks from sequencesat test time, providing task information may not be help-ful.

When can models solve our synthetic problem byinferring each sequence’s task, and when must they begiven the task information? RQ2.

Explanations seen in the past may help with pre-dicting future data points.

Can retrieval of past explana-tions enable a model to solve our task? RQ3.

Useful information might be distributed over sev-

Illustrative Example

Figure 1.

Hypothetical data and explanations for illustration. Inthese examples, s is the kind of input one might expect a modelto produce the correct output for after some amount of ﬁnetuningon ( s, y ) pairs. For some models s may be sufﬁcient, while othersmay beneﬁt from additional information as provided by τ or e . eral explanations. Can models aggregate informationacross explanations for better prediction? RQ4.

We can let pretrained models combine explana-tions by giving them as textual input, or we can poolextracted feature representations.

What is the best wayto compute explanation representations for prediction? RQ5.

Good explanations pertain to the data point theyare given for, but what makes an explanation relevantacross data points? What enables a retrieval model toﬁnd relevant explanations for a new data point? RQ6.

One intuitive use case for explanations is to en-courage models to rely on causal features rather thanspurious correlations.

Can explanations help modelslearn to use strong features rather than weak ones? RQ7.

Here, the training signal for a retrieval model de-pends on how the classiﬁer uses the explanations theinitial retrieval model can provide.

How does the co-dependence between classiﬁer and retrieval model inﬂu-ence the viability of joint training? RQ8.

After identifying a set of conditions which deter-mine whether retrieval-based modeling can succeed inour synthetic task, we ask: does retrieval of explanationsimprove model performance on existing datasets?

2. Formalizing the Role of Explanationsin Modeling Data

In what follows we discuss what we mean by the term“explanation” (Sec. 2.1), our formal framework for the usesof explanations in modeling and relevant work on the subject(Sec. 2.2), a uniﬁed view of the roles of explanations inmodeling (Sec. 2.3), how explanations complement the inputin NLP tasks (Sec. 2.4), and the model we use in this paper(Sec. 2.6). hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

The term “explanation” has no consistent deﬁnition in ma-chine learning, as methods papers use it in multiple sensesand even opinion papers present deﬁnitions of limited speci-ﬁcity. For our present purposes, we use the term to referto the kinds of data one might collect if asking a personto answer the question, “Why does data point x have label y ?” This is a generic formulation of the explanation as ananswer to a why-question of the kind discussed in Miller(2019). For a more extensive discussion of explanations inthe context of AI, we refer the reader to this work. Ratherthan try to give a delimiting, formal deﬁnition of the kindof data generated from this question, we proceed with someillustrative examples, shown in Fig. 1. In Sec. 5, we describehuman explanations used in experiments. In this section, we lay out our theory of how explanationsmay be used in modeling a task. With a focus on supervisedlearning, we characterize the modeling process here in termsof MAP inference over model parameters θ , ˆ θ = arg max θ p ( θ | X, Y ) p ( θ | X, Y ) ∝ p ( Y | X, θ ) p ( θ ) where Y is a set of labels for inputs X , and the pair con-stitute a standard supervised learning task. We refer to therole of Y in this probabilistic model as the target, X as aninput, and p ( θ ) as a prior. Whereas X is intended as the dataobserved at prediction time, we allow for a latent variable Z to be included as follows: p ( θ | X, Y ) ∝ (cid:90) Z p ( Y | Z, X, θ ) p ( Z | X ) p ( θ ) dZ Both X and Z will be considered as “model inputs” below.Note that we intend this framework to extend to a many-tasksituation, which we deﬁne as a case where several distinctconditional distributions produce the data { Y, X } . All thatis required is that Z indicate the task τ to be solved, i.e.which conditional distribution should be computed.A few examples: For supervised classiﬁcation, the task is tomap an input X to a label Y , and Z could be a document re-trieved from a database. In autoregressive modeling, X and Y are sequences of tokens which may appear in either roledepending on the current context and positions to be pre-dicted, and Z could be a textual description of the sequenceprediction task or a set of unobserved token positions oneintends to marginalize over.Below we describe existing approaches to using explana-tions, categorized in our framework. An overview of thecorresponding graphical models is shown in Fig. 2 in Sup-plement (2.3). We will refer to tasks interchangeably with a function to be computed and parameters of the true model of the data. We mean to index the conditional distributionfor each task and refer to the parameterized function thatcomputes it: p τ ( y | x ) = f θ ( x ) . Using Explanations as Targets.

Explanations may beused as additional supervision (i.e. as Y ), dependingon the ultimate modeling goals (shown as Multi-Task inFig. 2). For instance, Pruthi et al. (2020) consider the use ofattention-weight explanations (from a model) as targets ina multi-task framework, and they ﬁnd that the explanationsmake for useful targets in helping another “simulator” modelpredict the explained model’s outputs. Meanwhile, naturallanguage explanations have been treated as targets in a multi-task, sequence-to-sequence framework, using datasets withfree-form textual annotations for each data point (Camburuet al., 2018; Narang et al., 2020; Hase et al., 2020; Wiegr-effe et al., 2020). None of these works ﬁnd improvementsin task performance from incorporating explanations. It issurprising and possibly concerning that a model could learnto generate coherent “explanations” without the learningof this ability inﬂuencing the models that are found for thetask, as measured by task performance. Using Explanations as Inputs.

Using additional modelinputs may be valuable for solving some tasks (i.e. addi-tional X or Z ). The ﬁrst family of approaches here uses ex-planations directly as model inputs for each data point (PerData Point Input in Fig. 2). Talmor et al. (2020) systemati-cally study RoBERTa’s ability to combine pieces of knowl-edge in a reasoning task by including relevant factoids in thetext input. In other settings, Co-Reyes et al. (2019) provideonline natural language feedback to RL agents, which helpsthem learn new tasks on the ﬂy, and Rupprecht et al. (2018)take a similar approach to interactive image segmentationwith language feedback.A key question with these approaches is whether it is sensi-ble to collect explanations at prediction time. In an interac-tive setting, this is reasonable given that human attention isalready demanded and system performance is monitored bya human. However, for cases where total automation is a de-sired outcome, it may not be feasible to collect explanationsat test time. There is also a risk of leaking the label throughthe additional data. Free-form human explanations tendto directly reveal the label when collected for tasks suchas NLI and QA (Hase et al., 2020; Wiegreffe et al., 2020).Here, what is essentially the cost of human labeling couldbe mistaken as an improvement in model performance.There are a few ways to avoid collecting explanations attest time. In ExpBERT (Murty et al., 2020), a model con-ditions on vector representations of an input x and a single,“global” set of explanations in order to make each predic-tion (shown as Global Set in Fig. 2). This can work wellfor handling up to a hundred or so explanations, but can- hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data Multi-Task Single Structured Variable Global Set Per Label Structured VariablePer Data Point InputRetrieval

Explanation as InputExplanation as Target

Regularizer or HypernetworkData Augmentation

Explanation as Prior

Figure 2.

Graphical models for several approaches to using explanations as targets , as inputs , and as priors . Typically past works do notcondition on human-given explanations at test time, unless they are collected in an interactive manner with a user or specially designed tonot leak the data point label. Note prior works may add or remove dependencies in some cases. not scale to settings with many thousands of explanations.Zhou et al. (2020) treat explanations as latent variableswhen modeling datasets where only a subset of data pointshave explanations, and at inference time they retrieve expla-nations from the training data (Retrieval in Fig. 2.4, withone difference noted here). However, they do not learn theretrieval model, and during training they allow for a datapoint’s own explanation to be conditioned on as its label ispredicted. Since explanations are not available for test datapoints, this leads to distribution shift between training andtest-time inference, and it may introduce label leakage dur-ing training predictions. Instead of retrieving explanations,a few works condition on explanations generated at test timeusing generative models learned with human explanationsas supervision, which are represented as Single StructuredVariable and Per-Label Structured Variable in Fig. 2 (Cam-buru et al., 2018; Rajani et al., 2019; Kumar & Talukdar,2020; Hase et al., 2020; Wiegreffe et al., 2020). While thisform of intermediate supervision could in principle helpmodels learn useful structured variables (the explanations)for prediction, these methods have not produced sustainedimprovements in model accuracy.

Using Explanations as Priors.

Here, we group togetherany approach to deﬁning or learning a distribution overmodel parameters, including those that condition on somedata, p ( θ | data ) . We note that this is a prior over modelweights not in the sense that the distribution is independentof any data (which it is not), but rather in the sense that theposterior parameters are conditioned on the prior. One natu- ral way to use explanations is to constrain the learned model,e.g. by constraining the conditional distributions the func-tion can express (Srivastava et al., 2017; 2018), or throughplacing priors over how features are weighted or extracted(Zaidan et al., 2007; Small et al., 2011; Zhang et al., 2016;Ross et al., 2017; Bao et al., 2018; Selvaraju et al., 2019;Liang et al., 2020; Stammer et al., 2020). Other works mapdirectly from text to parameters in models (Ba et al., 2015;Andreas et al., 2018), in effect learning a prior p ( θ | text ) (though Andreas et al. (2018) condition on generated ratherthan human-provided text at test time). These methods areall effectively described by Regularizer or Hypernetwork inFig. 2. Lastly, a few approaches learn to use explanationsfor automatically labeling data for data augmentation pur-poses (Hancock et al., 2018; Wang et al., 2019b; Awasthiet al., 2020), which is effectively augmenting a task withdata drawn from some prior distribution p θ ( y | x ) given bythe noisy labeling mechanism (shown as Data Augmenta-tion in Fig. 2). Critically, in each of these cases, the priorover model weights is some function of explanations, mean-ing that we require an interpretation I , whether learned orgiven by humans, of how the explanations encode informa-tion about the model. We will write that a prior over modelsis given by an interpretation function on a set of explana-tions: p ( θ |{ e } ) = I ( { e } ) . This kind of function can serveeither as a regularizer during training or a hypernetwork thatdirectly outputs model parameters or, equivalently, sometask representation (Ha et al., 2017). hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data or Priors

Each of the above methods of supplying information to themodeling process may appear rather distinct, but in prin-ciple they can all be used to inﬂuence the behavior of alearning algorithm as represented in the posterior param-eters. In fact, we observe situations where a single pieceof data can be used either as a target, input, or informationyielding a prior. Below, we describe a few such situationsin simpliﬁed terms, providing some justiﬁcation for how asingle format of explanation data might be used as a label,input, or prior. Ultimately, the fact that these various rolescan fulﬁll a single purpose helps us understand how expla-nations have historically been used with some success ineach of the apparently disparate roles. We should note thatit was already clear that training better models was one goal of using explanations in modeling. We would expect a pri-ori that explanations are suited to this goal given that oneunderlying purpose to explanation is the communication ofa mental model (Doshi-Velez & Kim, 2017; Miller, 2019).

Using Data as a Target or Prior.

Adopting terminologyfrom Pruthi et al. (2020), we refer to a teacher giving expla-nations to a student who is learning a task. Suppose a stu-dent is modeling a simple 1-D regression problem ( x ∈ R )as y ∼ N ( y | θx, σ ) , for data D = { x i , y i } ni =1 , using aknown σ and a Normal prior p ( θ ) . In this case, the teachercould in principle induce any MAP estimate they wish byadding a single data point ( x , y (cid:48) ) to D , a copy of the ﬁrstdata point with a new label. Of course, the teacher couldalso induce any desired MAP estimate by directly modi-fying the student’s prior using a particular interpretationfunction, p ( θ | y (cid:48) ) = I ( y (cid:48) ) . This is simply an illustrativeexample where one can achieve the same learning outcomeseither by providing additional targets or using a particularprior. A more serious analysis would be required to formal-ize the argument for neural language models and objectivesfor structured outputs. Thus far, natural language explana-tions have made no difference to task performance whenused as targets (Narang et al., 2020; Hase et al., 2020). Theevidence is more favorable for using attention weights froma model as targets, but Pruthi et al. (2020) ﬁnd this form ofexplanation to work better as a prior. Using Data as an Input or Prior.

Now consider a mul-tivariate regression setting with y ∈ R and features x =( x , x ) with x ∈ R and x ∈ { , } , where the true modelis: y is linear in a continuous feature x , with the strength ofthe relationship modulated by the binary feature x (writtenas y = β x + x × β x ). Notice that, per our deﬁnitionof a task above, x is exactly a task representation τ , sinceit controls for which of multiple functions deﬁne a condi-tional relationship p ( y | x ) . Hence, we can treat x as a task representation and deﬁne an interpretation I to give a priorover the weight on x , p ( β | x ) = I ( x ) . A model of thisform takes the appearance p ( y | x ) = p ( y | x ; β ) p ( β | x ) p ( β ) (1)Interestingly, there will exist equally predictive models ofthe form (1) as there will for a standard regression model, p ( y | x ) = p ( y | x , x ; β , β ) p ( β , β ) . (2)With the beneﬁt of hindsight, we can say that the simplestinterpretation function to represent p ( β | x ) places a pointmass on β when x = 0 and on β + β when x =1 . But we could also learn the prior p ( β | x ) , either withdirect supervision for β , by differentiating through a pointestimate ˆ β , or by marginalizing over a random variablefor β . In this manner, one can learn equally predictivemodels treating x as an input to a single learned functionor a task representation that carries information about modelparameters. As before, this is only a simple example, and amore formal analysis would be required to precisely identifythis phenomenon when using textual data with methods thatmay perform interpretation and prediction within one largecomputation graph (i.e. existing neural models). The ambiguity between considering data as an input or prioris of great relevance in NLP now as a growing body of ev-idence suggests that pretraining language models teachesthem how to do inference over tasks at test time. Indeedit appears that sufﬁciently large language models do “inferand perform the tasks demonstrated in natural language se-quences in order to better predict them,” as Radford et al.(2019) hoped for. For example, GPT-3 metalearns how to dosequence prediction over the course of pretraining, whichequips it for zero-shot prediction given task descriptions andexamples (Brown et al., 2020). Even GPT-2 demonstratesthe ability to infer the task at prediction time, e.g. for sum-marization purposes given the “tl;dr” prompt (Radford et al.,2019).But these results leave open the question of when and towhat degree task information is helpful for prediction inwell-deﬁned tasks. In question answering, for example,when should we think that inferring or conditioning on taskinformation is helpful, as opposed to relying on a task’s“input” alone? In fact, for cases like QA, it is even dif-ﬁcult to identify what counts as a sufﬁcient input for thetask to be solvable by some model without additional in-formation clarifying ambiguities in the task or providingrelevant background knowledge. Consider that Robertset al. (2020) use pretrained models to answer questionswithout any further input, while Lewis et al. (2020) ﬁndit helpful to retrieve relevant documents from Wikipedia hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data model inputs, with explanations each

Marginalize over Compute classifier

Retrieval given

Figure 3.

A depiction of our retrieval-based method T

EXT C AT . A total of Ck explanations are retrieved and allocated into k latentvariables, each a set of explanations E , which are marginalized over to produce a ﬁnal prediction. before answering, drawing a distinction between paramet-ric and non-parametric model memory. Yet when Welleret al. (2020) study how models generalize across tasks whenconditioning directly on task descriptions, they formulatethe descriptions as questions with the task’s data given asaccompanying documents. Hence we see one model’s in-put x used as another model’s task description τ , and inboth situations additional (possibly retrieved) data canimprove task performance .Our experiments provide some answers to the remainingquestion of when and degree to what degree task informationis helpful, and based on our experiments in Sec. 6, wedescribe conditions for models (1) inferring tasks from theinput alone, (2) beneﬁting from the retrieval of additionalinformation, and (3) being able to learn the retrieval. In this paper we assume we have data of the form D = { x i , y i , e i } ni =1 , where ( x, y ) is a standard data point in asupervised classiﬁcation task, and e is some kind of datacollected in response to a question like “why does data point x have label y ?” In our experiments with both syntheticand human-curated data, x and e are sequences of tokensfrom a ﬁxed vocabulary. The approaches we present willallow for unexplained training data, meaning some or evenmost e i may be missing. The model may use any number offree-ﬂoating explanations too, i.e. e i without corresponding ( x i , y i ) pairs, though this does not apply to datasets in thispaper. Here, we introduce our chosen model for incorporating ex-planation data, which makes use of explanations as modelinputs after they are retrieved from the training data (the“Retrieval” graphical model in Fig. 2). Given our discussionabove, a few reasons point us in this direction: (1) since pastexplanations may be useful for future predictions, while col-lecting explanations is costly and can lead to label leakage,we want to avoid collecting explanations at test time; (2)this method may directly condition on relevant informationthat is useful for reasoning tasks (Talmor et al., 2020); (3)textual data can provide useful task information when serv- ing as a model input, and hence this is a natural way to learna prior over tasks (Brown et al., 2020; Weller et al., 2020)(4) retrieval is more scalable than conditioning on a globalset of explanations, and (5) using explanations as structuredvariables and as targets do not appear to be promising ap-proaches at the moment (Hase et al., 2020; Wiegreffe et al.,2020; Pruthi et al., 2020).So, we use a retrieval-based model that treats retrieved ex-planations as latent variables to be marginalized over. Ourapproach is similar to Lewis et al. (2020), who marginalizeover latent documents retrieved from Wikipedia for questionanswering, question generation, and fact veriﬁcation. Themarginal distribution is given as: p Θ ( y | x ) = (cid:88) e ∈ top- k ( p η ( ·| x )) p θ ( y | x, e ) p η ( e | x ) where top- k gets the top k texts as ranked by the retrievalmodel, p η ( e | x ) . Note that we never retrieve a datapoint’s own explanation when predicting its label . Wedo so because explanations can leak the label (Hase et al.,2020) and this approach matches the test-time distribution,where we assume explanations are not collected for newdata points (see discussion in Sec. 2).Zhou et al. (2020) also propose to use explanations as la-tent variables and retrieve explanations at inference time,but they do not learn the retrieval model, marginalize overthe latents during inference, or prohibit data point’s ownexplanations from being retrieved. In our experiments, wecompare with their original approach and a version wherewe marginalize over the latents and learn the retrieval model.The form of p η ( e | x ) follows Lewis et al. (2020) andKarpukhin et al. (2020). Given a query x , unnormalizedprobabilities are computed as: p η ( e | x ) ∝ exp ( f η ( e ) T f η ( x )) where f η embeds each sequence into a vector. To computetop- k ( p η ( ·| x )) , we search through the training explanationsusing FAISS (Johnson et al., 2017). We discuss methodsfor computing p θ ( y | x, e ) and f η ( e | x ) in Sec. 4. Becauseit may be helpful to reason over multiple explanations atonce, we extend this model to allow for explanations to becomposed into a single “document.” Assuming explanations hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data to be conditionally independent given x , we can computethe probability of a set of explanations E = { e c } Cc =1 as p ( E | x ) ∝ exp ( (cid:88) e ∈ E f η ( e ) T f η ( x )) , where (1) a context size C will control the size of the expla-nation set, (2) a value of k implies that the top Ck will beretrieved, and (3) we sort these Ck explanations into sets inorder of their probability p η ( e | x ) .We represent the overall approach in Fig. 3 for one methodof computing p θ ( y | x, E ) (described fully in Sec. 4), whereexplanations are concatenated with the query sequence.Flowing from left to right, Fig. 3 shows how explanationsare retrieved from the training data conditioned on a querysequence x , then allocated into k classiﬁer inputs with C ex-planations each. The k classiﬁer predictions are aggregatedby marginalizing over the latent variable, Z = E . Modeling Assumptions.

In using retrieval, we make afew assumptions. First, since the number of forward passesper data point scales with k , we require a relatively smallvalue of k , i.e. k ≤ , for reasonable computational efﬁ-ciency in SGD-based training. Hence, we must assume thatthis summation is sufﬁciently similar to the full summationover latent variables. This assumption is more likely to holdwhen (1) a small number of documents account for most ofthe probability mass in p η ( e | x ) , and (2) a pretrained model p η ( e | x ) yields a decent initial rank-ordering, such that someof the best documents are in the top- k . The exact value of k we use depends on the experiment. A second, more basicassumption is that explanations will be useful in predictingother data points’ labels. Such an assumption is neededsince we never condition on a data point’s own explanation.We study how the “relevance” of explanations to other datapoints inﬂuences task solvability through experiments inSec. 6.5. Lastly, during retrieval we assume that explana-tions are independent given x , i.e. p ( E | x ) = (cid:81) e ∈ E p ( e | x ) .This could be a poor assumption when, for instance, expla-nations each contribute one of a number of needed facts, inwhich case it would be helpful to retrieve additional expla-nations conditioned on what has already been retrieved.

3. Synthetic Task

We design a synthetic dataset so that we can carefully con-trol several important properties of the data, though we alsomake use of several human-curated datasets (described inSec. 5). Designing a synthetic dataset for the task at hand isa useful exercise for a number of reasons. At a high level,it helps us formalize our intuitions regarding what makesthe task solvable or not solvable given (1) certain inputs,(2) certain modeling approaches, and (3) certain availableexplanations. A critical part of this procedure is that, as

Analogous Components to Real DataAn easily computable feature that allows for retrieval over explanationsIndication of what should be drawn from a retrieved explanationInformation that is helpful for models without needed parametric knowledgeIndex The topic of a question orreferents in a statementIndicator The question itself, or an interpretationof the referent properties to look forText that resolves ambiguity in the task and provides missing dataSynthetic TaskDescription: The sequence has label because there are more s than s.The index maps to , and indicator says to count rather than . There is a one-to-one map between index values and tuples. Given by

Figure 4.

Examples of our synthetic task and analogies we draw tohuman-curated existing data. we do so, we make disputable decisions regarding how thesynthetic task maps back onto reality. When all is said anddone, one can ask if the properties of the proposed dataand modeling paradigm do in fact reﬂect how we expectmodeling will work with human-given, natural languageexplanations. In this spirit, we claim that our synthetic taskshares a few important properties with human-curated data,which are described in Sec. 3.2. Lastly, as a practical matter,it allows us to study how various properties of the data al-low for successful modeling with existing methods. In thispaper, we are able to provide experimental answers to six ofour eight primary research questions only through syntheticdata, and not with available datasets. Hence, we introduce asynthetic task for our present purpose. For further discus-sion of the pros and cons of synthetic datasets, see Liu et al.(2021). In Fig. 4, we show an example data point, alongwith a description of how it gets its label. The premise ofour task is to classify sequences using counts of differentintegers in the sequences. The basic idea of counting inte-gers is drawn from De Cao et al. (2020). They propose atoy task requiring a model to count whether there are more8s than 1s in a sequence, with the purpose of evaluating amodel interpretation method.

We wish to design a task where, while it would be possibleto solve the task by learning a function y = f ( x ) , it wouldbe easier if you could condition on relevant explanationsand learn y = f ( x, e ) . We propose a few task variants,but the core of the task is that, given a sequence x , thebinary label will be determined by whether there are moreof an integer m in the sequence than there are of an integer n . We assign a one-to-one map between the integers ( a, b ) to be counted and a set of special integers each sequenceincludes as its ﬁrst two elements, which we term the index hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data and indicator . For our purposes, a key property of this kindof task is that a model could succeed by memorizing themap between ( index , indicator ) and the integers it needs toto count. However, it should be much easier to solve the taskwhen directly conditioning on those integers, i.e. learning afunction from ( x, a, b ) to y . Here, the “explanation” ( a, b ) is a plausible answer to the question of why data point x haslabel y , because this information determines the feature thatcauses the label.Rather than just using the index to map to the two numbersthat need to be counted, we include the indicator so thatmodels can succeed by integrating information from x and e .An explanation is given as ( index , m, n, r, d ) , where either ( m, n ) or ( r, d ) is the integer pair that actually needs tobe counted. The opposite pair will be a distractor featurewhose relative counts match those of the causal feature 50%of the time. Then the index will map to ( m, n, r, d ) index ,and the indicator , either a or a , will tell whether it isthe ﬁrst integer pair in the explanation ( m, n ) or the second ( r, d ) that needs to be counted (as displayed in Fig. 4). Asa result, with num-tasks many index values, there will be × num-tasks possible pairs of integers that have to becounted. In general, we will refer to a sequence’s task as the function that counts the relevant integers for thatparticular sequence, meaning we view our dataset to becomposed of many (similar) tasks, each well-deﬁned for aset of sequences.We next describe the exact dataset in detail. The full gen-erative process is given in Appendix C. We give typicalvalues that dataset parameters take on, and in Sec. 6, wenote differences from this default setting as they becomerelevant to each experiment. The resulting data is:1. Train set: sequences of integers (including in-dex and indicator ), where there are unique valuesof index in the dataset drawn from unif (1 , . Foreach index , there are 10 distinct x i that share a com-mon explanation , ( index , m, n, r, d ) index . The values of m, n, r, and d are drawn from unif (1 , while ﬁl-tering samples s.t. m (cid:54) = n (cid:54) = r (cid:54) = d . The corresponding values of indicator are balanced between and . Halfof the points have label y =1 , meaning that either m>n or r>d , depending on which feature is causal. Half thetime, the non-causal integer pair in ( m, n, r, d ) (i.e., theone not indicated by indicator ), has counts with the samerank-ordering as the causal feature’s counts. In each x i ,after m, n, r, and d have been randomly placed into thesequence, any unﬁlled slots are ﬁlled with samples from unif (1 , .2. Dev set: , points, none appearing in Train, with thesame index values, and twice the number of pointsper index as Train. 3. Test set: , points of similar construction to the Devset, but with ﬁve times the points per index as Train. Weclaim that aspects of our synthetic task are analogous toproperties real (i.e. existing, human-curated) data mighttake on. We ﬁrst highlight a few properties of the Illus-trative Examples in Fig. 1. Here, s is the kind of inputfor which one might expect a model to produce the correctoutput after some amount of ﬁnetuning on an appropriatedataset, while τ offers explicit task instructions and e isan explanation of the data point’s label. We expect that,for some models, τ and e will provide useful additionalinformation for the task that is not represented in s or isdifﬁcult to infer from s . Models might more easily extractthis information from τ or e than they can infer it from s ,allowing for better task performance. However, a modelmay infer any “hidden” information perfectly well withoutrelying on these variables, especially after some amount ofﬁnetuning on ( s, y ) pairs. Without ﬁnetuning, a model mayalready be pretrained to interpret task instructions (Brownet al., 2020), or the model may already know the hiddeninformation (Roberts et al., 2020), meaning the knowledgeencoded is in their parameters and accessible in the rightcircumstances.Now, regarding our synthetic data, we ﬁrst claim that e isan explanation in the sense that it is a plausible answer tothe question, ”why does point x have label y ?” The expla-nation gives the information which determines the featurethat causes the label, i.e. the integers that should be counted.We suggest that the index in a sequence is analogous tothe topic of a question or the referents of a statement (thethings referred to): both are computable features that makeretrieval-based modeling possible. Likewise, good modelswill combine the indicator and explanation to identify thecausal feature in the same way that a good QA model wouldﬁgure out what to look for in a document by ﬁrst understand-ing what the question asks for or the referent properties itshould be looking for.Our task shares another important characteristic with human-curated data: whenever retrieval could be helpful, modelscan learn to directly infer the hidden information from theinput alone. In the synthetic task, this looks like learning thefunction from the index to the integers to be counted. Withquestion answering, for example, a model could learn themap between a certain topic and the set of facts that couldbe needed to answer questions about that topic. This maybe harder than learning a retrieval model for a given dataset,but it is possible in theory and would render the additionaldata for retrieval irrelevant. In our experiments in Sec 6.1,we outline situations where this map is learned by models, hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data making retrieval unnecessary. Data Parameters,

Relevance , and Strong Features.

There are a few parameters to the data generation that heav-ily shape our expectations of the task’s solvability. The ﬁrstis the number of unique values of index , which we will referto as the number of tasks, num-tasks . With a ﬁxed trainingset size, num-tasks determines the number of data pointsper task, n task . For example, while we will typically have points per task, decreasing the number of tasks to wouldmean there would be points per task (with trainingpoints). This is a particularly important property becauseit determines how explanations will be relevant across datapoints. Here, we deﬁne an explanation for one data point e i to be relevant to another sequence s j when e i is informativeabout what sequence s j ’s task τ j is. Recall that by τ j werefer to the function counting the integers ( a, b ) j . Formallywe will say that a relevance function on s and e yields somedistribution over the task parameters: p (( a, b ) j | s j , e i ) = f ( s j , e i ) . In the standard version of our synthetic task, one such rele-vance function could place all probability mass on ( m, n ) if indicator = 1 and the index in s j and e i matched (or ( r, d ) if indicator = 2 ). If the index does not match, thenthere would be no information about what τ j is, since werandomly sample index and ( m, n, r, d ) values when pair-ing them. To obtain a smoother, more continuous levelof relevance between sequences and explanations, we canalso deﬁne a predictable relationship between index i and ( m, n, r, d ) i so that ( a, b ) i and ( a, b ) j are close together (un-der some distance metric) whenever index i and index j areclose together. We describe experiments comparing the twosettings of binary and smooth relevance in Sec. 6.5.Next, note that we can vary the degree to which the non-causal feature is correlated with the causal (strong) feature.In the case of perfect correlation, we have that m> n iff r> d and m< n iff r< d , regardless of which is thecausal feature. This allows us to test whether explanationscan induce models to rely on causal rather than non-causal(weak) features. While this is an intuitive reason for think-ing explanations should be helpful for models, we showin Sec. 6.6 that models can correctly use explanations forselection between correlated features only in a narrow setof situations.Finally, index can be removed from each sequence to moreclosely imitate a situation requiring task inference. While inprinciple models can learn the map from (index, indicator) to ( a, b ) , in fact we ﬁnd that models will infer the taskeven when index is removed from the sequence (Sec. 6.1).Ostensibly they do so by counting the sequence integers:those which appear often are likely to make up ( m, n, r, d ) . The data we have described so far includes only a singleform of explanation, e = ( index , m, n, r, d ) , which we willcall our full-info condition. As long as a retrieval modelreturns relevant explanations, the task for a sequence canbe read off from this kind of explanation. Yet, rather thangiving a full description of the task, explanations in existingdatasets tend to only partially specify a task or give just apiece of the hidden information for a data point, especiallywhen annotators limit the length of their explanations to asingle sentence (Camburu et al., 2018; Wang et al., 2019b).This leads us to suggest two alternative forms of explanationin our synthetic task, which we refer to as evidential and recomposable explanations. Given an index , evidential explanations are generated by adding independent, zero-mean noise to each element in the true ( m, n, r, d ) index , s.t.taking the average across a set of evidential explanationsconverges in the limit to the true ( m, n, r, d ) index . In ourexperiments, we will add some noise (cid:15) drawn from theuniform discrete distribution from − to .The second explanation kind, recomposable , is designed sothat one could infer the task if one had all the relevant ex-planations for a particular index . We create such a situationby breaking the ( m, n, r, d ) into parts that neatly recomposeback into the true set of numbers. Principally, we do so bydividing the explanation into two pieces, ( m, , r, and (0 , n, , d ) , where some points with that index have one ex-planation, and other points have the other. We ensure thatboth pieces of an explanation appear at least once amongthe data points for each index . We also experiment with asimilar setting where we decompose explanations into fourpieces, but do we not include results for this condition aswe ﬁnd them to be quite similar to the two-piece setting.

4. Computational Methods

In this section we describe the methods used to compute p θ ( y | x, E ) and p η ( e | x ) (see Sec. 2.6 for the overall modeldescription). For the classiﬁer p θ ( y | x, E ) , we use two meth-ods, T EXT C AT and H-M EAN , which are described below.Then we describe the retrieval model, which is based onSentence-BERT (Reimers & Gurevych, 2019).

EXT C AT . Represented in Figure 3, this method takesa straightforward approach to conditioning on a set of ex-planations: concatenating C explanations and the input x to form a longer sequence of text. Each of the originalsequences is separated by a special token, e.g. [SEP] forBERT. In our experiments, we pass this longer sequence intoa RoBERTa-base model. After pooling the output token rep-resentations, we pass the resulting vector to a 1-layer MLP hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data for classiﬁcation. We use mean pooling for our synthetictask and NLI; for relation extraction tasks, we concatenatethe representations corresponding to the initial tokens in the subject and object words, since this is an especially effectivepooling technique (Baldini Soares et al., 2019).This approach allows the model to reason over all of theexplanations and the input together. While the method maybe limited by the fact that some models can face difﬁcultiesin processing long pieces of text (Beltagy et al., 2020), thisissue is partly mitigated by marginalizing over k sets ofexplanations. As a result of the marginalization, the ﬁnalprediction can be conditioned on a far higher number ( Ck )of individual explanations than could ﬁt in the context alone. H-M

EAN . By H-M

EAN , we refer to the kind of un-weighted hidden representation averaging used in Co-Reyeset al. (2019) and Zhou et al. (2020). H-M

EAN works byﬁrst obtaining representations of the input x and a singleexplanation e at a time, then passing the unweighted aver-age of these representations to an MLP. For a fair compar-ison with T EXT C AT , we use the same token pooling anda 1-layer MLP. So with C explanations to condition on, x (cid:48) = concatenate ( x, e ) , and vector representations fromRoBERTa ( x (cid:48) ) , H-M EAN obtains a single representation as h = 1 C C (cid:88) c =1 RoBERTa ( x (cid:48) ) which is then passed to the MLP for classiﬁcation.H-M EAN does not face the same sequence length limita-tions as T

EXT C AT , but by separately processing of eachexplanations H-M EAN may fail to integrate informationacross explanations. This method also becomes expensivewhen we marginalize over E (which is what allows retrievalto be learned), as it requires Ck forward passes for a singleprediction. We compare the two methods in Sec. 6.4. We use a similar approach to retrieval as in Lewis et al.(2020), namely using vector representations of sequencesfrom a pretrained transformer to compute p η ( e | x ) ∝ exp ( f η ( e ) T f η ( x )) , which is followed by computing top- Ck ( p η ( ·| x ) . We usean approximate but sub-linear time search method (FAISS)to ﬁnd the top- Ck points (Johnson et al., 2017). In ourexperiments we ﬁnd that it is necessary to use Sentence-BERT (Reimers & Gurevych, 2019) as our pretrained f η ,rather than simply a pretrained RoBERTa model (discussedin Sec. 6.7). Sentence-BERT is a network trained to producesemantic representations of sentences that can be comparedunder cosine similarity. In our experiments, we use the Sample SizeDataset Explns Train Dev Test |Y|

Synthetic 5000 5000 10000 50000 2e-SNLI 549,367 549,367 9842 9824 3SemEval 203 7016 800 2715 19TACRED 169 68,124 22,631 15,509 42

Table 1.

Statistics for Datasets.

Sentence-RoBERTa-base model trained on a combinationof several NLI and semantic textual similarity tasks, withmean pooling of token representations. We normalize therepresentations we obtain from this model, so that our innerproduct is equivalent to a cosine similarity.Note that during training, we never condition on a datapoint’s own explanation when predicting its label. This isan important constraint for matching the train and test-timedistributions. At test time, we assume we have access onlyto past (training) explanations, since they can be expensiveto collect and conditioning on explanations at test time canlead to label leakage, meaning what is essentially the beneﬁtof human labeling could be mistaken as improvements inmodel performance.

5. Experimental Setup

Here, we detail the datasets and important model trainingdetails used in our experiments.

Datasets.

The standard version of our synthetic task usedin experiments is described in Sec. 3. We include experi-ments with three other (English) datasets. The ﬁrst, e-SNLI,is the SNLI dataset annotated with human explanations(Bowman et al., 2015; Camburu et al., 2018). The next two,SemEval and TACRED (Hendrickx et al., 2010; Zhang et al.,2017), are relation extraction tasks with a subset of datapoints annotated by Wang et al. (2019b). Summary statisticsfrom the three datasets are shown in Table 1. For additionaldetails including data preprocessing see Appendix B.

Model Training.

We train all models in an end-to-endmanner using AdamW with a standard cross-entropy loss(Loshchilov & Hutter, 2017). This would be straightforwardgiven the model’s end-to-end structure, except for the factthat with after every gradient update, all training explanationrepresentations need to be recomputed in order for futurepredictions and gradients to reﬂect the new parameters. Priorwork using retrieval models has either periodically updatedthe document representations (Guu et al., 2020) or left themﬁxed and only updated the query embeddings (Lewis et al.,2020). We ﬁnd it is important to update all embeddings atleast every epoch, and unless otherwise noted we rebuildthe embeddings every 20% of each epoch (see Appendix Afor further discussion). hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data e-SNLI x : Premise:

After playing with her other toys, the babydecides that the guitar seems fun to play with as well.

Hypothesis:

A blonde baby. y : Neutral e : Not all babies are blonde.SemEval x : The SUBJ originates from an OBJ which transcendsthe speaker. y : Entity-Origin e : The phrase ”originates from an” occurs between SUBJand OBJ and there are no more than four words betweenSUBJ and OBJ and OBJ follows SUBJ.TACRED x : SUBJ’s husband OBJ died in 1995. y : Person-Spouse e : Between SUBJ and OBJ the phrase “’s husband” occursand there are no more than ﬁve words between SUBJand OBJ.

Table 2.

Example data points from the three existing datasets.More examples can be found in Table 4.

We give important hyperparameters such as the context size C and retrieval parameter k in each experiment descriptionin Sec. 6. We provide an analysis of the inﬂuence of hy-perparameters on training in Appendix A, but usually weobserve that larger values of C and k yield higher accuracieswith more stable training behavior. Other hyperparametersfor training are also given in Appendix A. Model Selection and Hypothesis Testing.

We report andvisualize results on our synthetic dataset with conﬁdenceintervals representing seed variance , which accounts forvariability across sampled datasets and random model train-ing behavior. We do not estimate sample variance becauseit is quite small using a test set of , points, with a 95%conﬁdence interval of e.g. ± . for a model accuracy of90%. Seed variance is estimated from 5-10 random seeds,depending on the condition. See Appendix B for furtherdetails of seed variance estimation. In synthetic data experi-ments, we comment on effects far larger than the conﬁdenceintervals and do not conduct hypothesis tests.With the three existing datasets, for the majority of condi-tions, we run three model seeds and select the best model bydev set accuracy. We run only one seed for conditions usingthe full TACRED training set and the e-SNLI dataset withat least , training points. With the selected model, weconduct hypothesis tests for a difference in binomial meansto check for differences in test set accuracy.

6. Experiment Design and Results

Below, we give the experimental design and results foreach research question in Sec. 1. The ﬁrst seven research num-tasks

Acc. RoBERTa-baseTask GivenIndex OnlyNo Index

When Can the Task Be Inferred?

Figure 5. ( RQ1 ) Synthetic task accuracy as a function of num-tasks . questions are best answered with our synthetic task, and sothey each make use of synthetic data (introduced in Sec. 3).See Sec. 6.8 for results with the three existing datasets. We measure test accuracy as a function of the num-tasks parameter across three conditions. The condi-tions vary in how task information is available in the input:(1) task given , where each sequence has its true task in-formation ( m, n, r, d ) appended to it; (2) task signalled ,meaning index is given and hence the model can learn themap index → ( m, n, r, d ) ; (3) task inferred , where in-dex is not given, so the model must infer the task fromthe sequence’s contents alone. To see the interaction be-tween these conditions and model capacity, we test withboth RoBERTa-base and RoBERTa-large, and we also mea-sure the effect of increasing the training set size. Notethat, with a ﬁxed training set size, num-tasks directly im-plies the number of points per task, n task . In this experi-ment, num-tasks = { , , , , , , } ⇒ n task = { , , , , , , } . Results.

We show the results in Fig. 5. We see that, whenthe numbers of tasks is small, RoBERTa-base can infer thetask for each sequence and achieve as high an accuracyas if it had been given task information. Yet, the feasi-bility of task inference quickly falls off as the numberof tasks increases (equivalent to the number of points pertask decreasing), reaching accuracies as low as 62.2% at num-tasks = 500 . Meanwhile, we observe that providingthe index does slightly ease the task inference, but the mod-els can by no means memorize the map from index to thetask information. Regarding model capacity, we ﬁnd thatusing RoBERTa-large increases model accuracy when thenumber of num-tasks is relatively low (less than 250), butafter this point RoBERTa-base performs better (see Fig. 13in Appendix B). Lastly, we see that increasing the train- hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data [ PLACEHOLDER RUNNING TITLE – SOME TITLES WILL BE TOO LONG TO FIT – PH ] ᴇᴀɴ T ᴇ x ᴛ C ᴀᴛ Acc. Retrieval ModelNo RetrievalNo Retrieval(10x Train)FixedLearnedOptimal

Is Explanation Retrieval Helpful?

Figure 6. ( RQ2 ) Synthetic task accuracy by the conditioning mech-anism and retrieval model status, for data with num-tasks = 500 . [new 10x train baseline – PH ] of tasks increases (equivalent to the number of points pertask decreasing), reaching accuracies as low as 62.2% at num-tasks = 500 . Meanwhile, we observe that providingthe index does slightly ease the task inference, but the mod-els can by no means memorize the map from index to thetask information. Regarding model capacity, we ﬁnd thatusing RoBERTa-large increases model accuracy when thenumber of num-tasks is relatively low (less than 250), butafter this point RoBERTa-base performs better (see Fig. 13in Appendix B). Lastly, we see that increasing the trainingset size can greatly improve model performance even with num-tasks = 500 , reaching . with , trainingpoints (trend shown in Fig. 14 in Appendix B). However,we will see in the next section that, in terms of improvingmodel accuracy, even this x increase in training size isless effective than using retrieved explanations with 5000training points. [ added transition off sample size point – PH][ roberta-large results in appendix. better at low-task regime,worse in high-task regime – PH ] Using the full-info explanations and data with num-tasks = 500 , we measure model accuracy with retrievalin a ⇥ design. There are three conditions for the retrievalmodel: (1) ﬁxed , where the Sentence-RoBERTa retriever isﬁxed and only the classiﬁer is trained, (2) learned , whereboth classiﬁer and retriever are trained end-to-end, and (3) optimal where the optimal retrieval model is used and theclassiﬁer is trained. Note that we know the optimal retrievalmodel assigns the highest probabilities to explanations with index e matching the query point’s index x , so by using a re-triever p ( e i | x i ) = exp ( [ index e = index x ]) and a contextsize lower than n task , we can ensure the retrieved explana-tions are always relevant. There are two conditions for theconditioning mechanism used: (1) T EXT C AT with C = k =6 , Figure 7. ( RQ3 ) Synthetic task accuracy with evidential and re-composable explanations, grouped by the conditioning mechanismand status of retrieval model. [ shud we be mentioning the errorbars once somewhere in caption/main text?[ added model se-lection and hypothesis testing section in Experimental Setup –PH ] – MB ] and (2) H-M

EAN with C =4 and k =4 , which approximatelymatches the computational cost of the T EXT C AT condition. Results.

Shown in Fig. 6, the results show that retrievalwith Sentence-BERT improves model accuracy by around29 percentage points over a no-retrieval baseline. Each con-ditioning mechanism sees roughly the same improvement.Additionally, we can learn a retrieval model that does nearlyas well as the optimal retrieval model, improving over the ﬁxed condition by another 7 points. [ should we add somemore reasons + conclusions/takeaways of these numerical re-sults? [ added a couple takeaway sentences – PH ] – MB ]

Thus, retrieval of explanations allows the model to per-form much better than a no-retrieval baseline . We see alarge improvement in performance from retrieval even whenthe baseline could learn to infer the task information directlyfrom the index value in each input. In fact, explanation re-trieval outperforms a no-retrieval baseline with as many as , training data points (a x increase), which obtains . accuracy. We run the same experiment design as forRQ2, using evidential and recomposable explanations (seeSec. 3.3). With evidential explanations, we shift each inte-ger in the explanation (excluding the index ) independentlyby zero-mean, discrete noise ✏ ⇠ unif ( , . We use the condition for recomposable explanations, meaningtwo explanations combine to give the full task information.As in RQ1, we show results here for values of C = k =6 forT EXT C AT and C = k =4 for H-M EAN . Results.

We display the results in Fig. 7. First, we observethat for evidential explanations, learned retrieval is close

Figure 6. ( RQ2 ) Synthetic task accuracy by the conditioning mech-anism and retrieval model status, for data with num-tasks = 500 . ing set size can greatly improve model performance evenwith num-tasks =500 , reaching 87.11% with , trainingpoints (trend shown in Fig. 14 in Appendix B). However,we will see in the next section that, in terms of improvingmodel accuracy, even this x increase in training size isless effective than using retrieved explanations with 5000training points. Using the full-info explanations and data with num-tasks = 500 , we measure model accuracy with retrievalin a × design. There are three conditions for the retrievalmodel: (1) ﬁxed , where the Sentence-RoBERTa retriever isﬁxed and only the classiﬁer is trained, (2) learned , whereboth classiﬁer and retriever are trained end-to-end, and (3) optimal where the optimal retrieval model is used and theclassiﬁer is trained. Note that we know the optimal retrievalmodel assigns the highest probabilities to explanations with index e matching the query point’s index x , so by using a re-triever p ( e i | x i ) = exp ( [ index e = index x ]) and a contextsize lower than n task , we can ensure the retrieved explana-tions are always relevant. There are two conditions for theconditioning mechanism used: (1) T EXT C AT with C = k =6 ,and (2) H-M EAN with C =4 and k =4 , which approximatelymatches the computational cost of the T EXT C AT condition. Results.

Shown in Fig. 6, the results show that retrievalwith Sentence-BERT can reach accuracies above 98%, im-proving model accuracy by around 37 percentage pointsover a no-retrieval baseline. Each conditioning mechanismsees roughly the same improvement. Additionally, we ﬁndthat the learned retrieval model does as well as the optimalretrieval model, improving over the ﬁxed condition by about7 points. Thus, retrieval of explanations allows the modelto perform much better than a no-retrieval baseline . Wesee a large improvement in performance from retrieval evenwhen the baseline could learn to infer the task informationdirectly from the index value in each input. In fact, expla-

Evidential Recomposable

H-M ᴇᴀɴ T ᴇ x ᴛ C ᴀᴛ H-M ᴇᴀɴ T ᴇ x ᴛ C ᴀᴛ Acc. Retrieval ModelFixedLearnedOptimal

Retrieval By Explanation Kind

Figure 7. ( RQ3 ) Synthetic task accuracy with evidential and re-composable explanations, grouped by the conditioning mechanismand status of retrieval model. nation retrieval outperforms a no-retrieval baseline with asmany as , training data points (a x increase), whichobtains 87.11% accuracy. We run the same experiment design as forRQ2, using evidential and recomposable explanations (seeSec. 3.3). With evidential explanations, we shift each integerin the explanation (excluding the index ) independently byzero-mean, discrete noise (cid:15) ∼ unif ( − , . In the recom-posable setting, for each index two explanations combine togive the full task information. As in RQ1, we show resultshere for values of C = k =6 for T EXT C AT and C = k =4 forH-M EAN . Results.

We display the results in Fig. 7.

With both ex-planation kinds, the model can learn to retrieve and ag-gregate information across explanations , achieving accu-racies above 90%. We observe that for evidential expla-nations, learned retrieval is close to the optimal retrieval,and the conditioning mechanisms perform very similarly.Yet the models cannot interpret evidential explanations aswell as full-info, seeing as even with optimal retrieval bothT

EXT C AT and H-M EAN obtain around 92% accuracy com-pared to full-info’s 98%.With recomposable explanations, meanwhile, we noticetwo differences. First, we ﬁnd that with optimal retrievalT

EXT C AT can interpret the recomposable explanations aswell as full-info, achieving upwards of 98% accuracy. Yetwe observe that learned retrieval falls 6-8 points short of op-timal retrieval (depending on the conditioning mechanism).There is no clear reason why this should be, though we canattribute it to the differences in explanations alone. Second,T EXT C AT with learned or optimal retrieval outperforms H-M EAN with the same retrieval (by 4.58 points for Learned).We discuss this further in the next section. hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

Map from index to ( m,n,r,d )Acc. RetrievalNo RetrievalFixedLearned

Generalizing From

Relevant

Explanations

Figure 8. ( RQ5 ) Task accuracy with by retrieval model and the smoothness of the index → ( m, n, r, d ) map, using 1 point pertask index . At test time new index values are used, meaning modelsmust generalize based on retrieved explanations with similar butnever exactly correct ( m, n, r, d ) values. Here we rely on results from the experiments forRQ3, and we also test method performance across trainingset sizes in { , , , , } , using optimalretrieval with C =5 and k =1 for both T EXT C AT and H-M EAN . Lastly, we consider training time as a relevant factor.

Results.

As shown in Fig. 7, with learned retrievalT

EXT C AT outperforms H-M EAN by 4.58 points when expla-nations are broken down into parts that can be recombinedto obtain the exact task information. This is especially im-portant as explanations for existing natural language datacan give facts and task speciﬁcations that may be combinedto form a fuller picture of the problem. Additionally, weﬁnd that for small sample sizes, T

EXT C AT achieves higheraccuracy than H-M EAN , by 9.3 points for n = 1000 and 9.2points for n = 1500 , though the gap shrinks to 1.3 pointsat n = 2500 and the methods perform equally well after n = 5000 (see Fig. 15 in Appendix B). As a ﬁnal considera-tion, we note that at C = k =4 , H-M EAN takes 61% longerto train than T

EXT C AT due to the additional model forwardpasses. So, given favorable performance with recomposableexplanations and low sample sizes, as well as the trainingspeed, T EXT C AT appears to be the preferable condition-ing mechanism to H-M EAN , and unless otherwise statedwe use T

EXT C AT in experiments henceforth. For retrieval-based modeling to be successful, ex-planations for one data point must be relevant to predictingother data points’ labels. So far, we have used n task =10 points per index , with test index values that have been seen during training, meaning that both during training and test-ing, “exactly correct” explanations are available for retrieval(i.e. explanations with the true ( m, n, r, d ) for the data pointat hand). To see what is required for explanations to berelevant across data points, we set n task to 1, making ev-ery explanation n the train set unique, and we use test datawith index values not seen in training. As a result of thesechanges, at both training and test time there are no exactlycorrect explanations available for retrieval (since we do notretrieve the data point’s own explanation). In addition, werestrict the causal feature to always be ( m, n ) , rather than ( r, d ) , for reasons that will become apparent momentarily.To succeed in this setting, models must generalize fromexplanations given for one data point to a data point witha similar but not identical set of integers to be counted.By default, index and ( m, n ) values are randomly matched,meaning one cannot infer that the explanations are similarfor two index values given that the index values are similar.In our smooth condition, we enforce a constraint in datageneration so that the index and ( m, n ) values are orderedtogether, and similar index values will have similar ( m, n ) tuples (see Appendix B for further details).We also measure the importance of including the index in x , which is the easily computable feature linking query datapoints and explanations. Here, we use the default task setup,identical to that in RQ2, and we learn the retrieval modelwhile using T EXT C AT . Results.

We show results across n task and smoothness inFig. 8. The notable trend here is that learned retrieval clearlyoutperforms the baseline in the smooth condition (by 7.6points), while it only slightly outperforms the baseline in the non-smooth condition (by 4.3 points). In terms of improve-ment over the ﬁxed retriever, the differences are 8.1 pointsin the smooth condition and 3.5 points in the non-smooth condition. This result suggests that learning to retrieveexplanations will be particularly helpful when there isa sufﬁciently smooth notion of relevance between datapoints and explanations . The mechanism for this improve-ment is that, by retrieving explanations with similar index values to the data point at hand, a model can guess the taskparameters for the current data point since they will be closeto the ( m, n ) values in the retrieved explanations (ﬁtting thedeﬁnition of relevance in Sec. 3.2).Regarding the importance of the index , we ﬁnd that forlearning the retrieval to be possible, it is crucial thatdata and explanations are linked by an easily com-putable feature such as the index . Without including the index in x , learned retrieval accuracy falls drastically from98.6% to 54.7%. hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data Weak/Strong Corr. = 0 Weak/Strong Corr. = 1

Baseline Opt. Retrieval Baseline Opt. Retrieval5060708090100

Acc. Given ExplanationNo Explanation(m,n,r,d)Causal Integers

Can Explanations Indicate Strong Features?

Figure 9. ( RQ6 ) Synthetic task accuracy grouped by the expla-nation kind and correlation between strong (causal) and weak(non-causal) features. In the Causal Integers condition, the modelis always given the true pair of integers that must be counted.

One especially intuitive use case for explanationsis to help a model distinguish between strong, causal fea-tures and weak, spurious features. In this experiment, wevary the correlation between the strong and weak featuresin the training data along with the kinds of explanations thatare retrieved by an optimal retrieval model. Recall that thestrong feature in our task is [ m> n ] when indicator = 1 and [ r> d ] when indicator = 2 , while the weak featureis drawn from the opposite integer pair’s counts (refer toSec. 3). We emphasize that our strong and weak featuresare equally difﬁcult to extract from the input; they differonly in that the strong feature causes the label, and the weakone does not. The explanations either match the familiarform, including all integers ( m, n, r, d ) index , or are restrictedto include only the causal integers, ( m, n ) if indicator =1 and ( r, d ) otherwise. When the strong-weak feature cor-relation is , m>n iff r>d and m d ] if indicator =1 , matches the strong feature’s relative countsprecisely half the time.In all of these settings, the dev and test data are unaffected,meaning that a model with high test accuracy must havelearned to use only the causal feature. We give additionalresults with the correlation varied between and in Fig. 16in Appendix. B. Results.

We see in Fig. 9 that, surprisingly, the only suc-cessful situation is when the original ( m, n, r, d ) explana-tions are given and the strong-weak correlation is 0, un-der which the test accuracy is above 99%. Note that, inthe other settings, models most likely achieve around 75%accuracy by predicting 1 when [ m> n ] ∨ [ r> d ] , Epochs Retriever Fixed ForAcc. Retriever Noise s Classifier and Retriever Co-Dependence

Figure 10. ( RQ7 ) The retrieval model must be ﬁxed for some num-ber of epochs for training to succeed. Meanwhile, degrading thequality of the initial retrieval by some amount of random noise canquickly render retrieval unlearnable. since this strategy yields a test accuracy of 75%. That is, our “causal feature” explanation fails to help when thestrong and features are correlated and even when theyare not . This is surprising because we might expect that,when the correlation is 0, giving the causal feature shouldallow the model to succeed. After all, we may feel that weare effectively telling the model to count those two integersin every sequence. But we risk anthropomorphizing themodel whenever we suppose its interpretation matchesour own . From the model’s standpoint, it sees a sequenceof numbers whose relative counts are always unaffected bythe two integers concatenated to the end of the sequence.

Our “explanations” blend in seamlessly with the remainderof the sequence, except for the [SEP] token that happensto separate them. Hence we should not be so surprised thatthe model cannot use these explanations to pick out thecausal feature; in fact, it may even be more impressive thatthe model does succeed when the full-info explanations aregiven. Evidently, the model learns a near-perfect interpreta-tion of full-info explanations with training examples.

Since the learning signal for the retrieval modelcomes through the classiﬁer, while the classiﬁer relies onthe retrieved explanations for its predictions, there is someco-dependence in their training dynamics. We further mea-sure this co-dependence in a × design using evidentialexplanations with (cid:15) = 2 . On one axis, we vary the numberof training epochs for which only the classiﬁer is trainedand not the retrieval model, in values of { , , , ∞} . On Half of the time in the test data, the relative counts of ( m, n ) and ( r, d ) will agree by chance, meaning that predicting [ m> n ] ∨ [ r> d ] will yield 100% accuracy. The other halfof the time, the features will disagree, and this strategy yields 50%accuracy. The overall test accuracy is then 75%. hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data the other axis, we degrade the retrieval model by addingi.i.d. Normal noise to every parameter in the model, us-ing σ values of { , − , − , − } . To see the effectof choice of Sentence-BERT model, we perform another × experiment. Using either a randomly re-initializedRoBERTa-base, a standard pretrained RoBERTa-base, orthe Sentence-RoBERTa-base model, we evaluate the perfor-mance with learned retrieval relative to ﬁxed retrieval. Results.

We show results for the ﬁrst experiment inFig. 10. We ﬁnd that the classiﬁer must be warmed up, or,conversely, the retrieval model must be ﬁxed, for some num-ber of epochs to attain optimal performance. Jointly trainingboth models from epoch on results in failed training runs.Meanwhile, adding noise to the initial retrieval model canquickly degrade its performance and render the retrievalunlearnable. Hence, both retrieval model and classiﬁermust reach some initial quality before training the otherin order for joint training to succeed .As for the choice of pretrained retrieval model, we observethat retrieval is learnable only with Sentence-RoBERTa;retrieval is not learnable using RoBERTa-base, which per-forms about as poorly as using a randomly initialized re-trieval model. These results are shown in Fig. 18 in Ap-pendix B. We test the retrieval-based model with three ex-isting datasets: e-SNLI, TACRED, and SemEval. We alsovary the training set size between values in { } , depending on the dataset, since the helpfulnessof explanation retrieval could vary by the amount of avail-able training data. Because T EXT C AT achieves favorableresults in our synthetic experiments, we use it as the condi-tioning mechanism here. Within each dataset, we tune C and k between values with the same product Ck , with theexception of e-SNLI using the full train set. For e-SNLIconditions with n ≤ , we select ( C =2 , k =8) . Weuse ( C =2 , k =4) for e-SNLI with the full train set, given theexpense of training retrieval in this setting. For most relationextraction settings we select ( C =2 , k =4) . See Appendix Afor further details.Unlike in the synthetic data experiments, we consider adding x j and y j to the query data point along with retrieved ex-planation e j , since explanations might best be interpretedin the context of the data they were given for. In tuningexperiments we do not ﬁnd any evidence for or againstadding this extra information (see Table 5 in Appendix A).Here, we do add a textual representation of y j to the input x i along with retrieved explanations for relation extractiontasks, since these tasks have a higher number of classes. For Condition Model Acc. Effect Sizee-SNLI n =5000 RoBERTa 84.83 (0.71)T

EXT C AT . (1.00) n =10 , RoBERTa 85.52 (0.70)T

EXT C AT . (0.98) n =50 , RoBERTa 87.90 (0.64)T

EXT C AT − . (0.92) n = full RoBERTa 91.06 (0.56)T

EXT C AT . (0.79)SemEval n =5000 RoBERTa 75.21 (1.62)T

EXT C AT . (2.29) n = full RoBERTa 76.94 (1.58)T

EXT C AT − . (2.24)TACRED n =5000 RoBERTa 84.24 (0.57)T

EXT C AT . (0.81) n =10 , RoBERTa 85.51 (0.55)T

EXT C AT . (0.78) n = full RoBERTa 88.29 (0.51)T

EXT C AT . (0.71) Table 3.

Model accuracies for each dataset across training set sizes( n ), with 95% conﬁdence intervals given in parentheses. We donot ﬁnd retrieval of explanations to improve over baselines for anydataset and training set size. e-SNLI, where y j can be easily inferred from the structureof explanations, we add only retrieved the explanations.Finally, for TACRED and SemEval, we compare to the ELV-M method in Zhou et al. (2020), which is H-M EAN with( C =10 , k =1 ) and ﬁxed retrieval (discussed in Appendix B). Results.

Shown in Table 3, we see no statistically sig-niﬁcant improvements from using explanation retrievalwith any combination of dataset and training set size .Across conditions, the effect sizes are slightly positive onaverage, but we are unable to assert any particular effectis positive. We also measure how accuracy varies acrossvalues of k for ﬁnetuned models, but we do not ﬁnd thatincreasing k at test time improves accuracy (see Fig. 19 inAppendix B). In fact, the only statistically signiﬁcant effectwe see is from increasing the training set size. For example,doubling the TACRED training data from 5000 to 10000,increases the baseline accuracy by 1.28 ( p = . ).Yet since we ﬁnd that retrieval-based modeling succeeds incertain synthetic conditions, there must be a reason that themodel fails to work well with datasets such as these. Usingthe results from this section, we speculate on the possiblecauses of this failure in Section 7 below. hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

7. When Can Explanations Help?

In this section we take a position, based on our experimentalﬁndings, regarding the possible causes of the success ofexplanation retrieval in our synthetic task and its failurewith e-SNLI, TACRED, and SemEval.Summarizing our experimental results, we suggest that inprinciple, explanations can be helpful for modeling atask when: (1) The model can better infer relevant latent informationgiven x and the explanation, relative to using x alone.Relevant latent information includes, for example, per-tinent facts and task speciﬁcations that can assist withprediction. (RQ1, RQ2) However, this is not enough for explanations to be usefulin practice.

Retrieval over explanations will be learnableto the extent that: (2) Explanations are linked to query data points by an easilycomputable index feature (RQ5) , and(3) There are explanations that are sufﬁciently relevant across data points as to be useful for predicting labelsfor future data (RQ2, RQ5) , and(4) There is a known or identiﬁable interpretation of theexplanations by the classiﬁer that yields a useful repre-sentation of the explanation (RQ3, RQ6) , and(5) Before training the retrieval model, the classiﬁer reachessome sufﬁcient quality (RQ7) , and(6) Before training the classiﬁer, the initial retrieval modelexhibits some sufﬁcient quality (RQ7) .We wish to emphasize a few related results. One of themost intuitive use cases for explanations is to help a modeldistinguish between strong, generalizable features and weak,spurious features. But explanations only help break tiesbetween strong and weak features when the model al-ready knows how to interpret the explanations . Whenstrong and weak features are perfectly correlated, we ﬁndthat our synthetic explanations do not lead the model toselect the causal feature more often than a non-causal one,even when using the optimal retrieval model. We only seethat the model can learn to interpret the explanations whenthe features in question are not perfectly correlated. Wesuggest that, in the paradigm of large language model pre-training, this interpretation function will be meta-learnedduring pretraining. This behavior is clearly exempliﬁed inGPT-3, which learns from pretraining to infer novel tasksfrom prompts that precede tasks in zero-shot settings (Brownet al., 2020). Even GPT-2 learns some tasks such as sum-marization during pretraining, which can be elicited withthe right prompt (e.g. “tl;dr”) (Radford et al., 2019). As weobserve in our experiments, ﬁnetuning may allow the modelto further identify the correct interpretation, provided that itis identiﬁable and sufﬁcient training data is available. It is also important to reiterate that when using a retrievalmodel, the information that explanations provide canbe inferred from the input alone . A model need onlylearn the map between input and hidden information, ratherthan ﬁrst using the input for retrieval and then interpretingthe retrieved explanation. This is clearly possible in oursynthetic task given the relationship between the index and ( m, n, r, d ) values. The same situation will hold true ofexplanations for real-world tasks. Similar to our syntheticsetting, if models can learn to retrieve explanations and theninterpret them, they could instead learn to infer the latentinformation directly from the input. This property of tasksand explanations suggests that conditioning on explanationsis a way to structure model computation, biasing it towarddesirable functions, and away from difﬁcult to learn func-tions. That is, as discussed in Sec. 2.3, we see explanationsacting as priors as well as simply inputs.So should we collect explanations to assist with solvingtasks? At present, the answer is task-speciﬁc. In our syn-thetic task, it is far more helpful to have explanations for training points than to have , unexplained points.On benchmark tasks such as e-SNLI, TACRED, and Se-mEval, we ﬁnd that explanation retrieval does not yield sta-tistically signiﬁcant improvements in model accuracy, whileusing more unexplained data can lead to large improve-ments. We suggest that the reason for this lies somewhere inthe six preconditions for explanation retrieval given above,and it will be useful in future work to develop a diagnosticprocedure for further narrowing down the causes of modelperformance with and without explanations.More broadly, we see two countervailing trends at work here.The ﬁrst is that, as language models store more and moreknowledge in their parameters, there will be less and less ofa need for retrieved explanations to provide hidden informa-tion for tasks, though retrieval may still make “accessing”this knowledge easier. In the other direction, we note that aslanguage models become better at interpreting explanationsand task descriptions, we will ﬁnd that for some tasks perfor-mance is greatly boosted by having a good task descriptionor set of explanations for example data points.

8. Conclusion

In this paper we present a formal framework for under-standing the role of explanations in modeling, and we arguethat explanations are most suitably used in a retrieval-basedmodeling approach, where past explanations are retrievedand used as model inputs for predicting future data points.We experimentally study the preconditions for explanations’usefulness in modeling, and based on results from our syn-thetic task, we suggest that the model must be able to betterinfer relevant latent information given the explanation andinput, relative to using the input alone. For explanation re- hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data trieval to be learnable, we ﬁnd that (1) explanations shouldbe linked to query data points by an easily computable fea-ture, (2) explanations should be relevant across data points,(3) the interpretation of explanations by the classiﬁer shouldbe known or identiﬁable, and (4) the classiﬁer and retrievalmodel must both be of some sufﬁcient quality before theother is trained. When we test our method on three existingdatasets (e-SNLI, TACRED, and SemEval), we ﬁnd thatexplanations do not improve task performance, suggestingthat these settings do not meet one of criteria outlined above.

Acknowledgements

We thank Miles Turpin and Ethan Perez for helpful discus-sion of the topics represented here, as well as Xiang Zhouand Prateek Yadav for feedback on this paper. This workwas supported by NSF-CAREER Award 1846185, DARPAMachine-Commonsense (MCS) Grant N66001-19-2-4031,Royster Society PhD Fellowship, Microsoft InvestigatorFellowship, and Google and AWS cloud compute awards.The views contained in this article are those of the authorsand not of the funding agency.

References

Andreas, J., Klein, D., and Levine, S. Learning with la-tent language. In Walker, M. A., Ji, H., and Stent, A.(eds.),

NAACL-HLT 2018 , 2018. doi: 10.18653/v1/n18-1197. URL https://doi.org/10.18653/v1/n18-1197 .Awasthi, A., Ghosh, S., Goyal, R., and Sarawagi, S.Learning from rules generalizing labeled exemplars. In

ICLR 2020 , 2020. URL https://arxiv.org/pdf/2004.06025.pdf .Ba, L. J., Swersky, K., Fidler, S., and Salakhutdinov, R.Predicting deep zero-shot convolutional neural networksusing textual descriptions. In , pp. 4247–4255. IEEE Com-puter Society, 2015. doi: 10.1109/ICCV.2015.483. URL https://doi.org/10.1109/ICCV.2015.483 .Baldini Soares, L., FitzGerald, N., Ling, J., andKwiatkowski, T. Matching the blanks: Distributionalsimilarity for relation learning. In

ACL , pp. 2895–2905,Florence, Italy, July 2019. Association for ComputationalLinguistics. doi: 10.18653/v1/P19-1279. URL .Bao, Y., Chang, S., Yu, M., and Barzilay, R. Derivingmachine attention from human rationales. In

EMNLP ,pp. 1903–1913, Brussels, Belgium, October-November2018. Association for Computational Linguistics. doi: 10. 18653/v1/D18-1216. URL .Beltagy, I., Peters, M. E., and Cohan, A. Longformer:The long-document transformer.

CoRR , abs/2004.05150,2020. URL https://arxiv.org/abs/2004.05150 .Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D.A large annotated corpus for learning natural languageinference. In

EMNLP 2015 , 2015. URL https://arxiv.org/abs/1508.05326 .Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners. In Larochelle, H., Ran-zato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.),

NeurIPS , 2020. URL https://arxiv.org/abs/2005.14165 .Camburu, O.-M., Rockt¨aschel, T., Lukasiewicz, T., andBlunsom, P. e-snli: Natural language inference with natu-ral language explanations. In

NeurIPS 2018 , 2018. URL https://arxiv.org/pdf/1812.01193.pdf .Co-Reyes, J. D., Gupta, A., Sanjeev, S., Altieri, N., An-dreas, J., DeNero, J., Abbeel, P., and Levine, S. Guidingpolicies with language via meta-learning. In

ICLR 2019 ,2019. URL https://openreview.net/forum?id=HkgSEnA5KQ .De Cao, N., Schlichtkrull, M. S., Aziz, W., and Titov,I. How do decisions emerge across layers in neuralmodels? interpretation with differentiable masking. In

EMNLP , pp. 3243–3255, Online, November 2020. Asso-ciation for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.262. URL .Doshi-Velez, F. and Kim, B. Towards a rigorous science ofinterpretable machine learning. arXiv: Machine Learn-ing , 2017. URL https://arxiv.org/pdf/1702.08608.pdf .Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang,M. Retrieval augmented language model pre-training.In

ICML , volume 119 of

Proceedings of MachineLearning Research , pp. 3929–3938. PMLR, 2020.URL http://proceedings.mlr.press/v119/guu20a.html . hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data Ha, D., Dai, A., and Le, Q. V. Hypernetworks. In

ICLR2017 , 2017. URL https://openreview.net/pdf?id=rkpACe1lx .Hancock, B., Varma, P., Wang, S., Bringmann, M., Liang,P., and R´e, C. Training classiﬁers with natural languageexplanations. In

ACL , 2018. URL https://pubmed.ncbi.nlm.nih.gov/31130772/ .Hase, P., Zhang, S., Xie, H., and Bansal, M. Leakage-adjusted simulatability: Can models generate non-trivialexplanations of their behavior in natural language? In

Findings of EMNLP , 2020. URL https://arxiv.org/abs/2010.04119 .Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P.,´O S´eaghdha, D., Pad´o, S., Pennacchiotti, M., Ro-mano, L., and Szpakowicz, S. SemEval-2010 task 8:Multi-way classiﬁcation of semantic relations betweenpairs of nominals. In

Proceedings of the 5th Interna-tional Workshop on Semantic Evaluation , pp. 33–38,Uppsala, Sweden, July 2010. Association for Compu-tational Linguistics. URL .Johnson, J., Douze, M., and J´egou, H. Billion-scalesimilarity search with gpus.

IEEE Transactions onBig Data , 2017. URL https://arxiv.org/pdf/1702.08734.pdf .Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L.,Edunov, S., Chen, D., and Yih, W.-t. Dense pas-sage retrieval for open-domain question answering. In

EMNLP , pp. 6769–6781, Online, November 2020. Asso-ciation for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL .Kumar, S. and Talukdar, P. Nile : Natural language in-ference with faithful natural language explanations. In

ACL 2020 , 2020. URL https://arxiv.org/abs/2005.12116 .Lewis, P. S. H., Perez, E., Piktus, A., Petroni, F.,Karpukhin, V., Goyal, N., K¨uttler, H., Lewis, M., Yih,W., Rockt¨aschel, T., Riedel, S., and Kiela, D. Retrieval-augmented generation for knowledge-intensive NLPtasks. In Larochelle, H., Ranzato, M., Hadsell, R.,Balcan, M., and Lin, H. (eds.),

NeurIPS , 2020. URL https://arxiv.org/abs/2005.11401 .Liang, W., Zou, J., and Yu, Z. ALICE: active learningwith contrastive natural language explanations. In Web-ber, B., Cohn, T., He, Y., and Liu, Y. (eds.),

EMNLP ,pp. 4380–4391. Association for Computational Lin-guistics, 2020. URL . Liu, N. F., Lee, T., Jia, R., and Liang, P. Can smalland synthetic benchmarks drive modeling innovation?a retrospective study of question answering modeling ap-proaches.

CoRR , 2021. URL https://arxiv.org/pdf/2102.01065.pdf .Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.Roberta: A robustly optimized bert pretraining approach.

ArXiv , abs/1907.11692, 2019. URL https://arxiv.org/pdf/1907.11692.pdf .Loshchilov, I. and Hutter, F. Decoupled weight decay regu-larization, 2017.Miller, T. Explanation in artiﬁcial intelligence: Insightsfrom the social sciences.

Artif. Intell. , 267:1–38, 2019.doi: 10.1016/j.artint.2018.07.007. URL https://doi.org/10.1016/j.artint.2018.07.007 .Murty, S., Koh, P. W., and Liang, P. Expbert: Representationengineering with natural language explanations. In Juraf-sky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.),

ACL , pp. 2106–2113. Association for Computational Lin-guistics, 2020. URL .Narang, S., Raffel, C., Lee, K. J., Roberts, A., Fiedel,N., and Malkan, K. WT5?! training text-to-text mod-els to explain their predictions.

ArXiv , abs/2004.14546,2020. URL https://arxiv.org/pdf/2004.14546.pdf .Pruthi, D., Dhingra, B., Soares, L. B., Collins, M., Lip-ton, Z. C., Neubig, G., and Cohen, W. W. Evaluat-ing explanations: How much do explanations from theteacher aid students?

CoRR , abs/2012.00893, 2020. URL https://arxiv.org/abs/2012.00893 .Radford, A., Wu, J., Child, R., Luan, D., Amodei,D., and Sutskever, I. Language models are unsu-pervised multitask learners. In

OpenAI TechnicalReport , 2019. URL https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf .Rajani, N. F., McCann, B., Xiong, C., and Socher, R. Ex-plain yourself! leveraging language models for common-sense reasoning. In

ACL 2019 , 2019. URL https://arxiv.org/pdf/1906.02361.pdf .Reimers, N. and Gurevych, I. Sentence-BERT: Sentenceembeddings using Siamese BERT-networks. In

EMNLP-IJCNLP , pp. 3982–3992, Hong Kong, China, November2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL . hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data Roberts, A., Raffel, C., and Shazeer, N. How muchknowledge can you pack into the parameters of alanguage model? In

EMNLP , pp. 5418–5426, On-line, November 2020. Association for ComputationalLinguistics. doi: 10.18653/v1/2020.emnlp-main.437.URL .Ross, A. S., Hughes, M. C., and Doshi-Velez, F. Rightfor the right reasons: Training differentiable models byconstraining their explanations. In

IJCAI , pp. 2662–2670,2017. doi: 10.24963/ijcai.2017/371. URL https://doi.org/10.24963/ijcai.2017/371 .Rupprecht, C., Laina, I., Navab, N., Harger, G. D., andTombari, F. Guide me: Interacting with deep networks.In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2018 , 2018. URL https://arxiv.org/abs/1803.11544 .Selvaraju, R. R., Lee, S., Shen, Y., Jin, H., Ghosh, S., Heck,L. P., Batra, D., and Parikh, D. Taking a HINT: leveragingexplanations to make vision and language models moregrounded. In

ICCV , pp. 2591–2600. IEEE, 2019. doi: 10.1109/ICCV.2019.00268. URL https://doi.org/10.1109/ICCV.2019.00268 .Small, K., Wallace, B. C., Brodley, C. E., and Trikalinos,T. A. The constrained weight space svm: learning withranked features. In

ICML , pp. 865–872, 2011.Srivastava, S., Labutov, I., and Mitchell, T. Learning classi-ﬁers from declarative language. In

NeurIPS 2017 , 2017.URL .Srivastava, S., Labutov, I., and Mitchell, T. Zero-shotlearning of classiﬁers from natural language quantiﬁ-cation. In

ACL 2018 , July 2018. doi: 10.18653/v1/P18-1029. URL .Stammer, W., Schramowski, P., and Kersting, K. Rightfor the right concept: Revising neuro-symbolic con-cepts by interacting with their explanations.

CoRR ,abs/2011.12854, 2020. URL https://arxiv.org/abs/2011.12854 .Talmor, A., Tafjord, O., Clark, P., Goldberg, Y., and Be-rant, J. Leap-of-thought: Teaching pre-trained modelsto systematically reason over implicit knowledge. In

NeurIPS 2020 , 2020. URL https://arxiv.org/abs/2006.06609 .Wang, C., Liang, S., Zhang, Y., Li, X., and Gao, T. Doesit make sense? and why? a pilot study for sense makingand explanation. In

ACL 2019 , 2019a. URL https://arxiv.org/pdf/1906.00363.pdf . Wang, Z., Qin, Y., Zhou, W., Yan, J., Ye, Q., Neves, L.,Liu, Z., and Ren, X. Learning from explanations withneural execution tree. In

ICLR , 2019b. URL https://openreview.net/pdf?id=rJlUt0EYwS .Weller, O., Lourie, N., Gardner, M., and Peters, M.Learning from task descriptions. In

EMNLP , pp.1361–1375, Online, November 2020. Association forComputational Linguistics. doi: 10.18653/v1/2020.emnlp-main.105. URL .Wiegreffe, S., Marasovic, A., and Smith, N. A. Measuringassociation between labels and free-text rationales.

CoRR ,abs/2010.12762, 2020. URL https://arxiv.org/abs/2010.12762 .Zaidan, O., Eisner, J., and Piatko, C. Using “Anno-tator Rationales” to Improve Machine Learning forText Categorization. In

Human Language Technolo-gies 2007: The Conference of the North AmericanChapter of the Association for Computational Linguis-tics; Proceedings of the Main Conference , pp. 260–267,Rochester, New York, April 2007. Association for Com-putational Linguistics. URL .Zhang, Y., Marshall, I., and Wallace, B. C. Rationale-Augmented Convolutional Neural Networks for TextClassiﬁcation. In

Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Processing ,pp. 795–804, Austin, Texas, November 2016. Associ-ation for Computational Linguistics. doi: 10.18653/v1/D16-1076. URL .Zhang, Y., Zhong, V., Chen, D., Angeli, G., and Man-ning, C. D. Position-aware attention and superviseddata improve slot ﬁlling. In

EMNLP , pp. 35–45,2017. URL https://nlp.stanford.edu/pubs/zhang2017tacred.pdf .Zhou, W., Hu, J., Zhang, H., Liang, X., Sun, M., Xiong,C., and Tang, J. Towards interpretable natural languageunderstanding with explanations as latent variables. In

NeurIPS , 2020. URL https://arxiv.org/pdf/2011.05268.pdf . hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data e-SNLI Example 1 x : Premise:

After playing with her other toys, the babydecides that the guitar seems fun to play with as well.

Hypothesis:

A blonde baby. y : Neutral e : Not all babies are blonde.e-SNLI Example 2 x : Premise:

A girl wearing a pink and black shirt and jeansﬁxes her hair before walking up the stairs.

Hypothesis:

A girl has blonde hair. y : Neutral e : Not all girls have blonde hair.SemEval Example 1 x : The SUBJ originates from an OBJ which transcendsthe speaker. y : Entity-Origin e : The phrase ”originates from an” occurs between SUBJand OBJ and there are no more than four words betweenSUBJ and OBJ and OBJ follows SUBJ.SemEval Example 2 x : With one exception, the SUBJ emerged from the OBJduring hours of darkness. y : Entity-Origin e : The phrase “emerged from the” occurs between SUBJand OBJ and there are no more than four words betweenSUBJ and OBJ and SUBJ precedes OBJ.TACRED Example 1 x : SUBJ’s husband OBJ died in 1995. y : Person-Spouse e : Between SUBJ and OBJ the phrase “’s husband” occursand there are no more than ﬁve words between SUBJand OBJ.TACRED Example 2 x : SUBJ is married to OBJ and is the father of three sons. y : Person-Spouse e : There are no more than four words between SUBJ andOBJ and the phrase “is married to” appears betweenSUBJ and OBJ.

Table 4.

Additional example data points from three existingdatasets.

A. Training Details

A.1. Data Preprocessing

No preprocessing is applied to the synthetic data. For thethree existing datasets, we use maximum sequence lengthsas follows: For e-SNLI, we use a maximum sequence lengthof 120 tokens, with maximum lengths of 90 for x and 60for each explanation. For TACRED and SemEval, we usea max of 160 for the entire input, with a max of for x and for e . We remove one explanation from the set ofexplained data points for TACRED after ﬁnding that it isgiven for a data point in the dev set.We give additional examples of data points from each datasetin Table 4. Context SizeAcc. MethodH-M ᴇᴀɴ T ᴇ x ᴛ C ᴀᴛ Accuracy by Method and Context Size

Figure 11. ( Training Hyperparameters and Analysis. ) Learnedretrieval accuracy by C and conditioning mechanism, using k = 1 and optimal retrieval of evidential explanations. k Acc. Explanation KindEvidentialFull-info

How Does k Influence Retrieval Learning?

Figure 12. ( Training Hyperparameters and Analysis. ) Learnedretrieval accuracy by k , with C = 1 , for full-info and evidential explanations. A.2. Runtimes.

Regarding training times, we run most experiments on a sin-gle NVIDIA RTX 2080 GPU, with runtimes as follows: 4.0hours for 40 epochs of the no-retrieval RoBERTa-base usingthe synthetic dataset; 5.7 hours for 40 epochs of RoBERTa-large in the same setting; 8.6 hours for 20 epochs of learnedretrieval with RoBERTa-base models on synthetic data; 32.9hours for 10 epochs of learned retrieval with TACRED. Sev-eral adjacent experimental conditions can be easily extrap-olated here given the training sizes for these conditions.Lastly, we run our full-data e-SNLI condition with learnedretrieval for 5 epochs on a single Tesla P100 GPU, whichtakes 7 days to run.

A.3. Training Hyperparameters and Analysis

For optimization, we use AdamW with a learning rate of − and gradient norm clipping at norm 1. For the LR, weuse a linear warmup and decay schedule peaking at 10% ofthe training steps for experiments with synthetic data and at1% for experiments with existing datasets (given the largertraining set sizes). The batch size is set to 10 across allexperiments. hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data We decide how often to rebuild the representations of train-ing explanations while learning the retrieval model by tuningacross frequency values in the range { } (i.e. to rebuild at this percentage of every epoch),as well as never rebuilding. In our synthetic setting, theonly noticeable drop in performance comes from never re-building. As long as representations are re-encoded at leastas often as every epoch, we notice no difference in ﬁnaltest accuracy, though in early experiments we observed thatrebuilding more often improved training stability. To err onthe safe side of training stability, we re-encode the represen-tations every 20% of each epoch in all experiments excepte-SNLI with full data, where we re-encode every 30% ofeach epoch.Additionally, we use the stop-gradient function when com-puting the gradient of p η ( e | x ) as follows: ∇ η exp ( sg [ f η ( e )] T f η ( x )) , meaning that we do not differentiate through the explanationembeddings, but only through the query data point embed-dings. In early experiments, we found that this decision con-tributed to training stability, while improving computationalefﬁciency, and we conﬁrm that we observe no differencesin model accuracy as a result.We measure the relationship between the context size C andperformance on evidential explanations, using the optimalretrieval model and comparing between conditioning mech-anisms. The results are shown in Fig. 11. We see that, witheach method, a larger value of C is preferable up to around or so, after which performance plateaus.Regarding the value of k , we see in Fig. 12 that trainingperformance can be sensitive to the chosen value for thishyperparameter. It appears that one should try to selectas high a value of k as possible, all else equal. Thoughsince this parameter increases the number of forward passesduring training by a factor of k , there is a trade-off betweenthe available compute budget and the value of k in practice. A.4. Model Selection in Experiments

In general, within a single training run, we select the modelthat achieves the best dev set accuracy as measured at theend of each training epoch.With our synthetic task, we observe some training instabil-ity in a few conditions, particularly in those where we aredegrading the training model (RQ7). On such occasions,training fails after a few epochs and model accuracy trendstoward 50% (random performance). These occurrences areeasily noticeable, so we rerun these experiments with adifferent seed in order to report results from a stable run,and typically we ﬁnd that stable training dynamics can beobtained from just one other seed.

Condition Model Acc.e-SNLI n =10 , T EXT C AT -E 87.17 (0.66)T EXT C AT -YXE 87.11 (0.66)SemEval n =7016 T EXT C AT -YE 75.25 (2.99)T EXT C AT -YXE 75.75 (2.97)TACRED n =68 , T EXT C AT -YE 87.49 (0.43)T EXT C AT -YXE 87.31 (0.43) Table 5.

Ablation across the retrieved variables: the sufﬁx toT

EXT C AT indicates which retrieved variables are included in themodel input. We do not ﬁnd that including x improves modelperformance, so we use only e or ( y, e ) , depending on the task. For the existing datasets, we run three model seeds for thebaseline and each of the hyperparameter conditions, exceptfor when using at least , training data points, where werun only one seed. We also use only one seed when ablatingacross which retrieved variables to include in the modelinput (i.e. whether to include the retrieved x in the modelinput). To select one model from three seeds for a givencondition, we pick based on the highest dev performance.The results for ablating across the retrieved variables toinclude as model inputs are shown in Table 5. Note that wetest the effect of adding x to e for e-SNLI and the effectof adding x to ( y, e ) for relation extraction tasks, since y is easily inferred from e for e-SNLI (Hase et al., 2020).In these experiments, we roughly control for the sequencelength, meaning that for relation extraction tasks, we use C = 1 when x is present and C = 2 when it is not, whilefor NLI we use C = 5 without x and C = 2 with x . Theseexperiments all use k = 4 . We do not ﬁnd any statisticallysigniﬁcant differences in dev set accuracy across any of theconditions. Hence, we proceed with using ( y, e ) for relationextraction tasks and e for NLI.When tuning over C and k with the existing datasets,we use the following values for each condition: Forrelation extraction tasks, we tune over values in { ( C =2 , k =4 ),( C =1 , k =8 ) } , and for NLI we tune over { ( C =4 , k =4 ),( C =2 , k =8 ) } when n < . We tuneseparately for each training set size conﬁguration. We selectthe hyperparameters to use based off of the best dev setaccuracy achieved from three seeds in each condition. Wereport results from a single run of ( C =2 , k =4) for e-SNLIwith the full training data. B. Experimental Details And AdditionalResults

In this section, we describe experimental details and hyper-parameters for particular experiments, organized by research hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data num-tasks

Acc. ModelRoBERTa-baseRoBERTa-large

Task Inference by Model Size

Figure 13. ( RQ1 ) Synthetic task accuracy as a function of num-tasks , by model size.

Training Set SizeAcc.

No-Retrieval Baseline by Training Set Size

Figure 14. ( RQ1 ) Synthetic task accuracy as a function of the train-ing set size, without explanation retrieval. question, as well as additional results accompanying someresearch questions. Lastly, we discuss hypothesis testingprocedures. Unless otherwise stated, additional experimentsbelow use the default synthetic task parameters, given in theSynthetic Task section in the main paper.

B.1. RQ1: When can models solve our syntheticproblem by inferring each sequence’s task, andwhen must they be given the task information?

We show results for additional model choices and sizes ofthe training datasets here. In Fig. 13, we see that RoBERTa-large outperforms RoBERTa-base only when the numberof tasks in the training data is relatively small. After num-tasks ≥ , RoBERTa-base is the better choice. Here, wetrain models for 40 epochs with a LR of − , though for num-tasks ∈ { , } with RoBERTa-large, we have totrain for 60 epochs with a LR of − in order for trainingto converge.In Fig. 14, we see that performance scales well with theavailable training data for our synthetic task. RoBERTa-base reaches an accuracy of . with , trainingpoints. Training Set SizeAcc. H-M ᴇᴀɴ T ᴇ x ᴛ C ᴀᴛ Methods Across Sample Size

Figure 15. ( RQ4 ) Synthetic task accuracy across training set sizesand method, using optimal retrieval.

B.2. RQ2: Can retrieval of past explanations enable amodel to solve our task?

In these experiments, we always freeze the retriever for theﬁrst two epochs of training and train for a total of 20 epochs.

B.3. RQ3: Can models aggregate information acrossexplanations for better prediction?

Here, we freeze the retriever for the ﬁrst ﬁve epochs oftraining and train for a total of 25 epochs.

B.4. RQ4: What is the best way to computeexplanation representations for prediction?

In Fig. 15, we see that T

EXT C AT outperforms H-M EAN at smaller training set sizes. T

EXT C AT achieves higheraccuracy than H-M EAN by 9.3 points for n = 1000 and 9.2points for n = 1500 , though the gap shrinks to 1.3 pointsat n = 2500 and the methods perform equally well after n = 5000 . In these experiments we use the optimal retrievalmodel. B.5. RQ5: What makes an explanation relevant acrossdata points? What enables a retrieval model toﬁnd relevant explanations for a new data point?

In these experiments, we use C = 1 and k = 12 , andwe make changes to the default data properties regardingthe n task and smoothness . In order to achieve a smoothfunction from index to ( m, n ) , we ﬁrst order the domainand codomain and then match them up one-to-one. To order ( m, n ) tuples, we sort by m ﬁrst and then n . The result isthat when two explanations have similar index values, their m values are very likely to be close together, and their n values will probably be close together. Note that in thisexperiment, we sample ( m, n ) not from unif ([1 , ) ,but rather we draw the valid ( m, n ) tuples in increasingorder starting from the ﬁrst valid tuple, (1 , . We usethe same ( m, n ) sampling scheme for the baselines andthe non-smooth condition. The only difference in the non- hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data Pearson's r

Acc. No RetrievalOptimal Retrieval

Accuracy By Weak/Strong Correlation

Figure 16. ( RQ6 ) Synthetic task accuracy across correlation levelsbetween the strong and weak feature, with and without optimalretrieval.

Training Set SizeAcc. Explanation Kindfull-infofull-info +5

Interpreting the Explanations

Figure 17. ( RQ6 ) Even a simple function on explanations can ren-der them difﬁcult for the model to interpret, though the correctinterpretation is identiﬁed with more data. smooth conditions is the lack of ordering in the domain andcodomain before they are matched up. This allows for moreprecise inference as to the task parameters when retrievingan explanation with a similar index to a data point at hand.

B.6. RQ6: Can explanations help models learn to usestrong features rather than weak ones?

We give additional results with the strong-weak feature cor-relation varied between and in Fig. 16, using the traininghyperparameters for RQ2. Using the full-info explanationwith optimal retrieval, we see the model continues to per-form well as long as the features are not perfectly correlated.Interestingly, the no-retrieval condition’s performance risesas the correlation increases, though it never matches theretrieval condition’s performance. Since the performance ofoptimal retrieval is above the baseline but not greater than75% when the features are perfectly correlated, the expla-nations are not helping decide whether to use the strong orweak feature, but they are helping these features be used inthe ﬁrst place (see the footnote in the results for RQ5 in themain body).It may even be surprising that the full-info explanations Pretrained Retrieval ModelAcc. RetrievalFixedLearned

Effect of Retrieval Model Choice

Figure 18. ( RQ7 ) Model performance by choice of retriever, with evidential explanations. Using a pretrained Sentence-BERT modelis vital to the success of learning a retrieval model in our synthetictask. are useful when the strong-weak correlation is 0, since theCausal Integer explanations are not. In Fig. 17, we see that,while the correlation is 0, some explanations may be hard tointerpret when a small amount of training data is available,but as more data is available the correct interpretation isidentiﬁed. Here, we simply add 5 to each integer in the full-info explanations. Using optimal retrieval, models struggleto correctly interpret these explanations at a low sample size,but with more data the correct interpretation is identiﬁed.

B.7. RQ7: How does the co-dependence betweenclassiﬁer and retrieval model inﬂuence theviability of joint training?

Hyperparameters in experiments for this RQ match thosefor RQ3. In Fig. 18, we show the effect of the retrievalmodel choice on the viability of learning retrieval. As in themain body, we also use evidential explanations with (cid:15) = 2 .We ﬁnd that it is necessary to use a pretrained Sentence-RoBERTa model. Simply using a pretrained RoBERTa-basemodel will not sufﬁce for learning retrieval with our syn-thetic task. Surprisingly, this condition cannot outperformeven a randomly initialized model with an identical archi-tecture. This could be due to the fact that we use the meantoken pooling and cosine similarity that the Sentence-BERTmodels were trained with.

B.8. RQ8: Does retrieval of explanations improvemodel performance on existing datasets?

We train for: 5 epochs when using the full e-SNLI trainingset; 20 epochs when using n ≤ for any dataset; and 10epochs for other larger values of n .In Fig. 19, we show the result of varying the value of k usedto calculate dev set accuracy for the retrieval model in thee-SNLI with n = 10000 condition. We see no meaningfulchanges in dev set accuracy across values of k from 1 to 20,showing that increasing k at test time is not a reliable way hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data k Acc.

Accuracy by k for a Finetuned Model Figure 19. ( RQ8 ) Dev set accuracy across k for the retrieval modelon e-SNLI using training points. ᴇ x ᴛ C ᴀᴛ Evidential H-M ᴇᴀɴ

Evidential

Acc.

Seed Variance

Figure 20.

Seed variance for some representative experimental con-ditions. to improve retrieval model accuracy in this setting.Lastly, we observe that the ELV-M condition from Zhouet al. (2020), which is H-M

EAN with ﬁxed retrieval and ( C =10 , k =1) , does not outperform baselines on TACREDand SemEval. The approach obtains 87.99% on TACRED,where our baseline is 88.29%, and 76.46% on SemEval,where the baseline is 76.94%. Besides using RoBERTamodels instead of BERT, one change we make from theimplementation in Zhou et al. (2020) is to disallow for datapoints’ own explanations to be conditioned on when predict-ing their labels, although this is not relevant for predictingtest points in either dataset. B.9. Conﬁdence Intervals and Hypothesis Testing

We compute conﬁdence intervals for our synthetic data tasksto represent seed variance around some mean seed perfor-mance, while conﬁdence intervals and associated hypothesistests for existing datasets represent sample variance.

Withsynthetic data we represent seed variance in ﬁgures ratherthan sample variance because the sample variance is fairlylow with , test points and could be driven arbitrarilylow with more generated test points. For instance, the 95%conﬁdence interval for a model accuracy of 90% would be ± . . To calculate seed variance, we run 10 random seeds for ourbaseline condition (no-retrieval) with the default synthetictask setup. Then we run 5 runs with learned retrieval using(1) T EXT C AT with full-info explanations, (2) T EXT C AT with evidential explanations, and (3) H-M EAN with evi-dential explanations. The results of these runs are shownin Fig. 20. We then assume that seed variance is invariantacross experimental factors not related to the choice of con-ditioning method or explanation and assign 95% conﬁdenceintervals across experimental conditions based on these fourrepresentative conditions. We prioritize assignments basedon the explanation kind ( full-info vs. evidential or recompos-able ), then by conditioning mechanism, when for instancesome conditions use combinations of methods and explana-tion kinds not represented in these conditions. We assumethese invariances in order to efﬁciently calculate seed vari-ance. Running 5 seeds per retrieval condition and 10 pernon-retrieval would increase the number of synthetic dataexperiments in this paper from 172 to 1035. In syntheticdata experiments, we comment on effects far larger than theconﬁdence intervals and do not conduct hypothesis tests.The conﬁdence intervals shown for model accuracies onexisting datasets are 95% conﬁdence intervals on the under-lying binomial probability. The hypothesis tests conductedfor RQ8 are two-sided difference in binomial means tests. C. Synthetic Task Generative Process

The required parameters to the data generation include: (1) atraining sample size sample-size and (2) num-tasks , the num-ber of unique integer pairs to be counted, or, equivalently,the number of points per index , n task . In all experiments,we use a maximum integer value of 100 to appear in thesequences, and a maximum index value of , . We givethe general generative process below. Note that the devand test sets are constructed with the extra constraint thatsequences must not appear in the training data. Further notethat this is the generic version of generative process, and insome experiments the process is altered. For example, inRQ5, indicator is always 1 and the construction of the mapfrom index values to ( m, n ) tuples occurs in a special waydescribed in the experimental design for RQ5.1. Sample { index t } num-tasks τ =1 from the uniform distributionover integers { } without replacement.2. Sample { ( m, n, r, d ) t } num - tasksτ =1 from the uniform dis-tribution over integers, unif ([1 , ) , without replace-ment and requiring that m (cid:54) = n (cid:54) = r (cid:54) = d .3. Deﬁne the set { ( index , m, n, r, d ) index ) } for index and ( m, n, r, d ) drawn from their respective sets, withoutreplacement, in an arbitrary order.4. Compute the number of points per index , hen Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data n task = sample-size // num-tasks .5. For each index ∈ { index t } num-tasks τ =1 :(a) Sample a vector of length n task , balanced between s and s, that gives the values of { indicator p } Pp =1 for the P points with that index .(b) Sample a vector of length n task , balanced be-tween 0s and 1s, representing whether the features [ m> n ] and [ r> d ] should correlate (1 im-plies they are equal, and 0 unequal). This balancechanges when the strong-weak correlation is in-tended to change.(c) Sample a vector of length n task , balanced between sand s, representing whether ( m, n ) or ( r, d ) shouldbe the more numerous integers in the sequence (sothat there is no bias, even randomly, between fea-tures by size).(d) For i ∈ n task :i. Place the index in the ﬁrst element of an emptyarray, and the indicator in the second.ii. Based on the i th elements of the three vectorsdescribed above, allocate samples of the integersin ( m, n, r, d ) index into the remaining 18 slots.iii. If there are any remaining slots after these inte-gers are randomly allocated, ﬁll them with i.i.d.samples from unif (1 ,100)