[PDF] Does injecting linguistic structure into language models lead to better alignment with brain recordings?

Abstract

Neuroscientists evaluate deep neural networks for natural language processing as possible candidate models for how language is processed in the brain. These models are often trained without explicit linguistic supervision, but have been shown to learn some linguistic structure in the absence of such supervision (Manning et al., 2020), potentially questioning the relevance of symbolic linguistic theories in modeling such cognitive processes (Warstadt and Bowman, 2020). We evaluate across two fMRI datasets whether language models align better with brain recordings, if their attention is biased by annotations from syntactic or semantic formalisms. Using structure from dependency or minimal recursion semantic annotations, we find alignments improve significantly for one of the datasets. For another dataset, we see more mixed results. We present an extensive analysis of these results. Our proposed approach enables the evaluation of more targeted hypotheses about the composition of meaning in the brain, expanding the range of possible scientific inferences a neuroscientist could make, and opens up new opportunities for cross-pollination between computational neuroscience and linguistics.

Full PDF

DDoes injecting linguistic structure into language models lead to betteralignment with brain recordings?

Mostafa Abdou , Ana Valeria González , Mariya Toneva , Daniel Hershcovich , and Anders Søgaard University of Copenhagen, Carnegie Mellon University {mabdou, ana, dh, soegaard}@[email protected]

Abstract

Neuroscientists evaluate deep neural net-works for natural language processing aspossible candidate models for how languageis processed in the brain. These modelsare often trained without explicit linguisticsupervision, but have been shown to learnsome linguistic structure in the absence ofsuch supervision (Manning et al., 2020), po-tentially questioning the relevance of sym-bolic linguistic theories in modeling suchcognitive processes (Warstadt and Bowman,2020). We evaluate across two fMRI datasetswhether language models align better withbrain recordings, if their attention is biasedby annotations from syntactic or semanticformalisms. Using structure from depen-dency or minimal recursion semantic anno-tations, we ﬁnd alignments improve signif-icantly for one of the datasets. For anotherdataset, we see more mixed results. Wepresent an extensive analysis of these results.Our proposed approach enables the evalua-tion of more targeted hypotheses about thecomposition of meaning in the brain, expand-ing the range of possible scientiﬁc inferencesa neuroscientist could make, and opens upnew opportunities for cross-pollination be-tween computational neuroscience and lin-guistics.

Recent advances in deep neural networks for nat-ural language processing (NLP) have generatedexcitement among computational neuroscientists,who aim to model how the brain processes lan-guage. These models are argued to better capturethe complexity of natural language semantics thanprevious computational models, and are thought torepresent meaning in a more similar way to howit is hypothesized to be represented in the humanbrain. For neuroscientists, these models provide possible hypotheses for how word meanings com-pose in the brain. Previous work has evaluatedthe plausibility of such candidate models by test-ing how well representations of text extracted fromthese models align with brain recordings of humansduring language comprehension tasks (Wehbe et al.,2014; Jain and Huth, 2018; Gauthier and Ivanova,2018; Gauthier and Levy, 2019; Abnar et al., 2019;Toneva and Wehbe, 2019; Schrimpf et al., 2020;Caucheteux and King, 2020), and found some cor-respondences.However, modern NLP models are often trainedwithout explicit linguistic supervision (Devlin et al.,2018; Radford et al., 2019), and the observationthat they nevertheless learn some linguistic struc-ture has been used to question the relevance of sym-bolic linguistic theories. Whether injecting suchsymbolic structures into language models wouldlead to even better alignment with cognitive mea-surements, however, has not been studied. In thiswork, we address this gap by training BERT (§3.1)with structural bias, and evaluate its alignment withbrain recordings (§3.2). Structure is derived fromthree formalisms—UD, DM and UCCA (§3.3)—which come from different linguistic traditions, andcapture different aspects of syntax and semantics.Our approach, illustrated in Figure 1, allows forquantifying the brain alignment of the structurally-biased NLP models in comparison to the base mod-els, as related to new information about linguisticstructure learned by the models that is also poten-tially relevant to language comprehension in thebrain. More speciﬁcally, in this paper, we:(a) Employ a ﬁne-tuning method utilising struc-turally guided attention for injecting structuralbias into language model (LM) representa-tions.(b) Assess the representational alignment to brainactivity measurements of the ﬁne-tuned andnon-ﬁne-tuned LMs. a r X i v : . [ c s . C L ] J a n alignment prior to intervention 𝑥 ＋ 𝛿 alignment post interventionBERT modelaltered BERT Sentence or wordRepresentations “Harry never thought he would ...” fMRI

Decoder If 𝛿 >> , was successful at encoding brain-relevant bias Decoder

Figure 1: Overview of our approach. We use BERT as a baseline and inject structural bias in two ways.Through a brain decoding task, we then compare the alignment of the (sentence and word) representationsof our baseline and our altered models with brain activations.(c) Further evaluate the LMs on a range of tar-geted syntactic probing tasks and a seman-tic tagging task, which allow us to uncoverﬁne-grained information about their structure-sensitive linguistic capabilities.(d) Present an analysis of various linguistic fac-tors that may lead to improved or deterioratedbrain alignment.

Mitchell et al. (2008) ﬁrst showed that there is arelationship between the co-occurrence patterns ofwords in text and brain activation for processing thesemantics of words. Speciﬁcally, they showed thata computational model trained on co-occurrencepatterns for a few verbs was able to predict fMRIactivations for novel nouns. Since this paper wasintroduced, many works have attempted to isolateother features that enable prediction and interpreta-tion of brain activity (Frank et al., 2015; Brennanet al., 2016; Lopopolo et al., 2017; Anderson et al.,2017; Pereira et al., 2018; Wang et al., 2020). Gau-thier and Ivanova (2018) however, emphasize thatdirectly optimizing for the decoding of neural rep-resentation is limiting, as it does not allow for theuncovering of the mechanisms that underlie theserepresentations. The authors suggest that in orderfor us to better understand linguistic processing inthe brain, we should also aim to train models that optimize for a speciﬁc linguistic task and explicitlytest these against brain activity.Following this line of work, Toneva and Wehbe(2019) present experiments both predicting brainactivity and evaluating representations on a set oflinguistic tasks. They ﬁrst show that using uniformattention in early layers of BERT (Devlin et al.,2018) instead of pretrained attention leads to betterprediction of brain activity. They then use the repre-sentations of this altered model to make predictionson a range of syntactic probe tasks, which isolatedifferent syntactic phenomena (Marvin and Linzen,2019), ﬁnding improvements against the pretrainedBERT attention. Gauthier and Levy (2019) presenta series of experiments in which they ﬁne-tuneBERT on a variety of tasks including languagemodeling as well as some custom tasks such asscrambled language modeling and part-of-speech-language modeling. They then perform brain de-coding, where a linear mapping is learnt from fMRIrecordings to the ﬁne-tuned BERT model activa-tions. They ﬁnd that the best mapping is obtainedwith the scrambled language modelling ﬁne-tuning.Further analysis using a structural probe methodconﬁrmed that the token representations from thescrambled language model performed poorly whenused for reconstructing Universal Dependencies(UD; Nivre et al., 2016) parse trees.When dealing with brain activity, many con-ounds may lead to seemingly divergent ﬁndings,such as the size of fMRI data, the temporal resolu-tion of fMRI, the low signal-to-noise ratio, as wellas how the tasks were presented to the subjects,among many other factors. For this reason, it is es-sential to take sound measures for reporting results,such as cross-validating models, evaluating on un-seen test sets, and conducting a thorough statisticalanalysis.

Figure 1 shows a high-level outline of our experi-mental design, which aims to establish whether in-jecting structure derived from a variety of syntacto-semantic formalisms into neural language modelrepresentations can lead to better correspondencewith human brain activation data. We utilize fMRIrecordings of human subjects reading a set of texts.Representations of these texts are then derived fromthe activations of the language models. FollowingGauthier and Levy (2019), we obtain LM represen-tations from BERT for all our experiments. Weapply masked language model ﬁne-tuning with at-tention guided by the formalisms to incorporatestructural bias into BERT’s hidden-state represen-tations. Finally, to compute alignment betweenthe BERT-derived representations—with and with-out structural bias—and the fMRI recordings, weemploy the brain decoding framework, where a lin-ear decoder is trained to predict the LM derivedrepresentation of a word or a sentence from thecorresponding fMRI recordings. BERT uses wordpiece tokenization, dividing thetext to sub-word units. For a sentence S madeup of P wordpieces , we perform mean-poolingover BERT’s ﬁnal layer hidden-states [ h , ..., h P ] ,obtaining a vector representation of the sentence S mean = P (cid:80) p h p (Wu et al., 2016). In initialexperiments, we found that this leads to a closermatch with brain activity measurements comparedto both max-pooling and the special [CLS] token,which is used by Gauthier and Levy (2019). Simi-larly, for a word W made up of P wordpieces, to de-rive word representations, we apply mean-poolingover hidden-states [ h , ..., h P ] , which correspondto the wordpieces that make up W : W mean = P (cid:80) p h p . For each dataset, D LM ∈ R n × d H de- Speciﬁcally: bert-large-uncased trained withwhole-word masking. notes a matrix of n LM-derived word or sentencerepresentations where d H is BERT’s hidden layerdimensionality ( d H = 1024 in our experiments). We utilize two fMRI datasets, which differ in thegranularity of linguistic cues to which human re-sponses were recorded. The ﬁrst, collected inPereira et al. (2018)’s experiment 2, comprises asingle brain image per entire sentence. In the sec-ond, more ﬁne-grained dataset, recorded by Wehbeet al. (2014), each brain image corresponds to 4words. We conduct a sentence-level analysis forthe former and a word-level one for the latter. Pereira2018 consists of fMRI recordings from 8subjects. The subjects were presented with stimuliconsisting of 96 Wikipedia-style passages writtenby the authors, consisting of 4 sentences each. Thesubjects read the sentences one by one and wereinstructed to think about their meaning. The result-ing data for each subject consists of 384 vectors ofdimension 200,000; a vector per sentence. Thesewere reduced to 256 dimensions using PCA byGauthier and Levy (2019). These PCA projectionsexplain more than 95% of the variance among sen-tence responses within each subject. We use thisreduced version in our experiments.

Wehbe2014 consists of fMRI recordings from 8subjects as they read a chapter from

Harry Potterand the Sorcerer’s Stone . For the 5000 word chap-ter, subjects were presented with words one by onefor 0.5 seconds each. An fMRI image was taken ev-ery 2 seconds, as a result, each image correspondsto 4 words. The data was further preprocessed(i.e. detrended, smoothed, trimmed) and releasedby Toneva and Wehbe (2019). We use this prepro-cessed version to conduct word-level analysis, forwhich we use PCA to reduce the dimensions of thefMRI images from 25,000 to 750, explaining atleast 95% variance for each participant.

To inject linguistic structure into language models,we experiment with three distinct formalisms forrepresentation of syntactic/semantic structure, com-ing from different linguistic traditions and captur-ing different aspects of linguistic signal: UD, DMand UCCA. An example graph for each formalism Even though the images are recorded at the 4-gram levelof granularity, a word-level analysis is applied, as in (Schwartzet al., 2019). e had been looking forward to learning to ﬂy more than anything else . nsubjaux aux root advmod markadvcl markxcompadvmod caseobl amodpunctARG1 TOP ARG2 ARG2ARG1 mwe ARG2 ARG1 (a) UD (above, orange), DM (below, blue)

He Ahad DbeenD looking forwardP to F learningP to F ﬂyPAA moreDH thanL anythingC elseEAHA A A P (b) UCCA

Figure 2: Manually annotated example graphs for a sentence from the Wehbe2014 dataset. While UCCAand UD attach all words, DM only connects content words. However, all formalisms capture basicpredicate-argument structure, for example, denoting that “more than anything else” modiﬁes “lookingforward” rather than “ﬂy”.is shown in Figure 2. Although there are other im-portant linguistic structured formalisms, includingmeaning representations such as AMR (Banarescuet al., 2013), DRS (Kamp and Reyle, 1993; Boset al., 2017) and FGD (Sgall et al., 1986; Hajicet al., 2012), we select three relatively differentformalisms as a somewhat representative sample.All three have manually annotated datasets, whichwe use for our experiments.UD (Universal Dependencies; Nivre et al., 2020)is a syntactic bi-lexical dependency framework (de-pendencies are denoted as arcs between words,with one word being the head and another the de-pendent ), which represents grammatical relationsaccording to a coarse cross-lingual scheme. ForUD data, we use the English Web Treebank cor-pus (EWT; Silveira et al., 2014), which contains254,830 words and 16,622 sentences, taken fromﬁve genres of web media: weblogs, newsgroups,emails, reviews, and Yahoo! answers.DM (DELPH-IN MRS Bi-Lexical Dependen-cies; Ivanova et al., 2012) is derived from the un-derspeciﬁed logical forms computed by the En-glish Resource Grammar (Flickinger et al., 2017;Copestake et al., 2005), and is one of the frame-works targeted by the Semantic Dependency Pars-ing SemEval Shared Tasks (SDP; Oepen et al.,2014, 2015). We use the English SDP data forDM (Oepen et al., 2016), annotated on newspapertext from the Wall Street Journal (WSJ), containing802,717 words and 35,656 sentences.UCCA (Universal Cognitive Conceptual Annota-tion; Abend and Rappoport, 2013) is based on cog-nitive linguistic and typological theories, primar-ily Basic Linguistic Theory (Dixon, 2010/2012).We use UCCA annotations over web reviews text from the English Web Treebank, and from EnglishWikipedia articles on celebrities. In total, they con-tain 138,268 words and 6,572 sentences. For unifor-mity with the other formalisms, we use bi-lexicalapproximation to convert UCCA graphs, whichhave a hierarchical constituency-like structure, tobi-lexical graphs with edges between words. Thisconversion keeps about 91% of the information(Hershcovich et al., 2017).

Recent work has explored ways of modifying at-tention in order to incorporate structure into neuralmodels (Chen et al., 2016; Strubell et al., 2018;Strubell and McCallum, 2018; Zhang et al., 2019;Bugliarello and Okazaki, 2019). For instance,Strubell et al. (2018) incorporate syntactic infor-mation by training one attention head to attend tosyntactic heads, and ﬁnd that this leads to improve-ments in Semantic Role Labeling (SRL). Drawingon these approaches, we modify the BERT MaskedLanguage Model (MLM) objective with an addi-tional structural attention constraint. BERT

LARGE consists of 24 layers and 16 attention heads. Eachattention head head i takes in as input a sequenceof representations h = [ h , ..., h P ] correspondingto the P wordpieces in the input sequence. Eachrepresentation in h p is transformed into query, key,and value vectors. The scaled dot product is com-puted between the query and all keys and a softmaxfunction is applied to obtain the attention weights.The output of head i is a matrix O i , correspondingto the weighted sum of the value vectors.For each formalism and its corresponding cor-pus, we extract an adjacency matrix from each sen-tence’s parse. For the sequence S , the adjacencyatrix A S is a matrix of size P × P , where thecolumns correspond to the heads in the parse treeand the rows correspond to the dependents. Thematrix elements denote which tokens are connectedin the parse tree, taking into account BERT’s word-piece tokenization. Edge directionality is not con-sidered. We modify BERT to accept as input amatrix A S as well as S ; maintaining the originalMLM objective. For each attention head head i , wecompute the binary cross-entropy loss between O i and A S and add that to our total loss, potentiallydown-weighted by a factor of α (a hyperparameter).BERT’s default MLM ﬁne-tuning hyperparametersare employed and α is set to . based on validationset perplexity scores in initial experiments.Structural information can be injected into BERTin many ways, in many heads, across many lay-ers. Because the appropriate level and extent ofsupervision is unknown a priori, we run variousﬁne-tuninig settings with respect to combinationsof number of layers ( , . . . , ) and attention heads( , , , , , , ) supervised via attention guid-ance. Layers are excluded from the bottom up (e.g.:when 10 layers are supervised, it is the topmost10); heads are chosen according to their indices(which are arbitrary). This results in a total of ﬁne-tuning settings per formalism. For each ﬁne-tuning setting, we perform two ﬁne-tuning runs. For each run r of each ﬁne-tuning setting f , wederive a set of sentence or word representations D fr ∈ R n × d H from each ﬁne-tuned model usingthe approach described in §3.1 for obtaining D LM ,the baseline set of representations from BERT be-fore ﬁne-tuning. We then use development set embedding space hubness—an indicator of the de-gree of difﬁculty of indexing and analysing data(Houle, 2015) which has been used to evaluate em-bedding space quality (Dinu et al., 2014)—as anunsupervised selection criterion for the ﬁne-tunedmodels, selecting the model with the lowest degreeof hubness (per formalism) according to the RobinHood Index (Feldbauer et al., 2018). This yieldsthree models for each of the two datasets—one performalism—for which we present results below.In addition to the approach described above, we We ﬁnd that the mean difference in brain decoding score(Pearson’s r ) between two runs of the same setting (across allsettings) is low ( . ), indicating that random initializationdoes not play a major part in our results. We, therefore, do notcarry out more runs. For

Wehbe2014 : second chapter of Harry Potter. For

Pereira2018 : ﬁrst 500 sentences of English Wikipedia. also experiment with directly optimizing for theprediction of the formalism graphs (i.e., parsing) asa way of encoding structural information in LM rep-resentations. We ﬁnd that this leads to a consistentdecline in alignment of the LMs’ representationsto brain recordings. Further details can be found inAppendix A.

To measure the alignment of the different LM-derived representations to the brain activity mea-surements, brain decoding is performed, followingthe setup described in Gauthier and Levy (2019). For each subject i ’s fMRI images correspondingto a set of n sentences or words, a ridge regres-sion model is trained to linearly map from brainactivity B i ∈ R n × d B ( n = 384 ; d B = 256 for Pereira2018 and n = 4369 ; d B = 750 for We-hbe2014 ) to a LM-derived representation ( D fr or D LM ), minimizing the following loss: L ifr = (cid:107) B i G i → fr − D fr (cid:107) + λ (cid:107) G i → fr (cid:107) where G i → fr : R d H × d B is a linear map, and λ isa hyperparameter for ridge regularization. Nested12-fold cross-validation (Cawley and Talbot, 2010)is used for selection of λ , training and evaluation. Evaluation

To evaluate the regression models,Pearson’s correlation coefﬁcient between the pre-dicted and the corresponding heldout true sentenceor word representations is computed. We ﬁnd thatthis metric is consistent across subjects and acrossthe two datasets. We run 5000 bootstrap resam-pling iterations and a) report the mean corre-lation coefﬁcient (referred to as brain decodingscore/performance ), b) use a paired bootstrap testto establish whether two models’ mean (acrossstimuli) scores were drawn from populations hav- Other methods for evaluating representational corre-spondence such as Representational Similarity Analysis(Kriegeskorte et al., 2008) and the Centered Kernel Alignmentsimilarity index (Kornblith et al., 2019) were also exploredbut were found to be either less powerful or less consistentacross subjects and datasets. Appendix B shows results for the rank-based metric re-ported in (Gauthier and Levy, 2019), which we ﬁnd to stronglycorrespond to Pearson’s correlation. This metric evaluatesrepresentations based on their support for contrasts betweensentences/words which are relevant to the brain recordings.Other metrics for the evaluation of goodness of ﬁt were foundto be less consistent. Across ﬁne-tuning runs, cross-validation splits, and boot-strap iterations. m ud ucca formalism pea r s on r Pereira et al. (2018) dm ud ucca formalism pea r s on r Wehbe et al. (2014) pretraineddomain-finetuned

Figure 3: Brain decoding score (mean Pearson’s r ; with 95% conﬁdence intervals shown for subjectscores) for models ﬁne-tuned by MLM with guided attention on each of the formalisms, as well thebaseline models: pretrained BERT (dotted line), and BERT ﬁne-tuned by MLM on each formalism’straining text without guided attention (domain-ﬁnetuned BERT, solid lines).ing the same distribution , c) apply the Wilcoxonsigned rank test (Wilcoxon, 1992) to the by-subjectscores to test for evidence of strength of general-ization over subjects. Bonferroni correction (for multiple comparisons) is used to adjust for multiplehypothesis testing. See Appendix C for details. To evaluate the effect of the structurally-guidedattention, we compute the brain decoding scoresfor the guided attention models corresponding toeach formalism and fMRI dataset and comparethese scores against the brain decoding scores fromtwo baseline models: 1) a domain-ﬁnetuned

BERT(DF), which ﬁnetunes BERT using the regularMLM objective on the text of each formalism’straining data, and a pretrained

BERT. We introducethe domain-ﬁnetuned baseline in order to controlfor any effect that ﬁnetuning using a speciﬁc textdomain may have on the model representations.Comparing against this baseline allows us to bet-ter isolate the effect of injecting the structural biasfrom the possible effect of simply ﬁne-tuning onthe text domain. We further compare to a pretrainedbaseline in order to evaluate how the structurally-guided attention approach performs against an off-the-shelf model that is commonly used in brain-alignment experiments. This is applied per subject to test for strength of evidenceof generalization over sentence stimuli.

Figure 3 shows the sentence-level decoding perfor-mance on the

Pereira2018 dataset, for the guidedattention ﬁne-tuned models (GA) and both baselinemodels (domain-ﬁnetuned and pretrained). We ﬁndthat the DF baseline (shown in Figure 3 as solidlines) leads to brain decoding scores that are ei-ther lower than or not signiﬁcantly different fromthe pretrained baseline. Speciﬁcally, for DM andUCCA, it performs below the pretrained baseline,which suggests that simply ﬁne-tuning on these cor-pora results in BERT’s representations becomingless aligned with the brain activation measurementsfrom

Pereira2018 . We ﬁnd that all GA modelsoutperform their respective DF baselines (for allsubjects, p < . ). We further ﬁnd that comparedto the pretrained baselines, with p < . : a) theUD GA model shows signiﬁcantly better brain de-coding scores for 7 out of 8 subjects, b) the DM GAmodel for 4 out of 8 subjects, c) UCCA GA showsscores not signiﬁcantly different from or lower, forall subjects. For details see Appendix C. For

Wehbe2014 , where analysis is conducted onthe word level, we again ﬁnd that domain-ﬁnetunedmodels—especially the one ﬁnetuned on the UCCAdomain text—achieve considerably lower brain de-coding scores than the pretrained model, as shownin Figure 3. Furthermore, the guided attention mod-ls for all three formalisms outperform both base-lines by a large, signiﬁcant margin (after Bonfer-roni correction, p < . ). Overall, our results show that structural bias fromsyntacto-semantic formalisms can improve the abil-ity of a linear decoder to map the BERT representa-tions of stimuli sentences to their brain recordings.This improvement is especially clear for

Wehbe2014 , where token representations and not aggre-gated sentence representations (as in

Pereira 2018 )are decoded, indicating that ﬁner-grain recordingsand analyses might be necessary for modelling thecorrelates of linguistic structure in brain imagingdata. To arrive at a better understanding of theeffect of the structural bias and its relationship tobrain alignment, in what follows, we present ananalysis of the various factors which affect andinteract with this relationship.

The effect of domain

Our results suggest thatthe domain of ﬁne-tuning data and of stimuli mightplay a signiﬁcant role, despite having been previ-ously overlooked: simply ﬁne-tuning on data fromdifferent domains leads to varying degrees of align-ment to brain data. To quantify this effect, wecompute the average word perplexity of the stimulifrom both fMRI datasets for the pretrained and DFbaselines on each of the three domain datasets. Ifthe domain of the corpora used for ﬁne-tuning in-ﬂuences our results as hypothesized, we expect thisscore to be higher for the DF baselines. We ﬁndthat this is indeed the case and that for those base-lines (DF), increase in perplexity roughly corre-sponds to lower brain decoding scores—see detailsin Appendix D. This ﬁnding calls to attention thenecessity of accounting for domain match in workutilizing cognitive measurements and emphasizesthe importance of the DF baseline in this study.

Targeted syntactic evaluation

We evaluate allmodels on a range of syntactic probing tasks pro-posed by Marvin and Linzen (2019). This datasettests the ability of models to distinguish mini-mal pairs of grammatical and ungrammatical sen-tences across a range of syntactic phenomena. Fig-ure 4 shows the results for the three

Wehbe2014 Note that this is not equivalent to the commonly utilisedsequence perplexity (which can not be calculated for non-auto-regressive models) but sufﬁces for quantifying the effect ofdomain shift. Using the evaluation script from Goldberg (2019). models across all subject-verb agreement (SVA)tasks. We observe that after GA ﬁne-tuning: a)the DM guided-attention model, and to a lesser ex-tent the UD guided-attention model have a higherscore than the pretrained baseline and the domain-ﬁnetuned baselines for most SVA tasks and b) theranking of the models corresponds to their rank-ing on the brain decoding task ( DM > UD > UCCA ). Although all three formalisms anno-tate the subject-verb-object or predicate-argumentstructure necessary for solving SVA tasks, it ap-pears that some of them do so more effectively, atleast when encoded into a LM by GA.

Effect on semantics

To evaluate the impact ofstructural bias on encoding of semantic informa-tion, we consider Semantic Tagging (Abzianidzeand Bos, 2017), commonly used to analyse the se-mantics encoded in LM representations (Belinkovet al., 2018; Liu et al., 2019): tokens are labeled toreﬂect their semantic role in context. For each ofthe three guided attention

Wehbe2014 models andthe pretrained model, a linear probe is trained topredict a word’s semantic tag, given the contextualrepresentation induced by the model (see AppendixE for details). For each of the three GA models,Figure 5 shows the change in test set classiﬁcationF1-score, relative to the pretrained baseline, percoarse-grained grouping of tags. We ﬁnd thatthe structural bias improves the ability to correctlyrecognize almost all of the semantic phenomenaconsidered, indicating that our method for inject-ing linguistic structure leads to better encoding ofa broad range of semantic distinctions. Further- See Appendix F for the full set of results for both

We-hbe2014 and for

Pereira2018 with similar patterns. For reﬂexive anaphora tasks, these trends are reversed:the models underperform the pretrained baseline and theirranking is the converse of their brain decoding scores. Re-ﬂexive Anaphora, are not explicitly annotated for in any ofthe three formalisms. We ﬁnd, however, that they occur ina larger proportion of the sentences comprising the UCCAcorpus ( . ) than those the UD ( . ) or DM ( . )ones, indicating that domain might play a role here too. Note that the test set consists of 263,516 instances, there-fore, the margin of change in number of instances here isconsiderable, e.g. ∗ . ≈ instances for the DM andUCCA models on the temporal category, which is the leastfrequent in the test set. See test set category frequencies in theappendix. The eight most frequent coarse-grained categories froman original set of ten are included—ordered by frequency fromleft to right; we exclude the

UNKNOWN category because itis uninformative and the

ANAPHORIC category because itshows no change from the baseline for all three models. igure 4: Accuracy per subject-verb agreement category of (Marvin and Linzen, 2019) for the three

Wehbe2014 models and each of the four baselines.more, the improvements are largest for phenomenathat have a special treatment in the linguistic for-malisms, namely discourse markers and temporalentities. Identifying named entities is negativelyimpacted by GA with DM, where they are indis-criminately labeled as compounds.

Content words and function words are treateddifferently by each of the formalisms: UD andUCCA encode all words, where function wordshave special labels, and DM only attaches con-tent words. Our guided attention ignores edge la-bels (dependency relations), and so it considersUD and UCCA’s attachment of function words justas meaningful as that of content words. Figure 8in Appendix G shows a breakdown of decodingperformance on content and function words for

Wehbe2014 . We ﬁnd that: a) all GA models andthe pretrained model show a higher function thancontent word decoding score, b) a large part ofthe decrease in score of two of the three domain-ﬁnetuned baselines (UD and DM) compared to thepretrained model is due to content words.

Discrepancy between datasets

While the mod-els ﬁne-tuned with GA show considerable improve- ment in brain decoding for

Wehbe2014 (word levelanalysis), the improvements are much more modestfor

Pereira2018 (sentence level analysis). A possi-ble reason for this is the loss of structural informa-tion that occurs when aggregating over token rep-resentations to construct sentence-level ones. Fora more direct comparison, we conduct a sentence-level analysis for the

Wehbe2014 dataset, meanpooling over token hidden-states and their corre-sponding fMRI time slices . If the advantage ofthe guided attention models over the baseline drops,this would indicate that mean pooling is at leastpartially responsible for the lower improvementsobserved for Pereira2018 . We ﬁnd that this is in-deed the case: in this setting, decoding scores forthe the GA models are not signiﬁcantly differentfrom or lower than the pretrained baseline.

Caveats

The fMRI data used for both the sen-tence and word level analyses was recorded whileparticipants read text without performing a speciﬁc Note that since the

Pereira2018 fMRI recordings aretaken and the sentence level and

Wehbe2014 at the 4-gramlevel, the comparison is still approximate. A possible con-founds is that averaging over fMRI time slices could also leadto loss of information. igure 5: Change in F1-score per coarse-grained semantic class compared to the pretrained baseline forthe three guided attention

Wehbe2014 models.task. Although we observe some correlates of lin-guistic structure, it is possible that uncovering moreﬁne-grained patterns would necessitate brain datarecorded while participants perform a targeted task.For future work it would be interesting to inves-tigate if an analysis based on a continuous, natu-ralistic listening fMRI dataset (Brennan and Hale,2019) matches up to the results we have obtained.Regarding the different linguistic formalisms, thereare potential confounds such as domain, corpussize , and dependency length, (i.e. the distance be-tween words attached by a relation), which dependboth on the formalism and on the underlying train-ing set text. To properly control for them, a corpusannotated for all formalisms is necessary, but sucha corpus of sufﬁcient size is not yet available. Conclusions

We propose a framework to investi-gate the effect of incorporating speciﬁc structuralbiases in language models for brain decoding. Wepresent evidence that inducing linguistic structurebias through ﬁne-tuning using attention guided ac-cording to syntacto-semantic formalisms, can im-prove brain decoding performance across two fMRI It is interesting to note that decoding score rank for

We-hbe2014 corresponds to ﬁne-tuning corpus size for the GAmodels (DM > UD > UCCA), but not the domain-ﬁnetunedmodels. A reasonable conclusion to draw from this is thatdataset size might play a role in the effective learning of astructural bias. datasets. For each of the investigated formalisms,we observed that the models that aligned most withthe brain performed best at a range of subject-verbagreement syntactic tasks, suggesting that languagecomprehension in the brain, as captured by fMRIrecordings, and the tested syntactic tasks may relyon common linguistic structure, that was partly in-duced by the added attention constraints. Acrossformalisms, we found that models with attentionguided by DM and UD consistently exhibited bet-ter alignment with the brain than UCCA for bothfMRI datasets. Rather than concluding that DMand UD are more cognitively plausible, controlledexperiments, with ﬁne-tuning on each annotatedcorpus as plain text, suggest that the text domainis an important, previously overlooked confound.Further investigation is needed using a commonannotated corpus for all formalisms to make con-clusions about their relative aptness.Overall, our proposed approach enables the eval-uation of more targeted hypotheses about the com-position of meaning in the brain, and opens up newopportunities for cross-pollination between compu-tational neuroscience and linguistics. To facilitatethis, we make all code and data for our experimentsavailable at: https://github.com/mhany90/Structural_bias_brain eferences Omri Abend and Ari Rappoport. 2013. Univer-sal conceptual cognitive annotation (UCCA). In

Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Vol-ume 1: Long Papers) , pages 228–238, Soﬁa,Bulgaria. Association for Computational Lin-guistics.Samira Abnar, Lisa Beinborn, Rochelle Choenni,and Willem Zuidema. 2019. Blackbox meetsblackbox: Representational similarity and sta-bility analysis of neural language models andbrains. arXiv preprint arXiv:1906.01539 .Lasha Abzianidze and Johan Bos. 2017. To-wards universal semantic tagging. arXiv preprintarXiv:1709.10381 .Andrew James Anderson, Jeffrey R Binder,Leonardo Fernandino, Colin J Humphries,Lisa L Conant, Mario Aguilar, Xixi Wang, Do-nias Doko, and Rajeev DS Raizada. 2017. Pre-dicting neural activity patterns associated withsentences using a neurobiologically motivatedmodel of semantic representation.

Cerebral Cor-tex , 27(9):4379–4395.Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Grifﬁtt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, andNathan Schneider. 2013. Abstract meaning rep-resentation for sembanking. In

Proceedings ofthe 7th Linguistic Annotation Workshop and In-teroperability with Discourse , pages 178–186,Soﬁa, Bulgaria. Association for ComputationalLinguistics.Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad,Nadir Durrani, Fahim Dalvi, and James Glass.2018. Evaluating layers of representation inneural machine translation on part-of-speechand semantic tagging tasks. arXiv preprintarXiv:1801.07772 .Johan Bos, Valerio Basile, Kilian Evang, Noortje JVenhuizen, and Johannes Bjerva. 2017. Thegroningen meaning bank. In

Handbook of lin-guistic annotation , pages 463–496. Springer.Jonathan R Brennan and John T Hale. 2019. Hi-erarchical structure guides rapid linguistic pre-dictions during naturalistic listening.

PloS one ,14(1):e0207741. Jonathan R Brennan, Edward P Stabler, Sarah EVan Wagenen, Wen-Ming Luh, and John T Hale.2016. Abstract linguistic structure correlateswith temporal activity during naturalistic com-prehension.

Brain and language , 157:81–94.Emanuele Bugliarello and Naoaki Okazaki.2019. Enhancing machine translation withdependency-aware self-attention. arXiv preprintarXiv:1909.03149 .Charlotte Caucheteux and Jean-Rémi King. 2020.Language processing in brains and deep neuralnetworks: computational convergence and itslimits.

BioRxiv .Gavin C Cawley and Nicola LC Talbot. 2010. Onover-ﬁtting in model selection and subsequentselection bias in performance evaluation.

TheJournal of Machine Learning Research , 11:2079–2107.Wanxiang Che, Longxu Dou, Yang Xu, YuxuanWang, Yijia Liu, and Ting Liu. 2019. Hit-scirat mrp 2019: A uniﬁed pipeline for meaningrepresentation parsing via efﬁcient training andeffective encoding. In

Proceedings of the SharedTask on Cross-Framework Meaning Representa-tion Parsing at the 2019 Conference on NaturalLanguage Learning , pages 76–85.Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur,Asli Celikyilmaz, Jianfeng Gao, and Li Deng.2016. Knowledge as a teacher: Knowledge-guided structural attention networks. arXivpreprint arXiv:1609.03286 .Ann Copestake, Dan Flickinger, Carl Pollard, andIvan A Sag. 2005. Minimal recursion seman-tics: An introduction.

Research on languageand computation , 3(2-3):281–332.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training ofdeep bidirectional transformers for language un-derstanding. arXiv preprint arXiv:1810.04805 .Georgiana Dinu, Angeliki Lazaridou, and MarcoBaroni. 2014. Improving zero-shot learning bymitigating the hubness problem. arXiv preprintarXiv:1412.6568 .Robert M. W. Dixon. 2010/2012.

Basic LinguisticTheory . Oxford University Press.oman Feldbauer, Maximilian Leodolter, Clau-dia Plant, and Arthur Flexer. 2018. Fast ap-proximate hubness reduction for large high-dimensional data. In , pages358–367. IEEE.Dan Flickinger, Stephan Oepen, and Emily M Ben-der. 2017. Sustainable development and reﬁne-ment of complex linguistic annotations at scale.In

Handbook of Linguistic Annotation , pages353–377. Springer.Stefan L Frank, Leun J Otten, Giulia Galli, andGabriella Vigliocco. 2015. The erp response tothe amount of information conveyed by wordsin sentences.

Brain and language , 140:1–11.Jon Gauthier and Anna Ivanova. 2018. Does thebrain represent words? an evaluation of brain de-coding studies of language understanding. arXivpreprint arXiv:1806.00591 .Jon Gauthier and Roger Levy. 2019. Linking arti-ﬁcial and human neural representations of lan-guage. In

Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 529–539.Yoav Goldberg. 2019. Assessing bert’s syntacticabilities. arXiv preprint arXiv:1901.05287 .Jan Hajic, Eva Hajicová, Jarmila Panevová, PetrSgall, Ondrej Bojar, Silvie Cinková, EvaFucíková, Marie Mikulová, Petr Pajas, JanPopelka, et al. 2012. Announcing prague czech-english dependency treebank 2.0. In

LREC ,pages 3153–3160.Daniel Hershcovich, Omri Abend, and Ari Rap-poport. 2017. A transition-based directedacyclic graph parser for UCCA. In

Proceedingsof the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: LongPapers) , pages 1127–1138, Vancouver, Canada.Association for Computational Linguistics.Daniel Hershcovich, Miryam de Lhoneux, ArturKulmizev, Elham Pejhan, and Joakim Nivre.2020. Køpsala: Transition-based graph pars-ing via efﬁcient training and effective encoding.In

Proceedings of the 16th International Con-ference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Uni-versal Dependencies , pages 236–244, Online.Association for Computational Linguistics.Michael E Houle. 2015. Inlierness, outlierness,hubness and discriminability: an extreme-value-theoretic foundation.

National Institute of Infor-matics Technical Report NII-2015-002E, Tokyo,Japan .Angelina Ivanova, Stephan Oepen, Lilja Øvrelid,and Dan Flickinger. 2012. Who did what towhom? a contrastive study of syntacto-semanticdependencies. In

Proceedings of the Sixth Lin-guistic Annotation Workshop , pages 2–11, Jeju,Republic of Korea. Association for Computa-tional Linguistics.Shailee Jain and Alexander Huth. 2018. Incorpo-rating context into language encoding modelsfor fmri. bioRxiv , page 327601.Hans Kamp and Uwe Reyle. 1993. From discourseto logic: introduction to modeltheoretic seman-tics of natural language, formal logic and dis-course representation theory.

Studies in linguis-tics and philosophy .Simon Kornblith, Mohammad Norouzi, HonglakLee, and Geoffrey Hinton. 2019. Similarity ofneural network representations revisited. arXivpreprint arXiv:1905.00414 .Nikolaus Kriegeskorte, Marieke Mur, and Peter ABandettini. 2008. Representational similarityanalysis-connecting the branches of systems neu-roscience.

Frontiers in systems neuroscience ,2:4.Nelson F Liu, Matt Gardner, Yonatan Belinkov,Matthew E Peters, and Noah A Smith. 2019.Linguistic knowledge and transferability ofcontextual representations. arXiv preprintarXiv:1903.08855 .Alessandro Lopopolo, Stefan L Frank, AntalVan den Bosch, and Roel M Willems. 2017. Us-ing stochastic language models (slm) to map lexi-cal, syntactic, and phonological information pro-cessing in the brain.

PloS one , 12(5):e0177794.Christopher D Manning, Kevin Clark, John He-witt, Urvashi Khandelwal, and Omer Levy. 2020.Emergent linguistic structure in artiﬁcial neuralnetworks trained by self-supervision.

Proceed-ings of the National Academy of Sciences .ebecca Marvin and Tal Linzen. 2019. Targetedsyntactic evaluation of language models.

Pro-ceedings of the Society for Computation in Lin-guistics , 2(1):373–374.Tom M Mitchell, Svetlana V Shinkareva, AndrewCarlson, Kai-Min Chang, Vicente L Malave,Robert A Mason, and Marcel Adam Just.2008. Predicting human brain activity asso-ciated with the meanings of nouns. science ,320(5880):1191–1195.Joakim Nivre, Marie-Catherine De Marneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christopher DManning, Ryan McDonald, Slav Petrov, SampoPyysalo, Natalia Silveira, et al. 2016. Univer-sal dependencies v1: A multilingual treebankcollection. In

Proceedings of the Tenth Interna-tional Conference on Language Resources andEvaluation (LREC’16) , pages 1659–1666.Joakim Nivre, Marie-Catherine de Marneffe, FilipGinter, Jan Hajiˇc, Christopher D. Manning,Sampo Pyysalo, Sebastian Schuster, Francis Ty-ers, and Daniel Zeman. 2020. Universal Depen-dencies v2: An evergrowing multilingual tree-bank collection. In

Proceedings of The 12thLanguage Resources and Evaluation Conference ,pages 4034–4043, Marseille, France. EuropeanLanguage Resources Association.Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Silvie Cinková, Dan Flickinger,Jan Hajic, Angelina Ivanova, and Zdenka Ure-šová. 2016. Semantic dependency parsing (sdp)graph banks release 1.0 ldc2016t10.

Web Down-load .Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Silvie Cinková, Dan Flickinger,Jan Hajiˇc, and Zdeˇnka Urešová. 2015. SemEval2015 task 18: Broad-coverage semantic depen-dency parsing. In

Proceedings of the 9th Inter-national Workshop on Semantic Evaluation (Se-mEval 2015) , pages 915–926, Denver, Colorado.Association for Computational Linguistics.Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Dan Flickinger, Jan Hajiˇc, An-gelina Ivanova, and Yi Zhang. 2014. SemEval2014 task 8: Broad-coverage semantic depen-dency parsing. In

Proceedings of the 8th In-ternational Workshop on Semantic Evaluation (SemEval 2014) , pages 63–72, Dublin, Ireland.Association for Computational Linguistics.Francisco Pereira, Bin Lou, Brianna Pritchett,Samuel Ritter, Samuel J Gershman, Nancy Kan-wisher, Matthew Botvinick, and Evelina Fe-dorenko. 2018. Toward a universal decoder oflinguistic meaning from brain activation.

Naturecommunications , 9(1):1–13.Alec Radford, Jeffrey Wu, Rewon Child, DavidLuan, Dario Amodei, and Ilya Sutskever. 2019.Language models are unsupervised multitasklearners.

OpenAI Blog , 1(8):9.Martin Schrimpf, Idan Blank, Greta Tuckute, Ca-rina Kauf, Eghbal A. Hosseini, Nancy Kan-wisher, Joshua Tenenbaum, and Evelina Fe-dorenko. 2020. Artiﬁcial neural networks ac-curately predict language processing in the brain. bioRxiv .Dan Schwartz, Mariya Toneva, and Leila Wehbe.2019. Inducing brain-relevant bias in naturallanguage processing models. In

Advances inNeural Information Processing Systems , pages14123–14133.Petr Sgall, Eva Hajicová, and Jarmila Panevová.1986. The meaning of the sentence and its se-mantic and pragmatic aspects. academia.Natalia Silveira, Timothy Dozat, Marie-CatherineDe Marneffe, Samuel R Bowman, Miriam Con-nor, John Bauer, and Christopher D Manning.2014. A gold standard dependency corpus forenglish. In

LREC , pages 2897–2904. Citeseer.Emma Strubell and Andrew McCallum. 2018. Syn-tax helps elmo understand semantics: Is syntaxstill relevant in a deep neural architecture for srl? arXiv preprint arXiv:1811.04773 .Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for seman-tic role labeling. In

Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing , pages 5027–5038.Mariya Toneva and Leila Wehbe. 2019. Interpret-ing and improving natural-language processing(in machines) with natural language-processing(in the brain). In

NeurIPS .lex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.Glue: A multi-task benchmark and analysis plat-form for natural language understanding. In

Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355.Shaonan Wang, Jiajun Zhang, Haiyan Wang, NanLin, and Chengqing Zong. 2020. Fine-grainedneural decoding with distributed word represen-tations.

Information Sciences , 507:256–272.Alex Warstadt and Samuel R Bowman. 2020.Can neural networks acquire a structural biasfrom raw linguistic data? arXiv preprintarXiv:2007.06761 .Leila Wehbe, Ashish Vaswani, Kevin Knight, andTom M. Mitchell. 2014. Aligning context-basedstatistical models of language with brain activ-ity during reading. In

Proceedings of the 2014Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 233–243,Doha, Qatar. Association for Computational Lin-guistics.Frank Wilcoxon. 1992. Individual comparisons byranking methods. In

Breakthroughs in statistics ,pages 196–202. Springer.Yonghui Wu, Mike Schuster, Zhifeng Chen,Quoc V Le, Mohammad Norouzi, WolfgangMacherey, Maxim Krikun, Yuan Cao, Qin Gao,Klaus Macherey, et al. 2016. Google’s neuralmachine translation system: Bridging the gapbetween human and machine translation. arXivpreprint arXiv:1609.08144 .Yue Zhang, Rui Wang, and Luo Si. 2019. Syntax-enhanced self-attention-based semantic role la-beling. arXiv preprint arXiv:1910.11204 . A Injecting Structure by PredictingParse

One way to encode structural information fromeach of these formalisms into language model rep-resentations is to directly optimize for the predic-tion of the formalism graphs, i.e., parsing. ForDM and UCCA, we use the HIT-SCIR parser (Cheet al., 2019), the best performing parser from theMRP 2019 Shared Task. For UD, we use the Køp-sala parser (Hershcovich et al., 2020) from theEUD Shared Task, which is largely based on theHIT-SCIR one. Both are transition-based parsers,which ﬁne-tune BERT during training: BERT takesin a sequence S of P wordpieces and outputs asequence of contextualized token representations [ h , ..., h P ] , which the parsers use as embeddings,ﬁne-tuning the BERT model. Our assumptionis that these representations are ﬁne-tuned dur-ing parser training to better capture the linguisticdistinctions made by each formalism. After ﬁne-tuning on each formalism’s respective corpus, weextract sentence and word representations for allﬁne-tuned models as described above. Each of theparsers’ default hyperparameters are employed. Model Epoch 0 Epoch 1 Epoch 2Pereira et al. (2018)DM 0.278 0.201 0.167UD 0.277 0.186 0.159UCCA 0.277 0.189 0.161

PRE

Table 1: Brain decoding scores (Pearson’s r ) foreach of the BERT models ﬁne-tuned via parsing ,and for the pretrained baseline (PRE). Note that thelatter is not ﬁne-tuned.Results for the models ﬁne-tuned via pars-ing show divergence in brain decoding perfor-mance. Indeed, we ﬁnd that as parsing performance(as measured by unlabeled undirected attachmentscores (UUAS)) improves on the held-out devel-opment set, brain decoding performance declines.This ﬁnding is congruent with the results of Gau-thier and Levy (2019), which show that ﬁne-tuningon GLUE tasks (Wang et al., 2018) leads to a de-cline in brain decoding performance, until a ceilingpoint where it eventually stabilizes. In our exper-iments, after one epoch of ﬁne-tuning, decodingperformance is equivalent to the one achieved byhe pretrained model. However, with more ﬁne-tuning, the models consistently diverge, as shownin Table 1. These results are averaged over two ﬁne-tuning runs. Understanding the learning dynamicsthat lead to such divergence is an interesting avenuefor future work. B Mean/median rank results

Table 2 shows results for the Pearson’s r metricreported in the main paper, alongside the meanand median rank metrics reported in (Gauthier andLevy, 2019), which give the rank of a ground-truth sentence representation in the list of nearestneighbors of a predicted sentence representation,ordered by increasing cosine distance. This met-ric evaluates representations based on their sup-port for contrasts between sentences/words whichare relevant to the brain recordings. The tableshows that the models which have higher Pear-son r scores, also have a lower average groundtruth word/sentence nearest neighbour rank i.e. in-duce representations that better support contrastsbetween sentences/words which are relevant to thebrain recordings. Model Pearson’s r Mean rank Median rankPereira et al. (2018)

DF-B DM

DF-B UD

DF-B UCCA

GA DM

GA UD

GA UCCA

PRE

DF-B DM

DF-B UD

DF-B UCCA

GA DM

GA UD

GA UCCA

PRE

Table 2: Brain decoding scores as measured viathree metrics — Pearson’s r , Mean rank, and Me-dian Rank — for each of the domain-ﬁnetunedbaseline (DF-B) models, the guided attention mod-els (GA), and the pretrained (PRE) model. C Signiﬁcance testing

Bootstrapping

The bootstrapping procedure isdescribed below. For each subject of m subjects: 1. There are n stimuli sentences, correspondingto n fMRI recordings. A linear decoder istrained to map each recording to its corre-sponding LM-extracted (PRE, DF-B,GA) sen-tence representation. This is done using 12-fold cross-validation. This yields predicted a‘sentence representation’ per stimuli sentence.2. To compensate for the small size of the datasetwhich might lead to a noise estimate of thelinear decoder’s performance, we now ran-domly resample n datapoints (with replace-ment) from the full n datapoints.3. For each resampling, our evaluation metrics(pearson’s r , mean rank, etc.) are computedbetween the sampled predictions and their cor-responding ‘gold representations’, for all setsof LM reps. We store the mean metric value(e.g. pearson r score) across the n ‘sampled’datapoints. We run 5000 such iterations.4. This gives us 5000 such paired mean (acrossthe n samples, that is) scores for all models.5. When comparing two models, e.g. GA DM vs.

PRE , to test our results for strength of evi-dence of generalization over stimuli, we com-pute the proportion of these 5000 paired sam-ples where e.g.

GA DM ’s mean sample scoreis greater than

PRE . After Bonferroni correc-tion for multiple hypothesis testing, is the p -value we report. See 3 for these per subject p -values for Pereira 2018 . For

Wehbe 2014 ,comparisons between each of the GA modelsand the pretrained baseline lead to p = 0 . (i.e. The GA model mean score is greater thanthe pretrained baseline’s mean score for all5000 sets of paired samples), for all subjects.We, therefore, do not include a similar table.6. We average over these 5000 samples per sub-ject, and use these m subject means for theacross-subject signiﬁcance testing, which isdescribed below. Strength of generalization across subjects

Totest our results for strength of generalization acrosssubjects, we apply the Wilcoxon signed rank test(Wilcoxon, 1992) to the m by-subject mean scores(see above), comparing the GA models to the pre-trained baselines. Since m = 8 for both datasets,he lowest p -value is . (if every subject’s dif-ference score consistently favors the GA modelover the baseline or vice versa).In the case of Pereira 2018 : for

PRE vs.

GAUD we get a p -value of . ( . after Bon-ferroni correction); for PRE vs.

GA DM we getan p -value of . ( . after Bonferroni correc-tion); for PRE vs.

GA UCCA we get a p -value of . ( . after Bonferroni correction, here PRE > GA UCCA for all subjects).In the case of

Wehbe 2014 : all comparisonsyield a p -value of . ( . after Bonferronicorrection), where the GA model > the pretrainedbaseline. Pereira et al. (2018)Model/Subject M02 M04 M07 M08 M09 M14 M15 P01

GA UD

GA DM

GA UCCA

Table 3: p -values resulting from paired bootstraptest described above, for each of the three GA mod-els when compared to the pretrained baseline. D The Domain effect

Table 4 shows average word perplexity scores forthe pretrained model and the domain-ﬁnetunedmodels for each of the three text domains onthe stimuli from

Pereira2018 and

Wehbe2014 .Scores are averaged over the words in a sentenceand the sentences (stimuli) in the datasets.

E Semantic Tagging

Probing details

Representations for the probingtask are derived as described in 3.1 for each sen-tence in the development and testing sets from(Abzianidze and Bos, 2017). The development setis employed as a training set, because it is mostlymanually annotated/corrected (as opposed to themuch noisier training set) and because it is alreadypossible to train rather accurate semantic taggerswhich sufﬁce for our analysis with a training set ofthat size ( instances). We report results forthe ofﬁcial test set. Table 5 shows the frequencyof each semantic category we report scores for inthe test set. An L regularised logistic regressionmodel is utilised. Further discussion

We observe the largest im-provements for the

DISCOURSE and

TEMPORAL categories. The former involves identifying sub-ordinate, coordinate, appositional, and contrast

Pereira et al. (2018)

PRE

DF-B DM

DF-B UD

DF-B UCCA

GA DM

GA UD

GA UCCA

PRE

DF-B DM

DF-B UD

DF-B UCCA

GA DM

GA UD

GA UCCA

Table 4: Average word perplexity scores for each ofthe domain-ﬁnetuned baseline (DF-B) models, theguided attention models (GA), and the pretrained(PRE) model.relations. These relations are highly inﬂuencedby context, and correctly classifying them can of-ten be contingent on longer dependencies, whichthe structural bias increases ’awareness’ of. The

TEMPORAL category, on the other hand, consistsof tags such as clocktime or time of day which are applied to multi-word expressions, e.g . Highlighting these dependenciesby assigning more weight to the attention betweentheir sub-parts is likely helpful for their accurateidentiﬁcation. Category / FrequencyAttribute 63763Unamed Entity 48654Logical 32973Named Entity 29271Event 25338Tense and Aspect 15208Discourse 9948Temporal 5652Table 5: Semantic category frequency in the testset. Targeted Syntactic Evaluation Scores

Figures 6 and 7 show the performance of the

Pereira2018 and

Wehbe2014 models and the fourbaselines for each of the syntactic categories fromMarvin and Linzen (2019).

G Content words and function wordsanalysis

Figure 8 shows the breakdown of brain decodingaccuracy by content and function words for

We-hbe2014 . We consider content words as wordswhose universal part-of-speech according to spaCyis one of the following: {ADJ, ADV, NOUN,PROPN, VERB, X, NUM}. Out of a total of , are considered content words and asfunction words.igure 6: Targeted syntactic evaluation accuracy scores per category for Pereira2018 models.Figure 7: Targeted syntactic evaluation accuracy scores per category for

Wehbe2014 models.

M UD UCCA

Formalism P ea r s on r Content Words

DM UD UCCA

Formalism P ea r s on r Function Words pretraineddomain-finetuned

Figure 8: Content word and function word brain decoding score (mean Pearson’s rr