[PDF] Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT

Abstract

We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a "subject") is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and then evaluate them zero-shot on intransitive sentences (where subjecthood classification depends on alignment), within and across languages. We find that the resulting classifier distributions reflect the morphosyntactic alignment of their training languages. Our results demonstrate that mBERT representations are influenced by high-level grammatical features that are not manifested in any one input sentence, and that this is robust across languages. Further examining the characteristics that our classifiers rely on, we find that features such as passive voice, animacy and case strongly correlate with classification decisions, suggesting that mBERT does not encode subjecthood purely syntactically, but that subjecthood embedding is continuous and dependent on semantic and discourse factors, as is proposed in much of the functional linguistics literature. Together, these results provide insight into how grammatical features manifest in contextual embedding spaces, at a level of abstraction not covered by previous work.

Full PDF

DDeep Subjecthood: Higher-Order Grammatical Featuresin Multilingual BERT

Isabel Papadimitriou

Stanford University [email protected]

Ethan A. Chi

Stanford University [email protected]

Richard Futrell

University of California, Irvine [email protected]

Kyle Mahowald

University of California, Santa Barbara [email protected]

Abstract

We investigate how Multilingual BERT(mBERT) encodes grammar by examininghow the high-order grammatical feature of morphosyntactic alignment (how differentlanguages deﬁne what counts as a “subject”)is manifested across the embedding spacesof different languages. To understand ifand how morphosyntactic alignment affectscontextual embedding spaces, we trainclassiﬁers to recover the subjecthood ofmBERT embeddings in transitive sentences(which do not contain overt information aboutmorphosyntactic alignment) and then evaluatethem zero-shot on intransitive sentences(where subjecthood classiﬁcation depends onalignment), within and across languages. Weﬁnd that the resulting classiﬁer distributionsreﬂect the morphosyntactic alignment of theirtraining languages. Our results demonstratethat mBERT representations are inﬂuenced byhigh-level grammatical features that are notmanifested in any one input sentence , and thatthis is robust across languages. Further ex-amining the characteristics that our classiﬁersrely on, we ﬁnd that features such as passivevoice, animacy and case strongly correlatewith classiﬁcation decisions, suggesting thatmBERT does not encode subjecthood purelysyntactically, but that subjecthood embeddingis continuous and dependent on semantic anddiscourse factors, as is proposed in muchof the functional linguistics literature. To-gether, these results provide insight into howgrammatical features manifest in contextualembedding spaces, at a level of abstraction notcovered by previous work. Our goal is to understand whether, and how, largepretrained models encode abstract features of the We release the code to reproduce our experiments here https://github.com/toizzy/deep-subjecthood

Figure 1:

Top : Illustration of the difference betweenalignment systems. A (for agent) is notation used forthe transitive subject , and O for the transitive ob-ject : “

The lawyer chased the dog .” S denotes theintransitive subject: “The lawyer laughed.” The bluecircle indicates which roles are marked as “subject” ineach system.

Bottom : Illustration of the training and test process.We train a classiﬁer to distinguish A from O argumentsusing the BERT contextual embeddings, and test theclassiﬁer’s behavior on intransitive subjects (S). The re-sulting distribution reveals to what extent morphosyn-tactic alignment (above) affects model behavior. grammars of languages. To do so, we analyzethe notion of subjecthood in Multilingual BERT(mBERT) across diverse languages with different morphosyntactic alignments . Alignment (howeach language deﬁnes what classiﬁes as a “sub-ject”) is a feature of the grammar of a language,rather than of any single word or sentence, lettingus analyze mBERT’s representation of language-speciﬁc high-order grammatical properties.Recent work has demonstrated that transformermodels of language, such as BERT (Devlin et al.,2019), encode sentences in structurally meaning-ful ways (Manning et al., 2020; Rogers et al.,2020; Kovaleva et al., 2019; Linzen et al., 2016; a r X i v : . [ c s . C L ] J a n ulordava et al., 2018; Futrell et al., 2019; Wilcoxet al., 2018). In Multilingual BERT, previous workhas demonstrated surprising levels of multilingualand cross-lingual understanding (Pires et al., 2019;Wu and Dredze, 2019; Libovick`y et al., 2019;Chi et al., 2020), with some notable limitations(Mueller et al., 2020). However, these studiesstill leave an open question: are higher-order ab-stract grammatical features — features such asmorphosyntactic alignment, which are not realizedin any one sentence — accessible to deep neu-ral models? And how are these allegedly discretefeatures represented in a continuous embeddingspace? Our goal is to answer these questions byexamining grammatical subjecthood across typo-logically diverse languages. In doing so, we com-plicate the traditional notion of the grammaticalsubject as a discrete category and provide evidencefor a richer, probabilistic characterization of sub-jecthood.For 24 languages, we train small classiﬁers todistinguish the mBERT embeddings of nouns thatare subjects of transitive sentences from nouns thatare objects. We then test these classiﬁers on out-of-domain examples within and across languages.We go beyond standard probing methods (whichrely on classiﬁer accuracy to make claims aboutembedding spaces) by (a) testing the classiﬁersout-of-domain to gain insights about the shapeand characteristics of the subjecthood classiﬁca-tion boundary and (b) testing for awareness ofmorphosyntactic alignment, which is a feature ofthe grammar rather than of the classiﬁer inputs.Our main experiments are as follows. In Exper-iment 1, we test our subjecthood classiﬁers on out-of-domain intransitive subjects (subjects of verbswhich do not have objects, like “The man slept”)in their training language. Whereas in Englishand many other languages, we think of intransitivesubjects as grammatical subjects, some languageshave a different morphosyntactic alignment sys-tem and treat intransitive subjects more like ob-jects (Dixon, 1979; Du Bois, 1987). We ﬁnd evi-dence that a language’s alignment is represented inmBERT’s embeddings. In Experiment 2, we per-form successful zero-shot cross-linguistic trans-fer of our subject classiﬁers, ﬁnding that higher-order features of the grammar of each languageare represented in a way that is parallel across lan-guages. In Experiment 3, we characterize the ba-sis for these classiﬁer decisions by studying how they vary as a function of linguistic features likeanimacy, grammatical case, and the passive con-struction.Taken together, the results of these experi-ments suggest that mBERT represents subject-hood and objecthood robustly and probabilisti-cally. Its representation is general enough suchthat it can transfer across languages, but alsolanguage-speciﬁc enough that it learns language-speciﬁc abstract grammatical features. In transitive sentences, languages need a way ofdistinguishing which noun is the transitive sub-ject (called A, for agent) and which noun is thetransitive object (O). In English, this distinctionis marked by word order: “The dog A chasedthe lawyer O ” means something different than “thelawyer A chased the dog O ”. In other languages,this distinction is marked by a morphological fea-ture: case. Case markings, usually afﬁxes, are at-tached to nouns to indicate their role in the sen-tence, and as such in these languages word orderis often much freer than in English.Apart from A and O, there is also a third gram-matical role: intransitive subjects (S). In sentenceslike “The lawyer laughed”, there is no ambiguityas to who is doing the action. As such, cased lan-guages usually do not reserve a third case to markS nouns, and use either the A case or the O case.Languages that mark S nouns in the same way as Anouns are said to follow a Nominative–Accusativecase system, where the nominative case is for Aand S, and the accusative case is for O. Lan-guages that mark S nouns like O nouns followan Ergative–Absolutive system, where the erga-tive case is used to mark A nouns, and the absolu-tive case marks S and O nouns. For example, theBasque language follows this system. A visual-ization of the two case systems is shown in Figure1. The feature of whether a language follows anominative-accusative or an ergative-absolutivesystem is called morphosyntactic alignment . Mor-phosyntactic alignment is a high-order grammati-cal feature of a language, which is not usually in-ferable from looking at just one sentence, but from English pronouns follow a Nominative–Accusative sys-tem. For example, the pronoun “she” is nominative and isused both for A and S (as in “she laughed”). The pronoun“her” is accusative and is used only for O. igure 2: Results of Experiment 1: the behavior of subjecthood classiﬁers across mBERT layers (x-axis). For eachlayer, the proportion of the time that the classiﬁer predicts arguments to be A, separated by grammatical role. Inhigher layers, A and O are reliably classiﬁed correctly, and S is mostly classiﬁed as A. When the source languageis Basque (ergative) or Hindi or Urdu (split-ergative) S is less likely to pattern with A. The ﬁgure is ordered byhow close the S line is to A, and ergative and split-ergative languages are highlighted with a gray box. the system with which different sentences are en-coded. As such, examining the way that individ-ual contextual embeddings express morphosyntac-tic alignment gets to the question of how mBERTencodes abstract features of grammar. This is aquestion that is not answered by work that looksat the contextual encoding of the features that arerealized in sentences, like part of speech or sen-tence structure.

Our primary method involves training classiﬁers topredict subjecthood from mBERT contextual em-beddings, and examining the decisions of theseclassiﬁers within and across languages. We traina classiﬁer to distinguish A from O in the mBERTembeddings of one language, and we examine itsperformance on S embeddings in its training lan-guage, and on A, S, and O mBERT embeddings inother languages.

Data

To train a subjecthood classiﬁer for onelanguage, we use a balanced dataset of 1,012 tran-sitive subject (A) mBERT embeddings, and 1,012transitive object (O) mBERT embeddings. We test our classiﬁers on test datasets of A, S, and O em-beddings. Our data points are extracted from theUniversal Dependencies treebanks (Nivre et al.,2016): we use the dependency parse informa-tion to determine whether each noun is an A oran O, and if it is either we pass the whole sen-tence through mBERT and take the contextual em-bedding corresponding to the noun. We run ex-periments on 24 languages; speciﬁcally, all thelanguages that are both in the mBERT trainingset and have Universal Dependencies treebankswith at least 1,012 A occurences and 1,012 O oc-curences. Labeling

Since UD treebanks are not labeled forsentence role (A, S and O), we extract these labelsusing the dependency graph annotations. We onlyinclude nouns and proper nouns, leaving pronouns https://github.com/google-research/bert/blob/master/multilingual.md Our datasets for all languages are the same size. We haveset them all to be the size of the largest balanced A-O datasetwe can extract from the Basque UD corpus, since Basque isone of the only represented ergative languages and we wantedit to meet our cutoff. or future work. We label a noun token as:• O if it has a verb as a head and its dependencyarc is either dobj or iobj .• A if it has a verb as a head and its dependencyarc is nsubj and it has a sibling O.• S if it has a verb as a head and its dependencyarc is nsubj and it has no sibling O.Finally, we exclude the subjects of passive con-structions (where the object of an action is madethe grammatical subject) to analyze separately,as including these examples would confoundgrammatical subjecthood with semantic agency.We also exclude the siblings of expletives (e.g.,“There are many goats”), as these are grammati-cal objects which appear without subjects as theonly argument of the verb, and we also excludethe children of auxiliaries (“The goat can swim”),looking only at the arguments of verbs.Because we use embeddings and are limited bythe Universal Dependencies annotation scheme,there are some cross-linguistic differences in howarguments are handled. For instance, our systemis not able to handle null subjects or null objects,even though those are prominent parts of manylanguages. Classiﬁers

For each language, and for eachmBERT layer (cid:96) , we train a classiﬁer to classifymBERT contextual embeddings drawn from layer (cid:96) as A or O. The classiﬁers are all two-layer per-ceptrons with one hidden layer of size 64. We traineach classiﬁer for 20 epochs on a dataset of thelayer- (cid:96) contextual embeddings of 1,012 A nounsand 1,012 O nouns. In total, we train 24 languages ×

13 mBERT layers = 312 total classiﬁers.

In our ﬁrst experiment, we train a classiﬁer to pre-dict the grammatical role of a noun in context fromits mBERT contextual embedding, and examineits behavior on intransitive subjects (S), which areout-of-domain.This experimental setup lets us ask two ques-tions about subjecthood encoding in mBERT.Firstly, do contextual word embeddings reliablyencode subjecthood information? Secondly, howdo our classiﬁers act when given S arguments (in-transitive subjects), which crucially do not appear For an example of how pronouns complicate how sub-jecthood is deﬁned, see Fox (1987). l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l

Layer C l a ss i f i e r A cc u r a cy Figure 3: Accuracy of A-O classiﬁers for every lan-guage, by mBERT layer. For all languages, accuracy ishighest in layers 7-10 P r obab ili t y S noun s l abe l ed A Non ergative languages Basque, Hindi and Urdu

Figure 4: Distribution of layer 10 classiﬁer probabil-ities for S nouns in the test set. When trained onnon-ergative languages, the classiﬁers mostly predictS nouns to be A. When trained on ergative and split-ergative languages, the classiﬁer predictions for S aremuch more spread out (towards being classiﬁed as O),suggesting that the ergative nature of the languages isexpressed in the contextual embeddings of the A and Onouns, inﬂuencing the classiﬁer. in the training data? If S arguments are mostlyclassiﬁed as A, that would suggest mBERT islearning a nominative-accusative system, whereA and S pattern together. If S patterns with O,that would suggest it has an ergative-absolutivesystem. If S patterns differently in differentlanguages, that would suggest that it learns alanguage-speciﬁc morphosyntactic system and ex-presses it in the encoding of nouns in transitiveclauses (which are unaffected by alignment), sothat the A-O classiﬁers can pick it up.

Our results show that the classiﬁers can reliablyperform A-O classiﬁcation of contextual embed- sde en eseteu fafi frhehi hr idjala lv nopl rusk slsrur zh l l l

Language group l og odd s S : O t o l og odd s O : A Figure 5: For layer 10, the log odds ratio of S:A relativeto O:A, by source language. This is a measure of howclose S is to A, relative to O. The ergative languagesskew lower than the others, although some other lan-guages (like Finnish and Estonian) also skew low. dings with relatively high accuracy, especially inthe higher layers of mBERT. As shown in Fig-ure 3, performance peaks at around mBERT lay-ers 7-10, where for the majority of languages clas-siﬁer accuracy surpasses 90%. This is consistentwith previous work showing that syntactic infor-mation is most well represented in BERT’s latermiddle layers (Rogers et al., 2020; Hewitt andManning, 2019; Jawahar et al., 2019; Liu et al.,2019). For the rest of this paper, we will focusmainly on the behavior of the classiﬁers in thehigh-performance higher layers to assess the prop-erties in these highly contextual spaces that deﬁnesubjecthood within and across languages.Performance across layers on the test sets of all24 languages is shown in Figure 2. When we breakthe classiﬁers’ behavior down across roles, we seethat S nouns mostly pattern with A, though theyare consistently less likely to be classed as A thantransitive A nouns.The separation between the A and the S lines isnot constant for all languages: it is the largest forBasque, which is an ergative language, and Hindiand Urdu, which have a split-ergative case system(De Hoop and Narasimhan, 2005). This differenceis highlighted in Figure 4, where we show the clas-siﬁers’ probabilities of classifying S nouns as Aacross the test sets of Basque, Hindi and Urdu ver-sus the test sets for all other 21 languages. In Fig-ure 5, we plot the log odds ratio of classifying Sas A versus classifying S as O, and show that forergative languages this is signiﬁcantly lower. Thefact that classiﬁers trained on ergative and split-ergative languages are more likely to classify S nouns as O indicates that the ergativity of the lan-guage is encoded in the A and O embeddings thatthe classiﬁers are trained on.Note, however, that the A-O classiﬁers for theergative languages do not deduce a fully erga-tive system for classifying S nouns, but a greaterskew towards classifying S as O than nomina-tive languages. This suggests that, even thoughproperties of ergativity are encoded in mBERTspace, the prominence of nominative training lan-guages has inﬂuenced the contextual space to bebiased towards encoding a nominative subject-hood system. The difﬁculty of training the clas-siﬁer in Basque seems consistent with Ravfogelet al. (2019)’s ﬁnding that learning agreement isharder in Basque than in English.In Experiment 2, we test the zero-shot perfor-mance of these A-O classiﬁers across languages,to ask: is there a parallel, interlingual notion ofsubjecthood in mBERT contextual space, and dolanguage-speciﬁc morphosyntactic alignment bi-ases transfer interlingually?

We can learn only so much about mBERT’s gen-eral subjecthood representations by training andtesting in the same language, since many lan-guages in our data set have case-marking andtherefore have surface forms that reﬂect theirgrammatical roles. To test whether representationsof subjecthood in mBERT are language-general,we can do a similar analysis to Experiment 1 butwith zero-shot cross-lingual transfer.That is, we train a classiﬁer to distinguish Aand O in Language X (just as in Experiment 1),but then we test in Language Y by seeing how theclassiﬁer classiﬁes A, O, and S arguments in Lan-guage Y.By training a classiﬁer on one language andtesting on others, we can ask: is subjecthoodencoded in parallel ways across languages inmBERT space?

If a classiﬁer trained to distinguishA from O in a source language can then use thesame parameters to successfully classify A fromO in another language, this would indicate that thedifference between A and O is encoded in similarways in mBERT space for these two languages.Secondly, we can examine the classiﬁcation ofS nouns (which are out of domain for the classi-ﬁers) in the zero-shot cross-lingual setting. By ob- l lll l l l l ll l l l l l l l l ll ll l l lll l l l l ll l l l l l l l l ll ll l l lll l l l l ll l l l l l l l l ll ll l ll ll l l l l ll l l l l l l l l ll ll l ll ll l l l l ll l l l l l l l l ll ll l ll ll l l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l l l l l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l l l ll l ll lll l l l l ll l l l l l l l l lll l ll lll l l l l ll l l l l l l l l ll l l ll lll l l l l ll l l l l l l l l ll ll B a s que C h i ne s e C r oa t i an C z e c h E ng li s h E s t on i an F a r s i F i nn i s h F r en c h G e r m an H eb r e w H i nd i I ndone s i an J apane s e La t i n La t v i an N o r w eg i an P o li s h R u ss i an S e r b i an S l o v a k S l o v ene S pan i s h U r du Source Language A cc u r a cy SerbianFrenchCroatianIndonesianNorwegianHebrewGermanPolishSloveneUrduHindiRussianSlovakCzechFinnishEstonianBasqueFarsiEnglishLatvianLatinSpanishChineseJapanese S e r b i an F r en c h C r oa t i an I ndone s i an N o r w eg i an H eb r e w G e r m an P o li s h S l o v ene U r du H i nd i R u ss i an S l o v a k C z e c h F i nn i s h E s t on i an B a s que F a r s i E ng li s hLa t v i an La t i n S pan i s h C h i ne s e J apane s e Source Language D e s t i na t i on Language accuracy

Figure 6: Experiment 2 Results: Cross-lingual transferaccuracies (accuracies shown are for BERT layer 10).

Top:

For each classiﬁer trained to distinguish A and Onouns in a source language (labeled on the x-axis), weplot the accuracy that classiﬁer achieves when testedzero-shot on all other languages. Zero-shot transferis surprisingly successful across languages, indicatingthat subjecthood is encoded cross-lingually in mBERT.Each black point represents the accuracy of a classiﬁertested on a particular destination language, and the redpoints represent the within-language accuracy.

Bottom:

Analytical performance of classiﬁers for ev-ery language pair. The x-axis sorted by average transferaccuracy, so that the source whose classiﬁer performsthe best on average is on the left. Despite the generalEnglish bias that mBERT often exhibits, in our experi-ments English is neither a standout source nor destina-tion. serving the test behavior of classiﬁers on S nounsin other languages, we can ask: is morphosyntac-tic alignment expressed in cross-lingually gener-alizable and parallel ways in mBERT contextualembeddings?

If a classiﬁer trained to distinguishBasque A from O is more likely to classify EnglishS nouns as O, this means that information aboutmorphosyntactic alignment is encoded speciﬁcallyenough to represent each language’s alignment,but in a space that generalizes across languages.

Zero-shot transfer of subjecthood classiﬁcation iseffective across languages, as shown in Figure 6.

012 0.00 0.25 0.50 0.75 1.00

Proportion of S nouns labelled A for each destination language den s i t y Non ergative languagesBasque, Hindi and Urdu

Figure 7: Classiﬁers trained on ergative languages aremore likely to label S nouns in other languages as O.For BERT layer 8, the proportion of S nouns in eachdestination language test set labeled as A for the classi-ﬁers trained on (1) ergative and split-ergative languages(blue) or (2) the rest of the languages.

The average accuracy across all source-destinationpairs for a high-performing mBERT layer (layer10) is 82.61%, and there are several pairs forwhich zero-shot transfer of the sentence role clas-siﬁer yields accuracies above 90%. The consis-tent success of zero-shot transfer across differentsource and destination pairs indicates that mBERThas parallel, interlingual ways of encoding gram-matical and semantic relations like subjecthood.We would expect there to be some extent of jointlearning in mBERT: different languages wouldn’texist totally independently in the contextual em-bedding space, both due to mBERT’s multilingualtraining texts and to successful regularization. Itis nevertheless surprising that zero-shot transfer ofsubjecthood classiﬁcation between languages is sosuccessful out of the box, and that for all clas-siﬁers, within-language accuracy (the red dots inFigure 6) is not an outlier compared to transfer ac-curacies. Our results show not just that there ismutual entanglement between the contextual em-bedding spaces of many languages, but that syn-tactic and semantic information in these spaces isorganized in largely parallel, transferable ways.We can then look at how S is classiﬁed: doesthe subjecthood of S, and the degree of ergativ-ity within each language that we saw expressed inExperiment 1 generalize across languages? Clas-siﬁers trained on ergative languages are signiﬁ-cantly more likely to classify S nouns in other lan-guages as O, as illustrated in Figure 7 (the sourcelanguage’s case system is a signiﬁcant predictorof the probability of S being an agent, in a mixedeffect regression with a random intercept for lan-uage β = . , t = 2 . , p < . ). Our re-sults show that the ergative nature of these lan-guages is encoded in the contextual embeddingsof transitive nouns (where ergativity is not real-ized), and that this encoding of ergativity transferscoherently across languages. To explore the nature of mBERT’s underlying rep-resentation of grammatical role, we ask which ar-guments are most likely to be classiﬁed as subjectsor objects. This is of particular interest when theclassiﬁer gets it wrong: what kinds of subjects geterroneously classiﬁed as objects?The functional linguistics literature offers in-sight into these questions. It has been frequentlyclaimed that grammatical subjecthood is actuallya multi-factor, probabilistic concept (Keenan andComrie, 1977; Comrie, 1981; Croft, 2001; Hop-per and Thompson, 1980) that cannot always bepinned down as a discrete category. Some subjectsare more subject-y than others. Comrie (1988) ar-gues that a subject can be thought of as the in-tersection of that which is high in agency (sub-jects do things) and topicality (subjects are thetopics of sentences). Thus, in English, a proto-typical subject is something like “He kicked theball.” since in such a sentence, the pronoun “he”is a clear agent and the topic of the sentence. But,in a sentence like “The lake, which Jack Frost vis-ited, froze,” the subject is still “lake.” But it is lesssubject-y: it is not the clear topic of the sentenceand it is not an agent.A probabilistic notion of grammatical rolelends itself naturally to the continuous embeddingspaces of computational models. So, in a series ofexperiments, we explored what factors in mBERTcontextual embedding space predict subjecthood.In these experiments, we examine how the deci-sions and probabilities of the A-O classiﬁers fromExperiment 2 relate to other linguistic featuresknown to contribute to the degree of subjecthood.In particular, we look at whether nouns appear in passive constructions , as well as the animacy and case of nouns. In seeing how passives, animacy,and case interact with our subjecthood classiﬁers,we can assess if mBERT’s representation of sub-jecthood in continuous space is consistent withfunctional analyses, and better understand the con-tinuous space in which mBERT encodes syntactic Proportion of test set labelled A for each source/dest pair den s i t y Role

O S−passive S A

Figure 8: Passive subjects are hard to classify. Thedistribution of average classiﬁer probabilities in layer10 for all source-destination language pairs, separatedby role. While the layer 10 classiﬁer separates A andS from O, passive subjects remain largely ambigu-ous in their classiﬁcation. These plots indicate that,in mBERT space, the grammatical subjects of passiveconstructions are less subject-y. and semantic relations.We choose these three factors as they are well-studied in the functional literature, as well as read-ily available to extract from UD corpora. Pas-sive subjects are marked with a separate depen-dency arc label, the animacy of nouns is anno-tated directly in some UD treebanks, and in case-marked languages, nouns are annotated with theircase. Future work on a more complete exami-nation of the functional nature of contextual em-beddings would include other factors not readilyavailable in UD, like the discourse and informa-tion structure (topicality) of nouns in context.

The ﬁrst area that we look at are passive con-structions. In passive constructions such as “Thelawyer was chased by a cat”, the grammatical sub-ject is not the main actor or agent in the sen-tence. As such, while a purely syntactic analy-sis of subjecthood would classify passive subjects(S-passive) as subjects, an understanding of sub-jecthood as continuous and reliant on semanticswould be more prone to classify passive subjectsas objects. As shown in Figure 8, subjecthoodclassiﬁers across languages are ambivalent abouthow they classify passive subjects, even in layerswhere they have the acuity to successfully sepa-rate A and S from O. This indicates that the clas-siﬁers do not learn a purely syntactic separationof A and O: the subjecthood encoding that theylearn from mBERT space is largely dependent onemantic information. cseurusktauk cseuplruskuk cseuhrruskslsrtauk cseuhrplruskslsruk cseurusktauk cseuplruskukA O SAnim Inan Anim Inan Anim Inan0.000.250.500.75 animacy p r obab ili t y o f be i ng agen t Figure 9: The inﬂuence of animacy on classiﬁcation(within and across languages). For a high-performinglayer (Layer 10), the average probability of classiﬁersin all languages classifying nouns in languages with an-imacy distinctions as A. For all three roles, animatesare more likely to be classiﬁed as agents. The labelsare two-letter codes for the languages.

We also ﬁnd that animacy is a strong predic-tor of subjecthood . Our results presented in Fig-ure 9 demonstrate that when we control by role,animacy is a signiﬁcant factor in determining theprobability of being classiﬁed as A. Classiﬁers inall languages, when zero-shot evaluated on a cor-pus marked for animacy, are more likely to clas-sify animate nouns as A than inanimate nouns.For Layer 10, a mixed effect regression predict-ing each destination language’s probability of as-signing an argument to being an agent shows thatboth role and animacy are signiﬁcant predictors(with a main effect of animacy corresponding to a16% increase in the probability of being an agent, p < . ). These results indicate that, in learningto separate A from O, the classiﬁers did not learna purely syntactic separation of the space (thoughit is possible to distinguish A and O using onlystrictly structural syntactic features). Instead, wesee that subjecthood information is entangled withsemantic notions such as animacy, giving credenceto the hypothesis that subjecthood BERT space isencoded in a way concordant with the multi-factormanner proposed by Croft, Comrie, and others.Lastly, we ﬁnd that classiﬁer probabilities alsovary with case, even when we control for sentencerole. As demonstrated in Figure 10, across gram-matical roles, classiﬁers are signiﬁcantly morelikely to classify nouns as A if they are in more Figure 10: Average probability of being an agent, inlayer 10, with 95% conﬁdence intervals, for Finnishand Basque broken up by case. agentive cases (nominative and ergative). In amixed effect regression predicting Layer 10 prob-ability of being an agent based on role and whetherthe case is agentive (nominative/ergative), therewas a 15% increase associated with being nomina-tive/ergative across categories ( t = 2 . , p < . ). Our experimental results constitute a way to be-gin understanding how general knowledge ofgrammar is manifested in contextual embeddingspaces, and how discrete categories like sub-jecthood are reconciled in continuous embeddingspaces. While most previous work analyzing largecontextual models focuses on extracting their anal-ysis of features or structures present in speciﬁc in-puts, we focus on morphosyntactic alignment, afeature of grammars that is not explicitly realizedin any one sentence.We ﬁnd that, when tested out of domain, clas-siﬁers trained to predict transitive subjecthood inmBERT contextual space robustly demonstratedecisions which reﬂect (a) the morphosyntacticalignment of their training language and (b) con-tinuous encoding of subjecthood inﬂuenced by se-mantic properties.There has been much recent work pointing outthe limitations of the probing methodology for an-alyzing embedding spaces (Voita and Titov, 2020;imentel et al., 2020; Hewitt and Liang, 2019),a methodology that is very similar to ours. Themain limitation pointed out in this literature is thatthe power of classiﬁers is a confounding variable:we can’t know if a classiﬁer’s encoding of a fea-ture is due to the feature being encoded in BERTspace, or to the classiﬁer ﬁguring out the featurefrom surface encoding.In this paper, we address these issues by propos-ing two ways to use classiﬁers to analyze embed-ding spaces that go beyond probing, and avoid thelimitations of arguments based only around the ac-curacy of probes. Firstly, our results rely on testingthe classiﬁers on out-of-domain zero-shot transfer :both to S arguments and to different languages.As such, we focus on linguistically deﬁning thetype of classiﬁcation boundary which our classi-ﬁers learn from mBERT space, rather than theiraccuracy, and in using transfer we avoid many ofthe limitations of probing, as argued in Papadim-itriou and Jurafsky (2020). Secondly, we exam-ine a feature (morphosyntactic alignment) whichis not inferable from the classiﬁers’ training data ,which consists only of transitive sentences. We areasking if mBERT contextual space is organized ina way that encodes the effects of morphosyntacticalignment for tokens that do not themselves ex-press alignment. Especially in the cross-lingualcase, a classiﬁer would not be able to spuriouslydeduce this from the surface form, whatever itspower.A limitation of our experimental setup is thatboth our Universal Dependencies training data andthe set of mBERT training languages are heav-ily weighted towards nominative-accusative lan-guages. As such, we see a clear nominative-accusative bias in mBERT, and our results aresomewhat noisy as we only have one ergative-absolutive language and two semi-ergative lan-guagesFuture work should examine the effects ofbalanced joint training between nominative-accusative and ergative-absolutive languages onthe contextual embedding of subjecthood. Andwe hope that future work will continue to ask notjust if deep neural models of language representdiscrete linguistic features, but how they representthem probabilistically. Acknowledgments

We thank Dan Jurafsky, Jesse Mu, and Ben New-man for helpful comments. This work was sup-ported by NSF grant

References

Ethan A. Chi, John Hewitt, and Christopher D. Man-ning. 2020. Finding universal grammatical relationsin multilingual BERT. In

Proceedings of the 58thAnnual Meeting of the Association for Computa-tional Linguistics , pages 5564–5577, Online. Asso-ciation for Computational Linguistics.Bernard Comrie. 1981.

Language Universals and Lin-guistic Typology , 1st edition. University of ChicagoPress, Chicago.Bernard Comrie. 1988. Linguistic typology.

AnnualReview of Anthropology , 17:145–159.William Croft. 2001.

Radical construction grammar:Syntactic theory in typological perspective . OxfordUniversity Press.Helen De Hoop and Bhuvana Narasimhan. 2005. Dif-ferential case-marking in hindi. In

Competition andVariation in Natural Languages , pages 321–345. El-sevier.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Language , pages59–138.John W Du Bois. 1987. The discourse basis of ergativ-ity.

Language , pages 805–855.Barbara A Fox. 1987. The noun phrase accessibilityhierarchy reinterpreted: Subject primacy or the ab-solutive hypothesis?

Language , pages 856–870.Richard Futrell, Ethan Wilcox, Takashi Morita, PengQian, Miguel Ballesteros, and Roger Levy. 2019.Neural language models as psycholinguistic sub-jects: Representations of syntactic state. In

Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 32–42, Min-neapolis, Minnesota. Association for ComputationalLinguistics.ristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 1195–1205, NewOrleans, Louisiana. Association for ComputationalLinguistics.John Hewitt and Percy Liang. 2019. Designing and in-terpreting probes with control tasks. In

Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 2733–2743.John Hewitt and Christopher D. Manning. 2019. Astructural probe for ﬁnding syntax in word repre-sentations. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Paul J Hopper and Sandra A Thompson. 1980. Tran-sitivity in grammar and discourse. language , pages251–299.Ganesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah.2019. What does BERT learn about the structure oflanguage? In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 3651–3657, Florence, Italy. Associationfor Computational Linguistics.Edward L. Keenan and Bernard Comrie. 1977. Nounphrase accessibility and universal grammar.

Lin-guistic Inquiry , 8(1):63–99.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof BERT. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Process-ing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP) ,pages 4365–4374, Hong Kong, China. Associationfor Computational Linguistics.Jindˇrich Libovick`y, Rudolf Rosa, and AlexanderFraser. 2019. How language-neutral is multilingualbert? arXiv preprint arXiv:1911.03310 .Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies.

Transactions of theAssociation for Computational Linguistics , 4:521–535.Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextual representations. In

Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 1073–1094, Minneapolis, Minnesota.Association for Computational Linguistics.Christopher D Manning, Kevin Clark, John Hewitt, Ur-vashi Khandelwal, and Omer Levy. 2020. Emer-gent linguistic structure in artiﬁcial neural networkstrained by self-supervision.

Proceedings of the Na-tional Academy of Sciences .Aaron Mueller, Garrett Nicolai, Panayiota Petrou-Zeniou, Natalia Talmina, and Tal Linzen. 2020.Cross-linguistic syntactic evaluation of word predic-tion models. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics , pages 5523–5539.Joakim Nivre, Marie-Catherine De Marneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christopher DManning, Ryan McDonald, Slav Petrov, SampoPyysalo, Natalia Silveira, et al. 2016. Universaldependencies v1: A multilingual treebank collec-tion. In

Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC’16) , pages 1659–1666.Isabel Papadimitriou and Dan Jurafsky. 2020. Learn-ing music helps you read: Using transfer to studylinguistic structure in language models. In

Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages6829–6839.Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay,Ran Zmigrod, Adina Williams, and Ryan Cotterell.2020. Information-theoretic probing for linguisticstructure. In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics , pages 4609–4622.Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual BERT? In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4996–5001, Florence, Italy. Association for Computa-tional Linguistics.Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. 2019.Studying the inductive biases of RNNs with syn-thetic variations of natural languages. In

Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 3532–3542, Min-neapolis, Minnesota. Association for ComputationalLinguistics.Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in bertology: What we know abouthow bert works.

Transactions of the Association forComputational Linguistics , 8:842–866.lena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length.In

Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing(EMNLP) , pages 183–196.Ethan Wilcox, Roger Levy, Takashi Morita, andRichard Futrell. 2018. What do RNN languagemodels learn about ﬁller–gap dependencies? In

Proceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 211–221, Brussels, Belgium.Association for Computational Linguistics.Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofbert. In