Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT
Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, Kyle Mahowald
DDeep Subjecthood: Higher-Order Grammatical Featuresin Multilingual BERT
Isabel Papadimitriou
Stanford University [email protected]
Ethan A. Chi
Stanford University [email protected]
Richard Futrell
University of California, Irvine [email protected]
Kyle Mahowald
University of California, Santa Barbara [email protected]
Abstract
We investigate how Multilingual BERT(mBERT) encodes grammar by examininghow the high-order grammatical feature of morphosyntactic alignment (how differentlanguages define what counts as a “subject”)is manifested across the embedding spacesof different languages. To understand ifand how morphosyntactic alignment affectscontextual embedding spaces, we trainclassifiers to recover the subjecthood ofmBERT embeddings in transitive sentences(which do not contain overt information aboutmorphosyntactic alignment) and then evaluatethem zero-shot on intransitive sentences(where subjecthood classification depends onalignment), within and across languages. Wefind that the resulting classifier distributionsreflect the morphosyntactic alignment of theirtraining languages. Our results demonstratethat mBERT representations are influenced byhigh-level grammatical features that are notmanifested in any one input sentence , and thatthis is robust across languages. Further ex-amining the characteristics that our classifiersrely on, we find that features such as passivevoice, animacy and case strongly correlatewith classification decisions, suggesting thatmBERT does not encode subjecthood purelysyntactically, but that subjecthood embeddingis continuous and dependent on semantic anddiscourse factors, as is proposed in muchof the functional linguistics literature. To-gether, these results provide insight into howgrammatical features manifest in contextualembedding spaces, at a level of abstraction notcovered by previous work. Our goal is to understand whether, and how, largepretrained models encode abstract features of the We release the code to reproduce our experiments here https://github.com/toizzy/deep-subjecthood
Figure 1:
Top : Illustration of the difference betweenalignment systems. A (for agent) is notation used forthe transitive subject , and O for the transitive ob-ject : “
The lawyer chased the dog .” S denotes theintransitive subject: “The lawyer laughed.” The bluecircle indicates which roles are marked as “subject” ineach system.
Bottom : Illustration of the training and test process.We train a classifier to distinguish A from O argumentsusing the BERT contextual embeddings, and test theclassifier’s behavior on intransitive subjects (S). The re-sulting distribution reveals to what extent morphosyn-tactic alignment (above) affects model behavior. grammars of languages. To do so, we analyzethe notion of subjecthood in Multilingual BERT(mBERT) across diverse languages with different morphosyntactic alignments . Alignment (howeach language defines what classifies as a “sub-ject”) is a feature of the grammar of a language,rather than of any single word or sentence, lettingus analyze mBERT’s representation of language-specific high-order grammatical properties.Recent work has demonstrated that transformermodels of language, such as BERT (Devlin et al.,2019), encode sentences in structurally meaning-ful ways (Manning et al., 2020; Rogers et al.,2020; Kovaleva et al., 2019; Linzen et al., 2016; a r X i v : . [ c s . C L ] J a n ulordava et al., 2018; Futrell et al., 2019; Wilcoxet al., 2018). In Multilingual BERT, previous workhas demonstrated surprising levels of multilingualand cross-lingual understanding (Pires et al., 2019;Wu and Dredze, 2019; Libovick`y et al., 2019;Chi et al., 2020), with some notable limitations(Mueller et al., 2020). However, these studiesstill leave an open question: are higher-order ab-stract grammatical features — features such asmorphosyntactic alignment, which are not realizedin any one sentence — accessible to deep neu-ral models? And how are these allegedly discretefeatures represented in a continuous embeddingspace? Our goal is to answer these questions byexamining grammatical subjecthood across typo-logically diverse languages. In doing so, we com-plicate the traditional notion of the grammaticalsubject as a discrete category and provide evidencefor a richer, probabilistic characterization of sub-jecthood.For 24 languages, we train small classifiers todistinguish the mBERT embeddings of nouns thatare subjects of transitive sentences from nouns thatare objects. We then test these classifiers on out-of-domain examples within and across languages.We go beyond standard probing methods (whichrely on classifier accuracy to make claims aboutembedding spaces) by (a) testing the classifiersout-of-domain to gain insights about the shapeand characteristics of the subjecthood classifica-tion boundary and (b) testing for awareness ofmorphosyntactic alignment, which is a feature ofthe grammar rather than of the classifier inputs.Our main experiments are as follows. In Exper-iment 1, we test our subjecthood classifiers on out-of-domain intransitive subjects (subjects of verbswhich do not have objects, like “The man slept”)in their training language. Whereas in Englishand many other languages, we think of intransitivesubjects as grammatical subjects, some languageshave a different morphosyntactic alignment sys-tem and treat intransitive subjects more like ob-jects (Dixon, 1979; Du Bois, 1987). We find evi-dence that a language’s alignment is represented inmBERT’s embeddings. In Experiment 2, we per-form successful zero-shot cross-linguistic trans-fer of our subject classifiers, finding that higher-order features of the grammar of each languageare represented in a way that is parallel across lan-guages. In Experiment 3, we characterize the ba-sis for these classifier decisions by studying how they vary as a function of linguistic features likeanimacy, grammatical case, and the passive con-struction.Taken together, the results of these experi-ments suggest that mBERT represents subject-hood and objecthood robustly and probabilisti-cally. Its representation is general enough suchthat it can transfer across languages, but alsolanguage-specific enough that it learns language-specific abstract grammatical features. In transitive sentences, languages need a way ofdistinguishing which noun is the transitive sub-ject (called A, for agent) and which noun is thetransitive object (O). In English, this distinctionis marked by word order: “The dog A chasedthe lawyer O ” means something different than “thelawyer A chased the dog O ”. In other languages,this distinction is marked by a morphological fea-ture: case. Case markings, usually affixes, are at-tached to nouns to indicate their role in the sen-tence, and as such in these languages word orderis often much freer than in English.Apart from A and O, there is also a third gram-matical role: intransitive subjects (S). In sentenceslike “The lawyer laughed”, there is no ambiguityas to who is doing the action. As such, cased lan-guages usually do not reserve a third case to markS nouns, and use either the A case or the O case.Languages that mark S nouns in the same way as Anouns are said to follow a Nominative–Accusativecase system, where the nominative case is for Aand S, and the accusative case is for O. Lan-guages that mark S nouns like O nouns followan Ergative–Absolutive system, where the erga-tive case is used to mark A nouns, and the absolu-tive case marks S and O nouns. For example, theBasque language follows this system. A visual-ization of the two case systems is shown in Figure1. The feature of whether a language follows anominative-accusative or an ergative-absolutivesystem is called morphosyntactic alignment . Mor-phosyntactic alignment is a high-order grammati-cal feature of a language, which is not usually in-ferable from looking at just one sentence, but from English pronouns follow a Nominative–Accusative sys-tem. For example, the pronoun “she” is nominative and isused both for A and S (as in “she laughed”). The pronoun“her” is accusative and is used only for O. igure 2: Results of Experiment 1: the behavior of subjecthood classifiers across mBERT layers (x-axis). For eachlayer, the proportion of the time that the classifier predicts arguments to be A, separated by grammatical role. Inhigher layers, A and O are reliably classified correctly, and S is mostly classified as A. When the source languageis Basque (ergative) or Hindi or Urdu (split-ergative) S is less likely to pattern with A. The figure is ordered byhow close the S line is to A, and ergative and split-ergative languages are highlighted with a gray box. the system with which different sentences are en-coded. As such, examining the way that individ-ual contextual embeddings express morphosyntac-tic alignment gets to the question of how mBERTencodes abstract features of grammar. This is aquestion that is not answered by work that looksat the contextual encoding of the features that arerealized in sentences, like part of speech or sen-tence structure.
Our primary method involves training classifiers topredict subjecthood from mBERT contextual em-beddings, and examining the decisions of theseclassifiers within and across languages. We traina classifier to distinguish A from O in the mBERTembeddings of one language, and we examine itsperformance on S embeddings in its training lan-guage, and on A, S, and O mBERT embeddings inother languages.
Data
To train a subjecthood classifier for onelanguage, we use a balanced dataset of 1,012 tran-sitive subject (A) mBERT embeddings, and 1,012transitive object (O) mBERT embeddings. We test our classifiers on test datasets of A, S, and O em-beddings. Our data points are extracted from theUniversal Dependencies treebanks (Nivre et al.,2016): we use the dependency parse informa-tion to determine whether each noun is an A oran O, and if it is either we pass the whole sen-tence through mBERT and take the contextual em-bedding corresponding to the noun. We run ex-periments on 24 languages; specifically, all thelanguages that are both in the mBERT trainingset and have Universal Dependencies treebankswith at least 1,012 A occurences and 1,012 O oc-curences. Labeling
Since UD treebanks are not labeled forsentence role (A, S and O), we extract these labelsusing the dependency graph annotations. We onlyinclude nouns and proper nouns, leaving pronouns https://github.com/google-research/bert/blob/master/multilingual.md Our datasets for all languages are the same size. We haveset them all to be the size of the largest balanced A-O datasetwe can extract from the Basque UD corpus, since Basque isone of the only represented ergative languages and we wantedit to meet our cutoff. or future work. We label a noun token as:• O if it has a verb as a head and its dependencyarc is either dobj or iobj .• A if it has a verb as a head and its dependencyarc is nsubj and it has a sibling O.• S if it has a verb as a head and its dependencyarc is nsubj and it has no sibling O.Finally, we exclude the subjects of passive con-structions (where the object of an action is madethe grammatical subject) to analyze separately,as including these examples would confoundgrammatical subjecthood with semantic agency.We also exclude the siblings of expletives (e.g.,“There are many goats”), as these are grammati-cal objects which appear without subjects as theonly argument of the verb, and we also excludethe children of auxiliaries (“The goat can swim”),looking only at the arguments of verbs.Because we use embeddings and are limited bythe Universal Dependencies annotation scheme,there are some cross-linguistic differences in howarguments are handled. For instance, our systemis not able to handle null subjects or null objects,even though those are prominent parts of manylanguages. Classifiers
For each language, and for eachmBERT layer (cid:96) , we train a classifier to classifymBERT contextual embeddings drawn from layer (cid:96) as A or O. The classifiers are all two-layer per-ceptrons with one hidden layer of size 64. We traineach classifier for 20 epochs on a dataset of thelayer- (cid:96) contextual embeddings of 1,012 A nounsand 1,012 O nouns. In total, we train 24 languages ×
13 mBERT layers = 312 total classifiers.
In our first experiment, we train a classifier to pre-dict the grammatical role of a noun in context fromits mBERT contextual embedding, and examineits behavior on intransitive subjects (S), which areout-of-domain.This experimental setup lets us ask two ques-tions about subjecthood encoding in mBERT.Firstly, do contextual word embeddings reliablyencode subjecthood information? Secondly, howdo our classifiers act when given S arguments (in-transitive subjects), which crucially do not appear For an example of how pronouns complicate how sub-jecthood is defined, see Fox (1987). l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l
Layer C l a ss i f i e r A cc u r a cy Figure 3: Accuracy of A-O classifiers for every lan-guage, by mBERT layer. For all languages, accuracy ishighest in layers 7-10 P r obab ili t y S noun s l abe l ed A Non ergative languages Basque, Hindi and Urdu
Figure 4: Distribution of layer 10 classifier probabil-ities for S nouns in the test set. When trained onnon-ergative languages, the classifiers mostly predictS nouns to be A. When trained on ergative and split-ergative languages, the classifier predictions for S aremuch more spread out (towards being classified as O),suggesting that the ergative nature of the languages isexpressed in the contextual embeddings of the A and Onouns, influencing the classifier. in the training data? If S arguments are mostlyclassified as A, that would suggest mBERT islearning a nominative-accusative system, whereA and S pattern together. If S patterns with O,that would suggest it has an ergative-absolutivesystem. If S patterns differently in differentlanguages, that would suggest that it learns alanguage-specific morphosyntactic system and ex-presses it in the encoding of nouns in transitiveclauses (which are unaffected by alignment), sothat the A-O classifiers can pick it up.
Our results show that the classifiers can reliablyperform A-O classification of contextual embed- sde en eseteu fafi frhehi hr idjala lv nopl rusk slsrur zh l l l
Language group l og odd s S : O t o l og odd s O : A Figure 5: For layer 10, the log odds ratio of S:A relativeto O:A, by source language. This is a measure of howclose S is to A, relative to O. The ergative languagesskew lower than the others, although some other lan-guages (like Finnish and Estonian) also skew low. dings with relatively high accuracy, especially inthe higher layers of mBERT. As shown in Fig-ure 3, performance peaks at around mBERT lay-ers 7-10, where for the majority of languages clas-sifier accuracy surpasses 90%. This is consistentwith previous work showing that syntactic infor-mation is most well represented in BERT’s latermiddle layers (Rogers et al., 2020; Hewitt andManning, 2019; Jawahar et al., 2019; Liu et al.,2019). For the rest of this paper, we will focusmainly on the behavior of the classifiers in thehigh-performance higher layers to assess the prop-erties in these highly contextual spaces that definesubjecthood within and across languages.Performance across layers on the test sets of all24 languages is shown in Figure 2. When we breakthe classifiers’ behavior down across roles, we seethat S nouns mostly pattern with A, though theyare consistently less likely to be classed as A thantransitive A nouns.The separation between the A and the S lines isnot constant for all languages: it is the largest forBasque, which is an ergative language, and Hindiand Urdu, which have a split-ergative case system(De Hoop and Narasimhan, 2005). This differenceis highlighted in Figure 4, where we show the clas-sifiers’ probabilities of classifying S nouns as Aacross the test sets of Basque, Hindi and Urdu ver-sus the test sets for all other 21 languages. In Fig-ure 5, we plot the log odds ratio of classifying Sas A versus classifying S as O, and show that forergative languages this is significantly lower. Thefact that classifiers trained on ergative and split-ergative languages are more likely to classify S nouns as O indicates that the ergativity of the lan-guage is encoded in the A and O embeddings thatthe classifiers are trained on.Note, however, that the A-O classifiers for theergative languages do not deduce a fully erga-tive system for classifying S nouns, but a greaterskew towards classifying S as O than nomina-tive languages. This suggests that, even thoughproperties of ergativity are encoded in mBERTspace, the prominence of nominative training lan-guages has influenced the contextual space to bebiased towards encoding a nominative subject-hood system. The difficulty of training the clas-sifier in Basque seems consistent with Ravfogelet al. (2019)’s finding that learning agreement isharder in Basque than in English.In Experiment 2, we test the zero-shot perfor-mance of these A-O classifiers across languages,to ask: is there a parallel, interlingual notion ofsubjecthood in mBERT contextual space, and dolanguage-specific morphosyntactic alignment bi-ases transfer interlingually?
We can learn only so much about mBERT’s gen-eral subjecthood representations by training andtesting in the same language, since many lan-guages in our data set have case-marking andtherefore have surface forms that reflect theirgrammatical roles. To test whether representationsof subjecthood in mBERT are language-general,we can do a similar analysis to Experiment 1 butwith zero-shot cross-lingual transfer.That is, we train a classifier to distinguish Aand O in Language X (just as in Experiment 1),but then we test in Language Y by seeing how theclassifier classifies A, O, and S arguments in Lan-guage Y.By training a classifier on one language andtesting on others, we can ask: is subjecthoodencoded in parallel ways across languages inmBERT space?
If a classifier trained to distinguishA from O in a source language can then use thesame parameters to successfully classify A fromO in another language, this would indicate that thedifference between A and O is encoded in similarways in mBERT space for these two languages.Secondly, we can examine the classification ofS nouns (which are out of domain for the classi-fiers) in the zero-shot cross-lingual setting. By ob- l lll l l l l ll l l l l l l l l ll ll l l lll l l l l ll l l l l l l l l ll ll l l lll l l l l ll l l l l l l l l ll ll l ll ll l l l l ll l l l l l l l l ll ll l ll ll l l l l ll l l l l l l l l ll ll l ll ll l l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l ll l l l l l l l l ll ll l ll lll l l l l l l l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l ll ll l ll lll l l l l ll l l l l l l l l l ll l ll lll l l l l ll l l l l l l l l lll l ll lll l l l l ll l l l l l l l l ll l l ll lll l l l l ll l l l l l l l l ll ll B a s que C h i ne s e C r oa t i an C z e c h E ng li s h E s t on i an F a r s i F i nn i s h F r en c h G e r m an H eb r e w H i nd i I ndone s i an J apane s e La t i n La t v i an N o r w eg i an P o li s h R u ss i an S e r b i an S l o v a k S l o v ene S pan i s h U r du Source Language A cc u r a cy SerbianFrenchCroatianIndonesianNorwegianHebrewGermanPolishSloveneUrduHindiRussianSlovakCzechFinnishEstonianBasqueFarsiEnglishLatvianLatinSpanishChineseJapanese S e r b i an F r en c h C r oa t i an I ndone s i an N o r w eg i an H eb r e w G e r m an P o li s h S l o v ene U r du H i nd i R u ss i an S l o v a k C z e c h F i nn i s h E s t on i an B a s que F a r s i E ng li s hLa t v i an La t i n S pan i s h C h i ne s e J apane s e Source Language D e s t i na t i on Language accuracy
Figure 6: Experiment 2 Results: Cross-lingual transferaccuracies (accuracies shown are for BERT layer 10).
Top:
For each classifier trained to distinguish A and Onouns in a source language (labeled on the x-axis), weplot the accuracy that classifier achieves when testedzero-shot on all other languages. Zero-shot transferis surprisingly successful across languages, indicatingthat subjecthood is encoded cross-lingually in mBERT.Each black point represents the accuracy of a classifiertested on a particular destination language, and the redpoints represent the within-language accuracy.
Bottom:
Analytical performance of classifiers for ev-ery language pair. The x-axis sorted by average transferaccuracy, so that the source whose classifier performsthe best on average is on the left. Despite the generalEnglish bias that mBERT often exhibits, in our experi-ments English is neither a standout source nor destina-tion. serving the test behavior of classifiers on S nounsin other languages, we can ask: is morphosyntac-tic alignment expressed in cross-lingually gener-alizable and parallel ways in mBERT contextualembeddings?
If a classifier trained to distinguishBasque A from O is more likely to classify EnglishS nouns as O, this means that information aboutmorphosyntactic alignment is encoded specificallyenough to represent each language’s alignment,but in a space that generalizes across languages.
Zero-shot transfer of subjecthood classification iseffective across languages, as shown in Figure 6.
012 0.00 0.25 0.50 0.75 1.00
Proportion of S nouns labelled A for each destination language den s i t y Non ergative languagesBasque, Hindi and Urdu
Figure 7: Classifiers trained on ergative languages aremore likely to label S nouns in other languages as O.For BERT layer 8, the proportion of S nouns in eachdestination language test set labeled as A for the classi-fiers trained on (1) ergative and split-ergative languages(blue) or (2) the rest of the languages.
The average accuracy across all source-destinationpairs for a high-performing mBERT layer (layer10) is 82.61%, and there are several pairs forwhich zero-shot transfer of the sentence role clas-sifier yields accuracies above 90%. The consis-tent success of zero-shot transfer across differentsource and destination pairs indicates that mBERThas parallel, interlingual ways of encoding gram-matical and semantic relations like subjecthood.We would expect there to be some extent of jointlearning in mBERT: different languages wouldn’texist totally independently in the contextual em-bedding space, both due to mBERT’s multilingualtraining texts and to successful regularization. Itis nevertheless surprising that zero-shot transfer ofsubjecthood classification between languages is sosuccessful out of the box, and that for all clas-sifiers, within-language accuracy (the red dots inFigure 6) is not an outlier compared to transfer ac-curacies. Our results show not just that there ismutual entanglement between the contextual em-bedding spaces of many languages, but that syn-tactic and semantic information in these spaces isorganized in largely parallel, transferable ways.We can then look at how S is classified: doesthe subjecthood of S, and the degree of ergativ-ity within each language that we saw expressed inExperiment 1 generalize across languages? Clas-sifiers trained on ergative languages are signifi-cantly more likely to classify S nouns in other lan-guages as O, as illustrated in Figure 7 (the sourcelanguage’s case system is a significant predictorof the probability of S being an agent, in a mixedeffect regression with a random intercept for lan-uage β = . , t = 2 . , p < . ). Our re-sults show that the ergative nature of these lan-guages is encoded in the contextual embeddingsof transitive nouns (where ergativity is not real-ized), and that this encoding of ergativity transferscoherently across languages. To explore the nature of mBERT’s underlying rep-resentation of grammatical role, we ask which ar-guments are most likely to be classified as subjectsor objects. This is of particular interest when theclassifier gets it wrong: what kinds of subjects geterroneously classified as objects?The functional linguistics literature offers in-sight into these questions. It has been frequentlyclaimed that grammatical subjecthood is actuallya multi-factor, probabilistic concept (Keenan andComrie, 1977; Comrie, 1981; Croft, 2001; Hop-per and Thompson, 1980) that cannot always bepinned down as a discrete category. Some subjectsare more subject-y than others. Comrie (1988) ar-gues that a subject can be thought of as the in-tersection of that which is high in agency (sub-jects do things) and topicality (subjects are thetopics of sentences). Thus, in English, a proto-typical subject is something like “He kicked theball.” since in such a sentence, the pronoun “he”is a clear agent and the topic of the sentence. But,in a sentence like “The lake, which Jack Frost vis-ited, froze,” the subject is still “lake.” But it is lesssubject-y: it is not the clear topic of the sentenceand it is not an agent.A probabilistic notion of grammatical rolelends itself naturally to the continuous embeddingspaces of computational models. So, in a series ofexperiments, we explored what factors in mBERTcontextual embedding space predict subjecthood.In these experiments, we examine how the deci-sions and probabilities of the A-O classifiers fromExperiment 2 relate to other linguistic featuresknown to contribute to the degree of subjecthood.In particular, we look at whether nouns appear in passive constructions , as well as the animacy and case of nouns. In seeing how passives, animacy,and case interact with our subjecthood classifiers,we can assess if mBERT’s representation of sub-jecthood in continuous space is consistent withfunctional analyses, and better understand the con-tinuous space in which mBERT encodes syntactic Proportion of test set labelled A for each source/dest pair den s i t y Role
O S−passive S A
Figure 8: Passive subjects are hard to classify. Thedistribution of average classifier probabilities in layer10 for all source-destination language pairs, separatedby role. While the layer 10 classifier separates A andS from O, passive subjects remain largely ambigu-ous in their classification. These plots indicate that,in mBERT space, the grammatical subjects of passiveconstructions are less subject-y. and semantic relations.We choose these three factors as they are well-studied in the functional literature, as well as read-ily available to extract from UD corpora. Pas-sive subjects are marked with a separate depen-dency arc label, the animacy of nouns is anno-tated directly in some UD treebanks, and in case-marked languages, nouns are annotated with theircase. Future work on a more complete exami-nation of the functional nature of contextual em-beddings would include other factors not readilyavailable in UD, like the discourse and informa-tion structure (topicality) of nouns in context.
The first area that we look at are passive con-structions. In passive constructions such as “Thelawyer was chased by a cat”, the grammatical sub-ject is not the main actor or agent in the sen-tence. As such, while a purely syntactic analy-sis of subjecthood would classify passive subjects(S-passive) as subjects, an understanding of sub-jecthood as continuous and reliant on semanticswould be more prone to classify passive subjectsas objects. As shown in Figure 8, subjecthoodclassifiers across languages are ambivalent abouthow they classify passive subjects, even in layerswhere they have the acuity to successfully sepa-rate A and S from O. This indicates that the clas-sifiers do not learn a purely syntactic separationof A and O: the subjecthood encoding that theylearn from mBERT space is largely dependent onemantic information. cseurusktauk cseuplruskuk cseuhrruskslsrtauk cseuhrplruskslsruk cseurusktauk cseuplruskukA O SAnim Inan Anim Inan Anim Inan0.000.250.500.75 animacy p r obab ili t y o f be i ng agen t Figure 9: The influence of animacy on classification(within and across languages). For a high-performinglayer (Layer 10), the average probability of classifiersin all languages classifying nouns in languages with an-imacy distinctions as A. For all three roles, animatesare more likely to be classified as agents. The labelsare two-letter codes for the languages.
We also find that animacy is a strong predic-tor of subjecthood . Our results presented in Fig-ure 9 demonstrate that when we control by role,animacy is a significant factor in determining theprobability of being classified as A. Classifiers inall languages, when zero-shot evaluated on a cor-pus marked for animacy, are more likely to clas-sify animate nouns as A than inanimate nouns.For Layer 10, a mixed effect regression predict-ing each destination language’s probability of as-signing an argument to being an agent shows thatboth role and animacy are significant predictors(with a main effect of animacy corresponding to a16% increase in the probability of being an agent, p < . ). These results indicate that, in learningto separate A from O, the classifiers did not learna purely syntactic separation of the space (thoughit is possible to distinguish A and O using onlystrictly structural syntactic features). Instead, wesee that subjecthood information is entangled withsemantic notions such as animacy, giving credenceto the hypothesis that subjecthood BERT space isencoded in a way concordant with the multi-factormanner proposed by Croft, Comrie, and others.Lastly, we find that classifier probabilities alsovary with case, even when we control for sentencerole. As demonstrated in Figure 10, across gram-matical roles, classifiers are significantly morelikely to classify nouns as A if they are in more Figure 10: Average probability of being an agent, inlayer 10, with 95% confidence intervals, for Finnishand Basque broken up by case. agentive cases (nominative and ergative). In amixed effect regression predicting Layer 10 prob-ability of being an agent based on role and whetherthe case is agentive (nominative/ergative), therewas a 15% increase associated with being nomina-tive/ergative across categories ( t = 2 . , p < . ). Our experimental results constitute a way to be-gin understanding how general knowledge ofgrammar is manifested in contextual embeddingspaces, and how discrete categories like sub-jecthood are reconciled in continuous embeddingspaces. While most previous work analyzing largecontextual models focuses on extracting their anal-ysis of features or structures present in specific in-puts, we focus on morphosyntactic alignment, afeature of grammars that is not explicitly realizedin any one sentence.We find that, when tested out of domain, clas-sifiers trained to predict transitive subjecthood inmBERT contextual space robustly demonstratedecisions which reflect (a) the morphosyntacticalignment of their training language and (b) con-tinuous encoding of subjecthood influenced by se-mantic properties.There has been much recent work pointing outthe limitations of the probing methodology for an-alyzing embedding spaces (Voita and Titov, 2020;imentel et al., 2020; Hewitt and Liang, 2019),a methodology that is very similar to ours. Themain limitation pointed out in this literature is thatthe power of classifiers is a confounding variable:we can’t know if a classifier’s encoding of a fea-ture is due to the feature being encoded in BERTspace, or to the classifier figuring out the featurefrom surface encoding.In this paper, we address these issues by propos-ing two ways to use classifiers to analyze embed-ding spaces that go beyond probing, and avoid thelimitations of arguments based only around the ac-curacy of probes. Firstly, our results rely on testingthe classifiers on out-of-domain zero-shot transfer :both to S arguments and to different languages.As such, we focus on linguistically defining thetype of classification boundary which our classi-fiers learn from mBERT space, rather than theiraccuracy, and in using transfer we avoid many ofthe limitations of probing, as argued in Papadim-itriou and Jurafsky (2020). Secondly, we exam-ine a feature (morphosyntactic alignment) whichis not inferable from the classifiers’ training data ,which consists only of transitive sentences. We areasking if mBERT contextual space is organized ina way that encodes the effects of morphosyntacticalignment for tokens that do not themselves ex-press alignment. Especially in the cross-lingualcase, a classifier would not be able to spuriouslydeduce this from the surface form, whatever itspower.A limitation of our experimental setup is thatboth our Universal Dependencies training data andthe set of mBERT training languages are heav-ily weighted towards nominative-accusative lan-guages. As such, we see a clear nominative-accusative bias in mBERT, and our results aresomewhat noisy as we only have one ergative-absolutive language and two semi-ergative lan-guagesFuture work should examine the effects ofbalanced joint training between nominative-accusative and ergative-absolutive languages onthe contextual embedding of subjecthood. Andwe hope that future work will continue to ask notjust if deep neural models of language representdiscrete linguistic features, but how they representthem probabilistically. Acknowledgments
We thank Dan Jurafsky, Jesse Mu, and Ben New-man for helpful comments. This work was sup-ported by NSF grant
References
Ethan A. Chi, John Hewitt, and Christopher D. Man-ning. 2020. Finding universal grammatical relationsin multilingual BERT. In
Proceedings of the 58thAnnual Meeting of the Association for Computa-tional Linguistics , pages 5564–5577, Online. Asso-ciation for Computational Linguistics.Bernard Comrie. 1981.
Language Universals and Lin-guistic Typology , 1st edition. University of ChicagoPress, Chicago.Bernard Comrie. 1988. Linguistic typology.
AnnualReview of Anthropology , 17:145–159.William Croft. 2001.
Radical construction grammar:Syntactic theory in typological perspective . OxfordUniversity Press.Helen De Hoop and Bhuvana Narasimhan. 2005. Dif-ferential case-marking in hindi. In
Competition andVariation in Natural Languages , pages 321–345. El-sevier.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Robert MW Dixon. 1979. Ergativity.
Language , pages59–138.John W Du Bois. 1987. The discourse basis of ergativ-ity.
Language , pages 805–855.Barbara A Fox. 1987. The noun phrase accessibilityhierarchy reinterpreted: Subject primacy or the ab-solutive hypothesis?
Language , pages 856–870.Richard Futrell, Ethan Wilcox, Takashi Morita, PengQian, Miguel Ballesteros, and Roger Levy. 2019.Neural language models as psycholinguistic sub-jects: Representations of syntactic state. In
Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 32–42, Min-neapolis, Minnesota. Association for ComputationalLinguistics.ristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 1195–1205, NewOrleans, Louisiana. Association for ComputationalLinguistics.John Hewitt and Percy Liang. 2019. Designing and in-terpreting probes with control tasks. In
Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 2733–2743.John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word repre-sentations. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Paul J Hopper and Sandra A Thompson. 1980. Tran-sitivity in grammar and discourse. language , pages251–299.Ganesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah.2019. What does BERT learn about the structure oflanguage? In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 3651–3657, Florence, Italy. Associationfor Computational Linguistics.Edward L. Keenan and Bernard Comrie. 1977. Nounphrase accessibility and universal grammar.
Lin-guistic Inquiry , 8(1):63–99.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof BERT. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Process-ing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP) ,pages 4365–4374, Hong Kong, China. Associationfor Computational Linguistics.Jindˇrich Libovick`y, Rudolf Rosa, and AlexanderFraser. 2019. How language-neutral is multilingualbert? arXiv preprint arXiv:1911.03310 .Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies.
Transactions of theAssociation for Computational Linguistics , 4:521–535.Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextual representations. In
Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 1073–1094, Minneapolis, Minnesota.Association for Computational Linguistics.Christopher D Manning, Kevin Clark, John Hewitt, Ur-vashi Khandelwal, and Omer Levy. 2020. Emer-gent linguistic structure in artificial neural networkstrained by self-supervision.
Proceedings of the Na-tional Academy of Sciences .Aaron Mueller, Garrett Nicolai, Panayiota Petrou-Zeniou, Natalia Talmina, and Tal Linzen. 2020.Cross-linguistic syntactic evaluation of word predic-tion models. In
Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics , pages 5523–5539.Joakim Nivre, Marie-Catherine De Marneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christopher DManning, Ryan McDonald, Slav Petrov, SampoPyysalo, Natalia Silveira, et al. 2016. Universaldependencies v1: A multilingual treebank collec-tion. In
Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC’16) , pages 1659–1666.Isabel Papadimitriou and Dan Jurafsky. 2020. Learn-ing music helps you read: Using transfer to studylinguistic structure in language models. In
Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages6829–6839.Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay,Ran Zmigrod, Adina Williams, and Ryan Cotterell.2020. Information-theoretic probing for linguisticstructure. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics , pages 4609–4622.Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual BERT? In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4996–5001, Florence, Italy. Association for Computa-tional Linguistics.Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. 2019.Studying the inductive biases of RNNs with syn-thetic variations of natural languages. In
Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 3532–3542, Min-neapolis, Minnesota. Association for ComputationalLinguistics.Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in bertology: What we know abouthow bert works.
Transactions of the Association forComputational Linguistics , 8:842–866.lena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length.In
Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing(EMNLP) , pages 183–196.Ethan Wilcox, Roger Levy, Takashi Morita, andRichard Futrell. 2018. What do RNN languagemodels learn about filler–gap dependencies? In
Proceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 211–221, Brussels, Belgium.Association for Computational Linguistics.Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofbert. In