Does Typological Blinding Impede Cross-Lingual Sharing?
DDoes Typological Blinding Impede Cross-Lingual Sharing?
Johannes Bjerva
Department of Computer ScienceAalborg University [email protected]
Isabelle Augenstein
Department of Computer ScienceUniversity of Copenhagen [email protected]
Abstract
Bridging the performance gap between high-and low-resource languages has been the fo-cus of much previous work. Typological fea-tures from databases such as the World Atlasof Language Structures (WALS) are a primecandidate for this, as such data exists even forvery low-resource languages. However, previ-ous work has only found minor benefits fromusing typological information. Our hypothesisis that a model trained in a cross-lingual set-ting will pick up on typological cues from theinput data, thus overshadowing the utility ofexplicitly using such features. We verify thishypothesis by blinding a model to typologicalinformation, and investigate how cross-lingualsharing and performance is impacted. Ourmodel is based on a cross-lingual architecturein which the latent weights governing the shar-ing between languages is learnt during training.We show that (i) preventing this model fromexploiting typology severely reduces perfor-mance, while a control experiment reaffirmsthat (ii) encouraging sharing according to ty-pology somewhat improves performance.
Most languages in the world have little access toNLP technology due to data scarcity (Joshi et al.,2020). Nonetheless, high-quality multilingual rep-resentations can be obtained using only a raw textsignal, e.g. via multilingual language modelling(Devlin et al., 2019). Furthermore, structural sim-ilarities of languages are to a large extent docu-mented in typological databases such as the WorldAtlas of Language Structures (WALS, Dryer andHaspelmath (2013)). Hence, developing modelswhich can take use typological similarities of lan-guages is an important direction in order to alleviatelanguage technology inequalities.While previous work has attempted to use ty-pological information to inform NLP models, our
Figure 1: A PoS tagger is exposed (or blinded with gra-dient reversal, − λ ) to typological features. Observing α values tells us how typology affects sharing. work differs significantly from such efforts in thatwe blind a model to this information. Most previ-ous work includes language information as features,by using language IDs, or language embeddings(e.g. Ammar et al. (2016); O’Horan et al. (2016);¨Ostling and Tiedemann (2017); Ponti et al. (2019);Oncevay et al. (2020)). Notably, limited effectsare usually observed from including typologicalfeatures explicitly. For instance, de Lhoneux et al.(2018) observe positive cross-lingual sharing ef-fects only in a handful of their settings. We there-fore hypothesise that relevant typological informa-tion is learned as a by-product of cross-lingual train-ing. Hence, although models do benefit from thisinformation, it is not necessary to provide it ex-plicitly in a high-resource scenario, where there isabundant training data. This is confirmed by Bjervaand Augenstein (2018a), who find that, e.g., lan-guage embeddings trained on a morphological taskcan encode morphological features from WALS.In contrast with previous work, we blind a modelto typological information, by using adversarial a r X i v : . [ c s . C L ] J a n echniques based on gradient reversal (Ganin andLempitsky, 2014). We evaluate on the structuredprediction and classification tasks in XTREME (Huet al., 2020), yielding a total of 40 languages and4 tasks. We show that when a model is blinded totypological signals relating to syntax and morphol-ogy, performance on related NLP tasks drops sig-nificantly. For instance, the mean accuracy across40 languages for POS tagging drops by 1.8% whenblinding the model to morphological features. An overview of the model is shown in Figure 1.We model each task in this paper using the fol-lowing steps. First, contextual representationsare extracted using multilingual BERT (m-BERT,Devlin et al. (2019)), a transformer-based model(Vaswani et al., 2017), trained with shared word-pieces across languages. We either blind m-BERTto typological features, with an added adversarialcomponent based on gradient reversal (Ganin andLempitsky, 2014), or expose it to them via multi-task learning (MTL, (Caruana, 1997)). Representa-tions from m-BERT are fed to a latent multi-taskarchitecture learning network (Ruder et al., 2019),which includes α parameters we seek to investi-gate. The model learns which parameters to sharebetween languages (e.g. α es,fr denotes sharing be-tween Spanish and French). Our sharing architecture is based on that of Ruderet al. (2019), which has latent variables learned dur-ing training, governing which layers and subspacesare shared between tasks, to what extent, as well asthe relative weighting of different task losses. Weare most interested in the parameters which controlthe sharing between the hidden layers allocated toeach task, referred to as α parameters (Ruder et al.,2019). Consider a setting with two tasks A and B . The outputs h A,k and h B,k of the k -th layer fortask A and B interact through the α parameters,for which the output is defined as: (cid:34)(cid:101) h A,k (cid:101) h B,k (cid:35) = (cid:20) α AA α AB α BA α BB (cid:21) (cid:2) h A,k (cid:62) , h
B,k (cid:62) (cid:3) (1)where (cid:101) h A,k is a linear combination of the acti-vations for task A at layer k , weighted with thelearned α s. While their model is an MTL model, we choose to interpret this differently by consid-ering each language as a task, yielding α ∈ R l × l ,where l is the number of languages for the giventask. Each activation (cid:101) h A,k is then a linear com-bination of the language specific activations h A,k .These are used for prediction in the downstreamtasks, as in the baselines from Hu et al. (2020).Crucially, this model allows us to draw conclu-sions about parameter sharing between languagesby observing the α parameters under the blindingand prediction conditions. We will combine this in-sight with observing downstream task performancein order to draw conclusions about the effects oftypological feature blinding and prediction. We introduce a component which can either blindor expose the model to typological features. We im-plement this as a single task-specific layer per fea-ture, using the [CLS] token from m-BERT model,without access to any of the soft sharing betweenlanguages from α -layers. Each layer optimises acategorical cross-entropy loss function ( L typ ).For this task, we predict typological featuresdrawn from WALS (Dryer and Haspelmath, 2013),inspired by previous work (Bjerva and Augenstein,2018a). Unlike previous work, we also blind themodel to such features by including a gradient re-versal layer (Ganin and Lempitsky, 2014), whichmultiplies the gradient of the typological predic-tion task with a negative constant ( − λ ), inspired byprevious work on adversarial learning (Goodfellowet al., 2014; Zhang et al., 2019; Chen et al., 2019).We hypothesise that using a gradient reversal layerfor typology will yield typology-invariant features,and that this will perform worse on tasks for whichthe typological feature at hand is important. For in-stance, we expect that blinding a model to syntacticfeatures will severely reduce performance for taskswhich rely heavily on syntax, such as POS tagging. We investigate the effects of typological blinding,using typological parameters as presented in WALS(Dryer and Haspelmath, 2013). The experimentsare run on XTREME (Hu et al., 2020), which in-cludes up to 40 languages from 12 language fam-ilies and two isolates. We experiment on the fol-lowing languages (ISO 639-1 codes): af, ar, bg,bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it,ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw,a, te, th, tl, tr, ur, vi, yo, and zh. We experimenton four tasks: POS (part of speech tagging), NER(named entity recognition), XNLI (cross-lingualnatural language inference), and PAWS-X (para-phrase identification). Our general setup for thestructured prediction tasks (POS and NER) is thatwe train on all available languages, and downsam-ple to 1,000 samples per language. For the clas-sification tasks XNLI and PAWS-X, we train onthe English training data and fine-tune on the de-velopment sets, as no training data is available forother languages. Hence, typological differenceswill be the main factor in our results, rather thandifferences in dataset sizes.
We first investigate whether prohibiting or allow-ing access to typological features has an effect onmodel performance using our architecture. We hy-pothesise that our multilingual model will leveragesignals related to the linguistic nature of a taskwhen optimising its its sharing parameters α .There exists a growing body of work on predic-tion of typological features (Daum´e III and Camp-bell, 2007; Murawaki, 2017; Bjerva and Augen-stein, 2018b; Bjerva et al., 2019a,b), most notablyin a recent shared task on the subject (Bjerva et al.,2020). While we are inspired by this direction ofresearch, our contribution is not concerned withthe accuracy of the prediction of such features, andthis is therefore not evaluated in detail in the paper.Moreover, an increasing amount of work mea-sures the correlation of predictive performanceof cross-lingual models with typological featuresas a way of probing what a model has learnedabout typology (Malaviya et al., 2017; Choenni andShutova, 2020; Gerz et al., 2018; Nooralahzadehet al., 2020; Zhao et al., 2020). In contrast to suchpost-hoc approaches, our experimental setting al-lows for measuring the impact of typology on cross-lingual sharing performance in a direct manner aspart of the model architecture. Syntactic Features
We first blind/expose themodel to syntactic features from WALS (Dryerand Haspelmath, 2013). We take the set of word or-der features which are annotated for all languagesin our experiments, resulting in 33 features. Thisincludes features such as , which encodes what the preferredword ordering is (if any) in a transitive clause. Forall features, we exclude feature values which do not occur for our set of languages. We hypothesisethat performance will drop for all four tasks, asthey all require syntactic understanding.
Morphological Features
We next attempt toblind/expose the model to the morphological fea-tures in WALS. We use the same approach as above,resulting in a total of 8 morphological features.This includes features such as , indicating towhat extent a language uses prefixing or suffix-ing morphology. We hypothesise that mainly thePOS tagging task will suffer under this condition,whereas other tasks only to some extent requiremorphology.
Phonological Features
We next consider acontrol experiment, in which we attempt toblind/expose the model to phonological featuresin WALS. We arrive at a total of 15 phonologicalfeatures, such as whichindicates the size of the consonant inventory of alanguage. We expect the performance to remain rel-atively unaffected by this task, as phonology oughtto have little importance given a textual input.
Genealogical Features
Finally, we attempt touse what one might consider to be language meta-data. We attempt to blind/expose the model to whatlanguage family a language belongs to. This canbe seen as a type of proxy to language similarity,and correlates relatively strongly with structuralsimilarities in languages. Because of this correla-tion with structural similarities, we expect blindingunder this condition to only slightly reduce perfor-mance for all tasks, as previous work has shownthis type of relationship not to be central in lan-guage representations (Bjerva et al., 2019c).
In general, we observe a drop in performance whenblinding the model to relevant typological infor-mation, and an increase in performance when ex-posing the model to it (Table 1). For phonologicalblinding or prediction, none of the four tasks is no-ticeably affected. Although, e.g., both the syntacticand morphological prediction tasks increase perfor-mance on POS tagging, it is not straightforward todraw conclusions on which of these is the most ef-ficient, as there is a substantial correlation betweensyntactic and morphological features. As for XNLIand PAWS-X, performance notably drops underboth the syntactic and genealogical blinding tasks. igure 2: PoS tagging results per language family across blinding and prediction conditions
Model POS NER XNLI PAWS-X + Syntactic Blind. 85.3 − − − + Morphological Blind. 85.0 − m-BERT baseline + + + Morphological Pred. + + Table 1: Typological Blinding and Prediction. MeanPOS accuracy, NER F1 scores, XNLI accuracy andPAWS-X accuracy across all languages. + and − indi-cate significantly better or worse performance respec-tively, as determined by a one-tailed t-test ( p < . ). Figure 2 shows results for PoS tagging underprediction and blinding across language families,following the same scheme as Hu et al. (2020). In-terestingly, the syntactic and morphological blind-ing settings are robust across all language families,yielding a drop in accuracy across the board. Allother conditions yield mixed results. This furtherstrengthens our argument that preventing a modelfrom learning syntactic and morphological featurescan be severely detrimental.
The results show that preventing access to typolog-ical features hampers performance, whereas pro-viding access improves performance. We now turnto an analysis of how the model shares parametersacross languages in this setting. Our hypothesisis that blinding will prevent models from sharingparameters between similar languages, in spite oftypological similarities. Concretely, we expect thatthe drop in POS tagging performance under mor-phological blinding is caused by lower α weightsbetween languages which are morphologically sim-ilar, and higher α weights between languageswhich are dissimilar. Recall that these parametersare latent variables learned by the model, regulat-ing the amount of sharing between languages (seeEq. 1). We investigate the correlations between the α sharing parameters, and two proxies of languagesimilarity. We focus on the POS task, as the re-sults from the typological blinding and predictionexperiments were the most pronounced here, asboth morphological and syntactic blinding affectedperformance.Our first measure of language similarity is basedon Bjerva et al. (2019c), who introduce what theyrefer to as structural similarity. This is based on odel Struct. Lang. Emb. Syntactic Blind. 0.31 0.27Morphological Blind. 0.34 0.29Phonological Blind. 0.40 0.41Genealogical Blind. 0.29 0.31No blind./pred. 0.43 0.40Syntactic Pred.
Phonological Pred. 0.41 0.39Genealogical Pred. 0.47 0.38
Table 2: Pearson correlations between α weights andlanguage similarity measures. dependency statistics from the Universal Depen-dencies treebank (Zeman et al., 2020), resultingin vectors which describe how different syntacticrelations are used in each language. Previous workhas shown that this measure of similarity correlatesstrongly with that learned in embedded languagespaces during multilingual training. In addition toconsidering these dependency statistics, we alsouse language embeddings drawn form ¨Ostling andTiedemann (2017). For each language similaritymeasure we calculate its pairwise Pearson correla-tion with the α values learned under each condition.Table 2 shows correlations between α weightsand similarities increase when predicting typolog-ical features, and decreases when blinded to suchfeatures. Hence, when the model has indirect ac-cess to, e.g., the SVO word ordering features oflanguages, sharing also reflects this. We have shown that blinding a multilingual modelto typological features severely affects sharingacross a relatively large language sample, and forseveral NLP tasks. The effects on model perfor-mance, as evaluated over 40 languages and 4 tasksfrom XTREME (Hu et al., 2020), were the largestfor POS tagging. The fact that smaller effects wereobserved for NER, could be because this task re-lies more on memorising NEs rather than using(morpho-)syntactic cues (Augenstein et al., 2017).Furthermore, the relatively small effects on XNLIand PAWS-X can also be interpreted as evidencefor that typology is less important in these tasksthan in more traditional linguistic analysis.A potential critique of our approach is that itmerely blinds the model to language identities. This could be the case, if only some latent represen-tation of, e.g., “SVO” ordering is used to representa language identity. However, previous work hasshown that morphological information is encodedby the type of model we investigate. Hence, sincewe only blind features in a single category at atime, we expect that the model’s representation oflanguage identities is unaffected.Not only do we observe a drop in performancewhen blinding a model to syntactic features, butwe also observe that the α sharing weights in ourmodel do not appear to correlate with linguisticsimilarities in this setting. Conversely, encouraginga model to consider typology, by jointly optimis-ing it for typological feature prediction, improvesperformance in general. Furthermore, α weightsin this scenario converge towards correlating withstructural similarities of languages. This is in linewith recent work which has found that m-BERTuses fine-grained syntactic distinctions in its cross-lingual representation space (Chi et al., 2020).We interpret this as evidence for the fact thattypology can be a necessity for modelling in NLP.Our results furthermore corroborate previous workin that we only find moderate benefits from includ-ing typological information explicitly. We expectthat this to a large degree is due to the typologicalsimilarities of languages being encoded implicitlybased on correlations between patterns in the inputdata. As low-resource languages often do not evenhave access to any substantial amount of raw text,but often do have annotations in WALS, we expectthat using typological information can go some waytowards building truly language-universal models. We have shown that preventing access to typologycan impede the performance of cross-lingual shar-ing models. Investigating latent weights govern-ing the sharing between languages shows that thisprevents the model from sharing between typologi-cally similar languages, which is otherwise learnedbased on patterns in the input. We therefore expectthat using typological information can be of partic-ular interest for building truly language-universalmodels for low-resource languages.
Acknowledgements
This research has received funding from theSwedish Research Council (grant No 2019-04129),and the NVIDIA Corporation (Titan Xp GPU). eferences
Waleed Ammar, George Mulcaire, Miguel Ballesteros,Chris Dyer, and Noah A Smith. 2016. Many lan-guages, one parser.
Transactions of the Associationfor Computational Linguistics , 4:431–444.Isabelle Augenstein, Leon Derczynski, and KalinaBontcheva. 2017. Generalisation in named entityrecognition: A quantitative analysis.
ComputerSpeech & Language , 44:61–83.Johannes Bjerva and Isabelle Augenstein. 2018a. Fromphonology to syntax: Unsupervised linguistic typol-ogy at different levels with language embeddings. In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) , pages 907–916, NewOrleans, Louisiana. Association for ComputationalLinguistics.Johannes Bjerva and Isabelle Augenstein. 2018b.Tracking Typological Traits of Uralic Languages inDistributed Language Representations. In
Proceed-ings of the Fourth International Workshop on Com-putational Linguistics of Uralic Languages , pages76–86, Helsinki, Finland. Association for Computa-tional Linguistics.Johannes Bjerva, Yova Kementchedjhieva, Ryan Cot-terell, and Isabelle Augenstein. 2019a. A probabilis-tic generative model of linguistic typology. In
Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 1529–1540,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Johannes Bjerva, Yova Kementchedjhieva, Ryan Cot-terell, and Isabelle Augenstein. 2019b. Uncoveringprobabilistic implications in typological knowledgebases. In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 3924–3930, Florence, Italy. Associationfor Computational Linguistics.Johannes Bjerva, Elizabeth Salesky, Sabrina J. Mielke,Aditi Chaudhary, Celano Giuseppe, Edoardo MariaPonti, Ekaterina Vylomova, Ryan Cotterell, and Is-abelle Augenstein. 2020. SIGTYP 2020 shared task:Prediction of typological features. In
Proceedingsof the Second Workshop on Computational Researchin Linguistic Typology , pages 1–11, Online. Associ-ation for Computational Linguistics.Johannes Bjerva, Robert ¨Ostling, Maria Han Veiga,J¨org Tiedemann, and Isabelle Augenstein. 2019c.What Do Language Representations Really Repre-sent?
Computational Linguistics , 45(2):381–389.Rich Caruana. 1997. Multitask learning.
MachineLearning , 28 (1):41–75. Steven Chen, Nicholas Carlini, and David Wagner.2019. Stateful detection of black-box adversarial at-tacks. arXiv preprint arXiv:1907.05587 .Ethan A Chi, John Hewitt, and Christopher D Manning.2020. Finding universal grammatical relations inmultilingual bert. arXiv preprint arXiv:2005.04511 .Rochelle Choenni and Ekaterina Shutova. 2020. Whatdoes it mean to be language-agnostic? probing mul-tilingual sentence encoders for typological proper-ties.
CoRR , abs/2009.12862.Hal Daum´e III and Lyle Campbell. 2007. A BayesianModel for Discovering Typological Implications. In
Proceedings of the 45th Annual Meeting of the Asso-ciation of Computational Linguistics , pages 65–72,Prague, Czech Republic. Association for Computa-tional Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Matthew S. Dryer and Martin Haspelmath, editors.2013.
WALS Online . Max Planck Institute for Evo-lutionary Anthropology, Leipzig.Yaroslav Ganin and Victor Lempitsky. 2014. Unsuper-vised domain adaptation by backpropagation. arXivpreprint arXiv:1409.7495 .Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti, RoiReichart, and Anna Korhonen. 2018. On the relationbetween linguistic typology and (limitations of) mul-tilingual language modeling. In
Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 316–327, Brussels, Bel-gium. Association for Computational Linguistics.Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014. Explaining and harnessing adversar-ial examples. arXiv preprint arXiv:1412.6572 .Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-ham Neubig, Orhan Firat, and Melvin Johnson.2020. Xtreme: A massively multilingual multi-taskbenchmark for evaluating cross-lingual generaliza-tion. arXiv preprint arXiv:2003.11080 .Pratik Joshi, Sebastin Santy, Amar Budhiraja, KalikaBali, and Monojit Choudhury. 2020. The state andfate of linguistic diversity and inclusion in the nlpworld. arXiv preprint arXiv:2004.09095 .Miryam de Lhoneux, Johannes Bjerva, Isabelle Augen-stein, and Anders Søgaard. 2018. Parameter sharingbetween dependency parsers for related languages.In
Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing , pages4992–4997.haitanya Malaviya, Graham Neubig, and Patrick Lit-tell. 2017. Learning language representations fortypology prediction. In
Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing , pages 2529–2535, Copenhagen,Denmark. Association for Computational Linguis-tics.Yugo Murawaki. 2017. Diachrony-aware induction ofbinary latent representations from typological fea-tures. In
Proceedings of the Eighth InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers) , pages 451–461, Taipei,Taiwan. Asian Federation of Natural Language Pro-cessing.Farhad Nooralahzadeh, Giannis Bekoulis, JohannesBjerva, and Isabelle Augenstein. 2020. Zero-ShotCross-Lingual Transfer with Meta Learning. In
Pro-ceedings of EMNLP . Association for ComputationalLinguistics.Helen O’Horan, Yevgeni Berzak, Ivan Vuli´c, Roi Re-ichart, and Anna Korhonen. 2016. Survey on theuse of typological information in natural languageprocessing. arXiv preprint arXiv:1610.03349 .Arturo Oncevay, Barry Haddow, and Alexandra Birch.2020. Bridging linguistic typology and multilingualmachine translation with multi-view language rep-resentations. In
Proceedings of EMNLP . Associa-tion for Computational Linguistics. ArXiv preprintarXiv:2004.14923.Robert ¨Ostling and J¨org Tiedemann. 2017. Continuousmultilinguality with language vectors. In
Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 2, Short Papers , pages 644–649, Valencia,Spain. Association for Computational Linguistics.Edoardo Maria Ponti, Helen O’horan, Yevgeni Berzak,Ivan Vuli´c, Roi Reichart, Thierry Poibeau, EkaterinaShutova, and Anna Korhonen. 2019. Modeling lan-guage variation and universals: A survey on typo-logical linguistics for natural language processing.
Computational Linguistics , 45(3):559–601.Sebastian Ruder, Joachim Bingel, Isabelle Augenstein,and Anders Søgaard. 2019. Latent multi-task archi-tecture learning. In
Proceedings of the AAAI Con-ference on Artificial Intelligence , volume 33, pages4822–4829.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Daniel Zeman, Joakim Nivre, and Mitchell Abramset al. 2020. Universal dependencies 2.6.LINDAT/CLARIAH-CZ digital library at theInstitute of Formal and Applied Linguistics ( ´UFAL),Faculty of Mathematics and Physics, CharlesUniversity. Huan Zhang, Hongge Chen, Zhao Song, Duane Boning,Inderjit S Dhillon, and Cho-Jui Hsieh. 2019. Thelimitations of adversarial training and the blind-spotattack. arXiv preprint arXiv:1901.04684 .Wei Zhao, Steffen Eger, Johannes Bjerva, and Is-abelle Augenstein. 2020. Inducing Language-Agnostic Multilingual Representations. arXivpreprint arXiv:2008.09112arXivpreprint arXiv:2008.09112