Treebank Embedding Vectors for Out-of-domain Dependency Parsing
TTreebank Embedding Vectors for Out-of-Domain Dependency Parsing
Joachim Wagner and
James Barry and
Jennifer Foster
ADAPT CentreSchool of Computing, Dublin City University, Ireland [email protected]
Abstract
A recent advance in monolingual dependencyparsing is the idea of a treebank embeddingvector, which allows all treebanks for a par-ticular language to be used as training datawhile at the same time allowing the model toprefer training data from one treebank overothers and to select the preferred treebank attest time. We build on this idea by 1) intro-ducing a method to predict a treebank vectorfor sentences that do not come from a tree-bank used in training, and 2) exploring whathappens when we move away from predefinedtreebank embedding vectors during test timeand instead devise tailored interpolations. Weshow that 1) there are interpolated vectors thatare superior to the predefined ones, and 2) tree-bank vectors can be predicted with sufficientaccuracy, for nine out of ten test languages, tomatch the performance of an oracle approachthat knows the most suitable predefined tree-bank embedding for the test set.
The Universal Dependencies project (Nivre et al.,2016) has made available multiple treebanks forthe same language annotated according to the samescheme, leading to a new wave of research whichexplores ways to use multiple treebanks in mono-lingual parsing (Shi et al., 2017; Sato et al., 2017;Che et al., 2017; Stymne et al., 2018).Stymne et al. (2018) introduced a treebank em-bedding . A single model is trained on the concate-nation of the available treebanks for a language,and the input vector for each training token in-cludes the treebank embedding which encodes thetreebank the token comes from. At test time, allinput vectors in the test set of the same treebankare also assigned this treebank embedding vector.Stymne et al. (2018) show that this approach issuperior to mono-treebank training and to plain treebank concatenation. Treebank embeddings per-form at about the same level as training on multipletreebanks and tuning on one, but they argue that atreebank embedding approach is preferable since itresults in just one model per language.What happens, however, when the input sen-tence does not come from a treebank? Stymne et al.(2018) simulate this scenario with the Parallel Uni-versal Dependency (PUD) test sets. They definethe notion of a proxy treebank which is the tree-bank to be used for a treebank embedding whenparsing sentences that do not come from any ofthe training treebanks. They empirically determinethe best proxy treebank for each PUD test set bytesting with each treebank embedding. However,the question remains what to do with sentences forwhich no gold parse is available, and for which wedo not know the best proxy.We investigate the problem of choosing tree-bank embedding vectors for new, possibly out-of-domain, sentences. In doing so, we explore theusefulness of interpolated treebank vectors whichare computed via a weighted combination of thepredefined fixed ones. In experiments with Czech,English and French, we establish that useful inter-polated treebank vectors exist. We then develop asimple k-NN method based on sentence similarityto choose a treebank vector, either fixed or interpo-lated, for sentences or entire test sets, which, for 9of our 10 test languages matches the performanceof the best (oracle) proxy treebank.
Following recent work in neural dependency pars-ing (Chen and Manning, 2014; Ballesteros et al.,2015; Kiperwasser and Goldberg, 2016; Zemanet al., 2017, 2018), we represent an input token byconcatenating various vectors. In our experiments,each word w i in a sentence S = ( w ,..., w n ) is a a r X i v : . [ c s . C L ] M a y oncatenation of 1) a dynamically learned wordvector, 2) a word vector obtained by passing the k i characters of w i through a BiLSTM and 3), follow-ing Stymne et al. (2018), a treebank embedding todistinguish the m training treebanks: e ( i ) = e ( w i ) ◦ biLSTM ( e ( ch i, ) , ..., e ( ch i,k i )) ◦ f (1)Stymne et al. (2018) use f = e ( t (cid:63) ) (2)where t (cid:63) ∈ , ..., m is the source treebank for sen-tence S or if S does not come from one of the m treebanks, a choice of one of these (the proxytreebank). We change f during test time to f = m (cid:88) t =1 α t e ( t ) (3)where there are m treebanks for the language inquestion and (cid:80) mt =1 α t = 1 . For all experiments, we use UD v2.3 (Nivre et al.,2018). We choose Czech, English and French asour development languages because they each havefour treebanks (excluding PUD), allowing us totrain on three treebanks and test on a fourth. Fortesting, we use the PUD test sets for languagesfor which there are at least two other treebankswith training data: Czech, English, Finnish, French,Italian, Korean, Portuguese, Russian, Spanish andSwedish. Following Stymne et al. (2018), we usethe transition-based parser of de Lhoneux et al.(2017) with the token input representations as Eq. 1above. Source code of our modified parser andhelper scripts to carry out the experiments are avail-able online. We attempt to ascertain how useful interpolatedtreebank embedding vectors are by examining thelabelled attachment score (LAS) of trees parsedwith different interpolated treebank vectors. Foreach of our three development languages, we trainmulti-treebank parsing models on the four com-binations of three of the four available treebanksand we test each model on the development sets https://github.com/jowagner/tbev-prediction Figure 1: LAS in the treebank vector weightspace ( m = 3 ) for cs cltt+fictree+pdt on cs cac-dev with the second seed. of all four treebanks, i. e. three in-domain parsingsettings and one out-of-domain setting. Since m = 3 and (cid:80) mt =1 α t = 1 , all treebank vec-tors lie in a plane and we can visualise LAS resultsin colour plots. As the treebank vectors can havearbitrary distances, we plot (and sample) in theweight space R m . We include the equilateral trian-gle spanned by the three fixed treebank embeddingvectors in our plots. Points outside the triangle canbe reached by allowing negative weights α t < .We obtain treebank LAS and sentence-level LASfor 200 weight vectors sampled from the weightspace, including the corners of the triangle, andrepeat with different seeds for parameter initial-isation and training data shuffling. Rather thansampling at random, points are chosen so that theyare somewhat symmetrical and evenly distributed.Figure 1 shows the development set LASon cs cac-dev for a model trained on cs cltt+fictree+pdt with the second seed.We create 432 such plots for nine seeds, fourtraining configurations, four development setsand three languages. The patterns vary witheach seed and configuration. The smallestLAS range within a plot is 87.8 to 88.3( cs cac+cltt+pdt on cs pdt with the sev-enth seed). The biggest LAS range is 59.7 to 76.8( fr gsd+sequoia+spoken on fr spoken with the fifth seed).The location of the fixed treebank vectors e ( t ) are at the corners of the triangle in each graph. Forin-domain settings one or two corners usually haveLAS close to the highest LAS in the plot. The An in-domain example is testing a model trainedon cs cac+cltt+fictree on cs cac , and an out-of-domain example is testing the same model on cs pdt . igure 2: LAS in the treebank vector weight space( m = 3 ) for sentence 2 of en partut-dev (28 to-kens) with en ewt+gum+lines and our first seed. best LAS scores (black circles), however, are oftenlocated outside the triangle, i. e. negative weightsare needed to reach it.Turning to sentence-level LAS, Figure 2 showsthe LAS for an individual example sentence ratherthan an entire development set. This sentence istaken from en partut-dev and is parsed with amodel trained on en ewt+gum+lines . For this28-token sentence, LAS can only change in steps of1/28 and 34 of the 200 treebank embedding weightpoints share the top score. Negative weights areneeded to reach these points outside the triangle.Over all development sentences and parsingmodels, an interpolated treebank vector achieveshighest LAS for 99.99% of sentences: In 78.07%of cases, one of the corner vectors also achieves thehighest LAS and in the remaining 21.92%, inter-polated vectors are needed. It is also worth notingthat, for 39% of sentences, LAS does not dependon the treebank vectors at all, at least not in theweight range explored.Often, LAS changes from one side to anotherside of the graph. The borders have different orien-tation and sharpness. The fraction of points withhighest LAS varies from few to many. The sameis true for the fraction of points with lowest LAS.Noise seems to be low. Most data points matchthe performance of their neighbours, i. e. the scoresare not sensitive to small changes of the treebankweights, suggesting that the observed differencesare not just random numerical effects.This preliminary analysis suggests that useful in-terpolated treebank vectors do exist. Our next stepis to try to predict them. In all subsequent experi-ments, we focus on the out-of-domain setting, i. e.each multi-treebank model is tested on a treebank not included in training. We use k -nearest neighbour ( k -NN) classificationto predict treebank embedding vectors for an indi-vidual sentence or a set of sentences at test time.We experiment with 1) allocating the treebank vec-tor for an input sentence using the k most similartraining sentences ( se-se ), and 2) allocating thetreebank vector for a set of input sentences usingthe most similar training treebank ( tr-tr ).We will first explain the se-se case. For eachinput sentence, we retrieve from the training datathe k most similar sentences and then identify thetreebank vectors from the candidate samples thathave the highest LAS. To compute similarity, werepresent sentences either as tf-idf vectors com-puted over character n-grams, or as vectors pro-duced by max-pooling over a sentence’s ELMovectors (Peters et al., 2018) produced by averagingall ELMo biLM layers. We experiment with k = 1 , , . For many sen-tences, several treebank vectors yield the optimalLAS for the most similar retrieved sentence(s), andso we try several tie-breaking strategies, includingchoosing the vector closest to the uniform weightvector (i. e. each of the three treebanks is equallyweighted), re-ranking the list of vectors in the tieaccording to the LAS of the next most similar sen-tence, and using the average LAS of the k sentencesretrieved to choose the treebank vector. Three tree-bank vector sample sizes were tried:1. fixed : Only the three fixed treebank vectors,i. e. the corners of the triangle in Fig. 1.2. α t ≥ : Negative weights are not used in theinterpolation, i. e. only the 32 points inside oron the triangle in Fig. 1.3. any : All 200 weight points shown in Fig. 1.When retrieving treebanks ( tr-tr ), we use theaverage of the treebank’s sentence representationvectors as the treebank representation and we nor-malise the vectors to the unit sphere as otherwisethe size of the treebank would dominate the loca-tion in vector space.We include oracle versions of each k-NN modelin our experiments. The k-NN oracle method isdifferent from the normal k-NN method in that thetest data is added to the training data so that thetest data itself will be retrieved. This means that a We use
ELMoForManyLangs (Che et al., 2018). odel ( se-se ) Lang Avg LASLearning Weights Cs En Fr random fixed 82.5 73.4 72.1random α t ≥ k-NN α t ≥ α t ≥ Table 1: Development set LAS with per sentence tree-bank vectors k-NN oracle with k = 1 knows exactly what tree-bank vector is best for each test item while a basick-NN model has to predict the best vector basedon the training data. In the tr-tr setting, ourk-NN classifier is selecting one of three treebanksfor the fourth test treebank. In the oracle k-NN set-ting, it selects the test treebank itself and parses thesentences in that treebank with its best-performingtreebank vector. When the treebank vector samplespace is limited to the vectors for the three trainingtreebanks (fixed), this method is the same as thebest-proxy method of Stymne et al. (2018). The development results, averaged over the fourdevelopment sets for each language, are shown inTables 1 and 2. As discussed above, upper boundsfor k -NN prediction are calculated by including anoracle setting in which the query item is added tothe set of items to be retrieved, and k restricted to 1.We are also curious to see what happens when anequal combination of the three fixed vectors (uni-form weight vector) is used ( equal ), and whentreebank vectors are selected at random.Table 1 shows the se-se results. The top sec-tion shows the results of randomly selecting a sen-tence’s treebank vector, the middle section showsthe k -NN results and the bottom section the oracle k -NN results. The k -NN predictor clearly outper-forms the random predictor for English and French,but not for Czech, suggesting that the treebank vec-tor itself plays less of a role for Czech, perhaps dueto high domain overlap between the treebanks. The To reduce noise from random initialisation, we parse eachdevelopment set nine times with nine different seeds and usethe median LAS.
Model ( tr-tr ) Lang Avg LASLearning Weights Cs En Fr proxy-best fixed 82.7 74.7 73.8proxy-worst fixed 82.3 72.4 70.7k-NN fixed α t ≥ k-NN any 82.7 74.5 73.8oracle k-NN fixed 82.7 74.7 73.8oracle k-NN α t ≥ Table 2: Development set LAS with one treebank vec-tor for all input sentences oracle k -NN results indicate not only the substan-tial room for improvement for the predictor, butalso the potential of interpolated vectors since theresults improve as the sample space is increasedbeyond the three fixed vectors.Table 2 shows the tr-tr results. The first sec-tion is the proxy treebank embedding of Stymneet al. (2018) where one of the fixed treebank vec-tors is used for parsing the development set. Wereport the best- and worst-performing of the three( proxy-best and proxy-worst ). The k -NNmethods are shown in the second section of Ta-ble 2. The first row of this section ( fixed weights)can be directly compared with the proxy-best .For Czech and French, the k -NN method matchesthe performance of proxy-best . For English, itcomes close. Examining the per-treebank Englishresults, k -NN predicts the best proxy treebank forall but en partut , where it picks the second best( en gum ) instead of the best ( en ewt ).The oracle k -NN results are shown in the thirdsection of Table 2. Although less pronounced thanfor the more difficult se-se task, they indicatethat there is still some room for improving the vec-tor predictor at the document level if interpolatedvectors are considered.Our equal method, that uses the weights ( ⁄ , ⁄ , ⁄ ), is shown in the last row of Table 2. It isthe overall best English model. Our best modelfor Czech is a tr-tr model which just selectsfrom the three fixed treebank vectors. For French,the best is a tr-tr model which selects from in-terpolated vectors with positive weights. For thePUD languages not used in development, we se- Recall that the first method in this section, oraclefixed , is the same method as proxy-best . an-lan- proxy ge- guage-guage m worst best neric specific cs 4 81.6 en 4 76.4 † † es 2 76.1 –fi 2 52.5 it 3 84.4 –ko 2 35.5 43.9 –pt 2 74.6 77.4 –ru 3 82.6 – Table 3: PUD Test Set Results: Statistically signifi-cant differences between proxy-best and our bestmethod are marked with † lect the hyper-parameters based on average LASon all 12 development sets. The resulting generichyper-parameters are the same as those for the bestFrench model: tr-tr with interpolated vectorsand positive weights. The PUD test set results are shown in Table 3.For nine out of ten languages we match the oraclemethod proxy-best within a 95% confidenceinterval. For Russian, the treebank vector of thesecond-best proxy treebank is chosen, falling 0.8LAS points behind. Still, this difference is not sig-nificant (p=0.055). For English, the generic modelalso picks the second-best proxy treebank. In experiments with Czech, English and French, weinvestigated treebank embedding vectors, exploringthe ideas of interpolated vectors and vector weightprediction. Our attempts to predict good vectorweights using a simple regression model yieldedencouraging results. Testing on PUD languages,we match the performance of using the best fixedtreebank embedding vector in nine of ten caseswithin the bounds of statistical significance and infive cases exactly match it. While the k -NN models selected for final testing use char- n -gram-based sentence representations, ELMo representationsare competitive. Statistical significance is tested with udapi-python( https://github.com/udapi/udapi-python ). For Korean PUD, LAS scores are surprisingly low giventhat development results on ko gsd and ko kaist areabove 76.5 for all seeds. A run with a mono-treebank modelconfirms low performance on Korean PUD. According to a re-viewer, there are known differences in the annotation betweenthe Korean UD treebanks.
On the whole, it seems that our predictor is notyet good enough to find interpolated treebank vec-tors that are clearly superior to the basic, fixed vec-tors and that we know to exist from the oracle runs.Still, we think it is encouraging that performancedid not drop substantially when the set of candidatevectors was widened ( α t ≥ and ‘any’). We do notthink the superior treebank vectors found by the or-acle runs are simply noise, i. e. model fluctuationsdue to varied inputs, because the LAS landscapein the weight vector space is not noisy. For indi-vidual sentences, LAS is usually constant in largeareas and there are clear, sharp steps to the nextLAS level. Therefore, we think that there is roomfor improvement for the predictor to find interpo-lated vectors which are better than the fixed ones.We plan to explore other methods to predict tree-bank vectors, e. g. neural sequence modelling, andto apply our ideas to the related task of languageembedding prediction for zero-shot learning.Another area for future work is to explore whatinformation treebank vectors encode. The previouswork on the use of treebank vectors in mono- andmulti-lingual parsing suggests that treebank vectorsencode information that enables the parser to selecttreebank-specific information where needed whilealso taking advantage of treebank-independent in-formation available in the training data. The typeof information will depend on the selection of tree-banks, e. g. in a polyglot setting the vector maysimply encode the language, and in a monolingualsetting such as ours it may encode annotation ordomain differences between the treebanks.Interpolating treebank vectors adds a layer ofopacity, and, in future work, it would be interestingto carry out experiments with synthetic data, e. g.varying the number of unknown words, to get a bet-ter understanding of what they may be capturing.Future work should also test even simpler strate-gies which do not use the LAS of previous parses togauge the best treebank vector, e. g. always pickingthe largest treebank. Acknowledgments
This research is supported by Science FoundationIreland through the ADAPT Centre for Digital Con-tent Technology, which is funded under the SFIResearch Centres Programme (Grant 13/RC/2106)and is co-funded under the European Regional De-velopment Fund. We thank the reviewers for theirinspiring questions and detailed feedback. eferences
Miguel Ballesteros, Chris Dyer, and Noah A. Smith.2015. Improved transition-based parsing by mod-eling characters instead of words with lstms. In
Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing , pages349–359, Lisbon, Portugal. Association for Compu-tational Linguistics.Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng,Huaipeng Zhao, Yang Liu, Dechuan Teng, and TingLiu. 2017. The hit-scir system for end-to-end pars-ing of universal dependencies. In
Proceedings ofthe CoNLL 2017 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies , pages 52–62. Association for Computational Linguistics.Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,and Ting Liu. 2018. Towards better UD parsing:Deep contextualized word embeddings, ensemble,and treebank concatenation. In
Proceedings of theCoNLL 2018 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies , pages55–64, Brussels, Belgium. Association for Compu-tational Linguistics.Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In
Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 740–750. Association for Compu-tational Linguistics.Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-ple and accurate dependency parsing using bidirec-tional lstm feature representations.
Transactionsof the Association for Computational Linguistics ,4:313–327.Miryam de Lhoneux, Sara Stymne, and Joakim Nivre.2017. Arc-hybrid non-projective dependency pars-ing with a static-dynamic oracle. In
Proceedings ofthe 15th International Conference on Parsing Tech-nologies , pages 99–104, Pisa, Italy. Association forComputational Linguistics.Joakim Nivre, Mitchell Abrams, ˇZeljko Agi´c, LarsAhrenberg, Lene Antonsen, Katya Aplonova,Maria Jesus Aranzabe, Gashaw Arutie, MasayukiAsahara, Luma Ateyah, Mohammed Attia, Aitz-iber Atutxa, Liesbeth Augustinus, Elena Badmaeva,Miguel Ballesteros, Esha Banerjee, Sebastian Bank,Verginica Barbu Mititelu, Victoria Basmov, JohnBauer, Sandra Bellato, Kepa Bengoetxea, Yev-geni Berzak, Irshad Ahmad Bhat, Riyaz AhmadBhat, Erica Biagetti, Eckhard Bick, Rogier Blok-land, Victoria Bobicev, Carl B¨orstell, CristinaBosco, Gosse Bouma, Sam Bowman, AdrianeBoyd, Aljoscha Burchardt, Marie Candito, BernardCaron, Gauthier Caron, G¨uls¸en Cebiro˘glu Eryi˘git,Flavio Massimiliano Cecchini, Giuseppe G. A.Celano, Slavom´ır ˇC´epl¨o, Savas Cetin, FabricioChalub, Jinho Choi, Yongseok Cho, Jayeol Chun,Silvie Cinkov´a, Aur´elie Collomb, C¸ a˘grı C¸ ¨oltekin, Miriam Connor, Marine Courtin, Elizabeth David-son, Marie-Catherine de Marneffe, Valeria de Paiva,Arantza Diaz de Ilarraza, Carly Dickerson, Pe-ter Dirix, Kaja Dobrovoljc, Timothy Dozat, KiraDroganova, Puneet Dwivedi, Marhaba Eli, AliElkahky, Binyam Ephrem, Tomaˇz Erjavec, AlineEtienne, Rich´ard Farkas, Hector Fernandez Al-calde, Jennifer Foster, Cl´audia Freitas, Katar´ınaGajdoˇsov´a, Daniel Galbraith, Marcos Garcia, MoaG¨ardenfors, Sebastian Garza, Kim Gerdes, FilipGinter, Iakes Goenaga, Koldo Gojenola, MemduhG¨okırmak, Yoav Goldberg, Xavier G´omez Guino-vart, Berta Gonz´ales Saavedra, Matias Grioni, Nor-munds Gr¯uz¯ıtis, Bruno Guillaume, C´eline Guillot-Barbance, Nizar Habash, Jan Hajiˇc, Jan Hajiˇc jr.,Linh H`a M˜y, Na-Rae Han, Kim Harris, Dag Haug,Barbora Hladk´a, Jaroslava Hlav´aˇcov´a, FlorinelHociung, Petter Hohle, Jena Hwang, Radu Ion,Elena Irimia, O. l´aj´ıd´e Ishola, Tom´aˇs Jel´ınek, An-ders Johannsen, Fredrik Jørgensen, H¨uner Kas¸ıkara,Sylvain Kahane, Hiroshi Kanayama, Jenna Kan-erva, Boris Katz, Tolga Kayadelen, Jessica Ken-ney, V´aclava Kettnerov´a, Jesse Kirchner, KamilKopacewicz, Natalia Kotsyba, Simon Krek, Sooky-oung Kwak, Veronika Laippala, Lorenzo Lam-bertino, Lucia Lam, Tatiana Lando, Septina DianLarasati, Alexei Lavrentiev, John Lee, PhuongLˆe H`ˆong, Alessandro Lenci, Saran Lertpradit, Her-man Leung, Cheuk Ying Li, Josie Li, KeyingLi, KyungTae Lim, Nikola Ljubeˇsi´c, Olga Logi-nova, Olga Lyashevskaya, Teresa Lynn, VivienMacketanz, Aibek Makazhanov, Michael Mandl,Christopher Manning, Ruli Manurung, C˘at˘alinaM˘ar˘anduc, David Mareˇcek, Katrin Marheinecke,H´ector Mart´ınez Alonso, Andr´e Martins, JanMaˇsek, Yuji Matsumoto, Ryan McDonald, Gus-tavo Mendonc¸a, Niko Miekka, Margarita Misir-pashayeva, Anna Missil¨a, C˘at˘alin Mititelu, YusukeMiyao, Simonetta Montemagni, Amir More, LauraMoreno Romero, Keiko Sophie Mori, ShinsukeMori, Bjartur Mortensen, Bohdan Moskalevskyi,Kadri Muischnek, Yugo Murawaki, Kaili M¨u¨urisep,Pinkey Nainwani, Juan Ignacio Navarro Hor˜niacek,Anna Nedoluzhko, Gunta Neˇspore-B¯erzkalne, Lu-ong Nguy˜ˆen Thi., Huy`ˆen Nguy˜ˆen Thi. Minh, VitalyNikolaev, Rattima Nitisaroj, Hanna Nurmi, StinaOjala, Ad´edayo. Ol´u`okun, Mai Omura, Petya Osen-ova, Robert ¨Ostling, Lilja Øvrelid, Niko Partanen,Elena Pascual, Marco Passarotti, Agnieszka Pate-juk, Guilherme Paulino-Passos, Siyao Peng, Cenel-Augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitu-lainen, Emily Pitler, Barbara Plank, Thierry Poibeau,Martin Popel, Lauma Pretkalnin¸a, Sophie Pr´evost,Prokopis Prokopidis, Adam Przepi´orkowski, Ti-ina Puolakainen, Sampo Pyysalo, Andriela R¨a¨abis,Alexandre Rademaker, Loganathan Ramasamy,Taraka Rama, Carlos Ramisch, Vinit Ravishankar,Livy Real, Siva Reddy, Georg Rehm, MichaelRießler, Larissa Rinaldi, Laura Rituma, LuisaRocha, Mykhailo Romanenko, Rudolf Rosa, DavideRovati, Valentin Roca, Olga Rudina, Jack Rueter,Shoval Sadde, Benoˆıt Sagot, Shadi Saleh, TanjaSamardˇzi´c, Stephanie Samson, Manuela Sanguinetti,aiba Saul¯ıte, Yanin Sawanakunanon, NathanSchneider, Sebastian Schuster, Djam´e Seddah, Wolf-gang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shi-mada, Muh Shohibussirri, Dmitry Sichinava, Na-talia Silveira, Maria Simi, Radu Simionescu, KatalinSimk´o, M´aria ˇSimkov´a, Kiril Simov, Aaron Smith,Isabela Soares-Bastos, Carolyn Spadine, AntonioStella, Milan Straka, Jana Strnadov´a, Alane Suhr,Umut Sulubacak, Zsolt Sz´ant´o, Dima Taji, YutaTakahashi, Takaaki Tanaka, Isabelle Tellier, TrondTrosterud, Anna Trukhina, Reut Tsarfaty, FrancisTyers, Sumire Uematsu, Zdeˇnka Ureˇsov´a, LarraitzUria, Hans Uszkoreit, Sowmya Vajjala, Daniel vanNiekerk, Gertjan van Noord, Viktor Varga, EricVillemonte de la Clergerie, Veronika Vincze, LarsWallin, Jing Xian Wang, Jonathan North Washing-ton, Seyi Williams, Mats Wir´en, Tsegay Wolde-mariam, Tak-sum Wong, Chunxiao Yan, Marat M.Yavrumyan, Zhuoran Yu, Zdenˇek ˇZabokrtsk´y, AmirZeldes, Daniel Zeman, Manying Zhang, and HanzhiZhu. 2018. Universal dependencies 2.3. LIN-DAT/CLARIN digital library at the Institute of For-mal and Applied Linguistics ( ´UFAL), Faculty ofMathematics and Physics, Charles University.Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016. Universal dependencies v1: A multilingualtreebank collection. In
Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation (LREC 2016) , pages 1659–1666, Paris,France. European Language Resources Association(ELRA).Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In
Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Motoki Sato, Hitoshi Manabe, Hiroshi Noji, and YujiMatsumoto. 2017. Adversarial training for cross-domain universal dependency parsing. In
Proceed-ings of the CoNLL 2017 Shared Task: MultilingualParsing from Raw Text to Universal Dependencies ,pages 71–79. Association for Computational Lin-guistics.Tianze Shi, Felix G. Wu, Xilun Chen, and Yao Cheng.2017. Combining global models for parsing uni-versal dependencies. In
Proceedings of the CoNLL2017 Shared Task: Multilingual Parsing from RawText to Universal Dependencies , pages 31–39. Asso-ciation for Computational Linguistics.Sara Stymne, Miryam de Lhoneux, Aaron Smith, andJoakim Nivre. 2018. Parser training with heteroge-neous treebanks. In
Proceedings of the 56th Annual Meeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 619–625.Association for Computational Linguistics.Daniel Zeman, Jan Hajiˇc, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018. CoNLL 2018 shared task: Mul-tilingual parsing from raw text to universal depen-dencies. In
Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies , pages 1–21, Brussels, Belgium.Association for Computational Linguistics.Daniel Zeman, Martin Popel, Milan Straka, Jan Ha-jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti,Sampo Pyysalo, Slav Petrov, Martin Potthast, Fran-cis Tyers, Elena Badmaeva, Memduh Gokirmak,Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr.,Jaroslava Hlavacova, V´aclava Kettnerov´a, ZdenkaUresova, Jenna Kanerva, Stina Ojala, Anna Mis-sil¨a, Christopher D. Manning, Sebastian Schuster,Siva Reddy, Dima Taji, Nizar Habash, Herman Le-ung, Marie-Catherine de Marneffe, Manuela San-guinetti, Maria Simi, Hiroshi Kanayama, ValeriadePaiva, Kira Droganova, H´ector Mart´ınez Alonso,C¸ a˘gr C¸ ¨oltekin, Umut Sulubacak, Hans Uszkoreit,Vivien Macketanz, Aljoscha Burchardt, Kim Harris,Katrin Marheinecke, Georg Rehm, Tolga Kayadelen,Mohammed Attia, Ali Elkahky, Zhuoran Yu, EmilyPitler, Saran Lertpradit, Michael Mandl, Jesse Kirch-ner, Hector Fernandez Alcalde, Jana Strnadov´a,Esha Banerjee, Ruli Manurung, Antonio Stella, At-suko Shimada, Sookyoung Kwak, Gustavo Men-donca, Tatiana Lando, Rattima Nitisaroj, and JosieLi. 2017. Conll 2017 shared task: Multilingual pars-ing from raw text to universal dependencies. In