Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data
aa r X i v : . [ c s . C L ] F e b Creating a Universal Dependencies Treebank of Spoken Frisian-DutchCode-switched Data
Anouck Braggaar
University of Groningen [email protected]
Rob van der Goot
IT University of Copenhagen [email protected]
Abstract
This paper explores the difficulties of anno-tating transcribed spoken Dutch-Frisian code-switch utterances into Universal Dependen-cies. We make use of data from the FAME!corpus, which consists of transcriptions andaudio data. Besides the usual annotation dif-ficulties, this dataset is extra challenging be-cause of Frisian being low-resource, the in-formal nature of the data, code-switching andnon-standard sentence segmentation. As astarting point, two annotators annotated 150random utterances in three stages of 50 utter-ances. After each stage, disagreements wherediscussed and resolved. An increase of 7.8UAS and 10.5 LAS points was achieved be-tween the first and third round. This paper willfocus on the issues that arise when annotatinga transcribed speech corpus. To resolve theseissues several solutions are proposed.
A key-component to developing low-resource de-pendency parsers for a specific language-type, isan evaluation treebank. In this paper we will fo-cus on the low-resource language West Frisianwhich is spoken in the Netherlands. We have usedthe FAME!-project dataset which was created outof broadcasts from Omrop Fryslˆan (Frisian radiobroadcaster) (Yilmaz et al., 2016). Not only arewe dealing here with a low-resource language, weare also dealing with a spontaneous speech datasetthat contains code-switching between Frisian andDutch (the main language spoken in the Nether-lands). As Yilmaz et al. (2016) also mention, code-switching often occurs due to the influence ofDutch.In this paper we will elaborate on the issuesthat arise with annotating Universal Dependen-cies (Nivre et al., 2020) on spoken code-switchedlanguage. As a starting point, we have randomlyselected 150 utterances that contain at least one code-switching point from the FAME! corpus. Wehave followed the utterance segmentation as canbe found in the corpus. The code-switches are alsoalready annotated in these utterances by the anno-tators of the corpus (Yilmaz et al., 2016). The fi-nal goal is to use the annotated data to evaluatea dependency parser for this low-resource (and inthis case spoken) language. In this paper we willdiscuss issues that arose during the annotation andpropose possible solutions for these issues.
We are not the first to annotate spoken data. Previ-ous work has annotated English for conversationagents (Davidson et al., 2019), Slovenian data(Dobrovoljc and Martinc, 2018), Komi-Zyrian(Partanen et al., 2018) and Turkish-German(C¸ etino˘glu and C¸ ¨oltekin, 2019). Commonlymentioned problems are disfluencies and sen-tence segmentation (Dobrovoljc and Martinc,2018). Two main types of solutions can beidentified; adapting the existing guidelines(C¸ etino˘glu and C¸ ¨oltekin, 2019) versus extendingthem (Davidson et al., 2019).Previous research also focuses on cre-ating treebanks for code-switch data.C¸ etino˘glu and C¸ ¨oltekin (2019) focus on theissues that arise when annotating a spokenTurkish-German code-switch treebank and makea distinction between issues that are code-switchspecific or related to spoken language. Theyconclude that they use dependencies that rarelyoccur in monolingual Turkish or German tree-banks. Seddah et al. (2020) create a treebankfor an Arabic dialect which contains a highamount of code-switching and language variation.Partanen et al. (2018) create a spoken treebankfor Komi-Zyrian with code-switching to Russian.They argue that some language-specific issuesmight be difficult to fully address with UniversalOS UAS LASRound 1 69.5 72.3 60.9Round 2 87.1 76.1 64.6Round 3 89.7 80.1 71.4
Table 1: POS, UAS and LAS scores between the twoannotators.
Dependencies.
Overall, we tried to closely follow the exist-ing Universal Dependency guidelines and theexisting annotations of the Dutch Alpino andLassySmall treebanks (Van der Beek et al., 2002;Van Noord et al., 2013). But as we will show,some phenomena in spoken language may not beeasy to annotate with an appropriate label.
We have both annotated in total 150 randomlyselected utterances from scratch. After everybatch of 50 sentences we discussed issues andadjusted our annotation scheme. We report ac-curacy over Universal Parts-of-speech tags, Unla-belled Attachment Score (UAS) and Labelled At-tachment Score (LAS) (Zeman et al., 2018) in Ta-ble 1. Even though the scores improve over time,the final agreements are still relatively low; previ-ous work on social media data reached an LAS of84 (Liu et al., 2018), and for code-switched dataan LAS of 92 is reported (Bhat et al., 2017).We found that the four main sources of disagree-ment where due to 1) difficulties in ungrammaticalconstructions 2) sentence segmentation 3) interpre-tation of the utterances (ambiguity) 4) annotatorshad to learn the guidelines. In fact, there werevery few issues with the code-switch aspect of thisdata. The reason for this could be that only veryshort parts of the utterance (e.g. only a contentword) are switched and that there is a high degreeof resemblance between the two languages (Wolf,1996), making the switches not directly an issuewhile annotating. In the following section, we willdiscuss how we overcame issues in the first andsecond sources of disagreement.
In this section we will discuss the two most com-mon sources of disagreement we encountered. This first example shows a phenomenon that oc-curred often and was often annotated differently: benammen eh foarsitter eh Van Raaij dy hat eh oare plannenespecially eh chairman eh Van Raaij whom has eh other plans amoddiscourse nsubjdiscourseappos flat:name expl rootdiscourse amodobj
A common source of confusion was when some-thing/someone is referred to by name (in this case”foarsitter Van Raaij”) and it is later referred toagain with a relative pronoun (”dy”). There aremultiple ways to annotate this. Our first choicewas to annotate ”foarsitter Van Raaij” as being thesubject and ”dy” as being in a determiner relation.A second option would be to annotate ”dy” as thesubject of the sentence and the other part as beingin an appos relation, defining the ”dy”. Eventually,we chose to annotate ”dy” as expletive and keep”foarsitter Van Raaij” as the subject.The second example shows a couple of differentissues that were specific to spoken language: hoe dan ek jongens dy moties dy eh dy moatte der trochkomme enanyways guys those motions they eh they have there come through and discoursefixedfixed vocativedet nsubjreparandumdiscourse expl auxadvmod root dislocated
The first thing to notice is that it starts with ”hoedan ek” which is an expression that is mainly usedin spoken speech. Therefore we decided to labelthis as discourse. The most striking phenomenonin this utterance is the fact that it doesn’t seemto have a proper ending, it seems to go on ”en”(”and”). This is something that happens a lot be-cause of the spoken data and because of segmenta-tion. We have chosen to attach these elements tothe root with the dislocated label. Normally youwould have elements on the right to which most ofthese dislocated elements would attach.
In this paper we have discussed some of the issuesthat arise when annotating a code-switch spokentreebank. As we have discussed we follow thegeneral Universal Dependency guidelines and ex-isting Dutch annotations. Annotating is still workin progress and our LAS and UAS scores leaveroom for improvement. Eventually we would likeo annotate a larger amount of utterances that canbe used to evaluate a dependency parser in a low-resource setup.
References
Leonoor Van der Beek, Gosse Bouma, Rob Malouf,and Gertjan Van Noord. 2002. The Alpino depen-dency treebank. In
Computational linguistics in thenetherlands 2001 , pages 8–22. Brill Rodopi.Irshad Bhat, Riyaz A. Bhat, ManishShrivastava, and Dipti Sharma. 2017.Joining hands: Exploiting monolingual treebanks for parsing of code-mixing data.In
Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Volume 2, Short Papers , pages 324–330,Valencia, Spain. Association for ComputationalLinguistics.¨Ozlem C¸ etino˘glu and C¸ a˘grı C¸ ¨oltekin. 2019.Challenges of annotating a code-switching treebank.In
Proceedings of the 18th International Workshopon Treebanks and Linguistic Theories (TLT,SyntaxFest 2019) , pages 82–90, Paris, France.Association for Computational Linguistics.Sam Davidson, Dian Yu, and Zhou Yu. 2019.Dependency parsing for spoken dialog systems. In
Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 1513–1519, Hong Kong, China. Association for Computa-tional Linguistics.Kaja Dobrovoljc and Matej Martinc. 2018.Er ... well, it matters, right? on the role of data representations in spoken language dependency parsing.In
Proceedings of the Second Workshop on Uni-versal Dependencies (UDW 2018) , pages 37–46,Brussels, Belgium. Association for ComputationalLinguistics.Yijia Liu, Yi Zhu, Wanxiang Che, Bing Qin,Nathan Schneider, and Noah A. Smith. 2018.Parsing tweets into Universal Dependencies. In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) , pages 965–975, NewOrleans, Louisiana. Association for ComputationalLinguistics.Joakim Nivre, Marie-Catherine de Marneffe,Filip Ginter, Jan Hajiˇc, Christopher D. Man-ning, Sampo Pyysalo, Sebastian Schuster,Francis Tyers, and Daniel Zeman. 2020.Universal Dependencies v2: An evergrowing multilingual treebank collection.In
Proceedings of the 12th Language Resourcesand Evaluation Conference , pages 4034–4043,Marseille, France. European Language ResourcesAssociation. Niko Partanen, Rogier Blokland, KyungTae Lim,Thierry Poibeau, and Michael Rießler. 2018.The first Komi-Zyrian Universal Dependencies treebanks.In
Proceedings of the Second Workshop on Uni-versal Dependencies (UDW 2018) , pages 126–132,Brussels, Belgium. Association for ComputationalLinguistics.Djam´e Seddah, Farah Essaidi, Amal Fethi, MatthieuFuteral, Benjamin Muller, Pedro Javier Ortiz Su´arez,Benoˆıt Sagot, and Abhishek Srivastava. 2020.Building a user-generated content North-African Arabizi treebank: Tackling hell.In
Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , pages1139–1150, Online. Association for ComputationalLinguistics.Gertjan Van Noord, Gosse Bouma, Frank Van Eynde,Daniel De Kok, Jelmer Van der Linde, Ineke Schu-urman, Erik Tjong Kim Sang, and Vincent Van-deghinste. 2013. Large scale syntactic annotationof written Dutch: Lassy. In
Essential speechand language technology for Dutch , pages 147–164.Springer, Berlin, Heidelberg.Henk Wolf. 1996. Structural neutrality in Frisian-Dutch interaction.
Us Wurk , 45(3-4):125–138.Emre Yilmaz, Maaike Andringa, Sigrid Kingma, JelskeDijkstra, F Kuip, H Velde, Frederik Kampstra, JoukeAlgra, H Heuvel, and David A van Leeuwen. 2016.A longitudinal bilingual Frisian-Dutch radio broad-cast database designed for code-switching research.Daniel Zeman, Jan Hajiˇc, Martin Popel,Martin Potthast, Milan Straka, Filip Gin-ter, Joakim Nivre, and Slav Petrov. 2018.CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies.In