[PDF] Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data

Abstract

This paper explores the difficulties of annotating transcribed spoken Dutch-Frisian code-switch utterances into Universal Dependencies. We make use of data from the FAME! corpus, which consists of transcriptions and audio data. Besides the usual annotation difficulties, this dataset is extra challenging because of Frisian being low-resource, the informal nature of the data, code-switching and non-standard sentence segmentation. As a starting point, two annotators annotated 150 random utterances in three stages of 50 utterances. After each stage, disagreements where discussed and resolved. An increase of 7.8 UAS and 10.5 LAS points was achieved between the first and third round. This paper will focus on the issues that arise when annotating a transcribed speech corpus. To resolve these issues several solutions are proposed.

Full PDF

aa r X i v : . [ c s . C L ] F e b Creating a Universal Dependencies Treebank of Spoken Frisian-DutchCode-switched Data

Anouck Braggaar

University of Groningen [email protected]

Rob van der Goot

IT University of Copenhagen [email protected]

Abstract

This paper explores the difﬁculties of anno-tating transcribed spoken Dutch-Frisian code-switch utterances into Universal Dependen-cies. We make use of data from the FAME!corpus, which consists of transcriptions andaudio data. Besides the usual annotation dif-ﬁculties, this dataset is extra challenging be-cause of Frisian being low-resource, the in-formal nature of the data, code-switching andnon-standard sentence segmentation. As astarting point, two annotators annotated 150random utterances in three stages of 50 utter-ances. After each stage, disagreements wherediscussed and resolved. An increase of 7.8UAS and 10.5 LAS points was achieved be-tween the ﬁrst and third round. This paper willfocus on the issues that arise when annotatinga transcribed speech corpus. To resolve theseissues several solutions are proposed.

A key-component to developing low-resource de-pendency parsers for a speciﬁc language-type, isan evaluation treebank. In this paper we will fo-cus on the low-resource language West Frisianwhich is spoken in the Netherlands. We have usedthe FAME!-project dataset which was created outof broadcasts from Omrop Fryslˆan (Frisian radiobroadcaster) (Yilmaz et al., 2016). Not only arewe dealing here with a low-resource language, weare also dealing with a spontaneous speech datasetthat contains code-switching between Frisian andDutch (the main language spoken in the Nether-lands). As Yilmaz et al. (2016) also mention, code-switching often occurs due to the inﬂuence ofDutch.In this paper we will elaborate on the issuesthat arise with annotating Universal Dependen-cies (Nivre et al., 2020) on spoken code-switchedlanguage. As a starting point, we have randomlyselected 150 utterances that contain at least one code-switching point from the FAME! corpus. Wehave followed the utterance segmentation as canbe found in the corpus. The code-switches are alsoalready annotated in these utterances by the anno-tators of the corpus (Yilmaz et al., 2016). The ﬁ-nal goal is to use the annotated data to evaluatea dependency parser for this low-resource (and inthis case spoken) language. In this paper we willdiscuss issues that arose during the annotation andpropose possible solutions for these issues.

We are not the ﬁrst to annotate spoken data. Previ-ous work has annotated English for conversationagents (Davidson et al., 2019), Slovenian data(Dobrovoljc and Martinc, 2018), Komi-Zyrian(Partanen et al., 2018) and Turkish-German(C¸ etino˘glu and C¸ ¨oltekin, 2019). Commonlymentioned problems are disﬂuencies and sen-tence segmentation (Dobrovoljc and Martinc,2018). Two main types of solutions can beidentiﬁed; adapting the existing guidelines(C¸ etino˘glu and C¸ ¨oltekin, 2019) versus extendingthem (Davidson et al., 2019).Previous research also focuses on cre-ating treebanks for code-switch data.C¸ etino˘glu and C¸ ¨oltekin (2019) focus on theissues that arise when annotating a spokenTurkish-German code-switch treebank and makea distinction between issues that are code-switchspeciﬁc or related to spoken language. Theyconclude that they use dependencies that rarelyoccur in monolingual Turkish or German tree-banks. Seddah et al. (2020) create a treebankfor an Arabic dialect which contains a highamount of code-switching and language variation.Partanen et al. (2018) create a spoken treebankfor Komi-Zyrian with code-switching to Russian.They argue that some language-speciﬁc issuesmight be difﬁcult to fully address with UniversalOS UAS LASRound 1 69.5 72.3 60.9Round 2 87.1 76.1 64.6Round 3 89.7 80.1 71.4

Table 1: POS, UAS and LAS scores between the twoannotators.

Dependencies.

Overall, we tried to closely follow the exist-ing Universal Dependency guidelines and theexisting annotations of the Dutch Alpino andLassySmall treebanks (Van der Beek et al., 2002;Van Noord et al., 2013). But as we will show,some phenomena in spoken language may not beeasy to annotate with an appropriate label.

We have both annotated in total 150 randomlyselected utterances from scratch. After everybatch of 50 sentences we discussed issues andadjusted our annotation scheme. We report ac-curacy over Universal Parts-of-speech tags, Unla-belled Attachment Score (UAS) and Labelled At-tachment Score (LAS) (Zeman et al., 2018) in Ta-ble 1. Even though the scores improve over time,the ﬁnal agreements are still relatively low; previ-ous work on social media data reached an LAS of84 (Liu et al., 2018), and for code-switched dataan LAS of 92 is reported (Bhat et al., 2017).We found that the four main sources of disagree-ment where due to 1) difﬁculties in ungrammaticalconstructions 2) sentence segmentation 3) interpre-tation of the utterances (ambiguity) 4) annotatorshad to learn the guidelines. In fact, there werevery few issues with the code-switch aspect of thisdata. The reason for this could be that only veryshort parts of the utterance (e.g. only a contentword) are switched and that there is a high degreeof resemblance between the two languages (Wolf,1996), making the switches not directly an issuewhile annotating. In the following section, we willdiscuss how we overcame issues in the ﬁrst andsecond sources of disagreement.

In this section we will discuss the two most com-mon sources of disagreement we encountered. This ﬁrst example shows a phenomenon that oc-curred often and was often annotated differently: benammen eh foarsitter eh Van Raaij dy hat eh oare plannenespecially eh chairman eh Van Raaij whom has eh other plans amoddiscourse nsubjdiscourseappos ﬂat:name expl rootdiscourse amodobj

A common source of confusion was when some-thing/someone is referred to by name (in this case”foarsitter Van Raaij”) and it is later referred toagain with a relative pronoun (”dy”). There aremultiple ways to annotate this. Our ﬁrst choicewas to annotate ”foarsitter Van Raaij” as being thesubject and ”dy” as being in a determiner relation.A second option would be to annotate ”dy” as thesubject of the sentence and the other part as beingin an appos relation, deﬁning the ”dy”. Eventually,we chose to annotate ”dy” as expletive and keep”foarsitter Van Raaij” as the subject.The second example shows a couple of differentissues that were speciﬁc to spoken language: hoe dan ek jongens dy moties dy eh dy moatte der trochkomme enanyways guys those motions they eh they have there come through and discourseﬁxedﬁxed vocativedet nsubjreparandumdiscourse expl auxadvmod root dislocated

The ﬁrst thing to notice is that it starts with ”hoedan ek” which is an expression that is mainly usedin spoken speech. Therefore we decided to labelthis as discourse. The most striking phenomenonin this utterance is the fact that it doesn’t seemto have a proper ending, it seems to go on ”en”(”and”). This is something that happens a lot be-cause of the spoken data and because of segmenta-tion. We have chosen to attach these elements tothe root with the dislocated label. Normally youwould have elements on the right to which most ofthese dislocated elements would attach.

In this paper we have discussed some of the issuesthat arise when annotating a code-switch spokentreebank. As we have discussed we follow thegeneral Universal Dependency guidelines and ex-isting Dutch annotations. Annotating is still workin progress and our LAS and UAS scores leaveroom for improvement. Eventually we would likeo annotate a larger amount of utterances that canbe used to evaluate a dependency parser in a low-resource setup.

References

Leonoor Van der Beek, Gosse Bouma, Rob Malouf,and Gertjan Van Noord. 2002. The Alpino depen-dency treebank. In

Computational linguistics in thenetherlands 2001 , pages 8–22. Brill Rodopi.Irshad Bhat, Riyaz A. Bhat, ManishShrivastava, and Dipti Sharma. 2017.Joining hands: Exploiting monolingual treebanks for parsing of code-mixing data.In

Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Volume 2, Short Papers , pages 324–330,Valencia, Spain. Association for ComputationalLinguistics.¨Ozlem C¸ etino˘glu and C¸ a˘grı C¸ ¨oltekin. 2019.Challenges of annotating a code-switching treebank.In

Proceedings of the 18th International Workshopon Treebanks and Linguistic Theories (TLT,SyntaxFest 2019) , pages 82–90, Paris, France.Association for Computational Linguistics.Sam Davidson, Dian Yu, and Zhou Yu. 2019.Dependency parsing for spoken dialog systems. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 1513–1519, Hong Kong, China. Association for Computa-tional Linguistics.Kaja Dobrovoljc and Matej Martinc. 2018.Er ... well, it matters, right? on the role of data representations in spoken language dependency parsing.In

Proceedings of the Second Workshop on Uni-versal Dependencies (UDW 2018) , pages 37–46,Brussels, Belgium. Association for ComputationalLinguistics.Yijia Liu, Yi Zhu, Wanxiang Che, Bing Qin,Nathan Schneider, and Noah A. Smith. 2018.Parsing tweets into Universal Dependencies. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) , pages 965–975, NewOrleans, Louisiana. Association for ComputationalLinguistics.Joakim Nivre, Marie-Catherine de Marneffe,Filip Ginter, Jan Hajiˇc, Christopher D. Man-ning, Sampo Pyysalo, Sebastian Schuster,Francis Tyers, and Daniel Zeman. 2020.Universal Dependencies v2: An evergrowing multilingual treebank collection.In

Proceedings of the 12th Language Resourcesand Evaluation Conference , pages 4034–4043,Marseille, France. European Language ResourcesAssociation. Niko Partanen, Rogier Blokland, KyungTae Lim,Thierry Poibeau, and Michael Rießler. 2018.The ﬁrst Komi-Zyrian Universal Dependencies treebanks.In

Proceedings of the Second Workshop on Uni-versal Dependencies (UDW 2018) , pages 126–132,Brussels, Belgium. Association for ComputationalLinguistics.Djam´e Seddah, Farah Essaidi, Amal Fethi, MatthieuFuteral, Benjamin Muller, Pedro Javier Ortiz Su´arez,Benoˆıt Sagot, and Abhishek Srivastava. 2020.Building a user-generated content North-African Arabizi treebank: Tackling hell.In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , pages1139–1150, Online. Association for ComputationalLinguistics.Gertjan Van Noord, Gosse Bouma, Frank Van Eynde,Daniel De Kok, Jelmer Van der Linde, Ineke Schu-urman, Erik Tjong Kim Sang, and Vincent Van-deghinste. 2013. Large scale syntactic annotationof written Dutch: Lassy. In

Essential speechand language technology for Dutch , pages 147–164.Springer, Berlin, Heidelberg.Henk Wolf. 1996. Structural neutrality in Frisian-Dutch interaction.

Us Wurk , 45(3-4):125–138.Emre Yilmaz, Maaike Andringa, Sigrid Kingma, JelskeDijkstra, F Kuip, H Velde, Frederik Kampstra, JoukeAlgra, H Heuvel, and David A van Leeuwen. 2016.A longitudinal bilingual Frisian-Dutch radio broad-cast database designed for code-switching research.Daniel Zeman, Jan Hajiˇc, Martin Popel,Martin Potthast, Milan Straka, Filip Gin-ter, Joakim Nivre, and Slav Petrov. 2018.CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies.In