Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool
Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Gözde Berk, Seyyit Talha Bedir, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
RResources for Turkish Dependency Parsing:Introducing the BOUN Treebank and the BoATAnnotation Tool
Utku Türk , Furkan Atmaca , ¸Saziye Betül Özate¸s , Gözde Berk ,Seyyit Talha Bedir , Abdullatif Köksal , Balkız Öztürk Ba¸saran ,Tunga Güngör , and Arzucan Özgür Department of Linguistics, Bo ˘gaziçi University Department of Computer Engineering, Bogazici UniversityFebruary 25, 2020
Abstract
In this paper, we describe our contributions and efforts to develop Turkish re-sources, which include a new treebank (BOUN Treebank) with novel sentences,along with the guidelines we adopted and a new annotation tool we developed(BoAT). The manual annotation process we employed was shaped and imple-mented by a team of four linguists and five NLP specialists. Decisions regardingthe annotation of the BOUN Treebank were made in line with the Universal Depen-dencies framework, which originated from the works of De Marneffe et al (2014)and Nivre et al (2016). We took into account the recent unifying efforts basedon the re-annotation of other Turkish treebanks in the UD framework (Türk et al,2019). Through the BOUN Treebank, we introduced a total of 9,757 sentencesfrom various topics including biographical texts, national newspapers, instruc-tional texts, popular culture articles, and essays. In addition, we report the parsingresults of a graph-based dependency parser obtained over each text type, the totalof the BOUN Treebank, and all Turkish treebanks that we either re-annotated orintroduced. We show that a state-of-the-art dependency parser has improved scoresfor identifying the proper head and the syntactic relationships between the headsand the dependents. In light of these results, we have observed that the unificationof the Turkish annotation scheme and introducing a more comprehensive treebankimproves performance with regards to dependency parsing . Our material regarding our treebank and tool as well as our code regarding R and Python scripts areavailable online. The links are provided within the text. a r X i v : . [ c s . C L ] F e b Introduction
The field of Natural Language Processing (NLP) has seen an influx of various tree-banks following the introduction of the treebanks in Marcus et al (1993), Leech andGarside (1991), and Sampson (1995). These treebanks paved the way for today’sever-growing NLP framework, consisting of NLP applications, treebanks, and tools.Among the many languages with a growing treebank inventory, Turkish—with its richmorpho-syntax—was one of the less fortunate languages. Due to its complex networkof inflectional and derivational morphology, as well as its non-strict SOV word order,Turkish has posed an enormous challenge for NLP studies. One of the first attemptsto create a structured treebank was initiated in the studies of Atalay et al (2003) andOflazer et al (2003). Following these studies, many more Turkish treebanking effortswere introduced (among others Megyesi et al, 2010; Sulger et al, 2013; Sulubacaket al, 2016). However, most of these efforts either contained a small volume of Turkishsentences or they were reformulations of already existing treebanks.This paper aims to contribute to the limited NLP resources in Turkish by annotatinga part of a brand new corpus that has not been approached with a syntactic perspectivebefore, namely the Turkish National Corpus (henceforth TNC) (Aksan et al, 2012).TNC is an online corpus that contains 50 million words. The BOUN Treebank, whichis introduced in this paper, includes 9,757 previously non-analyzed sentences extractedfrom five different text types in this corpus, i.e. essays, broadsheet national newspa-pers, instructional texts, popular culture articles, and biographical texts. We annotatedthe inflections and POS tags semi-automatically using a morphological disambiguator(Sak et al, 2008) as an initial filter and later manually checked every word and its mor-phological representation. The syntactic dependency relations of the sentences weremanually annotated following the up-to-date Universal Dependencies (UD) annotationscheme.Through a discussion of the annotation decisions made in the creation of the BOUNTreebank, we present our take on one of the most debated Turkish constructions: ver-bal and nominal clitics, and consequently their syntactic and morphological representa-tions. Even though their unique behavior is observed and accounted for within Turkishlinguistic studies, Turkish treebanking studies have avoided addressing such structures,with the exclusion of (c.f. Çöltekin, 2016).In addition, we present our efforts to create an annotation tool that integrates atabular view, a hierarchical tree structure, and extensive morphological editing. Webelieve that in addition to Turkic languages, other agglutinative languages that offerchallenging morphological problems may benefit from this tool.Lastly, we report the results of an NLP task, namely dependency parsing, where ournew treebank and previous re-annotations that we have completed are used. The resultsshow that using the UD annotation scheme more faithfully and in an unified mannerwithin Turkish UD treebanks offers an increase in the UAS (Unlabeled Attachment)F1 and LAS (Labeled Attachment) F1 scores. We also report individual scores fordifferent text types within our new treebank.This paper is organized as follows: In Section (2), we briefly explain the morpho-logical and syntactic properties of Turkish. In Section (3), we present an extensivereview of previous treebanking efforts in Turkish and locate them with regards to each2able 1: Possible morphological analyses of the word alın from Sak et al (2008).The symbol ‘-’ indicates derivational morphemes, and ‘+’ indicates inflectional mor-phemes.Root Categoryof the root Features alın [Noun] +[A3sg]+[Pnon]+[Nom] al [Noun] +[A3sg]+Hn[P2sg]+[Nom] al [Adj] -[Noun]+[A3sg]+Hn[P2sg]+[Nom] al [Noun] +[A3sg]+[Pnon]+NHn[Gen] al [Adj] -[Noun]+[A3sg]+[Pnon]+NHn[Gen] alın [Verb] +[Pos]+[Imp]+[A2sg] al [Verb] +[Pos]+[Imp]+YHn[A2pl] al [Verb] -Hn[Verb+Pass]+[Pos]+[Imp]+[A2sg]other in terms of their use and their aim. In Section (4), we report the details of theBOUN Treebank, morphological and syntactic decisions, and our process. We lay outour tool BoAT in Section (5) and, in Section (6), we introduce our experiments andtheir results. In Section (7), we present our conclusions and discuss the implications ofour work. Turkish is a Turkic language spoken mainly in Asia Minor and Thracia with approx-imately 75 million native speakers. As an agglutinative language, Turkish makes ex-cessive use of morphological concatenation. According to Bickel and Nichols (2013),an average Turkish word may have 8-9 inflectional categories, making Turkish an out-lier among the world’s languages. The number of morphological categories increaseseven more when considering derivational processes. Kapan (2019) states that Turkishwords may host up to 6 different derivational affixes at the same time. The complexityof morphological analysis, however, is not limited to the sheer numbers of inflectionaland derivational affixes. In addition to such affixes, syncretisms, vowel harmony pro-cesses, elisions, and insertions create an arduous task for researchers in Turkish NLP.Table 1 lists the possible morphological analyses of the verb alın . The table showsthat despite the shortness of the word, the morphological analysis can be toilsome; anddue to the syncretisms, even such a short item may be parsed to have different possibleroots.With respect to syntactic properties, Turkish has a relatively free word order whichis constrained by discourse elements and information structure (Erguvanlı-Taylan, 1986;Hoffman, 1995; I¸ssever, 2003; Öztürk, 2008; Özsoy, 2019). Even though SOV is thebase word order, other permutations are highly utilized, as exemplified below . Abbreviations used in the paper are as follows: 1 = first person,
ABL = ablative,
ACC = accusative,
AOR = aorist,
COM = comitative,
COP = copula,
DAT = dative,
EMPH = emphasis,
FOC = focus,
FUT = future,
31) a.
Fatma
Fatma
Ahmet’i
Ahmet-
ACC gör-dü. see-
PST (SOV 48%)‘Fatma saw Ahmet.’b. Ahmet’i Fatma gör-dü. (OSV 8%)c. Fatma gör-dü Ahmet’i. (SVO 25%)d. Ahmet’i gör-dü Fatma. (OVS 13%)e. Gör-dü Fatma Ahmet’i. (VSO 6%)f. Gör-dü Ahmet’i Fatma. (VOS <1%) (adapted from Hoffman, 1995)As for the case system, every element in a sentence needs to host a case accordingto its syntactic role, semantic contribution, or the lexical selection of the phrasal head(Erguvanlı-Taylan, 2015). These groupings, however, are not clear cut and there is notalways a one-to-one correspondence between cases and their roles.Moreover, Turkish is a pro-drop language in which the subject is almost alwayselided when it is retrievable from the given discourse (Kornfilt, 1984; Özsoy, 1988).Overt subjects are used only to convey certain semantic effects, such as a change incontext or focus. However, the subject is also retrievable from the agreement markeron the verb. In addition to these properties, Turkish is also a null object language,even though the language does not have an overt agreement marker available for thisprocess (Öztürk, 2006). If the object of a sentence is retrievable from the given dis-course, speakers may omit the object without any overt marking on the verb. The finalissue with Turkish syntax lies in the fact that it frequently makes use of nominaliza-tion processes for embedded clauses, which may modify nouns and verbs (Göksel andKerslake, 2005). This sentence embedding strategy complicates the annotation processsince the final form of the construction is a noun which is derived from a verb. How-ever, these constructions encode complex predication and may act as a subject, object,adjective, adverb or even predicate on their own.
Following the studies on treebanks for languages such as English, Chinese, Arabic, andmany more (Leech and Garside, 1991; Marcus et al, 1993; Sampson, 1995; Maamouriet al, 2004; Xue et al, 2005), the initial groundwork for Turkish treebanks was laid inAtalay et al (2003) and Oflazer et al (2003). The first of its kind, the Metu-SabancıTreebank (MST) consisted of 5,635 sentences, a subset of the METU corpus that in-cluded 16 different text types, including newspaper articles, novels, and many more(Say et al, 2002). They both encoded morphological complexities and syntactic rela-tions. Due to the productive use of derivational suffixes, they explicitly spelled out ev-ery inflection and derivation within a word. As for the syntactic representation, Atalayet al (2003) used a dependency grammar in order to bypass the problem of constituencyin Turkish, which arises from the relatively free word order of the language.
GEN = genitive,
LOC = locative,
NEC = necessity,
NEG = negative,
NMLZ = nominalizer, PL = plural, POSS =possessive,
PRF = perfect,
PST = past, SG = singular. parallel treebanks . The first of these parallel treebanksis the Swedish-Turkish parallel treebank (STPT). Megyesi et al (2008) published theirparallel treebank containing 145,000 tokens in Turkish and 160,000 in Swedish. Fol-lowing this work, Megyesi et al (2010) published the Swedish-Turkish-English paralleltreebank (STEPT). This treebank included 300,000 tokens in Swedish, 160,000 tokensin Turkish, and 150,000 tokens in English. Both of these treebanks utilized the samemorphological and syntactical parsing tools. For Swedish morphology, the Trigrams’n’ Tags tagger (Brants, 2000), trained on Swedish (Megyesi, 2002), was used. On theother hand, Turkish data was first analyzed using the parser in Oflazer (1994), and itsaccuracy was enhanced through the morphological parser proposed in Yüret and Türe(2006). Both of them were annotated using the MaltParser (Nivre et al, 2006a) andwere trained with the Swedish treebank Talkanben05 (Nivre et al, 2006b) and the MST(Oflazer et al, 2003), respectively.Another parallel treebank introduced for Turkish is the PUD, which adopts the UD5able 2: Composition of the written component of TNC, adapted from Aksan et al(2012).Domain % Medium %Imaginative 19 Books 58Social Science 16 Periodicals 32Art 7 Miscellaneous published 5Commence/Finance 8 Miscellaneous unpublished 3Belief and Thought 4 To-be-spoken 2World Affairs 20Applied Science 8Nature Science 4Leisure 14framework. The Turkish PUD Treebank was published as part of a collaborative effort,the CoNLL 2017 Shared Task on Multilingual Parsing from Raw Text to Universal De-pendencies (Zeman et al, 2017). Sentences for this collaborative treebank were drawnfrom newspapers and Wikipedia. The same 1,000 sentences were translated into morethan 40 languages and manually annotated in line with the universal annotation guide-lines of Google. After the annotation, the Turkish PUD Treebank was automaticallyconverted to the UD style.Lastly, there are also two other independent treebanks. The first is the GrammarBook treebank (GB) introduced in Çöltekin (2015). In this treebank, data were col-lected from a reference grammar book for Turkish written by Göksel and Kerslake(2005). It includes 2,803 items that are either sentences or sentence fragments from thegrammar book. It utilized TRMorph (Çöltekin, 2010) for morphological analyses andthe proper morphological annotations were manually selected amongst the suggestionsproposed by TRMorph. The sentences were manually annotated in the native UD-style. The other independent treebank is the Turkish-German Code-Switching Tree-bank (TGCST) (Çetino˘glu and Çöltekin, 2016). This treebank includes 1,029 bilingualTurkish-German tweets that had already been annotated with respect to the language inuse. They also utilized the UD syntactic relation tags to represent dependency relations. In this paper, we introduce a treebank that consists of 9,757 sentences which form asubset of the Turkish National Corpus (Aksan et al, 2012). The TNC includes 50 mil-lion words from various text types, and it encompasses sentences from a 20 year periodbetween 1990 and 2009. They followed the principles of the British National Corpus interms of their selection of domains. Table 2 shows the percentages of different domainsand media used in the TNC.In our treebank, we included the following text types: essays, broadsheet national Our treebank is freely available online in https://github.com/boun-tabi/UD_Turkish-BOUN
AnnotatorPair κ Head κ Label κ Head ) and the correct dependency label of the syntactic relations ( κ Label ). As mentioned above, Turkish makes use of affixation much more frequently than anyother word-formation process. Even though it adds an immense complexity to its wordlevel representation, patterns within the Turkish word-formation process allowed previ-ous research to formulate morphological disambiguators that dissect word-level depen-7encies. One such work is introduced in Sak et al (2008). Their morphological parseris able to run independently of any other external systems and is capable of providingthe correct morphological analysis with 98% accuracy using the contextual cues, i.e.the two previous tags.Instead of opting for manual annotation, we decided to use the morphological an-alyzer and the disambiguator of Sak et al (2008). Our decision was motivated by thefact that manual annotation may give rise to mis-annotations due to morphologicalcomplexity. By using an automated parser and making annotators to choose among thealternatives that are pre-determined, we aim to overcome the problem of time and tominimize the human-error.In our treebank, in addition to strings of words, we encoded the lexical and gram-matical properties of the words as sets of features and values for these features. Wealso encoded the lemma of every word separately, following the UD framework. Fig-ure 1 shows an example of encoding within the CoNLL format. The lemmas and themorphological features are also provided by the morphological analyzer and disam-biguator.
Figure 1: An example sentence from our treebank encoded in ConLL-U format.In the BOUN Treebank, we maximally used the morphological features from theUD framework. When there is no clear-cut mapping between the features that we ac-quired from the morphological disambiguator and features proposed in the UD frame-work, we used the features previously suggested in the works of Çöltekin (2016); Tyerset al (2017b); Sulubacak and Eryi˘git (2018). Table 4 shows the automatic conversionfrom the results of the Sak et al (2008)’s morphological disambiguator. Due to varyinglinguistic concerns, the depth of morphological representation in Sak et al (2008) andthe UD framework does not align perfectly. When necessary, we used the morphologi-cal cues provided by the disambiguator to decide on UPOS and lemma.8able 4: Mappings of morphological features from the notation of Sak et al (2008) tothe features used in the UD framework.
Sak et al (2008) UD Sak et al (2008) UDA1sg Number=Sing|Person=1 Recip Voice=RcpA2sg Number=Sing|Person=2 Able Mood=AbilA3sg Number=Sing|Person=3 Repeat Mood=IterA1pl Number=Plur|Person=1 Hastily Mood=RapidA2pl Number=Plur|Person=2 EverSinceA3pl Number=Plur|Person=3 Almost Mood=ProPnon Stay Mood=DurP1sg Number[psor]=Sing|Person[psor]=1 StartP2sg Number[psor]=Sing|Person[psor]=2 Pos Polarity=[PosP3sg Number[psor]=Sing|Person[psor]=3 Neg Polarity=NegP1pl Number[psor]=Plur|Person[psor]=1 Past Tense=Past|Evident=FhP2pl Number[psor]=Plur|Person[psor]=2 Narr Evident=NfhP3pl Number[psor]=Plur|Person[psor]=3 Fut Tense=FutAbl Case=Abl Aor Tense=AorAcc Case=Acc Pres Tense=PresDat Case=Dat Desr Mood=DesEqu Cond Mood=CndGen Case=Gen Neces Mood=NecIns Case=Ins Opt Mood=OptLoc Case=Loc Imp Mood=ImpNom Case=Nom CopPass Voice=Pass Prog1Caus Voice=Cau Prog2Reflex
In the BOUN Treebank, we decided to represent relations amongst the parts of thesentences within a dependency framework. This decision has two main reasons. Themain and the historical reason is the fact that the growth of the Turkish treebank hasbeen mainly within the frameworks where the syntactic relations have been representedwith dependencies (Oflazer, 1994; Çetinoglu, 2009). The other reason is the fact thatTurkish allows for phrases to be scrambled to pre-subject, post-verbal, and any clause-internal positions with specific constraints (Kural, 1992; Aygen, 2003; I¸ssever, 2007).With these in mind, we wanted to stick with the conventional dependency frameworkand use the recently rising UD framework. One of the main advantages of the UDframework is that it creates directly comparable sets of treebanks with regards to theirsyntactic representation due to its very nature.By following the UD framework, we encode two different syntactic information:the category of the dependent and the function of this dependent with regards to its syn-tactic head. Within the function information, the UD framework differentiates betweennominal and verbal heads, with a one more level of classification within the verbalheads: whether the dependent is core or non-core. As for the category of the depen-dent, we identified function words, modifier words, nominals, and clausal elements.In addition to this classification there are some other small groupings which may belisted as: coordination, multiword expressions, loose syntactic relation, sentential, andextra-sentential. Table 5 shows the version of UD framework we are employing in thistreebank.Every dependency forms a relation between two segments within the sentence,9able 5: The UD v2 syntactic relations.
Nominals Clauses Modifier words Function wordsCore Arguments nsubjobjiobj csubjccompxcomp
Non-core dependents oblvocativeexpdislocated advcl advmoddiscourse auxcopmark
Nominal dependents nmodapposnummod acl amod detclfcase
Coordination MWE Loose Special Other conjcc fixedflatcompound listparataxis orphangoeswithreparandum punctrootdep building up to a non-binary and hierarchical representation of the sentence. This rep-resentation is exemplified in Item 2 using the sentence in Figure 1.(2) ˙I¸ste basit bir beyaz gömlek , bir gri kuma¸s ceket ya da yelek , bir de lacivert kravat . advmodamoddetamod punctdetamodcompoundconj cccompoundconj punctadvmodadvmod:emph amodconj punct ˙I¸ste see basit basic bir a beyaz white gömlek, shirt, bir a gri gray kuma¸s fabric ceket jacket ya or da FOC yelek, vest, bir a de FOC mavi blue kravat. tie.‘See, a simple white shirt, a gray blazer or a vest, and also a blue tie.’Even though the syntactic representation scheme is discussed lengthily within theUD framework, previous applications of this scheme on Turkish data were problem-atic. In recent works on the re-annotation of Turkish UD treebanks, we have pointedout these issues (Türk et al, 2019; Türk et al, 2019). These issues mainly revolvearound embedded clauses, compounds, and the distinction between core and non-corearguments. In addition to such problems, we believe that the UD framework needsfine-tuning between the usage of case , fixed , and advmod for segments that can bemarked either of those regarding the sentence-level discourse.10 .2 Challenges in Annotation Process In this section, we provide the justifications of our linguistic decisions for some reoc-curring problems. One of the main concerns for us was to reflect linguistic adequacyin the BOUN Treebank. We also paid great attention to follow the unified and alreadyin use solutions to the problems in the annotation of the BOUN Treebank and the re-annotation of IMST and Turkish PUD Treebanks. In the following sections, we willtouch upon our decisions on splitting the copular verb as a new syntactic head, repre-senting the syntactic depth of the embedded clauses with more transparency, having amore thorough analysis of compounds, and dealing with other issues that rely on thegrey area between some dependency relations. We will first talk about the issues thatoriginated in the re-annotation of the previous treebanks, and then the ones that cameup while annotating the BOUN Treebank from ground-zero.
In the previous treebanks, the annotation of embedded clauses did not reflect the innerhierarchy that a clause by definition possesses. This is mostly due to the morphologicalaspect of the most common embedding strategies in Turkish: nominalization. Due toexcessive use of nominalization, embedded clauses in Turkish can be regarded as nom-inals since they behave exactly like nominals: They can be marked with a case, can besubstituted with any other nominal, and show nominal stress patterns. Thus, previoustreebanks in the UD framework used dependency relations such as obj , nsubj , amod ,or advmod instead of ccomp , csubj , acl , or advcl to mark its relation with the matrixverb. Moreover, dependents of the embedded nominalized verb, like oblique adjuncts,may be either attached to the matrix verb wrongly or represented with erroneous de-pendency relations. For example, an oblique of an embedded verb used to be attachedto the root since the embedded verb is seen as a nominal, and not as a verb as in Item 3.Likewise, the subject of the embedded clause is wrongly marked as a possessee nom-inal modifier. This wrong annotation in the previous treebanks are due to the fact thatTurkish makes use of genitive-possessive cases for marking the agreement in an em-bedded clause as in Item 4.(3) Tünele girmeden önce geçti˘gim manzarayla burası bamba¸ska ... ROOTNMODOBL ADVCLCASE ACL OBL CASE PUNCT
Tünel-e tunnel-
DAT gir-me-den enter-
NEG - NMLZ önce before geç-ti˘g-im pass-
NMLZ -1 SG manzara-yla scenery- COM burası here bam~ba¸ska...
EMPH ~different...‘The scenery that I passed before I entered the tunnel was completely differentfrom here.’ 114) Senin de gelmeni isterdim
ROOTCCOMPADVMOD : EMPH
NSUBJ
NMOD : POSS
Sen-in you-
GEN de too gel-me-ni come- NMLZ - POSS iste-r-di-m. want-
AOR - PST -1 SG ‘I would have wanted you to come, as well.’ Another inconsistent annotation was with regards to the compounds and their classifi-cations. The UD framework specifies the use of compound as a dependency relationbetween two heads that have the same syntactic category. Mostly in Turkish PUD, butalso in other Turkish treebanks in UD, not only constructions that are formed with twoheads, but also constructions that involve genitive-possessive suffixes are marked withthe compound dependency as in Item 5. We have modified these dependency relationsas nmod:poss , which is already a convention in use.(5) Bunların ellisi pazar alanı -ydı
ROOTNSUBJNMOD : POSS COP
NMOD : POSS
COMPOUND
Bun-lar-ın this- PL elli-si fifty- POSS pazar market alan-ı-ydı place-
POSS - COP ‘50 of these were marketplaces’
Turkish proposes a unique problem with regards to the detection of core arguments.Unlike many other languages, Turkish can drop its object without any marking onthe verb when it is available in the discourse. This null marking of the contextuallyavailable core argument yields a new problem for the canonical tests for distinguishingbetween dependency relations such as obl and obj as in Item 6. In our annotations,we have used the recently proposed dependency relation obl:arg for such cases. Wealso modified the existing treebanks in such fashion.(6) Ütüden anlamam
ROOT
OBL : ARG
OBL tü-den ironing- ABL anla-ma-m. understand-
NEG -1 SG “I do not know a thing about ironing." Due to its agglutinating nature, the line between the syntax and morphology is notcrystal clear in Turkish. This grey area is even more visible with the issue is Turkishcopula i- ( be ). The verb i- has three allomorphs in Turkish: i- , -y , and - /0. Regardlessof the category of its base, the verb i- always behaves the same in terms of its stressassignment and the features they can host. Moreover, it is always detachable meaningthat the allomorph i- and the two others are in free variation as shown in Figure 2 andFigure 3. The selection between the /0 and -y is governed by the previous segment; ifthe previous segment is a consonant /0 is used, otherwise -y is used.(7) Okula gelecek idim . ROOTOBL COPPUNCT
Okul-a school-
DAT gel-ecek come-
FUT i-di-m. be-
PST -1 SG ‘I was going to come to school.’ (8) Okula gelecek -tim . ROOTOBL COPPUNCT
Okul-a school-
DAT gel-ecek-ti-m. come-
FUT - PST -1 SG ‘I was going to come to school.’Figure 2: Syncretism between -i and /0.(9) Okulda ö˘grenci idim . ROOTOBL COPPUNCT
Okul-da school-
LOC ö˘grenci student i-di-m. be-
PST -1 SG ‘I was a student in the school.’ (10) Okulda ö˘grenci -ydim . ROOTOBL COPPUNCT
Okul-da school-
LOC ö˘grenci-ydi-m. student-
PST -1 SG ‘I was a student in the school.’Figure 3: Syncretism between -y and /0.Even though the previous Turkish treebanks are consistent in the decision withregards to the verb i- within categories, they lack a unified treatment of the verb i- when it surfaces as a clitic as in Item 9 and Item 8. In previous treebanks, all of the i- verbs were analyzed within the same word with their host and thus not segmented.However, when the i- verb is attached to a nominal base, annotation was made as suchthat the i- verb were analyzed as a separate syntactic unit. In the BOUN Treebank,13e segmented all instances of the verb i- as a copula ( cop ) regardless of the categoryof the base or its surface form. The reason we use cop instead of aux was becauseof the limited number of TAME (Tense-Aspect-Mood-Evidentiality) markers that theverb i- can host. Turkish be verb can only host -mI¸s , -DI , -ken , and -sA whereas Turkishauxiliary, like ol- , can host every TAME marker. Another reason is the fact that both i- and ol- can occur at the same time as in Item 11 and we cannot use two auxiliaries atthe same time.(11) Okula gelmi¸s olacak -tım . ROOTOBL AUX COPPUNCT
Okul-a school-
DAT gel-mi¸s come-
PRF ol-acak-tı-m. become-
FUT - PST -1 SG ‘I was going to have come to school.’ One of the problems that we have encountered while annotating the BOUN Treebankand re-annotating previous treebanks was to find a fool-proof criterion of distinguish-ing between fixed and case or case and advmod in certain environments. This ismostly due to grammaticalization of some adverbs as postpositions in Turkish. Forexample, sentences as in Item 12 have an adverbial phrase which is a grammaticalizedmultiword expression, bu kadar . However, we see that the same element kadar canbe a proposition in sentences like Item 13. We found that these types of discrepanciesare not limited to kadar and almost visible with every postpositions. In order to dis-tinguish between when a postposition is a part of fixed multiword expression or whenit has a case dependency relation with a nominal, we have used the case on the pre-vious noun phrase as a clue. When the previous NP is bare nominal, we annotated it,the deictic term bu in our example Item 12, as the advmod to the root and the kadar with the dependency relation of fixed . If the previous NP is marked with a case, asin dative in düne from our Item 13, the whole phrase is annotated with obl to the root,and postposition, here kadar , is annotated with the dependency relation of case .(12) Bu kadar zor olmamalı .
ROOTAMODADVMODFIXED PUNCT Bu this kadar much zor hard ol-ma-malı. be- NEG - NEC .‘It should not be this hard.’(13)
Düne kadar hasta -ydı .
ROOT COPCASEOBL PUNCT
Dün-e yesterday-
DAT kadar until hasta-y-dı. sick-
COP - PST case and advmod , we also used the case related cues to distinguish betweenthem. When the noun phrase prior to the problematic segment, sonra in our examplesbelow, is marked with a case that is lexically determined by the postposition, we usedthe dependency relation case as in Item 15. However, these postpositions can also bean adverbial modifier ( advmod ) to the matrix verb. In these cases, instead of the lexicalcase of the postposition, previous noun phrases are marked with other cases introducedby other elements as in Item 14.(14) Bana sonra a˘glama .
ROOTPUNCTOBLADVMOD
Ban-a me-
DAT sonra after a˘glama. cry-
NEG .‘Do not cry to me later on.’(15) Benden sonra a˘glama .
ROOTCASE OBL PUNCT
Ben-den me-
ABL sonra after a˘gla-ma. cry-
NEG ‘Do not cry after I am gone.’
Annotation tools are fundamental to facilitate the annotation process of many NLPtasks including dependency parsing. Treebanks are re-annotated or annotated fromscratch in line with the annotation guidelines of the UD framework (Nivre et al, 2016).We present the BoAT annotation tool for dependency parsing that is specialized forannotating CoNLL-U files.
There are several annotation tools that are showcased within the UD framework. Thesetools include both web-based and desktop annotation tools. Some of them are generalpurpose annotation tools whereas plenty of them are specialized for the UD framework.BRAT is a browser-based, online, general purpose text annotation tool developedby Stenetorp et al (2012). It provides graphics-based visualization in flat graph mode.Dependency relations are edited via mouse clicks and dragging.UD Annotatrix (Tyers et al, 2017a) is specialized for dependency parsing. It is abrowser-based manual annotation tool which can be used both online and offline. Theaim of the UD Annotatrix is to be simple, uncluttered, and fast. It offers certain dis-tinctive features such as two-level segmentation tailor-made for the UD framework.15urthermore, it supports other input formats besides the CoNLL-U format. Each sen-tence is projected in flat graph mode and text/table mode. Dependency relations canbe edited using these modes. Dependency relations and part-of-speech tags are alsovalidated.ConlluEditor (Heinecke, 2019) is a browser-based manual annotation tool designedfor the UD framework. It provides graphic view in both tree mode and flat modebut text and table views are not available. In addition, it offers an advanced searchfeature. Validation mechanism is also provided via a button. Moreover, mouse clicksare reqiured for editing.
The motivation behind our tool is to present a user-friendly, compact, and practicalmanual annotation tool that is build upon the desires of the annotators. While devel-oping BoAT, we received feedback from our annotators every step of the way. One ofthe crucial points of annotation is speed. Unlike the other existing tools within the UDframework, almost every possible action within the BoAT can be made using keyboardshortcuts. We aim to decrease the time-wise and ergonomic load introduced by the useof mouse and to increase speed accordingly. We believe that both the graph view andthe text view have certain advantages alongside with certain drawbacks, which lead usutilizing both view types. For the graph view, the tree mode is favored against the flatmode by our annotators. For the text view, table view is selected rather than simple textview. Moreover, each annotator can customize the table view of the tool by selectingthe columns they believe is fit for their workflow at a specific time. Furthermore, we en-abled our annotators to split or join words within our tool using UI option or keyboardshortcuts, which permitted a better analysis of multiword expressions. Moreover, newtokens can be added or existing ones can be deleted to overcome tokenization problemsgenerated during the pre-processing of the text. Last but not the least, we added theoption of taking notes that is specific to every item. This feature enabled our annotatorsto have better communication and have better reporting power.
BoAT is a desktop annotation tool which is specifically designed for CoNLL-U files.It provides both tree view and table view as shown in Figure 4. The upper part of thescreen shows the default table view while the lower part of the screen shows the treeview.
Tree View:
The dependency tree of each sentence is visualized as a graph. Insteadof using flat view, hierarchical tree view is used. The tree view accompanied by thelinearly readable tree is favored in order to increase readability and clarity. The treeview is based on the hierarchical view in the CoNLL-U Viewer offered by the UDframework.
Table view:
Each sentence is shown along with its default fields which are ID,FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, and MISC. Mor-phological features denoted by the FEATS field are parsed into specific fields for exist-ing morphological features in the UD framework. These fields are optional in the table16igure 4: A screenshot from the tool.view; annotators can choose which features, if any, they want to see. They are storedin the CoNLL-U file concatenated.
Customizing the table view:
Annotators can customize the table view accordingto their needs by using the checkboxes assigned to the fields shown above the parse ofthe sentence. In this way, a user can organize the table view easily and obtain a cleanview without the unnecessary fields at the time of annotating. This customization ame-liorates readability, thereby the speed of the annotation. An example of a customizedtable view is shown in Figure 4.
Moves in the table view:
In this example, all the fields except DEPS, and MISCand the three morphological features Case, Number, and Person were made visible. Toease the annotation process, most frequently used functions are assigned to keyboardshortcuts. Arrow keys are used to move between cells in the table view. “Prev” and“Next” buttons are used to move between sentences. The “Prev” button has shortcut“Alt+O”, and The “Next” button has shortcut “Alt+P”. There is no explicit Save button;advancing to the next sentence or going back to the previous sentence automaticallysave the CoNNL-U file and applies the validation tests. Moreover, annotators can go17o any sentence by simply typing the ID of the sentence and clicking “Go".
Editing the table view:
The value in a cell is edited by directly typing when thefocus is on that cell. To finish editing, press “Enter”. If one of the features is edited,the FEATS cell is updated accordingly.
Editing multiword expressions:
One of the biggest challenges in the annotationprocess is keeping up with the changes in the segment IDs when new syntactic seg-mentations are introduced. Annotating multiword expressions often comes with thecost of updating the segment IDs within a sentence. Annotators may need an easy wayto split a word into two different syntactic unit. In our tool, the cells in the first columnof the table (written “+” or “-”) are clickable and used for MWE manipulation. “-”button is used for splitting and “+ " button is used for joining. Dependency relationsand segment IDs are updated automatically.
Validation:
Each tree is validated with respect to the field values before saving thesentence. If an error is detected in the annotated sentence, an error message is issuedsuch as “unknown UPOS value", “invalid UPOS tag", and so on. An example error isshown in Figure 4 between the table view and the tree view. If the error is fatal, theannotation tool will not save the sentence.
Taking notes:
With the note feature, the annotator is able to take notes for eachsentence as in Figure 4. Each note is attached with the corresponding sentence andstored in a different file with a specified sentence ID. Shortcut for writing notes is“Alt+M”.
Adding and deleting rows:
Annotators are able to add a new token or delete anexisting token by adding or deleting rows to correct tokenization errors. For adding anew row, a row ID is entered and “Add Row" button is clicked. A new row is addedabove the row with the given ID. For deleting an existing row, a row ID is entered and“Delete Row" button is clicked. The row with the given ID is deleted. For both of thecases, the entered row ID must not belong to a multiword expression.
BoAT is an open-source desktop application. The software is implemented in Python3 along with PySide2 and regex modules. In addition, CoNLL-U viewer is utilized byadapting some part of the UDAPI library (Popel et al, 2017). Resources consisting ofdata folder, the tree view, and validate.py are adopted from the UD-maintained tools for validation check. Data folder is used without any change while some modificationsare made to validate.py. BoAT is a cross-platform application since it runs on Linux,OS X, and Windows.The BoAT tool was designed in accordance with the needs of the annotators, and itincreased the speed and the consistency of annotation. Currently, BoAT only supportsthe ConLL-U format of UD since the tool is designed specifically for dependency pars-ing. In the future, it may be improved to support other formats and tasks. BoAT is available in https://github.com/boun-tabi/BoAT https://github.com/universaldependencies/tools Experiments
We performed the first parsing experiments on the BOUN Treebank. In addition to thebrand-new BOUN Treebank, we performed parsing experiments on our re-annotatedversions of IMST (Türk et al, 2019) and PUD (Türk et al, 2019). The dependencyparser used in these studies is Stanford’s graph-based neural dependency parser (Dozatet al, 2017). This parser uses unidirectional LSTM modules to generate word embed-dings and bidirectional LSTM modules to create possible head-dependency relations.It uses ReLu layers and biaffine classifiers to score these relations. For more informa-tion, see Dozat et al (2017).For the automatic morphological analysis of the sentences, we used the Turkishmorphological analyzer and disambiguator tool by Sak et al (2008). Unlike TRMorph(Çöltekin, 2010) that analyzes one word at a time, Sak et al (2008)’s tool takes thewhole sentence as input and analyzes the words with respect to their correspondingmeanings in the sentence. This feature is very useful for Turkish because most of theword forms in Turkish have multiple morphological analyses which can be correctlydisambiguated only by considering the context the word is in.The BOUN Treebank consists of 9,757 sentences from five different text types.These text types almost equally contribute to the total number of sentences. Table 6shows these text types and gives the treebank statistics in detail.Table 6: Word statistics of the different sections of the BOUN Treebank. The differ-ence between the numbers of tokens and forms is due to multi-word expressions beingrepresented with a single token, but with multiple forms.
Treebank Num. of sentences Num.of tokens Num. of forms
Essays 1,953 27,010 27,576Broadsheet National Newspapers 1,898 29,307 29,386Instructional Texts 1,969 20,382 20,565Popular Culture Articles 1,965 21,096 21,295Biographical Texts 1,972 23,391 23,553
Total 9,757 121,186 122,375
For the parsing experiments, we randomly assigned each register to the training,development, and test sets with the percentages as %60, %20, and %20 respectively.Table 7 shows the number of sentences in each set of the BOUN Treebank, as well asthe re-annotated versions of the IMST-UD Treebank and the Turkish PUD Treebank.19able 7: Division of the Turkish treebanks to training, development, and test sets forthe experiments.
Treebank Training set Development set Test set Total
Essays 1,173 389 391
Broadsheet National Newspapers 1,137 380 381
Instructional Texts 1,182 391 396
Popular Culture Articles 1,176 394 395
Biographical Texts 1,183 394 395
600 200 200
We first experimented with the dependency parser on each register separately. Then,we measured the performance of the parser on parsing the entire BOUN Treebank. Asa final experiment, we combined the training, development, and test sets of the BOUNTreebank with the corresponding sets of the re-annotated versions of the IMST-UD andPUD treebanks. The aim of this experiment is to see the effect of the newly introducedBOUN Treebank on the current state-of-the-art parsing performance of Turkish.In all the experiments, both projective and nonprojective sentences were includedin the training and test phases. Previous studies in Turkish treebanking usually ex-cluded non-projective sentences. The reason we included them in our parsing studywas to have more realistic results. As for the pre-trained word vectors used by thedependency parser, we used the Turkish word vectors supplied by the CoNLL-17 orga-nization (Ginter et al, 2017).In the evaluation of the dependency parser, we used word-based unlabeled attach-ment score (UAS) and labeled attachment score (LAS) metrics. The UAS is measuredas the percentage of words that are attached to the correct head, and the LAS is de-fined as the percentage of words that are attached to the correct head with the correctdependency type.
Table 8 shows the first parsing results on the test sets of each section in the BOUNTreebank in terms of the labeled and unlabeled attachment scores.Table 8: UAS and LAS scores of the parser on each of the five sections of the BOUNTreebank.
Treebank UAS F1-score LAS F1-score
Essays 63.77 54.44Broadsheet National Newspapers
Instructional Texts 74.04 65.29Popular Culture Articles 75.04 66.80Biographical Texts 70.46 62.32We observed that the highest and the lowest parsing scores are achieved on the
Broadsheet National Newspapers section and the
Essays section of the BOUN Tree-20 raining Development Test10121416 Average Token Count Training Development Test33 . . . . Essays Biographical Texts Instructional TextsPopular Culture Articles B. National Newspapers
Figure 5: The average token count and the average dependency arc length in a sentencefor the five different sections of the BOUN Treebank.bank, respectively. The second best section according to the parsing scores is the
Pop-ular Culture Articles . The parsing performance of the parser on the
Instructional Texts section is approximately 1 point below from its performance on the
Popular CultureArticles . The
Biographical Texts is in the fourth place in this comparison.To understand the possible reasons behind the performance differences betweenthe parsing scores of the five sections of the BOUN Treebank, we compared them withrespect to the average token count and the average dependency arc length in a sen-tence. Figure 5 shows these statistics for the five sections of the BOUN Treebank.We observed that both the average token count and the average dependency arc lengthmetrics are the highest in the
Broadsheet National Newspapers section. The secondhighest on both metrics is the
Essays section. The average token count and the av-erage dependency arc length of the
Instructional Texts and
Popular Culture Articles sections are very close to each other and the lowest ones. Both of the metrics for the
Biographical Texts section are in the middle being higher than the scores of the
In-structional Texts and
Popular Culture Articles sections and lower than the scores of the
Broadsheet National Newspapers and
Essays sections.We anticipate that the higher these two metrics are in a sentence, the harder thetask of constructing the dependency tree of that sentence. From Figure 5, we observethat all of the sections except the
Broadsheet National Newspapers follow this hypoth-esis. However the
Broadsheet National Newspapers which has the highest numbers ofthese metrics, holds the best parsing performance in terms of UAS and LAS scores.We believe that this increase in scores are due to the interaction between the lack ofinterpersonal differences in writing in journalese and the editorial process behind thejournals and magazines.In Table 9, we present the success rates on the BOUN Treebank, the re-annotatedversion of the IMST-UD Treebank, and the re-annotated version of the PUD Treebank,when each treebank is used separately to train and test the parser, as well as when theyare used together in the training and the evaluation phases.21able 9: UAS and LAS scores of the parser on the Turkish treebanks. First threerows show the performance on each of the re-annotated versions of IMST-UD andPUD treebanks and the newly introduced BOUN Treebank separately. The fourth rowdepicts the performance when all three treebanks are joined together.
Treebank Num. of sentences UAS F1-score LAS F1-scoreIMST-UD
PUD
BOUN
IMST-UD, PUD, BOUN
We observe that the performance of the parser on the BOUN Treebank is betterthan its performance on the re-annoated version of the IMST-UD Treebank in terms ofthe LAS score. The UAS scores reached on these treebanks are more or less the same.Considering their similar annotation styles and domains, we can infer that an increasein the size of the treebank leads to better parsing performances in terms of the LASscore with the exception of PUD Treebank.We first inquired this oddity by looking at a possible confound: the differences inthe percentages of certain dependency relations. Table 10 presents the distribution ofthe dependency relation types across the re-annotated versions of the IMST-UD and thePUD treebanks, and the BOUN Treebank. We observe that there is not a noteworthydifference in the distribution of the relation types across the three treebanks.When comparing the BOUN Treebank and the re-annotated version of the IMST-UD Treebank, we observed that the percentages of the case , compound , and nmod types were lower more than 1% in the BOUN Treebank. The root was also lower inthe BOUN Treebank by more than 2% which indicates that the average token countwas higher in this treebank with respect to the re-annotated version of the IMST-UDTreebank. However, the percentages of the nmod:poss type were higher by more than2% and the obl type was higher by more than 3% in the BOUN Treebank.Moreover, when comparing the BOUN Treebank with the re-annotated version ofthe Turkish PUD Treebank, we observed that the highest percentage difference was forthe obl type which was higher in the BOUN Treebank by more than 7%. The otherrelation types whose percentages were higher in BOUN by more than 1% were the conj and root types. This indicates that the average token count was higher in there-annotated version of the PUD Treebank when compared to the BOUN Treebank andthere were more conjunct relations in the BOUN Treebank which sometimes increasedthe complexity of a sentence in terms of dependency parsing. These observations sug-gest that the differences in the success rates on these two treebanks did not stem fromthe varying percentages of the dependency relations, rather they stem from the com-plexity expressed in the text and how well this complexity is handled.22able 10: Comparison of the re-annotated versions of the IMST-UD and PUD tree-banks, and the BOUN Treebank on the distribution of dependency relation labels. Theblack numbers represent the counts and the gray numbers show their percentages. Relation type IMST-UD PUD BOUN acl advcl
926 (%1.59) 435 (%2.6) 2,589 (%2.12) advcl:cond
110 (%0.19) 13 (%0.07) 268 (%0.22) advmod advmod:emph
976 (%1.68) 143 (%0.8) 1,721 (%1.41) amod appos
136 (%0.23) 166 (%1) 506 (%0.41) aux aux:q
211 (%0.36) 1 (%0.01) 269 (%0.22) case cc
879 (%1.51) 520 (%3.1) 2,799 (%2.29) cc:preconj ccomp
626 (%1.08) 171 (%1) 1,510 (%1.23) clf compound compound:lvc
523 (%0.90) 186 (%1.1) 1,218 (%1.0) compound:redup
219 (%0.37) 9 (%0.05) 456 (%0.37) conj cop
851 (%1.46) 496 (%2.9) 1,291 (%1.05) csubj
82 (%0.14) 93 (%0.5) 545 (%0.45) dep det det:predet - 8 (%0.05) - discourse
150 (%0.26) 5 (%0.03) 377 (%0.31) dislocated
20 (%0.03) 5 (%0.03) 28 (%0.02) fixed
25 (%0.04) 1 (%0.01) 12 (%0.01) flat
902 (%1.55) 409 (%2.4) 2,033 (%1.66) goeswith iobj
354 (%0.61) 138 (%0.8) 165 (%0.13) list - - 40 (%0.03) mark
86 (%0.15) 5 (%0.03) 117 (%0.10) nmod nmod:poss nsubj nummod
567 (%0.98) 263 (%1.6) 1,567 (%1.28) obj obl orphan
12 (%0.02) 8 (%0.05) 83 (%0.07) parataxis
11 (%0.02) 15 (%0.09) 208 (%0.17) punct root vocative - - 87 (%0.07) xcomp
39 (%0.07) 128 (%0.7) 125 (%0.10)23onsidering the best performance of the parser is on the re-annotated version of thePUD Treebank, we believe that the question of how well the complexity of the text ishandled may be answered if we take a look at the text selection. This treebank includessentences translated from different languages by professional translators and hence,the sentences have different structures than the sentences of the other two treebanks.This difference in structures is a result of the different environments that these texts arebrewed, namely a living corpus (BOUN and IMST-UD) and a well-edited translations(PUD). Our evaluation style for this treebank is also different. Due to the insufficientamount of annotated sentences in the treebank, we used 5-fold cross validation in theevaluation of the PUD Treebank.Lastly, we observe that combining these three treebanks improves the parsing per-formance in terms of the attachment scores. The increase in the training size resultedin better parsing scores, contributing to the discussion of correlation between the sizeof the corpus and the success rates in parsing experiments (Foth et al, 2014; Ballesteroset al, 2012).
In this paper, we presented the largest and the most comprehensive Turkish treebankwith 9,757 sentences: the BOUN Treebank. In the treebank, we encoded the surfaceform of the sentences, universal part of speech tags, lemmas, and morphological fea-tures for each segment, as well as syntactic relations between these segments. Weexplained our annotation methodology in detail. We present our data online with thehistory of changes we applied and our guidelines. We also present an overview ofother Turkish treebanks. Moreover, we explained our linguistic decisions and anno-tation scheme that are based on the UD framework. We provided examples for thechallenging issues that are present in the BOUN Treebank as well as other treebanksthat we re-annotated.In addition to such contributions, we provided a detailed presentation of our anno-tation tool: BoAT. We explained our motivation for such an initiative in detail. We alsoprovide the tool and the documentation online.Lastly, we provide an NLP task where our new treebank and previous re-annotationshave been used. We report UAS and LAS F1-scores with regards to specific text typesand treebanks. We also showcase scores of all the treebanks used together. All thetools and materials that are present in the paper are freely available in our webpage https://tabilab.cmpe.boun.edu.tr/boun-pars . Acknowledgements
We are immensely grateful to Prof. Ye¸sim Aksan and the other members of the Turk-ish National Corpus Team for their tremendous help in providing us with sentencesfrom the Turkish National Corpus. We are also thankful to the anonymous reviewersfrom SyntaxFest’19 and LAW XIII, as well as to Ça˘grı Çöltekin for the constructivecomments about the re-annotation process of the IMST and PUD Treebanks.24his work was supported by the Scientific and Technological Research Councilof Turkey (TÜB˙ITAK) under grant number 117E971 and as BIDEB 2211 graduatescholarship.
References
Aksan Y, Aksan M, Koltuksuz A, Sezer T, Mersinli Ü, Demirhan UU, Yılmazer H,Atasoy G, Öz S, Yıldız I, Kurto˘glu Ö (2012) Construction of the Turkish NationalCorpus (TNC). In: Proceedings of the Eighth International Conference on LanguageResources and Evaluation (LREC-2012), European Language Resources Associa-tion (ELRA), Istanbul, Turkey, pp 3223–3227Atalay NB, Oflazer K, Say B (2003) The annotation process in the Turkish treebank.In: Proceedings of 4 th International Workshop on Linguistically Interpreted Corpora(LINC-03) at EACL 2003Aygen G (2003) Extractability and the nominatice case feature on tense. In: ÖzsoyS, Akar D, Nakipo˘glu-Demiralp M, Erguvanlı-Taylan E, Aksu-Koç A (eds) Studiesin Turkish Linguistics: Proceedings of the 10 th International Conference in TurkishLinguistics, IstanbulBallesteros M, Herrera J, Francisco V, Gervás P (2012) Are the existing training cor-pora unnecessarily large? Procesamiento del Lenguaje Natural (48):21–27Bickel B, Nichols J (2013) Inflectional Synthesis of the Verb. In: Dryer MS, Haspel-math M (eds) The World Atlas of Language Structures Online, Max Planck Institutefor Evolutionary Anthropology, LeipzigBrants T (2000) TnT: A statistical part-of-speech tagger. In: Proceedings of the SixthConference on Applied Natural Language Processing, Association for Computa-tional Linguistics, pp 224–231Çöltekin c (2010) A freely available morphological analyzer for Turkish. In: LREC,vol 2, pp 19–28Çetinoglu Ö (2009) A large scale LFG grammar for Turkish. PhD thesis, Ph. D. Thesis,Sabanci UniversityÇetino˘glu Ö, Çöltekin Ç (2016) Part of speech annotation of a Turkish-German code-switching corpus. In: Proceedings of the 10th Linguistic Annotation Workshop heldin conjunction with ACL 2016 (LAW-X 2016), pp 120–130Çöltekin Ç (2015) A grammar-book treebank of Turkish. In: Dickinson M, Hinrichs E,Patejuk A, Przepiórkowski A (eds) Proceedings of the 14th Workshop on Treebanksand Linguistic Theories (TLT 14), pp 35–49Çöltekin Ç (2016) (When) do we need inflectional groups? In: Proceedings of TheFirst International Conference on Turkic Computational Linguistics25e Marneffe MC, Dozat T, Silveira N, Haverinen K, Ginter F, Nivre J, Manning CD(2014) Universal Stanford dependencies: A cross-linguistic typology. In Proceed-ings of the Ninth International Conference on Language Resources and Evaluation(LREC-2014) pp 4585–4592Dozat T, Qi P, Manning CD (2017) Stanford’s graph-based neural dependency parserat the CoNLL 2017 shared task. Proceedings of the CoNLL 2017 Shared Task: Mul-tilingual Parsing from Raw Text to Universal Dependencies pp 20–30Erguvanlı-Taylan E (1986) Pronominal versus zero representation of anaphora in Turk-ish. In: Studies in Turkish Linguistics, John Benjamins, p 209Erguvanlı-Taylan E (2015) The Phonology and Morphology of Turkish. Bo˘gaziçi Uni-versityFoth KA, Köhn A, Beuck N, Menzel W (2014) Because size does matter: The Hamburgdependency treebank. In: LRECGinter F, Hajiˇc J, Luotolahti J, Straka M, Zeman D (2017) CoNLL 2017 shared task- automatically annotated raw texts and word embeddings. LINDAT/CLARIN dig-ital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty ofMathematics and Physics, Charles UniversityGöksel A, Kerslake C (2005) Turkish: A Comprehensive Grammar. Comprehensivegrammars, RoutledgeHeinecke J (2019) ConlluEditor: A fully graphical editor for universal dependenciestreebank files. In: Proceedings of the Third Workshop on Universal Dependencies(UDW, SyntaxFest 2019), pp 87–93Hoffman B (1995) The computational analysis of the syntax and interpretation of "free"word order in Turkish. IRCS Technical Reports Series p 130I¸ssever S (2003) Information structure in Turkish: The word order–prosody interface.Lingua 113(11):1025–1053I¸ssever S (2007) Towards a unified account of clause-initial scrambling in Turkish: Afeature analysis. Turkic Languages 11(1):93–123Kapan A (2019) Derivational networks of nouns and adjectives in Turkish. Master’sthesis, Bo˘gaziçi University, ˙Istanbul, TurkeyKornfilt J (1984) Case marking, agreement, and empty categories in Turkish. HarvardUniversityKural M (1992) Properties of scrambling in Turkish. Ms, UCLALeech G, Garside R (1991) Running a grammar factory: The production of syntac-tically analysed corpora or treebanks. English Computer Corpora: Selected Papersand Research Guide pp 15–32 26aamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic treebank:Building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabiclanguage resources and tools, Cairo, vol 27, pp 466–467Marcus M, Santorini B, Marcinkiewicz M (1993) Building a large annotated corpus ofEnglish: The Penn treebank 19:330–331Megyesi B (2002) Data-driven syntactic analysis. PhD thesis, Institutionen för Talöver-föring och MusikakustikMegyesi B, Dahlqvist B, Pettersson E, Nivre J (2008) Swedish-Turkish parallel tree-bank. In: LRECMegyesi B, Dahlqvist B, Csató EA, Nivre J (2010) The English-Swedish-Turkish par-allel treebank. In: LREC 2010, 17-23 May 2010, Valletta, MaltaNivre J, Hall J, Nilsson J (2006a) Maltparser: A data-driven parser-generator for de-pendency parsing. In: LREC, vol 6, pp 2216–2219Nivre J, Nilsson J, Hall J (2006b) Talbanken05: A Swedish treebank with phrase struc-ture and dependency annotation. In: LREC, pp 1392–1395Nivre J, De Marneffe MC, Ginter F, Goldberg Y, Hajic J, Manning CD, McDonald R,Petrov S, Pyysalo S, Silveira N, et al (2016) Universal dependencies v1: A multilin-gual treebank collection. In: Proceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016), pp 1659–1666Oflazer K (1994) Two-level description of Turkish morphology. Literary and linguisticcomputing 9(2):137–148Oflazer K, Say B, Hakkani-Tür DZ, Tür G (2003) Building a Turkish Treebank,Springer Netherlands, Dordrecht, pp 261–277. DOI 10.1007/978-94-010-0201-1_15Özsoy AS (2019) Word Order in Turkish, vol 97. SpringerÖzsoy S (1988) Null subject parameter and Turkish. In: Studies on modern Turkish:Proceedings of the third conference on Turkish linguistics, Tilburg University PressTilburg, the Netherlands, pp 82–90Öztürk B (2006) Null arguments and case-driven agree in Turkish. Minimalist essays,ed by Cedric Boeckx pp 268–287Öztürk B (2008) Non-configurationality: Free word order and argument drop in Turk-ish. The Limits of Syntactic Variation Amsterdam: John Benjamins Publishing Com-pany pp 411–440Popel M, Žabokrtský Z, Vojtek M (2017) Udapi: Universal API for universal dependen-cies. In: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies(UDW 2017), Association for Computational Linguistics, Gothenburg, Sweden, pp96–101 27ak H, Güngör T, Saraçlar M (2008) Turkish language resources: Morphologicalparser, morphological disambiguator and web corpus. In: International Conferenceon Natural Language Processing, Springer, pp 417–427Sampson G (1995) English for the computer: The SUSANNE corpus and analyticschemeSay B, Zeyrek D, Oflazer K, Özge U (2002) Development of a corpus and a treebankfor present-day written Turkish. In: Proceedings of the eleventh international con-ference of Turkish linguistics, Eastern Mediterranean University, pp 183–192Stenetorp P, Pyysalo S, Topi´c G, Ohta T, Ananiadou S, Tsujii J (2012) Brat: A web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrationsat the 13th Conference of the European Chapter of the Association for Computa-tional Linguistics, Association for Computational Linguistics, Avignon, France, pp102–107Sulger S, Butt M, King TH, Meurer P, Laczkó T, Rákosi G, Dione CB, Dyvik H, RosénV, De Smedt K, et al (2013) Pargrambank: The Pargram parallel treebank. In: Pro-ceedings of the 51st Annual Meeting of the Association for Computational Linguis-tics (Volume 1: Long Papers), vol 1, pp 550–560Sulubacak U, Eryi˘git G (2018) Implementing universal dependency, morphology, andmultiword expression annotation standards for Turkish language processing. TurkishJournal of Electrical Engineering & Computer Sciences 26(3):1662–1672Sulubacak U, Gökırmak M, Tyers FM, Çöltekin Ç, Nivre J, Eryi˘git G (2016) Universaldependencies for Turkish. In: Proceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Technical Papers, The COLING 2016Organizing Committee, Osaka, Japan, pp 3444–3454Türk U, Atmaca F, Özate¸s ¸SB, Köksal A, Öztürk B, Güngör T, Özgür A (2019) Turkishtreebanking: Unifying and constructing efforts. In: Proceedings of the 13 thth