A Corpus of Adpositional Supersenses for Mandarin Chinese
Siyao Peng, Yang Liu, Yilun Zhu, Austin Blodgett, Yushi Zhao, Nathan Schneider
AA Corpus of Adpositional Supersenses for Mandarin Chinese
Siyao Peng, Yang Liu, Yilun Zhu, Austin Blodgett, Yushi Zhao, Nathan Schneider
Department of Linguistics, Georgetown UniversityWashington, DC, USA{sp1184, yl879, yz565, ajb341, yz521, nathan.schneider}@georgetown.edu
Abstract
Adpositions are frequent markers of semantic relations, but they are highly ambiguous and vary significantly from language to language.Moreover, there is a dearth of annotated corpora for investigating the cross-linguistic variation of adposition semantics, or for buildingmultilingual disambiguation systems. This paper presents a corpus in which all adpositions have been semantically annotated in MandarinChinese; to the best of our knowledge, this is the first Chinese corpus to be broadly annotated with adposition semantics. Our approachadapts a framework that defined a general set of supersenses according to ostensibly language-independent semantic criteria, though itsdevelopment focused primarily on English prepositions (Schneider et al., 2018). We find that the supersense categories are well-suitedto Chinese adpositions despite syntactic differences from English. On a Mandarin translation of
The Little Prince , we achieve highinter-annotator agreement and analyze semantic correspondences of adposition tokens in bitext.
Keywords: adpositions, supersenses, Mandarin Chinese, corpus, annotation
1. Introduction
Adpositions (i.e. prepositions and postpositions) includesome of the most frequent words in languages like Chineseand English, and help convey a myriad of semantic relationsof space, time, causality, possession, and other domains ofmeaning. They are also a persistent thorn in the side of sec-ond language learners owing to their extreme idiosyncrasy(Chodorow et al., 2007; Lorincz and Gordon, 2012). Forinstance, the English word in has no exact parallel in an-other language; rather, for purposes of translation, its manydifferent usages cluster differently depending on the secondlanguage. Semantically annotated corpora of adpositions inmultiple languages, including parallel data, would facilitatebroader empirical study of adposition variation than is pos-sible today, and could also contribute to NLP applicationssuch as machine translation (Li et al., 2005; Agirre et al.,2009; Shilon et al., 2012; Weller et al., 2014, 2015; Hashemiand Hwa, 2014; Popovi´c, 2017) and grammatical error cor-rection (Chodorow et al., 2007; Tetreault and Chodorow,2008; De Felice and Pulman, 2008; Hermet and Alain, 2009;Huang et al., 2016; Graën and Schneider, 2017).This paper describes the first corpus with broad-coverage an-notation of adpositions in Chinese. For this corpus we haveadapted Schneider et al.’s (2018) Semantic Network of Ad-position and Case Supersenses annotation scheme (SNACS;see §2.2) to Chinese. Though other languages were takeninto consideration in designing SNACS, no serious annota-tion effort has been undertaken to confirm empirically thatit generalizes to other languages. After developing newguidelines for syntactic phenomena in Chinese (§3), we ap-ply the SNACS supersenses to a translation of
The LittlePrince ( Xiˇao Wáng Zˇı ), finding the supersenses to be robustand achieving high inter-annotator agreement (§4). We ana-lyze the distribution of adpositions and supersenses in the Zhu et al. (2019) previewed our approach. Originally
Le Petit Prince by Antoine de St. Exupéry, pub-lished in 1943 and subsequently translated into numerous lan-guages. corpus, and compare to adposition behavior in a separateEnglish corpus (see §5). We also examine the predictions ofa part-of-speech tagger in relation to our criteria for anno-tation targets (§6). The annotated corpus and the Chineseguidelines for SNACS will be made freely available online.
2. Related Work
To date, most wide-coverage semantic annotation of prepo-sitions has been dictionary-based, taking a word sense dis-ambiguation perspective (Litkowski and Hargraves, 2005,2007; Litkowski, 2014). Schneider et al. (2015) proposeda supersense-based (unlexicalized) semantic annotationscheme which would be applied to all tokens of preposi-tions in English text. We adopt a revised version of theapproach, known as SNACS (see §2.2). Previous SNACSannotation efforts have been mostly focused on English—particularly STREUSLE (Schneider et al., 2016, 2018), thesemantically annotated corpus of reviews from the EnglishWeb Treebank (EWT; Bies et al., 2012). We present the firstadaptation of SNACS for Chinese by annotating an entireChinese translation of
The Little Prince . In the computational literature for Chinese, apart from somefocused studies (e.g., Yang and Kuo (1998) on logical-semantic representation of temporal adpositions), there hasbeen little work addressing adpositions specifically. Mostprevious semantic projects for Mandarin Chinese focusedon content words and did not directly annotate the semanticrelations signaled by functions words such as prepositions(Xue et al., 2014; Hao et al., 2007; You and Liu, 2005; Liet al., 2016). For example, in Chinese PropBank, Xue (2008)argued that the head word and its part of speech are clearlyinformative for labeling the semantic role of a phrase, butthe preposition is not always the most informative element.Li et al. (2003) annotated the Tsinghua Corpus (Zhang,1999) from
People’s Daily where the content words were https://github.com/nert-nlp/Chinese-SNACS/ a r X i v : . [ c s . D L ] M a r elected as the headwords, i.e., the object is the headword ofthe prepositional phrase. In these prepositional phrases, thenominal headwords were labeled with one of the 59 semanticrelations (e.g. Location, LocationIni, Kernel word ) whereasthe prepositions and postpositions were respectively labeledwith syntactic relations
Preposition and
LocationPreposi-tion . Similarly, in Semantic Dependency Relations (SDR,Che et al. 2012, 2016), prepositions and localizers werelabeled as semantic markers mPrep and mRange , whereassemantic roles, e.g.,
Location, Patient , are assigned to thegoverned nominal phrases.Sun and Jurafsky (2004) compared PropBank parsing per-formance on Chinese and English, and showed that fourChinese prepositions ( zài , yú , bˇı , and duì ) are among the top20 lexicalized syntactic head words in Chinese PropBank,bridging the connections between verbs and their arguments.The high frequency of prepositions as head words in Prop-Bank reflects their importance in context. However, veryfew annotation scheme attempted to directly label the se-mantics of these adposition words.Chinese Knowledge Information Processing Group (CKIP)(1993) is the most relevant adposition annotation effort,categorizing Chinese prepositions into 66 types of sensesgrouped by lexical items. However, these lexicalized se-mantic categories are constrained to a given language and aclosed set of adpositions. For semantic labeling of Chineseadpositions in a multilingual context, we turn to the SNACSframework, described below. Schneider et al. (2018) proposed the Semantic Network ofAdposition and Case Supersenses (SNACS), a hierarchicalinventory of 50 semantic labels, i.e., supersenses, that char-acterize the use of adpositions, as shown in Figure 1. Sincethe meaning of adpositions is highly affected by the context,SNACS can help distinguish different usages of adpositions.For instance, (1) presents an example of the supersenseT
OPIC for the adposition about which emphasizes the sub-ject matter of urbanization that the speaker discussed. In (2),however, the same preposition about takes a measurementin the context, expressing an approximation.(1) I gave a presentation about:T
OPIC urbanization. (2) We have about:A PPROXIMATOR about:S
TIMULUS ↝ T OPIC you.For instance, (3) blends the domains of emotion (principally Though named
LocationPreposition in Li et al. (2003), theseadpositions actually occur postnominally, equivalent to localizersin this paper. Throughout this paper, adposition tokens under discussion arebolded and labeled.
CircumstanceTemporalTimeStartTimeEndTimeFrequencyDurationIntervalLocusSourceGoalPathDirectionExtentMeansMannerExplanationPurpose ParticipantCauserAgentCo-AgentThemeCo-ThemeTopicStimulusExperiencerOriginatorRecipientCostBeneficiaryInstrument ConfigurationIdentitySpeciesGestaltPossessorWholeCharacteristicPossessionPartPortionStuffAccompanierInsteadOfComparisonRefRateUnitQuantityApproximatorSocialRelOrgRole
Figure 1:
SNACS hierarchy of 50 supersenses. reflected in care , which licenses a S
TIMULUS ), and cog-nition (principally reflected in about , which often marksnon-emotional T
OPIC s). Thus, SNACS incorporates the construal analysis (Hwang et al., 2017) wherein the lexi-cal semantic contribution of an adposition (its function ) isdistinguished and may diverge from the underlying relationin the surrounding context (its scene role ). Construal is no-tated by S
CENE R OLE ↝ F UNCTION , as S
TIMULUS ↝ T OPIC in (3). Another motivation for incorporating the construal analy-sis, as pointed out by Hwang et al. (2017), is its capabilityto adapt the English-centric supersense labels to other lan-guages, which is the main contribution of this paper. Theconstrual analysis can give us insights into the similaritiesand differences of function and scene roles of adpositionsacross languages.
3. Adposition Criteria in Mandarin Chinese
Our first challenge is to determine which tokens qualify asadpositions in Mandarin Chinese and merit supersense anno-tations. The English SNACS guidelines (we use version 2.3)broadly define the set of SNACS annotation targets to in-clude canonical prepositions (taking an noun phrase (NP)complement) and their subordinating (clausal complement)uses. Possessives, intransitive particles, and certain uses ofthe infinitive marker to are also included (Schneider et al.,2019).In Chinese, the difficulty lies in two areas, which we discussbelow. Firstly, prepositional words are widely attested. How-ever, since no overt derivational morphology occurs on theseprepositional tokens (previously referred to as coverbs), weneed to filter non-prepositional uses of these words. Sec-ondly, post-nominal particles, i.e., localizers, though not The supersense labels in congruent construals, such as T
OPIC and A
PPROXIMATOR in (1) and (2), are both function and scenerole by definition. lways considered adpositions in Chinese, deliver rich se-mantic information.
Coverbs
Tokens that are considered generic prepositionscan co-occur with the main predicate of the clause and in-troduce an NP argument to the clause (Li and Thompson,1974) as in (4). These tokens are referred to as coverbs. Insome cases, coverbs can also occur as the main predicate.For example, the coverb zài heads the predicate phrase in(5).(4) t¯a3 SG zài:L OCUS P :at xuéshùacademia shàng:T OPIC ↝ L OCUS LC :on-top-of yˇousuˇozuòwéi.successful‘He succeeded in academia.’(5) nˇı2 SG yàowant de DE yángsheep jiù RES zài at lˇımiàn.inside‘The sheep you wanted is in the box.’( zh_lpp_1943.92 )In this project, we only annotate coverbs when they do notfunction as the main predicate in the sentence, echoing theview that coverbs modify events introduced by the predi-cates, rather than establishing multiple events in a clause(Hui, 2012). Therefore, lexical items such as zài are anno-tated when functioning as a modifier as in (4), but not whenas the main predicate as in (5).
Localizers
Localizers are words that follow a noun phraseto refine its semantic relation. For example, shàng in (4)denotes a contextual meaning, ‘in a particular area,’ whereasthe co-occurring coverb zài only conveys a generic location.It is unclear whether localizers are syntactically postposi-tions, but we annotate all localizers because of their semanticsignificance. Though coverbs frequently co-occur with local-izers and the combination of coverbs and localizers is veryproductive, there is no strong evidence to suggest that theyare circumpositions. As a result, we treat them as separatetargets for SNACS annotation: for example, zài and shàng receive L
OCUS and T
OPIC ↝ L OCUS respectively in (4).Setting aside the syntactic controversies of coverbs and lo-calizers in Mandarin Chinese, we regard both of them asadpositions that merit supersense annotations. As in (4),both the coverb zài and the localizer shàng surround an NPargument xuéshù (‘academia’) and they as a whole modifythe main predicate yˇousuˇozuòwéi (‘successful’). In this pa-per, we take the stance that coverbs co-occur with the mainpredicate and precede an NP, whereas localizers follow anoun phrase and add semantic information to the clause.
4. Corpus Annotation
We chose to annotate the novella
The Little Prince becauseit has been translated into hundreds of languages and di-alects, which enables comparisons of linguistic phenomenaacross languages on bitexts. This is the first Chinese cor-pus to undergo SNACS annotation. Ongoing adpositionalsupersense projects on
The Little Prince include English,German, French, and Korean. In addition,
The Little Prince has received large attention from other semantic frameworks and corpora, including the English (Banarescu et al., 2013)and Chinese (Li et al., 2016) AMR corpora.
We use the same Chinese translation of
The Little Prince as the Chinese AMR corpus (Li et al., 2016), which is alsosentence-aligned with the English AMR corpus (Banarescuet al., 2013). These bitext annotations in multiple languagesand annotation semantic frameworks can facilitate cross-framework comparisons.Prior to supersense annotation, we conducted the followingpreprocessing steps in order to identify the adposition targetsthat merit supersense annotation.
Tokenization
After automatic tokenization using Jieba, we conducted manual corrections to ensure that all potentialadpositions occur as separate tokens, closely following theChinese Penn Treebank segmentation guidelines (Xia, 2000).The final corpus includes all 27 chapters of The Little Prince ,with a total of 20k tokens.
Adposition Targets
All annotators jointly identified ad-position targets according to the criteria discussed in §3.Manual identification of adpositions was necessary as anautomatic POS tagger was found unsuitable for our criteria(§6).
Data Format
Though parsing is not essential to this an-notation project, we ran the StanfordNLP (Qi et al., 2018)dependency parser to obtain POS tags and dependency trees.These are stored alongside supersense annotations in the
CoNLL-U-Lex format (modeled after the STREUSLE cor-pus; Schneider and Smith, 2015; Schneider et al., 2018).CoNLL-U-Lex extends the CoNLL-U format used by theUniversal Dependencies (UD; Nivre et al., 2016) project toadd additional columns for lexical semantic annotations. The corpus is jointly annotated by three native Mandarin Chi-nese speakers, all of whom have received advanced trainingin theoretical and computational linguistics. Supersense la-beling was performed cooperatively by 3 annotators for 25%(235/933) of the adposition targets, and for the remainder,independently by the 3 annotators, followed by cooperativeadjudication. Annotation was conducted in two phases, andtherefore we present two inter-annotator agreement studiesto demonstrate the reproducibility of SNACS and the relia-bility of the adapted scheme for Chinese.Table 1 shows raw agreement and Cohen’s kappa acrossthree annotators computed by averaging three pairwise com-parisons. Agreement levels on scene role, function, and fullconstrual are high for both phases, attesting to the validityof the annotation framework in Chinese. However, there is aslight decrease from Phase 1 to Phase 2, possibly due to theseven newly attested adpositions in Phase 2 and the 1-yearinterval between the two annotation phases.
AA S
AMPLES
Phase Time Chapters
Phase 1 July 2018 15–20 111Phase 2 Sept 2019 26–27 124R AW A GREEMENT
Phase Scene Function Construal
Phase 1 .92 .95 .90Phase 2 .93 .90 .89K
APPA
Phase Scene Function Construal
Phase 1 .90 .93 .88Phase 2 .92 .88 .88
Table 1:
Inter-annotator agreement (IAA) results on two samplesfrom different phases of the project.
Toks. Types
Chapters 27 NASentences 1,597 NATokens 20,287 NAAdpositions 933 70Prepositions 667 42Postpositions 266 28Supersenses 933 29Scene roles 933 28Functions 933 26Construals 933 41Congruent (scene=fxn) 803 25Divergent (scene ≠ fxn) 130 16 Table 2:
Statistics of the final Mandarin
The Little Prince
Corpus(the Chinese SNACS Corpus). Tokenization, identification of adpo-sition targets, and supersense labeling were performed manually.
5. Corpus Analysis
Our corpus contains 933 manually identified adpositions.Of these, 70 distinct adpositions, 28 distinct scene roles, 26distinct functions, and 41 distinct full construals are attestedin annotation. Full statistics of token and type frequenciesare shown in Table 2. This section presents the most frequentadpositions in Mandarin Chinese, as well as quantitativeand qualitative comparisons of scene roles, functions, andconstruals between Chinese and English annotations.
We analyze semantic and distributional properties of ad-positions in Mandarin Chinese. The top 5 most frequentprepositions and postpositions are shown in Table 3. Prepo-sitions include canonical adpositions such as y¯ınwèi andcoverbs such as zài . Postpositions are localizers such as shàng and zh¯ong . We observe that prepositions zài and duì are dominant in the corpus (greater than 10%). Other top ad-positions are distributed quite evenly between prepositionsand postpositions. On the low end, 27 out of the 70 attestedadposition types occur only once in the corpus. https://github.com/fxsjy/jieba https://github.com/nert-nlp/streusle/blob/master/CONLLULEX.md Prep. Trans. %
Countzài on 18.4 172duì to 11.0 103bˇa theme marker
Total
Postp. Trans. %
Countshàng on top of 9.5 89zh¯ong in the middle of 4.9 46lˇı inside of 3.9 36láishu¯o to one’s regard 2.7 25shí at the time of 2.1 20
Total
Table 3:
Percentages and counts of the top 5 prepositions andpostpositions in Chinese
Little Prince . The percentages are out ofall adpositions.
The distribution of scene role and function types in Chineseand English reflects the differences and similarities of adpo-sition semantics in both languages. In table 4 we comparethis corpus with the largest English adposition supersensecorpus, STREUSLE version 4.1 (Schneider et al., 2018),which consists of web reviews. We note that the Chinesecorpus is proportionally smaller than the English one interms of token and adposition counts. Moreover, there arefewer scene role, function and construal types attested inChinese. The proportion of construals in which the scenerole differs from the function (scene ≠ fxn) is also halved inChinese. In this section, we delve into comparisons regard-ing scene roles, functions, and full construals between thetwo corpora both quantitatively and qualitatively. Overall Distribution of Supersenses
Figures 2 and 3present the top 10 scene roles and functions in MandarinChinese and their distributions in English. It is worth notingthat since more scene role and function types are attestedin the larger STREUSLE dataset, the percentages of thesesupersenses in English are in general lower than the ones inChinese.There are a few observations in these distributions that areof particular interest. For some of the examples, we usean annotated subset of the English
Little Prince corpus forqualitative comparisons, whereas all quantitative results inEnglish refer to the larger STREUSLE corpus of EnglishWeb Treebank reviews (Schneider et al., 2018).
Fewer Adpositions in Chinese
As shown in Table 4, thepercentage of adposition targets over tokens in Chinese isonly half of that in English. This is due to the fact thatChinese has a stronger preference to convey semantic infor-mation via verbal or nominal forms. Examples (6, 7) showthat the prepositions used in English, of and in , are trans-lated as copula verbs ( shì ) and progressives ( zhèngzài ) inChinese. Corresponding to Figures 2 and 3, the proportion We exclude possessives and multi-word expressions that areannotated in the English corpus since possessives are not formedby adpositional phrases in Mandarin Chinese. oks % adps uniq adps uniq scene uniq fxn uniq cons scene ≠ fxn % scene ≠ fxnChinese: Little Prince
20k 4.6% 70 28 26 41 16 14%
English:
EWT Reviews
55k 7.4% 111 47 40 170 130 27%
Table 4:
Statistics of Adpositional Supersenses in Chinese versus English. % adps presents the proportion of adposition targets over alltoken counts; uniq adps/scene/fxn/cons demonstrates the type frequency of adposition tokens, scene role and function supersense andconstruals; scene ≠ fxn and % scene ≠ fxn shows the type frequency and proportion of divergent construals. Figure 2:
Top 10 most frequent scene roles in Chinese versusEnglish.
Figure 3:
Top 10 most frequent functions in Chinese versus En-glish. of the supersense label T
OPIC in English is higher than thatin Chinese; and similarly, the supersense label I
DENTITY isnot attested in Chinese for either scene role or function.(6) It was a picture of:T
OPIC a boa constrictor in:M
ANNER the act of:I
DENTITY swallowing ananimal . ( en_lpp_1943.3 )(7) [huàdraw de] DE shì COP [[yìone tiáo CL mˇangshé]boa zhèngzài PROG t¯unshíswallow [yìone zh¯ı CL dàbig yˇeshòu]]animal‘The drawing is a boa swallowing a big animal’.( en_lpp_1943.3 ) Larger Proportion of L
OCUS in Chinese
In both Fig-ure 2 and Figure 3, the percentages of L
OCUS as scene roleand function are twice that of the English corpus respec- tively. This corresponds to the fact that fewer supersensetypes occur in Mandarin Chinese than in English. As a re-sult, generic locative and temporal adpositions, as well asadpositions tied to thematic roles, have larger proportions inChinese than in English. E XPERIENCER as Function in Chinese
Despite the factthat there are fewer supersense types attested in Chinese,E
XPERIENCER as a function is specific to Chinese as it doesnot have any prototypical adpositions in English (Schneideret al., 2019). In (8), the scene role E
XPERIENCER is ex-pressed through the preposition to and construed as G OAL ,which highlights the abstract destination of the ‘air of truth’.This reflects the basic meaning of to , which denotes a pathtowards a goal (Bowerman and Choi, 2001). In contrast,the lexicalized combination of the preposition duì and thelocalizer láishu¯o in (9) are a characteristic way to introducethe mental state of the experiencer, denoting the meaning‘to someone’s regard’. The high frequency of láishu¯o andthe semantic role of E XPERIENCER (6.3%) underscore itsstatus as a prototypical adposition usage in Chinese.(8)
To:E
XPERIENCER ↝ G OAL those who understandlife, that would have given a much greater air of truthto my story. ( en_lpp_1943.185 )(9) [ duì:E
XPERIENCER P :to [dˇongdéknow-about sh¯enghuólife de DE rén]people láishu¯o:E XPERIENCER ], LC :one’s-regard zhèyàngthis-way shu¯otelljiù RES xiˇandéseems zh¯enshíreal‘It looks real to those who know about life.’( zh_lpp_1943.185 ) Divergence of Functions across Languages
Among allpossible types of construals between scene role and func-tion, here we are only concerned with construals wherethe scene role differs from the function (scene ≠ fxn). Thebasis of Hwang et al.’s (2017) construal analysis is that ascene role is construed as a function to express the contexualmeaning of the adposition that is different from its lexicalone. Figure 4 presents the top 10 divergent (scene ≠ fxn)construals in Chinese and their corresponding proportionsin English. Strikingly fewer types of construals are formedin Chinese. Nevertheless, Chinese is replete with R ECIPI - ENT ↝ D IRECTION adpositions, which constitute nearly halfof the construals.The 2 adpositions annotated with R
ECIPIENT ↝ D IRECTION are duì and xiàng , both meaning ‘towards’ in Chinese. In(10, 11), both English to and Chinese duì have R ECIPIENT as the scene role. In (10), G
OAL is labelled as the func-tion of to because it indicates the completion of the “saying” igure 4: Top 10 Construals where scene ≠ function in Chineseversus English. event. In Chinese, duì has the function label D
IRECTION provided that duì highlights the orientation of the messageuttered by the speaker as in (11). Even though they expressthe same scene role in the parallel corpus, their lexical se-mantics still requires them to have different functions inEnglish versus Chinese.(10) You would have to say to:R
ECIPIENT ↝ G OAL them: “I saw a house that costs $20 , en_lpp_1943.172 ).(11) (nˇı)2 SG bìx¯umust [ duì:R ECIPIENT ↝ D IRECTION
P:tot¯amen]3 PL shu¯o:say “wˇo1 SG kànjiànsee le ASP yíone dòngCL shíwàn10 , DE fángzi.”house‘You must tell them: “I see a house that costs 10,000francs.” ’ ( zh_lpp_1943.172 ). New Construals in Chinese
Similar to the distinction be-tween R
ECIPIENT ↝ G OAL and R
ECIPIENT ↝ D IRECTION inEnglish versus Chinese, language-specific lexical semanticscontribute to unique construals in Chinese, i.e. semantic usesof adpositions that are unattested in the STREUSLE corpus.Six construals are newly attested in the Chinese corpus:• B
ENEFICIARY ↝ E XPERIENCER • C
IRCUMSTANCE ↝ T IME • P
ART P ORTION ↝ L OCUS • T
OPIC ↝ L OCUS • C
IRCUMSTANCE ↝ A CCOMPANIER • D
URATION ↝ I NSTRUMENT
Of these new construals, B
ENEFICIARY ↝ E XPERIENCER has the highest frequency in the corpus. The novelty ofthis construal lies in the possibility of E
XPERIENCER asfunction in Chinese, shown by the parallel examples in (12,13), where duì receives the construal annotation B
ENEFI - CIARY ↝ E XPERIENCER .(12) One must not hold it against:B
ENEFICIARY them . (en_lpp_1943.180) The prototypical function of to indicates telic motion events.Telicity, however, is not required for D IRECTION . (13) xiˇaohˇaizimenchildren duì:B ENEFICIARY ↝ E XPERIENCER
P:to dàrénmenadultsyìngg¯aishould ku¯anhòulenient xie
COMP ‘Children should not hold it against adults.’ (zh_lpp_1943.180)
Similarly, other new construals in Chinese resulted from thelexical meaning of the adpositions that are not equivalentto those in English. For instance, the combination of d¯ang... shí (during the time of) denotes the circumstance of anevent that is grounded by the time ( shí ) of the event. Differ-ent lexical semantics of adpositions necessarily creates newconstruals when adapting the same supersense scheme intoa new language, inducing newly found associations betweenscene and function roles of these adpositions. Fortunately,though combinations of scene and function require innova-tion when adapting SNACS into Chinese, the 50 supersenselabels are sufficient to account for the semantic diversity ofadpositions in the corpus.
6. POS Tagging of Adposition Targets
We conduct a post-annotation comparison between manuallyidentified adposition targets and automatically POS-taggedadpositions in the Chinese SNACS corpus. Among the 933manually identified adposition targets that merit supersenseannotation, only 385 (41.3%) are tagged as
ADP (adposition)by StanfordNLP (Qi et al., 2018). Figure 5 shows that goldtargets are more frequently tagged as
VERB than
ADP inautomatic parses, as well as a small portion that are taggedas
NOUN . The inclusion of targets with
POS = VERB reflectsour discussion in §3 that coverbs co-occurring with a mainpredicate are included in our annotation. The automatic POStagger also wrongly predicts some non-coverb adpositions,such as y¯ınwéi , to be verbs.
Figure 5:
POS Distribution of Gold Adposition Tokens.
The StanfordNLP POS tagger also suffers from low preci-sion (72.6%). Most false positives resulted from the discrep-ancies in adposition criteria between theoretical studies onChinese adpositions and the tagset used in Universal Depen-dencies (UD) corpora such as the Chinese-GSD corpus. Forinstance, the Chinese-GSD corpus considers subordinatingonjunctions (such as rúguˇo , yídàn , jìrán , zhˇıyào ) adposi-tions; however, theoretical research on Chinese adpositionssuch as Li and Thompson (1989) differentiates them fromadpositions, since they can never syntactically precede anoun phrase.Hence, further SNACS annotation and disambiguation ef-forts on Chinese adpositions cannot rely on the StanfordNLP ADP category to identify annotation targets. Since adposi-tions mostly belong to a closed set of tokens, we apply asimple rule to identify all attested adpositions which arenot functioning as the main predicate of a sentence, i.e.,not the root of the dependency tree. As shown in Table 5,our heuristic results in an F of 82.4%, outperforming thestrategy of using the StanfordNLP POS tagger. P R F StanfordNLP ADP 72.6 41.3 52.6attested dep ≠ root adpositions 75.1 91.3 82.4 Table 5:
Adposition identification performance on ChineseSNACS corpus.
7. Conclusion
In this paper, we presented the first corpus annotated withadposition supersenses in Mandarin Chinese. The corpusis a valuable resource for examining similarities and dif-ferences between adpositions in different languages withparallel corpora and can further support automatic disam-biguation of adpositions in Chinese. We intend to annotateadditional genres—including native (non-translated) Chi-nese and learner corpora—in order to more fully capture thesemantic behavior of adpositions in Chinese as compared toother languages.
Acknowledgements
We thank anonymous reviewers for their feedback. Thisresearch was supported in part by NSF award IIS-1812778and grant 2016375 from the United States–Israel BinationalScience Foundation (BSF), Jerusalem, Israel.
References
Agirre, Eneko, Atutxa, Aitziber, Labaka, Gorka, Lersundi,Mikel, Mayor, Aingeru, and Sarasola, Kepa (2009). Useof rich linguistic information to translate prepositions andgrammatical cases to Basque. In Màrquez, Lluís andSomers, Harold, editors,
Proc. of EAMT , pages 58–65.Barcelona, Catalonia, Spain.Banarescu, Laura, Bonial, Claire, Cai, Shu, Georgescu,Madalina, Griffitt, Kira, Hermjakob, Ulf, Knight, Kevin,Koehn, Philipp, Palmer, Martha, and Schneider, Nathan(2013). Abstract Meaning Representation for sembanking.In
Proc. of the 7th Linguistic Annotation Workshop andInteroperability with Discourse .Bies, Ann, Mott, Justin, Warner, Colin, and Kulick, Seth(2012). English Web Treebank. Technical ReportLDC2012T13, Linguistic Data Consortium, Philadelphia,PA. Bowerman, Melissa and Choi, Soonja (2001). Shapingmeanings for language: universal and language-specificin the acquisition of spatial semantic categories. In Bow-erman, Melissa and Levinson, Stephen, editors,
LanguageAcquisition and Conceptual Development , pages 475–511.Cambridge University Press.Che, Wanxiang, Shao, Yanqiu, Liu, Ting, and Ding, Yu(2016). SemEval-2016 Task 9: Chinese Semantic Depen-dency Parsing. In
Proc. of SemEval , pages 1074–1080.San Diego, California.Che, Wanxiang, Zhang, Meishan, Shao, Yanqiu, and Liu,Ting (2012). SemEval-2012 Task 5: Chinese SemanticDependency Parsing. In
Proc. of *SEM/SemEval , pages378–384. Montréal, Canada.Chinese Knowledge Information Processing Group (CKIP)(1993). Chinese part-of-speech analysis. Technical Report93-05, Taipei.Chodorow, Martin, Tetreault, Joel R., and Han, Na-Rae(2007). Detection of grammatical errors involving prepo-sitions. In
Proc. of the Fourth ACL-SIGSEM Workshopon Prepositions , pages 25–30. Prague, Czech Republic.De Felice, Rachele and Pulman, Stephen G. (2008). Aclassifier-based approach to preposition and determinererror correction in L2 English. In
Proc. of Coling , pages169–176. Manchester, UK.Graën, Johannes and Schneider, Gerold (2017). Crossingthe border twice: reimporting prepositions to alleviateL1-specific transfer errors. In
Proc. of the Joint 6th Work-shop on NLP for Computer Assisted Language Learningand 2nd Workshop on NLP for Research on LanguageAcquisition at NoDaLiDa, Gothenburg, 22nd May 2017 ,pages 18–26. Gothenburg, Sweden.Hao, Xiao-yan, Liu, Wei, Li, Ru, and Liu, Kai-ying (2007).Description systems of the Chinese FrameNet databaseand software tools.
Journal of Chinese Information Pro-cessing , 5.Hashemi, Homa B. and Hwa, Rebecca (2014). A compar-ison of MT errors and ESL errors. In Calzolari, Nico-letta, Choukri, Khalid, Declerck, Thierry, Loftsson, Hrafn,Maegaard, Bente, Mariani, Joseph, Moreno, Asuncion,Odijk, Jan, and Piperidis, Stelios, editors,
Proc. of LREC ,pages 2696–2700. Reykjavík, Iceland.Hermet, Matthieu and Alain, Désilets (2009). Using first andsecond language models to correct preposition errors insecond language authoring. In
Proc. of the Fourth Work-shop on Innovative Use of NLP for Building EducationalApplications , pages 64–72. Boulder, Colorado.Huang, Hen-Hsen, Shao, Yen-Chi, and Chen, Hsin-Hsi(2016). Chinese preposition selection for grammaticalerror diagnosis. In
Proc. of COLING , pages 888–899.Osaka, Japan.Hui, Audrey Li Yen (2012).
Order and constituency inMandarin Chinese , volume 19. Springer.wang, Jena D., Bhatia, Archna, Han, Na-Rae, O’Gorman,Tim, Srikumar, Vivek, and Schneider, Nathan (2017).Double trouble: the problem of construal in semanticannotation of adpositions. In
Proc. of *SEM , pages 178–188. Vancouver, Canada.Li, Bin, Wen, Yuan, Bu, Lijun, Qu, Weiguang, and Xue,Nianwen (2016). Annotating The Little Prince with Chi-nese AMRs. In
Proc. of LAW X – the 10th LinguisticAnnotation Workshop , pages 7–15. Berlin, Germany.Li, Charles N. and Thompson, Sandra A. (1974). Co-verbsin Mandarin Chinese: verbs or prepositions?
Journal ofChinese Linguistics , 2(3):257–278.Li, Charles N. and Thompson, Sandra A. (1989).
Man-darin Chinese: A functional reference grammar . Univ ofCalifornia Press.Li, Hui, Japkowicz, Nathalie, and Barrière, Caroline (2005).English to Chinese Translation of Prepositions. In Kégl,Balázs and Lapalme, Guy, editors,
Advances in ArtificialIntelligence , number 3501 in Lecture Notes in ComputerScience, pages 412–416. Springer, Berlin.Li, Mingqin, Li, Juanzi, Dong, Zhendong, Wang, Zuoying,and Lu, Dajin (2003). Building a large Chinese corpusannotated with semantic dependency. In
Proc. of the Sec-ond SIGHAN Workshop on Chinese Language Processing ,pages 84–91. Sapporo, Japan.Litkowski, Ken (2014). Pattern Dictionary of English Prepo-sitions. In
Proc. of ACL , pages 1274–1283. Baltimore,Maryland, USA.Litkowski, Ken and Hargraves, Orin (2005). The PrepositionProject. In
Proc. of the Second ACL-SIGSEM Workshopon the Linguistic Dimensions of Prepositions and theirUse in Computational Linguistics Formalisms and Appli-cations , pages 171–179. Colchester, Essex, UK.Litkowski, Ken and Hargraves, Orin (2007). SemEval-2007Task 06: Word-Sense Disambiguation of Prepositions. In
Proc. of SemEval , pages 24–29. Prague, Czech Republic.Lorincz, Kristen and Gordon, Rebekah (2012). Difficultiesin learning prepositions and possible solutions.
LinguisticPortfolios , 1(1):14.Nivre, Joakim, Marneffe, Marie-Catherine de, Ginter, Filip,Goldberg, Yoav, Hajiˇc, Jan, Manning, Christopher D.,McDonald, Ryan, Petrov, Slav, Pyysalo, Sampo, Silveira,Natalia, Tsarfaty, Reut, and Zeman, Daniel (2016). Uni-versal Dependencies v1: a multilingual treebank collec-tion. In Calzolari, Nicoletta, Choukri, Khalid, Declerck,Thierry, Grobelnik, Marko, Maegaard, Bente, Mariani,Joseph, Moreno, Asuncion, Odijk, Jan, and Piperidis, Ste-lios, editors,
Proc. of LREC , pages 1659–1666. Portorož,Slovenia.Popovi´c, Maja (2017). Comparing language related issuesfor NMT and PBMT between German and English.
ThePrague Bulletin of Mathematical Linguistics , 108(1):209–220. Qi, Peng, Dozat, Timothy, Zhang, Yuhao, and Manning,Christopher D. (2018). Universal Dependency parsingfrom scratch. In
Proc. of CoNLL , pages 160–170. Brus-sels, Belgium.Schneider, Nathan, Hwang, Jena D., Bhatia, Archna, Sriku-mar, Vivek, Han, Na-Rae, O’Gorman, Tim, Moeller,Sarah R., Abend, Omri, Shalev, Adi, Blodgett, Austin,and Prange, Jakob (2019). Adposition and Case Super-senses v2.3: Guidelines for English. arXiv:1704.02134v4[cs] . August 18 version: https://arxiv.org/abs/1704.02134v4 .Schneider, Nathan, Hwang, Jena D., Srikumar, Vivek, Green,Meredith, Suresh, Abhijit, Conger, Kathryn, O’Gorman,Tim, and Palmer, Martha (2016). A corpus of prepositionsupersenses. In
Proc. of LAW X – the 10th LinguisticAnnotation Workshop , pages 99–109. Berlin, Germany.Schneider, Nathan, Hwang, Jena D., Srikumar, Vivek,Prange, Jakob, Blodgett, Austin, Moeller, Sarah R., Stern,Aviram, Bitan, Adi, and Abend, Omri (2018). Compre-hensive supersense disambiguation of English preposi-tions and possessives. In
Proc. of ACL , pages 185–196.Melbourne, Australia.Schneider, Nathan and Smith, Noah A. (2015). A corpus andmodel integrating multiword expressions and supersenses.In
Proc. of NAACL-HLT , pages 1537–1547. Denver, Col-orado.Schneider, Nathan, Srikumar, Vivek, Hwang, Jena D., andPalmer, Martha (2015). A hierarchy with, of, and forpreposition supersenses. In
Proc. of The 9th LinguisticAnnotation Workshop , pages 112–123. Denver, Colorado,USA.Shilon, Reshef, Fadida, Hanna, and Wintner, Shuly (2012).Incorporating linguistic knowledge in statistical machinetranslation: translating prepositions. In
Proc. of the Work-shop on Innovative Hybrid Approaches to the Processingof Textual Data , pages 106–114. Avignon, France.Sun, Honglin and Jurafsky, Daniel (2004). Shallow semanticparsing of Chinese. In
Proc. of HLT-NAACL , pages 249–256. Boston, Massachusetts, USA.Tetreault, Joel R. and Chodorow, Martin (2008). The upsand downs of preposition error detection in ESL writing.In
Proc. of Coling , pages 865–872. Manchester, UK.Weller, Marion, Fraser, Alexander, and Schulte im Walde,Sabine (2015). Target-side generation of prepositionsfor SMT. In
Proc. of EAMT , pages 177–184. Antalya,Turkey.Weller, Marion, Schulte im Walde, Sabine, and Fraser,Alexander (2014). Using noun class information to modelselectional preferences for translating prepositions inSMT. In
Proc. of the 11th Conference of the Associationfor Machine Translation in the Americas , pages 275–287.Vancouver, Canada.ia, Fei (2000). The segmentation guidelines for the PennChinese Treebank (3.0). Technical Report IRCS-00-06,University of Pennsylvania, Philadelphia, PA.Xue, Nianwen (2008). Labeling Chinese predicates withsemantic roles.
Computational Linguistics , 34(2):225–255.Xue, Nianwen, Bojar, Ondˇrej, Hajiˇc, Jan, Palmer, Martha,Urešová, Zdeˇnka, and Zhang, Xiuhong (2014). Not aninterlingua, but close: comparison of English AMRs toChinese and Czech. In Calzolari, Nicoletta, Choukri,Khalid, Declerck, Thierry, Loftsson, Hrafn, Maegaard,Bente, Mariani, Joseph, Moreno, Asuncion, Odijk, Jan,and Piperidis, Stelios, editors,
Proc. of LREC , pages 1765–1772. Reykjavík, Iceland.Yang, York Chung-Ho and Kuo, June-Jei (1998). The Chi-nese temporal coverbs, postpositions, coverb-postpositionpairs, and their temporal logic. In
Proc. of PACLIC , pages20–32. Singapore.You, Liping and Liu, Kaiying (2005). Building ChineseFrameNet database. In , pages 301–306. IEEE.Zhang, Jianping (1999).
A Study of Language Model and Un-derstanding Algorithm for Large Vocabulary SpontaneousSpeech Recognition . Ph.D. thesis, Tsinghua University.Zhu, Yilun, Liu, Yang, Peng, Siyao, Blodgett, Austin, Zhao,Yushi, and Schneider, Nathan (2019). Adpositional Super-senses for Mandarin Chinese.