To What Extent are Name Variants Used as Named Entities in Turkish Tweets?
aa r X i v : . [ c s . C L ] D ec To What Extent are Name Variants Used as Named Entities in Turkish Tweets?
Dilek K ¨uc¸ ¨uk
Electrical Power Technologies DepartmentT ¨UB˙ITAK Energy InstituteAnkara–[email protected]
Abstract
Social media texts differ from regular texts in various aspects. One of the main differences is the common use of informal namevariants instead of well-formed named entities in social media compared to regular texts. These name variants may come in the formof abbreviations, nicknames, contractions, and hypocoristic uses, in addition to names distorted due to capitalization and writingerrors. In this paper, we present an analysis of the named entities in a publicly-available tweet dataset in Turkish with respect to theirbeing name variants belonging to different categories. We also provide finer-grained annotations of the named entities as well-formednames and different categories of name variants, where these annotations are made publicly-available. The analysis presented and theaccompanying annotations will contribute to related research on the treatment of named entities in social media.
Keywords: named entity, named entity recognition, name variant, Turkish, Twitter
1. Introduction
Automatic extraction and classification of named entities in natural language texts (i.e., named entity recognition (NER))is a significant topic of natural language processing (NLP), both as a stand-alone research problem and as a subproblemto facilitate solutions of other related NLP problems. NER has been studied for a long time and in different domains, andthere are several survey papers on NER including (Marrero et al., 2013).Conducting NLP research (such as NER) on microblog texts like tweets poses further challenges, due to the particularnature of this text genre. Contractions, writing/grammatical errors, and deliberate distortions of words are common inthis informal text genre which is produced with character limitations and published without a formal review processbefore publication. There are several studies that propose tweet normalization schemes (Han and Baldwin, 2011) toalleviate the negative effects of such language use in microblogs, for the other NLP tasks to be performed on thenormalized microblogs thereafter. Yet, particularly regarding Turkish content, a related study on NER on Turkish tweets(K¨uc¸ ¨uk and Steinberger, 2014) claims that normalization before the actual NER procedure on tweets may not guaranteeimproved NER performance.Identification of name variants is an important research issue that can help facilitate tasks including named entity linking(Weichselbraun et al., 2019) and NER, among others. Name variants can appear due to several reasons including the useof abbreviations, contracted forms, nicknames, hypocorism, and capitalization/writing errors (Weichselbraun et al., 2019).The identification and disambiguation of name variants have been studied in studies such as (Driscoll and Yarowsky, 2007)and (Weichselbraun et al., 2019), where resource-based and/or algorithmic solutions are proposed.In this paper, we consider name variants from the perspective of a NER application and analyze an existing namedentity-annotated tweet dataset in Turkish described in (K¨uc¸ ¨uk et al., 2014), in order to further annotate the included namedentities with respect to a proprietary name variant categorization. The original dataset includes named annotations foreight types:
PERSON , LOCATION , ORGANIZATION , DATE , TIME , MONEY , PERCENT , and
MISC (K¨uc¸ ¨uk et al., 2014).However, in this study, we target only at the first three categories which amounts to a total of 980 annotations in 670 tweetsin Turkish. We further annotate these 980 names with respect to a name variant categorization that we propose and tryto present a rough estimate of the extent at which different named entity variants are used as named entities in Turkishtweets. The resulting annotations of named entities as different name variants are also made publicly available for researchpurposes. We believe that both the analysis described in the paper and the publicly-shared annotations (i.e., a tweet datasetannotated for name variants) will help improve research on NER, name disambiguation, and name linking on Turkishsocial media posts.The rest of the paper is organized as follows: In Section 2, an analysis of the named entities in the publicly-availableTurkish tweet dataset with respect to their being name variants or not is presented together with the descriptions of namevariant categories. In Section 3, details and samples of the related finer-grained annotations of named entities are describedand Section 4 concludes the paper with a summary of main points.1 . An Analysis of Turkish Tweets for Name Variants Included
Although NER is an NLP topic that has been studied for a long time, currently, the target genre of the related studieshas shifted from well-formed texts such as news articles to microblog texts like tweets (Ritter et al., 2011). Followingthis scheme (mostly) on English content, NER research on other languages like Turkish has also started to target attweets (K¨uc¸ ¨uk et al., 2014; K¨uc¸ ¨uk and Steinberger, 2014). A named entity-annotated dataset consisting of Turkishtweets is described in (K¨uc¸ ¨uk et al., 2014) and the results of NER experiments on Turkish tweets are presented in(K¨uc¸ ¨uk and Steinberger, 2014). Interested readers are referred to (K¨uc¸ ¨uk et al., 2017) which presents a survey of namedentity recognition on Turkish, including related work on tweets.In this study, we analyze the basic named entities (of type
PERSON , LOCATION , and
ORGANIZATION , henceforth,PLOs) in the annotated dataset compiled in (K¨uc¸ ¨uk et al., 2014), with respect to their being well-formed canonical namesor name variants. The dataset includes a total of 1.322 named entity annotations, however, 980 of them are PLOs (457
PERSON , 282
LOCATION , and 241
ORGANIZATION names) and are the main focus of this paper. These 980 PLOs wereannotated within a total of 670 tweets.We have extracted these PLO annotations from the dataset and further annotated them as belonging to one of the followingeight name variant categories that we propose. We should note that a particular name can belong to several categories andtherefore, there may be multiple category labels assigned to it. However, the number of category labels does not exceedtwo in our case, i.e., each name is annotated with either one or two labels in the resulting dataset. • WELL-FORMED : This category comprises those names which are written in their open and canonical form withoutany distortions, conforming to the capitalization and other writing rules of Turkish. In Turkish, each of the tokensof names are written with their initial letters capitalized. However, those names written all in uppercase are alsoconsidered within this category as they cannot be considered as writing errors. • ABBREVIATION : This category represents those names which are provided as abbreviations. This usually appliesto named entities of
ORGANIZATION type. But, these abbreviations can include writing errors due capitalization orcharacters with diacritics, as will be explained below. Hence, those names annotated as
ABBREVIATION can alsohave an additional category label as
CAPITALIZATION or DIACRITICS . • CAPITALIZATION : This category includes those names distorted due to not conforming to the capitalization rulesof Turkish. As pointed out above, initial letters of each of the tokens of a named entity are capitalized in Turkish.Additionally, abbreviations of names are generally all in uppercase. Those names not conforming to these rules aremarked with the
CAPITALIZATION label, denoting a capitalization issue. • DIACRITICS : There are six letters with diacritics in Turkish alphabet { c¸, ˘g, ı, ¨o, s¸, ¨u } which are some-times replaced with their counterparts without diacritics { c, g, i, o, s, u } , in informal texts like microblogs(K¨uc¸ ¨uk and Steinberger, 2014). Very rarely, the opposite (and perhaps unintentional) replacements can be observedagain in informal texts (this time at least one character without diacritics is replaced with a character having diacriticsin a word). Named entities including such writing errors are assigned the category label of DIACRITICS . • HASHTAG-LIKE : Another name variant type is the case where the whitespaces in the names are removed, so theyappear like hashtags, and sometimes they are actually hashtags. Such phenomena are annotated with the categorylabel of
HASHTAG-LIKE . • CONTRACTED : This category represents those name variants in which the original name is contracted, by leavingout some of its tokens. Since users like to produce and publish instantly on social media, they tend to contractespecially those long organization names, mostly by using its initial token only. Such name variants are annotated as
CONTRACTED . • HYPOCORISM : Hypocorism or hypocoristic use is defined as the phenomenon of deliberately modifying aname, in the forms of nicknames, diminutives, and terms of endearment, to show familiarity and affection(Newman and Ahmad, 1992; Driscoll, 2013). An example hypocoristic use in English is using
Bobby instead of thename
Bob (Newman and Ahmad, 1992). Such name variants observed in the tweet dataset are marked with the cate-gory label of
HYPOCORISM . • ERROR : This category denotes those name variants which have some forms of writing errors, excluding issues relatedto capitalization, diacritics, hypocorism, and removing whitespaces to make names appear like hashtags. Hence,names conforming to this category are labelled with
ERROR .The following subsection includes examples of the above name variant categories in the Turkish tweet dataset analyzed, inaddition to statistical information indicating the share of each category in the overall dataset.2 . Finer-Grained Annotation of Named Entities
We have annotated the PLOs in the tweet dataset (already-annotated for named entities as described in (K¨uc¸ ¨uk et al., 2014))with the name variant category labels of
WELL-FORMED , ABBREVIATION , CAPITALIZATION , DIACRITICS , HASHTAG-LIKE , CONTRACTED , HYPOCORISM , and
ERROR , as described in the previous subsection. Although thereare 980 PLOs in the dataset, since 44 names have two name variant category labels, the total number of name variantannotations is 1,024.The percentages of the category labels in the final annotation file are provided as a bar graph in Figure 1. As indicated inthe figure, about 60% of all named entities are well-formed and hence about 40% of them are not in their canonical openform or do not conform to the capitalization/writing errors regarding named entities in Turkish.Figure 1: Statistical Information for Each Named Entity Variant Category in the Turkish Tweet Dataset.The most common issue is the lack of proper capitalization of names in tweets, revealed with a percentage of 22.56%names annotated with the
CAPITALIZATION label. For instance, people write istanbul instead of the correct form ˙Istanbul and ankara instead of
Ankara in their tweets.The number of names having issues about characters with diacritics is 45, and similarly there are 45 abbreviations (ofmostly organization names) in the dataset. As examples of names having issues with diacritics, people use
Kutahya isteadof the correct form
K¨utahya , and similarly
Besiktas instead of
Bes¸iktas¸ . Abbreviations in the dataset include nationalcorporations like
TRT and
SGK , and international organizations like
UEFA .Instances of the categories of
HASHTAG-LIKE and
CONTRACTED are observed in 38 and 35 names, respectively. Asample name variant marked with
HASHTAG-LIKE is SabriSarıo˘glu where this person name should have been written as
Sabri Sarıo˘glu . A contracted name instance in the dataset is
Diyanet which is an organization name with the correct openform of
Diyanet ˙Is¸leri Bas¸kanlı˘gı .The instances of
HYPOCORISM and
ERROR are comparatively low, where 10 instances of hyprocorism and 11 instancesof other errors are seen in the dataset. An instance of the former category is
Nazlıs¸ which is a hypocoristic use of thefemale person name
Nazlı . An instance of the
ERROR category is the use of
FENEBAHC¸ E instead of the correct sportsclub name
FENERBAHC¸ E .Overall, this finer-granularity analysis of named entities as name variants in a common Turkish tweet dataset is significantdue to the following reasons. • The analysis leads to a breakdown of different named entity variants into eight categories. Although about 60% ofthe names are in their correct and canonical forms, about 40% of them either appear as abbreviations or suffer from3 deviation from the standard form due to multiple reasons including violations of the writing rules of the language.Hence, it provides an insight about the extent of the use of different name variants as named entities in Turkish tweets. • The use of different name variants is significant for several NLP tasks including NER on social media, name disam-biguation and linking. A recent and popular research topic that may benefit from patterns governing name variants isstance detection, where the position of a post owner towards a target is explored, mostly using the content of the post(Mohammad et al., 2016). A recent study reports that named entities can be used as improving features for the stancedetection task (K¨uc¸ ¨uk, 2017). Hence, an analysis of name variants can contribute to the algorithmic/learning-basedproposals for these research problems.The name variant annotations described in the study are made publicly available at https://github.com/dkucuk/Name-Variants-Turkish-Tweets as a text file, for research purposes.Each line in the annotation file denotes triplets, separated by semicolons. The first item in each triplet is the tweet id,the second item is another triplet denoting the already-existing named entity boundaries and type, and the final item isa comma-separated list of name variant annotations for the named entity under consideration. Below provided are twosample lines from the annotation file. The first line indicates a person name (between the non-white-space characters of 0and 11 in the tweet text) annotated with CAPITALIZATION category, as it lacks proper capitalization. The second linedenotes an organization name (between the non-white-space characters of 0 and 19 in the tweet) which has issues relatedto characters with diacritics and proper capitalization.
4. Conclusion
This paper focuses on named entity variants in Turkish tweets and presents the related analysis results on a common named-entity annotated tweet dataset in Turkish. The named entities of type person, location, and organization names are furthercategorized into eight proprietary name variant classes and the resulting annotations are made publicly available. Theresults indicate that about 40% of the considered names deviate from their standard canonical forms in these tweets and thecategorizations for these cases can be used by researchers to devise solutions for related NLP problems. These problemsinclude named entity recognition, name disambiguation and linking, and more recently, stance detection.
5. References
Patricia Driscoll and David Yarowsky. 2007. Disambiguation of standardized personal name variants. In
Proceedings ofIWMMIES , pages 1–7.Patricia Driscoll. 2013. Computational methods for name normalization using hypocoristic personal name variants. In
Multi-source, multilingual information extraction and summarization , pages 73–91.Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a
Proceedingsof the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 ,pages 368–378.Dilek K¨uc¸ ¨uk. 2017. Joint named entity recognition and stance detection in tweets. arXiv preprint arXiv:1707.09611 .Dilek K¨uc¸ ¨uk and Ralf Steinberger. 2014. Experiments to improve named entity recognition on Turkish tweets. In
Pro-ceedings of the 5th Workshop on Language Analysis for Social Media (LASM) , pages 71–78.Dilek K¨uc¸ ¨uk, Guillaume Jacquet, and Ralf Steinberger. 2014. Named entity recognition on Turkish tweets. In
Proceedingsof the Ninth International Conference on Language Resources and Evaluation (LREC) , pages 450–454.Do˘gan K¨uc¸ ¨uk, Nursal Arıcı, and Dilek K¨uc¸ ¨uk. 2017. Named entity recognition in Turkish: Approaches and issues. In
Proceedings of the International Conference on Applications of Natural Language to Information Systems , pages 176–181.M´onica Marrero, Juli´an Urbano, Sonia S´anchez-Cuadrado, Jorge Morato, and Juan Miguel G´omez-Berb´ıs. 2013. Namedentity recognition: fallacies, challenges and opportunities.
Computer Standards & Interfaces , 35(5):482–489.Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6:Detecting stance in tweets. In
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) ,pages 31–41.Paul Newman and Mustapha Ahmad. 1992. Hypocoristic names in hausa.
Anthropological Linguistics , pages 159–172.Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In
Proceed-ings of the conference on empirical methods in natural language processing , pages 1524–1534.Albert Weichselbraun, Philipp Kuntschik, and Adrian MP Brasoveanu. 2019. Name variants for improving entity discoveryand linking. In