Construction of a Japanese Word Similarity Dataset
CConstruction of a Japanese Word Similarity Dataset
Yuya Sakaizawa and Mamoru Komachi
Tokyo Metropolitan University6-6 AsahigaokaHino city, Tokyo 191-0065, Japan { sakaizawa-yuya@ed.,komachi@ } tmu.ac.jp Abstract
An evaluation of distributed word representation is generally conducted using a word similarity task and/or a word analogy task. Thereare many datasets readily available for these tasks in English. However, evaluating distributed representation in languages that do nothave such resources (e.g., Japanese) is difficult. Therefore, as a first step toward evaluating distributed representations in Japanese, weconstructed a Japanese word similarity dataset. To the best of our knowledge, our dataset is the first resource that can be used to evaluatedistributed representations in Japanese. Moreover, our dataset contains various parts of speech and includes rare words in addition tocommon words.
Keywords: word embeddings, distributed representation, word similarity
1. Introduction
Traditionally, a word is represented as a sparse vector in-dicating the word itself (one-hot vector) or the context ofthe word (distributional vector). However, both the one-hotnotation and distributional notation suffer from data sparse-ness since dimensions of the word vector do not interactwith each other. Distributed word representation addressesthe data sparseness problem by constructing a dense vec-tor of a fixed length, wherein contexts are shared (or dis-tributed) across dimensions. Distributed word representa-tion is known to improve the performance of many NLPapplications such as machine translation (Chen and Guo,2015) and sentiment analysis (Tai et al., 2015) to name afew. The task to learn a distributed representation is calledrepresentation learning.However, evaluating the quality of learned distributed wordrepresentation itself is not straightforward. In languagemodeling, perplexity or cross-entropy is widely acceptedas a de facto standard for intrinsic evaluation. In con-trast, distributed word representations include the additive(or compositional) property of the vectors, which cannot beassessed by perplexity. Moreover, perplexity makes littleuse of infrequent words; thus, it is not appropriate for eval-uating distributed presentations that try to represent them.Therefore, a word similarity task and/or a word analogytask are generally used to evaluate distributed word repre-sentations in the NLP literature. The former judges whetherdistributed word representations improve modeling con-texts, and the latter estimates how well the learned repre-sentations achieve the additive property. However, such re-sources other than for English (e.g., Japanese) seldom exist.In addition, most of these datasets comprise high-frequencynouns so that they tend not to include other parts of speech.Hence, previous data fail to evaluate word representationsof other parts of speech, including content words such asverbs and adjectives.To address the problem of the lack of a dataset for evaluat-ing Japanese distributed word representations, we proposeto build a Japanese dataset for the word similarity task.
Currently at JustSystems Corporation.
The main contributions of our work are as follows: • To the best of our knowledge, it is the first work thatconstructs a Japanese word similarity dataset. • The dataset contains various parts of speech and in-cludes rare words in addition to common words.
2. Related Work
In general, distributed word representations are evaluatedusing a word similarity task. For instance, WordSim353(Finkelstein et al., 2002), MC (Miller and Charles, 1991),RG (Rubenstein and Goodenough, 1965), and SCWS(Huang et al., 2012) have been used to evaluate word sim-ilarities in English. Moreover, Baker et al. (2014) built averb similarity dataset (VSD) based on WordSim353 be-cause there was no dataset of verbs in the word-similaritytask. Recently, SimVerb-3500 was introduced to evaluatehuman understanding of verb meaning (Gerz et al., 2016).It provides human ratings for the similarity of 3,500 verbpairs so that it enables robust evaluation of distributed rep-resentation for verbs. However, most of these datasets in-clude English words only. There has been no Japanesedataset for the word-similarity task.Apart from English, WordSim353 and SimLex-999 (Hillet al., 2015) have been translated and rescored in other lan-guages: German, Italian and Russian (Leviant and Reichart,2015). SimLex-999 has also been translated and rescoredin Hebrew and Croatian (Mrksic et al., 2017). SimLex-999explicitly targets at similarity rather than relatedness and in-cludes adjective, noun and verb pairs. However, this datasetcontains only frequent words.In addition, the distributed representation of words is gen-erally learned using only word-level information. Conse-quently, the distributed representation for low-frequencywords and unknown words cannot be learned well withconventional models. However, low-frequency words andunknown words are often comprise high-frequency mor-phemes (e.g., unkingly → un + king + ly). Some previousstudies take advantage of the morphological information toprovide a suitable representation for low-frequency wordsand unknown words (Luong et al., 2013; Soricut and Och, a r X i v : . [ c s . C L ] F e b entence I don’t think it is likely to not include these people, or [exclude] まさかこういった 方 々を 対 象 としない、 [ 排 除 する ] わけではないと 思 いますが Paraphrase ignore ostracize avoid exclude remove 無 視 する 排 斥 する 敬 遠 する 排 除 する 除 外 する Figure 1: An example of the dataset from a previous study (Kodaira et al., 2016).Frequency 1- 101- 1001- 10001-Verb 239 539 710 598Adjective 183 322 523 350Noun 15 63 172 258Adverb 23 75 80 81Table 1: The number of parts of speech classified into eachfrequency.2015). Morphological information is particularly importantfor Japanese since Japanese is an agglutinative language.
3. Construction of a Japanese WordSimilarity Dataset
What makes a pair of words similar? Most of the previ-ous datasets do not concretely define the similarity of wordpairs. The difference in the similarity of word pairs orig-inates from each annotator’s mind, resulting in differentscales of a word. Thus, we propose to use an example-based approach (Table 2) to control the variance of the sim-ilarity ratings. We remove the context of word when we ex-tracted the word. So, we consider that an ambiguous wordhas high variance of the similarity, but we can get low vari-ance of the similarity when the word is monosemous.For this study, we constructed a Japanese word similaritydataset . We followed the procedure used to construct theStanford Rare Word Similarity Dataset (RW) (Luong et al.,2013).We extracted Japanese word pairs from the EvaluationDataset of Japanese Lexical Simplification (Kodaira et al.,2016). It targeted content words (nouns, verbs, adjectives,adverbs). It included 10 contexts about target words anno-tated with their lexical substitutions and rankings. Figure 1shows an example of the dataset. A word in square bracketsin the text is represented as a target word of simplification.A target word is not only recorded in the lemma form butalso in the conjugated form. We built a Japanese similaritydataset from this dataset using the following procedure. Word selection:
First, paraphrase candidates were ex-tracted from this dataset. Because the construction processof the simplification dataset was divided into a paraphraseacquisition phase and a simplification ranking phase, wesimply discarded the simplification rankings from thedataset to obtain paraphrase candidates. Table 1 shows thefrequency of extracted words in the Japanese Wikipedia asof May 2015. As shown in the table, low-frequency wordsare included in the dataset. https://github.com/tmu-nlp/JapaneseWordSimilarityDataset word 1 word 2 sim.EN JA EN JAclose 瞑 る close つぶる 拭 き 取 る wipe 拭 う 塞 ぎ 込 んだ sick 病 んだ 手 探 る go 行 く とばせる control 制 御 できる Table 2: Example of the degree of similarity when we re-quested annotation at Lancers.
Pair construction:
Because extracted words are anno-tated with their paraphrase candidates, we picked up eachpair from the candidate as a word pair. Consequently, weacquired 5,051 verb pairs, 4,033 adjective pairs, 1,528 nounpairs and 902 adverb pairs. To balance the numbers of verband adjective pairs with other parts of speech, we extractedsamples at random for verbs and adjectives. Finally, weobtained 1,464 verb pairs and 960 adjective pairs.We observed that the similarity of the pairs extracted fromthe dataset of Kodaira et al. (2016) was low without pro-viding contexts; thus, we did not augment the dataset by in-serting pseudo-negative instances from WordNet’s synsets,as was done in the RW corpus. Another reason why we didnot employ the synset from the Japanese WordNet (Isaharaet al., 2008) was because its quality was not as good as theEnglish WordNet except for concrete nouns . Human judgment:
We opted to use the crowd-sourcingservice (Lancers ) to hire native Japanese speakers. Weasked annotators to assign the degree of similarity for eachpair using the same 10-point scale . We used only thoseannotators who were able to complete at least 95% of theirprevious assignments correctly. We collected similarity rat-ing for each word pair from ten annotators and defined theaverage of their annotations as the similarity of the pairs.Although (Kodaira et al., 2016) gave the annotators the con-text during annotation, we removed the context and gaveonly pairs to annotators. We did so because the previousdatasets such as VSD and RW did not present any contextduring annotation . To improve the quality of the annota-tion, we presented an example of the degree of similarity It might be because it was translated from the English Word-Net. This is why we decided not to translate the existing Englishword similarity dataset to create a Japanese version. In a crowdsourcing request, we indicated that a similarity ofpairs with different notations, such as “write ( 書 いた ) ” and“write ( かいた ) ” is 10. Another reason why we did not do so is because the SCWShas a very high variance even though it is annotated with contexts(Table 5).
OS verb adj adv nounIAA 0.69 0.67 0.61 0.56Table 3: Inter-annotator agreements of each POS.of the pairs during annotation (Table 2). Consequently, wecollected 4,851 pairs overall. Table 4 shows an example of apair from our dataset. Inter-annotator agreements (IAA) ofeach POS are shown in Table 3. The inter-annotator agree-ment is the average Spearman’s ρ between a single annota-tor and the average of all others.
4. Discussion
Table 5 shows how several resources vary. WordSim353comprises high-frequency words and so the variance tendsto be low. In contrast, RW includes low-frequency words,unknown words, and complex words composed of severalmorphemes; thus, the variance is large. VSD has many pol-ysemous words, which increase the variance. Despite thefact that our dataset, similar to the VSD and RW datasets,contains low-frequency and ambiguous words, its varianceis 3.00. The variance level is low compared with the othercorpora. We considered that the examples of the similarityin the task request reduced the variance level.We did not expect SCWS to have the largest variance in thedatasets shown in Table 5 because it gave the context to an-notators during annotation. At the beginning, we thoughtthe context would serve to remove the ambiguity and clar-ify the meaning of word; however after looking into thedataset, we determined that the construction procedure usedseveral extraordinary annotators. It is crucial to filter insin-cere annotators and provide straightforward instructions toimprove the quality of the similarity annotation like we did.To gain better similarity, each dataset should utilize thereliability score to exclude extraordinary annotators. Forexample, for SCWS, an annotator rating the similarity ofpair of “CD” and “aglow” assigned a rating of 10. Weassumed it was a typo or misunderstanding regarding thewords. To address this problem, such an annotation shouldbe removed before calculating the true similarity. All thedatasets except for RW simply calculated the average of thesimilarity, but datasets created using crowdsourcing shouldconsider the reliability of the annotator.
We present examples of a pair with high variance of simi-larity as shown below:
Aspect of relatedness. (e.g., a pairing of “fast ( 速 い ) ”and “early ( 早 い ) ”.)Although they are similar in meaning with respect to thetime, they have nothing in common with respect to speed;Annotator A assigned a rating of 10, but Annotator B as-signed a rating of 1.Another example, the pairing of “be eager ( 懇 願 する ) ”and “request ( 頼 む ) ”. Even though the act indicated bythe two verbs is the same, there are some cases where they express different degrees of feeling. Compared with “re-quest”, “eager” indicates a stronger feeling. There weretwo annotators who emphasized the similarity of the act it-self rather than the different degrees of feeling, and viceversa. In this case, Annotator A assigned a rating of 9, butAnnotator B assigned a rating of 2.Although it was necessary to distinguish similarity and se-mantic relatedness (Mrksic et al., 2016) and we asked an-notators to rate the pairs based on semantic similarity, itwas not straightforward to put paraphrase candidates onto asingle scale considering all the attributes of the words. Thislimitation might be relaxed if we would ask annotators torefer to a thesaurus or an ontology such as Japanese Lexi-con (Ikehara et al., 1997). Comparing spell . (e.g., a pairing of “slogan ( スローガン ) ” and “slogan ( 標 語 ) ”.)In Japanese, we can write a word using hiragana, katakana,or kanji characters; however because hiragana and katakanarepresent only the pronunciation of a word, annotatorsmight think of different words. In this case, Annotator Aassigned a rating of 8, but Annotator B assigned a rating of0. Similarly, we confirmed the same thing in other parts ofspeech. Especially, nouns can have several word pairs withdifferent spellings, which results in their IAA became toolow compared to other parts of speech. Frequency or time expressions. (e.g., a pairing of “often ( しばしば ) ” and “frequently ( しきりに ) ”.)We confirmed that the variance becomes larger among ad-verbs expressing frequency. This is due to the differencein the frequency of words that annotators imagines. In thiscase, Annotator A assigned a rating of 9, but Annotator Bassigned a rating of 0. Similarly, we confirmed the samething among adverbs expressing time.
5. Conclusion
In this study, we constructed the first Japanese word sim-ilarity dataset. It contains various parts of speech and in-cludes rare words in addition to common words. Crowd-sourced annotators assigned similarity to word pairs duringthe word similarity task. We gave examples of similarityin the task request sent to annotators, so that we reducedthe variance of each word pair. However, we did not re-strict the attributes of words, such as the level of feeling,during annotation. Error analysis revealed that the notionof similarity should be carefully defined when constructinga similarity dataset.As a future work, we plan to construct a word analogydataset in Japanese by translating an English dataset toJapanese. We hope that a Japanese database will facilitateresearch in Japanese distributed representations.
6. Bibliographical References
Baker, S., Reichart, R., and Korhonen, A. (2014). AnUnsupervised Model for Instance Level Subcategoriza-tion Acquisition. In
Proceedings of the 2014 Conference We indicated these pair’s similarity is 10. However, someannotators ignored this instruction. It would be necessary to cleanthe spellings of paraphrase candidates before requesting similarityannotation. ord 1 EN follow exclude challenge storm elucidate wanderJA 受 け 継 ぐ 除 外 する チャレンジする しける 明 白 になる 迷 う word 2 EN inherit remove wish rough reflect stopJA 継 承 する 除 去 する 望 む あれる 反 映 される 止 める similarity 9.3 7.3 6.0 5.7 2.7 1.7Table 4: Examples of verb pairs in our dataset. The similarity rating is the average of the ratings from ten annotators.Dataset VarianceWordSim353 3.16VSD 4.76RW 5.70SCWS 8.60JWSD (our dataset) 3.00Table 5: Variance of each dataset. on Empirical Methods in Natural Language Processing(EMNLP) , pages 278–289.Chen, B. and Guo, H. (2015). Representation Based Trans-lation Evaluation Metrics. In Proceedings of the 53rdAnnual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conferenceon Natural Language Processing (ACL-IJCNLP) , pages150–155.Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., and Kan-zaki, K. (2008). Development of the Japanese Word-Net. In
Proceedings of the Sixth International Confer-ence on Language Resources and Evaluation (LREC) ,pages 2420–2423.Luong, M.-T., Socher, R., and Manning, C. D. (2013).Better Word Representations with Recursive Neural Net-works for Morphology. In
Proceedings of the Seven-teenth Conference on Computational Natural LanguageLearning (CoNLL) , pages 104–113.Miller, G. A. and Charles, W. G. (1991). Contextual Cor-relates of Semantic Similarity.
Language and CognitiveProcesses , 6(1):1–28.Mrksic, N., S´eaghdha, D. ´O., Thomson, B., Gasic, M.,Rojas-Barahona, L. M., Su, P., Vandyke, D., Wen, T.,and Young, S. J. (2016). Counter-fitting Word Vectorsto Linguistic Constraints. In
Proceedings of the 2016Conference of the North American Chapter of the Asso-ciation for Computatinal Linguistics: Human LanguageTechnologies (NAACL-HLT) , pages 142–148.Rubenstein, H. and Goodenough, J. B. (1965). ContextualCorrelates of Synonymy.
Communications of the ACM ,8(10):627–633.Soricut, R. and Och, F. (2015). Unsupervised Morphol-ogy Induction Using Word Embeddings. In
Proceedingsof the 2015 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies (NAACL-HLT) , pages1627–1637.Tai, K. S., Socher, R., and Manning, C. D. (2015). Im-proved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In
Proceedings ofthe 53rd Annual Meeting of the Association for Com-putational Linguistics and the 7th International JointConference on Natural Language Processing (ACL-IJCNLP) , pages 1556–1566.
7. Language Resource References
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E.,Solan, Z., Wolfman, G., and Ruppin., E. (2002). PlacingSearch in Context: The Concept Revisited.
ACM Trans-actions on Information Systems (TOIS) , 20(1):116–131.Gerz, D., Vulic, I., Hill, F., Reichart, R., and Korhonen, A.(2016). SimVerb-3500: A Large-Scale Evaluation Set ofVerb Similarity. In
Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Processing(EMNLP) , pages 2173–2182.Hill, F., Reichart, R., and Korhonen, A. (2015).Simlex-999: Evaluating Semantic Models with (Gen-uine) Similarity Estimation.
Computational Linguistics ,41(4):665–695.Huang, E. H., Socher, R., Manning, C. D., and Ng, A.Y. (2012). Improving Word Representations via GlobalContext and Multiple Word Prototypes. In
Proceedingsof the 50th Annual Meeting of the Association for Com-putational Linguistics (ACL) , pages 873–882.Ikehara, S., Miyazaki, M., Shirai, S., Yokoo, A., Nakaiwa,H., Ogura, K., Ooyama, Y., and Hayashi, Y. (1997).
AJapanese Lexicon . Iwanami Shoten.Kodaira, T., Kajiwara, T., and Komachi, M. (2016). Con-trolled and Balanced Dataset for Japanese Lexical Sim-plification. In
Proceedings of the ACL 2016 Student Re-search Workshop , pages 1–7.Leviant, I. and Reichart, R. (2015). Judgment Lan-guage Matters: Multilingual Vector Space Models forJudgment Language Aware Lexical Semantics.
CoRR ,abs/1508.00106.Luong, M.-T., Socher, R., and Manning, C. D. (2013).Better Word Representations with Recursive Neural Net-works for Morphology. In
Proceedings of the Seven-teenth Conference on Computational Natural LanguageLearning (CoNLL) , pages 104–113.Mrksic, N., Vulic, I., S´eaghdha, D. ´O., Leviant, I., Re-ichart, R., Gasic, M., Korhonen, A., and Young, S. J.(2017). Semantic Specialisation of Distributional WordVector Spaces using Monolingual and Cross-LingualConstraints.