WordRep: A Benchmark for Research on Learning Word Representations
aa r X i v : . [ c s . C L ] J u l WordRep: A Benchmark for Research on Learning Word Representations
Bin Gao
BINGAO @ MICROSOFT . COM
Jiang Bian
JIBIAN @ MICROSOFT . COM
Tie-Yan Liu
TYLIU @ MICROSOFT . COM
Microsoft Research
Abstract
WordRep is a benchmark collection for the re-search on learning distributed word representa-tions (or word embeddings), released by Mi-crosoft Research. In this paper, we describe thedetails of the WordRep collection and show howto use it in different types of machine learning re-search related to word embedding. Specifically,we describe how the evaluation tasks in WordRepare selected, how the data are sampled, and howthe evaluation tool is built. We then compareseveral state-of-the-art word representations onWordRep, report their evaluation performance,and make discussions on the results. After that,we discuss new potential research topics that canbe supported by WordRep, in addition to algo-rithm comparison. We hope that this paper canhelp people gain deeper understanding of Wor-dRep, and enable more interesting research onlearning distributed word representations and re-lated topics.
1. Introduction
The success of machine learning methods depends muchon data representation, since different representations mayencode different explanatory factors of variation behindthe data. Conventional natural language processing (NLP)tasks often take the 1-of- v word representation, where v is the size of the entire vocabulary, and each word in thevocabulary is represented as a long vector with only onenon-zero element. However, such simple form of word rep-resentation meets several challenges. The most critical oneis that 1-of- v word representation cannot indicate any re-lationship between different words even though they yieldhigh semantic or syntactic correlation. For example, while elegant and elegantly have quite similar semantics, their ICML 2014 Workshop on Knowledge-Powered Deep Learning forText Mining.
Copyright 2014 by the author(s). corresponding 1-of- v representation vectors trigger differ-ent indexes to be the hot value, and it is not explicit that ele-gant is much closer to elegantly than other words like rough via 1-of- v representations. To deal with this problem, La-tent Semantic Analysis (LSA) (Dumais) and Latent Dirich-let Allocation (LDA) (Blei et al., 2003) were proposed tolearn continuous word representations. Unfortunately, it isquite difficult to train LSA or LDA model efficiently onlarge-scale text data.Recently, with the rapid development of deep learning tech-niques, researchers have started to train complex and deepmodels on large amounts of text corpus, to learn distributedrepresentations of words (also known as word embeddings)in the form of continuous vectors (Collobert & Weston,2008; Bengio et al., 2003; Glorot et al., 2011; Mikolov,2012; Socher et al., 2011; Tur et al., 2012). While con-ventional NLP techniques usually represent words as in-dices in a vocabulary causing no notion of relationshipbetween words, word embeddings learned by deep learn-ing approaches aim at explicitly encoding many seman-tic relationships as well as linguistic regularities and pat-terns into the new word embedding space. For example,a previous study (Bengio et al., 2003) proposed a widelyused model architecture for estimating neural network lan-guage model. Collobert et al. (Collobert & Weston, 2008;Collobert et al., 2011) introduced a unified neural networkarchitecture that learns word representations based on largeamounts of unlabeled training data, to deal with severaldifferent natural language processing tasks. Mikolov etal. (Mikolov et al., 2013a;b) proposed the continuous bag-of-words model (CBOW) and the continuous skip-grammodel (Skip-gram) for learning distributed representationsof words also from large amount of unlabeled text data;these two models can map the semantically or syntacticallysimilar words to close positions in the word embeddingspace, based on the intuition that similar words are likelyto yield similar context.Although the study in learning distributed word representa-tions has become very hot recently, there are very few largeand public datasets for the evaluation of word representa- ordRep: A Benchmark for Research on Learning Word Representations tions. In this paper, we introduce a new large benchmarkcollection named WordRep, which is built from several dif-ferent data source. We describe which evaluation tasks areselected in WordRep, how the data are sampled, and howthe evaluation tool is built. We also compare the perfor-mance of several state-of-the-art word representations onWordRep. Moreover, we take further discussions on newpotential research topics that can be supported by Wor-dRep.The rest of the paper is organized as follows. Section 2gives a detailed description about how WordRep is created.In Section 3, we report the performance of several state-of-the-art word representations on WordRep. We then showhow to leverage WordRep to study other research topics inSection 4. Finally, the paper is concluded in Section 5.
2. Creating WordRep Collection
In this section, we introduce the process of creating Wor-dRep collection, consisting of three main steps: selectingevaluation tasks, generating evaluation samples, and final-izing datasets.
The main task for WordRep is the analogical reasoningtask introduced by Mikolov et al. (Mikolov et al., 2013a).The task consists of 19,544 questions, each of which is atuple composed by two word pairs ( a, b ) and ( c, d ) . Fromthe tuple, the question is of the form “ a is to b is as c is to”, denoted as a : b → c : ?. Suppose −→ w is the learnedrepresentation vector of word w and is normalized to unitnorm. Following (Mikolov et al., 2013a), we answer thisquestion by finding the word d ∗ whose representation vec-tor is the closest to vector −→ b − −→ a + −→ c according to cosinesimilarity excluding b and c , i.e., d ∗ = arg max x ∈ V,x = b,x = c ( −→ b − −→ a + −→ c ) T −→ x . (1)The question is judged as correctly-answered only if d ∗ isexactly the answer word in the evaluation set. There aretwo categories of analogical tasks, including 8,869 seman-tic analogies in five subtasks (e.g., England : London → China : Beijing ) and 10,675 syntactic analogies in ninesubtasks (e.g., amazing : amazingly → unfortunate : un-fortunately ).Table 1 gives a summary of Mikolov et al.’s evaluation set.The first five subtasks are semantic questions and the restnine subtasks are syntactic questions. The table gives oneexample tuple for every subtask. It also shows the numberof unique word pairs and the number of tuples combinedfrom these pairs in each subtask. Regarding to the numbersof unique word pairs, we can see that there are actuallyvery small number of meaningful pairs in this dataset. Af- ter further checking the number of tuples, we find that notall possible combinations of word pairs are used as the tu-ple questions. For example, in the subtask of City-in-state ,there should be as many as 4,556 tuples combined from 68word pairs, but only 2,467 tuples are used in the publisheddataset. It is not clear about the reason or how the 2,467tuples were sampled.In the WordRep collection, we merged the above 14 sub-tasks into 12 subtasks, and expanded the set of word pairsin each subtask by extracting new word pairs from theWikipedia and an English dictionary; we also added 13new subtasks by deriving pairwise word relationship fromWordNet (WordNet, 2010).
Before we describe how we generate the evaluation sam-ples, we first introduce the scope of the vocabulary in Wor-dRep. Since the words extracted from the Wikipedia, thedictionary, and the Web knowledge bases (WordNet) maycontain many rare words that are not commonly used, wefilter out those words that are not covered by the vocabu-lary of wiki2010 (Shaoul & Westbury, 2010). This corpusis a snapshot of the Wikipedia corpus in April 2010, whichcontains about two million articles and 990 million tokens.The vocabulary size of wiki2010 is 731,155, we regardwhich is large enough for common NLP tasks. Further-more, note that the evaluation set proposed by Mikolov etal. (Mikolov et al., 2013a) contains a question set of phrasepairs besides the word pairs. To deal with phrases like
New York , they simply connect the tokens by an underlineand write it as
New York . In this release of WordRep, wewill only generate question sets with word pairs and leavephrase pairs for the future versions.2.2.1. W
IKIPEDIA K NOWLEDGE
We leverage Wikipedia knowledge to enlarge the semanticanalogical tasks. For the first two subtasks in Table 1, wemerged
Common capital city into
All capital cities for sim-plicity. Then, we extracted the Wikipedia pages for all thecounties and areas to get a full list of countries, capitals,currencies, and nationality adjectives, so that we can en-large the question sets in the subtasks of
All capital cities , Currency , and
Nationality adjective . For
City-in-state , wefound the top cities in population in the 50 states in the U.S.from Wikipedia, and built the city-state pairs accordingly.Note that we only kept the word based name entities, andremoved the phrase based names.2.2.2. D
ICTIONARY K NOWLEDGE
We take advantage of dictionary knowledge to enlarge bothsemantic and syntactic analogical tasks. For the subtask
Opposite , we merged it into a new subtask
Antonym ex- ordRep: A Benchmark for Research on Learning Word Representations
Table1. Summary of Mikolov et al.’s evaluation set on analogical reasoning task.
Subtask Word pair 1 ( a , b ) Word pair 2 ( c , d ) Man-Woman , Adjective toadverb , Comparative , Superlative , Present Participle , Pasttense , Plural nouns , Plural verbs ) in Table 1, we enlargedtheir corresponding word pairs by extracting new candi-dates from the Longman Dictionaries . For Man-Woman ,we also added some word pairs of male and female animalslike cock and hen . Note that the newly extracted words arefiltered by the vocabulary of wiki2010 . The statistics aboutthe enlarged subtasks by Wikipedia knowledge and Dictio-nary knowledge is shown in Table 2.2.2.3. W
ORD N ET K NOWLEDGE
WordNet (WordNet, 2010) is a large lexical database of En-glish. It contains both the words and the relations amongthe words, which is a gold mine to enlarge both semanticand syntactic analogical tasks. We extracted 13 new sub-tasks based on the WordNet relations including
Antonym , MemberOf , MadeOf , IsA , SimilarTo , PartOf , InstanceOf , DerivedFrom , HasContext , RelatedTo , Attribute , Causes ,and
Entails . Note that we merged the original subtask
Op-posite in Table 1 into the new subtask
Antonym , and wealso filtered the words in the 13 new subtasks by the vocab-ulary of wiki2010 . The statistics and examples about the 13WordNet subtasks are shown in Table 3.
From Table 2 and Table 3, we can see that Wor-dRep has much larger number of word pairs as wellas word tuples. The word pairs and word tu-ples can be downloaded ( http://research.microsoft.com/en-us/um/beijing/events/kpdltm2014/WordRep-1.0.zip ). Thesize of the compressed package is 1.61 GB.
3. Evaluation on The Benchmark Collection
In this section, we report the performance of sev-eral state-of-the-art distributed word representations onWordRep, including CW08 (Collobert & Weston, 2008),RNNLM (Mikolov, 2012), and CBOW (Mikolov et al.,2013a). In particular, we download the public word rep-resentations of CW08 (Collobert & Weston, 2008) whosedimension is 50; the public word representations ofRNNLM (Mikolov, 2012) are obtained from its publicsite , which includes three word representation modelswith the dimension of 80, 640, and 1600, respectively; weobtain the CBOW (Mikolov et al., 2013a) models by usingits online tools to train word representations directly onthe wiki2010 dataset, where we set the dimension as 100,200, and 300, respectively.Table 4 demonstrates the accuracy of each of word repre-sentations on the enlarged evaluation set of analogical rea-soning tasks. From this table, we can find that differentword representations yield quite various accuracy on theanalogical reasoning tasks. In particular, CW08 with 50dimension can only achieve relatively low accuracy com-pared with RNNLM and CBOW, which may be due to thedifference in terms of the dimension of word representa-tions, the training data, or the training algorithms. We canalso find that, with respect to the same training method,larger dimension of word representations is more likely toresult in better performance.Table 5 demonstrates the accuracy of each of word repre-sentations on the WordNet evaluation set of analogical rea-soning tasks. From this table, we can see the analogousobservations with the Table 4. From these two tables, we http://ml.nec-labs.com/senna/ https://code.google.com/p/word2vec/ ordRep: A Benchmark for Research on Learning Word Representations can also find that different subtasks could yield quite di-verse accuracy by evaluating on the same word representa-tion model.Note that the above evaluation experiments can be done us-ing the evaluation tool provided by Word2Vec . The onlydifference is that we treat all tuple questions as seen ques-tions. Thus, if a tuple question is unseen by the word em-bedding being evaluated (i.e., at least one word in the tuplequestion is not in the vocabulary of the word embedding),we will regard it to be answered incorrectly.
4. Supporting New Research Topics
Besides measuring the quality of word representations, wecan also use WordRep for, but not limited to, the followingtasks. • WordRep can be used to evaluate the embedding forrelations. Recently some researchers have attemptedto do word embedding and relation embedding simul-taneously. WordRep contains 25 subtasks in total, i.e.,there are 25 types of relations that can be used for eval-uating relation embeddings. • WordRep can be used to evaluate relation prediction.The 25 types of relations can be regarded as labels ofword pairs. Researchers can test their elation predic-tion methods using these labels as ground-truth. • WordRep provides several good word lists for generalNLP tasks. For example, there are lists for differentsyntactic forms of nouns, verbs, adjectives, and ad-verbs, and there are lists for commonly used relations.
5. Conclusions
In this paper, we have introduced a new data collectioncalled WordRep, which can be used for the evaluation ofdistributed word representations. We described how webuilt the data collection and reported the evaluation perfor-mance of several state-of-the-art word representations onit. We also discussed the possible research topics that Wor-dRep may support.For the future work, we plan to further expand the eval-uation set to phrase pairs, and we also plan to enrich thecollection by considering other Web knowledge bases likeFreebase (Bollacker et al., 2008).
6. Acknowledgements
We would like to thank Siyu Qiu, Chang Xu, Yalong Bai,Hongfei Xue, and Rui Zhang for their contributions in the https://code.google.com/p/word2vec/ preparation of the collection and the evaluation of the state-of-the-art word embeddings. References
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. Aneural probabilistic language model. In
The Journal ofMachine Learning Research , pp. 3:1137–1155, 2003.Blei, David M., Ng, Andrew Y., and Jordan, Michael I.Latent dirichlet allocation.
J. Mach. Learn. Res. , 3:993–1022, March 2003. ISSN 1532-4435.Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Tay-lor, J. Freebase: a collaboratively created graph databasefor structuring human knowledge. In
Proceedings of the2008 ACM SIGMOD international conference on Man-agement of data , pp. 1247–1250. ACM, 2008.Collobert, R. and Weston, J. A unified architecture fornatural language processing: Deep neural networks withmultitask learning. In
Proceedings of the 25th Interna-tional Conference on Machine Learning , ICML ’08, pp.160–167, New York, NY, USA, 2008. ACM.Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., and Kuksa, P. Natural language pro-cessing (almost) from scratch.
Journal of MachineLearning Research , 12:2493–2537, November 2011.ISSN 1532-4435.Dumais, Susan T. Latent semantic analysis.
Annual Reviewof Information Science and Technology , 38(1). ISSN1550-8382.Glorot, X., Bordes, A., and Bengio, Y. Domain adaptationfor large-scale sentiment classification: A deep learningapproach. In
In Proceedings of the Twenty-eight Inter-national Conference on Machine Learning, ICML , 2011.Mikolov, T.
Statistical Language Models Based on NeuralNetworks . PhD thesis, Brno University of Technology,2012.Mikolov, T., Chen, K., Corrado, G., and Dean, J. Effi-cient estimation of word representations in vector space.
CoRR , abs/1301.3781, 2013a.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S.,and Dean, J. Distributed representations of words andphrases and their compositionality. In Burges, Christo-pher J. C., Bottou, Lon, Ghahramani, Zoubin, and Wein-berger, Kilian Q. (eds.),
NIPS , pp. 3111–3119, 2013b.Shaoul, C. and Westbury, C. The westbury lab wikipediacorpus, edmonton, ab: University of alberta, 2010. ordRep: A Benchmark for Research on Learning Word Representations
Table2. Summary of enlarged evaluation set on analogical reasoning task.
Subtask Word pair 1 ( a , b ) Word pair 2 ( c , d ) Table3. Summary of WordNet evaluation set on analogical reasoning task.
Subtask Word pair 1 ( a , b ) Word pair 2 ( c , d ) Table4. Accuracy of various word representations on the enlarged evaluation set of analogical reasoning tasks.
CW08 RNNLM CBOWSubtask dim=50 dim=80 dim=640 dim=1600 dim=100 dim=200 dim=300All capital cities 0.62% 0.76% 1.23% 1.81% 6.62% 9.04% 11.28%Currency 0.25% 0.41% 0.66% 0.87% 3.13% 4.06% 4.32%City-in-state 0.67% 1.36% 3.14% 3.38% 1.55% 1.86% 2.25%Man-Woman 4.83% 6.67% 18.46% 20.82% 25.89% 29.06% 28.60%Adjective to adverb 1.40% 0.93% 1.17% 2.01% 3.45% 3.44% 3.23%Comparative 1.55% 16.61% 34.92% 40.28% 33.41% 41.70% 42.53%Superlative 1.94% 9.18% 25.33% 26.21% 23.56% 24.99% 29.07%Present participle 1.53% 9.67% 20.03% 23.26% 8.20% 11.25% 11.75%Nationality adjective 3.07% 1.62% 3.15% 3.76% 23.66% 40.19% 47.44%Past tense 1.84% 10.43% 19.51% 22.77% 15.51% 21.60% 24.15%Plural nouns 3.21% 3.32% 14.42% 18.28% 23.95% 34.64% 38.82%Plural verbs 2.44% 7.89% 22.41% 26.62% 17.28% 27.47% 31.82%Total 2.36% 5.08% 14.69% 17.85% 16.70% 24.16% 27.10% ordRep: A Benchmark for Research on Learning Word Representations
Table5. Accuracy of various word representations on the WordNet evaluation set of analogical reasoning tasks.
CW08 RNNLM CBOWSubtask dim=50 dim=80 dim=640 dim=1600 dim=100 dim=200 dim=300Antonym 0.28% 1.21% 2.88% 3.12% 2.74% 4.08% 4.57%Attribute 0.22% 0.11% 0.24% 0.42% 0.68% 1.09% 1.18%Causes 0.00% 0.31% 0.00% 0.00% 0.15% 0.31% 1.08%DerivedFrom 0.05% 0.06% 0.16% 0.18% 0.33% 0.53% 0.63%Entails 0.05% 0.05% 0.05% 0.07% 0.26% 0.44% 0.38%HasContext 0.12% 0.06% 0.16% 0.19% 0.28% 0.38% 0.35%InstanceOf 0.08% 0.24% 0.81% 0.64% 0.48% 0.60% 0.58%IsA 0.07% 0.18% 0.42% 0.47% 0.42% 0.64% 0.67%MadeOf 0.03% 0.15% 0.10% 0.13% 0.33% 1.02% 0.72%MemberOf 0.08% 0.07% 0.11% 0.13% 0.58% 0.84% 1.06%PartOf 0.31% 0.29% 0.55% 0.60% 1.17% 1.24% 1.27%RelatedTo 0.00% 0.04% 0.02% 0.00% 0.20% 0.09% 0.05%SimilarTo 0.02% 0.07% 0.14% 0.18% 0.14% 0.23% 0.29%Total 0.06% 0.15% 0.35% 0.40% 0.40% 0.61% 0.66%Socher, R., Lin, C. C., Ng, A. Y., and Manning, C. D. Pars-ing natural scenes and natural language with recursiveneural networks. In
Proceedings of the 26th Interna-tional Conference on Machine Learning (ICML) , 2011.Tur, G., Deng, L., Hakkani-Tur, D., and He, X. Towardsdeeper understanding: Deep convex networks for seman-tic utterance classification. In
ICASSP , pp. 5045–5048,2012.WordNet. “about wordnet”, princeton university, 2010.URL http://wordnet.princeton.eduhttp://wordnet.princeton.edu