DAWT: Densely Annotated Wikipedia Texts across multiple languages
DDAWT: Densely Annotated Wikipedia Texts across multiplelanguages
Nemanja Spasojevic, Preeti Bhargava, Guoning Hu
Lithium Technologies | KloutSan Francisco, CA {nemanja.spasojevic, preeti.bhargava, guoning.hu}@lithium.com
ABSTRACT
In this work, we open up the DAWT dataset - Densely An-notated Wikipedia Texts across multiple languages. Theannotations include labeled text mentions mapping to enti-ties (represented by their Freebase machine ids) as well asthe type of the entity. The data set contains total of . M articles, . B tokens, . M mention entity co-occurrences.DAWT contains 4.8 times more anchor text to entity linksthan originally present in the Wikipedia markup. More-over, it spans several languages including English, Span-ish, Italian, German, French and Arabic. We also presentthe methodology used to generate the dataset which en-riches Wikipedia markup in order to increase number oflinks. In addition to the main dataset, we open up severalderived datasets including mention entity co-occurrencecounts and entity embeddings, as well as mappings be-tween Freebase ids and Wikidata item ids. We also discusstwo applications of these datasets and hope that openingthem up would prove useful for the Natural Language Pro-cessing and Information Retrieval communities, as well asfacilitate multi-lingual research. Keywords
Wiki, Wikipedia, Freebase, Freebase annotations, Wikipediaannotations, Wikification, Named Entity Recognition, En-tity Disambiguation, Entity Linking
1. INTRODUCTION
Over the past decade, the amount of data available toenterprises has grown exponentially. However, a majorityof this data is unstructured or free-form text, also knownas Dark Data . This data holds challenges for NaturalLanguage Processing (NLP) and information retrieval (IR)tasks unless the text is semantically labeled. Two NLPtasks that are particularly important to the IR communityare: https://en.wikipedia.org/wiki/Dark_data . • Named Entity Recognition (NER) - task of identifyingan entity mention within a text, • Entity Disambiguation and Linking (EDL) - task oflinking the mention to its correct entity in a Knowl-edge Base (KB).These tasks play a critical role in the construction of a highquality information network which can be further lever-aged for a variety of IR and NLP tasks such as text catego-rization, topical interest and expertise modeling of users[20, 21]. Moreover, when any new piece of informationis extracted from text, it is necessary to know which realworld entity this piece refers to. If the system makes an er-ror here, it loses this piece of information and introducesnoise. As a result, both these tasks require high qualitylabeled datasets with densely extracted mentions linkingto their correct entities in a KB.Wikipedia has emerged as the most complete and widelyused KB over the last decade. As of today, it has around5.3 million English articles and 38 million articles acrossall languages. In addition, due to its open nature andavailability, Wikipedia Data Dumps have been adopted byacademia and industry as an extremely valuable data as-set. Wikipedia precedes other OpenData projects like Free-base [8] and DBpedia [10] which were built on the founda-tion of Wikipedia. The Freebase Knowledge Graph is themost exhaustive knowledge graph capturing 58 million en-tities and 3.17 billion facts. The Wikipedia and Freebasedata sets are large in terms of: • information comprehensiveness, • wide language coverage, • number of cross-article links, manually curated crossentity relations, and language independent entity iden-tifiers.Although these two data sets are readily available, Wikipedialink coverage is relatively sparse as only the first entitymention is linked to the entity’s Wikipedia article. Thissparsity may significantly reduce the number of trainingsamples one may derive from Wikipedia articles which, inturn, reduces the utility of the dataset. In this work, weprimarily focus on creating the DAWT dataset that con-tains denser annotations across Wikipedia articles. Weleverage Wikipedia and Freebase to build a large data setof annotated text where entities extracted from Wikipediatext are mapped to Freebase ids. Moreover, this data setspans multiple languages. In addition to the main dataset,we open up several derived datasets for mention occur-rence counts, entity occurrence counts, mention entity co-occurrence counts and entity Word2Vec. We also discusstwo applications of these datasets and hope that opening a r X i v : . [ c s . I R ] M a r hem up would prove useful for the NLP and IR communi-ties as well as facilitate multi-lingual research.
2. PROBLEM STATEMENT
The Wikification problem was introduced by Mihalcea etal. [13], where task was to introduce hyperlinks to the cor-rect wikipedia articles for a given mention. In Wikipedia,only the first mention is linked or annotated. In our task,we focus on densifying the annotations i.e. denser hyper-links from mentions in Wikipedia articles to other Wikipediaarticles. The ultimate goal is to have high-precision hyper-links with relatively high recall that could be further usedas ground truth for other NLP tasks.For most of the supervised Machine Learning or NLPtasks, one of the challenges is gathering ground truth atscale. In this work, we try to solve the problem of gener-ating a labeled data set at large scale with the followingconstraints: • The linked entity ids need to be unified across differ-ent languages. In Freebase, the machine id is sameacross different languages and hence, we annotateWikipedia with Freebase machine ids, • The dataset needs to be comprehensive (with largenumber of entities spanning multiple domains), • The labels should be precise.
3. CONTRIBUTIONS
Our contributions in this work are: • We extract a comprehensive inventory of mentionsspanning several domains. • We densify the entity links in the Wikipedia docu-ments by 4.8 times. • The DAWT dataset covers several more languages inaddition to English such as Arabic, French, German,Italian, and Spanish. • Finally, we open up this dataset and several other de-rived datasets (such as mention occurrence counts,entity occurrence counts, mention entity co- occur-rence counts, entity word2vec and mappings betweenFreebase ids and Wikidata item ids) for the benefit ofthe IR and NLP communities.
4. KNOWLEDGE BASE
Our KB consists of about 1 million Freebase machineids for entities. These were chosen from a subset of allFreebase entities that map to Wikipedia entities. We preferto use Freebase as our KB since in Freebase, the same idrepresents a unique entity across multiple languages. Fora more general use, we have also provided the mappingfrom Freebase id to Wikipedia link and Wikidata item id(see Section 6.4). Due to limited resources and usefulnessof the entities, our KB contains approximately 1 millionmost important entities from among all the Freebase enti-ties. This gives us a good balance between coverage andrelevance of entities for processing common social mediatext. To this end, we calculate an entity importance score[2] using linear regression with features capturing popu-larity within Wikipedia links, and importance of the entitywithin Freebase. We used signals such as Wiki page rank, Freebase was a standard community generated KB untilJune 2015 when Google deprecated it in favor of the com-mercially available Knowledge Graph API.
CombineWikipedia● Direct anchor Text ● Title ● Redirect Text Freebase:● Aliases● Names Wikipedia Concepts● Anchor TextsVanila DictionaryPrune Semantically Unaligned EntitiesWikidata Aliases
Figure 1: Candidate Dictionary Generation OverviewWiki and Freebase incoming and outgoing links, and typedescriptors within our KB etc. We use this score to rankthe entities and retain only the top 1 million entities in ourKB.In addition to the KB entities, we also employ two specialentities:
NIL and
MISC . NIL entity indicates that there isno entity associated with the mention, eg. mention ‘the’within the sentence may link to entity
NIL . This entity isuseful especially when dealing with stop words and falsepositives.
MISC indicates that the mention links to an en-tity which is outside the selected entity set in our KB.
5. DAWT DATA SET GENERATION
Our main goals when building the DAWT data set wereto maintain high precision and increase linking coverage.As shown in Figure 1, we first generate a list of candidatephrases mapping to the Wikipedia articles by combining: • Wiki articles (direct anchor texts, titles of pages, redi-rect text to wiki pages) • Freebase Aliases, and Freebase Also Known As fieldsrelated to entities • Wikipedia Concepts (English anchor texts)The initial candidate lists are pruned to remove outlierphrases that do not semantically align with the rest of thelist. As a semantic alignment metrics of two phrases, weused a combination of Jaccard similarity (both token and 3-gram character), edit distance (token and character), andlargest common subsequence. We averaged the metricsand for each candidate in the list, we calculated averagealignment against all the other candidates. As final stepwe remove all candidates, if any, with the lowest alignmentscores.For example, in the candidate set {‘USA’, ‘US’, ‘our’, ...} for entity
USA , phrase ‘our’ does not align with rest ofthe cluster and is filtered out. In addition to the candidatedictionary, we also calculate co-occurrence frequencies,based on direct links from Wikipedia article markup, be-tween any 2 entities appearing within the same sentence.To generate the DAWT dataset, we do the following. Foreach supported language, and for each Wiki page in thelanguage:1. Iterate over the Wiki article and extract the set ofdirectly linked entities.2. Calculate all probable co-occurring entities with theset of directly linked entities from Step 1.3. Iterate over the Wiki article and map all phrases totheir set of candidate entities.4. Resolve phrases whose candidates have been directlylinked from Step 1. anguage Article Unique Mention Unique Entity Unique Mention Total Mention Total CPUCount Count Entity pairs Entity links time (days) en 5,303,722 5,786,727 1,276,917 6,956,439 360,323,792 139.5es 2,393,366 901,370 224,695 1,038,284 62,373,952 17.9it 1,467,486 799,988 211,687 931,369 47,659,715 14.2fr 1,750,536 1,670,491 423,603 1,952,818 93,790,881 28.6de 1,818,649 2,168,723 426,556 2,438,583 103,738,278 20.2ar 889,007 394,024 186,787 433,472 12,387,715 1.6Table 1: DAWT Data Set Statistics5. For the remaining unresolved references, choose can-didates, if any, with the highest probable co-occurrencewith the directly linked entities.As a last step, the hyperlinks to Wikipedia articles in a spe-cific language are replaced with links to their Freebase idsto adapt to our KB.The densely annotated Wikipedia articles have on an av-erage 4.8 times more links than the original articles. Adetailed description of the data set, per language, alongwith total CPU time taken for annotation is shown in Table1. All experiments were run on a 8-core 2.4GHz Xeon pro-cessor with 13 GB RAM. As evident, since English had thehighest count of documents as well as entities and men-tions, it took the maximum CPU time for annotation.An example of the densely annotated text in JSON formatis given below:Listing 1: Annotated example in JSON file format { "tokens": [{ "raw_form": "Vlade" }, { "raw_form": "Divac" }, { "raw_form": "is" }, { "raw_form": "a" }, { "raw_form": "retired" }, { "raw_form": "Serbian" }, { "raw_form": "NBA" }, { "raw_form": "player", "break": ["SENTENCE"] }], "entities": [{ "id_str": "01vpr3", "type": "PERSON", "start_position": 0, "end_position": 1, "raw_form": "Vlade Divac" }, { "id_str": "077qn", "type": "LOCATION", "start_position": 5, "end_position": 5, "raw_form": "Serbian" }, { "id_str": "05jvx", "type": "ORGANIZATION", Mention Occurrence Count
Apple 16104apple 2742Tesla 822Table 2: Mention Occurrence Counts
Entity Occurrence Count "start_position": 6, "end_position": 6, "raw_form": "NBA" }], "id": "wiki_page_id:en:322505:01vpr3:Vlade_Divac" }
6. DERIVED DATASETS
We also derive and open several other datasets from theDAWT dataset which we discuss here.
This dataset includes the raw occurrence counts for amention M i in our corpus and KB. Table 2 shows the rawcounts for mentions “Apple", “apple" and “Tesla". This dataset includes the raw occurrence counts for anentity E j in our corpus and KB. Table 3 shows the rawcounts for entities Apple Inc. , Apple (fruit) and
Nikola Tesla .We also generate separate dictionaries for each language.Table 4 shows the different surface form variations andoccurrence counts of the same entity across different lan-guages.
This dataset includes the co-occurrence counts of men-tions and entities. This is particularly useful for estimat-ing the prior probability of a mention M i referring to acandidate entity E j with respect to our KB and corpora.Table 5 shows the raw and normalized mention entity co-occurrences for the mentions “Apple" and “apple" and dif-ferent candidate entities. As evident, the probability ofmention “Apple" referring to the entity Apple Inc. is higherthan to the entity
Apple (fruit) . However, “apple" most ntity Language Surface Form Occurrence NormalizedOccurrence
Apple (fruit) . Similarly, the men-tion “Tesla" most likely refers to the entity
Nikola Tesla . In this data set, we use the Freebase machine id to rep-resent an entity. To facilitate studies using Wikidata ids,which are also widely used entity ids in the literatures, weprovide a data set that maps individual Freebase ids toWikidata ids. This data set contains twice as many map-pings as that from Google . A summary comparison be-tween these two mapping sets are shown in Table 6, whichlists the total numbers of mappings in 4 buckets: • Same : A Freebase id maps to a same Wikidata id. • Different : A Freebase id maps to different Wikidataids. • DAWT Only : A Freebase id only maps to a Wikidataid in DAWT. https://developers.google.com/freebase • Google Only : A Freebase id only maps to a Wikidataid in Google.Note that the 24,638 different mappings are mainly causedby multiple Wikidata ids mapping to a same entity. Forexample, Freebase id 01159r maps to Q6110357 in DAWTand Q7355420 in Google, and both Q6110357 and Q7355420represent the town Rockland in Wisconsin.
There have been many efforts on learning word embed-dings, i.e., vector space representations of words [6, 14,15]. Such representations are very useful in many tasks,such as word analogy, word similarity, and named entityrecognition. Recently, Pennington et al. [17] proposed amodel, GloVe, which learns word embeddings with bothglobal matrix factorization and local context windowing.They showed that obtained embeddings captured rich se-mantic information and performed well in the aforemen-tioned tasks. ention Entity Co-occurrence NormalizedCo-occurrence
Apple 0k8z-Apple Inc. 6738 87.8%014j1m-Apple (fruit) 422 5.5%019n_t-Apple Records 302 3.9%02hwrl-Apple Store 87 1.1%02_7z_-Apple Corps. 84 1.1%apple 014j1m-Apple 1295 85.3%01qd72-Malus 157 10.3%02bjnm-Apple juice 46 3.0%0gjjvk-The Apple Tree 15 1.0%0k8z-Apple Inc. 1 0.1%Tesla 05d1y-Nikola Tesla 327 49.5%0dr90d-Tesla Motors 162 24.5%036wfx-Tesla (Band) 92 13.9%02rx3cy-Tesla (Microarchitecture) 38 5.7%03rhvb-Tesla (Unit) 29 4.4%Table 5: Mention Entity Co-occurrencesSame 2,048,531Different 24,638DAWT Only 2,362,077Google Only 26,413Table 6: Comparison of Freebase id to Wikidata id Map-pingsThere are several word-vector data sets available on GloVe’swebsite . However, they only contain embeddings of indi-vidual words and thus have several limitations: • Language dependent • Missing entities that cannot be represented by a sin-gle word • May not properly represent ambiguous words, suchas "apple", which can be either the fruit or the tech-nology company.To facilitate research in this direction, we provide an entity-embedding data set that overcomes the above limitations.This date set contains embeddings of Wiki entities with 3different vector sizes: 50, 300, and 1000. They were gen-erated with the GloVe model via the following steps:1. Represent each Wiki document across all languagesas a list of entities: There are about 2.2B total entitiesand 1.8M unique entities in these documents.2. Use the open source GloVe code to process thesedocuments: For each vector size, we ran 300 iter-ations on a GPU box with 24 cores and 60G dedi-cated memory. Other runtime configurations werethe same as default. In particular, we: • Truncate entities with total count < 5 • Set window size to be 15Execution time was roughly proportional to the vec-tor size. It took about 25 minutes to run 1 iterationwhen size is 1000. Among the 1.8 M unique entities,the GloVe model were able to generate embeddingsfor about 1.6 M entities.To evaluate these embeddings, we compared them withone of the GloVe word embeddings, which was also gen- http://nlp.stanford.edu/projects/glove/ https://github.com/stanfordnlp/GloVe erated from Wikipedia data , on the word/entity analogytask. This task is commonly used to evaluate embeddings[14, 15, 17] by answering the following question: Givenword/entity X, Y, and Z, what is the word/entity that is sim-ilar to Z in the same sense as Y is similar to X?
For exam-ple, given word "Athens", "Greece", and "Paris", the rightanswer is "France".Here we used the test data provided by [14] . This testset contains 5 semantic and 9 syntactic relation types. Foreach word in this data set, we find the corresponding Free-base id using the mapping between Freebase ids and en-glish Wikidata urls. Thus, we obtain a test set that containsrelations between entities. Note that when we could notfind a Freebase id for a word, all the associated relationswere removed from the test set.We then ran the test on the 5 semantic types. Syntacticrelations were excluded from this test because most of thetime the task is trivial when one can correctly link wordsto entities. For example, when both "bird" and "birds" arelinked to entity , and "cat" and "cats" are linkedto entity , the analogy among them is obviouswithout examining the underlining embeddings.Table 7 shows the accuracy (in %) obtained from our en-tity embeddings with vector sizes of 50, 300, and 1000. Incomparison, it also shows the accuracy from GloVe wordembeddings with vector sizes of 50, 100, 200, and 300.Entity embeddings have better performance with vectorsize of 50. As we increase vector size, word embeddingsperform significantly better and outperform entity embed-dings when the vector size is 200 or higher. The degradedperformance of entity embeddings may due to less train-ing data, since our entity embeddings were obtained from2.2B tokens, where GloVe’s word embeddings were ob-tained from 6B tokens.
7. APPLICATIONS OF THE DATASET
As discussed earlier, the DAWT and other derived datasetsthat we have described in this paper have several applica-tions for the NLP and IR communities. These include: The data is available at http://nlp.stanford.edu/data/glove.6B.zip The data set is available at elation GloVe Word dimensionality DAWT Entity dimensionality
50 100 200 300 50 300 1000Capital-World 74.43 92.77 97.05 97.94 93.24 93.95 91.81City-in-State 23.22 40.10 63.90 72.59 68.39 88.98 87.90Capital-Common-Countries 80.04 95.06 96.64 97.23 78.66 79.64 71.54Currency 17.29 30.05 37.77 35.90 43.88 13.56 2.93Family 71.05 85.09 89.18 91.23 66.96 72.51 75.15Average 53.21 68.61 76.91 78.98 70.23 69.73 65.87Table 7: Accuracy of Semantic Analogy
This task involves identifying an entity mention within atext and also generating candidate entities from the KB.For this, the Mention Occurrence, Entity Occurrence andthe Mention To Entity Co-occurrence datasets describedin Sections 6.1, 6.2 and 6.3 are extremely useful. For in-stance, the raw mention occurrence counts and probabili-ties can be stored in a dictionary and can be used to extractthe mentions in a document. Furthermore, the mentionoccurrence count of a mention M i and its co-occurrencecount with an entity E j can be used to calculate the priorprobability of the mention mapping to that entity: count ( M i → E j ) count ( M i ) This can be used to determine the candidate entities fora mention from our KB.
This task involves linking the mention to its correct KBentity. For each mention, there may be several candidateentities with prior probabilities as calculated in NER. Inaddition, other features derived from these datasets caninclude entity co-occurrences, entity word2vec similarityand lexical similarity between the mention and the entitysurface form. These features can be used to train a su-pervised learning algorithm to link the mention to the cor-rect disambiguated entity among all the candidate entitiesas done in [1]. The approach used in [1] employs severalsuch context dependent and independent features and hasa precision of 63%, recall of 87% and an F-score of 73%.
8. ACCESSING THE DATASET
The DAWT and derived datasets discussed in paper areavailable for download at this page: https://github.com/klout/opendata/tree/master/wiki_annotation . The DAWTdataset was generated using Wikipedia Data Dumps fromJanuary 20th 2017. Statistics regarding the data set areshown in Table 1.
9. RELATED WORK
While a lot of works have focused on building and open-ing such datasets, very few have addressed all the chal-lenges and constraints that we mentioned in Section 2.Spitkovsky and Chang [22] opened a cross-lingual dictio-nary (of English Wikipedia Articles) containing 175,100,788mentions linking to 7,560,141 entities. This dataset, thoughextremely valuable, represents mention - entity mappingsacross a mixture of all languages which makes it harder touse for a specific language. In addition, this work used rawcounts that, although useful, lack mention context (such aspreceding and succeeding tokens etc.) which have a big impact while performing EDL. The Freebase annotationsof the ClueWeb corpora dataset dataset contains 647 mil-lion English web pages with an average of 13 entities an-notated per document and 456 million documents havingat least 1 entity annotated. It does not support multiplelanguages.Another related technique for generating such dictionar-ies is Wikification [12, 16, 4] where mentions in Wikipediapages are linked to the disambiguated entities’ Wikipediapages. Such techniques rely on a local or global approach.A local approach involves linking observed entities usingonly their local context eg. by comparing the relatednessof candidate Wiki articles with the mentions [5, 7, 19]while in global approach entities across the entire docu-ment are disambiguated together using document context,thus, ensuring consistency of entities across the document[9, 18]. Most recently Cai et al. [3] achieved 89.97% pre-cision and 76.43% recall, using an iterative algorithm thatleverages link graph, link distributions, and a noun phraseextractor.Although the problem of entity linking has been wellstudied for English, it has still not been explored for otherlanguages. McNamee et al. [11] introduced the problem ofcross-language entity linking. The main challenge here isthat state-of-the-art part-of-speech taggers perform muchbetter on English than on other languages. In addition,both Wikipedia and Freebase have significantly higher qual-ity and coverage of English compared to any other lan-guage.
10. CONCLUSION AND FUTURE WORK
In this work, we opened up the DAWT dataset - DenselyAnnotated Wikipedia Texts across multiple languages. Theannotations include labeled text mentions mapping to enti-ties (represented by their Freebase machine ids) as well asthe type of the entity. The data set contains total of . M articles, . B tokens, . M mention entity co-occurrences.DAWT contains 4.8 times more anchor text to entity linksthan originally present in the Wikipedia markup. More-over, it spans several languages including English, Span-ish, Italian, German, French and Arabic. We also pre-sented the methodology used to generate the dataset whichenriched Wikipedia markup in order to increase number oflinks. In addition to the main dataset, we opened up sev-eral derived datasets for mention occurrence counts, en-tity occurrence counts, mention entity co-occurrence counts,entity word2vec as well as mappings between Freebase idsand Wikidata item ids. We also discussed two applicationsof these datasets and hope that opening them up wouldprove useful for the NLPand IR communities as well as fa-cilitate multi-lingual research. http://lemurproject.org/clueweb09/FACC1/n the future, we plan to improve the algorithm that weused for generating DAWT. Also, we plan to migrate fromusing Freebase ids in our KB to Wikidata item ids.
11. REFERENCES [1] P. Bhargava, N. Spasojevic, and G. Hu.High-throughput and language-agnostic entitydisambiguation and linking on user generated data.In
Proceedings of the 26th International ConferenceCompanion on World Wide Web.
International WorldWide Web Conferences Steering Committee, 2017.[2] P. Bhattacharyya and N. Spasojevic. Global entityranking across multiple languages. In
Proceedings ofthe 26th International Conference on World WideWeb , page to appear. International World Wide WebConferences Steering Committee, 2017.[3] Z. Cai, K. Zhao, K. Q. Zhu, and H. Wang. Wikificationvia link co-occurrence. In
Proceedings of the 22ndACM international conference on Conference oninformation and knowledge management , CIKM ’13,pages 1087–1096, New York, NY, USA, 2013. ACM.[4] X. Cheng and D. Roth. Relational inference forwikification.
Urbana , 51:61801, 2013.[5] S. Cucerzan. Large-scale named entitydisambiguation based on wikipedia data. In
EMNLP-CoNLL , volume 7, pages 708–716, 2007.[6] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K.Landauer, and R. Harshman. Indexing by latentsemantic analysis.
Journal of the American societyfor information science , 41(6):391, 1990.[7] P. Ferragina and U. Scaiella. Tagme: on-the-flyannotation of short text fragments (by wikipediaentities). In
Proceedings of the 19th ACMinternational conference on Information andknowledge management , pages 1625–1628. ACM,2010.[8] X. Glorot, A. Bordes, and Y. Bengio. Domainadaptation for large-scale sentiment classification: Adeep learning approach. In
Proceedings of the 28thInternational Conference on Machine Learning(ICML-11) , pages 513–520, 2011.[9] S. Kulkarni, A. Singh, G. Ramakrishnan, andS. Chakrabarti. Collective annotation of wikipediaentities in web text. In
Proceedings of the 15th ACMSIGKDD international conference on Knowledgediscovery and data mining , pages 457–466. ACM,2009.[10] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch,D. Kontokostas, P. N. Mendes, S. Hellmann,M. Morsey, P. van Kleef, S. Auer, et al. Dbpedia-alarge-scale, multilingual knowledge base extractedfrom wikipedia.
Semantic Web , 6(2):167–195, 2015.[11] P. McNamee, J. Mayfield, D. Lawrie, D. W. Oard, andD. S. Doermann. Cross-language entity linking. In
IJCNLP , pages 255–263, 2011.[12] O. Medelyan, D. Milne, C. Legg, and I. H. Witten.Mining meaning from wikipedia.
InternationalJournal of Human-Computer Studies , 67(9):716–754,2009.[13] R. Mihalcea and A. Csomai. Wikify!: Linkingdocuments to encyclopedic knowledge. In
Proceedings of the Sixteenth ACM Conference on Conference on Information and KnowledgeManagement , CIKM ’07, pages 233–242, New York,NY, USA, 2007. ACM.[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean.Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781 , 2013.[15] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, andJ. Dean. Distributed representations of words andphrases and their compositionality. In
Proceedings ofNIPS, 2013 , 2013.[16] D. Milne and I. H. Witten. Learning to link withwikipedia. In
Proceedings of the 17th ACMConference on Information and KnowledgeManagement , CIKM ’08, pages 509–518, 2008.[17] J. Pennington, R. Socher, and C. D. Manning. Glove:Global vectors for word representation. In
Proceedings of EMNLP, 2014 , page 1532–1543,2014.[18] L. Ratinov, D. Roth, D. Downey, and M. Anderson.Local and global algorithms for disambiguation towikipedia. In
Proceedings of the 49th AnnualMeeting of the Association for ComputationalLinguistics: Human Language Technologies-Volume1 , pages 1375–1384. Association for ComputationalLinguistics, 2011.[19] B. Skaggs and L. Getoor. Topic modeling forwikipedia link disambiguation.
ACM Transactions onInformation Systems (TOIS) , 32(3):10, 2014.[20] N. Spasojevic, P. Bhattacharyya, and A. Rao. Mininghalf a billion topical experts across multiple socialnetworks.
Social Network Analysis and Mining ,6(1):1–14, 2016.[21] N. Spasojevic, J. Yan, A. Rao, and P. Bhattacharyya.Lasta: Large scale topic assignment on multiplesocial networks. In
Proc. of ACM Conference onKnowledge Discovery and Data Mining (KDD) , KDD’14, 2014.[22] V. I. Spitkovsky and A. X. Chang. A cross-lingualdictionary for english wikipedia concepts. In
Language Resources and Evaluation , 2012.
APPENDIXA. SAMPLE ANNOTATED WIKIPEDIA TEXTSFROM DAWT
Figure 2 shows samples of the densely annotated Wikipediapages for the entities
Nikola Tesla and
Tesla Motors acrossEnglish and Arabic. a) Annotated wikipedia article on Nikola Tesla (English)(b) Annotated wikipedia article on Nikola Tesla (Arabic)(c) Annotated wikipedia article on Tesla Motors (English)(d) Annotated wikipedia article on Tesla Motors (Arabic)a) Annotated wikipedia article on Nikola Tesla (English)(b) Annotated wikipedia article on Nikola Tesla (Arabic)(c) Annotated wikipedia article on Tesla Motors (English)(d) Annotated wikipedia article on Tesla Motors (Arabic)