Wikiwhere: An interactive tool for studying the geographical provenance of Wikipedia references
Martin Körner, Tatiana Sennikova, Florian Windhäuser, Claudia Wagner, Fabian Flöck
WWikiwhere: An Interactive Tool for Studying theGeographical Provenance of Wikipedia References
Martin K¨orner , Tatiana Sennikova , Florian Windh¨auser , Claudia Wagner , , andFabian Fl¨ock University of Koblenz-Landau, Germany GESIS - Leibniz Institute for the Social Sciences, Germany
Wikipedia articles about the same topic in different language editions are built arounddifferent sources of information. For example, one can find very different news articleslinked as references in the English Wikipedia article titled “Annexation of Crimea bythe Russian Federation” than in its German counterpart (determined via Wikipedia’slanguage links). Some of this difference can of course be attributed to the differentlanguage proficiencies of readers and editors in separate language editions; yet, althoughincluding English-language news sources seems to be no issue in the German edition,English references that are listed do not overlap highly with the ones in the article’sEnglish version. Remarkably, the German version, compared to its English counterpart,includes a notably higher imbalance in favor of Russian sources against Ukrainian ones,and also a lesser overall ratio of Ukrainian and Russian sources in relation to the nativelanguage of the Wikipedia edition (cf. Figure 1) – although many of these pages arewritten in English and can be easily included in the German article. Such patternscould be an indicator of bias towards certain national contexts when referencing factsand statements in Wikipedia. However, determining for each reference which nationalcontext it can be traced back to, and comparing the link distributions to each other isinfeasible for casual readers or scientists with non-technical backgrounds.
Wikiwhere answers the question where Web references stem from by analyzing andvisualizing the geographic location of external reference links that are included in agiven Wikipedia article. Instead of relying solely on the IP location of a given URL ourmachine learning models consider several features. (a) English version (b) German version
Fig. 1: Comparison of the Wikiwhere heat maps of the (a) English and (b) Germanversions of the Wikipedia article on the ”Annexation of Crimea by the Russian Feder-ation” (English title). Blue represents least links from a country, red most, grey none.Open https://goo.gl/YgJx6O and https://goo.gl/pV0Mqp for comparison. a r X i v : . [ c s . H C ] D ec losely related is the work by Sen et al. [1] that investigates, among other aspects,the geo-provenance of URLs in Wikipedia articles describing geographic locations. Arelated visualization is also available. Wikiwhere is available at http://wikiwhere.west.uni-koblenz.de . Given a validWikipedia article URL of any language edition, the tool returns a classification of allreferences on the requested article page into countries of origin, determined by ourmachine learning model. The results are displayed topmost on a heat map (cf. Figure1) and further down in a bar chart. Additional bar charts show the distribution oflinks over countries when using only single features of our model (e.g. just IP address)and finally, all references and their location result are individually listed. By openingseveral language equivalents of an article, the user can thus easily compare differenteditions.It is also possible to access the plotted results via URL parameters, and preprocessedanalyses can be accessed via the ”Articles” tab on the website. The source code isavailable under a free license from GitHub ( https://github.com/mkrnr/wikiwhere )and can also be easily employed to classify references for research purposes beyond ourvisualization use case, such as statistical analyses.
We use the term reference to refer to an URL that leads from a given Wikipedia articleto another web page that is not associated with the Wikimedia foundation’s projects.Currently available online services aiming to determine where web sources hail fromgeographically often rely solely on IP-derived locations. But given that websites anddocuments are frequently hosted under arbitrary domains, in many different languageson remote servers, this might yield inaccurate results.To investigate this suspicion, we set up a machine learning model to infer geo-provenance. To obtain a training set for the model, we retrieved geo-location infor-mation on Wikipedia-referenced websites from DBpedia SPARQL endpoints ( http://wiki.dbpedia.org/about/language-chapters ). DBpedia contains structured in-formation that allows to link the owner of a web address to a location, or – if suchinformation is not explicitly encoded – inference about the owning entity and possi-ble parent entities (e.g., a URL of a reference belongs to a newspaper, which has nolocation associated, but is associated with a parent company that has location informa-tion). In order to evaluate the accuracy of this location extraction method, we manuallychecked 255 locations for references that we extracted from the English DBpedia, usingan explicit coding scheme. The resulting accuracy was 95%; we thus used this data asour ground truth for the subsequent steps. Next, we randomly extracted referencesfrom Wikipedia articles and obtained their DBpedia geo-location. For this list, whichcomprised a total of 233,932 URLs, we automatically retrieved the IP-location, toplevel domain (plus location), and website language. On this data, we applied a varietyof statistical learning models. An SVM model with a one vs. one multiclass classifierconsistently provided the most accurate location prediction and was selected as theeventual approach. We trained separate prediction models for the following languages:English, German, French, Italian, Spanish, Ukrainian, Slovak, and Dutch, as for thoselanguage editions DBpedia knowledge bases do currently exist. We also built a general http://shilad.com/localness/index.html http://wikiwhere.west.uni-koblenz.de/about.php provides additional up-to-date in-formation. odel that combines the data from all DBpedia knowledge bases. The performance ofour model was evaluated via 10-cross fold validation.Table 1 compares the accuracy of our learned model with a baseline that reliesexclusively on IP address location. The comparison was performed on two data sets.The first includes “All data”, i.e., all references and their location indicated by one ofthe features. The second only includes references for which all three features indicatedifferent locations and thus represents “Difficult cases”. As becomes apparent fromTable 1, (i) using only IP location decreases location determination accuracy by 20%in the general model (10% to 45% in the language-specific models) Table 2 moreovershows how much the different features contribute to the models; together, these arestrong indicators that research and services should not rely on IP addresses as a solegauge for location. (ii) This holds even more true when features differ, which is often thecase nowadays when websites are hosted abroad or address an international audiencein, e.g., English. The main contributions of this work are: 1. An interactive tool for visual analysis ofthe geographical provenance of references in a Wikipedia article with tested accuracy,including source code for free reuse, and 2. the insight that IP-location-based trackingis insufficient for determining (geographical) provenance of reference documents (inWikipedia). Further, the approach of using semantic knowledge bases as a ground truthseems to be promising for tracking other kinds of provenance of references, e.g., multi-national corporations and media networks by following links and ownership-relationsbetween businesses.Table 1: Accuracy of the learned models in comparison to a classification based on onlythe IP address.Method General EN FR DE ES UK IT NL SV CSAll data: Model 0.81 0.81 0.91 0.90 0.75 0.96 0.91 0.96 0.92 0.98All data: IP only 0.61 0.30 0.62 0.77 0.29 0.86 0.73 0.86 0.81 0.80Difficult cases: Model 0.77 0.78 0.86 0.80 0.71 0.89 0.85 0.91 0.85 0.93Difficult cases: IP only 0.30 0.57 0.64 0.25 0.81 0.66 0.80 0.74 0.79 0.53Table 2: Feature contribution over all dataModel IP location TLD location Website LanguageGeneral 61% 58% 25%EN 30% 13% 2%FR 62% 73% 23%DE 77% 68% 42%ES 29% 30% 7%UK 86% 89% 29%IT 73% 70% 27%NL 86% 76% 47%SV 81% 82% 29%CS 80% 78% 34% eferences
1. S. W. Sen, H. Ford, D. R. Musicant, M. Graham, O. S. Keyes, and B. Hecht. Barriersto the localness of volunteered geographic information. In