On the Value of Wikipedia as a Gateway to the Web
Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, Robert West
OOn the Value of Wikipedia as a Gateway to the Web
Tiziano Piccardi
Miriam Redi
Wikimedia [email protected]
Giovanni Colavizza
University of [email protected]
Robert West ∗ [email protected]
ABSTRACT
By linking to external websites, Wikipedia can act as a gatewayto the Web. To date, however, little is known about the amount oftraffic generated by Wikipedia’s external links. We fill this gap in adetailed analysis of usage logs gathered from Wikipedia users’ clientdevices. Our analysis proceeds in three steps: First, we quantify thelevel of engagement with external links, finding that, in one month,English Wikipedia generated 43M clicks to external websites, inroughly even parts via links in infoboxes, cited references, andarticle bodies. Official links listed in infoboxes have by far thehighest click-through rate (CTR), 2.47% on average. In particular,official links associated with articles about businesses, educationalinstitutions, and websites have the highest CTR, whereas officiallinks associated with articles about geographical content, television,and music have the lowest CTR. Second, we investigate patterns ofengagement with external links, finding that Wikipedia frequentlyserves as a stepping stone between search engines and third-partywebsites, effectively fulfilling information needs that search enginesdo not meet. Third, we quantify the hypothetical economic valueof the clicks received by external websites from English Wikipedia,by estimating that the respective website owners would need topay a total of $7–13 million per month to obtain the same volumeof traffic via sponsored search. Overall, these findings shed light onWikipedia’s role not only as an important source of information,but also as a high-traffic gateway to the broader Web ecosystem.
ACM Reference Format:
Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West. 2021.On the Value of Wikipedia as a Gateway to the Web. In
Proceedings of TheWeb Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.
ACM,New York, NY, USA, 12 pages. https://doi.org/10.1145/3442381.3450136
Thanks to the collaborative effort of a community of volunteer edi-tors, Wikipedia is the world’s largest encyclopedia and an importantsource of information for millions of people. Wikipedia serves itscontent as a regular website, allowing editors to add hyperlinks in ∗ Robert West is a Wikimedia Foundation Research Fellow.This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3450136
Figure 1: Example of an of-ficial link, in infobox of Wi-kipedia article about InternetArchive’s Wayback Machine. order to enable readersto more easily find addi-tional content, both inter-nal and external. Internallinks help readers locaterelevant encyclopedic con-tent by navigating from ar-ticle to article. In contrast,external links enrich arti-cles with additional con-tent that should not or can-not be included in Wiki-pedia itself. There are var-ious reasons to add ex-ternal links, with linkedcontent ranging from offi-cial websites, to news arti-cles used as references, tocopyrighted material.In this paper, we are in-terested in quantifying and characterizing the outgoing traffic gen-erated by Wikipedia towards external content. Given Wikipedia’scrucial societal role and global reach, it is essential to understandhow it interacts with the broader Web by driving traffic to externalwebsites. The resulting insights can inform the platform’s future de-sign and thus allow it to better cater to readers’ information needsaround external content. As Web traffic has monetary value—inparticular when the traffic goes to commercial websites—an inves-tigation of the external traffic generated by Wikipedia also shedsnew light on the poorly understood role it has as a provider notonly of information, but also of economic wealth. Research questions.
We approach the question of Wikipedia’svalue as a gateway to the Web from two angles: informational andeconomic. Concretely, we pose three research questions:
RQ1 Level of engagement with external links:
What total vol-ume of traffic does Wikipedia drive to third-party websites?What is the click-through rate of external links, and howdoes it vary across types of linked content? (Sec. 4)
RQ2 Patterns of engagement with external links:
How dousers interact with external links? Do they click throughfast or slow, and how does this vary across types of linkedcontent? In what navigational situations do clicks to externalwebsites occur? (Sec. 5) https://en.wikipedia.org/wiki/Wikipedia:External_links https://en.wikipedia.org/wiki/Wikipedia:Citing_sources a r X i v : . [ c s . C Y ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West
RQ3 Economic value of external links:
What is the monetaryvalue of the traffic from Wikipedia to external websites? Ifwebsite owners had to pay for an equivalent amount of trafficvia sponsored search, how much would this cost? (Sec. 6)
Summary of findings.
Based on usage logs gathered over a one-month period from English Wikipedia users’ client devices (Sec. 3),we quantified the level of engagement with external links (RQ1) ,determining that English Wikipedia generated 43 million clicks toexternal websites during the month we studied, despite the factthat, on average, the click-through rate (CTR) of external links wasonly 0.08%. While most external links (95.5%) occurred in articlebodies and cited references (accounting for about two thirds ofthe external traffic), a disproportionately large fraction (23%) ofthe total traffic came from a relatively small fraction (0.8%) of allexternal links, namely from official links to the website of the entitycovered in the respective article. Such official links are regularlylisted in so-called infoboxes, short tabular summaries of key factsabout the covered entity (see Fig. 1 for an example). Since officiallinks witnessed a vastly increased CTR of 2.47% (vs. 0.08% overall external links), we focused our analysis on official links. In atopical analysis, we found that official links associated with articlesabout businesses, educational institutions, and websites had thehighest CTR (a first indicator of the economic value of Wikipedia’sexternal links), whereas official links associated with articles aboutgeographical content, television, and music had the lowest CTR.By analyzing patterns of engagement with external links (RQ2), we observed that Wikipedia frequently serves as a stepping stonebetween search engines and third-party websites. We captured thiseffect quantitatively as well as in a manual analysis, where we foundthat URLs that are down-ranked or censored by search engines, andthus not retrievable via search, can often be found in Wikipediainfoboxes, which leads search users to take a detour via Wikipedia.We conclude that Wikipedia regularly and systematically meetsinformation needs that search engines do not meet, which furtherconfirms Wikipedia’s central role in the Web ecosystem.Finally, we aimed to quantify the hypothetical economic valueof the clicks received by external websites from English Wikipedia (RQ3).
Wikipedia is, of course, free, and it runs thanks to the dona-tions of thousands of people. We thus cannot ask how much moneyWikipedia could earn by charging a fee for external clicks—thishypothetical scenario is simply too far from reality—but we mayapproach the question from a different angle, asking how muchmoney external-website owners would have to pay in order to ob-tain an equivalent number of clicks by other means, such as paidads. In this spirit, we applied the Google Ads API to the content ofofficial websites linked from Wikipedia in order to generate key-words for sponsored search and estimated their cost per click atmarket price. We conclude that the owners of external websiteslinked from English Wikipedia’s infoboxes would need to collec-tively pay a total of around $7–13 million per month (or $84–156million per year) for sponsored search in order to obtain the samevolume of traffic that they receive from Wikipedia for free.These numbers exceed even the ballpark guess given in a bullish2013 analysis [15] that, unlike ours, was not based on real click logs,but on generic rates commonly assumed in the online ad industry,and estimated that Wikipedia could earn $2.5 million monthly via affiliate links. Although our analysis of monetary value shouldmostly be taken as an indicative “back-of-the-envelope” calculation,it highlights the importance of Wikipedia not only as a source ofinformation, but also as a gratuitous provider of economic wealth.
User engagement on the Web.
Being able to quantify user en-gagement is crucial for websites, especially for those with an ad-vertising-based business model [2]. Researchers from various fieldshave investigated ways to measure users’ attention, interest, andengagement with websites and their ads. Several works have triedto predict engagement with content in social media based on so-cial interest metrics, such as the number of post comments orlikes [3, 6, 13]. Researchers in information retrieval have also in-vestigated methods to estimate users’ satisfaction and engagementwith textual and visual Web search engines [14, 29, 36]. In computa-tional advertising, existing works have tried to improve ad servingbased on target engagement metrics [4, 35], or to directly predictad click-through rates [17].
User engagement on Wikipedia.
Users’ reasons for engagingwith Wikipedia are varied, and consequently bring about differentusage behaviors [16, 28], ranging from in-depth information seekingto using Wikipedia as a stopover towards other locations on theWeb [30]. The most relevant prior work is a study by Piccardiet al. [24], who used the same dataset as the present paper, butfocused exclusively on external links in cited references, whereasthe present study considers all external links, with a focus on officiallinks listed in infoboxes. Piccardi et al. [24] found that users engagelittle with the external links included in references, and that they doso more frequently from short, possibly unsatisfactory Wikipediaarticles and in order to visit recent content, open access sources,and references about life events (births, deaths, marriages, etc.). Asimilar negative relationship between Wikipedia article quality anduser engagement with external sources via references was foundspecifically for medical content [18].Wikipedia’s own traffic is influenced by its connections to thelarger Web ecosystem, and its interdependence with search en-gines, and Google in particular, has real consequences. On the onehand, Wikipedia’s content improves Google search results (e.g.,via content snippets); on the other hand, this might keep usersfrom visiting Wikipedia itself [20]. In the present paper, we take anovel angle by focusing not on traffic from the rest of the Web toWikipedia, but on traffic from Wikipedia to the rest of the Web.
The economic value of Wikipedia.
The value of Wikipedia tothe world is not only high, but also difficult, if not impossible, tofully quantify in purely economic terms. It has been shown thatWikipedia is essential—or has the potential to be—in a variety ofspillovers with substantial economic impact. For instance, it is ofcritical importance to Web search engines, such as Google [20], andhas also been shown to be useful to improve, or even predict, finan-cial markets [22, 23, 34]. Wikipedia can be used to inform economicdevelopment policies [27], improve the visibility of places, withdirect positive consequences on tourism [12], and even predict andmonitor global health and diseases [9, 11]. Furthermore, Wikipediahas been shown to influence the very development of science [31]. n the Value of Wikipedia as a Gateway to the Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
Table 1: Click statistics for external links embedded in Wikipedia articles.Link location Total links Clicks Total Mean Median
Number Perc. of total Number Perc. of total articles CTR ± SD* click time † Infobox 2.8M 4.5% 12.5M 29.1% 1.3M 0.90% ± Official links 506K 0.8% 9.8M 22.7% 506K 2.47% ± Body 24.9M 39.5% 16.2M 37.8% 4.0M 0.14% ± ± ± *CTR = click-through rate; considering only links with at least 300 impressions during the one-month study period. † Inter-quartile range in parentheses.
Nevertheless, and perhaps surprisingly, to the best of our knowl-edge Wikipedia’s economic value as an information gateway to theWeb has rarely been discussed in previous work. In a rare exception,researchers have considered the value of Wikipedia in providingtraffic to Reddit and Stack Overflow [32].
In order to analyze user engagement with Wikipedia’s externallinks, we made use of the dataset collected by Piccardi et al. [24].This dataset consists of logs of all reader interactions with externallinks in English Wikipedia articles over the one-month period from24 March to 21 April 2019. The data was captured by the browseron the client side and includes all clicks on external links and auniformly random sample (33%) of all pageview events, organizedinto sessions, i.e., sequences of events from the same user in thesame browser tab. This article reports results using the full datasetwhen describing external-click events. Whenever pageview countsare involved, we extrapolate from the 33% sample.The data was collected in accordance with Wikimedia’s privacypolicy and processed exclusively on Wikimedia computing ma-chinery. Although the data does not contain personally indentifiableinformation beyond what is implicit in browsing behavior, it cannotbe shared publicly. For transparency, we publish our data analysiscode at https://github.com/epfl-dlab/WikipediaAsWebGateway. At the time of data collection, Wikipedia had around 5.8M articles,which were loaded by readers more than 4.5 billion times duringthe month studied. We characterized each article by popularity(pageviews during the month studied) and length (number of char-acters). The popularity distribution was very skewed: 50% of thearticles had fewer than 42 views, 90% had fewer than 894 views.In contrast, the average number of pageviews in one month was700. The most visited 1,550 articles, which represented 0.02% of allarticles, accounted for 10% of all pageviews. The most visited pageswere articles about topics that were trending in April 2019, such asnipsey hussle (5.7M views), notre-dame de paris (4.7M), bonnieand clyde (3.5M), or game of thrones (season 8) (2.6M). Mostof the articles were short, and similar to popularity, the number ofcharacters showed a skewed distribution, with a median of 3,888,and an average of 7,793 characters. https://foundation.wikimedia.org/wiki/Privacy_policy ORES topics.
ORES is a toolkit offered by Wikimedia that, amongother things, includes functionality for labeling articles with topics,based on a manually curated taxonomy of 64 topics [10] derivedfrom WikiProjects. Based on this categorization, ORES offers aclassifier that predicts, for a given article, its probability of belong-ing to each of the 64 topics. Since a single article may belong tomultiple topics, the 64 probabilities generally do not sum to 1. Weused topic labels in binarized form, considering an article to belongto a topic if the corresponding probability is greater than 50%. Notethat, although the taxonomy is hierarchically organized in two lev-els, in this work we only considered the 57 lower-level topics (listedalong the 𝑥 -axis of Fig. 3). Having run the classifier on all articles inthe dataset, we observed that overall the most common topic wasbiography (1.7M articles), followed by sports (1.4M) and northamerica (950K). The least common topics were eastern africa(11K) and libraries & information (14K). External links form our central object of study, so we extracteddetailed information about them from Wikipedia articles. Sinceparsing content from articles in wikitext format might result inmissing hyperlinks [21], we extracted the external links from thearticles in HTML format instead. As this paper focuses specificallyon those Wikipedia links that lead to external websites, we adoptthe convention that, whenever we simply say “link”, we implicitlymean “external link”.We partitioned the external links in the dataset into three classes,according to their position on the page: infobox, article body, and references . Infoboxes are tables (typically rendered by the browseron the right-hand side of the page on desktop devices, or at the topof the page on mobile devices) that summarize key information byadding semi-structured content (see Fig. 1). In addition to imagesand textual properties, this area can contain—potentially many—external links pointing to external geolocation services, officialregistries, or official websites. The links in an article body appearinline within the main textual content of the page or in dedicatedsections such as “External links” or “See also”. Links in articlebodies are more heterogeneous, including links to social mediapages, PDF documents, or related external material. Finally, weconsidered as references all links used to cite external content insupport of a statement. Typically they appear at the bottom of thepage, reachable from the article body via numbered link anchors. https://en.wikipedia.org/wiki/Wikipedia:WikiProject WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West
During the period considered, Wikipedia had 5.3M articles thatcontained at least one of 63.1M external links (totaling 49.8M uniquetarget URLs). Table 1 (column “Total links”) summarizes these val-ues. In total, 35.3M (56.0%) of these links appeared in references,24.9M (39.5%) in article bodies, and 2.8M (4.5%) in infoboxes. Around1.3M articles in English Wikipedia had an infobox with links, andthe average number of links per infobox in these articles was 2.08.Links spanned from official company links (e.g., schlenkerla.de) togeocoordinates on geohack.toolforge.org to institutional registries(e.g., National Register of Historic Places).
To further qualify the infobox links, we designed a binary classifierthat can distinguish between official links and other types of link.It was trained on a random sample of 2,000 infobox links, manuallyannotated as “official” or “other”. This resulted in a training set of387 official links and 1,613 other links. To characterize each link,we then computed the following features: • URL length:
Number of characters in the URL path (guidedby the intuition that official-link URLs tend to be short). • Similarity of URL with article title:
Motivated by theusefulness of character 𝑛 -grams for URL-based topic classifi-cation [5], we computed the character 𝑛 -grams ( 𝑛 = , . . . , 𝑛 -grams for three pairs:title/URL-domain, anchor-text/URL-domain, title/URL-path. • Similarity of context with marker words:
Jaccard simi-larities between the character 𝑛 -gram sets ( 𝑛 = , . . . ,
4) ofhigh-precision marker words (“official”, “website”, “home-page”, “URL”) and of the link’s anchor text and context (i.e.,the text within the same
Our main metric for mea-suring engagement with external links is the click-through rate Official link CTR R e l . f r e q . o f a r t i c l e s MobileDesktop (a) Click-through rate
Click time [seconds] R e l . f r e q . o f a r t i c l e s Infobox linksBody linksReferences links (b) Click time
Figure 2: Usage of external links. (a) Distribution of click-through rate of official links by device type (vertical lines:means). (b) Distribution of click time by link type (verticallines: medians). (CTR) , which, intuitively, is simply the number of times a link wasclicked, divided by the number of times the link was displayed (byvirtue of being contained in an article that contained the link). Inpractice, care must be taken, as it frequently happens that the samearticle is viewed multiple times in the same session, e.g., because theuser refreshes the page or clicks the back button. To guard againstovercounting such multiple views, we grouped all pageviews of thesame article 𝑎 that occurred during the same session 𝑠 and call theunique pair ( 𝑎, 𝑠 ) one visit of 𝑎 .With 𝑁 𝑙 as the number of visits (i.e., distinct ( 𝑎, 𝑠 ) pairs) uponwhich link 𝑙 was displayed, and 𝐶 𝑙 as the number of visits uponwhich link 𝑙 was clicked, we define the CTR of link 𝑙 as 𝐶 𝑙 / 𝑁 𝑙 , i.e.,the fraction of visits upon which 𝑙 was clicked, out of all visits uponwhich 𝑙 was displayed. Since each official link belongs to exactly onearticle (with extremely rare exceptions; cf. Sec. 3.4), we may also,in a slight abuse of terminology, speak of the “CTR of an article”,implying the CTR of the official link associated with the article.In order to reliably estimate CTRs, we need to avoid small denom-inators, so we restricted our analyses to links that were displayedupon at least 300 pageview events. In the case of official links, thisresulted in a set of 160K links (and their corresponding articles). Definition: click time.
In order to capture how long users dwellon an article before they click an external link, we define the notionof click time, which measures the number of seconds between thepageview event on which the link was displayed and the click onthe link itself. If the same external link was clicked multiple timesin the same session, we only considered the first pageview that wasaccompanied by an external click.Since click times are unbounded above and follow a heavy-taileddistribution, we used medians, rather than means, for aggregation.
We start our analysis by quantifying the level of engagement withexternal links, both overall (Sec. 4.1) and by article topic (Sec. 4.2).
Overall click statistics are summarized in Table 1. During our one-month data collection period, there was a total of around 4.5 billionWikipedia pageviews, which led to around 43.1M clicks on externallinks. The total volume of external clicks was distributed roughly n the Value of Wikipedia as a Gateway to the Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia O ff i c i a l li n k C T R L i b r a r i e s & I n f o r m a t i o n S o f t w a r e I n t e r n e t c u l t u r e C o m p u t i n g B u s i n e ss a n d e c o n o m i c s T e c hn o l o g y E d u c a t i o n M a t h e m a t i c s F oo d a n d d r i n k B i o l o g y M e d i c i n e & H e a l t h C h e m i s t r y R a d i o F a s h i o n E n g i n ee r i n g F il m s E a s t e r n A f r i c a V i d e o g a m e s W e s t e r n A f r i c a S o u t h A s i a T r a n s p o r t a t i o n W e s t A s i a E a r t h a n d e n v i r o n m e n t A r c h i t e c t u r e S o u t h e r n A f r i c a L i n g u i s t i c s L i t e r a t u r e Sp o r t s B oo k s P h il o s o p h y a n d r e li g i o n N o r t h A m e r i c a N o r t h e r n A f r i c a O c e a n i a N o r t h e r n E u r o p e W e s t e r n E u r o p e E a s t A s i a E n t e r t a i n m e n t S o u t h e a s t A s i a S o c i e t y C e n t r a l A s i a P h y s i c s C e n t r a l A m e r i c a P o li t i c s a n d g o v e r n m e n t Sp a c e E a s t e r n E u r o p e P e r f o r m i n g a r t s N o r t h A s i a C o m i c s a n d A n i m e M ili t a r y a n d w a r f a r e S o u t h e r n E u r o p e S o u t h A m e r i c a H i s t o r y B i o g r a p h y M u s i c T e l e v i s i o n G e o g r a p h i c a l N a r t . Figure 3: Official-link click-through rate by article topic.
Blue bars: means with bootstrapped 95% confidence intervals.
Graybars: number of articles with official links.
Red dashed line: global mean. evenly over the three classes of external links: those in infoboxes(12.5M), those in references (14.2M), and those in article bodies(16.2M). As the vast majority of external links was located in refer-ences (56.0%) and article bodies (39.5%), the CTR of infobox links(0.90%) vastly exceeded that of links in references (0.03%) and arti-cle bodies (0.14%). To ascertain that this was not simply caused bythe fact that infobox links appear higher up on the page, we alsocomputed the CTR of article-body links appearing in the top 20%of the page. This yielded a CTR of 0.20%, much closer to the 0.14%of article-body links overall than to the 0.90% of infobox links.
Official links.
Official links play a key role. Although they consti-tuted only 18% of the 2.8M infobox links, they accounted for 78%of the 12.5M clicks on infobox links, with a CTR of 2.47%, nearly3 times as high as that of infobox links overall. The average CTRwas even higher on desktop devices, where it reached 2.78%, vs.1.87% on mobile (Fig. 2a). Given their prominence, we shall focusmostly on official links from here on, and unless stated otherwise,we henceforth refer to official links when simply writing “links”.
Geographical differences.
The top 5 countries by pageview vol-ume were, in this order, the United States, the United Kingdom,India, Canada, and Australia. They generated 71.6% of the total traf-fic. Among these countries, the CTR on official links was highest inthe U.S. (2.36%), followed by India (2.14%), the U.K. (1.53%), Canada(1.38%), and Australia (1.11%).
Next, we aim to understand how the click-through rates of officiallinks vary by topic as defined by the ORES classifier introducedin Sec. 3.2. Since, in the vast majority of cases, official links areassociated with exactly one article, we may label each official linkwith the topics of that article. (Throughout the discussion thatfollows, keep in mind that each article, and thus each official link,may be labeled with multiple topics.)Fig. 3 visualizes the mean CTR (in blue), as well as the number ofarticles with official links (in gray), by topic. We see that official links Article length [chars] M e a n c li c k - t h r o u g h r a t e Num. pageviews (a) Click-through rate Article length [chars] M e d i a n c li c k t i m e [ s e c ] Num. pageviews (b) Click time
Figure 4: (a) Click-through rate and (b) click time of offi-cial links as functions of article length (left) and popular-ity (right), with 95% CIs. Official links on longer pages areclicked more rarely and more slowly; those on more popu-lar pages are clicked more rarely and more quickly. relating to libraries & information, software, and internetculture had the highest click-through rates, whereas geographicalcontent, media-related content, and biographies on average sawthe lowest engagement.
Controlling for article length and popularity.
The length andpopularity of a Wikipedia article correlated strongly and negativelywith the CTR of the official link contained in its infobox (Fig. 4a),possibly because longer articles, by offering more information, re-duce the user’s need to gather additional information from externallinks, and because more popular articles are more likely to appearin shallower information-seeking sessions [24]. Since length and
WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West popularity also vary by topic, they might act as confounds thatcould potentially explain an observed variation of CTR by topic.To tease these two confounds apart from the impact of topicsalone, we controlled for length and popularity in a matched analysis,as follows. We split the set of articles at the median CTR into high-vs. low-CTR articles, and we split the length and popularity rangesinto 1,000 equally sized bins each. We then defined a bipartite graphwith edges between articles that fell in different CTR halves, butin the same bins with respect to length and popularity. Using theEuclidean distance in the space defined by the logarithmic lengthand popularity as edge weights, we found a minimum matchingand retained only the 112K matched articles (out of originally 160K).This procedure successfully balanced the dataset. In the balanced dataset, we binarized the CTR by splitting at themedian and fit a logistic regression to model whether an articlebelonged to the high- or low-CTR group, with topic indicatorsas predictors (pseudo 𝑅 = . 𝑝 < − ). (The advantage ofperforming regression modeling rather than a simple comparisonof per-topic average CTRs is that topics are correlated, which isaccounted for by the regression model.)The 15 largest positive and negative coefficients, plotted in Fig. 5a,revealed a slightly different ranking than Fig. 3, with business andeconomics and education emerging as the strongest predictorsof a high CTR, whereas geographical and television remainedthe strongest predictors of a low CTR. Top of the CTR ranking.
While manually screening the data, werealized that, among the articles with the highest official-link CTR,there was a disproportionate fraction of articles about websites(which are generally classified by ORES under the topic internetculture), and in particular websites related to file sharing andpornography, some with CTRs of 40% or more, e.g., Library Genesis(47%), RARBG (45%), or The Pirate Bay (43%). To determine whetherofficial links of Wikipedia articles about websites dominated the topof the CTR ranking in general, we repeated the above regressionanalysis with a small modification: instead of predicting the top halfvs. the bottom half of the article ranking with respect to CTR, wenow predicted the top 𝐿 articles (an absolute, rather than relativenumber) vs. the same number of samples from the bottom half,matched on length and popularity. This way, plotting the fittedcoefficients for a given topic as a function of 𝐿 reveals whetherthe topic is particularly over-represented among the highest-CTRofficial links (manifested in a sharply decreasing curve). The results,presented in Fig. 6, clearly show that internet culture—a topicheld by most articles about websites—is indeed particularly over-represented among the articles with the very highest official-linkCTR. Similar effects were observed for society (a loose mix ofarticles), sports, software, and entertainment, among others.On the contrary, we observed that geographical, biography, andtelevision, among others, were particularly under-representedamong the highest-CTR official links. Fine-grained topical analysis.
The topics from the ORES clas-sifier used above are rather broad. In order to obtain more fine-grained insights, we conducted a word- rather than topic-level The standardized mean differences in logarithmic length and logarithmic popularitydropped from 0.7 to 0.00017, and from 0.54 to 0.000005, respectively. B u s i n e ss a n d e c o n o m i c s E d u c a t i o n Sp o r t s T r a n s p o r t a t i o n T e c hn o l o g y I n t e r n e t c u l t u r e R a d i o L i t e r a t u r e M e d i c i n e & H e a l t h A r c h i t e c t u r e L i b r a r i e s & I n f o r m a t i o n N o r t h A m e r i c a E a r t h a n d e n v i r o n m e n t P h il o s o p h y a n d r e li g i o n B i o l o g y E n g i n ee r i n g V i d e o g a m e s M u s i c E a s t e r n E u r o p e W e s t e r n E u r o p e S o u t h e a s t A s i a S o u t h A m e r i c a B oo k s Sp a c e C o m i c s a n d A n i m e S o u t h e r n E u r o p e E a s t A s i a B i o g r a p h y T e l e v i s i o n G e o g r a p h i c a l C o e ff i c i e n t (a) Topic-based analysis un i v e r s i t y c o m p a n y m a nu f a c t u r e r a i r li n e p o r n o g r a p h i c f o un d e d s c h oo l h e a dq u a r t e r e db a s e d l e a g u e m u s e u m p u b li cc h a i n p r i v a t e i n s t i t u t e a c t o r i nh a b i t a n t s c o m un e s e r i e s d r a m a k s t b a n d m un i c i p a li t y d i s t r i c t p r e f e c t u r e p r e m i e r e d c i t y t o w n a i r e dp o p u l a t i o n Pos. coefficientNeg. coefficient% articles
Words in the leading section (b) Word-based analysis
Figure 5: Association of click-through rate of official linkswith article properties, captured via 15 largest positive andnegative coefficients (with 95% CIs) from binary logistic re-gression models that predict above- vs. below-median CTR,using as predictors (a) article topics or (b) words from leadparagraphs (controlling for article length and popularity).Gray bars in (b): percentage of articles whose lead paragraphcontains the word. analysis, where we represented an official link by the words con-tained in the lead paragraph of the article in whose infobox the linkappeared as an official link (via 𝑧 -score-standardized TF-IDF vectorsrestricted to the 3,000 most frequent words across all articles withofficial links). Mirroring the above regression analysis, but nowusing words rather than topics as predictors (pseudo 𝑅 = . 𝑝 < − ), this analysis revealed words associated with high- vs.low-CTR links. Fig. 5b, which shows the 15 words with the largestpositive and negative coefficients, confirms our previous findingswhile adding nuance. We see that education specifically markshigh-CTR links about universities, schools, institutes, and muse-ums; business and economics, about companies, manufacturers,chains, and airlines; and internet culture, about adult websites. Summary.
Taking stock of the findings so far, we reiterate thatofficial links play a key role among Wikipedia’s external links, withCTRs far above those of other types of external link. We observed alarge amount of variation depending on article topics, with official n the Value of Wikipedia as a Gateway to the Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia C o e ff . Internet culture Society Education Business and economics Sports Software Entertainment West Asia Central Africa Earth and environment
5k 10k 15k101 C o e ff . Geographical
5k 10k 15k
Biography*
5k 10k 15k
Television
5k 10k 15k
Eastern Africa
5k 10k 15k
Music
5k 10k 15k
Video games
5k 10k 15k
Southern Africa
5k 10k 15k
Southern Europe
5k 10k 15k
Philosophy and religion
5k 10k 15k
Food and drink
Top L links w.r.t. CTR
Figure 6: Prevalence of topics among most frequently clicked official links. We fitted binary logistic regression models thatused article topics as predictors to predict if an article’s official link is among the top 𝐿 highest-CTR links. Plots show regres-sion coefficients for individual topics (predictors) as functions of 𝐿 (for values of 𝐿 between 1K and 15K). Topics are sorted bythe leftmost values of their curves. Sharply decreasing [increasing] curves correspond to topics that are particularly over-rep-resented [under-represented] among the links with the most extreme CTR. (More details: Sec. 4.2, “Top of the CTR ranking”.) links associated with articles about websites, software, businesses,education, and sports seeing particularly high engagement. Above, we established which types of external link have a particu-larly high CTR. Next, we investigate more closely the patterns bywhich users engage with external links.
We start by analyzing the click time (cf. Sec. 3.5), which captureshow long users dwell on an article before leaving it toward anexternal website via an external link.Click time statistics are summarized in Table 1, and click timedistributions are plotted in Fig. 2b, for the three types of link: thoseappearing in infoboxes, article bodies, and references, respectively.(Note that, although we consider only official links in most of thispaper—and indeed in the rest of this section—we nevertheless reportthis basic statistic for all types of external link beyond official linksonly.) The global median click time was 32.9 seconds (31.8 secondsfor desktop, 34.4 seconds for mobile), with a much lower valuefor infobox links (18.7 seconds; 20.1 seconds for official links), andlarger values for the article-body links (35.4 seconds) and referencelinks (51.8 seconds). The short click time of infobox links, however,seems to be due to their prominent position within articles: whenapproximately controlling for position by considering only article-body links in the top 20% of the page, the median click time droppedto 22.2 seconds, only 10% longer than for infobox links.After this general characterization, we from here on focus onofficial links in infoboxes. Similar to CTR (Fig. 4a), the click timeof official links was correlated with article length and popularity(Fig. 4b), such that clicks took more time on longer and on lesspopular articles. When analyzing click time by topic, we thus againcontrolled for these two factors via matching, as they could other-wise confound the analysis. Analogously to the setup of Sec. 4.2, we split the set of articles at the median click time into articles with“slow” vs. “fast” clicks on official links, and subsequently found abipartite matching in order to ensure that the distribution of lengthand popularity was nearly identical in articles with “slow” vs. “fast”official-link clicks. We then fitted a linear regression on the result-ing balanced dataset in order to model the logarithmic click timeas the outcome variable, using (as in Sec. 4.2) topic indicators aspredictors ( 𝑅 = . 𝑝 < − ).The 15 largest positive and negative coefficients are plotted inFig. 7a, revealing that clicks on official links to entertainment-re-lated websites occurred faster, whereas links to websites on moreclassic encyclopedic topics, such as biographies, geographical con-tent, history, etc., occurred more slowly.To gain more granular insights than at the coarse topic level, weagain ran an analysis at the word level, parallel to the word-levelanalysis of Sec. 4.2, but this time in the linear regression setup justdescribed ( 𝑅 = . 𝑝 < − ). The words most indicative ofslow and fast clicks (Fig. 7b) mirror the findings from the topic-levelanalysis (Fig. 7a), but we now also see that official links on articlesabout websites and universities were clicked particularly fast. We just noted that official links on articles about websites typicallyhave short click times. Moreover, in Sec. 4.2 we had observed thatsuch links are also strongly over-represented among the highest-CTR links. This observations led us to hypothesize that the maininterest of users interacting with these links might be to find theselinks to begin with, rather than to find the content that surroundsthe links in their respective Wikipedia articles. In other words, wehypothesized that Wikipedia serves as a “stepping stone” towardexternal websites, whereupon users barely set foot onto Wikipediabefore leaving again towards different content that they actuallywere intent on finding. To further investigate this hypothesis, we The standardized mean differences in logarithmic article length and logarithmicpopularity dropped from 0.076 to 0.0001, and from 0.30 to 0.00005, respectively. 138Kof the original 160K data points were retained after matching.
WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West B i o g r a p h y G e o g r a p h i c a l H i s t o r y B oo k s M ili t a r y a n d w a r f a r e Sp a c e C e n t r a l A f r i c a P h il o s o p h y a n d r e li g i o n E a r t h a n d e n v i r o n m e n t O c e a n i a S o u t h e r n A f r i c a W e s t e r n A f r i c a C e n t r a l A m e r i c a C h e m i s t r y A r c h i t e c t u r e P o li t i c s a n d g o v e r n m e n t N o r t h e r n A f r i c a C o m i c s a n d A n i m e S o u t h A m e r i c a S o f t w a r e E a s t e r n E u r o p e E a s t A s i a F a s h i o n N o r t h A s i a W e s t A s i a I n t e r n e t c u l t u r e F il m s L i b r a r i e s & I n f o r m a t i o n R a d i o Sp o r t s C o e ff i c i e n t FastSlow (a) Topic-based analysis
Slow -0.0-0.02-0.04-0.06
Fast p o li t i c i a n a m e r i c a n c o m un e s i n g e r a u t h o r p r o f e ss o r m un i c i p a li t y c e n s u s f o un d e r t o w n k m s e r i e s n o v e li s t c i t y r i v e r b r o a d c a s t i n gg o v e r n m e n t p r o f e ss i o n a l p a r t y n e w s p a p e r b a s k e t b a ll w e b s i t e f o un d e d o w n e d c l u b un i v e r s i t y c o un t y l e a g u e t r a n s m i tt e r f oo t b a ll Pos. coefficientNeg. coefficient% articles
Words in the leading section (b) Word-based analysis
Figure 7: Association of click time of official links with ar-ticle properties, captured via 15 largest positive and nega-tive coefficients (with 95% CIs) from linear regression mod-els that predict logarithmic click time, using as predictors(a) article topics or (b) words from lead paragraphs (control-ling for article length and popularity). Gray bars in (b): per-centage of articles whose lead paragraph contains the word. considered the relationship between CTR and click time, visual-ized in Fig. 8, which presents a scatter plot of the coefficients ob-tained from the previously described regressions for CTR ( 𝑥 -axis;cf. Sec. 4.2) and click time ( 𝑦 -axis; cf. Sec. 5.1). As the plot shows,topics with a high CTR tended to have low click times (lower rightquadrant), notably sports, radio, and internet culture (a topicthat, as mentioned, primarily tags articles about websites). We takethis as another indicator of the existence of a class of articles fromwhich users leave Wikipedia frequently and fast.If indeed a distinct class of articles on Wikipedia are used heavilyas stepping stones, then we should be able to identify many articlesthat were visited primarily from outside of Wikipedia (e.g., fromsearch engines), rather than from other Wikipedia articles (viainternal navigation), and that had a high CTR on the external linksthey contain. With the goal of finding such articles, we define the external-referrer frequency (ERF) of an article as the fraction of allvisits made to the article via a click from a referral page external toWikipedia. Note that external referrals almost exclusively stemmedfrom search engines: about 70% of them specified a search engine CTR coefficient C li c k - t i m e c o e ff i c i e n t Geographical EducationTelevision SportsBiography RadioBusiness and economicsInternet culture
Figure 8: Click-through rate ( 𝑥 -axis) vs. click time ( 𝑦 -axis) ofofficial links. Each point represents one topic. CTR and clicktime of a topic captured in terms of the topic’s coefficient inthe regressions summarized in Fig. 5a and 7a, respectively. referrer URL in the logs, and about 29% did not specify any referrerURL, but it is suspected that a majority of these visits also come fromsearch [19]. Hence, we included empty referrers in our analysis,although the conclusions were identical when excluding them.The ERF histogram, plotted in Fig. 9a, shows that most articleswere primarily visited from outside of Wikipedia, but only fewarticles had a very high ERF close to 1. Although extreme-ERFarticles were in total visited less than medium-ERF articles (Fig. 9b),they generated a total number of external clicks comparable to thatgenerated by medium-ERF articles (Fig. 9c). The most importantpiece of evidence, however, comes from Fig. 9d, which plots, on the 𝑦 -axis, the mean CTR for situations where the respective article wasreached from an external referrer. The plot shows that the articlesthat were nearly exclusively reached from external referrers (withan ERF close to 1) are precisely those articles that also had thehighest CTR after being reached from an external referrer. Taken together, these facts provide evidence of a class of articlesthat serve as mere “stepping stones”, “revolving doors”, or “in-and-outs”: users come from elsewhere in order to find a particular linkand immediately leave Wikipedia by clicking that link.But why, then, would users go through Wikipedia in the firstplace, if all they want is to go to a website linked from Wikipedia?We shall discuss potential reasons for this behavior in Sec. 7.
Our final set of analyses aims to estimate the monetary value ofthe traffic that Wikipedia drives to external websites.The idea behind our calculations is the following. Search enginesgenerally charge website owners money in exchange for drivingclicks to their sites from sponsored search results (e.g., Google Ads),whereas Wikipedia conveys a large volume of traffic to officialwebsites for free. We therefore ask, “How much money would asearch engine want from website owners to obtain, via ads, thesame number of clicks they obtain from Wikipedia for free?”While we could be asking, “How much money could Wikipe-dia earn by charging a fee for external clicks?”, we consider this This analysis only considered articles with at least 100 visits from external referrers,in order to avoid noise due to division by small numbers. n the Value of Wikipedia as a Gateway to the Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
External-referrer frequency N u m b e r o f a r t i c l e s (a) External-referrer frequency T o t a l p a g e v i e w s (b) External-referrer frequency T o t a l c li c k s (c) External-referrer frequency C T R e x t . r e f e rr e r (d) Figure 9: Quantification of Wikipedia’s role as a stepping stone toward external websites. (a) Histogram of external-referrerfrequency (ERF) of Wikipedia articles, where ERF is defined as the fraction of times the article was visited via a referral pageexternal to Wikipedia. (b) Total number of pageviews of articles within each ERF bin. (c) Total number of official-link clicks ofarticles within each ERF bin. (d) Official-link CTR upon pageviews with an external referrer (most likely a search engine), with95% CIs. Articles with an extreme ERF close to 1 are rare (a), but generate a disproportionately large number of official-linkclicks (c vs. b), especially when reached from search engines (d). counterfactual scenario too far from reality: Wikipedia is open andfree by design, and it functions rather differently from platformsdriven mostly by advertising. Moreover, we could not estimatethe “price paid” to Wikipedia in this scenario, as we do not knowwhether website owners would be willing or able to actually pay forany hypothetical click fees. On the contrary, estimating the “priceasked”—the cost of online ads—is entirely feasible.
Google Ads.
To estimate the price of achieving a certain set of URLclicks via online ads, we used the Google Ads API, with “sponsoredsearch” as the ads network. Google Ads is one of the most pro-lific advertising platforms, and the primary source of revenue forGoogle’s parent company, Alphabet [1]. The Google Ads API allowsadvertisers to create campaigns for promoting a URL by placingbids on campaign-related search keywords. The bid expresses themaximum amount that the website owner is willing to pay eachtime the promoted URL is clicked when shown on the search re-sult page for the respective keyword. When a user searches for akeyword specified by the campaign, an auction system determineswhich sponsored URL to show among all the candidates competingfor the keyword. When the user clicks a promoted link, the cam-paign owner pays the auction value. Note that the paid price is notnecessarily equal to, but only bounded by, the website owner’s bid.
From URLs to keywords.
Our intended analysis started fromclicks on official links (URLs) observed in the Wikipedia logs andaimed to estimate how much these clicks would cost when obtainedvia Google Ads instead of Wikipedia. The Google Ads API, on thecontrary, requires keywords, not URLs, as input. Thus, in orderto leverage Google Ads for our analysis, we had to work our waybackwards and determine appropriate keywords that a websiteowner might use to advertise a given URL. Since the choice of theright keywords is critical to increase a website’s discoverabilitywhile keeping ad costs down [26], Google Ads offers a tool called
Keyword Planner, which, given a website URL, generates a set ofrelevant keywords, alongside information on the historical bidding With the “broad search” option, the number of matching searches increases by alsoconsidering substrings and permutations of the tokens in campaign keywords.
Table 2: Keywords, alongside estimated average cost perclick (CPC), for two example websites.
Coursera American Airlines
From title coursera american airlines
Keywordsrecommendedby Google AdsKeyword Planner online coursesonline collegesonline classesmooconline learningfree online coursesonline degreesopen university coursesonline educationonline universities aaairline flightsairline ticketsairlinesamericanamerican airlines flightscheap air ticketscheap airline ticketsflight ticketsus airways
Est. avg. CPC $0.79 $1.10 range and search volume for those keywords. Using the KeywordPlanner, we generated 11 keywords for each official link: the titleof the corresponding Wikipedia article, plus the top 10 keywordsreturned by the Keyword Planner. As examples, Table 2 summarizesthe most relevant keywords generated for two different websites:Coursera (a popular online course platform) and American Airlines.
Cost-per-click (CPC) forecasting.
Once the set of keywords andthe bids have been set, Google Ads can make a prediction about thecost of the campaign through its forecasting tool. The predictionmodel uses historical data to simulate the auction system and pro-vides an estimated number of clicks and the average price for everykeyword. The tool predicts the campaign’s average cost per click(CPC) by combining the keyword costs with their expected click-through rates. In practice, the forecasting tool simulates campaignsfor specific target countries. We used the top 5 English-speakingcountries (U.S., U.K., India, Canada, and Australia; cf. Sec. 4.1) astarget countries, since they accounted for a large portion (71.6%) ofall external-link clicks in the Wikipedia logs studied here.
Estimating the value of official links.
Leveraging the abovetools, we estimated how much a website owner would need to payto Google Ads for a single click to their website as follows:(1) Obtain keywords for the website via the Keyword Planner.(2) For each keyword, set the bid to the 80th percentile of thekeyword’s historical auction price.
WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West
Average cost per click [US$] N u m b e r o f w e b s i t e s Weighted medianWeighted mean
Figure 10: Distribution of official-link CPC, estimated viaGoogle Ads API. Vertical lines represent weighted versionsof median and mean, respectively, where each link wasweighted according to its click volume in Wikipedia logs. (3) Estimate the cost (CPC) for one click on the website link byfeeding the keywords and their bids to the forecasting tool(using “broad search”, cf. footnote 9).We emphasize that setting a high bid (step 2) does not automati-cally entail a high CPC. Indeed, as we shall see, the winning pricewas usually much lower than the bid. A high bid ensures that weare likely to win the simulated auction and that the promoted linkis actually displayed to the user, which is required in order to obtainclicks—the event whose cost we are aiming to estimate.
Cost per click (CPC).
We applied the above-described procedureto a total of 3,600 official links from Wikipedia infoboxes, obtainedby randomly sampling an equal number of articles from each of the57 topics. During the one-month study period, these 3,600 officiallinks were clicked 2.73M times in total.As mentioned, the bid passed to the forecasting tool is not nec-essarily the winning price of the simulated auction; it merely capsexpenses. In practice, the auction price reached our bid in only 8.9%of cases; on average, the auction price was 58% of the bid.The CPC distribution is shown in Fig. 10. When weighting alllinks evenly (i.e., macro-averaging), the mean and median CPC is$1.64 and $0.90, respectively. As not all links are equally popular, amore reasonable estimate of the overall CPC may be obtained byweighting each link according to the number of clicks it received(i.e., micro-averaging). The resulting weighted CPC is slightly lower,with a mean and median of $1.37 and $0.73, respectively (verticallines in Fig. 10).Investigating the weighted mean CPC for individual topics, wefound considerable variation, with the highest CPCs for mathemat-ics, medicine & health, books, and architecture, and the lowestCPCs for music, sports, fashion, and films (omitting from the listtopics that mark geographical regions, such as north america).
Monthly value of traffic to official websites.
Multiplying theweighted CPC with the overall number of 9.8M clicks on officiallinks during the one-month study period, we estimate the totalmonthly value generated by the traffic from Wikipedia to official websites as $13.4 million when using the mean CPC estimate, oras $7.2 million when using the median CPC estimate.Broken down by topic, we obtained the (mean-based) estimatesof Fig. 11. The topic with the highest total monthly value ($1.9M) isnorth america, a macro category assigned to a large set of articles,including U.S. companies and people. It is followed by businessand economics ($1.3M), biography ($1.3M), technology ($1.0M),and software ($0.9M).
Wikipedia as a gateway to the Web.
While the value of Wikipe-dia’s knowledge is fairly well known, less known was the hiddenand significant additional value of Wikipedia as a gateway. Build-ing on top of existing work related to Wikipedia readers’ behavioranalysis, we have uncovered a new perspective on the role of Wiki-pedia in the broader Web ecosystem. We offered a description of thevalue of Wikipedia as a gateway under multiple levels of analysis.Overall, we found that a substantial fraction of Wikipedia readersuse the encyclopedia as a gateway to the broader Web: readersengage more and faster with official links in the article’s infoboxthan with links in the article body or in the reference section. Wefound that Wikipedia’s role as a gateway to external content isparticularly pronounced when users visit articles about websites,software, businesses, education, and sports, among others, wherethe click-through rate of official links is the highest.
Wikipedia as a stepping stone.
We found an inverse relationbetween the time it takes to click on an external link and its aver-age click-through rate, showing clusters of topics, such as sportsand internet culture where engagement with links was highand fast, or conversely, where engagement with link was low andslow, such as biography and geographical. We also observedthat articles that were visited particularly frequently from exter-nal referrers (mostly search engines) also had a particularly highprobability of an official-link click after being reached from exter-nal referrers. These results indicate that a certain distinct set ofWikipedia articles is leveraged by users in the spirit of “steppingstones” or “revolving doors”, which are reached nearly exclusivelyfrom external referrers (mostly search) and from which the userleaves Wikipedia immediately again by clicking an official externalwebsite link. This begs the question: If users visit Wikipedia from asearch engine result page only to leave it immediately towards athird website, why would they not simply use the search engine tolocate the third website to begin with?We conjecture that the reason is that the search engine cannotfulfill the user’s information need in such situations, whereas Wiki-pedia can. When manually screening the data (focusing on populararticles with at least 30K pageviews during the one-month studyperiod), we found that, among the articles with the highest CTR,there was a disproportionate fraction of file-sharing (5 of the top 6)and pornographic (5 of the top 15) websites. Such search results arefrequently censored or down-ranked by search engines, dependingon the search engine’s corporate policy as well as legislation in theuser’s country. Indeed, manually searching Google for the names ofthe 15 articles with the highest CTR (and more than 30K monthlypageviews) from two locations (U.S. and Germany) revealed that 5 n the Value of Wikipedia as a Gateway to the Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia N o r t h A m e r i c a B u s i n e ss a n d e c o n o m i c s B i o g r a p h y T e c hn o l o g y S o f t w a r e E d u c a t i o n I n t e r n e t c u l t u r e P o li t i c s a n d g o v e r n m e n t S o u t h A s i a M u s i c E a s t A s i a T e l e v i s i o n C o m p u t i n g N o r t h e r n E u r o p e Sp o r t s T r a n s p o r t a t i o n A r c h i t e c t u r e E n t e r t a i n m e n t F oo d a n d d r i n k L i t e r a t u r e E n g i n ee r i n g M e d i c i n e & H e a l t h W e s t e r n E u r o p e F il m s O c e a n i a S o c i e t y P h il o s o p h y a n d r e li g i o n F a s h i o n S o u t h e r n E u r o p e W e s t A s i a S o u t h e a s t A s i a R a d i o G e o g r a p h i c a l V i d e o g a m e s M ili t a r y a n d w a r f a r e C e n t r a l A m e r i c a B oo k s W e s t e r n A f r i c a P e r f o r m i n g a r t s H i s t o r y B i o l o g y E a s t e r n E u r o p e S o u t h A m e r i c a C o m i c s a n d A n i m e M a t h e m a t i c s Sp a c e S o u t h e r n A f r i c a L i b r a r i e s & I n f o r m a t i o n E a r t h a n d e n v i r o n m e n t N o r t h e r n A f r i c a C h e m i s t r y E a s t e r n A f r i c a P h y s i c s L i n g u i s t i c s N o r t h A s i a C e n t r a l A s i a T o t a l c li c k s v a l u e [ U S $ ] Figure 11: Estimated total monthly value of official links in Wikipedia infoboxes by topic, obtained by multiplying the meancost per click (CPC) of links from the respective topic with the total number of clicks on those links in the Wikipedia logs. file-sharing websites and 1 pornographic website were not listed byGoogle among the top 10 search results in at least one of the twolocations. (Additionally, two controversial websites were not onlineanymore at the time of research, about 18 months after data collec-tion.) While these findings remain small-scale and anecdotal, theysuggest that Wikipedia fills a functional gap, as a workaround forcontent suppressed by search engines (sometimes for valid reasons,e.g., when the linked material is copyrighted or illegal).Wikipedia’s role as a transitory stepping stone towards externalcontent can have important implications for Web user studies. Forexample, researchers working on disinformation diffusion mightwant to take into account the function of Wikipedia as a short yetcrucial stopover in Web users’ information-seeking journeys.
The economic value of traffic generated by Wikipedia.
Fi-nally, we set out to estimate the monetary value of Wikipedia as agateway to the broader Web. The infoboxes contained in EnglishWikipedia articles collectively list over half a million official-web-site links, which were clicked 9.8M times during our one-monthstudy period. These clicks were generated by Wikipedia for free,amidst a Web ecosystem that is majorly powered by paid ads. Weasked, “If the respective website owners wanted to achieve thesame number of clicks via sponsored search results, what wouldbe the price?” We estimate that achieving the 9.8M monthly clickson official links would cost a total of $7–13 million using GoogleAds. Extrapolating to 12 months, the yearly cost would amount to$84–156 million. This is a remarkably high number, consideringthat the annual revenue of the Wikimedia Foundation, the non-for-profit organization that operates Wikipedia and its sister projects,stands at around $110 million, coming entirely from donationsand voluntary contributions. We also emphasize that the estimatedeconomic value of $84–156 million pertains to English Wikipediaonly, whereas Wikimedia’s annual revenue of $110 million needsto support all Wikimedia projects across languages.We showed that, when buying clicks from Google Ads insteadof obtaining them from Wikipedia for free, the types of businesses https://wikimediafoundation.org/about/financial-reports that would have to pay the most would be North American com-panies, as well as software and technology businesses. While thenarrative about tech companies’ donations to Wikipedia has oftenbeen around their massive usage of the free encyclopedic contentfor products and algorithms [7, 33], these findings might provideyet another perspective on how these companies benefit from thehard work of hundreds of thousands of volunteer editors.More broadly, our work expands the small body of literatureon measuring the value of Wikipedia to the Web [20, 32]. Whileprevious work focused on the value of content production, for ex-ample estimating that Wikipedia generates $1.7 million of Redditand Stack Overflow’s revenue, based on the amount of Wikipedia-linked posts on those platforms [32], we focused here on the valueof Wikipedia traffic . We provided, for the first time, an estimation ofthe monetary value offered—for free—by Wikipedia to the broaderWeb ecosystem by means of link navigation. Limitations and future work.
This study should be consideredin the light of its limitations. Most notably, it was constrainedto data collected during one month from English Wikipedia only,and as such provides a limited view of readers’ general behavior.Future work should replicate the study for different time periods andlanguage versions in order to paint a more complete and inclusivepicture of Wikipedia readers’ engagement with external links.Besides broadening the scope, future work should also go deeperby more closely investigating why certain types of official link seeparticularly high or low CTRs (e.g., the CTR of links to geographicaland biographical content was particularly low; cf. Fig. 5a). Also,considering that official links related to business and econom-ics saw the highest CTRs, it will be interesting to analyze whichbusinesses benefit most from the free traffic provided by Wikipedia.Whereas our investigation of the volume (RQ1) and patterns(RQ2) of engagement with external links on Wikipedia was pri-marily measurement-based, our estimation of the economic valueof Wikipedia as a gateway to the Web (RQ3) was more specula-tive. On the one hand, we operated under the assumption thatour methodology for obtaining costs per click via the Google AdsAPI is sound and provides accurate estimates, despite the fact that
WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West we relied on keyword suggestions of unknown quality from theGoogle Ads Keyword Planner and on uncertain auction simulationsfrom the Google Ads forecasting tool. On the other hand, and morefundamentally, estimating the economic value of the Web trafficgenerated by Wikipedia necessarily requires arguing about a hy-pothetical, counterfactual (“what if”) situation, in our case, “Whatif website owners were to pay for the same number of clicks viaGoogle Ads instead of obtaining them from Wikipedia for free?”Although similar reasoning has been applied to estimate thevalue of images from Wikimedia Commons [8], it remains openhow realistic that “what if” is: as a matter of fact, Wikipedia is providing those clicks for free, so why would website owners everdecide to pay for them instead? As one concrete example, onecould imagine a situation where a website owner would want toincrease traffic to their site, in which case our estimates indicate howmuch it would cost them to double the traffic they already receivefrom Wikipedia for free. Alternatively, one could imagine scenarioswhere Wikipedia were to be blocked by censorship or ceded toexist entirely, in which case our estimated economic value of trafficfrom Wikipedia would correspond to the loss on behalf of websiteowners due to the lack of that traffic. Finally, and more boldly, onecould imagine a setting where Wikipedia decided to introduce afee for clicks on official website links, in which case our estimatewould upper-bound the amount of extra revenue Wikipedia couldpossibly earn from such a fee. Although the latter setting is highlyunrealistic, we consider it a useful thought experiment that canhelp emphasize Wikipedia’s importance as a provider of free traffic.As a final remark, we argue that all notions of economic valueare fundamentally counterfactual at heart, as they always consider“what if” scenarios (“If A were to give X to B, how much moneywould B give to A in return?”), which is also the reason why businessvaluations of companies are routinely criticized as absurdly off [25].To conclude, we hope this work will offer ideas and methods tothose interested in exploring Wikipedia’s role in the larger Webecosystem in more depth, and that it will help quantify the truevalue of the largest encyclopedic knowledge repository on the Web. Acknowledgments.
West’s lab was partly funded by Swiss National Sci-ence Foundation (grant 200021_185043), European Union (TAILOR, grant952215), Microsoft Swiss Joint Research Center, and by generous gifts fromFacebook and Google. We would also like to thank Leila Zia and Joseph Sed-don from Wikimedia for providing valuable suggestions for this research.
REFERENCES [1] Alphabet Inc. 2019. Alphabet announces fourth quarter and fiscal year 2019results. https://web.archive.org/web/20210104101629/https://abc.xyz/investor/static/pdf/2019Q4_alphabet_earnings_release.pdf[2] Simon Attfield, Gabriella Kazai, Mounia Lalmas, and Benjamin Piwowarski. 2011.Towards a science of user engagement (position paper). In
Proc. UMWA .[3] Saeideh Bakhshi, David A Shamma, and Eric Gilbert. 2014. Faces engage us:Photos with faces attract more likes and comments on Instagram. In
Proc. CHI .[4] Nicola Barbieri, Fabrizio Silvestri, and Mounia Lalmas. 2016. Improving post-clickuser engagement on native ads via survival analysis. In
Proc. WWW .[5] Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2009. PurelyURL-based topic classification. In
Proc. WWW .[6] Jörg Claussen, Tobias Kretschmer, and Philip Mayrhofer. 2013. The effects ofrewarding user engagement: The case of Facebook apps.
Information SystemsResearch
24, 1 (2013), 186–200.[7] Megan Rose Dickey. 2019. Google.org donates $2 million to Wikipedia’sparent org.
TechCrunch (22 January 2019). https://web.archive.org/web/20210213130411/https://techcrunch.com/2019/01/22/google-org-donates-2-million-to-wikipedias-parent-org/ [8] Kristofer Erickson, Felix Rodriguez Perez, and Jesus Rodriguez Perez. 2018. Whatis the Commons worth? Estimating the value of Wikimedia imagery by observingdownstream use. In
Proc. OpenSym .[9] Nicholas Generous, Geoffrey Fairchild, Alina Deshpande, Sara Y. Del Valle, andReid Priedhorsky. 2014. Global disease monitoring and forecasting with Wikipe-dia.
PLoS Computational Biology
10, 11 (2014), e1003892.[10] Aaron Halfaker and R.Stuart Geiger. 2019. ORES: Lowering Barriers with Partici-patory Machine Learning in Wikipedia.
Proc. ACM Human-Computer Interaction .[11] Kyle S. Hickmann, Geoffrey Fairchild, Reid Priedhorsky, Nicholas Generous,James M. Hyman, Alina Deshpande, and Sara Y. Del Valle. 2015. Forecasting the2013–2014 influenza season using Wikipedia.
PLOS Computational Biology
11, 5(2015), e1004239.[12] Marit Hinnosaar, Toomas Hinnosaar, Michael E Kummer, and Olga Slivko. 2019.Wikipedia matters.
SSRN (2019).[13] Yuheng Hu, Shelly Farnham, and Kartik Talamadupula. 2015. Predicting userengagement on Twitter with real-world events. In
Proc. ICWSM .[14] Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue,and Sarah Tavel. 2015. Visual search at Pinterest. In
Proc. KDD .[15] Michael Johnston. 2013. Wikipedia revenue analysis: How a wiki could make$2.3B a year.
MonetizePros (25 June 2013). https://web.archive.org/web/20201112024633/https://monetizepros.com/features/analysis-how-wikipedia-could-make-2-8-billion-in-annual-revenue/[16] Florian Lemmerich, Diego Sáez-Trumper, Robert West, and Leila Zia. 2019. Whythe world reads Wikipedia: Beyond English speakers. In
Proc. WSDM .[17] Xiaoliang Ling, Weiwei Deng, Chen Gu, Hucheng Zhou, Cui Li, and Feng Sun.2017. Model ensemble for click prediction in Bing search ads. In
Proc. WWW .[18] Lauren A Maggio, Ryan M Steinberg, Tiziano Piccardi, and John M Willinsky.2020. Meta-Research: Reader engagement with medical content on Wikipedia. eLife
Search Engine Land (8 July 2014).https://web.archive.org/web/20201020025709/https://searchengineland.com/60-direct-traffic-actually-seo-195415[20] Connor McMahon, Isaac Johnson, and Brent Hecht. 2017. The substantial inter-dependence of Wikipedia and Google: A case study on the relationship betweenpeer production communities and information technologies. In
Proc. ICWSM .[21] Blagoj Mitrevski, Tiziano Piccardi, and Robert West. 2020. WikiHist.html: EnglishWikipedia’s full revision history in HTML format.
Proc. ICWSM .[22] Helen Susannah Moat, Chester Curme, Adam Avakian, Dror Y. Kenett, H. EugeneStanley, and Tobias Preis. 2013. Quantifying Wikipedia usage patterns beforestock market moves.
Scientific Reports
3, 1 (2013), 1–5.[23] Helen Susannah Moat, Chester Curme, H. Eugene Stanley, and Tobias Preis. 2014.Anticipating stock market movements with Google and Wikipedia.
NATO Sciencefor Peace and Security Series C: Environmental Security (2014), 47–59.[24] Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West. 2020. Quan-tifying engagement with citations on Wikipedia. In
Proc. WWW .[25] Jamie Powell. 2020. Is GMO’s Montier right on ‘absurd’ US stocks?
FinancialTimes
DecisionSupport Systems
119 (2019), 96–106.[27] Evan Sheehan, Chenlin Meng, Matthew Tan, Burak Uzkent, Neal Jean, MarshallBurke, David Lobell, and Stefano Ermon. 2019. Predicting economic developmentusing geolocated Wikipedia articles. In
Proc. KDD .[28] Philipp Singer, Florian Lemmerich, Robert West, Leila Zia, Ellery Wulczyn,Markus Strohmaier, and Jure Leskovec. 2017. Why we read Wikipedia. In
Proc.WWW .[29] Yang Song, Xiaolin Shi, and Xin Fu. 2013. Evaluating and predicting user engage-ment change with degraded search relevance. In
Proc. WWW .[30] Nathan TeBlunthuis, Tilman Bayer, and Olga Vasileva. 2019. Dwelling on Wikipe-dia: Investigating time spent by global encyclopedia readers. In
Proc. OpenSym .[31] Neil Thompson and Douglas Hanley. 2017. Science is shaped by Wikipedia:Evidence from a randomized control trial.
SSRN Electronic Journal (2017).[32] Nicholas Vincent, Isaac Johnson, and Brent Hecht. 2018. Examining Wikipediawith a broader lens: Quantifying the value of Wikipedia’s relationships withother large-scale online communities. In
Proc. CHI .[33] Rachel Withers. 2018. Amazon owes Wikipedia big-time.
Slate (11 Oc-tober 2018). https://web.archive.org/web/20210202152028/https://slate.com/technology/2018/10/amazon-echo-wikipedia-wikimedia-donation.html[34] Sean Xin Xu and Xiaoquan (Michael) Zhang. 2013. Impact of Wikipedia on marketinformation environment: Evidence on management disclosure and investorreaction.
MIS Quarterly
37, 4 (2013), 1043–1068.[35] Xing Yi. 2015. Dwell time based advertising in a scrollable content stream. USPatent App. 13/975,157.[36] Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma.2018. How well do offline and online evaluation metrics measure user satisfactionin Web image search?. In