[PDF] Filter List Generation for Underserved Regions

Abstract

Filter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g., ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions. Because only a small percentage of web users participate in the generation of filter lists, a crowd-sourcing strategy works well for blocking either uncommon resources that appear on "popular" websites, or resources that appear on a large number of "unpopular" websites. A crowd-sourcing strategy will perform poorly for parts of the web with small "crowds", such as regions of the web serving languages with (relatively) few speakers. This work addresses this problem through the combination of two novel techniques: (i) deep browser instrumentation that allows for the accurate generation of request chains, in a way that is robust in situations that confuse existing measurement techniques, and (ii) an ad classifier that uniquely combines perceptual and page-context features to remain accurate across multiple languages. We apply our unique two-step filter list generation pipeline to three regions of the web that currently have poorly maintained filter lists: Sri Lanka, Hungary, and Albania. We generate new filter lists that complement existing filter lists. Our complementary lists block an additional 3,349 of ad and ad-related resources (1,771 unique) when applied to 6,475 pages targeting these three regions. We hope that this work can be part of an increased effort at ensuring that the security, privacy, and performance benefits of web resource blocking can be shared with all users, and not only those in dominant linguistic or economic regions.

Full PDF

FFilter List Generation for Underserved Regions

Alexander Sjösten

Chalmers University of Technology

Peter Snyder

Brave Software

Antonio Pastor

Universidad Carlos III de Madrid

Panagiotis Papadopoulos

Brave Software

Benjamin Livshits

Brave SoftwareImperial College London

ABSTRACT

Filter lists play a large and growing role in protecting and assistingweb users. The vast majority of popular filter lists are crowd-sourced,where a large number of people manually label resources relatedto undesirable web resources (e.g. ads, trackers, paywall libraries),so that they can be blocked by browsers and extensions.Because only a small percentage of web users participate in thegeneration of filter lists, a crowd-sourcing strategy works wellfor blocking either uncommon resources that appear on “popular”websites, or resources that appear on a large number of “unpopular”websites. A crowd-sourcing strategy will perform poorly for partsof the web with small “crowds”, such as regions of the web servinglanguages with (relatively) few speakers.This work addresses this problem through the combination oftwo novel techniques: (i) deep browser instrumentation that allowsfor the accurate generation of request chains, in a way that is robustin situations that confuse existing measurement techniques, and (ii)an ad classifier that uniquely combines perceptual and page-contextfeatures to remain accurate across multiple languages.We apply our unique two-step filter list generation pipeline tothree regions of the web that currently have poorly maintained filterlists: Sri Lanka, Hungary, and Albania. We generate new filter liststhat complement existing filter lists. Our complementary lists blockan additional 3,349 of ad and ad-related resources (1,771 unique)when applied to 6,475 pages targeting these three regions.We hope that this work can be part of an increased effort atensuring that the security, privacy, and performance benefits of webresource blocking can be shared with all users, and not only thosein dominant linguistic or economic regions.

CCS CONCEPTS • Security and privacy → Human and societal aspects of securityand privacy ; •

Information systems → Web applications . KEYWORDS ad blocking, filter lists, crowdsource

ACM Reference Format:

Alexander Sjösten, Peter Snyder, Antonio Pastor, Panagiotis Papadopoulos,and Benjamin Livshits. 2020. Filter List Generation for UnderservedRegions. In

Proceedings of The Web Conference 2020 (WWW ’20), April

This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personaland corporate Web sites with the appropriate attribution.

WWW ’20, April 20–24, 2020, Taipei, Taiwan © 2020 IW3C2 (International World Wide Web Conference Committee), published underCreative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-7023-3/20/04.https://doi . org/10 . . ACM, New York, NY, USA, 11 pages.https://doi . org/10 . . Hundreds of millions of web users (i.e. 30% of all internet users [25])use filter lists to maintain a secure, private, performant, andappealing web. Prior work has shown that filter lists, and thetypes of content blocking they enable, significantly reduce datause [37], protect users from malware [36], improve browserperformance [28, 38] and significantly reduce how often andpersistently users are tracked on the web.Most filter lists are generated through crowd-sourcing, wherea large number of people collaborate to identify undesirableonline resources (e.g. ads, user tracking libraries, anti-adblockingscripts etc..) and generate sets of rules to identify those resources.Crowd-sourcing the generation of these lists has proven a usefulstrategy, as evidenced by the fact that the most popular lists arequite frequently used and frequently updated [33, 41].The most popular filter lists (e.g. EasyList, EasyPrivacy) target“global” sites, which in practice means either websites in English,or resources popular enough to appear on English-speaking sites inaddition to sites targeting speakers of other languages. Non-Englishspeaking web users face different, generally less appealing optionsfor content blocking. Web users who visit non-English websites thattarget relatively wealthy users generally have access to well main-tained, language-specific lists. Indeed, the French [9], German [11],and Japanese [17] specific filter lists are representative examples ofwell-maintained, popular filter lists targeting non-English web users.Similarly, linguistic regions with very large numbers of speakersalso generally have well maintained filter lists. Examples hereinclude well maintained filter lists targeting Hindi [15], Russian [21],Chinese [5], and Portuguese [4, 20] websites.Sadly, users who visit websites in languages with fewer speakers,or with less wealthy users, have worse options. Put differently, theusefulness of crowd-sourced filter lists depends on having a largeor affluent crowd; filter lists targeting parts of the web with less,or less affluent, users are left with filter lists that are smaller, lesswell-maintained, or both. Visitors speaking these less-commonly-spoken languages have degraded web experiences, and are exposedto all the web maladies that filter lists are designed to fix.Compounding the problem, in many cases, users in these regionsare the ones who could benefit most from robust filter lists, asnetwork connections may be slower, data may be more expensive,the frequency of undesirable web resources may be higher. Anexample which motivates this work and illustrates the inability ofcurrent filter lists to adequately block ads on a regional website inAlbania can be seen in the screenshot in Figure 1. In this example, a r X i v : . [ c s . CR ] J a n WW ’20, April 20–24, 2020, Taipei, Taiwan Sjösten and Snyder, et al. we browse the website gazetatema.net while using AdBlock Plus(which uses EasyList, a “global” targeting filter list).While there has been significant prior work on automating thegeneration of filter lists [24, 26, 29, 34], this existing work is focusedon replicating and extending the most popular English and globally-focused filter lists, with little to no evaluation on, or applicabilityto, non-English web regions. In this paper, we target the problem ofimproving filter lists for web users in regions with small numbers ofspeakers (relative to prominent global languages). We select three re-gions as representative of the problem in general: Albania, Hungaryand Sri Lanka, using a methodology presented in Section 4.1.We describe a two-pronged strategy for identifying long-tailresources on websites that target under-served linguistic regionson the web: (i) a classifier that can identify advertisements in away that generalizes well across languages, and (ii) a method foraccurately determining how advertisements end up in pages (asdetermined by either existing filter lists or our classifier), and byusing this information, generate new, generalized filter rules.We use this novel instrumentation to both build inclusions chains (i.e. measurements of how every remote resource wound up in aweb page), and determine how high in each inclusion chain blockingcan begin. This allows us to (i) generate generalized filter rules (i.e.rules that target scripts that include ad images on each page, insteadof rules that target URLs for individual advertisements), and to (ii)ensure we do not block new resources that will break the websitein other ways.

Contributions.

In summary, this paper makes the followingcontributions to the problem of blocking unwanted resources onwebsites targeting audiences with smaller linguistic audiences.(1) The implementation and evaluation of an image classifierfor automatically detecting advertisements on the web which relies on a mix of perceptual and contextual features. Thisclassifier is designed to be robust across many languages (andparticularly those overlooked by existing research) and achievesaccuracy of 97.6 % in identifying images and iframes related toadvertising.(2) Novel, open source browser instrumentation, implementedas modifications to the Blink and V8 run-times that allowsfor determining the cause of every web request in a page,in a way that is far more accurate than existing tools. Thisinstrumentation also allows us to accurately attribute everyDOM modification to its cause, which in turn allows us topredict whether blocking a resource would break a page.(3) The design of a novel, two stage pipeline for identifyingadvertising resources on websites , using the previouslymentioned classifier and instrumentation, to identify long-tailadvertising resources targeting web users who do not speaklanguages with large global communities.(4) A real world evaluation of our pipeline on sites that arepopular with languages that are (relatively) uncommononline. We find that our approach is successful in significantlyimproving the quality of filter lists for web users without large,language-specific crowd sourced lists. As our evaluation shows,our generated lists block an additional 3,349 of ad and ad-relatedresources (1,771 unique) when applied to 6,475 pages targetingthese three regions.

Figure 1: Motivating example of current filter lists’ regional ineffi-ciency. Screenshot of Albanian website browsed with Adblock Plus.

A successful contribution to the problem of improving the qualityof filter lists in small web regions should account for the followingissues:

Scalability.

The primary difficulty of generating effective blockingrules for small-region web users is the reduced number of people whocan participate in a crowd-sourced list generation. While portions ofthe web are targeted at large audiences (e.g. sites in English language,or web regions with a large number of language speakers) can counton a large number of users to report unwanted resources, or generallydistribute the task of list generation, regions of the web targeting onlya small number of users (e.g. languages with less speakers) do nothave this luxury. A successful solution therefore likely requires somekind of automation to augment the efforts of regional list generators.

Generalize-ability.

In most of the cases, ads are rendered by scripts.In addition, every time an ad-slot is filled, the embedded ad imagemay have come from a different URL. Approaches that directly targetthe URLs serving ad-related resources are then likely to become stalevery quickly. An effective solution to the problem would insteadtarget the “root cause” of the unwanted resources being included inthe page, in this case the script, which determines what image URLs toload. Approaches that attempt to only build lists of URLs of ad-relatedimages are therefore unlikely to be useful solutions to the problemfor the long term (as seen also from the screenshot in Figure 1).

Web compatibility.

Content blocking necessarily requiresmodifying the execution of a page from what the site-authorintended, to something hopefully more closely aligned with thevisitor’s goals and preferences. Modifying the page’s execution inthis way (e.g. by changing what resources to load, by preventingscripts from executing, etc.) frequently cause pages to break, andusers to abandon content blocking tools. While filter lists targetinglarge audiences can rely on the crowd to report breaking sites tothe list authors, (so that they can tailor the rules accordingly), filterlists targeting smaller parts of the web often do not have enoughusers to maintain this positive feedback loop. An effective systemfor programmatically augmenting small-region filter lists musttherefore take extra care to ensure that new rules will not break sites. ilter List Generation for Underserved Regions WWW ’20, April 20–24, 2020, Taipei, Taiwan

This section presents a methodology for programmatically identify-ing advertising and other unwanted web resources in under-servedregions. This section proceeds by describing (i) a high-level overviewof our approach, (ii) a hybrid classifier used to identify image-basedweb advertisements, (iii) unique browser instrumentation used inour approach, (iv) how we identify ad-libraries and other “upstream”resources for blocking, (v) how we determined if a request was safeto block (i.e. would not break desirable functionality on the page),and (vi) how we generated filter list rules from the gathered ad URLs.

Our solution to improving filter lists for under-served regionsconsists of the combination of two unique strategies. First, wedesigned a system for programmatically determining whether animage is an ad, in a cross-language, highly precise way. We use thisclassifier to identify ad images that are missed by crowd-sourcedfilter list maintainers.Second we developed a technique for identifying additionalresources that should be blocked, by considering the request chainsthat brought the ad into the page, and finding instances where wecan block earlier in that request chain. We then apply this “blockingearlier in the chain” principle to both ads identified by existingfilter lists, and new ads identified by our classifier, to maximizethe number of resources that can be safely blocked. This approachalso allows us to generate generalized blocking rules that targetthe causes of ads being included in the page, instead of only the“symptoms”: the specific, frequently-changing image URLs.We note that this approach could be applied to any regionof the web, including both popular and under-served regions.However, since popular parts of the web are already well-served bycrowd-sourced approaches, we expect the marginal improvementof applying this technique will be greatest for under-served regions,where there are comparatively few manual labelers.The following subsections describe the implementation of eachpiece in our filter list generation pipeline. Section 4 describes theevaluation of how successful this approach was at generating newfilter list rules for under-served regions.

First, our approach requires an oracle for determining if a pageelement is an advertisement, without human labeling. To solve thisproblem, we designed and trained a unique hybrid image classifierthat considers both the image’s pixel data, and page context animage request occurred in, when predicting if a page element isan advertisement. Our classifier targets both images (i.e. )and sub-documents (i.e. ). Our classifier prefers precisionover recall, since for filter list it is more important to only block ads,instead of blocking every ad. While there is significantexisting work on image based (i.e. perceptual) web ad classification,we were not able to use existing approaches for two reasons. First,we had disappointing results when applying existing perceptualclassifiers to the web at large. The existing approaches we considered did very well on the data sets they were trained on, but did arelatively poor job when applied to new, random crawls of the web.Second, we were concerned that relying on perceptual featuresalone would reduce the classifier’s ability to generalize across lan-guages. We expected that adding contextual features (e.g. the sur-rounding elements in the page, whether the image request was trig-gered by JavaScript or the document parser, attributes on the elementdisplaying the image) would make the classifier generalize better. Our approach combines both perceptualand contextual page features, each building on existing work. Theperceptual features are similar to those described in the Percival [40]paper, while the contextual features are extensions of those usedin the AdGraph [34] project. The probability estimated by theperceptual module is then used as an input to the contextual classifier. Perceptual Sub-module. The perceptual part of our classifierexpands Percival’s SqueezeNet based CNN into a larger network,ResNet18 [30]. While the Percival project used a smaller networkfor fast online, in-browser classification, our classifier is designedfor offline classification, and so faces no such constraint. We insteaduse the larger ResNet18 approach to increase predictive power.Otherwise, our approach is the same as that described in [40]. Contextual Sub-Module. The contextual part of our classifierdoes not consider the image’s pixel data, but instead how the imageloaded in the web page, and the context of the page the imageor subdocument would be displayed in. Examples of contextualfeatures include whether the resource being requested is servedon the same domain as the requesting website, and the numberof sibling DOM nodes of the img or iframe element initiating therequest. These features are similar to those described in the AdGraphpaper, and detailed in Figure 4. The browser instrumentation neededto extract these features is described in detailed in Section 3.3. We built our image classifier in twosteps. First we built a purely perceptual classifier, using approachesdescribed in existing work. Second, when we found the perceptualclassifier did not generalize well when applied to a new, independentsampling of images, we moved to a hybrid approach. In this hybridapproach, the output of the perceptual classifier is just one featureamong many other contextual features. We found this hybridapproach performed much better on our new, manually labeled,random crawl of the Alexa 10k. The rest of this subsection describeseach stage in this process.Initially, we built a classifier using an approach nearly identicalto the perceptual approach described in [40]. We evaluated thismodel on a combination of data provided by the paper’s authors,augmented with a small amount of additional data labeled byourselves. This data set is referred to in Figure 3 as the “Initial Alexa10k Set”. When we applied the training method described in [40] tothis data set, we received very accurate results, reported in Figure 2.Later, while building the pipeline described in this paper, wegenerated a second manually labeled data set of images and frames,randomly sampled a new crawl of the Alexa 10k. This data set isreferred to in Figure 3 as “Alexa 10k Recrawl”, and was collected WW ’20, April 20–24, 2020, Taipei, Taiwan Sjösten and Snyder, et al. Initial Alexa 10k Data Set Alexa 10k RecrawlAccuracy Precision Recall Accuracy Precision RecallPerceptual-only 95.9 % 95.5 % 96.4 % 77.0 % 48.8 % 87.4 %Hybrid - - - 97.6 % 92 % 75 % Figure 2: Comparison of classification strategies. “Perceptual-only” refers to the approach by Percival [40] and variants (best numbers reported).“Hybrid” uses both perceptual and contextual features, and performed much better on our independent sampling of images and frames fromthe Alexa 10k, especially with regards to precision. Figure 3: Comparison of the distribution of ads for images andframes collected in each data set. Content featuresHeight & WidthIs image size a standard ad size?Resource URL lengthIs resource from subdomain?Is resource from third party?Presence of a semi-colon in query string?Resource type (image or iframe)Perceptual classifier ad probabilityStructural featuresResource load time from startDegree of the resource node (in, out, in+out)Is the resource modified by script?Parent node degree (in, out, in+out)Is parent node modified by script?Average degree connectivity Figure 4: Partial feature set of the contextual classifier. between 2-6 months later than the previous data set . When weapplied the prior purely-perceptual approach to this new data set,we received greatly reduced accuracy. Most alarming of which,for our purposes, was the dramatically reduced precision. Thesenumbers are also reported in Figure 2.We concluded that perceptual features alone were insufficientto handle the breath of advertisements found on the web, and sowanted to augment the prior perceptual approach with additional,contextual features we expected to generalize better, both acrosslanguages and across time. A subset of these contextual featuresare presented in Figure 4, and are heavily based on the contextualad-identification features discussed in the AdGraph [34] project.After constructing our hybrid classifier from the combination ofperceptual and contextual features, we achieved greatly increasedprecision, though at the expense of some recall. We used a Random the date range here is due to the majority of this data set being collected by the Percivalauthors, 6 months before our work, with a smaller additional amount of data beingcollected by ourselves later on. Forest approach to combine the perceptual and contextual features,and after conducting a 5-fold cross-validation, achieved meanprecision of 92 % and mean recall of 75 %, again summarized inFigure 2. Our hybrid classifier could not be evaluated against theinitial Alexa 10k data set because the data set 1) programmaticallydetermined some labels, and 2) was collected without our browserinstrumentation, meaning we could not extract the required features. In this subsection we present PageGraph, a system for representingand recording web page execution as a graph. PageGraph allows usto correctly attribute every page modification and network requestto its cause in the page (usually, the responsible JavaScript unit).We use this instrumentation both to extract the contextual featuresdescribed in Section 3.2.2, and to accurately understand what pagemodifications and downstream requests each JavaScript unit isresponsible for.Our approach is similar to the AdGraph [34] project, but is morerobust (i.e. corrects categories of attribution errors) and broader(i.e. cover an even greater set of page behaviors). PageGraph isimplemented as a large set of patches and modifications to Blinkand V8 (approximately 12K LOC). The code for PageGraph is opensource and actively maintained, and can be found at [19], along withinformation on how other researchers can use the tool.The remainder of this subsection provides a high-level summaryof the graph-based approach used by PageGraph, and how it differsfrom existing work. We use PageGraphto represent the execution of each page as a directed graph. Thisgraph is available both at run-time, and offline (serialized asgraphml [22]) for after-the-fact analysis. PageGraph uses nodes torepresent elements in a web page (e.g. DOM elements, resourcesrequested, executing JavaScript units, child frames) and edgesrepresenting the interaction between these elements in the page (e.g.an edge from a script to a node might depict the script modifyingan attribute on the node, an edge from a DOM element node toa resource node might depict a file being fetched because of a img element’s src attribute, etc.). All such page behaviors in thetop-level frame, and child-local-frames, are captured in the graph.We use PageGraph’s context-rich recording of page executionfor several purposes in this work. First, it allows to accurately andefficiently understand how a JavaScript unit’s execution modifiedthe page; we can easily determine which scripts made a lot ofmodifications to the page, and which had only “invisible” effectsto e.g. fingerprint the user. Second, the graph allows us to determine ilter List Generation for Underserved Regions WWW ’20, April 20–24, 2020, Taipei, Taiwan how each element ended up in a page. For example, the graphrepresentation makes it easy to determine if an image was injectedin the page by a script, if so what other script, and how that scriptwas included in the page, etc. Being able to accurately determinewhat page element is responsible for the inclusion of each script,frame or image element is particularly valuable to this work, asdescribed in the following subsections. The most relevant relatedwork to PageGraph is the AdGraph project, which also modifies theBlink and V8 systems in Chromium to build a graph-representationof page execution. PageGraph differs from AdGraph in severalsignificant ways. Improved Attribution Accuracy. PageGraph significantlyimproves cause-attribution in the graph, or correctly determin-ing which JavaScript unit is responsible for each modification.We observed a non-trivial number of corner cases where Ad-Graph would attribute modifications to the wrong script unit,such as when the script was executed as a result of an ele-ment attribute (e.g. onerror="do_something()" ), or when theJavaScript stack is reset through events like timer callbacks (e.g. setTimeout(do_something,1) ). PageGraph correctly handlesthese and a large number of similar corner cases. Increased Attribution Breadth. PageGraph significantly in-creases the set of page events tracked in the graph, beyond whatAdGraph records. For example, PageGraph tracks image requestsinitiated because of CSS rules and prefetch instructions, recordsmodifications made in local sub-documents, and tracks failednetwork requests, among many others. This additional attributionallows for greater understanding of the context scripts execute in. We next discuss how we generate generalized filter rules fromthe data gathered by the previously described image classifier andbrowser instrumentation. The general approach is to find URLsserving ad images and frames using the classifier, use the browserinstrumentation to build the entire request chain that caused theadvertisement to be included in the page (e.g. the script that fetchedthe script that inserted the image), and then again use the browserinstrumentation to determine how far up each request chain wecan block without breaking the page.We build these request chains for both images (and frames)our classifier identifies as an ad, and for resources identified bynetwork rules in existing filter lists (i.e. EasyList, EasyPrivacy andthe most up to date applicable regional list). The former allows usto generalize the benefits of our image classifier, the latter allowsus to maximize the benefits of existing filter lists. Blocking higher in the request chain has severalbenefits. First, and most importantly, targeting URLs higher inthe request chain yields a more consistent set of URLs. While thespecific images that an ad library loads will change frequently, theURL of the ad library itself will rarely change. Approaches thattarget the frequently changing image URLs will result in filter listrules that quickly go stale; rules that target ad library scripts (asone example) are more likely to be useful over time, and to a widerrange of users. Moving higher in the request chain means we are HTML Parser Script 1 Script 2 Ad image<script> <img> <div> <div> i n s e r t i n s e r t i n s e r t i n s e r t insert insert insert Figure 5: Example of a request chain, ending in an inserted ad image. more likely to programmatically identify ad libraries in addition to frequently changing, one-off image URLs.Second, blocking higher in request chains reduces the total num-ber of requests, bringing privacy and performance improvements.Blocking a single “upstream” ad library may prevent the browserfrom needing to consider several “downstream” requests. To generate optimized filter listrules, we target not only the ad images and ad frames in each page,but the scripts that injected those images and frames (and, poten-tially, the scripts that injected those scripts, etc.). We refer to thecause of a request as being “upstream”, and the thing being requested“downstream”. We refer to the list of elements that participated inan advertisement being included as its “request chain.”For each <img> , <iframe> and <script> in a page, we determinethe request chain as follows:(1) Locate each element in the PageGraph generated graph structure.Call this element X .(2) Use the graph edges to determine how X was inserted in thedocument. If X was inserted by the parser (i.e. it appeared in theinitial HTML text) then stop.(3) Otherwise, append the script element X into the request chain for X , set the responsible script element as the new X and continuefrom This subsection describes how we determine whether blocking ascript request is likely to break a page. We use this technique todetermine how “high” in each request chain we can block, withthe goal of determining the earliest “upstream” request we canblock in a request chain without breaking the page. Our approachis “conservative” (i.e. prefers false negatives over false positives),under the intuition that users would prefer a working, ad-filled page,over a broken, ad-less page. We use a pair of simple heuristicsto determine whether blocking a script is likely to break a page.These heuristics are designed to distinguish scripts that onlyinject ads into pages from scripts that perform more complex, andhopefully user serving, page operations.(1) If a script creates more than two subtrees in the document, weconsider it unsafe to block. WW ’20, April 20–24, 2020, Taipei, Taiwan Sjösten and Snyder, et al. (2) If a script inserts another script that matches condition unsafe to block.(3) We consider all other scripts safe to block.Less formally, if a script makes no modifications to the structureof a page, or the modifications to the page are isolated to one ortwo parts of the page (e.g. one or two ads, an ad and an “ad choices”annotation, etc.) we consider it safe to block. Scripts that addelements to more than two parts of the page, or include scripts thatdo the same, are considered too risky to block, and too likely tobreak desirable page functionality.We note that this is only a heuristic, one that matches our expe-rience building and debugging advertising and tracking-blockingtools, but still only a heuristic. Although heuristics can be fooledby an attacker, they are being used in current tools to identifyunwanted code. If the script for injecting ad content is updated toevade the heuristic, the heuristic can be updated. We choose theconservative figure of allowing modifications to a maximum of tworegions of the page to favor false negatives over false positives (i.e.we’d rather allow an ad than break a page). The larger problem ofpredicting whether any given page modification breaks a site inthe subjective determination of the browser user is an open researchquestion, and one that would be its own complicated project. We use the above-described heuristics to determine the highest point in a requestchain that can be blocked. For each request chain describing howan advertising image or frame was included in a page, we select the“highest” script request we can block, that will not break the page.Put differently, we want to select the earliest point in each chainto block, that will have no “downstream” breaking scripts. If thereare no elements in the request chain that can be blocked, the lastloaded resource (i.e. the ad) will be blocked only.As a demonstration, consider Figure 5. Our system would generatetwo filter rules for this request chain, one targeting the “ad image”,and one targeting “script 2”. Our system begins by considering themost “downstream” request, the image element at the far right. Thisimage has been identified as an ad, either by our classifier, or byexisting filter lists. Using the browser instrumentation describedin Section 3.3, we build the request chain for this image.Next, we try to consider the earliest point in the request chainwe can begin blocking. We observe that “script 2” only modifiesone other part of the document (inserting a single <div> element)and so we consider this script safe to block. The next element in therequest chain, “script 1” inserts elements into more than two partsof the document, and so we consider it “unsafe” for blocking. Finally, we describe how we turn the set of identified ad-servingURLs into filter list rules. We do so through the following four stepsfor each URL we determine to be blockable:(1) Reduce the URL’s domain to its eTLD+1 root.(2) Remove the query parameter portion of the URL.(3) Remove the fragment portion of the URL.(4) Remove the protocol from the URL.We then record the modified (i.e. reduced) version of each URLas a right-rooted AdBlock Plus format filter rule [2]. For example, Country Figure 6: Regional crawling data. The “Last Update” column gives thenumber of months since the last update, relative to October 2019. Country Commits Start (Month-Year) Average SourceAlbania 3 02-2019 0.27 [1]Hungary 542 12-2014 8.9 [16]Sri Lanka 16 03-2016 0.35 [3]India 1,637 05-2018 81.85 [13]Germany 11,982 01-2014 166.4 [12]Japan 1,687 05-2014 24.8 [14] Figure 7: Filter list activity. “Average” gives the average number ofcommitspermonthsincethestartofthegitrepo,relativetoJan.2020. the URL https://a.good.example.com/ad.html?id=3 wouldbe recorded as ||example.com/ad.html , and would block re-quests to https://good.example.com/another-ad.html and http://a.b.good.example.com/ad.html?id=4 , but would notblock requests to https://other.domain.com/ad.html . Thisapproach is designed to generalize some (i.e. match other similarrequests, even when irrelevant details like tracking related queryparameters change) but not so much so that unrelated materialsserved on the same host are blocked. In this section, we evaluate the approach to regional filter listgeneration described in Section 3 by applying the technique tothree representative under-served web regions. We find that thetechnique is successful, and generates 1310 new rules that identify1,771 advertising URLs missed by existing filter lists. These newrules, when applied in addition to existing filter list rules, resultsin 27.1% more advertising resource being blocked than when usingexisting filter lists alone. We also find that our technique is usefulin all three measured regions, though to varying degrees.This section proceeds by first describing how we selected thethree under-served regions used in this evaluation, then presents thecrawling methodology we used to measure popular websites in eachselected region, and follows by presenting the results of applyingour methodology to each of these regions. The section concludes bypresenting the output of our measurements (i.e. the newly generatedfilter list rules), so that they can be used by existing content blockers. We evaluated our approach on three regions under-served byexisting crowd-sourced filter lists: Albania, Hungary, and Sri Lanka.We selected these regions after looking for regions that matchedfour criteria.(1) The national language was not a major world language.(2) The amount of updates to the regional supplementary list issignificantly lower compared to more popular lists. ilter List Generation for Underserved Regions WWW ’20, April 20–24, 2020, Taipei, Taiwan Country Figure 8: Measurements of data gathered from crawls of popularsites in selected under-served regions. Given numbers are counts ofunique image and frame URLs. (3) The region has seen a vast increase of Internet usage in the pastdecade .(4) Had a popular sites listing on Alexa Top Sites.(5) There existed at least one filter list for the region.(6) We could purchase or gain access to a VPN with an exit pointin the country.Figure 6 presents the regions we selected for this evaluation,along with measurements of the existing best-maintained regionalfilter list. For each region we identified the best maintained filterlist for the region by consulting both the EasyList selection ofregional filter lists [18], and the filter lists indexed on a popular,crowd-sourced site of regional filter lists [8]. To further illustratethe lack of updates, Figure 7 illustrates how much activity there ison average each month in the respective GitHub repos between theselected regions and some popular filter lists. We next built a data set of popular websites and pages for each ofthe three selected regions. We use this data set for two purposes:first to approximate how internet users in these regions experiencethe web, and second to determine how much advertising contentwas being missed by existing content blocking options.For each region, we first fetched the 1,000 most popular domainsfor the region, as determined by the Alexa Top Lists. Next, wepurchased VPN access from ExpressVPN [6], a commercial VPNservice, that provided an IP address in each region. Third, weconfigured a crawler to visit each domain and select two randomchild links with the same eTLD+1 . We then configured our crawlerto use our PageGraph-instrumented browser to visit the domainof each site and each selected child page, each for 30 seconds. Allcrawling was conducted from the VPN end point, to as closely aspossible approximate how the page would run for a local visitor.After 30 seconds, we recorded the PageGraph data for each page(note, the PageGraph data includes information about all networkrequests issued during the page’s execution, in addition to the causeof each request). We also record all images and scripts fetched in thetop-and-local frames (i.e. <iframe> s with the same domain as the toplevel frame) during each page’s execution, along with screenshotsof each remote child-frame (i.e. <iframe> s of third-party domains).Figure 8 presents the results of our automated crawl. For each ofthe regions we encountered a significant number of non-responsivedomains, which comprise the difference between the number inthe “ . internetlivestats . com Country Figure 9: Measurements of how many unique image and iframe adsare currently identified by existing filter lists (e.g. EasyList, EasyPri-vacy, and the best maintained filter list for each region.). Country Figure 10: Measurements of how many unique image and iframe adsthe classifier described in Section 3.2 identified that were not identi-fied as ads by existing filter lists. inline with prior web-studies [31] that find around 11% error rate forautomated crawls of the web, the even higher rate of non-responsivesites in Albania was surprising. On manual evaluation of a sampleof these domains, we found a small number of cases were due toanti-crawler countermeasures or apparent IP blacklisting of theVPN end point. In a surprising number of cases though, domainsseemed to be abandoned and hosting no web content at all. We notethis as a point for future study. Next, we measured how successful existing filter lists are at blockingadvertisements on popular sites in our selected regions. We treatedthis measurement as our baseline when measuring how muchadditional blocking benefit our approach provides. Figure 9 showsthe number of ads identified by existing filter lists.We measured the amount of ad resources identified by existinglists in two steps. First, we combined the best maintained regionalfilter list for each region (listed in Figure 6) with the two mostpopular “global” filter lists, EasyList and EasyPrivacy. Then, weapplied these combined filter lists to the image and iframe requestsencountered when crawling each region, using a popular AdBlockfilter list library [23]. We then noted which images and iframerequests would be blocked by the current best filter lists availableto people in each region. Next, we identified how many advertising images and iframes wemissed by existing filter lists. We found a significant number of both;our classifier identified 1,497 ad images and 47 ad frames that weremissed by existing filter lists. Put differently, our approach identified27.3% more ad images, and 7% more ad frames, than existing filterlists.We note that these figures only include ad images and framesmissed by filter lists; we observed a significant amount of overlap WW ’20, April 20–24, 2020, Taipei, Taiwan Sjösten and Snyder, et al. Country Current Lists Classifier ∪ Chains ∆ Albania 2,850 460 511 18%Hungary 1,819 557 586 32.2%Sri Lanka 1,872 527 674 36%Total 6,541 1,544 1,771 27.1% Figure 11: Additions to filter lists when applying all steps of ourmethodology. “Current lists” gives the number of ad-resourcesfound by existing filter lists, “classifier” describes ad-resourcesfound by our hybrid classifier but not existing filter lists. “ ∪ chains”gives the number of new ad-resources found by applying our “up-stream” approach to ad-resources found by either current filter listsor the hybrid classifier. The “ ∆ ” column gives the overall increasein identified ad-resources provided by our techniques, compared toexisting filter lists. between the two approaches. Figure 10 summarizes the additionalad resources our classifier identified.We measured how many advertising images and frames currentapproaches miss in two steps. First, we identified each image orframe in the corresponding PageGraph graph data, and used thatcontextual information to extract the contextual features describedin Section 3.2.2. Second, we used the pixel data of each resource tofeed the perceptual part of the classifier. For images, we used theimage file directly; for frames we used a screenshot of the frametaken during the crawling step. Next, we identified additional resources that should be blockedby examining the request chain for each ad image or frame, andfinding the earliest point in each chain that could be blockedwithout breaking the page. We applied this “upstream request chain”blocking technique to both ad resources labeled by existing filterlists and ad resources newly identified by our hybrid classifier. Doingso allowed us to not only identify specific images and frames thatshould be blocked, but to programmatically identify the “upstream”libraries that caused those images and frames to be included.We were able to identify 1,771 additional advertising URLs byanalyzing the request chains in this manner, an improvement of27.1% in advertisement blocking in these regions. We note thatby following the methodology described in Section 3.5, targetingthese additional resources will result in more generalizable filterrules by identifying both the individual ad image URLs and the adlibraries that determine what ads to load. The approach describedin Section 3.5 also gives us a high degree of confidence that this“upstream” blocking will not break pages.Figure 11 shows the final results of our regional filter listmethodology when applied to the Albanian, Hungarian and SriLankan web regions. The “current lists” column presents the numberof ad resources existing filter lists identify in each region in our dataset. The “classifier” column gives the number of images and framesour hybrid classifier identifies as ad-related that are not identified by existing filter lists. The “ ∪ chains” column gives the total numberof ad resources identified by applying the request chain approach(Section 3.4) to images and frames identified as advertisements by either existing filter lists or the hybrid classifier. The final “ ∆ ” column Country Network rules Albania 387Hungary 551Sri Lanka 372Total 1310 Figure 12: Number of new filter list rules. gives the percent-increase in resources identified by our combinedmethodology, when compared to using only existing filter lists. Finally, we generated filter lists in AdBlock Plus format to blockthe advertising resources identified in the previous steps. Weused the rule generation methodology described in Section 3.6to generate 1310 new filter rules. We have made the filter listsavailable [10]. Counts of the total number of new rules for eachregion are presented in Figure 12. In this section, we discuss broader issues related to the problem offilter list generation and ad blocking, including possible next stepsand extensions for the described approach, and some limitationsand concerns for future researchers to consider. A reoccurring issue in identifying and blocking online advertise-ments is that many images and ads are context specific. An ad in onecontext might be core content in another. For example, an imageof shoes with the name of the shoe maker might be perceived asan advertisement when positioned next to a news article, but thesame image might be desirable when placed in the middle of a pageon a shoes selling website.We encountered an even more difficult case when labeling anddebugging the pipeline described in this work. We found a moviesharing forum that used a number of banner ads (like the onepresented in Figure 13) from elsewhere on the web as a table ofcontents, to show which movies had most recently been added onthe site. In such cases, the “ad-ness” of an image is not ambiguous,its explicitly both an ad and desirable page content!Crowd-sourced filter list generation approaches rely on thesubjective intuition of list contributors to resolve such difficultsituations. Programmatic solutions have no such option, and somust address a two tiered problem: first, how to identify images thatlook like advertisements, and second, how to model subjective userexpectations of when an advertisement is desirable to users.We find the problem presented by the intersection of these twoissues is unaddressed by existing literature (current work included).Our resolution to this issue was “ads are ads”’, and it is the job of adblocking tools to block ads, and if a user is in a scenario where theywish to see and ad, they should disable the ad-blocking tool. Howsatisfying such an approach is will likely be task specific. Thankfully,the number of ambiguous ads we encountered was low enough thatit did not affect the main focus of the work, but we mention it asan interesting area for future research. ilter List Generation for Underserved Regions WWW ’20, April 20–24, 2020, Taipei, Taiwan Figure 13: Example of an ambiguous ad image we encountered on amovie sharing forum in Sri Lanka as part of the its table of contents. This work used a unique hybrid approach for determining whetheran image or a frame was an advertisement. This is only one of aninfinite number of possible oracles the same pipeline could adopt.While we designed our approach to be conservative in identifyingimages (as described in Section 3.2), one could instead use a muchmore aggressive oracle, if one was willing to accept a greaterfalse-positive rate in ad-identification, or was willing to accept sitesbreaking, for additional data savings and privacy protections.In this sense, our oracle represents just one web use preference(less advertisements, but with a low tolerance for error). The samebroad approach, as described in this paper, could be used with otheroracles, such as those targeting just certain types of advertisements(i.e. blocking adult ads), or certain types of web content in general(i.e. blocking violent images).While our goal in this work was to improve web browsing forpeople in under-served regions, the described approach is notspecific to advertising. The identify-and-prune-the-request-chainapproach could be helpful in addressing many web problems wherehuman labelers are lacking. We considered many additional features and approaches whendesigning the methodology in this work. Here, we briefly describea variety of improvements we considered, but did not implementbecause of time, cost or complexity. We list them as possiblesuggestions to other researchers addressing similar problems. Predicting Page Breakage. An important part of this work wasgenerating and testing useful heuristics for whether blocking ascript would break a page. The heuristics discussed in Section 3.5have proven useful for us, but could be improved. One could, forexample, also consider the number and type of Web API calls a scriptmakes, whether the script sets or reads storage, or any number ofother behavioral characteristics when trying to predict whetherblocking a script would break a page. Tracking Protections. This work improves blocking advertise-ments in under-served regions, but similar approach could be takento target tracking scripts . Instead of building a classifier to determineif an image is an advertisement, one would instead need an oracle todetermine whether a script was privacy violating. This might be aneasier task, since determining the privacy implications of a script’s execution is in many cases easier that predicting the subjectiveevaluation of whether an image is an advertisement. Improving Other Filter Lists. The approach in this paper wasdesigned to help web users in under-served regions. However, thesame approach could likely be used to improve filter lists in general,including “global” popular ones like EasyList and EasyPrivacy.Though the marginal improvement would likely be lower, since therelative popularity of such sites likely means a higher percentageof ad resources have been identified by filter list contributors, ourapproach could still be useful in improving blocking on less popular,or frequently changing sites. Finally, we note some limitations of this approach, in the hopesthat future work might address them. First, while our approachwas successful on the three selected regions, its difficult to knowif these findings would generalize to all under-served regions onthe web. While such a measurement is beyond our ability to carryout, it would be interesting to better understand how similar webadvertising is across the web generally.Second, our approach relies on automated, manual crawls ofwebsites to identify ad-related resources. It is possible that thekinds of advertisements reachable by automated tools are differentfrom the kinds of advertisements humans experience when onparts of the web not reachable by crawlers, such as within webapplications, behind paywalls, or within account-requiring portionsof websites. This limitation is a subset of a larger open problem inweb measurement, of understanding how well automated crawlsapproximate human user experiences.Finally, the types of advertisements targeted in this work (i.e.image based web advertisements) are just one of many types ofadvertisements web users face. A partial list includes audio ads,video interstitials, native text ads, and interactive advertisements.If image- and frame- targeting ad blockers continue to become morepopular, we can expect advertisers to adopt to these alternativeadvertising approaches. Researchers will in turn need to come upwith new ad blocking techniques to preserve a usable, performant,privacy respecting web. Below we cover the existing work related to filter lists (Section 6.1),resource blocking (Section 6.2), and the importance of differentvantage points for web measurements (Section 6.3). In [41], Vastel et al. explored the accumulation of dead rules by study-ing EasyList, the most popular filter list. Results of their study showthat the list has grown from several hundred rules, to well over 60,000rules, within 9 years, when 90.16% of the resource blocking rules arenot useful. Finally, authors, propose optimizations for popular ad-blocking tools, that allow EasyList to be applied on performance con-strained mobile devices, and improve desktop performance by 62.5%.Gugelmann et al. in [29] investigated how to detect privacy-intrusive trackers and services from passive measurements andpropose an automated approach that relies on a set of web traffic fea-tures to identify such services and thus help developers maintaining WW ’20, April 20–24, 2020, Taipei, Taiwan Sjösten and Snyder, et al. filter lists. Pujol et al. in [38] used Adblock Plus filter lists for passivenetwork classification. By analyzing data from a major EuropeanISP authors show that 22% of the active users have Adblock Plusdeployed. Also they found that 56% and 35% of the ad-relatedrequests are blacklisted by EasyList and EasyPrivacy, respectively.Iqbal et al., in [33], studied the anti-adblock filter lists that adblockers use to remove anti-adblock scripts. By analyzing theevolution of two popular anti-adblock filter lists, authors show thattheir coverage considerably improved the last 3 years and they areable to detect anti-adblockers on about 9% of Alexa top-5K websites.Finally authors proposed a machine learning based method toautomatically detect anti-adblocking scripts. Iqbal et al., in [34], proposed AdGraph: a graph-based machinelearning approach for detecting advertising and tracking resourceson the web. Contrary to filter list based approaches AdGraphbuilds a graph representation of the HTML structure, networkrequests, and JavaScript behavior of a webpage, and uses thisunique representation to train a classifier for identifying advertisingand tracking resources. AdGraph can replicate the labels ofhuman-generated filter lists with 95.33% accuracy.In [39], Storey et al. discussed the future of ad blocking bymodelling it as a state space with four states and six state transitions,which correspond to techniques that can be deployed by eitherpublishers or ad blockers. They also proposed several new adblocking techniques, including ones that borrow ideas from rootkitsto prevent detection by anti-ad blocking scripts. Zhu et al., in [42],proposed ShadowBlock: a new Chromium-based ad-blockingbrowser that can hide traces of ad-blocking activities from anti-adblockers. ShadowBlock leverages existing filter lists and hides allad elements stealthily so anti-ad blocking scripts cannot detect anytampering of the ads (e.g. , absence of ad elements). Performanceevaluation on Alexa top-1K websites shows that their approachsuccessfully blocks 98.3% of all visible ads while only causing minorbreakage on less than 0.6% of the websites.Garimella et al. in [28] measured the performance and privacyaspects of popular ad-blocking tools. Their findings show that (i)uBlock has the best performance, in terms of ad and third partytracker filtering, and least privacy tracking. They also found that thetime to load pages is not necessarily faster when using adblockers,and this happens due to additional functionality introduced bythe adblocking tools. In [40], Din at al. proposed Percival: a deeplearning based perceptual ad blocker that aims to replace filterlist based adblocking. Percival runs within the browser’s imagerendering pipeline, intercepts images during page execution andby performing image classification, it blocks ads. Percival canreplicate EasyList rules with an accuracy of 96.76% when it imposesa rendering performance overhead of 4.55%. Selecting different vantage points to browse Internet from is a quitecommon technique in order to understand the different view of theweb different users may have. Jueckstock et al. in [35] design anddeploy a synchronized multi-vantage point web measurement studyto explore the comparability of web measurements across differentInternet vantage points. In [27] Fruchter et al. proposed a method for investigating tracking behavior by analyzing cookies and HTTPrequests from browsing sessions from different countries. Resultsshow that websites track users differently, and to varying degrees,based on the regulations of the country the visitor’s IP is based in.Iordanou et al. in [32] proposed a system for measuring howe-commerce websites discriminate between users. Authors considerseveral different motivations for discrimination, including geogra-phy, prior browsing behavior (e.g., tracking-derived PPI) of the user,and site A/B testing. They found that the first and third motivationsexplain more website “discrimination” than the second. In this work we address the problem of augmenting filter lists forusers of under-served, linguistically-small parts of the web. Theapproach described in this work is amenable to full automation, andwith sufficient computation resources could be applied to any num-ber of additional under-served populations of web users. Further, weexpect the same approach could be used to block tracking-relatedresources too, improving privacy for under-served web users too.The problem of poorly maintained filter lists in under-served re-gions is significant. First, the current predominant approach to filterlist generation (i.e. crowd-sourcing) is poorly suited for these web-regions, which by definition have less users, and so smaller “crowds.”Second, in many cases, under-served areas of the web target userswith less income, and with less access to cheap, high speed data; theusers who would benefit most from ad blocking are often the poorestserved by current filter list generating strategies. Third, existingfilter list generation strategies that do not rely on crowd-sourcingfail to consider web compatibility (i.e. breaking sites), leaving under-served users with the unappealing trade-off between data-draining,privacy-harming browsing, or, alternatively, breaking web sites.This work proposes a novel approach for generating filter rulesfor under-served regions of the web. Our approach determineswhether images and frames are advertisements by consideringperceptual and contextual aspects of the underlying image (orframe), and then using deep browser instrumentation to determinewhere in the request chain we can optimally begin blocking requests.We apply this approach to popular websites in three regionscurrently poorly served by crowd-sourced filter lists, Sri Lanka,Hungary, and Albania. Our approach is successful at improvingblocking without breaking websites. We generate 1310 new filterlist rules that identify 27.1% new advertising resources that shouldbe blocked, improving blocking by 30.1% over the existing bestoptions for these regions. We are also releasing our generated filterlists so that web users in these regions can benefit from them [10],along with the source code for our hybrid image classifier [7] andour PageGraph browser instrumentation [19]. We hope this workadvances the goal of improving the web for all users, no matter theirlocation or linguistic-community. ACKNOWLEDGMENTS This work was partly funded by the Swedish Foundation forStrategic Research (SSF) and the Swedish Research Council (VR). REFERENCES [1] Adblock list for albania. https://github.com/anxh3l0/blocklist. accessed: Oct-2019. ilter List Generation for Underserved Regions WWW ’20, April 20–24, 2020, Taipei, Taiwan [2] Adblock Plus filters explained. https://adblockplus.org/filter-cheatsheet/.accessed: Jan-2020.[3] Adblock sri lanka. https://github.com/miyurusankalpa/adblock-list-sri-lanka.accessed: Oct-2019.[4] Brazilian filterlist. https://raw . githubusercontent . com/easylistbrasil/easylistbrasil/filtro/easylistbrasil . txt. accessed: Oct-2019.[5] Chinese filterlist. https://easylist-downloads . adblockplus . org/easylistchina . txt.accessed: Oct-2019.[6] ExpressVPN. accessed: Oct-2019.[7] Filterlist Generator GitHub repo. https://github.com/brave-experiments/regional-filterlist-gen. accessed: Jan-2020.[8] FilterLists. https://filterlists.com/. accessed: Jan-2020.[9] French filterlist. https://easylist-downloads . adblockplus . org/liste f r . txt. accessed:Oct-2019.[10] Generated Filter Lists. https://sites.google.com/site/longtailfilterlists/. accessed:Jan-2020.[11] German filterlist. https://easylist . to/easylistgermany/easylistgermany . txt.accessed: Oct-2019.[12] Github repo for german filter list. https://github.com/easylist/easylistgermany.accessed: Jan-2020.[13] Github repo for hindi filter list. https://github.com/mediumkreation/IndianList.accessed: Jan-2020.[14] Github repo for japanese filter list. https://github.com/k2jp/abp-japanese-filters.accessed: Jan-2020.[15] Hindi filterlist. https://easylist-downloads.adblockplus.org/indianlist.txt.accessed: Oct-2019.[16] hufilter. https://github.com/hufilter/hufilter/wiki. accessed: Oct-2019.[17] Japanese filterlist. https://raw . githubusercontent . com/k2jp/abp-japanese-filters/master/abpjf . txt. accessed: Oct-2019.[18] Other Supplementary Filter Lists and EasyList Variants.https://easylist.to/pages/other-supplementary-filter-lists-and-easylist-variants.html. accessed: Jan-2020.[19] PageGraph. https://github.com/brave/brave-core/wiki/PageGraph. accessed:Jan-2020.[20] Portugese filterlist. https://easylist-downloads . adblockplus . org/easylistportuguese . txt. accessed: Oct-2019.[21] Russian filterlist. https://easylist-downloads.adblockplus.org/advblock.txt.accessed: Oct-2019.[22] The GraphML File Format. http://graphml.graphdrawing.org/. accessed: Jan-2020.[23] Ben Livshits Andrius Aucinas. Brave improves its ad-blocker performance by69x with new engine implementation in rust. https://brave . com/improved-ad-blocker-performance/, 2019.[24] Sruti Bhagavatula, Christopher Dunn, Chris Kanich, Minaxi Gupta, and BrianZiebart. Leveraging machine learning to improve unwanted resource filtering.In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop . businessinsider . com/30-of-all-internet-users-will-ad-block-by-2018-2017-3, 2017.[26] V. Dudykevych and V. Nechypor. Detecting third-party user trackers withcookie files. In , pages 78–80, Oct 2016.[27] Nathaniel Fruchter, Hsin Miao, Scott Stevenson, and Rebecca Balebako. Variationsin tracking in relation to geographic location. CoRR , abs/1506.04103, 2015.[28] Kiran Garimella, Orestis Kostakis, and Michael Mathioudakis. Ad-blocking: AStudy on Performance, Privacy and Counter-measures. In WebSci , pages 259–262.ACM, 2017.[29] David Gugelmann, Markus Happe, Bernhard Ager, and Vincent Lenders. Anautomated approach for complementing ad blockers’ blacklists. Proceedings onPrivacy Enhancing Technologies , 2015(2):282–298, 2015.[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. In Proceedings of the IEEE conference on computer visionand pattern recognition , pages 770–778, 2016.[31] Luca Invernizzi, Kurt Thomas, Alexandros Kapravelos, Oxana Comanescu,Jean-Michel Picod, and Elie Bursztein. Cloak of visibility: Detecting whenmachines browse a different web. In , pages 743–758. IEEE, 2016.[32] Costas Iordanou, Claudio Soriente, Michael Sirivianos, and Nikolaos Laoutaris.Who is fiddling with prices?: Building and deploying a watchdog service fore-commerce. In Proceedings of the Conference of the ACM Special Interest Groupon Data Communication , SIGCOMM ’17, pages 376–389, New York, NY, USA, 2017.ACM.[33] Umar Iqbal, Zubair Shafiq, and Zhiyun Qian. The ad wars: Retrospective measure-ment and analysis of anti-adblock filter lists. In Proceedings of the 2017 InternetMeasurement Conference , IMC ’17, pages 171–183, New York, NY, USA, 2017. ACM.[34] Umar Iqbal, Zubair Shafiq, Peter Snyder, Shitong Zhu, Zhiyun Qian, and BenjaminLivshits. Adgraph: A machine learning approach to automatic and effectiveadblocking. CoRR , abs/1805.09155, 2018. [35] Jordan Jueckstock, Shaown Sarker, Peter Snyder, Panagiotis Papadopoulos, MatteoVarvello, Benjamin Livshits, and Alexandros Kapravelos. The blind men and theinternet: Multi-vantage point web measurements. CoRR , abs/1905.08767, 2019.[36] Zhou Li, Kehuan Zhang, Yinglian Xie, Fang Yu, and XiaoFeng Wang. Knowingyour enemy: understanding and detecting malicious web advertising. In the ACMConference on Computer and Communications Security, CCS’12, Raleigh, NC, USA,October 16-18, 2012 Proceedings of the 2015 Internet MeasurementConference , IMC ’15, pages 93–106, New York, NY, USA, 2015. ACM.[39] Grant Storey, Dillon Reisman, Jonathan Mayer, and Arvind Narayanan. The futureof ad blocking: An analytical framework and new techniques. arXiv preprintarXiv:1705.08568 , 2017.[40] Zain ul Abi Din, Panagiotis Tigas, Samuel T. King, and Benjamin Livshits. Percival:Making in-browser perceptual ad blocking practical with deep learning. CoRR ,abs/1905.07444, 2019.[41] Antoine Vastel, Peter Snyder, and Benjamin Livshits. Who filters the filters:Understanding the growth, usefulness and efficiency of crowdsourced ad blocking. CoRR , abs/1810.09160, 2018.[42] Shitong Zhu, Umar Iqbal, Zhongjie Wang, Zhiyun Qian, Zubair Shafiq, andWeiteng Chen. Shadowblock: A lightweight and stealthy adblocking browser.In </div> </div> <div itemprop="description" title="PDF |Filter List Generation for Underserved Regions - Researchain Archives" class="embed-responsive embed-responsive-a4"> <iframe class="embed-responsive-item" src="https://researchain.net/static/pdf/web/viewer.html?file=https://arxiv.org/pdf/1910.07303.pdf">