Filter List Generation for Underserved Regions
Alexander Sjosten, Peter Snyder, Antonio Pastor, Panagiotis Papadopoulos, Benjamin Livshits
FFilter List Generation for Underserved Regions
Alexander Sjösten
Chalmers University of Technology
Peter Snyder
Brave Software
Antonio Pastor
Universidad Carlos III de Madrid
Panagiotis Papadopoulos
Brave Software
Benjamin Livshits
Brave SoftwareImperial College London
ABSTRACT
Filter lists play a large and growing role in protecting and assistingweb users. The vast majority of popular filter lists are crowd-sourced,where a large number of people manually label resources relatedto undesirable web resources (e.g. ads, trackers, paywall libraries),so that they can be blocked by browsers and extensions.Because only a small percentage of web users participate in thegeneration of filter lists, a crowd-sourcing strategy works wellfor blocking either uncommon resources that appear on “popular”websites, or resources that appear on a large number of “unpopular”websites. A crowd-sourcing strategy will perform poorly for partsof the web with small “crowds”, such as regions of the web servinglanguages with (relatively) few speakers.This work addresses this problem through the combination oftwo novel techniques: (i) deep browser instrumentation that allowsfor the accurate generation of request chains, in a way that is robustin situations that confuse existing measurement techniques, and (ii)an ad classifier that uniquely combines perceptual and page-contextfeatures to remain accurate across multiple languages.We apply our unique two-step filter list generation pipeline tothree regions of the web that currently have poorly maintained filterlists: Sri Lanka, Hungary, and Albania. We generate new filter liststhat complement existing filter lists. Our complementary lists blockan additional 3,349 of ad and ad-related resources (1,771 unique)when applied to 6,475 pages targeting these three regions.We hope that this work can be part of an increased effort atensuring that the security, privacy, and performance benefits of webresource blocking can be shared with all users, and not only thosein dominant linguistic or economic regions.
CCS CONCEPTS • Security and privacy → Human and societal aspects of securityand privacy ; •
Information systems → Web applications . KEYWORDS ad blocking, filter lists, crowdsource
ACM Reference Format:
Alexander Sjösten, Peter Snyder, Antonio Pastor, Panagiotis Papadopoulos,and Benjamin Livshits. 2020. Filter List Generation for UnderservedRegions. In
Proceedings of The Web Conference 2020 (WWW ’20), April
This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personaland corporate Web sites with the appropriate attribution.
WWW ’20, April 20–24, 2020, Taipei, Taiwan © 2020 IW3C2 (International World Wide Web Conference Committee), published underCreative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-7023-3/20/04.https://doi . org/10 . . ACM, New York, NY, USA, 11 pages.https://doi . org/10 . . Hundreds of millions of web users (i.e. 30% of all internet users [25])use filter lists to maintain a secure, private, performant, andappealing web. Prior work has shown that filter lists, and thetypes of content blocking they enable, significantly reduce datause [37], protect users from malware [36], improve browserperformance [28, 38] and significantly reduce how often andpersistently users are tracked on the web.Most filter lists are generated through crowd-sourcing, wherea large number of people collaborate to identify undesirableonline resources (e.g. ads, user tracking libraries, anti-adblockingscripts etc..) and generate sets of rules to identify those resources.Crowd-sourcing the generation of these lists has proven a usefulstrategy, as evidenced by the fact that the most popular lists arequite frequently used and frequently updated [33, 41].The most popular filter lists (e.g. EasyList, EasyPrivacy) target“global” sites, which in practice means either websites in English,or resources popular enough to appear on English-speaking sites inaddition to sites targeting speakers of other languages. Non-Englishspeaking web users face different, generally less appealing optionsfor content blocking. Web users who visit non-English websites thattarget relatively wealthy users generally have access to well main-tained, language-specific lists. Indeed, the French [9], German [11],and Japanese [17] specific filter lists are representative examples ofwell-maintained, popular filter lists targeting non-English web users.Similarly, linguistic regions with very large numbers of speakersalso generally have well maintained filter lists. Examples hereinclude well maintained filter lists targeting Hindi [15], Russian [21],Chinese [5], and Portuguese [4, 20] websites.Sadly, users who visit websites in languages with fewer speakers,or with less wealthy users, have worse options. Put differently, theusefulness of crowd-sourced filter lists depends on having a largeor affluent crowd; filter lists targeting parts of the web with less,or less affluent, users are left with filter lists that are smaller, lesswell-maintained, or both. Visitors speaking these less-commonly-spoken languages have degraded web experiences, and are exposedto all the web maladies that filter lists are designed to fix.Compounding the problem, in many cases, users in these regionsare the ones who could benefit most from robust filter lists, asnetwork connections may be slower, data may be more expensive,the frequency of undesirable web resources may be higher. Anexample which motivates this work and illustrates the inability ofcurrent filter lists to adequately block ads on a regional website inAlbania can be seen in the screenshot in Figure 1. In this example, a r X i v : . [ c s . CR ] J a n WW ’20, April 20–24, 2020, Taipei, Taiwan Sjösten and Snyder, et al. we browse the website gazetatema.net while using AdBlock Plus(which uses EasyList, a “global” targeting filter list).While there has been significant prior work on automating thegeneration of filter lists [24, 26, 29, 34], this existing work is focusedon replicating and extending the most popular English and globally-focused filter lists, with little to no evaluation on, or applicabilityto, non-English web regions. In this paper, we target the problem ofimproving filter lists for web users in regions with small numbers ofspeakers (relative to prominent global languages). We select three re-gions as representative of the problem in general: Albania, Hungaryand Sri Lanka, using a methodology presented in Section 4.1.We describe a two-pronged strategy for identifying long-tailresources on websites that target under-served linguistic regionson the web: (i) a classifier that can identify advertisements in away that generalizes well across languages, and (ii) a method foraccurately determining how advertisements end up in pages (asdetermined by either existing filter lists or our classifier), and byusing this information, generate new, generalized filter rules.We use this novel instrumentation to both build inclusions chains (i.e. measurements of how every remote resource wound up in aweb page), and determine how high in each inclusion chain blockingcan begin. This allows us to (i) generate generalized filter rules (i.e.rules that target scripts that include ad images on each page, insteadof rules that target URLs for individual advertisements), and to (ii)ensure we do not block new resources that will break the websitein other ways.
Contributions.
In summary, this paper makes the followingcontributions to the problem of blocking unwanted resources onwebsites targeting audiences with smaller linguistic audiences.(1) The implementation and evaluation of an image classifierfor automatically detecting advertisements on the web which relies on a mix of perceptual and contextual features. Thisclassifier is designed to be robust across many languages (andparticularly those overlooked by existing research) and achievesaccuracy of 97.6 % in identifying images and iframes related toadvertising.(2) Novel, open source browser instrumentation, implementedas modifications to the Blink and V8 run-times that allowsfor determining the cause of every web request in a page,in a way that is far more accurate than existing tools. Thisinstrumentation also allows us to accurately attribute everyDOM modification to its cause, which in turn allows us topredict whether blocking a resource would break a page.(3) The design of a novel, two stage pipeline for identifyingadvertising resources on websites , using the previouslymentioned classifier and instrumentation, to identify long-tailadvertising resources targeting web users who do not speaklanguages with large global communities.(4) A real world evaluation of our pipeline on sites that arepopular with languages that are (relatively) uncommononline. We find that our approach is successful in significantlyimproving the quality of filter lists for web users without large,language-specific crowd sourced lists. As our evaluation shows,our generated lists block an additional 3,349 of ad and ad-relatedresources (1,771 unique) when applied to 6,475 pages targetingthese three regions.
Figure 1: Motivating example of current filter lists’ regional ineffi-ciency. Screenshot of Albanian website browsed with Adblock Plus.
A successful contribution to the problem of improving the qualityof filter lists in small web regions should account for the followingissues:
Scalability.
The primary difficulty of generating effective blockingrules for small-region web users is the reduced number of people whocan participate in a crowd-sourced list generation. While portions ofthe web are targeted at large audiences (e.g. sites in English language,or web regions with a large number of language speakers) can counton a large number of users to report unwanted resources, or generallydistribute the task of list generation, regions of the web targeting onlya small number of users (e.g. languages with less speakers) do nothave this luxury. A successful solution therefore likely requires somekind of automation to augment the efforts of regional list generators.
Generalize-ability.
In most of the cases, ads are rendered by scripts.In addition, every time an ad-slot is filled, the embedded ad imagemay have come from a different URL. Approaches that directly targetthe URLs serving ad-related resources are then likely to become stalevery quickly. An effective solution to the problem would insteadtarget the “root cause” of the unwanted resources being included inthe page, in this case the script, which determines what image URLs toload. Approaches that attempt to only build lists of URLs of ad-relatedimages are therefore unlikely to be useful solutions to the problemfor the long term (as seen also from the screenshot in Figure 1).
Web compatibility.
Content blocking necessarily requiresmodifying the execution of a page from what the site-authorintended, to something hopefully more closely aligned with thevisitor’s goals and preferences. Modifying the page’s execution inthis way (e.g. by changing what resources to load, by preventingscripts from executing, etc.) frequently cause pages to break, andusers to abandon content blocking tools. While filter lists targetinglarge audiences can rely on the crowd to report breaking sites tothe list authors, (so that they can tailor the rules accordingly), filterlists targeting smaller parts of the web often do not have enoughusers to maintain this positive feedback loop. An effective systemfor programmatically augmenting small-region filter lists musttherefore take extra care to ensure that new rules will not break sites. ilter List Generation for Underserved Regions WWW ’20, April 20–24, 2020, Taipei, Taiwan
This section presents a methodology for programmatically identify-ing advertising and other unwanted web resources in under-servedregions. This section proceeds by describing (i) a high-level overviewof our approach, (ii) a hybrid classifier used to identify image-basedweb advertisements, (iii) unique browser instrumentation used inour approach, (iv) how we identify ad-libraries and other “upstream”resources for blocking, (v) how we determined if a request was safeto block (i.e. would not break desirable functionality on the page),and (vi) how we generated filter list rules from the gathered ad URLs.
Our solution to improving filter lists for under-served regionsconsists of the combination of two unique strategies. First, wedesigned a system for programmatically determining whether animage is an ad, in a cross-language, highly precise way. We use thisclassifier to identify ad images that are missed by crowd-sourcedfilter list maintainers.Second we developed a technique for identifying additionalresources that should be blocked, by considering the request chainsthat brought the ad into the page, and finding instances where wecan block earlier in that request chain. We then apply this “blockingearlier in the chain” principle to both ads identified by existingfilter lists, and new ads identified by our classifier, to maximizethe number of resources that can be safely blocked. This approachalso allows us to generate generalized blocking rules that targetthe causes of ads being included in the page, instead of only the“symptoms”: the specific, frequently-changing image URLs.We note that this approach could be applied to any regionof the web, including both popular and under-served regions.However, since popular parts of the web are already well-served bycrowd-sourced approaches, we expect the marginal improvementof applying this technique will be greatest for under-served regions,where there are comparatively few manual labelers.The following subsections describe the implementation of eachpiece in our filter list generation pipeline. Section 4 describes theevaluation of how successful this approach was at generating newfilter list rules for under-served regions.
First, our approach requires an oracle for determining if a pageelement is an advertisement, without human labeling. To solve thisproblem, we designed and trained a unique hybrid image classifierthat considers both the image’s pixel data, and page context animage request occurred in, when predicting if a page element isan advertisement. Our classifier targets both images (i.e. )and sub-documents (i.e.
While there is significantexisting work on image based (i.e. perceptual) web ad classification,we were not able to use existing approaches for two reasons. First,we had disappointing results when applying existing perceptualclassifiers to the web at large. The existing approaches we considered did very well on the data sets they were trained on, but did arelatively poor job when applied to new, random crawls of the web.Second, we were concerned that relying on perceptual featuresalone would reduce the classifier’s ability to generalize across lan-guages. We expected that adding contextual features (e.g. the sur-rounding elements in the page, whether the image request was trig-gered by JavaScript or the document parser, attributes on the elementdisplaying the image) would make the classifier generalize better.
Our approach combines both perceptualand contextual page features, each building on existing work. Theperceptual features are similar to those described in the Percival [40]paper, while the contextual features are extensions of those usedin the AdGraph [34] project. The probability estimated by theperceptual module is then used as an input to the contextual classifier.
Perceptual Sub-module.
The perceptual part of our classifierexpands Percival’s SqueezeNet based CNN into a larger network,ResNet18 [30]. While the Percival project used a smaller networkfor fast online, in-browser classification, our classifier is designedfor offline classification, and so faces no such constraint. We insteaduse the larger ResNet18 approach to increase predictive power.Otherwise, our approach is the same as that described in [40].
Contextual Sub-Module.
The contextual part of our classifierdoes not consider the image’s pixel data, but instead how the imageloaded in the web page, and the context of the page the imageor subdocument would be displayed in. Examples of contextualfeatures include whether the resource being requested is servedon the same domain as the requesting website, and the numberof sibling DOM nodes of the img or iframe element initiating therequest. These features are similar to those described in the AdGraphpaper, and detailed in Figure 4. The browser instrumentation neededto extract these features is described in detailed in Section 3.3. We built our image classifier in twosteps. First we built a purely perceptual classifier, using approachesdescribed in existing work. Second, when we found the perceptualclassifier did not generalize well when applied to a new, independentsampling of images, we moved to a hybrid approach. In this hybridapproach, the output of the perceptual classifier is just one featureamong many other contextual features. We found this hybridapproach performed much better on our new, manually labeled,random crawl of the Alexa 10k. The rest of this subsection describeseach stage in this process.Initially, we built a classifier using an approach nearly identicalto the perceptual approach described in [40]. We evaluated thismodel on a combination of data provided by the paper’s authors,augmented with a small amount of additional data labeled byourselves. This data set is referred to in Figure 3 as the “Initial Alexa10k Set”. When we applied the training method described in [40] tothis data set, we received very accurate results, reported in Figure 2.Later, while building the pipeline described in this paper, wegenerated a second manually labeled data set of images and frames,randomly sampled a new crawl of the Alexa 10k. This data set isreferred to in Figure 3 as “Alexa 10k Recrawl”, and was collected
WW ’20, April 20–24, 2020, Taipei, Taiwan Sjösten and Snyder, et al.
Initial Alexa 10k Data Set Alexa 10k RecrawlAccuracy Precision Recall Accuracy Precision RecallPerceptual-only 95.9 % 95.5 % 96.4 % 77.0 % 48.8 % 87.4 %Hybrid - - - 97.6 % 92 % 75 %
Figure 2: Comparison of classification strategies. “Perceptual-only” refers to the approach by Percival [40] and variants (best numbers reported).“Hybrid” uses both perceptual and contextual features, and performed much better on our independent sampling of images and frames fromthe Alexa 10k, especially with regards to precision.
Figure 3: Comparison of the distribution of ads for images andframes collected in each data set.
Content featuresHeight & WidthIs image size a standard ad size?Resource URL lengthIs resource from subdomain?Is resource from third party?Presence of a semi-colon in query string?Resource type (image or iframe)Perceptual classifier ad probabilityStructural featuresResource load time from startDegree of the resource node (in, out, in+out)Is the resource modified by script?Parent node degree (in, out, in+out)Is parent node modified by script?Average degree connectivity
Figure 4: Partial feature set of the contextual classifier. between 2-6 months later than the previous data set . When weapplied the prior purely-perceptual approach to this new data set,we received greatly reduced accuracy. Most alarming of which,for our purposes, was the dramatically reduced precision. Thesenumbers are also reported in Figure 2.We concluded that perceptual features alone were insufficientto handle the breath of advertisements found on the web, and sowanted to augment the prior perceptual approach with additional,contextual features we expected to generalize better, both acrosslanguages and across time. A subset of these contextual featuresare presented in Figure 4, and are heavily based on the contextualad-identification features discussed in the AdGraph [34] project.After constructing our hybrid classifier from the combination ofperceptual and contextual features, we achieved greatly increasedprecision, though at the expense of some recall. We used a Random the date range here is due to the majority of this data set being collected by the Percivalauthors, 6 months before our work, with a smaller additional amount of data beingcollected by ourselves later on. Forest approach to combine the perceptual and contextual features,and after conducting a 5-fold cross-validation, achieved meanprecision of 92 % and mean recall of 75 %, again summarized inFigure 2. Our hybrid classifier could not be evaluated against theinitial Alexa 10k data set because the data set 1) programmaticallydetermined some labels, and 2) was collected without our browserinstrumentation, meaning we could not extract the required features.
In this subsection we present PageGraph, a system for representingand recording web page execution as a graph. PageGraph allows usto correctly attribute every page modification and network requestto its cause in the page (usually, the responsible JavaScript unit).We use this instrumentation both to extract the contextual featuresdescribed in Section 3.2.2, and to accurately understand what pagemodifications and downstream requests each JavaScript unit isresponsible for.Our approach is similar to the AdGraph [34] project, but is morerobust (i.e. corrects categories of attribution errors) and broader(i.e. cover an even greater set of page behaviors). PageGraph isimplemented as a large set of patches and modifications to Blinkand V8 (approximately 12K LOC). The code for PageGraph is opensource and actively maintained, and can be found at [19], along withinformation on how other researchers can use the tool.The remainder of this subsection provides a high-level summaryof the graph-based approach used by PageGraph, and how it differsfrom existing work.
We use PageGraphto represent the execution of each page as a directed graph. Thisgraph is available both at run-time, and offline (serialized asgraphml [22]) for after-the-fact analysis. PageGraph uses nodes torepresent elements in a web page (e.g. DOM elements, resourcesrequested, executing JavaScript units, child frames) and edgesrepresenting the interaction between these elements in the page (e.g.an edge from a script to a node might depict the script modifyingan attribute on the node, an edge from a DOM element node toa resource node might depict a file being fetched because of a img element’s src attribute, etc.). All such page behaviors in thetop-level frame, and child-local-frames, are captured in the graph.We use PageGraph’s context-rich recording of page executionfor several purposes in this work. First, it allows to accurately andefficiently understand how a JavaScript unit’s execution modifiedthe page; we can easily determine which scripts made a lot ofmodifications to the page, and which had only “invisible” effectsto e.g. fingerprint the user. Second, the graph allows us to determine ilter List Generation for Underserved Regions WWW ’20, April 20–24, 2020, Taipei, Taiwan how each element ended up in a page. For example, the graphrepresentation makes it easy to determine if an image was injectedin the page by a script, if so what other script, and how that scriptwas included in the page, etc. Being able to accurately determinewhat page element is responsible for the inclusion of each script,frame or image element is particularly valuable to this work, asdescribed in the following subsections.
The most relevant relatedwork to PageGraph is the AdGraph project, which also modifies theBlink and V8 systems in Chromium to build a graph-representationof page execution. PageGraph differs from AdGraph in severalsignificant ways.
Improved Attribution Accuracy.
PageGraph significantlyimproves cause-attribution in the graph, or correctly determin-ing which JavaScript unit is responsible for each modification.We observed a non-trivial number of corner cases where Ad-Graph would attribute modifications to the wrong script unit,such as when the script was executed as a result of an ele-ment attribute (e.g. onerror="do_something()" ), or when theJavaScript stack is reset through events like timer callbacks (e.g. setTimeout(do_something,1) ). PageGraph correctly handlesthese and a large number of similar corner cases.
Increased Attribution Breadth.
PageGraph significantly in-creases the set of page events tracked in the graph, beyond whatAdGraph records. For example, PageGraph tracks image requestsinitiated because of CSS rules and prefetch instructions, recordsmodifications made in local sub-documents, and tracks failednetwork requests, among many others. This additional attributionallows for greater understanding of the context scripts execute in.
We next discuss how we generate generalized filter rules fromthe data gathered by the previously described image classifier andbrowser instrumentation. The general approach is to find URLsserving ad images and frames using the classifier, use the browserinstrumentation to build the entire request chain that caused theadvertisement to be included in the page (e.g. the script that fetchedthe script that inserted the image), and then again use the browserinstrumentation to determine how far up each request chain wecan block without breaking the page.We build these request chains for both images (and frames)our classifier identifies as an ad, and for resources identified bynetwork rules in existing filter lists (i.e. EasyList, EasyPrivacy andthe most up to date applicable regional list). The former allows usto generalize the benefits of our image classifier, the latter allowsus to maximize the benefits of existing filter lists.
Blocking higher in the request chain has severalbenefits. First, and most importantly, targeting URLs higher inthe request chain yields a more consistent set of URLs. While thespecific images that an ad library loads will change frequently, theURL of the ad library itself will rarely change. Approaches thattarget the frequently changing image URLs will result in filter listrules that quickly go stale; rules that target ad library scripts (asone example) are more likely to be useful over time, and to a widerrange of users. Moving higher in the request chain means we are
HTML Parser Script 1 Script 2 Ad image