User Tracking in the Post-cookie Era: How Websites Bypass GDPR Consent to Track Users
Emmanouil Papadogiannakis, Panagiotis Papadopoulos, Nicolas Kourtellis, Evangelos P. Markatos
UUser Tracking in the Post-cookie Era: How Websites BypassGDPR Consent to Track Users
Please cite our WWW’2021 paper, doi: 10.1145/3442381.3450056
Emmanouil Papadogiannakis
FORTH/University of Crete, Greece
Panagiotis Papadopoulos
Telefonica Research, Spain
Nicolas Kourtellis
Telefonica Research, Spain
Evangelos P. Markatos
FORTH/University of Crete, Greece
ABSTRACT
During the past few years, mostly as a result of the GDPR and theCCPA, websites have started to present users with cookie consentbanners. These banners are web forms where the users can state theirpreference and declare which cookies they would like to accept, ifsuch option exists. Although requesting consent before storing anyidentifiable information is a good start towards respecting the userprivacy, yet previous research has shown that websites do not alwaysrespect user choices. Furthermore, considering the ever decreasingreliance of trackers on cookies and actions browser vendors take byblocking or restricting third-party cookies, we anticipate a worldwhere stateless tracking emerges, either because trackers or websitesdo not use cookies, or because users simply refuse to accept any.In this paper, we explore whether websites use more persistent andsophisticated forms of tracking in order to track users who said theydo not want cookies. Such forms of tracking include first-party IDleaking, ID synchronization, and browser fingerprinting. Our resultssuggest that websites do use such modern forms of tracking evenbefore users had the opportunity to register their choice with respectto cookies. To add insult to injury, when users choose to raise theirvoice and reject all cookies, user tracking only intensifies. As aresult, users’ choices play very little role with respect to tracking:we measured that more than 75% of tracking activities happenedbefore users had the opportunity to make a selection in the cookieconsent banner, or when users chose to reject all cookies.
KEYWORDS
User Tracking, GDPR, User Consent, Web fingerprinting
Over the past few years, we have seen an increasing concern aboutuser data protection with respect to the data of European users. Thiswas probably the result of the General Data Protection Regulation(GDPR) which was adopted in April 2016 and came into forcein May 2018. The main difference of this regulation compared toprevious legislation is that it includes significant fines for companieswhich collect users data without the users’ consent or some otherlegal basis. Such fines can reach up to 20 million euros, or up to4% of the annual worldwide turnover of the preceding financialyear, whichever is greater. As a result, several companies, and theirassociated websites, have started asking their visitors and users fortheir consent, before collecting (and processing) their data.Such a consent has been usually collected via cookie banners,which ask users for consent and may give some choices as well. Indeed, users may be given the choice to accept all cookies, toaccept some cookies, or even to reject all cookies. The choice isentirely up to the user, and the correct implementation of this choiceis the responsibility of the website. Although this sounds completelylegal and fully straightforward, deviations have been reported inliterature [1–5]. For example, some websites claim that some cookiesare absolutely necessary for their operation (e.g., for the page to bedelivered) or due to legitimate interest (e.g., to improve the product),and can not be rejected by the users. Thus, users cannot really chooseto reject all cookies: these necessary cookies cannot be rejected. Pastresearch studies have noticed some discrepancies between whatthe users type and what is registered in the website. For example,the users may provide a negative response (i.e., reject all cookies),but the cookie banners may register a positive one (i.e., accept allcookies), or the cookie banners may register a positive response evenbefore the users had the opportunity to provide any choice [4].All these previous studies focus on cookies and compliance ofcookie processing with the GDPR. In this paper, we set out to explorea slightly different question:
If a user does not provide consent, or chooses to reject all cookies, do websites use other forms of tracking totrack this user? If so, what are these forms of tracking,and what is the extent of this tracking?
Considering the (i) ever less reliance of third-party trackers onnon-permanent, erasable state-like cookies [6] and (ii) recent ad-vances of browser vendors against third-party cookies [7, 8], it isapparent that the need for identifying how websites treat user con-sent in case of stateless (cookie-less) tracking is more than timelyand urgent. We address this need and try to fill this exact gap in ourunderstanding, by being the first to investigate what is the GDPRcompliance across the Web in the case where websites and trackersdo not use cookies, or users do not accept cookies.Sadly, our results suggest that even when users reject all cookies,websites do use other forms of tracking to track users, and processpersonal data, in violation of GDPR. Such forms of tracking include first-party ID leaking , ID synchronization and browser fingerprint-ing . First-party ID leaking and ID synchronization are used to passan identifier (such as a cookie) as an “argument” in an HTTP requestto a website - different from the website that planted this ID in thefirst place. In fact, according to past studies [9–11], Web entities One might think that ID synchronization is a form of tracking using cookies. This isnot really true: although ID synchronization does use (values stored in) cookies, passingsuch values around is done in an unorthodox manner, completely different from the waycookies are used. a r X i v : . [ c s . C Y ] F e b ay share IDs they have assigned to users and help third-parties re-identify users or create universal IDs. Browser fingerprinting [12, 13]uses elaborate approaches to uniquely identify a user through char-acteristics of her device - characteristics which can be easily foundby a website. Such characteristics may include screen resolutionand rendering characteristics, browser fonts and installed pluginsetc. [14–17]. Combining several of these characteristics can providea large enough number of entropy bits to uniquely identify a user.Although these cases of user identification are considered “per-sonal data processing” according to GDPR and ePrivacy [18] regula-tions, and must be visible to users, they often do not appear in requestforms of consent managers deployed by modern websites. In thisstudy, we highlight exactly that: the lack of transparency and userconsent when it comes to websites that deploy user identificationtechniques like ID synchronization and browser fingerprinting.The contributions of this work are as follows: • We propose a fully automated method for detecting browserfingerprinting on websites using the Chromium Profiler. • We crawl close to one million websites and record how theytrack users using sophisticated forms of tracking (such as first-party ID leaking, ID synchronization and browser fingerprinting)as a function of users’ choices. • We find that: (1) More than 75% of leaks happen despite thefact that users have chosen to reject all cookies; (2) Websitesembedded with ID synchronizing third-parties force browsers toengage in several ID synchronizations (3.51 per ID, on average)even before users had a chance to accept or deny consent; (3)Popular websites are more likely to disregard users’ consentchoices and engage in first-party ID leaking and ID synchroniza-tion; (4) Browsers leak more information when users choose toreject all cookies than when they choose to take no action at all;(5) Our analysis of tracking per country code reveals significantdiscrepancies across EU countries. • Our methodology can be transformed into an auditing tool forregulators, stakeholders and privacy-policy makers, for verifyingcompliance with GDPR and users’ privacy rights.
In the world of Web, cookies are used to store identifying infor-mation for a given user. However, recent policies and regulationsfrom browser vendors and government bodies [19–21] try to con-trol the exposure of this identifying information to third-partiesand for how long. These policies restrain the ad and tracking in-dustry that relies on re-identifying a user for long periods to servemore targeted ads. Some of the most popular techniques used bythe third-parties include ID synchronization (e.g., cookie synchro-nization [9, 13, 22, 23]) and canvas fingerprinting [24], but alsothe font-based fingerprinting [14], WebRTC-based fingerprinting,AudioContext fingerprinting, and Battery API fingerprinting [12].
Whenever a user visits a new website, a plethora of cookies andIDs are assigned to her, allowing first or third-parties to re-identifyher across the Web and build a profile based on her browsing be-havior. These profiles can be later centralized in Data ManagementPlatforms [25], sold by data brokers [26], or used by advertisers to ad v e r t i s e r . c o m ? s y n c I D = u s e r & pub li s h e r = w e b s i t e . c o m (1) New Alias(2) New Alias(3) Sync of Aliases Figure 1: Example of an ID synchronization operation. Two en-tities match the IDs they have assigned to the same user. bid in ad auctions [27], ad-retargeting [28] and cross-device track-ing [29]. For the different Web entities (e.g., publishers, analytics,data brokers, advertisers, etc.) to perform such transactions, all ofthe different assigned aliases (i.e., IDs) that each entity has assignedto the same user, need to be linked (i.e., synced) together. This wouldreveal that the user that the entity A knows as userABC is the sameuser that entity B knows as user123 .Figure 1 illustrates an example of how this ID synchroniza-tion takes place. Assume a user browsing website1.com and website2.com , in which there are third-parties like tracker.com and advertiser.com , respectively. Conse-quently, these two third-parties have the chance to assign analias to the user and re-identify them in the future. From nowon, tracker.com knows the user with the ID user123 , and advertiser.com knows the same user with the ID userABC .Next, assume that the user lands on website3.com , which in-cludes some JavaScript code from tracker.com making thebrowser issue a GET request to tracker.com (step 1), whoresponds back with a REDIRECT request (step 2), instructingthe user’s browser to issue another request to its collaborator advertiser.com this time, using a specifically crafted URL(step 3) where the alias it uses (i.e., user123 ) is piggybacked.When advertiser.com receives the above request from the userit knows as userABC , it learns that the user whom tracker.com knows as user123 , and the user userABC are basically the sameuser. This allows the two entities to join the different aliases (e.g.,cookies, device IDs, user IDs, etc.) a user has on the Web.In this paper, we study two types of ID sharing: (i) first-partyID leaking , where a first-party alias (e.g., a cookie or device ID) isleaked from the visited website to different third-parties, and (ii) igure 2: Canvas Fingerprinting process as part of the browserfingerprinting methodology used by popular libraries. The web-site can extract a fingerprint of the user’s browser. third-party ID synchronization , where third-parties link together thedifferent third-party aliases they use for the same users. Browser Fingerprinting is a sophisticated set of techniques, whichcan be used to uniquely identify browser instances without storingany information on the user side (stateless). It can be used to detectmalicious users that create multiple accounts in social networkingservices, or even stop deceitful orders in e-commerce platforms.However, this technique can be abused by privacy-violating websitesand, therefore, track users across sites, or even de-anonymize privatesessions. In fact, previous work [24, 30] has shown that this tech-nique provides sufficient bits of entropy to effectively track users,even through the usage of the Tor Browser.One of the most prevalent and stealthy such fingerprinting tech-niques is Canvas Fingerprinting: named after the HTML canvaselement, which was introduced in the latest version (i.e., HTML5).A canvas element provides the required functionality for drawinggraphics using client-side code. Moreover, canvas fingerprintingrelies on WebGL, a cross-platform JavaScript API that enables de-velopers to render advanced graphics using shaders. As a result, de-velopers have access to rendering functionality, which is performedin a GPU, however, in an HTML context via the canvas element.Figure 2 demonstrates the process of canvas fingerprinting aspart of browser fingerprinting. Assume (i) a website that containsthe fingerprinting code and (ii) a browser instance that can executeJavaScript code. As a first step, the fingerprinting script creates acanvas element using the built-in interface provided by almost allmodern browsers. Next, the script renders some 2D graphics andtext on the canvas. Usually, the text that is drawn is a pangram.This means, that it contains all the letters of the English alphabetin order to increase the number of entropy bits. Different font sizesand font families result in a slightly different text that can affect thefinal fingerprint. As a next step, the fingerprinting script needs toextract the content of the canvas and inspect its pixel values (step3). This is achieved using the method toDataURL() , provided bythe canvas object. This method returns the Base64 encoding of thecanvas’ content. Based on various factors, including fonts that areinstalled on the user’s machine, version of OpenGL and browser’srendering engine, this string can be sufficiently different per user.Then, the script combines this canvas fingerprint with other in-formation, which can be used as an additional source of entropy (step 4). This information includes, among others, the host operatingsystem and timezone, its screen resolution, installed plugins, pre-ferred language set in the browser and number of logical processorsavailable on the host. The output of the combination algorithm isa long string that uniquely identifies the specific browser instance(step 5). Finally, the identifier is hashed, to produce a fingerprint forthis specific browser (step 6) and is usually sent across the network,or even stored as a cookie.Tracking techniques need to be transparent to users to avoidraising suspicion or harm the user experience. As such, browser fin-gerprinting can be performed in minimum time on any browser thatsupports JavaScript by using invisible HTML elements and withoutrequiring any privileges or permission from the user. Consequently,even privacy-aware advanced users that block cookies can be tracked.Furthermore, browser fingerprinting is difficult to prevent becauseit relies on native functionality, built in modern browsers. Usersneed to either disable JavaScript, or use external browser extensions.These techniques usually add random noise to some built-in func-tions, making the fingerprint different, each time the same websiteattempts to (re)identify a user [31–33].
To investigate the effect of the different options a user is providedwith while visiting websites with a consent form, we leverage theConsent-O-matic tool [34]. Consent-O-matic is the state-of-the-artbrowser extension to automatically detect and handle GDPR con-sent forms. Whenever the extension detects a Consent ManagementPlatform (CMP), it logs its info (e.g., vendor, encoding, IDs). Addi-tionally, it can be configured to either accept or reject the differentcategories of data processing purposes. In addition to this, we de-velop a puppeteer-based crawler that instruments a Chrome browser.By using Consent-O-matic, the browser can automatically performone of the following three actions when a consent form is detected:(1)
Accept All : grant consent for all data processing purposesto all third-parties residing in the visited website.(2)
Reject All : deny consent for all data processing purposesto all third-parties residing in the visited website.(3)
No Action : avoid interacting with the form in any way.By using our instrumented browser, we crawl (with clean state)the landing page of the top 850K sites of Tranco list [35]. This listaggregates the ranks from the lists provided by Alexa, Umbrella,and Majestic from 29.07.2020 to 27.08.2020 (pay-level domainsretained) . Whenever a CMP is detected, we crawl the given website3 times (one for each of the different consent actions), and we store:the HTML, cookiejar, HTTP requests, HTTP responses, JS functioncalls and CMP info for each case. It is important to note that, wecapture HTTP(S) requests and responses passively, via the emittedChrome events without mutating or intercepting them. This ensuresthat the behavior of the website is not affected by our crawler. Anoverview of our crawling methodology is illustrated in Figure 3. Overall, the crawler (located in EU) visited 850K sites from August28 th th https://tranco-list.eu/list/Q274/full Cookie Consent Form
Accept Reject