[PDF] Impact of HTTP Cookie Violations in Web Archives

Abstract

Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs.

Full PDF

IImpact of HTTP Cookie Violations in Web Archives

Sawood Alam, Plinio Vargas, Michele C. Weigle, and Michael L. Nelson

Department of Computer Science, Old Dominion University, Norfolk, Virginia – 23529 (USA){salam,pvargas,mweigle,mln}@cs.odu.edu

ABSTRACT

Certain

HTTP Cookies on certain sites can be a source of content biasin archival crawls. Accommodating

Cookies at crawl time, but notutilizing them at replay time may cause cookie violations, resultingin defaced composite mementos that never existed on the live web.To address these issues, we propose that crawlers store

Cookies with short expiration time and archival replay systems account forvalues in the

Vary header along with URIs.

For a long time we have been observing a strange behavior ofvarious web archives when accessing mementos [11] of Twitterpages, some of the mementos would be replayed in non-Englishlanguages. This happens even if those Twitter timelines belong toEnglish-speaking personalities, archived using crawlers in NorthAmerica, and were not requested in any specific language explicitlyas shown in Figure 1. After a thorough investigation we figuredit out that it is happening due to the use of

HTTP Cookies for con-tent negotiation by Twitter [4, 9]. We found that almost half of themementos of Barack Obama’s Twitter timeline out of over 9,000properly archived mementos in five different web archives were innon-English languages, of which, almost half were in Kannada (a re-gional Indian language) alone, and remaining in 45 other languages(as shown in Figure 2). While language diversity in web archivesis generally a good thing, this non-uniform bias is disconcertingwhen a page is archived in a language not anticipated.One day we were looking at a Twitter timeline’s memento whichshould have been in English, but was primarily in Portuguese (forthe reason described above), after a while we noticed that a notifi-cation appeared in Urdu, suggesting that there were 20 new tweets(as shown in Figure 3). On further inspection found that the pagecontained a sidebar block in English too. Apparently, we were see-ing a defaced composite memento of a page that perhaps neverexisted on the live web. We knew that live-leakage (also knownas Zombies) [6] and temporal violations [1] can cause such mal-formed memento reconstruction and we also knew their potentialprevention techniques [3, 8]. However, this mixed-language Twit-ter timeline issue cannot be explained by zombies nor temporalviolations. After a thorough investigation we found that

Cookies were again the reason behind this replay issue [2, 10].

HTTP is stateless, but often applications need to maintain stateinformation between a client and a server. This is often done withthe help of

Cookies [5]. Servers can send one or more

Set-Cookie headers containing strings of name-value pairs along with scope (do-main and path) and expiration information. Clients store them andsend them back with each request in the scope using

Cookie headeruntil expired or removed.

Cookies are used for session management,personalization, content-negotiation, tracking, and client-side key-value store. The latter is less common now after wide adoption of

LocalStorage and other similar techniques in web browsers.

Figure 1: Barack Obama’s Twitter Timeline is Archived inUrdu, Which Should be in EnglishFigure 2: Language Distribution of Mementos of BarackObama’s Twitter Timeline in Different Web ArchivesFigure 3: A Memento of a Twitter Timeline Simultaneouslyin Multiple Languages (Portuguese, English, and Urdu) a r X i v : . [ c s . D L ] J un mpact of HTTP Cookie Violations in Web Archives S. Alam et al. ... [45 LINKS TRUNCATED] ... Figure 4: 47 Alternate Language Links in Twitter $ curl -s -c /tmp/tt.cook https://twitter.com/?lang=ar \ > | grep " $ grep lang /tmp/tt.cook twitter.com FALSE / FALSE 0 lang ar $ curl -s https://twitter.com/ | grep " $ curl -s -H "Accept-Language: ur" https://twitter.com/ \ > | grep " $ curl -s -b /tmp/tt.cook https://twitter.com/ | grep " Figure 5: Language Content Negotiation in Twitter UsingQuery Parameters, Accept-Language, and Cookies

Twitter uses standard method of internationalization in its publiclyaccessible pages by including alternate links in 47 supportedlanguages and the x-default landing language (as illustrated inFigure 4) to help search engines point users with different locales tothe correct language. This technique is utilized by many other multi-lingual sites such as Facebook and Instagram. However, unlike otherpopular multi-lingual sites, when accessing a language-specific URI(that contains a lang query parameter), Twitter sets a lang

Cookie with the corresponding language (as illustrated in Figure 5). This

Cookie sticks throughout the session and forces all subsequentpages to be served in that language until another language-specificURI overwrites the

Cookie . This essentially means Twitter performslanguage content negotiation using

Cookie header, though it doesnot acknowledge it in a

Vary header.

Some websites insist that certain

Cookies are present in a request be-fore they return desired content otherwise they issue redirects andattempt to set those

Cookies . Failure to send their desired

Cookies insubsequent requests may turn such sites into crawler traps withoutany useful content. Web archiving crawlers such as Heritrix havebuilt-in support for cookies. However, the web surfing pattern ofcrawlers is generally breadth-first-style and comprehensive (notnecessarily how human surf the web) for which they use frontierqueue of URIs to be crawled. In case of the Twitter’s example above,when one of the language-specific alternate link is crawled, itimpacts all the subsequent non-language-specific URIs due to the lang sticky Cookie . Kannada being the last language in the list (inFigure 4) gets more exposure before it gets overwritten by anotherlanguage, resulting in the disproportionate language bias.Popular archival replay systems (such as OpenWayback andPyWB ) utilize only the canonicalized URI-R and the datetime ofthe capture to select a memento to replay. Other request head-ers that might have been used for content negotiation (such as

Accept-Language or Geolocation etc.) are ignored at replay. Tra-ditional crawlers did not execute JavaScript, so the likelihood of a https://github.com/internetarchive/heritrix3 https://github.com/iipc/openwayback https://github.com/webrecorder/pywb custom request header being utilized during crawling was minimal,but it is changing with headless browser-based crawlers. Cookies ,however, have been supported even in traditional crawlers thatare used by some sites for content negotiation (as is the case withTwitter). Moreover, aggregating private archives [7] containingauthenticated resources without isolating them based on session

Cookies has some privacy implications.Based on our assessment we propose that

Cookies in crawlersare kept short-lived and pruned frequently to minimize the impactof sticky

Cookies . Accommodating

Cookies (or other headers thataffect the response) at capture/crawl time, but not utilizing themat replay time has this consequence of cookie violations, resultingin defaced composite mementos. On the contrary, blindly utilizingevery

Cookie as a filter at replay would result in many false nega-tives. Unfortunately,

Cookie names are opaque strings and carry noagreed upon semantics to identify ones that affect the payload.

We identified that certain

Cookies on certain sites can be a sourceof content bias in archival crawls. To address this issue we proposethat crawlers store

Cookies with short expiration time explicitly,irrespective of the original value. We also identified that

CookieViolations at replay time have the potential to deface compositemementos and reconstruct pages from web archives that neverexisted on the live web. Archival replay systems need to behavelike HTTP proxies or cache servers that accommodate values inthe

Vary header along with URIs. Not every

Cookie is created equal,those impacting the content need to be identified and accounted forat replay. This is a difficult problem which opens up the possibilityfor a more extensive research to fully address the issue.

This work is supported in part by NSF grant IIS-1526700.

REFERENCES [1] Scott G. Ainsworth, Michael L. Nelson, and Herbert Van de Sompel. 2015. OnlyOne Out of Five Archived Web Pages Existed as Presented. In

Proceedings of the26th ACM Conference on Hypertext & Social Media . 257–266.[2] Sawood Alam. 2019. Cookie Violations Cause Archived Twitter Pages to Simul-taneously Replay in Multiple Languages. https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html.[3] Sawood Alam, Mat Kelly, Michele Weigle, and Michael L. Nelson. 2017. Client-side Reconstruction of Composite Mementos Using ServiceWorker. In

Proceedingsof the 17th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’17) . ACM,New York, NY, USA, 237–240. https://doi.org/10.1109/JCDL.2017.7991579[4] Sawood Alam and Plinio Vargas. 2018. Cookies Are Why Your ArchivedTwitter Page Is Not in English. https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html.[5] Adam Barth. 2011. HTTP State Management Mechanism, Internet RFC 6265.https://tools.ietf.org/html/rfc6265.[6] Justin F. Brunelle. 2012. Zombies in the Archives. https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html.[7] Mat Kelly, Michael L. Nelson, and Michele C. Weigle. 2018. A Frameworkfor Aggregating Private and Public Web Archives. In

Proceedings of the 18thACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’18) . 273–282. https://doi.org/10.1145/3197026.3197045[8] Ada Lerner, Tadayoshi Kohno, and Franziska Roesner. 2017. Rewriting History:Changing the Archived Web from the Present. In

Proceedings of the 2017 ACMSIGSAC Conference on Computer and Communications Security . 1741–1755.[9] David S. H. Rosenthal. 2018. All Your Tweets Are Belong To Kannada. https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html.[10] David S. H. Rosenthal. 2019. The 47 Links Mystery. https://blog.dshr.org/2019/03/the-47-links-mystery.html.[11] Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTPFramework for Time-Based Access to Resource States – Memento, Internet RFC7089. https://tools.ietf.org/html/rfc7089.. 1741–1755.[9] David S. H. Rosenthal. 2018. All Your Tweets Are Belong To Kannada. https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html.[10] David S. H. Rosenthal. 2019. The 47 Links Mystery. https://blog.dshr.org/2019/03/the-47-links-mystery.html.[11] Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTPFramework for Time-Based Access to Resource States – Memento, Internet RFC7089. https://tools.ietf.org/html/rfc7089.