Impact of HTTP Cookie Violations in Web Archives
Sawood Alam, Plinio Vargas, Michele C. Weigle, Michael L. Nelson
IImpact of HTTP Cookie Violations in Web Archives
Sawood Alam, Plinio Vargas, Michele C. Weigle, and Michael L. Nelson
Department of Computer Science, Old Dominion University, Norfolk, Virginia – 23529 (USA){salam,pvargas,mweigle,mln}@cs.odu.edu
ABSTRACT
Certain
HTTP Cookies on certain sites can be a source of content biasin archival crawls. Accommodating
Cookies at crawl time, but notutilizing them at replay time may cause cookie violations, resultingin defaced composite mementos that never existed on the live web.To address these issues, we propose that crawlers store
Cookies with short expiration time and archival replay systems account forvalues in the
Vary header along with URIs.
For a long time we have been observing a strange behavior ofvarious web archives when accessing mementos [11] of Twitterpages, some of the mementos would be replayed in non-Englishlanguages. This happens even if those Twitter timelines belong toEnglish-speaking personalities, archived using crawlers in NorthAmerica, and were not requested in any specific language explicitlyas shown in Figure 1. After a thorough investigation we figuredit out that it is happening due to the use of
HTTP Cookies for con-tent negotiation by Twitter [4, 9]. We found that almost half of themementos of Barack Obama’s Twitter timeline out of over 9,000properly archived mementos in five different web archives were innon-English languages, of which, almost half were in Kannada (a re-gional Indian language) alone, and remaining in 45 other languages(as shown in Figure 2). While language diversity in web archivesis generally a good thing, this non-uniform bias is disconcertingwhen a page is archived in a language not anticipated.One day we were looking at a Twitter timeline’s memento whichshould have been in English, but was primarily in Portuguese (forthe reason described above), after a while we noticed that a notifi-cation appeared in Urdu, suggesting that there were 20 new tweets(as shown in Figure 3). On further inspection found that the pagecontained a sidebar block in English too. Apparently, we were see-ing a defaced composite memento of a page that perhaps neverexisted on the live web. We knew that live-leakage (also knownas Zombies) [6] and temporal violations [1] can cause such mal-formed memento reconstruction and we also knew their potentialprevention techniques [3, 8]. However, this mixed-language Twit-ter timeline issue cannot be explained by zombies nor temporalviolations. After a thorough investigation we found that
Cookies were again the reason behind this replay issue [2, 10].
HTTP is stateless, but often applications need to maintain stateinformation between a client and a server. This is often done withthe help of
Cookies [5]. Servers can send one or more
Set-Cookie headers containing strings of name-value pairs along with scope (do-main and path) and expiration information. Clients store them andsend them back with each request in the scope using
Cookie headeruntil expired or removed.
Cookies are used for session management,personalization, content-negotiation, tracking, and client-side key-value store. The latter is less common now after wide adoption of
LocalStorage and other similar techniques in web browsers.
Figure 1: Barack Obama’s Twitter Timeline is Archived inUrdu, Which Should be in EnglishFigure 2: Language Distribution of Mementos of BarackObama’s Twitter Timeline in Different Web ArchivesFigure 3: A Memento of a Twitter Timeline Simultaneouslyin Multiple Languages (Portuguese, English, and Urdu) a r X i v : . [ c s . D L ] J un mpact of HTTP Cookie Violations in Web Archives S. Alam et al. ... [45 LINKS TRUNCATED] ... Figure 4: 47 Alternate Language Links in Twitter $ curl -s -c /tmp/tt.cook https://twitter.com/?lang=ar \ > | grep " $ grep lang /tmp/tt.cook twitter.com FALSE / FALSE 0 lang ar $ curl -s https://twitter.com/ | grep " $ curl -s -H "Accept-Language: ur" https://twitter.com/ \ > | grep " $ curl -s -b /tmp/tt.cook https://twitter.com/ | grep " Figure 5: Language Content Negotiation in Twitter UsingQuery Parameters, Accept-Language, and Cookies
Twitter uses standard method of internationalization in its publiclyaccessible pages by including alternate links in 47 supportedlanguages and the x-default landing language (as illustrated inFigure 4) to help search engines point users with different locales tothe correct language. This technique is utilized by many other multi-lingual sites such as Facebook and Instagram. However, unlike otherpopular multi-lingual sites, when accessing a language-specific URI(that contains a lang query parameter), Twitter sets a lang
Cookie with the corresponding language (as illustrated in Figure 5). This
Cookie sticks throughout the session and forces all subsequentpages to be served in that language until another language-specificURI overwrites the
Cookie . This essentially means Twitter performslanguage content negotiation using
Cookie header, though it doesnot acknowledge it in a
Vary header.
Some websites insist that certain
Cookies are present in a request be-fore they return desired content otherwise they issue redirects andattempt to set those
Cookies . Failure to send their desired
Cookies insubsequent requests may turn such sites into crawler traps withoutany useful content. Web archiving crawlers such as Heritrix havebuilt-in support for cookies. However, the web surfing pattern ofcrawlers is generally breadth-first-style and comprehensive (notnecessarily how human surf the web) for which they use frontierqueue of URIs to be crawled. In case of the Twitter’s example above,when one of the language-specific alternate link is crawled, itimpacts all the subsequent non-language-specific URIs due to the lang sticky Cookie . Kannada being the last language in the list (inFigure 4) gets more exposure before it gets overwritten by anotherlanguage, resulting in the disproportionate language bias.Popular archival replay systems (such as OpenWayback andPyWB ) utilize only the canonicalized URI-R and the datetime ofthe capture to select a memento to replay. Other request head-ers that might have been used for content negotiation (such as
Accept-Language or Geolocation etc.) are ignored at replay. Tra-ditional crawlers did not execute JavaScript, so the likelihood of a https://github.com/internetarchive/heritrix3 https://github.com/iipc/openwayback https://github.com/webrecorder/pywb custom request header being utilized during crawling was minimal,but it is changing with headless browser-based crawlers. Cookies ,however, have been supported even in traditional crawlers thatare used by some sites for content negotiation (as is the case withTwitter). Moreover, aggregating private archives [7] containingauthenticated resources without isolating them based on session
Cookies has some privacy implications.Based on our assessment we propose that
Cookies in crawlersare kept short-lived and pruned frequently to minimize the impactof sticky
Cookies . Accommodating
Cookies (or other headers thataffect the response) at capture/crawl time, but not utilizing themat replay time has this consequence of cookie violations, resultingin defaced composite mementos. On the contrary, blindly utilizingevery
Cookie as a filter at replay would result in many false nega-tives. Unfortunately,
Cookie names are opaque strings and carry noagreed upon semantics to identify ones that affect the payload.
We identified that certain
Cookies on certain sites can be a sourceof content bias in archival crawls. To address this issue we proposethat crawlers store
Cookies with short expiration time explicitly,irrespective of the original value. We also identified that
CookieViolations at replay time have the potential to deface compositemementos and reconstruct pages from web archives that neverexisted on the live web. Archival replay systems need to behavelike HTTP proxies or cache servers that accommodate values inthe
Vary header along with URIs. Not every
Cookie is created equal,those impacting the content need to be identified and accounted forat replay. This is a difficult problem which opens up the possibilityfor a more extensive research to fully address the issue.
This work is supported in part by NSF grant IIS-1526700.
REFERENCES [1] Scott G. Ainsworth, Michael L. Nelson, and Herbert Van de Sompel. 2015. OnlyOne Out of Five Archived Web Pages Existed as Presented. In
Proceedings of the26th ACM Conference on Hypertext & Social Media . 257–266.[2] Sawood Alam. 2019. Cookie Violations Cause Archived Twitter Pages to Simul-taneously Replay in Multiple Languages. https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html.[3] Sawood Alam, Mat Kelly, Michele Weigle, and Michael L. Nelson. 2017. Client-side Reconstruction of Composite Mementos Using ServiceWorker. In
Proceedingsof the 17th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’17) . ACM,New York, NY, USA, 237–240. https://doi.org/10.1109/JCDL.2017.7991579[4] Sawood Alam and Plinio Vargas. 2018. Cookies Are Why Your ArchivedTwitter Page Is Not in English. https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html.[5] Adam Barth. 2011. HTTP State Management Mechanism, Internet RFC 6265.https://tools.ietf.org/html/rfc6265.[6] Justin F. Brunelle. 2012. Zombies in the Archives. https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html.[7] Mat Kelly, Michael L. Nelson, and Michele C. Weigle. 2018. A Frameworkfor Aggregating Private and Public Web Archives. In
Proceedings of the 18thACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’18) . 273–282. https://doi.org/10.1145/3197026.3197045[8] Ada Lerner, Tadayoshi Kohno, and Franziska Roesner. 2017. Rewriting History:Changing the Archived Web from the Present. In
Proceedings of the 2017 ACMSIGSAC Conference on Computer and Communications Security . 1741–1755.[9] David S. H. Rosenthal. 2018. All Your Tweets Are Belong To Kannada. https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html.[10] David S. H. Rosenthal. 2019. The 47 Links Mystery. https://blog.dshr.org/2019/03/the-47-links-mystery.html.[11] Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTPFramework for Time-Based Access to Resource States – Memento, Internet RFC7089. https://tools.ietf.org/html/rfc7089.. 1741–1755.[9] David S. H. Rosenthal. 2018. All Your Tweets Are Belong To Kannada. https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html.[10] David S. H. Rosenthal. 2019. The 47 Links Mystery. https://blog.dshr.org/2019/03/the-47-links-mystery.html.[11] Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTPFramework for Time-Based Access to Resource States – Memento, Internet RFC7089. https://tools.ietf.org/html/rfc7089.