MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Sawood Alam, Michele C. Weigle, Michael L. Nelson, Fernando Melo, Daniel Bicho, Daniel Gomes
MMementoMap Framework for Flexible and Adaptive WebArchive Profiling
Sawood Alam
Old Dominion UniversityDepartment of Computer ScienceNorfolk, Virginia, [email protected]
Michele C. Weigle
Old Dominion UniversityDepartment of Computer ScienceNorfolk, Virginia, [email protected]
Michael L. Nelson
Old Dominion UniversityDepartment of Computer ScienceNorfolk, Virginia, [email protected]
Fernando Melo
FCT: Arquivo.ptLisbon, [email protected]
Daniel Bicho
FCT: Arquivo.ptLisbon, [email protected]
Daniel Gomes
FCT: Arquivo.ptLisbon, [email protected]
ABSTRACT
In this work we propose
MementoMap , a flexible and adaptive frame-work to efficiently summarize holdings of a web archive. We de-scribed a simple, yet extensible, file format suitable for
MementoMap .We used the complete index of the
Arquivo.pt comprising 5B me-mentos (archived web pages/files) to understand the nature andshape of its holdings. We generated
MementoMaps with varyingamount of detail from its
HTML pages that have an
HTTP statuscode of
200 OK . Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large
MementoMap into a small one and an in-file binary search methodfor efficient lookup. We analyzed more than three years of
Mem-Gator (a Memento aggregator) logs to understand the responsebehavior of 14 public web archives. We evaluated
MementoMaps bymeasuring their
Accuracy using 3.3M unique
URIs from
MemGator logs. We found that a
MementoMap of less than 1.5%
Relative Cost (as compared to the comprehensive listing of all the unique original
URIs ) can correctly identify the presence or absence of 60% of thelookup
URIs in the corresponding archive while maintaining 100%
Recall (i.e., zero false negatives).
KEYWORDS
Memento, Web Archiving, Archive Profiling, MementoMap
Old Dominion University (ODU) runs
MemGator [7] as a serviceto power many of our tools and services such as Mink [30], Car-bonDate [14], WAIL [16], ICanHazMemento [34], and Memento-Damage [38]. We released
MemGator [2] as an open-source toolfor users to run locally to avoid generating too much traffic on acentral aggregator service. Our service receives three aggregationlookup requests per minute on average. Due to this low traffic wedo not yet use any prediction-based Memento routing or caching.We recently analyzed over three years of our
MemGator logs andfound that it has served about 5.2M requests so far. These lookupswere broadcasted to 14 different upstream archives for a total of61.8M requests. Only 5.44% of these requests had a hit, while theremaining 93.56% were either a miss or an error as shown in Ta-ble 3. If only there was a way to know a summary of the holdings of these web archives, we could have avoided many wasted upstreamrequests and had an overall better response time for clients.
MementoMap is a framework for profiling web archives andexpressing their holdings in an adaptive and flexible way to easilyscale. It is inspired by the simplicity of the widely used robots.txt and sitemap.xml formats, but for a purpose other than searchengine optimization. An example
MementoMap is illustrated inFigure 2 in the format we propose.
MementoMap allows wildcard-based partial
URI Keys to enable flexibility in how detailed or conciseone wants it to be depending on use cases, full or partial knowledgeabout the archive’s holdings, and available resources. This caneither be generated by the archives themselves or by a third partybased on their external observations. We propose the “ mementomap ” well-known URI suffix [33] and the “ mementomap ” link relation forits dissemination and discovery.We used the complete index of Arquivo.pt (the Portuguese WebArchive), spanning over 27 years, and more than three years of
MemGator logs for evaluation. We found that a summarized
Me-mentoMap of less than 1.5%
Relative Cost (as compared to the com-prehensive listing of all the unique original
URIs ) can correctlyidentify the presence or absence of 60% of the lookup
URIs in Ar-quivo.pt without any false negatives (i.e., 100%
Recall ). We haveopen-sourced our implementation [3] under the
MIT license. Thispaper is an expanded version of a conference paper [11].
The Internet Archive (IA) is the first, largest, and most resource-ful web archive with over 700B mementos (timestamped archivedcopies of web pages and files) as of January 21, 2019 . However,it is also the softest target for censorship and denial of service at-tacks [28]. It continues to be blocked in China [25] and Russia [22]for an extended period of time and has been blocked temporarily inmany other countries such as India and Jordan [1, 18]. As a result,many web archiving related tools are increasingly adding supportfor Memento aggregators to consolidate archived resources frommore than one web archive of varying scale to avoid single pointof failure. https://archive.org/ https://twitter.com/brewster_kahle/status/1087515601717800960 a r X i v : . [ c s . D L ] M a y ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al.The Memento framework [40] defines uniform APIs for TimeMap and
TimeGate endpoints to enable cross-archive communication.A
TimeMap is a list of all mementos of an original URI (or
URI-R )and a
TimeGate is a gateway to resolve to the closest memento of a
URI-R w.r.t. a given
Datetime and redirect to a Memento URI (or
URI-M ). With out-of-the-box Memento support in major archival toolsand replay systems, many web archives have adopted the protocol.To avoid the need of every tool being configured and periodicallyupdated to poll results from an ever-changing list of many knownweb archives, Memento aggregators were created to act like a singleconsolidated web archive to users and tools. Los Alamos NationalLaboratory’s (LANL)
Time Travel service is one such well-knownaggregator that powers many tools and services. MemGator is ouropen-source Memento aggregator implementation that can be usedlocally as a CLI tool or run as a service for a drop-in replacementof the
Time Travel service.
CDX (Capture inDeX) [26] is a
CSV -like text file-based indexformat that has traditionally been used by the IA and was oneof the primarily supported index formats of OpenWayback . Itis very rigid in nature and has a predefined list of fields that arenot extendable. While working on this paper, we encountered aconsequence of its limitations when we realized that the MIME-Type field was reused to record a different metadata to identify whethera record is a revisit . As a result the actual
MIME-Type of the recordwould not be known without finding another entry in the indexwhich the record is a revisit of.
CDXJ [6] is an evolution of theclassic
CDX format. In this file format, lookup key fields (
URI-R and
Datetime ) are placed at the beginning of each line which isfollowed by a single-line compact JSON [20] block that holds otherfields that can vary in number and be extended as needed. Bothof these formats are sort-friendly to enable binary search on filewhen performing lookups. The latter format is primarily used byarchival replay systems including PyWB and our InterPlanetaryWayback (IPWB) [29]. SURT (Sort-friendly URI Reordering Transform) [37] is used tocanonicalize
URIs and place together related
URIs when sorted,which is important for efficient indexing. In a traditional URI thehostname parts are organized differently than paths. In the host-name section, the root of the Domain Name System (
DNS ) chain(i.e., the Top Level Domain, or
TLD ) comes at the end towards theright hand side while registered domain name portion and subdo-main sections are placed towards the left hand side. In contrast,in the path section, the root path comes first followed by deepernodes of the path tree towards the right side. As a consequence,if a list of three domain names example.com , foo.example.com ,and example.net are sorted, the latter with a different TLD willsit in between the other two. As opposed to this the
SURT of“
Www.Foo.Example.COM/a/b?x=y&c=d ” converts it to become“ com,example,foo)/a/b?c=d&x=y ”, which changes the domainname with lower case letters, removes the “ ” subdomain, re-verses the order of hostname segments, and sorts query parameters.
SURTs are commonly used in archival index files and many otherplaces where a URI is used as a lookup field, including
MementoMap . http://timetravel.mementoweb.org/ https://github.com/iipc/openwayback https://github.com/webrecorder/pywb http:// (com,cnn)/* http:// (com,cnn,cdn)/img/logos/logo.png ?h=20&w=30http:// (com,nytimes)/2018/10/* http:// (com,nytimes)/2018/11/* http:// (edu,odu,cs,ws-dl)/ http:// (org,arxiv)/* http:// (org,arxiv)/pdf/* http:// (uk,bl,* http:// (uk,co,bbc)/news/world ?lang=arhttp:// (uk,co,bbc)/news/world ?lang=enhttp:// (uk,gov,*)/ (a) A sample list of sorted SURTs . Different colors signify
Scheme , Host , Path , and
Query segments. The “ https://( ” prefix is common in all
SURTs , hence removed in practice.(b) A visual representation of
SURTs as a tree. Different colored regions signify
Scheme , Host , Path , and
Query segments. Each node of the tree contains a token and each edgedenotes the separator of the corresponding segment. Dotted lines indicate transitionfrom one segment to the next. Dotted triangles with a wildcard character “*” denote asub-tree. Trailing slashes are removed from this representation. Labels on the right handside (i.e., S , H0 – Hn , P0 – Pn , and Q ) denote corresponding level/depth in each segment. Figure 1: Illustration of
SURTs with wildcard.
Figure 1(a) illustrates a sample of sorted
SURTs and highlights differ-ent segments. We have extended
SURTs to support wildcard to allowgrouping of
URI Keys with the same prefix and roll them up intoa single key. A visual representation of these
SURTs is illustratedin Figure 1(b) in the form of a tree that segregates layers of
Scheme , Host , Path , and
Query . It further annotates various depths of
Host and
Path segments as
H0, H1, H2. . . and
P0, P1, P2. . . that will beuseful in understanding some terminologies used later in this paper.
SURTs also allow credentials and port numbers, but we omittedthem from the illustration for brevity. It is worth noting that thescheme portion is common in all
HTTP/HTTPS
URIs and has noinformational value, hence the “ https://( ” prefix is often omitted.
UKVS (Unified Key Value Store) [4] is an evolving file formatproposal that is a contribution of this
MementoMap work. It is anevolution of the
CDXJ format that we earlier proposed to be usedby
Archive Profiles [9]. This format extends
SURT with wildcardsupport and improves various other aspects to simplify it and elim-inate some limitations of our prior proposal (such as not being ableto express blacklists or lack of support to merge two profiles gener-ated with different profiling policies). We generalized the format tobe more inclusive and flexible after we realized its utility in many2ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al. !context ["https://git.io/mementomap"] !id {uri: "https://archive.example.org/"} !fields {keys: ["surt"], values: ["frequency"]} !meta {name: "Example Archive", year: 1996} !meta {type: "MementoMap"} !meta {updated_at: "2018-09-03T13:27:52Z"} * 54321/20000 com,* 10000+ org,arxiv)/ 100 org,arxiv)/* 2500~/900 org,arxiv)/pdf/* 0 uk,co,bbc)/images/* 300+/20- Figure 2: A sample
MementoMap in UKVS format.
Lines begin-ning with a “ ! ” denote headers. Lines in bold text are data entries.The “ !fields ” line describes keys and values in data columns inorder. The “ frequency ” column of the data section is formattedas “ [URI-M Count]/[URI-R Count] ”. Optional suffix characters(i.e., + , - , and ~ ) with numbers denote approximate values. A “ ”value is a way to represent blacklists, potentially, for more specificsub-trees.web archiving related use cases (such as indexing, replay accesscontrol list (ACL), fixity blocks [15], and extended TimeMaps ) andmany other places such as extended server logging.Figure 2 illustrates a sample
MementoMap file that starts withsome metadata headers. Header lines are prefixed with “ ! ” to ensurethey are separated from data lines and surfaced on top when thefile is sorted. The “ !fields ” header tells that the first column isa SURT and is used as a lookup key (there can be more than onekey column such as
Datetime or Language ) which is followed by avalue column that holds “ frequency ” information. Each data linecan optionally also contain a single-line
JSON block, which is notillustrated here for the sake of simplicity. The frequency column isformatted as “ [URI-M Count]/[URI-R Count] ” where both countsare optional and the separator is also optional if only the
URI-MCount is present (in this paper we only used this latter option).Additionally, these counts can have an optional suffix character + , - , or ~ to express that the numbers are not exact and represent alower bound, an upper bound, and a rough estimate respectively.The first data line in the example means there are a total of exactly54,321 mementos ( URI-Ms ) of exactly 20,000
URI-Rs in the archiveand the next line suggests that there are at least 10,000 mementosfrom the “ .com ” TLD . The next two lines suggest that there are 100mementos of the arxiv.org homepage and many more capturesof pages with deeper paths. However, the next line illustrates anexclusion of a sub-tree by being more specific under /pdf/* thathas zero mementos (this is illustrated in Figure 1(b) as well).
Arquivo.pt [23] was founded in 2008 with the aim to preserve webcontent of interest to the Portuguese community, but not limitedto just the .pt
TLD (as shown in Table 1). It has since archivedabout 5B mementos of which some data was donated to it by otherarchives, including IA, explaining why its temporal spread extendsback before the
Arquivo.pt ’s founding date. We analyzed 1.8T of
Arquivo.pt ’s complete
CDXJ index in production. A brief summaryof the dataset is shown in Table 2. We used it along with ODU’s
MemGator server logs to evaluate this work.
Table 1: Top
Arquivo.pt TLDs .TLD URI-R% URI-M% .pt .com .eu .net .org .de .br .uk .fr .nl .mz .pl .io .edu .es .it .cv .ru .ao .us .cz .info
Table 2:
Arquivo.pt index statistics.Attributes Values
CDXJ files 70Total file size 1.8TCompressed file size 262GTemporal coverage 1992–2018CDXJ lines 5.0BMementos (URI-Ms) 4.9BUnique URI-Rs 2.0BUnique HxPx keys 1.1BUnique hosts 5.8MUnique IP addresses 15K
Query routing is a rigorously researched topic in various fieldsincluding, networked databases, meta-searching, and search aggre-gation [24, 31]. However, archive profiling and Memento lookuprouting is a niche field that is not explored by many researchersbeyond a small community.Sanderson et al. created comprehensive content-based profiles [35,36] of various
International Internet Preservation Consortium (IIPC) member archives by collecting their
CDX files and extracting URI-Rs from them. This approach gave them complete knowledge of theholdings in each participating archive, hence they can route queriesprecisely to archives that have any mementos for the given
URI-R .This approach yielded no false positives or false negatives (i.e., 100%
Accuracy ) while the
CDX files were fresh, but they would go stalevery quickly. It is a resource and time intensive task to generatesuch profiles and some archives may be unwilling or unable to3ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al.provide their
CDX files. Such profiles are so large in size (typically,a few billion
URI-R keys) that they require special infrastructureto support fast lookup. Acquiring fresh
CDX files from variousarchives and updating these profiles regularly is not easy.In contrast, AlSum et al. explored a minimal form of archiveprofiling using only the
TLDs and
Content-Language [12, 13]. Theycreated profiles of 15 public archives using access logs of thosearchives (if available) and fulltext search queries. They found thatby sending requests to only the top three archives matching thecriteria for the lookup URI based on their profile, they can discoverabout 96% of
TimeMaps . When they excluded IA from the list andperformed the same experiment on the remaining archives, theywere able to discover about 65% of
TimeMaps using the remainingtop three archives. Excluding IA was an important aspect of eval-uation as its dominance can cause bias in results. This exclusionexperiment also showed the importance of smaller archives andthe impact of their holdings collectively. This minimal approachhad many false positives, but no false negatives.Bornand et al. implemented a different approach for Mementorouting by building binary classifiers from LANL’s Time Travelaggregator cache data [17]. They analyzed responses from variousarchives in the aggregator’s cache over a period of time to learnabout the holdings of different archives. They reported a 77% reduc-tion in the number of requests and a 42% reduction in response timewhile maintaining 85%
Recall . These approaches can be categorizedas usage-based profiling in which access logs or caches are usedto observe what people were looking for in archives and whichof those lookups had a hit or miss in the past. While usage-basedprofiling can be useful for Memento lookup routing, it may notgive the real picture of archives’ holdings, producing both falsenegatives and false positives .We found that traffic from MemGator requested less than 0.003%of the archived resources in
Arquivo.pt . There is a need for content-based archive profiling which can express what is present in archives,irrespective of whether or not it is being looked for.In previous work [8, 9], we explored the middle ground wherearchive profiles are neither as minimal as storing just the
TLD (which results in many false positives) nor as detailed as collect-ing every URI-R present in every archive (which goes stale veryquickly and is difficult to maintain). We first defined various pro-filing policies, summarized
CDX files according to those policies,evaluated associated costs and benefits, and prepared gold standarddatasets [8, 9]. In our experiments, we correctly identified about78% of the URIs that were or were not present in the archive withless than 1% relative cost as compared to the complete knowledgeprofile and identified 94% URIs with less than 10% relative cost with-out any false negatives. Based on the archive profiling frameworkwe established, we further investigated the possibility of content-based profiling by issuing fulltext search queries (when available)and observing returned results [10] if access to the
CDX data is notpossible. We were able to make routing decisions of 80% of the re-quests correctly while maintaining about 90%
Recall by discoveringonly 10% of the archive holdings and generating a profile that costsless than 1% of the complete knowledge profile.
MementoMap isa continuation of this effort to make it more flexible and portable https://groups.google.com/forum/ by eliminating the need for rigid profiling policies we defined ear-lier [8, 9] (which are still good for baseline evaluation purposes)and replacing them with an adaptive approach in which the levelof detail is dynamically controlled with a number of parameters. Generating a
MementoMap begins by scanning
CDX/CDXJ files,performing fulltext search, filtering access logs, or any other meansto identify what
URIs an archive holds (or does not hold). These
URIs are then converted to
SURTs (if not already) and their querysection is stripped off. We call these partial
SURTs as HxPx URI Keys (which means a
URI Key that has all the host and path parts, butno query parameters). Previously, we found that removing queryparameters from these
SURTs reduces the file size and the numberof unique
URI Keys significantly without any significant loss in thelookup
Accuracy [9]. We then create a text file with its first columncontaining
HxPx Keys and the second column as their respective
Frequencies . The frequency column in its simplest form can be thecount of each
HxPx Key , but it can be made more expressive as illus-trated in the data section of Figure 2. Finally, necessary metadatais added and the file is sorted as the baseline
MementoMap .In order to make a less detailed
MementoMap (which is desiredfor efficient dissemination and long-lasting freshness at the cost ofincreased false positives), we pass a detailed
MementoMap througha compaction procedure which yields a summarized output thatcontains fewer lookup keys by rolling sub-trees with many chil-dren nodes up and replacing them with corresponding wildcardkeys. Our compaction algorithm is illustrated with pseudo-code inFigure 3. As opposed to an in-memory tree building (which willnot scale), it is a single-pass procedure with minimal memory re-quirements and does not need any special hardware to process a
MementoMap of any size. We leverage the fact that the input
Me-mentoMap is sorted, hence, we can easily detect at what depth ofhost or path segments a branch differed from the previous line. Wekeep track of the most recent state of host and path keys at eachdepth (up to
MAXHOSTDEPTH and
MAXPATHDEPTH ), their correspond-ing cumulative frequencies, how many children nodes each of themhave seen so far, and the byte position of the output file when thesekeys were seen the first time. Each time we encounter a new branchat any depth, we check to see if a roll up action is applicable atthat depth or further down in the existing tree based on the mostrecent states and the compaction parameters supplied. If so, wemove the write pointer in the output file back to the position wherethe corresponding key was observed first, then we reset the stateof all the deeper depths and update them with the current state.As a consequence of this progressive processing, the trailing partof the output file is overwritten many times. The input file doesnot have to be the baseline
MementoMap , any
MementoMap can besupplied as input with fresh compaction parameters to attempt tofurther compact it. Our algorithm is parallel processing-friendlyif the input data is partitioned strategically (e.g., processing each
TLD ’s records on separate machines and combining all compactedoutput files). It is worth noting that sub-trees of the path section areneither independent trees nor have a single root node (as shownin Figure 1(b)), as a result, certain implementation details can bemore complex than a simple tree pruning algorithm.4ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al. func host_keys (surt) s = surt.split(")")[0].split(",", MAXHOSTDEPTH) return [s[:i].join(",") for i in 1..len(s)] func path_keys (surt) s = surt.split("?")[0].split("/", MAXPATHDEPTH) return [s[:i].join("/") for i in 1..len(s)] func compact (imap, omap, opts) htrail = [None] * MAXHOSTDEPTH ptrail = [None] * MAXPATHDEPTH for line in imap key, freq, *_ = line.split() k = host_keys(key) for i in range(len(k)) if htrail[i] == k[i] htrail[i][1] += freq else for j in range(i, MAXHOSTDEPTH) if rollup_threshold_reached omap.seek(htrail[j][3]) omap.write(htrail[j][:1].join(",* ")) reset_remaining_trail(ptrail, 0) reset_remaining_trail(htrail, i) if !htrail[i] htrail[i] = [k[i], freq, 0, omap.tell()] htrail[i-1][2]++ omap.write(line) omap.truncate() func lookp_keys (uri) key = surtify(uri).split("?")[0].strip("/") keys = [key] while "," in key keys.uppend(sub("(.+[,/]).+$", "\1*", key)) key = sub("(.+)[,/].+$", "\1", key) return keys func bin_search (mmap, key) surtk, freq, *_ = mmap.readline().split() if key == surtk return [surtk, freq] left = 0 mmap.seek(0, 2) right = mmap.tell() while (right - left > 1) mid = (right + left) / 2 mmap.seek(mid) mmap.readline() surtk, freq, *_ = mmap.readline().split() if key == surtk return [surtk, freq] elif key > surtk left = mid else: right = mid func lookup (mmap, uri) for key in lookp_keys(uri) result = bin_search(mmap, key) if result return [key, result] Figure 3:
MementoMap
Compaction and Lookup procedures.
These pseudo-code illustrations are not in any specific language.Actual implementation is more elaborated.The algorithm for lookup in a
MementoMap is also illustratedin Figure 3. Given a
URI , we first generate all possible lookup keys,in which all keys but the longest one have a wildcard suffix (e.g,“
Www.Example.COM/a/b?x=y&c=d ” yields “ com,example)/a/b ”, Table 3:
MemGator log responses from various archives.
Dataranges from 2015-10-25 to 2019-01-16.
Archive Request Hit% Miss% Err% Sleep
Internet Archive 4,723,880 35.76 63.68 0.56 1,594Archive-It 5,011,385 9.14 90.38 0.48 1,556Archive Today 5,151,720 8.44 88.96 2.60 1,920Library of Congress 4,862,458 4.77 94.31 0.92 2,705Arquivo.pt 4,300,221 3.35 96.29 0.36 1,153Icelandic 5,126,706 2.22 97.14 0.64 3,143Stanford 5,178,835 1.54 98.02 0.43 1,482UK Web Archive 5,113,984 1.49 86.30 12.20 2,779Perma 4,116,099 1.32 98.67 0.01 46PRONI 5,165,805 0.75 98.72 0.54 1,608UK Parliament 5,181,991 0.63 98.85 0.52 1,542NRS 2,683,311 0.21 99.77 0.01 46UK National 5,178,184 0.10 99.45 0.45 1,457PastPages 22,058 0.00 62.90 37.10 0
All com,example)/a/b/* ”, “ com,example)/a/* ”, “ com,example)/* ”,and “ com,* ” as lookup keys). We then perform a binary search in the
MementoMap with lookup keys in decreasing specificity until wefind a match or all the keys are exhausted. In case of a match, we re-turn the matched lookup key and corresponding frequency results.For dissemination and discovery of
MementoMaps we proposethat web archives make their
MementoMap available at the well-known URI [33] “ /.well-known/mementomap ” under their domainnames. Alternatively, a custom
URI can be advertised using the“ mementomap ” link relation (or “ rel ”) in an HTTP Link header or
HTML element. Third parties hosting
MementoMaps of otherarchives can use the “ anchor ” attribute of the
Link header to adver-tise a different context. Moreover,
MementoMaps are self-descriptiveas they contain sufficient metadata in their headers to establish arelationship with their corresponding archives.
MementoMaps sup-port pagination that can be discovered after retrieving the primary
MementoMap from a well-known URI or by any other means.
For evaluation we used the complete index of
Arquivo.pt , completelogs of our
MemGator service, and generated
MementoMaps . Wefirst examine logs, then describe holdings of
Arquivo.pt in detail,and finally measure the effectiveness of various
MementoMaps . We analyzed over three years of our
MemGator logs containingrecords about 14 different web archives. In its lifetime it has served atotal of 5,241,771 requests for 3,282,155 unique
URIs . Table 3 showsthe the summary of our log analysis in which IA has over 35%hit rate, and every other archive is below 10% (down to zero) indecreasing order of hit rate.
Arquivo.pt is showing a 3.35% hit rate,so we cross checked it with the full index and found that there areonly 1.64% unique
URIs from the
MemGator logs that are present in
Arquivo.pt (note that the
CDX data even includes recent mementosthat would have generated a miss prior to them being archived). Thedifference in these numbers is perhaps as a result of some archived5ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al.
Figure 4: Overlap between archived and accessed resourcesin
Arquivo.pt . Ones denote single digit non-zero numbers (i.e., 1–9),
Tens denote two digit numbers (i.e., 10–99), and so on. The
Zero column shows the number of mementos of various
URI-Rs that arenever accessed using
MemGator . The
Zero row shows the numberof access requests for various
URI-Rs using
MemGator that are notarchived. The (Zero, Zero) cell denotes
N/A because the number ofresources that are neither archived nor accessed is unknown.
URIs being looked for more frequently. This low percentage ofoverlap in access logs and archive indexes conforms to our earlierfindings [9]. The table shows an overall 93% miss rate, which isall wasted traffic and delayed response time. Identifying sourcesof such a large miss rate can save resources and time significantly,which is the primary motivation of this work.There are some other notable entries in Table 3 such as lownumber of requests to PastPages which was excluded from beingpolled in the early days due to its zero hit rate and high error rate.NRS (National Records of Scotland) is a new addition to the list,hence it shows a low number of requests. The high error rate ofthe UK Web Archive was primarily caused by a bug in the Golanguage (used to develop
MemGator ) that was not cleaning idleTCP connections that were already closed by the application. Asa result, UKWA’s firewall was seeing an ever increasing numberof open, but idle connections, hence dropping packets after a hardlimit of 20 concurrent connections per host. This has since beenfixed after the release of the Go language version 1.7. We have laterintroduced an automatic dormant feature that puts an upstreamarchive to sleep for a configurable amount of time after a set numberof successive errors.Figure 4 shows a breakdown of what people are looking for inarchives and what web archives hold. The 1.1K entry in the “Ones”row and “Tens” column shows that there are over a thousand
URI-Rs that were requested 10–99 times in
MemGator and each has 1–9mementos in
Arquivo.pt . Large numbers in the “Zero” column show
Table 4:
URI-M vs.
URI-R summary of
Arquivo.pt .Attributes Values
Unique URI-Rs 1,999,790,376Total number of mementos 4,923,080,506Maximum mementos for any URI-R 2,308,634Median (and Minimum) 1Mean mementos per URI-R ( γ ) 2.46Standard Deviation 57.20Gini Coefficient 0.42Pareto Break Point 70/30 Table 5: Most archived
URI-Rs in Arquivo.pt . Most of theseresources are either single pixel blank images or corner graphicsused for styling in the pre-CSS3 era.
URIs URI-Ms com,wunderground,icons)/graphics/blank.gif 2,308,634com,wunderground,icons)/graphics/wuicorner.gif 768,250pt,ipleiria,inscricoes)/logon.aspx 238,292com,wunderground,icons)/graphics/wuicorner2.gif 207,448com,lygo)/ly/i/inv/dot_clear.gif 115,221com,listbot)/subscribe_button.gif 108,530com,wunderground,icons)/* (including top URI-R) 3,336,086com,wunderground,* (41 sub-domains) 3,392,676there are a lot of mementos that are never requested from
MemGa-tor . Similarly, the “Zero” row shows there are a lot of requests thathave zero mementos in
Arquivo.pt . Another way to look at it is thata content-based archive profile will not know about the “Zero” rowand a usage-based profile will miss out the content in the “Zero” col-umn. Active archives may want to profile their access logs periodi-cally to identify potential seed URIs of frequently requested missingresources that are within the scope of the archive. Ideally, we wouldlike more activity along the diagonal that passes from the (Zero,Zero) corner, except the corner itself, which suggests there are un-determined number of
URI-Rs that were never archived or accessed.
Table 4 and Figure 5 summarize the distribution of
URI-Ms over
URI-Rs in Arquivo.pt . Almost 2M unique
URI-Rs in Arquivo.pt havean average of 2.46 mementos per
URI-R ( γ value [9]), but this dis-tribution is not uniform. The top 30% URI-Rs account for 70% ofthe mementos, for a
Gini Coefficient of 0.42 [41]. Additionally, the
Median is one, which means at least half of the
URI-Rs have onlyone memento. Furthermore, the most frequently archived
URI-R has2.3M mementos (i.e., 0.05% of total), so we decided to investigateit further. Table 5 lists the six most archived
URI-Rs , and they aremostly one pixel clear images and corner graphics primarily usedin web designing in the pre-CSS3 era. The only HTML page thatshows up in the top list is a login page. We further investigated allthe mementos from all the subdomains of the top
URI-R ’s domainand found that the blank.gif image was archived out of propor-tion. This shows another use for archive profiling – identifyingsuch unintentional biases due to misconfigured crawling policiesor bugs in crawlers’ frontier queue management.6ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al. (a) Percentage of
URI-Rs by popularity vs. cumulative percentage of mementos. (b) Gini coefficient of memento over
URI-R population.
Figure 5: Distribution of mementos over
URI-Rs in Arquivo.pt .Figure 6: Cumulative growth of
URI-Rs and
URI-Ms in Ar-quivo.pt . Almost half of the mementos are captured in the last twoactive years alone.Furthermore, we partitioned
Arquivo.pt ’s index into yearly buck-ets for analysis as shown in Table 6. Data prior to year 2008 ismostly donated from other sources in the form of many small files,as
Arquivo.pt was not yet established. However, when everythingis put together it looks like the archiving activity took off signif-icantly in 2007. Low numbers in years 2017 and 2018 are due to
Arquivo.pt ’s embargo policy. It shows that
Arquivo.pt ’s collectionis growing with a healthy pace by mostly collecting new
URI-Rs aswell as revisiting on an average 26% of older ones on a yearly basis.We expected γ would change gradually over time, but years 2000and 2018 had significantly high values with respect to other years. So, we looked for the possibility of increased status codes inthose years as a potential source of increase in γ (e.g., http URIs redirecting to corresponding https version), but we did not see anycorrelation there. However, the data for these years seems to havecome from another source and overall they are insignificant, hence,the cumulative γ + is fairly stable between 2 and 3. We noted a sig-nificant and steady growth in status codes which has crossedthe 20% mark in year 2016. Status codes for the last two years (stillin embargo period) do not sum up to 100% because a significant por-tion of their entries are either revisit records or screenshots that donot report status codes. In Figure 6 we plotted a cumulative growthgraph of both URI-Ms and
URI-Rs to see the shape [27] of
Arquivo.pt during the active region. Their archiving rate is increasing overtime as almost half of the total mementos were archived in the lasttwo active years alone.
To understand the shape of the
URI Keys tree in
MementoMap wefirst investigated the number of unique
Domains and
HxPx Keys that have certain host or path depths as shown in Table 7. Thesenumbers are relative to the size of the
Arquivo.pt index, but webelieve a similar trend should be seen in other archives, unless theircollection is manually curated and crawled using a more or lesscapable tool than what is currently being used by many large webarchives [32]. There were some outliers in the data that showed ahost depth of up to 15 and path depths up to 130, but those werevery few in number. These numbers gave us a good starting point todecide how deep we need to analyze hosts and paths for profiling.Figure 7 shows the shape of the total 1,138,923,169 unique
HxPxKeys of Arquivo.pt ’s current index put together in the form of a7ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al.
Table 6: Yearly distribution of
URI-Rs , URI-Ms , and status codes in
Arquivo.pt . The symbol γ denotes the ratio of URI-Ms vs.
URI-Rs .Column names with a “ + ” superscript denote cumulative values as yearly data is processed incrementally. While URI-M + represents arunning total, URI-R + does not, because some URI-Rs are already seen in previous years. Status codes for the last two years (still in embargoperiod) do not add up to 100% because a significant portion of their entries are either revisit records or screenshots.
Year URI-R URI-R + URI-M URI-M + Dup. URI-R% γ γ + All
Table 7: Unique items with exact
Host and
Path depths.Depth Host (Domains) Host (HxPx) Path (HxPx)
Total
URI Key space changes on each host and path depth.The tree is broken down in host and path segments (i.e., Figure 7(a)and 7(b)) instead of one continuous tree and the latter is scaleddown 70 times as compared with the host segment to ensure thatthe shape of path segment is distinguishable from one depth to thenext. In the host segment, at each host level (after H1 ) a significant portion leads to P0 (i.e., root path), but the remainder has furtherchildren host segments (i.e., sub-domains). Figure 7(a) shows thathostnames with depth more than four (i.e., H5 and beyond) aresignificantly small in number. In the path segment, at each level asignificant portion terminates, but the remainder branches out intodeeper path segments. The shape of the path segment in Figure 7(b)shows that the tree starts to shrink from P4 and the bulk tree isaround P3 . Any effort to reduce the URI Key space near this levelcan significantly reduce the
Relative Cost .Table 8 is based on the total 1,138,923,169 unique
HxPx Keys of Arquivo.pt ’s current index. For example, the H3 (see Figure 1(b)for naming convention) row means there are a total of 2,158,880unique H3 prefixes that cover a sum of 630,309,184 HxPx Keys ofwhich the most popular prefix covers 51,849,377 keys alone. The
Mean number of keys per prefix at H3 is 291.96 with a Median of7 and
Standard Deviation of 37,641.59. The
RedQ (Reduction Coef-ficient) column represents a derived quantity that we defined asthe amount of reduction in keys it would cause if
HxPx Keys longerthan a given depth are stripped off at that depth and only countedreduced unique prefixes. This can be calculated using Equation 1at depth d where | HxPx Keys ≥ d | is the number of HxPx Keys withdepth ≥ d and | URI Keys d | is the number of unique partial URI Keys stripped at depth d (reported under the Sum and
Count columns8ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al. (a)
Parents and
Children at each
Host depth. All the terminating host nodes at each levellead to the root path (i.e., P0 ) shown at the bottom. (b) Parents and
Children at each
Path depth. The root path (i.e., P0 ) shown at the top isscaled 70 times down as compared with the bottom row of the Host segment tree.
Figure 7: The shape of
HxPx Key tree of
Arquivo.pt . Labels on the left denote
Host and
Path depths. Corresponding pair of labels on theright denote number of
Parents and
Children respectively. Darker nodes have higher number of
Mean Children . Host and
Path segments areplotted separately with different scales while the bottom row of the
Host segment corresponds to the top row of the
Path segment.
Table 8:
Host and
Path depth statistics of unique
HxPx Keys in Arquivo.pt . Sorted
HxPx Keys no shorter than a given depth arechopped at that depth, number of occurrences of these keys is
Count , their total is
Sum , and various other statistical measures are reportedbased on these numbers. The
RedQ value is calculated using Equation 1,
Parents is the number of non-terminal nodes of the previous depth,
Children is the number of unique nodes at a given depth, and
MeanChld is the average number of
Children per
Parent . Depth Count Sum Max Mean Med. StdDev RedQ Parents Children MeanChld
H1 973 1,138,923,169 616,372,626 1,170,527.41 930 21,620,107.00 1.00000 1 973 973.00H2 2,068,333 1,138,916,690 109,176,956 550.64 5 91,308.66 0.99818 904 2,068,333 2,287.98H3 2,158,880 630,309,184 51,849,377 291.96 7 37,641.59 0.55153 253,091 2,158,880 8.53H4 1,329,137 201,308,887 3,765,122 151.46 10 4,797.10 0.17559 148,589 1,329,137 8.95H5 245,881 39,396,636 376,969 160.23 5 3,420.96 0.03438 31,635 245,881 7.77H6 103,579 17,571,552 105,591 169.64 27 1,106.03 0.01534 16,496 103,579 6.28H7 34,380 9,636,427 19,572 280.29 20 450.16 0.00843 10,061 34,380 3.42H8 69,829 6,383,484 535 91.42 120 45.75 0.00554 15,359 69,829 4.55H9 55,811 2,660,591 80 47.67 56 19.6 0.00229 55,811 55,811 1.00H10+ 10 62 19 6.20 2 6.51 0.00000 10 10 1.00P0 5,841,503 1,138,923,169 2,264,623 194.97 7 3,059.43 0.99487 5,841,503 5,841,503 1.00P1 145,687,459 1,134,466,338 2,242,344 7.79 1 376.64 0.86817 5,828,059 145,687,459 25.00P2 290,761,965 1,021,443,935 603,840 3.51 1 130.76 0.64156 40,130,355 290,761,965 7.25P3 392,635,328 795,954,162 565,043 2.03 1 78.14 0.35412 79,234,027 392,635,328 4.96P4 215,251,988 461,498,975 512,098 2.14 1 80.01 0.21621 66,059,544 215,251,988 3.26P5 158,256,277 287,069,088 512,098 1.81 1 65.72 0.11310 48,163,114 158,256,277 3.29P6 91,334,214 159,584,909 50,384 1.75 1 22.3 0.05993 33,776,599 91,334,214 2.70P7 60,099,825 91,006,216 44,114 1.51 1 17.24 0.02714 24,201,781 60,099,825 2.48P8 31,101,768 45,186,916 24,631 1.45 1 15.54 0.01237 14,890,308 31,101,768 2.09P9 18,601,197 23,008,116 10,247 1.24 1 9.74 0.00387 9,233,634 18,601,197 2.01P10 6,817,122 7,455,014 5,858 1.09 1 9.36 0.00056 3,206,260 6,817,122 2.13P11+ 858,772 858,856 2 1.00 1 0.01 0.00000 222,432 392,565 1.769ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al. (a) Global
HxPx reduction rate at
Host . (b) Global
HxPx reduction rate at
Path .(c) Incremental
Host children reduction. (d) Incremental
Path children reduction.
Figure 8: Global and incremental
Host and
Path segment reduction.
Global reductions describe the change in the total number of
HxPx Keys (or the size of sub-trees) when keys are rolled up at a given
Host or Path depth. Incremental children reductions describe thechange caused by roll ups of immediate children nodes into their corresponding parent nodes at a given
Host or Path depth. Nodes withlarger sub-trees and children counts in the two cases respectively are rolled up first.of Table 8 respectively). Figures 8(a) and 8(b) show the cumulativereduction as the top most frequent keys are rolled up at a host andpath depth respectively. Furthermore, there are 253,091 nodes inthe tree one depth above (i.e., H2 ) that lead to 2,158,880 nodes atthe current depth. While the Mean Child count at H3 is 8.53, the dis-tribution is not uniform. Figures 8(c) and 8(d) show the cumulativereduction in immediate children count as the most popular parentsleading to the current depth are rolled up incrementally from bot-tom up. The purpose of the Reduction Coefficient is to understandthe impact and importance of various host and path depths globally while the
Mean Child count gives an estimate of a more localizedimpact at a given depth. For this work we have used the latter asa factor to decide when to roll a sub-tree up while compacting a
MementoMap . Rolling the sub-tree up at H1 , H2 , and P0 are notapplicable for evaluation here because H1 means shrinking every-thing into a single record of “*” key, H2 would require out-of-bandinformation (because not every TLD is equally popular), and P0 being the root of the path has nothing to roll up into (though com-paction might happen in the relevant host segment independently).We fit the remaining values of Mean Child count on Power Law [19]10ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al.curves (other curve fittings are also possible) for both host and pathsegments to find a and k parameters and use these empirical valuesfor compaction decision making.RedQ d = | HxPx Keys ≥ d | − | URI Keys d || HxPx Keys | (1) Web archives are messy collections that contain many malformedrecords often caused by configuration issues in web servers, poorlywritten web applications, bugs in archiving tools, incompatiblefile transformations, or even security vulnerabilities [5]. Archiveprofiling can uncover some of these as we found many malformed
MIME-Type and Status Code entries in Arquivo.pt .To run our experiments we decided to filter only the clean recordsout from these
CDXJ files. We further limited our scope to only
HTML pages that returned a status code. Additionally, we ex-cluded any robots.txt and sitemap.xml files that were servedwrongly as “ text/html ”. With these filters in place we reduced me-mentos by almost half of the total index size to only 2,671,653,766.Now, there are 962,832,513 filtered unique
URI-Rs , which means the γ value is increased slightly to 2.77. Also, the HxPx Keys count isreduced to 447,107,301, which is 39% of the overall number. Fromthese keys we created the baseline
MementoMap with compressedfile size of 3.4G (as shown in the first record of Table 9) which isalready reduced to 1.3% of the original index size. This baseline
Me-mentoMap has 46.4%
Relative Cost (i.e., the ratio of reduced numberof unique lookup keys vs. number of unique
URI-Rs ) that yields94.6%
Accuracy .In the next step we supplied this baseline
MementoMap as inputfor compaction with host and path compaction weights W h = . W p = . Mean Child value at each depth to findthe cutoff number when the sub-tree is to be rolled up. A smallweight will roll the sub-tree up more aggressively than a large value,resulting in a more compact
MementoMap . This process produced a
MementoMap with only 27,010,037 lines (i.e., 6.0% of the baseline or2.8%
Relative Cost ) after going through 4,574,305 recursive roll ups.The process took 2.4 hours to complete on our Network File System(NFS) storage. The time taken to complete the compaction process isa function of the number of lines to process from the input, numberof lines to be written out, and the number of roll ups to occur (alongwith the read and write speeds of the disk). Since the process isI/O intensive, using faster storage can reduce the time significantly,which we verified by repeating the experiment on
TMPFS [39]. Wegenerated 36 variations of
MementoMaps with all possible pairsof W h and W p weights from values 4.00, 2.00, 1.00, 0.50, 0.25, and0.00 as shown in Table 9. To generate MementoMaps with smallerweights we used
MementoMaps of immediate larger weight pairs asinputs (e.g., input one with W h = . , W p = .
50 to generate onewith W h = . , W p = . MementoMaps from hours to a few minutes and also illustrated that
MementoMaps can easily be compacted further when needed. https://gist.github.com/ibnesayeed/bb167fe19c5719d87c1c1f665001d44b https://gist.github.com/ibnesayeed/7307f0bf1783357db99f8b2357249dd0 Figure 9: Growth of compacted
MementoMap vs. lines pro-cessed from an input
MementoMap . This plot illustrates a verysmall portion of the entire process to highlight the compactionbehavior at a micro level. The size of the output
MementoMap de-creases each time a roll up happens. A roll up at smaller depth oftenreduces the size more significantly.
Figure 10:
Relative Cost vs.
Lookup Routing Accuracy . A Me-mentoMap generated/compacted using W h = .
00 and W p = . Accuracy with only 1.5%
Relative Cost .Figure 9 shows a portion of the roll up activity during the com-paction process. The size of the output grows linearly, but on amicro-scale whenever there is a roll up activity, the output sizegoes down depending on at what depth roll up happened and howbig of a sub-tree was affected.Finally, we used
MemGator logs to perform lookup in these 36
MementoMaps generated with different host and path weight pairsto see how well they perform. Figure 10 shows the
Relative Cost andcorresponding
Lookup Routing Accuracy of these
MementoMaps .The
Accuracy here is defined as the ratio of correctly identified
URIs for their presence or absence vs. all the lookup
URIs . In thisexperiment
MementoMaps with weights W h = . , W p = . W h = . , W p = .
00 yielded about 60%
Routing Accuracy
Table 9:
MementoMap generation, compaction, and lookup statistics for
Arquivo.pt . Output of one step is used as the input of thenext step in a chain as the next step has at least one smaller weight. The first record was created using some
Linux commands instead of thescript, that is why some values are reported as
N/A . Input W h W p Lines Size (bytes) Gzipped (MB) Rollups Time (sec) RelCost Accuracy
CDXJ ∞ ∞ H ∞ P ∞ H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . H . P . Relative Cost without any false negatives (i.e., 100%
Recall ). Since
Arquivo.pt had only a 3.35% hit rate in the past threeyears,
MemGator could have avoided almost 60% of the wastedtraffic to
Arquivo.pt without missing any good results if
Arquivo.pt were to advertise its holdings via a small
MementoMap of about111MB in size. The accuracy can further be improved by 1) exploringother optimal configurations for sub-tree pruning, 2) generating
MementoMaps with the full index, not just a sample, and 3) includingentries for absent resources from the “Zero” row of the Figure 4.
In this work we proposed
MementoMap , a flexible and adaptiveframework to express holdings of a web archive efficiently. We de-scribed a simple, yet extensible, file format suitable for
MementoMap and some other use cases. We extended traditional
SURT formatto support wildcards for partial
URI Keys . We analyzed more thanthree years of
MemGator logs to understand the response behavior of 14 public web archives. We used the complete index of 5B me-mentos in the
Arquivo.pt as a case study, learned some generalizablebehaviors of
URIs in web archives, described
Arquivo.pt ’s holdingsin different ways, and created
MementoMaps of varying sizes fromit for evaluation. We designed a single-pass, memory-efficient, andparallelization-friendly algorithm to compact a large
MementoMap into a small one iteratively, based on user-specified parametersto accommodate different needs and available resources. We alsoimplemented a time-and memory-efficient lookup method using bi-nary search on
MementoMap files on disk by leveraging the fact that
MementoMaps are in a lexicographical order. Finally, we evaluatedthe effectiveness of
MementoMaps of varying sizes by measuring the
Accuracy using 3.3M unique
URIs from
MemGator logs. We foundthat a
MementoMap of less than 1.5%
Relative Cost can correctly iden-tify the presence or absence of 60% of the lookup
URIs in the corre-sponding archive without any false negatives. We open-sourced our12ementoMap Framework for Flexible and Adaptive Web Archive Profiling S. Alam et al.implementation code under a permissive license [3]. For dissemina-tion and discovery of
MementoMaps we proposed the “ mementomap ” well-known URI suffix and the “ mementomap ” link relation .The trend shown in Figure 8 opens up many possibilities totry, such as, to fit them as Heaps’ Law [21] curves and estimate K and β parameters to then automatically identify the best rollup possibilities instead of asking a human to provide weights andsupply other parameters. The MementoMap format proposed in thispaper supports the ability to highlight inactive sub-trees within anactive tree by being more specific, which will reduce false positives.However, generating this information will require processing accesslogs or other out-of-band data sources. Rolling the sub-tree up at H2 can be useful for large web archives and one way to explorethis possibility is to identify globally less popular TLDs that have asignificant presence in an archive. Currently, it is possible to do itmanually, but not automatically. A major goal of this work is to pushfor adoption of
MementoMap by adding out-of-the-box support inmajor archival replay systems. We would also like to investigatethe possibility of routing non-
HTML lookup requests by utilizing
MementoMap generated for
HTML mementos only. The motivationcomes from the assumption that page requisites are generally co-located with the parent page, hence we can leverage the informationpresent in the
Referer header of embedded resources to identifypotential archives to poll from.
This work is supported in part by National Science Foundationgrant IIS-1526700.
REFERENCES
Support forVarious HTTP Methods on the Web . Technical Report arXiv:1405.2330.[6] Sawood Alam, Ilya Kreymer, and Michael L. Nelson. 2015. Object ResourceStream (ORS) and CDX-JSON (CDXJ) Draft. https://github.com/oduwsdl/ORS.[7] Sawood Alam and Michael L. Nelson. 2016. MemGator - A Portable ConcurrentMemento Aggregator. In
Proceedings of the 16th ACM/IEEE-CS Joint Conferenceon Digital Libraries (JCDL ’16) .[8] Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva,Harihar Shankar, and David S. H. Rosenthal. 2015. Web Archive Profiling ThroughCDX Summarization. In
Proceedings of 19th International Conference on Theoryand Practice of Digital Libraries, TPDL 2015 . 3–14.[9] Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva,Harihar Shankar, and David S. H. Rosenthal. 2016. Web Archive Profiling ThroughCDX Summarization.
International Journal on Digital Libraries
17, 3 (2016), 223–238. https://doi.org/10.1007/s00799-016-0184-4[10] Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, and David S. H. Rosen-thal. 2016. Web Archive Profiling Through Fulltext Search. In
Proceedings of 20thInternational Conference on Theory and Practice of Digital Libraries, TPDL 2016 .121–132.[11] Sawood Alam, Michele C. Weigle, Michael L. Nelson, Fernando Melo, DanielBicho, and Daniel Gomes. 2019. MementoMap Framework for Flexible andAdaptive Web Archive Profiling. In
Proceedings of the 19th ACM/IEEE-CS JointConference on Digital Libraries (JCDL ’19) .[12] Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Som-pel. 2013. Profiling Web Archive Coverage for Top-Level Domain and ContentLanguage. In
Proceedings of the International Conference on Theory and Practiceof Digital Libraries, TPDL 2013 . 60–71. [13] Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Som-pel. 2014. Profiling Web Archive Coverage for Top-Level Domain and ContentLanguage.
International Journal on Digital Libraries
14, 3-4 (2014), 149–166.[14] Grant Atkins. 2017. Carbon Dating the Web, version 4.0. http://ws-dl.blogspot.com/2017/09/2017-09-19-carbon-dating-web-version-40.html.[15] Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle.2019. Archive Assisted Archival Fixity Verification Framework. In
Proceedings ofthe 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’19) .[16] John A. Berlin, Mat Kelly, Michael L. Nelson, and Michele C. Weigle. 2017. WAIL:Collection-Based Personal Web Archiving. In
Proceedings of the IEEE/ACM JointConference on Digital Libraries (JCDL) . 340–341. https://doi.org/10.1109/JCDL.2017.7991619[17] Nicolas Bornand, Lyudmila Balakireva, and Herbert Van de Sompel. 2016. RoutingMemento Requests Using Binary Classifiers. In
Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’16) . ACM, 63–72.[18] Chris Butler. 2017. Statement and Questions Regarding an Indian Court’s Order toBlock archive.org. https://blog.archive.org/2017/08/09/statement-and-questions-regarding-an-indian-courts-order-to-block-archive-org/.[19] Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-LawDistributions in Empirical Data.
SIAM review
51, 4 (2009), 661–703. https://doi.org/10.1137/070710111[20] Douglas Crockford. 2006. The application/json Media Type for JavaScript ObjectNotation (JSON). RFC 4627.[21] Leo Egghe. 2007. Untangling Herdan’s Law and Heaps’ Law: Mathematical andInformetric Arguments.
Journal of the American Society for Information Scienceand Technology
58, 5 (2007), 702–709.[22] Adam Clark Estes. 2015. Russia Is Banning the Internet Archive and BlamingIt On Terrorism. https://gizmodo.com/russia-is-banning-the-internet-archive-and-blaming-it-o-1713926987.[23] Daniel Gomes, Miguel Costa, David Cruz, João Miranda, and Simão Fontes. 2013.Creating a Billion-Scale Searchable Web Archive. In
Proceedings of the 22ndInternational Conference on World Wide Web . 1059–1066.[24] Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and AndreasPaepcke. 1997. STARTS: Stanford Proposal for Internet Meta-Searching.
SIGMODRecord
Proceedings of iPres 2018 .[28] Brewster Kahle. 2016. Geez, Now Internet Insurance? https://blog.archive.org/2016/06/16/geez-now-internet-insurance/.[29] Mat Kelly, Sawood Alam, Michael L. Nelson, and Michele C. Weigle. 2016. Inter-Planetary Wayback: Peer-To-Peer Permanence of Web Archives. In
Proceedingsof the 20th International Conference on Theory and Practice of Digital Libraries .411–416. https://doi.org/10.1007/978-3-319-43997-6_35[30] Mat Kelly, Michael L. Nelson, and Michele C. Weigle. 2014. Mink: Integrating theLive and Archived Web Viewing Experience Using Web Browsers and Memento.In
Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries .469–470.[31] Weiyi Meng, Clement Yu, and King-Lup Liu. 2002. Building Efficient and EffectiveMetasearch Engines.
ACM Computing Surveys (CSUR)
34, 1 (2002), 48–89.[32] Gordon Mohr, Michael Stack, Igor Rnitovic, Dan Avery, and Michele Kimpton.2004. Introduction to Heritrix. In
Proceedings of the 4th International Web Archiv-ing Workshop .[33] Mark Nottingham. 2019. Well-Known Uniform Resource Identifiers (URIs), Inter-net RFC 8615. https://tools.ietf.org/html/rfc8615.[34] Alexander C. Nwala. 2015. I Can Haz Memento. http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html.[35] Robert Sanderson. 2012. Global Web Archive Integration with Memento. In
Proceedings of the 12th ACM/IEEE Joint Conference on Digital Libraries
Proceedings of theAutumn 1990 EUUG Conference . 241–248.[40] Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTPFramework for Time-Based Access to Resource States – Memento, Internet RFC7089. https://tools.ietf.org/html/rfc7089.[41] Shlomo Yitzhaki. 1979. Relative Deprivation and the Gini Coefficient.
TheQuarterly Journal of Economics (1979), 321–324.(1979), 321–324.