Archive Assisted Archival Fixity Verification Framework
Mohamed Aturban, Sawood Alam, Michael L. Nelson, Michele C. Weigle
AArchive Assisted Archival Fixity Verification Framework
Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle
Old Dominion UniversityNorfolk, Virginia 23529, USA { maturban,salam,mln,mweigle } @cs.odu.edu ABSTRACT
The number of public and private web archives has increased, andwe implicitly trust content delivered by these archives. Fixity ischecked to ensure an archived resource has remained unalteredsince the time it was captured. Some web archives do not allowusers to access fixity information and, more importantly, even iffixity information is available, it is provided by the same archivefrom which the archived resources are requested. In this research,we propose two approaches, namely
Atomic and
Block , to establishand check fixity of archived resources. In the
Atomic approach,the fixity information of each archived web page is stored in aJSON file (or a manifest), and published in a well-known web loca-tion (an Archival Fixity server) before it is disseminated to severalon-demand web archives. In the
Block approach, we first batchtogether fixity information of multiple archived pages in a singlebinary-searchable file (or a block) before it is published and dissem-inated to archives. In both approaches, the fixity information isnot obtained directly from archives. Instead, we compute the fixityinformation (e.g., hash values) based on the playback of archivedresources. One advantage of the
Atomic approach is the ability toverify fixity of archived pages even with the absence of the ArchivalFixity server. The
Block approach requires pushing fewer resourcesinto archives, and it performs fixity verification faster than the
Atomic approach. On average, it takes about 1.25X, 4X, and 36Xlonger to disseminate a manifest to perma.cc , archive.org , and webcitation.org , respectively, than archive.is , while it takes3.5X longer to disseminate a block to archive.org than perma.cc .The Block approach performs 4.46X faster than the
Atomic approachon verifying the fixity of archived pages.
Web archives, such as the Internet Archive (IA) and UK WebArchive , have made great efforts to capture and archive the web toallow access to prior states of web resources. We implicitly trust thearchived content delivered by such archives, but with the currenttrend of extended use of other public and private web archives[12, 15], we should consider the question of validity. For instance,if a web page is archived in 1999 and replayed in 2019, how do weknow that it has not been tampered with during those 20 years?One potential solution is to generate a cryptographic hash valueon the HTML content of an archived web page, or memento. Amemento is an archived version of an original web page [47]. Figure1 shows an example where the cURL command downloads the rawHTML code of the memento https://web.archive.org/web/20181219102034/https://2019.jcdl.org/ http://archive.org and then the hashing function sha256sum generates a SHA-256hash on this downloaded code. By running these commands atdifferent times we should always expect to obtain the same hash.In the context of web archiving, fixity verifies that archived re-sources have remained unaltered since the time they were received[11]. The final report of the PREMIS Working Group [24] definesinformation used for fixity as “information used to verify whetheran object has been altered in an undocumented or unauthorizedway.” Web content tampering is a common Internet-related crimein which content is altered by malicious users and activities [29].Part of the problem is the lack of standard techniques that userscan apply to verify the fixity of web content [5, 21]. Jinfang Niumentioned that none of the web archives declare the reliabilityof the archived content in their servers, and some archives, suchas the Internet Archive, WAX , and Government of Canada WebArchive , have a disclaimer stating that they are not responsiblefor the reliability of the archived content they provide [39].A motivating example, which shows the importance of verifyingfixity of mementos, is the story of Joy-Ann Reid, an American cabletelevision host at MSNBC. In December 2017, she apologized forwriting several “insensitive” LGBT blog posts nearly a decade agowhen she was a morning radio talk show host in Florida [38, 38, 45].In April 2018, Reid, supported by her lawyers, claimed that her blogand/or the archived versions of the blog in the Internet Archivehad been compromised and the content was fabricated [18]. Eventhough the Internet Archive denied that their archived pages hadbeen hacked [14], a stronger case could be made if we had anindependent service verifying that those archived blog posts hadnot changed since they were captured by the archive. In this paper,we are introducing two approaches, Atomic and
Block , to makearchived web resources verifiable.In the
Atomic approach, the fixity information of each archivedweb page is stored in a single JSON file, or manifest, publishedon the web, and disseminated to several on-demand web archives.In the
Block approach, we batch together fixity information, orrecords, of multiple archived pages to a single binary-searchablefile, or block. The block then is published at a well-known web lo-cation before disseminating to archives. While we make a chain ofblocks, we are not attempting to create yet another Blockchain [37].Manifests’ chain of blocks are limited in scope as we do not needto worry about consensus , eventual consistency , or proof-of-work because these blocks are generated and published by a central au-thority (the Block approach is described in 3.3). In both approaches,the fixity information, such as hash values, is not directly providedby archives (server-side) even though some archives’ APIs (e.g., theInternet Archive CDX server [25]) allow accessing such informa-tion. Alternatively, we decided to calculate the fixity information wax.lib.harvard.edu/collections/home.do a r X i v : . [ c s . D L ] M a y curl -s https://web.archive.org/web/20181219102034id_/https://2019.jcdl.org/ | sha256sum Figure 1: Commands to generate a hash value of a memento.based on the playback of archived resources (client-side) for tworeasons. First, we are not expecting hashes generated and storedin WARC files by archives at crawl time to match those generatedon the playback of mementos [9]. Second, if an archive has beencompromised then it is likely the corresponding hashes have beenalso compromised, so we need to have the fixity information storedin independent archives [36].This work introduces a basic, yet extensible, format of fixityinformation in the form of a structured manifest file. However,the main contribution of this paper focuses on the two suggestedapproaches of disseminating fixity information (or manifests) ratherthan strength, applicability, extension, scope, or security of themanifest. The framework describes how manifests are published,discovered, and used to verify mementos. The proposed frameworkdoes not require any change in the infrastructure of web archives.It is built based on well-known standards, such as the Mementoprotocol, and works with current archives’ APIs. The frameworkallows for the generation of manifests for selected resources insteadof incurring the overhead of creating manifests for all archivedresources.We show that the size of a manifest represents about 2% of anactual memento’s content, and, on average, it takes about 1.25X,4X, and 36X longer to disseminate a manifest to perma.cc , theInternet Archive, and WebCite [20], respectively, than archive.is ,while it takes 3.5X longer to disseminate a block to archive.org than perma.cc . The
Block approach performs 4.46X faster than the
Atomic approach on verifying the fixity of archived pages. Thispaper is an expanded version of a conference paper [7].
In order to automatically collect portions of the web, web archivesemploy web crawling software, such as the Internet Archive’s Her-itrix [43]. Having a set of seed URIs placed in a queue, Heritrix willstart by fetching web pages identified by those URIs, and each timea web page is downloaded, Heritrix writes the page to a WARC file[27], extracts any URIs from the page, places those discovered URIsin the queue, and repeats the process.The crawling process will result in a set of archived pages, ormementos. To provide access to their archived pages, many webarchives that use OpenWayback [26], the open-source implementa-tion of IA’s Wayback Machine, to allow users to query the archiveby submitting a URI. OpenWayback will replay the content of anyselected archived web page in the browser. One of the main tasksof OpenWayback is to ensure that when replaying a web page froman archive, all resources that are used to construct the page (e.g.,images, style sheets, and JavaScript files) should be retrieved fromthe archive, not from the live web. Thus, at the time of replayingthe page, OpenWayback will rewrite all links to those resourcesto point directly to the archive [46]. In addition to OpenWayback, PyWb [30] is another replaying tool, which is used by Perma [49]and Webrecorder [31].Memento [48] is an HTTP protocol extension that uses time asa dimension to access the web by relating current web resourcesto their prior states. The Memento protocol is supported by mostpublic web archives including the Internet Archive. The protocolintroduces two HTTP headers for content negotiation. First, Accept-Datetime is an HTTP Request header through which a client canrequest a prior state of a web resource by providing the preferreddatetime (e.g.,
Accept-Datetime: Mon, 09 Jan 2017 11:21:57 GMT ).Second, the Memento-Datetime HTTP Response header is sent by aserver to indicate the datetime at which the resource was captured.The Memento protocol also defines the following terminology:- URI-R - an original resource from the live Web- URI-M - an archived version (memento) of the original resourceat a particular point in time- URI-T - a resource (TimeMap) that provides a list of mementos(URI-Ms) for a particular original resource- URI-G - a resource (TimeGate) that supports content negotiationbased on datetime to access prior versions of an original resourceTo establish trust in repositories and web archives, differentpublications and standards have emphasized the importance of ver-ifying fixity of archived resources. The report Trusted RepositoriesAudit & Certification (TRAC) by the Task Force on Archiving ofDigital Information introduces criteria for identifying trusted digi-tal repositories [17]. In addition to the ability to reliably provideaccess, preserve, and migrate digital resources, digital repositorieswhich include web archives must create preservation metadata thatcan be used to verify that content is not tampered with or corrupted(fixity) according to sections B2.9 and B4.4. The report recommendsthat preserved content is stored separately from fixity information,so it is less likely that someone is able to alter both the contentand its associated fixity information [17]. Thus, generating fixityinformation and using it to ensure that archived resources are validwill help to establish trust in web archives. Eltgrowth [19] outlinedseveral judicial decisions that involve evidence (i.e., archived webpages) taken from the Internet Archive. The author mentions thatthere is an open question whether to consider an archived webpage as a duplicate of the original web page at a particular timein the past. This concern might prevent considering archived webpages as evidence.Different vulnerabilities were discovered in the Internet Archive’sWayback Machine by Lerner et al. [35] and Berlin [13]. They areArchive-Escapes, Same-Origin Escapes, Archive-Escapes + Same-Origin Escapes, and Anachronism-Injection. Attackers can leveragethese vulnerabilities to modify a user’s view at the time when amemento is rendered in a browser. The authors suggested somedefenses that could be deployed by either web archives or webpublishers to prevent abusing these vulnerabilities. Cushman andreymer created a shared repository in May 2017 to describe po-tential threats in web archives, such as controlling a user’s accountdue to Cross-Site Request Forgery (CSRF) or Cross-Site Scripting(XSS), and archived web resources reaching out to the live web[16]. The authors provide recommendations on how to avoid suchthreats. Rosenthal et al. [42], on the other hand, described severalthreats against the content of digital preservation systems (e.g., webarchives). The authors indicated that designers of archives must beaware of threats, such as media failure, hardware failure, softwarefailure, communication errors, failure of network services, mediahardware obsolescence, software obsolescence, operator error, nat-ural disaster, external attack, internal attack, economic failure, andorganizational failure.Several tools have been developed to generate trusted times-tamps. For example, OriginStamp [23] allows users to generate atrusted timestamp using blockchain-based networks on any file,plain text, or a hash value. The data is hashed in the user’s browserand the resulting hash is sent to OriginStamp’s server which thenwill be added to a list of all hashes submitted by other users. Onceper day, OriginStamp generates a single aggregated hash of allreceived hashes. This aggregated hash is converted to a Bitcoinaddress that will be a part of a new Bitcoin transaction. The times-tamp associated with the transaction is considered a trusted times-tamp. A user can verify a timestamp through OriginStamp’s APIor by visiting their website. Other services, such as Chainpoint( chainpoint.org ) and OpenTimestamps ( opentimestamps.org ),are based on the same concept of using blockchain-based networksto timestamp digital documents. Even though users of these ser-vices can pass data by value, they are not allowed to submit data byreference (i.e., passing a URI of a web page). In other words, thesetools are not directly timestamping web pages. The only exceptionis a service [22] established by OriginStamp that accepts URIs fromusers, but the service is no longer available on the live web at
A number of problems with blockchain-based networks are descibedby Rosenthal [41]. He indicates that having a large number ofindependent nodes in the network is what makes it secure, butthis is not the case with many blockchain-based services, such asEthereum ( ).There are issues related to how web archives preserve and pro-vide access to mementos that make it difficult to generate repeatablefixity information. When serving mementos, web archives oftenapply some transformation to appropriately replay content in theuser’s browser. This includes (1) adding archive-specific code to theoriginal content, (2) rewriting links to embedded resources (e.g., im-ages) within an archived page so these resources are retrieved fromthe archive, not from the live web, and (3) serving content in differ-ent file formats like images (or screenshots), ZIP files, and WARCformat [1]. Furthermore, issues, such as reconstructing archivedweb pages, caching, dynamic/randomly-generated content, illus-trate how difficult it is to generate repeatable fixity information.Taking into account all of these archive-related issues, it becomesa challenging problem to distinguish between legitimate changesby archives and malicious changes. In our technical report [9] weprovide several recommendations of how to generate repeatablefixity information. Figure 2: An example showing an original URI vs. a trusty URI.Kuhn et al. [32] define a trusty URI as a URI that contains acryptographic hash value of the content it identifies as shown inFigure 2.With the assumption that a trusty URI, once created, is linkedfrom other resources or stored by a third party, it becomes possibleto detect if the content that the trusty URI identifies has beentampered with or manipulated on the way (e.g., to prevent man-in-the-middle attacks [34]). In their second paper [33], they introducetwo different modules to allow creating trusty URIs on differentkinds of content. In the module F, the hash is calculated on thebyte-level file content, while in the second module R, the hash iscalculated on RDF graphs. Even though trusty URIs detect altereddocuments, there are some limitations. First, a trusty URI is createdby an owner of a resource it identifies. Second, trusty URIs can begenerated on only two types of content RDF graphs and byte-levelcontent (i.e., no modules introduced for HTML documents).
The process of fixity verification of mementos can broadly be de-scribed in three phases: 1) generating manifests for mementos, 2)disseminating those manifests into different web archives, and 3) ata later date, generating manifests of the current state and compar-ing them with their corresponding previously archived versions.We have two approaches of manifest dissemination, namely,
Atomic and
Block (as described in Sections 3.2 and 3.3 respectively).
A manifest (identified by
URI-Manif ) consists of metadata summa-rizing fixity information of a memento. A manifest can be generatedat or after a memento’s creation datetime. The proposed structureof a manifest file is illustrated in Figure 3, and should have thefollowing properties: @context : It specifies the URI where names used in the manifestfile are defined. created : The creation datetime of the manifest. It must be equal toor greater than the memento’s creation datetime.
URI-R , URI-M , and
Memento-Datetime : It refers to the URI ofan original resource, the URI of a memento, and the datetime whena memento was created, respectively [48]. @id : The URI that identifies a published manifest file (URI-Manif). http-headers : Selected HTTP Response headers of the memento.As proposed by Jones et al. [28], we insert the
Preference-Applied header to specify options used to retrieve the memento. For ex-ample,
Original-Content refers to the raw memento—accessing "@context": "http://manifest.ws-dl.cs.odu.edu/","created": "Sun, 23 Dec 2018 11:43:55 GMT","@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abbe20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archive.org/web/20181219102034/https://2019.jcdl.org/","uri-r": "https://2019.jcdl.org/","uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/","memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT","http-headers": {"Content-Type": "text/html; charset=UTF-8","X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT","X-Archive-Orig-link": "
Figure 3: A manifest showing fixity information of the memento https://web.archive.org/web/20181219102034/https://2019.jcdl.org/unaltered archived content because archives by default return thememento after transforming its content. hash-constructor : The commands that calculate hashes. The vari-able $uri-m is replaced with the uri-m value and the selected head-ers (e.g., $Content-Type ) are replaced with the corresponding val-ues in the http-headers
The hashes are generated on both theHTML of a memento and selected response headers, and they arecalculated using two different hashing algorithms,
MD5 and
SHA256 ,so even if the two functions are vulnerable to collision attacks, itbecomes difficult for an attacker to make both functions collide atthe same time [40]. hash : The hash values calculated based on commands defined in hash-constructor . In the
Atomic approach, each memento that we are interested inverifying should have at least one corresponding manifest file con-taining fixity information of the memento. Once generated, themanifest should be published on the web and disseminated to differ-ent web archives. The main concept of this approach is to store thefixity information of a memento in differenent archives in additionto the archive in which the memento is preserved. This practice isrecommended by the TRAC report [17] where content is maintainedseparately from its fixity information. Disseminating manifests canbe archived through four steps:(1) Push a web page into one or more archives. This will createone or more mementos,
URI-M .(2) Generate a manifest by computing the fixity informationof the memento.(3) Publish the manifest at a well-known location,
URI-Manif .(4) Disseminate the published manifest in multiple archives.This will generate archived manifests,
URI-M-Manif .We briefly describe the steps involved in generating, publishing,and disseminating fixity information of mementos with examples. https://2019.jcdl.org/
Figure 4: A web page is pushed into multiple archives: archive.org , archive.is , perma.cc , and webcitation.org .Figure 4 shows the web page https://2019.jcdl.org pushedinto multiple archives, resulting in four mementos. The Pythonmodule ArchiveNow can be invoked via the command-line interfaceor user interface for simultaneously disseminating a web page intoon-demand web archives [8].Next, as shown in Figure 5, for each memento, a manifest isgenerated and published on the web at the Archival Fixity server, http://manifest.ws-dl.cs.odu.edu so that archives are able to access and capture those manifests. Forexample, the manifest of the memento web.archive.org/web/20181224085329/https://2019.jcdl.org/ is available at the
URI-Manifmanifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/
This URI-Manif is a generic URI, which means if the Archival Fix-ity server creates another manifest for the same memento (markedin red), the server will publish it using the same generic URI. Forthis reason, the generic URI must always redirect to the most recent
Figure 5: Compute fixity and publish it on the webmanifest of a memento (i.e., the manifest that is published using atrusty URI), so requesting the manifest’s generic URI manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/ will result in “302 Redirect” to the trusty URI (Figure 6) manifest.ws-dl.cs.odu.edu/manifest/20181224093024/8c31ccfbb3a664c9160f98be466b7c9fb9afa80580ab5052001174be59c6a73a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/
Figure 7 shows an example of retrieving all mementos (theTimeMap) from the Internet Archive of the URI-Manif: manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/
As Figure 8 shows, requesting the memento of the manifest (withgeneric URI) found in the TimeMap results in
302 Redirect to thearchived manifest (with the trusty URI).This
302 Redirect from the generic URI to the trusty URI hastwo advantages. First, as we described in Section 2, having a trustyURI will help validate the manifest content, as the hash included inthe URI is the hash of the content it identifies. Second and moreimportantly, we can use the generic URI to discover manifests inthe Archival Fixity server and archived manifests in the archives.Therefore, even in cases where the Archival Fixity server is unavail-able or compromised we still can discover manifests in the archivesdirectly (e.g., using a TimeGate or TimeMap). Figure 10 shows howthe live web, the archive, and the Archival Fixity server are relatedin the
Atomic approach.Generally, we build trust in the content of memento from the timewhen fixity information is computed and published. One of the bestscenarios is when a manifest is generated at ingest by the archive.In other words, the archive crawls a web page and immediatelyafter that computes and publishes its fixity information.The final step is to push the published manifest into multiplearchives. In the example shown in Figure 9, the fixity information(or the manifest) of the memento from archive.org is disseminatedto the same archive and other three archives including archive.is , perma-archives.org , and webcitation.org . As opposed to the
Atomic approach, in the
Block approach we batchmultiple manifests together in a single binary-searchable file alongwith some additional metadata (using the UKVS file format [3, 4]),and add the reference of the previously published latest block. Then,we generate the content-addressable identity of the block, compressit, and archive it into multiple web archives by making it available ata well-known content-addressable URI (and allow people to keep lo-cal copies anywhere). While we make a chain of blocks, we are notattempting to create yet another Blockchain [37]. Manifests’ chainof blocks are limited in scope as we do not need to worry about con-sensus , eventual consistency , or proof-of-work because these blocksare generated and published by a central authority. Linking blocksin a chain using their content-addressable hashes provides tamper-proofing, and enables discovery of previous blocks (starting fromthe latest or anywhere in the middle of the chain). Additionally, aslong as we are depending on an archived page to be available inthe archive, we can count on the archived metadata about the pageto be available too. Creation and dissemination of manifest blocksis performed in the following steps:(1) Identify a set of URI-Ms for their manifests to be includedin the same block (a strategically chosen set may improveblock compression factor and enable a more efficient lookupfor verification later).(2) Generate their individual manifests in the form of a single-line JSON file (exclude @id field, needed in case of recordsbeing placed in a block, and eliminate many common fieldsthat can go in the headers of the block).(3) Prefix each manifest JSON line with the Sort-friendly URIReordering Transform (SURT) [44] of the correspondingURI-M.(4) Write these lines in a UKVS file along with the metadataheaders as illustrated in Figure 13.(5) Add the content-addressable hash of the latest publishedblock in the metadata as the previous block.(6) Sort the file using LC ALL=C locale.(7) Calculate the content-addressable hash (e.g., SHA256) ofthis block.(8) Name the file using its content-addressable hash.(9) Compress the block file to efficiently archive it.(10) Publish the compressed block file on a URI that containsits hash.(11) Make the entrypoint (the well-known URI) redirect to thelatest block’s URI (as illustrated in Figure 11).(12) Add
Link response header with appropriate links to nav-igate through the chain of blocks, which is visually illus-trated on the landing page as shown in Figure 12 (a similarapproach of creating bidirectional linked list of HTTP mes-sages was used in the HTTPMailbox [2]).(13) Archive the entrypoint in multiple web archives, whichwill implicitly archive the latest block as well due to theredirect.(14) Optionally, for further tamper-proofing post the URI ofthe newly published block on immutable platforms notcontrolled by a single authority (e.g., Twitter and GitHub’sGist). curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i "(HTTP/|ˆlocation:)"HTTP/2 302location: https://manifest.ws-dl.cs.odu.edu/manifest/20181224093024/8c31ccfbb3a664c9160f98be466b7c9fb9afa80580ab5052001174be59c6a73a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/HTTP/2 200
Figure 6: The manifest identified with the generic URI redirects to the manifest with the trusty URI. $ curl -i http://web.archive.org/web/ timemap "first memento" ; datetime="Mon, 24 Dec 2018 09:33:54 GMT",
Figure 7: Retreiving the TimeMap of a manifest from the Internet Archive. In this example, the TimeMap contains only one memento. $ curl -sIL http://web.archive.org/web/20181224093354/http://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i "(HTTP/|ˆlocation:)"HTTP/1.1 302 FoundLocation: http://web.archive.org/web/20181224093354/http://manifest.ws-dl.cs.odu.edu/manifest/20181224093024/8c31ccfbb3a664c9160f98be466b7c9fb9afa80580ab5052001174be59c6a73a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/HTTP/1.1 302 FOUNDLocation: http://web.archive.org/web/20181224093355/http://manifest.ws-dl.cs.odu.edu/manifest/20181224093024/8c31ccfbb3a664c9160f98be466b7c9fb9afa80580ab5052001174be59c6a73a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/HTTP/1.1 200 OK
Figure 8: The archived manifest with the generic URI redirects to the archived manifest with the trusty URI.Although the number of public web archives is increasing [12,15], only a few of them support an on-demand web archiving ser-vice. However, a small number (greater than one) of independenton-demand archives can suffice for of the purpose of disseminatingmanifests. The
Block dissemination approach has a number of ad-vantages over the
Atomic approach. It requires far fewer networkrequests to push it to web archives and creates significantly fewer independently published manifest resources to keep track of me-mentos. By bundling multiple manifests in a single file, it yieldsa significant compression factor due to the repeated boilerplatecontent in each manifest file. As web archives die and new onescome to life, these blocks can be replicated and migrated externallyto other places efficiently, while in the case of the
Atomic approachwe might lose historical manifests as old web archives die withoutdonating their holdings to live archives. Moreover, these blocks are manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/
Figure 9: Push the fixity information into multiple archives
URI-R web
Archivalprocess webArchive
Manifest Trusty—URI
Archival Fixity server
URI-ManifURI-G for URI-Manif Archivalprocess webArchive
URI-MHTTP Linkoriginal HTTP LinkoriginalHTTP Linktimegate HTTP 302HTTP Linkoriginal URI-M-ManifURI-M-Manif HTTP LinkoriginalHTTP Linkoriginal
Figure 10: The
Atomic approach. The generic URI (URI-Manif) redirects to the most recent trusty URI, so when the archive captures thegeneric URI, the archive follows the
302 Redirect and captures the trusty URI as well. This figure is a modified version of an originaldiagram contributed by Herbert Van de Sompel (from DANS).more tamper-proof than atomic manifests due to chaining. On theother hand, the
Block approach has the disadvantage of shifting theburden of lookup of a specific record in the entire chain of blocksto the user or a service that provides verification. While individ-ual blocks are binary searchable for fast lookup, as the number ofblocks increases, one has to scan through all of them. However, thiscan easily be solved by scanning the entire chain once and creatinga search index over the SURT field.
Verifying the fixity of a memento in both the
Atomic and
Block approaches can be achieved through three common steps:(a) For the given memento, discover one or more manifests
URI-Manif . In the
Atomic approach, this step requires alsodiscovering archived copies
URI-M-Manif of the manifest.(b) Recompute current fixity information of the memento. (c) Compare current fixity information with discovered mani-fests.In the
Atomic approach, we can discover a manifest of a givenmemento through the Archival Fixity server. Which manifest isreturned depends on the server’s API. For example, the server mayrespond with the closest manifest to the memento’s creation dateor return the manifest that is closest to a given datetime (i.e., via aTimeGate). Once a manifest is discovered, we may use TimeGatesand/or TimeMaps to retrieve its archived copies available in webarchives. Again, it is possible to discover archived manifests us-ing the generic URI even without the Archival Fixity server beinginvolved. Next, we compute current fixity information by gener-ating a new manifest for the given memento. Then, we comparecurrent hash values in the new manifest with the hashes in thediscovered archived manifests. In this compression step, we shouldonly consider independent copies of the manifest. For example, curl -IL https://manifest.ws-dl.cs.odu.edu/blocksHTTP/2 302content-type: text/html; charset=utf-8date: Mon, 21 Jan 2019 22:27:14 GMTlocation: https://manifest.ws-dl.cs.odu.edu/blocks/59bc17511de502b7a7bdf39b2020c3bd4ad08aaefd7135604edb2a8e3e89540bserver: ArchivalFixity/0.1content-length: 417HTTP/2 200accept-ranges: bytescache-control: immutablecontent-disposition: attachment; filename="59bc17511de502b7a7bdf39b2020c3bd4ad08aaefd7135604edb2a8e3e89540b.ukvs.gz"content-encoding: gzipcontent-type: application/ukvsdate: Mon, 21 Jan 2019 22:27:14 GMTetag: "59bc17511de502b7a7bdf39b2020c3bd4ad08aaefd7135604edb2a8e3e89540b"expires: Tue, 22 Jan 2019 10:27:14 GMTlast-modified: Fri, 11 Jan 2019 18:19:00 GMTlink:
Figure 11: Blocks Access APIif an archived manifest is delivered from the same archive wherethe memento is from, then this copy of the manifest should not beconsidered independent. In other cases, two manifests might be dis-covered in two cooperating archives (e.g., we know Archive-It.orgis a service established by the Internet Archive).In case of the
Block approach, the fixity verification server (or anyequivalent tool) needs to have access to all the blocks, either overHTTP (e.g., from a web archive) or stored locally. These blocks arethen scanned for one or more matching records for the given
URI-M .Corresponding single-line JSON entries (as shown in Figure 13) areextracted as historical fixity records for comparison. Remainingsteps for creating current fixity information, comparing with thehistorical records, and generating the response summary are thesame as in the case of
Atomic approach.While due to the immutable nature of blocks we can only haveback references, creating a single linked list pointing from the mostrecent blocks to the older ones, with the help of some externalmetadata our archival fixity block server provides bidirectional nav-igational links for easy navigation along the chain back and forth(as illustrated in Figure 11 with first , last , prev , and next linkrelations in the link header). The content of these blocks is sortedthat enables fast lookup in each block using binary search, but thechain of blocks has to be scanned linearly, which can decrease thethroughput as the number of blocks increases. To deal with thisissue, one can create an inverted index of existing blocks, treating URI-Ms as the keyword and blocks as the document. Additionally,the chain of blocks is in chronological order, which makes it easy to create a lightweight skip index to identify segments of the chainthat were created around certain points of time in the past. Creat-ing large blocks with a slowly growing chain will be more efficientthan a rapidly growing chain of small blocks. However, an optimalblock size can be decided based on how long one is willing to waitfor enough records to be available for a new block creation andon the largest size of a single block that can easily be stored inweb archives. Creating blocks with strategically grouped URI-Ms(e.g., mementos of nearby datetime values, URI-Rs from a set ofdomains, or URI-Ms from a set of archives) can also improve theefficiency of lookup (or indexing).
We conducted a study on 1,000 mementos from the Internet Archivewhich are a subset of a larger set of URI-Ms involved in a differ-ent research project [10]. We did not take the size of mementosinto consideration (i.e., the number of embedded resources, suchas images and JavaScript/CSS files) because fixity in this paper iscomputed based on only the returned raw HTML content of thebase file. The main reason for choosing a small set, only 1,000 URI-Ms, is because the study requires pushing at least 12 manifests foreach memento in multiple archives. Sending too many archivingrequests to archives might result in technical issues, such as block-ing IP addresses. For example, webcitation.org responded with“WebCite has flagged your IP address for suspicious activity” aftermaking 100 requests, but the issue was resolved after contacting thearchive. Perma.cc on the other hand allows users to freely submit aigure 12: The landing page showing a chain of blocks.maximum of 10 URIs for preserving per month. Fortunately, thearchive supported us for this study by increasing this limit, so wewere able to disseminate more manifests to the archive. Part of eval-uation is measuring the time it takes to generate, disseminate, andverify manifests in the
Atomic and
Block approaches. In addition,we want to compare the size of files created in both approaches andwhether all mementos are going to be verified successfully.We wrote Python scripts [6] for performing different functions: generate atomic() : Accepts a URI-M and returns the filenameof a JSON file containing the fixity information of the memento.We generated 3,000 manifests. The main purpose of generatingthree manifests for each memento is because we are interestedin reporting the average time for generating a manifest for eachmemento. Figure 14 shows an example of generating a manifest ofthe memento
The resulting JSON file contains the fixity information includingthe hash calculated on the returned HTML of the memento. publish atomic() : Submits a given JSON to the Archival Fixityserver at https://manifest.ws-dl.cs.odu.edu
The server will insert @id and created metadata before publishingthe new manifest on the web. Figure 15 shows an example ofpublishing the manifest file generated previously (in Figure 14). Itreturns the generic URI of the manifest
URI-Manif and the trustyURI. disseminate atomic() : Pushes a published manifest into differentarchives using ArchiveNow. In our study, we used archive.org , archive.is , perma.cc , and webcitation.org resulting in cre-ating 12,000 archived manifests (i.e., 3,000 URI-M-Manif in eacharchive). We used the Generic URI to push manifest into archives.Again, this URI always redirects to the trusty URI. If archives con-sider a “302 Redirect” as a separate resource, then the total numberof archived resources created in the four archives was 24,000. Figure16 shows an example of disseminating to four archives the manifest verify atomic() : Accepts a URI-M, It discovers a manifest closestto the memento’s creation datetime. In addition, the function dis-covers archived copies of the manifest in the four archives usingTimeGates and TimeMaps. Then, it computes current fixity in-formation using generate atomic() . Finally, it compares currentfixity information with the discovered manifests and their archivedcopies. As a result, for each URI-M, the function returns either“Verified” or “Failed” with other information, such as hash values,URI-Manifs, and URI-M-Manifs. Figure 17 shows an example ofverifying the fixity of the memento generate block() : Accepts multiple JSON files. It generates one ormore blocks depending on the selected block size. In this study, weset it to 100 manifests per block, so the total number of generatedblocks was 10. The example in Figure 18 shows the output of theshell script generate block.sh , which uses the Python function generate block() to generate ten 100-record blocks. Figure 19shows only four records (out of 100) of block 1. The four recordshave the fixity information of the following mementos: • • • • disseminate block() : Pushes a block into two archives ( archive.org and perma.cc ). Again, because we are interested in calculatingthe average time of disseminating block, each block is pushed threetime into both archives resulting in creating 60 archived blocks (i.e.,30 per archive). We did not use archive.is and webcitation.org because .gz files were not handled correctly by those archives. Fig-ure 20 shows an example of disseminating a block to two archives. context ["http://oduwsdl.github.io/contexts/fixity"]!fields {keys: ["surt"]}!id {uri: "https://manifest.ws-dl.cs.odu.edu/"}!meta {created_at: "20190111181327"}!meta {prev_block: "sha256:d4eb1190f9aaae9542fd3ad8a3c4519450cfb00845b632eb2b3f4f098a34144d"}!meta {type: "FixityBlock"}org,archive,web)/web/19961022175434/http://search.com/ {
Figure 14: An example of generating a manifest of a memento. $ python fixity.py publish_atomic
Figure 15: An example of publishing a manifest at the Archival Fixity server. $ python fixity.py disseminate_atomic
Figure 16: An example of disseminating a manifest to four archives. verify block() : Accepts a URI-M, and discovers fixity informationof the URI-M from the published blocks. Then, it computes currentfixity information using generate atomic() . Finally, it comparescurrent and discovered fixity. The function returns either “Verified”or “Failed” with other information, such as hash values. Figure 21shows the output of verify block() for only 10 mementos (outof 1,000).In addition to the Python scripts, we implemented the ArchivalFixity server that is responsible for publishing and discoveringmanifests and blocks. For example, Figure 22 shows a request fordiscovering the closest manifest’s creation date to December 22,2018 for the given memento. The server response indicates that theclosest manifest was created on December 12, 2018.The selected number of records per block affects the total size ofall blocks and the time required to generate these blocks. Figure23 illustrates that creating large blocks with a slowly growing chain is more efficient than a rapidly growing chain of small blocks.As mentioned in Section 3.4, one factor of choosing the optimalnumber of records in each block is the largest size of a single blockthat can easily be stored in web archives. For example, we testedthe Internet Archive (IA) to identify the largest single file that thearchive can accept for preservation. After submitting multiple fileswith different sizes, we found that IA can accept up to 800 MB,beyond that the archive returns “504 Gateway Time-out”.
Figure 24 illustrates the distribution of the average time taken togenerate manifests. We generated three manifests for each me-mento, and calculated the average time, so the the total number ofgenerated manifests is 3,000. The manifest generation time includes:1) downloading the raw HTML content using the Requests mod-ule in Python, 2) calculating fixity information of the downloaded python fixity.py generate_atomic verify_atomic
Figure 17: An example of verifying the fixity of the memento . The current fixity information should be generated first. Then, the function verify atomic() finds a published manifestin the Archival Fixity server and its archived versions in web archives. Finally, the function compares current fixity information with thefixity information in the discovered manifest and its archived captures.content, and 3) storing the fixity information locally in JSON for-mat. The average size of the generated manifest files is 1,157 bytes.This size represents 2.79% of the actual download HTML content,which is 41,392 bytes on average. The total size of all manifests is1,156,657 bytes, while the total size of the blocks is 176,128 bytes.This indicates that the
Block approach requires less storage spacethan the
Atomic approach to store fixity information of the samenumber of mementos.As expected, the time for disseminating manifests and blocks wasthe maximum time compared with other operations, such as gener-ating and verifying manifests. Figure 25 shows that pushing mani-fests into webcitation.org (or WebCite) takes much longer timethan other archives. On average, we wait for 33.82 seconds beforeWebCite finishes processing an archival request of a manifest, whilethe manifest disseminating average time drops down dramaticallyin the other three archives as Table 1 indicates. We observed that archive.org and webcitation.org add a few seconds responsedelay after receiving the first tens of archiving requests. In sum, ittakes about 1.25X, 4X, and 36X longer to disseminate a manifestto perma.cc , archive.org , and webcitation.org , respectively,than archive.is , while it takes 3.5X longer to disseminate a blockto archive.org than perma.cc . The average dissemination timeof blocks in archive.org and perma.cc is shown in Figure 26.Given a collection of N mementos and K web archives, the totalnumber of resources that we are creating in the K archives by the Atomic and
Block approaches are ( N ∗ K ) and ( k ∗( N / B )) respectively,where B is the selected block size. In our study, N = , k atomic = k block =
2, and B = ,
000 resources werecreated by the
Atomic approach and only 60 resources were createdby the
Block approach considering the fact that we repeated thedissemination process for three times.Figure 27 shows the time required to discover manifests of eachmemento from the Archival Fixity server. Figure 28 illustates the to-tal time for verifying the fixity of all mementos by both approaches. Table 1: Average time (in seconds) for disseminating and down-loading of manifests and blocks.
Operation archive.is perma.cc IA WebCite
Manifest dissemination 0.94 1.18 3.74 33.82Block dissemination - 1.37 4.80 -Manifest download 0.47 0.60 1.42 4.55Block download - 0.30 7.19 -
The verification time includes discovering manifests, computingcurrent fixity information, downloading copies of manifests (in the
Atomic approach), and comparing manifests. On average, the veri-fication time of a memento is 6.65 seconds by the
Atomic approachand 1.49 seconds by the
Block approach, so the
Block approach per-forms 4.46X faster than the
Atomic approach on verifying the fixityof memento. Although we have predicted that some mementosmight not be verified for reasons like an archive responds with“HTTP 500 Error”, we have not yet encountered any failed cases(i.e., all mementos are verified successfully).
Most web archives do not allow users to access fixity information.Even if fixity information is accessible, it is provided by the samearchive delivering content. In this proposal, we have describedtwo approaches,
Atomic and
Block , for generating and verifyingfixity of archived web pages. The proposed work does not requireany change in the infrastructure of web archives and is built basedon well-known standards, such as the Memento protocol. Whilea central service is used to create manifests, this approach doesnot exclude additional, centralized manifest servers, possibly tai-lored to specific communities. The
Block approach creates fewerresources in archives and reduces fixity verification time, whilethe
Atomic approach has the ability to verify fixity of archived ./generate_blocks.shInput: urims.txtOutput Dir: ./blocks/100Block Size: 100Num Blocks: 10======================[1548318100009] Generating block 1[1548318100348] Saving 87606 bytes to ./blocks/100/20190124082140-0000000000000000000000000000000000000000000000000000000000000000-dfbbe3600d5fe4e51c895db94cb9e9cfd0eb04716d9e4be6e63cf8ac3f3e9233.ukvs[1548318100350] Compressing block to ./blocks/100/20190124082140-0000000000000000000000000000000000000000000000000000000000000000-dfbbe3600d5fe4e51c895db94cb9e9cfd0eb04716d9e4be6e63cf8ac3f3e9233.ukvs.gz[1548318100356] Finished creating block 1 of size 15174 bytes in 347 milliseconds======================[1548318101360] Generating block 2[1548318101686] Saving 82971 bytes to ./blocks/100/20190124082141-dfbbe3600d5fe4e51c895db94cb9e9cfd0eb04716d9e4be6e63cf8ac3f3e9233-861f2b2e872125f31a61bed8141f1c8be04c48ebbebb2a49b4fdf2d9d6999f77.ukvs[1548318101688] Compressing block to ./blocks/100/20190124082141-dfbbe3600d5fe4e51c895db94cb9e9cfd0eb04716d9e4be6e63cf8ac3f3e9233-861f2b2e872125f31a61bed8141f1c8be04c48ebbebb2a49b4fdf2d9d6999f77.ukvs.gz[1548318101693] Finished creating block 2 of size 14233 bytes in 333 milliseconds======================[1548318102695] Generating block 3[1548318103017] Saving 84359 bytes to ./blocks/100/20190124082143-861f2b2e872125f31a61bed8141f1c8be04c48ebbebb2a49b4fdf2d9d6999f77-213c79bda7483d87609287142b86bc8d6b8c66306662236455507be046b0caf2.ukvs[1548318103019] Compressing block to ./blocks/100/20190124082143-861f2b2e872125f31a61bed8141f1c8be04c48ebbebb2a49b4fdf2d9d6999f77-213c79bda7483d87609287142b86bc8d6b8c66306662236455507be046b0caf2.ukvs.gz[1548318103025] Finished creating block 3 of size 14516 bytes in 330 milliseconds======================[1548318104030] Generating block 4[1548318104355] Saving 92165 bytes to ./blocks/100/20190124082144-213c79bda7483d87609287142b86bc8d6b8c66306662236455507be046b0caf2-4fa056c839656babdd8c8428df006590d7f48a4fbcd7df2d76d3d77110eba056.ukvs[1548318104357] Compressing block to ./blocks/100/20190124082144-213c79bda7483d87609287142b86bc8d6b8c66306662236455507be046b0caf2-4fa056c839656babdd8c8428df006590d7f48a4fbcd7df2d76d3d77110eba056.ukvs.gz[1548318104362] Finished creating block 4 of size 16321 bytes in 332 milliseconds======================[1548318105366] Generating block 5[1548318105689] Saving 92646 bytes to ./blocks/100/20190124082145-4fa056c839656babdd8c8428df006590d7f48a4fbcd7df2d76d3d77110eba056-70524610a0b3736de2c9b7ea9988ed1cedbcf3098c0f35d2cfed5f89d3193a45.ukvs[1548318105691] Compressing block to ./blocks/100/20190124082145-4fa056c839656babdd8c8428df006590d7f48a4fbcd7df2d76d3d77110eba056-70524610a0b3736de2c9b7ea9988ed1cedbcf3098c0f35d2cfed5f89d3193a45.ukvs.gz[1548318105697] Finished creating block 5 of size 16136 bytes in 331 milliseconds======================[1548318106701] Generating block 6[1548318107032] Saving 90681 bytes to ./blocks/100/20190124082147-70524610a0b3736de2c9b7ea9988ed1cedbcf3098c0f35d2cfed5f89d3193a45-8d40976dcce88bbc9c5907618e2f95beaa7484426fff509b9fa42a92719edae3.ukvs[1548318107033] Compressing block to ./blocks/100/20190124082147-70524610a0b3736de2c9b7ea9988ed1cedbcf3098c0f35d2cfed5f89d3193a45-8d40976dcce88bbc9c5907618e2f95beaa7484426fff509b9fa42a92719edae3.ukvs.gz[1548318107038] Finished creating block 6 of size 15756 bytes in 337 milliseconds======================[1548318108043] Generating block 7[1548318108379] Saving 89160 bytes to ./blocks/100/20190124082148-8d40976dcce88bbc9c5907618e2f95beaa7484426fff509b9fa42a92719edae3-e0ef8e7677778fc430d6142b87204f0510e81007c4019747e32e76851e30f657.ukvs[1548318108381] Compressing block to ./blocks/100/20190124082148-8d40976dcce88bbc9c5907618e2f95beaa7484426fff509b9fa42a92719edae3-e0ef8e7677778fc430d6142b87204f0510e81007c4019747e32e76851e30f657.ukvs.gz[1548318108387] Finished creating block 7 of size 15604 bytes in 344 milliseconds======================[1548318109391] Generating block 8[1548318109718] Saving 89228 bytes to ./blocks/100/20190124082149-e0ef8e7677778fc430d6142b87204f0510e81007c4019747e32e76851e30f657-59851a4cb1e29b9e178c24bfb31a2043762c554d34ac1a34e3cf0880ee7d87be.ukvs[1548318109720] Compressing block to ./blocks/100/20190124082149-e0ef8e7677778fc430d6142b87204f0510e81007c4019747e32e76851e30f657-59851a4cb1e29b9e178c24bfb31a2043762c554d34ac1a34e3cf0880ee7d87be.ukvs.gz[1548318109725] Finished creating block 8 of size 15550 bytes in 334 milliseconds======================[1548318110729] Generating block 9[1548318111066] Saving 87937 bytes to ./blocks/100/20190124082151-59851a4cb1e29b9e178c24bfb31a2043762c554d34ac1a34e3cf0880ee7d87be-2b01231e1a07d92bc5e5aa3b4b3a76e3672384df69013c3537a2cd7505cd23d0.ukvs[1548318111068] Compressing block to ./blocks/100/20190124082151-59851a4cb1e29b9e178c24bfb31a2043762c554d34ac1a34e3cf0880ee7d87be-2b01231e1a07d92bc5e5aa3b4b3a76e3672384df69013c3537a2cd7505cd23d0.ukvs.gz[1548318111074] Finished creating block 9 of size 15354 bytes in 345 milliseconds======================[1548318112078] Generating block 10[1548318112400] Saving 88688 bytes to ./blocks/100/20190124082152-2b01231e1a07d92bc5e5aa3b4b3a76e3672384df69013c3537a2cd7505cd23d0-5da03e339e52bafcc82b64c1636adff474a94df46a057e9356e74f70eba8b26f.ukvs[1548318112402] Compressing block to ./blocks/100/20190124082152-2b01231e1a07d92bc5e5aa3b4b3a76e3672384df69013c3537a2cd7505cd23d0-5da03e339e52bafcc82b64c1636adff474a94df46a057e9356e74f70eba8b26f.ukvs.gz[1548318112408] Finished creating block 10 of size 15223 bytes in 330 milliseconds======================
Figure 18: The shell script uses the Python function generate block() to generate ten 100-record blocks.pages even without involving the Archival Fixity server. On av-erage, it takes about 1.25X, 4X, and 36X longer to disseminate amanifest to perma.cc , archive.org , and webcitation.org , re-spectively, than archive.is , while it takes 3.5X longer to dissemi-nate a block to archive.org than perma.cc . The Block approachperforms 4.46X faster than the
Atomic approach on verifying thefixity of archived pages. We believe that the
Atomic and
Block approaches can be adoptedto verify fixity of particular archived web pages with important con-tent. Some future improvements can be applied to those approachesso they become scalable and can work with any number of memen-tos. Varying or increasing the block size in the
Block approachmight be one potential solution to improve its performance andreduce number of resources created in archives. Caching archived
Figure 19: Block 1 contains 100 records (only four records are shown). $ python fixity.py disseminate_block http://manifest.ws-dl.cs.odu.edu/blockshttps://perma.cc/8YG3-X7KNhttps://web.archive.org/web/20190121054059/https://manifest.ws-dl.cs.odu.edu/blocks/7bbf757046ac0a0a60015a1cb847c3189160d18c809b210073822df157609e01
Figure 20: An example of disseminating one block to two archives. The URI http://manifest.ws-dl.cs.odu.edu/blocks always redirectsto the most recent published block. In this example the URI redirects to block 1: https://manifest.ws-dl.cs.odu.edu/blocks/7bbf757046ac0a0a60015a1cb847c3189160d18c809b210073822df157609e01 .manifests in the Archival Fixity server should also improve theperformance of the two approaches, so instead of discovering thosemanifests from the archives, we may used cached copies in theArchival Fixity server.
This work is supported in part by The Andrew W. Mellon Foun-dation (AMF) grant 11600663. This work includes contributionsfrom Herbert Van de Sompel (DANS) and Martin Klein (LANL).We thank Ben Steinberg from the Perma.cc web archive for thegenerous increase in our monthly free quota to allow experimental fixity resource dissemination. We thank the WebCite archive forsolving some technical issues about disseminating web resources.
N Status BlockIdx TotalT LookupT GenerationT VerifyT URIM
Figure 21: The output of verify block() (only the results of verifying 10 mementos out of 1,000 are shown). The column
Status indicateswhether the fixity of a memento is verified or not. The column
BlockIdx is the block number used to verify the memento. The columns
LookupT , GenerationT , and
VerifyT show the time taken for lookup the fixity information in blocks, generating current fixity information,and verifying/comparing the current fixity with the discovered fixity information from blocks, respectively. The column
TotalT shows theoverall time taken to verify the fixity of the memento. $ curl -I https://manifest.ws-dl.cs.odu.edu/manifest/ /https://web.archive.org/web/20171115140705/http://rln.fm/HTTP/2 302 Foundcontent-length: 501content-type: text/html; charset=utf-8date: Thu, 10 Jan 2019 09:16:40 GMT location : https://manifest.ws-dl.cs.odu.edu/manifest/ /bd669de8835e38d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/20171115140705/http://rln.fm/server: ArchivalFixity/0.1
Figure 22: Discovering the closest manifest to December 22, 2018 for the memento web. archive. org/web/2017111 5140705/http://rln.fm/.Figure 23: The effect of the selected number of records per block. Figure 24: Generating manifests of mementos.igure 25: Disseminating manifests to four archives.Figure 26: Disseminating blocks to two archives. Figure 27: Discovering manifests by both approaches.Figure 28: Verifying mementos by both approaches.
EFERENCES [1] 2017. WARC file format. . ISO28500:2017 (2017).[2] Sawood Alam. 2013.
HTTP Mailbox - Asynchronous Restful Communication . Mas-ter’s thesis. Old Dominion University. https://doi.org/10.25777/wh13-fd86 [3] Sawood Alam. 2019. Unified Key Value Store. https://github.com/oduwsdl/ORS/blob/master/ukvs.md . (January 2019).[4] Sawood Alam, Michele C. Weigle, and Michael L. Nelson. 2019. MementoMapFramework for Flexible and Adaptive Web Archive Profiling. In
Proceedings ofthe 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL) .[5] Shadi Aljawarneh, Christopher Laing, and Paul Vickers. 2008. Design andexperimental evaluation of Web Content Verification and Recovery (WCVR)system: A survivable security system. In
Proceedings of the 3rd Conference onAdvances in Computer Security and Forensics (ACSF) .[6] Mohamed Aturban and Sawood Alam. 2019. Archival Fixity. https://github.com/oduwsdl/archival-fixity . (2019).[7] Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle.2019. Archive Assisted Archival Fixity Verification Framework. In
Proceedings ofthe 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL) .[8] Mohamed Aturban, Mat Kelly, Alam Sawood, John Berlin, Michael L Nelson,and Michele C Weigle. 2018. ArchiveNow: Simplified, Extensible, Multi-ArchivePreservation. In
Proceedings of the 18th ACM/IEEE Joint Conference on DigitalLibraries (JCDL) .[9] Mohamed Aturban, Michael L. Nelson, and Michele C. Weigle. 2017.
Difficultiesof Timestamping Archived Web Pages . Technical Report arXiv:1712.03140.[10] Mohamed Aturban, Michael L Nelson, and Michele C Weigle. 2018. It is Hardto Compute Fixity on Archived Web Pages. In
Proceedings of the Workshop onWeb Archiving and Digital Libraries (WADL) held in conjunction with the 18thACM/IEEE Joint Conference on Digital Libraries (JCDL) .[11] Jefferson Bailey. 2014. Protect Your Data: File Fixity andData Integrity. https://blogs.loc.gov/thesignal/2014/04/protect-your-data-file-fixity-and-data-integrity/ . (April 2014).[12] Jefferson Bailey, Abigail Grotke, Edward McCain, Christie Moffatt, and NicholasTaylor. 2017. Web Archiving in the United States: A 2016 Survey. http://ndsa.org/documents/WebArchivingintheUnitedStates A2016Survey.pdf . (Feb-ruary 2017).[13] John A. Berlin. 2018.
To Relive the Web: A Framework for the Transformationand Archival Replay of Web Pages . Master’s thesis. Old Dominion University. https://doi.org/10.25777/n8mg-da06 [14] Chris Butler. 2018. Addressing Recent Claims of “Manipulated” BlogPosts in the Wayback Machine. http://blog.archive.org/2018/04/24/addressing-recent-claims-of-manipulated-blog-posts-in-the-wayback-machine/ . (2018).[15] Miguel Costa, Daniel Gomes, and M´ario J. Silva. 2017. The evolution of webarchiving.
International Journal on Digital Libraries
18, 3 (2017), 191–205. https://doi.org/10.1007/s00799-016-0171-9 [16] Jack Cushman and Ilya Kreymer. 2017. Thinking like a hacker: SecurityConsiderations for High-Fidelity Web Archives. http://labs.rhizome.org/presentations/security.html . (May 2017).[17] Robin L Dale and Bruce Ambacher. 2007. Trustworthy repositories audit andcertification: Criteria and checklist. Report of the RLG-NARA Task Force onDigital Repository Certification. (February 2007).[18] Caleb Ecarma. 2018. EXCLUSIVE: Joy Reid Claims Newly Discovered Homopho-bic Posts From Her Blog Were “Fabricated”. . (2018).[19] Deborah R Eltgrowth. 2009. Best evidence and the Wayback Machine: toward aworkable authentication standard for archived Internet evidence.
Fordham LawReview
78 (2009), 181.[20] Gunther Eysenbach and Mathieu Trudel. 2005. Going, going, still there: Usingthe WebCite service to permanently archive cited web pages.
J Med Internet Res
7, 5 (2005), e60. https://doi.org/10.2196/jmir.7.5.e60 [21] Peng Gao, Hao Han, and Takehiro Tokuda. 2012. IAAS: An integrity assuranceservice for Web page via a fragile watermarking chain module. In
Proceedingsof the 6th International Conference on Ubiquitous Information Management andCommunication . ACM.[22] B. Gipp, N. Meuschke, and C. Breitinger. 2016. Using the Blockchain of Cryp-tocurrencies for Timestamping Digital Cultural Heritage. In
Proceedings of theWorkshop on Web Archiving and Digital Libraries (WADL) held in conjunctionwith the 16th ACM/IEEE Joint Conference on Digital Libraries (JCDL) . [23] Bela Gipp, Norman Meuschke, and Andr´e Gernandt. 2015. Decentralizedtrusted timestamping using the crypto currency Bitcoin . Technical ReportarXiv:1502.04015.[24] PREMIS Working Group et al. 2005. Data dictionary for preservation metadata:final report of the PREMIS Working Group.
OCLC Online Computer LibraryCenter & Research Libraries Group, Dublin, Ohio, USA, Final report (2005). [25] Interent Archive. 2019. Wayback CDX Server API - BETA. https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server .(2019).[26] International Internet Preservation Consortium (IIPC). 2005. OpenWayback. https://github.com/iipc/openwayback/wiki . (October 2005).[27] ISO 28500:2017. 2017. Information and documentation – WARC file format. . (2017).[28] Shawn M. Jones, Herbert Van de Sompel, and Michael L. Nelson. 2016.Mementos in the Raw. (2016). http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html [29] Ramniwas Kachhawa, Nikhil Kumar Singh, and Deepak Singh Tomar. 2014. ANovel Approach to Detect Web Page Tampering.
IJCSIT) International Journal ofComputer Science and Information Technologies
5, 3 (2014), 4604–4607.[30] Ilya Kreymer. 2013. PyWb - Web Archiving Tools for All. https://github.com/ikreymer/pywb . (December 2013).[31] Ilya Kreymer. 2015. Webrecorder - a web archiving platform and service for all.(2015). https://webrecorder.io [32] Tobias Kuhn and Michel Dumontier. 2014. Trusty URIs: Verifiable, immutable,and permanent digital artifacts for linked data. In
European Semantic Web Con-ference . Springer, 395–410.[33] Tobias Kuhn and Michel Dumontier. 2015. Making digital artifacts on the webverifiable and reliable.
IEEE Transactions on Knowledge and Data Engineering
International Journal of Computer Applications Technology andResearch
Proceedings of the 16th ACMconference on Computer and Communications Security (CCS) . 1741–1755.[36] Petros Maniatis, David S. H. Rosenthal, Mema Roussopoulos, Mary Baker, TJGiuli, and Yanto Muliadi. 2003. Preserving Peer Replicas by Rate-limited SampledVoting. In
Proceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (SOSP ’03) . 44–59. https://doi.org/10.1145/945445.945451 [37] Arvind Narayanan and Jeremy Clark. 2017. Bitcoin’s Academic Pedigree: Theconcept of cryptocurrencies is built from forgotten ideas in research literature.
ACM Queue
15, 4 (2017).[38] Michael L Nelson. 2018. Why we need multiple web archives: thecase of blog.reidreport.com. https://ws-dl.blogspot.com/2018/04/2018-04-24-why-we-need-multiple-web.html . (2018).[39] Jinfang Niu. 2012. Functionalities of Web Archives.
D-Lib Magazine
18, 3/4(2012). [40] David Rosenthal. 2017. SHA1 is dead. https://blog.dshr.org/2017/03/sha1-is-dead.html. (2017).[41] David Rosenthal. 2018. Blockchain Solves Preservation!https://blog.dshr.org/2018/09/blockchain-solves-preservation.html. (2018).[42] David Rosenthal, Thomas Robertson, Tom Lipkis, Vicky Reich, and Seth Morabito.2005. Requirements for Digital Preservation Systems.
D-Lib Magazine
11, 11(2005). [43] Kristinn Sigurdsson. 2005. Incremental crawling with Heritrix. In
Proceedings ofthe 5th International Web Archiving Workshop (IWAW) .[44] Kristinn Sigurðsson, Michael Stack, and Igor Ranitovic. 2006. Heritrix UserManual: Sort-friendly URI Reordering Transform. http://crawler.archive.org/articles/user manual/glossary.html . (2006).[45] Brooke Sopelsa. 2017. MSNBC’s Joy Reid apologizes for ’insensi-tive’ LGBT blog posts. .(2017).[46] Brad Tofel. 2007. Wayback for accessing web archives. In
Proceedings of the 7thInternational Web Archiving Workshop (IWAW) . 27–37.[47] Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTPframework for time-based access to resource states – Memento, Internet RFC7089. http://tools.ietf.org/html/rfc7089. (2013).[48] Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Bal-akireva, Scott Ainsworth, and Harihar Shankar. 2009.
Memento: Time Travel forthe Web . Technical Report arXiv:0911.1112.[49] Jonathan Zittrain, Kendra Albert, and Lawrence Lessig. 2014. Perma: Scopingand addressing the problem of link and reference rot in legal citations.