Making Recommendations from Web Archives for "Lost" Web Pages
MMaking Recommendations from Web Archives for “Lost” WebPages
Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle
Department of Computer ScienceOld Dominion UniversityNorfolk, Virginia 23529 [email protected],{mln,mweigle}@cs.odu.edu
ABSTRACT
When a user requests a web page from a web archive, the user willtypically either get an HTTP 200 if the page is available, or an HTTP404 if the web page has not been archived. This is because webarchives are typically accessed by Uniform Resource Identifier (URI)lookup, and the response is binary: the archive either has the pageor it does not, and the user will not know of other archived webpages that exist and are potentially similar to the requested webpage. In this paper, we propose augmenting these binary responseswith a model for selecting and ranking recommended web pagesin a Web archive. This is to enhance both HTTP 404 responsesand HTTP 200 responses by surfacing web pages in the archivethat the user may not know existed. First, we check if the URI isalready classified in DMOZ or Wikipedia. If the requested URI is notfound, we use machine learning to classify the URI using DMOZas our ontology and collect candidate URIs to recommended to theuser. The classification is in two parts, a first-level classificationand a deep classification. Next, we filter the candidates based on ifthey are present in the archive. Finally, we rank candidates basedon several features, such as archival quality, web page popularity,temporal similarity, and URI similarity. We calculated the F scorefor different methods of classifying the requested web page at thefirst level. We found that using all-grams from the URI after remov-ing numerals and the top-level domain (TLD) produced the bestresult with F =0.59. For the deep-level classification, we measuredthe accuracy at each classification level. For second-level classifica-tion, the micro-average F =0.30 and for third-level classification,F =0.15. We also found that 44.89% of the correctly classified URIscontained at least one word that exists in a dictionary and 50.07% ofthe correctly classified URIs contained long strings in the domain.In comparison with the URIs from our Wayback access logs, only5.39% of those URIs contained only words from a dictionary, and26.74% contained at least one word from a dictionary. These per-centages are low and may affect the ability for the requested URIto be correctly classified. Web archives are a window to view past versions of web pages.The oldest and largest web archive, the Internet Archive’s WaybackMachine, contains over 700 billion web objects [16] . But even withthis massive collection, sometimes a user requests a web page thatthe Wayback Machine does not have. Currently, in this case, theuser is presented with a message saying that the Wayback Machinedoes not have the page archived and a link to search for otherarchived pages in that same domain (Figure 1a). Our goal is to enhance the response from a web archive with recommendations ofother archived web pages that may be relevant to the request. Forexample, Figure 1b shows a potential set of recommended archivedweb pages for the request in Figure 1a.One approach to finding related web pages is to examine thecontent of the requested web page and then select candidates withsimilar content. However, in this work, we assume that the re-quested web page is neither available in web archives nor on thelive web and thus is considered to be a “lost” web page. This as-sumption reflects previous work showing that users often searchweb archives when they cannot find the desired web page on thelive web [5] and that there are a significant number of web pagesthat are not archived [1, 3]. Learning about a requested web pagewithout examining the content of the page can be challenging dueto little context and content available. There are several advan-tages to using the Uniform Resource Identifier (URI) over usingthe content of the web page. First, in some cases the content of theURI is not available on the live Web or in the archive. Second, theURI may contain hints about the resource it identifies. Third, it ismore efficient both in time and space to use the text of the URIonly rather than to extract the content of the web page. Fourth,some web pages have little or no textual content, such as images orvideos, so extracting the content will be not useful or even possible.Fifth, some web pages have privacy settings that do not permitthem to be archived.In this work we recommend similar URIs to a request by follow-ing five steps. First, we determine if the requested URI is one ofthe 4 million categorized URIs in DMOZ or in Wikipedia via theWikipedia API. If the URI is found, we collect candidates in the samecategory from DMOZ or Wikipedia and move to Step 4. Second,if the URI is not found we classify the requested URI based on afirst-level of categorization. Third, we classify the requested URIto determine the deep categorization levels and collect candidates.Fourth, we filter candidates by removing candidates that are notarchived. Finally, we filter and rank candidates based on severalfeatures, such as archival quality, web page popularity, temporalsimilarity, and URI similarity. There has been previous work on searching an archive without in-dexing it. Kanhabua et al. [19] proposed a search system to supportretrieval and analytics on the Internet Archive. They used Bing tosearch the live web and then extracted the URLs from the results The original DMOZ, http://dmoz.org, is out of service but we have archived versionslocally. a r X i v : . [ c s . D L ] A ug ulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle (a) Response to the request http://tripadvisor.com/where_to_travel at the Internet Archive(b) Proposed recommendations for the requested URI http://tripadvisor.com/where_to_travel displayed with MementoEmbed [15] social cards Figure 1: The actual response to the requested URI http://tripadvisor.com/where_to_travel (1a) and its proposed replacement (1b) and used those as queries to the web archive. They measured thecoverage of the archived content retrieved by the current searchengine and found that on page one of Bing results, 94% are availablein the Internet Archive. Note that this technique will not find URLsthat have been missing (HTTP status 404) long enough for Bing tohave removed them from its index. Klein et al. [20] addressed a similar but slightly different problemby using web archives to recommend replacement pages on thelive web. They investigated four techniques for using the archivedpage to generate queries for live web search engines: (1) lexicalsignatures, (2) web page titles, (3) tags, and (4) link neighborhoodlexical signatures. Using these four methods helped to find a replace-ment for missing web pages. Various datasets were used, including aking Recommendations from Web Archives for “Lost” Web Pages
DMOZ. By comparing the different methods, they found that 70%of the web pages were recovered using the title method. The resultincreased to 77% by combining the other three methods. In theirwork, the user will get a single alternative when a page is not foundon the live Web.Huurdeman et al. [13, 14] detailed their approach to recoverpages in the unarchived Web based on the existence of links andanchors of crawled pages. The data used was from the Dutch 2012National Library of the Netherlands (KB). Both external links (inter-server links), which are links between different servers, and siteinternal links (intra-server links), which occur within a server, wereincluded in the dataset. Their findings included that the archivedpages show evidence of a large number of unarchived pages andweb sites. Finally, they found that even with a few words to describea missing web page, they can be found within the first rank.Classification is the process of comparing representations of doc-uments with representations of labeled categories and computingsimilarity to find to which category the documents belong. Baykanet al. [8, 9] investigated using the URI to classify the web page andidentify its topic. They found that there is a relationship betweenclassification and the length of the URI, where the longer URI, thebetter result. They used different machine learning algorithms, andthe highest scores were achieved by the maximum entropy algo-rithm. They trained the classifiers on the DMOZ dataset using all-grams method and tested the performance on Yahoo!, Wikipedia,Delicious, and Google. The classifier performed the best on theGoogle data, with F = .
87. We use Baykan et al.’s tokenizationmethods in Section 4.2.Xue et al. [32] used text classification on a hierarchal structure.They proposed a deep classification method, where given a docu-ment, the entire categories are divided into two kinds accordingto their similarity to the document, related categories and unre-lated categories. They had two steps, the search stage and theclassification stage. After the search stage ends a small subset ofcandidate categories in a hierarchy structure would be the result.Then the output of the first stage would be the input of the secondstage. For the first search stage, two strategies have been proposed,document-based and category-based. They either compared therequested document to each document in the dataset or comparedit to all documents in a category. Then term frequency (TF) andcosine similarity were used to find the top 10 documents. For thesecond stage, the resulting 10 category candidates are structuredas a tree, then the tree is pruned by removing the category if it hasno candidate in it. Three strategies are proposed to accomplish thisstep: flat structure, pruned top-down, and ancestor-assistant. Theyalso used Naïve Bayes as a classifier because of the large samplesize and the speed desired. They used 3-gram because of the closesimilarity between categories. As a dataset they used 1.3 millionURIs from DMOZ and ignored the Regional and World categories.For evaluation, they used the Mi-F score metric, which evaluatesthe performance for each level. They found that the deep classifica-tion performs the highest of the three using the Mi-F score, whereit resulted in a 77% improvement over top-down based approach.This work is the basis for the deep-level classification we perform(Section 4.3). https://kb.nl/en Rajalakshmi et al. [25] proposed an approach where N-grambased features are extracted from URIs alone, and the URI is classi-fied using Support Vector Machines and Maximum Entropy Classi-fiers. In this work, they used the 3-gram features from the URI ontwo datasets: 2 million URIs from DMOZ and a WebKB dataset with4K URIs. Using this method on the WebKB dataset resulted in an in-crease of F score by 20.5% compared to the related work [11, 17, 18].Also, using this method on DMOZ resulted in an increase of F score by 4.7% compared to the related work [8, 18, 24].One of the features we will use to rank the candidate URIs isthe archival quality. Archival quality refers to measuring mementodamage by evaluating the impact of missing resources in a webpage. The missing resources could be text, images, video, audio,style sheet, or any other type of resource on the web page. Brunelleet al. [10] proposed a damage rating algorithm to measure therelative value of embedded resources and evaluate archival success.The algorithm is based on a URI’s MIME type, size, and locationof the embedded resources. In the Internet Archive the averagememento damage reduced from 0.16 in 1998 to 0.13 in 2013. In this work we use three datasets: DMOZ, Wikipedia, and a set ofrequests to the Wayback Machine. We use the DMOZ and Wikipediadatasets as ontologies to help classify the requested URI and gener-ate candidate recommendations. For evaluation, we use the Way-back Machine access logs as a sample of actual requests to a popularweb archive. We chose DMOZ because its web pages are likely tobe found in the archive [1, 7]. Wikipedia was chosen because newor recent web pages are found. In this section we will describe eachof the datasets.
DMOZ, or the Open Directory Project (ODP), was the largest human-edited directory of the Web. DMOZ is considered a hierarchicalclassification in which each category may have sub-categories. Eachentry in the dataset contains the following fields: category, URI,title, and description. For example an entry could be: Computers/Computer_Science/Academic_Departments/North_America/United_States/Virginia, http://cs.odu.edu/, Old Dominion University, andNorfolk Virginia, as shown in Figure 2.DMOZ was closed down on March 14, 2017. We have archived118 DMOZ files of the type RDF, from 2001 to 2017. Since we focuson English language web pages, we first filtered out the World cat-egory. Then, we collect all entries that contain at least the URI andthe category fields. Next, starting from the latest archived dataset,we collected the entries that include a unique URI. After that, weconverted all the URIs to Sort-friendly URI Reordering Transform(SURT) format. Table 1 shows the number of collected entries andsub-categories for each category. To be consistent with a similarwork [25], we filtered out the Regional, Netscape, Kids_and_Teens,and Adult categories.Since we are going to gather recommendations from DMOZ, wewanted to analyze the dataset. We checked the top-level domains,the depth of URIs, if the URIs are on the live web, and if URI patternsoccur. https://pypi.org/project/surt/ ulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle Figure 2: ODU main page found in DMOZTable 1: The number of entries for each category and thenumber of sub-categories in the DMOZ datasetCategory Num. URIs Num. sub-categories
Regional 2,348,257 297,140Arts 658,942 57,959Society 487,834 36,259Business 469,668 22,465News 421,800 2,581Computers 297,789 12,580Sports 278,706 28,761Recreation 261,005 15,467Shopping 250,538 7,393Science 217,071 17,212Adult 197,141 10,683Reference 160,652 13,077Games 151,459 20,233Health 149,648 10,292Home 81,059 3,553Kids_ands_Teens 63,333 5,793Netscape 27,223 2,581Total 6,522,125 564,029
Top-Level Domain . In this section we determine the diversity of the top-level domains(TLDs) in DMOZ. Shown in Table 2, we found that 61.85% of URIsare in the commercial top-level domain, .com, followed by .org, .net,.edu. Other top-level domains include .ca, .it, etc.
Table 2: Top-level domain analysis for DMOZ datasetTLD Num. URIs Percentcom org net edu gov us others Total
Table 3: Depth analysis for DMOZ datasetDepth Count Percent0 Total
Depth . Here, we want to know if the URIs we are recommending areonly URIs of depth 0. Note that depth 0 includes URIs ending with/index.html or /home.html. The depth is measured after URI canon-icalization . Shown in Table 3 we found that 50.57% of the URIs inDMOZ are depth 0 (i.e., top-level web pages). https://pypi.org/project/surt/ aking Recommendations from Web Archives for “Lost” Web Pages Table 4: URI patterns present in DMOZPattern % in hostname % in pathLong strings
Long slugs
Numbers
Change in case
Query - 4.72%
Port number
IP address
Percent-encoding
0% 0.50%
Date
0% 0.43%
Live Web . As of November 2018, we found that 86% of the URIs in the DMOZdataset are either live or redirect to live web pages.
Patterns . In this section we calculate the different URI patterns that occur inDMOZ. Table 4 shows the percentage of occurrence of the patternin the hostname and the path. We analyze the following patterns: • Long strings . Contains 10 or more contiguous letters. Wechose 10 because it is likely that at least two words aregrouped together since the average English word is 5 letterslong [22, 23]. Example: http://radiotunis.com. • Long slugs • Numbers . Example: http://911.com. • Change in case . Example: http://zeekoo.com/ZeeKooGids.php. • Query in the path . HTTP query string, beginning with a“?”. Example: http://findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=1795. • Port number in the hostname • IP address in the hostname . Example: http://63.135.118.69/. • Percent-encoding . Encoding to represent special charac-ters in the URI. Example: http://tinet.cat/%7ekosina. • Date string . Example: http://elmundo-eldia.com/1999/08/29/opinion/1001023218.html.We found 42.65% of the URIs contain long strings in the hostnameand 20.01% of the URIs contain numbers in the path.
Wikipedia is a web-based encyclopedia, launched in 2001 [30] andavailable in 304 languages [31]. It contains articles that are catego-rized and most also contain a list of external links. For instance, thearticle shown in Figure 3 is categorized as
Old Dominion University,Universities and colleges in Virginia, Educational institutions estab-lished in 1930 , etc. and contains two external links at the end of thearticle. If the entity described in the article has an official website,then it will be linked as the “Official website” in the list of external links. We use Python Wikipedia packages [12, 21] to extract theinformation needed.
The Wayback Machine server access logs contain real requests tothe Internet Archive’s Wayback Machine [28]. The requests arefrom 295 noncontiguous days between 2011-01-01 to 2012-03-02.A sample of this dataset was used for evaluation. This dataset hasbeen used in other work [4, 6].Each request (line) contains the following information: Client IP,Access Time, HTTP Request Method, URI, Protocol, HTTP StatusCode, Bytes Sent, Referring URI, User-Agent.In our work, we will use a sample from the requests made onFeb 8, 2012, similar to data selected in AlNoamany et al. [6]. Therewere 49,026,577 requests on that day. Before collecting a sample touse, we performed several filtering steps. First, we filtered out anyrequests that did not result in an HTTP 200 status code. We alsofiltered out any requests with an invalid URI format or extension,non-HTML URIs, an IP address as the domain, or a ccTLD from anon-English speaking country. In addition, we filtered out requeststhat resulted in HTML with a non-English HTML language code.This filtering left 732,130 unique URIs.
Our recommendation algorithm, shown in Algorithm 1, is composedof four main steps, each of which will be described in more detail inthe following subsections. As per the current method of searchinga web archive, the user provides a requested URI and optionally adesired datetime.Our goal is to provide recommendations for other archived webpages based on the requested URI, which we assume is “lost”, nei-ther available on the live web nor archived. The first step is to obtaina first-level classification of the URI, using DMOZ or Wikipedia.This would result in a high-level category for the URI, such as“Computers”, “Business”, etc. similar to those in Table 1. We thenuse machine learning techniques to obtain a deeper categorization,such as “Computers/Computer_Science/Academic_Departments/North_America_United_States/Virginia”. Once this categorizationis obtained, we can collect candidates from other URIs in the samecategory in DMOZ or Wikipedia. Then we filter out any candidatesthat are not archived and finally rank and recommend candidatesbased on several features, such as archival quality, web page popu-larity, temporal similarity, and URI similarity.
The first step is to determine if the requested URI is already presentand categorized in DMOZ or Wikipedia. Using DMOZ is straight-forward; we check if the URI exists in DMOZ or not. However, inWikipedia we check if the requested URI is the official web site (bysearching for the keyword “official website”) and is categorized. Forexample, if the requested URI was http://odu.edu, we use the URI tofind a related Wikipedia web page. In this example we find that theWikipedia web page https://en.wikipedia.org/wiki/Old_Dominion_University mentions http://odu.edu as the official website. Then wecollect the categories that this web page belongs to, such as
Old Do-minion University, Universities and colleges in Virginia, Educational ulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle
Figure 3: Searching for the request http://odu.edu in Wikipedia resulted in finding the Wikipedia web page https://en.wikipedia.org/wiki/Old_Dominion_University that contains the requested URI as the official website in the external link section. We useother web pages in the same categories (at the end of the page) as candidate web pages. aking Recommendations from Web Archives for “Lost” Web Pages
Algorithm 1
Algorithm for recommending archived web pagesusing only the URI ▷ Step 1: Classify the URI ( levelone ) function Classify_URI_level_one( requested _ U RI )Tokenize (requested_URI)ML (requested_URI) end function ▷ Step 2: Deep classify the URI ( deep − levels ) function Classify_URI_deep_levels( requested _ U RI )Index_dataset_by_category ()Cosine_similarity ( requested _ U RI )Get_top_N_candidates (
Candidates )Create_and_prune tree (
Candidates )ML (
Candidates ) end function ▷ Step 3: Filter candidates function
Archived(
Candidates ) for Candidates doif
Candidate is archived then
Archived_Candidates=Candidate end ifend forend function ▷ Step 4: Score and rank candidates function
Rank(
Archived _ Candidates )Score (
Archived _ Candidates )Get_top_N_candidates (
Archived _ Candidates ) end function ▷ Main Function function
Recommending_Archived_Web_Pages( requested _ U RI ) if requested_URI not in a_classified_ontology then Classify_URI_level_one( requested _ U RI ) ▷ Step 1Classify_URI_deep_levels( requested _ U RI ) ▷ Step 2 end if
Collect_All_Candidates( requested _ U RI )Archived(
Candidates ) ▷ Step 3Rank(
Archived _ Candidates ) ▷ Step 4 end function institutions established in 1930 , etc. Then we collect as candidatesall of the official web pages that these categories contain.To test how often this option might be available, we used theWayback Machine access logs (Section 3.3). From the filtered set,we found 13.17% URIs in DMOZ or Wikipedia.
For a request that did not appear in an ontology, we will classify itusing only the tokens from the URI. We test three different methodsof tokenization. First, we use URI tokens that are split by non-alphanumeric characters. Second, we use all-grams from the tokens.Third, we use all-grams from the URI.
Table 5: Tokenizing the URI https://odu.edu/compsci using dif-ferent methods [9]Method Result
Tokens odu, edu, compsciAll-grams from tokens odu, edu, comp, omps,mpsc, psci, comps, ompsc,mpsci, compsc, ompsci, compsciAll-grams from URI(http://odu.edu/compsci) odue, dued, uedu, educ,duco,ucom, comp, omps, mpsc, psci,odued, duedu, ueduc, educo,ducom, ucomp, comps, ompsc,mpsci, oduedu, dueduc, ueduco,educom, ducomp, ucomps, compsc,ompsci, odueduc, dueduco,ueducom, educomp, ducomps,ucompsc, compsci, odueduco,dueducom, ueducomp, educomps,ducompsc, ucompsci
Tokenize the URI . To classify the URI, we need to extract meaningful keywords, ortokens, from the URI. We adopt the three methods proposed byBaykan et al. [9]. • Tokens
The URI is split into potentially meaningful tokens.The URI is converted to lower-case and then split into tokensusing any non-alphabetic character as a delimiter. Finally, the“http” (or “https”) token is removed, along with any resultingtoken of length 2 or less. • All-grams from tokens
The URI tokens are converted toall-grams. We perform the tokenization as above and thengenerate all-grams on the tokens by combining 4-, 5-, 6-, 7-,and 8-grams of the combined tokens. • All-grams from the URI
The URI is converted to all-gramswithout tokenizing first. Any punctuation and numbers areremoved from the URI, along with “http” (or “https”). Thenthe result is converted to lowercase. Finally, the all-gramsare generated by combining the 4-, 5-, 6-, 7-, and 8-grams ofthe remaining URI characters.An example of the different tokenization methods is shown inTable 5. Using these methods we also examine removing the TLDsfrom the URIs, removing numbers, and removing stop words (Sec-tion 4.2.2).To determine the best tokenization method, as a baseline wetested the classification of tokens on the DMOZ dataset, usingmachine learning. We took the DMOZ dataset and created a 10-foldcross-validation set, using 90% for training and 10% for testing. Weemployed a Naïve Bayes classifier to take tokens and return the top-level category. Naïve Bayes was selected because of its simplicitythat assumes independence between the features. In the testingdataset we filtered out URIs that contain tokens not seen in thetraining set, as was also done in related work [9].We measured the F score to evaluate the different tokenizationmethods. Table 6 shows the result of our evaluation. In addition tothe base tokenization methods described above, we also tested thefollowing alternatives for each method: ulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle Table 6: Classifying at the first-level, comparing F score, Mi-cro average, and Macro average for DMOZ dataset using dif-ferent methods Method F score Microaverage MacroaverageTokens All URI tokens URI tokens,without TLD
URI tokens,without TLDand numbers
URI tokens,without TLDand stop words
All-gramfromtokens All URI tokens
URI tokens,without TLD
URI tokens,without TLDand numbers
URI tokens,without TLDand stop words
All-gramsfromURI All URI tokens
URI tokens,without TLD
URI tokens,without TLDand numbers 0.59 0.62 0.61URI tokens,without TLDand stop words • remove TLD before tokenization • remove TLD and numbers before tokenization • remove TLD, numbers, and stop words before tokenizationThe stop words were based on a set of stop words in the NaturalLanguage Toolkit (NLTK) . We found that using the all-grams fromthe URI after removing the TLD and numbers had the highest F score, which was comparable to results obtained in related work[25]. We use this method of tokenization going forward. Classify the URI using Machine Learning . Now that we have determined the best tokenization method, wewill apply this for future requests. We trained the Naïve Bayesclassifier on the entire DMOZ dataset and this will be used forclassification as the baseline at the first-level. We take the requestedURI, remove the TLD and numbers, and then perform the all-gramfrom URI tokenizations described in the previous section. Theseresulting all-grams are used in the the classifier to produce a first-level classification.
In this step we want to classify the requested URI http://cs.odu.edu/compsci to a hierarchal deep classification such as Computers/Computer_Science/Academic_Departments/North_America_United_ https://nltk.org/ States/Virginia. Known methods to determine hierarchal deep clas-sification are the big-bang approach and the top-down approach[27]. Neither method is ideal with a large number of hierarchiesand may result in error propagation. For this reason we adopt themethod by Xue et al. [32], but as opposed to this work, we arelimited to the URI only and do not have the documents or anysupporting details.(1)
Index dataset . In preparation to compute similarity be-tween the requested URI and the category entries, we indexDMOZ by category, creating a list of all URIs in each of theDMOZ deep-level categories.(2)
Cosine similarity . We compute the cosine similarity be-tween the tokenized requested URI and the tokenized URIsand their titles and description, in each category. In this stepeach category of the index will get a similarity score to therequested URI, which is the average similarity to all entriesin that category.(3)
Collect N candidates . Next we select the top 10 candidatecategories with the highest similarity score, similar to relatedwork [32].(4)
Prune tree . Each candidate category could be a leaf nodeor an internal node. We create a hierarchical tree and thenprune it to get the final list of candidates that we can usemachine learning to classify. First, we create a tree from thecandidates by starting from the first node and then goingdown until all 10 candidates are presented, as shown inFigure 4a. Next, in order to enhance the classification, thetree is pruned based on the ancestor assistance strategy. Theancestor assistance strategy includes the ancestors of a nodeif there are no common ancestors with another candidate,as shown in Figure 4b.(5)
Classify . To choose a single classification from the prunedtree we classify the requested URI based on two methods,using 3-gram tokens and all-grams. The 3-gram method hadthe best result when comparing documents [32], however inour work we compare URI tokens, so we expect the all-grammethod to perform better.
Step 3 in our algorithm is to ensure that all recommendations comefrom a web archive. We take the candidates from Step 2 and removeany that are not archived. We use MemGator [2] to determine this.In Step 4, we rank and recommend the remaining candidates basedon temporal similarity ( t ), web page popularity ( p ), URI similarity( s ), and archival quality ( q ). Our final list of recommended web pageswill be ranked based on Equation 1, where w t +w p +w s +w q =1.0 andspecify the weights given to each of the features. score = w t t + w p p + w s s + w q q (1) Temporal similarity . Temporal similarity refers to how close the available candidateweb page’s Memento-Datetime [29] is to the requested URI. Thisis shown in Equation 2, where r d is the request datetime, c d is thecandidate datetime, u d is the current datetime, and e d is the earliestdatetime. The earliest datetime is considered 1996, because it was aking Recommendations from Web Archives for “Lost” Web Pages (a) Create hierarchical tree from the 10 candidate cat-egories (the candidate categories are highlighted). Thenumbers represent the category ID(b) Pruned tree using ancestor assistance strategy. Theparents of nodes 88 and 100 are included because theyhave no shared ancestor with other candidates Figure 4: The process of pruning a hierarchical tree usingancestor assistance strategy [32] when archiving the Web started . t = | r d − c d | u d − e d (2) Web page popularity . We use how often the web page has been archived and the domain’spopularity as determined by Alexa as an approximation for the webpage’s popularity. Our popularity measure p is given in Equation3, where a is the Alexa Global Ranking of the requested domain, x is the lowest ranked domain in Alexa, n is the number of timesthe URI has been archived, and m is the number of times Alexa’stop-ranked web site has been archived. p = (| loдaloдx − | + log n log m ) x to 30,000,000 as it is the current lowest ranking in Alexa,and we set m to 538,300, the number of times that http://google.com,the top-ranked Alexa web page, has been archived. URI similarity . We measure the similarity of requested URI tokens and candidateURI tokens using Jaccard similarity coefficient (Equation 4). s = | A ∩ B || A | + | B | − | A ∩ B | (4) https://archive.org/about/ https://alexa.com Archival quality . Archival quality refers to how well the page is archived. We useMemento-Damage [26] to calculate the impact of missing resourcesin the web page. We calculate archival quality in Equation 5, where d is the damage score calculated from Memento-Damage. q = | d − | (5) Here we present an example of a request and the resulting recom-mendations. We request http://odu.edu/compsci with the date ofMarch 1, 2014. This URI is not classified in DMOZ or in Wikipedia,so we use machine learning and classify it to Computers/Computer_Science/Academic_Departments/North_America/United_States/Virginia.Then we collect all the candidates from DMOZ: • http://cs.gmu.edu • http://cs.odu.edu • http://cs.virginia.edu • http://cs.vt.edu • http://wm.edu/as/computerscience/?svr=web • http://radford.edu/content/csat/home/itec.html • http://cs.jmu.edu • https://php.radford.edu/~itec • http://mathcs.richmond.edu • http://hollins.edu/academics/computersciUsing equal weights ( w t = w p = w q ) for our ranking equation, thetop three ranked candidates are:(1) https://web.archive.org/web/20140226090846/http://cs.odu.edu:80/, score= 0.87(2) https://web.archive.org/web/20140208043915/http://cs.virginia.edu/, score= 0.75(3) https://web.archive.org/web/20140223213510/http://cs.jmu.edu/,score= 0.73 First, we evaluate how well our deep classification method works(Step 3). To test this step we use 10% of the DMOZ dataset for testingand the rest for training. We assume that level one categorizationis already predicted in Step 1. We evaluate the performance bydetermining if we classified each level correctly. For example, if aURI is actually in the category c1/c2/c3, then for level two evalu-ation, we check if we predicted c1/c2. For each level we calculatethe Micro-average F (Mi-F ) score. In Figure 5, we show the Mi-F score of each level using 3-gram cosine similarity. The highest levelin our results was 0.2 compared to 0.8 in the related work [32], butthat is due to using only the requested URI as the testing data andthe URI and title and category as training, as opposed to using thetext of the full document as in [32]. This shows that using only thetokens from the URI is not enough for deep classification. Becauseof limited information, we also show the result of testing the samemethod using all-gram cosine similarity. We found that the resultsare better, however it is still considered low compared to relatedwork.Some features could affect the URI classification. We investigatedthe relationship between the depth of the URI and classification.Table 7 shows the URI depth and the percentage of the correctly ulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle Figure 5: Performance on classifying to different levels us-ing 3-gram and all-gramTable 7: URI depth and percentage of correctly classifiedURIs Depth Percent andwordninja to split compound words. For example, the URI http://mickeymantlebaseballcards.net is split to mickey, mantle, baseball,and cards. We found that 36.92% of the correctly classified URIscontain only words from a dictionary, and 44.89% of the correctlyclassified URIs contain at least one word from a dictionary.An ideal structure of the URI is that it contains long strings thatwill have more semantics. We are trying to identify a “slug”, whichis the part of a URI that contains keywords or the web page title. Anexample of a slug is the path in https://cnn.com/2017/07/31/health/climate-change-two-degrees-studies/index.html. The slug in theURI is readable, and we can identify what the web page is about.We evaluate the existence of long strings in the correctly classified https://pypi.org/project/pyenchant/ https://pypi.org/project/wordninja/ Table 8: Percentage from the correctly classified URIs foreach category Category count PercentSociety
459 15.32%
Arts
401 13.38%
Shopping
355 11.85%
Recreation
331 11.05%
Sports
291 9.71%
Home
288 9.61%
Reference
238 7.94%
Computers
228 7.61%
Health
190 6.34%
Science
130 4.34%
Games
50 1.67%
Business
35 1.17%
News
Total aking Recommendations from Web Archives for “Lost” Web Pages
Table 9: Top-level domain analysis for the Wayback Ma-chine server access logs datasetTLD Num. URIs Percentcom net org edu gov us others Total
Table 10: Depth analysis for Wayback access log datasetDepth Count Percent0 Total
In this work we wanted to recommend web pages from a Webarchive for a requested “lost” URI. Our work proposes a methodto enhance the current response from Web archives when a URIcannot be found (Figure 1a). We used both DMOZ and Wikipediato classify the request and find candidates. First, we check if therequested URI is classified in DMOZ or Wikipedia. If the requestedURI is not pre-classified, then we classify the URI using first-levelclassification and then deep classification. This step results in a listof candidates that we filter based on if the web page is archived.Next we score and rank the candidates based on archival quality,web page popularity, temporal similarity, and URI similarity.We found that the best method to classify the first-level is usingall-grams from the URI while filtering the TLD and numbers fromthe URI. Using a Naïve Bayes classifier resulted in a F score of 0.59.For the second-level classification we measure the accuracy for each classification level. For second-level classification, the micro-average F =0.30 and for third-level classification, F =0.15. We alsofound that 44.89% of the correctly classified URIs contain a wordthat exists in a dictionary. Also, 50.07% of the correctly classifiedURIs contain long strings in the domain. We also analyzed theproperties of a sample of URIs requested to the Wayback Machineand found that the large majority were of depth 0, meaning thatour classification will rely largely on domain information.Future work includes adding other languages, filtering spam webpages, and ranking based on how long the web page was not live.For popularity, if the access log was saved we can measure howfrequently the URI was requested from the archive. For temporalsimilarity we can measure the closeness of the creation date of therequest and the candidate. This work is supported in part by the National Science Foundation,IIS-1526700.
REFERENCES [1] Scott G. Ainsworth, Ahmed Alsum, Hany M. SalahEldeen, Michele C. Weigle,and Michael L. Nelson. 2011. How Much of the Web is Archived?. In
Proceedingsof the 11th IEEE/ACM Joint Conference on Digital Libraries (JCDL) . 133–136.[2] Sawood Alam and Michael L Nelson. 2016. MemGator-A portable concurrentmemento aggregator: Cross-platform CLI and server binaries in Go. In
Proceedingsof the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries . ACM, 243–244.[3] Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2017. Comparingthe Archival Rate of Arabic, English, Danish, and Korean Language Web Pages.
ACM Transactions on Information Systems (TOIS)
36, 1 (2017), 1:1–1:34.[4] Yasmin AlNoamany. 2016.
Using Web Archives to Enrich the Live Web ExperienceThrough Storytelling . Ph.D. Dissertation. Old Dominion University.[5] Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, and Michael L. Nelson.2014. Who and What Links to the Internet Archive.
International Journal onDigital Libraries (IJDL)
14, 3-4 (2014), 101–115.[6] Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2013. AccessPatterns for Robots and Humans in Web Archives. In
Proceedings of the 13thIEEE/ACM Joint Conference on Digital Libraries (JCDL) . 339–348.[7] Ahmed AlSum. 2014.
Web Archive Services Framework for Tighter IntegrationBetween the Past and Present Web . Ph.D. Dissertation. Old Dominion University.[8] Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2009. PurelyURL-based Topic Classification. In
Proceedings of the 18th International conferenceon World Wide Web (WWW) . 1109–1110.[9] Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2011. AComprehensive Study of Features and Algorithms for URL-based Topic Classifi-cation.
ACM Transactions on the Web (TWEB)
5, 3 (2011), 15.[10] Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L.Nelson. 2015. Not all mementos are created equal: Measuring the impact ofmissing resources.
International Journal on Digital Libraries (IJDL)
16, 3-4 (2015),283–301.[11] M Indra Devi, R Rajaram, and K Selvakuberan. 2007. Machine learning tech-niques for automated web page classification using URL features. In
Proceedingsof the International Conference on Computational Intelligence and MultimediaApplications (ICCIMA) , Vol. 2. 116–120.[12] Jonathan Goldsmith. 2016. A Pythonic wrapper for the Wikipedia API. https://github.com/goldsmith/Wikipedia. (2016).[13] Hugo C. Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar, and Arjen P.de Vries. 2014. Finding pages on the unarchived web. In
Proceedings of the 14thIEEE/ACM Joint Conference on Digital Libraries (JCDL) . 331–340.[14] Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and Richard A. Rogers. 2015. Lost but not Forgotten: Finding Pages on theUnarchived Web.
International Journal on Digital Libraries (IJDL)
16, 3-4 (2015),247–265.[15] Shawn M. Jones. 2018. A Preview of MementoEmbed: EmbeddableSurrogates for Archived Web Pages. https://ws-dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html. (2018).[16] Brewster Kahle. 2019. 703,726,890,000 URL’s now in the @waybackmachine by the@internetarchive ! (703 billion) Over a billion more added each week. The Web is agrand experiment in sharing and giving. Loving it! http://web.archive.org/. https://twitter.com/brewster_kahle/status/1087515601717800960. (21 January 2019). ulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle [17] Min-Yen Kan. 2004. Web Page Classification Without the Web Page. In
Proceedingsof the 13th International World Wide Web conference on Alternate Track Papersand Posters . 262–263.[18] Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast Webpage ClassificationUsing URL Features. In
Proceedings of the 14th ACM International Conference onInformation and Knowledge Management (CKIM) . 325–326.[19] Nattiya Kanhabua, Philipp Kemkes, Wolfgang Nejdl, Tu Ngoc Nguyen, FelipeReis, and Nam Khanh Tran. 2016. How to search the Internet Archive withoutindexing it. In
Proceedings of the International conference on Theory and Practiceof Digital Libraries (TPDL) . 147–160.[20] Martin Klein and Michael L. Nelson. 2014. Moved but not Gone: An Evaluationof Real-time Methods for Discovering Replacement Web Pages.
InternationalJournal on Digital Libraries (IJDL)
14, 1-2 (2014), 17–38.[21] Martin Majlis. 2019. Python wrapper for Wikipedia. https://github.com/martin-majlis/Wikipedia-API. (2019).[22] David D Palmer. 1997. A trainable rule-based algorithm for word segmentation.In
Proceedings of the 35th Annual Meeting of the Association for ComputationalLinguistics and Eighth Conference of the European Chapter of the Association forComputational Linguistics . Association for Computational Linguistics, 321–328.[23] John R Pierce. 2012.
An introduction to information theory: symbols, signals andnoise . Courier Corporation.[24] R. Rajalakshmi and Chandrabose Aravindan. 2011. Naive bayes approach forwebsite classification. In
Proceedings of the Information Technology and MobileCommunication. Communications in Computer and Information Science . Vol. 147.[25] R Rajalakshmi and Chandrabose Aravindan. 2013. Web Page Classification UsingN-gram Based URL Features. In
Proceedings of the 5th International Conference onAdvanced Computing (ICoAC) . 15–21.[26] Erika Siregar. 2017. Deploying the Memento-Damage Service. https://ws-dl.blogspot.com/2017/11/2017-11-22-deploying-memento-damage.html. (2017).[27] Aixin Sun and Ee-Peng Lim. 2001. Hierarchical text classification and evaluation.In
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on .IEEE, 521–528.[28] Brad Tofel. 2007. Wayback for Accessing Web Archives. In .[29] Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTPframework for time-based access to resource states – Memento, Internet RFC7089. http://tools.ietf.org/html/rfc7089. (2013).[30] Wikipedia. [n. d.]. History of Wikipedia. https://en.wikipedia.org/wiki/History_of_Wikipedia. ([n. d.]).[31] Wikipedia. [n. d.]. List of Wikipedias. https://en.wikipedia.org/wiki/List_of_Wikipedias. ([n. d.]).[32] Gui-Rong Xue, Dikan Xing, Qiang Yang, and Yong Yu. 2008. Deep classificationin large-scale text hierarchies. In