Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure
EEvaluating Methods to Rediscover Missing Web Pagesfrom the Web Infrastructure
Martin Klein
Department of Computer ScienceOld Dominion UniversityNorfolk, VA, 23529 [email protected] Michael L. Nelson
Department of Computer ScienceOld Dominion UniversityNorfolk, VA, 23529 [email protected]
ABSTRACT
Missing web pages (pages that return the 404 “Page NotFound” error) are part of the browsing experience. The man-ual use of search engines to rediscover missing pages can befrustrating and unsuccessful. We compare four automatedmethods for rediscovering web pages. We extract the page’stitle, generate the page’s lexical signature (LS), obtain thepage’s tags from the bookmarking website delicious.com and generate a LS from the page’s link neighborhood. Weuse the output of all methods to query Internet search en-gines and analyze their retrieval performance. Our resultsshow that both LSs and titles perform fairly well with over60% URIs returned top ranked from Yahoo!. However, thecombination of methods improves the retrieval performance.Considering the complexity of the LS generation, queryingthe title first and in case of insufficient results querying theLSs second is the preferable setup. This combination ac-counts for more than 75% top ranked URIs.
Categories and Subject Descriptors
H.3.3 [
Information Storage and Retrieval ]: InformationSearch and Retrieval
General Terms
Measurement, Performance, Design, Algorithms
Keywords
Web Page Discovery, Digital Preservation, Search Engines
1. INTRODUCTION
Inaccessible web pages and “404 Page Not Found” re-sponses are part of the web browsing experience. Despiteguidance for how to create “Cool URIs” that do not change[6] there are many reasons why URIs or even entire websitesbreak [25]. However, we claim that information on the webis rarely completely lost, it is just missing. In whole or in
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.
JCDL’10,
June 21–25, 2010, Gold Coast, Queensland, Australia.Copyright 2010 ACM 978-1-4503-0085-8/10/06 ...$10.00. part, content is often just moving from one URI to another.It is our intuition that major search engines like Google, Ya-hoo! and MSN Live (our experiments were conducted beforeMicrosoft introduced Bing), as members of what we call theWeb Infrastructure (WI), likely have crawled the contentand possibly even stored a copy in their cache. Thereforethe content is not lost, it “just” needs to be rediscovered.The WI, explored in detail in [17, 26, 29], also includes (be-sides search engines) non-profit archives such as the InternetArchive (IA) or the European Archive as well as large-scaleacademic digital data preservation projects e.g., CiteSeerand NSDL.It is commonplace for content to “move” to different URIsover time. Figure 1 shows two snapshots as an example of aweb page whose content has moved within one year after itscreation. Figure 1(a) shows the content of the original URIof the Hypertext 2006 conference as displayed in 12/2009.The original URI clearly does not hold conference relatedcontent anymore. Our suspicion is that the website admin-istrators did not renew the domain registration and thereforeenabling someone else to take over. However, the content isnot lost. It is now available at a new URI as shown inFigure 1(b).In this paper we investigate the retrieval performance offour methods that can be automated and together with theWI used to discover missing web pages. These methods are:1. lexical signatures (LSs) – typically the 5-7 most sig-nificant keywords extracted from a cached copy of themissing page that capture its “aboutness”2. the title of the page – the two underlying assumptionshere are: • web pages have descriptive titles • titles only change infrequently over time3. social bookmarking tags – terms suggested by Inter-net users on delicious.com when the page was book-marked4. link neighborhood LSs (LNLS) – a LS generated fromthe pages that link to the missing page (inlinks) andnot from a cached copy of the missing page.Figure 2 displays the scenario how the four methods ofinterest can automatically be applied for the discovery of amissing page. The occurrence of an 404 error is displayed http://hypertext.expositus.com/ a r X i v : . [ c s . I R ] A p r a) Original URI, new (unrelated) Content(b) Original Content, new URI Figure 1: The Content of the Website for the Con-ference Hypertext has Moved over Time in the first step. Search engine caches and the IA will con-sequently be queried with the URI requested by the user.In case older copies of the page are available they can beoffered to the user. If the user’s information need is satis-fied, nothing further needs to be done (step (2)). If this isnot the case we need to proceed to step (3) where we ex-tract titles, try to obtain tags about the URI and generateLSs from the obtained copies. The obtained terms are thenqueried against live search engines. The returned results areagain offered to the user and in case the outcome is not sat-isfying more sophisticated and complex methods need to beapplied (step (5)). Search engines can be queried to discoverpages linking to the missing page. The assumption is thatthe aggregate of those pages is likely to be “about” the sametopic. From this link neighborhood a LS can be generated.At this point the approach is the same as the LS method,with the exception that the LS has been generated from alink neighborhood and not a cached copy of the page itself.The important point of this scenario is that it works whilethe user is browsing and therefore has to provide results inreal time. Queries against search engines can be automatedthrough APIs but the generation of LSs needs to be auto-mated too.As an example let us look at how the methods would beapplied to the web page . The page isabout a photographer named Nic Nichols. Table 1 displaysall data we obtained about the page using the four meth-ods. The question is now if nicnichols.com went missing(returned HTTP 404 response code), which of the four meth-ods will produce the best search engine query to rediscoverNic Nichol’s website if it moved to a new URI? To further
DONEuser issatisfieduser issatisfied DONE · extract titles · generate LSs · obtain tags · query search enginesDONE (1) (5) (3)(4)(6)(2) ☺ ☹☹ ☺ no resultsfound presentresultspresentresults present results ☺ · include link neighborhood · relevance feedback · user interaction: ➙ request keywords ➙ change number of terms in LS ➙ add/delete term from LS ➙ advanced search operators query for URL in: · search engine caches · Internet Archive
Figure 2: Process to Rediscover Missing Web Pages illustrate the difference between titles and LSs we comparetheir retrieval performance with the following three exam-ples. From the URI smiledesigners.org we derive a LSand a title ( T ): • LS: “Dental Imagined Pleasant Boost Talent Proud Ways” • T: “Home” .When queried against Google the LS returns the URI topranked but since the title is rather arbitrary it does notreturn the URI within the top 100 results. From the theURI we get • LS: “Marek Halloween Ready Images SchwarzeneggerGovenor Villaraigosa” • T: “American Red Cross of Greater Los Angeles” .The LS contains terms that were part of the page at thetime of the crawl but are less descriptive. Hence the URIremains undiscovered. The title of the page however per-forms much better and returns the URI top ranked. Fromthe last example URI we obtain • LS: “Charter Aircraft Jet Air Evacuation Medical Medi-vac” • T: “ACMI, Private Jet Charter, Private Jet Lease,Charter Flight Service: Air Charter International” . S NICNICHOLS NICHOLS NIC STUFFSHOOT COMMAND PENITENTIARY
Title
NICNICHOLS.COM : DOCUMENTARYTOY CAMERA PHOTOGRAPHY OFNIC NICHOLS : HOLGA, LOMOAND OTHER LO-FI CAMERAS!
Tags
PHOTOGRAPHY BLOG PHOTOGRAPHERPORTIFOLIO PORTFOLIO INSPIRATIONPHOTOGRAPHERS
LNLS
NICNICHOLS PHOTO SPACERVIEW PHIREBRUSH SUBMISSIONBOONIKA
Table 1: Data Obtained from
Both describe the page’s content very well and return theURI top ranked.The contribution of this paper is the performance compar-ison of all our methods and an interpretation resulting in asuggested workflow on how to set up the investigated meth-ods to achieve a highest possible rate in discovering missingweb resources.
2. RELATED WORK2.1 Missing Web Resources
Missing web pages are a pervasive part of the web ex-perience. The lack of link integrity on the web has beenaddressed by numerous researchers [3, 4, 8, 9]. In 1997Brewster Kahle published an article focused on preservationof Internet resources claiming that the expected lifetime ofa web page is 44 days [18]. A different study of web pageavailability performed by Koehler [23] shows the random testcollection of URIs eventually reached a “steady state” afterapproximately 67% of the URIs were lost over a 4-year pe-riod. Koehler estimated the half-life of a random web page isapproximately two years. Lawrence et al. [24] found in 2000that between 23 and 53% of all URIs occurring in computerscience related papers authored between 1994 and 1999 wereinvalid. By conducting a multi level and partially manualsearch on the Internet, they were able to reduce the numberof inaccessible URIs to 3%. This confirms our intuition thatinformation is rarely lost, it is just moved. This intuitionis also supported by Baeza-Yates et al. [5] who show thata significant portion of the web is created based on alreadyexisting content.Spinellis [35] conducted a study investigating the accessi-bility of URIs occurring in papers published in Communi-cations of the ACM and IEEE Computer Society. He foundthat 28% of all URIs were unavailable after five years and41% after seven years. He also found that in 60% of thecases where URIs where not accessible, a 404 error was re-turned. He estimated the half-life of an URI in such a paperto be four years from the publication date. Dellavalle etal. [12] examined Internet references in articles published injournals with a high impact factor given by the Institute forScientific Information (ISI). They found that Internet refer-ences occur frequently (in 30% of all articles) and are ofteninaccessible within months after publication in the highestimpact (top 1%) scientific and medical journals. They dis-covered that the percentage of inactive references (referencesthat return an error message) increased over time from 3 .
8% after 3 month to 10% after 15 month up to 13% after 27month. The majority of inactive references they found werein the .com domain (46%) and the fewest in the .org do-main (5%). By manually browsing the IA they were able torecover information for about 50% of all inactive references.Zhuang et al. [40] and Silva et al. [34] have used theweb infrastructure to obtain missing documents from digi-tal library collections. Their notion of “missing documents”however is different from ours since they focus on enhancingexisting library records with related (full text) documents.They extract the title, names of authors and publicationvenues from the library records and use them as search en-gine queries in order to obtain resources that are not heldin the digital library.
The work done by Henzinger et al. [15] is related in thesense that they tried to determine the “aboutness” of newsdocumentations. They provide the user with web pages re-lated to TV news broadcasts using a 2-term summary whichcan be thought of as a LS. This summary is extracted fromclosed captions of the broadcast and various algorithms areused to compute the scores determining the most relevantterms. The terms are used to query a news search enginewhile the results must contain all of the query terms. Theauthors found that 1-term queries return results that are toovague and 3-term queries return too often zero results. Thusthey focus on creating 2-term queries.He and Ounis’ work on query performance prediction [14]is based on the TREC dataset. They measured retrievalperformance of queries in terms of average precision (AP)and found that the AP values depend heavily on the type ofthe query. They further found that what they call simplifiedclarity score (SCS) has the strongest correlation with APfor title queries (using the title of the TREC topics). SCSdepends on the actual query length but also on global knowl-edge about the corpus such as document frequency and totalnumber of tokens in the corpus.
Nelson et al. [29] present various models for the preser-vation of web pages based on the web infrastructure. Theyargue that conventional approaches to digital preservationsuch as storing digital data in archives and applying meth-ods of refreshing and migration are, due to the implied costs,unsuitable for web scale preservation.McCown has done extensive research on the usability ofthe web infrastructure for reconstructing missing websites[26]. He also developed
Warrick [28], a system that crawlsweb repositories such as search engine caches (characterizedin [27]) and the index of the IA to reconstruct websites. Hissystem is targeted to individuals and small scale communi-ties that are not involved in large scale preservation projectsand suffer the loss of websites.
So far, little research has been done in the field of lexi-cal signatures for web resources. Phelps and Wilensky [31]first proposed the use of LSs for finding content that hadmoved from one URI to another. Their claim was “robusthyperlinks cost just 5 words each”and their preliminary testsconfirmed this. The LS length of 5 terms however was cho-en somewhat arbitrarily. Phelps and Wilensky proposed“robust hyperlinks”, an URI with a LS appended as an ar-gument. They conjectured that if an URI would return a404 error, the browser would use the LS appended to theURI and submit it to a search engine in order to find therelocated copy.Park et al. [30] expanded on the work of Phelps andWilensky, studying the performance of 9 different LS gener-ation algorithms (and retaining the 5-term precedent). Theperformance of the algorithms depended on the intention ofthe search. Algorithms weighted for term frequency (TF;“how often does this word appear in this document?”) werebetter at finding related pages, but the exact page wouldnot always be in the top N results. Algorithms weighted forinverse document frequency (IDF; “in how many documentsof the entire corpus does this word appear?”) were better atfinding the exact page but were susceptible to small changesin the document (e.g., when a misspelling is fixed).
3. EXPERIMENT SETUP
We are not aware of a data corpus providing missing webpages. Therefore we need to generate a dataset of URIstaken from the live web and “pretend” they are missing. Weknow they are indexed by search engines so by querying theright terms, we will be able to retrieve them in the resultset.
As shown in [16, 32, 37], finding a small sample set ofURIs that represent the Internet is not trivial. Rather thanattempt to get an unbiased sample, we randomly sampled500 URIs from the Open Directory Project dmoz.org . Weare aware of the implicit bias of this selection but for sim-plicity it shall be sufficient. We dismissed all non-Englishlanguage pages as well as all pages containing less than 50terms (this filter was also applied in [21, 30]). Our finalsample set consists of a total of 309 URIs, 236 in the .com,38 .org, 27 .net and 8 in the .edu domain. We downloadedthe content of all pages and excluded all non-textual contentsuch as HTML and JavaScript code.
The LS generation is commonly done following the wellknown and established TF-IDF term weighting concept. TF-IDF extracts the most significant terms from textual con-tent while also dismissing more common terms such as stopwords. It is often used for term weighting in the vectorspace model as described by Salton et al. [33]. For the IDFcomputation, two values are mandatory: the overall numberof documents in the corpus and the number of documents,the particular term occurs in. Both values can only be es-timated when the corpus is the entire web. As a commonapproach researchers use search engines to estimate the doc-ument frequency of a term ([13, 19, 31, 39]). Even thoughthe obtained values are only estimates ([1]) our earlier work[20] has shown that this approach actually works well com-pared to using a modern text corpus.Recent research [13, 21, 30, 38] has shown that a LS gen-erated from the content of the potentially missing web pagecan be used as a query for the WI trying to rediscover thepage. A LS is generally defined as the top n terms of thelist of terms ranked by their TF-IDF values in decreasingorder. We have shown in [21] that 5- and 7-term LSs per- form best, depending on whether the focus is on obtainingthe best mean rank or the highest percentage of top rankedURIs.Our first experiment investigates the differences in re-trieval performance between LSs generated from three dif-ferent search engines. We use the Google, Yahoo! (BOSS)and MSN Live APIs to determine the IDF values and com-pute TF-IDF values of all terms. Due to the results of ourearlier research we use 5- and 7-term LSs for each URI andquery them against the search engine the LS was generatedfrom. A comparison of the retrieval results from cross searchengines queries was not the focus of this paper but can befound in [22]. As an estimate for the overall number of doc-uments in the corpus (the Internet) we use values obtainedfrom [2]. The TF-IDF score of very common words are verylow with a sufficiently large corpus. Therefore these termsvery likely would not make it into the top n from which aLS is generated. Despite that and for keeping the queriesto determine the document frequency value low we dismissstopwords from the textual content of the web pages be-fore computing TF-IDF values. We also ran experimentswith stemming algorithms applied but the resulting LSs per-formed very poorly and hence we decided not to report thenumbers here. Titles of web pages seem to be commonplace. To confirmthis intuition we randomly sampled another set of URIs from dmoz.org (a total of 10 ,
000 URIs) and parsed their contentfor the title. The statistics showed that the vast majorityof URIs contained a title and in only 1 .
1% of all cases notitle could be discovered. Since we already downloaded allof our URIs, extracting the title is simply done by parsingthe page and extract everything between the HTML tags
API provides up to ten terms per URI andthey are ordered by frequency of use, supposedly indicatingthe importance of the term.
As shown in [36] the content of neighboring pages can helpretrieve the centroid page. Their research is based on theidea that the content of a centroid web page is often relatedto the content of its neighboring pages. This assumption hasbeen proven by Davidson in [10] and Dean and Henzinger in[11].We download up to 50 inlink pages (pages which have areference to our, the centroid, page) that the Yahoo! APIprovides for each of our 309 URIs. We generate a bucketof words from the neighborhood of each URI and apply thesame procedure as in 3.2 to generate one LS per page neigh-borhood. More than 425 ,
000 queries were necessary to de-termine document frequency values of the entire neighbor-hood. Since the Google API is restricted to 1000 queriesper day and the MSN Live API was in our experiments notufficiently reliable for such a query volume we only use theYahoo! API for this experiment.
4. EXPERIMENT RESULTS
As laid out in Section 3 we use 5- and 7-term LSs, titlesand tags as queries to three different search engines. Weparse the top 100 returned results for the source URI anddistinguish between 4 scenarios:1. the URI returns top ranked2. the URI returns in the top 10 but not top ranked3. the URI returns in the top 100 but not top ranked andnot in the top 104. the URI is considered not to return.In the last scenario we consider the URI as undiscoveredsince the majority of the search engine users do not browsethrough the result set past the top 100 results. We areaware of the possibility that we are somewhat discriminatingURIs that may be returned just beyond rank 100 but weapply that threshold for simplicity. With these scenarios weevaluate our results as success at 1, 10 and 100. Success isdefined as a binary value, as the target either occurs in thesubset (top result, top 10, top 100) of the entire result setor it does not.
Figure 3 shows the percentage of URIs retrieved top ranked,ranked in the top 10 and top 100 as well as the percentage ofURIs that remained undiscovered when using 5- and 7-termLSs. For each of the four scenarios we show three tuples dis-tinguished by color, indicating the search engine the LS wasgenerated from and queried against. The left bar of eachtuple represents the results for 5- and the right for 7-termLSs. We can observe an almost binary pattern meaning themajority of the URIs are either returned ranked betweenone and ten or are undiscovered. If we for example consider5-term LSs fed into Yahoo! we retrieve 67 .
6% of all URIstop ranked, 7 .
7% ranked in the top 10 (but not top) and22% remain undiscovered. Hence the binary pattern: wesee more than 75% of all URIs ranked between one and tenand vast majority of the remaining quarter of URIs was notdiscovered. Yahoo! returns the most URIs and leaves theleast undiscovered. MSN Live, using 5-term LSs, returnsmore than 63% of the URIs as the top result and hence per-forms better than Google which barely returns 51%. Googlereturns more than 6% more top ranked results with 7-termLSs compared to when 5-term LSs were used. Google alsohad more URIs ranked in the top 10 and top 100 with 5-termLSs. These two observations confirm our findings in [21].
The bars displayed in Figure 4 show the percentages ofretrieved URIs when querying the title of the pages. Wequeried the title once without quotes and once quoted, forc-ing the search engines to handle all terms of the query asone string. The left bar of each tuple (again distinguishedby color) shows the results for the non-quoted titles. Toour surprise both Google and Yahoo! return fewer URIswhen using quoted titles. Google in particular returns 14%more top ranked URIs and 38% less undiscovered URIs for
Top Top10 Top100 Undiscovered UR L s i n % GoogleYahooMSN
Figure 3: - and -Term LS Retrieval Performance the non-quoted titles compared to the quoted titles. OnlyMSN Live shows a different behavior with more top rankedresults (almost 8% more) for the quoted and more undis-covered URIs (more than 7%) using the non-quoted titles.We can see however that titles are a very well performingalternative to LSs. The top value for LSs was obtained fromYahoo! (5-term) with 67 .
6% top ranked URIs returned andfor titles with Google (non-quoted) which returned 69 . Top Top10 Top100 Undiscovered UR L s i n % GoogleYahooMSN
Figure 4: Non-Quoted and Quoted Title RetrievalPerformance
We were able to retrieve tags from 47 out of our 309URIs through the API of the bookmarking website deli-cious.com . Not all URIs where annotated with the samenumber of tags. The API returns at most the top 10 tagsper URI which means we need to distinguish between themount of tags we query against search engines. The re-trieval performance of all obtained tags sorted by their lengthcan be seen in Figure 5. The length of the entire bar rep-resents the frequency, how many URIs were annotated withthat many tags. The shaded portions of each bar indicatethe performance. We again distinguish between top ranked,top 10, 100 and undiscovered results. Regardless of howmany tags are being used, the retrieval performance in ourexperiment is poor. Only a few 10-tag queries actually re-turned the source URI as the top result. Figure 5 showsresults obtained from the Yahoo! API. Since the results areequally bad when querying tags against Google and MSNLive we do not show those graphs here. We are aware thatthe size of our sample set is very limited (47 URIs) but westill believe that tags may provide some value for the dis-covery of missing pages, especially when titles and LSs arenot available. F r equen cy TopTop10Top100Undiscovered
Yahoo Results
Figure 5: Tags Retrieval Performance by Length inYahoo!
The results based on LSs generated from the link neigh-borhood are not impressive. Neither 5- nor 7-term LNLSsperform in a satisfying manner. Slightly above 3% for 5-term and 1% of all URIs for 7-term LNLSs are returnedtop ranked. As mentioned in Section 3.4 we only generatedthe neighborhood based LSs using the Yahoo! API and alsoqueried the LNLSs only against Yahoo!. In concurrence withthe results seen above, 5-term LNLSs perform better thanLNLSs containing 7-terms. Figure 6 shows the relative num-ber of URIs retrieved in the according section of ranks.
The observation of well performing LSs and titles leadsto the question of how much would we gain if we combinedboth for the retrieval of the missing page. To approach thispoint we took all URIs that remained undiscovered with LSsand analyzed their returned rank with titles (non-quotedonly). Table 2 summarizes the results shown in the sectionsabove. It holds the relative numbers of URIs retrieved usingone single method. The first, leftmost column indicates the
Top Top10 Top100 Undiscovered5− and 7−Term Neighborhood Lexical Signatures UR L s i n % Yahoo Results
Figure 6: Retrieval Performance of LNLSs in Yahoo! method. LS LS T I fortitle and
T A for tags. Note that for tags we chose to displaythe results for URIs for which we were able to obtain 10tags simply because the results were best for those URIs.The top performing single methods are highlighted in boldfigures (one per row).Table 3 shows in a similar fashion all combinations ofmethods that we consider reasonable involving LSs and ti-tles. The reason why tags are left out here is simple: allURIs returned by tags are also returned by titles and LSs inthe according rank section or better. For example, if URI A is returned at rank five through tags, it is also returnedrank five or better with titles and LSs.The combination of methods displayed in the leftmost col-umn is sensitive to its order, i.e. there is a difference be-tween applying 5-term LSs first and 7-term LSs second andvice versa. The top results of each combination of methodsare again highlighted in bold numbers. Regardless of thecombination of methods, the best results are obtained fromYahoo!. If we consider all combinations of only two methodswe find the top performance of 75 .
7% twice in the Yahoo!results. Once with LS − T I and once with
T I − LS
5. Thelatter combination is preferable for two reasons:1. titles are easy to obtain and do not involve a complexcomputation and acquisition of document frequencyvalues as needed for LSs and2. this methods returns 9 .
1% of the URIs in the top 10which is 1 .
7% more than the first combination returns.Even though we do not distinguish between rank twoand rank nine, we still consider URIs returned withinthe top 10 as good results.The combination LS − T I − LS . Top Top10 Top100 Undis Top Top10 Top100 Undis Top Top10 Top100 Undis
LS5 50.8 12.6 4.2 32.4
Table 2: Relative Number of URIs Retrieved with one Single Method from Google, Yahoo! and MSN Live
Google Yahoo! MSN Live
Top T10 T100 Undis Top T10 T100 Undis Top T10 T100 Undis
LS5-TI 65.0 15.2 6.1 13.6
Table 3: Relative Number of URIs Retrieved with Two or More Methods Combined
Yahoo!
Top T10 T100 Undis
LNLS5 3.2 1.6 1.6 93.5LNLS7 1.3 1.3 0 97.4LS5-LNLS5 68.3 7.8 2.3 21.7LS5-LNLS7 67.6 8.1 2.3 22.0LS7-LNLS5 67.3 4.9 1.9 25.9LS7-LNLS7 66.7 4.9 1.9 26.5TI-LNLS5 64.7 8.1 0.6 26.5TI-LNLS7 64.1 8.7 0.6 26.5
Table 4: Relative Number of URIs Retrieved fromYahoo! with Methods that Involve LNLSs good but it is obvious that a combination with titles (eitheras the first or second method) provides better results.Yahoo! uniformly gave the best results and MSN Livewas a close second. Google was third, only managing tooutperform MSN Live once (
T I − LS
5) at the top rank.Since we only have retrieval data for LNLSs from Yahoo!,we isolated all reasonable combinations into Table 4. Thefirst two rows again mirror the results from Figure 6 andthe consecutive rows show combinations of methods. Wecan summarize that there is value in combining this methodwith others but the overall results are not as impressive asthe results shown above. The LNLSs only make sense as asecond method in a combination. The reason is very simple:these kinds of LSs are far too expensive to generate (acquireand download all pages, generate LSs). This method, similarto tags, can however be applied as a first step in case nocopies of the missing page are available in the WI.
5. TITLE ANALYSIS AND PERFORMANCEPREDICTION
Given that the title of a page seems to be a good methodconsidering its retrieval performance we further investigate the characteristics of our titles. We analyzed four factors ofall obtained titles: • title length in number of terms • total number of characters in the title and • mean number of characters per term • number of stop words in the title.Since this series of experiments is also costly on the numberof queries, we only ran it against the Yahoo! API.How the title length in number of terms behaves in con-trast to the retrieval performance is shown in Figure 7. Thesetup of the figure is similar to Figure 5. Each occurringtitle length is represented by its own bar and the number oftimes this title length occurs is shown by the hight of theentire bar. The shaded parts of the bars indicate how manytitles (of the according length) performed in what retrievalclass (the usual, top, top 10, 100 and undiscovered). Thetitles vary in length between one and 43 terms. However,there is for example no title with length 21, hence its baris of hight null. Visual observation indicates a title lengthbetween three and six occurs frequently and performs fairlywell. Given the data from Figure 7 we extracted the valuesfor all URIs and are now able to generate a lookup tablewith the distilled probabilities for each title length to returnURIs in the top, top 10 and top 100. We define the prob-abilities as P , P and P . The lookup table with theprobabilities in dependence of the title length (here T L ) innumber of terms is given in Table 5. Using this table, wecan predict if a given title is likely to perform well. Thepredicted probability may have an impact on what methodshould be run first. For example, if P and P are very lowwe may want to skip the title query and proceed with LSsright away.The contrast of total title length in number of charactersand rank is shown in Figure 8. While the title length variesgreatly between 4 and 294 characters we only see 15 URIs F r equen cy TopTop10Top100Undiscovered
Figure 7: Title Length in Number of Terms vs RankTL P P P op TL P P P Table 5: Lookup Table for Performance Probabilityof Titles Depending on Their Length with a title length greater or equal to 100 and only threeURIs with more than 200 characters in their title. Figure8 does not reveal an obvious pattern between number ofcharacters and rank returned for a title but very short titles(less than 10 characters) do not seem to perform well. Atitle length between 10 and 70 characters is most commonand the ranks seem to be better in the range of 10 to 45characters total.Figure 9 depicts on the left the mean number of charactersper title term and their retrieval performance. It seems thatterms with an average of 5, 6 or 7 characters seem to be mostsuitable for well performing query terms. On the bottomright end of the barplot we can see two titles that have amean character length per term of 19 and 21. Since suchlong words are rather rare they perform very well.The observation of stop word frequency in the titles andtheir performance is not surprising. As shown on the rightin Figure 9 titles with more than a couple of stop words F r equen cy TopTop10Top100Undiscovered
Figure 8: Title Length in Number of Characters vsRank seem to harm the performance. The intuition is that searchengines filter stop words from the query (keep in mind, theseare non-quoted titles) and therefore it makes sense that forexample the title with 11 stop words does not return its URIwithin the top 100 ranks.For completeness we removed all stopwords from the ti-tles and analyzed their retrieval performance in dependenceof the new title length. The results are shown in Figure10. As expected we see more titles with fewer terms per-forming slightly better than the original titles. This resultindicates that the performance of the method using the webpage’s titles can still be improved. Further analysis of thebest combination of methods with titles without stop wordsremains for future work.
6. FUTURE WORK
Our main aspect of future work is the implementationof the system described in the flow diagram of Figure 2.The system will operate as a browser plugin and will triggeronce the user encounters a 404 “Page Not Found” error. Itwill provide all of the introduced methods to rediscover themissing page and since the discovery process happens in anautomated fashion, the system can provide the user with theresults in real-time while she is browsing the Internet.We have shown in [21] that LSs evolve over time and con-sequently lose some of its retrieval strength. Here we arearguing that titles of web pages are a powerful alternativeto LSs. The next logical step is to investigate the evolutionand possible decay of titles over time. Our intuition is thattitles do not decay quite as quickly as LSs do since the actualcontent of a web page (a headline, sentence or paragraph)presumably changes more frequently than its general topicwhich is what the title is supposed to represent.Our set of obtained tags is limited. It remains for future F r equen cy F r equen cy TopTop10Top100Undiscovered
Figure 9: Mean Number of Characters per TitleTerm and Number of Stop Words vs Rank work to investigate the retrieval performance of tags in alarge scale experiment. It also would be interesting to seewhat the term overlap between tags, titles and LSs is sinceall three methods are generated on different grounds.Our method to generate LNLSs may not be optimal. Wechose to use inlink pages only and created a bucket of allneighborhood terms per URI. The LNLSs are based on thisbucket. It remains to be seen whether outlink pages actu-ally can contribute to the retrieval performance and othermethods than the bucket of terms are preferable. It is pos-sible that our neighborhood is too big since it includes theentire neighboring page. A page that links to many pages(hub) may have a diffuse “aboutness”. Hence we are goingto restrict the content gained from the neighborhood to thelink anchor text of the inlink pages.
7. CONCLUSIONS
In this paper we evaluate the retrieval performance offour methods to discover missing web pages. We generate adataset of URIs by randomly sampling URIs from dmoz.org and assume these pages to be missing. We generate LSsfrom copies of the pages, parse the pages’ titles, obtain tagsof the URIs from the bookmarking website delicious.com and generate LSs based on link neighborhood. We use thethree major search engines Google, Yahoo! and MSN Liveto acquire mandatory document frequency data for the gen-eration of the LSs. We further query all three search enginesfor all our methods and combine methods to improve the re-trieval performance. We are able to recommend a setup ofmethods and see one search engine performing best in mostof our experiments.It has been shown in related work that LSs can performwell for retrieving web pages. Our results confirm these find-ings, for example more than two-thirds of our URIs havebeen returned as the top result when querying 5- and 7-term LSs against the Yahoo! search engine API. They also F r equen cy TopTop10Top100Undiscovered
Figure 10: Title Length in Number of Non-StopWords vs Rank lead us to the claim that titles of web pages are a strongalternative to LSs. Almost 70% of the URIs have been re-turned as the top result from the Google search engine APIwhen queried with the (non-quoted) title. However, our re-sults show that a combination of methods performs best.Querying the title first and then using the 5-term LSs forall remaining undiscovered URIs against Yahoo! providedthe overall best result with 75 .
7% of top ranked URIs andanother 9 .
1% in the top 10 ranks. The combination 7-termLS, title, 5-term LS returned 76 .
4% of the URIs in the topranks but since LSs are more expensive to generate than ti-tles, we recommend the former combination of methods. Agood strategy, based on our results, is to query the title firstand if the results are insufficient generate and query LSs sec-ond. Yahoo! returned the best results for all combination ofmethods and thus seems to be the best choice even thoughGoogle returned better results when querying the title only.
8. ACKNOWLEDGEMENT
This work is supported in part by the Library of Congress.
9. REFERENCES [1] How does Google calculate the number of results? .[2] The size of the World Wide Web. .[3] H. Ashman. Electronic document addressing: Dealingwith change.
ACM Computing Surveys , 32(3):201–212,2000.[4] H. Ashman, H. Davis, J. Whitehead, and S. Caughey.Missing the 404: Link integrity on the world wide web.In
Proceedings of WWW ’98 , pages 761–762, 1998.[5] R. Baeza-Yates, ´Alvaro Pereira, and N. Ziviani.Genealogical trees on the web: a search engine usererspective. In
Proceeding of WWW ’08 , pages367–376, 2008.[6] T. Berners-Lee. Cool URIs don’t change, 1998. .[7] K. Bischoff, C. Firan, W. Nejdl, and R. Paiu. Can AllTags Be Used for Search? In
Proceedings of CIKM’08 , pages 193–202, 2008.[8] H. C. Davis. Hypertext link integrity.
ACMComputing Surveys , page 28.[9] H. C. Davis. Referential integrity of links in openhypermedia systems. In
Proceedings of HYPERTEXT’98 , pages 207–216, 1998.[10] B. D. Davison. Topical locality in the web. In
Proceedings of SIGIR ’00 , pages 272–279, 2000.[11] J. Dean and M. R. Henzinger. Finding Related Pagesin the World Wide Web.
Computer Networks ,31(11-16):1467–1479, 1999.[12] R. P. Dellavalle, E. J. Hester, L. F. Heilig, A. L.Drake, J. W. Kuntzman, M. Graber, and L. M.Schilling. INFORMATION SCIENCE: Going, Going,Gone: Lost Internet References.
Science ,302(5646):787–788, 2003.[13] T. L. Harrison and M. L. Nelson. Just-in-TimeRecovery of Missing Web Pages. In
Proceedings ofHYPERTEXT ’06 , pages 145–156, 2006.[14] B. He and I. Ounis. Inferring Query PerformanceUsing Pre-retrieval Predictors. In
Proceedings ofSPIRE ’04 , pages 43–54, 2004.[15] M. Henzinger, B.-W. Chang, B. Milch, and S. Brin.Query-free News Search. In
Proceedings of WWW ’03 ,pages 1–10, 2003.[16] M. R. Henzinger, A. Heydon, M. Mitzenmacher, andM. Najork. On Near-Uniform URL Sampling.
Computer Networks , 33(1-6):295–308, 2000.[17] A. Jatowt, Y. Kawai, S. Nakamura, Y. Kidawara, andK. Tanaka. A Browser for Browsing the Past Web. In
Proceedings of WWW ’06 , pages 877–878, 2006.[18] B. Kahle. Preserving the Internet.
ScientificAmerican , 276:82–83, March 1997.[19] F. Keller and M. Lapata. Using the Web to ObtainFrequencies for Unseen Bigrams.
ComputationalLinguistics , 29(3):459–484, 2003.[20] M. Klein and M. L. Nelson. A Comparison ofTechniques for Estimating IDF Values to GenerateLexical Signatures for the Web. In
Proceedings ofWIDM ’08 , pages 39–46, 2008.[21] M. Klein and M. L. Nelson. Revisiting LexicalSignatures to (Re-)Discover Web Pages. In
Proceedings of ECDL ’08 , pages 371–382, 2008.[22] M. Klein and M. L. Nelson. Inter-Search EngineLexical Signature Performance. In
Proceedings ofJCDL ’09 , pages 413–414, 2009.[23] W. C. Koehler. Web Page Change and Persistence - AFour-Year Longitudinal Study.
Journal of theAmerican Society for Information Science andTechnology , 53(2):162–171, 2002.[24] S. Lawrence, D. M. Pennock, G. W. Flake, R. Krovetz,F. M. Coetzee, E. Glover, F. A. Nielsen, A. Kruger,and C. L. Giles. Persistence of Web References inScientific Research.
Computer , 34(2):26–31, 2001.[25] C. C. Marshall, F. McCown, and M. L. Nelson. Evaluating Personal Archiving Strategies forInternet-based Information. In
Proceedings of IS&TArchiving ’07 , pages 48–52, 2007.[26] F. McCown.
Lazy Preservation: ReconstructingWebsites from the Web Infrastructure . PhD thesis,Old Dominion University, 2007.[27] F. McCown and M. L. Nelson. Characterization ofSearch Engine Caches. In
Proceedings of IS&TArchiving ’07 , pages 48–52, 2007. (Also available asarXiv:cs/0703083v2).[28] F. McCown, J. A. Smith, and M. L. Nelson. LazyPreservation: Reconstructing Websites by Crawlingthe Crawlers. In
Proceedings of WIDM ’06 , pages67–74, 2006.[29] M. L. Nelson, F. McCown, J. A. Smith, and M. Klein.Using the Web Infrastructure to Preserve Web Pages.
IJDL , 6(4):327–349, 2007.[30] S.-T. Park, D. M. Pennock, C. L. Giles, andR. Krovetz. Analysis of Lexical Signatures forImproving Information Persistence on the World WideWeb.
ACM Transactions on Information Systems ,22(4):540–572, 2004.[31] T. A. Phelps and R. Wilensky. Robust HyperlinksCost Just Five Words Each. Technical ReportUCB//CSD-00-1091, University of California atBerkeley, Berkeley, CA, USA, 2000.[32] P. Rusmevichientong, D. M. Pennock, S. Lawrence,and C. L. Giles. Methods for Sampling PagesUniformly from the World Wide Web. In
AAAI FallSymposium on Using Uncertainty WithinComputation , pages 121–128, 2001.[33] G. Salton, A. Wong, and C. S. Yang. A Vector SpaceModel for Automatic Indexing.
Communications ofthe ACM , 18(11):613–620, 1975.[34] A. J. Silva, M. A. Goncalves, A. H. Laender, M. A.Modesto, M. Cristo, and N. Ziviani. Finding What isMissing from a Digital Library: A Case Study in theComputer Science Field.
Information Processing andManagement , 45(3):380 – 391, 2009.[35] D. Spinellis. The decay and failures of web references.
Communications of the ACM , 46(1):71–77, 2003.[36] K. Sugiyama, K. Hatano, M. Yoshikawa, andS. Uemura. Refinement of TF-IDF Schemes for WebPages using their Hyperlinked Neighboring Pages. In
Proceedings of HYPERTEXT ’03 , pages 198–207,2003.[37] M. Theall. Methodologies for Crawler Based WebSurveys.
Internet Research: Electronic Networking andApplications , 12:124–138, 2002.[38] X. Wan and J. Yang. Wordrank-based LexicalSignatures for Finding Lost or Related Web Pages. In
APWeb , pages 843–849, 2006.[39] X. Zhu and R. Rosenfeld. Improving TrigramLanguage Modeling with the World Wide Web. In
Proceedings of ICASSP ’01 , pages 533–536, 2001.[40] Z. Zhuang, R. Wagle, and C. L. Giles. What’s Thereand What’s Not?: Focused Crawling for MissingDocuments in Digital Libraries. In