[PDF] Modeling Updates of Scholarly Webpages Using Archived Data

Abstract

The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors' homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency ( λ ) values. Our evaluation shows that λ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.

Full PDF

MModeling Updates of Scholarly Webpages UsingArchived Data

Yasith Jayawardana

Computer Science DepartmentOld Dominion UniversityNorfolk, VA, USA [email protected]

Alexander C. Nwala

Center for Complex Networks and Systems ResearchLuddy School of Informatics, Computing, and EngineeringIndiana University, Bloomington, IN, USA [email protected]

Gavindya Jayawardena

Computer Science DepartmentOld Dominion UniversityNorfolk, VA, USA [email protected]

Jian Wu

Computer Science DepartmentOld Dominion UniversityNorfolk, VA, USA [email protected]

Sampath Jayarathna, Michael L. Nelson

Computer Science DepartmentOld Dominion UniversityNorfolk, VA, USA { sampath,mln } @cs.odu.edu C. Lee Giles

Information Sciences & TechnologyPennsylvania State UniversityUniversity Park, PA, USA [email protected]

Abstract —The vastness of the web imposes a prohibitive cost onbuilding large-scale search engines with limited resources. Crawlfrontiers thus need to be optimized to improve the coverageand freshness of crawled content. In this paper, we propose anapproach for modeling the dynamics of change in the web usingarchived copies of webpages. To evaluate its utility, we conducta preliminary study on the scholarly web using 19,977 seedURLs of authors’ homepages obtained from their Google Scholarproﬁles. We ﬁrst obtain archived copies of these webpages fromthe Internet Archive (IA), and estimate when their actual updatesoccurred. Next, we apply maximum likelihood to estimate theirmean update frequency ( λ ) values. Our evaluation shows that λ values derived from a short history of archived data provide agood estimate for the true update frequency in the short-term,and that our method provides better estimations of updates ata fraction of resources compared to the baseline models. Basedon this, we demonstrate the utility of archived data to optimizethe crawling strategy of web crawlers, and uncover importantchallenges that inspire future research directions. Index Terms —Crawl Scheduling, Web Crawling, Search En-gines

I. I

NTRODUCTION

The sheer size of the Web makes it impossible for smallcrawling infrastructures to crawl the entire Web to build ageneral search engine comparable to Google or Bing. Instead,it is more feasible to build specialized search engines, whichemploy focused web crawlers [1], [2] to actively harvestwebpages or documents of particular topics or types. GoogleScholar, for instance, is a specialized search engine that isespecially useful for scientists, technicians, students, and otherresearchers to ﬁnd scholarly papers.The basic algorithm for a focused web crawler is straightfor-ward. The crawl frontier is ﬁrst initialized with seed URLs thatare relevant to the search engine’s focus. Next, the crawler visitswebpages referenced by seed URLs, extracts hyperlinks in them,selects hyperlinks that satisfy preset rules (to ensure that onlyrelated webpages are visited), adds them to the crawl frontier, and repeats this process until the crawl frontier exhausts [3].Although this works for relatively short seed lists, it does notscale for large seed lists. For instance, the crawler may notﬁnish visiting all webpages before they change. Given suchcircumstances, re-visiting web pages that have not changedsince their last crawl is a waste of time and bandwidth. It istherefore important to select and prioritize a subset of seeds foreach crawl, based on their likeliness to change in the future.Without sufﬁcient crawl history, it is difﬁcult to accuratelypredict when a webpage will change. Web archives, such asthe well-known Internet Archive’s (IA) Wayback Machine [4]and others, preserve webpages as they existed at particularpoints in time for later replay. The IA has been collectingand saving public webpages since its inception in 1996, andcontains archived copies of over 424 billion webpages [5],[6]. The resulting record of such archived copies is knownas a

TimeMap [7] and allows us to examine each saved copyto determine if a change occurred (not every saved versionwill represent a change in the webpage). TimeMaps provide acritical source of information for studying changes in the web.For example, if a researcher created his website in 2004, via aTimeMap we could retrieve copies of the website observed bythe IA between 2004 and 2020, and examine these copies forchanges.In this paper, we propose an approach to model the dynamicsof change in the web using archived copies of webpages.Though such dynamics have been studied in previous papers,e.g., [8]–[10], online activities have evolved since then, andto the best of our knowledge, the use of archived data tomodel these dynamics has not been explored. While manyweb archives exist, we use the IA to obtain archived copies ofwebpages due to its high archival rate, and efﬁciency of massqueries. Given a URL, we ﬁrst obtain its TimeMap from theIA’s Wayback Machine, and identify mementos that representupdates. Next, we use this information to estimate their meanupdate frequency ( λ ). We then use λ to calculate the probability a r X i v : . [ c s . D L ] D ec p ) of seeing an update d days after it was last updated. Beforeeach crawl, we repeat this process for each seed URL and usea threshold ( θ ) on p to select a subset of seed URLs that aremost likely to have changed since their next crawl.Our preliminary analysis demonstrates how this approachcan be integrated into a focused web crawler, and its impact onthe efﬁciency of crawl scheduling. Here, we select the scholarlyweb as our domain of study, and analyze our approach at bothhomepage-level (single webpage) and at website-level (multiplewebpages). The former, investigates changes occurring onan author’s homepage, while the latter, investigates changesoccurring collectively on the homepage and any webpagebehind it, e.g., publications , projects , and teaching webpages.Our contributions are as follows:1) We studied the dynamics of the scholarly web usingarchived data from the IA for a sample of 19,977 authors’websites.2) We veriﬁed that the updates to authors’ websites andhomepages follow a near-Poisson distribution, with spikesthat may represent non-stochastic activities.3) We developed History-Aware Crawl Scheduler (HACS),which uses archived data to ﬁnd and schedule a subset ofseed URLs that are most likely to have changed beforethe next crawl.4) We compared HACS against baseline models for asimulated web crawling task, and demonstrated that itprovides better estimations.

A. Crawling the Web

Although the basic focused web crawling algorithm [3] issimple, challenges in the web, such as scale, content selectiontrade-offs (e.g., coverage vs freshness), social obligations,and adversaries, makes it infeasible to crawl the web in thatmanner. Crawl frontiers should thus be optimized to improvethe robustness of web crawlers. One approach is to reorderthe crawl frontier to maximize some goal (e.g., bandwidth,freshness, importance, relevance) [11], [12].

Fish-Search [13],for instance, reorders the crawl frontier based on contentrelevance, and is one of the earliest of such methods. Givena seed URL and a driving query, it builds a priority queuethat prioritizes webpages (and their respective out-links) thatmatch the driving query.

Shark-Search [14] is an improvedversion of Fish-Search which uses cosine similarity (numberbetween 0 and 1) to calculate the relevance of a webpage tothe driving query, instead of binary similarity (either 0 or 1)used in Fish-Search. Such algorithms do not require the crawlhistory to calculate relevance, and can be applied at both theinitial crawl and any subsequent crawls.In incremental crawling, webpages need to be re-visitedonce they change, to retain the freshness of their crawledcopies. Several methods have been proposed [15], [16]. Olstonet. al. [17], for instance, studied the webpage revisitationpolicy that a crawler should employ to achieve good freshness.They considered information longevity, i.e., the lifetime ofcontent fragments that appear and disappear from webpagesover time, to avoid crawling ephemeral content such as advertisements, which have limited contribution to the maintopic of a webpage. Such methods require sufﬁcient crawlhistory to identify ephemeral content, and until sufﬁcient crawlhistory is generated, the algorithm may yield sub-optimalresults.Algorithms proposed by Cho et al. [18], reorders the crawlfrontier based on the importance of webpages. Here, the querysimilarity metric used in Fish-Search and Shark-Search wasextended with additional metrics such as, back-link count , forward-link count , PageRank , and location (e.g., URL depth,top-level domain). Alam et al. [19] proposed a similar approach,where the importance of a webpage was estimated using

PageRank , partial link structure , inter-host links , webpage titles ,and topic relevance measures. Although such methods takeadvantage of the crawl history, the importance of a webpagemay not reﬂect how often it changes. Thus, such methodsfavour the freshness of certain content over the others.Focused web crawlers should ideally discover all webpagesrelevant to its focus. However, the coverage that it couldachieve depends on the seed URLs used. Wu et al. [20], forinstance, proposed the use of a whitelist and a blacklist forseed URL selection. The whitelist contains high-quality seedURLs selected from parent URLs in the crawl history, whilethe blacklist contains seed URLs that should be avoided. Theidea was to concentrate the workforce to exploit URLs withpotentially abundant resources. In addition, Zheng et al. [21]proposed a graph-based framework to select seed URLs thatmaximize the value (or score) of the portion of the web graph“covered” by them. They model this selection as a MaximumK-Coverage Problem . Since this is a NP-hard [22] problem, theauthors have proposed several greedy and iterative approachesto approximate the optimal solution. Although this works wellfor a general web crawler, studies show that the scholarlyweb has a disconnected structure [23]. Hence, the process ofselecting seed URLs for such use cases may beneﬁt from thecrawl records of a general web crawler.CiteSeerX [24] is a digital library search engine that hasmore than 10 million scholarly documents indexed and isgrowing [25]. Its crawler, identiﬁed as citeseerxbot , is anincremental web crawler that actively crawls the scholarly weband harvests scholarly papers in PDF format [25]. Comparedto general web crawlers, crawlers built for the scholarly webhas different goals in terms of optimizing the freshness of theircontent. The crawl scheduling model used by citeseerxbot ,which we refer to as the

Last-Obs model, prioritizes seedURLs based on the time elapsed since a webpage was lastvisited . In this work, we use the

Last-Obs model as a baselineto compare with our method.

B. Modeling Updates to a Webpage

Updates to a webpage can be modeled as a Poissonprocess [9], [26], [27]. The model is based on the followingtheorem.

Theorem 1: If T is the time of occurrence of the next eventin a Poisson process with rate λ (number of events per unit Fig. 1. An illustration of accesses ( (cid:79) , (cid:52) ), accesses with updates ( (cid:72) ), true update occurrences ( ◦ ) and the interpolated update occurrences ( (cid:78) ) over time.Gray shades represent the deviation of the observed and interpolated update occurrences from the true update occurrences. time period), the probability density for T is f T ( t ) = λe − λt , t > , λ > . (1)Here, we assume that each update event is independent. Whilethis assumption is not always true (i.e. certain updates arecorrelated), as shown later, it is a reasonable estimation. Byintegrating f T ( t ) , we obtain the probability that a certainwebpage changes in interval [ t , t ] : P (∆ t ) = (cid:90) tt f T ( t ) dt = 1 − e − λ ∆ t (2)Note that the value of λ may vary for different webpages. Forthe same webpage, λ may also change over time but for ashort period of time, λ is approximately constant. Therefore,by estimating λ , we calculate how likely a webpage will beupdated since its last update at time t c . Intuitively, λ can beestimated using, ˆ λ = X/T (3)in which X is the number of updates detected during n accesses,and T is the total time elapsed during n accesses. As provenin [9], this estimator is biased and it is more biased whenthere are more updates than accesses in the interval T . Forconvenience [26] deﬁnes an intermediate statistical variable r = λ/f , the ratio of the update frequency to the accessfrequency. An improved estimator was proposed below: ˆ r = − log (cid:18) ¯ X + 0 . n + 0 . (cid:19) , ¯ X = n − X. (4)This estimator is much less biased than X/T and i It is alsoconsistent, meaning that as n → ∞ , the expectation of ˆ r is r .Unfortunately, since archival rates of the IA depend on itscrawl scheduling algorithm and the nature of the webpagesthemselves, its crawl records have irregular intervals. As aresult, archived copies may not reﬂect every update thatoccurred on the live web, and not all consecutive archivedcopies may reﬂect an update. Since both Eq. (3) and Eq. (4)assume regular access, they cannot be used directly. To addressthis limitation, we use a maximum likelihood estimator tocalculate which λ is most likely to produce an observed set ofevents. m (cid:88) i =1 t c i exp ( λt c i ) − n − m (cid:88) j =1 t u j , (5) Here, t c i is the i -th time interval where an update was detected, t u j is the j -th time interval where an update was not detected,and m is the total number of updates detected from n accesses(see Figure 1). λ is calculated by solving Eq. (5). Since thisequation is nonlinear, we solve it numerically using Brent’smethod [28]. There is a special case when m = n (i.e. updatesdetected at all accesses) where solving Eq. (5) yields λ = ∞ .In this case, Eq.(5)’s solution is inﬁnity and Eq.(4) is used.To the best of our knowledge, there has not been anopen source crawl scheduler for the scholarly web that takesadvantage of the update model above. With IA providing anexcellent, open-accessible resource to model the updates ofscholarly webpages, this model can be applied on focusedcrawl schedulers to save substantial time on crawling andre-visitation. II. M ETHODOLOGY

A. Data Acquisition

The seed list used in this work was derived from adataset containing Google Scholar proﬁle records of researchers. This dataset was collected around 2015 by scrapingproﬁle webpages in Google Scholar with a long crawl-delay.The steps for data acquisition and preparation are illustratedin Figure 2.

Step 1:

From the Google Scholar proﬁle records, wediscovered proﬁles that provided homepage URLs.These URLs referenced either individual author homepages, ororganizational websites. Since our study focused on modelingthe dynamics of the websites of individual authors, we removedorganizational websites. This was nontrivial using a simplerule-based ﬁlter as there were personal homepages that looksimilar to organizational homepages. Therefore, we restrictedour scope to homepage URLs hosted within a user directory of an institution, i.e., URLs with a tilde ( ∼ ) in them (e.g.,foo.edu/ ∼ bar/). In this manner, we obtained homepageURLs. Step 2:

Next, we performed a wildcard query on the IAWayback CDX Server API [29] to obtain TimeMaps for eachauthor website under their homepage URL. Out of websites, we obtained TimeMaps for author websites( . archival rate). The remaining websites were eithernot archived, or the CDX Server API returned an error code oogleScholarDataset396,423 URLs withtilde (~)139,910 CDX API24,236 Internet Archive D0, D2 Datasets(19,977 Records)DataCleaningCDX/HTML(21,171 records)HomepageURLs Fig. 2. Steps followed to acquire and prepare data from IA (depths 0–2). during access. The resulting TimeMaps provided informationsuch as the crawl timestamps and URI-Ms of archived copiesof each webpage. From these webpages, we selected webpagesat depth ≤ (Depth 0 is the homepage). For instance, for ahomepage foo.edu/ ∼ bar, a link to foo.edu/ ∼ bar/baz is of depth1 and is selected. However a link to foo.edu/ ∼ bar/baz/qux/quuxis of depth 3 and is not selected. Step 3:

Next, we generated the

D0 dataset and

D2 dataset ,which we use in our analysis. First, we de-referenced the URI-Ms of each URL selected in Step 2, and saved their HTML forlater use. When doing so, we dropped inconsistent records suchas records with invalid checksum, invalid date, multiple depth0 URLs, and duplicate captures from our data. The resultingdata, which we refer to as the

D2 dataset , contained HTMLof websites, totaling individual webpages. Theaverage number of webpages per website is 227.49. Theminimum and maximum number of webpages per websiteare 1 and 35,056, respectively. We selected a subset of the

D2 dataset consisting HTML of only the homepages,which we refer to as the

D0 dataset . Fig. 3. Captures (blue dots) of homepage URLs over time, with URLs sortedby their earliest capture time (red dots). The captures between 2015-06-01and 2018-06-01 (green vertical lines) were used for the evaluation.

Figure 3 shows the distribution of captures in the D0 dataset,sorted by their earliest capture time. Here, the median crawlinterval of of author homepages were between − days. The distribution of capture density over time suggeststhat the capture densities of IA vary irregularly with time. Forinstance, captures during 2015–2018 show a higher densityon average than the captures during 2010–2014. Since high- cadence captures help to obtain a better estimation for theupdate occurrences, we scoped our analysis to the periodbetween June 1, 2015 and June 1, 2018 (shown by greenvertical lines in Figure 3). B. Estimating Mean Update Frequency

The exact interpretation of update may differ depending onthe purpose of study. We examine a speciﬁc type of update– the addition of new links . The intuition here is to identifywhen authors add new publications into their webpages, asopposed to identifying when that webpage was updated ingeneral. We claim that this interpretation of update is moresuited to capture such behavior.For each webpage in datasets D0 and D2 , we processedeach capture m i to extract links l ( m i ) from its HTML, where l ( m i ) is the set of links in the i th capture. Next, we calculated | l ∗ ( m i ) | , i.e., the number of links in a capture m i that was neverseen before m i , for each capture in these datasets. Formally, l ∗ ( m i ) = l ( m i ) − ∪ i − k =1 l ( m k ) , i ≥ . and ∪ i − k =1 l ( m k ) is the union of links from captures m to m i − .Finally, we calculated the observed-update intervals t c i ∈ T c and observed non-update intervals t u j ∈ T u based on capturesthat show link additions, i.e., l ∗ ( m i ) > and ones that do not,i.e., l ∗ ( m i ) = 0 (see Figure 1). We estimate λ in two ways.

1) Estimation Based on Observed Updates:

For each web-page, we substituted t c i and t u j values into Eq. (5) or Eq.(4)and solved for λ using Brent’s method to obtain its estimatedmean observed-update frequency ( λ ). In this manner, wecalculated λ for author websites at both homepage-level (using D0 dataset) and webpage-level (using D2 dataset).Figure 4 shows the distribution of I est = 1 /λ at both website-level and homepage-level, obtained using captures from 2015-06-01 to 2018-06-01. Both distributions are approximatelylog-normal, with a median of days at website-level, and of days at homepage-level. This suggests that most authorsadd links to their homepage less often than they add links totheir website (e.g., publications ).

2) Estimation Based on Interpolated Updates:

The methoddescribed in Section II-B1 calculates the maximum likelihoodof observing the updates given by intervals t c i and t u j .Intuitively, an update could have occurred at any time between t ( m x − ) and t ( m x ) , where t ( m x ) is the time of an updated ig. 4. Distribution of /λ of author websites at website-level (red) andhomepage-level (blue). Here, λ was calculated using captures from 2015-06-01to 2018-06-01. capture, and t ( m x − ) is the time when the capture before itwas taken. Here, we use an improved method where we ﬁrstinterpolate when a URL was updated. We deﬁne interpolated-update time ( (cid:78) ) as ( t ( m x − ) + t ( m x )) / , i.e., the midpointbetween t ( m x ) and t ( m x − ) . Next, we obtain the updateintervals ˜ t c i and ˜ t u j from these interpolated updates, and usethem to calculate the estimated mean interpolated-updatefrequency ( ˜ λ ). C. Distribution of Updates

Figure 5 shows the distribution of / ˜ λ (red) and the median interpolated-update interval ( ˜∆ t ) (blue) of author websites atboth homepage-level and website-level. It suggests that thedistribution of / ˜ λ is consistent with the distribution of median ˜∆ t at both homepage-level and website-level. D. Poisson Distribution

Next, we observe whether updates to author websites followa Poisson distribution, at both homepage-level and website-level. Here, we group author websites by their calculated / ˜ λ values into bins having a width of 1 day. Within each bin, wecalculate the probability (y-axis) of ﬁnding an author websitehaving an interpolated-update interval ( ˜∆ t ) of d days (x-axis).Figure 6 shows the probability distributions for homepage-level (using D0 dataset) and website-level (using D2 dataset),at / ˜ λ = 35 days and / ˜ λ = 70 days, respectively. Themajority of data points follow a power-law distribution inthe logarithmic scale, indicating that they ﬁt into a Poissondistribution. We also observe that at homepage-level, the datapoints follow a power-law distribution with a positive indexwhen d is (approximately) lower than / ˜ λ . We observe sporadicspikes on top of the power law. This indicates that: (1) Fora given ˜ λ , consecutive changes within short intervals occurless frequently than predicted by a Poisson distribution, (2)The updates of scholarly webpages are not absolutely random (a) Homepage-level(b) Website-levelFig. 5. The distributions of / ˜ λ (red) and the median interpolated-updateinterval ( ˜∆ t ) (blue) of author websites at (a) homepage-level and (b) website-level. The y -axis represents individual author websites, in the increasing orderof / ˜ λ . but exhibit a certain level of weak correlation. Investigatingthe reasons behind these correlations is beyond the scope ofthis paper, but presumably, they may reﬂect collaboration orcommunity-level activities. Probability distributions for othervalues of / (cid:101) λ also exhibit similar patterns (see Figures 15, 16,17, and 18 in Appendix). E. Prediction Model

We formally deﬁne our prediction model using two functions, f and g . The function f : m → ( λ, τ ) takes the captures m (i.e. crawl snapshots from the IA) of a website as input, andoutputs its estimated mean update frequency λ (See Eq. (5))and last known update time τ . The function g : ( λ, τ, e ) → p takes a website’s estimated mean update frequency ( λ ), its lastknown update time ( τ ), and a time interval ( e ) as input, and a) Homepage-level, / ˜ λ = 35 days (b) Homepage-level, / ˜ λ = 70 days(c) Website-level, / ˜ λ = 35 days (d) Website-level, / ˜ λ = 70 daysFig. 6. Probability (y-axis) of ﬁnding author websites with an interpolated-update interval ( ˜∆ t ) of d days (x-axis) at both homepage-level and website-level,among author websites having / ˜ λ of 35 days and 70 days, respectively. The vertical blue line shows where d = 1 / ˜ λ . outputs the probability ( p ) that the website changes after thetime interval e since its last known update time τ .III. E VALUATION

Here, we study how archived copies of webpages, and thequasi-Poisson distribution of webpage updates can be leveragedto build a focused crawl scheduler for the scholarly web. w xx xxx x x x x x x xx xxx xt eu u u u timet+et-w Fig. 7. An illustration of history size ( w ) , reference point ( t ) , evaluationinterval ( e ) , and updates ( × ) . For each URL u i , λ was estimated usingupdates between [ t − w, t ] (green), and the probability of change ( p ) at t + e was calculated. In Evaluation 1, the correctness of p (red) was checked usingthe actual updates between [ t, t + e ] . In Evaluation 2, URLs were ordered by p , and compared against the ordering of those that changed ﬁrst after t . Figure 7 illustrates our crawl scheduling model, HACS. Fora selected date t between 2015-06-01 and 2018-06-01, weﬁrst obtain, from the D2 and D0 , archived captures of seedURLs within w weeks prior to t (i.e., in the interval [ t − w, t ] ).Based on these captures, we calculate the estimated meaninterpolated-update frequency ( ˜ λ ) of each seed URL. Next,we use the ˜ λ values thus obtained, to calculate the probability( p ) that each seed URL would exhibit a change e days from t (i.e., by day t + e ). Following this, we sort the seed URLs inthe decreasing order of p , and apply a threshold parameter ( θ )to select a subset of seed URLs to be crawled on that date. A. Simulated Crawl Scheduling Task

Here, we set e = 1 week, and advance t across differentpoints in time from 2015-06-01 to 2018-06-01, to simulate acrawl scheduling task. At each t , we use standard IR metricsto evaluate whether the selected subset of seed URLs were theones that actually changed within the interval [ t, t + e ] . Wealso experiment with different values of w (i.e., history size),to determine which w yields an optimal result.The following metrics are used for evaluating our modelin comparison with several baseline models. First, we look atrecision, recall, and F to measure how accurately the sched-uler selects URLs for a simulated crawl job (see Evaluation 1).Then, we use P @K to evaluate how accurate the schedulerranks URLs in the order they change (see Evaluation 2). B. Evaluation 1

Because most implementations of scholarly web crawlersare not published, we compare with two baseline models,(1) random URLs (Random), and (2) Brute Force (select allURLs). We introduce a threshold parameter θ ∈ [0 , to selectwebpages with a probability of change p ≥ θ for crawling.Formally, we deﬁne the scheduling function as, D w,t ( θ ) = { u ; g ( λ, τ, ≥ θ, ( λ, τ ) = f ( M w,t ( u )) | ∀ u ∈ U } M w,t ( u ) = { m x ; x ∈ [ t − w, t ] | ∀ m ∈ M u } Here, U is the set of all seed URLs, and M u is the set ofcaptures of a seed URL u . The parameters w , t , and θ arethe history size, reference point, and threshold, respectively.The functions f and g are as deﬁned in Section II-E. Foreach ( w, t, θ ) , the following actions are performed: In theHACS model, we use D w,t ( θ ) to select URLs for crawling. Inthe Random model, we randomly pick | D w,t ( θ ) | URLs from D w,t (0) , i.e., all URLs having captures within the time windowof [ t − w, t ] . In the Brute Force model, we mimic the behaviorof a hypothetical crawler by picking all URLs from D w,t (0) .The results from each model were compared to the URLs that actually changed within the interval [ t, t + e ] .Following this, we counted the number of true positives (TP),true negatives (TN), false positives (FP), and false negatives(FN) at each ( w, t, θ ) . Next, we got rid of the reference point t by macro/micro-averaging over t , and calculated Precision ( P ) ,Recall ( R ) , and F1 ( F ) for each w and θ , respectively. At each w , we then calculated the threshold θ = ˆ θ which maximizes F for both homepage-level and website-level. Table I showsthe results from this evaluation.We also show how P , R and F changes with θ ∈ [0 , forboth homepage-level and website-level updates. Figures 8,9,and 10 illustrate these results at w = 1 and w = 2 (also, resultsat w = 3 given in Figures 12, 13, and 14 in Appendix). C. Evaluation 2

Here, the HACS model was compared against two baselinemodels: Last-Obs and Random. In the HACS model, URLs thathave a higher probability of change on the crawl date ( t + e ) are ranked higher. In the Last-Obs model, URL ranks aredetermined by the date they were last accessed. Here, URLsthat have not been updated the longest (i.e. larger ( t − τ ) )are ranked higher. In the Random model, URLs are rankedrandomly. By comparing the URL rankings from each modelto the expected URL ranking (where URLs that were updatedcloser to t were ranked higher), we calculate a weighted P @ K over all K . Here, the weights were obtained via a logarithmicdecay function to increase the contribution from lower K values. This weighted P @ K provides a quantitative measureof whether URLs that were actually updated ﬁrst were rankedhigher. Next, we get rid of the reference point t by calculatingthe mean weighted P @ K over all t , at each history size w . In this manner, we obtain the mean weighted P @ K of eachmodel when different history sizes ( w ) are used. Figure 11shows the results from this evaluation.IV. R ESULTS

The results in Table I indicate that the P and F values ofHACS model are higher than the Random and Brute Forcemodels for all values of w (history size in weeks). This leadis higher when w is lower. However, this difference becomesless signiﬁcant as w increases. The Brute Force method hada consistent R of . , since it crawls all URLs at all times.However, this model is impractical due to resource constraints.The HACS model produced a higher R than the Randommodel at all w . Also, ˆ θ ∈ [0 . , . for homepage-level and ˆ θ ∈ [0 . , . for website-level indicates the optimal ranges for θ . From Figure 8, as θ increases, the F score of HACS modelincreases until θ = ˆ θ , and then drops as θ increases further. At ˆ θ , the HACS model yields the highest micro-average F scoreat both the homepage-level and the website-level. This trend ismore prominent at the homepage-level than the website-level.In terms of macro-average F , the Random model closelyfollows the HACS model at homepage-level when w = 1 .However, the HACS model yields better F scores in all othercases. The Brute Force model gives constant F scores at bothhomepage-level and website-level, as it selects all seed URLsregardless of θ .When comparing precision P , Figure 9 shows that bothmicro-average and macro-average P ’s of HACS model in-creases as θ increases. This is expected as the URL selectionbecomes stricter as θ increases, which, in turn, generates lessfalse positives. Similar to F , the lead in P of the HACSmodel is more noticeable at homepage-level than website-level.Nevertheless, the HACS model yields higher P than othermodels in all cases. The Brute Force model has a constant P ,as it selects all URLs regardless of θ . However, P of BruteForce model is lower than HACS model at both homepage-leveland website-level. Interestingly, the P of both Brute Force andRandom models remain close to each other. At θ = 0 . (i.e.when no threshold is applied), all models give the same results,as they select all seed URLs.When comparing results of R , Figure 10 shows that bothmicro-average R and macro-average R decreases as θ increases.This is expected as the URL selection becomes stricter as θ increases, which, in turn, generates less false negatives. TheBrute Force model has a constant R of . , as it selects allURLs regardless of θ . At θ = 0 . (i.e. when no threshold isapplied), all models give R = 1 . as they select all seed URLs.At θ = 1 . , both HACS and Random models give R = 0 . ,as they select no URLs here. For θ values other than these, theHACS model consistently yields better R than Random modelat both homepage-level and website-level. However, this leadis less signiﬁcant at website-level than at homepage-level, anddiminishes as w increases.When comparing the average P@K results, Figure 11 showsthat the HACS model yields a better average P@K than the ABLE IC

OMPARISON OF

HACS

MODEL TO BASELINE MODELS USING P RECISION ( P ) , R ECALL ( R ) AND F VALUES AT e = 1 WEEK , AND AT THRESHOLD ˆ θ WHERE F IS MAXIMUM . H

ERE , w IS THE HISTORY SIZE ( IN WEEKS ). M

AXIMUM VALUES ARE IN BOLD , AND HIGHLIGHTED IN BLUE .Homepage-level w Micro Average Macro Average ˆ θ HACS Random Brute ˆ θ HACS Random Brute P R

10 0.7 0.342 0.125

11 0.7 0.348 0.134

12 0.7 0.349 0.153 F w Micro Average Macro Average ˆ θ HACS Random Brute ˆ θ HACS Random Brute P R

10 0.5 0.501 0.374

11 0.5 0.505 0.385

12 0.5 0.507 0.393 F Last-Obs and Random models at both homepage-level andwebsite-level, for all values of w . However, the HACS modelyields a higher average P@K for lower values of w than forhigher values of w . As w increases, the average P@K of allmodels become approximately constant. At homepage-level, theLast-Obs model yields a better average P@K than the Randommodel for lower values of w . At website-level, however, ityields a worse average P@K than the Random model forhigher values of w . V. D ISCUSSION

From Table I, the P , R , and F values obtained from theHACS model are greater than the baseline models at boththe homepage-level and the website-level, when the optimalthreshold ˆ θ is selected. Figure 8 shows that regardless ofthe θ selected, the HACS model performs better than thebaseline models. Also, the P of the HACS model increases as θ increases. This indicates that the HACS model predicted ahigher probability ( p ) for the URLs that got updated ﬁrst during [ t, t + e ] . This is also conﬁrmed by the higher mean weighted P @ K values obtained by the HACS model (see Figure 11).Since R decreases with increasing θ while P increases withincreasing θ , it is imperative that an optimal θ value should beselected. Results in Table I show that selecting θ = ˆ θ (whichmaximizes F ) provides a good compromise between precisionand recall, yet perform better than the baseline models.The P and R of the Brute Force model is constantirrespective of θ . Though this model yields the highest R (which is . ), it consumes a signiﬁcant amount of resourcesto crawl everything. This approach does not scale well to alarge number of seed URLs. It also yields a lower P and F than the HACS model across all w , at both homepage-leveland website-level. These results suggest that the HACS model,which yields a much higher P and F at a marginal reductionin R , is more suited for a resource-constrained environment.Recall that the archival of webpages is both irregular andsparse (See Figure 3). In our sample, authors updated theirhomepages every . days on average, and their websites .0 0.2 0.4 0.6 0.8 1.0Threshold ( )0.00.20.40.60.81.0 F History: 1 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

F1: Page Level (a) Homepage-level, History = 1 week F History: 2 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

F1: Page Level (b) Homepage-level, History = 2 weeks F History: 1 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

F1: Site Level (c) Website-level, History = 1 week F History: 2 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

F1: Site Level (d) Website-level, History = 2 weeksFig. 8. F vs Threshold ( θ ). The HACS model produced a higher F than other baseline models. This lead is more visible at the homepage-level thanthe website-level. As θ increases, the F of the HACS model increases up to θ = ˆ θ , and then drops as θ further increases. This drop is more visible at thewebsite-level than the homepage-level. The macro-average F of Random model follows the HACS model with a similar trend at the Homepage-level, History= 1 week. every days on average. Note that here, an update to awebpage means adding a new link into it. Authors may updatetheir homepages or websites by updating content or addingexternal links. Content updates can be studied in a similarway by comparing the checksum of webpages. Since CDXﬁles only contain mementos of webpages within the samedomain, taking external links into consideration may requireother data sources. The better performance of the HACS modelin estimating the mean update frequency ( λ ) for homepagesmay be attributed to the fact that homepages undergo fewerchanges than websites.From Table I, the best micro-average F measure obtainedat homepage-level and website-level were . and . ,respectively. Similarly, the best macro-average F measuresobtained at homepage-level and website-level were . and . , respectively. In both cases, these F measures originatedfrom the HACS model when w = 1 and θ ∈ [0 . , . .Figure 8 demonstrates the efﬁciency of our model. As the threshold θ increases, the number of false positives is reduced,thereby increasing the precision. Here, we note that even asmall increase in precision matters, because for a large numberof seed URLs, even the slightest increase in precision attributesto a large decrease in false positives. If crawling is performedon a regular basis, the HACS model could be utilized to pickseed URLs that have most likely been updated. This, based onthe above results, would improve collection freshness whileusing resources and bandwidth more effectively.VI. C ONCLUSION

We studied the problem of improving the efﬁciency of afocused crawl scheduler for the scholarly web. By analyzingthe crawl history of seed URLs obtained from the IA, we ﬁttheir change information into a Poisson model and estimatedthe probability that a webpage would update (addition of newlinks) by the next crawl. Finally, our scheduler automaticallygenerates a list of seed URLs most likely to have changed .0 0.2 0.4 0.6 0.8 1.0Threshold ( )0.00.20.40.60.81.0 P r e c i s i o n History: 1 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Precision: Page Level (a) Homepage-level, History = 1 week P r e c i s i o n History: 2 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Precision: Page Level (b) Homepage-level, History = 2 weeks P r e c i s i o n History: 1 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Precision: Site Level (c) Website-level, History = 1 week P r e c i s i o n History: 2 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Precision: Site Level (d) Website-level, History = 2 weeksFig. 9. Precision ( P ) vs Threshold ( θ ). The HACS model produced a higher P than other baseline models, and increases with θ . This lead is more visible athomepage-level than website-level. Both Random and Brute Force models have a low P , regardless of θ . since the last crawl. Our analysis found that the estimatedmean update frequency (or equivalently, update interval) followa log-normal distribution. For the 19,977 authors we studiedfrom Google Scholar, new links were added on an averageinterval of 141.5 days for a homepage, and 75 days for awebsite. We also observed that the median crawl interval of of author homepages was between 20–127 days. Ourevaluation results show that our scheduler achieved betterresults than the baseline models when θ is optimized. Toencourage reproducible research, our research dataset consistingof HTML, CDX ﬁles, and evaluation results have been madepublicly available .In the future, we will investigate different types of updates,such as the addition of a scholarly publication in PDF format.Additionally, author websites could be crawled regularly toensure that updates are not missed, and its effect on theestimation of mean update frequency could be evaluated. We https://github.com/oduwsdl/scholarly-change-rate will also generalize this work into more domains by exploringnon-scholarly URLs.VII. A CKNOWLEDGEMENT

This work was supported in part by the National ScienceFoundation and the Dominion Graduate Scholarship from theCollege of Science at the Old Dominion University.R

EFERENCES[1] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, andJ. Kleinberg, “Automatic resource compilation by analyzing hyperlinkstructure and associated text,”

Computer networks and ISDN systems ,vol. 30, no. 1-7, pp. 65–74, 1998.[2] S. Chakrabarti, M. Van den Berg, and B. Dom, “Focused crawling: a newapproach to topic-speciﬁc web resource discovery,”

Computer networks ,vol. 31, no. 11, pp. 1623–1640, 1999.[3] C. Olston and M. Najork, “Web crawling.”

Foundations and Trends inInformation Retrieval , vol. 4, no. 3, pp. 175–246, 2010.[4] B. Tofel, “Wayback’for accessing web archives,” in

Proceedings of the7th International Web Archiving Workshop . IWAW’07, 2007, pp. 27–37.[5] D. Gomes, J. a. Miranda, and M. Costa, “A survey on web archivinginitiatives,” in

Proceedings of Theory and Practice of Digital Libraries(TPDL) , 2011, pp. 408–420. .0 0.2 0.4 0.6 0.8 1.0Threshold ( )0.00.20.40.60.81.0 R e c a ll History: 1 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Recall: Page Level (a) Homepage-level, History = 1 week R e c a ll History: 2 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Recall: Page Level (b) Homepage-level, History = 2 weeks R e c a ll History: 1 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Recall: Site Level (c) Website-level, History = 1 week R e c a ll History: 2 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Recall: Site Level (d) Website-level, History = 2 weeksFig. 10. Recall ( R ) vs Threshold ( θ ). The HACS model produced a higher R at lower values of θ , and reaches as θ increases. The HACS model has a muchhigher R than Random model at < θ < . This lead is more visible at homepage-level than website-level. The Brute Force model has a constant R of . .[6] M. Farrell, E. McCain, M. Praetzellis, G. Thomas, and P. Walker, “Webarchiving in the United States: a 2017 survey,” https://osf.io/ht6ay/, 2018.[7] H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L. Balakireva,S. Ainsworth, and H. Shankar, “Memento: Time Travel for the Web,”arXiv, Tech. Rep. arXiv:0911.1112, 2009.[8] W. Koehler, “Web page change and persistence — a four-year longitudinalstudy,” Journal of the American Society for Information Science andTechnology , vol. 53, no. 2, pp. 162–171, 2002.[9] J. Cho and H. Garcia-Molina, “Estimating frequency of change,”

ACMTransactions on Internet Technology , vol. 3, no. 3, pp. 256–290, Aug.2003.[10] K. Radinsky and P. Bennett, “Predicting content change on the web,” in

WSDM ’13: Proceedings of the Sixth ACM International Conference onWeb Search and Data Mining , 2013.[11] E. G. Coffman Jr., Z. Liu, and R. R. Weber, “Optimal robot schedulingfor web search engines,”

Journal of Scheduling , vol. 1, no. 1, pp. 15–29,1998.[12] C. Castillo, M. Marin, A. Rodriguez, and R. Baeza-Yates, “Schedul-ing algorithms for web crawling,” in

WebMedia and LA-Web, 2004.Proceedings . Ribeirao Preto, Brazil: IEEE, Oct. 2004, pp. 10–17.[13] P. De Bra, G.-J. Houben, Y. Kornatzky, and R. Post, “Informationretrieval in distributed hypertexts,” in

Intelligent Multimedia InformationRetrieval Systems and Management-Volume 1 . Centre de Hautes EtudesInternationales d’inform atique Docum entaire, 1994, pp. 481–491.[14] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur, “The shark-search algorithm. an application: tailored web sitemapping,”

Computer Networks and ISDN Systems , vol. 30, no. 1-7, pp.317–326, 1998.[15] F. Shipman and C. D. D. Monteiro, “Crawling and classiﬁcation strategiesfor generating a multi-language corpus of sign language video,” in , 2019, pp.97–106.[16] G. Gossen, E. Demidova, and T. Risse, “icrawl: Improving the freshnessof web collections by integrating social web and focused web crawling,”in

Proceedings of the 15th ACM/IEEE-CE Joint Conference on DigitalLibraries , 2015, pp. 75–84.[17] C. Olston and S. Pandey, “Recrawl scheduling based on informationlongevity,” in

Proceedings of the 17th International Conference on WorldWide Web, WWW 2008 . ACM, 2008.[18] J. Cho, H. Garcia-Molina, and L. Page, “Efﬁcient crawling through urlordering,”

Computer Networks and ISDN Systems , vol. 30, no. 1-7, pp.161–172, 1998.[19] M. H. Alam, J. Ha, and S. Lee, “Novel approaches to crawling importantpages early,”

Knowledge and Information Systems , vol. 33, no. 3, pp.707–734, 2012.[20] J. Wu, P. Teregowda, J. P. F. Ram´ırez, P. Mitra, S. Zheng, and C. L. Giles,“The evolution of a crawling strategy for an academic document searchengine: Whitelists and blacklists,” in

Proceedings of the 4th Annual ACMWeb Science Conference , ser. WebSci ’12, 2012, pp. 340–343.[21] S. Zheng, P. Dmitriev, and C. L. Giles, “Graph-based seed selection for M e a n P @ K Mean P@K vs History

HACS (D0)Last-Obs (D0)Random (D0) HACS (D2)Last-Obs (D2)Random (D2)

Fig. 11. Mean Weighted P @K of rankings of HACS, Last-Obs, and Randommodels to the expected ranking, at different history ( w ) sizes. The HACSmodel outperforms the Last-Obs and Random models at both homepage-leveland website-level.web-scale crawlers,” in Proceedings of the 18th ACM Conference onInformation and Knowledge Management, CIKM 2009 . New York, NY,USA: ACM, 2009, pp. 1967–1970.[22] D. S. Hochbaum and A. Pathria, “Analysis of the greedy approach inproblems of maximum k-coverage,”

Naval Research Logistics (NRL) ,vol. 45, no. 6, pp. 615–627, 1998.[23] M. Thelwall and D. Wilkinson, “Graph structure in three nationalacademic webs: Power laws with anomalies,”

Journal of the AmericanSociety for Information Science and Technology , vol. 54, no. 8, pp.706–712, 2003.[24] C. L. Giles, K. D. Bollacker, and S. Lawrence, “CiteSeer: An automaticcitation indexing system,” in

Proceedings of the 3rd ACM InternationalConference on Digital Libraries, June 23-26, 1998, Pittsburgh, PA, USA ,1998, pp. 89–98.[25] J. Wu, K. Kim, and C. L. Giles, “CiteSeerX: 20 years of service toscholarly big data,” in

Proceedings of the Conference on ArtiﬁcialIntelligence for Data Discovery and Reuse , ser. AIDR ’19. New York,NY, USA: Association for Computing Machinery, 2019.[26] J. Cho and H. Garcia-Molina, “The evolution of the web and implicationsfor an incremental crawler,” in

VLDB 2000, Proceedings of 26thInternational Conference on Very Large Data Bases , 2000, pp. 200–209.[27] ——, “Synchronizing a database to improve freshness,” in

Proceedingsof the 2000 ACM SIGMOD International Conference on Management ofData, May 16-18, 2000 , Dallas, Texas, USA., 2000, pp. 117–128.[28] R. P. Brent, “An algorithm with guaranteed convergence for ﬁnding azero of a function,”

The Computer Journal , vol. 14, no. 4, pp. 422–425,1971.[29] Internet Archive, “Wayback CDX Server API,” https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server, 2019. A PPENDIX

This section documents additional results obtained fromthe evaluation of HACS model against our baselines, andthe veriﬁcation of the stochastic nature of scholarly webpageupdates for more interval sizes.Figure 12 illustrates the F vs Threshold ( θ ) of each model,when a history size of 3 weeks is used. Here too, the HACSmodel produced a higher F than other baseline models. Thislead is more visible at the homepage-level than the website-level. However, compared to a history size of 1 week and 2 weeks, this lead is less prominent at both homepage-leveland webpage-level. As θ increases, the F of the HACSmodel increases up to θ = ˆ θ , and then drops as θ furtherincreases. This drop is more visible at the website-level thanthe homepage-level. The macro-average F of Random modelfollows the HACS model with a similar trend at the Homepage-level, History = 1 week.Figure 13 illustrates the Precision ( P ) vs Threshold ( θ ) ofeach model, when a history size of three weeks is used. Heretoo, the HACS model produced a higher P than the baselinesfor all values of θ at homepage level, and for θ ≤ . atwebsite level. This lead is more visible at homepage-level thanwebsite-level. However, compared to a history size of 1 weekand 2 weeks, this lead is less prominent at both homepage-leveland webpage-level. Both Random and Brute Force models havea low P , regardless of θ .Figure 14 illustrates the Recall ( R ) vs Threshold ( θ ) of eachmodel, when a history size of three weeks is used. Here too,the HACS model produced a higher R than other baselinemodels for all values of θ , at both homepage level and websitelevel. This lead is more visible at homepage-level than website-level. However, compared to a history size of 1 week and 2weeks, this lead is less prominent at both homepage-level andwebpage-level. The Brute Force model has a consistent R of . , as it selects all seed URLs regardless. The Random modelhas a low R , regardless of θ .Figures 15, 16, 17, and 18 illustrates the probability ofﬁnding author websites with an interpolated update interval of d days for additional values of d , ranging from 7 days to 70days, at both homepage-level (see Figure 15) and webpage-level (see Figure 16). The results suggest that as d increases,the probability distribution gets closer to the expected poissondistribution in both cases. .0 0.2 0.4 0.6 0.8 1.0Threshold ( )0.00.20.40.60.81.0 F History: 3 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

F1: Page Level (a) Homepage-level, History = 3 weeks F History: 3 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

F1: Site Level (b) Website-level, History = 3 weeksFig. 12. F vs Threshold ( θ ) when a history of 3 weeks is used P r e c i s i o n History: 3 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Precision: Page Level (a) Homepage-level, History = 3 weeks P r e c i s i o n History: 3 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Precision: Site Level (b) Website-level, History = 3 weeksFig. 13. Precision ( P ) vs Threshold ( θ ) when a history of 3 weeks is used R e c a ll History: 3 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Recall: Page Level (a) Homepage-level, History = 3 weeks R e c a ll History: 3 week(s), Scheduling: 1st week

HACS (Micro Avg)HACS (Macro Avg)Random (Micro Avg)Random (Macro Avg)Brute Force (Micro Avg)Brute Force (Macro Avg)

Recall: Site Level (b) Website-level, History = 3 weeksFig. 14. Recall ( R ) vs Threshold ( θ ) when a history of 3 weeks is useda) Homepage-level, / (cid:101) λ = 7 days (b) Homepage-level, / (cid:101) λ = 14 days(c) Homepage-level, / (cid:101) λ = 21 days (d) Homepage-level, / (cid:101) λ = 28 daysFig. 15. Probability (y-axis) of ﬁnding author websites with an interpolatedupdate interval ( ∆ t ) of d days (x-axis) at homepage-level, among authorwebsites having / (cid:101) λ across different bin sizes. The vertical blue line showswhere d = 1 / (cid:101) λ .(a) Homepage-level, / (cid:101) λ = 42 days (b) Homepage-level, / (cid:101) λ = 49 days(c) Homepage-level, / (cid:101) λ = 56 days (d) Homepage-level, / (cid:101) λ = 63 daysFig. 16. Probability (y-axis) of ﬁnding author websites with an interpolatedupdate interval ( ∆ t ) of d days (x-axis) at homepage-level, among authorwebsites having / (cid:101) λ across different bin sizes. The vertical blue line showswhere d = 1 / (cid:101) λ . (a) Website-level, / (cid:101) λ = 7 days (b) Website-level, / (cid:101) λ = 14 days(c) Website-level, / (cid:101) λ = 21 days (d) Website-level, / (cid:101) λ = 28 daysFig. 17. Probability (y-axis) of ﬁnding author websites with an interpolatedupdate interval ( ∆ t ) of d days (x-axis) at website-level, among author websiteshaving / (cid:101) λ across different bin sizes. The vertical blue line shows where d = 1 / (cid:101) λ .(a) Website-level, / (cid:101) λ = 42 days (b) Website-level, / (cid:101) λ = 49 days(c) Website-level, / (cid:101) λ = 56 days (d) Website-level, / (cid:101) λ = 63 daysFig. 18. Probability (y-axis) of ﬁnding author websites with an interpolatedupdate interval ( ∆ t ) of d days (x-axis) at website-level, among author websiteshaving / (cid:101) λ across different bin sizes. The vertical blue line shows where d = 1 / (cid:101) λλ