[PDF] Online Algorithms for Estimating Change Rates of Web Pages

Abstract

For providing quick and accurate search results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. It would have been ideal if the crawler managed to update the local snapshot as soon as a page changed on the web. However, finite bandwidth availability and server restrictions mean that there is a bound on how frequently the different pages can be crawled. This then brings forth the following optimisation problem: maximise the freshness of the local cache subject to the crawling frequency being within the prescribed bounds. Recently, tractable algorithms have been proposed to solve this optimisation problem under different cost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide three novel schemes for online estimation of page change rates. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawl instance. Our first scheme is based on the law of large numbers, the second on the theory of stochastic approximation, while the third is an extension of the second and involves an additional momentum term. For all of these schemes, we prove convergence and, also, provide their convergence rates. As far as we know, the results concerning the third estimator is quite novel. Specifically, this is the first convergence type result for a stochastic approximation algorithm with momentum. Finally, we provide some numerical experiments (on real as well as synthetic data) to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

Full PDF

OOnline Algorithms for Estimating Change Rates of Web Pages

Konstantin Avrachenkov , Kishor Patil , and Gugan Thoppe INRIA Sophia Antipolis, France 06902 ∗ Indian Institute of Science, Bengaluru, India 560012

Abstract

For providing quick and accurate search results, a search engine maintains a local snapshot of theentire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across variousweb pages. It would have been ideal if the crawler managed to update the local snapshot as soon as apage changed on the web. However, ﬁnite bandwidth availability and server restrictions mean that thereis a bound on how frequently the diﬀerent pages can be crawled. This then brings forth the followingoptimisation problem: maximise the freshness of the local cache subject to the crawling frequency beingwithin the prescribed bounds.Recently, tractable algorithms have been proposed to solve this optimisation problem under diﬀerentcost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic inpractice. We address this issue here. Speciﬁcally, we provide three novel schemes for online estimation ofpage change rates. All these schemes only need partial information about the page change process, i.e.,they only need to know if the page has changed or not since the last crawl instance. Our ﬁrst scheme isbased on the law of large numbers, the second on the theory of stochastic approximation, while the thirdis an extension of the second and involves an additional momentum term. For all of these schemes, weprove convergence and, also, provide their convergence rates. As far as we know, the results concerningthe third estimator is quite novel. Speciﬁcally, this is the ﬁrst convergence type result for a stochasticapproximation algorithm with momentum. Finally, we provide some numerical experiments (on real aswell as synthetic data) to compare the performance of our proposed estimators with the existing ones(e.g., MLE).

The world wide web is gigantic: it has a lot of interconnected information and both the information and theconnections keep changing. However, irrespective of the challenges arising out of this, a user always expects asearch engine to instantaneously provide accurate and up-to-date results. A search engine deals with this bymaintaining a local cache of all the useful web pages and their links. As the freshness of this cache determinesthe quality of the search results, the search engine regularly updates it by employing a crawler (also referredto as a web spider or a web robot). The job of a crawler is (a) to discover new web pages; (b) to accessvarious web pages at certain frequencies so as to determine if any changes have happened to the contentsince the last crawled instance; and (c) to update the local cache whenever there is a change. In this workwe focus on tasks (b) and (c). To understand the detailed working of crawlers, see [2, 3, 4, 5, 6].In general, a crawler has two constraints on how often it can access a page. The ﬁrst one is due tolimitations on the available bandwidth. The second one—also known as the politiness constraint—ariseswhen a server imposes limits on the crawl frequency. The latter implies that the crawler can not access pages ∗ A shorter version [1] of this paper has appeared in the proceedings of ValueTools 2020. The novel contributions here includean additional online scheme (this is a stochastic approximation scheme with momentum) and additional experiments based onreal data (Wikitraces). ∗ email: [email protected], [email protected], [email protected] a r X i v : . [ c s . I R ] S e p n that server too often in a short amount of time. Such constraints cannot be ignored, since otherwise theserver may forbid the crawler from all future accesses. In summary, to identify the ideal rates for crawlingdiﬀerent web pages, a search engine needs to solve the following optimisation problem: Maximise the freshnessof the local database subject to constraints on the crawling frequency.In the early variants of this problem, the freshness of each page was assumed to be equally important[7, 6]. In such cases, experimental evidence somewhat surprisingly shows that the uniform policy—crawl allpages at the same frequency irrespective of their change rates—is more or less the optimal crawling strategy.Starting from the pioneering work in [8], however, the freshness deﬁnition was modiﬁed to include diﬀerentweights for diﬀerent pages depending on their importance, e.g., represented as the frequency of requests forthe pages. The motivation for this change was the fact that only a ﬁnite number of pages can be crawled inany given time frame. Hence, to improve the utility of the local database, important pages should be kept asfresh as possible. Not surprisingly, under this new deﬁnition, the optimal crawling policy does indeed dependon the page change rates.The above observation was numerically demonstrated ﬁrst in [8] for a setup with a small number of pages.A more rigorous derivation of this fact was recently given in the path breaking paper [9]. In fact, this workalso provides a near-linear time algorithm to ﬁnd a near-optimal solution. A major concern for this algorithmis that it needs to know the actual values of the page change rates. However, in practice, these values arenot known in advance and, instead, have to be estimated.A separate study [10, 11] provides a Whittle index based dynamic programming approach to optimise theschedule of a web crawler. In that context, the page/catalogue freshness estimate also inﬂuences the optimalcrawling policy and it also requires the good estimation of actual page change rates.This work, which is an extended version of [1], is mainly motivated by the work from Azar et al. [9].Our main contributions here can be summarised as follows. We propose three novel approaches for onlineestimation of the actual page change rates. The ﬁrst is based on the Law of Large Numbers (LLN), thesecond is based on the Stochastic Approximation (SA) principles, while the third one is an extension of thesecond with an additional momentum term. Next, we theoretically show that all these estimators almostsurely (a.s.) converge to the actual change rate values; thus, all our estimators are asymptotically consistent.To the best of our knowledge, the result concerning the third estimator is the ﬁrst to show convergenceof a stochastic approximation algorithm with momentum. We also rigorously derive the convergence ratesof the ﬁrst two estimators in the expected error sense. Based on the existing literature, we also providea loose guess on the convergence rate of the third estimator. Finally, we provide numerical simulationsto compare the performance of our online schemes to each other and also to that of the (oﬄine) MLEestimator. Our experiments are based on both real (Wikipedia traces) as well as synthetic data sets. In oneof our experiments, we also verify our modelling assumption that the page change process is a Poisson pointprocess.The rest of this paper is organised as follows. The next section provides a formal summary of this workin terms of the setup, goals, and key contributions. It also gives explicit update rules for all of our onlineschemes. In Section 3, we formally analyse their convergence and the rates of convergence. The numericalexperiments discussed above are given in Section 4. Then, in Section 5, we provide some motivation on howone can use our estimates to ﬁnd the optimal crawling rates. Finally, we conclude in Section 6 with somefuture directions. The three topics are individually described below.

Setup : Without loss of generality, we work with a single web page. We presume that the actual timesat which this page changes is a time-homogeneous Poisson point process in [0 , ∞ ) with a constant butunknown rate ∆ . Independently of everything else, this page is crawled (accessed) at the random instances { t k } k ≥ ⊂ [0 , ∞ ) , where t = 0 and the inter-arrival times, i.e., { t k − t k − } k ≥ , are IID exponential randomvariables with a known rate p. Thus, the times at which this page is crawled is also a time-homogeneousPoisson point process but with rate p. At time instance t k , we get to know if the page got modiﬁed or not in2he interval ( t k − , t k ] , i.e., we can access the value of the indicator I k := (cid:40) , if the page got modiﬁed in ( t k − , t k ] , , otherwise.The above assumptions are standard in the crawling literature, nevertheless, we now provide a quick justi-ﬁcation for the same. Our assumption that the page change process is a Poisson point process is based on theexperiments reported in [12, 13, 14]. Nevertheless, we also verify this assumption on a randomly selected pagefrom frequently edited Wikipedia pages. We extract the complete history of this web page (exact time anddate of a change) for a period of ﬁve months (April 01,2020 to August 31,2020). From the available history,we calculate the inter-arrival time of the page change process and plot Q-Q plot. We were indeed able toobserve that the set of quantiles for real data matches linearly with the quantiles of exponential distribution;more details are present in the Section 4. Some generalised models for the page change process have alsobeen considered in the literature [15, 16]; however, we do not pursue them here. Separately, our assumptionon { I k } is based on the fact that a crawler can only access incomplete knowledge about the page changeprocess. In particular, a crawler does not know when and how many times a page has changed between twocrawling instances. Instead, all it can track is the status of a page at each crawling instance and know if ithas changed or not with respect to the previous access. Sometimes, it is possible to also know the time atwhich the page was last modiﬁed [3, 17], but we do not consider this case here. Goal : Develop online algorithms for estimating ∆ in the above setup. The motivation for doing this is thatsuch estimates can then be used to estimate the optimal crawling rates [9, 18]; see Section 5 for more detailson this.

Key Contributions : We present three online methods for estimating the page change rate ∆ . The ﬁrstis based on the law of large numbers, while the second and third are based on the theory of stochasticapproximation with the third one having an additional momentum component. If { x k } , { y k } , and { z k } , denote the iterates of these three methods, respectively, then their update rules are as shown below. • LLN Estimator : For k ≥ , x k = p ˆ I k / ( k + α k − ˆ I k ) . (1)Here, ˆ I k = (cid:80) kj =1 I j ; hence, ˆ I k = ˆ I k − + I k . And, { α k } is any positive sequence satisfying the conditionsin Theorem 1; e.g., { α k } could be { } , { log k } , or {√ k } . • SA Estimator : For k ≥ y ,y k +1 = y k + η k [ I k +1 ( y k + p ) − y k ] . (2)Here, { η k } is any stepsize sequence that satisﬁes the conditions in Theorem 2. For example, { η k } couldbe { / ( k + 1) η } for some η ∈ (0 , . • SAM Estimator ( SA Estimator with Momentum ): For k ≥ z , z − ,z k +1 = z k + η k [ I k +1 ( z k + p ) − z k ] + ζ k ( z k − z k − ) . (3)Here, { η k } and { ζ k } are any stepsize sequences that satisfy the conditions given in Theorem 3. Forexample, pick a β ∈ (1 / ,

1] and let β k = 1 / ( k + 1) β . Then, { η k } and { ζ k } could be { / ( k + 1) η } and { ( β k − ωη k ) / ( β k − ) } , respectively, where ω > β < η ≤ β. While we do notshow it, we conjecture that one can also pick β ∈ (0 , /

2] and then choose η so that β < η ≤ β. Notethat if β = η and ω = 1 , then the asymptotic behaviour of (3) will resemble that of (2); this is becauselim k →∞ ζ k = 0 then. 3e call these methods online because the estimates can be updated on the ﬂy as and when a new obser-vation I k becomes available. This contrasts the MLE estimator in which one needs to start the calculationfrom scratch each time a new data point arrives. Also, unlike MLE, our estimators are never unstable; seeSection 3.4 for the details.Our main results include the following. We show that all our three estimators, i.e., x k , y k , and z k , convergeto ∆ a.s. Further, we show that1. E (cid:107) x k − ∆ (cid:107) = O (cid:0) max (cid:8) k − / , α k /k (cid:9)(cid:1) , and2. E (cid:107) y k − ∆ (cid:107) = O ( k − η/ ) if η k = ( k + 1) η with η ∈ (0 , . Separately, based on existing literature [19, 20, 21], we conjecture that E (cid:107) z k − ∆ (cid:107) = ˜ O ( k − β/ ) , where ˜ O hideslogarithmic terms. However, we believe that this estimate is not tight in our setup; see Remark 8.Finally, we provide several numerical experiments based on real as well as synthetic data for judging thestrength of our three proposed estimators. Here, we formally discuss the convergence and convergence rates of our three estimators. Thereafter, wecompare their behaviours with the estimators that already exist in the literature—the Naive estimator, theMLE estimator, and the Moment Matching (MM) estimator [22].

Our ﬁrst aim here is to obtain a formula for E [ I ] . We shall use this later to motivate the form of our LLNestimator.Let τ = t − t = t , where the second equality holds since t = 0 . Then, as per our assumptions inSection 2, τ is an exponential random variable with rate p. Also, E [ I | τ = τ ] = 1 − exp ( − ∆ τ ) . Hence, E (cid:2) I (cid:3) = ∆ / (∆ + p ) . (4)This gives the desired formula for E [ I ] . From this latter calculation, we have ∆ = p E [ I ] / (1 − E [ I ]) . (5)Separately, because { I k } is an IID sequence and E | I | ≤ E (cid:2) I (cid:3) = lim k →∞ (cid:80) kj =1 I j /k a.s. Thus,∆ = p lim k →∞ (cid:80) kj =1 I j /k − lim k →∞ (cid:80) kj =1 I j /k a.s.Consequently, a natural estimator for ∆ is x (cid:48) k = p (cid:80) kj =1 I j /k − (cid:80) kj =1 I j /k = p ˆ I k k − ˆ I k , (6)where ˆ I k is as deﬁned below (1).Unfortunately, the above estimator faces an instability issue, i.e., x (cid:48) k = ∞ when I , . . . , I k are all 1 . Toﬁx this, one can add a non-zero term in the denominator. The diﬀerent choices then gives rise to the LLNestimator deﬁned in (1).The following result discusses the convergence and convergence rate of this estimator.4 heorem 1.

Consider the estimator given in (1) for some positive sequence { α k } .

1. If lim k →∞ α k /k = 0 , then lim k →∞ x k = ∆ a.s.2. Additionally, if lim k →∞ log( k/α k ) /k = 0 , then E | x k − ∆ | = O (cid:16) max (cid:110) k − / , α k /k (cid:111)(cid:17) . Proof.

Let µ = E [ I ] , I k = ˆ I k /k, and α k = α k /k. Then, observe that (1) can be rewritten as x k = pI k / (1 + α k − I k ) . Now, lim k →∞ I k = µ a.s. and lim k →∞ α k = 0; the ﬁrst claim holds due to the strong law of largenumbers, while the second one is true due to our assumption. Statement 1. is now easy to see.We now derive Statement 2. From (5), we have | x k − ∆ | = (cid:12)(cid:12)(cid:12)(cid:12) x k − p µ − µ (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ( A k + B k ) , where A k = (cid:12)(cid:12)(cid:12)(cid:12) I k α k + 1 − I k − µα k + 1 − µ (cid:12)(cid:12)(cid:12)(cid:12) and B k = (cid:12)(cid:12)(cid:12)(cid:12) µα k + 1 − µ − µ − µ (cid:12)(cid:12)(cid:12)(cid:12) . Since α k > α k > , it follows that B k = α k µ (1 − µ )( α k + (1 − µ )) ≤ α k µ (1 − µ ) . Similarly, A k ≤ (cid:18) α k − µ (cid:19) (cid:18) | I k − µ | α k + 1 − I k (cid:19) . It is now easy to see that E [ B k ] = O ( α k ) . The rest of our arguments concern how fast E [ A k ] decays to 0 . Let { δ k } be a deterministic sequence that is both non-negative and decays to 0 . We will describe how topick this later. Let k be such that (1 + δ k ) µ < . Then, E (cid:20) | I k − µ | α k + 1 − I k (cid:21) ≤ E [ C k ] + E [ D k ] , where C k = | I k − µ | α k + 1 − I k (cid:8) I k − µ ≤ δ k µ (cid:9) , and D k = | I k − µ | α k + 1 − I k (cid:8) I k − µ ≥ δ k µ (cid:9) . On the one hand, E [ C k ] ≤ E | I k − µ | α k + 1 − (1 + δ k ) µ ≤ (cid:112) Var[ I ] √ k ( α k + 1 − (1 + δ k ) µ ) . On the other hand, since | I k − µ | ≤ − I k ≥ , it follows by applying the Chernoﬀ bound that E [ D k ] ≤ α k Pr { I k ≥ (1 + δ k ) µ } ≤ α k exp (cid:0) − kδ k µ/ (cid:1) . Now, pick { δ k } so that δ k = 6 log(1 / α k ) / ( kµ ) ∨ k ≥ . Notice that this choice is both non-negative and decays to 0 due to our assumptions on { α k } ; thus, this is a valid choice. It is now easy to seethat E [ C k ] = O (1 / √ k ) and E [ D k ] = O ( α k ) . The desired result now follows. 5 .2 SA Estimator

Let I denote a random variable with the same distribution as I . Also, for y ∈ R , let H ( y, I ) = I ( y + p ) − y. Next, deﬁne h : R → R using h ( y ) := E [ H ( y, I )] . Observe that h ( y ) = p (∆ − y ) / (∆ + p ); further, ∆ is itsunique zero. The theory of stochastic approximation then suggests using the update rule given in (2) forestimating ∆ . We now discuss the convergence and convergence rate of this algorithm.

Theorem 2.

Consider the estimator given in (2) for some positive stepsize sequence { η k } .

1. Suppose that (cid:80) ∞ k =0 η k = ∞ and (cid:80) ∞ k =0 η k < ∞ . Then, lim k →∞ y k = ∆ a.s.2. Suppose that η k = 1 / ( k + 1) η for some constant η ∈ (0 , . Then, E | y k − ∆ | = O (cid:16) k − η/ (cid:17) . Proof.

For k ≥ , consider the σ − ﬁeld F k := σ ( y j , I j , j ≤ k ) . Then, from (4) and the fact that { I k } is an IIDsequence, we get E [ I k +1 ( y k + p ) − y k |F k ] = ∆∆ + p ( y k + p ) − y k = h ( y k ) . Hence, one can rewrite (2) as y k +1 = y k + η k [ h ( y k ) + M k +1 ] , (7)where M k +1 = [ I k +1 ( y k + p ) − y k ] − h ( y k )= (cid:20) I k +1 − ∆∆ + p (cid:21) ( y k + p ) . (8)Since E [ M k +1 |F k ] = 0 for all k ≥ , { M k } is a martingale diﬀerence sequence. Consequently, (7) is a classicalSA algorithm whose limiting ODE is ˙ y ( t ) = h ( y ( t )) . (9)We now make use of Theorem 9 given in the Appendix to establish Statement 1. Accordingly, we verifythe four conditions listed there. The stepsize Condition i.) directly holds due to our assumptions on { η k } . With regards to Condition ii.), recall we have already established above that { M k } is a martingale diﬀerencesequence with respect to {F k } . The square-integrability condition holds since | M k +1 | ≤ | y k | + p which, inturn, implies that E [ | M k +1 | |F k ] ≤ p ∨ | y k | ) , as desired. Next, due to linearity, h is triviallyLipschitz continuous. Further, h ( y ) = 0 if and only if y = ∆ . This shows that ∆ is the unique equilibriumpoint of (9). Now, because the coeﬃcient of y in h ( y ) is negative, it also follows that ∆ is the unique globallyasymptotically stable equilibrium of (9). This veriﬁes Condition iii.). We ﬁnally consider Condition iv.) Let h ∞ ( y ) := − yp/ (∆ + p ) . Then, clearly, h c → h ∞ uniformly on compacts as c → ∞ . Furthermore, since thecoeﬃcient of y is negative in the deﬁnition of h ∞ , it is easy to see that the origin is the unique globallyasymptotically stable equilibrium of the ODE ˙ y ( t ) = h ∞ ( y ( t )) , as required. Statement 1. now follows.We now sketch a proof for Statement 2. First, note that y k +1 − ∆ = (1 − aη k )( y k − ∆) + η k M k +1 , where a = p/ (∆ + p ) . Now, since E [ M k +1 |F k ] = 0 , we have E [( y k +1 − ∆) |F k ] = (1 − aη k ) ( y k − ∆) + η k E [ M k +1 |F k ] . Recall that E [ M k +1 |F k ] ≤ C (1 + y k ) for some constant C ≥ . By substituting this above and then repeatingall the steps from the proof of [23, Theorem 3.1], it is not diﬃcult to see that Statement 2 holds as well.6 .3 SA Estimator with Momentum

In simple words, our SAM estimator is the SA estimator discussed above with an additional momentum term.Simulations in Section 4 show that this simple modiﬁcation results in a drastic improvement in performance.We now discuss the convergence of the SAM estimator under the assumption that, for k ≥ ,ζ k = β k − ωη k β k − , (10)where ω > { β k } is some positive real sequence. By substituting (10) and letting u k = ( z k − z k − ) /β k − , observe that the update rule in (3) can be rewritten as u k +1 = u k + γ k [ I k +1 ( z k + p i ) − z k ] − ωγ k u k , where γ k := η k /β k . For k ≥ , let M k +1 be as in (8). Also, let F k denote the σ -ﬁeld σ ( z , u , I , . . . , I k ) . Clearly, u k , z k ∈ F k and E [ M k +1 |F k ] = 0 . Hence, { M k } is again a martingale diﬀerence sequence with respect to the ﬁltration {F k } . Furthermore, since | M k +1 | ≤ | z k | + p, we have E [ | M k +1 | |F k ] ≤ p ∨ | z k | ) . (11)As before, let a = p/ (∆ + p ) . Also, let b = ∆ p/ (∆ + p ) and (cid:15) k = u k +1 − u k for k ≥ . It is then easy tosee that one can write down (3) in terms of the following two update rules: u k +1 = u k + γ k [ h ( u k , z k ) + M k +1 ] (12) z k +1 = z k + β k [ g ( u k , z k ) + (cid:15) k ] , (13)where h : R → R and g : R → R are the linear functions given by h ( u, z ) = b − ωu − az and g ( u, z ) = u. Theorem 3.

Consider the SAM estimator given in (3) with ζ k of the form given in (10) . Then z k → ∆ a.s., if one of the following conditions holds true.1. One-timescale : (cid:80) k ≥ β k = ∞ , (cid:80) k ≥ β k < ∞ , and β k = γ k . Two-timescale : (cid:80) k ≥ β k = (cid:80) k ≥ γ k = ∞ , (cid:80) k ≥ (cid:0) β k + γ k (cid:1) < ∞ , and lim k →∞ β k γ k = 0 . Recall that γ k = η k /β k . We state a few remarks concerning this result before discussing its proof.

Remark . Examples of { η k } and { β k } sequences such that the above conditions are satisﬁed include thefollowing. • One-timescale : β k = 1 / ( k + 1) β with β ∈ (1 / ,

1] and η k = 1 / ( k + 1) η with η = 2 β. • Two-timescale : β k = 1 / ( k + 1) β with β ∈ (1 / ,

1] and η k = 1 / ( k + 1) η with β < η < β. In either case, note that lim k →∞ ζ k = 1 . Remark . The justiﬁcation for the names given above for the two sets of conditions is as follows. Underthe ﬁrst set of conditions, the update rules in (12) and (13) indeed behave like a one-timescale stochasticapproximation algorithm, i.e., both u k and z k move on the same timescale. On the other hand, under thesecond set of conditions, (12) and (13), it behaves like a two-timescale stochastic approximation algorithm.This is because β k decays to 0 at a much faster rate than γ k , in turn implying that the changes in { z k } , i.e., { z k +1 − z k } are of a smaller magnitude than that in { u k } . emark . In the spirit of the above remark, a natural question to consider is the following. Can one pick { η k } and { β k } so that η k /β k → γ k /β k →

0? That is, can one pick the stepsizes so that u k now becomes the slowly moving update relative to z k ? The answer to this question seems to be no. Thisis because a couple of suﬃcient conditions needed to guarantee convergence (see Condition iii.) and iv.) inTheorem 11) no longer hold true for this new setup. Furthermore, simulations seem to suggest that theiterates, in fact, race to inﬁnity. Remark . Another question to consider is the following. Can one pick ω, { β k } , and { η k } so that ζ k → ζ, where ζ is a constant in (0 , ω = (1 − ζ ) , β k = 1 / ( k + 1) β with β ∈ (1 / , η k = 1 / ( k + 1) β so that ζ k → ζ ? The answer to this second question does not seem to beclear. This is because lim k →∞ γ k would then equal 1 . Consequently, again, one of the suﬃcient conditions toguarantee convergence (see condition i.) of Theorem 11) would no longer hold. However, simulations in thiscase do show some promise.

Remark . Based on the existing literature on convergence rates for one-timescale and two-timescale linearstochastic approximation [23, 19, 20, 21], one can conjecture that E | z k − ∆ | = ˜ O ( k − β/ ) when { β k } and { η k } are chosen as described in Remark 4. This implies the optimal convergence rate would then again be˜ O (1 / √ k ) , which matches the bound we have obtained in Theorem 2 for the SA estimator. However, webelieve that this bound may not be tight in the case of the SAM estimator. The is because (13) lacks themartingale diﬀerence term and, typically, these are the kind of terms that dictate the convergence rates.Furthermore, simulations in Section 4 suggest that the SAM estimator always converges much faster thanthe SA estimator. Proof of Theorem 3.

We discuss the two cases one by one.

One-timescale Setup : In this case, the update rules given in (12) and (13) together form a one-timescalestochastic approximation algorithm. More speciﬁcally, if we let v k = (cid:20) u k z k (cid:21) , then it follows that v k +1 = v k + β k (cid:18) H ( v k ) + (cid:20) (cid:15) k (cid:21) + (cid:20) M k +1 (cid:21)(cid:19) , (14)where H : R → R is the function deﬁned by H ( v ) = (cid:20) b (cid:21) − (cid:20) ω a − (cid:21) v. We now verify the four conditions listed in Theorem 9 and then make use of Proposition 10 (both givenin the appendix) to show that v k → (cid:20) (cid:21) =: v ∗ a.s. This automatically implies z k → ∆ a.s., which is whatwe need to prove.Notice that the stepsize in (14) is β k . Condition i.), therefore, trivially holds due to the assumptions madein Statement 1. Next, observe that the martingale diﬀerence term in (14) is the vector (cid:20) M k +1 (cid:21) . This, alongwith (11) and the statements above it, shows that Condition ii.) is true as well.With regards to Condition iii.), ﬁrst note that H is trivially Lipschitz continuous due to the linearity ofboth its component functions. Next, since ∆ = b/a, we have that H ( v ) = 0 if and only if v = v ∗ . Furthermore,since a and ω are strictly positive, the real parts of the eigenvalues of the matrix in the deﬁnition of H are alsopositive. This can be seen from the following set of observations. To begin with, the associated characteristicequation of this matrix is λ − λω + a = 0 . Hence, the roots are λ = ( ω ± √ ω − a ) / . If ω < a, then the roots are complex valued; therefore, the realpart of both these roots is ω/ ω ≥ a, then both the rootsare real; further, the smallest of the two roots, i.e., ( ω − √ ω − a ) / , is strictly positive since a > . Thisshows that the negative of the matrix given in the deﬁnition of H is Hurwitz. Together, these observations8how that v ∗ is the unique globally asymptotically stable equilibrium of the ODE ˙ v ( t ) = H ( v ( t )) . This veriﬁesCondition iii.).Finally, let H ∞ ( v ) = − (cid:20) ω a − (cid:21) v. Then, it is easy to see that H c ( v ) → H ∞ ( v ) uniformly on compact sets as c → ∞ . Also, H ∞ ( v ) = 0 if and onlyif v = 0 . Furthermore, as shown before, the negative of the matrix in the deﬁnition of H ∞ is Hurwitz. Thisimplies that the origin is the unique globally asymptotically stable equilibrium of the ODE ˙ v ( t ) = H ∞ ( v ) . This veriﬁes condition iv.).It now remains to check if { (cid:15) k } has the decaying behaviour described in Proposition 10. Towards this,since | M k +1 | ≤ ( p + | z k | ) , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) (cid:15) k (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:48) γ k (1 + | u k | + | z k | ) ≤ Cγ k (1 + (cid:107) v k (cid:107) )for some constants C, C (cid:48) ≥ . Now, because γ k decays to 0 as k → ∞ due to the assumption in Statement1., it follows that { (cid:15) k } indeed has the desired behaviour.This completes the proof in the one-timescale setup. Two-timescale Setup : Since β k /γ k → , one can perceive u k to be changing on a faster timescale relativeto y k . Hence, the update rules in (12) and (13) can be viewed as a two-timescale stochastic approximation.We now verify the conditions listed in Theorem 11 and then use Proposition 12 (both given in the appendix)to conclude z k → ∆ a.s.Conditions i.) and ii.) trivially hold. Hence, we only focus on verifying Conditions iii.) and iv.) Becauseof linearity, h and g are trivially Lipschitz continuous. Next, let φ ( z ) = ( b − az ) /ω for z ∈ R . Clearly, φ is linear in z and, hence, Lipschitz continuous. Also, h ( φ ( z ) , z ) = 0 . This, along with the fact that thesign in front of u in h ( u, z ) is negative, shows that φ ( z ) is indeed the unique globally asymptotically stableequilibrium of the ODE ˙ u ( t ) = h ( u ( t ) , z ) . Next, observe that the ODE ˙ z ( t ) = g ( φ ( z ( t )) , z ( t )) has the form˙ z ( t ) = ( b − az ( t )) /ω. Clearly, this ODE has ∆ as its unique globally asymptotically stable equilibrium. Thiscompletes the veriﬁcation of Condition iii.).With regards to Condition iv.), ﬁrst let h ∞ be the function deﬁned by h ∞ ( u, z ) = − ωu − az. Also,for z ∈ R , let φ ∞ ( z ) = − az/ω. This function is linear in z and, hence, Lipschitz; also, φ ∞ (0) = 0 . Then,on the one hand, h c → h ∞ uniformly on compacts as c → ∞ and, on the other hand, the ODE ˙ u ( t ) = h ∞ ( u ( t ) , z )) = − ωu ( t ) − az indeed has φ ∞ ( z ) as its unique globally asymptotically stable equilibrium.Finally, for z ∈ R , let g ∞ ( z ) = − az/ω. Then, trivially, g c → g ∞ uniformly on compacts, as c → ∞ . Further, ˙ z ( t ) = g ∞ ( z ( t )) = − az ( t ) /ω which indeed has the origin as its unique globally asymptotically stableequilibrium. With this, we ﬁnish with verifying Condition iv.).Now, as per Proposition 12, we need to show that { (cid:15) k } is asymptotically negligible. However, this isindeed true since | M k +1 | ≤ ( z k + p ) which implies | (cid:15) k | ≤ Cγ k (1 + | u k | + | z k | ) for some constant C ≥ , andthe fact that γ k → . This shows that ( u k , z k ) → ( φ (∆) , ∆) = (0 , ∆) a.s., as desired. As far as we know, there are three other approaches in the literature for estimating page change rates—theNaive estimator, the MLE estimator, and the MM estimator. The details about the ﬁrst two estimators canbe found in [17] while, for the third one, one can look at [22]. We now do a comparison, within the contextof our setup, between these estimators and the ones that we have proposed.The Naive estimator simply uses the average number of changes detected to approximate the rate atwhich a page changes. That is, if the sequence { w k } denotes the values of the Naive estimator then, in oursetup, w k = p ˆ I k /k, where ˆ I k is as deﬁned below (1). The intuition behind this is the following. If τ is asdeﬁned at the beginning of Section 3.1, then observe that E [ N ( τ )] = ∆ /p. Hence, the Naive estimator tries9o approximate E [ N ( τ )] with ˆ I k /k so that the previous relation can then be used for guessing the changerate.Clearly, E [ w k ] = p ∆ / (∆ + p ) (cid:54) = ∆ . Also, from the strong law of large numbers, w k a.s. → p ∆ / (∆ + p ) (cid:54) = ∆ . Thus, this estimator is not consistent and is also biased. This is to be expected since this estimator does notaccount for all the changes that occur between two consecutive accesses.Next, we look at the MLE estimator. Informally, this estimator identiﬁes the parameter value that hasthe highest probability of producing the observed set of observations. In our setup, the value of the MLEestimator is obtained by solving the following equation for ∆ : k (cid:88) j =1 I j τ j / (exp (∆ τ j ) −

1) = k (cid:88) j =1 (1 − I j ) τ j , (15)where τ k = t k − t k − and { t k } is as deﬁned in Section 2. The derivation of this relation is given in [17,Appendix C]. As mentioned in [17, Section 4], the above estimator is consistent.Note that the MLE estimator makes actual use of the inter-arrival crawl times { τ k } unlike our twoestimators and also the Naive estimator. In this sense, it fully accounts for the randomness and availableinformation in the crawling process. And, as we shall see in the numerical section, the quality of the estimateobtained via MLE improves rapidly in comparison to the Naive estimator as the sample size increases.However, MLE suﬀers in two aspects— computational tractability and mathematical instability. Speciﬁ-cally, note that the MLE estimator lacks a closed form expression. Therefore, one has to solve (15) by usingnumerical methods such as the NewtonRaphson method, Fishers Scoring Method, etc. Unfortunately, usingthese ideas to solve (15) takes more and more time as the number of samples grow. Also note that, underthe above solution ideas, the MLE estimator works in an oﬄine fashion. In that, each time we get a newobservation, (15) needs to be solved afresh. This is because there is no easy way to eﬃciently reuse thecalculations from one iteration into the next; note that the deﬁning equation (15) changes in a signiﬁcantand nontrivial way from one iteration to another.Besides the complexity, the MLE estimator is also unstable in two situations. One, when no changeshave been detected ( I j = 0 , ∀ k ∈ { , . . . , k } ), and the other, when all the accesses detect a change ( I j =1 , ∀ k ∈ { , . . . , k } ). In the ﬁrst setting, no solution exists; in the second setting, the solution is ∞ . Onesimple strategy to avoid these instability issues is to clip the estimate to some pre-deﬁned range wheneverone of bad observation instances occur.Finally, let us discuss the MM estimator. Here, one looks at the fraction of times no changes were detectedduring page accesses and then, using a moment matching method, tries to approximate the actual page changerate. In our context, the value of this estimator is obtained by solving (cid:80) kj =1 (1 − I j ) = (cid:80) kj =1 e − ∆ τ j for ∆ . The details of this equation are given in [22, Section 4]. While the MM idea is indeed simpler than MLE,the associated estimation process continues to suﬀer from similar instability and computational issues likethe ones discussed above.We emphasise that none of our estimators suﬀer from any of the issues mentioned above. In particular,all of our estimators are online and have a signiﬁcantly simple update rule; thus, improving the estimatewhenever a new data point arrives is extremely easy. Moreover, all of them are stable, i.e., the estimatedvalues will almost surely be ﬁnite. More importantly, the performance of our estimators is comparable tothat of MLE. This can be seen from the numerical experiments in Section 4.

Here, we demonstrate the strength of our estimators using three diﬀerent experiments. The ﬁrst one involvesreal data based on Wikipedia traces. On the one hand, we use this experiment to provide a validation of ourmodel assumption that the page change process is a stationary Poisson point process. On the other hand,we use it to demonstrate that the estimation quality of our online estimators is comparable to that of theoﬄine MLE estimator. In the second experiment, using synthetic data, we study the impact of ∆ and p onour three estimators. In the third experiment, we similarly study how the choice of { α k } , { η k } and { β k } inﬂuences the performance. 10 Theoretical Quantiles Q uan t il e s o f R ea l D a t a (a) Q-Q Plot: Real Data versus exponential distribution NaiveMLELLNSASAM (b) ∆ = 1 . , p = 0 . NaiveMLELLNSASAM (c) ∆ = 1 . , p = 0 . Figure 1: Diﬀerent Estimators: Real Data

NaiveMLELLNSASAM (a) Performance of single trajectories (b) 95% Conﬁdence interval

MLELLNSASAM (c) Root mean square error

Figure 2: Synthetic data: ∆ = 5 , p = 3.11 .1 Performance on Real Data (Expt. 1)

As mentioned before, our goal here is provide a validation for our model as well as to compare the performanceof the diﬀerent estimators on real data.To generate the data set, we used Wikipedia traces which are openly available on the web. In particular,we looked at the list of frequently edited pages on Wikipedia and then randomly selected one page. The titleof the page we chose was ‘Template talk: Did you know”. Next, we extracted the timestamps at which thispage was edited over the last ﬁve months (April 01 , , . . Using a Q-Q plot, we then compared the distribution of the collected data to that of an exponentialdistribution with rate parameter equal to this ∆ value. That is, we used a scatterplot to compare thequantiles of the given data to that of the exponential distribution with rate 1 . ◦ diagonal. This implies that both the sets of quantiles come from the same distribution, therebyconﬁrming that the collected inter-update times indeed follow an exponential distribution whose rate is closeto ∆ . Equivalently, this implies that the update times come from a Poisson point process with rate parameterclose to ∆ . Having veriﬁed our assumption, we now compare ﬁve diﬀerent page rate estimators: Naive, MLE, LLN,SA, and SAM. Their performances are given in Fig 1(b) and Fig 1(c).The procedure we adopted to obtain these plots was as follows. Unless speciﬁed, the notations are as inSection 2. Recall that we had access to the actual timestamps at which this Wikipedia page was changed.Keeping this in mind, we artiﬁcially generated the crawl instances of this page. These times were sampledfrom a Poisson point process with rate p = 0 . p = 0 . { I k } . For p = 0 . , the length of this sequence was 1723 while, for p = 0 . , this length turned out to be 340 . Using these I k , p, and inter-update time lengths, we then used the ﬁvediﬀerent estimators mentioned above to ﬁnd ∆ . This gave rise to the trajectories shown in Fig 1(b) andFig 1(c). Note that the depicted trajectories correspond to exactly one run of each estimator. The trajectoryof the estimates obtained by the SA estimator is labelled ∆ SA , etc. The stepsizes chosen for our diﬀerentestimators are as follows. For our LLN estimator, we had set α k ≡ η k = ( k + 1) − η with η = 0 .

75. In case of the SAM estimator, we had set η k as above, β k = ( k + 1) − β with β = 0 . ω = 1. (Recall that, in the SAM estimator, the main stepsize is η k while the stepsize multiplyingthe momentum term has the form ζ k = ( β k − ωη k ) /β k − ).We now summarise our ﬁndings. In Fig 1(b), we observe that performances of the MLE, LLN, SA andSAM estimators are comparable to each other and all of them outperform the Naive estimator. This lastobservation is not at all surprising since the Naive estimator completely ignores the changes missed betweentwo successive crawling instances. In contrast to this, we observe that the estimators behave somewhatdiﬀerently in Fig 1(c). Recall that the crawling frequency here is 0 . , which is quite small compared withthe value 0 . Throughout this experiment, we work with synthetic data.

Our goal here is to study the sample variance and root mean squared error of the estimates obtained frommultiple runs of the diﬀerent estimators. The output is given in Fig. 2.The data for this experiment is generated as follows. To begin with, we imagine there is only one page.We then sample points from two diﬀerent stationary Poisson point processes, one with parameter ∆ = 5 andthe other with parameter p = 3 . We treat the samples from the ﬁrst process as the times at which this page12 NaiveLLNSASAM (a) Performance of single trajectories(b) 95% Conﬁdence interval SALLNSAM (c) Root mean square error

Figure 3: Synthetic data: ∆ = 500 , p = 3. NaiveLLNSASAM (a) Performance of single trajectories(b) 95% Conﬁdence interval SALLNSAM (c) Root mean square error

Figure 4: Synthetic data: ∆ = 500 , p = 50.13hanges, and the samples from the second process as the times at which this page is crawled. We then checkif the page has changed or not between two successive page accesses. This is then used to generate the valuesof the indicator sequence { I k } . We now give { I k } , p, as well as the inter-access lengths as input to the ﬁve diﬀerent estimators mentionedbefore. The stepsizes we use are as follows. For our LLN estimator, we set α k ≡

1; for the SA estimator,we use η k = ( k + 1) − η with η = 0 .

75; and, for the SAM estimator, we choose ζ k as above, ω = 1 , and β k = ( k + 1) − β with β = 0 .

6. Fig. 2(a) depicts one single run of each of the ﬁve estimators.In Fig. 2(b) and Fig. 2(c), the parameter values are exactly the same as in Fig. 2(a). However, we nowrun the simulation 100 times; the page change times and the page access times are generated afresh in eachrun. Fig. 2(b) depicts the 95% conﬁdence interval of the obtained estimates, whereas Fig. 2(c) shows theroot mean squared value of the diﬀerence between the estimated value and actual change rate of the page.We now summarise our ﬁndings. Clearly, in each case, we observe that performances of the MLE, LLN,SA and SAM estimators are comparable to each other and all of them outperform the Naive estimator.The fact that the estimates from our approaches are close to that of the MLE estimator was indeed quitesurprising to us. This is because, unlike MLE, our estimators completely ignore the actual lengths of theintervals between two accesses. Instead, they use p, which only accounts for the mean interval length. Notethat the variance of the ﬁrst few samples for MLE is very high. This may be due to the instability thatMLE faces; see Section 3.4. Fig. 2(c) shows that the error in the MLE estimate decays faster as comparedto others. We believe this is because the MLE also uses the actual interval lengths in its computation; thus,it uses more information about the crawling process than the other estimators.While the plots do not show this, we once again draw attention to the fact that the time taken by eachiteration in MLE rapidly grows as k increases. In contrast, our estimators take roughly the same amount oftime for each iteration. ∆ and p on Performance In the previous experiments, recall that our diﬀerent estimators more or less behaved similarly. Our goal nowis to vary the values of ∆ and p and see if there are any major diﬀerences that crop up in their performances.Alongside, we also wish to see the usefulness of the momentum term used in the SAM estimator. Theperformances in two such interesting scenarios are shown in Fig. 3 and Fig. 4. Note that we no longerconsider MLE on account of their impractical run times when the I k sequence lengths are large.In Fig. 3, ∆ = 500 and p = 3 , which means the crawling frequency is quite low compared to the frequencyat which the page is updated. On the other hand, in Fig. 4, ∆ = 500 and p = 50; thus, the crawling frequencynow is relatively higher. The stepsizes for our diﬀerent estimators are as follows. For the LLN estimator, wechose α k ≡

1; for the SA estimator, we chose η k = ( k + 1) − η with η = 0 .

8; and, for the SAM estimator, wechose η k as before, ω = 1 and β k = ( k + 1) − β with β = 0 . p value becomes higher. The impact of the momentum term canalso be clearly seen in the low frequency crawling case. In this scenario, note that the crawler will more orless always detects a change. That is, the { I k } sequence will mostly consists of all 1s. In turn, this meansthat the SA estimator’s update rule will almost always have the form y k +1 = y k + η k p. We then run the simulation 100 times and plot the 95% conﬁdence interval and root mean squared errorof our diﬀerent estimators in the two scenarios. This is shown in Fig. 3(b), 3(c), 4(b), and 4(c). We observethat variance for SA is relatively very low. This is because the SA estimator does not deviate too muchfrom the update rule mentioned in the previous paragraph. The disadvantage, however, is that its estimatestypically are quite far away from the actual change rate. Furthermore, this error decreases quite slowly.Another interesting observation from Fig. 3(b) and 3(c) is that the variance of LLN estimator is larger thanthat of SAM estimator, however, its error decays at much faster rate than that of the SAM estimator.In Fig. 4, notice that performance of all our estimators improve. However, as shown in Fig. 4(b), theSAM estimator is more noisy now. Separately, the zoomed-in plot in 4(c) shows that the average error for theSAM estimator drops quite rapidly compared to others in the initial few iterations. However, this advantage14isappears after 400 iterations; then on the LLN estimator is much more stable.

The theoretical results presented in Section 3 show that the convergence rate of LLN, SA, and SAM estimatoris aﬀected by the choice of { α k } { η k } , and { ζ k } respectively. Figures 5 provide a numerical veriﬁcation ofthe same. The details are as follows. We chose ∆ = 500 and p = 10 . Notice that the page change rate isagain very high, whereas the crawling frequency is relatively very low value. We then use the LLN estimatorwith three diﬀerent choices of { α k } ; these choices are shown in the Fig 5(a) itself. The LLN estimator with α k = k . has the worst performance. This behaviour matches the prediction made by Theorem 1. InFig. 5(b), we again consider the same setup as above. However, this time we run the SA estimator with threediﬀerent choices of { η k } ; the choices are given in the ﬁgure itself. We see that the performance for η = 0 . { η k } and { ζ k } on the performance of the SAM estimator. Let ζ k be of form given in (10). Based on Remark 4, pick η k = ( k + 1) − η and β k = ( k + 1) − β with β ∈ (1 / , β < η < β. In Fig. 5(c), we ﬁx η = 0 . β ; these choices are shown in the ﬁgure itself. TheSAM estimator with β = 0 . β increases; however, larger values of β also slow down the rate at which the error decreases. We observe that the SAM estimator with β = 0 . β = 0 . η . It is clear from the ﬁgure that the performanceof the SAM estimator remains more or less the same. This implies that the major factor that aﬀects theperformance of SAM estimator is the stepsize related to momentum term. Practical Recommendations:

Here, we provide some recommendations on which estimator to use in practice. Our conclusions are based onwhat we have observed in the numerical experiments discussed in Section 4. We summarise them as follows. • High frequency crawling:

If the crawling frequency p is comparable to ∆, all estimators (LLN, SA,SAM and MLE) perform well except the Naive estimator. However, we do not recommend MLE asit is oﬄine and very time-consuming. The examples that correspond to this scenario are depicted inFig. 1(b) and Fig. 2. • Low frequency crawling:

There are two sub-cases depending on the value of p as compared to ∆. – Relatively very low p : The Naive estimator is very bad for this scenario as there will severalmissed changes which will be unaccounted for. We recommend LLN or SAM estimator as theyboth outperform SA estimator; the example that corresponds to this scenario is depicted in Fig. 3.For similar reasons as in the previous case, we do not recommend the MLE estimator. – Relatively moderate p : The Naive estimator is again a bad choice here. Amongst the rest, werecommend the LLN estimator when several I k values are available. Otherwise, one can use SAMor the MLE estimator; the oﬄine nature of the MLE will be of concern here as well. The examplesthat corresponds to this scenario are depicted in Fig. 1(c) and Fig. 4. In this section, we discuss how our estimators can be used to ﬁnd optimal crawling rates { p ∗ i } so that theoverall freshness of the local cache lim T →∞ E (cid:20) T T (cid:90) (cid:18) N (cid:88) i =1 w i { Fresh( i, t ) } (cid:19) dt (cid:21) (16)15 Actual rate k = 1 k = log ( k ) k = k (a) LLN estimator for diﬀerent { α k } choices Actual rate = 0.5 = 0.75 = 1 (b) SA estimator with η k = ( k + 1) − η for diﬀerent η choices Actual rate = 0.4 = 0.5 = 0.6 = 0.7SA (c) SAM estimator with η = 0 . β k choices Number of Samples (k) SA M E s t i m a t o r Actual rate = 0.8= 0.7= 0.6= 0.5SA (d) SAM estimator with β = 0 . η k choices Figure 5: Impact of { α k } , { η k } and { ζ k } choices on Performance; ∆ = 500 and p = 10.is maximised subject to (cid:80) Ni =1 p i ≤ B. Here,

T > N is the number of pages, w i denotes the importance of the i − th page, B ≥ {} is theindicator, and Fresh( i, t ) is the event that page i is fresh at time t, i.e., the local copy matches the actualpage.Azar et al. [9] showed that maximising (16) under a bandwidth constraint for large enough T correspondsto maximising F ( p ) = (cid:80) Ni =0 (cid:0) w i p i / ( p i + ∆ i ) (cid:1) . Authors further provide an eﬃcient algorithm with complexity O ( N log N ) to solve this equation. Note that this algorithm requires for ∆ to be known in advance whichcan be estimated eﬃciently with any of the three schemes that we propose.Separately, the recent work [18] along this direction view maximising freshness as minimising the harmonicstaleness penalty related to every possible number of uncrawled changes. The associated problem nowbecomes minimising ˜ F ( p ) = − (cid:80) Ni =0 w i ln (cid:0) p i / ( p i + ∆ i ) (cid:1) . All the algorithms provided in [18] also assumethe known change rates, and authors use the MLE estimator to obtain them beforehand. Note that the16LE estimator is oﬄine and can be very time-consuming as the number of samples increases. Thus, one canreplace it with any of our online estimators to obtain faster updates for page change rates. We have proposed three new online approaches for estimating the rate of change of web pages. All theseestimators are computationally eﬃcient in comparison to the MLE estimator. We ﬁrst provide theoreticalanalysis on the convergence of our estimators and then provide numerical simulations to compare theirperformance with the existing estimators in the literature. From numerical experiments, we have veriﬁedthat the proposed estimators perform signiﬁcantly better than the Naive estimator and have extremelysimple update rules which make them computationally attractive. We also provide important insights onwhich estimator one should use in practice.The performance of both our estimators currently depend on the choice of { α k } , { η k } , and { ζ k } respec-tively. One aspect to analyse in the future would be to ask what would be the ideal choice for these sequencesthat would help attain the fastest convergence rate. Another interesting research direction is to combine theonline estimation with dynamic optimisation. Acknowledgement

This work is partly supported by ANSWER project PIA FSN2 (P15 9564-266178 \ DOS0060094) and DST-Inria project ”Machine Learning for Network Analytics” IFC/DST-Inria-2016-01/448. The authors wouldalso like to thank A. Budhiraja for several useful discussions concerning Theorem 3.

References [1] Konstantin Avrachenkov, Kishor Patil, and Gugan Thoppe. Change rate estimation and optimal fresh-ness in web page crawling. In

Proceedings of the 13th EAI International Conference on PerformanceEvaluation Methodologies and Tools , pages 3–10, 2020.[2] Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler.

World Wide Web ,2(4):219–229, 1999.[3] Carlos Castillo. Eﬀective web crawling. In

Acm sigir forum , volume 39, pages 55–56, New York, NY,USA, 2005. Acm New York, NY, USA, Association for Computing Machinery.[4] Rahul Kumar, Anurag Jain, and Chetan Agrawal. A survey of web crawling algorithms.

Advances invision computing: An international journal , 3:1–7, 2016.[5] Christopher Olston, Marc Najork, et al. Web crawling.

Foundations and Trends R (cid:13) in InformationRetrieval , 4(3):175–246, 2010.[6] Jenny Edwards, Kevin McCurley, and John Tomlin. An adaptive model for optimizing performance ofan incremental web crawler. In Proceedings of the 10th International Conference on World Wide Web ,volume 8, page 106113, New York, NY, USA, 2001. Association for Computing Machinery.[7] Junghoo Cho and Hector Garcia-Molina. Synchronizing a database to improve freshness.

ACM sigmodrecord , 29(2):117–128, 2000.[8] Junghoo Cho and Hector Garcia-Molina. Eﬀective page refresh policies for web crawlers.

ACM Trans-actions on Database Systems (TODS) , 28(4):390–426, 2003.[9] Yossi Azar, Eric Horvitz, Eyal Lubetzky, Yuval Peres, and Dafna Shahaf. Tractable near-optimal policiesfor crawling.

Proceedings of the National Academy of Sciences , 115(32):8099–8103, 2018.1710] Konstantin E Avrachenkov and Vivek S Borkar. Whittle index policy for crawling ephemeral content.

IEEE Transactions on Control of Network Systems , 5(1):446–455, 2016.[11] Jos´e Ni˜no-Mora. A dynamic page-refresh index policy for web crawlers. In

Analytical and StochasticModeling Techniques and Applications , pages 46–60, Cham, 2014. Springer International Publishing.[12] Brian E Brewington and George Cybenko. How dynamic is the web?

Computer Networks , 33(1-6):257–276, 2000.[13] Brian E Brewington and George Cybenko. Keeping up with the changing web.

Computer , 33(5):52–58,2000.[14] Junghoo Cho and Hector Garcia-Molina. The evolution of the web and implications for an incrementalcrawler. In , pages 1–18, San Francisco, CA,USA, 2000. Morgan Kaufmann Publishers Inc.[15] Norman Matloﬀ. Estimation of internet ﬁle-access/modiﬁcation rates from indirect data.

ACM Trans-actions on Modeling and Computer Simulation (TOMACS) , 15(3):233–253, 2005.[16] Sanasam Ranbir Singh. Estimating the rate of web page updates. In

Proc. International Joint Confer-ences on Artiﬁcial Intelligence , pages 2874–2879, San Francisco, CA, USA, 2007. ACM.[17] Junghoo Cho and Hector Garcia-Molina. Estimating frequency of change.

ACM Transactions on InternetTechnology (TOIT) , 3(3):256–290, 2003.[18] Andrey Kolobov, Yuval Peres, Cheng Lu, and Eric J Horvitz. Staying up to date with online contentchanges using reinforcement learning for scheduling. In

Advances in Neural Information ProcessingSystems , pages 581–591, 2019.[19] Gal Dalal, Gugan Thoppe, Bal´azs Sz¨or´enyi, and Shie Mannor. Finite sample analysis of two-timescalestochastic approximation with applications to reinforcement learning. In

Conference On Learning The-ory , pages 1199–1233. PMLR, 2018.[20] Gal Dalal, Bal´azs Sz¨or´enyi, and Gugan Thoppe. A tale of two-timescale reinforcement learning with thetightest ﬁnite-time bound. In

Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence , pages 3701–3708, San Francisco, CA, USA, 2020. AAAI Press.[21] Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysisof linear two-timescale stochastic approximation with markovian noise. arXiv preprint arXiv:2002.01268 ,2020.[22] Utkarsh Upadhyay, Robert Busa-Fekete, Wojciech Kotlowski, David Pal, and Balazs Szorenyi. Learningto crawl. In

Thirty-fourth AAAI Conference on Artiﬁcial Intelligence , pages 8471–8478, New York, NY,USA, 2020. AAAI press.[23] Gal Dalal, Bal´azs Sz¨or´enyi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) withfunction approximation. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , pages 6144–6160,San Francisco, CA, USA, 2018. AAAI Press.[24] Vivek S Borkar.

Stochastic approximation: a dynamical systems viewpoint , volume 48. Springer, India,2009.[25] Chandrashekar Lakshminarayanan and Shalabh Bhatnagar. A stability criterion for two timescalestochastic approximation schemes.

Automatica , 79:108–114, 2017.18

Convergence of Stochastic Approximation Algorithms

In this section, we discuss results from literature that provide suﬃcient conditions for convergence of bothone-timescale and two-timescale stochastic approximation algorithms.We begin by discussing the convergence of a generic one-timescale stochastic approximation algorithm.This result is obtained by combining [24, Chapter 2, Corollary 4,] and [24, Chapter 3, Theorem 7].

Theorem 9 (Convergence of One-timescale Stochastic Approximation [24]) . Consider the update rule y k +1 = y k + η k [ h ( y k ) + M k +1 ] , where η k is a positive scalar; y k , M k ∈ R d ; and h : R d → R d is a deterministic function. Suppose the followingconditions hold:i.) (cid:80) ∞ k =0 η k = ∞ and (cid:80) ∞ k =0 η k < ∞ . ii.) { M k } is a martingale diﬀerence sequence with respect to the increasing family of σ − ﬁelds F k := σ ( y j , M j , j ≤ k ) , k ≥ . That is, E [ M k +1 |F k ] = 0 a.s., k ≥ . Further, there is a constant C ≥ such that E [ (cid:107) M k +1 (cid:107) |F k ] ≤ C (1 + (cid:107) y k (cid:107) ) a.s. for all k ≥ . iii.) h is a globally Lipschitz continuous function. Further, the ODE ˙ y ( t ) = h ( y ( t )) has an unique globallyasymptotically stable equilibrium y ∗ . iv.) There exists a continuous function h ∞ : R d → R d such that the functions h c ( x ) := h ( cx ) /c, c ≥ , satisfy h c → h ∞ uniformly on compact sets as c → ∞ . Further, the ODE ˙ y ( t ) = h ∞ ( y ( t )) has the originas its unique globally asymptotically stable equilibrium.Then, y k → y ∗ a.s. Often, stochastic approximation algorithms contain an additional perturbation term that is asymptoticallynegligible. The next result discusses convergence of such algorithms.

Proposition 10 (Convergence of Perturbed One-timescale Stochastic Approximation) . Consider the updaterule y k +1 = y k + η k [ h ( y k ) + (cid:15) k + M k +1 ] , where (cid:15) k is an additional perturbation term while the other terms have the same meaning as in Theorem 9.Suppose that the four conditions listed in Theorem 9 hold true. Further, suppose (cid:107) (cid:15) k (cid:107) ≤ Cρ k (1 + (cid:107) y k (cid:107) ) a.s.for k ≥ , where C is a positive constant and { ρ k } is a sequence of positive scalars such that lim k →∞ ρ k = 0 . Then, y k → y ∗ a.s.Proof. We only give a sketch of the proof since the arguments are more or less similar to the ones used toderive Theorem 9. As mentioned before, this latter result follows from [24, Chapter 2, Corollary 4] and [24,Chapter3, Theorem 7]. We now brieﬂy discuss how, even in the presence of the additional perturbation term,these two results continue to hold. • [24, Chapter 2, Corollary 4]: This result follows from [24, Chapter 2, Theorem 2] which, in turn, followsfrom [24, Chapter 2, Lemma 1]. However, as shown in extension 3 in [24, pg. 17], this latter resultgoes through even in the presence of the perturbation term { (cid:15) k } . This is because (cid:15) k is asymptoticallynegligible a.s. More speciﬁcally, observe that the sequence { y k } is a.s. bounded under assumption (A4) given on [24, pg. 17]. This implies that { (cid:15) k } is a random bounded sequence which is o (1) a.s.; the latteris true since ρ k → . [24, Chapter3, Theorem 7]: The proof of this result is based on Lemmas 1 to 6 in [24, Chapter 3].The ﬁrst three of these lemmas concerns the behaviour of the solution trajectories of the limiting ODE˙ y ( t ) = h ∞ ( y ( t )) . Since the perturbation term does not aﬀect the deﬁnition of this limiting ODE in anyway whatsoever, these three results continue to hold as before. Similarly, Lemma 5 in ibid is unchangedsince it only concerns the convergence of the sum of martingale diﬀerences (cid:80) k η k ˆ M k +1 (recall that thestepsize sequence in our update rule is η k ). With regards to the proof of Lemma 4 in ibid, observe thatour update rule satisﬁesˆ y ( t ( k + 1)) = ˆ y ( t ( k )) + η k ( h r ( n ) (ˆ y ( t ( k ))) + ˆ (cid:15) k + ˆ M k +1 ) , m ( n ) ≤ k ≤ m ( n + 1) , where ˆ (cid:15) k = (cid:15) k /r ( n ) while the other notations are analogous to the ones deﬁned in [24, Chapter 3].Because (cid:107) (cid:15) k (cid:107) ≤ Cρ k (1 + (cid:107) y k (cid:107) ) , ρ k → , and r ( n ) ≥ , it follows that (cid:107) ˆ (cid:15) k (cid:107) ≤ C (1 + (cid:107) ˆ y ( t ( k )) (cid:107) )for some positive constant C . Note that this is in similar spirit to (3.2.5) in ibid. It is then easy to seethat the rest of the proof goes through as before. This shows that [24, Chapter 3,Lemma 4] continuesto be true even in the presence of the the perturbation term. Using exactly the same bound for (cid:107) ˆ (cid:15) k (cid:107) obtained above, one can see that the arguments in the proof of Lemma 6 in ibid hold as well. Thus,[24, Chapter 3, Theorem 7] continues to hold, which is exactly what we wanted to establish.The desired result now follows.We next state a result that discusses the convergence of a generic two-timescale stochastic approximationalgorithm. The proof of this result is based on [24, Chapter 6, Theorem 2] and [25, Theorem 10]. Theorem 11 (Convergence of Two-timescale Stochastic Approximation [24, 25]) . Consider the update rules u k +1 = u k + γ k [ h ( u k , z k ) + M (1) k +1 ] ,z k +1 = z k + β k [ g ( u k , z k ) + M (2) k +1 ] , where γ k and β k are positive scalars; u k , z k , M (1) k , M (2) k ∈ R d ; and h, g : R d → R d are two deterministicfunctions. Suppose the following conditions hold:i.) (cid:80) k ≥ γ k = (cid:80) k ≥ β k = ∞ , (cid:80) k ≥ (cid:0) γ k + β k (cid:1) < ∞ , and lim k →∞ β k γ k = 0 . ii.) { M (1) k } and { M (2) k } are martingale diﬀerence sequences with respect to the increasing σ − ﬁelds F k := σ ( u j , z j , M (1) j , M (2) j , j ≤ k ) , k ≥ . Further, there exists a constant C ≥ such that E [ (cid:107) M ( i ) k +1 (cid:107) |F k ] ≤ C (1 + (cid:107) u k (cid:107) + (cid:107) z k (cid:107) ) for i = 1 , and k ≥ . iii.) h and g are globally Lipschitz continuous functions. For each ﬁxed z, the ODE ˙ u ( t ) = h ( u ( t ) , z ) hasa unique globally asymptotically stable equilibrium φ ( z ) , where φ : R d → R d is Lipschitz continuous.Further, the ODE ˙ z ( t ) = g ( φ ( z ( t )) , z ( t )) has an unique globally asymptotically stable equilibrium z ∗ . iv.) The functions h c ( u, z ) := h ( cu, cz ) /c, c ≥ , satisfy h c → h ∞ as c → ∞ , uniformly on compacts for h ∞ . Also, for each ﬁxed z ∈ R d , the limiting ODE ˙ u ( t ) = h ∞ ( u ( t ) , z ) has a unique globally asymptoticallystable equilibrium φ ∞ ( z ) , where φ ∞ : R d → R d is a Lipschitz map. Further, φ ∞ (0) = 0 . Separately, thefunctions g c ( z ) := g ( cφ ∞ ( z ) , cz ) /c, c ≥ , satisfy g c → g ∞ as c → ∞ , uniformly on compacts for some g ∞ . Also, the limiting ODE ˙ z ( t ) = g ∞ ( z ( t )) has the origin as its unique globally asymptotically stableequilibrium. hen, ( u k , z k ) → ( φ ( z ∗ ) , z ∗ ) a.s. The last and ﬁnal result of this section concerns the convergence of two-timescale stochastic approximationwith perturbation terms that are asymptotically negligible.

Proposition 12 (Convergence of Perturbed Two-timescale Stochastic Approximation) . Consider the updaterules u k +1 = u k + γ k [ h ( u k , z k ) + (cid:15) (1) k + M (1) k +1 ] z k +1 = z k + β k [ g ( u k , z k ) + (cid:15) (2) k + M (2) k +1 ] , where (cid:15) (1) k , (cid:15) (2) k are additional perturbation terms while the other terms have the same meaning as in The-orem 11. Suppose that the four conditions listed in Theorem 11 hold true. Further, suppose (cid:107) (cid:15) ( i ) k (cid:107) ≤ Cρ ( i ) k (1 + (cid:107) u k (cid:107) + (cid:107) z k (cid:107) ) a.s. for k ≥ and i = 1 , , where C is a positive constant and { ρ ( i ) k } , i = 1 , , are sequences of positive scalars such that lim k →∞ ρ ( i ) k = 0 . Then, ( u k , z k ) → ( φ ( z ∗ ) , z ∗ ) a.s.Proof. As stated before, this result follows from [24, Chapter 6, Theorem 2] and [25, Theorem 10]. We nowbrieﬂy discuss how these results continue to hold even in the presence of the perturbation terms (cid:15) (1) k and (cid:15) (2) k . • [24, Chapter 6, Theorem 2]: This result, as well as [24, Chapter 6, Lemma 1] on which it relies, areessentially proved by deﬁning suitable one-timescale stochastic approximation algorithms and then usingconvergence results concerning the latter. In our situation, both these will have additional perturbationterms that are asymptotically negligible. Consequently, by arguing as in the third extension given in[24, pg. 27], it can be shown that the asymptotic behaviour of these two algorithms remains unchangedeven in the perturbed setup. Therefore, it follows that the conclusions of [24, Chapter 6, Theorem 2]continue to hold as before. ••