Online Algorithms for Estimating Change Rates of Web Pages
OOnline Algorithms for Estimating Change Rates of Web Pages
Konstantin Avrachenkov , Kishor Patil , and Gugan Thoppe INRIA Sophia Antipolis, France 06902 ∗ Indian Institute of Science, Bengaluru, India 560012
Abstract
For providing quick and accurate search results, a search engine maintains a local snapshot of theentire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across variousweb pages. It would have been ideal if the crawler managed to update the local snapshot as soon as apage changed on the web. However, finite bandwidth availability and server restrictions mean that thereis a bound on how frequently the different pages can be crawled. This then brings forth the followingoptimisation problem: maximise the freshness of the local cache subject to the crawling frequency beingwithin the prescribed bounds.Recently, tractable algorithms have been proposed to solve this optimisation problem under differentcost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic inpractice. We address this issue here. Specifically, we provide three novel schemes for online estimation ofpage change rates. All these schemes only need partial information about the page change process, i.e.,they only need to know if the page has changed or not since the last crawl instance. Our first scheme isbased on the law of large numbers, the second on the theory of stochastic approximation, while the thirdis an extension of the second and involves an additional momentum term. For all of these schemes, weprove convergence and, also, provide their convergence rates. As far as we know, the results concerningthe third estimator is quite novel. Specifically, this is the first convergence type result for a stochasticapproximation algorithm with momentum. Finally, we provide some numerical experiments (on real aswell as synthetic data) to compare the performance of our proposed estimators with the existing ones(e.g., MLE).
The world wide web is gigantic: it has a lot of interconnected information and both the information and theconnections keep changing. However, irrespective of the challenges arising out of this, a user always expects asearch engine to instantaneously provide accurate and up-to-date results. A search engine deals with this bymaintaining a local cache of all the useful web pages and their links. As the freshness of this cache determinesthe quality of the search results, the search engine regularly updates it by employing a crawler (also referredto as a web spider or a web robot). The job of a crawler is (a) to discover new web pages; (b) to accessvarious web pages at certain frequencies so as to determine if any changes have happened to the contentsince the last crawled instance; and (c) to update the local cache whenever there is a change. In this workwe focus on tasks (b) and (c). To understand the detailed working of crawlers, see [2, 3, 4, 5, 6].In general, a crawler has two constraints on how often it can access a page. The first one is due tolimitations on the available bandwidth. The second one—also known as the politiness constraint—ariseswhen a server imposes limits on the crawl frequency. The latter implies that the crawler can not access pages ∗ A shorter version [1] of this paper has appeared in the proceedings of ValueTools 2020. The novel contributions here includean additional online scheme (this is a stochastic approximation scheme with momentum) and additional experiments based onreal data (Wikitraces). ∗ email: [email protected], [email protected], [email protected] a r X i v : . [ c s . I R ] S e p n that server too often in a short amount of time. Such constraints cannot be ignored, since otherwise theserver may forbid the crawler from all future accesses. In summary, to identify the ideal rates for crawlingdifferent web pages, a search engine needs to solve the following optimisation problem: Maximise the freshnessof the local database subject to constraints on the crawling frequency.In the early variants of this problem, the freshness of each page was assumed to be equally important[7, 6]. In such cases, experimental evidence somewhat surprisingly shows that the uniform policy—crawl allpages at the same frequency irrespective of their change rates—is more or less the optimal crawling strategy.Starting from the pioneering work in [8], however, the freshness definition was modified to include differentweights for different pages depending on their importance, e.g., represented as the frequency of requests forthe pages. The motivation for this change was the fact that only a finite number of pages can be crawled inany given time frame. Hence, to improve the utility of the local database, important pages should be kept asfresh as possible. Not surprisingly, under this new definition, the optimal crawling policy does indeed dependon the page change rates.The above observation was numerically demonstrated first in [8] for a setup with a small number of pages.A more rigorous derivation of this fact was recently given in the path breaking paper [9]. In fact, this workalso provides a near-linear time algorithm to find a near-optimal solution. A major concern for this algorithmis that it needs to know the actual values of the page change rates. However, in practice, these values arenot known in advance and, instead, have to be estimated.A separate study [10, 11] provides a Whittle index based dynamic programming approach to optimise theschedule of a web crawler. In that context, the page/catalogue freshness estimate also influences the optimalcrawling policy and it also requires the good estimation of actual page change rates.This work, which is an extended version of [1], is mainly motivated by the work from Azar et al. [9].Our main contributions here can be summarised as follows. We propose three novel approaches for onlineestimation of the actual page change rates. The first is based on the Law of Large Numbers (LLN), thesecond is based on the Stochastic Approximation (SA) principles, while the third one is an extension of thesecond with an additional momentum term. Next, we theoretically show that all these estimators almostsurely (a.s.) converge to the actual change rate values; thus, all our estimators are asymptotically consistent.To the best of our knowledge, the result concerning the third estimator is the first to show convergenceof a stochastic approximation algorithm with momentum. We also rigorously derive the convergence ratesof the first two estimators in the expected error sense. Based on the existing literature, we also providea loose guess on the convergence rate of the third estimator. Finally, we provide numerical simulationsto compare the performance of our online schemes to each other and also to that of the (offline) MLEestimator. Our experiments are based on both real (Wikipedia traces) as well as synthetic data sets. In oneof our experiments, we also verify our modelling assumption that the page change process is a Poisson pointprocess.The rest of this paper is organised as follows. The next section provides a formal summary of this workin terms of the setup, goals, and key contributions. It also gives explicit update rules for all of our onlineschemes. In Section 3, we formally analyse their convergence and the rates of convergence. The numericalexperiments discussed above are given in Section 4. Then, in Section 5, we provide some motivation on howone can use our estimates to find the optimal crawling rates. Finally, we conclude in Section 6 with somefuture directions. The three topics are individually described below.
Setup : Without loss of generality, we work with a single web page. We presume that the actual timesat which this page changes is a time-homogeneous Poisson point process in [0 , ∞ ) with a constant butunknown rate ∆ . Independently of everything else, this page is crawled (accessed) at the random instances { t k } k ≥ ⊂ [0 , ∞ ) , where t = 0 and the inter-arrival times, i.e., { t k − t k − } k ≥ , are IID exponential randomvariables with a known rate p. Thus, the times at which this page is crawled is also a time-homogeneousPoisson point process but with rate p. At time instance t k , we get to know if the page got modified or not in2he interval ( t k − , t k ] , i.e., we can access the value of the indicator I k := (cid:40) , if the page got modified in ( t k − , t k ] , , otherwise.The above assumptions are standard in the crawling literature, nevertheless, we now provide a quick justi-fication for the same. Our assumption that the page change process is a Poisson point process is based on theexperiments reported in [12, 13, 14]. Nevertheless, we also verify this assumption on a randomly selected pagefrom frequently edited Wikipedia pages. We extract the complete history of this web page (exact time anddate of a change) for a period of five months (April 01,2020 to August 31,2020). From the available history,we calculate the inter-arrival time of the page change process and plot Q-Q plot. We were indeed able toobserve that the set of quantiles for real data matches linearly with the quantiles of exponential distribution;more details are present in the Section 4. Some generalised models for the page change process have alsobeen considered in the literature [15, 16]; however, we do not pursue them here. Separately, our assumptionon { I k } is based on the fact that a crawler can only access incomplete knowledge about the page changeprocess. In particular, a crawler does not know when and how many times a page has changed between twocrawling instances. Instead, all it can track is the status of a page at each crawling instance and know if ithas changed or not with respect to the previous access. Sometimes, it is possible to also know the time atwhich the page was last modified [3, 17], but we do not consider this case here. Goal : Develop online algorithms for estimating ∆ in the above setup. The motivation for doing this is thatsuch estimates can then be used to estimate the optimal crawling rates [9, 18]; see Section 5 for more detailson this.
Key Contributions : We present three online methods for estimating the page change rate ∆ . The firstis based on the law of large numbers, while the second and third are based on the theory of stochasticapproximation with the third one having an additional momentum component. If { x k } , { y k } , and { z k } , denote the iterates of these three methods, respectively, then their update rules are as shown below. • LLN Estimator : For k ≥ , x k = p ˆ I k / ( k + α k − ˆ I k ) . (1)Here, ˆ I k = (cid:80) kj =1 I j ; hence, ˆ I k = ˆ I k − + I k . And, { α k } is any positive sequence satisfying the conditionsin Theorem 1; e.g., { α k } could be { } , { log k } , or {√ k } . • SA Estimator : For k ≥ y ,y k +1 = y k + η k [ I k +1 ( y k + p ) − y k ] . (2)Here, { η k } is any stepsize sequence that satisfies the conditions in Theorem 2. For example, { η k } couldbe { / ( k + 1) η } for some η ∈ (0 , . • SAM Estimator ( SA Estimator with Momentum ): For k ≥ z , z − ,z k +1 = z k + η k [ I k +1 ( z k + p ) − z k ] + ζ k ( z k − z k − ) . (3)Here, { η k } and { ζ k } are any stepsize sequences that satisfy the conditions given in Theorem 3. Forexample, pick a β ∈ (1 / ,
1] and let β k = 1 / ( k + 1) β . Then, { η k } and { ζ k } could be { / ( k + 1) η } and { ( β k − ωη k ) / ( β k − ) } , respectively, where ω > β < η ≤ β. While we do notshow it, we conjecture that one can also pick β ∈ (0 , /
2] and then choose η so that β < η ≤ β. Notethat if β = η and ω = 1 , then the asymptotic behaviour of (3) will resemble that of (2); this is becauselim k →∞ ζ k = 0 then. 3e call these methods online because the estimates can be updated on the fly as and when a new obser-vation I k becomes available. This contrasts the MLE estimator in which one needs to start the calculationfrom scratch each time a new data point arrives. Also, unlike MLE, our estimators are never unstable; seeSection 3.4 for the details.Our main results include the following. We show that all our three estimators, i.e., x k , y k , and z k , convergeto ∆ a.s. Further, we show that1. E (cid:107) x k − ∆ (cid:107) = O (cid:0) max (cid:8) k − / , α k /k (cid:9)(cid:1) , and2. E (cid:107) y k − ∆ (cid:107) = O ( k − η/ ) if η k = ( k + 1) η with η ∈ (0 , . Separately, based on existing literature [19, 20, 21], we conjecture that E (cid:107) z k − ∆ (cid:107) = ˜ O ( k − β/ ) , where ˜ O hideslogarithmic terms. However, we believe that this estimate is not tight in our setup; see Remark 8.Finally, we provide several numerical experiments based on real as well as synthetic data for judging thestrength of our three proposed estimators. Here, we formally discuss the convergence and convergence rates of our three estimators. Thereafter, wecompare their behaviours with the estimators that already exist in the literature—the Naive estimator, theMLE estimator, and the Moment Matching (MM) estimator [22].
Our first aim here is to obtain a formula for E [ I ] . We shall use this later to motivate the form of our LLNestimator.Let τ = t − t = t , where the second equality holds since t = 0 . Then, as per our assumptions inSection 2, τ is an exponential random variable with rate p. Also, E [ I | τ = τ ] = 1 − exp ( − ∆ τ ) . Hence, E (cid:2) I (cid:3) = ∆ / (∆ + p ) . (4)This gives the desired formula for E [ I ] . From this latter calculation, we have ∆ = p E [ I ] / (1 − E [ I ]) . (5)Separately, because { I k } is an IID sequence and E | I | ≤ E (cid:2) I (cid:3) = lim k →∞ (cid:80) kj =1 I j /k a.s. Thus,∆ = p lim k →∞ (cid:80) kj =1 I j /k − lim k →∞ (cid:80) kj =1 I j /k a.s.Consequently, a natural estimator for ∆ is x (cid:48) k = p (cid:80) kj =1 I j /k − (cid:80) kj =1 I j /k = p ˆ I k k − ˆ I k , (6)where ˆ I k is as defined below (1).Unfortunately, the above estimator faces an instability issue, i.e., x (cid:48) k = ∞ when I , . . . , I k are all 1 . Tofix this, one can add a non-zero term in the denominator. The different choices then gives rise to the LLNestimator defined in (1).The following result discusses the convergence and convergence rate of this estimator.4 heorem 1.
Consider the estimator given in (1) for some positive sequence { α k } .
1. If lim k →∞ α k /k = 0 , then lim k →∞ x k = ∆ a.s.2. Additionally, if lim k →∞ log( k/α k ) /k = 0 , then E | x k − ∆ | = O (cid:16) max (cid:110) k − / , α k /k (cid:111)(cid:17) . Proof.
Let µ = E [ I ] , I k = ˆ I k /k, and α k = α k /k. Then, observe that (1) can be rewritten as x k = pI k / (1 + α k − I k ) . Now, lim k →∞ I k = µ a.s. and lim k →∞ α k = 0; the first claim holds due to the strong law of largenumbers, while the second one is true due to our assumption. Statement 1. is now easy to see.We now derive Statement 2. From (5), we have | x k − ∆ | = (cid:12)(cid:12)(cid:12)(cid:12) x k − p µ − µ (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ( A k + B k ) , where A k = (cid:12)(cid:12)(cid:12)(cid:12) I k α k + 1 − I k − µα k + 1 − µ (cid:12)(cid:12)(cid:12)(cid:12) and B k = (cid:12)(cid:12)(cid:12)(cid:12) µα k + 1 − µ − µ − µ (cid:12)(cid:12)(cid:12)(cid:12) . Since α k > α k > , it follows that B k = α k µ (1 − µ )( α k + (1 − µ )) ≤ α k µ (1 − µ ) . Similarly, A k ≤ (cid:18) α k − µ (cid:19) (cid:18) | I k − µ | α k + 1 − I k (cid:19) . It is now easy to see that E [ B k ] = O ( α k ) . The rest of our arguments concern how fast E [ A k ] decays to 0 . Let { δ k } be a deterministic sequence that is both non-negative and decays to 0 . We will describe how topick this later. Let k be such that (1 + δ k ) µ < . Then, E (cid:20) | I k − µ | α k + 1 − I k (cid:21) ≤ E [ C k ] + E [ D k ] , where C k = | I k − µ | α k + 1 − I k (cid:8) I k − µ ≤ δ k µ (cid:9) , and D k = | I k − µ | α k + 1 − I k (cid:8) I k − µ ≥ δ k µ (cid:9) . On the one hand, E [ C k ] ≤ E | I k − µ | α k + 1 − (1 + δ k ) µ ≤ (cid:112) Var[ I ] √ k ( α k + 1 − (1 + δ k ) µ ) . On the other hand, since | I k − µ | ≤ − I k ≥ , it follows by applying the Chernoff bound that E [ D k ] ≤ α k Pr { I k ≥ (1 + δ k ) µ } ≤ α k exp (cid:0) − kδ k µ/ (cid:1) . Now, pick { δ k } so that δ k = 6 log(1 / α k ) / ( kµ ) ∨ k ≥ . Notice that this choice is both non-negative and decays to 0 due to our assumptions on { α k } ; thus, this is a valid choice. It is now easy to seethat E [ C k ] = O (1 / √ k ) and E [ D k ] = O ( α k ) . The desired result now follows. 5 .2 SA Estimator
Let I denote a random variable with the same distribution as I . Also, for y ∈ R , let H ( y, I ) = I ( y + p ) − y. Next, define h : R → R using h ( y ) := E [ H ( y, I )] . Observe that h ( y ) = p (∆ − y ) / (∆ + p ); further, ∆ is itsunique zero. The theory of stochastic approximation then suggests using the update rule given in (2) forestimating ∆ . We now discuss the convergence and convergence rate of this algorithm.
Theorem 2.
Consider the estimator given in (2) for some positive stepsize sequence { η k } .
1. Suppose that (cid:80) ∞ k =0 η k = ∞ and (cid:80) ∞ k =0 η k < ∞ . Then, lim k →∞ y k = ∆ a.s.2. Suppose that η k = 1 / ( k + 1) η for some constant η ∈ (0 , . Then, E | y k − ∆ | = O (cid:16) k − η/ (cid:17) . Proof.
For k ≥ , consider the σ − field F k := σ ( y j , I j , j ≤ k ) . Then, from (4) and the fact that { I k } is an IIDsequence, we get E [ I k +1 ( y k + p ) − y k |F k ] = ∆∆ + p ( y k + p ) − y k = h ( y k ) . Hence, one can rewrite (2) as y k +1 = y k + η k [ h ( y k ) + M k +1 ] , (7)where M k +1 = [ I k +1 ( y k + p ) − y k ] − h ( y k )= (cid:20) I k +1 − ∆∆ + p (cid:21) ( y k + p ) . (8)Since E [ M k +1 |F k ] = 0 for all k ≥ , { M k } is a martingale difference sequence. Consequently, (7) is a classicalSA algorithm whose limiting ODE is ˙ y ( t ) = h ( y ( t )) . (9)We now make use of Theorem 9 given in the Appendix to establish Statement 1. Accordingly, we verifythe four conditions listed there. The stepsize Condition i.) directly holds due to our assumptions on { η k } . With regards to Condition ii.), recall we have already established above that { M k } is a martingale differencesequence with respect to {F k } . The square-integrability condition holds since | M k +1 | ≤ | y k | + p which, inturn, implies that E [ | M k +1 | |F k ] ≤ p ∨ | y k | ) , as desired. Next, due to linearity, h is triviallyLipschitz continuous. Further, h ( y ) = 0 if and only if y = ∆ . This shows that ∆ is the unique equilibriumpoint of (9). Now, because the coefficient of y in h ( y ) is negative, it also follows that ∆ is the unique globallyasymptotically stable equilibrium of (9). This verifies Condition iii.). We finally consider Condition iv.) Let h ∞ ( y ) := − yp/ (∆ + p ) . Then, clearly, h c → h ∞ uniformly on compacts as c → ∞ . Furthermore, since thecoefficient of y is negative in the definition of h ∞ , it is easy to see that the origin is the unique globallyasymptotically stable equilibrium of the ODE ˙ y ( t ) = h ∞ ( y ( t )) , as required. Statement 1. now follows.We now sketch a proof for Statement 2. First, note that y k +1 − ∆ = (1 − aη k )( y k − ∆) + η k M k +1 , where a = p/ (∆ + p ) . Now, since E [ M k +1 |F k ] = 0 , we have E [( y k +1 − ∆) |F k ] = (1 − aη k ) ( y k − ∆) + η k E [ M k +1 |F k ] . Recall that E [ M k +1 |F k ] ≤ C (1 + y k ) for some constant C ≥ . By substituting this above and then repeatingall the steps from the proof of [23, Theorem 3.1], it is not difficult to see that Statement 2 holds as well.6 .3 SA Estimator with Momentum
In simple words, our SAM estimator is the SA estimator discussed above with an additional momentum term.Simulations in Section 4 show that this simple modification results in a drastic improvement in performance.We now discuss the convergence of the SAM estimator under the assumption that, for k ≥ ,ζ k = β k − ωη k β k − , (10)where ω > { β k } is some positive real sequence. By substituting (10) and letting u k = ( z k − z k − ) /β k − , observe that the update rule in (3) can be rewritten as u k +1 = u k + γ k [ I k +1 ( z k + p i ) − z k ] − ωγ k u k , where γ k := η k /β k . For k ≥ , let M k +1 be as in (8). Also, let F k denote the σ -field σ ( z , u , I , . . . , I k ) . Clearly, u k , z k ∈ F k and E [ M k +1 |F k ] = 0 . Hence, { M k } is again a martingale difference sequence with respect to the filtration {F k } . Furthermore, since | M k +1 | ≤ | z k | + p, we have E [ | M k +1 | |F k ] ≤ p ∨ | z k | ) . (11)As before, let a = p/ (∆ + p ) . Also, let b = ∆ p/ (∆ + p ) and (cid:15) k = u k +1 − u k for k ≥ . It is then easy tosee that one can write down (3) in terms of the following two update rules: u k +1 = u k + γ k [ h ( u k , z k ) + M k +1 ] (12) z k +1 = z k + β k [ g ( u k , z k ) + (cid:15) k ] , (13)where h : R → R and g : R → R are the linear functions given by h ( u, z ) = b − ωu − az and g ( u, z ) = u. Theorem 3.
Consider the SAM estimator given in (3) with ζ k of the form given in (10) . Then z k → ∆ a.s., if one of the following conditions holds true.1. One-timescale : (cid:80) k ≥ β k = ∞ , (cid:80) k ≥ β k < ∞ , and β k = γ k . Two-timescale : (cid:80) k ≥ β k = (cid:80) k ≥ γ k = ∞ , (cid:80) k ≥ (cid:0) β k + γ k (cid:1) < ∞ , and lim k →∞ β k γ k = 0 . Recall that γ k = η k /β k . We state a few remarks concerning this result before discussing its proof.
Remark . Examples of { η k } and { β k } sequences such that the above conditions are satisfied include thefollowing. • One-timescale : β k = 1 / ( k + 1) β with β ∈ (1 / ,
1] and η k = 1 / ( k + 1) η with η = 2 β. • Two-timescale : β k = 1 / ( k + 1) β with β ∈ (1 / ,
1] and η k = 1 / ( k + 1) η with β < η < β. In either case, note that lim k →∞ ζ k = 1 . Remark . The justification for the names given above for the two sets of conditions is as follows. Underthe first set of conditions, the update rules in (12) and (13) indeed behave like a one-timescale stochasticapproximation algorithm, i.e., both u k and z k move on the same timescale. On the other hand, under thesecond set of conditions, (12) and (13), it behaves like a two-timescale stochastic approximation algorithm.This is because β k decays to 0 at a much faster rate than γ k , in turn implying that the changes in { z k } , i.e., { z k +1 − z k } are of a smaller magnitude than that in { u k } . emark . In the spirit of the above remark, a natural question to consider is the following. Can one pick { η k } and { β k } so that η k /β k → γ k /β k →
0? That is, can one pick the stepsizes so that u k now becomes the slowly moving update relative to z k ? The answer to this question seems to be no. Thisis because a couple of sufficient conditions needed to guarantee convergence (see Condition iii.) and iv.) inTheorem 11) no longer hold true for this new setup. Furthermore, simulations seem to suggest that theiterates, in fact, race to infinity. Remark . Another question to consider is the following. Can one pick ω, { β k } , and { η k } so that ζ k → ζ, where ζ is a constant in (0 , ω = (1 − ζ ) , β k = 1 / ( k + 1) β with β ∈ (1 / , η k = 1 / ( k + 1) β so that ζ k → ζ ? The answer to this second question does not seem to beclear. This is because lim k →∞ γ k would then equal 1 . Consequently, again, one of the sufficient conditions toguarantee convergence (see condition i.) of Theorem 11) would no longer hold. However, simulations in thiscase do show some promise.
Remark . Based on the existing literature on convergence rates for one-timescale and two-timescale linearstochastic approximation [23, 19, 20, 21], one can conjecture that E | z k − ∆ | = ˜ O ( k − β/ ) when { β k } and { η k } are chosen as described in Remark 4. This implies the optimal convergence rate would then again be˜ O (1 / √ k ) , which matches the bound we have obtained in Theorem 2 for the SA estimator. However, webelieve that this bound may not be tight in the case of the SAM estimator. The is because (13) lacks themartingale difference term and, typically, these are the kind of terms that dictate the convergence rates.Furthermore, simulations in Section 4 suggest that the SAM estimator always converges much faster thanthe SA estimator. Proof of Theorem 3.
We discuss the two cases one by one.
One-timescale Setup : In this case, the update rules given in (12) and (13) together form a one-timescalestochastic approximation algorithm. More specifically, if we let v k = (cid:20) u k z k (cid:21) , then it follows that v k +1 = v k + β k (cid:18) H ( v k ) + (cid:20) (cid:15) k (cid:21) + (cid:20) M k +1 (cid:21)(cid:19) , (14)where H : R → R is the function defined by H ( v ) = (cid:20) b (cid:21) − (cid:20) ω a − (cid:21) v. We now verify the four conditions listed in Theorem 9 and then make use of Proposition 10 (both givenin the appendix) to show that v k → (cid:20) (cid:21) =: v ∗ a.s. This automatically implies z k → ∆ a.s., which is whatwe need to prove.Notice that the stepsize in (14) is β k . Condition i.), therefore, trivially holds due to the assumptions madein Statement 1. Next, observe that the martingale difference term in (14) is the vector (cid:20) M k +1 (cid:21) . This, alongwith (11) and the statements above it, shows that Condition ii.) is true as well.With regards to Condition iii.), first note that H is trivially Lipschitz continuous due to the linearity ofboth its component functions. Next, since ∆ = b/a, we have that H ( v ) = 0 if and only if v = v ∗ . Furthermore,since a and ω are strictly positive, the real parts of the eigenvalues of the matrix in the definition of H are alsopositive. This can be seen from the following set of observations. To begin with, the associated characteristicequation of this matrix is λ − λω + a = 0 . Hence, the roots are λ = ( ω ± √ ω − a ) / . If ω < a, then the roots are complex valued; therefore, the realpart of both these roots is ω/ ω ≥ a, then both the rootsare real; further, the smallest of the two roots, i.e., ( ω − √ ω − a ) / , is strictly positive since a > . Thisshows that the negative of the matrix given in the definition of H is Hurwitz. Together, these observations8how that v ∗ is the unique globally asymptotically stable equilibrium of the ODE ˙ v ( t ) = H ( v ( t )) . This verifiesCondition iii.).Finally, let H ∞ ( v ) = − (cid:20) ω a − (cid:21) v. Then, it is easy to see that H c ( v ) → H ∞ ( v ) uniformly on compact sets as c → ∞ . Also, H ∞ ( v ) = 0 if and onlyif v = 0 . Furthermore, as shown before, the negative of the matrix in the definition of H ∞ is Hurwitz. Thisimplies that the origin is the unique globally asymptotically stable equilibrium of the ODE ˙ v ( t ) = H ∞ ( v ) . This verifies condition iv.).It now remains to check if { (cid:15) k } has the decaying behaviour described in Proposition 10. Towards this,since | M k +1 | ≤ ( p + | z k | ) , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) (cid:15) k (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:48) γ k (1 + | u k | + | z k | ) ≤ Cγ k (1 + (cid:107) v k (cid:107) )for some constants C, C (cid:48) ≥ . Now, because γ k decays to 0 as k → ∞ due to the assumption in Statement1., it follows that { (cid:15) k } indeed has the desired behaviour.This completes the proof in the one-timescale setup. Two-timescale Setup : Since β k /γ k → , one can perceive u k to be changing on a faster timescale relativeto y k . Hence, the update rules in (12) and (13) can be viewed as a two-timescale stochastic approximation.We now verify the conditions listed in Theorem 11 and then use Proposition 12 (both given in the appendix)to conclude z k → ∆ a.s.Conditions i.) and ii.) trivially hold. Hence, we only focus on verifying Conditions iii.) and iv.) Becauseof linearity, h and g are trivially Lipschitz continuous. Next, let φ ( z ) = ( b − az ) /ω for z ∈ R . Clearly, φ is linear in z and, hence, Lipschitz continuous. Also, h ( φ ( z ) , z ) = 0 . This, along with the fact that thesign in front of u in h ( u, z ) is negative, shows that φ ( z ) is indeed the unique globally asymptotically stableequilibrium of the ODE ˙ u ( t ) = h ( u ( t ) , z ) . Next, observe that the ODE ˙ z ( t ) = g ( φ ( z ( t )) , z ( t )) has the form˙ z ( t ) = ( b − az ( t )) /ω. Clearly, this ODE has ∆ as its unique globally asymptotically stable equilibrium. Thiscompletes the verification of Condition iii.).With regards to Condition iv.), first let h ∞ be the function defined by h ∞ ( u, z ) = − ωu − az. Also,for z ∈ R , let φ ∞ ( z ) = − az/ω. This function is linear in z and, hence, Lipschitz; also, φ ∞ (0) = 0 . Then,on the one hand, h c → h ∞ uniformly on compacts as c → ∞ and, on the other hand, the ODE ˙ u ( t ) = h ∞ ( u ( t ) , z )) = − ωu ( t ) − az indeed has φ ∞ ( z ) as its unique globally asymptotically stable equilibrium.Finally, for z ∈ R , let g ∞ ( z ) = − az/ω. Then, trivially, g c → g ∞ uniformly on compacts, as c → ∞ . Further, ˙ z ( t ) = g ∞ ( z ( t )) = − az ( t ) /ω which indeed has the origin as its unique globally asymptotically stableequilibrium. With this, we finish with verifying Condition iv.).Now, as per Proposition 12, we need to show that { (cid:15) k } is asymptotically negligible. However, this isindeed true since | M k +1 | ≤ ( z k + p ) which implies | (cid:15) k | ≤ Cγ k (1 + | u k | + | z k | ) for some constant C ≥ , andthe fact that γ k → . This shows that ( u k , z k ) → ( φ (∆) , ∆) = (0 , ∆) a.s., as desired. As far as we know, there are three other approaches in the literature for estimating page change rates—theNaive estimator, the MLE estimator, and the MM estimator. The details about the first two estimators canbe found in [17] while, for the third one, one can look at [22]. We now do a comparison, within the contextof our setup, between these estimators and the ones that we have proposed.The Naive estimator simply uses the average number of changes detected to approximate the rate atwhich a page changes. That is, if the sequence { w k } denotes the values of the Naive estimator then, in oursetup, w k = p ˆ I k /k, where ˆ I k is as defined below (1). The intuition behind this is the following. If τ is asdefined at the beginning of Section 3.1, then observe that E [ N ( τ )] = ∆ /p. Hence, the Naive estimator tries9o approximate E [ N ( τ )] with ˆ I k /k so that the previous relation can then be used for guessing the changerate.Clearly, E [ w k ] = p ∆ / (∆ + p ) (cid:54) = ∆ . Also, from the strong law of large numbers, w k a.s. → p ∆ / (∆ + p ) (cid:54) = ∆ . Thus, this estimator is not consistent and is also biased. This is to be expected since this estimator does notaccount for all the changes that occur between two consecutive accesses.Next, we look at the MLE estimator. Informally, this estimator identifies the parameter value that hasthe highest probability of producing the observed set of observations. In our setup, the value of the MLEestimator is obtained by solving the following equation for ∆ : k (cid:88) j =1 I j τ j / (exp (∆ τ j ) −
1) = k (cid:88) j =1 (1 − I j ) τ j , (15)where τ k = t k − t k − and { t k } is as defined in Section 2. The derivation of this relation is given in [17,Appendix C]. As mentioned in [17, Section 4], the above estimator is consistent.Note that the MLE estimator makes actual use of the inter-arrival crawl times { τ k } unlike our twoestimators and also the Naive estimator. In this sense, it fully accounts for the randomness and availableinformation in the crawling process. And, as we shall see in the numerical section, the quality of the estimateobtained via MLE improves rapidly in comparison to the Naive estimator as the sample size increases.However, MLE suffers in two aspects— computational tractability and mathematical instability. Specifi-cally, note that the MLE estimator lacks a closed form expression. Therefore, one has to solve (15) by usingnumerical methods such as the NewtonRaphson method, Fishers Scoring Method, etc. Unfortunately, usingthese ideas to solve (15) takes more and more time as the number of samples grow. Also note that, underthe above solution ideas, the MLE estimator works in an offline fashion. In that, each time we get a newobservation, (15) needs to be solved afresh. This is because there is no easy way to efficiently reuse thecalculations from one iteration into the next; note that the defining equation (15) changes in a significantand nontrivial way from one iteration to another.Besides the complexity, the MLE estimator is also unstable in two situations. One, when no changeshave been detected ( I j = 0 , ∀ k ∈ { , . . . , k } ), and the other, when all the accesses detect a change ( I j =1 , ∀ k ∈ { , . . . , k } ). In the first setting, no solution exists; in the second setting, the solution is ∞ . Onesimple strategy to avoid these instability issues is to clip the estimate to some pre-defined range wheneverone of bad observation instances occur.Finally, let us discuss the MM estimator. Here, one looks at the fraction of times no changes were detectedduring page accesses and then, using a moment matching method, tries to approximate the actual page changerate. In our context, the value of this estimator is obtained by solving (cid:80) kj =1 (1 − I j ) = (cid:80) kj =1 e − ∆ τ j for ∆ . The details of this equation are given in [22, Section 4]. While the MM idea is indeed simpler than MLE,the associated estimation process continues to suffer from similar instability and computational issues likethe ones discussed above.We emphasise that none of our estimators suffer from any of the issues mentioned above. In particular,all of our estimators are online and have a significantly simple update rule; thus, improving the estimatewhenever a new data point arrives is extremely easy. Moreover, all of them are stable, i.e., the estimatedvalues will almost surely be finite. More importantly, the performance of our estimators is comparable tothat of MLE. This can be seen from the numerical experiments in Section 4.
Here, we demonstrate the strength of our estimators using three different experiments. The first one involvesreal data based on Wikipedia traces. On the one hand, we use this experiment to provide a validation of ourmodel assumption that the page change process is a stationary Poisson point process. On the other hand,we use it to demonstrate that the estimation quality of our online estimators is comparable to that of theoffline MLE estimator. In the second experiment, using synthetic data, we study the impact of ∆ and p onour three estimators. In the third experiment, we similarly study how the choice of { α k } , { η k } and { β k } influences the performance. 10 Theoretical Quantiles Q uan t il e s o f R ea l D a t a (a) Q-Q Plot: Real Data versus exponential distribution NaiveMLELLNSASAM (b) ∆ = 1 . , p = 0 . NaiveMLELLNSASAM (c) ∆ = 1 . , p = 0 . Figure 1: Different Estimators: Real Data
NaiveMLELLNSASAM (a) Performance of single trajectories (b) 95% Confidence interval
MLELLNSASAM (c) Root mean square error
Figure 2: Synthetic data: ∆ = 5 , p = 3.11 .1 Performance on Real Data (Expt. 1)
As mentioned before, our goal here is provide a validation for our model as well as to compare the performanceof the different estimators on real data.To generate the data set, we used Wikipedia traces which are openly available on the web. In particular,we looked at the list of frequently edited pages on Wikipedia and then randomly selected one page. The titleof the page we chose was ‘Template talk: Did you know”. Next, we extracted the timestamps at which thispage was edited over the last five months (April 01 , , . . Using a Q-Q plot, we then compared the distribution of the collected data to that of an exponentialdistribution with rate parameter equal to this ∆ value. That is, we used a scatterplot to compare thequantiles of the given data to that of the exponential distribution with rate 1 . ◦ diagonal. This implies that both the sets of quantiles come from the same distribution, therebyconfirming that the collected inter-update times indeed follow an exponential distribution whose rate is closeto ∆ . Equivalently, this implies that the update times come from a Poisson point process with rate parameterclose to ∆ . Having verified our assumption, we now compare five different page rate estimators: Naive, MLE, LLN,SA, and SAM. Their performances are given in Fig 1(b) and Fig 1(c).The procedure we adopted to obtain these plots was as follows. Unless specified, the notations are as inSection 2. Recall that we had access to the actual timestamps at which this Wikipedia page was changed.Keeping this in mind, we artificially generated the crawl instances of this page. These times were sampledfrom a Poisson point process with rate p = 0 . p = 0 . { I k } . For p = 0 . , the length of this sequence was 1723 while, for p = 0 . , this length turned out to be 340 . Using these I k , p, and inter-update time lengths, we then used the fivedifferent estimators mentioned above to find ∆ . This gave rise to the trajectories shown in Fig 1(b) andFig 1(c). Note that the depicted trajectories correspond to exactly one run of each estimator. The trajectoryof the estimates obtained by the SA estimator is labelled ∆ SA , etc. The stepsizes chosen for our differentestimators are as follows. For our LLN estimator, we had set α k ≡ η k = ( k + 1) − η with η = 0 .
75. In case of the SAM estimator, we had set η k as above, β k = ( k + 1) − β with β = 0 . ω = 1. (Recall that, in the SAM estimator, the main stepsize is η k while the stepsize multiplyingthe momentum term has the form ζ k = ( β k − ωη k ) /β k − ).We now summarise our findings. In Fig 1(b), we observe that performances of the MLE, LLN, SA andSAM estimators are comparable to each other and all of them outperform the Naive estimator. This lastobservation is not at all surprising since the Naive estimator completely ignores the changes missed betweentwo successive crawling instances. In contrast to this, we observe that the estimators behave somewhatdifferently in Fig 1(c). Recall that the crawling frequency here is 0 . , which is quite small compared withthe value 0 . Throughout this experiment, we work with synthetic data.
Our goal here is to study the sample variance and root mean squared error of the estimates obtained frommultiple runs of the different estimators. The output is given in Fig. 2.The data for this experiment is generated as follows. To begin with, we imagine there is only one page.We then sample points from two different stationary Poisson point processes, one with parameter ∆ = 5 andthe other with parameter p = 3 . We treat the samples from the first process as the times at which this page12 NaiveLLNSASAM (a) Performance of single trajectories(b) 95% Confidence interval SALLNSAM (c) Root mean square error
Figure 3: Synthetic data: ∆ = 500 , p = 3. NaiveLLNSASAM (a) Performance of single trajectories(b) 95% Confidence interval SALLNSAM (c) Root mean square error
Figure 4: Synthetic data: ∆ = 500 , p = 50.13hanges, and the samples from the second process as the times at which this page is crawled. We then checkif the page has changed or not between two successive page accesses. This is then used to generate the valuesof the indicator sequence { I k } . We now give { I k } , p, as well as the inter-access lengths as input to the five different estimators mentionedbefore. The stepsizes we use are as follows. For our LLN estimator, we set α k ≡
1; for the SA estimator,we use η k = ( k + 1) − η with η = 0 .
75; and, for the SAM estimator, we choose ζ k as above, ω = 1 , and β k = ( k + 1) − β with β = 0 .
6. Fig. 2(a) depicts one single run of each of the five estimators.In Fig. 2(b) and Fig. 2(c), the parameter values are exactly the same as in Fig. 2(a). However, we nowrun the simulation 100 times; the page change times and the page access times are generated afresh in eachrun. Fig. 2(b) depicts the 95% confidence interval of the obtained estimates, whereas Fig. 2(c) shows theroot mean squared value of the difference between the estimated value and actual change rate of the page.We now summarise our findings. Clearly, in each case, we observe that performances of the MLE, LLN,SA and SAM estimators are comparable to each other and all of them outperform the Naive estimator.The fact that the estimates from our approaches are close to that of the MLE estimator was indeed quitesurprising to us. This is because, unlike MLE, our estimators completely ignore the actual lengths of theintervals between two accesses. Instead, they use p, which only accounts for the mean interval length. Notethat the variance of the first few samples for MLE is very high. This may be due to the instability thatMLE faces; see Section 3.4. Fig. 2(c) shows that the error in the MLE estimate decays faster as comparedto others. We believe this is because the MLE also uses the actual interval lengths in its computation; thus,it uses more information about the crawling process than the other estimators.While the plots do not show this, we once again draw attention to the fact that the time taken by eachiteration in MLE rapidly grows as k increases. In contrast, our estimators take roughly the same amount oftime for each iteration. ∆ and p on Performance In the previous experiments, recall that our different estimators more or less behaved similarly. Our goal nowis to vary the values of ∆ and p and see if there are any major differences that crop up in their performances.Alongside, we also wish to see the usefulness of the momentum term used in the SAM estimator. Theperformances in two such interesting scenarios are shown in Fig. 3 and Fig. 4. Note that we no longerconsider MLE on account of their impractical run times when the I k sequence lengths are large.In Fig. 3, ∆ = 500 and p = 3 , which means the crawling frequency is quite low compared to the frequencyat which the page is updated. On the other hand, in Fig. 4, ∆ = 500 and p = 50; thus, the crawling frequencynow is relatively higher. The stepsizes for our different estimators are as follows. For the LLN estimator, wechose α k ≡
1; for the SA estimator, we chose η k = ( k + 1) − η with η = 0 .
8; and, for the SAM estimator, wechose η k as before, ω = 1 and β k = ( k + 1) − β with β = 0 . p value becomes higher. The impact of the momentum term canalso be clearly seen in the low frequency crawling case. In this scenario, note that the crawler will more orless always detects a change. That is, the { I k } sequence will mostly consists of all 1s. In turn, this meansthat the SA estimator’s update rule will almost always have the form y k +1 = y k + η k p. We then run the simulation 100 times and plot the 95% confidence interval and root mean squared errorof our different estimators in the two scenarios. This is shown in Fig. 3(b), 3(c), 4(b), and 4(c). We observethat variance for SA is relatively very low. This is because the SA estimator does not deviate too muchfrom the update rule mentioned in the previous paragraph. The disadvantage, however, is that its estimatestypically are quite far away from the actual change rate. Furthermore, this error decreases quite slowly.Another interesting observation from Fig. 3(b) and 3(c) is that the variance of LLN estimator is larger thanthat of SAM estimator, however, its error decays at much faster rate than that of the SAM estimator.In Fig. 4, notice that performance of all our estimators improve. However, as shown in Fig. 4(b), theSAM estimator is more noisy now. Separately, the zoomed-in plot in 4(c) shows that the average error for theSAM estimator drops quite rapidly compared to others in the initial few iterations. However, this advantage14isappears after 400 iterations; then on the LLN estimator is much more stable.
The theoretical results presented in Section 3 show that the convergence rate of LLN, SA, and SAM estimatoris affected by the choice of { α k } { η k } , and { ζ k } respectively. Figures 5 provide a numerical verification ofthe same. The details are as follows. We chose ∆ = 500 and p = 10 . Notice that the page change rate isagain very high, whereas the crawling frequency is relatively very low value. We then use the LLN estimatorwith three different choices of { α k } ; these choices are shown in the Fig 5(a) itself. The LLN estimator with α k = k . has the worst performance. This behaviour matches the prediction made by Theorem 1. InFig. 5(b), we again consider the same setup as above. However, this time we run the SA estimator with threedifferent choices of { η k } ; the choices are given in the figure itself. We see that the performance for η = 0 . { η k } and { ζ k } on the performance of the SAM estimator. Let ζ k be of form given in (10). Based on Remark 4, pick η k = ( k + 1) − η and β k = ( k + 1) − β with β ∈ (1 / , β < η < β. In Fig. 5(c), we fix η = 0 . β ; these choices are shown in the figure itself. TheSAM estimator with β = 0 . β increases; however, larger values of β also slow down the rate at which the error decreases. We observe that the SAM estimator with β = 0 . β = 0 . η . It is clear from the figure that the performanceof the SAM estimator remains more or less the same. This implies that the major factor that affects theperformance of SAM estimator is the stepsize related to momentum term. Practical Recommendations:
Here, we provide some recommendations on which estimator to use in practice. Our conclusions are based onwhat we have observed in the numerical experiments discussed in Section 4. We summarise them as follows. • High frequency crawling:
If the crawling frequency p is comparable to ∆, all estimators (LLN, SA,SAM and MLE) perform well except the Naive estimator. However, we do not recommend MLE asit is offline and very time-consuming. The examples that correspond to this scenario are depicted inFig. 1(b) and Fig. 2. • Low frequency crawling:
There are two sub-cases depending on the value of p as compared to ∆. – Relatively very low p : The Naive estimator is very bad for this scenario as there will severalmissed changes which will be unaccounted for. We recommend LLN or SAM estimator as theyboth outperform SA estimator; the example that corresponds to this scenario is depicted in Fig. 3.For similar reasons as in the previous case, we do not recommend the MLE estimator. – Relatively moderate p : The Naive estimator is again a bad choice here. Amongst the rest, werecommend the LLN estimator when several I k values are available. Otherwise, one can use SAMor the MLE estimator; the offline nature of the MLE will be of concern here as well. The examplesthat corresponds to this scenario are depicted in Fig. 1(c) and Fig. 4. In this section, we discuss how our estimators can be used to find optimal crawling rates { p ∗ i } so that theoverall freshness of the local cache lim T →∞ E (cid:20) T T (cid:90) (cid:18) N (cid:88) i =1 w i { Fresh( i, t ) } (cid:19) dt (cid:21) (16)15 Actual rate k = 1 k = log ( k ) k = k (a) LLN estimator for different { α k } choices Actual rate = 0.5 = 0.75 = 1 (b) SA estimator with η k = ( k + 1) − η for different η choices Actual rate = 0.4 = 0.5 = 0.6 = 0.7SA (c) SAM estimator with η = 0 . β k choices Number of Samples (k) SA M E s t i m a t o r Actual rate = 0.8= 0.7= 0.6= 0.5SA (d) SAM estimator with β = 0 . η k choices Figure 5: Impact of { α k } , { η k } and { ζ k } choices on Performance; ∆ = 500 and p = 10.is maximised subject to (cid:80) Ni =1 p i ≤ B. Here,
T > N is the number of pages, w i denotes the importance of the i − th page, B ≥ {} is theindicator, and Fresh( i, t ) is the event that page i is fresh at time t, i.e., the local copy matches the actualpage.Azar et al. [9] showed that maximising (16) under a bandwidth constraint for large enough T correspondsto maximising F ( p ) = (cid:80) Ni =0 (cid:0) w i p i / ( p i + ∆ i ) (cid:1) . Authors further provide an efficient algorithm with complexity O ( N log N ) to solve this equation. Note that this algorithm requires for ∆ to be known in advance whichcan be estimated efficiently with any of the three schemes that we propose.Separately, the recent work [18] along this direction view maximising freshness as minimising the harmonicstaleness penalty related to every possible number of uncrawled changes. The associated problem nowbecomes minimising ˜ F ( p ) = − (cid:80) Ni =0 w i ln (cid:0) p i / ( p i + ∆ i ) (cid:1) . All the algorithms provided in [18] also assumethe known change rates, and authors use the MLE estimator to obtain them beforehand. Note that the16LE estimator is offline and can be very time-consuming as the number of samples increases. Thus, one canreplace it with any of our online estimators to obtain faster updates for page change rates. We have proposed three new online approaches for estimating the rate of change of web pages. All theseestimators are computationally efficient in comparison to the MLE estimator. We first provide theoreticalanalysis on the convergence of our estimators and then provide numerical simulations to compare theirperformance with the existing estimators in the literature. From numerical experiments, we have verifiedthat the proposed estimators perform significantly better than the Naive estimator and have extremelysimple update rules which make them computationally attractive. We also provide important insights onwhich estimator one should use in practice.The performance of both our estimators currently depend on the choice of { α k } , { η k } , and { ζ k } respec-tively. One aspect to analyse in the future would be to ask what would be the ideal choice for these sequencesthat would help attain the fastest convergence rate. Another interesting research direction is to combine theonline estimation with dynamic optimisation. Acknowledgement
This work is partly supported by ANSWER project PIA FSN2 (P15 9564-266178 \ DOS0060094) and DST-Inria project ”Machine Learning for Network Analytics” IFC/DST-Inria-2016-01/448. The authors wouldalso like to thank A. Budhiraja for several useful discussions concerning Theorem 3.
References [1] Konstantin Avrachenkov, Kishor Patil, and Gugan Thoppe. Change rate estimation and optimal fresh-ness in web page crawling. In
Proceedings of the 13th EAI International Conference on PerformanceEvaluation Methodologies and Tools , pages 3–10, 2020.[2] Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler.
World Wide Web ,2(4):219–229, 1999.[3] Carlos Castillo. Effective web crawling. In
Acm sigir forum , volume 39, pages 55–56, New York, NY,USA, 2005. Acm New York, NY, USA, Association for Computing Machinery.[4] Rahul Kumar, Anurag Jain, and Chetan Agrawal. A survey of web crawling algorithms.
Advances invision computing: An international journal , 3:1–7, 2016.[5] Christopher Olston, Marc Najork, et al. Web crawling.
Foundations and Trends R (cid:13) in InformationRetrieval , 4(3):175–246, 2010.[6] Jenny Edwards, Kevin McCurley, and John Tomlin. An adaptive model for optimizing performance ofan incremental web crawler. In Proceedings of the 10th International Conference on World Wide Web ,volume 8, page 106113, New York, NY, USA, 2001. Association for Computing Machinery.[7] Junghoo Cho and Hector Garcia-Molina. Synchronizing a database to improve freshness.
ACM sigmodrecord , 29(2):117–128, 2000.[8] Junghoo Cho and Hector Garcia-Molina. Effective page refresh policies for web crawlers.
ACM Trans-actions on Database Systems (TODS) , 28(4):390–426, 2003.[9] Yossi Azar, Eric Horvitz, Eyal Lubetzky, Yuval Peres, and Dafna Shahaf. Tractable near-optimal policiesfor crawling.
Proceedings of the National Academy of Sciences , 115(32):8099–8103, 2018.1710] Konstantin E Avrachenkov and Vivek S Borkar. Whittle index policy for crawling ephemeral content.
IEEE Transactions on Control of Network Systems , 5(1):446–455, 2016.[11] Jos´e Ni˜no-Mora. A dynamic page-refresh index policy for web crawlers. In
Analytical and StochasticModeling Techniques and Applications , pages 46–60, Cham, 2014. Springer International Publishing.[12] Brian E Brewington and George Cybenko. How dynamic is the web?
Computer Networks , 33(1-6):257–276, 2000.[13] Brian E Brewington and George Cybenko. Keeping up with the changing web.
Computer , 33(5):52–58,2000.[14] Junghoo Cho and Hector Garcia-Molina. The evolution of the web and implications for an incrementalcrawler. In , pages 1–18, San Francisco, CA,USA, 2000. Morgan Kaufmann Publishers Inc.[15] Norman Matloff. Estimation of internet file-access/modification rates from indirect data.
ACM Trans-actions on Modeling and Computer Simulation (TOMACS) , 15(3):233–253, 2005.[16] Sanasam Ranbir Singh. Estimating the rate of web page updates. In
Proc. International Joint Confer-ences on Artificial Intelligence , pages 2874–2879, San Francisco, CA, USA, 2007. ACM.[17] Junghoo Cho and Hector Garcia-Molina. Estimating frequency of change.
ACM Transactions on InternetTechnology (TOIT) , 3(3):256–290, 2003.[18] Andrey Kolobov, Yuval Peres, Cheng Lu, and Eric J Horvitz. Staying up to date with online contentchanges using reinforcement learning for scheduling. In
Advances in Neural Information ProcessingSystems , pages 581–591, 2019.[19] Gal Dalal, Gugan Thoppe, Bal´azs Sz¨or´enyi, and Shie Mannor. Finite sample analysis of two-timescalestochastic approximation with applications to reinforcement learning. In
Conference On Learning The-ory , pages 1199–1233. PMLR, 2018.[20] Gal Dalal, Bal´azs Sz¨or´enyi, and Gugan Thoppe. A tale of two-timescale reinforcement learning with thetightest finite-time bound. In
Thirty-Fourth AAAI Conference on Artificial Intelligence , pages 3701–3708, San Francisco, CA, USA, 2020. AAAI Press.[21] Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysisof linear two-timescale stochastic approximation with markovian noise. arXiv preprint arXiv:2002.01268 ,2020.[22] Utkarsh Upadhyay, Robert Busa-Fekete, Wojciech Kotlowski, David Pal, and Balazs Szorenyi. Learningto crawl. In
Thirty-fourth AAAI Conference on Artificial Intelligence , pages 8471–8478, New York, NY,USA, 2020. AAAI press.[23] Gal Dalal, Bal´azs Sz¨or´enyi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) withfunction approximation. In
Thirty-Second AAAI Conference on Artificial Intelligence , pages 6144–6160,San Francisco, CA, USA, 2018. AAAI Press.[24] Vivek S Borkar.
Stochastic approximation: a dynamical systems viewpoint , volume 48. Springer, India,2009.[25] Chandrashekar Lakshminarayanan and Shalabh Bhatnagar. A stability criterion for two timescalestochastic approximation schemes.
Automatica , 79:108–114, 2017.18
Convergence of Stochastic Approximation Algorithms
In this section, we discuss results from literature that provide sufficient conditions for convergence of bothone-timescale and two-timescale stochastic approximation algorithms.We begin by discussing the convergence of a generic one-timescale stochastic approximation algorithm.This result is obtained by combining [24, Chapter 2, Corollary 4,] and [24, Chapter 3, Theorem 7].
Theorem 9 (Convergence of One-timescale Stochastic Approximation [24]) . Consider the update rule y k +1 = y k + η k [ h ( y k ) + M k +1 ] , where η k is a positive scalar; y k , M k ∈ R d ; and h : R d → R d is a deterministic function. Suppose the followingconditions hold:i.) (cid:80) ∞ k =0 η k = ∞ and (cid:80) ∞ k =0 η k < ∞ . ii.) { M k } is a martingale difference sequence with respect to the increasing family of σ − fields F k := σ ( y j , M j , j ≤ k ) , k ≥ . That is, E [ M k +1 |F k ] = 0 a.s., k ≥ . Further, there is a constant C ≥ such that E [ (cid:107) M k +1 (cid:107) |F k ] ≤ C (1 + (cid:107) y k (cid:107) ) a.s. for all k ≥ . iii.) h is a globally Lipschitz continuous function. Further, the ODE ˙ y ( t ) = h ( y ( t )) has an unique globallyasymptotically stable equilibrium y ∗ . iv.) There exists a continuous function h ∞ : R d → R d such that the functions h c ( x ) := h ( cx ) /c, c ≥ , satisfy h c → h ∞ uniformly on compact sets as c → ∞ . Further, the ODE ˙ y ( t ) = h ∞ ( y ( t )) has the originas its unique globally asymptotically stable equilibrium.Then, y k → y ∗ a.s. Often, stochastic approximation algorithms contain an additional perturbation term that is asymptoticallynegligible. The next result discusses convergence of such algorithms.
Proposition 10 (Convergence of Perturbed One-timescale Stochastic Approximation) . Consider the updaterule y k +1 = y k + η k [ h ( y k ) + (cid:15) k + M k +1 ] , where (cid:15) k is an additional perturbation term while the other terms have the same meaning as in Theorem 9.Suppose that the four conditions listed in Theorem 9 hold true. Further, suppose (cid:107) (cid:15) k (cid:107) ≤ Cρ k (1 + (cid:107) y k (cid:107) ) a.s.for k ≥ , where C is a positive constant and { ρ k } is a sequence of positive scalars such that lim k →∞ ρ k = 0 . Then, y k → y ∗ a.s.Proof. We only give a sketch of the proof since the arguments are more or less similar to the ones used toderive Theorem 9. As mentioned before, this latter result follows from [24, Chapter 2, Corollary 4] and [24,Chapter3, Theorem 7]. We now briefly discuss how, even in the presence of the additional perturbation term,these two results continue to hold. • [24, Chapter 2, Corollary 4]: This result follows from [24, Chapter 2, Theorem 2] which, in turn, followsfrom [24, Chapter 2, Lemma 1]. However, as shown in extension 3 in [24, pg. 17], this latter resultgoes through even in the presence of the perturbation term { (cid:15) k } . This is because (cid:15) k is asymptoticallynegligible a.s. More specifically, observe that the sequence { y k } is a.s. bounded under assumption (A4) given on [24, pg. 17]. This implies that { (cid:15) k } is a random bounded sequence which is o (1) a.s.; the latteris true since ρ k → . [24, Chapter3, Theorem 7]: The proof of this result is based on Lemmas 1 to 6 in [24, Chapter 3].The first three of these lemmas concerns the behaviour of the solution trajectories of the limiting ODE˙ y ( t ) = h ∞ ( y ( t )) . Since the perturbation term does not affect the definition of this limiting ODE in anyway whatsoever, these three results continue to hold as before. Similarly, Lemma 5 in ibid is unchangedsince it only concerns the convergence of the sum of martingale differences (cid:80) k η k ˆ M k +1 (recall that thestepsize sequence in our update rule is η k ). With regards to the proof of Lemma 4 in ibid, observe thatour update rule satisfiesˆ y ( t ( k + 1)) = ˆ y ( t ( k )) + η k ( h r ( n ) (ˆ y ( t ( k ))) + ˆ (cid:15) k + ˆ M k +1 ) , m ( n ) ≤ k ≤ m ( n + 1) , where ˆ (cid:15) k = (cid:15) k /r ( n ) while the other notations are analogous to the ones defined in [24, Chapter 3].Because (cid:107) (cid:15) k (cid:107) ≤ Cρ k (1 + (cid:107) y k (cid:107) ) , ρ k → , and r ( n ) ≥ , it follows that (cid:107) ˆ (cid:15) k (cid:107) ≤ C (1 + (cid:107) ˆ y ( t ( k )) (cid:107) )for some positive constant C . Note that this is in similar spirit to (3.2.5) in ibid. It is then easy to seethat the rest of the proof goes through as before. This shows that [24, Chapter 3,Lemma 4] continuesto be true even in the presence of the the perturbation term. Using exactly the same bound for (cid:107) ˆ (cid:15) k (cid:107) obtained above, one can see that the arguments in the proof of Lemma 6 in ibid hold as well. Thus,[24, Chapter 3, Theorem 7] continues to hold, which is exactly what we wanted to establish.The desired result now follows.We next state a result that discusses the convergence of a generic two-timescale stochastic approximationalgorithm. The proof of this result is based on [24, Chapter 6, Theorem 2] and [25, Theorem 10]. Theorem 11 (Convergence of Two-timescale Stochastic Approximation [24, 25]) . Consider the update rules u k +1 = u k + γ k [ h ( u k , z k ) + M (1) k +1 ] ,z k +1 = z k + β k [ g ( u k , z k ) + M (2) k +1 ] , where γ k and β k are positive scalars; u k , z k , M (1) k , M (2) k ∈ R d ; and h, g : R d → R d are two deterministicfunctions. Suppose the following conditions hold:i.) (cid:80) k ≥ γ k = (cid:80) k ≥ β k = ∞ , (cid:80) k ≥ (cid:0) γ k + β k (cid:1) < ∞ , and lim k →∞ β k γ k = 0 . ii.) { M (1) k } and { M (2) k } are martingale difference sequences with respect to the increasing σ − fields F k := σ ( u j , z j , M (1) j , M (2) j , j ≤ k ) , k ≥ . Further, there exists a constant C ≥ such that E [ (cid:107) M ( i ) k +1 (cid:107) |F k ] ≤ C (1 + (cid:107) u k (cid:107) + (cid:107) z k (cid:107) ) for i = 1 , and k ≥ . iii.) h and g are globally Lipschitz continuous functions. For each fixed z, the ODE ˙ u ( t ) = h ( u ( t ) , z ) hasa unique globally asymptotically stable equilibrium φ ( z ) , where φ : R d → R d is Lipschitz continuous.Further, the ODE ˙ z ( t ) = g ( φ ( z ( t )) , z ( t )) has an unique globally asymptotically stable equilibrium z ∗ . iv.) The functions h c ( u, z ) := h ( cu, cz ) /c, c ≥ , satisfy h c → h ∞ as c → ∞ , uniformly on compacts for h ∞ . Also, for each fixed z ∈ R d , the limiting ODE ˙ u ( t ) = h ∞ ( u ( t ) , z ) has a unique globally asymptoticallystable equilibrium φ ∞ ( z ) , where φ ∞ : R d → R d is a Lipschitz map. Further, φ ∞ (0) = 0 . Separately, thefunctions g c ( z ) := g ( cφ ∞ ( z ) , cz ) /c, c ≥ , satisfy g c → g ∞ as c → ∞ , uniformly on compacts for some g ∞ . Also, the limiting ODE ˙ z ( t ) = g ∞ ( z ( t )) has the origin as its unique globally asymptotically stableequilibrium. hen, ( u k , z k ) → ( φ ( z ∗ ) , z ∗ ) a.s. The last and final result of this section concerns the convergence of two-timescale stochastic approximationwith perturbation terms that are asymptotically negligible.
Proposition 12 (Convergence of Perturbed Two-timescale Stochastic Approximation) . Consider the updaterules u k +1 = u k + γ k [ h ( u k , z k ) + (cid:15) (1) k + M (1) k +1 ] z k +1 = z k + β k [ g ( u k , z k ) + (cid:15) (2) k + M (2) k +1 ] , where (cid:15) (1) k , (cid:15) (2) k are additional perturbation terms while the other terms have the same meaning as in The-orem 11. Suppose that the four conditions listed in Theorem 11 hold true. Further, suppose (cid:107) (cid:15) ( i ) k (cid:107) ≤ Cρ ( i ) k (1 + (cid:107) u k (cid:107) + (cid:107) z k (cid:107) ) a.s. for k ≥ and i = 1 , , where C is a positive constant and { ρ ( i ) k } , i = 1 , , are sequences of positive scalars such that lim k →∞ ρ ( i ) k = 0 . Then, ( u k , z k ) → ( φ ( z ∗ ) , z ∗ ) a.s.Proof. As stated before, this result follows from [24, Chapter 6, Theorem 2] and [25, Theorem 10]. We nowbriefly discuss how these results continue to hold even in the presence of the perturbation terms (cid:15) (1) k and (cid:15) (2) k . • [24, Chapter 6, Theorem 2]: This result, as well as [24, Chapter 6, Lemma 1] on which it relies, areessentially proved by defining suitable one-timescale stochastic approximation algorithms and then usingconvergence results concerning the latter. In our situation, both these will have additional perturbationterms that are asymptotically negligible. Consequently, by arguing as in the third extension given in[24, pg. 27], it can be shown that the asymptotic behaviour of these two algorithms remains unchangedeven in the perturbed setup. Therefore, it follows that the conclusions of [24, Chapter 6, Theorem 2]continue to hold as before. ••