Approximating the Stationary Probability of a Single State in a Markov chain
AApproximating the Stationary Probabilityof a Single State in a Markov chain
Christina E. Lee, Asuman Ozdaglar, Devavrat Shah
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139,[email protected], [email protected], [email protected]
In this paper, we present a novel iterative Monte Carlo method for approximating the stationary probabilityof a single state of a positive recurrent Markov chain. We utilize the characterization that the stationaryprobability of a state i is inversely proportional to the expected return time of a random walk beginningat i . Our method obtains an (cid:15) -multiplicative close estimate with probability greater than 1 − α using atmost ˜ O (cid:0) t mix ln(1 /α ) /π i (cid:15) (cid:1) simulated random walk steps on the Markov chain across all iterations, where t mix is the standard mixing time and π i is the stationary probability. In addition, the estimate at eachiteration is guaranteed to be an upper bound with high probability, and is decreasing in expectation withthe iteration count, allowing us to monitor the progress of the algorithm and design effective terminationcriteria. We propose a termination criteria which guarantees a (cid:15) (1 + 4 ln(2) t mix ) multiplicative error perfor-mance for states with stationary probability larger than ∆, while providing an additive error for states withstationary probability less than ∆ ∈ (0 , O (cid:16) ln(1 /α ) (cid:15) min (cid:16) t mix π i , (cid:15) ∆ (cid:17)(cid:17) simulated random walk steps, which is bounded by a constant with respect tothe Markov Chain. We provide a tight analysis of our algorithm based on a locally weighted variant of themixing time. Our results naturally extend for countably infinite state space Markov chains via Lyapunovfunction analysis. Key words : Markov chains, stationary distribution, Monte Carlo methods, network centralities
1. Introduction
Given a discrete-time, irreducible, positive-recurrent Markov chain { X t } t ≥ on a countable statespace Σ with transition probability matrix P , we consider the problem of approximating the sta-tionary probability of a chosen state i ∈ Σ. This is equivalent to computing the i th component ofthe largest eigenvector of P . The classical approach aims to estimate the entire stationary distri-bution by computing the largest eigenvector of matrix P using either algebraic, graph theoretic, orsimulation based techniques, which often involve computations with the full matrix. In this paper,we focus on computing the stationary probability of a particular state i , specifically in settingswhen P is sparse and the dimension is large. Due to the large scale of the system, it becomes usefulto have an algorithm which can approximate only a few components of the solution without thefull cost of computing the entire stationary distribution. a r X i v : . [ c s . D S ] D ec ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Computing the stationary distribution of a Markov chain with a large state space (finite, orcountably infinite) has become a basic building block to many algorithms and applications acrossdisciplines. For example, the Markov Chain Monte Carlo (MCMC) method is widely used in sta-tistical inference for approximating or generating samples from distributions that are difficult tospecifically compute. We are particularly motivated by the application of computing stationarydistributions of Markov chains for network analysis. Many decision problems over networks relyon information about the importance of different nodes as quantified by network centrality mea-sures. Network centrality measures are functions assigning “importance” values to each node inthe network. A few examples of network centrality measures that can be formulated as the station-ary distribution of a specific random walk on the underlying network include PageRank, which iscommonly used in Internet search algorithms (Page et al. 1999), Bonacich centrality and eigencen-trality measures, encountered in the analysis of social networks (Candogan et al. 2012, Chasparisand Shamma 2010), rumor centrality, utilized for finding influential individuals in social media likeTwitter (Shah and Zaman 2011), and rank centrality, used to find a ranking over items within anetwork of pairwise comparisons (Negahban et al. 2012).There are many natural contexts in which one may be interested in computing the networkcentrality of a specific agent, or a subset of agents in the network. For example, a particularbusiness owner may be interested in computing the PageRank of his webpage and that of his nearbycompetitors within the webgraph, without incurring the cost of estimating the full PageRank vector.These settings call for an algorithm which estimates the stationary probability of a given state ofa Markov chain using only information adjacent to the state within some local neighborhood asdescribed by the graph induced by matrix P , in which the edge ( i, j ) has weight P ij . We provide a novel Monte Carlo algorithm which is designed based on the characterization of thestationary probability of state i given by π i = 1 E i [ T i ] , where T i (cid:44) inf { t ≥ X t = i } and E i [ · ] (cid:44) E [ ·| X = i ]. Standard MCMC methods estimate thefull stationary distribution by using the property that the distribution of the current state of arandom walk over the Markov chain will converge to the stationary distribution as time goes toinfinity. Therefore, such methods generate approximate samples by simulating a long random walkuntil the distribution of the terminal state is close to the stationary distribution. Our key insightis that when we focus on solving for the stationary probability of a specific state i , we can centerthe computation and the samples around the state of interest by sampling return walks that begin ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State at state i and return to state i . This method is suitable when we are specifically interested in theestimates for states with high stationary probability, since the corresponding expected return time E i [ T i ] will be short due to it being inversely proportional to π i . In order to keep the computationwithin a local neighborhood and limit the cost, we truncate the sample random walks at a threshold.To determine the appropriate truncation threshold and sufficient number of samples, we iterativelyincrease the truncation threshold and the number of samples to obtain successively closer estimates.Thus the method systematically increases the size of the local neighborhood that it computes over,iteratively refining the estimates in a way that exploits the local structure. The estimates alongthe computation path are always upper bounds with high probability, allowing us to observe theprogress of the algorithm as the estimates converge from above.Given an oracle for transitions of a Markov chain defined on state space Σ, state i ∈ Σ, andscalars α, (cid:15) ∈ (0 , (cid:15) -multiplicative close estimate with probability greaterthan 1 − α using at most ˜ O ( t mix ln(1 /α ) /π i (cid:15) ) oracle calls (i.e. number of steps of the Markovchain), where t mix is the mixing time defined by t mix (cid:44) min { t : max x ∈ Σ (cid:107) P t ( x, · ) − π (cid:107) T V ≤ / } . Thus the number of simulated random walk steps used by of our method scales with the same orderas standard MCMC approaches up to polylogarithmic factors. Our method has the added benefitthat the estimates along the computation path are always upper bounds with high probability,so we can monitor the progress as our estimate converges to the true stationary probability. Thisallows for easier design of a verifiable termination criterion, i.e., a procedure for determining howmany random walks to sample and at what length to truncate the paths.When the objective is to estimate the stationary probability of states with values larger thansome ∆ ∈ (0 , O ( (cid:15)t mix ) multiplicative error performancefor states with stationary probability larger than ∆, while providing an additive error for stateswith stationary probability less than ∆ ∈ (0 , π i which satisfies either(a) ˆ π i < ∆ / (1 + (cid:15) ) = ⇒ π i < ∆ with high probability, or(b) ˆ π i ≥ ∆ / (1 + (cid:15) ) = ⇒ (1 − (cid:15) (1 + 4 ln(2) t mix )) ˆ π i ≤ π i ≤ (1 + (cid:15) )ˆ π i with high probability.With probability greater than 1 − α , the total number of oracle calls, i.e., number of steps of theMarkov chain, that the algorithm uses before satisfying the termination criteria is bounded by˜ O (cid:18) ln( α ) (cid:15) min (cid:18) t mix π i , (cid:15) ∆ (cid:19)(cid:19) = ˜ O (cid:18) ln( α ) (cid:15) ∆ (cid:19) . We denote ˜ O ( f ( a ) g ( b )) = O ( f ( a )polylog f ( a )) O ( g ( b )polylog g ( b )). P t ( x, · ) denotes the x th row of matrix P t , and (cid:107) · (cid:107) TV denotes the total variation distance. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State The termination criteria that we propose does not require any knowledge of the properties of theMarkov chain, but only depends on the parameters ∆ , (cid:15) , and the intermediate vectors obtainedthrough the computation. Therefore, a feature of our algorithm is that its cost and performanceadapts to the mixing properties of the Markov chain without requiring prior knowledge of theMarkov chain. Specifically, the cost of the computation naturally reduces for Markov chains thatmix quickly. Standard MCMC methods in contrast often involve choosing the parameters such asnumber of samples heuristically a priori based on given assumptions about the mixing properties.We show that our analysis of the error is tight for a family of Markov chains, and the precisecharacterization of the error depends on a locally weighted variant of the classic mixing time.In scenarios where these “local mixing times” for different states differ from the global mixingtime, then our algorithm which utilizes local random walk samples may provide tighter resultsthan standard MCMC methods. This also suggests an important mathematical inquiry of how tocharacterize local mixing times of Markov chains, and in what settings they are homogenous asopposed to heterogenous.We utilize the exponential concentration of return times in Markov chains to establish theoreticalguarantees for the algorithm. Our analysis extends to countably infinite state space Markov chains,suggesting an equivalent notion of mixing time for analyzing countable state space Markov chains.We provide an exponential concentration bound on tail of the distribution of return times to agiven state, utilizing an bound by Hajek (1982) on the concentration of certain types of hittingtimes in a countably infinite state space Markov chain. For Markov chains that mix quickly, thedistribution over return times concentrates more sharply around its mean, resulting in tighterperformance guarantees. Our analysis in the countably infinite state space setting lends insightstowards understanding the key properties of large scale finite state space Markov chains.Due to the truncation used within the original algorithm, the estimates obtained are biased.Therefore we also provide a bias correction for the estimates, at no additional computation cost,which we show performs surprisingly well in simulations. Whereas the original algorithm gavecoarser estimates for states with low stationary probability, the bias corrected algorithm outputtedclose estimates for all states in simulations, even the states with stationary probability less thanparameter ∆. We provide theoretical analysis that sheds insight into the class of Markov chainnsfor which the bias correction is effective. In addition we present a modification of our algorithmwhich reuses the same simulated random walks to obtain estimates of the stationary probabilitiesof other states in the neighborhood of state i , based on the frequency of visits to other s. Againthis modification does not require any extra computation cost in terms of simulated random walksteps, and yet provides estimates for the full stationary distribution. We provide theoretical boundsin addition to simulations that show its effectiveness. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State We provide a brief overview of the standard methods used for computing stationary distributions.
Monte Carlo Markov chain (MCMC) methods involvesimulating long random walks over a carefully designed Markov chain in order to obtain samplesfrom a target stationary distribution (Metropolis et al. 1953, Hastings 1970). In equilibrium, i.e.as time tends to infinity, the distribution of the random walk over the state space approachesthe stationary distribution. These algorithms also leverage the ergodic property of Markov chains,which states that the Markov chain exhibits the same distribution when averaged over time andover space. In other words, as t tends to infinity, the distribution over states visited by the Markovchain from time 0 to t will converge to the distribution of the state of the Markov chain at time t . Therefore, MCMC methods approximate the stationary distribution by simulating the Markovchain for a sufficiently long time, and then either averaging over the states visited along the Markovchain, or using the last visited state as an approximate sample from the stationary distribution.This process is repeated many times to collect independent samples from π i .When applying MCMC methods in practice, it is often difficult to determine confidently whenthe Markov chain has been simulated for a sufficiently long time. Therefore, many heuristics areused for the termination criteria. The majority of work following the initial introduction of theMCMC method involves analyzing the convergence rate of the random walk for Markov chainsunder different conditions (Aldous and Fill 1999, Levin et al. 2009). Techniques involve spectralanalysis (i.e. bounding the convergence rate as a function of the spectral gap of P ) or couplingarguments. Graph properties such as conductance provide ways to characterize the spectrum of thegraph. Most results are specific to reversible finite state space Markov chains, which are equivalentto random walks on weighted undirected graphs. A detailed summary of the major developmentsand analysis techniques for MCMC methods can be found in articles by Diaconis and Saloff-Coste(1998) and Diaconis (2009).Our algorithm also falls within the class of MCMC methods, as it is based upon simulatingrandom walks over the Markov chain, and using concentration results to prove guarantees on theestimate. Its major distinction is due to the use of a different characterization of π i = 1 / E [ T i ], whichnaturally lends itself to a component centered approximation. By sampling returning random walkswhich terminate when the initial state is revisited, we are able to design an intuitive terminationcriterion that is able to provide a one-sided guarantee. The power-iteration method is an equally old and well-establishedmethod for computing leading eigenvectors of matrices (Golub and Van Loan 1996, Stewart 1994,Koury et al. 1984). Given a matrix A and an initial vector x , it recursively compute iterates ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State x t +1 = Ax t / (cid:107) Ax t (cid:107) . If matrix A has a single eigenvalue that is strictly greater in magnitude thanall other eigenvalues, and if x is not orthogonal to the eigenvector associated with the dominanteigenvalue, then a subsequence of { x , x , x , . . . } converges to the eigenvector associated with thedominant eigenvalue. Recursive multiplications involving large matrices can become expensive veryfast as the matrix grows. When the matrix is sparse, computation can be saved by implementing itthrough ‘message-passing’ techniques; however it still requires computation to take place at everystate in the state space. The convergence rate is governed by the spectral gap , or the differencebetween the two largest eigenvalues of A . Techniques used for analyzing the spectral gap andmixing times as discussed above are also used in analyzing the convergence of power iteration,thus many results again only pertain to reversible Markov chains. For large Markov chains, themixing properties may scale poorly with the size, making it difficult to obtain good estimates in areasonable amount of time. This is particularly ill suited for very large or countably infinite statespace.In the setting of computing PageRankfor nodes in a network, there have been efforts to modifythe algorithm to execute power iteration over local subsets of the graph and combine the resultsto obtain estimates for the global PageRank. These methods rely upon key assumptions on theunderlying graph, which are difficult to verify. Kamvar et. al. observed that there may be obviousways to partition the web graph (i.e. by domain names) such that power iteration can be usedto estimate the local PageRank within these partitions (Kamvar et al. 2003). They use heuristicsto estimate the relative weights of these partitions, and combine the local PageRank within eachpartition according to the weights to obtain an initial estimate for PageRank. This initial estimateis used to initialize the power iteration method over the global Markov chain, with the hope thatthis initialization may speed up convergence. Chen et. al. proposed a method for estimating thePageRank of a subset of nodes given only the local neighborhood of this subset (Chen et al. 2004).Their method uses heuristics such as weighted in-degree as estimates for the PageRank values ofnodes on the boundary of the given neighborhood. After fixing the boundary estimates, standardpower iteration is used to obtain estimates for nodes within the local neighborhood. The error inthis method depends on how close the true PageRank of nodes on the boundary correspond to theheuristic guesses such as weighted in-degree. Unfortunately, we rarely have enough information tomake accurate heuristic guesses of these boundary nodes. There has been much recent effort to develop localalgorithms for computing PageRank for the web graph. Given a directed graph of n nodes withan n × n adjacency matrix A (i.e., A ij = 1 if ( i, j ) ∈ E and 0 otherwise), the PageRank vector π is ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State given by the stationary distribution of a Markov chain over n states, whose transition matrix P isgiven by P = (1 − β ) D − A + β · r T . (1) D denotes the diagonal matrix whose diagonal entries are the out-degrees of the nodes; β ∈ (0 ,
1) isa fixed scalar; and r is a fixed probability vector over the n nodes . In each step the random walkwith probability (1 − β ) chooses one of the neighbors of the current node equally likely, and withprobability β chooses any of the nodes in the graph according to r . Thus, the PageRank vector π satisfies π T = π T P = (1 − β ) π T D − A + βr T (2)where π T · = 1. This definition of PageRank is also known as personalized PageRank, because r can be tailored to the personal preferences of a particular web surfer. When r = n · , then π equalsthe standard global PageRank vector. If r = e i , then π describes the personalized PageRank thatjumps back to node i with probability β in every step .Computationally, the design of local algorithms for computing the personalized PageRank hasbeen of interest since its discovery. Most of the algorithms and analyses crucially rely on the specificstructure of the random walk describing PageRank: P decomposes into a natural random walkmatrix D − A , and a rank-1 matrix · r T , with strictly positive weights (1 − β ) and β respectively,cf. (1). Jeh and Widom (2003) and Haveliwala (2003) observed a key linearity relation – the globalPageRank vector is the average of the n personalized PageRank vectors corresponding to thoseobtained by setting r = e i for 1 ≤ i ≤ n . That is, these n personalized PageRank vectors centeredat each node form a basis for all personalized PageRank vectors, including the global PageRank.Therefore, the problem boils down to computing the personalized PageRank for a given node.Fogaras et al. (2005) used the fact that for the personalized PageRank centered at a given node i (i.e., r = e i ), the associated random walk has probability β at every step to jump back to node i , “resetting” the random walk. The distribution over the last node visited before a “reset” isequivalent to the personalized PageRank vector corresponding to node i . Therefore, they proposean algorithm which samples from the personalized PageRank vector by simulating short geometric-length random walks beginning from node i , and recording the last visited node of each samplewalk. The performance of the estimate can be established using standard concentration results.Subsequent to the key observations mentioned above, Avrachenkov et al. (2007) surveyed variantsto Fogaras’ random walk algorithm, such as computing the frequency of visits to nodes across the denotes the all ones vector. e i denotes the standard basis vector having value one in coordinate i and zero for all other coordinates. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State sample path rather than only the end node. Bahmani et al. (2010) addressed how to incrementallyupdate the PageRank vector for dynamically evolving graphs, or graphs where the edges arrive ina streaming manner. Das Sarma et al. extended the algorithm to streaming graph models (Sarmaet al. 2011), and distributed computing models (Sarma et al. 2012), “stitching” together shortrandom walks to obtain longer samples, and thus reducing communication overhead. More recently,building on the same sets of observation, Borgs et al. (2012) provided a sublinear time algorithm forestimating global PageRank using multi-scale matrix sampling. They use geometric-length randomwalk samples, but do not require samples for all n personalized PageRank vectors. The algorithmreturns a set of “important” nodes such that the set contains all nodes with PageRank greater thana given threshold, ∆, and does not contain any node with PageRank less than ∆ /c with probability1 − o (1), for a given c >
1. The algorithm runs in time ˜ O ( n/ ∆).Andersen et al. (2007) designed a backward variant of these algorithms. Previously, to computethe global PageRank of a specific node j , we would average over all personalized PageRank vec-tors. The algorithm proposed by Andersen et al. estimates the global PageRank of a node j byapproximating the “contribution vector”, i.e. estimating for the j th coordinates of the personalizedPageRank vectors that contribute the most to π j .All of these algorithms rely on the crucial property that the random walk has renewal timethat is distributed geometrically with constant parameter β > n . This is because the transition matrix P decomposes according to (1), with a fixed β . Ingeneral, the transition matrix of any irreducible, positive-recurrent Markov chain will not havesuch a decomposition property (and hence known renewal time), making the above algorithmsinapplicable in general. Our work can be seen as extending this approach of local approximation viashort random walks beyond the restricted class of personalized PageRank to all Markov Chains. Weutilize the fundamental invariant that stationary distribution is inversely proportional to averagereturn times, which leads to a natural sampling scheme in which repeated visits to the state ofinterest behaves as the “renewal events” of the stochastic process, whereas the PageRank algorithmused the teleportation steps as the renewal events. For the remainder of the paper, we will formalize the problem, present the main theorem results,provide intuition and proof sketches for the results, and demonstrate the algorithm through basicsimulations. Section 2 includes the definition of the problem statement and a background review ofkey properties of Markov chains. We also provide an example to show that if the mixing propertiesof the Markov chain could be arbitrarily poor, then any Monte Carlo algorithm which samplesonly random walks within a local neighborhood of state i cannot distinguish between a family of ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Markov chains which look similar locally, but very different globally. Section 3 presents the maintheorem results of the analysis of convergence time and approximation error for the algorithmswhich we develop. Sections 4 to 7 present the proof sketch and intuition behind the analysis,including the simple proofs, but leaving the more complex details of the proofs to the appendices.Section 4 shows that the random variables used in the algorithm for the estimates and terminationconditions indeeed concentrate around their means with high probability. Section 5 shows thatfor any positive recurrent Markov chain, the distribution over the return time of a random walkdecays exponentially in the length, where the rate of decay is a function of the mixing propertiesof the Markov chain. This section provides the foundation for proving that our results extendto countably infinite state space Markov chains. Section 6 provides the proof sketch for provingbounds on the approximation error of the estimates in each iteration. Section 7 provides the proofsketch for proving bounds on the convergence and computation time of the algorithm. Section 8presents the results from basic simulations in which we implemented and executed our algorithmson simple Markov chains.
2. Setup
In this section, we present our problem setup, and review the key definitions and properties ofMarkov chains which are useful to understanding our algorithm and analysis.
Consider a discrete time, irreducible, positive recurrent Markov chain { X t } t ≥ on a countable statespace Σ with transition probability matrix P : Σ × Σ → [0 , i ∈ Σ, our goal is toestimate the stationary probability of state i , denoted by π i . We consider the regime where thestate space is large, thus it becomes critical to have an algorithm that scales well with the sizeof the state space. We limit ourselves to crawl operations originating from state i , simulating alimited access setting that occurs when the algorithm is run by a third-party user of the networkwho does not own or have full access to the network. We also focus on the setting when we areparticularly interested in states with large stationary probability, specifically, when there is somethreshold ∆ such that we only consider a state significant if it has stationary probability largerthan ∆. Thus, for states with stationary probability less than ∆, we are satisfied with a roughestimate; however, for states with stationary probability larger than ∆, we would like an estimatewhich is has bounded multiplicative error. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Given the transition probability matrix P , let P nxy denote the value of entry ( x, y ) in the matrix P n . If the state space is countably infinite, then P : Σ × Σ → [0 ,
1] is a function such that for all x, y ∈ Σ, P xy = P ( X t +1 = y | X t = x ) . Similarly, P nxy is defined for all x, y ∈ Σ to be P nxy (cid:44) P ( X n = y | X = x ) . The stationary distribution is the largest eigenvector of P , also described by a function π : Σ → [0 , (cid:80) i ∈ Σ π i = 1 and π i = (cid:80) j ∈ Σ π j P ji for all i ∈ Σ.The Markov chain can be visualized as a random walk over a weighted directed graph G =(Σ , E, P ), where Σ is the set of states, E = { ( i, j ) ∈ Σ × Σ : P ij > } is the set of edges, and P describes the weights of the edges. We refer to G as the Markov chain graph . Throughout thepaper,
Markov chain and random walk on a graph are used interchangeably; similarly nodes and states are used interchangeably. If the state space Σ is finite, let n = | Σ | denote the number ofstates in the graph. We assume throughout this paper that the Markov chain { X t } is irreducibleand positive recurrent. This guarantees that there exists a unique stationary distribution.Our algorithm involves generating sample sequences of the Markov chain by simulating a randomwalk on the graph. These sample sequences allow us to observe return times T i and visit frequencies F j to different states, where T i and F j are defined as: T i (cid:44) inf { t ≥ X t = i } , (3)and F j (cid:44) ∞ (cid:88) t =1 { X t = j } { t ≤ T i } = T i (cid:88) t =1 { X t = j } . Throughout this paper, we denote E i [ · ] (cid:44) E [ ·| X = i ], and P i ( · ) (cid:44) P ( ·| X = i ). The followingcharacterization of the stationary distribution of a Markov chain is the theorem upon which ouralgorithm and analysis stands. Given samples of T i and F j , we use this theorem to constructestimates for the stationary probabilities. A Markov chain is irreducible if and only if the corresponding Markov chain graph is strongly connected, i.e. for all x, y ∈ Σ, there exists a path from x to y . A Markov chain is positive recurrent if the expected time for a random walkbeginning at state i to return to state i is finite. This means that the random walk cannot “drift to infinity”. This istrue for all irreducible finite state space Markov chains. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Lemma 1 ( c.f. Meyn and Tweedie 1993 ) . An irreducible positive recurrent Markov chain has aunique stationary distribution π with the following form: (a) For any fixed i ∈ Σ , π j = E i [ F j ] E i [ T i ] , j ∈ Σ . (b) An equivalent expression for this distribution is π i = 1 E i [ T i ] . Lemma 1(b) states that the stationary probability of a state i is inversely proportional to theexpected return time of a random walk beginning at state i and ending at its first return to state i . The basic algorithm we propose is based upon this characterization of stationary probability.Lemma 1(a) states that the stationary probability of state j is equivalent to the fraction of expectedvisits to state j out of the total number of steps taken along a random walk beginning at state i and returning to state i . This characterization is used to show that given returning random walksfrom state i to state i , we can also obtain estimates of the stationary probability of other states j by observing the frequency of visits to state j along those sample paths. The mixing properties of the Markov chain affect the ease to which our algorithm approximatesthe stationary probabilities. Our analysis and bounds will be a function of a few related quantitieswhich we will proceed to define and discuss. In the finite state space setting, the error bounds onthe estimate produced for the stationary probability of state i will be given as a function of themaximal hitting time H i , and the fundamental matrix Z . This measures how well connected thegraph is globally. The maximal hitting time to a state i in a finite state space Markov chain isdefined as H i (cid:44) max j ∈ Σ E j [ T i ] . (4)The fundamental matrix Z of a finite state space Markov chain is Z (cid:44) ∞ (cid:88) t =0 (cid:0) P t − π T (cid:1) = (cid:0) I − P + π T (cid:1) − , i.e., the entries of the fundamental matrix Z are defined by Z jk (cid:44) ∞ (cid:88) t =0 (cid:0) P tjk − π k (cid:1) . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Since P tjk denotes the probability that a random walk beginning at state j is at state k after t steps, Z jk represents how quickly the probability mass at state k from a random walk beginningat state j converges to π k . We will use the following property, stated by Aldous and Fill (1999) inChapter 2 Section 2.2 Lemma 12, to relate entries in the fundamental matrix to expected returntimes. Lemma 2.
For j (cid:54) = k , E j [ T k ] = Z kk − Z jk π k . We define Z max ( i ) (cid:44) max k ∈ Σ | Z ki | . The relationship between Z max ( i ) and H i is described by Z max ( i ) ≤ π i H i ≤ Z max ( i ) . A standard definition of mixing time is the amount of time until the total variation distancebetween the distribution of a random walk and the stationary distribution is below 1 /
4. We for-malize the definition and review some well known and useful properties below. For further details,read chapter 4 of Levin et al. (2009). (cid:107) µ − ν (cid:107) T V = 12 (cid:88) x ∈ Σ | µ ( x ) − ν ( x ) | = (cid:88) x ∈ Σ ,µ ( x ) ≥ ν ( x ) | µ ( x ) − ν ( x ) | .d ( t ) (cid:44) max x ∈ Σ (cid:107) P t ( x, · ) − π (cid:107) T V = sup µ ∈P (cid:107) µP t − π (cid:107) T V . ¯ d ( t ) (cid:44) max x,y ∈ Σ (cid:107) P t ( x, · ) − P t ( y, · ) (cid:107) T V = sup µ,ν ∈P (cid:107) µP t − νP t (cid:107) T V .t mix ( (cid:15) ) (cid:44) min { t : d ( t ) ≤ (cid:15) } .t mix (cid:44) t mix (1 / .t mix ( (cid:15) ) ≤ (cid:100) log (1 /(cid:15) ) (cid:101) t mix .d ( t ) ≤ ¯ d ( t ) ≤ −(cid:98) t/t mix (cid:99) . Therefore, we can obtain the following relation between entries of Z and t mix . Z jk = ∞ (cid:88) t =0 (cid:0) P tjk − π k (cid:1) ≤ ∞ (cid:88) t =0 (cid:12)(cid:12) P tjk − π k (cid:12)(cid:12) ≤ ∞ (cid:88) t =0 (cid:13)(cid:13) P t ( j, · ) − π (cid:13)(cid:13) T V ≤ ∞ (cid:88) t =0 d ( t ) ≤ ∞ (cid:88) t =0 −(cid:98) t/t mix (cid:99) ≤ ∞ (cid:88) t =0 − t/t mix = 2 / (1 − − /t mix ) ≈ t mix Therefore, Z jk = O ( t mix ) for any j, k ∈ Σ. Our analysis and bounds will be given as a functionof entries in the fundamental matrix Z , however observe that a bound on the mixing time alsoprovides a bound on the maximum entry of Z . For countably infinite state space Markov chains,we have an equivalent notion of mixing time, which we discuss in Section 5. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State
1 2 3 4 5 (a) M/M/1 Queue
1 2 3 4 (b) Magnet Markov chain
Figure 1
We introduce two classes of Markov chains which have similar local random walk properties, and yet canhave arbitrarily different stationary probabilities, illustrating the difficulty of approximating stationaryprobabilities using Monte Carlo type algorithms when the Markov chain can mix poorly.
We give an example to illustrate why the mixing properties affect the achievable estimation accu-racy for a Monte Carlo method which samples random walks. Consider the Markov chains shownin Figures 1(a) and 1(b). The state space of the Markov chain in Figure 1(a) is the positive integers(therefore countably infinite). It models the length of a M/M/1 queue, where q is the probabilitythat an arrival occurs before a departure. This is also equivalent to a biased random walk on Z + .In the Markov chain depicted by Figure 1(b), when q and q are less than one half, states 1 to n − n to n are attracted to state n . Since there are twoopposite attracting states (1 and n ), we call this Markov chain a “Magnet”.Consider the problem of estimating the stationary probability of state 1 in the Magnet Markovchain using sampled random walks. Due to the structure of the Magnet Markov chain, any randomwalk originating on the left will remain on the left with high probability. Therefore, the right portionof the Markov chain will only be explored with an exponentially small probability. The sample localrandom walks starting from state 1 will behave effectively the same for both the M/M/1 queueand the Magnet Markov chain, illustrating the difficulty of even distinguishing between the twoMarkov chains, much less to obtain an accurate estimate. This illustrates an unavoidable challengefor any Monte Carlo algorithm.
3. Basic Algorithm and Main Results
Recall that our algorithm is based on the characterization of stationary probability as given byLemma 1(b): π i = 1 / E i [ T i ]. The EstimateTruncatedRT ( i, N, θ ) method, which forms the basicunit of our algorithm, estimates π i by collecting N independent samples of the random variable ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State min( T i , θ ). Each sample is obtained by simulating the Markov chain starting from state X = i andstopping at the first time t > t = θ or X t = i . The sample average is used to approximate E i [ T i ]. As the number of samples and θ go to infinity, the estimate will converge almost surely to π i , due to the strong law of large numbers and positive recurrence of the Markov chain. EstimateTruncatedRT ( i, N, θ ):1. Simulate N independent realizations of the Markov chain with X = i .For each sample s ∈ { , , . . . N } , let t s = min { t ≥ t = θ or X t = i } ,distributed as min( T i , θ ).2. Return sample average ˆ T i , fraction truncated ˆ p i , and estimate ˆ π i ˆ T i = 1 N N (cid:88) s =1 t s , ˆ p = No. samples truncated N , ˆ π i = 1ˆ T i It is not clear a priori what choice of N and θ are sufficient to guarantee a good estimate whilenot costing too much computation. The IteratedRefinement ( i, (cid:15), α ) method iteratively improvesthe estimate by increasing θ and N . In each iteration, it doubles θ and increases N according to theChernoff’s bound to ensure that with probability 1 − α , for all time k , ˆ T ( k ) i ∈ (1 ± (cid:15) ) E i [min( T i , θ ( k ) )].This allows us to use the estimate from the previous iteration to determine how many samples issufficient for the current threshold θ ( k ) . IteratedRefinement ( i, (cid:15), α ):1. k = 1 , θ (1) = 2 , N (1) = (cid:100) (cid:15) ) ln(8 /α ) /(cid:15) (cid:101)
2. ( ˆ T ( k ) i , ˆ p ( k ) , ˆ π ( k ) i ) = EstimateTruncatedRT ( i, N ( k ) , θ ( k ) )3. θ ( k +1) = 2 θ ( k ) , N ( k +1) ← (cid:108) (cid:15) ) θ ( k +1) ln(4 θ ( k +1) /α ) / ˆ T ( k ) i (cid:15) (cid:109) , increment k
4. Repeat from line 2The estimate ˆ π ( k ) i is larger than π i / (1 + (cid:15) ) for all iterations k with high probability. E i [ ˆ T ( k ) i ] = E i [min( T i , θ ( k ) )] increases with each iteration; thus the expected estimate decreases in each iteration,converging to π i from above as N ( k ) and θ ( k ) increase. In this section, for the sake of clarity, wepresent our results only for finite state space Markov chains, yet the extension of the analysis tocountable state space will be discussed in Section 5 and presented in Theorems 8 and 13. Theorem1 provides error bounds which show the convergence rate of the estimator ˆ π ( k ) i , and it also upperbounds the computation cost of the algorithm for the first k iterations. Theorem 1.
For an irreducible finite state space Markov chain, for any i ∈ Σ , with probabilitygreater than − α , for all iterations k , ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State (cid:16) − (cid:15) − · − k / H i Z max ( i ) (cid:17) ˆ π ( k ) i ≤ (cid:0) − (cid:15) − Z max ( i ) P i (cid:0) T i > θ ( k ) (cid:1)(cid:1) ˆ π ( k ) i ≤ π i ≤ (1 + (cid:15) )ˆ π ( k ) i , the number of random walk steps taken by the algorithm within the first k iterations is bounded by ˜ O (cid:18) ln(1 /α )2 k (cid:15) (cid:19) . Corollary 1 directly follows from rearranging Theorem 1, providing a bound on the cost ofiterating the algorithm until the estimate is (cid:15) -multiplicative close to the true value π i . Corollary 1.
For a finite state space Markov chain { X t } , for any i ∈ Σ , with probability greaterthan − α , the estimate ˆ π ( k ) i produced by the algorithm IteratedRefinement ( i, (cid:15)/ , α ) will satisfy π i ∈ (1 ± (cid:15) )ˆ π ( k ) i for all t ≥ log (cid:18) H log (cid:18) Z max ( i ) (cid:15) (cid:19)(cid:19) = O (cid:18) ln (cid:18) t mix π i ln (cid:18) t mix (cid:15) (cid:19)(cid:19)(cid:19) . The number of random walk steps simulated by the algorithm until π i ∈ (1 ± (cid:15) )ˆ π ( k ) i is bounded by ˜ O (cid:18) H i ln(1 /α ) (cid:15) ln (cid:18) Z max ( i ) (cid:15) (cid:19)(cid:19) = ˜ O (cid:18) t mix ln(1 /α ) π i (cid:15) (cid:19) . The cost of our algorithm is comparable to standard Monte Carlo methods, as the length ofeach random walk to guarantee convergence to stationarity is O ( t mix ln(1 /(cid:15)π i )), and the numberof samples to guarantee concentration by Chernoff’s bound is O (ln(1 /α ) /π i (cid:15) ). Though this givesus an understanding of the convergence rate, we may not know H i , Z max ( i ), or t mix in the generalcase, and thus it does not provide practical guidance for how long to run the algorithm. One intuitive termination criteria is to stop the algorithm when the fraction of samples truncatedis less than some δ ∈ (0 , p ( k ) < δ . Theorem 2.
With probability greater than − α , for all k such that ˆ p ( k ) < δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ π ( k ) i − π i ˆ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (1 + 2 Z max ( i ) /
3) + 2 δZ max ( i ) ≤ (cid:15) + 4 ln(2) t mix ( δ + (cid:15)/ . With probability greater than − α , the number of random walk steps used by the algorithm beforesatisfying ˆ p ( k ) < δ is bounded above by ˜ O (cid:18) H i ln(1 /α ) ln(1 /δ )) (cid:15) (cid:19) = ˜ O (cid:18) t mix ln(1 /α ) ln(1 /δ ) π i (cid:15) (cid:19) . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Theorem 2 indicates that the error is a function of both (cid:15) and δ . The number of samples N ( t ) in each iteration of IteratedRefinement is chosen such that with high probability we obtain an (cid:15) approximation of the mean E [ ˆ T ( t ) i ]. Therefore, even if we were to run the algorithm until ˆ p ( k ) = 0,we would still only be able to guarantee an accuracy of O ( (cid:15)t mix ) when the algorithm terminates,since the number of samples N ( t ) is only sufficient to guarantee less than (cid:15) error for the sampleaverage estimates. Therefore, in the remainder of the paper, we choose δ to be on the same orderas (cid:15) , specifically δ = 2 (cid:15)/ O ( (cid:15)t mix ) multiplicative errorbound, whereas Corollary 1 reaches an (cid:15) close estimate. The difference is due to the fact thatCorollary 1 assumes we are able to determine when π i ∈ (1 ± (cid:15) )ˆ π ( k ) i , while the termination conditionanalyzed for Theorem 2 must rely only on measured quantities. The algorithm must determinehow many samples and how far to truncate the random walks without knowledge of t mix , and byusing ˆ p as an estimate for P i ( T i > θ ).The example of distinguishing between the M/M/1 queue and the Magnet Markov chain inFigure 1 suggests why guaranteeing (cid:15) error without further knowledge of the mixing propertiesis impossible. Since the behavior in terms of sampled random walks from state 1 looks nearlyidentical for both Markov chains, an algorithm which knows nothing about the mixing propertieswill perform the same on both Markov chains, which actually have different stationary probabilities.In Makov chains which mix poorly such as the Magnet Markov chain, there may be states whichlook like they have a high stationary probability within the local neighborhood yet may not actuallyhave large stationary probability globally.Next we proceed to show that in a setting where we do not need an O ( (cid:15)t mix )-close estimate forstates with stationary probability less than some ∆ ∈ (0 , O (ln(1 /α ) /(cid:15) ∆) independently of t mix , whilestill maintaining the same error bound for states with stationary probability larger than ∆. TerminationCriteria (ˆ π i , ˆ p, ∆ , (cid:15) ):Stop when either (a) ˆ π i < ∆ / (1 + (cid:15) ) or (b) ˆ p < (cid:15)/ (cid:15) )ˆ π i is an upper bound on π withhigh probability, thus when condition (a) is satisfied, we can safely conclude that π i < ∆ with highprobability. The termination condition (b) is chosen according to the fact that we can upper boundthe error as a function of P i ( T i > θ ( k ) ). Therefore, we can use the fraction of samples truncated ˆ p ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State to estimate this quantity, since ˆ p is a binomial random variable with mean P i ( T i > θ ( k ) ). Therefore,when condition (b) is satisfied, we can safely conclude that P i ( T i > θ ( k ) ) < (cid:15) with high probability,thus implying that the percentage error of estimate ˆ π ( k ) i is upper bounded by O ( (cid:15)Z max ( i )). BasicAlgorithm ( i, (cid:15), α, ∆):Run IteratedRefinement ( i, (cid:15), α ) until TerminationCriteria (ˆ π ( k ) i , ˆ p ( k ) , ∆ , (cid:15) )is satisfied, at which point the algorithm outputs the final estimate ˆ π ( k ) i .For the remainder of the paper, we assume that this TerminationCriteria is used. It is easy toverify and does not require prior knowledge of the Markov chain to implement, as it only dependson the chosen parameters ∆ and (cid:15) . We highlight and discuss the benefits and limitations of usingthe termination criteria suggested.
Theorem 3.
With probability greater than − α , the following three statements hold:(a) If ˆ π ( k ) i < ∆ / (1 + (cid:15) ) for any k , then π i < ∆ with high probability.(b) For all k such that ˆ p ( k ) < (cid:15)/ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ π ( k ) i − π i ˆ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (2 Z max ( i ) + 1) ≤ (cid:15) (4 ln(2) t mix + 1) . (c) The number of random walk steps used by the algorithm before satisfying either ˆ π ( k ) i < ∆ / (1 + (cid:15) ) or ˆ p ( k ) < (cid:15)/ is bounded above by ˜ O (cid:18) ln( α ) (cid:15) min (cid:18) H i , (cid:15) ∆ (cid:19)(cid:19) = ˜ O (cid:18) ln( α ) (cid:15) min (cid:18) t mix π i , (cid:15) ∆ (cid:19)(cid:19) . For “important” states i such that π i > ∆, Theorem 3(a) states that with high probability π ( t ) i ≥ ∆ / (1+ (cid:15) ) for all t , and thus the algorithm will terminate at criteria (b) ˆ p ( t ) < (cid:15)/ ∆, which guaranteesthen an O ( (cid:15)t mix ) multiplicative bound on the estimate error. Observe that the computation costof the algorithm is upper bounded by ˜ O (1 /(cid:15) ∆), which only depends on the algorithm parametersand is independent of the particular properties of the Markov chain. The cost is also bounded by˜ O ( t mix /π i (cid:15) ∆), which indicates that when the mixing time is smaller, the algorithm also terminatesearlier for the same chosen algorithm parameters.In some settings we may want to choose the parameters ∆ and (cid:15) as a function of the Markovchain, whether as a function of the state space or the mixing properties. In settings where we havelimited knowledge of the size of the state space or the mixing properties of the Markov chain, thealgorithm can still be implemented as a heuristic. Since the estimates are an upper bound withhigh probability, we can observe and track the progress of the algorithm in each iteration as theestimate converges to the solution from above. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State The algorithm presented above has a systematic bias due to the truncation. We present a secondestimator which corrects for the bias under some conditions. In fact, in our basic simulations, weshow that it corrects for the bias even for states with small stationary probabilty, which the originalalgorithm performs poorly on, since it terminates at ˆ π i < ∆ / (1 + (cid:15) ). Thus the surprising aspectof the bias corrected estimate is that it can obtain a good estimate at the same cost. This biascorrected estimate is based upon the characterization of stationary probability given in Lemma1(a), since the average visits to state i along the sampled paths is given by (1 − ˆ p ). BiasCorrectedAlgorithm ( i, (cid:15), α, ∆):Run IteratedRefinement ( i, (cid:15), α ) until TerminationCriteria (ˆ π ( k ) i , ˆ p ( k ) , ∆ , (cid:15) )is satisfied, at which point the algorithm outputs the final estimate˜ π i = (1 − ˆ p ) / ˆ T i . While ˆ π i is an upper bound of π i with high probability due to its use of truncation, ˜ π i is neitherguaranteed to be an upper or lower bound of π i . Theorem 4 provides error bounds for the biascorrected estimator ˜ π i . Theorem 4.
For an irreducible finite state space Markov chain, for any i ∈ Σ , with probabilitygreater than − α , for all iterations k such that P i ( T i > θ ( k ) ) < / , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ π ( k ) i − π i ˜ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) )1 − (cid:15) − θ ( k ) / H i max(2 Z max ( i ) − ,
1) + 2 (cid:15) − (cid:15) . Theorem 4 shows that with high probability, the percentage error between ˜ π ( k ) i and π i decaysexponentially in θ ( k ) . The condition P i (cid:0) T i > θ ( k ) (cid:1) < / p concentrates around P i (cid:0) T i > θ ( k ) (cid:1) . We require P i ( T i > θ ( k ) ) < / − ˆ p ( k ) ) concentrates within a (1 ± (cid:15) ) multiplicative interval around (1 − P i ( T i > θ ( k ) )). If P i (cid:0) T i > θ ( k ) (cid:1) is too large and close to 1, then a majority of the sample random walks are truncated, and we cannotguarantee good multiplicative concentration of (1 − ˆ p ). The equivalent extension of this theorem tocountable state space Markov chains is presented in Theorem 9. Although the improvement of thebias corrected estimator is not clear from the theoretical error bounds, we will show simulationsof computing PageRank, in which ˜ π i is a significantly closer estimate of π i than ˆ π i , especially forstates with small stationary probability. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State We can also simultaneously learn about other states in the Markov chain through these randomwalks from i . We refer to state i as the anchor state . We will extend our algorithm to obtainestimates for the stationary probability of states within a subset J ⊆ Σ, which is given as an inputto the algorithm. We refer to J as the set of observer states . We estimate the stationary probabilityof any state j ∈ J using the characterization given in Lemma 1(a). We modify the algorithm tokeep track of how many times each state in J is visited along the sample paths. The estimate ˜ π j is the fraction of visits to state j along the sampled paths. We replace the subroutine in step 2of the IterativeRefinement function with
EstimatedTruncatedRT-Multi ( i, J, N ( k ) , θ ( k ) ), and weuse the same TerminationCriteria previously defined.
EstimateTruncatedRT-Multi ( i, J, N, θ ):1. Sample N independent truncated return paths to is k ∼ min( T i , θ ) for k ∈ { , , . . . N }
2. Compute sample average ˆ T i and fraction truncated ˆ p i ˆ T i = 1 N N (cid:88) k =1 s k , ˆ p = No. samples truncated N
3. For each j ∈ J , let the estimate ˜ π j be computed as f k ( j ) = θ ( k ) (cid:88) r =1 { X r = j } { r ≤ T i } , ˆ F ( k ) j = 1 N ( k ) N ( k ) (cid:88) k =1 f k ( j ) , and ˜ π ( k ) j = ˆ F ( k ) j ˆ T ( k ) i . Since the
IterativeRefinement method sets the parameters
N, θ independent of the states j and their estimates, the error bounds for π j will be looser. The number of samples is only enough toguarantee that ˜ π ( k ) j is an additive approximation of E i [ ˆ F ( k ) j ] / E i [min( T i , θ ( k ) )]. In addition, the effectof truncation is no longer as clear, since the frequency of visits to state j along a sample returnpath from state i to state i can be distributed non-uniformly along the sample path. Therefore,the estimate cannot be guaranteed to be either an upper or lower bound. Theorem 5 bounds theerror for the estimates ˜ π ( k ) j for states j (cid:54) = i . Due to the looser additive concentration guarantees forˆ F ( k ) i , Theorem 5 provides an additive error bound rather than a bound on the percentage error. Theorem 5.
For an irreducible finite state space Markov chain, for any i, j ∈ Σ such that j (cid:54) = i ,with probability greater than − α , for all iterations k , (cid:12)(cid:12)(cid:12) ˜ π ( k ) j − π j (cid:12)(cid:12)(cid:12) ≤ (cid:15) ) P i ( T i > θ ( k ) ) Z max ( j )ˆ π ( k ) i + (cid:15) ˜ π ( k ) j + (cid:15), ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State ≤ (cid:15) )2 − θ ( k ) / H i Z max ( j )ˆ π ( k ) i + (cid:15) ˜ π ( k ) j + (cid:15). Theorem 5 indicates that the accuracy of the estimator ˆ π ( k ) i depends on both H i and Z max ( j ),the mixing properties centered at states i and j . In order for the error to be small, both the anchorstate i and the observer state j must have reasonable mixing and connectivity properties within theMarkov chain. It is not surprising that it depends on the mixing properties related to both states,as the sample random walks are centered at state i , and the estimator consists of observing visitsto state j . While the other theorems presented in this paper have equivalent results for countablestate space Markov chains, Theorem 5 does not directly extend to a countably infinite state spaceMarkov chain because state j can be arbitrarily far away from state i such that random walksbeginning at state i rarely hit state j before returning to state i . This algorithm is simple to implement and is easy to parallelize. It requires only O ( | J | ) space tokeep track of the visits to each state in J , and a constant amount of space to keep track of thestate of the random walk sample, and running totals such as ˆ p ( k ) and ˆ T ( k ) i . For each random walkstep, the computer only needs to fetch the local neighborhood of the current state, which is upperbounded by the maximum degree. Thus, at any given instance in time, the algorithm only needsto access a small neighborhood within the graph. Each sample is completely independent, thus thetask can be distributed among independent machines. In the process of sampling these randompaths, the sequence of states along the path does not need to be stored or processed upon.Consider implementing this over a distributed network, where the graph consists of the pro-cessors and the communication links between them. Each random walk over this network can beimplemented by a message passing protocol. The anchor state i initiates the random walk by send-ing a message to one of its neighbors chosen uniformly at random. Any state which receives themessage forwards the message to one of its neighbors chosen uniformly at random. As the messagetravels over each link, it increments its internal counter. If the message ever returns to the anchorstate i , then the message is no longer forwarded, and its counter provides a sample from min( T i , θ ).When the counter exceeds θ , then the message stops at the current state. After waiting for θ timesteps, the anchor state i can compute the estimate of its stationary probability within this network,taking into consideration the messages which have returned to state i . In addition, each observerstate j ∈ J can keep track of the number of times any of the messages are forwarded to state j . Atthe end of the θ time steps, state i can broadcast the total number of steps to all states j ∈ J sothat they can properly normalize to obtain final estimates for π j . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State
4. Concentration bounds
In order to analyze our algorithm, we first need to show that the statistics obtained from therandom samples, specifically the values of ˆ p ( k ) , ˆ T ( k ) i , and ˆ F ( k ) j , concentrate around their mean withhigh probability, based upon standard concentration results for sums of independent identicallydistributed random variables. These statistics are used in computing the estimates and determiningthe termination time of IterativeRefinement .Recall that the iterations are not independent, since the number of samples at iteration k dependson the estimate ˆ T ( k − i at iteration k −
1, which itself is a random variable. Therefore, in order toprove any result about iteration k , we must consider the distribution over values of ˆ T ( k − i from theprevious iteration. The following Lemmas 3 to 6 use iterative conditioning to show concentrationbounds that hold for all iterations k simultaneously with probability greater than 1 − α . Lemma3 shows concentration of ˆ T ( k ) i , which directly implies concentration of the estimate ˆ π i as well as N ( k ) . Lemma 3.
For every k ∈ Z + , P i (cid:32) k (cid:92) h =1 (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111)(cid:33) ≥ − α. Proof of Lemma 3.
We will sketch the proof here and leave the details to the Appendix. Let A h denote the event (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111) . As discussed earlier, N ( h ) is a random variable thatdepends on ˆ T ( h − i . However, conditioned on the event A h − , we can lower bound N ( h ) as a functionof E i [ ˆ T ( h − i ]. Then we apply Chernoff’s bound for independent identically distributed boundedrandom variables and use the fact that E i [ ˆ T ( h ) i ] is nondecreasing in h to show that P i ( A h | A h − ) ≥ − α h +1 for all h. Since iteration h is only dependent on the outcome of previous iterations through the variableˆ T ( h − i , we know that A h (cid:48) is independent from A h for h (cid:48) < h conditioned on A h − . Therefore, P i (cid:32) k (cid:92) h =1 A h (cid:33) = P i ( A ) k (cid:89) h =2 P i ( A h | A h − ) . We combine these two insights to complete the proof. (cid:3)
Lemmas 4 to 6 also use the multiplicative concentration of ˆ T ( k ) i in order to lower bound thenumber of samples in each iteration. Their proofs are similar to the proof sketch given for Lemma3, except that we have two events per iteration to consider. Conditioning on the event that ˆ p ( h − ∈ P i ( T i > θ ( h − ) ± (cid:15)/ T ( h − i ∈ (1 ± (cid:15) ) E i [ ˆ T ( h − i ], we compute the probability that ˆ p ( h ) ∈ P i ( T i >θ ( h ) ) and ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i [ ˆ T ( h ) i ] using Chernoff’s bound and union bound. Lemma 4 shows an additiveconcentration of ˆ p ( h ) . It is used to prove that when the algorithm terminates at condition (b), withhigh probability, P i ( T i > θ ( h ) ) < (cid:15) , which is used to upper bound the estimation error. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Lemma 4.
For every k ∈ Z + , P i (cid:32) k (cid:92) h =1 (cid:110) ˆ p ( h ) ∈ P i ( T i > θ ( h ) ) ± (cid:15) (cid:111) k (cid:92) h =1 (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111)(cid:33) ≥ − α. Lemma 5 gives a multiplicative concentration result for (1 − ˆ p ( k ) ), which is used in the analysisof the estimate ˜ π ( k ) i . Lemma 5.
Let k be such that P i ( T i > θ ( k ) ) < / . For every k ≥ k , P i (cid:32) k (cid:92) h = k (cid:8) (1 − ˆ p ( h ) ) ∈ (1 ± (cid:15) )(1 − P i ( T i > θ ( h ) )) (cid:9) k (cid:92) h =1 (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111)(cid:33) ≥ − α. Lemma 6 is used in the analysis of the estimate ˜ π ( k ) j for j (cid:54) = i . It guarantees that ˆ F ( k ) j is withinan additive value of (cid:15) E i [ ˆ T ( k ) i ] around its mean. This allows us to show that the ratio between ˆ F ( k ) j and ˆ T ( k ) i is within an additive (cid:15) error around the ratio of their respective means. We are not able toobtain a small multiplicative error bound on ˆ F ( k ) j because we do not use any information from state j to choose the number of samples N ( k ) . E i [ ˆ F ( k ) j ] can be arbitrarily small compared to E i [ ˆ T ( k ) i ], sowe may not have enough samples to estimate E i [ ˆ F ( k ) j ] closely. Lemma 6.
For every t ∈ Z + , P i (cid:32) k (cid:92) h =1 (cid:110) ˆ F ( h ) j ∈ E i (cid:104) ˆ F ( h ) j (cid:105) ± (cid:15) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111) k (cid:92) h =1 (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111)(cid:33) ≥ − α.
5. Exponential Decay of Return Times
In this section, we discuss the error that arises due to truncating the random walks at threshold θ . We show that the tail of the distribution of the return times to state i decays exponentiallyas a function of the truncation parameter θ . This is the key property which underlies the errorand cost analysis of the algorithm. Intuitively, it means that the distribution over return times isconcentrated around its mean, since it cannot have large probability at values far away from themean. For finite state space Markov chains, this result is easy to show using the strong Markovproperty, as outlined by Aldous and Fill (1999) Chapter 2 Section 4.3 . Lemma 7. (Aldous and Fill 1999) Let Markov chain { X t } be defined on finite state space Σ . Forany i ∈ Σ and t ∈ Z + , P i ( T i > t ) ≤ · − t/ H i , where H i = max j ∈ Σ E j [ T i ] . Lemma 8 shows that since P i ( T i > t ) decays exponentially in t , the bias due to truncation likewisedecays exponentially as a function of θ ( k ) . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Lemma 8. E i [ T i ] − E i [ ˆ T ( k ) i ] = ∞ (cid:88) t = θ ( k ) P i ( T i > t ) . Proof of Lemma 8.
Since T i is a nonnegative random variable, and by the definition of ˆ T ( k ) i , E i [ T i ] − E i [ ˆ T ( k ) i ] = E i [ T i ] − E i [min( T i , θ ( k ) )]= ∞ (cid:88) t =0 P i ( T i > t ) − θ ( k ) − (cid:88) t =0 P i ( T i > t )= ∞ (cid:88) t = θ ( k ) P i ( T i > t ) . (cid:3) Lemma 7 depends on the finite size of the state space. In order to obtain the same result forcountable state space Markov chains, we use Lyapunov analysis techniques to prove that the tailof the distribution of T i decays exponentially for any state i in any countable state space Markovchain that satisfies Assumption 1. Assumption 1.
The Markov chain { X t } is irreducible. There exists a Lyapunov function V : Σ → R + and constants ν max , γ > , and b ≥ , that satisfy the following conditions: The set B = { x ∈ Σ : V ( x ) ≤ b } is finite, For all x, y ∈ Σ such that P (cid:0) X t +1 = j | X t = i (cid:1) > , | V ( j ) − V ( i ) | ≤ ν max , For all x ∈ Σ such that V ( x ) > b , E (cid:2) V ( X t +1 ) − V ( X t ) | X t = x (cid:3) < − γ . At first glance, this assumption may seem very restrictive. But in fact, this is quite reasonable:by the Foster-Lyapunov criteria (see Theorem 16 in Appendix), a countable state space Markovchain is positive recurrent if and only if there exists a Lyapunov function V : Σ → R + that satisfiescondition (1) and (3), as well as (2’): E [ V ( X t +1 ) | X t = x ] < ∞ for all x ∈ Σ. Assumption 1 has (2),which is a restriction of the condition (2’). The implications of Assumption 1 are visualized inFigure 2. The existence of the Lyapunov function allows us to decompose the state space into sets B and B c such that for all states x ∈ B c , there is an expected decrease in the Lyapunov functionin the next step or transition. Therefore, for all states in B c , there is a negative drift towardsset B . In addition, in any single step, the random walk cannot escape “too far”. The Lyapunovfunction helps to impose a natural ordering over the state space that allows us to prove propertiesof the Markov chain. There have been many results that use Lyapunov analysis to give boundson the stationary probabilities, return times, and distribution of return times as a function of theLyapunov function (Hajek 1982, Bertsimas et al. 1998). Building upon results by Hajek, we provethe following lemma which establishes that return times have exponentially decaying tails even forcountable-state space Markov chains, as long as they satisfy Assumption 1. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State finite negative drift i [Hajek 1982] Figure 2
This illustrates the implication of Assumption 1, which uses a Lyapunov function to decompose thestate space into a finite region B and a region with negative drift. Lemma 9.
Let { X t } be an irreducible Markov chain satisfying Assumption 1. For any i ∈ B andfor all k ∈ Z + , P i ( T i > k ) ≤ · − kRi , where R i = O (cid:18) H Bi e ην max (1 − ρ )( e ην max − ρ ) (cid:19) , and H Bi is the maximal hitting time over the Markov chain with its state space restricted to thesubset B . The scalars η and ρ are functions of γ and ν max (see (38) in Appendix F). Lemma 9 pertains to states i ∈ B such that V ( i ) ≤ b . This is not restrictive, since for any state k of interest such that V ( k ) = b (cid:48) > b , we can define a new Lyapunov function V (cid:48) ( · ) such that V (cid:48) ( k ) = b , and V (cid:48) ( j ) = V ( j ) for all j (cid:54) = k . Then we define B (cid:48) = { j ∈ Σ : V (cid:48) ( j ) ≤ b } = B ∪ { k } and ν (cid:48) max = ν max + b (cid:48) − b . By extension, Assumption 1 holds for V (cid:48) ( · ) with constants ν (cid:48) max , γ , and b (cid:48) .The quantity R i in Lemma 9 for countable state space Markov chains plays the same role as H i in Lemma 7 for finite state space Markov chains. Thus, equivalent theorems for the countable statespace setting are obtained by using Lemma 9 rather than Lemma 7. In the countable state spacesetting, H i and Z max ( i ) no longer are well defined since the maximum over an infinite set may notexist. However, we recall that Z max ( i ) = O ( π i H i ), and thus our analysis of the algorithm for finitestate space Markov chains extend to countable state space Markov chains by substituting R i for H i and π i R i for Z max ( i ).This Theorem leads to some interesting insights about the performance of our algorithm andcomputation of stationary probabilities over large finite Markov chains. In some sense, the boundin Lemma 7 is not very tight as the size of the state space grows, since it takes a maximumover all states. However, in fact, Lemma 9 indicates that it is perhaps the mixing properties ofthe local neighborhood that matters the most. In addition, Assumption 1 also lends insights intothe properties that become significant for a large finite state space Markov chain. In some sense, ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State if the large finite state space Markov chain mixes poorly, then it is kind of the notion of theMarkov chain growing to a limiting countably infinite state space Markov chain which is no longerpositive recurrent (i.e. becomes separate recurrence classes). In this setting, we argue that thetrue stationary distribution is no longer significant, and perhaps the significant quantity may beseparate local stationary distributions over each community or subset.
6. Analysis of Estimation Error
In this section, we provide bounds on the estimates produced by the algorithm. The omitted proofscan be found in the Appendix. Recall that the estimate ˆ π ( k ) i concentrates around 1 / E i [ T ( k ) i ], and π i = 1 / E i [ T i ]. Therefore we begin by characterizing the difference between the truncated meanreturn time and the original mean return time. Lemma 10 expresses the ratio between the E i [ T ( k ) i ] and E i [ T i ] as a function of P i (cid:0) T i > θ ( k ) (cid:1) andthe fundamental matrix Z . This lemma shows the error purely due to the truncation bias and notstochastic sampling error. Lemma 10.
For an irreducible, positive recurrent Markov chain { X t } with countable state space Σ and transition probability matrix P , and for any i ∈ Σ and t ∈ Z + , − E i [ ˆ T ( k ) i ] E i [ T i ] = P i (cid:0) T i > θ ( k ) (cid:1) Γ i ( θ ( k ) ) , (5) where Γ i ( θ ) (cid:44) (cid:88) q ∈ Σ \{ i } P i ( X θ = q | T i > θ ) ( Z ii − Z qi ) . (6) Proof of Lemma 10.
We divide the equation given in Lemma 8 by E i [ T i ]. Then we apply Bayes’rule, the law of total probability, and the Markov property.1 − E i [ ˆ T ( k ) i ] E i [ T i ] = 1 E i [ T i ] ∞ (cid:88) k = θ ( k ) P i ( T i > k ) (7)= P i (cid:0) T i > θ ( k ) (cid:1) E i [ T i ] ∞ (cid:88) k = θ ( k ) P i (cid:0) T i > k | T i > θ ( k ) (cid:1) = P i (cid:0) T i > θ ( k ) (cid:1) E i [ T i ] ∞ (cid:88) k = θ ( k ) (cid:88) q ∈ Σ \{ i } P i (cid:0) T i > k | X θ ( k ) = q, T i > θ ( k ) (cid:1) P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) = P i (cid:0) T i > θ ( k ) (cid:1) (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) E q [ T i ] E i [ T i ] . (8)Finally, we use Lemma 2 to complete the proof. (cid:3) ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State In order to understand the expression Γ i ( θ ), we observe that by the definition of Z , Z ii − Z qi = ∞ (cid:88) k =0 (cid:16) P ( k ) ii − P ( k ) qi (cid:17) ≤ ∞ (cid:88) k =0 ¯ d ( t ) ≤ ∞ (cid:88) k =0 −(cid:98) t/t mix (cid:99) ≈ t mix . Although Γ i ( θ ) is globally upper bounded by 2 Z max and 2 ln(2) t mix for all i and θ , it is actuallya convex combination of the quantities ( Z ii − Z qi ), where each term is weighted according to theprobability that the random walk is at state q after θ steps. Because the random walks begin atstate i , we expect this distribution to more heavily weight states which are closer to i , in fact thesupport of this distribution is limited to states that are within a θ distance from i . Thus, Γ i ( θ )can be interpreted as a locally weighted variant of the mixing time, which we term the “localmixing time”, measuring the time it takes for a random walk beginning at some distribution ofstates around i to reach stationarity at state i , where the size of the local neighborhood considereddepends on θ . In scenarios where these “local mixing times” for different states differ from theglobal mixing time, then our algorithm which utilizes local random walk samples may providetighter results than standard MCMC methods. This suggests an interesting mathematical inquiryof how to characterize local mixing times of Markov chains, and in what settings they may behomogenous as opposed to heterogenous.This lemma gives us a key insight into the termination conditions of the algorithm. Recall thattermination condition (b) is satisfied when ˆ p ( k ) < (cid:15)/
3. Since ˆ p ( k ) concentrates around P i (cid:0) T i > θ ( k ) (cid:1) ,and ˆ π ( k ) i concentrates around 1 / E i [ ˆ T ( k ) i ], Lemma 10 indicates that when the algorithm stops atcondition (b), the multiplicative error between ˆ π ( k ) i and π i is approximately (cid:15) Γ i ( θ ( k ) ) ≤ (cid:15)Z max ( i ). Theorem 6 states that with high probability, for any irreducible, positive recurrent Markov chain,the estimate produced by the basic algorithm is always an upper bound with high probability.
Theorem 6.
For an irreducible, positive recurrent, countable state space Markov chain, and forany i ∈ Σ , with probability greater than (1 − α ) , for all k , ˆ π ( k ) i ≥ π i (cid:15) . Proof of Theorem 6.
This result follows direcly from Lemma 3, which implies that ˆ π ( k ) i lieswithin 1 / (1 ± (cid:15) ) E i [ ˆ T ( k ) i ]. Due to truncation, E i [ ˆ T ( k ) i ] ≤ E i [ T i ], thus the estimate is an upper boundof π i with high probability. (cid:3) Theorem 7 upper bounds the the percentage error between ˆ π ( k ) i and π i . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Theorem 7.
For an irreducible finite state space Markov chain, for any i ∈ Σ , with probabilitygreater than − α , for all iterations k , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ π ( k ) i − π i ˆ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − (cid:15) ) P i ( T i > θ ( k ) ) Z max ( i ) + (cid:15), ≤ − (cid:15) )2 − θ ( k ) / H i Z max ( i ) + (cid:15). Corollary 2 directly follows from Theorem 7 and Lemma 4, allowing us to upper bound the erroras a function of ˆ p ( k ) . This corollary motivates the choice of termination criteria, indicating thatterminating when ˆ p ≤ (cid:15)/ (cid:15) (2 Z max ( i ) + 1). Corollary 2.
With probability greater than − α , for all iterations k , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ π ( k ) i − π i ˆ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (2 Z max ( i ) / p ( k ) Z max ( i ) . We proceed to prove Theorem 7 by combining Lemmas 10, 3, 4, and 7.
Proof of Theorem 7.
By Theorem 6, it follows that ( π i − ˆ π ( k ) i ) / ˆ π ( k ) i < (cid:15) with high probability.By Lemma 3 and Lemma 10, it follows that with high probability,ˆ π ( k ) i − π i ˆ π ( k ) i = 1 − ˆ T ( k ) i E i [ T i ] ≤ − (1 − (cid:15) ) E i [ ˆ T ( k ) i ] E i [ T i ] (9)= 1 − (1 − (cid:15) )(1 − P i ( T i > θ ( k ) )Γ i ( θ ( k ) ))= (1 − (cid:15) ) P i ( T i > θ ( k ) )Γ i ( θ ( k ) ) + (cid:15). = 2(1 − (cid:15) ) P i ( T i > θ ( k ) ) Z max ( i ) + (cid:15). When the algorithm terminates at condition (b), ˆ p ( k ) < (cid:15)/
3. By Lemma 4, with probability greaterthan 1 − α , P i ( T i > θ ( k ) ) ≤ ˆ p ( k ) + (cid:15) ≤ (cid:15) . Therefore, it follows thatˆ π ( k ) i − π i ˆ π ( k ) i ≤ (cid:15) (2 Z max ( i ) + 1) . (cid:3) Theorem 7 shows that the error bound decays exponentially in θ ( k ) , which doubles in eachiteration. Thus, for every subsequent iteration k , the estimate ˆ π ( k ) i approaches π i exponentiallyfast. The key part of the proof relies on the fact that the distribution of the return time T i has anexponentially decaying tail, ensuring that the return time T i concentrates around its mean E i [ T i ].Theorem 8 presents the error bounds for countable state space Markov chains, relying upon theexponentially decaying tail proved in Lemma 9. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Theorem 8.
For a Markov chain satisfying Assumption 1, for any i ∈ B , with probability greaterthan − α , for all iterations k , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ π ( k ) i − π i ˆ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − (cid:15) ) (cid:32) − θ ( k ) /R i − − /R i (cid:33) π i + (cid:15), ≈ − (cid:15) ) ln(2) π i R i − θ ( k ) /R i + (cid:15). Proof of Theorem 8.
This proof follows a similar proof of Theorem 7. Substitute Lemma 9 into(7) to show that 1 − E i [ ˆ T ( k ) i ] E i [ T i ] = 1 E i [ T i ] ∞ (cid:88) k = θ ( k ) P i ( T i > k ) ≤ π i (cid:32) · − θ ( k ) /R i − − /R i (cid:33) . (10)Substitute (10) into (9) to complete the proof. (cid:3) Lemma 11 gives an expression for how the bias corrected estimate differs from the true value if wehad the exact expected return time as well as the probability of truncation. By comparing Lemma11 with Lemma 10, we gain some intuition of the difference in the expected error for the originalestimator ˆ π i and the bias corrected estimator ˜ π i . Recall that ˜ π ( k ) i is equal to (1 − ˆ p ( k ) ) / ˆ T ( k ) i . Lemma11 gives an expression for the additive difference between (1 − P i ( T i > θ ( k ) )) / E i [ T ( k ) i ] and E i [ T i ]. Lemma 11.
For an irreducible, positive recurrent Markov chain { X t } with countable state space Σ and transition probability matrix P , and for any i ∈ Σ and t ∈ Z + , (1 − P i ( T i > θ ( k ) )) E i [ ˆ T ( k ) i ] − π i = P i ( T i > θ ( k ) ) E i [ ˆ T ( k ) i ] (cid:0) Γ i ( θ ( k ) ) − (cid:1) , (11) where Γ i ( θ ) (cid:44) (cid:88) q ∈ Σ \{ i } P i ( X θ = q | T i > θ ) ( Z ii − Z qi ) . (12) Proof of Lemma 11.
This lemma follows directly from Lemma 10. (cid:3)
In comparing Lemma 10 and 11, we see that the additive estimation error for both estimatorsare almost the same except for (Γ i ( θ ( k ) ) −
1) in Lemma 11 as opposed to Γ i ( θ ( k ) ) in Lemma 10.Therefore, when Γ i ( θ ( k ) ) is small, we expect ˜ π i to be a better estimate than ˆ π i ; however, whenΓ i ( θ ( k ) ) is large, then the two estimates will have approximately the same error. Theorem 9 presentsan equivalent bound for the error of ˜ π i when the algorithm is implemented on a countable statespace Markov chain. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Theorem 9.
For a Markov chain satisfying Assumption 1, for any i ∈ B , with probability greaterthan − α , for all iterations k such that P i ( T i > θ ( k ) ) < / , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ π ( k ) i − π i ˜ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) )1 − (cid:15) − θ ( k ) /R i max (cid:18) π i − − /R i , (cid:19) + 2 (cid:15) − (cid:15) , ≈ (cid:15) )1 − (cid:15) − θ ( k ) /R i max (ln(2) π i R i ,
1) + 2 (cid:15) − (cid:15) . j As the number of samples increases, the sample mean converges to the true epxected value, andthus the error bound stated in Lemme 12 shows the bias of the estimates ˜ π ( k ) j . Lemma 12.
For an irreducible, positive recurrent Markov chain { X t } with countable state space Σ and transition probability matrix P , and for any i, j ∈ Σ , and t ∈ Z + , E i [ ˆ F ( k ) j ] E i [ ˆ T ( k ) i ] − π j = P i (cid:0) T i > θ ( k ) (cid:1) E i [ ˆ T ( k ) i ] (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) ( Z ij − Z qj ) (13)Compare Lemmas 10 and 12. Although they look similar, observe that if we bound ( Z ij − Z qj )by 2 Z max ( j ), it becomes clear that Lemma 12 depends on the Markov chain mixing properties withrespect to both state i through P i ( T i > θ ( k ) ), and state j through Z max ( j ). In this section, we discuss the tightness of our analysis. Lemmas 10, 11, and 12 give exact expressionsof the estimation error that arises from the truncation of the sample random walks. For a specificMarkov chain, Theorems 7, 4, and 5 could be loose due to two approximations. First, 2 Z max ( i )could be a loose upper bound upon Γ i ( θ ( k ) ). Second, Lemma 7 could be loose due to its use of theMarkov inequality. Since ˆ π i is greater than π i with high probability, Theorem 7 is only useful whenthe upper bound is less than 1. We will show that for a specific family of graphs, namely cliquegraphs, our bound scales correctly as a function of θ ( k ) , H i , and Z max ( i ).Consider a family of clique graphs G n indexed by n ∈ Z + , such that G n is the clique graphover n vertices. For G n , we can directly compute the hitting time H i = n , the truncation proba-bility P i ( T i > k ) ≈ e − ( θ ( t ) − / ( n − , and values of the fundamental matrix Z , specifically Z ii − Z ji = π i E j [ T i ] = n − n and Z max ( i ) = n − n +1 n . By substituting these into Lemma 10, it follows that theexpected error is approximately given by1 − E i [ ˆ T ( k ) i ] E i [ T i ] = e − ( θ ( k ) − / ( n − (cid:18) n − n (cid:19) . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State By substituting these into Theorem 7, we show that with probability at least 1 − α , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − π i ˆ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − (cid:15) ) e − θ ( k ) ln(2) / n (cid:18) n − n + 1 n (cid:19) + (cid:15). While Theorem 7 gives that the percentage error is upper bounded by O ( e − θ ( k ) ln(2) / H i Z max ( i )),by Lemma 10 the percentage error of our algorithm on the clique graph is no better thanΩ( e − θ ( k ) /H i Z max ( i )). If we used these bounds to determine a threshold θ ( k ) large enough to guaran-tee that the multiplicative error is bounded by (cid:15) , the threshold computed via Theorem 7 would onlybe a constant factor of 2 / ln(2) larger than the threshold computed via Lemma 10. Our algorithmleverages the property that there is some concentration of measure, or “locality”, over the statespace. It is the worst when there is no concentration of measure, and the random walks spread overthe space quickly and take a long time to return, such as a random walk over the clique graph.For Markov chains that have strong concentration of measure, such as a biased random walk onthe positive integers, the techniques given in Section 5 for analyzing countable state space Markovchains using Lyapunov functions will obtain tighter bounds, as compared to using the hitting time H i and Z max ( i ), since these quantities are computed as a worst case over all states, even if therandom walk only stays within a local region around i .
7. Cost of Computation
In this section, we compute bounds on the computation cost of the algorithm. We first prove thatthe total number of random walk steps taken by the algorithm within the first k iterations scaleswith 2 k , which we recall is equivalent to θ ( k ) by design. Lemma 13.
With probability greater than (1 − α ) , the total number of random walk steps taken bythe algorithm within the first k iterations is bounded by ˜ O (cid:18) ln(1 /α )2 k (cid:15) (cid:19) . Proof of Lemma 13.
The total number of random walk steps (i.e., oracle calls) used in thealgorithm over all k iterations is equal to (cid:80) kh =1 N ( h ) ˆ T ( h ) i . We condition on the event that ˆ T ( h ) i is within a (1 ± (cid:15) ) multiplicative interval around its mean for all h ∈ , . . . k , which occurs withprobability greater than (1 − α ) by Lemma 3. Because θ ( h ) doubles in each iteration, E i [ ˆ T ( h ) i ] ≤ E i [ ˆ T ( h − i ]. By combining these facts with the definition of N ( h ) , we obtain an upper bound as afunction of θ ( h ) , α, (cid:15), and E i [ ˆ T ( h ) i ]. We suppress the insignificant factors (1 + (cid:15) ) and (1 − (cid:15) ). Since N ( h ) ˆ T ( h ) i grows super exponentially, the largest term of the summation dominates. k (cid:88) h =1 N ( h ) ˆ T ( h ) i = k (cid:88) h =1 O (cid:18) h ln(2 h /α ) (cid:15) (cid:19) = O (cid:18) k ln(2 k /α ) (cid:15) (cid:19) = ˜ O (cid:18) k ln(1 /α ) (cid:15) (cid:19) . (cid:3) ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State The next two theorems provide upper bounds for the number of iterations until
TerminationCriteria is satisfied. Theorem 10 asserts that with high probability, the algorithmterminates in finite time as a function of the parameters of the algorithm, independent from thesize of the Markov chain state space. It is proved by showing that if θ ( k ) > (cid:15) ) / (cid:15) ∆, then eithertermination condition (a) or (b) must be satisfied. Theorem 10.
For an irreducible, positive recurrent, countable state space Markov chain, and forany i ∈ Σ , with probability 1, the total number of iterations k before the algorithm satisfies either ˆ π ( k ) i < ∆ / (1 + (cid:15) ) or ˆ p ( k ) < (cid:15)/ is bounded above by log (cid:18) (cid:15) )2 (cid:15) ∆ (cid:19) . With probability greater than − α , the computation time, or the total number of random walksteps (i.e. oracle calls) used by the algorithm is bounded above by ˜ O (cid:18) ln( α ) (cid:15) ∆ (cid:19) . Proof of Theorem 10.
By definition, ˆ T ( k ) i ≥ ˆ p ( k ) θ ( k ) , which implies that ˆ p ( k ) ≤ / ˆ π ( k ) i θ ( k ) . When θ ( k ) ≥ (cid:15) ) / (cid:15) ∆, if termination condition (a) has not been satisfied, then ˆ π ( k ) i ≥ ∆ / (1 + (cid:15) ). Thisimplies that ˆ p ( k ) ≤ / ˆ π ( k ) i θ ( k ) ≤ (cid:15)/
3, satisfying termination condition (b). This provides an upperbound for the number of iterations before the algorithm terminates, and we can substitute thisinto Lemma 13 to complete the proof. (cid:3)
As can be seen through the proof, Theorem 10 does not utilize any information or properties ofthe Markov chain. Theorems 11 and 11 use the exponential tail bounds in Lemma 7 to prove thatas a function of the mixing preoperties, if θ ( k ) is large enough, the fraction of truncated samples issmall, and the error is bounded, such that either one of the termination conditions are satisfied. Theorem 11.
For an irreducible finite state space Markov chain, for any state i ∈ Σ such that π i < (1 − (cid:15) )∆ / (1 + (cid:15) ) , with probability greater than − α , min { t : ˆ π ( k ) i < ∆ / (1 + (cid:15) ) } ≤ log (cid:32) H i log (cid:32) Z max ( i ) (cid:18) − (1 + (cid:15) ) π i (1 − (cid:15) )∆ (cid:19) − (cid:33)(cid:33) . (14) Thus the total number of random walk steps (i.e. oracle calls) used by the algorithm before satisfying ˆ π ( k ) i < ∆ / (1 + (cid:15) ) is bounded above by ˜ O (cid:18) H i ln(1 /α ) (cid:15) ln (cid:18) Z max ( i ) (cid:16) − π i ∆ (cid:17) − (cid:19)(cid:19) . Proof of Theorem 11.
For k larger than the expression given in (14), we compute an upperbound on ˆ π ( k ) i by substituting into Theorem 7. It follows that ˆ π ( k ) i < ∆ / (1 + (cid:15) ) with probabilitygreater than 1 − α . (cid:3) ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Theorem 12.
For an irreducible finite state space Markov chain, for all i ∈ Σ , with probabilitygreater than − α , min { t : ˆ p ( k ) < δ } ≤ log (2 H i log (6 / (3 δ − (cid:15) ))) . (15) The total number of random steps (i.e. oracle calls) used by the algorithm before satisfying ˆ p ( k ) < δ is bounded above by ˜ O (cid:18) H i ln(1 /α ) ln(6 / (3 δ − (cid:15) )) (cid:15) (cid:19) . Proof of Theorem 12.
For k larger than the expression specified in (15), we show that P i (cid:0) T i > θ ( k ) (cid:1) < δ − (cid:15)/ p ( k ) < δ withprobability greater than 1 − α . (cid:3) For states i in a Markov chain such that the maximal hitting time is small, the bounds givenin Theorem 12 will be smaller than the general bound given in Theorem 10. Given a state i suchthat π i < (1 − (cid:15) )∆ / (1 + (cid:15) ), our tightest bound is given by the minimum over the expressions fromTheorems 10, 11, and 12. For a state i such that π i > ∆, our tightest bound is given by the minimumbetween Theorem 10 and 12.Theorem 13 presents the equivalent result for countable state space Markov chains. Theorem 13.
For a Markov chain satisfying Assumption 1, (a)
For any state i ∈ B such that π i < (1 − (cid:15) )∆ / (1 + (cid:15) ) , with probability greater than − α , thetotal number of steps used by the algorithm before satisfying ˆ π ( k ) i is bounded above by ˜ O (cid:18) R i ln( α ) (cid:15) ln (cid:18) π i R i (cid:16) − π i ∆ (cid:17) − (cid:19)(cid:19) . (b) For all states i ∈ B , with probability greater than − α , the total number of steps used by thealgorithm before satisfying ˆ p ( k ) < (cid:15)/ is bounded above by ˜ O (cid:18) R i ln( α ) (cid:15) (cid:19) . Proof of Theorem 13.
The proof is exactly the same as the proof of Theorems 11 and 12, exceptthat we use Theorem 8 instead of Theorem 7, and Lemma 9 instead of Lemma 7. (cid:3)
8. Examples and Simulations
We present the results of applying our algorithm to concrete examples of Markov chains. Theexamples illustrate the wide applicability of our algorithm for estimating stationary probabilities.
Example 1 (PageRank).
In analyzing the web graph, PageRank is a frequently used measureto compute the importance of webpages. We are given a scalar parameter β and an underlying ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State directed graph over n nodes, described by the adjacency matrix A (i.e., A ij = 1 if ( i, j ) ∈ E and 0otherwise). The transition probability matrix of the PageRank random walk is given by P = βn · T + (1 − β ) D − A, (16)where D denotes the diagonal matrix whose diagonal entries are the out-degrees of the nodes. Thestate space is equivalent to the set of nodes in the graph, and it follows that P rs = P ( X t +1 = s | X t = r ) = β (cid:18) n (cid:19) + (1 − β ) (cid:18) A rs (cid:80) v A rv (cid:19) . Thus, in every step, there is a β probability of jumping uniformly randomly to any other node inthe graph. In our simulation, n = 100, β = 0 .
15, and the underlying graph is generated accordingto the configuration model with a power law degree distribution: P ( d ) ∝ d − . . We choose β = 0 . Z max ≈ . Example 2 (Queueing System).
In queuing theory, Markov chains are commonly used to modelthe length of the queue of jobs waiting to be processed by a server, which evolves over time asjobs arrive and are processed. For illustrative purposes, we chose the M/M/1 queue, equivalent toa random walk on Z + . The state space Z + is countably infinite. Assume we have a single serverwhere the jobs arrive according to a Poisson process with parameter λ , and the processing timefor a single job is distributed exponentially with parameter µ . The queue length can be modeledwith the random walk shown in Figure 1(a), where q is the probability that a new job arrivesbefore the current job is finished processing, given by λ/ ( λ + µ ). For the purposes of our simulation,we choose q = 0 .
3, and estimate the stationary probabilities for the queue to have length i for i ∈ { , , , . . . } . The parameter q = 0 . λ and µ can be any values such that λ/ ( λ + µ ) = 0 .
3, for example λ = 0 . µ = 0 . Example 3 (Magnet Graph).
This example illustrates a Markov chain with poor mixing prop-erties. The Markov chain is depicted in Figure 1(b), and can be described as a random walk overa finite section of the integers such that there are two attracting states, labeled in Figure 1(b) asstates 1 and n . We assume that q , q < /
2, such that for all states left of state n , the randomwalk will drift towards state 1 with probablity 1 − q in each step, and for all states right of state n , the random walk will drift towards state n with probability 1 − q in each step. Due to thisbipolar attraction, a random walk that begins on the left will tend to stay on the left, and similarly,a random walk that begins on the right will tend to stay on the right. For our simulations, we chose q = q = 0 . n = 25 , and n = 50. We computed that Z max ≈ . × . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State We show the results of applying our algorithm to estimate the stationary probabilities in thesethree different Markov chains, using algorithm parameters ∆ = 0 . (cid:15) = 0 .
15, and α = 0 .
2. Thethree Markov chains have different mixing properties, chosen to illustrate the performance of ouralgorithm on Markov chains with different values of H i and Z max ( i ). Let t max denote the finaliteration at which the algorithm terminates. In the following figures and discussion, we observethe accuracy of the estimates, the computation cost, truncation threshold, and fraction of samplestruncated as a function of the estimation threshold ∆, the stationary probability of the anchorstate, and the properties of the Markov chain. True Stationary Probability ( π )0 0.02 0.04 0.06 0.08 V a l ue o f E s t i m a t e ∆ = 0.02 Basic Estimate ( π hat )Bias Corrected ( π tilde ) (a) PageRank Estimates. True Stationary Probability ( π ) M u l t i p li c a t i v e E rr o r Basic Estimate ( π hat )Bias Corrected ( π tilde ) (b) PageRank Error. Figure 3
These plots show the estimates as well as the multiplicative error resulting from applying both the basicalgorithm and the bias corrected algorithm on the PageRank Markov chain.
We applied both the basic and bias corrected algorithms to estimate each state in the PageRankMarkov chain. Figure 3(a) plots both estimates ˆ π i and ˜ π i as a function of the true stationaryprobability π i for all states i . Figure 3(b) plots the multiplicative error given by | ˆ π i − π i | /π i and | ˜ π i − π i | /π i . We observe that the results validate Theorem 3, which proves that for all states i suchthat π i > ∆, the multiplicative error is bounded approximately by 2 (cid:15)Z max , while for states i suchthat π i ≤ ∆, we only guarantee ˆ π i ≤ ∆. In fact, we verified that for most pairs of states ( i, q ) in thePageRank Markov chain, Z ii − Z qi ≈
1, and thus Γ i ( θ ) ≈
1. Lemma 10 then predicts that the errorshould be bounded by (cid:15) Γ i ≈ .
15, which we verify holds in our simulations.Figure 3(a) clearly shows a significant reduction in the multiplicative error of the bias correctedestimate ˜ π i . This is analyzed in Lemmas 10 and 11, which prove that the additive error of the ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State expected estimates will be a factor of (Γ i − / Γ i smaller for the bias corrected estimate as opposedto the basic estimate. Again, we point out that this is a surprising gain due to the fact that we arestill using the same termination criteria with early truncation and that the bias corrected estimatedoes not use any more samples than the basic estimate. True Stationary Probability ( π )10 -3 -2 -1 V a l ue o f E s t i m a t e -3 -2 -1 ∆ = 0.02 Basic Estimate ( π hat )Bias Corrected ( π tilde ) (a) MM1 Estimates. True Stationary Probability ( π )10 -3 -2 -1 V a l ue o f E s t i m a t e -3 -2 -1 ∆ = 0.02 Basic Estimate ( π hat )Bias Corrected ( π tilde ) (b) Magnet Estimates. Figure 4
These plots show the estimates obtained by applying both the basic algorithm and the bias correctedalgorithm on the MM1 and the Magnet Markov chains.
In order to gain an understanding of the behavior of the algorithm as a function of the mixingproperties of the Markov chain, we applied both the basic and bias corrected algorithms to estimatethe first 50 states of the the M/M/1 Markov chain, and all states of the Magnet Markov chain,since these two chains are locally similar, yet have different global mixing properties. Figure 4 plotsboth estimates ˆ π i and ˜ π i as a function of the true stationary probability π i for both Markov chains.Since the stationary probabilities decay exponentially, the figures are plotted on a log-log scale. Forthe M/M/1 queue, we observe a similar pattern in Figure 4(a) as we did for PageRank, in whichthe states i with π i > ∆ are approximated closely, and states i such that π i ≤ ∆ are thresholded,i.e. ˆ π i ≤ ∆.In constrast, Figure 4(b) shows the result for the Magnet Markov chain, which mixes very slowly.The algorithm overestimates the stationary probabilities by almost two times the true value, whichis depicted in the figure by the estimates being noticeably above the diagonal. This is due to thefact that the random samples have close to zero probability of sampling from the opposite half ofthe graph. Therefore the estimates are computed without being able to observe the opposite halfof the graph. As the challenge is due to the poor mixing properties of the graph, both ˆ π i and ˜ π i ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State are poor estimates. In the figure, it is difficult to distinguish the two estimates because they arenearly the same and thus superimposed upon each other. We compute the fundamental matrix Z for this Markov chain, and find that for most pairs i, j ∈ Σ, | Z ij | is on the order of 10 .Standard methods such as power iteration or MCMC will also perform poorly on this graph,as it would take an incredibly large amount of time for the random walk to fully mix across themiddle border. The final outputs of both power iteration and MCMC are very sensitive to theinitial vector, since with high probability each random walk will stay on the half of the graph inwhich it was initialized. The estimates are neither guaranteed to be upper or lower bounds uponthe true stationary probability. An advantage of our algorithm even in settings with badly mixingMarkov chains, is that ˆ π i is always guaranteed to be an upper bound for π i with high probability. True Stationary Probability ( π )0 0.02 0.04 0.06 0.08 F r a c t i on o f s a m p l e s t r un c a t ed (a) ˆ p ( t max ) vs. π i True Stationary Probability ( π ) T r un c a t i on t h r e s ho l d (b) θ ( t max ) vs. π i True Stationary Probability ( π ) T o t a l nu m be r o f s t ep s t a k en × (c) Total Steps vs. π i Figure 5
This shows the value of three variables from the final iteration t max of the algorithm applied to thePageRank Markov chain: (a) fraction of samples truncated = ˆ p ( t max ) ; (b) truncation threshold = θ ( t max ) ;(c) total number of random walk steps taken = N ( t max ) · ˆ T ( t max ) i . Figure 5 plots the quantities ˆ p ( t max ) , θ ( t max ) , and N ( t max ) · ˆ T ( t max ) i for the execution of our algorithmon the PageRank Markov chain, as a function of the stationary probability of the target state.Recall that the algorithm terminates when either ˆ π ( t ) i < ∆ / (1 + (cid:15) ) or ˆ p ( t ) < (cid:15)/
3. In our setting, wechose (cid:15) = 0 .
15, such that the algorithm terminates when ˆ p ( t ) < . i such that π i > ∆ terminated at the condition ˆ p ( t ) < (cid:15)/
3, as follows from Theorem 3. The fraction of samples truncated increases as π i decreases. Forstates with small stationary probability, the algorithm terminates with a large fraction of samplestruncated, even as large as 0.8.Similarly, the truncation threshold and the total computation time also initially increase as π i decreases, but then decreases again for very small stationary probability states. This illustrates the ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State effect of the truncation and termination conditions. For states with large stationary probabilitythe expected return time E [ T i ] is small, leading to lower truncation threshold and number of stepstaken. For states with very small stationary probability, although E i [ T i ] is large, the algorithmterminates quickly at ˆ π i < ∆ / (1 + (cid:15) ), thus also leading to a lower truncation threshold and totalnumber of steps taken. This figure hints at the the computational savings of our algorithm due tothe design of truncation and termination conditions. The algorithm can quickly determines that astate has small stationary probability without wasting extra time to obtain unnecessary precision. ∆ ∆ threshold10 -3 -2 -1 B a s i c E s t i m a t e ( π ha t ) -3 -2 -1 ∆ / ( - ǫ ) π = 0.0768 π = 0.0257 π = 0.0103 π = 0.0050 π = 0.0024 (a) ˆ π i vs. ∆ ∆ threshold -3 -2 -1 T o t a l S t ep s T a k en π = 0.0768 π = 0.0257 π = 0.0103 π = 0.0050 π = 0.0024 (b) Total Steps vs. ∆ Figure 6
These figures show the dependence of our algorithm as a function of the parameter ∆. We plot thebasic algorithm estimate ˆ π i and the total number of steps taken by the algorithm when applied to thePageRank Markov chain. The figures are shown on a log-log scale. Figure 6 shows the results of our algorithm as a function of the parameter ∆, when applied to thePageRank Markov chain. The figures are shown on a log-log scale. Recall that parameter ∆ onlyaffects the termination conditions of the algorithm. We show results from separate executions of thealgorithm to estimate five different states in the Markov chain with varying stationary probabilities.Figure 6(a) plots the basic algorithm estimates for the stationary probability, along with a diagonalline indicating the termination criteria corresponding to ˆ π i < ∆ / (1 + (cid:15) ). When ∆ > ˆ π i , due to thetermination conditions, the estimate produced is approximately ∆ / (1 + (cid:15) ). When ∆ ≤ ˆ π i , then wesee that ˆ π i concentrates around π i and no longer decreases with ∆.Figure 6(b) plots the total steps taken in the last iteration of the algorithm, which we recall isprovably of the same order as the total random walk steps over all iterations. Figure 6(b) confirmsthat the computation time of the algorithm is upper bounded by O (1 / ∆), which is linear when ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State plotted on log-log scale. When ∆ > π i , the computation time behaves as Θ(1 / ∆). When ∆ ≤ π i ,the computation time levels off and grows slower than O (1 / ∆). True Stationary Probability ( π ) M u l t i - S t a t e O b s e r v e r E s t i m a t e State aState bState cState d (a) PageRank Markov chain M u l t i − S t a t e O b s e r v e r E s t i m a t e Est: State x = 5Est: State x = 20Est: State x = 30Est: State x = 45True value ( π ) (b) Magnet Markov chain −10 −8 −6 −4 −2 Observer State ID M u l t i − S t a t e O b s e r v e r E s t i m a t e Est: State x = 5Est: State x = 20Est: State x = 30Est: State x = 45True value ( π ) (c) Magnet Markov chain - log scale Figure 7
These figures show the results of the multiple state extension of the algorithm. We chose 4 differentstates in the state space to fix as the anchor state. Then we used the multiple state extension toapproximate the full stationary distribution by keeping track of frequency of visits along the samplepaths. We show the results from both the PageRank and Magnet Markov chains. Due to the exponentialdecay of stationary probabilities in the Magnet Markov chain, we display the results plotted with alinear scale and log scale.
In this section, we show that in simulations the multiple state extension of our algorithm performsquite well on the PageRank Markov chain, regardless of which state is chosen as the anchor state.The algorithm has interesting behavior on the Magnet Markov chain, varying according to theanchor state. We chose 4 different states in the state space with varying stationary probability asthe anchor state. We apply the multiple state extension of the algorithm to estimate stationaryprobability for the observer states, which involves keeping track of the frequency of visits to theobserver states along the return paths to the anchor state.Figure 7(a) shows that for all choices of the anchor state, the estimates were close to the truestationary probability, almost fully coinciding with the diagonal. Note that these estimates are com-puted with the same sample sequences collected from the basic algorithm, and thus the truncationand termination are a function of the anchor state.When we apply the algorithm to the Magnet Markov chain, we observe that the results are highlydependent on the anchor state chosen, due to the poor mixing properties of the Markov chain.Figure 7(b) shows that for all the choices of anchor states, the algorithm over estimated states inthe same half of the state space, and underestimated states in the opposite half of the state space.In Figure 7(c), the estimates are plotted on a log-scale in order to more finely observe the behavior ee, Ozdaglar, and Shah:
Estimating Stationary Probability of Single State in the middle section of the state space which has exponentially small stationary probabilities. Weobserve that the random walks sampled from anchor states 5, 30, and 45 do not cross over to theother half of the Markov chain, and thus completely ignore the other half of the state space. Someof the random walks beginning from anchor state 20 did cross over to the other half of the statespace, however it was still not significant enough to properly adjust the estimates. The significanceof overestimating one side and underestimating the other side will depend on the choice of q and q . This is a graph on which any Monte Carlo algorithm will perform poorly on, since the randomwalks originating in one half will not be aware of the other half of the state space.
9. Discussion: Understanding Properties of PageRank
Equation (16) shows that the PageRank random walk is a convex combination between a directedgraph given by the adjacency matrix A , and a complete graph modeled by the uniform randomjumps between any two states. When β = 0, then the Markov chain is specified by A , which canbe fully general when the edges in the graph are weighted. By tuning the parameter β , we cancontrol whether the PageRank Markov chain is closer to the underlying directed graph or to thecomplete graph. The existing algorithms and analysis for computing PageRank locally only utilizethe property that with probability β , the transition is chosen from the complete graph (Jeh andWidom 2003, Fogaras et al. 2005, Avrachenkov et al. 2007, Bahmani et al. 2010, Borgs et al.2012). When β is large, the Markov chain mixes quickly due to the complete graph, and thesealgorithms also performs efficiently; however, when β is small, regardless of the properties of theunderlying graph, the algorithm scales with 1 /β , even though the Markov chain might still mixquickly depending on the underlying graph. Specifically, we have observed that our algorithmfinds a natural tradeoff to balance precision of estimates with the computation cost, differentiatingbetween high and low stationary probability states. In this section we explore the properties of thePageRank Markov chain when β is a parameter that can be tuned to modify the Markov chainto be closer to the classic PageRank Markov chain with β = 0 .
15, or any general random walkdefined by matrix A . These properties will lead to implications and tradeoffs between using existingmethods as opposed to our algorithm to estimate PageRank as a function of β .In the following simulations, we restrict ourselves to the personalized PageRank setting. For afixed central state x and scalar β , we consider the Markov chain with transition probability matrix˜ P : ˜ P = β · e Tx + (1 − β ) D − A, (17)where A is the adjacency matrix of the underlying graph, generated according to the configurationmodel with a power law degree distribution, and D is a diagonal matrix of the out-degrees. In ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State every step, there is a probability β of returning to the state x , and probability 1 − β of choosinga neighbor at random (according to the configuration model generated graph). We choose severalvalues for β and four different states in the graph to be the central state x . For each combinationof x and β , we sample ten long random walks starting at state x having five million steps each.After every ten thousand steps, we compute the current fraction of visits to state x out of the totalsteps taken thus far. This quantity converges to π x as the length goes to infinity.We can split the random walk sequence into adjacent contiguous subsequences denoted by whenthe random walk jumps back to state x with probability β . These events denote renewal timesof the process, and due to the Markov property these sequences are independent and identicallydistributed, conditioned on their previous and ending state being x . The length of each subsequenceis distributed as a geometric random variable with parameter β . In fact, this insight forms thefoundation for the existing PageRank algorithms by Fogaras et al. (2005) and Avrachenkov et al.(2007), which collects geometric length random walk samples on the underlying graph given bymatrix A , beginning at state x . These independently sampled sequences can be stitched togetherto form a long random walk which has the same distribution as if the Markov chain was simulatedsequentially. Our work can be seen as extending this approach for general Markov Chains by usingthe return visits to state x to denote the renewal time of the process, which includes the β jumpsin the case of Personalized PageRank. While the frequency of β jumps is always distributed as aBernoulli process, the frequency of visits to state x depends on both β and the neighborhood ofstate x . β N u m be r o f s t ep s un t il . e rr o r π = 0.0932 π = 0.0155 π = 0.0031 π = 0.0010 (a) Number of steps until 0.01 error vs. 1 /β β E x pe c t ed r e t u r n t i m e ( E [ T i ] ) π = 0.0932 π = 0.0155 π = 0.0031 π = 0.0010 (b) Expected return time vs. 1 /β Figure 8
These figures show the number of random walk steps needed to accurately estimate the stationaryprobability of a state, as well as the expected return time to a state in a PageRank Markov chain, forfour different states, and as a function of the parameter β . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State In Figure 8(b), we plot the expected length of a sample in Fogaras’ algorithm, which is given by1 /β , as opposed to the expected length of a sample in our algorithm, given by E [ T i ]. When β islarge (1 /β < E [ T i ] ≈ /β uniformly for all states, which is expected since theMarkov chain will be dominated by the β jumps. When β is very small, the expected return timesdifferentiate between states, reflecting the different local connectivity properties around each state.Next we investigate the rates at which the sample frequency of visits to a state in the Markovchain converges to the true stationary probability. This helps us to quantify the amount of necessarycomputation, or simulated random walk steps, until we have sufficient information to obtain a goodestimate for the stationary probability of a state. For each combination of x and β and for eachlong random walk sample, we compute the multiplicative error between the stationary probability,and the fraction of visits to state x along the path. We average the error across the 10 samples,and compute the smallest number of steps such that the average error is less than 0.01. This isshown in Figure 8(a) as a function of β for the four different chosen states x .When β is large, the β jump back to x dominates the random walk, causing the random walk tobehave similarly for different chosen states x . In this setting β can be used to determine the requirednumber of samples to achieve good accuracy. When β is small, the random walk approaches thatof the underlying graph, thus the choice of state x greatly affects the computation. The numberof steps to achieve below 0.01 error does not increase much as β decreases for states x that havelarge π x in the underlying graph (such as state 1). However, for states such that π x is small inthe underlying graph (such as state 4), the required number of steps increases significantly as β decreases. We observe a relationship between E x [ T x ] and the total amount of computation stepsneeded for 0.01 accuracy. This clearly highlights that in the setting when β is very small, it becomescritical to consider information from the underlying graph in determining the number of steps tosample. The existing algorithm only uses the parameter β to govern the number of steps taken forthe algorithm, where the length of a sample random walk scales with 1 /β . However, for small β ,this is unnecessarily expensive, as some states (such as state 1) do not in fact require the numberof sample steps to increase with 1 /β . Our proposed algorithm relies on the expected return timeto state x as the length of each sample path, which adjusts for states that have varying stationaryprobabilities in the underlying graph. Appendix A: Formal Proofs for Theorems presented in Section 3
Proof of Theorem 1.
The error bounds of Theorem 1 directly follow from Theorems 6 and 7 inSection 6. The analysis of computation cost is directly proved by Lemma 13. (cid:3) ee, Ozdaglar, and Shah:
Estimating Stationary Probability of Single State Proof of Theorem 2.
The error bound of Theorem 2 directly follows from Corollary 2, andusing the property that Z max ≤ t mix . The bound on the computation cost in Theorem 2follows directly from Theorem 12, and the property that H i = O ( Z max ( i ) /π i ) = O ( t t mix /π i ). (cid:3) Proof of Theorem 3.
Theorem 3(a) follows from Theorem 6. Theorem 3(b) follows from plug-ging in 2 (cid:15)/ δ in Theorem 2. Theorem 3(c) follows from taking the minimum of bounds givenin Theorems 10 and 12, again substituting 2 (cid:15)/ δ and using the property that H i = O ( t t mix /π i ). (cid:3) Proof of Theorem 4.
By Lemma 5, with probability greater than 1 − α , ˆ T ( k ) i ∈ (1 ± (cid:15) ) E i [ ˆ T ( k ) i ]for all k , and (1 − ˆ p ( k ) ) ∈ (1 ± (cid:15) )(1 − P i ( T i > θ ( k ) )) for all k such that P i ( T i > θ ( k ) ) > /
2. Therefore,with probability greater than 1 − α , for every k such that P i ( T i > θ ( k ) ) > / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ π ( k ) i − π i ˜ π ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ˆ T ( k ) i (1 − ˆ p ( k ) ) E i [ T i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max (cid:32) − (1 − (cid:15) ) E i [ ˆ T ( k ) i ](1 + (cid:15) )(1 − P i ( T i > θ ( k ) )) E i [ T i ] , (1 + (cid:15) ) E i [ ˆ T ( k ) i ](1 − (cid:15) )(1 − P i ( T i > θ ( k ) )) E i [ T i ] − (cid:33) ≤ max (cid:32) − (cid:15) (cid:15) (cid:32) − E i [ ˆ T ( k ) i ](1 − P i ( T i > θ ( k ) )) E i [ T i ] (cid:33) , (cid:15) − (cid:15) (cid:32) E i [ ˆ T ( k ) i ](1 − P i ( T i > θ ( k ) )) E i [ T i ] − (cid:33)(cid:33) + 2 (cid:15) − (cid:15) ≤ (cid:15) − (cid:15) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − E i [ ˆ T ( k ) i ](1 − P i ( T i > θ ( k ) )) E i [ T i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 (cid:15) − (cid:15) . (18)By upper bounding ( Z ii − Z qi ) by 2 Z max ( i ) for all q , it follows that Γ i ( θ ( k ) ) < Z max ( i ). Thus byrearranging (11) from Lemma 11, it follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − E i [ ˆ T ( k ) i ](1 − P i ( T i > θ ( k ) )) E i [ T i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max(2 Z max ( i ) − , P i (cid:16) T ( k ) i > θ ( k ) (cid:17) − P i (cid:16) T ( k ) i > θ ( k ) (cid:17) . (19)Substitute (19) into (18), and apply Lemma 7 to complete the proof. (cid:3) Proof of Theorem 5.
By Lemma 6, with probability greater than 1 − α , ˆ F ( k ) j ∈ E i [ ˆ F ( k ) j ] ± (cid:15) E i [ ˆ T ( k ) i ] and ˆ T ( k ) i ∈ (1 ± (cid:15) ) E i [ ˆ T ( k ) i ] for all k . Therefore with probability greater than 1 − α ,(1 − (cid:15) ) ˆ F ( k ) j ˆ T ( k ) i ≤ E i [ ˆ F ( k ) j ] + (cid:15) E i [ ˆ T ( k ) i ] E i [ ˆ T ( k ) i ] . Therefore, ˆ π ( k ) j ≤ E i [ ˆ F ( k ) j ] E i [ ˆ T ( k ) i ] + (cid:15) + (cid:15) ˆ π ( k ) j . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Similarly, (1 + (cid:15) ) ˆ F ( k ) j ˆ T ( k ) i ≥ E i [ ˆ F ( k ) j ] − (cid:15) E i [ ˆ T ( k ) i ] E i [ ˆ T ( k ) i ]ˆ π ( k ) j ≥ E i [ ˆ F ( k ) j ] E i [ ˆ T ( k ) i ] − (cid:15) − (cid:15) ˆ π ( k ) j . Therefore, with probability greater than 1 − α , (cid:12)(cid:12)(cid:12) ˜ π ( k ) j − π j (cid:12)(cid:12)(cid:12) ≤ max (cid:32) E i [ ˆ F ( k ) j ] E i [ ˆ T ( k ) i ] + (cid:15) (1 + ˆ π ( k ) j ) − π j , π j − E i [ ˆ F ( k ) j ] E i [ ˆ T ( k ) i ] + (cid:15) (1 + ˆ π ( k ) j ) (cid:33) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E i [ ˆ F ( k ) j ] E i [ ˆ T ( k ) i ] − π j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) (1 + ˜ π ( k ) j ) . Use Lemma 12 to show that with probability greater than 1 − α , (cid:12)(cid:12)(cid:12) ˜ π ( k ) j − π j (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P i (cid:0) T i > θ ( k ) (cid:1) E i [ ˆ T ( k ) i ] (cid:88) k ∈ Σ \{ i } P i (cid:0) X θ ( k ) = k | T i > θ ( k ) (cid:1) Z ij − Z kj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:15) (1 + ˜ π ( k ) j ) ≤ P i (cid:0) T i > θ ( k ) (cid:1) E i [ ˆ T ( k ) i ] 2 Z max ( j ) + (cid:15) (1 + ˜ π ( k ) j ) . Since ˆ T ( k ) i ∈ (1 ± (cid:15) ) E i [ ˆ T ( k ) i ], it follows that (cid:12)(cid:12)(cid:12) ˜ π ( k ) j − π j (cid:12)(cid:12)(cid:12) ≤ (1 + (cid:15) ) P i (cid:0) T i > θ ( k ) (cid:1) Z max ( j )ˆ π ( k ) i + (cid:15) (1 + ˜ π ( k ) j ) . Apply Lemma 7 to complete the proof. (cid:3)
Appendix B: Additional Proofs of Lemmas presented in Section 4
The proofs of Lemmas 3 to 6 use the following fact:
Fact 1. If (cid:80) kh =1 x h ≤
1, then (cid:81) kh =1 (1 − x h ) ≥ − (cid:80) kh =1 x h . Proof of Lemma 3.
Let A h denote the event (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111) . Since N ( h ) is a randomvariable due to its dependence on ˆ T ( h − i , the distribution of ˆ T ( h ) i depends on the value of ˆ T ( h − i .Conditioned on the event A h − , N ( h ) = (cid:38) (cid:15) ) θ ( h ) ln(4 θ ( h ) /α )ˆ T ( h − i (cid:15) (cid:39) ≥ (cid:15) ) θ ( h ) ln(4 θ ( h ) /α )(1 + (cid:15) ) E i [ ˆ T ( h − i ] (cid:15) = 3 θ ( h ) ln(4 θ ( h ) /α ) E i [ ˆ T ( h − i ] (cid:15) . (20)Then we apply Chernoff’s bound for independent identically distributed bounded random variables(see Theorem 15 in Appendix), substitute in for N ( h ) , and use the facts that E i [ ˆ T ( h ) i ] ≥ E i [ ˆ T ( h − i ]and θ ( h ) = 2 h for all h , to show that P i ( ¬ A h | A h − ) ≤ (cid:32) − (cid:15) N ( h ) E i [ ˆ T ( h ) i ]3 θ ( h ) (cid:33) ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State ≤ (cid:32) − E i [ ˆ T ( h ) i ] ln(4 θ ( h ) /α ) E i [ ˆ T ( h − i ] (cid:33) ≤ α θ ( h ) ≤ α h +1 . (21)It can be verified that P i ( ¬ A ) is similarly upper bounded by α/ N (1) .Therefore, by Bayes rule and by the fact that ˆ T ( h (cid:48) ) i is independent from ˆ T ( h ) i conditioned on ˆ T ( h − i for all h (cid:48) < h , we show that P i (cid:32) k (cid:92) h =1 A h (cid:33) = P i ( A ) k (cid:89) h =2 P i ( A h | A h − ) ≥ k (cid:89) h =1 (cid:16) − α h +1 (cid:17) . (22)By applying Fact 1, it follows that P i (cid:32) k (cid:92) h =1 A h (cid:33) ≥ − k (cid:88) h =1 α h +1 = 1 − α − − k ) ≥ − α. (23) (cid:3) Proof of Lemma 4.
Let A h denote the event (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111) . Let B h denote the event (cid:8) ˆ p ( h ) ∈ P i ( T i > θ ( h ) ) ± (cid:15) (cid:9) . ˆ T ( h ) i is independent from ˆ p ( h − i conditioned on ˆ T ( h − i . Therefore, it fol-lows from (21) that P i ( ¬ A h | A h − ∩ B h − ) ≤ α h +1 . By applying Hoeffding’s Inequality for Bernoulli random variables to ˆ p ( h ) and substituting (20),we show that P i ( ¬ B h | A h − ∩ B h − ) ≤ (cid:18) − (cid:15) N ( h ) (cid:19) ≤ (cid:32) − θ ( h ) ln(4 θ ( h ) /α )3 E i [ ˆ T ( h − i ] (cid:33) . Since E i [ ˆ T ( h − i ] ≤ θ ( h − = θ ( h ) / P i ( ¬ B h | A h − ∩ B h − ) ≤ (cid:18) −
43 ln(4 θ ( h ) /α ) (cid:19) ≤ α θ ( h ) ≤ α t +1 . By applying the union bound, it follows that P i ( A h ∩ B h | A h − ∩ B h − ) ≥ − α t . (24)We can easily verify using the same techniques that P i ( A ∩ B ) ≥ − α . Therefore, P i (cid:32) k (cid:92) h =1 ( B h ∩ A h ) (cid:33) ≥ k (cid:89) h =1 (cid:16) − α h (cid:17) . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State We use Fact 1 to show that P i (cid:32) k (cid:92) h =1 ( B h ∩ A h ) (cid:33) ≥ − k (cid:88) h =1 α h = 1 − α (1 − − k ) ≥ − α. (cid:3) Proof of Lemma 5.
Recall that k is defined such that P i ( T i > θ ( k ) ) < . Let A h denote theevent (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111) . Let B h denote the event (cid:8) (1 − ˆ p ( h ) ) ∈ (1 ± (cid:15) )(1 − P i ( T i > θ ( h ) )) (cid:9) . P i (cid:32) k (cid:92) h = k B h k (cid:92) h =1 A h (cid:33) = P i (cid:32) k (cid:92) h = k ( B h ∩ A h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − (cid:92) h =1 A h (cid:33) P i (cid:32) k − (cid:92) h =1 A h (cid:33) . (25)ˆ T ( h ) i and ˆ p ( h ) are not independent from ˆ T ( h − i and ˆ p ( h − due to the dependence of N ( h ) as a randomvariable upon ˆ T ( h − i . However, ˆ T ( h ) i is independent from ˆ p ( h − i conditioned on ˆ T ( h − i . Therefore,it follows from (21) that P i ( ¬ A h | A h − ∩ B h − ) ≤ α h +1 . (26)By using Chernoff’s bound for Bernoulli random variables and substituting (20), we show that P i ( ¬ B h | A h − ∩ B h − ) ≤ (cid:18) − (cid:15) N ( h ) (1 − P i ( T i > θ ( h ) ))3 (cid:19) ≤ (cid:32) − θ ( h ) ln(4 θ ( h ) /α )(1 − P i ( T i > θ ( h ) )) E i [ ˆ T ( h − i ] (cid:33) . (27)By definition, E i [ ˆ T ( h − i ] ≤ θ ( h − = θ ( h ) /
2. Since we are given that P i ( T i > θ ( h ) ) ≤ , then θ ( h ) (1 − P i ( T i > θ ( h ) )) E i [ ˆ T ( h − i ] ≥ . (28)By substituting (28) into (27), it follows that P i ( ¬ B h | A h − ∩ B h − ) ≤ α θ ( h ) ≤ α h +1 . (29)We combine (26) and (29), and apply the union bound to show that P i ( A h ∩ B h | A h − ∩ B h − ) ≥ − α t . (30)We can easily verify using the same techniques that P i (cid:32) A k ∩ B k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − (cid:92) h =1 A k (cid:33) ≥ − α k . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State We use Bayes’ rule and substitute (22) and (30) into (25) to show that P i (cid:32) k (cid:92) h = k B h k (cid:92) h =1 A h (cid:33) ≥ k (cid:89) h = k (cid:16) − α h (cid:17) k − (cid:89) h =1 (cid:16) − α h +1 (cid:17) . We apply Fact 1 to show that P i (cid:32) k (cid:92) h = k B h k (cid:92) h =1 A h (cid:33) ≥ − k (cid:88) h =1 α h + k (cid:88) h =1 α h +1 = 1 − α (1 − − k ) + k (cid:88) h =1 α h +1 ≥ − α. (cid:3) Proof of Lemma 6.
Let A h denote the event (cid:110) ˆ T ( h ) i ∈ (1 ± (cid:15) ) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111) . Let B h denote the event (cid:110) ˆ F ( h ) j ∈ E i (cid:104) ˆ F ( h ) j (cid:105) ± (cid:15) E i (cid:104) ˆ T ( h ) i (cid:105)(cid:111) . ˆ T ( h ) i is independent from ˆ F ( h − j conditioned on ˆ T ( h − i . Therefore,it follows from (21) that P i ( ¬ A h | A h − ∩ B h − ) ≤ α h +1 . Recall from the definition of ˆ F ( h ) j that ˆ F ( h ) j ∈ [0 , θ ( h ) ]. Therefore, by applying Chernoff’s boundfor independent identically distributed bounded random variables, and substituting (20), we showthat P i ( ¬ B h | A h − ∩ B h − ) ≤ − (cid:32) (cid:15) E i [ ˆ T ( h ) i ] E i [ ˆ F ( h ) j ] (cid:33) N ( h ) E i [ ˆ F ( h ) j ]3 θ ( h ) ≤ (cid:32) − E i [ ˆ T ( h ) i ] ln(4 θ ( h ) /α ) E i [ ˆ F ( h ) j ] E i [ ˆ T ( h − i ] (cid:33) . (31)Since E i [ ˆ T ( h ) i ] ≥ E i [ ˆ T ( h − i ] and E i [ ˆ T ( h ) i ] ≥ E i [ ˆ F ( h ) j ], then E i [ ˆ T ( h ) i ] E i [ ˆ F ( h ) j ] E i [ ˆ T ( h − i ] ≥ . By substituting into (31), it follows that P i ( ¬ B h | A h − ∩ B h − ) ≤ α θ ( h ) ≤ α h +1 . By applying the union bound, we show that P i ( A h ∩ B h | A h − ∩ B h − ) ≥ − α h . (32)We can easily verify using the same techniques that P i ( A ∩ B ) ≥ − α . Therefore, P i (cid:32) k (cid:92) h =1 ( B h ∩ A h ) (cid:33) ≥ k (cid:89) h =1 (cid:16) − α h (cid:17) . We use the Fact 1 to show that P i (cid:32) k (cid:92) h =1 ( B h ∩ A h ) (cid:33) ≥ − k (cid:88) h =1 α h = 1 − α (1 − − k ) ≥ − α. (cid:3) ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Appendix C: Additional Proofs for Lemmas presented in Section 5
Proof of Lemma 7.
This proof is adapted from a proof found in Chapter 2 Section 4.3 of Aldousand Fill (1999). It is included here for completeness. P i ( T i > mα ) = m − (cid:89) s =0 P i ( T i > ( s + 1) α | T i > sα )= m − (cid:89) s =0 (cid:32)(cid:88) j ∈ Σ P i ( T i > ( s + 1) α | X sα = j, T i > sα ) P i ( X sα = j | T i > sα ) (cid:33) . By the Markov property, P i ( T i > mα ) = m − (cid:89) s =0 (cid:32)(cid:88) j ∈ Σ P j ( T i > α ) P i ( X sα = j | T i > sα ) (cid:33) . By Markov’s inequality, P i ( T i > mα ) ≤ m − (cid:89) s =0 (cid:32)(cid:88) j ∈ Σ (cid:18) E j [ T i ] α (cid:19) P i ( X sα = j | T i > sα ) (cid:33) ≤ m − (cid:89) s =0 (cid:18) max j ∈ Σ E j [ T i ] α (cid:19) = (cid:18) H i α (cid:19) m . By choosing α = 2 H i and m = (cid:98) k/ H i (cid:99) , it follows that P i ( T i > k ) ≤ P i (cid:18) T i > (cid:22) k H i (cid:23) · H i (cid:19) ≤ −(cid:98) k/ H i (cid:99) ≤ · − k/ H i . (cid:3) To prove that the return times concentrate for countably infinite state space Markov chains,we use Lyapunov function analysis introduced by Foster (1953). We establish that indeed returntimes have exponentially decaying tail even for countable-state space Markov chain as long as theysatisfy Assumption 1.
Useful notation.
We introduce formal notation for observing { X t } t ≥ over the subset B . Let { Y t } t ≥ be a Markov chain with state space B . { Y t } is a subsequence of { X t } constructed in thefollowing way. Define the subsequence { S k } of Z + as: S (cid:44) , S k (cid:44) min { t > S k − : X t ∈ B } , and define Y t (cid:44) X S t such that P Y ( x, y ) = P ( X S = y | X = x ) for any x, y ∈ B. Let Q t (cid:44) S t +1 − S t , the length of the path between X S t and X S t +1 .Let T Bi (cid:44) inf { t ≥ | Y t = i } , the return time to i for the chain { Y t } . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Let H Bi (cid:44) max j ∈ B E j [ T Bi ], the maximal expected hitting time to state i for the chain { Y t } . We canuse these variables to express the return time to state i by T i = S T Bi = T Bi − (cid:88) k =0 Q k . (33) Lemma 14.
Let { X t } be a countable state space Markov chain satisfying Assumption 1. Let { Y t } be defined above as the Markov chain restricted to B , and let Q k be the length between visits to B .For any W ∈ Z + , Z ≥ W , and i ∈ B , P i (cid:32) W − (cid:88) k =0 Q k > Z (cid:33) ≤ exp (cid:18) . − ρ ) e ην max (cid:18) . W e ην max (1 − ρ )( e ην max − ρ ) + W − Z (cid:19)(cid:19) . The constants γ and ν max are given by the Assumption 1, and the scalars η and ρ are functions of γ and ν max , as defined in (38) in Appendix F.Proof of Lemma 14. By the law of iterated expectation, E i (cid:34) W − (cid:89) k =0 exp ( λQ k ) (cid:35) = E i (cid:34) exp ( λQ ) E i (cid:34) W − (cid:89) k =1 exp ( λQ k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q (cid:35)(cid:35) Conditioned on Y = j , Y and Q are independent of { Q k } k> , because Y = X Q . Thus by thestrong Markov property, E i (cid:34) W − (cid:89) k =1 exp ( λQ k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q (cid:35) ≤ max j ∈ B E (cid:34) W − (cid:89) k =1 exp ( λQ k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y = j (cid:35) , so that E i (cid:34) W − (cid:89) k =0 exp ( λQ k ) (cid:35) ≤ E i [exp( λQ )] max j ∈ B E (cid:34) W − (cid:89) k =1 exp ( λQ k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y = j (cid:35) . We iteratively apply conditioning to show that E i (cid:34) W − (cid:89) k =0 exp ( λQ k ) (cid:35) ≤ E i [exp( λQ )] W − (cid:89) k =1 (cid:18) max j ∈ B E [ exp ( λQ k ) | Y k = j ] (cid:19) . We can upper bound Q k by assuming that it always goes on an excursion from B , such that Q k ≤ B c ) . We invoke Hajek’s result of Theorem 17, with V ( x ) < b + ν max to bound the exponential momentsof the excursion. For any i ∈ B , E i (cid:34) W − (cid:89) k =0 exp ( λQ k ) (cid:35) ≤ (cid:18) e λ (cid:18) e ην max (cid:18) e λ − − ρe λ (cid:19) + 1 (cid:19)(cid:19) W , ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State where η and ρ are functions of γ and ν max , as defined in (38) in Appendix F. They satisfy theconditions given in Hajek (1982). For λ < min (cid:16) . , . − ρ ) ρ (cid:17) ,e λ < . λ and 1 − ρe λ > − ρ − . ρλ. By substituting in these approximations and using 1 + x < e x , we obtain (cid:18) e λ (cid:18) e ην max (cid:18) e λ − − ρe λ (cid:19) + 1 (cid:19)(cid:19) W < (cid:18) e λ (cid:18) e ην max (cid:18) . λ − ρ − . ρλ (cid:19) + 1 (cid:19)(cid:19) W < exp ( λW ) exp (cid:18) . λW e ην max − ρ − . ρλ (cid:19) < exp (cid:18) λW (cid:18) . e ην max − ρ − . ρλ + 1 (cid:19)(cid:19) . By Markov’s inequality, P i (cid:32) W − (cid:88) k =0 Q k > Z (cid:33) ≤ E i (cid:104) exp (cid:16) λ (cid:80) W − k =0 Q k (cid:17)(cid:105) exp( λZ ) ≤ exp (cid:18) λW (cid:18) . e ην max − ρ − . ρλ + 1 (cid:19) − λZ (cid:19) . Choose λ = . − ρ ) e ην max . We can verify that for our choice of η and ρ according to (38), λ < max (cid:16) . , . − ρ ) ρ (cid:17) always holds. Therefore, we complete the proof by substituting in for λ , P i (cid:32) W − (cid:88) k =0 Q k > Z (cid:33) ≤ exp (cid:18) . − ρ ) e ην max (cid:18) . W e ην max (1 − ρ )( e ην max − ρ ) + W − Z (cid:19)(cid:19) . (cid:3) Proof of Lemma 9.
By (33), for any constants
W, Z ∈ Z + , { T Bi ≤ W } (cid:92) (cid:40) W − (cid:88) k =0 Q k ≤ Z (cid:41) = ⇒ { T i ≤ Z } . We use the union bound on the contrapositive statement to obtain the inequalities P i ( T i > Z ) ≤ P i (cid:32) { T Bi > W } ∪ (cid:40) W − (cid:88) k =0 Q k > Z (cid:41)(cid:33) ≤ P i ( T Bi > W ) + P i (cid:32) W − (cid:88) k =0 Q k > Z (cid:33) . (34)Choose W = 2 H Bi (cid:16) kR i (cid:17) and Z = 4 H Bi (cid:18) . e ην max (1 − ρ )( e ην max − ρ ) + 1 (cid:19) + ln(2) e ην max . − ρ ) + k = 2 R i − ln(2) e ην max . − ρ ) + k. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State The next inequality follows from substituting these expressions for W and Z into (34), and applyingLemmas 7 and 14: P i (cid:18) T i > R i − ln(2) e ην max . − ρ ) + k (cid:19) ≤ − kRi , P i ( T i > R i + k ) ≤ P i (cid:18) T i > R i − ln(2) e ην max . − ρ ) + k (cid:19) ≤ − kRi , P i ( T i > k ) ≤ − k − RiRi ≤ · − kRi . (cid:3) Appendix D: Additional Proofs for Results in Section 6
Proof of Theorem 9.
This proof follows a similar proof of Theorem 4. By dividing (7) by (1 − P i ( T i > θ ( k ) )), it follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − E i [ ˆ T ( k ) i ](1 − P i ( T i > θ ( k ) )) E i [ T i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − P i ( T i > θ ( k ) )) π i ∞ (cid:88) k = θ ( k ) P i ( T i > k ) − P i ( T i > θ ( k ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − P i ( T i > θ ( k ) )) max π i ∞ (cid:88) k = θ ( k ) P i ( T i > k ) , P i ( T i > θ ( k ) ) . Then we apply Lemma 9 and use the fact that P i ( T i > θ ( k ) ) < to show that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − E i [ ˆ T ( k ) i ](1 − P i ( T i > θ ( k ) )) E i [ T i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:32) π i (cid:32) · − θ ( k ) /R i − − /R i (cid:33) , · − θ ( k ) /R i (cid:33) = 8 · − θ ( k ) /R i max (cid:18) π i − − /R i , (cid:19) . (35)Substitute (35) into (18) to complete the proof. (cid:3) In order to analyze the distribution over the number of visits to state j on a return path to state i , we will use the following Lemma as stated by Aldous and Fill (1999) in Chapter 2 Section 2.2Lemma 9. Lemma 15. (Aldous and Fill 1999) For distinct i, k ∈ Σ , the expected number of visits to a state j beginning from state k before visiting i is equal to E k (cid:34) ∞ (cid:88) t =1 { X t = j } { t ≤ T i } (cid:35) = π j ( E k [ T i ] + E i [ T j ] − E k [ T j ]) . ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Proof of Lemma 12.
By definition, E i [ F j ] − E i [ ˆ F ( k ) j ]= E i (cid:34) ∞ (cid:88) t =1 { X t = j } { t ≤ T i } (cid:35) − E i θ ( k ) (cid:88) t =1 { X t = j } { t ≤ T i } = P i (cid:0) T i > θ ( k ) (cid:1) E i ∞ (cid:88) t = θ ( k ) { X t = j } { t ≤ T i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T i > θ ( k ) = P i (cid:0) T i > θ ( k ) (cid:1) (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) E i ∞ (cid:88) t = θ ( k ) +1 { X t = j } { t ≤ T i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X θ ( k ) = q, T i > θ ( k ) = P i (cid:0) T i > θ ( k ) (cid:1) (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) E q (cid:34) ∞ (cid:88) t =1 { X t = j } { t ≤ T i } (cid:35) . We divide by E i [ F j ] and use Lemma 15 and Lemma 1(a) to show that1 − E i [ ˆ F ( k ) j ] E i [ F j ]= P i (cid:0) T i > θ ( k ) (cid:1) E i [ F j ] (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) π j ( E q [ T i ] + E i [ T j ] − E q [ T j ])= P i (cid:0) T i > θ ( k ) (cid:1) (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) ( E q [ T i ] + E i [ T j ] − E q [ T j ]) E i [ T i ] . (36)By multiplying (8) by E i [ ˆ T ( k ) i ], it follows that1 − E i [ ˆ T ( k ) i ] E i [ T i ] = P i (cid:0) T i > θ ( k ) (cid:1) (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) E q [ T i ] E i [ T i ] . (37)We use (36) and (37))= to show that E i [ ˆ F ( k ) j ] E i [ ˆ T ( k ) i ] − π j = E i [ F j ] E i [ ˆ T ( k ) i ] (cid:32) E i [ ˆ F ( k ) j ] E i [ F j ] − E i [ ˆ T ( k ) i ] E i [ T i ] (cid:33) = E i [ F j ] E i [ ˆ T ( k ) i ] P i (cid:0) T i > θ ( k ) (cid:1) (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) ( E q [ T j ] − E i [ T j ]) E i [ T i ]= P i (cid:0) T i > θ ( k ) (cid:1) E i [ ˆ T ( k ) i ] (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) π j ( E q [ T j ] − E i [ T j ]) . By Lemma 2, E i [ ˆ F ( k ) j ] E i [ ˆ T i ] − π j = P i (cid:0) T i > θ ( k ) (cid:1) E i [ ˆ T ( k ) i ] (cid:88) q ∈ Σ \{ i } P i (cid:0) X θ ( k ) = q | T i > θ ( k ) (cid:1) ( Z ij − Z qj ) . (cid:3) ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Appendix E: Chernoff Bounds
Theorem 14 ( Chernoff ’s Multiplicative Bound for Binomials ) . Let { X , X , X , . . . X N } be a sequence of independent identically distributed Bernoulli random vari-ables, such that for all i , X i = 1 with probability p and X i = 0 otherwise. Then for any (cid:15) > , P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 X i − E [ X ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) E [ X ] (cid:33) ≤ e − (cid:15) Np . Theorem 15 ( Chernoff ’s Multiplicative Bound for Bounded Variables ) . Let { X , X , X , . . . X N } be a sequence of independent identically distributed strictly bounded non-negative random variables, such that X i ∼ X for all i , and X ∈ [0 , θ ] . Then for any (cid:15) > , P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 X i − E [ X ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) E [ X ] (cid:33) ≤ e − (cid:15) N E [ X ]3 θ . Appendix F: Lyapunov Function Analysis
Theorem 16. (Foster 1953) Let { X t } be a discrete time, irreducible Markov chain on countablestate space Σ with transition probability matrix P . { X t } is positive recurrent if and only if thereexists a Lyapunov function V : Σ → R + , γ > and b ≥ , such that For all x ∈ Σ , E [ V ( X t +1 ) | X t = x ] ≤ ∞ , For all x ∈ Σ such that V ( x ) > b , E [ V ( X t +1 ) − V ( X t ) | X t = x ] ≤ − γ. In words, given a positive recurrent Markov chain, there exists a Lyapunov function V : Σ → R + and a decomposition of the state space into B = { x ∈ Σ : V ( x ) ≤ b } and B c = { x ∈ Σ : V ( x ) > b } such that there is a uniform negative drift in B c towards B and | B | is finite.For any irreducible, Markov chain, the following function is a valid Lyapunov function for γ = 1and b = 0 .
5: Choose any state i ∈ Σ, and fix this as the “central state”. Define the function V : Σ → R + such that V ( i ) = 0 and for all j ∈ Σ \ { i } , V ( j ) = E j [ T i ]. By definition, B = { i } . For all x ∈ Σ,by positive recurrence, E [ V ( X t +1 ) | X t = x ] ≤ ∞ . Similarly, for all x such that V ( x ) > b , V ( x ) = E x [ T i ] = 1 + (cid:88) y ∈ Σ P xy E y [ T i ] = 1 + E [ V ( X t +1 ) | X t = x ] . Therefore, for all x ∈ B c , E [ V ( X t +1 ) − V ( X t ) | X t = x ] = − ≤ − γ. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State Theorem 17. (Hajek 1982) Let { X t } be an irreducible, positive recurrent Markov chain on acountable state space Σ with transition probability matrix P . Assume that there exists a Lyapunovfunction V : Σ → R + and values ν max , γ > , and b ≥ satisfying Assumption 1. Let the randomvariable τ B = inf { t : X t ∈ B } . Then for any x such that V ( x ) > b , and for any choice of constants ω > , η , ρ , and λ satisfying < η ≤ min (cid:18) ω, γω e ων max − (1 + ων max ) (cid:19) ,ρ = 1 − γη + ( e ων max − (1 + ων max )) η ω , and < λ < ln( 1 ρ ) , the following two inequalities hold: P [ τ B > k | X = x ] ≤ e η ( V ( x ) − b ) ρ k , and E [ e λτ B | X = x ] ≤ e η ( V ( x ) − b ) (cid:18) e λ − − ρe λ (cid:19) + 1 . A concrete set of constants that satisfy the conditions above are ω = 1 ν max , η = γ e − ν , and ρ = 1 − γ e − ν . (38) Acknowledgments
This work is supported in parts by ARO under MURI awards 58153-MA-MUR and W911NF-11-1-0036, andgrant 56549-NS, and by NSF under grant CIF 1217043 and a Graduate Fellowship.
References
Aldous, D., J. Fill. 1999. Reversible Markov chains and random walks on graphs: Chapter 2 (General Markovchains). book in preparation. ∼ aldous/ RWG/ Chap2.pdf
7, 19–20.Andersen, R., C. Borgs, J. Chayes, J. Hopcraft, V.S. Mirrokni, S.H. Teng. 2007. Local computation ofPageRank contributions.
Proceedings of the 5th international conference on Algorithms and models forthe web-graph . WAW’07, Springer-Verlag, Berlin, Heidelberg, 150–165.Avrachenkov, K., N. Litvak, D. Nemirovsky, N. Osipova. 2007. Monte Carlo methods in PageRank compu-tation: When one iteration is sufficient.
SIAM Journal on Numerical Analysis (2) 890–904.Bahmani, B., A. Chowdhury, A. Goel. 2010. Fast incremental and personalized PageRank. Proc. VLDBEndow. (3) 173–184.Bertsimas, D., D. Gamarnik, J.N. Tsitsiklis. 1998. Geometric bounds for stationary distributions of infiniteMarkov chains via Lyapunov functions. Tech. rep., MIT Sloan School of Management. ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State
CoRR abs/1202.2771 .Candogan, O., K. Bimpikis, A. Ozdaglar. 2012. Optimal pricing in networks with externalities.
OperationsResearch (4) 883–905.Chasparis, G.C., J.S. Shamma. 2010. Control of preferences in social networks. CDC . 6651–6656.Chen, Y.Y., Q. Gan, T. Suel. 2004. Local methods for estimating PageRank values.
Proceedings of thethirteenth ACM international conference on Information and knowledge management . ACM, 381–389.Diaconis, P. 2009. The Markov chain Monte Carlo revolution.
Bulletin of the American Mathematical Society (2) 179–205.Diaconis, P., L. Saloff-Coste. 1998. What do we know about the Metropolis algorithm? Journal of Computerand System Sciences (1) 20–36.Fogaras, D., B. Racz, K. Csalogany, T. Sarlos. 2005. Towards scaling fully personalized PageRank: Algo-rithms, lower bounds, and experiments. Internet Mathematics (3) 333–358.Foster, F.G. 1953. On the stochastic matrices associated with certain queuing processes. The Annals ofMathematical Statistics (3) 355–360.Golub, G.H., C.F. Van Loan. 1996. Matrix Computations . Johns Hopkins Studies in the MathematicalSciences, Johns Hopkins University Press.Hajek, B. 1982. Hitting-time and occupation-time bounds implied by drift analysis with applications.
Advances in Applied probability
Biometrika (1) 97–109.Haveliwala, T.H. 2003. Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search.Technical Report 2003-29, Stanford InfoLab. Extended version of the WWW 2002 paper on Topic-Sensitive PageRank.Jeh, G., J. Widom. 2003. Scaling personalized web search. Proceedings of the 12th international conferenceon World Wide Web . New York, NY, USA, 271–279.Kamvar, S., T. Haveliwala, C. Manning, G. Golub. 2003. Exploiting the block structure of the web forcomputing PageRank.
Stanford University Technical Report .Koury, J.R., D.F. McAllister, W.J. Stewart. 1984. Iterative methods for computing stationary distributionsof nearly completely decomposable Markov chains.
SIAM Journal on Algebraic Discrete Methods (2)164–186.Levin, D.A., Y. Peres, E.L. Wilmer. 2009. Markov chains and mixing times . Amer Mathematical Society.Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. 1953. Equation of state calcula-tions by fast computing machines.
The Journal of Chemical Physics ee, Ozdaglar, and Shah: Estimating Stationary Probability of Single State
CoRR abs/1209.1688 .Page, L., S. Brin, R. Motwani, T. Winograd. 1999. The PageRank citation ranking: Bringing order to theweb. Technical Report 1999-66, Stanford.Sarma, A. Das, S. Gollapudi, R. Panigrahy. 2011. Estimating PageRank on graph streams.
Journal of ACM (3) 13.Sarma, A. Das, A.R. Molla, G. Pandurangan, E. Upfal. 2012. Fast distributed PageRank computation. CoRR abs/1208.3071 .Shah, D., T. Zaman. 2011. Rumors in a network: Who’s the culprit?