Frustrated Random Walks: A Faster Algorithm to Evaluate Node Distances on Connected and Undirected Graphs
AA Fast Method to Calculate Hitting Time Distribution for a Random Walk on aConnected and Undirected Graph
Enzhi Li, Zhengyi Le
Suning R&D Center, Palo Alto, USA (Dated: October 31, 2019)With the advent of increasingly large graphs, we need to find a quick and reliable method tomeasure the distance and similarity between any pair of nodes on a large graph. One way tomeasure the distance is by performing random walks on graph, and people already designed plentyof algorithms to accomplish this goal. However, most of the implementations of random walkalgorithms are computationally expensive due to the use of Monte Carlo simulations which could bepretty time-consuming. Here, we propose an alternative measure of the distance between any pairof nodes on a connected and undirected graph using the notion of hitting time for a random walk.We also give an analytical solution to the hitting time distribution of a random walk on graph. Thisanalytical method, which can be conveniently implemented using SciPy linear algebra packages,is more time-saving to run and yields more accurate results than that obtained via Monte Carlosimulations. It is further noted that the hitting times also provide a glimpse of the communitystructure of a graph. This algorithm for measuring the distance between any pair of nodes isspecifically devised to measure the influence of a fraudulent user upon all the other users that co-occur in the same social network. We employ our algorithm to weed out potential fraudsters fromamong tens of thousands of online retail users.
I. INTRODUCTION
With the advent of the era of social media, graph the-ory has become an increasingly important tool for study-ing user behavior in such fields as social networks, aca-demic citation, online retails, web analysis, etc. Vari-ous graph algorithms have been proposed to attack theproblems people encounter during the study of graphs.For example, in order to find the web pages that areof uttermost importance and relevance to a set of key-words, people designed the PageRank algorithm to rankthe tens of millions of web pages that are available online[1]. Nodes on a connected graph may form tightlybound communities, which can be detected using an al-gorithm that aims to maximize the modularity of eachpotential cluster[2–4]. We can equally well study the di-vision of a connected graph into weakly linked commu-nities via the Laplacian matrix, which is derived froma graph’s adjacency matrix[5]. It is also of significantinterest to find a metric that can measure the distancebetween any two nodes on a connected graph, and thenode2vec method is there to our resort[6]. All of thesealgorithms depend on a direct or indirect invocation ofa graph’s adjacency matrix, an invocation that is under-standable due to the convenience of the adjacency matrixin uniquely identifying a graph, whether be it directed orundirected.The detection of potential fraudsters in a social net-work is an imperative task for online retailers and re-quires the development of an efficient algorithm for per-forming label propagation in a connected graph. Usersof online retailers may have connections with each other,and we can capture these connections by analyzing users’behavior. For example, different users may share thesame shipping address, or they may log into their ac-counts via the same device. By analyzing data like this, we can create a social network of the users in which eachuser is represented by a node in a graph. If two usersshare the same login device, we can link an edge betweenthese two users. By this way, we can create a graphthat represents the social links between users, who maybe normal or fraudulent. In order to protect the nor-mal users from being exploited by fraudsters, we needto find a method to identify these fraudsters and buildan aegis for the normal users against them. One way toidentify these fraudsters is to employ a set of rules andsee who has violated them. However, due to the possi-bly large number of fraudsters in a social network, it isimpractical to try to weed out all fraudsters purely byhand. Here, we will perform a label propagation in a so-cial network so that once we have found out a handful offraudsters via rules, we can continue to find more poten-tial fraudsters automatically. The propagation of labelsin a social network requires a precise definition of thesimilarity between users. In this paper, we will explorean algorithm that could measure the distance (and thussimilarity) between any pair of nodes in a graph. By cal-culating the distances between nodes in a graph, we cangain a deeper and more intuitive understanding aboutthe similarities between apparently disconnected socialnetwork users since the similarity between a pair of usersshould be inversely proportional to their distance.There are already well developed algorithms for mea-suring the distance between nodes in a graph, such asthe geodesic path distance, or the node2vec algorithms.However, these algorithms either utterly disregards thegraph structure or is too time-consuming to run. Here,we propose an algorithm for measuring the distance be-tween a pair of nodes by considering the concept of hit-ting time for a random walk on a graph. Hitting timeof a random walk is the number of steps traversed by arandom walker before it hits a pre-specified target node a r X i v : . [ c s . D M ] O c t for the first time. This hitting time can be either ob-tained by Monte Carlo simulation which is again muchtoo time-consuming to be practical for a large graph orcan be exactly calculated by an analytical formula whichwe will derive in this paper. We will validate our algo-rithm by applying it to some real world problems thatwe encountered in our daily work. All of these test caseswill be presented in detail in the main text of this paper.The organization of the paper is as follows. In sec-tion II, we will give a brief review of the previous meth-ods for calculating the distance between a pair of nodes.We will summarize the advantages and disadvantages ofthese methods, and explain why we want to propose analternative method to calculate the distance using thenotion of hitting times in random walks[7]. After outlin-ing our algorithm in section III and detailing the algo-rithm in section IV, we continue to apply our analyticaland numerical methods to small and huge graphs respec-tively, in section V. We also highlight the asymmetricityof our distance function under the exchange of its twoarguments in the same section. In section VI, we make acomparison of our method with other existing methods.Finally, we make a conclusion in section VII. II. BACKGROUND
This paper explores the influence of a node on anothernode, or on a specific set of nodes in a graph. PageRankalgorithm is a convenient method that can measure howinfluential and important a single node is to the graph asa whole, whereas sometimes we also need to know howinfluential a node is to another specific node, a situationthat could arise when we want to know how suscepti-ble a community of nodes is to the presence of a labeleduser in a social network. For example, we can create aconnected graph whose nodes represent the users of anonline retailer. If we already added a label to a user inthe social network, then we want to know how that labelwill propagate among the other users that cooccur withthe labeled user. Intuitively, this process of label propa-gation depends on the distance between each pair of usersin the social network. The shorter the distance betweentwo nodes, the more similar these two nodes are to eachother, and thus the easier it is to propagate a label fromone user to another. Nowadays, there are already plentyof algorithms that can measure the distance or similaritybetween each pair of nodes in a graph. However, thereis no universal definition of this distance function, andfor each specific case, people can devise their own ver-sion of distance function. Two distance functions thathave gained much popularity are the geodesic distance[8]and the cosine distance which is a byproduct of node2vecalgorithm[6]. The geodesic distance between two nodesin an undirected graph is defined as the length of theshortest path connecting these two nodes. The geodesicdistance, which is calculated by finding the shortest dis-tance from one node say A to anther node say B us- ing Dijkstra’s algorithm for sparse graph or Floyd’s al-gorithm for dense graph, is a deterministic algorithm. Inthis algorithm, we are considering a deterministic walkon the graph. Due to the non-randomness of this algo-rithm, when we employ it to find the distance betweentwo nodes in a graph, we have failed to capture a signifi-cant part of the known information about the graph. Thedisregarding of the rich structures of a graph from whichwe could have extracted a huge amount of precious in-formation about the relationship between a pair of nodesconstitutes one major disadvantage of this algorithm. Innode2vec algorithm, we calculate the distance betweentwo nodes by first mapping each node in the graph intoa dense vector using the word2vec method[9], and thenusing the cosine distance between two mapped dense vec-tors as the distance between a pair of nodes. This algo-rithm, which can be considered as an extension of theword2vec algorithm to graphs, requires the pre-existenceof a node corpus that can only be generated by perform-ing tens of thousands of random walks on a graph. Thegeneration of this corpus is pretty time-consuming andmemory-intensive, thus precluding its application to ex-tremely large graphs.Another thing that is noteworthy is that both of thesetwo algorithms yield symmetric distance functions forany pair of nodes in an undirected graph. The distancefunction is symmetric in the sense that the distance fromnode A to node B is guaranteed to be identical to thedistance from node B to node A . However, even for anundirected graph such as the friendship social network ofFacebook, it is unreasonable to believe the distance froman influential user to an obscure user should be the sameas the distance from an obscure user to an influentialuser. Since not all users of a social network share identi-cal reputation and influence, we claim that the relation-ships between social network users are non-equivalent,non-reflective and asymmetric. Thus, a good definitionof distance function between two nodes of a graph shouldtake account of this non-equivalence, non-reflectivity andasymmetricity of relationships even for undirected graphs .As we have noted above, a deterministic walk on agraph tends to be blind to the rich structure of a graph.Therefore, here in this paper, we will focus our attentionon random walks on graphs. There are many scenariosfor performing random walks on graphs. One such sce-nario starts from a node, say A , and selects a node, say B , as its target, and performs a multitude of randomwalks starting from A and counts how many times thisrandom walker encounters node B within a pre-specifiednumber of steps. This encountering frequency for therandom walk provides a measure of the distance betweennodes A and B . The larger the frequency, the shorterthe distance between A and B . This method of of mea-suring the distance between two nodes, although valid insome sense, has several drawbacks, the most prominent ofwhich is its strong dependence upon such capricious pa-rameters as the maximum number of nodes each randomwalk is allowed to traverse, and the number of randomwalks to be performed for the encountering frequencyto be statistically stable and meaningful. Another weakpoint of this method is that it is again much too time-consuming to perform enough number of random walksto gain a statistically significant result for two nodes thatare located afar in a huge graph. The application of thisrandom walk scenario to a small sized graph is no lesstroublesome due to the fact that a random walker start-ing from one node in a connected graph is guaranteed toreach any other node in the same graph as long as therandom walk lasts long enough, thus rendering all thedistances between any pair of nodes almost the same.Considering the time-expensiveness of performing suf-ficiently large number of random walks on a large graphand the strong dependence of the final results on thehard-to-select hyper-parameters, we prefer to find an al-ternative method that can deliver an exact solution tothe random walk problem, thus avoiding this lengthy andtedious process of Monte Carlo simulations from the be-ginning. For sake of concreteness, consider a graph inwhich we have labeled some nodes as “black”, some as“white”, and some as “unknown”, as detailed in Ref. [10].We can estimate the color of the unknown nodes eitherby performing a Monte Carlo simulation or by solving adiscrete Laplacian equation. The inference of the colorsof these unknown nodes is equivalent to performing labelpropagation in a graph. It is shown in Ref. [10] thatsolution of discrete Laplacian equation gives us more ac-curate results using far less time. Unfortunately, solu-tion of Laplacian equations requires the pre-existence ofboundary conditions, which are not always available[11].The black and white labels in Ref. [10] are the boundaryconditions for a direct solution of Laplacian equation tobe feasible. However, if all the known labels are markedblack, then the only thing that a solution of Laplacianequation can tell us is that all the colors of the unknownnodes should be black, which is practically useless to us.For example, in order to quantify the influence of a blacknode on the other nodes that cooccur in a social net-work, we also need the existence of at least one nodethat is explicitly labeled as “white”, a label that is notalways available.In order to avoid these conundrums, here we proposea new algorithm that can measure the distance betweenany two nodes in a graph by giving an exact solutionto a random walk problem on undirected graphs, justlike the analytical solution of discrete Laplacian equa-tion for color inference as described above. The ad-vantage of this algorithm is that the final result isuniquely obtained by solving a sparse linear system, thusreleasing us of the unnecessarily thorny duty of selectinga set of appropriate parameters, and saving us tens ofthousands of CPU hours from performing Monte Carlosimulations thanks to the highly efficient numerical linearalgebra packages that are readily available for perform-ing sparse matrix multiplications.
Another advantageof this algorithm is that it gives us a distance functionthat is asymmetric between a pair of nodes, reflecting the reality that users in a social network generally havenon-equivalent and non-reflective relationships with eachother.
III. PROPOSED METHOD
In this section, we will propose an analytical methodfor finding expected hitting times of a random walk on anundirected and connected graph G . The connectedness ofthe graph does not constitute a major restriction to ourmethod due to the availability of efficient algorithms forfinding connected components of an undirected graph.The adjacency matrix of this graph is A , which is a | V | × | V | matrix ( V is the set of vertices in the graph,and | V | is the cardinality of the set), with matrix elements A ij = 1 if there is an edge between node i and node j ,otherwise A ij = 0. Because we are considering a socialnetwork of users who would have relationships only withothers, we demand that the graph in this paper should besimple, meaning that none of the nodes are self-looped.The matrix dimension | V | is the number of nodes in thegraph, and the number of non-zero matrix elements of A gives us the edge number. A is symmetric if graph G isundirected, or else it is generally non-symmetric.In this paper, we are trying to calculate the probabilityof reaching a target node from any other node in thegraph, whereas in a directed graph, a node may not bereachable from another node, thus here we only considerundirected and connected graphs for which the adjacencymatrix is always symmetric. Furthermore, our methodalso applies to weighted graphs, for which A ij = w > i to j has weight w , and A ij = 0 if there is no edge between nodes i and j .A random walk from a start node to a target node ona graph is defined as follows: Algorithm 1
Random walk on a graph
Require:
An undirected and connected Graph G procedure RandomWalk ( s, t ) (cid:46) s is the start node,and t is the target node. c ← s repeat r ← a random neighbor of c c ← r until c = t end procedure For a random walk that starts from node s , the hit-ting time is defined as the number of steps neededfor the random walker to reach a target node t for the first time . According to this definition, thehitting time is a random variable that depends on thegraph structure, the starting node s , and the target node t . Therefore, we can denote the hitting time as N ( s ) t .Denote the probability of hitting target t after exactly n steps starting from node s as x ( s ) n = P ( N ( s ) t = n ) (1)Assume that node s has m neighboring nodes, of whichat most one is the target node t . We enumerate these m nodes using indices i s = 1 , , ..., m . Then the probability P ( N ( s ) t = n ) can be recursively represented as P ( N ( s ) t = n ) = m (cid:88) i s =1 i s (cid:54) = t w s,i s W s P ( N ( i s ) t = n −
1) (2)In the above equation, W s = (cid:80) mi s =1 w s,i s is the totalweight associated with node s , w s,is W s represents the prob-ability for the random walker to make a transition fromnode s to one of its neighbors i s , and P ( N ( i s ) t = n − t from node i s after exactly n − t for the first time fromnode s after exactly n steps, the probability of reach-ing target starting from the target itself is zero for anynon-zero number of steps, i.e., P ( N ( t ) t = n ) = 0 , ∀ n > t even if t is one of the neighbors of node s , which justifies our notation i s (cid:54) = t in the summationsubscript. If we scan all possible starting vertices s , wecan obtain a simultaneous system of difference equationsfor the hitting probabilities P ( N ( i ) t = n ) , ∀ i ∈ V , i (cid:54) = t ,where V is the set of all vertices in graph G . Specifi-cally, if a vertex j has only one single neighbor, and thisvery neighbor is just our target t , then we can directlywrite out the probability of reaching target t after exactly n > j as P ( N ( j ) t = n ) = δ n, ,with δ n, being the Kronecker δ function. Since we al-ready know the probability distribution of hitting timesfor such queer nodes which we call adherents to target t ,we can ignore those nodes when establishing the simul-taneous system of difference equations for P ( N ( s ) t = n ). Definition 1.
A node in graph G is called an adherent to target node t if and only if this node has the targetnode as its only neighbor.With these notations, we can write out a simultaneoussystem of difference equations for the probabilities of hit-ting target node t for the first time with exactly n > i ∈ V as P ( N ( i ) t = n ) = | V | (cid:88) j =1 j (cid:54) = t B ij P ( N ( j ) t = n − , (3)where we have imposed the restriction that the startingnode i should not be equal to t , and should not be an adherent to target t . The B matrix is called proba-bility transition matrix , the elements of which are B ij = (cid:40) w ij (cid:80) i (cid:48) w ii (cid:48) , i (cid:48) ∈ { neighbors of i } A ij (cid:54) = 0;0 A ij = 0 . (4)Most of the time, B is a sparse matrix. Note that al-though in an undirected graph the adjacency matrix A is always symmetric, the probability transition matrix isgenerally non-symmetric. Moreover, due to the exclusionof the target node t in the definition of probability tran-sition matrix, the sum of matrix elements for each rowin B is not necessarily equal to 1. In fact, for a con-nected undirected graph, there is at least one row of B whose sum is less than 1. The rule is that (cid:80) j B ij = 1 ifthe target node t is not a neighbor of node i ; otherwise (cid:80) j B ij <
1. Since we have excluded the target node t and all its adherent nodes from the set of starting nodes,the matrix B has a dimension that is smaller than thatof matrix A . For an undirected graph, the matrix B isguaranteed to be square due to the fact that a randomwalker starting from a node that is not the target cannotpossibly reach an adherent node to the target. The factthat matrix B has rows with sum that are less than 1means that this matrix is not a Markov matrix, and thatall of its eigenvalues have a magnitude smaller than 1. Asa result, the spectral radius of matrix B is also smallerthan 1. We will take advantage of this fact later in thispaper.We can consider the hitting time N ( i ) t as the i th com-ponent of a column vector N t . Our aim in this paper is tostudy the probability distribution of this random vector.Once we have already created the probability transitionmatrix B , we can directly write out the expectation val-ues of N t as (cid:104) N t (cid:105) = ∞ (cid:88) n =0 B n , (5)where is a column vector of which each element is 1,i.e, = (cid:0) . . . (cid:1) T . Since the spectral radius of B issmaller than 1, the summation of power series in Eq. [5]will converge. By terminating the summation at a powerthat is high enough, we can obtain numerical results forthe expected hitting times with arbitrary precision. Wewill show that the expected hitting times can be used tomeasure the distance between two nodes in an undirectedgraph. We can also obtain higher order moments of N t ,the formulae for which are no more complicated than theone in Eq. [5]. We will show how to calculate all themoments of N t in the next section. IV. THEORETICAL PROOF
In the previous section, we have given a formula forcalculating the expected hitting times from an arbitrarynode to a target node in an undirected graph. In this sec-tion, we will give the necessary mathematical details forobtaining that formula. Actually, we will overshoot thisgoal by giving a generating function whose derivativesgive us moments of any order for hitting time distribu-tion.Previously, we have obtained a recursive equation for P ( N ( i ) t = n ) , n ≥
2, which is the probability for a randomwalker to hit target node t starting from node s afterexactly n steps. By introducing the notation X ( i ) n = P ( N ( i ) t = n ), we can rewrite Eq. [3] into matrix form as X n = B X n − , n ≥ X n = B n − X , n ≥ X can be convenientlyobtained by the observation that X ( i )1 = 0 if the tar-get node t is not a neighbor of node i , and that X ( i )1 = w it /w ii (cid:48) , i (cid:48) ∈ { neighbors of i } if the target node t is aneighbor of node i . Now that we have known matrix B and the initial probability vector X , we can calculateall the hitting probabilities for any valid starting node.Although we can calculate all the hitting probabilities,most of the time, we are more interested in the observ-able quantities associated with these probability distribu-tions. We can calculate the moments of the probabilitydistribution by invoking their definitions, which are (cid:104) N ( i ) mt (cid:105) = ∞ (cid:88) n =1 P ( N ( i ) t = n ) n m := ∞ (cid:88) n =1 X ( i ) n n m (8)The expectation and variance of the first hitting timestarting from any node i can be easily calculated fromthe first and second moments of the hitting probabilitydistribution. At first sight, it seems that we need to knowall the hitting probabilities before we can calculate theirmoments. However, we can exploit the fact the spectralradius of matrix B is less than 1 and directly calculatethe moments vector (cid:104) N mt (cid:105) from the recursive Eq. [6]. Toaccomplish this, we need to first define the characteristicfunction for the probability density function f ( x ) asˆ f ( ω ) = (cid:90) x ∈ R f ( x ) e i ωx dx (9)For a discrete series like X ( i ) n , the probability densityfunction is f ( i ) ( x ) = ∞ (cid:88) n =1 X ( i ) n δ ( x − n ) , (10)where δ ( x − n ) is the Dirac δ function with the propertythat for any continuous function f ( x ), we always have (cid:90) x ∈ R f ( x ) δ ( x − x ) dx = f ( x ) . (11) The characteristic function of f ( i ) ( x ) isˆ f ( i ) ( ω ) = (cid:90) x ∈ R f ( i ) ( x ) e i ωx dx (12)= ∞ (cid:88) n =1 X ( i ) n e i ωn If we further define z = e i ω , then the characteristic func-tion can be more compactly rewritten as˜ f ( i ) ( z ) = ∞ (cid:88) n =1 X ( i ) n z n (13)We can read off the expectation value and variance of hit-ting probabilities X ( i ) n from the first and second deriva-tives of ˜ f ( z ) as (cid:104) N ( i ) t (cid:105) = ddz (cid:16) ˜ f ( i ) ( z ) (cid:17)(cid:12)(cid:12)(cid:12) z =1 (14) (cid:104) N ( i )2 t (cid:105) = d dz (cid:16) ˜ f ( i ) ( z ) (cid:17)(cid:12)(cid:12)(cid:12) z =1 + ddz (cid:16) ˜ f ( i ) ( z ) (cid:17)(cid:12)(cid:12)(cid:12) z =1 Var( N ( i ) t ) = (cid:104) N ( i )2 t (cid:105) − (cid:104) N ( i ) t (cid:105) If we consider N ( i ) t as the i th component of N t , and N ( i )2 t as the i th component of N t which is a component-wisesquare of vector N t , then the above three relations canbe simplified into the form (cid:104) N t (cid:105) = ˜ f (cid:48) (1) (15) (cid:104) N t (cid:105) = ˜ f (cid:48)(cid:48) (1) + ˜ f (cid:48) (1) (16)Var( N t ) = (cid:104) N t (cid:105) − (cid:104) N t (cid:105) (17)Here, we have defined a vector function ˜ f ( z ) as ˜ f ( z ) = ∞ (cid:88) n =1 X n z n (18)Since the coefficients of ˜ f ( i ) ( z ) are the hitting probabil-ities X ( i ) n , we call it the generating function of hittingprobabilities. Plug Eq. [6] into the above definition, andwe get ˜ f ( z ) = (cid:16) ∞ (cid:88) n =1 z n B n − (cid:17) X (19)= z ( I − zB ) − X The second line of the above equation stems from thefact that | z | = 1 and that the spectral radius of matrix B is smaller than 1.We have already known how to calculate the probabil-ity transition matrix B and the initial probability vector X , we can in principle calculate exactly the generatingfunction from Eq. [19]. However, calculating the inverseof matrix I − zB is no easy task, especially when thegraph is huge. Moreover, the fact that the sparsity ofmatrix B which we should take full advantage of can getlost after matrix inversion compels us to shun the ideaof directly inverting matrix I − zB to calculate hittingprobability moments. Therefore, we have devised a trickfor finding probability moments without resort to matrixinversion. For this purpose, we rewrite Eq. [19] as( I − zB ) ˜ f ( z ) = z X (20)Performing first order derivative of both sides with re-spect to z and setting z = 1 yields − B ˜ f (1) + ( I − B ) ˜ f (cid:48) (1) = X (21)By definition, ˜ f (1) = (cid:0) ... (cid:1) T , and ˜ f (cid:48) (1) gives usthe first order moment of X n , which is equal to ˜ f (cid:48) (1) = ( I − B ) − ( B ˜ f (1) + X ) (22)= ( I − B ) − ˜ f (1)The second line of the above equation is due to the iden-tity that ˜ f (1) = B ˜ f (1)+ X , which can be easily verifiedby plugging z = 1 into Eq. [19]. We can avoid invertingsparse matrices, an operation that will destroy the spar-sity of a matrix, by noting that Eq. [22] can be rewrittenas (remember that the spectral radius of B is smallerthan 1) ˜ f (cid:48) (1) = ∞ (cid:88) n =0 B n ˜ f (1) (23)It is noteworthy that the first order moments of hittingprobabilities starting from each valid vertex are indepen-dent of the initial probability vector X , and depend onlyon the probability transition matrix B . Taking the sec-ond order derivative of Eq. [20] yields ˜ f (cid:48)(cid:48) (1) = 2 B ( I − B ) − ˜ f (1) (24)= 2 ∞ (cid:88) n =1 nB n ˜ f (1)The pseudocode for calculating mean and variance ofhitting times is shown below: Algorithm 2
Hitting time calculation algorithm
Require: probability transition matrix B must be square Require: max iteration number N must be positive Require: error limit (cid:15) must be positive procedure HittingTimeCalculator ( B, N, (cid:15) ) i ← d ← B.dimension (cid:46)
Dimension of matrix B ones ← vector of all 1’s, shape = (d, 1) zeros ← vector of all 0’s, shape = (d, 1) power ← ones µ ← ones (cid:46) µ : expectation of hitting times var ← zeros (cid:46) var : variance of hitting times while i ≤ N do i ← i + 1 power ← B ∗ power (cid:46) Matrix multiplication µ ← µ + power var ← var + i ∗ power error ← norm of i ∗ power if error < (cid:15) then break end if end while var ← ∗ var var ← var + µ − µ (element wise square) return µ, var end procedure Higher order derivatives of Eq. [20] yield higher ordermoments. Now we have already developed the algorithmfor calculating the moments of hitting probabilities us-ing both analytical and numerical methods, next we willillustrate the effectiveness of this algorithm using bothsmall and huge graphs.The significance of the first order moment lies in thatit is a measure of the distance from a starting node toa target node. If we already know that the target nodeis a fraudulent user in the social network, then we caninfer that the nodes with average distance smaller than athreshold value could be considered to be potential fraud-ulent users. From intuition, we make a claim that thesmaller the distance between any two nodes, the moresimilar they are to each other.
V. EXPERIMENTAL EVIDENCEA. An analytical calculation of hitting timedistribution on a simple graph
In this section, we will show how to calculate the hit-ting time distribution on a small graph using analyticalmethods. This graph contains five nodes, which are de-noted as 0 , , , ,
4, as shown in Fig. [1].The adjacency matrix of this graph is A = (25) FIG. 1. An undirected graph that is small enough to be solvedusing analytical formulae. We use node 3 as our target node.
Our target node is 3, and we want to calculate the prob-ability of hitting the target starting from each vertex forthe first time after exactly n steps. Since node 4 is anadherent to target 3, we can directly write out its hittingprobability as P ( N (4)3 = n ) = δ n, (26)For nodes 0, 1, 2, we can define the probabilities of start-ing from each node and ending at node 3 after exactly n steps. We encapsulate these probabilities into a columnvector as X n = P ( N (0)3 = n ) P ( N (1)3 = n ) P ( N (2)3 = n ) (27)The probability transition matrix is B = / / / / / / (28)The probability vector satisfies this equation X n = B X n − , n ≥ , (29)with initial condition X = / (30)The moment generating function for X n is ˜ f ( z ) = z ( I − zB ) − X (31)= z − z − z z + z z + z − z We can easily get the first order moments by differenti-ating the above equation with respect to z at z = 1, the results of which are ˜ f (cid:48) (1) = (cid:104) N (0)3 (cid:105)(cid:104) N (1)3 (cid:105)(cid:104) N (2)3 (cid:105) = (32)The above result means that the average distance fromnodes 0 and 1 to node 3 are equal, both being 9, and theaverage distance from node 2 to node 3 is 7. We interpretthese results as demonstrating that nodes 0 and 1 haveequal distance to node 3, whereas node 2 is nearer tonode 3 than both 1 and 2. Node 4, being an adherent totarget node 3, always has an average distance 1 to thetarget. These results are consistent with our intuitionand signify to us that node 4 is most susceptible to theinfluence of node 3, node 2 is second most susceptible,and nodes 0 and 1 are least susceptible to its influence.In order to make a further test of the analytical results,we also perform a Monte Carlo simulation for the randomwalk on this graph. In the Monte Carlo program, we usenode 3 as our target node, and start each random walkfrom node 0, 1, and 2. Each random walk terminates atnode 3 after some number of steps. By repeating thisprocess thousands of times, we obtain the average num-ber of steps required before the random walker finallyreaches the target. For each node in the set { , , } , weperform 10 random walks, each of which yields a stepnumber, and then we calculate the mean value of these10 numbers. The Monte Carlo simulation results weget are pretty similar to the analytical results, which areshown together in Table I. We can see that the MonteCarlo simulation results are consistent with our analyti-cal results, with relative errors being approximately 10 − .We do not expect Monte Carlo simulation to give us highprecision numerical results, and the final results of MonteCarlo simulations may vary slightly for different randomnumber generators.Using the algorithm outlined in the previous section,we can equally well compute the average step numberstarting from each node 0, 1, 2 using numerical methods.We can use either Eq. [22] or Eq. [23] for this purpose,because the probability transition matrix is small enoughfor the direct inversion of matrix to be feasible. However,Eq. [22] is no longer practical when the graph is large,and thus even for this small graph, we still prefer to useEq. [23], where we need to calculate the sum of an infi-nite power series. Due to the quick convergence of thisseries, we artificially impose a cutoff condition such thatthe summation series should terminate if the norm ofthe summand vector is smaller than a pre-specified errorlimit (cid:15) , which we choose to be 10 − here. We run thePython program listed in the previous section on macOSMojave, and obtain numerical results that are shown to-gether with analytical results and Mont Carlo simulationresults in Table I. It is clear that the numerical results ob-tained using our algorithm have a much higher precisionthan that of the Monte Carlo simulation results. (cid:104) N ( i )3 (cid:105) Analytical Monte Carlo Numerical (cid:104) N (0)3 (cid:105) (cid:104) N (1)3 (cid:105) (cid:104) N (2)3 (cid:105) (cid:104) N ( i ) t (cid:105) is the expected hitting time for a random walk thatstarts from node i and ends at target node t . Here, t = 3.The above table lists the expected hitting times for randomwalkers to first reach target node 3 starting from nodes 0,1, and 2, respectively. We obtain these results using threemethods: analytical, Monte Carlo simulation, and numerical.Both Monte Carlo simulation results and numerical compu-tation results are consistent with analytical results, althoughthe numerical results have much higher precision, which justi-fies our introduction of the numerical algorithm for attackingthis problem. B. Numerical computation of hitting timedistribution on large graphs
When dealing with large graphs, it is both tedious andimpractical to get an analytical formula as Eq. [31]. In-stead, we will resort to Eq. [23] to find numerical val-ues of hitting times. Another method to find the hit-ting times is to use Monte Carlo simulation, althoughwe will see that for a large graph, the running time ofMonte Carlo simulation is much longer than the numeri-cal method, thus making the Monte Carlo simulation aninferior alternative compared to Eq. [23]. In this section,we apply the numerical method and Monte Carlo sim-ulation method to a connected graph as shown in Fig.[2].
FIG. 2. An undirected graph with 100 vertices and 740 edges.This graph contains two communities, and is generated by therule that each pair of vertices within the same community isconnected by an edge with probability p in = 0 .
3, whereas eachpair of vertices from different communities is connected by anedge with probability p out = 0 . We will calculate the hitting times from each vertex in the graph to the target node which is chosen to benode 1. In order to visualize the results, we sort the ver-tices in the graph according to their hitting times to thetarget node. In Fig. [3], we plot the results from MonteCarlo simulation method and numerical method. We cansee from the figure that these two methods give almostthe same results, although we know that results fromMonte Carlo simulation have a precision that is muchlower that obtained from numerical method. Anotherweak point of Monte Carlo simulation is that it is muchmore time-consuming than the numerical method. In or-der to obtain the results shown in Fig. [3], we need torun 10 random walks from each vertex in the graph ex-cept the target node, and the whole process takes about157 seconds, whereas in the numerical method, we onlyneed to compute the power series in Eq. [23] up to 6180terms, and it takes only 3.06 seconds to obtain resultswith machine precision. Actually, the larger the graph,the more time-saving the numerical method is comparedto the Monte Carlo simulation method. h itti ng ti m e s Monte Carlo resultsNumerical results
FIG. 3. An undirected graph with 100 edges and 740 edges.This graph contains two communities, and is generated by therule that each pair of vertices within the same community isconnected by an edge with probability p in = 0 .
3, whereas eachpair of vertices from different communities is connected by anedge with probability p out = 0 . Another feature that is worth mentioning in Fig. [3]is that we can distinguish the two communities in theoriginal graph by looking at the distribution of hittingtimes with respect to vertices. It is clear that there isa transition region which connects two plateaus in thehitting time vs. sorted vertices curve. The two plateauscorrespond to the two communities in the graph. We caninterpret the emergence of these two plateaus by notingthat a random walker that starts from within one of thetwo communities tend to get trapped in a community.Once the random walker gets trapped in a community,the hitting times for each vertex in the community wouldnot change substantially from vertex to vertex, whichgives rise to a plateau in the curve. However, as soon asthe random walker finds a bridge leading from one com-munity to another, it will make a rapid transition acrossthe two communities, and consequently the hitting timesexperience a significant change. Thus, the calculationof hitting times provides us a tool for community detec-tion as a by-product. However, we should make a caveatthat this method of community detection is usable onlywhen the number of communities in the graph is smallenough and the communities are clearly separated fromeach other. Or else, this method of community detectionis not as good as the ones compiled in Ref. [5].
C. Directional distances and non-reciprocalrelationships
The distance between a pair of nodes can be thoughtof as a function that maps two nodes to a non-negativereal number whose value represents the distance betweenthese two nodes, i.e., d : ( u, v ) (cid:55)→ R + , where u, v aretwo nodes in a graph. Previously, people define the dis-tance function between a pair of nodes with the implicitassumption that this function should remain invariantwith the exchange of its two arguments, which means d ( u, v ) = d ( v, u ). This property, which we call the sym-metricity of the distance function, is however undesirablefor our situation where we want to study the influence ofa node A upon another node B . It is a truth universallyacknowledged that not all users in a social network areequally influential, and thus we do not expect the mu-tual influence between a pair of nodes to be the sameregardless of the direction. Based on this consideration,we dictate that the distance function for our case shouldbe directional even if we restrict our attention to undi-rected graphs. Take the undirected graph shown in Fig.4 as a concrete example. In this graph, where all edgesare assumed to be have equal weight, we do not expectthe distance from node 10 to node 11 to be identical tothe distance from node 11 to node 10. It is obvious fromthe figure that node 10, which belongs to a clique thatconsists of nodes 1 −
10, should have a strong associationwith nodes 1 − FIG. 4. A figure that illustrates the directedness or asym-metricity of the distance function between a pair of nodes.In this figure, we do not expect the distance from node 10to 11 to be identical to the distance in the reverse directiondue to the fact that node 10 is more closely associated withnodes 1 − suitable for describing the influence that one node exertsupon another one than other symmetric distance func-tions . VI. COMPARISON WITH EXISTINGMETHODSVII. CONCLUSION AND FUTURE WORK
In this paper, we have derived an analytical formulafor calculating the hitting time from a starting vertex toa target vertex in a connected undirected graph. Thismethod relies on the probability transition matrix thatcan be calculated conveniently from the graph’s adja-cency matrix. We also propose a quick method for im-plementing this formula using Python code, without theneed to invert a possibly huge sparse matrix. Since hit-ting time is a core concept in random walk which candirectly be simulated using Monte Carlo method, we canobtain an approximate value of the hitting time usingMonte Carlo simulation. We tested our formula by apply-ing the analytical method and Monte Carlo simulationsto undirected connected graphs, and show that these twomethods can give similar results within tolerance of er-ror. The advantage of the analytical method over MonteCarlo simulation is that the former is much quicker andmore accurate than the latter. Our calculation of thehitting times for vertices in a graph can also give usa glimpse of the community structures of the graph on0which we perform random walks, although this methodfor detecting communities is not as good as other existingalgorithms when the communities in graph are numerous and not clearly separated.
ACKNOWLEDGMENTS [1] Lawrence Page, Sergey Brin, Rajeev Motwani, and TerryWinograd. The pagerank citation ranking: Bringing or-der to the web. Technical report, Stanford InfoLab, 1999.[2] Aaron Clauset, M. E. J. Newman, and Cristopher Moore.Finding community structure in very large networks.
Phys. Rev. E , 70:066111, Dec 2004.[3] Mark EJ Newman. Modularity and community struc-ture in networks.
Proceedings of the national academy ofsciences , 103(23):8577–8582, 2006.[4] M. E. J. Newman. Finding community structure in net-works using the eigenvectors of matrices.
Phys. Rev. E ,74:036104, Sep 2006.[5] Santo Fortunato. Community detection in graphs.
Physics reports , 486(3-5):75–174, 2010.[6] Aditya Grover and Jure Leskovec. node2vec: Scalablefeature learning for networks. In
Proceedings of the 22ndACM SIGKDD international conference on Knowledgediscovery and data mining , pages 855–864. ACM, 2016. [7] Sheldon M Ross, John J Kelly, Roger J Sullivan,William James Perry, Donald Mercer, Ruth M Davis,Thomas Dell Washburn, Earl V Sager, Joseph B Boyce,and Vincent L Bristow.
Stochastic processes , volume 2.Wiley New York, 1996.[8] Bart Baesens, Veronique Van Vlasselaer, and WouterVerbeke.
Fraud analytics using descriptive, predictive,and social network techniques: a guide to data sciencefor fraud detection . John Wiley & Sons, 2015.[9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. Distributed representations of wordsand phrases and their compositionality. In
Advances inneural information processing systems , pages 3111–3119,2013.[10] Peter G Doyle and J Laurie Snell. Random walks andelectric networks. arXiv preprint math/0001057 , 2000.[11] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty.Semi-supervised learning using gaussian fields and har-monic functions. In