[PDF] Graph Tikhonov Regularization and Interpolation via Random Spanning Forests

Abstract

Novel Monte Carlo estimators are proposed to solve both the Tikhonov regularization (TR) and the interpolation problems on graphs. These estimators are based on random spanning forests (RSF), the theoretical properties of which enable to analyze the estimators' theoretical mean and variance. We also show how to perform hyperparameter tuning for these RSF-based estimators. TR is a component in many well-known algorithms, and we show how the proposed estimators can be easily adapted to avoid expensive intermediate steps in generalized semi-supervised learning, label propagation, Newton's method and iteratively reweighted least squares. In the experiments, we illustrate the proposed methods on several problems and provide observations on their run time.

Full PDF

11 Graph Tikhonov Regularization and Interpolationvia Random Spanning Forests

Yusuf Yigit Pilavci, Pierre-Olivier Amblard, Simon Barthelm´e, Nicolas TremblayCNRS, Univ. Grenoble Alpes, Grenoble INP, GIPSA-lab, Grenoble, France

Abstract —Novel Monte Carlo estimators are proposed to solveboth the Tikhonov regularization (TR) and the interpolationproblems on graphs. These estimators are based on randomspanning forests (RSF), the theoretical properties of which enableto analyze the estimators’ theoretical mean and variance. We alsoshow how to perform hyperparameter tuning for these RSF-basedestimators. Finally, TR or interpolation being a building blockof several algorithms, we show how the proposed estimators canbe easily adapted to avoid expensive intermediate steps in well-known algorithms such as generalized semi-supervised learning,label propagation, Newton’s method and iteratively reweightedleast square. In the experiments, we illustrate the proposedmethods on several problems and provide observations on theirrun time, which are comparable with the state-of-the-art.

Index Terms —graph signal processing, random spanningforests, smoothing, interpolation, semi-supervised learning, labelpropagation, newton’s method, IRLS

I. I

NTRODUCTION G RAPHS are ubiquitous models of complex structures, e.g. social, transportation, sensors or neuronal networks.The vertices and edges, the main components of graphs, arenatural representations of the elements of a network and thelinks between them, respectively. In many applications, thesenetworks often come with data on the elements. For example,in a transportation network (the roads and their intersectionsare respectively the nodes and edges), the data can be trafﬁcﬂow observations of each road [1]; or in a brain network, itcould be the activity of each individual brain region [2]. Suchtype of data over vertices are called graph signals [3], [4].

Graph Tikhonov regularization.

Consider a graph of size n represented by its Laplacian matrix L = D − W ∈ R n × n ( W is the weighted adjacency matrix and D the diagonal matrixof degrees –see formal deﬁnitions in Section II). Given noisysignal measurements y = ( y , y , . . . , y n ) (cid:62) on the n verticesof this graph, a classical denoising scheme is graph Tikhonovregularization [3], [5]: ˆ x = argmin z ∈ R n µ || y − z || + z (cid:62) L z (1)where µ > is a hyper-parameter tuning the relativeimportance the user wants to set between the data-ﬁdelityterm || y − z || and the regularization term z (cid:62) L z . Note that z (cid:62) L z = (cid:80) i,j W ij ( z i − z j ) , which explains its regularizingproperty: this term penalizes large variations along the edgesof the graph. The exact solution to (1) is: ˆ x = K y with K = ( L + µ I ) − µ I ∈ R n × n which requires the inversion of the regularized Laplacian. Graph interpolation.

Consider a graph signal x ∈ R n that isonly known over a vertex subset (cid:96) ⊂ V (with typically | (cid:96) | (cid:28) n ). The graph interpolation problem consists in combining thisprior information with an underlying smoothness assumptionin order to infer the signal on the remaining vertices. One wayto formulate this problem is given by Pesenson [6]: ˆ x = arg min z ∈ R n z (cid:62) ( L + µ I ) z subject to ∀ i ∈ (cid:96), z i = x i (2)where µ ≥ is a user-deﬁned parameter. Note that Pesensonchooses to further parametrize the interpolation problem byconsidering ( L + µ I ) t for t > instead of ( L + µ I ) in Eq. (2). Inthis paper, we only consider the case t = 1 . The exact solutionof (2) can be found in [6] and requires the inversion of L + µ I restricted to the rows and columns indexed by the vertices thatare not in (cid:96) (more details are provided in Section III). Graph TR and interpolation as a building block . Both graphTR and interpolation problems often appear in different graphproblems as building blocks.Node classiﬁcation is one problem which is sometimessolved by using graph interpolation. A well-known Semi-Supervised Learning (SSL) algorithm, called label propa-gation , corresponds to the Dirichlet boundary problem ongraphs [8] whose solution can be viewed as ﬁnding interiorvalues that smoothly interpolate between the known boundaryvalues [9]. This problem is a sub-case of Eq. (2), for µ = 0 .Graph TR, on top of its use for graph signal denoising,also appears in SSL algorithms for node classiﬁcation, suchas in the work of Zhou et al. [10], later generalized byAvrachenkov et al. [11]. Moreover, graph TR is also usedin graph optimization algorithms. Two examples are New-ton’s method [12] and iteratively reweighted least squares(IRLS) [13]. In both methods, the computationally expensivesteps can be formulated and solved as graph TR problems. Classical approaches.

Explicitly computing the exact solu-tion for graph TR requires the inversion of the regularizedLaplacian L + µ I of size n × n , and thus, O ( n ) elementaryoperations. For graph interpolation, the overall cost is also O ( n ) in the typical setting where | (cid:96) | (cid:28) n . For largegraphs ( i.e. with n ≥ ), this is prohibitive and the state-of-the-art relies on approximate methods. These approachesmay be roughly separated in two groups, iterative methods( e.g. conjugate gradient method with preconditioning [14]) In fact, label propagation may refer to a more generic set of algorithms.In this paper, we refer to the algorithm proposed by Zhu et al. [7] a r X i v : . [ c s . D M ] N ov and polynomial approximations ( e.g. Chebyshev polynomials[15]). Both class of methods run in linear time with the numberof edges.

Random processes on graphs.

A longstanding and fruitfulapproach to studying the properties of graphs has been to studythe properties of random processes on graphs (via randomwalks, for instance). This paper will take such a perspectiveto propose novel estimators for the two problems presented.For instance, a very well-known fact is the link between thesmallest non-null eigenvalue of the Laplacian matrix and themixing time of a random walk on the graph (see, e. g., [16]).Other examples include properties of electrical networks, suchas potential functions or effective resistances, that can beinterpreted in terms of probabilistic quantities deﬁned forrandom walks such as hitting time probabilities [17]. Closer toour work, Wu et al. [18], [19] show that the ( i, j ) -th entry of K equals to the probability of having an interrupted random walk(partially absorbed random walk) starting at node i and endingin node j ; and further leverages random walks to give practicalinsights for the algorithms of several applications includingimage retrieval, local clustering and semi-supervised learning. Random spanning forests.

In this paper, we will focuson random spanning forests (RSFs): random collections ofdisjoint trees that cover all nodes in the graph (a formaldeﬁnition is in section II). The link between the matrix K andrandom spanning forests has been observed by several authorsin the past, such as Avrachenkov et al. [20]. Avena et al. [21],[22] analyze precisely several aspects of this connection. RSFsnot only have a rich theoretical background (connectionswith Markov chains, determinantal point processes, spectralgraph theory), they also come with an efﬁcient samplingalgorithm [21]: a variant of Wilson’s algorithm [23] based onloop-erased random walks. Our contributions . In this work, • We propose two novel Monte Carlo estimators basedon RSFs to approximate the solution of graph Tikhonovregularization and interpolation. – We provide a rigorous analysis on their performancesby building upon known results on RSFs. – In terms of computational cost, they are comparablewith state-of-the-art methods. – By coupling these estimators with certain statisticsof RSFs, we provide a scheme to correctly tune thehyperparameters of the problems. • We show how versatile these estimators are by adaptingthem to several graph-based problems such as generalizedsemi-supervised learning, label propagation, Newton’smethod and Iteratively Reweighted Least Squares (IRLS).A preliminary version of some of these results can be foundin [24]. The Julia code to reproduce this paper’s results isavailable on the authors’ website. Organization of the paper.

We start with the necessarybackground on graphs and RSFs in Section II. Then, weintroduce the proposed methods in Section III. In Section IV, https://y2p.github.io/ﬁles/codes/rsf journal codes.tar.xz we examine several extensions to different graph-related prob-lems. Finally, in Section V, we illustrate the methods indifferent use cases and we conclude in Section VI.II. B ACKGROUND ON

RSF S This section contains background on graph theory, randomspanning trees and forests.

A. Graph theory

A directed weighted graph G = ( V , E , w ) consists of a setof vertices V = { , , ..., n } and edges E = { ( i, j ) ∈ V × V} .The weight function w : V × V → R + maps each edge in E toa positive weight and others to 0. A graph is called undirectedif w ( i, j ) = w ( j, i ) for all distinct vertices i and j . In thefollowing, unless otherwise speciﬁed only undirected graphsare considered. Graphs are often represented using matrices,and several matrix descriptions are available.The weighted adjacency matrix or weight matrix is W =[ w ( i, j )] i,j ∈ R n × n . The degree matrix is the diagonal matrix D ∈ R n × n with D i,i = (cid:80) j ∈N ( i ) w ( i, j ) and N ( i ) is the set ofnodes connected to i . The graph Laplacian matrix is deﬁned as L = D − W . It is semi-positive deﬁnite [16] and its eigenvaluesand eigenvectors are usually denoted by λ ≤ λ ≤ . . . ≤ λ n and U = ( u | u | . . . | u n ) , respectively. The multiplicity ofeigenvalue is equal to the number of connected componentsin the graph [16]. For undirected graphs, all of these squarematrices are symmetric.Another way to represent graphs is via the edge incidencematrix B = ( b | b | . . . | b m ) (cid:62) ∈ R m × n where b k ∈ R n is avector associated to the k -th edge ( i, j ) . The only nonzero en-tries of b k are b k ( i ) = ± (cid:112) w ( i, j ) and b k ( j ) = ∓ (cid:112) w ( i, j ) .In directed graphs, the sign is set by considering the edge ori-entations. For example, if the k -th edge starts from i and endsin j , then, b k ( i ) < and b k ( j ) > . In undirected graphs, thesigns of the non-zero entries of b k can be arbitrarily chosenas long as they are opposite. Although this matrix seems lessnatural than the others, it often appears in graph theory. Oneexample is the well known identity L = B (cid:62) B . B. Random spanning trees and Wilson’s algorithm

Let us recall the deﬁnition of random spanning trees (RSTs).Consider a graph G = ( V , E , w ) . A subgraph of G is agraph whose vertex and edge sets are subsets of V and E ,respectively, and its edge weights are valued by w . A subgraphcontains a cycle whenever there exists a pair of vertices ( u, v ) that are connected via (strictly) more than one path. If thereexists no such pair, the subgraph is called a tree. A spanningtree τ = ( V τ , E τ , w τ ) is a tree whose vertex set V τ is equalto V . A rooted spanning tree τ r is a directed spanning treewhere all edges are directed towards a node called the root.See Fig. 1 for illustrations.Our work is related to a particular distribution on spanningtrees called random spanning trees (RSTs). An RST T is arandomly generated spanning tree from the following distri-bution: P ( T = τ ) ∝ (cid:89) ( i,j ) ∈ τ w ( i, j ) (3) Fig. 1: From left to right, a graph G , a spanning tree on G , a rooted spanning tree on G and a rooted spanning forest on G (roots arecolored in red)Fig. 2: All possible rooted spanning trees associated with a givenundirected spanning tree. For four vertices, four different rooted treesexist. Note that this distribution becomes uniform over all pos-sible spanning trees whenever the given graph is un-weighted i.e. ∀ i, j ∈ V , w ( i, j ) ∈ { , } : P ( T = τ ) = 1 |T | (4)where T is the set of all spanning trees. In this particular case,the random tree T is also known as a uniform spanning tree(UST) in the literature.In his celebrated work [23], Wilson proposes an algo-rithm, called RandomTreeWithRoot , that samples a ran-dom spanning tree from the set of all spanning trees rootedin node r . Wilson also shows that, in the case of undirectedgraphs, sampling an unrooted RST amounts to: i/ choosinguniformly a root, ii/ running RandomTreeWithRoot , andiii/ erasing the orientation.

C. Random spanning forests

A forest is a set of disjoint trees. When all the trees in aforest are rooted, it is called a rooted forest. A rooted spanningforest, generically denoted by φ , reaches all the vertices in thegraph. Let ρ be the function that maps any rooted spanningforests to its set of roots. The number of roots | ρ ( φ ) | isbetween 1 and n . For | ρ ( φ ) | = 1 , φ corresponds to a rootedspanning tree. See Fig. 1 for illustrations. Random Spanning Forests . Let F be the set of all rootedspanning forests. A random spanning forest (RSF) is a randomvariable whose outcome space is F . Among many possibleoptions, we focus on the following parametric distribution forRSFs. For a ﬁxed parameter q > , Φ q is a random variablein F verifying: ∀ φ ∈ F , P (Φ q = φ ) ∝ q | ρ ( φ ) | (cid:89) ( i,j ) ∈ φ w ( i, j ) . (5)An algorithm [22] to sample from this distribution is derivedfrom RandomTreeWithRoot . This algorithm: 1) extends the graph G = ( V , E , w ) by adding a node called Γ .2) connects each node i in V to Γ with an edge of weight w ( i, Γ) = q .3) runs RandomTreeWithRoot by setting Γ as the rootto obtain a spanning tree rooted in Γ in the extendedgraph.4) deletes the edges incident to Γ in the obtained tree toyield a forest in the original graph.The result is a rooted spanning forest whose root set is formedby the nodes which were neighbors of Γ . For every distinctspanning tree rooted at Γ , a distinct spanning forest is obtainedafter removing the root and its incident edges. Using this one-to-one relation ensures that this algorithm indeed samples fromthe distribution in (5): P ( T Γ = τ Γ ) = P (Φ q = φ ) ∝ (cid:89) ( i,j ) ∈ τ Γ w ( i, j ) ∝ (cid:89) ( i, Γ) ∈ τ Γ q (cid:89) ( i,j ) ∈ τ Γ i,j (cid:54) =Γ w ( i, j ) ∝ q | ρ ( φ ) | (cid:89) ( i,j ) ∈ φ w ( i, j ) (6)An implementation of this algorithm is detailed in Algo-rithm 1. In the algorithm, rand (line 7) returns a uniformrandom value between 0 and 1 and RandomSuccessor (line 11) returns a random node i from N ( u ) with probability w ( u,i ) (cid:80) j ∈N ( u ) w ( u,j ) . At termination, the array Next contains all thenecessary information to build the sampled spanning forest.The expected run time of RandomForest is the expectednumber of calls of

RandomSuccessor before termination.For

RandomTreeWithRoot , the number of calls equals tothe mean commute time, i.e. , the expected length of a randomwalk going from node i to j and back (see theorem 2 in [23]).In proposition 1 of [25], Marchal rewrites this commute timein terms of graph matrices. Adapting his result to the currentsetting, the expected run time of RandomForest can beshown to equal the trace of (cid:0) ( L + q I ) − ( D + q I ) (cid:1) , a roughupper-bound of which is n + 2 |E| /q , which is linear with thenumber of edges. Varying q over nodes . The original graph can also beextended by setting w ( i, Γ) ← q i > , ∀ i ∈ V , that is, byconnecting the added node Γ with links of unequal weights. Algorithm 1

RandomForest Inputs: G = ( V , E , w ) q ∈ R + Initialize: ∀ i ∈ V , InForest [ i ] ← false ∀ i ∈ V , Next[i] ← − ∀ i ∈ V , d [ i ] ← (cid:80) j ∈N ( i ) w ( i, j ) for i ← |V| do u ← i while not InForest [ u ] do u is in the forest if rand ≤ qq + d [ u ] then u becomes a root InForest [ u ] ← true u to the forest Next [ u ] ← − u to null else Next[u] ← RandomSuccessor ( u, G ) u ← Next [ u ] end if end while u ← i while not InForest [ u ] do InForest [ u ] ← true u ← Next [ u ] end while end for return NextIn this case, the distribution of sampled forests becomes: P (Φ Q = φ ) ∝ (cid:89) i ∈ ρ ( φ ) q i (cid:89) ( i,j ) ∈ φ w ( i, j ) , φ ∈ F (7)where Q = { q , q , . . . , q n } is the collection of parameters.Algorithm 1 can easily be adapted by modifying the scalarinput q to Q = { q , q , . . . , q n } and qq + d [ u ] to q u q u + d [ u ] at thestep of root selection (line 7). In addition, the average runtime in this case becomes tr (cid:0) ( L + Q ) − ( D + Q ) (cid:1) , with Q =diag( q , . . . , q n ) . Random partitions . A partition of V , denoted by P , is a set ofdisjoint subsets whose union equals V . The trees of Φ q givea random partition of V by splitting it into | ρ (Φ q ) | disjointsubsets. Let us enumerate the trees from to | ρ (Φ q ) | anddenote the vertex set of the k -th tree as V k ⊂ V . Let π be afunction that outputs the partition for a given spanning forest.Then, the random partition of V derived from Φ q is π (Φ q ) =( V , . . . , V | ρ (Φ q ) | ) with | π (Φ q ) | subsets. Note that this functionis a many-to-one mapping because different spanning forestsmay correspond to the same partition (see Figure 3). D. Useful properties of Φ q Recent studies in [21], [22] have established some theoret-ical properties of Φ q and we reproduce here a few results. φ (cid:48) : φ (cid:48)(cid:48) : π ( φ (cid:48) ) = π ( φ (cid:48)(cid:48) ) Fig. 3: Two different rooted spanning forests (on the left) with thesame corresponding partition (on the right)

The root process.

To start with, Proposition 2.2 in [21] statesthat ρ (Φ q ) is sampled from a determinantal point process(DPP) [26] with marginal kernel: K = ( q I + L ) − q I (8)This means that the inclusion probabilities verify: ∀S ⊂ V , P ( S ∈ ρ (Φ q )) = det K S where K S = [ K i,j | ( i, j ) ∈ S×S ] is the submatrix of K reducedto the rows and columns indexed by S . Cardinality of ρ (Φ q ) . As a consequence of ρ (Φ q ) being aDPP, the ﬁrst two moments of | ρ (Φ q ) | verify [26]: E [ | ρ (Φ q ) | ] = tr( K ) = (cid:88) i qq + λ i Var( | ρ (Φ q ) | ) = tr( K − K ) = (cid:88) i λ i q ( q + λ i ) (9)where the λ i ’s are the eigenvalues of L . The root probability distribution . Given any rooted spanningforest φ , deﬁne the root function r φ : V → ρ ( φ ) which takesas input any node i and outputs the root of the tree which i belongs to. In [21], [22], the authors show that the probability,for any node pair ( i, j ) , that i is rooted in j reads: ∀ i, j ∈ V P ( r Φ q ( i ) = j ) = K ij (10) Conditioning on a partition . Let t : V → { , , . . . , | ρ (Φ q ) |} be a random mapping between any node and its tree numberin Φ q . ( e.g. , t ( i ) = k if i ∈ V k ∈ π (Φ q ) ). By conditioningthe root probability over a ﬁxed partition P , one obtains (seeProposition 2.2 in [22]): ∀ i, j ∈ V P ( r Φ q ( i ) = j | π (Φ q ) = P ) = I ( j ∈ V t ( i ) ) |V t ( i ) | (11)where I is the indicator function ( i.e. , it outputs 1 if the inputstatement is true and 0 otherwise). In other words, given aﬁxed partition P , the root probability within each subset V k is uniform over the nodes in V k . Extending to non-constant q . All of these properties areadaptable to the case of q varying over nodes (with somechanges). The root process ρ (Φ Q ) is also a DPP. However,the associated marginal kernel becomes: K = ( L + Q ) − Q with Q = diag( q , . . . , q n ) . (12) Notice that this kernel is not co-diagonalizable with the graphLaplacian L . Thus, the expected number of roots E [ | ρ (Φ q ) | ] is not writable in terms of λ i ’s, but it is still equal to tr( K ) .Similarly, the root probability P ( r Φ Q ( i ) = j ) remains K i,j whereas the conditional probability in (11) becomes: ∀ i, j ∈ V P ( r Φ Q ( i ) = j | π (Φ Q ) = P ) = q j I ( j ∈ V t ( i ) ) (cid:80) k ∈V t ( i ) q k (13)III. RSF BASED ESTIMATORS

In this section, we present our main results. We ﬁrst recallthe graph Tikhonov regularization and interpolation problems.Then, we describe the RSF-based methods to solve them. Wealso provide some theoretical analysis of the performance ofthe methods. Finally, we show how to tune hyperparametersfor the proposed estimators.

Graph Tikhonov regularization.

For a given graph G =( V , E , w ) and measurements y = ( y , y , . . . , y n ) (cid:62) on the |V| = n vertices, the Tikhonov regularization of y reads: ˆ x = argmin z ∈ R n µ || y − z || + z (cid:62) L z (14)where L ∈ R n × n is the graph Laplacian of G . The solution ofthis minimization problem is: ˆ x = K y with K = ( L + µ I ) − µ I . Interestingly, the matrix in this solution also appears in (8) asthe marginal kernel of the root process. This correspondenceplays a signiﬁcant role for the proposed methods by connectingRSFs to the Tikhonov regularization problem.In some important cases, instead of ( L + µ I ) − µ I y , thegeneralized solution ( L + Q ) − Q y is required where Q is anentry-wise non-negative diagonal matrix. For example, if wewrite the Tikhonov regularization of (14) with another graphLaplacian such as the random walk Laplacian L rw = D − L ,then the solution reads ˆ x = ( L + Q ) − Q y where Q = µ D .Another example occurs when the noise variance is knownto be non-constant over vertices, i.e. heteroscedastic noise.The measurements may be less reliable at some verticescompared to others, meaning that there are different noisevariances σ , . . . , σ n . This implies that q . . . , q n should beset proportional to σ , . . . , σ n in the estimation of ˆ x . Thisagain corresponds to the generalized formulation. Graph interpolation.

Given a connected graph G = ( V , E , w ) ,a parameter µ ≥ , and (cid:96) ⊂ V a set of nodes where a signal x is known, the interpolated signal reads: ˆ x = arg min z ∈ R n z (cid:62) ( L + µ I ) z subject to ∀ i ∈ (cid:96), z i = x i (15)Deﬁne u = V\ (cid:96) the set of nodes for which x is not knownand write L in block form: L = (cid:20) L (cid:96) | (cid:96) L (cid:96) | u L u | (cid:96) L u | u (cid:21) where L (cid:96) | u is the Laplacian reduced to its rows and columnsindexed by (cid:96) and u , respectively. The solution of (15) reads: ˆ x = (cid:40) x i if i ∈ (cid:96) (cid:0) − ( L u | u + µ I ) − L u | l x (cid:96) (cid:1) i otherwise (16)where x (cid:96) ∈ R | (cid:96) | is the signal x reduced to its entries in (cid:96) .This solution can almost always be rewritten as: ˆ x u = K y with (cid:40) K = ( L G\ (cid:96) + Q ) − Q y = − Q − L u | (cid:96) x (cid:96) , (17)where L G\ (cid:96) is the Laplacian of the reduced graph obtainedby removing the vertices (and the incident edges) in (cid:96) , Q ∈ R | u |×| u | is a diagonal matrix with Q i,i = µ + (cid:80) j ∈ (cid:96) w ( i, j ) .Similarly to graph TR, the RSF-based estimator for interpola-tion proposed in this paper draws upon the connection betweenEqs. (17) and (12). Parameter selection . The solution to graph TR tends to theconstant vector (equal to the average of y ) as µ → , and to y for µ → ∞ , where it suffers from underﬁtting and overﬁtting,respectively. In the interpolation problem, as µ → ∞ , no priorinformation gets propagated through the other vertices, and ˆ x u tends to the zero vector. The case of µ = 0 corresponds tosolving the Dirichlet problem [9] which does not necessarilygive the closest inference to the original signal x . Due tothese reasons, µ needs to be set at a value that gives thebest approximation to the original signal. In both problems,choosing µ is a classical hyperparameter selection problemwhich can be approached in several ways for the proposedestimators.In the following, we ﬁrst present the proposed estimatorsfor approximating ˆ x for a ﬁxed value of µ in Section III-A.Then, we outline in Section III-B several methods that selectan appropriate µ automatically. Combining the hyperparameterselection and the RSF based estimators forms an RSF-basedframework to approximate the solutions of graph Tikhonovregularization and graph interpolation. A. RSF based estimation of ˆ x = K y We propose two novel Monte Carlo estimators to approx-imate ˆ x = K y with K = ( L + Q ) − Q . These estimatorsleverage the probability distribution of the root process onRSFs presented in (10). The ﬁrst estimator , denoted by ˜ x , is deﬁned as follows: ∀ i ∈ V ˜ x ( i ) = y ( r Φ Q ( i )) (18)In practice, a realization of Φ Q is considered. Then, in eachtree, the measurement of the root is propagated through thenodes of the tree. (See top-right in Fig. 4). Proposition 1. ˜ x is an unbiased estimator of ˆ x : E [˜ x ] = ˆ x . Q , as deﬁned in the paragraph following (17), needs to be invertible for y to be well-deﬁned. This is always the case if µ > . When µ = 0 , it may notbe the case. We will see in Section IV-A what can be done in this scenario.

12 3 45 6 13 Φ q

12 3 45 6 13 ˜ x ¯ x

11 1 15 6 65 2.52.5 2.52.54 3.5 3.54

Fig. 4: An illustration for the estimators where q is constant over allnodes. In the left, the graph signal is interpreted by both colors andnumbers. In the middle, a realization of Φ q , a forest, is illustrated.On this forest, the estimators ˜ x and ¯ x are illustrated in top-right andbottom right, respectively. Moreover, the weighted expected error of ˜ x is: E (cid:0) || ˆ x − ˜ x || Q (cid:1) = (cid:88) i ∈V q i Var(˜ x ( i )) = y (cid:62) ( Q − K (cid:62) QK ) y where || x || Q = x (cid:62) Q x .Proof. For every node i , ˜ x ( i ) is an unbiased estimator of ˆ x ( i ) thanks to the following: E [˜ x ( i )] = E (cid:2) y ( r Φ Q ( i )) (cid:3) = (cid:88) j P ( r Φ Q ( i ) = j ) y ( j )= (cid:88) j K ij y ( j ) = δ (cid:62) i K y = ˆ x ( i ) where δ i is the Kronecker delta ( i.e. δ i ( i ) = 1 and 0 other-wise). This result is prominently due to the root probability ofRSFs given in (10). Also, the variance of ˜ x ( i ) reads: Var(˜ x ( i )) = E (cid:2) ˜ x ( i ) (cid:3) − E [˜ x ( i )] = δ (cid:62) i K y (2) − ( δ (cid:62) i K y ) . where y (2) ( k ) = y ( k ) , ∀ k ∈ V . Then, the weighted sumreads: (cid:88) i ∈V q i Var(˜ x ( i )) = (cid:62) QK y (2) − y (cid:62) K (cid:62) QK y where denotes the all-ones vector. Note that (cid:62) QK = (cid:62) K (cid:62) Q . Moreover, is a left eigenvector of K (cid:62) withcorresponding eigenvalue 1. Then, the ﬁrst term becomes (cid:62) K (cid:62) Q y (2) = (cid:62) Q y (2) = y (cid:62) Q y , and, one obtains: (cid:88) i ∈V q i Var(˜ x ( i )) = y (cid:62) ( Q − K (cid:62) QK ) y (19) The second estimator , denoted by ¯ x , is the expectation of ˜ x conditioned on the partition induced by Φ Q : ¯ x ( i ) = E [˜ x ( i ) | π (Φ Q ) = P ] = (cid:80) j ∈V t ( i ) y ( j ) q j (cid:80) j ∈V t ( i ) q j (20)Due to the law of iterated expectations, this estimator is alsounbiased, moreover it has a reduced variance compared to ˜ x ( i ) due to the law of total variance: Var(˜ x ( i )) = E [Var(˜ x ( i ) | π (Φ Q ) = P )] + Var(¯ x ( i )) which implies Var(˜ x ( i )) ≥ Var(¯ x ( i )) . This idea of improvingan estimator is often called Rao-Blackwellization [27], [28].In practice, we again take a realization of Φ Q and considerthe corresponding partition π (Φ Q ) . Then, we compute theweighted average of the measurements in each subset of π (Φ Q ) . Then, we ﬁnally propagate these averages in eachsubset (see Fig. 4). Proposition 2. ¯ x is an unbiased estimator of ˆ x : E [¯ x ] = ˆ x Moreover, the weighted expected error reads: E (cid:0) || ˆ x − ¯ x || Q (cid:1) = (cid:88) i ∈V q i Var(¯ x ( i )) = y (cid:62) ( QK − K (cid:62) QK ) y . Proof.

Let S = (cid:104) q j I ( j ∈V t ( i ) ) (cid:80) k ∈V t ( i ) q k (cid:105) i,j be a symmetric random matrixassociated to the random partition π (Φ Q ) . A simple matrixproduct shows that S (cid:62) QS = QS by deﬁnition of S . Moreover, ¯ x ( i ) = δ (cid:62) i S y and the expectation of S i,j over all possiblepartitions derived from Φ Q is: E [ S i,j ] = (cid:88) P∈ π ( F ) q j I ( j ∈ V t ( i ) ) (cid:80) k ∈V t ( i ) q k P ( π (Φ Q ) = P )= (cid:88) P∈ π ( F ) P ( r Φ Q ( i ) = j | π (Φ Q ) = P ) P ( π (Φ Q ) = P )= P ( r Φ Q ( i ) = j ) = K i,j (21)Similarly, the expectation of ¯ x ( i ) reads: E [¯ x ( i )] = E [ δ (cid:62) i S y ] = δ (cid:62) i E [ S ] y = δ (cid:62) i K y = ˆ x ( i ) (22)Thus, ¯ x is unbiased. The expected error is also computed ina similar way: E (cid:0) || ˆ x − ¯ x || Q (cid:1) = (cid:88) i ∈V q i Var(¯ x ( i ))= (cid:88) i ∈V q i (cid:0) E [( δ (cid:62) i S y ) ] − E [( δ (cid:62) i S y )] (cid:1) = (cid:88) i ∈V y (cid:62) E (cid:2) q i S (cid:62) δ i δ (cid:62) i S (cid:3) y − q i y (cid:62) ( δ (cid:62) i E [ S ]) y = y (cid:62) E (cid:2) S (cid:62) QS (cid:3) y − y (cid:62) ( E [ S ] (cid:62) Q E [ S ]) y = y (cid:62) ( E (cid:2) S (cid:62) QS (cid:3) − K (cid:62) QK ) y (23)Finally, rewriting S (cid:62) QS = QS , one has: E (cid:0) || ˆ x − ¯ x || Q (cid:1) = y (cid:62) ( E [ QS ] − K (cid:62) QK ) y = y (cid:62) ( QK − K (cid:62) QK ) y Sample Mean.

The sample mean of an unbiased Monte Carloestimator over different realizations has a reduced variance,and so, gives a better estimator. Thus, in the rest, we usethe sample means N (cid:80) Nk =1 ˜ x Φ ( k ) Q and N (cid:80) Nk =1 ¯ x Φ ( k ) Q over N forest realizations as the outputs of the RSF based methods. A remark.

Reducing these results to the constant q case ( Q = q I ), one recovers the preliminary results presented in [24]. B. Parameter selection for the RSF estimators

The proposed estimators are efﬁcient tools to approximate ˆ x in both graph TR and interpolation problems for a ﬁxed valueof µ . However, as usual in these problems, a difﬁcult questionis the tuning of the hyper-parameter: the choice of µ thatyields the best performance. For linear smoothers such as theone we have at hand ( ˆ x = K y ), many methods such as AIC,BIC, Marlow’s Cp, leave-one-out cross validation (LOOCV),generalized cross validation (GCV) or Stein’s unbiased riskestimator (SURE) are readily available for this tuning step (formore details and motivations, we refer the reader to [29]).All of these methods need to compute a quantity called theeffective number of parameters or the degree of freedom [29],which equals tr( K ) for linear smoothers of the form K y .Computing exactly this trace requires the matrix inversion wewish to avoid from the start. A classical estimator of this quan-tity is Girard’s estimator [30] (also known as Hutchinson’sestimator [31]). We showed in [32] that RSFs can also be usedto efﬁciently estimate tr( K ) . In this section, we build uponthese preliminary results to show how the SURE and LOOCVmethods can be adapted to the proposed estimators in orderto select a good value of µ . Other methods are adaptable in asimilar fashion. Stein’s Unbiased Risk Estimator . Given independent noisymeasurements y = x + (cid:15) ∈ R n with a Gaussian noise (cid:15) i ∝N (0 , σ ) , let θ ( y ) be an estimate for the unknown quantity x . SURE ( y , θ ( y )) provides an unbiased estimate of the expectederror E (cid:15) [ || θ ( y ) − x || ] . For the linear smoother θ ( y ) = K y with K = ( Q + L ) − Q , the generic formula of SURE in [33]can be adapted as: SURE ( y , θ ( y )) = − nσ + || y − θ ( y ) || + 2 σ tr( K ) (24)where the degree of freedom term is replaced with tr( K ) . Thetheory behind relies on Stein’s lemma on multivariate Gaus-sians [34]. Note that this method requires prior knowledge onthe noise variance σ and it outputs an unbiased estimation ofthe error. Then, this error needs to be evaluated for differentvalues of µ and select the value yielding the smallest error. SURE for RSF estimators.

Similar to ˆ x , the RSF estimatorstoo need the value of µ that gives the best performance. Forthis purpose, SURE can be used. In the following derivations,we present the adapted SURE formula for ˜ x and ¯ x . More-over, these derivations show that numerically computing thisformula is trivial after sampling N spanning forests.Consider two random matrices ˜ S = (cid:104) I ( r Φ Q ( i ) = j ) (cid:105) i,j and ¯ S = (cid:104) q j I ( j ∈V t ( i ) ) (cid:80) k ∈V t ( i ) q k (cid:105) i,j (previously deﬁned as S in the proofof Prop. 2). With these deﬁnitions, notice that ˜ x = ˜ S y and ¯ x = ¯ S y . Moreover, the proposed estimators can be written inthe form of linear smoothers: ˜ θ ( y ) = 1 N N (cid:88) k =1 ˜ x Φ ( k ) Q = 1 N N (cid:88) k =1 ˜ S ( k ) y ¯ θ ( y ) = 1 N N (cid:88) k =1 ¯ x Φ ( k ) Q = 1 N N (cid:88) k =1 ¯ S ( k ) y (25) where superscript ( k ) denotes k -th realization of ˜ S or ¯ S .Then, one can also evaluate the formula in (24) for ˜ θ ( y ) and ¯ θ ( y ) . For instance, we get for ˜ θ ( y ) : SURE ( y , ˜ θ ( y )) = − nσ + || y − ˜ θ ( y ) || +2 σ tr (cid:32) N N (cid:88) k =1 ˜ S ( k ) (cid:33) (26)The residual error is trivial to compute after sampling N spanning forests. Moreover, this is also the case for the degreeof freedom term. A closer look at the trace shows: tr (cid:32) N N (cid:88) k =1 ˜ S ( k ) (cid:33) = tr (cid:32) N N (cid:88) k =1 ¯ S ( k ) (cid:33) = 1 N N (cid:88) k =1 | ρ (Φ ( k ) Q ) | This result yields that the trace term can be replaced with theaverage number of roots in the computation of

SURE ( y , ˜ θ ( y )) or SURE ( y , ¯ θ ( y )) . Thus, the SURE scores of both estimatorsare trivial to numerically compute after sampling N spanningforests.Note that neither SURE ( y , ˜ θ ( y )) nor SURE ( y , ¯ θ ( y )) is anunbiased estimator for SURE ( y , θ ( y )) . Moreover, the estima-tion errors read: E (cid:104) SURE ( y , ˜ θ ( y )) (cid:105) − SURE ( y , θ ( y )) = Var(˜ θ ( y )) ≥ E (cid:2) SURE ( y , ¯ θ ( y )) (cid:3) − SURE ( y , θ ( y )) = Var(¯ θ ( y )) ≥ (27)Thus, SURE ( y , ˜ θ ( y )) and SURE ( y , ¯ θ ( y )) are (with high prob-ability) upper-bounds for SURE ( y , θ ( y )) . For large graphs, inwhich computing SURE ( y , θ ( y )) is prohibitive, these upperbounds also might be useful since they can be obtainedcheaply. Leave-One-Out cross validation . LOOCV is a very simplemethod to select µ for interpolation problems. Let x (cid:96) ∈ R | (cid:96) | be the known part of the original signal x over the verticesin (cid:96) ⊂ V and θ be an estimator for interpolation. LOOCVcomputes the following score (See Chapter 5.5.1 in [29]): LOOCV ( x (cid:96) , θ ( x (cid:96) )) = 1 | (cid:96) | (cid:88) i ∈ (cid:96) ( θ − i ( x (cid:96) ) i − x i ) where θ − i ( x (cid:96) ) is the estimation without using the i -th mea-surement. This method leaves x i out at the estimation stage,and calculates the error on it. The overall score is the averageerror over the vertices in (cid:96) . Note that this method needsto compute the estimation θ − i ( x (cid:96) ) for each individual i which might not be computationally feasible. Fortunately, thisformula simpliﬁes to the following for linear estimators in theform of θ ( y ) = K y [29]: LOOCV ( x (cid:96) , θ ( y )) = 1 | (cid:96) | (cid:88) i ∈ (cid:96) (cid:18) θ ( x (cid:96) ) i − x i − K i,i (cid:19) (28)which avoids re-computation. LOOCV for RSF estimators.

Similar to SURE, this scorecan be adapted for the RSF based estimators. For example, incase of ˜ θ ( x (cid:96) ) , it becomes: LOOCV ( x (cid:96) , ˜ θ ( x (cid:96) )) = 1 | (cid:96) | (cid:88) i ∈ (cid:96) (cid:32) ˜ θ ( x (cid:96) ) i − x i − N (cid:80) Nk =1 ˜ S ( k ) i,i (cid:33) (29) and for ¯ θ , it can be derived in the same way. Notice thatevery element in this expression is numerically available aftersampling N spanning forests. Thus, this score can be easilycomputed for both estimators.IV. RSF ESTIMATORS FOR OTHER GRAPH PROBLEMS

In this section, we explore a few graph problems in whichthe RSF based estimators presented can replace expensiveexact computations.

A. Node Classiﬁcation in semi-supervised learning

Consider a dataset consisting of elements one wishes toclassify. In the semi-supervised learning context, the classlabel of a few elements are supposed to be known a priori ,along with a graph structure encoding some measure of afﬁnitybetween the different elements: the larger the weight of theedge connecting two elements, the closer they are accordingto some application-dependent metric, the more likely thesetwo elements belong to the same class. The goal is then toinfer all the labels given this prior information.Among many options to solve this problem, label propaga-tion [7] and generalized SSL framework [11] are two well-known baseline approaches. In this section, we deploy ˜ x and ¯ x to approximate the solutions given by these approaches. Problem deﬁnition.

Let us denote the labeled vertices by (cid:96) ⊂ V (typically | (cid:96) | (cid:28) |V| and the unlabeled ones by u = V\ (cid:96) .Assume C distinct label classes and deﬁne the followingbinary encoding of the prior knowledge for the c -th class: ∀ i ∈ V , y c ( i ) = (cid:40) if i is known to belong to the c -th class otherwise (30)The matrix Y = [ y | . . . | y C ] ∈ R n × C thus encodes the priorknowledge. Many approaches to SSL formulate the problemas follows. First, for each class c , compute the so-called“classiﬁcation function”, deﬁned as: f c = arg min z c ∈ R n µ (cid:88) i ∈V q (cid:48) i ( y c ( i ) − z c ( i )) + z (cid:62) c L z c (31)where µ and q (cid:48) i ’s are regularization parameters: µ sets theglobal regularization level, and each q (cid:48) i acts entry-wise (when q (cid:48) i is high, the corresponding entry in f c is close to the mea-surement y c ( i ) ). Eq. (31) has the following explicit solution: f c = ( L + Q ) − Q y c (32)where Q = diag( q , . . . , q n ) and q i = µq (cid:48) i . Thus, eachclassiﬁcation function f c can be viewed as a smoothed versionof the prior knowledge encoded in y c . As such, if f c ( i ) islarge, it implies that labels of class c are relatively densearound node i . The last step in these SSL algorithms is toassign each node i to the class arg max c f c ( i ) .Label propagation and the generalized SSL framework aretwo algorithms that adapt this solution in different ways. Inparticular, by using different set of q (cid:48) i ’s in (31), label propaga-tion is in fact a graph interpolation and the generalized SSLframework may be understood as a graph TR. Thus, the RSFestimators can be used to approximate the solution for both algorithms. In the following, we discuss the correspondingparameter settings for these algorithms along with their RSFversions. Label Propagation.

The label propagation algorithm [7]solves the Dirichlet problem for each class c , that is: ∀ i ∈ V , L f c ( i ) = 0 s. t. ∀ i ∈ (cid:96), f c ( i ) = y c ( i ) (33)which is equivalent to (15) for µ = 0 . Deﬁning the classiﬁca-tion matrix F = [ f | . . . | f C ] ∈ R n × C , one thus has: F i,c = (cid:40) Y i,c , if i ∈ (cid:96) ( − ( L u | u ) − L u | (cid:96) Y (cid:96) | : ) i,c , otherwise (34)where Y (cid:96) | : is the matrix Y restricted to rows in (cid:96) . Note inpassing that F corresponds to a special set of functions forgraphs called harmonic functions . Besides being the solutionof Dirichlet boundary problem, they have interesting connec-tions with electrical networks and random walks [7].Zhu et al. [7] provide a simple algorithm to compute F without computing the inverse matrix. Starting from anarbitrary initial F (0) , at each iteration k , the algorithm updates F ( k ) ← D − WF ( k − . The iteration is completed by setting theknown labels F ( k ) (cid:96) | : to Y (cid:96) | : . They prove that the output of thisiteration converges to F as k → ∞ (see Section 2.3 in [35]).Here, we provide an RSF-based estimator to approximate F . Two scenarii are possible. The ﬁrst (unlikely) scenario iswhen any node in u is connected to at least one node in (cid:96) . Inthis case, one can rewrite (34) as: F = − K Q − L u | (cid:96) Y with K = ( L G\ (cid:96) + Q ) − Q (35)where L G\ (cid:96) is the Laplacian of the reduced graph obtainedby removing the vertices (and the incident edges) in (cid:96) , Q ∈ R | u |×| u | is a diagonal matrix with Q i,i = (cid:80) j ∈ (cid:96) w ( i, j ) .The condition of this ﬁrst scenario ensures that Q is indeedinvertible; and RSFs on the reduced graph G \ (cid:96) can thusestimate the columns of F .However, when there exists at least one node in u that is notconnected to (cid:96) , Q is no longer invertible and another approachis needed. In this second scenario, the parameters are deﬁnedover all vertices and set to q i = α > for i ∈ (cid:96) and q i = 0 for i ∈ u . The following proposition guarantees that as α → ∞ ,the RSF estimator ˜ x with this setting approximates the solutiongiven in Eq (34). Proposition 3.

Given the parameters q i = α > for i ∈ (cid:96) and q i = 0 for i ∈ u , as well as the input vector y c ∈ R n forthe RSF estimator ˜ x , the following is veriﬁed: lim α →∞ E [˜ x ] = f c . Proof.

See the supplementary material for the detailed proof.From the random forest point-of-view, setting q i to inﬁnityfor all nodes i in (cid:96) , and to otherwise, implies that all possiblerealizations of Φ Q have exactly the same root set: (cid:96) . Thus,when using the estimator ˜ x , the measurements in (cid:96) are notaltered and are simply propagated to other vertices via the sampled random trees. In addition, the estimator ¯ x boils downto ˜ x in this very speciﬁc case. Generalized SSL framework.

The generalized SSL frame-work proposed in [11] can be seen as a graph TR. It deﬁnesthe classiﬁcation function as follows: f c = µµ + 2 (cid:18) I − µ + 2 D − σ WD σ − (cid:19) − y c where µ > is the regularization parameter and σ determineswhich graph Laplacian is used: σ = 0 , . , respectivelycorrespond to the combinatorial, normalized and random walkgraph Laplacian. This formula can also be written as: f c = D − σ KD σ − y c with K = ( L + Q ) − Q where Q = µ D . Notice that the cumbersome part in thisformula is to compute KD σ − y c and it can be approximatedby the proposed estimators on the input vector y = D σ − y c .Then, f c is obtained by left-multiplying the result by D − σ .Both solutions, label propagation and the generalized SSL,can be considered as two different versions of a more genericoptimization problem. Label propagation puts a very highconﬁdence on the prior. ˜ x , for example, only propagatesthe measurements of the labeled vertices (from the secondscenario’s perspective). Whereas, in generalized SSL, lowerconﬁdence over the prior information is assumed, and thus,the propagation of other measurements, which are all setto 0 in this encoding, is authorized. The success of bothmethods depends on the correctness of these assumptionson the data. Section V provides empirical comparisons onbenchmark datasets. B. Non-quadratic convex functions and Newton’s method

Consider the following generalized optimization problem: ˆ x = arg min z ∈ R n µf ( z ) + 12 z (cid:62) L z (36)where f : R n → R is a generic twice-differentiable functionfor the data ﬁdelity term ( e.g. previously f ( z ) = || y − z || ).A common occurence of the generalised form above iswhen the function f above is a log-likelihood, i.e. f ( z ) = (cid:80) ni =1 log p ( y i | z i ) . This is used when the assumption that theobservations are Gaussian (given the signal) is inappropriate,for instance when the observations are discrete. In such cases f is not a quadratic function and there is typically no closed-form solution for 36.In these cases, iterative approaches are often deployed.These approaches draw an iteration scheme to minimize theobjective or the loss function. One popular approach amongthem is Newton’s method. Let L ( z ) denote the loss function,then Newton’s method draws the following iteration scheme: z k +1 = z k − α ( H L ( z k )) − ∇ L ( z k ) with α ∈ [0 , , k = 1 , , . . . (37) H and ∇ are the Hessian and gradient operators, respectivelyand z k denotes the estimation at iteration k . Note that, bydeﬁnition of the Hessian operator, this method requires twice-differentiability for the loss function. Given this scheme, the methods proposed here may becomeuseful for approximating the inversion throughout the itera-tions. We illustrate this usage in the following setup:Assume an independent Poisson distribution for each like-lihood at i : P ( y i | λ = z i ) = z y i i exp( − z i ) y i ! where λ is the distribution parameter. This assumption isoften made in image processing applications to eliminate shotnoise [36]. Also, consider the following slightly modiﬁed lossfunction: L (cid:48) ( t ) = − µ n (cid:88) i =1 log P ( y i | λ = exp( t i )) + 12 t (cid:62) L t (38)where exp( t i ) = z i . The gradient ∇ L (cid:48) ( t ) reads: ∇ L (cid:48) ( t ) = µ exp( t ) − µ y + L t where exp operates entry-wise on vectors. Then, the Hessianmatrix becomes: H L (cid:48) ( t ) = µ diag(exp( t )) + L With these two ingredients, a Newton iteration scheme be-comes: t k +1 = t k − α [ µ diag(exp( t k )) + L ] − ( µ exp( t k ) − µ y + L t k ) The update term, which requires an inverse operation, can beapproximated by our RSF estimators. This approximation isachieved by setting Q = µ diag(exp( t k )) and the graph mea-surements y (cid:48) to µ − diag(exp( − t k ))( µ exp( t k ) − µ y + L t k ) .This particular case yields the following computation: K y (cid:48) = ( Q + L ) − Q y (cid:48) = [ µ diag(exp( t k )) + L ] − ( µ exp( t k ) − µ y + L t k ) which equals to the update in Newton’s method. Thus, theRSF estimators can be easily used to compute each updatestep with a cheap cost.In the classical Newton’s method i.e. α = 1 , convergenceof the result is not guaranteed. It might diverge or stuck ina loop depending on the closeness of the initial point t tothe solution. Guessing a good initial point is not an easy taskand may require expensive computations in high dimensions.Instead, modifying α is a more applicable option to ensureconvergence. Thus, Newton’s method is often combined withan additional step at each iteration in which α is resetaccordingly. Line search algorithms [12] are simple and well-understood methods for this purpose. At each iteration, ifneeded, they damp the applied update by shrinking α . Thesemethods provide convergence, however, they may require moreiteration steps w.r.t. the pure Newton’s method with a goodinitial point. In our case, the updates are stochastic and exactconvergence cannot be expected. C. l -Regularization and iteratively reweighted least squares As with the data ﬁdelity term, many alternatives for theregularization term are also available. Among them, l -regularization [37] is often deployed to obtain sparser solu-tions. In [38], Sharpnack et al. adapts l -regularization forgraphs as follows: ˆ x = argmin z ∈ R n µ || y − z || + || B z || (39)where B is the edge incidence matrix and || B z || = (cid:80) ( i,j ) ∈E w ( i, j ) / | z i − z j | . In contrast to the l regularization,a closed form solution for l case is not available. Thus,iterative optimization schemes are usually utilized to computethe solution. Iterative reweighted least square [13] (IRLS)method is one that can be easily adapted for our RSF basedestimators. In this section, we derive the iteration scheme ofIRLS for the given problem and show that each iteration stepcan be approximated by ˜ x and ¯ x .Let M = diag(abs( B z )) − where abs is the entry-wiseabsolute value operator. An iteration scheme can be derivedby using this equality: || B z || = abs( z (cid:62) B (cid:62) ) = abs( z (cid:62) B (cid:62) ) M abs( B z ) = z (cid:62) B (cid:62) MB z where ∈ R m is the all-ones vector. Then, the problem canalso be written as: ˆ x = argmin z ∈ R n µ || y − z || + z (cid:62) B (cid:62) MB z (cid:62) (40)which can be iteratively solved by the following scheme: z k +1 = ( µ I + B (cid:62) M k B ) − µ I y with M k = diag(abs( B z k )) − A more detailed derivation and the convergence analysis ofthis scheme can be found in [13]. Notice that, by deﬁnition, B (cid:62) M k B equals to a reweighted graph Laplacian L k . Then,computing the update at each iteration step immediately re-duces to solve a graph Tikhonov regularization. Thus, ˜ x or ¯ x can be used for estimating the update.V. E XPERIMENTS

We provide in this section several illustrations and run timeanalysis of the proposed methods. In the ﬁrst illustration, theRSF based methods are run on two image denoising setups.In these, we consider • An image with a Gaussian noise: solution provided bythe RSF based Tikhonov regularization parameterized bySURE. • An image with a Poisson noise: solution provided by theRSF based Newton’s method coupled with line search.In both setups, the underlying graph is assumed to be a 2-dimensional (2D) grid graph.In the second illustration, the SSL node classiﬁcation prob-lem is considered on three benchmark datasets. We examinethe classiﬁcation performances of Tikhonov regularization,label propagation and the RSF versions of these algorithms.In these experiments, the parameter selection for the Tikhonovregularization is done by the RSF based leave-one-out crossvalidation. Finally, we brieﬂy examine the computational time for sam-pling a spanning forest which is the building block operationfor the proposed estimators. Then, we compare this quantitywith the computational time of computing L y which is thebuilding block for state-of-the-art approaches, polynomial ap-proximations or iterative approaches. A. Image Denoising

A 2D grid graph is a natural underlying structure for images:every pixel corresponds to a node and each pixel is connectedto its four direct neighbors with equal weights. Other structurescould be used (such as an 8-neighbour version of the gridgraph) and performances will depend on the chosen structure.However, for the purpose of illustration, we will only considerthe simplest 4-neighbour grid graph.In the ﬁrst setup, noisy (with additive Gaussian noise)measurements y of the original image x are given: y = x + (cid:15) with (cid:15) ∝ N ( , σ I ) To recover x , Tikhonov regularization is applied. Fig. 5compares the exact result ˆ x = K y = ( L + µ I ) − µ y , to itsforest-based approximations ˜ x and ¯ x . The SURE method toestimate the best value of µ is also illustrated for all threeestimations of the original image x (and we remark in passingthat they are consistent). The results conﬁrms that ¯ x producesa better estimate for ˆ x than ˜ x . Also, in Fig. 5d, the scorescomputed for the RSF estimators are observed as upper boundsof the scores for ˆ x which is expected from the results inEq (27).In the second setup, each pixel value is assumed to besampled from a Poisson distribution whose mean is the truevalue of the pixel: y ∝ Poisson ( x ) To reconstruct x , Newton’s method is applied, as explained inSection IV-B. The line search algorithm is used for pickingthe value of α at each iteration to ensure convergence. Bothqualitative and quantitative results in Fig. 6 show that allmethods converge to the same solution. Fig. 6c shows that,even though all three methods converge in less than iterations, ˆ x and ¯ x minimize the loss function in a similarway whereas ˜ x diverges from this pattern and requires moreiterations for the convergence (due to a larger approximationerror at each step). B. Node Classiﬁcation

In this illustration, we run our methods to solve the nodeclassiﬁcation problem discussed in Section IV-A. For thegeneralized SSL framework, σ is set to , and µ is set byRSF based cross-validation.The experiments are done on three standard benchmarkdatasets [39], namely Cora, Citeseer and Pubmed. In the ﬁrsttwo, the underlying graphs are disconnected, thus we usethe largest connected components. Also, in all datasets, theorientations of the edges are omitted to operate on undirectedgraphs. The general statistics of these datasets after the pre-processing are summarized in Table I. Note that in these three (a) Original Image x (b) Noisy Measurments y ,PSNR = 14.2 (c) Exact solution ˆ x , PSNR =19.4 Parameter S U R E S c o r e s SURE with x-tildeSURE with x-barSURE with x-hat (d) SURE scores vs µ (e) ˜ x with N = 1 (f) ¯ x with N = 1 (g) ˜ x with N = 20 , PSNR =18.6 (h) ¯ x with N = 20 , PSNR =19.1 Fig. 5: An image denoising experiment with additive Gaussian noise. a) the original image. b) a noisy version y = x + (cid:15) with (cid:15) N ( , σ ) .c) the exact graph TR ˆ x = K y . Figure d) summarizes the SURE scores of ˆ x , ˜ x and ¯ x for different µ ’s. In each, the value of µ that minimizesthe SURE score is selected. e-f) the two RSF estimates ˜ x and ¯ x based on only one sampled forest. g-h) same as e-f) but averaged over N = 20 sampled forests. (a) Original image x (b) Noisy image y ,PSNR=18.0 Iterations L o ss F un c t i o n Updates with x-tildeUpdates with x-barUpdates with x-hat (c) Loss function throughiterations(d) Denoised imagewith ˜ x with N = 40 ,PSNR=23.1 (e) Denoised imagewith ¯ x with N = 40 ,PSNR=23.1 (f) Denoised imagewith ˆ x , PSNR=23.1 Fig. 6: An image denoising experiment with Poisson noise. a) originalimage x . b) a noisy version y ∝ Poisson ( x ) . Newton’s method isdeployed to recover x by minimizing the loss function in 38. Forthree update options, namely ˆ x , ˜ x and ¯ x , Newton’s method yields theresults shown on the bottom line. Figure c) shows the loss functionthrough the iterations for the three update options. datasets, the class of each node is known. This will enable usto test the different SSL frameworks (a small arbitrary fractionof nodes will serve as pre-labeled nodes, and one tests whetheror not this is sufﬁcient to infer the class of all nodes). Moreprecisely, we use the following procedure: TABLE I: SSL dataset statistics after preprocessing

Dataset

Nodes

Edges

ClassesCiteseer 2110 3668 6Cora 2485 5069 7Pubmed 19717 44324 3 • m vertices are selected at random per class as the labelednodes, • the parameter µ is set by LOOCV separately for ˆ x , ˜ x and ¯ x . • the classiﬁcation functions f c for each class c are com-puted by the generalized SSL framework, label propaga-tion and their RSF versions averaged over N repetitions, • for each vertex i , we assign arg max c F i,c as its class andcalculate the classiﬁcation accuracy as the ratio of cor-rectly predicted labels to the total number of predictions.In Fig. 7, the classiﬁcation accuracy is reported as m and N vary. The results are averaged over 50 realizations of the m labeled vertices for all datasets.In all experiments, ¯ x performs better than ˜ x as expected.Moreover, for the ﬁrst two datasets, Cora and Citeseer, ¯ x hasa comparable performance with the exact solution. However,in Pubmed, ¯ x fails to perform as good as the ˆ x for gSSL dueto larger approximation errors in both the parameter selectionand the estimation steps.The empirical results yield that the proposed methods needmuch less forest realizations to reach the exact solution ofLP rather than the generalized SSL. However, sampling aforest for LP often takes more time if n is large and m is relatively small. For example, in the Pubmed graph, for

20 22 24 26 28 30 m C i t e s ee r A cc u r a c y Generalized SSL

20 22 24 26 28 30 m Label Propagation

20 22 24 26 28 30 m C o r a A cc u r a c y

20 22 24 26 28 30 m m P u b m e d A cc u r a c y

20 22 24 26 28 30 m xx N TR , LP =50, 5 x N TR , LP =500, 50 x N=50 x N=500Random Classifier Constant Classifier

Fig. 7: The classiﬁcation accuracy of the generalized SSL, LP andtheir RSF variants are presented on the datasets Citeseer, Cora andPubmed. The RSF methods for the generalized SSL are illustratedfor N = 50 , forest realizations, whereas, these numbers for LPare N = 5 , . In these plots, the random classiﬁer denotes theaccuracy of inferring classes at random and the constant classiﬁer isthe accuracy of assigning the most occurring class to all unlabeledvertices. The results for Citeseer, Cora and Pubmed datasets areaveraged over respectively 50 different set of labeled vertices. m = 20 , sampling a single forest for LP (resp. the generalizedSSL) takes . × − (resp. . × − ) seconds averagedover 100 repetitions in a single threaded run time of a laptop.Note that these ﬁgures strongly depend on m and the givennetwork. Thus, one needs to examine this trade-off with thegiven dataset to adjust the total run time. C. Run-time Analysis

Sampling a spanning forest is the repeated core operation inthe proposed estimators. Similarly, the matrix product L y is thebuilding block operation for the state-of-the-art, polynomialapproximations and iterative methods. In this section, wecompare the times needed for these operations to give an orderof magnitude for the run-time of the proposed methods.In Fig. 8, the average time needed for sampling a spanningforest on different graphs are plotted while the value of q q R un t i m e ( m s ) q Fig. 8: Average computational times for the RSF sampling (solidlines) and the sparse matrix multiplication L y (dashed lines) arecompared. The entries of y are chosen uniformly between and .On the left plot, the underlying graphs are 2D grid with nodes,10-regular graph (every node has exactly 10 neighbors) with nodes, Erdos-Renyi graph with nodes (average degree is ).On the right, the graphs given by Citeseer and Pubmed datasets areused. The experiments are run on a single thread on a laptop. varies. As shown in the plots, the time to sample a spanningforest via our naive Julia implementation remains comparablewith the time to compute the sparse matrix product L y through various graphs. Our implementation can be improvedin several ways by going beyond the scope of this work.Nevertheless, the empirical results indicate that the times tosample a spanning forest and the product L y are within thesame order of magnitude.VI. C ONCLUSION

The Monte Carlo estimators proposed have a comparablecomputational cost with state-of-the-art methods, can be usedas building blocks in various problems, and are amenable totheoretical analysis. As we have shown, the proposed methodsare adaptable to various graph-based optimization algorithmsincluding the generalized SSL framework, label propagation,Newton’s method and IRLS. Moreover, their use can be moregeneral than the problems involving graph Laplacians. Inparticular, optimization problems with symmetric, diagonallydominant regularisers can be reduced to graph Tikhonovregularisation problems, along the lines of [40]. Future workwill continue to leverage the links between RSFs and graph-related algebra to develop efﬁcient estimators of graph-relatedquantities e.g. effective resistances or tr( L † ) where L † is theMoore-Penrose inverse of L .A CKNOWLEDGMENT

This work was partly funded by the the French Na-tional Research Agency in the framework of the ”In-vestissements d’avenir” program (ANR-15-IDEX-02), theLabEx PERSYVAL (ANR-11-LABX-0025-01), the ANRGraVa (ANR-18-CE40-0005) the ANR GenGP (ANR-16-CE23-0008) the Grenoble Data Institute (ANR-15-IDEX-02)the MIAI@Grenoble Alpes chair “LargeDATA at UGA.”(ANR-19-P3IA-0003). R EFERENCES[1] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrentneural network: Data-driven trafﬁc forecasting,” , jul 2018.[2] W. Huang, T. A. Bolton, J. D. Medaglia, D. S. Bassett, A. Ribeiro, andD. Van De Ville, “A Graph Signal Processing Perspective on FunctionalBrain Imaging,”

Proc. IEEE , vol. 106, no. 5, pp. 868–885, may 2018.[3] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van-dergheynst, “The emerging ﬁeld of signal processing on graphs: Ex-tending high-dimensional data analysis to networks and other irregulardomains,”

IEEE Signal Process. Mag. , vol. 30, no. 3, pp. 83–98, oct2013.[4] A. Sandryhaila and J. M. Moura, “Big data analysis with signalprocessing on graphs: Representation and processing of massive datasets with irregular structure,”

IEEE Signal Process. Mag. , vol. 31, no. 5,pp. 80–90, 2014.[5] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples,”

J. Mach. Learn. Res. , vol. 7, pp. 2399–2434, 2006.[6] I. Pesenson, “Variational Splines and Paley-Wiener Spaces on Combi-natorial Graphs,”

Constr. Approx. , vol. 29, no. 1, pp. 1–21, nov 2009.[7] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled datawith label propagation,” 2002.[8] L. Grady and E. Schwartz, “Anisotropic interpolation on graphs : Thecombinatorial dirichlet problem Anisotropic Interpolation on Graphs :The Combinatorial Dirichlet Problem,”

Bost. Univ. , 2003.[9] L. J. Grady and E. L. Schwartz,

Anisotropic interpolation on graphs:The combinatorial Dirichlet problem . Citeseer, 2003.[10] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨olkopf, “Learningwith local and global consistency,” in

Adv. Neural Inf. Process. Syst. ,2004, pp. 321–328.[11] K. Avrachenkov, A. Mishenin, P. Gonc¸alves, and M. Sokol, “Generalizedoptimization framework for graph-based semi-supervised learning,” in

Proc. 2012 SIAM Int. Conf. Data Min.

SIAM, 2012, pp. 966–974.[12] S. Boyd and L. Vandenberghe,

Convex optimization . Cambridgeuniversity press, 2004.[13] C. S. Burrus, J. A. Barreto, and I. W. Selesnick, “Iterative ReweightedLeast-Squares Design of FIR Filters,”

IEEE Trans. Signal Process. ,vol. 42, no. 11, pp. 2926–2936, 1994.[14] Y. Saad,

Iterative Methods for Sparse Linear Systems , 2003.[15] D. I. Shuman, P. Vandergheynst, and P. Frossard, “Chebyshev Polyno-mial Approximation for Distributed Signal Processing,” may 2011.[16] F. R. K. Chung and F. C. Graham,

Spectral graph theory . AmericanMathematical Soc., 1997, no. 92.[17] G. Grimmett,

Probability on graphs: Random processes on graphs andlattices: Second edition . Cambridge University Press, 2018, vol. 8.[18] X. M. Wu, Z. Li, A. M. C. So, J. Wright, and S. F. Chang, “Learningwith partially absorbing random walks,” in

Adv. Neural Inf. Process.Syst. , vol. 4, 2012, pp. 3077–3085.[19] X. Wu, “Learning on Graphs with Partially Absorbing Random Walks:Theory and Practice,” Ph.D. dissertation, 2016.[20] K. Avrachenkov, P. Chebotarev, and A. Mishenin, “Semi-supervisedlearning with regularized Laplacian,”

Optim. Methods Softw. , vol. 32,no. 2, pp. 222–236, 2017.[21] L. Avena and A. Gaudilli`ere, “Random spanning forests, Markov matrixspectra and well distributed points,” oct 2013.[22] L. Avena and A. Gaudilli`ere, “Two applications of random spanningforests,”

Journal of Theoretical Probability , vol. 31, no. 4, pp. 1975–2004, 2018.[23] D. B. Wilson, “Generating random spanning trees more quickly thanthe cover time,” in

Proc. Annu. ACM Symp. Theory Comput. , vol. PartF1294. Association for Computing Machinery, jul 1996, pp. 296–303.[24] Y. Y. Pilavci, P. O. Amblard, S. Barthelm´e, and N. Tremblay, “Smoothinggraph signals via random spanning forests,” in

ICASSP, IEEE Int. Conf.Acoust. Speech Signal Process. - Proc. , vol. 2020-May. Institute ofElectrical and Electronics Engineers Inc., may 2020, pp. 5630–5634.[25] P. Marchal, “Loop-erased random walks, spanning trees and hamiltoniancycles,”

Elect. Comm. in Probab , vol. 5, pp. 39–50, 2000.[26] A. Kulesza and B. Taskar, “Determinantal point processes for machinelearning,”

Found. Trends Mach. Learn. , vol. 5, no. 2-3, pp. 123–286, jul2012.[27] D. Blackwell, “Conditional Expectation and Unbiased Sequential Esti-mation,”

Ann. Math. Stat. , vol. 18, no. 1, pp. 105–110, mar 1947.[28] C. R. Rao, “Information and the Accuracy Attainable in the Estimationof Statistical Parameters,” 1992, pp. 235–247. [29] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, “The elements ofstatistical learning: data mining, inference and prediction,”

Math. Intell. ,vol. 27, no. 2, pp. 83–85, 2005.[30] D. Girard, “Un algorithme simple et rapide pour la validation crois´eeg´en´eralis´ee sur des probl`emes de grande taille,” Tech. Rep., 1987.[31] M. F. Hutchinson, “A stochastic estimator of the trace of the inﬂuencematrix for laplacian smoothing splines,”

Commun. Stat. - Simul. Com-put. , vol. 19, no. 2, pp. 433–450, 1990.[32] S. Barthelme, N. Tremblay, A. Gaudilliere, L. Avena, and P.-O. Amblard,“Estimating the inverse trace using random forests on graphs,” in

XXVII`eme Colloq. GRETSI (GRETSI 2019) , Lille, France, aug 2019.[33] R. Tibshirani and L. Wasserman, “Stein’s unbiased risk estimate,”

Course notes from “Statistical Mach. Learn. , pp. 1–12, 2015.[34] C. M. Stein, “Estimation of the Mean of a Multivariate Normal Distri-bution,”

Ann. Stat. , vol. 9, no. 6, pp. 1135–1151, nov 1981.[35] X. Zhu, “Semi-Supervised Learning with Graphs,” Ph.D. dissertation,2005.[36] M. Lebrun, M. Colom, A. Buades, and J. M. Morel, “Secrets of imagedenoising cuisine,”

Acta Numer. , vol. 21, pp. 475–576, may 2012.[37] R. J. Tibshirani, J. Taylor et al. , “The solution path of the generalizedlasso,”

The Annals of Statistics , vol. 39, no. 3, pp. 1335–1371, 2011.[38] J. Sharpnack, A. Singh, and A. Rinaldo, “Sparsistency of the edge lassoover graphs,” in

Artiﬁcial Intelligence and Statistics , 2012, pp. 1028–1036.[39] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classiﬁcation in network data,”

AI Mag. , vol. 29, no. 3,p. 93, 2008.[40] J. A. Kelner, L. Orecchia, A. Sidford, and Z. A. Zhu, “A simple,combinatorial algorithm for solving SDD systems in nearly-linear time,”in