[PDF] Adaptive Diffusions for Scalable Learning over Graphs

Abstract

Diffusion-based classifiers such as those relying on the Personalized PageRank and the Heat kernel, enjoy remarkable classification accuracy at modest computational requirements. Their performance however is affected by the extent to which the chosen diffusion captures a typically unknown label propagation mechanism, that can be specific to the underlying graph, and potentially different for each class. The present work introduces a disciplined, data-efficient approach to learning class-specific diffusion functions adapted to the underlying network topology. The novel learning approach leverages the notion of "landing probabilities" of class-specific random walks, which can be computed efficiently, thereby ensuring scalability to large graphs. This is supported by rigorous analysis of the properties of the model as well as the proposed algorithms. Furthermore, a robust version of the classifier facilitates learning even in noisy environments. Classification tests on real networks demonstrate that adapting the diffusion function to the given graph and observed labels, significantly improves the performance over fixed diffusions; reaching -- and many times surpassing -- the classification accuracy of computationally heavier state-of-the-art competing methods, that rely on node embeddings and deep neural networks.

Full PDF

11 Adaptive Diffusions for ScalableLearning over Graphs

Dimitris Berberidis , Athanasios N. Nikolakopoulos and Georgios B. Giannakis Dept. of ECE and Digital Tech. Center , University of Minnesota, Minneapolis, MN 55455, USA Abstract —Diffusion-based classiﬁers such as those relying onthe Personalized PageRank and the Heat kernel, enjoy remark-able classiﬁcation accuracy at modest computational require-ments. Their performance however is affected by the extent towhich the chosen diffusion captures a typically unknown labelpropagation mechanism, that can be speciﬁc to the underlyinggraph, and potentially different for each class. The present workintroduces a disciplined, data-efﬁcient approach to learning class-speciﬁc diffusion functions adapted to the underlying network topology. The novel learning approach leverages the notion of“landing probabilities” of class-speciﬁc random walks, which canbe computed efﬁciently, thereby ensuring scalability to largegraphs. This is supported by rigorous analysis of the propertiesof the model as well as the proposed algorithms. Furthermore, arobust version of the classiﬁer facilitates learning even in noisyenvironments. Classiﬁcation tests on real networks demonstratethat adapting the diffusion function to the given graph andobserved labels, signiﬁcantly improves the performance overﬁxed diffusions; reaching – and many times surpassing – theclassiﬁcation accuracy of computationally heavier state-of-the-art competing methods, that rely on node embeddings and deepneural networks.

Index Terms —Semi-supervised Classiﬁcation, Random Walks,Diffusions.

I. I

NTRODUCTION T HE task of classifying nodes of a graph arises frequentlyin several applications on real-world networks, suchas the ones derived from social interactions and biologicaldependencies. Graph-based semi-supervised learning (SSL)methods tackle this task building on the premise that the truelabels are distributed “smoothly” with respect to the underlyingnetwork, which then motivates leveraging the network structureto increase the classiﬁcation accuracy [11]. Graph-based SSLhas been pursued by various intertwined methods, includingiterative label propagation [6], [43], [29], [25], kernels ongraphs [31], manifold regularization [5], graph partitioning [40],[19], transductive learning [39], competitive infection models[36], and bootstrapped label propagation [10]. SSL based ongraph ﬁlters was discussed in [37], and further developed in[12] for bridge monitoring. Recently, approaches based onnode-embeddings [34], [18], [42], as well as deep-learningarchitectures [21], [2] have gained popularity, and were reportedto have state-of-the-art performance.Many of the aforementioned methods are challenged bycomputational complexity and scalability issues that limittheir applicability to large-scale networks. Random-walk-baseddiffusions present an efﬁcient and effective alternative. Methods

Work was supported by NSF 171141, 1514056 and 1500713 of this family diffuse probabilistically the known labels throughthe graph, thereby ranking nodes according to weighted sums ofvariable-length landing probabilities. Celebrated representativesinclude those based on the Personalized PageRank (PPR) andthe Heat Kernel that were found to perform remarkably well incertain application domains [22], and have been nicely linkedto particular network models [23], [3], [24]. Spectral diffusionshave been used for community detection [47], [45], where localdiffusion patterns are produced to approximate low-conductancecommunities, and adaptive PPR has been applied for predictionon a heterogeneous protein-function network [46].The effectiveness of diffusion-based classiﬁers can varyconsiderably depending on whether the chosen diffusionconforms with the latent label propagation mechanism thatmight be, (i) particular to the target application or underlyingnetwork topology; and, (ii) different for each class. The presentcontribution alleviates these shortcomings and markedlyimproves the performance of random-walk-based classiﬁersby adapting the diffusion functions of every class to both thenetwork and the observed labels. The resultant novel classiﬁerrelies on the notion of landing probabilities of short-lengthrandom walks rooted at the observed nodes of each class. Thesmall number of these landing probabilities can be extractedefﬁciently with a small number of sparse matrix-vector products,thus ensuring applicability to large-scale networks. Theoreticalanalysis establishes that short random walks are in most casessufﬁcient for reliable classiﬁcation. Furthermore, an algorithmis developed to identify (and potentially remove) outlying oranomalous samples jointly with adapting the diffusions. We testour methods in terms of multiclass and multilabel classiﬁcationaccuracy, and conﬁrm that it can achieve results competitive tostate-of-the-art methods, while also being considerably faster.The rest of the paper is organized as follows. Section IIintroduces random-walk based diffusions. Our novel approachalong with relevant analytical results are the subjects ofSection III. Section IV presents a robust version of ouralgorithm, and Section V places our work in the context ofrelated methods. Finally, Section VI presents experiments, whileSection VII concludes the paper and discusses future directions. Notation.

Bold lower-case letters denote column vectors (e.g., v ); bold upper-case letters denote matrices (e.g., Q ). Vectors q j and q T i denote the j th column and the i th row of Q , respectively;whereas Q ij (or sometimes for clarity [ Q ] ij ) denotes the ij th entry of Q . Vector e K denotes the K th canonical column vector;and (cid:107)·(cid:107) denotes the Euclidean norm, unless stated otherwise. A preliminary version of the work has appeared in [8]. a r X i v : . [ s t a t . M L ] J a n II. P

ROBLEM S TATEMENT AND M ODELING

Consider a graph G := {V , E} , where V is the set of N nodes, and E the set of edges. Connectivity is captured bythe weight matrix W having entries W ij > if ( i, j ) ∈ E .Associated with each node i ∈ V there is a discrete label y i ∈ Y . In SSL classiﬁcation over graphs, a subset L ⊂ V of nodes has available labels y L , and the goal is to infer thelabels of the unlabeled set U := V \ L . Given a measureof inﬂuence, a node most inﬂuenced by labeled nodes of acertain class is deemed to also belong to the same class. Thus,label-propagation on graphs boils down to quantifying theinﬂuence of L on U , see, e.g. [11], [25], [41]. An intuitive yetsimple measure of node-to-node inﬂuence relies on the notionof random walks on graphs.A simple random walk on a graph is a discrete-time Markovchain deﬁned over the nodes, meaning with state space V . Thetransition probabilities are Pr { X k = i | X k − = j } = W ij /d j = [ WD − ] ij := [ H ] ij where X k ∈ V denotes the position of the random walker (state)at the k th step; d j := (cid:80) k ∈N j W kj is the degree of node j ;and, N j its neighborhood. Since we consider undirected graphsthe limiting distribution of { X k } always exists and it is uniqueif it is connected and non-bipartite. It is given by the dominantright eigenvector of the column-stochastic transition probabilitymatrix H := WD − , where D := diag ( d , d , . . . , d N ) [27].The steady-state distribution π can be shown to have entries π i := lim k →∞ (cid:88) j ∈V Pr { X k = i | X = j } Pr { X = j } = d i |E| that are clearly not dependent on the initial “seeding” distribu-tion Pr { X } ; and π is thus unsuitable for measuring inﬂuenceamong nodes. Instead, for graph-based SSL, we will utilizethe k − step landing probability per node i given by p ( k ) i := (cid:88) j ∈V Pr { X k = i | X = j } Pr { X = j } (1)that in vector form p ( k ) := [ p ( k )1 . . . p ( k ) N ] T satisﬁes p ( k ) = H k p (0) , where p (0) i := Pr { X = i } . In words, p ( k ) i is theprobability that a random walker with initial distribution p (0) is located at node i after k steps. Therefore, p ( k ) i is a validmeasure of the inﬂuence that p (0) has on any node in V .The landing probabilities per class c ∈ Y are (cf. (1)) p ( k ) c = H k v c (2)where for L c := { i ∈ L : y i = c } , we select as v c thenormalized class-indicator vector with i − th entry [ v c ] i = (cid:26) / |L c | , i ∈ L c , else (3)acts as initial distribution. Using (2), we model diffusions perclass c over the graph driven by { p ( k ) c } Kk =0 as f c ( θ ) = K (cid:88) k =0 θ k p ( k ) c (4)where θ k denotes the importance assigned to the k th hopneighborhood. By setting θ = 0 (since it is not useful for classiﬁcation purposes) and constraining θ ∈ S K , where S K := { x ∈ R K : x ≥ , T x = 1 } is the K − dimensionalprobability simplex, f c ( θ ) can be compactly expressed as f c ( θ ) = K (cid:88) k =1 θ k p ( k ) c = P ( K ) c θ (5)where P ( K ) c := (cid:104) p (1) c · · · p ( K ) c (cid:105) . Note that f c ( θ ) denotes avalid nodal probability mass function (pmf) for class c .Given θ and upon obtaining { f c ( θ ) } c ∈Y , our diffusion-basedclassiﬁers will predict labels over U as ˆ y i ( θ ) := arg max c ∈Y [ f c ( θ )] i (6)where [ f c ( θ )] i is the i th entry of f c ( θ ) .The upshot of (4) is a unifying form of superimposeddiffusions allowing tunable simplex weights, taking up to K steps per class to come up with an inﬂuence metric for thesemi-supervised classiﬁer (6) over graphs. Next, we outline twonotable members of the family of diffusion-based classiﬁersthat can be viewed as special cases of (4). A. Personalized PageRank Classiﬁer

Inspired by its celebrated network centrality metric [9], thePersonalized PageRank (PPR) algorithm has well-documentedmerits for label propagation; see, e.g. [28]. PPR is a special caseof (4) corresponding to θ PPR = (1 − α ) (cid:2) α α · · · α K (cid:3) T ,where < α < , and − α can be interpreted as the “restart”probability of random walks with restarts.The PPR-based classiﬁer relies on (cf. (5)) f c ( θ PPR ) = (1 − α ) K (cid:88) k =0 α k p ( k ) c (7)satisfying asymptotically in the number of random walk steps lim K →∞ f c ( θ PPR ) = (1 − α )( I − α H ) − v c which implies that f c ( θ PPR ) approximates the solution ofa linear system. Indeed, as shown in [3], PPR amounts tosolving a weighted regularized least-squares problem over V ; see also [23] for a PPR interpretation as an approximategeometric discriminant function deﬁned in the space of landingprobabilities. B. Heat Kernel Classiﬁer

The heat kernel (HK) is another popular diffusion thathas recently been employed for SSL [31] and communitydetection on graphs [22]. HK is also a special case of(4) with θ HK = e − t (cid:104) t t · · · t K K ! (cid:105) T , yielding classdistributions (cf. (4)) f c ( θ HK ) = e − t K (cid:88) k =0 t k k ! p ( k ) c . (8)Furthermore, it can be readily shown that lim K →∞ f c ( θ HK ) = e − t ( I − H ) v c allowing HK to be interpreted as an approximation of a heatdiffusion process, where heat is ﬂowing from L c to the restof the graph; and f c ( θ HK ) is a snapshot of the temperatureafter time t has elapsed. HK provably yields low conductancecommunities, while also converging faster to its asymptoticclosed-form expression than PPR (depending on the value of t ) [15]. III. A DAPTIVE D IFFUSIONS

Besides the unifying view of (4), the main contributionhere is on efﬁciently designing f c ( θ c ) in (5), by learningthe corresponding θ c per class. Thus, unlike PPR and HK,the methods introduced here can afford class-speciﬁc labelpropagation that is adaptive to the graph structure, and also adaptive to the labeled nodes. Figure 1 gives a high-levelillustration of the proposed adaptive diffusion framework, wheretwo classes (red and green) are to be diffused over the graph(cf. (2)), with class-speciﬁc diffusion coefﬁcients adapted aswill be described next. Diffusions are then built (cf. (5)), andemployed for class prediction (cf. (6)).Consider for generality a goodness-of-ﬁt loss (cid:96) ( · ) , and aregularizer R ( · ) promoting e.g., smoothness over the graph.Using these, the sought class distribution will be ˆ f c = arg min f ∈ R N (cid:96) ( y L c , f ) + λR ( f ) (9)where λ tunes the degree of regularization, and [ y L c ] i = (cid:26) , i ∈ L c , else is the indicator vector of the nodes belonging to class c . Usingour diffusion model in (5), the N − dimensional optimizationproblem (9) reduces to solving for the K − dimensional vector( K (cid:28) N ) ˆ θ c = arg min θ ∈S K (cid:96) ( y L c , f c ( θ )) + λR ( f c ( θ )) . (10)Although many choices of (cid:96) ( · ) may be of interest, we willfocus for simplicity on the quadratic loss, namely (cid:96) ( y L c , f ) := (cid:88) i ∈L d i ([¯ y L c ] i − f i ) = (¯ y L c − f ) T D †L (¯ y L c − f ) (11)where ¯ y L c := (1 / |L| ) y L c is the class indicator vector afternormalization to bring target variables (entries of ¯ y L c ) andentries of f to the same scale, and D †L = diag( d ( − L ) withentries [ d ( − L ] i = (cid:26) /d i , i ∈ L , else . For a smoothness-promoting regularization, we will employthe following (normalized) Laplacian-based metric R ( f ) = 12 (cid:88) i ∈V (cid:88) j ∈N i (cid:18) f i d i − f j d j (cid:19) = f T D − LD − f . (12)where L := D − W is the Laplacian matrix of the graph.Intuitively speaking, (11) favors vectors f having non-zero AdaptingDiﬀusions LabelPrediction P ( K ) r P ( K ) g θ r θ g Fig. 1. High-level illustration of adaptive diffusions. The nodes belong to twoclasses (red and green). The per-class diffusions are learned by exploiting thelanding probability spaces produced by random walks rooted at the samplenodes (second layer: up for red; down for green). ( | / |L| ) values on nodes that are known to belong to class c ,and zero values on nodes that are known to belong to otherclasses ( L \ L c ), while (12) promotes similarity of the entriesof f that correspond to neighboring nodes. In (11) and (12),each entry f i is normalized by d − i and d − i respectively.This normalization counterbalances the tendency of randomwalks to concentrate on high-degree nodes, thus placing equalimportance to all nodes.Substituting (11) and (12) into (10), and recalling from (5)that f c ( θ ) = P ( K ) c θ , yields the convex quadratic program ˆ θ c = arg min θ ∈S K θ T A c θ + θ T b c (13)with b c and A c given by b c = − |L| ( P ( K ) c ) T D †L y L c (14) A c = ( P ( K ) c ) T D †L P ( K ) c + λ ( P ( K ) c ) T D − LD − P ( K ) c (15) = ( P ( K ) c ) T (cid:104)(cid:16) D †L + λ D − (cid:17) P ( K ) c − λ D − HP ( K ) c (cid:105) = ( P ( K ) c ) T (cid:16) D †L P ( K ) c + λ D − ˜ P ( K ) c (cid:17) (16)where HP ( K ) c = (cid:104) Hp (1) c Hp (2) c · · · Hp ( K ) c (cid:105) = (cid:104) p (2) c p (3) c · · · p ( K +1) c (cid:105) is a “shifted” version of P ( K ) c , where each p ( k ) c is advancedby one step, and ˜ P ( K ) c := (cid:104) ˜ p (1) c ˜ p (2) c · · · ˜ p ( K ) c (cid:105) with ˜ p ( i ) c := p ( i ) c − p ( i +1) c containing the “differential” landingprobabilities. The complexity of ‘naively’ ﬁnding the K × K matrix A c (and thus also b c ) is O ( K N ) for computing theﬁrst summand, and O ( |E| K ) for the second summand in (15),after leveraging the sparsity of L , which means |E| (cid:28) N .But since columns of ˜ P ( K ) c are obtained as differences ofconsecutive columns of P ( K ) c , this load of O ( |E| K ) is saved. In a nutshell, the solver in (13)-(16) that we term adaptive-diffusion (AdaDIF), incurs complexity of order O ( K N ) . Remark 1 . The problem in (13) is a quadratic program (QP) of dimension K (or the dictionary size D to be discussed inSection III-C when in dictionary mode). In general, solving aQP with K variables to a given precision requires a O ( K ) worst-case complexity. Although this may appear heavy, K inour setting is – and thus negligibly small compared to thequantities that depend on the graph dimensions. For instance,the graphs that we tested have O (10 ) nodes ( N ) and O (10 ) edges ( |E| ). Therefore, since K (cid:28) N and K (cid:28) |E| by manyorders of magnitude, the complexity for QP is dominated bythe O ( |E| K ) (same as PPR and HK) performing the randomwalks and O ( N K ) for computing A c . A. Limiting behavior and computational complexity

In this section, we offer further insights on the model (5),along with complexity analysis of the parametric solution in(13). To start, the next proposition establishes the limitingbehavior of AdaDIF as the regularization parameter grows.

Proposition 1.

If the second largest eigenvalue of H hasmultiplicity 1, then for K sufﬁciently large but ﬁnite, thesolution to (13) as λ → ∞ satisﬁes ˆ θ c = e K , ∀ L c ⊆ V . (17)Our experience with solving (13) reveal that the sufﬁcientlylarge K required for (17) to hold, can be as small as .As λ → ∞ , the effect of the loss in (10) vanishes. Accordingto Proposition 1, this causes AdaDIF to boost smoothness byconcentrating the simplex weights (entries of ˆ θ c ) on landingprobabilities of the late steps ( k close to K ). If on the otherextreme, smoothness-over-the-graph is not promoted (cf. λ =0 ), the sole objective of AdaDIF is to construct diffusions thatbest ﬁt the available labeled data. Since short-length randomwalks from a given node typically lead to nodes of the sameclass, while longer walks to other classes, AdaDIF with λ = 0 tends to leverage only a few landing probabilities of early steps( k close to ). For moderate values of λ , AdaDIF effectivelyadapts per-class diffusions by balancing the emphasis on initialversus ﬁnal landing probabilities.Fig. 2 depicts an example of how AdaDIF places weights { θ k } Kk =1 on landing probabilities after a maximum of K = 20 steps, generated from few samples belonging to one of classesof the Cora citation network. Note that the learnt coefﬁcientsmay follow radically different patterns than those dictated bystandard non-adaptive diffusions such as PPR or HK. It isworth noting that the simplex constraint induces sparsity ofthe solution in (13), thus ‘pushing’ { θ k } entries to zero.The computational core of the proposed method is to buildthe landing probability matrix P ( K ) c , whose columns arecomputed fast using power iterations leveraging the sparsity of H (cf. (2)). This endows AdaDIF with high computationalefﬁciency, especially for small K . Speciﬁcally, since forsolving (13) adaDIF incurs complexity O ( K N ) per class,if K < |E| /N , this becomes O ( |E| K ) ; and for |Y| classes, the . . k θ k PPRHKAdaDIF

Fig. 2. Illustration of K = 20 landing probability coefﬁcients for PPR with α = 0 . , HK with t = 10 , and AdaDIF ( λ = 15 ). overall complexity of AdaDIF is O ( |Y||E| K ) , which is in thesame order as that of non-adaptive diffusions such as PPR andHK. For larger K however, an additional O ( K N ) is requiredper class, mainly to obtain A c in (16).Overall, if O ( KN ) memory requirements are met, theruntime of AdaDIF scales linearly with |E| , provided that K remains small. Thankfully, small values of K are usuallysufﬁcient to achieve high learning performance. As will beshown in the next section, this observation is in par withthe analytical properties of diffusion based classiﬁers, where itturns out that K large does not improve classiﬁcation accuracy. B. On the choice of K

Here we elaborate on how the selection of K inﬂuencesthe classiﬁcation task at hand. As expected, the effect of K is intimately linked to the topology of the underlying graph,the labeled nodes, and their properties. For simplicity, we willfocus on binary classiﬁcation with the two classes denotedby “ + ” and “ − . ” Central to our subsequent analysis is aconcrete measure of the effect an extra landing probabilityvector p ( k ) c can have on the outcome of a diffusion-basedclassiﬁer. Intuitively, this effect is diminishing as the numberof steps K grows, as both random walks eventually converge tothe same stationary distribution. Motivated by this, we introducenext the γ -distinguishability threshold. Deﬁnition 1 ( γ -distinguishability threshold) . Let p + and p − denote respectively the seed vectors for nodes of class “+” and “ − , ” initializing the landing probability vectors in matrices X c := P ( K ) c , and ˇ X c := (cid:104) p (1) c · · · p ( K − c p ( K +1) c (cid:105) , where c ∈{ + , −} . With y := X + θ − X − θ and ˇ y := ˇ X + θ − ˇ X − θ , the γ -distinguishability threshold of the diffusion-based classiﬁeris the smallest integer K γ satisfying (cid:107) y − ˇ y (cid:107) ≤ γ . The following theorem establishes an upper bound on K γ expressed in terms of fundamental quantities of the graph, aswell as basic properties of the labeled nodes per class; see theAppendix B for a proof. Theorem 1.

For any diffusion-based classiﬁer with coefﬁcients θ constrained to a probability simplex of appropriate dimen-sions, the γ -distinguishability threshold is upper-bounded as K γ ≤ µ (cid:48) log (cid:104) √ d max γ (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) where d min + := min i ∈L + d i , d min − := min j ∈L − d j , d max := max i ∈V d i and µ (cid:48) := min { µ , − µ N } where { µ n } Nn =1 denote the eigenvalues of the normalized graphLaplacian in ascending order. The γ -distinguishability threshold can guide the choice ofthe dimension K of the landing probability space. Indeed,using class-speciﬁc landing probability steps K ≥ K γ , doesnot help distinguishing between the corresponding classes, inthe sense that the classiﬁer output is not perturbed by morethan γ . Intuitively, the information contained in the landingprobabilities K γ + 1 , K γ + 2 , . . . is essentially the same forboth classes and thus, using them as features unnecessarilyincreases the overall complexity of the classiﬁer, and also“opens the door” to curse of dimensionality related concerns.Note also that in settings where one can freely choose thenodes to sample, this result could be used to guide such choicein a disciplined way.Theorem 1 makes no assumptions on the diffusion coefﬁ-cients, so long they belong to a probability simplex. Of course,specifying the diffusion function can specialize and furthertighten the corresponding γ -distinguishability threshold. InAppendix C we give a tighter threshold for PPR.Conveniently, our experiments suggest that K ∈ [10 , is usually sufﬁcient to achieve high performance for mostreal graphs ; see also Fig. 3 where K γ is found numericallyfor different values of γ -distinguishability threshold, andproportions of sampled nodes on the BlogCatalog graph.Nevertheless, longer random walks may be necessary in e.g.,graphs with small µ (cid:48) , especially when the number of labelednodes is scarce. To deal with such challenges, the ensuingmodiﬁcation of AdaDIF that scales linearly with K is nicelymotivated. Remark 2 . While PPR and HK in theory rely on inﬁnitely longrandom walks, the coefﬁcients decay rapidly ( θ k = α k for PPRand θ k = t k /k ! for HK). This means that not only θ k → as k → ∞ in both cases, but the convergence rate is also very fast(especially for HK). This agrees with our intuition on randomwalks, as well as our result in Theorem 1 suggesting that, toguarantee a level of distinguishability (which is necessary foraccuracy) between classes, classiﬁers should rely on relativelyshort-length random walks. Moreover, when operating in anadaptive framework such as the one proposed here, using ﬁnite-step (preferably short-length) landing probabilities is muchmore practical, since it restricts the number of free variables( θ k ’s) which mitigates overﬁtting and results in optimizationproblems that scale well with the network size.

10% 20% 30% 40% 50% 10 − − − − γ K BlogCatalog

Fig. 3. Experimental evaluation K γ for different values of γ -distinguishabilitythreshold, and proportions of sampled nodes on BlogCatalog graph.

C. Dictionary of diffusions

The present section deals with a modiﬁed version of AdaDIF,where the number of parameters (dimension of θ ) is restrictedto D < K , meaning the “degrees of freedom” of each class-speciﬁc distribution are fewer than the number of landingprobabilities. Speciﬁcally, consider (cf. (5)) f c ( θ ) = K (cid:88) k =1 a k ( θ ) p ( k ) c = P ( K ) c a ( θ ) where a k ( θ ) := (cid:80) Dd =1 θ d C kd , and C := (cid:2) c · · · c D (cid:3) ∈ R K × D is a dictionary of D coefﬁcient vectors, the i th formingthe column c i ∈ S K . Since a ( θ ) = C θ , it follows that f c ( θ ) = P ( K ) c C θ = D (cid:88) d =1 θ d f ( d ) c where f ( d ) c := (cid:80) Kk =1 C kd p ( k ) c is the d th diffusion.To ﬁnd the optimal θ , the optimization problem in (13) issolved with b c = − |L| ( F ∆ c ) T D †L y L c (18) A c = ( F ∆ c ) T D †L F ∆ c + λ ( F ∆ c ) T D − LD − F ∆ c (19)where F ∆ c := [ f (1) c · · · f ( D ) c ] effectively replaces P ( K ) c asthe basis of the space on which each f c is constructed. Thedescription of AdaDIF in dictionary mode is given as a specialcase of Algorithm 1, together with the subroutine in Algorithm2 for memory-efﬁcient generation of F ∆ c .The motivation behind this dictionary-based variant ofAdaDIF is two-fold. First, it leverages the properties of a judi-ciously selected basis of known diffusions, e.g. by constructing C = (cid:2) θ PPR θ HK · · · (cid:3) . In that sense, our approach is relatedto multi-kernel methods, e.g. [1], although signiﬁcantly morescalable than the latter. Second, the complexity of AdaDIF indictionary mode is O ( |E| ( K + D )) , where D can be arbitrarilysmaller than K , leading to complexity that is linear with respectto both K and |E| . Algorithm 1 A DAPTIVE D IFFUSIONS

Input:

Adjacency matrix: W , Labeled nodes: { y i } i ∈L parameters : Regularization parameter: λ , of landingprobabilities: K , Dictionary mode ∈ { True , False } , Un-constrained ∈ { True , False } Output:

Predictions: { ˆ y i } i ∈U Extract Y = { Set of unique labels in: { y i } i ∈L } for c ∈ Y do L c = { i ∈ L : y i = c } if Dictionary mode then F ∆ c = D ICTIONARY ( W , L c , K, C ) Obtain b c and A c as in (18) and (19) F c = F ∆ c else { P ( K ) c , ˜ P ( K ) c } = L AND P ROB ( W , L c , K ) Obtain b c and A c as in (14) and (16) F c = P ( K ) c end ifif Unconstrained then

Obtain ˆ θ c as in (20) and (21) else Obtain ˆ θ c by solving (13) end if f c (ˆ θ c ) = F c ˆ θ c end for Obtain ˆ y i = arg max c ∈Y (cid:104) f c (ˆ θ c ) (cid:105) i , ∀ i ∈ U Algorithm 2 L AND P ROB

Input: W , L c , K Output: P ( K ) c , ˜ P ( K ) c H = WD − p (0) c = v c for k = 1 : K + 1 do p ( k ) c = Hp ( k − c ˜ p ( k ) c = p ( k − c − p ( k ) c end forAlgorithm 3 D ICTIONARY

Input: W , L c , K, C Output: F ∆ c H = WD − p (0) c = v c { f ( d ) c } Dd =1 = for k = 1 : K do p ( k ) c = Hp ( k − c for d = 1 : D do f ( d ) c = f ( d ) c + C kd p ( k ) c end forend for D. Unconstrained diffusions

Thus far, the diffusion coefﬁcients θ have been constrainedon the K − dimensional probability simplex S K , resultingin sparse solutions ˆ θ c , as well as f c (ˆ θ c ) ∈ S N . The latteralso allows each f c ( θ ) to be interpreted as a pmf over V . Nevertheless, the simplex constraint imposes a limitation tothe model: landing probabilities may only have non-negative contribution on the resulting class distribution. Upon relaxingthis non-negativity constraint, (13) can afford a closed-formsolution as ˆ θ c = A − c ( b c − λ ∗ ) (20) λ ∗ = T A − c b c − b T A − c b c . (21)Retaining the hyperplane constraint T θ = 1 forces at leastone entry of θ to be positive. Note that for K > |L| , (20) maybecome ill-conditioned, and yield inaccurate solutions. Thiscan be mitigated by imposing (cid:96) − norm regularization on θ ,which is equivalent to adding (cid:15) I to A c , with (cid:15) > sufﬁcientlylarge to stabilize the linear system.A step-by-step description of the proposed AdaDIF approachis given by Algorithm 1, along with the subroutine in Algorithm2. Determining the limiting behavior of unconstrained AdaDIF,as well as exploring the effectiveness of different regularizers(e.g., sparsity inducing (cid:96) − norm) is part of our ongoingresearch. Towards the goal of developing more robust methodsto design diffusions, the ensuing section presents our proposedapproach that relies on minimizing the leave-one-out loss ofthe resulting classiﬁer.IV. A DAPTIVE D IFFUSIONS R OBUST TO A NOMALIES

Although the loss function in (11) is simple and easyto implement, it may lack robustness against nodes withlabels that do not comply with a diffusion-based informationpropagation model. In real-world graphs, such ‘difﬁcult’ nodesmay arise due to model limitations, observation noise, or evendeliberate mislabeling by adversaries. For such setups, thissection introduces a novel adaptive diffusion classiﬁer with:i) robustness in ﬁnding θ by ignoring errors that arise due tooutlying/anomalous nodes; as well as, ii) capability to identifyand remove such ‘difﬁcult’ nodes.Let us begin by deﬁning the following per-class c ∈ Y loss (cid:96) c rob ( y L c , θ ) := (cid:88) i ∈L d i ([¯ y L c ] i − [ f c ( θ ; L \ i )] i ) (22)where f c ( θ ; L \ i ) is the class- c diffusion after removing the i th node from the set of all labels. Intuitively, (22) evaluatesthe ability of a propagation mechanism effected by θ to predictthe presence of class c label on each node i ∈ L , usingthe remaining labeled nodes L \ i . Since each class-speciﬁcdistribution f c ( θ ) is constructed by random walks that arerooted in L c ⊆ L , it follows that f c ( θ ; L \ i ) = (cid:26) f c ( θ ) , i / ∈ L c f c ( θ ; L c \ i ) , i ∈ L c (23)since f c ( θ ) is not directly affected by the removal of a labelthat belongs to other classes, and it is not used as a class- c seed. The class- c diffusion upon removing the i th node fromthe seeds L c is given as (cf. (5)) f c ( θ ; L c \ i ) = K (cid:88) k =1 θ k p ( k ) L c \ i where p ( k ) L c \ i := H k v L c \ i , and [ v L c \ i ] j = (cid:26) / |L c \ i | , j ∈ L c \ i , else . (24)The robust loss in (22) can be expressed more compactly as (cid:96) c rob ( y L c , θ ) := (cid:107) D − L (cid:16) ¯ y L c − R ( K ) c θ (cid:17) (cid:107) (25)where D − L := (cid:16) D †L (cid:17) − , and (cid:104) R ( K ) c (cid:105) ik :=  (cid:104) p ( k ) L c \ i (cid:105) i , i ∈ L c (cid:104) p ( k ) c (cid:105) i , else . (26)Since p ( k ) c = |L c | − (cid:80) i ∈L c p ( k ) L c \ i , evaluating (25) only re-quires the rows of R ( K ) c and entries of y L c that correspondto L , since the rest of the diagonal entries of D †L are .Having deﬁned (cid:96) c rob ( · ) , per-class diffusion coefﬁcients ˆ θ c canbe obtained by solving ˆ θ c = arg min θ ∈S K (cid:96) c rob ( y L c , θ ) + λ θ (cid:107) θ (cid:107) (27)where (cid:96) regularization with parameter λ θ is introduced inorder to prevent overﬁtting and numerical instabilities. Notethat smoothness regularization in (12) is less appropriate inthe context of robustness, since it promotes “spreading” ofthe random walks (cf. Prop. 1), thus making class-diffusionsmore similar and increasing the difﬁculty of detecting outliers.Similar to (13), quadratic programming can be adopted to solve(27).Towards mitigating the effects of outliers, and inspired bythe robust estimators introduced in [20], we further enhance (cid:96) c rob ( · ) by explicitly modeling the effect of outliers with asparse vector o ∈ R N , leading to the modiﬁed cost (cid:96) c rob ( y L c , o , θ ) := (cid:107) D − L (cid:16) o + ¯ y L c − R ( K ) c θ (cid:17) (cid:107) . (28)The non-zero entries of o can capture large residuals (predictionerrors | [¯ y L c ] i − [ f c ( θ ; L \ i )] i | ) which may be the result ofoutlying, anomalous or mislabeled nodes. Thus, when operatingin the presence of anomalies, the robust classiﬁer aims atidentifying both diffusion parameters { ˆ θ c } c ∈Y as well as perclass outlier vectors { ˆ o c } c ∈Y . The two tasks can be performedjointly by solving the following optimization problem { ˆ θ c , ˆ o c } c ∈Y = arg min θ c ∈S K o c ∈ R N (cid:88) c ∈Y (cid:2) (cid:96) c rob ( y L c , o c , θ c ) + λ θ (cid:107) θ c (cid:107) (cid:3) + λ o (cid:107) D − L O (cid:107) , (29)where O := (cid:2) o · · · o |Y| (cid:3) concatenates the outlier vectors o c , and (cid:107) X (cid:107) , := (cid:80) Ii =1 (cid:113)(cid:80) Jj =1 X i,j for any X ∈ R I × J .The term λ o (cid:107) D − L O (cid:107) , in (29) acts as a regularizer thatpromotes sparsity over the rows of O ; it can also be interpretedas an (cid:96) -norm regularizer over a vector that contains the (cid:96) norms of the rows of O . The reason for using such block-sparse regularization is to force outlier vectors o c of differentclasses to have the same support (pattern of non-zero entries). In other words, the |Y| different diffusion/outlier detectors are forced to consent on which nodes are outliers.Since (29) is non-convex, convergence of gradient-descent-type methods to the global optimum is not guaranteed. Nev-ertheless, since (29) is biconvex (i.e., convex with respect toeach variable) the following alternating minimization scheme ˆ O ( t ) = arg min O (cid:88) c ∈Y (cid:104) (cid:96) c rob ( y L c , o c , ˆ θ ( t − c ) + λ θ (cid:107) ˆ θ ( t − c (cid:107) (cid:105) + λ o (cid:107) D − L O (cid:107) , (30) ˆ θ ( t ) c = arg min θ ∈S K (cid:96) c rob ( y L c , ˆ o ( t ) c , θ )+ λ θ (cid:107) θ (cid:107) + λ o (cid:107) D − L ˆ O ( t ) (cid:107) , (31)with ˆ O (0) := [ . . . ] converges to a partial optimum [17].By further simplifying (31) and solving (30) in closed form,we obtain ˆ θ ( t ) c = arg min θ ∈S K (cid:96) c rob (¯ y L c + ˆ o ( t − c , θ ) + λ θ (cid:107) θ (cid:107) (32) ˆ O ( t ) = SoftThres λ o (cid:16) ˜ Y ( t ) (cid:17) (33)where ˜ Y ( t ) := (cid:104) ˜ y t ) , . . . , y ( t ) |Y| (cid:105) is the matrix that concatenates the per class residual vectors ˜ y ( t ) c := ¯ y L c − R ( K ) c ˆ θ ( t ) c , and Z = SoftThres λ o ( X ) is a row-wise soft-thresholding operator such that z i = (cid:107) x i (cid:107) [1 − λ o / (2 (cid:107) x i (cid:107) )] + where z i and x i are the i th rows of Z and X respectively, seee.g. [35]. Intuitively, the soft-thresholding operation in (33)extracts the outliers by scaling down residuals and “trimming”them wherever their across-classes (cid:96) norm is below a certainthreshold.The alternating minimization between (32) and (33) termi-nates when (cid:107) ˆ θ ( t ) c − ˆ θ ( t − c (cid:107) ∞ ≤ (cid:15), ∀ c ∈ Y where (cid:15) ≥ is a prescribed tolerance. Having obtainedthe tuples { ˆ θ c , ˆ o c } c ∈Y , one may remove the anomaloussamples that correspond to non-zero rows of ˆ O and performthe diffusion with the remaining samples. The robust (r)AdaDIF is summarized as Algorithm 4, and has O ( K |L||E| ) computational complexity.V. C ONTRIBUTIONS IN C ONTEXT OF P RIOR W ORKS

Following the seminal contribution in [9] that introducedPageRank as a network centrality measure, there has been a vastbody of works studying its theoretical properties, computationalaspects, as well as applications beyond Web ranking [26], [16].Most formal approaches to generalize PageRank focus eitheron the teleportation component (see e.g. [32], [33] as wellas [7] for an application to semi-supervised classiﬁcation), or,on the so-termed damping mechanism [13], [4]. Perhaps themost general setting can be found in [4], where a familyof functional rankings was introduced by the choice of aparametric damping function that assigns weights to successive

Algorithm 4 R OBUST A DAPTIVE D IFFUSIONS

Input:

Adjacency matrix: W , Labeled nodes: { y i } i ∈L parameters : Regularization parameters: λ θ , λ o , of landingprobabilities: K Output:

Predictions: { ˆ y i } i ∈U Outliers: ∪ c ∈Y L oc Extract Y = { Set of unique labels in: { y i } i ∈L } for c ∈ Y do L c = { i ∈ L : y i = c } for i ∈ L c do { p ( k ) L c \ i } Kk =1 = L AND P ROB ( W , L c \ i, K ) end for Obtain R ( K ) c as in (26) end for ˆ O (0) = [ , . . . , ] , t = 0 while (cid:107) ˆ θ ( t ) c − ˆ θ ( t − c (cid:107) ∞ ≤ (cid:15) do t ← t + 1 Obtain { ˆ θ ( t ) c } c ∈Y as in (32)Obtain ˆ O ( t ) as in (33) end while Set of outliers: S := { i ∈ L : (cid:107) [ ˆ O ] i, : (cid:107) > } for c ∈ Y do L oc = L c ∩ SL c ← L c \ L oc end for Obtain ˆ y i = arg max c ∈Y (cid:104) f c (ˆ θ c ) (cid:105) i , ∀ i ∈ U steps of a walk initialized according to the teleportationdistribution. The per class distributions produced by AdaDIFare in fact members of this family of functional rankings.However, instead of choosing a ﬁxed damping function asin the aforementioned approaches, AdaDIF learns a class-speciﬁc and graph-aware damping mechanism. In this sense,AdaDIF undertakes statistical learning in the space of functionalrankings, tailored to the underlying semi-supervised classiﬁ-cation task. A related method termed AptRank was recentlyproposed in [46] speciﬁcally for protein function prediction.Differently from AdaDIF the method exploits meta-informationregarding the hierarchical organization of functional roles ofproteins and it performs random walks on the heterogeneousprotein-function network. AptRank splits the data into trainingand validation sets of predetermined proportions and adopt ascross-validation approach for obtaining diffusion coefﬁcients.Furthermore a1) AptRank trains a single diffusion for allclasses whereas AdaDIF identiﬁes different diffusions, and a2)the proposed robust leave-one-out method (r-AdaDIF) gathersresiduals from all leave-one-out splits into one cost function(cf. (22)) and then optimizes the (per class) diffusion.Recently, community detection (CD) methods were proposedin [47] and [45], that search the Krylov subspace of landingprobabilities of a given community’s seeds, to identify adiffusion that satisﬁes locality of non-zero entries over thenodes of the graph. In CD, the problem deﬁnition is: “givencertain members of a community, identify the remaining (latent)members.” There is a subtle but important distinction between TABLE IN

ETWORK C HARACTERISTICS

Graph |V| |E| |Y|

Multilabel

Citeseer

Cora

PubMed

PPI (H. Sapiens)

Wikipedia

BlogCatalog

CD and semi-supervised classiﬁcation (SSC): CD focuses onthe retrieval of communities (that is nodes of a given class),whereas SSC focuses on the predicting the labels/attributesof every node . While CD treats the detection of variousoverlapping communities of the graph as independent tasks,SSC classiﬁes nodes by taking all information from labelednodes into account. More speciﬁcally, the proposed AdaDIFtrains the diffusion of each class by actively avoiding theassignment of large diffusion values to nodes that are known(they have been labeled) to belong to a different class. Anotherimportant difference of AdaDIF with [47] and [45]—whichagain arises from the different contexts—is the length of thewalk compared to the size of the graph. Since [47] and [45]aim at identifying small and local communities, they performlocal walks of length smaller than the diameter of the graph. Incontrast, SSC typically demands a certain degree of globalityin information exchange, achieved by longer random walksthat surpass the graph diameter.AdaDIF also shares links with SSL methods based on graphsignal processing proposed in [37], and further pursued in[12] for bridge monitoring; see also [38] and [14] for recentadvances on graph ﬁlters. Similar to our approach, thesegraph ﬁlter based techniques are parametrized via assigningdifferent weights to a number of consecutive powers of amatrix related to the structure of the graph. Our contributionhowever, introduces different loss and regularization functionsfor adapting the diffusions, including a novel approach fortraining the model in an anomaly/outlier-resilient manner.Furthermore, while [37] focuses on binary classiﬁcation and[12] identiﬁes a single model for all classes, our approachallows for different classes to have different propagationmechanisms. This feature can accommodate differences inthe label distribution of each class over the nodes, whilealso making AdaDIF readily applicable to multi-label graphs.Moreover, while in [37] the weighting parameters remainunconstrained and in [12] belong to a hyperplane, AdaDIFconstrains the diffusion parameters on the probability simplex,which allows the random-walk-based diffusion vectors to denotevalid probability mass functions over the nodes of the network.This certainly enhances interpretability of the method, improvesthe numerical stability of the involved computations, and alsoreduces the search-space of the model is beneﬁcial underdata scarcity. Finally, imposing the simplex constraint makesthe model amenable to a rigorous analysis that relates thedimensionality of the feature space to basic graph properties,as well as to a disciplined exploration of its limiting behavior. M i c r o - F S c o r e ( % ) PPRHKAdaDIF

Fig. 4. Micro-F1 score for AdaDIF and non-adaptive diffusions on labeled Cora graph as a function of the length of underline random walks.

VI. E

XPERIMENTAL E VALUATION

Our experiments compare the classiﬁcation accuracy ofthe novel AdaDIF approach with state-of-the-art alternatives.For the comparisons, we use 6 benchmark labeled graphswhose dimensions and basic attributes are summarized inTable I. All 6 graphs have nodes that belong to multipleclasses, while the last 3 are multilabeled (each node has oneor more labels). We evaluate performance of AdaDIF and thefollowing: i) PPR and HK, which are special cases of AdaDIFas discussed in Section II; ii) Label propagation (LP) [43];iii) Node2vec [18]; iv) Deepwalk [34]; v) Planetoid-G [42];and, vi) graph convolutional networks (GCNs) [21]. We notehere that AptRank [46] was not considered in our experimentssince it relies on meta-information that is not available for thebenchmark datasets used here.We performed -fold cross-validation to select parametersneeded by i) - v). For HK, we performed grid search over t ∈ [1 . , . , . , . . For PPR, we ﬁxed α = 0 . since it iswell documented that α close to yields reliable performance;see e.g., [28]. Both HK and PPR were run for steps forconvergence to be in effect; see Fig 4; LP was also runfor 50 steps. For Node2vec, we ﬁxed most parameters tothe values suggested in [18], and performed grid search for p, q ∈ [0 . , . , . , . . Since Deepwalk can be seen asNode2vec with p = q = 1 . , we used the Node2vec Pythonimplementation for both. As in [18], [34], we used the embedednode-features to train a supervised logistic regression classiﬁerwith (cid:96) regularization. For AdaDIF, we ﬁxed λ = 15 . , while K = 15 was sufﬁcient to attain desirable accuracy (cf. Fig.4); only the values of Boolean variables Unconstained and

Dictionary Mode (see Algorithm 1) were tuned by validation.For the multilabel graphs, we found λ = 5 . and even shorterwalks of K = 10 to perform well. For the dictionary mode ofAdaDIF, we preselected D = 10 , with the ﬁrst ﬁve collumns of C being HK coefﬁcients with parameters t ∈ [5 , , , , ,and the other ﬁve polynomial coefﬁcients c i = k β with β ∈ [2 , , , , .For multiclass experiments, we evaluated the performance ofall algorithms on the three benchmark citation networks, namely Cora , Citeseer , and

PubMed . We obtained the labels ofan increasing number of nodes via uniform, class-balancedsampling, and predicted the labels of the remaining nodes.Thus, instead of sampling nodes over the graph uniformly atrandom, we randomly sample a given number of nodes perclass. For each graph, we performed 20 experiments, each timesampling , , and nodes per class. For each experiment,classiﬁcation accuracy was measured on the unlabeled nodesin terms of Micro-F1 and Macro-F1 scores; see e.g., [30]. Theresults were averaged over 20 experiments, with mean andstandard deviation reported in Table II. Evidently, AdaDIFachieves state of the art performance for all graphs. For Cora and

PubMed , AdaDIF was switched to dictionary mode, whilefor

Citeseer , where the gain in accuracy is more signiﬁcant,unconstrained diffusions were employed. In the multiclasssetting, diffusion-based classiﬁers (AdaDIF, PPR, and HK)outperformed the embedding-based methods by a small margin,and GCNs by a larger margin. It should be noted however thatGCNs were mainly designed to combine the graph with nodefeatures. In our “featureless” setting, we used the identity matrixcolumns as input features, as suggested in [21, Appendix].The scalabilty of AdaDIF is reﬂected on the runtimecomparisons listed in Fig. 7. All experiments were run ona machine with i5 @3.50 Mhz CPU, and 16GB of RAM.We used the Python implementations provided by the authorsof the compared algorithms. The Python implementation ofAdaDIF, uses only tools provided by scipy, numpy, and CVX-OPT libraries. We also developped an efﬁcient implementationthat exploits parallelism, which is straightforward since eachclass can be treated separately. While AdaDIF incurs (asexpected) a relatively small computational overhead over ﬁxeddiffusions, it is faster than GCNs that use Tensorﬂow, andorders of magnitude faster than embedding-based approaches.Finally, Table III presents the results on multilabel graphs,where we compare with Deepwalk and Node2vec, since therest of the methods are designed for multiclass problems. Sincethese graphs entail a large number of classes, we increased thenumber of training samples. Similar to [18] and [34], duringevaluation of accuracy the number of labels per sampled node isknown, and check how many of them are in the top predictions.First, we observe that AdaDIF markedly outperforms PPR andHK across graphs and metrics. Furthermore, for the

PPI and

BlogCatalog graphs the Micro-F1 score of AdaDIF comesclose to that of the much heavier state-of-the-art Node2vec.Finally, AdaDIF outperforms the competing alternatives interms of Macro-F1 score. It is worth noting that for multilabelgraphs with many classes, the performance boost over ﬁxeddiffusions can be largely attributed to AdaDif’s ﬂexibility totreat each class differently. To demonstrate that different classesare indeed diffused in a markedly different manner, Fig. 6plots all diffusion coefﬁcient vectors { θ c } c ∈C yielded byAdaDIF on the PPI graph with of nodes labeled. Eachline in the plot corresponds to the values of θ c for a different c ; evidently, while the overall “form” of the correspondingdiffusion coefﬁcients adheres to the general pattern observedin Fig.2 there is indeed large diversity among classes. TABLE IIM

ICRO F1 AND M ACRO

F1 S

CORES ON M ULTICLASS N ETWORKS ( CLASS - BALANCED SAMPLING ) Graph

Cora Citeseer PubMed |L c | M i c r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . LP . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Planetoid-G . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . GCN . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . M ac r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . LP . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Planetoid-G . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . GCN . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TABLE IIIM

ICRO F1 AND M ACRO

F1 S

CORES OF V ARIOUS A LGORITHMS ON M ULTILABEL N ETWORKS

Graph

PPI BlogCatalog Wikipedia |L| / |V|

10% 20% 30% 10% 20% 30% 10% 20% 30% M i c r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . M ac r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . A. Analysis/interpretation of results

Here we will follow an experimental approach that is aimedat understanding and interpreting our results. We will focus ondiffusion-based classiﬁers, along with a simple benchmark fordiffusion-based classiﬁcation: the k − step landing probabilities.Speciﬁcally, we compare the classiﬁcation accuracy on thethree multiclass datasets of AdaDIF, PPR, and HK, with theaccuracy of the classiﬁer that uses only the k − th landingprobability vectors { p ( k ) c } c ∈Y ,k ∈ [1 ,K ] . The setting is similarto the one in the previous section, and with class-balancedsampling of nodes per class, while the k − step classiﬁerswere examined for a wide range of steps k ∈ [1 , . The k − step classiﬁer reveals the predictive power of individuallanding probabilities, resulting in curves (see Fig. 5) that appearto be different for each network, characterizing the graph-labeldistribution relationship of the latter. For the Cora graph(left two plots), performance of the k − step classiﬁer improvessharply after the ﬁrst few steps, peaks for k ≈ , and thenquickly degrades, suggesting that using the landing probabilitiesof k > or would most likely degrade the performanceof a diffusion-based classiﬁer. Interestingly, AdaDIF relyingon combinations of the ﬁrst 15 steps, and PPR and HK ofthe ﬁrst 50, all achieve higher accuracy than that of the best single step. On the other hand, the Citeseer graph (middletwo plots) behaves in a signiﬁcantly different manner, with the k − step classiﬁer requiring longer walks to reach high accuracythat was retained for much longer. Furthermore, accumulatinglanding probabilities the way PPR or HK does yields lowerMicro-F1 accuracy than that of the single best step. On theother hand, by smartly combining the ﬁrst 15 steps that areof lower quality, AdaDIF surpasses the Micro-F1 scores ofthe longer walks. Interestingly, the Macro-F1 metric for the Citeseer behaves differently than the Micro-F1, and quicklydecreases after ∼ steps. The disagreement between the twometrics can be explained as the diffusions of one or moreof the larger classes gradually “overwhelms” those of one ormore smaller classes, thus lowering the Macro-F1 score, sincethe latter is a metric that averages per-class . In contrast, theMicro-F1 metric averages per-node and takes much less of animpact if a few nodes from the smaller classes are mislabeled.Finally, for the PubMed graph (right two plots), steps in therange [20 , yield consistently high accuracy both in termsof Micro- as well as Macro-averaged F1-score. Since HK andmostly PPR largely accumulate steps in that range, it seemsreasonable that both ﬁxed diffusions are fairly accurate in the PubMed graph. . . . . . M i c r o F - s c o r e . . . .

54 0 20 40 60 80 1000 . . . . .

760 20 40 60 80 1000 . . . . k M ac r o F - s c o r e . . . k . . . . . k k -step landing probabilities AdaDIF PPR HK Fig. 5. Classiﬁcation accuracy of AdaDIF, PPR, and HK compared to the accuracy of k − step landing probability classiﬁer. Top Left)

Cora

Micro-F1score;

Bottom Left)

Cora

Macro-F1 score;

Top Middle)

Citeseer

Micro-F1 score;

Bottom Middle)

Citeseer

Macro-F1 score;

Top Right)

PubMed

Micro-F1 score;

Bottom Right)

PubMed

Macro-F1 score . . . . k Fig. 6. AdaDIF diffusion coefﬁcients for the different classes of PPIgraph ( sampled). Each line corresponds to a different θ c . Diffusion ischaracterized by high diversity among classes. B. Tests on simulated label-corruption setup

Here we outline experimental results performed to evaluatethe performance of different diffusion-based classiﬁers in thepresence of anomalous nodes. The main goal is to evaluatewhether r-AdaDIF (Algorithm 4) yields improved performanceover AdaDIF, HK and PPR, as well as the ability of r-AdaDIF

Cora Citeseer PubMed − R e l a ti v e R un ti m e PPR/HK/LP AdaDIF GCNNode2vec/Deepwalk Planetoid-G

Fig. 7. Relative runtime comparisons for multiclass graphs. to detect anomalous nodes. We also tested a different type ofrounding from class-diffusions to class labels that was shownin [44] to be robust in the presence of erroneous labels ona graph constructed by images of handwritten digits. Theidea is to ﬁrst normalize diffusions with node degrees, sorteach diffusion vector, and assign to each node the class forwhich the corresponding rank is higher. We applied this typeof rounding on PPR diffusions (denoted as PPR w. ranking).Since a ground truth set of anomalous nodes is not availablein real graphs, we chose to infuse the true labels with artiﬁcialanomalies generated by the following simulated label corruptionprocess: Go through y L and for each entry [ y L ] i = c draw withprobability p cor a label c (cid:48) ∼ Unif {Y \ c } ; and replace [ y L ] i ← .

05 0 . .

15 0 . .

25 0 . . Label corruption rate p cor M i c r o - F S c o r e ( % ) r-AdaDIFAdaDIFPPRHKPPR w. ranking .

05 0 . .

15 0 . .

25 0 . . Label corruption rate p cor M ac r o - F S c o r e ( % ) r-AdaDIFAdaDIFPPRHKPPR w. ranking Fig. 8. Classiﬁcation accuracy of various diffusion-based classiﬁers on Cora, as a function of the probability of label corruption. . . . . . . . . p FA p D p cor = 0 . p cor = 0 . p cor = 0 . Fig. 9. Anomaly detection performance of r-AdaDIF for different labelcorruption probabilities. The horizontal axis corresponds to the frequencywith which r-AdaDIF returns a true positive (probability of detection) and thevertical axis corresponds to the frequency of false positives (probability offalse alarm). c (cid:48) . In other words, anomalies are created by corrupting someof the true labels by randomly and uniformly “ﬂipping” themto a different label. Increasing the corruption probability p cor ofthe training labels y L is expected to have increasingly negativeimpact on classiﬁcation accuracy over y U . Indeed, as depictedin Fig. 8, the accuracy of all diffusion-based classiﬁers on Cora graph degrades as p cor increases. All diffusions were run for K = 50 , while for r-AdaDIF we found λ o = 14 . × − and λ θ = 67 . × − to perform well for moderate values of p cor .Results were averaged over Monte Carlo experiments, andfor each experiment of the nodes were sampled uniformlyat random. While tuning λ o for a speciﬁc p cor generally yieldsimproved results, we use the same λ o across the range of p cor values, since the true value of the latter is generally notavailable in practice. In this setup, r-AdaDIF demonstrateshigher accuracy compared to non-robust classiﬁers. Moreover,the performance gap increases as more labels become corrupted,until it reaches a “break point” at p cor ≈ . . Interestingly, r-AdaDIF performs worse in the absence of anomalies ( p cor = 0 )that can be attributed to the fact that it only removes useful samples and thus reduces the training set. Although PPR w.ranking displays relative robustness as p cor increases, overallit performs worse than PPR with value based rounding, at leaston the Cora graph.As mentioned earlier, the performance of r-AdaDIF in termsof outlier detection depends on parameter λ o . Speciﬁcally, for λ o → the regularizer in (29) is effectively removed andall samples are characterized as outliers. On the other hand,for λ o (cid:29) (29) yields ˆ O = [ , . . . , ] , meaning that nooutliers are unveiled. For intermediate values of λ o , r-AdaDIFtrades off falsely identifying nominal samples as outliers(false alarm) with correctly identifying anomalies (correctdetection). This tradeoff of r-AdaDIF’s anomaly detectionbehavior was experimentally evaluated over Monte Carloruns by sweeping over a large range of values of λ o , andfor different values of p cor ; see the probability of detection( p D ) versus probability of false alarms ( p FA ) depicted in Fig.9. Evidently, r-AdaDIF performs much better than a randomguess (“coin toss”) detector whose curve is given by the greydotted line, while the detection performance improves as thecorruption rate decreases.VII. C ONCLUSIONS

The present work, introduces a principled, data-efﬁcientapproach to learning class-speciﬁc diffusion functions tailoredfor the underlying network topology. Experiments on realnetworks conﬁrm that adapting the diffusion function to thegiven graph and observed labels, signiﬁcantly improves theperformance over ﬁxed diffusions; reaching – and many timessurpassing – the classiﬁcation accuracy of computationallyheavier state-of-the-art competing methods.Emerging from this work are many exciting directions toexplore. First, one can investigate different cost functions withrespect to which the diffusions are adapted, e.g., by taking intoaccount robustness of the resulting classiﬁer in the presenceof adversarial data. Furthermore, it is worth investigating thespace of nonlinear functions of the landing probabilities todetermine the degree to which accuracy can be boosted further.Last but not least, it will be interesting to develop adaptivediffusion methods, where learning and adaptation are performed on-the-ﬂy , without any memory and computational overhead. A PPENDIX

A. Proof of Proposition 1

For λ → ∞ , the effect of (cid:96) ( · ) in (10) vanishes, and theoptimization problem becomes equivalent to solving min θ ∈S K θ T A θ (34)where A := ( P ( K ) c ) T D − LD − P ( K ) c has ( i, j ) entry givenby A ij = ( p ( i ) c ) T D − LD − p ( j ) c ; and p ( K ) c is the vector of K -step landing probabilities with initial distribution v c andtransition matrix H = (cid:80) Nn =1 λ n u n v T n , where λ > λ > · · · > λ N are its eigenvalues. Since H is a column-stochastictransition probability matrix, it holds that λ = 1 , v = ,and u = π , where π = lim k →∞ p ( k ) c is the steady-statedistribution that can be also expressed as π = d / (2 |E| ) [27].The landing probability vector for class c is thus p ( K ) c = H K v c = (cid:34) |E| d1 T + N (cid:88) n =2 λ Kn u n v T n (cid:35) v c = 12 |E| d + N (cid:88) n =2 λ Kn u n γ n ≈ |E| d + λ K u γ (35)where γ n := v T n v c , and the approximation in (35) holdsbecause λ K (cid:29) λ Kn , for n ∈ [3 , N ] , and K large enoughbut ﬁnite. Using (35), A ij can be rewritten as A ij = (cid:20) |E| d T + λ i u T γ (cid:21) D − LD − (cid:20) |E| d + λ j u γ (cid:21) = (cid:20) |E| T + λ i u T D − γ (cid:21) L (cid:20) |E| + λ j D − u γ (cid:21) = 14 |E| T L1 + λ i γ |E| u T D − L1 + λ j γ |E| T LD − u + γ λ i + j u T D − LD − u = Cλ i + j (36)where C := γ u T D − LD − u , the second equality uses D − d = , and the last equality follows because L1 = . Using (36), one obtains A = C λ λ T , where λ := (cid:2) λ λ · · · λ K (cid:3) T , while (34) reduces to min θ ∈S K (cid:16) λ T θ (cid:17) . (37)Since λ T θ > ∀ θ ∈ S K , it can be shown that the KKToptimality conditions for (37) are identical to those of min θ ∈S K λ T θ . (38)Therefore, (37) admits minimizer(s) identical to (38). Fi-nally, we will show that the minimizer of (38) is e K . Since the problem is convex, it sufﬁces to show that ∇ T θ ( λ T θ ) θ = e K ( θ − e K ) ≥ ∀ θ ∈ S K , or, equivalently λ T ( θ − e K ) ≥ ⇔ K (cid:88) k =1 θ k λ k − λ K ≥ ⇔ K (cid:88) k =1 θ k λ k − K ≥ ⇔ K (cid:88) k =1 θ k λ k − K ≥ K (cid:88) k =1 θ k ⇔ K (cid:88) k =1 θ k (cid:0) λ k − K − (cid:1) ≥ which holds since θ ≥ and λ k − K ≥ ∀ k ∈ [1 , K ] , andcompletes the proof of the proposition. B. Proof of Theorem 1

We need to ﬁnd the smallest integer K such that max θ ∈S K (cid:107) y − ˇ y (cid:107) ≤ γ. We have (cid:107) y − ˇ y (cid:107) = (cid:107) X + θ − X − θ − ˇ X + θ + ˇ X − θ (cid:107) ≤≤ (cid:107) θ K p ( K )+ − θ K p ( K ) − (cid:107) + (cid:107) θ K p ( K +1)+ − θ K p ( K +1) − (cid:107)≤ (cid:107) H K p + − H K p − (cid:107) + (cid:107) H K +1 p + − H K +1 p − (cid:107) (39)since θ ∈ S K . Therefore, to determine an upper bound for the γ -distinguishability threshold it sufﬁces to ﬁnd the smallestinteger K for which (39) is upper bounded by γ .Let q , . . . , q N be the eigenvectors corresponding to theeigenvalues µ < µ ≤ · · · ≤ µ N < of the normalizedLaplacian ˜ L . The transition probability matrix is then H = D ( I − ˜ L ) D − . (40)For the ﬁrst term of the RHS of (39), we have (cid:107) H K p + − H K p − (cid:107) ≤ (cid:107) H K p + − π (cid:107) + (cid:107) H K p − − π (cid:107) = (cid:107) D ( I − ˜ L ) K D − p + − D1 |E| (cid:107) + (cid:107) D ( I − ˜ L ) K D − p − − D1 |E| (cid:107) . (41)Since q = D √ |E| [27], we have for c ∈ { + , −} that D q (cid:104) q , D − p c (cid:105) = D D (cid:112) |E| (cid:42) D (cid:112) |E| , D − p c (cid:43) = D1 (cid:112) |E| (cid:104) , p c (cid:105) (cid:112) |E| = D1 |E| . (42)Upon deﬁning M := ( I − ˜ L ) K − q q T , and taking into account(42), inequality (41) can be written as (cid:107) H K p + − H K p − (cid:107)≤ (cid:107) D (cid:107)(cid:107) M (cid:107) (cid:16) (cid:107) D − p + (cid:107) + (cid:107) D − p − (cid:107) (cid:17) . (43) The factors in (43) can be bounded as (cid:107) D − p + (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i ∈L + (cid:18) |L + | d − i (cid:19) = (cid:115) (cid:88) i ∈L + |L + | d − i ≤ (cid:112) d min + |L + | , (44) (cid:107) D − p − (cid:107) = (cid:115) (cid:88) i ∈L − |L − | d − i ≤ (cid:112) d min − |L − | , (45) (cid:107) M (cid:107) = sup v (cid:104) Mv , v (cid:105) Mv = max i (cid:54) =1 | − µ i | K , (46) (cid:107) D (cid:107) = (cid:112) d max (47)where (46) follows from the properties of the normalizedLaplacian. Therefore, (43) becomes (cid:107) H K p + − H K p − (cid:107) ≤ (cid:32) (cid:112) d min − |L − | + 1 (cid:112) d min + |L + | (cid:33) · max i (cid:54) =1 | − µ i | K · (cid:112) d max . (48)Letting µ (cid:48) := min { µ , − µ N } , and using the fact that (1 − µ (cid:48) ) K ≤ e − Kµ (cid:48) (49)we obtain (cid:107) H K p + − H K p − (cid:107)≤ (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) . (50)Likewise, we can bound the second term in (39) as (cid:107) H K +1 p + − H K +1 p − (cid:107)≤ (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − ( K +1) µ (cid:48) . (51)In addition, we note that for all µ (cid:48) > , K ∈ Z it holds that e − Kµ (cid:48) + e − ( K +1) µ (cid:48) < e − Kµ (cid:48) . (52)Upon substituting (50) and (51) into (39), and also using (52),we arrive at (cid:107) y − ˇ y (cid:107) ≤ (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) . (53)To determine an upper bound on the γ -distinguishabilitythreshold, it sufﬁces to ﬁnd the smallest integer K for which(53) becomes less than γ ; that is, (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) ≤ γ. (54)Multiplying both sides of (54) by the positive number e Kµ (cid:48) /γ ,and taking logarithms yields log (cid:104) √ d max γ (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) ≤ Kµ (cid:48) . Therefore, using as landing probabilities (cid:24) µ (cid:48) log (cid:104) √ d max γ (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105)(cid:25) the (cid:96) distance between any two diffusion-based classiﬁers willbe at most γ ; and the proof is complete. C. Bound for PageRank

Substituting PageRank’s diffusion coefﬁcients in the proofof Theorem 1, inequality (54) becomes − α ) α K (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) ≤ γ. Multiplying both sides by the positive number e Kµ (cid:48) α − K /γ and taking logarithms yields log (cid:104) √ d max γ/ (1 − α ) (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) ≤ K ( µ (cid:48) − log α ) which results in the γ -distinguishability threshold bound K PR γ ≤ µ (cid:48) − log α log (cid:104) √ d max γ/ (1 − α ) (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) . R EFERENCES[1] A. Argyriou, M. Herbster, and M. Pontil, “Combining graph laplaciansfor semi–supervised learning,” in

Proc. Advances in Neural InformationProcessing Systems , Vancouver, Can., 2006, pp. 67–74.[2] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in

Proc. Advances in Neural Information Processing Systems , Barcelona,Spain, 2016, pp. 1993–2001.[3] K. Avrachenkov, A. Mishenin, P. Gonc¸alves, and M. Sokol, “Generalizedoptimization framework for graph-based semi-supervised learning,”

Proc.SIAM Int. Conf. on Data Mining , Anaheim, CA, 2012, pp. 966–974.[4] R. Baeza-Yates, P. Boldi, and C. Castillo, “Generic damping functionsfor propagating importance in link-based ranking,”

Internet Math. , vol. 3,no. 4, pp. 445–478, 2006.[5] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples,”

J. Mach. Learn. Res. , no. 7, Nov, 2006, pp. 2399–2434.[6] Y. Bengio, O. Delalleau, and N. Le Roux, “Label propagation andquadratic criterion,” in

Semi-Supervised Learning . Cambridge, MA, USA:MIT Press, 2006.[7] D. Berberidis, A. N. Nikolakopoulos, and G. B. Giannakis, “Randomwalks with restarts for graph-based classiﬁcation: Teleportation tuningand sampling design,” in

Proc. of IEEE Int. Conf. on Acoustics, Speechand Signal Processing , Calgary, Can., April 2018.[8] D. Berberidis, A. N. Nikolakopoulos, and G. B. Giannakis, ”AdaDIF:Adaptive Diffusions for Efﬁcient Semi-supervised Learning over Graphs,”Proc. of IEEE Int. Conf. on Big Data, Seattle, Washington, Dec. 10-13,2018. pp. 92–99.[9] S. Brin and L. Page, “Reprint of: The anatomy of a large-scalehypertextual web search engine,”

Comput. Netw. , vol. 56, no. 18, pp.3825–3833, 2012.[10] E. Buchnik and E. Cohen, “Bootstrapped graph diffusions: Exposing thepower of nonlinearity,” arXiv preprint arXiv:1703.02618 , 2017.[11] O. Chapelle, B. Sch¨olkopf, and A. Zien,

Semi-Supervised Learning .Cambridge, MA, USA: MIT Press, 2006.[12] S. Chen, F. Cerda, P. Rizzo, J. Bielak, J. H. Garrett, and J. Kovacevic,“Semi-supervised multiresolution classiﬁcation using adaptive graphﬁltering with application to indirect bridge structural health monitoring,”

IEEE Trans. Signal Process. , vol. 62, no. 11, pp. 2879–2893, June 2014.[13] P. G. Constantine and D. F. Gleich, “Random alpha pagerank,”

InternetMath. , vol. 6, no. 2, pp. 189–236, 2009.[14] M. Contino, E. Isuﬁ and G. Leus,“Distributed edge-variant graph ﬁlters,”

Proc. Int. Work. on Computational Advances in Multi-Sensor AdaptiveProcessing , Curacao, Dutch Antilles, Dec. 2017, pp. 1-5.[15] F. Chung, “The heat kernel as the pagerank of a graph,”

Proc. Natl. Acad.Sci. , vol. 104, no. 50, pp. 19 735–19 740, 2007.[16] D. F. Gleich, “Pagerank beyond the web,”

SIAM Rev. , vol. 57, no. 3, pp.321–363, 2015.[17] J. Gorski, F. Pfeuffer, and K. Klamroth, “Biconvex sets and optimizationwith biconvex functions: a survey and extensions,”

Math. Methods ofOper. Res. , vol. 66, no. 3, pp. 373–407, Dec. 2007. [18] A. Grover and J. Leskovec, “node2vec: Scalable feature learning fornetworks,” in Proc. of ACM SIGKDD Int. Conf. on Knowledge Discoveryand Data Mining , San Francisco, CA, 2016, pp. 855–864.[19] T. Joachims, “Transductive learning via spectral graph partitioning,”

Proc.of Int. Conf. on Machine Learn. , Washington DC, 2003, pp. 290–297.[20] V. Kekatos and G. B. Giannakis, “From sparse signals to sparse residualsfor robust sensing,”

IEEE Trans. Signal Process. , vol. 59, no. 7, pp.3355–3368, July 2011.[21] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graphconvolutional networks,” arXiv preprint arXiv:1609.02907 , 2016.[22] K. Kloster and D. F. Gleich, “Heat kernel based community detection,”in

Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and DataMining , New York, NY, 2014, pp. 1386–1395.[23] I. M. Kloumann, J. Ugander, and J. Kleinberg, “Block models andpersonalized pagerank,”

Proc. Natl. Acad. Sci. , vol. 114, no. 1, pp. 33–38, 2017.[24] R. I. Kondor and J. Lafferty, “Diffusion kernels on graphs and otherdiscrete input spaces,” in

Proc. of Int. Conf. on Machine Learning ,Syndey, Australia, 2002, pp. 315–322.[25] B. Kveton, M. Valko, A. Rahimi, and L. Huang, “Semi-supervisedlearning with max-margin graph cuts,” in

Proc. of. Int. Conf. on ArtiﬁcialIntelligence and Statistics , Sardinia, Italy, 2010, pp. 421–428.[26] A. N. Langville and C. D. Meyer, “Deeper inside pagerank,”

InternetMath. , vol. 1, no. 3, pp. 335–380, 2004.[27] D. A. Levin and Y. Peres,

Markov Chains and Mixing Times . New York,NY, USA: Amer. Math. Soc., 2017.[28] F. Lin and W. W. Cohen, “Semi-supervised classiﬁcation of networkdata using very few labels,” in

Proc. of Int. Conf. on Advances in SocialNetwork Analysis and Mining , Odense, Denmark, 2010, pp. 192–199.[29] W. Liu, J. Wang, and S.-F. Chang, “Robust and scalable graph-basedsemisupervised learning,”

Proc. of the IEEE , vol. 100, no. 9, pp. 2624–2638, 2012.[30] C. D. Manning, P. Raghavan, and H. Schutze,

Introduction to InformationRetrieval . Cambridge, MA: Cambridge University Press, 2008.[31] E. Merkurjev, A. L. Bertozzi, and F. Chung, “A semi-supervised heatkernel pagerank MBO algorithm for data classiﬁcation,” Univ. ofCalifornia Los Angeles, Los Angeles, US, Tech. Rep., 2016.[32] A. N. Nikolakopoulos and J. D. Garofalakis, “Ncdawarerank: A novelranking method that exploits the decomposable structure of the web,”

Proc. ACM Int. Conf. on Web Search and Data Mining , Rome, Italy,2013, pp. 143–152.[33] A. N. Nikolakopoulos, A. Korba, and J. D. Garofalakis, “Randomsurﬁng on multipartite graphs,” in

Proc. of IEEE Int. Conf. on BigData , Washington DC, Dec. 2016, pp. 736–745.[34] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning ofsocial representations,”

Proc. ACM SIGKDD Int. Conf. on Knowl. Disc.and Data Mining , New York, NY, 2014, pp. 701–710.[35] A. T. Puig, A. Wiesel, G. Fleury, and A. O. Hero, “Multidimensionalshrinkage-thresholding operator and group lasso penalties,”

IEEE SignalProcess. Lett. , vol. 18, no. 6, pp. 363–366, 2011.[36] N. Rosenfeld and A. Globerson, “Semi-supervised learning with com-petitive infection models,” arXiv preprint arXiv:1703.06426 , 2017.[37] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs,”

IEEE Trans. Signal Process. , vol. 61, no. 7, pp. 1644–1656,April 2013.[38] S. Segarra, A. Marques, and A. Ribeiro,“Optimal graph-ﬁlter design andapplications to distributed linear network operators,”

IEEE Trans. onSignal Process. , vol. 65, no. 15, pp. 4117–4131, August 2017.[39] P. P. Talukdar and K. Crammer, “New regularized algorithms fortransductive learning,” in

Proc. of Joint Eur. Conf. on Machine Learningand Knowledge Discovery in Databases , 2009, pp. 442–457.[40] J. Ugander and L. Backstrom, “Balanced label propagation for partitioningmassive graphs,” in

Proc. of ACM Int. Conf. on Web Search and DataMining , Rome, Italy, 2013, pp. 507–516.[41] X.-M. Wu, Z. Li, A. M. So, J. Wright, and S.-F. Chang, “Learning withpartially absorbing random walks,”

Proc. Adv. in Neural Inform. Proc.Systems , Lake Tahoe, CA, Dec. 2012, pp. 3077–3085.[42] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervisedlearning with graph embeddings,” arXiv preprint arXiv:1603.08861 , 2016.[43] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning usingGaussian ﬁelds and harmonic functions,” in

Proc. of Int. Conf. on MachineLearning , Washington DC, Aug. 2003.[44] D. F. Gleich, and M. W. Mahoney, “Using Local Spectral Methods toRobustify Graph-Based Learning Algorithms,” in

Proc. of the Int. Conf.on Knowl. Disc. and Data Mining , Sidney Australia, Aug. 2015. [45] K. He, P. Shi, J. E. Hopcroft, and D. Bindel, “Local Spectral Diffusionfor Robust Community Detection,” in

Proc. of the SIGKDD workshop ,San Francisco CA, Aug. 2016.[46] B. Jiang, K. Kloster, D. F. Gleich, and M. Gribskov, “AptRank: anadaptive PageRank model for protein function prediction on bi-relationalgraphs,” in

Bioinformatics , vol. 33, no. 12, pp. 1829–1836, Aug. 2017.[47] K. He, Y. Sun, D. Bindel, J. E. Hopcroft, and Y. Li, “Detectingoverlapping communities from local spectral subspaces,” in