Adaptive Diffusions for Scalable Learning over Graphs
Dimitris Berberidis, Athanasios N. Nikolakopoulos, Georgios B. Giannakis
11 Adaptive Diffusions for ScalableLearning over Graphs
Dimitris Berberidis , Athanasios N. Nikolakopoulos and Georgios B. Giannakis Dept. of ECE and Digital Tech. Center , University of Minnesota, Minneapolis, MN 55455, USA Abstract —Diffusion-based classifiers such as those relying onthe Personalized PageRank and the Heat kernel, enjoy remark-able classification accuracy at modest computational require-ments. Their performance however is affected by the extent towhich the chosen diffusion captures a typically unknown labelpropagation mechanism, that can be specific to the underlyinggraph, and potentially different for each class. The present workintroduces a disciplined, data-efficient approach to learning class-specific diffusion functions adapted to the underlying network topology. The novel learning approach leverages the notion of“landing probabilities” of class-specific random walks, which canbe computed efficiently, thereby ensuring scalability to largegraphs. This is supported by rigorous analysis of the propertiesof the model as well as the proposed algorithms. Furthermore, arobust version of the classifier facilitates learning even in noisyenvironments. Classification tests on real networks demonstratethat adapting the diffusion function to the given graph andobserved labels, significantly improves the performance overfixed diffusions; reaching – and many times surpassing – theclassification accuracy of computationally heavier state-of-the-art competing methods, that rely on node embeddings and deepneural networks.
Index Terms —Semi-supervised Classification, Random Walks,Diffusions.
I. I
NTRODUCTION T HE task of classifying nodes of a graph arises frequentlyin several applications on real-world networks, suchas the ones derived from social interactions and biologicaldependencies. Graph-based semi-supervised learning (SSL)methods tackle this task building on the premise that the truelabels are distributed “smoothly” with respect to the underlyingnetwork, which then motivates leveraging the network structureto increase the classification accuracy [11]. Graph-based SSLhas been pursued by various intertwined methods, includingiterative label propagation [6], [43], [29], [25], kernels ongraphs [31], manifold regularization [5], graph partitioning [40],[19], transductive learning [39], competitive infection models[36], and bootstrapped label propagation [10]. SSL based ongraph filters was discussed in [37], and further developed in[12] for bridge monitoring. Recently, approaches based onnode-embeddings [34], [18], [42], as well as deep-learningarchitectures [21], [2] have gained popularity, and were reportedto have state-of-the-art performance.Many of the aforementioned methods are challenged bycomputational complexity and scalability issues that limittheir applicability to large-scale networks. Random-walk-baseddiffusions present an efficient and effective alternative. Methods
Work was supported by NSF 171141, 1514056 and 1500713 of this family diffuse probabilistically the known labels throughthe graph, thereby ranking nodes according to weighted sums ofvariable-length landing probabilities. Celebrated representativesinclude those based on the Personalized PageRank (PPR) andthe Heat Kernel that were found to perform remarkably well incertain application domains [22], and have been nicely linkedto particular network models [23], [3], [24]. Spectral diffusionshave been used for community detection [47], [45], where localdiffusion patterns are produced to approximate low-conductancecommunities, and adaptive PPR has been applied for predictionon a heterogeneous protein-function network [46].The effectiveness of diffusion-based classifiers can varyconsiderably depending on whether the chosen diffusionconforms with the latent label propagation mechanism thatmight be, (i) particular to the target application or underlyingnetwork topology; and, (ii) different for each class. The presentcontribution alleviates these shortcomings and markedlyimproves the performance of random-walk-based classifiersby adapting the diffusion functions of every class to both thenetwork and the observed labels. The resultant novel classifierrelies on the notion of landing probabilities of short-lengthrandom walks rooted at the observed nodes of each class. Thesmall number of these landing probabilities can be extractedefficiently with a small number of sparse matrix-vector products,thus ensuring applicability to large-scale networks. Theoreticalanalysis establishes that short random walks are in most casessufficient for reliable classification. Furthermore, an algorithmis developed to identify (and potentially remove) outlying oranomalous samples jointly with adapting the diffusions. We testour methods in terms of multiclass and multilabel classificationaccuracy, and confirm that it can achieve results competitive tostate-of-the-art methods, while also being considerably faster.The rest of the paper is organized as follows. Section IIintroduces random-walk based diffusions. Our novel approachalong with relevant analytical results are the subjects ofSection III. Section IV presents a robust version of ouralgorithm, and Section V places our work in the context ofrelated methods. Finally, Section VI presents experiments, whileSection VII concludes the paper and discusses future directions. Notation.
Bold lower-case letters denote column vectors (e.g., v ); bold upper-case letters denote matrices (e.g., Q ). Vectors q j and q T i denote the j th column and the i th row of Q , respectively;whereas Q ij (or sometimes for clarity [ Q ] ij ) denotes the ij th entry of Q . Vector e K denotes the K th canonical column vector;and (cid:107)·(cid:107) denotes the Euclidean norm, unless stated otherwise. A preliminary version of the work has appeared in [8]. a r X i v : . [ s t a t . M L ] J a n II. P
ROBLEM S TATEMENT AND M ODELING
Consider a graph G := {V , E} , where V is the set of N nodes, and E the set of edges. Connectivity is captured bythe weight matrix W having entries W ij > if ( i, j ) ∈ E .Associated with each node i ∈ V there is a discrete label y i ∈ Y . In SSL classification over graphs, a subset L ⊂ V of nodes has available labels y L , and the goal is to infer thelabels of the unlabeled set U := V \ L . Given a measureof influence, a node most influenced by labeled nodes of acertain class is deemed to also belong to the same class. Thus,label-propagation on graphs boils down to quantifying theinfluence of L on U , see, e.g. [11], [25], [41]. An intuitive yetsimple measure of node-to-node influence relies on the notionof random walks on graphs.A simple random walk on a graph is a discrete-time Markovchain defined over the nodes, meaning with state space V . Thetransition probabilities are Pr { X k = i | X k − = j } = W ij /d j = [ WD − ] ij := [ H ] ij where X k ∈ V denotes the position of the random walker (state)at the k th step; d j := (cid:80) k ∈N j W kj is the degree of node j ;and, N j its neighborhood. Since we consider undirected graphsthe limiting distribution of { X k } always exists and it is uniqueif it is connected and non-bipartite. It is given by the dominantright eigenvector of the column-stochastic transition probabilitymatrix H := WD − , where D := diag ( d , d , . . . , d N ) [27].The steady-state distribution π can be shown to have entries π i := lim k →∞ (cid:88) j ∈V Pr { X k = i | X = j } Pr { X = j } = d i |E| that are clearly not dependent on the initial “seeding” distribu-tion Pr { X } ; and π is thus unsuitable for measuring influenceamong nodes. Instead, for graph-based SSL, we will utilizethe k − step landing probability per node i given by p ( k ) i := (cid:88) j ∈V Pr { X k = i | X = j } Pr { X = j } (1)that in vector form p ( k ) := [ p ( k )1 . . . p ( k ) N ] T satisfies p ( k ) = H k p (0) , where p (0) i := Pr { X = i } . In words, p ( k ) i is theprobability that a random walker with initial distribution p (0) is located at node i after k steps. Therefore, p ( k ) i is a validmeasure of the influence that p (0) has on any node in V .The landing probabilities per class c ∈ Y are (cf. (1)) p ( k ) c = H k v c (2)where for L c := { i ∈ L : y i = c } , we select as v c thenormalized class-indicator vector with i − th entry [ v c ] i = (cid:26) / |L c | , i ∈ L c , else (3)acts as initial distribution. Using (2), we model diffusions perclass c over the graph driven by { p ( k ) c } Kk =0 as f c ( θ ) = K (cid:88) k =0 θ k p ( k ) c (4)where θ k denotes the importance assigned to the k th hopneighborhood. By setting θ = 0 (since it is not useful for classification purposes) and constraining θ ∈ S K , where S K := { x ∈ R K : x ≥ , T x = 1 } is the K − dimensionalprobability simplex, f c ( θ ) can be compactly expressed as f c ( θ ) = K (cid:88) k =1 θ k p ( k ) c = P ( K ) c θ (5)where P ( K ) c := (cid:104) p (1) c · · · p ( K ) c (cid:105) . Note that f c ( θ ) denotes avalid nodal probability mass function (pmf) for class c .Given θ and upon obtaining { f c ( θ ) } c ∈Y , our diffusion-basedclassifiers will predict labels over U as ˆ y i ( θ ) := arg max c ∈Y [ f c ( θ )] i (6)where [ f c ( θ )] i is the i th entry of f c ( θ ) .The upshot of (4) is a unifying form of superimposeddiffusions allowing tunable simplex weights, taking up to K steps per class to come up with an influence metric for thesemi-supervised classifier (6) over graphs. Next, we outline twonotable members of the family of diffusion-based classifiersthat can be viewed as special cases of (4). A. Personalized PageRank Classifier
Inspired by its celebrated network centrality metric [9], thePersonalized PageRank (PPR) algorithm has well-documentedmerits for label propagation; see, e.g. [28]. PPR is a special caseof (4) corresponding to θ PPR = (1 − α ) (cid:2) α α · · · α K (cid:3) T ,where < α < , and − α can be interpreted as the “restart”probability of random walks with restarts.The PPR-based classifier relies on (cf. (5)) f c ( θ PPR ) = (1 − α ) K (cid:88) k =0 α k p ( k ) c (7)satisfying asymptotically in the number of random walk steps lim K →∞ f c ( θ PPR ) = (1 − α )( I − α H ) − v c which implies that f c ( θ PPR ) approximates the solution ofa linear system. Indeed, as shown in [3], PPR amounts tosolving a weighted regularized least-squares problem over V ; see also [23] for a PPR interpretation as an approximategeometric discriminant function defined in the space of landingprobabilities. B. Heat Kernel Classifier
The heat kernel (HK) is another popular diffusion thathas recently been employed for SSL [31] and communitydetection on graphs [22]. HK is also a special case of(4) with θ HK = e − t (cid:104) t t · · · t K K ! (cid:105) T , yielding classdistributions (cf. (4)) f c ( θ HK ) = e − t K (cid:88) k =0 t k k ! p ( k ) c . (8)Furthermore, it can be readily shown that lim K →∞ f c ( θ HK ) = e − t ( I − H ) v c allowing HK to be interpreted as an approximation of a heatdiffusion process, where heat is flowing from L c to the restof the graph; and f c ( θ HK ) is a snapshot of the temperatureafter time t has elapsed. HK provably yields low conductancecommunities, while also converging faster to its asymptoticclosed-form expression than PPR (depending on the value of t ) [15]. III. A DAPTIVE D IFFUSIONS
Besides the unifying view of (4), the main contributionhere is on efficiently designing f c ( θ c ) in (5), by learningthe corresponding θ c per class. Thus, unlike PPR and HK,the methods introduced here can afford class-specific labelpropagation that is adaptive to the graph structure, and also adaptive to the labeled nodes. Figure 1 gives a high-levelillustration of the proposed adaptive diffusion framework, wheretwo classes (red and green) are to be diffused over the graph(cf. (2)), with class-specific diffusion coefficients adapted aswill be described next. Diffusions are then built (cf. (5)), andemployed for class prediction (cf. (6)).Consider for generality a goodness-of-fit loss (cid:96) ( · ) , and aregularizer R ( · ) promoting e.g., smoothness over the graph.Using these, the sought class distribution will be ˆ f c = arg min f ∈ R N (cid:96) ( y L c , f ) + λR ( f ) (9)where λ tunes the degree of regularization, and [ y L c ] i = (cid:26) , i ∈ L c , else is the indicator vector of the nodes belonging to class c . Usingour diffusion model in (5), the N − dimensional optimizationproblem (9) reduces to solving for the K − dimensional vector( K (cid:28) N ) ˆ θ c = arg min θ ∈S K (cid:96) ( y L c , f c ( θ )) + λR ( f c ( θ )) . (10)Although many choices of (cid:96) ( · ) may be of interest, we willfocus for simplicity on the quadratic loss, namely (cid:96) ( y L c , f ) := (cid:88) i ∈L d i ([¯ y L c ] i − f i ) = (¯ y L c − f ) T D †L (¯ y L c − f ) (11)where ¯ y L c := (1 / |L| ) y L c is the class indicator vector afternormalization to bring target variables (entries of ¯ y L c ) andentries of f to the same scale, and D †L = diag( d ( − L ) withentries [ d ( − L ] i = (cid:26) /d i , i ∈ L , else . For a smoothness-promoting regularization, we will employthe following (normalized) Laplacian-based metric R ( f ) = 12 (cid:88) i ∈V (cid:88) j ∈N i (cid:18) f i d i − f j d j (cid:19) = f T D − LD − f . (12)where L := D − W is the Laplacian matrix of the graph.Intuitively speaking, (11) favors vectors f having non-zero AdaptingDiffusions LabelPrediction P ( K ) r P ( K ) g θ r θ g Fig. 1. High-level illustration of adaptive diffusions. The nodes belong to twoclasses (red and green). The per-class diffusions are learned by exploiting thelanding probability spaces produced by random walks rooted at the samplenodes (second layer: up for red; down for green). ( | / |L| ) values on nodes that are known to belong to class c ,and zero values on nodes that are known to belong to otherclasses ( L \ L c ), while (12) promotes similarity of the entriesof f that correspond to neighboring nodes. In (11) and (12),each entry f i is normalized by d − i and d − i respectively.This normalization counterbalances the tendency of randomwalks to concentrate on high-degree nodes, thus placing equalimportance to all nodes.Substituting (11) and (12) into (10), and recalling from (5)that f c ( θ ) = P ( K ) c θ , yields the convex quadratic program ˆ θ c = arg min θ ∈S K θ T A c θ + θ T b c (13)with b c and A c given by b c = − |L| ( P ( K ) c ) T D †L y L c (14) A c = ( P ( K ) c ) T D †L P ( K ) c + λ ( P ( K ) c ) T D − LD − P ( K ) c (15) = ( P ( K ) c ) T (cid:104)(cid:16) D †L + λ D − (cid:17) P ( K ) c − λ D − HP ( K ) c (cid:105) = ( P ( K ) c ) T (cid:16) D †L P ( K ) c + λ D − ˜ P ( K ) c (cid:17) (16)where HP ( K ) c = (cid:104) Hp (1) c Hp (2) c · · · Hp ( K ) c (cid:105) = (cid:104) p (2) c p (3) c · · · p ( K +1) c (cid:105) is a “shifted” version of P ( K ) c , where each p ( k ) c is advancedby one step, and ˜ P ( K ) c := (cid:104) ˜ p (1) c ˜ p (2) c · · · ˜ p ( K ) c (cid:105) with ˜ p ( i ) c := p ( i ) c − p ( i +1) c containing the “differential” landingprobabilities. The complexity of ‘naively’ finding the K × K matrix A c (and thus also b c ) is O ( K N ) for computing thefirst summand, and O ( |E| K ) for the second summand in (15),after leveraging the sparsity of L , which means |E| (cid:28) N .But since columns of ˜ P ( K ) c are obtained as differences ofconsecutive columns of P ( K ) c , this load of O ( |E| K ) is saved. In a nutshell, the solver in (13)-(16) that we term adaptive-diffusion (AdaDIF), incurs complexity of order O ( K N ) . Remark 1 . The problem in (13) is a quadratic program (QP) of dimension K (or the dictionary size D to be discussed inSection III-C when in dictionary mode). In general, solving aQP with K variables to a given precision requires a O ( K ) worst-case complexity. Although this may appear heavy, K inour setting is – and thus negligibly small compared to thequantities that depend on the graph dimensions. For instance,the graphs that we tested have O (10 ) nodes ( N ) and O (10 ) edges ( |E| ). Therefore, since K (cid:28) N and K (cid:28) |E| by manyorders of magnitude, the complexity for QP is dominated bythe O ( |E| K ) (same as PPR and HK) performing the randomwalks and O ( N K ) for computing A c . A. Limiting behavior and computational complexity
In this section, we offer further insights on the model (5),along with complexity analysis of the parametric solution in(13). To start, the next proposition establishes the limitingbehavior of AdaDIF as the regularization parameter grows.
Proposition 1.
If the second largest eigenvalue of H hasmultiplicity 1, then for K sufficiently large but finite, thesolution to (13) as λ → ∞ satisfies ˆ θ c = e K , ∀ L c ⊆ V . (17)Our experience with solving (13) reveal that the sufficientlylarge K required for (17) to hold, can be as small as .As λ → ∞ , the effect of the loss in (10) vanishes. Accordingto Proposition 1, this causes AdaDIF to boost smoothness byconcentrating the simplex weights (entries of ˆ θ c ) on landingprobabilities of the late steps ( k close to K ). If on the otherextreme, smoothness-over-the-graph is not promoted (cf. λ =0 ), the sole objective of AdaDIF is to construct diffusions thatbest fit the available labeled data. Since short-length randomwalks from a given node typically lead to nodes of the sameclass, while longer walks to other classes, AdaDIF with λ = 0 tends to leverage only a few landing probabilities of early steps( k close to ). For moderate values of λ , AdaDIF effectivelyadapts per-class diffusions by balancing the emphasis on initialversus final landing probabilities.Fig. 2 depicts an example of how AdaDIF places weights { θ k } Kk =1 on landing probabilities after a maximum of K = 20 steps, generated from few samples belonging to one of classesof the Cora citation network. Note that the learnt coefficientsmay follow radically different patterns than those dictated bystandard non-adaptive diffusions such as PPR or HK. It isworth noting that the simplex constraint induces sparsity ofthe solution in (13), thus ‘pushing’ { θ k } entries to zero.The computational core of the proposed method is to buildthe landing probability matrix P ( K ) c , whose columns arecomputed fast using power iterations leveraging the sparsity of H (cf. (2)). This endows AdaDIF with high computationalefficiency, especially for small K . Specifically, since forsolving (13) adaDIF incurs complexity O ( K N ) per class,if K < |E| /N , this becomes O ( |E| K ) ; and for |Y| classes, the . . k θ k PPRHKAdaDIF
Fig. 2. Illustration of K = 20 landing probability coefficients for PPR with α = 0 . , HK with t = 10 , and AdaDIF ( λ = 15 ). overall complexity of AdaDIF is O ( |Y||E| K ) , which is in thesame order as that of non-adaptive diffusions such as PPR andHK. For larger K however, an additional O ( K N ) is requiredper class, mainly to obtain A c in (16).Overall, if O ( KN ) memory requirements are met, theruntime of AdaDIF scales linearly with |E| , provided that K remains small. Thankfully, small values of K are usuallysufficient to achieve high learning performance. As will beshown in the next section, this observation is in par withthe analytical properties of diffusion based classifiers, where itturns out that K large does not improve classification accuracy. B. On the choice of K
Here we elaborate on how the selection of K influencesthe classification task at hand. As expected, the effect of K is intimately linked to the topology of the underlying graph,the labeled nodes, and their properties. For simplicity, we willfocus on binary classification with the two classes denotedby “ + ” and “ − . ” Central to our subsequent analysis is aconcrete measure of the effect an extra landing probabilityvector p ( k ) c can have on the outcome of a diffusion-basedclassifier. Intuitively, this effect is diminishing as the numberof steps K grows, as both random walks eventually converge tothe same stationary distribution. Motivated by this, we introducenext the γ -distinguishability threshold. Definition 1 ( γ -distinguishability threshold) . Let p + and p − denote respectively the seed vectors for nodes of class “+” and “ − , ” initializing the landing probability vectors in matrices X c := P ( K ) c , and ˇ X c := (cid:104) p (1) c · · · p ( K − c p ( K +1) c (cid:105) , where c ∈{ + , −} . With y := X + θ − X − θ and ˇ y := ˇ X + θ − ˇ X − θ , the γ -distinguishability threshold of the diffusion-based classifieris the smallest integer K γ satisfying (cid:107) y − ˇ y (cid:107) ≤ γ . The following theorem establishes an upper bound on K γ expressed in terms of fundamental quantities of the graph, aswell as basic properties of the labeled nodes per class; see theAppendix B for a proof. Theorem 1.
For any diffusion-based classifier with coefficients θ constrained to a probability simplex of appropriate dimen-sions, the γ -distinguishability threshold is upper-bounded as K γ ≤ µ (cid:48) log (cid:104) √ d max γ (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) where d min + := min i ∈L + d i , d min − := min j ∈L − d j , d max := max i ∈V d i and µ (cid:48) := min { µ , − µ N } where { µ n } Nn =1 denote the eigenvalues of the normalized graphLaplacian in ascending order. The γ -distinguishability threshold can guide the choice ofthe dimension K of the landing probability space. Indeed,using class-specific landing probability steps K ≥ K γ , doesnot help distinguishing between the corresponding classes, inthe sense that the classifier output is not perturbed by morethan γ . Intuitively, the information contained in the landingprobabilities K γ + 1 , K γ + 2 , . . . is essentially the same forboth classes and thus, using them as features unnecessarilyincreases the overall complexity of the classifier, and also“opens the door” to curse of dimensionality related concerns.Note also that in settings where one can freely choose thenodes to sample, this result could be used to guide such choicein a disciplined way.Theorem 1 makes no assumptions on the diffusion coeffi-cients, so long they belong to a probability simplex. Of course,specifying the diffusion function can specialize and furthertighten the corresponding γ -distinguishability threshold. InAppendix C we give a tighter threshold for PPR.Conveniently, our experiments suggest that K ∈ [10 , is usually sufficient to achieve high performance for mostreal graphs ; see also Fig. 3 where K γ is found numericallyfor different values of γ -distinguishability threshold, andproportions of sampled nodes on the BlogCatalog graph.Nevertheless, longer random walks may be necessary in e.g.,graphs with small µ (cid:48) , especially when the number of labelednodes is scarce. To deal with such challenges, the ensuingmodification of AdaDIF that scales linearly with K is nicelymotivated. Remark 2 . While PPR and HK in theory rely on infinitely longrandom walks, the coefficients decay rapidly ( θ k = α k for PPRand θ k = t k /k ! for HK). This means that not only θ k → as k → ∞ in both cases, but the convergence rate is also very fast(especially for HK). This agrees with our intuition on randomwalks, as well as our result in Theorem 1 suggesting that, toguarantee a level of distinguishability (which is necessary foraccuracy) between classes, classifiers should rely on relativelyshort-length random walks. Moreover, when operating in anadaptive framework such as the one proposed here, using finite-step (preferably short-length) landing probabilities is muchmore practical, since it restricts the number of free variables( θ k ’s) which mitigates overfitting and results in optimizationproblems that scale well with the network size.
10% 20% 30% 40% 50% 10 − − − − γ K BlogCatalog
Fig. 3. Experimental evaluation K γ for different values of γ -distinguishabilitythreshold, and proportions of sampled nodes on BlogCatalog graph.
C. Dictionary of diffusions
The present section deals with a modified version of AdaDIF,where the number of parameters (dimension of θ ) is restrictedto D < K , meaning the “degrees of freedom” of each class-specific distribution are fewer than the number of landingprobabilities. Specifically, consider (cf. (5)) f c ( θ ) = K (cid:88) k =1 a k ( θ ) p ( k ) c = P ( K ) c a ( θ ) where a k ( θ ) := (cid:80) Dd =1 θ d C kd , and C := (cid:2) c · · · c D (cid:3) ∈ R K × D is a dictionary of D coefficient vectors, the i th formingthe column c i ∈ S K . Since a ( θ ) = C θ , it follows that f c ( θ ) = P ( K ) c C θ = D (cid:88) d =1 θ d f ( d ) c where f ( d ) c := (cid:80) Kk =1 C kd p ( k ) c is the d th diffusion.To find the optimal θ , the optimization problem in (13) issolved with b c = − |L| ( F ∆ c ) T D †L y L c (18) A c = ( F ∆ c ) T D †L F ∆ c + λ ( F ∆ c ) T D − LD − F ∆ c (19)where F ∆ c := [ f (1) c · · · f ( D ) c ] effectively replaces P ( K ) c asthe basis of the space on which each f c is constructed. Thedescription of AdaDIF in dictionary mode is given as a specialcase of Algorithm 1, together with the subroutine in Algorithm2 for memory-efficient generation of F ∆ c .The motivation behind this dictionary-based variant ofAdaDIF is two-fold. First, it leverages the properties of a judi-ciously selected basis of known diffusions, e.g. by constructing C = (cid:2) θ PPR θ HK · · · (cid:3) . In that sense, our approach is relatedto multi-kernel methods, e.g. [1], although significantly morescalable than the latter. Second, the complexity of AdaDIF indictionary mode is O ( |E| ( K + D )) , where D can be arbitrarilysmaller than K , leading to complexity that is linear with respectto both K and |E| . Algorithm 1 A DAPTIVE D IFFUSIONS
Input:
Adjacency matrix: W , Labeled nodes: { y i } i ∈L parameters : Regularization parameter: λ , of landingprobabilities: K , Dictionary mode ∈ { True , False } , Un-constrained ∈ { True , False } Output:
Predictions: { ˆ y i } i ∈U Extract Y = { Set of unique labels in: { y i } i ∈L } for c ∈ Y do L c = { i ∈ L : y i = c } if Dictionary mode then F ∆ c = D ICTIONARY ( W , L c , K, C ) Obtain b c and A c as in (18) and (19) F c = F ∆ c else { P ( K ) c , ˜ P ( K ) c } = L AND P ROB ( W , L c , K ) Obtain b c and A c as in (14) and (16) F c = P ( K ) c end ifif Unconstrained then
Obtain ˆ θ c as in (20) and (21) else Obtain ˆ θ c by solving (13) end if f c (ˆ θ c ) = F c ˆ θ c end for Obtain ˆ y i = arg max c ∈Y (cid:104) f c (ˆ θ c ) (cid:105) i , ∀ i ∈ U Algorithm 2 L AND P ROB
Input: W , L c , K Output: P ( K ) c , ˜ P ( K ) c H = WD − p (0) c = v c for k = 1 : K + 1 do p ( k ) c = Hp ( k − c ˜ p ( k ) c = p ( k − c − p ( k ) c end forAlgorithm 3 D ICTIONARY
Input: W , L c , K, C Output: F ∆ c H = WD − p (0) c = v c { f ( d ) c } Dd =1 = for k = 1 : K do p ( k ) c = Hp ( k − c for d = 1 : D do f ( d ) c = f ( d ) c + C kd p ( k ) c end forend for D. Unconstrained diffusions
Thus far, the diffusion coefficients θ have been constrainedon the K − dimensional probability simplex S K , resultingin sparse solutions ˆ θ c , as well as f c (ˆ θ c ) ∈ S N . The latteralso allows each f c ( θ ) to be interpreted as a pmf over V . Nevertheless, the simplex constraint imposes a limitation tothe model: landing probabilities may only have non-negative contribution on the resulting class distribution. Upon relaxingthis non-negativity constraint, (13) can afford a closed-formsolution as ˆ θ c = A − c ( b c − λ ∗ ) (20) λ ∗ = T A − c b c − b T A − c b c . (21)Retaining the hyperplane constraint T θ = 1 forces at leastone entry of θ to be positive. Note that for K > |L| , (20) maybecome ill-conditioned, and yield inaccurate solutions. Thiscan be mitigated by imposing (cid:96) − norm regularization on θ ,which is equivalent to adding (cid:15) I to A c , with (cid:15) > sufficientlylarge to stabilize the linear system.A step-by-step description of the proposed AdaDIF approachis given by Algorithm 1, along with the subroutine in Algorithm2. Determining the limiting behavior of unconstrained AdaDIF,as well as exploring the effectiveness of different regularizers(e.g., sparsity inducing (cid:96) − norm) is part of our ongoingresearch. Towards the goal of developing more robust methodsto design diffusions, the ensuing section presents our proposedapproach that relies on minimizing the leave-one-out loss ofthe resulting classifier.IV. A DAPTIVE D IFFUSIONS R OBUST TO A NOMALIES
Although the loss function in (11) is simple and easyto implement, it may lack robustness against nodes withlabels that do not comply with a diffusion-based informationpropagation model. In real-world graphs, such ‘difficult’ nodesmay arise due to model limitations, observation noise, or evendeliberate mislabeling by adversaries. For such setups, thissection introduces a novel adaptive diffusion classifier with:i) robustness in finding θ by ignoring errors that arise due tooutlying/anomalous nodes; as well as, ii) capability to identifyand remove such ‘difficult’ nodes.Let us begin by defining the following per-class c ∈ Y loss (cid:96) c rob ( y L c , θ ) := (cid:88) i ∈L d i ([¯ y L c ] i − [ f c ( θ ; L \ i )] i ) (22)where f c ( θ ; L \ i ) is the class- c diffusion after removing the i th node from the set of all labels. Intuitively, (22) evaluatesthe ability of a propagation mechanism effected by θ to predictthe presence of class c label on each node i ∈ L , usingthe remaining labeled nodes L \ i . Since each class-specificdistribution f c ( θ ) is constructed by random walks that arerooted in L c ⊆ L , it follows that f c ( θ ; L \ i ) = (cid:26) f c ( θ ) , i / ∈ L c f c ( θ ; L c \ i ) , i ∈ L c (23)since f c ( θ ) is not directly affected by the removal of a labelthat belongs to other classes, and it is not used as a class- c seed. The class- c diffusion upon removing the i th node fromthe seeds L c is given as (cf. (5)) f c ( θ ; L c \ i ) = K (cid:88) k =1 θ k p ( k ) L c \ i where p ( k ) L c \ i := H k v L c \ i , and [ v L c \ i ] j = (cid:26) / |L c \ i | , j ∈ L c \ i , else . (24)The robust loss in (22) can be expressed more compactly as (cid:96) c rob ( y L c , θ ) := (cid:107) D − L (cid:16) ¯ y L c − R ( K ) c θ (cid:17) (cid:107) (25)where D − L := (cid:16) D †L (cid:17) − , and (cid:104) R ( K ) c (cid:105) ik := (cid:104) p ( k ) L c \ i (cid:105) i , i ∈ L c (cid:104) p ( k ) c (cid:105) i , else . (26)Since p ( k ) c = |L c | − (cid:80) i ∈L c p ( k ) L c \ i , evaluating (25) only re-quires the rows of R ( K ) c and entries of y L c that correspondto L , since the rest of the diagonal entries of D †L are .Having defined (cid:96) c rob ( · ) , per-class diffusion coefficients ˆ θ c canbe obtained by solving ˆ θ c = arg min θ ∈S K (cid:96) c rob ( y L c , θ ) + λ θ (cid:107) θ (cid:107) (27)where (cid:96) regularization with parameter λ θ is introduced inorder to prevent overfitting and numerical instabilities. Notethat smoothness regularization in (12) is less appropriate inthe context of robustness, since it promotes “spreading” ofthe random walks (cf. Prop. 1), thus making class-diffusionsmore similar and increasing the difficulty of detecting outliers.Similar to (13), quadratic programming can be adopted to solve(27).Towards mitigating the effects of outliers, and inspired bythe robust estimators introduced in [20], we further enhance (cid:96) c rob ( · ) by explicitly modeling the effect of outliers with asparse vector o ∈ R N , leading to the modified cost (cid:96) c rob ( y L c , o , θ ) := (cid:107) D − L (cid:16) o + ¯ y L c − R ( K ) c θ (cid:17) (cid:107) . (28)The non-zero entries of o can capture large residuals (predictionerrors | [¯ y L c ] i − [ f c ( θ ; L \ i )] i | ) which may be the result ofoutlying, anomalous or mislabeled nodes. Thus, when operatingin the presence of anomalies, the robust classifier aims atidentifying both diffusion parameters { ˆ θ c } c ∈Y as well as perclass outlier vectors { ˆ o c } c ∈Y . The two tasks can be performedjointly by solving the following optimization problem { ˆ θ c , ˆ o c } c ∈Y = arg min θ c ∈S K o c ∈ R N (cid:88) c ∈Y (cid:2) (cid:96) c rob ( y L c , o c , θ c ) + λ θ (cid:107) θ c (cid:107) (cid:3) + λ o (cid:107) D − L O (cid:107) , (29)where O := (cid:2) o · · · o |Y| (cid:3) concatenates the outlier vectors o c , and (cid:107) X (cid:107) , := (cid:80) Ii =1 (cid:113)(cid:80) Jj =1 X i,j for any X ∈ R I × J .The term λ o (cid:107) D − L O (cid:107) , in (29) acts as a regularizer thatpromotes sparsity over the rows of O ; it can also be interpretedas an (cid:96) -norm regularizer over a vector that contains the (cid:96) norms of the rows of O . The reason for using such block-sparse regularization is to force outlier vectors o c of differentclasses to have the same support (pattern of non-zero entries). In other words, the |Y| different diffusion/outlier detectors are forced to consent on which nodes are outliers.Since (29) is non-convex, convergence of gradient-descent-type methods to the global optimum is not guaranteed. Nev-ertheless, since (29) is biconvex (i.e., convex with respect toeach variable) the following alternating minimization scheme ˆ O ( t ) = arg min O (cid:88) c ∈Y (cid:104) (cid:96) c rob ( y L c , o c , ˆ θ ( t − c ) + λ θ (cid:107) ˆ θ ( t − c (cid:107) (cid:105) + λ o (cid:107) D − L O (cid:107) , (30) ˆ θ ( t ) c = arg min θ ∈S K (cid:96) c rob ( y L c , ˆ o ( t ) c , θ )+ λ θ (cid:107) θ (cid:107) + λ o (cid:107) D − L ˆ O ( t ) (cid:107) , (31)with ˆ O (0) := [ . . . ] converges to a partial optimum [17].By further simplifying (31) and solving (30) in closed form,we obtain ˆ θ ( t ) c = arg min θ ∈S K (cid:96) c rob (¯ y L c + ˆ o ( t − c , θ ) + λ θ (cid:107) θ (cid:107) (32) ˆ O ( t ) = SoftThres λ o (cid:16) ˜ Y ( t ) (cid:17) (33)where ˜ Y ( t ) := (cid:104) ˜ y t ) , . . . , y ( t ) |Y| (cid:105) is the matrix that concatenates the per class residual vectors ˜ y ( t ) c := ¯ y L c − R ( K ) c ˆ θ ( t ) c , and Z = SoftThres λ o ( X ) is a row-wise soft-thresholding operator such that z i = (cid:107) x i (cid:107) [1 − λ o / (2 (cid:107) x i (cid:107) )] + where z i and x i are the i th rows of Z and X respectively, seee.g. [35]. Intuitively, the soft-thresholding operation in (33)extracts the outliers by scaling down residuals and “trimming”them wherever their across-classes (cid:96) norm is below a certainthreshold.The alternating minimization between (32) and (33) termi-nates when (cid:107) ˆ θ ( t ) c − ˆ θ ( t − c (cid:107) ∞ ≤ (cid:15), ∀ c ∈ Y where (cid:15) ≥ is a prescribed tolerance. Having obtainedthe tuples { ˆ θ c , ˆ o c } c ∈Y , one may remove the anomaloussamples that correspond to non-zero rows of ˆ O and performthe diffusion with the remaining samples. The robust (r)AdaDIF is summarized as Algorithm 4, and has O ( K |L||E| ) computational complexity.V. C ONTRIBUTIONS IN C ONTEXT OF P RIOR W ORKS
Following the seminal contribution in [9] that introducedPageRank as a network centrality measure, there has been a vastbody of works studying its theoretical properties, computationalaspects, as well as applications beyond Web ranking [26], [16].Most formal approaches to generalize PageRank focus eitheron the teleportation component (see e.g. [32], [33] as wellas [7] for an application to semi-supervised classification), or,on the so-termed damping mechanism [13], [4]. Perhaps themost general setting can be found in [4], where a familyof functional rankings was introduced by the choice of aparametric damping function that assigns weights to successive
Algorithm 4 R OBUST A DAPTIVE D IFFUSIONS
Input:
Adjacency matrix: W , Labeled nodes: { y i } i ∈L parameters : Regularization parameters: λ θ , λ o , of landingprobabilities: K Output:
Predictions: { ˆ y i } i ∈U Outliers: ∪ c ∈Y L oc Extract Y = { Set of unique labels in: { y i } i ∈L } for c ∈ Y do L c = { i ∈ L : y i = c } for i ∈ L c do { p ( k ) L c \ i } Kk =1 = L AND P ROB ( W , L c \ i, K ) end for Obtain R ( K ) c as in (26) end for ˆ O (0) = [ , . . . , ] , t = 0 while (cid:107) ˆ θ ( t ) c − ˆ θ ( t − c (cid:107) ∞ ≤ (cid:15) do t ← t + 1 Obtain { ˆ θ ( t ) c } c ∈Y as in (32)Obtain ˆ O ( t ) as in (33) end while Set of outliers: S := { i ∈ L : (cid:107) [ ˆ O ] i, : (cid:107) > } for c ∈ Y do L oc = L c ∩ SL c ← L c \ L oc end for Obtain ˆ y i = arg max c ∈Y (cid:104) f c (ˆ θ c ) (cid:105) i , ∀ i ∈ U steps of a walk initialized according to the teleportationdistribution. The per class distributions produced by AdaDIFare in fact members of this family of functional rankings.However, instead of choosing a fixed damping function asin the aforementioned approaches, AdaDIF learns a class-specific and graph-aware damping mechanism. In this sense,AdaDIF undertakes statistical learning in the space of functionalrankings, tailored to the underlying semi-supervised classifi-cation task. A related method termed AptRank was recentlyproposed in [46] specifically for protein function prediction.Differently from AdaDIF the method exploits meta-informationregarding the hierarchical organization of functional roles ofproteins and it performs random walks on the heterogeneousprotein-function network. AptRank splits the data into trainingand validation sets of predetermined proportions and adopt ascross-validation approach for obtaining diffusion coefficients.Furthermore a1) AptRank trains a single diffusion for allclasses whereas AdaDIF identifies different diffusions, and a2)the proposed robust leave-one-out method (r-AdaDIF) gathersresiduals from all leave-one-out splits into one cost function(cf. (22)) and then optimizes the (per class) diffusion.Recently, community detection (CD) methods were proposedin [47] and [45], that search the Krylov subspace of landingprobabilities of a given community’s seeds, to identify adiffusion that satisfies locality of non-zero entries over thenodes of the graph. In CD, the problem definition is: “givencertain members of a community, identify the remaining (latent)members.” There is a subtle but important distinction between TABLE IN
ETWORK C HARACTERISTICS
Graph |V| |E| |Y|
Multilabel
Citeseer
Cora
PubMed
PPI (H. Sapiens)
Wikipedia
BlogCatalog
CD and semi-supervised classification (SSC): CD focuses onthe retrieval of communities (that is nodes of a given class),whereas SSC focuses on the predicting the labels/attributesof every node . While CD treats the detection of variousoverlapping communities of the graph as independent tasks,SSC classifies nodes by taking all information from labelednodes into account. More specifically, the proposed AdaDIFtrains the diffusion of each class by actively avoiding theassignment of large diffusion values to nodes that are known(they have been labeled) to belong to a different class. Anotherimportant difference of AdaDIF with [47] and [45]—whichagain arises from the different contexts—is the length of thewalk compared to the size of the graph. Since [47] and [45]aim at identifying small and local communities, they performlocal walks of length smaller than the diameter of the graph. Incontrast, SSC typically demands a certain degree of globalityin information exchange, achieved by longer random walksthat surpass the graph diameter.AdaDIF also shares links with SSL methods based on graphsignal processing proposed in [37], and further pursued in[12] for bridge monitoring; see also [38] and [14] for recentadvances on graph filters. Similar to our approach, thesegraph filter based techniques are parametrized via assigningdifferent weights to a number of consecutive powers of amatrix related to the structure of the graph. Our contributionhowever, introduces different loss and regularization functionsfor adapting the diffusions, including a novel approach fortraining the model in an anomaly/outlier-resilient manner.Furthermore, while [37] focuses on binary classification and[12] identifies a single model for all classes, our approachallows for different classes to have different propagationmechanisms. This feature can accommodate differences inthe label distribution of each class over the nodes, whilealso making AdaDIF readily applicable to multi-label graphs.Moreover, while in [37] the weighting parameters remainunconstrained and in [12] belong to a hyperplane, AdaDIFconstrains the diffusion parameters on the probability simplex,which allows the random-walk-based diffusion vectors to denotevalid probability mass functions over the nodes of the network.This certainly enhances interpretability of the method, improvesthe numerical stability of the involved computations, and alsoreduces the search-space of the model is beneficial underdata scarcity. Finally, imposing the simplex constraint makesthe model amenable to a rigorous analysis that relates thedimensionality of the feature space to basic graph properties,as well as to a disciplined exploration of its limiting behavior. M i c r o - F S c o r e ( % ) PPRHKAdaDIF
Fig. 4. Micro-F1 score for AdaDIF and non-adaptive diffusions on labeled Cora graph as a function of the length of underline random walks.
VI. E
XPERIMENTAL E VALUATION
Our experiments compare the classification accuracy ofthe novel AdaDIF approach with state-of-the-art alternatives.For the comparisons, we use 6 benchmark labeled graphswhose dimensions and basic attributes are summarized inTable I. All 6 graphs have nodes that belong to multipleclasses, while the last 3 are multilabeled (each node has oneor more labels). We evaluate performance of AdaDIF and thefollowing: i) PPR and HK, which are special cases of AdaDIFas discussed in Section II; ii) Label propagation (LP) [43];iii) Node2vec [18]; iv) Deepwalk [34]; v) Planetoid-G [42];and, vi) graph convolutional networks (GCNs) [21]. We notehere that AptRank [46] was not considered in our experimentssince it relies on meta-information that is not available for thebenchmark datasets used here.We performed -fold cross-validation to select parametersneeded by i) - v). For HK, we performed grid search over t ∈ [1 . , . , . , . . For PPR, we fixed α = 0 . since it iswell documented that α close to yields reliable performance;see e.g., [28]. Both HK and PPR were run for steps forconvergence to be in effect; see Fig 4; LP was also runfor 50 steps. For Node2vec, we fixed most parameters tothe values suggested in [18], and performed grid search for p, q ∈ [0 . , . , . , . . Since Deepwalk can be seen asNode2vec with p = q = 1 . , we used the Node2vec Pythonimplementation for both. As in [18], [34], we used the embedednode-features to train a supervised logistic regression classifierwith (cid:96) regularization. For AdaDIF, we fixed λ = 15 . , while K = 15 was sufficient to attain desirable accuracy (cf. Fig.4); only the values of Boolean variables Unconstained and
Dictionary Mode (see Algorithm 1) were tuned by validation.For the multilabel graphs, we found λ = 5 . and even shorterwalks of K = 10 to perform well. For the dictionary mode ofAdaDIF, we preselected D = 10 , with the first five collumns of C being HK coefficients with parameters t ∈ [5 , , , , ,and the other five polynomial coefficients c i = k β with β ∈ [2 , , , , .For multiclass experiments, we evaluated the performance ofall algorithms on the three benchmark citation networks, namely Cora , Citeseer , and
PubMed . We obtained the labels ofan increasing number of nodes via uniform, class-balancedsampling, and predicted the labels of the remaining nodes.Thus, instead of sampling nodes over the graph uniformly atrandom, we randomly sample a given number of nodes perclass. For each graph, we performed 20 experiments, each timesampling , , and nodes per class. For each experiment,classification accuracy was measured on the unlabeled nodesin terms of Micro-F1 and Macro-F1 scores; see e.g., [30]. Theresults were averaged over 20 experiments, with mean andstandard deviation reported in Table II. Evidently, AdaDIFachieves state of the art performance for all graphs. For Cora and
PubMed , AdaDIF was switched to dictionary mode, whilefor
Citeseer , where the gain in accuracy is more significant,unconstrained diffusions were employed. In the multiclasssetting, diffusion-based classifiers (AdaDIF, PPR, and HK)outperformed the embedding-based methods by a small margin,and GCNs by a larger margin. It should be noted however thatGCNs were mainly designed to combine the graph with nodefeatures. In our “featureless” setting, we used the identity matrixcolumns as input features, as suggested in [21, Appendix].The scalabilty of AdaDIF is reflected on the runtimecomparisons listed in Fig. 7. All experiments were run ona machine with i5 @3.50 Mhz CPU, and 16GB of RAM.We used the Python implementations provided by the authorsof the compared algorithms. The Python implementation ofAdaDIF, uses only tools provided by scipy, numpy, and CVX-OPT libraries. We also developped an efficient implementationthat exploits parallelism, which is straightforward since eachclass can be treated separately. While AdaDIF incurs (asexpected) a relatively small computational overhead over fixeddiffusions, it is faster than GCNs that use Tensorflow, andorders of magnitude faster than embedding-based approaches.Finally, Table III presents the results on multilabel graphs,where we compare with Deepwalk and Node2vec, since therest of the methods are designed for multiclass problems. Sincethese graphs entail a large number of classes, we increased thenumber of training samples. Similar to [18] and [34], duringevaluation of accuracy the number of labels per sampled node isknown, and check how many of them are in the top predictions.First, we observe that AdaDIF markedly outperforms PPR andHK across graphs and metrics. Furthermore, for the
PPI and
BlogCatalog graphs the Micro-F1 score of AdaDIF comesclose to that of the much heavier state-of-the-art Node2vec.Finally, AdaDIF outperforms the competing alternatives interms of Macro-F1 score. It is worth noting that for multilabelgraphs with many classes, the performance boost over fixeddiffusions can be largely attributed to AdaDif’s flexibility totreat each class differently. To demonstrate that different classesare indeed diffused in a markedly different manner, Fig. 6plots all diffusion coefficient vectors { θ c } c ∈C yielded byAdaDIF on the PPI graph with of nodes labeled. Eachline in the plot corresponds to the values of θ c for a different c ; evidently, while the overall “form” of the correspondingdiffusion coefficients adheres to the general pattern observedin Fig.2 there is indeed large diversity among classes. TABLE IIM
ICRO F1 AND M ACRO
F1 S
CORES ON M ULTICLASS N ETWORKS ( CLASS - BALANCED SAMPLING ) Graph
Cora Citeseer PubMed |L c | M i c r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . LP . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Planetoid-G . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . GCN . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . M ac r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . LP . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Planetoid-G . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . GCN . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TABLE IIIM
ICRO F1 AND M ACRO
F1 S
CORES OF V ARIOUS A LGORITHMS ON M ULTILABEL N ETWORKS
Graph
PPI BlogCatalog Wikipedia |L| / |V|
10% 20% 30% 10% 20% 30% 10% 20% 30% M i c r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . M ac r o - F AdaDIF . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . PPR . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HK . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Node2vec . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Deepwalk . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . A. Analysis/interpretation of results
Here we will follow an experimental approach that is aimedat understanding and interpreting our results. We will focus ondiffusion-based classifiers, along with a simple benchmark fordiffusion-based classification: the k − step landing probabilities.Specifically, we compare the classification accuracy on thethree multiclass datasets of AdaDIF, PPR, and HK, with theaccuracy of the classifier that uses only the k − th landingprobability vectors { p ( k ) c } c ∈Y ,k ∈ [1 ,K ] . The setting is similarto the one in the previous section, and with class-balancedsampling of nodes per class, while the k − step classifierswere examined for a wide range of steps k ∈ [1 , . The k − step classifier reveals the predictive power of individuallanding probabilities, resulting in curves (see Fig. 5) that appearto be different for each network, characterizing the graph-labeldistribution relationship of the latter. For the Cora graph(left two plots), performance of the k − step classifier improvessharply after the first few steps, peaks for k ≈ , and thenquickly degrades, suggesting that using the landing probabilitiesof k > or would most likely degrade the performanceof a diffusion-based classifier. Interestingly, AdaDIF relyingon combinations of the first 15 steps, and PPR and HK ofthe first 50, all achieve higher accuracy than that of the best single step. On the other hand, the Citeseer graph (middletwo plots) behaves in a significantly different manner, with the k − step classifier requiring longer walks to reach high accuracythat was retained for much longer. Furthermore, accumulatinglanding probabilities the way PPR or HK does yields lowerMicro-F1 accuracy than that of the single best step. On theother hand, by smartly combining the first 15 steps that areof lower quality, AdaDIF surpasses the Micro-F1 scores ofthe longer walks. Interestingly, the Macro-F1 metric for the Citeseer behaves differently than the Micro-F1, and quicklydecreases after ∼ steps. The disagreement between the twometrics can be explained as the diffusions of one or moreof the larger classes gradually “overwhelms” those of one ormore smaller classes, thus lowering the Macro-F1 score, sincethe latter is a metric that averages per-class . In contrast, theMicro-F1 metric averages per-node and takes much less of animpact if a few nodes from the smaller classes are mislabeled.Finally, for the PubMed graph (right two plots), steps in therange [20 , yield consistently high accuracy both in termsof Micro- as well as Macro-averaged F1-score. Since HK andmostly PPR largely accumulate steps in that range, it seemsreasonable that both fixed diffusions are fairly accurate in the PubMed graph. . . . . . M i c r o F - s c o r e . . . .
54 0 20 40 60 80 1000 . . . . .
760 20 40 60 80 1000 . . . . k M ac r o F - s c o r e . . . k . . . . . k k -step landing probabilities AdaDIF PPR HK Fig. 5. Classification accuracy of AdaDIF, PPR, and HK compared to the accuracy of k − step landing probability classifier. Top Left)
Cora
Micro-F1score;
Bottom Left)
Cora
Macro-F1 score;
Top Middle)
Citeseer
Micro-F1 score;
Bottom Middle)
Citeseer
Macro-F1 score;
Top Right)
PubMed
Micro-F1 score;
Bottom Right)
PubMed
Macro-F1 score . . . . k Fig. 6. AdaDIF diffusion coefficients for the different classes of PPIgraph ( sampled). Each line corresponds to a different θ c . Diffusion ischaracterized by high diversity among classes. B. Tests on simulated label-corruption setup
Here we outline experimental results performed to evaluatethe performance of different diffusion-based classifiers in thepresence of anomalous nodes. The main goal is to evaluatewhether r-AdaDIF (Algorithm 4) yields improved performanceover AdaDIF, HK and PPR, as well as the ability of r-AdaDIF
Cora Citeseer PubMed − R e l a ti v e R un ti m e PPR/HK/LP AdaDIF GCNNode2vec/Deepwalk Planetoid-G
Fig. 7. Relative runtime comparisons for multiclass graphs. to detect anomalous nodes. We also tested a different type ofrounding from class-diffusions to class labels that was shownin [44] to be robust in the presence of erroneous labels ona graph constructed by images of handwritten digits. Theidea is to first normalize diffusions with node degrees, sorteach diffusion vector, and assign to each node the class forwhich the corresponding rank is higher. We applied this typeof rounding on PPR diffusions (denoted as PPR w. ranking).Since a ground truth set of anomalous nodes is not availablein real graphs, we chose to infuse the true labels with artificialanomalies generated by the following simulated label corruptionprocess: Go through y L and for each entry [ y L ] i = c draw withprobability p cor a label c (cid:48) ∼ Unif {Y \ c } ; and replace [ y L ] i ← .
05 0 . .
15 0 . .
25 0 . . Label corruption rate p cor M i c r o - F S c o r e ( % ) r-AdaDIFAdaDIFPPRHKPPR w. ranking .
05 0 . .
15 0 . .
25 0 . . Label corruption rate p cor M ac r o - F S c o r e ( % ) r-AdaDIFAdaDIFPPRHKPPR w. ranking Fig. 8. Classification accuracy of various diffusion-based classifiers on Cora, as a function of the probability of label corruption. . . . . . . . . p FA p D p cor = 0 . p cor = 0 . p cor = 0 . Fig. 9. Anomaly detection performance of r-AdaDIF for different labelcorruption probabilities. The horizontal axis corresponds to the frequencywith which r-AdaDIF returns a true positive (probability of detection) and thevertical axis corresponds to the frequency of false positives (probability offalse alarm). c (cid:48) . In other words, anomalies are created by corrupting someof the true labels by randomly and uniformly “flipping” themto a different label. Increasing the corruption probability p cor ofthe training labels y L is expected to have increasingly negativeimpact on classification accuracy over y U . Indeed, as depictedin Fig. 8, the accuracy of all diffusion-based classifiers on Cora graph degrades as p cor increases. All diffusions were run for K = 50 , while for r-AdaDIF we found λ o = 14 . × − and λ θ = 67 . × − to perform well for moderate values of p cor .Results were averaged over Monte Carlo experiments, andfor each experiment of the nodes were sampled uniformlyat random. While tuning λ o for a specific p cor generally yieldsimproved results, we use the same λ o across the range of p cor values, since the true value of the latter is generally notavailable in practice. In this setup, r-AdaDIF demonstrateshigher accuracy compared to non-robust classifiers. Moreover,the performance gap increases as more labels become corrupted,until it reaches a “break point” at p cor ≈ . . Interestingly, r-AdaDIF performs worse in the absence of anomalies ( p cor = 0 )that can be attributed to the fact that it only removes useful samples and thus reduces the training set. Although PPR w.ranking displays relative robustness as p cor increases, overallit performs worse than PPR with value based rounding, at leaston the Cora graph.As mentioned earlier, the performance of r-AdaDIF in termsof outlier detection depends on parameter λ o . Specifically, for λ o → the regularizer in (29) is effectively removed andall samples are characterized as outliers. On the other hand,for λ o (cid:29) (29) yields ˆ O = [ , . . . , ] , meaning that nooutliers are unveiled. For intermediate values of λ o , r-AdaDIFtrades off falsely identifying nominal samples as outliers(false alarm) with correctly identifying anomalies (correctdetection). This tradeoff of r-AdaDIF’s anomaly detectionbehavior was experimentally evaluated over Monte Carloruns by sweeping over a large range of values of λ o , andfor different values of p cor ; see the probability of detection( p D ) versus probability of false alarms ( p FA ) depicted in Fig.9. Evidently, r-AdaDIF performs much better than a randomguess (“coin toss”) detector whose curve is given by the greydotted line, while the detection performance improves as thecorruption rate decreases.VII. C ONCLUSIONS
The present work, introduces a principled, data-efficientapproach to learning class-specific diffusion functions tailoredfor the underlying network topology. Experiments on realnetworks confirm that adapting the diffusion function to thegiven graph and observed labels, significantly improves theperformance over fixed diffusions; reaching – and many timessurpassing – the classification accuracy of computationallyheavier state-of-the-art competing methods.Emerging from this work are many exciting directions toexplore. First, one can investigate different cost functions withrespect to which the diffusions are adapted, e.g., by taking intoaccount robustness of the resulting classifier in the presenceof adversarial data. Furthermore, it is worth investigating thespace of nonlinear functions of the landing probabilities todetermine the degree to which accuracy can be boosted further.Last but not least, it will be interesting to develop adaptivediffusion methods, where learning and adaptation are performed on-the-fly , without any memory and computational overhead. A PPENDIX
A. Proof of Proposition 1
For λ → ∞ , the effect of (cid:96) ( · ) in (10) vanishes, and theoptimization problem becomes equivalent to solving min θ ∈S K θ T A θ (34)where A := ( P ( K ) c ) T D − LD − P ( K ) c has ( i, j ) entry givenby A ij = ( p ( i ) c ) T D − LD − p ( j ) c ; and p ( K ) c is the vector of K -step landing probabilities with initial distribution v c andtransition matrix H = (cid:80) Nn =1 λ n u n v T n , where λ > λ > · · · > λ N are its eigenvalues. Since H is a column-stochastictransition probability matrix, it holds that λ = 1 , v = ,and u = π , where π = lim k →∞ p ( k ) c is the steady-statedistribution that can be also expressed as π = d / (2 |E| ) [27].The landing probability vector for class c is thus p ( K ) c = H K v c = (cid:34) |E| d1 T + N (cid:88) n =2 λ Kn u n v T n (cid:35) v c = 12 |E| d + N (cid:88) n =2 λ Kn u n γ n ≈ |E| d + λ K u γ (35)where γ n := v T n v c , and the approximation in (35) holdsbecause λ K (cid:29) λ Kn , for n ∈ [3 , N ] , and K large enoughbut finite. Using (35), A ij can be rewritten as A ij = (cid:20) |E| d T + λ i u T γ (cid:21) D − LD − (cid:20) |E| d + λ j u γ (cid:21) = (cid:20) |E| T + λ i u T D − γ (cid:21) L (cid:20) |E| + λ j D − u γ (cid:21) = 14 |E| T L1 + λ i γ |E| u T D − L1 + λ j γ |E| T LD − u + γ λ i + j u T D − LD − u = Cλ i + j (36)where C := γ u T D − LD − u , the second equality uses D − d = , and the last equality follows because L1 = . Using (36), one obtains A = C λ λ T , where λ := (cid:2) λ λ · · · λ K (cid:3) T , while (34) reduces to min θ ∈S K (cid:16) λ T θ (cid:17) . (37)Since λ T θ > ∀ θ ∈ S K , it can be shown that the KKToptimality conditions for (37) are identical to those of min θ ∈S K λ T θ . (38)Therefore, (37) admits minimizer(s) identical to (38). Fi-nally, we will show that the minimizer of (38) is e K . Since the problem is convex, it suffices to show that ∇ T θ ( λ T θ ) θ = e K ( θ − e K ) ≥ ∀ θ ∈ S K , or, equivalently λ T ( θ − e K ) ≥ ⇔ K (cid:88) k =1 θ k λ k − λ K ≥ ⇔ K (cid:88) k =1 θ k λ k − K ≥ ⇔ K (cid:88) k =1 θ k λ k − K ≥ K (cid:88) k =1 θ k ⇔ K (cid:88) k =1 θ k (cid:0) λ k − K − (cid:1) ≥ which holds since θ ≥ and λ k − K ≥ ∀ k ∈ [1 , K ] , andcompletes the proof of the proposition. B. Proof of Theorem 1
We need to find the smallest integer K such that max θ ∈S K (cid:107) y − ˇ y (cid:107) ≤ γ. We have (cid:107) y − ˇ y (cid:107) = (cid:107) X + θ − X − θ − ˇ X + θ + ˇ X − θ (cid:107) ≤≤ (cid:107) θ K p ( K )+ − θ K p ( K ) − (cid:107) + (cid:107) θ K p ( K +1)+ − θ K p ( K +1) − (cid:107)≤ (cid:107) H K p + − H K p − (cid:107) + (cid:107) H K +1 p + − H K +1 p − (cid:107) (39)since θ ∈ S K . Therefore, to determine an upper bound for the γ -distinguishability threshold it suffices to find the smallestinteger K for which (39) is upper bounded by γ .Let q , . . . , q N be the eigenvectors corresponding to theeigenvalues µ < µ ≤ · · · ≤ µ N < of the normalizedLaplacian ˜ L . The transition probability matrix is then H = D ( I − ˜ L ) D − . (40)For the first term of the RHS of (39), we have (cid:107) H K p + − H K p − (cid:107) ≤ (cid:107) H K p + − π (cid:107) + (cid:107) H K p − − π (cid:107) = (cid:107) D ( I − ˜ L ) K D − p + − D1 |E| (cid:107) + (cid:107) D ( I − ˜ L ) K D − p − − D1 |E| (cid:107) . (41)Since q = D √ |E| [27], we have for c ∈ { + , −} that D q (cid:104) q , D − p c (cid:105) = D D (cid:112) |E| (cid:42) D (cid:112) |E| , D − p c (cid:43) = D1 (cid:112) |E| (cid:104) , p c (cid:105) (cid:112) |E| = D1 |E| . (42)Upon defining M := ( I − ˜ L ) K − q q T , and taking into account(42), inequality (41) can be written as (cid:107) H K p + − H K p − (cid:107)≤ (cid:107) D (cid:107)(cid:107) M (cid:107) (cid:16) (cid:107) D − p + (cid:107) + (cid:107) D − p − (cid:107) (cid:17) . (43) The factors in (43) can be bounded as (cid:107) D − p + (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i ∈L + (cid:18) |L + | d − i (cid:19) = (cid:115) (cid:88) i ∈L + |L + | d − i ≤ (cid:112) d min + |L + | , (44) (cid:107) D − p − (cid:107) = (cid:115) (cid:88) i ∈L − |L − | d − i ≤ (cid:112) d min − |L − | , (45) (cid:107) M (cid:107) = sup v (cid:104) Mv , v (cid:105) Mv = max i (cid:54) =1 | − µ i | K , (46) (cid:107) D (cid:107) = (cid:112) d max (47)where (46) follows from the properties of the normalizedLaplacian. Therefore, (43) becomes (cid:107) H K p + − H K p − (cid:107) ≤ (cid:32) (cid:112) d min − |L − | + 1 (cid:112) d min + |L + | (cid:33) · max i (cid:54) =1 | − µ i | K · (cid:112) d max . (48)Letting µ (cid:48) := min { µ , − µ N } , and using the fact that (1 − µ (cid:48) ) K ≤ e − Kµ (cid:48) (49)we obtain (cid:107) H K p + − H K p − (cid:107)≤ (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) . (50)Likewise, we can bound the second term in (39) as (cid:107) H K +1 p + − H K +1 p − (cid:107)≤ (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − ( K +1) µ (cid:48) . (51)In addition, we note that for all µ (cid:48) > , K ∈ Z it holds that e − Kµ (cid:48) + e − ( K +1) µ (cid:48) < e − Kµ (cid:48) . (52)Upon substituting (50) and (51) into (39), and also using (52),we arrive at (cid:107) y − ˇ y (cid:107) ≤ (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) . (53)To determine an upper bound on the γ -distinguishabilitythreshold, it suffices to find the smallest integer K for which(53) becomes less than γ ; that is, (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) ≤ γ. (54)Multiplying both sides of (54) by the positive number e Kµ (cid:48) /γ ,and taking logarithms yields log (cid:104) √ d max γ (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) ≤ Kµ (cid:48) . Therefore, using as landing probabilities (cid:24) µ (cid:48) log (cid:104) √ d max γ (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105)(cid:25) the (cid:96) distance between any two diffusion-based classifiers willbe at most γ ; and the proof is complete. C. Bound for PageRank
Substituting PageRank’s diffusion coefficients in the proofof Theorem 1, inequality (54) becomes − α ) α K (cid:32)(cid:115) d max d min − |L − | + (cid:115) d max d min + |L + | (cid:33) e − Kµ (cid:48) ≤ γ. Multiplying both sides by the positive number e Kµ (cid:48) α − K /γ and taking logarithms yields log (cid:104) √ d max γ/ (1 − α ) (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) ≤ K ( µ (cid:48) − log α ) which results in the γ -distinguishability threshold bound K PR γ ≤ µ (cid:48) − log α log (cid:104) √ d max γ/ (1 − α ) (cid:16)(cid:113) d min − |L − | + (cid:113) d min+ |L + | (cid:17)(cid:105) . R EFERENCES[1] A. Argyriou, M. Herbster, and M. Pontil, “Combining graph laplaciansfor semi–supervised learning,” in
Proc. Advances in Neural InformationProcessing Systems , Vancouver, Can., 2006, pp. 67–74.[2] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in
Proc. Advances in Neural Information Processing Systems , Barcelona,Spain, 2016, pp. 1993–2001.[3] K. Avrachenkov, A. Mishenin, P. Gonc¸alves, and M. Sokol, “Generalizedoptimization framework for graph-based semi-supervised learning,”
Proc.SIAM Int. Conf. on Data Mining , Anaheim, CA, 2012, pp. 966–974.[4] R. Baeza-Yates, P. Boldi, and C. Castillo, “Generic damping functionsfor propagating importance in link-based ranking,”
Internet Math. , vol. 3,no. 4, pp. 445–478, 2006.[5] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples,”
J. Mach. Learn. Res. , no. 7, Nov, 2006, pp. 2399–2434.[6] Y. Bengio, O. Delalleau, and N. Le Roux, “Label propagation andquadratic criterion,” in
Semi-Supervised Learning . Cambridge, MA, USA:MIT Press, 2006.[7] D. Berberidis, A. N. Nikolakopoulos, and G. B. Giannakis, “Randomwalks with restarts for graph-based classification: Teleportation tuningand sampling design,” in
Proc. of IEEE Int. Conf. on Acoustics, Speechand Signal Processing , Calgary, Can., April 2018.[8] D. Berberidis, A. N. Nikolakopoulos, and G. B. Giannakis, ”AdaDIF:Adaptive Diffusions for Efficient Semi-supervised Learning over Graphs,”Proc. of IEEE Int. Conf. on Big Data, Seattle, Washington, Dec. 10-13,2018. pp. 92–99.[9] S. Brin and L. Page, “Reprint of: The anatomy of a large-scalehypertextual web search engine,”
Comput. Netw. , vol. 56, no. 18, pp.3825–3833, 2012.[10] E. Buchnik and E. Cohen, “Bootstrapped graph diffusions: Exposing thepower of nonlinearity,” arXiv preprint arXiv:1703.02618 , 2017.[11] O. Chapelle, B. Sch¨olkopf, and A. Zien,
Semi-Supervised Learning .Cambridge, MA, USA: MIT Press, 2006.[12] S. Chen, F. Cerda, P. Rizzo, J. Bielak, J. H. Garrett, and J. Kovacevic,“Semi-supervised multiresolution classification using adaptive graphfiltering with application to indirect bridge structural health monitoring,”
IEEE Trans. Signal Process. , vol. 62, no. 11, pp. 2879–2893, June 2014.[13] P. G. Constantine and D. F. Gleich, “Random alpha pagerank,”
InternetMath. , vol. 6, no. 2, pp. 189–236, 2009.[14] M. Contino, E. Isufi and G. Leus,“Distributed edge-variant graph filters,”
Proc. Int. Work. on Computational Advances in Multi-Sensor AdaptiveProcessing , Curacao, Dutch Antilles, Dec. 2017, pp. 1-5.[15] F. Chung, “The heat kernel as the pagerank of a graph,”
Proc. Natl. Acad.Sci. , vol. 104, no. 50, pp. 19 735–19 740, 2007.[16] D. F. Gleich, “Pagerank beyond the web,”
SIAM Rev. , vol. 57, no. 3, pp.321–363, 2015.[17] J. Gorski, F. Pfeuffer, and K. Klamroth, “Biconvex sets and optimizationwith biconvex functions: a survey and extensions,”
Math. Methods ofOper. Res. , vol. 66, no. 3, pp. 373–407, Dec. 2007. [18] A. Grover and J. Leskovec, “node2vec: Scalable feature learning fornetworks,” in Proc. of ACM SIGKDD Int. Conf. on Knowledge Discoveryand Data Mining , San Francisco, CA, 2016, pp. 855–864.[19] T. Joachims, “Transductive learning via spectral graph partitioning,”
Proc.of Int. Conf. on Machine Learn. , Washington DC, 2003, pp. 290–297.[20] V. Kekatos and G. B. Giannakis, “From sparse signals to sparse residualsfor robust sensing,”
IEEE Trans. Signal Process. , vol. 59, no. 7, pp.3355–3368, July 2011.[21] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” arXiv preprint arXiv:1609.02907 , 2016.[22] K. Kloster and D. F. Gleich, “Heat kernel based community detection,”in
Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and DataMining , New York, NY, 2014, pp. 1386–1395.[23] I. M. Kloumann, J. Ugander, and J. Kleinberg, “Block models andpersonalized pagerank,”
Proc. Natl. Acad. Sci. , vol. 114, no. 1, pp. 33–38, 2017.[24] R. I. Kondor and J. Lafferty, “Diffusion kernels on graphs and otherdiscrete input spaces,” in
Proc. of Int. Conf. on Machine Learning ,Syndey, Australia, 2002, pp. 315–322.[25] B. Kveton, M. Valko, A. Rahimi, and L. Huang, “Semi-supervisedlearning with max-margin graph cuts,” in
Proc. of. Int. Conf. on ArtificialIntelligence and Statistics , Sardinia, Italy, 2010, pp. 421–428.[26] A. N. Langville and C. D. Meyer, “Deeper inside pagerank,”
InternetMath. , vol. 1, no. 3, pp. 335–380, 2004.[27] D. A. Levin and Y. Peres,
Markov Chains and Mixing Times . New York,NY, USA: Amer. Math. Soc., 2017.[28] F. Lin and W. W. Cohen, “Semi-supervised classification of networkdata using very few labels,” in
Proc. of Int. Conf. on Advances in SocialNetwork Analysis and Mining , Odense, Denmark, 2010, pp. 192–199.[29] W. Liu, J. Wang, and S.-F. Chang, “Robust and scalable graph-basedsemisupervised learning,”
Proc. of the IEEE , vol. 100, no. 9, pp. 2624–2638, 2012.[30] C. D. Manning, P. Raghavan, and H. Schutze,
Introduction to InformationRetrieval . Cambridge, MA: Cambridge University Press, 2008.[31] E. Merkurjev, A. L. Bertozzi, and F. Chung, “A semi-supervised heatkernel pagerank MBO algorithm for data classification,” Univ. ofCalifornia Los Angeles, Los Angeles, US, Tech. Rep., 2016.[32] A. N. Nikolakopoulos and J. D. Garofalakis, “Ncdawarerank: A novelranking method that exploits the decomposable structure of the web,”
Proc. ACM Int. Conf. on Web Search and Data Mining , Rome, Italy,2013, pp. 143–152.[33] A. N. Nikolakopoulos, A. Korba, and J. D. Garofalakis, “Randomsurfing on multipartite graphs,” in
Proc. of IEEE Int. Conf. on BigData , Washington DC, Dec. 2016, pp. 736–745.[34] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning ofsocial representations,”
Proc. ACM SIGKDD Int. Conf. on Knowl. Disc.and Data Mining , New York, NY, 2014, pp. 701–710.[35] A. T. Puig, A. Wiesel, G. Fleury, and A. O. Hero, “Multidimensionalshrinkage-thresholding operator and group lasso penalties,”
IEEE SignalProcess. Lett. , vol. 18, no. 6, pp. 363–366, 2011.[36] N. Rosenfeld and A. Globerson, “Semi-supervised learning with com-petitive infection models,” arXiv preprint arXiv:1703.06426 , 2017.[37] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs,”
IEEE Trans. Signal Process. , vol. 61, no. 7, pp. 1644–1656,April 2013.[38] S. Segarra, A. Marques, and A. Ribeiro,“Optimal graph-filter design andapplications to distributed linear network operators,”
IEEE Trans. onSignal Process. , vol. 65, no. 15, pp. 4117–4131, August 2017.[39] P. P. Talukdar and K. Crammer, “New regularized algorithms fortransductive learning,” in
Proc. of Joint Eur. Conf. on Machine Learningand Knowledge Discovery in Databases , 2009, pp. 442–457.[40] J. Ugander and L. Backstrom, “Balanced label propagation for partitioningmassive graphs,” in
Proc. of ACM Int. Conf. on Web Search and DataMining , Rome, Italy, 2013, pp. 507–516.[41] X.-M. Wu, Z. Li, A. M. So, J. Wright, and S.-F. Chang, “Learning withpartially absorbing random walks,”
Proc. Adv. in Neural Inform. Proc.Systems , Lake Tahoe, CA, Dec. 2012, pp. 3077–3085.[42] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervisedlearning with graph embeddings,” arXiv preprint arXiv:1603.08861 , 2016.[43] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning usingGaussian fields and harmonic functions,” in
Proc. of Int. Conf. on MachineLearning , Washington DC, Aug. 2003.[44] D. F. Gleich, and M. W. Mahoney, “Using Local Spectral Methods toRobustify Graph-Based Learning Algorithms,” in
Proc. of the Int. Conf.on Knowl. Disc. and Data Mining , Sidney Australia, Aug. 2015. [45] K. He, P. Shi, J. E. Hopcroft, and D. Bindel, “Local Spectral Diffusionfor Robust Community Detection,” in
Proc. of the SIGKDD workshop ,San Francisco CA, Aug. 2016.[46] B. Jiang, K. Kloster, D. F. Gleich, and M. Gribskov, “AptRank: anadaptive PageRank model for protein function prediction on bi-relationalgraphs,” in
Bioinformatics , vol. 33, no. 12, pp. 1829–1836, Aug. 2017.[47] K. He, Y. Sun, D. Bindel, J. E. Hopcroft, and Y. Li, “Detectingoverlapping communities from local spectral subspaces,” in