[PDF] Robust Vertex Classification

Abstract

For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs, adjacency spectral embedding followed by appropriate vertex classification is asymptotically Bayes optimal; but this approach requires knowledge of and critically depends on the model dimension. In this paper, we propose a sparse representation vertex classifier which does not require information about the model dimension. This classifier represents a test vertex as a sparse combination of the vertices in the training set and uses the recovered coefficients to classify the test vertex. We prove consistency of our proposed classifier for stochastic blockmodels, and demonstrate that the sparse representation classifier can predict vertex labels with higher accuracy than adjacency spectral embedding approaches via both simulation studies and real data experiments. Our results demonstrate the robustness and effectiveness of our proposed vertex classifier when the model dimension is unknown.

Full PDF

11 Robust Vertex Classiﬁcation

Li Chen, Cencheng Shen, Joshua Vogelstein, Carey E. Priebe

Abstract —For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs,adjacency spectral embedding followed by appropriate vertex classiﬁcation is asymptotically Bayes optimal; but thisapproach requires knowledge of and critically depends on the model dimension. In this paper, we propose a sparserepresentation vertex classiﬁer which does not require information about the model dimension. This classiﬁer represents atest vertex as a sparse combination of the vertices in the training set and uses the recovered coefﬁcients to classify thetest vertex. We prove consistency of our proposed classiﬁer for stochastic blockmodels, and demonstrate that the sparserepresentation classiﬁer can predict vertex labels with higher accuracy than adjacency spectral embedding approaches viaboth simulation studies and real data experiments. Our results demonstrate the robustness and effectiveness of ourproposed vertex classiﬁer when the model dimension is unknown.

Index Terms —sparse representation, vertex classiﬁcation, robustness, adjacency spectral embedding, stochasticblockmodel, latent position model, model dimension, classiﬁcation consistency. (cid:70)

NTRODUCTION

Modern datasets have been collected with complexstructures which contain interacting objects. Depend-ing on the ﬁeld of interest, such as sociology, bio-chemistry, or neuroscience, the objects can be people,organizations, genes, or neurons, and the interactinglinkages can be communications, organizational posi-tions, protein interactions, or synapses. Many usefulmodels imply that objects sharing a “class” attributehave similar connectivity structures. Graphs are oneuseful and appropriate tool to describe such datasets– the objects are denoted by vertices and the linkagesare denoted by edges. One interesting task on suchdatasets is vertex classiﬁcation: determination of theclass labels of the vertices. For instance, we may wishto classify whether a neuron is a motor neuron ora sensory neuron, or whether a person in a socialnetwork is liberal or conservative.In many applications, measured edge activity canbe inaccurate, either missing or absolutely wrong,which leads to contaminated datasets. When the con-nectivity among a collection of vertices is invisible, oc-clusion contamination occurs. When we wrongly ob-serve the connectivity among a collection of vertices,linkage reversion contamination occurs. The spectralembedding method on the adjacency matrix has beenshown to be a valuable tool for performing inferenceon graphs realized from a stochastic blockmodel ([1],[2], [3], [4]). One major issue is that such a method crit-

L.C., C.S., and C.E.P are with Department of Applied Mathematicsand Statistics, Johns Hopkins University. J.T.V. is with Department ofBiomedical Engineering, Johns Hopkins University. ically depends on a known model dimension, whichis often unknown in practice. Moreover, for highlyoccluded graphs, classiﬁcation composed with thespectral embedding method degrades in performance.This motivates us to propose a vertex classiﬁer thatdoes not require knowledge of the model dimension,yet achieves good performance for highly contam-inated graphs. In this work, we apply the sparserepresentation classiﬁer ([5], [6], [7]) to do vertex clas-siﬁcation on graph data, which performs well in objectrecognition with contamination and does not requiredimension selection. In particular, we provide boththeoretical performance guarantee of this classiﬁer forthe stochastic blockmodel, and its numerical advan-tages via simulations and various real graph datasets.Furthermore, the proposed classiﬁer maintains lowmisclassiﬁcation error under both occlusion and link-age reversion contamination.This paper is organized as follows: in Section 2, weprovide background on the classiﬁcation framework,review the latent position model and the stochas-tic blockmodel, and present the vertex classiﬁcationframework. In Section 3, we describe the motivationfor investigating robust vertex classiﬁcation, and pro-pose two contamination models on stochastic block-models. In Section 4, we propose a sparse representa-tion classiﬁer for vertex classiﬁcation and prove con-sistency of our proposed classiﬁer for the stochasticblockmodel under certain condition on the modelparameters. In Section 5, we demonstrate the effec-tiveness of our proposed classiﬁer via both simulatedand real data experiments. In Section 6, we discuss the a r X i v : . [ s t a t . M L ] A p r ractical advantages of applying sparse representationclassiﬁer to graphs. All theoretical proofs are in thesupplementary material. ACKGROUND AND F RAMEWORK

Let [ K ] = { , ..., K } for any positive integer K . Let ( X, Y ) ∼ F XY , where the feature vector X is an R d -valued random vector, Y is a [ K ] -valued classlabel, and F XY is the joint distribution of X and Y . Let π k = P ( Y = k ) be the class priors andlet g : R d → [ K ] provide one’s guess of Y given X , for which g is a classiﬁer. We intend to classifya test observation X – that is, estimate its true butunknown label Y via g ( X ) . An error occurs when g ( X ) (cid:54) = Y , and the probability of error is denotedby L ( g ) = P ( g ( X ) (cid:54) = Y ) . The optimal classiﬁer isdeﬁned by g ∗ = arg min g : R d → [ K ] P ( g ( X ) (cid:54) = Y ) , whichis the Bayes classiﬁer achieving the minimum possibleerror. In the classical setting of supervised learning, weobserve training data T n = { ( X , Y ) , ..., ( X n , Y n ) } iid ∼ F XY . The performance of g n is measured by theconditional probability of error deﬁned by L n = L ( g n ) = P ( g n ( X ; T n ) (cid:54) = Y |T n ) , for a sequence of classiﬁers { g n , n ≥ } . The sequenceof classiﬁers is consistent if lim n →∞ L n → L ∗ as n →∞ ; and it is universally consistent if lim n →∞ L n = L ∗ with probability for any distribution F XY [8]. This supervised learning framework is adapted forthe setting of random graphs. A graph is a pair G = ( V, E ) consisting of a set of vertices or nodes V = [ n ] = { , , ..., n } and a set of edges E ⊂ (cid:0) [ n ]2 (cid:1) . Inthis work, we assume that all graphs are simple; thatis, the graphs are undirected, unweighted, and non-loopy. The adjacency matrix of G , denoted by A , is n -by- n symmetric, binary, and hollow, i.e., the diagonalsof A are all zeros. Each entry A uv = A vu = 1 , ifthere is an edge between vertices u and v ; A uv = 0 otherwise [9]. A random graph is a graph-valuedrandom variable G : Ω → G n , where Ω denotes theprobability space, and G n the collection of all possible n ) graphs on V = [ n ] . For instance, one frequentlyoccurring random graph model is the so-called Erdos-Renyi graph, ER ( n, p ) , in which each pair of verticeshas an edge independently with probability p [10].Our exploitation task is vertex classiﬁcation. Weobserve the adjacency matrix A ∈ { , } ( n +1) × ( n +1) on n + 1 vertices { v , . . . , v n , v } and the class labels Y i ∈ [ K ] associated with the ﬁrst n vertices. Our goalis to estimate the class label Y of the test vertex v viaa classiﬁer g : { , } ( n +1) × ( n +1) → [ K ] such that theprobability of error P ( g ( A ) (cid:54) = Y ) is small. In our setting, we describe the stochastic blockmodeland vertex classiﬁcation from the perspective of alatent position graph framework. Hoff et al. [11] pro-posed a latent position graph model. In this model,each vertex v is associated with an unobserved latentrandom vector X v drawn independently from a speci-ﬁed distribution F on R d . The adjacency matrix entries A uv | ( X u , X v ) ∼ Bernoulli ( l ( X u , X v )) are condition-ally independent, where l : R d × R d → [0 , is thelink function. The random dot product graph modelproposed in [12] is a special case of the latent positionmodel, where the link function l ( X u , X v ) is the innerproduct of latent positions, l ( X u , X v ) = (cid:104) X u , X v (cid:105) . Forthe purpose of theoretical analysis and simulation inthis paper, we mainly consider the stochastic block-model introduced in [13], which is a random graphmodel with a set of n vertices randomly drawn from K block memberships. Conditioned on the K -partition,edges between all the pairs of vertices are independentBernoulli trials with parameters determined by theblock memberships.Below we formally present the deﬁnitions of thelatent position model and the stochastic blockmodel,which provide the framework for our exploitation taskof vertex classiﬁcation. Deﬁnition 1.

Latent Position Model (LPM)

Let F be a distribution on [0 , , X , ..., X n iid ∼ F , and deﬁne Z := [ X , ..., X n ] T ∈ R n × d . Suppose rank ( Z ) = d ,and denote P ∈ [0 , n × n as the communication prob-ability matrix, where each entry P ij is the probabilitythat there is an edge between vertices i , j conditionedon X i and X j . Let A ∈ { , } n × n be the randomadjacency matrix. Then ( Z, A ) ∼ LP M ( F ) if and onlyif the following conditional independence relationshipholds: P ( A | X , ..., X n ) = Π i

Stochastic Blockmodel (SBM)

Let K bethe number of blocks, and π be a length K vector inthe unit simplex ∆ K − . The block memberships of thevertices are given by Y ( v ) iid ∼ Multinomial ([ K ] , π ) . Let2 be a K × K symmetric matrix specifying block com-munication probabilities. Then A ∼ SBM ([ n ] , B, π ) if and only if the following conditional independencerelationship holds: P ij = P ( A ij = 1 | X i , X j ) = P ( A ij = 1 | Y i , Y j )= B Y i ,Y j . Note that SBM is a special case of LPM, becausethe latent positions of an SBM are mixtures of thepoint masses, which are the eigenvectors of B . Theunknown latent positions X i and X j of vertices i , j determine their memberships Y i and Y j . And forvertex classiﬁcation on SBM, the Bayes error L ∗ = 0 [1]. Deﬁnition 3.

Model Dimension

For stochastic block-models, the model dimension refers to the rank of thecommunication probability matrix. A d -dimensionalSBM satisﬁes rank ( P ) = rank ( B ) = d , for which d ≤ K ; if B is full rank, then d = K . Deﬁnition 4.

Adjacency Spectral Embedding in Di-mension ˆ d Let A be deﬁned as in Deﬁnition 1. Let A = U A S A U TA be the full spectral decompositionof A , where S A = Diag ( λ , λ , . . . , λ n ) with λ ≥ λ ≥ · · · ≥ λ n . Let S A, ˆ d = Diag ( λ , λ , . . . , λ ˆ d ) ∈ R ˆ d × ˆ d , containing the ˆ d largest eigenvalues of A . Let U A, ˆ d ∈ R n × ˆ d be the matrix containing the correspond-ing eigenvectors as its column vectors. The estimateof latent positions of SBM via adjacency spectral em-bedding in dimension ˆ d is deﬁned as ˆ Z ˆ d = U A, ˆ d S A, ˆ d ,for ≤ ˆ d ≤ n . We denote the method of adjacencyspectral embedding to dimension ˆ d as ASE ˆ d .Many techniques have been developed to inferthe latent positions via the realized adjacency matrix.Bickel et al. [14] used subgraph counts and degreedistributions to consistently estimate stochastic block-models. Sussman et al. [1] proved the consistencyof spectral partitioning on the adjacency matrix ofstochastic blockmodels. Rohe et al. [15] proved a con-sistent spectral partitioning procedure on the Lapla-cian of the stochastic blockmodels. Fishkind et al. [2]showed the consistency of adjacency spectral parti-tioning, when the model parameters are unknown.Athreya et al. [16] proved a central limit theoremfor the adjacency spectral embedding of stochasticblockmodels.In the area of clustering and classiﬁcation, there ex-ists intensive works regarding unsupervised learningfor graph data [17], [18], [19], [20], [21], [22] and [23];as well as supervised learning, such as [3], [24], [25]for vertex classiﬁcation, and [26] and [27] for vertexnomination. Our task in this paper is vertex classiﬁcation.However, we do not and can not observe the latentpositions X , ..., X n , X ; otherwise, we are back in theclassical setting of supervised learning. We assumethat the class-conditional density X i | Y i = k ∼ f k withclass priors π as before, that is, P ( Y i = k | X i = x ) = π k f k ( x ) (cid:80) j ∈ [ K ] π j f j ( x ) . We denote the test vertex as v whoselatent position is X , and we shall assume that we donot observe the label Y . OTIVATION

Our motivation for proposing a robust vertex classiﬁercomes from asking the question: how well can vertexclassiﬁers perform when model assumptions do nothold. If the model dimension d is known or can beestimated correctly, ASE d consistently estimates thelatent positions for SBM [1]. Figure 1 presents anexample of ASE d , where vertices from two classesare well separated in the embedded space. A subse-quent k -nearest neighbor ( k NN) classiﬁer on ASE d is universally consistent for SBM [3]. That means re-gardless of what distribution the latent positions aredrawn from, k NN ◦ ASE d achieves the Bayes error L ∗ asymptotically as k → ∞ , n → ∞ and k/n → . Inparticular, for stochastic blockmodels, NN ◦ ASE d isasymptotically Bayes optimal [1].Athreya et al. [16] proved a central limit theoremthat for K -block and d -dimensional SBM, ˆ Z d via ASE d is distributed asymptotically as a K -mixture of d -variate normal with covariance matrices of order n .This asymptotic result holds true for any constant K , any ﬁnite d , all but ﬁnitely many n , and doesnot require equal number of vertices per partition.This result implies that quadratic discriminant anal-ysis (QDA) and linear discriminant analysis (LDA)on the represented data ˆ Z d of stochastic blockmodelsare asymptotic Bayes plug-in classiﬁers, while LDArequires a fewer number of parameters to ﬁt. Hencein our analysis, we employ two consistent classiﬁers1NN ◦ ASE d and LDA ◦ ASE d for vertex classiﬁcation onstochastic blockmodels.Importantly, having information on the model di-mension d is critical to adjacency spectral approaches.When d is given, ASE d is consistent, and 1NN ◦ ASE d ,LDA ◦ ASE d are asymptotically Bayes optimal. When d is not known, Sussman et al. [1] estimates d via aconsistent estimator. However, for the consistent esti-mator to be accurate, the required number of vertices n will depend highly on the graph density, and increasesrapidly as the expected graph density decreases.Fishkind et al. [2] shows that if we pick a positiveinteger ˆ d ≥ d , then ASE ˆ d is still consistent as n → ∞ .However, for a ﬁnite number of vertices, 1NN ◦ ASE ˆ d An example of adjacency spectral embedding.

Example of adjacency spectral embedding (ASE d =2 )with n = 500 . The parameters B and π are givenin Equation 10. The latent position of this SBM is amixture of point masses at X = (0 . , − . T and X = (0 . , . T .and LDA ◦ ASE ˆ d degrade signiﬁcantly in performancecompared to 1NN ◦ ASE d and LDA ◦ ASE d . Moreovertheir performance on real data can be very sensitiveto the choice of embedding dimension. Our focus ison removing the need to know the model dimension d and still maintaining low error rate for vertex clas-siﬁcation, so the classiﬁcation procedure can be robustand suitable for practical inference when the modelassumptions do not hold. To assess the robustness of the vertex classiﬁers forstochastic blockmodels, we propose two scenarios ofcontamination that change the model dimension ofSBM. Suppose the uncontaminated graph model G un is a stochastic blockmodel G un ∼ SBM ([ n ] , B un , π un ) .Denote the communication probability matrix of G un as P un . We can write P un = Z un Z T un , where Z un isthe latent positions of the uncontaminated model [1],and suppose rank ( B un ) = d . Denote by δ i ( M ) the i -thlargest singular value of a matrix M . Let p o ∈ [0 , denote the occlusion rate. We randomlyselect (100 p o )% vertices out of the n vertices andset the probability of connectivity among the selectedvertices to be . In this scenario, the probability ofconnectivity between the contaminated vertices andthe uncontaminated vertices remains the same as in G un . This occlusion procedure can be formulated as astochastic blockmodel G occ with the following param-eters: B occ = (cid:18) B un B un B un K × K (cid:19) ∈ R K × K , (3) π occ = [(1 − p o ) π T un , p o π T un ] T ∈ R K . (4)Denote the communication probability matrix of G occ by P occ . It always holds that δ ( P occ ) ≤ δ ( P un ) ≤ n , and it almost always holds thatrank ( B occ ) = rank ( P occ ) = 2 d . That is, the true modeldimension of the occluded graph is d instead of d .The proofs to the above claims are provided in thesupplementary material.Both B occ and P occ have d positive and d neg-ative eigenvalues, where the d negative eigenvaluesare due to occlusion contamination. The number ofblocks in the contaminated model G occ rises to K ,where K blocks correspond to (1 − p o ) π un and theother K blocks correspond to p o π un . Although thenumber of blocks in the model changes to K dueto contamination, the number of classes in the ver-tex classiﬁcation problem remains K . As p o → ,the number of contaminated vertices approaches n ,indicating that the majority of the edges are sampledfrom the contamination source K × K ; as a result, theadjacency matrix A becomes sparser and sparser.Note that our occlusion scenario randomly selectsthe vertices; and conditioned on selecting the con-taminated vertices, the edges between these verticesare missing deterministically. Therefore the edges arenot missing completely at random in this occlusioncontamination procedure. Let p l ∈ [0 , denote the linkage reversion rate. Werandomly select (100 p l )% vertices out of the n verticesand reverse the connectivity among all the selectedvertices. The probability of connectivity between thecontaminated vertices and the uncontaminated ver-tices remains the same as in G un . The linkage rever-sion contamination can be formulated as a stochasticblockmodel G rev with the following parameters: B rev = (cid:18) B un B un B un J K × K − B un (cid:19) ∈ R K × K , (5) π rev = [(1 − p l ) π T un , p l π T un ] T ∈ R K . (6)The matrix J K × K ∈ R K × K is the matrix of all ones.Denote the communication probability matrix of G rev by P rev . If rank ( B un ) = d , then it almost always holdsthat d + 1 ≤ rank ( B rev ) = rank ( P rev ) ≤ d , since theblock matrix J K × K − B un has rank at most d . The num-ber of blocks in the contaminated model also increases4o K , similar to the occlusion model. As p l → , werecover the complement of SBM ([ n ] , B un , π un ) – thatis, SBM ([ n ] , J K × K − B un , π un ) . When the stochastic blockmodels are contaminated bythe above two procedures, the model parameters andthe model dimension are changed. Suppose both theoriginal model dimension d and the contaminationinformation are known, then we can use the contami-nated model dimension d occ = 2 d or d rev ∈ [ d + 1 , d ] for embedding, so that ASE d occ and ASE d rev followedby 1NN and LDA are asymptotically Bayes optimal.However, if we only know the contamination but notthe model dimension, then adjacency spectral em-bedding will require the estimation of an embeddingdimension; and if we know d but not the contamina-tion, we usually consider d as the default embeddingdimension. In either case, the embedding dimensionused may not be the best choice for adjacency spectralembedding and subsequent classiﬁcation.Figure 2 and Figure 3 provide two examples of thescree plots obtained from the contaminated adjacencymatrices A occ and A rev , for which the original modeldimension is d = 2 . Using d = 2 is clearly not thebest choice in the contaminated data; and if we decideto estimate d , this remains a very challenging task,despite various procedures and criteria for dimensionselection [28]. Here we use a principled automaticdimension selection procedure using the proﬁle like-lihood by [29], to estimate the embedding dimensionbased on the scree plot.However, in the setting of Figure 2 and Figure 3,Monte Carlo investigation yields ˆ d = 2 every time asthe elbow (500 times out of 500 Monte Carlo repli-cates), using the full spectrum or a partial spectrum ofthe largest 22 eigenvalues in magnitude respectively.The second elbow selected by [29] concentrates around80 and 11 using the full spectrum and the partialspectrum respectively. Even though ˆ d = 3 , are betterfor classiﬁcation purpose in these two contaminatedgraphs, they are not selected by the dimension selec-tion method of [29].Notwithstanding the results in [3] and [2], we can-not be guaranteed to successfully choose the embed-ding dimension in practice. Consequently, the perfor-mance of ASE method and subsequent classiﬁcationwill suffer. Figure 4 and Figure 5 demonstrate that,as the contamination proportions p o and p l increase,latent positions change as reﬂected in the estimatedlatent positions ˆ Z ˆ d =2 and ˆ Z ˆ d =2 plots, for which theproﬁle likelihood method always yield ˆ d = 2 for thecontaminated data. In particular, as the occlusion rate p o increases, more vertices from different classes areembedded close together.Furthermore, vertex classiﬁcation on the contami-nated ˆ Z ˆ d using 1NN or LDA will degrade in perfor-mance, as illustrated later in the simulation and Figure7. Indeed, the model dimension critically determinesthe success of vertex classiﬁcation based on the ASEprocedures, whereas in practice, the model dimensionis usually unknown. This motivates us to seek a robustvertex classiﬁer which does not heavily depend on themodel selection and still attains good performance.Fig. 2: Scree plot of the occlusion contaminated adja-cency matrix.

Scree plot of the occlusion contaminatedadjacency matrix A occ at occlusion rate p o = 0 . with n = 200 . The parameters B un and π un are given in Eq.10. The red dots are the negative eigenvalues of A occ due to occlusion contamination, and the green dots arethe positive eigenvalues of A occ . Proﬁle likelihood [29]method always suggests ˆ d = 2 for this scree plot. HE S PARSE R EPRESENTATION C LASSI - FIER FOR V ERTEX C LASSIFICATION

In this section, we propose to use the sparse represen-tation classiﬁer (SRC) for robust vertex classiﬁcation.Instead of employing adjacency spectral embeddingand applying subsequent classiﬁers on ˆ Z ˆ d , we recovera sparse representation of the test vertex with respectto the vertices in the training set, and use the recoveredsparse representation coefﬁcients to classify the testvertex.For the purpose of algorithm presentation, in thissection we slightly abuse the notation to denote A as the adjacency matrix on the training vertices { v , . . . , v n } with known labels Y i ∈ [ K ] , and denote φ as the adjacency column with respect to the testing5ig. 3: Scree plot of the linkage reversion contami-nated adjacency matrix.

Scree plot of the linkage re-version contaminated adjacency matrix A rev at linkagereversion rate p l = 0 . with n = 200 . The parameters B un and π un are given in Eq. 10. The red dots are thenegative eigenvalues of A occ due to linkage reversion,and the green dots are the positive eigenvalues. Proﬁlelikelihood method [29] always suggests ˆ d = 2 for thisscree plot.vertex v with an unknown label Y ; note that this isalmost equivalent to let A be the adjacency matrix for { v , . . . , v n , v } as in previous sections, then split theﬁrst n columns for training and the last column fortesting, except the last row is not used.Now suppose there are n k training vertices in eachclass k , so that n = (cid:80) k ∈ [ K ] n k . Let a k, , ..., a k,n k denote the columns in A corresponding to the n k training vertices in class k . Deﬁne a matrix D k =[ d k, , ..., d k,n k ] ∈ R n × n k , where each column d k,j = a k,j (cid:107) a k,j (cid:107) for ≤ j ≤ n k ; then we concatenate D , . . . , D K such that D := [ D , . . . , D K ] ∈ R n × n .Namely the matrix D re-arranges the columns of A by classes, and normalize each column to have (cid:96) unitnorm.Also normalize φ to unit norm. Then SRC is ap-plied to D and φ directly, by ﬁrst solving the (cid:96) -minimization problem arg min (cid:107) β (cid:107) subject to φ = Dβ + (cid:15), (7)followed by subsequent classiﬁcation on the sparserepresentation β . This procedure does not requirespectral embedding of the adjacency matrix, and isoriginally used by [5] to do robust face recognition.In subsection 4.1 we show the algorithmic and im-plementation details, and argue why SRC is applicablefor graphs; then a consistency result of SRC for the Fig. 4: The occlusion contamination effect on es-timated latent positions.

A depiction of the occlu-sion effect on the latent positions as reﬂected in theestimated latent positions ˆ Z ˆ d =2 with n = 200 . Theparameters B un and π un are given in Eq. 10. Thefour-panel displays the latent position estimation fordifferent occlusion rate p o . As p o increases, verticesfrom different blocks become close in the embeddedspace. For p o close to , ASE ˆ d =2 will eventually yieldonly one cloud at .stochastic blockmodel is proved, followed by relevantdiscussions in subsection 4.2. The algorithm is summarized in Algorithm 1. Theonly computational costly step in Algorithm 1 is (cid:96) minimization. Many algorithms, such as (cid:96) homotopy[30], augmented Lagrangian multiplier [31], orthogo-nal matching pursuit [32], etc., are developed to solve (cid:96) minimization. In this paper, we use orthogonalmatching pursuit (OMP) to solve Equation 8, which isa fast approximation of exact (cid:96) minimization; detailsof various (cid:96) minimization and OMP are available in[30], [33], [31], and [7].Usually there is a model selection parameter forstopping (cid:96) minimization, namely the noise threshold (cid:15) in Equation 8, or equivalently designate a sparsitylevel s so that (cid:107) β (cid:107) ≤ s . As (cid:15) is difﬁcult to determinefor real data, in this paper we choose to set s ratherthan (cid:15) : this allows us to better compare the vertexclassiﬁcation performance through-out different spar-sity levels, and we will argue that SRC is robust6ig. 5: The linkage reversion contamination effect onestimated latent positions.

A depiction of the linkagereversion effect on the latent positions as reﬂected inthe estimated latent positions ˆ Z ˆ d =2 with n = 200 .The parameters B un and π un are given in Eq. 10. Thefour-panel displays the latent position estimation fordifferent linkage reversion rate p l . As p l increases,vertices from different blocks become close in theembedded space. For p l = 1 , ASE ˆ d =2 will yield twoclouds corresponding to SBM( , J × − B un , π un ).against s in the next subsection and also the numericalexperiments. Note that the constraint in Equation 8can be replaced by φ = Dβ in a noiseless setting,but usually some parameters like (cid:15) or s is requiredto achieve a parsimonious model, when dealing withhigh-dimensional or noisy data.Although the SRC algorithm can always be usedfor supervised learning, it does not always performwell for arbitrary data sets; and it is necessary tounderstand why SRC is applicable to graphs. In [5],it is argued that the face images of different classes lieon different subspaces, so that (cid:96) minimization is ableto select training data of the correct class (i.e., the truebut unknown class of the testing observation). Basedon this subspace assumption, [34] derives a theoreticalcondition for (cid:96) minimization to do perfect variableselection in sparse representation, i.e., all selectedtraining data are from the correct class. This validatesthat sparse representation is a valuable tool with (cid:96) minimization under the subspace assumption. How-ever, the subspace assumption requires an intrinsiclow-dimensional structure for each class, which maynot be satisﬁed for high-dimensional real data such as Algorithm 1

Robust vertex classiﬁcation.

Goal : Classify the vertex v whose unknown label is Y . Input : Adjacency matrix A ∈ { , } ( n ) × ( n ) from thetraining vertices { v , . . . , v n } , where each column a i contains the adjacency column of i th vertex to allother training vertices, and all vertices are associ-ated with observed labels Y i ∈ [ K ] . Let φ ∈ { , } n be the testing vertex containing its connectivity toall training data.1. Arrange and scale all vertices : Re-arrangecolumns of A in class order, and normalize thecolumn to (cid:96) unit norm. Denote the resulting matrixas D . Also scale the testing adjacency column φ tohave unit norm.2. Find a sparse representation of φ by (cid:96) mini-mization: ˆ β = arg min (cid:107) β (cid:107) subject to φ = Dβ + (cid:15). Compute the distance of φ to each class k : r k ( φ ) = (cid:107) φ − D ˆ β k (cid:107) , where ˆ β k =[0 , ..., , ˆ β k, , ..., ˆ β k,n k , ..., ∈ R n is the recoveredcoefﬁcients corresponding to the k -th class.4. Classify test vertex : ˆ Y = arg min k r k ( φ ) .the adjacency matrix.Furthermore, the motivation behind the popularityof (cid:96) minimization is its equivalence to (cid:96) minimiza-tion under certain conditions, such as the incoherencecondition or restricted isometry property, see [35], [36],[37], [38], [39], [40]. But those conditions are oftenviolated in the SRC framework, because the sampletraining data are usually correlated; and SRC doesnot necessarily need a unique or most sparse β inorder to do correct classiﬁcation. As long as the sparserepresentation β assigns dominating coefﬁcients todata of the correct class, SRC can classify correctly.Shen et al. [7] proves SRC performance guaranteeunder a principal angle condition, which is similar tothe condition in [34], but does not rely on the subspaceassumption and does not require a unique and mostsparse solution. The condition is easy to check for agiven model and intuitive to understand: as long asthe within-class principal angle is smaller than thebetween-class principal angle, (cid:96) minimization andOMP are able to assign dominating regression coef-ﬁcients to training data of the correct class, so thatSRC can perform well. Based on this direction, in thenext subsection we derive a condition on the stochasticblockmodels so that the principal angle condition issatisﬁed, consequently achieving SRC consistency forSBM.7 .2 SRC Consistency for SBM Here we prove a consistency theorem for sparse rep-resentation classiﬁer for vertex classiﬁcation on thestochastic blockmodel, which provides theoretical per-formance guarantees of our proposed robust vertexclassiﬁcation. All proofs are put into the supplemen-tary material.For this subsection only, we ﬁrst deﬁne for each q = 1 , . . . K , Q q ∼ K (cid:88) k =1 { Y = k } B kq , (8)where Y is the class label, { Y = k } is the indicatorfunction with probability π k , and B kq corresponds tothe entry of the probability matrix B generating SBM.Note that { Q q } and all their moments only depend onthe prior probability π and the block probability B .Next we deﬁne the un-centered correlation as ρ qr = E ( Q q Q r ) (cid:113) E ( Q q ) E ( Q r ) , for each ≤ q (cid:54) = r ≤ K . Clearly ≤ ρ qr = ρ rq ≤ .Our ﬁrst lemma proves a necessary and sufﬁ-cient condition on the SBM parameters for adjacencycolumns of the same class to be asymptotically mostcorrelated. Lemma 1.

Under the stochastic blockmodel, for an ad-jacency column of class q , its asymptotic most correlatedcolumn is of the same class q , if and only if the priorprobability π and the block probability matrix B satisfy thefollowing inequality: ρ qr · E ( Q r ) E ( Q q ) < E ( Q r ) E ( Q q ) (9) for all r (cid:54) = q . When Lemma 1 holds for all q , it in fact guaranteesthat SRC at s = 1 (or equivalently -nearest-neighborbased on principal angle) is a consistent classiﬁer forthe stochastic blockmodel. To prove SRC consistencyat any s , we need a second lemma. Lemma 2.

Denote A ( s ) as an s × n random matrixconsisting of s adjacency columns, and denote C as a scalarvector of length s .Suppose Equation 9 holds for the stochastic blockmodel.Then for any adjacency column α of class q , its within-class correlation (i.e., the correlation between α and anotheradjacency column of class q ) is asymptotically larger thanthe correlation between α and C · A ( s ) , for any A ( s ) whosecolumns are not from class q and any vector C with non-negative entries.The above holds for any s ≥ . The above two lemmas essentially establish theprincipal angle condition in [7]. They can guaranteethat β assigns dominating coefﬁcients to training dataof the correct class, which leads to SRC consistency forSBM. Theorem 1.

Suppose Equation 9 holds for the correspond-ing stochastic blockmodel for all q ∈ [1 , . . . , K ] , and thesparse representation β is constrained to be non-negative.Then SRC is a consistent classiﬁer for vertex classiﬁca-tion of SBM, with L n → as n → . This holds for SRCimplemented by either exact (cid:96) minimization or orthogonalmatching pursuit at any s ≥ . Let us make some remarks regarding the theoremand its implication. Firstly, if the block columns arevery close in their (cid:96) and (cid:96) norms in the measurespace with respect to π (i.e., E ( Q r ) E ( Q q ) ≈ E ( Q r ) E ( Q q ) ), then thetheorem is very likely to hold for all ρ qr < and SRCis expected to perform well; if not, the block columnscannot be too highly correlated in order for the in-equality to hold and for SRC to work; and if block r is a scalar multiple of block q , the condition alwaysfails and SRC cannot separate those two classes. In anycase, if the adjacency matrix can be modeled by SBM,then it is very easy to estimate the model parametersand check Equation 9.Secondly, even though Equation 9 is only sufﬁcientand not necessary for SBM consistency at s > , it isoften the case that SRC is no longer consistent whenEquation 9 is violated. Because when Equation 9 isviolated for some r , the adjacency column of class q is asymptotically most correlated with a column fromclass r , which usually causes SRC to misbehave.Thirdly, the theorem requires the sparse repre-sentation to be non-negative, which can be easilyachieved in (cid:96) minimization; and [41], [42], [43] showthat eliminating the negative entries of β has very nicetheoretical properties in non-negative OMP and non-negative least square. Even though we do not explic-itly use non-negative (cid:96) minimization or bound thecoefﬁcients, in our numerical experiments the negativeentries of β are almost never large, and L n clearlyconverges to for the SBM simulation in the numericalsection.Fourthly, since the consistency result holds for SRCat any s ≥ , we expect SRC to be robust in the choiceof s , compared to the model selection of ˆ d for ASEprocedures. This is demonstrated empirically in Sec-tion 5. In particular, the two contamination scenariosessentially double the number of blocks comparing tothe uncontaminated SBM; this causes the classiﬁcationerror of ASE to be no longer consistent unless the em-bedding dimension d is adjusted accordingly, but SRCmay remain consistent as long as the contaminated8locks still satisfy Equation 9.Lastly, we should note that even though the consis-tency results hold at any s ≥ , in most experimentsmoderate s helps the ﬁnite-sample performance com-paring to s = 1 or large s : One explanation is thatthe classiﬁer itself is designed to favor a more parsi-monious model as argued in [5]. Another explanationbased on the consistency proof of [7], is that the sub-matrix of D corresponding to the nonzero entries of β should be full rank; this is always true when using (cid:96) minimization and OMP, but large s may make thesub-matrix close to rank deﬁcient (i.e., having singularvalues close to zero). Indeed in the numerical section,we will see that as long as the sparsity level s is not toolarge relative to the sample size n , SRC can performwell; in addition, choosing smaller sparsity level hasless computational cost. UMERICAL E XPERIMENTS

If the true model dimension is unknown, ASE ˆ d maynot be consistent. In particular, when contaminationresults in a changed model dimension, or the modeldimension cannot be correctly estimated, the perfor-mance of subsequent classiﬁers may suffer. We con-sider a classiﬁer robust if it can maintain relativelylow misclassiﬁcation rate under data contamination.Our sparse representation classiﬁer (SRC) for vertexclassiﬁcation does not rely on the knowledge of themodel dimension, is robust to the choices of sparsitylevel s , and achieves consistency with respect to allsparsity levels.Throughout this section, we use the orthogonalmatching pursuit (OMP) to solve the (cid:96) minimization.In the following experiments, SRC s denotes the per-formance of SRC with varying sparsity levels s , andSRC means s = 5 by default. We use leave-one-outcross validation to estimate the classiﬁcation error. Thestandard errors are small compared to the differencesin performance. Our simulation experiments and realdata analysis demonstrate that SRC for vertex classi-ﬁcation performs well under varying sparsity levels,possesses higher robustness to contamination than1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , and is an excellent toolfor real data inference. We compare the robustness of SRC with two vertexclassiﬁers: 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , both of whichachieve the asymptotic Bayes error when ˆ d = d withno contamination, and ˆ d = d occ or d rev in the contami-nation model. We simulate the probability matrix for an uncontam-inated stochastic blockmodel G un with K = 2 blocks( Y ∈ { , } ) and parameters B un = (cid:18) . . .

32 0 . (cid:19) π un = [0 . , . T . (10)The SBM parameters in Equation 10 in fact satisﬁes thetheoretical condition in Equation 9, so we expect SRCto perform well in this case.We ﬁrst assess the performance of all classiﬁers inthe uncontaminated model, assuming the true modeldimension d = 2 is known. As seen in the left plotof Figure 6, LDA ◦ ASE d =2 performs the best for all n ∈ { , , . . . , } . In this ideal setting, SRC doesnot outperform 1NN ◦ ASE d =2 or LDA ◦ ASE d =2 , butall classiﬁers converge to error as n increases, asexpected based on our theoretical derivation.Then we ﬁx the number of vertices n = 110 andvary the sparsity level s and embedding dimension ˆ d .The right plot of Figure 6 exhibits the three classiﬁers’performance. SRC s performs well throughout s , sodoes LDA ◦ ASE ˆ d except at ˆ d = 1 , while 1NN ◦ ASE ˆ d degrades signiﬁcantly with increasing ˆ d or ˆ d = 1 . Now we assess the robustness of SRC, 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d under contamination using the sameparameter setting as Equation 10. If the model dimen-sion d = 2 is known and the exact contamination isknown, then ˆ d = 4 is best for subsequent classiﬁcationof the contaminated data; otherwise ˆ d will be set to ,either due to not knowing the contamination or dueto estimating ˆ d by the proﬁle likelihood procedure in[29], as seen in Figure 2 and Figure 3.Figure 7 presents the misclassiﬁcation error ofSRC, 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d under occlusioncontamination, linkage reversion contamination, anda mixed combination of both contamination, for s = 5 and ˆ d = 2 , respectively. The x-axis stands for thecontamination rate, while the y-axis stands for theclassiﬁcation error. In case of occlusion, all classiﬁersdegrade as the contamination rate increases, due toless density in the graph. And in case of linkagereversion, all classiﬁers degrade ﬁrst due to a weakerblock signal, and then improve when the contamina-tion rate increases above . , because the reversedblock signal becomes stronger. As to the mix con-tamination, it is done as follows: ﬁrst, we randomlyselect p % vertices and occlude their connectivity;secondly, we randomly select p % vertices (somemay have already been occluded) and reverse their9ig. 6: Classiﬁcation performance under no contamination.

We simulate

SBMs with B un , π un given inEquation 10, and show the average the misclassiﬁcation error over the Monte Carlo replicates. (Left):When the true model dimension d = 2 is known, SRC does not outperform 1NN ◦ ASE d =2 or LDA ◦ ASE d =2 for n ∈ [20 , . (Right): Do the same vertex classiﬁcation using various s, ˆ d at n = 110 .connectivity. In this scenario, the degradation in clas-siﬁcation performance comes from both occlusion andlinkage reversion contamination.For both occlusion and linkage reversion,LDA ◦ ASE d =4 is the best classiﬁer, followed by1NN ◦ ASE d =4 . SRC is slightly inferior, but is signif-icantly better than LDA ◦ ASE ˆ d =2 and 1NN ◦ ASE ˆ d =2 .For the mixed contamination, SRC and 1NN ◦ ASE d =4 are the best classiﬁers, which perform much betterthan the others. This indicates that SRC is robustagainst the contamination, while subsequent classi-ﬁcation after spectral embedding may suffer frommodel dimension misspeciﬁcation and data contam-ination.Note that SRC also has a model selection param-eter, namely the sparsity level s . Thus in Figure 8 weplot SRC error with respect to the sparsity level s ∈ [1 , . . . , , as well as LDA ◦ ASE d and 1NN ◦ ASE d withrespect to the embedding dimension d ∈ [1 , . . . , .Furthermore, because we have ﬁxed the number ofnearest neighbor to be so far, the ﬁrst plot inFigure 8 is used to show that varying the numberof nearest-neighbor does not help kNN ◦ ASE ˆ d =2 for k ∈ [1 , . . . , . All plots in Figure 8 show that SRC isstable with respect to the sparsity level s , while ASEmethods are less robust with respect to the dimensionchoice. We apply SRC to several real datasets. We binarize,symmetrize the adjacency matrix and set the diag-onals to be zero. We followed [44] and [45], which suggest imputing the diagonal of the adjacency ma-trix to improve performance. We vary the embeddingdimension ˆ d for 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , and thesparsity level s for SRC s . We apply SRC to the electric neural connectomeof Caenorhabditis elegans (

C.elegans ) [46], [47], [48].The hermaphrodite

C.elegans somatic nervous systemhas neurons ([49]). Those neurons are classiﬁedinto classes: motor neurons ( . ), interneurons( . ) and sensory neurons ( . ). The adja-cency matrix is seen in the top of Figure 9. The graphhas density ( ) = 1 . . The objective is to predictthe classes of the neurons, and the chance line for thisclassiﬁcation task is . .The bottom of Figure 9 demonstrates the perfor-mance of the three classiﬁers. Both LDA ◦ ASE ˆ d and1NN ◦ ASE ˆ d improve in performance as ˆ d increases to , since more signal is included in the embeddedspace; and as ˆ d continues to increase to , both clas-siﬁers gradually degrade in performance, since morenoise is included. The exhibited phenomenon is dueto bias-variance trade-off. In comparison, SRC s hasstable performance with respect to the sparsity level s ∈ [1 , . . . , , which outperforms LDA ◦ ASE ˆ d and1NN ◦ ASE ˆ d . This demonstrates that SRC is a practicaltool in random graph inference. The AdjNoun graph, collected in [50], is a networkcontaining frequently used adjectives and nouns from10ig. 7:

Classiﬁcation performance under three types of contamination.

We simulate

SBMs with B un , π un given in Eq. 10, set n = 200 , contaminate the data accordingly, and present the average misclassiﬁcation errorfor the ﬁve classiﬁers over the Monte Carlo replicates. SRC at s = 5 exhibits robust performance comparedto 1NN ◦ ASE ˆ d =2 and LDA ◦ ASE ˆ d =2 , throughout all type of contamination with varying contamination rates.Fig. 8: Under the same setting of Figure 7, the ﬁrst plot varies the choice of neighborhood k in kNN ◦ ASE, andcompare with SRC s with varying s . The other three plots compare the classiﬁcation error of SRC s , 1NN ◦ ASE ˆ d ,and LDA ◦ ASE ˆ d throughout s = ˆ d ∈ [1 , . . . , . SRC exhibits stable performance with respect to the sparsitylevel s .the novel “David Copperﬁeld” by Charles Dickens.The vertices are the 60 most frequently used adjectivesand 60 most frequently used nouns in the book. Theedges are present if any pair of words occur in an adja-cent position in the book. The chance error is . .The adjacency matrix of the adjective noun networksuggests that the connectivity between nouns andadjectives are more frequent than the connectivitiesamong nouns and the connectivities among adjectivesrespectively, as seen in the top of Figure 10.We apply SRC s , 1NN ◦ ASE ˆ d , and LDA ◦ ASE ˆ d onthis dataset, and vary the embedding dimension ˆ d ∈{ , , . . . , } and the sparsity level s ∈ { , , . . . , } .Performance of the three classiﬁers is seen in thebottom of Figure 10. SRC s again exhibits stable per-formance with respect to various sparsity level s ,comparing to 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d . Note thatas the number of vertices is only , we limit thesparsity level to in this experiment. The political blog sphere was collected in February2005 [51]. The vertices are blogs during the time of the 2004 presidential election, and edges exist if the blogsare linked. The blogs are either liberal or conservative,which sum up to n = 1490 vertices. The top ofFigure 11 demonstrates the adjacency matrix of theblog network, which reﬂects a strong two-block signal.The performance of three classiﬁers is shown in thebottom of Figure 11, with varying sparsity level s anddimension choice ˆ d up to . SRC s has very stableand superior performance with respect to varioussparsity level, and always outperforms 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d . It is worthwhile to point out that thisdataset can be modeled by SBM as shown in [52]; andthe sparsity limit is relatively small comparing tothe number of vertices n here, which is the reason whySRC s is very stable up to s = 100 . The political book graph contains 105 books about USpolitics and sold by Amazon.com [50]. The edges existif any pairs of books were purchased by the samecustomer. There are 3 class labels on the books: liberal(46.67 % ), neural (40.95 % ) and conservative (12.28 % ).The adjacency matrix of this dataset and the perfor-11ig. 9: (Top): The adjacency matrix of the C.elegans neural connectome is sorted according to the classesof the neurons. A three-block structure is exhibited.(Bottom): Vertex classiﬁcation performance on the

C.elegans network. As we vary the sparsity level s andembedding dimension ˆ d , SRC s demonstrates superiorand stable performance compared to 1NN ◦ ASE ˆ d andLDA ◦ ASE ˆ d .mance of the three classiﬁers are seen in the top ofFigure 12.The bottom of Figure 12 shows that SRC s is verystable with respect to the sparsity level, and usuallybetter than 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , but the opti-mal error is achieved by 1NN ◦ ASE ˆ d =10 . ISCUSSION

Adjacency spectral embedding is a feature extrac-tion approach for latent position graphs. When fea-ture extraction is composed with common classiﬁerssuch as nearest-neighbor or discriminant analysis, the Fig. 10: (Top): Adjacency matrix of adjective noun net-work, where each class is more likely to communicatewith the other class than itself. (Bottom): Vertex classi-ﬁcation performance on the adjective and noun net-work. SRC s demonstrates robust performance com-pared to 1NN ◦ ASE ˆ d , and LDA ◦ ASE ˆ d .choice of feature space or embedding dimension iscrucial. Given the model dimension d for a stochasticblockmodel, ASE d is consistent and the subsequentvertex classiﬁcation via 1NN ◦ ASE d or LDA ◦ ASE d isasymptotically Bayes optimal. And the success of ASEprocedures clearly depends on the knowledge of d , asillustrated in the experiments.However, in practical settings, the model dimen-sion d is usually unknown, and there may exist datacontamination. In this paper, we present a robustvertex classiﬁer via sparse representation for graphdata. The sparse representation classiﬁer does not12ig. 11: (Top): Adjacency matrix of political blogsphere, exhibiting strong connectivity within class.(Bottom): Vertex classiﬁcation performance on thepolitical blog network. SRC s demonstrates superiorand very stable performance with respect to vari-ous sparsity levels s , comparing to 1NN ◦ ASE ˆ d andLDA ◦ ASE ˆ d .need information of the model dimension, can achieveconsistency under a mild condition for SBM param-eters, and is robust against the choice of sparsitylevels. As seen in the simulation studies using SBM,SRC may not outperform 1NN ◦ ASE d and LDA ◦ ASE d when d is known, but does outperform 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d where ˆ d is chosen using the scree plotof the adjacency matrix. In the real data experiments,most of the time SRC outperforms 1NN ◦ ASE ˆ d andLDA ◦ ASE ˆ d for varying ˆ d , and is very stable with Fig. 12: (Top): Adjacency matrix of political bookgraph. (Bottom): Classiﬁcation performance on thepolitical book graph.respect to the sparsity level s . The numerical studiesstrongly indicate that SRC is a valuable tool for ran-dom graph inference.For SRC implementation, we only considered or-thogonal matching pursuit (OMP) to solve the (cid:96) min-imization problem. Different implementations of (cid:96) minimization are explored in [7], and using a differentalgorithm may yield slightly different classiﬁcationperformance for SRC.Another interesting question is the effect of nor-malization, namely the transformation of A into D in Algorithm 1. The normalization effect is usuallydifﬁcult to quantify; but empirically, we see improve-ment in SRC performance under (cid:96) normalization, asillustrated in Figure 13. Note that the SBM parameters13atisfy the condition in Equation 9, so we expect SRCto perform well in the normalized case; furthermore,in the ﬁgure SRC error is very close to as n increases,despite the fact that the non-negative constraint is notused in the algorithm (which is used in the consistencyproof).Fig. 13: Examination of SRC performance with orwithout (cid:96) normalization on columns of D . Wecompare SRC performance when columns of D are (cid:96) normalized and when columns of D are not (cid:96) normalized. The parameters B and π are given in Eq.10 with n ∈ { , . . . , } and we run MonteCarlo replicates for each n . We see an improvement inSRC performance when (cid:96) normalization is applied.The Wilcoxon signed rank test reports a p -value lessthan . under the the null hypothesis that the er-ror difference SRC error ,(cid:96) − SRC error , no (cid:96) comes from adistribution with zero median. PPENDIX

An event occurs “almost always”, if with probability1, the event occurs for all but ﬁnitely many n . Proposition 1.

It always holds that σ ( P occ ) ≤ σ ( P un ) ≤ n .Proof. Suppose the set of the contaminated verticesis I := { i , i , . . . , i l } . Let P (cid:48) s denote the principalsubmatrix of P un ∈ R |I|×|I| obtained by deleting the V \ I columns and the corresponding V \ I rows. P (cid:48) s is symmetric.Note that P un = P occ + P s , where P s is sym-metric, P s = P (cid:48) s at { i , i , . . . , i l } -th columns and { i , i , . . . , i l } -th rows, and P s = 0 everywhere else. By Weyl’s Theorem [53], σ ( P occ ) + min σ σ ( P s ) ≤ σ ( P occ + P s ) = σ ( P un ) . Thus, σ ( P occ ) ≤ σ ( P un ) .Since P un ∈ [0 , n , P un P T un = P un P un is a non-negative and symmetric matrix with entries boundedby n . Then each row sum is bounded by n . Thus, σ ( P un ) = σ ( P un ) = σ ( P un P T un ) ≤ n , giving σ ( P un ) ≤ n. Proposition 2.

It always holds that σ d +1 ( P occ ) = 0 .It almost always holds that σ d ( P occ ) ≥ min( p , − p ) αγn . rank( P occ ) = 2 d .Proof. The Guassian elimination of B occ is given by B occ ∼ (cid:18) B un B un K × K B un (cid:19) . (11)Since rank ( B un ) = d , rank ( B occ ) = 2 d . Then thereexist µ = (cid:18) ν K × K ν − ν (cid:19) ∈ R K × d and ˜ µ = (cid:18) ν K × K ν ν (cid:19) ∈ R K × d such that B occ = µ ˜ µ T . Let X occ ∈ R n × d and ˜ X occ ∈ R n × d with row u given by ˜ X occ ,u = ˜ µ Y u . By the parametrization of SBM as RDPGmodel, P occ = X occ ˜ X T occ . Since X occ , ˜ X occ are at mostrank d , then σ d +1 ( P occ ) = 0 .Since the following holds: µµ T = ˜ µ ˜ µ T = (cid:18) νν T νν T νν T νν T (cid:19) (12) = (cid:18) νν T νν T νν T νν T (cid:19) + (cid:18) K × K K × K K × K νν T (cid:19) , by Weyl’s theorem [53], min λ i ( µµ T ) = min λ i (˜ µ ˜ µ T ) (13) ≥ min λ i (cid:18) νν T νν T νν T νν T (cid:19) + min λ i (cid:18) K × K K × K K × K νν T (cid:19) ≥ γ + 0 = γ. (14)Moreover, we have min i ∈ [2 K ] ( π occ ,i ) = min( p o π un ,i , (1 − p o ) π un ,i ) ≥ min( p o , − p o ) γ. (15)The eigenvalues of P occ P T occ are the same as thenonzero eigenvalues of ˜ X T occ ˜ X occ X T occ X occ . it almost al-ways holds that n i ≥ min( p o , − p ) γn for all i ∈ [2 K ] so that X T occ X occ = K (cid:88) i =1 n i µ i µ Ti = min( p o , − p o ) γnµ T µ + K (cid:88) i =1 ( n i − min( p o , − p o ) γn ) µ i µ Ti . (16)The ﬁrst term has min λ i bounded below by α min( p o , − p o ) γ . This means λ d ( X T occ X occ ) ≥ min( p o , − p o ) γ . For the exact same argument, λ d ( ˜ X T occ ˜ X occ ) ≥ α min( p o , − p o ) γ . ˜ X T occ ˜ X occ X T occ X occ is the product of two positive semi-deﬁnite matrices.Then, λ d ( ˜ X T occ ˜ X occ Z T occ Z occ ) ≥ λ d ( ˜ X T occ ˜ X occ ) λ d ( X T occ X occ ) ≥ ( α min( p , − p ) γn ) . This gives λ d ( P occ ) ≥ α min( p , − p ) γn = min( p , − p ) αγn. (17)Since λ d ( P occ ) ≥ almost always and σ d +1 ( P occ ) =0 always, then rank ( P occ ) = d . Proposition 3. B occ has d positive eigenvalues and d negative eigenvalues.Proof. Let the eigen-decomposition of B un given by ΞΨΞ T , where Ξ ∈ R K × K is orthogonal and Ψ =

Diag ( ψ , ..., ψ k ) ∈ R K × K is diagonal. We have thefollowing congruent relation: (cid:18) B un B un B un K × K (cid:19) = (cid:18) I K × K K × K I K × K I K × K (cid:19) × (cid:18) B un K × K K × K − B un (cid:19) × (cid:18) I K × K K × K I K × K I K × K (cid:19) T = (cid:18) Ξ 0 K × K Ξ Ξ (cid:19) × (cid:18) Ψ 0 K × K K × K − Ψ (cid:19) × (cid:18) Ξ 0 K × K Ξ Ξ (cid:19) T . (18)Hence, B occ and (cid:18) Ψ 0 K × K K × K − Ψ (cid:19) are congruent.By Sylvester’s law of Inertia [53], they have the samenumber of positive, negative and zero eigenvalues. Ψ has d positive diagonal entries since rank ( B un ) = d .Similarly, − Ψ has d negative diagonal entries. Hence, B occ has d positive eigenvalues and d negative eigen-values. Proposition 4.

Assuming | λ ( P occ ) | ≥ | λ ( P occ ) | ≥ . . . ≥ | λ d ( P occ ) | , then |{ i : λ i ( P occ ) < }| = |{ i : λ i ( P occ ) < }| = d . That is, the number of positiveeigenvalues of P occ is the same as the number of negativeeigenvalues of P occ , and it equals d .Proof. Let Z ∈ { , } n × K denote the matrix, whereeach row i is of the form (0 , . . . , , , . . . , , where 1indicates the block membership of vertex i in the oc-clusion stochastic blockmodel. Then P occ = ZB occ Z T .Note that P occ has the same number of nonzero eigen-values as Z T ZB occ . Let D Z := Z T Z ∈ N K × K and note that D Z is a diagonal matrix with nonnegative di-agonal entries, where each diagonal entry denotes thenumber of vertices belonging to block k ∈ [ K ] . Withhigh probability, D is positive deﬁnite, as the numberof vertices in each block is positive. Then the numberof nonzero eigenvalues of P occ is the same as the num-ber of nonzero eigenvalues of Z T ZB occ = DB occ = √ D Z √ D Z B occ = √ D Z B occ √ D Z . By Sylvester’s lawof Inertia [53], the number of positive eigenvalues of √ D Z B occ √ D Z is the same as the number of positiveeigenvalues of B occ , and the number of negative eigen-values of √ D Z B occ √ D Z is the same as the numberof negative eigenvalues of B occ , thus proving ourclaim. Proof.

We ﬁrst prove that an adjacency column fromclass q is asymptotically most correlated with anothercolumn of the same class, if and only if Equation 9 issatisﬁed.Suppose the ﬁrst two vertices , are from class ,and vertices is of class . Without loss of generality,let us prove that A is asymptotically most correlatedwith A if and only if Equation 9 is satisﬁed for q = 1 .We expand the correlation between A and A asfollows: ρ ( A , A ) = (cid:80) ni =1 ( A i A i ) (cid:113)(cid:80) ni =1 A i (cid:80) ni =1 A i = (cid:80) ni =1 ( A i A i /n ) (cid:112)(cid:80) ni =1 ( A i /n ) (cid:80) ni =1 ( A i /n ) a.s. → (cid:80) Kk =1 ( π k B k B k ) (cid:113)(cid:80) Kk =1 ( π k B k ) (cid:80) Kk =1 ( π k B k )= E ( Q Q ) (cid:112) E ( Q ) E ( Q ) , where the ﬁrst line is by the deﬁnition of correlation,the second line follows by noting that the entriesof A are and , the third line follows by passingto limit, the fourth line simpliﬁes the expression byour deﬁnition of { Q q } . Note that we assumed knownclass-membership for the ﬁrst three vertices, but theydo not affect the asymptotic correlation and thus notconsidered in the limit expression.By a similar expansion, we have ρ ( A , A ) a.s. → E ( Q ) E ( Q ) . A is asymptotically most correlated with A ifand only if E ( Q ) E ( Q ) > E ( Q Q ) (cid:112) E ( Q ) E ( Q ) ⇔ E ( Q ) E ( Q ) > ρ (cid:112) E ( Q ) E ( Q ) (cid:112) E ( Q ) E ( Q ) ⇔ (cid:115) E ( Q ) E ( Q ) > ρ (cid:115) E ( Q ) E ( Q ) ⇔ ρ · E ( Q ) E ( Q ) < E ( Q ) E ( Q ) . The above derivation is always valid when class isreplaced by any class r (cid:54) = 1 . Thus we proved Lemma 1. Proof.

Without loss of generality, denote α and A s +1 as two adjacency columns from class , A ( s ) =[ A | · · · | A s ] , and C = [ c , · · · , c s ] . It sufﬁces to provethat as n → ∞ , we always have ρ ( α, A s +1 ) > ρ ( α, C · A ( s ) ) = ρ ( α, s (cid:88) j =1 c j A j ) (19)for any non-negative vector C , and all possible A ( s ) whose columns are not of class . Note that Lemma 1show that Equation 9 is sufﬁcient and necessary forEquation 19 to hold at s = 1 for any C ; and the samecondition is still sufﬁcient for Equation 19 to hold atany s ≥ , under the additional assumption that C is anon-negative vector; when s = 0 , Equation 19 triviallyholds.In the proof of Lemma 1, we already showed that ρ ( α, A s +1 ) a.s. → E ( Q ) E ( Q ) . Next let us expand ρ ( α, (cid:80) sj =1 c j A j ) as follows: ρ ( α, C · A ( s ) ) = ρ ( α, s (cid:88) j =1 c j A j )= (cid:80) ni =1 ( α i ( (cid:80) sj =1 c j A ij )) (cid:113)(cid:80) ni =1 α i (cid:80) ni =1 ( (cid:80) sj =1 c j A ij ) = (cid:80) sj =1 c j ( (cid:80) ni =1 α i A ij /n ) (cid:113)(cid:80) ni =1 ( α i /n ) (cid:80) ni =1 ( (cid:80) sj =1 c j A ij ) /n ≤ (cid:80) sj =1 c j ( (cid:80) ni =1 α i A ij /n ) (cid:113)(cid:80) ni =1 ( α i /n ) (cid:80) ni =1 ( (cid:80) sj =1 c j A ij /n ) a.s. → (cid:80) sj =1 ( c j (cid:80) Kk =1 ( π k B k B ky j )) (cid:113)(cid:80) Kk =1 ( π k B k ) (cid:80) sj =1 ( c j (cid:80) Kk =1 ( π k B ky j ))= (cid:80) sj =1 c j E ( Q Q y j ) (cid:113)(cid:80) sj =1 c j E ( Q ) E ( Q y j ) , where y j denotes the class membership for A j . Allother steps being routine, the inequality in the aboveexpansion is due to (cid:80) sj =1 ( c j A ij ) ≥ (cid:80) sj =1 c j A ij ,which is obvious because c j and A ij are always non-negative.Therefore, in order to show Equation 19 holdsasymptotically, it sufﬁces to prove that E ( Q ) E ( Q ) > (cid:80) sj =1 c j E ( Q Q y j ) (cid:113)(cid:80) sj =1 c j E ( Q ) E ( Q y j ) ⇔ E ( Q ) E ( Q ) > (cid:80) sj =1 ρ y j c j (cid:113) E ( Q ) E ( Q y j ) (cid:113)(cid:80) sj =1 c j E ( Q ) E ( Q y j ) ⇔ (cid:118)(cid:117)(cid:117)(cid:116) s (cid:88) j =1 c j E ( Q y j ) E ( Q ) > s (cid:88) j =1 ρ y j c j (cid:115) E ( Q y j ) E ( Q ) . The last inequality holds when ρ y j · E ( Q y j ) E ( Q ) < E ( Q y j ) E ( Q ) , which is exactly Equation 9 when y j (cid:54) = 1 .Therefore, Equation 9 and non-negative C are suf-ﬁcient for Lemma 2 to hold. Proof.

In order to prove that SRC is a consistent classi-ﬁer with L n → , it sufﬁces to prove that the adjacencymatrix generated by SBM satisﬁes a principal anglecondition in [7]. This consistency holds for either (cid:96) minimization or orthogonal matching pursuit at any s ≥ , assuming the sparse coefﬁcient x is non-negative.16or the adjacency matrix under SBM, suppose α isa ﬁxed adjacency column of class q , A s +1 is a randomadjacency column of class q , and A ( s ) is a randommatrix whose columns are not of class q . Then to show L n → at any s , the principal angle condition requiresthat θ ( α, A s +1 ) < θ ( α, A ( s ) ) , for all possible α underSBM.The principal angle condition is used in two areasof SRC: ﬁrst it guarantees that the selected sub-matrixby SRC contains at least one observation of the cor-rect class; second it guarantees the sparse coefﬁcientwith respect to the correct class dominates the sparserepresentation. Then if data is always non-negative(which always holds for graph adjacency) and thesparse representation is non-negative (which can berelaxed to bounded below in [7]), such dominance issufﬁcient for the correct classiﬁcation of SRC.Lemma 1 proves the principal angle condition at s = 1 , which is sufﬁcient for the ﬁrst point above.Lemma 2 proves the principal angle condition at any s , which is equivalent to the second point above underthe non-negative constraint. Therefore we establishSRC consistency for SBM.Note that there are two small differences: First,the original condition is not in the limit form, whileLemma 1 and Lemma 2 are proved asymptotically forSBM; this change has no effect for classiﬁcation con-sistency. Second, in [7] we separate the principal anglecondition from the non-negative constraint, while inLemma 2 we effectively combine the non-negativeconstraint into proving the principal angle condition;this does not affect the result either, because when thesparse coefﬁcient are constrained to be non-negative,the principal angle between two subspaces are alsoconstrained accordingly. A CKNOWLEDGEMENTS

This work is partially supported by a National Se-curity Science and Engineering Faculty Fellowship(NSSEFF), Johns Hopkins University Human Lan-guage Technology Center of Excellence (JHU HLTCOE), and the XDATA program of the Defense Ad-vanced Research Projects Agency (DARPA) adminis-tered through Air Force Research Laboratory contractFA8750-12-2-0303. We would also like to thank MinhTang for his thoughtful discussions. R EFERENCES [1] D. L. Sussman, M. Tang, D. E. Fishkind, and C. E. Priebe,“A consistent adjacency spectral embedding for stochasticblockmodel graphs,”

Journal of the American Statistical Asso-ciation , vol. 107, no. 499, pp. 1119–1128, 2012. [2] D. E. Fishkind, D. L. Sussman, M. Tang, J. T. Vogelstein, andC. E. Priebe, “Consistent adjacency-spectral partitioning forthe stochastic block model when the model parameters areunknown,”

SIAM Journal on Matrix Analysis and Applica-tions , vol. 34, no. 1, pp. 23–39, 2013.[3] D. Sussman, M. Tang, and C. Priebe, “Universally consis-tent latent position estimation and vertex classiﬁcation forrandom dot product graphs,”

IEEE Transactions on PatternAnalysis and Machine Intelligence, accepted , 2012.[4] M. Tang, D. L. Sussman, and C. E. Priebe, “Universallyconsistent vertex classiﬁcation for latent positions graphs,”

Annals of Statistics, accepted , 2012.[5] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma,“Robust face recognition via sparse representation,”

IEEETransactions on Pattern Analysis and Machine Intelligence ,vol. 31, no. 2, pp. 210–227, 2009.[6] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, andS. Yan, “Sparse representation for computer vision andpattern recognition,”

Proceedings of the IEEE , vol. 98, no. 6,pp. 1031–1044, 2010.[7] C. Shen, L. Chen, and C. E. Priebe, “Sparse representationclassiﬁcation beyond (cid:96) minimization and the subspaceassumption,” submitted , http://arxiv.org/abs/1502.01368.[8] L. Devroye, L. Gy¨orﬁ, and G. Lugosi, A probabilistic theoryof pattern recognition . New York: Springer, 1996, vol. 31.[9] D. B. West,

Introduction to graph theory . Prentice hallEnglewood Cliffs, 2001, vol. 2.[10] B. Bollob´as,

Random graphs . Cambridge university press,2001, vol. 73.[11] P. D. Hoff, A. E. Raftery, and M. S. Handcock, “Latentspace approaches to social network analysis,”

Journal of theAmerican Statistical Association , vol. 97, no. 460, pp. 1090–1098, 2002.[12] S. J. Young and E. R. Scheinerman, “Random dot productgraph models for social networks,” in

Algorithms and modelsfor the web-graph . Springer, 2007, pp. 138–149.[13] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochasticblockmodels: First steps,”

Social networks , vol. 5, no. 2, pp.109–137, 1983.[14] P. J. Bickel, A. Chen, and E. Levina, “The method of mo-ments and degree distributions for network models,”

TheAnnals of Statistics , vol. 39, no. 5, pp. 2280–2301, 2011.[15] K. Rohe, S. Chatterjee, and B. Yu, “Spectral clustering andthe high-dimensional stochastic blockmodel,”

The Annals ofStatistics , vol. 39, no. 4, pp. 1878–1915, 2011.[16] A. Athreya, V. Lyzinski, D. J. Marchette, C. E. Priebe,D. L. Sussman, and M. Tang, “A limit theorem for scaledeigenvectors of random dot product graphs,” arXiv preprintarXiv:1305.7388 , 2013.[17] M. S. Handcock, A. E. Raftery, and J. M. Tantrum, “Model-based clustering for social networks,”

Journal of the RoyalStatistical Society: Series A (Statistics in Society) , vol. 170,no. 2, pp. 301–354, 2007.[18] J. Lei and A. Rinaldo, “Consistency of spectral clusteringin stochastic block models,”

The Annals of Statistics , vol. 43,no. 1, pp. 215–237, 2014.[19] K. Chaudhuri, F. C. Graham, and A. Tsiatas, “Spectralclustering of graphs with general degrees in the extendedplanted partition model.” in

COLT , 2012, pp. 35–1.[20] Y. Chen, S. Sanghavi, and H. Xu, “Clustering sparsegraphs,” in

Advances in neural information processing systems ,2012, pp. 2204–2212.[21] S. Balakrishnan, M. Xu, A. Krishnamurthy, and A. Singh,“Noise thresholds for spectral clustering,” in

Advances inNeural Information Processing Systems , 2011, pp. 954–962.[22] V. Lyzinski, D. L. Sussman, M. Tang, A. Athreya, C. E.Priebe et al. , “Perfect clustering for stochastic blockmodelgraphs via adjacency spectral embedding,”

Electronic Jour-nal of Statistics , vol. 8, no. 2, pp. 2905–2922, 2014.

23] L. Chen and M. Patton, “Stochastic blockmodelling foronline advertising,” arXiv preprint arXiv:1410.6714 , 2014.[24] M. Tang, D. L. Sussman, C. E. Priebe et al. , “Universallyconsistent vertex classiﬁcation for latent positions graphs,”

The Annals of Statistics , vol. 41, no. 3, pp. 1406–1430, 2013.[25] C. E. Priebe, D. L. Sussman, M. Tang, and J. T. Vogel-stein, “Statistical inference on errorfully observed graphs,”

Journal of Computational and Graphical Statistics , no. just-accepted, pp. 00–00, 2014.[26] D. E. Fishkind, V. Lyzinski, H. Pao, L. Chen, and C. E.Priebe, “Vertex nomination schemes for membership pre-diction,” arXiv preprint arXiv:1312.2638 , 2013.[27] L. Chen, “Pattern Recognition on Random Graphs,” Ph.D.dissertation, Johns Hopkins University, Baltimore, MD,May 2015.[28] J. E. Jackson,

A user’s guide to principal components . JohnWiley & Sons, 2005, vol. 587.[29] M. Zhu and A. Ghodsi, “Automatic dimensionality selec-tion from the scree plot via the use of proﬁle likelihood,”

Computational Statistics & Data Analysis , vol. 51, no. 2, pp.918–930, 2006.[30] D. Donoho and Y. Tsaig, “Fast solution of l1-norm mini-mization problems when the solution may be sparse,”

IEEETransactions on Information Theory , vol. 54, no. 11, pp. 4789–4812, 2008.[31] Z. Yang, C. Zhang, J. Deng, and W. Lu, “Orthonormalexpansion l1-minimization algorithms for compressed sens-ing,” arXiv preprint arXiv:1108.5037 , 2011.[32] J. A. Tropp, “Greed is good: Algorithmic results for sparseapproximation,”

Information Theory, IEEE Transactions on ,vol. 50, no. 10, pp. 2231–2242, 2004.[33] J. A. Tropp and A. C. Gilbert, “Signal recovery fromrandom measurements via orthogonal matching pursuit,”

IEEE Transactions on Information Theory , vol. 53, no. 12, pp.4655–4666, 2007.[34] E. Elhamifar and R. Vidal, “Sparse subspace clustering:Algorithm, theory, and applications,”

IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 35, no. 11, pp.2765–2781, 2013.[35] D. L. Donoho, “For most large underdetermined systemsof linear equations the minimal 1-norm solution is alsothe sparsest solution,”

Communications on pure and appliedmathematics , vol. 59, no. 6, pp. 797–829, 2006.[36] D. L. Donoho and M. Elad, “Optimally sparse represen-tation in general (nonorthogonal) dictionaries via 1 min-imization,”

Proceedings of the National Academy of Sciences ,vol. 100, no. 5, pp. 2197–2202, 2003.[37] R. Gribonval and M. Nielsen, “Sparse representations inunions of bases,”

Information Theory, IEEE Transactions on ,vol. 49, no. 12, pp. 3320–3325, 2003.[38] E. J. Candes and T. Tao, “Decoding by linear program-ming,”

IEEE Transactions on Information Theory , vol. 51,no. 12, pp. 4203–4215, 2005.[39] M. Elad and A. M. Bruckstein, “A generalized uncertaintyprinciple and sparse representation in pairs of bases,”

In-formation Theory, IEEE Transactions on , vol. 48, no. 9, pp.2558–2567, 2002.[40] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionariesfor sparse representation modeling,”

Proceedings of the IEEE ,vol. 98, no. 6, pp. 1045–1057, 2010.[41] A. Bruckstein, M. Elad, and M. Zibulevsky, “On the unique-ness of nonnegative sparse solutions to underdeterminedsystems of equations,”

IEEE Transactions on Information The-ory , vol. 54, no. 11, pp. 4813–4820, 2008.[42] N. Meinshausen, “Sign-constrained least squares estima-tion for high-dimensional regression,”

Electronic Journal ofStatistics , vol. 7, pp. 1607–1631, 2013.[43] M. Slawski and M. Hein, “Non-negative least squares forhigh-dimensional linear models: consistency and sparse recovery without regularization,”

Electronic Journal of Statis-tics , vol. 7, pp. 3004–3056, 2013.[44] D. Marchette, C. Priebe, and G. Coppersmith, “Vertexnomination via attributed random dot product graphs,” in

Proceedings of the 57th ISI World Statistics Congress , vol. 1121,2011, p. 1126.[45] E. R. Scheinerman and K. Tucker, “Modeling graphs us-ing dot product representations,”

Computational Statistics ,vol. 25, no. 1, pp. 1–16, 2010.[46] D. H. Hall and R. Russell, “The posterior nervous system ofthe nematode caenorhabditis elegans: serial reconstructionof identiﬁed neurons and complete pattern of synapticinteractions,”

The Journal of neuroscience , vol. 11, no. 1, pp.1–22, 1991.[47] R. Goldschmidt, “Das nervensystem von ascaris lumbri-coides und megalocephala, i,”

Z wiss Zool , vol. 90, pp. 73–136, 1908.[48] L. Chen, J. T. Vogelstein, V. Lyzinski, and C. E. Priebe, “Ajoint graph inference case study: the c.elegans chemical andelectrical connectomes,”

Submitted to Signiﬁcance , no. Underreview, 2015.[49] L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall, andD. B. Chklovskii, “Structural properties of the caenorhabdi-tis elegans neuronal network,”

PLoS computational biology ,vol. 7, no. 2, p. e1001066, 2011.[50] M. E. Newman, “Finding community structure in net-works using the eigenvectors of matrices,”

Physical reviewE , vol. 74, no. 3, p. 036104, 2006.[51] L. A. Adamic and N. Glance, “The political blogosphereand the 2004 us election: divided they blog,” in

Proceedingsof the 3rd international workshop on Link discovery . ACM,2005, pp. 36–43.[52] S. Olhede and P. Wolfe, “Network histograms and uni-versality of blockmodel approximation,”

Proceedings of theNational Academy of Sciences of the USA , vol. 111, pp. 14 722–14 727, 2014.[53] R. A. Horn and C. R. Johnson,

Matrix analysis . Cambridgeuniversity press, 2012.. Cambridgeuniversity press, 2012.