Robust Vertex Classification
11 Robust Vertex Classification
Li Chen, Cencheng Shen, Joshua Vogelstein, Carey E. Priebe
Abstract —For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs,adjacency spectral embedding followed by appropriate vertex classification is asymptotically Bayes optimal; but thisapproach requires knowledge of and critically depends on the model dimension. In this paper, we propose a sparserepresentation vertex classifier which does not require information about the model dimension. This classifier represents atest vertex as a sparse combination of the vertices in the training set and uses the recovered coefficients to classify thetest vertex. We prove consistency of our proposed classifier for stochastic blockmodels, and demonstrate that the sparserepresentation classifier can predict vertex labels with higher accuracy than adjacency spectral embedding approaches viaboth simulation studies and real data experiments. Our results demonstrate the robustness and effectiveness of ourproposed vertex classifier when the model dimension is unknown.
Index Terms —sparse representation, vertex classification, robustness, adjacency spectral embedding, stochasticblockmodel, latent position model, model dimension, classification consistency. (cid:70)
NTRODUCTION
Modern datasets have been collected with complexstructures which contain interacting objects. Depend-ing on the field of interest, such as sociology, bio-chemistry, or neuroscience, the objects can be people,organizations, genes, or neurons, and the interactinglinkages can be communications, organizational posi-tions, protein interactions, or synapses. Many usefulmodels imply that objects sharing a “class” attributehave similar connectivity structures. Graphs are oneuseful and appropriate tool to describe such datasets– the objects are denoted by vertices and the linkagesare denoted by edges. One interesting task on suchdatasets is vertex classification: determination of theclass labels of the vertices. For instance, we may wishto classify whether a neuron is a motor neuron ora sensory neuron, or whether a person in a socialnetwork is liberal or conservative.In many applications, measured edge activity canbe inaccurate, either missing or absolutely wrong,which leads to contaminated datasets. When the con-nectivity among a collection of vertices is invisible, oc-clusion contamination occurs. When we wrongly ob-serve the connectivity among a collection of vertices,linkage reversion contamination occurs. The spectralembedding method on the adjacency matrix has beenshown to be a valuable tool for performing inferenceon graphs realized from a stochastic blockmodel ([1],[2], [3], [4]). One major issue is that such a method crit-
L.C., C.S., and C.E.P are with Department of Applied Mathematicsand Statistics, Johns Hopkins University. J.T.V. is with Department ofBiomedical Engineering, Johns Hopkins University. ically depends on a known model dimension, whichis often unknown in practice. Moreover, for highlyoccluded graphs, classification composed with thespectral embedding method degrades in performance.This motivates us to propose a vertex classifier thatdoes not require knowledge of the model dimension,yet achieves good performance for highly contam-inated graphs. In this work, we apply the sparserepresentation classifier ([5], [6], [7]) to do vertex clas-sification on graph data, which performs well in objectrecognition with contamination and does not requiredimension selection. In particular, we provide boththeoretical performance guarantee of this classifier forthe stochastic blockmodel, and its numerical advan-tages via simulations and various real graph datasets.Furthermore, the proposed classifier maintains lowmisclassification error under both occlusion and link-age reversion contamination.This paper is organized as follows: in Section 2, weprovide background on the classification framework,review the latent position model and the stochas-tic blockmodel, and present the vertex classificationframework. In Section 3, we describe the motivationfor investigating robust vertex classification, and pro-pose two contamination models on stochastic block-models. In Section 4, we propose a sparse representa-tion classifier for vertex classification and prove con-sistency of our proposed classifier for the stochasticblockmodel under certain condition on the modelparameters. In Section 5, we demonstrate the effec-tiveness of our proposed classifier via both simulatedand real data experiments. In Section 6, we discuss the a r X i v : . [ s t a t . M L ] A p r ractical advantages of applying sparse representationclassifier to graphs. All theoretical proofs are in thesupplementary material. ACKGROUND AND F RAMEWORK
Let [ K ] = { , ..., K } for any positive integer K . Let ( X, Y ) ∼ F XY , where the feature vector X is an R d -valued random vector, Y is a [ K ] -valued classlabel, and F XY is the joint distribution of X and Y . Let π k = P ( Y = k ) be the class priors andlet g : R d → [ K ] provide one’s guess of Y given X , for which g is a classifier. We intend to classifya test observation X – that is, estimate its true butunknown label Y via g ( X ) . An error occurs when g ( X ) (cid:54) = Y , and the probability of error is denotedby L ( g ) = P ( g ( X ) (cid:54) = Y ) . The optimal classifier isdefined by g ∗ = arg min g : R d → [ K ] P ( g ( X ) (cid:54) = Y ) , whichis the Bayes classifier achieving the minimum possibleerror. In the classical setting of supervised learning, weobserve training data T n = { ( X , Y ) , ..., ( X n , Y n ) } iid ∼ F XY . The performance of g n is measured by theconditional probability of error defined by L n = L ( g n ) = P ( g n ( X ; T n ) (cid:54) = Y |T n ) , for a sequence of classifiers { g n , n ≥ } . The sequenceof classifiers is consistent if lim n →∞ L n → L ∗ as n →∞ ; and it is universally consistent if lim n →∞ L n = L ∗ with probability for any distribution F XY [8]. This supervised learning framework is adapted forthe setting of random graphs. A graph is a pair G = ( V, E ) consisting of a set of vertices or nodes V = [ n ] = { , , ..., n } and a set of edges E ⊂ (cid:0) [ n ]2 (cid:1) . Inthis work, we assume that all graphs are simple; thatis, the graphs are undirected, unweighted, and non-loopy. The adjacency matrix of G , denoted by A , is n -by- n symmetric, binary, and hollow, i.e., the diagonalsof A are all zeros. Each entry A uv = A vu = 1 , ifthere is an edge between vertices u and v ; A uv = 0 otherwise [9]. A random graph is a graph-valuedrandom variable G : Ω → G n , where Ω denotes theprobability space, and G n the collection of all possible n ) graphs on V = [ n ] . For instance, one frequentlyoccurring random graph model is the so-called Erdos-Renyi graph, ER ( n, p ) , in which each pair of verticeshas an edge independently with probability p [10].Our exploitation task is vertex classification. Weobserve the adjacency matrix A ∈ { , } ( n +1) × ( n +1) on n + 1 vertices { v , . . . , v n , v } and the class labels Y i ∈ [ K ] associated with the first n vertices. Our goalis to estimate the class label Y of the test vertex v viaa classifier g : { , } ( n +1) × ( n +1) → [ K ] such that theprobability of error P ( g ( A ) (cid:54) = Y ) is small. In our setting, we describe the stochastic blockmodeland vertex classification from the perspective of alatent position graph framework. Hoff et al. [11] pro-posed a latent position graph model. In this model,each vertex v is associated with an unobserved latentrandom vector X v drawn independently from a speci-fied distribution F on R d . The adjacency matrix entries A uv | ( X u , X v ) ∼ Bernoulli ( l ( X u , X v )) are condition-ally independent, where l : R d × R d → [0 , is thelink function. The random dot product graph modelproposed in [12] is a special case of the latent positionmodel, where the link function l ( X u , X v ) is the innerproduct of latent positions, l ( X u , X v ) = (cid:104) X u , X v (cid:105) . Forthe purpose of theoretical analysis and simulation inthis paper, we mainly consider the stochastic block-model introduced in [13], which is a random graphmodel with a set of n vertices randomly drawn from K block memberships. Conditioned on the K -partition,edges between all the pairs of vertices are independentBernoulli trials with parameters determined by theblock memberships.Below we formally present the definitions of thelatent position model and the stochastic blockmodel,which provide the framework for our exploitation taskof vertex classification. Definition 1.
Latent Position Model (LPM)
Let F be a distribution on [0 , , X , ..., X n iid ∼ F , and define Z := [ X , ..., X n ] T ∈ R n × d . Suppose rank ( Z ) = d ,and denote P ∈ [0 , n × n as the communication prob-ability matrix, where each entry P ij is the probabilitythat there is an edge between vertices i , j conditionedon X i and X j . Let A ∈ { , } n × n be the randomadjacency matrix. Then ( Z, A ) ∼ LP M ( F ) if and onlyif the following conditional independence relationshipholds: P ( A | X , ..., X n ) = Π i Stochastic Blockmodel (SBM) Let K bethe number of blocks, and π be a length K vector inthe unit simplex ∆ K − . The block memberships of thevertices are given by Y ( v ) iid ∼ Multinomial ([ K ] , π ) . Let2 be a K × K symmetric matrix specifying block com-munication probabilities. Then A ∼ SBM ([ n ] , B, π ) if and only if the following conditional independencerelationship holds: P ij = P ( A ij = 1 | X i , X j ) = P ( A ij = 1 | Y i , Y j )= B Y i ,Y j . Note that SBM is a special case of LPM, becausethe latent positions of an SBM are mixtures of thepoint masses, which are the eigenvectors of B . Theunknown latent positions X i and X j of vertices i , j determine their memberships Y i and Y j . And forvertex classification on SBM, the Bayes error L ∗ = 0 [1]. Definition 3. Model Dimension For stochastic block-models, the model dimension refers to the rank of thecommunication probability matrix. A d -dimensionalSBM satisfies rank ( P ) = rank ( B ) = d , for which d ≤ K ; if B is full rank, then d = K . Definition 4. Adjacency Spectral Embedding in Di-mension ˆ d Let A be defined as in Definition 1. Let A = U A S A U TA be the full spectral decompositionof A , where S A = Diag ( λ , λ , . . . , λ n ) with λ ≥ λ ≥ · · · ≥ λ n . Let S A, ˆ d = Diag ( λ , λ , . . . , λ ˆ d ) ∈ R ˆ d × ˆ d , containing the ˆ d largest eigenvalues of A . Let U A, ˆ d ∈ R n × ˆ d be the matrix containing the correspond-ing eigenvectors as its column vectors. The estimateof latent positions of SBM via adjacency spectral em-bedding in dimension ˆ d is defined as ˆ Z ˆ d = U A, ˆ d S A, ˆ d ,for ≤ ˆ d ≤ n . We denote the method of adjacencyspectral embedding to dimension ˆ d as ASE ˆ d .Many techniques have been developed to inferthe latent positions via the realized adjacency matrix.Bickel et al. [14] used subgraph counts and degreedistributions to consistently estimate stochastic block-models. Sussman et al. [1] proved the consistencyof spectral partitioning on the adjacency matrix ofstochastic blockmodels. Rohe et al. [15] proved a con-sistent spectral partitioning procedure on the Lapla-cian of the stochastic blockmodels. Fishkind et al. [2]showed the consistency of adjacency spectral parti-tioning, when the model parameters are unknown.Athreya et al. [16] proved a central limit theoremfor the adjacency spectral embedding of stochasticblockmodels.In the area of clustering and classification, there ex-ists intensive works regarding unsupervised learningfor graph data [17], [18], [19], [20], [21], [22] and [23];as well as supervised learning, such as [3], [24], [25]for vertex classification, and [26] and [27] for vertexnomination. Our task in this paper is vertex classification.However, we do not and can not observe the latentpositions X , ..., X n , X ; otherwise, we are back in theclassical setting of supervised learning. We assumethat the class-conditional density X i | Y i = k ∼ f k withclass priors π as before, that is, P ( Y i = k | X i = x ) = π k f k ( x ) (cid:80) j ∈ [ K ] π j f j ( x ) . We denote the test vertex as v whoselatent position is X , and we shall assume that we donot observe the label Y . OTIVATION Our motivation for proposing a robust vertex classifiercomes from asking the question: how well can vertexclassifiers perform when model assumptions do nothold. If the model dimension d is known or can beestimated correctly, ASE d consistently estimates thelatent positions for SBM [1]. Figure 1 presents anexample of ASE d , where vertices from two classesare well separated in the embedded space. A subse-quent k -nearest neighbor ( k NN) classifier on ASE d is universally consistent for SBM [3]. That means re-gardless of what distribution the latent positions aredrawn from, k NN ◦ ASE d achieves the Bayes error L ∗ asymptotically as k → ∞ , n → ∞ and k/n → . Inparticular, for stochastic blockmodels, NN ◦ ASE d isasymptotically Bayes optimal [1].Athreya et al. [16] proved a central limit theoremthat for K -block and d -dimensional SBM, ˆ Z d via ASE d is distributed asymptotically as a K -mixture of d -variate normal with covariance matrices of order n .This asymptotic result holds true for any constant K , any finite d , all but finitely many n , and doesnot require equal number of vertices per partition.This result implies that quadratic discriminant anal-ysis (QDA) and linear discriminant analysis (LDA)on the represented data ˆ Z d of stochastic blockmodelsare asymptotic Bayes plug-in classifiers, while LDArequires a fewer number of parameters to fit. Hencein our analysis, we employ two consistent classifiers1NN ◦ ASE d and LDA ◦ ASE d for vertex classification onstochastic blockmodels.Importantly, having information on the model di-mension d is critical to adjacency spectral approaches.When d is given, ASE d is consistent, and 1NN ◦ ASE d ,LDA ◦ ASE d are asymptotically Bayes optimal. When d is not known, Sussman et al. [1] estimates d via aconsistent estimator. However, for the consistent esti-mator to be accurate, the required number of vertices n will depend highly on the graph density, and increasesrapidly as the expected graph density decreases.Fishkind et al. [2] shows that if we pick a positiveinteger ˆ d ≥ d , then ASE ˆ d is still consistent as n → ∞ .However, for a finite number of vertices, 1NN ◦ ASE ˆ d An example of adjacency spectral embedding. Example of adjacency spectral embedding (ASE d =2 )with n = 500 . The parameters B and π are givenin Equation 10. The latent position of this SBM is amixture of point masses at X = (0 . , − . T and X = (0 . , . T .and LDA ◦ ASE ˆ d degrade significantly in performancecompared to 1NN ◦ ASE d and LDA ◦ ASE d . Moreovertheir performance on real data can be very sensitiveto the choice of embedding dimension. Our focus ison removing the need to know the model dimension d and still maintaining low error rate for vertex clas-sification, so the classification procedure can be robustand suitable for practical inference when the modelassumptions do not hold. To assess the robustness of the vertex classifiers forstochastic blockmodels, we propose two scenarios ofcontamination that change the model dimension ofSBM. Suppose the uncontaminated graph model G un is a stochastic blockmodel G un ∼ SBM ([ n ] , B un , π un ) .Denote the communication probability matrix of G un as P un . We can write P un = Z un Z T un , where Z un isthe latent positions of the uncontaminated model [1],and suppose rank ( B un ) = d . Denote by δ i ( M ) the i -thlargest singular value of a matrix M . Let p o ∈ [0 , denote the occlusion rate. We randomlyselect (100 p o )% vertices out of the n vertices andset the probability of connectivity among the selectedvertices to be . In this scenario, the probability ofconnectivity between the contaminated vertices andthe uncontaminated vertices remains the same as in G un . This occlusion procedure can be formulated as astochastic blockmodel G occ with the following param-eters: B occ = (cid:18) B un B un B un K × K (cid:19) ∈ R K × K , (3) π occ = [(1 − p o ) π T un , p o π T un ] T ∈ R K . (4)Denote the communication probability matrix of G occ by P occ . It always holds that δ ( P occ ) ≤ δ ( P un ) ≤ n , and it almost always holds thatrank ( B occ ) = rank ( P occ ) = 2 d . That is, the true modeldimension of the occluded graph is d instead of d .The proofs to the above claims are provided in thesupplementary material.Both B occ and P occ have d positive and d neg-ative eigenvalues, where the d negative eigenvaluesare due to occlusion contamination. The number ofblocks in the contaminated model G occ rises to K ,where K blocks correspond to (1 − p o ) π un and theother K blocks correspond to p o π un . Although thenumber of blocks in the model changes to K dueto contamination, the number of classes in the ver-tex classification problem remains K . As p o → ,the number of contaminated vertices approaches n ,indicating that the majority of the edges are sampledfrom the contamination source K × K ; as a result, theadjacency matrix A becomes sparser and sparser.Note that our occlusion scenario randomly selectsthe vertices; and conditioned on selecting the con-taminated vertices, the edges between these verticesare missing deterministically. Therefore the edges arenot missing completely at random in this occlusioncontamination procedure. Let p l ∈ [0 , denote the linkage reversion rate. Werandomly select (100 p l )% vertices out of the n verticesand reverse the connectivity among all the selectedvertices. The probability of connectivity between thecontaminated vertices and the uncontaminated ver-tices remains the same as in G un . The linkage rever-sion contamination can be formulated as a stochasticblockmodel G rev with the following parameters: B rev = (cid:18) B un B un B un J K × K − B un (cid:19) ∈ R K × K , (5) π rev = [(1 − p l ) π T un , p l π T un ] T ∈ R K . (6)The matrix J K × K ∈ R K × K is the matrix of all ones.Denote the communication probability matrix of G rev by P rev . If rank ( B un ) = d , then it almost always holdsthat d + 1 ≤ rank ( B rev ) = rank ( P rev ) ≤ d , since theblock matrix J K × K − B un has rank at most d . The num-ber of blocks in the contaminated model also increases4o K , similar to the occlusion model. As p l → , werecover the complement of SBM ([ n ] , B un , π un ) – thatis, SBM ([ n ] , J K × K − B un , π un ) . When the stochastic blockmodels are contaminated bythe above two procedures, the model parameters andthe model dimension are changed. Suppose both theoriginal model dimension d and the contaminationinformation are known, then we can use the contami-nated model dimension d occ = 2 d or d rev ∈ [ d + 1 , d ] for embedding, so that ASE d occ and ASE d rev followedby 1NN and LDA are asymptotically Bayes optimal.However, if we only know the contamination but notthe model dimension, then adjacency spectral em-bedding will require the estimation of an embeddingdimension; and if we know d but not the contamina-tion, we usually consider d as the default embeddingdimension. In either case, the embedding dimensionused may not be the best choice for adjacency spectralembedding and subsequent classification.Figure 2 and Figure 3 provide two examples of thescree plots obtained from the contaminated adjacencymatrices A occ and A rev , for which the original modeldimension is d = 2 . Using d = 2 is clearly not thebest choice in the contaminated data; and if we decideto estimate d , this remains a very challenging task,despite various procedures and criteria for dimensionselection [28]. Here we use a principled automaticdimension selection procedure using the profile like-lihood by [29], to estimate the embedding dimensionbased on the scree plot.However, in the setting of Figure 2 and Figure 3,Monte Carlo investigation yields ˆ d = 2 every time asthe elbow (500 times out of 500 Monte Carlo repli-cates), using the full spectrum or a partial spectrum ofthe largest 22 eigenvalues in magnitude respectively.The second elbow selected by [29] concentrates around80 and 11 using the full spectrum and the partialspectrum respectively. Even though ˆ d = 3 , are betterfor classification purpose in these two contaminatedgraphs, they are not selected by the dimension selec-tion method of [29].Notwithstanding the results in [3] and [2], we can-not be guaranteed to successfully choose the embed-ding dimension in practice. Consequently, the perfor-mance of ASE method and subsequent classificationwill suffer. Figure 4 and Figure 5 demonstrate that,as the contamination proportions p o and p l increase,latent positions change as reflected in the estimatedlatent positions ˆ Z ˆ d =2 and ˆ Z ˆ d =2 plots, for which theprofile likelihood method always yield ˆ d = 2 for thecontaminated data. In particular, as the occlusion rate p o increases, more vertices from different classes areembedded close together.Furthermore, vertex classification on the contami-nated ˆ Z ˆ d using 1NN or LDA will degrade in perfor-mance, as illustrated later in the simulation and Figure7. Indeed, the model dimension critically determinesthe success of vertex classification based on the ASEprocedures, whereas in practice, the model dimensionis usually unknown. This motivates us to seek a robustvertex classifier which does not heavily depend on themodel selection and still attains good performance.Fig. 2: Scree plot of the occlusion contaminated adja-cency matrix. Scree plot of the occlusion contaminatedadjacency matrix A occ at occlusion rate p o = 0 . with n = 200 . The parameters B un and π un are given in Eq.10. The red dots are the negative eigenvalues of A occ due to occlusion contamination, and the green dots arethe positive eigenvalues of A occ . Profile likelihood [29]method always suggests ˆ d = 2 for this scree plot. HE S PARSE R EPRESENTATION C LASSI - FIER FOR V ERTEX C LASSIFICATION In this section, we propose to use the sparse represen-tation classifier (SRC) for robust vertex classification.Instead of employing adjacency spectral embeddingand applying subsequent classifiers on ˆ Z ˆ d , we recovera sparse representation of the test vertex with respectto the vertices in the training set, and use the recoveredsparse representation coefficients to classify the testvertex.For the purpose of algorithm presentation, in thissection we slightly abuse the notation to denote A as the adjacency matrix on the training vertices { v , . . . , v n } with known labels Y i ∈ [ K ] , and denote φ as the adjacency column with respect to the testing5ig. 3: Scree plot of the linkage reversion contami-nated adjacency matrix. Scree plot of the linkage re-version contaminated adjacency matrix A rev at linkagereversion rate p l = 0 . with n = 200 . The parameters B un and π un are given in Eq. 10. The red dots are thenegative eigenvalues of A occ due to linkage reversion,and the green dots are the positive eigenvalues. Profilelikelihood method [29] always suggests ˆ d = 2 for thisscree plot.vertex v with an unknown label Y ; note that this isalmost equivalent to let A be the adjacency matrix for { v , . . . , v n , v } as in previous sections, then split thefirst n columns for training and the last column fortesting, except the last row is not used.Now suppose there are n k training vertices in eachclass k , so that n = (cid:80) k ∈ [ K ] n k . Let a k, , ..., a k,n k denote the columns in A corresponding to the n k training vertices in class k . Define a matrix D k =[ d k, , ..., d k,n k ] ∈ R n × n k , where each column d k,j = a k,j (cid:107) a k,j (cid:107) for ≤ j ≤ n k ; then we concatenate D , . . . , D K such that D := [ D , . . . , D K ] ∈ R n × n .Namely the matrix D re-arranges the columns of A by classes, and normalize each column to have (cid:96) unitnorm.Also normalize φ to unit norm. Then SRC is ap-plied to D and φ directly, by first solving the (cid:96) -minimization problem arg min (cid:107) β (cid:107) subject to φ = Dβ + (cid:15), (7)followed by subsequent classification on the sparserepresentation β . This procedure does not requirespectral embedding of the adjacency matrix, and isoriginally used by [5] to do robust face recognition.In subsection 4.1 we show the algorithmic and im-plementation details, and argue why SRC is applicablefor graphs; then a consistency result of SRC for the Fig. 4: The occlusion contamination effect on es-timated latent positions. A depiction of the occlu-sion effect on the latent positions as reflected in theestimated latent positions ˆ Z ˆ d =2 with n = 200 . Theparameters B un and π un are given in Eq. 10. Thefour-panel displays the latent position estimation fordifferent occlusion rate p o . As p o increases, verticesfrom different blocks become close in the embeddedspace. For p o close to , ASE ˆ d =2 will eventually yieldonly one cloud at .stochastic blockmodel is proved, followed by relevantdiscussions in subsection 4.2. The algorithm is summarized in Algorithm 1. Theonly computational costly step in Algorithm 1 is (cid:96) minimization. Many algorithms, such as (cid:96) homotopy[30], augmented Lagrangian multiplier [31], orthogo-nal matching pursuit [32], etc., are developed to solve (cid:96) minimization. In this paper, we use orthogonalmatching pursuit (OMP) to solve Equation 8, which isa fast approximation of exact (cid:96) minimization; detailsof various (cid:96) minimization and OMP are available in[30], [33], [31], and [7].Usually there is a model selection parameter forstopping (cid:96) minimization, namely the noise threshold (cid:15) in Equation 8, or equivalently designate a sparsitylevel s so that (cid:107) β (cid:107) ≤ s . As (cid:15) is difficult to determinefor real data, in this paper we choose to set s ratherthan (cid:15) : this allows us to better compare the vertexclassification performance through-out different spar-sity levels, and we will argue that SRC is robust6ig. 5: The linkage reversion contamination effect onestimated latent positions. A depiction of the linkagereversion effect on the latent positions as reflected inthe estimated latent positions ˆ Z ˆ d =2 with n = 200 .The parameters B un and π un are given in Eq. 10. Thefour-panel displays the latent position estimation fordifferent linkage reversion rate p l . As p l increases,vertices from different blocks become close in theembedded space. For p l = 1 , ASE ˆ d =2 will yield twoclouds corresponding to SBM( , J × − B un , π un ).against s in the next subsection and also the numericalexperiments. Note that the constraint in Equation 8can be replaced by φ = Dβ in a noiseless setting,but usually some parameters like (cid:15) or s is requiredto achieve a parsimonious model, when dealing withhigh-dimensional or noisy data.Although the SRC algorithm can always be usedfor supervised learning, it does not always performwell for arbitrary data sets; and it is necessary tounderstand why SRC is applicable to graphs. In [5],it is argued that the face images of different classes lieon different subspaces, so that (cid:96) minimization is ableto select training data of the correct class (i.e., the truebut unknown class of the testing observation). Basedon this subspace assumption, [34] derives a theoreticalcondition for (cid:96) minimization to do perfect variableselection in sparse representation, i.e., all selectedtraining data are from the correct class. This validatesthat sparse representation is a valuable tool with (cid:96) minimization under the subspace assumption. How-ever, the subspace assumption requires an intrinsiclow-dimensional structure for each class, which maynot be satisfied for high-dimensional real data such as Algorithm 1 Robust vertex classification. Goal : Classify the vertex v whose unknown label is Y . Input : Adjacency matrix A ∈ { , } ( n ) × ( n ) from thetraining vertices { v , . . . , v n } , where each column a i contains the adjacency column of i th vertex to allother training vertices, and all vertices are associ-ated with observed labels Y i ∈ [ K ] . Let φ ∈ { , } n be the testing vertex containing its connectivity toall training data.1. Arrange and scale all vertices : Re-arrangecolumns of A in class order, and normalize thecolumn to (cid:96) unit norm. Denote the resulting matrixas D . Also scale the testing adjacency column φ tohave unit norm.2. Find a sparse representation of φ by (cid:96) mini-mization: ˆ β = arg min (cid:107) β (cid:107) subject to φ = Dβ + (cid:15). Compute the distance of φ to each class k : r k ( φ ) = (cid:107) φ − D ˆ β k (cid:107) , where ˆ β k =[0 , ..., , ˆ β k, , ..., ˆ β k,n k , ..., ∈ R n is the recoveredcoefficients corresponding to the k -th class.4. Classify test vertex : ˆ Y = arg min k r k ( φ ) .the adjacency matrix.Furthermore, the motivation behind the popularityof (cid:96) minimization is its equivalence to (cid:96) minimiza-tion under certain conditions, such as the incoherencecondition or restricted isometry property, see [35], [36],[37], [38], [39], [40]. But those conditions are oftenviolated in the SRC framework, because the sampletraining data are usually correlated; and SRC doesnot necessarily need a unique or most sparse β inorder to do correct classification. As long as the sparserepresentation β assigns dominating coefficients todata of the correct class, SRC can classify correctly.Shen et al. [7] proves SRC performance guaranteeunder a principal angle condition, which is similar tothe condition in [34], but does not rely on the subspaceassumption and does not require a unique and mostsparse solution. The condition is easy to check for agiven model and intuitive to understand: as long asthe within-class principal angle is smaller than thebetween-class principal angle, (cid:96) minimization andOMP are able to assign dominating regression coef-ficients to training data of the correct class, so thatSRC can perform well. Based on this direction, in thenext subsection we derive a condition on the stochasticblockmodels so that the principal angle condition issatisfied, consequently achieving SRC consistency forSBM.7 .2 SRC Consistency for SBM Here we prove a consistency theorem for sparse rep-resentation classifier for vertex classification on thestochastic blockmodel, which provides theoretical per-formance guarantees of our proposed robust vertexclassification. All proofs are put into the supplemen-tary material.For this subsection only, we first define for each q = 1 , . . . K , Q q ∼ K (cid:88) k =1 { Y = k } B kq , (8)where Y is the class label, { Y = k } is the indicatorfunction with probability π k , and B kq corresponds tothe entry of the probability matrix B generating SBM.Note that { Q q } and all their moments only depend onthe prior probability π and the block probability B .Next we define the un-centered correlation as ρ qr = E ( Q q Q r ) (cid:113) E ( Q q ) E ( Q r ) , for each ≤ q (cid:54) = r ≤ K . Clearly ≤ ρ qr = ρ rq ≤ .Our first lemma proves a necessary and suffi-cient condition on the SBM parameters for adjacencycolumns of the same class to be asymptotically mostcorrelated. Lemma 1. Under the stochastic blockmodel, for an ad-jacency column of class q , its asymptotic most correlatedcolumn is of the same class q , if and only if the priorprobability π and the block probability matrix B satisfy thefollowing inequality: ρ qr · E ( Q r ) E ( Q q ) < E ( Q r ) E ( Q q ) (9) for all r (cid:54) = q . When Lemma 1 holds for all q , it in fact guaranteesthat SRC at s = 1 (or equivalently -nearest-neighborbased on principal angle) is a consistent classifier forthe stochastic blockmodel. To prove SRC consistencyat any s , we need a second lemma. Lemma 2. Denote A ( s ) as an s × n random matrixconsisting of s adjacency columns, and denote C as a scalarvector of length s .Suppose Equation 9 holds for the stochastic blockmodel.Then for any adjacency column α of class q , its within-class correlation (i.e., the correlation between α and anotheradjacency column of class q ) is asymptotically larger thanthe correlation between α and C · A ( s ) , for any A ( s ) whosecolumns are not from class q and any vector C with non-negative entries.The above holds for any s ≥ . The above two lemmas essentially establish theprincipal angle condition in [7]. They can guaranteethat β assigns dominating coefficients to training dataof the correct class, which leads to SRC consistency forSBM. Theorem 1. Suppose Equation 9 holds for the correspond-ing stochastic blockmodel for all q ∈ [1 , . . . , K ] , and thesparse representation β is constrained to be non-negative.Then SRC is a consistent classifier for vertex classifica-tion of SBM, with L n → as n → . This holds for SRCimplemented by either exact (cid:96) minimization or orthogonalmatching pursuit at any s ≥ . Let us make some remarks regarding the theoremand its implication. Firstly, if the block columns arevery close in their (cid:96) and (cid:96) norms in the measurespace with respect to π (i.e., E ( Q r ) E ( Q q ) ≈ E ( Q r ) E ( Q q ) ), then thetheorem is very likely to hold for all ρ qr < and SRCis expected to perform well; if not, the block columnscannot be too highly correlated in order for the in-equality to hold and for SRC to work; and if block r is a scalar multiple of block q , the condition alwaysfails and SRC cannot separate those two classes. In anycase, if the adjacency matrix can be modeled by SBM,then it is very easy to estimate the model parametersand check Equation 9.Secondly, even though Equation 9 is only sufficientand not necessary for SBM consistency at s > , it isoften the case that SRC is no longer consistent whenEquation 9 is violated. Because when Equation 9 isviolated for some r , the adjacency column of class q is asymptotically most correlated with a column fromclass r , which usually causes SRC to misbehave.Thirdly, the theorem requires the sparse repre-sentation to be non-negative, which can be easilyachieved in (cid:96) minimization; and [41], [42], [43] showthat eliminating the negative entries of β has very nicetheoretical properties in non-negative OMP and non-negative least square. Even though we do not explic-itly use non-negative (cid:96) minimization or bound thecoefficients, in our numerical experiments the negativeentries of β are almost never large, and L n clearlyconverges to for the SBM simulation in the numericalsection.Fourthly, since the consistency result holds for SRCat any s ≥ , we expect SRC to be robust in the choiceof s , compared to the model selection of ˆ d for ASEprocedures. This is demonstrated empirically in Sec-tion 5. In particular, the two contamination scenariosessentially double the number of blocks comparing tothe uncontaminated SBM; this causes the classificationerror of ASE to be no longer consistent unless the em-bedding dimension d is adjusted accordingly, but SRCmay remain consistent as long as the contaminated8locks still satisfy Equation 9.Lastly, we should note that even though the consis-tency results hold at any s ≥ , in most experimentsmoderate s helps the finite-sample performance com-paring to s = 1 or large s : One explanation is thatthe classifier itself is designed to favor a more parsi-monious model as argued in [5]. Another explanationbased on the consistency proof of [7], is that the sub-matrix of D corresponding to the nonzero entries of β should be full rank; this is always true when using (cid:96) minimization and OMP, but large s may make thesub-matrix close to rank deficient (i.e., having singularvalues close to zero). Indeed in the numerical section,we will see that as long as the sparsity level s is not toolarge relative to the sample size n , SRC can performwell; in addition, choosing smaller sparsity level hasless computational cost. UMERICAL E XPERIMENTS If the true model dimension is unknown, ASE ˆ d maynot be consistent. In particular, when contaminationresults in a changed model dimension, or the modeldimension cannot be correctly estimated, the perfor-mance of subsequent classifiers may suffer. We con-sider a classifier robust if it can maintain relativelylow misclassification rate under data contamination.Our sparse representation classifier (SRC) for vertexclassification does not rely on the knowledge of themodel dimension, is robust to the choices of sparsitylevel s , and achieves consistency with respect to allsparsity levels.Throughout this section, we use the orthogonalmatching pursuit (OMP) to solve the (cid:96) minimization.In the following experiments, SRC s denotes the per-formance of SRC with varying sparsity levels s , andSRC means s = 5 by default. We use leave-one-outcross validation to estimate the classification error. Thestandard errors are small compared to the differencesin performance. Our simulation experiments and realdata analysis demonstrate that SRC for vertex classi-fication performs well under varying sparsity levels,possesses higher robustness to contamination than1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , and is an excellent toolfor real data inference. We compare the robustness of SRC with two vertexclassifiers: 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , both of whichachieve the asymptotic Bayes error when ˆ d = d withno contamination, and ˆ d = d occ or d rev in the contami-nation model. We simulate the probability matrix for an uncontam-inated stochastic blockmodel G un with K = 2 blocks( Y ∈ { , } ) and parameters B un = (cid:18) . . . 32 0 . (cid:19) π un = [0 . , . T . (10)The SBM parameters in Equation 10 in fact satisfies thetheoretical condition in Equation 9, so we expect SRCto perform well in this case.We first assess the performance of all classifiers inthe uncontaminated model, assuming the true modeldimension d = 2 is known. As seen in the left plotof Figure 6, LDA ◦ ASE d =2 performs the best for all n ∈ { , , . . . , } . In this ideal setting, SRC doesnot outperform 1NN ◦ ASE d =2 or LDA ◦ ASE d =2 , butall classifiers converge to error as n increases, asexpected based on our theoretical derivation.Then we fix the number of vertices n = 110 andvary the sparsity level s and embedding dimension ˆ d .The right plot of Figure 6 exhibits the three classifiers’performance. SRC s performs well throughout s , sodoes LDA ◦ ASE ˆ d except at ˆ d = 1 , while 1NN ◦ ASE ˆ d degrades significantly with increasing ˆ d or ˆ d = 1 . Now we assess the robustness of SRC, 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d under contamination using the sameparameter setting as Equation 10. If the model dimen-sion d = 2 is known and the exact contamination isknown, then ˆ d = 4 is best for subsequent classificationof the contaminated data; otherwise ˆ d will be set to ,either due to not knowing the contamination or dueto estimating ˆ d by the profile likelihood procedure in[29], as seen in Figure 2 and Figure 3.Figure 7 presents the misclassification error ofSRC, 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d under occlusioncontamination, linkage reversion contamination, anda mixed combination of both contamination, for s = 5 and ˆ d = 2 , respectively. The x-axis stands for thecontamination rate, while the y-axis stands for theclassification error. In case of occlusion, all classifiersdegrade as the contamination rate increases, due toless density in the graph. And in case of linkagereversion, all classifiers degrade first due to a weakerblock signal, and then improve when the contamina-tion rate increases above . , because the reversedblock signal becomes stronger. As to the mix con-tamination, it is done as follows: first, we randomlyselect p % vertices and occlude their connectivity;secondly, we randomly select p % vertices (somemay have already been occluded) and reverse their9ig. 6: Classification performance under no contamination. We simulate SBMs with B un , π un given inEquation 10, and show the average the misclassification error over the Monte Carlo replicates. (Left):When the true model dimension d = 2 is known, SRC does not outperform 1NN ◦ ASE d =2 or LDA ◦ ASE d =2 for n ∈ [20 , . (Right): Do the same vertex classification using various s, ˆ d at n = 110 .connectivity. In this scenario, the degradation in clas-sification performance comes from both occlusion andlinkage reversion contamination.For both occlusion and linkage reversion,LDA ◦ ASE d =4 is the best classifier, followed by1NN ◦ ASE d =4 . SRC is slightly inferior, but is signif-icantly better than LDA ◦ ASE ˆ d =2 and 1NN ◦ ASE ˆ d =2 .For the mixed contamination, SRC and 1NN ◦ ASE d =4 are the best classifiers, which perform much betterthan the others. This indicates that SRC is robustagainst the contamination, while subsequent classi-fication after spectral embedding may suffer frommodel dimension misspecification and data contam-ination.Note that SRC also has a model selection param-eter, namely the sparsity level s . Thus in Figure 8 weplot SRC error with respect to the sparsity level s ∈ [1 , . . . , , as well as LDA ◦ ASE d and 1NN ◦ ASE d withrespect to the embedding dimension d ∈ [1 , . . . , .Furthermore, because we have fixed the number ofnearest neighbor to be so far, the first plot inFigure 8 is used to show that varying the numberof nearest-neighbor does not help kNN ◦ ASE ˆ d =2 for k ∈ [1 , . . . , . All plots in Figure 8 show that SRC isstable with respect to the sparsity level s , while ASEmethods are less robust with respect to the dimensionchoice. We apply SRC to several real datasets. We binarize,symmetrize the adjacency matrix and set the diag-onals to be zero. We followed [44] and [45], which suggest imputing the diagonal of the adjacency ma-trix to improve performance. We vary the embeddingdimension ˆ d for 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , and thesparsity level s for SRC s . We apply SRC to the electric neural connectomeof Caenorhabditis elegans ( C.elegans ) [46], [47], [48].The hermaphrodite C.elegans somatic nervous systemhas neurons ([49]). Those neurons are classifiedinto classes: motor neurons ( . ), interneurons( . ) and sensory neurons ( . ). The adja-cency matrix is seen in the top of Figure 9. The graphhas density ( ) = 1 . . The objective is to predictthe classes of the neurons, and the chance line for thisclassification task is . .The bottom of Figure 9 demonstrates the perfor-mance of the three classifiers. Both LDA ◦ ASE ˆ d and1NN ◦ ASE ˆ d improve in performance as ˆ d increases to , since more signal is included in the embeddedspace; and as ˆ d continues to increase to , both clas-sifiers gradually degrade in performance, since morenoise is included. The exhibited phenomenon is dueto bias-variance trade-off. In comparison, SRC s hasstable performance with respect to the sparsity level s ∈ [1 , . . . , , which outperforms LDA ◦ ASE ˆ d and1NN ◦ ASE ˆ d . This demonstrates that SRC is a practicaltool in random graph inference. The AdjNoun graph, collected in [50], is a networkcontaining frequently used adjectives and nouns from10ig. 7: Classification performance under three types of contamination. We simulate SBMs with B un , π un given in Eq. 10, set n = 200 , contaminate the data accordingly, and present the average misclassification errorfor the five classifiers over the Monte Carlo replicates. SRC at s = 5 exhibits robust performance comparedto 1NN ◦ ASE ˆ d =2 and LDA ◦ ASE ˆ d =2 , throughout all type of contamination with varying contamination rates.Fig. 8: Under the same setting of Figure 7, the first plot varies the choice of neighborhood k in kNN ◦ ASE, andcompare with SRC s with varying s . The other three plots compare the classification error of SRC s , 1NN ◦ ASE ˆ d ,and LDA ◦ ASE ˆ d throughout s = ˆ d ∈ [1 , . . . , . SRC exhibits stable performance with respect to the sparsitylevel s .the novel “David Copperfield” by Charles Dickens.The vertices are the 60 most frequently used adjectivesand 60 most frequently used nouns in the book. Theedges are present if any pair of words occur in an adja-cent position in the book. The chance error is . .The adjacency matrix of the adjective noun networksuggests that the connectivity between nouns andadjectives are more frequent than the connectivitiesamong nouns and the connectivities among adjectivesrespectively, as seen in the top of Figure 10.We apply SRC s , 1NN ◦ ASE ˆ d , and LDA ◦ ASE ˆ d onthis dataset, and vary the embedding dimension ˆ d ∈{ , , . . . , } and the sparsity level s ∈ { , , . . . , } .Performance of the three classifiers is seen in thebottom of Figure 10. SRC s again exhibits stable per-formance with respect to various sparsity level s ,comparing to 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d . Note thatas the number of vertices is only , we limit thesparsity level to in this experiment. The political blog sphere was collected in February2005 [51]. The vertices are blogs during the time of the 2004 presidential election, and edges exist if the blogsare linked. The blogs are either liberal or conservative,which sum up to n = 1490 vertices. The top ofFigure 11 demonstrates the adjacency matrix of theblog network, which reflects a strong two-block signal.The performance of three classifiers is shown in thebottom of Figure 11, with varying sparsity level s anddimension choice ˆ d up to . SRC s has very stableand superior performance with respect to varioussparsity level, and always outperforms 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d . It is worthwhile to point out that thisdataset can be modeled by SBM as shown in [52]; andthe sparsity limit is relatively small comparing tothe number of vertices n here, which is the reason whySRC s is very stable up to s = 100 . The political book graph contains 105 books about USpolitics and sold by Amazon.com [50]. The edges existif any pairs of books were purchased by the samecustomer. There are 3 class labels on the books: liberal(46.67 % ), neural (40.95 % ) and conservative (12.28 % ).The adjacency matrix of this dataset and the perfor-11ig. 9: (Top): The adjacency matrix of the C.elegans neural connectome is sorted according to the classesof the neurons. A three-block structure is exhibited.(Bottom): Vertex classification performance on the C.elegans network. As we vary the sparsity level s andembedding dimension ˆ d , SRC s demonstrates superiorand stable performance compared to 1NN ◦ ASE ˆ d andLDA ◦ ASE ˆ d .mance of the three classifiers are seen in the top ofFigure 12.The bottom of Figure 12 shows that SRC s is verystable with respect to the sparsity level, and usuallybetter than 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d , but the opti-mal error is achieved by 1NN ◦ ASE ˆ d =10 . ISCUSSION Adjacency spectral embedding is a feature extrac-tion approach for latent position graphs. When fea-ture extraction is composed with common classifierssuch as nearest-neighbor or discriminant analysis, the Fig. 10: (Top): Adjacency matrix of adjective noun net-work, where each class is more likely to communicatewith the other class than itself. (Bottom): Vertex classi-fication performance on the adjective and noun net-work. SRC s demonstrates robust performance com-pared to 1NN ◦ ASE ˆ d , and LDA ◦ ASE ˆ d .choice of feature space or embedding dimension iscrucial. Given the model dimension d for a stochasticblockmodel, ASE d is consistent and the subsequentvertex classification via 1NN ◦ ASE d or LDA ◦ ASE d isasymptotically Bayes optimal. And the success of ASEprocedures clearly depends on the knowledge of d , asillustrated in the experiments.However, in practical settings, the model dimen-sion d is usually unknown, and there may exist datacontamination. In this paper, we present a robustvertex classifier via sparse representation for graphdata. The sparse representation classifier does not12ig. 11: (Top): Adjacency matrix of political blogsphere, exhibiting strong connectivity within class.(Bottom): Vertex classification performance on thepolitical blog network. SRC s demonstrates superiorand very stable performance with respect to vari-ous sparsity levels s , comparing to 1NN ◦ ASE ˆ d andLDA ◦ ASE ˆ d .need information of the model dimension, can achieveconsistency under a mild condition for SBM param-eters, and is robust against the choice of sparsitylevels. As seen in the simulation studies using SBM,SRC may not outperform 1NN ◦ ASE d and LDA ◦ ASE d when d is known, but does outperform 1NN ◦ ASE ˆ d and LDA ◦ ASE ˆ d where ˆ d is chosen using the scree plotof the adjacency matrix. In the real data experiments,most of the time SRC outperforms 1NN ◦ ASE ˆ d andLDA ◦ ASE ˆ d for varying ˆ d , and is very stable with Fig. 12: (Top): Adjacency matrix of political bookgraph. (Bottom): Classification performance on thepolitical book graph.respect to the sparsity level s . The numerical studiesstrongly indicate that SRC is a valuable tool for ran-dom graph inference.For SRC implementation, we only considered or-thogonal matching pursuit (OMP) to solve the (cid:96) min-imization problem. Different implementations of (cid:96) minimization are explored in [7], and using a differentalgorithm may yield slightly different classificationperformance for SRC.Another interesting question is the effect of nor-malization, namely the transformation of A into D in Algorithm 1. The normalization effect is usuallydifficult to quantify; but empirically, we see improve-ment in SRC performance under (cid:96) normalization, asillustrated in Figure 13. Note that the SBM parameters13atisfy the condition in Equation 9, so we expect SRCto perform well in the normalized case; furthermore,in the figure SRC error is very close to as n increases,despite the fact that the non-negative constraint is notused in the algorithm (which is used in the consistencyproof).Fig. 13: Examination of SRC performance with orwithout (cid:96) normalization on columns of D . Wecompare SRC performance when columns of D are (cid:96) normalized and when columns of D are not (cid:96) normalized. The parameters B and π are given in Eq.10 with n ∈ { , . . . , } and we run MonteCarlo replicates for each n . We see an improvement inSRC performance when (cid:96) normalization is applied.The Wilcoxon signed rank test reports a p -value lessthan . under the the null hypothesis that the er-ror difference SRC error ,(cid:96) − SRC error , no (cid:96) comes from adistribution with zero median. PPENDIX An event occurs “almost always”, if with probability1, the event occurs for all but finitely many n . Proposition 1. It always holds that σ ( P occ ) ≤ σ ( P un ) ≤ n .Proof. Suppose the set of the contaminated verticesis I := { i , i , . . . , i l } . Let P (cid:48) s denote the principalsubmatrix of P un ∈ R |I|×|I| obtained by deleting the V \ I columns and the corresponding V \ I rows. P (cid:48) s is symmetric.Note that P un = P occ + P s , where P s is sym-metric, P s = P (cid:48) s at { i , i , . . . , i l } -th columns and { i , i , . . . , i l } -th rows, and P s = 0 everywhere else. By Weyl’s Theorem [53], σ ( P occ ) + min σ σ ( P s ) ≤ σ ( P occ + P s ) = σ ( P un ) . Thus, σ ( P occ ) ≤ σ ( P un ) .Since P un ∈ [0 , n , P un P T un = P un P un is a non-negative and symmetric matrix with entries boundedby n . Then each row sum is bounded by n . Thus, σ ( P un ) = σ ( P un ) = σ ( P un P T un ) ≤ n , giving σ ( P un ) ≤ n. Proposition 2. It always holds that σ d +1 ( P occ ) = 0 .It almost always holds that σ d ( P occ ) ≥ min( p , − p ) αγn . rank( P occ ) = 2 d .Proof. The Guassian elimination of B occ is given by B occ ∼ (cid:18) B un B un K × K B un (cid:19) . (11)Since rank ( B un ) = d , rank ( B occ ) = 2 d . Then thereexist µ = (cid:18) ν K × K ν − ν (cid:19) ∈ R K × d and ˜ µ = (cid:18) ν K × K ν ν (cid:19) ∈ R K × d such that B occ = µ ˜ µ T . Let X occ ∈ R n × d and ˜ X occ ∈ R n × d with row u given by ˜ X occ ,u = ˜ µ Y u . By the parametrization of SBM as RDPGmodel, P occ = X occ ˜ X T occ . Since X occ , ˜ X occ are at mostrank d , then σ d +1 ( P occ ) = 0 .Since the following holds: µµ T = ˜ µ ˜ µ T = (cid:18) νν T νν T νν T νν T (cid:19) (12) = (cid:18) νν T νν T νν T νν T (cid:19) + (cid:18) K × K K × K K × K νν T (cid:19) , by Weyl’s theorem [53], min λ i ( µµ T ) = min λ i (˜ µ ˜ µ T ) (13) ≥ min λ i (cid:18) νν T νν T νν T νν T (cid:19) + min λ i (cid:18) K × K K × K K × K νν T (cid:19) ≥ γ + 0 = γ. (14)Moreover, we have min i ∈ [2 K ] ( π occ ,i ) = min( p o π un ,i , (1 − p o ) π un ,i ) ≥ min( p o , − p o ) γ. (15)The eigenvalues of P occ P T occ are the same as thenonzero eigenvalues of ˜ X T occ ˜ X occ X T occ X occ . it almost al-ways holds that n i ≥ min( p o , − p ) γn for all i ∈ [2 K ] so that X T occ X occ = K (cid:88) i =1 n i µ i µ Ti = min( p o , − p o ) γnµ T µ + K (cid:88) i =1 ( n i − min( p o , − p o ) γn ) µ i µ Ti . (16)The first term has min λ i bounded below by α min( p o , − p o ) γ . This means λ d ( X T occ X occ ) ≥ min( p o , − p o ) γ . For the exact same argument, λ d ( ˜ X T occ ˜ X occ ) ≥ α min( p o , − p o ) γ . ˜ X T occ ˜ X occ X T occ X occ is the product of two positive semi-definite matrices.Then, λ d ( ˜ X T occ ˜ X occ Z T occ Z occ ) ≥ λ d ( ˜ X T occ ˜ X occ ) λ d ( X T occ X occ ) ≥ ( α min( p , − p ) γn ) . This gives λ d ( P occ ) ≥ α min( p , − p ) γn = min( p , − p ) αγn. (17)Since λ d ( P occ ) ≥ almost always and σ d +1 ( P occ ) =0 always, then rank ( P occ ) = d . Proposition 3. B occ has d positive eigenvalues and d negative eigenvalues.Proof. Let the eigen-decomposition of B un given by ΞΨΞ T , where Ξ ∈ R K × K is orthogonal and Ψ = Diag ( ψ , ..., ψ k ) ∈ R K × K is diagonal. We have thefollowing congruent relation: (cid:18) B un B un B un K × K (cid:19) = (cid:18) I K × K K × K I K × K I K × K (cid:19) × (cid:18) B un K × K K × K − B un (cid:19) × (cid:18) I K × K K × K I K × K I K × K (cid:19) T = (cid:18) Ξ 0 K × K Ξ Ξ (cid:19) × (cid:18) Ψ 0 K × K K × K − Ψ (cid:19) × (cid:18) Ξ 0 K × K Ξ Ξ (cid:19) T . (18)Hence, B occ and (cid:18) Ψ 0 K × K K × K − Ψ (cid:19) are congruent.By Sylvester’s law of Inertia [53], they have the samenumber of positive, negative and zero eigenvalues. Ψ has d positive diagonal entries since rank ( B un ) = d .Similarly, − Ψ has d negative diagonal entries. Hence, B occ has d positive eigenvalues and d negative eigen-values. Proposition 4. Assuming | λ ( P occ ) | ≥ | λ ( P occ ) | ≥ . . . ≥ | λ d ( P occ ) | , then |{ i : λ i ( P occ ) < }| = |{ i : λ i ( P occ ) < }| = d . That is, the number of positiveeigenvalues of P occ is the same as the number of negativeeigenvalues of P occ , and it equals d .Proof. Let Z ∈ { , } n × K denote the matrix, whereeach row i is of the form (0 , . . . , , , . . . , , where 1indicates the block membership of vertex i in the oc-clusion stochastic blockmodel. Then P occ = ZB occ Z T .Note that P occ has the same number of nonzero eigen-values as Z T ZB occ . Let D Z := Z T Z ∈ N K × K and note that D Z is a diagonal matrix with nonnegative di-agonal entries, where each diagonal entry denotes thenumber of vertices belonging to block k ∈ [ K ] . Withhigh probability, D is positive definite, as the numberof vertices in each block is positive. Then the numberof nonzero eigenvalues of P occ is the same as the num-ber of nonzero eigenvalues of Z T ZB occ = DB occ = √ D Z √ D Z B occ = √ D Z B occ √ D Z . By Sylvester’s lawof Inertia [53], the number of positive eigenvalues of √ D Z B occ √ D Z is the same as the number of positiveeigenvalues of B occ , and the number of negative eigen-values of √ D Z B occ √ D Z is the same as the numberof negative eigenvalues of B occ , thus proving ourclaim. Proof. We first prove that an adjacency column fromclass q is asymptotically most correlated with anothercolumn of the same class, if and only if Equation 9 issatisfied.Suppose the first two vertices , are from class ,and vertices is of class . Without loss of generality,let us prove that A is asymptotically most correlatedwith A if and only if Equation 9 is satisfied for q = 1 .We expand the correlation between A and A asfollows: ρ ( A , A ) = (cid:80) ni =1 ( A i A i ) (cid:113)(cid:80) ni =1 A i (cid:80) ni =1 A i = (cid:80) ni =1 ( A i A i /n ) (cid:112)(cid:80) ni =1 ( A i /n ) (cid:80) ni =1 ( A i /n ) a.s. → (cid:80) Kk =1 ( π k B k B k ) (cid:113)(cid:80) Kk =1 ( π k B k ) (cid:80) Kk =1 ( π k B k )= E ( Q Q ) (cid:112) E ( Q ) E ( Q ) , where the first line is by the definition of correlation,the second line follows by noting that the entriesof A are and , the third line follows by passingto limit, the fourth line simplifies the expression byour definition of { Q q } . Note that we assumed knownclass-membership for the first three vertices, but theydo not affect the asymptotic correlation and thus notconsidered in the limit expression.By a similar expansion, we have ρ ( A , A ) a.s. → E ( Q ) E ( Q ) . A is asymptotically most correlated with A ifand only if E ( Q ) E ( Q ) > E ( Q Q ) (cid:112) E ( Q ) E ( Q ) ⇔ E ( Q ) E ( Q ) > ρ (cid:112) E ( Q ) E ( Q ) (cid:112) E ( Q ) E ( Q ) ⇔ (cid:115) E ( Q ) E ( Q ) > ρ (cid:115) E ( Q ) E ( Q ) ⇔ ρ · E ( Q ) E ( Q ) < E ( Q ) E ( Q ) . The above derivation is always valid when class isreplaced by any class r (cid:54) = 1 . Thus we proved Lemma 1. Proof. Without loss of generality, denote α and A s +1 as two adjacency columns from class , A ( s ) =[ A | · · · | A s ] , and C = [ c , · · · , c s ] . It suffices to provethat as n → ∞ , we always have ρ ( α, A s +1 ) > ρ ( α, C · A ( s ) ) = ρ ( α, s (cid:88) j =1 c j A j ) (19)for any non-negative vector C , and all possible A ( s ) whose columns are not of class . Note that Lemma 1show that Equation 9 is sufficient and necessary forEquation 19 to hold at s = 1 for any C ; and the samecondition is still sufficient for Equation 19 to hold atany s ≥ , under the additional assumption that C is anon-negative vector; when s = 0 , Equation 19 triviallyholds.In the proof of Lemma 1, we already showed that ρ ( α, A s +1 ) a.s. → E ( Q ) E ( Q ) . Next let us expand ρ ( α, (cid:80) sj =1 c j A j ) as follows: ρ ( α, C · A ( s ) ) = ρ ( α, s (cid:88) j =1 c j A j )= (cid:80) ni =1 ( α i ( (cid:80) sj =1 c j A ij )) (cid:113)(cid:80) ni =1 α i (cid:80) ni =1 ( (cid:80) sj =1 c j A ij ) = (cid:80) sj =1 c j ( (cid:80) ni =1 α i A ij /n ) (cid:113)(cid:80) ni =1 ( α i /n ) (cid:80) ni =1 ( (cid:80) sj =1 c j A ij ) /n ≤ (cid:80) sj =1 c j ( (cid:80) ni =1 α i A ij /n ) (cid:113)(cid:80) ni =1 ( α i /n ) (cid:80) ni =1 ( (cid:80) sj =1 c j A ij /n ) a.s. → (cid:80) sj =1 ( c j (cid:80) Kk =1 ( π k B k B ky j )) (cid:113)(cid:80) Kk =1 ( π k B k ) (cid:80) sj =1 ( c j (cid:80) Kk =1 ( π k B ky j ))= (cid:80) sj =1 c j E ( Q Q y j ) (cid:113)(cid:80) sj =1 c j E ( Q ) E ( Q y j ) , where y j denotes the class membership for A j . Allother steps being routine, the inequality in the aboveexpansion is due to (cid:80) sj =1 ( c j A ij ) ≥ (cid:80) sj =1 c j A ij ,which is obvious because c j and A ij are always non-negative.Therefore, in order to show Equation 19 holdsasymptotically, it suffices to prove that E ( Q ) E ( Q ) > (cid:80) sj =1 c j E ( Q Q y j ) (cid:113)(cid:80) sj =1 c j E ( Q ) E ( Q y j ) ⇔ E ( Q ) E ( Q ) > (cid:80) sj =1 ρ y j c j (cid:113) E ( Q ) E ( Q y j ) (cid:113)(cid:80) sj =1 c j E ( Q ) E ( Q y j ) ⇔ (cid:118)(cid:117)(cid:117)(cid:116) s (cid:88) j =1 c j E ( Q y j ) E ( Q ) > s (cid:88) j =1 ρ y j c j (cid:115) E ( Q y j ) E ( Q ) . The last inequality holds when ρ y j · E ( Q y j ) E ( Q ) < E ( Q y j ) E ( Q ) , which is exactly Equation 9 when y j (cid:54) = 1 .Therefore, Equation 9 and non-negative C are suf-ficient for Lemma 2 to hold. Proof. In order to prove that SRC is a consistent classi-fier with L n → , it suffices to prove that the adjacencymatrix generated by SBM satisfies a principal anglecondition in [7]. This consistency holds for either (cid:96) minimization or orthogonal matching pursuit at any s ≥ , assuming the sparse coefficient x is non-negative.16or the adjacency matrix under SBM, suppose α isa fixed adjacency column of class q , A s +1 is a randomadjacency column of class q , and A ( s ) is a randommatrix whose columns are not of class q . Then to show L n → at any s , the principal angle condition requiresthat θ ( α, A s +1 ) < θ ( α, A ( s ) ) , for all possible α underSBM.The principal angle condition is used in two areasof SRC: first it guarantees that the selected sub-matrixby SRC contains at least one observation of the cor-rect class; second it guarantees the sparse coefficientwith respect to the correct class dominates the sparserepresentation. Then if data is always non-negative(which always holds for graph adjacency) and thesparse representation is non-negative (which can berelaxed to bounded below in [7]), such dominance issufficient for the correct classification of SRC.Lemma 1 proves the principal angle condition at s = 1 , which is sufficient for the first point above.Lemma 2 proves the principal angle condition at any s , which is equivalent to the second point above underthe non-negative constraint. Therefore we establishSRC consistency for SBM.Note that there are two small differences: First,the original condition is not in the limit form, whileLemma 1 and Lemma 2 are proved asymptotically forSBM; this change has no effect for classification con-sistency. Second, in [7] we separate the principal anglecondition from the non-negative constraint, while inLemma 2 we effectively combine the non-negativeconstraint into proving the principal angle condition;this does not affect the result either, because when thesparse coefficient are constrained to be non-negative,the principal angle between two subspaces are alsoconstrained accordingly. A CKNOWLEDGEMENTS This work is partially supported by a National Se-curity Science and Engineering Faculty Fellowship(NSSEFF), Johns Hopkins University Human Lan-guage Technology Center of Excellence (JHU HLTCOE), and the XDATA program of the Defense Ad-vanced Research Projects Agency (DARPA) adminis-tered through Air Force Research Laboratory contractFA8750-12-2-0303. We would also like to thank MinhTang for his thoughtful discussions. R EFERENCES [1] D. L. Sussman, M. Tang, D. E. Fishkind, and C. E. Priebe,“A consistent adjacency spectral embedding for stochasticblockmodel graphs,” Journal of the American Statistical Asso-ciation , vol. 107, no. 499, pp. 1119–1128, 2012. [2] D. E. Fishkind, D. L. Sussman, M. Tang, J. T. Vogelstein, andC. E. Priebe, “Consistent adjacency-spectral partitioning forthe stochastic block model when the model parameters areunknown,” SIAM Journal on Matrix Analysis and Applica-tions , vol. 34, no. 1, pp. 23–39, 2013.[3] D. Sussman, M. Tang, and C. Priebe, “Universally consis-tent latent position estimation and vertex classification forrandom dot product graphs,” IEEE Transactions on PatternAnalysis and Machine Intelligence, accepted , 2012.[4] M. Tang, D. L. Sussman, and C. E. Priebe, “Universallyconsistent vertex classification for latent positions graphs,” Annals of Statistics, accepted , 2012.[5] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma,“Robust face recognition via sparse representation,” IEEETransactions on Pattern Analysis and Machine Intelligence ,vol. 31, no. 2, pp. 210–227, 2009.[6] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, andS. Yan, “Sparse representation for computer vision andpattern recognition,” Proceedings of the IEEE , vol. 98, no. 6,pp. 1031–1044, 2010.[7] C. Shen, L. Chen, and C. E. Priebe, “Sparse representationclassification beyond (cid:96) minimization and the subspaceassumption,” submitted , http://arxiv.org/abs/1502.01368.[8] L. Devroye, L. Gy¨orfi, and G. Lugosi, A probabilistic theoryof pattern recognition . New York: Springer, 1996, vol. 31.[9] D. B. West, Introduction to graph theory . Prentice hallEnglewood Cliffs, 2001, vol. 2.[10] B. Bollob´as, Random graphs . Cambridge university press,2001, vol. 73.[11] P. D. Hoff, A. E. Raftery, and M. S. Handcock, “Latentspace approaches to social network analysis,” Journal of theAmerican Statistical Association , vol. 97, no. 460, pp. 1090–1098, 2002.[12] S. J. Young and E. R. Scheinerman, “Random dot productgraph models for social networks,” in Algorithms and modelsfor the web-graph . Springer, 2007, pp. 138–149.[13] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochasticblockmodels: First steps,” Social networks , vol. 5, no. 2, pp.109–137, 1983.[14] P. J. Bickel, A. Chen, and E. Levina, “The method of mo-ments and degree distributions for network models,” TheAnnals of Statistics , vol. 39, no. 5, pp. 2280–2301, 2011.[15] K. Rohe, S. Chatterjee, and B. Yu, “Spectral clustering andthe high-dimensional stochastic blockmodel,” The Annals ofStatistics , vol. 39, no. 4, pp. 1878–1915, 2011.[16] A. Athreya, V. Lyzinski, D. J. Marchette, C. E. Priebe,D. L. Sussman, and M. Tang, “A limit theorem for scaledeigenvectors of random dot product graphs,” arXiv preprintarXiv:1305.7388 , 2013.[17] M. S. Handcock, A. E. Raftery, and J. M. Tantrum, “Model-based clustering for social networks,” Journal of the RoyalStatistical Society: Series A (Statistics in Society) , vol. 170,no. 2, pp. 301–354, 2007.[18] J. Lei and A. Rinaldo, “Consistency of spectral clusteringin stochastic block models,” The Annals of Statistics , vol. 43,no. 1, pp. 215–237, 2014.[19] K. Chaudhuri, F. C. Graham, and A. Tsiatas, “Spectralclustering of graphs with general degrees in the extendedplanted partition model.” in COLT , 2012, pp. 35–1.[20] Y. Chen, S. Sanghavi, and H. Xu, “Clustering sparsegraphs,” in Advances in neural information processing systems ,2012, pp. 2204–2212.[21] S. Balakrishnan, M. Xu, A. Krishnamurthy, and A. Singh,“Noise thresholds for spectral clustering,” in Advances inNeural Information Processing Systems , 2011, pp. 954–962.[22] V. Lyzinski, D. L. Sussman, M. Tang, A. Athreya, C. E.Priebe et al. , “Perfect clustering for stochastic blockmodelgraphs via adjacency spectral embedding,” Electronic Jour-nal of Statistics , vol. 8, no. 2, pp. 2905–2922, 2014. 23] L. Chen and M. Patton, “Stochastic blockmodelling foronline advertising,” arXiv preprint arXiv:1410.6714 , 2014.[24] M. Tang, D. L. Sussman, C. E. Priebe et al. , “Universallyconsistent vertex classification for latent positions graphs,” The Annals of Statistics , vol. 41, no. 3, pp. 1406–1430, 2013.[25] C. E. Priebe, D. L. Sussman, M. Tang, and J. T. Vogel-stein, “Statistical inference on errorfully observed graphs,” Journal of Computational and Graphical Statistics , no. just-accepted, pp. 00–00, 2014.[26] D. E. Fishkind, V. Lyzinski, H. Pao, L. Chen, and C. E.Priebe, “Vertex nomination schemes for membership pre-diction,” arXiv preprint arXiv:1312.2638 , 2013.[27] L. Chen, “Pattern Recognition on Random Graphs,” Ph.D.dissertation, Johns Hopkins University, Baltimore, MD,May 2015.[28] J. E. Jackson, A user’s guide to principal components . JohnWiley & Sons, 2005, vol. 587.[29] M. Zhu and A. Ghodsi, “Automatic dimensionality selec-tion from the scree plot via the use of profile likelihood,” Computational Statistics & Data Analysis , vol. 51, no. 2, pp.918–930, 2006.[30] D. Donoho and Y. Tsaig, “Fast solution of l1-norm mini-mization problems when the solution may be sparse,” IEEETransactions on Information Theory , vol. 54, no. 11, pp. 4789–4812, 2008.[31] Z. Yang, C. Zhang, J. Deng, and W. Lu, “Orthonormalexpansion l1-minimization algorithms for compressed sens-ing,” arXiv preprint arXiv:1108.5037 , 2011.[32] J. A. Tropp, “Greed is good: Algorithmic results for sparseapproximation,” Information Theory, IEEE Transactions on ,vol. 50, no. 10, pp. 2231–2242, 2004.[33] J. A. Tropp and A. C. Gilbert, “Signal recovery fromrandom measurements via orthogonal matching pursuit,” IEEE Transactions on Information Theory , vol. 53, no. 12, pp.4655–4666, 2007.[34] E. Elhamifar and R. Vidal, “Sparse subspace clustering:Algorithm, theory, and applications,” IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 35, no. 11, pp.2765–2781, 2013.[35] D. L. Donoho, “For most large underdetermined systemsof linear equations the minimal 1-norm solution is alsothe sparsest solution,” Communications on pure and appliedmathematics , vol. 59, no. 6, pp. 797–829, 2006.[36] D. L. Donoho and M. Elad, “Optimally sparse represen-tation in general (nonorthogonal) dictionaries via 1 min-imization,” Proceedings of the National Academy of Sciences ,vol. 100, no. 5, pp. 2197–2202, 2003.[37] R. Gribonval and M. Nielsen, “Sparse representations inunions of bases,” Information Theory, IEEE Transactions on ,vol. 49, no. 12, pp. 3320–3325, 2003.[38] E. J. Candes and T. Tao, “Decoding by linear program-ming,” IEEE Transactions on Information Theory , vol. 51,no. 12, pp. 4203–4215, 2005.[39] M. Elad and A. M. Bruckstein, “A generalized uncertaintyprinciple and sparse representation in pairs of bases,” In-formation Theory, IEEE Transactions on , vol. 48, no. 9, pp.2558–2567, 2002.[40] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionariesfor sparse representation modeling,” Proceedings of the IEEE ,vol. 98, no. 6, pp. 1045–1057, 2010.[41] A. Bruckstein, M. Elad, and M. Zibulevsky, “On the unique-ness of nonnegative sparse solutions to underdeterminedsystems of equations,” IEEE Transactions on Information The-ory , vol. 54, no. 11, pp. 4813–4820, 2008.[42] N. Meinshausen, “Sign-constrained least squares estima-tion for high-dimensional regression,” Electronic Journal ofStatistics , vol. 7, pp. 1607–1631, 2013.[43] M. Slawski and M. Hein, “Non-negative least squares forhigh-dimensional linear models: consistency and sparse recovery without regularization,” Electronic Journal of Statis-tics , vol. 7, pp. 3004–3056, 2013.[44] D. Marchette, C. Priebe, and G. Coppersmith, “Vertexnomination via attributed random dot product graphs,” in Proceedings of the 57th ISI World Statistics Congress , vol. 1121,2011, p. 1126.[45] E. R. Scheinerman and K. Tucker, “Modeling graphs us-ing dot product representations,” Computational Statistics ,vol. 25, no. 1, pp. 1–16, 2010.[46] D. H. Hall and R. Russell, “The posterior nervous system ofthe nematode caenorhabditis elegans: serial reconstructionof identified neurons and complete pattern of synapticinteractions,” The Journal of neuroscience , vol. 11, no. 1, pp.1–22, 1991.[47] R. Goldschmidt, “Das nervensystem von ascaris lumbri-coides und megalocephala, i,” Z wiss Zool , vol. 90, pp. 73–136, 1908.[48] L. Chen, J. T. Vogelstein, V. Lyzinski, and C. E. Priebe, “Ajoint graph inference case study: the c.elegans chemical andelectrical connectomes,” Submitted to Significance , no. Underreview, 2015.[49] L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall, andD. B. Chklovskii, “Structural properties of the caenorhabdi-tis elegans neuronal network,” PLoS computational biology ,vol. 7, no. 2, p. e1001066, 2011.[50] M. E. Newman, “Finding community structure in net-works using the eigenvectors of matrices,” Physical reviewE , vol. 74, no. 3, p. 036104, 2006.[51] L. A. Adamic and N. Glance, “The political blogosphereand the 2004 us election: divided they blog,” in Proceedingsof the 3rd international workshop on Link discovery . ACM,2005, pp. 36–43.[52] S. Olhede and P. Wolfe, “Network histograms and uni-versality of blockmodel approximation,” Proceedings of theNational Academy of Sciences of the USA , vol. 111, pp. 14 722–14 727, 2014.[53] R. A. Horn and C. R. Johnson, Matrix analysis . Cambridgeuniversity press, 2012.. Cambridgeuniversity press, 2012.