Multi-view Banded Spectral Clustering with Application to ICD9 Clustering
MMulti-view Banded Spectral Clustering with Application toICD9 Clustering
Luwan Zhang , Katherine Liao , Isaac Kohane , Tianxi Cai Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston MA Division of Rheumatology, Brigham and Women’s Hospital, Boston MA Department of Biomedical Informatics, Harvard Medical School, Boston MA
Abstract
Despite recent development in methodology, community detection remains a challengingproblem. Existing literature largely focuses on the standard setting where a network is learnedusing an observed adjacency matrix from a single data source. Constructing a shared networkfrom multiple data sources is more challenging due to the heterogeneity across populations.Additionally, no existing method leverages the prior distance knowledge available in manydomains to help the discovery of the network structure. To bridge this gap, in this paperwe propose a novel spectral clustering method that optimally combines multiple data sourceswhile leveraging the prior distance knowledge. The proposed method combines a banding stepguided by the distance knowledge with a subsequent weighting step to maximize consensusacross multiple sources. Its statistical performance is thoroughly studied under a multi-viewstochastic block model. We also provide a simple yet optimal rule of choosing weights inpractice. The efficacy and robustness of the method is fully demonstrated through extensivesimulations. Finally, we apply the method to cluster the International classification of diseases,ninth revision (ICD9), codes and yield a very insightful clustering structure by integratinginformation from a large claim database and two healthcare systems.
Keywords: multi-view, banding, spectral clustering, community detection, stochasticblock model 1 a r X i v : . [ s t a t . M E ] J un Introduction
Collapsing interchangeable or highly similar entities into a single group is a problem of great impor-tance and necessity in a wide range of areas. It enables signal enhancement, dimension reductionand variable selection, while ensuring reproducibility and interpretability on subsequent researchresults. The concept of “nearly equivalent” entities naturally arises in many fields presenting invarious forms to serve for different research purposes. For example, to identify genetic variants thatincrease susceptibility to a disease (or other phenotype of interest), it can be nearby markers thatare usually correlated and hence form a linkage disequilibrium (LD) block. In natural languageprocessing, it can be synonyms entitled with a similar meaning and occurring in a similar context.In phenome-wide association studies (PheWAS), it can be the International Classification of Dis-ease (ICD) codes that essentially describe the same disease but differ in details such as the affectedanatomical areas. Similarly, in brain image analysis, voxels in the same region with a commonneurological function can also be viewed as interchangeable.The problem of collapsing interchangeable entities can be translated into a statistical problemof community detection. More specifically, let v i refer to the i -th entity and V = { v i } ni =1 representthe whole vertex set, the goal is to seek a partition such that V = ∪ k V k , V k ∩ V l = ∅ , ∀ k (cid:54) = l. where V k denotes the k -th community and any entity pair ( v i , v j ) from this group is stochasticallyequivalent. To infer such partition with a single observed similarity matrix W , many statisticalmethods have been proposed and witnessed huge success in numerous applications (Shi and Malik,2000; Ng et al., 2002; Newman, 2006; Bickel and Chen, 2009; Zhao et al., 2012, e.g.). Statisticalproperties of the clustering have also been established under the framework of stochastic blockmodel (SBM) (Holland et al., 1983) and its extension to allow for degree heterogenity (Karrer andNewman, 2011; Rohe et al., 2011; Qin and Rohe, 2013; Lei et al., 2015; Jin et al., 2015, e.g.).When multiple similarity matrices from different data sources become available, an importanttask is to develop an effective synergistic integration strategy that leads to a better inference on2he underlying network structure. To this end, a series of multi-view clustering methods have beenproposed (Blaschko and Lampert, 2008; Chaudhuri et al., 2009; Cai et al., 2011; Kumar and Daum´e,2011; Xia et al., 2014, e.g.), where a data source is also termed as a view. Despite their empiricalsuccess, little theoretical justifications have been provided until recently Han et al. (2014) and Paulet al. (2016) showed some consistency results. In addition, optimal approaches of combining allviews to account for their heterogeneity remain elusive.Another limitation of these existing methods is that they require input being unweighted, binary-valued similarity matrices. This would largely hamper the applicability to settings with similaritymatrices being weighted and real-valued. In particular, real-valued measures of similarity are fre-quently used in numerous contemporary biological and clinical applications. For example, with therecent emergence of word2vec algorithms (Mikolov et al., 2013 a), biological sequences (e.g. genes,proteins), clinical concepts describing disease conditions, and ICD codes have been represented byEuclidean vectors (Asgari and Mofrad, 2015; Nguyen et al., 2016; Choi et al., 2016; Ng, 2017, e.g.),with pairwise similarity frequently summarized by the real-valued cosine score.In addition to multi-view, prior knowledge on the distance between the nodes is often availablethrough established ontologies, especially in biomedical domains. Such distance information canpotentially help the discovery of the network structure in that more distant nodes suggested by theontology are less likely grouped together. For example, when nodes represent chromosomal loci,clusters of homogeneous genes tend to be adjacent genes. For brain graph connectivity, spatiallyfurther apart voxels are less likely considered as belonging to the same biological group. Hierarchicalstructures have also been curated for a wide range of clinical concepts. For example, the unifiedmedical language system (UMLS) provides a relational database for clinical terms used in medicallanguage (Humphreys and Lindberg, 1993). Relationships between different disease phenotypes,medical concepts are described in the ICD hierarchy and the human phenotype ontology such asSNOMED-CT and MedDRA. Nodes further apart on these hierarchies less likely belong to the samegroup.To the best of our knowledge, no existing clustering method incorporates multiple, say m , real-3alued similarity matrices { W s , s = 1 , ..., m } or leverages prior information on the distance betweenthe nodes. In this paper, we propose a novel two-step muti-view banded spectral clustering (mvBSC)method to bridge this gap. The mvBSC method leverages the prior knowledge by restrictingthe parameter space to a class of decaying matrices and integrates information from m sourcesby performing spectral clustering on a convex combination of m membership-encoded matrices.Although the performance of clustering accuracy improves due to banding, our procedure is robustto the banding assumption and remains valid in the absence of this operation.The rest of paper is organized as follows. In Section 2, we give a formal description of the multi-view stochastic block model and provide assumptions on the parameter space. Section 3 detailsthe proposed multi-view banded spectral clustering method and provides theoretical justificationsunder the multi-view stochastic block model. Simulations are given in Section 4 to demonstrate theefficacy and robustness of the proposed method. In Section 5, we apply the proposed method tothe ICD9 coding system and yield a very insightful clustering structure by integrating informationfrom a large claim database and two healthcare systems. Concluding remarks and discussions aregiven in Section 6. For any matrix A ∈ R p × p , let A i · and A · i respectively denote the i th row and column of A , andlet (cid:107) A (cid:107) F and (cid:107) A (cid:107) respectively denote its Frobenius norm and spectral norm. For any two matrices A , B ∈ R p × p , A (cid:46) B means that A ≤ c B for some constant c > A (cid:38) B means that A ≥ c B for some constant c > A (cid:16) B is equivalent as A (cid:46) B (cid:46) A . For any vector a = ( a , ..., a p ) T , letdiag( a ) denote the corresponding p × p diagonal matrix with diagonals being a and (cid:107) a (cid:107) be its (cid:96) -norm. Let p = (1 , ..., T denote the all-one vector in R p and a ≥ x, y ∈ R , x = o ( y ) means xy = o (1). Let I ( · ) denote the indicator function.Suppose the data for analysis consist of m similarity matrices, { W s , s = 1 , ..., m } , representing m undirected weighted graphs, G s = {V , W s } , where W s = [ W sij ] n × n , W sij is the similarity between4he nodes v i and v j based on the s th view, and we assume that these m graphs share the samevertex set V = { v i } ni =1 which has a non-overlapping K -partition network structure V = {∪ Kk =1 V k , V k ∩ V l = ∅ , ∀ ≤ k < l ≤ K } , where V k = { v i } i : g i = k and g i ∈ { , ..., K } indexes which group v i belongs to. Let n k := |V k | denotethe size of the k -th cluster. The partition can also be represented by a group membership matrix Z ∗ = [ Z ∗ ik ] n × K ∈ Z n,K , where Z ∗ ik = I ( v i ∈ V k ) = I ( g i = k ) , and Z n,K consists of all possible K -group membership matrices for n nodes. Denote its associatedclass of projector matrices by P n,K = (cid:110) Z [diag( T n Z )] − Z T : Z ∈ Z n,K (cid:111) In particular, denote by P Z the projector matrix Z [diag( T n Z )] − Z T that projects any n -dimensionalvector to the K -dimensional subspace spanned by the columns of Z . Throughout, we assume that K is known and remains as a constant for all theoretical analyses. Strategies for choosing K inpractice will be discussed in Section 5. We aim to optimally combine information from the m-views, { W s = [ W sij ] n × n , s = 1 , ..., m } , to learnabout the network structure through an mvSBM such that ∀ ≤ i < j ≤ n , E Z ∗ W sij ≡ W sij = Ω sg i g j , W s = [ W sij ] j =1 ,...,ni =1 ,...,n = Z ∗ Ω s Z ∗ T , | W sij | ≤ L, Var Z ∗ ( W sij ) = σ s , (2.1)where Ω s = [Ω skl ] l =1 ,...,Kk =1 ,...,K ∈ [ − L, L ] K × K is a symmetric and positive definite matrix of rank K , L < ∞ is a constant, E Z ∗ and Var Z ∗ respectively denote the expectation and variance given the membershipmatrix Z ∗ . Thus, under the mvSBM, the hidden membership matrix Z ∗ is shared across all mviews, but the connection intensities encoded by Ω s may vary. For each diagonal element W sii ,it can be either considered as a yet another independent bounded random variable following themodel specified in (2.1), or it can be treated as a non-random constant ω ∈ [ − L, L ]. Without loss5f generality, we consider hereafter the latter case , which includes most commonly used similaritymeasures.
Remark 1.
The traditional definition of a SBM given in Holland et al. (1983) requires that each W s is 0/1 valued in that each off-diagonal entry corresponds to an independent Bernoulli randomvariable whose probability depends only on the block memberships of the two nodes. Here we extendto a real-valued setting to allow for more generality. In fact, Karrer and Newman (2011) was thefirst attempt to extend the applicability of a SBM in which a Poisson model is imposed to allow formultiple edges between any two nodes. We further leverage the prior knowledge on the distance between nodes under the assumptionthat nodes further apart are less likely to be grouped together. Specifically, let d : V ×V (cid:55)→ [0 , ∞ ) bethe distance metric, which satisfies the well-known non-negativity, symmetry and identity property: ∀ ( v i , v j ) ∈ V × V , d ij ≡ d ( v i , v j ) ≥ , d ( v i , v j ) = d ( v j , v i ) , d ( v i , v j ) = 0 ⇔ v i = v j Here the triangular inequality assumption on d ( · , · ) is not required. For example, if V denotes allstrings of ICD9 codes, one simple way to define the distance metric is based on their numericalrepresentations, e.g., d (“331 . , “331 . | . − . | = 0 .
8. An alternative distance is basedon the number of steps needed to connect two codes on the ICD9 hierarchy tree. For k = 1 , ..., K ,we define the centroid node of the cluster V k as v c k , where c k := min i : argmin v i ∈V k (cid:88) v j ∈V k d ij and n k := |V k | is the cardinality of V k . To leverage the prior knowledge that nodes further apartless likely belong to the same group, we assume that( C1 ) There exists some δ > such that d ( v i , v c k ) ≤ δ, ∀ v i ∈ V k , k = 1 , . . . , K . Obviously, δ is the radius of each cluster. More than that, it can be used as a lever to characterizethe confidence level cast on the prior distance knowledge. More precisely, a smaller δ leads to6ore engagement of the prior knowledge while a larger δ downplays its role. More discussions onthe effect of δ in our methodolgy will be given in Section 3. The adoption of ( C1 ) enables us toemploy thresholding on any pairwise similarity whose pairwise distance is beyond 2 δ . For example,in Figure 2.1, δ only keeps within-in cluster pairwise similarities while δ entails no thresholdingsince all nodes are encompassed in the outter black dashed circle. Without loss of generality, we δ δ δ δ Figure 2.1: Effect of δ : two different choices of δ lead to two different thresholding schemes. Nodesfrom the same cluster are in the same color and symbol.assume δ > d := min ≤ i
Suppose Ω s ∈ F α s defined in (2.2) and V satisfies condition ( C1 ), given the member-ship matrix Z ∗ , then W s := Z ∗ Ω s Z ∗ T resides in the following class of matrices H α s := H ( α s , δ, L, d , β ) = A = [ A ij ] j =1 ,...,ni =1 ,...,n ∈ [ − L, L ] n × n : A = A T , max ≤ i ≤ n (cid:88) j : d ( v i ,v j ) >h | A ij | ≤ Ln max (cid:18) h − δd (cid:19) − α s , for all h > δ, and < n min β ≤ γ K ( A ) ≤ γ ( A ) ≤ n max /β (cid:27) (2.3) where n min = min k n k , n max = max k n k . In this section, we first summarize our proposed mvBSC procedure in Algorithm 1 and then give abrief explanation on the reasoning behind it. We then discuss the theoretical results supporting thevalidity and optimality of the proposed algorithm as well as the appropriate choices of the bandingvector h = ( h , ..., h m ) T and the weighting vector λ = ( λ , ..., λ m ) T . Without loss of generality, werestrict our attention to the convex combination, λ ≥ , T m λ = 1.8 .1 The mvBSC procedure The proposed mvBSC procedure is summarized as follows:
Input: m n × n similarity matrices W , . . . , W m , pairwise distances { d ij } j =1 ,...,ni =1 ,...,n , number ofgroups K , a set of banding parameters h , ..., h m , and a set of weighting parameters λ , . . . , λ m ≥ , (cid:80) ms =1 λ s = 1. Output:
Membership matrix (cid:98) Z ∗ λ ∈ Z n,K for s = 1:m do (1) B h s ( W s ) := [ W sij I ( d ij ≤ h s )] j =1 ,...,ni =1 ,...,n ← banding W s using the banding parameter h s .(2) (cid:98) U s ← the matrix of concatenating K eigenvectors corresponding to the first K largestabsolute eigenvalues of B h s ( W s ). end (cid:98) U ∗ λ ← the matrix of concatenating K eigenvectors corresponding to the first K largesteigenvalues of (cid:80) ms =1 λ s (cid:98) U s (cid:98) U s T .Treating each row of (cid:98) U ∗ λ as a point in R K , run k-means to cluster these n points into K groups and obtain a corresponding membership matrix (cid:98) Z ∗ λ . Result: (cid:98) Z ∗ λ Algorithm 1:
Multi-view banded spectral clustering (mvBSC)Algorithm 1 involves two key steps: operating banding to each similarity matrix W s and per-forming spectral clustering on a convex combination of m projector matrices (cid:98) U s (cid:98) U s T . To see thereasoning, we first examine the eigen-space of W s . Let∆ = [diag( T n Z ∗ )] / = diag( √ n , . . . , √ n K ) , U ∗ = Z ∗ ∆ − , and (cid:101) Ω s := ∆Ω s ∆ = [Ω kl √ n k n l ] l =1 , ··· ,mk =1 , ··· ,m with eigen-decomposition (cid:101) Ω s = V s D s V s T . Then the eigen-decomposition of W s is given by W s = U s D s U s T = U ∗ ( V s D s V s T ) U ∗ T , where U s = Z ∗ ∆ − V s = U ∗ V s . Clearly, U s is a rotation of U ∗ by V s and they correspond to the same K -dimensional subspaceexpressed by P Z ∗ = U ∗ U ∗ T = U s U s T . Since (∆ − V s ) − exists, U si · = U sj · ⇔ g i = g j , meaning that v i and v j are in the same group if and only if their corresponding rows in U s are the same. So theeigen-space of W s is membership structured and recovering U s is equivalent as recovering Z ∗ .Now consider the difference between W s and W s : W s − W s = ( W s − E Z ∗ W s ) + diag( E Z ∗ W s − W s ) = ( W s − E Z ∗ W s ) + diag( W s − W s ) , W s ) = diag( E Z ∗ W ) because the diagonals of W s are constants. Although the deviationfrom the eigenspace of W s to that of W s is always upper bounded by the operator norm of theirdifference, with no further information on the structure of W s , one can at most know the operatornorm deviation is on the scale of √ n since the noise matrix ( W s − E Z ∗ W s ) is on this scale and theremaining diagonal matrix can be treated as a constant and hence makes negligible contributions.Fortunately, Theorem 3.1 sheds light on the benefit of banding to significantly reduce the upperbound to the scale of max (cid:26)(cid:18) n αs +1 max (log n ) αs αs +1 (cid:19) , (cid:113) δd log n, log n (cid:27) . Therefore, the space spannedby (cid:98) U s would show more resemblance to the space spanned by U ∗ .The other key step is the use of a weighted average (cid:80) ms =1 λ s (cid:98) U s (cid:98) U s T to estimate P Z ∗ . Since each (cid:98) U s (cid:98) U s T can be viewed as a stand-alone estimator of P Z ∗ , (cid:98) U s (cid:98) U s T can be decomposed into P Z ∗ + E s ,where E s is a symmetric error matrix. As an analogy of a weighted least square problem, to allowfor heterogeity in noise corruptions, given a weight vector λ , it is ideal to find (cid:101) Z ∗ λ = argmin Z ∈ Z n,K m (cid:88) s =1 λ s (cid:107) (cid:98) U s (cid:98) U s T − P Z (cid:107) F (3.1)From a regularization perspective, these weights essentially put penalty on each view so that theycan be dragged to a common ground to maximize consensus expressed by (cid:101) Z ∗ λ . However, the solutionof (3.1) is NP-hard to find, and the alternative is to find the un-constrained solution (cid:80) ms =1 λ s (cid:98) U s (cid:98) U s T as a surrogate whose eigenvectors can be subsequently used to reconstruct Z ∗ .In summary, the two sets of parameters h and λ in the proposed mvBSC procedure play or-thogonal but complementary roles. The banding parameter h s maximally attempts to improveeach individual performance, while the weight parameter λ s allows for efficient messsage passinghorizontally to make up each other’s deficiencies. To study the statistical performance of the proposed mvBSC procedure(Algorithm 1), it is importantto realize that errors consist of two parts–distance from U ∗ to (cid:98) U ∗ λ and the membership misallocationarising from the k-means step. The minimal distance between U ∗ and (cid:98) U ∗ λ is equivalent to the distance10etween their respective subspace P Z ∗ and (cid:98) U ∗ λ (cid:98) U ∗ T λ (Vu et al., 2013)12 inf Q (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F ≤ (cid:107) (cid:98) U ∗ λ (cid:98) U ∗ T λ − P Z ∗ (cid:107) F ≤ inf Q (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F (3.2)where Q is a K × K orthonormal matrix. To evaluate the quality of the final k-means step, it isnecessary to first define a “mis-clustered” node. To this end, recall that the k-means obtains( (cid:98) Z ∗ λ , (cid:98) A λ ) = argmin Z ∈ Z n,K , A ∈R K × K (cid:13)(cid:13)(cid:13)(cid:98) U ∗ λ − ZA (cid:13)(cid:13)(cid:13) F (3.3)in which (cid:98) A T λ k · ∈ R K represents the k -th centroid (Steinhaus, 1956). Intuitively, if U ∗ i · Q is closer to (cid:98) A λ gi · than it is to any other (cid:98) A λ k · for k (cid:54) = g i , then node v i can be correctly clustered. Definition 1 (set of mis-clustered nodes) . Given Z ∗ ∈ Z n,K , let U ∗ = Z ∗ ∆ − , (cid:98) U ∗ λ given in Algo-rithm 1 , (cid:98) A λ defined in (3.3) and Q be a K × K orthonormal matrix satisfying (3.2), define M λ as the set of mis-clustered nodes M λ = (cid:26) i : (cid:107) (cid:98) A λ gi · − U ∗ i · Q (cid:107) ≥ (cid:112) / n max (cid:27) Remark 2.
The definition of M λ is a sufficient condition to ensure (cid:107) (cid:98) A λ gi · − U ∗ i · Q (cid:107) ≤ (cid:107) (cid:98) A λ gi · − U ∗ j · Q (cid:107) for any g j (cid:54) = g i , which was firstly considered in Rohe et al. (2011). The error analysis inthis paper mainly addresses the global optimum of (3.3), and this optimization problem could sufferfrom local optima in practice. Theorem 3.1 (Optimal Choice of Banding Parameter h ) . Given the membership matrix Z ∗ ∈ Z n,K ,consider a similarity matrix W ∈ [ − L, L ] n × n , generated according to (2.1), in which W = Z ∗ Ω Z ∗ T and Ω ∈ F α defined in (2.2), if h − δd (cid:16) (cid:16) n max √ log n (cid:17) α +1 , the mean absolute operator-norm error loss E Z ∗ (cid:107) B h ( W ) − W (cid:107) (cid:46) max (cid:40)(cid:18) n α +1 max (log n ) α α +1 (cid:19) , (cid:114) δd log n, log n (cid:41) More specifically, if n max (cid:38) (log n ) α +1 , E Z ∗ (cid:107) B h s ( W ) − W (cid:107) (cid:46) (cid:18) n α +1 max (log n ) α α +1 (cid:19) if δd = o (cid:18)(cid:16) n max √ log n (cid:17) α +1 (cid:19)(cid:113) δd log n otherwise nd for a finely-sliced network, i.e. n max = o ([log n ] α +1 ) , E Z ∗ (cid:107) B h ( W ) − W (cid:107) (cid:46) (cid:40) (cid:113) δd log n if δd (cid:38) log n log n otherwise Theorem 3.1 explicitly shows the effect of banding in reducing the mean absolute operator-normerror loss whose upper bound is a function of the key model parameters ( n, n max , δ, α ). The optimalbanding parameter h is reflective of the decay rate in the sense that a smaller α yields a larger h .Adopting a conservative δ would lead the mean absolute operator-norm error loss to the order of (cid:113) δd log n . In the extreme scenario where δd (cid:46) n , suggesting that one is too conservative to performbanding, the mean absolute operator-norm error loss given in Theorem 3.1 is upper bounded by √ n log n , reducing to the standard result up to a √ log n factor on a random matrix whose entries areindependent copies of a random variable with zero mean, unit variance, and finite fourth moment. Theorem 3.2 (mis-clustered error rate) . Given the membership matrix Z ∗ ∈ Z n,K , consider asequence of similarity matrices W , . . . , W m , generated independently according to (2.1), in which W s = Z ∗ Ω s Z ∗ T and Ω s ∈ F α s defined in (2.2), and γ sn,K = γ K ( W s ) . Given a weighting vector λ , let M λ be the set of mis-clustered nodes given in Definition 1, then with probability at least − m/n , |M λ | n = O p n max n m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) max (cid:26) n αs +1 max (log n ) αs αs +1 , δd log n, (log n ) (cid:27) Corollary 3.1.
Suppose Ω , ..., Ω m ∈ F α for some α ≥ , and the underlying Z ∗ exhibits a balancedcommunity structure, i.e., n max (cid:16) n min , then with probability at least − m/n ,(i) if n max (cid:38) (log n ) α +1 , |M λ | n = O p (cid:34) log nnn max max (cid:40)(cid:18) n max √ log n (cid:19) α +1 , δd (cid:41)(cid:35) (ii) if n max = o ([log n ] α +1 ) , |M λ | n = O p (cid:26) log nnn max max (cid:18) δd , log n (cid:19)(cid:27) .3 Optimal choice of { λ s } ms =1 A remaining important question is that in what sense these m views can be optimally combinedvia the weighting parameters λ = ( λ , ..., λ m ) T ? Ideally, a desirable set of weighting parameters λ ∗ shall minimize the population-level mis-clustered node size E Z ∗ |M λ | and thus can be referred asthe oracle weighting vector. Despite its attractiveness, an explicit form of E Z ∗ |M λ | is intractabledue to the difficulty in deriving the deviation from (cid:98) U s (cid:98) U s T to P Z ∗ . As an alternative strategy, weseek to derive an upper bound of E Z ∗ |M λ | as a surrogate objective function, q h ( λ ), that leads to anapproximately optimal solution that sufficiently reflects all m views. To this end, we note that fromthe proof of Theorem 3.2, |M λ | is upper-bounded by (cid:107) (cid:80) ms =1 λ s (cid:98) U s (cid:98) U s T − P Z ∗ (cid:107) up to a constant, and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) s =1 λ s (cid:98) U s (cid:98) U s T − P Z ∗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m m (cid:88) s =1 λ s (cid:107) (cid:98) U s (cid:98) U s T − U s U s T (cid:107) ≤ m m (cid:88) s =1 (cid:18) λ s γ sn (cid:19) (cid:107) B h s ( W s ) − W s (cid:107) (3.4)It thus suffices to derive an upper bound for each individual E Z ∗ (cid:107) B h s ( W s ) − W s (cid:107) as given below. Theorem 3.3.
Suppose n max (cid:38) (log n ) α s +1 , and choose h s = 2 δ + d (cid:16) Ln max √ log n (cid:17) αs +1 , s = 1 , ..., m ,then for some constant C > , E Z ∗ (cid:107) B h s ( W s ) − W s (cid:107) ≤ Ch s σ s log n and m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) (cid:107) B h s ( W s ) − W s (cid:107) ≤ C log n m (cid:88) s =1 (cid:32) λ s σ s γ sn,K (cid:33) h s Theorem 3.3 immediately implies that the surrogate objective function can be given as q h ( λ ) = m (cid:88) s =1 (cid:32) λ s σ s γ sn,K (cid:33) h s (3.5)It is straightforward to see that λ ∗ q = argmin λ : (cid:80) ms =1 λ s =1 ,λ s ≥ q h ( λ ) takes the form λ ∗ qs = h − s (cid:18) γ sn,K σ s (cid:19) (cid:34) m (cid:88) t =1 h − t (cid:18) γ tn,K σ t (cid:19) (cid:35) − , s = 1 , ..., m. (3.6)Furthermore, if Ω , ..., Ω m ∈ F α for some common α > λ ∗ qs = (cid:18) γ sn,K σ s (cid:19) (cid:34) m (cid:88) t =1 (cid:18) γ tn,K σ t (cid:19) (cid:35) − , s = 1 , ..., m. (3.7)13t is easy to see from (3.7), λ ∗ qs ∝ ( γ sn,K /σ s ) , which can be comprehended as a measure of signal-to-noise ratio (SNR) since γ sn,K characterizes the capability of W s unveiling Z ∗ at the population levelwhereas σ s summarizes the sample-level ( W s ) corruption. This SNR flavored weighting schemeis seamlessly aligned with one’s intuition that quality evaluation on each view shall consider itsinherent ability and external noise extent simultaneously. To use it in practice, γ sn,K can be estimatedby the largest K -th eigenvalue of B h s ( W s ) and σ s can be estimated as in Remark 3. More generally,(3.6) takes α s into account through h − s , which is also intuitive in the sense that it downweights theview with a slower decay rate. Although (3.7) is only a special case of (3.6), it is often the case inpractice that α s ’s are very close or researchers prefer to take a bit more conservative perspective( α ≤ α implies that F α ⊆ F α ), in whichever case h s makes negligible difference and (3.6) reduces to(3.7). In the sequel, we refer mvBSC SNR to the mvBSC using (3.7) and mvBSC q to the mvBSC using(3.6). Remark 3 (Estimation of σ s ) . Recall that the variation in each similarity matrix consists of twosources: within group variation and across group variation. Let MSE within and MSE across denotethe mean square error respectively. Given a membership matrix Z ∈ Z n,K ,MSE within = K (cid:88) k =1 n k ( n k − / − (cid:88) { ( i,j ): g i = g j = k,i 10 anduse v i and its index i exchangeably. To examine the robustness of mvBSC to the underlying14etwork structure, we considered five different models (M1)-(M5) for the membership matrix Z ∗ as illustrated in Figure 4.1 where nodes from the same cluster share the same color. In (M1), theclusters have a clear block structure with a total of 25 clusters and cluster size ranging from 9 to 28.Under (M1), our model assumption (C1) holds exactly with a small δ . We then gradually departfrom this assumption by perturbing (M1). More specifically, for k = 2 , , , Z ∗ in model (Mk)is generated by randomly swapping the node’s membership in (M1) to one of the l k most adjacentclusters with probability p k , where we let ( p , l ) = (0 . , p , l ) = (0 . , p , l ) = (0 . , p , l ) = (0 . , p k and l k jointly control the degree of departure from assumption (C1)and the assumption no longer holds for a finite constant δ in the most challenging case of (M5).Throughout, we let n = 500 and for each membership model, we considered K = 50 , , 10 to reflect M1M2M3M4M5 Figure 4.1: Graphical representation of the membership matrix Z ∗ under (M1), (M2), (M3), (M4)and (M5) with groups indexed by colors.a small, medium, large average cluster size.For a given membership matrix Z ∗ and its associated network partition V = ∪ Kk =1 V k , we let c k = |V k | (cid:80) i : Z ∗ ik =1 i denote the centroid node of V k . We then generate each population level similaritymatrix W s independently according to (2.1) from Ω s ∈ F α s with α = 0 . α = 0 . 6, whereΩ skl = (cid:26) k = l . | c k − c l | − ( α s +1) if k (cid:54) = l (4.1)The observed similarity matrix W s was generated as W sij = { ( W sij + (cid:15) sij ) ∧ } ∨ ( − (cid:15) sij ∼ N (0 , σ s ). We considered three noise levels with ( σ , σ ) being (i) (0 . , . 4) for low noise; (ii) (0 . , . . , . 8) for high noise. Here, W sij is banded to be between [ − , 1] tomimic the cosine similarity used in the real data example and we let all diagonals of W s to be 1.We choose h s = 2 δ + d (cid:16) Ln max √ log n (cid:17) αs +1 , s = 1 , W + = W + W (ii) the Laplacian of W + (KAL); (iii) the normalized Laplacian of W + (normKAL); (iv) eachsingle W s alone (singleW); (v) the Laplacian of each single W s (singleL); and (vi) the normalizedLaplacian of each single W s . Here, for a W under consideration, the Laplacian matrix is derivedusing W − min( W ) since it is defined based on a non-negative adjacency matrix. For the mvBSCmethod, we considered different approaches to select λ including mvBSC q and mvBSC SNR as well asan oracle method that chooses λ by minimizing the mis-clustering rate. For each configuration andeach clustering method, we quantify the quality of the clustering based on the average clusteringaccuracy, defined as one minus the mis-clustered error rate, and the normalized mutual information(NMI) over 100 replications. The NMI is a commonly used measure in the networks literature andis known to be impartial with respect to K (Strehl and Ghosh, 2002). Specifically, given a vertexset V = { v i } ni =1 , the NMI between a partition X with V = V X ∪ · · · ∪ V X K X and a gold standardreference partition X with V = V X ∪ · · · ∪ V X K X isNMI( X , X ) = (cid:80) K X k =1 (cid:80) K X l =1 (cid:12)(cid:12) V X k ∩ V X l (cid:12)(cid:12) log (cid:18) | V X k ∩V X l || V X k || V X l | (cid:19)(cid:115)(cid:26)(cid:80) K X k =1 |V X k | log (cid:18) | V X k | n (cid:19)(cid:27) (cid:26)(cid:80) K X l =1 (cid:12)(cid:12) V X l (cid:12)(cid:12) log (cid:18) | V X l | n (cid:19)(cid:27) , (4.2)which is a score ranging from 0 to 1 with a higher value indicating that X is more similar to thereference partition X . We let X be the true underlying partition in our simulation studies andsuppress the dependence on X for notation simplicity.We first examine the effect of λ selection on the quality of the mvBSC clustering. Table 4.1summarizes the mean and the standard deviation of the clustering accuracy and NMI score for themvBSC clustering with λ selected via mvBSC q , mvBSC SNR and the oracle method under the fivenetwork structures (M1)–(M5) with K = 25 , σ = 0 . , σ = 0 . 6. First, we note that both mvBSC q SNR have comparable clustering performance to that of the mvBSC trained with oracle λ across all settings. Although α (cid:54) = α , selecting λ based on the simple mvBSC SNR appears toresult in clustering with near identical performance as that of mvBSC q . These results suggest thatthe proposed procedure for selecting λ is indeed near optimal and the simple mvBSC SNR works wellwhen the views are reasonably similar. model method Accuracy NMI scoremean sd mean sdM1 oracle 0.966 0.0169 0.989 0.0046mvBSC q SNR q SNR q SNR q SNR q SNR K = 25 with optimal λ selected based on (3.6) (mvBSC q ),the the empirical version of (3.7) (mvBSC SNR ) and the oracle obtained by minimizing the empirical |M λ ∗ | . We next compare the performance of mvBSC q and mvBSC SNR to the aforementioned alternativespectral clustering procedures. In Figure 4.2, we show the clustering accuracy and NMI for differentclustering methods under (M3) with K = 25 and different noise levels for W s . For conciseness ofthe presentation, for the methods based on a single view, we only report the maximum accuracyand NMI of the two views. It is easy to see the normalized Laplacian always performs better thanits unnormalized counterpart and in fact the unnormalized version fails in all scenarios. The KAclustering with W + performs even worse than the clustering with the single view W s , suggestingthat a naive aggregation of multiple sources of information could have detrimental effect on theclustering due to the heterogeneity in the underlying Ω s . Our mvBSC method is consistently better17han all competing methods in terms of both average and spread across all noise levels and theadvantage is even more apparent as noise level increases. Figure 4.3 shows how the performancesof different methods change over different level of K under the medium noise level setting. As K increases, the clustering becomes more challenging. As a result, the clustering accuracy andNMI decrease substantially for most competing methods but only slightly for the mvBSC method.Thus, the larger noise level and K , the more advantage the mvBSC approach showcases over othermethods. KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . m i s E rr R a t e σ = , σ = KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . N M I (a) small noise KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . m i s E rr R a t e σ = , σ = KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . N M I (b) medium noise KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . m i s E rr R a t e σ = , σ = KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . N M I (c) large noise Figure 4.2: Boxplots of clustering accuracy and NMI of different clustering methods for (M3) with K = 25 and three different noise levels: (i) mvBSC q : mvBSC using (3.6), (ii) mvBSC SNR : mvBSCusing (3.7), (iii) clustering using the kernel addition matrix W + (KA); (iv) clustering with theLaplacian of the W + (KAL); (v) clustering using the normalized Laplacian of W + (normKAL); (vi) W (singleW), (vii) clustering with Laplacian of a single W (singleL); and (viii) clustering with thenormalized Laplacian of W . The International Classification of Disease, 9th edition (ICD9) coding system, containing over14,000 codes, is a widely adopted mechanism for billing. Recording a full spectrum of diagnoses18 A KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . m i s E rr R a t e σ = , σ = KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . . N M I (a) low K/n KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . m i s E rr R a t e σ = , σ = KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . . N M I (b) medium K/n KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . m i s E rr R a t e σ = , σ = KA KAL KANormL mvBSC q mvBSC SNR singleL singleNormL singleW . . . . N M I (c) large K/n Figure 4.3: Boxplots of mis-clustered error rate and NMI score over multiple methods in comparisonby three different K/n network structures. mvBSC q : mvBSC using (3.6), mvBSC SNR : mvBSC using(3.7), KA: the kernel addition matrix, KAL: Laplacian of the kernel addition matrix, normKAL:the normalized Laplacian of the kernel addition matrix, singleW: a single W , singleL: Laplacian ofa single W , singleNormL: the normalized Laplacian of a single W .and procedure information in the electronic health records (EHR), the ICD9 coding system isa valuable resource for various types of biomedical research. However, designed for billing andadministrative functions, individual ICD9 codes tend to be too specific to be directly used asdisease phenotypes. Many codes indeed describe the same disease and only differ in details suchas the affected anatomical areas. For clinical and genetic studies, it is thus often desirable tocollapse detailed codes into clinically relevant groups. To address such a need, Denny et al. (2010,2013) manually curated grouping information to allow for more efficient representation of diseasephenotypes recorded in the EHR. The grouping has been successfully used to perform phenome-wide association studies (PheWAS). Despite a valuable asset, this manual curation approach hasmajor limitations including lack of scalability, portability and susceptible to subjective bias. Withthe adoption of ICD10 codes in recent years, a substantial human effort will be required to manuallyupdate the grouping to include both ICD9 and ICD10 codes, signifying the need of a data-drivenapproach. 19 .2 Data Sources and Model Set-up To employ the proposed mvBSC algorithm, three similarity matrices { W s , s = 1 , , } were obtainedfor all ICD9 codes from three different healthcare systems including a large insurance claim database(Claim), the Veteran Health Administration (VHA) and Partner’s Healthcare systems (PHS). Here, W sij represents the cosine similarity score of the semantic vectors for ICD9 codes v i and v j from the s th data source, Within each healthcare system, the semantic vectors were obtained by fitting a word2vec algorithm (Mikolov et al., 2013 a) to a co-occurrence table that records the frequency of acode pair co-occuring within a 30-day time window. Two main factors contribute to the heterogene-ity across the three data sources. First, the sample sizes are significantly different stretching from ∼ 45 million for Claim down to 1 million for VHA and further reducing to ∼ , 000 for Partner’sBiobank. Second, the underlying patient populations vary substantially. Specifically, Claim coversa full nationwide spectrum of subjects, whereas VHA solely targets the veteran population and PHSprimarily consists of tertiary hospitals enriched for patients with more complex and severe diseases.Such heterogeneities signify the need for an unbiased approach to optimally combine informationfrom these sources, which can be also easily checked in Figure 5.1 that gives a summary of the rawdata on the cosine similarity matrices. The top panel displays the density histogram of each cosinesimilarity matrix, which supports the sparseness in each cosine similarity matrix, even though thesparseness pattern in claim is not as apparent as the other two. The bottom panel is a snapshot ofeach cosine similarity matrix restricted on a common set of codes. The darkness of dots indicatesthe magnitude of the corresponding cosine similarity. Most large entries locate near the diagonal,suggesting the appropriateness of banding the three similarity matrices.For an ICD9 code v i , we let d ij = |N ( v i ) − N ( v j ) | + ηI { v i (cid:54) = v j , N ( v i ) = N ( v j ) } , where N ( · )maps a character string to its numeric form and η is a small constant chosen such as 0 . N (“001 . . 1. As a consequence, the vertex set V can be ordered in the sense that N ( v i ) < N ( v j ) if and only if i < j . The additional term involving a small constant η is includedto distinguish the case of v i = v j versus the rare cases when N ( v i ) = N ( v j ) but v i (cid:54) = v j (see forexample Figure 5.2). 20 laim D en s i t y −0.5 0.0 0.5 1.0 . . . (a) claim va D en s i t y −0.5 0.0 0.5 1.0 (b) va biobank D en s i t y −0.5 0.0 0.5 1.0 (c) biobank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (d) claim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (e) va −1−0.8−0.6−0.4−0.200.20.40.60.81 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (f) biobank Figure 5.1: Raw data summary: the top panel displays the density histogram of each cosine similar-ity matrix. The bottom panel corresponds to each cosine similarity matrix restricted on a commonset of codes. Figure 5.2: Ambiguity example in ICD9 coding system21n this application, the existing manually curated PheWAS groups can also serve as silver-standard labels to provide a guidance in search for appropriate choices of the banding parameter h s and the weight parameter λ s . All parameters were tuned in a grid search manner with the bestcorresponding to the highest NMI score. To choose a proper K , considering the number of PheWASgroups( K phewas ) is already a good estimate, we scanned through its neighborhood (0 . K phewas ∼ . K phewas ) and picked the one with the highest NMI score for subsequent analysis( K use ). With agiven K , we performed clustering using the proposed mvBSC SNR procedure along with the mvBSCprocedure with λ selected to empirically maximize the NMI (mvBSC maxNMI ). Results based onmvBSC q are omitted here since three banding parameters are very close to each other which yieldvery similar results to mvBSC SNR . For illustration purposes, in this paper we only present results focusing on the following fourcategories–neoplasms, neurological, musculoskeletal and sense organs – whose grouping results drawparticularly great interest in the current clinical studies. Table 5.1 clearly shows that the proposedmvBSC algorithm performs well across four categories with high agreement with the existing Phe-WAS grouping. Beyond that, our proposed mvBSC algorithm has the advantage of being efficient,scalable, and adaptive to the evolving human knowledge as reflected in the observed data. Theclustering with λ selected via mvBSC SNR also has similar performance as the optimal λ selectedto maximize the NMI. Figure 5.3 visually compares the the global clustering structure given bymvBSC SNR and PheWAS on neurological and musculoskeletal category respectively, showing thepower of mvBSC SNR to mimic the global network structure.To further demonstrate its efficacy, we zoom in to individual three-digit categories of ICD9 codesand examine their grouping structures compared to PheWAS. Figure 5.4 shows a typical examplethat mvBSC SNR based grouping result perfectly agrees with PheWAS. In other cases, mvBSC SNR turns out to be quite robust with only a few occasional mismatches to the best grid results. Forexample Figure 5.5 compares the two corresponding results of category 368 in which only code 368.922 ategory n K phewas K use NMI scoremvBSC maxNMI mvBSC SNR neoplasms 799 122 138 0.856 0.843neurological 364 68 56 0.852 0.839musculoskeletal 675 124 128 0.834 0.790sense organs 639 119 132 0.862 0.859Table 5.1: Table of NMI scores of the mvBSC procedure compared to PheWAS labels using twodifferent choices of λ (maximizing empirical NMI and SNR) across four different ICD categories,where n is the total number of ICD9 codes within each category, K phewas is the total number ofPheWAS groups and K use is the total number of groups used for final clustering by maximizing theNMI score. (a) neurological, mvBSC SNR (b) neurological, PheWAS(c) musculoskeletal, mvBSC SNR (d) musculoskeletal, PheWAS Figure 5.3: Clustering structure comparison. Squares represent cluster nodes and circles representICD9 codes. 23igure 5.4: Grouping result for myopathy. Codes are colored by their membership. Codes inparenthesis colored in blue represent corresponding PheWAS codes.is grouped differently. Indeed mvBSC SNR seems to be able to do a better job in this scenario in thatunspecified visual disturbance is clinically similar to any other specified disturbances and thus isnot necessarily parsed out. Figure 5.6 (a) gives the clustering result on ICD9 codes starting with (a) mvBSC SNR (b) best grid Figure 5.5: Grouping result for ICD9 . Codes are colored by their membership. Codes in parenthesiscolored in blue represent corresponding PheWAS codes.711. PheWAS separates these codes into four groups, with a majority of the codes being groupedto represent Arthropathy associated with infections (P711), followed by Pyogenic arthritis (P711.1), Reiter’s disease (P711.2), and Behcet’s syndrome (P711.3). On the other hand, mvBSC SNR separatesthese codes into seven concept groups with perfect agreement for codes in P711.1, P711.2 andP711.3. The main difference between mvBSC SNR grouping and PheWAS grouping appears in codesthat belong to P711 by PheWAS. While our method does not distinguish postdysenteric arthropathy24rom arthropathy associated with viral and bacterial diseases, it can perfectly set apart anthropathyassociated with unspecified infective arthritis, other infectious and parasitic diseases, and mycoses.As a final example, ICD9 codes starting with 714 consist of Rheumatoid Arthritis (RA) and Juvenile (a) 711 (b) 714 Figure 5.6: Grouping result for arthropathies. Codes are colored by their membership. Codes inparenthesis colored in blue represent corresponding PheWAS codes.Rheumatoid Arthritis (JRA). Clinically, these are distinct diseases and thus should be groupedseparately, despite adjacent coding representations and similar terminologies. Figure 5.6 (b) welldemonstrates that our approach is able to distinguish between these two conditions. In this paper, we introduce a novel spectral clustering method that incorporates multiple datasources and leverages the prior distance knowledge among nodes. More specifically, the noveltyconsists of two main parts. First, a consensus clustering is realized by the means of a weightedsum of membership-encoded matrices that attempts to drag all views to a common ground whileallowing between-view heterogeneity. Second, the proposed approach effectively leverages the priordistance knowledge via the banding step. The statistical performance of the proposed method isthoroughly studied under a multi-view stochastic block model (mvSBM) framework. In particular,25e demonstrate the effect of a banding operation on reducing the mean absolute operator-normerror bound to max (cid:26)(cid:18) n α +1 max (log n ) α α +1 (cid:19) , (cid:113) δd log n, log n (cid:27) . Reducing to the standard result upto a √ log n factor, this bound shows the robustness of our approach to the abscence of banding.In addition, banding by the distance also encourages a desired sparseness pattern in the observedsimilarity matrix and the sparseness level can be well controlled by the choice of δ . Both simula-tions and the real data analysis demonstrate the effectiveness of the proposed mvBSC method todramatically improve the clustering performance for a network with ordered nodes. We also providea simple SNR based rule of choosing the weights that is intuitive and easy to follow in practice.However, we would like to make additional notes that this guideline may not yield a satisfactoryresult if a more complex hetereogeneity pattern is present in the data. In this paper, we focus onthe case where heterogeneity is only allowed across different views. Relaxing the homoscedasticityassumption within each view warrants further research. References Asgari, E. and Mofrad, M. R. (2015), Continuous distributed representation of biological sequencesfor deep proteomics and genomics, PloS one (11), e0141287.Bickel, P. J. and Chen, A. (2009), A nonparametric view of network models and newman–girvanand other modularities, Proceedings of the National Academy of Sciences (50), 21068–21073.Bickel, P. J. and Levina, E. (2008), Regularized estimation of large covariance matrices, The Annalsof Statistics pp. 199–227.Blaschko, M. B. and Lampert, C. H. (2008), Correlational spectral clustering, in ‘Computer Visionand Pattern Recognition, 2008. CVPR 2008. IEEE Conference on’, IEEE, pp. 1–8.Cai, X., Nie, F., Huang, H. and Kamangar, F. (2011), Heterogeneous image feature integrationvia multi-modal spectral clustering, in ‘Computer Vision and Pattern Recognition (CVPR), 2011IEEE Conference on’, IEEE, pp. 1977–1984. 26haudhuri, K., Kakade, S. M., Livescu, K. and Sridharan, K. (2009), Multi-view clustering viacanonical correlation analysis, in ‘Proceedings of the 26th annual international conference onmachine learning’, ACM, pp. 129–136.Chen, R. Y., Gittens, A. and Tropp, J. A. (2012), The masked sample covariance estimator: ananalysis using matrix concentration inequalities, Information and Inference: A Journal of theIMA (1), 2–20.Choi, Y., Chiu, C. Y.-I. and Sontag, D. (2016), Learning low-dimensional representations of medicalconcepts, AMIA Summits on Translational Science Proceedings , 41.Denny, J. C., Bastarache, L., Ritchie, M. D., Carroll, R. J., Zink, R., Mosley, J. D., Field, J. R.,Pulley, J. M., Ramirez, A. H., Bowton, E. et al. (2013), Systematic comparison of phenome-wideassociation study of electronic medical record data and genome-wide association study data, Nature biotechnology (12), 1102–1111.Denny, J. C., Ritchie, M. D., Basford, M. A., Pulley, J. M., Bastarache, L., Brown-Gentry, K.,Wang, D., Masys, D. R., Roden, D. M. and Crawford, D. C. (2010), Phewas: demonstrat-ing the feasibility of a phenome-wide scan to discover gene–disease associations, Bioinformatics (9), 1205–1210.Han, Q., Xu, K. S. and Airoldi, E. M. (2014), Consistent estimation of dynamic and multi-layernetworks, arXiv preprint arXiv:1410.8597 .Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983), Stochastic blockmodels: First steps, Socialnetworks (2), 109–137.Humphreys, B. L. and Lindberg, D. (1993), The umls project: making the conceptual connec-tion between users and the information they need., Bulletin of the Medical Library Association (2), 170.Jin, J. et al. (2015), Fast community detection by score, The Annals of Statistics (1), 57–89.27arrer, B. and Newman, M. E. (2011), Stochastic blockmodels and community structure in net-works, Physical Review E (1), 016107.Kumar, A. and Daum´e, H. (2011), A co-training approach for multi-view spectral clustering, in ‘Proceedings of the 28th International Conference on Machine Learning (ICML-11)’, pp. 393–400.Lei, J., Rinaldo, A. et al. (2015), Consistency of spectral clustering in stochastic block models, TheAnnals of Statistics (1), 215–237.Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013 a), Efficient estimation of word representa-tions in vector space, arXiv preprint arXiv:1301.3781 .Newman, M. E. (2006), Modularity and community structure in networks, Proceedings of the na-tional academy of sciences (23), 8577–8582.Ng, A. Y., Jordan, M. I. and Weiss, Y. (2002), On spectral clustering: Analysis and an algorithm, in ‘Advances in neural information processing systems’, pp. 849–856.Ng, P. (2017), dna2vec: Consistent vector representations of variable-length k-mers, arXiv preprintarXiv:1701.06279 .Nguyen, D., Luo, W., Phung, D. and Venkatesh, S. (2016), Control matching via discharge codesequences, arXiv preprint arXiv:1612.01812 .Paul, S., Chen, Y. et al. (2016), Consistent community detection in multi-relational data throughrestricted multi-layer stochastic blockmodel, Electronic Journal of Statistics (2), 3807–3870.Qin, T. and Rohe, K. (2013), Regularized spectral clustering under the degree-corrected stochasticblockmodel, in ‘Advances in Neural Information Processing Systems’, pp. 3120–3128.Rohe, K., Chatterjee, S., Yu, B. et al. (2011), Spectral clustering and the high-dimensional stochasticblockmodel, The Annals of Statistics (4), 1878–1915.28hi, J. and Malik, J. (2000), Normalized cuts and image segmentation, IEEE Transactions onpattern analysis and machine intelligence (8), 888–905.Steinhaus, H. (1956), Sur la division des corp materiels en parties, Bull. Acad. Polon. Sci (804), 801.Strehl, A. and Ghosh, J. (2002), Cluster ensembles—a knowledge reuse framework for combiningmultiple partitions, Journal of machine learning research (Dec), 583–617.Tropp, J. A. et al. (2015), An introduction to matrix concentration inequalities, Foundations andTrends R (cid:13) in Machine Learning (1-2), 1–230.Vu, V. Q., Lei, J. et al. (2013), Minimax sparse principal subspace estimation in high dimensions, The Annals of Statistics (6), 2905–2947.Xia, R., Pan, Y., Du, L. and Yin, J. (2014), Robust multi-view spectral clustering via low-rank andsparse decomposition., in ‘AAAI’, pp. 2149–2155.Zhao, Y., Levina, E., Zhu, J. et al. (2012), Consistency of community detection in networks underdegree-corrected stochastic block models, The Annals of Statistics (4), 2266–2292. A Appendix A.1 Proof for Lemma 2.1 Proof. ∀ h > δ , since d ( v i , v j ) ≤ d ( v i , v c gi ) + d ( v c gi , v c gj ) + d ( v j , v c gj ) ≤ d ( v c gi , v c gj ) + 2 δ , d ( v i , v j ) > h implies d ( v c gi , v c gj ) > h − δ > 0. Therefore, for any i , (cid:80) j {| W sij | : d ( v i , v j ) > h } = (cid:80) j {| Ω sg i g j | : d ( v c gi , v c gj ) > h − δ } ≤ n max (cid:80) l {| Ω sg i l | : d ( v c gi , v c gj ) > h − δ } ≤ n max L (cid:16) h − δd (cid:17) − α s . It is easy tosee that n min β ≤ γ K ( W s ) ≤ γ ( W s ) ≤ n max /β , which completes the proof.29 .2 Proof of Theorem 3.1 Proof. Let E ij denote the n × n indicator matrix whose ( i, j )-th and ( j, i )-th entry is 1 and 0elsewhere, then B h ( W ) − W = B h ( W ) − B h ( W ) + B h ( W ) − W = (cid:88) ≤ i Proof. Recall that ( (cid:98) Z , (cid:98) A ) = argmin Z ∈ Z n,K , A ∈R K × K (cid:13)(cid:13)(cid:13)(cid:98) U ∗ λ − ZA (cid:13)(cid:13)(cid:13) F , thus (cid:107) (cid:98) Z (cid:98) A − U ∗ Q (cid:107) F ≤ (cid:107) (cid:98) Z (cid:98) A − (cid:98) U ∗ λ (cid:107) F + 2 (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F ≤ (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F 30t follows from the results from Lei et al. (2015) that (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F ≤ K (cid:107) m (cid:88) s =1 λ s (cid:98) U s (cid:98) U s T − U ∗ U ∗ T (cid:107) ≤ mK m (cid:88) s =1 λ s (cid:107) (cid:98) U s (cid:98) U s T − U s U s T (cid:107) and (cid:107) ( I − (cid:98) U s (cid:98) U s T ) U s U s T (cid:107) ≤ (cid:107) B h s ( W s ) − W s (cid:107) γ sn,K (A.5)where I is the identity matrix.In addition, since (cid:107) ( I − (cid:98) U s (cid:98) U s T ) U s U s T (cid:107) = (cid:107) (cid:98) U s (cid:98) U s T − U s U s T (cid:107) , we have (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F ≤ mK m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) (cid:107) B h s ( W s ) − W s (cid:107) . (A.6)It follows that |M λ | = (cid:88) i ∈M λ ≤ n max (cid:88) i ∈M λ (cid:107) (cid:98) A g i · − U ∗ i · Q (cid:107) ≤ n max n (cid:88) i =1 (cid:107) (cid:98) A i · − U ∗ i · Q (cid:107) = 2 n max (cid:107) (cid:98) Z (cid:98) A − U ∗ Q (cid:107) F ≤ n max (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F ≤ mn max K m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) (cid:107) B h s ( W s ) − W s (cid:107) (A.7)Using Bernstein inequality given in Theorem 6.6.1 in Tropp et al. (2015), for all t ≥ (cid:107) B h s ( W s ) − W s (cid:107) ≥ t ) ≤ n exp (cid:26) − t / σ s h s /d + 4 Lt/ (cid:27) ≤ exp (cid:26) log n − t / b s ( h s /d + t ) (cid:27) ≤ exp (cid:110) log n − t b s h s /d (cid:111) if t ≤ h s /d exp (cid:110) log n − t b s (cid:111) if t ≥ h s /d (A.8)where b s = max( σ s , L/ r > 0, with probability at least 1 − n − r , if n max (cid:38) (log n ) α s +1 (cid:107) B h s ( W s ) − W s (cid:107) ≤ c s ( r + 1) (cid:18) n αs +1 max (log n ) αs αs +1 (cid:19) if δd = o (cid:18)(cid:16) n max √ log n (cid:17) αs +1 (cid:19) c s ( r + 1) (cid:113) δd log n otherwise (A.9)and if n max = o ((log n ) α s +1 ), (cid:107) B h s ( W s ) − W s (cid:107) ≤ (cid:40) c s ( r + 1) (cid:113) δd log n if δd (cid:38) √ log nc s ( r + 1) log n otherwise (A.10)31or some positive constant c s that depends on b s . Applying the union bound, for any r > 0, withprobability at least 1 − mn − r , (cid:107) (cid:98) U ∗ λ − U ∗ Q (cid:107) F ≤ mK ( r + 1) m (cid:88) s =1 (cid:32) λ s c s γ sn,K (cid:33) max (cid:18) n αs +1 max (log n ) αs αs +1 , δ log nd , (log n ) (cid:19) (A.11)Therefore, dropping some constant terms not involving with n , with probability at least 1 − m/n , |M| n = O p n max n m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) max (cid:18) n αs +1 max (log n ) αs αs +1 , δ log nd , (log n ) (cid:19) (A.12) A.4 Proof of Corollary 3.1 Proof. Recall that γ sn,K is the K-th largest eigenvalue of W s , from Lemma 2.1, γ sn,K is at least atthe scale of n min . The results are natural simplifications of (A.12). A.5 Proof of Theorem 3.3 Proof. Recall that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) s =1 λ s (cid:98) U s (cid:98) U s T − U ∗ U ∗ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m m (cid:88) s =1 λ s (cid:107) (cid:98) U s (cid:98) U s T − U s U s T (cid:107) ≤ m m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) (cid:107) B h s ( W s ) − W s (cid:107) (A.13)From (A.1), (cid:107) B h s ( W s ) − W s (cid:107) ≤ L ) + 3 (cid:0) Ln max ( h s − δ/d ) − α s (cid:1) + 3 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) ≤ i 12 log nd m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) (1 + 2 eσ s ) h s + 96( e log n ) m (cid:88) s =1 (cid:32) σ s λ s γ sn,K (cid:33) + 12 L m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) Since n max (cid:38) (log n ) α s +1 , h s /d (cid:38) log n, s = 1 , ..., m , dropping some negligible and constant terms,we have E Z ∗ (cid:107) B h s ( W s ) − W s (cid:107) ≤ Ch s σ s log n and m (cid:88) s =1 (cid:32) λ s γ sn,K (cid:33) E Z ∗ (cid:107) B h s ( W s ) − W s (cid:107) ≤ C log n m (cid:88) s =1 (cid:32) σ s λ s γ sn,K (cid:33) h ss