[PDF] Integrate Multi-omic Data Using Affinity Network Fusion (ANF) for Cancer Patient Clustering

Abstract

Clustering cancer patients into subgroups and identifying cancer subtypes is an important task in cancer genomics. Clustering based on comprehensive multi-omic molecular profiling can often achieve better results than those using a single data type, since each omic data type (representing one view of patients) may contain complementary information. However, it is challenging to integrate heterogeneous omic data types directly. Based on one popular method -- Similarity Network Fusion (SNF), we presented Affinity Network Fusion (ANF) in this paper, an "upgrade" of SNF with several advantages. Similar to SNF, ANF treats each omic data type as one view of patients and learns a fused affinity (transition) matrix for clustering. We applied ANF to a carefully processed harmonized cancer dataset downloaded from GDC data portals consisting of 2193 patients, and generated promising results on clustering patients into correct disease types. Our experimental results also demonstrated the power of feature selection and transformation combined with using ANF in patient clustering. Moreover, eigengap analysis suggests that the learned affinity matrices of four cancer types using our proposed framework may have successfully captured patient group structure and can be used for discovering unknown cancer subtypes.

Full PDF

IIntegrate Multi-omic Data Using Afﬁnity NetworkFusion (ANF) for Cancer Patient Clustering

Tianle Ma

Department of Computer Science and EngineeringUniversity at Buffalo (SUNY)Buffalo, New York 14260-2500Email: [email protected]

Aidong Zhang

Department of Computer Science and EngineeringUniversity at Buffalo (SUNY)Buffalo, New York 14260-2500Email: [email protected]

Abstract —Clustering cancer patients into subgroups and iden-tifying cancer subtypes is an important task in cancer genomics.Clustering based on comprehensive multi-omic molecular pro-ﬁling can often achieve better results than those using a singledata type, since each omic data type (representing one view ofpatients) may contain complementary information. However, it ischallenging to integrate heterogeneous omic data types directly.Based on one popular method – Similarity Network Fusion (SNF),we presented Afﬁnity Network Fusion (ANF) in this paper, an“upgrade” of SNF with several advantages. Similar to SNF, ANFtreats each omic data type as one view of patients and learns afused afﬁnity (transition) matrix for clustering. We applied ANFto a carefully processed harmonized cancer dataset downloadedfrom GDC data portals consisting of 2193 patients, and generatedpromising results on clustering patients into correct diseasetypes. Our experimental results also demonstrated the powerof feature selection and transformation combined with usingANF in patient clustering. Moreover, eigengap analysis suggeststhat the learned afﬁnity matrices of four cancer types usingour proposed framework may have successfully captured patientgroup structure and can be used for discovering unknown cancersubtypes.

I. I

NTRODUCTION

Cancer genomics projects such as TCGA have generatedcomprehensive multi-omic molecular proﬁling for dozens ofcancer types. Mining huge amounts of omic data to discovercancer subtypes and disease mechanisms is still a hot topic.As patients with cancer from the same primary sites, forexample, lung cancer, can be very different from each otherin terms of disease progression, response to treatments, etc.One important task is to further cluster cancer patients ofthe same cancer type into subgroups and deﬁne new cancersubtypes with comprehensive molecular signatures associatedwith distinct clinical features.There are many challenges to cluster patients into subgroupssince cancer patients are very heterogeneous. Even though wecan always cluster patients into groups by running a speciﬁcclustering algorithm, the clustering results may not be robust,i.e., slightly changing clustering methods or parameter settingsmay lead to a different clustering result. Moreover, there is nogroundtruth to decide which methods and parameter settingswork best. While the omic data collected are comprehensive,they are heterogeneous and noisy, too. If we use each typeof omic data to cluster patients, we can probably generate different results. Is it possible to generate a robust clusteringresult and use “groundtruth” to justify it? This paper mainlyfocus on this problem.Since each type of omic data may contain some com-plementary information about the patients and disease, wecan perform integrative network analysis with multi-omicdata. Many methods that have been developed to integratemulti-omic data for patient clustering in the past severalyears are either based on probabilistic models or networkmodels [1]. It has been demonstrated that patient clusteringbased on similarity network fusion (SNF) [2] can usuallyachieve promising results compared with other methods suchas iCluster [3] or KMeans. While SNF works well in clusteringpatients, we ﬁnd that the required computational operationsin SNF can be signiﬁcantly reduced and simpliﬁed to get areliable fused afﬁnity network. Based on SNF, in this paperwe presented Afﬁnity Network Fusion (ANF) with severaladvantages compared with SNF.Our contribution can be summarized as follows:First, ANF presented here can be seen as an improvedversion of SNF with several advantages. ANF requires muchless computation while generating as good as or even betterresults than those from SNF. ANF can incorporate weights ofeach view, while SNF is unweighted. Moreover, ANF has amuch more clearer interpretation, while SNF contains some“mysterious” operations.Second, we cleverly selected a “gold” dataset with true classlabels for cancer patients clustering problem. Thus insteadof using internal clustering evaluation metrics, we were ableto use external metrics to evaluate clustering methods. Weused the newest release of harmonized cancer datasets fromGenomic Data Commons Data Portal (since the data is har-monized, it is of high quality for large-scale integration), andcarefully selected 2193 cancer patients from four primary siteswith known disease types. With this dataset, we were able todemonstrate the power of afﬁnity network fusion technique incancer patient clustering. Our experimental results also showedsurvival information alone may not be sufﬁcient for evaluatingdisease (sub)type discovery methods.Third, using harmonized gene expression and miRNA ex-pression data, we were able to demonstrate that feature se-lection and transformation of “raw” counts or other genomic a r X i v : . [ q - b i o . GN ] A ug eatures could lead to better clustering results. Speciﬁcally, weﬁnd that log transformation or variance stabilizing transforma-tion [4] of raw counts data usually perform better than directlyusing normalized expression values such as FPKM values ofgene expression.Forth, the experimental results on this relatively largedataset (2193 cancer patients with gene expression, miRNAexpression and DNA methylation data) are very promising.The learned fused afﬁnity matrices for the selected fourcancer types matched well with both true class labels andthe theory of spectral clustering based on eigengap analysis(Sec. V-E), which can be reliably used for unknown cancersubtype discovery and identifying subtype-speciﬁc molecularsignatures.The rest of paper is organized as follows. In section II,we brieﬂy summarize some related work. In section III, webrieﬂy describe a general framework for clustering complexobjects. In section IV, we describe ANF framework in detail.In section V, we presented experimental results and analysis.The last section gives conclusion.II. R ELATED W ORK

Integrating multi-omic data has been a hot topic in recentyears. There are multiple good reviews on this topic [1], [5].Various techniques can be classiﬁed into four groups basedon whether they are probabilistic or network based: Non-Probabilistic Non-Network, Non-Network Probabilistic, Non-Probabilistic Network, and Probabilistic Network.Non-Probabilistic Non-Network approaches do not assumea probability distribution or use graph theory to integrate multi-omic data. Instead direct regression or correlation analysisare applied to multi-omic data. For example, partial leastsquare can identify the features that are most informativefor predicting clinical outcomes, and can weigh differentsources. Canonical Correlation Analysis leverages correlationsamong different features in various sources to select the mostinformative features. Some frameworks apply Expectation-Maximization approaches to learn the weight of each sourcein the optimization framework.Non-Network Probabilistic approaches refer to methods thatemploys a probabilistic approach containing latent variableswith a prior distribution. For instance, iCluster [3], a widelyused cancer subtype clustering method, assumes differenttypes of data share a common latent feature space that can belearned through Expectation-Maximization (EM). Clusteringis performed on the learned latent feature space.Non-probabilistic network approaches often do not assumea prior distribution of a set of latent variables. In fact, manymodels of this category even do not have latent variables. Mostof them try to leverage interaction network to diffuse the signaland fuse various networks. Typical examples include HotNet2[6] and SNF [2]. HotNet2 diffuses genetic mutation signalsthrough gene interaction network to get a smoothed mutationproﬁle, which is then used for detecting “hot spot” of genesubnetworks that might be disease-causing. SNF constructspatient similarity networks using different types of data. The novelty of SNF is that it fuses different patient similaritynetworks to achieve a consensus similarity network that couldbe more reliable instead of directly combing heterogeneousfeatures. The fused patient similarity network is then used forclustering patients into disease subtypes.Probabilistic network approaches usually employs Bayesianapproaches, which usually incorporate latent variables andfactors into the model and learn these variables and factorsthrough optimization frameworks. External sources such asinteraction networks or pathways are often used to constructthe network or factor graph. For instance, PARADIGM [7]converted NCI pathway databases into a factor graph inwhich each gene is a factor incorporating several kinds ofinformation. Since most of the variables and factors in thisfactor graph are unknown, PARADIGM relied on a set ofempirical distributions to make the learning task possible withEM algorithm.Our approach presented in this report roughly falls intothe third category – non-probabilistic network approach. Wedo not assume a prior distribution for any variables to avoidunrealistic assumptions. Instead we adopted the techniques inspectral clustering to construct a k-nearest-neighbor afﬁnity(similarity) graph with local Gaussian kernel (Eq. 5). This willreduce noise and possible distortion caused by non-uniformmeasurements and preprocessing of omic datasets. Based onthe main idea of similarity network fusion (SNF) [2], wedeveloped a simpler and more general framework of afﬁnitynetwork fusion (ANF), or more technically speaking, transitionmatrix fusion, to combine multiple networks into a fusedconsensus network. The learned afﬁnity network capturescomplementary information from multiple views and is muchmore robust than individual networks learned from each view.III. G

ENERAL F RAMEWORK TO C LUSTER C OMPLEX O BJECTS

A. Representing Complex Objects With Multi-View

A patient is a complex object, and can have multiple viewswith heterogeneous features. In this paper, we mainly dealwith patient clustering. However, the proposed framework canbe used for any complex object clustering.Suppose each patient has n -views. Thus we have n patient-feature matrices: X ( v ) ∈ R N × p v , v = 1 , , · · · , nN : number of patients p v : feature dimension in view v Here we assume each feature can be mapped to a vectorof real numbers. Note that there are quite a lot of featuresthat are not numeric, such as text annotations. However, manyof them can be transformed to a new feature space in R p using feature embedding. In this paper, we do not focus ontransforming a speciﬁc non-numeric feature space to R p , butfocus on clustering patients based on already transformedfeature matrices in R N × p v . . Feature Selection and Transformation For patients, each type of omic data is different, representinga different view of the patients. The dimensionality of eachview is usually very high. Feature selection and transformationare often necessary. For example, gene expression data canbe represented using a sample-feature matrix with tens ofthousands of rows (i.e., genes) and dozens of columns (i.e.,samples). Most genes are not disease genes. If we use all genefeatures for clustering patients into subgroups, irrelevant genefeatures may lead to a “bad” clustering result.For cancer genomics, we usually have expression datafrom tumor-normal pairs, which can be used for selectingdifferentially expressed (DE) genes (or miRNAs). Using DEgenes for clustering patients is often better than using all genes[8].In genomics, raw counts of molecular measurements arecommon. However, for clustering or other exploratory anal-ysis, direct use of raw counts is usually discouraged. Properfeature normalization and transformation is often necessarybefore clustering. Log transformation, variance stabilizingtransformation [4], and regularized log transformation [9]are commonly used for transforming raw counts data fordownstream analysis.

C. Distance Metrics

Most clustering techniques require to deﬁne a pair-wise pa-tient distance (or similarity) matrix ∆ = ( δ ij ) N × N ∈ R N × N + . R + represents the set of non-negative real numbers.With proper feature engineering, we can calculate pair-wise distance on normalized features using Euclidean distance δ ij = || x i − x j || , or δ ij = 1 − Cor ( x i , x j ) ( Cor ( x i , x j ) repre-sents Pearson (or Spearman) correlation), etc. For categoricalfeatures without feature embedding, we can use chi-squareddistance or other similar metrics. D. Clustering Objectives

To ﬁnd a clustering assignment function A =( a , a , · · · , a N ) that maps each patient i to a uniquecluster a i (here we do not consider soft clustering) withcertain constraints (e.g., the number of clusters), we needsolve an optimization problem in general: arg min A f ( A , ∆) (1) f is a cost function that aims to make patients in thesame cluster are more “like” each other than those belongto different clusters.Once common choice is to minimize the ratio betweenthe sum of intra-class distances and the sum of inter-classdistances. arg min A (cid:80) ≤ i,j ≤ N I ( a i = a j ) δ ij (cid:80) ≤ i,j ≤ N I ( a i (cid:54) = a j ) δ ij (2) I ( · ) is the indicator function. Many clustering methodsessentially solve this problem or its variants directly or in-directly. If the distance between patients is not very accurate due to noise and feature heterogeneity, etc., we can build asimilarity graph based on distance matrix ∆ , then performgraph cut to ﬁnd densely connected clusters. A powerfulapproach is spectral clustering that work on patient similaritygraph S = ( s ij ) N × N ∈ R N × N + [10]. In this paper we usedspectral clustering [10] on learned fused afﬁnity (similarity)matrix for patient clustering.IV. K NN A

FFINITY N ETWORK F USION (ANF)

A. Afﬁnity Matrix for Each View

With pair-wise distance matrix ∆ , we can deﬁne corre-sponding similarity graph S in multiple ways. For example[10], • (cid:15) neighborhood: only the edges that has weight less than (cid:15) are kept in the similarity graph. The choice of (cid:15) isproblem-dependent. • k -nearest-neighbor graph: only the edges of each node’sk-nearest neighbors are kept. k is the parameter to tune. • Fully connected graph: a kernel such as e − δ ij σ is oftenused to transform distance to similarity. σ is the radiusof a local neighborhood instead of a global constant tocapture local network structure.In this paper, we adopted the deﬁnition of local σ ij from[2]. µ i = (cid:80) l ∈ N k ( i ) δ il k (3) σ ij = α ( µ i + µ j ) + βδ ij (4) N k ( i ) represents the indexes of k-nearest neighbors ofpatient i . Thus µ i in Eq. 3 represents local diameter of node i . σ ij in Eq. 4 incorporates both local diameters of patient i and j and their distance. The choice of k is important and needsto be tuned. In [2], σ ij = µ i + µ j + δ ij . Eq. 4 is more generalwith tuning parameters α and β , α, β ≥ . K ij = 1 √ πσ ij e − δ ij σ ij (5)Eq. 5 calculates local Gaussian kernel between patient i and j , with σ ij deﬁned as Eq. 4, to incorporate local kNNnetwork structure. Even though K is fully connected (i.e., ∀ i, ∀ j, K ij > ), only those node pairs that are within a smalldense neighborhood will have a relatively large kernel (assimilarity measure). We can regard K as a similarity graphto perform spectral clustering.With similarity matrix K , we can deﬁne a state transitionmatrix by Eq. 6, with S ij representing the probability of (thestate of) patient i transition to (the state of) patient j . Eachrow of S sums to 1. S ij = K ij (cid:80) Nj =1 K ij , ≤ i, j ≤ N (6)hile K is symmetric, S is probably not. We use transitionmatrix instead of symmetric similarity matrix to make the ourframework interpretable through random walk. kNN Afﬁnity Matrix for Each View With multi-view data,one can perform clustering on each view and synthesize resultsusing approaches like consensus clustering [11]. Or we canconstruct a more robust similarity network incorporating multi-view data and then perform spectral clustering.For each view, we can calculate state transition matrix S ( v ) using Eq. 6. Since S ( v ) is normalized from similarity matrix(Eq. 5), we can easily recover a symmetric similarity graphfrom S ( v ) . Thus we loosely refer S ( v ) as fully-connectedsimilarity graph or afﬁnity matrix in this paper.Based on fully connected graph S ( v ) , we can further deﬁnek-nearest-neighbor similarity graph or afﬁnity matrix as W ( v ) (Eq. 7). W ( v ) ij =  (1 − (cid:15) ) S ( v ) ij (cid:80) j ∈ Nk ( i ) S ( v ) ij , if j ∈ N k ( i ) (cid:15) S ( v ) ij (cid:80) j / ∈ Nk ( i ) S ( v ) ij , otherwise (7) N k ( i ) refers to the indexes of k nearest neighbors of patient i . (cid:15) refers to a small number. If we set (cid:15) = 0 , then for each rowof W ( v ) , only k elements are non-zero, and only the weightsof k nearest neighbors are used for normalization. We alsoloosely call W ( v ) a (kNN) afﬁnity matrix in this paper.Since each row sums to 1, W ( v ) is also a transition matrix.In fact W ( v ) can be seen as a trunked version of S ( v ) by“throwing away” weak signals (i.e., small edge weights) in S ( v ) . Thus W ( v ) should be more robust to small noise. If W ( v ) represents a connected and non-bipartite graph, it willreach a unique stationary distribution after a sufﬁcient numberof random walk [10].When we cluster N patients into several groups, we essen-tially try to ﬁnd several different “stable” state space. Patientswill be much more likely to stay in their own state spacethan to transition to another state space. Thus we can ﬁnd agraph cut based on its transition matrix. For a network withmultiple possible transition matrices from multi-view, we canuse random walk on multigraph to aggregate all transitionmatrices to get a fused transition matrix for spectral clustering.This is the main idea of afﬁnity network fusion (ANF) in thispaper. B. Afﬁnity Network Fusion with One-step Random Walk(ANF1)

For each view, we have deﬁned two transition matrices : S ( v ) (representing fully connected afﬁnity network, Eq. 6),and W ( v ) (kNN afﬁnity network, a trunked version of S ( v ) ,Eq. 7). We can build a multi-graph G to incorporate multi-views. In G , each node represents a patient. There can be atmost n (or n if we include two edges from each view usingEq. 6 and Eq. 7) edges between patients.To calculate an aggregated edge weight for each patient pair,we can apply one-step random walk to fuse multi-view afﬁnity networks in two steps. First, we use Eq. 8 to “smooth” eachview. Then we use Eq. 11 to get a fused weighted view. W ( v ) = β W ( v ) + β W ( − v ) + β S ( v ) + β S ( − v ) (8) (cid:80) v =1 β v = 1 , β v ≥ W ( − v ) = (cid:88) k (cid:54) = v w k · W ( k ) (9) S ( − v ) = (cid:88) k (cid:54) = v w k · S ( k ) (10) W = n (cid:88) v =1 w v · W ( v ) (11) (cid:80) nv =1 w v = 1 , w v ≥ In Eq. 8, the second term W ( − v ) represents a weightedcomplementary view from n − other views (Eq. 9). Term3 and term 4 in Eq. 8 are included for comprehensiveness.In practice, since W ( v ) is usually more robust to noise than S ( v ) , we often set β = β = 0 . Eq. 8 can be interpreted asnetwork diffusion between view v and other complementaryviews, resulting in a smoother version of W ( v ) .Since all W ( v ) and S ( v ) are transition matrices, the fusedview Eq. 11 essentially computes a weighted transition matrix W , which combines complementary information from multi-views and could be more informative for patient clustering.We can interpret the fused view W (Eq. 8 and 11) as theresult of one-step random walk on a multigraph, with W being an aggregated transition matrix of a simple graph derivedfrom multigraph. We call this process Afﬁnity Network Fusion(ANF). Even though it is very simple, it turned out to be aspowerful as SNF [2] (see Sec.V).To get an aggregated transition matrix, we can have multi-step random walk. In the following, we refer to ANF with one-step random walk as ANF1, and ANF with two-step randomwalk as ANF2, which is to be discussed in the next session. C. Afﬁnity Network Fusion with Two-step Random Walk(ANF2)

In addition to one-step random walk, we can have multi-step random walk on multigraph. Our experiments on cancergenomic data showed a one-step or two-step random walk canusually work well enough. If the number of steps are too big,the fused transition matrix W can eventually become rank 1,with each row being the same as the stationary distribution.Thus we only consider network fusion with two-step randomwalk in the following.Similar to one-step random walk, we derive the fusedtransition matrix with two steps: ﬁrst calculate a smoothedtransition matrix for each view using Eq. 12, then aggregateall views using Eq. 11. ( v ) = α W ( v ) · W ( − v ) + α W ( − v ) · W ( v ) + α W ( v ) · S ( − v ) + α S ( − v ) · W ( v ) + α S ( v ) · W ( − v ) + α W ( − v ) · S ( v ) + α S ( v ) · S ( − v ) + α S ( − v ) · S ( v ) (12) (cid:80) i =1 α i = 1 , α i ≥ The ﬁrst term of Eq. 12, α W ( v ) · W ( − v ) represents a two-step random walk (multiplying two transition matrices): theﬁrst step is a random walk on view v , the second step is arandom walk on the aggregated complementary view ( W ( − v ) ,Eq. 9). Similarly, the second term represents random walk onthe complementary view ﬁrst followed by random walk onview v . Since we have two transition matrices W ( v ) and S ( v ) ,we can perform random walks on either W ( v ) or S ( v ) . Theother six terms in Eq. 12 have similar meanings as the ﬁrsttwo.Our experiments on cancer genomic data show that theterms using W ( v ) usually works better than using S ( v ) ,suggesting W ( v ) is more reliable than S ( v ) . In practice, thedefault choice is just using the ﬁrst two terms: W ( v ) = αW ( v ) · W ( − v ) + (1 − α ) W ( − v ) · W ( v ) (13) D. Comparison with SNF from [2]

ANF is based on SNF [2], but is much more simpler andas powerful as SNF. In SNF, the network fusion process isperformed iteratively (Eq. 14): S ( v ) = W ( v ) × (cid:80) k (cid:54) = v S ( v ) n − × W ( v ) T (14)Note the notations used in this paper are different fromthose in [2]. In [2], S ( v ) represents a symmetric similaritymatrix derived from Eq. 5, while in our paper S ( v ) is arow-normalized asymmetric afﬁnity (transition) matrix. Eventhough it is intuitive enough, SNF does not have a clearphysical interpretation by multiplying similarity matrix ( S ( v ) )with transition matrices ( W ( v ) ) in Eq. 14. However, if weconsider S ( v ) as a transition matrix, then Eq. 14 can be looselyseen as a three-step random walk. By contrast, in Eq. 8 andEq. 12, ANF directly operates on transition matrices, witha natural interpretation of random walk on multi-graph togenerate an aggregated simple graph for spectral clustering.Our experiments showed that increasing the number of stepsof random walk may not increase clustering performancedramatically. We have also tried iteratively updating W ( v ) asin [2]. Results show that it is not necessary to include moreiterations which often cannot outperform simple one-step ortwo-step random walk.In fact, if we adopt an iterative approach to update W ( v ) ,we have to “manually” adjust W ( v ) after each iteration. Inorder to “force” SNF to converge, [2] used a “mysterious”operation to “avoid numerical instability” of S ( v ) by forcing the diagonal of S ( v ) to be 0.5 after each iteration (Eq. 15). Asa result, the learned fused similarity matrix S contains largevalues ( ≈ . based on their implementation) in the diagonal,while all other values are smaller by usually at least one order.Though the following spectral clustering does not rely on thediagonal elements, the physical meaning of the learned S inSNF is not as clear as in ANF, where the learned W is aweighted transition matrix. S ( v ) ij =  , if i = j S ( v ) ij (cid:80) j (cid:54) = i S ( v ) ij else (15)Based on extensive experiments, we found this iterative pro-cess is not necessary. Without iterative process, the “forced”normalization (Eq. 15) can be eliminated, too.Comparing ANF and SNF, ANF has at least several advan-tages:First, it requires much less computation to achieve as goodas or even better results than SNF (see Sec. V). SNF typicallyrequires about dozens of iterations to converge, while ANFonly needs no iteration. ANF1 (Eq. 8) does not involvematrix multiplication, ANF2 (Eq. 13) involves two matrixmultiplications, while SNF (Eq. 14) needs perform two matrixmultiplications for each iteration.Second, ANF is more general framework that can incorpo-rate weights of views, while SNF only use uniform weights(Eq. 14). This is important because properly chosen weightscan make fusion process more effective.Third, ANF has a natural interpretation of random walkon multi-graph to generate a fused simple graph. The learnedfused afﬁnity matrix W has a natural meaning of weightedtransition matrix incorporating multi-view data, while theoperation in SNF does not have a direct physical meaning,though it can also be loosely seen as a three-step randomwalk.To sum up, ANF signiﬁcantly reduces unnecessary opera-tions in SNF, and provides a more general and interpretableframework for integrating multi-view data using patient net-work fusion. E. Spectral Clustering on Fused Afﬁnity Matrix

With learned fused afﬁnity matrix W , we can performspectral clustering by solving an optimization problem Eq. 16[12]. arg max Y T race (( Y T DY ) − Y T W Y ( Y T DY ) − ) Y K = N , Y ∈ { , } N × K (16) D is a diagonal matrix with diagonal elements being thesum of each row in W . Since W generated by ANF is atransition matrix, D becomes an identity matrix. Eq. 16 ﬁndsclustering assignment matrix Y that maximizes K-way nor-alized associations [12], which can be solved approximatelyby solving Eq. 17 [12]. arg max T race ( Y T ZR ) D − W Z = Z Λ R T R = I K , Y K = N , Y ∈ { , } N × K (17)The columns of Z in Eq. 17 are eigenvectors of D − W (or W since D is identical matrix). R is orthogonal matrix. Wesolve Eq. 17 by iteratively update R and Y [12]. First ﬁx R ,update Y Y ij = I ( j = argmax ( ZR ) ij ) (18) I ( · ) is indicator function. Then ﬁx Y , update RY T Z = U Λ V T (19) R = V U T (20) Y T Z = U Λ V T is the singular value decomposition of Y T Z .The overall ANF framework to cluster cancer patients issummarized in Alg. 1. Algorithm 1:

Afﬁnity Network Fusion for Patient Clus-tering

Input : • Patient-feature matrices ( n views): X ( v ) , v = 1 , , · · · , n • Number of clusters: K • Weight of each view (optional): w • Other optional parameters, such as weight ofANF components α Output: • Fused patient afﬁnity matrix W • Patient cluster assignment Y beginFeature selection and transformation X ( v ) → X ( v ) ∈ R N × p v , v = 1 , , · · · , n Calculate pair-wise distance matrix for each view: ∆ ( v ) ∈ R N × N + , v = 1 , , · · · , n Calculate kNN afﬁnity matrix for each view: W ( v ) , v = 1 , , · · · , n (Eq. 8 or Eq. 12) Calculate fused afﬁnity matrix W (Eq. 11) Spectral clustering on fused afﬁnity matrix W : ( W, K ) → Y (Sec. IV-E) Return

W, Y end

V. E

XPERIMENTAL R ESULTS

A. Dataset and Evaluation Metrics

Harmonized cancer datasets were downloaded from Ge-nomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). We selected four primary sites with more than onedisease type: adrenal gland, lung, kidney, and uterus. Forexample, cancers from adrenal gland has two disease types:Pheochromocytoma and Paraganglioma (project name: TCGA-PCPG) and Adrenocortical Carcinoma (project name: TCGA-ACC). In this paper, for ease of description, we refer to“cancer types” as cancers from these four primary sites. We

TABLE IS

AMPLE INFORMATION OF FOUR CANCER TYPES

Cancer type Disease type Totaladrenal gland TCGA-ACC 76 253TCGA-PCPG 177lung TCGA-LUAD 447 811TCGA-LUSC 364kidney TCGA-KICH 65 654TCGA-KIRC 316TCGA-KIRP 273Uterus TCGA-UCEC 421 475TCGA-UCS 54 want to cluster tumor samples of the same “cancer types”into known disease types. The number of samples used foranalysis in each cancer type is summarized in Table I (a few“outlier” samples detected by exploratory data analysis hadalready been removed). All these patient samples has geneexpression, miRNA expression and DNA methylation (fromHumanMethylation450 array) data available for both tumorand normal samples.While our ultimate goal is to detect cancer subtypes (thetrue subtypes are not known yet), it is a good strategy toevaluate disease subtype discovery methods using a datasetwith groundtruth. The dataset we selected and processed servesfor this purpose well.For tumors from each primary site, we already know thedisease types. Since we have ground truth disease types, wecan evaluate clustering results using external metrics such asnormalized mutual information (NMI). In addition, we havecancer patient survival data. We can perform survival analysisto test if the patient clusters show statistically different survivaldistributions.The three metrics we used to evaluate clustering resultsare: (1) Normalized mutual information:

N M I (Ω , C ) = I (Ω , C )( H (Ω)+ H ( C )) / ; (2) Adjusted Rand Index (ARI) [13], and (3)p-value of log rank test of survival distributions of differentpatient clusters [14].We have chosen seven combinations of data types (legendof Fig. 1) and six feature types of gene expression andmiRNA expression (legend of Fig. 3). For DNA methylation,we directly used beta values, so it only has one featuretype. Thus in total there are 37 unique combinations of datatypes and feature types. We run both ANF1 and ANF2 onall 37 combinations, and SNF on 24 combinations (ANF isimplemented to work on a single data type as well, while theimplementation of SNF requires input to include at least twodata types). Due to page limit, detailed results including codecan be accessed at https://github.com/BeautyOfWeb/ANF. Inthe following, we only show some results to demonstrate thepower of ANF and feature engineering, and compare ANFwith SNF. B. The Power of Afﬁnity Network Fusion (ANF)

To demonstrate the power of ANF, we compared the clus-tering results using single data types with those using ANF to ig. 1. Power of ANF combining multiple data types integrate multiple data types. In Fig. 1, we compared sevendifferent combinations of data types: • “gene”: gene expression • “mirnas”: miRNA expression • “methylation”: DNA methylation (beta values from Illu-mina Human Methylation 450 platform) • “gene+mirnas”: combine “gene” and “mirnas” using ANF • “gene+methylation”: combine “fpkm” and “methylation”using ANF • “mirnas+methylation”: combine “mirnas” and “methyla-tion” using ANF • “gene+mirnas+methylation”: combine “gene”, “mirnas”,and “methylation” using ANFFig. 1 shows NMI values (between 0 and 1. The largerNMI value, the better clustering result) of patient clusters(here we set the number of clusters to be the number ofdisease types) using ANF2 framework on the aforementionedseven combinations of data types (for gene expression, weused normalized FPKM values; for miRNA expression, weused normalized counts in Fig. 1). Fig. 1 shows clusteringusing DNA methylation beta values performs better than usingFPKM and normalized miRNA expression values for all fourcancer types, suggesting that DNA methylation data maycontain highly relevant information about disease types (andpotentially disease subtypes).In general a combination of at least two data types usuallyyields better clustering results. There are two exceptionsin Fig. 1. For adrenal gland cancer, clustering using DNAmethylation beta values alone yields the best clustering results(“gene+methylation” can also generate the same result), andcan outperform the results from using a combination ofdata types. Again this suggests DNA methylation beta valuescontain most relevant information about disease (sub)types.Another exception is that the clustering result using “mir-nas+methylation” combination for uterus cancer is lower thanusing “methylation” data alone. This is probably due to that thequality of fused afﬁnity network from miRNA and methylationdata is not as good as that from methylation data alone. Foruterus cancer, clustering using gene or miRNA expressiondata alone did a “terrible” job (NMI ≈ ). However, byintegrating the two data types, the result improves signiﬁcantly(NMI=0.30), which demonstrates the power of ANF.We also ﬁnd that it is usually not the case that inte-grating three data types would generate better results than Fig. 2. -log(p-value) of log rank test of patient survival distributions that from integrating two data types. However, integratingmore data types tends to make the results more robustas “gene+mirna+methylation” consistently performs relativelywell across four cancer types, while two data type combina-tions may fail in some cases. For instance, for uterus cancer,integrating miRNA and methylation data does not performwell, while integrating all three types of data performs muchbetter even though it does not achieve the best result. Note inFig. 1, we used ANF2 and did not tune the weight of views.Better results may be achieved if we tune the weights.Very similar results are obtained for using Adjust RandIndex (ARI) as clustering metric (not shown here). Fig. 2shows − log ( p - value ) of log rank test of patient survivaldistributions among detected patient clusters. Since we alreadyknow the true disease labels, we added a pink bar labeled“TrueClass” in Fig. 2 (the other seven bars are one-to-onematched with Fig. 1 for comparison). Label “TrueClass” doesnot correspond to a data type combination, but refers tothe groundtruth cluster assignment (the bar corresponding to“TrueClass” shows the negative log p-value of log-rank testof survival distributions of true disease types).Fig. 2 shows it is not always the case that the p-valuecalculated from using true class labels is the smallest (neg-ative log p-value the largest). In fact, for lung cancer, thesurvival distributions of two known disease types do not shownstatistical difference at all, even though we can separate thetwo disease types relatively well using clustering. For kidneycancer, the smallest p-value ( p = 4 . × − ) was achievedusing methylation data alone for clustering, but the corre-sponding clustering accuracy was relatively low (NMI=0.48,Fig. 1). Using the true class labels of kidney cancer, we canonly achieve p-value = . × − . This suggests that log-rank test of survival distributions should not be used as theonly metric to evaluate patient clustering results. This is alsoone major reason we carefully select such a dataset withgroudtruth disease type information for evaluation purpose.With groudtruth disease type information, external evaluationmetrics such as NMI and Adjusted Rand Index (ARI) can beused. The largest possible value of NMI and ARI is 1 if allpatients are clustered correctively. C. The Power of Feature Selection and Transformation

Fig. 3 shows the power of feature selection and transfor-mation of raw counts data (gene and miRNA expression). ig. 3. Power of feature engineering

The topleft, topright, and bottomleft panels show NMI ofclustering results using gene expression, miRNA expression,and a combination of both using ANF. We choose geneand miRNA expression data because raw counts data areavailable for both data types, and we can apply commonlyused feature selection and transformation techniques to rawcounts. We used differential expression analysis to select geneor miRNAs features, and used two commonly raw countstransformation techniques: 1) log transformation: log ( n + n ) .(in our experiments, we set n = 1 , so that zero counts willbe mapped to 0), and 2) variance stabilization transformation[4].In Fig. 3 we compared clustering results using six differentfeatures. • “raw.all”: Raw counts of all genes or miRNAs • “normalized”: FPKM values of all genes or normalizedcounts for all miRNAs • “raw.sel”: Raw counts of selected (differentially ex-pressed) genes or miRNAs (Differential expression anal-ysis was performed using DESeq2 [9]) • “log.all”: Log transformation of raw counts of all genesor miRNAs • “log.sel”: Log transformation of raw counts of selected(differentially expressed) genes or miRNAs • “vst.sel”: Variance stabilizing transformation of rawcounts of selected genes or miRNAsThe topleft panel of Fig. 3 shows that for all four cancertypes, “log.sel” and “vst.sel” of gene expression will workrelatively well. The topright panel shows that “vst.sel” ofmiRNA expression consistently works relatively well acrossall four cancer types. When combining both gene expressionand miRNA expression data using ANF (bottomleft panel),“log.all”, “log.sel”, and “vst.sel” work consistently well. Foreach type of feature, combining two data types using ANF im-proves clustering accuracy. This also demonstrates the powerof ANF, consistent with the analysis of Fig. 1.“TrueClass” was only shown in bottomright panel to com-pare the negative log p-value of log-rank test of survival dis-tributions when true disease labels are used. Some clusteringresults have a smaller p-value than the p-value calculated usingtrue patient groups even when the clustering results are not100% correct (Fig. 3 bottomleft and bottomright panels). Thisis consistent with the analysis of Fig. 2. Fig. 4. Comparing the performance of SNF and two frameworks of ANF

D. Performance Comparisons with SNF from [2]

Fig. 4 compares the performances of SNF, ANF1 and ANF2for clustering four cancer types into their known disease types.Since SNF has shown superior performance over KMeans,iCluster, and feature concatenation approaches [2], we do notrepeat comparisons with KMeans, iCluster, etc. in this paper.We have used three types of data: gene expression, miRNAexpression and DNA methylation. The topleft, topright, bot-tomleft, and bottomright panels correspond to results from fourdifferent combinations of data types.Except adrenal gland, for which all three methods canachieve the same clustering accuracy (NMI=0.93, only twoout of 253 samples were misclassiﬁed), at least one of ANF1and ANF2 can achieve slightly better results than SNF.Though the improvement is not signiﬁcant, ANF has severaladvantages. ANF has a much more transparent interpretationand uses much less computation compared with SNF. SNFneeds to converge after typically 20 iterations, while ANFonly needs less than half computation required for computingone single iteration of SNF. Moreover, ANF provides a moregeneral framework and can incorporate the weights of multipleviews, while SNF only used uniform weights. Taken thesefacts into consideration, ANF is better than SNF and thus canreplace SNF for clustering complex patients with multi-viewdata.There is no signiﬁcant difference using ANF1 and ANF2,which correspond to one-step and two-step random walk ona multigraph to generate a fused simple graph for spectralclustering, respectively.Both ANF1 and ANF2 has weight parameters for each view.We found uniform weights can usually do a good job. If we setthe weight of each view to be the NMI value of the clusteringresult using that view alone, we can usually get a slightlybetter result. However, the optimal weights may not be veryintuitive as simple NMI values. In our experiments (Fig. 1),we randomly set weights as ( , , ) for gene expression,miRNA expression and DNA methylation, and can achieveslightly better results using NMI values as weights in somecases. We can improve results by tuning weights of eachview, however, this usually requires using true class labelinformation. In an unsupervised task (our framework ANF isdesigned for unsupervised learning, we just used true classlabels to evaluate its performance), we have to use internal ig. 5. Eigenvalues of afﬁnity matrix of four cancer types evaluation metric such as silhouette or resort to some relatedinformation (such as cancer patient survival data) to evaluateclustering results.In addition, since ANF framework applied spectral cluster-ing to a fused afﬁnity matrix, we can use eigengap heuristic todetermine the number of clusters and indirectly assess clusterquality. E. Determine the Number of Clusters Using Eigengap Anal-ysis

In this section, we carefully examine the learned fusedafﬁnity matrices for the four cancer types. We have run ANFon 37 combinations of features types and data types, Wechoose the afﬁnity matrix W with the highest NMI value foreach cancer type for the following analysis.The fused afﬁnity matrix W generated by ANF is a transi-tion matrix, and is asymmetric in most cases if the node degreedistribution is non-uniform. We apply spectral clustering onthis afﬁnity network with normalized graph Laplacian L being: L = I − D − W (21) D is a diagonal matrix with diagonal elements being therow sums of W . Though L is asymmetric, experiments showthat the results can be as good as or slightly better than asymmetric version of L (one way to get a symmetric L is ﬁrstforce W to be symmetric W ← ( W + W T ) / , then calculate L ). Let’s deﬁne asymmetric ratio of W as: AsymRatio ( W ) = || W − W T || || W || W T is the transpose of W , and || · || represents Frobeniusnorm. The asymmetric ratios of the fused afﬁnity (transition)matrices W learned by ANF are 0, 0.10, 0.11, 0.11 for adrenalgland, lung, kidney and uterus, respectively. The eigenvaluesof the corresponding normalized graph Laplacian L are shownin Fig. 5.Surprisingly, for adrenal gland, the learned afﬁnity matrix W is symmetric (with all eigenvalues being real numbersshown in topleft panel of Fig. 5). With this W , all 253 samplesexcept one are clustered correctly according true diseae types(NMI=0.96). For other cancer types, W is not asymmetricand there are quite a few complex eigenvalues. However, theimaginary part of eigenvalues are less than 0.02, and the realpart of eigenvalues are at least more than 30 times larger than TABLE IIC

ONFUSION MATRIX OF CLUSTERING ADRENAL GLAND the imaginary part. Let’s use perturbation theory to brieﬂyexplain this observation.We can treat an asymmetric L as a symmetric matrix plusa perturbation matrix (representing small disturbances): L = L sym + H Based on perturbation theory, as long as H is relatively“small” than L sym , the eigenvalues and corresponding eigen-vectors of L should be near those of L sym . The eigenvalues of L sym are all real numbers with the smallest eigenvalue being0 ( ≤ λ ≤ λ ≤ · · · λ n ), and thus the eigenvalues of L should be approximately the same as those of L sym .One interesting property of our deﬁned W and L is thesummation of eigenvalues of L equals to the summation ofthe diagonal of L , which equals to: (cid:80) Ni =1 λ i = N − (cid:80) Ni =1 W ii We observed all eigenvalues (absolute value for complexvalue) are in [0,1]. We expect this to be generally true for anytransition matrix W , and thus for L (without rigorous proof).Importantly, we found eigengap heuristic is very usefulfor deciding the number of clusters. For adrenal gland, theﬁrst two smallest eigenvalues are very near 0, while the thirdone is about 0.2. The eigengap between the third and secondsmallest values is relatively large. This suggests there shouldbe two “natural” clusters (corresponding to the two nearly 0eigenvalues). Furthermore, the eigengap between the fourthand third values is relatively high, too. This suggests wecan use the learned afﬁnity matrix W for disease subtypediscovery for adrenal gland. In fact, when we set the numberof cluster to be 3, our framework will separate 176 “TCGA-PCPG” samples into two groups consisting 155 samples and21 samples respectively. If we set the number of clusters to 4,the 155 samples will be further split into two small groups asshown in Table. II.Similarly, for lung cancer, the two smallest eigenvalues arenear zero, and the eigengap between the third and secondvalues is relatively high. This information alone suggests thereshould be two “natural” clusters, which coincide with the twodifferent disease types. ABLE IIIC

LUSTERING ACCURACY OF FOUR CANCER TYPES

Adrenal gland Lung Kidney UterusNMI 0.96 0.75 0.84 0.61ARI 0.98 0.83 0.91 0.78 − log ( p ) For kidney cancer, three eigenvalues are near 0 while theeigengap between the fourth and the third one is relativelybig. From this information, we can infer that there shouldbe three “natural” clusters, which coincide with the fact thatthere are indeed three different disease types in kidney cancer.Our clustering results can achieve high accuracies for adrenalgland, lung, and kidney cancers (Table III).For uterus cancer, only one eigenvalue is 0. The eigengapbetween the second and ﬁrst value is already relatively large.Thus there might not be “natural” clusters (“natural” clustersemerge from some nearly block diagonal afﬁnity matrix W ).Consequently, our clustering result only achieves a moderateaccuracy with NMI = 0.61 and adjusted rand index = 0.78.However, the p-value of log-rank test of survival distributionsof two clusters is . × − , while the p-value calculatedfrom the true disease types is . × − . This suggeststhe identiﬁed clusters may still be useful to deﬁne clinicallydifferent patient groups.Without true class label information, eigengap analysis canbe used to predict the number of clusters and assess the“cluster quality” of afﬁnity matrix for spectral clustering.In fact, the above eigengap analysis of the learned afﬁnitymatrices successfully reveal the potential number of “natural”clusters and is consistent with spectral clustering theory. Thissuggests the four learned fused afﬁnity matrices may havesuccessfully captured patient group structure, and can be usedfor unknown cancer subtype detection (For example, we canfurther cluster adrenal gland into more than two clusters asshown in Table. II). VI. C ONCLUSION

Deﬁning cancer subtypes and identifying subtype-speciﬁcmolecular signatures associated with clinical variables is onemajor goal for cancer genomics. In this paper, we presentedafﬁnity network fusion (ANF) framework, an upgrade of SNF[2], for clustering cancer patients by integrating multi-omicdata. ANF has a clear interpretation, is more general thanSNF, and can achieve as good as or even better results thanSNF with much less computation. We performed extensiveexperiments on a selected cohort of 2193 cancer patients fromfour primary sites and nine disease types, and achieved highclustering accuracy. With this carefully selected “gold” dataset,we demonstrated the power of ANF and the power of featureselection and transformation in cancer patient clustering.Eigengap analysis on learned fused afﬁnity matrices ishighly consistent with true class label information, whichstrongly suggests that the learned afﬁnity matrices may capture the internal structure of patient groups. We can use these ma-trices for subsequent cancer subtype discovery. Once diseasesubgroups are deﬁned, future work may focus on a relativelyhomogeneous group of patients to identify subtype-speciﬁccomprehensive molecular signatures.While we only reported experimental results on four cancertypes with known disease types, ANF can be used for dis-covering subtypes of other cancers, and more generally, forcomplex object clustering with multi-view feature matrices.R

EFERENCES[1] M. Bersanelli, E. Mosca, D. Remondini, E. Giampieri, C. Sala,G. Castellani, and L. Milanesi, “Methods for the integrationof multi-omics data: mathematical aspects,”

BMC Bioinformatics ,vol. 17, no. 2, p. S15, Jan 2016. [Online]. Available: https://doi.org/10.1186/s12859-015-0857-9[2] B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno,B. Haibe-Kains, and A. Goldenberg, “Similarity network fusion foraggregating data types on a genomic scale,”

Nature methods , vol. 11,no. 3, pp. 333–337, 2014.[3] R. Shen, Q. Mo, N. Schultz, V. E. Seshan, A. B. Olshen, J. Huse,M. Ladanyi, and C. Sander, “Integrative subtype discovery inglioblastoma using icluster,”

PLOS ONE , vol. 7, no. 4, pp. 1–9, 042012. [Online]. Available: https://doi.org/10.1371/journal.pone.0035236[4] B. P. Durbin, J. S. Hardin, D. M. Hawkins, and D. M. Rocke,“A variance-stabilizing transformation for gene-expression microarraydata,”

Bioinformatics , vol. 18, no. suppl 1, pp. S105–S110, 2002.[5] C. Meng, O. A. Zeleznik, G. G. Thallinger, B. Kuster, A. M.Gholami, and A. C. Culhane, “Dimension reduction techniquesfor the integrative analysis of multi-omics data,”

Brieﬁngs inBioinformatics , vol. 17, no. 4, pp. 628–641, 2016. [Online]. Available:+http://dx.doi.org/10.1093/bib/bbv108[6] M. D. M. Leiserson, F. Vandin, H.-T. Wu, J. R. Dobson, J. V.Eldridge, J. L. Thomas, A. Papoutsaki, Y. Kim, B. Niu, M. McLellan,M. S. Lawrence, A. Gonzalez-Perez, D. Tamborero, Y. Cheng,G. A. Ryslik, N. Lopez-Bigas, G. Getz, L. Ding, and B. J.Raphael, “Pan-cancer network analysis identiﬁes combinations ofrare somatic mutations across pathways and protein complexes,”

NatGenet , vol. 47, no. 2, pp. 106–114, Feb. 2015. [Online]. Available:http://dx.doi.org/10.1038/ng.3168[7] C. J. Vaske, S. C. Benz, J. Z. Sanborn, D. Earl, C. Szeto,J. Zhu, D. Haussler, and J. M. Stuart, “Inference of patient-speciﬁcpathway activities from multi-dimensional cancer genomics data usingparadigm,”

Bioinformatics , vol. 26, no. 12, pp. i237–i245, 2010.[Online]. Available: +http://dx.doi.org/10.1093/bioinformatics/btq182[8] T. Ma and A. Zhang, “A framework for robust differential networkmodular structure discovery from rna-seq data,” in

Bioinformatics andBiomedicine (BIBM), 2016 IEEE International Conference on . IEEE,2016, pp. 288–293.[9] M. I. Love, W. Huber, and S. Anders, “Moderated estimation offold change and dispersion for rna-seq data with deseq2,”

GenomeBiology , vol. 15, no. 12, p. 550, Dec 2014. [Online]. Available:https://doi.org/10.1186/s13059-014-0550-8[10] U. Luxburg, “A tutorial on spectral clustering,”

Statistics andComputing , vol. 17, no. 4, pp. 395–416, Dec. 2007. [Online]. Available:http://dx.doi.org/10.1007/s11222-007-9033-z[11] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus clustering: aresampling-based method for class discovery and visualization of geneexpression microarray data,”

Machine learning , vol. 52, no. 1, pp. 91–118, 2003.[12] X. Y. Stella and J. Shi, “Multiclass spectral clustering,” in null . IEEE,2003, p. 313.[13] L. Hubert and P. Arabie, “Comparing partitions,”

Journal ofClassiﬁcation , vol. 2, no. 1, pp. 193–218, Dec 1985. [Online].Available: https://doi.org/10.1007/BF01908075[14] D. P. HARRINGTON and T. R. FLEMING, “A class of rank testprocedures for censored survival data,”