[PDF] Kernel learning approaches for summarising and combining posterior similarity matrices

Abstract

When using Markov chain Monte Carlo (MCMC) algorithms to perform inference for Bayesian clustering models, such as mixture models, the output is typically a sample of clusterings (partitions) drawn from the posterior distribution. In practice, a key challenge is how to summarise this output. Here we build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models. A key contribution of our work is the observation that PSMs are positive semi-definite, and hence can be used to define probabilistically-motivated kernel matrices that capture the clustering structure present in the data. This observation enables us to employ a range of kernel methods to obtain summary clusterings, and otherwise exploit the information summarised by PSMs. For example, if we have multiple PSMs, each corresponding to a different dataset on a common set of statistical units, we may use standard methods for combining kernels in order to perform integrative clustering. We may moreover embed PSMs within predictive kernel models in order to perform outcome-guided data integration. We demonstrate the performances of the proposed methods through a range of simulation studies as well as two real data applications. R code is available at this https URL.

Full PDF

KKernel learning approaches for summarising andcombining posterior similarity matrices

Alessandra Cabassi , Sylvia Richardson , and Paul D. W. Kirk , MRC Biostatistics Unit Cambridge Institute of Therapeutic Immunology & Infectious DiseaseUniversity of Cambridge, U.K.

Preprint, September 29, 2020

Abstract . Summary:

When using Markov chain Monte Carlo (MCMC) algorithms to perform inference forBayesian clustering models, such as mixture models, the output is typically a sample of clusterings(partitions) drawn from the posterior distribution. In practice, a key challenge is how to summarisethis output. Here we build upon the notion of the posterior similarity matrix (PSM) in order tosuggest new approaches for summarising the output of MCMC algorithms for Bayesian clusteringmodels. A key contribution of our work is the observation that PSMs are positive semi-deﬁnite, andhence can be used to deﬁne probabilistically-motivated kernel matrices that capture the clusteringstructure present in the data. This observation enables us to employ a range of kernel methodsto obtain summary clusterings, and otherwise exploit the information summarised by PSMs. Forexample, if we have multiple PSMs, each corresponding to a diﬀerent dataset on a common set ofstatistical units, we may use standard methods for combining kernels in order to perform integrativeclustering . We may moreover embed PSMs within predictive kernel models in order to perform outcome-guided data integration. We demonstrate the performances of the proposed methodsthrough a range of simulation studies as well as two real data applications.

Availability: R code is available at https://github.com/acabassi/combine-psms. Contact: [email protected], [email protected]

Clustering techniques aim to partition a set of statistical units in such a way that items in the samegroup are similar and items in diﬀerent groups are dissimilar. The deﬁnition of similarity betweenobservations varies according to the application at hand. In biomedical applications, for example,clustering can be used to deﬁne groups of patients who have similar genotypes or phenotypes andtherefore are more likely to respond similarly to a certain treatment and/or have similar prognoses.Statistical methods for clustering can be divided into two main categories: heuristic approachesas, for instance, k -means (Hartigan and Wong, 1979) and hierarchical clustering (Kaufman andRousseeuw, 1990), and model-based techniques such as mixture models (McLachlan and Peel, 2004).Here we are interested in particular in Bayesian mixture models. Inference on the parameters of1 a r X i v : . [ s t a t . M E ] S e p hese models can be performed either via deterministic approximate inference methods based onvariational inference (Bishop, 2006, Blei et al., 2006) or using Markov chain Monte Carlo (MCMC)schemes to sample from the posterior distribution (Gelfand and Smith, 1990). One of the advantagesof Bayesian model-based clustering is that, if inﬁnite mixtures are used, one does not need to specifythe number of clusters a priori (Rasmussen, 2000). On the other hand, summarising the outputof MCMC algorithms can be challenging, as it includes a large number of partitions of the datasampled from the posterior distribution. The labels of each mixture component can change ateach iteration of the MCMC, because the model likelihood is invariant under permutations of theindices. This phenomenon is known as label switching (Celeux et al., 2000). One way of summarisingthe cluster allocations that circumvents this problem is to compute a posterior similarity matrix(PSM) containing the probabilities of each pair of observations of belonging to the same cluster(more details about Bayesian mixture models and PSMs can be found in Section 2). In addition tothat, however, one is often interested in ﬁnding one clustering of the data that best matches theinformation contained in the PSM (Fritsch and Ickstadt, 2009).We propose a new method to summarise the PSMs derived from the MCMC output of Bayesianmodel-based clustering, showing that they are valid kernel matrices. Consequently, we are ableto use kernel methods such as kernel k -means to ﬁnd a summary clustering, and can make use ofmachine learning algorithms developed for pattern analysis (see e.g. Shawe-Taylor and Cristianini,2004) to combine the PSMs obtained from diﬀerent datasets. We assume that we have a similaritymatrix for each dataset, from initial clustering analyses performed for each dataset independently.We show how these multiple kernel learning (MKL) techniques allow us to ﬁnd a global clusteringstructure and assign diﬀerent weights to each PSM, depending on how much information eachprovides about the clustering structure. We also show how we may include a response variable, ifit is available, in order to perform outcome-guided integrative clustering. In particular, we showthat, in the unsupervised framework, we can use the localised multiple kernel k -means approachof Gönen and Margolin (2014). We further demonstrate that, if a response variable is availablefor each data point, it is possible to incorporate this information using the simpleMKL algorithmof Rakotomamonjy et al. (2008) in order to perform outcome-guided integrative clustering. Bothalgorithms assign a weight to each PSM, that is output together with the global cluster assignment.This work therefore contributes to the many ways of summarising the clusterings sampled fromthe posterior distribution that have already been proposed (Fritsch and Ickstadt, 2009, Wade andGhahramani, 2018). Our approach performs equally well in the case of one dataset and has theadvantage of being easily extended to the case of multiple data sources.From a diﬀerent perspective, this work also suggests a new rational way to deﬁne kernels thatare appropriate when the data are believed to possess a clustering structure. Lanckriet et al. (2004)applied MKL methods to the problem of genomic data fusion, trying diﬀerent kernels for each datasource. However, how to deﬁne a kernel in general remained an open problem. Cabassi and Kirk(2020) suggested the use of methods based on multiple kernel learning to summarise the output ofmultiple similarity matrices, each resulting from applying consensus clustering (Monti et al., 2003)to a diﬀerent dataset. Similarly, using PSMs as kernels ensures that the similarities between datapoints that we consider correctly reﬂect the clustering structure present in the data.The manuscript is organised as follows. In Section 2 we introduce the problem of summarisingPSMs. In Section 3 we recall the theory of kernel methods, prove that PSMs are valid kernel matricesand explain how this allows us to apply kernel methods to the problems mentioned above. We alsopresent the method that we use to choose the best number of clusters in the ﬁnal summary. InSection 4 we present some simulation studies demonstrating the performances of kernel methods tosummarise PSMs and to integrate multiple datasets in an unsupervised and outcome-guided fashion.Finally, in Section 5 we introduce our motivating examples and illustrate how our methodology can2e applied to it. In this section we brieﬂy recall the concept of Bayesian mixture models (Section 2.1) and we exposethe problem of summarising the posterior distribution on the cluster allocations (Section 2.2).

Mixture models assume that the data are drawn from a mixture of distributions: p ( x ) = K X k =1 π k f x ( x | φ k ) . (1)where f x is a parametric density that depends on the parameter(s) φ k , and π k are the mixtureweights, which must satisfy 0 ≤ π k ≤ P Kk =1 π k = 1. In the Bayesian framework, we assign aprior distribution to the set of all parameters, Π = [ π , . . . , π k ] and Φ = [ φ , . . . , φ k ]. In the ﬁnitemixtures case, methods to estimate the posterior distributions when the true number of mixturecomponents is unknown include the MCMC-based algorithms proposed by Ferguson (1973) andRichardson and Green (1997). MCMC approaches also exist for so-called inﬁnite mixture models(Rasmussen, 2000), such as Dirichlet process mixtures and their generalisations (Antoniak, 1974). When using MCMC methods in order to perform Bayesian clustering on a dataset X = [ x , . . . , x N ],one obtains a vector of cluster assignments c ( b ) = [ c ( b )1 , . . . , c ( b ) N ] from the posterior distribution foreach iteration b = 1 , . . . , B of the algorithm (see, for example, Neal, 2000). From this, it is possibleto obtain a Monte Carlo estimate of the probability that observations i and j belong to the samecluster as follows: P ( c i = c j | X ) ≈ B B X b =1 I { c ( b ) i = c ( b ) j } =: ∆ ij . (2)We denote by ∆ the posterior similarity matrix that is the matrix that has ij th entry ∆ ij equal tothe right hand side of Equation (2).Many ways to ﬁnd a ﬁnal clustering using the PSM ∆ have been proposed (Binder, 1978, Dahl,2006, Fritsch and Ickstadt, 2009, Medvedovic and Sivaganesan, 2002, Wade and Ghahramani, 2018).A simple solution is to choose, among the c ( b ) , the one that maximises the posterior density. Theproblem with this approach is that many clusterings are associated with very similar posteriordensities (Fritsch and Ickstadt, 2009). A more principled approach is to deﬁne a loss function L ( c , ˆ c )measuring the loss of information that occurs when estimating the true clustering c with ˆ c (Binder,1978). The optimal clustering c ∗ is then deﬁned as the one minimising the posterior expected loss: c ∗ = arg min ˆ c E [ L ( c , ˆ c ) | X ] = arg min ˆ c X c L ( c , ˆ c ) p ( c | X ) . (3)Binder (1978), for instance, suggested choosing the clustering ˆ c that minimises the loss function L Binder ( c , ˆ c ) = X i

1) or vice versa ( l /l <

1) and I is theindicator function. If l = l , then c ∗ Binder = arg min ˆ c X i

The map δ : R P × R P → R deﬁned by δ ( x , x ) = h φ ( x ) , φ ( x ) i X is a kernel. Moreover, using Mercer’s theorem, it can be shown that for any positive semi-deﬁnite kernelfunction, δ , there exists a corresponding feature map, φ (see e.g. Vapnik, 1998). That is, Theorem 3.1.

For each kernel δ , there exists a feature map φ taking value in some inner productspace X such that δ ( x , x ) = h φ ( x ) , φ ( x ) i X . In practice, it is therefore often suﬃcient to specify a positive semi-deﬁnite kernel matrix, ∆, inorder to allow us to apply kernel methods such as those presented in the following sections. For amore detailed discussion of kernel methods, see e.g. Shawe-Taylor and Cristianini (2004).

It has been shown elsewhere that co-clustering matrices are valid kernel matrices (Cabassi and Kirk,2020). We show here that this result also holds for PSMs, and hence they can be used as input forany kernel-based model. PSMs are convex combinations of co-clustering matrices C ( b ) , where eachmatrix C ( b ) is deﬁned as follows: C ( b ) ij = ( c ( b ) i = c ( b ) j , otherwise , where C ( b ) ij indicates whether the statistical units i and j are assigned to the same cluster at iteration b of the MCMC chain. Indicating by K the total number of clusters and reordering the rows andcolumn, each C ( b ) ij can be written as a block-diagonal matrix where every block is a matrix of ones: C ( b ) =  J . . . J . . . ... . . . ... . . . J K  (9)5here J k is an n k × n k matrix of ones, with n k being the number of items in cluster k . Theeigenvalues of a block diagonal matrix are simply the eigenvalues of its blocks, which, in this case,are nonnegative. Therefore all C ( b ) , with b = 1 , . . . , B , are positive semideﬁnite. Now, if λ is anonnegative scalar, and C is positive semideﬁnite, then λC is also positive semideﬁnite. Moreover,the sum of positive semideﬁnite matrices is a positive semideﬁnite matrix. Therefore, given anyset of nonnegative λ b , b = 1 , . . . , B , P Bb =1 λ b C ( b ) is positive semideﬁnite. We can conclude that anyPSM is positive deﬁnite. Here we show that the fact that all posterior similarity matrices are valid kernels allows us to usekernel methods to ﬁnd a clustering of the data that summarises a sample of clusterings c (1) , . . . , c ( B ) from the posterior distribution of an MCMC algorithm for Bayesian clustering. To do this wesuggest to use an extension of the well-known k -means algorithm that only needs as input a kernelmatrix.Moreover, this method can be easily extended to allow us to combine multiple PSMs. This canbe a useful feature under many circumstances. For instance, as it is the case in our motivatingexamples, one could have diﬀerent types of information relative to the same statistical observations.In this situation, it may be appropriate to deﬁne and ﬁt diﬀerent mixture models on each datatype, and then summarise the two posterior samples of clusterings at a later stage. This can beachieved by using multiple kernel k -means algorithms, that allow us to combine multiple kernels toﬁnd a global clustering. On top of that, these techniques also assign diﬀerent weights to each kernel.These can be used to assess how much each dataset contributed to the ﬁnal clustering and thereforeto get an idea of how much information is present in each data type about the clustering structure.The problem with combining multiple kernels is, however, that it is not always clear whether theyall have the same clustering structure. To overcome this issue, we also propose an outcome-guided algorithm to summarise multiple PSMs. The idea is that, instead of choosing the weight of eachkernel in an unsupervised way, if we have a variable available which is closely related to the outcomeof interest, we should weight more highly the kernels in which statistical units that have similaroutcomes are closer to each other. In mathematical terms, this corresponds to using support vectormachines to ﬁnd the kernel weights, where the response variable is our proxy for the outcome. In order to illustrate our method for summarising PSMs, ﬁrst we recall the main ideas behind k -means clustering and then we present its extension to the kernel framework. k -means clustering k -means clustering is a widely used clustering algorithm, ﬁrst introducedby Steinhaus (1956). Let x , . . . , x N indicate the observed dataset, with x n ∈ R p and z nk be thecorresponding cluster labels, where P k z nk = 1 and z nk = ( , if x n belongs to cluster k, , otherwise . (10)We denote by Z the N × K matrix with ij th element equal to z ij . The goal of the k -means algorithmis to minimise the sum of all squared distances between the data points x n and the corresponding6luster centroid m k . The optimisation problem isminimise Z X n X k z nk k x n − m k k (11a)subject to X k z nk = 1 , ∀ n, (11b) N k = X n z nk , ∀ k, (11c) m k = 1 N k X n z nk x n , ∀ k. (11d)where k·k indicates the Euclidean norm. Kernel k -means clustering Now we can show how the kernel trick works in the case of the k -means clustering algorithm (Girolami, 2002). Redeﬁning the objective function of Equation(11a) based on the distances between observations and cluster centres in the feature space X , theoptimisation problem becomes:minimise Z X n X k z nk k φ ( x n ) − ˜ m k k X (12a)subject to X k z nk = 1 , ∀ n, (12b) N k = X n z nk , ∀ k, (12c)˜ m k = 1 N k X n z nk φ ( x n ) , ∀ k. (12d)where we indicated by ˜ m k the cluster centroids in the feature space X . Using this kernel, each termof the sum in Equation (12a) can be written as k φ ( x n ) − ˜ m k k X = h φ ( x n ) − ˜ m k , φ ( x n ) − ˜ m k i X (13)= h φ ( x n ) , φ ( x n ) i X − N k N X i =1 z ik h φ ( x n ) , φ ( x i ) i X (14)+ 1 N k N X i =1 N X j =1 z ik z jk h φ ( x i ) , φ ( x j ) i X (15)= δ ( x n , x n ) − N k N X i =1 z ik δ ( x n , x i ) + 1 N k N X i =1 N X j =1 z ik z jk δ ( x i , x j ) . (16)Therefore, we do not need to evaluate the map φ at every point x i to compute the objective functionof Equation (12a). Instead, we just need to know the values of the kernel evaluated at each pair ofdata points δ ( x i , x j ), i, j = 1 , . . . , N . This is what is commonly referred to as the kernel trick.We have seen how kernels can be used to perform k -means clustering. Now, if we have a sampleof clusterings from the posterior, we can easily exploit this technique to ﬁnd a summary clustering.Once we compute our PSM ∆, this will be our kernel matrix. The clustering of interest will then bethe one given by kernel k -means in the form of z nk , n = 1 , . . . , N , k = 1 , . . . , K . The number ofclusters K is chosen as explained in Section 3.4. 7 .3.2 Combining PSMs to perform integrative clustering To combine multiple PSMs relative to the same statistical units, all we need to do is to use theextension of kernel k -means to the case of multiple kernels. Multiple kernel k -means clustering Gönen and Margolin (2014) extended the kernel k -meansapproach to the case of multiple kernels. We consider multiple datasets X , . . . , X M each with a dif-ferent mapping function φ m : R P → X m and corresponding kernel δ m ( x i , x j ) = h φ m ( x i ) , φ m ( x j ) i X m and kernel matrix ∆ m . Then, if we deﬁne φ Θ ( x i ) = [ θ i φ ( x i ) , θ i φ ( x i ) , . . . , θ iM φ M ( x i ) ] , (17)where Θ ∈ R N × M + is a vector of kernel weights such that θ im is the weight of observation x i indataset m and P m θ im = 1 for all i = 1 , . . . , N and θ im ≥

0, the kernel function of this multiplefeature problem is a convex sum of the single kernels: δ Θ ( x i , x j ) = h φ Θ ( x i ) , φ Θ ( x j ) i X m (18)= M X m =1 θ im θ jm h φ m ( x i ) , φ m ( x j ) i X m (19)= M X m =1 θ im θ jm δ m ( x i , x j ) . (20)We denote the corresponding kernel matrix by ∆ Θ . The optimisation strategy proposed by Gönenand Margolin (2014) is based on the idea that, for some ﬁxed vector of weights θ , the problem isequivalent to the one of Equation (12a), where we had only one kernel. Therefore, they developa two-step optimisation strategy: (1) given a ﬁxed matrix of weights Θ, solve the optimisationproblem as in the case of one kernel, with kernel matrix δ Θ and then (2) minimise the objectivefunction with respect to the kernel weights, keeping the assignment variables ﬁxed. This is a convexquadratic programming (QP) problem that can be solved with any standard QP solver up to amoderate number of kernels M .Similarly to before, once we have deﬁned the kernels ∆ m to be equal to each of our PSMs,the labels found through multiple kernel k -means constitute the clustering that we are lookingfor. Moreover, the kernel weights give us an indication of how each PSM contributed to the ﬁnalclustering. Suppose now that, in addition to the posterior similarity matrices ∆ , . . . , ∆ M , we also have acategorical response variable y n associated with each observation x n . As we explained above, wewould like to use this information to guide our clustering algorithm. We can use the simpleMKL algorithm described in the remainder of this section to ﬁnd the kernel weights θ , . . . , θ M and thenuse kernel k -means on the weighted kernel ∆ = P m θ m ∆ m to ﬁnd the ﬁnal clustering (Figure 2). Support vector machines

We brieﬂy recall here the concept of the support vector machine(Boser et al., 1992) that is widely used for solving problems in classiﬁcation and regression (Bishop,2006, Schölkopf and Smola, 2001). 8 atasets Step 1MCMC, CC, ... Kernels Step 2Multiple kernel k -means C l u s t e r l a b e l s Figure 1.

Schematic representation of the MKL-based integrative clustering approach.Each colour indicates a diﬀerent dataset/kernel. First, a mixture model is ﬁt using MCMCon each dataset separately. The resulting PSMs are valid kernels that can be used as inputto kernel k -means to ﬁnd a global clustering of the data. Note that, as shown elsewhere(Cabassi and Kirk, 2020), other similarity matrices, such as the similarity matrices given byconsensus clustering (CC), deﬁne valid kernels. For this reason, they can be combined withPSMs via MKL.In its simplest form, this method is applied to a binary classiﬁcation problem, in which thedata points x , . . . , x N ∈ R P in the training set are assigned to two classes indicated by the targetvalues y n ∈ {− , } , n = 1 , . . . , N . We consider a feature map φ : R P → X and the associated kernel δ ( · , · ) : X × X → R such that δ ( x i , x j ) = h φ ( x i ) , φ ( x j ) i X . Suppose that there exist some values of α n and b such that f ( x ) = N X n =1 α n δ ( x , x n ) (21)satisﬁes f ( x n ) + b > y n = 1 and f ( x n ) + b < f is a function that lives in a functionspace H endowed with the norm k·k H . Then, this function can be used to classify new data points x according to the sign of f ( x ) + b . For support vector machines, the parameters α n and b arechosen so as to maximise the margin , i.e. the distance between the decision boundary given byEquation (21) and the point x n that is closest to the boundary. It can be shown that this can beachieved by solving the quadratic programming problem (see e.g. Bishop, 2006, Rakotomamonjyet al., 2008) minimise f, b k f k H (22a)subject to y n (cid:2) f ( x n ) + b (cid:3) ≥ , ∀ n. (22b)However, in real applications, it is usually not possible to separate the two classes perfectly.Hence, in order to take into account misclassiﬁcations, it is necessary to introduce a penalty termthat is linear with respect to the distance of the misclassiﬁed points to the classiﬁcation boundary(Bennett and Mangasarian, 1992). To this end, we deﬁne a variable ξ n (known as a slack variable )for each data point such that ξ n = ( , if x n is correctly classiﬁed, | y n − f ( x n ) | , otherwise . (23)9he optimisation problem of Equation (22) then becomesminimise f, b, { ξ n } k f k H + λ X n ξ n (24a)subject to y n (cid:2) f ( x n ) + b (cid:3) ≥ − ξ n , ∀ n, (24b) ξ n ≥ , ∀ n, (24c)where λ > sequential minimal optimisation (Platt,1999). For more details about SVMs see e.g. Bishop (2006). Multiple kernel learning for SVMs

In the multiple kernel learning framework for SVMs, weconsider M diﬀerent feature representations, with mapping functions φ m and corresponding kernelfunctions δ m and feature spaces X m . We substitute the kernel δ of Equation (21) with a convexcombination of kernels δ m (Lanckriet et al., 2004): f ( x ) + b = M X m =1 θ m f m ( x ) + b (25)where θ m ≥ P m θ m = 1 and f m = P n δ m ( x , x n ). Rakotomamonjy and Bach (2007) proposedthen to solve the optimisation problemminimise { f m } , b, { ξ n } , { θ m } J ( θ ) := 12 X m θ m k f m k H m + λ X n ξ n (26a)subject to y n (cid:20) X m f m ( x n ) + b (cid:21) ≥ − ξ n , ∀ n, (26b) ξ n ≥ , ∀ n, (26c) X m θ m = 1 , (26d) θ m ≥ , ∀ m (26e)using the convention that x/ x = 0 and ∞ otherwise. The algorithm of Rakotomamonjyand Bach takes the name of simpleMKL and is based on the idea that one can iteratively solve astandard SVM problem (24a) for a ﬁxed value of θ and then update the vector of weights θ usingthe gradient descent method on the objective function J ( θ ). Since the objective function is smoothand diﬀerentiable with Lipschitz gradient, it can be easily optimised with the reduced gradientalgorithm (Luenberger and Ye, 1984, Chapter 11). If the standard SVM problem is solved exactlyat each iteration, then convergence to the global optimum is guaranteed (Luenberger and Ye, 1984). Multiclass multiple kernel learning

SVMs can be used also when the target value y n takesmore than two diﬀerent values. The most commonly used approaches are called one-versus-one (Knerr et al., 1990) and one-versus-the-rest (Vapnik, 1998). In the ﬁrst one, we consider in turneach class as the “positive” case, and all the others as the “negative” cases. This way, we construct K diﬀerent classiﬁers and then assign a new observation x using y ( x ) = max k ∈{ ,...,K } y k ( x ) (27)10he second approach is to train one SVM for each pair of classes and then assign a point x to theclass to which it is assigned more often.Rakotomamonjy et al. (2008) extended the simpleMKL algorithm to the case of a response with K > J ( θ ) as the sum of all the cost functions of the partial SVMs J s ( θ ): J ( θ ) = X s ∈S J s ( θ ) (28)where S indicates the set of all partial SVMs and each J s is deﬁned as in Equation (26a). Datasets Step 1MCMC, CC, ... Kernels R e s p o n s e Step 2SimpleMKLWeighted kernel Step 3Kernel k -means C l u s t e r l a b e l s Figure 2.

Schematic representation of the MKL-based outcome-guided integrative cluster-ing approach. Each colour indicates a diﬀerent dataset/kernel. First, a mixture model is ﬁton each dataset separately. The resulting PSMs are valid kernels that can be used as inputto simpleMKL, if a response variable is available, to ﬁnd a global clustering of the data.It is important to note that none of these approaches explicitly rely on the fact that the ∆ m are posterior similarity matrices. Hence, any other type of matrix ∆ m can be used, as long as it issymmetric, positive semi-deﬁnite and the entries ∆ mij can be interpreted as some measure of thesimilarity between x i and x j . Many possible approaches have been proposed to choose the number of clusters (see e.g. Dudoitand Fridlyand, 2002, Milligan and Cooper, 1985, Tibshirani et al., 2001, Yeung et al., 2001). Wefocus on the so-called silhouette , a measure of compactness of the clustering structure, proposed byRousseeuw (1987). There are two ways of deﬁning the silhouette of a cluster, based respectively onthe similarities and dissimilarities between the data. Here we brieﬂy explain the former.Given some cluster assignment labels c = [ c , . . . , c N ] and some measure of the dissimilaritybetween the data points ∆ ij for all i, j = 1 , . . . , N , we can deﬁne the following quantities: a n isthe average similarity of x n to all objects of cluster c n and, for each c i = c n , ∆ n,c i is the averagesimilarity of n to all objects belonging to cluster c i . Moreover, let us indicate by b n the maximum∆ n,c i over all i such that c i = c n . Then, for each observation n = 1 , . . . , N , we can calculate s n =  − a n /b n , if a n < b n , , if a n = b n ,a n /b n − , if a n > b n . (29)This takes values between − x i is well-clustered andnegative values suggesting that x i has been misclassiﬁed.11hus, we run our algorithms for combining the posterior similarity matrices with diﬀerent numberof clusters from K min to K max . We consider the overall average silhouette width ¯ s = N P Nn =1 s n asa measure of the compactness of clusters and we choose the value of K that gives the highest valueof ¯ s . Here we show how the methods presented above perform in practice. In Section 4.1 we explainhow we generate the synthetic datasets for the simulation studies. In Section 4.2 we show thatthe kernel k -means approach applied to a posterior similarity matrix derived from a single datasetperforms similarly to standard clustering methods. Additionally, to assess the MKL-based integrativeclustering approaches described in Sections 3.3.2 and 3.3.3, we perform a range of simulation studies;the results are presented in Section 4.3. We generate four synthetic datasets, each composed of data belonging to six diﬀerent clusters ofequal size. Each observation x ( k ) n ∈ { , , } belonging to cluster k is drawn from a multivariatecategorical distribution such that, for each covariate j = 1 , . . . , x ( k ) nj ∼ Categorical( π k , π k , π k ) , (30)where π ik , i = 1 , , π ik = wρ ik +(1 − w ) /

3, with [ ρ k , ρ k , ρ k ] ∼ Dirichlet(0 . w ∈ [0 , w give clearer clustering structures. Theresponse variable is binary, with P ( y n = 1 | z n = k ) = θ k , where θ k ∈ { . , . , . , . , . , . } .We repeat each experiment 100 times. For each synthetic dataset, we use the MCMC algorithm forDirichlet process mixture models implemented in the R package PreMiuM of Liverani et al. (2015)to obtain the PSMs. We use discrete mixtures (Liverani et al., 2015, Section 3.2) except in onesetting (detailed below) where the proﬁle regression model of Molitor et al. (2010) is used, using adiscrete mixture with categorical response. The idea of proﬁle regression is that, if a response y n isavailable for each n = 1 , . . . , N , the observations d n = ( x n , y n ) are jointly modelled as the productof the response model and a covariate model. In both cases we use the default hyperparameters,which we found to work well in practice.We consider four diﬀerent simulation settings:(A) The clustering structure in every dataset is the same and is related to the outcome of interest.(B) As in setting A, the clustering structure in each dataset is the same. In this case, however,each dataset contains some additional covariates that have no clustering structure.(C) The dataset with highest cluster separability has a clustering structure that is unrelated tothe response variable, all the other datasets are the same as in setting A.(D) This is the same as setting C, but proﬁle regression is used to derive the PSMs.One set of PSMs used for setting A is shown in Figure 3.In this case we know that higher values of w d are associated with higher levels of clusterseparability. However, in general, we do not know how dispersed the elements in each matrix are.So, we deﬁne another way to score how strong the signal is in each dataset. We use the copheneticcorrelation coeﬃcient , a measure of how faithfully hierarchical clustering would preserve the pairwise12 igure 3. PSMs of the datasets used for setting A. The rows and columns correspond tothe statistical units. The coloured bar on the right of each PSM represents the true clusters.Higher probabilities of belonging to the same cluster are indicated in blue. The values of w used to generate these matrices are, from left to right, 0.2, 0.4, 0.6, and 0.8.distances between the original data points. Given a dataset X = [ x , x , . . . , x N ] and a similaritymatrix ∆ ∈ R N × N , we deﬁne the dendrogrammatic distance between x i and x j as the height ofdendrogram at which these two points are ﬁrst joined together by hierarchical clustering and wedenote it by η ij . The cophenetic correlation coeﬃcient ρ is calculated as ρ = P i

Figure 4.

ARI for the kernel k -means applied to one dataset at a time, for diﬀerent valuesof ρ compared to maximising the PEAR as suggested by Fritsch and Ickstadt (2009) andto minimising the VI as suggested by Wade and Ghahramani (2018). For both methods,we try diﬀerent settings, namely: performing hierarchical clustering on the matrix 1 − ∆with average (avg) and complete (comp) linkage, with maximum number of clusters equalto either 6 or 20, as well as considering all the clusterings samples that are appear in theMCMC output (draws). For kernel k -means, the results obtained ﬁxing the number ofclusters to six and choosing it via the silhouette are presented.14 .3 Integrative clustering We assess the MKL-based integrative clustering approaches described in Sections 3.3.2 and 3.3.3in the four settings presented in Section 4.1. For each setting we consider four diﬀerent subsets ofdata, each combining three out of our four synthetic datasets generated with w = 0 . , . , . , . w = 0 . w = 0 . w equal to 0.2, 0.4, and 0.6 respectively,and similarly for the other combinations of datasets. Here we show the ARI between the clusteringsfound via MKL integration and the true cluster labels, the weights assigned to each dataset in eachsetting are instead reported in the Supplementary Material. Setting A

The ARI obtained by combining the datasets in the unsupervised and outcome-guidedframeworks is shown in the ﬁrst row of Figure 5. The values of the ARI obtained in the previoussection on each dataset separately are also reported. In all settings we set the number of clusters tothe true value, six. The unsupervised integration performed using localised multiple kernel k -meansallows us to reach values of the ARI that are close to those of the “best” dataset (i.e. the datasetthat has the highest value of cluster separability) among the three datasets in each subset. Thisis because the unsupervised MKL approach considered here assign higher weights to the datasetsthat give rise to kernels with higher values of ρ , which in this case correspond to higher values of w (Cabassi and Kirk, 2020). Moreover, even higher values of the ARI are achieved via outcome-guidedintegration thanks to the smarter weighting of the datasets. In this case, the kernels that helpseparate the classes in the response have higher weights than the others. Setting B

In the second row of Figure 5 are shown the results obtained for setting B, wherethe PSMs are obtained exploiting (an adaptation of) the variable selection strategy of Chung andDunson (2009) implemented in the R package PReMiuM . Despite the fact that the ARI of dataset 2is lower than in the previous case, the integration results are better than in Setting A. Again, this isdue to the fact that most informative kernels are weighted more highly than the other ones.

Setting C

This simulation study helps us to show that the outcome-guided approach favoursthe clustering structures that agree with the structure in the response. For this reason we use adataset with high cophenetic correlation coeﬃcient whose clustering structure is not related to theresponse. The results are presented in the third row of Figure 5. Again, localised multiple kernel k -means assigns higher weights to the datasets that are more easily separable, i.e. datasets thatgive rise to kernels having higher cophenetic correlation coeﬃcients. Note that here higher values of w correspond to higher cophenetic correlation. In this situation, this causes the ARI of the subsetsof kernels that include dataset 4 to drop to zero. In the outcome-guided case, instead, the datasetthat has the highest level of cluster separability but is not related to the outcome of interest has(almost) always weight equal to zero. Setting D

Lastly, we consider the case where the model used to generate the PSMs is proﬁleregression (fourth row of Figure 5). We see that, as expected, the ARI is higher than in the previouscases for the clustering obtained with each dataset taken separately, except of course for dataset4, that has a diﬀerent clustering structure. This is reﬂected in an improvement of the ARI of theunsupervised and outcome-guided integration, for all considered subsets of data. In particular, thelatter almost always allows us to retrieve the true clustering.15 l l llllll l lllll lllllllll l lllll l l l lllll llllllllll lllll lllllll lllllll llllllll llllllllllll lllllllllllllllllllllll lllllll llllllllllll ll lllll lllll lllllllllllll lllll lllllllllllll llllll l ll ll

One PSM Unsupervised integration Outcome−guided integration S e tt i ng AS e tt i ng BS e tt i ng C S e tt i ng D + + + + + + + + + + + + + + + + A R I Figure 5.

Simulation study, adjusted Rand index obtained by summarising the PSMsone at a time using kernel k -means (left), combining diﬀerent subsets of three PSMs in anunsupervised fashion using localised multiple kernel k -means (centre), and combining thesame subsets making use of a response variable and multi-class SVMs to determine eachPSM’s weight and using kernel k -means for the ﬁnal clustering (right).16 Integrative clustering: biological applications

We present results from applying our method to two exemplar integrative clustering problems fromthe literature. In Section 5.1 we perform integrative clustering on the dataset used for the analysisof 3,527 tumour samples from 12 diﬀerent tumour types of Hoadley et al. (2014). In Section 5.2 weconsider an example of transcriptional module discovery , to which a number of existing integrativeclustering approaches have previously been applied (Cabassi and Kirk, 2020, Kirk et al., 2012,Savage et al., 2010).

We analyse the dataset of Hoadley et al. (2014), which contains data regarding 3,527 tumour samplesspanning 12 diﬀerent tumour types. This dataset is particularly suitable for two purposes: (i)determining whether ’omic signatures are shared across diﬀerent tumour types and (ii) discoveringtumour subtypes. The ’omic layers available are: DNA copy number, DNA methylation, mRNAexpression, microRNA expression and protein expression. Hoadley et al. (2014) used

Cluster-Of-Clusters Analysis (COCA; Cabassi and Kirk, 2020) to cluster the data and found that the clusterswhere highly correlated with the tissue of origin of each tumour sample and were shown to be ofclinical interest.Here, we combine the data layers both in the unsupervised and outcome-guided frameworks. Wemake use of the C implementation (Mason et al., 2016) of the multiple dataset integration (MDI)method of Kirk et al. (2012) to produce PSMs for each data layer separately. In order to be ableto do so, we only include in our analysis the tumour samples that have no missing values; thisreduces the sample size to 2,421 and the number of tumour types available for the analysis to ten. Amixture of Gaussians is used for the continuous layers (DNA copy number, microRNA, and proteinexpression), while the multinomial model is used for the methylation data, which are categorical.Due to the high number of features, it is not possible to produce a PSM for the full mRNA dataset,so we exclude it from the analysis presented here. In the Supplementary Material, however, we showhow the variable selection method developed speciﬁcally for multi-omic data by Cabassi et al. (2020)can be employed in this case to reduce the size of each data layer and integrate all ﬁve data types. We combine the PSMs of the four data layers via multiple kernel k -means with number of clustersgoing from 2 to 50. We choose the number of clusters that maximises the silhouette, which is 9(Supplementary Material). The resulting clusters are shown in Figure 6. Six out of the nine clusterscontain almost exclusively samples from one tissue: most samples of renal cell carcinoma (KIRC)are in cluster 2, almost all statistical units in clusters 1 and 4 are breast cancer samples, mostserous ovarian carcinoma (OV) samples are in cluster 3, bladder urothelial adenocarcinoma (BLCA)samples in cluster 8, and endometrial cancer (UCEC) samples in cluster 9. Cluster 5, instead,is formed by the colon and rectal adenocarcinoma samples together, and corresponds exactly tocluster 7 of Hoadley et al. (COAD/READ). Moreover, lung squamous cell carcinoma (LUSC), lungadenocarcinoma (LUAD), and head and neck squamous cell carcinoma (HNSC) are divided intotwo clusters (6 and 7). Cluster 8 contains the remaining samples. The average weights assignedto each data layer are: 6.9% to the copy number data, 7.5% to the methylation data, 7% to themicroRNA expression data, and 78.7% to the protein expression data.17 l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters12 34 56 78 9TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC7 − COAD/READ 8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Clusters and weighted kernel.

BRCAKIRCUCECOVLUADHNSCLUSCBLCACOADREAD C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Coincidence matrix.

Figure 6.

Unsupervised multiplatform analysis of ten cancer types. (a)

Left: weightedkernel. The rows and columns correspond to cancer samples. Higher values of similaritybetween samples are indicated in blue. Right: ﬁnal clusters, tissues of origin, and COCAclusters. (b)

Coincidence matrix comparing the tissue of origin of the tumour samples(rows) with the clusters (columns). 18 .1.2 Outcome-guided integration

We obtain the weights for the outcome-guided integration via the SimpleMKL algorithm, which areas follows: DNA copy number 35.9%, methylation 13.5%, microRNA expression 33.8%, and proteinexpression 16.8%. We then cluster the data using kernel k -means with number of clusters going from2 to 50. The silhouette is maximised at K = 27 (Supplementary Material). The clusters obtained inthis way are shown in Figure 7. It is interesting to note that, in this case, each cluster containsalmost exclusively tumour samples from the same tissue. The only exceptions are clusters 4 and 22,which contain both lung and head/neck squamous cell carcinoma samples, and clusters 14 and 25 inwhich colon and rectal adenocarcinomas are clustered together, like in the unsupervised case. Eachtumour type, except for ovarian and bladder cancers, is divided into multiple subclusters. Furtheranalysis would be required to assess whether these clusters are clinically relevant. Interestingly, weobserve a distinction between luminal (i.e. estrogen receptor-positive and HER2-positive) and basalbreast cancer samples (the former are in clusters 8, 9, 17, 19, 23, 24, 27, while the latter are incluster 13). This was also observed by Hoadley et al. (2014). For this example we consider transcriptional module discovery for yeast (

Saccharomyces cerevisiae ).The goal is to ﬁnd clusters of genes that share a common biological function and are co-regulatedby a common set of transcription factors. Previous studies have demonstrated that combining geneexpression datasets with information about transcription factor binding can improve detection ofmeaningful transcriptional modules (Ihmels et al., 2002, Savage et al., 2010).We combine the ChIP-chip dataset of Harbison et al. (2004), which provides binding informationfor 117 transcriptional regulators, with the expression dataset of Granovskaia et al. (2010). TheChIP-chip data are discretised as in Savage et al. (2010) and Kirk et al. (2012). The measurementsin the dataset of Granovskaia et al. (2010) represent the expression proﬁles of 551 genes at 41diﬀerent time points of the cell cycle.Since the goal is to ﬁnd clusters of genes, here the statistical units correspond to the genes andthe covariates to 41 experiments of gene expression measurement for the expression dataset and tothe 117 considered transcriptional regulators in the ChIP-chip dataset. To produce the posteriorsimilarity matrices for the two datasets, we use the

DPMSysBio

Matlab package of Žurauskien˙e et al.(2016). For each dataset, we run 10.000 iterations of the MCMC algorithm and summarise theoutput into a PSM. The PSMs obtained in this way are reported in the Supplementary Material.We combine the PSMs using our unsupervised MKL approach. The average weights assignedby the localised multiple kernel k -means to each matrix and the values of the average silhouettefor diﬀerent numbers of clusters are reported in the Supplementary Material. We set the numberof clusters to 25, which is the value that maximises the silhouette. The ﬁnal clusters are shownin Figure 8 next to the two datasets and the combined PSM. In order to determine whether ourclustering is biologically meaningful, we use the Gene Ontology Term Overlap (GOTO) scoresdeﬁned by Mistry and Pavlidis (2008). Denoting by annot g i the set of all direct annotations foreach gene and all of their associated parent terms, the GOTO similarity between two genes g i , g j isthe number of annotations that the two genes share:sim GOTO ( g i , g j ) = | annot g i ∩ annot g j | . (35)Details on how to compute the average overall GOTO score of a clustering are given in theSupplementary Material of Kirk et al. (2012). The GOTO scores are reported in Table 1. Thetwo datasets combined achieve higher GOTO scores than those of the clusters obtained using each19 l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters1234 5678 9101112 13141516 17181920 21222324 252627TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC 7 − COAD/READ8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Clusters and weighted kernel.

BRCAKIRCUCECLUADHNSCLUSCCOADREADBLCAOV C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Coincidence matrix.

Figure 7.

Outcome-guided multiplatform analysis of ten cancer types. (a)

Coincidence matrix comparing the tissue of origin of the tumour samples(rows) with the clusters (columns). 20ataset separately. We also compare these GOTO scores to those obtained with the two methodsused in Cabassi and Kirk (2020). The ﬁrst one is COCA, a simple, unweighted algorithm forintegrative clustering that is widely used in practice (Hoadley et al., 2014, The Cancer GenomeAtlas Research Network, 2012). The other integrative method considered here is KLIC (

KernelLearning Integrative Clustering ), by which we mean the integration of multiple kernels via multiplekernel k -means, where the kernels are generated via consensus clustering (Figure 1). The clustersobtained with these two alternative methods have lower GOTO scores than the integration of PSMs.For COCA, this is result not unexpected, since the method is unweighted and has previously beenshown to perform less well than KLIC. The diﬀerence between what is referred to as KLIC hereand the unsupervised integration of PSMs, however, only lies in how the kernels are constructed.These scores therefore suggest that kernels generated from probabilistic models can lead to moreaccurate results than those built using consensus clustering.Dataset(s) GOTO BP GOTO MF GOTO CCChIP data (Harbison et al. ) 6.18 0.97 8.54Expression data (Granovskaia et al. ) 7.07 1.04 8.90ChIP+Expression data: COCA 5.74 0.90 8.19ChIP+Expression data: KLIC 6.60 0.96 8.66ChIP+Expression data: integration of PSMs Gene Ontology Term Overlap scores. “BP” stands for Biological Process ontology,“MF” for Molecular Function, and “CC” for Cellular Component. The number of clustersused for every method is 25.

We have presented a novel method for summarising a sample of clusterings from the posteriordistribution of an MCMC algorithm for Bayesian clustering, based on kernel methods. We havealso extended this method to allow us to integrate multiple PSMs. This can be done either in anunsupervised or in an outcome-guided way. The former weights each PSM according to how welldeﬁned is the clustering structure that it shows, the latter gives more importance to the PSMs thatbetter reﬂect the structure encountered in the response variable of choice.We have used simulation examples to show that our method gives comparable performances interms of proportion of correct co-clustering as the existing techniques. We have also demonstratedthat the integration of multiple datasets gives better results than using one dataset at a time.Additionally, we proved that, if a variable related to the output of interest is available, our methodcan assign higher weights to the PSMs that are more closely related to that. The simulationexamples prove that this feature can be extremely useful when not all the PSMs have the sameclustering structure.Finally, we have applied the novel methods to two real data applications. The pancancer dataanalysis shows that the outcome-guided integration of multiple PSMs can potentially be used in thecontext of tumour subtype discovery. The yeast example demonstrates that the proposed method isable to identify groups of genes that are co-expressed and co-regulated that are more biologicallymeaningful than those determined via state-of-the-art integrative algorithms.21 a) Expression data. (b)

ChIP-chip data. (c)

Weighted kernel.

Figure 8.

Transcriptional module discovery, integration of the Harbison et al. (2004) andGranovskaia et al. (2010) datasets. (a)

Expression data. Each row corresponds to a geneand each column to a diﬀerent time point. (b)

ChIP-chip data. Each row corresponds to agene and each column to a transcriptional regulator. (c)

Weighted kernel. The rows andcolumns correspond to the genes. Higher values of similarity between genes are indicatedin blue. To the left of each plot is shown the ﬁnal clustering, obtained by integrating thePSMs of the expression and ChIP-chip data via multiple kernel k -means.22 unding A. Cabassi and P.D.W. Kirk are supported by the MRC [MC_UU_00002/13]. This work wassupported by the National Institute for Health Research [Cambridge Biomedical Research Centre atthe Cambridge University Hospitals NHS Foundation Trust]. The views expressed are those of theauthors and not necessarily those of the NHS, the NIHR or the Department of Health and SocialCare. Partly funded by the RESCUER project. RESCUER has received funding from the EuropeanUnion’s Horizon 2020 research and innovation programme under grant agreement No. 847912.

References

Antoniak, C. E. (1974). Mixtures of dirichlet processes with applications to bayesian nonparametricproblems.

The annals of statistics , pages 1152–1174.

Referred to on page 3.

Baudat, G. and Anouar, F. (2000). Generalized discriminant analysis using a kernel approach.

Neural Computation , 12(10):2385–2404.

Referred to on page 5.

Bennett, K. and Mangasarian, O. L. (1992). Robust linear programming discrimination of twolinearly inseparable sets.

Optimization Methods and Software , 1(1):23–34.

Referred to on page 9.

Binder, D. A. (1978). Bayesian cluster analysis.

Biometrika , 1:31–38.

Referred to on page 3.

Bishop, C. M. (2006).

Pattern Recognition and Machine Learning . Springer, 1st edition edition.

Referred to on pages 2, 8, 9, and 10.

Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for dirichlet process mixtures.

Bayesiananalysis , 1(1):121–143.

Referred to on page 2.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal marginclassiﬁers.

Proceedings of the 5th annual workshop on Computational Learning Theory . Referredto on page 8.

Cabassi, A. and Kirk, P. D. W. (2020). Multiple kernel learning for integrative consensus clusteringof ’omic datasets.

Bioinformatics . btaa593.

Referred to on pages 2, 5, 9, 15, 17, and 21.

Cabassi, A., Seyres, D., Frontini, M., and Kirk, P. D. W. (2020). Penalised logistic regression formulti-omic data with an application to cardiometabolic syndrome. arXiv . 2008.00235.

Referredto on page 17.

Celeux, G., Hurn, M., and Robert, C. P. (2000). Computational and inferential diﬃculties withmixture posterior distributions.

Journal of the American Statistical Association , 95(451):957–970.

Referred to on page 2.

Chung, Y. and Dunson, D. B. (2009). Nonparametric bayes conditional distribution modeling withvariable selection.

Journal of the American Statistical Association , 104(488):1646–1660.

Referredto on page 15.

Dahl, D. B. (2006). Model-based clustering for expression data via a dirichlet process mixture model.

Bayesian inference for gene expression and proteomics , 4:201–218.

Referred to on pages 3 and 4.

Dudoit, S. and Fridlyand, J. (2002). A prediction-based resampling method for estimating thenumber of clusters in a dataset.

Genome Biology , 3(7).

Referred to on page 11.

Cluster analysis . Hodder Education.

Referred to on page 4.

Ferguson, T. S. (1973). A bayesian analysis of some nonparametric problems.

The Annals ofStatistics , pages 209–230.

Referred to on page 3.

Friedman, J., Hastie, T., and Tibshirani, R. (2001).

The elements of statistical learning , volume 1.Springer Series in Statistics.

Referred to on page 5.

Fritsch, A. and Ickstadt, K. (2009). Improved Criteria for Clustering Based on the PosteriorSimilarity Matrix.

Bayesian Analysis , 4(2):367–392.

Referred to on pages 2, 3, 4, 13, and 14.

Gelfand, A. E. and Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities.

Journal of the American statistical association , 85(410):398–409.

Referred to on page 2.

Girolami, M. (2002). Mercer kernel-based clustering in feature space.

IEEE Transactions on NeuralNetworks , 13(3):780–784.

Referred to on page 7.

Gönen, M. and Margolin, A. A. (2014). Localized data fusion for kernel k-means clustering withapplication to cancer biology. In

Advances in Neural Information Processing Systems , pages1305–1313.

Referred to on pages 2 and 8.

Granovskaia, M. V., Jensen, L. J., Ritchie, M. E., Toedling, J., Ning, Y., Bork, P., Huber, W., andSteinmetz, L. M. (2010). High-resolution transcription atlas of the mitotic cell cycle in buddingyeast.

Genome Biology , 11(3):R24.

Referred to on pages 19 and 22.

Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., Hannett,N. M., Tagne, J.-B., Reynolds, D. B., Yoo, J., et al. (2004). Transcriptional regulatory code of aeucaryotic genome.

Nature , 431(7004):99–104.

Referred to on pages 19 and 22.

Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm.

Journalof the Royal Statistical Society. Series c (Applied Statistics) , 28(1):100–108.

Referred to on page1.

Hoadley, K. A., Yau, C., Wolf, D. M., Cherniack, A. D., Tamborero, D., Ng, S., Leiserson, M. D.,Niu, B., McLellan, M. D., Uzunangelov, V., et al. (2014). Multiplatform analysis of 12 cancertypes reveals molecular classiﬁcation within and across tissues of origin.

Cell , 158(4):929–944.

Referred to on pages 17, 19, and 21.

Hubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of Classiﬁcation , 2(1):193–218.

Referred to on pages 4 and 13.

Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y., and Barkai, N. (2002). Revealingmodular organization in the yeast transcriptional network.

Nature Genetics , 31(4):370.

Referredto on page 19.

Kaufman, L. and Rousseeuw, P. J. (1990).

Finding groups in data: an introduction to clusteranalysis , volume 344. John Wiley & Sons.

Referred to on pages 1 and 4.

Kirk, P. D., Griﬃn, J. E., Savage, R. S., Ghahramani, Z., and Wild, D. L. (2012). Bayesiancorrelated clustering to integrate multiple datasets.

Bioinformatics , 28(24):3290–3297.

Referredto on pages 17 and 19.

Neurocomputing: Algorithms, Architecturesand Applications , 68(41-50):71.

Referred to on page 10.

Lanckriet, G. R., De Bie, T., Cristianini, N., Jordan, M. I., and Noble, W. S. (2004). A statisticalframework for genomic data fusion.

Bioinformatics , 20(16):2626–2635.

Referred to on pages 2and 10.

Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M., and Richardson, S. (2015). PReMiuM: An Rpackage for Proﬁle Regression Mixture Models Using Dirichlet Processes.

Journal of StatisticalSoftware , 31(2).

Referred to on page 12.

Luenberger, D. G. and Ye, Y. (1984).

Linear and nonlinear programming , volume 2. Springer.

Referred to on page 10.

Mason, S. A., Sayyid, F., Kirk, P. D., Starr, C., and Wild, D. L. (2016). Mdi-gpu: acceleratingintegrative modelling for genomic-scale data using gp-gpu computing.

Statistical Applications inGenetics and Molecular Biology , 15(1):83–86.

Referred to on page 17.

McLachlan, G. J. and Peel, D. (2004).

Finite mixture models . John Wiley & Sons.

Referred to onpage 1.

Medvedovic, M. and Sivaganesan, S. (2002). Bayesian inﬁnite mixture model based clustering ofgene expression proﬁles.

Bioinformatics , 18(9):1194–1206.

Referred to on pages 3 and 4.

Meilă, M. (2007). Comparing clusterings—an information based distance.

Journal of multivariateanalysis , 98(5):873–895.

Referred to on page 4.

Mika, S., Rätsch, G., Weston, J., Schölkopf, B., and Müllers, K.-R. (1999). Fisher discriminantanalysis with kernels. In

Proceedings of the 1999 IEEE Signal Processing Society Workshop. ,pages 41–48. IEEE.

Referred to on page 5.

Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for determining thenumber of clusters in a data set.

Psychometrika , 50(2):159–179.

Referred to on page 11.

Mistry, M. and Pavlidis, P. (2008). Gene ontology term overlap as a measure of gene functionalsimilarity.

BMC bioinformatics , 9:327.

Referred to on page 19.

Molitor, J., Papathomas, M., Jerrett, M., and Richardson, S. (2010). Bayesian proﬁle regression withan application to the national survey of children’s health.

Biostatistics , 11(3):484–498.

Referredto on pages 4 and 12.

Monti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003). Consensus clustering: a resampling-basedmethod for class discovery and visualization of gene expression microarray data.

Machine Learning ,52(1-2):91–118.

Referred to on page 2.

Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models.

Journalof Computational and Graphical Statistics , 9(2):249–265.

Referred to on page 3.

Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization.In B. Schölkopf, C. J. C. Burges, A. J. S., editor,

Advances in Kernel Methods – Support VectorLearning , chapter 12, pages 185–208. MIT Press.

Referred to on page 10.

Proceedingsof the 24th International Conference on Machine Learning , pages 775–782.

Referred to on page10.

Rakotomamonjy, A., Bach, F. R., Canu, S., and Grandvalet, Y. (2008). SimpleMKL.

Journal ofMachine Learning Research , 9:2491–2521.

Referred to on pages 2, 9, and 11.

Rand, W. M. (1971). Objective Criteria for the Evaluation of Clustering Methods.

Journal of theAmerican Statistical Association , 66(336):846–850.

Referred to on page 13.

Rasmussen, C. E. (2000). The inﬁnite gaussian mixture model. In

Advances in neural informationprocessing systems , pages 554–560.

Referred to on pages 2 and 3.

Richardson, S. and Green, P. (1997). On Bayesian Analysis of Mixtures with an Unknown Numberof Components.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,(59.4):731–792.

Referred to on page 3.

Roth, V. and Steinhage, V. (2000). Nonlinear discriminant analysis using kernel functions. In

Advances in neural information processing systems , pages 568–574.

Referred to on page 5.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of clusteranalysis.

Journal of computational and applied mathematics , 20:53–65.

Referred to on page 11.

Savage, R. S., Ghahramani, Z., Griﬃn, J. E., De La Cruz, B. J., and Wild, D. L. (2010). Discoveringtranscriptional modules by bayesian data integration.

Bioinformatics , 26(12):i158–i167.

Referredto on pages 17 and 19.

Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kerneleigenvalue problem.

Neural Computation , 10(5):1299–1319.

Referred to on page 5.

Schölkopf, B. and Smola, A. J. (2001).

Learning with kernels: support vector machines, regularization,optimization, and beyond . MIT press.

Referred to on page 8.

Shawe-Taylor, J. and Cristianini, N. (2004).

Kernel methods for pattern analysis . CambridgeUniversity Press.

Referred to on pages 2 and 5.

Steinhaus, H. (1956). Sur la division des corps matériels en parties.

Bulletin de l’Académie Polonaisedes Sciences , IV(12):801–804.

Referred to on page 6.

The Cancer Genome Atlas Research Network (2012). Comprehensive molecular portraits of humanbreast tumours.

Nature , 487(7407):61–70.

Referred to on page 21.

Tibshirani, R. et al. (2001). Estimating the number of clusters in a data set via the gap statistic.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63(2):411–423.

Referredto on page 11.

Vapnik, V. N. (1998).

Statistical learning theory . Wiley New York.

Referred to on pages 5 and 10.

Wade, S. and Ghahramani, Z. (2018). Bayesian cluster analysis: Point estimation and credible balls(with discussion).

Bayesian Analysis , 13(2):559–626.

Referred to on pages 2, 3, 4, and 14.

Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001). Validating clustering for gene expressiondata.

Bioinformatics , 17(4):309–318.

Referred to on page 11.

Statistical Applications in Genetics and Molecular Biology , 15(2):107–122.

Referred to on page19. ernel learning approaches for summarising andcombining posterior similarity matrices Alessandra Cabassi , Sylvia Richardson , and Paul D. W. Kirk , MRC Biostatistics Unit Cambridge Institute of Therapeutic Immunology & Infectious DiseaseUniversity of Cambridge, U.K.

Contents

S1 Simulation study S2

S1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S2S1.2 Integrative clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S3S1.3 Additional simulation settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S5S1.3.1 Diﬀerent number of covariates . . . . . . . . . . . . . . . . . . . . . . . . . . S5S1.3.2 Using the true cluster labels as response variable . . . . . . . . . . . . . . . . S7

S2 Integrative clustering: biological applications S10

S2.1 Multiplatform analysis of ten cancer types . . . . . . . . . . . . . . . . . . . . . . . . S10S2.1.1 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S10S2.1.2 MCMC convergence assessment . . . . . . . . . . . . . . . . . . . . . . . . . . S10S2.1.3 Unsupervised integration: additional ﬁgures . . . . . . . . . . . . . . . . . . . S33S2.1.4 Unsupervised integration after variable selection, α = 0 . α = 0 . α = 1 . . . . . . . . . . . . S44S2.1.7 Outcome-guided integration: additional ﬁgures . . . . . . . . . . . . . . . . . S46S2.1.8 Outcome-guided integration after variable selection, α = 0 . α = 0 . α = 1 . . . . . . . . . . . S55S2.2 Transcriptional module discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S57S2.2.1 Additional ﬁgures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S57S2.2.2 A diﬀerent set of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S59 Bibliography S62 S1 a r X i v : . [ s t a t . M E ] S e p S1.1 Data

In Figure S1 is shown one of the datasets used for the simulation studies, with value of w set to 0.8and number of covariates equal to 20. The ﬁrst ten covariates determine six diﬀerent clusters, theremaining covariates have no clustering structure. The response variable is generated as describedin the main paper. Figure S1.

One of the datasets used for the simulation studies. The data are categorical,taking values 0, 1 or 2, the response is binary. The rows are separated by cluster, the clusterlabels are indicated on the right of the data matrix.S2

Figures S2 and S3 show the weights assigned to each dataset by the unsupervised and outcome-guidedmethods for the integration of multiple PSMs presented in the main paper. lllllll lllllllllllllllllllll lllllllllllllllllll llllllllllll llllllll lllllll lllllllllllllllllllllllll lllll llllllllllllllll llllll lllllllllllllllllllllllllll lllllllllllllll lllllllllllllll lllllll llll llllllllll llllllllll lllllllllllll lllllllllllllll S e tt i ng AS e tt i ng BS e tt i ng C S e tt i ng D W e i gh t Figure S2.

Weights assigned to each PSM for each subset of datasets by the unsupervisedintegration method. S3 lllllllllllllllllllllllllllllllllllll lllllllllllllll l llllllllllllllllll llllllllllllllllll llllllllllllllllllllllllllllllllllllllllll llllllllllllllllll llllllllllllll lllll lllll llllll llll llllllll lllll llllllllll lllll lllll S e tt i ng AS e tt i ng BS e tt i ng C S e tt i ng D W e i gh t Figure S3.

Weights assigned to each PSM for each subset of datasets by the outcome-guided integration methods. S4

S1.3.1 Diﬀerent number of covariates

We present here the results obtained for simulation setting B, with diﬀerent numbers of irrelevantcovariates. Figure S4 shows the adjusted Rand index and Figure S5 the weights assigned to eachkernel. lllllll llllll lllllll lllllll l lllllllllll llllllll lllllll

One PSM Unsupervised integration Outcome−guided integration + + + + + + + + + + + + + + + + A R I (a) Setting B, 2 covariates without clustering structure. l lll l l l l l llllllllll lllllll lllllllll

One PSM Unsupervised integration Outcome−guided integration + + + + + + + + + + + + + + + + A R I (b) Setting B, 5 covariates without clustering structure.

Figure S4.

Adjusted Rand index obtained by summarising the PSMs one at a time usingkernel k -means (left), combining diﬀerent subsets of three PSMs in an unsupervised fashionusing localised multiple kernel k -means (centre), and combining the same subsets makinguse of a response variable and multi-class SVMs to determine each PSM’s weight and usingkernel k -means for the ﬁnal clustering (right).S5 ll llllll lllllllll lllll llllllll lllllllllllll lllll lllll lllllllll ll ll llllllllll llllll llllllllllllll llll ll U n s upe r v i s ed i n t eg r a t i on O u t c o m e − gu i ded i n t eg r a t i on0 1 2 0 1 3 0 2 3 1 2 3 W e i gh t (a) Setting B, 2 covariates without clustering structure. l llll l lllll llllll l ll lll l ll llll l lllllllllll lllll lll U n s upe r v i s ed i n t eg r a t i on O u t c o m e − gu i ded i n t eg r a t i on0 1 2 0 1 3 0 2 3 1 2 3 W e i gh t (b) Setting B, 5 covariates without clustering structure.

Figure S5.

Weights assigned to each PSM for each subset of datasets by the unsupervised(above) and outcome-guided (below) integration methods for PSMs.S6

We repeat the simulation study presented in the main paper, using the true cluster labels asthe response variable, both for the outcome-guided integration (in all simulation settings) and togenerate the PSMs with proﬁle regression (in Setting D). Although the true cluster labels are notavailable in practice, this simulation study is used here to determine a putative upper bound on theperformances of outcome-guided integration.The ARI is reported in Figure S6. As expected, the outcome-guided integration has highervalues of the ARI in all settings, compared to the case where the outcome is a binary variable.Moreover, in Setting D the ARI of each PSM taken individually is also higher here than in the othersimulation study.The weights assigned to each kernel matrix are shown in Figure S7. These are of easierinterpretation compared to those presented above (Figure S3): on average, kernels originated fromdatasets with higher cluster separability have higher weights.S7 l lllll lllll llllllll l lll lll ll lllll llllllllll lllll lllllll lllllll llllllll llllllllllll llllllllllllllllll llllllll l llllllllll ll lll llll lllll ll llllllllllll lllll lllllll lllllllllllllllll llllllllllllll llllllllllllllllll

One PSM Unsupervised integration Outcome−guided integration S e tt i ng AS e tt i ng BS e tt i ng C S e tt i ng D + + + + + + + + + + + + + + + + A R I Figure S6.

Simulation study where the response for each observation is given by its truecluster label. Adjusted Rand index obtained by summarising the PSMs one at a time usingkernel k -means (left), combining diﬀerent subsets of three PSMs in an unsupervised fashionusing localised multiple kernel k -means (centre), and combining the same subsets makinguse of a response variable and multi-class SVMs to determine each PSM’s weight and usingkernel k -means for the ﬁnal clustering (right).S8 llllllllllllllllllll l ll ll lllllll lllllllllllllllllllllllllllllllllllllll lllllllllllllll ll lll lllllll ll ll l S e tt i ng AS e tt i ng BS e tt i ng C S e tt i ng D W e i gh t Figure S7.

Weights assigned to each PSM for each subset of datasets by the unsupervised(above) and outcome-guided (below) integration methods for PSMs for the simulation studywhere the response for each observation is given by its true cluster label.S9

S2.1 Multiplatform analysis of ten cancer types

S2.1.1 Variable selection

In Table S1 are reported the number of variables measured in each layer and the number of selectedvariables using separate elastic-net (EN) on each layer as in Cabassi et al. (2020) and Seyres et al.(2020). Dataset Full dataset α = 0 . α = 0 . α = 1Protein expression 131 131 131 124mRNA expression 6000 1893 568 258Methylation 2043 1439 623 322DNA copy number 84 84 84 84microRNA expression 51 51 51 50 Table S1.

Number of selected variables in each dataset for diﬀerent values of the ENparameter α .The full mRNA dataset is too large to be used as input to MDI and for this reason it is onlyused for the integration of the reduced datasets obtained via variable selection with values of α of0.5 and 1. S2.1.2 MCMC convergence assessment

We run ﬁve MCMC chains for 50.000 iterations, with a burn-in period of 25.000 iterations andthinning of 5. For each set of ﬁve chains, we check the Vats-Knudson ˆ R (Vats and Knudson, 2018)with parameters (cid:15) = 0 . α = 0 . imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S8.

PSMs of the protein expression data. λ = 0, α = 0 . , . hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S2.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette. . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S9.

MCMC convergence assessment, protein expression data. λ = 0, α = 0 . , . imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S10.

PSMs of the protein expression data. α = 1.S13 hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S3.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette. . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S11.

MCMC convergence assessment, protein expression data. α = 1.S14 imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S12.

PSMs of the mRNA expression data. α = 0 . hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S4.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S13.

MCMC convergence assessment, mRNA expression data. α = 0 . imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S14.

PSMs of the mRNA expression data. α = 1.S17 hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S5.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S15.

MCMC convergence assessment, mRNA expression data. α = 1.S18 imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S16.

PSMs of the methylation data. λ = 0.S19 hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S6.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S17.

MCMC convergence assessment, methylation data. λ = 0.S20 imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S18.

PSMs of the methylation data. α = 0 . hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S7.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette. . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S19.

MCMC convergence assessment, methylation data. α = 0 . imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S20.

PSMs of the methylation data. α = 0 . hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S8.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S21.

MCMC convergence assessment, methylation data.S24 imilarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S22.

PSMs of the methylation data. α = 1.S25 hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S9.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r . . Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S23.

MCMC convergence assessment, methylation data.S26 imilarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S24.

PSMs of the DNA copy number data.S27 hain 2 Chain 3 Chain 4 Chain 5Chain 1

Chain 2

Chain 3

Chain 4

Table S10.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Iteration N u m be r o f c l u s t e r s Iteration M a ss pa r a m e t e r Figure S25.

MCMC convergence assessment, DNA copy number data.S28 imilarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S26.

PSMs of the microRNA data. λ = 0, α = 0 . , . hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S11.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

MCMC convergence assessment, microRNA data. λ = 0, α = 0 . , . imilarity Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Similarity

Tissue

BLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S28.

PSMs of the microRNA data. α = 1.S31 hain 2 Chain 3 Chain 4 Chain 5Chain 1 Chain 2

Chain 3

Chain 4

Table S12.

ARI between the clusterings found on the PSMs of diﬀerent chains with thenumber of clusters that maximises the silhouette.

MCMC convergence assessment, microRNA data. α = 1.S32 In Figure S30 are reported the average values of the silhouettewhen the number of clusters goes from 2 to 50. The maximum is at K = 15. lllllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . . . Number of clusters S il houe tt e Figure S30.

Average silhouette.

Kernel weights

Figure S31 shows the weights assigned to each PSM by the multiple kernel k -means algorithm. CN M e t h y l a t i on m i RN A R PPA C l u s t e r T i ss ue C O C A Weight0 0.2 0.4 0.6 0.8 1COCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC7 − COAD/READ 8 − BLCA9 − OV10 − GBM11 − small−various12 − small−variousClusters12 34 56 78 9TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S31.

Weights assigned by the multiple kernel k -means algorithm to each obser-vations in each layer, where “CN” stands for copy number and “RPPA” for reverse phaseprotein array. S33 omparison with the clusters identiﬁed by Hoadley et al. (2014) In Figure S32 are shownthe correspondences between the clusters found in the main paper and the clusters identiﬁed byHoadley et al. (2014) using Cluster-Of-Clusters Analysis (COCA).

371 0 2 19 3 19 6 0 00 330 0 2 1 78 138 0 00 0 0 0 257 24 0 0 10 0 192141 2 28 1 1 00 0 188 47 3 19 0 0 00 0 0 3 0 9 2 0 1840 0 0 5 0 4 1 157 00 0 3 4 0 78 16 0 30 0 0 1 0 65 1 1 00 0 1 2 0 2 0 0 01 0 0 1 0 2 0 0 00 0 0 0 0 2 0 0 0

COCA cluster 5COCA cluster 3COCA cluster 6COCA cluster 2COCA cluster 1COCA cluster 9COCA cluster 7COCA cluster 4COCA cluster 8COCA cluster 12COCA cluster 10COCA cluster 11 C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s Figure S32.

Comparison between the clusters found combining the PSMs of each layerusing multiple kernel learning and those identiﬁed by Hoadley et al. (2014) using COCA.S34 lustering structure in the data

Figures S33, S34, S34, and S36 show the four data layerswhere the rows have been sorted by ﬁnal cluster. C l u s t e r T i ss ue C O C A Figure S33.

Copy number data and ﬁnal clusters.S35 l u s t e r T i ss ue C O C A Figure S34. microRNA expression data and ﬁnal clusters.S36 l u s t e r T i ss ue C O C A Figure S35.

Methylation data and ﬁnal clusters.S37 l u s t e r T i ss ue C O C A Figure S36.

Protein expression data and ﬁnal clusters.S38 α = 0 . C l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters12 34 56 78 910 1112 1314 1516 1718TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC7 − COAD/READ 8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Weighted kernel and clusters.

Figure S37.

Unsupervised multiplatform analysis of ten cancer types. Weighted kernel,ﬁnal clusters, tissues of origin, and COCA clusters. lllllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . Number of clusters S il houe tt e Figure S39.

Average silhouette.S39

BRCAKIRCUCECOVLUADHNSCLUSCBLCACOADREAD C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (a) Comparison to tissue of origin.

346 0 0 1 0 27 0 3 3 1 0 33 0 0 1 0 5 00 315 0 0 0 5 0 128 1 0 0 52 9 11 3 1 7 170 0 0 0 155101 0 0 0 0 0 1 5 1 17 0 2 00 0 0 34 1 10 1 1 86 65 99 14 6 0 20 20 8 00 0 0 152 2 0 0 0 27 24 7 8 3 0 11 18 5 00 0 0 0 0 8 139 0 3 11 0 3 0 0 3 0 0 00 0 1 2 0 1 0 13 2 0 0 0 22 45 9 6 3 00 0 0 0 0 1 0 1 0 1 0 0 40 1 22 2 0 00 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 1 01 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 00 0 183 0 0 0 0 2 2 0 0 4 0 3 1 2 1 0

COCA cluster 5COCA cluster 3COCA cluster 6COCA cluster 2COCA cluster 1COCA cluster 7COCA cluster 4COCA cluster 8COCA cluster 12COCA cluster 10COCA cluster 11COCA cluster 9 C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Comparison to COCA clusters.

Figure S38.

Unsupervised multiplatform analysis of ten cancer types. (a)

Coincidencematrix comparing the tissue of origin of the tumour samples to the new clusters. (b)

Coincidence matrix comparing the COCA clusters of Hoadley et al. to the new clusters.S40 N M e t h y l a t i on m i RN A R PPA C l u s t e r T i ss ue C O C A Weight0 0.2 0.4 0.6 0.8 1COCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC7 − COAD/READ 8 − BLCA9 − OV10 − GBM11 − small−various12 − small−variousClusters12 34 56 78 910 1112 1314 1516 1718TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S40.

Weights assigned by the multiple kernel k -means algorithm to each obser-vations in each layer, where “CN” stands for copy number and “RPPA” for reverse phaseprotein array. The weights assigned on average to the tumour samples in each layer are:copy number 7.9%, methylation 4.6%, miRNA 3.2%, protein 84.3%.S41 α = 0 . C l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters12 34 56 78 9TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC7 − COAD/READ 8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Weighted kernel and clusters.

BRCAKIRCUCECOVLUADHNSCLUSCBLCACOADREAD C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Comparison to tissue of origin.

371 0 2 20 3 18 6 0 00 330 0 2 0 79 138 0 00 0 0 0 257 24 0 0 10 0 192137 2 32 1 1 00 0 188 47 3 19 0 0 00 0 0 3 0 9 2 0 1840 0 0 4 0 4 1 158 00 0 3 4 0 78 16 0 30 0 0 1 0 65 1 1 00 0 1 2 0 2 0 0 01 0 0 1 0 2 0 0 00 0 0 0 0 2 0 0 0

Figure S41.

Unsupervised multiplatform analysis of ten cancer types. (a)

Weighted kernel,ﬁnal clusters, tissues of origin, and COCA clusters. (b)

Coincidence matrix comparingthe tissue of origin of the tumour samples to the new clusters. (c)

Coincidence matrixcomparing the COCA clusters of Hoadley et al. to the new clusters.S42 llllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . . . Number of clusters S il houe tt e Figure S42.

Average silhouette. CN M e t h y l a t i on m i RN A m RN A R PPA C l u s t e r T i ss ue C O C A Weight0 0.2 0.4 0.6 0.8 1COCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC7 − COAD/READ 8 − BLCA9 − OV10 − GBM11 − small−various12 − small−variousClusters12 34 56 78 910 1112 1314 15TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCEC

Figure S43.

Weights assigned by the multiple kernel k -means algorithm to each obser-vations in each layer, where “CN” stands for copy number and “RPPA” for reverse phaseprotein array. The weights assigned on average to the tumour samples in each layer are:copy number 5.8%, methylation 6%, microRNA 5.9%, mRNA 5.9%, protein 76.4%.S43 α = 1 C l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters12 34 56 78 910 1112 1314 15TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC7 − COAD/READ 8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Weighted kernel and clusters.

BRCAKIRCLUADUCECHNSCLUSCBLCACOADREADOV C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Comparison to tissue of origin.

COCA cluster 3COCA cluster 5COCA cluster 2COCA cluster 1COCA cluster 6COCA cluster 7COCA cluster 4COCA cluster 8COCA cluster 12COCA cluster 10COCA cluster 11COCA cluster 9 C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (c) Comparison to COCA clusters.

Figure S44.

Unsupervised multiplatform analysis of ten cancer types. (a)

Weighted kernel,ﬁnal clusters, tissues of origin, and COCA clusters. (b)

Coincidence matrix comparingthe tissue of origin of the tumour samples to the new clusters. (c)

Coincidence matrixcomparing the COCA clusters of Hoadley et al. to the new clusters.S44 llllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . . . Number of clusters S il houe tt e Figure S45.

Figure S46.

Weights assigned by the multiple kernel k -means algorithm to each obser-vations in each layer, where “CN” stands for copy number and “RPPA” for reverse phaseprotein array. The weights assigned on average to the tumour samples in each layer are:copy number 8.1%, methylation 3.8%, microRNA 3.6%, mRNA 3.7%, protein 80.8%.S45 In Figure S47 are reported the average values of the silhouettewhen the number of clusters goes from 2 to 50. The maximum is at K = 27. lllllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . . Number of clusters S il houe tt e Figure S47.

Average silhouette.

Comparison with the clusters identiﬁed by Hoadley et al. (2014)

In Figure S48 are shownthe correspondences between the clusters found in the main paper in the outcome-guided case andthe clusters identiﬁed by Hoadley et al. (2014) using Cluster-Of-Clusters Analysis (COCA).

COCA cluster 5COCA cluster 3COCA cluster 2COCA cluster 1COCA cluster 6COCA cluster 7COCA cluster 4COCA cluster 8COCA cluster 12COCA cluster 10COCA cluster 11COCA cluster 9 C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s Figure S48.

Comparison between the clusters found combining the PSMs of each layerusing the outcome-guided approach and those identiﬁed by Hoadley et al. (2014) usingCOCA. S46 lustering structure in the data

Figures S49, S50, S50, and S52 show the four data layerswhere the rows have been sorted by ﬁnal cluster. C l u s t e r T i ss ue C O C A Figure S49.

Copy number data and ﬁnal clusters.S47 l u s t e r T i ss ue C O C A Figure S50. microRNA expression data and ﬁnal clusters.S48 l u s t e r T i ss ue C O C A Figure S51.

Methylation data and ﬁnal clusters.S49 l u s t e r T i ss ue C O C A Figure S52.

Protein expression data and ﬁnal clusters.S50 α = 0 . C l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters1234 5678 9101112 13141516 17181920 21222324 25262728TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC 7 − COAD/READ8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Clusters and weighted kernel.

BRCAUCECLUADHNSCLUSCBLCACOADREADOVKIRC C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Coincidence matrix.

Figure S53.

Outcome-guided multiplatform analysis of ten cancer types. (a)

Weightedkernel, ﬁnal clusters, tissues of origin, and COCA clusters. (b)

Coincidence matrixcomparing the tissue of origin of the tumour samples with the clusters.S51 llllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . Number of clusters S il houe tt e Figure S54.

Average silhouette.

COCA cluster 5COCA cluster 3COCA cluster 2COCA cluster 6COCA cluster 1COCA cluster 7COCA cluster 4COCA cluster 8COCA cluster 12COCA cluster 10COCA cluster 11COCA cluster 9 C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s Figure S55.

Comparison between the clusters found combining the PSMs of each layerusing the outcome-guided approach and those identiﬁed by Hoadley et al. (2014) usingCOCA. S52 α = 0 . C l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters1234 5678 9101112 1314TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC 7 − COAD/READ8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Clusters and weighted kernel.

BRCAKIRCLUADUCECHNSCLUSCBLCACOADREADOV C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Coincidence matrix.

Figure S56.

Outcome-guided multiplatform analysis of ten cancer types. (a)

Weightedkernel, ﬁnal clusters, tissues of origin, and COCA clusters. (b)

Coincidence matrixcomparing the tissue of origin of the tumour samples with the clusters.S53 llllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . Number of clusters S il houe tt e Figure S57.

Average silhouette.

COCA cluster 3COCA cluster 5COCA cluster 2COCA cluster 1COCA cluster 9COCA cluster 6COCA cluster 7COCA cluster 4COCA cluster 8COCA cluster 12COCA cluster 10COCA cluster 11 C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s Figure S58.

Comparison between the clusters found combining the PSMs of each layerusing the outcome-guided approach and those identiﬁed by Hoadley et al. (2014) usingCOCA. S54 α = 1The weights assigned on average to the tumour samples in each layer are: copy number 0%, mRNA37.6%, methylation 25.2%, miRNA 24.5%, protein 12.7%. C l u s t e r T i ss ue C O C A Weighted similarity0 0.2 0.4 0.6 0.8 1Clusters1234 5678 9101112 13141516 17181920 2122TissueBLCABRCA COADHNSC KIRCLUAD LUSCOV READUCECCOCA (Hoadley et al., 2014)1 − LUAD−enriched2 − Squamous−like3 − BRCA/Luminal4 − BRCA/Basal5 − KIRC6 − UCEC 7 − COAD/READ8 − BLCA9 − OV10 − GBM11 − small−various12 − small−various (a)

Clusters and weighted kernel.

BRCAKIRCUCECHNSCLUSCBLCACOADREADOVLUAD C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s (b) Coincidence matrix.

Figure S59.

Outcome-guided multiplatform analysis of ten cancer types. (a)

Weightedkernel, ﬁnal clusters, tissues of origin, and COCA clusters. (b)

Coincidence matrixcomparing the tissue of origin of the tumour samples with the clusters.S55 llllllllllllllllllllllllllllllllllllllllllllllll

10 20 30 40 50 . . . . Number of clusters S il houe tt e Figure S60.

Average silhouette.

COCA cluster 3COCA cluster 5COCA cluster 2COCA cluster 9COCA cluster 6COCA cluster 4COCA cluster 8COCA cluster 12COCA cluster 10COCA cluster 11COCA cluster 7COCA cluster 1 C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C o i n c i den c e s Figure S61.

Comparison between the clusters found combining the PSMs of each layerusing the outcome-guided approach and those identiﬁed by Hoadley et al. (2014) usingCOCA. S56

S2.2.1 Additional ﬁgures

Figures S62 and S63 show the initial data, the clusterings obtained on each dataset individuallyand the PSMs. The cophenetic correlation coeﬃcients are 0.953685 for the expression data and0.9841434 for the ChIP data. 3.5% of the weight is assigned on average to the ChIP data, abd theremaining 96.5% to the expression data. (a)

Expression data. (b)

ChIP data.

Figure S62.

Clusters obtained on each dataset separately. The ordering of the rows isdiﬀerent in the two ﬁgures.In Figure S64 are reported the values of the average silhouette for diﬀerent values of the numberof clusters K . We choose K = 25, which gives the highest value of the silhouette.S57 a) Expression data. (b)

ChIP data.

Figure S63.

Posterior similarity matrices and clusterings obtained via kernel k -means oneach dataset separately. The ordering of the observations is diﬀerent in the two ﬁgures. Figure S64.

Plot of silhouette for diﬀerent numbers of clusters for the integration of thedatasets of Harbison et al. and Granovskaia et al. . The maximum value is attained at K = 25. S58 Here we combine the expression dataset of Ideker et al. (2001) with the ChIP-chip dataset ofHarbison et al. (2004), which provides binding information for 117 transcriptional regulators. Bothdatasets are discretised as in Savage et al. (2010) and Kirk et al. (2012). The dataset of Ideker et al.(2001) contains measurements related to 205 genes whose expression patterns reﬂect four functionalcategories based on gene ontology annotations.Figure S66 shows the PSMs obtained for each dataset. In Figure S66, the cophenetic correlationcoeﬃcient is 0.9999932 for the ChIP data and 0.9999184 for the PSM of the expression data. (a)

Expression data. (b)

ChIP data.

Figure S65.

Clusters obtained on each dataset separately. The ordering of the observations(i.e. rows) is diﬀerent in the two ﬁgures. (a)

Expression data. (b)

ChIP data.

Figure S66.

Posterior similarity matrices and clusterings obtained via kernel k -means oneach dataset separately. The ordering of the observations (i.e. rows) is diﬀerent in the twoﬁgures. S59n Figure S67 are reported the values of the average silhouette for diﬀerent values of the numberof clusters K . The number of clusters chosen for the data analysis is 5, since it has a similar valueof the average silhouette to 2.The ﬁnal clusters are shown in Figure S68 next to the initial datasets and combined PSM. 28.7%of the weight is assigned on average to the ChIP data, the remaining 71.3% to the expression data. Figure S67.

Plot of silhouette for diﬀerent numbers of clusters for the integration of thedatasets of Harbison et al. and Ideker et al. . The maximum value is attained for K = 2.Dataset(s) GOTO BP GOTO MF GOTO CCChIP (Harbison et al. ) 13.36 1.37 12.26Expr. (Ideker et al. ) Both 16.51 2.15 14.84

Table S13.

Gene Ontology Term Overlap scores. “BP” stands for Biological Processontology, “MF” for Molecular Function, and “CC” for Cellular Component. The number ofclusters used to combine the datasets of Harbison et al. and Ideker et al. is 5.S60 a) Expression data. (b)

ChIP data. (c)

Weighted kernel.

Figure S68.

Transcriptional module discovery, integration of the Harbison et al. (2004)and Ideker et al. (2001) datasets. S61 eferences

Cabassi, A., Seyres, D., Frontini, M., and Kirk, P. D. W. (2020). Penalised logistic regression formulti-omic data with an application to cardiometabolic syndrome. arXiv . 2008.00235.

Referredto on page S10.

Nature , 431(7004):99–104.

Referred to on pages S59 and S61.

Cell , 158(4):929–944.

Referred to on pages S34, S46, S52, S54, and S56.

Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R.,Goodlett, D. R., Aebersold, R., and Hood, L. (2001). Integrated genomic and proteomic analysesof a systematically perturbed metabolic network.

Science , 292(5518):929–934.

Referred to onpages S59 and S61.

Kirk, P. D., Griﬃn, J. E., Savage, R. S., Ghahramani, Z., and Wild, D. L. (2012). Bayesiancorrelated clustering to integrate multiple datasets.

Bioinformatics , 28(24):3290–3297.

Referredto on page S59.

Savage, R. S., Ghahramani, Z., Griﬃn, J. E., De La Cruz, B. J., and Wild, D. L. (2010). Discoveringtranscriptional modules by bayesian data integration.

Bioinformatics , 26(12):i158–i167.

Referredto on page S59.

Seyres, D., Cabassi, A., Lambourne, J. J., Burden, F., Farrow, S., McKinney, H., Batista, J.,Kempster, C., Pietzner, M., Slingsby, O., et al. (2020). Transcriptional, epigenetic and metabolicsignatures in cardiometabolic syndrome deﬁned by extreme phenotypes. bioRxiv . 2020.03.06.961805.

Referred to on page S10.

Vats, D. and Knudson, C. (2018). Revisiting the Gelman-Rubin diagnostic. arXiv . 1812.09384.