Multi-Way, Multi-View Learning
Ilkka Huopaniemi, Tommi Suvitaival, Janne Nikkilä, Matej Orešič, Samuel Kaski
MMulti-Way, Multi-View Learning
Ilkka Huopaniemi and Tommi Suvitaival
Department of Information and Computer ScienceHelsinki University of TechnologyFinland [email protected]@tkk.fi
Janne Nikkil¨a
Department of Basic Veterinary SciencesFaculty of Veterinary MedicineUniversity of HelsinkiFinland [email protected]
Matej Oreˇsiˇc
VTT Technical Research Centre of FinlandEspoo, Finland [email protected]
Samuel Kaski
Dept of Information and Computer ScienceHelsinki University of TechnologyFinland [email protected]
Abstract
We extend multi-way, multivariate ANOVA-type analysis to cases where onecovariate is the view, with features of each view coming from different, high-dimensional domains. The different views are assumed to be connected by havingpaired samples; this is a common setup in recent bioinformatics experiments, ofwhich we analyze metabolite profiles in different conditions (disease vs. controland treatment vs. untreated) in different tissues (views). We introduce a multi-way latent variable model for this new task, by extending the generative model ofBayesian canonical correlation analysis (CCA) both to take multi-way covariateinformation into account as population priors, and by reducing the dimensionalityby an integrated factor analysis that assumes the metabolites to come in correlatedgroups.
Finding disease and treatment effects from populations of measurements is a prototypical multi-waymodeling task, traditionally solved with multivariate ANOVA. Here disease state (diseased/healthy)and treatment (treated/placebo) are the two covariates, and the research question is, are there dif-ferences in the population that can be explained by either covariate or, more interestingly, theirinteraction, which would hint at the treatment being effective. It is naturally additionally interestingwhat the differences are.A recurring problem in multi-way analyses, especially with modern high-throughput measurementsin molecular biology, is the ”small n , large p ”-problem. The dimensionality p of the measurements ishigh while the number of samples n is low, and additionally the data may be collinear making estima-tion of the effects impossible with classical methods, univariate or multivariate linear models solvedwith multi-way ANOVA techniques. The most promising modern method, Bayesian sparse factorregression model [1], is useful in finding the variables most strongly related to the external covariateand to infer relationships between those variables via common latent factors. Instead of a regressionmodel we will use a generative latent factor model which incorporates an assumption of clustered-ness of the variables to regularize the model, and makes it possible to extend the model to multi-viewfactor analysis. Such clusteredness is well justified in our application field, metabolomics, wheredue to biochemical reaction pathways the variation in concentrations of metabolite groups is highlycorrelated [2]. 1 a r X i v : . [ s t a t . M L ] D ec ssume that measurements have been made on the same objects but with different methods, resultingin different data sources possibly on different domains. An example we will analyze in this paper ismetabolomic profiles in different tissues, where the domains are partly different since the metabolitescannot be fully matched. The different views form one covariate in the multi-way analysis, with theadditional problem that the samples come from different domains and cannot be directly compared.We introduce a new hierarchy level of latent variables intended to decompose the views into view-specific and shared components, which is needed for the multi-way analysis. Such a decompositionis possible given that the samples in the different views come in pairs, which we need to assume.The resulting decomposition between the views turns out to be implementable with Bayesian canon-ical correlation analysis [3, 4, 5], interpretable as unsupervised multi-view modeling. Hence, in thiswork we re-interpret unsupervised multi-view modeling as one-way modeling of samples from dif-ferent domains, and combine it with multi-way modeling. Given that we additionally can workunder the large p, small n conditions, the model is expected to have widespread applicability incurrent molecular biological measurements. We will generalize ANOVA to multi-view (multi-domain) analysis, restricting to two covariates andtwo views for simplicity. Using ANOVA-style notation and assuming the views to be in the samedomain, the multivariate linear model for samples is v d = α a + β b + ( αβ ) ab + γ d + ( αγ ) ad + ( βγ ) bd + ( αβγ ) abd + noise , (1)where a and b ( a = 0 , . . . A and b = 0 , . . . B ), are the two traditional independent covariates suchas disease and treatment, and d denotes the view.For different values of d the domain of v d may vary, meaning different feature spaces with differentdimensionalities. We assume the samples of the different views to come in pairs, v = [ x , y ] . For therest of the paper we will change the notation for clarity to v = x , v = y , and assume a mapping f x from the effects to the domain of x which is linear for now. Then, x = f x ( α a + β b + ( αβ ) ab ) + f x (( α ) xa + ( β ) xb + ( αβ ) xab ) + noise , (2)assuming γ d = 0 , because it does not make sense to compare means of different domains, and thatthe view-specific effects are in the same domain as the view-independent effects and hence need tobe transformed with the same function. The equation for y is analogous.To our knowledge, there exists no method capable of studying the view-independent, and view-dependent effects. In the next section we will introduce a model which will additionally assume thatthe effects may be uncertain, resulting in a hierarchical Bayesian model. We next formulate a hierarchical latent-variable model for the task of multi-way, multi-view learningunder “large p , small n ” conditions. For this we need three components: (i) regularized dimensionreduction, (ii) combination of different data domains, and (iii) multi-way analysis. We formulateeach of these as part of a big generative model. We will first summarize the main components of themodel shown in Figure 1, and then describe each part in detail.To deal with the small sample size n (cid:28) p problem, we reduce the dimensionality of the data x and y from the two views into their respective latent variables x lat and y lat . This is done with factor an-alyzers which assume that the variables come in groups, which is a strongly regularizing assumptioneffective under the “large p , small n ” conditions. The clustering assumption is particularly sensibleunder the assumption that metabolomics data, our main application, contains strongly correlatedgroups of variables [2].The second necessary element is search for a view shared by the two different domains x and y ,needed for finding shared multi-way effects. Given paired data, this is a task for Bayesian CCA(BCCA) [3, 4, 5] which introduces a new hiearchy level where a latent variable z captures the shared2igure 1: The hierarchical latent-variable model for multi-way, multi-view learning under “large p ,small n ” conditions.variation between the views. The view-specific variation has been implicitly modeled by view-specific latent variables which have been integrated out, resulting in flexible covariance matricesparameterized by the Ψ .The third necessary element, the ANOVA-type two-way analysis is supplemented by assigning theeffect terms as priors on the latent variables z ; in normal BCCA the prior is zero-mean. The observedcovariates a and b choose the correct effects for each sample. The covariates hence effectivelychange the means of the data as in eqn (2), and the variation around the mean is modeled withthe rest of the model. The central differences from (2) are that the model is hierarchical, implyingthat the arguments of the linear function f x have a distribution, and that the “noise” is structured,stemming from all the latent variables. With these additions, the model will be better able to takeinto account the uncertainty in the data.The posterior is computed with Gibbs sampling. The Gibbs-formulas are included in the supple-mentary material.In effect the model, shown in Figure 1, consists of two factor analyzers, where the loadings assumecluster memberships (multiplied with scales), a generative model of CCA, and population-specificpriors on z that assume ANOVA-type multi-way structure. We will now introduce the details of eachof these parts in turn. We need to reduce dimensionality, which can be done by factor analysis (FA). The model [6] for n exchangeable replicates is x latj ∼ N (0 , Ψ x ) x j ∼ N ( µ x + V x x latj , Λ x ) . (3)Here V x is the projection matrix that is assumed to generate the data vector x j from the latentvariable x latj . The x latj is a latent variable vector, whose elements are known as factor scores.The V x x latj models such common variance of the data around the variable-means µ x that can beexplained by factors common to all or many variables, effectively estimated from the sample covari-ance matrix of the dataset. The sample covariance becomes decomposed into ˆ Σ = V x V xT + Λ x ,where Λ x is a diagonal residual variance matrix with diagonal elements σ i , modelling the variable-3pecific noise not explained by the latent factors. The covariance matrix of x lat , Ψ x , comes fromthe CCA.At this point, when n < p , V x cannot be estimated due to the singularity of the sample covariancematrix. To overcome the n (cid:28) p problem, we now restrict V x to a non-singular clustering matrix,suitable for data containing highly correlated groups of variables. We make the structured assumption that there are strongly correlated groups of variables in the data,the generated values within the whole group being governed by one latent variable. The projectionmatrix V x is positive-valued, each row having one non-zero element corresponding to the clusterassignment of the variable, V x = λ λ ... ... ... λ j λ j +1 ... ... ... . (4)The location of the non-zero value on row i , v i , follows a multinomial distribution with one ob-servation, with an uninformative prior distribution π i . The π i could also be used to encode priorinformation on the known grouping of variables. The variation of each variable within a clusteris assumed to be modeled by the same latent variable, but the scales λ i may differ. The variable-specific residual variances σ i , that are the diagonal elements of Λ , follow a scaled Inv- χ with anuninformative prior.In summary, we regularize the covariance matrix by assuming that the main correlations are positivecorrelations between variables belonging to the same cluster. This correlation is mediated througha common latent variable; this is a reasonable assumption for metabolomics data and, furthermore,facilitates interpretation of the results. We need to combine different data domains, and for paired data that can be done with CCA. Thegenerative model of BCCA has been formulated [3, 7] for sample j as z j ∼ N (0 , I ) , x latj ∼ N ( W x z j , Ψ x ) , (5)and likewise for y . Here we have assumed no mean parameter since the mean of the data is estimatedin the factor analysis part. The W x is a projection matrix from the latent variables z j , and Ψ x is amatrix of marginal variances. The crucial thing is that the latent variables z are shared between thetwo data sets, while everything else is independent. The prior distributions were chosen as w l ∼ N (0 , β l I ) ,β l ∼ IG ( α , β ) , Ψ x , Ψ y ∼ IW ( S , ν ) . (6)Here w l denotes the l th column of W , and IG and IW are shorthand notations for the inverseGamma and inverse Wishart distributions. The priors for the covariance matrices Ψ x and Ψ y areconventional conjugate priors, and the prior for the projection matrices is the so-called AutomaticRelevance Determination (ARD) prior used for example in Bayesian principal component analysis[8]. We assume that the ANOVA-type effects act on the latent variables z , which allows access to effectsfound in both the spaces x lat and y lat . They are modeled as population priors to the latent variables,which in turn are given Gaussian priors α a , β b , ( αβ ) ab ∼ N (0 , I ) .4igure 2: The graphical model describing the decomposition of covariate effects into shared andview-specific ones. The figure expands the top part of Figure 1, leaving out the feature extractionpart and some parameters.In the K z -dimensional latent variable space we then have z j = α a + β b + ( αβ ) ab + (cid:15) j , (7)where (cid:15) j is a noise term. Note that the grand means are estimated in the lower level of hierarchy,that is, directly in the x and y -spaces, and do not appear here.To simplify the interpretation of the effects we center the grand means to the mean of one controlpopulation. A similar choice has been done successfully in other ANOVA studies [9], and it doesnot significantly sacrifice generality. We set the parameter vector µ x , describing variable-specificmeans, to the mean of the control group. One group now becomes the baseline to which other classesare compared by adding main and interaction effects. For convenience, we will additionally changethe variables compared to the standard ANOVA convention, such that the terms α , β , αβ , ( αβ ) b , and ( αβ ) a are not estimated. The differences between the populations are now modelleddirectly with x lat and y lat , and hierarchically by the main effects α a , β b , ( αβ ) ab , a, b > .In our case study, a and b have only two values and we have populations ( a, b ) =(0 , , (1 , , (0 , , (1 , , and there are hence three terms α , β and ( αβ ) , that model the dif-ference to the control population ( a, b ) = (0 , .In summary, the complete hierarchical model of Figure 1 is α = 0 , β = 0 , ( αβ ) a = 0 , ( αβ ) b = 0 α a , β b , ( αβ ) ab ∼ N (0 , I ) z j | j ∈ a,b ∼ N ( α a + β b + ( αβ ) ab , I ) x latj ∼ N ( W x z j , Ψ x ) x j ∼ N ( µ x + Vx latj , Λ x ) . (8) So far we have not discussed how the model finds the view-related effects or, in our application,tissue effects α xa , β xb , and ( αβ ) xab , and likewise for y .The Bayesian CCA assumes that the data is generated by a sum of view-specific latent variables z x and z y , and shared latent variables z , and the former have been integrated out in the graphicalmodel of Figure 1. The way to implement the view-specific effects is to assign them as priors to theview-specific latent variables. Then we do not want to integrate them out but include them explicitlyin the model as shown in Figure 2.As a technical note, to make computation of the model faster and more reliable, we have furtherincluded view-specific latent variables that do not have disease or treatment effects. They have beenintegrated out, resulting in the covariance parameters Ψ in Figure 2. Their role is to explain away allor most of the variation that is unrelated to the disease and treatment effects, so that Gibbs samplingdoes not need to model all that variation. This trick should not change modeling results in the limitof an infinite time for computation. 5n practice the decomposition in Figure 2 is implemented by restricting a column of W x to be zerofor the y-specific components and vice versa for x. For simplicity and to reduce the number of parameters of the model, the data is preprocessed suchthat for each variable the mean of the control population a = 0 , b = 0 is subtracted and the variableis scaled by the standard deviation of the control population. This fixes the scales λ i to one and the µ x and µ y to zero. The factor analysis part now models correlations of the variables. The possiblecovariate effects are now comparable to the control population as discussed in Chapter 2.2.4Model complexity, that is, the number of clusters and latent variables, is chosen separately for both x lat and y lat by predictive likelihood in 10-fold cross-validation. We demonstrate the working of the method on generated data, and apply it to a disease study wherelipidomic profiles have been measured from several tissues of model mouse samples, under a two-way experimental setup (disease and treatment), the two feature spaces (lipid profiles) are distinctand samples paired.
We generate data having known effects, and then study how well the model finds the effects as afunction of the number of measurements. There are three effects, in α , β y , and ( αβ ) x .Each of the three effects have strength +2 , the x lat and y lat are both 3-dimensional, and the x and y are 200-dimensional. The σ i = 1 for each variable i in x and y . The model is computed by Gibbssampling, discarding 1000 burn-in samples, and collecting 1000 samples for inference. To fix thesign of the effects without affecting the results, each posterior distribution is mirrored, if necessary,to have a positive mean, i.e. multiplied by the sign of the posterior mean.The method finds the three generated effects, shown in Figure 3. The uncertainty decreases withincreasing number of observations. The shared effect is found with much less uncertainty sincethere is evidence from both views. With low numbers of samples, there is considerable uncertaintyin the effects for view-specific components. In typical bioinformatics applications there may be20-50 samples. We then study data from a two-way, two-view, n (cid:28) p , so far unpublished lung cancer mousemodel experiment. The diseased mice are compared to healthy control samples and, in addition,some mice from both groups have been given a test anticancer drug treatment. There are thushealthy untreated (9 mice/samples), diseased untreated (7), healthy treated (6) and diseased treated(6) samples. Lipidomic profiles have been measured by Liquid Chromatography Mass Spectrometry.The study has a two-way experimental setup, such that disease effect α , treatment effect β and aninteraction effect αβ on lipid groups are to be estimated. The high-dimensional lipidomic profileshave been measured from several tissues of each mouse; the tissues have partly different lipids thathave not been matched, and even the roles of the matched lipids may be different in different tissues.Hence, the tissues have different feature spaces with paired samples, implying a two-view study. Wewill specifically study the relationship between blood and lung tissue, which is the most interestingfor diagnosis, since blood can be easily sampled. Blood plasma (168 lipids) and lung tissue (68 lipids) were integrated with the method. The optimalnumber of clusters for plasma was 6 and for lung 5, found by predictive likelihood. The method findsa disease effect α and treatment effect β shared by both views (Fig. 4). The effect can be traced back6igure 3: The method finds the generated effects α = +2 , β y = +2 , and ( αβ ) x = +2 . (effectsubscripts and have been dropped in the rest of the results section). The dots show posteriormean and the thin lines include 95% of posterior mass, as a function of number of observations. Aconsistently non-zero posterior distribution implies an effect found.to the metabolite groups, by first identifying the responsible row of W x and hence component of x lat , and then the metabolite cluster from the V x corresponding to the x lat component.The results imply that a cluster of 12 lipids in lung and a cluster of 20 lipids in blood are mutuallycoherently up-regulated due to disease, and additionally up-regulated by the treatment. Anothercluster of 13 lipids in lung was found down-regulated due to the disease and additionally down-regulated due to treatment. The lipids of the down-regulated cluster are thus negatively correlatedwith the up-regulated clusters. The results show that since no consistent interaction term ( αβ ) isfound, there is no indication that the treatment would cure the cancer effects. This confirms ourprior fear that the specific treatment might not be efficient. The treatment does, however, affect thesame groups of lipids as the disease, so investigating it as a potential cure was not a far-fetchedhypothesis.The up-regulated cluster of blood plasma contains abundant triglycerides known to be coregulated,the up-regulated cluster of lung contains lipotoxic ceramides [10] and proinflammatory lysophos-phatidylcholines [11], while the down-regulated cluster of lung contains ether lipids, known as en-dogenous antioxidants [12]. Our analysis reveals that the drug treatment enhances, not diminishes,the proinflammatory lipid profile found in the disease. We then integrate plasma x with another tissue, heart (58 lipids) y . The results ( Figure 4), show thatthe disease effect and the treatment effect are found only in the view-specific component of plasma.This implies that there are no shared effects between plasma and heart, and in fact no consistenteffects are found for the heart tissue. The method finds, however, the same effects, α and β in theplasma tissue as in Experiment 1 and for the same cluster of lipids, which is a sign that the methodworks well. We have generalized ANOVA-type multi-way analysis to cases where multiple views of sampleshaving a multi-way experimental setup are available. The problem is solved by a hierarchical latentvariable model that extends the generative model of bayesian CCA to model multi-way covariate in-7xperiment 1 Experiment 2Figure 4: In experiment 1 (left), the method finds a disease effect α and a treatment effect β sharedbetween the two views, plasma ( x ) and lung ( y ) tissues. In experiment 2 (right), only view-specificeffects are found for plasma ( x ) when integrating with the heart tissue ( y ). No effects are found inheart. The boxplots show quartiles and 95% intervals of posterior mass of the effects, a consistentlynon-zero posterior distribution implies an effect found.formation of samples by having population-specific priors on the shared latent variable of CCA. Fur-thermore, the method is able to decompose the covariate effects to shared and view-specific effects,treating the multiple views as one covariate. Finally, the method is designed for cases with high di-mensionality and small sample-size, common in bioinformatics applications. The small sample-sizeproblem was solved by assuming that the variables come in correlated groups, which is reasonablefor the metabolomics application.The modelling task is extremely difficult due to the complexity of the task and small sample-size. Hence it was striking that the method was capable of finding covariate effects in a real-worldlipidomic multi-view, multi-way dataset.In this work it was possible to estimate only three components, because the number of sampleswas extremely low: a shared component and two view-specific components. If more than oneshared components are to be estimated, an unidentifiability problem occurs, since there is a rotationalambiguity within the solution subspace. The problem can be solved by a deflation-type method,where the components are computed one by one. Each posterior sample is now considered as aconverged starting point, and a second component is added and the model is sampled with havingthe first component fixed. The last sample of each new sampled chain is collected for inference.8 eferences [1] Mike West. Bayesian factor regression models in the large p, small n paradigm. BayesianStatistics , 7:723–732, 2003.[2] Ralf Steuer. Review: On the analysis and interpretation of correlations in metabolomic data.
Briefings in Bioinformatics , 7(2):151–158, 2006.[3] Arto Klami and Samuel Kaski. Local dependent components. In Zoubin Ghahramani, editor,
Proceedings of ICML 2007, the 24th International Conference on Machine Learning , pages425–432. Omnipress, 2007.[4] Chong Wang. Variational Bayesian approach to canonical correlation analysis.
IEEE Transac-tions on Neural Networks , 18:905–910, 2007.[5] C´edric Archambeau and Francis Bach. Sparse probabilistic projections. In Daphne Koller,Dale Schuurmans, Yoshua Bengio, and Lon Bottou, editors,
Advances in Neural InformationProcessing Systems 21 , pages 73–80. MIT Press, 2009.[6] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models.
NeuralComputation , 11:305–345, 1999.[7] Francis R. Bach and Michael I. Jordan. A probabilistic interpretation of canonical correlationanalysis. Technical Report 688, Department of Statistics, University of California, Berkeley,2005.[8] Christopher M. Bishop. Bayesian PCA. In M. S. Kearns, S.A. Solla, and D.A. Cohn, editors,
Advances in Neural Information Processing Systems , volume 11, pages 382–388. MIT Press,1999.[9] David M. Seo, Pascal J. Goldschmidt-Clermont, and Mike West. Of mice and men: Sparsestatistical modelling in cardiovascular genomics.
Annals of Applied Statistics , 1(1):152–178,2007.[10] Scott A. Summers. Ceramides in insulin resistance and lipotoxicity.
Progress in Lipid Re-search , 45:42–72, 2006.[11] Dolly Mehta. Lysophosphatidylcholine: an enigmatic lysolipid.
American Journal of Physiol-ogy Lung Cellular and Molecular Physiology , 289:174–175, 2005.[12] Ronald J. A. Wanders Pedro Brites, Hans R. Waterham. Functions and biosynthesis of plas-malogens in health and disease.