FFusing heterogeneous data sets
Yipeng Song a r X i v : . [ q - b i o . GN ] A ug using heterogeneous data setsYipeng SongPhD thesisISBN: 978-94-6375-531-3All rights reserved. No part of this publication may be reproduced in any formwithout written permission from the copyright owner.Copyright c (cid:13) Yipeng Song, 2019 using heterogeneous data sets
Academisch Proefschrift ter verkrijging van de graad van doctoraan de Universiteit van Amsterdamop gezag van de Rector Magnificusprof. dr. ir. K.I.J. Maexten overstaan van een door het College voor Promoties ingesteldecommissie, in het openbaar te verdedigen in de Aula der Universiteitop vrijdag 27 september 2019 om 11.00 uurdoor
Yipeng Song geboren te Henan romotiecommisie
Promotor: Prof. dr. A. K. Smilde Universiteit van AmsterdamCo-promotor: Dr. J. A. Westerhuis Universiteit van AmsterdamOverige leden: Prof. dr. A.H.C. van Kampen Universiteit van AmsterdamProf. dr. A.H. Zwinderman Universiteit van AmsterdamProf. dr. P.M.A. Sloot Universiteit van AmsterdamProf. dr. M.E. Timmerman Rijksuniversiteit GroningenProf. dr. ir. M.J.T. Reinders Technische Universiteit DelftProf. dr. T. Naes NofimaFaculteit der Natuurwetenschappen, Wiskunde en InformaticaThe research reported in this thesis was carried out at the Swammerdam Institutefor Life Sciences, Faculty of Science, University of Amsterdam. Yipeng Songgratefully acknowledges the financial support from China Scholarship Council(NO.201504910809). iv o my mother Jin e v ontents Abstract 11 Introduction 3
Summary 145Samenvatting 149Acknowledgments 153Bibliography 155 x bstract In systems biology, it is common to measure biochemical entities at differentlevels of the same biological system. One of the central problems for the datafusion of such data sets is the heterogeneity of the data. This thesis discusses twotypes of heterogeneity. The first one is the type of data, such as metabolomics,proteomics and RNAseq data in genomics. These different omics data reflectthe properties of the studied biological system from different perspectives. Thesecond one is the type of scale, which indicates the measurements obtained atdifferent scales, such as binary, ordinal, interval and ratio-scaled variables. Inthis thesis, we developed several statistical methods capable to fuse data sets ofthese two types of heterogeneity. The advantages of the proposed methods incomparison with other approaches are assessed using comprehensive simulationsas well as the analysis of real biological data sets.1 hapter 1
Introduction
In systems biology, it is becoming increasingly common to measure biochemicalentities at different levels of the same biological system. Hence, data fusionproblems, which focus on analyzing such data sets simultaneously to arrive at aholistic understanding of the studied system, are abundant in the life sciences.With the availability of a multitude of measuring techniques, one of the centralproblems is the heterogeneity of the data. In this thesis, we mainly discuss twotypes of heterogeneity. The first one is the type of data , such as metabolomics,proteomics and RNAseq data in genomics. These different omics data reflect theproperties of the studied biological system from different perspectives. The secondone is the type of scale , which indicates the measurements obtained at differentscales, such as binary, ordinal, interval and ratio-scaled variables. In genomics,an example is the measurements of gene-expression and point mutation status onthe same objects. The latter are binary data and gene-expression measurementsare quantitative data. Ideally, data fusion methods should consider these twotypes of heterogeneity of such measurements and this will be the topic of thisthesis.The goal of this thesis is to develop appropriate statistical methods capable tofuse data sets of the two types of heterogeneity. Before going into the details ofthe developed methods, we begin with a brief introduction of the concept of datafusion in life sciences and the characteristics of the two types of heterogeneity.Another important concept in this thesis is the concave penalty, which is the basisof all the developed methods. Although it is not directly related to the fusion ofheterogeneous data sets, it is also introduced in Chapter 1. This chapter is based on Smilde, A.K., Song, Y., Westerhuis, J.A., Kiers, H.A., Aben, N.and Wessels, L.F., 2019. Heterofusion: Fusing genomics data of different measurement scales.arXiv preprint arXiv:1904.10279. Chapter 1. Introduction
With the availability of comprehensive measurements collected in multiple relateddata sets in the life sciences, the need for a simultaneous analysis of such datato arrive at a global view on the system under study is of increasing importance.There are many ways to perform such a simultaneous analysis and these go alsounder very different names in different areas of data analysis: data fusion, dataintegration, global analysis, multi-set or multi-block analysis to name a few. Wewill use the term data fusion in this thesis. Data fusion plays an especiallyimportant role in the life sciences, e.g., in genomics it is not uncommon to measuregene-expression (array data or RNAseq data), methylation of DNA and copynumber variation. Sometimes, also proteomics and metabolomics measurementsare available. All these examples serve to show that having methods in place tointegrate these data is not a luxury anymore.Without trying to build a rigorous taxonomy of data fusion it is worthwhileto distinguish several distinctions in data fusion. The first distinction is betweenmodel-based and exploratory data fusion. The former uses background knowl-edge in the form of models to fuse the data; one example being genome-scalemodels in biotechnology [1]. The latter does not rely on models, since these arenot available or poorly known, and thus uses empirical modeling to explore thedata. In this thesis, we will focus on exploratory data fusion. The next distinc-tion is between low-, medium-, and high-level data fusion [2]. In low-level datafusion, the data sets are combined at the lowest level, that is, at the level of the(preprocessed) measurements. In medium-level data fusion, each separate dataset is first summarized, e.g., by using a dimension reduction method or throughvariable selection. The reduced data sets are subsequently subjected to the datafusion. In high-level data fusion, each data set is used for prediction or classifica-tion of an outcome and the prediction or classification results are then combined,e.g, by using majority voting [3]. All these types of data fusion have advantagesand disadvantages which are beyond the scope of this thesis. In this thesis, wewill focus on low- and medium-level fusion.The final characteristic of data fusion relevant for this thesis is the hetero-geneity of the data sets to be fused. Different types of heterogeneity can be dis-tinguished. The first one is the type of data , such as metabolomics, proteomicsand RNAseq data in genomics. Clearly, these data relate to different parts ofthe biological system. The second one is the type of scale in which the data aremeasured present in the fusion problem. In genomics, an example is a data setwhere gene-expressions are available and mutation data in the form of single nu-cleotide polymorphisms(SNPs). The latter are binary data and gene-expressionmeasurements are quantitative data. They are clearly measured at a differentscale. Ideally, data fusion methods should consider these two levels of hetero-geneity in data analysis and this will be the topic of this thesis. In the followingsection, we will show the characteristics of these two types of heterogeneity and .2. Two types of heterogeneity
Multiple data sets measured on the same objects may have different types ofmeasurement scales. The history of measurement scales goes back a long time.A seminal paper drawing attention to this issue appeared in the 40-ties [4]. Sincethen numerous papers, reports and books have appeared [5, 6, 7, 8, 9, 10]. Themeasuring process assigns numbers to aspects of objects (an empirical system ),e.g, weights of persons. Hence, measurements can be regarded as a mapping fromthe empirical system to numbers, and scales are properties of these mappings. Inmeasurement theory, there are two fundamental theorems [6]: the representationtheorem and the uniqueness theorem. The representation theorem asserts theaxioms to be imposed on an empirical system to allow for a homomorphism ofthat system to a set of numerical values. Such a homomorphism into the setof real numbers is called a scale and thus represents the empirical system. Ascale possesses uniqueness properties: we can measure the weight of persons inkilograms or in grams, but if one person weighs twice as much as another person,this ratio holds true regardless the measurement unit. Hence, weight is a so-calledratio-scaled variable and this ratio is unique. The transformation of measuringin grams instead of kilograms is called a permissible transformation since it doesnot change the ratio of two weights. For a ratio-scaled variable, only similaritytransformations are permissible; i.e. (cid:101) x = αx ; α > x is the variable on theoriginal scale and (cid:101) x is the variable on the transformed scale. This is because (cid:101) x i (cid:101) x j = αx i αx j = x i x j . Note that this coincides with the intuition that the unit of measurement is im-material.The next level of scale is the interval-scaled measurement. The typical exam-ple of such a scale is concentrations of metabolites in metabolomics research andthe permissible transformation is affine, i.e. (cid:101) x = αx + β ; α >
0. In that case, theratio of two intervals is unique because (cid:101) x i − (cid:101) x j (cid:101) x k − (cid:101) x l = ( αx i + β ) − ( αx j + β )( αx k + β ) − ( αx l + β ) = α ( x i − x j ) α ( x k − x l ) = x i − x j x k − x l . Stated differently, the zero point and the unit are arbitrary on this scale.Ordinal-scaled variables represent the next level of measurements. Typicalexamples are scales of agreement in surveys: strongly disagree, disagree, neu-tral, agree and strongly agree. There is a rank-order in these answers, but no
Chapter 1. Introduction relationship in terms of ratios or intervals. The permissible transformation of anordinal-scale is a monotonic increasing transformation since such transformationskeep the order of the original scale intact. Nominal-scaled variables are next onthe list. These variables are used to encode categories and are sometimes alsocalled categorical. Typical examples are gender, race, brands of cars and the like.The only permissible transformation for a nominal-scaled variable is the one-to-one mapping. A special case of a nominal-scaled variable is the binary (0/1)scale. Binary data can have different meanings; they can be used as categories(e.g. gender) and are then nominal-scale variables. They can also be two pointson a higher-level scale, such as absence and presence (e.g. for methylation data).The above four scales are the most used ones but others exist [5, 6]. Counts,e.g., have a fixed unit and are therefore sometimes called absolute-scaled variables[8]. Another scale is the one for which the power transformation is permissible;i.e. (cid:101) x = αx β ; α, β > Multiple sets of measurements on the same objects obtained from different plat-forms may reflect partially complementary information of the studied system.Therefore, these multiple data sets may contain heterogeneous information, theinformation that is common across all or some of the data sets, and the infor-mation which is specific to each data set (often called distinct). The challengefor the data fusion of such data sets is how to separate the common and distinct .2. Two types of heterogeneity L different platforms on the same I objectsresult into L quantitative data sets, { X l } Ll =1 , and the l th data set X l ( I × J l ) has J l variables. After these data sets are column centered, we can decompose them inthe SCA model framework as X l = AB T l + E l , in which A ( I × R ) is the commonscore matrix; B l ( J l × R ) and E l ( I × J l ) are the loading matrix and residual termrespectively for X l and R is the number of components. The common columnsubspace, which is spanned by the columns of the score matrix A , represents thecommon information between these L data sets.The drawback of the SCA model is that only the global common components,which account for the common variation across all the data sets, is modeled.However, the real situation in multiple data sets integration can be far morecomplex as local common variation across some of the data sets and distinctvariation in each data set are expected as well. With the help of the concept ofstructural sparsity on the loading matrices of a SCA model, we can interpret thecommon and distinct variation framework as follows. Suppose we construct a SCAmodel on three column centered quantitative data sets { X l } l =1 , the common scorematrix is A , the corresponding loading matrices are { B l } l =1 , and X l = AB T l + E l ,in which E l is the residual term for l th data set. If the structured sparsity patternin { B l } l =1 is expressed as follows, B B B = b , b , b , , , b , , , , , b , , , in which b l,r ∈ R J l indicates the r th column of the l th loading matrix B l , then wehave the following relationships, X = a b T1 , + a b T1 , + a b T1 , + + a b T1 , + + + E X = a b T2 , + a b T2 , + + a b T2 , + + a b T2 , + + E X = a b T3 , + + a b T3 , + a b T3 , + + + a b T3 , + E . Here a r indicates the r th column of the common score matrix A . The first com-ponent represents the global common variation across three data sets; the 2 nd ,3 nd and 4 nd components represent the local common variation across two data Chapter 1. Introduction sets and the 5 nd , 6 nd and 7 nd components represent the distinct variation specificto each single data set. Therefore, the heterogeneity of data (common and dis-tinct information existed in multiple data sets) is disentangled by the commonand distinct components. In the above example, we only show a single score andloading vector for any specific variation, but that there can be multiple score andloading vcectors for each (global common, local common or distinct variations).Except for the above approach, there are also many other methods for dis-tinguishing common and distinct components. However, the above approach hasthe advantage of estimating the model complexity directly which is problematicin most other methods. Details will be shown in Chapter 5. Although concave penalties are not directly related to the fusion of heterogeneousdata sets, they are the basis of all the developed methods in this thesis. Therefore,it is better to have a general introduction to them at the beginning.The comprehensive measurements in the current biological research alwaysresult in high dimensional data sets, of which the number of variables is muchlarger than the number of samples. For the analysis of such high dimensionaldata sets, sparse parameter estimation (many estimated parameters are exactly0) is always desired since it makes both the data analysis problem feasible andthe results easier to be interpreted. Some typical examples of sparse parameterestimation include the sparse regression models [16], the low rank matrix approxi-mation problems [17, 14, 18], the structure learning problems in graphical models[19], and many others [20, 21, 22].We can use a linear regression model as an example to illustrate how to achievesparse parameter estimation through various penalties. Suppose we have a uni-variate response variable y ∈ R and a multivariate explanatory variable x ∈ R J .A standard linear regression model can be expressed as y = x T β + (cid:15) , in which β ∈ R J is the coefficient vector and (cid:15) ∈ R is the error term following a normaldistribution with mean 0 and variance σ , (cid:15) ∼ N (0 , σ ). After obtaining I sam-ples of { y, x } , we have the data sets { y i , x i } Ii =1 , which can be further expressedin their vector and matrix form as y ∈ R I and X ∈ R I × J . The optimizationproblem associated with the standard linear model is min β || y − X β || , andthe analytical form solution is ˆ β = ( X T X ) − X T y . Unfortunately, this model isunidentifiable when J > I and ill-conditioned when the explanatory variables arecorrelated. Also the estimated coefficient vector ˆ β is always dense, which makesthe interpretation difficult.Cardinality constraint can be imposed on the linear regression model as ahard constraint to induce a sparse parameter estimation of β . If we requireonly R elements of β to be nonzero, the associated optimization problem canbe expressed as min β || y − X β || subject to || β || = R , in which || β || = R .4. Scope and outline of the thesis || || indicates the pseudo L norm and counts thenumber of nonzero elements. Since the above optimization problem is non-convexand difficult to solve, the cardinality constraint is always replaced by its convexrelaxation L norm to induce the sparsity, which results in the lasso regressionmodel [23]. The optimization problem associated with the lasso regression modelis min β || y − X β || + λ || β || in which λ is the tuning parameter and || || indicates the L norm. Efficient algorithms exist to solve this convex optimizationproblem [24, 25]. However, the L norm penalty shrinks all the elements of thecoefficient vector β to the same degree. This leads to a biased estimation ofthe coefficients with large absolute values. This behavior will further make theprediction error or CV error based model selection procedure inconsistent [26].Many non-convex penalties, most of them are concave functions with respect tothe absolute value of β , have been proposed to tackle the drawback of the L norm penalty [16, 27]. They can shrink the parameters in a nonlinear mannerto achieve both nearly unbiased and sparse parameter estimation. A frequentistversion of the generalized double Pareto (GDP) [27] shrinkage can serve as anexample. The optimization problem of a linear regression model with the GDPpenalty can be expressed as min β || y − X β || + λ (cid:80) Jj log(1 + | β j | γ ), in which β j isthe j th element of β , γ is a hyper-parameter, and log(1 + | β j | γ ) is the concave GDPpenalty on β j . The thresholding properties of the cardinality constraint, the L norm penalty and the GDP penalty with different values of γ are shown in Fig. 1.1.The cardinality constraint shrinks all the coefficients, whose absolute values areless than a selected threshold, to 0 while keeping other coefficients unchanged.This thresholding behavior is referred to as hard thresholding [23]. The L normpenalty shrinks all the coefficients to the same degree until some coefficients are0. And its thresholding behavior is referred to as soft thresholding [23]. On theother hand, GDP penalty shrinks the coefficients in a nonlinear manner accordingto the absolute values of the corresponding coefficients. In this way, coefficientswith small absolute values are more likely to be shrunk to 0, while coefficientswith large absolute values tend to be shrunk less. Principal component analysis (PCA) model is the basis of many commonly useddata fusion methods [28]. And both PCA and these data fusion methods assumethe used data sets are quantitative. Thus, before talking about the data fusion ofdata sets with heterogeneous measurement scales, we should first introduce thegeneralizations of PCA for qualitative data sets. We review the extensions of PCAmethods for the analysis of multivariate binary data sets in Chapter 2 and developa robust logistic PCA method in Chapter 3. After that, we are ready for the datafusion of data sets with heterogeneous measurement scales. We generalize thecommonly used data fusion method, simultaneous component analysis (SCA), in0
Chapter 1. Introduction - -200 -100 0 100 200 -200-150-100-50050100150200 cardinaltiyL GDP: . =1GDP: . =50original Figure 1.1: Thresholding properties of the cardinality constraint, L norm, GDPpenalties when the same degree of shrinkage is achieved. Legend cardinality:cardinality constraint, legend L : L norm penalty, legend original: the originalvalues before thresholding. β in x axis indicates the original value of the coefficientwhile η in y axis is the value after thresholding.a probabilistic framework for the data fusion of the multivariate quantitative andbinary measurements data sets in Chapter 4. Finally, it comes to the data fusionof data sets of the two types of heterogeneity. We develop an exponential familySCA model for the data fusion of multiple data sets of mixed data types, such asquantitative, binary or count, and introduce the nearly unbiased group concavepenalty to induce structured sparsity on the loading matrix to separate common(global and local) and distinct variation in such mixed types data sets in Chapter5. Finally, the thesis closes in Chapter 6 with an outlook into the future of fusingheterogeneous data sets. hapter 2 PCA of binary genomics data
Genome wide measurements of genetic and epigenetic alterations are generat-ing more and more high dimensional binary data. The special mathematicalcharacteristics of binary data make the direct use of the classical PCA modelto explore low dimensional structures less obvious. Although there are severalPCA alternatives for binary data in the psychometric, data analysis and machinelearning literature, they are not well known to the bioinformatics community. Inthis chapter we introduce the motivation and rationale of some parametric andnonparametric versions of PCA specifically geared for binary data. Using bothrealistic simulations of binary data as well as mutation, CNA and methylationdata of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000)the methods are explored for their performance with respect to finding the cor-rect number of components, overfit, finding back the correct low dimensionalstructure, variable importance etc. The results show that if a low dimensionalstructure exists in the data that most of the methods can find it. When assuminga probabilistic generating process is underlying the data, we recommend to usethe parametric logistic PCA model (using the projection based approach), whilewhen such an assumption is not valid and the data is considered as given, thenonparametric Gifi model is recommended. Binary measurements only have two possible outcomes, such as presence andabsence, or true and false, which are usually labeled as “1” and “0”. In manyresearch problems, objects are characterized by multiple binary features, eachdepicting a different aspect of the object. In biological research, several examplesof binary data sets can be found. Genome wide measurements of genetic and This chapter is based on Song, Y., Westerhuis, J.A., Aben, N., Michaut, M., Wessels, L.F.and Smilde, A.K., 2017. Principal component analysis of binary genomics data. Briefings inbioinformatics, 20(1), pp.317-329.
Chapter 2. PCA of binary genomics data epigenetic alterations are generating more and more high dimensional binary data[29, 30]. One example is the high throughput measurements of point mutation.Here, a feature is labeled as “1” when it is classified as mutated in a sample, “0”when it is not. Another often observed binary data set is the binarized versionof copy number aberrations (CNA), which are gains and losses of segments inchromosomal regions. Segments are labeled as “1” when aberration is presents ina sample, otherwise “0” [31]. DNA methylation data can also be discretized asbinary features, where “1” indicates a high level of methylation and “0” means alow level [30].Compared to commonly used quantitative data, binary data has some specialmathematical characteristics, which should be taken into account during the dataanalysis. In binary measurements, “0” and “1” are abstract representations of twoexclusive categories rather than quantitative values 0 and 1. These two categoriescan also be encoded to any other two different labels, like “-1” and “1” or “-” and“+”, without changing the meaning. Because “1” and “0” are only an abstractrepresentation of two categories, they cannot be taken interpreted as quantitativedata. Furthermore, the measurement error of binary data is discrete in nature.Binary measurement error occurs when the wrong label is assigned to an object,such as when a mutated gene is mis-classified as wild type. Therefore, the bydefault used Gaussian error assumption for continuous data in many statisticalmodels is inappropriate for binary data analysis. Another aspect of binary datais that there can be an order in the two categories. For example, presence is oftenconsidered more important than absence. Finally, binary data can be generatedfrom a discrete measurement process, but also a continuous measurement process[32].PCA is one of the most popular methods in dimension reduction with numer-ous applications in biology, chemistry and many other disciplines [17]. PCA canmap data points, which are in a high dimensional space, to a low dimensionalspace with minimum loss of variation. The derived low dimensional features,which provide a parsimonious representation of the original high dimensionaldata, can be used in data visualization or for further statistical analysis.Classical linear PCA methods are appropriate for quantitative data. The di-rect use of linear PCA on binary data does not take into account the distinctmathematical characteristics of binary data. In this chapter, we are going tointroduce, compare and evaluate some of the PCA alternatives for binary data.First, the theory of the different approaches is introduced together with theirmodel properties and how the different models are assessed. Then we will in-troduce three binary genomics data sets on which the models will be applied.Besides the real data, realistic simulations of binary data are used to uncoversome of the properties of the different models. .2. Theory There exist two separate directions in extending PCA for binary data; paramet-ric and nonparametric. Parametric approaches are represented by logistic PCAmethods, originating from the machine learning literature. In these methods,PCA is extended to binary data from a probabilistic perspective in a similar wayas linear regression is extended to logistic linear regression [33, 34, 35]. Nonpara-metric methods, originating from the psychometric and data analysis communi-ties, include optimal scaling [36], multiple correspondence analysis [37] and manyothers [38]. In this direction, PCA is extended to binary data from a geometricperspective without probabilistic assumptions. The details for the motivationand rationale of above approaches for binary data will be explained later in thissection. We will start by introducing classical PCA.
Classical PCA can be expressed as a projection based approach (finding the lowdimensional space that best represents a cloud of high dimensional points) fol-lowing Pearson [39]. The measurements of J quantitative variables on I objectsresult into a matrix X ( I × J ) with I rows and J columns. The column vectorform of the i th row of X is x i ∈ R J , which is taken as a point in J dimensionalspace. Suppose that a low dimensional space is spanned by the columns of anorthogonal loading matrix B ( J × R ), R (cid:28) min( I, J ). The orthogonal projectionof x i on this low dimensional space is BB T x i . We find B by minimizing theEuclidean distance between the centered high dimensional points x i , i = 1 · · · I ,and their low dimensional projections:min µ , B I (cid:88) i (( x i − µ ) − BB T ( x i − µ )) subject to B T B = I , (2.1)in which the column offset term µ ( J ×
1) is included to center the samples, and I is the identity matrix. The exact position of the centered i th data point x i in this low dimensional space is represented by its R dimensional score vectorˆ a i , ˆ a i = ˆ B T ( x i − ˆ µ ), where ˆ B and ˆ µ are the estimated values of equation 2.1.In matrix form, we have ˆ A = ( X − ˆ µ T ) ˆ B , ˆ a i is the i th row of ˆ A ; is an I dimensional vector of ones; the estimated offset ˆ µ contains the column means of X and X − ˆ µ T is the column centered X .Another approach to explain PCA is the reconstruction based approach [40].A high dimensional data point x i ∈ R J is approximated by a linear function of thelatent low dimensional score a i ∈ R R with orthogonal coefficients B , x i ≈ µ + Ba i , B T B = I , µ is the offset term. Now, µ , A and B can be found simultaneously4 Chapter 2. PCA of binary genomics data by minimizing the Euclidean distance between centered x i , i · · · I , and their lowdimensional linear approximations µ + Ba i , i · · · I :min µ , A , B I (cid:88) i ( x i − µ − Ba i ) subject to B T B = I . (2.2)It is well known that the above two approaches for classical PCA are equivalentand the global optimal solution can be obtained from the R truncated singularvalue decomposition (SVD) of the column centered X [41]. The solution ˆ µ con-tains the column means of X ; ˆ A is the product of the first R left singular vectorsand the diagonal matrix of first R singular values; ˆ B contains the first R rightsingular vectors.Above, the classical PCA was derived from a geometrical perspective. Bishopet al.[42] have derived another explanation for PCA from a probabilistic perspec-tive, called probabilistic PCA. A high dimensional point x i can be regarded asa noisy observation of the true data point θ i ∈ R J , which lies in a low dimen-sional space. The model can be expressed as x i = θ i + (cid:15) i and θ i = µ + Ba i , µ is the offset term as before; B contains the coefficients; a i represents the lowdimensional score vector. The noise term (cid:15) i is assumed to follow a multivariatenormal distribution with mean and constant variance σ , (cid:15) i ∼ N (0 , σ I ). Thusthe conditional distribution of x i is a normal distribution with mean θ i and con-stant variance, x i | µ , A , B ∼ N ( µ + Ba i , σ I ). µ , A and B can be obtained bymaximum likelihood estimation.max µ , A , B I (cid:88) i log( p ( x i | µ , a i , B ))= I (cid:88) i log( N ( x i | µ + Ba i , σ I ))subject to B T B = I . (2.3)The above maximum likelihood estimation is equivalent to the least squaresminimization in classical PCA from the perspective of frequentist statistics [33].One important implication is that all the elements in the observed matrix X areconditionally independent of each other given the offset µ , the score matrix A and the loading matrix B , which is the key point for the further extension tobinary data. The probabilistic interpretation of PCA under multivariate normal distributionfor the observed data provides a framework for the further generalization to other .2. Theory ij th element in observedmatrix X , x ij , is a realization of the Bernoulli distribution with parameter p ij ,which is the ij th element in the probability matrix Π . Specifically, the probabilitythat x ij equals “1” is p ij . Similar to probabilistic PCA, all the elements in theobserved matrix X are conditionally independent of each other given the param-eter matrix Π ( I × J ). The log likelihood for observation X given the probabilitymatrix Π is as follows: l ( Π ) = I (cid:88) i J (cid:88) j x ij log( p ij ) + (1 − x ij ) log(1 − p ij ) . (2.4)The log-odds of p ij is θ ij , where θ ij = log( p ij − p ij ), which is the natural parameterof the Bernoulli distribution expressed in the exponential family form. Thus p ij = φ ( θ ij ) = (1 + e − θ ij ) − and φ () is called the logistic function. The loglikelihood for observation X given log-odds Θ is represented as: l ( Θ ) = I (cid:88) i J (cid:88) j x ij log( φ ( θ ij )) + (1 − x ij ) log(1 − φ ( θ ij )) . (2.5)A low dimensional structure can be assumed to exist in the log-odds Θ ( I × J )as Θ = AB T + µ T . Here A is the object score matrix for the log-odds Θ ; B isthe loading matrix; µ is the offset.There are mainly two approaches to fit the model (equation 2.5), logistic PCA[43] and projection based logistic PCA (logistic PPCA) [35]. The main differencebetween these two approaches is whether A and B are estimated simultaneouslyor sequentially. In the logistic PCA model, the score matrix A and loadingmatrix B are estimated simultaneously by alternating minimization [33, 44] orby a majorization-minimization (MM) algorithm [43].On the other hand, logistic PPCA only estimates B directly. After B isestimated, A is obtained by a projection based approach in the same manneras classical PCA in equation 2.1 [35]. Score matrix A is the low dimensionalrepresentation of the log-odds (cid:101) Θ of the saturated model in the subspace spannedby B . Details of the log-odds (cid:101) Θ from the saturated model will be shown latter.In matrix form, A = ( (cid:101) Θ − µ T ) B , µ is the offset term. Then the log-odds Θ inequation 2.5 can be represented as Θ = ( (cid:101) Θ − µ T ) BB T + µ T . The estimation ofparameters ˆ µ and ˆ B , can be obtained by maximizing the conditional log likelihood l ( Θ ) in equation 2.5. Then, the solution for the score matrix is ˆ A = ( (cid:101) Θ − ˆ µ T ) ˆ B .Compared to logistic PCA, logistic PPCA has fewer parameters to estimate,and thus is less prone to overfitting. In addition, the estimation of the scores of6 Chapter 2. PCA of binary genomics data new samples in logistic PCA involves an optimization problem while for logisticPPCA, it is a simple projection of the new data on ˆ B .In a saturated model, there is a separate parameter for every individual ob-servation. The model is over-parameterized and has perfect fit to the observeddata. For quantitative data, the parameters of the saturated model are simplethe observed data. For example, the parameters of the saturated PCA model onobserved data matrix X are the matrix X itself. For binary data, the parameter(probability of success) of a saturated model for the observation “1” is 1; for theobservation “0” is 0. Thus, the ij th element in (cid:101) Θ from the saturated logistic PCAmodel is (cid:101) θ ij = log( x ij − x ij ). It is negative infinity when x ij = 0; positive infinitywhen x ij = 1. In order to project (cid:101) Θ onto the low dimensional space spannedby B , one needs a finite (cid:101) Θ . In logistic PPCA, positive and negative infinitiesin (cid:101) Θ are approximated by large numbers m and − m . When m is too large, theelements in the estimated probability matrix ˆ Π for generating the binary obser-vation X are close to 1 or 0; when m is close to 0, the elements in ˆ Π are closeto 0.5. In the original paper of logistic PPCA, CV is used to select m [35]. Inthis paper we select m = 2 .
94 which corresponds to a probability of success 0.95.This can be interpreted as using probabilities 0.95 and 0.05 to approximate theprobabilities 1 and 0 in the saturated model.
Another generalization of PCA to binary data is nonlinear PCA with optimalscaling (the Gifi method). This method was primarily developed for categoricaldata, of which binary data is a special case [45, 36]. The basic idea is to quantifythe binary variables to quantitative values by minimizing some loss functions.The quantified variables are then used in a linear PCA model. The j th column of X , X ∗ j , is encoded into an I × G j . G j has two columns, “1” and“0”. If the i th object belongs to column “1”, the corresponding element of G j is 1,otherwise it is 0. A is the I × R object score matrix, which is the representationof X in a low dimensional Euclidean space; Q j is a 2 × R quantification matrix,which quantifies this j th binary variable to a quantitative value. For binary data,the rank of the quantification matrix Q j is constrained to 1. This is the PCAsolution in the Gifi method. Q j can be expressed as Q j = z j w j T , where z j is atwo dimensional column vector with binary quantifications and w j is the vectorof weights for R principal components. The loss function is expressed as:min A , z j , w j J (cid:88) j =1 ( A − G j z j w j ) , (2.6)in which the score matrix A is forced to be centered and orthogonal, T A = 0, A T A = I , to avoid trivial solutions. The loss function is optimized by alternating .3. Model properties Including the column offset term µ in component models also implies that thecolumn mean of score matrix is , i.e. T A = . Otherwise, the model isunidentifiable. In PCA and the Gifi method, the estimated ˆ µ equals the columnmean of X . Therefore, including µ in the model has the same effect as columncentering of X . In logistic PPCA and logistic PCA, the j th element of µ , µ j ,can be interpreted as the log-odds of the marginal probability of the j th variable.When only the offset µ is included in the model, Θ = µ T , the j th element ofthe solution ˆ µ , ˆ µ j , is the log-odds of the empirical marginal probability of the j th variable (the proportion of “1” in the j th column). When more componentsare included, Θ = µ T + AB T , the solution ˆ µ is not unique. If an identicaloffset is required for comparing component models with a different number ofcomponents, one can fix the offset term to the log-odds of the empirical marginalprobability during the maximum likelihood estimation. Similar to PCA, the orthogonality constraint B T B = I in logistic PPCA andlogistic PCA actually is inactive. If B is not orthogonal, it can be made orthogonalby subjecting AB T to an SVD algorithm. B equals the right hand singular vectorsand A equals the product of the left hand singular vectors and the diagonal matrixof singular values. This extra step will not change the objective value. Table 2.1gives the orthogonality properties of the scores and loadings of the four methodsdiscussed above.Table 2.1: Orthogonality properties of the scores and loadings of the four methods.O: the columns of this matrix are orthonormal vectors, B T B = I ; D: the columnsof this matrix are orthogonal vectors, B T B = D , D is a R × R diagonal matrix.PCA Gifi logistic PCA logistic PPCAscore matrix A D O D Dloading matrix B O D O O8
Chapter 2. PCA of binary genomics data
Linear PCA models are nested in the number of components, which means thefirst R principal components in the R + 1 components model are exactly the sameas the R components model. For the Gifi method, this property only holds forthe binary data case but not in general. For logistic PPCA and logistic PCA,this property does not hold. To make a fair comparison between linear PCA, the Gifi method, logistic PPCAand logistic PCA, the training error is defined as the average misclassification ratein using the derived low dimensional structure to fit the training set X . Eachof the four methods provides an estimation of the offset term, score matrix andloading matrix, ˆ µ , ˆA and ˆB . For linear PCA and the Gifi method, we take µ T + ˆA ˆB T as an approximation of the binary matrix X ; for logistic PCA and logisticPPCA, φ ( µ T + ˆA ˆB T ) is used as an approximation for the probability matrix Π ,of which the observed matrix X was generated. Since both approximations arecontinuous, we need to select a threshold to discretize them to binary fitting.In the discretization process, two misclassification errors exist. “0” can bemisclassified as “1”, which we call err err N err is the number of err N err isthe number of err N is the number of “0” in the observed binary matrix X ,and N is the number of “1” in X . A commonly used total error rate is givenby ( N err + N err ) / ( N + N ), which gives equal weights to these two errors.However, this can lead to undesirable results for imbalanced binary data, i.e.when the proportions of “1” and “0” are extreme. Usually, imbalanced binarydata sets are common in real applications, where sometimes the proportion of “1”in the observed matrix X can be less than 5%. In such a case, err err
1, and hence it seems inappropriate to give them equal weights.In imbalanced cases, a balanced error rate 0 . × ( N err /N + N err /N ) is moreappropriate [46]. To decide whether the predicted quantitative value representsa “0” or a “1”, a threshold value has to be selected. This threshold value can beselected by minimizing the balanced error rate in a training set after which it canbe applied to a test set in order to prevent biased (too optimistic) results. The training error is an overly optimistic estimator of the generalization error,which can be intuitively understood as the average misclassification rate in pre-dicting an independent test set. Thus, we use cross validation (CV) to approxi- .5. Data Sets X are selected in a diagonal style rather than a row wise style. The left outpart is considered as missing. In this way the prediction of the left out partis independent of the left out part itself. It is possible to use this approach asall the component models in this paper can handle missing data. A 7-fold CVprocedure was used for all calculations in this paper. In each of these folds, acomponent model is developed taking the missing data into account. The modelis then used to make a prediction of the missing elements. This is repeated untilall elements of X have been predicted in this way. The threshold of convertingthe continuous predictions to binary predictions in CV was the same as the oneused in computing the training error. The data we used is from the Genomic Determinants of Sensitivity in Cancer1000 (GDSC1000) [30]. To facilitate the interpretability of the results, only threecancer types are included in the data analysis: BRCA (breast invasive carcinoma,48 human cell lines), LUAD (lung adenocarcinoma, 62 human cell lines) andSKCM (skin cutaneous melanoma, 50 human cell lines). Each cell line is a samplein the data analysis. For these samples, three different binary data sets areavailable: mutation, copy number aberration (CNA) and methylation data. Forthe mutation data, there are 198 mutation variables. Each variable is a likelycancer driver or suppressor gene. A gene is labeled as “1” when it is classified asmutated in a sample and as “0” when classified as wild type. The mutation datais very imbalanced (supplemental Fig. S2.1 a): roughly 2% of the data matrix islabeled as “1”. The CNA data has 410 observed CNA variables. Each variable isa copy number region in a chromosome. It is labeled as “1” for a specific samplewhen it is identified as aberrated and it is labeled as “0” otherwise. The CNAdata set is also imbalanced (supplemental Fig. S2.1 b): roughly 7% of the datamatrix is labeled as “1”. For the methylation data, there are 38 methylationvariables. Each variable is a CpG island located in gene promotor region. Ineach variable, “1” indicates a high level of methylation and “0” indicates a lowlevel. The methylation data set is relatively balanced compared to other datasets (supplemental Fig. S2.1 c): roughly 27% of the data matrix is labeled as “1”.
Binary data matrices with an underlying low dimensional structure can be sim-ulated either from a latent variable model or as the noise corrupted version of a0
Chapter 2. PCA of binary genomics data structured binary data set. In the first case the data generating process is con-sidered to provide a quantitative data set while there is a binary read out. In thesecond case the data generating process is considered to provide a binary dataset. We use both of these two approaches to study the properties of differentbinary PCA methods.
Simulated binary data based on the logistic PCA model
Data sets with different degrees of imbalance and with low dimensional structureswere simulated according to the logistic PCA model. The offset term µ is usedto control the degree of imbalance and the log-odds Θ is defined to have a lowdimensional structure. The observed binary matrix X is generated from thecorresponding Bernoulli distributions.Each element in the J × R loading matrix B is sampled from the standardnormal distribution. The Gram-Schmidt algorithm is used to force B T B = I R . R is set to 3. The simulated I × R score matrix A has three group structuresin the samples. I samples are divided into three groups of equal size. The threegroup means are set manually to force sufficient difference between the groups.The first two group means are set to a ∗ = [2 , − , T and a ∗ = [ − , , − T . Thethird group mean is a ∗ = [0 , , T − a ∗ − a ∗ . The scores in first group are sampledfrom the multivariate normal distribution N ( a ∗ , I R ), the scores in second groupfrom N ( a ∗ , I R ) and the scores in the third group from N ( a ∗ , I R ). In this way,scores between groups are sufficiently different and scores within the same groupare similar.When the elements in AB T are close to 0, the corresponding probabilities areclose to 0.5. In this case, the binary observations are almost a random guess.When their absolute values are large, the corresponding probabilities are closeto 1 or 0, the binary observations are almost deterministic. The scale of AB T should be in a reasonable interval, not too large and not too small. A constant C is multiplied to AB T to control the scale for generating proper probabilities. Inaddition, the offset term µ is included to control the degree of imbalance in thesimulated binary data set. After Θ = C AB T + µ T is simulated as above, it istransformed to the probability matrix Π by the logistic function φ () and x ij in X is a realization of Bernoulli distribution with parameter p ij , which is the ij th element of probability matrix Π . Simulated binary data based on noise corruption of pre-structured bi-nary data
Another approach of simulating binary data is by the noise corruption of a pre-structured data set. Compared to the latent variable model, this approach pro-vides an intuitive understanding of the low dimensional structure in the observedbinary data. Pre-structured binary data set X true has structured and unstruc- .6. Results R different I dimensional binary vectors are simulated first, each element issampled from the Bernoulli distribution with probability p , which is the degree ofimbalance in the binary data simulation. Each of these R binary vectors is repli-cated 10 times to form the structured part. In the unstructured part of X true , allthe elements are randomly sampled from Bernoulli distribution with probability p . The observed binary data X is a noise corrupted version of the pre-structuredbinary data set X true . If the noise level is set to 0.1, all the elements in the binarydata X true have a probability of 0.1 to be bit-flipped. The observed binary matrix X , has R groups of 10 highly correlated variables and the other variables are notcorrelated. The R groups are taken as the low dimensional structure. The abovesimulation process is illustrated in the supplemental Fig. S2.4. All the computations are done in R [49]. The linear PCA model is fitted usingthe SVD method after centering the data [50]. The Gifi method is fitted usingthe alternating least squares approach by Homals package [36]. The logistic PCAand logistic PPCA models are fitted using an MM algorithm with offset term[43, 35]. The default stopping criterion is used for all the approaches.
The goal of this simulation is to evaluate the abilities of the four approaches infinding back the embedded low dimensional structures in the sample space andvariable space. The simulation process is based on the logistic PCA model. Theoffset term µ is set to to simulate balanced binary data. The parameters are setto I = 99 , J = 50 , R = 3 , C = 10. The simulated balanced binary data are shownin supplemental Fig. S2.2. First a classical PCA on the simulated probabilitymatrix Π and the log-odds Θ was performed. Fig. 2.1 shows the score plots ofthese two PCA analyses. The difference between the score plots of linear PCA on Π (Fig. 2.1 a) and on log-odds Θ (Fig. 2.1 b) is obvious. The scores of the linearPCA model on Π lie in the margin of the figure, while for Θ , they lie more in thecenter of the figure. This difference is related to the nonlinear logistic function φ (),which transforms Θ to Π . Furthermore, PCA on the log-odds matrix describesmore variation in the first two PCs.Logistic PCA, logistic PCA, Gifi and linear PCA are used to model the binarymatrix X . Two principal components are used. Offset terms are included in themodel. The score plots produced by these different approaches are shown in2 Chapter 2. PCA of binary genomics data
Scores on PC1 (64.3%) a b Scores on PC1 (56.1%) S c o r e s on P C ( . % ) S c o r e s on P C ( . % ) Figure 2.1: Score plot of the first two principal components (PCs) derived fromlinear PCA on the probability matrix Π (left) and log-odds matrix Θ (right) usedin the binary data simulation. G1, G2 and G3 are three simulated groups in thesamples.Fig. 2.2. The similarity between Fig. 2.1 and Fig. 2.2 indicates that the logisticPCA model approximates the underlying log-odds Θ from the binary observation X , while the other approaches approximate the probability matrix Π .Another observation is that the score plots derived from logistic PPCA (Fig. 2.2a), Gifi (Fig. 2.2 c) and linear PCA (Fig. 2.2 d) are very similar except for somescale differences. The similarity between the Gifi and linear PCA for balancedbinary data set is understandable. For binary data, the Gifi method is equivalentto PCA on standardized binary variables. Since the proportion of “1” and “0”of each binary variable are similar in a balanced simulated data set, the columnmean and standard deviation of each binary variable are close to 0.5. Thus thestandardization of each binary variable will change 0 and 1 binary data to -1and 1 data. Therefore, except for the difference in scale, Gifi and linear PCA arealmost the same for balanced binary data. For logistic PPCA, the score matrix A is a low dimensional representation of the log-odds (cid:101) Θ from the saturated model, A = ( (cid:101) Θ − µ T ) B , and the (cid:101) Θ is estimated by 2 m ( X − − m and m . Thus, the true difference between linear PCAand logistic PPCA is how to find the low dimension spanned by loading matrix B . Logistic PPCA finds it by minimizing the logistic loss and linear PCA findsit by minimizing the least squares loss.The training error and CV error for different models are shown in Fig. 2.3. Weadd the zero component model in which only the offset term µ is included, as thebaseline for evaluating the different methods with different numbers of compo-nents. The estimated offset ˆ µ in the zero component model is the column mean .6. Results S c o r e s on P C a bc d Scores on PC1 Scores on PC1 S c o r e s on P C Figure 2.2: Score plot of first two PCs produced by the four different approaches. a : logistic PPCA; b : logistic PCA; c : the Gifi method; d : PCA. G1, G2 and G3are three simulated groups in the samples.4 Chapter 2. PCA of binary genomics data of X for PCA and the Gifi method, while it is the logit transform of the columnmean of X for logistic PPCA and logistic PCA. All approaches successfully findthe three components truely underlying the data. It can also be observed thatlogistic PCA is more eager to overfit the data. It shows a lower balanced errorrate, but a higher CV error rate for more than three components compared tothe other methods. B a l an c ed e rr o r r a t e Number of PCs Number of PCs a b
Figure 2.3: The balanced training error ( a ) and CV error ( b ) for the balancedsimulated data set produced by four different approaches with different numberof components. a : training error; b : CV error. log ppca: logistic PPCA; log pca:logistic PCA; gifi: the Gifi method; pca: linear PCA. The goal of the imbalanced simulation is to evaluate the effect of imbalancedsimulated data on the ability of the four approaches in finding back the underlyinglow dimensional structures in variable space. Since the offset µ in logistic PCAmodel can be interpreted as the log-odds of marginal probabilities, we can usethe log-odds of the empirical marginal probabilities from the real data sets withdifferent degrees of imbalance as the offset in the simulation. The simulationprocess is based on the logistic PCA model. The offset term µ is set to log-oddsof column means of real data to simulate imbalanced binary data. The parameters I, J are set to the size of corresponding real data. The constant C is selected as20, R is set to 3. The simulated data is shown in supplemental Fig. S2.3. Weevaluate the effect of imbalanced binary data set on the different models’ abilitiesof finding back the simulated low dimensional structure. The CV error plots of .6. Results a b c Number of PCs B a l an c ed e rr o r r a t e Number of PCsNumber of PCs
Figure 2.4: CV error plot for simulated imbalanced data sets with different degreesof imbalance. a : similar as mutation data; b : similar as CNA data; c : similar asmethylation data. For the assessment of feature importance, the binary data is simulated by noisecorruption of a pre-structured binary data set. I is 198; J is 100; R is set to 3. Thedegree of imbalance is set to 0.2, and the noise level is 0.1. The simulated data isshown in supplemental Fig. S2.4. There are noisy corrupted structures in the first30 variables. For feature selection purposes we estimate the importance of eachfeature in the model. This is performed as follows, ( b j + b j + b j ), where b j isthe loading in the 2 nd PCs for j th variable, is taken as the importance measure.The process is repeated 15 times, the mean and standard deviation of the averagesquared loading for the 100 variables are shown in Fig. 2.5. It can be observedthat highly correlated binary variables have large loadings. The variance of theloadings derived from logistic PCA is much higher than other approaches. Thisindicates that the logistic PCA model cannot make stable estimation of loadings. The binary mutation, CNA and methylation data sets are analysed using the fourdifferent approaches. The score plots and error plots from different approacheson the real mutation data set are shown in Fig. 2.6. The CV results of PCA,Gifi and logistic PCA in Fig. 2.6 f do not support the assumption that a lowdimensional structure exists in the mutation data. For the CV result of logisticPPCA (Fig. 2.6 f), the minimum CV error was achieved using three components.6
Chapter 2. PCA of binary genomics data a bc d
Variables A v e r age s qua r e l oad i ng Variables A v e r age s qua r e l oad i ng Figure 2.5: Barplot with one standard deviation error bar of the mean squareloadings of linear PCA model ( a ), the Gifi method ( b ), logistic PPCA ( c ) andlogistic PCA ( d ). .6. Results Chapter 2. PCA of binary genomics data a bc de f
Scores on PC1 S c o r e s on P C S c o r e s on P C Number of PCs Number of PCs B a l an c ed e rr o r r a t e Scores on PC1
Figure 2.6: Score plot of the first two PCs, training and CV error plot of thefour different approaches for the mutation data. a : score plot of logistic PPCA ; b : score plot of logistic PCA; c : score plot of the Gifi method; d : score plot ofPCA; e : training error plot; f : CV error plot. BRCA: breast invasive carcinoma;LUAD: lung adenocarcinoma; SKCM: skin cutaneous melanoma. The legend ofthe training error and CV error plot is the same as Fig. 2.3. .6. Results a Load i ng s on P C b Scores on PC1 S c o r e s on P C Loadings on PC1Scores on PC1 S c o r e s on P C c d Scores on PC1 S c o r e s on P C c1c2c4 c3 CAL-51
Figure 2.7: Loading plot ( a ) and score plots of the first two PCs derived fromlogistic PPCA model on mutation data. The score plots ( b, c, d ) are labeledaccording to the mutation patterns. b : BRAF mutation labeled score plot; c :TP53 mutation labeled score plot; d : KRAS mutation labeled score plot. Redsquare: mutated; black dot: wild type. c1, c2, c3 and c4 are the plausible fourclusters in the samples on mutation data.0 Chapter 2. PCA of binary genomics data a bc de f
Scores on PC1 S c o r e s on P C S c o r e s on P C Number of PCs Number of PCs B a l an c ed e rr o r r a t e Scores on PC1
Figure 2.8: Score plot of the first two PCs, training and CV error plot of the fourdifferent approaches for the CNA data. a : score plot of logistic PPCA; b : scoreplot of logistic PCA; c : score plot of the Gifi method; d : score plot of PCA; e :training error plot; f : CV error plot. The legends for the score plot and trainingerror plot are the same as Fig. 2.4. .6. Results a bc de f Scores on PC1 S c o r e s on P C S c o r e s on P C Number of PCs Number of PCs B a l an c ed e rr o r r a t e Scores on PC1
Figure 2.9: Score plot of the first two PCs, training and CV error plot of the fourdifferent approaches for the methylation data. a : score plot of logistic PPCA ; b : score plot of logistic PCA; c : score plot of the Gifi method; d : score plot ofPCA; e : training error plot; f : CV error plot. The legends for the score plot andtraining error plot are the same as Fig. 2.4.2 Chapter 2. PCA of binary genomics data perfectly represent the three cancer types (Fig. 2.9 a). The corresponding loadingvalues also roughly fall into three cancer type specific clusters (Fig. 2.10), as mostvariables are exclusively non-zero in a single cancer type. Notable exceptions areGSTM1 and ARL17A, which are non-zero in two cancer types and hence eachreside between two clusters, and variables GSTT1 and DUSP22, which are non-zero in all three cancer types and hence reside towards the center of the plot.
Load i ng s on P C Loadings on PC1
Figure 2.10: Loading plot of logistic PPCA model on methylation data. The genenames corresponding to the methylation variables, which are interpreted in thepaper, are labeled on the plot.
In this chapter, four methods were discussed that all aim to explore binary datausing low dimensional scores and loadings. It was shown that each of the meth-ods has different goals and therefore produces slightly different results. LinearPCA (without the standardization processing step) treats the binary data asquantitative data and tries to use low rank score and loading matrices to fit thequantitative data. For binary data, the quantification process in the Gifi methodis simply a standardization of the binary variables. After quantification, the Gifimethod tries to use low rank score and loading matrices to fit the quantified bi-nary variables. Both logistic PPCA and logistic PCA assume that the binarydata follows a Bernoulli distribution, and try to find an optimal estimation ofthe log-odds matrix, which lies in the low dimensional space. Logistic PCA triesto estimate the low dimensional log-odds matrix directly; while logistic PPCA .7. Discussion A andloading matrix B are free parameters to fit in the optimization, the estimationof AB T will not hesitate to move to infinity to minimize the loss function. Thisrepresents itself in such a way that logistic PCA is prone to overfit (as can beseen from the CV results) and the large variation in the loading estimation. Thenon-robustness problem is mitigated in the logistic PPCA model. Since only theloading matrix B is freely estimated in logistic PPCA to find the optimal model,while the score matrix A is fixed given the loadings, the logistic PPCA model isless prone to overfitting. Thus, the estimation of the loadings of binary variablesis more stable compared to logistic PCA. Furthermore, since the fitted valuesare the linear projection of the log-odds matrix of the saturated model, its scale4 Chapter 2. PCA of binary genomics data is constrained by the scale of the approximate log-odds matrix of the saturatedmodel, which can be specified in advance.When assuming a probabilistic generating process is underlying the binarydata we recommend to use the parametric logistic PPCA model. When such anassumption is not valid and the data is considered as given, the nonparametricGifi model is recommended.
Acknowledgements
Y.S. gratefully acknowledges the financial support from China Scholarship Coun-cil (NO.201504910809). The research leading to these results has received fundingfrom the European Research Council under the European Union’s Seventh Frame-work Programme (FP7/2007-2013) / ERC synergy grant agreement n .8. Supplementary information a BRCALUADSKCM b c
Figure S2.1: Heatmap of the real data sets. White color: “0”; black color: “1”. a : mutation data; b : CNA data; c : methylation data. BRCA: breast invasivecarcinoma; LUAD: lung adenocarcinoma; SKCM: skin cutaneous melanoma.6 Chapter 2. PCA of binary genomics data
G1G2G3
Figure S2.2: Heatmap of the simulated balanced data set with low dimensionalstructure. White color: “0”; black color: “1”. G1, G2 and G3 are three groupsin the samples during simulation. a b c s a m p l e s Figure S2.3: Heatmap of simulated imbalanced binary data with different degreesof imbalance. White color: “0”; black color: “1”. a : similar as mutation data; b :similar as CNA data; c : similar as methylation data. No group structures are inthe samples. .8. Supplementary information a structured variables b Noisy corrupted structure
Figure S2.4: Heatmap of the pre-structured binary data ( a ) and noise corruptedbinary data ( b ). White color: “0”; black color: “1”. hapter 3 Robust logistic PCA model
Although the projection based logistic PCA model works well in our previousdata analysis, the idea of projection in the context of multivariate binary datais not straightforward. Furthermore, the results of the model are related to thespecification of the parameter m , which is the approximation of the infinity inthe saturated model. In addition, this approach is difficult to be generalized tothe data fusion of multiple data sets with different data types. Therefore, in thischapter we focus on developing a robust version of logistic PCA model withoutthe involvement of projection.The non-robustness of the standard logistic PCA model, manifested as someestimated parameters diverge towards infinity, is because of the used exact lowrank constraint, which is specified as the multiplication of low rank score andloading matrices. Therefore, we propose to fit a logistic PCA model throughnon-convex singular value thresholding to alleviate the non-robustness issue. Anefficient MM algorithm is implemented to fit the model and a missing value basedCV procedure is introduced for the model selection. In addition, we re-express thelogistic PCA model based on the latent variable interpretation of the generalizedlinear model on binary data. The multivariate binary data set is assumed to bethe sign observation of an unobserved quantitative data set, on which a low rankstructure is assumed to exist. The latent variable interpretation of the logisticPCA model not only makes the assumption of low rank structure easier to under-stand, but also provides us a way to define signal-to-noise ratio in multivariatebinary data simulation. Our experiments on realistic simulations of imbalancedbinary data and low signal-to-noise ratio show that the CV error based model se-lection procedure is successful in selecting the proposed model. And the selectedmodel demonstrates superior performance in recovering the underlying low rankstructure compared to models with convex nuclear norm penalty and exact lowrank constraint. Finally, a binary CNA data set is used to illustrate the proposed390 Chapter 3. Robust logistic PCA model methodology in practice. Logistic PCA [34, 43] is motivated from the probabilistic interpretation of theclassical PCA model with Gaussian distributed error. The extension of the clas-sical PCA model to the logistic PCA model is similar to the extension of linearregression to logistic linear regression. In the classical PCA model, the low rankconstraint is imposed on the conditional mean of the observed quantitative dataset, while in the logistic PCA model, the low rank constraint is imposed on thelogit transform of the conditional mean of the observed binary data. Therefore,the logistic PCA model can also be re-expressed in a similar way as the latentvariable interpretation of the generalized linear models (GLMs) on binary data[13]. In logistic PCA, the observed binary data set can be assumed as the signobservation of an unobserved quantitative data set, on which low rank structureis assumed to exist. This intuitive latent variable interpretation not only facili-tates the understanding of the low rank structure in the logistic PCA model, butalso provides a way to define the signal-to-noise ratio (SNR) in the simulation ofmultivariate binary data.However, the standard logistic PCA model with the exact low rank constraint,which is expressed as the multiplication of two low rank matrices, is prone tooverfitting, leading to divergence of some estimated parameters towards infinity[43, 51]. The same overfitting problem also happens for the logistic linear regres-sion model. If two classes of the outcome are linearly separable with respect toan explanatory variable, the corresponding coefficient of this variable tends togo to infinity [13]. A common trick is adding a ridge regression (quadratic) typepenalty on the coefficient vector to alleviate the overfitting issue. If we apply thesame trick on the logistic PCA model, the quadratic penalty on the loading ma-trix is equivalent to a quadratic penalty on the singular values of a matrix, whichis the multiplication of the score and loading matrices. Details will be shownlater. Therefore, it is possible to derive a robust logistic PCA model via theregularization of the singular values. [52] proposed to use a nuclear norm penaltyin the low rank matrix approximation framework for the binary matrix comple-tion problem. The proposed method is similar to the logistic PCA model exceptthat the column offset term is not included and the exact low rank constraint isreplaced by its convex relaxation, the nuclear norm penalty. The nuclear normpenalty, which is equivalent to applying a lasso penalty on the singular values ofa matrix, induces low rank estimation and constrains the scale of non-zeros sin-gular values simultaneously. However, a lasso type penalty shrinks all parameters This chapter is based on Song, Y., Westerhuis, J.A. and Smilde, A.K., 2019. Logis-tic principal component analysis via non-convex singular value thresholding. arXiv preprintarXiv:1902.09486. .2. Latent variable interpretation of models on binary data
A univariate binary response variable y is assumed to follow a Bernoulli distribu-tion with parameter π , y ∼ Bernoulli( π ). x is a multivariate explanatory variableand x ∈ R J . For the GLMs on binary data, we assume that the nonlinear transfor-mation of the conditional mean of y is a linear function of x , h (E( y | x )) = x T β ,in which h () is the link function, E( y | x ) is the conditional mean, and β is a J dimensional coefficient vector. If the inverse function of h () is φ (), we haveE( y | x ) = φ ( x T β ). If the logit link is used, φ ( θ ) = (1 + exp( − θ )) − , which isthe logistic linear regression model, and x T β can be interpreted as the log-odds,which is the natural parameter of Bernoulli distribution expressed in exponential2 Chapter 3. Robust logistic PCA model family distribution form. If the probit link is used, φ ( θ ) = Φ( θ ), in which Φ( θ )is the cumulative density function (CDF) of the standard normal distribution,which is the probit linear regression model.The fact that the inverse link function φ () can be interpreted as the CDF of aspecific probability distribution, motivates the latent variable interpretation of thelogistic or probit linear regression [13]. y can be assumed as the sign observationof a quantitative latent variable y ∗ , which has a linear relationship with theexplanatory variable x . Taking the probit linear regression as an example, thelatent variable interpretation can be expressed as, y ∗ = x T β + (cid:15)(cid:15) ∼ N(0 , y = ( y ∗ > , in which y ∗ is the latent variable, (cid:15) is the error term, and () is the indicatorfunction. The probability for the observation y = 1 is Pr( y = 1 | x T β ) = Pr( y ∗ ≥
0) = Φ( x T β ). A similar latent variable interpretation can be applied to thelogistic linear regression model by assuming that the error term (cid:15) follows thestandard logistic distribution. The probability density function of the logisticdistribution can be expressed as, p ( (cid:15) ) = exp( − (cid:15) − µσ ) σ (1 + exp( − (cid:15) − µσ )) , in which µ and σ are the location and scale parameters. In the standard logisticdistribution, µ = 0, σ = 1. The inverse-logit function φ () is the CDF of thestandard logistic distribution. The assumption of µ = 0 for the (cid:15) is reasonablesince we want to use the linear function x T β to capture the conditional mean of y ∗ .The assumption of σ = 1 for the (cid:15) seems restrictive, however scaling the estimatedˆ β by a positive constant as ˆ β /σ will not change the conclusion of the model. Sincethe assumption of logistic distributed noise is not very straightforward, the latentvariable interpretation of the logistic linear regression model is not widely used.The above latent variable interpretation of the GLMs on binary data is nat-urally connected to the generating process of binary data [32]. Binary data canbe discrete in nature, for example when females and males are classified as “1”and “0”. Another possibility is that there is a continuous process underlying thebinary observation. For example in a toxicology study, the binary outcome of asubject being dead or alive relates to the dosage of a toxin used and the subject’stolerance level. The tolerance varies for different subjects, and the status (deador alive) of a specific subject depends on whether its tolerance is higher than theused dosage or not. Thus, a continuous tolerance level is underlying the binaryoutcome [13]. If we assume our binary data set is generated from a continuousprocess, it is natural to use the latent variable interpretation of the probit link; .2. Latent variable interpretation of models on binary data The measurement of J binary variables on I samples results in a binary matrix X ( I × J ), whose ij th element x ij equals “1” or “0”. The logistic PCA model on X can be interpreted as follows. Conditional on the low rank structure assump-tion, which is used to capture the correlations observed in X , elements in X areindependent realizations of the Bernoulli distributions, whose parameters are thecorresponding elements of a probability matrix Π ( I × J ), E( X | Π ) = Π . Assum-ing the natural parameter matrix, which is the logit transform of the probabilitymatrix Π , is Θ ( I × J ), we have h ( Π ) = Θ and Π = φ ( Θ ), in which h () and φ ()are the element-wise logit and inverse logit functions. The low rank structure isimposed on Θ in the same way as in a classical PCA model, Θ = µ T + AB T , inwhich µ ( J ×
1) is the J dimensional column offset term and can be interpretedas the logit transform of the marginal probabilities of the binary variables. A ( I × R ) and B ( J × R ) are the corresponding low rank score and loading matrices,and R , R (cid:28) min( I, J ), is the low rank. Therefore, for the logistic PCA model,we have E( X | Θ ) = φ ( Θ ) = φ ( µ T + AB T ). On the other hand, in a classicalPCA model, we have E( X | Θ ) = Θ = µ T + AB T , which is equivalent to usingthe identity link function. Furthermore, unlike in the classical PCA model, thecolumn offset µ has to be included into the logistic PCA model to do the modelbased column centering. The reason is that the commonly used column centeringprocessing step is not allowed to be applied on the binary data set as the columncentered binary data is not binary anymore.The logistic PCA model can be re-expressed in the same way as the latentvariable interpretation of the GLMs on binary data. Our binary observation X is assumed to be the sign observation of an underlying quantitative data set X ∗ ( I × J ), and for the ij th element, we have x ij = 1 if x ∗ ij ≥ x ij = 0 viceversa . The low rank structure is imposed on the latent data set X ∗ as X ∗ = Θ + E ,in which E ( I × J ) is the error term, and its elements follow a standard logisticdistribution. The latent variable interpretation of the logistic PCA model can beexpressed as, X ∗ = Θ + E (cid:15) ij ∼ Logistic(0 , , i = 1 · · · I, j = 1 · · · Jx ij = ( x ∗ ij > , i = 1 · · · I, j = 1 · · · J. Similar to the latent variable interpretation of the logistic linear model, the4
Chapter 3. Robust logistic PCA model assumption of (cid:15) ij ∼ Logistic(0 ,
1) is not restrictive, since scaling the estimatedˆ Θ by a positive constant σ will not change the conclusions from the model.When the standard normal distributed error is used in the above derivation, weget the probit PCA model. The latent variable interpretation of the logisticor probit PCA not only facilitates our understanding of the low rank structureunderlying a multivariate binary data, but also provides a way to define the SNRin multivariate binary data simulation. Assume the column centered Θ is Z , Z = Θ − µ T = AB T . In the standardlogistic PCA model, the exact low rank constraint is imposed on Z as the multi-plication of two rank R matrices A and B . The negative log likelihood of fittingthe observed X conditional on the low rank structure assumption on Θ is usedas the loss function. We also introduce a weight matrix W ( I × J ) to tackle thepotential missing values in X . The ij th element of W , w ij , equals 0 when thecorresponding element in X is missing; while it is 1 vice versa . The optimizationproblem of the standard logistic PCA model can be expressed as,min µ , Z − log( p ( X | Θ , W ))= − log( I (cid:89) i J (cid:89) j ( p ( x ij | θ ij )) w ij )= − I (cid:88) i J (cid:88) j w ij [ x ij log( φ ( θ ij )) + (1 − x ij ) log(1 − φ ( θ ij ))]subject to Θ = µ T + Z rank( Z ) = R T Z = , (3.1)in which the constraint T Z = is imposed to make µ identifiable. Unfortunately,the classical logistic PCA model tends to overfit the observed binary data. Inorder to decrease the loss function in equation 3.1, θ ij tends to approach positiveinfinity when x ij = 1, and negative infinity when x ij = 0. This overfitting problemwill be explored in more detail below. In logistic linear regression, this overfittingproblem can be solved by adding a quadratic penalty on the coefficient vector toregularize the estimated parameters. A similar idea can be applied to the logisticPCA model by taking it as a regression type problem. The columns of the score .3. Logistic PCA via singular value thresholding A are taken as the unobserved explanatory variables, while the loadingmatrix B are the coefficients. If we decompose Z into a R truncated SVD as Z = UDV T , then A = U and B = VD T . It is easy to show that the quadraticpenalty || B || F = (cid:80) r σ r , in which σ r is the r th singular value of Z . Therefore,it is possible to derive a robust logistic PCA model by thresholding the singularvalues of Z . The most commonly used penalty function in thresholding singular values is thenuclear norm penalty, and it has been used in solving many low rank approx-imation problems [55, 56, 52, 51]. If the SVD decomposition of matrix Z is Z = UDV T , the nuclear norm penalty can be expressed as (cid:80) r σ r , in which σ r is the r th singular value. The nuclear norm penalty is the convex relaxation ofthe exact low rank constraint and can be regarded as applying a lasso penaltyon the singular values of a matrix. Therefore, the nuclear norm penalty has thesame problem as the lasso penalty, it shrinks all singular values to the same de-gree. This leads to a biased estimation of the large singular values. This behaviorwill further make the prediction error or CV error based model selection proce-dure inconsistent [26]. As an alternative, non-convex penalties can shrink theparameters in a nonlinear manner to achieve both nearly unbiased and sparseparameter estimation [16, 27]. Therefore, we propose to replace the exact lowrank constraint in the logistic PCA model by a concave penalty on the singularvalues of Z to achieve a low rank estimation and to alleviate the overfitting issue.We include the frequentist version of the generalized double Pareto (GDP) [27]shrinkage, the smoothly clipped absolute deviation (SCAD) penalty [16] and the L q :0 The negative log-likelihood f ( Θ ) = − log( p ( X | Θ , W )) can be majorized to aquadratic function of Θ by exploiting the upper-bound of the second order gra-dient of f ( Θ ). Suppose f ij ( θ ij ) = − [ x ij log( φ ( θ ij )) + (1 − x ij ) log(1 − φ ( θ ij ))], inwhich x ij and θ ij are the ij th elements of X and Θ , f ( Θ ) can be expressed as f ( Θ ) = (cid:80) Ii (cid:80) Jj w ij f ij ( θ ij ). When the logit link is used, the following results canbe easily derived out, ∇ f ij ( θ ij ) = φ ( θ ij ) − x ij , ∇ f ij ( θ ij ) = φ ( θ ij )(1 − φ ( θ ij )). As-sume that ∇ f ij ( θ ij ) is upper bounded by a constant L . Since ∇ f ij ( θ ij ) ≤ . L = 0 . 25. Take f ( θ ) as the generalrepresentation of f ij ( θ ij ), according to the Taylor’s theorem and the assumptionthat ∇ f ( θ ) ≤ L for θ ∈ domain f , we have the following inequality, f ( θ ) = f ( θ k )+ < ∇ f ( θ k ) , θ − θ k > + 12 ( θ − θ k ) T ∇ f ( θ k + t ( θ − θ k ))( θ − θ k ) ≤ f ( θ k )+ < ∇ f ( θ k ) , θ − θ k > + L θ − θ k ) = L θ − θ k + 1 L ∇ f ( θ k )) + c, (3.3)where θ k is the k th approximation of θ , t ∈ [0 , 1] is an unknown constant, c is an unknown constant doesn’t depend on Θ . Therefore, we have the followinginequality about f ij ( θ ij ), f ij ( θ ij ) ≤ L ( θ ij − θ kij + L ∇ f ij ( θ kij )) + c . Assume ∇ f ( Θ k )is the matrix forms of ∇ f ij ( θ kij ), ∇ f ( Θ k )) = φ ( Θ k ) − X . The inequality of f ( Θ ) can be derived out as f ( Θ ) ≤ L (cid:80) Ii (cid:80) Jj w ij [( θ ij − θ kij + L ∇ f ij ( θ kij )) ] + c = L || W (cid:12) ( Θ − Θ k + L ∇ f ( Θ k )) || F + c , in which (cid:12) indicates element-wise matrixmultiplication. Following [60], we further majorize the weighted least-squaresupper bound into a quadratic function of Θ as L || W (cid:12) ( Θ − Θ k + 1 L ∇ f ( Θ k )) || F ≤ L || Θ − H k || F + c H k = Θ k − L ( W (cid:12) ∇ f ( Θ k )) . (3.4) Suppose σ r is the r th singular value of Z , and g () is a concave function. From thedefinition of concavity [61], we have g ( σ r ) ≤ g ( σ kr ) + ω kr ( σ r − σ kr ) = ω kr σ r + c , in8 Chapter 3. Robust logistic PCA model which σ kr is the r th singular value of the k th approximation Z k and c is an unknownconstant. Also, ω kr ∈ ∂g ( σ kr ) and ∂g ( σ kr ) is the set of supergradients of function g () at σ kr . For all the concave penalties used in Table 3.1, their supergradient isunique, thus ω kr = ∂g ( σ kr ). Therefore, g ( Z ) = (cid:80) r g ( σ r ( Z )) can be majorized asfollows g ( Z ) = (cid:88) r g ( σ r ) ≤ (cid:88) r ω kr σ r + cω kr = ∂g ( σ kr ) . (3.5) Summarizing the above two majorization steps, we have the following majorizedproblem during the k th iteration.min µ , Z L || Θ − H k || F + λ (cid:88) r ω kr σ r subject to Θ = µ T + Z1 T Z = k = Θ k − L ( W (cid:12) ∇ f ( Θ k )) ω kr = ∂g ( σ kr ) . (3.6)This majorized problem during the k th iteration can be solved by the block coor-dinate descent algorithm. Updating µ When fixing Z in equation 3.6, the analytical form solution of µ is the columnmean of H k , µ = I ( H k ) T . Updating Z After deflating the offset set term µ in equation 3.6, the optimization problemof Z becomes min Z L || Z − JH k || F + λ (cid:80) r ω kr σ r , in which J is the column center-ing operator J = I − I T . This optimization problem is equivalent to findingthe proximal operator of the weighted sum of singular values, for which the an-alytical form global solution exists [62]. If the SVD decomposition of JH k is JH k = UDV T , the analytical form solution of Z is Z = UD z V T , in which D z = Diag { max(0 , d r − λω kr L ) } , and d r is r th element of the diagonal of D . .5. Real data set and simulation process Initialization The initialization Z and µ can be set according to the user imputed values, orby using the following random initialization strategy. All the elements in Z canbe sampled from the standard uniform distribution and µ can be set to . Inthe following algorithm, f k indicates the objective value in equation 3.2 duringthe k th iteration, the relative change of the objective value is used as the stoppingcriteria. (cid:15) f indicates the tolerance for the relative change of the objective value.Pseudocode of the algorithm described above is shown in Algorithm 1. Algorithm 1 An MM algorithm to fit the logistic PCA model via non-convexsingular value thresholding. Input: X , λ , γ ; Output: µ , A , B ; k = 0; Compute W for missing values; Initialize µ , Z ; while ( f k − − f k ) /f k − > (cid:15) f do Θ k = ( µ k ) T + Z k ; ∇ f ( Θ k ) = φ ( Θ k ) − X ; H k = Θ k − L ( W (cid:12) ∇ f ( Θ k )); ω kr = ∂g ( σ kr ); µ k +1 = I ( H k ) T ; JH = JH k ; UDV T = JH ; D z = Diag { max(0 , d r − λω kr L ) } ; Z k +1 = UD z V T ; Θ k +1 = µ k +1 + Z k +1 ; k = k + 1; end while A = U ; B = VD z ; The CNA data sets in Chapter 1 is used as an example of real data sets to showthe results. The characterization of the CNA data set is shown in supplementalFig. S1 b.0 Chapter 3. Robust logistic PCA model Multivariate binary data X is simulated according to the logistic PCA model ina similar way as Chapter 2 except that SNR is defined and there are no groupstructures in the sample space. Based on the latent variable interpretation of thelogistic PCA model, we can define the SNR as SNR = || Z || F || E || F , in which E is theerror term, and its elements are sampled from the standard logistic distribution.The column offset term µ can be set in the same way as in Chapter 1. Howeverthe rank R matrix Z = AB T is simulated in a slightly different way. We firstexpress Z in a SVD type as Z = UDV T , in which U T U = I R , V T V = I R and thediagonal of D contains the singular values. Elements in U and V are first sampledfrom N (0 , U is deflated to have T U = , andthe SVD is used to force U being orthogonal. Also, V is forced to orthogonality bythe Gram-Schmidt algorithm. Then, the diagonal matrix D pre , whose R diagonalelements are the sorted absolute values of the samples from N (1 , . D as D = c D pre , in which c is a constant used to adjust the SNRin the simulation of the multivariate binary data. Then Θ = µ T + Z accordingto the logistic PCA model and X ∗ = Θ + E according to the latent variableinterpretation. The probability matrix Π is generated as Π = φ ( Θ ), in which φ ()by the inverse logit link function. Finally, the multivariate binary data set X isgenerated from Bernoulli distributions with the corresponding parameters in Π . In this chapter we focus on evaluating the model’s performance in estimating thesimulated parameters Θ , Π , µ and Z . Also, the CV error is defined based on thelog likelihood of fitting binary data rather than misclassifying the binary data. After a logistic PCA model is constructed on the simulated binary data, wehave the estimated parameters ˆ µ , ˆ A and ˆ B , and ˆ Θ = ˆ µ T + ˆ AB T and ˆ Π = φ ( ˆ Θ ) can also be computed. The model’s ability in recovering the true Θ canbe evaluated by the relative mean squares error (RMSE), which is defined asRMSE( Θ ) = || Θ − ˆ Θ || F || Θ || F , where Θ is the true parameter. The RMSEs in estimating µ and Z are defined in the same way. In some cases the mean Hellinger distance(MHD) to quantify the similarity between the true probability matrix Π and theestimated ˆ Π is used. Hellinger distance [63] is a symmetric measure to quantifythe similarity between two probability distributions. Assuming the parameterof a Bernoulli distribution is π and its estimation is ˆ π , the Hellinger distanceis defined as HD( π, ˆ π ) = √ (cid:113) ( √ π − √ ˆ π ) + ( √ − π − √ − ˆ π ) . The mean .7. Results Π and its estimate ˆ Π is definedas MHD( Π ) = I × J (cid:80) I,Ji,j HD( π ij , ˆ π ij ). For the model selection on real data, a missing value based cross validation (CV)procedure is proposed. The CV error is computed as follows. First, elements in X are split into the training and test sets as follows: 10% “1”s and “0”s of X are randomly selected as the test set X test , which are set to missing values, andthe resulting data set is taken as X train . After getting an estimation of ˆ Θ from alogistic PCA on the X train , we can index the elements, which are correspondingto the test set X test , as ˆ Θ test . Then the CV error is defined as the negativelog-likelihood of using ˆ Θ test to fit X test .There are two tuning parameters, γ and λ , during the model selection of thelogistic PCA model with a concave penalty. However, the performance of themodel is rather insensitive to the selection of γ for some concave penalties, whichwill be shown below. After fixing the value of γ , we can use a grid search toselect a proper value of λ based on the minimum CV error. First, a sequenceof λ values can be selected from a proper searching range, after which logisticPCA models will be fitted with the selected λ values on the training set X train .A warm start strategy, using the results of a previous model as the initializationof the next model, is used to accelerate the model selection process. The modelwith the minimum CV error is selected and then it is re-fitted on the full data set X . Because the proposed model is non-convex and its result is sensitive to theused initialization, it is recommended to use the results derived from the selectedmodel as the initialization of the model to fit the full data sets. In this first section we will use the CNA data as an example. The algorithmfrom [43] is implemented to fit the standard logistic PCA model. Constraints, A T A = I and T A , are imposed. Two different standard logistic PCA modelsare constructed of the CNA data, both with three components. The first modelis obtained with low precision (stopping criteria was set to (cid:15) f = 10 − ) while forthe other model a high precision was used ( (cid:15) f = 10 − ). The initialization was thesame for these two models. The low precision model converged already after 220iterations, while the high precision model did not convergence even after 50000iterations. The difference between the final objective values of these two modelsis not large, 8 . e + 03 and 7 . e + 03 respectively. However, as shown in Fig. 3.2,2 Chapter 3. Robust logistic PCA model the scale of the loading plots derived from these two models is very different.When a high precision stopping criteria is used, some of the elements from theestimated loading matrix from the standard logistic PCA model tend to becomevery large. PC1 -100 -50 0 50 100 P C -100-50050100 low precision PC1 -1000 0 1000-1500-1000-500050010001500 high precision Figure 3.2: The loading plots of the first two components derived from the lowprecision (left) and high precision (right) standard logistic PCA models. Here we use logistic PCA model with a concave GDP penalty as an example toshow the CV error based model selection procedure. The data set is simulated asfollows. The offset term µ is set to the logit transform of the empirical marginalprobabilities of the CNA data to simulate an imbalanced binary data set. Otherparameters used in the simulation are I = 160, J = 410, SNR = 1 and R = 5.First we will show the model selection procedure of λ while the hyper-parameter γ is fixed to γ = 1. After splitting the simulated binary data set X into the trainingset X train and the test set X test , 30 λ values are selected from the searching range[10 , λ value, a logistic PCAwith a GDP penalty ( (cid:15) f = 10 − , maximum iteration is 500) is constructed on X train and for each model we evaluate its performance in estimating the simulatedparameters. As shown in the model selection results (Fig. 3.3), the selectedmodel with minimum CV error can also achieve approximately optimal RMSEsin estimating the simulated Θ , Z and µ . However, the rank of the estimated Z from the selected model is 3, which is different from the simulated rank R = 5.The reason will be discussed later. The selected model is re-fitted on the full .7. Results X , and the RMSEs of estimating Θ , Z and µ are 0.0797, 0.2064and 0.0421 respectively.Next, we will show the model selection process of both γ and λ . The simulateddata X is split into the X train and the X test in the same way as described above. 30 γ values are selected from the range [10 − , ] equidistant in log-space. For each γ , 30 values of λ are selected from a proper searching range, which is determinedby an automatic procedure. For each value of γ , the model selection of λ is done onthe X train in the same way as described above, after which the selected model is re-fitted on the full data X . Therefore, for each value of γ , we have a selected model,which is optimal with respect to the CV error. As shown in Fig. 3.4(left), thedifference between the RMSEs derived from these selected models is very small.This can be caused by two reasons: the model is insensitive to the selection of γ or the CV error based model selection procedure is not successful in selecting γ .To clarify the correct reason, we also fit 30 × 30 models on the full data X in thesame way as the above experiment. For each value of γ , the model with minimumRMSE( Θ ) is selected. As shown in Fig. 3.4(right), the value of γ does not have alarge effect on the RMSEs of the selected models, which are optimal with respectto the RMSE( Θ ). Therefore, it can be concluded that the performance of themodel is insensitive to the model selection of γ . Therefore, the strategy can beto set a default value for γ and focus on the selection of λ . log ( ) C V e rr o r CV errors log ( ) R M SE RMSEs Z log ( ) r an k estimated ranks Figure 3.3: Model selection and performance of the logistic PCA model with aGDP penalty. The CV error, RMSE of estimating Θ , Z and µ and the estimatedrank as a function of λ . The increased CV error and RMSEs for small λ arethe result of non-converged models after 500 iterations. The red cross markerindicates the λ value where minimum CV error is achieved.4 Chapter 3. Robust logistic PCA model log ( . ) -1 0 1 2 R M SE Z log ( . ) -1 0 1 2 R M SE Z Figure 3.4: The RMSE of estimating Θ , Z and µ as a function of hyper-parameter γ of GDP penalty. Results on the left side are obtained when the optimal model isselected based on minimum CV error while on the right hand side model selectionwas based on minimum RMSE( Θ ). .7. Results For the logistic PCA models with L q and SCAD penalty, the model is selectedand re-fitted on the full data sets in the same way as above. Fig. 3.5 shows howthe value of hyper-parameter q or γ effects the RMSEs in estimating Θ , Z and µ from the logistic PCA models, which are optimal with respect to CV error,with L q penalty (left) and SCAD penalty (right). The model with L q penalty canachieve similar performance as the model with GDP penalty when proper valueof q is selected, while the model with SCAD penalty tends to have very largeRMSEs in estimating Θ , Z and µ for all the values of hyper-parameter γ . q R M SE Lq Z . R M SE SCAD Z Figure 3.5: The RMSE of estimating Θ , Z and µ as a function of hyper-parameter q in L q penalty (left) and γ in SCAD penalty (right). The corresponding logisticPCA models are optimal with respect to CV error. In this section, we compare the performance of the logistic PCA models withthe exact low rank constraint, the nuclear norm penalty, SCAD ( γ = 3 . L q ( q = 0 . 5) and the GDP ( γ = 1) penalty. Random initialization is used and themaximum number of iterations is set to 10000 for all the models. Furthermore,all models are fitted using both (cid:15) f = 10 − and (cid:15) f = 10 − to test the model’srobustness to the stopping criteria. For the standard logistic PCA model usingthe exact low rank constraint, 5 components are used. For the models withGDP ( γ = 1), nuclear norm, SCAD ( γ = 3 . 7) and L q ( q = 0 . 5) penalties, the6 Chapter 3. Robust logistic PCA model models are selected (the model selection results are shown in Fig. 3.3 and thesupplemental Fig. S3.1) and re-fitted on full data set in the same way as wasdescribed above. In addition, according to the latent variable interpretation ofthe logistic PCA model, the unobserved quantitative data set X ∗ = Θ + E isavailable in our simulation. We constructed a 5 components PCA model (withoffset term) on this latent data X ∗ , and this model is called the full informationmodel. The results of above experiment are shown in Table 3.2. Since the logisticPCA model with nuclear norm penalty is a convex problem, the global solutioncan be achieved. The results from this model are taken as the baseline to compareother approaches. The drawback of the model with the nuclear norm penalty isthat the proposed CV error based model selection procedure tends to select a toocomplex model to compensate for the biased estimation caused by the nuclearnorm penalty (supplemental Fig. S3.2, Table 3.2). Compared to the model withnuclear norm penalty, the logistic PCA model with exact low rank constraint andSCAD penalty tends to overfit the data, thus have bad performance in estimatingthe simulated parameters Θ , Z and µ . Also these models are not robust to thestopping criteria. Compared to the model with nuclear norm penalty, the logisticPCA models with a GDP penalty or a L q penalty perform well in estimatingthe simulated parameters, and their results are even close to the full informationmodel.Table 3.2: Comparison of the logistic PCA models with the exact low rank con-straint, the nuclear norm penalty, the SCAD penalty, the L q penalty, the GDPpenalty, and the full information model. The RMSEs of estimating Θ , Z and µ , as well as the mean Hellinger distance (MHD) of estimating the simulatedprobability matrix Π and the rank estimation of ˆ Z are shown in the table.penalty (cid:15) f RMSE( Θ ) RMSE( Z ) RMSE( µ ) MHD rankexact 10 − − − − − − − − L q − − .7. Results Z andtheir estimations from the logistic PCA models with different penalties, and itsestimation from the full information model. The results are shown in Fig. 3.6. Thesimulated low rank is 5, however the last component is overwhelmed by the noise.Furthermore, the 4 th component is less than 2 times noise level and thereforecannot be expected to be distinguished from the noise. From Fig. 3.6(left) itbecomes clear that the logistic PCA models with exact low rank constraint andSCAD penalty clearly overestimate the singular values of Z . And when the morestrict stopping criterion is used, the overestimation problem becomes even worse.Fig. 3.6(right) shows that the logistic PCA model with nuclear norm penaltyunderestimated the singular values of Z , and includes too many small singularvalues into the model. The logistic PCA model with GDP penalty and L q penaltyhave very accurate estimation of the first three and four singular values of Z .These results are in line with their performance measures in Table 3.2 and theirthresholding properties in Fig. 3.1. The bad performance of the model withexact low rank constraint is mainly because the non-zero singular values arenot regularized at all (Fig. 3.1). Similarly, some of the non-zero singular values(the values larger than γλ ) are also not regularized at all for the SCAD penalty(Fig. 3.1). This property can be more problematic for the logistic PCA model withSCAD penalty because the selected model depends on the model with the smallest λ value due to the used warm start strategy during the model selection process.The low performance of the model with nuclear norm penalty is because thispenalty will over shrink the larger singular values, and the model selected basedon CV error is too complex (Fig. 3.1). On the contrary, both the models with L q and GDP penalties have nice thresholding properties and the correspondinglogistic models have superior performance. However, unlike SCAD and GDP, the L q (0 < q < 1) penalty has a small discontinuity region, continuous thresholdingcan not be achieved, which could results in instable prediction [16]. Therefore,even though the model with L q penalty achieves slight better performance, westill recommend to use the GDP penalty for the non-convex singular thresholdingin the logistic PCA model. In the analysis of simulated quantitative data sets using the PCA model, anincrease in SNR makes the estimation of the true underlying low rank structureeasier. Unfortunately, this is not true in the estimation of the true underlyinglogistic PCA model for simulated binary data. To illustrate this, the followingexperiment was performed using logistic PCA model with GDP penalty as anexample. 30 SNR values are selected from the interval [10 − , ] equidistant inlog-space. The simulated offset term µ is set to to simulate balanced binary8 Chapter 3. Robust logistic PCA model components s i ngu l a r v a l ue exact low rank and SCAD ZEexact: 10 -6 exact: 10 -8 SCAD: 10 -6 SCAD: 10 -8 components other penalties ZEnuclearGDPLqfull Figure 3.6: Left: the singular values of the simulated Z and E , and the singularvalues of the estimated ˆ Z from logistic PCA models ( (cid:15) f = 10 − and (cid:15) f = 10 − )with exact low rank constraint and SCAD penalty. Right: the singular valuesof the simulated Z and E , and the singular values of the estimated ˆ Z from thelogistic PCA models ( (cid:15) f = 10 − ) with a nuclear norm, GDP and L q penalty,and from the full information model. Only the first 10 components are shown toincrease the resolution of the plot. .7. Results c , which is used to adjust the SNR, changes with theSNR. All other parameters are kept the same. For each simulated X with aspecific SNR, logistic PCA models with GDP ( γ = 1) penalty and with nuclearnorm penalty are selected and re-fitted. In addition, PCA models with differentnumbers of components are fitted on the latent quantitative data set X ∗ , and themodel with the minimum RMSE( Θ ) is selected. In addition, the null model, i.e.the logistic PCA model with only the column offset term, is used to provide abaseline for comparison of the different approaches. The above experiments arerepeated 10 times, and their results are shown in Fig. 3.7. Results obtained froma similar experiment but performed on imbalanced data simulation are shown insupplemental Fig. S3.2. There, the simulated µ is set according to the marginalprobabilities of the CNA data set. Overall, the logistic PCA models with differentpenalties can always achieve better performance than the null model, and themodel with a GDP penalty demonstrates superior performance with respect toall the used metrics compared to the model with convex nuclear norm penalty.Fig. 3.7 (left and center) shows that with increasing SNR, the estimation of thequantitative full model improves as expected. However, for the parameters esti-mated from the binary data this is not the case. First the estimation of the simu-lated parameters Θ and Z improves, but when the SNR increases even further, theestimation deteriorates again leading to a bowl shaped pattern. This pattern hasbeen observed before in binary matrix completion using nuclear norm penalty [52].In order to understand this effect, considering the S-shaped logistic curve (sup-plemental Fig. S3.3), the plot of the function E( x | θ ) = φ ( θ ) = (1 + exp( − θ )) − , inwhich x and θ are a typical element of X and Θ respectively. This curve almostbecomes flat when θ becomes very large. There is no resolution anymore in theseflat regimes. A large deviation in θ has almost no effect on the logistic response.When the SNR becomes extremely large, the scale of the simulated parameter θ isvery extreme, then even if we have a good estimation of the probability ˆ π = φ (ˆ θ ),the scale of estimated ˆ θ can be far away from the simulated θ . That is why weobserved that even though the model is able to recovered the simulated Π basedon the logistic PCA model almost exactly (Fig. 3.7 right), the estimation of Θ and Z are not accurate (Fig. 3.7 left and center). We refer to [52] for a detailedinterpretation of this phenomenon. We demonstrate the proposed logistic PCA model with a GDP ( γ = 1) penaltyand the corresponding model selection procedure on the CNA data set. Themodel selection is done in the same way as was described above, and the resultis shown in supplemental Fig. S3.4. After that, the selected 4 components modelis re-fitted on the full data set. The score and loading plot of the first 2 compo-0 Chapter 3. Robust logistic PCA model log (SNR) -2 -1 0 1 2 3 R M SE mean RMSE( Z ) log (SNR) -2 -1 0 1 2 3 R M SE mean RMSE( ) log (SNR) -2 -1 0 1 2 3 M HD mean MHD( & ) nuclearGDPfullnull Figure 3.7: RMSE of Z (left) and Θ (middle) and the MHD of Π as a functionof increasing SNR values for simulated balanced binary data.nents are shown in Fig. 3.8. As was explained before in [64], the CNA data setis not discriminative for the three cancer types (illustrated in the score plot ofFig. 3.8 left). The structure in the loading plot (Fig. 3.8 right) mainly explainsthe technical characteristics of the data. Fig. 3.8 (right) shows that the gainsand losses of the segments in the chromosomal regions corresponding to the CNAmeasurements are almost perfectly separated from each other in the first compo-nent. Therefore, the variation explained of the first component is mainly becauseof the difference of gains and losses in CNA measurements. To study the properties of the logistic PCA model with different penalties, weneed to have the ability to simulate the multivariate binary data set with anunderlying low rank structure, and the simulated structure should have a properSNR so that the model can find it back. The latent variable interpretation of thelogistic PCA model not only makes the assumption of low rank structure easierto understand, but also provides us a way to define SNR in multivariate binarydata simulation.The standard logistic PCA model using the exact low rank constraint has anoverfitting problem. The overfitting issue manifests itself in a way that some ofthe elements in the estimated loading matrix ˆ B (the orthogonality constraint isimposed on A ) have the tendency to approach infinity, and the non-zero singularvalues of the ˆ Z = ˆ A ˆ B T are not upper-bounded when strict stopping criteria areused. This overfitting issue can be alleviated by regularizing the singular valuesof Z . Both convex nuclear norm penalties and some of the concave penalties caninduce low rank estimation and simultaneously constrain the scale of the non-zerosingular values. Therefore, logistic PCA models with these penalties do not suffer .8. Discussion PC1 -0.4 -0.2 0 0.2 0.4 P C -0.4-0.3-0.2-0.100.10.20.30.4 score plot BRCALUADSKCM PC1 -50 0 50-50050 loading plot gainloss Figure 3.8: The score and loading plots of the first 2 components of the logisticPCA model on the CNA data. The legend, BRCA, LUAD and SKCM, indi-cates the corresponding three cancer types. The legend, gain, loss, indicates thegain or loss of a segment in the chromosome region corresponding to the CNAmeasurement.from the overfitting problem.However, the logistic PCA model with a GDP or a L q penalty has severaladvantages compared to the model with the nuclear norm penalty. Since the nu-clear norm penalty applies the same degree of shrinkage on all the singular values,the large singular values are shrunken too much. Therefore, the implemented CVerror based model selection procedure tends to select a very complex model withtoo many components to compensate for the biased estimations. On the contrary,both the GDP penalty and the L q penalty achieve nearly unbiased estimation.Thus the CV error based model selection is successful in selecting the logisticPCA model with the a GDP penalty. Furthermore, the selected logistic PCAmodel with a GDP or a L q penalty has shown superior performance in recoveringthe simulated low rank structure compared to the model with the nuclear normpenalty, and the exact low rank constraint.One exception of the used concave penalties is the SCAD penalty, which leadsto a logistic PCA model with poor performance. As stated in the Section 3.7.4, thepoor performance of the model with the SCAD penalty is mainly because of thatsome of the large singular values are not regularized at all. And this drawback isexaggerated by the wart start strategy used during the model selection process.The poor performance of the models with the exact low rank constraint and theSCAD penalty reminds us the importance of regularizing all the singular valuessimultaneously in inducing the low rank structure for a logistic PCA model.2 Chapter 3. Robust logistic PCA model Acknowledgements Y.S. gratefully acknowledges the financial support from China Scholarship Coun-cil (NO.201504910809). .9. Supplementary information -1 0 11500200025003000 CV errors: nuclear -1 0 10123 RMSEs: nuclear Z -1 0 1050100150 estimated ranks: nuclear CV errors: SCAD RMSEs: SCAD Z estimated ranks: SCAD log ( ) CV errors: Lq log ( ) RMSEs: Lq Z log ( ) estimated ranks: Lq Figure S3.1: Model selection of the logistic PCA model with the nuclear normpenalty, the SCAD penalty and L q penalty. The CV error, RMSE of estimating Θ , Z and µ and the estimated rank as a function of λ . The red cross markerindicates the λ value where minimum CV error is achieved.4 Chapter 3. Robust logistic PCA model -2 -1 0 1 2 3 R M SE mean RMSE( ) nuclearGDPfullnull -2 -1 0 1 2 3 R M SE mean RMSE( Z ) log (SNR) -2 -1 0 1 2 3 R M SE mean RMSE( ) log (SNR) -2 -1 0 1 2 3 M HD mean MHD( & ) Figure S3.2: How the SNR in imbalanced binary data simulation affects theperformance of the logistic PCA models with different penalties, and the fullinformation model. .9. Supplementary information -10 -5 0 5 1000.10.20.30.40.50.60.70.80.91 E (x| ) Figure S3.3: The relationship of E( x | θ ) = φ ( θ ), in which x and θ are a typicalelement of X and Θ respectively. log ( ) C V e rr o r CV errors log ( ) r an k estimated ranks Figure S3.4: How λ effects the CV error (left) and the rank estimation (right) inthe model selection process of the logistic PCA model with a GDP penalty onthe CNA data set. hapter 4 Fusing binary and quantitative data sets In the current era of systems biological research there is a need for the integrativeanalysis of binary and quantitative genomics data sets measured on the sameobjects. One standard tool of exploring the underlying dependence structurepresent in multiple quantitative data sets is simultaneous component analysis(SCA) model. However, it does not have any provisions when a part of the dataare binary. To this end, we propose the generalized SCA (GSCA) model, whichtakes into account the distinct mathematical properties of binary and quantitativemeasurements in the maximum likelihood framework. Like in the SCA model, acommon low dimensional subspace is assumed to represent the shared informationbetween these two distinct types of measurements. However, the GSCA modelcan easily be overfitted when a rank larger than one is used, leading to some ofthe estimated parameters to become very large. To achieve a low rank solutionand combat overfitting, we propose to use non-convex singular value thresholding.An efficient majorization algorithm is developed to fit this model with differentconcave penalties. Realistic simulations (low signal-to-noise ratio and highly im-balanced binary data) are used to evaluate the performance of the proposed modelin recovering the underlying structure. Also, a missing value based cross valida-tion procedure is implemented for model selection. We illustrate the usefulnessof the GSCA model for exploratory data analysis of quantitative gene expres-sion and binary copy number aberration (CNA) measurements obtained from theGDSC1000 data sets. In biological research it becomes increasingly common to have measurements ofdifferent aspects of information on the same objects to study complex biological This chapter is based on Song, Y., Westerhuis, J.A., Aben, N., Wessels, L.F., Groenen,P.J. and Smilde, A.K., 2018. Generalized Simultaneous Component Analysis of Binary andQuantitative data. arXiv preprint arXiv:1807.04982. Chapter 4. Fusing binary and quantitative data sets systems. The resulting coupled data sets should be analyzed simultaneously toexplore the dependency between variables in different data sets and to reach aglobal understanding of the underlying biological system. The SCA model is oneof the standard methods for the integrative analysis of such coupled data sets indifferent areas, from psychology to chemistry and biology [65]. SCA discovers thecommon low dimensional column subspace of the coupled quantitative data sets,and this subspace represents the shared information between them.Next to the quantitative measurements (such as gene expression data), it iscommon in biological research to have additional binary measurements, in whichdistinct categories differ in quality rather than in quantity (such as mutationdata). Typical examples include the measurements of point mutations, the bi-nary version of CNA and DNA methylation measurements [30]. Compared toquantitative measurement, a binary measurement only has two mutually exclu-sive outcomes, such as presence vs absence (or true vs false), which are usuallylabeled as “1” and “0”. However, “1” and “0” indicate abstract representationsof two categories rather than quantitative values 1 and 0. As such, the specialmathematical properties of binary data should be taken into account in the dataanalysis. In most biological data sets, the number of “0”s is significantly largerthan the number of “1”s for most binary variables making the data imbalanced.Therefore, an additional requirement of the data analysis method is that it shouldbe able to handle imbalanced data.There is a need for statistical methods appropriate for doing an integrativeanalysis of coupled binary and quantitative data sets in biology research. Thestandard SCA models [65, 66] that use column centering processing steps andleast-squares loss criteria are not appropriate for binary data sets. Recently,iClusterPlus [67] was proposed as a factor analysis framework to model discreteand quantitative data sets simultaneously by exploiting the properties of expo-nential family distributions. In this framework, the special properties of binary,categorical, and count variables are taken into account in a similar way as in gen-eralized linear models. The common low dimensional latent variables and dataset specific coefficients are used to fit the discrete and quantitative data sets. Forthe binary data set, the Bernoulli distribution is assumed and the canonical logitlink function is used. The sum of the log likelihood is then used as the objectivefunction. Furthermore, the approach allows the use of a lasso type penalty forfeature selection. The Monte Carlo Newton–Raphson algorithm for this generalframework, however, involves a very slow Markov Chain Monte Carlo simulationprocess. Both the high complexity of the model and the algorithmic inefficiencylimit its use for large data sets and exploring its properties through simulations.In this chapter, we generalize the SCA model to binary and quantitative datafrom a probabilistic perspective similar as in Collins [68] and Mo [67]. However,the generalized SCA model can easily lead to overfitting by using a rank restrictionlarger than 1, leading to some of the parameters to become very large. Therefore,a penalty on the singular values of the matrix is used to simultaneously induce .2. The GSCA model Before the GSCA model is introduced, consider the standard SCA model. Thequantitative measurements on the same I objects from two different platformsresult into two data sets X ( I × J ) and X ( I × J ), in which J and J are thenumber of variables. Assume both X and X are column centered. The standardSCA model can be expressed as X = AB T1 + E X = AB T2 + E , (4.1)where A ( I × R ) denotes the common component scores (or latent variables), whichspan the common column subspace of X and X , B ( J × R ) and B ( J × R )are the data set specific loading matrices for X and X respectively, E ( I × J ) and E ( I × J ) are residuals, R , R (cid:28) { I, J , J } , is an unknown low rank.Orthogonality is imposed on A as A T A = I R , where I R indicates the R × R identity matrix, to have a unique solution. A , B and B are estimated byminimizing the sum of the squared residuals E and E . Following the probabilistic interpretation of the PCA model [42], the high dimen-sional quantitative data set X can be assumed to be a noisy observation from adeterministic low dimensional structure Θ ( I × J ) with independent and iden-tically distributed measurement noise, X = Θ + E . Elements in E ( I × J )0 Chapter 4. Fusing binary and quantitative data sets follow a normal distribution with mean 0 and variance σ , (cid:15) ij ∼ N(0 , σ ). In thesame way, following the interpretation of the exponential family PCA on binarydata [68], we assume there is a deterministic low dimensional structure Θ ( I × J )underlying the high dimensional binary observation X . Elements in X followthe Bernoulli distribution with parameters φ ( Θ ), x ij ∼ Ber( φ ( θ ij )). Here φ ()is the element wise inverse link function in the generalized linear model for binarydata; x ij and θ ij are the ij th element of X and Θ respectively. If the logitlink is used, φ ( θ ) = (1 + exp( − θ )) − , while if the probit link is used, φ ( θ ) = Φ( θ ),where Φ is the cumulative density function of the standard normal distribution.Although in this chapter, we only use the logit link in deriving the algorithm andin setting up the simulations, the option for the probit link is included in ourimplementation. The two link functions are similar, but their interpretations canbe quite different [13].In the same way as in the standard SCA model, Θ and Θ are assumed to liein the same low dimensional subspace, which represents the shared informationbetween the coupled matrices X and X . The commonly used column centeringis not appropriate for the binary data set as the centered binary data will notbe “1” and “0” anymore. Every column will still have only 2 values but thesevalues are different for different columns. Therefore, we include column offsetterms µ ( J × 1) and µ ( J × 1) for a model based centering. The above ideasare modeled as Θ = µ T1 + AB T1 Θ = µ T2 + AB T2 , (4.2)where, ( I × 1) is a I dimensional vector of ones; the parameters A , B and B have the same meaning as in the standard SCA model. Constraints A T A = I R and T A = are imposed to have a unique solution.For the generalization to quantitative and binary coupled data, we follow themaximum likelihood estimation framework. The negative log likelihood for fittingcoupled binary X and quantitative X is used as the objective function. In orderto implement a missing value based cross validation procedure [71], we introducetwo weight matrices W ( I × J ) and W ( I × J ) to handle the missing elements.The ij th element of W , w ij equals 0 if the ij th element in X is missing, whileit equals 1 vice versa . The same rules apply to W and X . The loss functions f ( Θ ) for fitting X and f ( Θ , σ ) for fitting X are defined as follows: f ( Θ ) = − I (cid:88) i J (cid:88) j w ij [ x ij log( φ ( θ ij )) + (1 − x ij ) log(1 − φ ( θ ij ))] f ( Θ , σ ) = 12 σ || W (cid:12) ( X − Θ ) || F + 12 || W || log(2 πσ ) , (4.3)where (cid:12) indicates element-wise multiplication; || || F is the Frobenius norm ofa matrix; || || is the pseudo L norm of a matrix, which equals the number ofnonzero elements. .2. The GSCA model X and X is assumed to be fully representedby the low dimensional subspace spanned by the common component score matrix A . Thus, X and X are conditionally independent given that the low dimensionalstructures Θ and Θ lie in the same low dimensional subspace. Therefore, thejoint loss function is the direct sum of the negative log likelihood functions forfitting X and X . f ( Θ , Θ , σ ) = − log( p ( X , X | Θ , Θ , σ ))= − log( p ( X | Θ ) p ( X | Θ , σ ))= − log( p ( X | Θ ) − log( p ( X | Θ , σ ))= f ( Θ ) + f ( Θ , σ ) . (4.4) To arrive at meaningful solutions for the GSCA model, it is necessary to introducepenalties on the estimated parameters. If we take Θ = [ Θ Θ ], µ = [ µ T1 µ T2 ] T ,and B = [ B T1 B T2 ] T , equation 4.2 in the GSCA model can be expressed as Θ = µ T + AB T . In the above interpretation of the GSCA model, the low rankconstraint on the column centered Θ is expressed as the multiplication of tworank R matrices A , B , Z = Θ − µ T = AB T . However, using an exact low rankconstraint in the GSCA model has some issues. First, the maximum likelihoodestimation of this model easily leads to overfitting. Given the constraint that A T A = I , overfitting represents itself in a way that some elements in B tend todiverge to plus or minus infinity. In addition, the exact low rank R in the GSCAmodel is commonly unknown and its selection is not straightforward.In this chapter, we take a penalty based approach to control the scale ofestimated parameters and to induce a low rank structure simultaneously. Thelow rank constraint on Z is obtained by a penalty function g ( Z ), which shrinksthe singular values of Z to achieve a low rank structure. The most widely usedconvex surrogate of a low rank constraint is the nuclear norm penalty, which issimply the sum of singular values, g ( Z ) = (cid:80) r σ r ( Z ) [69], where σ r ( Z ) representsthe r th singular value of Z . The nuclear norm penalty was also used in a relatedwork [70]. Although the convex nuclear norm penalty is easy to optimize, thesame amount of shrinkage is applied to all the singular values, leading to biasedestimates of the large singular values. Recent work [18, 62] already showed thesuperiority of concave surrogates of a low rank constraint under Gaussian noisecompared to the nuclear norm penalty. We take g ( Z ) = (cid:80) r g ( σ r ( Z )) as theconcave surrogate of a low rank constraint on Z , where g ( σ r ) is a concave penaltyfunction of σ r . After replacing the low rank constraint in equation 4.4 by g ( Z ),2 Chapter 4. Fusing binary and quantitative data sets the model becomes, min µ , Z ,σ f ( Θ ) + f ( Θ , σ ) + λg ( Z )subject to Θ = µ T + ZΘ = [ Θ Θ ] T Z = . (4.5)The most commonly used non-convex surrogates of a low rank constraint areconcave functions, including L q :0 When fixing σ , we can majorize f ( Θ ) = f ( Θ ) + f ( Θ ) to a quadratic functionof the parameter Θ . In addition, the concave penalty function g ( Z ) can be ma-jorized to a linear function of the singular values of Z by exploiting the concavity.The resulting majorized problem can be analytically solved by weighted singularvalue thresholding [62]. The derivation process is the same as the algorithm inChapter 3, therefore we will only show the result here. f ( Θ ) ≤ L k || Θ − H k || F + cg ( Z ) ≤ (cid:88) r ω kr σ r + c H k = Θ k − L k ( W (cid:12) ∇ f ( Θ k )) ω kr = ∂g ( σ kr ) , (4.6)in which ∇ f ( Θ k ) = [ ∇ f ( Θ k ) ∇ f ( Θ k )], ∇ f ( Θ k ) = φ ( Θ k − X ) and ∇ f ( Θ k ) = σ ( Θ k − X ); L k is the upper-bound of the second order gradient of f ( Θ ) during .3. Algorithm k th iteration, and can always set to L k = max(0 . , / ( σ ) k ); ( σ ) k is theapproximation of parameter σ during the k th iteration; σ kr is the r th singularvalue of Z k , which is an approximation of Z during the k th iteration; Θ k is theapproximation of Θ during the k th iteration; c is a constant doesn’t depend onany unknown parameters. Summarizing these two majorization steps, we havethe following majorized problem during the k th iteration.min µ , Z L k || Θ − H k || F + λ (cid:88) r ω kr σ r subject to Θ = µ T + Z1 T Z = k = Θ k − L k ( W (cid:12) ∇ f ( Θ k )) ω kr = ∂g ( σ kr ) . (4.7) We optimize µ , Z and σ alternatingly while fixing the other parameters. How-ever, updating µ and Z depend on solving the majorized problem in equation4.7 rather than solving the original problem in equation 4.5. Because of the MMprinciple, this step will also monotonically decrease the original loss function inequation 4.5. Updating µ The analytical solution of µ in equation 4.7 is simply the column mean of H k , µ = I ( H k ) T . Updating Z After deflating the offset term µ , the loss function in equation 4.7 becomes L k || Z − JH k || F + λ (cid:80) r ω kr σ r , in which J = I − I T is the column centering matrix.The solution of the resulting problem is equivalent to the proximal operator ofthe weighted sum of singular values, which has an analytical form solution [62].Suppose USV T = JH k is the SVD decomposition of JH k , the analytical formsolution of Z is Z = US z V T , in which S z = Diag { ( s r − λω r /L k ) + } and s r is the r th diagonal element in S . Updating σ By setting the gradient of f ( Θ , σ ) in equation 4.5 with respect to σ to be 0,we have the following analytical solution of σ , σ = || W || || W (cid:12) ( X − Θ ) || F .When no low rank estimation of Z can be achieved, the constructed model is4 Chapter 4. Fusing binary and quantitative data sets close to a saturated model and the estimated ˆ σ is close to 0. In that case, whenˆ σ < . 05, the algorithm stops and gives a warning that a low rank estimationhas not been achieved. Initialization and stopping criteria Random initialization is used. All the elements in Z are sampled from thestandard uniform distribution, µ is set to 0 and ( σ ) is set to 1. The relativechange of the objective value is used as the stopping criteria. Pseudocode of thealgorithm described above is shown in Algorithm 2. (cid:15) f is the tolerance of relativechange of the loss function. Algorithm 2 A MM algorithm for fitting the GSCA model with concave penal-ties. Input: X , X , penalty, λ , γ ; Output: ˆ µ , ˆ Z , ˆ σ ; Compute W , W for missing values in X and X , and W = [ W W ]; Initialize µ , Z , ( σ ) ; k = 0; while ( f k − − f k ) /f k − > (cid:15) f do ∇ f ( Θ k ) = φ ( Θ k ) − X ; ∇ f ( Θ k ) = σ ) k ( Θ k − X ); ∇ f ( Θ k ) = [ ∇ f ( Θ k ) ∇ f ( Θ k )]; L k = max(0 . , / ( σ ) k ); H k = Θ k − L k ( W (cid:12) ∇ f ( Θ k )); ω kr = ∂g ( σ kr ); µ k +1 = I ( H k ) T ; USV T = JH k ; S z = Diag { ( s r − λω kr /L k ) + } ; Z k +1 = US z V T ; Θ k +1 = ( µ k +1 ) T + Z k +1 ; [ Θ k +11 Θ k +12 ] = Θ k +1 ; ( σ ) k +1 = || W || || W (cid:12) ( X − Θ k +12 ) || F k = k + 1; end while To see how well the GSCA model is able to reconstruct data generated accordingto the model, we do a simulation study with similar characteristics as a typicalempirical data set. We first simulate the imbalanced binary X and quantitative X following the GSCA model with logit link and low signal-to-noise ratio (SNR). .4. Simulation The SNR for generating binary data is defined according to the latent variableinterpretation of the logistic PCA model. Elements in X are independent andindirect binary observations of the corresponding elements in an underlying quan-titative matrix X ∗ , x ij = 1 if x ∗ ij > x ij = 0 otherwise. X ∗ can be ex-pressed as X ∗ = Θ + E , in which Θ = µ T1 + AB T1 , and elements in E followthe standard logistic distribution, (cid:15) ij ∼ Logistic(0 , X is defined as SNR = || AB T1 || F / || E || F . Assume the quantitative X is simulated as X = Θ + E , in which Θ = µ T2 + AB T2 and elements in E follow a normal distribution with 0 mean and σ variance, (cid:15) ij ∼ N (0 , σ ). TheSNR for generating quantitative X is defined as SNR = || AB T2 || F / || E || F .After the definition of the SNR, we simulate the coupled binary X and quan-titative X as follows. µ represents the logit transform of the marginal probabil-ities of binary variables and µ represents the mean of the marginal distributionsof quantitative variables. They will be simulated according to the characteristicsof a real biological data set. The score matrix A and loading matrices B , B are simulated as follows. First, we express AB T1 and AB T2 in a SVD type as AB T1 = UD V T1 and AB T2 = UD V T1 , in which U T U = I R , D and D arediagonal matrices, V T1 V = I R and V T2 V = I R . All the elements in U , V and V are independently sampled from the standard normal distribution. Then, U , V and V are orthogonalized by the QR algorithm. The diagonal matrix D ( R × R ) is simulated as follows. R elements are sampled from standard nor-mal distribution, their absolute values are sorted in decreasing order. To satisfythe pre-specified SNR and SNR , D is scaled by positive scalars c and c as D = c D and D = c D . Then, binary elements in X are sampled from theBernoulli distribution with corresponding parameter φ ( θ ij ), in which φ () is in-verse logit function and Θ = µ T1 + AB T1 . Quantitative data set X is generatedas X = Θ + E , in which Θ = µ T2 + AB T2 and elements in E are sampledfrom N (0 , σ ). Take Z = AB T , B = [ B B ]. In order to make T Z = , wefurther deflate the column offset of Z to the simulated µ , µ = [ µ T1 µ T2 ] T . Thisstep will not change the value of Θ and Θ , thus does not affect the simulationof X and X . As for simulated data sets, the true parameters Θ = [ Θ Θ ], µ = [ µ T1 µ T1 ] T and Z = AB T are available. Therefore, the generalization error of the constructed6 Chapter 4. Fusing binary and quantitative data sets model can be evaluated by comparing the true parameters and their model esti-mates. Thus, the evaluation metric is defined as the relative mean squared error(RMSE) of the model parameters. The RMSE of estimating Θ is defined asRMSE( Θ ) = || Θ − ˆ Θ || F / || Θ || F , where Θ represents the true parameter and ˆ Θ its GSCA model estimate. The RMSE of µ and Z , are expressed as RMSE( µ )and RMSE( Z ) and they are defined in the same way as for Θ .For real data sets, missing value based cross validation (CV) is used to esti-mate the generalization error of the constructed model. Also in order to have anestimation of the uncertainty of the CV error, we will use a K-fold CV procedure.To make the prediction of the left out fold elements independent to the con-structed model based on the reminding folds, the data is partitioned into K foldsof elements which are selected in a diagonal style rather than row wise from X and X respectively, similar to the leave out patterns described by Wold [47, 71].The test set elements of each fold in X and X are taken as missing values, andthe remaining data are used to construct a GSCA model. After estimation of ˆ Θ and ˆ σ are obtained from the constructed GSCA model, the negative log likeli-hood of using ˆ Θ , ˆ σ to predict the missing elements (left out fold) is recorded.This negative log likelihood is scaled by the number of missing elements. Thisprocess is repeated K times until all the K folds have been left out once. Themean of the K scaled negative log likelihoods is taken as the CV error.When we define X = [ X X ] and J = J + J , the penalty term λg ( Z ) is notinvariant to the number of non-missing elements in X , as the joint loss function(equation 4.4) is the sum of the log likelihoods for fitting all the non-missingelements in the data X . Therefore, we effectively follow a similar approach asFan [16] by adjusting the penalty strength parameter λ for the relative numberobservations. By setting one fold of elements to be missing during the CV process, λ || X || / ( I × J ) rather than λ is used as the amount of penalty. During the K-fold CV process, a warm start strategy, using the results of previous constructedmodel as the initialization of next model, is applied. In this way, the K-fold CVcan be greatly accelerated.In the model selection process, the tuning parameter λ and hyper-parameters( q in L q and γ in SCAD and GDP) can be selected by a grid search. However,previous work of using these penalty functions in supervised learning context[57, 16, 27] and our experiments have shown that the results are not very sensitiveto the selection of these hyper-parameters, and thus a default value can be set.On the other hand, the selection of tuning parameter λ does have a significanteffect on the results, and should be optimized by the grid search. In the loading or score plots of the following experiments and real data analysis,we will multiply the estimated score matrix ˆ A by √ I , and the estimated loadingmatrix ˆ B by 1 / √ I to make the loading and score plots have the same scale. .4. Simulation Overfitting of the GSCA model with a fixed rank and no penalty The real data sets from the Section 4.5 are used to show how the GSCA modelwith a fixed rank and no penalty will overfit the data. The algorithm (detailsare in the supplemental material) used to fit the GSCA model (with an exactlow rank constraint and orthogonality constraint A T A = I I ) is a modificationof the developed algorithm in Section 4.3. GSCA models with three componentsare fitted using stopping criteria (cid:15) f = 10 − and (cid:15) f = 10 − . Exactly the sameinitialization is used for these two models. As shown in Fig. 4.1, different stoppingcriteria can greatly affect the estimated ˆ B from the GSCA models. Furthermore,the number of iterations to reach convergence increases from 141 to 23991.Related phenomena have been observed in logistic linear regression model andlogistic PCA model [43, 73] where some estimated parameters tend to divergetowards infinity. The overfitting issue of the GSCA model with exact low rankconstraint can be interpreted in the same way by taking the columns of scorematrix A as the latent variables and the loading matrix B as the coefficientsto fit the binary X . This result suggests that if an exact low rank constraint ispreferred in the GSCA model, an extra scale penalty should be added on B toavoid overfitting. PC1 -2 -1 0 1 2 P C -2-1.5-1-0.500.511.52 exact low rank: f = 10 -5 PC1 -30 -20 -10 0 10 20 30-30-20-100102030 exact low rank: f = 10 -8 Figure 4.1: Loading plots of estimated ˆ B from the GSCA models with exact lowrank constraint using two different stopping criteria (cid:15) f = 10 − and (cid:15) f = 10 − .Note that the scales of the coordinates for (cid:15) f = 10 − (right) is over ten timeslarger than those for (cid:15) f = 10 − (left).8 Chapter 4. Fusing binary and quantitative data sets Comparing the generalization errors of the GSCA models with nuclearnorm and concave penalties To evaluate the performance of the GSCA model in recovering the underlyingstructure, we set up the realistic simulation (strongly imbalanced binary dataand low SNR) as follows. The simulated X and X have the same size as thereal data sets in the Section 4.5, I = 160, J = 410, J = 1000. The logittransform of the empirical marginal probabilities of the CNA data set in theSection 4.5 is set as µ . Elements in µ are sampled from the standard normaldistribution. The simulated low rank is set to R = 10; σ is set to 1; SNR andSNR are set to 1. After the simulation of X , there are two columns only have“0” elements, which are removed as they provide no information (no variation).As the GSCA model with the nuclear norm penalty is a convex problem, aglobal optimum can be obtained. The nuclear norm penalty is therefore used asthe baseline in the comparison with other penalties. An interval from λ , whichis large enough to achieve an estimated rank of at most rank 1, to λ t , whichis small enough to achieve an estimated rank of 159, is selected based on lowprecision models ( (cid:15) f = 10 − ). 30 log-spaced λ s are selected equally from theinterval [ λ t , λ ]. The convergence criterion is set as (cid:15) f = 10 − . The results areshown in Fig. 4.2. With decreasing λ , the estimated rank of ˆ Z increased from0 to 159, and the estimated ˆ σ decreased from 2 to close to 0. The minimumRMSE( Θ ) of 0.184 (the corresponding RMSE( Θ ) = 0 . Θ ) = 0 . µ ) = 0 . 072 and RMSE( Z ) = 0 . λ = 38 . 3, whichcorresponds to rank( ˆ Z ) = 52 and ˆ σ = 0 . λ = 40. The reason is that when the penaltyis not large enough, the estimated rank becomes 159, and the constructed GSCAmodel is almost a saturated model. Thus the model has high generalizationerror and the estimated ˆ σ also becomes close to 0. Given that we only haveindirect binary observation X and highly noisy observation X of the underlyingstructure Θ , the performance of the GSCA model with nuclear norm penalty isreasonable. However, results can be greatly improved by using concave penalties.For concave penalties, different values of the hyper-parameters, q in L q , γ in SCAD and GDP, are selected according to their thresholding properties. Foreach value of the hyper-parameter, values of tuning parameter λ are selected inthe same manner as described above. The minimum RMSE( Θ ) achieved and thecorresponding RMSE( µ ) and RMSE( Z ) for different values of hyper-parameterof the GSCA models with different penalty functions are shown in Fig. 4.3. Hereall GSCA models with concave penalties can achieve much lower RMSEs in es-timating Θ , µ and Z compared to the convex nuclear norm penalty ( L q : q =1 inthe plot). Among the three concave penalties used, L q and GDP have betterperformance.If we get access to the full information, the underlying quantitative data X ∗ .4. Simulation 40 60 80 100 R M SE RMSE Z 40 60 80 100 ^ < ^ < 40 60 80 100 e s t i m a t ed r an ks estimated ranks Figure 4.2: RMSEs in estimating Θ , µ , Z (left), the estimated ˆ σ (center) andthe estimated rank( ˆ Z ) (right) from the GSCA model with nuclear norm penaltyas a function of the tuning parameter λ . Red cross marker indicates the modelwith minimum RMSE( Θ ). q m i n i m u m R M SE L q:0 Figure 4.3: The minimum RMSE( Θ ) achieved and the corresponding RMSE( µ )and RMSE( Z ) for different values of hyper-parameter for L q penalty (left), forSCAD penalty (center) and for GDP penalty (right). The legends indicate theRMSEs in estimating Θ , µ and Z respectively. The x -axis of the left and centersubplots has a linear scale, while right subplot has a log scale.0 Chapter 4. Fusing binary and quantitative data sets rather than the binary observation X , the SCA model on X ∗ and X is simply aPCA model on [ X ∗ X ]. From this model, we can get an estimation of Θ , µ and Z . We compared the results derived from the SCA model on the full information,the GSCA models with nuclear norm, L q : q =0 . , SCAD ( γ = 5) and GDP ( γ = 1)penalties. All the models are selected to achieve the minimum RMSE( Θ ). TheRMSEs of estimating Θ , Θ , Θ , µ and Z and the rank of estimated ˆ Z fromdifferent models are shown in Table 4.1. Here we can see that the GSCA modelswith L q : q =0 . and GDP ( γ = 1) penalties have better performance in almost allcriteria compared to the nuclear norm and SCAD penalties, and even comparablewith the SCA model on full information. The singular values of the true Z ,estimated ˆ Z from the above models and the noise terms E = [ E E ] are shownin Fig. 4.4. Only the first 15 singular values are shown to have higher resolution ofthe details. Since the 10 th singular value of the simulated data Z is smaller thanthe noise level, the best achievable rank estimation is 9. Both the L q : q =0 . andGDP ( γ = 1) penalties successfully find the correct rank 9, and they have a verygood approximation of the first 9 singular values of Z . On the other hand, thenuclear norm penalty shrinks all the singular values too much. Furthermore, theSCAD penalty overestimates the first three singular values and therefore shrinksall the other singular values too much. These results are easily understandableif taking their thresholding properties in Fig. 3.1 into account. Both the L q andthe GDP penalties have very good performance in this simulation experiment.Table 4.1: The RMSEs of estimating Θ , µ and Z and the rank of estimated ˆ Z from different models.RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( µ ) RMSE( Z ) rank( ˆ Z ) L q : q =1 L q : q =0 . γ = 5) 0.1093 0.1334 0.0395 0.0376 0.2777 24GDP( γ = 1) 0.0593 0.0675 0.0354 0.0160 0.1610 9full information 0.0222 0.0675 0.0354 0.0030 0.0674 9 Comparing the GSCA model with GDP penalty and the iClusterPlusmodel We compared our GSCA model with GDP penalty to the iClusterPlus model onthe simulated data sets. The parameters for the GSCA model with GDP penaltyis the same as described above. The running time is 60.61s when (cid:15) f = 10 − ,and 9.98s when (cid:15) f = 10 − . For the iClusterPlus model, 9 latent variables arespecified. The tuning parameter of the lasso type constraint on the data specificcoefficient matrices are set to 0. The default convergence criterion is used, that isthe maximum of the absolute changes of the estimated parameters in two subse-quent iterations is less than 10 − . The running time of the iClusterPlus model is .4. Simulation components s i ngu l a r v a l ue s L L SCADGDPfull informationtruenoise Figure 4.4: Approximation of the singular values using different penalties in thesimulation experiment. Labels “ L ”, “ L . ”, “SCAD”, “GDP”, “full information”indicate the singular values of estimated ˆ Z from the corresponding models; “true”indicates the singular values of the simulated Z ; “noise” indicates the singularvalues of the noise term E , which has full rank.close to 3 hours. The constructed iClusterPlus model provides the estimation ofcolumn offset ˆ µ , the common latent variables ˆ A , and data set specific coefficientmatrices ˆ B and ˆ B . The estimated ˆ Z and ˆ Θ are computed in the same way asdefined in the model section. The RMSEs in estimating Θ , µ and Z for iCluster-Plus are 2.571, 2.473 and 3.060 respectively. Compared to the results from theGSCA models in Table 4.1, iClusterPlus is unable to provide good results on thesimulated data sets. Supplemental Fig. S4.2 compares the estimated ˆ µ from theGSCA model with GDP penalty and iClusterPlus model. As shown in supple-mental Fig. S4.2(right), the iClusterPlus model is unable to estimate the offset µ correctly. Many elements of estimated ˆ µ are exactly 0, which corresponds to anestimated marginal probability of 0.5. In addition, as shown in Fig. 4.5(left), thesingular values of the estimated ˆ Z from the iClusterPlus model are clearly over-estimated. These undesired results from the iClusterPlus model are due mainlyto the imbalancedness of the simulated binary data set. If the offset term µ inthe simulation is set to 0, which corresponds to balanced binary data simulation,and fix all the other parameters in the same way as in the above simulation, theresults of iClusterPlus and the GSCA with GDP penalty are more comparable. Inthat case the RMSEs of estimating Θ , Z in the GSCA model with GDP penaltyare 0.071 and 0.091 respectively, while the RMSEs of the iClusterPlus model are0.107 and 0.142 respectively. As shown in Fig. 4.5(right), the singular values ofestimated ˆ Z from the iClusterPlus model are much more accurate compared to2 Chapter 4. Fusing binary and quantitative data sets the imbalanced case. However, iClusterPlus still overestimates the singular valuescompared to the GSCA model with GDP penalty. This phenomenon is relatedto the fact that exact low rank constraint is also used in the iClusterPlus model.These results suggest that compared to iClusterPlus, the GSCA model with GDPpenalty is more robust to the imbalanced binary data and has better performancein recovering the underlying structure in the simulation experiment. components s i ngu l a r v a l ue s GDPiClusterPlustruenoise components Figure 4.5: The singular values of estimated ˆ Z using the iClusterPlus model andthe GSCA model with GDP penalty on the simulation with imbalanced binarydata (left) and with balanced binary data (right). The performance of the GSCA model for the simulation with differentSNRs We will explore the performance of the GSCA model for the simulated binaryand quantitative data sets with varying noise levels in the following experiment.Equal SNR levels are used in the simulation for X and X . 20 log spacedSNR values are equally selected from the interval [0 . , X and quantitative X using the different SNRs in thesame way as described above. During this process, except for the parameters c and c , which are used to adjust the SNRs, all other parameters used in thesimulation were kept the same. The GSCA models with GDP penalty ( γ = 1), L q penalty ( q = 0 . Θ ) are selected.As shown in Fig. 4.6, the GSCA models with concave GDP and L q penalties .4. Simulation Z ) derived from the GSCA model, which isused to evaluate the performance of the model in recovering the underlying lowdimensional structure, first decreases to a minimum and then increases. As shownin bottom center and right, this pattern is mainly caused by how RMSE( Z )changes with respect to SNRs. Although this result counteracts the intuitionthat larger SNR means higher quality of data, it is in line with our previousresults in Chapter 3. RMSE( ) GDPL L full information RMSE( Z ) RMSE( ) SNR estimated ranks SNR RMSE( Z ) SNR RMSE( Z ) Figure 4.6: Minimum RMSE( Θ ) (top right), and the corresponding RMSE( µ )(top left), RMSE( Z ) (top center), rank estimation of ˆ Z (bottom left), RMSE( Z )(bottom center) and RMSE( Z ) (bottom right) of the GSCA models with nuclearnorm penalty (legend “ L ”), GDP penalty (legend GDP), L . penalty (legend“ L . ”) and SCA model on full information (legend “full information”) for differ-ent SNR levels.4 Chapter 4. Fusing binary and quantitative data sets Assessing the model selection procedure The cross validation procedure and the cross validation error have been definedin the model selection section. The GSCA model with GDP penalty is used asan example to assess the model selection procedure. (cid:15) f = 10 − is used as thestopping criteria for all the following experiments to save time. The values of λ and γ are selected in the same way as was described in Section 4.4. Fig. 4.7 showsthe minimum RMSE( Θ ) and minimum CV error achieved for different values ofthe hyper-parameter γ . The minimum CV error changes in a similar way asthe minimum RMSE( Θ ) with respect to the values of γ . However, taking intoaccount the uncertainty of estimated CV errors, the difference of the minimumCV errors for different γ is very small. Thus, we recommend to fix γ to be 1,rather than using cross validation to select it. Furthermore, setting γ = 1 as thedefault value for the GDP penalty has a probabilistic interpretation, see in [27]. . R M SE minimum RMSE( ) . C V e rr o r minimum CV error Figure 4.7: Minimum RMSE( Θ ) (left) and minimum CV error (right) for differentvalues of γ from the GSCA model with GDP penalty. One standard error barsare added to the CV error plot.Whenever the GSCA model is used for exploratory data analysis, there is noneed to select λ explicitly. It is sufficient to find a proper value to achieve a twoor three component GSCA model, in order to visualize the estimated score andloading matrices. If the goal is confirmatory data analysis, it is possible to selectthe tuning parameter λ explicitly by the proposed cross validation procedure.Fig. 4.8 shows how the tuning parameter λ affects the CV errors, RMSE( Θ ) andthe estimated ranks. The minimum CV error obtained is close to the Bayes error,which is the scaled negative log likelihood in cases where the true parameters Θ and σ are known. Even though, inconsistence exists between CV error plot(Fig. 4.8, left) and the RMSE( Θ ) plot (Fig. 4.8, center), the selected modelcorresponding to minimum CV error can achieve very low RMSE( Θ ) and correctrank estimation (Fig. 4.8, right). Therefore, we suggest to use the proposed CV .5. Empirical illustration λ at which the minimum CV error is obtained.Finally, we fit a model on full data set without missing elements using the selectedvalue of λ and the outputs of the selected model with minimum CV error as theinitialization. CV error CV errorBayes error RMSE( ) estimated ranks CVfit Figure 4.8: CV error, RMSE and estimated rank for different values of the tuningparameter λ . One standard error bars are added to the CV error plot. “Bayeserror” indicates the mean log negative likelihood using simulated Θ and σ tofit the simulated data sets X and X . The red cross marker indicates the pointwhere the minimum CV error is achieved. “CV” and “fit” (right plot) indicate themean rank( ˆ Z ) derived from the models constructed in 7-fold cross validation andthe rank( ˆ Z ) derived from a model constructed on full data set without missingelements (the outputs of the constructed model during cross validation are set asthe initialization). The Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000) [30] con-tains 926 tumor cell lines with comprehensive measurements of point mutation,CNA, methylation and gene expression. We selected the binary CNA and quan-titative gene expression measurements on the same cell lines (each cell line is asample) as an example to demonstrate the GSCA model. To simplify the interpre-tation of the derived model, only the cell lines of three cancer types are included:BRCA (breast invasive carcinoma, 48 cell lines), LUAD (lung adenocarcinoma,62 cell lines) and SKCM (skin cutaneous melanoma, 50 cell lines). The CNA dataset has 410 binary variables. Each variable is a copy number region, in which “1”indicates the presence and “0” indicates the absence of an aberration. Note that,the CNA data is very imbalanced: only 6 . 66% the elements are “1”. The empir-ical marginal probabilities of binary CNA variables are shown in supplemental6 Chapter 4. Fusing binary and quantitative data sets Fig. S4.1. The quantitative gene expression data set contains 17,420 variables, ofwhich 1000 gene expression variables with the largest variance are selected. Afterthat, the gene expression data is column centered and scaled by the standarddeviation of the each variable to make it more consistent with the assumption ofthe GSCA model. We applied the GSCA model (with GDP penalty and γ =1) to the GDSC data setof 160 tumor cell lines that have been profiled for both binary CNA (160 × × PC1 -2 -1 0 1 2 P C -2-1012 score plot BRCALUADSKCM PC1 -1 -0.5 0 0.5 1-1-0.500.51 loading plot of binary data X PTENERBB2MYC PC1 -1 -0.5 0 0.5 1-1-0.500.51 loading plot of quantitative data X hormone-positiveMITF-high Figure 4.9: Score plot (left), loading plot for binary CNA data X (center) andloading plot for gene expression data X (right) derived from the constructedGSCA model.We then wondered whether the GSCA model could leverage the gene ex-pression data to help us gain insight into the CNA data. To test this, we firstestablished how much insight could be gained from the CNA data in isolation.Supplemental Fig. S4.5 shows the scores and loadings of the first two compo- .6. Discussion In this chapter, we generalized the standard SCA model to explore the depen-dence between coupled binary and quantitative data sets. However, the GSCAmodel with exact low rank constraint overfits the data, as some estimated param-eters tend to diverge to positive infinity or negative infinity. Therefore, concavepenalties are introduced in the low rank approximation framework to achieve lowrank approximation and to mitigate the overfitting issues of the GSCA model.An efficient algorithm framework with analytical form updates for all the param-eters is developed to optimize the GSCA model with any concave penalties. Allconcave penalties used in our experiments have better performance with respectto generalization error and estimated low rank of the constructed GSCA modelcompared to the nuclear norm penalty. Both L q and GDP penalties with propermodel selection can recover the simulated low rank structures almost exactly8 Chapter 4. Fusing binary and quantitative data sets only from indirect binary observation X and noisy quantitative observation X .Furthermore, we have shown that the GSCA model outperforms the iCluster-Plus model with respect to speed and accuracy of the estimation of the modelparameters.The superior performance of the GSCA models with the concave penaltiescompared to the models with an exact low rank constraint or a nuclear normpenalty is related to their different thresholding properties. The exact low rankconstraint thresholds the singular values in a hard manner and, therefore, onlythe largest R singular values are kept. On the other hand, the nuclear normpenalty works in a soft manner, in which all the singular values are shrunk bythe same amount of λ . The thresholding properties of the concave penaltiesdiscussed in this chapter lie in between these two approaches. As Z = AB T and A T A = I R , the scale of the loadings is related to the scale of the singular values of Z . Thus, we can shrink the singular values of Z to control the scale of estimatedloading matrices in an indirect way. The exact low rank constraint kept the R largest singular values but without control of the scale of the estimated singularvalues, leading to overfitting. On the other hand, nuclear norm penalty shrinksall the singular values by the same amount of λ , leading to biased estimation ofthe singular values. A concave penalty, like L q or GDP, achieves a balance inthresholding the singular values. Among the concave penalties we used in theexperiment, the SCAD penalty does not work well in the simulation study. Thereason is that the SCAD penalty does not shrink the large singular values, whichtherefore tend to be overfitted, while the smaller singular values are shrunk toomuch.Compared to the iClusterPlus method, only the option of binary and quanti-tative data sets are included in our GSCA model, and at the moment no sparsitycan be imposed for the integrative analysis of binary and quantitative data sets.However, the GSCA model with GDP penalty is optimized by a more efficient al-gorithm, it is much more robust to the imbalanced nature of the biological binarydata and it provides a much better performance for the simulation experiments inthis chapter. Furthermore, the exploratory analysis of the GDSC coupled CNAand gene expression data sets provided important information on the binary CNAdata that was not obtained by a separate analysis. Acknowledgements Y.S. gratefully acknowledges the financial support from China Scholarship Coun-cil (NO.201504910809). .7. Supplementary information The exact low rank constraint on Z can be expressed as the multiplication oftwo low rank matrices A and B . The optimization problem related to the GSCAmodel with exact low rank constraint can be expressed asmin µ , Z ,σ f ( Θ ) + f ( Θ , σ )subject to Θ = µ T + ZΘ = [ Θ Θ ]rank( Z ) = R The developed algorithm in the paper can be slightly modified to fit thismodel. Similar to the paper, the above optimization problem can majorized tothe following problem. min µ , Z ,σ L || Θ − H k || F + c subject to Θ = µ T + ZH k = Θ k − L ( W (cid:12) ∇ f ( Θ k )) T Z = 0rank( Z ) = R. The analytical solution of µ is also the column mean of H k . After deflatingout the offset term µ , the majorized problem becomes min Z L || Z − JH k || F subjectto rank( Z ) = R and T Z = 0. The global optimal solution is the R truncatedSVD of JH k . Other steps in the algorithm to fit the GSCA model with exact lowrank constraint are exactly the same as in the paper to fit the GSCA model withconcave penalties. Chapter 4. Fusing binary and quantitative data sets binary variables emperical marginal probabilities of CNA data Figure S4.1: Empirical marginal probabilities of the binary CNA data set. binary variables e s t i m a t ed m a r g i na l p r obab ili t y GSCA model with GDP penalty binary variables iClusterPlus model Figure S4.2: Estimated marginal probabilities (the logit transform of the es-timated ˆ µ ) from the GSCA model with GDP penalty (left) and iClusterPlusmodel (right). 500 1000 150020000.850.90.9511.051.11.151.21.25 CV error 500 1000 150020000.20.30.40.50.60.70.8 ^ < 500 1000 1500200001020304050 estimated rank Figure S4.3: Model selection of the GSCA model with GDP penalty on the GDSCdata sets: CV error (left); estimated ˆ σ (center); rank( ˆ Z ) (right). .7. Supplementary information PC1 -2 -1 0 1 2 P C -2.5-2-1.5-1-0.500.511.522.5 score plot BRCALUADSKCM PC1 -1 -0.5 0 0.5 1-1-0.500.51 loading plot Figure S4.4: Score plot (left) and loading plot (right) derived from a PCA modelon the gene expression data X . X are centered and scaled in the same as in theGSCA model. SVD algorithm is used to solve the PCA model. PC1 -6 -4 -2 0 2 4 6 P C -6-4-20246 score plot BRCALUADSKCM PC1 -5 0 5-8-6-4-202468 loading plot PTENERBB2MYC Figure S4.5: Score plot (left) and loading plot (right) are derived from a threecomponents logistic PCA model on the CNA data X .2 Chapter 4. Fusing binary and quantitative data sets PC1 -5 0 5 P C -8-6-4-202468 loading plot amplificationdeletion Figure S4.6: Loading plot derived from the three components logistic PCA modelon the CNA data X . The legend indictates the amplification or deletion of CNAfeature. PC1 -6 -4 -2 0 2 4 6 f r equen cy Figure S4.7: The relationship between PC1 scores and the frequency of aberra-tions of given samples derived from the three components logistic PCA model onthe CNA data X . .7. Supplementary information PC1 -8 -6 -4 -2 0 2 4 6 8 P C -8-6-4-202468 loading plot Chromosome XChromosome 3pChromosome 8p Figure S4.8: Loading plot derived from the three components logistic PCA modelon the CNA data X . The annotation indicates those features are in the samechromosome region. BRCA LUAD SKCM00.10.20.30.40.5 MYC positive frequency BRCA LUAD RAD2100.10.20.30.40.5 ERBB2 positive frequency BRCA LUAD SKCM00.10.20.30.40.5 PTEN positive frequency Figure S4.9: Positive frequencies of MYC, ERBB2 and PTEN features in threedifferent cancer types. hapter 5 Fusing multiple data sets with two types of heterogeneity Multiple sets of measurements on the same objects obtained from different plat-forms may reflect partially complementary information of the studied system.The integrative analysis of such data sets not only provides us with the opportu-nity of a deeper understanding of the studied system, but also introduces somenew statistical challenges. First, the separation of information that is commonacross all or some of the data sets, and the information that is specific to eachdata set is problematic. Furthermore, these data sets are often a mix of quan-titative and qualitative (binary or categorical) data types, while commonly useddata fusion methods require all data sets to be quantitative. These two types ofheterogeneity existed in multiple data sets should be taken into account in thedata fusion. In this chapter, we propose an exponential family simultaneous com-ponent analysis (ESCA) model to tackle the potential mixed data types problemof multiple data sets. In addition, a structured sparse pattern of the loading ma-trix is induced through a nearly unbiased group concave penalty to disentanglethe global, local common and distinct information of the multiple data sets. AMajorization-Minimization based algorithm is derived to fit the proposed model.Analytic solutions are derived for updating all the parameters of the model ineach iteration, and the algorithm will decrease the objective function in each iter-ation monotonically. For model selection, a missing value based cross validationprocedure is implemented. The advantages of the proposed method in compari-son with other approaches are assessed using comprehensive simulations as wellas the analysis of real data from a chronic lymphocytic leukaemia (CLL) study. This chapter is based on Song, Y., Westerhuis, J.A. and Smilde, A.K., 2019. Separatingcommon (global and local) and distinct variation in multiple mixed types data sets. arXivpreprint arXiv:1902.06241 Chapter 5. Fusing multiple data sets with two types of heterogeneity Multiple data sets measured on the same samples are becoming increasingly com-mon in different research areas, from biology, food science to psychology. Onetypical example from biological research is the GDSC1000 study, in which 926 celllines are fully characterized with respect to point mutation, copy number alter-nation (CNA), methylation, gene expression and drug responses [30]. However,these comprehensive measurements from the same cell lines not only provide theopportunity for a deeper understanding of the studied biological system, but alsointroduce statistical challenges.The first challenge is how to separate the information that is common acrossall or some of the data sets, and the information which is specific to each dataset (often called distinct). These different sources of information have to bedisentangled from every data set to have a holistic understanding of the studiedsystem. The second challenge is that measurements from different platformscan be of different data type, such as binary, quantitative or counts. Thesedifferent data types have different mathematical properties, which should be takeninto account in the data analysis. For example, the measurement of a binaryvariable only has two possible exclusive outcomes, often classified as “1” and “0”.Examples of binary data in biology include point mutation, and the binarizedCNA and methylation data sets [30]. Taking binary measurements “1”, “0”as the quantitative values 1, 0, and casting them into the classical data fusionmethods that assume data sets to be quantitative, clearly neglects their binarynature.In this chapter, we focus on the component or latent variable based data fu-sion approaches, although other approaches exist such as undirected graphicalmodel based methods which are able to explore the association between data setsof different data types [77], or between variables of different data types [78, 79].Two commonly used latent variable based data fusion methods are simultane-ous component analysis (SCA) [65] and iCluster [80], which both focus on usinglow dimensional structures to approximate the common variation across all thedata sets. Both of these approaches have already been generalized to qualitativedata sets [67, 64]. In addition, the concept of common and distinct variation indata fusion has been framed in [81, 14], and several methods [82, 83, 84, 85, 14]have been proposed. One typical example is JIVE [82]. The JIVE model directlyspecifies the components for the global common variation (variation across all thedata sets) and the distinct variation (variation specific to each data set) in themodel, and estimates them simultaneously. However, JIVE model is incapable totackle the qualitative data sets and the local common variation (variation acrosssome of the data sets) is ignored. Direct generalization of JIVE to account forthe local common variation is infeasible as with the increase of the number ofdata sets, the possible combinations of local common variation blows up expo-nentially. Other methods [84, 83] encounter similar problems with respect to the .1. Background L q :0 In this section, we will introduce the generalization of the ESCA model. Then wewill show how to use the group concave penalty to induce the structured sparsepattern on the loading matrix of an ESCA model to disentangle the common(global and local) variation and distinct variation of multiple data sets. The quantitative measurements from L different platforms on the same I objectsresult into L quantitative data sets, { X l } Ll =1 , and the l th data set X l ( I × J l ) has J l variables. In the classical SCA model, we decompose the L data sets as X l = µ T l + AB T l + E l , in which ( I × 1) is a column vector with ones; µ l ( J l × 1) is thecolumn offset term; A ( I × R ) is the common score matrix; B l ( J l × R ) and E l ( I × J l )are the loading matrix and residual term respectively for X l and R is the numberof components. In addition, constraints A T A = I and T A = , in which I isthe identity matrix, are imposed to make the model identifiable. The SCA modeltries to discover the common column subspace, which is spanned by the columnsof the score matrix A , in L data sets to represent their common information.The column offset terms { µ l } Ll =1 can be removed by column centering of thecorresponding data sets { X l } Ll =1 . The parameters in the SCA model can beestimated by minimizing the sum of squares (cid:80) Ll w l || X l − µ T l − AB T l || F , in which w l is the relative weight of the l th data set X l .The least squares loss criterion in the classical SCA model is only appropriatefor quantitative data sets. When some or all data sets are of another data type,such as binary data, classical SCA model is not appropriate anymore. Motivatedby the previous research on exponential family PCA model [34], we use the ex-ponential family distribution to account for the different data types of multipledata sets, such as Bernoulli distribution for binary data, Poisson distribution forcount data and Gaussian distribution for quantitative data.Assume x ∈ R follows the exponential dispersion family distribution [13],and θ and α are the natural parameter and the dispersion parameter respec-tively. The probability density or mass function can be specified as p ( x | θ, α ) = .2. P-ESCA model xθ − b ( θ )) /α ] h ( x, α ), in which b ( θ ) is the log-partition function, and h ( x, α )is a function which does not depend on the natural parameter θ . SupplementalTable S5.1 lists the log-partition function b ( θ ) and its first and second orderderivatives b (cid:48) ( θ ), b (cid:48)(cid:48) ( θ ) for Gaussian, Bernoulli and Poisson distributions. Therelationship E( x | θ ) = b (cid:48) ( θ ) always hold in the exponential family distribution.Supplemental Fig. S5.1 visualizes this relationship for the Gaussian, Bernoulliand Poisson distributions. If the l th data set X l is quantitative, according to theprobabilistic interpretation of the PCA model [42], we assume there is a naturalparameter matrix Θ l ( I × J l ) underlying X l , and the low dimensional structure ex-ists in Θ l , X l = Θ l + E l and Θ l = µ T l + A l B T l , and elements in the error term E l follows a normal distribution (cid:15) lij ∼ N(0 , σ ). The conditional mean of the observed X l given the low dimensional structure assumption is E( X l | Θ l ) = b (cid:48) ( Θ ) = Θ l , inwhich b (cid:48) () is the first order derivative of the log-partition function for the Gaus-sian distribution (supplemental Table S5.1). In exponential family PCA, the sameidea has been generalized to other members of exponential family distributions byassuming E( X l | Θ l ) = b (cid:48) ( Θ l ) and Θ l = µ T l + A l B T l , in which the function formof b (cid:48) () depends on the used probability distribution (supplemental Table S5.1).In the exponential family PCA model, the elements in X l are conditionally in-dependent given the low dimensional structure assumption as Θ l = µ T l + A l B T l .Take x lij and θ lij as the ij th element of X l and Θ l respectively. The conditionallog-likelihood of observing X l is log( p ( X l | Θ l , α l )) = (cid:80) Ii (cid:80) J l j log( p ( x lij | θ lij , α l )) = (cid:80) Ii (cid:80) J l j α l ( x lij θ lij − b l ( θ lij )) + c = α l (cid:2) < X l , Θ l > − < T , b l ( Θ l ) > (cid:3) + c , in which <, > indicates the inner product, for matrices, < X l , Θ l > = trace( X T l Θ l ); c , aconstant that does not depend on the unknown parameter Θ l ; b l () and α l are theelement-wise log-partition function and the known dispersion parameter respec-tively for the l th data set X l . In the ESCA model, we assume that the naturalparameter matrices { Θ l } Ll =1 lie in the same column subspace, which is spannedby the common score matrix A . To make the model identifiable, constraints A T A = I and T A = are imposed. The optimization problem associated withthis ESCA model can be expressed as follows,min { µ l } Ll , A , { B l } Ll L (cid:88) l =1 − log( p ( X l | Θ l , α l ))= L (cid:88) l =1 α l (cid:2) < T , b l ( Θ l ) > − < X l , Θ l > (cid:3) + c subject to Θ l = µ T l + AB T l , l = 1 , . . . , L T A = T A = I . (5.1)00 Chapter 5. Fusing multiple data sets with two types of heterogeneity The drawback of the SCA or ESCA models is that only the global commoncomponents, which account for the common variation across all the data sets,is allowed. However, the real situation in multiple data sets integration canbe far more complex as local common variation across some of the data setsand distinct variation in each data set are expected as well. Directly specifyingthe components in the ESCA model for common (global and local) and distinctvariation in the same way as JIVE model [82] is infeasible, as the number ofpossible combinations of local common variation will blow up exponentially withan increasing number of data sets. A promising solution is using structuredsparsity on the loading matrix to disentangle the common (global and local) anddistinct variation indirectly [87, 15]. Structured sparsity of the data set specificloading matrices in component based data fusion methods has been explored by[93, 94]. The idea of using structured sparsity to disentangle the common (globaland local) and distinct variation in multiple quantitative data sets is made explicitin [87, 15]. To illustrate the idea, we use an example with three quantitative datasets. Suppose we construct a SCA model on three column centered quantitativedata sets { X l } l =1 , the common score matrix is A , the corresponding loadingmatrices are { B l } l =1 , and X l = AB T l + E l , in which E l is the residual term for l th data set. If the structured sparsity pattern in { B l } l =1 is expressed as follows, B B B = b , b , b , , , b , , , , , b , , , in which b l,r ∈ R J l indicates the r th column of the l th loading matrix B l , then wehave the following relationships, X = a b T1 , + a b T1 , + a b T1 , + + a b T1 , + + + E X = a b T2 , + a b T2 , + + a b T2 , + + a b T2 , + + E X = a b T3 , + + a b T3 , + a b T3 , + + + a b T3 , + E . Here a r indicates the r th column of the common score matrix A . The first com-ponent represents the global common variation across three data sets; the 2 nd ,3 nd and 4 nd components represent the local common variation across two datasets and the 5 nd , 6 nd and 7 nd components represent the distinct variation specificto each single data set. In this way, the structured sparsity pattern in the load-ing matrices { B l } l =1 can be used to separate the common (global and local) anddistinct variation of multiple quantitative data sets. .2. P-ESCA model In [93, 94, 15], the structured sparsity is induced by a group lasso penalty on thecolumns of { B l } L . The used group lasso penalty is λ (cid:80) l (cid:80) r || b l,r || , in which λ isthe tuning parameter, b l,r indicates the r th column of the l th loading matrix B l ,and || || indicates the L norm of a vector. This group lasso penalty shrinks || b l,r || as a lasso penalty and the elements inside b l,r as a ridge penalty. How-ever, lasso type penalty leads to biased parameter estimation as the same degreeof shrinkage is applied to all the parameters, which will shrink the nonzero pa-rameters too much and makes the prediction or cross validation error based modelselection procedures inconsistent [90, 91]. This leads in general to the selection oftoo complex models. The SLIDE model [15] solves the model selection problem ina two stages manner. First, varying degrees of regularization are imposed to in-duce a series of structured sparse loading patterns. Then these structured sparsepatterns are taken as hard constraints on a new SCA model, in which a Bi-crossvalidation procedure [88] is used for the final selection. This two stages approachis similar to the often used re-estimation trick in lasso regression. However, such atwo-step strategy cannot easily be generalized to the ESCA model. For example,if a binary data set is used and the structured sparse pattern is imposed as a hardconstraint on the loading matrices in a ESCA model, the estimated loadings ofthe binary data set can easily go to infinity [51, 64].The above issue introduced by the biased estimation of lasso type penaltiescan be alleviated by using concave penalties [57, 27], which can achieve sparsesolutions and nearly unbiased parameter estimation simultaneously. Therefore,in this chapter, we applied group concave penalties, generalized double Pareto(GDP) shrinkage [27] and bridge ( L q :0 B1B2B3 Figure 5.1: How the group GDP ( γ = 1) penalty induces structured sparse patternon { B } l =1 . Values inside the plot indicate the L norm of the correspondingloading vector b l,r . Top: loading matrix before thresholding; bottom: loadingmatrix after thresholding.Table 5.1: Three commonly used group penalty functions. Take σ as the L normof a group of elements. q and γ are the tuning parameters. The supergradient isthe counter concept of the subgradient for a concave function. When the concavefunction is differentiable everywhere, the supergradient is the gradient.penalty formula supergradientgroup lasso σ L q :0 Take f l ( Θ l ) = α l [ < W l , b l ( Θ l ) > − < W l (cid:12) X l , Θ l > ] as the loss function forfitting the l th data set X l , and g l ( B l ) = (cid:80) r g ( σ lr ) as the group concave penaltyfor the l th loading matrix B l . We can majorize f l ( Θ l ) and g l ( B l ) respectively asfollows. The majorization of f l ( Θ l )Given ˜ f l ( θ lij ) = b l ( θ lij ) − x lij θ lij , we have f l ( Θ l ) = α l (cid:80) i (cid:80) j w lij ˜ f l ( θ lij ). The firstand second gradients of ˜ f l ( θ lij ) with respect to θ lij are ∇ ˜ f l ( θ lij ) = b (cid:48) l ( θ lij ) − x lij and ∇ ˜ f l ( θ lij ) = b (cid:48)(cid:48) l ( θ lij ). Assume that ∇ ˜ f ( θ lij ) is upper bounded by a constant ρ l ,which will be detailed below. If θ l represents the general representation of θ lij ,then according to the Taylor’s theorem and the assumption that ∇ ˜ f l ( θ l ) ≤ ρ l for all θ l ∈ domain( ˜ f l ), we have the following inequality,˜ f l ( θ l ) = ˜ f l (( θ l ) k )+ < ∇ ˜ f l (( θ l ) k ) , θ l − ( θ l ) k > +12 ( θ l − ( θ l ) k ) T ∇ ˜ f l (cid:2) ( θ l ) k + t ( θ l − ( θ l ) k ) (cid:3) ( θ l − ( θ l ) k ) ≤ ˜ f l (( θ l ) k )+ < ∇ ˜ f l (( θ l ) k ) , θ l − ( θ l ) k > + ρ l θ l − ( θ l ) k ) = ρ l (cid:20) θ l − ( θ l ) k + 1 ρ l ∇ ˜ f l (( θ l ) k ) (cid:21) + c. (5.3)Here ( θ l ) k is an approximation of θ l at the k th iteration and t ∈ [0 , 1] is anunknown constant. Combining the above inequality and the majorization step[60] of transforming a weighted least square problem to a least squares problem, .3. Algorithm f l ( Θ l ) = 1 α l (cid:88) i (cid:88) j w lij ˜ f l ( θ lij ) ≤ ρ l α l || W l (cid:12) ( Θ l − Θ kl + 1 ρ l ( b (cid:48) l ( Θ kl ) − X l )) || F + c ≤ ρ l α l || Θ l − H kl || F + c H kl = W l (cid:12) ( Θ kl − ρ l ( b (cid:48) l ( Θ kl ) − X l )) + ( T − W l ) (cid:12) Θ kl = Θ kl − ρ l W l (cid:12) ( b (cid:48) l ( Θ kl ) − X l )) , (5.4)in which Θ kl is the approximation of Θ l during the k th iteration. For the Bernoullidistribution, an elegant bound b (cid:48)(cid:48) ( θ ) ≤ . 25 can be used [43]; for the Gaussianlikelihood, b (cid:48)(cid:48) ( θ ) = 1; for the Poisson distribution, b (cid:48)(cid:48) ( θ ) is unbounded, however,we can always set ρ l = max( b (cid:48)(cid:48) ( Θ kl )). The majorization of g l ( B l )Assume B kl is the k th approximation of B l , and σ klr = || b kl,r || . According tothe definition of a concave function [61], we always have the inequality g ( σ lr ) ≤ g ( σ klr ) + ω klr ( σ lr − σ klr ) = ω klr σ lr + c , in which ω klr ∈ ∂g ( σ klr ) and ∂g ( σ klr ) is theset of supergradients (the counterpart concept of the subgradient for a concavefunction) of the function g () at σ klr . When the supergradient is unique, then ω klr = ∂g ( σ klr ). Therefore, g l ( B l ) = (cid:80) r g ( σ lr ) can be majorized as follows, g l ( B l ) = (cid:88) r g ( σ lr ) ≤ (cid:88) r ω klr σ lr + cω klr ∈ ∂g ( σ klr ) . (5.5) The majorization of the regularized likelihood criterion Combining the above two majorization steps, we have majorized the originalcomplex problem in the equation 5.2 to a simper problem in each iteration as06 Chapter 5. Fusing multiple data sets with two types of heterogeneity follows, min { µ l } Ll , A , { B l } Ll L (cid:88) l =1 (cid:104) ρ l α l || Θ l − H kl || F + λ l (cid:112) J l (cid:88) r ω klr σ lr (cid:105) subject to Θ l = µ T l + AB T l , l = 1 . . . L T A = T A = I σ lr = || b l,r || , l = 1 . . . L, r = 1 . . . R H kl = Θ kl − ρ l W l (cid:12) ( b (cid:48) l ( Θ kl ) − X l ) , l = 1 . . . Lω klr ∈ ∂g ( σ klr ) , l = 1 . . . L, r = 1 . . . R. (5.6) The majorized optimization problem in equation 5.6 can be solved by the blockcoordinate descent approach, and the analytic solution can be derived for all theparameters. Updating { µ l } L When fixing all other parameters except µ l , the analytic solution of µ l in equation5.6 is simply the column mean of H kl , µ l = I ( H kl ) T . Updating A When fixing all other parameters except A , and deflating the offset term { µ l } L ,the loss function in equation 5.6 becomes (cid:80) Ll =1 ρ l α l || AB T l − JH kl || F + c , in which J = I − I T is the column centering matrix. If we take d l = (cid:112) ρ l /α l , the aboveequation can also be written in this way (cid:80) Ll =1 12 || A d l B T l − d l JH kl || F . To simplifythe equations, we set (cid:101) B l = d l B l and (cid:103) JH kl = d l JH kl . Then, we take (cid:101) B as therow concatenation of (cid:110) (cid:101) B l (cid:111) Ll =1 , (cid:101) B T = [ (cid:101) B T1 . . . (cid:101) B T l . . . (cid:101) B T L ], and take (cid:103) JH k as thecolumn concatenation of (cid:110)(cid:103) JH kl (cid:111) Ll =1 , (cid:103) JH k = [ (cid:103) JH k . . . (cid:103) JH kl . . . (cid:103) JH kL ]. After that,we have (cid:80) Ll =1 ρ l α l || AB T l − JH kl || F = (cid:80) Ll =1 12 || A (cid:101) B T l − (cid:103) JH kl || F = || A (cid:101) B T − (cid:103) JH k || F .Updating A equivalents to minimizing || A (cid:101) B T − (cid:103) JH k || F , s.t. A T A = I . Assumethe SVD decomposition of (cid:103) JH k (cid:101) B is (cid:103) JH k (cid:101) B = UDV T , the analytic solution for A is A = UV T . The derivation of the above solution is shown in the followingparagraph.To simplify the derivation, we take B = (cid:101) B and C = (cid:103) JH k . So the optimizationproblem is min A || AB T − C || F , s.t. A T A = I . This equation can be expanded as .3. Algorithm || AB T − C || F = tr( BA T AB T ) − BA T C ) + tr( C T C ). Since A T A = I , theabove optimization problem equivalents to maximizing a trace function problem,max A tr( BA T C ) , s.t. A T A = I . Assume the SVD decomposition of CB is CB = UDV T , we have tr( BA T C ) = tr( A T CB ) = tr( A T UDV T ) = tr( V T A T UD ).According to the Kristof theorem [41], we have tr( V T A T UD ) ≤ (cid:80) r d rr , in which d rr is the r th diagonal element of D , and this upper-bound can be achieved bysetting A = UV T . Updating { B l } L Because A T A = I , it is easy to prove that || AB T l − JH kl || F = || A T AB T l − A T JH kl || F = || B l − ( JH kl ) T A || F . Also, because of that the least squares problemsare decomposable, we have || B l − ( JH kl ) T A || F = (cid:80) r || b l,r − ( JH kl ) T a r || , in which a r is the r th column of A . In this way, we have the following optimization problem,min B l ρ l α l || AB T l − JH kl || F + λ l (cid:112) J l (cid:88) r ω klr σ lr = ρ l α l || B l − ( JH kl ) T A || F + λ l (cid:112) J l (cid:88) r ω klr σ lr = (cid:88) r (cid:104) ρ l α l ( b l,r − ( JH kl ) T a r ) + λ l (cid:112) J l ω klr σ lr (cid:105) subject to σ lr = || b l,r || , l = 1 . . . L, r = 1 . . . R, (5.7)The above optimization problem is equivalent to finding the proximal operatorof a L (or Euclidean) norm, and the analytic solution exists [25]. Take ˜ λ lr = λ l √ J l ω klr α l /ρ l , the analytical solution of b l,r is b l,r = max(0 , − ˜ λ lr || ( JH kl ) T a r || )( JH kl ) T a r .To update the parameter B l , we can simply apply this proximal operator to allthe columns of B l . Initialization and stopping criteria The initialization of the parameters { µ l } Ll =1 , A , { B l } Ll =1 can be set to the resultsof a classical SCA model on { X l } Ll =1 or to accept user imputed initializations.The relative change of the objective function is used as the stopping criteria.Pseudocode of the algorithm described above is shown in Algorithm 3, in which f k is the value of the objective function in k th iteration, (cid:15) f is the tolerance ofrelative change of the objective function. For the quantitative data set X l , the parameters are µ l , A and B l . The totalvariation explained ratio of the model for X l is defined as varExp l = 1 − || W l (cid:12) ( X l − µ T l − AB T l ) || F / || W l (cid:12) ( X l − µ T l ) || F . And the variation explained ratio08 Chapter 5. Fusing multiple data sets with two types of heterogeneity Algorithm 3 An MM algorithm for fitting the P-ESCA model. Input: { X l } Ll =1 , { α l } Ll =1 , g (), { λ l } Ll =1 , γ ; Output: ˆ µ , ˆ A , ˆ B ; Compute { W l } Ll =1 for missing values in { X l } Ll =1 ; Initialize { µ l } Ll =1 , A , { B l } Ll =1 ; Θ l = ( µ l ) T + A ( B l ) T , l = 1 . . . L ; k = 0; while ( f k − − f k ) /f k − > (cid:15) f do for l = 1 . . . L do Estimate ρ l according to the data type of X l ; H kl = Θ kl − ρ l W l (cid:12) ( b (cid:48) ( Θ kl ) − X l )); µ k +1 l = I ( H kl ) T ; (cid:101) B kl = (cid:113) ρ l α l B kl ; (cid:103) JH kl = (cid:113) ρ l α l JH kl ; end for ( µ k +1 ) T = [( µ k +11 ) T . . . ( µ k +1 l ) T . . . ( µ k +1 L ) T ]; ( (cid:101) B k ) T = [( (cid:101) B k ) T . . . ( (cid:101) B kl ) T . . . ( (cid:101) B kL ) T ]; (cid:103) JH k = [ (cid:103) JH k . . . (cid:103) JH kl . . . (cid:103) JH kL ]; UDV T = (cid:103) JH k (cid:101) B k ; A k +1 = UV T ; for l = 1 . . . L do for r = 1 . . . R do σ klr = || b kl,r || ; ω klr ∈ ∂g ( σ klr ); ˜ λ lr = λ l √ J l ω klr α l /ρ l ; b k +1 l,r = max(0 , − ˜ λ lr || ( JH kl ) T a k +1 r || )( JH kl ) T a k +1 r ; end for B k +1 l = [ b k +1 l, . . . b k +1 l,r . . . b k +1 l,R ]; end for ( B k +1 ) T = [( B k +11 ) T . . . ( B k +1 l ) T . . . ( B k +1 L ) T ] k = k + 1; end while Compute variation explained ratios.for the r th component on X l is defined as varExp lr = 1 − || W l (cid:12) ( X l − µ T l − a r b T l,r ) || F / || W l (cid:12) ( X l − µ T l ) || F . For the binary data set, we use a similar strategyas the MOFA model [89], where the H kl is taken as the pseudo X l during the k th iteration, and H kl rather than X l is used to compute the variation explainedratios. The multiple data sets can also be taken as a single full data set. In that .4. Simulation process { / √ α l } L values are taken as the weights for them, and then we cancompute the variation explained ratios of each component for this full data set.The full single data set (cid:101) X and the weight matrix (cid:102) W are the column concatenationof { (1 / √ α l ) X l } L and { W l } L , in which X l is replaced by H kl if the l th data setis not quantitative. The offset term (cid:101) µ and the loading matrix (cid:101) B are the rowconcatenation of { (1 / √ α l ) µ l } L and { (1 / √ α l ) B l } L and the score matrix (cid:101) A = A . To evaluate the proposed model and the model selection procedure, three datasets of different data types with underlying global, local common and distinctstructures are simulated. The following simulations and experiments focus on thequantitative and binary data types. We will first show the simulation of the struc-ture AB T , in which B is the row concatenation of { B l } l =1 , B T = [ B T1 B T2 B T3 ].The structure AB T can be expressed in the SVD type as AB T = UDV T ( A = U , B = VD ), in which U T U = I , D is a diagonal matrix, and the structured sparsepattern exists in the matrix V . First, all the elements in U and V are simulatedfrom the standard normal distribution. To make sure that T U = , simulated U is first column centered, and then it is orthogonalized by the SVD algorithmto have U T U = I . Also, V is orthogonalized by the QR algorithm to obtain V T V = I . In this example 21 components are predefined, 7 groups of global,local common and distinctive nature, 3 components each. The structure of thesecomponents are set in V as indicated below, V = V V V = V , V , V , , , V , , , , , V , , , in which V , indicates the loadings for the first three components for dataset 1, etc. After that, 21 values are sampled from N(1 , . D . Furthermore, an extra diagonalmatrix C , which has the same size as matrix D , is used to adjust the signalto noise ratios (SNRs) in simulating different global, local common and distinctstructures. Then we have AB T = U ( C (cid:12) D ) V T . In order to define the SNR, wehave to specify the noise term E l for the l th data set X l . If X l is quantitative,all the elements in E l can be sampled from N (0 , α l ). If X l is binary, accordingto the latent variable interpretation of logistic PCA [95], we assume there isa continuous latent matrix X ∗ l underlying the binary observation X l , and theelements of the noise term E l follow the standard logistic distribution. After thespecification of the noise terms, we can adjust the diagonal elements in C tosatisfy the predefined SNRs in simulating the global, local common and distinctstructures. We restrict the diagonal elements of C for the same structure to sharea single value to have a unique solution. For example, for the global structure10 Chapter 5. Fusing multiple data sets with two types of heterogeneity C123 = U : , ( C , (cid:12) D , ) V T: , , the corresponding noise term is E =[ E E E ], and the SNR of the global structure as defined as SNR = || C123 || F || E || F .The SNRs for the simulation of the local common (C12, C13, C23) and distinct(D1, D2, D3) structures are defined in the same way.If X l is quantitative, we simply sample all the elements in µ l from the stan-dard normal distribution. If X l is binary, the column offset µ l represents thelogit transformation of the marginal probabilities of binary variables. In our sim-ulation, we will first sample J l marginal probabilities from a Beta distribution.The Beta distribution can be specified in the following way. For example, if wehave 100 samples of a binary variable and we assume the marginal probabilityto be 0.1, this means we only observe 100 × . π , and use a uniform prior distribution for π , then the posterior distribution of π is π ∼ Beta(11 , 91) [96]. After generating J l marginal probabilities from this Beta distribution, the logit transformation ofthis vector of probabilities are set as µ l . If X l is quantitative, X l is simulated as X l = µ T l + AB T l + E l , and all the elements of E l are sampled from N (0 , α l ).If X l is binary, we have Θ l = µ T l + AB T l , and all the elements of X l are sam-pled from the Bernoulli distributions, whose probabilities are the correspondingelements in the inverse logit transformation of Θ l . An equivalent way to simu-late the binary X l is to first generate X ∗ l = µ T l + AB T l + E l , in which all theelements in E l are sampled from the standard logistic distribution. Then, all theelements in X l are the binary observations of the corresponding elements of X ∗ l , x lij = 1 if ( x ∗ ij ) l > 0, and x lij = 0 vise versa . In the following sections, we willuse Gaussian-Gaussian-Gaussian (G-G-G) to represent the simulation of threequantitative data sets; Bernoulli-Bernoulli-Bernoulli (B-B-B) for the simulationof three binary data sets; G-B-B for a quantitative data set and two binary datasets and G-G-B for two quantitative data sets and a binary data set. To evaluate the accuracy of the model in estimating the simulated parameters,such as Θ l and µ l , the relative mean squared error (RMSE) is used. If, forexample, the simulated parameter is Θ , Θ = [ Θ Θ Θ ], and its estimationis ˆ Θ , the RMSE is defined as RMSE( Θ ) = || Θ − ˆ Θ || F || Θ || F . All of the following evaluationmatrices RMSE( Θ l ), RMSE( Θ ) and RMSE( µ ) will be used in the experimentalsection. To evaluate the recovered subspaces with respect to the simulated globalcommon, local common and distinct structures, the modified RV coefficient [97]is used. If the simulated global structure is C123, and its estimation is (cid:91) C123, thesimilarity between the subspaces of C123 and (cid:91) C123 is calculated by the modifiedRV coefficient.For the real data sets, we can use the cross validation (CV) error as the proxy .5. Evaluation matrices and model selection X l , we will randomly select 10% non-missing elements as X test l , and theseselected elements in X l are set to missing values. The remaining elements formthe training set X train l . For binary data, the selection of the test set samples isperformed in a stratified manner to tackle the situation of unbalanced binary data.Here the test set consist of 10% “1”s and “0”s which are randomly selected from X l as X test l . A P-ESCA model is constructed on the training sets (cid:8) X train l (cid:9) Ll =1 , toobtain an estimation of { ˆ Θ l } L , in which ˆ Θ l = ˆ µ l T + ˆ A ˆ B T l . Then the parameters { ˆ Θ test l } L corresponding to { ˆ X test l } L are indexed. The CV error for X l is obtainedas the negative log likelihood of using ˆ Θ test l to predict X test l .If the data sets { X l } Ll =1 are of the same data type, a single tuning parameter λ is used to replace the { λ l } Ll =1 during the model selection. First, { X l } Ll =1 aresplit into (cid:8) X train l (cid:9) Ll =1 and { X test l } Ll =1 in the same way as described above. Then N λ values are selected (with equal distance in log-space) and for each λ valuea P-ESCA model is constructed on the training sets (cid:8) X train l (cid:9) Ll =1 . A warm startstrategy is used, in which the outputs of a previous model are used to initializethe next model with a slightly higher regularization strength. The warm startstrategy has a special meaning in the current context. If some component loadingsare shrunk to 0 in the previous model, they will also be 0 in the next models withhigher λ values. Thus, the search space of the next model will be constrainedbased on the learned structured sparse pattern in the previous model. In thisway, with increasing λ , components are removed adaptively. We prefer to selectthe model with the minimum CV error on { X test l } Ll =1 and the corresponding valueof λ is λ opt . After that we re-fit a P-ESCA model with λ opt on the full datasets { X l } Ll =1 and the outputs derived from the selected model with minimum CVerror are used for initialization in order to preserve the learned structured sparsepattern.If the data sets are of mixed data types, we prefer to use distinct tuningparameters for each data type. Suppose we have three data sets { X l } l =1 , of which X is quantitative and { X l } l =2 are binary. We specify two tuning parameters λ g and λ b for the loading matrices corresponding to the quantitative and binary datasets. A heuristic model selection approach, which has the same computationalcomplexity as tuning a single parameter, can be used for the model selection.The splitting of { X l } Ll =1 into the training and test sets is the same as discussedabove. Then again, N values of λ g and λ b are selected with equal distance inlog-space. For the first model, we fix λ g to be 0 or a very small value, and tune λ b in the same way as above. The model with the minimum CV error on thebinary test sets { X test l } l =2 is selected, and the corresponding value of λ b is λ b opt .After that, λ b is fixed to λ b opt , and the outputs of the above selected model areset as the initialization for the models in the model selection of λ g , which is done12 Chapter 5. Fusing multiple data sets with two types of heterogeneity in the same way as described above. The model with the minimum CV error onthe quantitative test set X test1 is selected, and the corresponding value of λ g is λ g opt . After the model selection, we re-fit the P-ESCA model on the full data sets { X l } l =1 with the λ g opt and λ b opt and again the outputs of the selected model in themodel selection process are used for initialization. The dispersion parameters of the Bernoulli and Poisson distributions can alwaysset to 1, while for the Binomial distribution with n experiments, it can alwaysbe set to n . However, for a Gaussian distribution, the dispersion parameter α represents the variance of the noise term, and is assumed to be known. Supposewe have a data set X l , we prefer to use a PCA model to estimate the α l beforeconstructing a P-ESCA model. The rank of the PCA model is selected by amissing value based cross validation procedure similar as described above. Detailsof the α estimation procedure are shown in the supplementary material. Afterobtaining an estimation of ˆ α l , it can be casted into the model or the data setcan be scaled by √ ˆ α l , which is the estimated standard deviation. We simulatedG-G-G, G-G-B and G-B-B data sets to test the α estimation procedure. Theparameters in the simulation are set as I = 100, J = 5000, J = 500, J = 50;the SNRs of the global, local common and distinct structures are all set to 1;the marginal probability is set to 0 . α estimation procedure was repeated 3 times and the average is taken as theestimation. As shown in supplemental Table S5.2, the mean estimated dispersionparameters in different situations are quite accurate, and the estimations derivedfrom the 3 times repetitions are very stable. We use the simulated G-G-G data sets as an example to show how the modelselection is performed when multiple data sets are of the same data type. Thefollowing parameters are used in the simulation, I = 100, J = 1000, J = 500, J = 100; the SNRs of global, local common and distinct structures are all set to1; all the dispersion parameters { α l } are set to be 1. The signals, which are takenas the singular values of the simulated structures, and the noise terms, which aretaken as the singular values of the corresponding residual terms, are characterizedin supplemental Fig. S5.3. The true variation explained ratios of each componentin every data set is computed using the simulated parameters, and is visualized insupplemental Fig. S5.4. For the model selection procedure, the maximum number .6. Experiments (cid:15) f = 10 − ; 30 λ valuesare selected from the interval [1 , { α l } L in the P-ESCA model are set tothe estimated values from the above α estimation procedure.Fig. 5.2 shows how the CV errors, RMSEs and the RV coefficients changewith respect to λ when a P-ESCA model with a group GDP ( γ = 1) penalty isused. The top figures in Fig. 5.2 show that the CV errors change in a similarway as the RMSEs. The model with minimum CV error has low RMSEs in esti-mating the simulated parameters (Fig. 5.2 top right) and correctly identifies thedimensions of the subspaces for the global, local common and distinct structures(Fig. 5.2 bottom). However, when the group lasso penalty is used this was notthe case. Supplemental Fig. S5.5 shows that when a group lasso penalty is used,the models with minimal CV error do not coincide with the correct dimensionsof the subspaces. In the model with minimum CV error, almost all the compo-nents are assigned to the global structure. This result relates to the fact thatthe lasso type penalty over-shrinks the non-zero parameters, and then the CVerror based model selection procedure tends to select a too complex model tocompensate to the biased parameter estimation. On the other hand, as the GDPpenalty achieves nearly unbiased parameter estimation, the CV error based modelselection procedure correctly identifies the correct model.After the model selection, a high precision P-ESCA model ( (cid:15) f = 10 − ) witha group GDP penalty is re-fitted on the full data sets with the value of λ corre-sponding to the minimum CV error and the selected structured sparse pattern.For this selected model, the RMSEs in estimating Θ , Θ , Θ , Θ and µ are0.0259, 0.0239, 0.0285, 0.0335 and 0.0096 respectively. The RV coefficients inestimating the global common structure C123 is 0.9985; local common structuresC12, C13 and C23, 0.9977, 0.9969, 0.9953; the distinct structures D1, D2 and D3,0.9961, 0.9937, 0.9779. The variation explained ratios of each component on thethree data sets computed using the estimated parameters, visualized in Fig. 5.3,are very similar to the true ones in supplemental Fig. S5.4. These values can bevery useful in exploring the constructed model. When applied to multiple quantitative data sets, our model is similar as theSLIDE model, except that we use different penalties and a different model se-lection procedure. The details of the differences between the two approaches aresummarized in the supplementary material. Since the concave GDP penalty iscapable to achieve a nearly unbiased estimation of the parameters, the P-ESCAmodel with a group GDP penalty is expected to achieve similar performance tothe two stages procedure used in the SLIDE model. Therefore, we simulated seven14 Chapter 5. Fusing multiple data sets with two types of heterogeneity C V e rr o r CV error XX X X R M SE RMSE log ( ) R V c oe ff i c i en t -0.500.51 common structures C123C12C13C23 log ( ) R V c oe ff i c i en t -0.500.51 distinct structures D1D2D3 Figure 5.2: The CV errors (top left), RMSEs (top right), RV coefficients of thecommon structures (bottom left), and of distinct structures (bottom right) forvarying λ values for the P-ESCA model with a group GDP ( γ = 1) penalty. Thered cross marker indicates the model with minimum CV error.realistic cases by adjusting the SNRs of the simulated structures to compare theperformance of these two models and their model selection procedures. The SNRsof the simulated structures corresponding to these seven cases are listed in sup-plemental Table S5.3. Case 1: only the local common structures exist and theyhave unequal SNRs; case 2: the JIVE case, only the global common and distinctstructures exist, and they are all of low SNRs; case3: all the simulated structuresare of low SNRs; case 4: global common structure dominate the simulation; case5: local common structures dominate the simulation; case 6: distinct structuresdominate the simulation; case 7: none of the global, local common and distinctstructures exist.The following parameters are used in the G-G-G data simulations, I = 100, J = 1000, J = 500, J = 100, all of the { α l } are set to 1. In order to haveexactly 3 components for all the simulated structures, we reject the simulationsof which the singular values of the three components of any specific structure arenot 2 times larger than the singular value of the corresponding residual term. The .6. Experiments estimated variation explained components Figure 5.3: Variation explained ratios computed using the estimated parametersfrom the selected P-ESCA model with a group GDP penalty. From the top to thebottom, we have data sets X , X and X ; from the left to the right, we have 20components corresponding to the global, local common and distinct structures.The total variation explained ratios for each data set are shown on the left sideof the plot, while he variation explained ratio for each component is shown insidethe plot.P-ESCA model with a group GDP ( γ = 1) penalty is selected and re-fitted onthe full data sets in the same way as above. For the SLIDE model, the simulateddata sets { X l } are column centered and block-scaled by the Frobenius norm ofeach data set. Then the SLIDE model is selected and fitted using the defaultparameters. The deflated column offset term is taken as the estimated ˆ µ . Thederived loading matrices { B l } are re-scaled by the corresponding Frobenius normof each data set. G-G-G data sets are simulated for all the 7 cases, and for eachcase, the simulation experiment (data simulation, model selection, fitting the finalmodel) is repeated 10 times for both the P-ESCA model and the SLIDE model.The mean RV coefficients in evaluating the estimated global, local common anddistinct structures and the corresponding mean estimated ranks are shown inTable 5.2, and the mean RMSEs in estimating the simulated parameters areshown in supplemental Table S5.4. In all 7 cases, these two methods have veryaccurate estimation of the subspaces corresponding to the global, local commonand distinct structures, and of the simulated parameters Θ , which is the columnconcatenation of { Θ l } , { Θ l } , and µ , which is row concatenation of { µ l } L . Forsome of the cases there is a slight advantage for the P-ESCA model. The performance of the proposed P-ESCA model is fully characterized with re-spect to multiple binary data sets. Here we make a comparison to the MOFAmodel, which is the Bayesian counterpart of P-ESCA. In the P-ESCA model, the16 Chapter 5. Fusing multiple data sets with two types of heterogeneity Table 5.2: Mean RV coefficients and the mean rank estimates in evaluating therecovered subspaces derived from 10 experiments using the P-ESCA model andthe SLIDE model for seven G-G-G simulated cases. The results are shown asin mean RV coefficient(mean rank estimation) form. The row names 1 p and 1 s indicate P-ESCA and SLIDE models applied to simulation case 1. Same ruleapplies to other row names. C123 C12 C13 C23 D1 D2 D31 p s p s p s p s p s p s p s structured sparse pattern is induced through a group concave penalty, and themodel selection is done through missing value based cross validation, while in theMOFA model, the structured sparse pattern is induced through the automaticrelevance determination approach and the model is selected through maximizingthe marginal likelihood. In addition, MOFA model also shrinks a component tobe 0 when its variation explained ratios for all the data sets are less than a thresh-old, the default value of which is 0. The details of the differences are summarizedin the supplementary material. For the model selection of the P-ESCA model,the range of λ values is [1 , λ in theP-ESCA model with a group GDP penalty on the simulated B-B-B data sets insupplemental Fig. S5.6. For the MOFA model, the default parameters are used,but as exact sparsity cannot be achieved by the automatic relevance determina-tion procedure used in the MOFA model, we take a component for a single dataset to be 0 when the variation explained ratio of this component on this data setis less than 0 . I = 200, and the marginal prob-ability to be 0 . .6. Experiments { Θ l } , µ (supplemental Table S5.5) can be achievedsolely from a model on multiple binary data sets. Although these results are alittle bit counter intuitive, it is coordinate with the previous research [52, 64].According to our previous research [64], this result mainly relates to the factthat the GDP penalty can achieve nearly unbiased parameter estimation. On theother hand, the RMSEs in estimating the simulated parameters from the MOFAmodel (supplemental Table S5.5) are much larger. Especially for the estimationof the simulated column offset term, all the elements in the estimated ˆ µ fromthe MOFA model are very close to 0, and are far away from the simulated µ .However, the recovered subspaces from the MOFA model are comparable to theresults derived from the P-ESCA model (Table 5.3). The proposed P-ESCA model is also fully characterized on the simulated multipledata sets of mixed quantitative and binary data types. Both G-B-B and G-G-Bdata sets are simulated for all the seven simulation cases. We set I = 200, all of { α l } to be 1, the marginal probability in simulating unbalanced data sets to be0 . 1. Other parameters are the same as above. The range of λ values for loadingsrelated to the quantitative data sets is [1 , , Chapter 5. Fusing multiple data sets with two types of heterogeneity Table 5.3: Mean RV coefficients and mean rank estimations of recovered subspacesderived from 10 repeated simulation experiments using the P-ESCA model andthe MOFA model for seven B-B-B cases. For case 7, a one component MOFAmodel is selected, however, results cannot be extracted when the offset term isincluded. The row names 1 p and 1 m indicate P-ESCA and MOFA models appliedto simulation case 1. Same rule applies to other row names. C123 C12 C13 C23 D1 D2 D31 p m p m p m p m p m p m p m NA NA NA NA NA NA NA .7. Real data analysis p and 1 m indicate P-ESCAand MOFA models applied to simulation case 1. Same rule applies to other rownames. C123 C12 C13 C23 D1 D2 D31 p m p m p m p m p m p m p m NA NA NA NA NA NA NA We applied the P-ESCA model on the chronic lymphocytic leukaemia (CLL)data set [92, 89], which was used in the chapter of the MOFA model, to give anexample of the real data analysis. For the 200 samples in the CLL data set, notall of them are fully characterized for all the measurements. Drug response datahas 184 samples and 310 variables; DNA methylation data, 196 samples and 4248variables; transcriptome data, 136 samples and 5000 variables; mutation data,200 samples and 69 binary variables. The missing pattern of the CLL data setsis visualized in supplemental Fig. S5.7. Except for the missing values related tothe samples that were not measured by a specific platform, there are also someselected variables missing in the mutation data (supplemental Fig. S5.7). Allthe quantitative data sets are first column centered and scaled by the samplestandard deviation of each variable. After that, the dispersion parameters of20 Chapter 5. Fusing multiple data sets with two types of heterogeneity Table 5.5: Mean RV coefficients and mean rank estimations of the recoveredsubspaces derived from 10 simulation experiments using the P-ESCA model andthe MOFA model for seven G-G-B cases. The row names 1 p and 1 m indicateP-ESCA and MOFA models applied to simulation case 1. Same rule applies toother row names. C123 C12 C13 C23 D1 D2 D31 p m p m p m p m p m p m p m NA NA NA NA NA NA NA the quantitative data sets are estimated by the α estimation procedure. Rankestimation of each single data set was performed three times and results are shownin supplemental Table S5.8. The P-ESCA model with a GDP ( γ = 1) is selectedand re-fitted on the CLL data sets in the same way as described above. The initialnumber of components is set to 50. The selected model has 41 components, and ifwe take each loading vector related to a single data set in a component as a group,there are 51 non-zero loading groups. The model selection results are shown insupplemental Fig. S5.8. Since the variation explained ratios of 41 componentsare difficult to visualize, we only show the components (Fig. 5.4), whose variationexplained ratio are larger than 2% for at least one data set. The above procedure(processing, model selection, fitting the final model) is repeated 5 times to testits stability. The Pearson coefficient matrix for the 5 estimations of the ˆ µ andthe RV coefficient matrices for the 5 estimations of the ˆ A , ˆ B and ˆ Θ are shownin supplemental Fig. S5.9.In [89], a 10 components MOFA model is selected on the CLL data sets. Thevariation explained plots of the 10 components MOFA model, reproduced from[89], is shown in supplemental Fig. S5.10. There is some overlap between the twomodels (Fig. 5.4, supplemental Fig. S5.10). Both models have one strong commoncomponent in which all data sets participate, and a common component in whichtwo (P-ESCA) or three (MOFA) data sets participate. Furthermore the drug .8. Discussion µ is infeasible because the column offset term is not included inthis 10 components MOFA model. In general the P-ESCA result is more complexthan the results in [89] in terms of number of selected components and variationexplained. However, this is mainly because, during the model selection of [89],the minimum variation explained threshold is set to 2%. If we set the thresholdto the default value 0%, and set the initial number of components to be 50, andother parameters are kept the same, a 50 components MOFA model is selected. estimated variation explained ratios on CLL data sets components Figure 5.4: Variation explained ratios computed using the estimated parametersfrom the selected P-ESCA model on CLL data sets. From the top to the bottom,the data sets are drug response, methylation, transcriptome and mutation data. In this chapter, we generalized an exponential family SCA (ESCA) model for thedata integration of multiple data sets of mixed data types. Then, we introducedthe nearly unbiased group concave penalty to induce structured sparsity patternon the loading matrices of the ESCA model to separate the global, local commonand distinct variation. An efficient MM algorithm with analytical form updatesfor all the parameters was derived to fit the proposed group concave penaltypenalized ESCA (P-ESCA) model. In addition, a missing value based cross vali-dation procedure is developed for the model selection. In many different realisticsimulations (different SNR levels, and combinations of quantitative and or binarydata sets of different), the P-ESCA model and the model selection procedure work22 Chapter 5. Fusing multiple data sets with two types of heterogeneity well with respect to recovering the subspaces related to the global, local commonand distinct structures, and the estimation of the simulated parameters.The performance of the P-ESCA model and the cross validation based modelselection procedure relate to the fact that the used group concave penalty canachieve nearly unbiased estimation of the parameters while generating sparsesolutions. The nearly unbiased parameter estimation makes the P-ESCA modelhave high accuracy in the estimation of the simulated parameters, and the crossvalidation error based model selection procedure is consistent. Another key pointof the model selection procedure is that the randomly sampled 10% non-missingelements are usually a typical set of elements from the population. This makes theCV error a good proxy of the prediction error of the model. The rank estimationin different repetitions of the model selection procedure is robust and only differslightly with respect to the very weak components.When applied to multiple quantitative data sets, the proposed P-ESCA modelcan achieve slightly better performance than the SLIDE model in recovering thesubspaces of the simulated structures and in estimating the simulated parameters.Also, since missing value problems (missing values in a single data set, or missingcomplete samples in one or some of the data sets) are very common in practice,the option of tackling missing values is a big advantage. In the P-ESCA modeland its model selection procedure, the effect of missing values is masked by usingthe weight matrices, making full use of the available data sets. When appliedto the multiple binary data sets or the mixed quantitative and binary data sets,the proposed P-ESCA model has better performance than the MOFA modelin recovering the subspaces of the simulated structures and in estimating thesimulated parameters. Furthermore, the exact orthogonality constraint can beachieved in the P-ESCA model, which is crucial for the uniqueness of the recoveredsubspaces related to the global, local common and distinct variation. Acknowledgements Y.S. gratefully acknowledges the financial support from China Scholarship Coun-cil (NO.201504910809). .9. Supplementary information The notations of this subsection is the same as the Chapter 5. Before constructingan ESCA or P-ESCA model, the dispersion parameter α of a quantitative data set X , which is the variance of the residual term, is assumed to be known. Assume thecolumn centered quantitative data set is X ( I × J ), and the PCA model of X canbe expressed as X = AB T + E . A ( I × R ) and B ( J × R ) are the score and loadingmatrix respectively; E ( I × J ) is the residual term and elements in E , (cid:15) ij ∼ N (0 , α ); R is the true low rank of X . In order to tackle the potential missing value problem,we also introduce the weight matrix W in the same way as above. The maximumlikelihood estimation of α can be expressed as ˆ α mle = || W || || W (cid:12) ( X − AB T ) || F ,in which || W || is the number of non-missing elements in W . Since this is abiased estimation of α , we can adjust the estimation according to the degree offreedom as ˆ α = || W || − ( I + J ) R || W (cid:12) ( X − AB T ) || F . The parameters R , A and B are estimated as follows.We select the rank R using a similar model selection strategy as in the maintext. We first split X into X test and X train in the same way as in the maintext. Then, a series of PCA models with different number of components areconstructed on X train , and the CV error is defined as the least square error infitting X test . After that ˆ R is set to the number of components of the modelwith the minimum CV error. Then a rank ˆ R PCA model is constructed onthe full data X , and we get an estimate of ˆ A and ˆ B . Then ˆ α is set to ˆ α = || W || − ( I + J ) ˆ R || W (cid:12) ( X − ˆ A ˆ B T ) || F . The EM type algorithm used to fit the PCAmodel with the option of missing values is implemented in Matlab in the sameway as in [60]. The notations of this subsection is the same as the Chapter 5. • Different processing steps. The SLIDE model does column centering andblock scaling using the Frobenius norm of the corresponding data set topreprocess the data. Then the relative weights of the data sets in the SCAmodel are set to 1. On the other hand, we estimate the dispersion parameter(variation of the noise term) of each data set and the inverse of the estimateddispersion parameter is equivalent to the relative weight of the data sets inthe SCA model. • Different penalty terms. The SLIDE model uses the group lasso penalty to24 Chapter 5. Fusing multiple data sets with two types of heterogeneity induce the structured sparsity. Because of the block scaling processing step,there is no weight (cid:8) √ J l (cid:9) Ll =1 on the group lasso penalty to accommodate forthe potential unequal number of variables in different data sets. On theother hand, the weighted group concave penalty is used in the P-ESCAmodel. • Option for missing values. The option of tackling the missing value problemis not included in the SLIDE model. • Different model selection procedures. The SLIDE model uses a two stagesapproach to do model selection, while our model selection approach is asdescribed as in the main text. The notations of this subsection is the same as the Chapter 5. • Different origins. Although these two methods are similar with respect towhat they can do, they have different origins. The MOFA model is devel-oped in the Bayesian probabilistic matrix factorization framework in thesame line as the group factor analysis model and the factor analysis model,while the P-ESCA model is derived in the deterministic matrix factoriza-tion framework in the same line as the SLIDE model, the SCA model andthe PCA model. • Different ways in inducing structured sparsity. In the P-ESCA model, thestructured sparse pattern is induced through a group concave penalty, whilein the MOFA model, it is induced through the automatic relevance determi-nation approach. The group concave penalty can shrink a group of elementsto be exactly 0, while the automatic relevance determination cannot achieveexact sparsity. In addition, MOFA model also shrinks a component to be0 when its variation explained ratios for all the data sets are less than athreshold, whose default value is 0. • Different model selection procedures. The P-ESCA model is selected by amissing value based CV approach; while the selection of a MOFA modelrelies on maximizing the marginal likelihood. In theory, maximizing themarginal likelihood has no difficulty in tuning multiple parameters, whilethe CV based model selection procedure is infeasible for such task. • Orthogonality constraint. The orthogonality constraint A T A = I can onlybe achieved in the P-ESCA model. Whether this property is meaningfulor not depends on the specific research question. However, the constraint .9. Supplementary information Table S5.1: A list of log-partition functions and their first and second orderderivatives for the Gaussian, Bernoulli and Poisson distributions. θ indicates thenatural parameter.Distribution b ( θ ) b (cid:48) ( θ ) b (cid:48)(cid:48) ( θ )Gaussian θ θ θ )) exp( θ )1+exp( θ ) exp( θ )(1+exp( θ )) Poisson exp( θ ) exp( θ ) exp( θ )Table S5.2: Results of the α estimation procedure. 1 g indicates that a Gaussiandistribution is used and α l = 1; b indicates the Bernoulli distribution. Theestimated dispersion parameter ˆ α l and the corresponding times are shown asmean ± std(seconds). When the estimated ranks are the same in each of the threetimes CV procedure is repeated, the corresponding standard deviation is 0. α α α ˆ α (time) ˆ α (time) ˆ α (time))1 g g g ± ± ± g g g ± ± ± g g b ± ± g g b ± ± g b b ± g b b ± Chapter 5. Fusing multiple data sets with two types of heterogeneity Table S5.3: Seven simulation cases used to evaluate the proposed P-ESCA model.For each simulation case, the corresponding SNRs in simulating the global struc-ture C123, local common structures, C12, C13, C23, and distinct structures D1,D2, D3, are give. If the SNR of a specific structure is 0, it means this structuredoes not exist in the simulation.case C123 C12 C13 C23 D1 D2 D31 0 1 2 3 0 0 02 1 0 0 0 1 1 13 1 1 1 1 1 1 14 10 5 5 5 1 1 15 5 10 10 10 1 1 16 1 5 5 5 10 10 107 0 0 0 0 0 0 0Table S5.4: Mean RMSEs in estimating the simulated parameters Θ , { Θ } l =1 and µ , derived from repeating the experiments 10 time using the P-ESCA model andthe SLIDE model for seven G-G-G simulation cases. The row names 1 p and 1 s indicate P-ESCA and SLIDE models applied to simulation case 1. Same ruleapplies to other row names.RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( µ )1 p s p s p s p s p s p s p s .9. Supplementary information Θ , { Θ } l =1 and µ derived from repeating the experiments 10 times using the P-ESCA model andthe MOFA model for seven B-B-B simulation cases. The row names 1 p and 1 m indicate P-ESCA and MOFA models applied to simulation case 1. Same ruleapplies to other row names.RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( µ )1 p m p m p m p m p m p m p m NA NA NA NA NA -10 0 10 E ( x | ) -10-50510 Gaussian -10 0 10 E ( x | ) Bernoulli E ( x | ) Poisson Figure S5.1: The conditional mean of x , E( x | θ ), for varying θ values for Gaussian,Bernoulli, Poisson distributions28 Chapter 5. Fusing multiple data sets with two types of heterogeneity Table S5.6: Mean RMSEs in estimating the simulated parameters Θ , { Θ } l =1 and µ derived from repeating the experiments 10 times using the P-ESCA model andthe MOFA model for seven G-B-B simulation cases. The row names 1 p and 1 m indicate P-ESCA and MOFA models applied to simulation case 1. Same ruleapplies to other row names.RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( µ )1 p m p m p m p m p m p m p m NA NA NA NA NA .9. Supplementary information Θ , { Θ } l =1 and µ derived from repeating the experiments 10 times using the P-ESCA model andthe MOFA model for seven G-G-B simulation cases. The row names 1 p and 1 m indicate P-ESCA and MOFA models applied to simulation case 1. Same ruleapplies to other row names.RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( Θ ) RMSE( µ )1 p m p m p m p m p m p m p m NA NA NA NA NATable S5.8: Rank estimations of the CLL data sets. Drug: drug response data;meth: DNA methylation data; mRNA: transcriptome data; mut: mutation data.data set data type size k = 1 k = 2 k = 3drug quantitative 184 × 310 17 17 18meth quantitative 196 × × × 69 1 1 030 Chapter 5. Fusing multiple data sets with two types of heterogeneity < L L q:q=0.1 L q:q=0.5 < L GDP: . =1GDP: . =50 Figure S5.2: The thresholding properties of the group lasso (L ), the group L q and the group GDP penalty. σ is taken as the L norm of a group of elements. q and γ are the hyper-parameters of the corresponding penalties. x axis indicatesthe value of σ before thresholding; y axis indicates the value after threhsolding.L q :0 C12 C13 C23 D1 D2 components D3 Figure S5.3: The singular values of the simulated common, local common anddistinct structures and of the corresponding residual terms. Blue dots: singularvalues of the simulated structure; red dots: singular values of the residual term;yellow dots: 2 times of the singular values of the residual term. .9. Supplementary information true variation explained components Figure S5.4: The variation explained ratios computed using the simulated pa-rameters. From the top to the bottom, we have data sets X , X and X ; fromthe left to the right, we have 21 components corresponding to the global, localcommon and distinct structures. The total variation explained ratios for eachdata set are shown in the left of the plot, while the variation explained ratio ofeach component for each data set is shown inside the plot.32 Chapter 5. Fusing multiple data sets with two types of heterogeneity -2 -1 0 1 C V e rr o r CV error XX X X -2 -1 0 1 R M SE RMSE log ( ) -2 -1 0 1 R V c oe ff i c i en t -0.500.51 common structures C123C12C13C23 log ( ) -2 -1 0 1 R V c oe ff i c i en t distinct structures D1D2D3 Figure S5.5: CV errors (top left), RMSEs (top right) and the RV coefficients inestimating the common structures (bottom left), and distinct structures (bottomright) as a function of the regularization strength λ when the P-ESCA modelwith a group lasso penalty is used. The red cross marker indicates the pointcorresponding to the minimum CV error. .9. Supplementary information C V e rr o r CV error XX X X R M SE RMSE log ( ) R V c oe ff i c i en t common structures C123C12C13C23 log ( ) R V c oe ff i c i en t distinct structures D1D2D3 Figure S5.6: CV errors (top left), RMSEs (top right) and the RV coefficients inestimating the common structures (bottom left), and distinct structures (bottomright) as a function of the regularization strength λ for the P-ESCA model witha group GDP penalty on the simulated B-B-B data sets. The red cross markerindicates the point corresponding to the minimum CV error. The SNRs of global,local common and distinct structures in the B-B-B simulation are set to be 1.The reason for the increased CV errors at the early stage (top left) is that thesemodels have not convergenced in 500 iterations.34 Chapter 5. Fusing multiple data sets with two types of heterogeneity samplesdrugmethmRNAmut Figure S5.7: Missing pattern of the CLL data sets. Black color indicates the datais missing, while gray color, the data is present. Drug: drug response data; meth:DNA methylation data; mRNA: transcriptome data; mut: mutation data. .9. Supplementary information log10( ) C V e rr o r selection of b mut log10( ) selection of g drugmethmRNA Figure S5.8: Model selection of the P-ESCA model with a group GDP penaltyon the CLL data sets. Drug: drug response data; meth: DNA methylation data;mRNA: transcriptome data; mut: mutation data. The red cross marker indicatesthe selected model. ^ e x pe r i m en t ^ A e x pe r i m en t ^ B experiment e x pe r i m en t ^ experiment e x pe r i m en t Figure S5.9: The Pearson coefficient matrix for the 5 estimations of the ˆ µ andthe RV coefficient matrices for the 5 estimations of the ˆ A , ˆ B and ˆ Θ derived fromthe P-ESCA model.36 Chapter 5. Fusing multiple data sets with two types of heterogeneity estimated variation explained ratios from the MOFA model components Figure S5.10: Variation explained plots reproduced from the MOFA paper. Fromthe top to the bottom, the data sets are drug response, methylation, mRNA andmutation hapter 6 Outlook In the proposed penalized exponential family SCA (P-ESCA) model (Chapter5), a group concave penalty is used to induce group-wise sparse pattern on theloading matrix to disentangle the global, local common and distinct components.The P-ESCA model and the associated MM algorithm can be further generalizedto include other types of penalties to induce more interesting sparse patterns onthe loading matrix, which may be useful for the data analyst. In addition, thecurrently developed model selection procedure has difficulties in tuning multipletuning parameters. It will be worthwhile to explore other types of model selec-tion approaches to address this issue. Also, in the current thesis, the parametricexponential family distribution is used to tackle the heterogeneous measurementscales. There also exist other possible non-parametric and semi-parametric ap-proaches. It is worthwhile to generalize them for the data fusion of multiple datasets with the two types of heterogeneity. Furthermore, it is also interesting togeneralize the P-ESCA model for prediction tasks or to taking into account theexperimental design underlining the used multiple data sets. The developed P-ESCA model and the associated MM algorithm have a lot of po-tential for further generalization. The options for inducing element-wise sparsityon the loading matrix or the composition of both group-wise and element-wisesparsity or other types of penalties can be easily included. Some examples willbe shown in the following subsections. These P-ESCA model extensions can beselected using the developed missing value based CV procedure (Chapter 5). Apossible alternative model selection approach is the Bayesian optimization [98]framework, which is appropriate for the tuning of multiple (usually less than 20)continuous tuning parameters. Since this framework has been successfully ap-plied in the automatic tuning of various machine learning algorithms [99], it will13738 Chapter 6. Outlook be worthwhile to explore its usage for the P-ESCA models with multiple tuningparameters.The following notations are the same as the P-ESCA model in Chapter 5.Suppose the r th column of the l th loading matrix B l is b l,r . The group con-cave penalty on the l th loading matrix B l is imposed on the L norm of b l,r as λ l (cid:80) r g ( || b l,r || ), in which g () is a concave function. This group concave penalty(or concave L norm penalty) will shrink on the group level ( L norm of b l,r )as a concave penalty, and in the element level (elements inside b l,r ) as a ridgeregression type penalty. Sometimes, we may need other types of penalties, suchas the element-wise sparsity on the elements of B l or the composition of bothelement-wise and group-wise sparsity patterns on B l . All these options can beeasily included into the P-ESCA model by a slightly modification of the developedMM algorithm. When a single data set is used, the P-ESCA model with an element-wise concavepenalty is an approach for the sparse exponential family PCA model. Whenmultiple data sets are used, the model is an approach for the sparse exponentialfamily SCA model. Element-wise concave penalty An element-wise concave penalty can be imposed on the elements of B l to inducethe element-wise sparsity on B l . Suppose the jr th element of B l is b ljr , and itsabsolute value is σ ljr , σ ljr = | b ljr | . The concave penalty on B l can be expressed as λ l (cid:80) J l j (cid:80) Rr g ( σ ljr ), in which g () is a concave function in Table 5.1. The optimiza-tion problem associated with this P-ESCA model with an element-wise concavepenalty can be expressed as follows,min { µ l } Ll , A , { B l } Ll L (cid:88) l =1 (cid:104) − log( p ( X l | Θ l , α l )) + λ l J l (cid:88) j R (cid:88) r g ( σ ljr ) (cid:105) subject to Θ l = µ T l + AB T l , l = 1 , . . . , L T A = T A = I σ ljr = | b ljr | , l = 1 ...L ; j = 1 , . . . , J l ; r = 1 , . . . , R. (6.1) Algorithm The algorithm developed in Section 5.3 can be modified only with respect tothe majorization step of the penalty function and the updating B l step to fitthe P-ESCA model with an element-wise penalty in equation 6.1. Similar to the .1. Including other types of sparse patterns g ( σ ljr ) is a concave function with respectto σ ljr and can be majorized as g ( σ ljr ) ≤ ω kljr σ ljr + c , in which ω kljr = ∂g ( σ kljr ) and σ kljr is the absolute value of the jr th element of B kl (the k th approximation of B l during the k th iteration). After majorizing the original problem in equation 6.1,updating the offset terms { µ l } L and score matrix A in exactly the same way asin Section 5.3, the optimization problem associated with the updating of B l ismin B l ρ l α l || AB T l − JH kl || F + λ l J l (cid:88) j R (cid:88) r ω kljr σ ljr = ρ l α l || B l − ( JH kl ) T A || F + λ l J l (cid:88) j R (cid:88) r ω kljr σ ljr = J l (cid:88) j R (cid:88) r (cid:104) ρ l α l ( b ljr − (( JH kl ) T A ) jr ) + λ l ω kljr σ ljr (cid:105) σ ljr = | b ljr | , l = 1 ...L ; j = 1 , . . . , J l ; r = 1 , . . . , R, in which (( JH kl ) T A ) jr indicates the jr th element of the matrix ( JH kl ) T A . Theabove optimization problem is simple the proximal operator of the L norm, andthe analytical solution exists [25]. Take ˜ λ ljr = λ l ω kljr α l /ρ l and v ljr = (( JH kl ) T A ) jr ,the analytical solution of b ljr is b ljr = sign( v ljr ) max(0 , | v ljr | − ˜ λ ljr ). To update theparameter B l , we can simply apply this proximal operator to all the elements of B l . The other parts of the algorithm are the same as in Section 5.3. Concave L1 norm penalty Another way to induce group sparsity on B l is through the concave L normpenalty [22]. Suppose the r th column of the l th loading matrix B l is b l,r , and its L norm is σ lr , σ lr = || b l,r || = (cid:80) J l j | b ljr | . The concave L norm penalty on B l canbe expressed as λ l J l (cid:80) Rr g ( σ lr ), in which weight J l is used to accommodate thepotential different number of variables in different data set, and g () is a concavefunction in Table 5.1. This concave L norm penalty will shrink on the grouplevel ( L norm of b l,r ) as a concave penalty, and in the element level (elementsinside b l,r ) as a lasso type penalty. The optimization problem associated withthis P-ESCA model with a concave L norm penalty can be expressed as follows,40 Chapter 6. Outlook min { µ l } Ll , A , { B l } Ll L (cid:88) l =1 (cid:104) − log( p ( X l | Θ l , α l )) + λ l J l R (cid:88) r g ( σ lr ) (cid:105) subject to Θ l = µ T l + AB T l , l = 1 , . . . , L T A = T A = I σ lr = || b l,r || , l = 1 ...L ; r = 1 , . . . , R. (6.2) Algorithm The algorithm developed in Section 5.3 can be modified only with respect to themajorization step of the penalty function and the updating B l step to fit theP-ESCA model with the concave L norm penalty in equation 6.2. Similar to theequation 5.5, the penalty function g ( σ lr ) is concave with respect to σ lr and can bemajorized as g ( σ lr ) ≤ ω klr σ lr + c , in which ω klr = ∂g ( σ klr ) and σ klr is the L norm ofthe r th column of B kl (the k th approximation of B l during the k th iteration). Aftermajorizing the original problem in equation 6.2, updating the offset terms { µ l } L and score matrix A in exactly the same way as in Section 5.3, the optimizationproblem associated with the updating of B l ismin B l ρ l α l || AB T l − JH kl || F + λ l J l R (cid:88) r ω klr σ lr = ρ l α l || B l − ( JH kl ) T A || F + λ l J l R (cid:88) r ω klr ( J l (cid:88) j | b ljr | )= J l (cid:88) j R (cid:88) r (cid:104) ρ l α l ( b ljr − (( JH kl ) T A ) jr ) + λ l J l ω klr | b ljr | (cid:105) . Take ˜ λ ljr = λ l J l ω klr α l /ρ l and v ljr = (( JH kl ) T A ) jr , the analytical solution of b ljr is b ljr = sign( v ljr ) max(0 , | v ljr | − ˜ λ ljr ). To update the parameter B l , we can simplyapply this proximal operator to all the elements of B l . The other parts of thealgorithm are the same as in Section 5.3. Composite concave penalty There also exists a composite concave penalty to induce both group and element-wise sparsity [22]. The composite concave penalty on B l can be expressed as λ l J l (cid:80) Rr g out ( (cid:80) J l j g inner ( | b ljr | )), in which weight J l is used to accommodate the po-tential different number of variables in different data set, g out () and g inner () are .1. Including other types of sparse patterns g () from Table 5.1 for both g out () and g inner (). Thiscomposite concave penalty will shrink both on the group level ( (cid:80) J l j g inner ( | b ljr | ))and in the element level (elements inside b l,r ) as a concave penalty. The opti-mization problem associated with this P-ESCA model with a composite concavepenalty can be expressed as follows,min { µ l } Ll , A , { B l } Ll L (cid:88) l =1 (cid:104) − log( p ( X l | Θ l , α l )) + λ l J l R (cid:88) r g out ( J l (cid:88) j g inner ( | b ljr | )) (cid:105) subject to Θ l = µ T l + AB T l , l = 1 , . . . , L T A = T A = I . (6.3) Algorithm The algorithm developed in Section 5.3 can be modified only with respect to themajorization step of the penalty function and the updating B l step to fit theP-ESCA model with a composite concave penalty in equation 6.3. Here we take σ lr = (cid:80) J l j g inner ( | b ljr | ) and σ ljr = | b ljr | . Since both g out ( σ lr ) and g inner ( σ ljr ) areconcave function and they are monotonically non-decreasing, their compositionis also a concave function with respect to σ ljr . Therefore, we can majorize thecomposite function g out ( (cid:80) J l j g inner ( σ ljr )) in a similar way as the equation 5.5, g out ( (cid:80) J l j g inner ( σ ljr )) ≤ (cid:80) J l j ω kljr σ ljr + c , in which ω kljr = ∂g out ( σ klr ) ∂g inner ( σ kljr )and σ kljr is the absolute value of the jr th element of B kl (the k th approximation of B l during the k th iteration), σ klr = (cid:80) J l j g inner ( σ kljr ). After majorizing the originalproblem in equation 6.3, updating the offset terms { µ l } L and score matrix A inexactly the same way as in Section 5.3, the optimization problem associated withthe updating of B l ismin B l ρ l α l || AB T l − JH kl || F + λ l J l R (cid:88) r ( J l (cid:88) j ω kljr σ ljr )= ρ l α l || B l − ( JH kl ) T A || F + λ l J l R (cid:88) r ( J l (cid:88) j ω kljr σ ljr )= J l (cid:88) j R (cid:88) r (cid:104) ρ l α l ( b ljr − (( JH kl ) T A ) jr ) + λ l J l ω kljr σ ljr (cid:105) , Take ˜ λ ljr = λ l J l ω kljr α l /ρ l and v ljr = (( JH kl ) T A ) jr , the analytical solution of b ljr is b ljr = sign( v ljr ) max(0 , | v ljr | − ˜ λ ljr ). To update the parameter B l , we can simplyapply this proximal operator to all the elements of B l . The other parts of thealgorithm are the same as in Section 5.3.42 Chapter 6. Outlook All the algorithms for the above P-ESCA models with different penalties arebased on the fact that the updating of B l in equation 5.7 can be re-expressed as aproblem of finding the proximal operator for the L norm or the L norm penalty.Therefore, P-ESCA model can also be extended to include other types of penaltywhose proximal operator has a simple or analytical solution. For example, there isno difficulty in including concave penalties on the rows of the loading matrix B l toinduce row-wise sparsity, which could be useful for the feature selection. Further-more, we can also add cardinality constraints (pseudo L norm) on the number ofnonzero elements, the number of nonzero rows, or the number of nonzero columnsof the loading matrix B l to induce the desired sparsity pattern. These various L norm penalties are non-convex, however, there are heuristic solutions for thecorresponding proximal operator [100]. These various L penalties can be usefulif all our data sets are quantitative. However, when discrete data sets are used,the derived model with the L norm type of penalty will have problems in con-straining the scale of estimated parameters. The standard logistic PCA model, inwhich the exact low rank constraint can be regarded as applying L norm penaltyon the singular values, is a good example to illustrate this point. In the current thesis, the heterogeneous measurement scales are accounted for byassuming a parametric exponential family distribution in a similar way as the gen-eralized linear models. There also existed other possible directions [38, 101, 36]to tackle the problems induced by the heterogeneous measurement scales. Onepromising alternative is the semi-parametric XPCA method [101]. In the proba-bilistic interpretation of a PCA model on a matrix X ( I × J ), we assume we have I samples from a J dimensional multivariate normal distribution and thereforenormal marginal distribution for each column. On the contrary, XPCA modelis based on a semi-parametric J dimensional multivariate distribution, which isthe combination of nonparametric marginals of all the J quantitative or discretecolumns and a Gaussian copula. The assumptions of parametric marginal dis-tributions (normal distribution for quantitative data, Bernoulli distribution forbinary data) for the columns of the observed data set X are relaxed in the XPCAmodel. Therefore, when the exponential family distribution is not a good ap-proximation of the observed data, for example, the empirical distribution of aquantitative variable is far from symmetric, XPCA model has a clear advantage.Another interesting alternative is the non-parametric representation matrices ap-proach [38], in which each variable (continuous or discrete) is represented by arepresentation matrix and the resulting representation matrices can be used in .3. Using data fusion for supervised learning All the methods developed in this thesis are unsupervised learning approaches. Itis worthwhile to extend these methods in the supervised learning framework forprediction tasks. A simple approach, same as the extension of PCA to principalcomponent regression model for prediction tasks, is as follows. These variousunsupervised data fusion methods are taken as feature extraction approaches formultiple data sets. The derived low dimensional score matrix can be regarded asthe extracted features and can be used as inputs for any other supervised learningmethods. However, the extracted low dimensional features are not necessarilyoptimal for the prediction tasks. Therefore, when label information is available,it is better to make full use of it to make the extracted features more informative tothe prediction tasks. The P-ESCA model can be extended from this perspective ina similar way as extending the PCA model to the partial least squares regressionmodel [102]. The extracted low dimensional structures from the P-ESCA modelshould not only represent the multiple data sets well but also have high covariancewith the label information. Sometimes, the multiple sets of measurements on the same objects result fromcarefully designed experimental studies rather than observational studies. Such anexperimental design always contains several factors, such as different treatmentsor different time points or their combinations, which are of interest with respectto the research question. Therefore, these experimental factors are underlying themultiple data sets on the same objects. To study the effects of these experimentalfactors or to remove their effects on the explorative data analysis, the used datafusion approaches should take the experimental design structure into account.The proposed P-ESCA can be extended from this direction by including extralow dimensional structures to account for these experimental factors in a similarway as the ANOVA-simultaneous component analysis (ASCA) model [103]. ummary Multiple high dimensional measurements from different platforms on the same bi-ological system are becoming increasingly common in biological research. Thesedifferent sources of measurements not only provide us with the opportunity of adeeper understanding of the studied system, but they also introduce some new sta-tistical challenges. All these challenges are related to the heterogeneity of the datasets. The first type of heterogeneity is the type of data , such as metabolomics,proteomics and RNAseq data in genomics. These different omics data reflect theproperties of the studied biological system from different perspectives. The sec-ond type of heterogeneity is the type of scale , which indicates the measurementsare obtained at different scales, such as binary, ordinal, interval and ratio-scaledvariables. Within this thesis, various data fusion approaches are developed totackle either one or two types of heterogeneity that exist in multiple data sets.In Chapter 2, we reviewed and compared various parametric and nonpara-metric extensions of principal component analysis (PCA) specifically geared forbinary data. The special mathematical characteristics of binary data are takeninto account from different perspectives in these different extensions of PCA. Weexplored their performance with respect to finding the correct number of com-ponents, overfitting, retrieving the correct low dimensional structure, variableimportance, etc, using both realistic simulations of binary data as well as muta-tion, copy number aberrations (CNA) and methylation data of the GDSC1000project. Our results indicate that if a low dimensional structure exists in thedata, most of the methods can find it. We recommend to use the parametriclogistic PCA model (projection based approach) if the probabilistic generatingprocess can be assumed underlying the data, and to use the nonparametric Gifimodel if such an assumption is not valid and the data is considered as given.In Chapter 3, we developed a robust logistic PCA model via non-convex sin-gular value thresholding. The promising logistic PCA model for binary data hasan overfitting issue because of the used exact low rank constraint. We proposedto fit a logistic PCA model via non-convex singular value thresholding to alleviate14546 Summary the overfitting issue. An efficient majorization-minimization (MM) algorithm isimplemented to fit the model and a missing value based cross validation (CV)procedure is introduced for the model selection. Furthermore, we re-expressed thelogistic PCA model based on the latent variable interpretation of the generalizedlinear models (GLMs) on binary data. The latent variable interpretation of thelogistic PCA model not only makes the assumption of low rank structure easier tounderstand, but also provides us a way to define signal to noise ratio (SNR) in thesimulation of multivariate binary data. Our experiments on realistic simulationsof imbalanced binary data and low SNR show that the CV error based model se-lection procedure is successful in selecting the proposed model. And the selectedmodel demonstrates superior performance in recovering the underlying low rankstructure compared to models with exact low rank constraint and convex nuclearnorm penalty.In the Chapter 4, we developed a generalized simultaneous component analysis(GSCA) model for the data fusion of binary and quantitative data sets. Simulta-neous component analysis (SCA) model is one of the standard tools for exploringthe underlying dependence structure present in multiple quantitative data setsmeasured on the same objects. However, it does not have any provisions whena part of the data are binary. To this end, we propose the GSCA model, whichtakes into account the distinct mathematical properties of binary and quantita-tive measurements in the maximum likelihood framework. In the same way asin the SCA model, a common low dimensional subspace is assumed to representthe shared information between these two distinct types of measurements. How-ever, the GSCA model can easily be overfitted when a rank larger than one isused, which can lead to the problem that some of the estimated parameters canbecome very large. To achieve a low rank solution and combat overfitting, wepropose to use non-convex singular value thresholding. An efficient majorizationalgorithm is developed to fit this model with different concave penalties. Realisticsimulations (low SNR and highly imbalanced binary data) are used to evaluatethe performance of the proposed model in recovering the underlying structure.Also, a missing value based CV procedure is implemented for the model selection.We illustrate the usefulness of the GSCA model for exploratory data analysis ofquantitative gene expression and binary CNA measurements obtained from theGDSC1000 data sets.In Chapter 5, we proposed a penalized exponential family SCA (P-ESCA)model for the data fusion of multiple data sets with two types of heterogene-ity. Multiple sets of measurements on the same objects obtained from differentplatforms may reflect partially complementary information of the studied system.However, the heterogeneity of such data sets introduces some new statistical chal-lenges for their data fusion. First, the separation of information that is commonacross all or some of the data sets, and the information that is specific to eachdata set is problematic. Furthermore, these data sets are often a mix of quanti-tative and discrete (binary or categorical) data types, while commonly used data ummary amenvatting In biologisch onderzoek wordt het steeds gebruikelijker meerdere hoog-dimen-sionale metingen op verschillende platformen aan hetzelfde biologische systeemuit te voeren. Deze van verschillende bronnen afkomstige metingen bieden de mo-gelijkheid tot een beter begrip van het bestudeerde systeem, maar brengen ooknieuwe statistische uitdagingen met zich mee. Al deze uitdagingen houden ver-band met de heterogeniteit van de dataverzamelingen. De eerste vorm van hetero-geniteit ligt in het type van de gegevens. Zo zijn er verschillende typen omics-data,metabolomics, proteomics en RNAseq data in genomics, die ieder een eigen per-spectief op de eigenschappen van het biologische systeem bieden. De tweede vormvan heterogeniteit ligt in de meetschaal van de data. De data kunnen op verschil-lende schalen gemeten worden, zoals binair, ordinale schaal, intervalschaal en ra-tioschaal. In dit proefschrift worden verschillende datafusiemethoden ontwikkeldwaarmee ´e´en of beide soorten dataheterogeniteit aangepakt kunnen worden.In hoofdstuk 2 wordt een aantal, zowel parametrische als niet-parametrische,uitbreidingen van principale componenten analyse (PCA) vergeleken die speci-fiek betrekking hebben op binaire data. In deze uitbreidingen van PCA wordt opverschillende manieren rekening gehouden met de speciale wiskundige karakter-istieken van binaire gegevens. We onderzochten de prestaties van deze uitbreidin-gen van PCA met betrekking tot het vinden van het juiste aantal componenten,overfitting, het vinden van de juiste laag-dimensionale structuur, het belang vanvariabelen, enz. door gebruik van zowel realistische simulaties van binaire dataals van mutatiedata, ‘copy number aberrations’ (CNA) en methylatiedata van hetGDSC1000 project. Onze resultaten laten zien dat als er een laag-dimensionalestructuur in de data aanwezig is, de meeste methoden deze kunnen vinden. Wijadviseren om het parametrisch logistisch PCA-model (op projectie gebaseerdebenadering) te gebruiken als verondersteld wordt dat een stochastisch proces aande data ten grondslag ligt. Als dit niet het geval is en de data als vast gegeven kanworden beschouwd, raden we aan het niet-parametrische Gifi-model te gebruiken.In hoofdstuk 3 hebben we een robuust logistisch PCA-model ontwikkeld met14950 Samenvatting behulp van een niet-convexe drempelwaardenfunctie voor de singuliere waarden.Het veelbelovende logistische PCA-model voor binaire data heeft een probleemmet over-fitting vanwege de gebruikte randvoorwaarde van een exacte lage rang.We stellen voor om het over-fitten te verminderen door het logistische PCA-model te fitten met gebruik van een niet-convexe drempelwaardenfunctie voorsinguliere waarden. Een effici¨ent majorisatie-minimalisatie (MM) algoritme isge¨ımplementeerd om het model te fitten en een op missende waarden gebaseerdekruisvalidatie (KV) procedure is ge¨ıntroduceerd voor modelselectie. Bovendienhebben we het logistische PCA-model uitgedrukt op basis van de latente vari-abelen interpretatie van gegeneraliseerde lineaire modellen (GLMs). Niet alleenmaakt de aanname van een structuur met lage rang het model beter te begri-jpen, maar biedt ook een manier om de signaal-ruis verhouding in de simulatievan multivariate binaire data te defini¨eren. Onze experimenten met realistischesimulaties van ongebalanceerde binaire data met een lage signaal-ruis verhoud-ing laten zien dat modelselectie gebaseerd op KV-fouten goed in staat is hetvoorgestelde model te selecteren. Dit geselecteerde model is uitstekend in staatde onderliggende lage-rang structuur terug te vinden en werkt beter dan mod-ellen met een exacte lage-rang randvoorwaarde die een convexe spoornormboetegebruiken.In hoofdstuk 4 ontwikkelden we een gegeneraliseerd simultaan componentenanalyse (GSCA) model voor de fusie van binaire en kwantitatieve dataverza-melingen. Simultane componenten analyse (SCA) is ´e´en van de standaard hulp-middelen om de onderliggende afhankelijkheidsstructuur te onderzoeken die aan-wezig is in meerdere kwantitatieve data sets die aan hetzelfde object gemetenzijn. Echter, SCA is niet geschikt als een deel van de data binair is. Daaromstellen we een GSCA-model voor dat rekening houdt met de specifieke mathe-matische eigenschappen van binaire data en kwantitatieve metingen binnen hetkader van grootste aannemelijkheid. Op dezelfde manier als voor het SCA-modelveronderstellen we het bestaan van een laag-dimensionale deelruimte waarin degedeelde informatie van de twee typen van metingen wordt gerepresenteerd. HetGSCA-model is evenwel geneigd tot overfitten wanneer een rang groter dan 1wordt gebruikt. Hierdoor kunnen sommige parameters bijzonder groot geschatworden. Om een oplossing met lage rang zonder overfitting te vinden, stellenwe voor een niet-convexe drempelwaardefunctie te gebruiken voor de selectie vansinguliere waarden. We ontwikkelden een effici¨ent majorisatie algoritme om ditmodel te fitten voor verschillende concave boetefuncties. Realistische simulaties(lage signaal-ruis verhouding en sterk ongebalanceerde binaire data) werden ge-bruikt om te beoordelen hoe goed het model de onderliggende structuur kanreproduceren. Ook is een op missende waarden gebaseerde kruisvalidatie geim-plementeerd om modellen te selecteren. De bruikbaarheid van het GSCA-modelals exploratieve tool wordt gedemonstreerd aan de hand van kwantitatieve gen-expressiedata en binaire CNA-metingen uit de GDSC1000 dataverzameling.In hoofdstuk 5 stellen we een exponentieel SCA-model met boeteoptie (P- amenvatting cknowledgments My four years of Ph.D. life in Amsterdam is difficult but productive. In thebeginning, I didn’t have much confidence in myself for the transition from anexperimentalist to a data analyst (statistician). However, during this process, Ilearned a lot on statistics, and have developed several statistical methods for thedata fusion of heterogeneous data sets. These cannot be done without the helpand support of a lot of people. Here I would like to express my gratitude to you.First, I would like to say thanks to Age Smilde and Johan Westerhuis foryour four years’ supervision. I appreciate that you accepted me to do my Ph.D.research in the BDA group. And thanks so much for all the discussions we haveduring the last four years and your insightful comments on my work. Thesehelped me a lot in learning statistics and doing research.I would also like to say thanks to Huub Hoefsloot and Age Smilde for thematrix algebra course. Matrix algebra is the foundation of statistics and nu-merical computation. A solid understanding of this subject helps me a lot onthe algorithm development for my research. Also, I would like to acknowledgeGooitzen Zwanenburg for your help in solving my problems on the computer,software, Internet, the Dutch translation of the thesis summary, and many otherthings. Furthermore, I want to thank Dicle, Chloie, Sandra, and Maryam. I amglad to be in the same office as you. Thanks for the discussion we have aboutlife, research and many other topics. Also, thanks for all the members in theBDA group for the coffee break conversations, the presentations, and discussionsduring the group meetings.Part of my Ph.D. research has collaborated with Lodewyk Wessels, NanneAben from the Netherlands Cancer Institute (NKI) and Patrick J.F. Groenenfrom the Erasmus University. Thanks to Lodewyk Wessels and Nanne Aben foryour real biological data sets, your informative comments on our paper and all thebiological interpretations. Also, thanks to Patrick for your enlightening commentson the algorithm section of our paper and for your work on the GDP penalty,which is used throughout my thesis. 15354 Acknowledgments I would also like to acknowledge the China Scholarship Council for financialsupport. The scholarship makes it possible for me to study in Amsterdam.Lastly, I would like to thank my family and my girlfriend for your love andencouragement. Yipeng SongMay 21, 2019 ibliography [1] M. Zimmermann, M. Kogadeeva, M. Gengenbacher, G. McEwen, H.-J. Mol-lenkopf, N. Zamboni, S. H. E. Kaufmann, and U. Sauer, “Integration ofmetabolomics and transcriptomics reveals a complex diet of mycobacteriumtuberculosis during early macrophage infection,” mSystems , vol. 2, no. 4,pp. e00057–17, 2017.[2] V. Steinmetz, F. Sevila, and V. Bellon-Maurel, “A methodology for sensorfusion design: Application to fruit quality assessment,” Journal of Agricul-tural Engineering Research , vol. 74, no. 1, pp. 21–31, 1999.[3] T. Doeswijk, A. Smilde, J. Hageman, J. Westerhuis, and F. Van Eeuwijk,“On the increase of predictive performance with high-level data fusion,” Analytica Chimica Acta , vol. 705, no. 1-2, pp. 41–47, 2011.[4] S. Stevens, “On the theory of scales of measurement,” Science , vol. 103,pp. 677–680, June 1946.[5] P. Suppes and J. Zinnes, “Basic measurement theory,” Psychology Series 45,Stanford University, Institute for Mathematical Studies in the Social Sci-ences, March 1962.[6] D. Krantz, R. Luce, P. Suppes, and A. Tversky, Foundations of Measure-ment (Volume I) . Dover, 1971.[7] L. Narens, “On the scales of measurement,” Journal of Mathematical Psy-chology , vol. 24, no. 3, pp. 249–275, 1981.[8] L. Narens and R. D. Luce, “Measurement - the theory of numerical assign-ments,” Psychological Bulletin , vol. 99, pp. 166–180, Mar. 1986.[9] R. D. Luce and L. Narens, “Measurement scales on the continuum,” Science ,vol. 236, no. 4808, pp. 1527–1532, 1987.15556 Bibliography [10] D. J. Hand, “Statistics and the theory of measurement,” Journal of theRoyal Statistical Society Series A-statistics in Society , vol. 159, pp. 445–473, 1996.[11] E. Adams, R. Fagot, and R. Robinson, “A theory of appropriate statistics,” Psychometrika , vol. 30, no. 2, pp. 99–127, 1965.[12] J. Michell, “Measurement scales and statistics - a clash of paradigms,” Psychological Bulletin , vol. 100, no. 3, pp. 398–407, 1986.[13] A. Agresti, Categorical data analysis . John Wiley & Sons, 2013.[14] A. K. Smilde, I. M˚age, T. Naes, T. Hankemeier, M. A. Lips, H. A. Kiers,E. Acar, and R. Bro, “Common and distinct components in data fusion,” Journal of Chemometrics , vol. 31, no. 7, p. e2900, 2017.[15] I. Gaynanova and G. Li, “Structural learning and integrative decompositionof multi-view data,” arXiv preprint arXiv:1707.06573 , 2017.[16] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihoodand its oracle properties,” Journal of the American Statistical Association ,vol. 96, no. 456, pp. 1348–1360, 2001.[17] I. Jolliffe, Principal component analysis . Wiley Online Library, 2002.[18] M. Gavish and D. L. Donoho, “Optimal shrinkage of singular values,” IEEETransactions on Information Theory , vol. 63, no. 4, pp. 2137–2152, 2017.[19] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance esti-mation with the graphical lasso,” Biostatistics , vol. 9, no. 3, pp. 432–441,2008.[20] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, “Sparsityand smoothness via the fused lasso,” Journal of the Royal Statistical Society:Series B (Statistical Methodology) , vol. 67, no. 1, pp. 91–108, 2005.[21] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decom-position, with applications to sparse principal components and canonicalcorrelation analysis,” Biostatistics , vol. 10, no. 3, pp. 515–534, 2009.[22] J. Huang, P. Breheny, and S. Ma, “A selective review of group selectionin high-dimensional models,” Statistical Science: A Review Journal of TheInstitute of Mathematical Statistics , vol. 27, no. 4, 2012.[23] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journalof the Royal Statistical Society: Series B (Methodological) , vol. 58, no. 1,pp. 267–288, 1996. ibliography et al. , “Least angle regres-sion,” The Annals of Statistics , vol. 32, no. 2, pp. 407–499, 2004.[25] N. Parikh, S. Boyd, et al. , “Proximal algorithms,” Foundations andTrends R (cid:13) in Optimization , vol. 1, no. 3, pp. 127–239, 2014.[26] N. Meinshausen and P. B¨uhlmann, “Stability selection,” Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , vol. 72, no. 4,pp. 417–473, 2010.[27] A. Armagan, D. B. Dunson, and J. Lee, “Generalized double Pareto shrink-age,” Statistica Sinica , vol. 23, no. 1, p. 119, 2013.[28] I. M˚age, A. K. Smilde, and F. M. van der Kloet, “Performance of meth-ods that separate common and distinct variation in multiple data blocks,” Journal of Chemometrics , vol. 33, no. 1, p. e3085, 2019.[29] C. G. A. T. R. Network et al. , “Comprehensive genomic characterizationdefines human glioblastoma genes and core pathways,” Nature , vol. 455,no. 7216, pp. 1061–1068, 2008.[30] F. Iorio, T. A. Knijnenburg, D. J. Vis, G. R. Bignell, M. P. Menden,M. Schubert, N. Aben, E. Gon¸calves, S. Barthorpe, H. Lightfoot, et al. ,“A landscape of pharmacogenomic interactions in cancer,” Cell , vol. 166,no. 3, pp. 740–754, 2016.[31] H.-T. Wu, I. Hajirasouliha, and B. J. Raphael, “Detecting independent andrecurrent copy number aberrations using interval graphs,” Bioinformatics ,vol. 30, pp. i195–i203, June 2014.[32] F. W. Young, J. de Leeuw, and Y. Takane, “Quantifying qualitative data,” Lantermann and H. Feger (Eds.): Similarity and Choice. Papers in Honourof Clyde Coombs. Berne: Hans Huber , 1980.[33] M. Collins, S. Dasgupta, and R. E. Schapire, “A generalization of princi-pal component analysis to the exponential family,” in Advances in NeuralInformation Processing Systems , MIT Press, 2001.[34] A. I. Schein, L. K. Saul, and L. H. Ungar, “A generalized linear model forprincipal component analysis of binary data.,” in AISTATS , vol. 3, p. 10,2003.[35] A. J. Landgraf, Generalized principal component analysis: Dimensionalityreduction through the projection of natural parameters . PhD thesis, TheOhio State University, 2015.58 Bibliography [36] J. de Leeuw and P. Mair, “Gifi methods for optimal scaling in R: the packagehomals,” Journal of Statistical Software , vol. 31, pp. 1–21, AUG 2009.[37] Y. Mori, M. Kuroda, and N. Makino, Nonlinear principal component anal-ysis and its applications . Springer, 2016.[38] H. A. Kiers, Three-way methods for the analysis of qualitative and quanti-tative two-way data . DSWO press Leiden, 1989.[39] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine Series 6 , vol. 2, no. 11, pp. 559–572, 1901.[40] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component anal-ysis,” Journal of Computational and Graphical Statistics , vol. 15, no. 2,pp. 265–286, 2006.[41] J. M. ten Berge, Least squares optimization in multivariate analysis . DSWOPress, Leiden University Leiden, 1993.[42] M. E. Tipping and C. M. Bishop, “Probabilistic principal component anal-ysis,” Journal of the Royal Statistical Society: Series B (Statistical Method-ology) , vol. 61, no. 3, pp. 611–622, 1999.[43] J. De Leeuw, “Principal component analysis of binary data by iteratedsingular value decomposition,” Computational Statistics & Data analysis ,vol. 50, no. 1, pp. 21–39, 2006.[44] M. Udell, C. Horn, R. Zadeh, et al. , “Generalized low rank models,” Foun-dations and Trends R (cid:13) in Machine Learning , vol. 9, no. 1, pp. 1–118, 2016.[45] A. Gifi, Nonlinear multivariate analysis . New York, N.Y.: Wiley, 1990.This is a publication under a collective pseudonym.[46] Q. Wei and R. L. Dunbrack Jr, “The role of balanced training and testingdata sets for binary classifiers in bioinformatics,” PloS One , vol. 8, no. 7,p. e67863, 2013.[47] S. Wold, “Cross-validatory estimation of the number of components infactor and principal components models,” Technometrics , vol. 20, no. 4,pp. 397–405, 1978.[48] R. Bro, K. Kjeldahl, A. K. Smilde, et al. , “Cross-validation of componentmodels: A critical look at current methods,” Analytical and BioanalyticalChemistry , vol. 390, no. 5, pp. 1241–1251, 2008.[49] R Development Core Team, R: A language and environment for statisticalcomputing . R Foundation for Statistical Computing, Vienna, Austria, 2008.ISBN 3-900051-07-0. ibliography et al. , “pcaMethods–a bioconduc-tor package providing PCA methods for incomplete data,” Bioinformatics ,vol. 23, no. 9, pp. 1164–1167, 2007.[51] P. J. Groenen and J. Josse, “Multinomial multiple correspondence analy-sis,” arXiv preprint arXiv:1603.03174 , 2016.[52] M. A. Davenport, Y. Plan, E. Van Den Berg, and M. Wootters, “1-Bitmatrix completion,” Information and Inference: A Journal of the IMA ,vol. 3, no. 3, pp. 189–223, 2014.[53] A. A. Shabalin and A. B. Nobel, “Reconstruction of a low-rank matrix inthe presence of gaussian noise,” Journal of Multivariate Analysis , vol. 118,pp. 67–76, 2013.[54] J. Josse and S. Sardy, “Adaptive shrinkage of singular values,” Statisticsand Computing , vol. 26, no. 3, pp. 715–724, 2016.[55] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimiza-tion,” Foundations of Computational Mathematics , vol. 9, no. 6, p. 717,2009.[56] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algo-rithms for learning large incomplete matrices,” Journal of Machine LearningResearch , vol. 11, no. Aug, pp. 2287–2322, 2010.[57] W. J. Fu, “Penalized regressions: The bridge versus the lasso,” Journal ofComputational and Graphical Statistics , vol. 7, no. 3, pp. 397–416, 1998.[58] J. De Leeuw, “Block-relaxation algorithms in statistics,” in InformationSystems and Data Analysis , pp. 308–324, Springer, 1994.[59] D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” The AmericanStatistician , vol. 58, no. 1, pp. 30–37, 2004.[60] H. A. L. Kiers, “Weighted least squares fitting using ordinary least squaresalgorithms,” Psychometrika , vol. 62, no. 2, pp. 251–266, 1997.[61] S. Boyd and L. Vandenberghe, Convex optimization . Cambridge UniversityPress, 2004.[62] C. Lu, C. Zhu, C. Xu, S. Yan, and Z. Lin, “Generalized singular valuethresholding,” in Twenty-Ninth AAAI Conference on Artificial Intelligence ,2015.[63] L. Le Cam and G. L. Yang, Asymptotics in statistics: Some basic concepts .Springer Science and Business Media, 2012.60 Bibliography [64] Y. Song, J. A. Westerhuis, N. Aben, L. F. Wessels, P. J. Groenen, andA. K. Smilde, “Generalized simultaneous component analysis of binary andquantitative data,” arXiv preprint arXiv:1807.04982 , 2018.[65] K. Van Deun, A. K. Smilde, M. J. van der Werf, H. A. L. Kiers, andI. Van Mechelen, “A structured overview of simultaneous component baseddata integration,” BMC Bioinformatics , vol. 10, no. 1, p. 246, 2009.[66] R. A. van den Berg, I. Van Mechelen, T. F. Wilderjans, K. Van Deun,H. A. L. Kiers, and A. K. Smilde, “Integrating functional genomics datausing maximum likelihood based simultaneous component analysis,” BMCBioinformatics , vol. 10, no. 1, p. 340, 2009.[67] Q. Mo, S. Wang, V. E. Seshan, A. B. Olshen, N. Schultz, C. Sander, R. S.Powers, M. Ladanyi, and R. Shen, “Pattern discovery and cancer gene iden-tification in integrated cancer genomic data,” Proceedings of the NationalAcademy of Sciences , vol. 110, no. 11, pp. 4245–4250, 2013.[68] M. Collins, S. Dasgupta, and R. E. Schapire, “A generalization of princi-pal components analysis to the exponential family,” in Advances in NeuralInformation Processing Systems , pp. 617–624, 2002.[69] V. Koltchinskii, K. Lounici, A. B. Tsybakov, et al. , “Nuclear-norm penaliza-tion and optimal rates for noisy low-rank matrix completion,” The Annalsof Statistics , vol. 39, no. 5, pp. 2302–2329, 2011.[70] D. Wu, D. Wang, M. Q. Zhang, and J. Gu, “Fast dimension reduction andintegrative clustering of multi-omics data using low-rank approximation:Application to cancer molecular classification,” BMC Genomics , vol. 16,no. 1, p. 1022, 2015.[71] R. Bro, K. Kjeldahl, A. K. Smilde, and H. A. L. Kiers, “Cross-validationof component models: A critical look at current methods,” Analytical andBioanalytical Chemistry , vol. 390, no. 5, pp. 1241–1251, 2008.[72] Y. Liu, H. H. Zhang, C. Park, and J. Ahn, “Support vector machines withadaptive L q penalty,” Computational Statistics & Data Analysis , vol. 51,no. 12, pp. 6380–6394, 2007.[73] Y. Song, J. A. Westerhuis, N. Aben, M. Michaut, L. F. Wessels, and A. K.Smilde, “Principal component analysis of binary genomics data,” Briefingsin Bioinformatics , 2017.[74] R. Akbani, K. C. Akdemir, B. A. Aksoy, M. Albert, A. Ally, S. B. Amin,H. Arachchi, A. Arora, J. T. Auman, B. Ayala, et al. , “Genomic classifica-tion of cutaneous melanoma,” Cell , vol. 161, no. 7, pp. 1681–1696, 2015. ibliography et al. , “Comprehensive molecular profiling of lungadenocarcinoma,” Nature , vol. 511, no. 7511, p. 543, 2014.[76] C. G. A. Network et al. , “Comprehensive molecular portraits of humanbreast tumours,” Nature , vol. 490, no. 7418, p. 61, 2012.[77] N. Aben, J. A. Westerhuis, Y. Song, H. A. L. Kiers, M. Michaut, A. K.Smilde, and L. F. A. Wessels, “iTOP: Inferring the topology of omics data,” Bioinformatics , vol. 34, pp. i988–i996, 09 2018.[78] J. D. Lee and T. J. Hastie, “Learning the structure of mixed graphicalmodels,” Journal of Computational and Graphical Statistics , vol. 24, no. 1,pp. 230–253, 2015.[79] J. Cheng, T. Li, E. Levina, and J. Zhu, “High-dimensional mixed graphicalmodels,” Journal of Computational and Graphical Statistics , vol. 26, no. 2,pp. 367–378, 2017.[80] R. Shen, A. B. Olshen, and M. Ladanyi, “Integrative clustering of multiplegenomic data types using a joint latent variable model with application tobreast and lung cancer subtype analysis,” Bioinformatics , vol. 25, no. 22,pp. 2906–2912, 2009.[81] F. M. van der Kloet, P. Sebasti´an-Le´on, A. Conesa, A. K. Smilde, andJ. A. Westerhuis, “Separating common from distinctive variation,” BMCbioinformatics , vol. 17, no. 5, p. S195, 2016.[82] E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel, “Joint andindividual variation explained (JIVE) for integrated analysis of multipledata types,” The Annals of Applied Statistics , vol. 7, no. 1, p. 523, 2013.[83] T. L¨ofstedt, D. Hoffman, and J. Trygg, “Global, local and unique decom-positions in OnPLS for multiblock data analysis,” Analytica Chimica Acta ,vol. 791, pp. 13–24, 2013.[84] M. Schouteden, K. Van Deun, T. F. Wilderjans, and I. Van Mechelen,“Performing DISCO-SCA to search for distinctive and common informationin linked data,” Behavior Research Methods , vol. 46, no. 2, pp. 576–587,2014.[85] I. M˚age, E. Menichelli, and T. Næs, “Preference mapping by PO-PLS:Separating common and unique information in several data blocks,” FoodQuality and Preference , vol. 24, no. 1, pp. 8–16, 2012.[86] I. M˚age, A. K. Smilde, and F. M. van der Kloet, “Performance of meth-ods that separate common and distinct variation in multiple data blocks,” Journal of Chemometrics , p. e3085, 2018.62 Bibliography [87] A. Klami, S. Virtanen, E. Lepp¨aaho, and S. Kaski, “Group factor analy-sis,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26,no. 9, pp. 2136–2147, 2015.[88] P. O. Perry, “Cross-validation for unsupervised learning,” arXiv preprintarXiv:0909.3052 , 2009.[89] R. Argelaguet, B. Velten, D. Arnol, S. Dietrich, T. Zenz, J. C. Marioni,F. Buettner, W. Huber, and O. Stegle, “Multi-Omics Factor Analysis—aframework for unsupervised integration of multi-omics data sets,” MolecularSystems Biology , vol. 14, no. 6, p. e8124, 2018.[90] N. Meinshausen, P. B¨uhlmann, et al. , “High-dimensional graphs and vari-able selection with the lasso,” The Annals of Statistics , vol. 34, no. 3,pp. 1436–1462, 2006.[91] C. Leng, Y. Lin, and G. Wahba, “A note on the lasso and related proceduresin model selection,” Statistica Sinica , pp. 1273–1284, 2006.[92] S. Dietrich, M. Ole´s, J. Lu, L. Sellner, S. Anders, B. Velten, B. Wu,J. H¨ullein, M. da Silva Liberio, T. Walther, et al. , “Drug-perturbation-based stratification of blood cancer,” The Journal of Clinical Investigation ,vol. 128, no. 1, pp. 427–445, 2018.[93] K. Van Deun, T. F. Wilderjans, R. A. Van den Berg, A. Antoniadis, andI. Van Mechelen, “A flexible framework for sparse simultaneous componentbased data integration,” BMC Bioinformatics , vol. 12, no. 1, p. 448, 2011.[94] E. Acar, R. Bro, and A. K. Smilde, “Data fusion in metabolomics using cou-pled matrix and tensor factorizations,” Proceedings of the IEEE , vol. 103,no. 9, pp. 1602–1620, 2015.[95] Y. Song, J. A. Westerhuis, and A. K. Smilde, “Logistic principal compo-nent analysis via non-convex singular value thresholding,” arXiv preprintarXiv:1902.09486 , 2019.[96] A. Gelman, H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B.Rubin, Bayesian data analysis . Chapman and Hall/CRC, 2013.[97] A. K. Smilde, H. A. Kiers, S. Bijlsma, C. Rubingh, and M. Van Erk, “Ma-trix correlations for high-dimensional data: the modified RV-coefficient,” Bioinformatics , vol. 25, no. 3, pp. 401–405, 2008.[98] P. I. Frazier, “A tutorial on bayesian optimization,” arXiv preprintarXiv:1807.02811 , 2018. ibliography Advances in Neural InformationProcessing Systems , pp. 2951–2959, 2012.[100] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. , “Distributedoptimization and statistical learning via the alternating direction method ofmultipliers,” Foundations and Trends R (cid:13) in Machine learning , vol. 3, no. 1,pp. 1–122, 2011.[101] C. Anderson-Bergman, T. G. Kolda, and K. Kincher-Winoto, “XPCA: Ex-tending PCA for a combination of discrete and continuous variables,” arXivpreprint arXiv:1808.07510 , 2018.[102] P. Geladi and B. R. Kowalski, “Partial least-squares regression: a tutorial,” Analytica Chimica Acta , vol. 185, pp. 1–17, 1986.[103] A. K. Smilde, J. J. Jansen, H. C. Hoefsloot, R.-J. A. Lamers, J. VanDer Greef, and M. E. Timmerman, “ANOVA-simultaneous component anal-ysis (ASCA): a new tool for analyzing designed metabolomics data,” λσ σ ≤ λ − σ +2 γλσ − λ γ − λ < σ ≤ γλ λ ( γ +1)2 σ > γλ λ σ ≤ λ γλ − σγ − λ < σ ≤ γλ σ > γλ GDP λ log(1 + σγ ) λγ + σ < exactL L q:q=0.1 L q:q=0.5 < exactL SCAD: . =2SCAD: . =3.7 < exactL GDP: . =1GDP: . =50 Figure 3.1: Thresholding properties of the exact low rank constraint, L q , SCADand GDP penalties when the same degree of shrinkage is achieved. exact: exactlow rank constraint, L : nuclear norm penalty. σ indicates the original singularvalue while η is the value after thresholding. Note that in contrast to SCADand GDP, the L q :0
σγ ) γ + σ .3. Algorithm The constraint T A = makes the column offset terms { µ } Ll =1 identifiable. Thecolumns of the score matrix A span the joint subspace (cid:83) Ll =1 col( Θ l ), in whichcol() indicates the column subspace. The structured sparse pattern on the load-ing matrices and the multiplication of the score and loading matrices provide away to separate the joint subspace (cid:83) Ll =1 col( Θ l ) into subspaces col(GC), col(LC),col(D) corresponding to the global common, local common and distinct varia-tion, col(GC) (cid:83) col(LC) (cid:83) col(D) = (cid:83) Ll =1 col( Θ l ). If the orthogonality constraint A T A = I is imposed, the separated subspaces col(GC), col(LC), col(D), corre-sponding to the global common, local common and distinct variation, are orthogo-nal to each other, and unique as col(GC) (cid:84) col(LC) (cid:84) col(D) = ∅ . However, thereis still a rotation freedom for the components within the subspace correspondingto the global common or local common or distinct variation. The regularized likelihood criterion of fitting the proposed P-ESCA model can bederived as follows. To tackle the missing value problem, L weight matrices areintroduced. For the l th data set X l , we introduce a same size weight matrix W l ,in which w lij = 0 if the corresponding element in X l is missing, while w lij = 1 viseversa . This option is the basis for different missing value based cross validationapproaches. The corresponding optimization problem can be expressed as follows,min { µ l } Ll , A , { B l } Ll L (cid:88) l =1 (cid:104) − log( p ( X l | Θ l , α l )) + λ l (cid:112) J l (cid:88) r g ( σ lr ) (cid:105) = L (cid:88) l =1 (cid:104) α l ( < W l , b l ( Θ l ) > − < W l (cid:12) X l , Θ l > ) + λ l (cid:112) J l (cid:88) r g ( σ lr ) (cid:105) + c subject to Θ l = µ T l + AB T l , l = 1 , . . . , L T A = T A = I σ lr = || b l,r || , l = 1 ...L ; r = 1 , . . . , R, (5.2)in which (cid:12) indicates the element-wise matrix multiplication. The original problem in equation 5.2 is difficult to optimize directly because of thenon-convex orthogonality constraint A T A = I and the group concave penalty g ().However, by using the Majorization-Minimization (MM) principle, the original04 Chapter 5. Fusing multiple data sets with two types of heterogeneity difficult problem can be majorized to a simpler problem, for which analyticalform solutions can be derived for all the parameters. According to the MMprinciple, the derived algorithm will monotonously decrease the loss function ineach iteration. Further details of the MM principle can be found in [58, 59].