Multi-Block Sparse Functional Principal Components Analysis for Longitudinal Microbiome Multi-Omics Data
Lingjing Jiang, Chris Elrod, Jane J. Kim, Austin D. Swafford, Rob Knight, Wesley K. Thompson
SSubmitted to the Annals of Applied Statistics
BAYESIAN MULTIVARIATE SPARSE FUNCTIONAL PRINCIPALCOMPONENTS ANALYSIS WITH APPLICATION TO LONGITUDINALMICROBIOME MULTI-OMICS DATA B Y L INGJING J IANG , C
HRIS E LORD , J ANE
J. K IM , A USTIN
D. S
WAFFORD , R OB K NIGHT
AND W ESLEY
K. T
HOMPSON Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, * [email protected]; † [email protected] Julia Computing, [email protected] Department of Pediatrics, University of California San Diego, [email protected]; [email protected] Center for Microbiome Innovation, University of California San Diego, [email protected];[email protected] Department of Computer Science and Engineering, University of California San Diego, [email protected] Department of Bioengineering, University of California San Diego, [email protected]
Microbiome researchers often need to model the temporal dynamics ofmultiple complex, nonlinear outcome trajectories simultaneously. This moti-vates our development of multivariate Sparse Functional Principal Compo-nents Analysis (mSFPCA), extending existing SFPCA methods to simultane-ously characterize multiple temporal trajectories and their inter-relationships.As with existing SFPCA methods, the mSFPCA algorithm characterizes eachtrajectory as a smooth mean plus a weighted combination of the smoothmajor modes of variation about the mean, where the weights are given bythe component scores for each subject. Unlike existing SFPCA methods, themSFPCA algorithm allows estimation of multiple trajectories simultaneously,such that the component scores, which are constrained to be independentwithin a particular outcome for identifiability, may be arbitrarily correlatedwith component scores for other outcomes. A Cholesky decomposition isused to estimate the component score covariance matrix efficiently and guar-antee positive semi-definiteness given these constraints. Mutual informationis used to assess the strength of marginal and conditional temporal associa-tions across outcome trajectories. Importantly, we implement mSFPCA as aBayesian algorithm using R and stan , enabling easy use of packages such asPSIS-LOO for model selection and graphical posterior predictive checks toassess the validity of mSFPCA models. Although we focus on application ofmSFPCA to microbiome data in this paper, the mSFPCA model is of generalutility and can be used in a wide range of real-world applications.
1. Introduction.
Numerous disorders, including heritable immune-mediated diseasessuch as inflammatory bowel disease (IBD) and asthma, neurological conditions includingautism, and genetically driven diseases such as cancer, have been linked to dysregulation ofhuman microbiota (Holleran et al., 2018; Lloyd-Price et al., 2019; Frati et al., 2019; Sharonet al., 2019; Ballen et al., 2016). However, the complex influences of microbiota on humanhealth are not yet functionally understood. To understand the links between the human mi-crobiome and disease, it is necessary to determine which microbe genes are being expressedand the timing of their expression (Sberro et al., 2019). Thus, in addition to obtaining micro-biome data using 16S ribosomal RNA gene sequencing or whole genome shotgun sequenc-ing (Kuczynski et al., 2010; Ranjan et al., 2016; Gill et al., 2006), an increasing number
Keywords and phrases:
Bayesian, Functional Data Analysis, Longitudinal, Microbiome, Multi-omics. a r X i v : . [ s t a t . M E ] F e b of studies are also collecting transcriptomics data to understand microbial gene expression,proteomics data to study expressed proteins, and metabolomics data to define the functionalstatus of host-microbial relationships (iHMP Consortium, 2014; Lloyd-Price et al., 2019;Bouslimani et al., 2019). This complex combination of data types, called microbiome multi-omics , is essential for understanding the links between microbial communities and diseaseand may enable translation of microbiome research into effective treatments.An increasing number of microbiome multi-omics studies are longitudinal, aimed at si-multaneously characterizing microbiome and host temporal changes to provide a more com-prehensive picture of dynamics during healthy and diseased states (iHMP Consortium, 2014;Lloyd-Price et al., 2019; Vatanen et al., 2018; Stewart et al., 2018). Despite these break-throughs in microbiome study designs and data collection, few statistical methods are avail-able to analyze these complex longitudinal omics data. Recently, several new methods basedon network analysis were developed for multi-omics integration of microbiome data in cross-sectional studies (Jiang et al., 2019; Morton et al., 2019), however, analytical methods forlongitudinal microbiome multi-omics data are still in their infancy. The challenges includeirregular timing and frequency across subjects, unmatched time points between different datatypes, non-linear temporal patterns, missing data, and high individual variability (Bodeinet al., 2019).The statistical framework of functional data analysis (FDA) was introduced by Ramsayand Silverman (Ramsay and Silverman, 1997), wherein the basic unit of information isthe entire function, such as a curve or an image. Functional principal component analysis(FPCA) has been a widely used tool in FDA. The fundamental aim of FPCA is to reducedimensionality by capturing the principal modes of smooth variation. FPCA summarizessubject-specific deviations from the mean curve via the coordinates ( principal componentscores ) of the basis spanned by the principal components (Di et al., 2009). Existing FPCAmethods include smoothed FPCA approaches based on roughness penalties (Rice and Silver-man, 1991), extensions to sparsely sampled functional data (James, Hastie and Sugar, 2000;Yao, Müller and Wang, 2005; Peng and Paul, 2009; Di, Crainiceanu and Jank, 2014; Kidz-i´nski and Hastie, 2018), and the asymptotic properties of FPCA (Hall and Hosseini-Nasab,2006; Li et al., 2010). Most FPCA methods development has focused on univariate functionaldata. Chiou, Chen and Yang (2014) proposed a multivariate FPCA method to simultaneouslymodel multiple temporal outcomes and infer the component dependencies through pairwisecross-covariance functions. However, this method is limited to functional data as classicallyconceived, where the curves are observed longitudinally over densely and consistently sam-pled time points.To meet the need for modeling irregularly and sparsely sampled, non-linear multivariatemicrobiome multi-omics trajectories, we developed multivariate sparse functional principalcomponents analysis (mSFPCA). The major novelty of our approach is that it focuses on aset of functions which may covary in complex ways. Smoothing is accomplished by retain-ing a low-dimensional set of PC functions, as in James, Hastie and Sugar (2000). The PCscores across outcomes are modeled jointly via a constrained covariance matrix, efficientlyestimated by Cholesky decomposition. Our proposed method allows for simultaneously char-acterizing multiple temporal measurements, such as microbiome, metabolome, inflammatorymarkers, and self-report measures, and to infer the temporal associations among these mea-sures both marginally and conditionally, based on estimation of marginal and partial mu-tual information. The model is implemented using a Bayesian formulation, instantiated withHamiltonian Markov Chain Monte Carlo (MCMC) methods in stan to sample from the pos-terior distribution of the model parameters. Our Bayesian implementation enables the usageof PSIS-LOO for model selection and graphical posterior predictive checks to assess the va-lidity of mSFPCA models. While we focus on application of mSFPCA to microbiome datain this paper, the mSFPCA model is of general utility and can be used in many applications. ULTIVARIATE SFPCA FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA The remainder of the paper is organized as follows. Section 2 reviews sparse functionalprincipal component analysis (SFPCA), and introduces multivariate SFPCA, our statisticalframework for longitudinal microbiome multi-omics data. Section 3 describes extensive sim-ulation studies to evaluate the performance of mSFPCA in realistic settings. Section 4 de-scribes the application of our methodology to a challenging longitudinal microbiome multi-omics data on type 2 diabetes. Section 5 presents our conclusion. To ensure reproducibil-ity of our results, accompanying software, simulations and analysis results are posted athttps://github.com/knightlab-analyses/mfpca-analyses.
2. Methodology.
Sparse functional principal components analysis.
The classical assumption of func-tional data analysis is that each trajectory is sampled over a dense grid of time points commonto all individuals (Ramsay and Silverman, 2007). However, in practice, trajectories are oftenmeasured at an irregular and sparse set of time points that can differ widely across indi-viduals. To address this issue, James, Hastie and Sugar (2000) proposed sparse functionalprincipal components analysis (SFPCA) using a reduced rank mixed-effects framework. Let Y i ( t ) be the measurement at time t for the i th individual, µ ( t ) the overall mean function, f j the j th principal component function and f = [( f , f , . . . , f k )] T , where k is the number ofprincipal components. Then the James, Hastie and Sugar (2000) SFPCA model is given by(1) Y i ( t ) = µ ( t ) + k X j =1 f j ( t ) α ij + (cid:15) i ( t ) , i = 1 , ..., N subject to the orthogonality constraint R f j f l = δ jl , the Kronecker δ . The vector α i =( α i , . . . , α ik ) T is the component weights for the i th individual and (cid:15) i ( t ) is a normally-distributed residual, independent across subjects and across times within subject. The func-tions µ and f are approximated using cubic splines to allow a smooth but flexible fit. Let b ( t ) be a cubic spline basis with dimension q > k . The spline basis is orthonormalized sothat R b j b l = δ jl . Let Θ and θ µ be, respectively, a q × k matrix and a q -dimensional vec-tor of real-valued coefficients. For each individual i , denote their measurement times by t = ( t i , t i , . . . , t in i ) T , and let Y i = ( Y i ( t i ) , . . . , Y i ( t in i )) T be the corresponding real-valuedobservations. Then B i = ( b ( t i ) , . . . , b ( t in i )) T is the n i × q spline basis matrix for the i th in-dividual. The reduced rank model can then be written as Y i = B i θ µ + B i Θ α i + (cid:15) i , i = 1 , ..., N, (2) Θ T Θ =
I, α i ∼ N (0 , D ) , (cid:15) i ∼ N (0 , σ (cid:15) I n i ) , where the covariance matrix D is restricted to be diagonal and I n i is the n i × n i identitymatrix. Various fitting approaches, such as the EM algorithm, kernel smoothing and Newton-Raphson algorithm, have been proposed to estimate parameters of the SFPCA model (James,Hastie and Sugar, 2000; Yao, Müller and Wang, 2005; Peng and Paul, 2009). These ap-proaches then use model selection techniques, such as cross-validation, Akaike informationcriterion (AIC) and leave-one-curve-out cross-validation, to select the dimension of the basisand the number of principal components. However, due to their reliance on assumptions suchas normally-distributed component scores and residuals, these models need to be carefullyexamined when applied to real data (Kidzi´nski and Hastie, 2018). Bayesian SFPCA.
Jiang et al. (2020) proposed an SFPCA model in a Bayesianframework to allow for flexible prior specification and implementation of model selectionand assessment methods. This Bayesian implementation used the Hamiltonian MCMC sam-pling algorithm in
Stan (Carpenter et al., 2017). The real-valued observations Y i ( t ) are firststandardized to have mean zero and standard deviation one. The prior distributions for pa-rameters in Eq. (2) were chosen as follows: θ µ ∼ N q (0 , I q ) α i ∼ N k (0 , I k )Θ j ∼ N q (0 , I q ) , j = 1 , . . . , k(cid:15) i ∼ N v i (0 , σ (cid:15) I v i ) σ (cid:15) ∼ Cauchy (0 , , where Θ j is the j th column of the loading matrix Θ , and v i is the total number of visits forthe i th subject. This Bayesian implementation enables use of leave-one-out cross-validationwith Pareto-smoothed important sampling (PSIS-LOO) (Vehtari, Gelman and Gabry, 2017)to perform model selection on the number of principal components k and the number of basisfunctions q . Moreover, model fit can be assessed via diagnostic plots from PSIS-LOO as wellas the graphical posterior predictive checks obtained from simulating posterior predictivedata (Gelman, Meng and Stern, 1996; Gabry et al., 2019). This Bayesian implementationthus offers a flexible and comprehensive solution to real-data SFPCA applications.2.3. Multivariate SFPCA.
Here, we extend our previous Bayesian SFPCA to simultane-ously model multiple trajectories, and to quantify their marginal and conditional temporalassociations. Let Y i ( t ) denote the P -dimensional observed response at time t for subject i ,which can be modeled by the multivariate Functional PCA (mFPCA) model as(3) Y i ( t ) = µ i ( t ) + f ( t ) T α i + (cid:15) i ( t ) , i = 1 , ..., N, where µ i = ( µ i, ( t ) , ..., µ i,P ( t )) T is the overall mean response of P trajectories for subject i , f ( t ) T = diag ( f ( t ) T , ..., f P ( t ) T ) , with f p ( t ) the FPC functions corresponding to the p thtrajectory at time t , α i = ( α i, , ..., α i,P ) is the vector of FPC scores for subject i , and (cid:15) i isthe corresponding residuals.In order to fit this model when the data are sampled at only a finite number of time points,we chose to fit µ i and f ( t ) using a basis of spline functions B . Let K p be the number ofFPCs, Q p be the corresponding number of basis functions, V ip be the total number of as-sessments for i th subject in the p th temporal measurement, B ip be the transpose of the cubicspline basis, and Θ p be the corresponding FPC loadings. Then the total number of princi-pal components across p measurements are K = P Pp =1 K p , the total number of basis func-tions are Q = P Pp =1 Q p , and the total number of assessments for subject i is V i = P Pp =1 V ip .The model for Y i ( t ) can be written into the multivariate Sparse Functional PCA (mSFPCA)model as(4) Y i = B i θ µ + B i Θ α i + (cid:15) i , i = 1 , ..., N, ULTIVARIATE SFPCA FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA where Y i is a P -dimensional observed response, residuals (cid:15) i ∼ N V i (0 , σ (cid:15) I V i ) , B i is a V i × Q matrix with B i = B i T · · · T B i · · · T ... ... . . . ... · · · B iP , where B ip is the V ip × Q p matrix of spline bases evaluated at assessment times t ip =( t ip , . . . , t ipV ip ) . The Q × K matrix Θ of FPC loadings Θ = Θ T · · · T · · · T ... ... . . . ... · · · Θ P , is subject to the orthonormality constraint Θ T Θ = I . The K -dimensional vector of FPCscores α i ∼ N (0 , Σ α ) , with Σ α is restricted to the form D C T · · · C TP C D · · · C TP ... ... . . . ... C P C P · · · D P , where D p is the within-trajectory diagonal covariance matrix (necessary for identifiability ofwithin-trajectory FPCs) for the p th trajectory , and C lm is the covariance matrix for the l thand m th trajectories. Σ α can be written as Σ α = S α R α S α , where S α is the diagonal matrixof standard deviations for the FPC scores, and R α is the correlation matrix restricted to theform I R T · · · R TP R I · · · R TP ... ... . . . ... R P R P · · · I P , where I p is the Q p × Q p identity matrix corresponding to the p th trajectory.We implemented the sMFPCA model using the Hamiltonian Markov Chain Monte Carlo(MCMC) sampling algorithm in Stan to estimate parameters. The prior distributions for θ µ , Θ and (cid:15) i are set as follows θ µ ∼ N Q (0 , I Q ) Θ kp ∼ N Q p (0 , I Q p ) , (cid:15) i ∼ N V i (0 , σ (cid:15) I V i ) σ (cid:15) ∼ Cauchy (0 , , where Θ kp is the k th column of the FPC loadings in the p th block. Leveraging this Bayesianimplementation in Stan , we utilized PSIS-LOO for model selection, and diagnostics plotsfrom PSIS-LOO and graphical posterior predictive checks for model diagnostics.
Orthonormality constraint.
A difficulty in implementing the Bayesian mSFPCAmodel is that the principal component loadings Θ are not uniquely specified. For a given K × K rotation matrix P , if Θ ∗ = Θ P and Θ obeys the constraints in Eq.(4), then Θ ∗ T Θ ∗ = P T Θ T Θ P = I , and hence Θ is unidentifiable without additional restrictions.Instead of directly enforcing orthonormality when sampling from the conditional posteriorsin the Bayesian model fitting, we sampled the parameters with no constraint on Θ and thenperformed a post hoc rotation for each iteration of the MCMC algorithm to meet the orthonor-mality constraint. Since the symmetric matrix Θ Σ α Θ T is identifiable and non-negative def-inite, we applied an eigenvalue decomposition Θ Σ α Θ T = V SV T , where V is the Q × Q matrix of orthonormal eigenvectors, and S is the diagonal matrix of eigenvalues, with the Q positive eigenvalues ordered from largest to smallest. Let Θ ∗ = V k denote the Q × K matrix consisting of the first K eigenvectors of V , which satisfies Θ ∗ T Θ ∗ = I . Finally, werotated Σ α and FPC scores α i , to obtain Σ ∗ α = Θ T ∗ Θ Σ α Θ T Θ ∗ , and α ∗ i = Θ ∗ T Θ α i , so that Θ ∗ Σ ∗ α Θ T ∗ = Θ Σ α Θ T , and Θ ∗ α ∗ i = Θ α i .2.3.2. Modeling covariance.
The covariance matrix of FPC scores Σ α must be positivesemi-definite and is restricted to be diagonal within-trajectory, it is a challenge to model thiscovariance matrix effectively. Barnard, McCulloch and Meng (2000) proposed a separationstrategy for modeling Σ =
SRS by assuming independent priors for the standard deviations S and the correlation matrix R . To account for the dependent structure of correlations amongdifferent subsets of variables, Liechty, Liechty and Müller (2004) proposed the common cor-relation model for R , which assumes a common normal prior for all correlations with theadditional restriction that the correlation matrix is positive definite. However, the awkwardmanner in which r ij , the ij th element in the correlation matrix R , is embedded in the fullconditional posterior density, leading to use of a Metropolis-Hastings algorithm to updateone coefficient r ij at a time (Liechty, Liechty and Müller, 2004). This consecutive updatingprocedure for correlation estimation is inefficient, and could lead to heavy computationalcost when the correlation matrix is large or when the correlation has to be estimated sepa-rately from other parameters in mSFPCA model when implemented in Stan (Carpenter et al.,2017). For example, in a simulated data with 3 temporal measurements from 100 subjectsover 10 time points, it would take 40 hours for a mSFPCA model using Liechty’s covari-ance estimation method to estimate all the parameters when implemented in Stan. However,the computational time can be reduced over 130 times (to only 18 minutes) by using ourproposed method due to the avoidance of additional Metropolis-Hastings algorithm.To pursue an efficient numerical solution to the covariance estimation, we took advantageof the Cholesky decomposition (Nash, 1990) and imposed the diagonal constraint on thewithin-trajectory covariance matrices. Since the covariance matrix of FPC scores Σ α has fullrank with probability one, it has a unique Cholesky decomposition in the form of Σ α = LL T , where L is a real lower triangular matrix with positive diagonal entries (Gentle, 2012) . Givena lower triangular matrix L divided into P blocks, we have L = L , · · · L , L , · · · ... ... . . . ... L P, L P, · · · L P,P , ULTIVARIATE SFPCA FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA then LL T = L , L T , ∗ · · · ∗ L , L T , L , L T , + L , L T , · · · ∗ ... ... . . . ... L P, L T , L P, L T , + L P, L T , · · · L P, L TP, + ... + L P,P L TP,P , where ∗ denotes the transpose of the corresponding sub-diagonal block. To ensure that LL T is positive definite with diagonal within-block covariance matrices, the lower triangularCholesky factor L needs to meet the following two conditions:1. Within-block covariance matrices P Mm =1 L M,m L TM,m , M = 1 , ..., P , are diagonal.2. The diagonal entries of L M,M , M = 1 , ..., P are positive.We will focus on defining the diagonal blocks L M,M to achieve these, and leave the off-diagonal blocks L M,m , m = 1 , ..., M − to be arbitrary, unconstrained (i.e. the unconstrainedparameter elements from the Hamiltonian MCMC sampling).Let D M , M = 1 , ..., P be the M th within-block covariance matrix, then(5) D M = M X m =1 L M,m L TM,m = L M,M L TM,M + M − X m =1 L M,m L TM,m ,L M,M L TM,M = D M − M − X m =1 L M,m L TM,m = A. Since all the off-diagonal elements of D M are known to be zero and the off-diagonal blocks L M,m , m = 1 , ..., M − are defined earlier with unconstrained estimates, we have thus de-fined all the off-diagonals of this matrix A, leaving only the diagonals. Because L M,M needs to have positive diagonal entries, L M,M L TM,M must be positive definite, thus L M,M is the Cholesky factor of A. To derive L M,M , a typical approach is to proceed with theCholesky–Banachiewicz and Cholesky–Crout algorithm on A , where entries for the lowertriangular factor L are(6) L j,j = vuut Aj, j − j − X k =1 L j,k L Tj,k L i,j = 1 L j,j ( A i,j − j − X k =1 L i,k L Tj,k ) for i > j However, for the diagonal entries L j,j , instead of using Eq.6, we substitute it with an expo-nential term exp (0 . ∗ O + 2) to ensure it is positive, where O is the corresponding uncon-strained parameter estimates. Here, . was chosen to mimic the square root in the originalformula, and was added to bound initial values of diagonal entries away from zero, giventhat the default initial values are drawn uniformly from the interval ( − , in Stan . Finally,we update the off-diagonal entries L i,j using the existing formula Eq.6.In short, in our Bayesian implementation, we set the off-diagonal entries in within-trajectory covariance matrices to be zero, estimate the rest of parameters without constraintusing uninformative (non-proper) prior unif orm ( −∞ , + ∞ ) , substitute the diagonal entries with our exponential term, and finally update the off-diagonal entries. In this way, we areable to estimate the covariance matrix efficiently and guarantee it to be positive semi-definitewith our desired constrained form. Once we obtained the covariance matrix, we can then de-compose it into correlation matrix R α and standard deviations in order to estimate temporalassociations.2.3.3. Estimating inter-block association.
Apart from simultaneously modeling multi-variate longitudinal measurements, we want to estimate the association among measurementsof interest via the correlations among the FPC scores, where the correlation matrix R α ob-tained earlier will play a crucial role. We propose a measure of inter-trajectory association bycalculating the mutual information of FPC scores from different measurements.We define the inter-trajectory association between measurements p and p as the mutualinformation of FPC scores α ip and α ip , ≤ p , p ≤ P , with M I ( α ip , α ip ) = H ( α ip ) + H ( α ip ) − H ( α ip , α ip ) , where H ( X ) is the entropy of X and H ( X ) = − E [ log ( f X ( X ))] with f X ( X ) being theprobability density function of X (Cover, 1999).If K -dimensional random variable X follows multivariate normal distribution with covari-ance matrix Σ , then according to Ahmed and Gokhale (1989) H ( X ) = k k log (2 π ) + 12 | Σ | . Since the K -dimensional FPC scores α i ∼ N (0 , Σ α ) , and any subvector of α i is ofthe same structure with the correlation matrix being a submatrix of R α , then according toArellano-Valle, Contreras-Reyes and Genton (2013), the mutual information of α ip and α ip could be simplified as(7) M I ( α ip , α ip ) = − log | R α { p ,p } | , where R α { p ,p } = (cid:12)(cid:12)(cid:12)(cid:12) I p R p p R Tp p I p (cid:12)(cid:12)(cid:12)(cid:12) . Moreover, we can estimate the conditional inter-trajectory association between any twomeasurements of interest given the other measurements in the model by calculating thepartial mutual information of FPC scores. The conditional inter-trajectory association be-tween measurements p and p is defined as the partial mutual information of α ip and α ip , ≤ p , p ≤ P , with(8) M I ( α ip , α ip | α i { ,...,P \ p ,p } ) = H ( α ip , α i { ,...,P \ p ,p } ) + H ( α ip , α i { ,...,P \ p ,p } ) − H ( α i { ,...,P \ p ,p } ) − H ( α ip , α ip , α i { ,...,P \ p ,p } )= H ( α i { ,...,P \ p } ) + H ( α i { ,...,P \ p } ) − H ( α i { ,...,P \ p ,p } ) − H ( α i )= 12 log | R α { ,...,P \ p } | + 12 log | R α { ,...,P \ p } |− log | R α { ,...,P \ p ,p } | − log | R α | , ULTIVARIATE SFPCA FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA where R α { ,...,P \ p } , R α { ,...,P \ p } , and R α { ,...,P \ p ,p } are defined in the similar way as R α { p ,p } in Eq.(7).Inter-trajectory association obtained from this way ranges from 0 to infinity. By analogywith the way Person’s contingency coefficient was obtained, we can apply a simple trans-formation proposed by Joe (1989) to obtain a normalized version of the mutual informationas(9) M I ∗ ( α ip , α ip ) := q − exp [ − M I ( α ip , α ip )] . In this way, the inter-trajectory and conditional associations now take its value in [0, 1].The interpretation is that the closer
M I ∗ ( α ip , α ip ) or M I ∗ ( α ip , α ip | α i { ,...,P \ p ,p } ) isto 1, the higher the temporal association between measurements is.
3. Simulation studies.
To evaluate the performance of mSFPCA in modeling multi-ple temporal measurements, especially in its covariance estimation and temporal associationinference, we simulated sparse longitudinal trajectories with three temporal measurementsunder four different covariance structures. To better mimic the reality, our data was simulatedbased on an mSFPCA model using parameters initially estimated from a real longitudinalmicrobiome multi-omics dataset (Kostic et al., 2015) in the following way:1. Applying mSFPCA to model three temporal measurements in the real multi-omics dataset.2. Selecting the optimal number of PCs and dimension of basis using PSIS-LOO: the chosenmodel has the number of PCs as 2, 2, 1, and the number of basis as 6, 5, 5 for eachmeasurement respectively.3. Extracting the estimated values for population mean curve ( θ µ ) , FPC loadings ( Θ ) , andresidual variance σ (cid:15) .Then under four distinct covariance structures on FPC scores ( Σ α ), we simulate the tra-jectories for 100 subjects with an average of 20% missing data over observations at 10 timepoints. Observations were randomly deleted to create increasingly sparse functional datasets.In the 1st covariance structure, all PCs are independent; in the 2nd covariance structure, only1 strong correlation of 0.75 exists across all PCs; in the 3rd covariance structure, 1 strongand 1 medium correlation exists, at values of 0.75 and 0.5 respectively; in the 4th covari-ance structure, 1 strong, 1 medium and 1 weak correlations exists at strength of 0.75, 0.5 and0.25. In short, there are increasing dependence structures among PCs as the covariance struc-ture moves from the first to the last. Based on these pre-specified covariance structures andinitially estimated parameters, we simulate the sparse longitudinal trajectories as follows:1. Choosing the total number of subjects to be 100, and the number of time points to be 10in order to place possible time points between [0 , .2. Simulating the observed number of time points for each individual with n i ∼ P oisson (8) ,where 8 represents the average number of time points across all subjects, and then ran-domly placing the observed time points in the possible time locations (chosen in the pre-vious step).3. Generating the cubic spline basis matrix B i for each subject (orthonormality obtainedthrough Gram-Schmidt orthonormalization).4. Simulating for each subject FPC scores α i ∼ N (0 , Σ α ) and noise (cid:15) i ∼ N (0 , σ (cid:15) I ) .5. Obtaining the temporal trajectory for each individual with Y i = B i θ µ + B i Θ α i + (cid:15) i .6. Repeating step 1– 5 1000 times for each covariance structure, thus generating 4000 simu-lated datasets in total. To evaluate the mSFPCA model performance in simulated data, we want to examine threemain results: 1. how well mSFPCA can capture the temporal patterns embodied in the overallmean curve and FPC curves for each temporal measurement; 2. the accuracy of covarianceestimation; 3. the inference on temporal associations based on mutual information estimation.Figure 1 shows that the estimated overall mean curves and PC curves accurately recoveredthe ground truth for all three outcome variables under covariance structure I. This accuratecapturing of major temporal patterns was seen in other three covariance structures as well(Supplementary Figure 1-3). Figure 2 summarizes the performance of covariance estima-tion across all 4 scenarios in terms of the coverage probabilities of 95% credible intervalson estimated covariance parameters. The coverage probabilities are lowest in the 1st covari-ance structure (independent, Figure 2A), improved when more dependence structures areintroduced (Figure 2B-D), and reach highest with the 4th covariance structure (having mostcorrelations across PCs, Figure 2D). Despite these subtle differences in the coverage proba-bilities for each covariance parameter, the average coverage probability across all estimatedparameters, represented by the dashed line, is around 95% within each covariance structure.This indicates that our mSFPCA model is able to estimate the covariance matrix properly,and its performance is affected by the structure of covariance matrix itself: the more sparsethe covariance is, the more challenging the estimation. But even with the most sparse sce-nario (Figure 2 A), our mSFPCA model is still able to achieve about 95% average coverageprobability.Regarding the inference on temporal associations, Table 1 shows the mutual informationestimates in each simulation scenario, which estimates the temporal association between eachpair of temporal measurements.
M I ij denotes the normalized mutual information between i th and j th temporal measurements. When the true MI is zero, the coverage probability isdenoted as ∗ , as the estimated mutual information is always non-negative and thus by con-struction the posterior distribution will not encompass zero. In the rest of the table, except forthe slightly lower coverage probability with 0.92 for M I in the 3rd scenario, or 0.93 for M I in the 4th scenario, all the coverage probabilities are close to 95%. Table 2 shows theconditional mutual information estimates in each simulation scenario, which estimates thetemporal association between each pair of temporal measurements given the other measure-ment in the model. CM I ij denotes the normalized conditional mutual information between i th and j th temporal measurements. All coverage probabilities of non-zero CMIs are close to95% on the estimation of conditional mutual information.In short, our simulation results have demonstrated the good performance of mSFPCA inmodeling sparse longitudinal data with multiple temporal measurements and providing validinference on temporal associations.
4. Real data application.
For the real data application, we want to model multiple tem-poral measurements simultaneously in a large and challenging dataset, with a special interestin utilizing conditional mutual information to infer temporal association. This dataset comesfrom the type 2 diabetes (T2D) longitudinal studies in the Integrative Human MicrobiomeProject (iHMP Consortium, 2014). In this example, an over 3 years’ study has been con-ducted in approximately 100 individuals at high risk for T2D, in order to better understandthe biological changes that occur during the onset and progression of T2D. Multiple sam-ple types were collected from the study participants every 2-3 months during their healthyperiods, with more frequent sampling during periods of respiratory illness and other envi-ronmental stressors. These data include multi-omics assays such as stool microbiome datausing 16S rRNA sequencing, host protein expression profiles in fecal samples using LC-MS/MS, and cytokine profiles that quantify the levels of 50 diverse inflammatory proteinsand insulin peptides in host serum, as well as standard clinical tests results like hemoglobin
ULTIVARIATE SFPCA FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA F IG . Estimated mean and FPC curves from mSFPCA on simulated data with covariance structure I. Estimated(red) vs. true (blue) overall mean curve on simulated trajectories (black) on outcome variable 1 (A), outcomevariable 2 (B), and outcome variable 1 (C). Estimated (red) vs. true (black) PC curves on simulated data foroutcome variable 1 (D), outcome variable 2 (E), and outcome variable 3 (F). T ABLE Mutual information estimates for each simulation scenario
95% credible intervalSimulation scenario Parameter Truth Median Cov.prob. 2.5% 97.5%Covariance I MI MI MI MI MI MI MI MI MI MI MI MI A1c (HbA1c), insulin and glucose. Moreover, behavior changes of patients, such as emo-tional and psychological stress, were documented using the Perceived Stress Scale instru-ment. Our outcomes of interest are the longitudinal pattern of Shannon diversities in bacteria,proteins and cytokines, and of clinical test results on HbA1c. Shannon diversity is defined as F IG . Coverage probability of 95% credible interval on estimated covariance parameters in covariance matrixI (A), II (B), III (C) and IV (D). Values within each dot represent the coverage probability for each estimatedcovariance parameter, and only values for unique covariance element are displayed. Black dashed lines indicatethe average coverage probability across all estimated parameters within each covariance matrix, which is around95% in all four simulation scenarios. T ABLE Conditional mutual information estimates for each simulation scenario
95% credible intervalSimulation scenario Parameter Truth Median Cov.prob. 2.5% 97.5%Covariance I
CMI CMI CMI CMI CMI CMI CMI CMI CMI CMI CMI CMI Shannon = − P Si =1 p i ln ( p i ) , where S is the total number of species, and p i is the relativeproportion of species i relative to the entire population. Shannon diversities in bacteria, pro-teins and cytokines are chosen over specific features in our application, because they havehigher predictive power of individuals’ diabetic status than specific bacteria, proteins or cy-tokines. Hence, we are particularly interested in utilizing mutual information to investigatewhich omics data (based on Shannon diversity) have strongest association with HbA1c, andwhether additional omics data improve the temporal association based on conditional mutualinformation.The estimated mean curves in Figure 3 show different temporal trends in each outcome,where Shannon bacterial diversity decreases slowly over time, protein diversity increasessteadily over time, cytokine diversity increases over the first 300 days, decreases betweenday 300 and 900, and then increases, and HbA1c decreases during the first 2 years, andincreases slightly afterwards. As indicated by the observed trajectories for each individual(black curves), there are great subject-level variations in each outcome. This additional tem-poral information is captured by the FPC curves in Figure 4. Figure 4A shows the first twoPCs in Shannon bacterial diversity, of which PC 1 explains 79% and captures variation aroundday 750, while PC 2 explains 21% of the variation and emphasizes variation around day 300and 1100. Figure 4B shows the first four PCs in Shannon protein diversity, of which the firsttwo PCs explain over 90% of the variance. The first PC captures variation around day 750,and the second PC emphasizes variation around day 450 and 1200. Figure 4C shows thefirst four PCs in Shannon cytokine diversity, of which PC 1 explains 83% and exhibits analmost flat curve over time, while PC 2 explains 13% of the variation and emphasizes varia-tion around day 300 and 800. Figure 4D shows the first four PCs in HbA1c: PC1 exhibits aslight increasing curve over time, accounting for 70% variation, and PC 2 captures variationaround day 300 and 800, explaining for an additional 21% variation. In short, although princi-pal patterns in each measurement vary, changing time points are pretty consistent, suggestingcoherent responses to changes in patients’ mental or physical conditions.Among omics’ temporal associations with standard clinical test result HbA1c, Table 3 sug-gests that Shannon protein diversity has the highest association with HbA1c, at an estimatedmutual information of 0.91 with 95% credible interval (0.786, 0.971). Cytokine diversity isthe second highest, with MI at 0.849 (0.668, 0.954), and bacteria diversity has the lowestassociation at 0.71 (0.441, 0.854). However, when information about other omics measure-ments are provided, all the pairwise temporal associations increase to over 0.95, as indicatedby the conditional MI results. Regarding temporal associations among omics measurement,Shannon protein and bacteria diversities have highest temporal association, with mutual in-formation at 0.982 (0.875, 0.999); Shannon protein and cytokine diversities also have highassociation at 0.966 (0.886, 0.998); the association between Shannon bacteria and cytokineis medium at 0.798 (0.454, 0.977). Similar to earlier results, when information about othermeasurements are available, all conditional information increase to 0.99. In short, host pro-tein expression profiles data has highest temporal association with patients’ diabetes status(i.e. HbA1c), but this information can still be further improved with additional omics data.We need model diagnostics to conclude on the validity of our mSFPCA application. Theoptimal model selected by PSIS-LOO has 2 PCs for Shannon bacterial diversity, and 4 PCsfor the other measurements, and the number of internal knot is chosen to be one for alloutcomes. PSIS-LOO diagnostics in Figure 5A show that the selected mSFPCA model fitthe majority of the data well, except for 4 outliers with Pareto shape k values higher thanthe warning threshold 0.7. Graphical posterior predictive checks in Figure 5B suggests goodmodel fit as the simulated data from the posterior predictive distribution was able to coverthe distribution of observed outcomes well. Figure 5C-F highlight the observed trajectoriesof the 4 outliers detected by PSIS-LOO diagnostic plot. The red subject has highest curve F IG . Estimated mean curves from mSFPCA application on type 2 diabetes multi-omics dataset. (A) Estimatedpopulation mean curve (blue) for Shannon bacterial diversity on observed individual trajectories (black). (B)Estimated population mean curve (blue) for Shannon protein diversity on observed individual trajectories (black).(C) Estimated population mean curve (blue) for Shannon cytokine diversity on observed individual trajectories(black). (D) Estimated population mean curve (blue) for HbA1c on observed individual trajectories (black). in Shannon cytokine diversity and low value in HbA1c. A closer look at his/her metadatashows that this subject went through stages of healthy, infection and back to healthy. Thegreen subject shows high oscillation pattern in Shannon protein diversity, as he/she oscil-lated between stages of healthy, inflammation, and infection. The blue subject exhibits highoscillation pattern in Shannon protein diversity, because he/she went through a complicatedinterweaving stages of healthy, inflammation, infection, post-travel and allergy. The purplesubject, who has the highest Pareto shape k value in Figure 5A, experienced drastic changein HbA1c, as he/she went through stages of infection, stress and back to healthy. In short,our mSFPCA model generally fits this dataset well, and our diagnostic tools were able tohighlight biologically meaningful outliers for further examination.
5. Discussion.
Here, we have proposed multivariate sparse functional PCA, an exten-sion to the sparse functional principal components analysis, to modeling multiple trajectories
ULTIVARIATE SFPCA FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA F IG . Estimated FPC curves from mSFPCA application on type 2 diabetes multi-omics dataset. (A) Trends ofvariability in Shannon bacterial diversity captured by 2 principal component curves. (B) Trends of variabilityin Shannon protein diversity captured by 4 principal component curves. (C) Trends of variability in Shannoncytokine diversity captured by 4 principal component curves. (D) Trends of variability in HbA1c captured by 4principal component curves. T ABLE Mutual information estimates for type 2 diabetes multi-omics dataset application temporal associations with HbA1c MI (95% CI ) CMI (95% CI ) HbA1c—protein 0.910 (0.786, 0.971) 0.994 (0.968, 0.999)HbA1c—cytokine 0.849 (0.668, 0.954) 0.986 (0.951, 0.999)HbA1c—bacteria 0.710 (0.441, 0.854) 0.957 (0.820, 0.999)temporal associations among omics MI (95% CI ) CMI (95% CI ) protein—bacteria 0.982 (0.875, 0.999) 0.999 (0.996, 0.999)protein—cytokine 0.966 (0.886, 0.998) 0.999 (0.997, 0.999)bacteria—cytokine 0.798 (0.454, 0.977) 0.995 (0.958, 0.999) simultaneously. The methodological novelty lies in the computationally efficient Bayesiancovariance matrix estimation, where we utilized a Cholesky decomposition to guarantee it tobe positive semi-definite under the constrained form of diagonal within-trajectory covarianceand arbitrary form of between-trajectory covariance structure. Moreover, we utilized mutualinformation to assess marginal and conditional temporal associations, providing. Finally, our F IG . Graphical model diagnostics and examination of outliers for mSFPCA application on type 2 diabetesmulti-omics dataset. (A) Scatterplot of estimated Pareto shape parameter ˆ k in PSIS-LOO diagnostic plot: all ˆ k ’s except 4 are lower than the warning threshold 0.7. (B) Graphical posterior predictive plot: kernel densityestimate of the observed dataset y (dark curve), with kernel estimates for 100 simulated dataset y rep drawn fromthe posterior predictive distribution (thin, lighter lines). (C) Observed Shannon bacterial diversity for 4 outliersdetected by PSIS-LOO diagnostic plot in (A). (D) Observed Shannon protein diversity for 4 outliers. (E) ObservedShannon cytokine diversity for 4 outliers. (F) Observed HbA1c for 4 outliers. ULTIVARIATE SFPCA FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA Bayesian implementation in
Stan enables the usage of PSIS-LOO for efficient model selec-tion, and visual model diagnostic methods, such as examining the estimated shape parametersfrom PSIS-LOO and utilizing the graphical posterior predictive checks, to evaluate the valid-ity of mSFPCA models and highlight potential outliers.In both our real-data based simulations and application to longitudinal microbiome multi-omics datasets, we have demonstrated that mSFPCA is able to accurately uncover the under-lying principal modes of variation over time, including both the average population patternand subject-level variation, and estimate the temporal associations properly. These enabled usto detect biologically meaningful signals in a large and challenging longitudinal cohort withirregular sampling, missing data, and four temporal measurements. Moreover, the model di-agnostics plots from real data application show that mSFPCA can provide reliable modelfitting to real microbiome multi-omics dataset. All these results highlight the great value ofour method in modeling longitudinal data with multiple temporal measurements. Though weapplied mSFPCA to microbiome data in this paper, the method is in fact a general frameworkthat can be applied to a wide range of multivariate longitudinal data.One limitation of our method is that we assume the principal component scores and resid-uals to be normally distributed as in the original SFPCA model. This normality assumptionwould restrict our method from being applying to highly skewed trajectories, for exampleHowever, improper application of the method to such data could be detected by the model di-agnostic tools we provide, e.g., the graphical posterior predictive model checks. Users couldalso modify the mSFPCA model by incorporating alternative prior distributions, for exam-ple, a t-distribution with a low degree of freedom to capture heavy tails in the distributionof principal component scores, which can be easily implemented in
Stan . Finally, since themSFPCA model is implemented in
Stan , a programming language with a very active userbase, this method will be able to be updated with more efficient MCMC sampling algo-rithms and also incorporate other groundbreaking model selection and diagnostic techniqueswhenever they become available. Hence, we believe that the mSFPCA method will becomea useful and up-to-date tool for researchers in various fields to analyze longitudinal data withmultiple measurements in order to detect complex temporal associations.
6. Acknowledgement.
RK was supported by NIH under grant 1DP1AT010885, NIDDKunder grant 1P30DK120515, and CCFA under grant 675191. WT was supported byNIH/NIMH under grants MH120025 and MH122688.SUPPLEMENTARY MATERIAL
Supplement to "multivariate Sparse Functional Principal Components Analysis forLongitudinal Microbiome Multi-Omics Data" . Three supplementary figures for simula-tion results are included in the supplementary materials.(). REFERENCES A HMED , N. A. and G
OKHALE , D. (1989). Entropy expressions and their estimators for multivariate distributions.
IEEE Transactions on Information Theory RELLANO -V ALLE , R. B., C
ONTRERAS -R EYES , J. E. and G
ENTON , M. G. (2013). Shannon Entropy andMutual Information for Multivariate Skew-Elliptical Distributions.
Scandinavian Journal of Statistics ALLEN , K., A HN , K. W., C HEN , M., A
BDEL -A ZIM , H., A
HMED , I., A
LJURF , M., A
NTIN , J., B
HATT , A. S.,B
OECKH , M., C
HEN , G. et al. (2016). Infection rates among acute leukemia patients receiving alternativedonor hematopoietic cell transplantation.
Biology of Blood and Marrow Transplantation ARNARD , J., M C C ULLOCH , R. and M
ENG , X.-L. (2000). Modeling covariance matrices in terms of standarddeviations and correlations, with application to shrinkage.
Statistica Sinica B ODEIN , A., C
HAPLEUR , O., D
ROIT , A. and L Ê C AO , K.-A. (2019). A generic multivariate framework for theintegration of microbiome longitudinal studies with other data types. Frontiers in genetics .B OUSLIMANI , A., DA S ILVA , R., K
OSCIOLEK , T., J
ANSSEN , S., C
ALLEWAERT , C., A
MIR , A., D OR - RESTEIN , K., M
ELNIK , A. V., Z
ARAMELA , L. S., K IM , J.-N. et al. (2019). The impact of skin care productson skin chemistry and microbiome dynamics. BMC biology ARPENTER , B., G
ELMAN , A., H
OFFMAN , M. D., L EE , D., G OODRICH , B., B
ETANCOURT , M.,B
RUBAKER , M., G UO , J., L I , P. and R IDDELL , A. (2017). Stan: A probabilistic programming language.
Journal of statistical software .C HIOU , J.-M., C
HEN , Y.-T. and Y
ANG , Y.-F. (2014). Multivariate functional principal component analysis: Anormalization approach.
Statistica Sinica
OVER , T. M. (1999).
Elements of information theory . John Wiley & Sons.D I , C., C RAINICEANU , C. M. and J
ANK , W. S. (2014). Multilevel sparse functional principal component anal-ysis.
Stat I , C.-Z., C RAINICEANU , C. M., C
AFFO , B. S. and P
UNJABI , N. M. (2009). Multilevel functional principalcomponent analysis.
The annals of applied statistics RATI , F., S
ALVATORI , C., I
NCORVAIA , C., B
ELLUCCI , A., D I C ARA , G., M
ARCUCCI , F. and E
SPOSITO , S.(2019). The role of the microbiome in Asthma: The Gut–Lung axis.
International Journal of Molecular Sci-ences ABRY , J., S
IMPSON , D., V
EHTARI , A., B
ETANCOURT , M. and G
ELMAN , A. (2019). Visualization in Bayesianworkflow.
Journal of the Royal Statistical Society: Series A (Statistics in Society)
ELMAN , A., M
ENG , X.-L. and S
TERN , H. (1996). Posterior predictive assessment of model fitness via realizeddiscrepancies.
Statistica sinica
ENTLE , J. E. (2012).
Numerical linear algebra for applications in statistics . Springer Science & BusinessMedia.G
ILL , S. R., P OP , M., D E B OY , R. T., E CKBURG , P. B., T
URNBAUGH , P. J., S
AMUEL , B. S., G
ORDON , J. I.,R
ELMAN , D. A., F
RASER -L IGGETT , C. M. and N
ELSON , K. E. (2006). Metagenomic analysis of the humandistal gut microbiome. science
ALL , P. and H
OSSEINI -N ASAB , M. (2006). On properties of functional principal components analysis.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) OLLERAN , G., S
CALDAFERRI , F., I
ANIRO , G., L
OPETUSO , L., M C , D. N., M ELE , M., G
ASBARRINI , A.and C
AMMAROTA , G. (2018). Fecal microbiota transplantation for the treatment of patients with ulcerativecolitis and other gastrointestinal conditions beyond Clostridium difficile infection: an update.
Drugs of today(Barcelona, Spain: 1998) I HMP C
ONSORTIUM (2014). The Integrative Human Microbiome Project: dynamic analysis of microbiome-hostomics profiles during periods of human health and disease.
Cell host & microbe AMES , G. M., H
ASTIE , T. J. and S
UGAR , C. A. (2000). Principal component models for sparse functional data.
Biometrika IANG , D., A
RMOUR , C. R., H U , C., M EI , M., T IAN , C., S
HARPTON , T. J. and J
IANG , Y. (2019). Microbiomemulti-omics network analysis: statistical considerations, limitations, and opportunities.
Frontiers in genetics .J IANG , L., Z
HONG , Y., E
LROD , C., N
ATARAJAN , L., K
NIGHT , R. and T
HOMPSON , W. K. (2020). BayesTime:Bayesian Functional Principal Components for Sparse Longitudinal Data. arXiv preprint arXiv:2012.00579 .J OE , H. (1989). Relative entropy measures of multivariate dependence. Journal of the American Statistical Asso-ciation IDZI ´NSKI , Ł. and H
ASTIE , T. (2018). Longitudinal data analysis using matrix completion. arXiv preprintarXiv:1809.08771 .K OSTIC , A. D., G
EVERS , D., S
ILJANDER , H., V
ATANEN , T., H
YÖTYLÄINEN , T., H
ÄMÄLÄINEN , A.-M.,P
EET , A., T
ILLMANN , V., P
ÖHÖ , P., M
ATTILA , I. et al. (2015). The dynamics of the human infant gutmicrobiome in development and in progression toward type 1 diabetes.
Cell host & microbe UCZYNSKI , J., C
OSTELLO , E. K., N
EMERGUT , D. R., Z
ANEVELD , J., L
AUBER , C. L., K
NIGHTS , D., K O - REN , O., F
IERER , N., K
ELLEY , S. T., L EY , R. E. et al. (2010). Direct sequencing of the human microbiomereadily reveals community differences. Genome biology I , Y., H SING , T. et al. (2010). Uniform convergence rates for nonparametric regression and principal componentanalysis in functional/longitudinal data.
The Annals of Statistics IECHTY , J. C., L
IECHTY , M. W. and M
ÜLLER , P. (2004). Bayesian correlation estimation.
Biometrika LOYD -P RICE , J., A
RZE , C., A
NANTHAKRISHNAN , A. N., S
CHIRMER , M., A
VILA -P ACHECO , J.,P
OON , T. W., A
NDREWS , E., A
JAMI , N. J., B
ONHAM , K. S., B
RISLAWN , C. J. et al. (2019). Multi-omicsof the gut microbial ecosystem in inflammatory bowel diseases.
Nature M ORTON , J. T., A
KSENOV , A. A., N
OTHIAS , L. F., F
OULDS , J. R., Q
UINN , R. A., B
ADRI , M. H., S
WEN - SON , T. L., V AN G OETHEM , M. W., N
ORTHEN , T. R., V
AZQUEZ -B AEZA , Y. et al. (2019). Learning repre-sentations of microbe–metabolite interactions.
Nature methods ASH , J. (1990). The Cholesky Decomposition.
Compact numerical methods for computers: Linear algebra andfunction minimisation .P ENG , J. and P
AUL , D. (2009). A geometric approach to maximum likelihood estimation of the functional prin-cipal components from sparse longitudinal data.
Journal of Computational and Graphical Statistics AMSAY , J. and S
ILVERMAN , B. W. (1997). Functional data analysis (Springer series in statistics).R
AMSAY , J. O. and S
ILVERMAN , B. W. (2007).
Applied functional data analysis: methods and case studies .Springer.R
ANJAN , R., R
ANI , A., M
ETWALLY , A., M C G EE , H. S. and P ERKINS , D. L. (2016). Analysis of the micro-biome: Advantages of whole genome shotgun versus 16S amplicon sequencing.
Biochemical and biophysicalresearch communications
ICE , J. A. and S
ILVERMAN , B. W. (1991). Estimating the mean and covariance structure nonparametricallywhen the data are curves.
Journal of the Royal Statistical Society: Series B (Methodological) BERRO , H., F
REMIN , B. J., Z
LITNI , S., E
DFORS , F., G
REENFIELD , N., S
NYDER , M. P., P
AVLOPOU - LOS , G. A., K
YRPIDES , N. C. and B
HATT , A. S. (2019). Large-scale analyses of human microbiomes revealthousands of small, novel genes.
Cell
HARON , G., C
RUZ , N. J., K
ANG , D.-W., G
ANDAL , M. J., W
ANG , B., K IM , Y.-M., Z INK , E. M.,C
ASEY , C. P., T
AYLOR , B. C., L
ANE , C. J. et al. (2019). Human gut microbiota from autism spectrumdisorder promote behavioral symptoms in mice.
Cell
TEWART , C. J., A
JAMI , N. J., O’B
RIEN , J. L., H
UTCHINSON , D. S., S
MITH , D. P., W
ONG , M. C.,R
OSS , M. C., L
LOYD , R. E., D
ODDAPANENI , H., M
ETCALF , G. A. et al. (2018). Temporal developmentof the gut microbiome in early childhood from the TEDDY study.
Nature
ATANEN , T., F
RANZOSA , E. A., S
CHWAGER , R., T
RIPATHI , S., A
RTHUR , T. D., V
EHIK , K., L
ERNMARK , Å.,H
AGOPIAN , W. A., R
EWERS , M. J., S HE , J.-X. et al. (2018). The human gut microbiome in early-onset type1 diabetes from the TEDDY study. Nature
EHTARI , A., G
ELMAN , A. and G
ABRY , J. (2017). Practical Bayesian model evaluation using leave-one-outcross-validation and WAIC.
Statistics and computing AO , F., M ÜLLER , H.-G. and W
ANG , J.-L. (2005). Functional data analysis for sparse longitudinal data.
Journalof the American statistical association ubmitted to the Annals of Applied Statistics
MULTI-BLOCK SPARSE FUNCTIONAL PRINCIPAL COMPONENTSANALYSIS FOR LONGITUDINAL MICROBIOME MULTI-OMICS DATA B Y L INGJING J IANG , C
HRIS E LORD , J ANE
J. K IM , A USTIN
D. S
WAFFORD , R OB K NIGHT
AND W ESLEY
K. T
HOMPSON Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, * [email protected]; † [email protected] Julia Computing, [email protected] Department of Pediatrics, University of California San Diego, [email protected]; [email protected] Center for Microbiome Innovation, University of California San Diego, [email protected];[email protected] Department of Computer Science and Engineering, University of California San Diego, [email protected] Department of Bioengineering, University of California San Diego, [email protected]
Supplementary Materials.
Three supplementary figures for simulation results are in-cluded as supplementary materials. − − time r es pon se Outcome for Variable 1 A − − time r es pon se Outcome for Variable 2
True meanEstimated B − − time r es pon se Outcome for Variable 3 C − − time F P C c u r ve PFCs for Variable 1 D − − time F P C c u r ve PFCs for Variable 2
True PCEstimated E − − time F P C c u r ve PFCs for Variable 3 F F IG S1 . Estimated mean and FPC curves from mSMFPCA on simulated data with covariance structure II. Es-timated (red) vs. true (blue) overall mean curve on simulated trajectories (black) on outcome variable 1 (A),outcome variable 2 (B), and outcome variable 1 (C). Estimated (red) vs. true (black) PC curves on simulated datafor outcome variable 1 (D), outcome variable 2 (E), and outcome variable 3 (F). a r X i v : . [ s t a t . M E ] F e b − − time r es pon se Outcome for Variable 1 A − − time r es pon se Outcome for Variable 2
True meanEstimated B − − time r es pon se Outcome for Variable 3 C − − time F P C c u r ve PFCs for Variable 1 D − − time F P C c u r ve PFCs for Variable 2
True PCEstimated E − − time F P C c u r ve PFCs for Variable 3 F F IG S2 . Estimated mean and FPC curves from mSMFPCA on simulated data with covariance structure III.Estimated (red) vs. true (blue) overall mean curve on simulated trajectories (black) on outcome variable 1 (A),outcome variable 2 (B), and outcome variable 1 (C). Estimated (red) vs. true (black) PC curves on simulated datafor outcome variable 1 (D), outcome variable 2 (E), and outcome variable 3 (F). − − time r es pon se Outcome for Variable 1 A − − time r es pon se Outcome for Variable 2
True meanEstimated B − − time r es pon se Outcome for Variable 3 C − − time F P C c u r ve PFCs for Variable 1 D − − time F P C c u r ve PFCs for Variable 2