Robust Functional Principal Component Analysis for Non-Gaussian Continuous Data to Analyze Physical Activity Data Generated by Accelerometers
RRobust Functional Principal Component Analysis forNon-Gaussian Continuous Data to Analyze PhysicalActivity Data Generated by Accelerometers
Rou Zhong a , Shishi Liu a , Haocheng Li b , and Jingxiao Zhang ∗ aa Center for Applied Statistics, School of Statistics, Renmin University of China b Department of Mathematics and Statistics, University of Calgary
Abstract
Motivated by energy expenditure observations measured by wearable device, we pro-pose two functional principal component analysis (FPCA) methods, Spearman FPCA andKendall FPCA, for non-Gaussian continuous data. The energy expenditure records mea-sured during physical activity can be modeled as functional data. They often involvenon-Gaussian features, where the classical FPCA could be invalid. To handle this issue,we develop two robust FPCA estimators using the framework of rank statistics. Via anextension of the Spearman’s rank correlation estimator and the Kendall’s τ correlation esti-mator, a robust algorithm for FPCA is developed to fit the model. The two estimators areapplied to analyze the physical activity data collected by a wearable accelerometer monitor.The effectiveness of the proposed methods is also demonstrated through a comprehensivesimulation study. Keywords : Functional data analysis, functional principal component analysis, non-Gaussian process, Spearman’s rank correlation, Kendall’s τ correlation Functional principal component analysis (FPCA), as a crucial technique in functional dataanalysis (FDA), is widely used in reducing dimensionality and exploring the major variationmodes of sample curves. A comprehensive introduction on FPCA is provided in the monograph ∗ [email protected] a r X i v : . [ s t a t . M E ] F e b y Ramsay and Silverman Ramsay and Silverman [2005]. There is an extensive literature onthe work of FPCA. To list a few, Yao and Lee Yao and Lee [2006] studied penalized splinemodels for FPCA. Chen and Lei Chen and Lei [2015] proposed a localized FPCA method togain eigenfunctions with localized support regions. A dynamic FPCA approach was developedby H¨ormann et al. H¨ormann et al. [2015] with regard to functional time series. Moreover,Chiou et al. Chiou et al. [2014] and Wong et al. Wong et al. [2019] considered FPCA formultivariate functional data. However, these FPCA approaches may depend on the Gaussianprocess assumption either explicitly or implicitly. Further, Kraus and Panaretos Kraus andPanaretos [2012] have pointed out that the regular FPCA method would lead to biased outcomesfor non-Gaussian continuous data, which coincides with our simulation study.Meanwhile, non-Gaussian continuous data frequently occur in practice. For example, thephysical activity data Kozey-Keadle et al. [2014], which motivated our study, are shown tobe non-Gaussian (see Section 4). The data were obtained from 63 moderately overweight buthealthy office workers in a health research project to increase exercise activities and reducesedentary behaviors. Energy expenditure measurements were collected during physical activityfor individuals over time. The primary outcome was energy expenditure level in the unit ofmetabolic equivalents (METs), which was a continuous variable. More specifically, this outcomewas measured by every five minutes consecutively for three hours in each study participant, andthus each subject had 36 observations. Those measurements bear some skewed characteristicsand contain a few extreme values, which imply a plain deviation from Gaussian distribution. Inaddition, besides the physical activity data, the DNA minicircles data [Kraus and Panaretos,2012] and the Alzheimer’s disease neuroimaging initiative (ADNI) data [Zhong et al., 2020] alsoshow some non-Gaussian features. The goal of this article is to develop robust FPCA estimatorsfor such non-Gaussian cases.Recently, the study of FPCA for non-Gaussian data has attracted increasing attention. Hallet al. Hall et al. [2008] introduced a latent Gaussian process model for non-Gaussian longitudinaldata that measured sparsely and irregularly. Van der Linde van der Linde. [2009] discussed aBayesian FPCA approach for data in single-parameter exponential family. Gertheiss et al.Gertheiss et al. [2017] considered a generalized additive mixed model (GAMM) framework,and Li et al. Li et al. [2018] proposed an exponential family FPCA method. However, all ofthese studies were developed based on the assumption that data are generated from a specifieddistribution. These method may particularly work for discrete data (e.g. binary and countdata), whereas it is difficult to assume a proper distribution for the physical activity records onhand.In this paper, we propose a generalized FPCA method to handle continuous data withoutany distributional assumptions. The main idea is using rank statistics to replace the sample2ovariance function, which is commonly implemented in the eigenanalysis for regular multi-variate PCA [Han and Liu, 2014, 2018]. We use Spearman’s rank correlation estimator andKendall’s τ correlation estimator to construct new association functions. Moreover, a new al-gorithm is developed to conduct the Spearman FPCA and the Kendall FPCA with the newlyestablished association functions. Our method does not involve any assumption of data distri-bution, which avoids model misspeficiation in real practice. In addition, the computation forthe Spearman FPCA and the Kendall FPCA is convenient because it does not require takingmultiple iterations to estimate non-Gaussian continuous functional data [Li et al., 2014].The remainder of the article is organized as follows. Section 2 describes the two rankstatistics and the newly developed algorithm for FPCA. Section 3 reports the results fromsimulation studies. Section 4 illustrates the application of our proposed methods to the physicalactivity data. Concluding discussions are provided in Section 5. Let X ( t ) denote a smooth random function evaluated at time t ( t ∈ T ), with unknown meanfunction E { X ( t ) } = µ ( t ) and covariance function Γ( t , t ) = cov { X ( t ) , X ( t ) } ( t ∈ T , t ∈T ). Define Γ as a covariance operator that (Γ φ )( t ) = (cid:82) Γ( t , t ) φ ( t ) dt with φ ( t ) being anarbitrary function. A commonly used approach to implement FPCA is the eigenanalysis forΓ( t , t ) [Ramsay and Silverman, 2005]. This method uses a set of orthogonal eigenfunctions { φ ( t ) , φ ( t ) , · · · } with the eigenequations(Γ φ k )( t ) = λ k φ k ( t ) , k = 1 , , · · · , where λ k is the k -th eigenvalue and λ ≥ λ ≥ · · · . In classical FPCA, X ( t ) can be expressedby the Karhunen-Lo`eve expansion, X ( t ) = µ ( t ) + ∞ (cid:88) k =1 ξ k φ k ( t ) , (1)where ξ k is the k -th functional principal component score. ξ , ξ , · · · are uncorrelated randomvariables with mean 0 and variance λ , λ , · · · , respectively. The model fit for X ( t ) is equivalentto estimate (cid:98) µ ( t ), (cid:98) λ k and (cid:98) φ k ( t ). To be specific, (cid:98) φ k ( t ) can be obtained from the estimate of thecovariance function Γ( t , t ). Therefore, the estimation of Γ( t , t ) plays an important role inFPCA.Suppose we observe a set of independent random curves X i ( t ) ( i = 1 , , · · · , N ) from N individuals. To accommodate the application data illustrated in Section 1, we assume all ofthe sample curves are recorded at a grid of d time points ( t = 1 , · · · , d ) in this paper. In3articular, we have d = 36 in the physical activity data. Therefore, our methodology developedfor the application problem assumes functional data with a equally-spaced and dense design.In Section 5, we discuss the approaches to extend our methods for more general settings.The classical FPCA method estimates Γ( t , t ) (1 ≤ t ≤ d, ≤ t ≤ d ) by sample covariancefunction (cid:98) Γ( t , t ) = 1 N N (cid:88) i =1 { X i ( t ) − (cid:98) µ ( t ) }{ X i ( t ) − (cid:98) µ ( t ) } , (2)where (cid:98) µ ( t ) is the sample mean function. It can be estimated by (cid:98) µ ( t ) = N (cid:80) Ni =1 X i ( t ). Whencontinuous data follow Gaussian process, the classical FPCA method based on (cid:98) Γ( t , t ) in (2)has good properties such as consistency and asymptotic efficiency property [Yao et al., 2005,Hall et al., 2006]. On the contrary, the results are implausible when facing non-Gaussian casesKraus and Panaretos [2012]. Further, we also demonstrate in Section 3, the (cid:98) Γ( t , t ) calculatedin (2) with non-Gaussian features ignored, could lead to severely biased results. However, non-Gaussian features may occur frequently in real practice. For example, a dataset could be skewedand has outliers. To solve this problem, we consider two robust statistics to replace (cid:98) Γ( t , t ) inSection 2.1 and Section 2.2, respectively. The implementation of the Spearman’s rank correlation to the FPCA problem is based on itswell-known robustness property to handle outliers in observations. It has good performancein heavy-tailed distributions [de Winter and Gosling, 2016], and thus it is appropriate to beapplied in our physical activity data discussed in Section 1. In this section, we generalize theSpearman’s rank correlation to handle functional data analysis.For subject i , let r i ( t ) represent the rank of X i ( t ) among { X ( t ) , · · · , X i ( t ) , · · · , X N ( t ) } .Define r ( t ) = (cid:80) Ni =1 r i ( t ) /N . The sample Spearman’s rank correlation function is estimated as (cid:98) ρ ( t , t ) = (cid:80) Ni =1 { r i ( t ) − r ( t ) }{ r i ( t ) − r ( t ) } (cid:113)(cid:80) Ni =1 { r i ( t ) − r ( t ) } · (cid:80) Ni =1 { r i ( t ) − r ( t ) } . We use the following estimator in our application (cid:98) S ( t , t ) = 2 sin (cid:110) π (cid:98) ρ ( t , t ) (cid:111) , (3)and the new eigenproblem is given by( (cid:98) Sφ Sk )( t ) = λ Sk φ Sk ( t ) , (4)where λ Sk and φ Sk ( t ) are the k -th eigenvalue and eigenfunction for the Spearman’s rank correla-tion function, respectively. 4 .2 Kendall FPCA Follow the similar idea of the Spearman FPCA in Section 2.1, we introduce the Kendall’s τ intothe functional data analysis. A straightforward Kendall correlation function can be estimatedas (cid:98) K ( t , t ) = 2 N ( N − (cid:88) i Step 1. Estimate µ ( t ) by (cid:98) µ ( t ) = 1 N N (cid:88) i =1 X i ( t ) . Step 2. Calculate the sample correlation function (cid:98) S ( t , t ) by (3) for t = 1 , · · · , d and t =1 , · · · , d . Step 3. Based on the method with penalization for roughness proposed by Rice and SilvermanRice and Silverman [1991], calculate the estimated vector (cid:101) u S = (cid:8) (cid:98) φ S (1) , . . . , (cid:98) φ S ( d ) (cid:9) . Step 4. Follow the similar pattern, sequentially calculate (cid:101) u S , (cid:101) u S , and the vectors for higherorder eigenfunctions. 5 tep 5. Recover the estimated functions (cid:101) φ Sk ( t ) from (cid:101) u Sk by using B-spline basis functions. Step 6. We obtain the (cid:98) φ Sk ( t ) by (cid:98) φ Sk ( t ) = (cid:101) φ Sk ( t ) (cid:107) (cid:101) φ Sk (cid:107) , where (cid:107) (cid:101) φ Sk (cid:107) = (cid:113)(cid:82) T { (cid:101) φ Sk ( u ) } du .The Kendall FPCA discussed in Section 2.2 can be implemented from the similar approachby substituting (cid:98) S ( t , t ) with (cid:98) K ( t , t ) in (5) to obtain the estimation of (cid:98) φ Kk ( t ). In this section, we use a simulation study to demonstrate the performance of our proposedSpearman FPCA and Kendall FPCA. As a comparison, the classical FPCA approach is alsoexplored. In the simulation study of 100 runs, we have N = 100 individuals and each subject has d = 100 equally-spaced observations. Here we generate the data using the discretized covariancefunction rather than using the expansion (1). For Scenario 1, the covariance function is set tobe Γ( t , t ) = 15 φ ( t ) φ ( t ) + 5 φ ( t ) φ ( t ) + I ( t = t ) , t , t ∈ [0 , , (7)where I ( t , t ) = 1 if t = t ; otherwise I ( t , t ) = 0. The two leading eigenfunctions are φ ( t ) = sin( πt/ / √ φ ( t ) = cos( πt/ / √ 5, respectively. For Scenario 2, the covariancefunction is set to be Γ( t , t ) = min { t , t } − t t , t , t ∈ [0 , , (8)where the two leading eigenfunctions can be derived as φ ( t ) = √ πt ) and φ ( t ) = √ πt ).For both scenarios, data are generated from Gaussian distribution, student’s t distribution,and skew- t distribution [Azzalini and Genton, 2008] using the covariance functions specified in(7) and (8), respectively. For Gaussian distribution, we set the mean to be zero. For student’s t distribution, the degree of freedom is set to be three. For skew- t distribution, location parameteris set to be zero, slant parameter is defined to be 10, and the degree of freedom is three. For theestimates of each simulation run, the following criteria are evaluated to assess the performanceof the three methods: • Err k = (cid:82) { φ k ( t ) − (cid:98) φ k ( t ) } dt . The Err k measures the estimation error for the k -th eigen-function. 6 Angle k = arccos( | (cid:82) [ φ k ( t ) (cid:98) φ k ( t )] dt | ). The Angle k , which ranges from 0 to 90, presents theangle size of the k -th estimated eigenfunction deviated from the true one. Specifically,smaller Angle k means better estimation.Figure 1 displays the simulation results of (cid:98) φ k ( t ) using the classical FPCA, the SpearmanFPCA and the Kendall FPCA methods for Scenario 1. Three approaches have similar esti-mates, which are close to true curves under Gaussian distribution and student’s t distribution.However, the classical FPCA approach leads to obvious biased estimates under skew- t distri-bution settings. Figure 2 illustrates the simulation results for Scenario 2. It can be seen thatthe classical FPCA performs well under Gaussian distribution. Our proposed Spearman FPCAand Kendall FPCA methods have very small bias but the estimates are still comparable to theclassical FPCA’s outcomes. On the other hand, the classical FPCA approach yields remarkablybiased estimates for the student’s t and skew- t distributions, while our Spearman FPCA andKendall FPCA approaches lead to satisfactory curve estimates.Table 1 shows the Err k and Angle k ( k = 1 , 2) for Scenarios 1 and 2, respectively. Forthe Gaussian distribution settings, our Spearman FPCA and Kendall FPCA approaches gen-erally lead to similar outcomes as the classical FPCA. In the settings of student’s t and skew- t distributions, the two proposed methods lead to obvious smaller deviance to the true curvescompared to the classical FPCA method. Therefore, the simulation results suggest that theclassical FPCA, may provide misleading conclusions under non-Gaussian data settings. Onthe other hand, our proposed robust methods have satisfactory estimation results for differentscenarios. In this section we use the proposed method to analyze the physical activity dataset collectedby wearable monitors Kozey-Keadle et al. [2014]. The raw activity signal is captured by adevice named the ActivPAL TM < N = 63 subjectsand each participant has d = 36 five-minute interval records. Figure 3(a) shows the energyexpenditure observation from one subject across three hours. Figure 3(b) shows the histogramof the observations from all subjects. Figures 3(c)-(f) are histograms and Q-Q plots for theenergy expenditure level at two time points. The plots suggest the continuous variable is skewedand has extreme values. 7n this application, X i ( t ) can be defined as the continuous variable for energy expenditurelevel in the t th ( t = 1 , · · · , 36) five-minute interval from subject i ( i = 1 , · · · , t = 1 and then decreases back to starting level at about t = 5. The light energyexpenditure could be explained as the warming up stage designed in this health research projectto reduce the risk of injury. The METs level increases dramatically at about t = 12, anddecreases back to just above the starting level by t = 24. During this interval, individuals takean intense exercise training for about 30 minutes and then start to cool down for about 25minutes. The energy expenditure level stays the same after t = 24, which shows subjects aretaking sedentary activities after the exercise program. Figures 4(b)(c) display the estimates ofthe first two eigenfunctions. For Kendall FPCA, the first eigenfunction rises to a peak at about t = 16, and then decreases to zero. This suggests that main pattern of the variability in energyexpenditure is around t = 16 across different participants. Subjects would have highly differentMETs level during the intense exercise program because they have various physical trainingexperience. On the other hand, the second eigenfunction has one positive and one negativepeaks, which indicates energy expenditure variabilities at t = 15 and t = 21 are negativelycorrelated. That is, if a subject spends higher energy expenditure around t = 15, he/she mayhave lower energy expenditure later. The estimated eigenfunctions from the Spearman FPCAhave similar patterns as the Kendall FPCA’s estimates, while the first eigenfunction has slowerdecreasing trend and the second eigenfunction suggests larger variance between t = 1 to t = 5. In this paper, we developed the Spearman FPCA and the Kendall FPCA methods to analyzefunctional data in non-Gaussian cases. The two robust methods are developed using two rankstatistics, Spearman’s rank correlation and Kendall’s τ correlation, respectively. The simulationresults are encouraging. Our methods have comparable results to the classical FPCA methodunder Gaussian settings. For non-Gaussian cases, our proposed approaches have little biasand outperform the classical FPCA. The analysis of the physical activity data based on thedeveloped method demonstrates their utility in real world data applications.In this paper, the functional data are assumed to have regular and dense structure. It isof interest to generalize the two methods to irregular and sparse cases. For example, we mayextend our methods to sparse cases by using the local linear estimation [Yao et al., 2005].Moreover, we can explore other correlation measurements to handle FPCA problems in future8tudies. Acknowledgement Li was supported by discovery grants program from the Natural Sciences and EngineeringResearch Council of Canada (NSERC, RGPIN-2015-04409). The authors thank Dr. SarahKozey Keadle for making the physical activity data available to them. Dr. Sarah Kozey Keadlewas supported by a National Cancer Institute grant (R01-CA121005). References J. O. Ramsay and B. W. Silverman. Functional Data Analysis (2nd ed.) . Springer Series inStatistics, New York: Springer, 2005.Fang Yao and Thomas C. M. Lee. Penalized spline models for functional principal componentanalysis. Journal of the Royal Statistical Society. Series B (Methodological) , 68:3–25, 2006.Kehui Chen and Jing Lei. Localized functional principal component analysis. Journal of theAmerican Statistical Association , 110:511:1266–1275, 2015.Siegfried H¨ormann, (cid:32)Lukasz Kidzi´nski, and Marc Hallin. Dynamic functional principal compo-nents. Journal of the Royal Statistical Society. Series B (Methodological) , 77:319–348, 2015.Jeng-Min Chiou, Ya-Fang Yang, and Yu-Ting Chen. Multivariate functional principal compo-nent analysis: A normalization approach. Statistica Sinica , 24:1571–1596, 2014.Raymond K. W. Wong, Yehua Li, and Zhengyuan Zhu. Partially linear functional additivemodels for multivariate functional data. Journal of the American Statistical Association ,114:525:406–418, 2019.David Kraus and Victor M. Panaretos. Dispersion operators and resistant second-order func-tional data analysis. Biometrika , 99(4):813–832, 2012.Sarah Kozey-Keadle, John Staudenmayer, Amanda Libertine, Marianna Mavilia, Kate Lyden,Barry Braun, and Patty Freedson. Changes in sedentary time and spontaneous physicalactivity in response to an exercise training and/or lifestyle intervention. Journal of PhysicalActivity and Health , 11:1324–1333, 2014.Qingzhi Zhong, Huazhen Lin, and Yi Li. Cluster non-gaussian functional data. Biometrics ,2020. doi: https://doi.org/10.1111/biom.13349.9eter Hall, Hans-Georg M¨uller, and Fang Yao. Modelling sparse generalized longitudinal ob-servations with latent gaussian processes. Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 70:703–723, 2008.Angelika van der Linde. A bayesian latent variable approach to functional principal componentsanalysis with binary and count data. AStA Advances in Statistical Analysis , 93:307–333, 2009.Jan Gertheiss, Jeff Goldsmith, and Ana-Maria Staicu. A note on modeling sparse exponential-family functional response curves. Computational Statistics and Data Analysis , 105:46–52,2017.Gen Li, Huang, Jianhua Z. Huang, and Haipeng Shen. Exponential family functional dataanalysis via a low-rank model. Biometrics , 74:1301–1310, 2018.Fang Han and Han Liu. High dimensional semiparametric scale-invariant principal componentanalysis. IEEE Transactions on Pattern Analysis and Machine Intelligence , 36:2016–2032,2014.Fang Han and Han Liu. Eca: High-dimensional elliptical component analysis in non-gaussiandistributions. Journal of the American Statistical Association , 113:521:252–268, 2018.Haocheng Li, John Staudenmayer, and Raymond J. Carroll. Hierarchical functional data withmixed continuous and binary measurements. Biometrics , 70:802–811, 2014.Fang Yao, Hans-Georg M¨uller, and Jane-Ling Wang. Functional data analysis for sparse longi-tudinal data. Journal of the American Statistical Association , 100(470):577–590, 2005.Peter Hall, Hans-Georg M¨uller, and Jane-Ling Wang. Properties of principal component meth-ods for functional and longitudinal data analysis. Annals of Statistics , 34(3):1493–1517, 2006.Joost C. F. de Winter and Samuel D. Gosling. Comparing the pearson and spearman correlationcoefficients across distributions and sample sizes: A tutorial using simulations and empiricaldata. Psychological Methods , 21:273–290, 2016.Christophe Croux, Esa Ollila, and Hannu Oja. Sign and rank covariance matrices statisticalproperties and application to principal components analysis. Statistics for Industry andTechnology , pages 257–269, 2002.John A. Rice and B. W. Silverman. Estimating the mean and covariance structure nonparamet-rically when the data are curves. Journal of the Royal Statistical Society. Series B (Method-ological) , 53:233–243, 1991. 10delchi Azzalini and Marc G. Genton. Robust likelihood methods based on the skew-t andrelated distributions. International Statistical Review , 76(1):106–129, 2008.11 Figure 1: Eigenfunction estimations for 100 simulation runs in Scenario 1 illustrated in Section3. (a)(b) The first and the second eigenfunctions for Gaussian distribution settings. (c)(d)The first and the second eigenfunctions for student’s t distribution. (e)(f) The first and thesecond eigenfunctions for skew- t distribution. Thick solid lines are for true curves. Dotted lines,dot-dashed lines, and dashed lines are for the classical FPCA, the Spearman FPCA and theKendall FPCA methods, respectively. 12 Figure 2: Eigenfunction estimations for 100 simulation runs in Scenario 2 illustrated in Section3. (a)(b) The first and the second eigenfunctions for Gaussian distribution settings. (c)(d)The first and the second eigenfunctions for student’s t distribution. (e)(f) The first and thesecond eigenfunctions for skew- t distribution. Thick solid lines are for true curves. Dotted lines,dot-dashed lines, and dashed lines are for the classical FPCA, the Spearman FPCA and theKendall FPCA methods, respectively. 13 Time(/5min) E ne r g y e x pend i t u r e (a) Sample data from the first subject Energy expenditure F r equen cy (b) Histogram: The METs levelEnergy expenditure F r equen cy (c) Histogram: The METs level at time point 15 −2 −1 0 1 2 Theoretical Quantiles S a m p l e Q uan t il e s (d) Q−Q plot: The METs level at time point 15Energy expenditure F r equen cy (e) Histogram: The METs level at time point 20 −2 −1 0 1 2 Theoretical Quantiles S a m p l e Q uan t il e s (f) Q−Q plot: The METs level at time point 20 Figure 3: (a) Energy expenditure observation from subject with ID 1 across 3 hours. (b)Histogram of the energy expenditure observations (METs) from all subjects. (c)(d) Histogramand Q-Q plot for the energy expenditure level (METs) at time point 15. (e)(f) Histogram andQ-Q plot for the energy expenditure level (METs) at time point 20.14 Figure 4: FPCA results for the physical activity data illustrated in Section 4. (a) Mean curveestimation for energy expenditure level. Solid line and dashed line are the estimated curveand its 90% bootstrap confidence intervals, respectively. (b)(c) Estimated curves for the firstand the second eigenfunctions. Thin dot-dashed lines and thick dashed lines are the SpearmanFPCA and the Kendall FPCA, respectively. 15able 1: Simulation results for the averaged Err , Err , Angle and Angle across 100 runs.The simulation settings and the definition of the Err , Err , Angle and Angle are illustratedin Section 3. Scenario 1 Scenario 2FPCA SFPCA KFPCA FPCA SFPCA KFPCAGaussian Err Err Angle Angle t (3) Err Err Angle Angle t (3) Err Err Angle Angle2