Robust functional principal components for sparse longitudinal data
RRobust functional principal components for sparselongitudinal data
Graciela Boente and Mat´ıas Salibi´an-Barrera Universidad de Buenos Aires and CONICET, Argentina ,[email protected] University of British Columbia, Canada, [email protected] 4, 2020
Abstract
In this paper we review existing methods for robust functional principalcomponent analysis (FPCA) and propose a new method for FPCA that can beapplied to longitudinal data where only a few observations per trajectory areavailable. This method is robust against the presence of atypical observations,and can also be used to derive a new non-robust FPCA approach for sparselyobserved functional data. We use local regression to estimate the values of thecovariance function, taking advantage of the fact that for elliptically distributedrandom vectors the conditional location parameter of some of its componentsgiven others is a linear function of the conditioning set. This observation allowsus to obtain robust FPCA estimators by using robust local regression methods.The finite sample performance of our proposal is explored through a simulationstudy that shows that, as expected, the robust method outperforms existingalternatives when the data are contaminated. Furthermore, we also see thatfor samples that do not contain outliers the non-robust variant of our proposalcompares favourably to the existing alternative in the literature. A real dataexample is also presented.AMS Subject Classification 2000: MSC 62F35, MSC 62H25.Key words and phrases: Functional data analysis; Principal components; Ro-bust estimation; Sparse data a r X i v : . [ s t a t . M E ] D ec Introduction
Functional Data Analysis refers to a collection of statistical models that apply todata that can be naturally represented as observations taken from (mostly unob-served) underlying functions. Instead of simply treating the data as vectors, modelswith a functional structure naturally incorporate into the analysis important char-acteristics of the random process generating the observations, such as smoothness.Functional models represent the partially observed functions as random elements ona functional space, such as L ( I ), with I ⊂ R a finite interval. Given the infinitedimension of this type of sample spaces, some form of dimension reduction or regu-larization is typically required. Functional Principal Components Analysis (FPCA)is a commonly used approach to obtain optimal lower-dimensional representationsof the observations (Boente et al. , [7]). Moreover, FPCA can also be used to de-scribe the main characteristics of the process generating the functions, similarly towhat is done in the finite-dimensional case with the interpretation of PCA loadings.Specifically, the trajectories of each subject can be represented by their coefficients(scores) on a few elements of the basis of eigenfunctions (see for instance, Ramsayand Silverman [47] and Hall and Horowitz [26].) Other bases can also be used torepresent the trajectories, e.g. a sufficiently rich B -splines basis (James et al. , [36]).In this paper, we study the problem of reliably estimating the principal functionswhen the data may contain a small proportion of atypical observations. It is well-known that, similarly to what happens in the finite dimensional case, most FPCAmethods can be seriously affected in such a situation. The outliers need not be“extreme” data points, but might consist of curves that behave differently from theothers, or that display a persistent behaviour either in shift, amplitude and/or shape(see, e.g., Hubert et al. [32]).Robust FPCA methods in the literature can be classified in three groups, de-pending on the specific property of principal components on which they focus. Someof them rely on performing the eigenanalysis of a robust estimator of the covarianceor scatter operator. Others estimate the principal functions by searching for direc-tions that maximize a robust estimator of the spread or scale of the correspondingprojections. A third group of robust methods estimate the principal subspaces byminimizing a robust measure of the reconstruction error (the discrepancy betweenthe observations and their orthogonal projections on the eigenspace). A more de-tailed discussion can be found in Section 2 below.Most functional data analysis methods assume that each trajectory is observedon a relatively fine grid of points, often equally spaced. The corresponding ap-2roaches to analyze such functional data can be found, for instance, in Ramsay andSilverman [47], Ferraty and Vieu [19] and Ferraty and Romain [18]. Recent reviewsof functional data analysis methods include: Horv´ath and Kokoszka [29], Hsing andEubank [30], Cuevas [13], Goia and Vieu [25], and Wang et al. [53].To the best of our knowledge, most of the robust FPCA methods proposed inthe literature also require that the curves be observed in a relatively dense grid.However, there are many applications in which trajectories are measured only a fewtimes for each sampling unit. Such longitudinal data sets with only a few avail-able observations per curve (possibly recorded at irregular intervals), are relativelycommon in applications, and in many cases it is sensible to consider an underlyingfunctional structure (of smooth trajectories, for example). Only a few FPCA meth-ods exist in the literature for this type of data (see for example, James et al. [36],Yao et al. [56] and Li and Hsing [39]).The approach of James et al. , [36] consists of assuming a finite-dimensional ex-pansion for the underlying random process (i.e. a Karhunen-Loeve expansion withonly finitely many terms), and approximating the mean function and eigenfunc-tions with splines. An alternative proposal was given by Yao et al. [56] (see alsoStaniswalis and Lee [51]). It consists of estimating the covariance function usinga bivariate smoother over the cross-products of the available observations. In thisway, information about the covariance function of the process at different pointsis “borrowed” from different curves. Principal directions are constructed from theestimated covariance function, and scores are estimated using best linear predictors.Similarly, Li and Hsing [39] estimate the mean and covariance functions with locallinear smoothers, but use weights to ensure that the effect that each curve has onthe optimizers is not overly affected by the number of points at which they wereobserved.To construct robust FPCA estimators, Gervini [24] modified the reduced rankproposal of James et al. , [36] assuming that the finite-dimensional standardizedscores and the measurement errors have a joint multivariate Student’s T distri-bution. The resulting estimators for the centre function and eigenfunctions areFisher-consistent. A related method using M M -estimators for the basis coefficientsand scores has been recently proposed by Maronna [43].An intuitively straightforward approach to obtain a robust version of the methodproposed by Yao et al. [56] would be to replace the bivariate smoother of the cross-products with a robust alternative. However, the regression approach of Staniswalisand Lee [51], and Yao et al. [56] works because the smoother estimates the con-ditional mean of the cross-products, which is the covariance function. The main3hallenge with using a robust smoother in this setting is that, to the best of ourknowledge, there are no robust estimators for the expected value of asymmetricallydistributed random variables without imposing distributional assumptions. Hence,a robust smoother of the cross-products will typically not be able to estimate thecovariance function.In this paper, we provide an alternative approach to obtain a robust estimateof the covariance function in the case of sparsely observed functional (longitudinal)data. Furthermore, this proposal can naturally be adapted to offer another way toperform FPCA for longitudinal data in the settings considered by Yao et al. [56].Our proposal is based on a well-known property of the conditional distribution ofelliptically distributed random vectors: the conditional location parameter of one ofits components given others is a linear function of the set of conditioning values. Wefirst show that for an elliptically distributed random process, the vector of its valuesat a fixed set of points has a multivariate elliptical distribution. Combining thisproperty with the previous one, we propose to use a robust local regression methodto estimate the values of the covariance function when outliers may be present inthe data. Our approach can also be used with a non-robust local regression method,and our numerical experiments show that this method compares favourably to thatof Yao et al. [56].The rest of the paper is organized as follows. In Section 2 we review previouslyproposed robust FPCA methods which are applicable when either the entire tra-jectories were observed, or they were recorded on a dense grid of points. Section3 describes our robust estimator of the principal directions for sparsely recordeddata. We discuss the results of our numerical experiments studying the robustnessand finite sample performance of the different methods in Section 4. An illustra-tion is discussed in Section 5, while Section 6 contains a discussion and furtherrecommendations.
As mentioned in the Introduction, atypical observations in a functional setting neednot be “extreme” points and may occur in several different ways. Often they cor-respond to individuals that follow a different pattern than that of the majority ofthe data. The detection of such outliers is generally a difficult problem, and someproposals exist in the literature. Among others, we can mention the proceduresdescribed in Febrero et al. [15, 16], Hyndman and Ullah [35], Hyndman and Shang[34], Sawant et al. [48], and Sun and Genton [52].4n this paper we are concerned with methods that do not require the preliminarydetection of potential atypical observations in the data. We need to introduce somenotation and definitions. Although we are motivated by situations where data thatcan be represented as random elements X on a functional space, like L ( I ), with I ⊂ R a finite interval, most of our discussion can be generalized to the case where X is a random element on an arbitrary separable Hilbert space H . Denote thecorresponding inner product and norm with (cid:104)· , ·(cid:105) and (cid:107) u (cid:107) = (cid:104) u, u (cid:105) / , respectively.In classical FPCA one assumes that E (cid:107) X (cid:107) < ∞ , which ensures the existence ofthe mean and the covariance operator (denoted by µ X = E ( X ) and Γ X : H → H ,respectively). The covariance operator can be written as Γ X = E { ( X − µ X ) ⊗ ( X − µ X ) } , where ⊗ denotes the tensor product on H , i.e., the operator u ⊗ v : H → H given by ( u ⊗ v ) w = (cid:104) v, w (cid:105) u for w ∈ H . When H = L ( I ), the mean function µ X satisfies µ X ( t ) = E ( X ( t )) for t ∈ I , and the covariance operator Γ X is associatedwith a symmetric, non-negative definite covariance function γ X : I × I → R suchthat (cid:82) I (cid:82) I γ X ( t, s ) ds dt < ∞ , as follows (Γ X f )( t ) = (cid:82) I γ X ( s, t ) f ( s ) ds .The covariance operator Γ X is linear, self-adjoint, continuous, and Hilbert-Schmidt. Hilbert-Schmidt operators have a countable number of eigenvalues, allof them real when the operator is self-adjoint. Let { φ (cid:96) : (cid:96) ≥ } be the eigen-functions of Γ X associated with its eigenvalues λ ≥ λ ≥ · · · in decreasing order.The Karhunen-Lo`eve expansion of the process X is given by the representation X = µ X + (cid:80) ∞ (cid:96) =1 ξ (cid:96) φ (cid:96) , where the convergence of the series is in mean square norm,that is, lim q →∞ E (cid:107) X − µ X − (cid:80) q(cid:96) =1 ξ (cid:96) φ (cid:96) (cid:107) = 0. The random variables { ξ (cid:96) : (cid:96) ≥ } are the coordinates of X − µ X on the basis { φ (cid:96) : (cid:96) ≥ } , that is, ξ (cid:96) = (cid:104) X − µ X , φ (cid:96) (cid:105) .Note that E ( ξ (cid:96) ) = 0, E ( ξ (cid:96) ) = λ (cid:96) , and E ( ξ (cid:96) ξ s ) = 0 for (cid:96) (cid:54) = s .In multivariate analysis, elliptical distributions provide an important frameworkfor the development of robust principal component procedures. Elliptical processesplay a similar role for the development of robust FPCA methods, allowing for ran-dom elements that may not have finite second moments. Given µ ∈ H , usuallyknown as the location parameter, and Γ, a self-adjoint, positive semi-definite andHilbert-Schmidt operator on H (the scatter operator), we say that the process X hasan elliptical distribution E ( µ, Γ , ϕ ), if for any linear bounded operator A : H → R d the random vector AX has a multivariate elliptical distribution E d ( Aµ, A Γ A ∗ , ϕ ),where A ∗ denotes the adjoint operator of A and ϕ stands for the characteristicgenerator. It is easy to see that Gaussian processes satisfy the above definitionwith ϕ ( a ) = exp( − a/ X is µ andthe covariance operator is proportional to Γ, i.e., µ X = µ and Γ X = α Γ for some α > et al. [7], thescatter operator Γ is confounded with the function ϕ meaning that for any c > ( µ, Γ , ϕ ) ∼ E ( µ, c Γ , ϕ c ) where ϕ c ( w ) = ϕ ( w/c ). For that reason, without loss ofgenerality, when second moments exist, we will assume that Γ X = Γ. We refer theinterested reader to Bali and Boente [1] and Boente et al. [7].Like their finite-dimensional counterparts, functional principal components sat-isfy the following three equivalent properties: • Property 1:
The first principal directions correspond to the eigenfunctionsof the covariance operator Γ X associated with its largest eigenvalues. • Property 2:
The first principal direction maximizes
Var ( (cid:104) α, X (cid:105) ) over theunit sphere S = { α : (cid:107) α (cid:107) = 1 } , and subsequent ones solve the same optimiza-tion problem subject to the condition that α be orthogonal to all the previousprincipal directions. • Property 3:
The first q principal directions provide the best q -dimensionallinear approximation to the random element in terms of mean squared er-ror. Specifically, let π ( x, L ) denote the orthogonal projection of x ∈ H ontoan arbitrary q -dimensional closed linear space L , and L be the linear spacespanned by the eigenfunctions of Γ X associated with its q largest eigenvalues λ ≥ · · · ≥ λ q . If λ q > λ q +1 , then we have E ( (cid:107) ( X − µ X ) − π (( X − µ X ) , L ) (cid:107) ) ≤ E ( (cid:107) ( X − µ X ) − π (( X − µ X ) , L ) (cid:107) ).As mentioned in the Introduction, robust functional principal components esti-mators proposed in the literature can be classified in three groups, according to theproperty on which they focus. The first approach is based on obtaining a robustcounterpart to the sample covariance operator, and use its eigenfunctions to performrobust FPCA. The second group consists of methods that sequentially search fordirections that maximize a robust estimator of the spread or scale of the projections.The last class of robust FPCA methods estimates principal subspaces by searchingthe closed linear subspace L that minimizes a robust measure of scale of the recon-struction error (cid:107) ( X − µ ) − π (( X − µ ) , L ) (cid:107) (the distance between the data and theirorthogonal projections onto the subspace), where µ denotes a location parameter.In the rest of this section we review the robust FPCA methods that are avail-able in each of these three groups. Our proposal for the case of sparsely observedtrajectories is discussed in the next Section.6 .1 Methods based on Property 1 Probably the earliest robust functional principal components estimators in the litera-ture are those introduced in Locantore et al. [40]. They proposed the so-called spher-ical principal components, which were further studied in Gervini [23] and Boente etal. [5]. The basic idea is to control potential outliers in the data by normalizing theobservations (forcing all curves to lie on the unit sphere). Formally, given a centre µ ∈ H , the spatial or sign covariance operator of X centered at µ is defined asΓ s ( µ ) = E (cid:20)(cid:18) X − µ (cid:107) X − µ (cid:107) (cid:19) ⊗ (cid:18) X − µ (cid:107) X − µ (cid:107) (cid:19)(cid:21) , (1)which can be estimated through its sample version: (cid:98) Γ s ( µ ) = 1 n n (cid:88) i =1 (cid:18) X i − µ (cid:107) X i − µ (cid:107) (cid:19) ⊗ (cid:18) X i − µ (cid:107) X i − µ (cid:107) (cid:19) . Several location parameters (centres) have been considered in the literature. Whenusing the sign covariance operator (1) the standard choice for the centre µ is thefunctional spatial or geometric median: µ gm = argmin u ∈H E ( (cid:107) X − u (cid:107) − (cid:107) X (cid:107) ) (seeLocantore et al. [40] and Gervini [23]). It can be estimated by its sample version (cid:98) µ gm = argmin u ∈H (cid:80) ni =1 (cid:107) X − u (cid:107) leading to the usual spatial operator estimator (cid:98) Γ s ( (cid:98) µ gm ). The spherical principal direction estimators correspond to its eigenfunc-tions. We refer to Gervini [23] and Cardot et al. [9] for consistency results ofthe geometric median estimator and to Boente et al. [5] for results regarding theasymptotic distribution of (cid:98) Γ s ( (cid:98) µ ) and the spherical principal directions. Other ro-bust location estimators for µ above may be considered. For example, the α -trimmedmean introduced in Fraiman and Mu˜niz [22], the deepest point (e.g. Cuevas [13],Cuevas et al. [14] and L´opez-Pintado and Romo [41]) and the M -location estimatorin Sinova et al. [50].In addition to being easy to compute, another appealing property of the spher-ical principal directions is that, under certain conditions they can be shown tobe Fisher-consistent. Specifically, assume that either X has an elliptical distri-bution, as defined above, with location µ and scatter operator Γ, or that X hasa finite-rank representation X = µ + (cid:80) q(cid:96) =1 λ / (cid:96) f (cid:96) φ (cid:96) , where the random vector( f , . . . , f q ) has exchangeable symmetric marginal distributions (in this case, we de-note Γ = (cid:80) qj =1 λ j φ j ⊗ φ j ). Then, the eigenfunctions of Γ s ( µ ) are the same as thoseof Γ and in the same order (see Boente et al. [7] and Gervini [23]).Other robust scatter operator estimators have been proposed in the literature.For example, the ρ -scatter operator of Kraus and Panaretos [37], and the geometric7edian covariation studied in Cardot and Godichon-Baggioni [10]. However, theeigenfunctions of these operators are not guaranteed to remain in the same order asthose of the true covariance operator when the latter exists, or of the true scatteroperator for elliptical processes.Sawant et al. [48] proposed to estimate the principal directions and their sizeindirectly using the following approach. Ramsay and Silverman [47] showed that ifthe observed trajectories X i and the eigenfunctions φ (cid:96) of the covariance operatorare represented on a common known basis (e.g. splines or Fourier), then FPCAcan be performed using standard finite-dimensional PCA on the matrix of coeffi-cients representing the observations on the chosen basis. The robust version of thisprocedure in Sawant et al. [48] consists of replacing the PCA step above with arobust PCA alternative. In particular, they used ROBPCA (Hubert et al. [31]) andBACONPCA (Billor et al. [3]). Proposals to perform robust FPCA relying on Property 2 above are typically referredto as projection-pursuit approaches. Hyndman and Ullah [35] discussed a robustprojection-pursuit method applied to smoothed observed trajectories. Bali et al. [2] generalized this approach combining it with penalization and basis reduction.In order to impose smoothness on the estimated eigenfunctions, regularization isincluded via a penalization operator. Consider the subset H s , of smooth elements of H and D : H s → H a linear operator, usually called the “differentiator”, sincewhen H = L ( I ) one typically takes Dα = α (cid:48)(cid:48) . Associated with D one can constructthe following symmetric positive semi-definite bilinear form (cid:100)· , ·(cid:101) : H s × H s → R ,where (cid:100) α, β (cid:101) = (cid:104) Dα, Dβ (cid:105) , and the functional Ψ : H s → R as Ψ( α ) = (cid:100) α, α (cid:101) . Finally,a penalized inner product is defined as (cid:104) α, β (cid:105) τ = (cid:104) α, β (cid:105) + τ (cid:100) α, β (cid:101) , and the associatednorm is (cid:107) α (cid:107) τ = (cid:107) α (cid:107) + τ Ψ( α ).Let s n ( α ) be a robust univariate scale estimator σ n (e.g. an M-scale estimator)computed on the sample projections (cid:104) α, X (cid:105) , . . . , (cid:104) α, X n (cid:105) . The intuitive idea is toestimate the first principal direction as the element α that maximizes s n ( α ) − ρ Ψ( α ),over the unit sphere {(cid:107) α (cid:107) τ = 1 } ⊂ H . In addition, Bali et al. [2] used a sievesapproach to approximate the elements of H with an increasing sequence of knownbases (e.g. splines). Formally, let { δ i } i ≥ be a basis of H , and H p n the linear spacespanned by δ , . . . , δ p n , with p n → ∞ as n → ∞ . The robust projection-pursuitestimator for the first principal direction in Bali et al. [2] is given by (cid:98) φ = argmax α ∈H pn , (cid:107) α (cid:107) τ =1 (cid:8) s n ( α ) − ρ Ψ( α ) (cid:9) . (cid:104)· , ·(cid:105) τ .Bali et al. [2] showed that the estimators are qualitative robust. Furthermore,the procedure is Fisher-consistent when the robust univariate scale estimator σ n isthe empirical version of a functional σ rob such that there exist a constant c > satisfying σ rob ( P [ α ]) = c (cid:104) α, Γ α (cid:105) for all α ∈ H , where P [ α ] stands for the distribution of (cid:104) α, X (cid:105) . This condition is satisfied when X has an elliptical distribution.Note that the approaches described here and in Section 2.1 require that oneis able to compute or approximate well the norms (cid:107) X i − µ (cid:107) and inner products (cid:104) α, X i (cid:105) . For example, when H = L ( I ) this typically means that the trajectoriesneed to have been observed over a relatively dense grid of points. When the dataare measured sparsely (and only a few points per trajectory are available) thesemethods become infeasible. The last group of robust FPCA methods estimate the eigenspaces directly, for ex-ample by minimizing a robust measure of the distance between the observationsand their orthogonal projections on finite-dimensional subspaces. Lee et al. [38]proposed a sequential algorithm fitting linear spaces of dimension 1 to the observeddata. Their proposal assumes that all trajectories X i , 1 ≤ i ≤ n , were observedon a common grid t , . . . , t m . The goal is to find a smooth function v and a vector a = ( a , . . . , a n ) (cid:62) that minimize the size of the residuals X i ( t j ) − a i v ( t j ). Becauseof the potential presence of outliers, the method uses a bounded loss function ρ .Given an initial estimator for v , let a i = (cid:80) mj =1 X i ( t j ) v ( t j ), 1 ≤ i ≤ n , and for each j = 1 , . . . , m , let (cid:98) σ j be a robust scale estimator of the residuals X i ( t j ) − a i v ( t j ),1 ≤ i ≤ n . The estimators for v and a are defined as the minimizers of n (cid:88) i =1 m (cid:88) j =1 (cid:98) σ j ρ (cid:18) X i ( t j ) − a i v ( t j ) (cid:98) σ j (cid:19) + λ (cid:90) (cid:2) v (cid:48)(cid:48) ( t ) (cid:3) dt , subject to the constraint (cid:82) v ( t ) dt = 1, where λ > a = ( a , . . . , a n )let φ a be the minimizer of the objective function above, which can be obtained us-ing iterative reweighted penalized least-squares, and given φ a , the entries of a arethe projections of each trajectory along the direction φ a : a i = (cid:80) mj =1 X i ( t j ) φ a ( t j ),1 ≤ i ≤ n . These steps are then iterated until convergence. Once the first prin-cipal direction (cid:98) φ is obtained, we can compute the corresponding estimated scores9 ξ i, = (cid:80) mj =1 X i ( t j ) (cid:98) φ ( t j ). The other estimated principal directions are computedsequentially as follows: assume that (cid:98) φ s , s = 1 , . . . , (cid:96) −
1, are estimators of the first (cid:96) − (cid:96) − th principal direction and the related scoresare computed applying the previous procedure to X ( t j ) − (cid:80) (cid:96) − s =1 (cid:98) ξ i,s (cid:98) φ s ( t j ). Lee etal. [38] also discussed a data-dependent and resistant procedure to select λ .In some applications each curve X i may be observed on a different (but dense)grid, and in that case we can, in principle, proceed as recommended in Ramsay andSilverman [47], namely: smooth each curve X i and apply the method above to thevalues of the smoothed trajectories on a fixed grid. A drawback of this strategy isthat the errors introduced by the data smoothing step cannot be included easily inthe analysis.A different approach to estimate functional principal subspaces is as follows.Assume that each trajectory is observed on a dense grid of points (that may bespecific to each curve), and let { δ i } i ≥ be an orthonormal basis of H . The basic ideais to identify each curve X i with the vector of its coefficients on a finite-dimensionalbasis, apply a robust multivariate method to estimate principal subspaces in thisfinite-dimensional space, and then map the results back to H . Specifically, fix p n ,the basis size, and let x ij = (cid:104) X i , δ j (cid:105) be the coefficient of the i -th trajectory on the j -th element of the basis, 1 ≤ j ≤ p n . The grids where each X i are observed needto be sufficiently dense so that these inner products can be approximated well usingfinite Riemman sums. For each 1 ≤ i ≤ n , let x i be the vector of coefficients of X i onthe basis { δ j } ≤ j ≤ p n : x i = ( x i , . . . , x ip n ) t . We can now apply robust multivariatemethods on the sample x , . . . , x n to estimate a q dimensional principal subspace (cid:98) L ⊂ R p n . Once the principal subspace (cid:98) L is found, the functional principal directionestimators can be reconstructed by “mapping back” the results in R p n onto H . To fixideas, let (cid:98) b (1) , . . . , (cid:98) b ( q ) be an orthonormal basis for (cid:98) L and let (cid:98) x i = (cid:98) µ + (cid:80) q(cid:96) =1 (cid:98) a i(cid:96) (cid:98) b ( (cid:96) ) be the q -dimensional approximation to x i , where (cid:98) µ = ( (cid:98) µ , . . . , (cid:98) µ p ) t . The functionallocation parameter can be reconstructed as (cid:98) µ = (cid:80) p n j =1 (cid:98) µ j δ j , while the q -dimensionalprincipal direction basis in H is (cid:98) φ (cid:96) = (cid:80) p n j =1 (cid:98) b (cid:96) j δ j / (cid:107) (cid:80) p n j =1 (cid:98) b (cid:96) j δ j (cid:107) , for 1 ≤ (cid:96) ≤ q .Furthermore, the “fitted values” in H are (cid:98) X i = (cid:98) µ + (cid:80) q(cid:96) =1 (cid:98) a i(cid:96) (cid:98) φ (cid:96) .The proposals of Boente and Salibian-Barrera [6] and Cevallos-Valdiviezo [11]are variants of this approach that differ on how the principal subspace (cid:98) L ⊂ R p n andits orthonormal basis are estimated. Let B = (cid:0) b (1) , . . . , b ( q ) (cid:1) ∈ R p n × q and b t j the j -th row of B . For a given µ ∈ R p n and a i ∈ R q , the corresponding “fitted values”are (cid:98) x i ( µ , B , A ) = µ + Ba i , where a i are the columns of the matrix A ∈ R q × p n .Boente and Salibian-Barrera [6] estimated the principal subspace minimizing (cid:80) pj =1 (cid:98) σ j ( µ , B , A ), where (cid:98) σ j ( µ , B , A ) is a robust scale of the j -th coordinate of the10esidual vectors r i = x i − (cid:98) x i ( µ , B , A ), 1 ≤ i ≤ n . Although in principle one canuse any robust scale estimator, Boente and Salibian-Barrera [6] discussed in detailthe case of M -scale estimators (see Maronna et al. [42]), for which an iterativealgorithm can be derived. Cevallos-Valdiviezo [11] proposed using a multivariate S -estimator for PCA, minimizing (cid:98) σ ( µ , B , A ), where (cid:98) σ ( µ , B , A ) is a robust scale ofthe distances d i ( A , B , µ ) = (cid:107) x i − (cid:98) x i ( µ , B , A ) (cid:107) , using M -scale and least-trimmedscale estimators. All these proposals are Fisher-consistent for elliptically distributedrandom processes. In this section we describe a new robust FPCA method that is applicable to the casewhere few observations per trajectory may be available. Moreover, when robustnessis not a concern, our proposal results in a novel method for non-robust FPCA thatcompares favourably with existing methods in the literature.
We will assume the following framework. Let { X ( t ) : t ∈ I} be a stochasticprocess defined on a probability space (Ω , A , P ) with continuous trajectories, where I ⊂ R is a finite interval, which can be assumed to be I = [0 ,
1] without lossof generality. To allow for atypical observations, we will include processes X forwhich finite second moments may not exist. Specifically, we will only assume thatthe process X has an elliptical distribution E ( µ, Γ , ϕ ), where µ ∈ L ( I ) and Γ is aself-adjoint, positive semi-definite and Hilbert-Schmidt operator on L ( I ). We willdenote as γ : L ( I ) × L ( I ) the kernel defining Γ, that is, (Γ f )( s ) = (cid:82) I γ ( t, s ) f ( t ) dt .Recall that γ is symmetric and (cid:82) I γ ( t, s ) ds dt < ∞ , since Γ is a Hilbert-Schmidtoperator. Finally, { φ k } k ≥ will denote an orthonormal basis of L ( I ) consisting ofeigenfunctions of the scatter operator Γ ordered according to the associated non-zeroeigenvalues λ ≥ λ ≥ . . . .Note that elliptically distributed random processes as defined in Section 2 ac-cept a Karhunen-Lo`eve representation when Γ has finite rank or when the kernel γ is continuous. Consider first the case where the scatter operator Γ has finite rank(i.e. a finite number of non-zero eigenvalues). It follows that X ∼ µ + (cid:80) qk =1 ξ k φ k ,where q = rank(Γ) and ξ = ( ξ , . . . , ξ q ) t has a multivariate elliptical distribution E q ( q , diag( λ , . . . , λ q ) , ϕ ) (see the proof of Proposition 3.1). When there are in-finitely many positive eigenvalues, Proposition 2.1 in Boente et al. [7] shows that11 ∼ µ + S V , where S ≥ V . The standard Karhunen-Lo`eve expansion for Gaussianprocesses implies that V = (cid:80) ∞ k =1 η k φ k ( t ) with η k ∼ N (0 , λ k ), k ≥
1. The continuityof the covariance function γ and the process V implies that the convergence of theseries is uniform over I with probability 1. Thus, defining ξ k = S η k , we obtain that ξ = ( ξ , . . . , ξ q ) t has a multivariate elliptical distribution E q ( q , diag( λ , . . . , λ q ) , ϕ )and X ∼ µ + (cid:80) ∞ k =1 ξ k φ k with covariance operator Γ. Hence, we can write X ( t ) = µ ( t ) + ∞ (cid:88) k =1 ξ k φ k ( t ) , t ∈ I , (2)where the scores ξ k = (cid:104) X − µ , φ k (cid:105) are such that, for any q ≥
1, the random vector ξ = ( ξ , . . . , ξ q ) t has a multivariate elliptical distribution E q ( q , diag( λ , . . . , λ q ) , ϕ ).When second moments exist, the scores are uncorrelated random variables.The following Proposition provides the main motivation for our approach. Itshows that, similarly to what holds for Gaussian processes, the vector obtainedby evaluating an elliptically distributed random process on a fixed finite set of k points is an elliptical random vector on R k . Furthermore, it also shows that theconditional distribution of the scores ξ j in (2) given a set of k observations of theprocess is elliptical. This result suggests a natural estimator for the ξ j ’s which canbe used to reconstruct the full trajectories in the sample. The proof can be foundin Section 7. Proposition 3.1.
Let X ∼ E ( µ, Γ , ϕ ) be a random element on L ( I ) , with I ⊂ R ,and assume that the kernel γ associated with Γ is continuous. Let λ ≥ λ ≥ . . . bethe non-null eigenvalues of Γ and as φ k the eigenfunction of Γ associated with λ k chosen so that the set { φ k , k ∈ N } is an orthonormal set in L ( I ) . Then,a) For any fixed m and t , t , . . . , t m in I , the random vector ( X ( t ) , . . . , X ( t m )) t has an elliptical distribution in R m with location µ m = ( µ ( t ) , . . . , µ ( t m )) t andscatter matrix Σ with elements Σ ( (cid:96),j ) = γ ( t (cid:96) , t j ) , ≤ (cid:96), j ≤ m .b) For any s (cid:54) = t ∈ I we have that X ( t ) (cid:12)(cid:12) X ( s ) ∼ E (cid:0) µ t | s , σ t | s , ϕ (cid:63)s (cid:1) , where the conditional location is given by µ t | s = µ ( t ) + γ ( t , s ) γ ( s , s ) ( X ( s ) − µ ( s )) . (3) c) For any fixed m and t , t , . . . , t m in I , let X m = ( X ( t ) , . . . , X ( t m )) t . Wehave that ξ k (cid:12)(cid:12) X m ∼ E (cid:0) µ k , σ k , ϕ (cid:63) X m (cid:1) where µ k = λ k φ t k Σ − X m ( X m − µ m ) , (4)12 ith φ k = ( φ k ( t ) , . . . , φ k ( t m )) t , µ m = ( µ ( t ) , . . . , µ ( t m )) t and the ( (cid:96), j ) -thelement of Σ X m equals γ ( t (cid:96) , t j ) . In what follows we will use { X i : 1 ≤ i ≤ N } to denote independent realizationsof the stochastic process X . We assume that each trajectory X i is observed atindependent random “times” t ij , 1 ≤ j ≤ n i , where the n i ’s, 1 ≤ i ≤ N , are usuallyassumed to be random variables, independent from all others. Then, our model is: X ij = X i ( t ij ) = µ ( t ij ) + ∞ (cid:88) k =1 ξ ik φ k ( t ij ) , j = 1 , . . . , n i , i = 1 , . . . , N . (5)The procedure has three steps: (1) estimating the center function µ ; (2) estimatingthe diagonal elements of the scatter function Γ, and (3) estimating the off-diagonalentries of Γ. We start by estimating the center function µ using a robust local M -estimator. Forexample, one can consider the local M − smoothers proposed in Boente and Fraiman[4], H¨ardle and Tsybakov [28], H¨ardle [27] and Oh et al. [44] or robust local linearsmoothers as defined in Welsh [54]. Here we use the latter option.Consider a ρ -function ρ as defined in Maronna et al. [42]. That is, ρ iseven, nondecreasing as a function of | x | , ρ (0) = 0, and increasing for x > ρ ( x ) < sup t ρ ( t ). For each t ∈ I , let (cid:16) (cid:98) β ( t ) , (cid:98) β ( t ) (cid:17) t = argmin β ,β N (cid:88) i =1 n i (cid:88) j =1 w ij ( t ) ρ (cid:18) X ij − β − β ( t − t ij ) (cid:98) σ ( t ) (cid:19) , (6)where (cid:98) σ ( t ) is a preliminary robust consistent estimator of scale and the weights w ij ( t ) are given by w ij ( t ) = K (cid:18) t ij − t h (cid:19) (cid:40) N (cid:88) k =1 n k (cid:88) (cid:96) =1 K (cid:18) t k(cid:96) − t h (cid:19)(cid:41) − , (7)where h = h n is a bandwidth parameter and K : R → R is a kernel function, i.e.continuous, non-negative, and integrable. Then, the estimate for µ ( t ) is (cid:98) µ ( t ) = (cid:98) β ( t ) . (cid:98) σ ( t ) in (6) can, for example, be a “local MAD ”,that is: the MAD of the observations in a neighborhood of t . Formally: (cid:98) σ ( t ) = κ − median | t ij − t |≤ h (cid:12)(cid:12)(cid:12)(cid:12) X ij − median | t ij − t |≤ h ( X ij ) (cid:12)(cid:12)(cid:12)(cid:12) , where κ ∈ R is a constant ensuring Fisher-consistency at a given distribution. If weconsider a gaussian distribution as the central model, then κ = Φ − (3 / To estimate the diagonal elements γ ( t , t ) of the scatter function we use a robust M − scale estimator of the residuals. Although in principle one can use any robustscale estimator, we expect smooth functionals (like M-scales) to provide more stableestimators. More precisely, let ρ : R → R be a bounded ρ -function, such thatsup t ρ ( t ) = 1, and let b ∈ (0 ,
1) be a fixed constant which defines the robustness ofthe scale estimator (Maronna et al. [42]). Then (cid:98) γ ( t , t ) satisfies N (cid:88) i =1 n i (cid:88) j =1 w ij ( t ) ρ (cid:18) X ij − (cid:98) µ ( t ij ) (cid:98) γ ( t , t ) (cid:19) = b , (8)where the weights w ij ( t ) are as in (7). We choose ρ so that E ( ρ ( Z )) = b = 1 / Z ∼ N (0 ,
1) to ensure that in the case of i.i.d. observations with Gaussianerrors, the resulting scale estimator is consistent and has maximum breakdown point.
To estimate the off-diagonal elements γ ( t , s ), s (cid:54) = t , we take advantage of Propo-sition 3.1, which shows that the conditional mean of X ( t ) given X ( s ) is a linearfunction of the conditioning value, and furthermore, that the slope is proportionalto γ ( t , s ). This suggests that we can use local regression methods to estimate γ ( t , s ). Specifically, we first estimate the slope β ( t , s ) in (3) with a local re-gression estimator and then use that γ ( t , s ) = β ( t , s ) γ ( s , s ), where γ ( s , s )can be estimated as in (8). More precisely, let (cid:101) X ij = X ij − (cid:98) µ ( t ij ) be the centredobservations, and let (cid:98) β ( t , s ) be the local M -regression estimator satisfying (cid:98) β ( t , s ) = argmin β ∈ R N (cid:88) i =1 (cid:88) j (cid:54) = (cid:96) ρ (cid:32) (cid:101) X ij − β (cid:101) X i(cid:96) (cid:98) s ( t , s ) (cid:33) K (cid:18) t ij − t h (cid:19) K (cid:18) t i(cid:96) − s h (cid:19) , (9)14here (cid:98) s ( t , s ) is a preliminary robust scale estimator and ρ is a bounded ρ -function.To compute (cid:98) s ( t , s ) we use a local MAD of the residuals from an initial robustestimator. Specifically, let Z ij(cid:96) = (cid:101) X ij / (cid:101) X i(cid:96) , and noting that the model does notinclude an intercept, let (cid:101) β ( t , s ) be the local median of the slopes: (cid:101) β ( t , s ) =median | t ij − t |≤ h , | t i(cid:96) − s |≤ h Z ij(cid:96) . Construct residuals r i ( t , s ) = (cid:101) X ij − (cid:101) β ( t , s ) (cid:101) X i(cid:96) and let (cid:98) s ( t , s ) be the corresponding local MAD: (cid:98) s ( t , s ) = median | t ij − t |≤ h , | t i(cid:96) − s |≤ h | r i ( t , s ) − m ( t , s ) | , with m ( t , s ) = median | t ij − t |≤ h , | t i(cid:96) − s |≤ h r i ( t , s ).With (cid:98) β ( t , s ) computed in (9) and (cid:98) γ ( s , s ) from Step 2, we define (cid:101) γ ( t , s ) = (cid:98) β ( t , s ) (cid:98) γ ( s , s ) . To ensure that the estimated function (cid:98) γ is symmetric and smooth, we also compute (cid:98) β ( s , t ) and use a two-dimensional smoother (e.g. a bivariate B -spline as describedin Section 4) on each of them, resulting in (cid:101)(cid:101) γ ( s , t ) and (cid:101)(cid:101) γ ( t , s ). Finally, theestimated scatter function is: (cid:98) γ ( t , s ) = (cid:98) γ ( s , t ) = (cid:16)(cid:101)(cid:101) γ ( t , s ) + (cid:101)(cid:101) γ ( s , t ) (cid:17)(cid:46) . Note that even though (cid:98) γ ( t, s ) defines a symmetric kernel, it may not be semi-positivedefinite. If necessary, after computing its eigenvalues (cid:98) λ ≥ (cid:98) λ ≥ . . . , and correspond-ing eigenfunctions (cid:98) φ , (cid:98) φ , . . . , we set (cid:98) γ as (cid:98) γ ( t, s ) = (cid:80) j : (cid:98) λ j ≥ (cid:98) λ j (cid:98) φ j ( t ) (cid:98) φ j ( s ). Once we have estimated the eigenfunctions and eigenvalues of the scatter func-tion, Proposition 3.1(c) suggests a natural way to reconstruct the trajectories. Let ξ i,k = (cid:104) X i − µ, φ k (cid:105) the k -th score of the i -th observation. Recall that the conditionaldistribution of ξ i,k given X i = ( X i ( t i ) , . . . X i ( t in i )) t is elliptical with location pa-rameter λ k φ t ik Σ − i ( X i − µ i ) , (10)where φ ik = ( φ k ( t i ) , . . . , φ k ( t i n i )) t , µ i = ( µ ( t i ) , . . . , µ ( t i n i )) t and Σ i ∈ R n i × n i isthe matrix with ( (cid:96), j ) − th element equal to γ ( t i(cid:96) , t ij ). Note that if the process hasfinite mean then (10) equals E (cid:0) ξ i,k (cid:12)(cid:12) X i (cid:1) , which is the best predictor for ξ i,k basedon the observed trajectory.A natural estimator for (10) consists of replacing the unknown quantities withtheir estimators obtained in Steps 1 - 3 above. Specifically: (cid:98) ξ i,k = (cid:98) λ k (cid:98) φ t ik ( (cid:98) Σ i + δ I n i ) − ( X i − (cid:98) µ i ) , (11)15here (cid:98) µ i = ( (cid:98) µ ( t i ) , . . . (cid:98) µ ( t in i )) t , (cid:98) φ ik = (cid:16) (cid:98) φ k ( t i ) , . . . (cid:98) φ k ( t in i ) (cid:17) t , (cid:98) Σ i is the matrix with( j, (cid:96) ) − th element (cid:98) γ ( t ij , t i(cid:96) ), and δ > et al. [56]). Finally, for a fixed q , the reconstructed curvesare (cid:98) X i = q (cid:88) k =1 (cid:98) ξ i,k (cid:98) φ k , i = 1 , . . . , n . Note that using ρ ( u ) = ρ ( u ) = ρ ( u ) = u in the method described above naturallyyields a non-robust version of our proposal, which is different from that of Yao etal. [56]. Moreover, our numerical experiments (see Section 4) indicate that whenthe data do not contain outliers the non-robust version of our proposal comparesfavourably to PACE (Yao et al. [56]). In this section we report the results of a Monte Carlo study carried out to inves-tigate the finite-sample performance and robustness of the proposed robust FPCAestimator. Our experiments compared the robust and non-robust versions of ourproposal and PACE, the approach in Yao et al. [56]. We report results for randomprocesses with two covariance functions and different atypical observations. In eachcase we generated 500 samples of size N = 100, and focused on the behaviour of theestimated eigenfunctions, eigenvalues, predicted scores, and accuracy of the scatterfunction estimator for both clean and contaminated samples. For clean samples, the data sets are generated from the following model X i = µ + q (cid:88) j =1 √ λ j Z ij φ j , i = 1 , . . . N , (12)with Z ij i.i.d. Z ij ∼ N (0 ,
1) and λ ≥ λ ≥ · · · λ q >
0. We considered two modelsfor the location function µ and the principal directions φ j : • Model 1: I = (0 , µ ( t ) = t + sin( t ) q = 2, λ = 4, λ = 1 and φ j two16lements of the Fourier basis: φ ( t ) = − cos( t π/ √ , φ ( t ) = sin( t π/ √ . The design points were generated as follows. First build an equally spaced gridof 51 points { c (cid:96) } (cid:96) =1 on [0 ,
10] with c = 0 and c = 10, and define s (cid:96) = c (cid:96) + (cid:15) (cid:96) ,where (cid:15) (cid:96) are i.i.d. with (cid:15) (cid:96) ∼ N (0 , .
1) (set s (cid:96) = 0 if s (cid:96) <
0, and s (cid:96) = 10 when s (cid:96) > n i , chosenfrom a discrete uniform distribution on { , , } , and the locations t ij wererandomly chosen from { s (cid:96) } (cid:96) =2 without replacement. This is the same settingas in Yao et al. [56]. • Model 2:
In this case, I = (0 , µ ( t ) = 10 sin(2 t π ) exp( − t ), q = 4, λ = 0 . λ = 0 . λ = 0 .
029 and λ = 0 . φ j are the first q eigenfunctions of the “Mattern” covariance function: γ ( s, t ) = C (cid:32) √ ν | s − t | ρ (cid:33) , C ( u ) = σ − ν Γ( ν ) u ν K ν ( u ) , where Γ( · ) is the Gamma function and K ν is the modified Bessel function ofthe second kind. We set ρ = 3, σ = 1 and ν = 1 /
3. The eigenvalues abovewere chosen to have very similar ratios λ j /λ j +1 , j = 1 , , n i chosenfrom a discrete uniform distribution on { , , } . The observed times t ij satisfy t ij ∼ U ( I ), i.i.d., j = 1 , . . . , n i ; i = 1 , . . . , N .These models have different characteristics that may affect the performance of theFPCA estimators on finite samples. For example, while the ratio between the firstand second eigenvalues is 4 for Model 1, it is larger than 10 for Model 2. Thismeans that in the second model a single principal direction already explains a largerproportion (87%) of the total variation. In addition, although the first two principaldirections are similar in both models (with the order reversed), the third and fourtheigenfunctions of Model 2 will tend to produce less smooth trajectories, and a slightlymore complex covariance function.Figure 1 illustrates typical samples obtained under each Model. We highlighted3 randomly chosen trajectories in each sample. The filled circles correspond to theobservations. 17odel 1 Model 2 t X ( t ) l l ll l ll l ll l ll ll ll l ll l l ll l l l ll l ll l l ll l ll l l ll l l ll l l l ll l l l ll l l ll l ll l ll l ll l ll ll l ll l l ll l l ll l l lll l lll ll ll l ll l l ll l l ll l l l ll l l ll ll ll l ll ll l l ll ll ll l l ll l l l ll l l ll l l ll l l l ll l ll l ll l l ll l l ll l ll lll l l l ll l l l ll l lll l l l ll lll l ll ll l l l ll l l ll l l ll l l ll l lll l l ll l l ll l ll l l l ll l l l ll l l ll l l lll l l l ll l l ll l ll l l ll ll ll l l l ll l ll l ll l l l lll ll ll ll ll l l l ll l l l ll l l l ll l l ll l ll ll l ll l ll l l l ll l l ll l l lll l l ll l ll l ll l ll l ll l l l ll lll ll l ll l ll lll llll ll l l l ll l l ll l ll l l l ll l l ll l l l ll l ll ll l l − t X ( t ) lll l l l l l ll l lll l ll l ll ll lll l ll ll l ll l l l ll l l ll l l ll l l l ll l ll ll ll l l ll l l ll l l ll l l ll l ll l l l ll ll l ll l ll l l l ll l lll l l ll l ll l l l ll l l ll l l ll l l l ll l ll l l ll l l lll l l l ll ll l ll l l ll ll ll ll l ll l ll l ll l l l ll ll l l ll l ll l ll l ll l l l ll l ll l l ll l ll l ll l ll l l ll l l ll l l l ll l l ll ll l lll l l ll l l l ll l l lll l l ll l l l ll l ll ll ll l ll l ll l l l ll l l l ll l l ll l l l ll ll ll l l ll l l l ll l ll l l ll l ll ll l ll l l ll l l l ll l l l ll l lll l lll ll l ll l lll l l ll l l l ll l ll l ll l l ll l l ll l ll l l llll l llll l llll ll l lll l l lll l l ll ll l ll l l Figure 1:
One typical sample from each Model, with 3 randomly chosen trajectorieshighlighted. Filled circles correspond to available observations.
Atypical observations were introduced to alter the order among the principal direc-tions so that the estimated eigenfunctions and eigenvalues might be affected. Out-liers were introduced using a Bernoulli random variable B i ∼ B (1 , (cid:15) ), i = 1 , . . . , N ,with (cid:15) = 0 .
05 or 0 .
10, which correspond to 5% and 10% of outliers, respectively.When B i = 0 the trajectory X i is generated as in (12). When B i = 1 the curve iscontaminated as follows: • Model 1 : The score Z i, for the second direction is sampled from Z i, ∼N (12 , • Model 2 : The scores for the second and third direction are sampled from thefollowing bivariate distribution: (cid:32) Z i, Z i, (cid:33) ∼ N ( µ c , Σ c ) with µ c = (cid:32) (cid:33) and Σ c = (cid:32) /
16 00 1 / (cid:33) . Figure 2 shows how the introduced contaminations modify the pattern of theclean data when (cid:15) = 0 .
10. Black solid lines correspond to two randomly chosen18ncontaminated observations, and dashed red lines indicate 3 randomly selectedoutlying trajectories.Model 1 Model 2 t X ( t ) l l ll l ll l ll l l l ll ll l ll ll ll l l l ll l ll l l ll lll l lll ll lll l l ll l l l ll lll l l ll l ll l ll l ll ll l ll l l ll l l ll l l l ll l l ll l l ll l ll l l ll l l ll l l l ll l l ll ll ll l l ll l l l ll l l ll l lll l ll ll l l ll ll ll l l l ll l ll l ll ll ll ll ll l l l ll l ll l ll ll l lll l ll ll l ll lll l l l ll l l l ll l l ll l l ll l l ll l l ll l l ll l l ll lll l l lll l l lll ll ll l l l ll l l l ll l l ll l l l ll ll l l ll l l l ll l ll l llll l lll l l ll l l ll l l l ll l l l ll ll l ll l l ll l l l ll l ll l ll l l l ll l l ll l l ll ll l l l l ll l ll l lll ll l l lll l ll l l l ll l l l ll l l ll l ll l l l lll l ll l ll l ll ll l lll l l l ll l ll lll l ll l l l l − − t X ( t ) l l l llll l ll l l ll l ll l l l ll l ll lll ll lll l l l ll l l lll l ll l l l ll l ll l l ll l l ll l l ll l l ll l l ll l ll l ll ll l l l ll l ll l l l ll l lllll ll l ll l l l ll l l ll ll ll l l l ll l ll l l ll l l l ll l ll l l ll l ll l l ll l l ll l l l lll ll l ll l ll l l ll l l ll l ll l ll l ll l ll ll l ll l l ll l ll l ll l ll l l lll l ll l ll ll l ll l ll l lll l l ll l l l ll l l l ll l lll l l l ll l ll ll ll l ll l ll l l l ll l l l ll l l ll l l l ll l l ll l l ll l l l ll l ll ll l l l lll l l ll l l ll l l l lll l l lll ll l l ll llll llll ll l l ll l l l ll l ll l ll l l ll l lll l ll ll l ll l l l ll l l ll l l ll l l ll l ll ll ll ll l ll l l l l l lll l l l Figure 2:
Contaminated sample from Models 1 to 3 with (cid:15) = 0 .
10. The filled pointscorrespond to the observed trajectories. Black solid lines correspond to two randomly chosenuncontaminated observations, and dashed red lines indicate 3 randomly selected outlyingtrajectories (color figures are available on the web version of this article).
We considered three estimators: two of them are based on non-resistant proceduresand the third one is our proposal in Section 3. Of the two non-robust methods, oneis PACE (Yao et al. [56]), and the other is the non-robust version of our methodas in Section 3.3. All computations were done using R (R Core Team [46]). ThePACE estimator was computed using the package fdapace (Chen et al. [12]). Ourestimators were computed using R code in the package sparseFPCA publicly availableat https://github.com/msalibian/sparseFPCA .The tuning parameters for the robust FPCA estimator were chosen as follows.To estimate µ ( t ) we use a Huber function ρ with constant 1 . ρ -functions in Tukey’s biweight family. To estimate the diagonal elements of thescatter function γ ( t, t ) the tuning constants were c = 1 . b = 1 /
2. In Step3, to obtain a more efficient estimator for β ( t, s ), we used a larger tuning constant.19ore specifically, we selected c = 3 . / . K ( u ) = 0 .
75 (1 − u ) I [ − , ( u ). The two–dimensional smoother used in Step 3 was a thin plate re-gression spline, as described in Section 5.5.1 of Wood [55], which does not requireselecting knot locations. The bandwidth parameters were selected using a robust 5-fold cross-validation (holding 20% of the curves for each fold). The cross-validationcriterion for the robust estimator was based on a 50% breakdown point M -scale ofthe residuals with respect to the fits in equations (6) and (9), while for the othertwo approaches we used the mean squared prediction error.We chose the number K of principal directions to estimate with the goal ofexplaining at least 90% of the total variability. This corresponds to K = 2 for bothModels above. To compare the different approaches we look at the corresponding scatter operators (cid:98)
Γ, their associated eigenvalues and eigenfunctions and the predicted scores. Wereport summaries of their performance computed over 500 samples of N = 100trajectories each, generated under each Model using clean and contaminated data.To quantify the discrepancy between the estimated (cid:98) Γ and the true Γ scatteroperators we use the following approximation to the norm of their difference. Wecompute the operators over a fixed grid of 50 equidistant points in I , and calculatethe squared Frobenius norm of the difference matrix. This is a discrete version ofthe corresponding operator norm for Hilbert-Schmidt operators Υ given by (cid:107) Υ (cid:107) F =trace [Υ ∗ Υ] = (cid:80) ∞ j =1 (cid:107) Υ u j (cid:107) , with { u j : j ≥ } any orthonormal basis of L ( I ).To compare the different eigenvalue estimators we consider two loss measures:1500 (cid:88) (cid:96) =1 (cid:40) log (cid:32) (cid:98) λ (cid:96),k λ k (cid:33)(cid:41) , and 1500 (cid:88) (cid:96) =1 (cid:40)(cid:32) (cid:98) λ (cid:96),k λ k − (cid:33)(cid:41) , where (cid:98) λ (cid:96),k is the estimated k -th eigenvalue obtained with the (cid:96) -th sample.The accuracy of the eigenfunction estimators was measured using the cosine ofthe angle between the estimated and true eigenfunctions. Values close to 1 corre-spond to a small angle and thus to a good estimated eigenfunction, whereas valuesclose to zero indicate estimated eigenfunctions that are close to being orthogonal to20he true ones. We report 1500 (cid:88) (cid:96) =1 cos( |(cid:104) (cid:98) φ (cid:96),k , φ k (cid:105)| ) , where (cid:98) φ (cid:96),k and φ k are the estimated k -th eigenfunction computed with the (cid:96) -thsample, and the true k -th eigenfunction, respectively.Finally, to measure the discrepancy between the true and predicted scores, foreach k , we calculate the average of the scores distances over the observations in eachsample: MSE (cid:96) ( ξ k ) = (1 /N ) N (cid:88) i =1 (cid:16)(cid:98) ξ ( (cid:96) ) i,k − ξ ( (cid:96) ) i,k (cid:17) , (cid:96) = 1 , . . . , , where ξ ( (cid:96) ) i,k and (cid:98) ξ ( (cid:96) ) i,k denote the true and the estimated k -th score for the i -th trajectoryof the (cid:96) -th sample, 1 ≤ k ≤ K , respectively. For each k = 1 ,
2, we report the meanand median of MSE (cid:96) ( ξ k ) over the (cid:96) = 1 , . . . ,
500 replications.Note that, for a fixed k , the predicted scores (cid:98) ξ i,k defined in (11) may be multipliedby -1 when the k − th estimated eigenfunction is multiplied by -1, without changingthe predicted trajectory. For that reason, in order to ensure that both the true andpredicted scores have the same orientation , for each replication (cid:96) and each k , wecompute the Spearman correlation s ( (cid:96) ) k , between (cid:98) ξ ( (cid:96) ) i,k and ξ ( (cid:96) ) i,k , 1 ≤ i ≤ N . When s ( (cid:96) ) k <
0, we redefine the scores (cid:98) ξ ( (cid:96) ) i,k , 1 ≤ i ≤ N , multiplying by -1 those originallyobtained.In what follows the method proposed by Yao et al. [56] will be denoted PACE ,our robust method method will be labeled
ROB , and our non-robust variant (Section3.3) LS . Results for experiments without outliers will be indicated with the label C , while those corresponding to the contamination schemes described in Section4.1 will be indicated with C . and C . when (cid:15) = 0 .
05 and 0 .
10, respectively.Taking into account that the squared operator norm and the mean scores dis-tances MSE (cid:96) ( ξ k ) are non-negative and expected to have a skewed distribution, weuse skewed-adjusted boxplots (Hubert and Vandervieren [33]) to display our results.These can be found in Figures 3, 4 and 6.As expected, when no outliers are present all procedures are comparable, withthe robust procedure performing slightly worse than the non-robust methods whenestimating the scatter operator, in particular under Model 1 (Table 1). Note thatfor both Models the non-robust version of our approach ( LS ) shows the best per-formance. Figures 3 and 4 confirm this observation. The stability and advantageof ROB over
PACE and LS when outliers are present in the data can be seen in21able 1, and Figures 3 and 4.Table 2 contains the average of the cosine of the angles between the estimatedand true eigenfunctions. Here again LS has the best performance for clean data,although the difference between the three estimators is small. For contaminateddata sets (even with only 5% of outliers) the estimated eigenfunctions obtainedwith PACE and LS are heavily affected by the atypical observations, while therobust estimator remains stable.Tables 3 and 4 show the results for the eigenvalue estimators. Note that underModel 1, even without contamination, all estimators for λ are biased (this can beobserved in Figure 5). When the samples are contaminated the robust eigenvalueestimators perform better than the non-robust ones. The results for Model 2 areas one would expect: very little difference when no outliers are present in the data,and a noticeable advantage of the robust estimator in the contaminated settings.The predicted scores of all estimators with clean samples behave very similar toeach other (Tables 5 and 6). Note that the larger values of the mean and medianof MSE( ξ k ), under Model 1, may be explained by the fact that Var ( ξ ik ) = λ k , andthe eigenvalues of Model 1 are larger than those for Model 2. When no outliers arepresent the LS estimator performs best.Noting that score estimates can be highly influenced by atypical trajectories, toevaluate the fit over the uncontaminated curves, we also considered the average overreplications, M ( ξ k ), of the following quantity: M ,(cid:96) ( ξ k ) = 1 (cid:80) Ni =1 (1 − B i ) N (cid:88) i =1 (1 − B i ) (cid:16)(cid:98) ξ ( (cid:96) ) i,k − ξ ( (cid:96) ) i,k (cid:17) . We expect a good procedure to fit well most of the uncontaminated data, resultingin small values of M ,(cid:96) ( ξ k ). Note that, for clean samples, M ,(cid:96) ( ξ k ) equals MSE (cid:96) ( ξ k ).Figure 7 and Table 7 show that classical procedures result in a compromise betweenoutlying and non-outlying trajectories, leading to bad predictions for the uncontam-inated data. Under the contamination settings in this study, the robust estimatorhas the best performance overall. The scores mean squared error of the robust ap-proach is the smallest, sometimes by a factor larger than 6 (see Table 5). Table 7reveals that the robust procedure also provides better fits to the non-contaminatedsamples. For (cid:15) = 0 .
05, the values of M for the robust method are similar to thoseobtained under C , while for (cid:15) = 0 .
10 they increase by a factor less than 2.5. Thevalues of M for PACE and LS , when the data contain outliers, can be more than10 times higher than their values under C .22odel 1 Model 2 rob ls pace rob ls pace C C . C . (cid:107) (cid:98) Γ − Γ (cid:107) F over 500 samples. (cid:98) φ (cid:98) φ rob ls pace rob ls pace C C . C . C C . C . |(cid:104) (cid:98) φ (cid:96),k , φ k (cid:105)| ) over (cid:96) = 1 , . . . , k = 1 , k = 1 k = 2 rob ls pace rob ls pace C C . C . C C . C . Table 3: Average of (log( (cid:98) λ (cid:96),k /λ k )) over (cid:96) = 1 , . . . , k = 1 , k = 1 k = 2 rob ls pace rob ls pace C C . C . C C . C . Table 4: Average of (( (cid:98) λ (cid:96),k /λ k ) − over (cid:96) = 1 , . . . , k = 1 , = 1 k = 2 rob ls pace rob ls pace C C . C . C C . C . Table 5: Average of MSE (cid:96) ( ξ k ) over (cid:96) = 1 , . . . , k = 1 , k = 1 k = 2 rob ls pace rob ls pace C C . C . C C . C . Table 6: Median of MSE (cid:96) ( ξ k ) over (cid:96) = 1 , . . . , k = 1 , k = 1 k = 2 rob ls pace rob ls pace C C . C . C C . C . Table 7: Summary measure M for the scores.24 rob l llllllll ROB LS PACE . . . . . l lllllllllllllll C C C . . . . . . ls pace l lll llllllll C C C llllllll llll l C C C Figure 3: Adjusted boxplots of (cid:107) (cid:98) Γ − Γ (cid:107) F , under Model 1.25 rob llllll lll llllllllllll ROB LS PACE . . . . . llllll lllllllll llllllllll C C C ls pace lll lll C C C llllllllllll ll C C C Figure 4: Adjusted boxplots of (cid:107) (cid:98) Γ − Γ (cid:107) F , under Model 2.26 odel 1 Model 2 C llllll llllllll llllllllllllll lllll llllll llllll ROB LS PACE ROB LS PACE l ^ l ^ llll ll llll lllll lllllll lllllllllllll ROB LS PACE ROB LS PACE . . . . . . . l ^ l ^ C . lllll llll llllllll llllllllll lllll lllllllllll ROB LS PACE ROB LS PACE l ^ l ^ ll lllllllllllll lllllll llllllllllllllllllllll llllll lllllllllll ROB LS PACE ROB LS PACE l ^ l ^ C . llllllllllllll llllllll llllll llllllllllll llll lllll ROB LS PACE ROB LS PACE l ^ l ^ lllllll ll ll lllllllllllll lll llllllllll ROB LS PACE ROB LS PACE l ^ l ^ Figure 5: Boxplots of (cid:98) λ j , for j = 1 ,
2. The horizontal lines indicate the value ofthe true eigenvalues λ j . 27 odel 1 Model 2 C llllllllllllllllllllllllllllll llllllll lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllll ROB LS PACE ROB LS PACE
MSE ( x ^ ) MSE ( x ^ ) lllllllllllllll lllllllll lllllllllll llllllllllllll llll llllllllllllllllllllllllll ROB LS PACE ROB LS PACE . . . MSE ( x ^ ) MSE ( x ^ ) C . lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllll lllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll llll ROB LS PACE ROB LS PACE
MSE ( x ^ ) MSE ( x ^ ) l l lllllll lllll l lllllllllll ROB LS PACE ROB LS PACE
MSE ( x ^ ) MSE ( x ^ ) C . llllllllllllllllllll llllllllllll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll lllllllllll ROB LS PACE ROB LS PACE
MSE ( x ^ ) MSE ( x ^ ) llllllllllllllllllllllllllllllllllllllll lllll llllll llllllllllllllllllllll llllll lllll ROB LS PACE ROB LS PACE
MSE ( x ^ ) MSE ( x ^ ) Figure 6: Adjusted boxplots of the mean scores distance MSE (cid:96) ( ξ k ), k = 1 , odel 1 Model 2 C llllllllllllllllllllllllllllll llllllll lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllll ROB LS PACE ROB LS PACE M ( x ^ ) M ( x ^ ) lllllllllllllll lllllllll lllllllllll llllllllllllll llll llllllllllllllllllllllllll ROB LS PACE ROB LS PACE . . . M ( x ^ ) M ( x ^ ) C . llllllllllllllllllllllllllllllllllllllllllllllllllll lllll llllll lllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll llll ROB LS PACE ROB LS PACE M ( x ^ ) M ( x ^ ) lllllll ll lllllllllll llllllllll llllllll llllllllllllll ROB LS PACE ROB LS PACE . . . . . . M ( x ^ ) M ( x ^ ) C . lllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll llllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll lllllllllll ROB LS PACE ROB LS PACE M ( x ^ ) M ( x ^ ) lllllllllllllllllllllllllllllllllllllllllll llllll llllllllllll lllllllllllllllll llllllllllllll lllll ROB LS PACE ROB LS PACE M ( x ^ ) M ( x ^ ) Figure 7: Adjusted boxplots of M ,(cid:96) ( ξ k ) for k = 1 , A real data example
We illustrate our method on the CD4 data, which is part of the Multicentre AIDSCohort Study (Zeger and Diggle [57]). The data consists of 2376 measurementsof CD4 cell counts, taken on 369 men. The times are measured in years sinceseroconversion ( t = 0). The whole data set is available from the catdata packagefor R (Schauberger and Tutz [49]). To ensure that there are enough observationsto estimate the covariance function at every pair of times ( s, t ), we focus on theobservations with t ≥
0, and on individuals with more than one measurement. Thisresults in N = 292 curves, with the number of observations per individual rangingbetween 2 and 11 (with a median of 5). The data set is shown in Figure 8, withthree randomly chosen trajectories highlighted with solid black lines.(a) Data (b) Most outlying curves t X ( t ) l l ll l l l ll l l l l l ll ll ll l l l l l l l l ll l l l l l l l ll l l l ll l l l l l l l ll l ll l ll l l l l ll l l l ll l l l l l l l l ll l l l l l ll l l l l l ll ll l l l l ll ll l l ll ll l l l l ll l l l ll l l l ll l l l l l ll l ll l l l l l l ll l l l l l l l l l ll ll l ll l l l l ll l l l l l ll ll l l l ll l l l l l l ll l l l l l ll l l l l ll ll l l ll l l l l l l l l ll l l l l l ll l ll l l l l ll l l l l ll ll l l l l l l ll l l l l ll l l l ll l ll l l l l l l ll ll l l l l ll ll l l l l ll l ll l l l ll ll l l l l l ll l l l l ll l l ll l l l l ll l l l l l ll l l l l l ll l l ll ll l ll ll l l l l ll l l l l l l l ll l l l l l l l l l ll l ll l l l l ll l l ll l l l l ll ll l l l l l ll ll l l l ll l ll ll l l l l ll l ll l l ll l l l ll l l l ll l l l l l l l l ll l l ll l l l l l l l l ll l l ll ll l l l l l l l l ll l ll l l l l l l l l ll l l l l l l l l ll l l l l l l l l ll l l l l l l ll l ll l l l l ll l l l l ll l l l l l l ll ll l l l ll l l l l l l l l l ll l l l l l l l l l ll l l ll l l l l l l l l ll ll l l l l ll l l l l l ll l ll l l l ll l l l ll l ll l l l ll l l ll l l l l l ll ll l l l l l l l ll l ll l l l l l l ll l l l ll l ll ll l l l l ll l l l l l ll l l l ll l l l ll l l l l l ll l l l l l l l ll l l l ll l l l l l l ll l l l l l l l l ll l l ll l l l ll l l l l l ll l l ll l l l l ll l l l l l l l ll l l l l l l ll l l l ll l l ll ll l l l l ll l l l l l l ll l l ll l l l l l l ll l l l l l ll l l l l ll l l l l ll l l l l l l ll ll ll l l l l l l l ll ll l l ll l l l ll l ll l l l l ll l l l l ll l l l l ll ll ll l l l ll l ll l l l l l l l ll l l l l l l ll ll l ll ll l l l l l l ll ll l l ll l l l l l ll l ll ll ll l l l l ll l l ll ll l l l l ll l l ll l l l ll l l l ll l l l ll l ll ll l l ll l l l ll l l l l ll ll l l l l l l ll l ll l ll ll l l l l ll l l l l l l l ll l l ll l l l l l l ll ll l ll l l l l l ll l l l l ll l ll l l l l l l l l ll l ll l ll l l l l ll l l l l l l ll l ll ll l ll l ll l l ll l l l ll l l l l l l l l ll l l l ll ll l l ll ll l l l l ll l l l l l ll l l l ll l l l ll l l l l l l ll l l l l l l ll l l ll l ll l ll l l l ll l l l l ll l ll l ll l ll l l ll l l l l ll l ll l ll l ll l l l ll l ll l l l l l l l l ll l l l ll l ll l ll l l l l ll l l ll l l ll ll l l ll ll ll l l l l ll ll l l l l ll l l ll l l ll l l l l l l ll l ll l l ll l l l l l ll l l l ll l l l ll l l ll l ll l l l ll l l l l l l ll ll l ll l l l l l l l l ll l ll l l l l ll l ll l l ll l l l ll l l l l ll l l l ll l l l l l l ll l ll l l l ll l l l ll l l l l l ll l l ll l l l l ll l l l ll ll ll l l l l l l ll l l l ll ll l l l ll l l l l l l l l l t X ( t ) l l ll l l l ll l l l l l ll ll ll l l l l l l l l ll l l l l l l l ll l l l ll l l l l l l l ll l ll l ll l l l l ll l l l ll l l l l l l l l ll l l l l l ll l l l l l ll ll l l l l ll ll l l ll ll l l l l ll l l l ll l l l ll l l l l l ll l ll l l l l l l ll l l l l l l l l l ll ll l ll l l l l ll l l l l l ll ll l l l ll l l l l l l ll l l l l l ll l l l l ll ll l l ll l l l l l l l l ll l l l l l ll l ll l l l l ll l l l l ll ll l l l l l l ll l l l l ll l l l ll l ll l l l l l l ll ll l l l l ll ll l l l l ll l ll l l l ll ll l l l l l ll l l l l ll l l ll l l l l ll l l l l l ll l l l l l ll l l ll ll l ll ll l l l l ll l l l l l l l ll l l l l l l l l l ll l ll l l l l ll l l ll l l l l ll ll l l l l l ll ll l l l ll l ll ll l l l l ll l ll l l ll l l l ll l l l ll l l l l l l l l ll l l ll l l l l l l l l ll l l ll ll l l l l l l l l ll l ll l l l l l l l l ll l l l l l l l l ll l l l l l l l l ll l l l l l l ll l ll l l l l ll l l l l ll l l l l l l ll ll l l l ll l l l l l l l l l ll l l l l l l l l l ll l l ll l l l l l l l l ll ll l l l l ll l l l l l ll l ll l l l ll l l l ll l ll l l l ll l l ll l l l l l ll ll l l l l l l l ll l ll l l l l l l ll l l l ll l ll ll l l l l ll l l l l l ll l l l ll l l l ll l l l l l ll l l l l l l l ll l l l ll l l l l l l ll l l l l l l l l ll l l ll l l l ll l l l l l ll l l ll l l l l ll l l l l l l l ll l l l l l l ll l l l ll l l ll ll l l l l ll l l l l l l ll l l ll l l l l l l ll l l l l l ll l l l l ll l l l l ll l l l l l l ll ll ll l l l l l l l ll ll l l ll l l l ll l ll l l l l ll l l l l ll l l l l ll ll ll l l l ll l ll l l l l l l l ll l l l l l l ll ll l ll ll l l l l l l ll ll l l ll l l l l l ll l ll ll ll l l l l ll l l llll l l l l ll l l ll l l l ll l l l ll l l l ll l ll ll l l ll l l l ll l l l l ll ll l l l l l l ll l ll l ll ll l l l l ll l l l l l l l ll l l ll l l l l l l ll ll l ll l l l l l ll l l l l ll l ll l l l l l l l l ll l ll l ll l l l l ll l l l l l l ll l ll ll l ll l ll l l ll l l l ll l l l l l l l l ll l l l ll ll l l ll ll l l l l ll l l l l l ll l l l ll l l l ll l l l l l l ll l l l l l l ll l l ll l ll l ll l l l ll l l l l ll l ll l ll l ll l l ll l l l l ll l ll l ll l ll l l l ll l ll l l l l l l l l ll l l l ll l ll l ll l l l l ll l l ll l l ll ll l l ll ll ll l l l l ll ll l l l l ll l l ll l l ll l l l l l l ll l ll l l ll l l l l l ll l l l ll l l l ll l l ll l ll l l l ll l l l l l l ll ll l ll l l l l l l l l ll l ll l l l l ll l ll l l ll l l l ll l l l l ll l l l ll l l l l l l ll l ll l l l ll l l l ll l l l l l ll l l ll l l l l ll l l l ll ll ll l l l l l l ll l l l ll l l l l ll l l l ll l l ll l l l l l l ll l l l Figure 8: CD4 counts for N = 292 patients where t ≥
0. Three randomly chosencurves are highlighed with solid black lines on the left panel, while the right panelshows the five most outlying trajectories.We compare the covariance function estimates obtained with our robust FPCAmethod in Section 3.2 (ROB), the corresponding non-robust variant (Section 3.3)(LS) and PACE (Yao et al. [56]). The tuning parameters and bandwidths forROB and LS were chosen as in the simulation study (Section 4.3) using 10-foldcross-validation, while the smoothing parameters for PACE were set as described inWang et. al [53].The first two principal components estimated with ROB and LS account for over309% of the total variability, while for PACE they reach over 94%. Figures 9 and 10show the estimated covariance functions and eigenfunctions, respectively.(a) ROB (b) LS (c) PACE s t s t s t Figure 9: Estimated covariance functions for the three estimators (ROB, LS andPACE).Although the estimated first principal directions are similar for the three meth-ods, the PACE estimate for the second eigenfunction is notably different. To explorethe possibility that this difference is due to the effect of a few potentially atypicalobservations, we identify outliers using the scores ( ˆ ξ i , ˆ ξ i ), i = 1 , . . . , D i usingan M M − location and scatter estimator. We flag as outliers those vectors ( ˆ ξ i , ˆ ξ i )with D i larger than the 99.5% quantile of a χ distribution. This resulted in 18trajectories being identified as potentially atypical. The five most outlying ones areshown in the right panel of Figure 8. Note that these curves appear to either de-crease too rapidly (with respect to the rest), or to remain at high values over time.The other outlying curves also show one of these these two main patterns. Moredetails about this analysis, including documented R code reproducing these anal-yses is publicly available on-line at https://github.com/msalibian/sparseFPCA .We now re-compute the non-robust estimators (LS and PACE) after removing theoutliers. It is interesting to note that on this “clean” data set, the PACE and ROBestimators are qualitatively similar (see Figures 11 and 12).We further compare these fits in terms of their prediction ability by randomlysplitting the data into a training set and a test set (with 80% and 20% of the curves,respectively). We estimate the mean and covariance functions using the training set,31a) First eigenfunctions (b) Second eigenfunctions − . − . . . . t f ^ − . − . . . . t f ^ Figure 10: Estimated eigenfunctions associated with the two largest eigenvaluesof the corresponding covariance function estimate. Robust estimates (ROB) aredisplayed with solid lines, their non-robust alternatives (LS) use dashed lines, whilePACE is shown with dotted curves.(a) ROB (b) LS (c) PACE s t s t s t Figure 11: Estimated covariance functions with LS and PACE after removingtrajectories that were flagged as possible outliers together with the robust estimatedcovariance function using all the data.and use them to obtain predicted curves for the 20% held-out curves in the test set.Figure 13 below displays four curves in the test set (shown in gray), with the 332a) First eigenfunctions (b) Second eigenfunctions − . − . . . . t f ^ − . − . . . . t f ^ Figure 12: Estimated eigenfunctions after removing trajectories that were flaggedas possible outliers. The robust estimates (ROB) are displayed with solid lines, thenon-robust version (LS) uses dashed lines, while PACE is shown with dotted curves.predicted trajectories.
We propose a novel robust functional principal components analysis (FPCA) methodthat is appropriate for applications in which only a few observations per unit or tra-jectory are available. Such longitudinal data sets with few points per curve (possiblyrecorded at irregular intervals) are relatively common in applications, and it is of-ten natural to assume an underlying functional structure (of smooth trajectories,for example). Although our motivation was the development of a robust FPCAapproach, the proposed method can easily be extended to obtain an alternative toPACE (Yao et al. [56]) for cases where no atypical observations are present in thedata. Our method only assumes that the underlying random process has an ellipti-cal distribution, and thus it is applicable even with heavy-tailed data (for example,without finite moments). Our simulation studies confirm that the robust versionof this approach remains informative when the data contains outliers (which neednot be extreme values), and behave similarly to the non-robust alternatives whenthe data are clean. Furthermore, in both our numerical experiments and the ex-ample the non-robust variant of our proposal compares favourably to the existing33 t X ( t ) l l l l l l l l l l (a) Predictions t X ( t ) l l l l l l l (b) Predictions t X ( t ) l l l l (c) Predictions t X ( t ) l l (d) Predictions Figure 13: Predicted trajectories for four curves in the test set. The predictionsbased on the ROB estimator are shown with solid lines, the non-robust ones basedon LS use dashed lines, while those based on PACE are shown with dotted lines.methods in the literature. Our methodology could, in principle, also be used toderive robust estimators for functional principal components regression models (seeFebrero-Bande et al. [17] for a recent review) when only a few observations percurve are available. We will explore this idea in future work.
Acknowledgements
The authors wish to thank two anonymous referees for valuable comments whichled to an improved version of the original paper. They also thank Prof. Jane-LingWang for helpful discussions during the early stages of our work. This research waspartially supported by Grants 20020170100022BA from the Universidad de BuenosAires and pict anpcyt at Buenos Aires, Argentina and also by34he Spanish Project MTM2016-76969P from the Ministry of Economy, Industry andCompetitiveness (MINECO/AEI/FEDER, UE) (Graciela Boente) and by DiscoveryGrant RGPIN-2016-04288 of the Natural Sciences and Engineering Research Councilof Canada (M. Salibi´an Barrera).
Although the proof of Proposition 3.1 is immediate when Γ has finite rank, weinclude it here for completeness. The infinite-dimensional case is proved using therepresentation X ∼ µ + S V for elliptical random elements (Boente et al. [7]), where S ≥ V Proof of Proposition 3.1.
First note that it is sufficient to prove the result when µ = 0. Specifically, let (cid:101) X = X − µ ∼ E (0 , Γ , ϕ ), X m = ( X ( t ) , . . . , X ( t m )) t and (cid:101) X m = ( (cid:101) X ( t ) , . . . , (cid:101) X ( t m )) t = X m − ( µ ( t ) , . . . , µ ( t m )) t . It will then follow that (cid:101) X m has a multivariate elliptical distribution with location vector , and hence that X m has a multivariate elliptical distribution with location ( µ ( t ) , . . . , µ ( t m )) t and thesame scatter matrix as (cid:101) X m .In what follows assume that µ = 0. Consider first the case where there are onlyfinitely many non-zero eigenvalues, λ ≥ λ ≥ · · · ≥ λ q , λ (cid:96) = 0 for (cid:96) > p , and let φ ,. . . , φ q be the corresponding eigenfunctions. Define the operator A : L ( I ) → R q as Af = ( (cid:104) f, φ (cid:105) , . . . , (cid:104) f, φ q (cid:105) ) t . This operator is linear, and also bounded since by theCauchy-Schwartz inequality we have (cid:107) Af (cid:107) R p = q (cid:88) i =1 (cid:18)(cid:90) f ( t ) φ i ( t ) dt (cid:19) ≤ q (cid:88) i =1 ( (cid:107) f (cid:107) L (cid:107) φ i (cid:107) L ) = q (cid:107) f (cid:107) L . Using that A is a linear and bounded operator and the definition of elliptical elementsin L ( I ), we have that AX is an elliptical random vector. Furthermore, using (2),we have AX = ( (cid:104) X, φ (cid:105) , (cid:104) X, φ (cid:105) , · · · , (cid:104) X, φ q (cid:105) ) t = ( ξ , ξ , · · · , ξ q ) t . Thus, ξ = ( ξ , ξ , · · · , ξ q ) t is an elliptical random vector E q ( q , A Γ A ∗ , ϕ ). It is easyto see that A Γ A ∗ = diag( λ , . . . , λ q ), since A ∗ u = (cid:80) qj =1 u j φ j , for any u ∈ R q .Hence ξ ∼ E q ( q , Λ q , ϕ ), where Λ q = diag( λ , . . . , λ q ). Moreover, noticing that forany s ≥ (cid:96) ≤ (cid:96) ≤ · · · ≤ (cid:96) s , such that (cid:96) > q , ( (cid:104) X, φ (cid:96) (cid:105) , . . . , (cid:104) X, φ (cid:96) s (cid:105) ) has alsoan elliptical distribution with location zero and null scatter matrix, that is, it equals35 s with probability one, we get that X ∼ (cid:80) qj =1 ξ j φ j . Let B ∈ R m × q with ( i, j )-thentry B ij = φ j ( t i ), 1 ≤ i ≤ m , 1 ≤ j ≤ q . Then, X m = ( X ( t ) , X ( t ) , · · · , X ( t m )) t = B ξ ∼ E m ( m , B Λ q B t , ϕ ) . Finally, note that the ( s, (cid:96) ) element of B Λ q B t equals (cid:80) qj =1 λ j φ j ( t s ) φ j ( t (cid:96) ) = γ ( t s , t (cid:96) ),which concludes the proof of (a) for finite rank processes.If infinitely many λ i ’s are non-zero, then using Proposition 2.1 in Boente et al. [7] we have X ∼ S V where S ≥ V . Hence X m = ( X ( t ) , X ( t ) , · · · , X ( t m )) t = S V p , where S ≥ V p = ( V ( t ) , . . . , V ( t m )) t . Note that the covarianceoperator of V is a scalar multiple of Γ, without loss of generality, we may assume thatΓ V = Γ. The fact that V is a Gaussian process with continuous covariance kernel γ ( s, t ) entails that V p is an m − variate normally distributed vector. Effectively,Theorem 1.5 in Bosq [8] implies thatlim p →∞ sup t ∈I E (cid:32) V ( t ) − p (cid:88) (cid:96) =1 η (cid:96) φ (cid:96) ( t ) (cid:33) = 0 , where η (cid:96) = (cid:104) V, φ (cid:96) (cid:105) ∼ N (0 , λ (cid:96) ) are independent. Let V p ( t ) = (cid:80) p(cid:96) =1 η (cid:96) φ (cid:96) ( t ), then V p,m = ( V p ( t ) , . . . , V p ( t m )) t D −→ V m = ( V ( t ) , . . . , V ( t m )) t , as p → ∞ . Note that V p,m has a multivariate normal distribution N m ( m , Σ p,m ) with ( s, (cid:96) ) element of Σ p,m equal to (cid:80) pj =1 λ j φ j ( t s ) φ j ( t (cid:96) ). Then, taking into account that (cid:80) (cid:96) ≥ λ (cid:96) φ (cid:96) ( t i ) = γ ( t i , t i ) < ∞ , we obtain easily that V m ∼ N (0 , Σ V m ), where the ( s, (cid:96) ) element of Σ V m equals γ ( t s , t (cid:96) ). Hence, X m has an elliptical distribution. Taking into accountthat Γ V = Γ, we obtain that X m ∼ E m ( m , Σ , ϕ ), where the ( s, (cid:96) ) − th element of Σ equals γ ( t s , t (cid:96) ), concluding the proof of (a).Part (b) follows immediately from (a).To prove (c), note again that it is enough to obtain the result when µ = 0. Letus first consider the finite-rank case, when λ (cid:96) = 0 for (cid:96) > q . From (a), we have ξ = ( ξ , ξ , · · · , ξ q ) t ∼ E q ( q , diag( λ , . . . , λ q ) , ϕ ) and X m = ( X ( t ) , · · · , X ( t m )) t = B ξ . Hence, W = ( ξ k , X ( t ) , X ( t ) , · · · , X ( t m )) t = (cid:32) e t k B (cid:33) ξ = M ξ e k the k − th canonical vector in R q . Thus, W ∼ E q +1 ( q +1 , Σ W , ϕ ), where Σ W = M diag( λ , . . . , λ q ) M t is given by Σ W = (cid:32) λ k λ k φ t k λ k φ k Σ X m (cid:33) (13)with Σ X m and φ k as in the statement of the Proposition. The conclusion follows nowimmediately from properties of the conditional distribution of elliptical distributions,see, for instance, Corollary 8 in Frahm [21], where an expression for the characteristicgenerator is also given.When Γ does not have finite rank, as in the proof of (a), from Proposition 2.1in Boente et al. [7] we conclude that X ∼ S V , where S ≥ V and the covariance operator of V equals Γ.Then, W = ( ξ k , X ( t ) , X ( t ) , · · · , X ( t m )) t = S ( η k , V ( t ) , V ( t ) , · · · , V ( t m )) t with η k = (cid:104) V, φ k (cid:105) ∼ N (0 , λ k ). Note that the fact that η k ∼ N (0 , λ k ) entails that ξ k = S η k is finite almost surely. Moreover, with a similar argument as in theproof of part (a) and using the continuity of γ it is easy to see that the randomvector V = ( η k , V ( t ) , V ( t ) , · · · , V ( t m )) is normally distributed with zero meanand covariance matrix given in (13). Hence W ∼ E p +1 ( p +1 , Σ W , ϕ ) and the resultfollows again from the properties of the multivariate elliptical distributions. References [1] Bali, J. L. and Boente, G. (2009). Principal points and elliptical distributionsfrom the multivariate setting to the functional case.
Statistics and ProbabilityLetters , , 1858-1865.[2] Bali, J. L., Boente, G., Tyler, D. and Wang, J. L. (2011). Robust functionalprincipal components: a projection-pursuit approach. Annals of Statistics , ,2852-2882.[3] Billor, N., Hadi, A. and Velleman, P. (2000) BACON: blocked adaptive compu-tationally efficient outlier nominators. Computational Statistics & Data Anal-ysis , , 279-298.[4] Boente, G. and Fraiman, R. (1989). Robust nonparametric regression estima-tion. Journal of Multivariate Analysis , , 180-198.375] Boente, G., Rodriguez, D. and Sued, M. (2019). The spatial sign covariance op-erator: Asymptotic results and applications. Journal of Multivariate Analysis . , 115-128.[6] Boente, G. and Salibi´an-Barrera, M. (2015). S − estimators for functional prin-cipal component analysis. Journal of the American Statistical Association , ,1100-1111.[7] Boente, G., Salibi´an-Barrera, M. and Tyler, D. E. (2014). A characterization ofelliptical distributions and some optimality properties of principal componentsfor functional data. Journal of Multivariate Analysis , , 254-264.[8] Bosq, D. (2000). Linear Processes in Function Spaces . Springer, New York[9] Cardot, H., C´enac, P. and Zitt, P.. (2013). Efficient and fast estimation ofthe geometric median in Hilbert spaces with an averaged stochastic gradientalgorithm,
Bernoulli , , 18-431.[10] Cardot, H. and Godichon-Baggioni, A. (2017). Fast estimation of the mediancovariation matrix with application to online robust principal components anal-ysis. TEST , , 461-480.[11] Cevallos-Valdiviezo, H. (2016). On methods for prediction based on complexdata with missing values and robust principal component analysis , PhD thesis,Ghent University (supervisors Van Aelst S. and Van den Poel, D.)[12] Chen, Y., Carroll, C., Dai, X., Fan, J., Hadjipantelis, P.Z., Han, K., Ji, H.,M¨uller, H.G. and Wang, J.L. (2020). fdapace: Functional Data Analysis andEmpirical Dynamics. R package version 0.5.2. https://CRAN.R-project.org/package=fdapace .[13] Cuevas, A. (2014). A partial overview of the theory of statistics with functionaldata.
Journal of Statistical Planning and Inference , , 1-23.[14] Cuevas, A., Febrero,M. and Fraiman, R.. (2007). Robust estimation and classi-fication for functional data via projection-based depth notions, ComputationalStatistics , , 481-496.[15] Febrero, M., Galeano, P. and Gonzalez-Manteiga, W. (2007). A functional anal-ysis of NOx levels: location and scale estimation and outlier detection. Com-putational Statistics , , 411-427. 3816] Febrero, M., Galeano, P. and Gonzalez-Manteiga, W. (2008). Outlier detectionin functional data by depth measures, with application to identify abnormalNOx levels. Environmetrics , , 331-345.[17] Febrero-Bande, M.; Galeano, P. and Gonz´alez-Manteiga, W. (2017). Functionalprincipal component regression and functional partial least-squares regression:An overview and a comparative study. International Statistical Review , ,61-83.[18] Ferraty, F. and Romain, Y. (2010). The Oxford Handbook of Functional DataAnalysis , Oxford University Press.[19] Ferraty, F. and Vieu, Ph. (2006).
Nonparametric Functional data analysis: The-ory and Practice . Springer Series in Statistics, Springer, New York.[20] Fischl, M.A., Ribaudo, H.J., Collier, A.C., Erice, A., Giuliano, M., Dehlinger,M., Eron, J.J. Jr., Saag, M.S., Hammer, S.M., Vella, S., Morse, G.D., Feinberg,J.E., Demeter, L.M., Eshleman, S.H. and Adult AIDS Clinical Trials Group388 Study Team (2003). A randomized trial of 2 different 4-drug antiretroviralregimens versus a 3-drug regimen, in Advanced Human Immunodeficiency Virusdisease.
Journal of Infectious Diseases , , 625-634.[21] Frahm, G. (2004). Generalized Elliptical Distributions: Theory and Applica-tions . PhD. thesis from the University of K¨oln, Germany.[22] Fraiman, R. and Mu˜niz, G. (2001). Trimmed means for functional data.
Test , , 419-440.[23] Gervini, D. (2008). Robust functional estimation using the median and sphericalprincipal components. Biometrika , , 587-600.[24] Gervini, D. (2009). Detecting and handling outlying trajectories in irregularlysampled functional datasets. Annals of Applied Statististics , , 1758-1775.[25] Goia, A. and Vieu, P. (2016). An introduction to recent advances inhigh/infinite-dimensional statistics. Journal of Multivariate Analysis , , 1-6.[26] Hall, P. and Horowitz, J. L. (2007). Methodology and convergence rates forfunctional linear regression. Annals of Statistics , , 70-91.[27] H¨ardle (1990). Applied Nonparametric Regression . Econometric Society Mono-graphs, 19, Cambridge University Press, Cambridge.3928] H¨ardle, W. and Tsybakov, B. (1988). Robust nonparametric regression withsimultaneous scale curve estimation.
Annals of Statistics , , 120-135.[29] Horv´ath, L. and Kokoszka, P. (2012). Inference for functional data with appli-cations . Springer, New York.[30] Hsing, T. and Eubank, R. (2015).
Theoretical Foundations of Functional DataAnalysis, with an Introduction to Linear Operators , Wiley, New York.[31] Hubert, M., Rousseeuw, P. and Branden, K. (2005) ROBPCA: a new approachto robust principal component analysis.
Technometrics , , 64-79[32] Hubert M., Rousseeuw P. and Segaert P. (2015). Multivariate functional outlierdetection. Statistical Methods and Applications , , 177-202.[33] Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed dis-tributions. Computational Statistics & Data Analysis , , 5186-5201.[34] Hyndman, R. and Shang, H. (2010). Rainbow plots, bagplots, and boxplots forfunctional data. Journal of Computational and Graphical Statistics , , 29-45.[35] Hyndman, R. J. and Ullah, S. (2007). Robust forecasting of mortality andfertility rates: A functional data approach. Computational Statistics and DataAnalysis , , 4942-4956.[36] James, G., Hastie, T. and Sugar, C. (2000). Principal Component Models forSparse Functional Data. Biometrika , (3), 587-602.[37] Kraus, D. and Panaretos, V. M. (2012). Dispersion operators and resistantsecond-order functional data analysis. Biometrika , , 813-832.[38] Lee, S., Shin, H. and Billor, N. (2013). M -type smoothing spline estimators forprincipal functions. Computational Statistics and Data Analysis , , 89-100.[39] Li, Y. and Hsing, T. (2010). Uniform convergence rates for nonparametric re-gression and principal component analysis in functional/longitudinal data. An-nals of Statistics , , 3321-3351.[40] Locantore, N., Marron, J., Simpson, D., Tripoli, N., Zhang, J. and Cohen, K.(1999). Robust principal components for functional data, Test , , 1-28.[41] L´opez-Pintado, S. and Romo, J. (2007). Depth-based inference for functionaldata. Computational Statistics & Data Analysis , , 4957-4968.4042] Maronna, R., Martin, R., Yohai, V. and Salibi´an-Barrera, M. (2019). RobustStatistics: Theory and Methods (with R) . Wiley, New York.[43] Maronna, R. (2019). Robust functional principal components for irregularlyspaced longitudinal data. In press in
Statistical Papers . https://doi.org/10.1007/s00362-019-01147-2 [44] Oh, H-S., Nychka, D.W. and Lee, T.C.M. (2007). The role of pseudo datafor robust smoothing with applications to wavelet regression. Biometrika , ,893-904.[45] Park, J.G. and Wu, H. (2006). Backfitting and local likelihood methods for non-parametric mixed-effects models with longitudinal data. Journal of StatisticalPlanning and Inference , , 3760-3782.[46] R Core Team (2020). R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria. .[47] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis , 2ndedition. Springer, New York.[48] Sawant, P., Billor, N. and Shin, H. (2012). Functional outlier detection withrobust functional principal component analysis.
Computational Statistics , ,83-102.[49] Schauberger, G. and Tutz, G. (2020). catdata: Categorical Data. R packageversion 1.2.2. https://CRAN.R-project.org/package=catdata [50] Sinova, B., Gonzalez-Rodriguez, G. and Van Aelst, S. (2018). M-estimators oflocation for functional data. Bernoulli , , 2328-2357.[51] Staniswalis, J. G. and Lee, J. J. (1998). Nonparametric regression analysis oflongitudinal data. Journal of the American Statistical Association , , 1403-1418.[52] Sun, Y. and Genton, M. G. (2011). Functional boxplots. Journal of Computa-tional and Graphical Statistics , , 316-334.[53] Wang, J.L., Chiou, J., M¨uller, H.G. (2016). Functional Data Analysis. AnnualReview of Statistics and Its Application , , 257-295.[54] Welsh, A.H. (1996). Robust estimation of smooth regression and spread func-tions and their derivatives. Statistica Sinica , , 347-366.4155] Wood, S. (2017). Generalized Additive Models: An Introduction with R . CRCPress, Taylor and Francis, Boca Raton.[56] Yao, F., M¨uller, H.G. and Wang, J.L. (2005). Functional data analysis forsparse longitudinal data.
Journal of the American Statistical Association , ,577-590.[57] Zeger, S.L. and Diggle, P.J. (1994). Semiparametric models for longitudinaldata with application to CD4 cell numbers in HIV seroconverters. Biometrics ,50