Low-Rank Covariance Function Estimation for Multidimensional Functional Data
LLow-Rank Covariance Function Estimation for MultidimensionalFunctional Data
Jiayi Wang , Raymond K. W. Wong ∗ , and Xiaoke Zhang † Department of Statistics, Texas A&M University Department of Statistics, George Washington University
September 1, 2020
Abstract
Multidimensional function data arise from many fields nowadays. The covariance functionplays an important role in the analysis of such increasingly common data. In this paper, wepropose a novel nonparametric covariance function estimation approach under the frameworkof reproducing kernel Hilbert spaces (RKHS) that can handle both sparse and dense functionaldata. We extend multilinear rank structures for (finite-dimensional) tensors to functions, whichallow for flexible modeling of both covariance operators and marginal structures. The proposedframework can guarantee that the resulting estimator is automatically semi-positive definite,and can incorporate various spectral regularizations. The trace-norm regularization in partic-ular can promote low ranks for both covariance operator and marginal structures. Despite thelack of a closed form, under mild assumptions, the proposed estimator can achieve unified the-oretical results that hold for any relative magnitudes between the sample size and the numberof observations per sample field, and the rate of convergence reveals the phase-transition phe-nomenon from sparse to dense functional data. Based on a new representer theorem, an ADMMalgorithm is developed for the trace-norm regularization. The appealing numerical performanceof the proposed estimator is demonstrated by a simulation study and the analysis of a datasetfrom the Argo project.
Keywords : Functional data analysis; multilinear ranks; tensor product space; unified theory
In recent decades, functional data analysis (FDA) has become a popular branch of statisticalresearch. General introductions to FDA can be found in a few monographs (e.g., Ramsay andSilverman, 2005; Ferraty and Vieu, 2006; Horv´ath and Kokoszka, 2012; Hsing and Eubank, 2015;Kokoszka and Reimherr, 2017). While traditional FDA deals with a sample of time-varying tra-jectories, many new forms of functional data have emerged due to improved capabilities of datarecording and storage, as well as advances in scientific computing. One particular new form of ∗ The research of Raymond K. W. Wong is partially supported by National Science Foundation grants DMS-1806063, DMS-1711952 and CCF-1934904. † The research of Xiaoke Zhang is partially supported by National Science Foundation grant DMS-1832046. a r X i v : . [ s t a t . M E ] A ug unctional data is multidimensional functional data , which becomes increasingly common in vari-ous fields such as climate science, neuroscience and chemometrics. Multidimensional functional dataare generated from random fields, i.e., random functions of several input variables. One example ismulti-subject magnetic resonance imaging (MRI) scans, such as those collected by the Alzheimer’sDisease Neuroimaging Initiative. A human brain is virtually divided into three-dimensional boxescalled “voxels” and brain signals obtained from these voxels form a three-dimensional functionalsample indexed by spatial locations of the voxels. Despite the growing popularity of multidimen-sional functional data, statistical methods for such data are limited apart from very few existingworks (e.g., Huang et al., 2009; Allen, 2013; Zhang et al., 2013; Zhou and Pan, 2014; Wang andHuang, 2017).In FDA covariance function estimation plays an important role. Many methods have beenproposed for unidimensional functional data (e.g., Rice and Silverman, 1991; James et al., 2000;Yao et al., 2005; Paul and Peng, 2009; Li and Hsing, 2010; Goldsmith et al., 2011; Xiao et al.,2013), and a few were particularly developed for two-dimensional functional data (e.g., Zhou andPan, 2014; Wang and Huang, 2017). In general when the input domain is of dimension p , one needsto estimate a 2 p -dimensional covariance function. Since covariance function estimation in FDA istypically nonparametric, the curse of dimensionality emerges soon when p is moderate or large.For general p , most work are restricted to regular and fixed designs (e.g., Zipunnikov et al.,2011; Allen, 2013), where all random fields are observed over a regular grid like MRI scans. Suchsampling plan leads to a tensor dataset, so one may apply tensor/matrix decompositions to estimatethe covariance function. When random fields are observed at irregular locations, the dataset is nolonger a completely observed tensor so tensor/matrix decompositions are not directly applicable. Ifobservations are densely collected for each random field, a two-step approach is a natural solution,which involves pre-smoothing every random field followed by ensor/matrix decompositions at afine discretized grid. However, this solution is infeasible for sparse data where there are a limitednumber of observations per random field. One example is the data collected by the internationalArgo project ( ). See Section 7 for more details. In such sparse data setting,one may apply the local smoothing method of Chen and Jiang (2017), but it suffers from thecurse of dimensionality when the dimension p is moderate due to a 2 p -dimensional nonparametricregression.We notice that there is a related class of literature on longitudinal functional data (e.g., Chenand M¨uller, 2012; Park and Staicu, 2015; Chen et al., 2017), a special type of multidimensionalfunctional data where a function is repeatedly measured over longitudinal times. Typically multi-step methods are needed to model the functional and longitudinal dimensions either separately(one dimension at a time) or sequentially (one dimension given the other), as opposed to thejoint estimation procedure proposed in this paper. We also notice a recent work on longitudinalfunctional data under the Bayesian framework (Shamshoian et al., 2019).The contribution of this paper is three-fold. First, we propose a new and flexible nonparametricmethod for low-rank covariance function estimation for multidimensional functional data, via theintroduction of (infinite-dimensional) unfolding operators (See Section 3). This method can handleboth sparse and dense functional data, and can achieve joint structural reductions in all dimensions,in addition to rank reduction of the covariance operator. The proposed estimator is guaranteedto be semi-positive definite. As a one-step procedure, our method reduces the theoretical com-plexities compared to multi-steps estimators which often involve a functional principal componentanalysis followed by a truncation and reconstruction step (e.g., Hall and Vial, 2006; Poskitt and2engarapillai, 2013).Second, we generalize the representer theorem for unidimensional functional data by Wong andZhang (2019) to the multidimensional case with more complex spectral regularizations. The newrepresenter theorem makes the estimation procedure practically computable by generating a finite-dimensional parametrization to the solution of the underlying infinite-dimensional optimization.Finally, a unified asymptotic theory is developed for the proposed estimator. It automaticallyincorporates the settings of dense and sparse functional data, and reveals a phase transition in therate of convergence. Different from existing theoretical work heavily based on closed-form repre-sentations of estimators, (Li and Hsing, 2010; Cai and Yuan, 2010; Zhang and Wang, 2016; Liebl,2019), this paper provides the first unified theory for penalized global M-estimators of covariancefunctions which does not require a closed-form solution. Furthermore, a near-optimal (i.e., optimalup to a logarithmic order) one-dimensional nonparametric rate of convergence is attainable for the2 p -dimensional covariance function estimator for Sobolev-Hilbert spaces.The rest of the paper is organized as follows. Section 2 provides some background on repro-ducing kernel Hilbert space (RKHS) frameworks for functional data. Section 3 introduces Tuckerdecomposition for finite-dimensional tensors and our proposed generalization to tensor productRKHS operators, which is the foundation for our estimation procedure. The proposed estimationmethod is given in Section 4, together with an computational algorithm. The unified theoreticalresults are presented in Section 5. The numerical performance of the proposed method is evaluatedby a simulation study in Section 6 and a real data application in Section 7. In recent years there is a surge of RKHS methods in FDA (e.g., Yuan and Cai, 2010; Zhu et al.,2014; Li and Song, 2017; Reimherr et al., 2018; Sun et al., 2018; Wong et al., 2019). However, co-variance function estimation, a seemingly well-studied problem, does not receive the same amountof attention in the development of RKHS methods, even for unidimensional functional data. Inter-estingly, we find that the RKHS modeling provides a versatile framework for both unidimensionaland multidimensional functional data.Let X be a random field defined on an index set T ⊂ R p , with a mean function µ ( · ) = E { X ( · ) } and a covariance function γ ( ∗ , · ) = Cov( X ( ∗ ) , X ( · )), and let { X i : i = 1 , . . . , n } be n independentlyand identically distributed (i.i.d.) copies of X . Typically, a functional dataset is represented by { ( T ij , Y ij ) : j = 1 , . . . m i ; i = 1 , . . . , n } , where Y ij = X i ( T ij ) + (cid:15) ij ∈ R (1)is the noisy measurement of the i -th random field X i taken at the corresponding index T ij ∈ T , m i is the number of measurements observed from the i -th random field, and { (cid:15) ij : i = 1 , . . . , n ; j =1 , . . . m i } are independent errors with mean zero and finite variance. For simplicity and withoutloss of generality, we assume m i = m for all i .As in many nonparametric regression setups such as penalized regression splines (e.g., Pearceand Wand, 2006) and smoothing splines (e.g., Wahba, 1990; Gu, 2013), the sample field of X , i.e.,the realized X (as opposed to the sample path of a unidimensional random function), is assumed toreside in an RKHS H of functions defined on T with a continuous and square integrable reproducingkernel K . Let h· , ·i H and k · k H denote the inner product and norm of H respectively. With thetechnical condition E k X k H < ∞ , the covariance function γ resides in the tensor product RKHS H ⊗ H . It can be shown that
H ⊗ H is an RKHS, equipped with the reproducing kernel K ⊗ K K ⊗ K )(( s , t ) , ( s , t )) = K ( s , s ) K ( t , t ), for any s , s , t , t ∈ T . This result hasbeen exploited by Cai and Yuan (2010) and Wong and Zhang (2019) for covariance estimation inthe unidimensional setting.For any function f ∈ H ⊗ H , there exists an operator mapping H to H defined by g ∈ H 7→h f ( ∗ , · ) , g ( · ) i H ∈ H . When f is a covariance function, we call the induced operator a H -covarianceoperator, or simply a covariance operator as below. To avoid clutter, the induced operator willshare the same notation with the generating function. Similar to L -covariance operators, thedefinition of an induced operator is obtained by replacing the L inner product by the RKHS innerproduct. The benefits of considering this operator have been discussed in Wong and Zhang (2019).We also note that a singular value decomposition (e.g., Hsing and Eubank, 2015) of the inducedoperator exists whenever the corresponding function f belongs to the tensor product RKHS H ⊗ H .The idea of induced operator can be similarly extended to general tensor product space F ⊗ F where F and F are two generic RKHSs of functions.For any γ ∈ H ⊗ H , let γ > be the transpose of γ , i.e., γ > ( s , t ) = γ ( t , s ), s , t ∈ T . Define M = { γ ∈ H ⊗ H : γ ≡ γ > } . To guarantee symmetry and positive semi-definiteness of theestimators, Wong and Zhang (2019) adopted M + = { γ ∈ M : h γf, f i H ≥ , ∀ f ∈ H} as thehypothesis class of γ and considered the following regularized estimator:arg min γ ∈M + { ‘ ( γ ) + τ Ψ( γ ) } , (2)where ‘ is a convex and smooth loss function characterizing the fidelity to the data, Ψ( γ ) is a spectralpenalty function (see Definition 5 below), and τ is a tuning parameter. Due to the constraintsspecified in M + , the resulting covariance estimator is always positive semi-definite. In particular,if the spectral penalty function Ψ( γ ) imposes the trace-norm regularization, an ‘ -type shrinkagepenalty on the respective singular values, the estimator is usually of low rank. Cai and Yuan(2010) adopted a similar objective function as in (2) but with the hypothesis class H ⊗ H and an ‘ -type penalty Ψ( γ ) = k γ k H⊗H , so the resulting estimator may neither be positive semi-definitenor low-rank.Although Cai and Yuan (2010) and Wong and Zhang (2019) focused on unidimensional func-tional data, their frameworks can be directly extended to the multidimensional setting. Explicitly,similar to (2), as long as a proper H for the random fields with dimension p > M + can be obtained im-mediately, which results in a positive semi-definite and possibly low-rank estimator. Since an RKHSis identified by its reproducing kernel, we simply need to pick a multivariate reproducing kernel K for multidimensional functional data. However, even when the low-rank approximation/estimationis adopted (e.g., by trace-norm regularization), we still need to estimate several p -dimensionaleigenfunctions nonparametrically. This curse of dimensionality calls for a more efficient modeling.Below, we explore this through the lens of tensor decomposition in finite-dimensional vector spacesand its extension to infinite-dimensional function spaces. In this section we will extend the well-known Tucker decomposition for finite-dimensional tensorsto functional data, then introduce the concept of functional unfolding for low-rank modeling, andfinally apply functional unfolding to covariance function estimation.4 .1 Tucker decomposition for finite-dimensional tensors
First, we give a brief introduction to the popular Tucker decomposition (Tucker, 1966) for finite-dimensional tensors. Let G = N dk =1 G k denote a generic tensor product space with finite-dimensional G k , k = 1 , . . . , d . If the dimension of G k is q k , k = 1 , . . . , d , then each element in G = N dk =1 G k can be identified by an array in R Q dj = k q k , which contains the coefficients through anorthonormal basis. By Tucker decomposition, any array in R Q dk =1 q k can be represented in terms of n -mode products as follows. Definition 1 ( n -mode product) . For any arrays A ∈ R q × q ×···× q d and P ∈ R p n × q n , n ∈ { , . . . , d } ,the n -mode product between A and P , denoted by A × n P , is a array of dimension q × q × · · · q n − × p n × q n +1 × · · · q d of which ( l , . . . , l n − , j, l n +1 , . . . l d ) -th element is defined by ( A × n P ) l ,...,l n − ,j,l n +1 ,...l d = q n X i =1 A l ,...,l n − ,i,l n +1 ,...l d P j,i . Definition 2 (Tucker decomposition) . Tucker decomposition of A ∈ R q × q ×···× q d is A = G × U × · · · × d U d , (3) where U i ∈ R q i × r i i = 1 , , . . . , d , are called the “factor matrices” (usually orthonormal) with r i ≤ q i and G ∈ R r ×···× r d is called the “core tensor”. Figure 1 provides a pictorial illustration of a Tucker decomposition. Unlike matrices, the conceptof rank is more complicated for arrays of order 3 or above. Tucker decomposition naturally leadsto a particular form of rank, called “multilinear rank”, which is directly related to the familiarconcept of matrix ranks. To see this, we employ a reshaping operation called matricization , whichrearranges elements of an array into a matrix. A ( q × q × q ) = U ( q × r ) G ( r × r × r ) U ( q × r ) U ( q × r ) Figure 1: Tucker decomposition of a third-order array. The values in the parentheses are dimensions for thecorresponding matrices or arrays.
Definition 3 (Matricization) . For any n ∈ { , . . . , d } , the n -mode matricization of A ∈ R q × q ×···× q d ,denoted by A ( n ) , is a matrix of dimension q n × ( Q k = n q k ) of which ( l n , j ) -th element is defined by [ A ( n ) ] l n ,j = A l ,...,l d , where j = 1 + P di =1 ,i = n ( l i − Q i − m =1 ,m = n q m ) . All empty products are defined as 1. For example, Q jm = i q m = 1 when i > j . A ∈ R q × q ×···× q d , by simple derivations, one can obtain a useful relationship betweenthe n -mode matricization and Tucker decomposition A = G × U × · · · × d U d : A ( n ) = U n G ( n ) ( U d ⊗ · · · ⊗ U n +1 ⊗ U n − ⊗ · · · ⊗ U ) | , (4)where, with a slight abuse of notation, ⊗ also represents the Kronecker product between ma-trices. Hence if the factor matrices are of full rank, then rank( A ( n ) ) = rank( G ( n ) ). The vector(rank( A (1) ) , . . . , rank( A ( d ) )) is known as the multilinear rank of A . Clearly from (4), one can choosea Tucker decomposition such that { U k : k = 1 , . . . , d } are orthonormal matrices and rank( U k ) = r k .Therefore a “small” multilinear rank corresponds to a small core tensor and thus an intrinsic di-mension reduction, which potentially improves estimation and interpretation. We will relate thislow-rank structure to multidimensional functional data. To encourage low-rank structures in covariance function estimation, we generalize the matri-cization operation for finite-dimensional arrays to infinite-dimensional tensors (Hackbusch, 2012).Here let G = N dk =1 G k denote a generic tensor product space where G k is an RKHS of functionswith an inner product h· , ·i G k , for k = 1 , . . . , d .Notice that the tensor product space G = N dk =1 G k can be generated by some elementarytensors of the form N dk =1 f k ( x , . . . , x d ) = Q dk =1 f k ( x k ) where f k ∈ G k , k = 1 , . . . , d . More specif-ically, G is the completion of the linear span of all elementary tensors under the inner product h N dk =1 f k , N dk =1 f k i G = Q dk =1 h f k , f k i G k , for any f k , f k ∈ G k .In Definition 4 below, we generalize matricization/unfolding for finite-dimensional arrays toinfinite-dimensional elementary tensors. We also define a square unfolding for infinite-dimensionaltensors that will be used to describe the spectrum of covariance operators. Definition 4 (Functional unfolding operators) . The one-way unfolding operator and square un-folding operators are defined as follows for any elementary tensor of the form N dk =1 f k .1. One-way unfolding operator U j for j = 1 , . . . , d : The j -mode one-way unfolding operator U j : N dk =1 G k → G j ⊗ ( N k = j G k ) is defined by U j ( N dk =1 f k ) = f j ⊗ ( N k = j f k ) .2. Square unfolding operator S : When d is even, the square unfolding operator S : N dj =1 G j → ( N d/ j =1 G j ) ⊗ ( N dk = d/ G k ) is defined by S ( N dj =1 f j ) = ( N d/ j =1 f j ) ⊗ ( N dk = d/ f k ) .These definitions extend to any function f ∈ G by linearity. For notational simplicity we denote U j ( f ) by f ( j ) , j = 1 , . . . , d , and S ( f ) by f (cid:4) . Note that the range of each functional unfolding operator, either U j , j = 1 , . . . , d or S , is atensor product of two RKHSs, so its output can be interpreted as an (induced) operator. Given afunction f ∈ G , the multilinear rank can be defined as (rank( f (1) ) , . . . , rank( f ( d ) )), where f ( j ) ’s areinterpreted as an operator here and rank( A ) is the rank of any operator A . If all G k , k = 1 , . . . , d are finite-dimensional, the singular values of the output of any functional unfolding operator matchwith those of the j -mode matricization (of the corresponding array representation). Suppose that the random field X ∈ H = N pk =1 H k where each H k is a RKHS of functionsequipped with an inner product h· , ∗i k and corresponding norm k · k k , k = 1 , . . . , p . Then the6ovariance function γ resides in H ⊗ H = ( N pj =1 H j ) ⊗ ( N pk =1 H k ). To estimate γ , we couldconsider a special case of G = N dj =1 G j in Section 3.2 by letting d = 2 p , G j = H j for j = 1 , . . . , p ; G j = H j − p for j = p + 1 , . . . , d ; and h· , ·i G j = h· , ·i j for j = 1 , . . . , d .Clearly, the elements of H ⊗ H are identified by those in G = N dj =1 G j . In terms of the foldingstructure, H ⊗ H has a squarely unfolded structure. Since a low-multilinear-rank structure isrepresented by different unfolded forms, it would be easier to study the completely folded space N dk =1 G k instead of the squarely unfolded space H⊗H . We use Γ to represent the folded covariancefunction, the corresponding element of γ in G . In other words, Γ , (cid:4) = γ . For any Γ ∈ G , rank(Γ (cid:4) )is defined as the two-way rank of Γ while rank(Γ (1) ) , . . . , rank(Γ ( p ) ) are defined as the one-wayranks of Γ. Remark 1.
For an array A ∈ R Q dk =1 q k , the one-way unfolding U j ( A ) is the same as matricization,if we further impose the same ordering of the columns in the output of U j ( A ), j = 1 , . . . , d . Thisordering is just related to how we represent the array, and is not crucial in the general definitionof U j . Since the description of the computation strategy depends on the explicit representation,we will always assume this ordering. Similarly, we also define a specific ordering of rows andcolumns for A (cid:4) ∈ R ( d/ × ( d/ when d is even, such that its ( j , j )-th entry is A k ,...,k d where j = 1 + P d/ i =1 ( k i − Q d/ m = i +1 q m ) and j = 1 + P di = d/ ( k i − Q dm = i +1 q m ). Here we illustrate the roles of one-way and two-way ranks in the modeling of covariance func-tions. For a general G = N dj =1 G j , let { e k,l k : l j = 1 , . . . , q k } be a set of orthonormal basis functionsof G k for k = 1 , . . . , d = 2 p , where q k is allowed to be infinite, depending on the dimensionality of G k . Then { N dk =1 e k,l k : l k = 1 , . . . , q k ; k = 1 , . . . , d } forms a set of orthonormal basis functions for G . Thus for any Γ ∈ G , we can expressΓ = X k ,k ,...,k d B k ,...,k d d O i =1 e i,k i , (5)where the coefficients B k ,...,k d are real numbers. For convenience, we collectively put them into anarray B ∈ R Q dk =1 q k .To illustrate the low-multilinear-rank structures for covariance functions, we consider p = 2,i.e., d = 2 p = 4, and then by (5) the folded covariance function Γ can be expressed byΓ( s , s , t , t ) = q X k =1 q X k =1 q X k =1 q X k =1 B k ,k ,k ,k e ,k ( s ) e ,k ( s ) e ,k ( t ) e ,k ( t ) . To be precise, the covariance function is the squarely unfolded Γ (cid:4) (( s , s ) , ( t , t )) ≡ Γ( s , s , t , t ).Suppose that B possesses (or is well-approximated by) a structure of a low multilinear rank, andyields Tucker decomposition B = E × U × U × U × U where E ∈ R r × r × r × r , U k ∈ R q k × r k for k = 1 ,
2, and columns of U k are orthonormal. Apparently R := rank( B (cid:4) ) is the two-way rankof Γ, while r and r are the corresponding one-way ranks. Now the covariance function can befurther represented asΓ( s , s , t , t ) = r X j =1 r X j =1 r X j =1 r X j =1 E j ,j ,j ,j u j ( s ) v j ( s ) u j ( t ) v j ( t ) , Definition 1 is extended to the case when q n is infinite. { u j : j = 1 , . . . , r } and { v k : k = 1 , . . . , r } are (possibly infinite) linear combinations ofthe original basis functions. In fact, { u j : j = 1 , . . . , r } and { v k : k = 1 , . . . , r } are the sets oforthonormal functions of G and G respectively. Apparently rank( E (cid:4) ) = R .Consider the eigen-decomposition of the squarely unfolded core tensor E (cid:4) = P DP T where D = diag( λ , λ , . . . , λ R ) and P ∈ R r r × R has orthonormal columns. Then we obtain the eigen-decomposition of the covariance function Γ (cid:4) :Γ (cid:4) (( s , s ) , ( t , t )) = R X g =1 λ g f g ( s , s ) f g ( t , t ) , where the eigenfunction is f g ( s , s ) = r X j =1 r X j =1 P j +( j − r ,g u j ( s ) v j ( s ) =: (P r j =1 a j ,g ( s ) u j ( s ) P r j =1 b j ,g ( s ) v j ( s ) , with a j ,g ( · ) = P r j =1 P j +( j − r ,g v j ( · ) and b j ,g ( · ) = P r j =1 P j +( j − r ,g u j ( · ).First, this indicates that the two-way rank R is the same as the rank of the covariance operator.Second, this shows that { u j : j = 1 , . . . , r } is the common basis for the variation along thedimension s , hence describing the marginal structure along s . Similarly { v j : j = 1 , . . . , r } isthe common basis that characterizes the marginal variation along the dimension s . We call themthe marginal basis along the respective dimension. Therefore, the one-way ranks r and r are theminimal numbers of the one-dimensional functions for the dimensions s and s respectively thatconstruct all the eigenfunctions of covariance function Γ.Similarly, for p -dimensional functional data, each eigenfunction can be represented by a linearcombination of p -products of univariate functions. One can then show that the two-way rank R is the same as the rank of the covariance operator and the one-way ranks r , . . . , r p are theminimal numbers of one-dimensional functions along respective dimensions that characterize alleigenfunctions of the covariance operator. Remark 2.
Obviously, R ≤ Q pk =1 r k for p -dimensional functional data. If the random field X hasthe property of “weak separability” as defined by Lynch and Chen (2018), then max( r , . . . , r p ) ≤ R so the low-rank structure in terms of R will be automatically translated to low one-way ranks.Note that the construction of our estimator and corresponding theoretical analysis do not requireseparability conditions.Compared to typical low-rank covariance modelings only in terms of R , we also intend toregularize the one-way ranks r , . . . , r p for two reasons. First, the illustration above shows thatthe structure of low one-way ranks encourages a “sharing” structure of one-dimensional variationsamong different eigenfunctions. Promoting low one-way ranks can facilitate additional dimensionreduction and further alleviates the curse of dimensionality. Moreover, one-dimensional marginalstructures will provide more details of the covariance function structure and thus help with a betterunderstanding of p -dimensional eigenfunctions.Therefore, we will utilize both one-way and two-way structures and propose an estimation pro-cedure that regularizes one-way and two-way ranks jointly and flexibly, with the aim of seeking the“sharing” of marginal structures while controlling the number of eigen-components simultaneously.8 Covariance Function Estimation
In this section we propose a low-rank covariance function estimation framework based on func-tional unfolding operators and spectral regularizations. Spectral penalty functions (Abernethyet al., 2009; Wong and Zhang, 2019) are defined as follows.
Definition 5 (Spectral penalty function) . Given a compact operator A , a spectral penalty functiontakes the form Ψ( A ) = P k ≥ ψ ( λ k ( A )) with the singular values of the operator A , λ ( A ) , λ ( A ) , . . . in a descending order of magnitude and a non-decreasing penalty function ψ such that ψ (0) = 0 . Recall H = N pj =1 H j and G = N dj =1 G j where d = 2 p , G j = H j for j = 1 , . . . , p , and G j = H j − p for j = p + 1 , . . . , d . Clearly, a covariance operator is self-adjoint and positive semi-definite.Therefore we consider the hypothesis space M + = { Γ ∈ M : h Γ (cid:4) f, f i H ≥ , for all f ∈ H} , where M = { Γ ∈ G : Γ (cid:4) is self-adjoint } , and propose a general class of covariance function estimators asfollows: arg min Γ ∈M + ‘ (Γ) + λ β Ψ (Γ (cid:4) ) + 1 − βp p X j =1 Ψ j (Γ ( j ) ) , (6)where ‘ is a convex and smooth loss function, { Ψ j : j = 1 , . . . , p } are spectral penalty functions,and λ ≥ β ∈ [0 ,
1] are tuning parameters. Here Ψ penalizes the squarely unfolded operatorΓ (cid:4) while Ψ j regularizes one-way unfolded operator Γ ( j ) respectively for j = 1 , . . . , p . The tuningparameter β controls the relative degree of regularization between one-way and two-way singularvalues. The larger the β is, the more penalty is imposed on the two-way singular values relativeto the one-way singular values. When β = 1, the penalization is only on the eigenvalues of thecovariance operator (i.e., the two-way singular values), similarly as Wong and Zhang (2019).To achieve low-rank estimation, we adopt a special form of (6):ˆΓ = arg min Γ ∈M + ‘ square (Γ) + λ β k Γ (cid:4) k ∗ + 1 − βp p X j =1 k Γ ( j ) k ∗ , (7)where k · k ∗ is the sum of singular values, also called trace norm, and ‘ square is the squared errorloss: ‘ square (Γ) = 1 nm ( m − n X i =1 X ≤ j = j ≤ m { Γ( T ij , . . . , T ijp , T ij , . . . , T ij p ) − Z ijj } , (8)with Z ijj = { Y ij − ˆ µ ( T ij , . . . , T ijp ) }{ Y ij − ˆ µ ( T ij , . . . , T ij p ) } , ˆ µ as an estimate of the mean function,and T ijk as the k -th element of location vector T ij . Notice that trace-norm regularizations promotelow-rankness of the underlying operators, hence leading to a low-rank estimation in terms of boththe one-way and two-way (covariance) ranks. Before deriving a computational algorithm, we notice that the optimization (7) is an infinite-dimensional optimization which is generally unsolvable. To overcome this challenge, we show thatthe solution to (7) always lies in a known finite-dimensional sub-space given data, hence allowinga finite-dimensional parametrization. Indeed, we are able to achieve a stronger result in Theorem1 which holds for the general class of estimators obtained by (6).Let L n,m = { T ijk : i = 1 , . . . , n, j = 1 , . . . , m, k = 1 , . . . , p } .9 heorem 1 (Representer theorem) . If the solution set of (6) is not empty, there always exists asolution Γ lying in the space G ( L n,m ) := N pk =1 K k , where K p + k = K k and K k = span { K k ( T ijk ) : i = 1 , . . . , n, j = 1 , . . . , m } for k = 1 , . . . , p . The solution takes the form: Γ( s , . . . , s p , t , . . . , t p ) = A × z | ( s ) × z | ( s ) · · · × p z | p ( s p ) × p +1 z | ( t ) · · · × p z | p ( t p ) , (9) where the l -th element of z k ( · ) ∈ R mn is K ( T ijk , · ) with l = ( i − n + j . Also, A is a p -th ordertensor where the dimension of each mode is nm and A (cid:4) is a symmetric matrix. The proof of Theorem 1 is given in Section S1 of the supplementary material. By Theorem1, we can now only focus on covariance function estimators of the form (9). Let B = A × M T · · · × p M Tp × p +1 M T · · · × p M Tp , where M k is a nm × q k matrix such that M k M Tk = K k =[ K ( T i ,j ,k , T i ,j ,k )] ≤ i ,i ≤ n, ≤ j ,j ≤ m . With B , we can expressΓ( s , . . . , s p , t , . . . , t p ) = B × { M +1 z ( s ) } | · · · × p { M + p z p ( s p ) } | × p +1 { M +1 z ( t ) } | · · · × p { M + p z p ( t p ) } | , (10)where z k ( · ) is defined in Theorem 1 and M + k is the MoorePenrose inverse of matrix M k .The Gram matrix K k is often approximately low-rank. For computational simplicity, one couldadopt q k to be significantly smaller than nm . Ideally we can obtain the “best” low-rank approxi-mation with respect to the Frobenius norm by eigen-decomposition, but a full eigen-decompositionis computationally expensive. Instead, randomized algorithms can be used to obtain low-rankapproximations in an efficient manner (Halko et al., 2009).One can easily show that the eigenvalues of the operator Γ (cid:4) are the same as those of the matrix B (cid:4) and that the singular values of the operator Γ ( j ) are the same as those of the matrix B ( j ) .Therefore, solving (7) is equivalent to solving the following optimization:min B ( ˜ ‘ square ( B ) + λ " βh ( B (cid:4) ) + 1 − βp p X k =1 (cid:13)(cid:13) B ( j ) (cid:13)(cid:13) ∗ , (11)where k · k ∗ also represents the trace norm of matrices, h ( H ) = k H k ∗ if matrix H is positivesemi-definite, and h ( H ) = ∞ otherwise, and ˜ ‘ square ( B ) = ‘ square (Γ), where Γ is constructed from(10).Beyond estimating the covariance function, one may be further interested in the eigen-decompositionof Γ (cid:4) via the L inner product, e.g., to perform functional principal component analysis in theusual sense. Due to the finite-dimensional parametrization, a closed-form expression of L eigen-decomposition can be derived from our estimator without further discretization or approximation.In addition, we can obtain a similar one-way analysis in terms of the L inner product. We candefine a L singular value decomposition via the Tucker form and obtain the L marginal basis.Details are given in Appendix A. We solve (11) by the accelerated alternating direction method of multipliers (ADMM) algorithm(Kadkhodaie et al., 2015). We begin with an alternative form of (11):min B ∈ R q ×···× q p ( ˜ ‘ square ( B ) + λβh ( D , (cid:4) ) + λ − βp p X k =1 (cid:13)(cid:13) D j, ( j ) (cid:13)(cid:13) ∗ ) . (12)subject to B = D = D = · · · = D p (13)10here q p + k = q k for k = 1 , . . . , p .Then a standard ADMM algorithm solves the optimization problem (12) by minimizing theaugmented Lagrangian with respect to different variables alternatively. More explicitly, at the( t + 1)-th iteration, the following updates are implemented: B ( t +1) = argmin B ( ˜ ‘ square ( B ) + η k B (cid:4) − D ( t )0 , (cid:4) + V ( t )0 , (cid:4) k F + η p X k =1 (cid:13)(cid:13)(cid:13) B ( k ) − D ( t ) k, ( k ) + V ( t ) k, ( k ) (cid:13)(cid:13)(cid:13) F ) , (14a) D ( t +1)0 = argmin D (cid:26) λβh ( D , (cid:4) ) + η (cid:13)(cid:13)(cid:13) B ( t +1) (cid:4) − D , (cid:4) + V ( t )0 , (cid:4) (cid:13)(cid:13)(cid:13) F (cid:27) , (14b) D ( t +1) k = argmin D k (cid:26) λ − βp k D k, ( k ) k ∗ + η (cid:13)(cid:13)(cid:13) B ( t +1)( k ) − D k, ( k ) + V ( t ) k, ( k ) (cid:13)(cid:13)(cid:13) F (cid:27) , k = 1 , . . . , p, (14c) V ( t +1) k = V ( t ) k + B ( t +1) − D ( t +1) k , k = 0 , . . . , p, (14d)where V k ∈ R q ×··· q p , for k = 0 , . . . , p , are scaled Lagrangian multipliers and η > η is provided in Boyd et al. (2010). One can see thatSteps (14a), (14b) and (14c) involve additional optimizations. Now we discuss how to solve them.The objective function of (14a) is a quadratic function, and so we can easily solve this witha closed-form solution, given in line 2 of Algorithm 1. To solve (14b) and (14c), we use proximaloperator prox kv , k = 1 , . . . , p and prox + v : R q ×···× q p → R q ×···× q p respectively defined byprox kv ( A ) = argmin W ∈ R q ×···× q p (cid:26) k W ( k ) − A ( k ) k F + v k W ( k ) k ∗ (cid:27) , (15a)prox + v ( A ) = argmin W ∈ R q ×···× q p (cid:26) k W (cid:4) − A (cid:4) k F + vh ( W (cid:4) ) (cid:27) , (15b)for v ≥
0. By Lemma 1 in Mazumder et al. (2010), the solutions to (15) have closed forms.For (15a), write the singular value decomposition of A ( k ) as U diag((˜ a , . . . , ˜ a q k )) V | , then[prox kv ( A )] ( k ) = U diag( ˜ c ) V | where ˜ c = ((˜ a − v ) + , (˜ a − v ) + , . . . , (˜ a q k − v ) + ). As for (15b), is re-stricted to be a symmetric matrix since the penalty h equals infinity otherwise. Thus (15b) is equiv-alent to minimizing (cid:8) (1 / k W (cid:4) − ( A (cid:4) + A | (cid:4) ) / k F + vh ( W (cid:4) ) (cid:9) since h W (cid:4) , ( A (cid:4) − A | (cid:4) ) / i = h ( W (cid:4) + W | (cid:4) ) / , ( A (cid:4) − A | (cid:4) ) / i = 0 . Suppose that ( A (cid:4) + A | (cid:4) ) / P diag((˜ a , . . . , ˜ a q ) P | .Then [prox + v ( A )] (cid:4) = P diag( ˜ c ) P | , where ˜ c = ((˜ a − v ) + , (˜ a − v ) + , . . . , (˜ a q − v ) + ). Unlike singularvalues, the eigenvalues may be negative. Hence, as opposed to prox kv , this procedure prox + v alsoremoves eigen-components with negative eigenvalues.The details of computational algorithm are given in Algorithm 1, an accelerated version ofADMM which involves additional steps for a faster algorithmic convergence. In this section, we conduct an asymptotic analysis for the proposed estimator ˆΓ as definedin (7). Our analysis has a unified flavor such that the derived convergence rate of the proposedestimator automatically adapts to sparse and dense settings. Throughout this section, we neglectthe mean function estimation error by setting µ ( t ) = ˆ µ ( t ) = 0 for any t ∈ T , which leads to acleaner and more focused analysis. The additional error from the mean function estimation can beincorporated into our proofs without any fundamental difficulty.11 .1 Assumptions Without loss of generality let T = [0 , p . The assumptions needed in the asymptotic resultsare listed as follows. Assumption 1.
Sample fields { X i : i = 1 , . . . , n } reside in H = N pk =1 H k where H k is an RKHSof functions on [0 , with a continuous and square integrable reproducing kernel K k . Assumption 2.
The true (folded) covariance function Γ = 0 and Γ ∈ G = N dj =1 G j , where d = 2 p , G j = H j for j = 1 , . . . , p and G j = H j − p for j = p + 1 , . . . , d . Assumption 3.
The locations { T ij : i = 1 , . . . , n ; j = 1 , . . . , m } are independent random vectorsfrom Uniform[0 , p , and they are independent of { X i : i = 1 , . . . , n } .The errors { (cid:15) ij : i = 1 , . . . , n ; j = 1 , . . . , m } are independent of both locations and sample fields. Assumption 4.
For each t ∈ T , X ( t ) is sub-Gaussian with a parameter b X > which does notdepend on t , i.e., E [exp { βX ( t ) } ] ≤ exp (cid:8) b X β / (cid:9) for all β and t ∈ T . Assumption 5.
For each i and j , (cid:15) ij is a mean-zero sub-Gaussian random variable with a param-eter b (cid:15) independent of i and j , i.e., E [exp { β(cid:15) ij } ] ≤ exp (cid:8) b (cid:15) β / (cid:9) .Moreover all errors { (cid:15) ij : i = 1 , . . . , n ; j = 1 , . . . , m } are independent. Assumption 1 delineates a tensor product RKHS modeling, commonly seen in the nonparametricregression literature (e.g., Wahba, 1990; Gu, 2013). In Assumption 2, the condition Γ ∈ G issatisfied if E k X k H < ∞ , as shown in Cai and Yuan (2010). Assumption 3 is specified for randomdesign and we adopt the uniform distribution here for simplicity. The uniform distribution on[0 , p can be generalized to any other continuous distribution of which density function π satisfies c π ≤ π ( t ) ≤ c π for all t ∈ [0 , p , for some constants 0 < c π ≤ c π <
1, to ensure both Theorems2 and 3 still hold. Assumptions 4 and 5 involve sub-Gaussian conditions of the stochastic processand measurement error, which are standard tail conditions.
In Assumption 1, the “smoothness” of the function in the underlying RKHS is not explicitlyspecified. It is well-known that such smoothness conditions are directly related to the eigen-decayof the respective reproducing kernel. By Mercer’s Theorem (Mercer, 1909), the reproducing kernel K H (( t , . . . , t p ) , ( t , . . . , t p )) of H possesses the eigen-decomposition K H (( t , . . . , t p ) , ( t , . . . , t p )) = ∞ X l =1 µ l φ l ( t , . . . , t p ) φ l ( t , . . . , t p ) , (16)where { µ l : l ≥ } are non-negative eigenvalues and { φ l : l ≥ } are L eigenfunctions on [0 , p .Then for the space H ⊗ H , which is also identified by G = N dk =1 G k , its corresponding reproducingkernel K G has the following eigen-decomposition K G (( x , . . . , x p ) , ( x , . . . , x p ))= K H (( x , . . . , x p ) , ( x , . . . , x p )) K H (( x p +1 , . . . , x p ) , ( x p +1 , . . . , x p ))= ∞ X l,h =1 µ l µ h φ l ( x , . . . , x p ) φ h ( x p +1 , . . . , x p ) φ l ( x , . . . , x p ) φ h ( x p +1 , . . . , x p ) , { µ l µ h : l, h ≥ } are the eigenvalues of K G . Due to continuity assumption (Assumption 1)of the univariate kernels, there exists a constant b such thatsup ( x ,...,x p ) ∈ [0 , p K G (( x , . . . , x p ) , ( x , . . . , x p )) ≤ b. The decay rate of the eigenvalues { µ l µ h : l, h ≥ } is involved in our analysis through twoquantities κ n,m and η n,m , which have relatively complex forms defined in Appendix B. Similarquantities can be found in other analyses of RKHS-based estimators (e.g., Raskutti et al., 2012)that accommodate general choices of RKHS. Generally κ n,m and η n,m are expected to diminishin certain orders of n and m , characterized by the decay rate of the eigenvalues { µ l µ h } . Thesmoother the functions in the RKHS, the faster these two quantities diminish. Our general resultsin Theorems 2 and 3 are specified in terms of these quantities. To provide a solid example, we derivethe orders of κ n,m and η n,m under a Sobolev-Hilbert space setting and provide the convergence rateof the proposed estimator in Corollary 1. We write the penalty in (7) as I (Γ) = β k Γ (cid:4) k ∗ +(1 − β ) p − P pk =1 k Γ ( k ) k ∗ . For arbitrary functions g , g ∈ G , define their empirical inner product and the corresponding (squared) empirical norm as h g , g i n,m = 1 nm ( m − n X i =1 X ≤ j,j ≤ m g ( T ij , . . . , T ijp , T ij , . . . , T ij p ) g ( T ij , . . . , T ijp , T ij , . . . , T ij p ) , k g k n,m = h g , g i n,m . Additionally, the L norm of a function g is defined as k g k = { R T g ( t ) d t } / .Define ξ n,m = max { η n,m , κ n,m , ( n − log n ) / } . We first provide the empirical L rate of conver-gence for ˆΓ. Theorem 2.
Suppose that Assumptions 1–5 hold. Assume ξ n,m satisfies (log n ) /n ≤ ξ n,m / (log log ξ − n,m ) .If λ ≥ L ξ n,m for some constant L > depending on b X , b (cid:15) and b , we have k ˆΓ − Γ k n,m ≤ p I (Γ ) λ + L ξ n,m , with probability at least − exp( − cnξ n,m / log n ) for some positive universal constant c . Next, we provide the L rate of convergence for ˆΓ. Theorem 3.
Under the same conditions as Theorem 2, there exist a positive constant L dependingon b X , b (cid:15) , b and I (Γ ) , such that k ˆΓ − Γ k ≤ p I (Γ ) λ + L ξ n,m , with probability at least − exp( − c p nξ n,m / log n ) for some constant c p depending on b . The proofs of Theorems 2 and 3 can be found in Section S1 in the supplementary material.
Remark 3.
Theorems 2 and 3 are applicable to general RKHS H which satisfies Assumption 1.The convergence rate depends on the eigen-decay rates of the reproducing kernel. A special case ofpolynomial decay rates for univariate RKHS will be given in Corollary 1. Moreover, our analysishas a unified flavor in the sense that the resulting convergence rates automatically adapt to theorders of both n and m . In Remark 5 we will provide a discussion of a “phase transition” betweendense and sparse functional data revealed by our theory.13 emark 4. With a properly chosen λ , Theorems 2 and 3 bound the convergence rates (in termsof both the empirical and theoretical L norm) by ξ n,m , which cannot be faster than ( n − log n ) / .The logarithmic order is due to the use of Adamczak bound in Lemma S2 in the supplementarymaterial. If one further assumes boundedness for the sample fields X i ’s (in terms of the sup-norm)and the noise variables (cid:15) ij ’s, we can instead use Talagrand concentration inequality (Bousquetbound in Koltchinskii (2011)) and the results in Theorems 2 and 3 can be improved to max {k ˆΓ − Γ k n,m , k ˆΓ − Γ k } = O p ( ˜ ξ n,m ), where ˜ ξ n,m = max { η n,m , κ n,m , n − / } .Next we focus on a special case where the reproducing kernels of the univariate RKHS H k ’sexhibit polynomial eigen-decay rates, which holds for a range of commonly used RKHS. A canonicalexample is α -th order Sobolev-Hilbert space: H k = { f : f ( r ) , r = 0 , . . . , α, are absolutely continuous; f ( α ) ∈ L ([0 , } , where k = 1 , . . . , p . Here α is the same as α in Corollary 1. To derive the convergence rates, we relatethe eigenvalues ν l in (16) to the univariate RKHS H k , k = 1 , . . . , p . Due to Mercer’s Theorem, thereproducing kernel K k of H k yields an eigen-decomposition with non-negative eigenvalues { µ ( k ) l : l ≥ } and an L eigenfunction { φ ( k ) l : l ≥ } , i.e., K k ( t, t ) = P ∞ l =1 µ ( k ) l φ ( k ) l ( t ) φ ( k ) l ( t ) . Therefore,the set of eigenvalues { µ l : l ≥ } in (16) is the same as the set { Q pk =1 µ ( k ) l k : l , . . . , l p ≥ } . Giventhe eigen-decay of µ ( k ) l , one can obtain the order of ξ n,m and hence the convergence rates fromTheorems 2 and 3. Here are the results under the setting of a polynomial eigen-decay. Corollary 1.
Suppose that the same conditions in Theorem 3 hold. If the eigenvalues of K k for H k , k = 1 , . . . , p, have polynomial decaying rates, that is, there exists α > / such that µ ( k ) l (cid:16) l − α for all k = 1 , . . . , p , then max n k ˆΓ − Γ k n,m , k ˆΓ − Γ k o = O p (cid:18) max (cid:26) ( nm ) − α α { log( nm ) } α (2 p − α +1 , log nn (cid:27)(cid:19) . Remark 5.
All Theorems 2 and 3 and Corollary 1 reveal a “phase-transition” of the convergencerate depending on the relative magnitudes between n , the sample size, and m , the number ofobservations per field. When κ n,m (cid:28) (log n/n ), which is equivalent to m (cid:29) n / (2 α ) (log n ) p − − / (2 α ) in Corollary 1, both empirical and theoretical L rates of convergence can achieve the near-optimalrate p log n/n . Under the stronger assumptions in Remark 4, the convergence rate will achievethe optimal order p /n when κ n,m (cid:28) /n (or m (cid:29) n / (2 α ) (log n ) p − in Corollary 1). In thiscase, the observations are so densely sampled that we can estimate the covariance function asprecisely as if the entire sample fields are observable. On the contrary, when κ n,m (cid:29) (log n/n )(or m (cid:28) n / (2 α ) (log n ) p − − / (2 α ) in Corollary 1), the convergence rate is determined by the totalnumber of observations nm . When p = 1, the asymptotic result in Corollary 1, up to some log m and log n terms, is the same as the minimax optimal rate obtained by Cai and Yuan (2010), and iscomparable to the L rate obtained by Paul and Peng (2009) for α = 2. Remark 6.
For covariance function estimation for unidimensional functional data, i.e., p = 1, alimited number of approaches, including Cai and Yuan (2010), Li and Hsing (2010), Zhang andWang (2016), and Liebl (2019), can achieve unified theoretical results in the sense that they hold forall relative magnitudes of n and m . The similarity of these approaches is the availability of a closedform for each covariance function estimator. In contrast, our estimator obtained from (7) does not14ave a closed form due to the non-differentiability of the penalty, but it can still achieve unifiedtheoretical results which hold for both unidimensional and multidimensional functional data. Dueto the lack of a closed form of our covariance estimator, we used the empirical process techniques(e.g., Bartlett et al., 2005; Koltchinskii, 2011) in the theoretical development. In particular, we havedeveloped a novel grouping lemma (Lemma S4 in the supplementary material) to deterministically decouple the dependence within a U -statistics of order 2. We believe this lemma is of independentinterest. In our analysis, the corresponding U -statistics is indexed by a function class, and thisgrouping lemma provides a tool to obtain uniform results (see Lemma S3 in the supplementarymaterial). In particular, this allows us to relate the empirical and theoretical L norm of theunderlying function class, in precise enough order dependence on n and m to derive the unifiedtheory. See Lemma S3 for more details. To the best of our knowledge, this paper is one of the firstin the FDA literature that derives a unified result in terms of empirical process theories, and theproof technique is potentially useful for some other estimators without a closed form. To evaluate the practical performance of the proposed method, we conducted a simulation study.We in particular focused on two-dimensional functional data. Let H and H both be the RKHSwith kernel K ( t , t ) = P ∞ k =1 ( kπ ) − e k ( t ) e k ( t ), where e k ( t ) = √ kπt ), k ≥
1. This RKHShas been used in various studies in FDA, e.g., the simulation study of Cai and Yuan (2012). Each X i is generated from a mean-zero Gaussian random field with a covariance function γ (( s , s ) , ( t , t )) = Γ ( s , s , t , t ) = R X k =1 k − ψ k ( s , s ) ψ k ( t , t ) , where the eigenfunctions ψ k ( t , t ) ∈ P r ,r := { e i ( t ) e j ( t ) : i = 1 , . . . , r ; j = 1 , . . . , r } . Threecombinations of one-way ranks ( r , r ) and two-way rank R were studied for Γ : Setting 1 : R = 6, r = 3, r = 2; Setting 2 : R = 6, r = r = 4; Setting 3 : R = r = r = 4.For each setting, we chose R functions out of P r ,r to be { ψ k } such that smoother functions areassociated with larger eigenvalues. The details are described in Section S2.1 of the supplementarymaterial.In terms of sampling plans, we considered both sparse and dense designs. Due to the spacelimit, here we only show and discuss the results for the sparse design, while defer those for the densedesign to Section S2.3 of the supplementary material. For the sparse design, the random locations T ij , j = 1 , . . . , m, were independently generated from the continuous uniform distribution on [0 , within each field and across different fields, and the random errors { (cid:15) ij : i = 1 , . . . , n ; j = 1 , . . . , m } were independently generated from N (0 , σ ). In each of the 200 simulation runs, the observed datawere obtained following (1) with various combinations of m = 10 , n = 100 ,
200 and noise level σ = 0 . , . mOpCov , with three existing methods: 1) OpCov : the estimator based on Wong and Zhang (2019) with adaption to two dimensional case(see Section 2); 2) ll-smooth : local linear smoothing with Epanechnikov kernel; 3) ll-smooth+ : thetwo-step estimator constructed by retaining eigen-components of ll-smooth selected by 99% fractionof variation explained (FVE), and then removing the eigen-components with negative eigenvalues.15or both
OpCov and mOpCov , 5-fold cross-validation was adopted to select the corresponding tuningparameters.Table 1 show the average integrated squared error (AISE), average of estimated two-way rank( ¯ R ), as well as average of estimated one-way ranks (¯ r , ¯ r ) of the above covariance estimators over200 simulated data sets in respective settings when sample size is n = 200. Corresponding resultsfor n = 100 can be found in Table S4 of the supplementary material, and they lead to similarconclusions. Obviously ll-smooth and ll-smooth+ , especially ll-smooth , perform significantly worsethan the other two methods in both estimation accuracy and rank reduction (if applicable). Belowwe only compare mOpCov and OpCov .Regarding estimation accuracy, the proposed mOpCov has uniformly smaller AISE values than
OpCov , with around 10% ∼
20% improvement of AISE over
OpCov in most cases under Settings1 and 2. If the standard error (SE) of AISE is taken into account, the improvements of AISEby mOpCov are more distinguishable in Settings 1 and 2 than those in Setting 3 since the SEs of
OpCov in Setting 3 are relatively high. This is due to the fact that in Setting 3, marginal basisare not shared by different two-dimensional eigenfunctions, and hence mOpCov cannot benefit fromthe structure sharing among eigenfunctions. Setting 3 is in fact an extreme setting we designed tochallenge the proposed method.For rank estimation,
OpCov almost always underestimates two-way ranks, while mOpCov typ-ically overestimates both one-way and two-way ranks. For mOpCov , the average one-way rankestimates are always smaller than the average two-way rank estimates, and their differences aresubstantial in Settings 1 and 2. This demonstrates the benefit of mOpCov of detecting structuresharing of one-dimensional basis among two-dimensional eigenfunctions.We also tested the performance of mOpCov in the dense and regular designs, and comparedit with the existing methods mentioned above together with the one by Wang and Huang (2017),which is not applicable to the sparse design. Details are given in Section S2.3 of the supplementarymaterial, where all methods achieve similar AISE values, but mOpCov performs slightly better inestimation accuracy when the noise level is high.
We applied the proposed method to an Argo profile data set, obtained from . The Argo project has a global array of approximately 3,800 free-drifting profiling floats, which measure temperature and salinity of the ocean. These floats driftfreely in the depths of the ocean most of the time, and ascend regularly to the sea surface, wherethey transmit the collected data to the satellites. Every day only a small subset of floats show upon the sea surface. Due to the drifting process, these floats measure temperature and salinity atirregular locations over the ocean. See Figure 2 for examples.In this analysis, we focus on the different changes of sea surface temperature between the tropicalwestern and eastern Indian Ocean, which is called the Indian Ocean Dipole (IOD). The IOD isknown to be associated with droughts in Australia (Ummenhofer et al., 2009) and has a significanteffect on rainfall patterns in southeast Australia (Behera and Yamagata, 2003). According toShinoda et al. (2004), the IOD phenomenon is a predominant inter-annual variation of sea surfacetemperature during late boreal summer and autumn (Shinoda et al., 2004), so in this applicationwe focused on the sea surface temperature in the Indian Ocean region of longitude 40 ∼
120 andlatitude -20 ∼
20 between September and November every year from 2003 to 2018.Based on a simple autocorrelation analysis on the gridded data, we decided to use measurements16or every ten days in order to reduce the temporal dependence among the data.At each location of a float on a particular day, the average temperature between 0 and 5hPa from the float is regarded as a measurement. The Argo float dataset provides multiple ver-sions of data, and we adopted the quality controlled (QC) version. Eventually we have a two-dimensional functional data collected of n = 144 days, where the number of observed locations T ij = (longitudie , latitude) per day varies from 7 to 47, i.e., 7 ≤ m i ≤ i = 1 , ..., n , with anaverage of 21.83. The locations are rescaled to [0 , × [0 , Figure 2: Observations on 2013/09/04 (left), and all observations in the data set (right). Points on themap indicate locations (Longitude, Latitude) of observations and the color scale of every point shows thecorresponding Celsius temperature.
First we used kernel ridge regression with the corresponding kernel for the tensor product of twosecond order Sobolev spaces (e.g., Wong and Zhang, 2019) to obtain a mean function estimate forevery month. Then we applied the proposed covariance function estimator with the same kernel.The estimates of the top two two-dimensional L eigenfunctions are illustrated in Figure 3. Thefirst eigenfunction shows the east-west dipole mode, which aligns with existing scientific findings(e.g., Shinoda et al., 2004; Chu et al., 2014; Deser et al., 2010). The second eigenfunction can beinterpreted as the basin-wide mode, which is a dominant mode all around the year (e.g., Deseret al., 2010; Chu et al., 2014).To provide a clearer understanding of the covariance function structure, we derived a marginal L basis along longitude and latitude respectively. The details are given in Appendix A. The leftpanel of Figure 4 demonstrates that the first longitudinal marginal basis reflects a large variationin the western region while the second one corresponds to the variation in the eastern region.Due to different linear combinations, the variation along longitude may contribute to not onlyopposite changes between the eastern and western sides of the Indian Ocean as shown in the firsttwo-dimensional eigenfunction, but also an overall warming or cooling tendency as shown in thesecond two-dimensional eigenfunction. The second longitudinal marginal basis reveals that thecloser to the east boundary, the greater the variation is, which suggests that the IOD may berelated to the Pacific Ocean. This aligns with the evidence that the IOD has a link with El Ni˜noSouthern Oscillation (ENSO) (Stuecker et al., 2017), an irregularly periodic variation in sea surfacetemperature over the tropical eastern Pacific Ocean. As shown in the right panel of Figure 4,the overall trend for the first latitude marginal basis is almost a constant function. This provides17vidence that the IOD is primarily associated with the variation along longitude. Figure 3: The first two-dimensional L eigenfunction (left) and the second two-dimensional L eigenfunction(right). The first eigenfunction explains 33 .
60% variance and the second eigenfunction explains 25 . . . . . L basis functions along longitude and latitude respectively. Solid lines arethe first marginal basis function and dotted lines are the second marginal basis function. The fractions ofvariation explained by the corresponding principle components are given in parentheses. lgorithm 1: Accelerated ADMM for solving (11)
Input: ˆ V (0) k ∈ R q ×···× q p , k = 0 , , . . . , p , and B (0) ∈ R q ×···× q p such that ˆ V , (0) and B (0) (cid:4) are symmetric matrices; M k = [ M | ,k , . . . , M | n,k ] | , k = 1 , . . . , p ; Z i = ( Z ijj ) ≤ j,j ≤ m , i = 1 , . . . , n ; ˜ I = [ I ( i = j )] ≤ i,j ≤ m ; η > T Initialization: α (0) k ← D ( − k ← B (0) , ˆ D (0) k ← B (0) , V ( − k ← ˆ V (0) k , k = 0 , , . . . , p L i ← [ M | i, (cid:12) M | i, (cid:12) · · · (cid:12) M | i,p ] | , i = 1 , . . . , n , where (cid:12) is the KhatriRao product defined as A (cid:12) B = [ a i ⊗ b i ] i =1 ,...,r ∈ R r a r b × r for A ∈ R r a × r , B ∈ R r b × r and a i , b i are i -th column ofmatrices A and B respectively. G ← nm ( m − P ni =1 ( L i ⊗ L i ) | diag(vec( ˜ I ))( L i ⊗ L i ) h ← nm ( m − P ni =1 ( L i ⊗ L i ) | diag(vec( ˜ I ))vec( Z i ) Q ← (2( G + p +12 ∗ η ∗ I )) − for t = 0 , , . . . , T do vec( B ( t +1) (cid:4) ) ← Q { h + η P pk =0 vec([ D ( t ) k − ˆ V ( t ) k ] (cid:4) ) } for k = 0 , , . . . , p do if k = 0 then D ( t )0 ← prox + λβ/η ( B ( t +1) + ˆ V ( t )0 ) else D ( t ) k ← prox kλ (1 − β ) / ( pη ) ( B ( t +1) + ˆ V ( t ) k ) end V ( t ) k ← ˆ V ( t ) k + B ( t +1) − D ( t ) k α ( t +1) k ← q α ( t ) k ) ˆ D ( t +1) k ← D ( t ) k + α ( t ) k − α ( t +1) k ( D ( k ) k − D ( k − k ) ˆ V ( t +1) k ← V ( t ) k + α ( t ) k − α ( t +1) k ( V ( t ) k − V ( t − k ) end Stop if objective value change less than tolerance. endOutput: D ( T )0 able 1: Simulation results for three Settings with the sparse design when sample size ( n ) is 200. TheAISE values with standard errors (SE) in parentheses are provided for the four covariance estimators incomparison, together with average two-way ranks ( ¯ R ) for those estimators which can lead to rank reduction(i.e., mOpCov , OpCov , and ll-smooth+ ) and average one-way ranks ( r , r ) for mOpCov . Setting m σ mOpCov OpCov ll-smooth ll-smooth+ R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r
5, 5.060.4 AISE 0.0598 (2.68e-03) 0.0733 (6.14e-03) 0.531 (1.07e-01) 0.323 (4.23e-02)¯ R r , ¯ r R r , ¯ r R r , ¯ r ppendixA L eigensystem and L marginal basis In this section, we present a transformation procedure to produce L eigenfunctions and corre-sponding eigenvalues from our estimator ˆ B obtained by (11).Let Q k = [ R [0 , K ( s, T ijk ) K ( s, T i j k ) ds ] ≤ i,i ≤ n, ≤ j,j ≤ m , k = 1 , . . . , p . Then Q k = M k R k M | k ,where R k = [ R [0 , v l ( s ) v h ( s ) ds ] ≤ l,h ≤ q k and { v l : l = 1 , . . . , q k } form a basis of H k , so R k = M + k Q k ( M + k ) | . The L eigenvalues of ˆΓ (cid:4) coincide with the eigenvalues of matrix ˆ B L square := ( R ⊗ . . . ⊗ R p ) / ˆ B (cid:4) [( R ⊗ . . . ⊗ R p ) / ] | , and the number of nonzero eigenvalues is the same as the rankof ˆ B (cid:4) . The L eigenfunction ˆ φ l that corresponds to the l -th eigenvalue of ˆΓ (cid:4) can be expressed asˆ φ l ( s , ..., s p ) = u | l [ z ( s ) ⊗ . . . ⊗ z p ( s p )], where z k ( · ), k = 1 , . . . , p are defined in Theorem 1, and u l = ( M +1 ⊗ . . . ⊗ M + p ) | ( R ⊗ . . . ⊗ R p ) − / v l with v l being the l -th eigenvector of matrix ˆ B L square .Using the property of Kronecker products, we have ˆ φ l ( s , ..., s p ) = v | l [( R − / M +1 z ( s )) ⊗ . . . ⊗ ( R − / p M + p z p ( s p ))].By simple verification, we can see that R − / k M + k z k ( · ) are q k one-dimensional orthonormal L functions for dimension k , k = 1 , ..., p . Therefore, we can also express ˆΓ with these L one-dimensional basis and the coefficients will form a 2 p − th order tensor of dimension q × . . . q p × q × . . . q p . We use ˆ B L to represent this new coefficient tensor and extend our unfolding operators to L space. It is easy to see that ˆ B L (cid:4) = ˆ B L square .Since ˆΓ ( k ) is a compact operator in the L space, this yields a singular value decompositionwhich leads to a L basis characterizing the marginal variation along the k − th dimension. Wecall it a L marginal basis for the k − th dimension. Obviously the marginal basis function ˆ ψ kl corresponding to the l -th singular value for dimension k can be expressed as ˆ ψ kl ( · ) = u kl z k ( · ), where u kl = ( M + k ) | R − / k v kl , and v kl is the l -th singular vector of ˆ B L ( k ) . And the L marginal singularvalues of ˆΓ ( k ) coincide with the singular values of matrix ˆ B L ( k ) . B Definitions of κ n,m and η n,m Here we provide the specific forms of κ n,m and η n,m , which are closely related to the decay of { µ l µ h : l, h = 1 , . . . } . Specifically, κ n,m is defined as the smallest positive κ such that cb n ( m − ∞ X l,h =1 min (cid:8) κ , µ l µ h (cid:9) / ≤ κ , cb n ( m − ∞ X l,h =1 min (cid:8) κ /b , µ l µ h (cid:9) / ≤ κ , (17)where c is a universal constant, and η n,m is defined as the smallest positive η such that c η nm ∞ X l,h =1 min { η , µ l µ h } + η n / ≤ η , (18)21here c η is a constant depending on b, b X , b (cid:15) . The existences of κ n,m and η n,m are shown in theproof of Theorem 2. Supplementary Material
In the supplementary material related to this paper, we provide proofs of our theoretical findingsand additional simulation results.
Acknowledgement
The research of Raymond K. W. Wong is partially supported by the U.S. National ScienceFoundation under grants DMS-1806063, DMS-1711952 (subcontract) and CCF-1934904. The re-search of Xiaoke Zhang is partially supported by the U.S. National Science Foundation under grantDMS-1832046. Portions of this research were conducted with high performance research computingresources provided by Texas A&M University ( https://hprc.tamu.edu ). References
Abernethy, J., F. Bach, T. Evgeniou, and J.-P. Vert (2009). A new approach to collaborative filter-ing: Operator estimation with spectral regularization.
Journal of Machine Learning Research 10 ,803–826.Allen, G. I. (2013). Multi-way functional principal components analysis. In ,pp. 220–223. IEEE.Bartlett, P. L., O. Bousquet, and S. Mendelson (2005). Local rademacher complexities.
The Annalsof Statistics 33 (4), 1497–1537.Behera, S. K. and T. Yamagata (2003). Influence of the indian ocean dipole on the southernoscillation.
Journal of the Meteorological Society of Japan. Ser. II 81 (1), 169–177.Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2010). Distributed optimization andstatistical learning via the alternating direction method of multipliers.
Foundations and Trends R (cid:13) in Machine Learning 3 (1), 1–122.Cai, T. T. and M. Yuan (2010). Nonparametric covariance function estimation for functional andlongitudinal data. Technical report, Georgia Institute of Technology, Atlanta, GA.Cai, T. T. and M. Yuan (2012). Minimax and adaptive prediction for functional linear regression. Journal of the American Statistical Association 107 (499), 1201–1216.Chen, K., P. Delicado, and H.-G. M¨uller (2017). Modelling function-valued stochastic processes,with applications to fertility dynamics.
Journal of the Royal Statistical Society: Series B (Sta-tistical Methodology) 79 (1), 177–196.Chen, K. and H.-G. M¨uller (2012). Modeling repeated functional observations.
Journal of theAmerican Statistical Association 107 (500), 1599–1609.Chen, L.-H. and C.-R. Jiang (2017). Multi-dimensional functional principal component analysis.
Statistics and Computing 27 (5), 1181–1192. 22hu, J.-E., K.-J. Ha, J.-Y. Lee, B. Wang, B.-H. Kim, and C. E. Chung (2014). Future change ofthe indian ocean basin-wide and dipole modes in the cmip5.
Climate dynamics 43 (1-2), 535–551.Deser, C., M. A. Alexander, S.-P. Xie, and A. S. Phillips (2010). Sea surface temperature variability:Patterns and mechanisms.
Annual review of marine science 2 , 115–143.Ferraty, F. and P. Vieu (2006).
Nonparametric functional data analysis: theory and practice .Springer, New York.Goldsmith, J., J. Bobb, C. M. Crainiceanu, B. Caffo, and D. Reich (2011). Penalized functionalregression.
Journal of Computational and Graphical Statistics 20 (4), 830–851.Gu, C. (2013).
Smoothing Spline ANOVA Models (2nd ed.). New York: Springer.Hackbusch, W. (2012).
Tensor spaces and numerical tensor calculus , Volume 42. Springer Science& Business Media.Halko, N., P.-G. Martinsson, and J. A. Tropp (2009). Finding structure with randomness:Stochastic algorithms for constructing approximate matrix decompositions. arXiv preprintarXiv:0909.4061 .Hall, P. and C. Vial (2006). Assessing the finite dimensionality of functional data.
Journal of theRoyal Statistical Society: Series B 68 (4), 689–705.Horv´ath, L. and P. Kokoszka (2012).
Inference for functional data with applications , Volume 200.Springer, New York.Hsing, T. and R. Eubank (2015).
Theoretical foundations of functional data analysis, with anintroduction to linear operators . John Wiley & Sons.Huang, J. Z., H. Shen, and A. Buja (2009). The analysis of two-way functional data using two-way regularized singular value decompositions.
Journal of the American Statistical Associa-tion 104 (488), 1609–1620.James, G. M., T. J. Hastie, and C. A. Sugar (2000). Principal component models for sparsefunctional data.
Biometrika 87 (3), 587–602.Kadkhodaie, M., K. Christakopoulou, M. Sanjabi, and A. Banerjee (2015). Accelerated alternat-ing direction method of multipliers. In
Proceedings of the 21th ACM SIGKDD internationalconference on knowledge discovery and data mining , pp. 497–506. ACM.Kokoszka, P. and M. Reimherr (2017).
Introduction to functional data analysis . CRC Press.Koltchinskii, V. (2011).
Oracle Inequalities in Empirical Risk Minimization and Sparse RecoveryProblems: Ecole d’Et´e de Probabilit´es de Saint-Flour XXXVIII-2008 , Volume 2033. SpringerScience & Business Media.Li, B. and J. Song (2017). Nonlinear sufficient dimension reduction for functional data.
The Annalsof Statistics 45 (3), 1059–1095.Li, Y. and T. Hsing (2010). Uniform convergence rates for nonparametric regression and principalcomponent analysis in functional/longitudinal data.
The Annals of Statistics 38 (6), 3321–3351.23iebl, D. (2019). Inference for sparse and dense functional data with covariate adjustments.
Journalof Multivariate Analysis 170 , 315–335.Lynch, B. and K. Chen (2018). A test of weak separability for multi-way functional data, withapplication to brain connectivity studies.
Biometrika 105 (4), 815–831.Mazumder, R., T. Hastie, and R. Tibshirani (2010). Spectral regularization algorithms for learninglarge incomplete matrices.
Journal of machine learning research 11 (Aug), 2287–2322.Mercer, J. (1909). Xvi. functions of positive and negative type, and their connection the theory ofintegral equations.
Philosophical transactions of the royal society of London. Series A, containingpapers of a mathematical or physical character 209 (441-458), 415–446.Park, S. Y. and A.-M. Staicu (2015). Longitudinal functional data analysis.
Stat 4 (1), 212–226.Paul, D. and J. Peng (2009). Consistency of restricted maximum likelihood estimators of principalcomponents.
The Annals of Statistics 37 (3), 1229–1271.Pearce, N. D. and M. P. Wand (2006). Penalized splines and reproducing kernel methods.
TheAmerican Statistician 60 (3), 233–240.Poskitt, D. S. and A. Sengarapillai (2013). Description length and dimensionality reduction infunctional data analysis.
Computational Statistics & Data Analysis 58 , 98–113.Ramsay, J. and B. Silverman (2005).
Functional data analysis . Springer, New York.Raskutti, G., M. J. Wainwright, and B. Yu (2012). Minimax-optimal rates for sparse additive modelsover kernel classes via convex programming.
The Journal of Machine Learning Research 13 , 389–427.Reimherr, M., B. Sriperumbudur, and B. Taoufik (2018). Optimal prediction for additive function-on-function regression.
Electronic Journal of Statistics 12 (2), 4571–4601.Rice, J. A. and B. W. Silverman (1991). Estimating the mean and covariance structure non-parametrically when the data are curves.
Journal of the Royal Statistical Society: Series B(Methodological) 53 (1), 233–243.Shamshoian, J., D. Senturk, S. Jeste, and D. Telesca (2019). Bayesian analysis of multidimensionalfunctional data. arXiv preprint arXiv:1909.08763 .Shinoda, T., H. H. Hendon, and M. A. Alexander (2004). Surface and subsurface dipole variabilityin the indian ocean and its relation with enso.
Deep Sea Research Part I: Oceanographic ResearchPapers 51 (5), 619–635.Stuecker, M. F., A. Timmermann, F.-F. Jin, Y. Chikamoto, W. Zhang, A. T. Wittenberg, E. Widi-asih, and S. Zhao (2017). Revisiting enso/indian ocean dipole phase relationships.
GeophysicalResearch Letters 44 (5), 2481–2492.Sun, X., P. Du, X. Wang, and P. Ma (2018). Optimal penalized function-on-function regressionunder a reproducing kernel hilbert space framework.
Journal of the American Statistical Asso-ciation 113 (524), 1601–1611. 24ucker, L. R. (1966). Some mathematical notes on three-mode factor analysis.
Psychometrika 31 (3),279–311.Ummenhofer, C. C., M. H. England, P. C. McIntosh, G. A. Meyers, M. J. Pook, J. S. Risbey,A. S. Gupta, and A. S. Taschetto (2009). What causes southeast australia’s worst droughts?
Geophysical Research Letters 36 (4).Wahba, G. (1990).
Spline Models for Observational Data . Philadelphia: SIAM.Wang, W.-T. and H.-C. Huang (2017). Regularized principal component analysis for spatial data.
Journal of Computational and Graphical Statistics 26 (1), 14–25.Wong, R. K. W., Y. Li, and Z. Zhu (2019). Partially linear functional additive models for multi-variate functional data.
Journal of the American Statistical Association 114 (525), 406–418.Wong, R. K. W. and X. Zhang (2019). Nonparametric operator-regularized covariance functionestimation for functional data.
Computational Statistics & Data Analysis 131 , 131–144.Xiao, L., Y. Li, and D. Ruppert (2013). Fast bivariate p-splines: the sandwich smoother.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) 75 (3), 577–599.Yao, F., H.-G. M¨uller, and J.-L. Wang (2005). Functional data analysis for sparse longitudinaldata.
Journal of the American Statistical Association 100 (470), 577–590.Yuan, M. and T. T. Cai (2010). A reproducing kernel hilbert space approach to functional linearregression.
The Annals of Statistics 38 (6), 3412–3444.Zhang, L., H. Shen, and J. Z. Huang (2013). Robust regularized singular value decomposition withapplication to mortality data.
The Annals of Applied Statistics 7 (3), 1540–1561.Zhang, X. and J.-L. Wang (2016). From sparse to dense functional data and beyond.
The Annalsof Statistics 44 (5), 2281–2321.Zhou, L. and H. Pan (2014). Principal component analysis of two-dimensional functional data.
Journal of Computational and Graphical Statistics 23 (3), 779–801.Zhu, H., F. Yao, and H. H. Zhang (2014). Structured functional additive regression in reproducingkernel hilbert spaces.
Journal of the Royal Statistical Society: Series B 76 (3), 581–603.Zipunnikov, V., B. Caffo, D. M. Yousem, C. Davatzikos, B. S. Schwartz, and C. Crainiceanu(2011). Multilevel functional principal component analysis for high-dimensional data.
Journalof Computational and Graphical Statistics 20 (4), 852–873.25 upplementary material for “Low-Rank Covariance FunctionEstimation for Multidimensional Functional Data”
Jiayi Wang , Raymond K. W. Wong ∗ , and Xiaoke Zhang † Department of Statistics, Texas A&M University Department of Statistics, George Washington University
September 1, 2020
S1 Proofs
S1.1 Proof of Theorem 1
For any Γ ∈ G , we can decompose it into two orthogonal parts Γ and Γ such that Γ ∈G ( L n,m ) and Γ ∈ ( G ( L n,m )) ⊥ . Since the loss function ‘ (Γ) only depends on data, it sufficesto show that Ψ (Γ (cid:4) ) ≥ Ψ (Γ , (cid:4) ) and Ψ k (Γ ( k ) ) ≥ Ψ k (Γ , ( k ) ) for k = 1 , . . . , p . Below we followtwo steps to prove this.Step 1. Take H ( L n,m ) := N pk =1 K k . Since we require Γ ∈ M + , we first show that Γ , (cid:4) = Γ | , (cid:4) and h Γ , (cid:4) f, f i H ≥ f ∈ H . Note that Γ (cid:4) = Γ | (cid:4) , so Γ (cid:4) = (Γ , (cid:4) +Γ , (cid:4) ) / | , (cid:4) +Γ | , (cid:4) ) / | , (cid:4) ∈ H ( L n,m ) ⊗H ( L n,m ) and Γ | , (cid:4) ∈ ( H ( L n,m ) ⊗H ( L n,m )) ⊥ , we have Γ = (Γ , (cid:4) +Γ | , (cid:4) ) / = (Γ , (cid:4) + Γ | , (cid:4) ) /
2. Thus Γ , (cid:4) = Γ | , (cid:4) and Γ , (cid:4) = Γ | , (cid:4) .By the definition of Γ , h Γ , (cid:4) g, g i H = 0 for any g ∈ H ( L n,m ), so we have0 ≤ h Γ (cid:4) g, g i H = h Γ , (cid:4) g, g i H + h Γ , (cid:4) g, g i H = h Γ , (cid:4) g, g i H . Moreover, the definition of Γ leads to h Γ , (cid:4) g, g i H = 0 for any g ∈ ( H ( L n,m )) ⊥ . Hence h Γ , (cid:4) f, f i H ≥ f ∈ H .Step 2. Next we show that for all k , λ k (Γ (cid:4) ) ≥ λ k (Γ , (cid:4) ) and λ k (Γ ( j ) ) ≥ λ k (Γ , ( j ) ) with j = 1 , . . . , p . Let P H ( L n,m ) be the projection operator to space H ( L n,m ) and denote theadjoint operator of A by A ∗ . Then we have λ k (Γ , (cid:4) ) = λ k (cid:0) P H ( L n,m ) Γ (cid:4) P H ( L n,m ) (cid:1) ≤ λ k (cid:0) Γ (cid:4) P H ( L n,m ) (cid:1) = λ k (cid:0) P H ( L n,m ) Γ ∗ (cid:4) (cid:1) ≤ λ k (Γ ∗ (cid:4) ) = λ k (Γ (cid:4) ) . ∗ The research of Raymond K. W. Wong is partially supported by National Science Foundation grantsDMS-1806063, DMS-1711952 and CCF-1934904. † The research of Xiaoke Zhang is partially supported by National Science Foundation grant DMS-1832046. a r X i v : . [ s t a t . M E ] A ug et P K j denote the projection operator to space K j and P K − j as the projection operator tospace N pk =1 ,k = j K k where K p + k = K k , j = 1 , . . . , p . Then λ k (Γ , ( j ) ) = λ k (cid:0) P K j Γ ( j ) P K − j (cid:1) ≤ λ k (cid:0) Γ ( j ) P K − j (cid:1) = λ k (cid:16) P K − j Γ ∗ ( j ) (cid:17) ≤ λ k (cid:16) Γ ∗ ( j ) (cid:17) = λ k (cid:0) Γ ( j ) (cid:1) . Therefore, Ψ (Γ (cid:4) ) ≥ Ψ (Γ , (cid:4) ) and Ψ k (Γ ( k ) ) ≥ Ψ k (Γ , ( k ) ) for k = 1 , . . . , p . S1.2 Proofs of Theorems 2, 3 and Corollary 1
For notational simplicity, we do not adopt different notations for the fully folded and squarelyunfolded versions of operators (functions) in this section.Write ∆ = ˆΓ − Γ and e ( T ij , T ij ) = ( X i ( T ij ) + (cid:15) ij )( X i ( T ij ) + (cid:15) ij ) − Γ ( T ij , T ij ). From(7), we obtain the following basic inequality: k ∆ k n,m + λI (ˆΓ) ≤ h e, ∆ i n,m + λI (Γ ) . (S1)The term h e, ∆ i n,m involved in (S1) plays a crucial role in the subsequent asymptotic analysis,so we will focus on this term first.Consider G s = { (Γ − Γ ) / { I (Γ) + I (Γ ) } : Γ ∈ G} . To bound h e, ∆ i n,m , we start withcontrolling sup g ∈G s h e, g i n,m . For any g ∈ G s , there exists a Γ ∈ G such that g = (Γ − Γ ) / { I (Γ) + I (Γ ) } . When Γ = Γ , k g k G = 0. Otherwise, k g k G = (cid:13)(cid:13)(cid:13)(cid:13) Γ − Γ I (Γ) + I (Γ ) (cid:13)(cid:13)(cid:13)(cid:13) G ≤ k Γ − Γ k G I (Γ − Γ ) ≤ k Γ − Γ k G k Γ − Γ k G = 1 , where the second inequality is due to that I (Γ) ≥ k Γ k G for any Γ ∈ G , and k · k G isHilbert–Schmidt norm of RKHS G . Take G = { g ∈ G : k g k G ≤ } . From the above,one can easily see that G s ⊆ G , and hence sup g ∈G s h e, g i n,m ≤ sup g ∈G h e, g i n,m for any e . Inthe later part of our analysis, we will bound sup g ∈G h e, g i n,m to control sup g ∈G s h e, g i n,m .First, we note that the functions residing in G are bounded: For any g ∈ G , by theproperty of reproducing kernel,sup g ∈G | g | ∞ ≤ sup ( x ,...,x p ) ∈ [0 , p K (( x , ..., x p ) , ( x , ..., x p )) ≤ b. Next we recall the definition of the sub-exponential norm of a random variable.
Definition S1.
For a random variable X, its sub-exponential norm is defined as k X k ψ = inf { λ > E (exp( | X | /λ )) ≤ } . If k X k ψ < ∞ , then we call X a sub-exponential random variable. Recall that L n,m = { T ijk : i = 1 , ..., n ; j = 1 , ..., m ; k = 1 , ..., p } . We write e ijj = e ( T ij , T ij ). For random variables A and B , we denote by k A | B k ψ the sub-exponential normof the random variable A conditional on B . The notation naturally extends to the case when B is a random vector or a set of random variables. By Lemma 3 in Wong and Zhang (2019),we can see that conditioned on L n,m , e ijj are sub-exponential random variables. Moreover,there exists a constant σ ψ , depending on b X and b (cid:15) , such that k e ijj | L n,m k ψ ≤ σ ψ .2ext we introduce the following random variables:ˆ Z n,m ( e, t ; G ) := sup { g ∈G : k g k n,m ≤ t } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nm ( m − n X i =1 m X j = j e ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , ˜ Z n,m ( e, t ; G ) := sup { g ∈G : k g k ≤ t } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nm ( m − n X i =1 m X j = j e ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Our immediate goal is to bound ˆ Z n,m ( e, t ; G ), which will be achieved by bounding ˜ Z n,m ( e, t ; G ).We start with its expectation. Without loss of generality, we use c to denote all the universalconstants. Lemma S1.
There exists a constant c η > , depending on σ ψ and L , such that E (cid:20)n ˜ Z n,m ( e, t ; G ) o (cid:21) ≤ c η nm ∞ X l,h =1 min { t , µ l µ h } + t n . (S2) Proof.
A majority of the proof resembles that of Lemma 42 in Mendelson (2002), with ad-ditional arguments developed to control an important expectation term. Since the sam-ple field of X resides in H , we can decompose X ( t ) = P ∞ h =1 ζ h φ h ( t ) where E ( ζ h ζ h ) = E (cid:8) Γ ( T ij , T ij ) φ h ( T ij ) φ h ( T ij ) (cid:9) . For every s , t ∈ [0 , p , write Φ( s , t ) = (cid:0) √ µ l µ h φ l ( s ) φ h ( t ) (cid:1) ∞ l,h =1 .For two squarely summable sequences a = { a lh } ∞ l,h =1 and b = { b lh } ∞ l,h =1 , define their innerproduct and the 2-norm in the following: h a, b i = P ∞ l,h =1 a lh b lh and k a k = ( P ∞ l,h =1 a lh ) / .One can show that G = { g ( · , ? ) = h β, Φ( · , ? ) i : k β k ≤ } . Let B ( t ) = { β : k g k ≤ t } . It follows that g ∈ G ∩ B ( t ) if and only if β belongs to setΩ = { β : P ∞ l,h =1 β lh ( µ l µ h ) ≤ t , P ∞ l,h =1 β lh ≤ } . Let Ξ = { β : P ∞ l,h =1 β lh ν lh ≤ } , where ν lh = (min { , t /µ l µ h } ) − . We can see that Ξ ⊂ Ω ⊂ √ E (cid:16) ˜ Z n,m ( ω, t ; G ) (cid:17) (cid:16) n m ( m − E sup β ∈ Ξ h β, n X i =1 m X j = j e ijj Φ( T ij , T ij ) i . Next, E sup β ∈ Ξ * β, n X i =1 m X j = j e ijj Φ( T ij , T ij ) + = E sup β ∈ Ξ * ∞ X l,h =1 √ ν lh β lh , ∞ X l,h =1 √ µ l µ h √ ν lh n X i =1 m X j = j e ijj φ l ( T ij ) φ h ( T ij ) + ≤ E ∞ X l,h =1 µ l µ h ν lh n X i =1 m X j = j e ijj φ l ( T ij ) φ h ( T ij ) = n ∞ X l,h =1 µ l µ h ν lh E m X j = j e jj φ l ( T j ) φ h ( T j ) . E ( e ijj | L n,m ) = 0.It remains to bound E nP mj = j e jj φ l ( T j ) φ h ( T j ) o . Write U jj kk = e jj e kk φ l ( T j ) φ h ( T j ) φ l ( T k ) φ h ( T k ) . When j = k and j = k , U jj jj = E e jj φ l ( T j ) φ h ( T j ) = E (cid:2)(cid:8) E ( e jj | L n,m ) (cid:9) φ l ( T j ) φ h ( T j ) (cid:3) ≤ cσ ψ E (cid:8) φ l ( T j ) φ h ( T j ) (cid:9) = cσ ψ , where the inequality follows from the property of sub-exponential random variables and c isa universal constant. When j = k and j = k , U jj jk = E (cid:8) e jj e jk φ l ( T j ) φ h ( T j ) φ h ( T k ) (cid:9) ≤ E h(cid:8) E ( e jj | L n,m ) E ( e jk | L n,m ) (cid:9) / φ l ( T j ) φ h ( T j ) φ h ( T k ) i ≤ cσ ψ E (cid:8) φ l ( T j ) φ h ( T j ) φ h ( T k ) (cid:9) ≤ cσ ψ (cid:8) E φ h ( T j ) E φ h ( T k ) (cid:9) / ≤ cσ ψ . Similarly for j = k and j = k , U jj kj ≤ cσ ψ . When j = k and j = k , U jj kk = E (cid:8) E e jj φ l ( T j ) φ h ( T j ) | X (cid:9) = E (cid:2) E (cid:8) ( X ( T j ) + (cid:15) j )( X ( T j ) + (cid:15) j ) − Γ ( T j , T j ) (cid:9) φ l ( T j ) φ h ( T j ) | X (cid:3) = E (cid:2) E (cid:8) X ( T j ) X ( T j ) φ l ( T j ) φ h ( T j ) | X (cid:9) − E Γ ( T j , T j ) φ l ( T j ) φ h ( T j ) (cid:3) = E " E ∞ X g =1 ζ g φ g ( T j ) φ l ( T j ) | { ζ g : g ≥ } × E ∞ X g =1 ζ g φ g ( T j ) φ l ( T j ) | { ζ g : g ≥ } − E ζ l ζ h = E ( ζ l ζ h − E ζ l ζ h ) ≤ E (cid:0) ζ l ζ h (cid:1) . Putting together all these cases leads to ∞ X l,h =1 µ l µ h ν lh E m X j = j e jj φ l ( T j ) φ h ( T j ) ≤ ∞ X l,h =1 µ l µ h ν lh (cid:8) m ( m − cσ ψ + 3 m ( m − m − cσ ψ + m ( m − m − m − E (cid:0) ζ l ζ h (cid:1)(cid:9) ≤ c m cσ ψ ∞ X l,h =1 µ l µ h ν lh + m ∞ X l,h =1 µ l µ h ν lh E (cid:0) ζ l ζ h (cid:1) ≤ c m cσ ψ ∞ X l,h =1 min { t , µ l µ h } + m t ∞ X l,h =1 E (cid:0) ζ l ζ h (cid:1) . P ∞ l,h =1 E (cid:0) ζ l ζ h (cid:1) = E ( X ( T )) = L < ∞ , E n ˜ Z n,m ( e, t ; G ) o ≤ c η nm ∞ X l,h =1 min { t , µ l µ h } + t n . Next we derive the following concentration inequality for ˜ Z n,m ( e, t ; G ). Lemma S2.
There exists a universal constant c > and a constant c > depending on b and σ ψ , such that with probability at least − exp( − cnt / log n ) , we have ˜ Z n,m ( e, t ; G ) ≤ c n E ˜ Z n,m ( e, t ; G ) + c t o . Proof.
Write e i = (cid:8) e ijj : j = 1 , ..., m (cid:9) , T i = { T ij : j = 1 , ..., m } and f ( e i , T i ) = 1 m ( m − m X j = j e ijj g ( T ij , T ij ) . Note that E ( f ( e i , T i )) = 0. We adopt the Adamczak bound (Theorem 4 in Adamczaket al., 2008; Koltchinskii, 2011) to establish a concentration inequality for the unboundedclass F = { f : g ∈ G , k g k ≤ t } . To this end, we need to bound a variance term σ ( F ) :=sup f ∈F E ( f ( e , T )) and the sub-exponential norm of the envelope function F of the class F . For the variance term, σ ( F ) := sup k g k G ≤ , k g k ≤ t E f ( e , T ))= 1 m ( m − sup k g k G ≤ , k g k ≤ t E m X j = j e jj g ( T j , T j ) = 1 m ( m − m X j = j m X k = k sup k g k G0 ≤ , k g k ≤ t E ( E e jj e kk | T ) g ( T j , T j ) g ( T ik , T ik ) ≤ cσ ψ m ( m − m X j = j m X k = k sup k g k G ≤ , k g k ≤ t E g ( T j , T j ) g ( T ik , T ik ) ≤ cσ ψ m ( m − m X j = j m X k = k sup k g k G ≤ , k g k ≤ t (cid:8) E g ( T j , T j ) E g ( T ik , T ik ) (cid:9) / ≤ cσ ψ t . As for the envelope, k max i =1 ,...,n F ( e i , T i ) k ψ ≤ c max i =1 ,...,n k F ( e i , T i ) k ψ (log n ) ≤ cbm ( m − k m X j = j e ijj k ψ (log n ) ≤ cbσ ψ (log n ) , where the first inequality comes from Theorm 4 of Pisier (1983) and the second inequalityresults from g ( T ij , T ij ) ≤ b . The desired result then follows from Adamzack bound.5y Lemmas S1 and S2, we are able to bound ˜ Z n,m ( e, t ; G ). Then, we relate ˆ Z n,m ( e, t ; G )with ˜ Z n,m ( e, t ; G ) by Lemma S3 below. Recall that κ n,m is the smallest positive real number κ that fulfills the following inequalities cb Q ( κ/b ) ≤ κ , (S3)32 cbQ ( κ ) ≤ κ , (S4)where c is an universal constant that we do not specify and Q ( κ ) = n ( m − ∞ X l,h =1 min (cid:8) κ , µ l µ h (cid:9) / . Note that Q ( κ ) /κ → ∞ as κ →
0. Also, Q ( κ ) /κ is non-increasing in κ . Dividing both sides in(S3) and (S4) by κ , the resulting right hand side is an identity function, which is continuous,strictly increasing and is zero when κ = 0. Therefore κ n,m exists. Lemma S3.
We assume t ≥ κ n,m for all the following cases. For any g ∈ G , there existconstants M , M > , both depending on b , such that (cid:8) k g k n,m ≤ t (cid:9) ⊆ (cid:8) k g k ≤ M t (cid:9) , with probability at least − exp( − cnmκ n,m + log m ) , and (cid:8) k g k ≤ t (cid:9) ⊆ (cid:8) k g k n,m ≤ M t (cid:9) , with probability at least − exp( − cnmκ n,m + log m ) . Additionally, we have k g k − k g k n,m ≤ k g k , holds for all g ∈ G such that k g k > t , with probability at least − exp( − c p nmt + log m ) where c p is a constant depending on b .Proof. For 1 ≤ j, j ≤ m , we call ( j, j ) a pair formed by individuals j and j . When m iseven, by Lemma S4, we are able to partition the collection P = { ( j, j ) : 1 ≤ j < j ≤ m } into ( m −
1) groups G , ..., G m − , such that G k ∩ G k = ∅ for k = k , P = S m − k =1 G k ,card( G k ) = m/ k , and card( { ( j, j ) ∈ G k : j = ˜ j or j = ˜ j } ) = 1 for all ˜ j and k (i.e., noindividual occurs more than one time within a group), where card( A ) denotes the cardinalityof a set A . Therefore it is easy to see that the location pairs in { (cid:0) T ij , T ij (cid:1) : ( j, j ) ∈ G k } areindependent for any fixed k . As an illustration, suppose m = 4. Following the constructionrule in Lemma S4, we obtain three groups G = { (1 , , (2 , } , G = { (1 , , (3 , } and G = { (1 , , (2 , } .Consider the case when m is even. Take f G k ( T ) = nm P ni =1 P ( j,j ) ∈ G k g ( T ij , T ij ), k =1 , ..., m −
1. Note that the nm/ g ( T ij , T ij ) in f G k ( T ) all have expectation k g k ,and are independent due to the above grouping property. To relate k g k and f G k ( T ), we canapply Theorem 3.3 in Bartlett et al. (2005). 6ake R n,m ( t ; G k , G ) = nm sup { g ∈G : k g k ≤ t } | P ni =1 P ( j,j ) ∈ G k σ ijj g ( T ij , T ij ) | to be thecorresponding empirical local Rademacher complexity. By the well-known contraction in-equality and Lemma 42 in Mendelson (2002), it is simple to show that with some universalconstant c , E R n,m ( t ; G k , G ) ≤ b nm E sup g ∈G , k g k ≤ t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( j,j ) ∈ G k σ ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ cb nm ∞ X l,h =1 min { t , µ l µ h } / ≤ cbQ ( t )Note that for ( j, j ) ∈ G k ,Var { g ( T ij , T ij ) } ≤ E { g ( T ij , T ij ) } ≤ b k g k ≤ b t . In Theorem 3.3 in Bartlett et al. (2005), we can take T ( g ) = b k g k , B = b and ψ ( r ) = cb Q ( r / /b ). We then verify a condition in Theorem 3.3 in Bartlett et al. (2005). For any t > b E R n,m ( t ; G k , G ) = 2 b nm E sup g ∈G ,T ( g ) ≤ b t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( j,j ) ∈ G k σ ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ cb Q ( t ) , where the desired condition follows from taking r = b t . From the definition (S3) of κ n,m , wecan see that κ n,m is larger than the fixed point of ψ (i.e., the solution of ψ ( r ) = r ). Theorem3.3 in Bartlett et al. (2005) shows that k g k ≤ f G k ( T ) + 1408 b κ n,m + 2(11 b + 52 b ) κ n,m = 2 f G k ( T ) + (cid:18) b + 126 b (cid:19) κ n,m , holds for all g ∈ G , with probability at least 1 − exp( − nmκ n,m ). Also, f G k ( T ) ≤ k g k + 704 b κ n,m + 2(11 b + 26 b ) κ n,m = 2 k g k + (cid:18) b + 74 b (cid:19) κ n,m , holds for all g ∈ G , with probability at least 1 − exp( − nmκ n,m ).Recall that k g k n,m = m − P m − i =1 f G k ( T ). We proceed by taking union bounds of theprobability statements derived above, over f G , ..., f G m − . If t ≥ κ n,m , k g k ≤ k g k n,m + (cid:18) b + 126 b (cid:19) κ n,m ≤ M t , holds for all g ∈ G such that k g k n,m ≤ t , with probability at least 1 − ( m −
1) exp( − nmκ n,m ).Also, if t ≥ κ n,m , k g k n,m ≤ k g k + (cid:18) b + 74 b (cid:19) κ n,m ≤ M t , g ∈ G such that k g k ≤ t , with probability at least 1 − ( m −
1) exp( − nmκ n,m ).Here M , M > b .Now, we focus on k g k > t . By applying Theorem 2.1 in Bartlett et al. (2005), we obtainthe following inequality k g k − f G k ( T ) ≤ . k g k , holds for all g ∈ G such that k g k > t , with probability at least 1 − exp( − ( mn/ b ) t .Take a union bound over ( m −
1) groups, we will have k g k − k g k n,m ≤ . k g k , holds for all g ∈ G such that k g k > t , with probability at least 1 − ( m −
1) exp( − ( mn/ b ) t ).When m is odd, { ( j, j ) : 1 ≤ j < j ≤ m − } can be decomposed into ( m −
2) groups( G , . . . , G m − ) as described before, since m − { ( j, m ) : j = 1 , , ..., m − } which are not independent. k g k n,m = m − m m − m − X k =1 n X i =1 n ( m − X ( j,j ) ∈ G k g ( T ij , T ij ) + 2 m ( m − m − X j =1 n n X i =1 g ( T ij , T im )(S5)In the odd- m setting, we define f G k ( T ) = n ( m − P ni =1 P ( j,j ) ∈ G k g ( T ij , T ij ). We can applythe similar arguments derived for the even case (with m replaced by m − V j ( T ) = n P ni =1 g ( T ij , T im ) for a fixed 1 ≤ j ≤ m −
1. Note that E { g ( T ij , T im ) } = k g k and thesummands in V j ( T ) are independent.We still apply Theorem 3.3 in Bartlett et al. (2005). The local Rademacher complexitybecomes R ( t ; G ) = E ( n sup g ∈G , k g k ≤ t n X i =1 g ( T ij , T im ) ) ≤ cb n ∞ X l,h =1 min { t , µ l µ h } / . Take κ n to be the smallest positive real number κ that satisfies cb n ∞ X l,h =1 min { ( κ/b ) , µ l µ h } / ≤ κ . By Theorem 3.3 in Bartlett et al. (2005), it can be shown that k g k ≤ V j ( T ) + 1408 b κ n + 2(11 b + 52 b ) mκ n,m k g k /m ≤ V j ( T ) /m + 1408 b κ n /m + 2(11 b + 52 b ) κ n,m holds for all g ∈ G , with probability at least 1 − exp( − nmκ n,m ). Also, V j ( T ) /m ≤ k g k /m + 704 b κ n /m + 2(11( b ) + 26 b ) κ n,m , g ∈ G , with probability at least 1 − exp( − nmκ n,m ).Now, we take a union bound, and then combine it with the result for the first term in(S5). Since κ n /m ≤ κ n,m , we derive the following with assumption t ≥ κ n,m : k g k ≤ k g k n,m + 1408 b κ n,m (cid:18) m − m + 2 (cid:19) + 2(11 b + 52 b ) κ n,m (cid:18) m − m + 2 (cid:19) ≤ M t holds for all g ∈ G such that k g k ≤ t , with probability at least 1 − ( m − m − − nmκ n,m ). k g k n,m ≤ k g k + 704 b κ n,m (cid:18) m − m + 2 (cid:19) + 2(11( b ) + 26 b ) (cid:18) m − m + 2 (cid:19) κ n,m ≤ M t holds for all g ∈ G such that k g k < t , with probability at least 1 − ( m − m − − nmκ n,m ). Here M and M are some constants that depend on b .With similar argument, we will be able to derive for the odd case, k g k − k g k n,m ≤ . k g k , holds for all g ∈ G such that k g k > t , with probability at least 1 − ( m − m − − c p nmt ) for some constant c p = c p (1 /b ).With Lemmas S1, S2 and S3, we are now ready to prove Theorem 2. Recall the definitionof η n,m and ξ n,m . The term η n,m is defined as the smallest positive value η such that c η nm ∞ X l,h =1 min { η , µ l µ h } + η n / ≤ η , where c η > κ n,m , we can show that η n,m exists. By Lemma S1, we can show that E ˜ Z n,m ( e, t ; G ) ≤ q E [ { ˜ Z n,m ( e, t ; G ) } ] ≤ t for t ≥ η n,m .Take ξ n,m = min (cid:26) max { η n,m , κ n,m } , (cid:16) log nn (cid:17) / (cid:27) . We include the term (log n/n ) / mainlydue to the unboundedness of { e ijj } , which leads to the application of Adamzack bound(Lemma S2) instead of simpler forms of Talagrand’s concentration inequality. Proof for Theorem 2.
First, we study the crucial termˆ Z n,m ( e, b ; G ) = sup { g ∈G : k g k n,m ≤ b } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nm ( m − n X i =1 X j = j e ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , which is bounded by the maximum of ˆ Z n,m ( e, ξ n,m ; G ) andsup { g ∈G : k g k n,m >ξ n,m } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nm ( m − n X i =1 X j = j e ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (S6)9o it suffices to study the rates of convergence of these two terms.For the rate of the maximum of ˆ Z n,m ( e, ξ n,m ; G ), by Lemmas S1, S2 and S3, we can showthat with probability at least 1 − exp( − cnξ n,m / log n ) for some universal constant c :ˆ Z n,m ( e, ξ n,m ; G ) ≤ ˜ Z n,m ( e, p M ξ n,m ; G ) ≤ c n E ˜ Z n,m ( e, p M ξ n,m ; G ) + c M ξ n,m o ≤ c (cid:8) M ξ n,m + c M ξ n,m (cid:9) ≤ Rξ n,m , where R = cM (1 + c ) and, the first, second and last inequalities are due to Lemmas S3, S2and S1 respectively.For the rate of the second term in (S6), we first prove the following result. For any r > ξ n,m , with probability at least 1 − exp( − cnξ n,m / log n ), we haveˆ Z n,m ( e, r ; G ) = rξ n,m sup { g ∈G : k g k G ≤ ξn,mr , k g k n,m ≤ ξ n,m } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nm ( m − n X i =1 X j = j e ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ rξ n,m ˆ Z n,m ( e, ξ n,m ; G ) ≤ rξ n,m Rξ n,m = Rrξ n,m . (S7)For b > ξ n,m , a direct application of the above result with r = b does not provide theincrement with respect to the empirical norm, and so we apply a peeling argument.Set S l := { g ∈ G : 2 l − ξ n,m ≤ k g k n,m ≤ l ξ n,m } , l = 1 , . . . , L , where L = log ( b/ξ n,m ). P sup { g ∈G : k g k n,m >ξ n,m } (cid:12)(cid:12)(cid:12) nm ( m − P ni =1 P j = j e ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12) k g k n,m > Rξ n,m ≤ L X l =1 P sup g ∈ S l (cid:12)(cid:12)(cid:12) nm ( m − P ni =1 P j = j e ijj g ( T ij , T ij ) (cid:12)(cid:12)(cid:12) k g k n,m > Rξ n,m ≤ L X l =1 P (cid:16) ˆ Z n,m ( e, l ξ n,m ; G ) > R l − ξ n,m (cid:17) ≤ L exp (cid:18) − c n log n ξ n,m (cid:19) ≤ exp (cid:18) − c n log n ξ n,m (cid:19) , where the second last inequality results from (S7) by taking r = 2 l ξ n,m , and the universalconstant c in the last two lines could be different. For the last inequality, as long as 0 ≤ (log(log(1 /ξ n,m ))) / { n log n ξ n,m } ≤ c exists.Therefore, we have h e, g i n,m ≤ R ( ξ n,m + 2 k g k n,m ξ n,m ) , for every g ∈ G s ⊂ G , with probability at least 1 − exp( − cnξ n,m / log n ). With the sameprobability, we have h e, ∆ i n,m ≤ Rξ n,m n I (ˆΓ) + I (Γ ) o + 2 Rξ n,m k ∆ k n,m . (S8)10n below, we condition on the event (S8). From the basic inequality (S1), with λ = c λ ξ n,m such that c λ > R , k ∆ k n,m ≤ h e, ∆ i n,m + λ ( I (Γ ) − I (ˆΓ)) , k ∆ k n,m ≤ λI (Γ ) + 4 Rξ n,m k ∆ k n,m . Then we have k ∆ k n,m ≤ { c λ I (Γ ) } ξ n,m + 4 Rξ n,m and the proof is complete by taking L = 2 R .Next, we are ready to bound the L norm k ∆ k for ∆ = ˆΓ − Γ obtained by (7). Proof of Theorem 3.
From Lemma S3, we can see that k g k ≤ k g k n,m + ξ n,m , for all g ∈ G with probability at least 1 − exp( − c p ξ n,m ) for some universal constant c p = c p (1 /b ). So withthe same probability we have k ∆ k ≤ / k ∆ k n,m + ξ n,m n I (ˆΓ) + I (Γ ) o . In terms of LemmaS5, we are able to bound the regularization term I (ˆΓ) by a constant L , so finally we get k ∆ k ≤ h { c λ I (Γ ) } + 4 R i ξ n,m + { R + I (Γ ) } ξ n,m ≤ h { c λ I (Γ ) } + 4(2) R + R + I (Γ ) i ξ n,m . By taking L = 4(2) / R + R + I (Γ ), the proof is complete. Proof of Corollary 1.
By Lemma S6, the tensor product eigenvalue sequence has decay µ l (cid:16) ( l − α (log l ) α (2 p − ) as l → ∞ .By the definitions of κ n,m and η n,m , when m = O (cid:0) n / (2 α ) (log n ) p − − / (2 α ) (cid:1) , they are allof the same order, and so is ξ n,m . By Lemma S7, we can see that ξ n,m (cid:16) ( nm ) α/ (1+2 α ) (log nm ) α (2 p − / (2 α +1) . When n / (2 α ) (log n ) p − − α = O ( m ), log n/n will be the dominant term. From Theorems 2 and 3, we can see that k ˆΓ − Γ k n,m and k ˆΓ − Γ k are both of the same order. Overall, we have k ˆΓ − Γ k n,m , k ˆΓ − Γ k = O p (cid:18) ( nm ) − α α (log nm ) α (2 p − α +1 + log nn (cid:19) . S1.3 Auxiliary Lemmas
Lemma S4.
When m is even, we can decompose any collection of individual index pairs { ( j, j ) : 1 ≤ j < j ≤ m } into ( m − groups such that within each group, there are m/ pairs and no repeated individuals.Proof. First, we consider to construct a matrix G ∈ R m × m that satisfies following conditions:1. All the diagonal entries are zero;2. Every row and every column is a permutation of sequence { , , , ..., ( m − } ;3. It is symmetric. 11o begin with, we consider the cycle cyc = { , , ..., ( m − } and construct a sub-matrix G sub ∈ R ( m − × ( m − from it. For i − th row of G sub , we set it to be a sequence that startswith i in cyc and ends until it reaches ( m −
1) elements. For example, the first row will be[1 , , ..., ( m − , , ..., ( m − , m − m −
1) columns of G to be G sub and fill last row and last column of G withzeros. Then obviously G fulfills Conditions 2 and 3.To fulfill Condition 1, set G i,m and G m,i to be G ii and then set G ii = 0 for i = 1 , ..., ( m − m −
1) rows and first ( m −
1) columns,they are still permutations of sequence { , , , ..., ( m − } and symmetrization of G is notviolated. It remains to prove that last row and last column are also a permutation of thesequence, which is equivalent to proving the diagonal part of G sub is a permutation. In fact G sub ( i,i ) is (2 i − cyc , i = 1 , , ..., ( m − m is even, diagonalparts of G sub will cover the whole sequence { , , ..., ( m − } .So for every pair ( j, j ), 1 ≤ j < j ≤ m , we can assign it to Group G k where k = G j.j .In this way, we decompose the collection { ( j, j ) : 1 ≤ j < j ≤ m } into ( m −
1) groups whereeach group contains m/ Lemma S5.
Under the same assumptions as Theorem 2, if λ = c λ ξ n with some constant c λ > R , then there exists a constant R depending on I (Γ ) , R and c λ , such that withprobability at least − exp( c n log n ξ n,m ) , we have I (ˆΓ) ≤ R . Proof.
From the basic inequality (S1), we have k ∆ k n,m + λI (ˆΓ) ≤ h e, ∆ i + λI (Γ ) , (S9) λI (ˆΓ) ≤ h e, ∆ i + λI (Γ ) . (S10)From Theorem 2, we know that h e, ∆ i ≤ Rξ n,m n I (ˆΓ) + I (Γ ) o + 2 Rξ n,m k ∆ k n,m , (S11)and k ∆ k n,m ≤ { c λ I (Γ ) } ξ n,m + 4 Rξ n,m . (S12)Therefore, plug (S12) into (S12), h e, ∆ i ≤ h R n I (ˆΓ) + I (Γ ) o + 2 R { c λ I (Γ ) } + 8 R i ξ n,m . By plugging in(S10), we have( c λ − R ) I (ˆΓ) ≤ RI (Γ ) + 4 R { c λ I (Γ ) } + 16 R + c λ I (Γ ) . Therefore, there exists a constant L , such that I (ˆΓ) ≤ RI (Γ ) + 4 R { c λ I (Γ ) } + 16 R + c λ I (Γ ) c λ − R ≤ R . emma S6. Suppose K ( · , · ) = K ( · , · ) = . . . K p ( · , · ) , then H = H = . . . = H p . If eigenval-ues of K k has decay µ ( k ) n (cid:16) ( n − s ) for some constant s. Then eigenvalues of the reproducingkernel for tensor product N pk =1 H k ⊗ N pk =1 H k will have decay µ n (cid:16) ( n − s (log n ) s (2 p − ) Proof.
A direct application of Theorem 1 (Krieg, 2018) completes the proof.
Lemma S7.
Take t to be the solution of the equality √ nm ∞ X h =1 min (cid:8) t , µ h (cid:9)! / = t , where µ h (cid:16) ( h − α (log l ) α (2 p − ) . Then as n → ∞ and m → ∞ , the solution t (cid:16) ( nm ) − α α (log nm ) α (2 p − α +1 . Proof.
Take N = nm , To find the order of t . We need to find l such that t (cid:16) l α (log l ) α (2 p − .From some simple analysis, we could see that when N → ∞ , t → l → ∞ . Therefore,when N → ∞ we have t (cid:16) l α (log l ) α (2 p − , t (cid:16) l α (log l ) − α (2 p − ,log (1 /t ) (cid:16) α log l − α (2 p −
1) log(log( l )) (cid:16) log l ,l (cid:16) t − /α (log(1 /t )) p − . It’s easy to see that t l (cid:16) ( t ) − /α (log(1 /t )) p − , P l ≥ l µ l (cid:16) O ( l α +1 (log l ) α (2 p − ) (cid:16)O ( t − /α (log(1 /t )) p − ).So P l ≥ l µ l = O ( ξ n l ), therefore √ t l √ N (cid:16) t ,N (cid:16) (1 /t ) /α (log(1 /t )) p − , log N (cid:16) (2 + 1 /α ) log (1 /ξ n ) + (2 p −
1) log log(1 /t ) (cid:16) log(1 /t ) , /t (cid:16) N α α (log N ) − α (2 p − α +1 ,t (cid:16) N − α α (log N ) α (2 p − α +1 . S2 Simulation
S2.1 Eigenfunctions in different simulation settings
We present three simulation settings in a table form (Tables S1, S2 and S3). In each table,rows correspond to basis functions for dimension 1 and columns correspond to basis functions13or dimension 2. Recall that for each dimension, we use e k ( t ) = √ kπt ), k = 1 , , . . . asbasis. Then, for the cell with position row i and column j , it represents the two dimensionalfunction f ij ( s , s ) = e i ( s ) e j ( s ). We use a positive integer k to indicate that this twodimensional function is the k -th eigenfunction. The details of the three settings are given asfollows. Table S1: Eigenfunctions for Setting 1 e e e e e e e e e e e e e e e e e e e e e R = 6, r = 3, r = 2. For dimension 1, we use e , e and e as our basisfunctions. For dimension 2, we use e and e as our basis functions. Let 6 eigenfunctions ψ k be the tensor product of these one dimensional basis with eigenvalue decay λ k = 1 / ( k ), k =1 , , ...,
6. Eigenfunctions can be expressed as ψ k ( t , t ) = e i ( t ) e j ( t ), where k = 2( i −
1) + j for k = 1 , , , ψ ( s, t ) = e ( s ) e ( t ), ψ ( t , t ) = e ( t ) e ( t ). In this setting, R = r ∗ r ,one-way basis are mostly shared among different eigenfunctions.Setting 2: R = 6, r = r = 4. For both dimension 1 and dimension 2, we use e i , i = 1 , .., ψ k with eigenvalue decay λ k = 1 / ( k ), k = 1 , , ..., ψ k ( t , t ) = e i ( t ) e j ( t ), where k = 2( i −
1) + j for k = 1 , , ψ k ( t , t ) = e k − ( t ) e k − ( t ) for k = 4 , ,
6. In this setting, one-way basis are partly shared by differenteigenfunctions.Setting 3: R = r = r = 4. For both dimension 1 and dimension 2, we use e i , i = 1 , .., ψ k with eigenvalue decay λ k = 1 / ( k ), k = 1 , ..., ψ ( t , t ) = e ( t ) e ( t ), ψ ( t , t ) = e ( t ) e ( t ) and ψ k ( t , t ) = e k ( t ) e k ( t ) for k = 3 ,
4. Inthis case, one-way basis are not shared among different eigenfunctions.14
The simulation results for the sparse design with sample size n = 100 are shown in Table S4.Table S4: Simulation results for three Settings with the sparse design when sample size is100 ( n = 100): see description in Table 1. Setting m σ mOpCov OpCov ll-smooth ll-smooth+ R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r S2.3 Simulation results for regular design
For regular design, we selected 10 equally spaced points for each dimension and constructeda regular 10 ×
10 grid ( m = 100). We set sample size to be 50 ( n = 50). Two differentnoise levels are considered, since regular design has dense observations, we pick σ = 0 . σ = 0 . spatpca ), which allows to perform multi-dimensional covariance function estimationwith location-fixed observations into our comparisons. Results are showed in Table S5, TableS6 and Table S7.Table S5: Results for Setting 1 on regular design: see description in Table 1. σ mOpCov OpCov ll-smooth spatpca ll-smooth+ R r , ˆ r
6, 60.80 AISE 0.0629 (4.45e-03) 0.0676 (4.49e-03) 0.0643 (3.79e-03) 0.0738 (4.52e-03) 0.0639 (3.79e-03)ˆ R r , r
6, 6
Table S6: Results for Setting 2 on regular design: see description in Table 1. σ mOpCov OpCov ll-smooth spatpca ll-smooth+ R r , ˆ r
6, 60.80 AISE 0.062 (4.48e-03) 0.0659 (4.54e-03) 0.0631 (3.79e-03) 0.0724 (4.47e-03) 0.0627 (3.79e-03)ˆ R r , ˆ r
6, 6
Table S7: Results for Setting 3 on regular design: see description in Table 1. σ mOpCov OpCov ll-smooth spatpca ll-smooth+ R r , ˆ r
6, 60.80 AISE 0.0645 (4.05e-03) 0.0677 (4.48e-03) 0.0745 (3.94e-03) 0.0715 (4.20e-03) 0.07389 (3.94e-03)ˆ R r , ˆ r
6, 6
References
Adamczak, R. et al. (2008). A tail inequality for suprema of unbounded empirical processeswith applications to markov chains.
Electronic Journal of Probability 13 , 1000–1034.Bartlett, P. L., O. Bousquet, and S. Mendelson (2005). Local rademacher complexities.
TheAnnals of Statistics 33 (4), 1497–1537.Koltchinskii, V. (2011).
Oracle Inequalities in Empirical Risk Minimization and Sparse Re-covery Problems: Ecole d’Et´e de Probabilit´es de Saint-Flour XXXVIII-2008 , Volume 2033.Springer Science & Business Media.Krieg, D. (2018). Tensor power sequences and the approximation of tensor product operators.
Journal of Complexity 44 , 30–51. 16endelson, S. (2002). Geometric parameters of kernel machines. In
International Conferenceon Computational Learning Theory , pp. 29–43. Springer.Pisier, G. (1983). Some applications of the metric entropy condition to harmonic analysis.In
Banach Spaces, Harmonic Analysis, and Probability Theory , pp. 123–154. Springer.Wang, W.-T. and H.-C. Huang (2017). Regularized principal component analysis for spatialdata.
Journal of Computational and Graphical Statistics 26 (1), 14–25.Wong, R. K. W. and X. Zhang (2019). Nonparametric operator-regularized covariance func-tion estimation for functional data.
Computational Statistics & Data Analysis 131 , 131–144.Table S8: Simulation results for Setting 1 ( R = 6, r = 3, and r = 2) with the sparse design.The AISE values with standard errors (SE) in parentheses are provided for the four covarianceestimators in comparison, together with average two-way ranks ( ¯ R ) for those estimators whichcan lead to rank reduction (i.e., mOpCov , OpCov , and ll-smooth+ ) and average one-way ranks( r , r ) for mOpCov . n m σ mOpCov OpCov ll-smooth ll-smooth+
100 10 0.1 AISE 0.101 (5.47e-03) 0.122 (1.20e-02) 4.36 (2.28e+00) 1.702 (8.43e-01)¯ R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R = 6 and r = r = 4) with the sparse design:see description in Table S8. n m σ mOpCov OpCov ll-smooth ll-smooth+
100 10 0.1 AISE 0.1 (5.38e-03) 0.113 (6.12e-03) 2.12 (6.23e-01) 0.826 (1.76e-01)¯ R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R = r = r = 4) with the sparse design: seedescription in Table S8. n m σ mOpCov OpCov ll-smooth ll-smooth+
100 10 0.1 AISE 0.105 (4.75e-03) 0.115 (7.58e-03) 24.1 (2.28e+01) 1.87 (1.19)¯ R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r R r , ¯ r
5, 5.060.4 AISE 0.0598 (2.68e-03) 0.0733 (6.14e-03) 0.531 (1.07e-01) 0.323 (4.23e-02)¯ R r , ¯ r R r , ¯ r R r , ¯ r5.59, 5.66