[PDF] Non-asymptotic Optimal Prediction Error for RKHS-based Partially Functional Linear Models

Abstract

Under the framework of reproducing kernel Hilbert space (RKHS), we consider the penalized least-squares of the partially functional linear models (PFLM), whose predictor contains both functional and traditional multivariate part, and the multivariate part allows a divergent number of parameters. From the non-asymptotic point of view, we focus on the rate-optimal upper and lower bounds of the prediction error. An exact upper bound for the excess prediction risk is shown in a non-asymptotic form under a more general assumption known as the effective dimension to the model, by which we also show the prediction consistency when the number of multivariate covariates p slightly increases with the sample size n . Our new finding implies a trade-off between the number of non-functional predictors and the effective dimension of the kernel principal components to ensure the prediction consistency in the increasing-dimensional setting. The analysis in our proof hinges on the spectral condition of the sandwich operator of the covariance operator and the reproducing kernel, and on the concentration inequalities for the random elements in Hilbert space. Finally, we derive the non-asymptotic minimax lower bound under the regularity assumption of Kullback-Leibler divergence of the models.

Full PDF

aa r X i v : . [ m a t h . S T ] N ov Non-asymptotic Optimal Prediction Error for RKHS-basedPartially Functional Linear Models

Xiaoyu Lei and Huiming Zhang ∗ School of Mathematical Sciences, Peking University, Beijing 100871, [email protected] Department of Mathematics, Faculty of Science and Technology, University of Macau,Macau, [email protected] 24, 2020

Abstract

Under the framework of reproducing kernel Hilbert space (RKHS), we consider the penalized least-squares of the partially functional linear models (PFLM), whose predictor contains both functionaland traditional multivariate part, and the multivariate part allows a divergent number of parameters.From the non-asymptotic point of view, we focus on the rate-optimal upper and lower bounds of theprediction error. An exact upper bound for the excess prediction risk is shown in a non-asymptoticform under a more general assumption known as the eﬀective dimension to the model, by which wealso show the prediction consistency when the number of multivariate covariates p slightly increaseswith the sample size n . Our new ﬁnding implies a trade-oﬀ between the number of non-functionalpredictors and the eﬀective dimension of the kernel principal components to ensure the predictionconsistency in the increasing-dimensional setting. The analysis in our proof hinges on the spectralcondition of the sandwich operator of the covariance operator and the reproducing kernel, and onthe concentration inequalities for the random elements in Hilbert space. Finally, we derive the non-asymptotic minimax lower bound under the regularity assumption of Kullback-Leibler divergence ofthe models. Keywords : partially functional linear models, concentration inequality in Hilbert space, reproducingkernel Hilbert space, non-asymptotic bound, minimax rate, diverging number of covariates

Statistical analysis of functional data has become an important and diﬃcult part in modern statisticssince the leading work Ramsay (1982) and pioneering paper Grenander (1950). Due to the technologicalinnovation, the progress of the data storage enables scientists to acquire complex data sets with the struc-tures of curves, images, or other data with functional structures (referred as functional data). There hasbeen a large amount of works now focusing on many diﬀerent non-parametric aspects of functional datasuch as kernel ridge regressions Cai et al. (2006); Preda (2007); Du and Wang (2014); Reimherr et al.(2018), penalized B-spline regressions Cardot et al. (2003), functional principal component regressionsYao et al. (2005), local linear regressions Ba´ıllo and Gran´e (2009), and reader can refer to the reviewpaper Wang et al. (2016) for more details. The prediction in functional data analysis is a crucial topic,which has a wide range of applications including chemometrics, econometrics and biomedical studiesRamsay and Silverman (2007); Kokoszka and Reimherr (2017).Many existing works considering the estimation and prediction problems of functional data are basedon the framework of functional principal component analysis (FPCA), see Yao et al. (2005); Cai et al.(2006); Hall et al. (2007). However, the predictive power of FPCA-based methods is weakened whenthe functional principal components cannot form an eﬀective basis for the slope function, which oftenoccurs in practice and coincides with the similar phenomenon in principal component regressions, se ∗ Correspondence author. X = ( X , · · · , X p ) T be the p -dimensional multivariate predictor, Y ( t ) be the functional predictor, ε be the random noise and Z be the scalar response. In our work, we consider the PFLM taking the semi-parametric form Z = X T α + Z T Y ( t ) β ( t )d t + ε, (1)where the β ( t ) is the slope function for functional predictor and the α is the regression coeﬃcient formultivariate predictor. The model (1) contains both parametric and non-parametric part, which belongsto the semi-parametric statistics. We assume the predictor to be the random design where X , Y ( t )and ε are independent random variables. Because the intercepts of predictor are easy to estimate bycentralizing, we assume E X = and E Y ( t ) = 0 for simplicity.In some situations, a large number of non-functional predictors are often collected for practical dataanalysis, and this increasing-dimensional setting has been considered in Aneiros et al. (2015); Kong et al.(2016). Moreover, our work can also be applied to deal with divergent number of parameters. Theoreti-cally, this setting requires assuming that the number of scalar covariates grows with the sample size, i.e. p = p n → ∞ , and the convergence rate of the desired estimator becomes totally diﬀerent from the casewhere the dimension of the non-functional predictors is ﬁxed.Dealing with the functional data as a stochastic process is a signiﬁcant challenge in functional dataanalysis. Obviously, a functional covariate Y ( t ) has an inﬁnite number of predictors over the time domain(observed as discrete-time points) that are all highly correlated. The covariance function characterizesthe correlation of the functional covariate. The estimation of the slope function in functional regres-sions is connected to ill-posed inverse problems. To handle the inﬁnite-dimensionality of β ( t ), peopleoften impose certain regularity conditions on the hypothesized space of the slope function to ensurethat the inﬁnite-dimensional problems are tractable as an ﬁnite-dimensional approximation solution.Notwithstanding, the convergence rate of the slope estimators depends directly on the assumptions ofthe eigenvalue decay of the covariance operator and the restricted space of the slope function. Thus, theconvergence rate cannot be parametric due to the inﬁnite-dimensionality of the model (1).Some recent developments in PFLM include Zhu et al. (2019); Cui et al. (2020). Because of theshortcoming of the FPCA-based methods, we apply penalized least squares under the framework of RKHShere. There are few eﬀorts on the non-asymptotic upper and lower bounds in the existing literature.For FPCA-based method, Brunel et al. (2016) considers the adaptive estimation procedure of functionallinear models under a non-asymptotic framework; Wahl (2018) analyses the prediction error of functionalprincipal component regression (FPCR) and proves a non-asymptotic upper bound for the correspondingsquared risk. For the RKHS-based method, many works focus on the asymptotic results, such as Cai et al.(2006); Cai and Yuan (2012). Under the framework of RKHS-based kernel ridge regressions, Liu and Li(2020) recently studies the non-asymptotic RKHS-norm error bounds (called oracle inequalities) for theestimated function f in Gaussian non-parametric regression Y = f ( X ) + ε, ε ∼ N (0 , σ ) where f belongs to L , and this non-asymptotic approach has been early studied in ? . By applying the Maternkernel and supposing f in a Holder space with the polynomial decay rate of eigenvalues λ n = O ( n − a ),Liu and Li (2020) derives the nearly minimax optimal convergence rate (cid:16) log nn (cid:17) a a +1 (up to a log n factor)for L -norm estimation error.To analyse the PFLM, the main innovation of our work is that we provide a non-asymptotic upper andlower bounds for the excess prediction risk under a more general assumption to the eﬀective dimension.Tong and Ng (2018) establishes the optimal convergence rate of the excess prediction risk for the RKHS-based slope estimator of the functional linear models, but they do not consider the partially linear case.Moreover, we show the convergence rate of the excess prediction risk of the PFLM is the same as thatof the functional linear model, which means the convergence of the prediction risk of the functional part2ominates the convergence of the prediction risk of the whole PFLM. We also derive a minimax lowerbound for the excess prediction risk under the assumption to the Kullback-Leibler divergence of themodel. The speciﬁc theoretical contributions of our work are listed as • A signiﬁcant contribution is that we obtain the non-asymptotic upper and lower bounds, which havenot been well studied in the existing literature. We provide an exact non-asymptotic minimax lowerbound on the excess prediction risk in FLRM. Moreover, a particular application of the proposednon-asymptotic version of the optimal prediction upper bound is that it allows to analyse the PFLMwith divergent number of non-functional predictors, which leads to the prediction consistency underthe setting p log ( p ) = o ( n ). • We derive the non-asymptotic upper bound of the excess prediction risk for the RKHS-basedleast squares estimation in PFLM, and the optimal bound we obtain is more exact than that ofCui et al. (2020) which only obtains the stochastic order of the convergence rate without the deﬁnitemultiplying constants relevant to the high probability events. Our derivation for the optimal bounddoes not need the inverse Cauchy inequalityE (cid:18)Z Y ( t ) f ( t ) dt (cid:19) ≤ C " E (cid:18)Z Y ( t ) f ( t ) dt (cid:19) , for f ∈ L ( T )as a moment assumption of the functional predictor. This condition is imposed in Cui et al.(2020); Cai and Yuan (2012) to attain minimax prediction bounds for (partially) functional linearregressions. Our proof does not rely on the well-known representation lemma for the smoothingsplines, see Wahba (1990); Cucker and Smale (2001). • The proof for the Theorem 1 is divided into three steps, and it relies on new non-trivial results.First, we prove the diﬀerence of the functional part between the true parameter and our leastsquares estimate is bounded. Second, based on the boundedness, we show the excess predictionrisk contributed by the multivariate part of the predictor is convergent at n − -rate. Finally, accord-ing to the convergence of the multivariate part, we obtain the convergence of the prediction riskcorresponding to the functional part in n − θ -rate, where θ is related to the eﬀective dimensionin the Assumption 4. Speciﬁcally, the novelty of the proof lies in the Lemma 2, which is a cruciallemma for the Theorem 1. In the Lemma 2, to show the concentration property of the random ele-ments in Banach space, we use the methods in functional analysis and convert the random elementsin Banach space to other relevant random elements in Hilbert space.The outline of this paper is constructed as follows. In Section 2, we provide the notations anddeﬁnitions we need and a brief introduction on the RKHS and the PFLM. In Section 3, we show ourmain theorem about the non-asymptotic upper bound for the excess prediction risk and two relevantcorollaries. In Section 4, we state the minimax lower bound for the excess prediction risk. In Section 5,we provide the proof of the Theorem 1 in Section 3. In Section 6, we show the proof the Theorem 2 inSection 4. In Section 7 and 8, we prove the lemmas we need for the proofs in Section 5 and 6. In Section9, we summarize our conclusions and point out some future directions for research. Deﬁne k v k := ( p P i =1 v i ) to be the ℓ -norm of vector v ∈ R p . Let T ⊂ R be a compact set. Denote by L ( T ) the Hilbert space composed by square integrable functions on T , whose inner product and normare respectively denoted by h f, g i and k f k for any f, g ∈ L ( T ).Consider T a bounded linear operator from a Banach space A to a Banach space B respectivelyendowed with the norms k · k A and k · k B . Deﬁne the operator norm of T as k T k op := sup x ∈ A : k x k A =1 k T ( x ) k B . Let T ∗ be the adjoint of T from B ∗ to A ∗ deﬁned by T ∗ ( f )( x ) := f ( T ( x )) , for any f ∈ B ∗ . k T ∗ k op = k T k op .For a matrix E = ( e ij ) ≤ i,j ≤ p ∈ R p × p , when writing k E k op , we actually view E as a bounded linearoperator from R p to R p endowed with ℓ -norm deﬁned by v E v , which is also called the spectralnorm in other literature. Let k E k ∞ := max ≤ i,j ≤ p | e ij | be the ℓ ∞ -norm of the matrix E and λ max ( E )be the maximal eigenvalue of the matrix E . Moreover, we have k E k op ≤ p k E k ∞ from 5 . . P23 inHorn and Johnson (2012).For a real, symmetric, square integrable and nonnegative deﬁnite function R : T × T → R , let L R : L ( T ) → L ( T ) be an integral operator (also a bounded linear operator) deﬁned by L R ( f )( t ) := h R ( s, t ) , f ( s ) i = Z T R ( s, t ) f ( s )d s. According to the spectral theorem, there exists a set of orthonormalized eigenfunctions { φ Rk : k ≥ } anda sequence of eigenvalues θ R ≥ θ R ≥ · · · > R ( s, t ) = + ∞ X k =1 θ Rk ψ Rk ( s ) ψ Rk ( t ) , ∀ s, t ∈ T . Noticing the orthonormality of the eigenfunctions { φ Rk : k ≥ } , we have L R ( ψ Rk )( s ) = h R ( s, t ) , ψ Rk ( t ) i = h + ∞ X i =1 θ Ri ψ Ri ( s ) ψ Ri ( t ) , ψ Rk ( t ) i = + ∞ X i =1 θ Ri ψ Ri ( s ) h ψ Ri ( t ) , ψ Rk ( t ) i = θ Rk ψ Rk ( s ) . In what follows, we let { ( θ Rk , ψ Rk ) : k ≥ } be the eigenvalue-eigenfunction pairs corresponding to theoperator (or the equivalent bivariate function) R .Let L R be the operator satisfying L R ( ψ Rk ) = q θ Rk ψ Rk . Deﬁne the bivariate function R ( s, t ) by R ( s, t ) := + ∞ X k =1 q θ Rk ψ Rk ( s ) ψ Rk ( t ) , ∀ s, t ∈ T . For two bivariate functions R , R : T × T → R , deﬁne( R R )( s, t ) := h R ( s, · ) , R ( · , t ) i = Z T R ( s, u ) R ( u, t )d u. Then we have the relation and L R R = L R ◦ L R and hence L R = L R , where ◦ means the compositionof mappings. To show L R R = L R ◦ L R , we notice L R ◦ L R ( f )( t ) = Z T R ( t, s ) L R ( f )( s )d s = Z T R ( t, s ) (cid:18)Z T R ( s, u ) f ( u )d u (cid:19) d s = Z T (cid:18)Z T R ( t, s ) R ( s, u )d s (cid:19) f ( u )d u = L R R ( f )( t ) . Let HS( T ) be the Hilbert space of the Hilbert-Schmidt operators on L ( T ) with the inner product h A, B i H := Tr( B ∗ A ) and the norm k A k = + ∞ P k =1 k A ( φ k ) k where { φ k : k ≥ } is an orthonormal basisof L ( T ). The space HS( T ) is a subspace of the bounded linear operators on L ( T ), with the normrelations k A k op ≤ k A k HS and k AB k HS ≤ k A k op k B k HS .A reproducing kernel K : T × T → R is a real, symmetric, square integrable and nonnegative deﬁnitefunction. Given a reproducing kernel K , we can uniquely identify a RKHS H ( K ) composed by a subspaceof L ( T ) satisfying K ( t, · ) ∈ H ( K ) for any t ∈ T , which is endowed with an inner product h· , ·i K suchthat f ( t ) = h K ( t, · ) , f i K , for any f ∈ H ( K ) . There is a well-known fact that L K ( L ( T )) = H ( K ), i.e. the RKHS H ( K ) can be characterized asthe range of L K equipped with the norm k L / K ( f ) k K = k f k L ( T ) , see Sun (2005) for more details. Forsimplicity, we let H ( K ) be dense in L ( T ), which means L K is injective. And let κ be the operatornorm of L K , κ := k L K k op .Readers can refer to Wahba (1990); Cucker and Smale (2001) for more discussions on RKHS.4 .2 The Penalized Least Square for PFLM The goal of prediction given the predictor X and Y ( t ) is to recover the prediction η , the right sideof (1) without the random noise ε , η ( X , Y ( t )) := X T α + Z T Y ( t ) β ( t )d t. We assume that the training sample { ( Z i , X i , Y i ( t )) } ni =1 is composed by n independent copies of ( Z, X , Y ( t )).To estimate the true parameter ( α , β ), the penalized least squares is deﬁned as( ˆ α n , ˆ β n ) := arg min ( α ,β ) ∈ R p ×H ( K ) n n X i =1 (cid:18) Z i − X Ti α − Z T Y i ( t ) β ( t )d t (cid:19) + λ n k β k K . (2)Noticing L K ( L ( T )) = H ( K ), there exists f n ∈ L ( T ) such that L K ( f n ) = β n . So the (2) is replaced by( ˆ α n , ˆ f n ) := arg min ( α ,f ) ∈ R p × L ( T ) n n X i =1 (cid:16) Z i − X Ti α − h Y i , L K f i (cid:17) + λ n k f k . (3)Let ˆ η n be the prediction rule induced by the penalized least squares ( ˆ α n , ˆ β n )ˆ η n ( X , Y ( t )) := X T ˆ α n + Z T Y ( t ) ˆ β n ( t )d t. For a prediction rule η ( X , Y ( t )), deﬁne the prediction risk to be E ( η ) := E[ Z ∗ − η ( X ∗ , Y ∗ ( t ))] , where ( Z ∗ , X ∗ , Y ∗ ( t )) is an independent copy of ( Z, X , Y ( t )).We measure the accuracy of the prediction ˆ η n by the excess prediction risk E (ˆ η n ) − E ( η ) = E[ˆ η n ( X ∗ , Y ∗ ( t )) − η ( X ∗ , Y ∗ ( t ))] . Let f ∈ L ( T ) satisfying L K f = β and rewrite η ( X , Y ( t )) and ˆ η n ( X , Y ( t )) to η ( X , Y ( t )) = X T α + Z T Y ( t )( L K f )( t )d t and ˆ η n ( X , Y ( t )) = X T ˆ α n + Z T Y ( t )( L K ˆ f n )( t )d t, by which we can bound the excess prediction risk E (ˆ η n ) − E ( η ) = E (cid:20) X ∗ T ( α − ˆ α n ) + Z T Y ∗ ( t )( L K ( f − ˆ f n ))( t )d t (cid:21) ≤ X ∗ T ( α − ˆ α n )] + 2 E (cid:20)Z T Y ∗ ( t )( L K ( f − ˆ f n ))( t )d t (cid:21) = 2( α − ˆ α n ) T E( X ∗ X ∗ T )( α − ˆ α n )++ 2 ZZ T ×T E[ Y ∗ ( t ) Y ∗ ( s )]( L K ( f − ˆ f n ))( t )( L K ( f − ˆ f n ))( s )d t d s. Deﬁne the empirical convariance matrix D n and the convariance matrix D for the multivariate part ofthe predictor to be D n := 1 n n X i =1 X i X Ti and D := E( XX T ) . Let λ max := λ max ( D ) be the maximal eigenvalue of the convariance matrix D . Similarly, deﬁne theempirical convariance function C n ( s, t ) and the convariance function C ( s, t ) for the functional part ofthe predictor to be C n ( s, t ) := 1 n n X i =1 Y i ( s ) Y i ( t ) and C ( s, t ) := E( Y ( s ) Y ( t )) . (cid:20)Z T Y ( t ) f ( t )d t (cid:21) = Z Z

T ×T E[ Y ( s ) Y ( t )] f ( s ) f ( t )d s d t = Z T f ( t ) (cid:18)Z T C ( s, t ) f ( s )d s (cid:19) d t = h f, L C f i . (4)Deﬁne the sandwich operator of the covariance operator and the reproducing kernel by T = L K ◦ L C ◦ L K and its empirical version T n = L K ◦ L C n ◦ L K , see Cai and Yuan (2012) for details. With these deﬁnitions,we reformulate the upper bound for the excess prediction risk to E (ˆ η n ) − E ( η ) ≤ λ max k ˆ α n − α k + 2 h L K ( ˆ f n − f ) , L C L K ( ˆ f n − f ) i = 2 λ max k ˆ α n − α k + 2 k T ( ˆ f n − f ) k , (5)which is relatively easy to analysis.For simplicity of the following discussion, we need more deﬁnitions. Deﬁne g n := n n P i =1 ε i L K Y i and a n := n n P i =1 ε i X i , which are key quantities to derive the convergence rate of the desired estimator. Deﬁnethe bounded linear operators G n : L ( T ) → R p and H n : R p → L ( T ) by G n ( f ) := 1 n n X i =1 h Y i , L K f i X i , ∀ f ∈ L ( T ) and H n ( α ) := 1 n n X i =1 ( X Ti α ) L K Y i , ∀ α ∈ R p . Let { ( τ k , ϕ k ) : k ≥ } be the set of eigenvalue-eigenfunction pairs related to T as an operator on L ( T ).Deﬁne the trace of the operator ( T + λI ) − T as D ( λ ) := Tr(( T + λI ) − T ) , which is also called the eﬀective dimension in learning theory, see Zhang (2005). According to the deﬁnition of the penalized least squares ( ˆ α n , ˆ f n ), which minimize (3), the diﬀerencebetween the penalized least squares ( ˆ α n , ˆ f n ) and the true parameter ( α , f ) can be represented as ( ˆ f n − f = − λ n ( T n + λ n I ) − f − ( T n + λ n I ) − H n ( ˆ α n − α ) + ( T n + λ n I ) − g n , (6)ˆ α n − α = − D − n G n ( ˆ f n − f ) + D − n a n , (7)where T n , H n , G n , D n , g n and a n are given in the previous section. The derivation of (6) and (7) is leftto the Subsection 7.1, where we use the method of the calculus of variations.Some common regularity assumptions are needed to ensure our main results. Assumption 1. X satisﬁes the growth of moments condition for each component X j (1 ≤ j ≤ p ) : thereexist M , υ > such that E( | X j | l ) ≤ M υ l − l ! , for all integer l ≥ , ≤ j ≤ p. And we assume D = E( XX T ) and D − to be positive deﬁnite. Assumption 2. Y ( t ) is a bounded square integrable stochastic process: there exists M > such that k Y ( t ) k ≤ M (a.s.). Assumption 3.

The random noise ε has zero mean and ﬁnite variation: E ε = 0 and E ε = σ < ∞ . Assumption 4.

The eﬀective dimension of T satisﬁes: D ( λ ) = Tr(( T + λI ) − T ) ≤ cλ − θ for constants c, θ > . Assumption 5.

The number of the multivariate covariates p = p n can be increasing as a function ofsample size n . Y ( t ) is bounded almost surely in the Assumption 2. Deﬁne a random variable ξ takingvalues in HS( T ) by ξ ( f ) := ( T + λ n I ) − h L K Y, f i L K Y. Actually, we can weaken the Assumption 2 by assuming the Bernstein’s growth of moments condition of ξ in HS( T ): there exist M , ν > k ξ − E ξ k l HS ) ≤ D ( λ n ) M ν l − l ! , for all integer l ≥ . (8)Then using the Lemma 7, we obtain the same result as in the Lemma 1, by which we have the same resultfor the non-asymptotic optimal prediction error as in the Theorem 1. But the condition (8) is diﬃcultto verify, a similar condition is also provided for the FPCA method in Brunel et al. (2016). Here, we donot provide the complete proof under under the condition (8). The Assumption 3 follows the generalassumptions of zero mean and ﬁnite variance on random noise. The Assumption 4 on the eﬀectivedimension has been adopted in Tong and Ng (2018), which reﬂects the convergence of eigenvalues of L C and L K and how their eigenfunctions align. The Assumption 4 also contains the common assumptionon the decay rate of the eigenvalues of T (see Remark 2). Remark 1.

For the simplicity of later proof and statement of lemmas, we further assume κ , M > .These assumptions make no essential diﬀerence compared with the original assumptions because of theboundedness. Before getting to our main results, we need two important lemmas, of which the proofs are left to theSection 7 and 8. The following Lemma 1 and Lemma 2 can be viewed as the concentration inequalitiesfor the operator-valued random variable T n , G n and H n . The concentration inequalities for the randomvariable taking values in a Hilbert space as stated in the Lemma 6 and 7 play an important role in theproofs of lemmas. Lemma 1.

Under the Assumption 2, for any δ ∈ (0 , e − ) , with probability at least − δ , there exists k ( T + λ n I ) − ( T n − T ) k op ≤ c log( 2 δ ) B n , where c := 2 κ M and B n := n √ λ n + q D ( λ n ) n . Lemma 2 ( k G n k op = k H n k op ) . Under the Assumptions 1 and 2, for any δ ∈ (0 , , with probability atleast − δ , we have k G n k op = k H n k op ≤ c log( pδ ) √ n , where c := 2 pκ ( υ + M ) M . With all the preparations we establish above, we enable to state the main result of this paper. Thefollowing theorem provides a non-asymptotic upper bound for the excess prediction risk.

Theorem 1.

Under the Assumptions 1-4, for any δ , δ , δ , δ ∈ (0 , and δ ∈ (0 , e − ) , by taking λ n = ωn − θ , there exists an integer given by n = ⌈ max { N , N }⌉ , where N = 48 υ p k D − k op (48 p k D − k op M + 1) log( 2 p δ ) and N = (cid:18) p κ ( υ + M ) M k D − k op ω log ( 2 pδ ) log( 2 pδ ) (cid:19) θθ , such that for n ≥ n , we have with probability at least − P i =1 δ i E (ˆ η n ) − E ( η ) ≤ (cid:0) λ max (2 c c + c ) + 2 c (cid:1) n − + (cid:0) c + c ) c √ ω (cid:1) n − θ θ + (cid:0) c + c ) ω (cid:1) n − θ , (9)7 here c i (4 ≤ i ≤ are speciﬁc constants given in the proof that depend on the true parameters and theassumptions, and can be written as c = 3 pκ ( υ + M ) M k D − k op log( 2 pδ ) , c = 3 √ pσM k D − k op √ δ ,c = k f k + 2 pκ ( υ + M ) M c ω log( 2 pδ ) + σ ( ω − + ω − θ √ c ) √ δ ,c = k f k (cid:18) κ M ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) ,c = (cid:18) κ M ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) σ ( ω − + ω − θ √ c ) √ δ and c = 2 pκ ( υ + M ) M (2 c c + c ) √ ω (cid:18) κ M ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) log( 2 pδ ) . Equation (9) presents an exact convergence rate of the excess perdition risk with all precise constantsdetermined by the regularity conditions. The ﬁrst term in the right side of (9) is ascribed to theparametric part of the PFLM. The second term is a mixed rate consisting of both the parametric andthe functional part since the prediction risk is a square function composed by both the functional andnon-functional predictors. The last term is a dominated term, which reveals that the signal strength || f || , the operator norm of the reproducing kernel, and the variation of functional predictor play a crucialrole in the non-asymptotic convergence rate. These assumption-dependent constants are always ignoredin most reference of asymptotic analysis for functional regressions.From the proof of the Theorem 1, it is not hard to respectively obtain the non-asymptotic upperbounds of functional and the non-functional parameters in below. Corollary 1.

Under the conditions in Theorem 1, for n > n we have , P (cid:18) k ˆ α n − α k ≤ c c + c √ n (cid:19) ≥ − δ − δ − δ − δ , (10) P (cid:18) k T ( ˆ f n − f ) k ≤ ( c + c ) p λ n + c √ n (cid:19) ≥ − X i =1 δ i . (11)It should be noted that we are unable to show the asymptotic normality of the non-functional pa-rameters ˆ α n − α because it is inﬂuenced by the functional parameter ˆ f n − f as shown in (7). But itis diﬃcult to derive an analogy of central limit theorem for the functional parameter.The (10) and (11) in Corollary 1 are useful high-probability events, which can be used to obtain theconﬁdence balls for α and f under the distance k ˆ α n − α k and k T ( ˆ f n − f ) k . They are also helpfulto construct testing statistics, and thus they conceive non-asymptotic hypothesis testing for functionalregressions, see ? for the case of nonparametric regressions.Another corollary of the Theorem 1 is the excess prediction risk E (ˆ η n ) − E ( η ) = O p ( n − θ ). Fromthe proof, we notice the convergence rate of the prediction risk contributed by the multivariate part ofthe predictor is O p ( n − ), faster than the convergence rate corresponding to the functional part of thepredictor, which is O p ( n − θ ). Therefore, the convergence rate of the prediction risk of the partiallyfunctional linear model is the same as the minimax rate for the functional linear model Cai and Yuan(2012). Remark 2.

The assumption that the eigenvalues τ k decay as τ k ≤ c ′ k − r ( r > ) is a special case ofour more general assumption 4 with regard to the eﬀective dimension once we notice D ( λ n ) = + ∞ X k =1 τ k τ k + λ n ≤ + ∞ X k =1 c ′ k − r c ′ k − r + λ n = + ∞ X k =1 c ′ c ′ + λ n k r ≤ Z + ∞ c ′ c ′ + λ n t r d t s = λ rn t ======= λ − r n Z + ∞ c ′ c ′ + s r d s . λ − r n ≍ n r ( λ n = ωn − r r ) . And we state this special case as the corollary below. orollary 2. Suppose the Assumptions 1-3 are satisﬁed. Assume the eigenvalues τ k decay as τ k ≤ c ′ k − r for some c ′ > and r > . For any δ , δ , δ , δ ∈ (0 , and δ ∈ (0 , e − ) , by taking λ n = ωn − r r ,there exits an integer n such that for n > n , we have with probability at least − P i =1 δ i E (ˆ η n ) − E ( η ) ≤ (2 λ max (2 c c + c ) + 2 c ) n − + (4( c + c ) c √ ω ) n − r r + (2( c + c ) ω ) n − r r , where c i (4 ≤ i ≤ and n are the same as those of the Theorem 1 except replacing θ by r and c by aconstant relevant to c ′ and r . An useful and insightful application of the Theorem 1 is that we can consider the situation wherethe number of multivariate covariates p increases as a function of n . According to the deﬁnition of c i (4 ≤ i ≤

9) and N i ( i = 1 ,

2) in the proof, we reveal that c = O ( p log( p )) , c = O ( p ) , c = O ( p log( p )) , c = c = O (1) ,c = O ( p log ( p )) , N = O ( p log( p )) and N = O ( p θ ) θ log θ ) θ ( p )) . From these orders, it implies(2 λ max (2 c c + c ) + 2 c ) n − = O ( p log ( p ) n − ) , (4( c + c ) c √ ω ) n − θ θ = O ( p log ( p ) n − θ θ )and (2( c + c ) ω ) n − θ = O ( n − θ ) . When letting p log ( p ) n − →

0, i.e. n ≫ O ( p log ( p )). The term O ( p log ( p ) n − θ θ ) can be domi-nated by the term p log ( p ) n − , so we have as n, p → ∞ p log ( p ) n − θ θ ≪ O ( n − θ θ ) → . Notice n ≫ O ( p log ( p )) ≫ O ( p log( p )) = N . If we let θ ) θ < ⇔ θ > , we have n ≫ N under thecondition p log ( p ) n − → p ≫ log ǫ ( p ) for any ǫ ∈ R . Therefore we have the followingprediction consistency for the increasing dimension situation of non-functional parameters. Corollary 3.

Under the Assumptions 1-5, if the constant θ > in the Assumption 4 and p log ( p ) = o ( n ) in the Assumption 5, we have the consistency for the excess prediction risk: E (ˆ η n ) − E ( η ) = o p (1) . Remark 3.

If we assume the eigenvalues τ k decay as τ k ≤ c ′ k − r ( r > ) in the increasing-dimensionalsetting, by applying the Corollary 3 and noticing θ = r , we need to further assume r < to obtain theprediction consistency, which means the convergence rate of eigenvalues can not be too fast. Intuitively,when p increases, r can not be too large or equivalently, the eﬀective dimension D ( λ n ) ≍ n r can notbe too small. It implies we need to ﬁnd a trade-oﬀ between the number of non-functional predictors andthe eﬀective dimension to get the prediction consistency. The prediction consistency theory has been well-established for the estimators in non-parametricstatistics and high-dimensional statistics, see Zhuang and Lederer (2018) for the recent developmentfor general regularized maximum likelihood estimators. However their works specially aim for non-parametric or high-dimensional models, they do not cover the semi-parametric case as studied in ourpaper.

In this section, we derive a minimax lower bound for the excess prediction risk in the following Theorem2. To verify the optimality of the upper bound of the prediction risk for the proposed estimator, theresult on minimax lower bound below shows the prediction risk of our estimator achieves the theoreticallower bound caused by the intrinsic limitation of the PFLM. Let P α ,β be the probability taken overthe space ( Z, X , Y ) where Z is generated by the true parameter Z = X T α + h Y, β i + ε . Before statingthe main result, we need a regularity assumption relevant to the variance of the random noise.9 ssumption 6. For a ﬁxed α ∗ ∈ R p and diﬀerent β , β ∈ H ( K ) . We assume that the Kullback-Leiblerdistance between P α ∗ ,β and P α ∗ ,β can be bounded by K ( P α ∗ ,β | P α ∗ ,β ) := E α ∗ ,β log (cid:18) d P α ∗ ,β d P α ∗ ,β (cid:19) ≤ K σ E( h Y, β − β i ) , where K σ is a positive variance-dependent constant and E α ∗ ,β means the expectation taken over theprobability P α ∗ ,β . The examples of constant K σ include noises of exponential families (see Abramovich and Grinshtein(2016); Du and Wang (2014)) and noises with self-concordant log-density function (see Ostrovskii and Bach(2018)). If we assume the random noise ε ∼ N (0 , σ ), the constant K σ = 12 σ , (12)of which the proof is left to the Subsection 8.7. Now we state the main theorem of this section. Theorem 2.

Under the Assumptions 3 and 6, suppose the eigenvalues { τ k : k ≥ } of the operator T decay as b k − r ≤ τ k ≤ b k − r for some r ∈ (0 , ∞ ) and b > b > , then for ρ ∈ (0 , ) , there exists asequence { N n } n ≥ satisfying log( N n ) ≥ (cid:18) (cid:19) − r r ( b K σ ) r ρ − r n r , (13) such that when n ≥ ρ log 28 b K σ , the excess prediction risk satisﬁes inf ˜ η sup η ∈ R p ×H ( K ) P (cid:18) E (˜ η ) − E ( η ) ≥ b r ) ( 8 b K σ ρ log 2 ) − r r n − r r (cid:19) ≥ √ N n √ N n (cid:18) − ρ − r ρ log N n (cid:19) , where we identify the prediction rule ˜ η as the arbitrary estimator ( ˜ α , ˜ β ) based on the training samples { ( Z i , X i , Y i ) } ni =1 ,and view η as the true parameter ( α , β ) ∈ R p × H ( K ) . We emphasize the probability P is taken over the product space of training samples { ( Z i , X i , Y i ) } ni =1 generated by the true parameter η = ( α , β ) . In the existing literature, most results about the minimax bound are in the asymptotic sense, whilethe constants in our result are precise and speciﬁed. Letting N n → ∞ , the lower bound inequality in theTheorem 2 implies lim n →∞ inf ˜ η sup η ∈ R p ×H ( K ) P ( E (˜ η ) − E ( η ) ≥ b ′ ρ r r n − r r ) ≥ − ρ for some constant b ′ , by which we get the asymptotic minimax lower bound:lim a → lim n →∞ inf ˜ η sup η ∈ R p ×H ( K ) P ( E (˜ η ) − E ( η ) ≥ an − r r ) = 1 . When setting λ n = ωn − θ , we have n < n − θ = λ n ω = √ λ n ω √ λ n and D ( λ n ) n ≤ c ( ωn − θ ) − θ n = cω − θ n − θ = cω − (1+ θ ) λ n , by which we have B n ≤ ( ω − + ω − θ √ c ) √ λ n in Lemma 1 and nλ n = ωn θ θ ≥ ω .Applying Lemmas 2, 3 and 5 to (7), when n > n , we have with probability at least 1 − δ − δ − δ k ˆ α n − α k ≤ k D − n k op k G n k op k ˆ f n − f k + k D − n k op k a n k≤ k D − k op c log( pδ ) √ n k ˆ f n − f k + 32 k D − k op c √ δ √ n = c √ n k ˆ f n − f k + c √ n , (14)10here we let c := c k D − k op log( pδ ) and c := c k D − k op √ δ .First, we want to bound k ˆ f n − f k . According to (6), it gives k ˆ f n − f k ≤ λ n k ( T n + λ n I ) − k op k f k + k ( T n + λ n I ) − k op k H n k op k ˆ α n − α k + k ( T n + λ n I ) − g n k =: I + I + I . For the term I , we have I ≤ k f k because k ( T n + λ n I ) − k op ≤ λ n .For the term I , combining (14) we have with probability at least 1 − δ − δ − δ I ≤ λ n c log( pδ ) √ n c √ n k ˆ f n − f k + 1 λ n c log( pδ ) √ n c √ n = c c log( pδ ) nλ n k ˆ f n − f k + c c log( pδ ) nλ n ≤ c c log( pδ ) nλ n k ˆ f n − f k + c c ω log( 2 pδ ) , where we use nλ n ≥ ω .For the term I , we obtain with probability at least 1 − δ by Lemma 8 I ≤ k ( T n + λ n I ) − k op k ( T n + λ n I ) − g n k ≤ √ λ n σ √ δ B n ≤ σ ( ω − + ω − θ √ c ) √ δ , where we use B n ≤ ( ω − + ω − θ √ c ) √ λ n in the last step.Thus we can bound k ˆ f n − f k by I , I and I k ˆ f n − f k ≤ k f k + c c log( pδ ) nλ n k ˆ f n − f k + c c ω log( 2 pδ ) + σ ( ω − + ω − θ √ c ) √ δ = c c log( pδ ) nλ n k ˆ f n − f k + c , where we deﬁne c := k f k + c c ω log( pδ ) + σ ( ω − + ω − θ √ c ) √ δ . Notice when λ n = ωn − θ , nλ n = ωn θ θ and when n > (cid:16) c c ω log( pδ ) (cid:17) θθ , we have c c log( pδ ) nλ n ≤ . Therefore, when n > N , we obtain with probability at least 1 − δ − δ − δ − δ k ˆ f n − f k ≤ k ˆ f n − f k − c c log( pδ ) nλ n k ˆ f n − f k ≤ c , which equals to k ˆ f n − f k ≤ c .Next, we turn to bound k ˆ α n − α k . According to (14), we have with probability at least 1 − δ − δ − δ − δ , k ˆ α n − α k ≤ c c + c √ n . (15)Then we ﬁnd a way to bound k T ( ˆ f n − f ) k . According to (6), we have k T ( ˆ f n − f ) k ≤ λ n k T ( T n + λ n I ) − k op k f k + k T ( T n + λ n I ) − k op k H n k op k ˆ α n − α k ++ k T ( T n + λ n I ) − g n k =: E + E + E . For the term E , we have with probability at least 1 − δ by Lemma 1 E ≤ λ n k ( T + λ n I ) ( T n + λ n I ) − k op k ( T n + λ n I ) − k op k f k≤ λ n (cid:18) √ λ n k ( T n + λ n I ) − ( T n − T ) k op + 1 (cid:19) √ λ n k f k≤ p λ n k f k (cid:18) c log( 2 δ ) B n √ λ n + 1 (cid:19) ≤ k f k (cid:18) c ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) p λ n = c p λ n , B n ≤ ( ω − + ω − θ √ c ) √ λ n in thelast inequality and deﬁne c := k f k (cid:18) c ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) . For the term E , by applying the Inequality (2) and the Lemma 8, we have with probability at least1 − δ − δ E ≤ k ( T + λ n I ) ( T n + λ n I ) − k op k ( T n + λ n I ) − ( T + λ n I ) k op k ( T + λ n I ) − g n k≤ (cid:18) √ λ n k ( T n + λ n I ) − ( T n − T ) k op + 1 (cid:19) σ √ δ B n ≤ (cid:18) c log( 2 δ ) B n √ λ n + 1 (cid:19) σ √ δ B n ≤ (cid:18) c ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) σ ( ω − + ω − θ √ c ) √ δ p λ n = c p λ n , where in the last inequality we let c := (cid:18) c ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) σ ( ω − + ω − θ √ c ) √ δ . Notice we have (15) with probability at least 1 − δ − δ − δ − δ . Therefore, for the term E weobtain with probability at least 1 − P i =1 δ i , by using the Inequality 2, k ( T n + λ n I ) − k op ≤ √ λ n , Lemma2 and (15), E ≤ k ( T + λ n I ) ( T n + λ n I ) − k op k ( T n + λ n I ) − k op k H n k op k ˆ α n − α k≤ (cid:18) ( c ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) √ λ n · c log( pδ ) √ n · c c + c √ n ≤ c (2 c c + c ) √ ω (cid:18) ( c ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) log( 2 pδ ) 1 √ n = c √ n , where in the last step we use nλ n ≥ ω and deﬁne c := c (2 c c + c ) √ ω (cid:18) c ( ω − + ω − θ √ c ) log( 2 δ ) + 1 (cid:19) log( 2 pδ ) . Thus, we bound k T ( ˆ f n − f ) k by k T ( ˆ f n − f ) k ≤ E + E + E ≤ ( c + c ) p λ n + c √ n . Recall the excess prediction risk can be bounded by E (ˆ η n ) − E ( η ) ≤ λ max k ˆ α n − α k + 2 k T ( ˆ f n − f ) k , based on which we further have E (ˆ η n ) − E ( η ) ≤ λ max (cid:18) c c + c √ n (cid:19) + 2 (cid:18) ( c + c ) p λ n + c √ n (cid:19) = C n − + C r λ n n + C λ n , where C := 2 λ max (2 c c + c ) + 2 c , C := 4( c + c ) c and C := 2( c + c ) . Finally we get the desired conclusion after we notice q λ n n = √ ωn − θ θ in the above proof. (cid:3) Proof of the Theorem 2

Let M be the smallest integer greater than b n r , where b will be deﬁned in later proof. For abinary sequence θ = ( θ M +1 , · · · , θ M ) ∈ { , } M , deﬁne β θ = M − M X k = M +1 θ k L K ϕ k . By applying h L K ϕ j , L K ϕ k i K = h ϕ j , L K ϕ k i K = h ϕ j , ϕ k i = δ jk , we can show β θ ∈ H ( K ), because k β θ k K = k M − M X k = M +1 θ k L K ϕ k k K = M X k = M +1 M − θ k k L K ϕ k k K ≤ M − M X k = M +1 k L K ϕ k k K = 1 . Using the Lemma 10, there exist a set Θ = { θ i } Ni =0 ⊂ { , } M such that(i) θ = (0 , · · · , H ( θ i , θ j ) > M for all i = j , (iii) N ≥ M .For η = ( α , β ), let P n α ,β be the joint distribution on the product space of training samples { ( Z i , X i , Y i ) } ni =1 generated by the true parameter ( α , β ), where Z i = X Ti α + h Y i , β i + ε i , and P α ,β be the distribution on a single sample ( Z, X , Y ), where Z = X T α + h Y, β i + ε .By the independence of the training samples, for ﬁxed α ∗ ∈ R p and diﬀerent θ, θ ′ ∈ Θ, we havelog d P n α ∗ ,β θ ′ d P n α ∗ ,β θ (( Z i , X i , Y i ) : 1 ≤ i ≤ n ) ! = n X i =1 log (cid:18) d P α ∗ ,β θ ′ d P α ∗ ,β θ ( Z i , X i , Y i ) (cid:19) . Using the Assumption 6, we can bound the Kullback-Leibler distance between P n α ∗ ,β θ ′ and P n α ∗ ,β θ K ( P n α ∗ ,β θ ′ | P n α ∗ ,β θ ) = n X i =1 E α ∗ ,β θ ′ log (cid:18) d P α ∗ ,β θ ′ d P α ∗ ,β θ (cid:19) ≤ nK σ E( h Y, β θ ′ − β θ i ) . Noticing h L K ϕ j , L C L K ϕ k i = h ϕ j , T ϕ k i = τ k δ jk , we haveE( h Y, β θ ′ − β θ i ) = h β θ ′ − β θ , L C ( β θ ′ − β θ ) i = * M − M X k = M +1 ( θ ′ k − θ k ) L K ϕ k , M − M X k = M +1 ( θ ′ k − θ k ) L C L K ϕ k + = M − M X k = M +1 ( θ ′ k − θ k ) τ k ≤ M − s M M X k = M +1 ( θ ′ k − θ k ) = M − s M H ( θ ′ , θ ) ≤ s M ≤ b M − r , from which we have K ( P n α ∗ ,β θ ′ | P n α ∗ ,β θ ) ≤ b nK σ M − r .If we let b := ( b K σ log 2 ) r ρ − r , then for any ρ ∈ (0 , ), we have1 N N X j =1 K ( P α ∗ ,β θj | P α ∗ ,β θ ) ≤ b nK σ M − r ≤ ρ log(2 M ) ≤ ρ log( N ) . For θ ∈ Θ and a ﬁxed α ∗ ∈ R p , let the prediction rule η θ be η θ ( X , Y ) := X T α ∗ + h Y, β θ i . Fordiﬀerent θ, θ ′ ∈ Θ, when the true parameter is ( α ∗ , β θ ), the excess prediction risk for the prediction rule η θ ′ is E ( η θ ′ ) − E ( η θ ) = E (cid:2) X ∗ T ( α ∗ − α ∗ ) + h Y ∗ , β θ ′ − β θ i (cid:3) = E( h Y ∗ , β θ ′ − β θ i ) = M − M X k = M +1 ( θ ′ k − θ k ) τ k ≥ M − τ M M X k = M +1 ( θ ′ k − θ k ) = M − τ M H ( θ ′ , θ ) ≥ M − b (2 M ) − r M b − (2 r +3) M − r . M is the smallest integer greater than b n r , thus when b n r ≥ ⇔ n ≥ ρ log 28 b K σ , M ≤ b n r . Thus we obtain the lower bound for E ( η θ ′ ) − E ( η θ ) E ( η θ ′ ) − E ( η θ ) ≥ b − (2 r +3) (2 b n r ) − r = b − (3+4 r ) ( 8 b K σ log 2 ) − r r ρ r r n − r r . For ﬁxed α ∗ ∈ R p , consider the set Ξ := { ( α ∗ , β θ ) : θ ∈ Θ } . By the Lemma 9, we haveinf ˜ η sup η ∈ Ξ P ( E (˜ η ) − E ( η ) ≥ b − r ) ( 8 b K σ log 2 ) − r r ρ r r n − r r ) ≥ √ N √ N (cid:18) − ρ − r ρ log N (cid:19) . Notice sup η ∈ Ξ P ( E (˜ η ) − E ( η ) ≥ · · · ) ≤ sup η ∈ R p ×H ( K ) P ( E (˜ η ) − E ( η ) ≥ · · · ) and log( N ) ≥ log 28 M , wehave the desired conclusion. (cid:3) (6) and Equation (7) Recall that Z i = X Ti α + h Y i , L K f i + ε i , where ( X i , Y i , ε i )(1 ≤ i ≤ n ) are independent copies of ( X , Y, ε ) in (1). Thus the right side of (3) canbe written as F n ( α , f ) := 1 n n X i =1 (cid:16) X Ti ( α − α ) + h Y i , L K ( f − f ) i − ε i (cid:17) + λ n k f k . Notice ( ˆ α n , ˆ f n ) is the minimum of F n ( α , f ), therefore ∂F n ( ˆ α n , ˆ f n ) ∂ α = 0 , from which we have0 = 1 n n X i =1 X i (cid:16) X Ti ( ˆ α n − α ) + h Y i , L K ( ˆ f n − f ) i − ε i (cid:17) = ( 1 n n X i =1 X i X Ti )( ˆ α n − α ) + 1 n n X i =1 h Y i , L K ( ˆ f n − f ) i X i − n n X i =1 ε i X i := D n ( ˆ α n − α ) + G n ( ˆ f n − f ) − a n , thus ˆ α n − α = − D − n G n ( ˆ f n − f ) + D − n a n .Next deﬁne the function ϕ n ( t ; α , f, g ) := F n ( α , f + tg ), and the fact that ( ˆ α n , ˆ f n ) minimizes F n ( α , f )implies d ϕ ( t ; ˆ α n , ˆ f n , g )d t (cid:12)(cid:12)(cid:12)(cid:12) t =0 , ∀ g ∈ L ( T ) , from which we have0 = 1 n n X i =1 (cid:16) X Ti ( ˆ α n − α ) + h Y i , L K ( ˆ f n − f ) i − ε i (cid:17) h Y i , L K g i + λ n h ˆ f n , g i . (16)From (16), we have1 n n X i =1 X Ti ( ˆ α n − α ) L K Y i + 1 n n X i =1 h Y i , L K ( ˆ f n − f ) i L K Y i − n n X i =1 ε i L K Y i + λ n ˆ f n = 0 . (17)Notice L C n f = h n n X i =1 Y i ( s ) Y i ( t ) , f ( t ) i = 1 n n X i =1 h Y i , f i Y i , T n + λ n I ) ˆ f n − T n f + H n ( ˆ α n − α ) − g n = 0 . Finally we haveˆ f n − f = ( T n + λ n I ) − T n f − f − ( T n + λ n I ) − H n ( ˆ α n − α ) + ( T n + λ n I ) − g n = − λ n ( T n + λ n I ) − f − ( T n + λ n I ) − H n ( ˆ α n − α ) + ( T n + λ n I ) − g n . (cid:3) We ﬁrst prove H n = G ∗ n , which shows k G n k op = k H n k op . Notice for any α ∈ R p and f ∈ L ( T ), wehave α T G n ( f ) = 1 n n X i =1 h Y i , L K f i ( α T X i ) = 1 n n X i =1 ( X Ti α ) h L K Y i , f i = h H n ( α ) , f i . Now we turn to bound k G n k op , for 1 ≤ j ≤ p , deﬁne the operator G n,j : L ( T ) R by G n,j ( f ) := 1 n n X i =1 h Y i , L K f i X i,j , which can also be viewed as a random variable taking values in a Hilbert space L ( T ) ∗ = L ( T ).Notice G n = ( G n, , · · · , G n,p ), thus we have k G n ( f ) k = p X j =1 | G n,j ( f ) | ≤ p X j =1 k G n,j k k f k ≤ ( p X j =1 k G n,j k op ) k f k , which means k G n k op ≤ p P j =1 k G n,j k op .Deﬁne the operator ξ i,j : L ( T ) R and ξ j : L ( T ) R by ξ i,j ( f ) := h Y i , L K f i X i,j and ξ j ( f ) := h Y, L K f i X j , from which we rewrite G n,j = n n P i =1 ξ i,j and notice { ξ i,j } ni =1 are independent copies of ξ j .By the deﬁnition of ξ j , after noticing the isomorphism between L ( T ) ∗ and L ( T ), we have k ξ j k l op = k X j L K Y k l ≤ | X j | l k L K k l op k Y k l by which we have k ξ j k l op ≤ κ l M l | X j | l . After taking expectation and using the growth of momentscondition for X j , we haveE( k ξ j k l op ) ≤ κ l M l E( | X j | l ) ≤ M κ l M l υ l − l ! = ( κM M ) κM υ ) l − l ! . Because X and Y are independent with zero mean, we have(E ξ j )( f ) = h E Y i , L K f i · E X j = 0 , which means E ξ j = 0.Taking B = κM M and M = κM υ in the Lemma 7, we have for any δ ∈ (0 , − δ p , there exists k G n,j k op = k n n X i =1 ξ i,j k op ≤ κM υ log( pδ ) n + κM M s pδ ) n ≤ κ ( υ + M ) M log( pδ ) √ n . k G n k op ≤ p P j =1 k G n,j k op , thus we have the following relation for the events ( k G n k op ≥ pκ ( υ + M ) M log( pδ ) √ n ) ⊂ [ ≤ j ≤ p ( k G n,j k op ≥ κ ( υ + M ) M log( pδ ) √ n ) , from which we have P k G n k op ≥ pκ ( υ + M ) M log( pδ ) √ n ! ≤ p X j =1 P k G n,j k op ≥ κ ( υ + M ) M log( pδ ) √ n ! ≤ p X j =1 δ p = δ . Therefore we conclude with probability at least 1 − δ , there exists k G n k op ≤ pκ ( υ + M ) M log( pδ ) √ n , where we let c := 2 pκ ( υ + M ) M . (cid:3) The Lemma 3 shows the fact k a n k = O p ( n − ), and its proof is based on the Markov’s inequality. Lemma 3.

Under the Assumptions 1 and 3, for any δ ∈ (0 , , with probability at least − δ , we have k a n k ≤ c √ δ √ n with c := √ pσM . Proof.

Deﬁne ξ := ε X and ξ i := ε i X i (1 ≤ i ≤ n ), by which we rewrite a n = n n P i =1 ξ i .Notice for i = j , we have E( ξ Ti ξ j ) = E( ε i ε j X Ti X j ) = E ε i E ε j (E X i ) T E X j = 0. AndE( k ξ k ) = E( ε k X k ) = E ε E k X k = σ p X j =1 E | X j | ≤ pσ M , where we use the growth of moments condition of X in the Assumption 1. Therefore, we haveE( k a n k ) = E( k n n X i =1 ξ i k ) = E( k ξ k ) n ≤ pσ M n , from which we obtain the Markov’s inequality for a n , P ( k a n k ≥ t ) ≤ pσ M nt . Therefore, we can conclude for δ ∈ (0 , − δ k a n k ≤ c √ δ √ n with c := √ pσM . The Lemma 4 is the concentration inequality for the empirical convariance matrix D n . Lemma 4.

Under the Assumption 1, for t > , we have P ( k D n − D k ∞ ≥ t ) ≤ p exp (cid:18) − nt υ (16 M + t ) (cid:19) . roof. For 1 ≤ j, k ≤ p , put d jk := ( D ) jk and d njk := ( D n ) jk = 1 n n X i =1 X i,j X i,k = 1 n n X i =1 d njk,i with d njk,i := X i,j X i,k . Let e njk,i be the centralization of d njk,i , e njk,i := d njk,i − E d njk,i = d njk,i − d jk . Thus we have {k D n − D k ∞ ≥ t } = [ ≤ j,k ≤ p {| d njk − d jk | ≥ t } = [ ≤ j,k ≤ p {| n n X i =1 e njk,i | ≥ t } . By the Cauchy-Schwarz inequality, we haveE( | d njk,i | l ) = E( | X i,j | l | X i,k | l ) ≤ (E | X i,j | l ) (E | X i,k | l ) ≤ M υ l − (2 l )! , where we use the assumption of the growth of moments condition of X in the last step. Using (2 l )! ≤ l l !,we have E( | d njk,i | l ) ≤ M υ l − l l ! = (4 υM ) v ) l − l ! . Using | a − b | l ≤ l ( | a | l + | b | l ), we have | e njk,i | l = | d njk,i − E d njk,i | l ≤ l ( | d njk,i | l + | E d njk,i | l ) . Take expectation and use the Jensen’s inequality | E d njk,i | l ≤ E | d njk,i | l , we haveE( | e njk,i | l ) ≤ l +1 E | d njk,i | l ≤ l +1 (4 υM ) υ ) l − l ! = (8 √ υM ) υ ) l − l ! . For the independent random variables { e njk,i } ni =1 , by the Bernstein’s inequality with the growth of mo-ments condition, we have P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 e njk,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t ! ≤ (cid:18) − n t nυ M + 16 υ nt (cid:19) = 2 exp (cid:18) − nt υ (16 M + t ) (cid:19) . Thus we conclude P ( k D n − D k ∞ ≥ t ) ≤ X ≤ j,k ≤ p P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 e njk,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t ! ≤ p exp (cid:18) − nt υ (16 M + t ) (cid:19) . The Lemma 5 shows we can use k D − k op to bound k D − n k op from above. Lemma 5.

Under the Assumption 1, for any δ ∈ (0 , , let N = 48 υ p k D − k op (48 p k D − k op M + 1) log( 2 p δ ) , (18) we have P ( k D − n k op ≤ k D − k op ) ≥ − δ when n > N . roof. Notice the fact that if

A, B ∈ R p × p are invertible and k A − k op k A − B k op <

1, then k A − − B − k op ≤ k A − k k A − B k op − k A − k op k A − B k op . (see Lemma E.4 in Sun et al. (2017)). Let A = D and B = D n , when k D − k op k D − D n k op ≤ , we have k D − − D − n k op ≤ k D − k op − = 12 k D − k op , from which we have k D − n k op ≤ k D − k op + k D − − D − n k op ≤ k D − k op . Therefore, it gives P ( k D − n k op ≤ k D − k op ) ≥ P ( k D − k op k D − D n k op ≤

13 ) . Recall that we have k A k op ≤ p k A k ∞ for A ∈ R p × p , by the Lemma 4, we have P ( k D − n k op ≥ k D − k op ) ≤ P ( k D − k op k D − D n k op ≥

13 ) ≤ P ( p k D − k op k D − D n k ∞ ≥

13 ) ≤ p exp (cid:18) − n υ p k D − k op (48 p k D − k op M + 1) (cid:19) . Let N be deﬁned in (18), as n > n , we get P ( k D − n k op ≥ k D − k op ) ≤ δ . The Lemma 1, Lemma 8, Inequality (1) and (2) are from the proof of Tong and Ng (2018). We provideall the complete proofs in this section for integrity.

Deﬁne the random variable ξ ( f ) := ( T + λ n I ) − h L K Y, f i L K Y and ξ i ( f ) := ( T + λ n I ) − h L K Y i , f i L K Y i (1 ≤ i ≤ n ) . Then ξ i are independent copies of ξ , which takes values in HS( T ).Recall we deﬁne { ( τ k , ϕ k ) : k ≥ } to be the set of eigenvalue-eigenfunction pairs of the operator T ,we have k ξ k = + ∞ X k =1 k ( T + λ n I ) − h L K Y, ϕ k i L K Y k ≤ k ( T + λ n I ) − k k L K Y k ∞ X k =1 |h L K Y, ϕ k i| = k ( T + λ n I ) − k k L K Y k ≤ κ M λ n , where we use the fact + ∞ P k =1 |h L K Y, ϕ k i| = k L K Y k , k ( T + λ n I ) − k op ≤ √ λ n and k L K Y k ≤ k L K k op k Y k ≤ κM . Notice L K Y can be expended to + ∞ P l =1 h L K Y, ϕ l i ϕ l , we have k ξ k = + ∞ X k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ∞ X l =1 h L K Y, ϕ k ih L K Y, ϕ l i ( T + λ n I ) − ϕ l (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = + ∞ X k =1 |h L K Y, ϕ k i| ! (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ∞ X l =1 h L K Y, ϕ l i ( T + λ n I ) − ϕ l (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = k L K Y k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ∞ X l =1 √ τ l + λ n h L K Y, ϕ l i ϕ l (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = k L K Y k ∞ X l =1 τ l + λ n |h L K Y, ϕ l i| ≤ κ M

22 + ∞ X l =1 τ l τ l + λ n = κ M D ( λ n ) . NoticeE( h L K Y, f ih L K Y, g i ) = E( h Y, L K f ih Y, L K g i ) = E Z Z

T ×T Y ( s ) Y ( t )( L K f )( s )( L K g )( t )d s d t = Z T (cid:18)Z T C ( s, t )( L K f )( s )d s (cid:19) ( L K g )( t )d t = h L C L K f, L K g i = h T f, g i , from which we have E( h L K Y, f i L K Y ) = T ( f ) . Therefore (E ξ )( f ) = ( T + λ n I ) − T ( f ) . Taking k ξ k H ≤ κ M √ λ n and E( k ξ k ) ≤ κ M D ( λ n ) in the Lemma 6, we have for any δ ∈ (0 , e − ),with probability at least 1 − δ k ( T + λ n ) − ( T n − T ) k op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( ξ i − E ξ i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) HS ≤ κ M log( δ ) n √ λ n + s κ M D ( λ n ) log( δ ) n ≤ c log( 2 δ ) B n , where we let c := 2 κ M and B n := n √ λ n + q D ( λ n ) n . (cid:3) The Lemma 6 and Lemma 7 can be seen as the concentration inequalities for the random variablestaking values in a Hilbert space. In the Lemma 6, we assume the random variables to be bounded withregard to the norm in the Hilbert space, while we assume that the random variables satisfy the growthof moments condition in the Lemma 7.

Lemma 6.

Let H be a Hilbert space endowed with norm k · k H and ξ be a random variable taking valuesin H . Let { ξ i } ni =1 be a sequence of n independent copies of ξ . Assume that k ξ k H ≤ M (a.s.), then forany δ ∈ (0 , , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( ξ i − E ξ i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H ≤ M log( δ ) n + s k ξ k ) log( δ ) n with probability at least − δ .Proof. We refer readers to Pinelis (1994) for the proof of this lemma.

Lemma 7.

Let H be a Hilbert space endowed with norm k·k H . Let { ξ i } ni =1 be a sequence of n independentrandom variables in H with zero mean. Assume there exist B, M > such that for all integers l ≥ : E( k ξ i k l H ) ≤ B l ! M l − , i = 1 , , · · · , n then for any δ ∈ (0 , , we have P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ξ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H ≥ M log( δ ) n + s δ ) n B  ≤ δ. Proof.

We refer readers to Yurinsky (2006) for the proof of this lemma.19 .3 Lemma 8

The Lemma 8 shows the concentration property of ( T + λ n I ) − g n . Lemma 8.

Under the Assumption 3, for any δ ∈ (0 , , with probability at least − δ , there exists k ( T + λ n I ) − g n k ≤ σ √ δ B n . Proof.

Deﬁne random variables ξ and ξ i (1 ≤ i ≤ n ) taking values in the Hilbert space L ( T ) by ξ := ε ( T + λ n I ) − L K Y and ξ i := ε i ( T + λ n I ) − L K Y i , where we notice { ξ i } ni =1 are independent copies of ξ and ( T + λ n I ) − g n = n n P i =1 ξ i .With the independence of Y and ε , we haveE ξ = (E ε ) · E(( T + λ n I ) − L K Y ) , which shows E ξ = 0.Expanding ξ by the basis { ϕ k : k ≥ } of the operator T , and we haveE( k ξ k ) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ∞ X k =1 h ε ( T + λ n I ) − L K Y, ϕ k i ϕ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = E  ε (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ∞ X k =1 h L K Y, ( T + λ n I ) − ϕ k i ϕ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = E  ε (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ∞ X k =1 √ τ k + λ n h L K Y, ϕ k i ϕ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = E ε ∞ X k =1 |h L K Y, ϕ k i| τ k + λ n ! = σ ∞ X k =1 E( |h L K Y, ϕ k i| ) τ k + λ n = σ ∞ X k =1 h T ϕ k , ϕ k i τ k + λ n = σ D ( λ n ) . Hence we have E( k ( T + λ n I ) − g n k ) = E( k n n X i =1 ξ i k ) = E( k ξ k ) n = σ D ( λ n ) n . Using the Markov inequality, we get P ( k ( T + λ n I ) − g n k ≥ t ) ≤ σ D ( λ n ) nt from which we conclude withprobability at least 1 − δ , we have k ( T + λ n I ) − g n k ≤ σ √ δ r D ( λ n ) n ≤ σ √ δ B n . The following two inequalities play an important role in the proof of the Theorem 1, which shows k ( T + λ n I )( T n + λ n I ) − k op and k ( T + λ n I ) ( T n + λ n I ) − k op can be bounded by k ( T + λ n I ) − ( T n − T ) k op . Inequality 1. k ( T + λ n I )( T n + λ n T ) − k op ≤ (cid:18) √ λ n k ( T + λ n I ) − ( T n − T ) k op + 1 (cid:19) Proof.

Using the following decomposition of the operator product BA − = ( B − A ) B − ( B − A ) A − + ( B − A ) B − + I with A = T n + λ n I and B = T + λ n I , we have( T + λ n I )( T n + λ n T ) − = ( T − T n )( T + λ n I ) − ( T − T n )( T n + λ n I ) − + ( T − T n )( T + λ n I ) − + I =: F + F + I. F , we have k F k op ≤ k ( T − T n )( T + λ n I ) − k op k ( T + λ n I ) − ( T − T n ) k op · λ n = 1 λ n k ( T + λ n I ) − ( T n − T ) k , where we use the fact k AB k op = k ( AB ) ∗ k op = k B ∗ A ∗ k op = k BA k op for any self-adjoint operators A and B , and the bound k ( T n + λ n I ) − k op ≤ λ n .For the operator F , applying k ( T + λ n I ) − k op ≤ √ λ n , we have k F k op ≤ k ( T − T n )( T + λ n I ) − k op k ( T + λ n I ) − k op ≤ √ λ n k ( T + λ n I ) − ( T n − T ) k op . Thus we obtain k ( T + λ n I )( T n + λ n T ) − k op ≤ λ n k ( T + λ n I ) − ( T n − T ) k + 1 √ λ n k ( T + λ n I ) − ( T n − T ) k op + 1 ≤ (cid:18) √ λ n k ( T + λ n I ) − ( T n − T ) k op + 1 (cid:19) . Inequality 2. k ( T n + λ n I ) − ( T + λ n I ) k op = k ( T + λ n I ) ( T n + λ n I ) − k op ≤ √ λ n k ( T + λ n I ) − ( T n − T ) k op + 1 Proof.

Applying the fact that k A γ B γ k op ≤ k AB k γ op , γ ∈ (0 , A and B deﬁned on Hilbert space (see Lemma A.7 in Blanchard and Kr¨amer(2010)), we have k ( T n + λ n I ) − ( T + λ n I ) k op = k ( T + λ n I ) ( T n + λ n I ) − k op ≤ k ( T + λ n I )( T n + λ n T ) − k op ≤ √ λ n k ( T + λ n I ) − ( T n − T ) k op + 1 , where in the last step we use the Inequality (1). The Lemma 9 is helpful in constructing the lower bound which is based on the testing multiplehypothesis.

Lemma 9.

Assume that N ≥ and suppose there exists Θ = { θ i } Ni =0 such that the following conditionsare satisﬁed:1. r -separated condition: d ( θ j , θ k ) ≥ r > , ∀ ≤ j < k ≤ N ,2. Kullback-Leibler average condition: if P j ≪ P for ≤ j ≤ N and N N X j =1 K ( P j | P ) ≤ ρ log N for some < ρ < and P j = P θ j (0 ≤ j ≤ N ) .Then for all possible random variables ˜ θ , we have inf ˜ θ sup θ ∈ Θ P θ ( d (˜ θ, θ ) ≥ r ) ≥ √ N √ N (cid:18) − ρ − r ρ log N (cid:19) > .Proof. We refer readers to Theorem 2.5 in Tsybakov (2008) for the proof of this lemma.21 .6 Varhsamov-Gilbert Lemma

We need the Varhsamov-Gilbert lemma in the proof of the Theorem 2 to construct the analogy of Θin Lemma 9.

Lemma 10.

Let H ( θ, θ ′ ) = P Mk =1 θ k = θ ′ k ) be the Hamming distance between elements θ, θ ′ in { , } M .For any integer M ≥ , there exist vectors { θ i } Ni =0 ⊂ { , } M such that (i) θ = (0 , · · · , H (cid:0) θ i , θ j (cid:1) > M for all i = j , (iii) N ≥ M . Proof.

We refer readers to P104 in Tsybakov (2008) for the proof of this lemma. (12)

Let f ( X , Y ) be the density of ( X , Y ) and f ( ε ) = σ √ π e − ε σ be the density of ε since we assume ε ∼ N (0 , σ ), then the density of P α ∗ ,β can be written asd P α ∗ ,β d µ ( Z, X , Y ) = f ( X , Y ) f ( Z − X T α ∗ − h Y, β i ) , where µ is a dominant measure on the space R × R p × L ( T ).Therefore, under the assumption of f ( ε ) = σ √ π e − ε σ , we havelog (cid:18) d P α ∗ ,β d P α ∗ ,β (cid:19) = log (cid:18) f ( Z − X T α ∗ − h Y, β i ) f ( Z − X T α ∗ − h Y, β i ) (cid:19) = 12 σ (cid:0) ( Z − X T α ∗ − h Y, β i ) − ( Z − X T α ∗ − h Y, β i ) (cid:1) = 1 σ ( Z − X T α ∗ − h Y, β i ) h Y, β − β i + 12 σ ( h Y, β − β i ) . Notice when the true parameter is ( α ∗ , β ), we have Z = X T α ∗ + h Y, β i + ε , based on which weobtain E α ∗ ,β ( Z − X T α ∗ − h Y, β i ) h Y, β − β i = E ε E h Y, β − β i = 0 . Thus the Kullback-Leibler distance K ( P α ∗ ,β | P α ∗ ,β ) = E α ∗ ,β log (cid:18) d P α ∗ ,β d P α ∗ ,β (cid:19) = σ E ( h Y, β − β i ) . (cid:3) Recently, the PFLM has raised a sizable amount of challenging problems in functional data analysis.Numerous studies focus on the asymptotic convergence rate. However, we analyze the kernel ridgeestimator for the RKHS-based PFLM and obtain the non-asymptotic upper bound for the correspondingexcess prediction risk. Our work to drive the optimal upper bound weakens the common assumptionsin the existing literature on (partially) functional linear regressions. The optimal bound reveals that theprediction consistency holds under the setting where the number of non-functional parameters p slightlyincreases with the sample size n . For ﬁxed p , the convergence rate of the excess prediction risk attainsthe optimal minimax convergence rate under the eigenvalue decay assumption of the covariance operator.More works could be done to study the non-asymptotic upper bound for the double penalized partiallyfunctional regressions. The penalization for the non-functional parameters could be Lasso, Elastic-net,or their generalizations. The proposed non-asymptotic upper bound is novel and substantially beneﬁcial.It is also of interest to do non-asymptotic testing based on large deviation bounds for k ˆ α n − α k and k T ( ˆ f n − f ) k . 22 eferences Abramovich, F. and Grinshtein, V. (2016). Model selection and minimax estimation in generalized linearmodels.

IEEE Transactions on Information Theory , 62(6):3721–3730.Ai, M., Wang, F., Yu, J., and Zhang, H. (2020). Optimal subsampling for large-scale quantile regression.

Journal of Complexity , page 101512.Aneiros, G., Ferraty, F., and Vieu, P. (2015). Variable selection in partial linear regression with functionalcovariate.

Statistics , 49(6):1322–1347.Ba´ıllo, A. and Gran´e, A. (2009). Local linear regression for functional predictor and scalar response.

Journal of Multivariate Analysis , 100(1):102–111.Blanchard, G. and Kr¨amer, N. (2010). Optimal learning rates for kernel conjugate gradient regression.In

Advances in Neural Information Processing Systems , pages 226–234.Brunel, E., Mas, A., and Roche, A. (2016). Non-asymptotic adaptive prediction in functional linearmodels.

Journal of Multivariate Analysis , 143:208–232.Cai, T. T., Hall, P., et al. (2006). Prediction in functional linear regression.

The Annals of Statistics ,34(5): 2159–2179.Cai, T. T. and Yuan, M. (2012). Minimax and adaptive prediction for functional linear regression.

Journal of the American Statistical Association , 107(499): 1201–1216.Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model.

StatisticaSinica , pages 571–591.Cucker, F. and Smale, S. (2001). On the mathematical foundations of learning.

Bulletin of the AmericanMathematical Society , 39(1):1–49.Cui, X., Lin, H., and Lian, H. (2020). Partially functional linear regression in reproducing kernel hilbertspaces.

Computational Statistics & Data Analysis , page 106978.Du, P. and Wang, X. (2014). Penalized likelihood functional regression.

Statistica Sinica , 24(2):1017–1041.Grenander, U. (1950). Stochastic processes and statistical inference.

Arkiv f¨or Matematik , 1(3): 195–277.Hall, P., Horowitz, J. L., et al. (2007). Methodology and convergence rates for functional linear regression.

The Annals of Statistics , 35(1): 70–91.Horn, R. A. and Johnson, C. R. (2012).

Matrix analysis . Cambridge university press.Jolliﬀe, I. T. (1982). A note on the use of principal components in regression.

Journal of the RoyalStatistical Society: Series C (Applied Statistics) , 31(3): 300–303.Kokoszka, P. and Reimherr, M. (2017).

Introduction to functional data analysis . CRC Press.Kong, D., Xue, K., Yao, F., and Zhang, H. H. (2016). Partially functional linear regression in highdimensions.

Biometrika , 103(1):147–159.Liu, Z. and Li, M. (2020). Non-asymptotic analysis in kernel ridge regression. arXiv preprintarXiv:2006.01350 .Ostrovskii, D. and Bach, F. (2018). Finite-sample analysis of m-estimators using self-concordance. arXivpreprint arXiv:1810.06838 .Pinelis, I. (1994). Optimum bounds for the distributions of martingales in banach spaces.

The Annalsof Probability , pages 1679–1706.Preda, C. (2007). Regression models for functional data by reproducing kernel hilbert spaces methods.

Journal of statistical planning and inference , 137(3): 829–840.Ramsay, J. (1982). When the data are functions.

Psychometrika , 47(4): 379–396.23amsay, J. O. and Silverman, B. W. (2007).

Applied functional data analysis: methods and case studies .Springer.Reimherr, M. L., Sriperumbudur, B. K., and Taouﬁk, B. (2018). Optimal prediction for additive function-on-function regression.

Electronic Journal of Statistics , 12(2):4571–4601.Shin, H. (2009). Partial functional linear regression.

Journal of Statistical Planning and Inference ,139(10): 3405–3418.Sun, H. (2005). Mercer theorem for rkhs on noncompact sets.

Journal of Complexity , 21(3): 337–349.Sun, Q., Tan, K. M., Liu, H., and Zhang, T. (2017). Graphical nonconvex optimization for optimalestimation in gaussian graphical models. In

ICML 2018 .Tong, H. and Ng, M. (2018). Analysis of regularized least squares for functional linear regression model.

Journal of Complexity , 49: 85–94.Tsybakov, A. B. (2008).

Introduction to nonparametric estimation . Springer Science & Business Media.Wahba, G. (1990).

Spline models for observational data . SIAM.Wahl, M. (2018). A note on the prediction error of principal component regression. arXiv preprintarXiv:1811.02998 .Wang, J.-L., Chiou, J.-M., and M¨uller, H.-G. (2016). Functional data analysis.

Annual Review ofStatistics and Its Application , 3: 257–295.Yao, F., M¨uller, H.-G., and Wang, J.-L. (2005). Functional linear regression analysis for longitudinaldata.

The Annals of Statistics , pages 2873–2903.Yurinsky, V. (2006).

Sums and Gaussian vectors . Springer.Zhang, T. (2005). Learning bounds for kernel regression using eﬀective data dimensionality.

NeuralComputation , 17(9): 2077–2098.Zhu, H., Zhang, R., Yu, Z., Lian, H., and Liu, Y. (2019). Estimation and testing for partially functionallinear errors-in-variables models.

Journal of Multivariate Analysis , 170: 296–314.Zhuang, R. and Lederer, J. (2018). Maximum regularized likelihood estimators: A general predictiontheory and applications.