[PDF] SCENTS: Score explained non-randomized treatment systems

Abstract

Non-randomized treatment effect models are widely used for the assessment of treatment effects in various fields and in particular social science disciplines like political science, psychometry, psychology. More specifically, these are situations where treatment is assigned to an individual based on some of their characteristics (e.g. scholarship is allocated based on merit or antihypertensive treatments are allocated based on blood pressure level) instead of being allocated randomly, as is the case, for example, in randomized clinical trials. Popular methods that have been largely employed till date for estimation of such treatment effects suffer from slow rates of convergence (i.e. slower than \sqrt{n}). In this paper, we present a new model coined SCENTS: Score Explained Non-Randomized Treatment Systems, and a corresponding method that allows estimation of the treatment effect at \sqrt{n} rate in the presence of fairly general forms of confoundedness, when the `score' variable on whose basis treatment is assigned can be explained via certain feature measurements of the individuals under study. We show that our estimator is asymptotically normal in general and semi-parametrically efficient under normal errors. We analyze two real datasets via our method and compare our results with those obtained by using previous approaches. We conclude this paper with a discussion on some possible extensions of our approach.

Full PDF

aa r X i v : . [ s t a t . M E ] F e b Regression discontinuity design: estimating the treatmenteﬀect with standard parametric rate

Debarghya Mukherjee, Moulinath Banerjee and Ya’acov RitovDepartment of Statistics, University of MichiganFebruary 23, 2021

Abstract

Regression discontinuity design models are widely used for the assessment of treat-ment eﬀects in psychology, econometrics and biomedicine, speciﬁcally in situations wheretreatment is assigned to an individual based on their characteristics (e.g. scholarship isallocated based on merit) instead of being allocated randomly, as is the case, for example,in randomized clinical trials. Popular methods that have been largely employed till datefor estimation of such treatment eﬀects suﬀer from slow rates of convergence (i.e. slowerthan √ n ). In this paper, we present a new model and method that allows estimation ofthe treatment eﬀect at √ n rate in the presence of fairly general forms of confoundedness.Moreover, we show that our estimator is also semi-parametrically eﬃcient in certain situ-ations. We analyze two real datasets via our method and compare our results with thoseobtained by using previous approaches. We conclude this paper with a discussion on somepossible extensions of our method. Regression discontinuity designs (RDD) have been used extensively in the statistics and econo-metrics literatures to study eﬀects of treatments and interventions in various scenarios. As an1ntroduction to the idea, imagine that a scholarship granting agency tests a group of high schoolstudents and assigns scholarship to those that clear some pre-determined cutoﬀ (w.lo.g. 0, aftercentering). Of interest is to determine whether the scholarship has any tangible outcome onfuture academic performance. Letting Y i be the score of student i in a subsequent semester,we can write down a model of the type: Y i = α Q i ≥ + X ⊤ i β + ν i (1.1)where X i is a covariate vector including demographic information on the students, Q i is thecentered score on the test, and ν i is a residual term. The parameter α represents the eﬀectof the treatment: scholarship. If the hypothesis α = 0 is rejected by a statistical test, oneconcludes that the scholarship has a signiﬁcant impact on subsequent academic score.It is instructive to take a brief moment to compare and contrast RDD with randomizedclinical trials. In both, we estimate the eﬀect of a treatment which applies to a subset of theparticipants. The main diﬀerence between RDD and a randomized clinical trail (henceforthRCT) is that the treatment in the latter case is independent of the covariates and error termsin the model, while in RDD the applied treatment is non-trivially correlated with the covariateand error terms and is therefore endogenous. In our example, the score Q i on the assessmenttest, based on which the treatment (scholarship when Q i >

0) is applied, is correlated with Y i beyond the shift-eﬀect α and the background covariates X i , as the unmeasured sources of vari-ation (say, native abilities of the individual not captured by the observed covariates) that arecontained in the residual, ν i , can also contribute non-trivially to the score Q i on the assessmenttest. [More meritorious students are more likely to get the scholarship and are also more likelyto have higher values of Y , anyway, irrespective of the scholarship and their demographics!]This typically translates to what is called the ‘endogeneity assumption’: E ( ν i | Q i ) = 0, andhence the model displayed above cannot be estimated using simple linear regression of Y on(1( Q > , X ) unlike the RCT framework. Early work on models of this kind can be tracedto ([48], [14]), who analyzed data from a national merit competition (see [24] and [47]). RDD2ype methods have since found varied applications, e.g. [45], [19] in education, [34] in healthrelated research, [7] in social research, [20] in epidemiology, to name a few. The application ofRDD to economics started around mid-1990s (see [3], [4], [10], [2] and [26] for more details).Most of the analysis in RDD till date is based oﬀ a local analysis, i.e. the samples from asmall neighborhood of the cut-oﬀ Q value are considered as almost free from endogeneity andonly those samples are used to get an approximately unbiased estimate of the treatment eﬀect.[22], for the ﬁrst time, provided intuitive minimal suﬃcient conditions for identiﬁcation andestimation of both constant and variable treatment eﬀects. [31] noted that if the score (basedon which the treatment will be assigned) has a density function given the attributes of theindividuals, then under fairly weak assumptions, all conditions mentioned in [22] are satisﬁedand the observations near the cut-oﬀ region resemble data coming from randomized treatmentexperiments, an idea that was further expounded in [32] and [18]. Local polynomial estimatorsfor RDD were studied by [42] and [25], while empirical likelihood based methods were pursuedin [39]. [15] proposed conditions under which the observation that locally RDD resembles ran-domized experiments holds and provided ﬁnite sample analysis because the number of datapoints in a small vicinity of the cutoﬀ could be small, rendering asymptotic approximationsimprecise. For more recent works on RDD, see [21], [12] [40] and [5], where this last work usedtechniques from high dimensional statistics to analyze RDD non-parametrically, by imposingan ℓ norm constraint of the coeﬃcients of the basis. Our contribution:

To highlight our contribution to the burgeoning literature on RDD,lets ﬁrst consider a few more details of the standard approach. The extant literature is largelybased on the model: Y i = ( α + b Q i ) Q i <τ + ( α + b Q i ) Q i ≥ τ + X ⊤ i β + ν i , (1.2)where Q i is the score variable based on which treatment is assigned with known cutoﬀ τ and X i ’s are background covariates. The parameter of interest is α − α which encodes the eﬀect3f the treatment. Generally, only weak assumptions are made on the conditional expectationof E ( Y | Q, X ) to encode possible endogeneity. As the main takeaway from the model is thatboth the intercept and the slope of Q change at the cut-oﬀ, the traditional approach for theestimation of the treatment eﬀect is as follows: Select a (data-driven) bandwidth h n and lookat all the observations Y ′ i s for which | Q i − τ | ≤ h n . Now run a weighted local polynomialregression on these observations to estimate α − α . As this approach only uses O p ( nh n ) ob-servations for estimation purposes, the rate of convergence of the estimator is slower that √ n (typically √ nh n , where the bandwidth h n → Q and measured covariates Z on the individualsof interest where some or all components of Z intersect with components of X , and indeed inmany circumstances such covariates are available : e.g. Z = X is often a natural choice. Forexample, in the scholarship example alluded to above, it is quite reasonable to view the X ’s,the covariates that determine academic performance Y , as being informative about Q . In suchcases, we argue below that it becomes possible to use all our samples to estimate the treatmenteﬀect at √ n rate.To that end, let us augment the linear equation (1.1) by a second equation to obtain: Y i = α Q i ≥ + X ⊤ i β + ν i Q i = Z ⊤ i γ + η i , γ = 0 . (1.3)The eﬀect of unmeasured covariates that can aﬀect both Q and Y is encoded by the mean 0error vector ( ν i , η i ) on which we place no parametric assumptions, but they may be dependent,to take care of the issue of endogeneity: b ( η ) ≡ E ( ν | η ) = 0 . However, we assume the vector( ν, η ) to be independent of the observed covariates (

X, Z ).4e next discuss how this augmentation helps us obtain a √ n -consistent estimator. If weobserve η , writing ν i = b ( η i ) + ǫ i , our model would reduce to a simple partial linear model (seee.g. [52], [8], [37] and references therein) with ( α , β ) being the parametric component and theunknown mean function b being the nonparametric component: Y i = α Q i ≥ + X ⊤ i β + b ( η i ) + ǫ i . (1.4)Hence, we could acquire a √ n consistent estimator of α following the standard analysis of thepartial linear model provided E ( var ( ( Q > | η )) > α cannot be separated fromthe eﬀect of b in the above model. Now, although we don’t observe η in our model, by (1.3) wecan nevertheless approximate it using the residuals obtained via regressing Q on Z . We canplug-in the estimate of η i in equation (1.4) and treat the resulting equation as an approximatepartial linear model . Indeed, this idea lies at the heart of our method and is elaborated inSection 2. The assumption γ = 0 is critical to our analysis, because if not, Q i = η i andequation (1.4) becomes: Y i = α Q i ≥ + X ⊤ i β + b ( Q i ) + ǫ i . It is now no longer possible to estimate α at √ n rate as it is hard to separate the ﬁrst and thirdterm of the RHS. Thus, the augmenting equation Q = z ⊤ γ + η with γ = 0 is what preventsthe variable corresponding to the parametric component of interest to become a measurablefunction of the variable corresponding to the non-parametric component of the model, andenables converting the RDD problem to an approximate partial linear model problem. We notehere that for α to be estimated as √ n rate, it is not necessary for Q to have a linear relationwith Z , in fact any non-linear parametric relation like Q = ψ ( Z, γ ) + η for some known linkfunction ψ will work as long as E ( var ( ( Q > | η )) > Z isbeing used as an instrumental variable. However, this is typically not the case in our model5s Z and X may share several common covariates [since many factors inﬂuencing Y can alsodirectly inﬂuence Q ], indeed they may even be identical . When this is the case, Z is clearly notan instrument, since the exclusion restriction [35] that is critical in the instrumental variablesapproach, namely that the instrument should inﬂuence Y only through Q and not directly, isno longer satisﬁed. Consequently, the two-stage least squares procedure [1] typically employedin the instrumental variable literature is invalid in our problem.There is also a degree of similarity between our model and triangular simultaneous equa-tion models , which are well-studied in the economics literature. Interested readers may take alook at [38], [27], [29], [23] and [41] and references therein for more details. However, the keydiﬀerence lies in that triangular simultaneous equation models assume a smooth link functionbetween Y i and ( X i , Z i , Q i ), whilst our model presents an inherent discontinuity. Hence ourwork cannot be derived from analyses of the triangular simultaneous equation model.Our work makes the following contributions to the literature on RDD:1. It is able to take care of endogeneity among the errors in a very general form and providea √ n -consistent estimator of the treatment eﬀect. To the best of our knowledge, thisis the ﬁrst result in the literature on estimation of such treatment eﬀects at the classic(fast) parametric rate. Indeed, the main feature of our approach lies in modeling Q itselfin terms of covariates up to error terms, which enables the use of the entirety of dataavailable and not just the observations in a small vicinity of the boundary deﬁned by the Q threshold , on which existing approaches are typically based.2. Our estimate achieves semiparametric eﬃciency under appropriate model.3. Our method does not depend on tuning parameters, in the sense that the use of tuningparameters to estimate the treatment eﬀect is secondary . As will be seen in Section 2, wedo require tuning parameter speciﬁcations for non-parametric estimation of b ( η ), but aslong as those parameters satisfy some minimal conditions, our estimate of α – in terms6f both rate of convergence and asymptotic distribution– does not depend on it. Organization of the paper:

In Section 2 we describe the estimation procedure. Section 3provides the theoretical results along with a brief outline of the proof of asymptotic normality ofour estimator. In Section 4, we provide analyses of two real data examples using our method, aswell as comparisons to previous methods. In Section 5 we present some possible future researchdirections based on this work. Rigorous proofs of the main theorems are established SectionA of the Appendix. In Section B of the Appendix, we present proofs of auxiliary results thatare required to prove the main results. Finally in Section C.1 of the Appendix, we present twoauxiliary lemmas along with their proofs which play a pivotal role in our theoretical analysis.In C.2 of the Appendix, we provide a few details on the spline estimation techniques used inour analysis and in Section C.3 of the Appendix, an algorithm integrating the main steps ofthe proposed methodology for the ease of implementation.

Notation:

For any matrix A , we denote by A i, ∗ , the i th row of A and by A ∗ ,j the j th columnof A . Both a n . P b n and a n = O p ( b n ) bear the same meaning, i.e. a n /b n is a tight (random)sequence. Also, for two non-negative sequences { a n } and { b n } , we denote by a n ≫ b n (respec-tively a n ≪ b n ), the conditions that lim inf n a n /b n → ∞ (respectively lim sup n a n /b n → X , we denote by F ( X ), the sigma-ﬁeld generated by X . α Y i = α Z ⊤ i γ + η i ≥ + X ⊤ i β + ν i . ν i and η i are correlated. Deﬁning b ( η i ) = E ( ν i | η i ), write ν i = b ( η i ) + ǫ i where ǫ i ⊥ η i .Using this we can rewrite our model as: Y i = α Z ⊤ i γ + η i ≥ + X ⊤ i β + b ( η i ) + ǫ i . (2.1)We ﬁrst divide the whole data in three almost equal parts, say D , D , D . Henceforth forsimplicity we assume each D i has n/ X and Z by p and p respectively. The ﬁrst two data sets are used to obtain (consistent) estimates of severalnuisance parameters which are then plugged into the equations corresponding to the third dataset from which an estimator of the treatment eﬀect is constructed. The data splitting techniquemakes the theoretical analysis more tractable as one can use independence among the threesubsamples to our beneﬁt. Furthermore, by rotating the samples (to be elaborated below), weobtain three asymptotically independent and identically distributed estimates which are thenaveraged to produce a ﬁnal estimate that takes advantage of the full sample size. While webelieve that the estimator obtained without data-splitting achieves the same asymptotic vari-ance, a point that is corroborated by simulation studies (not reported in the manuscript), ananalysis of this estimator would be incredibly tedious.We, ﬁrst, estimate γ from D via standard least squares regression of Z on Q :ˆ γ n = ( Z ⊤ Z ) − Z ⊤ Q .

Using ˆ γ n , we expand our ﬁrst model equation as : Y i = α S i + X ⊤ i β + b (ˆ η i ) + ( η i − ˆ η i ) b ′ (ˆ η i ) + R ,i + ǫ i = α S i + X ⊤ i β + b (ˆ η i ) + b ′ (ˆ η i ) Z Ti (ˆ γ n − γ ) + R ,i + ǫ i . (2.2)where S i = Q i > , ˆ η i = Q i − Z ⊤ i ˆ γ n and R ,i = (ˆ η i − η i ) b ′′ (˜ η i ) / η i lying between η i and ˆ η i . It should be pointed out that if8 i were known, our model (equation (2.1)) would reduce to a simple partial linear model andestimation of α would become straight-forward. As we don’t observe η i , but rather use ˆ η i asits proxy, the corresponding approximation error needs careful handling, as we need to showthat the estimator does not inherit any resulting bias. Indeed, this one of the core technicalchallenges of this paper, and explained in the subsequent development.Note that the function b ′ in equation (2.2) is unknown. Since ˆ γ n = γ + O p ( n − / ) (which,in turn, implies ˆ η i = η i + O p ( n − / )), as long as we have a consistent estimate of b ′ , say ˆ b ′ ,the approximation error ( b ′ (ˆ η i ) − ˆ b ′ (ˆ η i )) Z i (ˆ γ n − γ ) is o p ( n − / ) and therefore asymptoticallynegligible.We now elaborate how we use B-spline basis to estimate b ′ from D . Equation (2.1) canbe rewritten as: Y i = α S i + X ⊤ i β + b (ˆ η i ) + ǫ i + R i . (2.3)where R i = b ( η i ) − b (ˆ η i ). Ignoring the remainder term (which is shown to be asymptoticallynegligible under some mild smoothness condition on b to be speciﬁed later), we estimate b ′ viaa B-spline basis from equation (2.3). The theory of spline approximation is mostly exploredfor estimating compactly supported non-parametric regression functions. Our errors η are, ofcourse, assumed to be unbounded as otherwise the problem would become artiﬁcial. However,there are certain technical issues with estimating b ′ on the entire support (see Remark 2.4), andto circumvent that we restrict ourselves to a compact (but arbitrary) support [ − τ, τ ], i.e. weconsider those observations for which | ˆ η i | ≤ τ . We then use a cubic B-spline basis appropriatelyscaled to the interval of interest with equispaced knots to estimate b ′ . Notationally, we use K − − τ, τ ] into K intervals of length 2 τ /K where K = K n increases with n at an appropriate rate (see remark 2.2 below), giving us in total K + 3 spline basis functions. While we work with cubic splines, one may certainly use higher degree polynomials. However, in practice,it has been observed that cubic spline works really well in most of the scenarios x , we use the notation ˜ N K ( x ) ∈ R ( K +3) to denote the vector of scaled B-spline basisfunctions { ˜ N K,j } K +3 j =1 evaluated at x . Using these basis functions we further expand equation(2.3) as follows: Y i = α S i + X ⊤ i β + ˜ N K (ˆ η i ) ⊤ ω b, ∞ ,n + ǫ i + R i + T i . (2.4)for all those observations with | ˆ η i | ≤ τ , where T i is the spline approximation error, and T i = b (ˆ η i ) − N K (ˆ η i ) ⊤ ω b, ∞ ,n and ω b, ∞ ,n is the (population) parameter deﬁned as: ω b, ∞ ,n = arg min ω ∈ R ( K +3) sup | x |≤ τ (cid:12)(cid:12)(cid:12) b ( x ) − ˜ N ⊤ K ( x ) ω (cid:12)(cid:12)(cid:12) . Suppose we have n ≤ n/ D with | ˆ η i | ≤ τ . Denote by Y the vector of allthe corresponding n responses, by X ∈ R ( n ,p ) the covariate matrix, and by ˜ N K ∈ R ( n ,K +3) the approximation matrix with rows ˜ N K,i ∗ = ˜ N K (ˆ η i ). Regressing Y on ( S , X , ˜ N K ) we estimate ω b, ∞ ,n (details can be found in the proof of Proposition 2.1) and set ˆ b ′ ( x ) = ∇ ˜ N K ( x ) ⊤ ˆ ω b, ∞ ,n where ∇ ˜ N K ( x ) is the vector of derivates of each of the co-ordinates of ˜ N K ( x ). The followingtheorem establishes consistency of our estimator of b ′ : Proposition 2.1.

Under Assumptions 3.1 - 3.4 (elaborated in Section 3) we have: sup | x |≤ τ (cid:12)(cid:12)(cid:12) b ′ ( x ) − ˆ b ′ ( x ) (cid:12)(cid:12)(cid:12) = sup | x |≤ τ (cid:12)(cid:12)(cid:12) b ′ ( x ) − ∇ ˜ N K ( x ) ⊤ ˆ ω b, ∞ ,n (cid:12)(cid:12)(cid:12) = o p (1) . Remark 2.2.

Henceforth, we choose K ≡ K n such that n / ≪ K ≪ n / to control theapproximation errors of certain non-parametric functions (including b ( η ) ) that appear in ouranalysis via the B-spline basis. However the bounds can be improved in presence of additionalderivations of the non-parametric functions involved. The ﬁnal (key) step involves α from D . Suppose there are n ≤ n/ D with A brief discussion on B-spline basis and scaled B-spline basis is presented in Section C.2 of the Appendixfor the ease of the reader. ˆ η i | ≤ τ . Replacing b ′ by ˆ b ′ obtained from D in equation (2.2) we obtain: Y i = α S i + X ⊤ i β + b (ˆ η i ) + ˆ b ′ (ˆ η i ) Z ⊤ i (ˆ γ n − γ ) + R ,i + R ,i + ǫ i , α S i + X ⊤ i β + b (ˆ η i ) + ˜ Z ⊤ i (ˆ γ n − γ ) + R ,i + R ,i + ǫ i , (2.5) Q i − Z ⊤ i ˆ γ n = − Z ⊤ i (ˆ γ n − γ ) + η i . where we deﬁne ˜ Z i = ˆ b ′ (ˆ η i ) Z i and the residual term R ,i as: R ,i = (cid:16) b ′ (ˆ η i ) − ˆ b ′ (ˆ η i ) (cid:17) Z ⊤ i (ˆ γ n − γ ) = (cid:16) b ′ (ˆ η i ) − ˆ b ′ (ˆ η i ) (cid:17) (ˆ η i − η i ) . An inspection of the ﬁrst part of Equation (2.5) shows that up to the remainder terms { ( R ,i , R ,i ) } n i =1 ,our model is a partial linear model with parameters ( α , β , b ). These remainder terms areasymptotically negligible as shown in the proof of our main theorem (Theorem 3.6). We es-timate α from equation (2.5) using standard techniques for the partial linear model which,again, involve approximating the function b with the same B-spline basis as before: Y i = α S i + X ⊤ i β + ˜ Z ⊤ i (ˆ γ n − γ ) + b (ˆ η i ) + R ,i + R ,i + ǫ i = α S i + X ⊤ i β + ˜ Z ⊤ i (ˆ γ n − γ ) + ˜ N K (ˆ η i ) ⊤ ω b, ∞ ,n + R ,i + R ,i + R ,i + ǫ i . (2.6)where R ,i is the spline approximation error, i.e. R ,i = b (ˆ η i ) − ˜ N K (ˆ η i ) ⊤ ω b, ∞ ,n . Combiningequation (2.6) along with the second equation of (2.5) we formulate the following linear model11quation:  YQ − Z ˆ γ n  =  S X ˜ Z ˜ N K − Z   α β ˆ γ n − γ ω b, ∞ ,n  +  R  +  ǫη  = (cid:18) W ˜ N K,a (cid:19)  θ ω b, ∞ ,n  +  R  +  ǫη  (2.7)where θ = ( α , β , ˆ γ n − γ ) , W =  W W  with W = (cid:18) S X ˜ Z (cid:19) , W = (cid:18) − Z (cid:19) , where W ∈ R ( n , p + p ) , W ∈ R ( n/ , p + p ) (as we are using all the observations in D in thesecond regression equation of (2.5)), R is the vector of the sum of three residuals R ,i , R ,i , R ,i mentioned in equation (2.6), and the matrix ˜ N K,a (read ˜ N K appended) has the form:˜ N K,a =  ˜ N K  . From equation (2.7), we estimate θ via ordinary least squares methods:ˆ θ = (cid:16) W ⊤ proj ⊥ ˜ N K,a W (cid:17) − W ⊤ proj ⊥ ˜ N K,a  YQ − Z ˆ γ n  and set ˆ α as the ﬁrst co-ordinate of ˆ θ , where for any matrix A , we deﬁne proj A as the pro-jection matrix on the columns space of A . Note that because of data-splitting, ˆ γ n , ˆ b ′ and D are mutually independent, which provides a signiﬁcant technical advantage in dealing with theasymptotics of our estimator. 12inally, we apply the same methodology on permutations of the three sets of data: i.e. weestimate ˆ γ n from D , ˆ b ′ from D , ˆ α ∈ D and ˆ γ n from D , ˆ b ′ from D , ˆ α ∈ D . Denote by ˆ α i ,the estimator ˆ α estimated from D i for 1 ≤ i ≤

3. Our ﬁnal estimate is then ˆ α = (1 / P i α i . Remark 2.3.

In our estimation procedure, we eﬀectively estimate the conditional mean func-tion b ( · ) twice: once while estimating b ′ from D and again while estimating α from D . Notethat, the second re-estimation of b is quite critical (i.e. we cannot use the estimator of b obtainedfrom D ) due to presence of higher order bias (slower than n − / ). Remark 2.4.

As described in this section, we only use observations for which | ˆ η i | ≤ τ andconsequently lose some eﬃciency for not using all the observations. One way to circumventthis issue is to use a sequence { τ n } slowly increasing to ∞ and consider all those observationsfor which | ˆ η | ≤ τ n . Although this will gain the eﬃciency in the limit, we need a stronger set ofassumptions to make it work: for starters, we need some condition on the decay of the densityof η , i.e. it is not anywhere in [ − τ n , τ n ] and at what rate min | x |≤ τ n f η ( x ) is going to . Wealso need to have stronger condition on some conditional expectation functions (i.e. conditionalexpectation of (cid:18) S X Z (cid:19) given η + a ⊤ Z for some vector a ) like bounded derivative in boththe co-ordinates over the entire space (see Lemma A.1 of the Appendix.) With more technicalnuances we believe our method can be extended to entire real line by using a growing interval,but it will not add anything insightful to the methodology that we propose here. In this section we presents our main theorems with broad outline of the proofs. Details areprovided in the Appendix. To establish the theory, we need the following assumptions:

Assumption 3.1.

The errors ( η, ν ) are independent of the distribution of ( X, Z ) and have zeroexpectation. Assumption 3.2.

The distribution of ( η, ν ) satisﬁes the following conditions: ) The density of η , denoted by f η , is continuously diﬀerentiable and both f η and its deriva-tives are uniformly bounded.ii) The conditional mean function b ( η ) = E [ ν | η ] is times diﬀerentiable with b ′′ and b ′′′ areuniformly bounded over real line.iii) The variance function σ ( η ) = var ( ǫ | η ) is uniformly bounded from above.iv) There exists some ξ > such that: min | x |≤ τ + ξ f η ( x ) > . Assumption 3.3.

Deﬁne the matrices Ω and Ω ∗ as: Ω = E (cid:20) var (cid:18)(cid:20) S X Zb ′ ( η ) (cid:21) | η (cid:19)(cid:21) + var (cid:18)(cid:20) Z (cid:21)(cid:19) Ω ∗ = E (cid:20) σ ( η ) var (cid:18)(cid:20) S X Zb ′ ( η ) (cid:21) | η (cid:19)(cid:21) + var ( η ) var (cid:18)(cid:20) Z (cid:21)(cid:19) . Then, the minimum eigenvalues of Ω and Ω ∗ are strictly positive. Assumption 3.4.

The distribution of ( X, Z ) satisﬁes the following conditions:i) ( X, Z ) has bounded continuous density function and have zero expectation.ii) The ﬁrst four moments of X, Z are ﬁnite.

Remark 3.5.

Assumption 3.2 provides a low-level assumption on the smoothness of the densityof η , the conditional mean function b ( η ) and the conditional variance proﬁle σ ( η ) , which isrequired for the standard asymptotic analysis of the partial linear model. Assumption 3.3, againis a standard assumption in partial linear model literature. It is essential for the asymptoticnormality of the treatment eﬀect as the asymptotic variance of our estimator is a function ofthese variances. If this assumption is violated, then the asymptotic variance will be inﬁnite andthat estimation at √ n is not possible. (Note that if γ = 0 , then Assumption 3.3 is violated.)As our method does not use all observations, but a fraction depending on the interval [ − τ, τ ] , ur limiting variance comprises of the following truncated versions of Ω and Ω ∗ : Ω τ = E (cid:20) var (cid:18)(cid:20) S X Zb ′ ( η ) (cid:21) | η (cid:19) | η |≤ τ (cid:21) + var (cid:18)(cid:20) Z (cid:21)(cid:19) , Ω ∗ τ = E (cid:20) σ ( η ) var (cid:18)(cid:20) S X Zb ′ ( η ) (cid:21) | η (cid:19) | η |≤ τ (cid:21) + var ( η ) var (cid:18)(cid:20) Z (cid:21)(cid:19) . It is immediate that if τ → ∞ , then Ω τ → Ω and Ω ∗ τ → Ω ∗ . Hence, in light of Assumption 3.3,by continuity, the minimum eigenvalues of Ω τ and Ω ∗ τ are also positive for all large τ . We now state the main result of the paper:

Theorem 3.6.

Consider the estimates obtained at the end of the previous section. Underassumptions 3.1-3.4: √ n ( ˆ α − α ) L = ⇒ N (0 , e ⊤ Ω − τ Ω ∗ τ Ω − τ e ) , whilst √ n (cid:0) ¯ˆ α − α (cid:1) L = ⇒ N (0 , e ⊤ Ω − τ Ω ∗ τ Ω − τ e ) Sketch of proof:

We present a high-level outline of the key steps of the proof, deferringall technical details to Subsection A.1 of the Appendix. From (2.7), on a set of probabilityapproaching 1, our estimator can be written as:ˆ α = e ⊤ (cid:16) W ⊤ proj ⊥ ˜ N K,a W (cid:17) − W ⊤ proj ⊥ ˜ N K,a  YQ − Z ˆ γ n  = α + e ⊤ (cid:16) W ⊤ proj ⊥ ˜ N K,a W (cid:17) − W ⊤ proj ⊥ ˜ N K,a  R  +  ǫη  . √ n ( ˆ α − α ) = e ⊤ W ⊤ proj ⊥ ˜ N K,a W n ! − W ⊤ proj ⊥ ˜ N K,a √ n  R  +  ǫη  (3.1)which is our main estimating equation. We next outline the key steps of our proof. Step 1:

First show that: W ⊤ proj ⊥ ˜ N K,a W n P −→

13 Ω τ . where Ω τ is as deﬁned in Remark 3.5. Step 2:

Next, establish the following asymptotic linear expansion: W ⊤ proj ⊥ ˜ N K,a √ n  ǫ η  = 1 √ n n/ X i =1 ϕ ( X i , Z i , η i , ν i ) + o p (1) . for some inﬂuence function ϕ . Step 3:

Apply central limit theorem to obtain:1 √ n n/ X i =1 ϕ ( X i , Z i , η i , ν i ) L = ⇒ N (0 ,

13 Ω ∗ τ ) . where Ω ∗ τ is as deﬁned in Remark 3.5. Step 4:

Finally ensure that ‘residual term’ is asymptotically negligible: W ⊤ proj ⊥ N K,a √ n  R  P −→ . √ n ( ˆ α − α ) = e ⊤ (cid:18)

13 Ω τ (cid:19) − √ n n/ X i =1 ϕ ( X i , Z i , η i , ν i ) + o p (1) L = ⇒ N (cid:0) , e ⊤ Ω − τ Ω ∗ τ Ω − τ e (cid:1) where the leading term only depends on the observations in D and is consequently independentof D , D . Finally, rotating the dataset and taking average of the ˆ α ’s we further conclude: √ n ( ¯ˆ α − α ) L = ⇒ N (cid:0) , e ⊤ Ω − τ Ω ∗ τ Ω − τ e (cid:1) . In this subsection we argue that our estimator is semi-parametrically eﬃcient under certainrestrictions. As our estimator is based on least square approach, it can not be shown to beeﬃcient unless the error ǫ is normal. We prove the following theorem: Theorem 3.7.

Suppose the model is the following: Y i = α Z ⊤ i γ + η> + X ⊤ i β + b ( η i ) + ǫ i . where ǫ i ∼ N (0 , τ ) | = η i . Then our estimator of α is semi-parametrically eﬃcient, i.e. thelower bound on the variance is σ where σ = τ e ⊤ Ω − e . Sketch of the proof:

Our proof is based on the techniques introduced in Section 3.3. of [9],albeit we sketch the main ideas here for the convenience of the readers. Suppose X , . . . , X n ∼ P ∈ P with density function p . We will work with √ p instead of p itself, as s := √ p lieson the unit sphere of L ( R ) (with respect to Lebesgue measure) and gives rise to a nice Hilbertspace. Suppose, we are interested in estimating a one dimensional functional θ ( s ), where θ : L ( R ) → R . As initially pointed out by Stein [46], estimating any one dimensional functionalof some non-parametric component is at least as hard as estimating the functional by restricting17neself to an one-dimensional parametric sub-model that contains the true parameter. Ingeneral, any one dimensional smooth parametrization, i.e. a function from say ϕ : ( − t , t ) → L ( R ) ( t >

0) with ϕ (0) = s introduces a one dimensional parametric sub-model, whichessentially is a curve on the unit sphere of L ( R ) passing through s . We restrict ourselves to regular parametrizations , which are diﬀerentiable on ( − t , t ) in the following sense: for any | t | < t , there exists some function ˙ s ϕ,t ∈ L ( R ) such that,lim h → (cid:13)(cid:13)(cid:13)(cid:13) ϕ ( t + h ) − ϕ ( t ) h − ˙ s ϕ,t (cid:13)(cid:13)(cid:13)(cid:13) L ( R ) = 0 . with k ˙ s ϕ,t k L ( R ) >

0. Let G be the set of all such regular parametrizations. Under mildconditions, this derivative also coincides with the pointwise derivative of ϕ ( t ) with respect to t , i.e. ˙ s ϕ,t ( x ) = ( d/dt ) ϕ ( t )( x ). As those conditions are easily satisﬁed in our model, henceforthwe use this fact in our derivations. Deﬁne the tangent set of P at P as ˙ P P = { ˙ s ϕ, : ϕ ∈ G} and the tangent space T ( P ) = lin ( ˙ P P ), the closed linear subspace spanned by ˙ P P .We restrict the discussion to the functional θ that obeys the pathwise norm diﬀerentiabilty condition, which asserts the existence of a bounded linear functional L : T ( P ) → R such that,for any ϕ ∈ G : L( ˙ s ϕ, ) := lim t → θ ( ϕ ( t )) − θ ( ϕ (0)) t . Now, for any ﬁxed ϕ ∈ G , the collection { P t } | t |

18 ( L( ˙ s ϕ, )) R ˙ s ϕ, ( x ) dx = ( L( ˙ s ϕ, )) k ˙ s ϕ, k F = L (cid:18) ˙ s ϕ, k ˙ s ϕ, k F (cid:19) where k· k F = 2 k· k L ( R ) is the Fisher norm ([44], [51]) and the last equality follows from the factthat L is a bounded linear operator. The optimal asymptotic variance (a term borrowed from[50]) for estimating θ ( s ) is deﬁned as the supremum of all these Cramer-Rao lower bounds IB ( ϕ ) over all regular one dimensional parametrization ϕ ∈ G , i.e.:Optimal asymptotic variance = sup ˙ s ϕ, ∈ T ( P ) L (cid:18) ˙ s ϕ, k ˙ s ϕ, k F (cid:19) = sup ˙ s ϕ, ∈ T ( P ): k ˙ s ϕ, k F =1 L ( ˙ s ϕ, ) ! = k L k ∗ . where k · k ∗ is the dual norm of the functional L on T ( P ) with respect to Fisher norm. As L isa bounded linear functional on the Hilbert space T ( p ), by Reisz representation theorem, thereexists some s ∗ ∈ T ( p ) such that: L( ˙ s ϕ, ) = h s ∗ , ˙ s ϕ, i F ∀ ˙ s ϕ, ∈ T ( P ) . where h· , ·i F is 2 h· , ·i L ( R ) . This further implies k L k ∗ = k s ∗ k F and consequently, the informationbound corresponding to the hardest one dimensional parametric sub-model is:Optimal asymptotic variance = k s ∗ k F . Therefore the problem of estimating the eﬃcient information bound boils down to ﬁnding therepresenter s ∗ in the Tangent space T ( P ). To summarize, the key steps are:1. First quantify T ( P ) in the given model.2. Then ﬁnd the expression for L by diﬀerentiating θ ( ϕ ( t )) with respect to t .19. Finally use the identity L( ˙ s ϕ, ) = h s ∗ , ˙ s ϕ, i F for all ˙ s ϕ, ∈ T ( P ) to ﬁnd s ∗ .A detailed proof of the eﬃciency of our estimator under normality is presented in SectionA.2, but here we sketch the main idea to give readers a sense of the application of the aboveapproach into our model. The log - likelihood function of our model, for any generic observation( Y, Q, X, Z ), can be written as: ℓ ( ϑ ) = 2 log s ( Y | Q, X, Z ) + log s ( Q | X, Z ) + log s ( X, Z )= 4 (cid:8) log φ / (cid:0) Y − α − X ⊤ β − b ( Q − Z ⊤ γ ) (cid:1) + log s η ( Q − Z ⊤ γ ) + log s X,Z ( X, Z ) (cid:9) where φ is the Gaussian density, s η and s X,Z are square-roots of the densities of η and ( X, Z )respectively and ϑ is the collection of all unknown parameters, i.e. ϑ = ( α, β, γ, b, s η , s X,Z ). Weare interested in the functional θ ( ϕ ( t )) = α ϕ ( t ) which implies that the derivative L( ˙ α ) = ˙ α ϕ =( d/dt ) α ϕ ( t ) | t =0 . Hence, the representer s ∗ ∈ T ( P ) should satisfy: h s ∗ , ˙ s ϕ, i F = ˙ α γ , (3.2) for all ϕ ∈ G . As a consequence, the optimal asymptotic variance will be α ∗ = h s ∗ , s ∗ i F . In theproof (Section A.2 of the Appendix), we use the identity (3.2) for some suitably chosen ϕ ∈ G (or equivalently ˙ s ϕ, ∈ T ( P )) to obtain α ∗ . Remark 3.8.

The assumption of the normality of ǫ is necessary to establish semiparametriceﬃciency for least squares type methods, but the assumption of homoskedasticity is essentialonly if we use ordinary least squares method. One may easily take care of heteroskedasticity byusing weighted least squares instead. The ﬁrst step towards that direction is to approximate thevariance proﬁle σ ( η ) using ˆ σ (ˆ η ) for some non-parametric estimate ˆ σ ( · ) of σ ( · ) . Then deﬁning D ∈ R ( n + n/ × ( n + n/ to be the diagonal matrix with ﬁrst n/ diagonal entries being ˆ σ (ˆ η i ) ’s(i.e. for all those ˆ η i ’s such that | ˆ η i | ≤ τ ) and last n/ diagonal entries being ˆ σ η ’s (an estimate f variance of η ) we estimate the treatment eﬀect as: ˆ α = e ⊤ (cid:16) W ⊤ D − / proj ⊥ D − / ˜ N K,a D − / W (cid:17) − W ⊤ D − / proj ⊥ D − / ˜ N K,a D − /  YQ − Z ˆ γ n  A more tedious analysis establishes that this estimator is asymptotic normal and semi-parametricallyeﬃcient under the error structure: ν i = b ( η i )+ σ ( η i ) ǫ i , where ǫ i ∼ N (0 , . As this does not addanything of signiﬁcance to the core idea of the paper, we conﬁne ourselves to use OLS insteadof WLS for ease of presentation. In this section we illustrate our method by analyzing two real datasets. We divide our analysisinto two subsections, one for each dataset. We ﬁrst present a brief description of the data, thenpresent our analysis and compare our results with the existing one.

In this subsection we study the eﬀect of Islamic party rule in Turkey on women empowermentin terms of their high school education. In the 1994 municipality elections, Islamic partieswon several municipal mayor seats in Turkey. We are interested in investigating whether thiswinning had any eﬀect on the education of women, i.e. to determine, statistically, whetherthe concern that Islamic control may be inimical towards gender equality is supported by thedata. The dataset we analyze here was collected by Turkish Statistical Institute and was ﬁrstanalyzed in [36]. Since then it has been used by several authors, appearing for example, asone of the core illustrations in [16] . The dataset consists of n = 2629 rows where the rowsrepresent municipalities, the units of our analysis. The main target/response variable Y isthe percentage of women in the 15-20 year age-group who were recorded to have completed We have downloaded the dataset from https://github.com/rdpackages-replication/CIT 2019 CUP/blob/master/CIT 2019 Cambridge polecon.csv . Q is the diﬀerence in vote share between the largest Islamicparty (i.e. the Islamic party which got maximum votes among all Islamic parties) and thelargest secular party (i.e. the non-Islamic party which got the maximum votes among all non-Islamic parties). Hence, the cutoﬀ is 0: i.e. if Q i >

0, then the i th municipality unit electedan Islamic party and if Q i <

0, a secular party. The description of the available co-variates ispresented in Table 2. For X and Z , we use all the co-variates in Table 2 except i89 becausealmost 1 / α byˆ α as described in Section 2. To test: H : α = 0 vs H : α = 0we construct an Efron-bootstrap based conﬁdence interval over 500 iterations. We present ourﬁnding in Table 1. From the table, it is clear that, we don’t have enough evidence to reject H at 5% level as the conﬁdence interval contains 0. Hence we conclude that, there is no signiﬁcanteﬀect of Islamic ruling party on women’s education.We next compare our result to that of [36] and [16]. [36] implemented a simpler model forRDD: Y i = β + α S i + f ( Q i ) + ǫ i . with f being a polynomial function and only those observations were used where Q i ∈ ( − h, h )for some optimal choice of the bandwidth h (chosen according to the prescription of [25]). Theauthors found that Islamic party rule has a signiﬁcantly positive eﬀect on women’s educationat 1% level test. On the other hand, [16] implemented the model based on (1.2). We repli-cate their result using the R-package rdrobust as advocated in [16]. The function rdrobust inside the package rdrobust , takes input Y, Q, X and splits out three diﬀerent types of esti-mate of α (along with 95% conﬁdence interval), namely conventional, bias-corrected and Robust . As mentioned in [11], conventional presents when the conventional RD esti-22able 1: Summary Statistics of data on Islamic party based on our methodPoint Estimate 0.4071513Bootstrap mean. 0.5760144Bootstrap s.e. 0.48115Bootstrap 95% C.I. ( − . , . bias-corrected implies bias-corrected RD estimate with conventional standard errorsand robust indicates bias-corrected RD estimates with robust standard errors (see [13]). Weuse the parameters kernel = ’triangular’, scaleregul = 1, p = 1, bwselect = ’mserd’ ofrdrobust function to run the analysis. As evident from Table 3, all these estimates reject H at5% level stipulating a strictly positive eﬀect of Islamic party on the education of women, whileour method fails to reject the null at the same level.23able 3: Summary Statistics of data on Islamic party based on [16]Name of methods Coeﬀ CI Lower CI UpperConventional 3.005951 0.9622239 5.049678Bias-Corrected 3.204837 1.1611103 5.248564Robust 3.204837 0.8266720 5.583003 We next analyze an educational dataset, originally collected and analyzed in [33] where we in-vestigate whether putting students on academic probation due to grades below a pre-determinedcutoﬀ has any eﬀect on their subsequent GPA. The data are based on students from 3 indepen-dent campuses of a large Canadian university – a major campus and other satellite campuses.The acceptance rate in the major campus is around 55% and in the satellite campuses around77%. The data were collected over 8 cohorts of students till the end of the 2005 academic year.To observe the students for at least two years, only those who entered the university prior to thebeginning of the 2004 academic year were considered. After being put on academic probationin their ﬁrst year, some students left the university. We, therefore, only have access to GPAduring the second year for those students who stayed. Thus, our Y variable is the GPA of theﬁrst academic term in the second year and the treatment S is whether the student was put onprobation. The treatment determining variable Q is the diﬀerence between the ﬁrst year GPAand the cutoﬀ for academic probation: if Q <

0, the student is put on academic probation,otherwise not. The covariates we consider here ( X = Z ) are presented in Table 4. In Table 5we summarize our result. It is immediate from the bootstrap conﬁdence interval from Table 5that for testing H : α = 0 vs H : α = 0, we have enough evidence to Reject H at the 5%level and conclude that the students who are put on academic probation and continue withtheir education, tend to improve their performance (note that the estimated α is positive) inthe subsequent academic year. This makes sense, as the students who did not leave universityafter being put on academic probation must have a strong incentive to work harder so that We have collected the dataset from . hsgrade pct High school grade in percentage totcredits year1

Total credits taken in ﬁrst year loc campus1

Indicator whether the student in from Campus 1 loc campus2

Indicator whether the student in from Campus 2 male

Indicator of whether the student is male bpl north america

Whether the birth place in North America age at entry

Age of the student when they entered the college english

Indicator of whether the student is native english speakerTable 5: Summary Statistics of data on GPA dataPoint Estimate 0.2733371Bootstrap mean. 0.2817404Bootstrap s.e. 0.016672Bootstrap 95% C.I. (0 . , . Remark 4.1.

Note that, in Table 1 and Table 5, we presented bootstrap conﬁdence interval ofour estimator instead of asymptotic conﬁdence interval. This is because, consistent estimationof the the asymptotic variance of our estimator is not straight forward. Recall that from Theorem3.6, the asymptotic variance of our estimator is e ⊤ Ω − τ Ω ∗ τ Ω − τ e . As mentioned in Step 1 of thesketch of the proof of Theorem 3.6, W ⊤ proj ⊥ ˜ N K,a W /n is a consistent estimator of Ω τ but it ishard to estimate Ω ∗ τ consistently, which forces us to resort to the bootstrap conﬁdence interval. In this paper, we propose a new approach to estimate the treatment of regression discontinuitydesign at √ n rate and show that under homogeneity and normality of the error, our method issemiparametrically eﬃcient. We have also pointed out in Remark 3.8 that one can use weightedleast square instead of ordinary least square to take care heterogenous model. However, nor-mality assumption is necessary for semiparametric eﬃciency as we using least square method25or estimating the treatment eﬀect. Here we articulate two possible future research directionsbased on our work, which we believe is worth investigating for: Bootstrap consistency:

As noted in Remark 4.1, we use bootstrap conﬁdence intervalinstead of the asymptotic one owing to the intricate form of the asymptotic variance of ourestimator, which makes it hard to estimate from the data. Therefore, an immediate question ofinterest is to investigate whether bootstrap is consistent under our model assumption. Althoughempirical evidences suggests the answer to be aﬃrmative, a rigorous theoretical undertaking isessential to establish the claim.

Extension to high dimension:

Modern applications routinely have covariates

X, Z thatmay be numerous with the order of p exceeding the available sample size, but only relativelyfew of the features being relevant to the response. Reverting to equation (1.3), one may assumethat β , γ ∈ R p where p ≫ n , n being the total number of samples. A standard approach forestimation is to assume sparsity, i.e. there exists s , s such that k β k ≤ s , k γ k ≤ s , k · k being the ℓ norm of the vector. It is of interest to estimate the treatment eﬀect in presenceof these high dimensional covariates at the best possible rate. We believe that estimation at √ n rate is possible in this scenario by employing a variant of the technique used to debias thelasso estimator (e.g. see [49], [53] or [28]). Non-parametric link function:

An important direction of extension of our model is toconsider a more general link function between the response Y and the score variable Q . Tech-nically speaking, we modify our model equation (1.3) as follows: Y i = g ( Q i ) Q i ≥ + X ⊤ i β + ν i Q i = Z ⊤ i γ + η i , γ = 0 (5.1)26or some monotone function g . Note that equation (1.3) is a special case of (5.1) when g is aconstant function. Suppose for the rest of the discussion that background knowledge tells usthat the treatment eﬀect should be non-decreasing. To this end, consider a situation where Q i isthe score on an assessment test for selection into an in-depth curriculum program [such tests forexample exist for middle school students and those who clear the cut-oﬀ are provided exposureto advanced material in out of school classes during middle school, e.g. the ATYP programat WMU, see https://wmich.edu/precollege/atyp ]. So, the treatment is advanced classes, andlet Y i be the performance of the students measured in high school. If we believe that amongthe students that receive the treatment, the ones who score higher points on the test mightalso be able to take better advantage of the advanced curriculum they are exposed to , then theeﬀect of Q on Y in the treatment group can be viewed as a potentially monotone increasingfunction g . In other words, there is a non-trivial interaction between the score on the test andthe treatment. The eﬀect of Q for Q <

A Proofs of main theorems

A.1 Proof of Theorem 3.6

A.1.1 Proof of Step 1

First we decompose the matrix as follows: W ⊤ proj ⊥ ˜ N K,a W n = W ⊤ proj ⊥ ˜ N K W n + W ⊤ W n

27e show that: W ⊤ proj ⊥ ˜ N K W n P → E (cid:20) var (cid:18)(cid:20) s x z b ′ ( η ) (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) η (cid:19) | η |≤ τ (cid:21) , (A.1) W ⊤ W n P →  Z  (A.2)where W , W are as deﬁned in Section 2. Note that this implies: W ⊤ proj ⊥ ˜ N K,a W n P → E (cid:20) var (cid:18)(cid:20) s x z b ′ ( η ) (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) η (cid:19) | η |≤ τ (cid:21) + 13  Z  = 13 Ω τ which delivers the assertion of Step 1.Equation (A.2) follows immediately from an application of weak law of large numbers. Forequation (A.1), note that W ⊤ proj ⊥ ˜ N K W /n is the norm of the residual of rows of W uponprojecting out the eﬀect of N K , which can be further decomposed as: W ⊤ proj ⊥ ˜ N K W n = W ∗ ⊤ proj ⊥ ˜ N K W ∗ n + 2 W ∗ ⊤ proj ⊥ ˜ N K ( W − W ∗ ) n + ( W − W ∗ ) ⊤ proj ⊥ ˜ N K ( W − W ∗ ) n (A.3)where W ∗ ,i ∗ = (cid:20) S i X i b ′ ( η i ) Z i (cid:21) . Note that the only diﬀerence between W and W ∗ is in thelast p co-ordinates where we have replaced ˆ b ′ (ˆ η i ) by b ′ ( η i ). We now show that: W ∗ ⊤ proj ⊥ ˜ N K ( W ∗ − W ) n P −→ . W − W ∗ ) ⊤ proj ⊥ ˜ N K ( W − W ∗ ) /n will consequently be o p (1) as it is a lowerorder term Fix 1 ≤ j, k ≤ p + p : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W ∗ ⊤ proj ⊥ ˜ N K ( W ∗ − W ) n ! j,k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D proj ⊥ ˜ N K W ∗ , ∗ j , proj ⊥ ˜ N K ( W ∗ − W ) ∗ k E n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r k W ∗ , ∗ j k n k ( W ∗ − W ) ∗ k k n That k W ∗ , ∗ j k /n = O p (1) follows from an immediate application of WLLN. To show that theother part is o p (1), note that k ( W ∗ − W ) ∗ k k /n = 0 for 1 ≤ k ≤ p +1. For p +2 ≤ k ≤ p + p ,deﬁne ˜ k = k − ( p + 1). Then: k ( W ∗ − W ) ∗ k k n = 1 n n/ X i =1 (cid:16) b ′ ( η i ) − ˆ b ′ (ˆ η i ) (cid:17) Z i, ˜ k | ˆ η i |≤ τ ≤  n n/ X i =1 ( b ′ ( η i ) − b ′ (ˆ η i )) Z i, ˜ k | ˆ η i |≤ τ + 1 n n X i =1 (cid:16) b ′ (ˆ η i ) − ˆ b ′ (ˆ η i ) (cid:17) Z i, ˜ k | ˆ η i |≤ τ  ≤  n n/ X i =1 ( η i − ˆ η i ) ( b ′′ (˜ η i )) Z i, ˜ k | ˆ η i |≤ τ + sup | x |≤ τ | b ′ ( t ) − ˆ b ′ ( t ) | n n X i =1 Z i, ˜ k  = O p ( n − ) + O p ( n − ) + o p (1) = o p (1) . (A.4)That the ﬁrst summand is O p ( n − ) follows from the fact b ′′ is uniformly bounded (Assumption3.2) and ˆ η i − η i = O p ( n − / ) and the second summand is o p (1) follows from Proposition 2.1.We next show: W ∗ ⊤ proj ⊥ ˜ N K W ∗ n P −→

13 Ω τ . (A.5)Towards that direction, we ﬁrst claim that ( W ∗ ⊤ proj ⊥ ˜ N K W ∗ /n ) = O p (1) . For any 1 ≤ l, m ≤ p + p , we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W ∗ ⊤ proj ⊥ ˜ N K W ∗ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l,m = (cid:12)(cid:12)(cid:12)(cid:12) h W ∗ , ∗ l , proj N K W ∗ , ∗ m i n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13) W ∗ , ∗ l √ n (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) W ∗ , ∗ m √ n (cid:13)(cid:13)(cid:13)(cid:13) = O p (1)29y WLLN. Setting a n = log n/ √ n , we next decompose this term into two further terms: W ∗ ⊤ proj ⊥ ˜ N K W ∗ n = W ∗ ⊤ proj ⊥ ˜ N K W ∗ n k ˆ γ n − γ k≤ a n + W ∗ ⊤ proj ⊥ ˜ N K W ∗ n k ˆ γ n − γ k >a n As we have already established ( W ∗ ⊤ proj ⊥ ˜ N K W ∗ /n ) = O p (1), it is immediate that: W ∗ ⊤ proj ⊥ ˜ N K W ∗ n k ˆ γ n − γ k >a n = o p (1) . Therefore, we need to establish the convergence of ( W ∗ ⊤ proj ⊥ ˜ N K W ∗ ) /n k ˆ γ n − γ k≤ a n . Deﬁne afunction g ( a, t ) and V ( a, t ) as: g ( a, t ) = E (cid:20)(cid:18) S X b ′ ( η ) Z (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) η + a ⊤ Z = t (cid:21) ,V ( a, t ) = var (cid:20)(cid:18) S X b ′ ( η ) Z (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) η + a ⊤ Z = t (cid:21) The following lemma characterizes some smoothness properties of the functions g, V : Lemma A.1.

Under Assumptions 3.1-3.4, the functions g and V are continuous. Moreover g is continuously diﬀerentiable in both of its co-ordinates and consequently g, ∂ a g and ∂ t g areuniformly bounded on k a k ≤ and | t | ≤ τ + 1 . The proof of this lemma can be found in Section B. Note that the deﬁnition of g implies g (ˆ γ n − γ , t ) = E (cid:2) W ∗ ,i (cid:12)(cid:12) ˆ η = t, F ( D ) (cid:3) . As g is a vector valued functions with range being asubset of R p + p , we henceforth denote by g j , the j th co-ordinate of g for 1 ≤ j ≤ p + p .Now for each of the co-ordinates of g , we further deﬁne ω j,n, ∞ as the B-spline approximationvector of g j , i.e. ω j,n, ∞ = arg min ω sup | x |≤ τ (cid:12)(cid:12)(cid:12) g j (ˆ γ n − γ , x ) − ˜ N K ( x ) ⊤ ω (cid:12)(cid:12)(cid:12) where ˜ N k is the scaled B-spline basis functions (for deﬁnition and brief discussion, see SectionC.2). We often drop the index n from ω j,n, ∞ when there is no disambiguity. It is immediate30rom Theorem C.3 (with l = r = 0):sup | x |≤ τ (cid:12)(cid:12)(cid:12) g j (ˆ γ n − γ , x ) − ˜ N K ( x ) ⊤ ω j, ∞ (cid:12)(cid:12)(cid:12) . τK sup | x |≤ τ | ∂ t g (ˆ γ n − γ , t ) | = O p ( K − ) . (A.6)We deﬁne the matrix G as G i ∗ = E (cid:2) W ∗ ,i ∗ (cid:12)(cid:12) ˆ η = ˆ η i , F ( D ) (cid:3) = g (ˆ γ n − γ , ˆ η i ) and the matrix H as H i,j = ˜ N K (ˆ η i ) ⊤ ω j, ∞ . Using these matrices we expand the matrix under consideration asfollows: W ∗ ⊤ proj ⊥ ˜ N K W ∗ n k ˆ γ n − γ k≤ a n = ( W ∗ − G ) ⊤ proj ⊥ ˜ N K ( W ∗ − G ) n k ˆ γ n − γ k≤ a n + ( G − H ) ⊤ proj ⊥ ˜ N K ( G − H ) n k ˆ γ n − γ k≤ a n + 2 ( G − H ) ⊤ proj ⊥ ˜ N K ( W ∗ − G ) n k ˆ γ n − γ k≤ a n = ( W ∗ − G ) ⊤ ( W ∗ − G ) n k ˆ γ n − γ k≤ a n − ( W ∗ − G ) ⊤ proj N K ( W ∗ − G ) n k ˆ γ n − γ k≤ a n + ( G − H ) ⊤ proj ⊥ ˜ N K ( G − H ) n k ˆ γ n − γ k≤ a n + 2 ( G − H ) ⊤ proj ⊥ ˜ N K ( W ∗ − G ) n k ˆ γ n − γ k≤ a n := T + T + T + T (A.7)We ﬁrst show that [( W ∗ − G ) ⊤ ( W ∗ − G ) /n ] ( k ˆ γ n − γ k ≤ a n ) converges to some matrix.Towards that end, we further expand it as follows:( W ∗ − G ) ⊤ ( W ∗ − G ) n k ˆ γ n − γ k≤ a n = 1 n n/ X i =1 ( W ∗ ,i ∗ − g (ˆ γ n − γ , ˆ η i ))( W ∗ ,i ∗ − g (ˆ γ n − γ , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n = 1 n n/ X i =1 ( W ∗ ,i ∗ − g (0 , ˆ η i ))( W ∗ ,i ∗ − g (0 , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n + 2 n n/ X i =1 ( W ∗ ,i ∗ − g (0 , ˆ η i ))( g (0 , ˆ η i ) − g (ˆ γ n − γ , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n

31 1 n n/ X i =1 ( g (0 , ˆ η i ) − g (ˆ γ n − γ , ˆ η i ))( g (0 , ˆ η i ) − g (ˆ γ n − γ , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n := T + T + T We now show that T = o p (1) and T = o p (1) follows immediately form there, as it is a lowerorder term. For T note that by Lemma A.1, the function g ( a, t ) has continuous derivativewith respect to a which makes g Lipschitz on the ball k a k ≤

1. Hence we have: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n/ X i =1 ( W ∗ ,i ∗ − g (0 , ˆ η i ))( g (0 , ˆ η i ) − g (ˆ γ n − γ , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ n n/ X i =1 (cid:13)(cid:13) W ∗ ,i ∗ − g (0 , ˆ η i ) (cid:13)(cid:13) k g (0 , ˆ η i ) − g n (ˆ γ n − γ , ˆ η i ) k | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n ≤ k ˆ γ n − γ k n n/ X i =1 (cid:13)(cid:13) W ∗ ,i ∗ − g (0 , ˆ η i ) (cid:13)(cid:13) | ˆ η i |≤ τ = O p ( n − / ) . Now, to establish convergence of T we further expand it as follows:1 n n/ X i =1 ( W ∗ ,i ∗ − g (0 , ˆ η i ))( W ∗ ,i ∗ − g (0 , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n = 1 n n/ X i =1 ( W ∗ ,i ∗ − g (0 , η i ))( W ∗ ,i ∗ − g (0 , η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n + 2 n n/ X i =1 ( W ∗ ,i ∗ − g (0 , η i ))( g (0 , η i ) − g (0 , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n + 1 n n/ X i =1 ( g (0 , η i ) − g (0 , ˆ η i ))( g (0 , η i ) − g (0 , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n := T + T + T From law of large numbers we ﬁrst conclude: T P −→ E h(cid:0) W ∗ , ∗ − E ( W ∗ , ∗ (cid:12)(cid:12) η ) (cid:1) (cid:0) W ∗ , ∗ − E ( W ∗ , ∗ (cid:12)(cid:12) η ) (cid:1) ⊤ | η |≤ τ i / T = o p (1)and T = o p (1) follows immediately being a higher order term. We analyse T as follows: k T k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n/ X i =1 ( W ∗ ,i ∗ − g (0 , η i ))( g (0 , η i ) − g (0 , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n/ X i =1 ( W ∗ ,i ∗ − g (0 , η i ))( g (0 , η i ) − g (0 , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n , | ˆ η i − η i |≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n/ X i =1 ( W ∗ ,i ∗ − g (0 , η i ))( g (0 , η i ) − g (0 , ˆ η i )) ⊤ | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n , | ˆ η i − η i | > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n X i (cid:13)(cid:13) W ∗ ,i ∗ − g (0 , η i ) (cid:13)(cid:13) | η i − ˆ η i | + 2 n X i (cid:13)(cid:13) W ∗ ,i ∗ − g (0 , η i ) (cid:13)(cid:13) k g (0 , η i ) − g (0 , ˆ η i ) k | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n , | ˆ η i − η i | > ≤ k ˆ γ n − γ k n X i (cid:13)(cid:13) W ∗ ,i ∗ − g (0 , η i ) (cid:13)(cid:13) k Z i k + 2 n X i (cid:13)(cid:13) W ∗ ,i ∗ − g (0 , η i ) (cid:13)(cid:13) k g (0 , η i ) − g (0 , ˆ η i ) k | ˆ η i |≤ τ, k ˆ γ n − γ k≤ a n , | ˆ η i − η i | > That the ﬁrst term is o p (1) is immediate. For the second term, note that:1 n X i (cid:13)(cid:13) W ∗ ,i ∗ − g (0 , η i ) (cid:13)(cid:13) k g (0 , η i ) − g (0 , ˆ η i ) k = O p (1)and P ( k ˆ γ n − γ k ≤ a n , | ˆ η i − η i | > −→ . This ﬁnishes the proof of T , i.e. we have estab-lished: T P −→ E h(cid:0) W ∗ , ∗ − E ( W ∗ , ∗ (cid:12)(cid:12) η ) (cid:1) (cid:0) W ∗ , ∗ − E ( W ∗ , ∗ (cid:12)(cid:12) η ) (cid:1) ⊤ | η |≤ τ i (A.8)For T in equation (A.7) we have for any 1 ≤ j, k ≤ p + p : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( G − H ) ⊤ proj ⊥ ˜ N K ( G − H ) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j,k k ˆ γ n − γ k≤ a n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ( G − H ) ∗ j , proj ⊥ ˜ N K ( G − H ) ∗ k i n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ s k ( G − H ) ∗ j k n k ( G − H ) ∗ k k n ≤ sup | x |≤ τ (cid:12)(cid:12)(cid:12) g j (ˆ γ n − γ , t ) − ˜ N K ( t ) ⊤ ω j, ∞ (cid:12)(cid:12)(cid:12) × sup | x |≤ τ (cid:12)(cid:12)(cid:12) g k (ˆ γ n − γ , t ) − ˜ N K ( t ) ⊤ ω k, ∞ (cid:12)(cid:12)(cid:12) = O p ( K − ) = o p (1) [From equation A. . (A.9)Similarly for T in equation (A.7) and for any 1 ≤ j, k ≤ p + p : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( G − H ) ⊤ proj ⊥ ˜ N K ( W ∗ − G ) n ! j,k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ˆ γ n − γ k≤ a n ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ( G − H ) ∗ j , proj ⊥ ˜ N K ( W ∗ − G ) ∗ k i n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ s k ( G − H ) ∗ j k n k ( W ∗ − G ) ∗ k k n ≤ sup | x |≤ τ (cid:12)(cid:12)(cid:12) g j (ˆ γ n − γ , t ) − ˜ N K ( t ) ⊤ ω j, ∞ (cid:12)(cid:12)(cid:12) s k ( W ∗ − G ) ∗ k k n = sup | x |≤ τ (cid:12)(cid:12)(cid:12) g j (ˆ γ n − γ , t ) − ˜ N K ( t ) ⊤ ω j, ∞ (cid:12)(cid:12)(cid:12) × O p (1) = o p (1) . (A.10)where k ( W ∗ − G ) ∗ k k /n = O p (1) follows from law of large numbers and the uniform splineapproximation error is o p (1) follows from equation (A.6). Finally, for T in equation (A.7),ﬁrst recall that W ∗ consists of all the rows for which | ˆ η i | ≤ τ . We can extend this matrix W ∗ ,f ∈ R ( n, p + p ) as: W ∗ ,f ,i ∗ = (cid:20) S i X i ∗ b ′ ( η i ) Z i ∗ (cid:21) | ˆ η i |≤ τ . The matrix W ∗ ,f is exactly W ∗ appended with 0’s in the rows where | ˆ η i | > τ . Similarly,we can deﬁne G f with G fi ∗ = E h W ∗ ,f ,i ∗ (cid:12)(cid:12)(cid:12) ˆ η = ˆ η i , F ( D ) i and ˜ N fK as the basis matrix with˜ N fK,i ∗ = ˜ N k (ˆ η i ) | ˆ η i |≤ τ . It is easy to see that for any 1 ≤ j ≤ p + p :( W ∗ − G ) ⊤∗ ,j proj N K ( W ∗ − G ) ∗ ,j n k ˆ γ n − γ k≤ a n

34 ( W ∗ ,f − G f ) ⊤∗ ,j proj ˜ N fK ( W ∗ ,f − G f ) ∗ ,j n k ˆ γ n − γ k≤ a n Hence we can bound ( j, j ) th term of T as follows : E ( W ∗ − G ) ⊤∗ ,j proj N K ( W ∗ − G ) ∗ ,j n k ˆ γ n − γ k≤ a n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D ) ! = E ( W ∗ ,f − G f ) ⊤∗ ,j proj ˜ N fK ( W ∗ ,f − G f ) ∗ ,j n k ˆ γ n − γ k≤ a n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D ) ! = tr E ( W ∗ ,f − G f ) ⊤∗ ,j proj ˜ N fK ( W ∗ ,f − G f ) ∗ ,j n k ˆ γ n − γ k≤ a n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D ) !! ≤ n E (cid:16) tr (cid:16) proj ˜ N fK E (cid:16) ( W ∗ ,f − G f ) ∗ ,j ( W ∗ ,f − G f ) ⊤∗ ,j (cid:12)(cid:12)(cid:12) F (ˆ η , D ) (cid:17)(cid:17)(cid:17) [ˆ η = { ˆ η i } n/ i =1 ∈ D ] ≤ K + 3 n sup | t |≤ τ var (cid:18)(cid:20) S X b ′ ( η ) Z (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) ˆ η = t, F ( D ) (cid:19) k ˆ γ n − γ k≤ a n = O p (cid:18) Kn (cid:19) = o p (1) . (A.11)where we use Lemma C.1 along with the fact that tr (cid:0) proj ˜ N K (cid:1) = K + 2. The ﬁniteness of theconditional variance follows from Lemma A.1. Combining our ﬁndings from equation (A.8),(A.9), (A.10) and (A.11) we conclude (A.1) which along with (A.2) concludes: W ⊤ proj ⊥ N K,a W n ! P −→ E (cid:20) var (cid:18)(cid:20) s x z b ′ ( η ) (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) η (cid:19) | η |≤ τ (cid:21) + 13  Z  := 13 Ω τ As Ω τ → Ω ∞ , using Assumption 3.3 we conclude that the minimum eigenvalue of Ω τ is positive. We use Lemma C.2 here to conclude that a sequence of non-negative random variables is o p (1) if theirconditional expectation is o p (1). .1.2 Proof of Step 2 Deﬁne A n to be generated by D , D and { ( X i , Z i , η i ) } n/ i =1 in D . We start with the followingdecomposition: W ⊤ proj ⊤ ˜ N K,a √ n  ǫ η  = W ⊤ proj ⊥ ˜ N K ǫ √ n + W ⊤ η √ n The asymptotic linear expansion of the second summand is immediate: W ⊤ η √ n = 1 √ n n/ X i =1 (cid:20) η i Z i (cid:21) Recall from the deﬁnition of G that the G i ∗ = g (ˆ γ n − γ , ˆ η i ). Deﬁne another matrix G ∗ as G ∗ i = g (0 , η i ). For the ﬁrst summand, we decompose it as follows: W ⊤ proj ⊥ ˜ N K ǫ √ n = W ∗ ⊤ proj ⊥ ˜ N K ǫ √ n + ( W − W ∗ ) ⊤ proj ⊥ ˜ N K ǫ √ n = ( W ∗ − G ∗ ) ⊤ ǫ √ n + ( W − W ∗ ) ⊤ proj ⊥ ˜ N K ǫ √ n + ( W ∗ − G ) ⊤ proj ˜ N K ǫ √ n + ( G − H ) ⊤ proj ⊥ ˜ N K ǫ √ n + ( G ∗ − G ) ⊤ ǫ √ n := T + T + T + T + T (A.12)We show that T , T , T , T are all o p (1). This will establish: W ⊤ proj ⊥ ˜ N K ǫ √ n = ( W ∗ − G ∗ ) ⊤ ǫ √ n + o p (1) . (A.13)For T note that for any p + 2 ≤ j ≤ p + p + 1: E  ( W − W ∗ ) ⊤∗ j proj ⊥ ˜ N K ǫ √ n ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D , D )  = 1 n E h ( W − W ∗ ) ⊤∗ j proj ⊥ ˜ N K E (cid:0) ǫǫ ⊤ (cid:12)(cid:12) A n (cid:1) proj ⊥ ˜ N K ( W − W ∗ ) ∗ j (cid:12)(cid:12)(cid:12) F ( D , D ) i sup η var ( ǫ | η ) × E " ( W − W ∗ ) ⊤∗ j ( W − W ∗ ) ∗ j n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D , D ) = sup η var ( ǫ | η ) × o p (1) [From equation (A.4)]Now for T , for any 1 ≤ j ≤ p + p by similar calculations: E  ( W ∗ − G ) ⊤ proj ˜ N K ǫ √ n ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A n  ≤ sup η var ( ǫ | η ) × ( W ∗ − G ) ⊤ proj ˜ N K ( W ∗ − G ) n = sup η var ( ǫ | η ) × o p (1) [From equation (A.11)]For T , for 1 ≤ j ≤ p + p : E  ( G − H ) ⊤ proj ⊥ ˜ N K ǫ √ n ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A n  ≤ sup η var ( ǫ | η ) × ( G − H ) ⊤ ( G − H ) n = sup η var ( ǫ | η ) × o p (1) [From equation (A.9)]Now for T , using similar technique we have for any 1 ≤ j ≤ p + p + 1: E  ( G − G ∗ ) ⊤∗ j ǫ √ n ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A n  ≤ sup η var ( ǫ | η ) × k G − G ∗ k n ≤ sup η var ( ǫ | η ) × n n/ X i =1 ( g j (ˆ γ n − γ , ˆ η i ) − g j (0 , η i )) k ˆ η i k≤ τ ≤ sup η var ( ǫ | η ) ×  n n/ X i =1 ( g j (ˆ γ n − γ , ˆ η i ) − g j (0 , η i )) k ˆ η i k≤ τ, k ˆ γ n − γ k≤ a n + 1 n n/ X i =1 ( g j (ˆ γ n − γ , ˆ η i ) − g j (0 , η i )) k ˆ η i k≤ τ, k ˆ γ n − γ k >a n  o p (1) from the boundedness of the partial deriva-tives of g with respect to both a and t . The second term inside the square bracket in-side the square bracket is o p (1) because (1 /n ) P n/ i =1 ( g j (ˆ γ n − γ , ˆ η i ) − g j (0 , η i )) = O p (1) and P ( k ˆ γ n − γ k > a n ) = o (1).Finally we show that:( W ∗ − G ∗ ) ⊤ ǫ √ n = 1 √ n n X i =1 (cid:0) W ∗ ,i ∗ − G ∗ i ∗ (cid:1) ǫ i | ˆ η i |≤ τ = 1 √ n n X i =1 (cid:0) W ∗ ,i ∗ − G ∗ i ∗ (cid:1) ǫ i | η i |≤ τ + o p (1) (A.14)This along with equation (A.13) concludes: W ⊤ proj ⊥ ˜ N K ǫ √ n = 1 √ n n X i =1 (cid:0) W ∗ ,i ∗ − G ∗ i ∗ (cid:1) ǫ i | η i |≤ τ + o p (1)= 1 √ n n X i =1 (cid:26)(cid:20) S i X i b ′ ( η i ) Z i (cid:21) − E (cid:20)(cid:20) S i X i b ′ ( η i ) Z i (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) η = η i (cid:21)(cid:27) ǫ i | η i |≤ τ + o p (1)Taking the function ϕ as: ϕ ( X i , Z i , η i , ν i ) = (cid:26)(cid:20) S i X i b ′ ( η i ) Z i (cid:21) − E (cid:20)(cid:20) S i X i b ′ ( η i ) Z i (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) η = η i (cid:21)(cid:27) ǫ i | η i |≤ τ + (cid:20) Z i η i (cid:21) we conlcude the proof of Step 2. All it remains to prove is equation (A.14). Deﬁne the event∆ i as: ∆ i = {| ˆ η i | ≤ τ ∩ | η i | > τ } ∪ {| ˆ η i | > τ ∩ | η i | ≤ τ } . Now for any 1 ≤ j ≤ p + p : E  √ n n/ X i =1 (cid:0) W ∗ ,i,j − G ∗ i,j (cid:1) ǫ i (cid:0) | ˆ η i |≤ τ − | η i |≤ τ (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D )   E  √ n n/ X i =1 (cid:0) W ∗ ,i,j − G ∗ i,j (cid:1) ǫ i (cid:0) | ˆ η i |≤ τ − | η i |≤ τ (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A n  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D )  ≤ sup η var ( ǫ | η ) × n n/ X i =1 E h(cid:0) W ∗ ,i,j − G ∗ i,j (cid:1) ∆ i (cid:12)(cid:12)(cid:12) F ( D ) i ≤ sup η var ( ǫ | η ) × n n/ X i =1 r E h(cid:0) W ∗ ,i,j − G ∗ i,j (cid:1) i P (∆ i |F ( D )) . n n/ X i =1 p P (∆ i |F ( D )) = o p (1) . A.1.3 Proof of Step 3

From the deﬁnition of ϕ is Step 2 and the deﬁnition of Ω ∗ τ , Step 3 immediately follows from adirect application of Central Limit theorem. A.1.4 Proof of Step 4

In this subsection we prove that: W ⊤ proj ⊥ ˜ N K,a √ n  R  = W ⊤ proj ⊥ ˜ N K R √ n = o p (1) . From our discussion in Section 2, the residual vector can be expressed as the sum of threeresidual term R = R + R + R , where R ,i = (ˆ η i − η i ) b ′′ (˜ η i ) , R ,i = ( b ′ (ˆ η i ) − ˆ b ′ (ˆ η i ))(ˆ η i − η i )and R ,i = ( b (ˆ η i ) − ˜ N K (ˆ η i ) ⊤ ω b, ∞ ,n ). We show that each all these terms is asymptoticallynegligible. For the ﬁrst term, we can write is as: E " n X i =1 R ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D ) = 14 n/ X i =1 E h (ˆ η i − η i ) ( b ′′ (˜ η i )) | ˆ η i |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) i n/ X i =1 E h (ˆ η i − η i ) ( b ′′ (˜ η i )) (cid:12)(cid:12)(cid:12) F ( D ) i ≤ k b ′′ k ∞ × n k ˆ γ n − γ k × E [ k Z k ] = O p ( n − ) . This implies k R k = qP n i =1 R ,i = O p ( n − / ) and consequently we have for any 1 ≤ j ≤ d + d : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W ⊤ , ∗ j proj ⊥ ˜ N K R √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ W ⊤ , ∗ j proj ⊥ ˜ N K W , ∗ j n k R k = O p (1) × O p ( n − / ) = o p (1) . where W ⊤ , ∗ j proj ⊥ ˜ N K W , ∗ j /n = O p (1) has been proved in the proof of Step 1 (see equation (A.1)).For the second residual term, we have: E " n X i =1 R ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D , D ) = E  n/ X i =1 R ,i | ˆ η i |≤ τ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D , D )  = n/ X i =1 E h ( b ′ (ˆ η i ) − ˆ b ′ (ˆ η i )) (ˆ η i − η i ) | ˆ η i |≤ τ (cid:12)(cid:12)(cid:12) F ( D , D ) i ≤ k ˆ γ n − γ k n/ X i =1 E h ( b ′ (ˆ η i ) − ˆ b ′ (ˆ η i )) k Z i k | ˆ η i |≤ τ (cid:12)(cid:12)(cid:12) F ( D , D ) i ≤ n k ˆ γ n − γ k sup | t |≤ τ (cid:12)(cid:12)(cid:12) ˆ b ′ ( t ) − b ′ ( t ) (cid:12)(cid:12)(cid:12) E (cid:2) k Z k (cid:3) Now from Proposition 2.1, we have sup | t |≤ τ (cid:12)(cid:12)(cid:12) ˆ b ′ ( t ) − b ′ ( t ) (cid:12)(cid:12)(cid:12) = o p (1) and from OLS properties wehave: n k ˆ γ n − γ k = O p (1). Combining this, we conclude that E (cid:2)P n i =1 R ,i (cid:3) = o p (1). Now wehave: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W ⊤ , ∗ j proj ⊥ ˜ N K R √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ W ⊤ , ∗ j proj ⊥ ˜ N K W , ∗ j n k R k = O p (1) × o p (1) = o p (1) . R (residual obtained by approximating the mean function via B-spline basis) deﬁne R f to be the extended version of R putting 0 in the places where | ˆ η i | > τ ,i.e. R ∈ R n , whereas R f ∈ R n/ . Using this we have: W ⊤ , ∗ j proj ⊥ ˜ N K R √ n = ( W , ∗ j − G ∗ ,j ) ⊤ proj ⊥ ˜ N K R √ n + ( G ∗ ,j − H ∗ ,j ) ⊤ proj ⊥ ˜ N K R √ n where G , H are same as deﬁned in Subsection A.1.1 (just after equation (A.6)). For the ﬁrstsummand above, we have: E  ( W , ∗ j − G ∗ j ) ⊤ proj ⊥ ˜ N K R √ n ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D , D )  E  ( W f , ∗ j − G f ∗ j ) ⊤ (cid:16) I − proj ˜ N fK (cid:17) R f √ n  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D , D )  = 1 n E (cid:16) R f ⊤ proj ⊥ N fK ( W f , ∗ j − G ∗ j )( W f , ∗ j − G f ∗ j ) ⊤ proj ⊥ N fK R f (cid:12)(cid:12)(cid:12) F ( D , D ) (cid:17) = 1 n E (cid:16) R f ⊤ proj ⊥ ˜ N fK E h ( W f , ∗ j − G f ∗ j )( W f ,j − G f ∗ j ) ⊤ (cid:12)(cid:12)(cid:12) F (ˆ η , D , D ) i proj ⊥ N fK R f (cid:12)(cid:12)(cid:12) F ( D , D ) (cid:17) ≤ sup | t |≤ τ var (cid:18)(cid:20) S X b ′ ( η ) Z (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) ˆ η = t, F ( D ) (cid:19) E (cid:2) k R k /n (cid:12)(cid:12) F ( D , D ) (cid:3) ≤ sup | t |≤ τ var (cid:18)(cid:20) S X b ′ ( η ) Z (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) ˆ η = t, F ( D ) (cid:19) × sup | t |≤ τ (cid:12)(cid:12)(cid:12) ˆ b ′ ( t ) − b ( t ) (cid:12)(cid:12)(cid:12) = sup | t |≤ τ V (ˆ γ n − γ , t ) × sup | t |≤ τ (cid:12)(cid:12)(cid:12) ˆ b ′ ( t ) − b ( t ) (cid:12)(cid:12)(cid:12) = o p (1) . where ˆ η is { ˆ η i } n/ i =1 from D . The last line follows from continuity of V ( a, t ) (Lemma A.1) andProposition 2.1.( G ∗ ,j − H ∗ ,j ) ⊤ proj ⊥ ˜ N K R √ n ≤ √ n k G ∗ j − H ∗ j k k R k = √ n s k G ∗ j − H ∗ j k n k R k n ≤ √ n sup | x |≤ τ (cid:12)(cid:12)(cid:12) g j (ˆ γ n − γ , x ) − ˜ N K ( x ) ⊤ ω j, ∞ (cid:12)(cid:12)(cid:12) × sup | x |≤ τ (cid:12)(cid:12)(cid:12) b ( x ) − ˜ N K ( x ) ⊤ δ ,n (cid:12)(cid:12)(cid:12) . √ n/K = o (1) . l = r = 0 for g j and l = 2 , r = 0for b ) along with Remark 2.2, which completes the proof of the asymptotic negligibility of theresiduals. A.2 Proof of Theorem 3.7

The model we consider here is: Y i = α S i + X ⊤ i β + b ( η i ) + ǫ i Q i = Z ⊤ i γ + η i . where S i = Q i ≥ and ǫ ∼ N (0 , τ ). For simplicity we assume here τ = 1. An inspection toour proof immediately reveals that extension to general τ is straight forward. Statisticiansobserve { D i , ( Y i , X i , Z i , Q i ) } ni =1 at stage n of the experiment and hence the likelihood of theparameters θ , ( α, β, γ, b ( · ) , s η , s X,Z ) becomes: L ( θ | D ) = Π ni =1 [ p ( Y i | Q i , X i , Z i ) × p ( Q i | X i , Z i ) × p ( X i , Z i )]As we calculate the information bound, henceforth we only will deal with one observation andgenerically write: ℓ ( θ ) = log L ( θ ) = log p ( Y, Q, X, Z )= log p ( Y | Q, X, Z ) + log p ( Q | X, Z ) + log p ( X, Z )= 2 log s ( Y | Q, X, Z ) + log s ( Q | X, Z ) + log s ( X, Z )Now consider some parametrization γ ∈ R where R is the set of all regular parametric modelas mentioned in Subsection 3.2 the derivative of log-likelihood along this curve at t = 0 can be42ritten as: S γ = ddt log p γ ( t ) ( Y, Q, X, Z ) | t =0 = 2 (cid:20) ˙ s γ, ( Y | Q, X, Z ) s ( Y | Q, X, Z ) + ˙ s γ, ( Q | X, Z ) s ( Q | X, Z ) + ˙ s γ, ( X, Z ) s ( X, Z ) (cid:21) and as a consequence, Fisher information for estimating α along this parametric submodelcurve: I ( γ ) = k ˙ s γ, k F = E (cid:0) S ( γ ) (cid:1) = 4 E "(cid:18) ˙ s γ, ( Y | Q, X, Z ) s ( Y | Q, X, Z ) (cid:19) + (cid:18) ˙ s γ, ( Q | X, Z ) s ( Q | X, Z ) (cid:19) + (cid:18) ˙ s γ, ( X, Z ) s ( X, Z ) (cid:19) = 4  E  φ ′ ( ǫ ) h − ˙ αS − X ⊤ ˙ β − ˙ b ( η ) + b ′ ( η ) Z ⊤ ˙ γ i φ ( ǫ )  + E " ˙ s η ( η ) − s ′ η ( η ) Z ⊤ ˙ γs η ( η ) + E (cid:20) ˙ s X,Z ( X, Z ) s X,Z ( X, Z ) (cid:21)  where φ is the square root of the density of standard gaussian distribution, s η is the squareroot of the density of η and s X,Z is the joint density of (

X, Z ). The function ˙ s η (resp. ˙ s X,Z ) isdeﬁned as the ( d/dt ) s η,γ ( t ) | t =0 (resp. ( d/dt ) s X,Z,γ ( t ) | t =0 ). Similar deﬁnition holds for ˙ α, ˙ β, ˙ γ ,where we omit the subscript γ for notational simplicity. The function s ′ η here denotes thederivative of s η, ( x ) (true data generating density) with respect to x . Note that in the lastequality we reparametrize the variable ( Y, Q, X, Z ) → ( ǫ, η, X, Z ) which is bijective. The ﬁsherinner product in T ( P ) corresponding to two parametrization γ , γ can be expressed as: h ˙ s γ , , ˙ s γ , i F = 4 E (cid:20)(cid:26) ˙ s γ , ( Y | Q, X, Z ) s ( Y | Q, X, Z ) + ˙ s γ , ( Q | X, Z ) s ( Q | X, Z ) + ˙ s γ , ( X, Z ) s ( X, Z ) (cid:27) × (cid:26) ˙ s γ , ( Y | Q, X, Z ) s ( Y | Q, X, Z ) + ˙ s γ , ( Q | X, Z ) s ( Q | X, Z ) + ˙ s γ , ( X, Z ) s ( X, Z ) (cid:27)(cid:21) = 4 E (cid:20) ˙ s γ , ( Y | Q, X, Z ) s ( Y | Q, X, Z ) ˙ s γ , ( Y | Q, X, Z ) s ( Y | Q, X, Z )43 ˙ s γ , ( Q | X, Z ) s ( Q | X, Z ) ˙ s γ , ( Q | X, Z ) s ( Q | X, Z )+ ˙ s γ , ( X, Z ) s ( X, Z ) ˙ s γ , ( X, Z ) s ( X, Z ) (cid:21) = 4 ( E "(cid:18) φ ′ ( ǫ ) φ ( ǫ ) (cid:19) n − ˙ α S − X ⊤ ˙ β − ˙ b ( η ) + b ′ ( η ) Z ⊤ ˙ γ o × n − ˙ α S − X ⊤ ˙ β − ˙ b ( η ) + b ′ ( η ) Z ⊤ ˙ γ oi + E "( ˙ s η ( η ) − s ′ η ( η ) Z ⊤ ˙ γ s η ( η ) ) ( ˙ s η ( η ) − s ′ η ( η ) Z ⊤ ˙ γ s η ( η ) ) + E (cid:20)(cid:26) ˙ s X,Z ( X, Z ) s X,Z ( X, Z ) (cid:27) (cid:26) ˙ s X,Z ( X, Z ) s X,Z ( X, Z ) (cid:27)(cid:21)(cid:27) where the superscript i ∈ (1 ,

2) refers to the parametrization corresponding to γ i . Our parame-ter of interest is θ ( γ ( t )) = α γ ( t ) . Diﬀerentiating with respect to t we obtain L ( ˙ s γ, ) = ˙ α γ, := ˙ α .Hence we need to ﬁnd the representer s ∗ such that: h s ∗ , ˙ s γ, i F = ˙ α . (A.15)for all γ ∈ R . This further implies:˙ α = 4 ( E "(cid:18) φ ′ ( ǫ ) φ ( ǫ ) (cid:19) n − ˙ αS − X ⊤ ˙ β − ˙ b ( η ) + b ′ ( η ) Z ⊤ ˙ γ o × (cid:8) − α ∗ S − Z ⊤ β ∗ − b ∗ ( η ) + b ′ ( η ) Z ⊤ γ ∗ (cid:9)(cid:3) + E "( ˙ s η ( η ) − s ′ η ( η ) Z ⊤ ˙ γs η ( η ) ) ( s ∗ η ( η ) − s ′ η ( η ) Z ⊤ γ ∗ s η ( η ) ) + E (cid:20)(cid:26) ˙ s X,Z ( X, Z ) s X ( X, Z ) (cid:27) (cid:26) s ∗ X,Z ( X, Z ) s X ( X, Z ) (cid:27)(cid:21)(cid:27) (A.16)for all γ ∈ R and the optimal asymptotic variance in estimating α is:Optimal asymptotic variance = k s ∗ k F = α ∗ .

44n the rest of the analysis we use equation (A.15) repeatedly for diﬀerent choices of ˙ s γ, toobtain the value of α ∗ . First, putting ˙ α = ˙ β = ˙ γ = ˙ b = ˙ s η = 0 (as zero vector is always in T ( P )) we obtain: E (cid:20)(cid:26) ˙ s X,Z ( X, Z ) s X,Z ( X, Z ) (cid:27) (cid:26) s ∗ X,Z ( X, Z ) s X,Z ( X, Z ) (cid:27)(cid:21) = 0 ∀ ˙ s X,Z . Hence we have s ∗ X,Z ≡

0. Thus we can modify equation (A.16) to obtain:˙ α = 4 ( E "(cid:18) φ ′ ( ǫ ) φ ( ǫ ) (cid:19) n − ˙ αS − X ⊤ ˙ β − ˙ b ( η ) + b ′ ( η ) Z ⊤ ˙ γ o × (cid:8) − α ∗ S − X ⊤ β ∗ − b ∗ ( η i ) + b ′ ( η ) Z ⊤ γ ∗ (cid:9)(cid:3) + E "( ˙ s η ( η ) − s ′ η ( η ) Z ⊤ ˙ γs η ( η ) ) ( s ∗ η ( η ) − s ′ η ( η ) Z ⊤ γ ∗ s η ( η ) ) (A.17)Next we put ˙ α = ˙ β = ˙ γ = ˙ b = 0 in equation (A.17) to obtain: E "(cid:26) ˙ s η ( η ) s η ( η ) (cid:27) ( s ∗ η ( η ) − s ′ η ( η ) Z ⊤ γ ∗ s η ( η ) ) = 0As E ( X ) = 0, we obtain from the above equation: E "(cid:26) ˙ s η ( η ) s η ( η ) (cid:27) ( s ∗ η ( η ) − s ′ η ( η ) Z ⊤ γ ∗ s η ( η ) ) = 0As E ( Z ⊤ γ ∗ ) = 0, we can conclude that s ∗ η ( · ) ≡

0. So modifying equation (A.17) we get thefollowing equation: ˙ α = 4 ( E "(cid:18) φ ′ ( ǫ ) φ ( ǫ ) (cid:19) n − ˙ αS − X ⊤ ˙ β − ˙ b ( η ) + b ′ ( η ) Z ⊤ ˙ γ o × (cid:8) − α ∗ S − X ⊤ β ∗ − b ∗ ( η ) + b ′ ( η ) Z ⊤ γ ∗ (cid:9)(cid:3) + ˙ γ ⊤ E (cid:18) s ′ η ( η ) s η ( η ) (cid:19) Σ Z γ ∗ ) E ((cid:18) φ ′ ( ǫ ) φ ( ǫ ) (cid:19) ) = E ((cid:18) ddǫ log φ ( ǫ ) (cid:19) ) = 1 . Hence deﬁning 4 E (cid:16) s ′ η ( η ) s η ( η ) (cid:17) = I η the above equation becomes:˙ α = n E X,η hn − ˙ αS − X ⊤ ˙ β − ˙ b ( η ) + b ′ ( η ) Z ⊤ ˙ γ o (A.18) × (cid:8) − α ∗ S − X ⊤ β ∗ − b ∗ ( η ) + b ′ ( η ) Z ⊤ γ ∗ (cid:9)(cid:3) + I η ˙ γ ⊤ Σ Z γ ∗ (cid:9) Next we put ˙ α = ˙ β = ˙ γ = 0 in equation (A.18) to get: E h ˙ b ( η ) (cid:8) α ∗ S + X ⊤ β ∗ + b ∗ ( η ) − b ′ ( η ) Z ⊤ γ ∗ (cid:9)i = 0 . Again from the independence of (

X, Z ) and η we have: E h ˙ b ( η ) { α ∗ S + b ∗ ( η ) } i = 0 . Hence, it is immediately clear the choice of b ∗ ( η ) = − α ∗ E ( S | η ). Using this we modify theequation (A.18) as below: n E hn ˙ αS + X ⊤ ˙ β − b ′ ( η ) Z ⊤ ˙ γ o (cid:8) α ∗ ( S − E ( S (cid:12)(cid:12) η )) + X ⊤ β ∗ − b ′ ( η ) Z ⊤ γ ∗ (cid:9)i (A.19)+ I η ˙ γ ⊤ Σ Z γ ∗ (cid:9) = ˙ α Next putting ˙ α = ˙ β = 0 in equation (A.19) we get: E (cid:2)(cid:8) − b ′ ( η ) Z ⊤ ˙ γ (cid:9) (cid:8) α ∗ ( S − E ( S (cid:12)(cid:12) η )) + X ⊤ β ∗ − b ′ ( η ) Z ⊤ γ ∗ (cid:9)(cid:3) + I η ˙ γ ⊤ Σ Z γ ∗ = 0 (A.20)46hich further implies:˙ γ ⊤ [ α ∗ E η {− b ′ ( η ) E ( ZS | η ) } ] + E {− b ′ ( η ) } ˙ γ ⊤ Σ ZX β ∗ + (cid:2) E (cid:8) ( b ′ ( η )) (cid:9) + I η (cid:3) ˙ γ ⊤ Σ Z γ ∗ = 0= ⇒ α ∗ ˙ γ ⊤ v + c ˙ γ ⊤ Σ ZX β ∗ + c ˙ γ ⊤ Σ Z γ ∗ = 0= ⇒ α ∗ v + c Σ ZX β ∗ + c Σ Z γ ∗ = 0 . (A.21)Here v = E η {− b ′ ( η ) E ( ZS | η ) } , c = E {− b ′ ( η ) } and c = [ E { ( b ′ ( η )) } + I η ]. Now equation(A.19) will be modiﬁed to: E hn ˙ αS + X ⊤ ˙ β o (cid:8) α ∗ ( S − E ( S (cid:12)(cid:12) η )) + X ⊤ β ∗ − b ′ ( η ) Z ⊤ γ ∗ (cid:9)i = ˙ α (A.22)Putting ˙ α = 0 in equation (A.22) we obtain: E hn X ⊤ ˙ β o (cid:8) α ∗ ( S − E ( S | η )) + X ⊤ β ∗ − b ′ ( η ) Z ⊤ γ ∗ (cid:9)i = 0= ⇒ ˙ β ⊤ [ α ∗ E ( XS )] + ˙ β ⊤ Σ X β ∗ + E ( − b ′ ( η )) ˙ β ⊤ Σ XZ γ ∗ = 0= ⇒ α ∗ ˙ β ⊤ v + ˙ β ⊤ Σ X β ∗ + c ˙ β ⊤ Σ XZ γ ∗ = 0= ⇒ α ∗ v + Σ X β ∗ + c Σ XZ γ ∗ = 0 . (A.23)47here v = E ( XS ). Hence equation (A.22) will be modiﬁed to: E (cid:2) { ˙ αS } (cid:8) α ∗ ( S − E ( S | η )) + X ⊤ β ∗ − b ′ ( η ) Z ⊤ γ ∗ (cid:9)(cid:3) = ˙ α (A.24)which implies: E (cid:2) s i (cid:8) α ∗ ( S − E ( S | η )) + X ⊤ β ∗ − b ′ ( η ) Z ⊤ γ ∗ (cid:9)(cid:3) = 1= ⇒ α ∗ E η { var ( S | η ) } + E ( SX ⊤ ) β ∗ + E η (cid:8) − b ′ ( η ) E ( SZ ⊤ | η i ) (cid:9) γ ∗ = 1= ⇒ α ∗ c + v ⊤ β ∗ + v ⊤ γ ∗ = 1 . (A.25)where c = E η { var ( S | η ) } . Finally we have three unknowns ( α ∗ , β ∗ , γ ∗ ) and three equations(equation (A.21), (A.23) and (A.25)), which we solve to get the value of α ∗ . For the convenienceof the readers we write those equations here: α ∗ v + c Σ ZX β ∗ + c Σ Z γ ∗ = 0 ∈ R p , (A.26) α ∗ v + Σ X β ∗ + c Σ XZ γ ∗ = 0 ∈ R p , (A.27) α ∗ c + v ⊤ β ∗ + v ⊤ γ ∗ = 1 ∈ R . (A.28)where Z ∈ R p and X ∈ R p . These three equations can be written in a matrix form asfollowing:  c v ⊤ v ⊤ v Σ X c Σ XZ v c Σ ZX c Σ Z   α ∗ β ∗ γ ∗  =   α ∗ = e ⊤  c v ⊤ v ⊤ v Σ X c Σ XZ v c Σ ZX c Σ Z  − e = e ⊤  c v ⊤ − v ⊤ v Σ X − c Σ XZ − v − c Σ ZX c Σ Z  − e = e ⊤  E ( var ( S | η )) E ( SX ⊤ ) E ( SZ ⊤ b ′ ( η )) E ( SX ) Σ X E ( XZ ⊤ b ′ ( η )) E ( SZb ′ ( η )) E ( ZX ⊤ b ′ ( η )) Σ Z (1 + ( b ′ ( η )) )  − e (A.29)= e ⊤ Ω e . (A.30) Remark A.2.

Note that if b is a linear function, (which happens if ( ǫ, η ) is generated frombivariate normal with correlation ρ ), then b ′ is a constant function. Hence the second term inthe expression of eﬃcient information vanishes and we get: I eff = (cid:2)(cid:8) E η ( var ( S (cid:12)(cid:12) η )) − E ( SX ⊤ )Σ − X E ( SX ) (cid:9)(cid:3) which is same as the eﬃcient information of partial linear model. Hence, one may think thesecond term as the price we pay for non-linearity of b . A.3 Proof of Proposition 2.1

Recall that our model can be written as: Y i = α S i + X ⊤ i β + b ( η i ) + ǫ i = α S i + X ⊤ i β + b (ˆ η i ) + R ,i + ǫ i R ,i = b ( η i ) − b (ˆ η i ) is the residual in approximating η i by ˆ η i . For notational simplicity,we absorb S i into X i (and α into β ) and write: Y i = X ⊤ i β + b (ˆ η i ) + R ,i + ǫ i (A.31)As mentioned in Section 2, we only need to estimate the derivative of the mean function b onthe interval [ − τ, τ ] and consequently we consider only those observations for which | ˆ η i | ≤ τ .We use scaled B-spline basis ˜ N K to estimate b non-parametrically on the interval [ − τ, τ ]. Formore details on the B-spline basis and its scaled version, see C.2. Deﬁne a vector ω b, ∞ ,n as: ω b, ∞ ,n = arg min ω ∈ R ( K +2) sup | x |≤ τ (cid:12)(cid:12)(cid:12) b ( x ) − ˜ N K ( x ) ⊤ ω (cid:12)(cid:12)(cid:12) By applying Theorem C.3 we conclude: (cid:13)(cid:13)(cid:13) b ( x ) − ˜ N K ( x ) ⊤ ω b, ∞ ,n (cid:13)(cid:13)(cid:13) ∞ , [ − τ,τ ] ≤ C (cid:18) τK (cid:19) k b ′′′ k ∞ , [ − τ,τ ] , (A.32) (cid:13)(cid:13)(cid:13) b ′ ( x ) − ∇ ˜ N K ( x ) ⊤ ω b, ∞ ,n (cid:13)(cid:13)(cid:13) ∞ , [ − τ,τ ] ≤ C (cid:18) τK (cid:19) k b ′′′ k ∞ , [ − τ,τ ] . (A.33)were ∇ ˜ N k ( x ) is the vector of derivatives of the co-ordinate of the basis functions in ˜ N k ( x ).Using this spline approximation we further expand on equation (A.31): Y i = X ⊤ i β + ˜ N K (ˆ η i ) ⊤ ω b, ∞ ,n + R i + R i + ǫ i (A.34)where R ,i = b (ˆ η i ) − ˜ N K (ˆ η i ) ⊤ ω b, ∞ ,n is the spline approximation error. To estimate w b, ∞ ,n weﬁrst estimate β as: ˆ β = (cid:16) X ⊤ proj ⊥ ˜ N K X (cid:17) − X ⊤ proj ⊥ ˜ N K Y . (A.35)and then estimate w b, ∞ ,n as: ˆ w b, ∞ ,n = (cid:16) ˜ N ⊤ K ˜ N K (cid:17) − ˜ N ⊤ K ( Y − X ˆ β ) . (A.36)50nd consequently we set ˆ b ( x ) = ˜ N k ( x ) ⊤ ˆ ω b, ∞ ,n and ˆ b ′ ( x ) = ∇ ˜ N k ( x ) ⊤ ˆ ω b, ∞ ,n . The estimation errorof b ′ using ˆ b ′ is then bounded as follows: (cid:12)(cid:12)(cid:12) ˆ b ′ ( x ) − b ′ ( x ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ∇ ˜ N k ( x ) ⊤ ˆ ω b, ∞ ,n − b ′ ( x ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ∇ ˜ N k ( x ) ⊤ ˆ ω b, ∞ ,n − ∇ ˜ N k ( x ) ⊤ ω b, ∞ ,n (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ∇ ˜ N k ( x ) ⊤ ω b, ∞ ,n − b ′ ( x ) (cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) ∇ ˜ N k ( x ) (cid:13)(cid:13)(cid:13) k ˆ ω b, ∞ ,n − ω b, ∞ ,n k + sup | t |≤ τ (cid:12)(cid:12)(cid:12) ∇ ˜ N k ( t ) ⊤ ω b, ∞ ,n − b ′ ( t ) (cid:12)(cid:12)(cid:12) . K √ K k ˆ ω b, ∞ ,n − ω b, ∞ ,n k + C (cid:18) τK (cid:19) k b ′′′ k ∞ , [ − τ,τ ] (A.37)where the last inequality follows from the fact that k∇ ˜ N k ( x ) k . K √ K (see Lemma C.7) andequation (A.33). We now relate ˆ ω b, ∞ ,n to ω b, ∞ ,n using equation (A.34):ˆ w b, ∞ ,n = (cid:16) ˜ N ⊤ K ˜ N K (cid:17) − ˜ N ⊤ K ( Y − X ˆ β )= w b, ∞ ,n + (cid:16) ˜ N ⊤ K ˜ N K (cid:17) − ˜ N ⊤ K X (cid:16) ˆ β − β (cid:17) + (cid:16) ˜ N ⊤ K ˜ N K (cid:17) − ˜ N ⊤ K R + (cid:16) ˜ N ⊤ K ˜ N K (cid:17) − ˜ N ⊤ K R + (cid:16) ˜ N ⊤ K ˜ N K (cid:17) − ˜ N ⊤ K ǫ = w b, ∞ ,n + ˜ N ⊤ K ˜ N K n ! − ˜ N ⊤ K X (cid:16) ˆ β − β (cid:17) n + ˜ N ⊤ K ˜ N K n ! − ˜ N ⊤ K R n + ˜ N ⊤ K ˜ N K n ! − ˜ N ⊤ K R n + ˜ N ⊤ K ˜ N K n ! − ˜ N ⊤ K ǫ n = w b, ∞ ,n + ˜ N ⊤ K ˜ N K n ! − ˜ N ⊤ K X (cid:16) ˆ β − β (cid:17) n + ˜ N ⊤ K ˜ N K n ! − " ˜ N ⊤ K R n + ˜ N ⊤ K R n + ˜ N ⊤ K ǫ n = w b, ∞ ,n + T + ˜ N ⊤ K ˜ N K n ! − ( T + T + T ) (A.38)Rest of the proof is devoted to show k ˆ ω b, ∞ ,n − ω b, ∞ ,n k = o p (cid:0) K − / (cid:1) . A.3.1 Bounding T To bound T , we ﬁrst bound k ˆ β − β k using the following Lemma: Lemma A.3.

Under assumptions 3.1-3.4 we have k ˆ β − β k = o p (cid:0) K − / (cid:1) . N ⊤ k ˜ N k /n ) − is bounded above. Using a conditional version of Theorem C.6 we have: E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ k ˜ N k n − E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D )  ≤ C K log Kn + r K log Kn ! As the bound on the right side does not depend on F ( D ), we conclude, taking expectation onthe both side: E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ k ˜ N k n − E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op  ≤ C K log Kn + r K log Kn ! (A.39)Note that we can write with a n = log n/ √ n : E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17) = E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17) k ˆ γ n − γ k≤ a n + E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17) k ˆ γ n − γ k >a n . As both of the matrices on the right side are p.s.d., we conclude: λ min (cid:16) E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17)(cid:17) ≥ λ min (cid:16) E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17) k ˆ γ n − γ k≤ a n (cid:17) . Choose δ > P ( k z k ≤ δ ) >

0. Now, from Theorem C.4: λ min (cid:16) E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17) k ˆ γ n − γ k≤ (cid:17) ≥ κ min | x |≤ τ f η +( γ − ˆ γ n ) ⊤ Z ( x ) k ˆ γ n − γ k≤ a n ≥ κ min | x |≤ τ k a k≤ a n f η + a ⊤ Z ( x ) k ˆ γ n − γ k≤ a n κ min | x |≤ τ k a k≤ a n Z R p f η ( x − a ⊤ z ) f Z ( z ) dz k ˆ γ n − γ k≤ a n ≥ κ min | x |≤ τ k a k≤ a n Z k z k≤ δ f η ( x − a ⊤ z ) f Z ( z ) dz k ˆ γ n − γ k≤ a n ≥ κ min | x |≤ τ + a n δ f η ( x ) P ( k Z k ≤ δ ) k ˆ γ n − γ k≤ a n . Now for large n , a n δ ≤ ξ (where ξ is same as deﬁned in (iv) of Assumption 3.2) and hence forall large n : λ min (cid:16) E (cid:16) ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) (cid:17) k ˆ γ n − γ k≤ (cid:17) ≥ κ min | x |≤ τ + ǫ f η ( x ) P ( k Z k ≤ δ ) k ˆ γ n − γ k≤ . (A.40)From equation (A.39) and (A.40) we conclude: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ k ˜ N k n ! − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = λ min ˜ N ⊤ k ˜ N k n !! − = O p (1) . (A.41)We next also provide a bound Now going back to T in equation (A.38) we have: k T k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ K ˜ N K n ! − ˜ N ⊤ K X (cid:16) ˆ β − β (cid:17) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ K ˜ N K n ! − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ K ˜ N K n ! − / ˜ N ⊤ K X (cid:16) ˆ β − β (cid:17) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ K ˜ N K n ! − / ˜ N ⊤ K X (cid:16) ˆ β − β (cid:17) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X ( ˆ β − β ) √ n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k ˆ β − β k p λ max ( X ⊤ X/n ) . P k ˆ β − β k = o p (cid:0) K − / (cid:1) [By Lemma A. . .3.2 Bounding T k T k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ K R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) R √ n (cid:13)(cid:13)(cid:13)(cid:13) λ max ˜ N ⊤ K ˜ N K n ! It was already proved in equation (A.41) that λ max ( ˜ N ⊤ K ˜ N K /n ) = O p (1). To control k R k / √ n : E (cid:18) k R k n (cid:12)(cid:12)(cid:12)(cid:12) F ( D ) (cid:19) = E (cid:0) ( b ( η ) − b (ˆ η )) (cid:12)(cid:12) D (cid:1) = E (cid:16)(cid:0) b ′ ( η )(ˆ η − η ) + (1 / b ′′ (˜ η )(ˆ η − η ) (cid:1) (cid:12)(cid:12)(cid:12) F ( D ) (cid:17) ≤ k ˆ γ n − γ k E (cid:16) ( b ′ ( η ) k Z k ) (cid:17) + 12 k b ′′ k ∞ k ˆ γ n − γ k E (( k Z k ))= O p ( n − ) . Hence k R / √ n k = O p ( n − / ) = o p ( K − / ), where the last equality follows from Remark 2.2. A.3.3 Bounding T k T k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ K R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) R √ n (cid:13)(cid:13)(cid:13)(cid:13) λ max ˜ N ⊤ K ˜ N K n ! With similar logic used in bounding T , all we need to bound k R k / √ n . It is immediate that: k R k√ n = r k R k n ≤ sup | x |≤ τ (cid:12)(cid:12)(cid:12) b ( t ) − ˜ N K ( t ) ⊤ ω b, , ∞ (cid:12)(cid:12)(cid:12) ≤ C (cid:18) τK (cid:19) k b ′′′ k ∞ = o p (cid:0) K − / (cid:1) . A.3.4 Bounding T E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ N ⊤ K ǫ n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( D )  = 1 n E h ǫ ⊤ ˜ N K ˜ N ⊤ K ǫ (cid:12)(cid:12)(cid:12) F ( D ) i = 1 n tr (cid:16) E h ǫǫ ⊤ ˜ N ⊤ K ˜ N K (cid:12)(cid:12)(cid:12) F ( D ) i(cid:17)

54 1 n tr (cid:16) E h E (cid:0) ǫǫ ⊤ (cid:12)(cid:12) F ( Z , η , D ) (cid:1) ˜ N ⊤ K ˜ N K (cid:12)(cid:12)(cid:12) F ( D ) i(cid:17) ≤ K + 3 n sup η var ( ǫ | η ) λ max (cid:16) E h ˜ N K (ˆ η ) ˜ N K (ˆ η ) ⊤ | ˆ η |≤ τ (cid:12)(cid:12)(cid:12) F ( D ) i(cid:17) = O p (cid:18) Kn (cid:19) = o p (cid:0) K − / (cid:1) [Remark 2 . . where η is { η i } n/ i =1 in D and the penultimate inequality follows from Lemma C.1. These boundsestablished k ˆ ω b,n, ∞ − ω b, ∞ k = o p (cid:0) K − / (cid:1) which completes the proof. B Proof of supplementary lemmas

B.1 Proof of Lemme A.1

The deﬁnition of function g is as follows: g ( a, t ) = E (cid:20)(cid:18) S X b ′ ( η ) Z (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) η + a ⊤ Z = t (cid:21) Note that g ( a, t ) ∈ R p + p . Divide the components of g in three parts as follows:1. g ( a, t ) = E (cid:2) S (cid:12)(cid:12) η + a ⊤ Z = t (cid:3) .2. g ( a, t ) = E (cid:2) X (cid:12)(cid:12) η + a ⊤ Z = t (cid:3) .3. g ( a, t ) = E (cid:2) b ′ ( η ) Z (cid:12)(cid:12) η + a ⊤ Z = t (cid:3) .If we prove the continuity of partial derivates of g , g , g separately then we are done. We startwith g . For ﬁxed a (i.e. we consider the partial derivative with respect to t ): g ( a, t ) = E (cid:2) S (cid:12)(cid:12) η + a ⊤ Z = t (cid:3) = R z : t> ( a + γ ) ⊤ z f Z ( z ) f η ( t − a ⊤ z ) dzf η + Z ⊤ a ( t ) , R Ω( t ) f Z ( z ) f η ( t − a ⊤ z ) dzf η + Z ⊤ a ( t )55y using Leibnitz rule for diﬀerentiating integral with varying domain, we can immediatelyconclude ∂ t g ( a, t ) is continuously diﬀerentiable. The calculation for ∂ a g ( a, t ) is similar andhence skipped for brevity.For g deﬁne h ( Z ) = E ( X | Z ). Then we have: g ( a, t ) = E (cid:2) h ( Z ) (cid:12)(cid:12) η + a ⊤ Z = t (cid:3) = R h ( z ) f Z ( z ) f η ( t − a ⊤ z ) dz R f Z ( z ) f η ( t − a ⊤ z ) dz That g is continuous both with respect to a and t follows from DCT and the fact that E ( k h ( Z ) k ) < ∞ (Assumption 3.4) and k f η k ∞ is ﬁnite (Assumption 3.2). The diﬀerentia-bility and continuity of the derivative also follows from the fact that f η is diﬀerentiable and E ( k Z kk h ( Z ) k ) < ∞ as well as E ( k Z k ) < ∞ . Also note that diﬀerentiation under integral signis allowed as E ( k h ( Z ) k ) < ∞ .Finally for g deﬁne h ( Z ) = b ′ ( t − a ⊤ Z ) Z . Then we have: g ( a, t ) = E (cid:2) h ( Z ) (cid:12)(cid:12) η + a ⊤ Z = t (cid:3) By the same logic as for g (i.e. using E ( k Z k| b ′ ( t − a ⊤ Z ) | ) < ∞ , E ( k Z k | b ′ ( t − a ⊤ Z ) | ) < ∞ and k f η k ∞ < ∞ ) our conclusion follows. Finally the continuity of V ( a, t ) follows directly fromthe continuity of density of η (Assumption 3.2) and ( X, Z ) (Assumption 3.4).

B.2 Proof of Lemma A.3

Recall that β in this proof is ( α , β ⊤ ) ⊤ and X = ( S, X ) ∈ R n × (1+ p ) as mentioned in thebeginning of the proof of Proposition 2.1. From the deﬁnition of ˆ β (equation (A.35)):ˆ β = (cid:16) X ⊤ proj ⊥ ˜ N K X (cid:17) − X ⊤ proj ⊥ ˜ N K Y X ⊤ proj ⊥ ˜ N K Xn ! − X ⊤ proj ⊥ ˜ N K Yn = β + X ⊤ proj ⊥ ˜ N K Xn ! − " X ⊤ proj ⊥ ˜ N K R n + X ⊤ proj ⊥ ˜ N K R n + X ⊤ proj ⊥ ˜ N K ǫ n We divide the entire proof into few steps which we articulate below ﬁrst:1. First we show the matrix X ⊤ proj ⊥ ˜ N K X/n = O p (1). More speciﬁcally we show that: X ⊤ proj ⊥ ˜ N K Xn P −→ E (cid:2) ( X − E ( X (cid:12)(cid:12) η )( X − E ( X (cid:12)(cid:12) η )) ⊤ ) | η |≤ τ (cid:3) . This along with Assumption 3.3. implies that ( X ⊤ proj ⊥ ˜ N K X/n ) − = O p (1).2. Next we show that the residual terms are negligible: X ⊤ proj ⊥ ˜ N K Xn ! − X ⊤ proj ⊥ ˜ N K ( R + R ) n ! = o p ( K − / ) .

3. Finally we show that X ⊤ proj ⊥ ˜ N K ǫ /n = o p ( K − / ). This will complete the proof. B.2.1 Proof of Step 1

Recall the deﬁnition of W ∗ from Subsection A.1.1. It then follows immediately that: X = W ∗ (cid:20) e e . . . e p (cid:21) := W ∗ A where e i is the i th canonical basis of R (1+ p + p ) . Hence we have: X ⊤ proj ⊥ ˜ N K Xn = A ⊤ W ∗ ⊤ proj ⊥ ˜ N K W ∗ n A As established in Subsection A.1.1 (see equation (A.5)): W ∗ ⊤ proj ⊥ ˜ N K W ∗ n P →

13 Ω τ ,

57e conclude from that: X ⊤ proj ⊥ ˜ N K Xn P −→ A ⊤ Ω τ A = 13 E h ( X − E ( X | η )) ( X − E ( X | η )) ⊤ | η |≤ τ i . (B.1) B.2.2 Proof of Step 2

For the ﬁrst residual term observe that: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X ⊤ proj ⊥ ˜ N K X n ! − X ⊤ proj ⊥ ˜ N K R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X ⊤ proj ⊥ ˜ N K X n ! − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X ⊤ proj ⊥ ˜ N K X n ! − / X ⊤ proj ⊥ ˜ N K R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X ⊤ proj ⊥ ˜ N K X n ! − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13) R √ n (cid:13)(cid:13)(cid:13)(cid:13) = o p ( K − / ) . (B.2)The last equality follows from the fact that the ﬁrst term of the above product is O p (1) andthe second term k R k / √ n is o p ( K − / ) because: R ⊤ R n = 1 n n/ X i =1 ( b ( η i ) − b (ˆ η i )) | ˆ η i |≤ τ ≤ n n/ X i =1 ( η i − ˆ η i ) ( b ′ (ˆ η i )) | ˆ η i |≤ τ + 12 n n/ X i =1 ( η i − ˆ η i ) ( b ′′ (˜ η i )) | ˆ η i |≤ τ ≤ k b ′ k ∞ , [ − τ,τ ] k ˆ γ n − γ k n n/ X i =1 k Z i k + k b ′′ k ∞ k ˆ γ n − γ k n n/ X i =1 k Z i k = O p ( n − ) + O p ( n − ) = o p ( K − ) [Remark 2 . . For the other residual using the same calculation we ﬁrst conclude: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X ⊤ proj ⊥ ˜ N K X n ! − X ⊤ proj ⊥ ˜ N K R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X ⊤ proj ⊥ ˜ N K X n ! − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13) R √ n (cid:13)(cid:13)(cid:13)(cid:13) (B.3)58nd k R k / √ n = o p ( K − / ) follows directly from equation (A.32). B.2.3 Proof of Step 3

Finally we show that X ⊤ proj ⊥ ˜ N K ǫ /n = o p (1) which completes the proof. Recall that, in Subsec-tion A.1.1, we show that the term T =. This immediately implies: W ∗ ⊤ proj ⊥ ˜ N K ǫ n = O p ( n − / ) . which, in turn implies: X ⊤ proj ⊥ ˜ N K ǫ n = A ⊤ W ∗ ⊤ proj ⊥ ˜ N K ǫ n A = O p ( n − / ) = o p ( K − / ) [Remark 2 . . (B.4)Combining equation (B.1), (B.2), (B.3) and (B.4) we conclude k ˆ β − β k = o p ( K − / ). C Algorithm to estimate treatment eﬀect

C.1 Some auxiliary lemmas

In this section we present some auxiliary lemmas that are necessary to establish our mainresults.

Lemma C.1.

Suppose A is a p.s.d. matrix and B is a symmetric matrix, then tr ( AB ) ≤ λ max ( B ) tr ( A ) .Proof. Note that B − λ max ( B ) I (cid:22)

0. Hence: tr ( A ( λ max ( B ) I − B )) = λ max ( B ) tr ( A ) − tr ( AB ) ≥ . as A ( λ max ( B ) I − B ) is a p.s.d. matrix. Lemma C.2.

Suppose { X n } n ∈ N is a sequence of non-negative random variables and {F n } n ∈ N is a sequence of sigma ﬁelds. If E ( X n | F n ) = o p (1) , then X n = o p (1) . roof. Fix ǫ >

0. From a conditional version of Markov inequality, we have: Y n := P ( X n > ǫ | F n ) ≤ E ( X n | F n ) ǫ = o p (1) . Now as { Y n } n ∈ N is bounded sequence of random variables which converge to 0 in probability,applying DCT we conclude: P ( X n > ǫ ) = E ( Y n ) = o (1) . This completes the proof.

C.2 Some preliminary discussion on B-spline basis

Recall that, we have mentioned in Section 2 we use truncated B-spline basis to approximateboth the unknown mean function b ( η ) = E ( ν | η ) and its derivative. More speciﬁcally, weﬁt spline basis on [ − τ n , τ n ] and then and deﬁne our estimator to be 0 outside of it. Recallthat the B-spline basis starts with 0 th order polynomial (i.e. a constant functions) and then isrecursively deﬁned for higher order polynomial. Let the knots inside [ − τ n , τ n ] are: − τ n = ξ < ξ < ξ < · · · < ξ K = τ n As we know from spline theory, the dimension of the space generated by spline basis of degree s is K + s . When s = 0, (i.e. constant functions) we need K basis functions which are deﬁnedas: N i, ( t ) =  ξ i ≤ t ≤ ξ i +1 . for 0 ≤ i ≤ K −

1. Now we deﬁne the recursion, i.e. how we go to a collection of B-splinebasis functions of degree p degree from a collection of B-spline basis functions of degree ( p − K + p many basis functions of degree p and each of the basis functions willbe local in a sense that they only have support over p + 1 many intervals (observe that the60onstant functions are only supported over one interval). For that we ﬁrst append some knotsat the both ends. For example, to go to 1 degree polynomial basis from 0-degree, we appendtwo knots, one at the beginning and one at the very end. − τ n = ξ − = ξ < ξ < · · · < ξ K = ξ K +1 = τ n . Our k + 1 basis functions are deﬁned as: N i, ( t ) =  t − ξ i ξ i +1 − ξ i N i, ( t ) + ξ i +2 − tξ i +2 − ξ i +1 N ( i +1) , ( t ) if ξ i ≤ t ≤ ξ i +2 . for − ≤ i ≤ K −

1. Note that when i = − N − , does not exist. Hence for that case weforget this part and deﬁne: N − , ( t ) =  ξ − tξ − ξ N , ( t ) if ξ − ≤ t ≤ ξ . and the for the last basis, i.e. i = K − B K, is not deﬁned. Hence analogously we deﬁne: N K − , ( t ) =  t − ξ K − ξ K − ξ K − N K − , ( t ) if ξ K − ≤ t ≤ ξ K . Now we will extend this pattern for any general degree p . For that we need to append p knotsat the both ends: − τ n = ξ − p = · · · = ξ − = ξ < ξ < · · · < ξ K = ξ K +1 = . . . ξ k + p = τ n . N i,p ( t ) =  t − ξ i ξ i + p − ξ i N i, ( p − ( t ) + ξ i + p +1 − tξ i + p +1 − ξ i N ( i +1) , ( p − ( t ) if ξ i ≤ t ≤ ξ i + p +1 . for − p ≤ i ≤ k −

1. Deﬁne a class of functions S p,k is the linear combinations of all the functionsof p th order B-spline basis { N i,p } k − i = − p , i.e. S p,k = ( f : f ( x ) = k − X i = − p c i N i,p ( x ) with c − p , . . . , c k − ∈ R ) . The following theorem is Theorem 17 of Chapter 1 of [30] which provides an approximationerror of a function (and its derivatives) with respect to B-spline basis:

Theorem C.3 (Functional approximation using B-spline basis) . For any ≤ r ≤ l ≤ p , if sup | x |≤ τ | f ( x ) | < ∞ and f is ( l + 1) times diﬀerentiable with ( l + 1) th derivative also boundedon [ − τ, τ ] then we have: inf s ∈ S p,k k ∂ r f − ∂ r s k ∞ ≤ C (cid:18) τk (cid:19) ( l +1 − r ) k f ( l +1) k ∞ where the constant C only depends on p , the order of the spline approximation. In our paper, we also need to control the behavior of the lower eigenvalue of the populationmatrix E (cid:0) N k (ˆ η ) N K (ˆ η ) ⊤ | ˆ η |≤ τ | D (cid:1) . For that reason, we use a scaled version of B-spline basisinstead: ˜ N i,p ( x ) = r K τ N i,p ( x ) . and use the following theorem (see the theorem of section 3 of [17] or Theorem 11 of [30]): Theorem C.4 (Eigenvalue of spline matrix) . Deﬁne a matrix G ∈ R ( k + p ) × ( k + p ) such that: G ij = Z τ − τ ˜ N i,p ( x ) ˜ N j,p ( x ) dx hen there exists a constant κ p > only depending on p such that: κ p ≤ x ⊤ Gxx ⊤ x ≤ for all x ∈ R ( k + p ) . In particular if X ∼ F with density f on [ − τ, τ ] and if < f − ≤ f ( x ) ≤ f + < ∞ then we have: κ p f − ≤ λ min (cid:16) E (cid:16) ˜ N k ( X ) ˜ N K ( X ) ⊤ (cid:17)(cid:17) ≤ λ max (cid:16) E (cid:16) ˜ N k ( X ) ˜ N K ( X ) ⊤ (cid:17)(cid:17) ≤ f + . Remark C.5.

Note that the linear span of { N i,p } ( k + p ) i =1 and { ˜ N i,p } ( k + p ) i =1 is same as S p,k because N i,p and ˜ N i,p only diﬀers by a scaling constant. Hence Theorem C.3 remains unaltered even ifwe use the scaled B-spline basis { ˜ N i,p } ( k + p ) i =1 . Another result which is due of [43], is used in this paper to bound the minimum eigen-value (e.g. the operator norm of the inverse) of the sample covariance matrix formed by { ˜ N K (ˆ η i ) | ˆ η i |≤ τ } ni =1 in terms of the population covariance matrix E h ˜ N k (ˆ η ) ˜ N k (ˆ η ) ⊤ | ˆ η |≤ τ i is thefollowing (see Lemma 6.2 of [6]): Theorem C.6.

Let Q , . . . , Q n be independent symmetric non-negative k × k matrix-valuedrandom variable. Let ¯ Q = (1 /n ) P i Q i and Q = E ( ¯ Q ) . If k Q i k op ≤ M a.s. then we have: E (cid:13)(cid:13) ¯ Q − Q (cid:13)(cid:13) op ≤ C M log kn + r M k Q k op log kn ! . for some absolute constant C . In particular if Q i = p i p ⊤ i for some random vector p i with k p i k ≤ ξ k almost surely, then: E (cid:13)(cid:13) ¯ Q − Q (cid:13)(cid:13) op ≤ C ξ k log kn + r ξ k k Q k op log kn ! . Lastly, as we are working on the derivative estimation of the mean function b , we need tohave a bound on k∇ ˜ N p,K ( x ) k for all | x | ≤ τ , where ∇ ˜ N p,K ( x ) is the vector of derivatives of { ˜ N i,k ( x ) } k + pi =1 . Towards that end, we prove the following lemma;63 emma C.7. For all | x | ≤ τ , we have: k∇ ˜ N p,K ( x ) k ≤ C p K √ K for some constant C p depend-ing only on the order of the spline basis p and τ .Proof. Note that, the unscaled B-spline N K ( x ) forms a partition of unity, i.e for any x ∈ [ − τ, τ ],we have P K + pj =1 N j,k ( x ) = 1 and also, by deﬁnition, each x only contributed to ﬁnitely many(at-most p many) basis. Hence it is immediate that, for any x ∈ [ − τ, τ ], k N K ( x ) k . k ˜ N K ( x ) k . √ K . Now, from Theorem 3 of [30] we have for any 1 ≤ j ≤ k + p : ddx N j,p,K ( x ) = K τ ( N j,p − ,K ( x ) − N j +1 ,p − ,K ( x ))Hence: ddx ˜ N j,p,K ( x ) = K √ Kτ √ τ ( N j,p − ,K ( x ) − N j +1 ,p − ,K ( x ))which along with with the partition of unity property of { N i,p } k + pi =1 implies k∇ ˜ N p,K ( x ) k ≤ C p K √ K , which completes the proof of the theorem. C.3 Main algorithm

In this Subsection, we present our estimation method of the treatment eﬀect α detailed inSection 2 in an algorithmic format. 64 lgorithm 1: A sketch of the method of eﬃcient estimation of α

1. Divide the whole data into three equal parts: D = D ∪ D ∪ D .2. Estimate γ from D by doing OLS regression of Q on Z , i.e. set:ˆ γ n = ( Z ⊤ Z ) − Z ⊤ Q .

3. Replace η i in equation (2.1) by ˆ η i using ˆ γ n obtained in the previous step (i.e. setˆ η i = Q i − Z ⊤ i ˆ γ n where { Q i , Z i } are in D and ˆ γ n obtained in Step (b). Then estimate b ′ from D using equation (2.1) via spline estimation method.4. Estimate α from D using ˆ γ n and ˆ b ′ estimated in the previous two steps.5. Finally, do the above steps by rotating the datasets and combine them to gain eﬃciency. References [1] Joshua D Angrist and Guido W Imbens. Two-stage le ast squares estimation of aver-age causal eﬀects in models with variable treatment intensity.

Journal of the Americanstatistical Association , 90(430):431–442, 1995.[2] Joshua D Angrist and Alan B Keueger. Does compulsory school attendance aﬀect schoolingand earnings?

The Quarterly Journal of Economics , 106(4):979–1014, 1991.[3] Joshua D Angrist and Alan B Krueger. Empirical strategies in labor economics. In

Handbook of labor economics , volume 3, pages 1277–1366. Elsevier, 1999.[4] Joshua D Angrist and Victor Lavy. Using maimonides’ rule to estimate the eﬀect of classsize on scholastic achievement.

The Quarterly journal of economics , 114(2):533–575, 1999.[5] Yoichi Arai, Taisuke Otsu, Myung Hwan Seo, et al. Causal inference on regression dis-continuity designs by high-dimensional methods. Technical report, Suntory and ToyotaInternational Centres for Economics and Related . . . , 2019.[6] Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Some new65symptotic theory for least squares series: Pointwise and uniform results.

Journal ofEconometrics , 186(2):345–366, 2015.[7] Richard A Berk and David Rauma. Capitalizing on nonrandom assignment to treatments:A regression-discontinuity evaluation of a crime-control program.

Journal of the AmericanStatistical Association , 78(381):21–27, 1983.[8] PK Bhattacharya and Peng-Liang Zhao. Semiparametric inference in a partial linearmodel.

The annals of statistics , pages 244–262, 1997.[9] Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon AWellner, and YA’Acov Ritov.

Eﬃcient and adaptive estimation for semiparametric models ,volume 4. Johns Hopkins University Press Baltimore, 1993.[10] Sandra E Black. Do better schools matter? parental valuation of elementary education.

The quarterly journal of economics , 114(2):577–599, 1999.[11] Sebastian Calonico and Maintainer Sebastian Calonico. Package ‘rdrobust’. 2020.[12] Sebastian Calonico, Matias D Cattaneo, Max H Farrell, and Rocio Titiunik. Regressiondiscontinuity designs using covariates.

Review of Economics and Statistics , 101(3):442–451,2019.[13] Sebastian Calonico, Matias D Cattaneo, and Rocio Titiunik. Robust nonparametric conﬁ-dence intervals for regression-discontinuity designs.

Econometrica , 82(6):2295–2326, 2014.[14] Donald Thomas Campbell and Julian Stanley. Experimental designs and quasi-experimental designs for research.

Skokie, IL: Rand McNally , 1963.[15] Matias D Cattaneo, Brigham R Frandsen, and Rocio Titiunik. Randomization inference inthe regression discontinuity design: An application to party advantages in the us senate.

Journal of Causal Inference , 3(1):1–24, 2015.6616] Matias D Cattaneo, Nicol´as Idrobo, and Roc´ıo Titiunik.

A practical introduction to re-gression discontinuity designs: Foundations . Cambridge University Press, 2019.[17] Carl De Boor. The quasi-interpolant as a tool in elementary polynomial spline theory.

Approximation theory , pages 269–276, 1973.[18] John DiNardo and David S Lee. Program evaluation and research designs. In

Handbookof labor economics , volume 4, pages 463–536. Elsevier, 2011.[19] Marilyn R Erickson and Theodore Cromack. Evaluating a tutoring program.

The Journalof Experimental Education , 41(2):27–31, 1972.[20] Michael O Finkelstein, Bruce Levin, and Herbert Robbins. Clinical and prophylactic trialswith assured new treatment for those at greater risk: I. a design proposal.

AmericanJournal of Public Health , 86(5):691–695, 1996.[21] Markus Fr¨olich and Martin Huber. Including covariates in the regression discontinuitydesign.

Journal of Business & Economic Statistics , 37(4):736–748, 2019.[22] Jinyong Hahn, Petra Todd, and Wilbert Van der Klaauw. Identiﬁcation and estimationof treatment eﬀects with a regression-discontinuity design.

Econometrica , 69(1):201–209,2001.[23] Sukjin Han. Nonparametric estimation of triangular simultaneous equations models underweak identiﬁcation.

Quantitative Economics , 11(1):161–202, 2020.[24] John L. Holland and John M. Stalnaker. An honorary scholastic award.

The Journal ofHigher Education , 28(7):361–368, 1957.[25] Guido Imbens and Karthik Kalyanaraman. Optimal bandwidth choice for the regressiondiscontinuity estimator.

The Review of economic studies , 79(3):933–959, 2012.[26] Guido Imbens and Wilbert Van Der Klaauw. Evaluating the cost of conscription in thenetherlands.

Journal of Business & Economic Statistics , 13(2):207–215, 1995.6727] Guido W Imbens and Whitney K Newey. Identiﬁcation and estimation of triangularsimultaneous equations models without additivity.

Econometrica , 77(5):1481–1512, 2009.[28] Adel Javanmard and Andrea Montanari. Conﬁdence intervals and hypothesis testing forhigh-dimensional regression.

The Journal of Machine Learning Research , 15(1):2869–2909,2014.[29] Roger Klein and Francis Vella. Estimating a class of triangular simultaneous equationsmodels without exclusion restrictions.

Journal of Econometrics , 154(2):154–164, 2010.[30] Angela Kunoth, Tom Lyche, Giancarlo Sangalli, and Stefano Serra-Capizzano.

Splines andPDEs: From approximation theory to numerical linear algebra . Springer, 2018.[31] David S Lee. Randomized experiments from non-random selection in us house elections.

Journal of Econometrics , 142(2):675–697, 2008.[32] David S Lee and Thomas Lemieux. Regression discontinuity designs in economics.

Journalof economic literature , 48(2):281–355, 2010.[33] Jason M Lindo, Nicholas J Sanders, and Philip Oreopoulos. Ability, gender, and per-formance standards: Evidence from academic probation.

American Economic Journal:Applied Economics , 2(2):95–117, 2010.[34] BW Lohr. An historical view of the research on the factors related to the utilization ofhealth services.

Bureau for Health Services Research and Evaluation, Social and EconomicAnalysis Division, Rockville, MD , 1972.[35] Mette Lise Lousdal. An introduction to instrumental variable assumptions, validation andestimation.

Emerging themes in epidemiology , 15(1):1–7, 2018.[36] Erik Meyersson. Islamic rule and the empowerment of the poor and pious.

Econometrica ,82(1):229–269, 2014. 6837] Patric M¨uller and Sara Van de Geer. The partial linear model in high dimensions.

Scan-dinavian Journal of Statistics , 42(2):580–608, 2015.[38] Whitney K Newey, James L Powell, and Francis Vella. Nonparametric estimation oftriangular simultaneous equations models.

Econometrica , 67(3):565–603, 1999.[39] Taisuke Otsu, Ke-Li Xu, and Yukitoshi Matsushita. Empirical likelihood for regressiondiscontinuity design.

Journal of Econometrics , 186(1):94–112, 2015.[40] Sida Peng and Yang Ning. Regression discontinuity design under self-selection. arXivpreprint arXiv:1911.09248 , 2019.[41] Jons Pinkse. Nonparametric two-step regression estimation when regressors and error aredependent.

Canadian Journal of Statistics , 28(2):289–300, 2000.[42] Jack Porter. Estimation in the regression discontinuity model.

Unpublished Manuscript,Department of Economics, University of Wisconsin at Madison , 2003:5–19, 2003.[43] Mark Rudelson. Random vectors in the isotropic position.

Journal of Functional Analysis ,164(1):60–72, 1999.[44] Thomas A Severini and Gautam Tripathi. A simpliﬁed approach to computing eﬃciencybounds in semiparametric models.

Journal of Econometrics , 102(1):23–66, 2001.[45] Alice M Stadthaus.

A comparison of the subsequent academic achievement of marginalselectees and rejectees for the Cincinnati Public Schools Special College Preparatory Pro-gram: An application of Campbell’s regression discontinuity design.

PhD thesis, ProQuestInformation & Learning, 1972.[46] Charles Stein et al. Eﬃcient nonparametric testing and estimation. In

Proceedings of theThird Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contri-butions to the Theory of Statistics . The Regents of the University of California, 1956.6947] Donald L Thistlethwaite. Eﬀects of social recognition upon the educational motivation oftalented youth.

Journal of Educational Psychology , 50(3):111, 1959.[48] Donald L Thistlethwaite and Donald T Campbell. Regression-discontinuity analysis: Analternative to the ex post facto experiment.

Journal of Educational psychology , 51(6):309,1960.[49] Sara Van de Geer, Peter B¨uhlmann, Ya’acov Ritov, Ruben Dezeure, et al. On asymptot-ically optimal conﬁdence regions and tests for high-dimensional models.

The Annals ofStatistics , 42(3):1166–1202, 2014.[50] Aad W Van der Vaart.

Asymptotic statistics , volume 3. Cambridge university press, 2000.[51] Wing Hung Wong and Thomas A Severini. On maximum likelihood estimation in inﬁnitedimensional parameter spaces.

The Annals of Statistics , pages 603–632, 1991.[52] Adonis Yatchew. An elementary estimator of the partial linear model.

Economics letters ,57(2):135–143, 1997.[53] Cun-Hui Zhang and Stephanie S Zhang. Conﬁdence intervals for low dimensional parame-ters in high dimensional linear models.