[PDF] A Unified Framework for Specification Tests of Continuous Treatment Effect Models

Abstract

We propose a general framework for the specification testing of continuous treatment effect models. We assume a general residual function, which includes the average and quantile treatment effect models as special cases. The null models are identified under the confoundedness condition and contain a nonparametric weighting function. We propose a test statistic for the null model in which the weighting function is estimated by solving an expanding set of moment equations. We establish the asymptotic distributions of our test statistic under the null hypothesis and under fixed and local alternatives. The proposed test statistic is shown to be more efficient than that constructed from the true weighting function and can detect local alternatives deviated from the null models at the rate of O_{P}(N^{-1/2}). A simulation method is provided to approximate the null distribution of the test statistic. Monte-Carlo simulations show that our test exhibits a satisfactory finite-sample performance, and an application shows its practical value.

Full PDF

AA Uniﬁed Framework for Speciﬁcation Tests of ContinuousTreatment Effect Models

Wei Huang * (cid:63) , Oliver Linton † † , and Zheng Zhang ‡ ‡ (cid:63) School of Mathematics and Statistics and Australian Research Council Centre of Excellence forMathematical and Statistical Frontiers (ACEMS), University of Melbourne, Australia † Faculty of Economics, University of Cambridge ‡ Center for Applied Statistics, Institute of Statistics & Big Data, Renmin University of China

Abstract

We propose a general framework for the speciﬁcation testing of continuous treat-ment effect models. We assume a general residual function, which includes the averageand quantile treatment effect models as special cases. The null models are identiﬁedunder the confoundedness condition and contain a nonparametric weighting function.We propose a test statistic for the null model in which the weighting function is esti-mated by solving an expanding set of moment equations. We establish the asymptoticdistributions of our test statistic under the null hypothesis and under ﬁxed and localalternatives. The proposed test statistic is shown to be more efﬁcient than that con-structed from the true weighting function and can detect local alternatives deviatedfrom the null models at the rate of O P ( N − / ) . A simulation method is provided toapproximate the null distribution of the test statistic. Monte-Carlo simulations showthat our test exhibits a satisfactory ﬁnite-sample performance, and an application showsits practical value. Keywords : Consistent tests; Continuous treatment effect; Series estimation; Bootstrap.

Causal inference is a central topic in economics, statistics, and machine learning. Althougha randomized trial is the gold standard for identifying the causal effect, it is often unavail- * E-mail: [email protected] † E-mail: [email protected] ‡ E-mail: [email protected] a r X i v : . [ ec on . E M ] F e b ble or even unethical in practice. Observational data, where the participation of an inter-vention is only observed rather than manipulated by scientists, are predominantly what isavailable. A major challenge for inferring the causality in observational study is the con-foundedness, whereby the individual characteristics are correlated with both the treatmentvariable and the outcome of interest. To identify causality, the unconfounded treatment as-signment condition is frequently imposed in the literature, see Rosenbaum and Rubin (1983,1984). For a comprehensive review of causal inference and its applications, see Imbens andWooldridge (2009) and Abadie and Cattaneo (2018).Treatment effect models are used extensively in economics and statistics to evaluatethe causal effect of a treatment or policy. Most of the existing literature focuses on thebinary treatment where an individual either receives the treatment or does not (see e.g.,Hahn, 1998, Hirano, Imbens, and Ridder, 2003, Donald, Hsu, and Lieli, 2014, Abrevaya,Hsu, and Lieli, 2015, Chan, Yam, and Zhang, 2016, Athey, Imbens, and Wager, 2018, Hsu,Lai, and Lieli, 2020, Chen, Hsu, and Wang, 2020, Fan, Hsu, Lieli, and Zhang, 2020 amongothers). Some literature focus on the multivalued treatment (see e.g., Cattaneo, 2010, Lee,2018, and Ao, Calonico, and Lee, 2021). In many applications, however, the treatmentvariable is continuously valued, and its causal effect is of great interest to decision makers.For example, in evaluating how non-labor income affects the labor supply, the causal effectmay depend on not only the introduction of the non-labor income but also the total non-labor income. Similarly, in evaluating how advertising affects the campaign contributionsfor political analysis, the causal effect may depend not only on whether any advertisementsare imposed but also on how many of them are distributed.Estimation of the continuous treatment effects has drawn great attention from researchers(see Hirano, Imbens, and Ridder (2003), Galvao and Wang (2015), Kennedy, Ma, McHugh,and Small (2017), Fong, Hazlett, and Imai (2018), Dong, Lee, and Gou (2019), Huber, Hsu,Lee, and Lettry (2020), Colangelo and Lee (2020) among others). Hirano, Imbens, andRidder (2003), Galvao and Wang (2015), and Fong, Hazlett, and Imai (2018) applied fullyparametric methods by modelling either the conditional distribution of the treatment giventhe confounders or that of the observed outcome given the treatment and the confounders.The shortcoming of these parametric methods is that modelling and testing the relationsof the treatment and the observed outcome regarding the confounders are difﬁcult, espe-cially when multiple confounding variables are involved. If the model is mis-speciﬁed, theconclusion can be biased and completely misleading. Kennedy, Ma, McHugh, and Small(2017) and Huber, Hsu, Lee, and Lettry (2020) estimated the continuous treatment effectsby using the nonparametric kernel method. Although nonparametric approaches are muchmore ﬂexible than parametric ones, they require smoothing of the data rather than estimat-2ng ﬁnite dimensional parameters, which leads to less precise ﬁts and slower convergencerates (slower than N − / ). Furthermore, it is usually hard to interpret nonparametric results.In a recent article, Ai, Linton, Motegi, and Zhang (2021) studied the continuous treat-ment effects by imposing a univariate generalized parametric model for the functionals ofthe potential outcome over the treatment variable. The general framework includes manyimportant causal parameters as special cases, for example, the average and quantile treat-ment effects. They proposed a generalized weighting estimator for the causal effect withthe weights modelled nonparametrically and estimated by solving an expanding set of equa-tions. They further derived the semiparametric efﬁciency bound for the causal effect oftreatment under the unconfounded treatment assignment condition and showed that theirestimator is √ N -asymptotically normal and attains the semiparametric efﬁciency bound.Although Ai, Linton, Motegi, and Zhang (2021)’s estimator enjoys superior asymptoticproperties and satisfactory ﬁnite sample performance, they did not detail the speciﬁcationsof the parametric models for the functionals of the potential outcomes. If the parametricmodel is mis-speciﬁed, the results developed in Ai, Linton, Motegi, and Zhang (2021) donot hold.We study the question of model speciﬁcation. In particular, we propose a consistentspeciﬁcation test for the most generalized continuous treatment effect model. That is, weconsider the generalized parametric model in Ai, Linton, Motegi, and Zhang (2021) asthe null model in our hypothesis test. The potential outcome variable in the model is notobservable. However, under the unconfounded treatment assignment condition, the modelcan be identiﬁed by a semiparametric weighted conditional model. There is abundant litera-ture that studies the speciﬁcation tests for conditional models (see e.g. Ait-Sahalia, Bickel,and Stoker (2001), Bierens (1982, 1990), Fan and Li (1996), Zheng (1996), Bierens andPloberger (1997), Stute (1997), Li (1999), Chen and Fan (1999), Fan and Li (2000), Li,Hsiao, and Zinn (2003), Crump, Hotz, Imbens, and Mitnik (2008) among others). Mostauthors have considered the problems of testing a parametric/semiparametric null modelusing the integrated type test statistic.Ait-Sahalia, Bickel, and Stoker (2001) and Chen andFan (1999) considered testing the nonparametric/semiparametric null models using non-parametric kernel methods. Li, Hsiao, and Zinn (2003) considered testing the nonparamet-ric/semiparametric using series methods. Crump, Hotz, Imbens, and Mitnik (2008) deriveda nonparametric Wald test statistic for testing the conditional average treatment effects un-der the unconfoundedness condition. We estimate our semiparametric weighted null modelby using the approach developed in Ai, Linton, Motegi, and Zhang (2021) and construct anintegrated-type test statistic. Although the weights in our null model are estimated nonpara-metrically, we show that our proposed test statistic is more efﬁcient than that constructed3rom the true weights. Moreover, our proposed test statistic can detect local alternativesthat deviate from the null model at the rate of O P ( N − / ) .Under the null hypothesis our test statistic is shown to converge in distribution to aweighted sum of independent chi-squared random variables. It is known that obtaining theexact critical values of such a distribution is extremely difﬁcult in practice. Most of the liter-ature suggests using a residual wild bootstrap procedure to approximate the critical values.This is not applicable in our case because our null model does not imply any explicit formof relationship among the observed outcome, the treatment, and the confounders for resid-ual sampling. To resolve this problem, we propose a simulation method to approximatethe null limiting distribution. Monte-Carlo simulations and real data analysis were con-ducted to demonstrate the numerical properties of our test method and limiting distributionapproximation.The remainder of the paper is organized as follows. We introduce the problem formu-lation and notations in Section 2. Section 3 constructs the test statistic, followed by thestudy of the asymptotic properties under null hypothesis, the ﬁxed and the local alternativesin Section 4. In Section 5, we discuss how to approximate the limiting distribution underthe null hypothesis. Finally, Section 6 discusses the choice of the tuning parameters inthe estimation and investigates the ﬁnite sample performance through simulations and U.S.campaign advertisement data. Let T denote a continuous treatment variable with support T ⊂ R , where T is a continuumsubset, and T has a marginal density function f T ( t ) . Let Y ∗ ( t ) denote the potential responsewhen treatment T = t is assigned. We are interested in testing the null hypothesis: H : ∃ some θ ∗ ∈ Θ , s.t. E [ m { Y ∗ ( t ); g ( t ; θ ∗ ) } ] = 0 for all t ∈ T , (2.1)against the alternative hypothesis H : (cid:64) any θ ∈ Θ , s.t. E [ m { Y ∗ ( t ); g ( t ; θ ) } ] = 0 for all t ∈ T , where Θ is a compact set in R p for some integer p ≥ , m ( · ) is some generalized residualfunction which could possibly be non-differentiable , and g ( t ; θ ) is a parametric workingmodel which is differentiable with respect to θ . If H holds, for each t , the dose-responsefunction (DRF) is deﬁned as the value g ( t ; θ ∗ ) that solves the moment condition in (2.1).The following examples show that the average dose-response function (ADRF) and the4uantile dose-response function (QDRF) are special cases of g ( t ; θ ∗ ) , which result fromchoosing speciﬁc forms of m ( · ) .• (Average) Setting m { Y ∗ ( t ); g ( t ; θ ∗ ) } = Y ∗ ( t ) − g ( t ; θ ∗ ) and letting its ﬁrst momentequal zero for each t , we obtain g ( t ; θ ∗ ) = E { Y ∗ ( t ) } , the unconditional ADRF,which is also called a marginal structural model in Robins, Hern´an, and Brum-back (2000). This can recover the average treatment effect (ATE), which is givenby ATE ( t , t ) = E { Y ∗ ( t ) } − E { Y ∗ ( t ) } . Examples include the linear marginalstructure model E { Y ∗ ( t ) } = β + β · t , and the nonlinear marginal structure model E { Y ∗ ( t ) } = β · t + 1 / ( t + β ) studied in Hirano and Imbens (2004)).• (Quantile) Let τ ∈ (0 , and F Y ∗ ( t ) ( · ) be the cumulative distribution function of Y ∗ ( t ) . Setting m { Y ∗ ( t ); g ( t ; θ ∗ ) } = τ − { Y ∗ ( t ) < g ( t ; θ ∗ ) } and letting its ﬁrstmoment equal zero for each t , we obtain g ( t ; θ ∗ ) = F − Y ∗ ( t ) ( τ ) := inf { q : P ( Y ∗ ( t ) ≥ q ) ≤ τ } , the unconditional QDRF. This can recover the quantile treatment effect(QTE), which is given by QTE ( t , t ) = F − Y ∗ ( t ) ( τ ) − F − Y ∗ ( t ) ( τ ) . See Firpo (2007) fordetailed discussion on QTE. Examples include the linear model g ( t ; θ ) = θ + θ · t andthe Box-Cox transformation model g ( t ; θ ) = h λ ( θ + θ · t ) studied in Buchinsky(1995), where h λ ( z ) = λz + 1) − /λ .We consider the observational study where the potential outcome Y ∗ ( t ) is not observedfor all t . Let Y := Y ∗ ( T ) denote the observed response. Under the null hypothesis, onemay attempt to solve the following equation to ﬁnd θ ∗ : E [ m { Y ; g ( T ; θ ) ∇ θ g ( T ; θ } ] = 0 . However, if there is a selection into treatment, even under the null hypothesis, the true value θ ∗ does not solve the above equation. Indeed, in this case, the observed response and thetreatment assignment data alone cannot identify θ ∗ . To address this identiﬁcation issue,most studies in the literature impose a selection on the observable condition (e.g., Hirano,Imbens, and Ridder, 2003, Imai and van Dyk, 2004, Fong, Hazlett, and Imai, 2018, Ai,Linton, Motegi, and Zhang, 2021). Speciﬁcally, let X ∈ R r , for some integer r ≥ , denotea vector of observable covariates. The following condition shall be maintained throughoutthe paper. Assumption 1 ( Unconfounded Treatment Assignment ) . For all t ∈ T , given X , T is inde-pendent of Y ∗ ( t ) , that is, Y ∗ ( t ) ⊥ T | X , for all t ∈ T . Let { T i , X i , Y i } Ni =1 be an independent and identically distributed ( i.i.d. ) sample drawnfrom the joint distribution of ( T, X , Y ) . Let f T | X denote the conditional density of T given5he observed covariates X . Under Assumption 1, Ai, Linton, Motegi, and Zhang (2021)showed that E [ m { Y ∗ ( t ); g ( t ; θ ) } ] can be identiﬁed as follows: E [ m { Y ∗ ( t ); g ( t ; θ ) } ] = E [ π ( T, X ) m { Y ; g ( T ; θ ) }| T = t ] , ∀ t ∈ T where π ( T, X ) := f T ( T ) f T | X ( T | X ) . The function π ( T, X ) is called the stabilized weights in Robins, Hern´an, and Brumback(2000). Then under H , the true value θ ∗ solves the following equation: E [ π ( T, X ) m { Y ; g ( T ; θ ) }∇ θ g ( T ; θ )] = 0 , (2.2)where the “ ∇ θ ” denotes the derivative with respect to θ .The null and alternative hypothesis in (2.1) can then be re-written as H : P ( E [ π ( T, X ) m { Y ; g ( T ; θ ∗ ) }| T ] = 0) = 1 for some θ ∗ ∈ Θ , (2.3)against the alternative hypothesis H : P ( E [ π ( T, X ) m { Y ; g ( T ; θ ) }| T ] (cid:54) = 0) > for all θ ∈ Θ . This converts the test for (2.1) to a goodness-of-ﬁt test for a univariate regression model, ifboth π ( T, X ) and θ ∗ were given. Specially, letting U i := π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } , (2.4)the null hypothesis H is equivalent to P { E ( U i | T i ) = 0 } = 1 . A popular technique fortesting such a conditional moment model is to convert it to an unconditional one.Note that P { E ( U i | T i ) = 0 } = 1 if and only if E { U i M ( T i ) } = 0 for all bounded andmeasurable functions M ( · ) . Following Bierens and Ploberger (1997), Stinchcombe andWhite (1998), Stute (1997), and Li, Hsiao, and Zinn (2003), by choosing a proper weightfunction H ( · , · ) , E ( U i | T i ) = 0 is a.s. equivalent to E { U i H ( T i , t ) } = 0 for all t ∈ T . (2.5)Popular choices of such a weight function are the logistic function H ( T i , t ) = 1 / { c − t · T i ) } with c (cid:54) = 0 , cosine-sine function H ( T i , t ) = cos( t · T i ) + sin( t · T i ) andthe indicator function H ( T i , t ) = ( T i ≤ t ) (see Stinchcombe and White, 1998 and Stute,1997 for more detailed discussion). Now, letting J N ( t ) = 1 √ N N (cid:88) i =1 U i H ( T i , t ) , (2.6)6he sample analogue of E { U i H ( T i , t ) } multiplied by √ N , one can test H by using theCramer-von Mises (CM)-type statistic CM N = (cid:90) { J N ( t ) } (cid:98) F T ( dt ) = 1 N N (cid:88) i =1 { J N ( T i ) } , (2.7)where (cid:98) F T ( · ) is the empirical distribution of T , ..., T N . However, both π ( T, X ) and θ ∗ areunknown in practice so that the U i ’s are unavailable. We have to replace the U i ’s with someestimates, which is studied in the following section. One obvious approach for estimating the U i ’s is to estimate f T ( T i ) and f T | X ( T i | X i ) , thenconstruct the estimators of π ( T i , X i ) and θ ∗ . However, it is well-known that this ratioestimator of π ( T, X ) is very sensitive to small values of f T | X ( T | X ) because small esti-mation errors in estimating f T | X ( T | X ) result in large estimation errors of the estimator of π ( T, X ) . To avoid or mitigate this problem, we follow Ai, Linton, Motegi, and Zhang(2021)’s idea of estimating the weighting function π ( T, X ) by generalized empirical like-lihood. Note that the weighting function satisﬁes E { π ( T, X ) u ( T ) v ( X ) } = E { u ( T ) } · E { v ( X ) } (3.1)for any suitable functions u ( t ) and v ( x ) . Ai, Linton, Motegi, and Zhang (2021, Theo-rem 2) showed that the restriction (3.1) identiﬁes the weighting function π ( T, X ) . Thisresult suggests that one may estimate the π ( T i , X i ) ’s by solving the sample analogue of(3.1). The challenge is that (3.1) implies an inﬁnite number of equations, which is im-possible to solve with a ﬁnite sample of observations. To overcome this difﬁculty, Ai,Linton, Motegi, and Zhang (2021) suggested approximating the inﬁnite-dimensional func-tion space by a sequence of ﬁnite-dimensional sieve spaces. Speciﬁcally, let u K ( T ) =( u K , ( T ) , . . . , u K ,K ( T )) (cid:62) and v K ( X ) = (cid:0) v K , ( X ) , . . . , v K ,K ( X ) (cid:1) (cid:62) denote someknown basis functions with dimensions K ∈ N and K ∈ N respectively, and let K := K · K . The functions u K ( t ) and v K ( x ) are called the approximation sieves , such asB-splines or power series (see Newey, 1997, Chen, 2007, for more discussion on sieve ap-proximation). Because the sieve approximating space is a subspace of the original functionspace, π ( T, X ) also satisﬁes E (cid:8) π ( T, X ) u K ( T ) v K ( X ) (cid:62) (cid:9) = E { u K ( T ) } · E { v K ( X ) } (cid:62) . (3.2)7ollowing Ai, Linton, Motegi, and Zhang (2021), we estimate the π ( T i , X i ) ’s con-sistently by the (cid:98) π i ’s that maximize the generalized empirical likelihood (GEL) function,subject to the sample analog of (3.2):  { (cid:98) π i } Ni =1 = arg max (cid:16) − N − (cid:80) Ni =1 π i log π i (cid:17) subject to N (cid:80) Ni =1 π i u K ( T i ) v K ( X i ) (cid:62) = (cid:110) N (cid:80) Ni =1 u K ( T i ) (cid:111) (cid:110) N (cid:80) Nj =1 v K ( X j ) (cid:62) (cid:111) . (3.3) Two observations are immediate. First, by including a constant of one in the sieve basefunctions, (3.3) guarantees that N − (cid:80) Ni =1 (cid:98) π i = 1 . Second, we notice that max (cid:32) − N − N (cid:88) i =1 π i log π i (cid:33) = − min (cid:40) N (cid:88) i =1 ( N − π i ) · log (cid:18) N − π i N − (cid:19)(cid:41) . The entropy maximization problem minimizes the Kullback-Leibler divergence betweenthe weights { N − π i } Ni =1 and the empirical frequencies { N − } , subject to the sample ana-logue of (3.2). Further, Ai, Linton, Motegi, and Zhang (2021) showed that the dual solutionof the primal problem (3.3) is (cid:98) π K ( T i , X i ) := ρ (cid:48) (cid:110) u K ( T i ) (cid:62) (cid:98) Λ K × K v K ( X i ) (cid:111) , (3.4)where ρ (cid:48) is the ﬁrst derivative of ρ with ρ ( u ) = − exp( − u − , and (cid:98) Λ K × K is the maxi-mizer of the strictly concave function (cid:98) G K × K deﬁned by (cid:98) G K × K (Λ):= 1 N N (cid:88) i =1 ρ (cid:8) u K ( T i ) (cid:62) Λ v K ( X i ) (cid:9) − (cid:40) N N (cid:88) i =1 u K ( T i ) (cid:41) (cid:62) Λ (cid:40) N N (cid:88) j =1 v K ( X j ) (cid:41) . (3.5)The ﬁrst order condition of (3.5) implies that { (cid:98) π K ( T i , X i ) } Ni =1 satisfy the sample analog of(3.2), such restrictions reduce the chance of obtaining extreme weights. The concavity of(3.5) enables us to obtain the solution quickly via the Gauss-Newton algorithm. To ensurea consistent estimate of π ( T, X ) , the dimensions of the bases, K and K , shall increaseas the sample size increases. The choice of K and K in practice will be discussed inSection 6.1.Having estimated the weights, we now estimate θ ∗ , denoted by (cid:98) θ , by solving the sam-ple analogue of (2.2) with respect to θ , i.e. N N (cid:88) i =1 (cid:98) π K ( T i , X i ) m { Y i ; g ( T i ; (cid:98) θ ) }∇ θ g ( T ; (cid:98) θ ) = o P ( N − / ) . (3.6)8ith the estimators { (cid:98) π K ( T i , X i ) } Ni =1 of { π ( T i , X i ) } Ni =1 and (cid:98) θ of θ , we estimate U i by (cid:98) U i = (cid:98) π K ( T i , X i ) m { Y i ; g ( T i ; (cid:98) θ ) } , for i = 1 , . . . , N . Replacing the U i ’s in (2.6) by the (cid:98) U i ’s,we have the feasible test statistic for H based on (cid:98) J N ( t ) = 1 √ N N (cid:88) i =1 (cid:98) U i H ( T i , t ) , and the corresponding estimator of the Cramer-von Mises (CM)-type statistic in (2.7) is (cid:100) CM N = 1 N N (cid:88) i =1 { (cid:98) J N ( T i ) } . Remark 1 . Crump, Hotz, Imbens, and Mitnik (2008) considered the null hypothesis con-cerning the conditional average treatment effect (CATE) with a binary treatment, that is, H : CATE ( x ) := E [ Y ∗ (1) − Y ∗ (0) | X = x ] = 0 for all x . This null hypothesis indi-cates that there is no heterogeneity in average treatment effects by covariates. Under theunconfoundedness condition, the null hypothesis is identical to H : CATE ( x ) = E [ Y | T =1 , X = x ] − E [ Y | T = 0 , X = x ] = 0 for all x . Further, they proposed series estimatorsfor the regression functions and formed the Wald test statistic. Their test method is appli-cable to a particular scenario included in our general formulation (2.1) that there is nocontinuous average treatment effect, that is, H (cid:48) : E [ Y ∗ ( t )] = E [ π ( T, X ) Y | T = t ] = 0 forall t , given that π ( T, X ) was known. However, in practice, π ( T, X ) is usually unknownand needs to be estimated. Remark 2 . An alternative estimator of θ ∗ can be constructed under H . Suppose that under H , θ ∗ is identiﬁed by the unique solution of the following optimization problem: θ ∗ = arg min θ ∈ Θ CM ( θ ) := N × (cid:90) T { E [ U i ( θ ) H ( T i , t )] } f T ( t ) dt, where U i ( θ ) := π ( T i , X i ) m { Y i ; g ( T i ; θ ) } . Let (cid:98) U i ( θ ) := (cid:98) π K ( T i , X i ) m { Y i ; g ( T i ; θ ) } and (cid:98) J N ( t ; θ ) := N − / (cid:80) Ni =1 (cid:98) U i ( θ ) H ( T i , t ) . Under H , the estimator of θ ∗ can be deﬁned by (cid:98) θ opt := arg min θ ∈ Θ (cid:100) CM N ( θ ) := arg min θ ∈ Θ N N (cid:88) i =1 { (cid:98) J N ( T i ; θ ) } . (3.7) As a result, the alternative test statistic is (cid:100) CM N ( (cid:98) θ opt ) . However, seeking the global mini-mizer of (cid:100) CM N ( θ ) is difﬁcult as (cid:100) CM N ( θ ) may not be differentiable, convex, even may notbe continuous. For example, taking m { Y i ; g ( T i ; θ ) } = τ − { Y i ≤ g ( T i ; θ ) } for QDRF,there does not exist a unique solution to the problem (3.7). Under a stronger condition that m ( y ; g ) is differentiable in g , we establish the asymptotic results for both (cid:98) J N ( t ; (cid:98) θ opt ) and (cid:100) CM N ( (cid:98) θ opt ) in Appendix E. Large sample properties

This section studies the asymptotic properties of (cid:98) J N ( · ) and the test statistic (cid:100) CM N . To establish the asymptotic properties of (cid:98) J N ( · ) and (cid:100) CM N , the following additional assump-tions are imposed. Assumption 2.

Suppose that N − (cid:80) Ni =1 (cid:98) π K ( T i , X i ) m { Y i ; g ( T i ; (cid:98) θ ) }∇ θ g ( T i ; (cid:98) θ ) = o p ( N − / ) holds. Assumption 3.

V ar ( Y | T ) is bounded a.s. on the support of T . Assumption 4. (i) g ( t ; β ) is twice continuously differentiable in θ ∈ Θ ;(ii) E [ m { Y ; g ( T ; θ ∗ ) }| T = t, X = x ] is continuously differentiable in ( t, x ) ;(iii) E [ π ( T, X ) m { Y ; g ( T ; θ ) }∇ θ g ( T ; θ )] is differentiable w.r.t. θ and ∇ θ E [ π ( T, X ) m { Y ; g ( T ; θ ) }∇ θ g ( T ; θ )] (cid:12)(cid:12) θ = θ ∗ is nonsingular. Assumption 5. (i) E (cid:2) sup θ ∈ Θ | m { Y ; g ( T ; θ ) }| δ (cid:3) < ∞ for some δ > ; (ii) The functionclass (cid:110) m { Y ; g ( T ; θ ) } : θ ∈ Θ (cid:111) satisﬁes: E (cid:34) sup θ : (cid:107) θ − θ (cid:107) <δ | m { Y ; g ( T ; θ ) } − m { Y ; g ( T ; θ ) }| (cid:35) / ≤ a · δ b for any θ ∈ Θ and any small δ > and for some ﬁnite positive constants a and b ≥ . Assumption 2 is essentially saying that the estimating equation is a.s. approximatelysatisﬁed, see Pakes and Pollard (1989). Assumption 3 is needed to bound the asymptoticvariance of the test statistic. Assumption 4 (i) and (ii) impose sufﬁcient regularity conditionson both the link function g and residual function m . Assumption 4 (iii) ensures that thevariance of the test statistic is ﬁnite. Assumption 5 is a stochastic equicontinuity condition,which is needed for establishing the weak convergence of our test statistic, see Andrews(1994). Again, this is satisﬁed by the widely used loss functions such as m { y, g ( t ; θ ) } = y − g ( t ; θ ) and m { y, g ( t ; θ ) } = τ − { y < g ( t ; θ ) } discussed in Section 2.10o aid presentation of the asymptotic properties of the test statistic, deﬁne the followingquantities: φ ( T i , X i ; t ) := π ( T i , X i ) · H ( T i , t ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] − E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | X i ] , and ψ ( T i , X i , Y i ; t ) := E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:21) × E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − × (cid:26) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) − π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ]+ E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) m { Y i ; g ( T i ; θ ∗ ) } | X i ] (cid:27) , and η ( T i , X i , Y i ; t ) := U i H ( T i , t ) − φ ( T i , X i ; t ) − ψ ( T i , X i , Y i ; t ) . The next theorem establishes the weak convergence of (cid:98) J N ( · ) and (cid:100) CM N under H . Theorem 1.

Suppose that Assumptions 1-5 and Assumptions 6-9 listed in Appendix A hold,then under H , ( i ) (cid:98) J N ( t ) = 1 √ N N (cid:88) i =1 η ( T i , X i , Y i ; t ) + o P (1) , ( ii ) (cid:98) J N ( · ) converges weakly to J ∞ ( · ) in L {T , dF T ( t ) } , where J ∞ is a Gaussian process with zero mean and covariance function given by Σ( t, t (cid:48) ) = E { η ( T i , X i , Y i ; t ) η ( T i , X i , Y i ; t (cid:48) ) } . Furthermore, ( iii ) (cid:100) CM N converges to (cid:90) { J ∞ ( t ) } dF T ( t ) in distribution . The proof of Theorem 1 is relegated to Appendix B. Similar to Bierens and Ploberger(1997), Chen and Fan (1999), it can be shown that (cid:82) { J ∞ ( t ) } dF T ( t ) can be written as an11nﬁnite sum of weighted (independent) χ random variables with weights depending on theunknown distribution of ( T i , X i , Y i ) . Hence, it is difﬁcult to obtain the exact critical values.We suggest a simulation method to approximate the critical values for the null limitingdistribution of (cid:100) CM N , see Section 5.The next theorem shows that the proposed test statistic is more efﬁcient than the in-feasible test statistic constructed by using the true π ( T, X ) . Suppose that π ( T, X ) wasknown, let (cid:98) θ be the estimator of θ ∗ constructed by using the true ratio function π ( T, X ) ,which is deﬁned to be the solution of the following equation: N N (cid:88) i =1 π ( T i , X i ) m { Y i ; g ( T i ; θ ) } ∇ θ g ( T i ; θ ) = o P (cid:18) √ N (cid:19) , w.r.t. θ . The infeasible test statistic for H is then based on (cid:98) J ( t ) = 1 √ N N (cid:88) i =1 (cid:98) U i H ( T i , t ) , where (cid:98) U i = π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) . Let ψ ( T i , X i , Y i ; t ) := E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:21) × E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − × π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) , and η ( T i , X i , Y i ; t ) := U i H ( T i , t ) − ψ ( T i , X i , Y i ; t ) . The following theorem establishes the weak convergence of (cid:98) J ( · ) under H and shows thatthe asymptotic variance of the proposed test statistic (cid:98) J N ( t ) is smaller than that of (cid:98) J ( t ) forany t ∈ T . Theorem 2.

Suppose that Assumptions 3-5 hold, then under H , ( i ) (cid:98) J ( t ) = 1 √ N N (cid:88) i =1 η ( T i , X i , Y i ; t ) + o P (1) , ( ii ) (cid:98) J ( · ) converges weakly to J , ∞ ( · ) in L {T , dF T ( t ) } , where J , ∞ is a Gaussian process with zero mean and covariance function given by Σ ( t, t (cid:48) ) = E { η ( T i , X i , Y i ; t ) η ( T i , X i , Y i ; t (cid:48) ) } . Furthermore, Σ ( t, t ) > Σ( t, t ) for any t ∈ T . The proof of Theorem 2 is presented in Appendix C.12 .2 Special cases

This section discusses two important special continuous treatment effect models, the av-erage and quantile continuous treatment models. In the case of testing for the averagedose-response model, that is, H : ∃ some θ ∗ ∈ Θ ⊂ R p , s.t. E { Y ∗ ( t ) } = g ( t ; θ ∗ ) for all t ∈ T , (4.1)against the alternative hypothesis H : (cid:64) any θ ∈ Θ ⊂ R p , s.t. E { Y ∗ ( t ) } = g ( t ; θ ) = 0 for all t ∈ T ,m { Y ∗ ( t ); g ( t ; θ ∗ ) } = Y ∗ ( t ) − g ( t ; θ ∗ ) , U ADRFi = π ( T i , X i ) { Y i − g ( T i ; θ ∗ ) } and the teststatistic for H is (cid:100) CM ADRFN = 1 N N (cid:88) i =1 { (cid:98) J ADRFN ( T i ) } , where (cid:98) J ADRFN ( t ) = 1 √ N N (cid:88) i =1 (cid:98) U ADRFi H ( T i , t ) , (cid:98) U ADRFi = (cid:98) π K ( T i , X i ) (cid:110) Y i − g ( T i ; (cid:98) θ ) (cid:111) . In this special case, the notations φ ( T i , X i ; t ) , ψ ( T i , X i , Y i ; t ) , and η ( T i , X i , Y i ; t ) in Theo-rem 1 become φ ADRF ( T i , X i ; t ) := π ( T i , X i ) · H ( T i , t ) · E { Y i − g ( T i ; θ ∗ ) | T i , X i }− E [ π ( T i , X i ) { Y i − g ( T i ; θ ∗ ) } · H ( T i , t ) | X i ] , and ψ ADRF ( T i , X i , Y i ; t ) := E (cid:2) π ( T i , X i ) · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:3) × E (cid:2) π ( T i , X i ) · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:3) − × (cid:26) π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) Y i − π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) · E ( Y i | T i , X i )+ E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) { Y i − g ( T i ; θ ∗ ) }| X i ] (cid:27) , and η ADRF ( T i , X i , Y i ; t ) := U ADRFi H ( T i , t ) − φ ADRF ( T i , X i ; t ) − ψ ADRF ( T i , X i , Y i ; t ) . Then Theorem 1 implies the following result.13 orollary 3.

Suppose that Assumptions 1-3 and Assumptions 6-9 listed in Appendix A hold,then under H , ( i ) (cid:98) J ADRFN ( · ) converges weakly to J ADRF ∞ ( · ) in L {T , dF T ( t ) } , where J ADRF ∞ is a Gaussian process with zero mean and covariance function given by Σ ADRF ( t, t (cid:48) ) = E (cid:8) η ADRF ( T i , X i , Y i ; t ) η ADRF ( T i , X i , Y i ; t (cid:48) ) (cid:9) . Furthermore, ( ii ) (cid:100) CM ADRFN converges to (cid:90) { J ADRF ∞ ( t ) } dF T ( t ) in distribution . In the case of testing for the quantile dose-response model, that is, H : ∃ some θ ∗ ∈ Θ ⊂ R p , s.t. F − Y ∗ ( t ) ( τ ) = g ( t ; θ ∗ ) for all t ∈ T , (4.2)against the alternative hypothesis H : (cid:64) any θ ∈ Θ ⊂ R p , s.t. F − Y ∗ ( t ) ( τ ) = g ( t ; θ ) for all t ∈ T ,m { Y ∗ ( t ); g ( t ; θ ∗ ) } = τ − { Y ∗ ( t ) < g ( t ; θ ∗ ) } , U QDRFi = π ( T i , X i ) (cid:2) τ − { Y i

Corollary 4.

Suppose that Assumptions 1-3 and Assumptions 6-9 listed in Appendix A hold,then under H , ( i ) (cid:98) J QDRFN ( · ) converges weakly to J QDRF ∞ ( · ) in L {T , dF T ( t ) } , where J QDRF ∞ is a Gaussian process with zero mean and covariance function given by Σ QDRF ( t, t (cid:48) ) = E (cid:8) η QDRF ( T i , X i , Y i ; t ) η QDRF ( T i , X i , Y i ; t (cid:48) ) (cid:9) . Furthermore, ( ii ) (cid:100) CM QDRFN converges to (cid:90) { J QDRF ∞ ( t ) } dF T ( t ) in distribution . This section studies the asymptotic distribution of (cid:98) J N ( · ) under the ﬁxed and Pitman localalternatives. The Pitman local alternative is given by H L : E (cid:20) m (cid:26) Y ∗ ( t ); g ( t ; θ ∗ N ) + 1 √ N · δ ( t ) (cid:27)(cid:21) = 0 for some θ ∗ N ∈ Θ and all t ∈ T , where (cid:82) { δ ( t ) } dF T ( t ) < ∞ . With Assumption 1, H L can be represented by H L : E (cid:20) π ( T, X ) m (cid:26) Y ; g ( T ; θ ∗ N ) + 1 √ N · δ ( T ) (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) T = t (cid:21) = 0 for some θ ∗ N ∈ Θ and all t ∈ T , which deviates from the null model at the rate of O p ( N − / ) . Let θ ∗ be the limit of θ ∗ N as N → ∞ , hence it solves the following equation: E (cid:20) π ( T, X ) m { Y ; g ( T ; θ ∗ ) } (cid:12)(cid:12)(cid:12)(cid:12) T = t (cid:21) = 0 for all t ∈ T . µ ( t ) := E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:21) × E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − × E (cid:20) π ( T i , X i ) · ∂∂g E (cid:2) m { Y i ; g ( T i ; θ ∗ ) } (cid:12)(cid:12) T i , X i (cid:3) · δ ( T i ) · ∇ θ g ( T i ; θ ∗ ) (cid:21) . The following theorem gives the asymptotic distribution of (cid:98) J N ( · ) under the local alter-native H L and the ﬁxed alternative H . Theorem 5.

Suppose that Assumptions 1-5 and Assumptions 6-9 listed in Appendix A hold.Under the local alternative hypothesis H L , ( i ) (cid:98) J N ( t ) = 1 √ N N (cid:88) i =1 η ( T i , X i , Y i ; t ) + µ ( t ) + o P (1) , (4.3) ( ii ) (cid:98) J N ( · ) converges weakly to J ∞ ,µ ( · ) in L {T , dF T ( t ) } , where J ∞ ,µ is a Gaussian process with mean function µ ( t ) and covariance function givenby Σ( t, t (cid:48) ) = E { η ( T i , X i , Y i ; t ) η ( T i , X i , Y i ; t (cid:48) ) } . Under the ﬁxed H , ( iii ) 1 √ N (cid:98) J N ( · ) converges to µ ( · ) in probability in L ( T , dt ) , where µ ( t ) := E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } H ( T i , t )] . Comparing Theorem 5 (ii) to Theorem 1 (ii), we see that our test statistic is able todetect the local alternatives deviated from the null model at the rate of O p ( N − / ) . We know from Theorem 1 that (cid:100) CM N converges in distribution to (cid:82) { J ∞ ( t ) } dF T ( t ) . Us-ing techniques similar to those in Bierens and Ploberger (1997) and Chen and Fan (1999),one can show that (cid:82) { J ∞ ( t ) } dF T ( t ) is an inﬁnite sum of weighted (independent) χ ran-dom variables, where the weights depend on the unknown distribution of the ( X i , T i , Y i ) ’s16see also Li, Hsiao, and Zinn, 2003). Obtaining the exact critical values is difﬁcult andwe here propose a simulation method to approximate the null limiting distribution. Themethod is a special case of the exchangeable bootstrap (Praestgaard and Wellner, 1993,Van Der Vaart and Wellner, 1996, Chernozhukov, Fern´andez-Val, and Melly, 2013, Donaldand Hsu, 2014). Speciﬁcally, we ﬁrst generate B sets of N independent standard normalrandom variables w ,b , . . . , w N,b , for b = 1 , . . . , B and B a large enough integer. Then wedeﬁne (cid:98) J ∗ N,b ( t ) = 1 √ N N (cid:88) i =1 w i,b (cid:98) η ( T i , X i , Y i ; t ) , (5.1)where (cid:98) η ( T i , X i , Y i ; t ) = (cid:98) U i H ( T i , t ) − (cid:98) φ ( T i , X i ; t ) − (cid:98) ψ ( T i , X i , Y i ; t ) , with (cid:98) φ ( T i , X i ; t ) and (cid:98) ψ ( T i , X i , Y i ; t ) respectively some consistent nonparametric plug-in estimators of φ ( T i , X i ; t ) and ψ ( T i , X i , Y i ; t ) deﬁned above in Theorem 1, for example the additive penalized splineestimator(see Ruppert, Wand, and Carroll, 2003 for example) or the series estimator usedin Donald and Hsu (2014).It is easy to see that E ∗ { w i,b (cid:98) η ( T i , X i , Y i ; t ) } = 0 and E ∗ { w i,b (cid:98) η ( T i , X i , Y i ; t ) (cid:98) η ( T i , X i , Y i ; t (cid:48) ) } = (cid:98) η ( T i , X i , Y i ; t ) (cid:98) η ( T i , X i , Y i ; t (cid:48) ) , for i = 1 , . . . , N , b = 1 , . . . , B and all t, t (cid:48) ∈ T , where E ∗ {·} is the conditional expectation given the data ( T i , X i , Y i ) Ni =1 . Because (cid:98) η is a con-sistent estimator of η , we can see that (cid:98) J ∗ N,b ( · ) has the same limiting process as (cid:98) J N ( · ) for b = 1 , . . . , B . Then, we can approximate the limiting distribution of (cid:100) CM N under H by (cid:100) CM ∗ N,b = 1 N N (cid:88) i =1 (cid:8) (cid:98) J ∗ N,b ( T i ) (cid:9) , for b = 1 , . . . , B . That is, we can approximate the p -value by B − (cid:80) Bb =1 ( (cid:100) CM ∗ N,b ≥ (cid:100) CM N ) . K and K The large-sample properties of the proposed estimator hold for a range of values of K and K . This presents a dilemma for applied researchers, who have only one ﬁnite sample.Too little smoothing yields a large variance and too much smoothing yields a large bias.Therefore, applied researchers would like to have some guidance on the choice of K and K . In this section, we propose a cross-validation method for choosing the smoothing17arameters K and K . Speciﬁcally, we split the data set into F sets (say F = 5 or 10), andselect K and K that minimize the following quantity CV ( K , K ) = F (cid:88) j =1 (cid:88) k ∈ S j (cid:104)(cid:98) π ( − j ) K ( T k , X k ) m (cid:110) Y k ; g (cid:16) T k ; (cid:98) θ ( − j ) (cid:17)(cid:111)(cid:105) , (6.1)where S j denotes the j th set of data of T, X and Y , and for j = 1 , . . . , F , (cid:98) θ ( − j ) = arg min θ (cid:88) i/ ∈ S j (cid:98) π ( − j ) K ( T i , X i ) m { Y i ; g ( T i ; θ ) } ∇ θ g ( T i , θ ) , with (cid:98) π ( − j ) K ( T i , X i ) obtained in the same way as that introduced in Section 2 via (3.4) and(3.5), but without using the individuals in S j . To assess the performance of our goodness-of-ﬁt test method, we conducted Monte Carlosimulation studies on the following four data generating processes (DGPs):

DGP0-L T = 1 + 0 . X + ξ, and Y = 1 + X + T + (cid:15), DGP0-NL T = 0 . X + ξ, and Y = X + T + (cid:15), DGP1-L T = 1 + 0 . X + ξ, and Y = 1 + X + T + (cid:15), DGP1-NL T = 0 . X + ξ, and Y = X + T + (cid:15), where ξ and (cid:15) are independent standard normal random variables, and X is a uniformrandom variable supported on [0 , . For all the four scenarios, we considered the two-sidedhypothesis testing in (2.1), where m { Y ∗ ( t ); g ( t ; θ ∗ ) } = Y ∗ ( t ) − g ( t ; θ ∗ ) (average) and m { Y ∗ ( t ); g ( t ; θ ∗ ) } = 0 . − { Y ∗ ( t ) < g ( t ; θ ∗ ) } (median), and g { t ; ( θ ∗ , θ ∗ ) } = θ ∗ + θ ∗ t . Clearly, H is true for DGP0-L and

DGP0-NL , but fails for

DGP1-L and

DGP1-NL . For each case, we generated 1000 samples of size 100, 200, and 500. The number ofsamples for the simulation-based approximation of the limiting process is B = 500 and thenumber of folds in the cross-validation (6.1) was taken to be F = 10 . We compared thethree commonly used weight functions H that are mentioned in Section 2, the logistic, thecosine-sine and the indicator ones. Speciﬁcally, for the logistic weight function, we tookthe constant c = 5 . 18able 1: Estimated sizes Logistic Cosine-Sine Indicator m ( · ) Model N

1% 5% 10% 1% 5% 10% 1% 5% 10%Average

DGP0-L

100 0.017 0.078 0.133 0.017 0.076 0.132 0.011 0.062 0.113200 0.013 0.056 0.112 0.014 0.055 0.113 0.005 0.058 0.125500 0.013 0.051 0.101 0.010 0.047 0.099 0.013 0.053 0.106

DGP0-NL

100 0.028 0.096 0.174 0.016 0.081 0.135 0.014 0.062 0.110200 0.019 0.084 0.140 0.007 0.062 0.127 0.010 0.059 0.120500 0.018 0.070 0.115 0.011 0.053 0.113 0.015 0.052 0.111Median

DGP0-L

100 0.028 0.101 0.162 0.019 0.071 0.133 0.026 0.077 0.144200 0.018 0.074 0.142 0.016 0.063 0.117 0.012 0.066 0.120500 0.014 0.059 0.117 0.007 0.064 0.117 0.012 0.052 0.117

DGP0-NL

100 0.063 0.148 0.233 0.016 0.082 0.136 0.018 0.074 0.148200 0.032 0.100 0.159 0.014 0.060 0.115 0.016 0.063 0.129500 0.018 0.069 0.128 0.009 0.056 0.111 0.013 0.057 0.107

Tables 1 and 2 summarize the empirical rejection probabilities computed at signiﬁcancelevels , , and for each case, which respectively show the estimated sizes (DGP0-L and DGP0NL) and the estimated powers (DGP1-L and DGP1-NL) of our test method.We can see from Table 1 that the estimated sizes of our method with cosine-sine andindicator weight functions are quite close to the nominal sizes from N = 100 to for allcases. The estimated sizes when using the logistic weight function are obviously over-sizedwhen the sample size is small, especially for nonlinear X cases, but they also improve asthe sample size increases and are close to the nominal sizes when N = 500 .From Table 2, we observed that all tests are quite powerful even when N = 100 .Overall, the simulation studies conﬁrmed our asymptotic theorems and showed thatin practice, the cosine-sine and indicator weight functions might perform better than thelogistic one for nonlinear X cases. In this section, we applied our method to examine the model assumption made on the U.S.presidential campaign data in Ai, Linton, Motegi, and Zhang (2021). The data have beenanalyzed a lot in the treatment effect literature (Urban and Niebler, 2014, Fong, Hazlett, andImai, 2018), where the interest was to explore the casual relationship between advertisingand campaign contributions. The treatment of interest is the number of political advertise-ments aired in each zip code from non-competitive states, which ranges from 0 to 2237919able 2: Estimated power

Logistic Cosine-Sine Indicator m ( · ) Model N

1% 5% 10% 1% 5% 10% 1% 5% 10%Average

DGP1-L

100 0.999 0.999 0.999 0.999 0.999 1 0.998 1 1200 1 1 1 1 1 1 1 1 1

DGP1-NL

100 0.995 1 1 0.998 0.998 1 0.998 0.999 1200 1 1 1 1 1 1 1 1 1Median

DGP1-L

100 1 1 1 1 1 1 1 1 1200 1 1 1 1 1 1 1 1 1

DGP1-NL

100 0.961 0.989 0.995 0.953 0.985 0.996 0.964 0.983 0.995200 0.999 1 1 1 1 1 1 1 1 across N = 16265 zip codes.The data was ﬁrst analyzed by Urban and Niebler (2014), who used a binary model tocompare the campaign contributions of the 5230 zip codes that received more than 1000advertisements with those of the other 11035 zip codes that received less than 1000 ad-vertisements. Their research suggested that advertising in non-competitive states had asigniﬁcant casual effect on the level of campaign contributions.By contrast, Ai, Linton, Motegi, and Zhang (2021) treated the treatment variable (num-ber of political advertisements) as a continuous variable and assumed that E { Y ∗ ( t ) } = β + β t + β t , where the observed outcome Y ∗ ( T ) = log( Contribution +1) and T = log( Advertisements +1) . The covariates X considered were X =  log( Population )% Age over 65 log(

Median Income )% Hispanic % Black log(

Population density + 1)%

College graduates ( Can commute to a competitive state )  . The deﬁnition of each covariate is almost self-explanatory and one can refer to Fong, Ha-zlett, and Imai (2018) for more details. Ai, Linton, Motegi, and Zhang (2021) found20

Contributions H i s t og r a m log( C on t r i bu t i on s BoxCox(contributions,-0.434) H i s t og r a m log( B o x C o x ( c on t r i bu t i on s , - . ) Figure 1: The histogram of the original campaign contribution data (top left) and the Box-Cox transformed contributions (bottom left), the scatter plot of the original campaign con-tribution data (top right) and the Box-Cox transformed ones (bottom right) versus the log-transformed number of political advertisements.that the 95% conﬁdence intervals for β and β were respectively [ − . , . and [ − . , . , indicating that no signiﬁcant causal link between advertising and cam-paign contributions was found from the linear model. Similar results were also reportedby Fong, Hazlett, and Imai (2018). The authors then concluded that such opposing resultsfrom binary models and continuous linear models suggested a rather complex relationshipbetween advertising and campaign contributions.We reached the same conclusion in our data analysis. First, we examined the histogramof the original campaign contribution data and the scatter plot of the campaign contributionsversus the log-transformed number of advertisements T . From the ﬁrst row of Figure 1, wecan see that the campaign contribution data are highly right-skewed both unconditionallyand conditionally on T . That is, they are not likely to ﬁt any linear models. We thenconducted a log-transformation on the contribution data as in Ai, Linton, Motegi, and Zhang(2021). However, the results were similar. 21able 3: Estimated power with J = 100 and B = 500 from subsamples of U.S. presidentialcampaign data Logistic Cosine-Sine Indicator N

1% 5% 10% 1% 5% 10% 1% 5% 10%200 0.67 0.83 0.90 0.72 0.84 0.91 0.64 0.83 0.85500 0.71 0.84 0.91 0.73 0.84 0.91 0.65 0.81 0.871000 0.93 0.96 0.98 0.91 0.97 0.98 0.90 0.96 0.98

To make the data more likely to ﬁt a linear model, we searched across Box-Cox trans-formations of the form { ( Contribution + 1) λ − } /λ w.r.t. λ to ﬁnd a transformation of thecontribution whose sample quantiles have the largest correlation with those of a standardnormal distribution. This yielded λ = − . . The histogram and the scatter plot of theBox-Cox-transformed contribution data are shown in the second row of Figure 1. We cansee that the transformed data are still highly right-skewed unconditionally. However, now,given T , the scatter plot no longer shows as much skewness.We then applied our method with logistic, cosine-sine, and indicator weight functionswith a B = 500 simulation-based approximation to the Box-Cox-transformed data to ver-ify if they ﬁt a linear model. Following Ai, Linton, Motegi, and Zhang (2021), we took g ( t ; θ ) = θ + θ t + θ t , u K ( T ) = (1 , T, T ) (cid:62) and v K ( X ) = (1 , X (cid:62) ) (cid:62) for estimating π . Unsurprisingly, all tests rejected the null hypothesis of a simple linear model. Thisleads to the same conclusion as Ai, Linton, Motegi, and Zhang (2021) that the relationshipbetween advertisements and campaign contributions is rather complex, or there are someother confounding variables not included in X .Finally, we treated the full sample N = 16256 as a population, knowing that the linearmodel is not true for this population. We then randomly took some subsamples to see thepower performance of our test. There, when we tried sample sizes larger than 1500, nearlyall tests rejected the null hypothesis 100% of the time. We report the estimated power of ourtests computed from 100 random subsamples of sample sizes 200, 500 and 1000 in Table 3.We can see that all tests perform similarly and powerfully.22 cknowledgement The ﬁrst author, Wei Huang’s research was supported by the Professor Maurice H. BelzFund of the University of Melbourne. The second author, Oliver Linton, acknowledgesCambridge INET for ﬁnancial support. The last author, Zheng Zhang, acknowledges ﬁ-nancial support from the National Natural Science Foundation of China through project12001535, and the fund for building world-class universities (disciplines) of the RenminUniversity of China.

References A BADIE , A.,

AND

M. D. C

ATTANEO (2018): “Econometric methods for program evalua-tion,”

Annual Review of Economics , 10, 465–503.A

BREVAYA , J., Y.-C. H SU , AND

R. P. L

IELI (2015): “Estimating conditional averagetreatment effects,”

Journal of Business & Economic Statistics , 33(4), 485–505.A I , C., O. L INTON , K. M

OTEGI , AND

Z. Z

HANG (2021): “A Uniﬁed Framework for Efﬁ-cient Estimation of General Treatment Models,”

Quantitative Economics , forthcoming.A IT -S AHALIA , Y., P. J. B

ICKEL , AND

T. M. S

TOKER (2001): “Goodness-of-ﬁt tests forkernel regression with an application to option implied volatilities,”

Journal of Econo-metrics , 105(2), 363–412.A

NDREWS , D. W. K. (1994): “Empirical Process Methods in Econometrics,” in

Handbookof Econometrics , ed. by R. F. Engle, and

D. L. McFadden, vol. 4, chap. 37, pp. 2247–2294. Citeseer.A O , W., S. C ALONICO , AND

Y.-Y. L EE (2021): “Multivalued treatments and decompo-sition analysis: An application to the WIA program,” Journal of Business & EconomicStatistics , 39(1), 358–371.A

THEY , S., G. W. I

MBENS , AND

S. W

AGER (2018): “Approximate residual balancing:debiased inference of average treatment effects in high dimensions,”

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 80(4), 597–623.B

IERENS , H. J. (1982): “Consistent model speciﬁcation tests,”

Journal of Econometrics ,20(1), 105–134. 231990): “A consistent conditional moment test of functional form,”

Econometrica:Journal of the Econometric Society , pp. 1443–1458.B

IERENS , H. J.,

AND

W. P

LOBERGER (1997): “Asymptotic theory of integrated condi-tional moment tests,”

Econometrica: Journal of the Econometric Society , pp. 1129–1151.B

UCHINSKY , M. (1995): “Quantile regression, Box-Cox transformation model, and theU.S. wage structure, 1963–1987,”

Journal of Econometrics , 65(1), 109–154.C

ATTANEO , M. D. (2010): “Efﬁcient semiparametric estimation of multi-valued treatmenteffects under ignorability,”

Journal of Econometrics , 155(2), 138–154.C

HAN , K. C. G., S. C. P. Y AM , AND

Z. Z

HANG (2016): “Globally efﬁcient non-parametric inference of average treatment effects by empirical balancing calibrationweighting,”

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,78(3), 673–700.C

HEN , X. (2007): “Large sample sieve estimation of semi-nonparametric models,”

Hand-book of Econometrics , 6(B), 5549–5632.C

HEN , X.,

AND

Y. F AN (1999): “Consistent hypothesis testing in semiparametric and non-parametric models for econometric time series.,” Journal of Econometrics , 91, 373–401.C

HEN , Y.-T., Y.-C. H SU , AND

H.-J. W

ANG (2020): “A stochastic frontier model withendogenous treatment status and mediator,”

Journal of Business & Economic Statistics ,38(2), 243–256.C

HERNOZHUKOV , V., I. F

ERN ´ ANDEZ -V AL , AND

B. M

ELLY (2013): “Inference on coun-terfactual distributions,”

Econometrica , 81(6), 2205–2268.C

OLANGELO , K.,

AND

Y.-Y. L EE (2020): “Double debiased machine learning nonpara-metric inference with continuous treatments,” arXiv preprint arXiv:2004.03036 .C RUMP , R. K., V. J. H

OTZ , G. W. I

MBENS , AND

O. A. M

ITNIK (2008): “Nonparametrictests for treatment effect heterogeneity,”

The Review of Economics and Statistics , 90(3),389–405.D

ONALD , S. G.,

AND

Y.-C. H SU (2014): “Estimation and inference for distribution func-tions and quantile functions in treatment effect models,” Journal of Econometrics , 178(3),383–397. 24

ONALD , S. G., Y.-C. H SU , AND

R. P. L

IELI (2014): “Testing the unconfoundednessassumption via inverse probability weighted estimators of (L) ATT,”

Journal of Business& Economic Statistics , 32(3), 395–415.D

ONG , Y., Y.-Y. L EE , AND

M. G OU (2019): “Regression discontinuity designs with acontinuous treatment,” Available at SSRN 3167541 .F AN , Q., Y.-C. H SU , R. P. L IELI , AND

Y. Z

HANG (2020): “Estimation of conditionalaverage treatment effects with high-dimensional data,”

Journal of Business & EconomicStatistics , pp. 1–15.F AN , Y., AND

Q. L I (1996): “Consistent model speciﬁcation tests: omitted variables andsemiparametric functional forms,” Econometrica: Journal of the econometric society ,pp. 865–890.(2000): “Consistent model speciﬁcation tests: Kernel-based tests versus Bierens’ICM tests,”

Econometric Theory , pp. 1016–1041.F

IRPO , S. (2007): “Efﬁcient Semiparametric Estimation of Quantile Treatment Effects,”

Econometrica , 75(1), 259–276.F

ONG , C., C. H

AZLETT , AND

K. I

MAI (2018): “Covariate Balancing Propensity Scorefor a Continuous Treatment: Application to the Efﬁcacy of Political Advertisements,”

Annals of Applied Statistics , 12(1), 156–177.G

ALVAO , A. F.,

AND

L. W

ANG (2015): “Uniformly semiparametric efﬁcient estimationof treatment effects with a continuous treatment,”

Journal of the American StatisticalAssociation , 110(512), 1528–1542.H

AHN , J. (1998): “On the role of the propensity score in efﬁcient semiparametric estima-tion of average treatment effects,”

Econometrica , 66(2), 315–331.H

IRANO , K.,

AND

G. W. I

MBENS (2004): “The propensity score with continuous treat-ments,” in

Applied Bayesian Modeling and Causal Inference from Incomplete-Data Per-spectives , ed. by A. Gelman, and

X.-L. Meng, chap. 7, pp. 73–84. John Wiley & SonsLtd.H

IRANO , K., G. W. I

MBENS , AND

G. R

IDDER (2003): “Efﬁcient Estimation of AverageTreatment Effects Using the Estimated Propensity Score,”

Econometrica , 71(4), 1161–1189. 25 SU , Y.-C., T.-C. L AI , AND

R. P. L

IELI (2020): “Counterfactual Treatment Effects: Esti-mation and Inference,”

Journal of Business & Economic Statistics , pp. 1–16.H

UBER , M., Y.-C. H SU , Y.-Y. L EE , AND

L. L

ETTRY (2020): “Direct and indirect effectsof continuous treatments based on generalized propensity score weighting,”

Journal ofApplied Econometrics , 35(7), 814–840.I

MAI , K.,

AND

D. A.

VAN D YK (2004): “Causal inference with general treatment regimes:Generalizing the propensity score,” Journal of the American Statistical Association ,99(467), 854–866.I

MBENS , G. W.,

AND

J. M. W

OOLDRIDGE (2009): “Recent developments in the econo-metrics of program evaluation,”

Journal of Economic Literature , 47(1), 5–86.K

ENNEDY , E. H., Z. M A , M. D. M C H UGH , AND

D. S. S

MALL (2017): “Non-parametricmethods for doubly robust estimation of continuous treatment effects,”

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 79(4), 1229–1245.L EE , Y.-Y. (2018): “Efﬁcient propensity score regression estimators of multivalued treat-ment effects for the treated,” Journal of Econometrics , 204(2), 207–222.L I , Q. (1999): “Consistent model speciﬁcation tests for time series econometric models,” Journal of Econometrics , 92(1), 101–147.L I , Q., C. H SIAO , AND

J. Z

INN (2003): “Consistent speciﬁcation tests for semiparamet-ric/nonparametric models based on series estimation methods,”

Journal of Econometrics ,112(2), 295–325.N

EWEY , W. K. (1997): “Convergence Rates and Asymptotic Normality for Series Estima-tors,”

Journal of Econometrics , 79(1), 147–168.P

AKES , A.,

AND

D. P

OLLARD (1989): “Simulation and the asymptotics of optimizationestimators,”

Econometrica , 57(5), 1027–1057.P

RAESTGAARD , J.,

AND

J. A. W

ELLNER (1993): “Exchangeably weighted bootstraps ofthe general empirical process,”

The Annals of Probability , 21(4), 2053–2086.R

OBINS , J. M., M. A. H

ERN ´ AN , AND

B. B

RUMBACK (2000): “Marginal structural modelsand causal inference in epidemiology,”

Epidemiology , 11(5), 550–560.26

OSENBAUM , P. R.,

AND

D. B. R

UBIN (1983): “The central role of the propensity scorein observational studies for causal effects,”

Biometrika , 70(1), 41–55.R

OSENBAUM , P. R.,

AND

D. B. R

UBIN (1984): “Reducing bias in observational studiesusing subclassiﬁcation on the propensity score,”

J. Am. Statist. Ass. , 79(387), 516–524.R

UPPERT , D., M. P. W

AND , AND

R. J. C

ARROLL (2003):

Semiparametric Regression .Cambridge University Press.S

TINCHCOMBE , M. B.,

AND

H. W

HITE (1998): “Consistent speciﬁcation testing withnuisance parameters present only under the alternative,”

Econometric theory , 14(3), 295–325.S

TUTE , W. (1997): “Nonparametric model checks for regression,”

The Annals of Statistics ,pp. 613–641.U

RBAN , C.,

AND

S. N

IEBLER (2014): “Dollars on the Sidewalk: Should U.S. PresidentialCandidates Advertise in Uncontested States?,”

American Journal of Political Science ,58(2), 322–336.

VAN DER V AART , A. W. (1998):

Asymptotic Statistics . Cambridge University Press.V AN D ER V AART , A. W.,

AND

J. A. W

ELLNER (1996):

Weak convergence and empiricalprocesses with applications to statistics . Springer.Z

HENG , J. X. (1996): “A consistent test of functional form via nonparametric estimationtechniques,”

Journal of Econometrics , 75(2), 263–289.

AppendixA Some preliminary results

We recall some preliminary results which have been established in Ai, Linton, Motegi, andZhang (2021). The following conditions are inherited from Ai, Linton, Motegi, and Zhang(2021):

Assumption 6. (i) The support X of X is a compact subset of R r . The support T of thetreatment variable T is a compact subset of R . (ii) There exist two positive constants η and η such that < η ≤ π ( t, x ) ≤ η < ∞ , ∀ ( t, x ) ∈ T × X . ssumption 7. There exist Λ K × K ∈ R K × K and a positive constant α > such that sup ( t, x ) ∈T ×X (cid:12)(cid:12) ρ (cid:48)− { π ( t, x ) } − u K ( t ) (cid:62) Λ K × K v K ( x ) (cid:12)(cid:12) = O ( K − α ) , where ρ ( u ) = − exp( − u − and ρ (cid:48)− is the inverse function of ρ (cid:48) . Assumption 8. (i) For every K and K , the smallest eigenvalues of E (cid:2) u K ( T ) u K ( T ) (cid:62) (cid:3) and E (cid:2) v K ( X ) v K ( X ) (cid:62) (cid:3) are bounded away from zero uniformly in K and K . (ii) Thereare two sequences of constants ζ ( K ) and ζ ( K ) satisfying sup t ∈T (cid:107) u K ( t ) (cid:107) ≤ ζ ( K ) and sup x ∈X (cid:107) v K ( x ) (cid:107) ≤ ζ ( K ) , K = K ( N ) K ( N ) and ζ ( K ) := ζ ( K ) ζ ( K ) , suchthat ζ ( K ) K − α → and ζ ( K ) (cid:112) K/N → as N → ∞ . Assumption 9. ζ ( K ) (cid:112) K /N → and √ N K − α → . See Ai, Linton, Motegi, and Zhang (2021) for a detailed discussion on Assumptions 6-9. Under these conditions, Ai, Linton, Motegi, and Zhang (2021, Theorem 3) establishedthe following results:

Proposition 6.

Suppose that Assumptions 6-8 hold. Then, we obtain the following: sup ( t, x ) ∈T ×X | ˆ π K ( t, x ) − π ( t, x ) | = O p (cid:34) max (cid:40) ζ ( K ) K − α , ζ ( K ) (cid:114) KN (cid:41)(cid:35) , (cid:90) T ×X | ˆ π K ( t, x ) − π ( t, x ) | dF T,X ( t, x ) = O p (cid:26) max (cid:18) K − α , KN (cid:19)(cid:27) , N N (cid:88) i =1 | ˆ π K ( T i , X i ) − π ( T i , X i ) | = O p (cid:26) max (cid:18) K − α , KN (cid:19)(cid:27) . Furthermore, for any estimand with the form of E { π ( T, X ) R ( T, X , Y ) } , where R ( T, X , Y ) ∈ L ( dF T,X,Y ) , Theorem 5 of Ai, Linton, Motegi, and Zhang (2021) provides an asymptoti-cally equivalent representation for the plug-in estimator N − (cid:80) Ni =1 (cid:98) π K ( T i , X i ) R ( T i , X i , Y i ) : Proposition 7.

Suppose that Assumptions 6-9 hold. For any integrable function R ( T, X , Y ) where E { R ( T, X , Y ) | T = t, X = x } is continuously differentiable. Then, √ N N (cid:88) i =1 (cid:2)(cid:98) π K ( T i , X i ) R ( T i , X i , Y i ) − E { π ( T, X ) R ( T, X , Y ) } (cid:3) = 1 √ N N (cid:88) i =1 (cid:20) π ( T i , X i ) R ( T i , X i , Y i ) − E { π ( T i , X i ) R ( T i , X i , Y i ) | T i , X i } + E { π ( T i , X i ) R ( T i , X i , Y i ) | T i } − E { π ( T i , X i ) R ( T i , X i , Y i ) } + E { π ( T i , X i ) R ( T i , X i , Y i ) | X i } − E { π ( T i , X i ) R ( T i , X i , Y i ) } (cid:21) + o p (1) . Proof of Theorem 1

Proof.

Note that (cid:98) U i = (cid:98) π K ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) = U i + { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g ( T i ; θ ∗ ) } + π ( T i , X i ) (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) + { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) , where U i = π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } . Then, we have (cid:98) J N ( t ) = 1 √ N N (cid:88) i =1 (cid:98) U i H ( T i , t ) = 1 √ N N (cid:88) i =1 U i H ( T i , t ) (B.1) + 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g ( T i ; θ ∗ ) } H ( T i , t ) (B.2) + 1 √ N N (cid:88) i =1 π ( T i , X i ) (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) H ( T i , t ) (B.3) + 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) H ( T i , t ) . (B.4)Using Proposition 7, under H : E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) }| T i ] = 0 , we have(B.2) = − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | T i , X i ]+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | T i ]+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | X i ] − √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t )] + o P (1)= − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | T i , X i ]+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | X i ] + o P (1) − √ N N (cid:88) i =1 φ ( T i , X i ; t ) + o P (1) . (B.5)By Ai, Linton, Motegi, and Zhang (2021, Theorems 4 and 5), under H , we have (cid:107) (cid:98) θ − θ ∗ (cid:107) = O P ( N − / ) . (B.6)We next ﬁnd the expression for √ N { (cid:98) θ − θ ∗ } . Note that √ N N (cid:88) i =1 (cid:98) π K ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) = o P (1) . Note that m ( · ) may not be differentiable, and we cannot simply apply the mean valuetheorem to obtain the expression for √ N { (cid:98) θ − θ ∗ } . We apply the empirical process theoryin Andrews (1994) to solve this problem. Let ν N ( f ) := 1 √ N N (cid:88) i =1 [ f ( T i , X i , Y i ) − E { f ( T i , X i , Y i ) } ] be the empirical process indexed by f ( · ) . Note that o P (1) = 1 √ N N (cid:88) i =1 (cid:98) π K ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ )= 1 √ N N (cid:88) i =1 π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (B.7) + 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (B.8) + 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (B.9) × (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) − m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:105) . For (B.8), by Proposition 7, under H , we have(B.8) = 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )= − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ ) } | T i ] · ∇ θ g ( T i ; θ ∗ )+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ ) } · ∇ θ g ( T i ; θ ∗ ) | X i ] − √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ ) } · ∇ θ g ( T i ; θ ∗ )] + o P (1)= − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ )+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ ) } · ∇ θ g ( T i ; θ ∗ ) | X i ] + o P (1) . For (B.9), we have | (B.9) | = (cid:13)(cid:13)(cid:13)(cid:13) √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) }× (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) − m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:105) (cid:13)(cid:13)(cid:13)(cid:13) ≤√ N · sup ( t, x ) ∈T ×X | (cid:98) π K ( t, x ) − π ( t, x ) |· N N (cid:88) i =1 (cid:34) (cid:13)(cid:13)(cid:13) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; θ ∗ ) − m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:101) θ ) (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) (cid:98) θ − θ ∗ (cid:13)(cid:13)(cid:13) (cid:35) = √ N · O P (cid:32) ζ ( K ) K − α + ζ ( K ) (cid:114) KN (cid:33) · (cid:110) E (cid:104)(cid:12)(cid:12)(cid:12) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:12)(cid:12)(cid:12) · (cid:107)∇ θ g ( T i ; θ ∗ ) (cid:107) (cid:105) + O P (cid:0) N − / (cid:1)(cid:111) ≤ O P (cid:32) ζ ( K ) K − α + ζ ( K ) (cid:114) KN (cid:33) · √ N · (cid:110) O (1) · (cid:107) (cid:98) θ − θ ∗ (cid:107) + O P (cid:0) N − / (cid:1)(cid:111) = o P (1) , (B.10)where the second equality holds by Proposition 6 and the law of large numbers; the secondinequality holds by Assumption 5; and the last equality holds by (B.6) and Assumption 8.31or (B.7), we have(B.7) = 1 √ N N (cid:88) i =1 π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ )= ν N ( π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ))+ (cid:110) ν N (cid:16) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:17) − ν N ( π ( T i , X i ) m ( Y i ; g ( T i ; θ ∗ )) ∇ θ g ( T i ; θ ∗ )) (cid:111) + √ N · E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:105) = ν N [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )] + o P (1)+ √ N · E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:105) , where the last equality holds because Assumption 5 and the compactness of Θ imply theempirical process (cid:26) ν N [ π ( T i , X i ) m { Y i ; g ( T i ; θ ) } ∇ θ g ( T i ; θ )] : θ ∈ Θ (cid:27) is stochastically equicontinuous (Andrews (1994, Theorems 4 and 5)), and (cid:107) (cid:98) θ − θ ∗ (cid:107) p −→ ,then ν N (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:105) − ν N [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )] = o P (1) . Note that under H , (cid:98) θ P −→ θ ∗ : √ N · E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:105) = √ N · E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )]+ √ N · ∇ θ E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:101) θ ) (cid:111) ∇ θ g ( T i ; (cid:101) θ ) (cid:105) · (cid:110) (cid:98) θ − θ ∗ (cid:111) = ∇ θ E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:101) θ ) (cid:111) ∇ θ g ( T i ; (cid:101) θ ) (cid:105) · √ N (cid:110) (cid:98) θ − θ ∗ (cid:111) + o P (1) , where (cid:101) θ lies between θ ∗ and (cid:98) θ . Then we get(B.7) = 1 √ N N (cid:88) i =1 π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )+ ∇ θ E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:101) θ ) (cid:111) ∇ θ g ( T i ; (cid:101) θ ) (cid:105) · √ N (cid:110) (cid:98) θ − θ ∗ (cid:111) + o P (1) Hence, combining the expressions of (B.7), (B.8), and (B.9), we get √ N (cid:110) (cid:98) θ − θ ∗ (cid:111) (B.11)32 {−∇ θ E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )] } − × √ N N (cid:88) i =1 (cid:26) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) − π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) · E [ m { Y i ; g ( T i ; θ ∗ } ) | T i , X i ]+ E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) m { Y i ; g ( T i ; θ ∗ ) } | T i ] − E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) m { Y i ; g ( T i ; θ ∗ ) } ]+ E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) m { Y i ; g ( T i ; θ ∗ ) } | X i ] + o P (1)= (cid:26) − E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21)(cid:27) − × √ N N (cid:88) i =1 (cid:26) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) − π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ]+ E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) m { Y i ; g ( T i ; θ ∗ ) } | X i ] (cid:27) + o P (1) , where the second equality holds by noting E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } | T i ] = 0 under H . Consider the term (B.3). Note that(B.3) = ν N (cid:110) π ( T i , X i ) (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) H ( T i , t ) (cid:111) + √ N · E { π ( T i , X i ) [ m { Y i ; g ( T i ; θ ) } − m { Y i ; g ( T i ; θ ∗ ) } ] H ( T i , t ) } (cid:12)(cid:12)(cid:12)(cid:12) θ = (cid:98) θ . By Assumption 5, the compactness of Θ , and Andrews (1994, Theorems 4 and 5), then theempirical process (cid:26) ν N [ π ( T i , X i ) [ m { Y i ; g ( T i ; θ ) } − m { Y i ; g ( T i ; θ ∗ ) } ] H ( T i , t )] : θ ∈ Θ (cid:27) is stochastically equicontinuous. With (cid:107) (cid:98) θ − θ ∗ (cid:107) p −→ under H , we have ν N (cid:110) π ( T i , X i ) (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) H ( T i , t ) (cid:111) = o P (1) . Using the mean value theorem and (cid:107) (cid:98) θ − θ ∗ (cid:107) p −→ under H , we have √ N · E { π ( T i , X i ) [ m { Y i ; g ( T i ; θ ) } − m { Y i ; g ( T i ; θ ∗ ) } ] H ( T i , t ) } (cid:12)(cid:12)(cid:12)(cid:12) θ = (cid:98) θ = (cid:26) ∇ θ E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ) } H ( T i , t )] (cid:12)(cid:12)(cid:12)(cid:12) θ = (cid:101) θ (cid:27) (cid:62) · √ N (cid:110) (cid:98) θ − θ ∗ (cid:111) E (cid:20) π ( T i , X i ) · ∂∂g E { m ( Y i ; g ( T i ; θ ) } | T i , X i ] · ∇ θ g ( T i ; θ ) (cid:62) H ( T i , t ) (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) θ = (cid:101) θ · √ N (cid:110) (cid:98) θ − θ ∗ (cid:111) = E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:21) · √ N (cid:110) (cid:98) θ − θ ∗ (cid:111) + o P (1) . By (B.11), we have (B.3) = − √ N N (cid:88) i =1 ψ ( T i , X i , Y i ; t ) + o p (1) , (B.12)where ψ ( T i , X i , Y i ; t ) := E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:21) × E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − × (cid:26) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) − π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ]+ E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) m { Y i ; g ( T i ; θ ∗ ) } | X i ] (cid:27) . For the term (B.4), we have | (B.4) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) H ( T i , t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤√ N · sup ( t, x ) ∈T ×X | (cid:98) π K ( t, x ) − π ( t, x ) |· N N (cid:88) i =1 (cid:12)(cid:12)(cid:12) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } H ( T i , t ) (cid:12)(cid:12)(cid:12) = √ N · O P (cid:32) ζ ( K ) K − α + ζ ( K ) (cid:114) KN (cid:33) · (cid:110) E (cid:104)(cid:12)(cid:12)(cid:12) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:12)(cid:12)(cid:12) · | H ( T i , t ) | (cid:105) + O P (cid:0) N − / (cid:1)(cid:111) ≤ O P (cid:32) ζ ( K ) K − α + ζ ( K ) (cid:114) KN (cid:33) · √ N · (cid:110) O (1) · (cid:107) (cid:98) θ − θ ∗ (cid:107) + O P (cid:0) N − / (cid:1)(cid:111) = o P (1) , (B.13)where the second equality holds by Proposition 6 and the law of large numbers; the secondinequality holds by Assumption 5; and the last equality holds by (B.6) and Assumption 8.34ence, combining (B.1), (B.5), (B.12), and (B.13), we have (cid:98) J N ( t ) = 1 √ N N (cid:88) i =1 η ( T i , X i , Y i ; t ) + o P (1)= 1 √ N N (cid:88) i =1 { U i H ( T i , t ) − φ ( T i , X i ; t ) − ψ ( T i , X i , Y i ; t ) } + o P (1) , where E { φ ( T i , X i ; t ) } = 0 and E { ψ ( T i , X i , Y i ; t ) } = 0 . Therefore, under the null hypoth-esis H , (cid:98) J N ( · ) weakly converges to J ∞ ( · ) in L ( T , dt ) , where J ∞ ( · ) is a Gaussian processwith zero mean and covariance function given by Σ( t, t (cid:48) ) = E { η ( T i , X i , Y i ; t ) η ( T i , X i , Y i ; t (cid:48) ) } . (ii) Obviously, h ( J ) := (cid:82) { J ( t ) } dF T ( t ) is a continuous function in L ( T , dF T ) .Given that F T ( · ) is absolutely continuous with respect to the Lebesgue measure, h ( J ) isalso continuous in L ( T , dt ) . Therefore, by Theorem 1 (i) and the continuous mappingtheorem, we have that (cid:82) { (cid:98) J ( t ) } dF T ( t ) converges to (cid:82) { J ∞ ( t ) } dF T ( t ) in distribution. C Proof of Theorem 2

Similar to Theorem 1, results (i) and (ii) can be established. We next prove Σ ( t, t ) > Σ( t, t ) for any ﬁxed t ∈ T . Let A t := E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:21) × E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − . Then ψ ( T i , X i , Y i ; t ) := A t · (cid:26) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) − π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ]+ E [ π ( T i , X i ) ∇ θ g ( T i ; θ ∗ ) m { Y i ; g ( T i ; θ ∗ ) } | X i ] (cid:27) , and ψ ( T i , X i , Y i ; t ) := A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) .

35e have Σ( t, t ) = E (cid:2) { η ( T i , X i , Y i ; t ) } (cid:3) = E (cid:2) { U i H ( T i , t ) − φ ( T i , X i ; t ) − ψ ( T i , X i , Y i ; t ) } (cid:3) = E (cid:20)(cid:26) U i H ( T i , t ) − A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) − π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ ))+ E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:27) (cid:21) = E (cid:20)(cid:110) U i H ( T i , t ) − A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:111) (cid:21) + E (cid:20)(cid:110) π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) (cid:111) (cid:21) + E (cid:20)(cid:110) E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:111) (cid:21) − · E (cid:34)(cid:110) U i H ( T i , t ) − A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:111) × (cid:110) π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) (cid:111)(cid:35) + 2 · E (cid:34)(cid:110) U i H ( T i , t ) − A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:111) × (cid:110) E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:111)(cid:35) − · E (cid:34)(cid:110) π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) (cid:111) × (cid:110) E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:111)(cid:35) = E (cid:20)(cid:110) U i H ( T i , t ) − A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:111) (cid:21) + E (cid:20)(cid:110) π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) (cid:111) (cid:21) + E (cid:20)(cid:110) E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:111) (cid:21) · E (cid:34)(cid:110) π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) (cid:111) (cid:35) + 2 · E (cid:34)(cid:110) E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:111) (cid:35) − · E (cid:34)(cid:110) E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:111) (cid:35) = E (cid:20)(cid:110) U i H ( T i , t ) − A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:111) (cid:21) + E (cid:20)(cid:110) E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) | X i ] (cid:111) (cid:21) − E (cid:34)(cid:110) π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ( H ( T i , t ) − A t · ∇ θ g ( T i ; θ ∗ )) (cid:111) (cid:35) < E (cid:20)(cid:110) U i H ( T i , t ) − A t · π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:111) (cid:21) = Σ ( t, t ) , where the fourth equality holds by using the tower property of the conditional expectation,the inequality holds by using Jensen’s inequality. D Proof of Theorem 5

Proof.

We prove parts ( i ) and ( ii ) . The proof is similar to that for Theorem 1. Let g N ( t, θ ) := g ( t ; θ ) + δ ( t ) √ N and U iN = π ( T i , X i ) m { Y i ; g N ( T i ; θ ∗ N ) } . Obviously, g N ( t, θ ) → g ( t, θ ) and U iN a.s. −−→ U i . Then (cid:98) U i = (cid:98) π K ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) = U iN + { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g N ( T i ; θ ∗ N ) } + π ( T i , X i ) (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g N ( T i ; θ ∗ N ) } (cid:105) + { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g N ( T i ; θ ∗ N ) } (cid:105) . Then, we have (cid:98) J N ( t ) = 1 √ N N (cid:88) i =1 (cid:98) U i H ( T i , t ) = 1 √ N N (cid:88) i =1 U iN H ( T i , t ) (D.1)37 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g N ( T i ; θ ∗ N ) } H ( T i , t ) (D.2) + 1 √ N N (cid:88) i =1 π ( T i , X i ) (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g N ( T i ; θ ∗ N ) } (cid:105) H ( T i , t ) (D.3) + 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g N ( T i ; θ ∗ N ) } (cid:105) H ( T i , t ) . (D.4)Obviously, by Chebyshev’s inequality, we have(D.1) = 1 √ N N (cid:88) i =1 U i H ( T i , t ) + 1 √ N N (cid:88) i =1 ( U iN − U i ) H ( T i , t ) = 1 √ N N (cid:88) i =1 U i H ( T i , t ) + o P (1) . Using Proposition 7, under H L : E [ π ( T i , X i ) m { Y i ; g N ( T i ; θ ∗ N ) }| T i ] = 0 , we have(D.2) = − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g N ( T i ; θ ∗ N ) } · H ( T i , t ) | T i , X i ]+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g N ( T i ; θ ∗ N ) } · H ( T i , t ) | T i ]+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g N ( T i ; θ ∗ N )) } · H ( T i , t ) | X i ] − √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g N ( T i ; θ ∗ N ) } · H ( T i , t )] + o P (1)= − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g N ( T i ; θ ∗ N ) } · H ( T i , t ) | T i , X i ]+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g N ( T i ; θ ∗ N ) } · H ( T i , t ) | X i ] + o P (1)= − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | T i , X i ]+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } · H ( T i , t ) | X i ] + o P (1)= − √ N N (cid:88) i =1 φ ( T i , X i ; t ) + o P (1) , √ N { (cid:98) θ − θ ∗ N } . Note fromAssumption 2 that √ N N (cid:88) i =1 (cid:98) π K ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) = o P (1) . Note that √ N N (cid:88) i =1 (cid:98) π K ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ )= 1 √ N N (cid:88) i =1 π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (D.5) + 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g ( T i ; θ ∗ N ) } ∇ θ g ( T i ; θ ∗ N ) (D.6) + 1 √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (D.7) × (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) − m { Y i ; g ( T i ; θ ∗ N ) } ∇ θ g ( T i ; θ ∗ N ) (cid:105) . For (D.6), Proposition 7, under H L , we have √ N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g ( T i ; θ ∗ N } ) ∇ θ g ( T i ; θ ∗ N )= − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ N ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ N )+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ N ) } | T i ] · ∇ θ g ( T i ; θ ∗ N )+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ N ) } · ∇ θ g ( T i ; θ ∗ N ) | X i ] − √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ N ) } · ∇ θ g ( T i ; θ ∗ N )] + o P (1)= − √ N N (cid:88) i =1 π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ )+ 1 √ N N (cid:88) i =1 E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ ) } · ∇ θ g ( T i ; θ ∗ ) | X i ] + o P (1) , lim N →∞ θ ∗ N = θ ∗ . Recall the deﬁnition of θ ∗ N in section 4.3, we can see that (cid:107) θ ∗ N − θ ∗ (cid:107) = O p ( N − / ) .Then for (D.7), by using a similar argument used in establishing (B.13), we can obtain(D.7) = o P (1) .For (D.5), we have(D.5) = 1 √ N N (cid:88) i =1 π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ )= ν N [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )]+ (cid:110) ν N (cid:16) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:17) − ν N ( π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )) (cid:111) + √ N · E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:105) = ν N [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )] + o P (1)+ √ N · E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:105) , where the last equality holds because the empirical process (cid:26) ν N ( π ( T i , X i ) m { Y i ; g ( T i ; θ ) } ∇ θ g ( T i ; θ )) : θ ∈ Θ (cid:27) is stochastically equicontinuous, and (cid:107) (cid:98) θ − θ ∗ (cid:107) p −→ , then ν N (cid:16) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:17) − ν N ( π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )) = o P (1) . Note that under H L , θ ∗ N → θ ∗ , and (cid:98) θ P −→ θ ∗ : √ N · E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) ∇ θ g ( T i ; (cid:98) θ ) (cid:105) = √ N · E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ N ) } ∇ θ g ( T i ; θ ∗ N )]+ √ N · ∇ θ E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:101) θ N ) (cid:111) ∇ θ g ( T i ; (cid:101) θ N ) (cid:105) · (cid:110) (cid:98) θ − θ ∗ N (cid:111) = √ N · E [ π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ N ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ N )]+ √ N · ∇ θ E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:101) θ N ) (cid:111) ∇ θ g ( T i ; (cid:101) θ N ) (cid:105) · (cid:110) (cid:98) θ − θ ∗ N (cid:111) = √ N · E [ π ( T i , X i ) · E [ m { Y i ; g N ( T i ; θ ∗ N ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ N )] − √ N · E (cid:20) π ( T i , X i ) · ∂∂g E (cid:2) m { Y i ; (cid:101) g N ( T i ; θ ∗ N ) } (cid:12)(cid:12) T i , X i (cid:3) · { g N ( T i ; θ ∗ N ) − g ( T i ; θ ∗ N ) }∇ θ g ( T i ; θ ∗ N ) (cid:21) + ∇ θ E (cid:104) π ( T i , X i ) m (cid:110) Y i ; g ( T i ; (cid:101) θ N ) (cid:111) ∇ θ g ( T i ; (cid:101) θ N ) (cid:105) · √ N (cid:110) (cid:98) θ − θ ∗ N (cid:111) − E (cid:20) π ( T i , X i ) · ∂∂g E (cid:2) m { Y i ; g ( T i ; θ ∗ ) } (cid:12)(cid:12) T i , X i (cid:3) · δ ( T i ) · ∇ θ g ( T i ; θ ∗ ) (cid:21) + {∇ θ E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ )] + o P (1) } · √ N (cid:110) (cid:98) θ − θ ∗ N (cid:111) = − E (cid:20) π ( T i , X i ) · ∂∂g E (cid:2) m { Y i ; g ( T i ; θ ∗ ) } (cid:12)(cid:12) T i , X i (cid:3) · δ ( T i ) · ∇ θ g ( T i ; θ ∗ ) (cid:21) + (cid:26) E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) + E (cid:2) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) (cid:3) + o P (1) (cid:27) · √ N (cid:110) (cid:98) θ − θ ∗ N (cid:111) = − E (cid:20) π ( T i , X i ) · ∂∂g E (cid:2) m { Y i ; g ( T i ; θ ∗ ) } (cid:12)(cid:12) T i , X i (cid:3) · δ ( T i ) · ∇ θ g ( T i ; θ ∗ ) (cid:21) + E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) · √ N (cid:110) (cid:98) θ − θ ∗ N (cid:111) + o P (1) , where (cid:101) θ N lies between θ ∗ N and (cid:98) θ , and (cid:101) g N ( T i ; θ ∗ N ) lies between g N ( T i ; θ ∗ N ) and g ( T i ; θ ∗ N ) .Hence, we get √ N (cid:110) (cid:98) θ − θ ∗ N (cid:111) = E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − × E (cid:20) π ( T i , X i ) · ∂∂g E (cid:2) m { Y i ; g ( T i ; θ ∗ N )) } (cid:12)(cid:12) T i , X i (cid:3) · δ ( T i ) · ∇ θ g ( T i ; θ ∗ N ) (cid:21) − E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − × √ N N (cid:88) i =1 (cid:40) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) } ∇ θ g ( T i ; θ ∗ ) − π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ )+ E [ π ( T i , X i ) · m { Y i ; g ( T i ; θ ∗ ) } · ∇ θ g ( T i ; θ ∗ ) | X i ] (cid:41) + o P (1) . Then similar to (B.12), we have(D.3) = − √ N N (cid:88) i =1 ψ ( T i , X i , Y i ; t ) + µ ( t ) + o p (1) , where µ ( t ) = E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) (cid:62) H ( T i , t ) (cid:21) E (cid:20) π ( T i , X i ) · ∂∂g E [ m { Y i ; g ( T i ; θ ∗ ) } | T i , X i ] · ∇ θ g ( T i ; θ ∗ ) ∇ (cid:62) θ g ( T i ; θ ∗ ) (cid:21) − × E (cid:20) π ( T i , X i ) · ddg E (cid:2) m { Y i ; g ( T i ; θ ∗ N ) } (cid:12)(cid:12) T i , X i (cid:3) · δ ( T i ) · ∇ θ g ( T i ; θ ∗ N ) (cid:21) . Hence, we have (cid:98) J N ( t ) = 1 √ N N (cid:88) i =1 η ( T i , X i , Y i ; t ) + µ ( t ) + o P (1)= 1 √ N N (cid:88) i =1 { U i H ( T i , t ) − φ ( T i , X i ; t ) − ψ ( T i , X i , Y i ; t ) } + µ ( t ) + o P (1) , where E { φ ( T i , X i ; t ) } = 0 and E { ψ ( T i , X i , Y i ; t ) } = 0 . Therefore, under the null hy-pothesis H , (cid:98) J N ( · ) weakly converges to J ∞ ,µ ( · ) in L ( T , dt ) , where J ∞ ,µ ( · ) is a Gaussianprocess with mean function µ ( t ) and covariance function given by Σ( t, t (cid:48) ) = E { η ( T i , X i , Y i ; t ) η ( T i , X i , Y i ; t (cid:48) ) } . We prove part ( iii ). Because √ N (cid:98) J N ( t ) = 1 N N (cid:88) i =1 (cid:98) U i H ( T i , t )= 1 N N (cid:88) i =1 U i H ( T i , t ) (D.8) + 1 N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } m { Y i ; g ( T i ; θ ∗ ) } H ( T i , t ) (D.9) + 1 N N (cid:88) i =1 π ( T i , X i ) (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) H ( T i , t ) (D.10) + 1 N N (cid:88) i =1 { (cid:98) π K ( T i , X i ) − π ( T i , X i ) } (cid:104) m (cid:110) Y i ; g ( T i ; (cid:98) θ ) (cid:111) − m { Y i ; g ( T i ; θ ∗ ) } (cid:105) H ( T i , t ) . (D.11)By applying a similar argument for (B.2)-(B.4), we have that (D.9)-(D.11) are of o P (1) .Under H , the law of large nubers implies (D.8) = µ ( t ) + o P (1) . Hence, we conclude theproof. 42 Asymptotic properties of (cid:98) J N ( t ; (cid:98) θ opt ) and (cid:100) C M N ( (cid:98) θ opt ) Theorem 8.

Suppose that m ( y ; g ) is differentiable with respect to g , Assumptions 1-5 andAssumptions 6-9 listed in Appendix A hold, then under H , ( i ) (cid:98) J N ( t ; (cid:98) θ opt ) = 1 √ N N (cid:88) i =1 η opt ( T i , X i , Y i ; t ) + o P (1) , ( ii ) (cid:98) J N ( · ; (cid:98) θ opt ) converges weakly to J ∞ ,opt ( · ) in L {T , dF T ( t ) } , where J ∞ ,opt is a Gaussian process with zero mean and covariance function given by Σ opt ( t, t (cid:48) ) = E { η opt ( T i , X i , Y i ; t ) η opt ( T i , X i , Y i ; t (cid:48) ) } . Furthermore, ( iii ) (cid:100) CM N ( (cid:98) θ opt ) converges to (cid:90) { J ∞ ,opt ( t ) } dF T ( t ) in distribution . Proof.

We ﬁrst claim (cid:107) (cid:98) θ opt − θ ∗ (cid:107) P −→ under H . Since• Θ is compact;• by Proposition 6, | N − · (cid:100) CM N ( θ ) − CM ( θ ) | P −→ for every θ ∈ Θ ;• CM ( θ ) is continuous in θ ;• | (cid:98) U i ( θ ) | = | (cid:98) π K ( T i , X i ) m ( Y i ; g ( T i ; θ )) | ≤ O p (1) × sup θ ∈ Θ | m ( Y i ; g ( T i ; θ )) | and E [sup θ ∈ Θ | m ( Y i ; g ( T i ; θ )) | ] < ∞ ;then it follows from van der Vaart (1998, Theorem 5.7) that (cid:107) (cid:98) θ opt − θ ∗ (cid:107) P −→ .We then ﬁnd the asymptotic expression for √ N { (cid:98) θ opt − θ ∗ } . By the ﬁrst order condition,we get N N (cid:88) i =1 (cid:98) J N ( T i ; (cid:98) θ opt ) · ∇ θ (cid:98) J N ( T i ; (cid:98) θ opt ) = 0 Using the mean value theorem, we get N N (cid:88) i =1 (cid:98) J N ( T i ; θ ∗ ) · ∇ θ (cid:98) J N ( T i ; θ ∗ ) √ N + 1 N N (cid:88) i =1 (cid:40) ∇ θ (cid:98) J N ( T i ; (cid:101) θ opt ) √ N · ∇ θ (cid:98) J N ( T i ; (cid:101) θ opt ) (cid:62) √ N + (cid:98) J N ( T i ; (cid:101) θ opt ) √ N · ∇ θ (cid:98) J N ( T i ; (cid:101) θ opt ) √ N (cid:41) · √ N (cid:110) (cid:98) θ opt − θ ∗ (cid:111) , (cid:101) θ opt lies on the joining from (cid:98) θ opt to θ ∗ . Using the fact that (cid:107) (cid:98) θ opt − θ ∗ (cid:107) p −→ andProposition 6, under H , it is easy to obtain N N (cid:88) i =1 (cid:40) ∇ θ (cid:98) J N ( T i ; (cid:101) θ opt ) √ N · ∇ θ (cid:98) J N ( T i ; (cid:101) θ opt ) (cid:62) √ N + (cid:98) J N ( T i ; (cid:101) θ opt ) √ N · ∇ θ (cid:98) J N ( T i ; (cid:101) θ opt ) √ N (cid:41) = (cid:90) T E (cid:20) π ( T, X ) · ∂g m { Y ; g ( T ; θ ∗ ) }∇ θ g ( T ; θ ∗ ) H ( T ; t ) (cid:21) × E (cid:20) π ( T, X ) · ∂g m { Y ; g ( T ; θ ∗ ) }∇ θ g ( T ; θ ∗ ) (cid:62) H ( T ; t ) (cid:21) f T ( t ) dt + o P (1)= (cid:90) T B t B (cid:62) t f T ( t ) dt + o P (1) , where B t := E (cid:20) π ( T, X ) · ∂g m { Y ; g ( T ; θ ∗ ) }∇ θ g ( T ; θ ∗ ) H ( T ; t ) (cid:21) . For (cid:98) J N ( t ; θ ∗ ) , under H : E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) }| T = t ] = 0 , by using Propo-sition 7, we get (cid:98) J N ( t ; θ ∗ ) = 1 √ N N (cid:88) i =1 (cid:98) π K ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) }H ( T i ; t )= 1 √ N N (cid:88) i =1 (cid:26) π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) }H ( T i ; t ) − π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) }| T i , X i ] · H ( T i ; t )+ E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) }H ( T i ; t ) | X i ] (cid:27) + o P (1)= 1 √ N N (cid:88) i =1 ϕ opt ( T i , X i , Y i ; t ) + o P (1) , where ϕ opt ( T i , X i , Y i ; t ) := π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) }H ( T i ; t ) − π ( T i , X i ) · E [ m { Y i ; g ( T i ; θ ∗ ) }| T i , X i ] · H ( T i ; t )+ E [ π ( T i , X i ) m { Y i ; g ( T i ; θ ∗ ) }H ( T i ; t ) | X i ] Now, we have √ N (cid:110) (cid:98) θ opt − θ ∗ (cid:111) − (cid:26)(cid:90) T B t B (cid:62) t f T ( t ) dt (cid:27) − (cid:40) N N (cid:88) i =1 (cid:98) J N ( T i ; θ ∗ ) · ∇ θ (cid:98) J N ( T i ; θ ∗ ) √ N (cid:41) = − √ N N (cid:88) i =1 (cid:26)(cid:90) T B t B (cid:62) t f T ( t ) dt (cid:27) − · (cid:90) T ϕ opt ( T i , X i , Y i ; t ) · B t · f T ( t ) dt. Let ψ opt ( T i , X i , Y i ; t ) = (cid:26)(cid:90) T B (cid:62) t f T ( t ) dt (cid:27) (cid:26)(cid:90) T B t B (cid:62) t f T ( t ) dt (cid:27) − (cid:90) T ϕ opt ( T i , X i , Y i ; t ) B t f T ( t ) dt. Following a similar argument of establishing Theorem 1, we get (cid:98) J N ( t ; (cid:98) θ opt ) = 1 √ N N (cid:88) i =1 η opt ( T i , X i , Y i ; t ) + o P (1)= 1 √ N N (cid:88) i =1 { U i H ( T i , t ) − φ ( T i , X i ; t ) − ψ opt ( T i , X i , Y i ; t ) } + o P (1) ..