Fixed Effects Binary Choice Models with Three or More Periods
FFixed Effects Binary Choice Models with Three orMore Periods ∗ Laurent Davezies † Xavier D’Haultfœuille ‡ Martin Mugnier § Abstract
We consider fixed effects binary choice models with a fixed number of periods T and without a large support condition on the regressors. If the time-varyingunobserved terms are i.i.d. with known distribution F , Chamberlain (2010)shows that the common slope parameter is point-identified if and only if F islogistic. However, he considers in his proof only T = 2. We show that actually,the result does not generalize to T ≥
3: the common slope parameter andsome parameters of the distribution of the shocks can be identified when F belongs to a family including the logit distribution. Identification is based ona conditional moment restriction. We give necessary and sufficient conditionson the covariates for this restriction to identify the parameters. In addition, weshow that under mild conditions, the corresponding GMM estimator reachesthe semiparametric efficiency bound when T = 3. Keywords:
Binary choice model, panel data, point identification, condi-tional moment restrictions.
JEL Codes:
C14, C23, C25. ∗ We would like to thank Pascal Lavergne for his comments. † CREST, [email protected] ‡ CREST, [email protected]. § CREST, [email protected]. a r X i v : . [ ec on . E M ] S e p Introduction
In this paper, we revisit the classical binary choice model with fixed effects. Specifi-cally, let T denote the number of periods and let us suppose to observe, for individual i , ( Y it , X it ) t =1 ,...,T with Y it = { X it β + γ i − ε it ≥ } (1.1)where β ∈ R K is unknown and ε it ∈ R is an idiosyncratic shock. The nonlinear natureof the model and the absence of restriction on the distribution of γ i conditional on X i := ( X i , . . . , X iT ) renders the identification of β difficult. Rasch (1960) showsthat if the ( ε it ) t =1 ,...,T are i.i.d. with a logitistic distribution, a conditional maximumlikelihood can be used to identify and estimate β . Chamberlain (2010) establishesa striking converse of Rasch’s result: if the ( ε it ) t =1 ,...,T are i.i.d. with distribution F and the support of X i is bounded, β is point identified only if F is logistic. Otherpapers have circumvented such a negative result by either considering large supportregressors (see in particular Manski, 1987; Honore and Lewbel, 2002) or allowing fordependence between the shocks (see Magnac, 2004).It turns out, however, that Chamberlain (2010) only proves his result for T = 2. Andin fact, we show that his result does not generalize to T ≥
3. Specifically, we considerdistributions F satisfying F ( x )1 − F ( x ) = τ X k =1 w k exp( λ k x ) or 1 − F ( x ) F ( x ) = τ X k =1 w k exp( − λ k x ) , (1.2)with T ≥ τ + 1, ( w , ..., w τ ) ∈ (0 , ∞ ) × [0 , ∞ ) τ − and 1 = λ < . . . < λ τ . Westudy the identification of β , assuming that λ := ( λ , . . . , λ τ ) is known, but also of θ := ( β , λ ). In both cases, the weights w , . . . , w τ remain unknown, thus allowingfor much more flexibility on the distribution of ε it than in the logit case. Our maininsight is that for any F satisfying (1.2), a conditional moment restriction holds.We then give necessary as well as sufficient conditions for such moment restrictionsto identify β or θ . The necessary conditions show for instance that with τ ≥ β cannot be achieved with a single, binary. X it . On the otherhand, our sufficient conditions imply that at least if γ is constant, θ is identifiedif conditional on ( X j,t ) ( j,t ) =( i,t ) , X it takes at least 2 τ values. Note that Johnson(2004) considers the same family with τ = 2 and T = 3. However, he does not2tudy the general case and does not show any formal identification result based onthe corresponding moment conditions.Obviously, the conditional moment condition can be used to construct GMM estima-tors. This means, in particular, that √ n -consistent estimation is possible beyond thelogit case when T >
2, overturning again the negative results of Chamberlain (2010)and Magnac (2004). Further, we show that if T = 3 and mild additional restric-tions hold, the optimal GMM estimator based on our conditional moment conditionsreaches the semiparametric efficiency bound of the model. This means that at leastwhen T = 3, these moment conditions contain all the information of the model. Wealso show through simulations that this information is sufficient to form rather preciseestimators for usual sample sizes.The remainder of the paper is organized as follows. Section 2 gives a necessary andsufficient conditions for point identification of β and the λ j . Section 3 discussesestimation and the semiparametric efficiency bound of the model. Section 4 reportsresults from a Monte-Carlo study. Section 5 concludes. All the proofs are collectedin the appendix. We drop the subscript i in the absence of ambiguity and let Y := ( Y , . . . , Y T ) , X := ( X , . . . , X T ) and X t := ( X ,t , . . . , X K,t ) . For any set A ⊂ R p (for any p ≥ A ∗ := A \{ } and | A | denote the cardinal of A . Hereafter, we maintain thefollowing conditions. Assumption 1 (Binary choice panel model)
Equation (1.1) holds and:1. ( X, γ ) and ( ε t ) ≤ t ≤ T are independent and the ( ε t ) ≤ t ≤ T are i.i.d. with a knowncumulative distribution function (cdf) F .2. For all ( k, t ) , E [ X k,t ] < ∞ . . β ∈ R K ∗ . The first condition is also considered in Chamberlain (2010). The second conditionis a standard moment restriction on the covariates. Finally, we exclude in the thirdcondition the case β = 0 here. This case can be treated separately, as the followingproposition shows. Proposition 2.1
Suppose that Assumption 1 holds, F is strictly increasing on R andthere exist ( t, t ) ∈ { , . . . , T } such that E [( X t − X t )( X t − X t ) ] is nonsingular. Then β = 0 if and only if P ( Y t = 1 , Y t = 0 | Y t + Y t = 1 , X t , X t ) = 12 a.s. (2.1)Condition (2.1) can be tested by a specification test on the nonparametric regressionof D = Y t (1 − Y t ) on ( X t , X t ), conditional on the event Y t + Y t = 1. See, e.g., Bierens(1990) or Hong and White (1995).Turning to identification on R K ∗ , we first recall the negative result of Chamberlain(2010). Theorem 2.2
Suppose that T = 2 , Assumption 1 holds, F is strictly increasing on R with bounded, continuous derivative and Supp ( X ) is compact. If, for all β ∈ R K ∗ , β is identified, then F ( x ) / (1 − F ( x )) = w exp( λx ) for some ( w, λ ) ∈ R + ∗ . Our results below imply, however, that this negative result does not generalize to
T >
2. To this end, we consider a family of distribution that includes the lo-gistic distribution and is defined as follows. Hereafter, Λ τ denotes a subset of { ( λ , . . . , λ τ ) ∈ R τ : 1 = λ < . . . < λ τ } . Assumption 2 (“Generalized” logistic distributions)
There exists a known τ ∈{ , . . . , T − } , unknown w := ( w , . . . , w τ ) ∈ (0 , ∞ ) × [0 , ∞ ) τ − and λ := ( λ , . . . , λ τ ) ∈ Λ τ such that: Either F ( x )1 − F ( x ) = P τj =1 w j exp( λ j x ) (First type) , or − F ( x ) F ( x ) = P τj =1 w j exp( − λ j x ) (Second type) . Noteworthy, the family of “generalized” logistic distributions we consider differs from thoseintroduced by Balakrishnan and Leung (1988) and Stukel (1988).
4e fix min { λ , . . . , λ τ } to 1 as the scale of the latent variable X it β + γ i − ε it is notidentified. Also, if F is of the second type, then one can show that the cdf of − ε it is of the first type. Thus, up to changing ( Y t , X t ) into (1 − Y t , − X t ), we can assumewithout loss of generality, as we do afterwards, that F is of the first type. We shallsee that τ + 1 periods are sufficient to achieve identification. Hence, we assume, againwithout loss of generality, that T = τ + 1: if T > τ + 1, we can always focus on τ + 1periods.We consider the identification of not only β but also λ . We then let θ := ( β , λ ) and Θ := ( R K ∗ ) × Λ τ . We also define, for any ( y, x, θ ) ∈ { , } T × Supp( X ) × Θ , m ( y, x ; θ ) := T X t =1 { y t = 1 , y t = 0 ∀ t = t } M t ( x ; θ ) , where for all j ∈ { , . . . , T } , M j ( x ; θ ) is the (1 , j )-cofactor of the matrix . . . λ x β ) . . . exp( λ x T β )... ...exp( λ τ x β ) . . . exp( λ τ x T β ) . As we also consider identification of β alone, we also let, with a slight abuse ofnotation, m ( y, x ; β ) := m ( y, x ; ( β , λ ) ). Our first result shows that the conditionalmoment of m ( Y, X ; θ ) is zero. Theorem 2.3
If Assumptions 1-2 hold, we have, almost surely, E [ m ( Y, X ; θ ) | X ] = 0 . (2.2)Theorem 2.3 shows there exists a known moment condition which potentially identifies θ in a model more general than the logistic one. It shows that, as the number ofperiods T increases, there is an increasing class of distributions F for which β (or θ ) can be point identified. This is consistent with the idea that if T = ∞ , β ispoint identified for any F , by using variations in X t of a single individual. It alsocomplements the results of Chernozhukov et al. (2013) showing that bounds on β for general F shrink quickly as T increases.5ote that the result holds also with T = τ + 1 = 2 (or, more generally, with T > τ =1). In such a case, the conditional moment condition can be written E [ { Y > Y } exp( X β ) − { Y > Y } exp( X β ) | X ] = 0 . This conditional moment generate the first-order conditions of the theoretical condi-tional likelihood, since the latter is equivalent to E " ( X − X )exp( X β ) + exp( X β ) ( { Y > Y } exp( X β ) − { Y > Y } exp( X β )) = 0 . The discussion above implies that with T = τ + 1 = 2, β is identified by (2.2) assoon as E [( X − X )( X − X ) ] is nonsingular. We now consider sufficient conditionsfor (2.2) to identify θ (or β ) more generally, not only with τ = 1. The momentconditions in the general case are highly nonlinear, making it difficult to provide acomplete characterization. First, we consider the case where γ is actually constant. For any ( k, t ) ∈ { , . . . , K } × { , . . . , T } , we let X k := ( X k, , . . . , X k,T ), X − k :=( X k ,t ) k = k,t =1 ,...,T and X k − t = ( X k,s ) s = t . Proposition 2.4
Let assume that Assumptions 1-2 are satisfied, T = τ + 1 ≥ , V ( γ ) = 0 and for all ( k, t ) ∈ { , . . . , K } × { , . . . , T } , | Supp ( X kt | X − k , X k − t ) | ≥ τ .Then, E [ m ( Y, X ; θ ) | X ] = 0 a.s. ⇒ θ = θ . (2.3)Proposition 2.4 shows that in the absence of fixed effects, the conditional momentcondition E [ m ( Y, X ; θ ) | X ] = 0 is sufficient to identify θ under mild restrictions onthe distribution of X . In particular, all components of X may be discrete. The resultrelies in particular on the fact that for any λ ∈ Λ τ , the family of functions ( v exp( λ j v )) j =1 ,...,τ forms a Chebyshev system (see, e.g., Krein and Nudelman, 1977, Because we consider identification based on (2.2) alone, we suppose this additional restrictionto be unknown by the econometrician. A close inspection of the proof reveals that the support restrictions on X could actually beweakened further, but at the expense of complicating the condition. v P Tj =1 a j exp( λ j v ) does not vanish more than T − γ is nondegenerate and possibly correlated with X ,which is more realistic in practice. For any ( t, ‘, x ) ∈ { , . . . , T } × { , . . . , τ } × Supp( X ), let us define a t,‘,x : v E exp( λ ‘ γ ) C ( γ, x ; θ , t ) (cid:16) P τj =1 w j δ j ( x ; θ , t ) exp( λ j ( β k v + γ )) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x where C ( γ, x ; θ , t ) := Q t = t (1+ P τj =1 w j exp( λ j ( x t β + γ ))) and δ j ( x ; θ , t ) := exp( λ j × x t β ). We consider the following conditions. Assumption 3
1. There exist ( t , t ) ∈ { , ..., T } such that E [( X t − X t )( X t − X t ) ] is nonsingular.2. There exists k ∈ { , . . . , K } such that β k = 0 and almost surely, X k | X − k admits a density with respect to the Lebesgue measure. Assumption 4 X k ⊥⊥ γ | X − k .2. There exists some t ∈ { , . . . , T }\{ t , t } such that, for all ( β k , λ ) ∈ ( R ∗ ) × Λ τ , { λ β k , . . . , λ τ β k } ∩ { λ β k , . . . , λ τ β k } = ∅ implies that the τ ( τ + 1) functions { a t ,‘,x ( v ) exp( λ ‘ β k v ) , a t ,‘,x ( v ) exp( λ β k v ) , . . . , a t ,‘,x ( v ) exp( λ τ β k v ) } ‘ =1 ,...,τ form a free family of functions over R for almost all x ∈ Supp ( X ) . Assumption 4 X k ⊥⊥ γ | X − k .2. There exists some t ∈ { , . . . , T }\{ t , t } such that, for all β k ∈ R ∗ , { λ β k , . . . ,λ τ β k } ∩ { λ β k , . . . , λ τ β k } = ∅ implies that the τ ( τ + 1) functions { a t ,‘,x ( v ) exp( λ ‘ β k v ) , a t ,‘,x ( v ) exp( λ β k v ) , . . . , a t ,‘,x ( v ) exp( λ τ β k v ) } ‘ =1 ,...,τ form a free family of functions over R for almost all x ∈ Supp ( X ) . Again, Assumptions 3 and 4 (or 3 and 4’) are assumed to be unknown by the econometrician. X t − X t ) β . Assumption 3.2 imposes that at least one regressor is continuouslydistributed. Assumption 4 and 4’ are very close, with Assumption 4’ being a weakerform of Assumption 4 that turns out to be sufficient to identify β only, when λ issupposed to be known. When combined with Assumption 3.2, Assumption 4.1 (orAssumption 4’.1) is similar, but less restrictive, than Assumption R.iii of Magnac andMaurin (2007) or Assumptions A.2-3 in Honore and Lewbel (2002). Importantly, itdoes not imply any large support restriction. Assumption 4.2 and 4’.2 are high-levelconditions that we discuss below. Proposition 2.5
Suppose that Assumptions 1-3 hold and T = τ + 1 . Then:1. If Assumption 4 holds as well, then (2.3) holds.2. If Assumption 4’ holds as well, E [ m ( Y, X ; β ) | X ] = 0 a.s. ⇒ β = β . (2.4)The proof relies on two main ingredients. The first is, again, the upper bound onthe number of roots of exponential “polynomials”. The second is analyticity of theconditional moment as a function of X k,t . By a continuation theorem on real analyticfunctions (see e.g. Corollary 1.2.5 in Krantz and Parks, 2002), this allows us toextend the conditional moment function from any x ∈ Supp( X ) to any x such that x j,t = x j,t for all ( j, t ) = ( k, t ) and x k,t ∈ R .Assumption 4.2 and 4’.2 are high-level and technical. We conjecture that they holdunder mild restrictions on the distribution of γ . The following proposition, restrictedto T = 3 and a binary γ , substantiates this claim. Proposition 2.6
Let T = τ + 1 = 3 and Λ τ ⊂ { (1 , λ ) : λ > } . If | Supp ( γ | X ) | = 2 almost surely, Assumption 4’.2 is satisfied. We now turn to necessary conditions for (2.4) to hold. We consider the followingassumption.
Assumption 5 P (cid:16) X ∈ n x ∈ R KT : |{ x , . . . , x T }| = T o(cid:17) > . X = ( X , . . . , X T ) with distinctvalues at all periods. Since we focus here on T ≥
3, this excludes in particular thecase where X t is binary. But contrary to Assumption 3, Assumption 5 does notexclude the case where all covariates are discrete, and can be expected to hold if | Supp( X t ) | ≥ T . The following proposition shows that Assumption 5 is actuallynecessary for the conditional moment condition E [ m ( Y, X ; β ) | X ] = 0 to identify β . Proposition 2.7
Suppose that Assumptions 1-2 are satisfied and T = τ + 1 ≥ .Then, if (2.4) holds, Assumption 5 holds as well. In the following, we assume that λ is known and focus on the estimation of β . Theconditional moment condition (2.4) can be transformed into unconditional conditionssuch that standard GMM estimators can easily be constructed. Letting g ( X ) ∈ R K ,such estimators b β satisfy b β = arg min β ∈ B n n X i =1 g ( X i ) m ( Y i , X i ; β ) ! n n X i =1 g ( X i ) m ( Y i , X i ; β ) ! , (3.1)where B is a compact subset of R K ∗ . The optimal estimator among this class isobtained by choosing g ? ( X ) := R ( X ) / Ω( X ), with R ( X ) = E [ ∇ β m ( Y, X ; β ) | X ]and Ω( X ) = V [ m ( Y, X ; β ) | X ] (see Chamberlain, 1987). Given that R ( X ) andΩ( X ) are unknown, an asymptotically efficient GMM estimator can be obtainedin two steps. In a first step, g ( X ) is chosen arbitrarily and we compute the cor-responding estimator b β . In a second step, we compute b g ? ( X ) = b R ( X ) / b Ω( X ), where b R ( x ) = b E [ ∇ β m ( Y, X ; b β ) | X ] and b Ω( X ) = b V [ m ( Y, X ; b β ) | X ] are standard nonpara-metric estimators (e.g., kernel or series estimators). We then compute the estimator b β ? based on b g ? ( X ). Under regularity conditions displayed in, e.g., Newey (1990), wehave √ n ( b β ? − β ) d −→ N (0 , V ) , (3.2) Estimation of θ could be performed in the same way as that of β , but it is unclear to us whetherthe corresponding estimator would reach the semiparametric efficiency bound of θ , something weprove below for β . V := E [Ω( X ) − R ( X ) R ( X ) ] − . To obtain this result, two assumptions areworth mentioning. The first is an identifiability condition when using the optimalinstruments: E [ g ? ( X ) m ( Y, X ; β )] = 0 ⇒ β = β . Such a condition may fail to hold, as shown by Dominguez and Lobato (2004). Otherestimators relying on the full set of moments can be used to prevent this identificationfailure (see in particular Dominguez and Lobato, 2004; Hsu and Kuan, 2011; Lavergneand Patilea, 2013). The second condition is that E [Ω( X ) − R ( X ) R ( X ) ] exists and isnonsingular. Nonsingularity holds if and only if E [ R ( X ) R ( X ) ] is nonsingular, whichis a local identification condition.We now establish that with T = τ + 1 = 3, the semiparametric efficiency boundactually coincides with the asymptotic variance V of the optimal GMM estimator.The result holds under the following condition. Assumption 6 E [Ω − ( X ) R ( X ) R ( X ) ] exists and is nonsingular.2. | Supp ( γ | X ) | ≥ almost surely. We already discuss the first condition. The second condition we impose is weakerthan that imposed by Chamberlain (2010), namely Supp( γ | X ) = R . Theorem 3.1
Assume T = τ + 1 = 3 , λ is known with λ = 2 and Assumptions 1-3 and 6 hold. Then the semiparametric efficiency bound of β , V ? ( β ) , is finite andsatisfies V ? ( β ) = V . Intuitively, this result states that all the information content of the model is includedin the conditional moment restriction E [ m ( Y, X ; β ) | X ] = 0. It complements, for T = τ + 1 = 3, the result of Hahn (1997), which states that the conditional maximumlikelihood estimator is the efficient estimator of β if F is logistic. The differencebetween the two results is that here, ( w , w ) is unknown rather than known andequal to (1 , Monte-Carlo simulations
We conduct numerical simulations in order to characterize the finite sample perfor-mance of b β ? . We let T = τ + 1 = 3 and consider both ( w , w ) = (0 . , .
9) and( w , w ) = (0 . , . λ = (1 ,
5) and suppose it is known. Next, we let K = 1and β = 1, with X t ∈ {− , , } (note that a binary X t ). We first draw X uni-formly over {− , , } , then draw X uniformly over {− , , }\{ X } and finally let X be the remaining element in {− , , }\{ X , X } . Note that Assumption 3.2 failsto hold with such a X . But as explained above, this condition is only sufficient, notnecessary for identification. We then consider five data generating processes (DGPs)where the r.v. γ is:i. Constant: γ = 0.ii. Discrete and independent of X : P ( γ = − / | X ) = P ( γ = 0 | X ) = P ( γ =1 / | X ) = 1 / X : γ | X ∼ U ([ − / , / X : γ = U Z where (
U, Z ) ∈ {− / , / } × { , } and P ( U = 1 / | X ) = 0 . X X / P ( Z = 1 | X, U ) = 0 . X : γ = U Z where U | X ∼ U ([0 , / Z ∈ {− / , / } , P ( Z = 1 / | X, U ) = 0 . X X / n ∈ { , , , } . With theDGPs above, the subsample effectively used in the estimation, namely { i ∈ { , . . . , n } : P t =1 Y it = 1 } , represents on average 47 .
8% of the initial sample.To compute the optimal GMM estimator, the usual practice is to estimate b g ? using aninefficient GMM estimator. However, in the current set-up, such estimators are oftenequal to zero if g is not chosen appropriately. To overcome this finite sample issue,we first use a rough estimator e g ? of g ? based on the conditional maximum likelihoodestimator ( b β ) of β , assuming a logistic distribution. Then, using e g ? , we obtainan initial GMM estimator e β , which allows us to compute a second (and consistent)estimator of b g ? . Finally, we compute the asymptotically optimal GMM estimator b β ? using b g ? . 11able 1: Simulation Results for b β ? w = 0 . w = 0 . n Bias RMSE Bias RMSEi. 500 − . . − . . , − . . − . . , − . . − . . , − . . − . . − . . − . . , − . . − . . , − . . − . . , − . . − . . − . . − . . , − . . − . . , − . . − . . , − . . − . . − . . − . . , − . . − . . , − . . − . . , − . . − . . − . . − . . , − . . − . . , − . . − . . , − . . − . . Notes: β = 1, λ = (1 , w = 1 − w . The optimal in-struments are estimated using conditional means and b β . Theresults are based on 10 ,
000 sample replications.
For each DGP and the two values of w , Table 1 reports the estimated bias and rootmean square error (RMSE) of b β ? . The estimator b β ? is precise in the absence of fixedeffects. When fixed effects are introduced the bias and RMSE vary with ( w, λ ).Overall, the results suggest that for a given sample size n , the bias and RMSE are12ower when w − w increases, when | Supp( γ | X ) | increases or when γ is uncorrelatedwith X . The second case is consistent with our conjecture about Assumption 4. This paper addresses the problem of point identification of the common slope pa-rameter in a static panel binary model with exogenous and bounded regressors. Wederive necessary and sufficient conditions for global point identification based on aconditional moment restriction when T ≥ T = 3. Our paper leaves a fewquestions unanswered. A first one is whether the family of F considered here is theonly one for which point identification can be achieved. Another one is whether theGMM estimator still reaches the semiparametric efficiency bound when T >
3. Bothquestions raise difficult issues and deserve future investigation.13 eferences
Balakrishnan, N. and Leung, M. (1988), ‘Order statistics from the type i generalizedlogistic distribution’,
Communications in Statistics-Simulation and Computation (1), 25–50.Bierens, H. J. (1990), ‘A consistent conditional moment test of functional form’, Econometrica: Journal of the Econometric Society pp. 1443–1458.Chamberlain, G. (1987), ‘Asymptotic efficiency in estimation with conditional mo-ment restrictions’,
Journal of Econometrics (3), 305–304.Chamberlain, G. (2010), ‘Binary response models for panel data: Identification andinformation’, Econometrica (1), 159–168.Chernozhukov, V., Fernández-Val, I., Hahn, J. and Newey, W. (2013), ‘Average andquantile effects in nonseparable panel models’, Econometrica (2), 535–580.Davezies, L., D’Haultfœuille, X. and Mugnier, M. (2020), ‘Online ap-pendix for fixed effects binary choice models with three or more periods’, https://faculty.crest.fr/xdhaultfoeuille/wp-content/uploads/sites/9/2020/09/online-appendix.pdf .Dominguez, M. A. and Lobato, I. N. (2004), ‘Consistent estimation of models definedby conditional moment restrictions’, Econometrica (5), 1601–1615.Hahn, J. (1997), ‘A note on the efficient semiparametric estimation of some exponen-tial panel models’, Econometric Theory (4), 583–588.Hong, Y. and White, H. (1995), ‘Consistent specification testing via nonparametricseries regression’, Econometrica: Journal of the Econometric Society pp. 1133–1159.Honore, B. E. and Lewbel, A. (2002), ‘Semiparametric binary choice panel data mod-els without strictly exogeneous regressors’,
Econometrica (5), 2053–2063.Hsu, S.-H. and Kuan, C.-M. (2011), ‘Estimation of conditional moment restrictionswithout assuming parameter identifiability in the implied unconditional moments’, Journal of Econometrics (1), 87 – 99.14ohnson, E. G. (2004), ‘Identification in discrete choice models with fixed effects’,
Working paper, Bureau of Labor Statistics .Krantz, S. and Parks, H. (2002),
A Primer of Real Analytic Functions , AdvancedTexts Series, Birkhäuser Boston.Krein, M. and Nudelman, A. A. (1977),
The Markov Moment Problem and ExtremalProblems , American Mathematical Society.Lavergne, P. and Patilea, V. (2013), ‘Smooth minimum distance estimation and test-ing with conditional estimating equations: uniform in bandwidth theory’,
Journalof Econometrics (1), 47–59.Magnac, T. (2004), ‘Binary variables and sufficiency: Generalizing conditional logit’,
Econometrica (6), 1859–1876.Magnac, T. and Maurin, E. (2007), ‘Identification and information in monotone bi-nary models’, Journal of Econometrics (1), 76–104.Manski, C. F. (1987), ‘Semiparametric analysis of random effects linear models frombinary panel data’,
Econometrica (2), 357–362.Newey, W. K. (1990), ‘Efficient instrumental variables estimation of nonlinear mod-els’, Econometrica (4), 809–837.Rasch, G. (1960), Probabilistic Models for Some Intelligence and Attainment Tests ,Copenhagen: Denmarks Paedagogiske Institute.Stukel, T. A. (1988), ‘Generalized logistic models’,
Journal of the American StatisticalAssociation (402), 426–431.van der Vaart, A. W. (2000), Asymptotic Statistics , Cambridge University Press.15
Proofs of the results
A.1 Proposition 2.1
The sufficient part is obvious. To prove necessity, suppose β = 0. Since E [( X t − X t )( X t − X t ) ] is non singular, there exist a subset S of the support of ( X t , X t )such that P ( S ) > x t , x t ) ∈ S , ( x t − x t ) β has constant, non-zero sign.Without loss of generality let us assume ( x t − x t ) β >
0. Let G ( x ) = F ( x ) / (1 − F ( x )).Because G is strictly increasing, we have, for all g ∈ R , G ( x t β + g ) > G ( x t β + g ) . Equivalently, F ( x t β + g )(1 − F ( x t β + g )) > F ( x t β + g )(1 − F ( x t β + g )) . In other words, P ( Y = 1 , Y t = 0 | X t = x t , X t = x t , γ = g ) > P ( Y = 0 , Y t = 1 | X t = x t , X t = x t , γ = g ) , and the result follows by integration over g . A.2 Theorem 2.3
Let us define A ( x, γ ; θ ) := P τj =1 w j exp( λ j ( x β + γ )) . . . P τj =1 w j exp( λ j ( x T β + γ ))exp( λ x β ) . . . exp( λ x T β )... ...exp( λ τ x β ) . . . exp( λ τ x T β ) . Let A i ( x, γ ; θ ) denote the i th line of A ( x, γ ; θ ). Then A ( x, γ ; θ ) = τ X j =1 w j exp( λ j γ ) A j +1 ( x, γ ; θ ) . It follows that for all ( x, γ ) ∈ Supp( X ) × R ,det A ( x, γ ; θ ) = 0 .
16y Assumption 2 and since we focus on the first type therein, we have G ( x ) := F ( x ) / (1 − F ( x )) = P T − j =1 w j exp( λ j x ). Now, developping det A ( x, γ ; θ ) with respectto the first row yields, by definition of the function m , X y ∈{ , } T m ( y, x ; θ ) Y t : y t =1 G ( x t β + γ ) = 0 . Multiplying this equality by Q t (1 − F ( x t β + γ )) we obtain X y ∈{ , } T m ( y, x ; θ ) Y t : y t =1 F ( x t β + γ ) Y t : y t =0 (1 − F ( x t β + γ )) = 0 . This equation is equivalent to E [ m ( Y, X ; θ ) | X, γ ] = 0 a.s. The result follows byintegration over γ . A.3 Proposition 2.4
Let us suppose that θ = ( β, λ ) ∈ Θ satisfies E [ m ( Y, X ; θ ) | X ] = 0 , (A.1)and let us show that θ = θ . Since γ = γ almost surely for some γ , Equation (A.1)is equivalent to: τ X i =1 w i exp( λ i γ ) det (cid:16) A i ( x ) (cid:17) = 0 , (A.2)for almost all x ∈ Supp( X ), with A i ( x ) := exp( λ i x β ) . . . exp( λ i x T β )exp( λ x β ) . . . exp( λ x T β )... ...exp( λ τ x β ) . . . exp( λ τ x T β ) . Let S denote the subset of Supp( X ) on which (A.2) holds. Further, let X ( β ) = { x ∈S : |{ x β, . . . , x T β }| = T − } . We first we show that P ( X ( β )) > . (A.3)17his is trivial for T = 2. Otherwise, note first that there exists k such that β k = 0.Then: X ( β ) = ( x ∈ S : ∀ t ≥ , x k ,t ( x β − x k , β − k β k , . . . , x t − β − x k ,t β − k β k )) . The condition | Supp( X k ,t | X , . . . , X t − , X − k t ) | ≥ τ for all t ≥ X − k t =( X j,t ) j = k ) ensures that almost surely,Supp( X k ,t | X , . . . , X t − , X − k t ) ( X β − X k ,t β − k β k , . . . , X t − β − X k ,t β − k β k ) and thus (A.3) holds.Now fix x ∈ X ( β ) and k ∈ { , . . . , K } . Using again | Supp( X k,t | X − k , X k − ) | ≥ τ ,there exists A ⊂ R , | A | ≥ τ , such that for all e x verifying e x j,t = x j,t for j = k or t = 1and e x k, = x k, + v , v ∈ A , we have e x ∈ S . Applying (A.2) to such e x and developingeach determinant with respect to the first column, we obtain that for all v ∈ A , τ X j =1 ( − j exp( λ j x β ) τ X i =1 w i exp( λ i γ ) det (cid:16) A i { j +1 , } ( x ) (cid:17)! exp( λ j β k v )+ τ X i =1 w i exp( λ i ( x β + γ )) det (cid:16) A i { , } ( x ) (cid:17) exp( λ i β k v ) = 0 , (A.4)where A ij,k ( x ) denote the sub-matrix of A i ( x ) once row j and column k have beenremoved.We first assume that β k = 0. Suppose that there exists i such that for all j ∈{ , . . . , τ } , λ j β k = λ i β k . The left hand-side of (A.4) is a polynomial of exponentialfunctions with at most 2 τ distinct exponential functions and it is equal to 0 on 2 τ distinct points v . Then, by Lemma B.1 and because the coefficient of exp( λ i β k v ) is w i exp( λ i ( x β + γ )) det (cid:16) A i { , } ( x ) (cid:17) , we havedet (cid:16) A i { , } ( x ) (cid:17) = 0 . (A.5)Now, because |{ x β, . . . , x T β }| = T −
1, the definition of Chebyshev systems impliesthat det (cid:16) A { , } (cid:17) = 0, a contradiction. Hence, for all i ∈ { , . . . , τ } , there exists ‘ ( i )such that λ ‘ ( i ) β k = λ i β k . Because λ ‘ ( i ) and λ i are both positive, the sign of β k is then equal to the sign of β k . Let us suppose without loss of generality (since, β k = 0) that β k >
0. Then λ β k < . . . < λ τ β k , implying, since β k >
0, that18 ‘ (1) < . . . < λ ‘ ( τ ) . Hence, ‘ ( i ) = i for all i ∈ { , . . . , τ } and i = 1 yields β k = β k . Inturn, this latter equality implies that λ = λ .We now consider the case β k = 0. Let us assume that β k = 0. The left hand-side of(A.4) is a polynomial of exponential functions with at most τ + 1 distinct exponentialfunctions (since λ j β k = 0 for all j ) and it is equal to 0 on 2 τ distinct points v . Then,by Lemma B.1, τ X i =1 w i exp( λ i ( x β + γ )) det (cid:16) A i { , } ( x ) (cid:17) = 0 . Now, notice that det (cid:16) A i { , } ( x ) (cid:17) = 0 does not depend on i . As a result, τ X i =1 w i exp( λ i ( x β + γ )) = 0 , which is a contradiction. Hence β k = 0 = β k . Note that we do not identify λ in thiscase, but its identification is achieved by the previous paragraph, since there exists k such that β k = 0. This concludes the proof. A.4 Proposition 2.5
1. Without loss of generality, we assume hereafter that t = 1 so that t , t ≥
2. Letus suppose that θ = ( β, λ ) ∈ Θ satisfies E [ m ( Y, X ; θ ) | X ] = 0 , (A.6)and let us show that θ = θ . Equation (A.6) is equivalent to τ X i =1 w i E exp( λ i γ ) C ( γ, x ; θ , (cid:16) P τj =1 w j exp( λ j ( x β + γ )) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x × det exp( λ i x β ) . . . exp( λ i x T β )exp( λ x β ) . . . exp( λ x T β )... ...exp( λ τ x β ) . . . exp( λ τ x T β ) = 0 , (A.7)for almost all x ∈ Supp( X ). Let S denote the subset of Supp( X ) on which (A.7)holds. Further, let X ( β ) = { x ∈ S : |{ x β, . . . , x T β }| = T } . By Assumption 3,19 ( X ( β )) = 1. Now, fix x ∈ X ( β ). By Assumption 3 again, there exist ε ≤ ≤ ε withmax( − ε, ε ) >
0, such that for almost every e x verifying e x t = x t for t > e x j = x j for j = k , | e x k, − x k, | ∈ [ ε, ε ], we have e x ∈ Supp( X ). Applying (A.7) to such e x andusing X k ⊥⊥ γ | X − k , we obtain τ X i =1 w i a ,i,x ( v ) det (cid:16) A i ( v ) (cid:17) = 0 , (A.8)for almost every v ∈ [ ε, ε ], with A i ( v ) = exp( λ i ( x β + β k v )) . . . exp( λ i x T β )exp( λ ( x β + β k v )) . . . exp( λ x T β )... ...exp( λ τ ( x β + β k v )) . . . exp( λ τ x T β ) . Let A iJ,K ( v ) denote the sub-matrix of A i ( v ) once the rows and columns with indicesin J ⊂ { , . . . , T } and K ⊂ { , . . . , T } , respectively, have been removed. We simplynote A iJ,K when A iJ,K ( v ) does not depend on v . Then, developping each A i ( v ) withrespect to the first column, we obtain, for almost every v ∈ [ ε, ε ], τ X i =1 w i " det (cid:16) A i { } , { } (cid:17) exp( λ i ( x β + β k v )) a ,i,x ( v )+ τ X j =1 ( − j det (cid:16) A i { j +1 } , { } (cid:17) exp( λ j ( x β + β k v )) a ,i,x ( v ) = 0 . (A.9)Now, by Lemma B.2, the left-hand side of (A.9) is real analytic, where we recall thata function f : I → R is real analytic if f is equal to its Taylor series at every point of I . Then, by the continuation theorem for real analytic functions (see e.g. Corollary1.2.5 in Krantz and Parks, 2002), (A.8) holds for all v ∈ R . Now, fix i ∈ { , . . . , τ } and let us assume that there is no t ( i ) ∈ { , . . . , τ } such that λ t ( i ) β k = λ i β k . Then,Assumption 4.2 ensures that the functions of v in (A.8) are linearly independent, sothat det (cid:16) A i { t } , { } (cid:17) = 0 , ∀ t ∈ { , . . . , T } , (A.10)Because |{ ( x β, . . . , x T β }| = T , we have, by definition of Chebyshev systems,det (cid:16) A i { } , { } (cid:17) = 0 , i ∈ { , . . . , τ } , there exists t ( i ) such that λ t ( i ) β k = λ i β k . Because λ t ( i ) and λ i are both positive, the sign of β k is then equalto the sign of β k . Let us suppose without loss of generality (since, by Assumption 3, β k = 0) that β k >
0. Then λ β k < . . . < λ τ β k , implying, since β k >
0, that λ t (1) < . . . < λ t ( τ ) . Hence, t ( i ) = i for all i ∈ { , . . . , τ } and i = 1 yields β k = β k . Inturn, this latter equality implies that λ = λ .Now, in (A.8), λ i = λ i for all i ∈ { , . . . , τ } and β k = β k . With λ replaced by λ and β k replaced by β k , (A.9) and Assumption 4.2 still imply that for all i ∈ { , . . . , τ } ,det (cid:16) A i { t } , { } (cid:17) = 0 , ∀ t ∈ { , . . . , T }\{ i + 1 } . (A.11)Because |{ x β, . . . , x T β }| = T , we have, by definition of Chebyshev systems,det (cid:16) A i { ,t } , { ,n } (cid:17) = 0 , ∀ ( t, n ) ∈ { , . . . , T }\{ i + 1 } × { , . . . , T } . This, together with (A.11), implies that for t ∈ { , . . . , T }\{ i + 1 } the first row of A it, is a non-trivial linear combination of the other rows. In other words, for all t = i ,there exists a non-zero vector ( w t,j ) j =1 ,...,τ with w t,t = 0 such that for all s ≥ λ i x s β ) = τ X j =1 w t,j exp( λ j x s β ) . (A.12)Let define P t ( u ) = P τj =1 w t,j exp( λ j u ) for all t ∈ { , . . . , τ } . Then, for all s ≥ P ( x s β ) = . . . = P i − ( x s β ) = P i +1 ( x s β ) = . . . = P τ ( x s β ) . Moreover, because x ∈ X ( β ), we have |{ x β, . . . , x T β }| = τ . Then, by Lemma B.1, P = . . . = P T . But this implies that for all ( t, j ) ∈ { , . . . , τ }\{ i } , w t,j = w j,j =0. Therefore, by (A.12) again, there exists strictly positive constants ( c , . . . , c τ ) ∈ (0 , ∞ ) τ such that exp( λ i x t β ) = c i exp( λ i x t β ) for all t ≥
2. In other words, thereexists K ∈ R such that for all t ≥ x t ( β − β ) = K. (A.13)This equality holds in particular for periods t and t in Assumption 3.1. Moreover,because x ∈ X ( β ) was arbitrary and P ( X ( β )) = 1, this implies that almost surely,( X t − X t ) ( β − β ) = 0. The first part of Assumption 3 implies β = β , which endsthe proof. 21. We follow the exact same reasoning, except that λ in θ is replaced by λ . Inparticular, we obtain the same equation as (A.9) with λ in place of λ . Then (A.10)holds under Assumption 3’.2 instead of Assumption 3.2. This implies that β k = β k .The proof that β j = β j for j = k is exactly as above. A.5 Proposition 2.6
We leave x and the conditioning on X = x implicit here. We also let C ( γ ) := C ( γ, x ; θ , t ), α i := w i δ i ( x ; θ , t ), a i := λ i β k , b i := λ i β k , ( γ , γ ) := Supp( γ | X = x ) and ( q , q ) denote the corresponding probabilities. We must prove that for all µ = ( µ j‘ ) j =0 , , ,‘ =1 , , if for all v ∈ R , X j =1 e a j v X p =1 q p C ( γ p ) 11 + P i =1 α i e λ i γ p e b i v X ‘ =1 µ j‘ e λ ‘ γ p ! + X ‘ =1 e λ ‘ β k v X p =1 q p µ ‘ e λ ‘ γ ‘ C ( γ p ) 11 + P i =1 α i e λ i γ p e b i v = 0 , then µ = 0. Let us define, for p ∈ { , } , f j,p ( v ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ajv P τi =1 α i e λ iγp e biv if j ∈ { , } , e bj − τ v P τi =1 α i e λ iγp e biv if j ∈ { , } ,G j,p ( µ ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q p C ( γ p ) P τ‘ =1 µ j‘ e λ ‘ γ p if j ∈ { , } , q p µ j − τ e λ j − τ γj − τ C ( γ p ) if j ∈ { , } . Then Assumption 4’.2 can be rewritten as follows: X j =1 2 X p =1 G j,p ( µ ) f j,p ( v ) = 0 ∀ v ∈ R ⇒ µ = 0 . (A.14)To prove (A.14), first remark that if G j,p ( µ ) = 0 for all ( j, p ), then µ = 0. This istrivial for the µ ‘ . For the µ j‘ , j ≥
1, this follows from Lemma B.1. Thus, Assumption4 holds if the family ( f j,p ) j =1 ,..., ,p =1 , is free, i.e. if for all ν = ( ν ij ) i =1 ,..., ,j =1 , X j =1 2 X p =1 ν jp f j,p ( v ) = 0 ∀ v ∈ R ⇒ ν = 0 . v ∈ R ,( ν + ν ) e a v + α ( ν e λ γ + ν e λ γ ) e ( a + b ) v + α ( ν e λ γ + ν e λ γ ) e ( a + b ) v +( ν + ν ) e a v + α ( ν e λ γ + ν e λ γ ) e ( a + b ) v + α ( ν e λ γ + ν e λ γ ) e ( a + b ) v +( ν + ν ) e b v + α ( ν e λ γ + ν e λ γ ) e b v + α ( ν e λ γ + ν e λ γ ) e ( b + b ) v +( ν + ν ) e b + α ( ν e λ γ + ν e λ γ ) e ( b + b ) v + α ( ν e λ γ + ν e λ γ ) e b v = 0 , then ν := ( ν , ν , ν , ν , ν , ν , ν , ν ) = 0. The proof of this point, which islong and cumbersome, is detailed in our online Appendix (Davezies et al., 2020). A.6 Proposition 2.7
Let us suppose that Assumption 5 fails. Without loss of generality, assume that X = X almost surely. Let us define y := (1 , , ..., y := (0 , , , ...,
0) and f ( x ; β ) := E [ m ( Y, X ; β, λ ) | X = x ]. By definition, f ( X ; β ) = X y ∈{ , } T P ( Y = y | X ) m ( y, X ; β ) . (A.15)Moreover, almost surely, P ( Y = y | X )= Z F ( X β + γ )(1 − F ( X β + γ ))(1 − F ( X β + γ )) · · · (1 − F ( X T β + γ ))d F γ | X ( γ )= Z F ( X β + γ )(1 − F ( X β + γ ))(1 − F ( X β + γ )) · · · (1 − F ( X T β + γ ))d F γ | X ( γ )= P ( Y = y | X ) . (A.16)Next, m ( y , X ; β ) = det exp ( λ X β ) . . . exp ( λ X T β )... ...exp ( λ T − X β ) . . . exp ( λ T − X T β ) = det exp ( λ X β ) . . . exp ( λ X T β )... ...exp ( λ T − X β ) . . . exp ( λ T − X T β ) = − m ( y , X ; β ) . (A.17)23oreover, for all y such that P t y t = 1 and y
6∈ { y , y } , m ( y, X ; β ) = 0 becausethe cofactor includes two identical columns (since X = X ). Finally, if P t y t = 1,we also have m ( y, X ; β ) = 0. In view of (A.15), these last points, combined with(A.16)-(A.17), imply f ( β ) = 0. Since β was arbitrary, it means that (2.3) does notidentify β . The result follows. A.7 Theorem 3.1
Let us first summarize the proof. We link the current model with a “complete” modelwhere γ is also observed. This model is fully parametric and thus can be analyzedeasily. Specifically, we show in a first step that this complete model is differentiablein quadratic mean (see, e.g. van der Vaart, 2000, pp.64-65 for a definition) and has anonsingular information matrix. In a second step, we establish an abstract expressionfor the semiparametric efficiency bound. This expression involves in particular thekernel K of the conditional expectation operator g E [ g ( X, Y ) | X, γ ]. In a thirdstep, we show that K = { ( x, y ) q ( x ) m ( x, y ; β ) , E [ q ( X )] < ∞} . (A.18)The fourth step of the proof concludes. First step: the complete model is differentiable in quadratic mean and hasa nonsingular information matrix.
Let p ( y | x, g ; β ) = P ( Y = y | X = x, γ = g ; β ).We check that the conditions of Lemma 7.6 in van der Vaart (2000) hold. Under,Assumptions 1-2, we have p ( y | x, g ; β ) = Y t : y t =1 F ( x it β + g ) Y t : y t =0 (1 − F ( x it β + g )) , where F is C ∞ on R and takes values in (0 , β ln p ( y | x, g ; β ) isdifferentiable. Let S β = ∂ ln p ( Y | X, γ ; β ) /∂β and let S βk denote its k -th component.We prove that E [ S βk ] < ∞ . First, remark that S βk = T X t =1 X k,t f ( X t β + γ )[ F ( X t β + γ )][1 − F ( X t β + γ )] [ Y t − F ( X t β + γ )] . | S βk | ≤ T X t =1 | X k,t | f ( X t β + γ ) F ( X t β + γ )(1 − F ( X t β + γ ))= T X t =1 | X k,t | P T − j =1 w j λ j e λ j ( X t β + γ ) P T − j =1 w j e λ j ( X t β + γ ) ≤ λ τ T X t =1 | X k,t | , (A.19)where we have used the triangle inequality and | Y t − F ( X t β + γ ) | ≤ E [ S βk ] < ∞ . Bythe dominated convergence theorem and again (A.19), β E [ S β S β ] is continuous.Therefore, the conditions in Lemma 7.6 in van der Vaart (2000) hold, and the completemodel is differentiable in quadratic mean. Moreover, E [ S β S β ] = E [ V ( S β | X, γ )] = T X t =1 E f ( X t β + γ )[ F ( X t β + γ )][1 − F ( X t β + γ )] ! X t X t . Then, if for some λ ∈ R K , λ E [ S β S β ] λ = 0, we would have X t λ = 0 almost surely forall t ∈ { , . . . , T } . By Assumption 3.1, this implies λ = 0. Hence, the informationmatrix E [ S β S β ] is nonsingular. Second step: V ? depends on the orthogonal projection of E [ S β | X, Y ] on K . Let e ψ = ( e ψ , . . . , e ψ K ) denote the efficient influence function, as defined p.363of van der Vaart (2000). Then V ? = E [ e ψ e ψ ] and E [ e ψ ] = 0. Let S =span( S β ), G = { q : E [ q ( X, γ )] < ∞ , E [ q ( X, γ )] = 0 } and for any closed convex set A andany h = ( h , . . . , h K ) , let Π A denote the orthogonal projection on A and Π A ( h ) =(Π A ( h ) , . . . , Π A ( h K )) . By Equation (25.29), Lemma 25.34 (since the complete modelis differentiable in quadratic mean by the first step) and the same reasoning as inExample 25.36 of van der Vaart (2000), e ψ is the function of ( X, Y ) of minimal L -norm satisfying e χ = Π S + G ( e ψ ) , (A.20)where e χ is the efficient influence function of the large model. Because this large modelis parametric, we have e χ = E [ S β S β ] − S β . (A.21)25quation (A.20) implies E [( e ψ − e χ ) e χ ] = 0. Thus, defining ‘ β = E [ S β | Y, X ], we get E [ e ψ‘ β ] = E [ e ψS β ] = Id , (A.22)Moreover, because E [ S β | X, γ ] = 0, S and G are orthogonal. Thus, (A.20) is equiv-alent to Π S ( e χ ) = Π S ( e ψ ) and Π G ( e χ ) = Π G ( e ψ ). Moreover, (A.21) implies thatΠ G ( e χ ) = 0. Hence, e ψ ∈ K K . Now, because Π K is an orthogonal projector, wehave E [ e ψ Π K ( ‘ β ) ] = E [Π K ( e ψ ) ‘ β ] = E [ e ψ‘ β ] = Id , where the last equality follows by (A.22). Hence, if Π K ( ‘ β ) λ = 0 a.s., we would have λ = 0. In other words, E [Π K ( ‘ β )Π K ( ‘ β ) ] is nonsingular. Now, consider the set, F = n E [Π K ( ‘ β )Π K ( ‘ β ) ] − Π K ( ‘ β ) + v : E [ v Π K ( ‘ β ) ] = 0 o . F is thus the set of vector-valued functions ψ satisfying the equation E [ ψ Π K ( ‘ β )] =Id.Hence, e ψ being the element of F with minimum L -norm, we obtain e ψ = E [Π K ( ‘ β )Π K ( ‘ β ) ] − Π K ( ‘ β ) . Finally, because V ? = E [ e ψ e ψ ], V ? = E [Π K ( ‘ β )Π K ( ‘ β ) ] − . (A.23) Third step: (A.18) holds.
Let r ∈ K and let us prove that r ( y, x ) = q ( x ) m ( y, x ; β )for some q . First, by definition of K , we have, for almost all ( g, x ) ∈ Supp( γ, X ),0 = r ((0 , , , x ) + r ((1 , , , x ) G ( x β + g ) + r ((0 , , , x ) G ( x β + g )+ r ((0 , , , x ) G ( x β + g ) + r ((1 , , , x ) G ( x β + g ) G ( x β + g )+ r ((1 , , , x ) G ( x β + g ) G ( x β + g ) + r ((0 , , , x ) G ( x β + g ) G ( x β + g )+ r ((1 , , , x ) G ( x β + g ) G ( x β + g ) G ( x β + g ) . (A.24)Let a t := x t β for t ∈ { , , } and, for the sake of conciseness, let us remove thedependence of r on x . Then, using Assumption 2, we obtain, for almost all ( g, x ),0 = A e × g + A e g + A e λ g + A e g + A e λ g + A e (1+ λ ) g + A e g + A e (2+ λ ) g + A e (1+2 λ ) g + A e λ g , A = r (0 , , ,A = w [ r (1 , , e a + r (0 , , e a + r (0 , , e a ] ,A = w h r (1 , , e λ a + r (0 , , e λ a + r (0 , , e λ a i ,A = w h r (1 , , e ( a + a ) + r (1 , , e ( a + a ) + r (0 , , e ( a + a ) i ,A = w w h r (1 , , e a + λ a + e a + λ a ) + r (1 , , e a + λ a + e a + λ a )+ r (0 , , e a + λ a + e a + λ a ) i ,A = w h r (1 , , e λ ( a + a ) + r (1 , , e λ ( a + a ) + r (0 , , e λ ( a + a ) i ,A = w r (1 , , e a + a + a ,A = w w r (1 , , h e a + a + λ a + e a + λ a + a + e λ a + a + a i ,A = w w r (1 , , h e a + λ ( a + a ) + e a + λ ( a + a ) + e a + λ ( a + a ) i ,A = w r (1 , , e λ ( a + a + a ) . Since λ = 2 is excluded by assumption, there are three cases left depending on thenumber of different exponents in Equation (A.24).First, we consider λ / ∈ { / , } . By Lemma B.1 and because | Supp( γ | X ) | ≥
10, weobtain A k = 0 for all k ∈ { , . . . , } . A = A = 0 imply that r (0 , ,
0) = r (1 , ,
1) =0. Next, A = A = 0 implies that either r (1 , ,
1) = r (1 , ,
0) = r (0 , ,
1) = 0 or r (1 , ,
0) = − r (1 , , e λ ( a − a ) − r (0 , , e λ ( a − a ) ,r (1 , ,
0) = − r (1 , , e ( a − a ) − r (0 , , e ( a − a ) . (A.25)Consider the second case. A = 0 implies, since ( r (1 , , , r (1 , , , r (0 , , =(0 , , r (1 , ,
0) = − r (1 , , e a + λ a + e a + λ a e a + λ a + e a + λ a − r (0 , , e a + λ a + e a + λ a e a + λ a + e a + λ a . By assumption, for almost every x = ( x , x , x ), a = a and a = a . Then, using27he latter display with equation (A.25) yields, since λ = 1, r (1 , ,
1) = r (0 , , h e λ ( a − a ) − e a − a i − h e a − a − e λ ( a − a ) i ,r (1 , ,
1) = r (0 , , " e λ ( a − a ) − e a + λ a + e a + λ a e a + λ a + e a + λ a − × " e a + λ a + e a + λ a e a + λ a + e a + λ a − e λ ( a − a ) . Since ( r (1 , , , r (1 , , , r (0 , , = (0 , , r (1 , , = 0 and r (0 , , = 0. Then e (1 − λ ) a e (1 − λ ) a e a + λ a +( λ − a − e λ ( a + a ) e λ ( a + a ) − e ( λ − a + λ a + a = e a + λ a +( λ − a − e λ ( a + a ) e λ ( a + a ) − e ( λ − a + λ a + a , which is equivalent to a = a . By assumption, the set of x for which this occurs isof probability zero. In other words, for almost every x , r ((1 , , , x ) = r ((1 , , , x ) = r ((0 , , , x ) = 0 .A = A = 0 implies that either r (1 , ,
0) = r (0 , ,
0) = r (0 , ,
1) = 0 or r (0 , ,
1) = − e ( a − a ) r (1 , , − e ( a − a ) r (0 , , ,r (0 , ,
1) = − e λ ( a − a ) r (1 , , − e λ ( a − a ) r (0 , , . In the first case, almost surely r ( Y, X ) = 0 = 0 × m ( Y, X ; β ). In the second case, r ( Y, X ) = q ( X ) × m ( Y, X ; β ) for some g ∈ L X . The result follows.Now, we turn to λ = 3 /
2. Then, for almost all ( g, x ) ∈ Supp( γ, X ),0 = A e × g + A e g + A e g + A e g + ( A + A ) e g + A e g + A e g + A e g + A e g . By Lemma B.1 and because | Supp( γ | X ) | ≥
9, we obtain A + A = 0 and A k = 0for all k
6∈ { , } . A = A = 0 implies that r (0 , ,
0) = r (1 , ,
1) = 0 which in turnimplies that A = 0 and thus A = 0. Hence, we have A k = 0 for all k ∈ { , . . . , } and the same reasoning as when λ
6∈ { / , } allows us to obtain the result.Finally, we consider λ = 3. Then, for all ( g, x ),0 = A e × g + A e g + ( A + A ) e g + A e g + A e g + A e g + A e g + A e g + A e g , By Lemma B.1 and because | Supp( γ | X ) | ≥
9, we obtain A + A = 0 and A k = 0for all k
6∈ { , } . A = A = 0 implies that r (0 , ,
0) = r (1 , ,
1) = 0 which in turnimplies that A = 0 and thus A = 0. Hence, A k = 0 for all k ∈ { , . . . , } and theresult follows again as when λ
6∈ { / , } .28 ourth step: conclusion. By Steps 2 and 3, there exists q ( X ) such that Π K ( ‘ β ) = q ( X ) m ( Y, X ; β ). Moreover, by definition of the orthogonal projection, Π K ( ‘ β ) − ‘ β ∈ ( K ⊥ ) K . Hence, again by Step 3, we have, for all q ∈ L X , E [ q ( X ) q ( X ) m ( Y, X ; β ) ] = E [ ‘ β q ( X ) m ( Y, X ; β )] . This implies that q ( X )Ω( X ) = E [ ‘ β m ( Y, X ; β ) | X ] . As a result, because ‘ β = E [ S β | Y, X ],Π K ( ‘ β ) =Ω − ( X ) m ( Y, X ; β ) E [ ‘ β m ( Y, X ; β ) | X ]=Ω − ( X ) m ( Y, X ; β ) E [ S β m ( Y, X ; β ) | X ] . Then, using (A.23), we obtain V ? = E h Ω − ( X ) E [ S β m ( Y, X ; β ) | X ] E [ S β m ( Y, X ; β ) | X ] i − . Now, by the end of the proof of Theorem 2.3, we have, for all β ,0 = E β [ m ( Y, X ; β ) | X, γ ] . As a result, 0 = ∇ β E β [ m ( Y, X ; β ) | X, γ ]= E β [ ∇ β m ( Y, X ; β ) | X, γ ] + E β [ m ( Y, X ; β ) S β | X, γ ] . Evaluating this equality at β and integrating over γ yields: E [ S β m ( Y, X ; β ) | X ] = − E [ ∇ β m ( Y, X ; β ) | X ] = − R ( X ) . We conclude that V ? = E h Ω − ( X ) R ( X ) R ( X ) i − = V , which is a well-defined matrix by Assumption 6.1.29 Technical lemmas
The following two lemmas are keys in the proof of Proposition 2.5.
Lemma B.1
Let n ≥ , ( α , . . . , α n ) be n distinct real numbers, ( a , . . . , a n ) ∈ R n and P ( x ) = P ni =1 a i exp( α i x ) . If P has n distinct roots, then a = . . . = a n = 0 . Lemma B.2
For any ( t, ‘ ) ∈ { , . . . , T } × { , . . . , τ } , a t,‘,x is real analytic for almostall Supp ( X ) . B.1 Proof of Lemma B.1
This follows by induction on n and Rolle’s theorem, see e.g. Chapter 2, section 2 ofKrein and Nudelman (1977). B.2 Proof of Lemma B.2
We want to prove that each function a t,‘,x is real analytic for almost all x ∈ Supp( X ).Fix x ∈ Supp( X ), and let ˜ w γj := w j δ j ( x, θ , t ) exp( λ j γ ), ˜ λ j := λ j β k . Let us define f : ( v, γ ) / (cid:16) P τj =1 ˜ w γj exp(˜ λ j v ) (cid:17) . We have a t,‘,x ( v ) = Z exp( λ ‘ γ ) C ( γ, x ; θ , t ) f ( v, γ ) dF γ | X = x ( γ ) , ∀ v ∈ R . We prove the result in three steps. First, we establish a bound on the derivatives of f . Second, we show that a t,‘,x is C ∞ , and we bound its deratives. Finally, we showthat a t,‘,x is real analytic. First step: for all k ≥ and all ( v, γ ) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k ∂v k f ( v, γ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k !( eλ τ | β k | ) k f ( v, γ ) . (B.1)For any infinitely differentiable real function g : R × Supp( γ | X = x ) → R , welet g ( k ) ( v, γ ) = ∂ k g ( v, γ ) /∂v k and define P : ( v, γ ) P τj =1 ˜ w γj ˜ λ j exp(˜ λ j v ). First,30emark that for any positive integer k , (cid:12)(cid:12)(cid:12) P ( k ) ( v, γ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ X j =1 ˜ w γj ˜ λ k +10 j exp(˜ λ j v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤| ˜ λ k +10 τ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ X j =1 ˜ w γj exp(˜ λ j v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤| ˜ λ k +10 τ | /f ( v, γ ) . (B.2)Now, we prove (B.1) by induction. The result is trivial for k = 0. Suppose that itholds for j = 0 , . . . , k , k ≥
0. Remark that f (1) = f × ( f P ). Then, by applying twicethe general Leibniz rule, we obtain (cid:12)(cid:12)(cid:12) f ( k +1) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k X j =0 kj ! ( f ) ( j ) ( f P ) ( k − j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k X j =0 kj ! | f ( j ) || ( f P ) ( k − j ) |≤ f k X j =0 kj ! j !( e ˜ λ τ ) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − j X i =0 k − ji ! f ( i ) P ( k − j − i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ f k X j =0 kj ! j !( e ˜ λ τ ) j k − j X i =0 k − ji ! i !( e ˜ λ τ ) i f ˜ λ k − j − i +10 τ f ≤ f ˜ λ k +10 τ e k k X j =0 kj ! j ! k − j X i =0 k − ji ! i ! , where we used the induction hypothesis to get the second and third inequalities. Thelast inequality follows from e i ≤ e k − j , ∀ i ≤ k − j . Now, notice that for any k ∈ N ∗ ,we have k X s =0 ks ! s ! = k X s =0 k !( k − s )! ≤ k ! e. (B.3)As a result, (cid:12)(cid:12)(cid:12) f ( k +1) (cid:12)(cid:12)(cid:12) ≤ f ˜ λ k +10 τ e k k X j =0 kj ! j !( k − j )! e = f ˜ λ k +10 τ e k × e × ( k + 1) × k != ( k + 1)! (cid:16) e ˜ λ τ (cid:17) k +1 f, and thus the induction hypothesis holds for k + 1. This ends the first step.31 econd step: a t,‘,x is C ∞ and for all k ≥ , sup v ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k a t,‘,x ( v ) ∂v k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C t,‘,x,θ k !( eλ τ | β k | ) k , (B.4)for some C t,‘,x,θ > v ∈ R , we have 1 /f ( v ) ≥ ˜ w γ‘ exp(˜ λ ‘ v ) and C ( γ, x ; θ , t ) ≥
1. Thus,exp( λ ‘ γ ) C ( γ, x ; θ , t ) f ( v, γ ) ≤ w ‘ δ ‘ ( x, θ , t ) . (B.5)Hence, (B.4) holds for k = 0, with C t,‘,x,θ = 1 / [ w ‘ δ ‘ ( x, θ , t )]. Next, v exp( λ ‘ γ ) × f ( v, γ ) /C ( γ, x ; θ , t ) is C ∞ and by (B.5) and the previous step, we have, for any k ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k ∂v k exp( λ ‘ γ ) C ( γ, x ; θ , t ) f ( v, γ ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k !( eλ τ | β k | ) k exp( λ ‘ γ ) C ( γ, x ; θ , t ) f ( v, γ ) ≤ k !( eλ τ | β k | ) k w j δ j ( x, θ , t ) . Thus, by the dominated convergence theorem, a t,‘,x is C k and we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k a t,‘,x ( v ) ∂v k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k ∂v k exp( λ ‘ γ ) C ( γ, x ; θ , t ) f ( v, γ ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d F γ | X = x ( γ ) ≤ k !( eλ τ | β k | ) k Z exp( λ ‘ γ ) C ( γ, x ; θ , t ) f ( v, γ )d F γ | X = x ( γ )= k !( eλ τ | β k | ) k a t,‘,x ( v ) ≤ C t,‘,x,θ k !( eλ τ | β k | ) k . Third step: a t,‘,x is real analytic. It suffices to show that there exists
R > v , a t,‘,x coincides with its Taylor expansion at v on ( v − R, v + R ). Let R < / (2 eλ τ | β k | ). First, by the second step, we have, for any v ∈ ( v − R, v + R ), (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ! ( v − v ) k ∂ k a t,‘,x ( v ) ∂v k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k ! R k sup v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k a t,‘,x ( v ) ∂v l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C t,‘,x,θ ( Reλ τ | β k | ) k , (B.6)and the corresponding series converges since Reλ τ | β k | <
1. Thus, the Taylor seriesof a t,‘,x converges at v , for any v ∈ ( v − R, v + R ). Finally, by the second step again32nd Taylor’s theorem applied to a t,‘,x ( v ), we obtain, for any K > | v − v | < R : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a t,‘,x ( v ) − K X k =0 k ! ( v − v ) k ∂ k a t,‘,x ( v ) ∂v k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R K +1 ( K + 1)! sup | v − v |