Empirical MSE Minimization to Estimate a Scalar Parameter
EEmpirical MSE Minimization to Estimate a Scalar Parameter
Clément de Chaisemartin ∗ Xavier D’Haultfœuille † Abstract
We consider the estimation of a scalar parameter, when two estimators are available.The first is always consistent. The second is inconsistent in general, but has a smallerasymptotic variance than the first, and may be consistent if an assumption is satisfied.We propose to use the weighted sum of the two estimators with the lowest estimatedmean-squared error (MSE). We show that this third estimator dominates the other twofrom a minimax-regret perspective: the maximum asymptotic-MSE-gain one may incurby using this estimator rather than one of the other estimators is larger than the maximumasymptotic-MSE-loss.
Keywords: bias-variance trade-off, mean-squared error, consistent estimator, efficient esti-mator, statistical decision theory, minimax regret, local asymptotics.
JEL Codes:
C21, C23
We consider the estimation of a scalar parameter β when two estimators are available. Thefirst, is √ n − consistent. The second is inconsistent in general, but it has a smaller asymptoticvariance than the first, and it may be √ n − consistent if the data generating process satisfies anassumption H . Hereafter, those two estimators are respectively referred to as the consistentand efficient estimators.To fix ideas, we consider two of the many examples where this set-up we is applicable. In strat-ified randomized experiments, the parameter of interest is the average treatment effect (ATE).To estimate it, one may use the propensity score matching estimator (see Hirano et al. 2003),which is √ n -consistent and asymptotically normal under some assumptions. Alternatively,one may regress the outcome of interest on strata fixed effects and units’ treatment status,and use the coefficient of the treatment in that regression. It follows from, e.g., Equation(3.3.7) in Angrist & Pischke (2008) that this estimator is √ n -consistent and asymptoticallynormal for a weighted average of the effect of the treatment in each strata. One can also show ∗ University of California at Santa Barbara, [email protected] † de Chaisemartin: University of California at Santa Barbara (email: [email protected]);D’Haultfœuille: CREST-ENSAE (email: [email protected]) a r X i v : . [ m a t h . S T ] J un hat under some assumptions, the asymptotic variance of the strata fixed effects estimator issmaller than that of the propensity score matching estimator. So if the treatment effect isconstant across strata, the strata fixed effects estimator is √ n -consistent for the ATE, and itis more efficient than the propensity score matching estimator. But if the treatment effect isheterogeneous, the strata fixed effects estimator is inconsistent.Another example where our set-up is applicable is a linear and constant treatment effectmodel, where the treatment is potentially endogenous, but one has an instrumental variableat hand. Then, the 2SLS estimator is √ n − consistent for the treatment effect. On the otherhand, the OLS estimator is only √ n − consistent if the treatment is actually exogenous, butits asymptotic variance is smaller than that of the 2SLS estimator.To estimate β , we propose to use (cid:98) β MSE , the weighted sum of the consistent and efficientestimators with the lowest estimated mean-squared error (MSE). We show that this third esti-mator dominates the other two from a minimax-regret perspective: the maximum asymptotic-MSE-gain one may incur by using (cid:98) β MSE rather than one of the other estimators is larger thanthe maximum asymptotic-MSE-loss that one may incur by doing so.We also consider a family of pre-test estimators ( (cid:98) β MSE,λ ) λ ≥ , where λ indexes the criticalvalue used in the pre-test. First, we test whether the consistent and efficient estimators areequal. If the test is accepted, (cid:98) β MSE,λ is equal to the efficient estimator. If the test is rejected, (cid:98) β MSE,λ is equal to a convex combination of the consistent and efficient estimators. We showthat such estimators have similar properties as (cid:98) β MSE . However, (cid:98) β MSE dominates all of themfrom a minimax-regret perspective.We then extend the initial result by considering situations where one has two estimators athand: one is r n − consistent, where r n / √ n → , and the other is inconsistent in general, butmay be √ n − consistent if the data generating process satisfies an assumption H . Such situ-ations may for instance arise in regression discontinuity (RD) designs. Then, non-parametricestimators such as the one proposed by Hahn et al. (2001) are n / − consistent for the aver-age treatment effect at the cut-off under weak conditions. On the other hand, the estimatorusing, say, linear regressions to the left and to the right of the cut-off without restricting thesample to observations in a narrow bandwidth around the cut-off is √ n − consistent if the po-tential outcomes’ CEFs are indeed linear in the running variable, but inconsistent otherwise.Again, we show that (cid:98) β MSE dominates the r n − consistent estimator from a minimax-regretperspective, under mild assumptions.The idea of combining consistent and inconsistent, or unbiased and biased estimators of aparameter has a long tradition in statistics and econometrics. Green & Strawderman (1991)have proposed an estimator related to (cid:98) β MSE , in the context of a normal model with knownvariances. Their estimator is related to the shrinkage estimator in Stein (1956) and James& Stein (1961), and when the parameter of interest is of dimension greater than three, itsMSE is lower than that of the unbiased estimator of the parameter of interest. Judge &2ittelhammer (2004) and Mittelhammer & Judge (2005) have proposed an estimator similarto that in Green & Strawderman (1991), without assuming normality or that the variances areknown, and study its asymptotic MSE, again when the parameter of interest is of dimensiongreater than three. Cheng et al. (2019) have considered an estimator related to (cid:98) β MSE , in aGMM context with some valid and some potentially misspecified moment conditions. Theyshow that asymptotically, the MSE of their estimator is uniformly smaller than that of theGMM estimator using only the valid moment conditions. Again, they focus on a multivariateparameter with a dimension greater than four. In the context of a linear model with at leastthree endogenous variables and instruments, Hansen (2017) proposes to use a weighted averageof the OLS and 2SLS estimators, with weights that depend on the Hausman-Wu statistic ina test of equality between the OLS and 2SLS estimators. He shows that this estimator hasa lower asymptotic MSE than that of the 2SLS estimator. Finally, Breusch et al. (2011)have considered (cid:98) β MSE in a panel data context, in a case where the parameter of interest isunivariate. However, they do not study the theoretical properties of that estimator.To the best of our knowledge, our paper is the first to study the theoretical properties of (cid:98) β MSE when the parameter of interest is univariate. This case is particularly relevant forpolicy evaluation and treatment choice. Even when one measures the effect of the policy onseveral outcomes, one is ultimately interested in summarizing those effects into a monetaryassessment of the benefits of the policy, to be compared to its cost. Contrary to the previousliterature, we do not find that the MSE of (cid:98) β MSE dominates uniformly that of the consistentestimator. However, we show that (cid:98) β MSE dominates the other two estimators from a minimax-regret perspective, thus giving a theoretical justification to its use to estimate a univariateparameter. Our paper also seems to be the first to consider the combination of estimatorswith different rates of convergence, which may be relevant in a number of contexts, such asRD designs.The remainder of the paper is organized as follows. Section 2 presents the set up and mainresults. Section 3 presents some extensions. Section 4 presents the proofs of the results.
We are interested in a parameter β ∈ R . To estimate β , we use a sample of size n . (cid:98) β C is √ n − consistent and asymptotically normal for β . (cid:98) β E is √ n − consistent and asymptoticallynormal for β E . In general, β E (cid:54) = β , but under an assumption on the data generating process H , β E = β . The asymptotic variance of (cid:98) β E is smaller than that of (cid:98) β C , so under H , (cid:98) β E isa more efficient estimator of β than (cid:98) β C . Moreover, the asymptotic variance of (cid:98) β E is smallerthan that of any weighted sum of (cid:98) β E and (cid:98) β C , which implies that the asymptotic covarianceof (cid:98) β E and (cid:98) β C is equal to the asymptotic variance of (cid:98) β E . Finally, we have estimators (cid:98) V ( (cid:98) β C ) , (cid:98) V ( (cid:98) β E ) , and (cid:99) cov ( (cid:98) β C , (cid:98) β E ) of the variances of (cid:98) β C and (cid:98) β E and of their covariance, that are such3hat n (cid:98) V ( (cid:98) β C ) , n (cid:98) V ( (cid:98) β E ) , and n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) are consistent for their asymptotic variances andcovariances. We summarize these conditions in Assumption 1 below: Assumption 1 (Set-up)1. We have √ n (cid:32) (cid:98) β C − β (cid:98) β E − β E (cid:33) d −→ N (cid:32) , (cid:32) σ C σ E σ E σ E (cid:33)(cid:33) , with σ C < σ E .2. n (cid:98) V ( (cid:98) β C ) p −→ σ C , n (cid:98) V ( (cid:98) β E ) p −→ σ E , and n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) p −→ σ E . We consider the following estimator of β : Definition 2.1 (The empirical-MSE-minimizing estimator of β )Let (cid:98) β MSE = (cid:98) p (cid:98) β E + (1 − (cid:98) p ) (cid:98) β C , (2.1) where (cid:98) p = arg min p ∈ R p (cid:16) (cid:98) β E − (cid:98) β C (cid:17) + p (cid:98) V ( (cid:98) β E ) + (1 − p ) (cid:98) V ( (cid:98) β C ) + 2 p (1 − p ) (cid:99) cov ( (cid:98) β C , (cid:98) β E ) . (2.2) (cid:98) β MSE is the weighted sum of (cid:98) β E and (cid:98) β C with the lowest estimated mean-squared error (MSE).Solving the problem in Equation (2.2) yields (cid:98) p = (cid:98) V ( (cid:98) β C ) − (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) (cid:98) β E − (cid:98) β C (cid:17) + (cid:98) V ( (cid:98) β E − (cid:98) β C ) . (2.3)Theorem 2.2 below gives the asymptotic distribution of (cid:98) β MSE and compares it with that of (cid:98) β C . In particular, we compare the MSE of the asymptotic distribution of the two estimators. Theorem 2.2 (Asymptotic distribution of (cid:98) β MSE )Suppose Assumption 1 holds.1. If β (cid:54) = β E and β and β E do not depend on n , √ n (cid:16) (cid:98) β MSE − β (cid:17) d −→ N (cid:0) , σ C (cid:1) .2. If β = β E , √ n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U , where U is such that E ( U ) = 0 and V ( U ) ∈ ( σ E , σ C ) .3. If β E = β + h/ √ n for some h ∈ R , √ n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U h , where ( U h ) h ∈ R is suchthat max h ∈ R (cid:2) E ( U h ) − σ C (cid:3) < max h ∈ R (cid:2) σ C − E ( U h ) (cid:3) . If the second moments of the normalized estimators converge, our results provide a comparison of theasymptotic MSE of the estimators. To avoid the issue that convergence in distribution does not imply conver-gence in L , one could consider instead, as Cheng et al. (2019), a winsorized version of the square loss. β (cid:54) = β E , (cid:98) β MSE and (cid:98) β C have the same asymptoticdistribution. On the other hand, Point 2 shows that if β = β E , their asymptotic distributionsdiffer, and the MSE of the asymptotic distribution of (cid:98) β MSE is larger than that of (cid:98) β E butsmaller than that of (cid:98) β C . This comes from the fact that under H , (cid:98) p converges in distributionto a nongenerate distribution. In other words, (cid:98) p does not perform consistent “model selection”:it does not converge to 1 if H holds and to 0 otherwise.The asymptotic approximations under fixed values of β and β E may not give good approx-imations of the finite sample behavior of the estimators. Instead, in Point 3 of the theorem,we compare the MSE of their asymptotic distributions under alternatives local to β = β E ,namely β E = β + h/ √ n . We find that under this type of asymptotics, the MSE of theasymptotic distribution of (cid:98) β MSE is not always smaller than that of (cid:98) β C . This phenomenon isreminiscent of Hodges’ estimator, whose asymptotic distribution has a smaller MSE than thatof the standard estimator if the true parameter is 0, but whose maximal risk increases withoutbound as n → ∞ (see, e.g. Lehmann & Casella 1998, pp. 440-443). An important differencewith Hodges’ estimator is that here, the maximum asymptotic-MSE-gain one may incur byusing (cid:98) β MSE rather than (cid:98) β C is larger than the maximum asymptotic-MSE-loss. Thus, (cid:98) β MSE dominates (cid:98) β C from a minimax regret perspective (see Savage 1951). It is straightforward toshow that (cid:98) β MSE also dominates (cid:98) β E from a minimax regret perspective. When β = β E , (cid:98) β MSE is not equivalent to (cid:98) β E , and it has a higher asymptotic variance. Thisis because the estimated squared bias ( (cid:98) β E − (cid:98) β C ) in (2.2) includes some noise and is notnegligible even if β = β E . We now consider a modified version of (cid:98) β MSE that uses a smallerestimator of the squared bias. Specifically, we replace (2.2) by (cid:98) p λ = arg min p ∈ R p max (cid:20) , (cid:16) (cid:98) β E − (cid:98) β C (cid:17) − λ (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:21) + p (cid:98) V ( (cid:98) β E ) + (1 − p ) (cid:98) V ( (cid:98) β C ) + 2 p (1 − p ) (cid:99) cov ( (cid:98) β C , (cid:98) β E ) for some λ ≥ . We then let (cid:98) β MSE,λ = (cid:98) p λ (cid:98) β E + (1 − (cid:98) p λ ) (cid:98) β C . (cid:98) β MSE,λ can be viewed as a pre-testestimator. Assume that one uses (cid:98) V ( (cid:98) β E ) to estimate cov ( (cid:98) β C , (cid:98) β E ) . Then, let F denote the cdfof a χ distribution and let α = F ( λ ) . To compute, (cid:98) β MSE,λ , one first runs a level- α test of H , using the fact that ( (cid:98) β E − (cid:98) β C ) / (cid:98) V ( (cid:98) β E − (cid:98) β C ) d −→ χ . If H is accepted, then (cid:98) β MSE,λ = (cid:98) β E .If H is rejected, (cid:98) β MSE,λ is equal to a convex combination between (cid:98) β E and (cid:98) β C , where theweight assigned to (cid:98) β C depends on how far we are from accepting H . Proposition 3.1
Suppose Assumption 1 holds and let λ be a positive real number. . If β (cid:54) = β E and β and β E do not depend on n , √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ N (cid:0) , σ C (cid:1) .2. If β = β E , √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ U ,λ , where U ,λ is such that E ( U ,λ ) = 0 , λ (cid:55)→ V ( U ,λ ) is strictly decreasing and lim λ → + ∞ V ( U ,λ ) = σ E .3. If β E = β + h/ √ n for some h ∈ R , √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ U h,λ , where ( U h,λ ) h ∈ R issuch that for all λ (cid:54) = 0max h ∈ R (cid:2) E ( U h,λ ) − E ( U h, ) (cid:3) > max h ∈ R (cid:2) E ( U h, ) − E ( U h,λ ) (cid:3) , (3.1)Points 1 and 2 in Proposition 3.1 are similar to those in Theorem 2.2, with the additional pointthat under H , the asymptotic, quadratic risk of (cid:98) β MSE,λ decreases and gets closer to that of (cid:98) β E as λ increases. However, the third point shows that from a minimax regret perspective,such estimators are dominated by our intial estimator (cid:98) β MSE . The reason is that the decreaseof λ (cid:55)→ E ( U ,λ ) does not compensate for the quick increase of λ (cid:55)→ max h ∈ R E ( U h,λ ) − E ( U h, ) .This echoes the discussion in Leeb & Pötscher (2005): as we move closer to an estimator basedon a consistent model selection, the maximal asymptotic risk increases without bound. In this subsection, we assume that (cid:98) β C is r n − consistent for β ∈ R , for some sequence ( r n ) n ∈ N such that r n → ∞ , r n /n / → . The estimator (cid:98) β E is still √ n − consistent and asymptoticallynormal for β E , which may be equal to β under an assumption on the data generating process H . We also assume we have estimators (cid:98) V ( (cid:98) β C ) , (cid:98) V ( (cid:98) β E ) , and (cid:99) cov ( (cid:98) β C , (cid:98) β E ) of the variances of (cid:98) β C and (cid:98) β E and of their covariance, that are such that r n (cid:98) V ( (cid:98) β C ) , n (cid:98) V ( (cid:98) β E ) , and n / r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) are consistent for their asymptotic variances and covariances. We summarize these conditionsin Assumption 2 below: Assumption 2 (Set-up)1. There exists a sequence ( r n ) n ∈ N , r n → ∞ and r n /n / → , such that r n (cid:16) (cid:98) β C − β (cid:17) √ n (cid:16) (cid:98) β E − β E (cid:17) d −→ N (cid:32)(cid:32) µ (cid:33) , (cid:32) σ C ρρ σ E (cid:33)(cid:33) . r n (cid:98) V ( (cid:98) β C ) p −→ σ C , n (cid:98) V ( (cid:98) β E ) p −→ σ E , and n / r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) p −→ ρ . Theorem 3.2 below gives the asymptotic distribution of (cid:98) β MSE defined in Equation 2.1 above,under Assumption 2 rather than Assumption 1.
Theorem 3.2 (Asymptotic distribution of (cid:98) β MSE under Assumption 2) Suppose Assumption2 holds. . If β (cid:54) = β E and β and β E do not depend on n , r n (cid:16) (cid:98) β MSE − β (cid:17) d −→ N (cid:0) , σ C (cid:1) .2. If β = β E , r n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U , where U is such that E ( U ) < E ( V ) .3. If β E = β + h/r n for some h ∈ R and | µ/σ C | ≤ . , then r n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U h ,where ( U h ) h ∈ R is such that max h ∈ R (cid:2) E ( U h ) − ( µ + σ C ) (cid:3) < max h ∈ R (cid:2) µ + σ C − E ( U h ) (cid:3) . Point 1 of Theorem 3.2 shows that if β (cid:54) = β E , (cid:98) β MSE and (cid:98) β C have the same asymptoticdistribution. On the other hand, Point 2 shows that if β = β E , their asymptotic distributionsdiffer, and the MSE of the asymptotic distribution of (cid:98) β MSE is smaller than that of (cid:98) β C .In Point 3 of the theorem, we compare the MSE of their asymptotic distributions underalternatives local to β = β E , namely β E = β + h/r n . Under this type of asymptotics, themaximum asymptotic-MSE-gain one may incur by using (cid:98) β MSE rather than (cid:98) β C is larger thanthe maximum asymptotic-MSE-loss, provided the first-order bias is no greater in absolutevalue than 0.4 σ C . Again, (cid:98) β MSE dominates (cid:98) β C from a minimax regret perspective, providedthe first-order bias of (cid:98) β C is not too large. We use the folowing lemma below.
Lemma 4.1
Suppose that f is an odd function such that f ( x ) > for all x > and Z hasan even density g that is strictly decreasing on R + . Then, sgn ( E [ f ( x + Z )]) = sgn ( x ) for all x ∈ R . Proof: the result holds if x = 0 , because E [ f ( Z )] = E [ f ( − Z )] = − E [ f ( Z )] . For any x < , E [ f ( Z + x )] = E [ f ( − Z + x )] = E [ − f ( Z − x )] , so it suffices to show that E [ f ( Z + x )] > for x > . We have E [ f ( Z + x )] = (cid:90) R f ( z + x ) g ( z ) dz = (cid:90) R + f ( z ) g ( z − x ) dz + (cid:90) R − f ( z ) g ( z − x ) dz = (cid:90) R + f ( z ) g ( z − x ) dz − (cid:90) R − f ( − z ) g ( x − z ) dz = (cid:90) R + f ( z )[ g ( z − x ) − g ( z + x )] dz. Now, for all z ∈ (0 , x ] , | z − x | = x − z < x + z so g ( z − x ) > g ( z + x ) . If z > x , | z − x | = z − x < z + x so again, g ( z − x ) > g ( z + x ) . The result follows since f ( z ) > on (0 , ∞ ) .7 .1 Theorem 2.2 Proof of Point 1 If β (cid:54) = β E , it follows from (2.3), Assumption 1 and the continuous mapping theorem that n (cid:98) p p −→ σ C − σ E ( β E − β ) . Moreover, (cid:98) β MSE − (cid:98) β C = (cid:98) p (cid:16) (cid:98) β E − (cid:98) β C (cid:17) . (4.1)Hence, by the continuous mapping theorem again, n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) p −→ σ C − σ E β E − β . Therefore, √ n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) = o P (1) . Proof of Point 2
Let ( V, W ) be a normal vector with mean (0 , , variances ( σ C , σ C − σ E ) , and covariance − (cid:0) σ C − σ E (cid:1) . Let U = V + W σ C − σ E W + σ C − σ E . (4.2) √ n (cid:16) (cid:98) β MSE − β (cid:17) = √ n (cid:16) (cid:98) β C − β (cid:17) + n (cid:98) V ( (cid:98) β C ) − n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17)(cid:17) + n (cid:98) V ( (cid:98) β E − (cid:98) β C ) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) d −→ U . The first equality follows from Equations (4.1) and (2.3) and from β E = β . The convergencein distribution arrow follows from Assumption 1, the Slutsky lemma, and the continuousmapping theorem. E ( U ) = 0 as φ : w (cid:55)→ w σ C − σ E w + σ C − σ E is such that φ ( − w ) = − φ ( w ) and the pdf of W is symmetricaround 0.Let Ψ = V + W . The vector (Ψ , W ) is normally distributed with E (Ψ) = 0 and cov (Ψ , W ) = 0 .Hence, Ψ ⊥⊥ W and we have U = Ψ + W (cid:18) σ C − σ E W + σ C − σ E − (cid:19) , V ( U ) > V (Ψ) = σ E . Moreover, V ( U ) − σ C = E (cid:0) U (cid:1) − E (cid:0) V (cid:1) = E (( U − V )( U + V ))= E (cid:18) W σ C − σ E W + σ C − σ E (cid:18) V + W σ C − σ E W + σ C − σ E (cid:19)(cid:19) = E (cid:18) W σ C − σ E W + σ C − σ E (cid:18) − W + 2Ψ + W σ C − σ E W + σ C − σ E (cid:19)(cid:19) = E (cid:18) W σ C − σ E W + σ C − σ E (cid:18) − σ C − σ E W + σ C − σ E (cid:19)(cid:19) < . The first equality follows from E ( U ) = E ( V ) = 0 , the third from Equation (4.2), the fourthfrom V = − W + Ψ and the fifth from Ψ ⊥⊥ W and E (Ψ) = 0 . The inequality holds since ( σ C − σ E ) / ( W + σ C − σ E ) < with probability 1, as σ C > σ E . Proof of Point 3
Let ( V, W h ) be a normal vector with means (0 , h ) , variances ( σ C , σ C − σ E ) , and covariance − (cid:0) σ C − σ E (cid:1) . Let U h = V + W h ( σ C − σ E ) / ( W h + σ C − σ E ) . We have √ n (cid:16) (cid:98) β MSE − β (cid:17) = √ n (cid:16) (cid:98) β C − β (cid:17) + n (cid:98) V ( (cid:98) β C ) − n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) + n (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:16) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) d −→ U h . The first equality follows from Equations (4.1) and (2.3) and from β E = β + h/ √ n . Theconvergence in distribution arrow follows from Assumption 1, the Slutsky lemma, and thecontinuous mapping theorem.Let Ψ h = V + W h . The vector (Ψ h , W h ) is normally distributed, cov (Ψ h , W h ) = 0 so Ψ h ⊥⊥ W h .9et g = h/ (cid:113) σ C − σ E , and let N g = W h √ σ C − σ E , so that N g ∼ N ( g, . We have E ( U h ) − σ C = E (( U h − V )( U h + V ))= E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:18) V + W h σ C − σ E W h + σ C − σ E (cid:19)(cid:19) = E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:18) − W h + 2Ψ h + W h σ C − σ E W h + σ C − σ E (cid:19)(cid:19) = − (cid:18) E (cid:18) W h σ C − σ E W h + σ C − σ E W h (cid:19) − E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) E (Ψ h ) (cid:19) + E (cid:32)(cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) (cid:33) = − (cid:18) E (cid:18) W h σ C − σ E W h + σ C − σ E W h (cid:19) − E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) E ( W h ) (cid:19) + E (cid:32)(cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) (cid:33) = − cov (cid:18) W h σ C − σ E W h + σ C − σ E , W h (cid:19) + E (cid:32)(cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) (cid:33) = (cid:0) σ C − σ E (cid:1) (cid:40) E (cid:34)(cid:18) N g N g + 1 (cid:19) (cid:35) − cov (cid:18) N g N g + 1 , N g (cid:19)(cid:41) . (4.3)Let ∆( g ) = E (cid:34)(cid:18) N g N g + 1 (cid:19) (cid:35) − cov (cid:18) N g N g + 1 , N g (cid:19) . (4.4)Since N − g ∼ − N g , we have ∆( − g ) = E (cid:34)(cid:18) − N g ( − N g ) + 1 (cid:19) (cid:35) − cov (cid:18) − N g ( − N g ) + 1 , − N g (cid:19) = ∆( g ) . Moreover, we obtain through numerical simulations that min g ∈ R + ∆( g ) (cid:39) − . and max g ∈ R + ∆( g ) (cid:39) . . The inequality max h ∈ R (cid:2) ( E ( U h )) + V ( U h ) − σ C (cid:3) < max h ∈ R (cid:2) σ C − (cid:0) ( E ( U h )) + V ( U h ) (cid:1)(cid:3) follows from these last results, Equations (4.3) and σ C > σ E . Proof of Point 1
First, we have (cid:98) p λ = (cid:98) V ( (cid:98) β C ) − (cid:99) cov ( (cid:98) β C , (cid:98) β E )max (cid:20) , (cid:16) (cid:98) β E − (cid:98) β C (cid:17) − λ (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:21) + (cid:98) V ( (cid:98) β E − (cid:98) β C ) . β (cid:54) = β E , Assumption 1 and the continuous mapping theorem yields n (cid:98) p λ p −→ σ C − σ E ( β E − β C ) . The result follows as in the previous proof.
Proof of Point 2
Instead of considering U defined by (4.2), we now consider U ,λ defined by U ,λ = V + W σ C − σ E max (cid:2) , W − λ ( σ C − σ E ) (cid:3) + σ C − σ E . Then, the proofs of the convergence in distribution, of E ( U ,λ ) = 0 and E (cid:16) U ,λ (cid:17) < σ C areidentical to those above. We now show that λ (cid:55)→ V ( U ,λ ) is decreasing. Let Z = W/ ( σ C − σ E ) / Using the same definition of Ψ as above, we have U ,λ = Ψ + ( σ C − σ E ) / Z (cid:18) , Z − λ ) + 1 − (cid:19) , with Ψ ⊥⊥ Z . Then, for any two λ > λ (cid:48) , V ( U ,λ ) − V ( U ,λ (cid:48) ) = E (cid:2) ( U ,λ + U ,λ (cid:48) )( U ,λ − U ,λ (cid:48) ) (cid:3) =( σ C − σ E ) E (cid:20) Z (cid:18) , Z − λ ) + 1 + 1max(0 , Z − λ (cid:48) ) + 1 − (cid:19) × (cid:18) , Z − λ ) + 1 − , Z − λ (cid:48) ) + 1 (cid:19)(cid:21) . Moreover, , Z − λ ) + 1 + 1max(0 , Z − λ (cid:48) ) + 1 − ≤ , and the inequality is strict for | Z | > √ λ (cid:48) . Also, , Z − λ ) + 1 − , Z − λ (cid:48) ) + 1 ≥ , and again, the inequality is strict for | Z | > √ λ (cid:48) . Hence, V ( U ,λ ) < V ( U ,λ (cid:48) ) . Finally, as λ → + ∞ , U ,λ converges almost surely to Ψ . Moreover, | U ,λ | ≤ | Ψ | + ( σ C − σ E ) / | Z | , so bythe dominated convergence theorem, V ( U ,λ ) → V (Ψ) = σ E . Proof of Point 3
With the same notation as above, let U h,λ = Ψ h + W h (cid:18) σ C − σ E max(0 , W h − λ ( σ C − σ E )) + σ C − σ E − (cid:19) . Using a reasoning similar to that in the proof of Point 3 of Theorem 2.2, √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ U h,λ .
11y the same reasoning as in the proof of Point 3 of Theorem 2.2, E ( U h,λ ) − E ( U h, ) = ( σ C − σ E ) (cid:40) E (cid:34)(cid:18) N g max(0 , N g − λ ) + 1 (cid:19) − (cid:18) N g N g + 1 (cid:19) (cid:35) − cov (cid:20) N g (cid:18) , N g − λ ) + 1 − N g + 1 (cid:19) , N g (cid:21)(cid:27) . Then, simulations show that (3.1) holds.
Proof of Point 1
We have (cid:98) β MSE − (cid:98) β C = (cid:98) p (cid:16) (cid:98) β E − (cid:98) β C (cid:17) . (4.5)If β (cid:54) = β E , it follows from Assumption 2 and the continuous mapping theorem that r n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) p −→ σ C β E − β . Therefore, r n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) = o P (1) . Proof of Point 2
Let V be a normal variable with mean µ and variance σ C . Let U = V (cid:18) − σ C V + σ C (cid:19) . (4.6)If β = β E , r n (cid:16) (cid:98) β MSE − β (cid:17) = r n (cid:16) (cid:98) β C − β (cid:17) + r n (cid:98) V ( (cid:98) β C ) − r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17)(cid:17) + r n (cid:98) V ( (cid:98) β E − (cid:98) β C ) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) d −→ U . The first equality follows from Equations (4.5) and (2.3) and from β E = β . The convergencein distribution arrow follows from Assumption 2, the Slutsky lemma, and the continuousmapping theorem. Further, E (cid:0) U (cid:1) − E (cid:0) V (cid:1) = E (( U − V )( U + V ))= E (cid:18) − σ C V + σ C V (cid:18) V − σ C V + σ C V (cid:19)(cid:19) = E (cid:18) − σ C V + σ C V (cid:18) − σ C V + σ C (cid:19)(cid:19) < . > σ C V + σ C with probability 1. Proof of Point 3
Let U h = V + ( h − V ) σ C ( h − V ) + σ C . (4.7) r n (cid:16) (cid:98) β MSE − β (cid:17) = r n (cid:16) (cid:98) β C − β (cid:17) + r n (cid:98) V ( (cid:98) β C ) − r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) + r n (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:16) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) d −→ U h . The first equality follows from Equations (4.5) and (2.3) and from β E = β + h/r n . Theconvergence in distribution arrow follows from Assumption 2, the Slutsky lemma, and thecontinuous mapping theorem.Let g = h − µσ C , N g = h − Vσ C , and µ sd = µσ C . Then N g ∼ N ( g, and we have: E (cid:0) U h (cid:1) − E (cid:0) V (cid:1) = E (( U h − V )( U h + V ))= E (cid:18) ( h − V ) σ C ( h − V ) + σ C (cid:18) V + ( h − V ) σ C ( h − V ) + σ C (cid:19)(cid:19) = σ C E (cid:18) N g N g + 1 (cid:18) g + µ sd − N g ) + N g N g + 1 (cid:19)(cid:19) = σ C (cid:18) ∆( g ) + 2 µ sd E (cid:18) N g N g + 1 (cid:19)(cid:19) , (4.8)with ∆( g ) defined as in Equation (4.4). Let Λ( g, µ sd ) = ∆( g ) + 2 µ sd E (cid:16) N g N g +1 (cid:17) . The function x (cid:55)→ x/ ( x + 1) and the density of a N (0 , satisfy the conditions of Lemma 4.1. Thus, E (cid:16) N g N g +1 (cid:17) ≥ if g ≥ , and E (cid:16) N g N g +1 (cid:17) < otherwise. Then, if µ and g are not of the samesign, Λ( g, µ sd ) ≤ ∆( g ) , so it follows from Point 3 of Theorem 2.2 that for every µ sd , the mini-mum of Λ( g, µ sd ) with respect to g is greater in absolute value than its maximum. We can thenrestrict attention to cases where µ and g are of the same sign. As Λ( − g, − µ sd ) = Λ( g, µ sd ) ,we can further restrict attention to cases where g ≥ and µ sd ≥ . We use Monte-Carlosimulations with Halton draws to approximate Λ( g, µ sd ) for every g ∈ { , . , . , ..., } and µ sd ∈ [0 , . , the absolute value of the minimum of Λ( g, µ sd ) with respect to g is largerthan the absolute value of its maximum. This proves the result, together with Equation (4.8).13 eferences Angrist, J. D. & Pischke, J.-S. (2008),
Mostly harmless econometrics: An empiricist’s com-panion , Princeton university press.Breusch, T., Ward, M. B., Nguyen, H. T. M. & Kompas, T. (2011), ‘On the fixed-effects vectordecomposition’,
Political Analysis (2), 123–134.Cheng, X., Liao, Z. & Shi, R. (2019), ‘On uniform asymptotic risk of averaging gmm estima-tors’, Quantitative Economics (3), 931–979.Green, E. J. & Strawderman, W. E. (1991), ‘A james-stein type estimator for combiningunbiased and possibly biased estimators’, Journal of the American Statistical Association (416), 1001–1006.Hahn, J., Todd, P. & Van der Klaauw, W. (2001), ‘Identification and estimation of treatmenteffects with a regression-discontinuity design’, Econometrica (1), 201–209.Hansen, B. E. (2017), ‘Stein-like 2sls estimator’, Econometric Reviews (6-9), 840–852.Hirano, K., Imbens, G. W. & Ridder, G. (2003), ‘Efficient estimation of average treatmenteffects using the estimated propensity score’, Econometrica (4), 1161–1189.James, W. & Stein, C. (1961), Estimation with quadratic loss, in ‘Proceedings of the FourthBerkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributionsto the Theory of Statistics’, The Regents of the University of California.Judge, G. G. & Mittelhammer, R. C. (2004), ‘A semiparametric basis for combining esti-mation problems under quadratic loss’, Journal of the American Statistical Association (466), 479–487.Leeb, H. & Pötscher, B. (2005), ‘Model selection and inference: Facts and fiction’, EconometricTheory (1), 21–59.Lehmann, E. L. & Casella, G. (1998), Theory of Point Estimation , Springer Texts in Statistics.Mittelhammer, R. C. & Judge, G. G. (2005), ‘Combining estimators to improve structuralmodel estimation and inference under quadratic loss’,
Journal of econometrics (1), 1–29.Savage, L. J. (1951), ‘The theory of statistical decision’,
Journal of the American Statisticalassociation (253), 55–67.Stein, C. (1956), Inadmissibility of the usual estimator for the mean of a multivariate distri-bution, inin