[PDF] Empirical MSE Minimization to Estimate a Scalar Parameter

Abstract

We consider the estimation of a scalar parameter, when two estimators are available. The first is always consistent. The second is inconsistent in general, but has a smaller asymptotic variance than the first, and may be consistent if an assumption is satisfied. We propose to use the weighted sum of the two estimators with the lowest estimated mean-squared error (MSE). We show that this third estimator dominates the other two from a minimax-regret perspective: the maximum asymptotic-MSE-gain one may incur by using this estimator rather than one of the other estimators is larger than the maximum asymptotic-MSE-loss.

Full PDF

EEmpirical MSE Minimization to Estimate a Scalar Parameter

Clément de Chaisemartin ∗ Xavier D’Haultfœuille † Abstract

We consider the estimation of a scalar parameter, when two estimators are available.The ﬁrst is always consistent. The second is inconsistent in general, but has a smallerasymptotic variance than the ﬁrst, and may be consistent if an assumption is satisﬁed.We propose to use the weighted sum of the two estimators with the lowest estimatedmean-squared error (MSE). We show that this third estimator dominates the other twofrom a minimax-regret perspective: the maximum asymptotic-MSE-gain one may incurby using this estimator rather than one of the other estimators is larger than the maximumasymptotic-MSE-loss.

Keywords: bias-variance trade-oﬀ, mean-squared error, consistent estimator, eﬃcient esti-mator, statistical decision theory, minimax regret, local asymptotics.

JEL Codes:

C21, C23

We consider the estimation of a scalar parameter β when two estimators are available. Theﬁrst, is √ n − consistent. The second is inconsistent in general, but it has a smaller asymptoticvariance than the ﬁrst, and it may be √ n − consistent if the data generating process satisﬁes anassumption H . Hereafter, those two estimators are respectively referred to as the consistentand eﬃcient estimators.To ﬁx ideas, we consider two of the many examples where this set-up we is applicable. In strat-iﬁed randomized experiments, the parameter of interest is the average treatment eﬀect (ATE).To estimate it, one may use the propensity score matching estimator (see Hirano et al. 2003),which is √ n -consistent and asymptotically normal under some assumptions. Alternatively,one may regress the outcome of interest on strata ﬁxed eﬀects and units’ treatment status,and use the coeﬃcient of the treatment in that regression. It follows from, e.g., Equation(3.3.7) in Angrist & Pischke (2008) that this estimator is √ n -consistent and asymptoticallynormal for a weighted average of the eﬀect of the treatment in each strata. One can also show ∗ University of California at Santa Barbara, [email protected] † de Chaisemartin: University of California at Santa Barbara (email: [email protected]);D’Haultfœuille: CREST-ENSAE (email: [email protected]) a r X i v : . [ m a t h . S T ] J un hat under some assumptions, the asymptotic variance of the strata ﬁxed eﬀects estimator issmaller than that of the propensity score matching estimator. So if the treatment eﬀect isconstant across strata, the strata ﬁxed eﬀects estimator is √ n -consistent for the ATE, and itis more eﬃcient than the propensity score matching estimator. But if the treatment eﬀect isheterogeneous, the strata ﬁxed eﬀects estimator is inconsistent.Another example where our set-up is applicable is a linear and constant treatment eﬀectmodel, where the treatment is potentially endogenous, but one has an instrumental variableat hand. Then, the 2SLS estimator is √ n − consistent for the treatment eﬀect. On the otherhand, the OLS estimator is only √ n − consistent if the treatment is actually exogenous, butits asymptotic variance is smaller than that of the 2SLS estimator.To estimate β , we propose to use (cid:98) β MSE , the weighted sum of the consistent and eﬃcientestimators with the lowest estimated mean-squared error (MSE). We show that this third esti-mator dominates the other two from a minimax-regret perspective: the maximum asymptotic-MSE-gain one may incur by using (cid:98) β MSE rather than one of the other estimators is larger thanthe maximum asymptotic-MSE-loss that one may incur by doing so.We also consider a family of pre-test estimators ( (cid:98) β MSE,λ ) λ ≥ , where λ indexes the criticalvalue used in the pre-test. First, we test whether the consistent and eﬃcient estimators areequal. If the test is accepted, (cid:98) β MSE,λ is equal to the eﬃcient estimator. If the test is rejected, (cid:98) β MSE,λ is equal to a convex combination of the consistent and eﬃcient estimators. We showthat such estimators have similar properties as (cid:98) β MSE . However, (cid:98) β MSE dominates all of themfrom a minimax-regret perspective.We then extend the initial result by considering situations where one has two estimators athand: one is r n − consistent, where r n / √ n → , and the other is inconsistent in general, butmay be √ n − consistent if the data generating process satisﬁes an assumption H . Such situ-ations may for instance arise in regression discontinuity (RD) designs. Then, non-parametricestimators such as the one proposed by Hahn et al. (2001) are n / − consistent for the aver-age treatment eﬀect at the cut-oﬀ under weak conditions. On the other hand, the estimatorusing, say, linear regressions to the left and to the right of the cut-oﬀ without restricting thesample to observations in a narrow bandwidth around the cut-oﬀ is √ n − consistent if the po-tential outcomes’ CEFs are indeed linear in the running variable, but inconsistent otherwise.Again, we show that (cid:98) β MSE dominates the r n − consistent estimator from a minimax-regretperspective, under mild assumptions.The idea of combining consistent and inconsistent, or unbiased and biased estimators of aparameter has a long tradition in statistics and econometrics. Green & Strawderman (1991)have proposed an estimator related to (cid:98) β MSE , in the context of a normal model with knownvariances. Their estimator is related to the shrinkage estimator in Stein (1956) and James& Stein (1961), and when the parameter of interest is of dimension greater than three, itsMSE is lower than that of the unbiased estimator of the parameter of interest. Judge &2ittelhammer (2004) and Mittelhammer & Judge (2005) have proposed an estimator similarto that in Green & Strawderman (1991), without assuming normality or that the variances areknown, and study its asymptotic MSE, again when the parameter of interest is of dimensiongreater than three. Cheng et al. (2019) have considered an estimator related to (cid:98) β MSE , in aGMM context with some valid and some potentially misspeciﬁed moment conditions. Theyshow that asymptotically, the MSE of their estimator is uniformly smaller than that of theGMM estimator using only the valid moment conditions. Again, they focus on a multivariateparameter with a dimension greater than four. In the context of a linear model with at leastthree endogenous variables and instruments, Hansen (2017) proposes to use a weighted averageof the OLS and 2SLS estimators, with weights that depend on the Hausman-Wu statistic ina test of equality between the OLS and 2SLS estimators. He shows that this estimator hasa lower asymptotic MSE than that of the 2SLS estimator. Finally, Breusch et al. (2011)have considered (cid:98) β MSE in a panel data context, in a case where the parameter of interest isunivariate. However, they do not study the theoretical properties of that estimator.To the best of our knowledge, our paper is the ﬁrst to study the theoretical properties of (cid:98) β MSE when the parameter of interest is univariate. This case is particularly relevant forpolicy evaluation and treatment choice. Even when one measures the eﬀect of the policy onseveral outcomes, one is ultimately interested in summarizing those eﬀects into a monetaryassessment of the beneﬁts of the policy, to be compared to its cost. Contrary to the previousliterature, we do not ﬁnd that the MSE of (cid:98) β MSE dominates uniformly that of the consistentestimator. However, we show that (cid:98) β MSE dominates the other two estimators from a minimax-regret perspective, thus giving a theoretical justiﬁcation to its use to estimate a univariateparameter. Our paper also seems to be the ﬁrst to consider the combination of estimatorswith diﬀerent rates of convergence, which may be relevant in a number of contexts, such asRD designs.The remainder of the paper is organized as follows. Section 2 presents the set up and mainresults. Section 3 presents some extensions. Section 4 presents the proofs of the results.

We are interested in a parameter β ∈ R . To estimate β , we use a sample of size n . (cid:98) β C is √ n − consistent and asymptotically normal for β . (cid:98) β E is √ n − consistent and asymptoticallynormal for β E . In general, β E (cid:54) = β , but under an assumption on the data generating process H , β E = β . The asymptotic variance of (cid:98) β E is smaller than that of (cid:98) β C , so under H , (cid:98) β E isa more eﬃcient estimator of β than (cid:98) β C . Moreover, the asymptotic variance of (cid:98) β E is smallerthan that of any weighted sum of (cid:98) β E and (cid:98) β C , which implies that the asymptotic covarianceof (cid:98) β E and (cid:98) β C is equal to the asymptotic variance of (cid:98) β E . Finally, we have estimators (cid:98) V ( (cid:98) β C ) , (cid:98) V ( (cid:98) β E ) , and (cid:99) cov ( (cid:98) β C , (cid:98) β E ) of the variances of (cid:98) β C and (cid:98) β E and of their covariance, that are such3hat n (cid:98) V ( (cid:98) β C ) , n (cid:98) V ( (cid:98) β E ) , and n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) are consistent for their asymptotic variances andcovariances. We summarize these conditions in Assumption 1 below: Assumption 1 (Set-up)1. We have √ n (cid:32) (cid:98) β C − β (cid:98) β E − β E (cid:33) d −→ N (cid:32) , (cid:32) σ C σ E σ E σ E (cid:33)(cid:33) , with σ C < σ E .2. n (cid:98) V ( (cid:98) β C ) p −→ σ C , n (cid:98) V ( (cid:98) β E ) p −→ σ E , and n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) p −→ σ E . We consider the following estimator of β : Deﬁnition 2.1 (The empirical-MSE-minimizing estimator of β )Let (cid:98) β MSE = (cid:98) p (cid:98) β E + (1 − (cid:98) p ) (cid:98) β C , (2.1) where (cid:98) p = arg min p ∈ R p (cid:16) (cid:98) β E − (cid:98) β C (cid:17) + p (cid:98) V ( (cid:98) β E ) + (1 − p ) (cid:98) V ( (cid:98) β C ) + 2 p (1 − p ) (cid:99) cov ( (cid:98) β C , (cid:98) β E ) . (2.2) (cid:98) β MSE is the weighted sum of (cid:98) β E and (cid:98) β C with the lowest estimated mean-squared error (MSE).Solving the problem in Equation (2.2) yields (cid:98) p = (cid:98) V ( (cid:98) β C ) − (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) (cid:98) β E − (cid:98) β C (cid:17) + (cid:98) V ( (cid:98) β E − (cid:98) β C ) . (2.3)Theorem 2.2 below gives the asymptotic distribution of (cid:98) β MSE and compares it with that of (cid:98) β C . In particular, we compare the MSE of the asymptotic distribution of the two estimators. Theorem 2.2 (Asymptotic distribution of (cid:98) β MSE )Suppose Assumption 1 holds.1. If β (cid:54) = β E and β and β E do not depend on n , √ n (cid:16) (cid:98) β MSE − β (cid:17) d −→ N (cid:0) , σ C (cid:1) .2. If β = β E , √ n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U , where U is such that E ( U ) = 0 and V ( U ) ∈ ( σ E , σ C ) .3. If β E = β + h/ √ n for some h ∈ R , √ n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U h , where ( U h ) h ∈ R is suchthat max h ∈ R (cid:2) E ( U h ) − σ C (cid:3) < max h ∈ R (cid:2) σ C − E ( U h ) (cid:3) . If the second moments of the normalized estimators converge, our results provide a comparison of theasymptotic MSE of the estimators. To avoid the issue that convergence in distribution does not imply conver-gence in L , one could consider instead, as Cheng et al. (2019), a winsorized version of the square loss. β (cid:54) = β E , (cid:98) β MSE and (cid:98) β C have the same asymptoticdistribution. On the other hand, Point 2 shows that if β = β E , their asymptotic distributionsdiﬀer, and the MSE of the asymptotic distribution of (cid:98) β MSE is larger than that of (cid:98) β E butsmaller than that of (cid:98) β C . This comes from the fact that under H , (cid:98) p converges in distributionto a nongenerate distribution. In other words, (cid:98) p does not perform consistent “model selection”:it does not converge to 1 if H holds and to 0 otherwise.The asymptotic approximations under ﬁxed values of β and β E may not give good approx-imations of the ﬁnite sample behavior of the estimators. Instead, in Point 3 of the theorem,we compare the MSE of their asymptotic distributions under alternatives local to β = β E ,namely β E = β + h/ √ n . We ﬁnd that under this type of asymptotics, the MSE of theasymptotic distribution of (cid:98) β MSE is not always smaller than that of (cid:98) β C . This phenomenon isreminiscent of Hodges’ estimator, whose asymptotic distribution has a smaller MSE than thatof the standard estimator if the true parameter is 0, but whose maximal risk increases withoutbound as n → ∞ (see, e.g. Lehmann & Casella 1998, pp. 440-443). An important diﬀerencewith Hodges’ estimator is that here, the maximum asymptotic-MSE-gain one may incur byusing (cid:98) β MSE rather than (cid:98) β C is larger than the maximum asymptotic-MSE-loss. Thus, (cid:98) β MSE dominates (cid:98) β C from a minimax regret perspective (see Savage 1951). It is straightforward toshow that (cid:98) β MSE also dominates (cid:98) β E from a minimax regret perspective. When β = β E , (cid:98) β MSE is not equivalent to (cid:98) β E , and it has a higher asymptotic variance. Thisis because the estimated squared bias ( (cid:98) β E − (cid:98) β C ) in (2.2) includes some noise and is notnegligible even if β = β E . We now consider a modiﬁed version of (cid:98) β MSE that uses a smallerestimator of the squared bias. Speciﬁcally, we replace (2.2) by (cid:98) p λ = arg min p ∈ R p max (cid:20) , (cid:16) (cid:98) β E − (cid:98) β C (cid:17) − λ (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:21) + p (cid:98) V ( (cid:98) β E ) + (1 − p ) (cid:98) V ( (cid:98) β C ) + 2 p (1 − p ) (cid:99) cov ( (cid:98) β C , (cid:98) β E ) for some λ ≥ . We then let (cid:98) β MSE,λ = (cid:98) p λ (cid:98) β E + (1 − (cid:98) p λ ) (cid:98) β C . (cid:98) β MSE,λ can be viewed as a pre-testestimator. Assume that one uses (cid:98) V ( (cid:98) β E ) to estimate cov ( (cid:98) β C , (cid:98) β E ) . Then, let F denote the cdfof a χ distribution and let α = F ( λ ) . To compute, (cid:98) β MSE,λ , one ﬁrst runs a level- α test of H , using the fact that ( (cid:98) β E − (cid:98) β C ) / (cid:98) V ( (cid:98) β E − (cid:98) β C ) d −→ χ . If H is accepted, then (cid:98) β MSE,λ = (cid:98) β E .If H is rejected, (cid:98) β MSE,λ is equal to a convex combination between (cid:98) β E and (cid:98) β C , where theweight assigned to (cid:98) β C depends on how far we are from accepting H . Proposition 3.1

Suppose Assumption 1 holds and let λ be a positive real number. . If β (cid:54) = β E and β and β E do not depend on n , √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ N (cid:0) , σ C (cid:1) .2. If β = β E , √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ U ,λ , where U ,λ is such that E ( U ,λ ) = 0 , λ (cid:55)→ V ( U ,λ ) is strictly decreasing and lim λ → + ∞ V ( U ,λ ) = σ E .3. If β E = β + h/ √ n for some h ∈ R , √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ U h,λ , where ( U h,λ ) h ∈ R issuch that for all λ (cid:54) = 0max h ∈ R (cid:2) E ( U h,λ ) − E ( U h, ) (cid:3) > max h ∈ R (cid:2) E ( U h, ) − E ( U h,λ ) (cid:3) , (3.1)Points 1 and 2 in Proposition 3.1 are similar to those in Theorem 2.2, with the additional pointthat under H , the asymptotic, quadratic risk of (cid:98) β MSE,λ decreases and gets closer to that of (cid:98) β E as λ increases. However, the third point shows that from a minimax regret perspective,such estimators are dominated by our intial estimator (cid:98) β MSE . The reason is that the decreaseof λ (cid:55)→ E ( U ,λ ) does not compensate for the quick increase of λ (cid:55)→ max h ∈ R E ( U h,λ ) − E ( U h, ) .This echoes the discussion in Leeb & Pötscher (2005): as we move closer to an estimator basedon a consistent model selection, the maximal asymptotic risk increases without bound. In this subsection, we assume that (cid:98) β C is r n − consistent for β ∈ R , for some sequence ( r n ) n ∈ N such that r n → ∞ , r n /n / → . The estimator (cid:98) β E is still √ n − consistent and asymptoticallynormal for β E , which may be equal to β under an assumption on the data generating process H . We also assume we have estimators (cid:98) V ( (cid:98) β C ) , (cid:98) V ( (cid:98) β E ) , and (cid:99) cov ( (cid:98) β C , (cid:98) β E ) of the variances of (cid:98) β C and (cid:98) β E and of their covariance, that are such that r n (cid:98) V ( (cid:98) β C ) , n (cid:98) V ( (cid:98) β E ) , and n / r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) are consistent for their asymptotic variances and covariances. We summarize these conditionsin Assumption 2 below: Assumption 2 (Set-up)1. There exists a sequence ( r n ) n ∈ N , r n → ∞ and r n /n / → , such that  r n (cid:16) (cid:98) β C − β (cid:17) √ n (cid:16) (cid:98) β E − β E (cid:17)  d −→ N (cid:32)(cid:32) µ (cid:33) , (cid:32) σ C ρρ σ E (cid:33)(cid:33) . r n (cid:98) V ( (cid:98) β C ) p −→ σ C , n (cid:98) V ( (cid:98) β E ) p −→ σ E , and n / r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) p −→ ρ . Theorem 3.2 below gives the asymptotic distribution of (cid:98) β MSE deﬁned in Equation 2.1 above,under Assumption 2 rather than Assumption 1.

Theorem 3.2 (Asymptotic distribution of (cid:98) β MSE under Assumption 2) Suppose Assumption2 holds. . If β (cid:54) = β E and β and β E do not depend on n , r n (cid:16) (cid:98) β MSE − β (cid:17) d −→ N (cid:0) , σ C (cid:1) .2. If β = β E , r n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U , where U is such that E ( U ) < E ( V ) .3. If β E = β + h/r n for some h ∈ R and | µ/σ C | ≤ . , then r n (cid:16) (cid:98) β MSE − β (cid:17) d −→ U h ,where ( U h ) h ∈ R is such that max h ∈ R (cid:2) E ( U h ) − ( µ + σ C ) (cid:3) < max h ∈ R (cid:2) µ + σ C − E ( U h ) (cid:3) . Point 1 of Theorem 3.2 shows that if β (cid:54) = β E , (cid:98) β MSE and (cid:98) β C have the same asymptoticdistribution. On the other hand, Point 2 shows that if β = β E , their asymptotic distributionsdiﬀer, and the MSE of the asymptotic distribution of (cid:98) β MSE is smaller than that of (cid:98) β C .In Point 3 of the theorem, we compare the MSE of their asymptotic distributions underalternatives local to β = β E , namely β E = β + h/r n . Under this type of asymptotics, themaximum asymptotic-MSE-gain one may incur by using (cid:98) β MSE rather than (cid:98) β C is larger thanthe maximum asymptotic-MSE-loss, provided the ﬁrst-order bias is no greater in absolutevalue than 0.4 σ C . Again, (cid:98) β MSE dominates (cid:98) β C from a minimax regret perspective, providedthe ﬁrst-order bias of (cid:98) β C is not too large. We use the folowing lemma below.

Lemma 4.1

Suppose that f is an odd function such that f ( x ) > for all x > and Z hasan even density g that is strictly decreasing on R + . Then, sgn ( E [ f ( x + Z )]) = sgn ( x ) for all x ∈ R . Proof: the result holds if x = 0 , because E [ f ( Z )] = E [ f ( − Z )] = − E [ f ( Z )] . For any x < , E [ f ( Z + x )] = E [ f ( − Z + x )] = E [ − f ( Z − x )] , so it suﬃces to show that E [ f ( Z + x )] > for x > . We have E [ f ( Z + x )] = (cid:90) R f ( z + x ) g ( z ) dz = (cid:90) R + f ( z ) g ( z − x ) dz + (cid:90) R − f ( z ) g ( z − x ) dz = (cid:90) R + f ( z ) g ( z − x ) dz − (cid:90) R − f ( − z ) g ( x − z ) dz = (cid:90) R + f ( z )[ g ( z − x ) − g ( z + x )] dz. Now, for all z ∈ (0 , x ] , | z − x | = x − z < x + z so g ( z − x ) > g ( z + x ) . If z > x , | z − x | = z − x < z + x so again, g ( z − x ) > g ( z + x ) . The result follows since f ( z ) > on (0 , ∞ ) .7 .1 Theorem 2.2 Proof of Point 1 If β (cid:54) = β E , it follows from (2.3), Assumption 1 and the continuous mapping theorem that n (cid:98) p p −→ σ C − σ E ( β E − β ) . Moreover, (cid:98) β MSE − (cid:98) β C = (cid:98) p (cid:16) (cid:98) β E − (cid:98) β C (cid:17) . (4.1)Hence, by the continuous mapping theorem again, n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) p −→ σ C − σ E β E − β . Therefore, √ n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) = o P (1) . Proof of Point 2

Let ( V, W ) be a normal vector with mean (0 , , variances ( σ C , σ C − σ E ) , and covariance − (cid:0) σ C − σ E (cid:1) . Let U = V + W σ C − σ E W + σ C − σ E . (4.2) √ n (cid:16) (cid:98) β MSE − β (cid:17) = √ n (cid:16) (cid:98) β C − β (cid:17) + n (cid:98) V ( (cid:98) β C ) − n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17)(cid:17) + n (cid:98) V ( (cid:98) β E − (cid:98) β C ) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) d −→ U . The ﬁrst equality follows from Equations (4.1) and (2.3) and from β E = β . The convergencein distribution arrow follows from Assumption 1, the Slutsky lemma, and the continuousmapping theorem. E ( U ) = 0 as φ : w (cid:55)→ w σ C − σ E w + σ C − σ E is such that φ ( − w ) = − φ ( w ) and the pdf of W is symmetricaround 0.Let Ψ = V + W . The vector (Ψ , W ) is normally distributed with E (Ψ) = 0 and cov (Ψ , W ) = 0 .Hence, Ψ ⊥⊥ W and we have U = Ψ + W (cid:18) σ C − σ E W + σ C − σ E − (cid:19) , V ( U ) > V (Ψ) = σ E . Moreover, V ( U ) − σ C = E (cid:0) U (cid:1) − E (cid:0) V (cid:1) = E (( U − V )( U + V ))= E (cid:18) W σ C − σ E W + σ C − σ E (cid:18) V + W σ C − σ E W + σ C − σ E (cid:19)(cid:19) = E (cid:18) W σ C − σ E W + σ C − σ E (cid:18) − W + 2Ψ + W σ C − σ E W + σ C − σ E (cid:19)(cid:19) = E (cid:18) W σ C − σ E W + σ C − σ E (cid:18) − σ C − σ E W + σ C − σ E (cid:19)(cid:19) < . The ﬁrst equality follows from E ( U ) = E ( V ) = 0 , the third from Equation (4.2), the fourthfrom V = − W + Ψ and the ﬁfth from Ψ ⊥⊥ W and E (Ψ) = 0 . The inequality holds since ( σ C − σ E ) / ( W + σ C − σ E ) < with probability 1, as σ C > σ E . Proof of Point 3

Let ( V, W h ) be a normal vector with means (0 , h ) , variances ( σ C , σ C − σ E ) , and covariance − (cid:0) σ C − σ E (cid:1) . Let U h = V + W h ( σ C − σ E ) / ( W h + σ C − σ E ) . We have √ n (cid:16) (cid:98) β MSE − β (cid:17) = √ n (cid:16) (cid:98) β C − β (cid:17) + n (cid:98) V ( (cid:98) β C ) − n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) + n (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:16) √ n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) d −→ U h . The ﬁrst equality follows from Equations (4.1) and (2.3) and from β E = β + h/ √ n . Theconvergence in distribution arrow follows from Assumption 1, the Slutsky lemma, and thecontinuous mapping theorem.Let Ψ h = V + W h . The vector (Ψ h , W h ) is normally distributed, cov (Ψ h , W h ) = 0 so Ψ h ⊥⊥ W h .9et g = h/ (cid:113) σ C − σ E , and let N g = W h √ σ C − σ E , so that N g ∼ N ( g, . We have E ( U h ) − σ C = E (( U h − V )( U h + V ))= E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:18) V + W h σ C − σ E W h + σ C − σ E (cid:19)(cid:19) = E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:18) − W h + 2Ψ h + W h σ C − σ E W h + σ C − σ E (cid:19)(cid:19) = − (cid:18) E (cid:18) W h σ C − σ E W h + σ C − σ E W h (cid:19) − E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) E (Ψ h ) (cid:19) + E (cid:32)(cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) (cid:33) = − (cid:18) E (cid:18) W h σ C − σ E W h + σ C − σ E W h (cid:19) − E (cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) E ( W h ) (cid:19) + E (cid:32)(cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) (cid:33) = − cov (cid:18) W h σ C − σ E W h + σ C − σ E , W h (cid:19) + E (cid:32)(cid:18) W h σ C − σ E W h + σ C − σ E (cid:19) (cid:33) = (cid:0) σ C − σ E (cid:1) (cid:40) E (cid:34)(cid:18) N g N g + 1 (cid:19) (cid:35) − cov (cid:18) N g N g + 1 , N g (cid:19)(cid:41) . (4.3)Let ∆( g ) = E (cid:34)(cid:18) N g N g + 1 (cid:19) (cid:35) − cov (cid:18) N g N g + 1 , N g (cid:19) . (4.4)Since N − g ∼ − N g , we have ∆( − g ) = E (cid:34)(cid:18) − N g ( − N g ) + 1 (cid:19) (cid:35) − cov (cid:18) − N g ( − N g ) + 1 , − N g (cid:19) = ∆( g ) . Moreover, we obtain through numerical simulations that min g ∈ R + ∆( g ) (cid:39) − . and max g ∈ R + ∆( g ) (cid:39) . . The inequality max h ∈ R (cid:2) ( E ( U h )) + V ( U h ) − σ C (cid:3) < max h ∈ R (cid:2) σ C − (cid:0) ( E ( U h )) + V ( U h ) (cid:1)(cid:3) follows from these last results, Equations (4.3) and σ C > σ E . Proof of Point 1

First, we have (cid:98) p λ = (cid:98) V ( (cid:98) β C ) − (cid:99) cov ( (cid:98) β C , (cid:98) β E )max (cid:20) , (cid:16) (cid:98) β E − (cid:98) β C (cid:17) − λ (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:21) + (cid:98) V ( (cid:98) β E − (cid:98) β C ) . β (cid:54) = β E , Assumption 1 and the continuous mapping theorem yields n (cid:98) p λ p −→ σ C − σ E ( β E − β C ) . The result follows as in the previous proof.

Proof of Point 2

Instead of considering U deﬁned by (4.2), we now consider U ,λ deﬁned by U ,λ = V + W σ C − σ E max (cid:2) , W − λ ( σ C − σ E ) (cid:3) + σ C − σ E . Then, the proofs of the convergence in distribution, of E ( U ,λ ) = 0 and E (cid:16) U ,λ (cid:17) < σ C areidentical to those above. We now show that λ (cid:55)→ V ( U ,λ ) is decreasing. Let Z = W/ ( σ C − σ E ) / Using the same deﬁnition of Ψ as above, we have U ,λ = Ψ + ( σ C − σ E ) / Z (cid:18) , Z − λ ) + 1 − (cid:19) , with Ψ ⊥⊥ Z . Then, for any two λ > λ (cid:48) , V ( U ,λ ) − V ( U ,λ (cid:48) ) = E (cid:2) ( U ,λ + U ,λ (cid:48) )( U ,λ − U ,λ (cid:48) ) (cid:3) =( σ C − σ E ) E (cid:20) Z (cid:18) , Z − λ ) + 1 + 1max(0 , Z − λ (cid:48) ) + 1 − (cid:19) × (cid:18) , Z − λ ) + 1 − , Z − λ (cid:48) ) + 1 (cid:19)(cid:21) . Moreover, , Z − λ ) + 1 + 1max(0 , Z − λ (cid:48) ) + 1 − ≤ , and the inequality is strict for | Z | > √ λ (cid:48) . Also, , Z − λ ) + 1 − , Z − λ (cid:48) ) + 1 ≥ , and again, the inequality is strict for | Z | > √ λ (cid:48) . Hence, V ( U ,λ ) < V ( U ,λ (cid:48) ) . Finally, as λ → + ∞ , U ,λ converges almost surely to Ψ . Moreover, | U ,λ | ≤ | Ψ | + ( σ C − σ E ) / | Z | , so bythe dominated convergence theorem, V ( U ,λ ) → V (Ψ) = σ E . Proof of Point 3

With the same notation as above, let U h,λ = Ψ h + W h (cid:18) σ C − σ E max(0 , W h − λ ( σ C − σ E )) + σ C − σ E − (cid:19) . Using a reasoning similar to that in the proof of Point 3 of Theorem 2.2, √ n (cid:16) (cid:98) β MSE,λ − β (cid:17) d −→ U h,λ .

11y the same reasoning as in the proof of Point 3 of Theorem 2.2, E ( U h,λ ) − E ( U h, ) = ( σ C − σ E ) (cid:40) E (cid:34)(cid:18) N g max(0 , N g − λ ) + 1 (cid:19) − (cid:18) N g N g + 1 (cid:19) (cid:35) − cov (cid:20) N g (cid:18) , N g − λ ) + 1 − N g + 1 (cid:19) , N g (cid:21)(cid:27) . Then, simulations show that (3.1) holds.

Proof of Point 1

We have (cid:98) β MSE − (cid:98) β C = (cid:98) p (cid:16) (cid:98) β E − (cid:98) β C (cid:17) . (4.5)If β (cid:54) = β E , it follows from Assumption 2 and the continuous mapping theorem that r n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) p −→ σ C β E − β . Therefore, r n (cid:16) (cid:98) β MSE − (cid:98) β C (cid:17) = o P (1) . Proof of Point 2

Let V be a normal variable with mean µ and variance σ C . Let U = V (cid:18) − σ C V + σ C (cid:19) . (4.6)If β = β E , r n (cid:16) (cid:98) β MSE − β (cid:17) = r n (cid:16) (cid:98) β C − β (cid:17) + r n (cid:98) V ( (cid:98) β C ) − r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17)(cid:17) + r n (cid:98) V ( (cid:98) β E − (cid:98) β C ) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) d −→ U . The ﬁrst equality follows from Equations (4.5) and (2.3) and from β E = β . The convergencein distribution arrow follows from Assumption 2, the Slutsky lemma, and the continuousmapping theorem. Further, E (cid:0) U (cid:1) − E (cid:0) V (cid:1) = E (( U − V )( U + V ))= E (cid:18) − σ C V + σ C V (cid:18) V − σ C V + σ C V (cid:19)(cid:19) = E (cid:18) − σ C V + σ C V (cid:18) − σ C V + σ C (cid:19)(cid:19) < . > σ C V + σ C with probability 1. Proof of Point 3

Let U h = V + ( h − V ) σ C ( h − V ) + σ C . (4.7) r n (cid:16) (cid:98) β MSE − β (cid:17) = r n (cid:16) (cid:98) β C − β (cid:17) + r n (cid:98) V ( (cid:98) β C ) − r n (cid:99) cov ( (cid:98) β C , (cid:98) β E ) (cid:16) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) + r n (cid:98) V ( (cid:98) β E − (cid:98) β C ) (cid:16) r n (cid:16) (cid:98) β E − β E − (cid:16) (cid:98) β C − β (cid:17)(cid:17) + h (cid:17) d −→ U h . The ﬁrst equality follows from Equations (4.5) and (2.3) and from β E = β + h/r n . Theconvergence in distribution arrow follows from Assumption 2, the Slutsky lemma, and thecontinuous mapping theorem.Let g = h − µσ C , N g = h − Vσ C , and µ sd = µσ C . Then N g ∼ N ( g, and we have: E (cid:0) U h (cid:1) − E (cid:0) V (cid:1) = E (( U h − V )( U h + V ))= E (cid:18) ( h − V ) σ C ( h − V ) + σ C (cid:18) V + ( h − V ) σ C ( h − V ) + σ C (cid:19)(cid:19) = σ C E (cid:18) N g N g + 1 (cid:18) g + µ sd − N g ) + N g N g + 1 (cid:19)(cid:19) = σ C (cid:18) ∆( g ) + 2 µ sd E (cid:18) N g N g + 1 (cid:19)(cid:19) , (4.8)with ∆( g ) deﬁned as in Equation (4.4). Let Λ( g, µ sd ) = ∆( g ) + 2 µ sd E (cid:16) N g N g +1 (cid:17) . The function x (cid:55)→ x/ ( x + 1) and the density of a N (0 , satisfy the conditions of Lemma 4.1. Thus, E (cid:16) N g N g +1 (cid:17) ≥ if g ≥ , and E (cid:16) N g N g +1 (cid:17) < otherwise. Then, if µ and g are not of the samesign, Λ( g, µ sd ) ≤ ∆( g ) , so it follows from Point 3 of Theorem 2.2 that for every µ sd , the mini-mum of Λ( g, µ sd ) with respect to g is greater in absolute value than its maximum. We can thenrestrict attention to cases where µ and g are of the same sign. As Λ( − g, − µ sd ) = Λ( g, µ sd ) ,we can further restrict attention to cases where g ≥ and µ sd ≥ . We use Monte-Carlosimulations with Halton draws to approximate Λ( g, µ sd ) for every g ∈ { , . , . , ..., } and µ sd ∈ [0 , . , the absolute value of the minimum of Λ( g, µ sd ) with respect to g is largerthan the absolute value of its maximum. This proves the result, together with Equation (4.8).13 eferences Angrist, J. D. & Pischke, J.-S. (2008),

Mostly harmless econometrics: An empiricist’s com-panion , Princeton university press.Breusch, T., Ward, M. B., Nguyen, H. T. M. & Kompas, T. (2011), ‘On the ﬁxed-eﬀects vectordecomposition’,

Political Analysis (2), 123–134.Cheng, X., Liao, Z. & Shi, R. (2019), ‘On uniform asymptotic risk of averaging gmm estima-tors’, Quantitative Economics (3), 931–979.Green, E. J. & Strawderman, W. E. (1991), ‘A james-stein type estimator for combiningunbiased and possibly biased estimators’, Journal of the American Statistical Association (416), 1001–1006.Hahn, J., Todd, P. & Van der Klaauw, W. (2001), ‘Identiﬁcation and estimation of treatmenteﬀects with a regression-discontinuity design’, Econometrica (1), 201–209.Hansen, B. E. (2017), ‘Stein-like 2sls estimator’, Econometric Reviews (6-9), 840–852.Hirano, K., Imbens, G. W. & Ridder, G. (2003), ‘Eﬃcient estimation of average treatmenteﬀects using the estimated propensity score’, Econometrica (4), 1161–1189.James, W. & Stein, C. (1961), Estimation with quadratic loss, in ‘Proceedings of the FourthBerkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributionsto the Theory of Statistics’, The Regents of the University of California.Judge, G. G. & Mittelhammer, R. C. (2004), ‘A semiparametric basis for combining esti-mation problems under quadratic loss’, Journal of the American Statistical Association (466), 479–487.Leeb, H. & Pötscher, B. (2005), ‘Model selection and inference: Facts and ﬁction’, EconometricTheory (1), 21–59.Lehmann, E. L. & Casella, G. (1998), Theory of Point Estimation , Springer Texts in Statistics.Mittelhammer, R. C. & Judge, G. G. (2005), ‘Combining estimators to improve structuralmodel estimation and inference under quadratic loss’,

Journal of econometrics (1), 1–29.Savage, L. J. (1951), ‘The theory of statistical decision’,

Journal of the American Statisticalassociation (253), 55–67.Stein, C. (1956), Inadmissibility of the usual estimator for the mean of a multivariate distri-bution, inin