FFrequentist Shrinkage Under Inequality Constraints
Edvard Bakhitov † University of Pennsylvania
January 30, 2020
Abstract
This paper shows how to shrink extremum estimators towards inequality constraintsmotivated by economic theory. We propose an Inequality Constrained Shrinkage Estima-tor (ICSE) which takes the form of a weighted average between the unconstrained andinequality constrained estimators with the data dependent weight. The weight drives boththe direction and degree of shrinkage. We use a local asymptotic framework to derivethe asymptotic distribution and risk of the ICSE. We provide conditions under which theasymptotic risk of the ICSE is strictly less than that of the unrestricted extremum estimator.The degree of shrinkage cannot be consistently estimated under the local asymptoticframework. To address this issue, we propose a feasible plug-in estimator and investigateits finite sample behavior. We also apply our framework to gasoline demand estimationunder the Slutsky restriction.
Keywords:
James-Stein, extremum estimators, nonlinear models, economic restrictions.
Inequality constraints are common in applied economic research. Typical examples are mono-tonicity constraints on utility or production functions, restrictions on estimated covariancematrices such as positive definiteness, restrictions on the Slutsky matrix, etc. If the imposedconstraints hold, we will get more e ffi cient estimates. If not, the estimates will be biased.The paper proposes an alternative way to use economic theory in estimation. We introducea generalized shrinkage estimator that shrinks an estimator that ignores theoretical restrictionstowards inequality constraints motivated by theory. The inequality constrained shrinkageestimator (ICSE) takes a simple weighted average form between the unconstrained andinequality constrained estimators, with the data-driven weight inversely proportional to theloss function evaluated at the two estimates. We show that the degree of shrinkage dependson which constraints bind, thus, both the direction and degree of shrinkage are fully datadriven. † I am grateful to Xu Cheng and Frank DiTraglia for their support and encouragement. Special thanks toMax Kasy for providing the dataset. This paper benefitted from feedback from Karun Adusumilli, StéphaneBonhomme, Philippe Goulet Coulombe, Phillip Heiler, Toru Kitagawa, and Frank Schorfheide, as well as seminarparticipants at the UPenn econometrics lunch, and participants at the ESWM 2019 conference in Rotterdam. a r X i v : . [ ec on . E M ] J a n e show that under certain conditions the ICSE outperforms the unrestricted estimator,regardless of what the true data generating process is and whether the theory is corrector not. We demonstrate that the ICSE has a smaller asymptotic risk than the unrestrictedestimator uniformly over the parameter space local to the restricted (shrinkage) parameterspace. The theory we present applies to a large set of extremum estimators, such as theGeneralized Method of Moments (GMM) estimator, the Maximum Likelihood estimator (MLE),the Minimum Distance (MD) estimator, etc.We use the local asymptotic framework to analyze the performance of the ICSE. To beprecise, we assume that the parameter space is located in a 𝑛 − / -neighborhood of the restrictedspace, reflecting the belief that the imposed theoretical restrictions are only ”approximatelycorrect”. In contrast to the generalized James-Stein estimator, the asymptotic distributionof the ICSE is not normal. Since the ICSE is a weighted average of the unconstrained andinequality constrained estimators, the asymptotic distribution of the shrinkage estimatorinherits the non-normality of the inequality constrained estimator.Under the local asymptotic framework, it is impossible to consistently estimate the optimaldegree of shrinkage, as it depends on the local 𝑂 ( 𝑛 − / ) parameters (see e.g. Hjort andClaeskens (2003)). To address this issue, we propose a feasible plug-in estimator based on theasymptotically unbiased estimator of the local parameters. However, this makes the estimatedshrinkage parameter asymptotically random, which a ff ects the asymptotic distribution of theaveraging weight. As a result, the feasible estimator is not consistent and the dominance resultmay not hold.In our Monte Carlo study we investigate the finite sample performance of the feasibleICSE along with the generalized James-Stein estimator of Hansen (2016), the Empirical Bayes(EB) estimator, the unrestricted estimator, and the restricted estimator. Simulations showthat the feasible ICSE dominates the unrestricted estimator in terms of mean squared error.Moreover, it also dominates the generalized James-Stein estimator in cases when a subset ofthe constraints bind. We also show that the ICSE performs better than the EB estimator whenthe constraints are violated or close to bind, while the EB estimator dominates the ICSE whenthe constraints are satisfied as strict inequalities.In our application we consider gasoline demand estimation under the Slutsky restriction.We estimate the demand curves across three income groups corresponding to the first, second,and third quartiles, respectively. We show that the shrinkage e ff ect is more prominent for thelow income group, since consumers with low income are less likely to have upward slopingdemand curves. In a similar application, Fessler and Kasy (2019) use the Empirical Bayesframework to show that the degree of shrinkage is similar across di ff erent groups.The literature on shrinkage estimation begins with Stein (1956) who observed that theunconstrained estimator in a Gaussian location model is inadmissible when the dimension ofthe parameter vector is greater than two. This lead to a seminal paper by James and Stein (1961)where they proposed a shrinkage estimator that dominates the MLE. Baranchik (1964) showedthat the James-Stein estimator is inadmissible and dominated by its positive part version.However, even the positive part James-Stein estimator is inadmissible. Shao and Strawder-man (1994) propose a piecewise linear estimator that has even smaller risk. Theory for riskanalysis of shrinkage estimators was provided by Stein (1981). Hansen (2015) compares theperformance of di ff erent shrinkage estimators and provides corresponding e ffi ciency bounds.All of the aforementioned estimators shrink the parameters towards zero. In contrast,Oman (1982a); Oman (1982b) introduce estimators which shrink towards linear subspaces.Del Negro and Schorfheide (2004) show how to shrink to non-linear subspaces in the Bayesian2ramework, using a DSGE model-based prior to estimate the VAR impulse response functions.In their recent paper, Fessler and Kasy (2019) provide an Empirical Bayes framework whichallows to shrink to various theoretical restrictions in form of both equalities and inequalities.Our paper complements the aforementioned literature by extending the Stein’s type shrinkageargument to non-linear inequality constraints.James and Stein (1961) first showed that the shrinkage estimator dominates the unrestrictedMLE in exact normal sampling. Hansen (2016) provides a generalized James-Stein typeestimator for parametric models and shows that it dominates the MLE in a pointwise locallyasymptotic sense. Hansen (2017) shows that a shrinkage estimator that shrinks the OLSestimator towards the 2SLS estimator has a smaller asymptotic risk than the ordinary OLSestimator. DiTraglia (2016) studies the averaging GMM estimator with the averaging weightbased on the focused moment selection criterion. The results in the paper suggest that theaveraging estimator does not uniformly dominate the conservative estimator. Unlike theaforementioned papers using the pointwise local asymptotic framework, Cheng et al. (2019)establish the uniform dominance result of the GMM averaging estimator over the conservativeestimator.This paper is also closely related to the frequentist model averaging literature. Hansen (2007)introduces a model averaging estimator for linear nested models and shows that it is asymp-totically optimal. He proposes to minimize a Mallows criterion to select the model weights,which is asymptotically equivalent to minimizing the squared error. Wan et al. (2010) showthat the latter result holds not only for discrete but also for continuous model weights andunder a non-nested set-up. Hansen and Racine (2012) show that the optimal weights can beobtained by minimizing the cross validation criterion, which allows for a more e ffi cient use ofdata. Moreover, their approach allows to easily accommodate for heteroskedasticity. Liu (2015)points out that the asymptotic distribution of data dependent weights is non-standard, whichcomplicates the inference. He augments the results from Hjort and Claeskens (2003) andClaeskens and Hjort (2008) and proposes a procedure that delivers asymptotically correctcoverage probabilities for model averaging estimators. Zhu et al. (2017) use a 𝐽 -fold cross-validation criterion to construct optimal averaging weights for model averaging estimatorsunder inequality constraints.There is a large literature studying estimation under inequality constraints. Andrews (1999)derives the asymptotic distribution of extremum estimators when the parameter of interestis on a boundary of the parameter space. His approach solves an asymptotically equivalentproblem by minimizing a stochastic quadratic objective function over a convex cone thatapproximates the parameter space. The approach follows Cherno ff (1954), Feder (1968),Pollard (1985), and Wolak (1989). Andrews extends the results in these papers and allowsfor cases when the estimator objective function is undefined in the neighborhood of the trueparameter.There has been a growing interest in shrinkage estimators in the modern statistics literature.The main idea there is that shrinkage can be introduced through a penalty imposed on theestimator objective function. The most famous example is LASSO (Tibshirani (1996)), whichsimultaneously shrinks and selects variables. Another seminal example is a Ridge regression,which shrinks the coe ffi cients to zero, but does not perform selection. More complicatedpenalties lead to more interesting shrinkage spaces, e.g. a fused LASSO (Friedman et al. (2007))can be used to shrink time-varying parameters towards random walk, i.e. it penalizes absolute For a given real vector 𝑐 , the pointwise local asymptotic analysis considers a sequence of localized parameters 𝜃 𝑛 = 𝑐𝑛 − / , and derives the asymptotic (truncated) risk of the averaging estimator under 𝜃 𝑛 for given 𝑐 . Suchanalysis will produce a pointwise risk function for the shrinkage estimator. | 𝜃 𝑡 − 𝜃 𝑡 − | . Another example is a nearly isotonic regression(Tibshirani et al. (2011)) which shrinks the sequence of points towards a monotone sequence,i.e. it penalizes only positive part deviations ( 𝜃 𝑖 − 𝜃 𝑖 +1 ) + .The remainder of the paper is organized as follows. Section 2 presents the general frame-work, describes the choice of shrinkage direction and the local asymptotic framework. Section3 introduces the inequality constrained shrinkage estimator. Section 4 derives the asymp-totic distribution of the estimator. Section 5 presents the risk dominance result. Section6 provides a feasible estimator for the data-dependent weight. Section 7 demonstrates thefinite sample performance of the ICSE in a series of simulations. In Section 8 we apply themethod to estimate gasoline demand under the Slutsky restriction. Section 9 concludes. Allthe mathematical proofs and additional details are left to the Appendix.We use the following notation throughout the paper: ℐ 𝑛 denotes an 𝑛 × 𝑛 identity matrix. { 𝑥 ≥ 𝑎 } is the indicator function that equals to one if 𝑥 ≥ 𝑎 and zero otherwise. We use( 𝑥 ) + = max { , 𝑥 } to denote the “positive part” function. Finally, if 𝑥 is a vector, we use 𝑥 > 𝑎 todenote each vector entry being strictly greater than 𝑎 , the same holds for 𝑥 < 𝑎 . Suppose we observe a random array 𝑋 𝑛 = { 𝑋 𝑖𝑛 } 𝑛𝑖 =1 of iid realizations. Let 𝑄 𝑛 ( 𝜃 ) denote anextremum estimator objective function that depends on 𝑋 𝑛 , for example, a GMM criterion orlog likelihood function. The objective function is indexed by a parameter 𝜃 ∈ Θ ⊂ R 𝑚 .The goal is to estimate the parameter of interest 𝜃 in a setting augmented by the belief thatthe true value of 𝜃 may be close (in a sense to be made clear later) to a restricted parameterspace Θ ⊂ Θ defined by a parametric restriction Θ = { 𝜃 ∈ Θ : 𝑟 ( 𝜃 ) ≥ } , (1)where 𝑟 ( 𝜃 ) is a di ff erentiable function that maps R 𝑚 → R 𝑝 . Let 𝑅 ( 𝜃 ) denote the derivative 𝜕𝜕𝜃 ′ 𝑟 ( 𝜃 ).The pivotal point is that the true parameter value 𝜃 may not satisfy the restrictions, i.e. 𝜃 does not necessarily lie within Θ . The restriction can be rather treated as a reasonablebelief or “prior” about the likely value of 𝜃 . It means that the empirical implication of theimposed theoretical restrictions are only “approximately correct”. Remark 1.
In this paper I focus on the parameter 𝜃 itself, the presented theory can be extendedto functions of 𝜃 using the delta-method approach. However, one has to be cautious, since thelevel of shrinkage depends on the dimension of the function’s output.A common example is sign restrictions on all parameters, i.e. 𝑝 = 𝑚 . In this case therestricted space is Θ = { 𝜃 ∈ Θ : 𝜃 ≥ } , where 𝑟 ( 𝜃 ) = 𝜃 and 𝑅 is simply an 𝑚 × 𝑚 identity matrix.The researcher may want to impose sign restrictions only on a subset of parameters. We caneasily allow for that by partitioning the parameter space 𝜃 = (︃ 𝜃 𝜃 )︃ 𝑚 − 𝑝𝑝 then the sign restrictions take the form 𝑟 ( 𝜃 ) = 𝜃 , and 𝑅 = [0 𝑝 × ( 𝑚 − 𝑝 ) ... ℐ 𝑝 ].In general, Θ may be a non-linear subspace. This can be especially useful for structuralestimation when an economic model implies non-linear inequality constraints on structuralparameters. 4 xample 1. In macroeconomics inequality restrictions often arise in estimation of DSGEmodels. Moon and Schorfheide (2009) study an example of interest rate feedback rules, whichwe briefly describe here. Consider the following interest rate policy rule 𝑅 𝑡 = 𝜌 𝑅 𝑅 𝑡 − + (1 − 𝜌 𝑅 ) 𝜓 𝜋 𝑡 + (1 − 𝜌 𝑅 ) 𝜓 𝑥 𝑡 + 𝜀 𝑅,𝑡 , (2)where 𝑅 𝑡 is the nominal interest rate in period t, 𝜋 𝑡 is the inflation rate, and 𝑥 𝑡 is a measure ofreal activity, such as output deviations from trend or output growth. The shock 𝜀 𝑅,𝑡 capturesunexpected deviations from the systematic component of the policy rule. To address potentialendogeneity of both inflation and output in equilibrium, the researcher needs instrumentalvariables. Lagged variables of inflation and output are natural candidates. According to alarge class of DSGE models, output does not fall in a response to an expansionary monetaryshock, which leads to a moment restriction E [ − 𝑥 𝑡 𝜀 𝑅,𝑡 ] ≥ Let 𝑋 𝑡 =( 𝑅 𝑡 − , 𝜋 𝑡 , 𝑥 𝑡 ) ′ be the vector of regressors, 𝑍 𝑡 = ( 𝑅 𝑡 − , 𝜋 𝑡 − , 𝑥 𝑡 − ) ′ be the vector of IVs, and 𝜃 = ( 𝜌 𝑅 , (1 − 𝜌 𝑅 ) 𝜓 , (1 − 𝜌 𝑅 ) 𝜓 ) ′ be the parameter vector. Based on (2), one can form a fi-nite sample moment condition 𝑔 𝑡 ( 𝑋 𝑡 , 𝑍 𝑡 , 𝑅 𝑡 ; 𝜃 ) = 𝑇 − ∑︀ 𝑇𝑡 =1 𝑍 𝑡 ( 𝑅 𝑡 − 𝑋 ′ 𝑡 𝜃 ).Instead of treating the moment restriction as an additional moment condition, one canimpose it directly on the estimation problem. The finite sample analog is − 𝑇 − ∑︀ 𝑇𝑡 =1 𝑥 𝑡 𝜀 𝑅,𝑡 ≥ 𝜌 𝑅 𝑇 ∑︁ 𝑡 =1 𝑥 𝑡 𝑅 𝑡 − + (1 − 𝜌 𝑅 ) 𝜓 𝑇 ∑︁ 𝑡 =1 𝑥 𝑡 𝜋 𝑡 + (1 − 𝜌 𝑅 ) 𝜓 𝑇 ∑︁ 𝑡 =1 𝑥 𝑡 − 𝑇 ∑︁ 𝑡 =1 𝑥 𝑡 𝑅 𝑡 ≥ , which imposes a linear inequality constraint on 𝜃 . Example 2.
Inequality constraints also arise in many demand models. Consider a consumerwho chooses her levels of consumption for di ff erent goods 𝑗 = 1 , . . . , 𝐽 by maximizing her utilityfunction with respect to her budget constraint. One can show that the demand functions 𝐷 𝑗 = 𝐷 𝑗 ( 𝑝, 𝑚 | 𝜃 ) , 𝑗 = 1 , . . . , 𝐽, where 𝑝 is a price vector, 𝑚 is income, and 𝜃 are the structural parameters of interest, are notarbitrary. In particular, they must satisfy the budget constraint 𝐽 ∑︁ 𝑗 =1 𝑝 𝑗 𝐷 𝑗 ( 𝑝, 𝑚 | 𝜃 ) = 𝑚. Furthermore, since they solve a constrained optimization problem, they must satisfy theSlutsky matrix conditions. Let 𝑆 denote the Slutsky substitution matrix of size 𝐽 × 𝐽 , whosegeneric entry is 𝑆 𝑘𝑗 = 𝜕𝐷 𝑗 ( 𝑝, 𝑚 | 𝜃 ) 𝜕𝑝 𝑘 + 𝜕𝐷 𝑗 ( 𝑝, 𝑚 | 𝜃 ) 𝜕𝑚 𝐷 𝑘 ( 𝑝, 𝑚 | 𝜃 ) . Economic theory tells us that such a matrix must be symmetric and negative semidefinite.These conditions imply inequality restrictions on the vector of structural parameters 𝜃 . For the ease of exposition, we skip the details regarding the representation of a typical DSGE model and itssolution.
5o measure the accuracy of an estimator 𝑇 𝑛 = 𝑇 𝑛 ( 𝑋 𝑛 ) of 𝜃 we will use a known loss function ℓ ( 𝜃, 𝑇 𝑛 ). The corresponding risk is just the expected loss 𝑅 ( 𝜃, 𝑇 𝑛 ) = E 𝜃 ℓ ( 𝜃, 𝑇 𝑛 ) . (3)The most popular loss function in the literature is weighted quadratic loss, ℓ ( 𝜃, 𝑇 𝑛 ) = ( 𝑇 𝑛 − 𝜃 ) ′ 𝑊 ( 𝑇 𝑛 − 𝜃 ) (4)for some weight matrix 𝑊 >
0. The risk associated with (4) is simply weighted mean squarederror. In general, the choice of a loss function can be motived by an economic application, seeHansen (2016) for more examples.The choice of a loss function plays a crucial role in the shrinkage estimator’s behavior sincethe weights depend on the loss between the unrestricted and the restricted estimators. Wespecify the following regularity conditions for the loss function.
Assumption 1.
The loss function ℓ ( 𝜃, 𝑇 𝑛 ) satisfies(a) ℓ ( 𝜃, 𝑇 𝑛 ) ≥ ℓ ( 𝜃, 𝜃 ) = 0(c) 𝑊 ( 𝜃 ) = 𝜕 𝜕𝑇 𝑛 𝜕𝑇 ′ 𝑛 ℓ ( 𝜃, 𝑇 𝑛 ) ⃒⃒⃒⃒ 𝑇 𝑛 = 𝜃 is continuous in a neighborhood of 𝜃 .Assumptions 1(a) and (b) are standard properties of any loss function. The dominanceresult of James and Stein (1961) hinges on the quadratic loss function, however, our resultshold for a more general family of loss functions. Assumption 1(c) requires the loss function ℓ ( 𝜃, 𝑇 𝑛 ) to have a second derivative with respect to the second argument. This allows forsmooth loss functions, like quadratic loss, and excludes non-smooth loss functions, such asabsolute value loss.The choice of a weight matrix also plays an important role. If one sets 𝑊 = ℐ 𝑚 , (4) becomesunweighted quadratic loss, which is appropriate for cases where all parameters are roughlyidentically scaled. However, when it is not the case, a weight matrix that renders a lossfunction which is robust to rotations of the parameter vector 𝜃 is a more plausible choice.We can fulfill the latter task by setting 𝑊 = Ω − , where Ω − is the inverse of the asymptoticvariance of the unrestricted estimator. The restriction in (1) defining the direction of shrinkage, it is the main building block for theconstruction of our shrinkage estimator. Inequality constraints impose milder restrictionscompared to equality constraints, which makes them harder to deal with. Equality restric-tions provide the researcher with a particular shrinkage direction, however, with inequalityconstraints the shrinkage direction depends on the boundary which the true parameter valueis close to. This stems from the properties of the inequality constrained estimator, see Section4 for more details.The researcher usually believes that restrictions are a reasonable simplification of theunrestricted model specification. And it is well known that if restrictions are correct, therestricted estimator renders more e ffi cient estimates. In contrast, if not, the restricted estimateswill be biased. In case of equality restrictions, the researcher can easily test them, however,testing inequality restrictions is an onerous task. Rather than testing inequality constraints,we can use them to construct an Inequality Constrained Shrinkage Estimator, and thereby,improve the e ffi ciency of estimates. 6 .2 Local asymptotic framework Our estimation framework is based on the belief that the empirical implications of theoreticalrestrictions are approximately correct. Put di ff erently, it means that the parameter of interest 𝜃 does not necessarily lie within the restricted space Θ , but is localized to it. We model thatby assuming that the constraints are local to zero, i.e. 𝑟 ( 𝜃 ) = 𝑐𝑛 − / , where 𝑐 ∈ R 𝑝 . In thisframework 𝑐 is a slackness, or localizing, parameter which measures the discrepancy between 𝜃 and Θ . When 𝑐 >
0, then the constraints are satisfied and not binding, while if 𝑐 <
0, theconstraints are violated. This modeling assumption ensures that the normalized asymptoticdistribution of the ICSE is identical to its finite sample distribution under exact normality(see e.g. Hansen (2016)).We do not consider distant alternatives of the form 𝑟 ( 𝜃 ) = 𝜅 𝑛 𝑐 , where 𝜅 𝑛 is 𝑂 ( 𝑛 − 𝑏 ) with 𝑏 < /
2, since we are interested in the asymptotic distribution of the normalized estimator.For simplicity, assume we have one only constraint. If 𝑐 <
0, then 𝑛 / 𝑟 ( 𝜃 ) = 𝑛 / 𝜅 𝑛 𝑐 → −∞ ,meaning that the constraint is violated, and we are better o ff with the restricted estimator. Incontrast, if 𝑐 >
0, then 𝑛 / 𝑟 ( 𝜃 ) = 𝑛 / 𝜅 𝑛 𝑐 → ∞ , meaning that the constraint is satisfied as astrict inequality, and we should resort to the unrestricted estimator. In order to define the shrinkage estimator, we first need to introduce unrestricted and re-stricted estimators.The unrestricted estimator ˆ 𝜃 𝑛 of 𝜃 maximizes the objective function over 𝜃 ∈ Θ 𝑄 𝑛 ( ˆ 𝜃 𝑛 ) = sup 𝜃 ∈ Θ 𝑄 𝑛 ( 𝜃 ) . The restricted estimator ˜ 𝜃 𝑛 is defined analogously 𝑄 𝑛 ( ˜ 𝜃 𝑛 ) = sup 𝜃 ∈ Θ 𝑄 𝑛 ( 𝜃 ) . We assume that the maximum is unique so that ˆ 𝜃 𝑛 and ˜ 𝜃 𝑛 are well-defined.The shrinkage estimator is defined as a weighted average of the unrestricted and restrictedestimators ˆ 𝜃 * 𝑛 = ˆ 𝑤 𝑛 ˆ 𝜃 𝑛 + (1 − ˆ 𝑤 𝑛 ) ˜ 𝜃 𝑛 , (5)where the weight is data driven and takes the formˆ 𝑤 𝑛 = (︃ − ˆ 𝜏 𝑛 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 ) )︃ + , (6)where ˆ 𝜏 𝑛 ≥ 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 )is the scaled loss between the unrestricted and restricted estimators. Under the quadratic loss,the latter becomes 𝑛 (︁ ˆ 𝜃 𝑛 − ˜ 𝜃 𝑛 )︁ ′ 𝑊 (︁ ˆ 𝜃 𝑛 − ˜ 𝜃 𝑛 )︁ . We are particularly interested in cases when constraints are locally violated. However, the analysis does notdepend on the sign of the localizing parameter. We can allow for a numerical error by requiring 𝑄 𝑛 ( ˆ 𝜃 𝑛 ) to be within 𝑜 𝑝 (1) of the global maximum of 𝑄 𝑛 ( 𝜃 ),rather than the exact global maximum. This is a common assumption in the extremum estimators literature, yetfairly technical. 𝜏 𝑛 is set to minimize the asymptotic risk of the ICSE. Thus, weallow ˆ 𝜏 𝑛 to be data-dependent and random, however, require it to converge in probability to anon-negative constant. Assumption 2. ˆ 𝜏 𝑛 𝑝 → 𝜏 ≥ 𝑛 → ∞ .The degree of shrinkage determines an optimal bias variance tradeo ff and depends on theratio of the shrinkage parameter ˆ 𝜏 𝑛 to the loss 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 ). When the restricted estimator is veryclose to the unrestricted one, i.e. the loss is small, and ˆ 𝜏 𝑛 > 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 ), we put all the weighton the restricted estimator, ˆ 𝑤 𝑛 = 0 and ˆ 𝜃 * 𝑛 = ˜ 𝜃 𝑛 . When ˆ 𝜏 𝑛 < 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 ), then ˆ 𝜃 * 𝑛 is a weightedaverage of the restricted and unrestricted estimators. The larger the loss compared to theshrinkage parameter, the more weight we put on the unrestricted estimator. In other words, itmeans that if the regularization bias is small, we are better o ff trading it for a reduction invariance. It is a well-known fact that the asymptotic distribution of the unrestricted extremum estimatoris normal (see e.g. Newey and McFadden (1994)), however, the asymptotic distribution of theinequality constrained estimator takes a more complicated form. Obtaining the restrictedestimator requires solving an inequality constrained optimization problem, the solution towhich depends on which constraints bind. As a result, the asymptotic distribution will takethe form of a sum of truncated normal random variables.We introduce the following regularity conditions.
Assumption 3. (a) For some some function 𝑄 ( 𝜃 ) : Θ → R , sup 𝜃 ∈ Θ | 𝑄 𝑛 ( 𝜃 ) − 𝑄 ( 𝜃 ) | → 𝑝 𝜀 >
0, sup 𝜃 ∈ Θ /𝑁 ( 𝜃 ,𝜀 ) 𝑄 ( 𝜃 ) < 𝑄 ( 𝜃 ), where 𝑁 ( 𝜃 , 𝜀 ) is an 𝜀 -neighborhood of 𝜃 .Assumption 3(a) ensures uniform convergence of the sample criterion function to thetrue criterion function. Assumption 3(b) requires the true criterion function to be uniquelymaximized at 𝜃 in its neighborhood. These conditions guarantee that both the unrestrictedand restricted estimators are consistent, i.e. ˆ 𝜃 𝑛 − 𝜃 and ˜ 𝜃 𝑛 − 𝜃 are 𝑜 𝑝 (1). Note that consistencydoes not depend on whether the estimator is restricted or not, the only thing that changes isthe parameter space over which an estimator is defined (see e.g. Theorem 9.1 in Newey andMcFadden (1994)). Assumption 4. (a) Θ is a compact subset of R 𝑚 ;(b) 𝜃 lies in the interior of Θ ;(c) 𝑄 𝑛 ( 𝜃 ) is twice continuously di ff erentiable in a neighborhood 𝑁 ( 𝜃 , 𝜀 ) of 𝜃 ;(d) 𝑛 / 𝜕𝜕𝜃 𝑄 𝑛 ( 𝜃 ) → 𝑑 𝐺 = 𝒩 (0 , 𝒱 ) for some nonrandom positive definite matrix 𝒱 ;(e) For 𝜃 ∈ 𝑁 ( 𝜃 , 𝜀 ) there exists 𝒥 ( 𝜃 ) that is continuous and non-singular at 𝜃 andsup 𝜃 ∈ 𝑁 ( 𝜃 ,𝜀 ) ‖ 𝜕 𝜕𝜃𝜕𝜃 ′ 𝑄 𝑛 ( 𝜃 ) − 𝒥 ( 𝜃 ) ‖→ 𝑝
0. 8ssumption 4 is a standard set of assumptions to ensure asymptotic normality of extremumestimators (see e.g. Newey and McFadden (1994)). Note that Assumption 4(b) does not implythat 𝜃 lies in the interior of the restricted set Θ , and whether 𝜃 belongs to the interior of Θ or not will a ff ect the asymptotic distribution of both the restricted and shrinkage estimators. Assumption 5. (a) 𝑅 ( 𝜃 ) is continuous in some neighborhood of 𝜃 ;(b) 𝑅 ( 𝜃 ) has full row rank.Assumption 5(a) allows for applying the continuous mapping theorem, and Assumption5(b) rules out linearly dependent constraints. The asymptotic behavior of the unrestricted estimator is easily characterized, however, thedistribution of the inequality constrained estimator is more complicated. Recall that in orderto obtain the restricted estimator, we have to solve the following problemsup 𝜃 ∈ Θ 𝑄 𝑛 ( 𝜃 ) s.t. 𝑟 ( 𝜃 ) ≥ . (7)Dealing with non-linear inequality constrained optimization problems typically leads to verycumbersome calculations of the first order conditions. However, it turns out that we do nothave to solve the original optimization problem. To derive the asymptotic distribution of theconstrained estimator, it is su ffi cient to solve a simpler, asymptotically equivalent problem(see e.g. Section 21.3.2 in Gourieroux and Monfort (1995)).In our asymptotic analysis we follow Andrews (1999) and rely on the quadratic approxi-mation of the objective function around the true parameter value. In particular, 𝑄 𝑛 ( 𝜃 ) = 𝑄 𝑛𝑞 ( 𝜃 ) + 𝜉 𝑛 ( 𝜃 ) . (8)where 𝑄 𝑛𝑞 ( 𝜃 ) = 𝑄 𝑛 ( 𝜃 ) + 𝜕𝜕𝜃 ′ 𝑄 𝑛 ( 𝜃 )( 𝜃 − 𝜃 ) + 12 ( 𝜃 − 𝜃 ) ′ 𝜕 𝜕𝜃𝜕𝜃 ′ 𝑄 𝑛 ( 𝜃 )( 𝜃 − 𝜃 )and 𝜉 𝑛 ( 𝜃 ) is the approximation error. We need to introduce some additional assumptionsensuring that 𝜉 𝑛 ( 𝜃 ) is of the right order, so that the estimator maximizing 𝑄 𝑛𝑞 ( 𝜃 ) has the sameasymptotic distribution as of the true maximum. Assumption 6.
For all 𝛿 𝑛 → 𝜃 ∈ Θ : || 𝜃 − 𝜃 ||≤ 𝛿 𝑛 | 𝜉 𝑛 ( 𝜃 ) | (1 + || 𝑛 / ( 𝜃 − 𝜃 ) || ) = 𝑜 𝑝 (1) . Pollard (1985) refers to Assumption 6 as stochastic di ff erentiability, which is a weakercondition than 𝜉 𝑛 ( 𝜃 ) converging to 0 due to the presence of the denominator term (1+ || 𝑛 / ( 𝜃 − 𝜃 ) || ).Let 𝒥 𝑛 ≡ − 𝜕 𝜕𝜃𝜕𝜃 ′ 𝑄 𝑛 ( 𝜃 ) 𝑎𝑛𝑑 𝑍 𝑛 ≡ 𝒥 − 𝑛 𝑛 / 𝜕𝜕𝜃 𝑄 𝑛 ( 𝜃 ) . 𝑄 𝑛𝑞 ( 𝜃 ) = 𝑄 𝑛 ( 𝜃 ) + 𝑛 − / 𝑍 ′ 𝑛 𝒥 𝑛 ( 𝜃 − 𝜃 ) −
12 ( 𝜃 − 𝜃 ) ′ 𝒥 𝑛 ( 𝜃 − 𝜃 )= 𝑄 𝑛 ( 𝜃 ) + 12 𝑛 𝑍 ′ 𝑛 𝒥 𝑛 𝑍 𝑛 − 𝑛 𝑞 𝑛 ( 𝑛 / ( 𝜃 − 𝜃 )) , where 𝑞 𝑛 ( 𝜆 ) ≡
12 ( 𝜆 − 𝑍 𝑛 ) ′ 𝒥 𝑛 ( 𝜆 − 𝑍 𝑛 ) and 𝜆 ∈ R 𝑚 . Note that under Assumption 6, it is su ffi cient to minimize 𝑞 𝑛 ( 𝑛 / ( 𝜃 − 𝜃 )) to obtain a max-imum of the quadratic approximation of 𝑄 𝑛 ( 𝜃 ). When the parameter space is unrestricted,the estimator ˆ 𝜃 𝑛 equals to 𝜃 + 𝑛 − / 𝑍 𝑛 . Therefore, 𝑛 / ( ˆ 𝜃 𝑛 − 𝜃 ) = 𝑍 𝑛 , and 𝑍 𝑛 determinesthe asymptotic distribution of the unrestricted estimator. A lemma below establishes theasymptotic distribution of the re-parameterized quadratic criterion function. Lemma 1.
Under Assumptions 3–5, 𝑍 𝑛 𝑑 → 𝑍 = 𝒥 − 𝐺,𝑞 𝑛 ( 𝜆 ) 𝑑 → 𝑞 ( 𝜆 ) ≡
12 ( 𝜆 − 𝑍 ) ′ 𝒥 ( 𝜆 − 𝑍 ) ∀ 𝜆 ∈ R 𝑚 . (9)Since the restricted estimator ˜ 𝜃 𝑛 is consistent, its asymptotic distribution depends onlyon the features of the parameter space around the true parameter value 𝜃 . We use the meanvalue expansion to approximate the constraints 𝑟 ( 𝜃 ) around 𝜃 , 𝑟 ( 𝜃 ) = 𝑟 ( 𝜃 ) + 𝑅 ( ¯ 𝜃 )( 𝜃 − 𝜃 ) = 𝑐 + 𝑅 ( ¯ 𝜃 ) 𝑛 / ( 𝜃 − 𝜃 ) ≥ , where ¯ 𝜃 lies on a segment between 𝜃 and 𝜃 . Since ¯ 𝜃 𝑛 lies on a segment between ˜ 𝜃 𝑛 and 𝜃 ,under Assumptions 3 and 5, 𝑅 ( ¯ 𝜃 𝑛 ) = 𝑅 ( 𝜃 ) + 𝑜 𝑝 (1). Let 𝑅 ≡ 𝑅 ( 𝜃 ). As shown in Lemma 2 below,the asymptotic distribution of 𝑛 / ( ˆ 𝜃 𝑛 − 𝜃 ) is given by the distribution of˜ 𝜆 = argmin 𝜆 ∈ Λ 𝑐 𝑞 ( 𝜆 ) (10)where Λ 𝑐 ≡ { 𝜆 ∈ R 𝑚 : 𝑐 + 𝑅𝜆 ≥ } . By approximating the objective function with a quadraticcounterpart and linearizing the constraints, we collapsed a potentially highly non-linearproblem (7) to a simple quadratic programming problem.There are 𝑝 inequality constraints which form 2 𝑝 di ff erent possible combinations of bindingand non-binding constraints. For each such combination the asymptotic distribution of therestricted estimator is simply a projection of the asymptotic limit of the unrestricted estimator 𝑍 on the corresponding boundary. This is exactly the intuition in Andrews (1999), wherehe shows that under the standard asymptotics the asymptotic distribution of the extremumestimator, when the true parameter value is on a boundary, depends on binding constraints.Let us introduce some notation simplifying the exposition. Let 𝐿 ( 𝜄 ) be a linear subspaceof the form 𝐿 ( 𝜄 ) ≡ { 𝑙 ∈ R 𝑚 : 𝑐 𝜄 + 𝑅 𝜄 𝑙 = 0 } , where 𝜄 = 1 , . . . , 𝑝 represents one of the possiblecombinations of binding constraints. Let 𝜄 = 1 denote the case when none of the constraints Essentially this approach is the same as approximating the restricted space by a cone of tangents (see e.g.Cherno ff (1954), Feder (1968), and Andrews (1999)). One can think of these combinations as possible boundaries of the restricted parameter space Θ . 𝑅 𝜄 consist of the rows of the Jacobian matrix 𝑅 corresponding to binding constraintsindexed by 𝜄 . By analogy, ˜ 𝜇 𝑛,𝜄 denotes a sub-vector of ˜ 𝜇 𝑛 with entries corresponding to bindingconstraints indexed by 𝜄 . Note that we also have to index the slackness parameter, as only theentries corresponding to binding constraints 𝑐 𝜄 will a ff ect the asymptotic distribution. Lemma 2.
Suppose that Assumptions 3–6 hold. Then, the asymptotic distribution of theconstrained estimator takes the form 𝑛 − / ( ˜ 𝜃 𝑛 − 𝜃 ) 𝑑 → ˜ 𝜆 ≡ 𝑍 − 𝑝 ∑︁ 𝜄 =2 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } , (11)where ˜ 𝜇 = − ( 𝑅 𝒥 − 𝑅 ′ ) − ( 𝑅𝑍 + 𝑐 ) is the vector of Kuhn-Tucker multipliers for problem (10), 𝑃 𝐿 ( 𝜄 ) ≡ 𝒥 − 𝑅 ′ 𝜄 (︁ 𝑅 𝜄 𝒥 − 𝑅 ′ 𝜄 )︁ − 𝑅 𝜄 (12)is the projection on the linear subspace 𝐿 ( 𝜄 ), ℎ 𝜄 ≡ 𝑅 − 𝜄 𝑐 𝜄 is the re-parameterized slacknessparameter, and 𝑅 − 𝜄 is the right inverse of 𝑅 𝜄 .Note that the distribution in (11) is non-normal and depends on the re-parametarizedslackness parameter ℎ . The distribution takes the form of a sum of truncated normal randomvariables. Notice that the indicator functions are random: they depend on the asymptoticdistribution of the Kuhn-Tucker multipliers.The slackness parameter enters the distribution through both the asymptotic bias term 𝑃 𝐿 ( 𝜄 ) ℎ 𝜄 and the distribution of the Kuhn-Tucker multipliers ˜ 𝜇 . From (9) it follows that if ℎ 𝑗 → ∞ ,then ˜ 𝜇 𝑗 → −∞ , implying that the 𝑗 𝑡ℎ constraint is not binding. If, on the contrary, ℎ 𝑗 → −∞ ,then ˜ 𝜇 𝑗 → ∞ , resulting into the 𝑗 𝑡ℎ constraint being binding.The summation starts from 𝜄 = 2 since we do not have to project the unrestricted esti-mator on any subspace when none of the constrains bind. Despite the seemingly complexexpression, the basic intuition behind this formula is surprisingly simple. The asymptoticdistribution of the inequality constrained estimator is just a projection of the asymptotic limitof the unconstrained estimator onto a boundary defined by the corresponding set of bindingconstraints.The following theorem summarizes the analysis above and presents the asymptotic distri-butions of the unrestricted, restricted, and shrinkage estimators. Theorem 1.
Under Assumptions 1–6, 𝑛 / ( ˆ 𝜃 𝑛 − 𝜃 ) 𝑑 → 𝑍 ∼ 𝒩 (0 , Ω ) , (13) 𝑛 / ( ˜ 𝜃 𝑛 − 𝜃 ) 𝑑 → ˜ 𝜆 ≡ 𝑍 − 𝑝 ∑︁ 𝜄 =2 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } , (14) 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 ) 𝑑 → 𝜉 ≡ 𝑝 ∑︁ 𝜄 =2 ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } , (15)ˆ 𝑤 𝑛 𝑑 → 𝑤 = (︂ − 𝜏𝜉 )︂ + . (16)The asymptotic distribution of the inequality constrained shrinkage estimator is 𝑛 / ( ˆ 𝜃 * 𝑛 − 𝜃 ) 𝑑 → 𝑤𝑍 + (1 − 𝑤 ) ˜ 𝜆. (17)11 Asymptotic Risk
In practice obtaining the restricted estimator still requires solving a potentially complicatednon-linear problem. This suggests that having an analytical closed form solution is extremelyunlikely. Even if it is possible to derive an analytical solution, this solution will take a complexform, and the ICSE will inherit it. As a result, calculating its finite sample risk may beinfeasible. However, we know the asymptotic distribution of the ICSE, which means we canuse the asymptotic risk to get a reasonable approximation of the finite sample risk.Since the ICSE may not have a su ffi cient number of finite moments, to ensure existencewe use an asymptotic trimmed loss. Let 𝑇 = { 𝑇 𝑛 } ∞ 𝑛 =1 denote a sequence of estimators. Theasymptotic risk of the estimator sequence 𝑇 is defined as 𝜌 ( ℎ, 𝑇 ) = lim 𝜁 →∞ lim inf 𝑛 →∞ E 𝜃 min [ 𝑛ℓ ( 𝜃 , 𝑇 𝑛 ) , 𝜁 ] . (18)The loss function is trimmed at 𝜁 , however, the trimming becomes negligible in large samplesas 𝜁 → ∞ with 𝑛 → ∞ .Hansen (2016) shows that whenever the loss function is locally quadratic, i.e. satisfiesAssumption 1, the asymptotic risk, defined in (18), of an arbitrary estimator 𝑇 𝑛 , such that 𝑛 / ( 𝑇 𝑛 − 𝜃 ) 𝑑 → 𝜓 , where 𝜓 is some random variable, can be calculated as 𝜌 ( ℎ, 𝑇 ) = E [ 𝜓 ′ 𝑊 𝜓 ] . (19)Equation (19) allows us to calculate the asymptotic risk of the unrestricted and shrinkageestimators as expected weighted quadratic loss. Note that 𝑛 / ( ˆ 𝜃 𝑛 − 𝜃 ) 𝑑 → 𝑍 , hence, theasymptotic risk of the unrestricted estimator is 𝜌 ( ℎ, ˆ 𝜃 𝑛 ) = E [ 𝑍 ′ 𝑊 𝑍 ] = 𝑡𝑟 ( 𝑊 E [ 𝑍𝑍 ′ ]) = 𝑡𝑟 ( 𝑊 Ω ) . (20)Define an 𝑚 × 𝑚 matrix 𝐴 𝐿 ( 𝜄 ) ≡ 𝑊 / ′ Ω 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 / , let 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) denote its largest eigenvalue.The following theorem establishes the main result of the paper. Theorem 2.
Under Assumptions 1–6, if0 < 𝜏 ≤ 𝑝 ∑︁ 𝜄 =2 (︁ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) )︁ 𝛾 𝜄 , (21)where 𝛾 𝜄 ≡ E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ ∑︀ 𝑝 𝜄 =2 E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ , (22)then for any ℎ 𝜌 ( ℎ, ˆ 𝜃 * 𝑛 ) < 𝜌 ( ℎ, ˆ 𝜃 𝑛 ) . (23)Equation (23) shows that the ICSE has strictly lower asymptotic risk than that of theunrestricted estimator for all values of the slackness parameter ℎ , given that the shrinkageparameter 𝜏 satisfies the restriction (21).The explicit risk bound for the ICSE is 𝜌 ( ℎ, ˆ 𝜃 * 𝑛 ) < 𝑡𝑟 ( 𝑊 Ω ) − 𝜏 𝑝 ∑︁ 𝜄 =2 E ⎡⎢⎢⎢⎢⎢⎣ (︁ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) )︁ − 𝜏𝜉 𝐿 ( 𝜄 ) ⎤⎥⎥⎥⎥⎥⎦ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤
0) (24)12ince the bound in (24) is quadratic in the shrinkage parameter 𝜏 , there exists a uniqueoptimal level of shrinkage 𝜏 * that minimizes this bound, 𝜏 * = 𝑝 ∑︁ 𝜄 =2 (︁ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) )︁ 𝛾 𝜄 . (25)From (22) it follows that 𝛾 𝜄 → 𝜄 𝑡ℎ event P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤
0) isclose to zero, or the expected inverse loss E [ 𝜉 − 𝐿 ( 𝜄 ) ] is approaching zero. The optimal shrinkageparameter puts more weight on events that are more likely to happen and on events where therestricted parameter is close to the unrestricted one. The behavior of 𝛾 𝑖 is ambiguous when P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤
0) goes to zero and E [ 𝜉 − 𝐿 ( 𝜄 ) ] approaches infinity.When 𝑊 = Ω − , (21) simplifies to0 < 𝜏 ≤ ⎛⎜⎜⎜⎜⎜⎝ 𝑝 ∑︁ 𝜄 =2 𝑝 𝜄 𝛾 𝜄 − ⎞⎟⎟⎟⎟⎟⎠ , (26)which leads to 𝜏 * = 𝑝 ∑︁ 𝜄 =2 𝑝 𝜄 𝛾 𝜄 − . (27)When 𝑊 = Ω − , 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) = 𝑡𝑟 ( 𝑊 Ω 𝑃 ′ 𝐿 ( 𝜄 ) ) = 𝑡𝑟 ( 𝑃 ′ 𝐿 ( 𝜄 ) ) = 𝑝 𝜄 , and 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) = 1, which givescondition (26). This restriction on the shrinkage parameter has the same form as the classicalJames-Stein condition, 0 < 𝜏 ≤ 𝑚 − 𝑚 is the dimension of the parameter of interest.As long as 𝑚 >
2, the James-Stein estimator will dominate the unrestricted estimator in termsof asymptotic risk. In case of the ICSE, condition (26) requires ∑︀ 𝑝 𝜄 =2 𝑝 𝜄 𝛾 𝜄 >
2. This meansthat the “expected” number of binding constraints must be greater than two for the ICSE todominate. Constraints that are more likely to bind tell us which boundary of the restrictedparameter space we are shrinking to, i.e. they determine the direction of shrinkage.
As it is pointed out by Hjort and Claeskens (2003), model averaging (and shrinkage) optimalweights cannot be consistently estimated in the local asymptotic framework since localizingparameters are 𝑂 ( 𝑛 / ). And the ICSE is not an exception. Since the weights { 𝛾 𝜄 } 𝑝 𝜄 =2 dependon the localizing parameter, which is unknown, the optimal shrinkage parameter in (25)is infeasible. Furthermore, the localizing parameter ℎ , which is a transformation of theoriginal localizing parameter 𝑐 , cannot be consistently estimated under the local asymptoticframework.The weights { 𝛾 𝜄 } 𝑝 𝜄 =2 depend on the localizing parameter through the Kuhn-Tucker multipli-ers. The distribution of the Kuhn-Tucker multipliers is given by˜ 𝜇 = − ( 𝑅 𝒥 − 𝑅 ′ ) − 𝑅 ( 𝑍 + ℎ ) = − ( 𝑅 𝒥 − 𝑅 ′ ) − ( 𝑅𝑍 + 𝑐 ) ∼ 𝒩 ( Ψ ( 𝑐 ) , Ξ ) , where Ψ ( 𝑐 ) = − ( 𝑅 𝒥 − 𝑅 ′ ) − 𝑐 and Ξ = ( 𝑅 𝒥 − 𝑅 ′ ) − 𝑅 Ω 𝑅 ′ ( 𝑅 𝒥 − 𝑅 ′ ) − . We observe that the mean of˜ 𝜇 depends on the localizing parameter, thus, the distribution cannot be consistently estimated,as well as the corresponding probabilities. As a result, the optimal shrinkage parameter isinfeasible. 13 common approach in the literature is to obtain an asymptotically unbiased estimator ofthe localizing parameter 𝑐 (see e.g. Liu (2015)). In our case ˆ 𝑐 𝑛 = 𝑛 / 𝑟 ( ˆ 𝜃 𝑛 ) is an asymptoticallyunbiased estimator of 𝑐 . To see this, approximate ˆ 𝑐 𝑛 around the true parameter value 𝜃 usingthe first-order Taylor expansion,ˆ 𝑐 𝑛 = 𝑛 / 𝑟 ( ˆ 𝜃 𝑛 ) = 𝑛 / 𝑟 ( 𝜃 ) + 𝑛 / 𝑅 ( 𝜃 )( ˆ 𝜃 𝑛 − 𝜃 ) + 𝑜 𝑝 (1) 𝑑 → 𝑐 + 𝑅𝑍 ∼ 𝒩 ( 𝑐, 𝑅 Ω 𝑅 ′ ) . Note that without the normalization a simple plug-in estimator 𝑟 ( ˆ 𝜃 𝑛 ) is just 𝑂 𝑝 (1).We propose to use a plug-in estimator of the optimal shrinkage parameter, ˆ 𝜏 * 𝑛 ≡ 𝜏 * ( ˆ 𝑐 𝑛 ). Wecan replace 𝑅 , 𝒥 , and Ω with their consistent estimators ˆ 𝑅 𝑛 = 𝑅 ( ˆ 𝜃 𝑛 ), ˆ 𝒥 𝑛 , and ˆ Ω 𝑛 = ˆ 𝒥 − 𝑛 ˆ 𝒱 𝑛 ˆ 𝒥 − 𝑛 .A consistent weighting matrix estimate, ˆ 𝑊 𝑛 , can either be constructed from a specific context(e.g. an identity matrix, ˆ 𝑊 𝑛 = ℐ 𝑛 ) or as the second derivative of the loss function, i.e. ˆ 𝑊 𝑛 = 𝑊 ( ˆ 𝜃 𝑛 ).We can then estimate 𝜏 * byˆ 𝜏 * 𝑛 = 𝑝 ∑︁ 𝜄 =2 (︁ 𝑡𝑟 ( ˆ 𝐴 𝑛,𝐿 ( 𝜄 ) ) − 𝜑 𝑚𝑎𝑥 ( ˆ 𝐴 𝑛,𝐿 ( 𝜄 ) ) )︁ ˆ 𝛾 𝑛,𝜄 , (28)where ˆ 𝐴 𝑛,𝐿 ( 𝜄 ) = ˆ 𝑊 / ′ 𝑛 ˆ Ω 𝑛 ˆ 𝑅 ′ 𝑛,𝜄 ( ˆ 𝑅 𝑛,𝜄 ˆ 𝒥 − 𝑛 ˆ 𝑅 ′ 𝑛,𝜄 ) − ˆ 𝑅 𝑛,𝜄 ˆ 𝒥 − 𝑛 ˆ 𝑊 / ′ 𝑛 and the weights are constructed asˆ 𝛾 𝑛,𝜄 = ˆ E − [ 𝜉 𝐿 ( 𝜄 ) ] ˆ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ ∑︀ 𝑝 𝜄 =2 ˆ E − [ 𝜉 𝐿 ( 𝜄 ) ] ˆ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ . (29)In general, 𝜉 𝐿 ( 𝜄 ) follows a generalized 𝜒 distribution, which makes estimating its first inversemoment an extremely onerous task. Instead, we proxy ˆ E [ 𝜉 − 𝐿 ( 𝜄 ) ] with ˆ E − [ 𝜉 𝐿 ( 𝜄 ) ], which tends towork well in practice. We can consistently estimate the expected loss E [ 𝜉 𝐿 ( 𝜄 ) ] byˆ E [ 𝜉 𝐿 ( 𝜄 ) ] = 𝑛 ( ˆ 𝜃 𝑛 − ˜ 𝜃 𝑛,𝜄 ) ′ ˆ 𝑊 𝑛 ( ˆ 𝜃 𝑛 − ˜ 𝜃 𝑛,𝜄 ) , where ˜ 𝜃 𝑛,𝜄 is the equality constrained estimator given the constraints indexed by 𝜄 . Probabilityestimates ˆ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤
0) are based on the feasible distribution of the Kuhn-Tucker multipli-ers 𝒩 ( ˆ Ψ 𝑛 , ˆ Ξ 𝑛 ), where ˆ Ψ 𝑛 = − ( ˆ 𝑅 𝑛 ˆ 𝒥 − 𝑛 ˆ 𝑅 ′ 𝑛 ) − ˆ 𝑐 𝑛 and ˆ Ξ 𝑛 = ( ˆ 𝑅 𝑛 ˆ 𝒥 − 𝑛 ˆ 𝑅 ′ 𝑛 ) − ˆ 𝑅 𝑛 ˆ Ω 𝑛 ˆ 𝑅 ′ 𝑛 ( ˆ 𝑅 𝑛 ˆ 𝒥 − 𝑛 ˆ 𝑅 ′ 𝑛 ) − .Note, { ˆ 𝛾 𝑛,𝜄 } 𝑝 𝜄 =2 are not consistent estimates, since they do not converge in probability to theircorresponding true values. Instead, they converge in distribution to random limits, whichimplies that the plug-in estimator of the shrinkage parameter ˆ 𝜏 * 𝑛 𝑝 (cid:57) 𝜏 * . Thus, the proposedfeasible estimator (28) is not optimal in the sense that it uses the feasible data-driven weightthat does not converge in probability to the optimal one. As a result, the dominance over theunrestricted estimator is not guaranteed. Despite that, in the following sections we show thatthe feasible estimator works well in practice. We demonstrate the finite sample performance of the ICSE in the following numerical simula-tion. Consider a following linear model. For 𝑖 = 1 , . . . , 𝑛 , 𝑦 𝑖 = 𝑥 ′ 𝑖 𝜃 + 𝑥 ′ 𝑖 𝜃 + 𝜀 𝑖 . For more details on the calculation of inverse moments of the generalized 𝜒 distribution see e.g. Jones (1986). 𝑥 𝑖 and 𝑥 𝑖 are 𝑘 × 𝑘 ×
1, respectively. The vector of regressors, 𝑥 𝑖 ,is distributed 𝒩 (0 , Σ ), where Σ 𝑗𝑗 = 1 and Σ 𝑗𝑘 = 0 . 𝑗 (cid:44) 𝑘 , and the error term, 𝜀 𝑖 , is 𝒩 (0 , ff ects under the belief that 𝜃 may be close to Θ = { 𝜃 ∈ R 𝑘 + 𝑘 : 𝜃 ≥ , 𝜃 = 0 } . For simplicity, in estimation we use a quadratic loss function.Let ˆ 𝜃 𝑛 denote the unrestricted OLS with ˆ Ω 𝑛 being a consistent estimate of its asymptoticcovariance matrix of 𝑛 / ( ˆ 𝜃 𝑛 − 𝜃 ). Let ˜ 𝜃 𝑛 be the restricted OLS under 𝜃 ≥ 𝜃 = 0.We compare the performance of five di ff erent estimators of 𝜃 . The first is ˆ 𝜃 𝑛 , the unre-stricted OLS estimator. The second is ˜ 𝜃 𝑛 , the restricted OLS estimator. The third estimator isthe generalized James-Stein estimator of Hansen (2016)ˆ 𝜃 𝐽𝑆𝑛 = ˆ 𝑤 𝑛 ˆ 𝜃 𝑛 , ˆ 𝑤 𝑛 = (︃ − 𝑘 + 𝑘 − 𝑛 ˆ 𝜃 ′ 𝑛 ˆ Ω − 𝑛 ˆ 𝜃 𝑛 )︃ + , which shrinks both 𝜃 and 𝜃 to zero.The fourth estimator is the Empirical Bayes estimator ˆ 𝜃 𝐸𝐵𝑛 , which assumes the truncatednormal prior 𝜃 | 𝜈 ∼ 𝒩 (0 , /𝜈 ) { 𝜃 ≥ } , where 𝜈 is a hyper parameter that tells us how muchweight to put on 𝜃 being equal to zero. The higher the value of 𝜈 , the more concentrated isthe prior around zero, hence, the more mass is put on zero. The motivation for this estimatorcomes from the fact that the James-Stein estimator can be represented as an Empirical Bayesestimator (see e.g. Efron and Morris (1972)).The last estimator is the feasible ICSE, which takes the same form but with the weightˆ 𝑤 * 𝑛 = ⎛⎜⎜⎜⎜⎜⎝ − ∑︀ 𝑘 𝜄 =1 𝑝 𝜄 ˆ 𝛾 𝑛,𝜄 − 𝑛 ( ˆ 𝜃 𝑛 − ˜ 𝜃 𝑛 ) ′ ˆ Ω − 𝑛 ( ˆ 𝜃 𝑛 − ˜ 𝜃 𝑛 ) ⎞⎟⎟⎟⎟⎟⎠ + , where 𝑝 𝜄 is the total number of binding constraints in 𝜄 case. Note, since there are two equalityconstraints, if none of the inequality constraints bind, 𝜄 = 1, 𝑝 = 2. The weights { ˆ 𝛾 𝑛,𝜄 } 𝑘 𝜄 =1 areestimated by (29).The estimators are compared by the mean square error (MSE), which is calculated based on 𝑁 = 2 ,
000 replications. For the ease of exposition, we normalize the MSE of the unrestrictedestimator to be equal to one so that the MSE of other estimators are given relative to the MSEof the unrestricted one.We set the regression coe ffi cients as 𝜃 = (1 , , , 𝑏, . . . , 𝑏 ), and 𝜃 = ( 𝑐, 𝑐, . . . , 𝑐 ) ′ . Thus, theremaining control parameters in the model are 𝑘 , 𝑏 , 𝑐 , and 𝑛 . The value of 𝑏 allows us tocontrol the strength of the inequality constraints, i.e. whether they are satisfied or not, and 𝑐 controls the strength of the equality constraints.Note that the inequality constraints do not change simultaneously with 𝑏 . When 𝑏 isnegative, the first three constraints are satisfied, while the remaining 𝑘 − ff erent fromshrinking towards equality constraints.In Figure 1, we display the results for 𝑛 = { , } , 𝑘 = { , , } , and vary 𝑏 on a 100-point equispaced grid from − . .
5. We set 𝑐 = 0 so that the equality constraints aresatisfied.First, the feasible ICSE dominates the unrestricted estimator, while the restricted estimatoralong with the EB estimator do worse than the unrestricted one when the constraints areviolated. Since the posterior is truncated at zero, the EB estimates of 𝜃 are always positive,which explains the result. Further details can be found in Appendix D. . . . . n = 200, k1 = 5 b M SE −0.4 −0.2 0.0 0.2 0.4 . . . . n = 200, k1 = 7 b M SE −0.4 −0.2 0.0 0.2 0.4 . . . . n = 200, k1 = 10 b M SE −0.4 −0.2 0.0 0.2 0.4 . . . . n = 500, k1 = 5 b M SE −0.4 −0.2 0.0 0.2 0.4 . . . . n = 500, k1 = 7 b M SE −0.4 −0.2 0.0 0.2 0.4 . . . . n = 500, k1 = 10 b M SE JS+ ICSE EB C UC
Figure 1. MC results.
This figure shows normalized MSEs for di ff erent combinations of 𝑘 = { , , } and 𝑛 = { , } . Second, we observe that the James-Stein estimator exhibits almost no improvement uponthe unrestricted estimator. This behavior is expected since the James-Stein estimator shrinksall the constraints towards zero, which is fundamentally di ff erent from shrinking towardsinequalities. As a result, the James-Stein estimator puts almost no weight on the restrictedestimator. If the shrinkage direction is chosen poorly, it will lead to a large bias resulting intopoor overall performance. Thus, the shrinkage gains are guaranteed only if the shrinkagedirection is chosen properly.When 𝑏 <
0, the restricted and EB estimators perform worse than the unrestricted one,while the feasible ICSE achieves significant MSE reduction gains. When 𝑏 approaches zero,the constrained estimator starts to dominate the feasible ICSE. When 𝑏 >
0, the constrainedestimator dominates both shrinkage estimators, however, the EB estimator achieves lower MSEwhen 𝑏 is slightly greater than zero. As 𝑏 grows, the EB estimator converges to the unrestrictedestimator. When the number of observations increases, the prior gets less weight pushingthe drop in the MSE closer to 𝑏 = 0. Notice that the MSE of the restricted estimator does notconverge to the one of the unrestricted. Since 𝑐 = 0, the inequality constrained estimator ismore accurate than the unrestricted one, which explains the result. Moreover, as the numberof inequality constraints grows, the di ff erence between the unconstrained and constrainedestimators vanishes, resulting into lower MSE gains of the constrained estimator over theunconstrained one.Finally, when the number of inequality constraints increases, the shrinkage e ff ect of thefeasible ICSE and EB estimators becomes more prominent, which supports the theoreticalfindings. 16 Empirical Application: Demand Estimation under the Slut-sky Restriction
In our empirical application we consider consumer demand estimation under the Slutskyrestriction (see Example 2 for more details). In this application we build on literature ondemand estimation under shape restrictions, especially on the recent results by Blundellet al. (2012), Dette et al. (2016), and Blundell et al. (2017).Our goal is to estimate price and income elasticities of gasoline demand for di ff erentincome levels. Slutsky condition is an inequality constraint on the demand function ensuringthat the compensated own-price elasticities are negative. Despite the fact that in theoryconsumer choices should abide the Slutsky restriction, in the data we might find evidencesuggesting otherwise. For example, if gasoline prices are too high and households anticipatethem to rise further, then households will tend to buy more gasoline now and store it for futureuse resulting in positive compensated price elasticity, which violates the Slutsky restriction.That is exactly where we expect shrinkage gains. Implementation details can be found inAppendix E.We use the same data and sample construction as Blundell et al. (2017), which we brieflydescribe here. The data are from the 2001 National Household Travel Survey (NHTS). Thesample is constructed to reduce heterogeneity by restricting the analysis to households with awhite respondent, two or more adults, at least one child under age 16, and at least one driver.Households in the most rural areas and in Hawaii are excluded from the sample, as well as arehouseholds with missing relevant variables or without a gasoline based vehicle. The resultingsample contains 3,640 observations, where the key variables of interest are gasoline demand,price of gasoline, and household income.We demonstrate estimates for low, medium, and high income level groups which corre-spond to the first, second, and third quartile, respectively. As a base estimator we use the locallinear regression (LLR) with 20 grid points in the observe range of values for the log price. Weset the bandwidth for log price and log income using the rule of thumb to their respectivestandard deviations. Further implementation details are left for Appendix E.Figure 2 plots the unrestricted, restricted, and ICSE estimates of price and income elastici-ties as functions of price, across the income levels. Degree of shrinkage di ff ers across incomegroups. We estimate the weight on the unrestricted estimator ˆ 𝑤 to be 0 for low income group,0 .
25 for medium income group, and 0 .
75 for high income group. Thus, consumers from higherincome groups are more likely to have upward sloping demand curves, which is consistentwith the results in Blundell et al. (2012). However, the Empirical Bayes estimates of Fesslerand Kasy (2019), based on the local linear quantile regression, suggest to shrink more towardsthe restricted estimates for all income groups. The reason the estimates di ff er is due to the factthat ICSE shrinks all components of ˆ 𝛽 by the same factor ˆ 𝑤 , while the EB estimator providescomponent-wise shrinkage with di ff erent shrinkage factors (for more details see Section 4.1 inFessler and Kasy (2019)). Further details on sample construction can be found in Section IV.A of Blundell et al. (2017). A more detaileddescription of the NHTS dataset is presented in Section 3 of Blundell et al. (2012). .20 0.25 0.30 0.35 − − Price Elasticity log price 0.20 0.25 0.30 0.35 . . . . . Income Elasticity log price ICSE restricted unrestricted (a) High Income − . − . − . − . . . . . Price Elasticity log price 0.20 0.25 0.30 0.35 . . . . . . Income Elasticity log price ICSE restricted unrestricted (b) Medium Income − . − . − . − . . . . Price Elasticity log price 0.20 0.25 0.30 0.35 . . . . . . . Income Elasticity log price ICSE restricted unrestricted (c) Low Income
Figure 2. Price and income elasticity estimates.
This figure shows the unrestricted, restricted, andICSE estimates of price and income elasticities.
In this paper we have shown how to shrink extremum estimators towards theoretical restric-tions in form of inequality constraints. The ICSE asymptotically uniformly dominates theunrestricted estimator. The shrinkage direction depends only on the binding constraints18endering it ex ante unknown to the researcher, which is the main di ff erence compared toshrinking towards equality constraints.An important caveat, however, is that due to the presence of localizing parameters thatcannot be consistently estimated we cannot guarantee the risk dominance result in finitesamples, which is a common problem in frequentist model averaging and shrinkage literatures.One possible improvement would be to establish uniform dominance of the ICSE, but weleave this for future research. References
Andrews, Donald WK (1999). Estimation when a parameter is on a boundary.
Econometrica
Multiple regression and estimation of the mean of a multivariate normaldistribution.
Tech. rep. Department of Statistics, Stanford University.Blundell, Richard, Joel L Horowitz, and Matthias Parey (2012). Measuring the price respon-siveness of gasoline demand: Economic shape restrictions and nonparametric demandestimation.
Quantitative Economics
Review ofEconomics and Statistics
Quantitative Economics ff , Herman (1954). On the distribution of the likelihood ratio. The Annals of Mathemati-cal Statistics , pp. 573–578.Claeskens, Gerda and Nils Lid Hjort (2008).
Model selection and model averaging . Tech. rep.Cambridge University Press.Del Negro, Marco and Frank Schorfheide (2004). Priors from general equilibrium models forVARs.
International Economic Review
Journal ofEconometrics
Journal of Econometrics
Biometrika
The Annals of MathematicalStatistics
Review of Economics and Statistics
The annals of applied statistics
Statistics and econometric models . Vol. 1.Cambridge University Press.Hansen, Bruce E (2016). E ffi cient shrinkage in parametric models. Journal of Econometrics
Econometrica ffi ciency bounds. Econometric Theory
Econometric Reviews ff rey S Racine (2012). Jackknife model averaging. Journal of Econometrics
Journal ofthe American Statistical Association
Proceedingsof the fourth Berkeley symposium on mathematical statistics and probability . Vol. 1. 1961,pp. 361–379.Jones, MC (1986). Expressions for inverse moments of positive quadratic forms in normalvariables.
Australian Journal of Statistics
Theory of point estimation . Springer Science andBusiness Media.Liu, Chu-An (2015). Distribution theory of the least squares averaging estimator.
Journal ofEconometrics
Journal of Econometrics
Handbook of econometrics
4, pp. 2111–2245.Oman, Samuel D (1982a). Contracting towards subspaces when estimating the mean of amultivariate normal distribution.
Journal of Multivariate Analysis
Techno-metrics
Econometric Theory
The Annals of Statistics
TheAnnals of Statistics , pp. 1135–1151.Stein, Charles M (1956). Inadmissibility of the usual estimator for the mean of a multivariatenormal distribution. In:
Proceedings of the third Berkeley symposium on mathematical statisticsand probability . Vol. 1. 1956, pp. 197–206.Tibshirani, Robert (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society: Series B (Methodological)
Technometrics
Journal of Econometrics
Econometric Theory
Working paper .20 ppendix A Lemmas and Proofs
A.1 Proof of Lemma 1
Assumptions 3–4 along with the Slutsky lemma and continuous mapping theorem immedi-ately give us 𝑍 𝑛 → 𝑑 𝑍 = 𝒥 − 𝐺 and 𝑞 𝑛 ( 𝜆 ) → 𝑑 ( 𝜆 − 𝑍 ) ′ 𝒥 ( 𝜆 − 𝑍 ). (cid:4) A.2 Proof of Lemma 2
The proof follows directly from Theorems 3 and 5 in Andrews (1999) with two slight mod-ifications. First, note that the restricted space Λ 𝑐 in (10) is a convex cone with a (possibly)non-zero vertex − 𝑐 , while in Andrews (1999) the cone has a zero vertex. This changes theform of projection in (11), which in our case accommodates for the non-zero vertex. Second,the indicator functions in (11) are given in terms of the Kuhn-Tucker multipliers (see thederivation below) instead of the asymptotic limits of the subvectors of ˜ 𝜃 𝑛 . (cid:4) Deriving Kunh-Tucker multipliers requires writing down the first order conditions to(10): 𝒥 ( ˜ 𝜆 − 𝑍 ) − 𝑅 ′ ˜ 𝜇 = 0˜ 𝜇 ′ ( 𝑐 + 𝑅 ) ˜ 𝜆 ) = 0˜ 𝜇 ≥ , 𝑐 + 𝑅 ˜ 𝜆 ≥ , where the vector of Kuhn-Tucker multipliers satisfying the first order conditions is given by˜ 𝜇 = − ( 𝑅 𝒥 − 𝑅 ′ ) − ( 𝑅𝑍 + 𝑐 ) . A.3 Proof of Theorem 1
To prove (15), we begin by taking a second order mean value expansion of the loss functionaround ˆ 𝜃 𝑛 , 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 ) = 𝑛ℓ ( ˆ 𝜃 𝑛 , ˆ 𝜃 𝑛 ) + 𝑛 𝜕𝜕𝜃 ′ ℓ ( ˆ 𝜃 𝑛 , 𝜃 ) ⃒⃒⃒⃒⃒ 𝜃 = ˆ 𝜃 𝑛 ( ˜ 𝜃 𝑛 − ˆ 𝜃 𝑛 ) + 𝑛 ( ˜ 𝜃 𝑛 − ˆ 𝜃 𝑛 ) ′ 𝑊 ( 𝜃 * 𝑛 )( ˜ 𝜃 𝑛 − ˆ 𝜃 𝑛 ) , where 𝜃 * 𝑛 lies on a line segment between ˆ 𝜃 𝑛 and ˜ 𝜃 𝑛 . Assumption 1(b) implies that 𝑛ℓ ( ˆ 𝜃 𝑛 , ˆ 𝜃 𝑛 ) =0. By Assumption 1(c) and the fact that ℓ ( ˆ 𝜃 𝑛 , ˆ 𝜃 𝑛 ) is minimized at ˆ 𝜃 𝑛 , 𝜕𝜕𝜃 ′ ℓ ( ˆ 𝜃 𝑛 , 𝜃 ) ⃒⃒⃒ 𝜃 = ˆ 𝜃 𝑛 = 0.Consistency of both ˆ 𝜃 𝑛 and ˜ 𝜃 𝑛 along with Assumption 1(c) implies that 𝑊 ( 𝜃 * 𝑛 ) = 𝑊 + 𝑜 𝑝 (1).We have shown that the unrestricted estimator is asymptotically equal to 𝑍 . Combiningthis fact with the results from Lemma 2 gives 𝑛 − / ( ˆ 𝜃 𝑛 − ˜ 𝜃 𝑛 ) → 𝑑 𝑝 ∑︁ 𝜄 =2 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } . Hence, the asymptotic distribution of the loss function is 𝑛ℓ ( ˆ 𝜃 𝑛 , ˜ 𝜃 𝑛 ) = 𝑛 ( ˜ 𝜃 𝑛 − ˆ 𝜃 𝑛 ) ′ 𝑊 ( 𝜃 * 𝑛 ( ˜ 𝜃 𝑛 − ˆ 𝜃 𝑛 ) → 𝑑 𝑝 ∑︁ 𝜄 =2 ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } = 𝜉, FOCs are both necessary and su ffi cient since it is a quadratic programming problem with linear constraints. (cid:4) To derive the bound, we use a version of Stein’s Lemma (Stein, 1981) presented inHansen (2016).
Lemma 3. If 𝑍 ∼ 𝒩 (0 , 𝒱 ) is 𝑚 × 𝐾 is 𝑚 × 𝑚 , and 𝜂 ( 𝑥 ) : R 𝑚 → R 𝑚 is absolutely continuous,then E [ 𝜂 ( 𝑍 + ℎ ) ′ 𝐾 𝑍 ] = E 𝑡𝑟 (︃ 𝜕𝜕𝑥 ′ 𝜂 ( 𝑍 + ℎ ) 𝐾 𝒱 )︃ . A.4 Proof of Theorem 2
The proof is similar to Hansen (2016). First, observe that 𝑛 / ( ˆ 𝜃 * 𝑛 − 𝜃 ) → 𝜓 * , where 𝜓 * = 𝑤𝑍 + (1 − 𝑤 ) ˜ 𝜆 , as shown in (17). Hence, the risk of the shrinkage estimator can be calculatedas 𝜌 ( ℎ, ˆ 𝜃 * 𝑛 ) = E [ 𝜓 *′ 𝑊 𝜓 * ]. The distribution of the variable 𝜓 * is based on the classic James-Steindistribution with positive part trimming. Define a similar random variable 𝜓 without positivepart trimming 𝜓 = 𝑍 (︂ − 𝜏𝜉 )︂ + 𝜏𝜉 ˜ 𝜆 = 𝑍 − 𝜏𝜉 𝑝 ∑︁ 𝜄 =2 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } . (A.1)It is a well-known fact that positive part trimming always reduces risk under the standardquadratic loss (see e.g. Theorem 5.5.4 in Lehmann and Casella (1998), or Lemma 2 inHansen (2015)). Thus, using this fact and (19), 𝜌 ( ℎ, ˆ 𝜃 * 𝑛 ) = E [ 𝜓 *′ 𝑊 𝜓 * ] < E [ 𝜓 ′ 𝑊 𝜓 ] . (A.2)Using (A.1), we calculate that the asymptotic risk in (A.2) is equal to E [ 𝜓 ′ 𝑊 𝜓 ] = E [ 𝑍 ′ 𝑊 𝑍 ]+ 𝜏 E ⎡⎢⎢⎢⎢⎣ ( ∑︀ 𝑝 𝜄 =2 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } ) ′ 𝑊 ( ∑︀ 𝑝 𝜄 =2 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } ) 𝜉 ⎤⎥⎥⎥⎥⎦ − 𝜏 E ⎡⎢⎢⎢⎢⎢⎢⎣ ∑︀ 𝑝 𝜄 =2 ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑍 { ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ } 𝜉 ⎤⎥⎥⎥⎥⎥⎥⎦ = 𝑡𝑟 ( 𝑊 Ω ) + 𝜏 𝑝 ∑︁ 𝜄 =2 E [︃ 𝜉 𝐿 ( 𝜄 ) ]︃ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ − 𝜏 𝑝 ∑︁ 𝜄 =2 E ⎡⎢⎢⎢⎢⎣ ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑍𝜉 𝐿 ( 𝜄 ) ⎤⎥⎥⎥⎥⎦ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ 𝑡𝑟 ( 𝑊 Ω ) + 𝑝 ∑︁ 𝜄 =2 ⎛⎜⎜⎜⎜⎝ 𝜏 E [︃ 𝜉 𝐿 ( 𝜄 ) ]︃ − 𝜏 E ⎡⎢⎢⎢⎢⎣ ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑍𝜉 𝐿 ( 𝜄 ) ⎤⎥⎥⎥⎥⎦⎞⎟⎟⎟⎟⎠ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ . (A.3)Have a closer look at the second expectation term of the 𝜄 𝑡ℎ summand, E ⎡⎢⎢⎢⎢⎣ ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑍𝜉 𝐿 ( 𝜄 ) ⎤⎥⎥⎥⎥⎦ = E [︁ 𝜂 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑍 ]︁ , 𝜂 𝐿 ( 𝜄 ) ( 𝑥 ) = 𝑥𝑥 ′ 𝐵 𝐿 ( 𝜄 ) 𝑥 . Next, before applying the Stein’s Lemma we calculate 𝜕𝜕𝑥 ′ 𝜂 𝐿 ( 𝜄 ) ( 𝑥 ) = (︃ 𝑥 ′ 𝐵 𝐿 ( 𝜄 ) 𝑥 )︃ ℐ − 𝐵 𝐿 ( 𝜄 ) 𝑥𝑥 ′ ( 𝑥 ′ 𝐵 𝐿 ( 𝜄 ) 𝑥 ) . (A.4)Using Lemma 3 and (A.4), E [︁ 𝜂 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑍 ]︁ = E 𝑡𝑟 (︃ 𝜕𝜕𝑥 ′ 𝜂 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 Ω )︃ = E 𝑡𝑟 ⎛⎜⎜⎜⎜⎝ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 Ω ( 𝑍 + ℎ 𝜄 ) ′ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) ⎞⎟⎟⎟⎟⎠ − E 𝑡𝑟 ⎛⎜⎜⎜⎜⎝ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 )( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 Ω [( 𝑍 + ℎ 𝜄 ) ′ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 )] ⎞⎟⎟⎟⎟⎠ . (A.5)Moreover, 𝑡𝑟 (︁ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 )( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 Ω )︁ = ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 Ω 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 )= ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 / 𝑊 / ′ Ω 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 / 𝑊 / ′ 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 )= ( 𝑍 + ℎ 𝜄 ) ′ ˜ 𝐵 ′ 𝐿 ( 𝜄 ) 𝐴 𝐿 ( 𝜄 ) ˜ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) (A.6)where ˜ 𝐵 𝐿 ( 𝜄 ) = 𝑊 / ′ 𝑃 𝐿 ( 𝜄 ) , and 𝐵 𝐿 ( 𝜄 ) = ˜ 𝐵 ′ 𝐿 ( 𝜄 ) ˜ 𝐵 𝐿 ( 𝜄 ) . Combining (A.5), (A.6), and the fact that 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) = 𝑡𝑟 ( 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 Ω ), we get E [︁ 𝜂 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) ′ 𝑃 ′ 𝐿 ( 𝜄 ) 𝑊 𝑍 ]︁ = E (︃ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) )( 𝑍 + ℎ 𝜄 ) ′ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) )︃ − E ⎛⎜⎜⎜⎜⎜⎝ ( 𝑍 + ℎ 𝜄 ) ′ ˜ 𝐵 ′ 𝐿 ( 𝜄 ) 𝐴 𝐿 ( 𝜄 ) ˜ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 )[( 𝑍 + ℎ 𝜄 ) ′ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 )] ⎞⎟⎟⎟⎟⎟⎠ ≥ E [︃ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 𝑚𝑎𝑥 ( 𝐴 𝐿 ( 𝜄 ) )( 𝑍 + ℎ 𝜄 ) ′ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) ]︃ , (A.7)The inequality in (A.7) comes from the fact that 𝑥 ′ 𝐴 𝐿 ( 𝜄 ) 𝑥𝑥 ′ 𝑥 ≤ max 𝑥 𝑥 ′ 𝐴 𝐿 ( 𝜄 ) 𝑥𝑥 ′ 𝑥 = 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) , which gives ( 𝑍 + ℎ 𝜄 ) ′ ˜ 𝐵 ′ 𝐿 ( 𝜄 ) 𝐴 𝐿 ( 𝜄 ) ˜ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) ≤ ( 𝑍 + ℎ 𝜄 ) ′ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) . Combining (A.7) and (A.3), we can show that E [ 𝜓 ′ 𝑊 𝜓 ] ≤ 𝑡𝑟 ( 𝑊 Ω ) − 𝜏 𝑝 ∑︁ 𝜄 =2 E [︃ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) )) − 𝜏 ( 𝑍 + ℎ 𝜄 ) ′ 𝐵 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) ]︃ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ 𝑡𝑟 ( 𝑊 Ω ) − 𝜏 𝑝 ∑︁ 𝜄 =2 E [︃ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) )) − 𝜏𝜉 𝐿 ( 𝜄 ) ]︃ P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤
0) (A.8)23n order for the shrinkage estimator to have lower asymptotic risk than the unrestrictedestimator, given 𝜏 >
0, we require 𝑝 ∑︁ 𝜄 =2 [︁ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) )) − 𝜏 ]︁ E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ ≥ 𝑝 ∑︁ 𝜄 =2 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) )) E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 > , ˜ 𝜇 − 𝜄 ≤ − 𝜏 𝑝 ∑︁ 𝜄 =2 E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ ≥ < 𝜏 ≤ 𝑝 ∑︁ 𝜄 =2 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) )) 𝛾 𝜄 , (A.9)where 𝛾 𝜄 ≡ E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ ∑︀ 𝑝 𝜄 =2 E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ . Given the condition in (A.9) and ℎ < ∞ , the risk in (A.8) is strictly less than 𝑡𝑟 ( 𝑊 Ω ), whichestablishes (23). (cid:4) Appendix B Mixtures of equality and inequality constraints
Some estimation problems involve a combination of equality and inequality constraints,e.g. estimating a parameter vector which represents probabilities, which have to be greaterthan zero and sum up to one. It turns out that it is straightforward to incorporate equalityconstraints into our analysis. To be specific, assume that there are 𝑞 equality constraintsand 𝑝 − 𝑞 inequality constraints. Since the distribution of the unconstrained estimator isuna ff ected by the composition of constraints, the main object of interest is the distribution ofthe constrained estimator, which determines the form of the shrinkage parameter.Following the intuition from Section 4, the asymptotic distribution of the restrictedestimator is given by 𝑛 / ( ˜ 𝜃 𝑛 − 𝜃 ) → 𝑑 𝑍 − 𝑝 − 𝑞 ∑︁ 𝜄 =1 𝑃 𝐿 ( 𝜄 ) ( 𝑍 + ℎ 𝜄 ) P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 < . (B.1)When none of the inequality constraints bind, 𝜄 = 1, then the distribution in (B.1) collapses to 𝑍 − 𝑃 𝐿 (1) ( 𝑍 + ℎ ), which is simply the distribution of the restricted estimator under 𝑞 equalityconstraints.By analogy, the optimal level of shrinkage is 𝜏 * = 𝑝 − 𝑞 ∑︁ 𝜄 =1 (︁ 𝑡𝑟 ( 𝐴 𝐿 ( 𝜄 ) ) − 𝜑 max ( 𝐴 𝐿 ( 𝜄 ) ) )︁ 𝛾 𝜄 , (B.2)where 𝛾 𝜄 ≡ E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ ∑︀ 𝑝 − 𝑞 𝜄 =1 E [ 𝜉 − 𝐿 ( 𝜄 ) ] P ( ˜ 𝜇 𝜄 > , ˜ 𝜇 − 𝜄 ≤ . The presence of equality constraints makes it easier to satisfy condition (21). For example,in case when 𝑊 = Ω − , if 𝑞 >
2, then 𝜏 >
0. The presence of equality constraints provides24dditional ex ante information about the shrinkage direction. However, the exact shrinkagedirection will still depend on additional binding constraints.
Appendix C Example of a Linear model with sign restrictions
To illustrate the idea, we will derive the asymptotic distribution of the restricted estimator forthe linear model with sign restrictions by solving an asymptotically equivalent problem.Consider a simple linear model with sign restrictions. Let 𝑦 𝑖 = 𝑥 ′ 𝑖 𝜃 + 𝜀 𝑖 , where { ( 𝑦 𝑖 , 𝑥 𝑖 , 𝜀 𝑖 ) } 𝑛𝑖 =1 are iid. For simplicity assume that we only have two parameters, 𝜃 ∈ R , and we wantto shrink towards all coe ffi cients being non-negative, 𝜃 𝑗 ≥ 𝑗 = 1 ,
2. In this example Θ = { 𝜃 ∈ Θ : 𝜃 ≥ } , 𝑟 ( 𝜃 ) = 𝜃 , 𝑅 = ℐ , and the local to zero assumption becomes 𝜃 = 𝑐𝑛 − / .The unconstrained Least Squares estimator isˆ 𝜃 𝑛 = ( 𝑋 ′ 𝑋 ) − ( 𝑋 ′ 𝑌 ) , where 𝑋 is a 𝑛 × 𝑌 is a 𝑛 × 𝜃 ∈ Θ ‖ 𝑌 − 𝑋𝜃 ‖ s.t. 𝜃 ≥ . (C.1)A solution to (C.1) depends on which constraints actually hold, which is in turn drivenby the location of the true parameter value 𝜃 relative to the restricted space Θ . Figure C.1illustrates the main idea. When 𝜃 ∈ Θ , meaning none of the constraints bind, the restrictedestimator coincides with the unrestricted one, ˆ 𝜃 𝑛 = ˜ 𝜃 𝑛 (see Figure C.1(a)). However, if 𝜃 (cid:60) Θ ,the restricted estimator becomes a projection of the unrestricted estimator on the closestboundary. In Figure C.1(b), 𝜃 is close to the boundary where 𝜃 = 0 and 𝜃 >
0, hence, ˜ 𝜃 𝑛 is aprojection of ˆ 𝜃 𝑛 on the half-space { 𝜃 : 𝜃 = 0 , 𝜃 > } . Figure C.1(c) demonstrates the reciprocalcase where ˜ 𝜃 𝑛 is a projection on { 𝜃 : 𝜃 > , 𝜃 = 0 } . Finally, in Figure C.1(d), ˜ 𝜃 𝑛 is a projectionon { 𝜃 : 𝜃 = 0 , 𝜃 = 0 } .Let 𝒥 𝑛 = 𝑛 − 𝑋 ′ 𝑋, 𝒥 = E [ 𝑥 𝑖 𝑥 ′ 𝑖 ] , 𝑍 𝑛 = 𝑛 / ( 𝑋 ′ 𝑋 ) − 𝑋 ′ 𝜀,𝑍 = 𝒥 − 𝐺, 𝐺 ∼ 𝒩 (0 , 𝒱 ) , 𝒱 = E [ 𝑥 𝑖 𝑥 ′ 𝑖 𝜀 𝑖 ] . The quadratic approximation of the objective function takes the form 𝑞 𝑛 ( 𝜆 ) = 12 𝑛 ( 𝜆 − 𝑍 𝑛 ) ′ ( 𝑋 ′ 𝑋 )( 𝜆 − 𝑍 𝑛 ) . The vector of Kuhn-Tucker multipliers is˜ 𝜇 𝑛 = − ( 𝑛 − 𝑋 ′ 𝑋 ) − ( 𝑍 𝑛 + 𝑐 ) 𝑑 → ˜ 𝜇 = −𝒥 − ( 𝑍 + 𝑐 ) . All constraints are satisfied as equalities if ˜ 𝜇 𝑛,𝑗 > 𝑗 = 1 ,
2. In this case the restrictedestimator is equal to zero, ˜ 𝜃 𝑛 = 0, and ˜ 𝜆 𝑛 = − 𝑐 . All constraints are satisfied as strict inequalitiesif ˜ 𝜇 𝑛,𝑗 ≤ 𝑗 = 1 ,
2. Then the restricted estimator equals to the unrestricted one, ˜ 𝜃 𝑛 = ˆ 𝜃 𝑛 ,and ˜ 𝜆 𝑛 = 𝑍 . The first constraint binds if ˜ 𝜇 𝑛, ≤ 𝜇 𝑛, >
0, which implies that ˜ 𝜆 = − 𝑐 and25 𝜃 ˆ 𝜃 𝑛 = ˜ 𝜃 𝑛 (a) None of the constraints bind. 𝜃 𝜃 ˆ 𝜃 𝑛 ˜ 𝜃 𝑛 (b) The first constraint binds, 𝜃 = 0. 𝜃 𝜃 ˆ 𝜃 𝑛 ˜ 𝜃 𝑛 (c) The second constraint binds, 𝜃 = 0. 𝜃 𝜃 ˆ 𝜃 𝑛 ˜ 𝜃 𝑛 (d) Both constraints bind. Figure C.1. Geometry of the restricted estimator.
This figure shows the behavior of the restrictedestimator depending on the position of the true parameter value in the two-dimensional case, 𝜃 ≥ 𝜃 ≥
0. The blue dashed lines are LS objective contour sets. ˜ 𝜆 = 𝑍 − 𝒥 − 𝒥 ( 𝑍 + 𝑐 ). We get a similar result for the case where the second constraintbinds. Therefore, according to Lemma 2, the resulting asymptotic distribution of the restrictedestimator takes the form 𝑛 / (︃ ˜ 𝜃 𝑛, − 𝜃 , ˜ 𝜃 𝑛, − 𝜃 , )︃ 𝑑 → (︃ 𝑍 𝑍 )︃ { ˜ 𝜇 ≤ , ˜ 𝜇 ≤ } + (︃ − 𝑐 𝑍 − 𝒥 − 𝒥 ( 𝑍 + 𝑐 ) )︃ { ˜ 𝜇 > , ˜ 𝜇 ≤ } + (︃ 𝑍 − 𝒥 − 𝒥 ( 𝑍 + 𝑐 ) − 𝑐 )︃ { ˜ 𝜇 ≤ , ˜ 𝜇 > } + (︃ − 𝑐 − 𝑐 )︃ { ˜ 𝜇 > , ˜ 𝜇 > } . The Hessian matrix here is just 2 ×
2. Therefore, the sub-matrices in the expression for ˜ 𝜆 are just thecorresponding elements of 𝒥 . ppendix D Empirical Bayes Estimator In matrix form the linear model from Section 7 is 𝑌 = 𝑋𝜃 + 𝜀 . Hence, the likelihood density is 𝑝 ( 𝑌 | 𝜃 ) = (2 𝜋 ) − 𝑛/ exp {︂ −
12 ( 𝑌 − 𝑋𝜃 ) ′ ( 𝑌 − 𝑋𝜃 ) }︂ . We assume the prior 𝜃 | 𝜈 ∼ 𝒩 (0 , /𝜈 ) { 𝜃 ≥ } , which density is 𝑝 ( 𝜃 | 𝜈 ) = Π 𝑚𝑗 =1 (︂ 𝜋𝜈 )︂ − / P − 𝜃 | 𝜈 ( 𝜃 𝑗 ≥
0) exp {︂ − 𝜈 𝜃 𝑗 }︂ = (︂ 𝜋𝜈 )︂ − 𝑚/ Φ (0) − 𝑚 exp {︂ − 𝜈 𝜃 ′ 𝜃 }︂ , where Φ ( · ) is the cdf of the standard normal distribution.By Bayes’ rule, the posterior distribution is 𝑝 ( 𝜃 | 𝑌 , 𝜈 ) ∝ 𝑝 ( 𝑌 | 𝜃 ) 𝑝 ( 𝜃 | 𝜈 ) ∝ exp {︂ −
12 ( 𝜃 − ¯ 𝜃 ) ′ ¯ 𝑉 − 𝜃 ( 𝜃 − ¯ 𝜃 )) }︂ { 𝜃 ≥ } , where ¯ 𝜃 = ( 𝑋 ′ 𝑋 + 𝜈 ℐ 𝑚 ) − 𝑋 ′ 𝑌 ¯ 𝑉 𝜃 = ( 𝑋 ′ 𝑋 + 𝜈 ℐ 𝑚 ) − . Thus, the posterior has also a truncated normal form, 𝜃 | 𝑌 , 𝜈 ∼ 𝒩 ( ¯ 𝜃, ¯ 𝑉 𝜃 ) { 𝜃 ≥ } .The marginal likelihood is 𝑝 ( 𝑌 | 𝜈 ) = ∫︁ 𝑝 ( 𝑌 | 𝜃 ) 𝑝 ( 𝜃 | 𝜈 ) 𝑑𝜃. We can use the Bayes’ rule again to get an explicit expression for the marginal likelihooddensity 𝑝 ( 𝑌 | 𝜈 ) = 𝑝 ( 𝑌 | 𝜃 ) 𝑝 ( 𝜃 | 𝜈 ) 𝑝 ( 𝜃 | 𝑌 , 𝜈 )= (2 𝜋 ) − 𝑛/ 𝜈 𝑚/ Φ (0) − 𝑚 | ¯ 𝑉 𝜃 | / 𝐷 exp {︂ −
12 [ 𝑌 ′ 𝑌 − 𝑌 ′ 𝑋 ( 𝑋 ′ 𝑋 + 𝜈 ℐ 𝑚 ) − 𝑋 ′ 𝑌 ] }︂ , where 𝐷 = P 𝜃 | 𝑌 ,𝜈 ( 𝜃 ≥
0) is a normalizing constant for the the posterior. Note that 𝐷 dependson 𝜈 .We select 𝜈 by maximizing the marginal likelihood density:ˆ 𝜈 = argmax 𝜈 ∈ R + 𝑝 ( 𝑌 | 𝜈 ) . Then the Empirical Bayes estimator is simply the mean of the posterior distribution given ˆ 𝜈 , 𝑝 ( 𝜃 | 𝑌 , ˆ 𝜈 ). 27 ppendix E Empirical Application: Implementation Details The set-up is close to Fessler and Kasy (2019). Let
𝑄, 𝑃 , and 𝑌 denote the quantity (of gasolinein our application) demanded by a consumer, the price paid, and the consumer’s income.Assume that we observe data { 𝑄 𝑖 , 𝑃 𝑖 , 𝑌 𝑖 } 𝑛𝑖 =1 on 𝑛 randomly sampled consumers. We assumethat the variables are related as 𝑄 = 𝑔 ( 𝑃 , 𝑌 ) +
𝑈 , where 𝑔 is an unknown demand function, and 𝑈 satisfies E [ 𝑈 | 𝑃 = 𝑝, 𝑌 = 𝑦 ] = 0 for all 𝑝 and 𝑦 . The latter assumption on the unobserved shock assumes that prices and incomes arestatistically independent of unobserved preference heterogeneity across consumers. That is,we ignore endogeneity concerns for the ease of exposition.Our goal is to estimate the price and income elasticities of 𝑔 ( 𝑝, 𝑦 ), 𝛽 𝑝𝑗 and 𝛽 𝑦𝑗 , respectively,at di ff erent price levels 𝑝 , . . . , 𝑝 𝐽 and a given income level 𝑦 , 𝛽 𝑝𝑗 = 𝜕 log 𝑔 ( 𝑝 𝑗 , 𝑦 ) 𝜕 log 𝑝 , 𝛽 𝑦𝑗 = 𝜕 log 𝑔 ( 𝑝 𝑗 , 𝑦 ) 𝜕 log 𝑦 . We can obtain the unrestricted elasticities estimates for a price-income pair ( 𝑝 𝑗 , 𝑦 ) using alocal linear regression (LLR), (︁ ˆ 𝛼 𝑗 , ˆ 𝛽 𝑝𝑗 , ˆ 𝛽 𝑦𝑗 )︁ = argmin 𝑎,𝑏 𝑝 ,𝑏 𝑦 𝑛 ∑︁ 𝑖 =1 (︁ 𝑄 𝑖 − 𝑎 − 𝑏 𝑝 (log 𝑃 𝑖 − log 𝑝 𝑗 ) − 𝑏 𝑦 (log 𝑌 𝑖 − log 𝑦 ) )︁ × 𝐾 ℎ (︁ log 𝑃 𝑖 − log 𝑝 𝑗 , log 𝑌 𝑖 − log 𝑦 )︁ , (E.1)where ˆ 𝛼 𝑗 = ˆ 𝑔 ( 𝑝 𝑗 , 𝑦 ), and 𝐾 ℎ is a kernel function with bandwidth ℎ (we use the Epanechnikovkernel). We use nonparametric bootstrap to estimate the joint variance 𝑉 of ˆ 𝛽 ≡ (︁ ˆ 𝛽 𝑝𝑗 , ˆ 𝛽 𝑦𝑗 )︁ 𝐽𝑗 =1 across all 𝑗 . As in Fessler and Kasy (2019), the variance of ˆ 𝛼 𝑗 is negligible compared to 𝑉 inour application.Slutsky condition is an inequality constraint on the demand function ensuring that thecompensated own-price elasticities are negative, 𝜕𝑔 ( 𝑝 𝑗 , 𝑦 ) 𝜕𝑝 + 𝜕𝑔 ( 𝑝 𝑗 , 𝑦 ) 𝜕𝑦 𝑔 ( 𝑝 𝑗 , 𝑦 ) ≤ , 𝑗 = 1 , . . . , 𝐽. Rewritten in terms of elasticities, it gives us the desired theoretical restriction 𝛽 𝑝𝑗 + 𝛽 𝑦𝑗 𝑔 ( 𝑝 𝑗 , 𝑦 ) 𝑝 𝑗 𝑦 ≤ , 𝑗 = 1 , . . . , 𝐽. (E.2)The restricted estimator ˜ 𝛽 ≡ (︁ ˜ 𝛽 𝑝𝑗 , ˜ 𝛽 𝑦𝑗 )︁ 𝐽𝑗 =1 solves (E.1) under the condition (E.2) for all 𝑗 .The ICSE takes the weighted average form ˆ 𝛽 * = ˆ 𝑤 ˆ 𝛽 + (1 − ˆ 𝑤 ) ˜ 𝛽 whereˆ 𝑤 = (︃ − ˆ 𝜏 * 𝑛 ( ˆ 𝛽 − ˜ 𝛽 ) ′ ( ˆ 𝛽 − ˜ 𝛽 ) )︃ + , and ˆ 𝜏 * is given by (28). Note that the feasible ICSE shrinks ˆ 𝛽 to ˜ 𝛽 jointly across all 𝑗𝑗