A Distance Covariance-based Estimator
AA Distance Covariance-based Estimator ∗ Emmanuel Selorm Tsyawo † Abdul-Nasah Soale ‡ February 2020
Abstract
Weak instruments present a major setback to empirical work. This paperintroduces an estimator that admits weak, uncorrelated, or mean-independentinstruments that are non-independent of endogenous covariates. Relative toconventional instrumental variable methods, the proposed estimator weakensthe relevance condition considerably without imposing a stronger exclusion re-striction. Identification mainly rests on (1) a weak conditional median exclu-sion restriction imposed on pairwise differences in disturbances and (2) non-independence between covariates and instruments. Under mild conditions, theestimator is consistent and asymptotically normal. Monte Carlo experimentsshowcase an excellent performance of the estimator, and two empirical examplesillustrate its practical utility.
Keywords: distance covariance, dependence, weak instrument, endogeneity, U -statistics JEL classification: C13, C14, C26 ∗ The authors would like to thank Brantly Callaway, Oleg Rytchkov, Dai Zusai, and Weige Huangfor helpful comments. † Corresponding author, [email protected], FGSES, Mohammed VI Polytechnic Uni-versity. ‡ [email protected], Department of Statistics, Temple University. a r X i v : . [ ec on . E M ] F e b Introduction
Empirical work in economics largely depends on instrumental variable (IV) meth-ods. When instruments are weakly correlated with endogenous covariates, thesemethods become unreliable; they can produce biased estimates and hypothesis testswith large size distortions. Moreover, IV methods are infeasible when instruments areuncorrelated with endogenous variables. Econometric literature on the weak instru-ment problem largely focusses on detection and weak-instrument-robust inference.Theoretical progress on estimation is, however, scant. This paper introduces an es-timator which minimises the distance covariance (dCov) between a vector of distur-bances and a set of instruments. The dCov is introduced by Sz´ekely, Rizzo, Bakirov,et al. (2007). It measures the distributional dependence between two random vari-ables of arbitrary dimensions. The proposed Minimum Dependence estimator (MDephereafter) weakens the relevance (or correlatedness) requirement considerably andadmits instruments that are non-independent of endogenous covariates.The non-independence identification condition implies the MDep estimator max-imises the set of relevant instruments. This feature of the proposed estimator makesit fundamentally different from conventional methods. The exclusion restriction im-posed is a conditional median independence assumption on pairwise differences indisturbances. Like the uncorrelatedness exclusion restriction of standard IV methods(see, e.g., Wooldridge (2010, Assumption 2SLS.1)), the MDep exclusion restriction isweaker than the (strong) conditional mean independence restriction (see, e.g., Chen,Chen, and Lewis (2020) and Dieterle and Snell (2016, Assumption A.1)). Unlike IVmethods, however, the form of dependence between endogenous covariates and instru-ments needs to be neither known nor specified, effectively eliminating the sensitivityof estimates to first-stage model specification. Although the MDep estimator gen-erally applies to linear and non-linear models typically estimated using IV methods,it shares the “robustness” feature of quantile estimators (see, e.g., Powell (1991) andOberhofer and Haupt (2016)) in the sense that its asymptotic properties do not de-pend on the existence of moments of the outcome variable. For example, the limit of This paper subsumes methods such as IV, ordinary least squares (OLS), two-stage least squares(2SLS), the Control Function (CF) approach, and generalised method of moments (GMM) underthe IV category. Dieterle and Snell (2016), for example, uncovers substantial sensitivity of conclusions to speci-fication (linear versus quadratic) of the first stage. This property holds for a class of models considered in the paper where the disturbance is linear When instruments are uncorrelatedor mean-independent of endogenous covariates, IV methods break down. The MDep,on the other hand, performs reasonably well. This paper considers two empiricalexamples that illustrate the practical usefulness of the MDep estimator. In the firstexample, the excluded instrument is IV-strong and hence MDep-strong by construc-tion. Consequently, both sets of estimates and standard errors are similar. Thesecond application presents an interesting case where the instrument has non-trivialnon-linear and non-monotone identifying variation that the MDep, unlike the IV, isable to exploit. In this example, MDep estimates are more precise than the IV.The econometric literature on weak instruments largely focusses on detection andweak-instrument-robust inference (e.g., Staiger and Stock (1997), Stock and Yogo(2005), Andrews, Moreira, and Stock (2006), Kleibergen and Paap (2006), Olea andPflueger (2013), Andrews and Mikusheva (2016), Sanderson and Windmeijer (2016),and Andrews and Armstrong (2017)). See Andrews, Stock, and Sun (2019) for anexcellent review. Normal distributions of conventional IV estimates can be poor andhypothesis tests based on them can be unreliable when instruments are weak (Nel-son and Startz, 1990a; Nelson and Startz, 1990b; Bound, Jaeger, and Baker, 1995).Although the econometric literature on weak instruments does not focus much onestimation, three exceptions deserve discussion. Hirano and Porter (2015) provesthat if instruments can be arbitrarily weak, no unbiased estimator exists withoutfurther restrictions. Andrews and Armstrong (2017) shows that asymptotically un-biased estimation is possible granted one correctly imposes a sign restriction on thefirst-stage regression coefficient. By extracting non-linear identifying variation ininstruments using machine learning techniques (e.g., neural networks and randomforests) and sample splitting, Chen, Chen, and Lewis (2020) achieves higher precisionand increased instrument strength. While it is conceivable to take transformationsof the instrument to extract more variation (see, e.g., Dieterle and Snell (2016)), in the intercept. Simulation results on non-linear models are available in the Online Appendix. When it is feasible, the approach easily results inhigh-dimensionality (see, e.g., Belloni, Chen, Chernozhukov, and Hansen (2012) andHansen and Kozbur (2014)). This paper’s contribution to the literature on weakinstruments is the MDep estimator. In addition to linear identifying variation inendogenous covariates that IV methods exploit, the MDep estimator exploits ad-ditional monotone and non-monotone forms of dependence that may exist betweeninstruments and endogenous covariates for identification. The MDep gives a new per-spective to handling weak IVs in empirical practice; poorly correlated, uncorrelated,or mean-independent instruments in the conventional IV setting can be MDep-strong.Several applications of the dCov measure have emerged since the seminal paperSz´ekely, Rizzo, Bakirov, et al. (2007). The dCov measure is used primarily to testindependence between two random variables of arbitrary dimensions. It is consistentagainst all types of dependent alternatives including linear, non-linear, monotone,and non-monotone forms of dependence. Sz´ekely, Rizzo, et al. (2014) derives thepartial distance covariance ( pdCov ) test of independence between two random vari-ables with dependence on a third set removed. Based on the dCov measure, Shaoand Zhang (2014) derives the martingale difference divergence (MDD) measure totest for conditional mean independence, and Park, Shao, Yao, et al. (2015) derivesa test of partial conditional mean independence. Su and Zheng (2017) and Xu andChen (2020), respectively, adapt the Shao and Zhang (2014) MDD measure to testmodel specification in conditional mean and quantile regression models. Zhou (2012)introduces the auto-covariance function for strictly stationary multivariate time se-ries. Davis, Matsui, Mikosch, Wan, et al. (2018) tests goodness-of-fit of time seriesmodels by applying an adapted distance covariance function to residuals. The closestwork to the current paper is perhaps Sheng and Yin (2013); it proposes an estimatorof single-index models as a tool for sufficient dimension reduction using the dCov asa criterion. There is a growing literature on the application of the dCov measureto variable selection and dimension reduction (see, e.g., Li, Zhong, and Zhu (2012),Shao and Zhang (2014), and Zhong and Zhu (2015)). The reader is referred to Edel-mann, Fokianos, and Pitsillou (2019) for a review of several applications of the dCov As an illustration, let outcome y = x + u , endogenous covariate x = x ∗ + u , and instrument z = | x ∗ | with cov( u, z ) = 0. For a symmetric mean-zero random variable x ∗ , cov( x, z ) = 0. It is notfeasible, without further information, to transform z in order to induce correlation with x . This section presents the MDep estimator. The estimator is based on the distancecovariance measure proposed by Sz´ekely, Rizzo, Bakirov, et al. (2007).
Sz´ekely, Rizzo, Bakirov, et al. (2007) introduces distance covariance (dCov) as ameasure of dependence between two random variables of arbitrary dimensions. Let ϕ U,Z ( t, s ) denote the joint characteristic function of [ U, Z ], and [ ϕ U ( t ) , ϕ Z ( s )] denotetheir marginal characteristic functions. The distance covariance measure is definedbelow. Definition 2.1.
The distance covariance between random variables U and Z withfinite first moments is the non-negative number V ( U, Z ) defined by V ( U, Z ) ≡ | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w = (cid:90) | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w ( t, s ) dtds = (cid:90) (cid:12)(cid:12) E (cid:2) exp( ι ( t (cid:48) U + s (cid:48) Z )) (cid:3) − E (cid:2) exp( ιt (cid:48) U ) (cid:3) E (cid:2) exp( ιs (cid:48) Z ) (cid:3)(cid:12)(cid:12) w ( t, s ) dtds (2.1) where ι = √− and the weight w ( t, s ) is an arbitrary positive function for which theintegration exists. | ζ | = ζ ¯ ζ , where ¯ ζ is the complex conjugate of ζ . | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w ≥ U and Z , and | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w = 0 if and only if U and Z are independent.A sample from the joint distribution of [ U, X, Z ] is denoted by the n × ( p u + p x + p z )matrix [ U , X , Z ]. Define E n [ ξ i ] ≡ n (cid:80) ni =1 ξ i and E n [ ξ ij ] ≡ n (cid:80) ni =1 (cid:80) nj =1 ξ ij . p u = 1 ismaintained throughout the rest of the paper since this paper considers only univariateresponse models. The distance covariance measure defined in terms of empiricalcharacteristic functions, | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n , is (cid:90) (cid:12)(cid:12) E n (cid:2) exp( ιtu i + s (cid:48) z i ) (cid:3) − E n (cid:2) exp( ιtu i ) (cid:3) E n (cid:2) exp( ιs (cid:48) z i ) (cid:3)(cid:12)(cid:12) w ( t, s ) dtds. Using the weight function w ( t, s ) = ( c p u c p z || t || p u || s || p z ) − where c p = π (1+ p ) / Γ((1+ p ) / for p ≥ || · || is the Euclidean norm, and Γ( · ) is the complete gamma function,Sz´ekely, Rizzo, Bakirov, et al. (2007) proves the equality | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n = V n ( U , Z ) where V n ( U , Z ) = S n ( U , Z ) + S n ( U , Z ) − S n ( U , Z ) , S n ( U , Z ) = n (cid:80) ni =1 (cid:80) nj =1 | u i − u j | × || z i − z j || , S n ( U , Z ) = n (cid:80) ni =1 (cid:80) nj =1 | u i − u j | n (cid:80) ni =1 (cid:80) nj =1 || z i − z j || , and S n ( U , Z ) = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 | u i − u j ||| z i − z k || . The empirical distance covariance measure V n ( U , Z ) simplifies to (2.2) V n ( U , Z ) ≡ n n (cid:88) i =1 n (cid:88) j =1 ˇ z ij × | ˜ u ij | where ˇ z ij ≡ || ˜ z ij || − n n (cid:88) k =1 ( || ˜ z ik || + || ˜ z kj || ) + 1 n n (cid:88) k =1 n (cid:88) l =1 || ˜ z kl || , ˜ u ij ≡ u i − u j , and˜ z ij ≡ z i − z j .Sz´ekely, Rizzo, Bakirov, et al.’s (2007) weight function w ( t, s ) = ( c p u c p z || t || p u || s || p z ) − ,besides yielding a reliable measure of dependence, results in a computationally tractablemeasure, obviates the choice of smoothing parameters (e.g., bandwidth or number of See Sz´ekely, Rizzo, Bakirov, et al. (2007, eqn. 2.18). This follows because n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z i − z k || × | u i − u j | = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z k − z i ||×| u i − u j | = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z k − z j ||×| u j − u i | = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z k − z j ||×| u i − u j | by the permutation symmetry of || · || , the commutativity of sums, and the permutation symmetryof | · | . While the form | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n of the dCovmeasure is natural and intuitive as a non-negative measure of distributional depen-dence, the form V n ( U , Z ) is computationally tractable. The MDep estimator proposedin this paper is based on the latter. For the estimator presented in this paper, thesimplified formulation (2.2) has two main advantages: the permutation symmetryof ˇ z ij , i.e., ˇ z ij = ˇ z ji , simplifies the application of U-statistic theory in the proof ofasymptotic normality, and lessens computational time needed to evaluate (2.2).For ease of reference, the properties of the dCov measure in Sz´ekely, Rizzo,Bakirov, et al. (2007) and Sz´ekely and Rizzo (2009) are stated below. The followingproperties of the dCov measure hold. Properties of dCov. (a) V n ( U , Z ) = | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n .(b) V n ( U , Z ) ≥ .(c) V ( U, Z ) = 0 if and only if U is independent of Z .(d) lim n →∞ V n ( U , Z ) = V ( U, Z ) almost surely (a.s.). The properties are provided and proved in the following papers: Property (a) inSz´ekely, Rizzo, Bakirov, et al. (2007, Theorem 1), Property (b) in Sz´ekely and Rizzo(2009, Theorem 4 (i)), Property (c) holds by definition (see (2.1)), and Property(d) in Sz´ekely, Rizzo, Bakirov, et al. (2007, Theorem 2). From (2.2), it is obviousthat V n ( U , Z ) = 0 if every sample observation of U , Z , or both is identical. Animportant implication of Properties (b) to (d) is that V n ( U , Z ) = 0 if and only if U is independent of Z with probability approaching one (w.p.a.1). For considerations of space and scope, the current paper is limited to paramet-ric regression models which assume the scalar outcome y i is generated by y i = G (cid:0) θ o,c + g ( x i , θ o )+ u i (cid:1) , where G ( · ) is a known invertible function, g ( · , · ) is a known dif-ferentiable function with unknown parameter vector θ o ∈ R p θ , and θ o,c is the intercept. See Sz´ekely, Rizzo, Bakirov, et al. (2007, section 2.1) for a discussion. Also, see Sz´ekely andRizzo (2012) on a discussion and proof of the uniqueness of the weight function.
7y the invertibility of G ( · ), the disturbance function is u i ( θ ) = G − ( y i ) − g ( x i , θ ).The intercept is ignored because it is differenced out in the objective function (2.3). The dependence of u i ( θ ) on covariates x i is suppressed for notational convenience.Observed data [ Y , X , Z ] include an n × Y , an n × p x matrix ofcovariates, and an n × p z matrix of instruments Z . While the class of models underconsideration includes interesting examples such as the linear model u i ( θ ) = y i − x i θ ,non-linear parametric models, e.g., u i ( θ ) = y i − exp( x i θ ), fractional response mod-els, e.g., u i ( θ ) = log( y i / (1 − y i )) − x i θ , and special cases of Box-Cox models e.g., u i ( θ ) = log( y i ) − x i θ , it excludes equally interesting ones such as binary responsemodels and quantile regression. The objective function is a squared distance covariance of U ( θ ) and Z , where the i ’th element of the n × U ( θ ) is u i ( θ ). Using (2.2), the objective function isgiven by(2.3) V n ( U ( θ ) , Z ) = 1 n n (cid:88) i =1 n (cid:88) j =1 ˇ z ij | ˜ u ij ( θ ) | where ˜ u ij ( θ ) ≡ u i ( θ ) − u j ( θ )and the MDep estimator is ˆ θ n = arg min θ ∈ Θ V n ( U ( θ ) , Z ) . (2.3) is a sum of specially weighted absolute pairwise differences in disturbances. (2.3)for example, explains why the asymptotic behaviour of the MDep estimator is similarto that of quantile estimators. It is instructive to note how the MDep relates tomoment estimators which include conventional IV methods such as OLS, IV, 2SLS,MM, and GMM. A moment estimator typically minimises the (possibly weighted)linear dependence between Z and U ( θ ) by solving Z (cid:48) U ( θ ) = . The MDep, on theother hand, minimises their distributional dependence. The MDep hence minimisesa more general form of dependence between Z and U ( θ ). See Section D.2 in the Online Appendix for an extension to the class of models such as thePoisson which allow non-linearity of the disturbance function in the intercept. Asymptotic Theory
Asymptotic theory for the MDep estimator falls within the category of estima-tors of models with non-smooth objective functions such as quantile regression (QR)methods, e.g., Koenker and Bassett Jr (1978), Powell (1991), and Oberhofer andHaupt (2016), the instrumental variable QR methods of Chernozhukov and Hansen(2006) and Chernozhukov and Hansen (2008), the CF approach to QR of Lee (2007),and the valid moment selection procedure of Cheng and Liao (2015).Define the n × p θ matrix X g ( θ ) whose i ’th row is x gi ( θ ) where x gi ( θ ) ≡ ∂u i ( θ ) ∂ θ (cid:48) ,the n × p θ matrix ˜ X g ( θ ) whose rows comprise { ˜ x gij ( θ ) : 1 ≤ i, j ≤ n } , ˜ x gij ( θ ) ≡ x gi ( θ ) − x gj ( θ ), ˜ u ij ≡ ˜ u ij ( θ o ), and the true parameter vector θ o . Also, let ˜ x g ( θ ) , x g ( θ ),˜ z , and ˇ z be any random variables defined on the supports of ˜ x gij ( θ ) , x gi ( θ ), ˜ z ij ,and ˇ z ij , respectively. Define { w i : 1 ≤ i ≤ n } , w i ≡ [ u i , x i , z i ] on a probabilityspace ( W , W , P ). Following Huber (1967), the normalised minimand is defined as Q n ( θ ) ≡ n n (cid:88) i =1 n (cid:88) j =1 q ( w i , w j , θ ), where q ( w i , w i , θ ) = ˇ z ij (cid:0) | ˜ u ij ( θ ) | − | ˜ u ij | (cid:1) . Normal-ising the minimand avoids unnecessary moment conditions on U . See Powell (1991)and Oberhofer and Haupt (2016) for example. Two sets of regularity conditions are imposed to guarantee the consistency of theMDep estimator ˆ θ n . The first set comprises sampling, smoothing, and dominanceconditions which ensure that the difference between the normalised minimand and itsexpectation converges to zero uniformly in θ . These are stated below as assumptions1(a)-1(d). Assumption 1.
For a constant
C > :(a) w i ≡ [ u i , x i , z i ] are independently distributed random vectors across i , and theoutcome is generated as y i = G (cid:0) θ o,c + g ( x i , θ o ) + u i (cid:1) .(b) u ( θ ) is measurable in [ u i , x i ] for all θ and is twice differentiable in θ for all [ u, x ] in the support of [ u i , x i ] . x g ( θ ) is measurable in x i for all θ .(c) E (cid:2) sup θ ∈ Θ || max {| ˇ z | , } ˜ x g ( θ ) || (cid:3) ≤ C . d) θ o is in the interior of a compact subset Θ in R p θ . Assumptions 1 (a) to (d) above suffice for the application of the uniform lawof large numbers in Andrews (1987). Assumption 1(a) allows for heteroskedasticity.Addressing other forms of dependence in the data, e.g., autocorrelation and clusteringis beyond the scope of the current paper. The condition on the data generating processin Assumption 1(a) and the differentiability requirement in Assumption 1(b) definethe class of models considered in the current paper, e.g., the linear model. Theyexclude models such as Koenker and Bassett Jr’s (1978) QR and binary responsemodels. ˜ u ij ( θ ) = ˜ u ij + ˜ x gij ( ¯ θ )( θ − θ o ) for some ¯ θ between θ and θ o is a usefulexpression for subsequent analyses thanks to Assumption 1(b) and the mean-valuetheorem (MVT).Assumption 1(c) is an MDep analogue of a standard condition that uniformlybounds the fourth moment of both the weighted and unweighted Jacobian, i.e., E (cid:2) sup θ ∈ Θ || max {| ˇ z | , } ˜ x g ( θ ) || (cid:3) ≤ C implies both E (cid:2) sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || (cid:3) ≤ C and E (cid:2) sup θ ∈ Θ || ˜ x g ( θ ) || (cid:3) ≤ C . It can be shown (see Lemma A.1) that under Assump-tion 1(c), B ( x i , x j , z i , z j ) ≡ sup θ ∈ Θ || ˇ z ij ˜ x gij ( θ ) || , which does not depend on [ u i , u j ], isintegrable, and | q ( w i , w j , θ ) | ≤ B ( x i , x j , z i , z j ) || θ − θ o || a.s. for all θ , θ ∈ Θ . FromAssumption 1(d) and the foregoing, the expected value of q ( w i , w j , θ ) exists even ifthe expected value of u i does not exist. This special feature of the MDep estimatoris emphasised in the following remarks. Remark 3.1.
Although the MDep estimator developed in this paper is not a quantileestimator, i.e., it does not estimate regression quantiles, it shares the “robustness”feature of quantile estimators over the class of models in Assumption 1(a) where theasymptotic properties do not depend on the existence of moments of the disturbance u i . The remaining set of regularity conditions for consistency (Assumptions 2(a) and2(b)) are identification conditions which ensure that the expected value of the mini-mand is uniquely minimised at θ o in large samples. Assumption 2. The differentiability requirement is for ease of treatment as the case of non-smooth objectivefunctions is beyond the scope of this paper. a) Given ˜ x gij , the median of ˜ u ij is independent of ˇ z ij .(b) There exists a ˜ δ ε > such that inf { γ ∈ R pθ : || γ ||≥ ε } E [ V n ( X g ( θ ) γ , Z )] > ˜ δ ε for any ε > , θ ∈ Θ , and n sufficiently large. These assumptions are perhaps the most important because one easily sees the advan-tage the MDep estimator has vis-`a-vis conventional IV methods. While the exclusionrestriction (Assumption 2(a)), like that of conventional methods, is not stronger thanthe conditional mean independence exclusion restriction, the relevance condition (As-sumption 2(b)) is much weaker.Assumption 2(a) requires that the median of ˜ u ij conditional on [ x gij , ˇ z ij ] be con-stant. Unlike Assumption 2(a) which is imposed on the pairwise differences in dis-turbances, similar exclusion restrictions on conditional quantiles are imposed on thelevels of disturbances for quantile estimators under (possible) endogeneity (see, e.g.,Chernozhukov and Hansen (2006, Assumption A.2), Lee (2007, Assumption 3.6), andPowell (1991, Assumption B2)). Specific to the median of u i , Lee (2007), for example,requires that the median of u i , conditional on [ x i , z i ] be constant. By the symmetryof the unconditional density of ˜ u ij around zero, Assumption 2(a) is equivalent to aconditional mean independence restriction E [˜ u ij | ˜ x gij , ˇ z ij ] = E [˜ u ij | ˜ x gij ] if the expectedvalue of u i exists. In words, E [˜ u ij | ˜ x gij , ˇ z ij ] = E [˜ u ij | ˜ x gij ] says given ˜ x gij , the mean of˜ u ij is independent of ˇ z ij . Assumption 2(a) is implied by the stronger conditionalmean-independence condition E [ u i | ˜ x gij , ˇ z ij ] = E [ u i | ˜ x gij ]. Taking the expectation ofboth sides conditional on w i gives the more familiar conditional mean-independenceexclusion restriction in levels E [ u i | x gi , z i ] = E [ u i | x gi ]. This sufficient (but not nec-essary) condition for Assumption 2(a) can be verified using the conditional meanindependence specification test of Su and Zheng (2017). Both the MDep exclusionrestriction Assumption 2(a) and the uncorrelatedness exclusion restriction of condi-tional IV methods (see, e.g., Wooldridge (2010, Assumption 2SLS.1)) are weaker thanthe conditional mean-independence condition. Simulations in Section 4 suggest theMDep exclusion restriction (Assumption 2(a)) is not stronger than the uncorrelated-ness exclusion restriction of conventional IV methods.Assumption 2(b) is the condition of non-independence between X and Z ; it is theMDep analogue of the relevance condition in the IV/CF setting (see, e.g., Wooldridge See the Online Appendix for an evaluation of the test’s finite sample performance using simula-tions. pdCov test of Sz´ekely, Rizzo, et al. (2014) can beused to test Assumption 2(b) when there is a single endogenous covariate. Assump-tion 2(b) is, however, a significantly weaker assumption. For example, suppose thereis one endogenous covariate X and a single instrument Z . A standard IV relevancecondition is that X be correlated with Z . The MDep, on the other hand simply re-quires that X be non-independent of Z . This effectively allows X to be uncorrelated,or even mean-independent of Z as long as some points on its distribution are notindependent of Z . All strong instruments in the conventional IV sense are MDep-strong. The converse is, however, not true. This idea is further explored in Section 4via simulations. Assumption 2(b) fails when the Jacobian matrix X g ( · ) does not havefull column rank or when instruments are fewer than parameters, i.e., p z < p θ . Tosee this, note that X g ( · ) γ can be zero (which implies V n ( X g ( · ) γ , Z ) = 0 w.p.a.1) for γ (cid:54) = if X g ( · ) does not have full column rank. Also, if instruments are fewer thanparameters, it is easy to construct an example where V n ( X g ( · ) γ , Z ) = 0 w.p.a.1 for X not independent of Z and γ (cid:54) = . Remark 3.2.
MDep admits the largest set of instruments possible. In addition toIV-relevant instruments that both MDep and conventional IV methods admit, MDepadmits instruments that are uncorrelated or mean-independent but non-independentof covariates. This clear advantage of the MDep over IV methods does not come atan added cost as the MDep exclusion restriction (Assumption 2(a)), like the uncor-relatedness exclusion restriction of IV methods, is weaker than the conditional meanindependence restriction.
The support of ˇ z ij is not non-negative. This makes identification proof techniquesbased on convexity of the objective function (see, e.g., Koenker and Bassett Jr (1978), See the Online Appendix for an evaluation of the test’s finite sample performance using simula-tions. Consider a linear model with p z = 1, p θ = 2, X g ( · ) = X = [ ξ , ξ − ξ ], Z = ξ , ξ independentof ξ , and γ = [1 , (cid:48) . V n ( X γ , Z ) = V n ( ξ , ξ ) = 0 w.p.a.1 although X is not independent of Z . Therather technical formulation of Assumption 2(b) is meant to avoid such identification failure. θ ∈ Θ , A θ : [0 , (cid:16) W where A θ ( τ ) = { x , x † , z , z † ∈ W : F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) = τ } , [ x † , z † ] is an independentcopy of [ x , z ], λ u ∈ (0 , θ satisfies ˜ u ( θ ) = ˜ u + ˜ x g ( ¯ θ )( θ − θ o ). The followingprovides an important result for identification. Proposition 3.1.
Suppose Assumptions 1(b) and 2(a) hold, then for any w , w † de-fined on the support of w i , E [ q ( w , w † , θ )] = (cid:90) | τ − | (cid:90) A θ ( τ ) (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:3) d P ˜ X g , ˆ Z dτ. The inner integrand ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12) has the same form as the summands of(2.2). This is important because the expectation of the normalised minimand can beexpressed in terms of the distance covariance between X g ( ¯ θ )( θ − θ o ) and Z for all θ (cid:54) = θ o and ¯ θ between θ and θ o . Identification then follows from Assumption 2(b). Lemma 3.1.
Suppose Assumptions 1(b) and 2 hold, then for any ε > and n sufficiently large, there exists a δ ε > such that inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ Q n ( θ )] > δ ε . Consistency of the MDep estimator is presented in the following theorem.
Theorem 1 (Consistency) . Under Assumptions 1 and 2, ˆ θ n p → θ o . Notice that ˇ z ij is not a bona fide function of [ z i , z j ] because it is dependent onother observations. To simplify the study of the asymptotic distribution of the MDepestimator, the asymptotically equivalent function, ˆ z ij = D n ( z i , z j ) ≡ || z i − z j || − n n (cid:88) k =1 (cid:0) E [( || z i − z k || ) | z i ]+ E [( || z k − z j || ) | z j ] (cid:1) + 1 n n (cid:88) k =1 n (cid:88) l =1 E [ || z k − z l || ], is used instead ofˇ z ij . Denote the equivalent normalised minimand by Q n ( θ ) ≡ n (cid:80) ni =1 (cid:80) nj =1 ˆ z ij (cid:0) | ˜ u ij ( θ ) |−| ˜ u ij | (cid:1) . It is shown in Lemma D.1 that the normalised minimands Q n ( θ ) and Q n ( θ )are asymptotically equivalent for all θ ∈ Θ . Using the form Q n ( θ ) not only simplifiesthe application of Hoeffding’s (1948) U-statistic theory for the proof of asymptoticnormality but also leads to a less cumbersome and a computationally less expensivecovariance matrix of the score (see Lemma D.3).13efine ψ ( w i , w j ) ≡ ˆ z ij (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) (cid:48) . ˆ z ij = ˆ z ji and (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) = (cid:0) − u ji < (cid:1) ˜ x gji ( θ o ) hence ψ ( · , · ) is permutation symmetric. Thescore function at θ o is S n ( θ o ) ≡ ∂ Q n ( θ ) ∂ θ (cid:48) (cid:12)(cid:12) θ = θ o = n (cid:80) ni =1 (cid:80) nj =1 ψ ( w i , w j ) (cid:48) . By thepermutation symmetry of ψ ( · , · ) and for any τ n ∈ R p θ such that || τ n || = 1, let V n = τ (cid:48) n S n ( θ o ) (cid:48) = n (cid:80) ni =1 (cid:80) nj =1 τ (cid:48) n ψ ( w i , w j ) = (1 − /n ) U n , where U n = (cid:0) n (cid:1) − (cid:80) ni =1 (cid:80) nj =1 12 τ (cid:48) n ψ ( w i , w j ). E [ ψ ( w i , w j )] = for 1 ≤ i, j ≤ n (seeLemma D.3) implies E [ U n ] = 0. U n is a U -statistic (see, e.g., Hoeffding (1948) and Vander Vaart (1998, chapter 11)). The connection to U-statistics is particularly useful forestablishing a central limit theorem on √ n S n ( θ ) because { ψ ( w i , w j ) , ≤ i, j ≤ n } are dependent and standard central limit theorems do not apply directly. Under As-sumption 1(a), the Haj´ek projection of U n is ˆ U n ≡ (cid:80) ni =1 E [ U n | w i ] = n (cid:80) ni =1 τ (cid:48) n ψ i ( w i )where ψ i ( w i ) ≡ n − (cid:80) j (cid:54) = i E [ ψ ( w i , w j ) | w i ] (see, e.g., Hoeffding (1948, eqn. 8.1) andVan der Vaart (1998, Lemma 11.10)). { ψ i ( w i ) , ≤ i ≤ n } are independently dis-tributed random vectors, and the Lyapunov central limit theorem (CLT) applies.Denote Ψ n ≡ var( √ n ˆ U n ) = var( √ n (cid:80) ni =1 ψ i ( w i )) = n (cid:80) ni =1 E [ ψ i ( w i ) ψ i ( w i ) (cid:48) ].Define ϑ n ( θ ) = Q n ( θ ) −S n ( θ o )( θ − θ o ), then by Lemma D.3, E [ ϑ n ( θ )] = E [ Q n ( θ )] − E [ S n ( θ o )]( θ − θ o ) = E [ Q n ( θ )]. R n ( θ ) = √ n ( ϑ n ( θ ) − E [ ϑ n ( θ )]) / || θ − θ o || is the re-mainder term of the Taylor approximation (see, e.g., Pollard (1985) and Newey andMcFadden (1994, sect. 7)). Define Σ o ≡ H − o Ω o H − o where Ω o ≡ lim n →∞ E [ Ω n ( θ o )],var( √ n S n ( θ o )) = E [ Ω n ( θ o )], H o ≡ lim n →∞ E [ H n ( θ o )], and E [ H n ( θ o )] = ∂ E [ S n ( θ )] ∂ θ (cid:12)(cid:12)(cid:12) θ = θ o ,the Hessian matrix (see Lemmata Lemmas D.3 and D.4). The expressions for E [ Ω n ( θ o )]and E [ H n ( θ o )] are E [ Ω n ( θ o )] = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:104) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 8ˆ z ij ˆ z ik cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:105) (3.1)and(3.2) E [ H n ( θ o )] = 2 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:105) where ω ˜ x, ˆ zijk ≡ σ ( x i , x j , x k , z i , z j , z k ), ω ˜ x, ˆ zij ≡ σ ( x i , x j , z i , z j ), σ ( · ) denotes the sigmaalgebra generated by its argument, and cov( ξ a , ξ b | ξ c ) denotes covariance of ξ a and ξ b ξ c . Let ρ min ( A ) denote the minimum eigen-value of A . The followingadditional assumptions are needed to establish the asymptotic distribution of theMDep estimator. Assumption 3.
For the constant C defined in Assumption 1:(a) ρ min ( Ψ n ) > C / / for all n sufficiently large.(b) sup || θ − θ o ||≤ δ n |R n ( θ ) | √ n || θ − θ o || p → for any δ n → .(c) H o is non-singular.(d) The conditional distribution of ˜ u , F ˜ U | ˜ X g , ˆ Z ( · ) , is differentiable with density f ˜ U | ˜ X g , ˆ Z ( · ) , C − < f ˜ U | ˜ X g , ˆ Z ( (cid:15) ) ≤ sup γ ∈ R f ˜ U | ˜ X g , ˆ Z ( γ ) ≤ C / , and | f ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − f ˜ U | ˜ X g , ˆ Z ( (cid:15) ) | ≤ C / | (cid:15) − (cid:15) | a.s. for all (cid:15), (cid:15) , (cid:15) in a neighbourhood of zero and ˜ u defined on thesupport of ˜ u ij .(e) E (cid:2) sup θ ∈ Θ || max {| ˆ z | , } ˜ x g ( θ ) || (cid:3) ≤ C , E (cid:2) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) (cid:3) ≤ C , and E (cid:2) || ˜ z || (cid:3) ≤ C . Assumption 3(a) means Ψ n is positive definite. A necessary (but not sufficient) condi-tion for this is the full column rank condition of the Jacobian matrix X g ( · ). Assump-tion 3(b) is is a stochastic equi-continuity condition for establishing the asymptoticnormality of estimators with non-smooth objective functions (see, e.g., Pollard (1985),Pakes and Pollard (1989), and Cheng and Liao (2015)). It is verified in Lemma D.6.Non-singularity of the Hessian (Assumption 3(c)) is imposed instead of positive defi-niteness because the MDep objective function (2.3) is not convex. Assumption 3(d) isa standard assumption for quantile estimators (see, e.g., Lee (2007, Assumption 3.6),Chernozhukov and Hansen (2006, Assumption 2 R.4), Chernozhukov and Hansen(2008, Assumption R.4), Powell (1991, Assumption C4. (i) and (ii)), and Oberhoferand Haupt (2016, Assumption A.14)); it ensures the Hessian is well-defined. Thefirst statement in Assumption 3(e) simply adapts Assumption 1(c) for the equivalentnormalised minimand Q n ( θ ). The second statement is required to apply the domi-nated convergence theorem in the derivation of the Hessian (see Lemmata D.4 andD.5). The third statement is an MDep analogue of a standard regularity condition(e.g., Cheng and Liao (2015, condition 4.1 (iii))); it is used in Lemma D.1(a) to show15hat | ˇ z ij − ˆ z ij | = O p ( n − / ) for 1 ≤ i, j ≤ n . The asymptotic normality of the MDepestimator is stated below. Theorem 2.
Suppose Assumptions 1 to 3 hold, then(a) √ n ˜ τ (cid:48) n E [ Ω n ( θ o )] − / S n ( θ o ) (cid:48) d → N (0 , for any ˜ τ n ∈ R p θ with || ˜ τ n || = 1 ;(b) √ n ( ˆ θ n − θ o ) d → N ( , Σ o ) . Theorem 2(a) together with the Cram´er-Wold device yields the asymptotic normalityof √ n S n ( θ o ), i.e, √ n S n ( θ o ) d → N ( , Ω o ). This result plays a crucial role in the proofof Theorem 2(b) (see, e.g., Newey and McFadden (1994, Theorem 7.1)). The preceding subsection presents the asymptotic normality of the MDep esti-mator. This subsection provides the covariance matrix estimator and proves its con-sistency. Consistency of the covariance estimator is essential for statistical inferenceprocedures viz. Wald tests, and confidence intervals. The approach adopted followsPowell (1991). The estimators of Ω o and H o are given byˇ Ω n ( ˆ θ n ) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n )+ 2ˇ z ij ˇ z ik (cid:0) (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − (cid:1) ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) (cid:111) andˇ H n ( ˆ θ n ) = 1 n ˆ c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n )ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) (cid:111) respectively where ˆ c n is a (possibly random) bandwidth and the uniform kernel re-places the conditional density in E [ H n ( θ o )]. The estimator of the covariance matrixis Σ n ( ˆ θ n ) = ˇ H n ( ˆ θ n ) − ˇ Ω n ( ˆ θ n ) ˇ H n ( ˆ θ n ) − . In addition to preceding assumptions, thefollowing condition is imposed on the sequence of bandwidths ˆ c n . Assumption 4.
For some non-stochastic sequence c n with c n → and √ nc n → ∞ , plim n →∞ (ˆ c n /c n ) = 1 . H n ( ˆ θ n );it means ˆ c n satisfies ˆ c n = o p (1) and ˆ c − n = o p ( √ n ). The next theorem states the con-sistency of Σ n ( ˆ θ n ). Theorem 3.
Under Assumptions 1 to 4, plim n →∞ Σ n ( ˆ θ n ) = Σ o . Estimating the asymptotic covariance matrix involves specifying the bandwidth ˆ c n .Poor performance of the covariance matrix can result from an inappropriate choiceof bandwidth. In addition to the asymptotic covariance matrix estimator Σ n ( ˆ θ n ),this paper considers the bootstrap. The bootstrap obviates the specification of ˆ c n .Simulations on the coverage performance of the estimator Σ n ( ˆ θ n ) and the bootstrapare available in the Online Appendix. This section examines the empirical performance of the MDep estimator vis-`a-visIV methods for the linear model using simulations. Non-linear parametric modelsare considered in the Online Appendix where the MDep is compared to maximumlikelihood, quasi-maximum likelihood, GMM, and the control function (CF) approach.Each element of x = [ x , x ∗ ] is standard normal N (0 ,
1) with cov( x , x ∗ ) = 0 . z = [ z , z ] are the instruments. θ c = 0 . θ ≡ [ θ , θ ] = [1 , −
1] are main-tained across all specifications. For each specification, the mean bias (MB), medianabsolute deviation (MAD), and root mean squared error (RMSE) are reported. 2000simulations are run for each sample size n ∈ { , , } . ν ∼ ( χ (1) − / √ σ ξ is the standard deviation of a random variable ξ , Ber(0 .
5) is the Bernouilli distribu-tion with mean equal to 0 . φ ( · ) is the standard normal probability density function(pdf), Φ − ( τ ) is the τ ’th quantile of the standard normal distribution, and M x ( ξ )is the residual from the linear projection of ξ on x . The following specifications areconsidered.(i) y = θ c + θ x + θ x + u , u = ( ˙ u − E [ ˙ u ]) /σ ˙ u , ˙ u = ν + φ (( x − x ∗ ) / . x = x ∗ ,and z = [ x , x ∗ ].(ii) y = θ c + θ x + θ x + u , u = ( ˙ u − E [ ˙ u ]) /σ ˙ u , ˙ u = ν + φ (( x − x ∗ ) / . x = ( x ∗ + ν ) / √ z = [ x , ( x ∗ + w ) / √ w ∼ N (0 , y = θ c + θ x + θ x + u , u = ˙ u/σ ˙ u , ˙ x = 6 x ∗ / √ n + (1 + x ∗ ) ν , x = ˙ x /σ ˙ x , z = [ x , ( x ∗ + w ) / √ w ∼ N (0 , u = (1 + D | x | + (1 − D ) | x ∗ | ) ν , and D ∼ Ber(0 . y = θ c + θ x + θ x + u , u = ˙ u/σ ˙ u , ˙ x = (1 + x ∗ ) ν , x = ˙ x /σ ˙ x , z =[ x , ( x ∗ + w ) / √ w ∼ N (0 , u = (1+ D | x | +(1 − D ) | x ∗ | ) ν , and D ∼ Ber(0 . y = θ c + θ x + θ x + ν , x = ( x ∗ + ν ) / √
2, and z = [ x , | x ∗ | / (cid:112) − /π ].(vi) y = θ c + θ x + θ x + ν , x = ( x ∗ + ν ) / √
2, and z = [ x , | x ∗ | < − Φ − (0 . y = θ c + θ x + θ x + ν , x = ( x ∗ + ν ) / √
2, and z = [ x , M x ( | x | )].Specifications (i) and (ii) consider instrument validity under the uncorrelatednessexclusion restriction cov( z , u ) = but not the strong conditional mean independencerestriction E [ u | z ] = E [ u ]. These help to demonstrate the performance of the MDepunder the weakest exclusion restriction needed for the validity of IV methods. Weakinstruments are introduced in specification (iii); the weight 6 / √ n is chosen to ensurethat the non-homoskedasticity-robust F-statistic (see Andrews, Stock, and Sun (2019,eqn. 7.)) is approximately 5. (iv) to (vii) consider cases where x is uncorrelated butnot independent of its instrument z . In (iv), x is mean independent but not (dis-tributionally) independent of z . (v) and (vi) examine the performance of the MDepwhen z is uncorrelated but non-monotonically dependent on x . Non-monotonicityrenders transformations of z that improve correlation with x infeasible. (iii) to (vi)present cases where z is IV-weak but MDep-strong. (i) to (iv) help to examine therelative performance of MDep and IV under conditional heteroskedasticity. Specifi-cation (vii) presents an interesting case where an instrument is constructed by takinga non-monotone transformation of a continuous endogenous x and rendering it or-thogonal to x . This specification illustrates a unique strength of the MDep estimatorwhen valid instruments are unavailable or too MDep-weak. Table 4.1 presents simulation results for the linear model. Entries exceeding 500are omitted. In the standard cases (i) and (ii) where the instrument is IV-strong, oneobserves good performance of both MDep and OLS/IV with the MDep performingbetter especially in terms of MAD and RMSE. MDep’s competitive performance in thespecifications (i) and (ii) suggests the MDep exclusion restriction (Assumption 2(a)) The pdCov test can be used to assess the strength of the constructed instrument. θ θ n 100 500 2500 100 500 2500Specification (i): (100 × Sample Bias)MDep OLS MDep OLS MDep OLS MDep OLS MDep OLS MDep OLSMB 0.442 0.832 0.123 0.245 0.02 0.037 0.482 0.59 0.093 0.213 0.013 0.005MAD 3.818 7.448 1.43 3.163 0.678 1.422 3.535 6.908 1.5 3.194 0.644 1.382RMSE 5.951 10.515 2.242 4.585 0.989 2.08 5.933 10.467 2.306 4.638 0.964 2.04Specification (ii): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB -0.8 0.581 -0.152 0.041 -0.043 -0.026 4.585 -2.289 0.746 -0.006 0.045 -0.055MAD 3.645 7.273 1.57 3.231 0.678 1.476 6.946 14.197 2.525 6.458 1.111 2.941RMSE 6.177 11.787 2.405 4.886 1.001 2.146 12.255 25.664 4.075 9.516 1.672 4.289Specification (iii): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB 0.277 0.499 0.326 0.877 -0.053 -0.492 12.318 -24.458 1.777 -21.28 -0.437 -27.43MAD 4.237 8.297 2.244 3.642 1.02 1.744 19.906 29.302 6.02 29.091 2.149 27.78RMSE 6.756 81.681 3.392 23.206 1.506 37.132 28.603 - 8.971 419.118 3.177 -Specification (iv): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB 1.987 1.43 0.409 0.843 -0.053 -3.547 8.519 103.793 1.206 -136.934 -0.359 -11.069MAD 5.354 9.979 2.214 4.3 1.005 1.967 10.931 84.134 4.188 84.998 1.828 82.703RMSE 8.315 132.652 3.414 53.491 1.497 184.761 18.824 - 6.364 - 2.707 -Specification (v): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB -1.893 -13.05 -0.612 -7.502 -0.148 -2.376 13.098 87.437 4.138 42.286 0.96 13.143MAD 3.131 14.773 0.994 13.163 0.36 13.287 12.187 72.464 4.053 73.944 1.141 73.269RMSE 5.835 276.485 1.91 330.653 0.687 304.325 18.246 - 6.096 - 1.708 -Specification (vi): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB -2.704 -14.84 -0.822 -28.669 -0.194 -28.705 16.502 105.025 4.919 172.413 1.114 161.015MAD 3.405 21.787 1.119 21.935 0.376 22.425 15.047 124.811 4.763 122.47 1.372 125.558RMSE 6.649 - 2.199 255.696 0.759 339.074 22.088 - 7.549 - 2.135 -Specification (vii): (100 × Sample Bias)MDep OLS MDep OLS MDep OLS MDep OLS MDep OLS MDep OLSMB -2.826 -12.085 -1.943 -12.7 -1.756 -12.891 16.853 71.175 10.243 72.537 8.858 72.914MAD 3.285 11.897 1.88 12.681 1.713 12.886 15.423 71.012 10.021 72.405 8.821 72.962RMSE 5.946 14.183 2.713 13.124 1.929 12.977 19.654 72.287 10.767 72.789 8.967 72.965This table provides results for the MDep and OLS/IV estimators of the linear model. Resultsare based on 2000 random repetitions. The mean bias (MB), median absolute deviation(MAD), and root mean squared errors (RMSE) are reported. Entries exceeding 500 areexcluded.
19s not stronger than the weak uncorrelatedness condition of IV methods. Results onspecifications (iii) and (iv) confirm MDep’s ability, unlike the case of conventionalIV methods, to exploit identifying variation from the quantile dependence of x on z . MDep continues to perform well in specifications (iv) to (vii) where x is eithermean-independent or uncorrelated with z . On the whole, the MDep’s performancein terms of bias and efficiency is good even in specifications where IV methods sufferidentification failure. This section uses two empirical examples to illustrate the usefulness of the MDepestimator in real-world data settings. The first application considers a simple demandestimation problem using data from Graddy (1995), and the second considers theimpact of migration on productivity using data from Hornung (2014). In the firstapplication, the instrument is IV-strong and hence MDep-strong by construction. Inthe second application, the instrument has additional non-linear and non-monotoneidentifying variation that the MDep, unlike the IV, is able to exploit.
The parameter of interest is the elasticity of demand for fish in a simple linearCobb-Douglas demand model with additive disturbance. The disturbance as a func-tion of parameters is specified as u ( θ ) = ln( Q p ) − θ c − θ ln( P ) − θ − W where Q p is the total quantity of fish sold per day, P is the daily average price, θ is the price elasticity of demand, and W is a vector of day dummies (see Graddy(1995) and Chernozhukov and Hansen (2008, sect. 5.1) for more details). In a typicaldemand estimation framework, price P is endogenous. Graddy (1995) proposes aninstrument Stormy which measures weather conditions at sea. Specifically,
Stormy is a dummy variable which indicates whether wave height is greater than 4.5 ft andwind speed is greater than 18 knots.The upper panel of Table 5.1 presents coefficients and standard errors while specifi-cation and relevance tests are presented in the lower panel. The tests comprise pdCov P ) -0.558 -0.541 -1.105 -1.082 -0.454 -0.563 -1.233 -1.119(0.186) (0.168) (0.459) (0.574) (0.191) (0.158) (0.471) (0.545)Const. 8.415 8.419 8.309 8.314 8.610 8.607 8.482 8.506(0.076) (0.074) (0.114) (0.136) (0.124) (0.114) (0.156) (0.155)Day Dummies (cid:88) (cid:88) (cid:88) (cid:88) Excluded Instr. (cid:88) (cid:88) (cid:88) (cid:88)
Relevance pdC nln p-value 0.189 0.527 pdC n p-value 0.000 0.000KB F -Stat. 20.663 22.929Specification T n p-value 0.573 0.406 0.918 0.892 0.894 0.952 0.681 0.793 Notes:
The number of observations in each specification is 111. For each specification,standard errors (in parentheses) are computed using the same 999 samples of the non-parametric bootstrap. 999 wild-bootstrap samples are used for the pdC nln , pdC n , and T n tests. The test statistics are omitted because the tests are not pivotal. Excluded Instr. -whether the excluded instrument is used. KB F-Stat. denotes the rk Wald F -statistic ofKleibergen and Paap (2006). pdC nln for non-linear MDep-relevance and pdC n for overall MDep-relevance of the instrument, and the T n specification test ofSu and Zheng (2017). A pdC nln p-value below 0 .
05 suggests the instrument has astatistically significant non-linear dependence with the endogenous covariate at the5% level whereas a T n p-value greater than 0 .
05 suggests failure to reject a null hy-pothesis of correct model specification. The rk Wald F-statistic of Kleibergen andPaap (2006) is provided as a measure of IV-relevance.MDep estimates of the price elasticity of demand do not differ significantly fromthe OLS/IV estimates. MDep estimates are slightly less precise than OLS estimatesin specifications (1) and (3) but are slightly more precise than IV estimates in speci-fications (2) and (4). One notices from Su and Zheng’s (2017) specification tests thatno specification is rejected by the data. Also, the relevance of the instrument
Stormy is largely linear because non-linear dependence (see the pdC nln p-value) is not signifi-cant at any conventional level. P-values of the partial distance covariance pdC n test,like Kleibergen and Paap’s (2006) rank test, confirm the strength of the instrument Stormy . Huguenots fled religious persecution in France and settled in Brandenburg-Prussiain 1685. They compensated for population losses due to plagues during the ThirtyYears’ War. Hornung (2014) examines the long-term impact of their skilled-workermigration on the productivity of textile manufactories in Prussia. The author exploitsthe settlement pattern in an IV approach and identifies the long-term impact ofskilled-worker migration on productivity. This empirical example is a revisit to theproblem and it compares OLS/IV to MDep.The disturbance function is derived from a standard firm-level Cobb-Douglas typeof production function: The linear dependence between the endogenous covariate and the instrument is removed beforethe pdCov test is conducted. See the Online Appendix for details. Su and Zheng’s (2017) T n test is a test of conditional mean independence. Conditional meanindependence is sufficient but not necessary for the MDep exclusion restriction (Assumption 2(a)). The rk Wald F -statistics of Kleibergen and Paap (2006) are above Staiger and Stock’s (1997)rule-of-thumb cut-off of 10. ( θ ) = y − θ c − θ l ln( L ) − θ k ln( K ) − θ m ln( M ) − θ (cid:16) HuguenotsT ownP opulation (cid:17) − θ − W where productivity y measures the value of goods produced by a firm in a given town, L is the number of workers, K is the number of looms, M is the value of materialsused in the production process, and W collects regional and town characteristics thatmight affect productivity. As instrument for the population share of Hougenots, theauthor proposes population losses during the Thirty Years’ War. Population dataused to construct the instrument are years removed from the beginning and the endof the Thirty Years’ War, and this may induce measurement error. The author thusconstructs an interpolated version of the instrument (see Hornung (2014) for details).Coefficient estimates and standard errors are presented in the upper panel andhypothesis tests are provided in the lower panel. The specifications considered repli-cate results in Hornung (2014, table 4). Table 5.2 presents an interesting case wherethe instrument has strong non-linear identifying variation. For the main parameterof interest (population share of Hougenots), one notices that OLS/IV estimates areless stable across specifications. Also, IV standard errors are about three times aslarge as MDep standard errors. This result is not surprising since, beyond lineardependence, the MDep estimator exploits the highly significant non-linear form ofidentifying variation (see the pdC nln p-value) in the instrument.23able 5.2: Estimates - Huguenot Productivity - Hornung (2014)MDep OLS MDep IV MDep IV(1) (1) (2) (2) (3) (3)Percent Huguenots 1700 1.816 1.741 2.003 3.475 1.625 3.380(0.293) (0.462) (0.340) (0.986) (0.282) (1.014)ln workers 0.115 0.123 0.111 0.135 0.111 0.134(0.024) (0.034) (0.024) (0.035) (0.025) (0.035)ln looms 0.096 0.102 0.095 0.082 0.094 0.083(0.026) (0.035) (0.026) (0.036) (0.027) (0.036)ln materials (imputed) 0.814 0.800 0.814 0.811 0.815 0.811(0.021) (0.028) (0.021) (0.030) (0.022) (0.030)Const. 1.344 2.043 1.186 2.830 2.060 2.787(0.427) (0.441) (0.444) (0.547) (0.445) (0.540)Excluded Instr. (cid:88) (cid:88) (cid:88) (cid:88) Relevance pdC nln p-value 0.000 0.000 pdC n p-value 0.000 0.000KB F -Stat. 14.331 22.531Specification T n p-value 0.721 0.559 0.738 0.405 0.638 0.438 Notes:
The number of observations in each specification is 150 firms. For each specification,standard errors (in parentheses) are computed using the same 999 samples of the non-parametric bootstrap. 999 wild-bootstrap samples are considered for the pdC nln , pdC n , and T n tests. The test statistics are omitted because they are not pivotal. Percent populationlosses in 30 Years’ War is the instrument used in specification (2), and
Percent populationlosses in 30 Years’ War, interpolated is the instrument used in specification (3) (see Hornung(2014) for details). Excluded Instr. - whether the excluded instrument is used. KB F-Stat.denotes Kleibergen and Paap’s (2006) rk Wald F -statistic. Conclusion
This paper introduces the MDep estimator which weakens the relevance conditionof IV methods to non-independence and hence maximises the set of relevant instru-ments without imposing a stronger exclusion restriction. The estimator is basedon the distance covariance measure of Sz´ekely, Rizzo, Bakirov, et al. (2007) whichmeasures the distributional dependence between random variables of arbitrary di-mensions. The estimator provides a fundamentally different and practically usefulapproach to dealing with the weak-IV problem; the proposed estimator admits in-struments that may be uncorrelated or mean-independent but non-independent ofendogenous covariates.Consistency, and asymptotic normality of the MDep estimator hold under mildconditions. Several Monte Carlo simulations showcase the gains of the MDep estima-tor vis-`a-vis existing IV methods. The empirical examples considered in the paperillustrate the practical utility of the MDep estimator in empirical work. In the firstexample where the instrument has a strong linear but trivial non-linear relationshipto the endogenous covariate, the MDep is comparable to the IV in terms of precision.In the second example where the linear dependence is quite weak but the non-lineardependence is non-trivial, MDep estimates are more precise and stable relative to theIV. This paper is a first step in developing a distance covariance-based estimatorfor econometric models under possible endogeneity. It will be interesting to extendthe current paper along a number of dimensions. First, an MDep estimator can bedeveloped for models which are beyond the scope of the current paper, e.g., quantileregression and binary response models. Second, owing to the frequent occurrenceof autocorrelated and clustered data in empirical practice, it will be interesting toextend the theory to accommodate mixing and clustering in disturbances.25 eferences [1] Andrews, Donald WK. “Consistency in nonlinear econometric models: A genericuniform law of large numbers”.
Econometrica: Journal of the Econometric So-ciety (1987), pp. 1465–1471.[2] Andrews, Donald WK, Marcelo J Moreira, and James H Stock. “Optimal two-sided invariant similar tests for instrumental variables regression”.
Economet-rica
Quantitative Economics
Econometrica
Annual Review of Economics
11 (2019), pp. 727–753.[6] Belloni, Alexandre, Daniel Chen, Victor Chernozhukov, and Christian Hansen.“Sparse models and methods for optimal instruments with an application toeminent domain”.
Econometrica
Matrix Mathematics: Theory, Facts, and Formulas . 2009.[8] Bound, John, David A Jaeger, and Regina M Baker. “Problems with instrumen-tal variables estimation when the correlation between the instruments and theendogenous explanatory variable is weak”.
Journal of the American StatisticalAssociation arXiv preprintarXiv:2011.06158 (2020).[10] Cheng, Xu and Zhipeng Liao. “Select the valid and relevant moments: Aninformation-based lasso for gmm with many moments”.
Journal of Economet-rics
Journal of Econometrics
Journal of Econometrics
Bernoulli
LabourEconomics
42 (2016), pp. 76–86.[15] Edelmann, Dominic, Konstantinos Fokianos, and Maria Pitsillou. “An updatedliterature review of distance correlation and its applications to time series”.
International Statistical Review
The RAND Journal of Economics (1995), pp. 75–92.[17] Hansen, Christian and Damian Kozbur. “Instrumental variables estimation withmany weak instruments using regularized JIVE”.
Journal of Econometrics
Econometric Reviews
The Annals of Mathematical Statistics
American Economic Review
Proceedings of the fifth Berkeley symposium on mathematicalstatistics and probability . Vol. 1. 1. University of California Press. 1967, pp. 221–233. 2722] Kleibergen, Frank and Richard Paap. “Generalized reduced rank tests using thesingular value decomposition”.
Journal of Econometrics
Econometrica:Journal of the Econometric Society (1978), pp. 33–50.[24] Lee, Sokbae. “Endogeneity in quantile regression models: A control functionapproach”.
Journal of Econometrics
Journal of the American Statistical Association
Journalof business (1990), S125–S140.[27] Nelson, Charles R. and Richard Startz. “Some Further Results on the ExactSmall Sample Properties of the Instrumental Variable Estimator”.
Economet-rica
Handbook of Econometrics
Econometric Theory
Journal of Business & Economic Statistics
Econometrica: Journal of the Econometric Society (1989),pp. 1027–1057.[32] Park, Trevor, Xiaofeng Shao, Shun Yao, et al. “Partial martingale differencecorrelation”.
Electronic Journal of Statistics
Econometric The-ory (1985), pp. 295–313. 2834] Powell, James L. “Estimation of monotonic regression models under quan-tile restrictions”.
Nonparametric and semiparametric methods in Econometrics (1991), pp. 357–384.[35] Sanderson, Eleanor and Frank Windmeijer. “A weak instrument F-test in linearIV models with multiple endogenous variables”.
Journal of Econometrics
Journal of the American StatisticalAssociation
Journal of Multivariate Analysis
122 (2013), pp. 148–161.[38] Staiger, Douglas and James H. Stock. “Instrumental Variables Regression withWeak Instruments”.
Econometrica
Identification and Inference for Econometric Models . Ed. byAndrews, Donald W.K. New York: Cambridge University Press, 2005, pp. 80–108.[40] Su, Liangjun and Xin Zheng. “A martingale-difference-divergence-based test forspecification”.
Economics Letters
156 (2017), pp. 162–167.[41] Sz´ekely, G´abor J and Maria L Rizzo. “Brownian distance covariance”.
TheAnnals of Applied Statistics (2009), pp. 1236–1265.[42] Sz´ekely, G´abor J and Maria L Rizzo. “On the uniqueness of distance covari-ance”.
Statistics & Probability Letters
The Annals of Statistics
The Annals of Statistics
Asymptotic statistics . Vol. 3. Cambridge universitypress, 1998. 2946] Wooldridge, Jeffrey M.
Econometric analysis of cross section and panel data .MIT Press, 2010.[47] Xu, Kai and Fangxue Chen. “Martingale-difference-divergence-based tests forgoodness-of-fit in quantile models”.
Journal of Statistical Planning and Infer-ence
207 (2020), pp. 138–154.[48] Zhong, Wei and Liping Zhu. “An iterative approach to distance correlation-based sure independence screening”.
Journal of Statistical Computation andSimulation
Journal of Time Series Analysis
AppendixA Proof of Theorem 1
The proof of Theorem 1 begins with a uniform law of large numbers result. Thefollowing lemma verifies the conditions of Andrews (1987, Corollary 2).
Lemma A.1.
Suppose Assumption 1 holds, then(a) there exists a non-random function B : W → R + with lim n →∞ n n (cid:88) i =1 n (cid:88) j =1 E [ B ( x i , x j , z i , z j )] < ∞ such that for any θ , θ ∈ Θ , | q ( w i , w j , θ ) − q ( w i , w j , θ ) | ≤ B ( x i , x j , z i , z j ) || θ − θ || a.s.;(b) E [ Q n ( θ )] is continuous on Θ uniformly, and sup θ ∈ Θ | Q n ( θ ) − E [ Q n ( θ )] | p → . Proof of Lemma A.1 . Part (a) : First, let B ( x , x † , z , z † ) ≡ sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || . By Assumption 1(c) andJensen’s inequality, E [ B ( x , x † , z , z † )] ≡ E [sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || ] ≤ ( E [sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || ]) / ≤ C / for any [ x , z ] , [ x † , z † ] defined on the support of [ x i , z i ].This implies lim n →∞ n n (cid:88) i =1 n (cid:88) j =1 E [ B ( x i , x j , z i , z j )] ≤ C / < ∞ . By Assumption 1(a) andthe measurability of the Jacobian (Assumption 1(b)), B ( x i , x j , z i , z j ) ≡ sup θ ∈ Θ || ˇ z ij ˜ x gij ( θ ) || is measurable. 30econd, for any w , w † defined on the support of w i and θ , θ , ¯ θ , ∈ Θ where ¯ θ , ,by Assumption 1(b) and the Mean-Value Theorem (MVT), satisfies ˜ u ( θ ) − ˜ u ( θ ) =˜ x g ( ¯ θ , )( θ − θ ), B ( x , x † , z , z † ) · || θ − θ || ≥ | ˇ z | · | ˜ x g ( ¯ θ , )( θ − θ ) | = | ˇ z | · | ˜ u ( θ ) − ˜ u ( θ ) |≥ | ˇ z ( | ˜ u ( θ ) | − | ˜ u ( θ ) | ) | = | q ( w , w † , θ ) − q ( w , w † , θ ) | . The second inequality follows from the triangle inequality.
Part (b) : From part (a) above, Assumptions 1(a)-1(c) imply Assumption A4 ofAndrews (1987). Assumptions 1(a) and 1(b) imply the measurability of q ( w , w † , θ ) ≡ ˇ z ( | ˜ u ( θ ) | − | ˜ u | ) for all θ ∈ Θ and any random vectors w , w † on the support of w i . Thiscorresponds to Andrews (1987, Assumption A2). In addition to Assumption 1(d), theconclusion follows from Corollary 2 of Andrews (1987).As a next step for the proof of Theorem 1, Proposition 3.1 is a useful step inproving the identification of θ o . Proof of Proposition 3.1 . The following expression is a useful mean-value expan-sion around b = 0. | ξ − b | − | ξ | = (2I( ξ < λ u b ) − b (A.1)where λ u ∈ (0 ,
1) and ξ is a random variable.Thanks to Assumption 1(b), ˜ u ( θ ) = ˜ u − ˜ x g ( ¯ θ )( θ − θ o ) by the MVT for any pairof random vectors w , w † defined on the support of w i . q ( w , w † , θ ) = ˇ z ( | ˜ u ( θ ) | − | ˜ u | )= ˇ z (cid:0) | ˜ u − ˜ x g ( ¯ θ )( θ − θ o ) | − | ˜ u | (cid:1) = ˇ z (cid:16) (cid:0) ˜ u < λ u ˜ x g ( ¯ θ )( θ − θ o ) (cid:1) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o )(A.2) 31sing the expression (A.1) for some λ u ∈ (0 , E [ q ( w , w † , θ )] = E (cid:104) ˇ z E ˜ U | ˜ X g , ˆ Z (cid:2) u < λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:3) ˜ x g ( ¯ θ )( θ − θ o ) (cid:105) = E (cid:104) ˇ z (cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o ) (cid:105) . Noting that F ˜ U | ˜ X g , ˆ Z (0) = 1 / (cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o ) ≥
0. This implies E [ q ( w , w † , θ )] = E (cid:104) ˇ z (cid:12)(cid:12)(cid:12)(cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:12)(cid:105) = E (cid:104) ˇ z (cid:12)(cid:12)(cid:12)(cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17)(cid:12)(cid:12)(cid:12) × (cid:12)(cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:12)(cid:105) = (cid:90) | τ − | E (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12) × I( x , x † , z , z † ∈ A θ ( τ )) (cid:3) dτ = (cid:90) | τ − | (cid:90) A θ ( τ ) (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:3) d P ˜ X g , ˆ Z dτ With Lemma A.1 and Proposition 3.1 in hand, the identification result is provednext.
Proof of Lemma 3.1 . Under the assumptions of Proposition 3.1, E [ q ( w , w † , θ )] = (cid:82) | τ − | (cid:82) A θ ( τ ) (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:3) d P ˜ X g , ˆ Z dτ . This implies E [ Q n ( θ )] ≡ n n (cid:88) i =1 n (cid:88) j =1 E [ q ( w i , w j , θ )]= (cid:90) | τ − | (cid:16) n n (cid:88) i =1 (cid:88) j
1] is unique.By the LIE, (cid:82) E [ V n,τ ( X g ( ¯ θ )( θ − θ o ) , Z )] dτ equals2 n n (cid:88) i =1 (cid:88) j ˜ δ ε > n sufficientlylarge. By Property (b), inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ V n,τ ( X g ( ¯ θ )( θ − θ o ) , Z )] ≥ τ ∈ [0 , f θ ( τ ) = E [ V n,τ ( X g ( ¯ θ )( θ − θ o ) , Z )] / E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )] and F θ ( τ ) = (cid:82) τ f θ ( t ) dt . For all θ (cid:54) = θ o , f θ ( τ ) ≥ (cid:82) f θ ( t ) dt = 1, and (cid:82) | τ − | f θ ( τ ) dτ = 1 − (cid:82) . F θ ( τ ) dτ − (cid:82) . F θ ( τ ) dτ ) using integration by parts. By theMVT, there exist ¯ τ ∗ θ and τ ∗ θ , 0 < τ ∗ θ < . < ¯ τ ∗ θ < (cid:82) . F θ ( τ ) dτ =0 . F θ ( τ ∗ θ ), and (cid:82) . F θ ( τ ) dτ = 0 . F θ (¯ τ ∗ θ ). Further, 0 < F θ ( τ ∗ θ ) ≤ F θ (¯ τ ∗ θ ) <
1, andthis implies E [ Q n ( θ )] = E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )] (cid:82) | τ − | f θ ( τ ) dτ = (cid:0) − ( F θ (¯ τ ∗ θ ) − F θ ( τ ∗ θ )) (cid:1) E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )] ≡ d n ( θ ) E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )].0 < inf { θ ∈ Θ : || θ − θ o ||≥ ε } d n ( θ ) ≤ sup { θ ∈ Θ : || θ − θ o ||≥ ε } d n ( θ ) <
1. Set δ ε = ˜ δ ε d n ( θ ) for n sufficientlylarge. The conclusion follows from Assumption 2(b).The final part of the proof of Theorem 1 follows Newey and McFadden (1994, The-orem 2.1). For every (cid:15) > E [ Q n ( ˆ θ n )] < Q n ( ˆ θ n ) + (cid:15)/ Q n ( ˆ θ n ) < Q n ( θ o ) + (cid:15)/ θ n , and(c) Q n ( θ o ) < E [ Q n ( θ o )] + (cid:15)/ E [ Q n ( ˆ θ n )] < Q n ( θ o ) + 2 (cid:15)/ < E [ Q n ( θ o )] + (cid:15) , i.e., for n sufficiently largeand every (cid:15) > E [ Q n ( ˆ θ n )] < E [ Q n ( θ o )]+ (cid:15) . Under the assumptions of Lemma 3.1, forevery ε >
0, there exists a δ ε > E [ Q n ( θ o )] < inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ Q n ( θ )] − δ ε .By the continuity of E [ Q n ( θ )] (Lemma A.1) and compactness of the set N ε ≡ { θ ∈ Θ : || θ − θ o || ≥ ε } , let inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ Q n ( θ )] = E [ Q n ( θ ∗ )] for some θ ∗ ∈ N ε . Thisimplies E [ Q n ( θ o )] < E [ Q n ( θ ∗ )] − δ ε . Choosing (cid:15) = E [ Q n ( θ ∗ )] − E [ Q n ( θ o )] − δ ε , itimplies that E [ Q n ( ˆ θ n )] < inf θ ∈ N ε E [ Q n ( θ )] − δ ε for n sufficiently large and thus ˆ θ n ∈ N cε N ε ). B Proof of Theorem 2
Part (a) : By Lemma 11.10 of Van der Vaart (1998), the Haj´ek projection of U n is ˆ U n ≡ (cid:80) ni =1 E [ U n | w i ] = n (cid:80) ni =1 τ (cid:48) n ψ i ( w i ) under Assumption 1(a). The rest ofthe proof of part (a) involves the verification of the conditions of Hoeffding (1948,Theorem 8.1). These conditions are needed to apply the Lyapunov CLT because { ψ i ( w i ) , ≤ i ≤ n } are independently distributed random vectors. The conditionsare stated in the following. Lemma B.1.
Suppose Assumption 3(e) holds, then for any random vectors w , w † defined on the support of w i and τ n ∈ R p θ with || τ n || = 1 ,(i) E [( τ (cid:48) n ψ ( w , w † )) ] ≤ C / ;(ii) E [ | ( τ (cid:48) n ψ i ( w i )) | ] ≤ C / ;(iii) lim n →∞ (cid:80) ni =1 E [ | ( τ (cid:48) n ψ i ( w i )) | ] (cid:0) (cid:80) ni =1 E [( τ (cid:48) n ψ i ( w i )) ] (cid:1) / = 0 . Proof . Condition (i) E [( τ (cid:48) n ψ ( w , w † )) ] = E [ τ (cid:48) n ψ ( w , w † ) ψ ( w , w † ) (cid:48) τ n ] = ( τ (cid:48) n ⊗ τ (cid:48) n )vec( E [ ψ ( w , w † ) ψ ( w , w † ) (cid:48) ]) = || ( τ (cid:48) n ⊗ τ (cid:48) n )vec( E [ ψ ( w , w † ) ψ ( w , w † ) (cid:48) ]) || ≤ || ( τ (cid:48) n ⊗ τ (cid:48) n ) || · || E [ ψ ( w , w † ) (cid:48) ψ ( w , w † )] || ≤ E [ || ˆ z ˜ x g ( θ o ) || ] ≤ ( E [ || ˆ z ˜ x g ( θ o ) || ]) / ≤ C / < ∞ by Jensen’s inequality, Assump-tion 3(e), and || τ (cid:48) n ⊗ τ (cid:48) n || = || τ n || by Bernstein (2009, Fact 9.7.27). Condition (ii)
For any i, ≤ i ≤ n ,34 [ | ( τ (cid:48) n ψ i ( w i )) | ] = 1( n − E (cid:104)(cid:12)(cid:12)(cid:12)(cid:16) (cid:88) j (cid:54) = i τ (cid:48) n E [ ψ ( w i , w j ) | w i ] (cid:17) (cid:12)(cid:12)(cid:12)(cid:105) = 1( n − E (cid:104)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = i (cid:88) k (cid:54) = i (cid:88) l (cid:54) = i ( τ (cid:48) n E [ ψ ( w i , w j ) | w i ])( τ (cid:48) n E [ ψ ( w i , w k ) | w i ])( τ (cid:48) n E [ ψ ( w i , w l ) | w i ]) (cid:12)(cid:12)(cid:12)(cid:105) ≤ n − (cid:88) j (cid:54) = i (cid:88) k (cid:54) = i (cid:88) l (cid:54) = i E (cid:2) | ( τ (cid:48) n E [ ψ ( w i , w j ) | w i ])( τ (cid:48) n E [ ψ ( w i , w k ) | w i ])( τ (cid:48) n E [ ψ ( w i , w l ) | w i ]) (cid:12)(cid:12)(cid:3) ≤ max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) E [ τ (cid:48) n ψ ( w i , w j ) | w i ] (cid:12)(cid:12) (cid:3) ≤ max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) τ (cid:48) n ψ ( w i , w j ) (cid:12)(cid:12) (cid:3) = max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) ˆ z ij τ (cid:48) n ˜ x gij ( θ o ) (cid:12)(cid:12) (cid:3) = max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) (˜ x gij ( θ o ) (cid:48) ⊗ ˆ z ij )vec( τ (cid:48) n ) (cid:12)(cid:12) (cid:3) ≤ max ≤ j ≤ n E (cid:2) || ˆ z ij ˜ x gij ( θ o ) || (cid:3) ≤ max ≤ j ≤ n (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( θ o ) || (cid:3)(cid:1) / ≤ C / . The third inequality follows from Jensen’s inequality (applied to the conditional ex-pectation), the convexity of | · | , and the LIE. The fourth inequality follows since (cid:12)(cid:12) (˜ x gij ( θ o ) (cid:48) ⊗ ˆ z ij )vec( τ (cid:48) n ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) (˜ x gij ( θ o ) (cid:48) ⊗ ˆ z ij ) (cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12) vec( τ (cid:48) n ) (cid:12)(cid:12)(cid:12)(cid:12) = || ˆ z ij ˜ x gij ( θ o ) || . The conclu-sion follows from Jensen’s inequality and Assumption 3(e). Condition (iii) n (cid:88) i =1 E [( τ (cid:48) n ψ i ( w i )) ] = τ (cid:48) n (cid:0) n (cid:88) i =1 E [ ψ i ( w i ) ψ i ( w i ) (cid:48) ] (cid:1) τ n = n τ (cid:48) n var (cid:0) √ n n (cid:88) i =1 ψ i ( w i ) (cid:1) τ n = n τ (cid:48) n Ψ n τ n ≥ n min τ τ (cid:48) Ψ n τ = nρ min ( Ψ n ) > n C / by Assumption 3(a) for n sufficientlylarge. This, in conjunction with part (ii) above, implies (cid:80) ni =1 E [ | ( τ (cid:48) n ψ i ( w i )) | ] (cid:0) (cid:80) ni =1 E [( τ (cid:48) n ψ i ( w i )) ] (cid:1) / ≤ n − / = O ( n − / ) = o (1), and the last condition of Hoeffding (1948, Theorem 8.1) isverified.By Hoeffding (1948, Theorem 8.1), the preceding three conditions (i) to (iii) imply U n / var( U n ) / d → N (0 , V n = (1 − /n ) U n , U n / var( U n ) / = V n / var( V n ) / = √ n τ (cid:48) n S n ( θ o ) (cid:48) ( τ (cid:48) n E [ Ω n ( θ o )] τ n ) / = √ n || τ (cid:48) n E [ Ω n ( θ o )] / || − τ (cid:48) n S n ( θ o ) (cid:48) . Noting that || ˜ τ n || = 1 where˜ τ n = || τ (cid:48) n E [ Ω n ( θ o )] / || − τ (cid:48) n E [ Ω n ( θ o )] / completes the proof of Theorem 2(a). Part (b) : The proof of part (b) proceeds by verifying the conditions of Neweyand McFadden (1994, Theorem 7.1). Twice differentiability of E [ Q n ( θ )] at θ o holds35nder the assumptions of Lemma D.4. Under the assumptions of Theorem 1, ˆ θ n p → θ o . E [ Q n ( θ )] is minimised by θ o for n sufficiently large under the assumptions of Lemmata3.1 and D.1. √ n S n ( θ o ) d → N ( , Ω o ) follows from part (a) and the Cram´er-Wolddevice. In addition to Assumptions 1(d), 3(b), and 3(c), the conclusion follows. C Proof of Theorem 3
The following expression is useful in subsequent results. For any (cid:15) > E ˜ U | ˜ X g , ˆ Z [I( | ˜ u | ≤ (cid:15) )] / (2 (cid:15) ) = F ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − F ˜ U | ˜ X g , ˆ Z ( − (cid:15) )2 (cid:15) = F ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − (cid:0) F ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − f ˜ U | ˜ X g , ˆ Z ((1 − λ ) (cid:15) )(2 (cid:15) ) (cid:1) (cid:15) = ( (cid:15) /(cid:15) ) f ˜ U | ˜ X g , ˆ Z ((1 − λ ) (cid:15) )(C.1)for some λ ∈ (0 ,
1) by Assumption 3(d) and the MVT (taken about − (cid:15) ).A first step in the proof is to show that plim n →∞ ˇ H n ( ˆ θ n ) = lim n →∞ E [ H n ( θ o )] ≡ H o .Recall ˇ H n ( ˆ θ n ) = 1 n ˆ c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n )ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) (cid:111) and E [ H n ( θ o )] = 2 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:105) .Define ˜ H n ( θ o ) = 1 n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij | ≤ c n )ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:111) . By the triangle in-equality, || ˇ H n ( ˆ θ n ) − ˜ H n ( θ o ) || ≤ c n ˆ c n ( A n, + A n, + A n, + A n, ) a.s. where A n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) | ˇ z ij − ˆ z ij | × I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:111) , A n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) | I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | × || ˆ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:111) , n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij | ≤ ˆ c n ) × || ˆ z ij [˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )] || (cid:111) , and A n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) | I( | ˜ u ij | ≤ ˆ c n ) − I( | ˜ u ij | ≤ c n ) | × || ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) || (cid:111) . The following lemma studies the terms A n, , A n, , A n, , and A n, . Lemma C.1.
Under Assumptions 1 to 4, E [ A n, ] = o (1) , E [ A n, ] = o (1) , E [ A n, ] = o (1) , and E [ A n, ] = o (1) . Proof . The verification of terms A n, , A n, , A n, , and A n, proceeds in the followingparts. A n, :By the Cauchy-Schwartz (CS) inequality, E [ A n, ] = 1 n c n n (cid:88) i =1 n (cid:88) j =1 E (cid:2) | ˇ z ij − ˆ z ij | × I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:3) ≤ n c n n (cid:88) i =1 n (cid:88) j =1 ( E [(ˇ z ij − ˆ z ij ) ]) / × ( E [ || ˜ x gij ( ˆ θ n ) || ]) / From the proof of Lemma D.1(a), Assumption 1(c), Assumption 3(e), and Assump-tion 4, it follows that E [ A n, ] ≤ √ C / √ nc n = O (cid:0) / ( √ nc n ) (cid:1) = o (1). A n, :For term A n, , note that for ¯ θ n between ˆ θ n and θ o , E ˜ U | ˜ X g , ˆ Z (cid:2) | I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | (cid:3) = E ˜ U | ˜ X g , ˆ Z (cid:2) | I( | ˜ u ij + ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | (cid:3) = E ˜ U | ˜ X g , ˆ Z (cid:2) I(ˆ c n < ˜ u ij ≤ ˆ c n − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) (cid:3) + E ˜ U | ˜ X g , ˆ Z (cid:2) I( − ˆ c n − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) ≤ ˜ u ij < − ˆ c n ) (cid:3) ≡ E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] + E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ]by Assumption 1(b), and the MVT. For ease of notation, let ˜ J x,θ ≡ ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ).37y Assumption 3(d) and the MVT, E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] = max { , F ˜ U | ˜ X g , ˆ Z (cid:0) ˆ c n − ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˆ c n | ω ˜ x, ˆ zij (cid:1) } = max { , − f ˜ U | ˜ X g , ˆ Z (cid:0) ˆ c n − λ ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) ˜ J x,θ } and E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] = max { , F ˜ U | ˜ X g , ˆ Z (cid:0) − ˆ c n | ω ˜ x, ˆ zij (cid:1) − F ˜ U | ˜ X g , ˆ Z (cid:0) − ˆ c n − ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) } = max { , f ˜ U | ˜ X g , ˆ Z (cid:0) − ˆ c n − λ ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) ˜ J x,θ } . Since f ˜ U | ˜ X g , ˆ Z ( · ) ≤ C / a.s. by Assumption 3(d), | ˜ J x,θ | = O p ( n − / ) by Assump-tion 3(e) and Theorem 2(b), and ˜ J x,θ c n = o p (1) by Assumption 4, E ˜ U | ˜ Xg, ˆ Z [˜ I ij ]+ E ˜ U | ˜ Xg, ˆ Z [˜ I ij ]2 c n = o p (1).It follows from the foregoing, the LIE, the CS inequality, Jensen’s inequality, andAssumption 3(e) that E [ A n, ] = 1 n c n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) E ˜ U | ˜ X g , ˆ Z [ | I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | ] × || ˆ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:105) ≤ n n (cid:88) i =1 n (cid:88) j =1 (cid:110) E (cid:104)(cid:16) E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] + E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ]2 c n (cid:17) (cid:105) E (cid:2) || max {| ˆ z ij | , } ˜ x gij ( ˆ θ n ) || (cid:3)(cid:111) / = o (1) A n, :First, by Assumption 3(d), eq. (C.1), and the MVT, E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ ˆ c n )] / (2 c n ) = f ˜ U | ˜ X g , ˆ Z ( λ ˆ c n | ω ˜ x, ˆ zij )(ˆ c n /c n ) for some λ ∈ (0 , ≤ i, j, k, l ≤ n and ¯ θ n , ¯ θ n between ˆ θ n and θ o , E (cid:2) | ˆ z ij ˆ z kl | × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gkl ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gkl ( θ o ) || (cid:3) = E (cid:2) | ˆ z ij ˆ z kl | × || ˜ x gij ( ˆ θ n ) (cid:48) (˜ x gkl ( ˆ θ n ) − ˜ x gkl ( θ o )) + (˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o )) (cid:48) ˜ x gkl ( θ o ) || (cid:3) ≤ (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || ] × E (cid:2) || ˆ z kl (˜ x gkl ( ˆ θ n ) − ˜ x gkl ( θ o )) || (cid:3)(cid:1) / + (cid:0) E (cid:2) || ˆ z kl ˜ x gkl ( θ o ) || (cid:3) × E (cid:2) || ˆ z ij (˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o )) || (cid:3)(cid:1) / = (cid:16) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || (cid:3) × E (cid:104)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z kl ∂ ˜ x gkl ( θ ) ∂ θ (cid:12)(cid:12)(cid:12) θ = ¯ θ n ( ˆ θ n − θ o ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / + (cid:16) E (cid:2) || ˆ z kl ˜ x gkl ( θ o ) || (cid:3) × E (cid:104)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ∂ ˜ x gij ( θ ) ∂ θ (cid:12)(cid:12)(cid:12) θ = ¯ θ n ( ˆ θ n − θ o ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / ≤ (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || (cid:3)(cid:1) / (cid:16) E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z kl ∂ ˜ x gkl ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / + (cid:0) E (cid:2) || ˆ z kl ˜ x gkl ( θ o ) || (cid:3)(cid:1) / (cid:16) E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ∂ ˜ x gij ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / ≤ C / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / . (C.2)It follows from the LIE, Assumption 3(d), Assumption 4, the CS inequality, eq. (C.2),and Theorem 2(b) that E [ A n, ] = 1 n c n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ ˆ c n )] × || ˆ z ij [˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )] || (cid:105) ≤ n n (cid:88) i =1 n (cid:88) j =1 (cid:110)(cid:16) E (cid:2)(cid:0) (ˆ c n /c n ) f ˜ U | ˜ X g , ˆ Z ( λ ˆ c n | ω ˜ x, ˆ zij ) (cid:1) (cid:3)(cid:17) / × (cid:16) E (cid:2) ˆ z ij || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) || (cid:3)(cid:17) / (cid:111) ≤ C / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / = o (1) . A n, : E ˜ U | ˜ X g , ˆ Z [ | I( | ˜ u ij | ≤ ˆ c n ) − I( | ˜ u ij | ≤ c n ) | ] / (2 c n ) = E ˜ U | ˜ X g , ˆ Z [I( c n < ˜ u ij ≤ ˆ c n ) + I( − ˆ c n ≤ ˜ u ij < − c n )]2 c n = max (cid:110) , F ˜ U | ˜ X g , ˆ Z (ˆ c n | ω ˜ x, ˆ zij ) − F ˜ U | ˜ X g , ˆ Z ( c n | ω ˜ x, ˆ zij )2 c n (cid:111) + (cid:110) , F ˜ U | ˜ X g , ˆ Z ( − c n | ω ˜ x, ˆ zij ) − F ˜ U | ˜ X g , ˆ Z ( − ˆ c n | ω ˜ x, ˆ zij )2 c n (cid:111) = max { , . f ˜ U | ˜ X g , ˆ Z ( λ ˆ c n + (1 − λ )ˆ c n | ω ˜ x, ˆ zij )(ˆ c n /c n − } + max { , − . f ˜ U | ˜ X g , ˆ Z ( − λ c n − (1 − λ )ˆ c n | ω ˜ x, ˆ zij )(ˆ c n /c n − } = o p (1)39or some λ , λ ∈ (0 ,
1) by Assumption 3(d), Assumption 4, eq. (C.1), and the MVT.By arguments analogous to the case of A n, , E [ A n, ] = o (1).To complete the proof of consistency of H n ( ˆ θ n ), it suffices to show that ˜ H n ( θ o ) − E [ H n ( θ o )] converges to zero in quadratic mean. This is shown in the following lemma. Lemma C.2.
Under Assumptions 1(a), 3(d)-3(e), and 4, ˜ H n ( θ o ) − E [ H n ( θ o )] con-verges to zero in quadratic mean. Proof . | E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ c n )] / (2 c n ) − f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) | = | f ˜ U | ˜ X g , ˆ Z ( λc n | ω ˜ x, ˆ zij ) − f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) | ≤ λc n a.s. for some λ ∈ (0 ,
1) by Assumption 3(d) and the MVT. In addition to As-sumption 3(e), Assumption 4, and the CS inequality, this implies || E [ ˜ H n ( θ o )] − E [ H n ( θ o )] || ≤ (cid:0) E [ (cid:0) E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ c n )] / (2 c n ) − f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) (cid:1) ] (cid:1) / (cid:0) E [ || max {| ˆ z ij | , } ˜ x gij ( θ o ) || ] (cid:1) / ≤ λC / c n = O ( c n ) = o (1) . Let τ and τ be two p θ × || τ || = || τ || = 1, then40ar( τ (cid:48) ˜ H n ( θ o ) τ )= 1 n c n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 cov (cid:16) { I( | ˜ u ij | ≤ c n )ˆ z ij τ (cid:48) ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) τ } , { I( | ˜ u kl | ≤ c n )ˆ z kl τ (cid:48) ˜ x gkl ( θ o ) (cid:48) ˜ x gkl ( θ o ) τ } (cid:17) = 1 n c n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } cov (cid:16) { I( | ˜ u ij | ≤ c n )ˆ z ij τ (cid:48) ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) τ } , { I( | ˜ u kl | ≤ c n )ˆ z kl τ (cid:48) ˜ x gkl ( θ o ) (cid:48) ˜ x gkl ( θ o ) τ } (cid:17) ≤ nc n · n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } (cid:16) var (cid:0) I( | ˜ u ij | ≤ c n ) τ (cid:48) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) τ (cid:1) × var (cid:0) I( | ˜ u kl | ≤ c n ) τ (cid:48) ˆ z kl ˜ x gkl ( θ o ) (cid:48) ˜ x gkl ( θ o ) τ (cid:1)(cid:17) / ≤ nc n · n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } (cid:16) E [ || max {| ˆ z ij | , } ˜ x gij ( θ o ) || ] · E [ || max {| ˆ z kl | , } ˜ x gkl ( θ o ) || ] (cid:17) / ≤ nc n · n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } (cid:16) E (cid:2) || max {| ˆ z ij | , } ˜ x gij ( θ o ) || (cid:3) · E (cid:2) || max {| ˆ z kl | , } ˜ x gkl ( θ o ) || (cid:3)(cid:17) / The second equality follows from Assumption 1(a), the first inequality follows fromthe CS inequality, and the last inequality follows from Jensen’s inequality. The sec-ond inequality holds because var( τ (cid:48) ˜ M τ ) ≤ E [( τ (cid:48) ˜ M τ ) ] = E [(vec( τ (cid:48) ˜ M τ )) ] = E [vec( ˜ M ) (cid:48) ( τ (cid:48) ⊗ τ (cid:48) ) (cid:48) ( τ (cid:48) ⊗ τ (cid:48) )vec( ˜ M )] ≤ E [ || vec( ˜ M ) || · || τ (cid:48) ⊗ τ (cid:48) || ] = E [ || vec( ˜ M ) || ·|| τ || · || τ || ] = E [ || ˜ M || ] for a matrix-valued random variable ˜ M , and || τ (cid:48) ⊗ τ (cid:48) || = || τ || · || τ || by Bernstein (2009, Fact 9.7.27). Thanks to Assumptions 3(e) and 4,var( τ (cid:48) ˜ H n ( θ o ) τ ) ≤ C / / ( nc n ) = o (1), and the assertion is proved.Combining Lemmata C.1 and C.2 shows that plim n →∞ ˇ H n ( ˆ θ n ) = lim n →∞ E [ H n ( θ o )] ≡ H o . The next part of the proof of Theorem 3 concerns the consistency of ˇ Ω n ( ˆ θ n ).The result is stated in the following lemma. Lemma C.3.
Under Assumptions 1 to 3, plim n →∞ ˇ Ω n ( ˆ θ n ) = Ω o . roof . Recallˇ Ω n ( ˆ θ n ) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n )+ 2ˇ z ij ˇ z ik (cid:0) (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − (cid:1) ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) (cid:111) . Also, define Ω n ( θ ) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) ˆ z ij ˜ x gij ( θ ) (cid:48) ˜ x gij ( θ )+ 2ˆ z ij ˆ z ik (cid:16) (cid:0) I(˜ u ij ( θ ) < u ik ( θ ) < (cid:1) − (cid:17) ˜ x gij ( θ ) (cid:48) ˜ x gik ( θ ) ≡ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:8) A ij ( θ ) + 2 B ijk ( θ ) (cid:9) where Ω o ≡ plim n →∞ Ω n ( θ o ). By the triangle inequality, || ˇ Ω n ( ˆ θ n ) − Ω n ( θ o ) || ≤ || ˇ Ω n ( ˆ θ n ) − Ω n ( ˆ θ n ) || + || Ω n ( ˆ θ n ) − Ω n ( θ o ) || a.s. Noting that ˇ z ij ˇ z kl − ˆ z ij ˆ z kl = (ˇ z ij − ˆ z ij )ˇ z kl + (ˇ z kl − ˆ z kl )ˆ z ij , | ˇ z ij + ˆ z ij | ≤ {| ˇ z ij | , | ˆ z ij |} , and | (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − | ≤ ≤ i, j, k, l ≤ n , || ˇ Ω n ( ˆ θ n ) − Ω n ( ˆ θ n ) || = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) (ˇ z ij − ˆ z ij )˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n )+ 2(ˇ z ij ˇ z ik − ˆ z ij ˆ z ik ) (cid:0) (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − (cid:1) ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) (cid:111)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 n (cid:88) j =1 | ˇ z ij − ˆ z ij | × || max {| ˇ z ij | , | ˆ z ij |} ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j | ˇ z ij − ˆ z ij | × || ˇ z ik ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) || + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j | ˇ z ik − ˆ z ik | × || ˆ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) || a.s.From the proof of Lemma D.1(a), the CS inequality, Jensen’s inequality, Assump-tion 1(c), and Assumption 3(e), 42 [ || ˇ Ω n ( ˆ θ n ) − Ω n ( ˆ θ n ) || ] ≤ n n (cid:88) i =1 n (cid:88) j =1 (cid:0) E [(ˇ z ij − ˆ z ij ) ] E [ || max { [ | ˆ z ij | , ˇ z ij | , } ˜ x gij ( ˆ θ n ) || ] (cid:1) / + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) E [(ˇ z ij − ˆ z ij ) ] E [ || ˜ x gij ( ˆ θ n ) || ] E [ || ˇ z ik ˜ x gik ( ˆ θ n ) || ] (cid:1) / + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) E [(ˇ z ik − ˆ z ik ) ] E [ || ˇ z ij ˜ x gij ( ˆ θ n ) || ] E [ || ˜ x gik ( ˆ θ n ) || ] (cid:1) / ≤ √ C / √ n + 6 √ C / √ n + 6 √ C / √ n = O ( n − / ) = o (1) . To complete the proof, it suffices to show that || Ω n ( ˆ θ n ) − Ω n ( θ o ) || converges tozero in probability. Using (C.2), the CS inequality, MVT, Assumptions 1(b) and 3(e), E [ || A ij ( ˆ θ n ) − A ij ( θ o ) || ] = E [ˆ z ij || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) || ] ≤ C ( E [ || ˆ θ n − θ o || ]) / = o (1) . By Assumption 1(b) and the MVT, | I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < − I(˜ u ij < u ik < | = | [I(˜ u ij < − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) − I(˜ u ij < u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o ))+ I(˜ u ij < u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o )) − I(˜ u ik < | = I(0 ≤ ˜ u ij < − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) × I(˜ u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o ))+ I(0 ≤ ˜ u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o )) × I(˜ u ij < ≡ ˜ I ijk + ˜ I ijk ≤ I(0 ≤ ˜ u ij < − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) + I(0 ≤ ˜ u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o )) ≡ ˜ I ij + ˜ I ik . For some λ ∈ (0 , ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] ≤ max (cid:8) , F ˜ U | ˜ X g , ˆ Z ( − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) | ω ˜ x, ˆ zij ) − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) (cid:9) = max (cid:8) , − f ˜ U | ˜ X g , ˆ Z ( − λ ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) | ω ˜ x, ˆ zij )˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) (cid:9) = O p ( n − / ) . E ˜ U | ˜ X g , ˆ Z [ ˜ I ik ] = O p ( n − / ) by a similar derivation whence E (cid:2) || B jik ( ˆ θ n ) − B ijk ( θ o ) || (cid:3) ≤ E (cid:2) E ˜ U | ˜ X g , ˆ Z [ ˜ I ijk + ˜ I ijk ] · || ˆ z ij ˜ x gij ( ˆ θ n ) || · || ˆ z ik ˜ x gik ( ˆ θ n ) || (cid:3) + 3 E [ | ˆ z ij ˆ z ik | × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) || ] ≤ (cid:0) E (cid:2) ( E ˜ U | ˜ X g , ˆ Z [ ˜ I ij + ˜ I ik ]) (cid:3)(cid:1) / · (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || (cid:3)(cid:1) / · (cid:0) E (cid:2) || ˆ z ik ˜ x gik ( ˆ θ n ) || (cid:3)(cid:1) / + 3 E [ | ˆ z ij ˆ z ik | × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) || ] ≤ C / × O ( n − / ) + 6 C ( E [ || ˆ θ n − θ o || ]) / = o (1)by Assumption 3(e), the CS inequality, and eq. (C.2). This completes the proof.44 nline AppendixD Useful Lemmata D.1 Asymptotic Equivalence of Normalised Minimands
Lemma D.1.
Under Assumptions 1(a), 1(b), 1(d), and 3(e),(a) | ˇ z ij − ˆ z ij | = O p ( n − / ) ;(b) sup θ ∈ Θ | Q n ( θ ) − Q n ( θ ) | p → . Proof . Part (a) : For any i, j , 1 ≤ i, j ≤ n , E [ˇ z ij − ˆ z ij ] = 0 by the LIE. Moreover, it followsfrom Lo`eve’s c r -inequality, Assumption 1(a), the CS inequality, and Assumption 3(e)that E [ | ˇ z ij − ˆ z ij | ] ≤ n E (cid:104)(cid:16) n (cid:88) k =1 ( || ˜ z ik || − E [( || ˜ z ik || ) | z i ]) (cid:17) (cid:105) + 3 n E (cid:104)(cid:16) n (cid:88) k =1 ( || ˜ z kj || − E [( || ˜ z kj || ) | z j ]) (cid:17) (cid:105) + 3 n E (cid:104)(cid:16) n (cid:88) k =1 n (cid:88) l =1 ( || ˜ z kl || − E [ || ˜ z kl || ]) (cid:17) (cid:105) = 3 n n (cid:88) k =1 E [var(( || ˜ z ik || ) | z i )] + 3 n n (cid:88) k =1 E [var(( || ˜ z kj || ) | z j )]+ 3 n n (cid:88) k =1 n (cid:88) l =1 var( || ˜ z kl || ) + 6 n n (cid:88) k =1 n (cid:88) l =1 (cid:88) l (cid:48) (cid:54) = l cov( || ˜ z kl || , || ˜ z kl (cid:48) || ) ≤ C / n + 3 C / n + 3 C / n + 6 C / n = 15 C / n (D.1)This implies | ˇ z ij − ˆ z ij | = O p ( n − / ) by the Markov inequality. Part (b) : From (D.1) above, Jensen’s inequality, Assumption 1(b), the MVT, As applied here, the inequality is E [( ξ + ξ + ξ ) ] ≤ E [ ξ ] + 3 E [ ξ ] + 3 E [ ξ ]. See Davidson(1994, eqn. 9.62). E [ | Q n ( θ ) − Q n ( θ ) | ] ≤ n n (cid:88) i =1 n (cid:88) j =1 E [ | ˇ z ij − ˆ z ij | × | ( | ˜ u ij ( θ ) | − | ˜ u ij | ) | ] ≤ n n (cid:88) i =1 n (cid:88) j =1 E [ | ˇ z ij − ˆ z ij | × | ˜ u ij ( θ ) − ˜ u ij | ]= 1 n n (cid:88) i =1 n (cid:88) j =1 E [ | ˇ z ij − ˆ z ij | × | ˜ x gij ( ¯ θ )( θ − θ o ) | ] ≤ || θ − θ o || n n (cid:88) i =1 n (cid:88) j =1 ( E [(ˇ z ij − ˆ z ij ) ]) / ( E [ || ˜ x gij ( ¯ θ ) || ]) / ≤ C / || θ − θ o || √ √ n for any θ ∈ Θ . By Assumption 1(d), sup θ ∈ Θ | Q n ( θ ) − Q n ( θ ) | = O p ( n − / ) by the Markovinequality. D.2 The Intercept and Jacobian
The class of models considered in the main text assumes the disturbance functionis linear in the intercept (Assumption 1(a)). This subsection considers an extensionto models where the disturbance function is not necessarily linear in the intercept,i.e., y i = G (cid:0) g ( x i , θ o , θ o,c ) + u i (cid:1) where the intercept is absorbed into a possibly non-linear function. In this extended class, the intercept is no longer differenced out in(2.3). Moreover, a constant in z i and z j cannot drive identification for the interceptterm because it is differenced out in ˇ z ij . Whenever linearity u i ( θ ) in the intercept isnot assumed, the intercept is required for identifying θ o . This problem is resolved,however, noting that for a specified model, the intercept is typically a deterministicfunction of θ and data.Let θ c ( θ ) denote the intercept as a function of any θ ∈ Θ and data. The model isspecified through the disturbance function as u i ( θ ) = G − ( y i ) − g ( x i , θ , θ c ( θ )), e.g., u i ( θ ) = y i − exp( θ c ( θ ) + x i θ ) for a non-linear parametric model such as the Pois-son. Note that for a given sample size n , the intercept, as a function of θ is θ c ( θ ) =log (cid:0) E n [ y i ] / E n [exp( x i θ )] (cid:1) . Where θ c ( θ ) is available in a closed form, one can directlysubstitute it into u i ( θ ) whence, for example, u i ( θ ) = y i − ( E n [ y i ] / E n [exp( x i θ )]) exp( x i θ )2or the non-linear model. Generally, θ c ( θ ) is implicitly defined by E n [ y i − g ( x i , θ , θ c ( θ ))] =0 for a sample of size n .The identification of the θ o is tied to that of the intercept which is implicitlydefined by E n [ y i − g ( x i , θ , θ c ( θ ))] = 0. The expected value of E n [ y i − g ( x i , θ , θ c ( θ ))]does not exist if the expected value of u i does not exist. This implies that theextension of the class of models in the main text to accommodate models where thedisturbance is non-linear in the intercept comes at a cost of additional assumptionsfor the intercept as a function of θ and the Jacobian ˜ x gij ( θ ) to be well-defined. Thefollowing lemma applies the implicit function theorem in the derivation of a generalform of the Jacobian x gi ( θ ) in this extended class. Fix θ s ∈ Θ and define θ ∗ c where E n [ y i − g ( x i , θ , θ c )] = 0 at [ θ (cid:48) , θ c ] = [ θ (cid:48) s , θ ∗ c ]. Lemma D.2.
In addition to Assumptions 1(b) and 1(d), suppose inf θ s ∈ Θ (cid:12)(cid:12)(cid:12) E n (cid:104) ∂g ( x i , θ , θ c ) ∂θ c (cid:12)(cid:12)(cid:12) θ = θ s θ c = θ ∗ c (cid:105)(cid:12)(cid:12)(cid:12) > C − a.s. for all n and E [ | u i | ] ≤ C / , then theJacobian x gi ( θ ) exists for all θ ∈ Θ , ≤ i ≤ n . Proof . Under the assumption E [ | u i | ] ≤ C / , Assumption 1(b), and Assumption 1(d),the expectation of E n [ y i − g ( x i , θ , θ c ( θ ))] = E n [ u i ( θ )] = E [ u i + x gi ( ¯ θ )( θ − θ o )] exists,from which the intercept, as a function of θ is (implicitly) well-defined. Under theassumption of the lemma and Assumption 1(b), ∂θ c ( θ ) ∂ θ = − (cid:16) E n (cid:104) ∂g ( x i , θ , θ c ) ∂θ c (cid:105)(cid:17) − E n (cid:104) ∂g ( x i , θ , θ c ) ∂ θ (cid:48) (cid:105) holds by the implicit function theorem. The Jacobian is well defined, and it is givenby x gi ( θ ) ≡ ∂u i ( θ ) ∂ θ (cid:48) = − ∂g ( x i , θ , θ c ) ∂ θ (cid:48) − ∂g ( x i , θ , θ c ) ∂θ c ∂θ c ( θ ) ∂ θ (cid:48) = − ∂g ( x i , θ , θ c ) ∂ θ (cid:48) + ∂g ( x i , θ , θ c ) ∂θ c (cid:16) E n (cid:104) ∂g ( x i , θ , θ c ) ∂θ c (cid:105)(cid:17) − E n (cid:104) ∂g ( x i , θ , θ c ) ∂ θ (cid:48) (cid:105) . D.3 Covariance of the Score
Denote the score function by S n ( θ ) ≡ ∂ Q n ( θ ) ∂ θ (cid:48) , and let E [ Ω n ( θ o )] ≡ var( √ n S n ( θ o )).The following lemma derives E [ Ω n ( θ o )] and shows its boundedness.3 emma D.3. Under Assumptions 1(a), 2(a), and 3(e), (a) E [ S n ( θ o )] = and (b) E [ Ω n ( θ o )] is given by eq. (3.1) with || E [ Ω n ( θ o )] || ≤ C / . Proof of Lemma D.3 . Part (a) : Expressing Q n ( θ ) as n (cid:80) ni =1 (cid:80) nj =1 ˆ z ij (cid:0) (1 − u ij ( θ ) < u ij ( θ ) −| ˜ u ij | (cid:1) ,the score function is S n ( θ ) = 1 n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (cid:0) − u ij ( θ ) < (cid:1) ˜ x gij ( θ ) . Evaluating the expectation of S n ( θ ) at θ = θ o and applying the LIE, E [ S n ( θ o )] = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:2) ˆ z ij (cid:0) − u ij ( θ o ) < (cid:1) ˜ x gij ( θ o ) (cid:3) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij E ˜ U | ˜ X g , ˆ Z (cid:2) − u ij < (cid:3) ˜ x gij ( θ o ) (cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (cid:0) − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) (cid:1) ˜ x gij ( θ o ) (cid:105) = . The last equality follows from Assumption 2(a) and the definition of the median. Thenext step is to obtain E [ Ω n ( θ o )] ≡ var( √ n S n ( θ o )), the covariance of √ n S n ( θ o ). Part (b) : E [ Ω n ( θ o )] = n E [ S n ( θ o ) (cid:48) S n ( θ o )]= n E (cid:104)(cid:16) n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) (cid:17) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) (cid:17)(cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 E (cid:2) ˆ z ij ˆ z kl (cid:0) − u ij < (cid:1)(cid:0) − u kl < (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gkl ( θ o ) (cid:3) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:110) ˆ z ij E ˜ U | ˜ X g , ˆ Z (cid:2)(cid:0) − u ij < (cid:1) (cid:3) ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 2ˆ z ij ˆ z ik E ˜ U | ˜ X g , ˆ Z (cid:2)(cid:0) − u ij < (cid:1)(cid:0) − u ik < (cid:1)(cid:3) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:111) i (cid:54) = k, l and j (cid:54) = k, l , E ˜ U | ˜ X g , ˆ Z (cid:104)(cid:0) − u ij < (cid:1)(cid:0) − u kl < (cid:1)(cid:105) = (1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ))(1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zkl )) = 0.Simplifying terms, (cid:0) − u ij < (cid:1) = 1 and E ˜ U | ˜ X g , ˆ Z (cid:104)(cid:0) − u ij < (cid:1)(cid:0) − u ik < (cid:1)(cid:105) = 4 E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < u ik < (cid:3) − . Substituting terms, E [ Ω n ( θ o )] = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:110) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 2ˆ z ij ˆ z ik (cid:16) E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < u ik < (cid:3) − (cid:17) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:111) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:110) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 8ˆ z ij ˆ z ik cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:111) where ω ˜ x, ˆ zijk ≡ σ ( x i , x j , x k , z i , z j , z k ) and cov( ξ a , ξ b | ξ c ) denotes covariance of ξ a and ξ b conditional on ξ c . The second equality follows since 4 E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < u ik < (cid:3) − (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) + 4 E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < (cid:3) E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ik < (cid:3) − (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) + 4 F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zik ) − (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) by Assumption 2(a) and the definition of the median.For the uniform boundedness of E [ Ω n ( θ o )], note that, || E [ Ω n ( θ o )] || ≤ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) + 8 (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ˆ z ij ˆ z ik cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:111) ≤ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110)(cid:16) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˜ x gij ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:17) / + 8 (cid:16) E (cid:2)(cid:0) cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1)(cid:1) (cid:3) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˆ z ik ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:17) / (cid:111) ≤ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110)(cid:0) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˜ x gij ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:1) / + 2 (cid:16) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˜ x gij ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ik ˜ x gik ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:17) / (cid:111) ≤ C / . The second inequality follows from the CS inequality, Assumption 2(a), and thedefinition of the median noting that | cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) | ≤ (cid:0) F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )(1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )) (cid:1) / × (cid:0) F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zik )(1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zik )) (cid:1) / = 1 / D.4 The Hessian Matrix
Lemma D.4.
Suppose Assumptions 1(b), 2(a), and 3(d)-3(e) hold, then the Hessianmatrix E [ H n ( θ o )] is given by eq. (3.2) , and || E [ H n ( θ o )] || ≤ C / . Proof of . Recall the score function from the proof of Lemma D.3 is S n ( θ ) =1 n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (1 − (cid:0) ˜ u ij ( θ ) < (cid:1) ˜ x gij ( θ ). By the LIE and using ˜ u ij ( θ ) = ˜ u ij +6 x gij ( ¯ θ )( θ − θ o ) by the MVT and Assumption 1(b), E [ S n ( θ )] = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (1 − E ˜ U | ˜ X g , ˆ Z [I (cid:0) ˜ u ij ( θ ) < (cid:1) ˜ x gij ( θ ) (cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (1 − E ˜ U | ˜ X g , ˆ Z [I(˜ u ij < − ˜ x gij ( ¯ θ )( θ − θ o ))])˜ x gij ( θ ) (cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (cid:0) − F ˜ U | ˜ X g , ˆ Z ( − ˜ x gij ( ¯ θ )( θ − θ o )) (cid:1) ˜ x gij ( θ ) (cid:105) Under the assumptions of Lemma D.5, the expectation and derivative are exchange-able by the dominated convergence theorem. The expression for E [ H n ( θ )] ≡ ∂ E [ S n ( θ )] ∂ θ becomes E [ H n ( θ )] = 1 n n (cid:88) i =1 n (cid:88) j =1 (cid:110) E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) ˜ x gij ( θ ) (cid:48) ˜ x gij ( ¯ θ ) (cid:105) + E (cid:104) ˆ z ij (cid:0) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1)(cid:1) ∂ ˜ x gij ( θ ) ∂ θ (cid:105)(cid:111) . Evaluating E [ H n ( θ )] at θ = θ o , E [ H n ( θ o )] = 2 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:105) because ¯ θ is trapped between θ and θ o and F ˜ U | ˜ X g , ˆ Z (0) = 1 / || E [ H n ( θ o )] || ≤ C / .The result in the following lemma is essential to applying the dominated conver-gence theorem in the proof of Lemma D.4. Define η ( θ ) ≡ (cid:104) zf ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1) ˜ x g ( θ ) (cid:48) ˜ x g ( ¯ θ ) (cid:105) + (cid:104) ˆ z (cid:16) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:17) ∂ ˜ x g ( θ ) ∂ θ (cid:105) ≡ η A ( θ ) + η B ( θ )The following lemma derives a bound on each summand of η ( θ ).7 emma D.5. Under Assumptions 3(d) and 3(e), E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η A ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:105) ≤ C / and E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η B ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:105) ≤ C / for any [ x , z ] and [ x † , z † ] defined on the support of [ x i , z i ] . Proof of Lemma D.5 . For any [ x , z ] and [ x † , z † ] defined on the support of [ x i , z i ]and θ ∈ Θ , || η A ( θ ) || = (cid:12)(cid:12)(cid:12)(cid:12) zf ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1) ˜ x g ( θ ) (cid:48) ˜ x g ( ¯ θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ f ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ˜ x g ( θ ) (cid:48) ˜ x g ( ¯ θ ) (cid:12)(cid:12)(cid:12)(cid:12) and || η B ( θ ) || = (cid:12)(cid:12) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:12)(cid:12) × (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) noting that (cid:12)(cid:12)(cid:12) − F ˜ U | ˜ X g , ˆ Z ( · ) (cid:12)(cid:12)(cid:12) ≤ E [sup θ ∈ Θ || η A ( θ ) || ] ≤ (cid:0) E (cid:2)(cid:0) sup θ ∈ Θ f ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:1) (cid:3)(cid:1) / (cid:0) E (cid:2) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) max {| ˆ z | , } ˜ x g ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:1) / ≤ C / and E [sup θ ∈ Θ || η B ( θ ) || ] ≤ (cid:0) E (cid:2) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:1) / ≤ C / . D.5 The Stochastic Equi-continuity Condition
The following lemma verifies Assumption 3(b).
Lemma D.6.
Let R ( θ ) = √ n ( ϑ n ( θ ) − E [ ϑ n ( θ )]) / || θ − θ o || . For any sequence δ n → , sup || θ − θ o ||≤ δ n |R n ( θ ) | √ n || θ − θ o || p → under Assumptions 1(a)-1(b), 2, and 3(e). Proof of Lemma D.6 . By the c r -inequality, | ϑ n ( θ ) − E [ ϑ n ( θ )] | = |Q n ( θ ) − S n ( θ o )( θ − θ o ) − E [ Q n ( θ )] | ≤ (cid:12)(cid:12) Q n ( θ ) − E [ Q n ( θ )] (cid:12)(cid:12) + 2 (cid:12)(cid:12) S n ( θ o )( θ − θ o ) (cid:12)(cid:12) ≡ ϑ An ( θ ) + 2 ϑ Bn ( θ )8ollowing the expression in (A.2) which holds by Assumption 1(b) and the MVT forsome λ u ∈ (0 , ϑ An ( θ ) = (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 n (cid:88) j =1 (cid:110)(cid:2) ˆ z ij (cid:0) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) − (cid:1) ˜ x gij ( ¯ θ ) (cid:3) − E (cid:2) ˆ z ij (cid:0) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) − (cid:1) ˜ x gij ( ¯ θ ) (cid:3)(cid:111) ( θ − θ o ) (cid:12)(cid:12)(cid:12) ≡ (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 n (cid:88) j =1 ˜ a ij ( θ )( θ − θ o ) (cid:12)(cid:12)(cid:12) = ( θ − θ o ) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 ˜ a ij ( θ ) (cid:48) ˜ a kl ( θ ) (cid:17) ( θ − θ o ) . For notational brevity, let ˜ I ij ( θ ) ≡ (cid:16) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) − (cid:17) , and note that for i (cid:54) = k, l , and j (cid:54) = k, l , E [˜ a ij ( θ ) (cid:48) ˜ a kl ( θ )] = cov (cid:0) ˜ a ij ( θ ) (cid:48) ˜ a kl ( θ ) (cid:1) = by Assumption 1(a).This implies that for any θ ∈ Θ , E [ ϑ An ( θ )] = ( θ − θ o ) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a kl ( θ ) (cid:3)(cid:17) ( θ − θ o )= 1 n ( θ − θ o ) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ij ( θ ) (cid:3) + 2 E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ik ( θ ) (cid:3)(cid:1)(cid:17) ( θ − θ o ) ≤ n (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) || E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ij ( θ ) (cid:3) || + 2 || E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ik ( θ ) (cid:3) || (cid:1)(cid:17) || θ − θ o || ≤ C / n || θ − θ o || because || E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ik ( θ ) (cid:3) || ≤ ( E [ || ˜ a ij ( θ ) || ]) / ( E [ || ˜ a kl ( θ ) || ]) / = ( E [( ˜ I ij ( θ )) || ˆ z ij ˜ x gij ( ¯ θ ) || ]) / ( E [( ˜ I ik ( θ )) || ˆ z ik ˜ x gik ( ¯ θ ) || ]) / ≤ ( E [ || ˆ z ij ˜ x gij ( ¯ θ ) || ] E [ || ˆ z ik ˜ x gik ( ¯ θ ) || ]) / ≤ C / by Assumption 3(e) and the CS inequality, noting that ˜ I ij ( θ ) ≡ (cid:16) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − o ) (cid:1) − (cid:17) = 1. E [ ϑ Bn ( θ )] = E (cid:2)(cid:12)(cid:12) S n ( θ o )( θ − θ o ) (cid:12)(cid:12) (cid:3) = ( θ − θ o ) (cid:48) E [ S n ( θ o ) (cid:48) S n ( θ o )]( θ − θ o ) = n ( θ − θ o ) (cid:48) E [ Ω n ( θ o )]( θ − θ o ) ≤ n || E [ Ω n ( θ o )] || × || θ − θ o || ≤ C / n || θ − θ o || under theassumptions of Lemma D.3. E [ ϑ An ( θ )] and E [ ϑ Bn ( θ )] are each bounded by C / n || θ − θ o || . This implies E (cid:104) sup θ ∈ Θ | ϑ n ( θ ) − E [ ϑ n ( θ )] | (cid:105) ≤ C / n sup θ ∈ Θ || θ − θ o || i.e, sup θ ∈ Θ | ϑ n ( θ ) − E [ ϑ n ( θ )] | = O p ( n − || θ − θ o || ) and sup θ ∈ Θ | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || = O p ( n − / )by the Markov inequality. Also, by the definition of a derivative, ϑ n ( θ ) and E [ ϑ n ( θ )]go to zero faster than || θ − θ o || , i.e., there exists a constant ˜ C > | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || = |Q n ( θ ) − E [ Q n ( θ )] −S n ( θ o )( θ − θ o ) ||| θ − θ o || ≤ ˜ C / | ϑ n ( θ ) − E [ ϑ n ( θ )] | a.s. The con-clusion follows by combining both preceding arguments, i.e.,sup || θ − θ o ||≤ δ n |R n ( θ ) | √ n || θ − θ o || = sup || θ − θ o ||≤ δ n √ n | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || + √ n || θ − θ o || ≤ sup || θ − θ o ||≤ δ n | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || ≤ ˜ C / sup || θ − θ o ||≤ δ n | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || = O p ( n − / ) . E Simulations
E.1 Linear Model - Additional Simulation Results
Maximum likelihood estimation of the linear model (under normally distributeddisturbances) is known to be efficient. This simulation exercise compares the MDepestimator to the maximum likelihood estimator (MLE) primarily in terms of effi-ciency both under correctly and incorrectly imposed distributional assumptions onthe disturbance. Also, the MDep estimator shares the “robustness” feature of quan-tile estimators though it estimates models estimable by IV methods. It is inter-esting to examine the performance of the MDep vis-`a-vis OLS under, for example,10 auchy ( location = 0 , scale = 1)-distributed disturbances where the expectation ofthe disturbance does not exist. Three specifications are considered below.(i) y = θ c + θ x + θ x + u , u = ν , x = x ∗ , z = [ x , x ∗ ], and ν ∼ ( χ (1) − / √ y = θ c + θ x + θ x + u , u = ν , x = x ∗ , z = [ x , x ∗ ], and ν ∼ N (0 , y = θ c + θ x + θ x + u , u = ν , x = x ∗ , z = [ x , x ∗ ], and ν ∼ Cauchy (0 , θ θ n 100 500 2500 100 500 2500Specification (i): (100 × Sample Bias)MDep MLE MDep MLE MDep MLE MDep MLE MDep MLE MDep MLEMB 0.479 0.916 0.1 0.27 0.018 0.042 0.344 0.54 0.064 0.23 0.006 0.01MAD 2.578 7.295 0.826 3.166 0.321 1.411 2.525 6.892 0.868 3.141 0.324 1.364RMSE 5.002 10.39 1.494 4.57 0.537 2.064 5.069 10.423 1.527 4.59 0.523 2.028Specification (ii): (100 × Sample Bias)MDep MLE MDep MLE MDep MLE MDep MLE MDep MLE MDep MLEMB 0.066 0.077 0.039 -0.016 0.018 -0.004 -0.022 -0.083 -0.042 -0.008 0.012 0.024MAD 7.188 6.972 3.363 3.202 1.471 1.411 7.348 7.248 3.178 2.953 1.425 1.437RMSE 10.956 10.51 4.961 4.729 2.145 2.048 10.973 10.479 4.705 4.518 2.161 2.085Specification (iii): (100 × Sample Bias)MDep OLS MDep OLS MDep OLS MDep OLS MDep OLS MDep OLSMB -0.698 18.758 0.003 59.2 0.008 51.105 0.278 27.691 -0.22 -71.169 -0.19 38.73MAD 13.333 84.066 5.773 87.888 2.517 82.152 13.728 80.794 5.901 81.822 2.539 79.623RMSE 21.078 3455.6 8.734 1735.6 3.775 3472.6 21.348 2076.3 8.687 4211.2 3.797 4431.3This table provides results for the MDep and OLS/IV estimators of the linear model. Resultsare based on 2000 random repetitions. The mean bias (MB), median absolute deviation(MAD), and root mean squared error (RMSE) are reported for each specification.
Table E.1 presents results for this short simulation exercise. When the normalityassumption is not satisfied (specification (i)), one observes a better performance ofthe MDep. Under normality of the disturbance (specification (ii)), one observes abetter performance of the MLE performs. In terms of MAD and RMSE, the relativegains of the MLE over the MDep in specification (ii), is smaller than MDep’s relativegains over the MLE in specification (i). These results suggest the MDep performsbetter when there are deviations away from normality of the disturbance. A surprising11esult is that the MDep under the skewed χ -distributed disturbance (specification(i)) appears to be more efficient than the MDep under normality (specification (ii)).Specification (iii) presents an interesting case where the expected value of thedisturbance does not exist. From Table E.1, it can be observed that the MDep is“robust” to the non-existence of the expected value of the disturbance. As expected,the OLS in this case fails; the estimator is not consistent for the slope parametersthat generated the outcome. The MDep, on the other hand, performs reasonablywell; its bias shrinks towards zero with the sample size. E.2 Non-linear model
While Section 4 focusses on the linear model, this subsection examines the per-formance of the MDep estimator vis-`a-vis conventional IV methods for non-linearparametric models such as Poisson quasi-maximum likelihood (QMLE), GMM, andcontrol function methods. For concerns of space, only the non-linear parametricmethod with the exponential function is considered. Specifications in the extendedclass of models (see Section D.2) are considered. ν ∼ Uniform[ −√ , √
3] in specifi-cations (i)-(v), and ν ∼ Cauchy ( − . ,
1) in specification (vii). Each column of thematrix x = [ x , x ∗ ] is standard normal N (0 ,
1) with cov( x , x ∗ ) = 0 .
25 truncated tothe interval [ −√ , √
3] in specifications (i)-(iii) and unrestricted in specifications (iv)to (vii). θ ≡ [ θ , θ ] = [1 , −
1] in all specifications considered. θ c = log( √
3) + 2 √ θ c = 0 . θ c in specifications (i)-(iii) is set to ensure y is non-negative. z is the matrix of in-strumental variables. The operator ˜ T ( ˙ x, a ) = a max { ˙ x }− min { ˙ x } (cid:0) x − (max { ˙ x } + min { ˙ x } ) (cid:1) restricts the support of ˙ x to the interval [ − a, a ].(i) y = exp( θ c + θ x + θ x ) + u , x = x ∗ , u = √ ( D | x | + (1 − D ) | x | ) ν , D ∼ Ber(0 . z = [ x , x ∗ ].(ii) y = exp( θ c + θ x + θ x ) + u , u = ν , x = ˜ T ( ˙ x , √ x = x ∗ + ν/ √
2, and z = [ x , x ∗ ].(iii) y = exp( θ c + θ x + θ x ) + u , x = ˜ T ( ˙ x , √ x = 3 x ∗ / √ n + (1 + x ∗ ) u/ √ u = ˜ T ( ˙ u, √ u = (1 + D | x | + (1 − D ) | x | ) ν , D ∼ Ber(0 . z = [ x , x ∗ ].12iv) y = max { , ˙ y } , ˙ y = exp( θ c + θ x + θ x ) + u , x = ( x ∗ + u ) / √ z =[ z , z /σ z , z /σ z ], z = x , z = | x ∗ | + ( x ∗ ) + w , z = | x ∗ | + ( x ∗ ) + w , w ∼ N (0 , w ∼ N (0 , y = max { , ˙ y } , ˙ y = exp( θ c + θ x + θ x ) + u , x = I( x ∗ + u ≤ z =[ z , z /σ z , z /σ z ], z = x , z = I( − (cid:112) /π + | x ∗ | + 6 x ∗ / √ n + w ≤ z =I( − x ∗ ) + 6 x ∗ / √ n + w ≤ w ∼ N (0 , w ∼ N (0 , y = Pois (cid:0) exp( θ c + θ x + θ x ) (cid:1) , x = x ∗ , and z = [ x , x ∗ ].(vii) y = θ c + exp( θ x + θ x ) + u , and u = ν .Instruments in specifications (iii)-(v) are set to ensure a median first-stage non-homoskedasticity-robust F-statistic less than 10 and a median pdCov test p-valueless than 0 . θ θ n 100 500 2500 100 500 2500Specification (i): (100 × Sample Bias)MDep QMLE MDep QMLE MDep QMLE MDep QMLE MDep QMLE MDep QMLEMB 0.003 0.004 0 0.002 0 0 -0.002 -0.001 0.002 0.001 0.001 0.001MAD 0.071 0.08 0.03 0.037 0.014 0.016 0.068 0.078 0.032 0.037 0.014 0.016RMSE 0.102 0.119 0.046 0.053 0.02 0.023 0.107 0.125 0.045 0.053 0.021 0.024Specification (ii): (100 × Sample Bias)MDep GMM MDep GMM MDep GMM MDep GMM MDep GMM MDep GMMMB -0.007 -0.871 -0.002 -1.234 -0.001 -1.338 0.043 0.848 0.01 1.163 0.002 1.249MAD 0.115 0.508 0.054 0.789 0.026 1.157 0.116 0.562 0.051 0.791 0.023 1.082RMSE 0.204 2.176 0.086 2.024 0.037 1.745 0.211 2.13 0.083 1.913 0.036 1.638Specification (iii): (100 × Sample Bias)MDep CF MDep CF MDep CF MDep CF MDep CF MDep CFMB 0.007 0.035 -0.003 0.012 -0.007 0.006 0.129 0.198 0.034 1.347 0.032 2.158MAD 0.109 0.141 0.071 0.076 0.04 0.04 0.106 0.624 0.048 0.94 0.032 1.546RMSE 0.181 0.278 0.104 0.171 0.059 0.106 0.307 2.234 0.074 10.391 0.045 10.464Specification (iv): (100 × Sample Bias)MDep CF MDep CF MDep CF MDep CF MDep CF MDep CFMB -0.022 -0.064 -0.007 -0.088 -0.003 -0.014 0.126 0.339 0.032 0.401 0.01 -0.03MAD 0.074 0.222 0.028 0.306 0.012 0.732 0.111 0.849 0.035 1.683 0.015 4.041RMSE 0.135 0.793 0.048 0.895 0.02 2.371 0.22 3.05 0.079 4.824 0.032 14.004Specification (v): (100 × Sample Bias)MDep CF MDep CF MDep CF MDep CF MDep CF MDep CFMB -0.024 0.04 -0.008 0.015 -0.004 -0.015 -0.645 -4.407 -0.179 -8.948 -0.032 -11.864MAD 0.153 0.216 0.062 0.105 0.028 0.081 0.562 4.445 0.186 8.744 0.06 11.914RMSE 0.247 0.403 0.099 0.213 0.043 0.205 1.152 6.074 0.429 9.59 0.101 12.335Specification (vi): (100 × Sample Bias)MDep MLE MDep MLE MDep MLE MDep MLE MDep MLE MDep MLEMB -0.621 0.376 -0.924 0.104 -0.828 -0.028 0.752 -0.09 0.899 -0.01 0.917 0.046MAD 6.176 4.421 2.934 1.695 1.583 0.776 6.13 4.352 2.916 1.845 1.628 0.811RMSE 9.318 6.763 4.397 2.739 2.342 1.141 9.343 6.644 4.54 2.752 2.351 1.196Specification (vii): (100 × Sample Bias)MDep GMM MDep GMM MDep GMM MDep GMM MDep GMM MDep GMMMB -0.893 96.472 0.082 17.436 -0.031 -3.558 0.303 200.154 0.041 183.718 0.01 145.422MAD 4.818 40.463 1.885 34.587 0.812 31.606 4.794 20.736 1.947 18.322 0.78 16.516RMSE 9.932 2456.5 3.181 3929.1 1.253 5313.9 9.507 4070.9 3.213 4479.4 1.233 4305.3This table provides results of the MDep, GMM, CF, MLE, and QMLE estimators for thenon-linear model. Results are based on 1000 random repetitions. The mean bias (MB),median absolute deviation (MAD), and root mean squared error (RMSE) are reported foreach specification. The continuously updated GMM is implemented in specification (ii).The CF includes the vector of first stage residuals and its square in the second stage. .3 Coverage performance of covariance estimators Theorem 3 shows the consistency of the asymptotic covariance matrix estimator.This subsection compares the confidence intervals of the covariance matrix estima-tor and the bootstrap in terms of empirical coverage probabilities. Three bootstrapconfidence intervals are considered, namely, the percentile bootstrap, the empiricalbootstrap, and a normal approximation using the bootstrap standard error. The Halland Sheather (1988) bandwidth with the standard normal pdf φ ( · ) and quantile func-tion Φ − ( · ), i.e., ˆ c n = n − / (Φ − (1 − α/ / (cid:0) . φ (0) (cid:1) / , is used in the estimationof the Hessian. The following specifications are considered:(i) y = θ c + θ x + θ x + u , x = x ∗ , and u ∼ ( χ (1) − / √ y = θ c + θ x + θ x + u , x = ( x ∗ + u ) / √ z = [ x , | x ∗ | / (cid:112) − /π ], and u ∼ ( χ (1) − / √ y = θ c + θ x + θ x + u , x = ( x ∗ + u ) / √ z = [ x , . x ∗ + ( √ / w , . x ∗ +( √ / w ], w ∼ N (0 , , w ∼ N (0 , u ∼ ( χ (1) − / √ y = exp( θ c + θ x + θ x )+ u , x = ( x ∗ + u ) / √ z = [ x , . x ∗ +( √ / w , . x ∗ +( √ / w ], w ∼ N (0 , , w ∼ N (0 , u ∼ [ −√ , √ y = exp( θ c + θ x + θ x ) + u , and x = x ∗ , and u ∼ [ −√ , √ E.4 Partial dCov test of instrument relevance
A standard F-test of IV relevance (see for example Stock and Yogo (2005) andKleibergen and Paap (2006)) verifies a sufficient (but not necessary) relevance condi-tion for Assumption 2(b). The strength of an instrument lies in its identifying vari-ation independently of other instruments and covariates. For this reason, Sz´ekely,15able E.3: 95% Empirical Coverage Probabilitiesn θ θ Asymp. B-%tile B-Emp. B-Asymp. Asymp. B-%tile B-Emp. B-Asymp.Specification (i) (Asymp. & Non-Parametric Bootstrap)100 96.2 94.0 98.8 98.4 97.8 96.4 99.4 99.4500 97.8 95.0 99.4 98.8 97.8 95.0 99.6 98.8Specification (ii) (Asymp. & Non-Parametric Bootstrap)100 95.01 93.80 96.80 98.00 93.19 74.60 87.60 90.60500 95.07 91.00 96.80 97.40 97.68 80.6 90.60 93.20Specification (iii) (Asymp. & Non-Parametric Bootstrap)100 93.00 96.00 98.00 98.60 87.53 89.60 96.00 97.20500 95.8 94.8 99.0 99.0 95.8 93.0 99.4 99.2Specification (iv) (Asymp. & Non-Parametric Bootstrap)100 87.26 96.60 97.80 99.20 89.16 87.80 96.40 97.20500 86.72 95.80 98.20 99.20 89.47 94.00 97.60 98.00Specification (v) (Asymp. & Non-Parametric Bootstrap)100 82.41 95.60 97.40 98.20 80.41 95.40 97.60 98.60500 79.40 96.80 98.80 98.40 77.60 96.40 97.4 97.40
This table presents coverage probabilities of confidence intervals constructed from theasymptotic covariance matrix (Asymp.) and the non-parametric bootstrap. Bootstrapconfidence intervals considered are the percentile (B-%tile), empirical (B-Emp.), normalapproximation (B-Asymp.). x isconsidered. Partition x and z as x = [ x , x ] and z = [ x , z ] where z may havemore than one column.For a random variable ξ with arbitrary dimension, let M ξ be an n × n matrix whose( i, j )’th entry is || ξ i − ξ j ||− n − (cid:80) nl =1 ( || ξ i − ξ l || + || ξ l − ξ j || )+ n − n − (cid:80) nk =1 (cid:80) nl =1 || ξ k − ξ l || if j (cid:54) = i and 0 otherwise. Sz´ekely, Rizzo, et al.’s (2014) partial distance covariance pdCov between x and z after netting out dependence on x is pdC n ( x , z ; x ) = 1 n ( n − P x ⊥ ( x ) · P x ⊥ ( z )where P z ⊥ ( x ) = M x − M x ·M z M z ·M z M z and P z ⊥ ( z ) = M z − M z ·M z M z ·M z M z , and P z ⊥ ( x ) · P z ⊥ ( z ) ≡ (cid:80) ni =1 (cid:80) nj =1 ( P z ⊥ ( x )) ij ( P z ⊥ ( z )) ij . The null hypothesis of interest is H : pdC n ( x , z ; x ) = 0 against the alternative hypothesis H : pdC n ( x , z ; x ) > z is independent of, i.e., not an MDep-relevant instrumentfor x , after netting out dependence on x . One may be interested in the non-linear relevance of z . This is achieved by replacing x with residuals from its linearprojection on z and testing for partial dependence. The partial distance covariancetest requires resampling techniques because the statistic is not asymptotically pivotalunder the null (Sz´ekely, Rizzo, et al., 2014). The bootstrap procedure is described inSection E.6.Monte Carlo simulations are used to examine the size and power of the pdCov test for MDep relevance. The first stage data generating process considered is x = x + z + λf T ( z )+ u where f T ( z ) is a non-monotone transformation of z and λ ∈ [0 , x and z are each standard nor-mally distributed with correlation 0 .
25 and u ∼ ( χ (1) − / √
2. Two non-monotonetransformations are considered: f T ( z ) = | z | (Figure 1a) and f T ( z ) = I( | z | < − Φ − (0 . − (0 .
25) denotes the first quartile of the standardnormal distribution. The size of the test is examined under λ = 0 while the power ofthe test is examined under λ = 0 . λ = 1. For each setting, the performance ofthe pdCov test is examined at three nominal levels viz. 1% , pdCov test (a) Non-monotone transformation f T ( z ) = | z | (b) Non-monotone transformation f T ( z ) = I( | z | < − Φ − (0 . Notes:
Figures on the left examine size control of the pdCov test of MDep relevance.Figures in the middle and on the right examine the power of the test at λ = 0 . λ = 1respectively. Φ − (0 .
25) denotes the first quartile of the standard normal distribution. Theblack dashed lines in the graphs on the left correspond to size = 0 . for the pdCov test of non-linear dependence and overall dependence are quite similarso results on the latter are omitted.The graphs on the left in Figures 1a and 1b demonstrate good size control acrossall sample sizes and nominal levels considered. Graphs in the middle and right ofFigure 1c demonstrate the test’s ability to detect the presence of non-linear/non-monotone MDep relevance. As expected, power increases in λ . E.5 Specification testing
The distance covariance test proposed by Sz´ekely, Rizzo, Bakirov, et al. (2007) isa test of independence. Shao and Zhang (2014) extends the dCov to test conditionalmean independence between observed random variables. Su and Zheng (2017) derives18 test from Shao and Zhang’s (2014) measure which applies to specification testing.Assumption 2(a) is a weak exclusion restriction. Su and Zheng’s (2017) test appliesto conditional mean models of the form y = g ( x ; θ ) + u where E [ y | x ] = g ( x ; θ ). Thetest remains applicable whether the researcher uses excluded instruments or not.The null hypothesis of interest is H : P (cid:0) E [ y i | x i ] = g ( x i , θ o , θ o,c ) (cid:1) = 1 against thealternative hypothesis H : P (cid:0) E [ y i | x i ] = g ( x i , θ , θ c ( θ )) (cid:1)(cid:1) < θ ∈ Θ . Letˆ u i be the residual corresponding to observation i such that (cid:80) ni =1 ˆ u i = 0. Su andZheng’s (2017) test statistic is given by T n = − n (cid:88) i (cid:54) = j (cid:88) j (cid:54) = i ˆ u i ˆ u j || z i − z j || Details of the the wild bootstrap used to implement the test as in Su and Zheng(2017) are described in Section E.6.It is instructive to use simulations to verify the performance of Su and Zheng’s(2017) specification test in the context of the MDep estimator. The specifications(under the null of exogeneity) considered are specifications (v) in Section 4 and (i)in Section E.1. The power of the test is examined under endogeneity using x =( ˙ x + u ) / √ x, x ∗ ) = 0 .
25 and cov( ˙ x, u ) = 0. Figure 2 presents the size andpower of the test at nominal levels 1%, 5%, and 10%.In all specifications considered, one observes good size control and non-trivialpower. In sum, Su and Zheng’s (2017) specification test is reliable in assessing,among other sources of miss-specification, violations of the sufficient conditional meanindependence condition in the MDep setting.
E.6 Implementing the wild bootstrap
Below is an outline of the wild-bootstrap procedure used in conducting the Su andZheng (2017) specification test. Define the transformation T (ˆ u i ) = (cid:112) n/ ( n − p θ − u i .The bootstrap errors are sampled from Mammen’s two-point distribution ν ∗ i = − ( √ − / √ / (2 √ √ / √ − / (2 √ For example, by setting ˆ u i = u i ( ˆ θ n ) − E n [ u i ( ˆ θ n )]. (a) Notes:
Figures on the left examine size control of Su and Zheng’s (2017) specificationtest while the figures on the right examine its power. The graphs on top correspond tospecification (v) and the graphs below correspond to specification (i) in Section E.1. Theblack dashed lines in the graphs on the left correspond to size = 0 . E [ ν ∗ i ] = 0, E [ ν ∗ i ] = 1, and E [ ν ∗ i ] = 1 (see Mammen (1993)). Otherdistributions with the same properties can be found in Liu et al. (1988). The bootstrapprocedure is specified below. Algorithm 1.
1. (a) Regress y i on x i . Obtain ˆ θ n and u i ( ˆ θ n ) = y i − g ( x i , ˆ θ n , θ c ( ˆ θ n )) (b) Compute ˆ u i = u i ( ˆ θ n ) − E n [ u i ( ˆ θ n )] and the test statistic T n
2. For b = 1 , . . . , B , Do(a) Generate y ∗ bi = g ( x i , ˆ θ n , θ c ( ˆ θ n )) + E n [ u i ( ˆ θ n )] + T (ˆ u i ) ν ∗ bi (b) Regress y ∗ bi on x i and obtain ˆ θ ∗ bn , u i ( ˆ θ ∗ bn ) = y ∗ bi − g ( x i , ˆ θ ∗ bn ) , ˆ u ∗ bi = u i ( ˆ θ ∗ bn ) − E n [ u i ( ˆ θ ∗ bn )] , and the test statistic T ∗ bn
3. End Do
The p-value of the specification test is computed using B (cid:80) Bb =1 I( T ∗ bn > T n ). Test-ing MDep-relevance using Sz´ekely, Rizzo, et al.’s (2014) partial distance covariancetest requires modifying Algorithm 1. The MDep-relevance test proceeds by replacingˆ u i with ˇ x i = x i − E n [ x i ], generating bootstrap samples as ˇ x ∗ bi = T (ˇ x i ) ν ∗ bi , skippingregression steps, and computing the p-value as B (cid:80) Bb =1 I( pdC ∗ bn > pdC n ). Conductinga pdCov test of non-linear relevance only requires replacing ˇ x i with residuals from alinear projection of x on z . Computation of the dCov and partial dCov measuresis done using the energy R package (Rizzo and Szekely, 2019).
References [1] Davidson, James.
Stochastic limit theory: An introduction for econometricians .OUP Oxford, 1994.[2] Hall, Peter and Simon J Sheather. “On the distribution of a studentized quan-tile”.
Journal of the Royal Statistical Society: Series B (Methodological)
Journal of Econometrics
TheAnnals of Statistics
The Annals of Statistics (1993), pp. 255–285.[6] Rizzo, Maria and Gabor Szekely. energy: E-Statistics: Multivariate Inferencevia the Energy of Data . R package version 1.7-7. 2019. url : https://CRAN.R-project.org/package=energy .[7] Shao, Xiaofeng and Jingsi Zhang. “Martingale difference correlation and itsuse in high-dimensional variable screening”. Journal of the American StatisticalAssociation
Identification and Inference for Econometric Models . Ed. byAndrews, Donald W.K. New York: Cambridge University Press, 2005, pp. 80–108.[9] Su, Liangjun and Xin Zheng. “A martingale-difference-divergence-based test forspecification”.
Economics Letters
156 (2017), pp. 162–167.[10] Sz´ekely, G´abor J, Maria L Rizzo, et al. “Partial distance correlation with meth-ods for dissimilarities”.
The Annals of Statistics