[PDF] A Distance Covariance-based Estimator

Abstract

Weak instruments present a major setback to empirical work. This paper introduces an estimator that admits weak, uncorrelated, or mean-independent instruments that are non-independent of endogenous covariates. Relative to conventional instrumental variable methods, the proposed estimator weakens the relevance condition considerably without imposing a stronger exclusion restriction. Identification mainly rests on (1) a weak conditional median exclusion restriction imposed on pairwise differences in disturbances and (2) non-independence between covariates and instruments. Under mild conditions, the estimator is consistent and asymptotically normal. Monte Carlo experiments showcase an excellent performance of the estimator, and two empirical examples illustrate its practical utility.

Full PDF

AA Distance Covariance-based Estimator ∗ Emmanuel Selorm Tsyawo † Abdul-Nasah Soale ‡ February 2020

Abstract

Weak instruments present a major setback to empirical work. This paperintroduces an estimator that admits weak, uncorrelated, or mean-independentinstruments that are non-independent of endogenous covariates. Relative toconventional instrumental variable methods, the proposed estimator weakensthe relevance condition considerably without imposing a stronger exclusion re-striction. Identiﬁcation mainly rests on (1) a weak conditional median exclu-sion restriction imposed on pairwise diﬀerences in disturbances and (2) non-independence between covariates and instruments. Under mild conditions, theestimator is consistent and asymptotically normal. Monte Carlo experimentsshowcase an excellent performance of the estimator, and two empirical examplesillustrate its practical utility.

Keywords: distance covariance, dependence, weak instrument, endogeneity, U -statistics JEL classiﬁcation: C13, C14, C26 ∗ The authors would like to thank Brantly Callaway, Oleg Rytchkov, Dai Zusai, and Weige Huangfor helpful comments. † Corresponding author, [email protected], FGSES, Mohammed VI Polytechnic Uni-versity. ‡ [email protected], Department of Statistics, Temple University. a r X i v : . [ ec on . E M ] F e b Introduction

Empirical work in economics largely depends on instrumental variable (IV) meth-ods. When instruments are weakly correlated with endogenous covariates, thesemethods become unreliable; they can produce biased estimates and hypothesis testswith large size distortions. Moreover, IV methods are infeasible when instruments areuncorrelated with endogenous variables. Econometric literature on the weak instru-ment problem largely focusses on detection and weak-instrument-robust inference.Theoretical progress on estimation is, however, scant. This paper introduces an es-timator which minimises the distance covariance (dCov) between a vector of distur-bances and a set of instruments. The dCov is introduced by Sz´ekely, Rizzo, Bakirov,et al. (2007). It measures the distributional dependence between two random vari-ables of arbitrary dimensions. The proposed Minimum Dependence estimator (MDephereafter) weakens the relevance (or correlatedness) requirement considerably andadmits instruments that are non-independent of endogenous covariates.The non-independence identiﬁcation condition implies the MDep estimator max-imises the set of relevant instruments. This feature of the proposed estimator makesit fundamentally diﬀerent from conventional methods. The exclusion restriction im-posed is a conditional median independence assumption on pairwise diﬀerences indisturbances. Like the uncorrelatedness exclusion restriction of standard IV methods(see, e.g., Wooldridge (2010, Assumption 2SLS.1)), the MDep exclusion restriction isweaker than the (strong) conditional mean independence restriction (see, e.g., Chen,Chen, and Lewis (2020) and Dieterle and Snell (2016, Assumption A.1)). Unlike IVmethods, however, the form of dependence between endogenous covariates and instru-ments needs to be neither known nor speciﬁed, eﬀectively eliminating the sensitivityof estimates to ﬁrst-stage model speciﬁcation. Although the MDep estimator gen-erally applies to linear and non-linear models typically estimated using IV methods,it shares the “robustness” feature of quantile estimators (see, e.g., Powell (1991) andOberhofer and Haupt (2016)) in the sense that its asymptotic properties do not de-pend on the existence of moments of the outcome variable. For example, the limit of This paper subsumes methods such as IV, ordinary least squares (OLS), two-stage least squares(2SLS), the Control Function (CF) approach, and generalised method of moments (GMM) underthe IV category. Dieterle and Snell (2016), for example, uncovers substantial sensitivity of conclusions to speci-ﬁcation (linear versus quadratic) of the ﬁrst stage. This property holds for a class of models considered in the paper where the disturbance is linear When instruments are uncorrelatedor mean-independent of endogenous covariates, IV methods break down. The MDep,on the other hand, performs reasonably well. This paper considers two empiricalexamples that illustrate the practical usefulness of the MDep estimator. In the ﬁrstexample, the excluded instrument is IV-strong and hence MDep-strong by construc-tion. Consequently, both sets of estimates and standard errors are similar. Thesecond application presents an interesting case where the instrument has non-trivialnon-linear and non-monotone identifying variation that the MDep, unlike the IV, isable to exploit. In this example, MDep estimates are more precise than the IV.The econometric literature on weak instruments largely focusses on detection andweak-instrument-robust inference (e.g., Staiger and Stock (1997), Stock and Yogo(2005), Andrews, Moreira, and Stock (2006), Kleibergen and Paap (2006), Olea andPﬂueger (2013), Andrews and Mikusheva (2016), Sanderson and Windmeijer (2016),and Andrews and Armstrong (2017)). See Andrews, Stock, and Sun (2019) for anexcellent review. Normal distributions of conventional IV estimates can be poor andhypothesis tests based on them can be unreliable when instruments are weak (Nel-son and Startz, 1990a; Nelson and Startz, 1990b; Bound, Jaeger, and Baker, 1995).Although the econometric literature on weak instruments does not focus much onestimation, three exceptions deserve discussion. Hirano and Porter (2015) provesthat if instruments can be arbitrarily weak, no unbiased estimator exists withoutfurther restrictions. Andrews and Armstrong (2017) shows that asymptotically un-biased estimation is possible granted one correctly imposes a sign restriction on theﬁrst-stage regression coeﬃcient. By extracting non-linear identifying variation ininstruments using machine learning techniques (e.g., neural networks and randomforests) and sample splitting, Chen, Chen, and Lewis (2020) achieves higher precisionand increased instrument strength. While it is conceivable to take transformationsof the instrument to extract more variation (see, e.g., Dieterle and Snell (2016)), in the intercept. Simulation results on non-linear models are available in the Online Appendix. When it is feasible, the approach easily results inhigh-dimensionality (see, e.g., Belloni, Chen, Chernozhukov, and Hansen (2012) andHansen and Kozbur (2014)). This paper’s contribution to the literature on weakinstruments is the MDep estimator. In addition to linear identifying variation inendogenous covariates that IV methods exploit, the MDep estimator exploits ad-ditional monotone and non-monotone forms of dependence that may exist betweeninstruments and endogenous covariates for identiﬁcation. The MDep gives a new per-spective to handling weak IVs in empirical practice; poorly correlated, uncorrelated,or mean-independent instruments in the conventional IV setting can be MDep-strong.Several applications of the dCov measure have emerged since the seminal paperSz´ekely, Rizzo, Bakirov, et al. (2007). The dCov measure is used primarily to testindependence between two random variables of arbitrary dimensions. It is consistentagainst all types of dependent alternatives including linear, non-linear, monotone,and non-monotone forms of dependence. Sz´ekely, Rizzo, et al. (2014) derives thepartial distance covariance ( pdCov ) test of independence between two random vari-ables with dependence on a third set removed. Based on the dCov measure, Shaoand Zhang (2014) derives the martingale diﬀerence divergence (MDD) measure totest for conditional mean independence, and Park, Shao, Yao, et al. (2015) derivesa test of partial conditional mean independence. Su and Zheng (2017) and Xu andChen (2020), respectively, adapt the Shao and Zhang (2014) MDD measure to testmodel speciﬁcation in conditional mean and quantile regression models. Zhou (2012)introduces the auto-covariance function for strictly stationary multivariate time se-ries. Davis, Matsui, Mikosch, Wan, et al. (2018) tests goodness-of-ﬁt of time seriesmodels by applying an adapted distance covariance function to residuals. The closestwork to the current paper is perhaps Sheng and Yin (2013); it proposes an estimatorof single-index models as a tool for suﬃcient dimension reduction using the dCov asa criterion. There is a growing literature on the application of the dCov measureto variable selection and dimension reduction (see, e.g., Li, Zhong, and Zhu (2012),Shao and Zhang (2014), and Zhong and Zhu (2015)). The reader is referred to Edel-mann, Fokianos, and Pitsillou (2019) for a review of several applications of the dCov As an illustration, let outcome y = x + u , endogenous covariate x = x ∗ + u , and instrument z = | x ∗ | with cov( u, z ) = 0. For a symmetric mean-zero random variable x ∗ , cov( x, z ) = 0. It is notfeasible, without further information, to transform z in order to induce correlation with x . This section presents the MDep estimator. The estimator is based on the distancecovariance measure proposed by Sz´ekely, Rizzo, Bakirov, et al. (2007).

Sz´ekely, Rizzo, Bakirov, et al. (2007) introduces distance covariance (dCov) as ameasure of dependence between two random variables of arbitrary dimensions. Let ϕ U,Z ( t, s ) denote the joint characteristic function of [ U, Z ], and [ ϕ U ( t ) , ϕ Z ( s )] denotetheir marginal characteristic functions. The distance covariance measure is deﬁnedbelow. Deﬁnition 2.1.

The distance covariance between random variables U and Z withﬁnite ﬁrst moments is the non-negative number V ( U, Z ) deﬁned by V ( U, Z ) ≡ | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w = (cid:90) | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w ( t, s ) dtds = (cid:90) (cid:12)(cid:12) E (cid:2) exp( ι ( t (cid:48) U + s (cid:48) Z )) (cid:3) − E (cid:2) exp( ιt (cid:48) U ) (cid:3) E (cid:2) exp( ιs (cid:48) Z ) (cid:3)(cid:12)(cid:12) w ( t, s ) dtds (2.1) where ι = √− and the weight w ( t, s ) is an arbitrary positive function for which theintegration exists. | ζ | = ζ ¯ ζ , where ¯ ζ is the complex conjugate of ζ . | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w ≥ U and Z , and | ϕ U,Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w = 0 if and only if U and Z are independent.A sample from the joint distribution of [ U, X, Z ] is denoted by the n × ( p u + p x + p z )matrix [ U , X , Z ]. Deﬁne E n [ ξ i ] ≡ n (cid:80) ni =1 ξ i and E n [ ξ ij ] ≡ n (cid:80) ni =1 (cid:80) nj =1 ξ ij . p u = 1 ismaintained throughout the rest of the paper since this paper considers only univariateresponse models. The distance covariance measure deﬁned in terms of empiricalcharacteristic functions, | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n , is (cid:90) (cid:12)(cid:12) E n (cid:2) exp( ιtu i + s (cid:48) z i ) (cid:3) − E n (cid:2) exp( ιtu i ) (cid:3) E n (cid:2) exp( ιs (cid:48) z i ) (cid:3)(cid:12)(cid:12) w ( t, s ) dtds. Using the weight function w ( t, s ) = ( c p u c p z || t || p u || s || p z ) − where c p = π (1+ p ) / Γ((1+ p ) / for p ≥ || · || is the Euclidean norm, and Γ( · ) is the complete gamma function,Sz´ekely, Rizzo, Bakirov, et al. (2007) proves the equality | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n = V n ( U , Z ) where V n ( U , Z ) = S n ( U , Z ) + S n ( U , Z ) − S n ( U , Z ) , S n ( U , Z ) = n (cid:80) ni =1 (cid:80) nj =1 | u i − u j | × || z i − z j || , S n ( U , Z ) = n (cid:80) ni =1 (cid:80) nj =1 | u i − u j | n (cid:80) ni =1 (cid:80) nj =1 || z i − z j || , and S n ( U , Z ) = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 | u i − u j ||| z i − z k || . The empirical distance covariance measure V n ( U , Z ) simpliﬁes to (2.2) V n ( U , Z ) ≡ n n (cid:88) i =1 n (cid:88) j =1 ˇ z ij × | ˜ u ij | where ˇ z ij ≡ || ˜ z ij || − n n (cid:88) k =1 ( || ˜ z ik || + || ˜ z kj || ) + 1 n n (cid:88) k =1 n (cid:88) l =1 || ˜ z kl || , ˜ u ij ≡ u i − u j , and˜ z ij ≡ z i − z j .Sz´ekely, Rizzo, Bakirov, et al.’s (2007) weight function w ( t, s ) = ( c p u c p z || t || p u || s || p z ) − ,besides yielding a reliable measure of dependence, results in a computationally tractablemeasure, obviates the choice of smoothing parameters (e.g., bandwidth or number of See Sz´ekely, Rizzo, Bakirov, et al. (2007, eqn. 2.18). This follows because n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z i − z k || × | u i − u j | = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z k − z i ||×| u i − u j | = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z k − z j ||×| u j − u i | = n (cid:80) ni =1 (cid:80) nj =1 (cid:80) nk =1 || z k − z j ||×| u i − u j | by the permutation symmetry of || · || , the commutativity of sums, and the permutation symmetryof | · | . While the form | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n of the dCovmeasure is natural and intuitive as a non-negative measure of distributional depen-dence, the form V n ( U , Z ) is computationally tractable. The MDep estimator proposedin this paper is based on the latter. For the estimator presented in this paper, thesimpliﬁed formulation (2.2) has two main advantages: the permutation symmetryof ˇ z ij , i.e., ˇ z ij = ˇ z ji , simpliﬁes the application of U-statistic theory in the proof ofasymptotic normality, and lessens computational time needed to evaluate (2.2).For ease of reference, the properties of the dCov measure in Sz´ekely, Rizzo,Bakirov, et al. (2007) and Sz´ekely and Rizzo (2009) are stated below. The followingproperties of the dCov measure hold. Properties of dCov. (a) V n ( U , Z ) = | ϕ U , Z ( t, s ) − ϕ U ( t ) ϕ Z ( s ) | w,n .(b) V n ( U , Z ) ≥ .(c) V ( U, Z ) = 0 if and only if U is independent of Z .(d) lim n →∞ V n ( U , Z ) = V ( U, Z ) almost surely (a.s.). The properties are provided and proved in the following papers: Property (a) inSz´ekely, Rizzo, Bakirov, et al. (2007, Theorem 1), Property (b) in Sz´ekely and Rizzo(2009, Theorem 4 (i)), Property (c) holds by deﬁnition (see (2.1)), and Property(d) in Sz´ekely, Rizzo, Bakirov, et al. (2007, Theorem 2). From (2.2), it is obviousthat V n ( U , Z ) = 0 if every sample observation of U , Z , or both is identical. Animportant implication of Properties (b) to (d) is that V n ( U , Z ) = 0 if and only if U is independent of Z with probability approaching one (w.p.a.1). For considerations of space and scope, the current paper is limited to paramet-ric regression models which assume the scalar outcome y i is generated by y i = G (cid:0) θ o,c + g ( x i , θ o )+ u i (cid:1) , where G ( · ) is a known invertible function, g ( · , · ) is a known dif-ferentiable function with unknown parameter vector θ o ∈ R p θ , and θ o,c is the intercept. See Sz´ekely, Rizzo, Bakirov, et al. (2007, section 2.1) for a discussion. Also, see Sz´ekely andRizzo (2012) on a discussion and proof of the uniqueness of the weight function.

7y the invertibility of G ( · ), the disturbance function is u i ( θ ) = G − ( y i ) − g ( x i , θ ).The intercept is ignored because it is diﬀerenced out in the objective function (2.3). The dependence of u i ( θ ) on covariates x i is suppressed for notational convenience.Observed data [ Y , X , Z ] include an n × Y , an n × p x matrix ofcovariates, and an n × p z matrix of instruments Z . While the class of models underconsideration includes interesting examples such as the linear model u i ( θ ) = y i − x i θ ,non-linear parametric models, e.g., u i ( θ ) = y i − exp( x i θ ), fractional response mod-els, e.g., u i ( θ ) = log( y i / (1 − y i )) − x i θ , and special cases of Box-Cox models e.g., u i ( θ ) = log( y i ) − x i θ , it excludes equally interesting ones such as binary responsemodels and quantile regression. The objective function is a squared distance covariance of U ( θ ) and Z , where the i ’th element of the n × U ( θ ) is u i ( θ ). Using (2.2), the objective function isgiven by(2.3) V n ( U ( θ ) , Z ) = 1 n n (cid:88) i =1 n (cid:88) j =1 ˇ z ij | ˜ u ij ( θ ) | where ˜ u ij ( θ ) ≡ u i ( θ ) − u j ( θ )and the MDep estimator is ˆ θ n = arg min θ ∈ Θ V n ( U ( θ ) , Z ) . (2.3) is a sum of specially weighted absolute pairwise diﬀerences in disturbances. (2.3)for example, explains why the asymptotic behaviour of the MDep estimator is similarto that of quantile estimators. It is instructive to note how the MDep relates tomoment estimators which include conventional IV methods such as OLS, IV, 2SLS,MM, and GMM. A moment estimator typically minimises the (possibly weighted)linear dependence between Z and U ( θ ) by solving Z (cid:48) U ( θ ) = . The MDep, on theother hand, minimises their distributional dependence. The MDep hence minimisesa more general form of dependence between Z and U ( θ ). See Section D.2 in the Online Appendix for an extension to the class of models such as thePoisson which allow non-linearity of the disturbance function in the intercept. Asymptotic Theory

Asymptotic theory for the MDep estimator falls within the category of estima-tors of models with non-smooth objective functions such as quantile regression (QR)methods, e.g., Koenker and Bassett Jr (1978), Powell (1991), and Oberhofer andHaupt (2016), the instrumental variable QR methods of Chernozhukov and Hansen(2006) and Chernozhukov and Hansen (2008), the CF approach to QR of Lee (2007),and the valid moment selection procedure of Cheng and Liao (2015).Deﬁne the n × p θ matrix X g ( θ ) whose i ’th row is x gi ( θ ) where x gi ( θ ) ≡ ∂u i ( θ ) ∂ θ (cid:48) ,the n × p θ matrix ˜ X g ( θ ) whose rows comprise { ˜ x gij ( θ ) : 1 ≤ i, j ≤ n } , ˜ x gij ( θ ) ≡ x gi ( θ ) − x gj ( θ ), ˜ u ij ≡ ˜ u ij ( θ o ), and the true parameter vector θ o . Also, let ˜ x g ( θ ) , x g ( θ ),˜ z , and ˇ z be any random variables deﬁned on the supports of ˜ x gij ( θ ) , x gi ( θ ), ˜ z ij ,and ˇ z ij , respectively. Deﬁne { w i : 1 ≤ i ≤ n } , w i ≡ [ u i , x i , z i ] on a probabilityspace ( W , W , P ). Following Huber (1967), the normalised minimand is deﬁned as Q n ( θ ) ≡ n n (cid:88) i =1 n (cid:88) j =1 q ( w i , w j , θ ), where q ( w i , w i , θ ) = ˇ z ij (cid:0) | ˜ u ij ( θ ) | − | ˜ u ij | (cid:1) . Normal-ising the minimand avoids unnecessary moment conditions on U . See Powell (1991)and Oberhofer and Haupt (2016) for example. Two sets of regularity conditions are imposed to guarantee the consistency of theMDep estimator ˆ θ n . The ﬁrst set comprises sampling, smoothing, and dominanceconditions which ensure that the diﬀerence between the normalised minimand and itsexpectation converges to zero uniformly in θ . These are stated below as assumptions1(a)-1(d). Assumption 1.

For a constant

C > :(a) w i ≡ [ u i , x i , z i ] are independently distributed random vectors across i , and theoutcome is generated as y i = G (cid:0) θ o,c + g ( x i , θ o ) + u i (cid:1) .(b) u ( θ ) is measurable in [ u i , x i ] for all θ and is twice diﬀerentiable in θ for all [ u, x ] in the support of [ u i , x i ] . x g ( θ ) is measurable in x i for all θ .(c) E (cid:2) sup θ ∈ Θ || max {| ˇ z | , } ˜ x g ( θ ) || (cid:3) ≤ C . d) θ o is in the interior of a compact subset Θ in R p θ . Assumptions 1 (a) to (d) above suﬃce for the application of the uniform lawof large numbers in Andrews (1987). Assumption 1(a) allows for heteroskedasticity.Addressing other forms of dependence in the data, e.g., autocorrelation and clusteringis beyond the scope of the current paper. The condition on the data generating processin Assumption 1(a) and the diﬀerentiability requirement in Assumption 1(b) deﬁnethe class of models considered in the current paper, e.g., the linear model. Theyexclude models such as Koenker and Bassett Jr’s (1978) QR and binary responsemodels. ˜ u ij ( θ ) = ˜ u ij + ˜ x gij ( ¯ θ )( θ − θ o ) for some ¯ θ between θ and θ o is a usefulexpression for subsequent analyses thanks to Assumption 1(b) and the mean-valuetheorem (MVT).Assumption 1(c) is an MDep analogue of a standard condition that uniformlybounds the fourth moment of both the weighted and unweighted Jacobian, i.e., E (cid:2) sup θ ∈ Θ || max {| ˇ z | , } ˜ x g ( θ ) || (cid:3) ≤ C implies both E (cid:2) sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || (cid:3) ≤ C and E (cid:2) sup θ ∈ Θ || ˜ x g ( θ ) || (cid:3) ≤ C . It can be shown (see Lemma A.1) that under Assump-tion 1(c), B ( x i , x j , z i , z j ) ≡ sup θ ∈ Θ || ˇ z ij ˜ x gij ( θ ) || , which does not depend on [ u i , u j ], isintegrable, and | q ( w i , w j , θ ) | ≤ B ( x i , x j , z i , z j ) || θ − θ o || a.s. for all θ , θ ∈ Θ . FromAssumption 1(d) and the foregoing, the expected value of q ( w i , w j , θ ) exists even ifthe expected value of u i does not exist. This special feature of the MDep estimatoris emphasised in the following remarks. Remark 3.1.

Although the MDep estimator developed in this paper is not a quantileestimator, i.e., it does not estimate regression quantiles, it shares the “robustness”feature of quantile estimators over the class of models in Assumption 1(a) where theasymptotic properties do not depend on the existence of moments of the disturbance u i . The remaining set of regularity conditions for consistency (Assumptions 2(a) and2(b)) are identiﬁcation conditions which ensure that the expected value of the mini-mand is uniquely minimised at θ o in large samples. Assumption 2. The diﬀerentiability requirement is for ease of treatment as the case of non-smooth objectivefunctions is beyond the scope of this paper. a) Given ˜ x gij , the median of ˜ u ij is independent of ˇ z ij .(b) There exists a ˜ δ ε > such that inf { γ ∈ R pθ : || γ ||≥ ε } E [ V n ( X g ( θ ) γ , Z )] > ˜ δ ε for any ε > , θ ∈ Θ , and n suﬃciently large. These assumptions are perhaps the most important because one easily sees the advan-tage the MDep estimator has vis-`a-vis conventional IV methods. While the exclusionrestriction (Assumption 2(a)), like that of conventional methods, is not stronger thanthe conditional mean independence exclusion restriction, the relevance condition (As-sumption 2(b)) is much weaker.Assumption 2(a) requires that the median of ˜ u ij conditional on [ x gij , ˇ z ij ] be con-stant. Unlike Assumption 2(a) which is imposed on the pairwise diﬀerences in dis-turbances, similar exclusion restrictions on conditional quantiles are imposed on thelevels of disturbances for quantile estimators under (possible) endogeneity (see, e.g.,Chernozhukov and Hansen (2006, Assumption A.2), Lee (2007, Assumption 3.6), andPowell (1991, Assumption B2)). Speciﬁc to the median of u i , Lee (2007), for example,requires that the median of u i , conditional on [ x i , z i ] be constant. By the symmetryof the unconditional density of ˜ u ij around zero, Assumption 2(a) is equivalent to aconditional mean independence restriction E [˜ u ij | ˜ x gij , ˇ z ij ] = E [˜ u ij | ˜ x gij ] if the expectedvalue of u i exists. In words, E [˜ u ij | ˜ x gij , ˇ z ij ] = E [˜ u ij | ˜ x gij ] says given ˜ x gij , the mean of˜ u ij is independent of ˇ z ij . Assumption 2(a) is implied by the stronger conditionalmean-independence condition E [ u i | ˜ x gij , ˇ z ij ] = E [ u i | ˜ x gij ]. Taking the expectation ofboth sides conditional on w i gives the more familiar conditional mean-independenceexclusion restriction in levels E [ u i | x gi , z i ] = E [ u i | x gi ]. This suﬃcient (but not nec-essary) condition for Assumption 2(a) can be veriﬁed using the conditional meanindependence speciﬁcation test of Su and Zheng (2017). Both the MDep exclusionrestriction Assumption 2(a) and the uncorrelatedness exclusion restriction of condi-tional IV methods (see, e.g., Wooldridge (2010, Assumption 2SLS.1)) are weaker thanthe conditional mean-independence condition. Simulations in Section 4 suggest theMDep exclusion restriction (Assumption 2(a)) is not stronger than the uncorrelated-ness exclusion restriction of conventional IV methods.Assumption 2(b) is the condition of non-independence between X and Z ; it is theMDep analogue of the relevance condition in the IV/CF setting (see, e.g., Wooldridge See the Online Appendix for an evaluation of the test’s ﬁnite sample performance using simula-tions. pdCov test of Sz´ekely, Rizzo, et al. (2014) can beused to test Assumption 2(b) when there is a single endogenous covariate. Assump-tion 2(b) is, however, a signiﬁcantly weaker assumption. For example, suppose thereis one endogenous covariate X and a single instrument Z . A standard IV relevancecondition is that X be correlated with Z . The MDep, on the other hand simply re-quires that X be non-independent of Z . This eﬀectively allows X to be uncorrelated,or even mean-independent of Z as long as some points on its distribution are notindependent of Z . All strong instruments in the conventional IV sense are MDep-strong. The converse is, however, not true. This idea is further explored in Section 4via simulations. Assumption 2(b) fails when the Jacobian matrix X g ( · ) does not havefull column rank or when instruments are fewer than parameters, i.e., p z < p θ . Tosee this, note that X g ( · ) γ can be zero (which implies V n ( X g ( · ) γ , Z ) = 0 w.p.a.1) for γ (cid:54) = if X g ( · ) does not have full column rank. Also, if instruments are fewer thanparameters, it is easy to construct an example where V n ( X g ( · ) γ , Z ) = 0 w.p.a.1 for X not independent of Z and γ (cid:54) = . Remark 3.2.

MDep admits the largest set of instruments possible. In addition toIV-relevant instruments that both MDep and conventional IV methods admit, MDepadmits instruments that are uncorrelated or mean-independent but non-independentof covariates. This clear advantage of the MDep over IV methods does not come atan added cost as the MDep exclusion restriction (Assumption 2(a)), like the uncor-relatedness exclusion restriction of IV methods, is weaker than the conditional meanindependence restriction.

The support of ˇ z ij is not non-negative. This makes identiﬁcation proof techniquesbased on convexity of the objective function (see, e.g., Koenker and Bassett Jr (1978), See the Online Appendix for an evaluation of the test’s ﬁnite sample performance using simula-tions. Consider a linear model with p z = 1, p θ = 2, X g ( · ) = X = [ ξ , ξ − ξ ], Z = ξ , ξ independentof ξ , and γ = [1 , (cid:48) . V n ( X γ , Z ) = V n ( ξ , ξ ) = 0 w.p.a.1 although X is not independent of Z . Therather technical formulation of Assumption 2(b) is meant to avoid such identiﬁcation failure. θ ∈ Θ , A θ : [0 , (cid:16) W where A θ ( τ ) = { x , x † , z , z † ∈ W : F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) = τ } , [ x † , z † ] is an independentcopy of [ x , z ], λ u ∈ (0 , θ satisﬁes ˜ u ( θ ) = ˜ u + ˜ x g ( ¯ θ )( θ − θ o ). The followingprovides an important result for identiﬁcation. Proposition 3.1.

Suppose Assumptions 1(b) and 2(a) hold, then for any w , w † de-ﬁned on the support of w i , E [ q ( w , w † , θ )] = (cid:90) | τ − | (cid:90) A θ ( τ ) (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:3) d P ˜ X g , ˆ Z dτ. The inner integrand ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12) has the same form as the summands of(2.2). This is important because the expectation of the normalised minimand can beexpressed in terms of the distance covariance between X g ( ¯ θ )( θ − θ o ) and Z for all θ (cid:54) = θ o and ¯ θ between θ and θ o . Identiﬁcation then follows from Assumption 2(b). Lemma 3.1.

Suppose Assumptions 1(b) and 2 hold, then for any ε > and n suﬃciently large, there exists a δ ε > such that inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ Q n ( θ )] > δ ε . Consistency of the MDep estimator is presented in the following theorem.

Theorem 1 (Consistency) . Under Assumptions 1 and 2, ˆ θ n p → θ o . Notice that ˇ z ij is not a bona ﬁde function of [ z i , z j ] because it is dependent onother observations. To simplify the study of the asymptotic distribution of the MDepestimator, the asymptotically equivalent function, ˆ z ij = D n ( z i , z j ) ≡ || z i − z j || − n n (cid:88) k =1 (cid:0) E [( || z i − z k || ) | z i ]+ E [( || z k − z j || ) | z j ] (cid:1) + 1 n n (cid:88) k =1 n (cid:88) l =1 E [ || z k − z l || ], is used instead ofˇ z ij . Denote the equivalent normalised minimand by Q n ( θ ) ≡ n (cid:80) ni =1 (cid:80) nj =1 ˆ z ij (cid:0) | ˜ u ij ( θ ) |−| ˜ u ij | (cid:1) . It is shown in Lemma D.1 that the normalised minimands Q n ( θ ) and Q n ( θ )are asymptotically equivalent for all θ ∈ Θ . Using the form Q n ( θ ) not only simpliﬁesthe application of Hoeﬀding’s (1948) U-statistic theory for the proof of asymptoticnormality but also leads to a less cumbersome and a computationally less expensivecovariance matrix of the score (see Lemma D.3).13eﬁne ψ ( w i , w j ) ≡ ˆ z ij (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) (cid:48) . ˆ z ij = ˆ z ji and (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) = (cid:0) − u ji < (cid:1) ˜ x gji ( θ o ) hence ψ ( · , · ) is permutation symmetric. Thescore function at θ o is S n ( θ o ) ≡ ∂ Q n ( θ ) ∂ θ (cid:48) (cid:12)(cid:12) θ = θ o = n (cid:80) ni =1 (cid:80) nj =1 ψ ( w i , w j ) (cid:48) . By thepermutation symmetry of ψ ( · , · ) and for any τ n ∈ R p θ such that || τ n || = 1, let V n = τ (cid:48) n S n ( θ o ) (cid:48) = n (cid:80) ni =1 (cid:80) nj =1 τ (cid:48) n ψ ( w i , w j ) = (1 − /n ) U n , where U n = (cid:0) n (cid:1) − (cid:80) ni =1 (cid:80) nj =1 12 τ (cid:48) n ψ ( w i , w j ). E [ ψ ( w i , w j )] = for 1 ≤ i, j ≤ n (seeLemma D.3) implies E [ U n ] = 0. U n is a U -statistic (see, e.g., Hoeﬀding (1948) and Vander Vaart (1998, chapter 11)). The connection to U-statistics is particularly useful forestablishing a central limit theorem on √ n S n ( θ ) because { ψ ( w i , w j ) , ≤ i, j ≤ n } are dependent and standard central limit theorems do not apply directly. Under As-sumption 1(a), the Haj´ek projection of U n is ˆ U n ≡ (cid:80) ni =1 E [ U n | w i ] = n (cid:80) ni =1 τ (cid:48) n ψ i ( w i )where ψ i ( w i ) ≡ n − (cid:80) j (cid:54) = i E [ ψ ( w i , w j ) | w i ] (see, e.g., Hoeﬀding (1948, eqn. 8.1) andVan der Vaart (1998, Lemma 11.10)). { ψ i ( w i ) , ≤ i ≤ n } are independently dis-tributed random vectors, and the Lyapunov central limit theorem (CLT) applies.Denote Ψ n ≡ var( √ n ˆ U n ) = var( √ n (cid:80) ni =1 ψ i ( w i )) = n (cid:80) ni =1 E [ ψ i ( w i ) ψ i ( w i ) (cid:48) ].Deﬁne ϑ n ( θ ) = Q n ( θ ) −S n ( θ o )( θ − θ o ), then by Lemma D.3, E [ ϑ n ( θ )] = E [ Q n ( θ )] − E [ S n ( θ o )]( θ − θ o ) = E [ Q n ( θ )]. R n ( θ ) = √ n ( ϑ n ( θ ) − E [ ϑ n ( θ )]) / || θ − θ o || is the re-mainder term of the Taylor approximation (see, e.g., Pollard (1985) and Newey andMcFadden (1994, sect. 7)). Deﬁne Σ o ≡ H − o Ω o H − o where Ω o ≡ lim n →∞ E [ Ω n ( θ o )],var( √ n S n ( θ o )) = E [ Ω n ( θ o )], H o ≡ lim n →∞ E [ H n ( θ o )], and E [ H n ( θ o )] = ∂ E [ S n ( θ )] ∂ θ (cid:12)(cid:12)(cid:12) θ = θ o ,the Hessian matrix (see Lemmata Lemmas D.3 and D.4). The expressions for E [ Ω n ( θ o )]and E [ H n ( θ o )] are E [ Ω n ( θ o )] = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:104) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 8ˆ z ij ˆ z ik cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:105) (3.1)and(3.2) E [ H n ( θ o )] = 2 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:105) where ω ˜ x, ˆ zijk ≡ σ ( x i , x j , x k , z i , z j , z k ), ω ˜ x, ˆ zij ≡ σ ( x i , x j , z i , z j ), σ ( · ) denotes the sigmaalgebra generated by its argument, and cov( ξ a , ξ b | ξ c ) denotes covariance of ξ a and ξ b ξ c . Let ρ min ( A ) denote the minimum eigen-value of A . The followingadditional assumptions are needed to establish the asymptotic distribution of theMDep estimator. Assumption 3.

For the constant C deﬁned in Assumption 1:(a) ρ min ( Ψ n ) > C / / for all n suﬃciently large.(b) sup || θ − θ o ||≤ δ n |R n ( θ ) | √ n || θ − θ o || p → for any δ n → .(c) H o is non-singular.(d) The conditional distribution of ˜ u , F ˜ U | ˜ X g , ˆ Z ( · ) , is diﬀerentiable with density f ˜ U | ˜ X g , ˆ Z ( · ) , C − < f ˜ U | ˜ X g , ˆ Z ( (cid:15) ) ≤ sup γ ∈ R f ˜ U | ˜ X g , ˆ Z ( γ ) ≤ C / , and | f ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − f ˜ U | ˜ X g , ˆ Z ( (cid:15) ) | ≤ C / | (cid:15) − (cid:15) | a.s. for all (cid:15), (cid:15) , (cid:15) in a neighbourhood of zero and ˜ u deﬁned on thesupport of ˜ u ij .(e) E (cid:2) sup θ ∈ Θ || max {| ˆ z | , } ˜ x g ( θ ) || (cid:3) ≤ C , E (cid:2) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) (cid:3) ≤ C , and E (cid:2) || ˜ z || (cid:3) ≤ C . Assumption 3(a) means Ψ n is positive deﬁnite. A necessary (but not suﬃcient) condi-tion for this is the full column rank condition of the Jacobian matrix X g ( · ). Assump-tion 3(b) is is a stochastic equi-continuity condition for establishing the asymptoticnormality of estimators with non-smooth objective functions (see, e.g., Pollard (1985),Pakes and Pollard (1989), and Cheng and Liao (2015)). It is veriﬁed in Lemma D.6.Non-singularity of the Hessian (Assumption 3(c)) is imposed instead of positive deﬁ-niteness because the MDep objective function (2.3) is not convex. Assumption 3(d) isa standard assumption for quantile estimators (see, e.g., Lee (2007, Assumption 3.6),Chernozhukov and Hansen (2006, Assumption 2 R.4), Chernozhukov and Hansen(2008, Assumption R.4), Powell (1991, Assumption C4. (i) and (ii)), and Oberhoferand Haupt (2016, Assumption A.14)); it ensures the Hessian is well-deﬁned. Theﬁrst statement in Assumption 3(e) simply adapts Assumption 1(c) for the equivalentnormalised minimand Q n ( θ ). The second statement is required to apply the domi-nated convergence theorem in the derivation of the Hessian (see Lemmata D.4 andD.5). The third statement is an MDep analogue of a standard regularity condition(e.g., Cheng and Liao (2015, condition 4.1 (iii))); it is used in Lemma D.1(a) to show15hat | ˇ z ij − ˆ z ij | = O p ( n − / ) for 1 ≤ i, j ≤ n . The asymptotic normality of the MDepestimator is stated below. Theorem 2.

Suppose Assumptions 1 to 3 hold, then(a) √ n ˜ τ (cid:48) n E [ Ω n ( θ o )] − / S n ( θ o ) (cid:48) d → N (0 , for any ˜ τ n ∈ R p θ with || ˜ τ n || = 1 ;(b) √ n ( ˆ θ n − θ o ) d → N ( , Σ o ) . Theorem 2(a) together with the Cram´er-Wold device yields the asymptotic normalityof √ n S n ( θ o ), i.e, √ n S n ( θ o ) d → N ( , Ω o ). This result plays a crucial role in the proofof Theorem 2(b) (see, e.g., Newey and McFadden (1994, Theorem 7.1)). The preceding subsection presents the asymptotic normality of the MDep esti-mator. This subsection provides the covariance matrix estimator and proves its con-sistency. Consistency of the covariance estimator is essential for statistical inferenceprocedures viz. Wald tests, and conﬁdence intervals. The approach adopted followsPowell (1991). The estimators of Ω o and H o are given byˇ Ω n ( ˆ θ n ) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n )+ 2ˇ z ij ˇ z ik (cid:0) (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − (cid:1) ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) (cid:111) andˇ H n ( ˆ θ n ) = 1 n ˆ c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n )ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) (cid:111) respectively where ˆ c n is a (possibly random) bandwidth and the uniform kernel re-places the conditional density in E [ H n ( θ o )]. The estimator of the covariance matrixis Σ n ( ˆ θ n ) = ˇ H n ( ˆ θ n ) − ˇ Ω n ( ˆ θ n ) ˇ H n ( ˆ θ n ) − . In addition to preceding assumptions, thefollowing condition is imposed on the sequence of bandwidths ˆ c n . Assumption 4.

For some non-stochastic sequence c n with c n → and √ nc n → ∞ , plim n →∞ (ˆ c n /c n ) = 1 . H n ( ˆ θ n );it means ˆ c n satisﬁes ˆ c n = o p (1) and ˆ c − n = o p ( √ n ). The next theorem states the con-sistency of Σ n ( ˆ θ n ). Theorem 3.

Under Assumptions 1 to 4, plim n →∞ Σ n ( ˆ θ n ) = Σ o . Estimating the asymptotic covariance matrix involves specifying the bandwidth ˆ c n .Poor performance of the covariance matrix can result from an inappropriate choiceof bandwidth. In addition to the asymptotic covariance matrix estimator Σ n ( ˆ θ n ),this paper considers the bootstrap. The bootstrap obviates the speciﬁcation of ˆ c n .Simulations on the coverage performance of the estimator Σ n ( ˆ θ n ) and the bootstrapare available in the Online Appendix. This section examines the empirical performance of the MDep estimator vis-`a-visIV methods for the linear model using simulations. Non-linear parametric modelsare considered in the Online Appendix where the MDep is compared to maximumlikelihood, quasi-maximum likelihood, GMM, and the control function (CF) approach.Each element of x = [ x , x ∗ ] is standard normal N (0 ,

1) with cov( x , x ∗ ) = 0 . z = [ z , z ] are the instruments. θ c = 0 . θ ≡ [ θ , θ ] = [1 , −

1] are main-tained across all speciﬁcations. For each speciﬁcation, the mean bias (MB), medianabsolute deviation (MAD), and root mean squared error (RMSE) are reported. 2000simulations are run for each sample size n ∈ { , , } . ν ∼ ( χ (1) − / √ σ ξ is the standard deviation of a random variable ξ , Ber(0 .

5) is the Bernouilli distribu-tion with mean equal to 0 . φ ( · ) is the standard normal probability density function(pdf), Φ − ( τ ) is the τ ’th quantile of the standard normal distribution, and M x ( ξ )is the residual from the linear projection of ξ on x . The following speciﬁcations areconsidered.(i) y = θ c + θ x + θ x + u , u = ( ˙ u − E [ ˙ u ]) /σ ˙ u , ˙ u = ν + φ (( x − x ∗ ) / . x = x ∗ ,and z = [ x , x ∗ ].(ii) y = θ c + θ x + θ x + u , u = ( ˙ u − E [ ˙ u ]) /σ ˙ u , ˙ u = ν + φ (( x − x ∗ ) / . x = ( x ∗ + ν ) / √ z = [ x , ( x ∗ + w ) / √ w ∼ N (0 , y = θ c + θ x + θ x + u , u = ˙ u/σ ˙ u , ˙ x = 6 x ∗ / √ n + (1 + x ∗ ) ν , x = ˙ x /σ ˙ x , z = [ x , ( x ∗ + w ) / √ w ∼ N (0 , u = (1 + D | x | + (1 − D ) | x ∗ | ) ν , and D ∼ Ber(0 . y = θ c + θ x + θ x + u , u = ˙ u/σ ˙ u , ˙ x = (1 + x ∗ ) ν , x = ˙ x /σ ˙ x , z =[ x , ( x ∗ + w ) / √ w ∼ N (0 , u = (1+ D | x | +(1 − D ) | x ∗ | ) ν , and D ∼ Ber(0 . y = θ c + θ x + θ x + ν , x = ( x ∗ + ν ) / √

2, and z = [ x , | x ∗ | / (cid:112) − /π ].(vi) y = θ c + θ x + θ x + ν , x = ( x ∗ + ν ) / √

2, and z = [ x , | x ∗ | < − Φ − (0 . y = θ c + θ x + θ x + ν , x = ( x ∗ + ν ) / √

2, and z = [ x , M x ( | x | )].Speciﬁcations (i) and (ii) consider instrument validity under the uncorrelatednessexclusion restriction cov( z , u ) = but not the strong conditional mean independencerestriction E [ u | z ] = E [ u ]. These help to demonstrate the performance of the MDepunder the weakest exclusion restriction needed for the validity of IV methods. Weakinstruments are introduced in speciﬁcation (iii); the weight 6 / √ n is chosen to ensurethat the non-homoskedasticity-robust F-statistic (see Andrews, Stock, and Sun (2019,eqn. 7.)) is approximately 5. (iv) to (vii) consider cases where x is uncorrelated butnot independent of its instrument z . In (iv), x is mean independent but not (dis-tributionally) independent of z . (v) and (vi) examine the performance of the MDepwhen z is uncorrelated but non-monotonically dependent on x . Non-monotonicityrenders transformations of z that improve correlation with x infeasible. (iii) to (vi)present cases where z is IV-weak but MDep-strong. (i) to (iv) help to examine therelative performance of MDep and IV under conditional heteroskedasticity. Speciﬁ-cation (vii) presents an interesting case where an instrument is constructed by takinga non-monotone transformation of a continuous endogenous x and rendering it or-thogonal to x . This speciﬁcation illustrates a unique strength of the MDep estimatorwhen valid instruments are unavailable or too MDep-weak. Table 4.1 presents simulation results for the linear model. Entries exceeding 500are omitted. In the standard cases (i) and (ii) where the instrument is IV-strong, oneobserves good performance of both MDep and OLS/IV with the MDep performingbetter especially in terms of MAD and RMSE. MDep’s competitive performance in thespeciﬁcations (i) and (ii) suggests the MDep exclusion restriction (Assumption 2(a)) The pdCov test can be used to assess the strength of the constructed instrument. θ θ n 100 500 2500 100 500 2500Speciﬁcation (i): (100 × Sample Bias)MDep OLS MDep OLS MDep OLS MDep OLS MDep OLS MDep OLSMB 0.442 0.832 0.123 0.245 0.02 0.037 0.482 0.59 0.093 0.213 0.013 0.005MAD 3.818 7.448 1.43 3.163 0.678 1.422 3.535 6.908 1.5 3.194 0.644 1.382RMSE 5.951 10.515 2.242 4.585 0.989 2.08 5.933 10.467 2.306 4.638 0.964 2.04Speciﬁcation (ii): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB -0.8 0.581 -0.152 0.041 -0.043 -0.026 4.585 -2.289 0.746 -0.006 0.045 -0.055MAD 3.645 7.273 1.57 3.231 0.678 1.476 6.946 14.197 2.525 6.458 1.111 2.941RMSE 6.177 11.787 2.405 4.886 1.001 2.146 12.255 25.664 4.075 9.516 1.672 4.289Speciﬁcation (iii): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB 0.277 0.499 0.326 0.877 -0.053 -0.492 12.318 -24.458 1.777 -21.28 -0.437 -27.43MAD 4.237 8.297 2.244 3.642 1.02 1.744 19.906 29.302 6.02 29.091 2.149 27.78RMSE 6.756 81.681 3.392 23.206 1.506 37.132 28.603 - 8.971 419.118 3.177 -Speciﬁcation (iv): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB 1.987 1.43 0.409 0.843 -0.053 -3.547 8.519 103.793 1.206 -136.934 -0.359 -11.069MAD 5.354 9.979 2.214 4.3 1.005 1.967 10.931 84.134 4.188 84.998 1.828 82.703RMSE 8.315 132.652 3.414 53.491 1.497 184.761 18.824 - 6.364 - 2.707 -Speciﬁcation (v): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB -1.893 -13.05 -0.612 -7.502 -0.148 -2.376 13.098 87.437 4.138 42.286 0.96 13.143MAD 3.131 14.773 0.994 13.163 0.36 13.287 12.187 72.464 4.053 73.944 1.141 73.269RMSE 5.835 276.485 1.91 330.653 0.687 304.325 18.246 - 6.096 - 1.708 -Speciﬁcation (vi): (100 × Sample Bias)MDep IV MDep IV MDep IV MDep IV MDep IV MDep IVMB -2.704 -14.84 -0.822 -28.669 -0.194 -28.705 16.502 105.025 4.919 172.413 1.114 161.015MAD 3.405 21.787 1.119 21.935 0.376 22.425 15.047 124.811 4.763 122.47 1.372 125.558RMSE 6.649 - 2.199 255.696 0.759 339.074 22.088 - 7.549 - 2.135 -Speciﬁcation (vii): (100 × Sample Bias)MDep OLS MDep OLS MDep OLS MDep OLS MDep OLS MDep OLSMB -2.826 -12.085 -1.943 -12.7 -1.756 -12.891 16.853 71.175 10.243 72.537 8.858 72.914MAD 3.285 11.897 1.88 12.681 1.713 12.886 15.423 71.012 10.021 72.405 8.821 72.962RMSE 5.946 14.183 2.713 13.124 1.929 12.977 19.654 72.287 10.767 72.789 8.967 72.965This table provides results for the MDep and OLS/IV estimators of the linear model. Resultsare based on 2000 random repetitions. The mean bias (MB), median absolute deviation(MAD), and root mean squared errors (RMSE) are reported. Entries exceeding 500 areexcluded.

19s not stronger than the weak uncorrelatedness condition of IV methods. Results onspeciﬁcations (iii) and (iv) conﬁrm MDep’s ability, unlike the case of conventionalIV methods, to exploit identifying variation from the quantile dependence of x on z . MDep continues to perform well in speciﬁcations (iv) to (vii) where x is eithermean-independent or uncorrelated with z . On the whole, the MDep’s performancein terms of bias and eﬃciency is good even in speciﬁcations where IV methods suﬀeridentiﬁcation failure. This section uses two empirical examples to illustrate the usefulness of the MDepestimator in real-world data settings. The ﬁrst application considers a simple demandestimation problem using data from Graddy (1995), and the second considers theimpact of migration on productivity using data from Hornung (2014). In the ﬁrstapplication, the instrument is IV-strong and hence MDep-strong by construction. Inthe second application, the instrument has additional non-linear and non-monotoneidentifying variation that the MDep, unlike the IV, is able to exploit.

The parameter of interest is the elasticity of demand for ﬁsh in a simple linearCobb-Douglas demand model with additive disturbance. The disturbance as a func-tion of parameters is speciﬁed as u ( θ ) = ln( Q p ) − θ c − θ ln( P ) − θ − W where Q p is the total quantity of ﬁsh sold per day, P is the daily average price, θ is the price elasticity of demand, and W is a vector of day dummies (see Graddy(1995) and Chernozhukov and Hansen (2008, sect. 5.1) for more details). In a typicaldemand estimation framework, price P is endogenous. Graddy (1995) proposes aninstrument Stormy which measures weather conditions at sea. Speciﬁcally,

Stormy is a dummy variable which indicates whether wave height is greater than 4.5 ft andwind speed is greater than 18 knots.The upper panel of Table 5.1 presents coeﬃcients and standard errors while speciﬁ-cation and relevance tests are presented in the lower panel. The tests comprise pdCov P ) -0.558 -0.541 -1.105 -1.082 -0.454 -0.563 -1.233 -1.119(0.186) (0.168) (0.459) (0.574) (0.191) (0.158) (0.471) (0.545)Const. 8.415 8.419 8.309 8.314 8.610 8.607 8.482 8.506(0.076) (0.074) (0.114) (0.136) (0.124) (0.114) (0.156) (0.155)Day Dummies (cid:88) (cid:88) (cid:88) (cid:88) Excluded Instr. (cid:88) (cid:88) (cid:88) (cid:88)

Relevance pdC nln p-value 0.189 0.527 pdC n p-value 0.000 0.000KB F -Stat. 20.663 22.929Speciﬁcation T n p-value 0.573 0.406 0.918 0.892 0.894 0.952 0.681 0.793 Notes:

The number of observations in each speciﬁcation is 111. For each speciﬁcation,standard errors (in parentheses) are computed using the same 999 samples of the non-parametric bootstrap. 999 wild-bootstrap samples are used for the pdC nln , pdC n , and T n tests. The test statistics are omitted because the tests are not pivotal. Excluded Instr. -whether the excluded instrument is used. KB F-Stat. denotes the rk Wald F -statistic ofKleibergen and Paap (2006). pdC nln for non-linear MDep-relevance and pdC n for overall MDep-relevance of the instrument, and the T n speciﬁcation test ofSu and Zheng (2017). A pdC nln p-value below 0 .

05 suggests the instrument has astatistically signiﬁcant non-linear dependence with the endogenous covariate at the5% level whereas a T n p-value greater than 0 .

05 suggests failure to reject a null hy-pothesis of correct model speciﬁcation. The rk Wald F-statistic of Kleibergen andPaap (2006) is provided as a measure of IV-relevance.MDep estimates of the price elasticity of demand do not diﬀer signiﬁcantly fromthe OLS/IV estimates. MDep estimates are slightly less precise than OLS estimatesin speciﬁcations (1) and (3) but are slightly more precise than IV estimates in speci-ﬁcations (2) and (4). One notices from Su and Zheng’s (2017) speciﬁcation tests thatno speciﬁcation is rejected by the data. Also, the relevance of the instrument

Stormy is largely linear because non-linear dependence (see the pdC nln p-value) is not signiﬁ-cant at any conventional level. P-values of the partial distance covariance pdC n test,like Kleibergen and Paap’s (2006) rank test, conﬁrm the strength of the instrument Stormy . Huguenots ﬂed religious persecution in France and settled in Brandenburg-Prussiain 1685. They compensated for population losses due to plagues during the ThirtyYears’ War. Hornung (2014) examines the long-term impact of their skilled-workermigration on the productivity of textile manufactories in Prussia. The author exploitsthe settlement pattern in an IV approach and identiﬁes the long-term impact ofskilled-worker migration on productivity. This empirical example is a revisit to theproblem and it compares OLS/IV to MDep.The disturbance function is derived from a standard ﬁrm-level Cobb-Douglas typeof production function: The linear dependence between the endogenous covariate and the instrument is removed beforethe pdCov test is conducted. See the Online Appendix for details. Su and Zheng’s (2017) T n test is a test of conditional mean independence. Conditional meanindependence is suﬃcient but not necessary for the MDep exclusion restriction (Assumption 2(a)). The rk Wald F -statistics of Kleibergen and Paap (2006) are above Staiger and Stock’s (1997)rule-of-thumb cut-oﬀ of 10. ( θ ) = y − θ c − θ l ln( L ) − θ k ln( K ) − θ m ln( M ) − θ (cid:16) HuguenotsT ownP opulation (cid:17) − θ − W where productivity y measures the value of goods produced by a ﬁrm in a given town, L is the number of workers, K is the number of looms, M is the value of materialsused in the production process, and W collects regional and town characteristics thatmight aﬀect productivity. As instrument for the population share of Hougenots, theauthor proposes population losses during the Thirty Years’ War. Population dataused to construct the instrument are years removed from the beginning and the endof the Thirty Years’ War, and this may induce measurement error. The author thusconstructs an interpolated version of the instrument (see Hornung (2014) for details).Coeﬃcient estimates and standard errors are presented in the upper panel andhypothesis tests are provided in the lower panel. The speciﬁcations considered repli-cate results in Hornung (2014, table 4). Table 5.2 presents an interesting case wherethe instrument has strong non-linear identifying variation. For the main parameterof interest (population share of Hougenots), one notices that OLS/IV estimates areless stable across speciﬁcations. Also, IV standard errors are about three times aslarge as MDep standard errors. This result is not surprising since, beyond lineardependence, the MDep estimator exploits the highly signiﬁcant non-linear form ofidentifying variation (see the pdC nln p-value) in the instrument.23able 5.2: Estimates - Huguenot Productivity - Hornung (2014)MDep OLS MDep IV MDep IV(1) (1) (2) (2) (3) (3)Percent Huguenots 1700 1.816 1.741 2.003 3.475 1.625 3.380(0.293) (0.462) (0.340) (0.986) (0.282) (1.014)ln workers 0.115 0.123 0.111 0.135 0.111 0.134(0.024) (0.034) (0.024) (0.035) (0.025) (0.035)ln looms 0.096 0.102 0.095 0.082 0.094 0.083(0.026) (0.035) (0.026) (0.036) (0.027) (0.036)ln materials (imputed) 0.814 0.800 0.814 0.811 0.815 0.811(0.021) (0.028) (0.021) (0.030) (0.022) (0.030)Const. 1.344 2.043 1.186 2.830 2.060 2.787(0.427) (0.441) (0.444) (0.547) (0.445) (0.540)Excluded Instr. (cid:88) (cid:88) (cid:88) (cid:88) Relevance pdC nln p-value 0.000 0.000 pdC n p-value 0.000 0.000KB F -Stat. 14.331 22.531Speciﬁcation T n p-value 0.721 0.559 0.738 0.405 0.638 0.438 Notes:

The number of observations in each speciﬁcation is 150 ﬁrms. For each speciﬁcation,standard errors (in parentheses) are computed using the same 999 samples of the non-parametric bootstrap. 999 wild-bootstrap samples are considered for the pdC nln , pdC n , and T n tests. The test statistics are omitted because they are not pivotal. Percent populationlosses in 30 Years’ War is the instrument used in speciﬁcation (2), and

Percent populationlosses in 30 Years’ War, interpolated is the instrument used in speciﬁcation (3) (see Hornung(2014) for details). Excluded Instr. - whether the excluded instrument is used. KB F-Stat.denotes Kleibergen and Paap’s (2006) rk Wald F -statistic. Conclusion

This paper introduces the MDep estimator which weakens the relevance conditionof IV methods to non-independence and hence maximises the set of relevant instru-ments without imposing a stronger exclusion restriction. The estimator is basedon the distance covariance measure of Sz´ekely, Rizzo, Bakirov, et al. (2007) whichmeasures the distributional dependence between random variables of arbitrary di-mensions. The estimator provides a fundamentally diﬀerent and practically usefulapproach to dealing with the weak-IV problem; the proposed estimator admits in-struments that may be uncorrelated or mean-independent but non-independent ofendogenous covariates.Consistency, and asymptotic normality of the MDep estimator hold under mildconditions. Several Monte Carlo simulations showcase the gains of the MDep estima-tor vis-`a-vis existing IV methods. The empirical examples considered in the paperillustrate the practical utility of the MDep estimator in empirical work. In the ﬁrstexample where the instrument has a strong linear but trivial non-linear relationshipto the endogenous covariate, the MDep is comparable to the IV in terms of precision.In the second example where the linear dependence is quite weak but the non-lineardependence is non-trivial, MDep estimates are more precise and stable relative to theIV. This paper is a ﬁrst step in developing a distance covariance-based estimatorfor econometric models under possible endogeneity. It will be interesting to extendthe current paper along a number of dimensions. First, an MDep estimator can bedeveloped for models which are beyond the scope of the current paper, e.g., quantileregression and binary response models. Second, owing to the frequent occurrenceof autocorrelated and clustered data in empirical practice, it will be interesting toextend the theory to accommodate mixing and clustering in disturbances.25 eferences [1] Andrews, Donald WK. “Consistency in nonlinear econometric models: A genericuniform law of large numbers”.

Econometrica: Journal of the Econometric So-ciety (1987), pp. 1465–1471.[2] Andrews, Donald WK, Marcelo J Moreira, and James H Stock. “Optimal two-sided invariant similar tests for instrumental variables regression”.

Economet-rica

Quantitative Economics

Econometrica

Annual Review of Economics

11 (2019), pp. 727–753.[6] Belloni, Alexandre, Daniel Chen, Victor Chernozhukov, and Christian Hansen.“Sparse models and methods for optimal instruments with an application toeminent domain”.

Econometrica

Matrix Mathematics: Theory, Facts, and Formulas . 2009.[8] Bound, John, David A Jaeger, and Regina M Baker. “Problems with instrumen-tal variables estimation when the correlation between the instruments and theendogenous explanatory variable is weak”.

Journal of the American StatisticalAssociation arXiv preprintarXiv:2011.06158 (2020).[10] Cheng, Xu and Zhipeng Liao. “Select the valid and relevant moments: Aninformation-based lasso for gmm with many moments”.

Journal of Economet-rics

Journal of Econometrics

Bernoulli

LabourEconomics

42 (2016), pp. 76–86.[15] Edelmann, Dominic, Konstantinos Fokianos, and Maria Pitsillou. “An updatedliterature review of distance correlation and its applications to time series”.

International Statistical Review

The RAND Journal of Economics (1995), pp. 75–92.[17] Hansen, Christian and Damian Kozbur. “Instrumental variables estimation withmany weak instruments using regularized JIVE”.

Journal of Econometrics

Econometric Reviews

The Annals of Mathematical Statistics

American Economic Review

Proceedings of the ﬁfth Berkeley symposium on mathematicalstatistics and probability . Vol. 1. 1. University of California Press. 1967, pp. 221–233. 2722] Kleibergen, Frank and Richard Paap. “Generalized reduced rank tests using thesingular value decomposition”.

Journal of Econometrics

Econometrica:Journal of the Econometric Society (1978), pp. 33–50.[24] Lee, Sokbae. “Endogeneity in quantile regression models: A control functionapproach”.

Journal of Econometrics

Journal of the American Statistical Association

Journalof business (1990), S125–S140.[27] Nelson, Charles R. and Richard Startz. “Some Further Results on the ExactSmall Sample Properties of the Instrumental Variable Estimator”.

Economet-rica

Handbook of Econometrics

Econometric Theory

Journal of Business & Economic Statistics

Econometrica: Journal of the Econometric Society (1989),pp. 1027–1057.[32] Park, Trevor, Xiaofeng Shao, Shun Yao, et al. “Partial martingale diﬀerencecorrelation”.

Electronic Journal of Statistics

Econometric The-ory (1985), pp. 295–313. 2834] Powell, James L. “Estimation of monotonic regression models under quan-tile restrictions”.

Nonparametric and semiparametric methods in Econometrics (1991), pp. 357–384.[35] Sanderson, Eleanor and Frank Windmeijer. “A weak instrument F-test in linearIV models with multiple endogenous variables”.

Journal of Econometrics

Journal of the American StatisticalAssociation

Journal of Multivariate Analysis

122 (2013), pp. 148–161.[38] Staiger, Douglas and James H. Stock. “Instrumental Variables Regression withWeak Instruments”.

Econometrica

Identiﬁcation and Inference for Econometric Models . Ed. byAndrews, Donald W.K. New York: Cambridge University Press, 2005, pp. 80–108.[40] Su, Liangjun and Xin Zheng. “A martingale-diﬀerence-divergence-based test forspeciﬁcation”.

Economics Letters

156 (2017), pp. 162–167.[41] Sz´ekely, G´abor J and Maria L Rizzo. “Brownian distance covariance”.

TheAnnals of Applied Statistics (2009), pp. 1236–1265.[42] Sz´ekely, G´abor J and Maria L Rizzo. “On the uniqueness of distance covari-ance”.

Statistics & Probability Letters

The Annals of Statistics

Asymptotic statistics . Vol. 3. Cambridge universitypress, 1998. 2946] Wooldridge, Jeﬀrey M.

Econometric analysis of cross section and panel data .MIT Press, 2010.[47] Xu, Kai and Fangxue Chen. “Martingale-diﬀerence-divergence-based tests forgoodness-of-ﬁt in quantile models”.

Journal of Statistical Planning and Infer-ence

207 (2020), pp. 138–154.[48] Zhong, Wei and Liping Zhu. “An iterative approach to distance correlation-based sure independence screening”.

Journal of Statistical Computation andSimulation

Journal of Time Series Analysis

AppendixA Proof of Theorem 1

The proof of Theorem 1 begins with a uniform law of large numbers result. Thefollowing lemma veriﬁes the conditions of Andrews (1987, Corollary 2).

Lemma A.1.

Suppose Assumption 1 holds, then(a) there exists a non-random function B : W → R + with lim n →∞ n n (cid:88) i =1 n (cid:88) j =1 E [ B ( x i , x j , z i , z j )] < ∞ such that for any θ , θ ∈ Θ , | q ( w i , w j , θ ) − q ( w i , w j , θ ) | ≤ B ( x i , x j , z i , z j ) || θ − θ || a.s.;(b) E [ Q n ( θ )] is continuous on Θ uniformly, and sup θ ∈ Θ | Q n ( θ ) − E [ Q n ( θ )] | p → . Proof of Lemma A.1 . Part (a) : First, let B ( x , x † , z , z † ) ≡ sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || . By Assumption 1(c) andJensen’s inequality, E [ B ( x , x † , z , z † )] ≡ E [sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || ] ≤ ( E [sup θ ∈ Θ || ˇ z ˜ x g ( θ ) || ]) / ≤ C / for any [ x , z ] , [ x † , z † ] deﬁned on the support of [ x i , z i ].This implies lim n →∞ n n (cid:88) i =1 n (cid:88) j =1 E [ B ( x i , x j , z i , z j )] ≤ C / < ∞ . By Assumption 1(a) andthe measurability of the Jacobian (Assumption 1(b)), B ( x i , x j , z i , z j ) ≡ sup θ ∈ Θ || ˇ z ij ˜ x gij ( θ ) || is measurable. 30econd, for any w , w † deﬁned on the support of w i and θ , θ , ¯ θ , ∈ Θ where ¯ θ , ,by Assumption 1(b) and the Mean-Value Theorem (MVT), satisﬁes ˜ u ( θ ) − ˜ u ( θ ) =˜ x g ( ¯ θ , )( θ − θ ), B ( x , x † , z , z † ) · || θ − θ || ≥ | ˇ z | · | ˜ x g ( ¯ θ , )( θ − θ ) | = | ˇ z | · | ˜ u ( θ ) − ˜ u ( θ ) |≥ | ˇ z ( | ˜ u ( θ ) | − | ˜ u ( θ ) | ) | = | q ( w , w † , θ ) − q ( w , w † , θ ) | . The second inequality follows from the triangle inequality.

Part (b) : From part (a) above, Assumptions 1(a)-1(c) imply Assumption A4 ofAndrews (1987). Assumptions 1(a) and 1(b) imply the measurability of q ( w , w † , θ ) ≡ ˇ z ( | ˜ u ( θ ) | − | ˜ u | ) for all θ ∈ Θ and any random vectors w , w † on the support of w i . Thiscorresponds to Andrews (1987, Assumption A2). In addition to Assumption 1(d), theconclusion follows from Corollary 2 of Andrews (1987).As a next step for the proof of Theorem 1, Proposition 3.1 is a useful step inproving the identiﬁcation of θ o . Proof of Proposition 3.1 . The following expression is a useful mean-value expan-sion around b = 0. | ξ − b | − | ξ | = (2I( ξ < λ u b ) − b (A.1)where λ u ∈ (0 ,

1) and ξ is a random variable.Thanks to Assumption 1(b), ˜ u ( θ ) = ˜ u − ˜ x g ( ¯ θ )( θ − θ o ) by the MVT for any pairof random vectors w , w † deﬁned on the support of w i . q ( w , w † , θ ) = ˇ z ( | ˜ u ( θ ) | − | ˜ u | )= ˇ z (cid:0) | ˜ u − ˜ x g ( ¯ θ )( θ − θ o ) | − | ˜ u | (cid:1) = ˇ z (cid:16) (cid:0) ˜ u < λ u ˜ x g ( ¯ θ )( θ − θ o ) (cid:1) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o )(A.2) 31sing the expression (A.1) for some λ u ∈ (0 , E [ q ( w , w † , θ )] = E (cid:104) ˇ z E ˜ U | ˜ X g , ˆ Z (cid:2) u < λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:3) ˜ x g ( ¯ θ )( θ − θ o ) (cid:105) = E (cid:104) ˇ z (cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o ) (cid:105) . Noting that F ˜ U | ˜ X g , ˆ Z (0) = 1 / (cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o ) ≥

0. This implies E [ q ( w , w † , θ )] = E (cid:104) ˇ z (cid:12)(cid:12)(cid:12)(cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:12)(cid:105) = E (cid:104) ˇ z (cid:12)(cid:12)(cid:12)(cid:16) F ˜ U | ˜ X g , ˆ Z ( λ u ˜ x g ( ¯ θ )( θ − θ o )) − (cid:17)(cid:12)(cid:12)(cid:12) × (cid:12)(cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:12)(cid:105) = (cid:90) | τ − | E (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12) × I( x , x † , z , z † ∈ A θ ( τ )) (cid:3) dτ = (cid:90) | τ − | (cid:90) A θ ( τ ) (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:3) d P ˜ X g , ˆ Z dτ With Lemma A.1 and Proposition 3.1 in hand, the identiﬁcation result is provednext.

Proof of Lemma 3.1 . Under the assumptions of Proposition 3.1, E [ q ( w , w † , θ )] = (cid:82) | τ − | (cid:82) A θ ( τ ) (cid:2) ˇ z (cid:12)(cid:12) ˜ x g ( ¯ θ )( θ − θ o ) (cid:12)(cid:12)(cid:3) d P ˜ X g , ˆ Z dτ . This implies E [ Q n ( θ )] ≡ n n (cid:88) i =1 n (cid:88) j =1 E [ q ( w i , w j , θ )]= (cid:90) | τ − | (cid:16) n n (cid:88) i =1 (cid:88) j

1] is unique.By the LIE, (cid:82) E [ V n,τ ( X g ( ¯ θ )( θ − θ o ) , Z )] dτ equals2 n n (cid:88) i =1 (cid:88) j ˜ δ ε > n suﬃcientlylarge. By Property (b), inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ V n,τ ( X g ( ¯ θ )( θ − θ o ) , Z )] ≥ τ ∈ [0 , f θ ( τ ) = E [ V n,τ ( X g ( ¯ θ )( θ − θ o ) , Z )] / E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )] and F θ ( τ ) = (cid:82) τ f θ ( t ) dt . For all θ (cid:54) = θ o , f θ ( τ ) ≥ (cid:82) f θ ( t ) dt = 1, and (cid:82) | τ − | f θ ( τ ) dτ = 1 − (cid:82) . F θ ( τ ) dτ − (cid:82) . F θ ( τ ) dτ ) using integration by parts. By theMVT, there exist ¯ τ ∗ θ and τ ∗ θ , 0 < τ ∗ θ < . < ¯ τ ∗ θ < (cid:82) . F θ ( τ ) dτ =0 . F θ ( τ ∗ θ ), and (cid:82) . F θ ( τ ) dτ = 0 . F θ (¯ τ ∗ θ ). Further, 0 < F θ ( τ ∗ θ ) ≤ F θ (¯ τ ∗ θ ) <

1, andthis implies E [ Q n ( θ )] = E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )] (cid:82) | τ − | f θ ( τ ) dτ = (cid:0) − ( F θ (¯ τ ∗ θ ) − F θ ( τ ∗ θ )) (cid:1) E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )] ≡ d n ( θ ) E [ V n ( X g ( ¯ θ )( θ − θ o ) , Z )].0 < inf { θ ∈ Θ : || θ − θ o ||≥ ε } d n ( θ ) ≤ sup { θ ∈ Θ : || θ − θ o ||≥ ε } d n ( θ ) <

1. Set δ ε = ˜ δ ε d n ( θ ) for n suﬃcientlylarge. The conclusion follows from Assumption 2(b).The ﬁnal part of the proof of Theorem 1 follows Newey and McFadden (1994, The-orem 2.1). For every (cid:15) > E [ Q n ( ˆ θ n )] < Q n ( ˆ θ n ) + (cid:15)/ Q n ( ˆ θ n ) < Q n ( θ o ) + (cid:15)/ θ n , and(c) Q n ( θ o ) < E [ Q n ( θ o )] + (cid:15)/ E [ Q n ( ˆ θ n )] < Q n ( θ o ) + 2 (cid:15)/ < E [ Q n ( θ o )] + (cid:15) , i.e., for n suﬃciently largeand every (cid:15) > E [ Q n ( ˆ θ n )] < E [ Q n ( θ o )]+ (cid:15) . Under the assumptions of Lemma 3.1, forevery ε >

0, there exists a δ ε > E [ Q n ( θ o )] < inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ Q n ( θ )] − δ ε .By the continuity of E [ Q n ( θ )] (Lemma A.1) and compactness of the set N ε ≡ { θ ∈ Θ : || θ − θ o || ≥ ε } , let inf { θ ∈ Θ : || θ − θ o ||≥ ε } E [ Q n ( θ )] = E [ Q n ( θ ∗ )] for some θ ∗ ∈ N ε . Thisimplies E [ Q n ( θ o )] < E [ Q n ( θ ∗ )] − δ ε . Choosing (cid:15) = E [ Q n ( θ ∗ )] − E [ Q n ( θ o )] − δ ε , itimplies that E [ Q n ( ˆ θ n )] < inf θ ∈ N ε E [ Q n ( θ )] − δ ε for n suﬃciently large and thus ˆ θ n ∈ N cε N ε ). B Proof of Theorem 2

Part (a) : By Lemma 11.10 of Van der Vaart (1998), the Haj´ek projection of U n is ˆ U n ≡ (cid:80) ni =1 E [ U n | w i ] = n (cid:80) ni =1 τ (cid:48) n ψ i ( w i ) under Assumption 1(a). The rest ofthe proof of part (a) involves the veriﬁcation of the conditions of Hoeﬀding (1948,Theorem 8.1). These conditions are needed to apply the Lyapunov CLT because { ψ i ( w i ) , ≤ i ≤ n } are independently distributed random vectors. The conditionsare stated in the following. Lemma B.1.

Suppose Assumption 3(e) holds, then for any random vectors w , w † deﬁned on the support of w i and τ n ∈ R p θ with || τ n || = 1 ,(i) E [( τ (cid:48) n ψ ( w , w † )) ] ≤ C / ;(ii) E [ | ( τ (cid:48) n ψ i ( w i )) | ] ≤ C / ;(iii) lim n →∞ (cid:80) ni =1 E [ | ( τ (cid:48) n ψ i ( w i )) | ] (cid:0) (cid:80) ni =1 E [( τ (cid:48) n ψ i ( w i )) ] (cid:1) / = 0 . Proof . Condition (i) E [( τ (cid:48) n ψ ( w , w † )) ] = E [ τ (cid:48) n ψ ( w , w † ) ψ ( w , w † ) (cid:48) τ n ] = ( τ (cid:48) n ⊗ τ (cid:48) n )vec( E [ ψ ( w , w † ) ψ ( w , w † ) (cid:48) ]) = || ( τ (cid:48) n ⊗ τ (cid:48) n )vec( E [ ψ ( w , w † ) ψ ( w , w † ) (cid:48) ]) || ≤ || ( τ (cid:48) n ⊗ τ (cid:48) n ) || · || E [ ψ ( w , w † ) (cid:48) ψ ( w , w † )] || ≤ E [ || ˆ z ˜ x g ( θ o ) || ] ≤ ( E [ || ˆ z ˜ x g ( θ o ) || ]) / ≤ C / < ∞ by Jensen’s inequality, Assump-tion 3(e), and || τ (cid:48) n ⊗ τ (cid:48) n || = || τ n || by Bernstein (2009, Fact 9.7.27). Condition (ii)

For any i, ≤ i ≤ n ,34 [ | ( τ (cid:48) n ψ i ( w i )) | ] = 1( n − E (cid:104)(cid:12)(cid:12)(cid:12)(cid:16) (cid:88) j (cid:54) = i τ (cid:48) n E [ ψ ( w i , w j ) | w i ] (cid:17) (cid:12)(cid:12)(cid:12)(cid:105) = 1( n − E (cid:104)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = i (cid:88) k (cid:54) = i (cid:88) l (cid:54) = i ( τ (cid:48) n E [ ψ ( w i , w j ) | w i ])( τ (cid:48) n E [ ψ ( w i , w k ) | w i ])( τ (cid:48) n E [ ψ ( w i , w l ) | w i ]) (cid:12)(cid:12)(cid:12)(cid:105) ≤ n − (cid:88) j (cid:54) = i (cid:88) k (cid:54) = i (cid:88) l (cid:54) = i E (cid:2) | ( τ (cid:48) n E [ ψ ( w i , w j ) | w i ])( τ (cid:48) n E [ ψ ( w i , w k ) | w i ])( τ (cid:48) n E [ ψ ( w i , w l ) | w i ]) (cid:12)(cid:12)(cid:3) ≤ max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) E [ τ (cid:48) n ψ ( w i , w j ) | w i ] (cid:12)(cid:12) (cid:3) ≤ max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) τ (cid:48) n ψ ( w i , w j ) (cid:12)(cid:12) (cid:3) = max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) ˆ z ij τ (cid:48) n ˜ x gij ( θ o ) (cid:12)(cid:12) (cid:3) = max ≤ j ≤ n E (cid:2)(cid:12)(cid:12) (˜ x gij ( θ o ) (cid:48) ⊗ ˆ z ij )vec( τ (cid:48) n ) (cid:12)(cid:12) (cid:3) ≤ max ≤ j ≤ n E (cid:2) || ˆ z ij ˜ x gij ( θ o ) || (cid:3) ≤ max ≤ j ≤ n (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( θ o ) || (cid:3)(cid:1) / ≤ C / . The third inequality follows from Jensen’s inequality (applied to the conditional ex-pectation), the convexity of | · | , and the LIE. The fourth inequality follows since (cid:12)(cid:12) (˜ x gij ( θ o ) (cid:48) ⊗ ˆ z ij )vec( τ (cid:48) n ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) (˜ x gij ( θ o ) (cid:48) ⊗ ˆ z ij ) (cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12) vec( τ (cid:48) n ) (cid:12)(cid:12)(cid:12)(cid:12) = || ˆ z ij ˜ x gij ( θ o ) || . The conclu-sion follows from Jensen’s inequality and Assumption 3(e). Condition (iii) n (cid:88) i =1 E [( τ (cid:48) n ψ i ( w i )) ] = τ (cid:48) n (cid:0) n (cid:88) i =1 E [ ψ i ( w i ) ψ i ( w i ) (cid:48) ] (cid:1) τ n = n τ (cid:48) n var (cid:0) √ n n (cid:88) i =1 ψ i ( w i ) (cid:1) τ n = n τ (cid:48) n Ψ n τ n ≥ n min τ τ (cid:48) Ψ n τ = nρ min ( Ψ n ) > n C / by Assumption 3(a) for n suﬃcientlylarge. This, in conjunction with part (ii) above, implies (cid:80) ni =1 E [ | ( τ (cid:48) n ψ i ( w i )) | ] (cid:0) (cid:80) ni =1 E [( τ (cid:48) n ψ i ( w i )) ] (cid:1) / ≤ n − / = O ( n − / ) = o (1), and the last condition of Hoeﬀding (1948, Theorem 8.1) isveriﬁed.By Hoeﬀding (1948, Theorem 8.1), the preceding three conditions (i) to (iii) imply U n / var( U n ) / d → N (0 , V n = (1 − /n ) U n , U n / var( U n ) / = V n / var( V n ) / = √ n τ (cid:48) n S n ( θ o ) (cid:48) ( τ (cid:48) n E [ Ω n ( θ o )] τ n ) / = √ n || τ (cid:48) n E [ Ω n ( θ o )] / || − τ (cid:48) n S n ( θ o ) (cid:48) . Noting that || ˜ τ n || = 1 where˜ τ n = || τ (cid:48) n E [ Ω n ( θ o )] / || − τ (cid:48) n E [ Ω n ( θ o )] / completes the proof of Theorem 2(a). Part (b) : The proof of part (b) proceeds by verifying the conditions of Neweyand McFadden (1994, Theorem 7.1). Twice diﬀerentiability of E [ Q n ( θ )] at θ o holds35nder the assumptions of Lemma D.4. Under the assumptions of Theorem 1, ˆ θ n p → θ o . E [ Q n ( θ )] is minimised by θ o for n suﬃciently large under the assumptions of Lemmata3.1 and D.1. √ n S n ( θ o ) d → N ( , Ω o ) follows from part (a) and the Cram´er-Wolddevice. In addition to Assumptions 1(d), 3(b), and 3(c), the conclusion follows. C Proof of Theorem 3

The following expression is useful in subsequent results. For any (cid:15) > E ˜ U | ˜ X g , ˆ Z [I( | ˜ u | ≤ (cid:15) )] / (2 (cid:15) ) = F ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − F ˜ U | ˜ X g , ˆ Z ( − (cid:15) )2 (cid:15) = F ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − (cid:0) F ˜ U | ˜ X g , ˆ Z ( (cid:15) ) − f ˜ U | ˜ X g , ˆ Z ((1 − λ ) (cid:15) )(2 (cid:15) ) (cid:1) (cid:15) = ( (cid:15) /(cid:15) ) f ˜ U | ˜ X g , ˆ Z ((1 − λ ) (cid:15) )(C.1)for some λ ∈ (0 ,

1) by Assumption 3(d) and the MVT (taken about − (cid:15) ).A ﬁrst step in the proof is to show that plim n →∞ ˇ H n ( ˆ θ n ) = lim n →∞ E [ H n ( θ o )] ≡ H o .Recall ˇ H n ( ˆ θ n ) = 1 n ˆ c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n )ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) (cid:111) and E [ H n ( θ o )] = 2 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:105) .Deﬁne ˜ H n ( θ o ) = 1 n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij | ≤ c n )ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:111) . By the triangle in-equality, || ˇ H n ( ˆ θ n ) − ˜ H n ( θ o ) || ≤ c n ˆ c n ( A n, + A n, + A n, + A n, ) a.s. where A n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) | ˇ z ij − ˆ z ij | × I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:111) , A n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) | I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | × || ˆ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:111) , n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) I( | ˜ u ij | ≤ ˆ c n ) × || ˆ z ij [˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )] || (cid:111) , and A n, ≡ n c n n (cid:88) i =1 n (cid:88) j =1 (cid:110) | I( | ˜ u ij | ≤ ˆ c n ) − I( | ˜ u ij | ≤ c n ) | × || ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) || (cid:111) . The following lemma studies the terms A n, , A n, , A n, , and A n, . Lemma C.1.

Under Assumptions 1 to 4, E [ A n, ] = o (1) , E [ A n, ] = o (1) , E [ A n, ] = o (1) , and E [ A n, ] = o (1) . Proof . The veriﬁcation of terms A n, , A n, , A n, , and A n, proceeds in the followingparts. A n, :By the Cauchy-Schwartz (CS) inequality, E [ A n, ] = 1 n c n n (cid:88) i =1 n (cid:88) j =1 E (cid:2) | ˇ z ij − ˆ z ij | × I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:3) ≤ n c n n (cid:88) i =1 n (cid:88) j =1 ( E [(ˇ z ij − ˆ z ij ) ]) / × ( E [ || ˜ x gij ( ˆ θ n ) || ]) / From the proof of Lemma D.1(a), Assumption 1(c), Assumption 3(e), and Assump-tion 4, it follows that E [ A n, ] ≤ √ C / √ nc n = O (cid:0) / ( √ nc n ) (cid:1) = o (1). A n, :For term A n, , note that for ¯ θ n between ˆ θ n and θ o , E ˜ U | ˜ X g , ˆ Z (cid:2) | I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | (cid:3) = E ˜ U | ˜ X g , ˆ Z (cid:2) | I( | ˜ u ij + ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | (cid:3) = E ˜ U | ˜ X g , ˆ Z (cid:2) I(ˆ c n < ˜ u ij ≤ ˆ c n − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) (cid:3) + E ˜ U | ˜ X g , ˆ Z (cid:2) I( − ˆ c n − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) ≤ ˜ u ij < − ˆ c n ) (cid:3) ≡ E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] + E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ]by Assumption 1(b), and the MVT. For ease of notation, let ˜ J x,θ ≡ ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ).37y Assumption 3(d) and the MVT, E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] = max { , F ˜ U | ˜ X g , ˆ Z (cid:0) ˆ c n − ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˆ c n | ω ˜ x, ˆ zij (cid:1) } = max { , − f ˜ U | ˜ X g , ˆ Z (cid:0) ˆ c n − λ ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) ˜ J x,θ } and E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] = max { , F ˜ U | ˜ X g , ˆ Z (cid:0) − ˆ c n | ω ˜ x, ˆ zij (cid:1) − F ˜ U | ˜ X g , ˆ Z (cid:0) − ˆ c n − ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) } = max { , f ˜ U | ˜ X g , ˆ Z (cid:0) − ˆ c n − λ ˜ J x,θ | ω ˜ x, ˆ zij (cid:1) ˜ J x,θ } . Since f ˜ U | ˜ X g , ˆ Z ( · ) ≤ C / a.s. by Assumption 3(d), | ˜ J x,θ | = O p ( n − / ) by Assump-tion 3(e) and Theorem 2(b), and ˜ J x,θ c n = o p (1) by Assumption 4, E ˜ U | ˜ Xg, ˆ Z [˜ I ij ]+ E ˜ U | ˜ Xg, ˆ Z [˜ I ij ]2 c n = o p (1).It follows from the foregoing, the LIE, the CS inequality, Jensen’s inequality, andAssumption 3(e) that E [ A n, ] = 1 n c n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) E ˜ U | ˜ X g , ˆ Z [ | I( | ˜ u ij ( ˆ θ n ) | ≤ ˆ c n ) − I( | ˜ u ij | ≤ ˆ c n ) | ] × || ˆ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || (cid:105) ≤ n n (cid:88) i =1 n (cid:88) j =1 (cid:110) E (cid:104)(cid:16) E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] + E ˜ U | ˜ X g , ˆ Z [ ˜ I ij ]2 c n (cid:17) (cid:105) E (cid:2) || max {| ˆ z ij | , } ˜ x gij ( ˆ θ n ) || (cid:3)(cid:111) / = o (1) A n, :First, by Assumption 3(d), eq. (C.1), and the MVT, E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ ˆ c n )] / (2 c n ) = f ˜ U | ˜ X g , ˆ Z ( λ ˆ c n | ω ˜ x, ˆ zij )(ˆ c n /c n ) for some λ ∈ (0 , ≤ i, j, k, l ≤ n and ¯ θ n , ¯ θ n between ˆ θ n and θ o , E (cid:2) | ˆ z ij ˆ z kl | × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gkl ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gkl ( θ o ) || (cid:3) = E (cid:2) | ˆ z ij ˆ z kl | × || ˜ x gij ( ˆ θ n ) (cid:48) (˜ x gkl ( ˆ θ n ) − ˜ x gkl ( θ o )) + (˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o )) (cid:48) ˜ x gkl ( θ o ) || (cid:3) ≤ (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || ] × E (cid:2) || ˆ z kl (˜ x gkl ( ˆ θ n ) − ˜ x gkl ( θ o )) || (cid:3)(cid:1) / + (cid:0) E (cid:2) || ˆ z kl ˜ x gkl ( θ o ) || (cid:3) × E (cid:2) || ˆ z ij (˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o )) || (cid:3)(cid:1) / = (cid:16) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || (cid:3) × E (cid:104)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z kl ∂ ˜ x gkl ( θ ) ∂ θ (cid:12)(cid:12)(cid:12) θ = ¯ θ n ( ˆ θ n − θ o ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / + (cid:16) E (cid:2) || ˆ z kl ˜ x gkl ( θ o ) || (cid:3) × E (cid:104)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ∂ ˜ x gij ( θ ) ∂ θ (cid:12)(cid:12)(cid:12) θ = ¯ θ n ( ˆ θ n − θ o ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / ≤ (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || (cid:3)(cid:1) / (cid:16) E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z kl ∂ ˜ x gkl ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / + (cid:0) E (cid:2) || ˆ z kl ˜ x gkl ( θ o ) || (cid:3)(cid:1) / (cid:16) E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ∂ ˜ x gij ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105)(cid:17) / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / ≤ C / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / . (C.2)It follows from the LIE, Assumption 3(d), Assumption 4, the CS inequality, eq. (C.2),and Theorem 2(b) that E [ A n, ] = 1 n c n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ ˆ c n )] × || ˆ z ij [˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )] || (cid:105) ≤ n n (cid:88) i =1 n (cid:88) j =1 (cid:110)(cid:16) E (cid:2)(cid:0) (ˆ c n /c n ) f ˜ U | ˜ X g , ˆ Z ( λ ˆ c n | ω ˜ x, ˆ zij ) (cid:1) (cid:3)(cid:17) / × (cid:16) E (cid:2) ˆ z ij || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) || (cid:3)(cid:17) / (cid:111) ≤ C / (cid:0) E (cid:2) || ˆ θ n − θ o || (cid:3)(cid:1) / = o (1) . A n, : E ˜ U | ˜ X g , ˆ Z [ | I( | ˜ u ij | ≤ ˆ c n ) − I( | ˜ u ij | ≤ c n ) | ] / (2 c n ) = E ˜ U | ˜ X g , ˆ Z [I( c n < ˜ u ij ≤ ˆ c n ) + I( − ˆ c n ≤ ˜ u ij < − c n )]2 c n = max (cid:110) , F ˜ U | ˜ X g , ˆ Z (ˆ c n | ω ˜ x, ˆ zij ) − F ˜ U | ˜ X g , ˆ Z ( c n | ω ˜ x, ˆ zij )2 c n (cid:111) + (cid:110) , F ˜ U | ˜ X g , ˆ Z ( − c n | ω ˜ x, ˆ zij ) − F ˜ U | ˜ X g , ˆ Z ( − ˆ c n | ω ˜ x, ˆ zij )2 c n (cid:111) = max { , . f ˜ U | ˜ X g , ˆ Z ( λ ˆ c n + (1 − λ )ˆ c n | ω ˜ x, ˆ zij )(ˆ c n /c n − } + max { , − . f ˜ U | ˜ X g , ˆ Z ( − λ c n − (1 − λ )ˆ c n | ω ˜ x, ˆ zij )(ˆ c n /c n − } = o p (1)39or some λ , λ ∈ (0 ,

1) by Assumption 3(d), Assumption 4, eq. (C.1), and the MVT.By arguments analogous to the case of A n, , E [ A n, ] = o (1).To complete the proof of consistency of H n ( ˆ θ n ), it suﬃces to show that ˜ H n ( θ o ) − E [ H n ( θ o )] converges to zero in quadratic mean. This is shown in the following lemma. Lemma C.2.

Under Assumptions 1(a), 3(d)-3(e), and 4, ˜ H n ( θ o ) − E [ H n ( θ o )] con-verges to zero in quadratic mean. Proof . | E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ c n )] / (2 c n ) − f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) | = | f ˜ U | ˜ X g , ˆ Z ( λc n | ω ˜ x, ˆ zij ) − f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) | ≤ λc n a.s. for some λ ∈ (0 ,

1) by Assumption 3(d) and the MVT. In addition to As-sumption 3(e), Assumption 4, and the CS inequality, this implies || E [ ˜ H n ( θ o )] − E [ H n ( θ o )] || ≤ (cid:0) E [ (cid:0) E ˜ U | ˜ X g , ˆ Z [I( | ˜ u ij | ≤ c n )] / (2 c n ) − f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) (cid:1) ] (cid:1) / (cid:0) E [ || max {| ˆ z ij | , } ˜ x gij ( θ o ) || ] (cid:1) / ≤ λC / c n = O ( c n ) = o (1) . Let τ and τ be two p θ × || τ || = || τ || = 1, then40ar( τ (cid:48) ˜ H n ( θ o ) τ )= 1 n c n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 cov (cid:16) { I( | ˜ u ij | ≤ c n )ˆ z ij τ (cid:48) ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) τ } , { I( | ˜ u kl | ≤ c n )ˆ z kl τ (cid:48) ˜ x gkl ( θ o ) (cid:48) ˜ x gkl ( θ o ) τ } (cid:17) = 1 n c n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } cov (cid:16) { I( | ˜ u ij | ≤ c n )ˆ z ij τ (cid:48) ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) τ } , { I( | ˜ u kl | ≤ c n )ˆ z kl τ (cid:48) ˜ x gkl ( θ o ) (cid:48) ˜ x gkl ( θ o ) τ } (cid:17) ≤ nc n · n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } (cid:16) var (cid:0) I( | ˜ u ij | ≤ c n ) τ (cid:48) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) τ (cid:1) × var (cid:0) I( | ˜ u kl | ≤ c n ) τ (cid:48) ˆ z kl ˜ x gkl ( θ o ) (cid:48) ˜ x gkl ( θ o ) τ (cid:1)(cid:17) / ≤ nc n · n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } (cid:16) E [ || max {| ˆ z ij | , } ˜ x gij ( θ o ) || ] · E [ || max {| ˆ z kl | , } ˜ x gkl ( θ o ) || ] (cid:17) / ≤ nc n · n n (cid:88) i =1 n (cid:88) j =1 (cid:88) k,l ∈{ i,j } (cid:16) E (cid:2) || max {| ˆ z ij | , } ˜ x gij ( θ o ) || (cid:3) · E (cid:2) || max {| ˆ z kl | , } ˜ x gkl ( θ o ) || (cid:3)(cid:17) / The second equality follows from Assumption 1(a), the ﬁrst inequality follows fromthe CS inequality, and the last inequality follows from Jensen’s inequality. The sec-ond inequality holds because var( τ (cid:48) ˜ M τ ) ≤ E [( τ (cid:48) ˜ M τ ) ] = E [(vec( τ (cid:48) ˜ M τ )) ] = E [vec( ˜ M ) (cid:48) ( τ (cid:48) ⊗ τ (cid:48) ) (cid:48) ( τ (cid:48) ⊗ τ (cid:48) )vec( ˜ M )] ≤ E [ || vec( ˜ M ) || · || τ (cid:48) ⊗ τ (cid:48) || ] = E [ || vec( ˜ M ) || ·|| τ || · || τ || ] = E [ || ˜ M || ] for a matrix-valued random variable ˜ M , and || τ (cid:48) ⊗ τ (cid:48) || = || τ || · || τ || by Bernstein (2009, Fact 9.7.27). Thanks to Assumptions 3(e) and 4,var( τ (cid:48) ˜ H n ( θ o ) τ ) ≤ C / / ( nc n ) = o (1), and the assertion is proved.Combining Lemmata C.1 and C.2 shows that plim n →∞ ˇ H n ( ˆ θ n ) = lim n →∞ E [ H n ( θ o )] ≡ H o . The next part of the proof of Theorem 3 concerns the consistency of ˇ Ω n ( ˆ θ n ).The result is stated in the following lemma. Lemma C.3.

Under Assumptions 1 to 3, plim n →∞ ˇ Ω n ( ˆ θ n ) = Ω o . roof . Recallˇ Ω n ( ˆ θ n ) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) ˇ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n )+ 2ˇ z ij ˇ z ik (cid:0) (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − (cid:1) ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) (cid:111) . Also, deﬁne Ω n ( θ ) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) ˆ z ij ˜ x gij ( θ ) (cid:48) ˜ x gij ( θ )+ 2ˆ z ij ˆ z ik (cid:16) (cid:0) I(˜ u ij ( θ ) < u ik ( θ ) < (cid:1) − (cid:17) ˜ x gij ( θ ) (cid:48) ˜ x gik ( θ ) ≡ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:8) A ij ( θ ) + 2 B ijk ( θ ) (cid:9) where Ω o ≡ plim n →∞ Ω n ( θ o ). By the triangle inequality, || ˇ Ω n ( ˆ θ n ) − Ω n ( θ o ) || ≤ || ˇ Ω n ( ˆ θ n ) − Ω n ( ˆ θ n ) || + || Ω n ( ˆ θ n ) − Ω n ( θ o ) || a.s. Noting that ˇ z ij ˇ z kl − ˆ z ij ˆ z kl = (ˇ z ij − ˆ z ij )ˇ z kl + (ˇ z kl − ˆ z kl )ˆ z ij , | ˇ z ij + ˆ z ij | ≤ {| ˇ z ij | , | ˆ z ij |} , and | (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − | ≤ ≤ i, j, k, l ≤ n , || ˇ Ω n ( ˆ θ n ) − Ω n ( ˆ θ n ) || = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110) (ˇ z ij − ˆ z ij )˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n )+ 2(ˇ z ij ˇ z ik − ˆ z ij ˆ z ik ) (cid:0) (cid:0) I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < (cid:1) − (cid:1) ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) (cid:111)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 n (cid:88) j =1 | ˇ z ij − ˆ z ij | × || max {| ˇ z ij | , | ˆ z ij |} ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) || + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j | ˇ z ij − ˆ z ij | × || ˇ z ik ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) || + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j | ˇ z ik − ˆ z ik | × || ˆ z ij ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) || a.s.From the proof of Lemma D.1(a), the CS inequality, Jensen’s inequality, Assump-tion 1(c), and Assumption 3(e), 42 [ || ˇ Ω n ( ˆ θ n ) − Ω n ( ˆ θ n ) || ] ≤ n n (cid:88) i =1 n (cid:88) j =1 (cid:0) E [(ˇ z ij − ˆ z ij ) ] E [ || max { [ | ˆ z ij | , ˇ z ij | , } ˜ x gij ( ˆ θ n ) || ] (cid:1) / + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) E [(ˇ z ij − ˆ z ij ) ] E [ || ˜ x gij ( ˆ θ n ) || ] E [ || ˇ z ik ˜ x gik ( ˆ θ n ) || ] (cid:1) / + 6 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) E [(ˇ z ik − ˆ z ik ) ] E [ || ˇ z ij ˜ x gij ( ˆ θ n ) || ] E [ || ˜ x gik ( ˆ θ n ) || ] (cid:1) / ≤ √ C / √ n + 6 √ C / √ n + 6 √ C / √ n = O ( n − / ) = o (1) . To complete the proof, it suﬃces to show that || Ω n ( ˆ θ n ) − Ω n ( θ o ) || converges tozero in probability. Using (C.2), the CS inequality, MVT, Assumptions 1(b) and 3(e), E [ || A ij ( ˆ θ n ) − A ij ( θ o ) || ] = E [ˆ z ij || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gij ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) || ] ≤ C ( E [ || ˆ θ n − θ o || ]) / = o (1) . By Assumption 1(b) and the MVT, | I(˜ u ij ( ˆ θ n ) < u ik ( ˆ θ n ) < − I(˜ u ij < u ik < | = | [I(˜ u ij < − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) − I(˜ u ij < u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o ))+ I(˜ u ij < u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o )) − I(˜ u ik < | = I(0 ≤ ˜ u ij < − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) × I(˜ u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o ))+ I(0 ≤ ˜ u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o )) × I(˜ u ij < ≡ ˜ I ijk + ˜ I ijk ≤ I(0 ≤ ˜ u ij < − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o )) + I(0 ≤ ˜ u ik < − ˜ x gik ( ¯ θ n )( ˆ θ n − θ o )) ≡ ˜ I ij + ˜ I ik . For some λ ∈ (0 , ˜ U | ˜ X g , ˆ Z [ ˜ I ij ] ≤ max (cid:8) , F ˜ U | ˜ X g , ˆ Z ( − ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) | ω ˜ x, ˆ zij ) − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) (cid:9) = max (cid:8) , − f ˜ U | ˜ X g , ˆ Z ( − λ ˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) | ω ˜ x, ˆ zij )˜ x gij ( ¯ θ n )( ˆ θ n − θ o ) (cid:9) = O p ( n − / ) . E ˜ U | ˜ X g , ˆ Z [ ˜ I ik ] = O p ( n − / ) by a similar derivation whence E (cid:2) || B jik ( ˆ θ n ) − B ijk ( θ o ) || (cid:3) ≤ E (cid:2) E ˜ U | ˜ X g , ˆ Z [ ˜ I ijk + ˜ I ijk ] · || ˆ z ij ˜ x gij ( ˆ θ n ) || · || ˆ z ik ˜ x gik ( ˆ θ n ) || (cid:3) + 3 E [ | ˆ z ij ˆ z ik | × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) || ] ≤ (cid:0) E (cid:2) ( E ˜ U | ˜ X g , ˆ Z [ ˜ I ij + ˜ I ik ]) (cid:3)(cid:1) / · (cid:0) E (cid:2) || ˆ z ij ˜ x gij ( ˆ θ n ) || (cid:3)(cid:1) / · (cid:0) E (cid:2) || ˆ z ik ˜ x gik ( ˆ θ n ) || (cid:3)(cid:1) / + 3 E [ | ˆ z ij ˆ z ik | × || ˜ x gij ( ˆ θ n ) (cid:48) ˜ x gik ( ˆ θ n ) − ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) || ] ≤ C / × O ( n − / ) + 6 C ( E [ || ˆ θ n − θ o || ]) / = o (1)by Assumption 3(e), the CS inequality, and eq. (C.2). This completes the proof.44 nline AppendixD Useful Lemmata D.1 Asymptotic Equivalence of Normalised Minimands

Lemma D.1.

Under Assumptions 1(a), 1(b), 1(d), and 3(e),(a) | ˇ z ij − ˆ z ij | = O p ( n − / ) ;(b) sup θ ∈ Θ | Q n ( θ ) − Q n ( θ ) | p → . Proof . Part (a) : For any i, j , 1 ≤ i, j ≤ n , E [ˇ z ij − ˆ z ij ] = 0 by the LIE. Moreover, it followsfrom Lo`eve’s c r -inequality, Assumption 1(a), the CS inequality, and Assumption 3(e)that E [ | ˇ z ij − ˆ z ij | ] ≤ n E (cid:104)(cid:16) n (cid:88) k =1 ( || ˜ z ik || − E [( || ˜ z ik || ) | z i ]) (cid:17) (cid:105) + 3 n E (cid:104)(cid:16) n (cid:88) k =1 ( || ˜ z kj || − E [( || ˜ z kj || ) | z j ]) (cid:17) (cid:105) + 3 n E (cid:104)(cid:16) n (cid:88) k =1 n (cid:88) l =1 ( || ˜ z kl || − E [ || ˜ z kl || ]) (cid:17) (cid:105) = 3 n n (cid:88) k =1 E [var(( || ˜ z ik || ) | z i )] + 3 n n (cid:88) k =1 E [var(( || ˜ z kj || ) | z j )]+ 3 n n (cid:88) k =1 n (cid:88) l =1 var( || ˜ z kl || ) + 6 n n (cid:88) k =1 n (cid:88) l =1 (cid:88) l (cid:48) (cid:54) = l cov( || ˜ z kl || , || ˜ z kl (cid:48) || ) ≤ C / n + 3 C / n + 3 C / n + 6 C / n = 15 C / n (D.1)This implies | ˇ z ij − ˆ z ij | = O p ( n − / ) by the Markov inequality. Part (b) : From (D.1) above, Jensen’s inequality, Assumption 1(b), the MVT, As applied here, the inequality is E [( ξ + ξ + ξ ) ] ≤ E [ ξ ] + 3 E [ ξ ] + 3 E [ ξ ]. See Davidson(1994, eqn. 9.62). E [ | Q n ( θ ) − Q n ( θ ) | ] ≤ n n (cid:88) i =1 n (cid:88) j =1 E [ | ˇ z ij − ˆ z ij | × | ( | ˜ u ij ( θ ) | − | ˜ u ij | ) | ] ≤ n n (cid:88) i =1 n (cid:88) j =1 E [ | ˇ z ij − ˆ z ij | × | ˜ u ij ( θ ) − ˜ u ij | ]= 1 n n (cid:88) i =1 n (cid:88) j =1 E [ | ˇ z ij − ˆ z ij | × | ˜ x gij ( ¯ θ )( θ − θ o ) | ] ≤ || θ − θ o || n n (cid:88) i =1 n (cid:88) j =1 ( E [(ˇ z ij − ˆ z ij ) ]) / ( E [ || ˜ x gij ( ¯ θ ) || ]) / ≤ C / || θ − θ o || √ √ n for any θ ∈ Θ . By Assumption 1(d), sup θ ∈ Θ | Q n ( θ ) − Q n ( θ ) | = O p ( n − / ) by the Markovinequality. D.2 The Intercept and Jacobian

The class of models considered in the main text assumes the disturbance functionis linear in the intercept (Assumption 1(a)). This subsection considers an extensionto models where the disturbance function is not necessarily linear in the intercept,i.e., y i = G (cid:0) g ( x i , θ o , θ o,c ) + u i (cid:1) where the intercept is absorbed into a possibly non-linear function. In this extended class, the intercept is no longer diﬀerenced out in(2.3). Moreover, a constant in z i and z j cannot drive identiﬁcation for the interceptterm because it is diﬀerenced out in ˇ z ij . Whenever linearity u i ( θ ) in the intercept isnot assumed, the intercept is required for identifying θ o . This problem is resolved,however, noting that for a speciﬁed model, the intercept is typically a deterministicfunction of θ and data.Let θ c ( θ ) denote the intercept as a function of any θ ∈ Θ and data. The model isspeciﬁed through the disturbance function as u i ( θ ) = G − ( y i ) − g ( x i , θ , θ c ( θ )), e.g., u i ( θ ) = y i − exp( θ c ( θ ) + x i θ ) for a non-linear parametric model such as the Pois-son. Note that for a given sample size n , the intercept, as a function of θ is θ c ( θ ) =log (cid:0) E n [ y i ] / E n [exp( x i θ )] (cid:1) . Where θ c ( θ ) is available in a closed form, one can directlysubstitute it into u i ( θ ) whence, for example, u i ( θ ) = y i − ( E n [ y i ] / E n [exp( x i θ )]) exp( x i θ )2or the non-linear model. Generally, θ c ( θ ) is implicitly deﬁned by E n [ y i − g ( x i , θ , θ c ( θ ))] =0 for a sample of size n .The identiﬁcation of the θ o is tied to that of the intercept which is implicitlydeﬁned by E n [ y i − g ( x i , θ , θ c ( θ ))] = 0. The expected value of E n [ y i − g ( x i , θ , θ c ( θ ))]does not exist if the expected value of u i does not exist. This implies that theextension of the class of models in the main text to accommodate models where thedisturbance is non-linear in the intercept comes at a cost of additional assumptionsfor the intercept as a function of θ and the Jacobian ˜ x gij ( θ ) to be well-deﬁned. Thefollowing lemma applies the implicit function theorem in the derivation of a generalform of the Jacobian x gi ( θ ) in this extended class. Fix θ s ∈ Θ and deﬁne θ ∗ c where E n [ y i − g ( x i , θ , θ c )] = 0 at [ θ (cid:48) , θ c ] = [ θ (cid:48) s , θ ∗ c ]. Lemma D.2.

In addition to Assumptions 1(b) and 1(d), suppose inf θ s ∈ Θ (cid:12)(cid:12)(cid:12) E n (cid:104) ∂g ( x i , θ , θ c ) ∂θ c (cid:12)(cid:12)(cid:12) θ = θ s θ c = θ ∗ c (cid:105)(cid:12)(cid:12)(cid:12) > C − a.s. for all n and E [ | u i | ] ≤ C / , then theJacobian x gi ( θ ) exists for all θ ∈ Θ , ≤ i ≤ n . Proof . Under the assumption E [ | u i | ] ≤ C / , Assumption 1(b), and Assumption 1(d),the expectation of E n [ y i − g ( x i , θ , θ c ( θ ))] = E n [ u i ( θ )] = E [ u i + x gi ( ¯ θ )( θ − θ o )] exists,from which the intercept, as a function of θ is (implicitly) well-deﬁned. Under theassumption of the lemma and Assumption 1(b), ∂θ c ( θ ) ∂ θ = − (cid:16) E n (cid:104) ∂g ( x i , θ , θ c ) ∂θ c (cid:105)(cid:17) − E n (cid:104) ∂g ( x i , θ , θ c ) ∂ θ (cid:48) (cid:105) holds by the implicit function theorem. The Jacobian is well deﬁned, and it is givenby x gi ( θ ) ≡ ∂u i ( θ ) ∂ θ (cid:48) = − ∂g ( x i , θ , θ c ) ∂ θ (cid:48) − ∂g ( x i , θ , θ c ) ∂θ c ∂θ c ( θ ) ∂ θ (cid:48) = − ∂g ( x i , θ , θ c ) ∂ θ (cid:48) + ∂g ( x i , θ , θ c ) ∂θ c (cid:16) E n (cid:104) ∂g ( x i , θ , θ c ) ∂θ c (cid:105)(cid:17) − E n (cid:104) ∂g ( x i , θ , θ c ) ∂ θ (cid:48) (cid:105) . D.3 Covariance of the Score

Denote the score function by S n ( θ ) ≡ ∂ Q n ( θ ) ∂ θ (cid:48) , and let E [ Ω n ( θ o )] ≡ var( √ n S n ( θ o )).The following lemma derives E [ Ω n ( θ o )] and shows its boundedness.3 emma D.3. Under Assumptions 1(a), 2(a), and 3(e), (a) E [ S n ( θ o )] = and (b) E [ Ω n ( θ o )] is given by eq. (3.1) with || E [ Ω n ( θ o )] || ≤ C / . Proof of Lemma D.3 . Part (a) : Expressing Q n ( θ ) as n (cid:80) ni =1 (cid:80) nj =1 ˆ z ij (cid:0) (1 − u ij ( θ ) < u ij ( θ ) −| ˜ u ij | (cid:1) ,the score function is S n ( θ ) = 1 n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (cid:0) − u ij ( θ ) < (cid:1) ˜ x gij ( θ ) . Evaluating the expectation of S n ( θ ) at θ = θ o and applying the LIE, E [ S n ( θ o )] = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:2) ˆ z ij (cid:0) − u ij ( θ o ) < (cid:1) ˜ x gij ( θ o ) (cid:3) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij E ˜ U | ˜ X g , ˆ Z (cid:2) − u ij < (cid:3) ˜ x gij ( θ o ) (cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (cid:0) − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) (cid:1) ˜ x gij ( θ o ) (cid:105) = . The last equality follows from Assumption 2(a) and the deﬁnition of the median. Thenext step is to obtain E [ Ω n ( θ o )] ≡ var( √ n S n ( θ o )), the covariance of √ n S n ( θ o ). Part (b) : E [ Ω n ( θ o )] = n E [ S n ( θ o ) (cid:48) S n ( θ o )]= n E (cid:104)(cid:16) n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) (cid:17) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (cid:0) − u ij < (cid:1) ˜ x gij ( θ o ) (cid:17)(cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 E (cid:2) ˆ z ij ˆ z kl (cid:0) − u ij < (cid:1)(cid:0) − u kl < (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gkl ( θ o ) (cid:3) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:110) ˆ z ij E ˜ U | ˜ X g , ˆ Z (cid:2)(cid:0) − u ij < (cid:1) (cid:3) ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 2ˆ z ij ˆ z ik E ˜ U | ˜ X g , ˆ Z (cid:2)(cid:0) − u ij < (cid:1)(cid:0) − u ik < (cid:1)(cid:3) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:111) i (cid:54) = k, l and j (cid:54) = k, l , E ˜ U | ˜ X g , ˆ Z (cid:104)(cid:0) − u ij < (cid:1)(cid:0) − u kl < (cid:1)(cid:105) = (1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ))(1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zkl )) = 0.Simplifying terms, (cid:0) − u ij < (cid:1) = 1 and E ˜ U | ˜ X g , ˆ Z (cid:104)(cid:0) − u ij < (cid:1)(cid:0) − u ik < (cid:1)(cid:105) = 4 E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < u ik < (cid:3) − . Substituting terms, E [ Ω n ( θ o )] = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:110) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 2ˆ z ij ˆ z ik (cid:16) E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < u ik < (cid:3) − (cid:17) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:111) = 1 n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j E (cid:110) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o )+ 8ˆ z ij ˆ z ik cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:111) where ω ˜ x, ˆ zijk ≡ σ ( x i , x j , x k , z i , z j , z k ) and cov( ξ a , ξ b | ξ c ) denotes covariance of ξ a and ξ b conditional on ξ c . The second equality follows since 4 E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < u ik < (cid:3) − (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) + 4 E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ij < (cid:3) E ˜ U | ˜ X g , ˆ Z (cid:2) I(˜ u ik < (cid:3) − (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) + 4 F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij ) F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zik ) − (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) by Assumption 2(a) and the deﬁnition of the median.For the uniform boundedness of E [ Ω n ( θ o )], note that, || E [ Ω n ( θ o )] || ≤ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ˆ z ij ˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) + 8 (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ˆ z ij ˆ z ik cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:111) ≤ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110)(cid:16) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˜ x gij ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:17) / + 8 (cid:16) E (cid:2)(cid:0) cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1)(cid:1) (cid:3) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˆ z ik ˜ x gij ( θ o ) (cid:48) ˜ x gik ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:17) / (cid:111) ≤ n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:110)(cid:0) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˜ x gij ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:1) / + 2 (cid:16) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ij ˜ x gij ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ik ˜ x gik ( θ o ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:17) / (cid:111) ≤ C / . The second inequality follows from the CS inequality, Assumption 2(a), and thedeﬁnition of the median noting that | cov (cid:0) I(˜ u ij < , I(˜ u ik < | ω ˜ x, ˆ zijk (cid:1) | ≤ (cid:0) F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )(1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )) (cid:1) / × (cid:0) F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zik )(1 − F ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zik )) (cid:1) / = 1 / D.4 The Hessian Matrix

Lemma D.4.

Suppose Assumptions 1(b), 2(a), and 3(d)-3(e) hold, then the Hessianmatrix E [ H n ( θ o )] is given by eq. (3.2) , and || E [ H n ( θ o )] || ≤ C / . Proof of . Recall the score function from the proof of Lemma D.3 is S n ( θ ) =1 n n (cid:88) i =1 n (cid:88) j =1 ˆ z ij (1 − (cid:0) ˜ u ij ( θ ) < (cid:1) ˜ x gij ( θ ). By the LIE and using ˜ u ij ( θ ) = ˜ u ij +6 x gij ( ¯ θ )( θ − θ o ) by the MVT and Assumption 1(b), E [ S n ( θ )] = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (1 − E ˜ U | ˜ X g , ˆ Z [I (cid:0) ˜ u ij ( θ ) < (cid:1) ˜ x gij ( θ ) (cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (1 − E ˜ U | ˜ X g , ˆ Z [I(˜ u ij < − ˜ x gij ( ¯ θ )( θ − θ o ))])˜ x gij ( θ ) (cid:105) = 1 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij (cid:0) − F ˜ U | ˜ X g , ˆ Z ( − ˜ x gij ( ¯ θ )( θ − θ o )) (cid:1) ˜ x gij ( θ ) (cid:105) Under the assumptions of Lemma D.5, the expectation and derivative are exchange-able by the dominated convergence theorem. The expression for E [ H n ( θ )] ≡ ∂ E [ S n ( θ )] ∂ θ becomes E [ H n ( θ )] = 1 n n (cid:88) i =1 n (cid:88) j =1 (cid:110) E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) ˜ x gij ( θ ) (cid:48) ˜ x gij ( ¯ θ ) (cid:105) + E (cid:104) ˆ z ij (cid:0) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1)(cid:1) ∂ ˜ x gij ( θ ) ∂ θ (cid:105)(cid:111) . Evaluating E [ H n ( θ )] at θ = θ o , E [ H n ( θ o )] = 2 n n (cid:88) i =1 n (cid:88) j =1 E (cid:104) ˆ z ij f ˜ U | ˜ X g , ˆ Z (0 | ω ˜ x, ˆ zij )˜ x gij ( θ o ) (cid:48) ˜ x gij ( θ o ) (cid:105) because ¯ θ is trapped between θ and θ o and F ˜ U | ˜ X g , ˆ Z (0) = 1 / || E [ H n ( θ o )] || ≤ C / .The result in the following lemma is essential to applying the dominated conver-gence theorem in the proof of Lemma D.4. Deﬁne η ( θ ) ≡ (cid:104) zf ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1) ˜ x g ( θ ) (cid:48) ˜ x g ( ¯ θ ) (cid:105) + (cid:104) ˆ z (cid:16) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:17) ∂ ˜ x g ( θ ) ∂ θ (cid:105) ≡ η A ( θ ) + η B ( θ )The following lemma derives a bound on each summand of η ( θ ).7 emma D.5. Under Assumptions 3(d) and 3(e), E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η A ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:105) ≤ C / and E (cid:104) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η B ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:105) ≤ C / for any [ x , z ] and [ x † , z † ] deﬁned on the support of [ x i , z i ] . Proof of Lemma D.5 . For any [ x , z ] and [ x † , z † ] deﬁned on the support of [ x i , z i ]and θ ∈ Θ , || η A ( θ ) || = (cid:12)(cid:12)(cid:12)(cid:12) zf ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1) ˜ x g ( θ ) (cid:48) ˜ x g ( ¯ θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ f ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ˜ x g ( θ ) (cid:48) ˜ x g ( ¯ θ ) (cid:12)(cid:12)(cid:12)(cid:12) and || η B ( θ ) || = (cid:12)(cid:12) − F ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:12)(cid:12) × (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) noting that (cid:12)(cid:12)(cid:12) − F ˜ U | ˜ X g , ˆ Z ( · ) (cid:12)(cid:12)(cid:12) ≤ E [sup θ ∈ Θ || η A ( θ ) || ] ≤ (cid:0) E (cid:2)(cid:0) sup θ ∈ Θ f ˜ U | ˜ X g , ˆ Z (cid:0) ˜ x g ( ¯ θ )( θ − θ o ) (cid:1)(cid:1) (cid:3)(cid:1) / (cid:0) E (cid:2) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) max {| ˆ z | , } ˜ x g ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:1) / ≤ C / and E [sup θ ∈ Θ || η B ( θ ) || ] ≤ (cid:0) E (cid:2) sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ˆ z ∂ ˜ x g ( θ ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) (cid:3)(cid:1) / ≤ C / . D.5 The Stochastic Equi-continuity Condition

The following lemma veriﬁes Assumption 3(b).

Lemma D.6.

Let R ( θ ) = √ n ( ϑ n ( θ ) − E [ ϑ n ( θ )]) / || θ − θ o || . For any sequence δ n → , sup || θ − θ o ||≤ δ n |R n ( θ ) | √ n || θ − θ o || p → under Assumptions 1(a)-1(b), 2, and 3(e). Proof of Lemma D.6 . By the c r -inequality, | ϑ n ( θ ) − E [ ϑ n ( θ )] | = |Q n ( θ ) − S n ( θ o )( θ − θ o ) − E [ Q n ( θ )] | ≤ (cid:12)(cid:12) Q n ( θ ) − E [ Q n ( θ )] (cid:12)(cid:12) + 2 (cid:12)(cid:12) S n ( θ o )( θ − θ o ) (cid:12)(cid:12) ≡ ϑ An ( θ ) + 2 ϑ Bn ( θ )8ollowing the expression in (A.2) which holds by Assumption 1(b) and the MVT forsome λ u ∈ (0 , ϑ An ( θ ) = (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 n (cid:88) j =1 (cid:110)(cid:2) ˆ z ij (cid:0) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) − (cid:1) ˜ x gij ( ¯ θ ) (cid:3) − E (cid:2) ˆ z ij (cid:0) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) − (cid:1) ˜ x gij ( ¯ θ ) (cid:3)(cid:111) ( θ − θ o ) (cid:12)(cid:12)(cid:12) ≡ (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 n (cid:88) j =1 ˜ a ij ( θ )( θ − θ o ) (cid:12)(cid:12)(cid:12) = ( θ − θ o ) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 ˜ a ij ( θ ) (cid:48) ˜ a kl ( θ ) (cid:17) ( θ − θ o ) . For notational brevity, let ˜ I ij ( θ ) ≡ (cid:16) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − θ o ) (cid:1) − (cid:17) , and note that for i (cid:54) = k, l , and j (cid:54) = k, l , E [˜ a ij ( θ ) (cid:48) ˜ a kl ( θ )] = cov (cid:0) ˜ a ij ( θ ) (cid:48) ˜ a kl ( θ ) (cid:1) = by Assumption 1(a).This implies that for any θ ∈ Θ , E [ ϑ An ( θ )] = ( θ − θ o ) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 n (cid:88) l =1 E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a kl ( θ ) (cid:3)(cid:17) ( θ − θ o )= 1 n ( θ − θ o ) (cid:48) (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ij ( θ ) (cid:3) + 2 E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ik ( θ ) (cid:3)(cid:1)(cid:17) ( θ − θ o ) ≤ n (cid:16) n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k (cid:54) = i,j (cid:0) || E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ij ( θ ) (cid:3) || + 2 || E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ik ( θ ) (cid:3) || (cid:1)(cid:17) || θ − θ o || ≤ C / n || θ − θ o || because || E (cid:2) ˜ a ij ( θ ) (cid:48) ˜ a ik ( θ ) (cid:3) || ≤ ( E [ || ˜ a ij ( θ ) || ]) / ( E [ || ˜ a kl ( θ ) || ]) / = ( E [( ˜ I ij ( θ )) || ˆ z ij ˜ x gij ( ¯ θ ) || ]) / ( E [( ˜ I ik ( θ )) || ˆ z ik ˜ x gik ( ¯ θ ) || ]) / ≤ ( E [ || ˆ z ij ˜ x gij ( ¯ θ ) || ] E [ || ˆ z ik ˜ x gik ( ¯ θ ) || ]) / ≤ C / by Assumption 3(e) and the CS inequality, noting that ˜ I ij ( θ ) ≡ (cid:16) (cid:0) ˜ u ij < λ u ˜ x gij ( ¯ θ )( θ − o ) (cid:1) − (cid:17) = 1. E [ ϑ Bn ( θ )] = E (cid:2)(cid:12)(cid:12) S n ( θ o )( θ − θ o ) (cid:12)(cid:12) (cid:3) = ( θ − θ o ) (cid:48) E [ S n ( θ o ) (cid:48) S n ( θ o )]( θ − θ o ) = n ( θ − θ o ) (cid:48) E [ Ω n ( θ o )]( θ − θ o ) ≤ n || E [ Ω n ( θ o )] || × || θ − θ o || ≤ C / n || θ − θ o || under theassumptions of Lemma D.3. E [ ϑ An ( θ )] and E [ ϑ Bn ( θ )] are each bounded by C / n || θ − θ o || . This implies E (cid:104) sup θ ∈ Θ | ϑ n ( θ ) − E [ ϑ n ( θ )] | (cid:105) ≤ C / n sup θ ∈ Θ || θ − θ o || i.e, sup θ ∈ Θ | ϑ n ( θ ) − E [ ϑ n ( θ )] | = O p ( n − || θ − θ o || ) and sup θ ∈ Θ | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || = O p ( n − / )by the Markov inequality. Also, by the deﬁnition of a derivative, ϑ n ( θ ) and E [ ϑ n ( θ )]go to zero faster than || θ − θ o || , i.e., there exists a constant ˜ C > | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || = |Q n ( θ ) − E [ Q n ( θ )] −S n ( θ o )( θ − θ o ) ||| θ − θ o || ≤ ˜ C / | ϑ n ( θ ) − E [ ϑ n ( θ )] | a.s. The con-clusion follows by combining both preceding arguments, i.e.,sup || θ − θ o ||≤ δ n |R n ( θ ) | √ n || θ − θ o || = sup || θ − θ o ||≤ δ n √ n | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || + √ n || θ − θ o || ≤ sup || θ − θ o ||≤ δ n | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || ≤ ˜ C / sup || θ − θ o ||≤ δ n | ϑ n ( θ ) − E [ ϑ n ( θ )] ||| θ − θ o || = O p ( n − / ) . E Simulations

E.1 Linear Model - Additional Simulation Results

Maximum likelihood estimation of the linear model (under normally distributeddisturbances) is known to be eﬃcient. This simulation exercise compares the MDepestimator to the maximum likelihood estimator (MLE) primarily in terms of eﬃ-ciency both under correctly and incorrectly imposed distributional assumptions onthe disturbance. Also, the MDep estimator shares the “robustness” feature of quan-tile estimators though it estimates models estimable by IV methods. It is inter-esting to examine the performance of the MDep vis-`a-vis OLS under, for example,10 auchy ( location = 0 , scale = 1)-distributed disturbances where the expectation ofthe disturbance does not exist. Three speciﬁcations are considered below.(i) y = θ c + θ x + θ x + u , u = ν , x = x ∗ , z = [ x , x ∗ ], and ν ∼ ( χ (1) − / √ y = θ c + θ x + θ x + u , u = ν , x = x ∗ , z = [ x , x ∗ ], and ν ∼ N (0 , y = θ c + θ x + θ x + u , u = ν , x = x ∗ , z = [ x , x ∗ ], and ν ∼ Cauchy (0 , θ θ n 100 500 2500 100 500 2500Speciﬁcation (i): (100 × Sample Bias)MDep MLE MDep MLE MDep MLE MDep MLE MDep MLE MDep MLEMB 0.479 0.916 0.1 0.27 0.018 0.042 0.344 0.54 0.064 0.23 0.006 0.01MAD 2.578 7.295 0.826 3.166 0.321 1.411 2.525 6.892 0.868 3.141 0.324 1.364RMSE 5.002 10.39 1.494 4.57 0.537 2.064 5.069 10.423 1.527 4.59 0.523 2.028Speciﬁcation (ii): (100 × Sample Bias)MDep MLE MDep MLE MDep MLE MDep MLE MDep MLE MDep MLEMB 0.066 0.077 0.039 -0.016 0.018 -0.004 -0.022 -0.083 -0.042 -0.008 0.012 0.024MAD 7.188 6.972 3.363 3.202 1.471 1.411 7.348 7.248 3.178 2.953 1.425 1.437RMSE 10.956 10.51 4.961 4.729 2.145 2.048 10.973 10.479 4.705 4.518 2.161 2.085Speciﬁcation (iii): (100 × Sample Bias)MDep OLS MDep OLS MDep OLS MDep OLS MDep OLS MDep OLSMB -0.698 18.758 0.003 59.2 0.008 51.105 0.278 27.691 -0.22 -71.169 -0.19 38.73MAD 13.333 84.066 5.773 87.888 2.517 82.152 13.728 80.794 5.901 81.822 2.539 79.623RMSE 21.078 3455.6 8.734 1735.6 3.775 3472.6 21.348 2076.3 8.687 4211.2 3.797 4431.3This table provides results for the MDep and OLS/IV estimators of the linear model. Resultsare based on 2000 random repetitions. The mean bias (MB), median absolute deviation(MAD), and root mean squared error (RMSE) are reported for each speciﬁcation.

Table E.1 presents results for this short simulation exercise. When the normalityassumption is not satisﬁed (speciﬁcation (i)), one observes a better performance ofthe MDep. Under normality of the disturbance (speciﬁcation (ii)), one observes abetter performance of the MLE performs. In terms of MAD and RMSE, the relativegains of the MLE over the MDep in speciﬁcation (ii), is smaller than MDep’s relativegains over the MLE in speciﬁcation (i). These results suggest the MDep performsbetter when there are deviations away from normality of the disturbance. A surprising11esult is that the MDep under the skewed χ -distributed disturbance (speciﬁcation(i)) appears to be more eﬃcient than the MDep under normality (speciﬁcation (ii)).Speciﬁcation (iii) presents an interesting case where the expected value of thedisturbance does not exist. From Table E.1, it can be observed that the MDep is“robust” to the non-existence of the expected value of the disturbance. As expected,the OLS in this case fails; the estimator is not consistent for the slope parametersthat generated the outcome. The MDep, on the other hand, performs reasonablywell; its bias shrinks towards zero with the sample size. E.2 Non-linear model

While Section 4 focusses on the linear model, this subsection examines the per-formance of the MDep estimator vis-`a-vis conventional IV methods for non-linearparametric models such as Poisson quasi-maximum likelihood (QMLE), GMM, andcontrol function methods. For concerns of space, only the non-linear parametricmethod with the exponential function is considered. Speciﬁcations in the extendedclass of models (see Section D.2) are considered. ν ∼ Uniform[ −√ , √

3] in speciﬁ-cations (i)-(v), and ν ∼ Cauchy ( − . ,

1) in speciﬁcation (vii). Each column of thematrix x = [ x , x ∗ ] is standard normal N (0 ,

1) with cov( x , x ∗ ) = 0 .

25 truncated tothe interval [ −√ , √

3] in speciﬁcations (i)-(iii) and unrestricted in speciﬁcations (iv)to (vii). θ ≡ [ θ , θ ] = [1 , −

1] in all speciﬁcations considered. θ c = log( √

3) + 2 √ θ c = 0 . θ c in speciﬁcations (i)-(iii) is set to ensure y is non-negative. z is the matrix of in-strumental variables. The operator ˜ T ( ˙ x, a ) = a max { ˙ x }− min { ˙ x } (cid:0) x − (max { ˙ x } + min { ˙ x } ) (cid:1) restricts the support of ˙ x to the interval [ − a, a ].(i) y = exp( θ c + θ x + θ x ) + u , x = x ∗ , u = √ ( D | x | + (1 − D ) | x | ) ν , D ∼ Ber(0 . z = [ x , x ∗ ].(ii) y = exp( θ c + θ x + θ x ) + u , u = ν , x = ˜ T ( ˙ x , √ x = x ∗ + ν/ √

2, and z = [ x , x ∗ ].(iii) y = exp( θ c + θ x + θ x ) + u , x = ˜ T ( ˙ x , √ x = 3 x ∗ / √ n + (1 + x ∗ ) u/ √ u = ˜ T ( ˙ u, √ u = (1 + D | x | + (1 − D ) | x | ) ν , D ∼ Ber(0 . z = [ x , x ∗ ].12iv) y = max { , ˙ y } , ˙ y = exp( θ c + θ x + θ x ) + u , x = ( x ∗ + u ) / √ z =[ z , z /σ z , z /σ z ], z = x , z = | x ∗ | + ( x ∗ ) + w , z = | x ∗ | + ( x ∗ ) + w , w ∼ N (0 , w ∼ N (0 , y = max { , ˙ y } , ˙ y = exp( θ c + θ x + θ x ) + u , x = I( x ∗ + u ≤ z =[ z , z /σ z , z /σ z ], z = x , z = I( − (cid:112) /π + | x ∗ | + 6 x ∗ / √ n + w ≤ z =I( − x ∗ ) + 6 x ∗ / √ n + w ≤ w ∼ N (0 , w ∼ N (0 , y = Pois (cid:0) exp( θ c + θ x + θ x ) (cid:1) , x = x ∗ , and z = [ x , x ∗ ].(vii) y = θ c + exp( θ x + θ x ) + u , and u = ν .Instruments in speciﬁcations (iii)-(v) are set to ensure a median ﬁrst-stage non-homoskedasticity-robust F-statistic less than 10 and a median pdCov test p-valueless than 0 . θ θ n 100 500 2500 100 500 2500Speciﬁcation (i): (100 × Sample Bias)MDep QMLE MDep QMLE MDep QMLE MDep QMLE MDep QMLE MDep QMLEMB 0.003 0.004 0 0.002 0 0 -0.002 -0.001 0.002 0.001 0.001 0.001MAD 0.071 0.08 0.03 0.037 0.014 0.016 0.068 0.078 0.032 0.037 0.014 0.016RMSE 0.102 0.119 0.046 0.053 0.02 0.023 0.107 0.125 0.045 0.053 0.021 0.024Speciﬁcation (ii): (100 × Sample Bias)MDep GMM MDep GMM MDep GMM MDep GMM MDep GMM MDep GMMMB -0.007 -0.871 -0.002 -1.234 -0.001 -1.338 0.043 0.848 0.01 1.163 0.002 1.249MAD 0.115 0.508 0.054 0.789 0.026 1.157 0.116 0.562 0.051 0.791 0.023 1.082RMSE 0.204 2.176 0.086 2.024 0.037 1.745 0.211 2.13 0.083 1.913 0.036 1.638Speciﬁcation (iii): (100 × Sample Bias)MDep CF MDep CF MDep CF MDep CF MDep CF MDep CFMB 0.007 0.035 -0.003 0.012 -0.007 0.006 0.129 0.198 0.034 1.347 0.032 2.158MAD 0.109 0.141 0.071 0.076 0.04 0.04 0.106 0.624 0.048 0.94 0.032 1.546RMSE 0.181 0.278 0.104 0.171 0.059 0.106 0.307 2.234 0.074 10.391 0.045 10.464Speciﬁcation (iv): (100 × Sample Bias)MDep CF MDep CF MDep CF MDep CF MDep CF MDep CFMB -0.022 -0.064 -0.007 -0.088 -0.003 -0.014 0.126 0.339 0.032 0.401 0.01 -0.03MAD 0.074 0.222 0.028 0.306 0.012 0.732 0.111 0.849 0.035 1.683 0.015 4.041RMSE 0.135 0.793 0.048 0.895 0.02 2.371 0.22 3.05 0.079 4.824 0.032 14.004Speciﬁcation (v): (100 × Sample Bias)MDep CF MDep CF MDep CF MDep CF MDep CF MDep CFMB -0.024 0.04 -0.008 0.015 -0.004 -0.015 -0.645 -4.407 -0.179 -8.948 -0.032 -11.864MAD 0.153 0.216 0.062 0.105 0.028 0.081 0.562 4.445 0.186 8.744 0.06 11.914RMSE 0.247 0.403 0.099 0.213 0.043 0.205 1.152 6.074 0.429 9.59 0.101 12.335Speciﬁcation (vi): (100 × Sample Bias)MDep MLE MDep MLE MDep MLE MDep MLE MDep MLE MDep MLEMB -0.621 0.376 -0.924 0.104 -0.828 -0.028 0.752 -0.09 0.899 -0.01 0.917 0.046MAD 6.176 4.421 2.934 1.695 1.583 0.776 6.13 4.352 2.916 1.845 1.628 0.811RMSE 9.318 6.763 4.397 2.739 2.342 1.141 9.343 6.644 4.54 2.752 2.351 1.196Speciﬁcation (vii): (100 × Sample Bias)MDep GMM MDep GMM MDep GMM MDep GMM MDep GMM MDep GMMMB -0.893 96.472 0.082 17.436 -0.031 -3.558 0.303 200.154 0.041 183.718 0.01 145.422MAD 4.818 40.463 1.885 34.587 0.812 31.606 4.794 20.736 1.947 18.322 0.78 16.516RMSE 9.932 2456.5 3.181 3929.1 1.253 5313.9 9.507 4070.9 3.213 4479.4 1.233 4305.3This table provides results of the MDep, GMM, CF, MLE, and QMLE estimators for thenon-linear model. Results are based on 1000 random repetitions. The mean bias (MB),median absolute deviation (MAD), and root mean squared error (RMSE) are reported foreach speciﬁcation. The continuously updated GMM is implemented in speciﬁcation (ii).The CF includes the vector of ﬁrst stage residuals and its square in the second stage. .3 Coverage performance of covariance estimators Theorem 3 shows the consistency of the asymptotic covariance matrix estimator.This subsection compares the conﬁdence intervals of the covariance matrix estima-tor and the bootstrap in terms of empirical coverage probabilities. Three bootstrapconﬁdence intervals are considered, namely, the percentile bootstrap, the empiricalbootstrap, and a normal approximation using the bootstrap standard error. The Halland Sheather (1988) bandwidth with the standard normal pdf φ ( · ) and quantile func-tion Φ − ( · ), i.e., ˆ c n = n − / (Φ − (1 − α/ / (cid:0) . φ (0) (cid:1) / , is used in the estimationof the Hessian. The following speciﬁcations are considered:(i) y = θ c + θ x + θ x + u , x = x ∗ , and u ∼ ( χ (1) − / √ y = θ c + θ x + θ x + u , x = ( x ∗ + u ) / √ z = [ x , | x ∗ | / (cid:112) − /π ], and u ∼ ( χ (1) − / √ y = θ c + θ x + θ x + u , x = ( x ∗ + u ) / √ z = [ x , . x ∗ + ( √ / w , . x ∗ +( √ / w ], w ∼ N (0 , , w ∼ N (0 , u ∼ ( χ (1) − / √ y = exp( θ c + θ x + θ x )+ u , x = ( x ∗ + u ) / √ z = [ x , . x ∗ +( √ / w , . x ∗ +( √ / w ], w ∼ N (0 , , w ∼ N (0 , u ∼ [ −√ , √ y = exp( θ c + θ x + θ x ) + u , and x = x ∗ , and u ∼ [ −√ , √ E.4 Partial dCov test of instrument relevance

A standard F-test of IV relevance (see for example Stock and Yogo (2005) andKleibergen and Paap (2006)) veriﬁes a suﬃcient (but not necessary) relevance condi-tion for Assumption 2(b). The strength of an instrument lies in its identifying vari-ation independently of other instruments and covariates. For this reason, Sz´ekely,15able E.3: 95% Empirical Coverage Probabilitiesn θ θ Asymp. B-%tile B-Emp. B-Asymp. Asymp. B-%tile B-Emp. B-Asymp.Speciﬁcation (i) (Asymp. & Non-Parametric Bootstrap)100 96.2 94.0 98.8 98.4 97.8 96.4 99.4 99.4500 97.8 95.0 99.4 98.8 97.8 95.0 99.6 98.8Speciﬁcation (ii) (Asymp. & Non-Parametric Bootstrap)100 95.01 93.80 96.80 98.00 93.19 74.60 87.60 90.60500 95.07 91.00 96.80 97.40 97.68 80.6 90.60 93.20Speciﬁcation (iii) (Asymp. & Non-Parametric Bootstrap)100 93.00 96.00 98.00 98.60 87.53 89.60 96.00 97.20500 95.8 94.8 99.0 99.0 95.8 93.0 99.4 99.2Speciﬁcation (iv) (Asymp. & Non-Parametric Bootstrap)100 87.26 96.60 97.80 99.20 89.16 87.80 96.40 97.20500 86.72 95.80 98.20 99.20 89.47 94.00 97.60 98.00Speciﬁcation (v) (Asymp. & Non-Parametric Bootstrap)100 82.41 95.60 97.40 98.20 80.41 95.40 97.60 98.60500 79.40 96.80 98.80 98.40 77.60 96.40 97.4 97.40

This table presents coverage probabilities of conﬁdence intervals constructed from theasymptotic covariance matrix (Asymp.) and the non-parametric bootstrap. Bootstrapconﬁdence intervals considered are the percentile (B-%tile), empirical (B-Emp.), normalapproximation (B-Asymp.). x isconsidered. Partition x and z as x = [ x , x ] and z = [ x , z ] where z may havemore than one column.For a random variable ξ with arbitrary dimension, let M ξ be an n × n matrix whose( i, j )’th entry is || ξ i − ξ j ||− n − (cid:80) nl =1 ( || ξ i − ξ l || + || ξ l − ξ j || )+ n − n − (cid:80) nk =1 (cid:80) nl =1 || ξ k − ξ l || if j (cid:54) = i and 0 otherwise. Sz´ekely, Rizzo, et al.’s (2014) partial distance covariance pdCov between x and z after netting out dependence on x is pdC n ( x , z ; x ) = 1 n ( n − P x ⊥ ( x ) · P x ⊥ ( z )where P z ⊥ ( x ) = M x − M x ·M z M z ·M z M z and P z ⊥ ( z ) = M z − M z ·M z M z ·M z M z , and P z ⊥ ( x ) · P z ⊥ ( z ) ≡ (cid:80) ni =1 (cid:80) nj =1 ( P z ⊥ ( x )) ij ( P z ⊥ ( z )) ij . The null hypothesis of interest is H : pdC n ( x , z ; x ) = 0 against the alternative hypothesis H : pdC n ( x , z ; x ) > z is independent of, i.e., not an MDep-relevant instrumentfor x , after netting out dependence on x . One may be interested in the non-linear relevance of z . This is achieved by replacing x with residuals from its linearprojection on z and testing for partial dependence. The partial distance covariancetest requires resampling techniques because the statistic is not asymptotically pivotalunder the null (Sz´ekely, Rizzo, et al., 2014). The bootstrap procedure is described inSection E.6.Monte Carlo simulations are used to examine the size and power of the pdCov test for MDep relevance. The ﬁrst stage data generating process considered is x = x + z + λf T ( z )+ u where f T ( z ) is a non-monotone transformation of z and λ ∈ [0 , x and z are each standard nor-mally distributed with correlation 0 .

25 and u ∼ ( χ (1) − / √

2. Two non-monotonetransformations are considered: f T ( z ) = | z | (Figure 1a) and f T ( z ) = I( | z | < − Φ − (0 . − (0 .

25) denotes the ﬁrst quartile of the standardnormal distribution. The size of the test is examined under λ = 0 while the power ofthe test is examined under λ = 0 . λ = 1. For each setting, the performance ofthe pdCov test is examined at three nominal levels viz. 1% , pdCov test (a) Non-monotone transformation f T ( z ) = | z | (b) Non-monotone transformation f T ( z ) = I( | z | < − Φ − (0 . Notes:

Figures on the left examine size control of the pdCov test of MDep relevance.Figures in the middle and on the right examine the power of the test at λ = 0 . λ = 1respectively. Φ − (0 .

25) denotes the ﬁrst quartile of the standard normal distribution. Theblack dashed lines in the graphs on the left correspond to size = 0 . for the pdCov test of non-linear dependence and overall dependence are quite similarso results on the latter are omitted.The graphs on the left in Figures 1a and 1b demonstrate good size control acrossall sample sizes and nominal levels considered. Graphs in the middle and right ofFigure 1c demonstrate the test’s ability to detect the presence of non-linear/non-monotone MDep relevance. As expected, power increases in λ . E.5 Speciﬁcation testing

The distance covariance test proposed by Sz´ekely, Rizzo, Bakirov, et al. (2007) isa test of independence. Shao and Zhang (2014) extends the dCov to test conditionalmean independence between observed random variables. Su and Zheng (2017) derives18 test from Shao and Zhang’s (2014) measure which applies to speciﬁcation testing.Assumption 2(a) is a weak exclusion restriction. Su and Zheng’s (2017) test appliesto conditional mean models of the form y = g ( x ; θ ) + u where E [ y | x ] = g ( x ; θ ). Thetest remains applicable whether the researcher uses excluded instruments or not.The null hypothesis of interest is H : P (cid:0) E [ y i | x i ] = g ( x i , θ o , θ o,c ) (cid:1) = 1 against thealternative hypothesis H : P (cid:0) E [ y i | x i ] = g ( x i , θ , θ c ( θ )) (cid:1)(cid:1) < θ ∈ Θ . Letˆ u i be the residual corresponding to observation i such that (cid:80) ni =1 ˆ u i = 0. Su andZheng’s (2017) test statistic is given by T n = − n (cid:88) i (cid:54) = j (cid:88) j (cid:54) = i ˆ u i ˆ u j || z i − z j || Details of the the wild bootstrap used to implement the test as in Su and Zheng(2017) are described in Section E.6.It is instructive to use simulations to verify the performance of Su and Zheng’s(2017) speciﬁcation test in the context of the MDep estimator. The speciﬁcations(under the null of exogeneity) considered are speciﬁcations (v) in Section 4 and (i)in Section E.1. The power of the test is examined under endogeneity using x =( ˙ x + u ) / √ x, x ∗ ) = 0 .

25 and cov( ˙ x, u ) = 0. Figure 2 presents the size andpower of the test at nominal levels 1%, 5%, and 10%.In all speciﬁcations considered, one observes good size control and non-trivialpower. In sum, Su and Zheng’s (2017) speciﬁcation test is reliable in assessing,among other sources of miss-speciﬁcation, violations of the suﬃcient conditional meanindependence condition in the MDep setting.

E.6 Implementing the wild bootstrap

Below is an outline of the wild-bootstrap procedure used in conducting the Su andZheng (2017) speciﬁcation test. Deﬁne the transformation T (ˆ u i ) = (cid:112) n/ ( n − p θ − u i .The bootstrap errors are sampled from Mammen’s two-point distribution ν ∗ i =  − ( √ − / √ / (2 √ √ / √ − / (2 √ For example, by setting ˆ u i = u i ( ˆ θ n ) − E n [ u i ( ˆ θ n )]. (a) Notes:

Figures on the left examine size control of Su and Zheng’s (2017) speciﬁcationtest while the ﬁgures on the right examine its power. The graphs on top correspond tospeciﬁcation (v) and the graphs below correspond to speciﬁcation (i) in Section E.1. Theblack dashed lines in the graphs on the left correspond to size = 0 . E [ ν ∗ i ] = 0, E [ ν ∗ i ] = 1, and E [ ν ∗ i ] = 1 (see Mammen (1993)). Otherdistributions with the same properties can be found in Liu et al. (1988). The bootstrapprocedure is speciﬁed below. Algorithm 1.

1. (a) Regress y i on x i . Obtain ˆ θ n and u i ( ˆ θ n ) = y i − g ( x i , ˆ θ n , θ c ( ˆ θ n )) (b) Compute ˆ u i = u i ( ˆ θ n ) − E n [ u i ( ˆ θ n )] and the test statistic T n

2. For b = 1 , . . . , B , Do(a) Generate y ∗ bi = g ( x i , ˆ θ n , θ c ( ˆ θ n )) + E n [ u i ( ˆ θ n )] + T (ˆ u i ) ν ∗ bi (b) Regress y ∗ bi on x i and obtain ˆ θ ∗ bn , u i ( ˆ θ ∗ bn ) = y ∗ bi − g ( x i , ˆ θ ∗ bn ) , ˆ u ∗ bi = u i ( ˆ θ ∗ bn ) − E n [ u i ( ˆ θ ∗ bn )] , and the test statistic T ∗ bn

3. End Do

The p-value of the speciﬁcation test is computed using B (cid:80) Bb =1 I( T ∗ bn > T n ). Test-ing MDep-relevance using Sz´ekely, Rizzo, et al.’s (2014) partial distance covariancetest requires modifying Algorithm 1. The MDep-relevance test proceeds by replacingˆ u i with ˇ x i = x i − E n [ x i ], generating bootstrap samples as ˇ x ∗ bi = T (ˇ x i ) ν ∗ bi , skippingregression steps, and computing the p-value as B (cid:80) Bb =1 I( pdC ∗ bn > pdC n ). Conductinga pdCov test of non-linear relevance only requires replacing ˇ x i with residuals from alinear projection of x on z . Computation of the dCov and partial dCov measuresis done using the energy R package (Rizzo and Szekely, 2019).

References [1] Davidson, James.

Stochastic limit theory: An introduction for econometricians .OUP Oxford, 1994.[2] Hall, Peter and Simon J Sheather. “On the distribution of a studentized quan-tile”.

Journal of the Royal Statistical Society: Series B (Methodological)

Journal of Econometrics

TheAnnals of Statistics

The Annals of Statistics (1993), pp. 255–285.[6] Rizzo, Maria and Gabor Szekely. energy: E-Statistics: Multivariate Inferencevia the Energy of Data . R package version 1.7-7. 2019. url : https://CRAN.R-project.org/package=energy .[7] Shao, Xiaofeng and Jingsi Zhang. “Martingale diﬀerence correlation and itsuse in high-dimensional variable screening”. Journal of the American StatisticalAssociation

Identiﬁcation and Inference for Econometric Models . Ed. byAndrews, Donald W.K. New York: Cambridge University Press, 2005, pp. 80–108.[9] Su, Liangjun and Xin Zheng. “A martingale-diﬀerence-divergence-based test forspeciﬁcation”.

Economics Letters

156 (2017), pp. 162–167.[10] Sz´ekely, G´abor J, Maria L Rizzo, et al. “Partial distance correlation with meth-ods for dissimilarities”.

The Annals of Statistics

Related Researches

Optimal transportation and the falsifiability of incompletely specified economic models

by Ivar Ekeland

A note on global identification in structural vector autoregressions

by Emanuele Bacchiocchi

Duality in dynamic discrete-choice models

by Khai Xiang Chiong

A test of non-identifying restrictions and confidence regions for partially identified parameters

by Alfred Galichon

Assessing Sensitivity of Machine Learning Predictions.A Novel Toolbox with an Application to Financial Literacy

by Falco J. Bargagli Stoffi

Extreme dependence for multivariate data

by Damien Bosc

Dilation bootstrap

by Alfred Galichon

Inference under Covariate-Adaptive Randomization with Imperfect Compliance

by Federico A. Bugni

Identification of Matching Complementarities: A Geometric Viewpoint

by Alfred Galichon

Hypothetical bias in stated choice experiments: Part I. Integrative synthesis of empirical evidence and conceptualisation of external validity

by Milad Haghani

Hypothetical bias in stated choice experiments: Part II. Macro-scale analysis of literature and effectiveness of bias mitigation methods

by Milad Haghani

The Econometrics and Some Properties of Separable Matching Models

by Alfred Galichon

Discretizing Unobserved Heterogeneity

by Stéphane Bonhomme Thibaut Lamadon Elena Manresa

Permutation Tests at Nonparametric Rates

by Marinho Bertanha

General Bayesian time-varying parameter VARs for predicting government bond yields

by Manfred M. Fischer

Quasi-maximum likelihood estimation of break point in high-dimensional factor models

by Jiangtao Duan

A Control Function Approach to Estimate Panel Data Binary Response Model

by Amaresh K Tiwari

Set Identification in Models with Multiple Equilibria

by Alfred Galichon

Inference in Incomplete Models

by Alfred Galichon

Non-stationary GARCH modelling for fitting higher order moments of financial series within moving time windows

by Luke De Clerk

Bridging factor and sparse models

by Jianqing Fan

Misguided Use of Observed Covariates to Impute Missing Covariates in Conditional Prediction: A Shrinkage Problem

by Charles F Manski

A Novel Multi-Period and Multilateral Price Index

by Consuelo Rubina Nava

Cointegrated Solutions of Unit-Root VARs: An Extended Representation Theorem

by Mario Faliva

Estimation and Inference by Stochastic Optimization: Three Examples

by Jean-Jacques Forneron

«

1

2

3

4

»

Submitted on 13 Feb 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar