[PDF] Testing for Quantile Sample Selection

Abstract

This paper provides tests for detecting sample selection in nonparametric conditional quantile functions. The first test is an omitted predictor test with the propensity score as the omitted variable. As with any omnibus test, in the case of rejection we cannot distinguish between rejection due to genuine selection or to misspecification. Thus, we suggest a second test to provide supporting evidence whether the cause for rejection at the first stage was solely due to selection or not. Using only individuals with propensity score close to one, this second test relies on an `identification at infinity' argument, but accommodates cases of irregular identification. Importantly, neither of the two tests requires parametric assumptions on the selection equation nor a continuous exclusion restriction. Data-driven bandwidth procedures are proposed, and Monte Carlo evidence suggests a good finite sample performance in particular of the first test. Finally, we also derive an extension of the first test to nonparametric conditional mean functions, and apply our procedure to test for selection in log hourly wages using UK Family Expenditure Survey data as \citet{AB2017}.

Full PDF

TTesting for Quantile Sample Selection ∗ Valentina Corradi † Surrey University Daniel Gutknecht ‡ Goethe University FrankfurtApril 10, 2020

Abstract

This paper provides a testing approach for detecting sample selection in nonparametricconditional quantile functions. Our testing strategy consists of a two-step procedure: the ﬁrst testis an omitted predictor test with the propensity score as the omitted variable. As with any omnibustest, in the case of rejection we cannot distinguish between rejection due to genuine selection or tomisspeciﬁcation. Thus, since the diﬀerentiation of the two causes has implications for nonparametric(point) identiﬁcation and estimation of the conditional quantile function(s), we suggest a second testto identify whether the cause for rejection at the ﬁrst stage was solely due to selection or not. Usingonly individuals with propensity score close to one, this second test relies on an ‘identiﬁcation atinﬁnity’ argument, but accommodates cases of irregular identiﬁcation. Our testing procedure doesnot require any parametric assumptions on the selection equation, and all our results hold uniformlyacross quantile ranks in a compact set. We apply our procedure to test for selection in log hourlywages using UK Family Expenditure Survey data.

Key-Words:

Nonparametric Estimation, Conditional Quantile Function, Irregular Identiﬁcation,Wild bootstrap, Speciﬁcation Test.

JEL Classiﬁcation:

C12, C14, C21. ∗ We are grateful to Bertille Antoine, Federico Bugni, Xavier D’Hautefoeuille, Giovanni Mellace, Toru Kitagawa,Peter C.B. Phillips, and Youngki Shin for useful discussions and comments. Moreover, we would like to thank seminarparticipants at the ISNPS meeting (Salerno, 2018), the ICEEE meeting (Lecce, 2019), the CIREQ Montreal EconometricsConference (Montreal, 2019), the Warwick Econometrics Workshop (Warwick, 2019), and the Southampton Workshopin Econometrics and Statistics (Southampton, 2019) for helpful comments. † Department of Economics, University of Surrey, School of Economics, Guildford GU2 7XH, UK. Email:

[email protected] ‡ Corresponding Author: Department of Economics and Business, Goethe University Frankfurt, Theodor-W.-AdornoPlatz 4, 60629 Frankfurt, Germany. Email:

[email protected] a r X i v : . [ ec on . E M ] A p r Introduction

Empirical studies using non-experimental data are often plagued by the presence of non-random sampleselection: individuals typically self select themselves into employment, training programs etc. on thebasis of characteristics which are believed to be non-random and unobservable to the researcher(s)(Gronau, 1974; Heckman, 1974). In fact, it is well known that ignoring selection in conditional meanmodels induces a bias in the estimation, which can be additive (see e.g. Heckman, 1979; Das et al.,2003) or multiplicative (Jochmans, 2015) depending on the functional form of the model. In bothcases, one can deal with the selection bias adopting a control function approach. On the other hand,until recently very little was known about the identiﬁcation and estimation of conditional quantilefunctions in the presence of endogenous selection, see the recent survey by Arellano and Bonhomme(2017b). A notable exception is the case of sample selection in location shift models, where it inducesa parallel shift in the quantile function, and hence can be taken into account by simply correcting forthe selection bias as in the mean case. In all other cases, including linear quantile regression models,however, the presence of endogenous selection causes a rotation of the quantile function (Arellano andBonhomme, 2017a) and control function methods can no longer be applied. In fact, as shown by theseauthors, in the absence of parametric assumptions, conditional quantile functions are point identiﬁed‘at inﬁnity’ or when the joint distribution of outcome and selection error is real analytic, a conditionwhich is diﬃcult to verify and whose practical implementation still requires additional parametricrestrictions (e.g. on the copula of the two errors).This highlights the importance of testing for sample selection when estimation of nonparametricconditional quantile functions is the goal, and marks our starting point. More speciﬁcally, this paperprovides a testing approach for sample selection in conditional quantile functions, imposing only aminimal set of functional form assumptions on both the outcome and the selection equation(s). In fact,the only additional assumption is that selection (if present) aﬀects outcome through the propensityscore, which is the probability to be in the selected sample, a standard assumption in the selectionliterature (e.g., Das et al., 2003).Our objective is then to develop a rule for deciding between sample selection and another possibleconfounder, namely misspeciﬁcation of the nonparametric conditional quantile function, and to controlthe overall classiﬁcation errors asymptotically. The distinction between non-random selection andmisspeciﬁcation in the form of omitted predictors is particularly relevant for the consistent estimationof and inference about nonparametric conditional quantile functions as selection generally leads toa loss of point identiﬁcation (cf. Arellano and Bonhomme, 2017a), but consistent estimation and2nference may still be carried out on a subset of observations with propensity score close to one. Bycontrast, omitting relevant predictors impedes, regardless of the presence of endogenous selection,consistent estimation and inference altogether.To understand the heuristics of our testing strategy, note that we have exclusively a problem ofsample selection if the conditional quantile error depends on the propensity score when the latter isin the interior of the unit interval, but is independent of the propensity score when the latter is one.This is so because individuals with a propensity score equal to one are selected into the sample almostsurely. By contrast, our conditional quantile function is likely to miss out on relevant predictor(s)correlated with the propensity score if it depends on the latter, regardless of whether it takes onvalues in the interior of the unit interval or close to one. We formalize this heuristic argument in adecision rule, which we implement in a two-step testing procedure. In the ﬁrst step, we propose atest for omitted predictors, where the omitted predictor is the (estimated) propensity score. Here, thenull hypothesis is that the conditional quantile error does not depend on the propensity score, whenthe latter is in the interior of its support. Our test statistic resembles that of Volgushev et al. (2013).However, we establish asymptotic normality under the null hypothesis uniformly over all quantileranks in a compact subset of (0 , Importantly, while the second step relies on a so called‘identiﬁcation at inﬁnity’ argument, meaning that the support of the propensity score has to comprisethe boundary point one, our test does allow for a thin set of observations close to the boundary, thusaccommodating cases of so called irregular identiﬁcation (Khan and Tamer, 2010). In fact, the rateof convergence of the second test depends on both the degree of irregularity of the marginal densityof the propensity score as well as on the size of the set of covariate values for which identiﬁcation atinﬁnity holds. We therefore suggest a studentized version of the test statistic, which is rate adaptiveand converges weakly even if numerator and denominator of the statistic diverge individually at thesame rate.To make the testing procedure operational, we establish the ﬁrst order validity of wild bootstrapcritical values for the ﬁrst and the second test. Moreover, the decision rule is formalized, and corre-sponding classiﬁcation errors associated with our decision are obtained. Finally, we apply our testingprocedure to test for selection in log hourly wages of females and males in the UK using data from theUK Family Expenditure Survey from 1995 to 2000. The same data was recently also used by Arellanoand Bonhomme (2017a) to analyze gender wage inequality in the UK. We run our testing procedureon two diﬀerent sub-periods of diﬀerent economic performance, namely 1995-1997 and 1998-2000. Asa preview of the results, we cannot ﬁnd evidence for selection among females for the 1995-1997 period,but only for the 1998-2000 period. By contrast, while we reject the null of the ﬁrst test for males withdata from 1995 to 1997, our second test strongly suggests that this rejection may actually be due tomisspeciﬁcation of the quantile function, a feature that might have remained undetected without ourtesting procedure.Finally, in supplementary material to this paper we also provide an extension of the above testingidea to nonparametric conditional mean functions, which are commonly used in practice. In fact,since we consider both tests to be important (even as ‘standalone’ tests), we derive asymptotic resultsnot only for the ﬁrst, but also for the second test. More speciﬁcally, while the ﬁrst test builds ona statistic suggested by Delgado and Gonzalez-Manteiga (2001), the second test is again a localizedversion using only observations with propensity score close to one.Tests for conditional mean selection bias in a local average treatment eﬀects framework have alreadybeen suggested by Black et al. (2017). These tests are based on the regression of parametric residualsfrom the null model on those variables which are assumed to aﬀect selection (but not the outcome). Thus, to obtain power against misspeciﬁcation in this test we require that the omitted predictor(s) are correlatedwith the propensity score, even when the latter is at or close to one.

We begin by outlining the data generating process. As it is customary in the sample selection literature,we postulate that the continuous outcome variable of interest, y i , is observed if and only if s i = 1,where s i denotes a binary selection indicator. For every individual i , we observe covariate(s) x i . A We only consider the case of continuous outcomes in this paper. For generic inference methods for conditionalquantile functions with discrete outcome variables see Chernozhukov et al. (2018). y i are only observed for individuals who participate in the labor market and who areemployed ( s i = 1), and diﬀerent sub-groups (e.g., males and females) may diﬀer in terms of theirunobservable labor market attachment. Thus, conventional measures of wage gaps or wage inequalitymay be biased (Heckman, 1974, 1979). In addition to x i , we also observe instrumental variable(s) z i .Here, z i is assumed to aﬀect the process of selection into the sample governed by s i , but not y i directly,an assumption which is testable in the context of the sample selection model (Kitagawa, 2010). Notealso that the variables x i and z i need not be disjoint, although our testing procedure requires some ofthe continuous variables in z i to be excluded from x i (cf. Assumption A.1 below).Throughout the paper, the maintained assumptions are that (i) the instrumental variable(s) and (ii)non-random selection (if present) enter the conditional quantile function only through the propensityscore p i ≡ Pr ( s i = 1 | z i ), the probability to be in the selected sample for a given z i . Formally, we canexpress this as follows: A.Q

For all τ ,Pr (cid:16) y i ≤ q τ ( x i ) (cid:12)(cid:12)(cid:12) x i , z i , s i = 1 (cid:17) ( i ) = Pr (cid:16) y i ≤ q τ ( x i ) (cid:12)(cid:12)(cid:12) x i , p i , s i = 1 (cid:17) ( ii ) = Pr (cid:16) y i ≤ q τ ( x i ) (cid:12)(cid:12)(cid:12) x i , p i (cid:17) , (1)holds almost surely, where q τ ( x i ) denotes the conditional τ -quantile of y i given x i and selection s i = 1,the probability limit of the conditional local quantile regression estimator deﬁned in (15) below.Assumption A.Q is implied by standard threshold crossing selection models where s i = 1 { p ( z i ) > v i } and the unobservable error terms from the selection and the outcome equation are jointly independentof x i and z i (see Remark 1 below). In particular, note that we will only require that the propensityscore is a smooth, but not necessarily monotonic function of z i . In fact, the conditions set out in(1) are the only ‘structure’ we impose on the way in which selection enters the conditional mean orquantile function. Remark 1 : To see that the condition in (1) is implied by the set-up of e.g. Arellano and Bonhomme(2017a), assume that there exists an unobserved outcome y ∗ i (e.g., market wages) given by y ∗ i = q ( u i , x i )and a selection indicator (e.g., employment status) s i = 1 { p ( z i ) > v i } , where ( u i , v i ) are assumed tobe jointly statistically independent of z i given x i . Also, assume that: y i = y ∗ i s i iﬀ s i = 1 . Then, if ( u i , v i ) are absolutely continuous w.r.t. Lebesgue measure, have standard uniform marginal6istributions, and as F y ∗ | x ( y ∗ i | x i ) and its inverse are strictly increasing, we obtain thatPr (cid:16) y ∗ i ≤ q τ ( x i ) (cid:12)(cid:12)(cid:12) x i , z i , s i = 1 (cid:17) = Pr (cid:16) q ( u i , x i ) ≤ q τ ( x i ) (cid:12)(cid:12)(cid:12) x i , z i , v i < p ( z i ) (cid:17) (2)= Pr (cid:16) y i ≤ q τ ( x i ) (cid:12)(cid:12)(cid:12) x i , p i (cid:17) = Pr (cid:16) u i ≤ τ (cid:12)(cid:12)(cid:12) x i , p i (cid:17) . Note that in fact q τ ( x i ), the ‘observed’ τ quantile of y i given x i and s i = 1, coincides with the τ quantile of y ∗ i given x i when selection is random, i.e. when F y | x,s =1 ( y | x, s = 1) = F y ∗ | x ( y | x ) almostsurely.Given the existence of a valid (continuous) instrument and under Equation (1), our aim is now todevelop a rule for deciding between selection and misspeciﬁcation (where the latter does not necessarilyrule out selection), or none of the two, and to obtain bounds on the classiﬁcation error. As outlinedin the introduction, this decision rule is based on the outcome of a two step testing procedure. Weﬁrst outline the two sets of hypotheses tested in the ﬁrst and the second step.In the ﬁrst step, we test the hypothesis that the propensity score is not an omitted predictor,against its negation. In what follows, let T = [ τ , τ ] denote the compact set of quantile ranks to beexamined, where 0 < τ ≤ τ <

1. Also, we use X to denote a compact set in the interior of the unionof the supports of covariates R x , and P = [ p, p ] ⊂ (0 ,

1) to denote a compact subset of the supportof p ( z i ). In the ﬁrst step we test H (1)0 ,q versus H (1) A,q using the subset of selected individuals for which s i = 1, i.e. H (1)0 ,q : Pr (Pr ( y i ≤ q τ ( x i ) | x i = x, p i = p ) = τ ) = 1 for all τ ∈ T , x ∈ X , and p ∈ P (3)versus H (1) A,q : Pr (Pr ( y i ≤ q τ ( x i ) | x i = x, p i = p ) = τ ) < τ ∈ T , x ∈ X , and p ∈ P . (4)The logic behind H (1)0 ,q vs. H (1) A,q is that, given (1),Pr ( y i ≤ q τ ( x i ) | x i , p i ) = Pr ( y i ≤ q τ ( x i ) | x i , p i , s i = 1) = τ if and only if Pr ( y i ≤ q τ ( x i ) | x i , p i , s i = 1) = Pr ( y i ≤ q τ ( x i ) | x i , s i = 1). Note that the null hypothesis ofno omitted predictor in (3) could have been also stated in terms of Conditional Distribution Functions’ From here onwards, we make the conditioning on values of x i and p i explicit whenever required for clarity. F y | x,p ( y | x i = x, p i = p ) = F y | x ( y | x i = x ) (5)for all x ∈ X , p ∈ P , and y ∈ Y for some set Y subset of the support of y i . A test for the nullin (5) is based on the diﬀerence between CDFs estimated using a larger and a smaller informationset. While this circumvents the issue of extreme quantile estimation, such a test would suﬀer froma dimensionality problem (see Remark 3 in the next section). Moreover, in the context of selectoin,interest often lies in speciﬁc (conditional) quantiles or an interior set quantile ranks. For instance, wemight only be interested in testing for sample selection in the (log) wage distribution of males andfemales from lower conditional quantiles such as from the 10% to the 25% quantiles, or of individualsthat earn below the (conditional) median wage etc.. To carry out this type of analysis in the conditionaldistribution function context would require ﬁnding corresponding values say y and y to examine all y such that y ≤ y ≤ y . These values are typically unknown and require estimating the conditionalquantiles in the ﬁrst place.Under (1) and assumptions outlined in the next section, failure to reject H (1)0 ,q rules out endogenousselection asymptotically, with probability approaching one. Therefore, if we fail to reject the nullhypothesis, we stop the testing procedure and decide against selection. By contrast, rejection in thisﬁrst test can occur either due to genuine selection or due to an omitted variable in the outcomeequation, which happens to be correlated with the propensity score. This is so, since the omittedpredictor test, as any omnibus test, does not possess directed power against speciﬁc alternatives. Wetherefore design a test in the second step which has directed power against detecting misspeciﬁcation.To render this argument more formal, suppose there is an omitted relevant predictor π i , which wedeﬁne as follows: Deﬁnition 1 : Let (cid:101) q τ ( x i , π i ) and q τ ( x i ) denote probability limits of two local polynomial quantileestimators for the selected subsample of y i on x i and π i as well as on x i only, respectively. We saythat π i is a relevant predictor if for some τ ∈ T and π ∈ R π , where R π denotes the support of π i , (cid:101) q τ ( x, π ) (cid:54) = q τ ( x ) for at least all x in a subset of X with non-zero Lebesgue measure.Therefore, if π i is a relevant, omitted predictor which is correlated with p i , we expect indeed thatPr (cid:16) y i ≤ q τ ( x i ) (cid:12)(cid:12)(cid:12) x i = x, p i = p (cid:17) (cid:54) = τ with positive probability for some τ ∈ T , x ∈ X , and p ∈ P . Remark 2 : Consider again the set-up of Remark 1, but suppose that the true τ conditional quantile8f y ∗ i is given by (cid:101) q ( τ, x i , π i ). Hence, π i is an omitted predictor, which is assumed to be correlated with p i . Given A.2 below, and letting y ∗ i = (cid:101) q ( (cid:101) u i , x i , π i ), we can writePr ( y ∗ i ≤ q τ ( x i ) | x i , z i , s i = 1)= Pr ( (cid:101) u i ≤ τ | x i , z i , s i = 1)= Pr ( (cid:101) u i ≤ τ | x i , p i ) (cid:54) = τ, with positive probability, where the last equality follows since (cid:101) u i is a function of π i (and x i ), which iscorrelated with p i , even in the absence of non-random selection.Hence, we want to disentangle selection from relevant omitted predictors correlated with the propensityscore, which may as well cover the case of endogenous selection. In order to impose no-selection asmaintained hypothesis, we require the existence of at least one value z in the support of z i s.t. p ( z ) = 1.This type of condition is typically labelled ‘identiﬁcation at inﬁnity’ in the nonparametric identiﬁcationliterature (e.g. Chamberlain, 1986) and requires the existence of a continuous instrument exhibitingsuﬃcient independent variation from x i . Note, however, that in Section 4 we will address concernsthat the marginal density of p i may not be bounded away from zero at p i = 1 (so called irregularidentiﬁcation) resulting in very few observations with (estimated) propensity score close to one.Since at p i = 1, every individual is selected into the sample with certainty, and so selection is notpresent, in a second step, we test the null hypothesis that the propensity score is an omitted predictorwhen p i = 1. That is, we test H (2)0 ,q = Pr (Pr ( y i ≤ q τ ( x i ) | x i = x, p i = 1) = τ ) = 1 (6)for all τ ∈ T , and x ∈ X for which identiﬁcation at inﬁnity holds (a more precise notion will be givenin Section 4), versus H (2) A,q = Pr (Pr ( y i ≤ q τ ( x i ) | x i = x, p i = 1) = τ ) < τ ∈ T , and for some x . Thus, if selection is the sole cause for rejection of H (1)0 ,q , we do notexpect to reject H (2)0 ,q (at least asymptotically). By contrast, if we reject H (2)0 ,q , we take this as anindication that misspeciﬁcation was likely to be the or at least a major driver of the rejection at theﬁrst stage. We formalize these arguments in Section 5 via a Decision Rule for which we shall establish As detailed out in Section 4, this test requires to assume that misspeciﬁcation, if present, is not independent of thepropensity score p i when the latter is one (or close to one). Of course, as we discuss in Section 5, we cannot rule out selection if both misspeciﬁcation and selection are presentand lead to a rejection simultaneously.

We now introduce a statistic for testing H (1)0 ,q vs. H (1) A,q , as deﬁned in (3) and (4). Moreover, fornotational simplicity, from here onwards we assume that all components of x i and z i are continuous.The extension to discrete elements in both vectors (as long as both still contain continuous elementswhich exhibit independent variation of each other) is immediate at the cost of more complicatednotation and more lengthy arguments in the proofs. Also, note that one would generally expect theconvergence rate of our statistics to depend only on the number of continuous elements in x i (and z i )(cf. Li and Racine, 2008).To implement our test, we rely on a statistic very close to that of Volgushev et al. (2013). Thisstatistic has the advantage of requiring an estimate of the conditional quantile function only underthe null hypothesis, i.e. where the conditional quantile is a function of x i only. To estimate theconditional quantile function(s) at some point x i = x , we use an r -th order local polynomial estimator,which we denote by (cid:98) q τ ( x ), while its corresponding probability limit is denoted by q † τ ( x ), which areformally deﬁned in the Appendix Equations (15) and (16). Moreover, deﬁne (cid:98) u τ ( x i ) ≡ y i − (cid:98) q τ ( x i ), u τ ( x i ) ≡ y i − q † τ ( x i ), and let x = ( x , ..., x d x ) , and x = (cid:0) x , ..., x d x (cid:1) , x, x ∈ X , where d x denotes thedimension of x i . The test statistic is given by: Z q ,n = sup τ ∈T , ( x,x ) ∈X , ( p,p ) ∈P | Z q ,n (cid:0) τ, x, x, p, p (cid:1) | , where Z q ,n (cid:0) τ, x, x, p, p (cid:1) = 1 √ n n (cid:88) i =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )Π d x j =1 { x j < x j,i < x j } { p < (cid:98) p i < p } . The statistc Z ,n (cid:0) τ, x, x, p, p (cid:1) diﬀers from Volgushev et al. (2013) in two aspects. First, the omittedregressor p i is not observable and thus replaced by a nonparametric estimator, (cid:98) p i . Under regularityand bandwidth conditions outlined below, we show however that the estimation error arising from (cid:98) p i isasymptotically negligible. This is a well known result for estimates aﬀecting the statistic only througha weight function (cf. Escanciano et al., 2014). Second, and more importantly, our test statistic isconstructed taking the supremum also w.r.t. τ (over T ). We therefore can test for selection not onlyat a given quantile, but across all interior quantile ranks. Heuristically, this is achieved via the use10f a local polynomial quantile estimator for which Guerre and Sabbah (2012) established a Bahadurrepresentation uniform over compact sets X and T . In fact, in the simulations, we do not ﬁnd theassumption of compactness of X to be of great importance in ﬁnite samples.In the sequel, we make the following assumptions: A.1 ( y i , x (cid:48) i , z (cid:48) i , s i ) ⊂ R y × R x × R z × { , } are identically and independently distributed. Let X ≡ X × . . . × X d x denote a compact subset of the interior of R x . z i contains at least one vari-able which is not contained in x i and which is not x i -measurable. The distributions of x i and z i have a probability density function with respect to Lebesgue measure which is strictly positive andcontinuously diﬀerentiable (with bounded derivatives) over the interior of their respective supports.Also, assume that the joint density function of y i , x i and p i is uniformly bounded everywhere, andthat Pr( s i = 1 | x, p ) = Pr( s i = 1 | p ) > x ∈ X and p ∈ P . A.2

The distribution function F y | x,s =1 ( ·|· , · ) of y i given x i and selection s i = 1 has a continuousprobability density function f y | x,s =1 ( y | x, s = 1) w.r.t. Lebesgue measure which is strictly positive andbounded for all y ∈ R y , x ∈ X . The partial derivative(s) ∇ x F y | x,s =1 ( y | x, s = 1) are continuous on R y × X . Moreover, there exists a positive constant C such that: | f y | x,s =1 ( y | x, s = 1) − f y | x,s =1 ( y (cid:48) | x (cid:48) , s = 1) | ≤ C (cid:107) ( y, x ) − ( y (cid:48) , x (cid:48) ) (cid:107) for all ( y, x ) , ( y (cid:48) , x (cid:48) ) ∈ R y × X . Also assume that q τ ( x ) is r + 1 − th times continuously diﬀerentiableon X for all τ ∈ T with r > d x . A.3

There exists an estimator (cid:98) p ( z i ) such that sup z ∈Z | (cid:98) p ( z ) − p ( z ) | = o p ( n − ) with Z a compact subsetof R z , and that: Pr ( ∃ i : z i ∈ R z \ Z , p ( z i ) ∈ P ) = o ( n − ) . A.4

For some positive constant C , it holds that: | F p | x,u τ ,s =1 ( p | x, , s = 1) − F p | x,u τ ,s =1 ( p (cid:48) | x (cid:48) , , s = 1) | ≤ C (cid:107) ( p, x ) − ( p (cid:48) , x (cid:48) ) (cid:107) for all τ ∈ T , ( p, p (cid:48) ) ∈ P , and ( x, x (cid:48) ) ∈ X . Qu and Yoon (2015) recently presented a uniform (in x i ) Bahadur representation for the conditional (re-arranged)quantile estimator on an unbounded set X . While this feature is certainly appealing, their representation does not holduniformly in τ . We therefore rely on a representation derived by Guerre and Sabbah (2012, see below for details), whichholds uniformly on compact sets X and T . .5 The non-negative kernel function K ( · ) is a bounded, continuously diﬀerentiable function withuniformly bounded derivative and compact support on [ − , (cid:82) K ( v ) dv = 1 as well as (cid:82) vK ( v ) dv = 0.Assumption A.1 imposes the existence of at least one continuous instrumental variable, and ensuresthe existence of selected observations for all values in X and P . Assumptions A.2 and

A.4 on theother hand are rather standard smoothness assumptions, while

A.3 is a high-level condition, whichensures that p ( z i ) can be estimated at a speciﬁc rate uniformly over Z so that estimation error in (cid:98) p ( z i )is asymptotically negligible. In fact, in the case where (cid:98) p ( z i ) is a local constant kernel estimator, theuse of a second order kernel imposes restrictions on the dimensionality of the number of continuousregressors, namely d z < Note also that a suﬃcient condition for the second part of

A.3 is theexistence of suﬃcient moments. Letting ‘ ⇒ ’ denote weak convergence, we establish the asymptoticbehavior of Z q ,n . Theorem 1:

Let Assumptions

A.1 - A.5 and

A.Q hold. Moreover, let h x denote a deterministicbandwidth sequence that satisﬁes h x → n → ∞ . If as n → ∞ , ( nh d x x ) / log n → ∞ , nh rx log n → , then(i) under H (1)0 ,q , Z q ,n ⇒ Z q , where Z q is the supremum of a zero mean Gaussian process whose covariance kernel is deﬁned in theproof of Theorem 1.(ii) under H (1) A,q , there exists ε > , such thatlim n →∞ Pr (cid:16) Z q ,n > ε (cid:17) = 1 . The results of Theorem 1 rely on an appropriate choice of h x . As common in the nonparametrictesting literature, our rate conditions require undersmoothing, and thus cross-validation is not directlyapplicable in our setting. However, to still pick h x in a data-driven manner ensuring minimal biasat the same time, one possibility to select h x in practice could be to choose h x on the basis of cross-validation for a local polynomial estimator of order smaller than the one assumed for the test. For More speciﬁcally, if we estimate (cid:98) p i using a local constant estimator, and select the bandwidth via cross validation,we may choose h z , the bandwidth of this estimator, to be of order h z = O ( n − dz ). In this case, when d z <

4, the biasis of order n − dz = o (cid:16) n − / (cid:17) , while for the standard deviation we obtain ( √ nh ) − = o (cid:16) n − / (cid:17) . On the contrary, if d z ≥

4, we instead require a local polynomial estimator of order greater than one. In fact, the order of the bandwidth selected by cross-validation is too large for nh rx log( n ) → r = 3 as an example), h x could be chosen by cross-validation for a local linear estimator, i.e. h x = O (cid:16) n − dx (cid:17) . This in turnimplies that nh rx log( n ) → nh d x x / log( n ) → ∞ whenever d x < Remark 3 : While the above test restricts itself to T , a compact subset of (0 , A corresponding test statistic, which might be moresuitable for extreme quantiles, could for instance be based on a weighted comparison of two empricalconditional distribution functions: h ( d x +1) / n (cid:88) j =1 (cid:16) (cid:98) F y | x,p,s =1 ( y j | x j , (cid:98) p j , s j = 1) − (cid:98) F y | x,s =1 ( y j | x j , s j = 1) (cid:17) ω ( x j , (cid:98) p j )where ω ( x j , (cid:98) p j ) is a non-negative weighting function, and (cid:98) F denotes a kernel estimator of a nonpara-metric conditional distribution function. However, a limitation of this test is that the above statisticconverges at a nonparametric rate, which depends on the dimension of the larger information set( x (cid:48) j , (cid:98) p j ) , see e.g. Corradi et al. (2019). This is not the case with the statistic presented here, whichconverges at a parametric rate.Since the limiting distribution Z q depends on features of the data generating process, we derivea bootstrap approximation for it. In particular, we follow He and Zhu (2003), and use the bootstrapstatistic: Z ∗ q ,n (cid:0) τ, x, x, p, p (cid:1) = 1 √ n n (cid:88) i =1 s i ( B i,τ − τ )Π d x j =1 { x j < x j,i < x j } (cid:0)(cid:0) { (cid:98) p i < p } − { (cid:98) p i < p } (cid:1) (8) − (cid:16) (cid:98) F p | x,u τ ,s =1 ( p | x i, , s i = 1) − (cid:98) F p | x,u τ ,s =1 (cid:0) p | x i, , s i = 1 (cid:1)(cid:17)(cid:17) , where B i,τ = 1 { U i ≤ τ } with U i i.i.d. ∼ U (0 , (cid:98) F p | x,u τ ,s =1 ( p | x i, , s i = 1)denotes a nonparamemtric kernel estimator with corresponding bandwidth sequence h F satisfaying h F → n → ∞ (see Equation (18) in the Appendix for a formal deﬁnition). The bootstrap test Even though detecting sample selection among say the upper 5% or lower 5% of the (conditional) wage distributiondoes not appear to be of particular relevance for most applied researchers. Z ∗ q ,n = sup τ ∈T , ( x,x ) ∈X ,p,p ∈P | Z ∗ q ,n (cid:0) τ, x, x, p, p (cid:1) | . Let c ∗ (1)(1 − α ) ,n,R be the (1 − α ) percentile of the empirical distribution of Z ∗ q, ,n , ..., Z ∗ q,R ,n , where R isthe number of bootstrap replications. The following Theorem establishes the ﬁrst order validity ofinference based on the bootstrap critical values, c ∗ (1 − α ) ,n,R . Theorem 1 ∗ : Let Assumption

A.1 - A.5 and

A.Q hold. If as n → ∞ , ( nh d x x ) / log n → ∞ , nh rx log n → , h F → , nh d x +1 F → ∞ , and R → ∞ , then(i) under H (1)0 ,q lim n,R →∞ Pr (cid:16) Z q ,n ≥ c ∗ (1)(1 − α ) ,n,R (cid:17) = α (ii) under H (1) A,q lim n,R →∞ Pr (cid:16) Z q ,n ≥ c ∗ (1)(1 − α ) ,n,R (cid:17) = 1 . If we fail to reject the null hypothesis H (1)0 ,q , we can conclude that the propensity score is not an omittedpredictor and thus there is no endogenous selection. This is so because the Type II error approacheszero in probability. On the other hand, if we reject the null, this may be either due to genuine non-random selection or instead due to an omitted regressor which is correlated with the propensity score,as the ﬁrst test does not have directed power against either of the alternatives. We therefore rely on an ‘identiﬁcation at inﬁnity’ argument, which in turn allows us discriminatebetween both alternatives. More speciﬁcally, provided p ( z ) = 1 for some z ∈ Z , under correctspeciﬁcation (i.e., in the absence of omitted predictors), whenever p →

1, the selection bias approaches Note that Volgushev et al. (2013) suggest a variant of Z ∗ q ,n ( τ, x, x, p ) in which τ is replaced by (cid:98) τ = n − n (cid:88) i =1 { y i ≤ (cid:98) q τ ( x i ) } . While we outline the second test in the context of the two-step testing procedure, it is noteworthy that the secondtest can as well be viewed as a ‘standalone’ test if researchers are exclusively interested in inferring whether results after‘selection correction’ may also be driven by misspeciﬁcation or not. p → Pr ( y i ≤ q τ ( x i ) | x i , p ) = τ. By contrast, when π i is also a relevant predictor in the sense of Deﬁnition 1 of Section 2 for value(s) x ∈ X with p ( z ) close to one, an assumption that we make explicit in condition A.8 below, thenlim p → Pr ( y i ≤ q τ ( x i ) | x i , p ) (cid:54) = τ with positive probability. This heuristic motivates the second statisticbased on observations with (estimated) propensity score close to one for testing H (2)0 ,q vs. H (2) A,q as deﬁnedin (6) and (7).A common concern in the context of ‘identiﬁcation at inﬁnity’ is so called irregular identiﬁcation(Khan and Tamer, 2010), where, although conditional quantiles are point identiﬁed, they cannot beestimated at a regular convergence rate as the marginal density of p i may not be bounded away fromzero at the evaluation point p ( z ) = 1. That is, heuristically, even if ‘identiﬁcation at inﬁnity’ holds,and for some value z ∈ R z , p ( z ) can reach one, it is still possible that observations in the neighborhoodof one are very sparse in practice (‘thin density set’), and so convergence occurs at an irregular rate(Khan and Tamer, 2010). To address this issue, we only use observations from parts of the supportwhere the density of p i is bounded away from zero. Formally, this is implemented by introducing atrimming sequence, converging to zero at a suﬃciently slow rate so that irregular identiﬁcation is nolonger a concern. Thus, let δ = 1 − H with H → H/h p → ∞ as n → ∞ , where H governs thespeed of the trimming sequence δ , while h p deﬁnes the window width around δ . Then, for some ﬁxedset ( x, x ), the second test is based on the statistic Z q ,n ( τ, x, x,

1) = (cid:80) ni =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )Π d x j =1 { x j < x j,i < x j } K (cid:16) (cid:98) p i − δh p (cid:17)(cid:16)(cid:82) K ( v ) dv (cid:80) ni =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } K (cid:16) (cid:98) p i − δh p (cid:17)(cid:17) / . (9)This statistic only uses observations with (estimated) propensity score (cid:98) p i ∈ (1 − h p − H, h p − H ),and thus overcomes the issue of possible irregular identiﬁcation as long as a suﬃcient number ofobservations are assumed to exist in this set (see below). Note here that the convergence speed of H is inherently pegged to the tail behavior of the density of p i in the neighborhood of p = 1, which isof course unknown in practice. That is, the thinner the density tail of p i , the slower H has to go tozero. We discuss this issue and a potential data-driven way to select H and h p in given ﬁnite samplesafter Theorem 2 further below.In what follows, let ξ j,n , j = { , . . . , d x } be a deterministic sequence that, for each j , may converge15o 0 or to some ξ j > n → ∞ . Furthermore, let C ,n ≡ ⊗ d x j =1 [ x ,j − ξ j,n , x ,j + ξ j,n ] , for some point x ∈ X deﬁned in A.6 below. Note that C ,n ⊆ ⊗ d x j =1 (cid:2) x j , x j (cid:3) . Finally, deﬁne G x i ( τ, − H ) ≡ Pr( y i ≤ q τ ( x i ) | x i , − H ), and note that under H (2)0 ,q , it holds that lim H → G x i ( τ, − H ) = τ forevery τ ∈ T and x i ∈ lim n →∞ C ,n , and that lim H → Pr( s i = 1 | − H ) = 1. We make the followingadditional assumptions: A.6

Assume there exists at least one point x ∈ X , such that for at least one z ∈ R z , it holds that p ( z ) = 1. Moreover, there exists a strictly positive, continuous, and integrable function g y,x,p ( y, x, g x,p ( x,

1) such that for all x ∈ C ,n and y ∈ R y :sup x ∈C ,n ,y ( x ) ∈ R y (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) f y,x,p ( y, x, − H ) g y,x,p ( y, x, H η − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) → x ∈C ,n (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) f x,p ( x, − H ) g x,p ( x, H η − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) → n → ∞ for some 0 ≤ η < η <

1. Moreover, for all x / ∈ C ,n , and y ∈ R y , it holds that f y,x,p ( y ( x ) , x, − H ) = 0 for every n . A.7

1) for all x i ∈ X and z i ∈ Z . Moreover, assume that for every x ∈ X and y , f y,x,p ( y, x, · ), f x,p ( x, · ), and f p ( · ) are left-continuous at p = 1. A.8

The set of x ∈ X for which π i is a relevant predictor is a subset of C ,n (see Assumption A.6). A.9

Assume that for all x ∈ X and τ ∈ T , there exist positive constants C ( x ) and C such that: | G x ( τ, − H ) − G x ( τ, | ≤ C ( x ) H − η as well as | Pr( s i = 1 | − H ) − | ≤ CH − η . Moreover, the partial derivatives sup y ∈ R y ,x ∈X ,p ∈ (0 , |∇ p f y | x,p,s =1 ( y | x, p, s = 1) | , sup x ∈X ,p ∈ (0 , |∇ p f p | x,s =1 ( p | x, s =1) | , sup x ∈X ,p ∈ (0 , |∇ p Pr( s i = 1 | p ) | , and sup x ∈X ,p ∈ (0 , |∇ p f x,p ( x, p ) | are bounded.Assumption A.6 requires identiﬁcation at inﬁnity, for at least some values of the covariates. Inparticular, we allow for both the case of ξ j,n = ξ j > ξ j,n → n → ∞ . The case of ξ j,n = ξ j > j ∈ { , . . . , d x } , corresponds to the case of strong support, as we require the propensity scoreto approach one for all x s in a set of non-zero Lebesgue measure in a compact subset of R d x . In thecase when instead ξ j,n → j, we require identiﬁcation at inﬁnity over a subset ofnon-zero Lebesgue measure in a compact subset of R d (cid:48) x , with d (cid:48) x < d x . Finally, if ξ j,n → n → ∞ forall j , we only search over an interval shrinking to a singleton in X . Furthermore, in all cases we allowfor so called irregular support, in the sense that f x,p ( x,

1) is not necessarily bounded away from zeroat p = 1. In fact, when η = 0, lim H → f x,p ( x, − H ) is bounded away from zero for all x ∈ C ,n , while η > η representing thinner tails).That is, if η >

0, we allow for a thin set of observations with a propensity score close to one. Similarly,when η = 0, the ﬁrst part A.9 becomes a standard Lipschitz condition, while as η gets closer to oneand the tails of the densities in A.6 become thinner, we allow G x ( τ, − H ) and Pr( s i = 1 | − H ) toapproach G x ( τ,

1) and 1, respectively, at a slower rate. Finally, Assumption

A.8 is crucial for thetest to have directed power against misspeciﬁcation since it postulates that omitted predictors π i , ifpresent, are correlated with the event { p i = 1 } , or, more speciﬁcally { p i ∈ (1 − h p − H, h p − H ) } .The rate of convergence of the numerator in (9) depends on both on (cid:16) Π d x j =1 ξ j,n (cid:17) , the measure ofthe set C ,n , and on H η , the tail behavior of the density f x,p ( x, p ) around p = 1, which are both ofcourse unknown in practice. In fact, we are generally ignorant about the rate of convergence given by (cid:114) nh p (cid:16) Π d x j =1 ξ j,n (cid:17) H η , which may in principle be as fast as (cid:112) nh p . To address this problem, we use astudentized statistic, which allows the convergence rate to vary depending on both the measure of theset of x for which p ( z ) = 1, and on the sparsity of observations around p close to one. That is, as wecannot infer the appropriate scaling factor, it is crucial that, regardless of the ‘strength’ of the supportand the degree of thinness of the set of observations with propensity score close to one, Z ,n ( τ, x, x, (cid:112)(cid:100) var ( Z ,n ( τ, x, x, Z q ,n ( τ,x,x, (cid:113)(cid:100) var ( Z q ,n ( τ,x,x, as an empirical process over τ ∈ T , but for ﬁxed values of the interval extremes x and x. The reason is that for the case of ξ n → , the statistic does not depend on x, x, as it vanishes outside a shrinking interval of x . Theorem 2:

Let Assumption

A.1 , A.3 , A.5 , A.6 , A.7 , A.8 , A.9 , and

A.Q hold. If as n →∞ , ( nh d x x ) / log n → ∞ , nh rx log n → , H → , H/h p → ∞ , nh p H − η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) →

0, and nh p (cid:16)(cid:81) dj =1 ξ j,n (cid:17) H η → ∞ , then Note that under H (2)0 ,q , we have that G x ( τ,

1) = τ almost surely. H (2)0 ,q , sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z q ,n ( τ, x, x, (cid:113)(cid:100) var ( Z q ,n ( τ, x, x, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ⇒ Z q , where Z q is the supremum of a zero mean Gaussian process with covariance kernel deﬁned in the proofof Theorem 2.(ii) under H (2) A,q , there exists ε > , such thatlim n →∞ Pr  sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z q ,n ( τ, x, x, (cid:113)(cid:100) var ( Z q ,n ( τ, x, x, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ε  = 1 . Theorem 2 establishes the limiting distribution of the studentized statistic. As the theoretical resultscrucially hinge on the tuning parameters H and h p , whose rates depend in turn on the unknown ξ j,n and η , a discussion of their choice in practice is warranted. In fact, a possible data-driven choice ofthese parameters, without claiming optimality of a speciﬁc kind, could be as follows: as shown inthe supplementary material, one may re-write h p and H , which is a function of h p itself, as functionsof η only, i.e. h p ( η ) = Cn − − η − ε + εη log( n ) − and H ( η ) = h p ( η ) − ε for some arbitrary ε > η < η , and some scaling constant C . Here, η represents the threshold value with the slowest possibleconvergence rate still satisfying the rate conditions of Theorem 2. Thus, in order to ‘choose’ thesmallest possible η in practice, which in turn corresponds to the fastest possible convergence rate, onecould for instance plot nh p ( η )) − η (cid:80) ni =1 K (cid:16) (cid:98) p i − (1 − H ( η )) h p ( η ) (cid:17) for a given ε > ε = .

1) on a grid ofdiﬀerent η values with η ∈ [0 , η as the smallest value for which the estimated density isbounded away from zero, e.g. above a minimum threshold value such as 10 − . In fact, if the set of (cid:98) p i close to 1 is not ‘thin’, we would expect to select ˆ η = 0 in large enough samples with this type ofprocedure.Finally, note that an alternative test for selection against omitted relevant predictors could inprinciple be based on the null q τ ( x,

1) = q τ ( x, p ) for all τ ∈ T and for some x for which identiﬁcationat inﬁnity holds. A statistic for this null could be constructed using the weighted diﬀerence of thetwo corresponding estimators of q τ ( x, p ) and q τ ( x, q τ ( x, . Hence, when constructing the wild bootstrap statistic we do not have to ‘subtract’an estimator of the conditional distribution of p i . On the other hand, as the rate of convergencedepends on the ‘degree’ of irregular identiﬁcation at p close to 1 and on the set of covariates for whichidentiﬁcation at inﬁnity holds, we also need a appropriately studentized bootstrap statistic, i.e. Z ∗ q ,n = sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z ∗ q ,n ( τ, x, x, (cid:113) (cid:100) var ∗ ( Z q ,n ( τ, x, x, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (10)where Z ∗ q ,n ( τ, x, x,

1) = n (cid:88) i =1 s i ( B i,τ − τ )Π d x j =1 { x j < x j,i < x j } K (cid:18) (cid:98) p i − δh p (cid:19) with B i,τ = 1 { U i ≤ τ } with U i i.i.d. ∼ U (0 ,

1) and independent of the sample, and (cid:100) var ∗ ( Z q ,n ( τ, x, x, (cid:32) n n (cid:88) i =1 ( B i,τ − τ ) (cid:33) (cid:18)(cid:90) K ( v ) dv (cid:19) n (cid:88) i =1 s i Π d x j =1 { x j < x j,i < x j } K (cid:18) (cid:98) p i − δh p (cid:19) . (11)By noting that n (cid:80) ni =1 ( B i,τ − τ ) = τ (1 − τ ) + o ∗ p (1) , given (11), we see that whenever identiﬁcationat inﬁnity holds at all x ∈ X and the number of observations with propensity score in the interval(1 − h p − H, h p − H ) grows at rate nh p , then both numerator and denominator in (10) are boundedin probability, otherwise they diverge at the same rate.Let c ∗ (2)(1 − γ ) ,n,R be the (1 − γ ) percentile of the empirical distribution of Z ∗ q, ,n , ..., Z ∗ q,R ,n , where R isthe number of bootstrap replications. The following Theorem establishes the ﬁrst order validity ofinference based on the bootstrap critical values, c ∗ (2)(1 − γ ) ,n,R . Theorem 2 ∗ : Let Assumption

A.1 , A.3 , A.5 , A.6 , A.7 , A.8 , A.9 , and

A.Q hold. If as n → ∞ , ( nh d x ) / log n → ∞ , nh rx log n → , H → , H/h p → ∞ , nh p H − η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) → nh p (cid:16)(cid:81) dj =1 ξ j,n (cid:17) H η → ∞ , and R → ∞ , then(i) under H (2)0 ,q lim n,R →∞ Pr (cid:16) Z q ,n ≥ c ∗ (2)(1 − γ ) ,n,R (cid:17) = γ (ii) under H (2) A,q lim n,R →∞ Pr (cid:16) Z q ,n ≥ c ∗ (2)(1 − γ ) ,n,R (cid:17) = 1 . Theorem 2* establishes the ﬁrst order validity of inference based on wild bootstrap critical values.19nder H (2)0 ,q , the studentized statistic and its bootstrap counterpart have the same limiting distribu-tion. Under, H (2) A,q , the statistic diverges, as the numerator is of larger probability order than thedenominator, while the bootstrap statistic remains bounded in probability. As detailed in the introduction, if we fail to reject the null hypothesis of the ﬁrst test, we decideagainst endogeneous selection which allows to rely on nonparametric estimators of the conditionalquantiles using all selected individuals in the data. On the other hand, if we reject the ﬁrst test, butfail to reject the second one, one may still estimate the conditional quantile function(s) using onlyindividuals with propensity score close to one, e.g. as in (17) in the Appendix. Finally, if the nullhypotheses of both tests are rejected, there is evidence for relevant omitted predictor(s) (and possiblyendogenous selection), and neither the estimator using all selected individuals nor the one using onlythose with propensity score close to one will deliver estimates consistent for the conditional quantilefunction(s) of interest.Thus, there are three cases to be distinguished, namely no sample selection ( H NS,q ), sample selec-tion only ( H S,q ), and misspeciﬁcation possibly with sample selection ( H M,q ). We have that: H NS,q = H (1)0 ,q , (12) H S,q = H (1) A,q ∩ H (2)0 ,q , (13)and ﬁnally: H M,q = H (1) A,q ∩ H (2) A,q . (14)This means that we decide for no selection if we fail to reject H (1)0 ,q , but decide for selection onlyif we reject H (1)0 ,q and fail to reject H (2)0 ,q . By contrast, if we reject both H (1)0 ,q and H (2)0 ,q , we opt formisspeciﬁcation (and selection).We now formalize the rules for diﬀerentiating among these three cases. Let c ∗ (1)(1 − α ) ,n,R and c ∗ (2)(1 − γ ) ,n,R be respectively the (1 − α ) and (1 − γ ) bootstrap critical values for either the quantile case (as deﬁnedin Theorem 1* and in Theorem 2*). Based on the outcome of ﬁrst and second test, we device thefollowing decision rule. Rule RS j,n : (1) If Z q ,n ≤ c ∗ (1)(1 − α ) ,n,R , we decide that H NS,q is true. That is, we decide in favor of non-selection.20 If Z q ,n ≥ c ∗ (1)(1 − α ) ,n,R and Z q ,n ( τ, x, x, ≤ c ∗ (2)(1 − γ ) ,n,R , we decide that H S,q is true. That is, we decidein favor of selection only. (3) If Z q ,n ≥ c ∗ (1)(1 − α ) ,n,R and Z q ,n ( τ, x, x, ≥ c ∗ (2)(1 − γ ) ,n,R , we decide that H M,q is true. That is, wedecide in favor of misspeciﬁcation or misspeciﬁcation and selection.The Theorem below establishes the validity of our procedure by showing that the mis-classiﬁcationprobabilities (e.g., decide for non-selection when there is selection and/ or misspeciﬁcation) are asymp-totically controlled by our decision rule at pre-speciﬁed levels.

Theorem RS

Let all the Assumptions and the rate conditions in Theorems 1, 1*, 2, and 2* hold.Then, ( i ) lim n →∞ Pr (choose H NS,q or H S,q | H M,q is true) = 0 , ( ii ) lim n →∞ Pr (choose H S,q or H M,q | H NS,q is true) ≤ α, and ( iii ) lim n →∞ Pr (choose H M,q or H NS,q | H S,q is true) ≤ γ. Since the estimator we use depends on the outcome of a testing procedure, one may be concerned aboutthe size problem arising when one fails to reject a Hausman test of endogeneity, and then conductsinference based on OLS estimators, as outlined in Guggenberger (2010a,b). In terms of (2) in Remark1, suppose that the correlation between u i and p i is weak, say of order n − / . In this case, we may failto reject the null of no selection, and thus move to estimate conditional quantiles using all selectedindividuals. If inference is based on a nonparametric quantile estimator, this estimator converges to itslimiting distribution at a rate slower than n / and thus pre-testing would not represent a problem. Itis only in the case where we decide to make inference using an estimator for a parametric conditionalquantile function that the issue of size distorsion may arise as in the set-up of Guggenberger (2010a,b).Moreover, while in the Hausman pre-testing case, the cost of always using instrumental variables isonly in terms of eﬃciency loss, in our context the cost lies predominantly in a much slower convergencerate.As pointed out in the previous section, our testing procedure cannot disentagle selection andomitted regressors when the latter are uncorrelated with p i when p i takes on values close to one. Inthis case, we still decide in favor of selection, and so we estimate the quantiles using only observationswith propensity score close to one. However, even in this case we make the ‘right’ decision in thesense that for observations with propensity score close to one, omitted predictor bias is not present.21inally, it is also noteworthy that our testing procedure only requires a second stage when we areunsure about the correct speciﬁcation of the outcome function and the correlation of the unobservedfactors with the instrument(s): when the instrumental variable(s) are free of these concerns as theyhave been constructed e.g. on the basis of a randomized control trial, a second stage is not required. Our illustration is based on a subsample of the UK wage data from the Family Expenditure Surveyused by Arellano and Bonhomme (2017a). As pointed out by these authors, due to changes inemployment rates over time, simply examining wage inequality for females and males at work overtime may provide a distorted picture of market-level wage inequality. We will therefore run ourselection testing procedure on two diﬀerent subsets of the data, namely 1995 to 1997, a period ofincreasing gross domestic product (GDP) growth rates, and 1998 to 2000, a period of high, but stableGDP growth rates. Unlike Arellano and Bonhomme (2017a), however, our testing procedure forselection will not rely on a parametric speciﬁcation of the conditional log-wage quantile functions, butremain completely nonparametric.The covariates we include in x i are dummies for marital status, education (end of schooling at 17or 18, and end of schooling after 18), location (eleven regional dummies), number of kids (split bysix age categories), time (year dummies), as well as age in years. This set of covariates is identicalto the one used by Arellano and Bonhomme (2017a), but for the fact that the latter used cohortdummies instead of age in years. The continuous instrumental variable is given by the measure ofpotential out-of-work (welfare) income, interacted with marital status. This variable, which was alsoused by Arellano and Bonhomme (2017a), builds on Blundell et al. (2003) and is constructed for eachindividual in the sample (employed and non-employed) using the Institute of Fiscal Studies (IFS) taxand welfare-beneﬁt simulation model.The ﬁnal sample for the years 1995-1997 comprises 21,263 individuals, 11,647 of which are femalesand 9,616 of which are males, respectively. The number of working females (males) with a positive loghourly wage in that sample is 7,761 (7,623). By contrast, for the 1998-2000 period we obtain 16,350observations, 8,904 females and 7,446 males. The number of working females (males) in that sampleare 5,931 (6,058).All estimates are constructed using routines from the np package of Hayﬁeld and Racine (2008).More speciﬁcally, we estimate the propensity score Pr( s i = 1 | z i ) fully nonparametrically using a stan- For the exact construction of the sample see their paper and references therein. n = 450), selectingthe median values over 50 replications. The conditional quantile function q τ ( x i ) is estimated as inEquation (19) of Li and Racine (2008), while the conditional distribution function F p | x,u τ ,s =1 ( ·|· , · , · ) isconstructed as in Equation (4) of the same paper. The bandwidths are again determined as before. The quantile grid is chosen to be T = { . , . , . , . , . , . , . , . , . } .To provide the reader with a better illustration of the potential magnitudes of selection intowork, we replicate the predictions from the (estimated) parametric conditional quantile functionsfrom Figure 1 of Arellano and Bonhomme (2017a, p.16) for the sub-periods 1995-1997 and 1998-2000, see Figures 1 and 2, respectively. In these pictures, solid lines represent estimated uncorrected(for selection) conditional log-wage quantile functions, while dashed lines are the ones corrected forsample selection. Throughout, female quantile lines lie below the male quantile lines. The ﬁgures,which display selection corrections based on a linear quantile regression model and parametric selectioncorrection, show little diﬀerence between the original and the corrected lines for both males and femalesfor the 1995-1997 period (except for males at lower percentile levels), but more pronounced diﬀerencesfor the subsequent 1998-2000 period. In terms of magnitude, the eﬀect of correction appears to begenerally bigger for males than for females in both subperiods.Turning to the test results in Table 1, we see that while we cannot ﬁnd any evidence for selectionduring the 1995-1997 period for females at conventional signiﬁcance levels, there is some evidence formales at the 10% signiﬁcance level. In fact, taking a closer look at the results in Table 1, we observethat rejection for males occurs on the basis of the 10th percentile, which is in line with the graphicalevidence in Figure 1. Switching over to the right panel of Table 1, however, we obtain a diﬀerentpicture: for females, H (1)0 ,q is rejected at any conventional level, and rejection is most pronounced atthe 20th and the 30th percentile. On the other hand, we cannot reject H (1)0 ,q for males. This failure toreject H (1)0 ,q for males is in contrast to the graphical evidence in Figure 2, and highlights the importanceof formal testing under a more ﬂexible speciﬁcation.Following our testing procedure outlined in Section 2, we perform the second test for males in the1995-1997 period, and for females in the 1998-2000 period (see Table 2). Turning to the results, we Recall that there are no continuous covariates contained in x i , and thus the theoretical rate conditions do not directlyapply here. For the exact speciﬁcation used, see Arellano and Bonhomme (2017a). q τ ( x i ) for males,but fail to reject that null for females at any conventional levels. Both results appear to be robust todiﬀerent choices of δ and h p . Thus, under the assumption that out-of-work income is a valid instrumentand indeed selection enters outcome as postulated in Equation (1), our test results suggest that thereis evidence for selection among females for the 1998-2000, but not for males. In fact, what appears tobe selection among males during the 1995-1997 period may actually be attributed to misspeciﬁcationof the conditional quantile function. In a related paper, Kitagawa (2010) tested for the validity of the same instrumental variable (but for the interactionwith marital status) in a similar data set on the basis of the UK Family Expenditure Survey used by Blundell et al.(2007). Although his test results are not directly informative here as his test is run on a much coarser set of covariates x i not including e.g. regional, martial, or family information, his evidence suggested that the conditional independence ofthe instrument and outcome (given x i and selection s i = 1) may indeed be violated for some sub-groups (in particular,younger males with moderate levels of education). Thus, rejection in the second test for males could also be related tothis feature. a) τ = 10% (b) τ = 40%(c) τ = 30% (d) τ = 40%(e) τ = 50% (f) τ = 60%(g) τ = 70% (h) τ = 80%(i) τ = 90% Figure 1: Corrected and Uncorrected Log Hourly Wage Quantiles by Gender 1995-1997 (Arellano andBonhomme, 2017a).

Note: male quantiles are always at the top, female ones at the bottom (solid lines:uncorrected quantiles; dashed lines: selection corrected quantiles) a) τ = 10% (b) τ = 20%(c) τ = 30% (d) τ = 40%(e) τ = 50% (f) τ = 60%(g) τ = 70% (h) τ = 80%(i) τ = 90% Figure 2: Corrected and Uncorrected Log Hourly Wage Quantiles by Gender 1998-2000 (Arellano andBonhomme, 2017a).

Note: male quantiles are always at the top, female ones at the bottom (solid lines:uncorrected quantiles; dashed lines: selection corrected quantiles) .

050 0 . .

037 0 . .

031 0 . .

030 0 . .

040 0 . .

034 0 . .

040 0 . .

032 0 . .

022 0 . .

050 0 . .

053 0 . .

10 0 . .

045 0 . .

033 0 . .

039 0 . .

044 0 . .

041 0 . .

045 0 . .

039 0 . .

035 0 . .

033 0 . .

032 0 . .

048 0 . .

049 0 . .

17 0 . Note: Number of Bootstrap Replications is 400.

This paper introduces a novel testing procedure to detect sample selection in conditional quantilefunctions, without imposing parametric assumptions on either the outcome or the selection equation.This is accomplished via two tests, the ﬁrst of which is an omitted predictor test, with the estimatedpropensity score as omitted predictor. As with any omnibus test, rejection in the ﬁrst step can be dueto either selection or to the omission of a predictor which is correlated with the estimated propen-sity score. Since selection and misspeciﬁcation have very diﬀerent implications for the estimation ofnonparametric (conditional) quantile functions, we aim at disentangling the two if we reject in theﬁrst step. That is, after rejection in the ﬁrst test we proceed to the second test, which is a localizedversion of the ﬁrst test, using only observations with (estimated) propensity score close to one. Arejection in this case indicates the presence of misspeciﬁcation, possibly in conjunction with selection.Importantly, the second test, although relying on ‘identiﬁcation at inﬁnity’, allows for irregular iden-tiﬁcation by using observations close, but not too close to one. We establish the ﬁrst order validity ofbootstrap critical values based on the wild bootstrap.In our empirical illustration, we test for sample selection in log hourly wages of females and malesin the UK using data from the UK Family Expenditure Survey. Using the periods 1995-1997 and1998-2000 as examples, we ﬁnd evidence for selection among females for the 1998-2000, but not formales. In fact, what appears to be selection among males during the 1995-1997 period may actually27est 2Males - 1995-1997 δ = . δ = . δ = 1 δ = 1 h p = . h p = . h p = . h p = . .

013 9 .

828 8 .

488 6 . .

961 7 .

118 5 .

899 4 . .

961 8 .

358 7 .

377 5 . .

097 9 .

397 8 .

033 7 . .

614 9 .

828 8 .

488 7 . .

013 9 .

165 7 .

704 7 . .

267 8 .

800 7 .

132 7 . .

170 7 .

082 5 .

185 6 . .

310 5 .

939 4 .

422 5 . .

166 4 .

267 3 .

203 3 . .

476 2 .

663 2 .

340 2 . .

682 2 .

936 2 .

589 2 . .

00 0 .

00 0 . δ = . δ = . δ = . δ = . h p = . h p = . h p = . h p = . .

249 1 .

431 2 .

346 1 . .

492 0 .

323 0 .

176 1 . .

312 0 .

024 1 .

032 1 . .

664 0 .

054 0 .

247 0 . .

696 0 .

108 1 .

209 0 . .

249 0 .

736 1 .

557 1 . .

125 1 .

144 2 .

061 1 . .

477 1 .

431 2 .

346 1 . .

739 0 .

310 0 .

993 0 . .

913 0 .

591 0 .

522 0 . .

323 2 .

246 2 .

459 2 . .

528 2 .

453 2 .

706 2 . .

83 0 .

71 0 .

13 0 . Note: Number of Bootstrap Replications is 1,000

28e attributed to misspeciﬁcation of the conditional quantile function.

Nonparametric Estimators :As detailed in the main text, to estimate the conditional quantile function at some point x i = x , weuse an r-th order local polynomial estimator based on the standard ‘check type’ objective function: l τ ( v ) = 2 v ( τ − { v ≤ } ) . The local polynomial estimator is then given by: (cid:98) b h x ( τ, x ) = arg min b nh d x x n (cid:88) i =1 l τ  y i − b − (cid:88) ≤| t |≤ r b t ( x i − x ) t  s i K (cid:18) x i − xh x (cid:19) (15)is an estimator of b † h x ( τ, x ) with b † ( τ, x ) = arg min b lim n →∞ nh d x x n (cid:88) i =1 E  l τ  y i − b − (cid:88) ≤| t |≤ r b t ( x i − x ) t  s i K (cid:18) x i − xh x (cid:19) . (16)Here, K ( · ) denotes a d x dimensional product kernel. We use (cid:98) q τ ( x ) = (cid:98) b ,h x ( τ, x ), the ﬁrst element of (cid:98) b h x ( τ, x ), and set q † τ ( x ) = b † ( τ, x ). If the quantile functions are estimated using only observationswith propensity score close to one, a corresponding estimator can be deﬁned as: (cid:98)(cid:101) b h x ( τ, x ) = arg min b nh d x x h p n (cid:88) i =1 l τ  y i − b − (cid:88) ≤| t |≤ r b t ( x i − x ) t  s i K (cid:18) x i − xh x (cid:19) K (cid:18) (cid:98) p i − δ n h p (cid:19) . (17)Finally, the kernel estimator of F p | x,u τ ,s =1 ( p | x i , , s i = 1) used to construct the bootstrap statistic ofthe ﬁrst test is given by: (cid:98) F p | x,u τ ,s =1 ( p | x i, , s i = 1) = nh dx +1 F (cid:80) nj =1 s j { (cid:98) p j ≤ p } K (cid:16) (cid:98) u j,τ − h F (cid:17) K (cid:16) x j − x i h F (cid:17) nh dx +1 F (cid:80) nj =1 s j K (cid:16) (cid:98) u j,τ − h F (cid:17) K (cid:16) x j − x i h F (cid:17) . (18) Auxiliary Lemmas :In the following, let E S n [ · ] denote the expectation operator conditional on the actual sample realiza-tions. Moreover, since 1 { p ≤ (cid:98) p i ≤ p } = 1 { (cid:98) p i ≤ p } − { (cid:98) p i ≤ p } , we will ignore the part of the statisticwhich involves 1 { (cid:98) p i ≤ p } in the sequel. Lemma 1 : Let Assumptions

A.1 - A.5 and

A.Q hold. Moreover, let h x denote a deterministicbandwidth sequence that satisﬁes h x → n → ∞ . If as n → ∞ , ( nh d x x ) / log n → ∞ , nh rx log n → T , X , and P : In the discrete case, we can set up a local constant estimator in the direction of the discrete elements. We borrow notation from Masry (1996) letting t = ( t , . . . , t d x ) (cid:48) , | t | = (cid:80) d x j =1 t j , and (cid:80) ≤| t |≤ r = (cid:80) rj =0 (cid:80) jt =0 . . . (cid:80) jt dx =0 . i) Under H (1)0 ,q : √ nE S n (cid:34) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { (cid:98) p i ≤ p }− s i (1 { u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { p i ≤ p } (cid:35) = − √ n n (cid:88) j =1 F p | x,u τ ,s =1 ( p | x j , , s j = 1) ( s j (1 { u τ ( x j ) ≤ } − τ )) Π d x l =1 { x l < x l,j < x l } + o p (1) . (ii) Under H (1) A,q : √ nE S n (cid:34) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) × Π d x j =1 { x j < x j,i < x j } { (cid:98) p i ≤ p } − s i (1 { u τ ( x i ) ≤ } − τ )Π d x j =1 { x j < x j,i < x j } { p i ≤ p } (cid:35) = O p (cid:32) ln( n ) (cid:112) h d x x (cid:33) . Lemma 2 : Let Assumptions

A.1 , A.3 , A.5 , A.6 , A.7 , A.8 , A.9 , and

A.Q hold. If as n →∞ , ( nh d x x ) / log n → ∞ , nh rx log n → , H → , H/h p → ∞ , nh p H − η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) →

0, and nh p (cid:16)(cid:81) dj =1 ξ j,n (cid:17) H η → ∞ , then uniformly over T : (i) n (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) E S n (cid:20) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19) − s i (1 { u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19)(cid:21) = o p (1) (ii) (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 (cid:18) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19) − E S n (cid:20) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19)(cid:21)(cid:19) − (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 (cid:18) s i (1 { u τ ( x i ) ≤ } − τ ) 1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19) − E S n (cid:20) s i (1 { u τ ( x i ) ≤ } − τ ) 1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19)(cid:21)(cid:19) = o p (1) . Lemma 3 : Let Assumptions

A.1 , A.3 , A.5 , A.6 , A.7 , A.8 , A.9 , and

A.Q hold. If as n →∞ , ( nh d x x ) / log n → ∞ , nh rx log n → , H → , H/h p → ∞ , nh p H − η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) →

0, and nh p (cid:16)(cid:81) dj =1 ξ j,n (cid:17) H η → ∞ , then: 30 i) Under H (2)0 ,q and H (2) A,q , pointwise in τ ∈ T : (cid:80) ni =1 s i (1 { u τ ( x i ) ≤ } − G x ( τ, (cid:81) d x j =1 (cid:8) x j ≤ x i,j ≤ x j (cid:9) K (cid:16) p i − δh p (cid:17)(cid:114)(cid:82) K ( v ) dv (cid:80) ni =1 s i (1 { u τ ( x i ) ≤ } − G x ( τ, Π d x j =1 { x j < x j,i < x j } K (cid:16) p i − δh p (cid:17) d → N (0 , . (ii) Under H (2)0 ,q , uniformly in τ ∈ T :1 (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 s i ( G x i ( τ, p i ) − τ )1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19) = o p (1) Proofs of Theorem 1 and 2 : Proof of Theorem 1 : (i) Start by noting that we can decompose Z ,n ( τ, x, x, p ) as follows:1 √ n n (cid:88) i =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )Π d x j =1 { x j < x j,i < x j } { (cid:98) p i ≤ p } = 1 √ n n (cid:88) i =1 s i (1 { u τ ( x i ) ≤ } − τ )Π d x j =1 { x j < x j,i < x j } { p i ≤ p } + √ nE S n (cid:34) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { (cid:98) p i ≤ p }− s i (1 { u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { p i ≤ p } (cid:35) − √ n n (cid:88) i =1 (cid:40) s i (1 { u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { p i ≤ p }− E S n (cid:34) s i (1 { u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { p i ≤ p } (cid:35)(cid:41) + 1 √ n n (cid:88) i =1 (cid:40) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { (cid:98) p i ≤ p }− E S n (cid:34) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) Π d x j =1 { x j < x j,i < x j } { (cid:98) p i ≤ p } (cid:35)(cid:41) = I n + II n + III n . From Lemma 1(i), II n = − √ n n (cid:88) j =1 F p | x,u τ ,s =1 ( p | x j , , s j = 1) ( s j (1 { u τ ( x j ) ≤ } − τ )) Π d x l =1 { x l < x l,j < x l } + o p (1) , where the o p (1) term holds uniformly over T , X , and P . As for III n , we ﬁrst, we apply Lemma A.1of Escanciano et al. (2014) to the function classes F ≡ { f ( s, τ, x ) = s (1 { u τ ( x ) ≤ } − τ )Π d x j =1 { x j

T × X × P , with covariancekernel cov (cid:16) Z q ,n (cid:0) τ, x, x, p, p (cid:1) , Z q ,n (cid:0) τ (cid:48) , x (cid:48) , x (cid:48) , p (cid:48) , p (cid:48) (cid:1)(cid:17) = E (cid:104) (1 { u τ ( x i ) ≤ } − τ )Π dj =1 { x j < x j,i < x j } (cid:0) s i (1 { p i ≤ p } − { p i ≤ p } ) − ( F p | x,u τ ,s =1 ( p | x i , , s i = 1) − F p | x,u τ ,s =1 ( p | x i , , s i = 1)) Pr( s i = 1 | x i ) (cid:1) (1 { u τ (cid:48) ( x i ) ≤ } − τ (cid:48) )Π dj =1 { x (cid:48) j < x j,i < x (cid:48) j } (cid:0) s i (1 { p i ≤ p (cid:48) } − { p i ≤ p (cid:48) } ) − ( F p | x,u τ ,s =1 ( p (cid:48) | x i , , s i = 1) − F p | x,u τ ,s =1 ( p (cid:48) | x i , , s i = 1)) Pr( s i = 1 | x i ) (cid:1)(cid:3) As an immediate consequence, we also obtain the weak convergence of any continuous functional andso: Z q ,n ⇒ Z q . (ii) Given Lemma 1(ii): Z ,n ( τ, x, x, p )= 1 √ n n (cid:88) i =1 (cid:32) (1 { u τ ( x i ) ≤ } − τ )Π dj =1 { x j < x j,i < x j } s i (cid:32) { p i ≤ p } − F p | x,u τ ,s =1 ( p | x i , , s i = 1) (cid:33) − E (cid:104) (1 { u τ ( x i ) ≤ } − τ )Π dj =1 { x j < x j,i < x j } s i (cid:0) { p i ≤ p } − F p | x,u τ ,s =1 ( p | x i , , s i = 1) (cid:1)(cid:17)(cid:105) + √ n E (cid:104) (1 { u τ ( x i ) ≤ } − τ )Π dj =1 { x j < x j,i < x j } s i (cid:0) { p i ≤ p } − F p | x,u τ ,s =1 ( p | x i , , s i = 1) (cid:1)(cid:105) + O p (cid:32) ln n (cid:112) h d x x (cid:33) (19)with the O p term holding uniformly over T , X , and P . The statement then follows as the ﬁrst termon the right hand side (RHS) of (19) weakly converges, and ln n √ h dxx diverges at a rate slower than √ n given that nh d x x → ∞ . Proof of Theorem 2 : (i) Given Assumption A.6, Z q ,n ( τ, x, x,

1) (20)= ZN q ,n ( τ, x, x, (cid:114)(cid:82) K ( v ) dv (cid:113) nh p H η ( (cid:81) dxj =1 ξ j,n ) (cid:80) ni =1 s i { (cid:98) u τ ( x i ) ≤ } − τ ) { x i ∈ C ,n } K (cid:16) (cid:98) p i − δh p (cid:17) (1 + o p (1))32here C ,n ⊆ ⊗ d x j =1 [ x j , x j ] and ZN q ,n ( τ, x, x, ≡ (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19) . The o p (1) term, which holds uniformly in τ ∈ T , follows from Assumption A.6, given that for all x inthe complement of C ,n , the set of observations with propensity score close to one is thinner than forall x ∈ C ,n . Hereafter, for brevity, we ignore the (1 + o p (1)) term. Now, ZN q ,n ( τ, x, x, (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 s i (1 { u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19) + n (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) E S n (cid:20) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19) − s i (1 { u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19)(cid:21) − (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 (cid:18) s i (1 { u τ ( x i ) ≤ } − τ ) 1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19) − E S n (cid:20) s i (1 { u τ ( x i ) ≤ } − τ ) 1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19)(cid:21)(cid:19) + 1 (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 (cid:18) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19) − E S n (cid:20) s i (1 { (cid:98) u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19)(cid:21)(cid:19) = I n + II n + III n Given Lemma 2(i)-(ii), II n and III n are o p (1) uniformly in τ ∈ T . Thus, it suﬃces to derive thelimiting distribution of I n . Recalling that G x i ( τ, p i ) = Pr( y i ≤ q τ ( x i ) | x i , p i ), I n reads as: I n = 1 (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 s i (1 { u τ ( x i ) ≤ } − G x i ( τ, p i ))1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19) + 1 (cid:114) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 s i ( G x i ( τ, p i ) − τ )1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19) (21)= I ,n + I ,n The ﬁrst term drives the limiting the distribution, while the second term can be thought of as biassince G x ( τ, p ) → τ only as p →

1. More speciﬁcally, by Lemma 3(i), I ,n satisﬁes a CLT for triangulararrays pointwise in τ ∈ T , while by Lemma 3(ii), I ,n = o p ( I ,n ) uniformly in τ ∈ T and for all33 i ∈ C ,n . We now need to study the denominator in (20): (cid:18)(cid:90) K ( v ) dv (cid:19) nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) { x i ∈ C ,n } K (cid:18) (cid:98) p i − δh p (cid:19) = (cid:18)(cid:90) − K ( v ) dv (cid:19) τ (1 − τ ) (cid:82) C ,n g ( x, x (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) + o p (1) = O (1)uniformly over T . The covariance kernel of the statistic is therefore given by:cov (cid:0) Z q ( τ, x, x, , Z q (cid:0) τ (cid:48) , x, x, (cid:1)(cid:1) = lim n →∞ E  (cid:80) ni =1 s i (1 { u τ ( x i ) ≤ } − τ )1 { x i ∈ C ,n } K (cid:16) p i − δh p (cid:17)(cid:0)(cid:82) K ( v ) dv (cid:1) (cid:80) ni =1 s i (1 { u τ ( x i ) ≤ } − τ ) { x i ∈ C ,n } K (cid:16) p i − δh p (cid:17)(cid:80) ni =1 s i (1 { u τ (cid:48) ( x i ) ≤ } − τ (cid:48) )1 { x i ∈ C ,n } K (cid:16) p i − δh p (cid:17)(cid:0)(cid:82) K ( v ) dv (cid:1) (cid:80) ni =1 s i (1 { u τ (cid:48) ( x i ) ≤ } − τ (cid:48) ) { x i ∈ C ,n } K (cid:16) p i − δh p (cid:17)  . Finally, by Lemma A.1 and B.3 of Escanciano et al. (2014), we can also conclude that numeratorand denominator of Z q ,n ( τ, x, x,

1) are Donsker, and hence by Theorem 2.10.6 of Van der Vaart andWellner (1996), that Z q ,n ( τ, x, x,

1) is Donsker as well. Thus, it follows that Z q ,n ( τ, x, x, ZN q ,n ( τ, x, x, (cid:18)(cid:0)(cid:82) K ( v ) dv (cid:1) nh p H η ( (cid:81) dxj =1 ξ j,n ) (cid:80) ni =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) { x i ∈ C ,n } K (cid:16) (cid:98) p i − δh p (cid:17)(cid:19) / converges weakly in l ∞ ( T ), and by continuous mapping, so does the functionalsup τ ∈T (cid:12)(cid:12)(cid:12) Z q ,n ( τ, x, x, (cid:12)(cid:12)(cid:12) as postulated in the statement of part (i). (ii) Now, if there is an omitted relevant regressor, given Asumption A.7,1 nh p (cid:16) Π d x j =1 ξ j,n (cid:17) H η n (cid:88) i =1 s i ( G x i ( τ, p i ) − τ )Π d x j =1 { x j ≤ x i,j ≤ x j } K (cid:18) (cid:98) p i − δh p (cid:19) = lim n →∞ h p (cid:16) Π d x j =1 ξ j,n (cid:17) H η E (cid:20) s i ( G x i ( τ, p i ) − τ )1 { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19)(cid:21)(cid:124) (cid:123)(cid:122) (cid:125) (cid:54) =0 + o p (1)and 1 nh p H η (cid:16)(cid:81) d x j =1 ξ j,n (cid:17) n (cid:88) i =1 s i (1 { (cid:98) u τ ( x i ) ≤ } − τ ) (cid:89) d x j =1 { x j ≤ x i,j ≤ x j } K (cid:18) (cid:98) p i − δh p (cid:19) p → lim n →∞ h p (cid:16) Π d x j =1 ξ j,n (cid:17) H η E (cid:20) s i (cid:0) G x ( τ, p i ) + τ − τ G x ( τ, p i ) (cid:1) { x i ∈ C ,n } K (cid:18) p i − δh p (cid:19)(cid:21) > . (cid:114) nh p (cid:16) Π d x j =1 ξ j,n (cid:17) H η alternatives. References

Arellano, M. and S. Bonhomme (2017a). Quantile selection models with an application to understand-ing changes in wage inequality.

Econometrica 85 (1), 1–28.Arellano, M. and S. Bonhomme (2017b). Sample selection in quantile regression: A survey. InR. Koenker, V. Chernozhukov, X. He, and L. Peng (Eds.),

Handbook of Quantile Regression (1ed.)., Chapter 13, pp. 209–221. Chapman and Hall/ CRC.Black, D., J. Joo, R. LaLonde, J. Smith, and E. Taylor (2017). Simple tests for selection bias: Learningmore from instrumental variables. IZA DP 9346, Institute for the Study of Labor.Blundell, R., A. Gosling, H. Ichimura, and C. Meghir (2007). Changes in the distribution of male andfemale wages accounting for employment composition using bounds.

Econometrica 75 , 323–363.Blundell, R., H. Reed, and T. Stoker (2003). Interpreting aggregate wage growth.

American EconomicReview 93 (4), 1114–1131.Breunig, C. (2017). Testing for missing at random using instrumental variables.

Journal of Businessand Economic Statistics Forthcoming , N/A.Chamberlain, G. (1986). Asymptotic eﬃciency in semi-parametric models with censoring.

Journal ofEconometrics 32 , 189–218.Chernozhukov, V., I. Fernandez-Val, B. Melly, and K. Wuethrich (2018). Generic inference on quantileand quantile eﬀect function for discrete outcomes. Working Paper arXiv:1608.05142v4, arXiv.Corradi, V., W. Distaso, and M. Fernandes (2019). Testing for jump spillovers without testing forjumps.

Journal of the American Statistical Association Forthcoming .Das, M., W. K. Newey, and F. Vella (2003). Nonparametric estimation of sample selection models.

Review of Economic Studies 70 (1), 33–58.Delgado, M. and W. Gonzalez-Manteiga (2001). Signiﬁcance testing in nonparametric regression basedon the bootstrap.

The Annals of Statistics 29 (5), 1469–1507.Escanciano, J. C., D. Jacho-Chavez, and A. Lewbel (2014). Uniform convergence of weighted sumsof non- and semiparametric residuals for estimation and testing.

Journal of Econometrics 178 ,426–443.Gronau, R. (1974). Wage comparisons: A selectivity bias.

Journal of Political Economy 82 , 1119–1143.Guerre, E. and C. Sabbah (2012). Uniform bias study and bahadur representation for local polynomialestimators of the conditional quantile function.

Econometric Theory 28 , 87–129.Guggenberger, P. (2010a). The impact of a hausman pretest on the asymptotic size of a hypothesistest.

Econometric Theory 26 , 369–382.Guggenberger, P. (2010b). The impact of a hausman pretest on the size of a hypothesis test: thepanel data case.

Journal of Econometrics 156 (2), 337–343.Hayﬁeld, T. and J. Racine (2008). Nonparametric econometrics: The np package.

Journal of StatisticalSoftware 27 (5).He, X. and L. Zhu (2003). A lack-of-ﬁt test for quantile regression.

Journal of the American StatisticalAssociation 98 (464), 1013–1022. 35eckman, J. (1974). Shadow prices, market wages and labor supply.

Econometrica 42 , 679694.Heckman, J. (1979). Sample selection bias as a speciﬁcation error.

Econometrica 47 , 153–161.Huber, M. and B. Melly (2015). A test of the conditional independence assumption in sample selectionmodels.

Journal of Applied Econometrics 30 , 1144–1168.Jochmans, K. (2015). Multiplicative-error models with sample selection.

Journal of Econometrics 184 ,315–327.Khan, S. and E. Tamer (2010). Irregular identiﬁcation, support conditions, and inverse weight esti-mation.

Econometrica 78 (6), 2021–2042.Kitagawa, T. (2010). Testing for instrument independence in the selection model. Unpublishedmanuscript, UCL.Li, Q. and J. Racine (2008). Nonparametric estimation of conditional cdf and quantile functionswith mixed categorical and continuous data.

Journal of Business and Economic Statistics 26 (4),423–434.Masry, E. (1996). Multivariate regression estimation local polynomial ﬁtting for time series.

StochasticProcesses and their Applications 65 , 81–101.Qu, Z. and J. Yoon (2015). Nonparametric estimation and inference on conditional quantile processes.

Journal of Econometrics 185 , 1–19.Racine, J. (1993). An eﬃcient crossvalidation algorithm for window width selection for nonparametrickernel regression.

Communications in Statistics 22 (4), 1107–1114.Van der Vaart, A. and J. Wellner (1996).

Weak Convergence and Empirical Processes (First ed.).Springer Series in Statistics. Springer Verlag.Volgushev, S., M. Birke, H. Dette, and N. Neumeyer (2013). Signiﬁcance testing in quantile regression.