[PDF] Better Lee Bounds

Abstract

This paper develops methods for tightening Lee (2009) bounds on average causal effects when the number of pre-randomization covariates is large, potentially exceeding the sample size. These Better Lee Bounds are guaranteed to be sharp when few of the covariates affect the selection and the outcome. If this sparsity assumption fails, the bounds remain valid. I propose inference methods that enable hypothesis testing in either case. My results rely on a weakened monotonicity assumption that only needs to hold conditional on covariates. I show that the unconditional monotonicity assumption that motivates traditional Lee bounds fails for the JobCorps training program. After imposing only conditional monotonicity, Better Lee Bounds are found to be much more informative than standard Lee bounds in a variety of settings.

Full PDF

BBetter Lee Bounds

Vira Semenova * August 31, 2020

Abstract

This paper develops methods for tightening Lee (2009) bounds on averagecausal effects when the number of pre-randomization covariates is large, poten-tially exceeding the sample size. These Better Lee Bounds are guaranteed to besharp when few of the covariates affect selection and the outcome. If this spar-sity assumption fails, the bounds remain valid. I propose inference methods thatenable hypothesis testing in either case. My results rely on a weakened monotonic-ity assumption that only needs to hold conditional on covariates. I show that theunconditional monotonicity assumption that motivates traditional Lee bounds failsfor the JobCorps training program. After imposing only conditional monotonic-ity, Better Lee Bounds are found to be much more informative than standard Leebounds in a variety of settings. * Email: [email protected]. I am grateful to Victor Chernozhukov, Michael Jansson, PatrickKline, Anna Mikusheva, and Whitney Newey for their guidance and encouragement. I am thankful toAlberto Abadie, Chris Ackerman, Sydnee Caldwell, Denis Chetverikov, Ben Deaner, Mert Demirer, JerryHausman, Peter Hull, Tetsuya Kaji, Kevin Li, Elena Manresa, Rachael Meager, Francesca Molinari,Denis Nekipelov, Oles Shtanko, Cory Smith, Sophie Sun, Roman Zarate, and the participants at the MITEconometrics Lunch for helpful comments. a r X i v : . [ ec on . E M ] A ug Introduction

Randomized controlled trials are often complicated by endogenous sample selectionand non-response. This problem occurs when treatment affects the researcher’s abilityto observe an outcome (a selection effect) in addition to the outcome itself (the causaleffect of interest). For example, being randomized into a job training program affectsboth an individual’s wage and employment status. As a result, wages in the treatmentand control groups are not directly comparable since wages only exist for the employedindividuals. A common way to estimate the average causal effect is to bound this effectfrom above and below, focusing on a partially latent group of subjects whose outcomesare observed regardless of their treatment status (the always-observed principal strata,Frangakis and Rubin (2002) or the always-takers, Lee (2009)).Seminal work by Lee (2009) leverages the monotonicity assumption to bound the av-erage causal effect for always-takers. For example, if job training cannot deter employ-ment, the Lee lower bound is the treatment-control difference in wages, where the topwages in the treated group are trimmed until employment rates in both groups are equal.If pre-randomization covariates are available, Lee bounds can be tightened by averag-ing covariate-speciﬁc bounds over the always-takers’ covariate distribution. However,it is hard or impossible to estimate sharp (i.e., the tightest possible) bounds since Lee’smethod requires a positive number of treated and control outcomes for each covariatevalue. As a result, empirical researchers spend a lot of energy selecting and discretizingcovariates, a process that is subjective, labor-intensive, and prone to erroneous inference.In this paper, I propose a generalization of Lee bounds—better Lee bounds—andprovide theoretical, simulation, and empirical evidence that they substantially outper-form standard Lee bounds. First, better Lee bounds are based on a weaker monotonicityassumption that only needs to hold conditional on covariates. Speciﬁcally, each subjectis allowed to have either a positive or negative selection response, as long as the direc-tion of this response is identiﬁed by a covariate vector. In contrast, standard Lee bounds2equire the same direction of treatment effect on selection response for all subjects. Sec-ond, better Lee bounds are asymptotically sharp as long as few of the covariates affectselection and the outcome, permitting the total number of covariates under considerationto exceed sample size. In contrast, standard Lee bounds are sharp only in a model thathas a handful of covariates. Finally, better Lee bounds accommodate a broad class ofmachine learning techniques for estimating the conditional probability of treatment (i.e.,the propensity score), overcoming a key historical limitation to the widespread adoptionof Lee bounds in quasi-experiments.As a ﬁrst step towards sharpness, I represent each bound via a semiparametric mo-ment equation that depends on the conditional outcome quantile and the conditionalprobability of selection. A naïve approach would be to estimate these functions byquantile and logistic series regressions. If the true functions are sufﬁciently smooth rel-ative to the covariate vector’s dimension, these estimators provide a good approximationto the true outcome quantile. However, the smoothness assumption implicitly restrictsthe number of covariates (Stone (1982)). This restriction is problematic for JobCorpsdata set (Schochet et al. (2008)), which has 9 ,

145 observations and 5 ,

177 covariates,and calls for selecting covariates in a data-driven way.The ﬁrst main contribution of this paper is a method for estimating and conductinginference on sharp Lee bounds with a built-in model selection process based on mod-ern machine learning techniques. For example, if few of the covariates affect selectionand the outcome, (cid:96) -regularized logistic (Belloni et al. (2017), Belloni et al. (2016))and quantile (Belloni and Chernozhukov (2013)) estimators deliver a good approxima-tion to the true functions. An implicit cost of (cid:96) -regularization is bias that convergesslower than the parametric rate. To prevent the transmission of this bias into the bounds,I propose a Neyman-orthogonal (Neyman (1959)) moment equation for each bound.Leveraging Neyman-orthogonality and sample splitting ideas, my proposed better Leebounds permit inference based on the standard normal approximation. The proposedbounds are straightforward to compute using the R software package leebounds , avail-3ble at https://github.com/vsemenova/leebounds. In settings where sparsity is not economically plausible, researchers utilize sam-ple splitting strategies to leverage machine learning techniques for model selection, al-though the results will not be as sharp. To account for the uncertainty generated by thechoice of sample split, Chernozhukov et al. (2017) suggests to generate several randomsplits and aggregate the lower and upper bounds over various partitions. On the onehand, this approach aims at less sharp bounds and leads to conservative inference due tosample splitting. On the other hand, this approach is fully agnostic: it does not requireany assumptions on the model selection procedure.My main result has several extensions. First, I allow the outcome variable to bemulti-dimensional and show that the sharp identiﬁed set for the treatment effect pa-rameter is compact and convex. Next, I derive an orthogonal moment equation for theidentiﬁed set’s boundary (i.e., support function) and provide a large sample approxi-mation that holds uniformly over the boundary. I also propose a weighted bootstrapprocedure for conducting inference on the boundary. In contrast to conventional boot-strap techniques, my algorithm is faster to compute since, by virtue of orthogonality,only the second stage is repeated in the simulation. Second, better Lee bounds accom-modate within-cluster dependence and panel data. Third, I derive better Lee boundsfor the Intent-to-Treat and Local Average Treatment Effect parameters and provide acomplete set of identiﬁcation, estimation and inference results. Finally, I provide theinference methods accommodating unknown propensity score in quasi-experiments.The paper builds on a growing literature that incorporates modern regularized ma-chine learning techniques into econometrics, see Mullainathan and Spiess (2017) for areview. A large body of this literature is devoted to establishing convergence propertiesof (cid:96) -regularized estimators (Belloni et al. (2016), Belloni and Chernozhukov (2013)),as well conducting debiased inference on parameters following Lasso model selection(Belloni et al. (2017), Belloni et al. (2014), Belloni et al. (2016), van der Geer et al.(2014), Javanmard and Montanari (2014), Zhang and Zhang (2014)). Leveraging the4ork of Belloni et al. (2017), Belloni et al. (2014), Belloni et al. (2016), Chernozhukovet al. (2016), and Chernozhukov et al. (2018), I derive an orthogonal moment equationfor better Lee bounds and propose asymptotic theory for conducting inference on thesebounds in both one- and multi-dimensional settings. The agnostic approach to inferenceis an extension of Chernozhukov et al. (2017)’s general machine learning approach toheterogeneous treatment effects, adapted for a partial identiﬁcation problem.In the ﬁnal part of the paper, I estimate Lee bounds in three empirical applications.First, I study the effect of the JobCorps training program on wages and wage growth,using data from Schochet et al. (2008). After accounting for the differential JobCorpseffect on employment, I ﬁnd that the average JobCorps effect on the always-takers’week 90 wages is 4 . . . −

11% and 11%. Thus, the average growth rate is 15% in thecontrol status and ranges between 4% and 26% in the treated status. Second, I study theeffect of private school subsidies on pupils’ educational achievement, as in Angrist et al.(2002). I ﬁnd that the voucher effect on Mathematics, Reading and Writing is smallerthan Angrist et al. (2002)’s original estimate that does not account for selection bias, bya factor of 0 . .

75. Finally, I study the effect of a Medicaid lottery on applicants’self-reported healthcare utilization and health, as in Finkelstein et al. (2012). Afteraccounting for non-response bias, I ﬁnd that Medicaid exposure and insurance has had apositive effect on all measures of health, conﬁrming Finkelstein et al. (2012)’s baselineresults. better Lee bounds attain nearly point-identiﬁcation in all three applications.In contrast, conventional Lee bounds are too wide to determine the direction of thetreatment effect in any of these settings.The paper is organized as follows. Section 2 reviews basic Lee bounds and Lee’sestimator under the standard monotonicity assumption. Section 3 presents evidenceagainst unconditional monotonicity in the JobCorps training program. Section 4 es-tablishes the asymptotic properties of better Lee bounds, assuming sparsity. Section 55roposes an agnostic approach for conducting inference on Lee bounds when sparsityfails. Section 6 discusses extensions of my baseline framework to allow for a multi-dimensional outcome, intent-to-treat and local average treatment effect target parame-ters, clustered or panel data, and the case when the propensity score is unknown. Section7 presents a simulation study based on JobCorps data. Section 8 presents empirical ap-plications. Section 9 concludes. Appendix A contains additional tables and ﬁgures sup-porting the results from the main text. Appendix B contains supplementary theoreticalstatements. Appendix C contains proofs. Appendix D contains additional simulations.Appendix E deﬁnes JobCorps covariates and contains supplementary results for Section3. Appendix F contains supplementary results for all empirical applications.

In this section, I review the Lee (2009) sample selection model and formally deﬁne Leebounds. I describe the bounds’ estimator and the conﬁdence region for the identiﬁed set.I then discuss how to tighten Lee bounds by conditioning on baseline covariates.

I use the standard Rubin (1974) potential outcomes framework. Let D = Y ( ) and Y ( ) denote the potential outcomes if anindividual is treated or not, respectively. Likewise, let S ( ) = S ( ) = ( D i , X i , S i , S i Y i ) Ni = consists of the treatment status D , a baseline co-variate vector X , the selection status S = D · S ( ) + ( − D ) · S ( ) and the outcome S · Y = S · ( D · Y ( ) + ( − D ) · Y ( )) for selected individuals. The object of interest6s the average treatment effect (ATE) β = E [ Y ( ) − Y ( ) | S ( ) = , S ( ) = ] (2.1)for subjects who are selected into the sample regardless of treatment receipt—the always-takers . ASSUMPTION 1 (Assumptions of Lee (2009)) . The following statements hold.(1) (Independence). The random vector ( Y ( ) , Y ( ) , S ( ) , S ( ) , X ) is independent of D.(2) (Monotonicity). S ( ) ≥ S ( ) a.s. Suppose Assumption 1 holds. By monotonicity, any outcome observed in the controlgroup must belong to an always-taker. Thus, the always-takers’ expected outcome in thecontrol status is identiﬁed: E [ Y ( ) | S ( ) = , S ( ) = ] = E [ Y | S = , D = ] . In contrast, a treated outcome can be either an always-taker’s outcome or a complier’soutcome, but it is not possible to distinguish between the two types in the treated group.Nevertheless, by Assumption 1, the proportion of the always-takers in the { D = , S = } group is identiﬁed as p = E [ S ( ) = , S ( ) = | S = , D = ] = E [ S = | D = ] E [ S = | D = ] . (2.2)When the always-takers comprise the top p outcome quantile in the treated group, theATE (2.1) attains its largest possible value:¯ β U = E [ Y | Y ≥ Q ( − p ) , D = , S = ] − E [ Y | D = , S = ] , where Q ( − p ) is the level- p outcome quantile in the treated selected group. The7ower bound ¯ β L is deﬁned analogously. Lee’s estimator ( (cid:98) ¯ β L , (cid:98) ¯ β U ) is deﬁned as follows: (cid:98) ¯ β U = ∑ Ni = D i S i Y i { Y i ≥ (cid:98) Q ( − (cid:98) p ) } ∑ Ni = D i S i { Y i ≥ (cid:98) Q ( − (cid:98) p ) } − ∑ Ni = ( − D i ) S i Y i ∑ Ni = ( − D i ) S i , (2.3) (cid:98) ¯ β L = ∑ Ni = D i S i Y i { Y i ≤ (cid:98) Q ( (cid:98) p ) } ∑ Ni = D i S i { Y i ≤ (cid:98) Q ( (cid:98) p ) } − ∑ Ni = ( − D i ) S i Y i ∑ Ni = ( − D i ) S i , (2.4) (cid:98) Q ( u ) = min y ∈ Y (cid:26) y : ∑ Ni = D i S i { Y ≤ y } ∑ Ni = D i S i ≥ u ∈ [ , ] (cid:27) , (2.5) (cid:98) p = ( ∑ Ni = S i ( − D i )) / ∑ Ni = ( − D i ) ∑ Ni = S i D i / ∑ Ni = ( D i ) , (2.6)where (cid:98) p and (cid:98) Q ( u ) are the sample analogs of p and Q ( u ) .If selection is not exogenous (i.e., p ̸ = Q ( p ) for it to be well approximated by its empirical analog (cid:98) Q ( (cid:98) p ) in a large sample. A conﬁdence region for the true identiﬁed set [ β L , β U ] that covers theset with a pre-speciﬁed probability α takes the form [ (cid:98) ¯ β L − N − / (cid:98) Ω LL c α / , (cid:98) ¯ β U + N − / (cid:98) Ω UU c − α / ] , (2.7)where (cid:98) Ω LL and (cid:98) Ω UU are estimates of the asymptotic standard deviations of (cid:98) ¯ β L and (cid:98) ¯ β U ,respectively, and c α is the critical value based on the standard normal approximation. Toconduct inference on the true parameter β , Imbens and Manski (2004) (IM) proposean adjustment of (2.7) that covers β with a pre-speciﬁed probability.The approach described above can be applied after conditioning on a vector X ofbaseline, or pre-randomization, covariates. Deﬁne the conditional trimming thresholdas p ( x ) = E [ S = | D = , X = x ] E [ S = | D = , X = x ] = s ( , x ) s ( , x ) , x ∈ X . (2.8)8he conditional outcome quantile Q ( u , x ) in the treated group is implicitly deﬁned byPr ( Y ≤ Q ( u , x ) | D = , S = , X = x ) = u , u ∈ [ , ] , x ∈ X . (2.9)The conditional upper bound is¯ β U ( x ) = E [ Y | D = , S = , Y ≥ Q ( − p ( x ) , x ) , X = x ] − E [ Y | D = , S = , X = x ] . To aggregate ¯ β U ( x ) into an average, I need to reweight ¯ β U ( x ) by the probability massfunction in the always-takers group: β U = (cid:90) x ∈ X ¯ β U ( x ) f ( x | S ( ) = , S ( ) = ) dx = (cid:90) x ∈ X ¯ β U ( x ) f ( x | S = , D = ) dx . (2.10)Lee (2009) has shown that (2.10) is a sharp (i.e., the smallest possible) upper bound on β : β ≤ β U ≤ ¯ β U . (2.11) Algorithm 1

Standard Lee bounds with covariates Partition the covariate space X into J discrete cells { C , C , . . . , C J } . Estimate the vector of cell-speciﬁc lower and upper bounds { (cid:98) ¯ β L ( j ) , (cid:98) ¯ β U ( j ) } Jj = and the prob-ability mass function { (cid:98) f ( j | S = , D = ) } Jj = in the selected control group. Estimate bounds as (cid:98) β L = J ∑ j = (cid:98) ¯ β L ( j ) (cid:98) f ( j | S = , D = ) , (cid:98) β U = J ∑ j = (cid:98) ¯ β U ( j ) (cid:98) f ( j | S = , D = ) . (2.12) Algorithm 1 describes Lee’s estimator with covariates. For the estimator (2.12) to bewell-deﬁned, each covariate group must contain both treated and control subjects, and anon-zero fraction of control subjects must be selected into the sample. Consequently, theestimator (2.12) can accommodate only coarse partitions of the covariate space. If thevector X contains many informative covariates, Lee (2009)’s covariate-based estimator9ill not be close to the sharp bound β U in a large sample.Lee (2009) argues that including covariates can lead to point identiﬁcation in ex-treme cases. First, consider the case where selection is exogenous conditional on covari-ates X . Then, the conditional probability of selection must be the same in the treatmentand control groups: s ( , x ) = s ( , x ) ∀ x ∈ X . As a result, the trimming threshold p ( x ) = x , and β L = β = β U . (2.13)Second, consider the case where the outcome is a deterministic function of the covari-ates. Then, the conditional quantile function of Q ( u , x ) does not vary within covariategroups, and Q ( p ( x ) , x ) = Q ( − p ( x ) , x ) ∀ x ∈ X . As a result, (2.13) holds. Thus, thecovariates that explain most variation in either selection or outcome are likely to be themost useful for tightening the bounds. In this section, I review the basics of Lee (2009)’s empirical analysis of JobCorps train-ing program and replicate Lee’s results. I then discuss how the direction of JobCorps’effect on employment differs with observed characteristics.Lee (2009) studies the effect of winning a lottery to attend JobCorps, a federal vo-cational and training program, on applicants’ wages. In the mid-1990s, JobCorps usedlottery-based admission to assess its effectiveness. The control group of 5 ,

977 appli-cants was essentially embargoed from the program for three years, while the remainingapplicants (the treated group) could enroll in JobCorps as usual. The sample consists of9 ,

145 JobCorps applicants and has data on lottery outcome, hours worked and wagesfor 208 consecutive weeks after random assignment. In addition, the data contain ed-10cational attainment, employment, recruiting experiences, household composition, in-come, drug use, arrest records, and applicants’ background information. These datawere collected as part of a baseline interview, conducted by Mathematica Policy Re-search (MPR) shortly after randomization (Schochet et al. (2008)). After convertingapplicants’ answers to binary vectors and adding numeric demographic characteristics,I obtain a total of 5 ,

177 raw baseline covariates, which are summarized in Section C.2.

Having access to baseline covariates X means that the monotonicity assumption can betested. Using the notation of Section 2, let S correspond to employment and Y corre-spond to log wages. If monotonicity holds, the treatment-control difference in employ-ment rates ∆ ( x ) = s ( , x ) − s ( , x ) = Pr ( S = | D = , X = x ) − Pr ( S = | D = , X = x ) , x ∈ X (3.1)must be either non-positive or non-negative for all covariate values. Consequently, itcannot be the case thatProb ( ∆ ( x ) > ) > ( ∆ ( x ) < ) > . (3.2)My ﬁrst exercise is to estimate s ( , x ) and s ( , x ) by a week-speciﬁc cross-sectionallogistic regression s ( D , X ) = Λ ( X ′ α + D · X ′ γ ) , (3.3)where Λ ( · ) = exp ( · ) + exp ( · ) is the logistic CDF, X is a vector of baseline covariates thatincludes a constant, D · X is a vector of covariates interacted with treatment, and α and11 are ﬁxed vectors. I report the average treatment-control difference for the covariategroups { ∆ ( x ) > } and { ∆ ( x ) < } in Figure 1 and the fraction of subjects in the covari-ate group { ∆ ( x ) > } in Figure 2.The second exercise is to test monotonicity without relying on logistic approxima-tion. For each week, I select a small number of discrete covariates and partition thesample into discrete cells C j , j ∈ { , , . . . , J } , determined by covariate values. Forexample, one binary covariate corresponds to J = µ = ( E [ ∆ ( X ) | X ∈ C j ]) Jj = , must be non-negative: H : ( − ) · µ ≤ . (3.4)The test statistic for the hypothesis in equation (3.4) is T = max ≤ j ≤ J ( − ) · (cid:98) µ j (cid:98) σ j , (3.5)and the critical value is the self-normalized critical value of Chernozhukov et al. (2019). Figure 1 shows the treatment-control difference in employment rates for applicant groupswhose estimated employment effect is positive (black dots) or negative (gray dots) con-ditional on covariates. The fraction of applicants with a positive employment effectincreases over time. Focusing on week 90, I ﬁnd that a smaller chance of employmentis associated with being female, black, receiving public assistance before RA, such asfood stamps or other welfare, being raised in a family that has received welfare most orall the time, being in fair (not excellent or good) health at the moment of RA, and smok-ing hashish or marijuana a few times each week. In addition, JobCorps is likely to hurtweek 90 employment chances for subjects whose most recent arrest occurred less than12 year before the baseline interview or who are on probation or parole at the moment ofthe interview. Week 90 is a special week since it is the ﬁrst week where the average em-ployment effect switches from negative to positive, and the only one out of ﬁve horizonswhere Lee found the average wage effect on the always-takers to be signiﬁcant.

Figure 1: Treatment-control differences in employment rate by week.Notes. The horizontal axis shows the number of weeks since random assignment. The verticalaxis shows the treatment-control difference in employment rate. The black dot represents appli-cants whose conditional employment effect ∆ ( x ) is positive, and the gray dot is its complement.(For each week, ∆ ( x ) is deﬁned as in equation (3.1) and estimated as in equation (3.3)). The sizeof each dot is proportional to the fraction of applicants. Computations use design weights. Figure 2 plots the fraction of subjects with a positive JobCorps effect on employ-ment in each week (that is, the fraction of applicants in black dots in Figure 1). In theﬁrst weeks after random assignment, there is no evidence of a positive JobCorps effecton employment for any group. By the end of the second year (week 104), JobCorpsincreases employment for nearly half of the individuals, and this fraction rises to 0 . Figure 2: Fraction of JobCorps applicants with positive conditional employment effect by week.Notes. The horizontal axis shows the number of weeks since random assignment. The verticalaxis shows the fraction of applicants whose conditional employment effect ∆ ( x ) is positive.Following week 60, a week is shaded if the test statistic T exceeds the critical value at the p = .

01 (dark gray) or p ∈ [ . , . ) (light gray) signiﬁcance level. For each week, ∆ ( x ) isdeﬁned in equation (3.1) and estimated as in equation (3.3), the null hypothesis is as in equation(3.4), the test statistic T is as in equation (3.5), and the test cells and critical values are as deﬁnedin Table E.9. Computations use design weights. Figure 2 shows the results of testing the inequality in (3.4) for each week. Thedirection of the employment effect varies with socio-economic factors. For example, theapplicants who received AFDC beneﬁts during the 8 months before RA or who belongedto median income and yearly earnings groups experience a signiﬁcantly positive ( p ≤ .

05) employment effect at weeks 60–89, although the average effect is signiﬁcantlynegative. As another example, the applicants who answered “1: Very important” to thequestion “How important was getting away from community on the scale from 1 (veryimportant) to 3 (not important)?” and who smoke marijuana or hashish a few times each14onths experience a signiﬁcantly negative ( p ≤ .

05) employment effect at week 117–152 despite the average effect being positive. Finally, at week 153–186, the averageJobCorps effect is signiﬁcantly negative for subjects whose most recent arrest occurredless than 12 months ago, despite the average effect being positive.Table 1 replicates Lee’s estimates of basic (Column (1)) bounds on JobCorps effecton the wages of always-takers. In addition, I also compute the covariate-based boundsusing the discretized predicted wage potential covariate that Lee proposed (Column (2)).Week 90 is the only horizon where Lee found JobCorps effect on wages to be statisticallysigniﬁcant. However, basic Lee bounds do not overlap with the covariate-based ones.Sharpness fails because one of the ﬁve covariate-speciﬁc trimming thresholds exceeds1 and is being capped at 0 .

999 to impose unconditional monotonicity. Capping corre-sponds to the researcher’s belief that the covariate-speciﬁc threshold exceeded 1 due tosampling noise, the only belief consistent with unconditional monotonicity. Once thisassumption is weakened, basic Lee bounds do not cover zero in any week (Table 3,Column 1).

In this section, I introduce better Lee bounds. Section 4.1 presents bounds under aweakened monotonicity assumption that only needs to hold conditional on covariates.Section 4.2 formulates the statistical assumptions on the data generating process andstates asymptotic results under these assumptions.

ASSUMPTION 2 (Conditional monotonicity) . The following statements hold.(1) (Independence). The vector ( Y ( ) , Y ( ) , S ( ) , S ( )) is independent of D conditionalon X .

2) (Monotonicity). There exists a partition of covariate space X = X help ⊔ X hurt so thatS ( ) ≥ S ( ) a.s. on X help and S ( ) ≥ S ( ) a.s. on X hurt . Assumption 2 allows the sign of the treatment effect on employment to vary alongwith covariates. A subject with covariate vector X belongs to the covariate group X help if and only if his treatment-control difference in selection rate is positive conditional on X : X ∈ X help ⇔ p ( X ) ≤ ⇔ S ( ) ≥ S ( ) a.s.,where p ( x ) is deﬁned in (2.8). When there are no covariates, Assumption 2 coincideswith Assumption 1. The more covariates are available, the weaker the assumption is.The sharp lower and upper bound are given in equation (B.12) in Appendix B. ASSUMPTION 3 (Strong Overlap and Endogeneity) . The following conditions hold.(1) (Overlap). There exists < s < ¯ s < such that < s < s ( d , x ) < ¯ s < for any x andd ∈ { , } .(2) (Endogeneity). There exists a set ¯ X ⊂ X , Pr ( X ∖ ¯ X ) = and an absolute constant ε > such that inf x ∈ ¯ X : | s ( , x ) − s ( , x ) | > ε . Assumption 3 is the price of relaxing Assumption 1(2) to Assumption 2(2). It en-sures that subjects are correctly assigned into X help and X hurt when the sample size islarge, with high probability. To attain correct classiﬁcation, a sufﬁcient condition on thetrimming threshold p ( x ) is to have its support bounded away from both zero and one.In particular, the trimming threshold p ( x ) cannot be equal to one (i.e., selection cannotbe conditionally exogenous) with positive probability. Lee’s model implicitly imposesAssumption 3, requiring that the covariate-speciﬁc trimming thresholds are boundedaway from one for each discrete cell. Suppose Assumption 1 holds (i.e., X hurt = /0).Assumption 3 is still required for sharpness, to ensure that there are enough data pointsabove and below the quantile level u = − p ( x ) to estimate the conditional quantile Q ( u , x ) . 16 able 1: Estimated bounds on the JobCorps effect on log wages under monotonicity.Basic Covariate-based(1) (2)Week 45 [-0.072, 0.140] [-0.074, 0.127](-0.097, 0.170) (-0.096, 0.156)(-0.096, 0.168) (-0.096, 0.155) Week 90 [0.048, 0.049] [0.036, 0.048](0.011, 0.081) (0.011, 0.075)(0.012, 0.081) (0.012, 0.073)

Week 104 [0.017, 0.064] [0.017, 0.054](-0.020, 0.102) (-0.009, 0.081)(-0.012, 0.095) (-0.007, 0.079)Week 135 [-0.007, 0.084] [-0.001, 0.075](-0.042, 0.113) (-0.032, 0.103)(-0.037, 0.109) (-0.028, 0.100)Week 180 [-0.032, 0.087] [-0.019, 0.080](-0.063, 0.112) (-0.048, 0.107)(-0.060, 0.109) (-0.044, 0.104)Week 208 [-0.020, 0.095] [-0.014, 0.084](-0.050, 0.118) (-0.041, 0.109)(-0.047, 0.117) (-0.039, 0.107)Covariates N/A 5Notes. The sample ( N = , ) and the time horizons are the same as in Lee (2009). Each panelreports estimated bounds (ﬁrst row), the 95% conﬁdence region for the identiﬁed set (secondrow) and the 95% Imbens and Manski (2004) conﬁdence interval for the true parameter (thirdrow). Column (1) reports basic Lee bounds. Column (2) reports covariate-based Lee bounds.All bounds assume that JobCorps discourages employment in week 45 and helps employmentfollowing week 90. The covariate in Column (2) is a linear combination of 28 baseline covari-ates, selected by Lee, given in Table E.1. The covariates are weighted by the coefﬁcients froma regression of week 208 wages on all baseline characteristics in the control group. The ﬁvediscrete groups are formed according to whether the predicted wage is within intervals deﬁnedby $6 .

75, $7, $7 .

50, and $8 .

50. Week 90 is highlighted in bold as the only week where Leefound a statistically signiﬁcant effect on wages. Computations use design weights. .2 Estimation In this subsection, I state the asymptotic results for better Lee bounds. For the sakeof clarity, I derive the results below under Assumption 1 rather than Assumption 2.Appendix B contains the general-case derivations.Suppose Assumption 1 holds. Applying Bayes rule (see Lemma B.1), I derive amoment equation for β U : β U = µ − E (cid:20) D Pr ( D = ) · S · Y · { Y ≥ Q ( − p ( X ) , X ) } − ( − D ) Pr ( D = ) · S · Y (cid:21) = E m U ( W , ξ ) , (4.1)where W = ( D , X , S , S · Y ) is the data vector, µ = Pr ( S = | D = ) , ξ = { s ( , x ) , s ( , x ) , Q ( − p ( x ) , x ) } is the ﬁrst-stage nuisance parameter, and m U ( W , ξ ) is a moment function. The nui-sance parameter ξ contains the conditional probability of selection { s ( , x ) , s ( , x ) } in both the treated and control status, and the conditional quantile function Q ( u , x ) .For simplicity, the population parameters Pr ( D = ) , Pr ( D = ) = − Pr ( D = ) andPr ( D = , S = ) are treated as known; their estimation does not conceptually affect theresults. ASSUMPTION 4 (First-Stage Rate of Selection Equation) . There exist sequences ofnumbers ε N = o ( ) , s N = o ( N − / ) and a sequence of sets S N such that the ﬁrst-stageestimates (cid:98) s ( , x ) of the true function s ( , x ) and (cid:98) s ( , x ) of the true function s ( , x ) belong to S N with probability at least − ε N . Furthermore, the sets S N shrink sufﬁciently ast around the true functions: sup s ∈ S N (cid:18) E X ( s ( , X ) − s ( , X )) (cid:19) / ≤ s N = o ( N − / ) , (4.2)sup s ∈ S N (cid:18) E X ( s ( , X ) − s ( , X )) (cid:19) / ≤ s N = o ( N − / ) . Assumption 4 states that the functions s ( , x ) and s ( , x ) are estimated with sufﬁcientquality. This is a classic assumption in the semiparametric literature (see, e.g., Newey(1994)). This assumption rules out any second-order effects of the estimation error (cid:98) s ( , X ) − s ( , X ) and (cid:98) s ( , X ) − s ( , X ) on the bounds, and allows the researcher to focuson the ﬁrst-order terms.Focusing on the function s ( , x ) , a common approach to estimate s ( , x ) is to con-sider a logistic approximation s ( , x ) = Λ ( B ( x ) ′ γ ) + r ( x ) , (4.3)where Λ ( · ) is the logistic CDF, B ( x ) = ( B ( x ) , B ( x ) , . . . B p S ( x )) ′ is a vector of basisfunctions (e.g., polynomial series or splines), γ ∈ R p S is the pseudo-true value of thelogistic parameter, and r ( x ) is its approximation error. Primitive Condition 1 (Smooth Selection Model) . The function s ( , x ) is continuouslydifferentiable of the order κ ≥ · dim ( X ) . Primitive Condition 1 is a low-level sufﬁcient condition for Assumption 4. If thiscondition holds, the logistic series estimator of Hirano et al. (2003) takes the form (cid:98) γ LSE = arg max γ ∈ R pS (cid:96) ( γ ) , (4.4)19here (cid:96) ( γ ) is the logistic likelihood function (cid:96) ( γ ) = N N ∑ i = { D i = } (cid:18) log ( + exp ( B ( X i ) ′ γ )) − S i B ( X i ) ′ γ (cid:19) . (4.5)If the Primitive Condition 1 holds, plugging (cid:98) γ = (cid:98) γ LSE delivers an estimator (cid:98) s ( , x ) = Λ ( B ( x ) ′ (cid:98) γ ) , x ∈ X , (4.6)which converges at rate s N = (cid:112) p S / N = o ( N − / ) .When dim ( X ) ≥ log N , a consistent estimate of a smooth function does not existin general case (Stone (1982)). To make progress, we need alternative assumptions onthe structure of the function s ( , x ) . One possible assumption is approximate sparsity,which requires that few of the basis functions in the vector B ( x ) can approximate s ( , x ) sufﬁciently well. Primitive Condition 2 (Approximately Sparse Selection Model) . There exists a vector γ ∈ R p S with only s γ non-zero coordinates such that the approximation error r ( x ) in (4.3) decays sufﬁciently fast relative to the sampling error: (cid:32) N N ∑ i = r ( X i ) (cid:33) / (cid:46) P (cid:115) s γ log p S N = : s N . Primitive Condition 2 is a low-level sufﬁcient condition for Assumption 4. If thiscondition holds, the (cid:96) -regularized logistic series estimator of Belloni et al. (2017) takesthe form (cid:98) γ L = arg max γ ∈ R pS (cid:96) ( γ ) + λ ‖ γ ‖ , (4.7)where λ ≥ λ ‖ γ ‖ prevents overﬁtting inhigh dimensions by shrinking the estimate toward zero. Belloni et al. (2017) provides20ractical choices for the penalty λ that provably guard against overﬁtting. An imminentcost of applying the penalty λ is regularization, or shrinkage, bias, that does not vanishfaster than root- N rate. To prevent this bias from affecting the second stage, I constructa Neyman-orthogonal moment equation for each bound.If Primitive Condition 2 holds with a sufﬁciently small s γ , both the lasso-logisticestimator and its post-penalized analog satisfy Assumption 4 with s N = (cid:115) s γ log p S N = o ( N − / ) . In contrast to the smooth model, the convergence rate s N depends on log p S rather than p S itself, permitting the number of covariates under consideration to exceedthe sample size. ASSUMPTION 5 (Quantile First-Stage Rate: One-Dimensional Case) . Let ¯ U be a com-pact set in ( , ) containing the support of p ( X ) and − p ( X ) . There exist a rateq N = o ( N − / ) , a sequence of numbers ε N = o ( ) and a sequence of sets Q N such thatthe ﬁrst-stage estimate of the quantile function Q ( u , x ) : [ , ] × X → R belongs to Q N w.p. at least − ε N . Furthermore, the set Q N shrinks sufﬁciently fast around the truevalue Q ( u , x ) uniformly on ¯ U : sup Q ∈ Q N sup u ∈ ¯ U (cid:18) E X ( Q ( u , X ) − Q ( u , X )) (cid:19) / ≤ q N = o ( N − / ) . Assumptions 5 is an analog of Assumption 4 for the quantile function. It states thatthe conditional quantile function Q ( u , x ) is estimated at sufﬁcient quality, as measuredby the mean square convergence rate q N = o ( N − / ) , and is a classic assumption in non-parametric literature. A classic approach to estimate Q ( u , x ) is by quantile regression.When the function Q ( u , x ) is a sufﬁciently smooth function of x , the quantile seriesestimator of Belloni et al. (2019) converges at rate q N = (cid:114) p Q N = o ( N − / ) uniformlyover U , where p Q is the number of series terms to approach Q ( u , x ) . Likewise, (cid:96) -penalized quantile regression estimate of Belloni and Chernozhukov (2013) satisﬁesAssumption 5 with q N = (cid:113) s Q log p Q / N = o ( N − / ) under the choice of λ proposed21n Belloni and Chernozhukov (2013) if the model is sufﬁciently sparse. Appendix B.2contains a more technical discussion of sufﬁcient primitive conditions on the conditionalquantile function. In this section, I discuss two well-known ideas: cross-ﬁtting and Neyman-orthogonality(Neyman (1959)). Combining these two ideas, I propose a better Lee bounds estimatorof sharp Lee bounds ( β L , β U ) . Cross-ﬁtting.

When the ﬁrst-stage parameter ξ is estimated by (cid:96) -regularized meth-ods, sample splitting is not required for the asymptotic results, but can accommodatelarger sparsity indices s γ and s Q (Chernozhukov et al. (2018)). For the other machinelearning techniques, I rely on the cross-ﬁtting idea of Chernozhukov et al. (2018) toestablish theoretical guarantees. Neyman-orthogonality.

A two-stage estimation procedure is orthogonal if the secondstage is insensitive (formally, orthogonal) to the ﬁrst-stage parameter (Neyman (1959)).Lee’s moment equation (4.1) is not orthogonal. In particular, its derivative with respectto a local parametrization of the ﬁrst-stage nuisance parameter ξ at its true value ξ isnot equal to zero: ∂ ξ E m U ( W , ξ )[ (cid:98) ξ − ξ ] ̸ = . (4.8)As a result, the biased estimation error of (cid:98) ξ − ξ translates into bias in the momentequation (4.1).To prevent transmission of the bias into the second stage, I derive an orthogonalmoment equation for the lower and the upper bound. The proposed moment equations,22iven in equations (B.23)-(B.24), obey the zero-derivative property ∂ ξ E g U ( W , ξ )[ (cid:98) ξ − ξ ] = . (4.9)As a result, the ﬁrst-stage bias does not affect the asymptotic distribution of the boundsunder Assumptions 4 and 5. Theorem 1 (Asymptotic Theory for Sharp Bounds) . Suppose Assumptions 2–5 hold. Inaddition, if X help ̸ = /0 and X hurt ̸ = /0 , suppose (cid:98) s ( d , x ) converges to s ( d , x ) uniformly over X for each d ∈ { , } . Then, the better Lee bounds estimator ( (cid:98) β L , (cid:98) β U ) is consistent andasymptotically normal, √ N  (cid:98) β L − β L (cid:98) β U − β U  ⇒ N ( , Ω ) , where Ω is a positive-deﬁnite covariance matrix deﬁned as Ω =  E g L ( W , ξ ) E g L ( W , ξ ) g U ( W , ξ ) E g L ( W , ξ ) g U ( W , ξ ) Eg U ( W , ξ )  (4.10) that can be estimated by a sample analog. Theorem 1 delivers a root- N consistent, asymptotically normal estimator of ( β L , β U ) assuming the conditional probability of selection and conditional quantile are estimatedat a sufﬁciently fast rate. In particular, this assumption is satisﬁed when few of thecovariates affect selection and the outcome. By orthogonality, the ﬁrst-stage estimationerror does not contribute to the total uncertainty of the two-stage procedure. As a result,the asymptotic variance (4.10) does not have any additional terms due to the ﬁrst-stageestimation. 23 emark . The sorted estimator ( (cid:101) β L , (cid:101) β U ) , (cid:101) β L = min ( (cid:98) β L , (cid:98) β U ) , (cid:101) β U = max ( (cid:98) β L , (cid:98) β U ) . must converge at least as fast as the original estimator ( (cid:98) β L , (cid:98) β U ) .Unlike standard Lee bounds, (cid:98) β L and (cid:98) β U are not ordered by construction. Cher-nozhukov et al. (2013a) suggests that sorting the estimated bounds can only improvethe convergence rate. Likewise, sorting the conﬁdence region continues to guaranteecoverage if the original (unsorted) conﬁdence region guarantees coverage. However,the Imbens and Manski (2004) local super-efﬁciency assumption no longer holds (Stoye(2009)). Instead, one can use Stoye (2009)’s modiﬁcation of the IM conﬁdence interval,CI Stoye α = (cid:20) (cid:98) β L − (cid:98) Ω / LL c SL , α √ N , (cid:98) β U + (cid:98) Ω / UU c SU , α √ N (cid:21) , (cid:98) β L − (cid:98) Ω / LL c SL , α √ N ≤ (cid:98) β U + (cid:98) Ω / UU c SU , α √ N , /0 , otherwise , where c SL , α and c SU , α are Stoye’s critical values. If (cid:98) β U is too far below (cid:98) β L , the intervalCI Stoye α is empty, indicating that Assumptions 4 and 5 are likely to be violated. In this section, I relax the sparsity assumption from Section 4. Suppose Assumption 1holds. Let X A be a subvector of covariates X . By Lemma B.1, β AL ≤ β ≤ β AU , where [ β AL , β AU ] is the sharp identiﬁed set for β in the model ( D , X A , S , S · Y ) where only X A covariates are observed.To select covariates in a data-driven way, randomly split the full sample into an24uxiliary sample A and a main sample M . In the auxiliary sample, select the covariatesby an arbitrary machine learning method. In the main sample, deﬁne the target boundsas the sharp bounds in the model ( D , X A , S , S · Y ) with selected covariates X A . AssumingPrimitive Condition 1 holds for the selected model, let (cid:98) β AL and (cid:98) β AU be the estimates of β AL and β AU , based on logistic and quantile series estimates of the ﬁrst-stage nuisanceparameters. Conditional on the auxiliary sample, the ( − α ) conﬁdence region for [ β AL , β AU ] is [ L A , U A ] = [ (cid:98) β AL − | M | − / (cid:98) Ω / A , LL c α / , (cid:98) β AU + | M | − / (cid:98) Ω / A , UU c − α / ] . Variational Inference.

In this section, I discuss extensions of my basic setup. In Section 6.1, I extend my setupto allow for a multi-dimensional outcome, using standardized treatment effect as themain motivation for this extension. Sections 6.2 and 6.3 derive better Lee bounds forthe Intent-to-Treat and Local Average Treatment Effect parameters. Sections 6.4 and6.5 generalize my results to settings with multiple observations for each cross-sectionalunit. Section 6.6 considers the case when the propensity score is unknown. The resultsI present in the main text summarize the insights from a formal analysis in Appendix B.

In this section, I generalize the Lee (2009) sample selection model to allow for a multi-dimensional outcome variable. As before, the observed sample ( D i , X i , S i , S i Y i ) Ni = con-sists of the realized treatment D , the vector of baseline covariates X , the selection out-come S = D · S ( ) + ( − D ) · S ( ) , and outcomes for the selected subjects S · Y = S · ( D · Y ( ) + ( − D ) · Y ( )) , where S ∈ R d and Y ∈ R d are d -vectors. The param-eter of interest is the average treatment effect β = E [ Y ( ) − Y ( ) | S ( ) = S ( ) = ] (6.1)for a group of subjects who are selected into the sample for each scalar outcome regard-less of treatment status.Proposition B.4 shows that the sharp identiﬁed set for β is compact and convex, andthus can be summarized by its projections on various directions of economic interest.For any point q on the unit sphere, the largest admissible value σ ( q ) of q ′ β consistentwith the observed data, is commonly referred to as the support function . In Appendix B,I provide a Gaussian approximation for the support function process that is uniform overthe unit sphere and propose a Bayes bootstrap procedure to conduct simultaneous infer-26nce uniformly over the sphere. Examples 1 and 2 explain the use of support functionin applied work. Example 1. Wage Growth

Let S = ( S t , S t ) be a vector of employment outcomes for t ∈ { t , t } , Y = ( Y t , Y t ) be a vector of log wages, and β = ( β t , β t ) be the effect onlog wage in time periods t and t . The sharp upper and lower bounds on the averagegrowth effect from t to t , β t − β t , are given by [ −√ σ ( − q ) , √ σ ( q )] , q = ( / √ , − / √ ) . (6.2) Example 2. Standardized Treatment Effect

Let Y be a vector of related outcomesand β be a vector of average effects. A common approach for summarizing ﬁndings isto consider the standardized treatment effect STE = d d ∑ j = β j ζ j , (6.3)where ζ j is the standard deviation of the outcome j in the control group. The sharplower and upper bounds on STE are given by [ − C ζ σ ( q ) , C ζ σ ( q )] , (6.4)where q = ζ / ‖ ζ ‖ and C ζ = ‖ ζ ‖ / d .Example 2 demonstrates the use of support functions when q = ζ / ‖ ζ ‖ is a popula-tion parameter. In contrast to Example 1, the direction q = ζ / ‖ ζ ‖ is unknown and needsto be estimated. Therefore, it is important that that the support function estimator canbe approximated uniformly in some neighborhood of q in addition to the point q itself. Iestablish this approximation in Theorem B.7, and give inference methods for hypothesistesting in Theorem B.9. 27 .2 Intent-to-Treat I consider the standard Intent-to-Treat parameter. Let D = X be a vector of stratiﬁcation covariates (i.e., ﬁxed effects), sothat D is randomly assigned conditional on ¯ X . In addition, ¯ X is a saturated vector and X is a full covariate vector that includes ¯ X . The object of interest is the intent-to-treateffect (ITT) Y = β + β D + ¯ X ′ β + ε , S ( ) = S ( ) = , (6.5)where β is the main coefﬁcient of interest, interpreted as the average causal effect ofbeing offered treatment to an always-taker. Since only one of S ( ) and S ( ) is observed,the parameter β is not point-identiﬁed. For the sake of simplicity, suppose X hurt = /0.A sharp upper bound on β is the regression coefﬁcient on D in the truncated regressionon the selected outcomes, where the bottom outcomes in the treated group are trimmeduntil selection response rates in both the D = D = X . (Proposition B.13, Appendix B). Furthermore, anysubvector of X that contains ¯ X corresponds to another valid bound on β that may not besharp. However, a subvector of X that does not contain ¯ X does not correspond to a validbound. Lemma B.14 in Appendix B extends the orthogonal and agnostic approaches tothe Intent-to-Treat parameter. I consider the Local Average Treatment Effect parameter deﬁned in Imbens and Angrist(1994). Let Z = D = X , X , S and Y

28s in the previous section. Suppose the potential selection outcome S ( , z ) = S ( , z ) = S ( z ) , for any z ∈ { , } is fully determined by the value of the instrument. The object of interest is the av-erage treatment effect on the subjects who comply with the treatment offer and selectinto the sample regardless of the treatment offer. Finally, suppose being a complier isindependent of being an always-taker given on all observed covariates. Then, the tar-get parameter is identiﬁed as the coefﬁcient π in the two-stage least squares (2SLS)regression Y = π + π D + ¯ X ′ π + ν , S ( ) = S ( ) = , (6.6)where the ﬁrst-stage equation is D = δ + δ Z + ¯ X ′ δ + ζ , S ( ) = S ( ) = . (6.7)As shown in Proposition B.18 in Appendix B, a sharp upper bound on π is the 2SLSeffect in the truncated regression on the selected outcomes, where the bottom outcomesin the treated group are trimmed until selection response rates are the same in D = , Z = D = , Z = Z = X . Furthermore,any subvector of X that contains ¯ X corresponds to another valid bound on π that maynot be sharp. However, a subvector of X that does not contain ¯ X may not correspond toa valid bound. Suppose that the researcher observes data sampled from G clusters: { W ig , i = , , . . . , N g , g = , , . . . G } . Each cluster size is non-random, and 1 ≤ N g ≤ ¯ N < ∞ for a constant ¯ N that29oes not depend on sample size G . According to Chiang (2020), the post-lasso-logisticestimator with the cluster-robust penalty parameter satisﬁes the analog of Assumption4 with s G = G − / rate. Generalizing Chiang (2020)’s arguments, one can establishthe analog of Assumption 5 with q G = G − / . Under these assumptions, the better Leebounds estimator is consistent and asymptotically normal. The cluster-robust estimatorof asymptotic variance Ω in Theorem 1 takes the form (cid:98) Ω cr =  / G ∑ Gg = ( ∑ n g c = g L ( W gc , (cid:98) ξ )) / G ∑ Gg = ( ∑ n g c = g L ( W gc , (cid:98) ξ ))( ∑ n g j = g U ( W g j , (cid:98) ξ )) / G ∑ Gg = ( ∑ n g c = g L ( W gc , (cid:98) ξ ))( ∑ n g j = g U ( W g j , (cid:98) ξ )) / G ∑ Gg = ( ∑ n g c = g U ( W gc , (cid:98) ξ )) .  (6.8) Consider a setting where the units ( D i , S it , S it Y it , X i ) NTi = , t = are observed over t = , , . . . , T time periods. Using the notation of Section 6.1, let S i : = ( S i , S i , . . . S iT ) be a vector ofselection indicators for an individual i and Y i : = ( Y i , S i , . . . Y iT ) be a vector of out-comes. The target parameter β is the average treatment effect β = E [ Y ( ) − Y ( ) | S ( ) = S ( ) = ] for subjects who are selected into the sample in each period regardless of treatmentstatus. In contrast to the cross-sectional setup of Section 6.1, it is important to allowthe observations to be correlated over time within each individual. In this case, one cangroup the observations over time into clusters and use the cluster-robust standard errorderived in Section 6.4 with G = N and N g = T , g ∈ { , , . . . T } . In this section, I extend better Lee bounds to accommodate the case when the conditionalprobability of treatment is unknown. The orthogonal moment equation (B.18) involves30n additional nuisance parameter (cid:26) E [ Y | S = , D = , X ] , E [ Y | Y ≤ Q ( u , X ) , S = , D = d , X ] , d ∈ { , } (cid:27) that needs to be estimated. If the sparsity assumption holds, Belloni et al. (2017)’slinear lasso estimator of the function E [ Y | S = , D = , X ] . The truncated conditionalmean function E [ Y | Y ≤ Q ( u , X ) , S = , D = , X ] , u ∈ { p ( X ) , − p ( X ) } can be esti-mated by Chernozhukov et al. (2018)’s automatic debiasing approach, aimed at genericnuisance functions that emerge as a result of orthogonalization. If few of the covariatesaffect selection and the outcome, the assumptions of Chernozhukov et al. (2018)’ settinghold, and the proposed estimate obeys an analog of Assumptions 4-5. In this section, I compare the performance of basic, naive and better Lee methods, build-ing a simulation exercise on the JobCorps data set. The vector X = ( , X , X ) consistsof a constant and two binary indicators, one for female gender ( X ) and one for gettingaway from home being a very important motivation for joining JobCorps ( X ) , takenfrom the JobCorps data. An artiﬁcial treatment variable D is determined by an unbiasedcoin ﬂip. A binary employment indicator S is S = { X ′ α + D · X ′ γ + U > } , (7.1)where U is an independently drawn logistic shock. Likewise, log wages are generatedaccording to the model Y = ( , X ) ′ κ + ε , ε ∼ N ( , (cid:101) σ ) , (7.2)31here ε is an independent normal random variable. The parameter vector ( α , γ , κ , (cid:101) σ ) is taken to be the estimates of (7.1) and (7.2), where S and Y are week 90 employmentand log wages, respectively, adjusted as described in Appendix D. The sets X help = { X = X = } and X hurt = { X ̸ = X ̸ = } , as determined by the sign ofthe parameter γ . The population data set is taken to be 9 ,

145 observations of baselinecovariates X and the artiﬁcial variables D , S , S · Y , generated for each observation. Byconstruction, the average treatment effect on the always-takers β is zero. The truesharp identiﬁed set is [ − . , . ] . Basic Lee bounds are deﬁned as the weightedaverage of standard Lee bounds on X help and X hurt . The true basic identiﬁed set is [ − . , . ] .I compare the performance of four estimators—oracle, basic, naive and better Leemethods—by drawing random samples with replacement from the population data set.To mimic the researcher’s covariate selection problem, I augment this data set with 28covariates selected by Lee. Although these variables are absent from equations (7.1)and (7.2), they are strongly correlated with X and X , making covariate selection aninteresting problem. The oracle method is the output of Algorithm 1, where the oracleknows the identities of covariates in vector X and the direction of employment effect on X help and X hurt . In contrast, all other methods need to learn X help and X hurt from theavailable sample. The basic method estimates X help by logistic and quantile regressionon 28 raw covariates. It targets basic identiﬁed set [ − . , . ] . Both the naive andthe better methods target the sharp identiﬁed set [ − . , . ] . The naive method es-timates the ﬁrst-stage functions (2.8) and (2.9) by standard regression methods on all 28covariates. In contrast, the better method selects covariates by post-lasso-logistic of Bel-loni et al. (2016) for the employment equation and by post-lasso of Belloni et al. (2017)for the wage equation. In the second stage, both the naive and the better method rely onorthogonal moment equations (B.18) for the lower and the upper bound, respectively.Table 2 reports the ﬁnite-sample performance for the oracle, basic, naive and bettermethods. I focus on the lower and the upper bound separately to detect any outward32 able 2: Finite-sample performance of oracle, basic, naive and better Lee methods Panel A: Lower BoundBias St. Dev. Coverage Rate N Oracle Basic Naive Better Oracle Basic Naive Better Oracle Basic Naive Better3 ,

000 -0.00 -0.05 -0.06 -0.02 0.01 0.02 0.03 0.03 0.94 0.21 0.64 0.885 ,

000 -0.00 -0.04 -0.04 -0.01 0.01 0.01 0.02 0.01 0.94 0.24 0.63 0.889 ,

000 0.00 -0.03 -0.03 -0.01 0.01 0.01 0.02 0.01 0.95 0.26 0.65 0.9310 ,

000 0.00 -0.03 -0.03 -0.01 0.01 0.01 0.01 0.01 0.95 0.25 0.64 0.9315 ,

000 0.00 -0.02 -0.02 -0.01 0.00 0.01 0.01 0.01 0.95 0.23 0.64 0.92Panel B: Upper Bound3 ,

000 0.00 0.05 0.04 0.00 0.01 0.02 0.03 0.02 0.94 0.21 0.64 0.935 ,

000 0.00 0.04 0.03 -0.00 0.01 0.01 0.02 0.01 0.95 0.25 0.63 0.959 ,

000 -0.00 0.03 0.02 0.00 0.01 0.01 0.01 0.01 0.95 0.28 0.65 0.9710 ,

000 -0.00 0.03 0.02 -0.00 0.01 0.01 0.01 0.01 0.95 0.29 0.64 0.9715 ,

000 -0.00 0.02 0.01 -0.00 0.00 0.01 0.01 0.01 0.94 0.28 0.64 0.97

Notes. Results are based on 10 ,

000 simulation runs. In Panel A, the true parameter value is − .

014 for the basic method, and − .

011 for all other methods. In Panel B, the true parametervalue is 0 .

035 for the basic method, and 0 .

018 for all other methods. Bias is the differencebetween the true parameter and the estimate, averaged across simulation runs. St. Dev. is thestandard deviation of the estimate. Coverage Rate is the fraction of times a two-sided symmetricCI with critical values c α / and c − α / covers the true parameter, where α = . N is thesample size in each simulation run. Oracle, basic, naive and better estimated bounds cover zeroin 100% of the cases. The naive method estimates the ﬁrst-stage functions (2.8) and (2.9) bylogistic and quantile regression on all 28 covariates. bias or poor coverage of either bound, which is masked when a conﬁdence interval isconsidered. On average, the width of oracle bounds ranges from 0 .

028 to 0 . . X is always included into the logistic regression together withirrelevant covariates. Although the vector X is not been selected for the employment33quation in at least 90% of simulation runs (i.e., “perfect” model selection by lasso-logistic rarely occurs), the better estimates prove robust to incorrectly attributing anobservation from X help into X hurt and vice versa. Table D.1 in Appendix D presentsadditional simulation evidence suggesting that the agnostic version of the better methodalso outperforms the basic method in terms of width and coverage and has comparableprecision. In this section, I demonstrate how better Lee bounds can achieve nearly point-identiﬁcationin three empirical settings. First, I study the effect of JobCorps on wages, as in Lee(2009). Second, I study the effect of PACES voucher tuition subsidy on pupils’ testscores, as in Angrist et al. (2002). Finally, I study the effect of Medicaid eligibilityand insurance on self-reported healthcare utilization and health, as in Finkelstein et al.(2012).

Lee (2009) studies the effect of winning a lottery to attend JobCorps, a federal vocationaland training program, on applicants’ wages. The data set is the same as in Section 3.Table 3 reports estimated bounds on the JobCorps week 90 wage effect on thealways-takers and the conﬁdence region for the identiﬁed set. The basic Lee boundscannot determine the direction of the effect (Column (1)). Neither can the sharp boundsgiven Lee’s covariates (Column (2)). If few of the covariates affect week 90 employ-ment and wage, the Column (3) bounds suggest that JobCorps raises week 90 wages by4 . .

6% on average, which is slightly smaller than Lee’s original estimate (4 . (cid:98) p is close toone). In contrast, the better ones account for the differential sign of the JobCorps effecton applicants’ employment. The better Lee bounds are tight because variation in em-ployment is well-explained by reasons for joining JobCorps, highest grade completed,and variation in wages is explained by pre-randomization earnings, household income,gender and other socio-economic factors.The bounds in Column (3) assume sparsity, which excludes some wage covariatesfrom the employment equation and vice versa. In Column (4), the target bounds aredeﬁned as the sharp bounds given the 15 covariates, selected for either employment orwage equation in Column (3). The Column (4) are almost the same as the Column (3)ones, suggesting that it is plausible for week 90 employment and wage equations to besparse. However, the Column (4) conﬁdence region does not account for the uncertaintyin how these 15 covariates are selected.To properly quantify the uncertainty of the Column (4) bounds, I invoke the condi-tional (Column (5)) and variational (Column (6)) agnostic approaches. In Column (5),the auxiliary sample is taken to be 6 ,

241 applicants that Lee excluded from considera-tion due to missing data in weeks other than week 90. The Column (5) bounds targetthe sharp bounds given the covariates selected on this auxiliary sample. The estimatessuggest that JobCorps raises week 90 wages by 4 . . . . S week ( ) and S week ( ) be the potential employment outcomes in a given week,and Y week ( ) and Y week ( ) be the potential log wage outcomes. Let S week = D · S week ( ) +( − D ) · S week ( ) be the realized employment and Y week be the realized wage. I focuson a subset of subjects whose treatment effect lower bound, averaged across weeks 80–120, is positive conditional on the full covariate vector X . My subjects of interest are X PLB = X helpPLB ∪ X hurtPLB , where X helpPLB = { X ∈ X help : 140 ∑ week = E [ Y week | D = , S week = , Y week ≤ Q week ( p week ( X ) , X ) , X ] (8.1) − E [ Y week | D = , S week = , X ] > } , and X hurtPLB is its analog for X hurt . 36 able 3: Estimated bounds on the JobCorps effect on week 90 log wages Average Treatment Effect (ATE) ATE on X PLB (1) (2) (3) (4) (5) (6) (7)[-0.027, 0.111] [-0.005, 0.091] [0.040, 0.046] [0.041, 0.059] [0.041, 0.043] [0.024, 0.065] [0.047, 0.061] (-0.058, 0.142) (-0.054, 0.135) (0.001, 0.078) (-0.019, 0.112) (-0.023, 0.101) (-0.05, 0.131) (0.005, 0.100)

Selection covs 28 28 5 177 15 13 12-13 5 177Post-lasso-log. N/A N/A 9 N/A N/A N/A 9Wage covs 0 28 470 15 13 12-13 470Post-lasso N/A N/A 6 N/A N/A N/A 6

Notes. Estimated bounds are in square brackets and the 95% conﬁdence region for the identiﬁed set is in parentheses. All subjects arepartitioned into the sets X help = { (cid:98) p ( X ) < } and X hurt = { (cid:98) p ( X ) > } , where the trimming threshold (cid:98) p ( x ) = (cid:98) s ( , x ) / (cid:98) s ( , x ) is estimated asin equation (3.3). Column (1): basic bounds given 28 Lee’s covariates. Column (2): sharp bounds given 28 Lee’s covariates (i.e., naivebounds). Column (3): sharp bounds given all covariates assuming few of them affect employment and wage. Column (4): sharp boundsgiven the union of raw covariates selected for the employment and wage equations in Column (3). Column (5): sharp bounds given thecovariates selected on the sample that Lee excluded due to missing data in weeks other than 90. Column (6): variational bounds deﬁned inSection 5 . Column (7): sharp bounds, based on the Column (3) ﬁrst-stage estimates, for the PLB sample. The PLB sample of N = , igure 3: Estimated bounds on the JobCorps effect on log wages by week.Notes. The horizontal axis shows the number of weeks since random assignment. The black(gray) circles are an estimated upper (lower) bound on the average wage effect. The black (gray)ﬁtted line is estimated by local linear approximation to 201 black (gray) points, respectively. Thesample consists of subjects whose average treatment effect lower bound is positive across weeks80–120 ( N = , ) , as deﬁned in equation (8.1). Computations use design weights. Figure 3 plots the average JobCorps effect on wages. The black and gray lines showthe upper and lower bounds on the average wage effect for the always-takers in the X PLB subgroup. The lower bound sharply increases from − .

122 at week 5 to 0 .

072 at week110 and declines to zero afterwards. The upper bound on the wage effect decreases from0 . .

09 around week 80, after which it ﬂuctuates around 0 . JobCorps Effect on Wage Growth Rate.

I study the JobCorps effect on long-termwage growth from week 104 to week 208. The 2-year time span between week 104and week 208 is the longest possible period to consider without encountering short-termeffects. Speciﬁcally, week 104 is one of the earliest weeks when the lower bound on thewage effect has become positive and stopped growing, according to Figure 3.38 igure 4: Geometric interpretation of the JobCorps effect on wage growth from week 104 toweek 208.Notes. This ﬁgure shows the best circular approximation to the estimated identiﬁed set (solidred perimeter), the best circular approximation to the 95% pointwise conﬁdence region (dashedred perimeter), the estimated projections of the true sets on the -45-degree line (four black dots),and the intercepts of the corresponding tangent lines. Computations use design weights. SeeAppendix F.2 for the details. + = { S ( ) = , S ( ) = , S ( ) = , S ( ) = , ∆ ( X ) > , ∆ ( X ) > } , (8.2)where “squared” refers to the two time periods under consideration and “plus” refersto the positive sign of the treatment-control difference in the employment rate. For the(AT + ), deﬁne the potential wage growth in the treated ( d =

1) and control ( d = ρ ( d ) = E [ Y ( d ) − Y ( d ) | AT + ] , d ∈ { , } . (8.3)The wage growth rate in the control group is identiﬁed as ρ ( ) = E [ Y − Y | S = , S = , D = , ∆ ( X ) > , ∆ ( X ) > ] . The growth rate in the treated group is not identiﬁed, but can be bounded using therelation ρ ( ) = ρ ( ) + β − β , (8.4)where β week stands for the average treatment effect for the AT + group in a particularweek.A simplistic approach to construct an upper bound on β − β is to subtract thelower bound on β from the upper bound on β . Since wages in weeks 104 and208 are likely to be correlated, this upper bound may not be sustained by any data40enerating process consistent with observed data. To obtain the sharp bound, project thetrue identiﬁed set on the −

45 degree line and take the intercepts of the correspondingtangent lines. In Figure 4, the projection endpoints correspond to the inner black dots,and the tangent lines passing through them have intercepts equal to − .

11 and 0 . ρ ( ) = .

149 to the intercepts yields thebounds on ρ ( ) , [ . , . ] .Figure A.5 reports the upper and lower bounds on the average log wage for thealways-takers in the control status, E [ Y week ( ) | S week ( ) = S week ( ) = ] for each week.The lower (upper) bound grows from 1 .

45 (1 .

90) in week 5 to 1 .

978 (2 . .

45 in week 14 to 0 .

01 in week 208. The gapbetween the lower and the upper bound shrinks over time as the share of applicants witha positive employment effect, where the average log wage is identiﬁed in the controlstatus, increases. One can interpret Figure A.5 as corroborating the Ashenfelter (1978)pattern and showing that earnings would have recovered even without JobCorps training.Therefore, evaluating JobCorps would have been very difﬁcult without a randomizedexperiment, as one would need to explicitly model mean reversion in the potential wagein the control status.

Angrist et al. (2002) studies the effect of winning a voucher from the Colombia PACESprogram, a voucher initiative established in 1991 to subsidize private school education,on pupils’ test scores. In 1999, Angrist et al. (2002) administered a grade-speciﬁc testand found the voucher effect on the total test score to be equal to 0 . N = ,

610 subjects from Bogota’s 1995 applicant cohort and has lottery outcome andtest scores for Mathematics, Reading, and Writing. In addition, the sample has 25 de-41ographic characteristics, including the applicant’s age and gender, their father’s andmother’s ages, father’s and mother’s highest grade completed, and a collection of in-dicators for area of residence. While the number of raw covariates is moderate, thenumber of their three-way interactions p =

900 is quite large for logistic and quantileseries methods.Table 4 reports bounds on the voucher effect on test scores in Mathematics (PanelA), Reading (Panel B), and Writing (Panel C) for various parameters of interest. Thesharp bounds based on 25 raw covariates (Columns (4)-(5)) cover zero for all threesubjects. Assuming sparsity, the better Lee bounds consider 900 technical covariatesand achieve nearly point-identiﬁcation for all three subjects (Column (6)). In addition,the better Lee bounds are more precisely estimated than the Column (4) ones. The widthof 95% conﬁdence region for the true identiﬁed set is smaller than its Column (4) analog,by a factor ranging from 0 .

38 (Reading) to 0 .

48 (Mathematics). For Mathematics andWriting, the average voucher effect on the always-takers comprises 3 / Finkelstein et al. (2012) studies the effect of access to Medicaid on self-reported health-42 able 4: Estimated bounds on the PACES voucher effect on pupils’ test scores

ITT Average Treatment Effect (ATE)Exogeneity Monotonicity Conditional monotonicity(1) (2) (3) (4) (5) (6)Mathematics [0.178, 0.178] [0.075, 0.279] [-0.432, 0.590] [-0.274, 0.545] [-0.169, 0.538] [0.056, 0.084] (-0.058, 0.413) (-0.243, 0.548) (-0.707, 0.828) (-0.710, 1.016) (-0.603, 0.975) (-0.298, 0.405)

Reading [0.204, 0.204] [0.029, 0.261] [-0.386, 0.783] [-0.333, 0.654] [-0.252, 0.526] [0.163, 0.177] (-0.021, 0.429) (-0.256, 0.538) (-0.643, 1.031) (-0.849, 1.143) (-0.735, 0.924) (-0.140, 0.460)

Writing [0.126, 0.126] [-0.057, 0.222] [-0.552, 0.433] [-0.396, 0.380] [-0.150, 0.474] [0.086, 0.094] (-0.101, 0.353) (-0.346, 0.486) (-0.808, 0.679) (-0.855, 0.825) (-0.586, 0.888) (-0.249, 0.396) N

282 3610 3610 3610 3610 3610Test-tak. covs N/A N/A 25 25 150 900Post-lasso-logistic N/A N/A N/A N/A N/A 10Test score covs N/A N/A 0 25 25 25Post-lasso N/A N/A N/A N/A N/A 9

Estimated bounds are in square brackets and the 95% conﬁdence region for the identiﬁed set is in parentheses. Any test participant (apupil who arrives at a testing location) is tested in all three subjects. Column (1): ITT estimate from Angrist et al. (2002), Table 5,Column 1. Column (2): basic Lee bounds under unconditional monotonicity. In Columns (3)–(6), all subjects are partitioned into thesets X help = { (cid:98) p ( X ) < } and X hurt = { (cid:98) p ( X ) > } , where the trimming threshold (cid:98) p ( x ) = (cid:98) s ( , x ) / (cid:98) s ( , x ) is estimated as in equation (3.3).Column (3): basic Lee bounds based on 25 raw covariates. Column (4): sharp Lee bounds based on 25 raw covariates. Column (5) differsfrom Column (4) only by adding second-order interactions of 6 continuous covariates into the test-taking equation. Column (6): sharpLee bounds, based on the technical covariates selected by post-lasso-logistic for the test participation and on the raw covariates selectedby post-lasso for the test score. Baseline covariates are described in Table F.15. First-stage estimates are given in Table F.16. Selectedcovariates are described in Figure F.7. See Appendix F.4 for details. are utilization and measures of health. The data come from the Oregon Health Insur-ance Experiment (OHIE), which allowed a subset of uninsured low-income applicantsto apply for Medicaid in 2008. OHIE used a lottery to determine who was eligible toapply for Medicaid. One year after randomization, a subset of N = ,

405 applicantswere mailed a survey with questions about recent changes in their healthcare utiliza-tion and general well-being. The sample contains the lottery outcome, actual Medicaidenrollment, and survey responses. In addition, the sample has 64 pre-determined charac-teristics including demographics, enrollment in SNAP and TANF government programs,and pre-existing health conditions. While the number of raw covariates is moderate, thenumber of their pairwise interactions p = = ,

096 is quite large for classic non-parametric methods. Since the survey response rate is close to 50% and the controlapplicants respond 1 .

07 more likely than the treated ones, Finkelstein et al. (2012)’sﬁndings are subject to potential nonresponse bias.Finkelstein et al. (2012) studies the effect of winning the Medicaid lottery using theintent-to-treat (ITT) framework. If an applicant wins the lottery, all members of theirhousehold become eligible to enroll. As a result, larger households are more likely towin the lottery than smaller ones. Furthermore, the control applicants were oversampledin the earlier survey waves. To account for the correlation between household size andsurvey wave ﬁxed effects, the intent-to-treat equation takes the form Y ih = β + β Lottery h + ¯ X ih β + ε ih , (8.5)where i denotes an individual, h denotes a household, Lottery h = h was offered access to Medicaid, and ¯ X ih is a vector of stratiﬁcationcharacteristics (survey wave and household size ﬁxed effects). The coefﬁcient β is themain coefﬁcient of interest interpreted as the impact of being able to apply for Medi-caid through the Oregon lottery. Finkelstein et al. (2012) also studies the local average44reatment effect (LATE) of insurance, Y ih = π + π Insurance ih + ¯ X ih π + ν ih , (8.6)where Insurance ih is an applicant-speciﬁc measure of insurance coverage deﬁned as“ever on Medicaid during study period”, and all other variables are as deﬁned in (8.5).Finkelstein et al. (2012) estimates (8.6) by two-stage least squares (2SLS), using Lottery h as an instrument for Insurance and including ¯ X ih in both the ﬁrst and the second stagesof 2SLS. The coefﬁcient π is the main coefﬁcient of interest: it shows the impact ofinsurance coverage on subjects who enroll in Medicaid if and only if they become eli-gible. If non-response is exogenous for each household size and survey wave, Medicaideligibility and enrollment have a positive and signiﬁcant effect on all measures of healthand healthcare utilization (Tables 5, 6, A.8, A.9, Columns (1) and (4)).I examine whether the Intent-to-Treat (8.5) and Local Average Treatment Effect(8.6) equations are robust to non-response bias. Tables 5 and 6 show the results forself-reported health outcomes. The standard trimming approach is very conservativeand cannot determine the direction of the effect for any of the health outcomes. Foreach household size and survey wave stratum, the smallest number of the worst-caseresponses is trimmed in the control group until treatment-control difference in responserate exceeds zero for each strata. Since incorporating the additional 48 baseline covari-ates requires considering more than 2 discrete cells, it is not possible to incorporateall of them at once. An ad-hoc choice of three demographic indicators: gender, Englishas preferred language, and urban area of residence does not improve standard estimates.A smoothness assumption on the conditional response probability and the outcomequantile drastically changes the result. Tables 5 and 6, Columns (3) and (6), suggest thatMedicaid eligibility and insurance has had positive effect on 7 out of 7 health outcomes.Furthermore, Medicaid insurance is associated with at least 0 .

981 (std. error 0 . .

317 (std. error 0 . Lee bounds are a popular empirical strategy for addressing post-randomization selectionbias. In this paper, I show that Lee bounds can be improved by incorporating baseline co-variates and modern regularized machine learning techniques. First, better Lee boundsaccommodate differential selection response. This relaxation is especially important forJobCorps, since the JobCorps effect on employment is unlikely to be in the same direc-tion for everyone. Second, better Lee bounds are sharp if few of very many covariatesunder consideration affect selection and outcome. In practice, better Lee bounds achievenearly point-identiﬁcation for all three empirical examples under interpretable assump-tions (i.e., smoothness and sparsity) that are often invoked in point-identiﬁed problems.Therefore, better Lee bounds are expected to deliver stronger conclusions that are alsomore robust to violations of monotonicity. 46 able 5: Estimated lower bound on the effect of access to Medicaid on self-reported binary health outcomesITT LATE(1) (2) (3) (4) (5) (6)None Standard ML None Standard MLHealth good / very good / excellent 0.039 -0.013 (0.008) (0.013) (0.017) (0.026) (0.044) (0.058) Health fair / good / very good / excellent 0.029 -0.052 (0.005) (0.012) (0.010) (0.018) (0.038) (0.033) Health same or gotten better 0.033 -0.033 (0.007) (0.014) (0.019) (0.023) (0.049) (0.065)

Did not screen positive for depression 0.023 -0.045 (0.007) (0.014) (0.010) (0.025) (0.049) (0.065)

Compulsory covariates (stratiﬁcation) N/A 16 16 N/A 16 16Additional covariates (trimming) N/A 0 21 N/A 0 21 * Standard errors in parentheses. This table reports results from a Lee bounding exercise on self-reported health outcomes for 3 speciﬁ-cations: no trimming, standard trimming, and the agnostic ML approach. Columns (1)–(3) report the coefﬁcient and standard error onLottery from estimating equation (8.5) by OLS. Columns (4)–(6) report the coefﬁcient and standard error on Insurance from estimatingequation (8.6) by 2SLS with Lottery as an instrument for Insurance. All regressions include household size ﬁxed effects, survey waveﬁxed effects, and their interactions. Trimming methods. None: exact replicate of Finkelstein et al. (2012), Table IX. Standard: the minimalnumber of zero outcomes are trimmed in the control group until the treatment-control difference in response rates switches from negativeto non-negative for each strata. Agnostic: Step 1. 21 additional covariates are selected on an auxiliary sample of 4 ,

000 households asdescribed in Appendix F.5. Step 2. In the main sample of 46 ,

Compulsory covariates (stratiﬁcation) N/A 16 16 N/A 16 16Additional covariates (trimming) N/A 0 9 N/A 0 9 * Standard errors in parentheses. This table reports results from a Lee bounding exercise on self-reported health outcomes for 3 speciﬁca-tions: no trimming, standard trimming, and the classic nonparametric (NP) approach. Columns (1)–(3) report the coefﬁcient and standarderror on Lottery from estimating equation (8.5) by OLS. Columns (4)–(6) report the coefﬁcient and standard error on Insurance fromestimating equation (8.6) by 2SLS with Lottery as an instrument for Insurance. All regressions include household size ﬁxed effects, sur-vey wave ﬁxed effects, and their interactions. Trimming methods. None: exact replicate of Finkelstein et al. (2012), Table IX. Standard:the minimal number of control outcomes are trimmed from below for each value of ﬁxed effect until the treatment-control differencein response rates switches from negative to non-negative for each strata. NP. Step 1. 9 additional covariates are taken as described inAppendix F.5. Step 2. An outcome with covariate vector x is trimmed if it is less than Q ( − / p ( x ) , x ) , where the trimming threshold p ( x ) is deﬁned in equation (F.4) and the conditional quantile is deﬁned in equation (2.8). Standard errors are estimated by a cluster-robustbootstrap with B = eferences Andrews, D. (1994). Asymptotics for semiparametric econometric models via stochasticequicontinuity.

Econometrica , (62):43–72.Angrist, J., Bettinger, E., Bloom, E., King, E., and Kremer, M. (2002). Vouchers forprivate schooling in colombia: Evidence from a randomized natural experiment.

TheAmerican Economic Review , 92(5):1535–1558.Angrist, J. and Pischke, J.-S. (2009).

Mostly Harmless Econometrics . Princeton Uni-versity Press.Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of averagecausal effects in models with variable treatment intensity.

Journal of the AmericanStatistical Association , 90(430):431–442.Ashenfelter, O. (1978). Estimating the effect of training programs on earnings.

Reviewof Economics and Statistics , 60:47–50.Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models.

Bernoulli , 19(2):521–547.Belloni, A., Chernozhukov, V., Chetverikov, D., and Fernandez-Val, I. (2019). Condi-tional quantile processes based on series or many regressors.

Journal of Economet-rics , 213(260):4–29.Belloni, A., Chernozhukov, V., Fernandez-Val, I., and Hansen, C. (2017). Programevaluation and causal inference with high-dimensional data.

Econometrica , 85:233–298.Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effectsafter selection amongst high-dimensional controls.

Journal of Economic Perspectives ,28(2):608–650. 49elloni, A., Chernozhukov, V., and Wei, Y. (2016). Post-selection inference for gener-alized linear models with many controls.

Journal of Business & Economic Statistics ,34(4):606–619.Chandrasekhar, A., Chernozhukov, V., Molinari, F., and Schrimpf, P. (2012). Infer-ence for best linear approximations to set identiﬁed functions. arXiv e-prints , pagearXiv:1212.5627.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W.,and Robins, J. (2018). Double/debiased machine learning for treatment and structuralparameters.

Econometrics Journal , 21:C1–C68.Chernozhukov, V., Chetverikov, D., and Kato, K. (2019). Inference on causal andstructural parameters using many moment inequalities.

Review of Economic Stud-ies , 86:1867–1900.Chernozhukov, V., Demirer, M., Duﬂo, E., and Fernández-Val, I. (2017). Generic Ma-chine Learning Inference on Heterogenous Treatment Effects in Randomized Exper-iments. arXiv e-prints , page arXiv:1712.04802.Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins,J. M. (2016). Locally Robust Semiparametric Estimation. arXiv e-prints , pagearXiv:1608.00033.Chernozhukov, V., Fernandez-Val, I., and Melly, B. (2013a). Inference on counterfactualdistributions.

Biometrics , 81(6):2205–2268.Chernozhukov, V., Lee, S., and Rosen, A. (2013b). Intersection bounds: Estimation andinference.

Econometrica , 81:667–737.Chernozhukov, V., Newey, W., and Singh, R. (2018). De-Biased Machine Learning ofGlobal and Local Parameters Using Regularized Riesz Representers. arXiv e-prints ,page arXiv:1802.08667. 50hiang, H. D. (2020). Many average partial effects: With an application to text regres-sion.Finkelstein, A., Taubman, S., Wright, B., Bernstein, M., Gruber, J., Newhouse, J., Allen,H., Baicker, K., and Group, O. H. S. (2012). The oregon health insurance experiment:Evidence from the ﬁrst year.

Quarterly Journal of Economics , 127(3):1057–1106.Frangakis, C. E. and Rubin, D. B. (2002). Principal stratiﬁcation in causal inference.

Biometrics , 58(1):21–29.Hirano, K., Imbens, G., and Reeder, G. (2003). Efﬁcient estimation of average treatmenteffects under the estimated propensity score.

Econometrica , 71(4):1161–1189.Ichimura, H. and Newey, W. K. (2015). The Inﬂuence Function of SemiparametricEstimators. arXiv e-prints , page arXiv:1508.01378.Imbens, G. and Manski, C. (2004). Conﬁdence intervals for partially identiﬁed param-eters.

Econometrica , 72(6):1845–1857.Imbens, G. W. and Angrist, J. D. (1994). Identiﬁcation and estimation of local averagetreatment effects.

Econometrica , 62(2):467–475.Javanmard, A. and Montanari, A. (2014). Conﬁdence intervals and hypothesis testingfor high-dimensional regression.

Journal of Machine Learning Research , 2(4):2869–2909.Lee, D. (2005). Trimming for bounds on treatment effects with missing outcomes.

Working Paper .Lee, D. (2009). Training, wages, and sample selection: Estimating sharp bounds ontreatment effects.

Review of Economic Studies , 76(3):1071–1102.Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econometricapproach.

Journal of Economic Perspectives , 31(2):87–=106.51ewey, W. (1994). The asymptotic variance of semiparametric estimators.

Economet-rica , 62:245–271.Neyman, J. (1959). Optimal asymptotic tests of composite statistical hypotheses.

Prob-ability and Statistics , 213(57):416–444.Rockafellar, R. T. (1997).

Convex Analysis . Princeton University Press.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonran-domized studies.

Journal of Educational Psychology , 66(5):688–701.Schochet, P. Z., Burghardt, J., and McConnell, S. (2008). Does job corps work? impactﬁndings from the national job corps study.

American Economic Review , 98(1):1864–1886.Stone, C. (1982). Optimal global rates of convergence for nonparametric regression.

Annals of Statistics , 10(4):1040–1053.Stoye, J. (2009). More on conﬁdence intervals for partially identiﬁed parameters.

Econometrica , 77(4):1299–1315.van der Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptoticallyoptimal conﬁdence regions and tests for high-dimensional models.

Annals of Statis-tics , 42(3):1166–1202.van der Vaart, A. (1998). Asymptotic statistics.Zhang, C.-H. and Zhang, S. (2014). Conﬁdence intervals for low-dimensional param-eters in high-dimensional linear models.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 76(1):217–242.

Appendix A: Supplementary Figures and Tables able A.7: Estimated bounds on the JobCorps effect on log wages Average Treatment Effect (ATE) ATE on X PLB (1) (2) (3) (4)Week 45 [-0.085, 0.141] [-0.081, 0.136] [-0.048, 0.087] [-0.043, 0.081](-0.11, 0.172) (-0.117, 0.172) (-0.081, 0.121) (0.000, 0.118)Week 104 [-0.027, 0.103] [-0.022, 0.098] [0.027, 0.032] [0.044, 0.055](-0.058, 0.133) (-0.074, 0.148) (-0.021, 0.078) (0.000, 0.106)Week 135 [-0.025, 0.102] [-0.025, 0.098] [0.030, 0.037] [0.059, 0.066](-0.066, 0.136) (-0.087, 0.149) (-0.018, 0.076) (0.005, 0.108)Week 180 [-0.047, 0.109] [-0.042, 0.101] [0.033, 0.063] [0.047, 0.081](-0.076, 0.133) (-0.096, 0.149) (-0.014, 0.103) (0.000, 0.125)Week 208 [-0.019, 0.094] [0.000, 0.091] [0.030, 0.065] [0.032, 0.093](-0.051, 0.119) (-0.054, 0.142) (-0.016, 0.106) (0.000, 0.136)Employment covs 28 28 5 177 5 177Post-lasso-logistic N/A N/A Varies (10-20) Varies (10-20)Wage covs N/A 28 470 470Post-lasso covs N/A N/A Varies (6-10) Varies (6-10)

Notes. Estimated bounds are in square brackets and the 95% conﬁdence region for identiﬁed set is in parentheses. All subjects arepartitioned into the sets X help = { (cid:98) p ( X ) < } and X hurt = { (cid:98) p ( X ) > } , where the trimming threshold (cid:98) p ( x ) = (cid:98) s ( , x ) / (cid:98) s ( , x ) is estimated asin equation (3.3). Column (1): basic bounds based on Lee’s covariates. Column (2): sharp bounds based on Lee’s covariates. Columns (3)and (4): sharp bounds given all covariates assuming few of them affect employment and wage, where the full sample N = ,

145 is usedin Column (3) and the PLB sample N = ,

735 is used in Column (4). The PLB sample is deﬁned as the always-takers whose conditionaltreatment effect lower bound is positive in at least one of six horizons considered by Lee (weeks 45,90,104,135,180,208). Covariates aredeﬁned in Section C.2. Computations use design weights. See Appendix F.3 for details. igure A.5: Estimated bounds on the average wage in the control status by week.Notes. The horizontal axis shows the number of weeks since random assignment. The black(gray) circles show the upper (lower) bound on the average untreated log wage for the always-takers. The black (gray) ﬁtted line is estimated by local linear approximation to 201 black (gray)points, respectively. Computations use design weights. able A.8: Estimated lower bound on the effect of access to Medicaid on self-reported healthcare utilization: extensive margin ITT LATE(1) (2) (3) (4) (5) (6)None Standard ML None Standard MLPrescription drugs currently 0.025 -0.008 0.017 0.088 -0.036 0.060(0.008) (0.014) (0.017) (0.029) (0.046) (0.060)Outpatient visits last six months 0.062 0.005 0.042 0.212 0.001 0.146(0.007) (0.013) (0.017) (0.025) (0.045) (0.058)ER visits last six months 0.006 -0.020 -0.004 0.022 -0.076 -0.015(0.007) (0.008) (0.011) (0.023) (0.030) (0.037)Hospital admissions last six months 0.002 -0.005 0.002 0.008 -0.020 0.007(0.004) (0.004) (0.005) (0.014) (0.016) (0.016)Compulsory covariates (stratiﬁcation) N/A 16 16 N/A 16 16Additional covariates (trimming) N/A 0 21 N/A 0 21 * Standard errors in parentheses. This table reports results from a Lee bounding exercise on self-reported healthcare utilization outcomesfor 3 speciﬁcations: no trimming, standard trimming, and the agnostic ML approach. Columns (1)–(3) report the coefﬁcient and standarderror on Lottery from estimating equation (8.5) by OLS. Columns (4)–(6) report the coefﬁcient and standard error on Insurance fromestimating equation (8.6) by 2SLS with Lottery as an instrument for Insurance. All regressions include household size ﬁxed effects,survey wave ﬁxed effects, and their interactions. Trimming methods. None: exact replicate of Finkelstein et al. (2012), Table V. Standard:the minimal number of control outcomes are trimmed from below for each value of ﬁxed effect until the treatment-control difference inresponse rates switches from negative to non-negative. Agnostic: Step 1. 21 additional covariates are selected on an auxiliary sample of4 ,

000 households as described in Appendix F.5. Step 2. In the main sample of 46 ,

000 households, a zero outcome with covariate vector x is trimmed in the control group if a ﬂipped coin with success prob. ( − p ( x )) / φ ( x ) is success, where the trimming threshold p ( x ) isdeﬁned in (F.4) and the zero outcome probability φ ( x ) is deﬁned in (F.2). Standard errors are estimated by a cluster-robust bootstrap with B = able A.9: Estimated lower bound on the effect of access to Medicaid on self-reported healthcare utilization: total utilization ITT LATE(1) (2) (3) (4) (5) (6)None Standard NP None Standard NPPrescription drugs currently 0.100 -0.024 0.077 0.347 -0.124 0.270(0.051) (0.066) (0.052) (0.175) (0.225) (0.179)Outpatient visits last six months 0.314 0.121 0.246 1.083 0.372 0.853(0.054) (0.065) (0.054) (0.182) (0.228) (0.183)ER visits last six months 0.007 -0.040 -0.008 0.026 -0.152 -0.027(0.016) (0.019) (0.016) (0.056) (0.065) (0.056)Hospital admissions last six months 0.006 -0.004 0.003 0.021 -0.014 0.010(0.006) (0.007) (0.006) (0.021) (0.024) (0.021)Compulsory covariates (stratiﬁcation) N/A 16 16 N/A 16 16Additional covariates (trimming) N/A 0 9 N/A 0 9 * Standard errors in parentheses. This table reports results from a Lee bounding exercise on self-reported health outcomes for 3 speciﬁca-tions: no trimming, standard trimming, and the classic nonparametric approach. Columns (1)–(3) report the coefﬁcient and standard erroron Lottery from estimating equation (8.5) by OLS. Columns (4)–(6) report the coefﬁcient and standard error on Insurance from estimatingequation (8.6) by 2SLS with Lottery as an instrument for Insurance. All regressions include household size ﬁxed effects, survey waveﬁxed effects, and their interactions. Trimming methods. None: exact replicate of Finkelstein et al. (2012), Table V. Standard: the minimalnumber of control outcomes are trimmed from below for each value of ﬁxed effect until the treatment-control difference in response ratesswitches from negative to non-negative. NP. Step 1. 9 additional covariates are taken as described in Appendix F.5. Step 2. An outcomewith covariate vector x is trimmed if it is less than Q ( − / p ( x ) , x ) , where the trimming threshold p ( x ) is deﬁned in equation (F.4) andthe conditional quantile is deﬁned in equation (2.8). Standard errors are estimated by a cluster-robust bootstrap with B = ppendix B: Supplementary Statements for Sections 4-6 B.1 Deﬁnitions

ML1 (cid:98) ξ ML2 (cid:98) ξ g L ( Data , (cid:98) ξ ) g L ( Data , (cid:98) ξ ) g U ( Data , (cid:98) ξ ) g U ( Data , (cid:98) ξ ) Data Data Lower Bound g L = m L + correction L Upper Bound g U = m U + correction U (cid:98) β L (cid:98) β U (cid:101) β L = min ( (cid:98) β L , (cid:98) β U ) (cid:101) β U = max ( (cid:98) β L , (cid:98) β U ) Figure B.6: Graphical representation of the better Lee bounds estimator with cross-ﬁtting.

FirstStage.

The rectangles represent the partition of the data into K = K = Second Stage.

For each bound, the orthogonal moment function is the sum of theoriginal moment equation for the bound and a mean zero bias correction term. For each partition k ∈ { , } , the lower and upper bounds are the sample average of the respective moment function,using the Data k and ML instance (cid:98) ξ k . Sharp Lee Bounds: Deﬁnition.

In this section, I derive the target parameter — sharpLee bounds — under Assumption 2. The conditional trimming threshold is p ( x ) = s ( , x ) s ( , x ) = E [ S = | D = , X = x ] E [ S = | D = , X = x ] . This appendix is for online publication. X help and X hurt are X help = { X : p ( X ) < } , X hurt = { X : p ( X ) > } . (B.1)By Assumption 3, Pr ( X ∈ X help ∪ X hurt ) =

1. The conditional probability of treatment(i.e., the propensity score) is µ ( X ) = Pr ( D = | X ) , µ ( X ) = − µ ( X ) = Pr ( D = | X ) . (B.2)The conditional quantiles in the selected treated and selected control groups are Q d ( u , x ) : Pr ( Y ≤ Q d ( u , x ) | S = , D = d , X = x ) = u , u ∈ [ , ] , d ∈ { , } . (B.3)Because Q ( u , x ) is invoked only for x ∈ X help and Q ( u , x ) is invoked only for x ∈ X hurt ,it makes sense to deﬁne combined conditional quantile: Q ( u , x ) = x ∈ X help Q ( u , x ) + x ∈ X hurt Q ( u , x ) . (B.4)Likewise, the conditional outcome densities in the selected treated and selected controlgroups are f d ( t | x ) = f ( t | S = , D = d , X = x ) , d ∈ { , } and combined conditional density is f ( t | x ) = x ∈ X help f ( t | x ) + x ∈ X hurt f ( t | x ) . (B.5)58or x ∈ X help , the conditional upper bound is¯ β help U ( x ) = E [ Y | D = , S = , Y ≥ Q ( − p ( x ) , x ) , X = x ] − E [ Y | D = , S = , X = x ] (B.6)and the conditional lower bound is¯ β help L ( x ) = E [ Y | D = , S = , Y ≤ Q ( p ( x ) , x ) , X = x ] − E [ Y | D = , S = , X = x ] . (B.7)For x ∈ X hurt , the conditional upper bound is¯ β hurt U ( x ) = E [ Y | D = , S = , X = x ] − E [ Y | D = , S = , Y ≤ Q ( / p ( x ) , x ) , X = x ] (B.8)and the conditional lower bound is¯ β hurt L ( x ) = E [ Y | D = , S = , X = x ] − E [ Y | D = , S = , Y ≥ Q ( − / p ( x ) , x ) , X = x ] . (B.9)Deﬁne the treated and control components of each conditional bound as¯ β (cid:63) * ( x ) = ¯ β (cid:63) * ( x ) − ¯ β (cid:63) * ( x ) , * ∈ { L , U } , (cid:63) ∈ { help , hurt } (B.10)and the normalizing constants as µ help10 = E [ s ( , X ) | X ∈ X help ] , µ hurt11 = E [ s ( , X ) | X ∈ X hurt ] . (B.11)59he sharp Lee bounds β L and β U are: β * = ( µ help10 ) − Pr ( X ∈ X help ) · E [ ¯ β help * ( X ) s ( , X ) | X ∈ X help ] (B.12) + ( µ hurt11 ) − Pr ( X ∈ X hurt ) · E [ ¯ β hurt * ( X ) s ( , X ) | X ∈ X hurt ] * ∈ { L , U } . Sharp Lee Bounds: Moment Equation

If the propensity score µ ( x ) in (B.2) isknown, the ﬁrst-stage nuisance parameter ξ is ξ = { s ( , x ) , s ( , x ) , Q ( u , x ) } . (B.13)Otherwise, ξ is ξ = { s ( , x ) , s ( , x ) , Q ( u , x ) , µ ( x ) } (B.14)for a non-orthogonal moment equation m * ( W , ξ ) , * ∈ { L , U } and ξ = { s ( , x ) , s ( , x ) , Q ( u , x ) , µ ( x ) , β help1 * ( x ) , β help0 * ( x ) , β hurt1 * ( x ) , β hurt0 * ( x ) } , (B.15)for an orthogonal moment equation g * ( W , ξ ) , * ∈ { L , U } , described below. Let µ ( X ) and µ ( X ) be as in (B.2), µ help10 and µ hurt11 as in (B.11). The original (i.e., non-orthogonal)moment equation for β U is m U ( W , ξ ) = ( µ help10 ) − X ∈ X help (cid:18) D µ ( X ) · S · Y { Y ≥ Q ( − p ( X ) , X ) } − ( − D ) µ ( X ) · S · Y (cid:19) (B.16) + ( µ hurt11 ) − X ∈ X hurt (cid:18) D µ ( X ) · S · Y − ( − D ) µ ( X ) · S · Y { Y ≤ Q ( / p ( X ) , X ) } (cid:19) , β L is m L ( W , ξ ) = ( µ help10 ) − X ∈ X help (cid:18) D µ ( X ) · S · Y { Y ≤ Q ( p ( X ) , X ) } − ( − D ) µ ( X ) · S · Y (cid:19) (B.17) + ( µ hurt11 ) − X ∈ X hurt (cid:18) D µ ( X ) · S · Y − ( − D ) µ ( X ) · S · Y { Y ≥ Q ( − / p ( X ) , X ) } (cid:19) . Sharp Lee Bounds: Orthogonal Moment Equation.

An orthogonal moment func-tion g (cid:63) ( W , ξ ) is g * ( W , ξ ) = m * ( W , ξ ) + ( µ help10 ) − X ∈ X help α help * ( W , ξ ) (B.18) + ( µ hurt11 ) − X ∈ X hurt α hurt * ( W , ξ ) , * ∈ { L , U } . The bias correction terms α help U ( W ; ξ ) and α hurt U ( W ; ξ ) are α help U ( W ; ξ ) = Q ( − p ( X ) , X ) (cid:18) ( − D ) · S µ ( X ) − s ( , X ) (cid:19) (B.19) − Q ( − p ( X ) , X ) p ( X ) (cid:18) D · S µ ( X ) − s ( , X ) (cid:19) + Q ( − p ( X ) , X ) s ( , X ) (cid:18) D · S · { Y ≤ Q ( − p ( X ) , X ) } s ( , X ) µ ( X ) − + p ( X ) (cid:19) − (cid:18) µ ( X ) β help1 U ( X ) + ( − µ ( X )) β help0 U ( X ) (cid:19) · s ( , X ) · ( D − µ ( X )) α hurt U ( W ; ξ ) = Q ( / p ( X ) , X ) p ( X ) (cid:18) ( − D ) · S µ ( X ) − s ( , X ) (cid:19) (B.20) + Q ( / p ( X ) , X ) (cid:18) D · S µ ( X ) − s ( , X ) (cid:19) + Q ( / p ( X ) , X ) s ( , X ) (cid:18) ( − D ) · S · { Y ≤ Q ( / p ( X ) , X ) } s ( , X ) µ ( X ) − p ( X ) (cid:19) − (cid:18) µ ( X ) β hurt1 U ( X ) + ( − µ ( X )) β hurt0 U ( X ) (cid:19) · s ( , X ) · ( D − µ ( X )) help L ( W ; ξ ) = Q ( p ( X ) , X ) (cid:18) ( − D ) · S µ ( X ) − s ( , X ) (cid:19) (B.21) − Q ( p ( X ) , X ) p ( X ) (cid:18) D · S µ ( X ) − s ( , X ) (cid:19) − Q ( p ( X ) , X ) s ( , X ) (cid:18) D · S · { Y ≤ Q ( p ( X ) , X ) } s ( , X ) µ ( X ) − p ( X ) (cid:19) − (cid:18) µ ( X ) β help1 L ( X ) + ( − µ ( X )) β help0 L ( X ) (cid:19) · s ( , X ) · ( D − µ ( X )) , α hurt L ( W ; ξ ) = − Q ( − / p ( X ) , X ) p ( X ) (cid:18) ( − D ) · S µ ( X ) − s ( , X ) (cid:19) (B.22) + Q ( − / p ( X ) , X ) (cid:18) D · S µ ( X ) − s ( , X ) (cid:19) − Q ( − / p ( X ) , X ) s ( , X ) (cid:18) ( − D ) · S · { Y ≤ Q ( − / p ( X ) , X ) } s ( , X ) µ ( X ) − + p ( X ) (cid:19) − (cid:18) µ ( X ) β hurt1 L ( X ) + ( − µ ( X )) β hurt0 L ( X ) (cid:19) · s ( , X ) · ( D − µ ( X )) , where β (cid:63) * ( x ) and β (cid:63) * ( x ) are as in (B.10). When Assumption 1 holds and the propensityscore is known, the bias correction terms simplify tocorrection U ( W , ξ ) = ( S = | D = ) (cid:20) Q ( − p ( X ) , X ) (cid:18) ( − D ) · S Pr ( D = ) − s ( , X ) (cid:19) − Q ( − p ( X ) , X ) p ( X ) (cid:18) D · S Pr ( D = ) − s ( , X ) (cid:19) + Q ( − p ( X ) , X ) s ( , X ) (cid:18) D · S · { Y ≤ Q ( − p ( X ) , X ) } s ( , X ) Pr ( D = ) − + p ( X ) (cid:19)(cid:21) , (B.23)62nd for the lower bound iscorrection L ( W , ξ ) = ( S = | D = ) (cid:20) Q ( p ( X ) , X ) (cid:18) ( − D ) · S Pr ( D = ) − s ( , X ) (cid:19) − Q ( p ( X ) , X ) p ( X ) (cid:18) D · S Pr ( D = ) − s ( , X ) (cid:19) − Q ( p ( X ) , X ) s ( , X ) (cid:18) D · S · { Y ≤ Q ( p ( X ) , X ) } s ( , X ) Pr ( D = ) − p ( X ) (cid:19)(cid:21) . (B.24) Lemma B.1 (Identiﬁcation of better Lee bounds) . Under Assumption 2, the followingstatements hold.(a) The bounds β L and β U deﬁned in (B.12) are sharp valid bounds on β in equation (2.1) . The moment functions (B.16) - (B.17) and (B.18) obey E g U ( W , ξ ) = E m U ( W , ξ ) = β U , E g L ( W , ξ ) = E m L ( W , ξ ) = β L . (b) If the conditional densities f d ( y | x ) , d ∈ { , } in (B.5) have a convex and compactsupport almost surely in X and a ﬁrst derivative that is bounded away from zeroand inﬁnity, the interval [ β L , β U ] is a sharp identiﬁed set for β .(c) If a stronger version of independence assumption (Assumption 2 (1)) holds:D ⊥ ( Y ( ) , Y ( ) , S ( ) , S ( ) , X ) | ¯ X , (a) and (b) remain true. Proposition B.2 (Estimation of Sets X help and X hurt ) . Suppose Assumption 3 holds.Furthermore, suppose (cid:98) s ( d , x ) converges uniformly over X : sup x ∈ X sup d ∈{ , } | (cid:98) s ( d , x ) − s ( d , x ) | = o P ( ) . Then, for N sufﬁciently large, Pr ( (cid:98) p ( X i ) < ⇔ X i ∈ X help ∀ i , ≤ i ≤ N ) → as N → ∞ . roof of Proposition B.2. Step 1.

Consider an open set ( s / , − ¯ s / ) × ( s / , − ¯ s / ) .W.p. 1 − o ( ) , the pair of estimated functions ( (cid:98) s ( , · ) , (cid:98) s ( , · )) belongs to this set. Thefunction f ( t , t ) = t / t has bounded partial derivatives in any direction on this set.Therefore, sup x ∈ X | (cid:98) p ( x ) − p ( x ) | = o P ( ) holds. Step 2.

The following statement hold:Pr (cid:18) (cid:98) p ( X i ) < < p ( X i ) or p ( X i ) < < (cid:98) p ( X i ) for some i (cid:19) ≤ Pr (cid:18) X i ∈ X : 0 < | − p ( X i ) | < | (cid:98) p ( X i ) − p ( X i ) | for some i (cid:19) ≤ Pr (cid:18) sup x ∈ X | (cid:98) p ( x ) − p ( x ) | > ε (cid:19) → . B.2 Supplementary Statements for Section 4

Let U ∈ ( , ) be an open set that contains the support of p ( X ) and 1 − p ( X ) . Forthe sake of exposition, suppose X hurt = /0 and Q ( u , x ) = Q ( u , x ) . Suppose that Q ( u , x ) is a sufﬁciently smooth function of x relative to its dimension for each u ∈ U , and thesmoothness index is the same for all u ∈ U . Then, Q ( u , x ) can be approximated by alinear function Q ( u , x ) = Z ( x ) ′ δ ( u ) + R ( u , x ) , (B.25)where Z ( x ) ∈ R p Q is a vector of basis functions, δ ( u ) is the pseudo-true parametervalue, and R ( u , x ) is approximation error. Let N = ∑ Ni = D i S i . The quantile regressionestimate (cid:98) Q ( u , x ) = Z ( x ) ′ (cid:98) δ ( u ) , (cid:98) δ ( u ) , u ∈ U is (cid:98) δ ( u ) : = arg min δ ∈ R pQ N N ∑ D i = , S i = ρ u ( Y i − Z ( X i ) ′ δ )= arg min δ ∈ R pQ N N ∑ D i = , S i = ( u − { Y i − Z ( X i ) ′ δ < } ) · ( Y i − Z ( X i ) ′ δ ) , (B.26)converges at rate q N = (cid:114) p Q N = o ( N − / ) , as shown in Belloni et al. (2019). Remark

B.1 ( (cid:96) -regularized quantile regression of Belloni and Chernozhukov (2013)) . For each u ∈ U , suppose there exists a vector δ ( u ) ∈ R p Q with only s Q out of p Q non-zero coordinates, sup u ∈ U ‖ δ ( u ) ‖ = sup u ∈ U p Q ∑ j = { δ , j ( u ) } = s Q ≪ N (B.27)so that the approximation error R ( u , x ) is sufﬁciently small relative to the sampling error (cid:114) s Q log p Q N : sup u ∈ U (cid:32) N N ∑ D i = , S i = R ( u , X i ) (cid:33) / (cid:46) P (cid:114) s Q log p Q N Furthermore, suppose δ ( u ) is a Lipshitz function of u . The (cid:96) -regularized quantileregression estimator of Belloni and Chernozhukov (2013) minimizes the (cid:96) -regularizedcheck function (cid:98) δ Lasso ( u ) = arg min δ ∈ R pQ N N ∑ D i = , S i = ρ u ( Y i − Z ( X i ) ′ δ ) + λ (cid:112) u ( − u ) N p Q ∑ j = (cid:98) ρ j | δ j | , (B.28)where λ ≥ (cid:98) ρ j = (cid:18) N ∑ ND i = , S i = Z j ( X i ) (cid:19) / . If the modelis sufﬁciently sparse, Assumption 5 is satisﬁed with q N = (cid:115) s Q log p Q N = o ( N − / ) under65he choice of λ proposed in Belloni and Chernozhukov (2013). B.3 Supplementary Statements for Section 5

In this section, I provide formal analysis of the agnostic approach summarized in Section5.

Algorithm 2

Agnostic Bounds Select covariates X = X A based on the auxiliary sample A . Let β L = β AL and β U = β AU be the sharp bounds in the model ( D , X A , S , S · Y ) . Report ( (cid:98) β AL , (cid:98) β AU ) based on either (1) conventional Lee bounds, (2) non-orthogonal momentfunction (B.16), or (3) orthogonal moment function (B.18), and logistic and quantileseries estimators of the ﬁrst-stage functions s A ( d , x ) and Q A ( u , x ) .The key advantage of the agnostic approach is that one is not required to use orthog-onal moment equations (B.18) for the bounds. If X A consists of a few discrete covariates,one can use conventional Lee (2009) bounds without any smoothness assumptions. Ifsmoothness is economically plausible, one can use the original (B.17) moment equationfor the bounds and estimate s A ( d , x ) and Q A ( u , x ) by classic nonparametric estimatorswith under-smoothing. Example 4 in Chernozhukov et al. (2013b) gives explicit condi-tions on the function Q A ( u , x ) so that a quantile series estimator’s bias due to approxi-mation error is asymptotically negligible. Likewise, Hirano et al. (2003) gives explicitconditions on the function s A ( u , x ) so that a logistic series estimator’s bias due to ap-proximation error is asymptotically negligible. Conditional Inference.

Conditional on the auxiliary sample Data A , the estimator ( (cid:98) β AL , (cid:98) β AU ) is asymptotically normal, (cid:112) | M |  (cid:98) β AL − β AL (cid:98) β AU − β AU  ⇒ N ( , Ω A ) .

66 (1 − α ) conditional CR takes the form [ L A , U A ] = [ (cid:98) β AL − | M | − / (cid:98) Ω / A , LL c α / , (cid:98) β AU + | M | − / (cid:98) Ω / A , UU c − α / ] , where choosing c α as the α -quantile of N ( , ) delivers a conﬁdence region for theidentiﬁed set [ β AL , β AU ] . Variational Inference.

Different splits ( A , M ) of the sample { , , . . . , N } yield dif-ferent target bounds ( β AL , β AU ) and different approximate distributions of these bounds.If we take the splitting uncertainty into account, the pair of bounds ( β AL , β AU ) are randomconditional on the full data sample. In practice, one may want to generate several ran-dom splits and aggregate various bounds over various partitions. Suppose the followingregularity condition holds. ASSUMPTION 6 (Regularity condition) . Suppose that A is a set of regular data con-ﬁgurations such that for all x ∈ [ , ] , under the null hypothesis sup P ∈ P | Pr P [ p A ≤ x ] − x | ≤ δ = o ( ) , and inf P ∈ P P P [ Data A ∈ A ] ≤ − γ = − o ( ) . In particular, suppose that this holds forthe p-values p A : = Φ ( (cid:98) Ω − / A , LL ( (cid:98) β AL − β AL )) , p A : = Φ ( (cid:98) Ω − / A , UU ( (cid:98) β AU − β AU )) , p A : = − Φ ( (cid:98) Ω − / A , LL ( (cid:98) β AL − β AL )) , p A : = − Φ ( (cid:98) Ω − / A , UU ( (cid:98) β AU − β AU )) Assumption 6 is an extension of PV condition in Chernozhukov et al. (2017). Incomparison to the PV condition in Chernozhukov et al. (2017), Assumption 6 involvestwice as many p -values, 2 p-values for the lower bound and 2 more for the upper bound.67or reporting purposes, I use Chernozhukov et al. (2017)’s adjusted point estimator (cid:98) β L = Med [ (cid:98) β AL | Data ] , (cid:98) β U = Med [ (cid:98) β AU | Data ] . To quantify the uncertainty of the random split, I deﬁne the conﬁdence region of level(1 − α ): [ L , U ] = (cid:2) Med [ L A | Data ] , Med [ U A | Data ] (cid:3) , (B.29)where Med ( X ) = inf { x ∈ R : P X ( X ≤ x ) ≥ / } is the lower median and Med ( X ) = sup { x ∈ R : P X ( X ≥ x ) ≥ / } is the upper median. I will also consider a related conﬁ-dence region of level 1 − α : CR = (cid:26) β ∈ R , p L ( β ) > α / , p U ( β ) > α / (cid:27) , (B.30)where p L ( β ) = Φ ( (cid:98) Ω − / A , LL ( (cid:98) β AL − β )) , p U ( β ) = Φ ( (cid:98) Ω − / A , UU ( (cid:98) β AU − β )) Theorem B.3 (Uniform Balidity of Variational Conﬁdence Region) . The conﬁdenceregion CR in equation (B.31) obeys CR ⊆ [ L , U ] . Under Assumption 6, Pr P ([ β AL , β AU ] ∈ CR ) ≥ − α − ( δ + γ ) = − α − o ( ) . (B.31) and therefore Pr P ( β ∈ CR ) ≥ − α − ( δ + γ ) = − α − o ( ) . (B.32)68 .4 Supplementary Statements for Section 6.1 Multi-Dimensional Outcome: Deﬁnitions.

The d -dimensional unit sphere is S d − = { q ∈ R d , ‖ q ‖ = } . For a point q of interest on a unit sphere, the data vector is W q = ( D , X , S , S · Y q ) , where D and X are as deﬁned in the one-dimensional case, S = { S } = is equal to oneif and only if each scalar outcome is selected into the sample, and Y q = q ′ Y . The condi-tional quantiles in the selected treated and the selected control groups are Q d ( q , u , x ) : Pr ( Y q ≤ Q d ( q , u , x ) | S = , D = d , X = x ) = u , u ∈ [ , ] , d ∈ { , } and the combined conditional quantile is Q ( q , u , x ) = x ∈ X help Q ( q , u , x ) + x ∈ X hurt Q ( q , u , x ) . Likewise, the combined conditional density is f ( q , t | x ) = x ∈ X help f ( q , t | x ) + x ∈ X hurt f ( q , t | x ) , where f d ( q , t | x ) is the conditional density of q ′ Y in S = , D = d , X = x group. Depend-ing on whether µ ( x ) is known ot not, the ﬁrst-stage nuisance parameter ξ ( q ) is as in(B.13)-(B.15), where Q ( u , x ) is replaced by Q ( q , u , x ) . The sharp upper bound on q ′ β is σ ( q ) = E m U ( W q , ξ ( q )) (B.33)69nd the sharp identiﬁed set B for β is B = ∩ q ∈ R d : ‖ q ‖ = { b ∈ R d : q ′ b ≤ σ ( q ) } . (B.34)Denote the sample average of a function f ( · ) as E N [ f ( W i )] : = N N ∑ i = f ( W i ) and the centered, root- N scaled sample average as G N [ f ( W i )] : = √ N N ∑ i = [ f ( W i ) − (cid:90) f ( w ) dP ( w )] . Let (cid:96) ∞ ( S d − ) be the space of almost surely bounded functions deﬁned on the unit sphere S d − and BL ( S d − , [ , ]) be a set of real functions on S d − with Lipshitz norm boundedby 1. Deﬁnition .

1. For a random sample of size N , denote a K -fold ran-dom partition of the sample indices [ N ] = { , , ..., N } by ( J k ) Kk = , where K isthe number of partitions and the sample size of each fold is n = N / K . For each k ∈ [ K ] = { , , ..., K } deﬁne J ck = { , , ..., N } ∖ J k .2. For each k ∈ [ K ] , construct an estimator (cid:98) ξ k ( q ) = (cid:98) ξ ( W i ∈ Jck ) ( q ) of the nuisance pa-rameter ξ using only the data { W j : j ∈ J ck } . Deﬁnition . Deﬁne (cid:98) σ ( q ) : = N N ∑ i = g U ( W iq , (cid:98) ξ ( q )) , (B.35)where g ( W iq , (cid:98) ξ ( q )) : = g ( W iq , (cid:98) ξ k ( q )) for any observation i ∈ J k , k = , , . . . , K . Proposition B.4 (Characterization of Identiﬁed Set) . Suppose Assumption 2 holds. Then,the set B , deﬁned in (B.34) , is a sharp identiﬁed set for β . Furthermore, B is a convex nd compact set, and σ ( q ) , deﬁned in (B.33) , is its support function. Proposition B.4 proves that the sharp identiﬁed set B is compact and convex andproposes a semiparametric moment equation for its support function. Proposition B.5 (Orthogonal Moment Equation for Support Function) . Suppose As-sumption 2 holds and let g U ( W , ξ ) be as in (B.18) . Then, E [ σ ( q ) − g U ( W q , ξ ( q ))] = is an orthogonal moment equation for σ ( q ) . Proposition B.5 establishes that the moment equation (B.36) is orthogonal w.r.t thenuisance parameter ξ ( q ) for each q ∈ S d − . Deﬁne h ( W , q ) = σ ( q ) − g U ( W q , ξ ( q )) . ASSUMPTION 7 (Quantile First-Stage Rate: Multi-Dimensional Case) . (1) For eachq ∈ S d − , the conditional density f ( q , t | x ) exists and is bounded from above and awayfrom zero and has bounded derivative, where the bounds do not depend on q, almostsurely in X . (2) Let ¯ U be a compact set in ( , ) containing the support of p ( X ) and − p ( X ) . There exist a rate q N = o ( N − / ) , a sequence of numbers ε N = o ( ) and asequence of sets Q N such that the ﬁrst-stage estimate (cid:98) Q ( q , u , x ) of the quantile functionQ ( q , u , x ) : S d − × [ , ] × X → R belongs to Q N w.p. at least − ε N . Furthermore, theset Q N shrinks sufﬁciently fast around the true value Q ( u , x ) uniformly on ¯ U and S d − : sup Q ∈ Q N sup q ∈ S d − sup u ∈ ( , ) ( E X ( (cid:98) Q ( q , u , X ) − Q ( q , u , X )) ) / (cid:46) q N = o ( N − / ) . Assumption 7 is a generalization of Assumption 5 from one- to multi-dimensionalcase. 71 emma B.6 (Limit Theory for the Support Function Process) . Suppose Assumptions 2,4, 7 hold. In addition, if X help ̸ = /0 and X hurt ̸ = /0 , suppose (cid:98) s ( d , x ) converges to s ( d , x ) uniformly over X for each d ∈ { , } . The support function process S N ( q ) = √ N ( (cid:98) σ ( q ) − σ ( q )) admits an approximationS N ( q ) = G N [ h ( q )] + o P ( ) in (cid:96) ∞ ( S d − ) . Moreover, the support function process admits an approximationS N ( q ) = G [ h ( q )] + o P ( ) in (cid:96) ∞ ( S d − ) , where the process G [ h ( q )] is a tight P-Brownian bridge in (cid:96) ∞ ( S d − ) with covariancefunction Ω ( q , q ) = E [ h ( W , q ) h ( W , q )] , q , q ∈ S d − that is uniformly Holder on S d − × S d − . Furthermore, the canonical distance betweenthe law of the support function process S N ( q ) and the law G [ h ( q )] in (cid:96) ∞ ( S d − ) ap-proaches zero, namely sup g ∈ BL ( S d − , [ , ]) | E [ g ( S N )] − E [ g ( G [ h ])] | → . Lemma B.6 says that the Support Function Estimator is asymptotically equivalentto a Gaussian process and can be used for pointwise and uniform inference about thesupport function. By orthogonality, the ﬁrst-stage estimation error does not contribute tothe total uncertainty of the two-stage procedure. In particular, orthogonality allows me toavoid relying on any particular ﬁrst-stage estimator, and to employ modern regularizedtechniques to estimate the ﬁrst-stage.

Theorem B.7 (Limit Inference on Support Function Process) . Under the assumptions f Lemma B.6 hold, for any (cid:98) c N = c N + o P ( ) , c N = O P ( ) and f ∈ F c , Pr ( f ( S N ) ≤ (cid:98) c N ) − Pr ( f ( G [ h ]) ≤ c N ) → . If c N ( − τ ) is the ( − τ ) -quantile of f ( G [ h ]) and (cid:98) c N ( − τ ) = c N ( − τ ) + o P ( ) is anyconsistent estimate of this quantile, then Pr ( f ( S N ) ≤ (cid:98) c N ( − τ )) → − τ . Deﬁnition . Let B represent a number of bootstrap repetitions.For each b ∈ { , , . . . , B } , repeat1. Draw N i.i.d. exponential random variables ( e i ) Ni = : e i ∼ Exp ( ) .2. Estimate (cid:101) σ b ( q ) = E N e i g ( W i , q , (cid:98) ξ i ( q )) .Let e i = e i − h = h − E [ h ] , and let P e be the probability measure conditional ondata. Lemma B.8 (Limit Theory for the Bootstrap Support Function Process) . Suppose as-sumptions of Lemma B.6 hold. The bootstrap support function process (cid:101) S N ( q ) = √ N ( (cid:101) σ ( q ) − (cid:98) σ ( q )) admits the following approximation conditional on the data: (cid:101) S N ( q ) = G N [ e i h i ( q )] + o P e ( ) in L ∞ ( S d − ) . Moreover, the support function process admits an approximation condi-tional on the data (cid:101) S N ( q ) = (cid:101) G [ h ( q )] + o P e ( ) in L ∞ ( S d − ) , in probability P , where (cid:101) G [ h ( q )] is a sequence of tight P-Brownian bridges in L ∞ ( S d − ) with the samedistributions as the processes G N [ h ( q )] deﬁned in Lemma B.6, and independent of N [ h ( q )] . Furthermore, the canonical distance between the law of the bootstrap sup-port function process (cid:101) S N ( q ) conditional on the data and the law of G [ h ] in (cid:96) ∞ ( S d − ) approaches zero, namely sup g ∈ BL ( S d − , [ , ]) | E P e [ g ( S N )] − E [ g ( G [ h ])] | → P , where BL ( S d − , [ , ]) is a set of real functions on S d − with Lipshitz norm bounded by . Lemma B.8 says that weighted bootstrap of Theorem B.7 can be used to approximatethe support function process of Theorem B.7 and conduct uniform inference about thesupport function. By orthogonality, I do not need to re-estimate the ﬁrst-stage parameterin each bootstrap repetition. Instead, I estimate the ﬁrst-stage parameter once on an aux-iliary sample, and plug the estimate into the bootstrap sampling procedure. Therefore,orthogonal Bayes bootstrap is faster to compute than a non-orthogonal Bayes bootstrap,where both the ﬁrst and the second stage are re-estimated in each bootstrap repetition.

Theorem B.9 (Bootstrap Inference on the Support Function Process) . Suppose assump-tions of Lemma B.6 hold. For any c N = O P ( ) and f ∈ F we have Pr ( f ( S N ) ≤ c N ) − Pr e ( f ( (cid:101) S N ) ≤ c N ) → P . In particular, if (cid:101) c N ( − τ ) is the ( − τ ) -quantile of f ( (cid:101) S N ) under Pr e , then Pr ( f ( S N ) ≤ (cid:101) c N ( − τ )) → P − τ . Lemmas B.6 and B.8 are a generalization of Theorems 1–4 in Chandrasekhar et al.(2012). Unlike Chandrasekhar et al. (2012), the support function estimator proposedhere is based on an orthogonal moment equation (B.36). Therefore, I do not rely ona series estimator of the ﬁrst-stage nuisance parameter ξ ( q ) , but allow any machine74earning method obeying Assumption 10 to be used. As a result, this theory can accom-modate many covariates. B.5 Supplementary Statements for Section 6.2

In this section, I derive sharp Lee bounds for the Intent-to-Treat parameter given inequation (8.5). For the sake of exposition, suppose X hurt = /0. Suppose there exists avector ¯ X of saturated covariates so that a stronger version of independence assumptionholds. ASSUMPTION 8 (Conditional Independence) . Conditional on ¯ X , D is independent of ( Y ( ) , Y ( ) , S ( ) , S ( )) . The full vector X is X = ( ¯ X , (cid:101) X ) , where ¯ X is stratiﬁcation covariate vector and (cid:101) X includes all other baseline covariates.Let A be an event and ξ be a random variable. For a given event A , its probabilityconditional on being on always-taker isPr ( A ) = Pr ( A | S ( ) = S ( ) = ) . (B.37)For a random variable ξ , its expectation conditional on being an always-taker is E [ ξ ] = E [ ξ | S ( ) = S ( ) = ] . (B.38)Suppose X hurt = /0. The lower and upper truncation sets are T L ( W ) : = (cid:26) { Y ≥ Q ( p ( X ) , X ) } ∩ D = (cid:27) , (B.39) T U ( W ) : = (cid:26) { Y ≤ Q ( − p ( X ) , X ) } ∩ D = (cid:27) . (B.40)75n what follows, I present the argument for the lower truncation set T L ( W ) , a symmetricargument holds for T U ( W ) . For a given event A , its probability conditional on not beingtrimmed is Pr T L ( A ) = Pr ( A | S = , W ̸∈ T L ( W )) . (B.41)For a random variable ξ , its expectation conditional on not being trimmed is E T L ( A ) = E ( A | S = , W ̸∈ T L ( W )) . (B.42)For a covariate value x , the density of X conditional on being an always-taker is f ( x ) = f ( x | S ( ) = S ( ) = ) . Likewise, the density of X conditional on not being trimmed is f T L ( x ) = f ( x | S = , W ̸∈ T L ( W )) . The target parameter β is the coefﬁcient on D in the infeasible regression (8.5). Theproposed bound β L is the coefﬁcient on D in the feasible trimmedregressions Y = β L + D β L + ¯ X ′ β L + U , S = , W ̸∈ T L ( W ) , (B.43)and β U is deﬁned similarly, where T L ( W ) is replaced by T U ( W ) . An orthogonal momentequation for β L is g L ( W , ξ ) = ( µ − ) w T L ( ¯ X )( β help L ( X ) s ( , X ) + α L ( W ; ξ )) , (B.44)where β help L ( x ) is deﬁned in (B.7), α L ( W ; ξ ) is deﬁned in (B.21), and µ is as in (B.11).76 emma B.10 (Intent-to-Treat Theorem, Angrist and Pischke (2009)) . Under Assump-tion 8, the regression coefﬁcient β can be represented as β = E w ( ¯ X ) (cid:18) E [ Y | D = , ¯ X ] − E [ Y | D = , ¯ X ] (cid:19) , (B.45) where the weighting function isw ( ¯ X ) = Pr ( D = | ¯ X ) Pr ( D = | ¯ X ) E [ Pr ( D = | ¯ X ) Pr ( D = | ¯ X )] . Likewise, β L can be represented as (B.45) , where Pr is replaced by Pr T L . Lemma B.11 (Equal covariate distributions under Pr and Pr T L ) . Under Assumptions2 and 8, the conditional density f ( x ) coincides with f T L ( x ) almost surely in X :f ( x ) = f T L ( x ) , x ∈ X . Lemma B.12 (Equal propensity scores under Pr and Pr T L ) . Under Assumptions 2 and8, the conditional propensity score Pr ( D = | X = x ) coincides with Pr T L ( D = | X = x ) almost surely in X : Pr ( D = | X = x ) = i Pr ( D = | X ) = ii Pr ( D = | ¯ X ) = iii Pr T L ( D = | X = x ) . (B.46) Proposition B.13 (Sharp Bounds on Intent-to-Treat effect: Identiﬁcation) . Let β L and β U be as deﬁned in equation (B.43) , and β be as deﬁned in equation (8.5) . UnderAssumptions 2 and 8, β L and β U are sharp bounds on β : β L ≤ β ≤ β U . (B.47) Lemma B.14 (Sharp Bounds on Intent-to-Treat effect: Estimation and Inference) . Sup-pose Assumptions 2 and 8 hold with X hurt = /0 . Suppose the ﬁrst-stage parameter ξ s estimated by logistic/quantile squares series, and under-smoothing conditions hold.Then, the empirical analog of (B.43) based on the estimated (cid:98) ξ is asymptotically equiva-lent to sample average. In addition, if Assumptions 4 and 7 hold, the plug-in cross-ﬁttingorthogonal estimator of β L based on equation (B.44) is asymptotically linear: √ N N ∑ i = g L ( W i , (cid:98) ξ ) = √ N N ∑ i = g L ( W i , ξ ) + o P ( ) . (B.48) B.6 Supplementary Statements for Section 6.3

In this section, I deﬁne the sharp bounds on the Local Average Treatment Effect param-eter, deﬁned in equation (8.6). Let Z = D = S ( d , z ) = z and the treatment is equal to d , and let Y ( d , z ) be a potential outcome. The observed data ( Z i , D i , X i , S i , S i Y i ) Ni = con-sist of the instrument, treatment, a baseline covariate vector X i , the selection outcome S i = ∑ d ∈{ , } ∑ z ∈{ , } { Z i = z } { D i = d } S i ( d , z ) and the observed outcomes for the selectedsubjects, S i · Y i = S i · ( ∑ d ∈{ , } ∑ z ∈{ , } { Z i = z } { D i = d } Y i ( d , z )) . ASSUMPTION 9 (Assumptions for LATE) . The following statements hold.(1) The vector D ( ) , D ( ) , { ( Y ( d , z ) , S ( d , z )) } d ∈{ , } , z ∈{ , } is independent of Z condi-tional on a subset of stratiﬁcation covariates ¯ X .(2) Exclusion for outcome. For each d ∈ { , } , the outcome Y ( d , ) = Y ( d , ) = Y ( d ) almost surely.(3) Monotonicity of treatment w.r.t instrument. The instrument affects treatment in thesame direction: D ( ) ≥ D ( ) almost surely.(4) First Stage. Pr ( D ( ) > D ( ) | ¯ X = ¯ x ) > almost surely in ¯ X .(5) Exclusion for selection. For each z ∈ { , } , the outcome S ( , z ) = S ( , z ) = S ( z ) almost surely.(6) Independence of selection and treatment. Treatment potential outcomes { D ( ) , D ( ) } re independent of selection potential outcomes { S ( ) , S ( ) } conditional on X . For the sake of exposition, suppose X hurt = /0. Let Q ( u , x , d ) be the conditional u -quantile of Y in the { S = , D = d , Z = } group: Q ( u , x , d ) : Pr ( Y ≤ Q ( u , x , d ) | S = , D = d , Z = , X = x ) = u . The lower truncation set is Λ L ( W ) = (cid:26) ∪ d ∈{ , } { Y ≥ Q ( p ( X ) , X , d ) ∩ D = d ∩ Z = } (cid:27) . (B.49)The upper truncation set is Λ U ( W ) = (cid:26) ∪ d ∈{ , } { Y ≤ Q ( − p ( X ) , X , d ) ∩ D = d ∩ Z = } (cid:27) . (B.50)For an event A , the probability of A conditional on not being trimmed isPr Λ L ( A ) = Pr ( A | S = , W ̸∈ Λ L ( W )) . For a random variable ξ , the expectation of ξ conditional on not being trimmed is E Λ L [ ξ ] = E [ ξ | S = , W ̸∈ Λ L ( W )] . The target parameter π is the 2SLS coefﬁcient on D in the infeasible regression (8.6)-(6.7). The proposed bound π L is the 2SLS coefﬁcient on D in the feasible trimmedregression Y = π L + D π L + ¯ X ′ π L + ε , S = , W ̸∈ Λ L ( W ) , (B.51)79here the ﬁrst-stage equation is D = δ L + Z δ L + ¯ X ′ δ L + ξ , S = , W ̸∈ Λ L ( W ) . (B.52) Lemma B.15 (LATE Theorem, Angrist and Imbens (1995)) . Let π follow the deﬁni-tion in (8.6) . If Assumption 9 (1) – (4) holds, π can be represented as π = E ω ( ¯ X ) E [ Y ( ) − Y ( ) | D ( ) > D ( ) , ¯ X ]= E ω ( ¯ X ) E [ Y | Z = , ¯ X ] − E [ Y | Z = , ¯ X ] E [ D = | Z = , ¯ X ] − E [ D = | Z = , ¯ X ] , (B.53) where the weighting function ω ( ¯ X ) takes the form ω ( ¯ X ) = V ( E [ D = | ¯ X , Z ] | ¯ X ) E [ V ( E [ D = | ¯ X , Z ] | ¯ X )] . Likewise, π L can be represented as (B.53) , where Pr is replaced by Pr Λ L . Lemma B.16 (Equal covariate distributions under Pr and Pr Λ L ) . Under Assumptions2 and 9, the conditional densities are equalf ( x ) = f Λ L ( x ) , for any x ∈ X . Furthermore, the integrated covariate densities are equal to each other, (cid:90) (cid:101) x ∈ (cid:101) X f ( ¯ x , (cid:101) x ) d (cid:101) x = (cid:90) (cid:101) x ∈ (cid:101) X f Λ L ( ¯ x , (cid:101) x ) d (cid:101) x , for any ¯ x . Lemma B.17 (Equal treatment distributions under Pr and Pr Λ L ) . Under Assumptions2, 8 and 9 hold, the following equality holds: Pr ( D = | Z = z , X ) = Pr ( D = | Z = z , X ) = Pr Λ L ( D = | Z = z , X ) , z ∈ { , } . (B.54)80 roposition B.18 (Sharp Bounds on LATE: Identiﬁcation) . Let π L and π U be as de-ﬁned in equation (B.51) - (B.52) , and β be as deﬁned in equation (8.6) - (6.7) . UnderAssumptions 2 and 9, π L and π U are sharp bounds on π : π L ≤ π ≤ π U . (B.55) Appendix C: Proofs for Sections 4-6

Notation.

I use the following standard notation. Let S d − = { q ∈ R d , ‖ q ‖ = } bethe d -dimensional unit sphere. I use standard notation for numeric and stochastic dom-inance. For two numeric sequences { a n , n ≥ } and { b n , n ≥ } , let a n (cid:46) b n stand for a n = O ( b n ) . For two sequences of random variables { a n , n ≥ } and { b n , n ≥ } , let a n (cid:46) P b n stand for a n = O P ( b n ) . Let a ∧ b = min { a , b } and a ∨ b = max { a , b } . Fora random variable ξ , ( ξ ) : = ξ − E [ ξ ] . Let (cid:96) ∞ ( S d − ) be the space of almost surelybounded functions deﬁned on the unit sphere S d − . Deﬁne an L P , c norm of a vector-valued random variable W as: ‖ W ‖ L P , c : = (cid:0) (cid:82) w ∈ W ‖ W ‖ c (cid:1) / c . Let W be the support of thedata vector W of the distribution P W . Let ( W i ) Ni = be an i.i.d sample from the distribution P W .I use the standard notation for vector and matrix norms. For a vector v ∈ R d , denotethe (cid:96) norm of a as ‖ v ‖ : = (cid:113) ∑ dj = v j . Denote the (cid:96) norm of v as ‖ v ‖ : = ∑ dj = | v j | , the (cid:96) ∞ norm of v as ‖ v ‖ ∞ : = max ≤ j ≤ d | v j | , and (cid:96) norm of v as ‖ v ‖ : = ∑ dj = { v j ̸ = } . Fora matrix M , denote its operator norm by ‖ M ‖ = sup α ∈ S d − ‖ M α ‖ . Denote the sampleaverage of a function f ( · ) as E N [ f ( W i )] : = N ∑ Ni = f ( W i ) and the centered, root- N scaledsample average as G N [ f ( W i )] : = √ N ∑ Ni = [ f ( W i ) − (cid:82) f ( w ) dP ( w )] .Fix a partition k in a set of partitions [ K ] = { , , . . . , K } . Deﬁne the sample average81f a function f ( · ) within this partition as: E n , k [ f ] = n ∑ i ∈ J k f ( x i ) (C.1)and the scaled normalized sample average as: G n , k [ f ] = √ nn ∑ i ∈ J k [ f ( x i ) − (cid:90) f ( w ) dP ( w )] . For each partition index k ∈ [ K ] deﬁne an event E n , k : = { (cid:98) ξ k ∈ Ξ n } as the nuisance esti-mate (cid:98) ξ k belonging to the nuisance realization set G N . Deﬁne E N = ∩ Kk = E n , k (C.2)as the intersection of such events. Proof of Lemma B.1.

Proof of Lemma B.1(a) is shown in Steps 1-3. Proof of LemmaB.1(b) is shown in Steps 4-6.

Step 1 . According to Lee (2009) (Proposition 3 in the working paper version Lee(2005)), (B.6)-(B.7) are valid sharp bounds on E [ Y ( ) − Y ( ) | S ( ) = S ( ) = , X = x ] for x ∈ X help . Likewise, (B.8)-(B.9) are valid sharp bounds on E [ Y ( ) − Y ( ) | S ( ) = S ( ) = , X = x ] for x ∈ X hurt . Step 2 . For x ∈ X * , Bayes rule implies that f (cid:63) ( x | S ( ) = S ( ) = ) = Pr ( S ( ) = S ( ) = | X = x ) f ( x ) Pr ( S ( ) = S ( ) = | X (cid:63) ) (cid:63) ∈ { help , hurt } . By Assumption 2,Pr ( S ( ) = S ( ) = | X = x ) = Pr ( S ( ) = | X = x ) = s ( , x ) for any x ∈ X help , Pr ( S ( ) = S ( ) = | X = x ) = Pr ( S ( ) = | X = x ) = s ( , x ) for any x ∈ X hurt . (cid:90) x ∈ X help ¯ β help (cid:63) ( x ) f help ( x | S ( ) = S ( )) dx · Prob ( X ∈ X help )+ (cid:90) x ∈ X hurt ¯ β hurt (cid:63) ( x ) f hurt ( x | S ( ) = S ( )) dx · Prob ( X ∈ X hurt ) , (cid:63) ∈ { L , U } coincide with the bounds in (B.12). Step 3 . Validity of moment equations. For X ∈ X help , s ( , X ) E [ Y | D = , S = , X , Y ≥ Q ( − p ( X ) , X )]]= i s ( , X ) E [ Y · { Y ≥ Q ( − p ( X ) , X ) } p ( X ) | D = , S = , X ]= s ( , X ) E (cid:20) D · S · Y { Y ≥ Q ( − p ( X ) , X ) } p ( X ) s ( , X ) µ ( X ) | X (cid:21) = ii E (cid:20) D µ ( X ) · S · Y · { Y ≥ Q ( − p ( X ) , X ) } | X (cid:21) where i follows from Pr ( Y ≥ Q ( − p ( X ) , X ) | D = , S = , X ) = p ( X ) and ii followsfrom p ( X ) = s ( , X ) / s ( , X ) . Likewise, s ( , X ) E [ Y | S = , D = , X ] = E [ − D µ ( X ) S · Y | X ] , and E [ X ∈ X help s ( , X ) ¯ β help U ( X )] = E [ X ∈ X help m U ( W , ξ )] . By a similar argument, E [ X ∈ X hurt s ( , X ) ¯ β hurt U ( X )] = E [ X ∈ X hurt m U ( W , ξ )] . Step 4 . Proof of Lemma B.1(b). First, suppose there are no covariates (i.e., X = /0) and Assumption 1 holds with p <

1. It sufﬁces to show that, for each value of β ∈ [ β L , β U ] , there exists a distribution of the always-takers’ outcome Y ( ) , so that E [ Y ( ) − Y ( ) | S ( ) = S ( ) = ] = β . Let λ ∈ [ , ] be a positive number. Consider the83ollowing distribution function F λ ( t ) , t ∈ ( − ∞ , ∞ ) : F λ ( t ) =  , t ≤ Q (( − λ )( − p )) , p − (cid:82) tQ (( − λ )( − p ) f ( y ) dy , Q (( − λ )( − p )) ≤ t ≤ Q ( λ p + ( − λ )) , t ≥ Q ( λ p + ( − λ )) , where Q ( u ) is the u -outcome quantile of Y in S = , D = λ = F ( t ) corresponds to the upper tail of the distribution f ( y ) , and E F [ Y ( )] − E [ Y ( )] = β U . When λ = F ( t ) corresponds to the lower tail of the distribution f ( y ) , and E F [ Y ( )] − E [ Y ( )] = β L . Observe that F ( t ) = p F λ ( t ) + ( − p ) G λ ( t ) , (C.3)where G λ ( t ) is a valid c.d.f.. Therefore, F ( t ) can be represented as a mixture of validtwo c.d.f.s, with mixing probabilities p and 1 − p . Therefore, F λ ( t ) is a valid plausiblec.d.f for the always-takers. Step 5 . In this step, I show that β λ = E F λ [ Y ( )] − E [ Y ( )] : λ ∈ [ , ] → [ β L , β U ] is a non-increasing and continuous function of λ . By construction, F λ ′ ( t ) ﬁrst-orderstochastically dominates F λ ( t ) for λ ′ < λ , which implies the ﬁrst statement. Second, let C be an upper bound sup t ∈ R | t · f ( t ) | . By assumption of the Lemma, C is ﬁnite. Thus,two real numbers λ ′ and λ mean value theorem implies: | β λ − β λ | ≤ p − C | Q ( λ p + ( − λ )) − Q ( λ ′ p + ( − λ ′ )) | + p − C | − Q (( − λ )( − p )) + Q ( ( − λ ′ )( − p )) |≤ p − C | sup u ∈ U Q ′ ( u ) || λ − λ ′ | , Q ′ ( u ) = f − ( Q ( u )) is the conditional quantile’s derivative, which exists and isbounded by assumption of the Lemma. Therefore, β λ is a continuous and monotonefunction on [ , ] → [ β L , β U ] . Therefore, each point β ∈ [ β L , β U ] corresponds to some λ ∈ [ , ] . Therefore, Lemma (b) follows. Step 6 . Suppose Assumption 2 holds instead of Assumption 1. By the argumentabove, [ β L help ( x ) , β U help ( x )] is a sharp identiﬁed set for E [ Y ( ) − Y ( ) | S ( ) = S ( ) = , X = x ] , x ∈ X help . Because the density f help ( x | S ( ) = S ( ) = ) = ( µ help10 ) − s ( , x ) f ( x ) is an identiﬁed function, Lemma B.1(b) under Assumption 2 holds. Furthermore, LemmaB.1(c) holds. Proof of Theorem 1.

According to Proposition B.2, the sets X help and X hurt are consis-tently estimated. Therefore, in the proof below, I condition on the event (cid:98) p ( X i ) < ⇔ X i ∈ X help for any X i , i = { , , . . . , N } , which occurs w.p. approaching 1. Theorem 1is a special case of Theorem B.7 with S d − = {− , } . Here I verify that Assumption7 for S d − = {− , } . In one-dimensional case, Q ( , u , x ) reduces to Q ( u , x ) . Further-more, the quantile of − Y , denoted by Q ( − , u , x ) , reduces to − Q ( − u , x ) . Assumption7(1)-(2) holds by Assumption 5(1)-(2). Proof of Theorem B.3.

Step 1.

The probability of non-coverage is bounded asPr P ([ β AL , β AU ] ̸∈ CR ) ≤ Pr P ( p L ( β AL ) < α / ) + Pr P ( p U ( β AU ) < α / ) ≤ − α − ( δ + γ ) , where the last inequality holds by Lemma 3.1 in Chernozhukov et al. (2017). Therefore,CR is a valid conﬁdence region for the identiﬁed set [ β AL , β AU ] . Step 2.

By the proof of Theorem 3.2 in Chernozhukov et al. (2017), { β ∈ R : p U ( β ) > α / } = { β ∈ R : Med [ (cid:98) Ω − / A , UU ( β − (cid:98) β AU ) − Φ − ( − α / ) | Data )] < ] . (C.4)85ince the R.H.S. of (C.4) is a monotone increasing function of β , it sufﬁces to show thatMed [ (cid:98) Ω − / A , UU ( U − (cid:98) β AU ) − Φ − ( − α / ) | Data )] ≥ . (C.5)By deﬁnition of U , E [ [ U − (cid:98) β AU − (cid:98) Ω / A , UU Φ − ( − α / ) ≥ ] | Data ] ≥ / . (C.6)By step 2 of the proof of Theorem 3.2 in Chernozhukov et al. (2017), (C.6) implies(C.5). Step 3.

By deﬁnition of Lee bounds, β ∈ [ β AL , β AU ] for any A . Therefore, (B.31)implies (B.32). Proof of Proposition B.4.

Step 1.

I invoke Lemma B.1 with data W q = ( D , X , S , S · Y q ) ,where S = { S = } and Y q = q ′ Y . Let β be the true parameter and B ′ be its true sharpidentiﬁed set for β . By Lemma B.1 , (B.33) is a sharp upper bound on q ′ β . Therefore,(B.34) implies that σ ( q ) = sup b ∈ B ′ q ′ b . (C.7) Step 2.

To show that σ ( q ) deﬁned in (B.33) is a support function of some com-pact and convex set, I need to show that σ ( q ) is (1) convex, (2) positive homogenous ofdegree one and (3) lower-semicontinuous function of q . By Theorem 13.2 from Rock-afellar (1997), the properties (1)-(3) imply that B is a convex and compact set and σ ( q ) is its support function. Therefore, B in (B.34) coincides with the true sharp identiﬁedset for β .Veriﬁcation of (1). Lemma B.1 proves that σ ( λ q + ( − λ ) q ) is a sharp upper86ound on ( λ q + ( − λ ) q ) ′ β . Furthermore, by Lemma B.1, q ′ β ≤ σ ( q ) and q ′ β ≤ σ ( q ) . Therefore, ( λ q + ( − λ ) q ) ′ β ≤ λ σ ( q ) + ( − λ ) σ ( q ) . By sharpness, σ ( λ q +( − λ ) q ) is the smallest bound on ( λ q + ( − λ ) q ) ′ β . Therefore, σ ( λ q + ( − λ ) q ) ≤ λ q + ( − λ ) q , which implies that σ ( q ) is a convex function of q .Veriﬁcation of (2). Let λ >

0. Observe that the event { λ Y q ≤ Q ( λ q , p ( X ) , X ) } holds if and only if { Y q ≤ Q ( q , p ( X ) , X ) } . Since Y q = q ′ Y is a linear function of q , σ ( q ) deﬁned in (B.33) is positive homogenous of degree 1.Veriﬁcation of (3). Consider a sequence of vectors q k → q , k → ∞ . Suppose σ ( q k ) ≤ C . Then, q ′ k β ≤ σ ( q k ) ≤ C , which implies that q ′ β ≤ C must hold. Therefore, C isa non-sharp bound on q ′ β . By sharpness, σ ( q ) is the smallest bound on q ′ β , whichimplies σ ( q ) ≤ C . By Theorem 13.2 in Rockafellar (1997), σ ( q ) is a support functionof a convex, compact set B deﬁned by a list of linear inequalities (C.7). Proof of Proposition B.5.

The moment equation (B.18) is a sum of the original mo-ment equation (B.16) and three bias correction terms, for s ( , x ) , s ( , x ) , and Q ( u , x ) ,which is an orthogonal moment by Newey (1994). The bias corrections for s ( , x ) , s ( , x ) and µ ( x ) follow from Proposition 4 in Newey (1994). The bias correctionfor Q ( u , x ) follows from Proposition 6 in Ichimura and Newey (2015). According toNewey (1994), a bias correction term for a vector-valued nuisance parameter ξ ( x ) = { s ( , x ) , s ( , x ) , Q ( u , x ) , µ ( x ) } is the sum of individual bias correction terms.87 SSUMPTION 10 (Quality of the First-Stage Estimation) . There exists a sequence { Ξ N , N ≥ } of subsets of Ξ (i.e, Ξ N ⊆ Ξ ) such that the following conditions hold. (1) Thetrue value ξ belongs to Ξ N for all N ≥ . There exists a sequence of numbers φ N = o ( ) such that the ﬁrst-stage estimator (cid:98) ξ ( q ) of ξ ( q ) belongs to Ξ N with probability at least − φ N . There exist sequences r N , r ′ N , δ N : r ′ N log / ( / r ′ N ) = o ( ) , r N = o ( N − / ) , and δ N = o ( N − / ) such that the following bounds hold sup r ∈ [ , ) sup q ∈ S d − ( ∂ r E g ( W , q , r ( ξ ( q ) − ξ ( q )) + ξ ( q ))) ≤ r N . (2) The following conditions hold for the function class F ξ = { g ( W , q , ξ ( q )) , q ∈ S d − } (C.8) There exists a measurable envelope function F ξ = F ξ ( W ) that almost surely bounds allelements in the class sup q ∈ S d − | g ( W , q , ξ ( q )) | ≤ F ξ ( W ) a.s.. There exists c > suchthat ‖ F ξ ‖ L P , c : = (cid:0) (cid:82) w ∈ W ( F ξ ( w )) c (cid:1) / c < ∞ . There exist constants a , v that do not dependon N such that the uniform covering entropy of the function class F ξ is bounded log sup Q N ( ε ‖ F ξ ‖ Q , , F ξ , ‖ · ‖ Q , ) ≤ v log ( a / ε ) , for all < ε ≤ . (C.9) Lemma C.1 (Veriﬁcation of Assumption 10 (1)) . Under Assumptions 3, 4 and 7, theorthogonal moment equation (B.18) obeys Assumption 10(1).Proof of Lemma C.1.

Step 1 . Consider the case when X hurt = /0 and the case of lower88ound. Deﬁne the following quantities: Λ ( W , ξ ) : = D µ ( X ) · S · Y { Y ≤ Q ( p ( X ) , X ) } − − D µ ( X ) · S · Y , Λ ( W , ξ ) : = Q ( p ( X ) , X ) (cid:18) ( − D ) · S µ ( X ) − s ( , X ) (cid:19) Λ ( W , ξ ) : = − Q ( p ( X ) , X ) p ( X ) (cid:18) D · S µ ( X ) − s ( , X ) (cid:19) Λ ( W , ξ ) : = − Q ( p ( X ) , X ) s ( , X ) (cid:18) D · S · { Y ≤ Q ( p ( X ) , X ) } s ( , X ) µ ( X ) − p ( X ) (cid:19) and observe that g L ( W , ξ ) = ( µ help10 ) − ( ∑ k = Λ k ( W , ξ ) − − D µ ( X ) · S · Y ) . Step 2 . Let ξ = : = { s ( , x ) , s ( , x ) , Q ( u , x ) } be the true value of the nuisanceparameter and ξ : = { s ( , x ) , s ( , x ) , Q ( u , x ) } be a candidate value in the neighborhoodof ξ . Let ∂ α E [ g ( W , ξ ) | X = x ] , ∂ β E [ g ( W , ξ ) | X = x ] , ∂ γ E [ g ( W , ξ ) | X = x ] denote the par-tial derivatives of E [ g ( W , ξ ) | X = x ] w.r.t the output of the functions s ( , x ) , s ( , x ) and Q ( u , x ) , respectively. Let ξ r : = r ( ξ − ξ ) + ξ . Step 3 . By construction of an orthogonal moment g L ( W , ξ ) , ∂ t E [ Λ ( W , ξ ) | X = x ] , t ∈ { α , β , γ } coincides with the multipliers in bias correction terms: ∂ α E [ Λ ( W , ξ ) | X = x ] = Q ( p ( x ) , x ) ∂ β E [ Λ ( W , ξ ) | X = x ] = Q ( p ( x ) , x ) p ( x ) ∂ γ E [ Λ ( W , ξ ) | X = x ] = Q ( p ( x ) , x ) s ( , x ) f ( Q ( p ( x ) , x ) | x ) . Furthermore, the ﬁrst partial derivatives of Q ( p ( x ) , x ) w.r.t. α , β , γ take the form ∂ α Q ( p ( x ) , x ) = f − ( Q ( p ( x ) , x ) | x ) s − ( , x ) ∂ β Q ( p ( x ) , x ) = f − ( Q ( p ( x ) , x ) | x ) s − ( , x ) s ( , x ) ∂ γ Q ( p ( x ) , x ) = . Q ( p ( x ) , x ) w.r.t. α , β , γ take the form ∂ αα Q ( p ( x ) , x ) = − f − ( Q ( p ( x ) , x ) | x ) f ′ ( Q ( p ( x ) , x ) | x ) s − ( , x ) ∂ β β Q ( p ( x ) , x ) = − f − ( Q ( p ( x ) , x ) | x ) f ′ ( Q ( p ( x ) , x ) | x ) s − ( , x ) s ( , x )+ f − ( Q ( p ( x ) , x ) | x ) s − ( , x ) s ( , x ) ∂ αβ Q ( p ( x ) , x ) = − f − ( Q ( p ( x ) , x ) | x ) s − ( , x ) ∂ β α Q ( p ( x ) , x ) = − s − ( , x ) f − ( Q ( p ( x ) , x ) | x ) − f − ( Q ( p ( x ) , x ) | x ) s − ( , x ) s ( , x ) ∂ γα Q ( p ( x ) , x ) = f − ( Q ( p ( x ) , x ) | x ) s − ( , x ) f ′ ( Q ( p ( x ) , x ) | x ) ∂ γβ Q ( p ( x ) , x ) = f − ( Q ( p ( x ) , x ) | x ) s − ( , x ) f ′ ( Q ( p ( x ) , x ) | x ) s ( , x ) ∂ αγ Q ( p ( x ) , x ) = ∂ β γ Q ( p ( x ) , x ) = ∂ γγ Q ( p ( x ) , x ) = . By Assumptions 3 and 7(1), all functions of x above are bounded a.s. in X . For k = Λ ( W , ξ ) is a product of Q ( p ( x ) , x ) and a linear function of s ( , x ) at X = x . For k = Λ ( W , ξ ) is a product of Q ( p ( x ) , x ) , p ( x ) , and a linear function of s ( , x ) at X = x .Therefore,sup t , t ∈{ α , β , γ } sup x ∈ X | ∂ t t E [ Λ k ( W , ξ ) | X = x ] ≤ , k ∈ { , , } . is bounded a.s. Step 4 . Let ρ ( x ) : = Q ( p ( x ) , x ) and ρ ( x ) be some function in the neighborhoodof ρ ( x ) . Denote ρ r ( x ) : = r ( ρ ( x ) − ρ ( x )) + ρ ( x ) Consider the following function λ ( r , x ) : = E [ { Y ≤ r ( ρ ( X ) − ρ ( X ))+ ρ ( X ) } − { Y ≤ ρ ( X ) } | X = x ] = F ( ρ r ( x ) | x ) − F ( ρ ( x ) | x ) . Therefore, the ﬁrst derivative λ ′ ( r ) = f ( ρ r ( x ) | x )( r ( x ) − r ( x )) and λ ′′ ( r , x ) = f ′ ( ρ r ( x ) | x )( r ( x ) − r ( x )) . By Assumption 7, f ′ ( ρ r ( x ) | x ) is bounded a.s. in X . Step 5 . Conclusion. By Steps 1-4 and Assumptions 4 and 7,sup r ∈ [ , ) | ∂ rr E [ g ( W , r ( ξ − ξ ) + ξ )] | (cid:46) sup t , t ∈{ α , β , γ } sup x ∈ X | ∂ t t E [ g ( W , ξ ) | X = x ]( q N + q N s N + s N ) = o ( N − / ) . Λ k ( W , ξ ) are smooth, inﬁnitely differentiable functions of theoutput of µ ( x ) and µ ( x ) on some open set in ( , ) that contains the support of µ ( X ) and µ ( X ) .Denote the conditional c.d.f as F ( q , t | x ) : = Pr ( q ′ Y ≤ t | S = , D = , X = x ) . and the conditional density as f ( q , t | x ) = ∂ t F ( q , t | x ) . The argument is given under as-sumption X hurt = /0. Lemma C.2 (Veriﬁcation of Assumption 10(2)) . (1) Suppose Y ∈ R d is a random vectorwith a.s. bounded coordinates. (2) There exists an integrable function m ( x ) , so that sup t ∈ R d | F ( q , t | x ) − F ( q , t | x ) | ≤ m ( x ) ‖ q − q ‖ and inf q ∈ S d − inf t ∈ R | f ( q , t | x ) | ≥ C > . Then, Assumption 10(2) holds.Proof of Lemma C.2. Step 1.

The function class M is P -Donsker: M : = { X → Q ( q , p ( X ) , X ) , q ∈ S d − } Since Y ∈ R d is a.s. bounded random vector with ‖ Y ‖ ≤ C , one can take C to be theenvelope. Invoking the identity F ( q , Q ( q , p ( x )) | x ) − F ( q , Q ( q , p ( x )) | x )+ F ( q , Q ( q , p ( x )) | x ) − F ( q , Q ( q , p ( x )) | x ) = p ( x ) − p ( x ) = , and the mean value theorem for t = Q ( q , p ( x ) and t = Q ( q , p ( x ) : F ( q , t | x ) − F ( q , t | x ) = f ( q , (cid:101) t | x )( t − t ) | Q ( q , p ( x ) , x ) − Q ( q , p ( x ) , x ) | ≤ sup t ∈ R | F ( q , t | x ) − F ( q , t | x ) | sup t ∈ R f − ( q , t | x ) ≤ C − m ( x ) ‖ q − q ‖ (C.10)By Example 19.7 from van der Vaart (1998), the covering numbers of the function class M obey N [] ( ε ‖ m ‖ P , r , M , L r ( P )) (cid:46) (cid:18) ε (cid:19) d , every 0 < ε < . Finally, since Y ∈ R d is an a.s. bounded vector, each element of the class M is boundedby ‖ Y ‖ ≤ C a.s. , and C can be taken as the envelope of M . Therefore, M is P -Donskerand obeys (C.9) with v = d and a = Step 2.

By Step 1, the function class H ′ = (cid:26) W → q ′ Y − Q ( q , p ( X ) , X ) , q ∈ S d − (cid:27) is the sum of 2 VC classes. Therefore, by Andrews (1994), H ′ is a VC classitself. Therefore, the class of indicators H : = (cid:26) W → { q ′ Y − Q ( q , p ( X ) , X ) ≤ } , q ∈ S d − (cid:27) . is also a VC class with a constant envelope, and, therefore, P -Donsker. Step 3 . The function class H = (cid:26) W → D · S · { q ′ Y ≤ Q ( q , p ( X ) , X )) } µ ( X ) (cid:27) is obtained by multiplying each element of H by an a.s. bounded random variable D · S / µ ( X ) . The function class H = (cid:26) W → Q ( q , p ( X ) , X ) (cid:18) ( − D ) µ ( X ) − s ( , X (cid:19) (cid:27)

92s obtained from M by multiplying each element of M by an a.s. bounded randomvariable (cid:18) ( − D ) µ ( X ) − s ( , X (cid:19) . The same argument applies to the function class H = (cid:26) W → Q ( q , p ( X ) , X ) p ( X ) (cid:18) D µ ( X ) − s ( , X ) (cid:19)(cid:27) . The function class H = (cid:26) W → Q ( q , p ( X ) , X ) s ( , X ) (cid:18) D · S { q ′ Y ≤ Q ( q , p ( X ) , X )) } µ ( X ) s ( , X ) − p ( X ) (cid:19) (cid:27) is obtained as a product of function classes M and H , multiplied by a random variable s ( , X ) . Finally, the function class F ξ in (C.8) is obtained by adding the elements of H k , k = , , ,

4. Since entropies obey the rules of addition and multiplication by arandom variable (Andrews (1994)), the argument follows.

Proof of Lemma B.6.

Let E n , k [ · ] and E N be as deﬁned in (C.1) and (C.2). Steps 1 and 2establish the statement of the theorem under Assumption 10. Steps 3 and 4 verify thestatements (1) and (2) of Assumption 10, respectively. Step 5 concludes. √ n | E n , k [ g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q ))] | ≤ √ n | E (cid:2) g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) (cid:3) | + | G n , k [ g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q ))] | = : | i ( q ) | + | ii ( q ) | . Step 1.

Introduce the function λ ( r ) : = E [ g ( W i , q , r ( (cid:98) ξ ( q ) − ξ ( q )) + ξ ( q )) | E N ∪ ( W i ) { i ∈ J ck } ] − E [ g ( W i , q , ξ ( q ))] .

93y Taylor’s expansion, λ ( r ) = λ ( ) + λ ′ ( ) + λ ′′ ( (cid:101) r ) / , for some (cid:101) r ∈ ( , ) . By construction, λ ( ) =

0. Because g ( w , q , ξ ( q )) is an orthogonal moment function, λ ′ ( ) =

0. By Assumption 10(1), | λ ′′ ( (cid:101) r ) | ≤ sup r ∈ ( , ) | λ ′′ ( r ) | ≤ r N . Therefore, | i ( q ) | converges to zero conditionally on data ( W i ) { i ∈ J ck } and the event E N : | i ( q ) | : = sup q ∈ S d − √ n | E [ g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) | E N ∪ ( W i ) { i ∈ J ck } ] |≤ sup q ∈ S d − sup ξ ∈ Ξ n √ n | E [ g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q )) | E N ∪ ( W i ) { i ∈ J ck } ] |≤ sup q ∈ S d − sup ξ ∈ Ξ n √ n | E [ g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q ))] ≤ i √ nr n = o ( ) . By Lemma 6.1 of Chernozhukov et al. (2018), the term i ( q ) = O ( r n ) = o ( ) uncondi-tionally. Step 2.

To bound the second quantity, consider the function class F (cid:98) ξ ξ = { g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) , q ∈ S d − } . for some ﬁxed (cid:98) ξ . By deﬁnition of the class, E sup q ∈ S d − | ii ( q ) | : = E sup f ∈ F | G n , k [ f ] | .

94e apply Lemma 6.2 of Chernozhukov et al. (2018) conditionally on data ( W i ) { i ∈ J ck } and the event E N so that (cid:98) ξ ( q ) = (cid:98) ξ k can be treated as a ﬁxed member of Ξ n . The functionclass F (cid:98) ξ ξ is obtained as the difference of two function classes: F (cid:98) ξ ξ : = F (cid:98) ξ − F ξ , eachof which has an integrable envelope and bounded logarithm of covering numbers. Inparticular, one can choose an integrable envelope as F (cid:98) ξ ξ : = F (cid:98) ξ + F ξ and bound thecovering numbers as:log sup Q N ( ε ‖ F (cid:98) ξ ξ ‖ Q , , F (cid:98) ξ ξ , ‖ · ‖ ) ≤ log sup Q N ( ε ‖ F (cid:98) ξ ‖ Q , , F (cid:98) ξ , ‖ · ‖ ) + log sup Q N ( ε ‖ F ξ ‖ Q , , F ξ , ‖ · ‖ ) ≤ v log ( a / ε ) , for all 0 < ε ≤ . Finally, we can choose the speed of shrinkage ( r ′ n ) such thatsup q ∈ S d − sup ξ ∈ Ξ n (cid:0) E [ g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q ))] (cid:1) / ≤ r ′ n , the application of Lemma 6.2 of Chernozhukov et al. (2018) gives with M : = max i ∈ I ck F (cid:98) ξ ξ ( W i ) sup q ∈ S d − | ii ( q ) | ≤ sup q ∈ S d − | G n , k [ g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q ))] |≤ (cid:113) v ( r ′ n ) log ( a ‖ F (cid:98) ξ ξ ‖ P , / r ′ n ) + v ‖ M ‖ P , c ′ / √ n log ( a ‖ F (cid:98) ξ ξ ‖ P , / r ′ n ) (cid:46) P r ′ n log / ( / r ′ n ) + n − / + / c ′ log / ( / r ′ n ) where a constant ‖ M ‖ P , c ′ ≤ n / c ′ ‖ F ‖ P , c ′ for the constant c ′ ≥ Step 3.

Lemma C.1 veriﬁes Assumption 10(1).

Step 4.

Lemma C.2 veriﬁes Assumption 10(2).

Step 5. Asymptotic Normality.

In Lemma C.2, we have shown that the function class F ξ = { g ( W , q , ξ ( q )) , q ∈ S d − } is P -Donsker. By Theorem 19.14 from van der Vaart951998), the asymptotic representation follows from the Skorohod-Dudley-Whichura rep-resentation, assuming the space L ∞ ( S d − ) is rich enough to support this representation. Proof of Theorem B.7.

The proof of Theorem B.7 follows from Lemma B.6 and theproof of Theorem 2 in Chandrasekhar et al. (2012).

Lemma C.3.

Let Assumption 10 hold. Then √ N E N (cid:0) g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) (cid:1) e i = o P ( ) . Proof.

Step 1.

Decompose the sample average into the sample averages within eachpartition: E N (cid:0) g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) (cid:1) e i = K K ∑ k = E n , k (cid:0) g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) (cid:1) e i . Since the number of partitions K is ﬁnite, it sufﬁces to show that the bound holds onevery partition: E n , k (cid:0) g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) (cid:1) e i = o P ( ) . Let E N : = ∩ Kk = { (cid:98) ξ I ck ∈ Ξ n } . By Assumption 10 Pr ( E N ) ≥ − K φ N = − o ( ) . The anal-ysis below is conditionally on E N for some ﬁxed element (cid:98) ξ k ∈ Ξ n . Since the probabilityof E N approaches one, the statements continue to hold unconditionally, which followsfrom the Lemma 6.1 of Chernozhukov et al. (2018). Step 2.

Consider the function class F e ξ ξ : = { ( g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q ))) e i , q ∈ S d − } . The function class is obtained by the multiplication of a random element of class F ξ ξ by an integrable random variable e i . Therefore, F e ξ ξ is also P -Donsker and hasbounded uniform covering entropy. The expectation of the random element of the class96 e ξ ξ is bounded as: √ n sup q ∈ S d − | E [ g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) e i | E N ] | (cid:46) sup ξ ∈ Ξ n E [ g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q ))] (cid:46) √ n µ n = o ( ) . The variance of each element of the class F e ξ ξ is bounded as:sup q ∈ S d − sup ξ ∈ Ξ n E (( g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q ))) e i ) = sup q ∈ S d − sup ξ ∈ Ξ n E (cid:0) ( g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q ))) (cid:1) E e i ≤ q ∈ S d − sup ξ ∈ Ξ n E (( g ( W i , q , ξ ( q )) − g ( W i , q , ξ ( q )))) (cid:46) r ′′ n , where the bound follows from the conditional independence of e i from W i , E e i = e i ∼ Exp ( ) , and Assumption 10. Proof of Lemma B.8.

The difference between the bootstrap and the true support functionas follows: √ N ( (cid:101) σ ( q , B ) − σ ( q , B )) = √ N N ∑ i = e i (cid:0) g ( W i , q , (cid:98) ξ ( q )) − g ( W i , q , ξ ( q )) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) K ξξ ( q ) + √ N N ∑ i = e i ( g ( W i , q , ξ ( q )) − σ ( q )) By Lemma C.3 sup q ∈ S d − | K ξ ξ | = o P ( ) . The remainder of the Theorem B.9 followsfrom Steps 2 and 3 of the Proof of Theorem 3 Chandrasekhar et al. (2012). Proof of Theorem B.9.

The proof of Theorem B.9 follows from Lemma B.8 and theproof of Theorem 4 in Chandrasekhar et al. (2012).97 .1 Proofs for Sections 6.3

I present the argument for the lower truncated measure f T L deﬁned in equation (B.39);a similar argument applies for f T U . I will make use of the following statements. Bydeﬁnition of T L ( W ) ,Pr ( S = W ̸∈ T L ( W ) | X ) = Pr ( S = W ̸∈ T L ( W ) | X , D = ) µ ( X )+ Pr ( S = | X , D = ) µ ( X )= p ( X ) s ( , X ) µ ( X ) + s ( , X ) µ ( X ) = s ( , X ) . (C.11)For any event D =

1, Bayes rule and (C.11) implyPr ( D = | S = W ̸∈ T L ( W ) , X ) = p ( X ) s ( , X ) s ( , X ) Pr ( D = | X ) = Pr ( D = | X ) (C.12) Proof of Lemma B.11.

Invoking Bayes rule, (C.11), and Assumption 8 gives f T L ( x ) = E − [ s ( , X )] s ( , x ) f ( x ) = f ( x ) . Proof of Lemma B.12.

By Assumption 1, D is independent of S ( ) and S ( ) given ¯ X ,which implies i . Invoking (C.12) gives iii . By Assumption 8, Pr ( D = | X ) is a functionof ¯ X , and ii holds. Therefore, the numerators of functions w ( ¯ X ) and w T L ( ¯ X ) are equalto each other. By Lemma B.11, the denominators of functions w ( ¯ X ) and w T L ( ¯ X ) areequal to each other. Proof of Proposition B.13.

Step 1.

By Lemma B.1 (c), the functions β L ( X ) and β U ( X ) E [ Y ( ) − Y ( ) | S ( ) = S ( ) = , X ] : β L ( X ) = E T L [ Y | D = , X ] − E T L [ Y | D = , X ] ≤ E [ Y ( ) − Y ( ) | X ] ≤ E T U [ Y | D = , X ] − E T U [ Y | D = , X ] = β U ( X ) , where β L ( X ) is deﬁned in (B.7) and β U ( X ) is deﬁned in (B.6). By the Law of IteratedExpectations (LIE), E [ β L ( X ) | ¯ X ] ≤ E [ Y ( ) − Y ( ) | ¯ X ] ≤ E [ β U ( X ) | ¯ X ] . Step 2.

Let w ( ¯ X ) and w T * ( ¯ X ) be the weighting functions deﬁned in Lemma B.12.The following statements hold: β L = i E T L w T L ( ¯ X ) (cid:18) E T L (cid:20) E T L [ Y | D = , X ] − E T L [ Y | D = , X ] (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ¯ X (cid:19) = ii E T L w T L ( ¯ X ) E T L [ β L ( X ) | ¯ X ]= iii E w T L ( ¯ X ) E [ β L ( X ) | ¯ X ] ≤ iv E w ( ¯ X ) E [ Y ( ) − Y ( ) | ¯ X ] ≤ v E w ( ¯ X ) E [ β U ( X ) | ¯ X ]= vi E T U w T U ( ¯ X ) E T U [ β U ( X ) | ¯ X ]= vii E T U w T U ( ¯ X ) (cid:18) E T U [ Y | D = , ¯ X ] − E T U [ Y | D = , ¯ X ] (cid:19) = viii β U , where i and viii follow from Lemma B.10, ii and vii follow from (B.7) and (B.6), iii and vi follow from Lemmas B.11 and B.12, and iv and v come from Step 1.99 roof of Lemma B.14. Step 1. Validity of (B.44). Observe that E [ g L ( W , ξ )] = i ( µ − ) E [[ w T L ( ¯ X ) β L ( X ) s ( , X )] + = ii E T L [ w T L ( ¯ X ) β L ( X )]= iii β L , where i follows from α L ( W ; ξ ) being the sum of three functions that are mean zeroconditional on X , ii follows from Lemma B.11, and iii follows from Lemma B.10.Step 2. Orthogonality of (B.44). Observe that ( µ − ) α L ( W ; ξ ) is the bias correctionterm for ( µ − ) β L ( X ) s ( , X ) , the original (non-orthogonal) component of (B.18). There-fore, the bias correction term for ( µ − ) w T L ( ¯ X ) β L ( X ) s ( , X ) is equal to ( µ − ) w T L ( ¯ X ) α L ( W ; ξ ) (Newey (1994)).Step 3. Asymptotic linearity of (B.44). From the proof of Lemmas C.1 and C.2,the moment equation (B.18) obeys Assumption 10. The weighting function w T L ( ¯ X ) isbounded a.s.. Therefore, (B.44) obeys Assumption 10 and equation (B.48) holds.Step 4. From the proof of Lemmas C.1 and C.2, the moment equation (B.18) obeysAssumption 10. As shown in Newey (1994) and Chernozhukov et al. (2016), the seriesestimators have the low-bias property that implies the last statement of the Lemma. Proof of Lemma B.16.

Step 1 . In the proof of Lemma B.16, I showed that f ( x ) = ( E [ s ( , X )]) − s ( , x ) f ( x ) . tep 2 . Observe thatPr ( S = W ̸∈ Λ L ( W ) | X )= Pr ( S = W ̸∈ Λ L ( W ) | X , Z = ) µ ( X ) + Pr ( S = | X , Z = ) µ ( X )= Pr ( S = W ̸∈ Λ L ( W ) | X , Z = , D = ) Pr ( D = | Z = , X ) µ ( X )+ Pr ( S = W ̸∈ Λ L ( W ) | X , Z = , D = ) Pr ( D = | Z = , X ) µ ( X )+ Pr ( S = | X , Z = ) µ ( X )= p ( X ) s ( , X )( Pr ( D = | Z = , X ) + Pr ( D = | Z = , X )) µ ( X ) + s ( , X ) µ ( X )= s ( , X ) (C.13)Invoking Bayes rule gives f Λ L ( x ) = E − [ s ( , X )] s ( , x ) f ( x ) = f ( x ) . Proof of Lemma B.17.

Step 1.

The following equality holds:Pr ( S ( ) = S ( ) = | D = , Z = z , X ) = Pr ( S ( ) = S ( ) = | D ( z ) = , Z = , X )= i Pr ( S ( ) = S ( ) = | D ( z ) , X )= ii Pr ( S ( ) = S ( ) = | X ) , where i holds by Assumption 8 and ii holds by Assumption 9(6) . Likewise, Pr ( S ( ) = S ( ) = | D = , Z = z , X ) = Pr ( S ( ) = S ( ) = | X ) . Bayes rule impliesPr ( D = | Z = z , X ) = Pr ( S ( ) = S ( ) = | D = , Z = z , X ) Pr ( D = | Z = z , X ) Pr ( S ( ) = S ( ) = | Z = z , X ) = Pr ( D = | Z = z , X ) , which establishes a . b follows from Assumption 8.101 tep 2. When Z =

0, observations are not truncated. Therefore,Pr ( S = , W ̸∈ T L ( W ) | D = d , Z = , X )= i Pr ( S = | D = d , Z = , X ) = ii Pr ( S = | Z = , X )= iii Pr ( S ( ) = | Z = , X ) = s ( , X ) , where i follows from the deﬁnition of Λ L ( W ) in (B.49), ii follows from Assumption 9(5), and iii follows from Assumption 9 (1). Step 3.

When Z =

1, observations are truncated at equal proportions in D = , Z = D = , Z = ( S = , W ̸∈ Λ L ( W ) | D = d , Z = , X )= Pr ( W ̸∈ Λ L ( W ) | D = d , Z = , S = , X ) Pr ( S = | D = d , Z = , X )= p ( X ) s ( , X ) = s ( , X ) , d ∈ { , } . Thus Pr ( W ̸∈ Λ L ( W ) | D = d , Z = z , X ) = Pr ( W ̸∈ Λ L ( W ) | X ) does not depend on either d or z . Bayes rule impliesPr Λ L ( D = | Z = z , X ) = Pr ( W ̸∈ Λ L ( W ) | D = , Z = z , X ) Pr ( D = | Z = z , X ) Pr ( W ̸∈ Λ L ( W ) | Z = z , X )= Pr ( D = | Z = z , X ) . Step 4.

Step 1 . By Lemma B.1(c), β L ( X ) = E Λ L [ Y | Z = , X ] − E Λ L [ Y | Z = , X ] ≤ E [( D ( ) − D ( )) · ( Y ( ) − Y ( )) | X ] ≤ E Λ U [ Y | Z = , X ] − E Λ U [ Y | Z = , X ] = β U ( X ) . Furthermore, [ β L ( X ) , β U ( X )] is a sharp identiﬁed set for E [ Y ( ) − Y ( ) | X ] . Step 2 .Conclusion. π L = i E Λ L ω Λ L ( ¯ X ) E Λ L (cid:20) E Λ L [ Y | Z = , X ] − E Λ L [ Y | Z = , X ] E Λ L [ D = | Z = , X ] − E Λ L [ D = | Z = , X ] (cid:12)(cid:12)(cid:12)(cid:12) ¯ X (cid:21) = ii E Λ L ω Λ L ( ¯ X ) E Λ L (cid:20) β L ( X ) E Λ L [ D = | Z = , X ] − E Λ L [ D = | Z = , X ] (cid:12)(cid:12)(cid:12)(cid:12) ¯ X (cid:21) = iii E ω ( ¯ X ) E (cid:20) β L ( X ) E [ D = | Z = , X ] − E [ D = | Z = , X ] (cid:12)(cid:12)(cid:12)(cid:12) ¯ X (cid:21) ≤ iv E w ( ¯ X ) E (cid:20) Y ( ) − Y ( ) (cid:12)(cid:12)(cid:12)(cid:12) D ( ) > D ( ) , ¯ X (cid:21) ≤ v E w ( ¯ X ) E (cid:20) β U ( X ) E Λ U [ D = | Z = , X ] − E Λ U [ D = | Z = , X ] (cid:12)(cid:12)(cid:12)(cid:12) ¯ X (cid:21) = vi E Λ U ω Λ U ( ¯ X ) (cid:20) E Λ U [ Y | Z = , X ] − E Λ U [ Y | Z = , X ] E Λ U [ D = | Z = , X ] − E Λ U [ D = | Z = , X ] (cid:12)(cid:12)(cid:12)(cid:12) ¯ X (cid:21) = vii π U , where i and vii follow from Lemma B.15, ii and vi follow from Lemmas B.1(c) and B.11, iii follows from Lemma B.17, and iv follows by Step 1. Appendix D: Additional Simulations

Deﬁnition of the parameters.

Consider the parameters in equations (7.1)-(7.2). Theparameters α and (cid:101) σ are multiplied by 3 . .

1, respectively, so that the artiﬁcial103 able D.1: Finite sample performance of oracle, basic and better Lee bounds

Panel A: Lower BoundBias St. Dev. Coverage Rate N Oracle Basic Better Oracle Basic Better Oracle Basic Better9 ,

000 0.00 -0.03 -0.01 0.01 0.01 0.01 0.95 0.26 0.9010 ,

000 0.00 -0.03 -0.01 0.01 0.01 0.01 0.95 0.25 0.9015 ,

000 0.00 -0.02 -0.01 0.00 0.01 0.01 0.95 0.23 0.90Panel B: Upper Bound9 ,

000 -0.00 0.03 0.00 0.01 0.01 0.01 0.95 0.28 0.9510 ,

000 -0.00 0.03 0.00 0.01 0.01 0.01 0.95 0.29 0.9515 ,

000 -0.00 0.02 0.00 0.00 0.01 0.01 0.94 0.28 0.95

Notes. Results are based on 10 ,

000 simulation runs. In Panel A, the true parameter value is − .

014 for the basic method, and − .

011 for all other methods. In Panel B, the true parametervalue is 0 .

035 for the basic method, and 0 .

018 for all other methods. Bias is the differencebetween the true parameter and the estimate, averaged across simulation runs. St. Dev. is thestandard deviation of the estimate. Coverage Rate is the fraction of times a two-sided symmetricCI with critical values c α / and c − α / covers the true parameter, where α = . N is thesample size in each simulation run. Oracle, basic, naive and better estimated bounds cover zeroin 100% of the cases. wage’s interquantile range matches its week 90 counterpart. Second, ( γ ) —the ﬁrstcoefﬁcient of γ —is multiplied by 0 . X help and X hurt sufﬁciently difﬁcult, as it could be in the real JobCorps example with 5 ,

177 covari-ates. Finally, α is multiplied by 3, to make the artiﬁcial employment rate match its week90 counterpart. The true basic identiﬁed set is the weighted average of basic Lee bounds(2.3) and (2.4), deﬁned separately for X help and X hurt . The true sharp identiﬁed set isthe output of Algorithm 1, where the direction of the treatment effect on employmentis positive if the treatment-control difference in employment rates exceeds zero and isnegative otherwise. The basic method.

The basic method is deﬁned as the weighted average of basic Leebounds, estimated on X help and X hurt separately. The trimming threshold is estimated asdescribed in equation (F.1) based on all covariates.104 he better method: orthogonal approach. In Section 7 of the main text, the bettermethod is the sample average of the orthogonal moment equations (B.18). The ﬁrst-stage parameters are estimated as described in Section F.1. For the employment equa-tion, covariates are selected by post-lasso-logistic of Belloni et al. (2016). For the wageequation, covariates are selected by post-lasso of Belloni et al. (2017).

The better method: agnostic approach.

In this section, I consider an alternativeversion of the better method based on agnostic approach. To construct the bounds, Irandomly split the sample into the auxiliary part with N /

100 observations and the mainpart with 99 / N observations. On the auxiliary part, I select three covariates. Theﬁrst covariate is the one with the largest absolute value of the coefﬁcient in the wageequation estimated by linear lasso of Belloni et al. (2017). The next two covariates arethe top 2 covariates according to the importance measure of the random forest ranger R command in the employment equation. For each covariate, its importance shows thereduction in variance explained by random forest if is a given covariate is excluded.Thus, the target bounds are the sharp bounds conditional on three covariates. Appendix E: Additional details for Section 3

C.2 JobCorps empirical application.

Data description.

In this section, I describe baseline covariates for the JobCorps em-pirical application. The data is taken from Schochet et al. (2008), who provides covariatedescriptions in Appendix L. All covariates describe experiences before random assign-ment (RA). Most of the covariates represent answers to multiple choice questions; forthese covariates I list the question and the list of possible answers. An answer is high-lighted in boldface if is selected by post-lasso-logistic of Belloni et al. (2016) for oneof employment equation speciﬁcations, described below. Table E.1 lists the covariates105elected by Lee (2009). A full list of numeric covariates, not provided here, includes p = , numeric covariates. Covariates selected by Lee (2009) . Lee (2009) selected 28 baseline covariates to es-timate parametric speciﬁcation of the sample selection model. They are given in TableE.1.

Table E.1: Baseline covariates selected by Lee (2009).

Name DescriptionFEMALE femaleAGE ageBLACK, HISP, OTHERRAC race categoriesMARRIED, TOGETHER, SEPARATED family status categoriesHASCHILD has childNCHILD number of childrenEVARRST ever arrestedHGC highest grade completedHGC_MOTH, HGC_FATH mother’s and father’s HGCHH_INC1 − HH_INC5 ﬁve household income groups with cutoffs 3 , , , , , , , − PERS_INC4 four personal income groups with cutoffs 3 , , , , , Reasons for joining JobCorps (R_X) . Applicants were asked a question “How impor-tant was reason X on the scale from (very important) to (not important), or (N/A),for joining JobCorps?”. Each reason X was asked about in an independent question. Table E.2: Reasons for joining JobCorpsName description Name descriptionR_HOME getting away from home

R_COMM getting away from community

R_GETGED getting a GED R_CRGOAL desire to achieve a career goalR_TRAIN getting job training R_NOWORK not being able to ﬁnd work

For example, a covariate

R_HOME1 is a binary indicator for the reason

R_HOME being ranked as a very important reason for joining JobCorps.

Sources of advice about the decision to enroll in JobCorps (IMP_X) . Applicants were106sked a question “How important was advice of X on the scale from (important) to (not important) ?”. Each source of advice was asked about in an independent question. Table E.3: Sources of advice about the decision to enroll in JobCorps.Name description Name descriptionIMP_PAR parent or legal guardian IMP_FRD friendIMP_TCH teacher IMP_CW case workerIMP_PRO probation ofﬁcer

IMP_CHL church leader

Main types of worry about joining JobCorps (TYPEWORR) . Applicants were askedto select one main type of worry about joining JobCorps.

Table E.4: Types of worry about joining JobCorps / safety3 homesickness 4 not knowing what it will be like5 dealing with other people Drug use summary (DRUG_SUMP) . Applicants were asked to select one of possi-ble answers best describing their drug use in the past year before RA. Table E.5: Summary of drug use in the year before RA marijuana / hashish only / hashish 4 both marijuana and other drugs Frequency of marijuana use (FRQ_POT) . Applicants were asked to select one of possible answers best describing their marijuana / hashish use in the past year beforeRA. 107 able E.6: Frequency of marijuana/hashish use in the year before RA a few times each month Applicant’s welfare receipt history . Applicants were asked whether they ever re-ceived food stamps (

GOTFS ), AFDC beneﬁts (

GOTAFDC ) or other welfare (

GOTOTHW )in the year prior to RA. In case of receipt, they asked about the duration of receiptin months (

MOS_ANYW , MOS_AFDC ). For example,

GOTAFDC =1 and

MOS_AFDC =8describes an applicant who received AFDC beneﬁts during 8 months before RA.

Household welfare receipt history (WELF_KID).

Applicants were asked about familywelfare receipt history during childhood.

Table E.7: Family was on welfare when growing up most or all time

Health status (HEALTH) . Applicants were asked to rate their health at the momentof RA

Table E.8: Health status at RA fair able E.9: Figure 2 details: monotonicity test results

Weeks Cell with the largest t -statistic Average Test Statistic(1) (2) (3)Weeks 60 – 89 MOS_AFDC=8 orPERS_INC=3 and EARN_YR ∈ [ , ] Notes. This table shows the results for the monotonicity test in Figure 2. The test is conducted separately for each week using a week-speciﬁc test statistic and p-value. For each test, I partition N = ,

145 subjects into J = C , C . Column (2) describes the cellwith the largest t -statistic whose value is compared to the critical value. Column (3) shows the average test-statistic across time period inColumn (1). The test statistic is T = max j ∈{ , } (cid:98) µ j / (cid:98) σ j , where (cid:98) µ j and (cid:98) σ j are sample average and standard deviation of random variable ξ j : = E [( D − ) · S | X ∈ C j ] , weighted by design weights DSGN_WGT. The critical value c α is the self-normalized critical value ofChernozhukov et al. (2019). For α = . c α = . α = . c α = . rrest experience . CPAROLE21 =1 is a binary indicator for being on probation orparole at the moment or RA. In addition, arrested applicants were asked about the timepast since most recent arrest

MARRCAT . Table E.10: Number of months since most recent arrest less than 12

Appendix F: Empirical applications: additional details

F.1 General Description of better Lee bounds

In this section, I describe how to compute the ﬁrst stage estimates for better Lee bounds,which applies to all three applications under consideration. All subjects are partitionedinto the sets (cid:98) X help = { X : (cid:98) p ( X ) < } (JobCorps helps employment) and (cid:98) X hurt = { X : (cid:98) p ( X ) > } (JobCorps hurts employment) by plugging covariate vector X into either lo-gistic or post-lasso-logistic estimate of trimming threshold p ( x ) = s ( , x ) / s ( , x ) deﬁnedas (cid:98) s ( , x ) = Λ ( x ′ (cid:98) α ) , (cid:98) s ( , x ) = Λ ( x ′ ( (cid:98) α + (cid:98) γ )) , (cid:98) p ( x ) = (cid:98) s ( , x ) / (cid:98) s ( , x ) (F.1) where Λ ( t ) = exp ( t ) / ( + exp ( t )) is the logistic function, (cid:98) α is the baseline coefﬁcient and (cid:98) γ is the interaction coefﬁcient. Treatment variable D is always included into the ﬁnallogistic regression regardless of being selected by post-lasso-logistic.For a continuous outcome, the quantile estimate (cid:98) Q ( (cid:98) p ( x ) , x ) is evaluated in four steps:1. The parameter δ ( u ) from equation (B.25) is estimated by quantile regression de-ﬁned in equation (B.26) with Z ( x ) = x and u ∈ { . , . , . . . , . } . Likewise,110n analog of δ ( u ) is estimated by quantile regression deﬁned in equation for S = , D = group.2. For each covariate value x and quantile level u ∈ { . , . , . . . , . } , (cid:98) Q ( u , x ) : = x ′ (cid:98) δ ( u ) is evaluated.3. For each covariate value x , the vector ( (cid:98) Q ( u , x )) . u = . is sorted. Furthermore, (cid:98) Q ( u , x ) is capped at the minimal and maximal outcome values.4. For each covariate value x , the trimming threshold (cid:98) p ( x ) = round ( (cid:98) p ( x ) , ) is roundedto 2 decimal places. (cid:98) Q ( (cid:98) p ( x ) , x ) is evaluated.For a binary outcome (e.g., a binary outcome in Finkelstein et al. (2012), the condi-tional probability of zero outcome in the treated group φ ( x ) : = Pr ( Y = | X = x ) : = Λ ( x ′ δ ) (F.2) is estimated by logistic regression. An outcome is trimmed if a coin with head probabil-ity ( − (cid:98) p ( x )) / (cid:98) φ ( x ) turns out head. F.2 Details of Figure 4.

Observe that the support function of a circle centered at β with radius R takes theform σ ( q ) = q ′ β + R . Given an estimate of the support function (cid:98) σ ( q ) evaluated at q ∈{ q , . . . , q J } , the best circle approximation for an identiﬁed set is deﬁned as the circlewith radius ( (cid:101) β , (cid:101) R ) chosen as the minimizers of the Ordinary Least Squares problem: ( (cid:101) β , (cid:101) R ) = arg min R , β J J ∑ j = ( (cid:98) σ ( q j ) − q ′ j β − R ) . (F.3) .3 Lee (2009) empirical details Table F.11: First-Stage Estimates, Table 3, Columns (3) and (7).

Logistic QuantileBaseline coef. ( α ) Interaction coef. ( γ ) Control Treated(1) (2) (3) (4) (5)1 (Intercept) -0.518 0.154 2.305 2.5612 BLACK and R_GETGED=1 -0.2003 R_COMM=1 and R_GETGED=1 -0.2244 MOS_ANYW and R_GETGED=1 -0.0225 HGC : EVWORK 0.0446 HGC : HRWAGER 0.0017 HGC : MOSINJOB 0.0048 HRWAGER : MOSINJOB 0.0069 EARN_YR 0.00010 R_HOME = = Notes. Table shows the ﬁrst-stage logistic and quantile regression estimates that produce boundsin Columns (3) and (7) of Table 3 . Column (2): baseline coefﬁcient α of equation (F.1). Column(3): interaction coefﬁcient γ of equation (F.1). Column (4): δ ( u ) of equation (B.26) on wage90 u = . δ ( u ) of equation(B.26) on wage 90 u = . able F.12: First-Stage Estimates, Table 3, Columns (1)-(2). Logistic QuantileBaseline coef. ( α ) Interaction coef. ( γ ) Control ( S = , D = ) Treated ( S = , D = ) (1) (2) (3) (4) (5)1 (Intercept) -1.047 0.553 2.669 2.1972 AGE 0.038 -0.037 -0.003 0.0143 BLACK -0.203 -0.109 -0.135 -0.1764 CURRJOB 0.201 -0.044 0.036 0.0855 EARN_YR 0.000 0.000 0.000 0.0006 EVARRST -0.123 0.147 -0.024 0.0247 FEMALE -0.23 -0.058 -0.113 -0.1268 HASCHLD 0.425 -0.177 -0.012 0.1039 HGC 0.036 0.026 -0.011 -0.01110 HGC_FATH 0.013 -0.001 0.003 0.00411 HGC_MOTH -0.004 0.008 0.003 0.00012 HH_INC2 0.148 -0.186 -0.032 -0.02613 HH_INC3 0.142 -0.035 -0.013 -0.0114 HH_INC4 0.373 -0.23 0.007 0.06115 HH_INC5 0.276 0.036 0.077 0.15116 HISP -0.155 0.004 0.095 0.02917 HRSWK_JR -0.006 0.003 0.000 -0.00318 MARRIED 0.339 -0.253 -0.034 -0.02119 MOSINJOB 0.039 0.007 -0.006 0.00020 NCHLD -0.324 0.137 0.067 0.02321 OTHERRAC -0.191 -0.284 0.121 0.05422 PERS_INC2 0.182 0.007 0.172 -0.05923 PERS_INC3 0.200 -0.024 0.185 0.04424 PERS_INC4 0.031 0.419 0.222 -0.1425 SEPARATED -0.149 -0.165 -0.084 -0.10526 TOGETHER -0.199 0.339 -0.026 0.01427 WKEARNR 0.001 -0.001 0.001 0.00128 YR_WORK 0.260 0.147 -0.070 -0.042 Notes. Table shows the ﬁrst-stage logistic and quantile regression estimates that produce boundsin Columns (1)-(2) of Table 3 . Column (2): baseline coefﬁcient α in equation (F.1). Column(3):interaction coefﬁcient γ in equation (F.1). Column (4): δ ( u ) from equation (B.26) on wage90 u = . δ ( u ) of equation(B.26) on wage 90 u = . able F.13: First-Stage Estimates, Table 3, Column (4). Logistic QuantileBaseline coef. ( α ) Interaction coef. ( γ ) Control Treated(1) (2) (3) (4) (5)1 (Intercept) -0.68 0.38 2.21 0.142 EARN_YR 0.00 -0.00 0.00 0.003 EVWORK -0.40 -0.08 -0.02 -0.014 FEMALE -0.22 -0.06 -0.13 0.015 HGC 0.07 -0.02 0.01 -0.006 HH_INC5 0.14 0.16 0.04 0.097 HRWAGER 0.16 -0.00 0.00 -0.018 MOSINJOB 0.04 0.02 0.00 -0.009 MOS_ANYW -0.02 0.00 0.00 -0.0010 PAY_RENT1 -0.09 0.13 0.06 0.0411 PERS_INC1 -0.09 -0.02 -0.01 -0.1012 RACE_ETH2 -0.15 -0.04 -0.15 0.0513 R_COMM1 -0.11 -0.05 0.02 -0.0714 R_GETGED1 -0.27 -0.01 -0.05 0.0415 R_HOME1 -0.21 -0.06 -0.04 0.0416 WKEARNR -0.00 0.00 0.00 0.0017 R_GETGED1:RACE_ETH2 -0.02118 HGC:EVWORK 0.08119 R_COMM1:R_GETGED1 -0.05420 R_GETGED1:MOS_ANYW 0.00421 HRWAGER:HGC -0.01422 HGC:MOSINJOB 0.00023 HRWAGER:MOSINJOB 0.003

Notes. Table shows the ﬁrst-stage logistic and quantile regression estimates that produce boundsin Column (4) of Table 3. Column (2): baseline coefﬁcient α of equation (F.1). Column (3):interaction coefﬁcient γ of equation (F.1). Column (4): δ ( u ) of equation (B.26) on wage 90 u = . δ ( u ) of equation(B.26) on wage 90 u = . able F.14: First-Stage Estimates, Table 3, Column (5). Logistic QuantileBaseline coef. ( α ) Interaction coef. ( γ ) Control Treated(1) (2) (3) (4) (5)1 (Intercept) -1.31 0.46 2.14 0.172 AGE 0.06 -0.02 0.01 -0.013 BLACK -0.22 -0.02 -0.17 0.034 EARN_CMP 0.00 0.00 0.01 -0.005 EARN_YR 0.00 -0.00 -0.00 0.006 FEMALE -0.14 -0.01 -0.11 0.007 HGC_FATH 0.02 -0.00 0.00 0.008 HRWAGER 0.06 -0.01 0.00 -0.009 MONINED 0.00 0.00 -0.01 0.0010 MOSINJOB 0.06 0.02 0.01 -0.0111 MOS_ANYW -0.02 0.00 -0.00 0.0012 PERS_INC1 -0.13 -0.05 -0.00 -0.0713 WKEARNR -0.00 0.00 0.00 0.00

Notes. Table shows the ﬁrst-stage logistic and quantile regression estimates that produce boundsin Column (5) of Table 3. Column (2): baseline coefﬁcient α of equation (F.1). Column (3):interaction coefﬁcient γ of equation (F.1). Column (4): δ ( u ) of equation (B.26) on wage 90 u = . δ ( u ) of equation(B.26) on wage 90 u = . F.4 Angrist et al. (2002) empirical details

Table F.15: Baseline covariates in Angrist et al. (2002)

Name Description Name DescriptionAGE2 pupil’s age SEX_NAME gender by ﬁrst nameMOM_AGE mother’s age DAD_AGE father’s ageMOS_SCH mother’s HGC DAD_SCH father’s HGCMOM_AGE_IS_NA missing mom age DAD_AGE_IS_NA missing dad ageMOS_SCH_IS_NA missing mom HGC DAD_SCH_IS_NA missing dad HGCDAREA4 − , , −

19 zip codes STRATA1 − Notes. For each of the four parental characteristics, missing values are replaced by zero andan additional indicator variable for having a (non)-missing record is generated. Imputation bymedian rather than zero leads to quantitatively similar results. Out of 19 zip codes, 9 zip codesthat are not perfectly multi-collinear are selected. HGC stands for highest grade completed. able F.16: First-Stage Estimates, Table 4, Column (6).

Logistic QuantileBaseline coef. ( α ) Interaction coef. ( γ ) Math Reading Writing(1) (2) (3) (4) (5) (6)1 (Intercept) -2.041 0.442 4.884 3.123 4.6512 STRATA2 0.6693 MOM_AGE_IS_NA -3.992 1.35 0.165 -0.5284 MOM_AGE -0.0055 AGE2:DAD_AGE_IS_NA -0.0126 MOM_SCH:DAREA11 0.0727 MOM_AGE:DAREA17 0.0368 MOM_AGE:DAREA19 0.0169 DAREA11:DAD_AGE 0.01310 AGE2 -0.27 -0.141 -0.19911 DAD_SCH 0.034 0.018 0.01712 DAREA6 -1.354 -0.14 -0.22813 MOM_SCH -0.032 0.013 -0.06514 DAREA4 0.135 -0.635 -0.96215 MOM_SCH_IS_NA -1.239 -0.612 -1.87316 DAD_AGE 0.012 -0.001 -0.00317 DAREA15 -2.073 -1.381 -1.07

Notes. Table shows the ﬁrst-stage logistic and quantile regression estimates that produce betterLee bounds in Table 4 (Column (6)). Column (2): estimate of baseline coefﬁcient α of equa-tion (F.1). Column (3): estimate of interaction coefﬁcient γ of equation (F.1). Column (4)-(6):covariate effect δ ( u ) of equation (B.26) on wage 90 u = . Data description.

The data set for analysis is obtained by merging the baseline dataset aerdat4.sas7bdat with the test score data set tab5v1.sas7bdat N = on the ID andﬁltering in cases from Bogota 1995 applicant cohort that have non-missing pupil’s ageand gender records. The ﬁnal data set has N = , cases. Since baseline covariateswere collected three years after randomization, I focus only on the covariates whosevalues are pre-determined at the moment of randomization and are deemed consideredexogenous by Angrist et al. (2002). Voucher is randomly assigned w.p. . that doesnot depend on covariates (i.e., Assumption 1(1) holds). Table 2 in Angrist et al. (2002)examines that voucher is indeed balanced across baseline characteristics.116 igure F.7: Graphical representation of covariate selection for Table 4, Column (6). Test Attendance S = Notes.

Test participation equation . The covariates for test participation are selected by post-lasso-logistic of Belloni et al. (2016) by regressing S = D , interacted with p =

900 covariates, obtained by interacting all 25 baseline covariates with6 pairwise interactions of continuous covariates with default choice of penalty λ / N = . λ / N = . Test score equation . For each subject, the covariates for test scores are selectedby post-lasso of Belloni et al. (2017) by regressing Y (test score) on the baseline covaraites inthe treated and the control group separately. The resulting set of covariates is the union of allcovariates, across 3 subjects and 2 possibilities for treated and control groups. .5 Finkelstein et al. (2012) empirical details Data source.

The data set is the output of

OHIE_QJE_Replication_Code/SubPrograms/prepare_data.do ﬁle, one of the subprograms of OHIE replication package of Finkel-stein et al. (2012). It contains N = , observations, survey wave, household sizeﬁxed effects, and their interactions, and 48 optional baseline covariates, summarized inTable F.17. Agnostic approach: composition of X help and X hurt . To estimate the composition of X help and X hurt , I invoke post-lasso-logistic of Belloni et al. (2016) with X being equal to baseline covariates and the penalty λ being equal to recommended choice of penalty,on the full sample N = , . For each of outcomes in reported in Tables 5, 6, A.8,A.9, the trimming threshold exceeds for at least . of subjects. For that reason, X help is taken to be /0 for each outcome under consideration. Covariate selection for Tables 5 and A.8: Agnostic approach.

The main sample M consists of , randomly selected households, and the auxiliary sample A is itscomplement. On the auxiliary sample A , my selection equation is (3.3) , where D = is abinary indicator for winning Medicaid lottery, X = , pairwise covariate interactions,and S = is a binary indicator for a non-missing response about receiving any prescrip-tion drugs. (Table A.8, Row 1, rx_any_12m ). Invoking logistic lasso of Belloni et al.(2016) with λ = to estimate (3.3) , I select pairwise interactions and break themdown to 21 raw covariates. They are listed in Table F.18, Column (1). Covariate selection for Tables 6 and A.9. selected covariates are: female_list , english_list , zip_msa , snap_ever_prenotify_07 , tanf_ever_prenotify_07 , snap_tot_prenotify_07 , tanf_tot_prenotify_07 , num_visit_pre_cens_ed , num_out_pre_cens_ed . 118 irst-Stage Estimates: Selection Equation. Selection equation is S = { X ′ α + D · Z ′ γ + U > } , (F.4) where Z = is a binary indicator of treatment offer (i.e., “treatment”), X is a vector ofcovariates, selected on auxiliary sample, and S = is a binary indicator for non-missingresponse. Therefore, (cid:98) s ( , x ) : = Λ ( x ′ (cid:98) α ) , (cid:98) s ( , x ) : = Λ ( x ′ ( (cid:98) α + (cid:98) γ )) . First-Stage Estimates: Outcome Equation for ITT.

Outcome equation is Y = { X ′ κ + ξ > } , S = , Z = , (F.5) where Y = is a binary indicator for negative (“No”) answer in Table 5, Row 1. Theestimate of φ ( x ) in equation (F.2) is (cid:98) π ( x ) : = Λ ( x ′ (cid:98) δ ) . To construct a trimmed data setfor ITT, a zero outcome in the control group is trimmed if a coin with success prob. ( − (cid:98) p ( x )) / (cid:98) φ ( x ) turns out success. For numerical stability, (cid:98) φ ( x ) : = max ( (cid:98) φ ( x ) , . ) . First-Stage Estimates for Binary Outcomes: Outcome Equation for LATE.

Out-come equation is Y = { X ′ δ + D · X ′ ρ + ξ > } , S = , Z = , (F.6) where D = is a binary indicator of having Medicaid insurance (i.e, “insurance”). There-fore, (cid:98) π ( , x ) : = Λ ( x ′ (cid:98) δ ) and (cid:98) π ( , x ) : = Λ ( x ′ (cid:98) δ + x ′ (cid:98) ρ ) . To construct a trimmed data set forLATE, a zero outcome in the control uninsured group is trimmed if a coin with suc-cess prob. ( − (cid:98) p ( x )) / (cid:98) φ ( , x ) turns out success. Likewise, a zero outcome in the controlinsured group is trimmed if a coin with success prob. ( − (cid:98) p ( x )) / (cid:98) φ ( , x ) turns out success.119 able F.17: Baseline covariates in Oregon Health Insurance Experiment.Name Descriptionfemale_list femaleenglish_list requested English materialszip_msa zip code is in MSAvisit_pre_ed ED visithosp_pre_ed ED visit resulting in hospital admissionout_pre_ed oupatient ED visiton_pre_ed ED visit on week-dayoff_pre_ed week-end or nighttime ED visitedcnnp_pre_ed emergent, non-preventable ED visitedcnpa_pre_ed emergent, preventable ED visitunclas_pre_ed unclassiﬁed ED visitepct_pre_ed primary care treatable ED visitne_pre_ed non-emergent ED visitacsc_pre_ed ambulatory case sensitive ED visitchron_pre_ed ED visit for chronic conditioninj_pre_ed ED visit for injury, pre-randomizationskin_pre_ed ED visit for skin conditionabdo_pre_ed abdominal pain visitback_pre_ed ED visit for back painback_ed back pain ED visitheart_pre_ed chest pain ED visitdepres_pre_ed mood disorders ED visitpsysub_pre_ed psych conditions/substance abuse ED visithiun_pre_ed high uninsured volume hospital ED visitloun_pre_ed low uninsured volume hospital ED visitcharg_tot_pre_ed total chargesed_charg_tot_pre_ed ED total chargessnap_ever_prenotify_07 ever on SNAPtanf_ever_prenotify_07 ever on TANFsnap_tot_prenotify_07 total household beneﬁts from SNAPtanf_tot_prenotify_07 total household beneﬁts from TANFddd_numhh_li_j household size ﬁxed effect for j = , , k = , , . . . , X represent ﬁxedeffects for household size and survey waves. able F.18: First-Stage Estimates, Table 5, Columns (3) and (6). ITT LATE α γ κ δ ρ (Intercept) -0.55 0.14 0.08 0.07any_acsc_pre_ed -0.10 -0.19 0.09 -1.96any_back_pre_ed 0.13 0.62 0.33 1.25any_depres_pre_ed 0.02 -0.11 0.15 -1.59any_head_pre_ed -0.04 0.39 0.02 1.70any_hiun_pre_ed -0.23 0.28 0.26 0.34any_hosp_pre_ed 0.09 0.29 0.22 0.38any_on_pre_ed -0.07 -0.16 -0.11 -0.61charg_tot_pre_ed 0.00 0.00 0.00 -0.00english_list 0.23 -0.44 -0.43 -0.20female_list 0.33 -0.07 -0.02 -0.46num_epct_pre_ed -0.02 0.18 0.21 -0.13num_ne_pre_ed -0.04 -0.03 0.13 -0.54num_on_pre_cens_ed 0.04 0.11 0.21 -0.05num_out_pre_cens_ed 0.11 0.19 0.27 -0.27num_skin_pre_cens_ed -0.01 0.16 0.13 0.79num_visit_pre_cens_ed -0.17 -0.23 -0.44 0.63snap_ever_prenotify07 -0.04 0.53 0.47 0.43snap_tot_hh_prenotify07 -0.00 -0.00 0.00 -0.00tanf_ever_prenotify07 -0.53 -0.95 -1.33 0.83tanf_tot_hh_prenotify07 -0.00 0.00 0.00 -0.00zip_msa -0.11 -0.20 -0.15 -0.32ddddraXnum _2_2 -0.335 -0.009 0.000ddddraXnum_2_3 0.458 0.147 0.175 -0.694ddddraXnum_3_2 0.083 0.100 0.065 0.000ddddraXnum_3_3 0.752 0.274ddddraXnum_4_2 -0.079 -0.112 -0.009 -0.185 0.606ddddraXnum_5_2 -0.041 -0.249 -0.271 0.154ddddraXnum_6_2 -0.057 -0.225 -0.250 0.129ddddraXnum_7_2 0.162 -0.237 0.553 0.000ddddraw_sur_2 0.015 0.010 -0.087 -0.104 0.126ddddraw_sur_3 -0.128 0.133 -0.073 -0.103 0.205ddddraw_sur_4 -0.040 0.056 0.003 0.024 -0.094ddddraw_sur_5 -0.053 0.002 0.089 0.093 0.078ddddraw_sur_6 -0.110 0.047 0.068 0.052 0.213ddddraw_sur_7 -0.053 0.002 -0.029 -0.021 -0.025dddnumhh_li_2 0.147 -0.027 -0.105 -0.070 -0.223dddnumhh_li_3 -1.066 0.649 -11.630 -11.667 0.000 N

53, 646 8, 383 8, 383 8, 383 8, 383

Notes. Table shows the ﬁrst-stage estimates for the estimated effect of Medicaid exposure (Col-umn (3)) and insurance (Column (6)) in Table 5, Row 1. Column (2) : baseline coefﬁcient α inequation (F.4). Column (3) : interaction coefﬁcient γ of equation (F.4). Column (4): baseline co-efﬁcient κ in equation (F.5) in S = , D = δ and interaction coefﬁcient ρ in (F.6) in S = , Z =0group (sample size = 8, 383) to estimate LATE bounds. Computations use survey weights.