[PDF] Causal Inference in Case-Control Studies

Abstract

We investigate partial identification of causal relative and attributable risk---the ratio of two counterfactual proportions and the difference between them---in case-control and case-population studies. The odds ratio is shown to be a sharp upper bound on causal relative risk under the monotone treatment response and monotone treatment selection assumptions, without resorting to strong ignorability, nor to the rare-disease assumption. Sharp bounds on causal attributable risk are also obtained under the same assumptions. Paying special attention to the (conditional) odds ratio, we propose a semiparametrically efficient estimator of the aggregated (log) odds ratio. Further, we develop easy-to-implement causal inference procedures for relative and attributable risk. Finally, we showcase our methodology by applying it to two unique datasets in the literature. We find that attending private school may have little effect on entering a very selective university in Pakistan and that dropping out of school could substantially increase relative and attributable risk of joining a criminal gang in Brazil.

Full PDF

CC AUSAL I NFERENCE IN C ASE -C ONTROL S TUDIES ∗ S UNG J AE J UN † AND S OKBAE L EE ‡ P ENN S TATE U NIV . C

OLUMBIA U NIV . AND

IFS

April 20, 2020

Abstract.

We investigate identiﬁcation of causal parameters in case-control andrelated studies. The odds ratio in the sample is our main estimand of interest andwe articulate its relationship with causal parameters under various scenarios. Itturns out that the odds ratio is generally a sharp upper bound for counterfactualrelative risk under some monotonicity assumptions, without resorting to strong ig-norability, nor to the rare-disease assumption. Further, we propose semparametri-cally efﬁcient, easy-to-implement, machine-learning-friendly estimators of the ag-gregated (log) odds ratio by exploiting an explicit form of the efﬁcient inﬂuencefunction. Using our new estimators, we develop methods for causal inference andillustrate the usefulness of our methods by a real-data example.

Key Words: relative risk, causality, monotonicity, case-control sample, machinelearning, partial identiﬁcation, semiparametric efﬁciency bound

JEL Classiﬁcation Codes:

C21, C55, C83 ∗ We would like to thank Chuck Manski and seminar participants at Cemmap, Oxford, and PennState for helpful comments. This work was supported in part by the European Research Council(ERC-2014-CoG-646917-ROMIA) and by the UK Economic and Social Research Council (ESRC)through research grant (ES/P008909/1) to the CeMMAP. † Department of Economics, Penn State University, 619 Kern Graduate Building, University Park,PA 16802, [email protected] ‡ Department of Economics, Columbia University, 1022 International Affairs Building, 420 West118th Street, New York, NY 10027, [email protected] a r X i v : . [ ec on . E M ] A p r un and Lee1. I NTRODUCTION

Empirical researchers often ﬁnd it useful to work with outcome-based or case-control samples when they study rare events: cancer (Breslow and Day, 1980),infant death (Currie and Neidell, 2005), consumer bankruptcy (Domowitz and Sar-tain, 1999), and drug trafﬁcking (Carvalho and Soares, 2016), among many others.Case-control sampling arises frequently in biostatistics when doctors or epidemi-ologists study risk factors for a rare disease: random sampling may yield only afew observations with the disease among several thousands of data. In econo-metrics, it is often referred to as choice-based or response-based sampling becausethe outcome of interest is discrete choice in many economic applications (see, e.g.,Chapter 6 of Manski, 2009).Inference methods that work with random samples are generally not suitablewhen data are outcome-based. In the econometrics literature, parametric estima-tion with outcome-based samples has been investigated by Manski and Lerman(1977), Cosslett (1981), Manski and McFadden (1981), Hsieh, Manski, and McFad-den (1985), Imbens (1992), and Lancaster and Imbens (1996), among others. Thisstrand of the literature has focused mainly on the consistency or efﬁciency of para-metric estimators in discrete response models; see e.g. McFadden (2015) for a re-view. In the biostatistics and epidemiology literature (e.g. Breslow, 1996), logisticregression has been the standard workhorse model in analyzing case-control stud-ies with a more emphasis on sampling designs.To motivate the setup of this paper, we start with a simple example. Table 1summarizes data from American Community Survey (ACS) 2018, cross-tabulatingthe likelihood of top income by educational attainment. The sample is restrictedto white males residing in California with at least a bachelor’s degree. The binary It is extracted from IPUMS USA (Ruggles, Flood, Goeken, Grover, Meyer, Pacas, and Sobek, 2019).The ACS is an ongoing annual survey by the US Census Bureau that provides key informationabout US population. The IPUMS database contains samples from the 2000-2018 ACS. The ACSsample is not a case-control sample but we will use it to illustrate our proposed methods. Y ) is deﬁned to be one if a respondent’s annual total pre-tax wage and salary income is top-coded. The binary treatment ( T ) is deﬁned tobe one if a respondent has a master’s degree, a professional degree, or a doctoraldegree. T ABLE

1. Top Income and EducationBeyond Bachelor’s TotalTop Income T = T = Y = Y = P ( Y = | T = ) ≈ P ( Y = | T = ) ≈ P ( Y = | T = ) − P ( Y = | T = ) ≈ P ( Y = | T = ) / P ( Y = | T = ) ≈ P ( Y = | T = ) P ( Y = | T = ) P ( Y = | T = ) P ( Y = | T = ) ≈ retrospective manner, the proportions of going be-yond a bachelor’s degree are P ( T = | Y = ) ≈ P ( T = | Y = ) ≈ P ( T = | Y = ) P ( T = | Y = ) P ( T = | Y = ) P ( T = | Y = ) ≈ In ACS 2018, the threshold income for top-coding is different across states. In our sample extract,the top-coded income bracket has median income $565,000 and the next highest income that is nottop-coded is $327,000. P ( Y = ) ≈ (cid:0) Y ∗ ( ) , Y ∗ ( ) , T ∗ , X ∗ (cid:1) , ( Y ∗ , T ∗ , X ∗ ) , and ( Y , T , X ) , where Y ∗ ( t ) is the potential binary outcome under treatment t ∈ {

0, 1 } , Y ∗ = Y ∗ ( ) T ∗ + Y ∗ ( )( − T ∗ ) , T ∗ and X ∗ are the outcome, treatment and covariates that wouldhave been observed under random sampling, and Y , T and X are the variables thatare actually observed in the outcome-based sample. As to the main causal param-eter of interest, we focus on θ RR ( x ) : = P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } , (3)which is causal relative risk conditional on X = x . To identify θ RR ( x ) , we face twoseparate challenges: one results from the usual missing data problem of potentialoutcomes and the other stems from the fact that the researcher does not have accessto ( Y ∗ , T ∗ , X ∗ ) but only to ( Y , T , X ) .Our contributions are two-fold. First, we articulate how the causal parameteris related with functionals of the distribution of ( Y , T , X ) under two different ver-sions of outcome-based sampling schemes: i.e. the traditional case-control sam-pling and case-population sampling considered in Lancaster and Imbens (1996). Itturns out that the odds ratio between Y and T conditional on X = x is generallya sharp upper bound for θ RR ( x ) under the MTR and MTS assumptions. This in-terpretation does not require strong ignorability, nor does it the usual rare-diseaseassumption. Therefore, our identiﬁcation analysis shows that we can provide theconventional estimand, i.e. the odds ratio in the sample, with causal interpretation4un and Leefrom the perspective of partial identiﬁcation (see, e.g., Manski, 2003, 2009; Tamer,2010).Second, we propose two novel estimation algorithms for the aggregated (log)odds ratio. For this purpose we obtain an explicit form of the efﬁcient inﬂuencefunction, after which we construct suitable sample analogs. The ﬁrst estimator webuild is a plug-in sieve estimator (e.g. X. Chen, 2007) and the second one is a dou-ble/debiased machine learning (DML) estimator (e.g. Chernozhukov, Chetverikov,Dimirer, Duﬂo, Hansen, Newey, and Robins, 2018). The former is simpler but thelatter accommodates LASSO-type or more general nonparametric estimators. Bothestimators achieve the semiparametric efﬁciency bound (e.g. Newey, 1990, 1994)and can be easily implemented by using standard statistical packages. Using ourestimators and the ACS data, we illustrate how to draw causal inferences based onour partial identiﬁcation results as well as how to carry out a sensitivity analysis.To the best of our knowledge, we are not cognizant of directly relevant papersin the literature. In fact, the recent econometrics literature on outcome-based sam-pling is rather sparse; however, it is an important reality that random samplingcan be expensive when the outcome of interest is rare. The goal of this paper isto revamp outcome-based sampling from the perspective of modern economet-rics. Our paper is the ﬁrst paper that nonparametrically connects the three dots:outcome-based sampling, causal inference and partial identiﬁcation. We provide afurther discussion on how our paper is related to the existing literature in section 7.The remainder of the paper is organized as follows. Section 2 presents the frame-work and identiﬁcation results. We describe two sampling schemes, i.e. case-control sampling and case-population sampling, after which we discuss causalparameters and their identiﬁcation. In section 3, we derive the semiparametricefﬁciency bound for our estimand, and in section 4, we propose two estimationalgorithms. We analytically establish the local robustness property of one of ourestimating equations, yielding an estimator that suits well for machine learningin section 4.3. Section 5 summarizes the main takeaways and discusses several5ausal Inference in Case-Control Studiesinferential issues. Section 6 presents an empirical example using the ACS data.We conclude the paper by discussing the related literature and topics for futureresearch in section 7. Appendices, along with an online supplement, include addi-tional materials and all the proofs.2. F RAMEWORK

In this section, we describe the scheme of outcome-based sampling, deﬁne causalparameters and discuss their identiﬁcation under two sets of assumptions: onewith strong ignorability and the other without it.2.1.

Bernoulli Sampling.

Let ( Y ∗ , T ∗ , X ∗ ) be the random variables that wouldhave been observed if a researcher had collected data via random sampling fromthe population of interest, where Y ∗ is a binary outcome, T ∗ is a binary treatment,and X ∗ is a vector of covariates. We assume that a random sample of ( Y ∗ , T ∗ , X ∗ ) is unavailable and hence ( Y ∗ , T ∗ , X ∗ ) is not observed. Instead, we assume that wehave a random sample of ( Y , T , X ) , where ( Y , T , X ) represents the random vari-ables that are actually observed in the sample that is drawn by the researcher’ssampling design, i.e. Bernoulli sampling (e.g. Breslow, Robins, and Wellner, 2000),which we further describe below and discuss in section 7.In Bernoulli sampling, the researcher draws a Bernoulli variable Y ﬁrst from apre-speciﬁed marginal distribution, after which she randomly draws ( T , X ) from P y if and only if Y = y . Since h = P ( Y = ) is part of the sampling scheme, weassume that it is known. If P y is identical to the conditional distribution of ( T ∗ , X ∗ ) given on Y ∗ = y , then this is known as case-control sampling. The Bernoullischeme allows for other possibilities. Below are the two leading cases that wefocus on throughout the paper. In order to simplify our discussion, we ﬁrst makea common-support assumption. Let X ∗ and X y be the support of X ∗ and that of X given Y = y , respectively. Assumption A (Common Support) . There is a common support X satisfying X = X ∗ = X = X . Y ∗ rep-resents breast cancer and we have two covariates to consider, i.e. gender and age,then the joint support of gender and age depends highly on whether to conditionon Y ∗ = X ∗ and X rep-resents the age; X ∗ is the age that would have been drawn from the population ofwomen and X is the age that is drawn from the subpopulation of women with orwithout breast cancer, depending on the corresponding value of Y . Throughoutthe paper, we are implicit about the possibility of stratiﬁcation using extra covari-ates (different from those included in X ∗ ).Let P y ( t , x ) = f X | Y ( x | y ) P ( T = t | X = x , Y = y ) , where f X | Y is the probabilitydensity (or mass) function of X given Y = y for y =

0, 1.

Design 1 (Case-Control Sampling) . Suppose that for all ( t , x ) ∈ {

0, 1 } × X and fory ∈ {

0, 1 } ,f X | Y ( x | y ) = f X ∗ | Y ∗ ( x | y ) and P ( T = t | X = x , Y = y ) = P ( T ∗ = t | X ∗ = x , Y ∗ = y ) . In other words, P is the distribution of ( T ∗ , X ∗ ) given Y ∗ = , while P is that of ( T ∗ , X ∗ ) given Y ∗ = . Design 2 (Case-Population Sampling) . Suppose that for all ( t , x ) ∈ {

0, 1 } × X ,f X | Y ( x | ) = f X ∗ ( x ) and P ( T = t | X = x , Y = ) = P ( T ∗ = t | X ∗ = x ) , f X | Y ( x | ) = f X ∗ | Y ∗ ( x | ) and P ( T = t | X = x , Y = ) = P ( T ∗ = t | X ∗ = x , Y ∗ = ) . In other words, P represents the distribution of ( T ∗ , X ∗ ) of the entire population, while P is that of ( T ∗ , X ∗ ) conditional on Y ∗ = . Design 1 is arguably the most popular form of case-control studies and de-sign 2, which we call case-population sampling , is considered in Lancaster and Im-bens (1996). The notation here distinguishes the original variables ( Y ∗ , T ∗ , X ∗ ) of7ausal Inference in Case-Control Studiesinterest from the sampled ones ( Y , T , X ) ; see, e.g., K. Chen (2001) and Xie, Lin,Yan, and Tang (2019) for using the same notational device. The advantage of thisapproach is that it becomes straightforward to apply asymptotic theory under ran-dom sampling to observations generated from ( Y , T , X ) because we can regardthem as a collection of independent and identically distributed (i.i.d.) copies of ( Y , T , X ) . The marginal distribution of ( T , X ) is identiﬁed from the data, whilethat of ( T ∗ , X ∗ ) is not. For instance, in design 1, we have f X ( x ) = f X ∗ | Y ∗ ( x | ) h + f X ∗ | Y ∗ ( x | )( − h ) (cid:54) = f X ∗ ( x ) if h (cid:54) = P ( Y ∗ = ) ; h is part of the sampling scheme,while P ( Y ∗ = ) is the true probability of the case in the population. Further, f YX ( x ) = f X ∗ | Y ∗ ( x | ) h = f X ∗ ( x ) P ( Y ∗ = | X ∗ = x ) h / P ( Y ∗ = ) , which yieldsthe likelihood function studied in e.g. Manski and Lerman (1977). We emphasizethat P ( Y = | X = x ) does not have economic (or structural) interpretation like P ( Y ∗ = | X ∗ = x ) , where the latter is often modeled by a rational behavior of aneconomic agent.2.2. Causal Functional Parameters.

To deﬁne causal functional parameters perti-nent to outcome-based samples, let Y ∗ ( t ) ∈ {

0, 1 } be the binary potential outcomeof interest for treatment t ∈ {

0, 1 } . For example, in the context of Currie and Nei-dell (2005), t = Y ∗ ( ) = Y ∗ can be written as Y ∗ = T ∗ Y ∗ ( ) + ( − T ∗ ) Y ∗ ( ) . The cen-tral counterfactual probabilities are P { Y ∗ ( ) = | X ∗ = x } and P { Y ∗ ( ) = | X ∗ = x } . Conditional on X ∗ = x , one may consider the difference or ratio between thetwo counterfactual probabilities, which are called (conditional) attributable andrelative risk in the literature (see, e.g. Manski, 2009). In this paper, we focus onthe latter, namely causal relative risk θ RR ( x ) deﬁned in equation (3). In view ofthe convenience of the odds ratio, as we demonstrated in Introduction, we alsoconsider a causal odds ratio that is deﬁned by θ OR ( x ) : = P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } .8un and Lee2.3. Identiﬁcation under Strong Ignorability.

We begin this section by articulat-ing how the odds ratio in the sample is related with some population quantitiesunder each sampling design. Let OR ( x ) be the odds ratio given X = x that isobserved in the sample: i.e.OR ( x ) : = P ( Y = | T = X = x ) P ( Y = | T = X = x ) P ( Y = | T = X = x ) P ( Y = | T = X = x ) , (4)where we assume that 0 < OR ( x ) < ∞ for all x ∈ X throughout the paper. Simi-larly, we deﬁne OR ∗ ( x ) and RR ∗ ( x ) by the conditional odds ratio and relative risk,respectively, in the population: i.e.OR ∗ ( x ) : = P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) , (5)RR ∗ ( x ) : = P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) . (6)Since we do not have a random sample of ( Y ∗ , T ∗ , X ∗ ) , identiﬁcation of OR ∗ ( x ) orRR ∗ ( x ) is a priori unclear. However, the Bayes rule shows the following result. Lemma 1.

Under design 1, we have OR ( x ) = OR ∗ ( x ) for all x ∈ X . Similarly, underdesign 2, we have OR ( x ) = RR ∗ ( x ) for all x ∈ X . Lemma 1 shows how to relate the odds ratio in the case-control sample (respec-tively, case-population sample) with the odds ratio (respectively, relative risk) ofthe population. It requires additional assumptions to connect the odds ratio orrelative risk of the population with the causal parameters deﬁned in terms of thepotential outcomes. The simplest approach is to use the idea of strong ignorability(see, e.g., Imbens and Rubin, 2015). In our context, strong ignorability consists ofthe following two assumptions.

Assumption B (Overlap) . For all ( t , x ) ∈ {

0, 1 } × X , we have < P { Y ∗ ( t ) = | X ∗ = x } < and < P ( T ∗ = | X ∗ = x ) < Assumption C (Unconfoundedness) . For all t ∈ {

0, 1 } and x ∈ X , P { Y ∗ ( t ) = | T ∗ = X ∗ = x } = P { Y ∗ ( t ) = | T ∗ = X ∗ = x } .The ﬁrst requirement of assumption B implies that the potential outcome Y ∗ ( t ) cannot be 0 or 1 with probability 1 for some value of x . The second condition ofassumption B is the standard overlap condition in the literature. Assumption Csays that the potential outcomes Y ∗ ( ) and Y ∗ ( ) are conditionally independent ofthe treatment T ∗ given X ∗ = x .We now provide the following identiﬁcation result in the spirit of Holland andRubin (1988). Theorem 1 (Holland and Rubin (1988)) . Suppose that assumptions B and C are satis-ﬁed. Then, under design 1, we have θ OR ( x ) = OR ∗ ( x ) = OR ( x ) for all x ∈ X ; underdesign 2, we have θ RR ( x ) = RR ∗ ( x ) = OR ( x ) for all x ∈ X . Theorem 1 slightly extends the result of Holland and Rubin (1988); they did notconsider design 2, but their arguments can be used in a straightforward manner.In substance, the observed odds ratio OR ( x ) identiﬁes the causal odds ratio θ OR ( x ) under design 1 and the causal relative risk θ RR ( x ) under design 2. One practicalmessage of theorem 1 is that it might be more beneﬁcial to sample a control groupfrom the unconditional population if a researcher cares mainly about θ RR ( x ) . Inlight of this, we may regard designs 1 and 2 as studies suitable for the causal oddsratio and causal relative risk, respectively.2.4. Causal Interpretation without Strong Ignorability.

Strong ignorability is con-venient but it may be too strong for observational data; T ∗ is often a deliberatedecision of an individual agent. In this subsection, we establish an alternativecausal interpretation of OR using the framework of partial identiﬁcation. In par-ticular, we build on assumptions of monotone treatment response (Manski, 1997)and monotone treatment selection (Manski and Pepper, 2000).10un and Lee Assumption D (Monotone Treatment Response) . We have Y ∗ ( ) ≥ Y ∗ ( ) almostsurely. Assumption E (Monotone Treatment Selection) . For all t ∈ {

0, 1 } and x ∈ X , P { Y ∗ ( t ) = | T ∗ = X ∗ = x } ≥ P { Y ∗ ( t ) = | T ∗ = X ∗ = x } .Assumption D rules out the possibility of Y ∗ ( ) = Y ∗ ( ) = θ RR ( x ) and θ OR ( x ) . Theorem 2.

Suppose that assumption B holds. The following inequalities are sharp.(i) If assumption D is satisﬁed, then ≤ θ RR ( x ) ≤ θ OR ( x ) for all x ∈ X under eachof the two sampling designs.(ii) If assumption E is satisﬁed, then θ OR ( x ) ≤ OR ( x ) under design 1 and θ RR ( x ) ≤ OR ( x ) under design 2.(iii) The following two statements are equivalent:(a) θ OR ( x ) = OR ( x ) in design 1 and θ RR ( x ) = OR ( x ) in design 2;(b) Assumption E is satisﬁed with equality, i.e. assumption C holds. Parts (i) and (ii) of Theorem 2 imply that if assumptions D and E are satisﬁed,then OR ( x ) can be understood as a sharp upper bound of causal relative risk bothunder designs 1 and 2. More speciﬁcally, for all x ∈ X , we have1 ≤ θ RR ( x ) ≤ θ OR ( x ) ≤ OR ( x ) under design 1; (7)1 ≤ θ RR ( x ) ≤ OR ( x ) under design 2. (8)11ausal Inference in Case-Control StudiesTheorems 1 and 2 articulate how to give causal interpretation to OR ( x ) in gen-eral. Assumption E allows for assumption C as a special case. Indeed, theorem 2shows that point identiﬁcation holds if and only if the unconfoundedness condi-tion is satisﬁed.Assumptions D and E are not individually testable, but they jointly have a testableimplication, i.e. OR ( x ) ≥ x ∈ X by theorem 2, for which a nonparametrictest can be constructed via the general framework of testing functional inequalities(see, e.g., Chernozhukov, Lee, and Rosen, 2013; Lee, Song, and Whang, 2018).In case-control studies, it is commonly assumed that there is some (cid:101) > < P ( Y ∗ = | X ∗ = x ) ≤ (cid:101) for all x ∈ X . When we consider the case where (cid:101) →

0, we refer to this condition as the rare-disease assumption (e.g. Breslow,1996; Manski, 2009). The rare-disease assumption leads to | θ RR ( x ) − θ OR ( x ) | → (cid:101) →

0. Hence, if both strong ignorability and rare-disease are assumed, then θ RR ( x ) is well-approximated by OR ( x ) under design 1.However, our identiﬁcation analysis shows that a researcher does not have toresort to strong ignorability, nor to the the rare-disease assumption, in order toprovide OR ( x ) with causal interpretation. If both MTR and MTS conditions areplausible, then a researcher can interpret OR ( x ) as the (sharp) upper bound of thecausal relative risk θ RR ( x ) under both designs 1 and 2.2.5. Heterogeneity and Aggregation.

The functional parameter OR ( x ) is difﬁcultto estimate nonparametrically with good precision when the dimension of X ishigh. To avoid the curse of dimensionality, it is popular in case-control studies toadopt logistic regression at the true population level: that is, P ( Y ∗ = | T ∗ = t , X ∗ = x ) = exp ( α + t α + x (cid:62) α + tx (cid:62) α ) + exp ( α + t α + x (cid:62) α + tx (cid:62) α ) , (9)which implies that α + x (cid:62) α = log { OR ∗ ( x ) } = log { OR ( x ) } x ∈ X ; therefore, under the rare-disease assumption, log { RR ∗ ( x ) } ≈ α + x (cid:62) α as well.The parametric assumption is popular, but it is restrictive. For instance, the for-mulation in equation (9) limits the possible forms of heterogeneous causal effects;without the parametric assumption, log { OR ( x ) } is generally an unknown functionof x that can be highly nonlinear. In this paper we take a nonparametric approach,where we aim at estimating OR ( x ) without using any parametric assumption, afterwhich we aggregate it by integrating over x .Since F X | Y ( ·| ) and F X | Y ( ·| ) are identiﬁed in our study designs, we consider β ( y ) : = (cid:90) X log { OR ( x ) } dF X | Y ( x | y ) for y =

0, 1, (10)which is the weighted average of the log odds ratio using F X | Y ( x | y ) as weights: theargument y indicates which distribution of X , and hence which distribution of X ∗ ,is used to aggregate the log odds ratio. Speciﬁcally, under design 2, β ( ) is equalto E (cid:2) log { OR ( X ∗ ) } (cid:3) . Under design 1, if the population fraction of the case (i.e. P ( Y ∗ = ) ) is known to the researcher, which has been frequently assumed in theeconometrics literature (since Manski and Lerman, 1977), then E (cid:2) log { OR ( X ∗ ) } (cid:3) can be obtained by taking the weighted average of β ( y ) , i.e. E (cid:2) log { OR ( X ∗ ) } (cid:3) = β ( ) P ( Y ∗ = ) + β ( ) P ( Y ∗ = ) .If P ( Y ∗ = ) is unknown but only its upper bound is known, then we can un-dertake a bound analysis on E (cid:2) log { OR ( X ∗ ) } (cid:3) by using β ( ) and β ( ) ; this prob-lem will be further discussed in section 5. Therefore, in the next two sections, wewill treat β ( y ) as the main estimand of interest; our discussion on semiparamet-ric efﬁciency and machine-learning approaches will focus on β ( y ) . It relies on theresearcher’s view on assumptions C to E whether it is the aggregation of the log-arithm of the causal parameter itself or its sharp identiﬁable upper bound. Usingour proposed estimators, we discuss how to carry out causal inferences in section 5.13ausal Inference in Case-Control StudiesIn equation (10), the logarithm is taken before aggregating the odds ratio; alter-natively, one may take an expectation of the odds ratio directly. This case can behandled similarly. See Appendix A for details.3. E FFICIENT I NFLUENCE F UNCTION FOR β ( y ) We consider estimating the parameter β ( y ) , for which we do not impose anyparametric restrictions anywhere. As a ﬁrst step, we derive the semiparametric ef-ﬁciency bound under both designs 1 and 2; since the mathematical structure of thelikelihood function is the same, we do not need to distinguish design 1 from de-sign 2. For this purpose, we will use the generic notation using the observed vari-ables ( Y , T , X ) instead of the original random variables of interest, i.e. ( Y ∗ , T ∗ , X ∗ ) .We start with the following assumptions for regularity. Assumption F (Bounded Probabilities) . There is a constant ε > such that for eachy =

0, 1 . Assumptions F and G are, in principle, testable since they are about the randomvariables observed in the sample. Assumption F is slightly stronger than what weneed to derive the efﬁcient inﬂuence function, but it will be needed to establishstatistical properties of our proposed estimators later. Assumption G focuses onthe case where X is continuous but this is only for the sake of notational simplicity;if X is discrete or mixed, then f X | Y should be understood as a general Radon-Nikodym density with respect to some dominating measure.Under the Bernoulli sampling scheme, the likelihood of a single observation ( Y , T , X ) is given by L ( Y , T , X ) = (cid:8) ( − h ) P ( T , X ) (cid:9) − Y (cid:8) h P ( T , X ) (cid:9) Y , (11)14un and Leewhere for y =

0, 1, P y ( T , X ) = f X | Y ( X | y ) P ( T = | X , Y = y ) T (cid:8) − P ( T = | X , Y = y ) (cid:9) − T . (12)The likelihood in equation (11) is a simple mixture of two binary likelihoods. Thetangent space can be derived by using regular parametric submodels P y ( T , X ; γ ) such that P y ( T , X ; γ ) = P y ( T , X ) for y =

0, 1. The tangent space is described inthe following lemma.

Lemma 2.

Consider the Bernoulli sampling scheme of design 1 or design 2. The tangentspace is given by the set of functions of the following form:s ( Y , T , X ) = ( − Y ) (cid:104) a ( X ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) b ( X ) (cid:105) + Y (cid:104) a ( X ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) b ( X ) (cid:105) , where the functions a y and b y are such that E { a y ( X ) | Y = y } = and E { s ( Y , T , X ) } < ∞ for each y =

0, 1 . The following theorem shows that β ( y ) is pathwise differentiable along the reg-ular parametric submodels at γ in the sense of Newey (1990, 1994). Before wepresent the theorem, deﬁne w ( X ) : = f X | Y ( X | ) f X | Y ( X | ) . (13)Further, for y =

0, 1, deﬁne ∆ y ( Y , T , X ) : = Y y ( − Y ) − y { T − P ( T = | X , Y = y ) } P ( T = | X , Y = y ) { − P ( T = | X , Y = y ) } .We establish the following result using the approach taken by Hahn (1998). Theorem 3.

Suppose that assumptions A, F and G hold and that we have a sample byBernoulli sampling. Then, for y =

0, 1 , β ( y ) is pathwise differentiable and its pathwise derivative is given byF y ( Y , T , X ) = Y y ( − Y ) − y h y ( − h ) − y (cid:110) log OR ( X ) − β ( y ) (cid:111) − ∆ ( Y , T , X )( − h ) w ( X ) y + w ( X ) − y ∆ ( Y , T , X ) h . Further, F y is an element of the tangent space, and therefore, the semiparametric efﬁciencybound for β ( y ) is given by E (cid:8) F y ( Y , T , X ) (cid:9) . Theorem 3 shows the efﬁciency bound for β ( y ) , and it also implies that the as-ymptotic variance of a √ n –consistent and asymptotically linear estimator of β ( y ) should be E { F y ( Y , T , X ) } by Theorem 2.1 of Newey (1994). Since β ( y ) is the ex-pectation of log OR ( X ) with respect to the distribution of X given Y = y , it satisﬁes E (cid:8) log OR ( X ) − β ( y ) (cid:12)(cid:12) Y = y (cid:9) = E (cid:34) Y y ( − Y ) − y h y ( − h ) − y (cid:110) log OR ( X ) − β ( y ) (cid:111)(cid:35) =

0, (14)which is the expected value of the ﬁrst term that appears in F y ( Y , T , X ) ; the otherterms in F y ( Y , T , X ) are for adjustment to address the effect of ﬁrst step nonpara-metric estimation of log OR ( X ) via P ( T = | X = x , Y = y ) .4. E FFICIENT E STIMATION OF β ( y ) Efﬁcient estimators of β ( y ) for y =

0, 1 can be constructed in multiple ways. Themost straightforward approach is just using equation (14), i.e. we base an estimatoron β ( y ) = E (cid:34) Y y ( − Y ) − y h y ( − h ) − y log OR ( X ) (cid:35) , (15)where we plug in a nonparametric estimator of OR ( x ) . Alternatively, we mayinclude the adjustment terms upfront to use E { F y ( Y , T , X ) } = β ( y ) by constructing a sample analog16un and Leeestimator from the following alternative expression: β ( y ) is equal to E (cid:34) Y y ( − Y ) − y h y ( − h ) − y log OR ( X ) − ∆ ( Y , T , X )( − h ) w ( X ) y + w ( X ) − y ∆ ( Y , T , X ) h (cid:35) . (16)This approach requires additional (nonparametric) estimation of w ( X ) , but since E { ∆ y ( Y , T , X ) | X } = y =

0, 1, having an incorrect function for w ( X ) does not matter for the consistency of the estimator based on equation (16). Suppose that we have the sample { ( Y i , T i , X i ) : i =

1, . . . , n } , where ( Y i , T i , X i ) ’sare i.i.d. copies of ( Y , T , X ) . Using this sample, we propose sieve logistic esti-mators based on equation (15) in section 4.1. In section 4.2, we show that themoment condition in equation (16) satisﬁes Neyman orthogonality in the senseof Chernozhukov, Chetverikov, Dimirer, Duﬂo, Hansen, Newey, and Robins (2018,DML hereafter). This leads to double/debiased machine learning (DML) estima-tors, which we present in section 4.3. Throughout the discussion we assume that h is known since it is part of the sampling scheme. However, if it is unknown,then using ˆ h = ∑ ni = Y i / n instead of h does not change the ﬁrst-order asymptoticbehaviors of the estimators based on (15) and (16), as long as P and P do notdepend on h .4.1. Retrospective Sieve Logistic Estimation.

Recall that the observed odds ratioin equation (4) can be expressed asOR ( x ) = P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) .We model the treatment probabilities by inﬁnite dimensional logistic regression:i.e. for y =

0, 1, P ( T = | X = x , Y = y ) = exp (cid:16) ∑ ∞ j = φ j ( x ) µ j , y (cid:17) + exp (cid:16) ∑ ∞ j = φ j ( x ) µ j , y (cid:17) , Misspeciﬁcation of w ( X ) may affect the asymptotic distribution of our proposed estimator. Welimit our attention to nonparametric estimation of w ( X ) to minimize the possibility of misspeciﬁ-cation. { φ j : j =

1, 2, . . . } is a series of basis functions and { µ j , y : j =

1, 2, . . . } isa series of unknown coefﬁcients for each y =

0, 1. It then follows that for each y =

0, 1, log P ( T = | X = x , Y = y ) P ( T = | X = x , Y = y ) = ∞ ∑ j = φ j ( x ) µ j , y . (17)Therefore, by using equation (15) and assumption F, we obtain β ( y ) = ∞ ∑ j = (cid:90) X φ j ( x ) dF X | Y ( x | y ) (cid:0) µ j ,1 − µ j ,0 (cid:1) ≈ J n ∑ j = (cid:90) X φ j ( x ) dF X | Y ( x | y ) (cid:0) µ j ,1 − µ j ,0 (cid:1) , (18)provided that J n diverges to inﬁnity as n → ∞ . Equation (18) suggests the follow-ing two-step sieve estimation strategy:(i) In the ﬁrst step, for each y =

0, 1, estimate { µ j , y : y =

0, 1 j =

1, . . . , J n } bylogistic regression of T i on { φ j ( X i ) : j =

1, . . . , J n } with the Y i = y sample.(ii) In the second step, construct a sample analog of equation (18): i.e. (cid:98) β ( y ) : = J n ∑ j = (cid:90) X φ j ( x ) d (cid:98) F X | Y ( x | y ) (cid:0) (cid:98) µ j ,1 − (cid:98) µ j ,0 (cid:1) , (19)where (cid:98) µ j , y ’s are sieve logit estimates from the ﬁrst step and (cid:90) X φ j ( x ) d (cid:98) F X | Y ( x | y ) = ∑ ni = Y di ( − Y i ) d φ j ( X i ) ∑ ni = Y di ( − Y i ) d .Since the retrospective probability model is used in equation (17), we call the es-timator deﬁned in (19) the retrospective sieve logistic estimator of β ( y ) , y =

0, 1. Itcan be computed using standard software for logistic regression, as described inalgorithm 1.The procedure described in algorithm 1 achieves the ﬁrst step by running a com-bined logistic regression of T i on Y i , the sieve basis terms and the interactions be-tween Y i and the sieve basis terms. This is ﬁrst-order equivalent since Y i is binary18un and Lee Algorithm 1:

Retrospective Sieve Logistic Estimator of β ( ) Input: { ( Y i , T i , X i ) : i =

1, . . . , n } , tuning parameter J n and basis functions { φ j ( · ) : j =

1, . . . , J n } Output: estimate of β ( ) and its standard error Construct { φ ( X i ) , . . . , φ J n ( X i ) : i =

1, . . . , n } , where an intercept term isexcluded in φ j ’s; For each j =

1, . . . , J n , compute the empirical mean of φ j ( X i ) using only thecase sample ( Y i =

1) and construct the demeaned version, say ϕ j ( X i ) , of φ j ( X i ) ; Run a logistic regression of T i on the following regressors: an intercept term, Y i , ϕ j ( X i ) , j =

1, . . . , J n , and interactions between Y i and ϕ j ( X i ) , j =

1, . . . , J n ,using standard software; Read off the estimated coefﬁcient for Y i and its standard errorand full interaction terms are included. For the second step, instead of evaluatingthe right-hand side of equation (19) after logistic regression, φ j ( X i ) ’s are demeanedﬁrst using only the case sample so that the resulting coefﬁcient for Y i is ﬁrst-orderequivalent to the estimator deﬁned in equation (19). The advantage of the for-mulation in algorithm 1 is that the standard error of (cid:98) β ( ) can be read off directlyfrom standard software without any further programming. It is straightforward tomodify algorithm 1 for estimating β ( ) . One has to compute the empirical meanof φ j ( X i ) using only the control sample ( Y i =

0) for the demeaning step.Sieve logistic estimators have been popular in the literature, including the propen-sity score estimator used in Hirano, Imbens, and Ridder (2003). To the best of ourknowledge, it is novel to adopt retrospective sieve logistic estimators in the contextof case-control studies. It is not difﬁcult to work out formal asymptotic propertiesof our proposed sieve estimator in view of the well-established literature on two-step sieve estimation (see, e.g., Ai and Chen, 2003, 2012; Ackerberg, Chen, Hahn,and Liao, 2014, among many others). Furthermore, conventional normal inferencebased on the standard error obtained in algorithm 1 is valid for semiparametric in-ference (e.g. Ackerberg, Chen, and Hahn, 2012). For brevity of the paper, we omitdetails. 19ausal Inference in Case-Control Studies4.2.

Neyman Orthogonality.

Both of the estimating equations in (15) and (16) de-pend on nonparametric objects that need to be estimated in advance. Equation (15)is simpler but equation (16) has an advantage that it is robust to local perturbationon the unknown functions that are estimated in the ﬁrst step. It requires extranotation to discuss this result formally.Let W be the set of functions on X that are bounded and bounded away fromzero. Similarly, let G be the set of functions g : X → [ (cid:101) , 1 − (cid:101) ] for some (cid:101) >

0. For η = ( η (cid:124) , η ) (cid:124) with η = ( a , b ) (cid:124) ∈ G and η ∈ W , deﬁne (cid:102) OR ( η )[ X ] = (cid:2) b ( X ) { − a ( X ) } (cid:3) / (cid:2) { − b ( X ) } a ( X ) (cid:3) and ˜ w ( η )[ X ] = η ( X ) .So, (cid:102) OR ( · )[ X ] and ˜ w ( · )[ X ] denote (candidate) mappings from G and W , respec-tively, such that they are equal to OR ( X ) and w ( X ) when they are evaluated at η ∈ G and η ∈ W , respectively, where η ( x ) = (cid:0) P ( T = | X = x , Y = ) , P ( T = | X = x , Y = ) (cid:1) (cid:124) and η ( x ) = w ( x ) . Now, we deﬁne the mapping˜ F y ( · )[ Y , T , X ] by˜ F y ( η )[ Y , T , X ] : = Y y ( − Y ) − y h y ( − h ) − y log (cid:102) OR ( η )[ X ] − β ( y )+ Yh { ˜ w ( η )[ X ] } − y { T − b ( X ) } b ( X ) { − b ( X ) } − − Y − h { ˜ w ( η )[ X ] } − y { T − a ( X ) } a ( X ) { − a ( X ) } ,where η = ( a , b , η ) (cid:124) ∈ G × W . So, we have ˜ F y ( η )[ Y , T , X ] = F y ( Y , T , X ) , where η = ( η (cid:124) , η ) (cid:124) . We are now ready to state the main theorem of this subsection. Theorem 4.

Suppose that assumptions F and G hold. Then, under both designs 1 and 2,and for each y =

0, 1 , the Gateaux derivative of ˜ F y ( · )[ Y , T , X ] at η has mean zero: i.e. E (cid:104) ∂ γ ˜ F y { η + γ ( η − η ) } [ Y , T , X ] (cid:12)(cid:12)(cid:12) γ = (cid:105) = for all η ∈ G × W . F y ( Y , T , X ) provides a Neyman orthogonal moment func-tion. The fact that small perturbations around η do not have ﬁrst-order asymp-totic consequences is known as the local robustness property. In this case, the ﬁrststep nonparametric estimation does not have any ﬁrst-order effect, i.e. the limitingdistribution would be the same as if η were known, because all the adjustmentterms that are needed to address the effect of the ﬁrst step estimation are alreadyreﬂected in F y .4.3. Retrospective Double/Debiased Machine Learning Estimation.

When thedimension of X is higher than the sample size, it is infeasible to implement thesieve estimator proposed in section 4.1. In this section we consider using machine-learning-based estimators in the ﬁrst step, which will allow X to be of high di-mension. In view of Neyman orthognality established in section 4.2, we build anew estimator based on equation (16), which requires estimation of w ( x ) deﬁnedin equation (13). In high-dimensional settings, it would be impractical to estimate f X | Y ( x | ) and f X | Y ( x | ) separately and to take the ratio to obtain an estimator of w ( x ) . Instead, we use the Bayes rule to obtain w ( x ) = f X | Y ( x | ) f X | Y ( x | ) = P ( Y = | X = x ) P ( Y = | X = x ) h − h , (20)which suggests that we estimate P ( Y = | X = x ) since h = P ( Y = ) is eitherknown or trivial to estimate. The key insight here is that it may be unrealistic toassume the sparsity of f X | Y ( x | y ) for each y =

0, 1, but w ( x ) can be estimated bysparsity-based models since the sparsity of w ( x ) is equivalent to that of P ( Y = | X = x ) / P ( Y = | X = x ) . Therefore, we may rely on machine-learning methodsto estimate not only P ( T = | X = x , Y = y ) , y =

0, 1, but also P ( Y = | X = x ) . Forexample, we may use (cid:96) -penalized logistic estimation for estimating all relevantprobability models to construct an estimator of β ( y ) , y =

0, 1. Speciﬁcally, in order Heckman and Todd (2009) use the same relationship in the context of propensity score matchingunder treatment-based sampling. K ≥ n is divisible by K . Let { I k : k =

1, . . . , K } denote a K -fold partition of {

1, . . . , n } such that | I k | = n / K for each k . Suppose that one estimates η = ( η (cid:124) , η (cid:124) ) (cid:124) us-ing a machine-learning estimator, say ˆ η k , using observations that belong to I ck : = {

1, . . . , n } \ I k for each k . Then, the retrospective double/debiased machine learn-ing estimator (cid:98) β DML ( y ) of β ( y ) , y =

0, 1, is deﬁned by (cid:98) β DML ( y ) : = K K ∑ k = | I k | ∑ i ∈ I k (cid:98) ψ i , k ( y ) , (21)where ˆ h : = n − ∑ ni = Y i , (cid:98) ψ i , k ( y ) : = Y yi ( − Y i ) − y ˆ h y ( − ˆ h ) − y log (cid:102) OR ( ˆ η k )[ X i ] + Y i ˆ h { ˜ w ( ˆ η k )[ X i ] } − y { T i − ˆ p k ( X i ) } ˆ p k ( X i ) { − ˆ p k ( X i ) }− ( − Y i )( − ˆ h ) { ˜ w ( ˆ η k )[ X i ] } − y T i − ˆ p k ( X i ) ˆ p k ( X i ) { − ˆ p k ( X i ) } , (22)and ˆ p k ( x ) : = (cid:98) P ML, k ( T = | X = x , Y = ) ,ˆ p k ( x ) : = (cid:98) P ML, k ( T = | X = x , Y = ) , (cid:102) OR ( ˆ η k )[ x ] : = (cid:98) P ML, k ( T = | X = x , Y = ) (cid:98) P ML, k ( T = | X = x , Y = ) (cid:98) P ML, k ( T = | X = x , Y = ) (cid:98) P ML, k ( T = | X = x , Y = ) ,˜ w ( ˆ η k )[ x ] : = (cid:98) P ML, k ( Y = | X = x ) (cid:98) P ML, k ( Y = | X = x ) ˆ h ( − ˆ h ) .Here, (cid:98) P ML, k denotes a machine-learning estimator of a probability model usingobservations that belong to I ck . We summarize the estimation procedure in algo-rithm 2. 22un and Lee Algorithm 2:

Retrospective Double/Debiased Machine Learning Estimator of β ( y ) , y =

0, 1

Input: { ( Y i , T i , X i ) : i =

1, . . . , n } , K , machine learning methods for estimatingprobability models Output: estimate of β ( ) and its standard error Construct a K -fold partition { I k : k =

1, . . . , K } of {

1, . . . , n } of approximatelyequal size; For each k , use observations belonging to I ck to obtain machine learningestimates ˆ η k of P ( T = | X = x , Y = ) , P ( T = | X = x , Y = ) and P ( Y = | X = x ) , respectively; For each k , use observations belonging to I k to construct (cid:98) ψ i , k ( y ) inequation (22); Obtain the estimate of β ( ) by equation (21) and its standard error (cid:98) σ DML ( y ) / √ n by (cid:98) σ ( y ) : = K K ∑ k = | I k | ∑ i ∈ I k (cid:110) (cid:98) ψ i , k ( y ) − (cid:98) β DML ( y ) (cid:111) . (23)Let (cid:107) · (cid:107) P ,2 denote the L ( P ) -norm, where P is a probability distribution that ( Y , T , X ) takes: i.e. (cid:107) a (cid:107) P ,2 = max ≤ (cid:96) ≤ d (cid:110) E [ a (cid:96) ( Y , T , X )] (cid:111) for a d -dimensional vector-valued function a : = ( a , . . . , a d ) . Assumption H (First-Stage Estimation) . There exist sequences δ n ≥ n − and τ n ofpositive constants both approaching zero such that for each k =

1, . . . , K, (cid:107) ˆ η k − η (cid:107) P ,2 ≤ δ n n − with probability no less than − τ n . Assumption H resembles classical rate requirements in semiparametric estima-tion. General theory of DML allows for a general norm; however, the L ( P ) -normis the most convenient for machine learning estimators. The required rate is at-tainable for a variety of machine learning methods. For instance, the primitiveconditions for (cid:96) -penalized logit estimators are worked out by van de Geer (2008)and Belloni, Chernozhukov, and Wei (2016) among others.An application of Theorems 3.1 and 3.2 of DML gives the following result thatformally justiﬁes the estimation method proposed in algorithm 2.23ausal Inference in Case-Control Studies Theorem 5.

Let {P n : n ≥ } be a sequence of sets of probability distributions of ( Y , T , X ) . Suppose that for all n ≥ and P ∈ P n , (16) and assumptions F to H holdand that we have a sample by the Bernoulli sampling scheme of design 1 or design 2. Then,for y =

0, 1 , √ n { (cid:98) β DML ( y ) − β ( y ) } (cid:98) σ DML ( y ) → d N (

0, 1 ) uniformly over P ∈ P n , and (cid:98) σ ( y ) → p E (cid:104) F y ( Y , T , X ) (cid:105) uniformly over P ∈ P n , where (cid:98) σ ( y ) is deﬁned inequation (23).

5. D

ISCUSSION : T HE M AIN T AKEAWAY AND I NFERENTIAL I SSUES

In this section we discuss and summarize some of the important messages fromour ﬁndings. Recall that the main estimand of interest is β ( y ) for y =

0, 1, whichis an aggregated version of log { OR ( x ) } . With causal inference in mind, the corre-sponding causal parameters would be either ξ RR ( y ) : = (cid:90) X log { θ RR ( x ) } dF X | Y ( x | y ) or ξ OR ( y ) : = (cid:90) X log { θ OR ( x ) } dF X | Y ( x | y ) .Here, we note that log θ RR ( x ) is easier to interpret than log θ OR ( x ) , so the formeris a more natural causal parameter to target. Also, it is arguably more desirable toaggregate log θ RR ( x ) by the true distribution of X ∗ : i.e. ξ RR : = E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] − E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] , (24)and we have ξ RR =  ξ RR ( )( − p ∗ ) + ξ RR ( ) p ∗ under design 1, ξ RR ( ) under design 2,where p ∗ : = P ( Y ∗ = ) .In this setup, causal inference can be understood as how we relate the estimand β ( y ) with ξ RR ( y ) , and eventually with ξ RR , for which we need to address the factthat p ∗ is unidentiﬁed. Below we discuss each step in detail.24un and LeeHow to relate β ( y ) with ξ ( y ) depends on several assumptions as well as the sam-pling design itself. In the case of case-control sampling (i.e. design 1), strong ig-norability ensures that β ( y ) = ξ OR ( y ) but we do not learn about ξ RR ( y ) from β ( y ) ,unless the rare-disease assumption is additionally in place: if the case is rare inthe population (uniformly across the values of X ∗ ), then ξ OR ( y ) is a good approx-imation of ξ RR ( y ) . The case of case-population sampling (i.e. design 2) is easier,because strong ignorability is sufﬁcient to guarantee β ( ) = ξ RR ( ) . Therefore, aconﬁdence interval for ξ RR ( ) in this case can be computed in the usual symmetricand two-sided way by using any of the proposed estimators of β ( ) .If strong ignorability is not credible, then the (approximate) equality relationshipbetween β ( y ) and ξ RR ( y ) breaks down. However, we have shown that if the MTRand MTS conditions are satisﬁed, we have0 ≤ ξ RR ( y ) ≤ β ( y ) (25)under both designs 1 and 2, where the inequalities are sharp. Further, these in-equalities do not require the rare disease assumption, and hence they are robustagainst its violation. Equation (25) implies that an estimate of β ( y ) , e.g. (cid:98) β DML ( y ) ,should be interpreted carefully: a large estimate does not necessarily conﬁrm alarge causal effect but a small estimate does conﬁrm a small causal effect. Also,a conﬁdence interval for ξ RR ( y ) should be computed differently. For example,if (cid:98) β DML ( y ) is used, then an asymptotically valid conﬁdence interval for ξ RR ( y ) should be computed by (cid:2) (cid:98) β DML ( y ) + z − α · (cid:98) σ DML ( y ) (cid:3) , where z − α is a one-sidedstandard normal critical value. Since (cid:98) β DML ( y ) is an efﬁcient estimator, it will leadto a tight one-sided conﬁdence interval for ξ RR ( y ) .If the ﬁnal object of interest is ξ RR , i.e. the aggregated version of log θ RR ( x ) overthe entire population, then design 2 is clearly more convenient than design 1. Inthe case-control sampling design, we need to compute the weighted average of ξ RR ( ) and ξ RR ( ) . If p ∗ is known, then conducting inference on ξ RR is not hard: allof our discussion above applies again, though we need to use the standard error25ausal Inference in Case-Control Studiesof the linear combination of (cid:98) β DML ( ) and (cid:98) β DML ( ) . More realistically, the onlyinformation available to a researcher may be p ∗ ∈ [ p ] for some known upperbound p . Then, the sharp bounds for ξ RR will be given by0 ≤ ξ RR ≤ max (cid:8) β ( ) , β ( )( − p ) + β ( ) p (cid:9) . (26)Equation (26) suggests that we can implement “union bounds” to obtain a con-ﬁdence interval for ξ RR . Speciﬁcally, we ﬁrst check if β ( ) ≥ β ( ) by compar-ing their estimates. If so, then we use the estimate of β ( ) and its standard errorto compute a one-sided conﬁdence interval. If not, then we use the estimate of β ( )( − p ) + β ( ) p and its standard error.6. A N E MPIRICAL E XAMPLE

In this section, we provide an empirical example to illustrate the usefulness ofour approach. We revisit the ACS 2018 sample extract in Introduction and addcovariates to implement the estimation methods we have proposed in this paper.Recall that the sample is restricted to white males residing in California with atleast a bachelor’s degree. The case sample ( Y = ) is composed of 921 individualswhose income is top-coded. To mimic design 1, the control sample ( Y = ) ofequal size is randomly drawn without replacement from the pool of individualswhose income is not top-coded. Thus, by design, P ( Y = ) = h = X ) include age and industry codes, and the binary treatment ( T ) isdeﬁned to be one if an individual has a degree beyond bachelor’s. Age is restrictedto be between 25 and 70.We consider two different estimators: (i) retrospective sieve logit and (ii) ret-rospective DML estimator. For (i), only age is included as a covariate with cubicB-splines having three inner knots. For (ii), both age and industry codes are used.In particular, cubic B-splines of age with 17 inner knots (hence, J n =

20) as wellas 254 industry dummies are included in this speciﬁcation, which can be viewed Speciﬁcally, they are 34, 45 and 55, which correspond to 0.25, 0.50 and 0.75 quantiles of the empir-ical age distribution. Speciﬁcally, we implement (cid:96) -penalized logisticestimation with glmnet package in R (Friedman, Hastie, and Tibshirani, 2010) toestimate P ( T = | X = x , Y = y ) , y =

0, 1 and P ( Y = | X = x ) with 5-fold cross-ﬁtting. The underlying assumption here is that the B-spline terms plus the indus-try dummies are rich enough to approximate P ( T = | X = x , Y = y ) as well as P ( Y = | X = x ) . The penalization tuning parameter is chosen by cross-validation(that is, lambda.min in the glmnet package). To present a representative result, wedraw the control sample 100 times and compute estimates for each draw. Estimatesand standard errors reported below are median values out of 100 replications.T ABLE

2. Empirical Results: Sieve LogitPanel A. β ( ) β ( ) Retrospective Estimate 0.656 0.489(0.101) (0.167)Note. Standard errors are in the paren-theses.Panel B. exp [ β ( )] exp [ β ( )] Retrospective Estimate 1.927 1.63195% Conﬁdence Interval [1,2.276] [1,2.147]Note. Conﬁdence intervals are obtained underthe assumption that the point estimate is the up-per bound of exp [ β ( y )] , y =

0, 1.Table 2 reports estimation results with sieve logit estimation. Looking at Panel A,the retrospective sieve estimate of β ( ) is 0.656, which is larger than that of β ( ) ,thereby suggesting that there is heterogeneity among individuals. However, thestandard error of (cid:98) β ( ) is larger than that of (cid:98) β ( ) , which indicates that the differencebetween the two estimates might be driven by sampling uncertainty. In Panel B,we present point estimates of exp [ β ( y )] , y =

0, 1 and their conﬁdence intervalsunder the assumption that the point estimate is the upper bound because the MTR Sieve logit estimation without penalization produced bogus results. [ β ( y )] are comparable to the usual odds ratio in terms of itsscale; therefore, they can be interpreted similarly. For example, 1.927 of exp [ (cid:98) β ( )] roughly means that obtaining a higher-level degree doubles the upper bound forthe chance of earning very high incomes. The end point of the conﬁdence intervalranges from 2.15 to 2.28, which includes the unconditional odds ratio of 2.19 usingthe full sample.T ABLE

3. Empirical Results: Retrospective DML EstimatorPanel A. β ( ) β ( ) Retrospective Estimate 0.816 0.663(0.145) (0.124)Standard errors are in the parentheses.Panel B. exp [ β ( )] exp [ β ( )] Retrospective Estimate 2.261 1.94095% Conﬁdence Interval [1,2.868] [1,2.377]Note. Conﬁdence intervals are obtained underthe assumption that the point estimate is the up-per bound of exp [ β ( y )] , y =

0, 1.Table 3 reports estimation results with the retrospective DML estimator. Thepoint estimates are larger than those in table 2, indicating that the effect of highereducational attainment might be larger. It is impressive that the standard errors areabout the same size as those reported in table 2, given that 254 industry dummiesare additionally included with more B-spline terms for age.In semiparametric estimation with sieve approximation of unknown functions,it is necessary to choose the number J n of approximating terms. Typically, theoptimal choice of J n for semiparametric estimation is different from one for non-parametric estimation. Furthermore, unlike age, there is no natural ordering inindustry codes; thus, it would require an ad hoc grouping of industry dummiesto reduce the number of covariates if a researcher needs to use logistic regression28un and LeeF IGURE

1. The Upper Bounds of ξ RR and exp ( ξ RR ) : Sensitivity Analysis . . . . The Upper Bound of x RR Pr(Y*=1) C hange i n l og p r obab ili t y Estimate of beta(1)*Pr(Y*=1)+beta(0)*(1−Pr(Y*=1))One−Sided 95% Pointwise Confidence Interval 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . The Upper Bound of exp( x RR ) Pr(Y*=1) e x p ( c hange i n l og p r obab ili t y ) Estimate of exp[beta(1)*Pr(Y*=1)+beta(0)*(1−Pr(Y*=1))]One−Sided 95% Pointwise Confidence Interval

Note. ξ RR = E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] − E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] .The left panel shows the estimate and 95% one-sided pointwise con-ﬁdence interval for ξ RR , as a function of P ( Y ∗ = ) , and the rightpanel those for exp ( ξ RR ) .without penalization. Alternatively, a researcher might want to use machine learn-ing methods to deal with high-dimensionality of B -spline terms and full industrycodes. However, it could lead to a question whether and how to conduct inferenceif one mainly cares about parameters such as (cid:98) β ( y ) . The retrospective DML estima-tion method provides a constructive and afﬁrmative answer to this question.We end this section by illustrating a sensitivity analysis for ξ RR . The left panelof ﬁgure 1 shows the estimate and 95% one-sided pointwise conﬁdence intervalfor ξ RR , as a function of P ( Y ∗ = ) , and the right panel those for exp ( ξ RR ) . In thecase-control sampling, the true value of P ( Y ∗ = ) may be unknown; however, aswe can see from ﬁgure 1, we can trace out ξ RR as a function of P ( Y ∗ = ) , therebyproviding a tool for the sensitivity analysis. In the range of P ( Y ∗ = ) from 0 to 0.5,the upper end point of the 95% pointwise conﬁdence interval for ξ RR (respectively,exp ( ξ RR ) ) is at most 0.9 (respectively, 2.5). Roughly speaking, this implies that it ishighly unlikely that obtaining a degree beyond bachelor’s improves the chance ofearning high incomes by more than a factor of 2.5.29ausal Inference in Case-Control Studies7. R ELATED L ITERATURE AND F UTURE R ESEARCH

The literature on causal inference using observational data is vast and the litera-ture on non-random sampling is extensive. In this section we discuss some of theimportant papers in the context of what we have achieved in this paper.We have labeled designs 1 and 2 together as Bernoulli sampling, which is theterm that we borrowed from Breslow, Robins, and Wellner (2000). The two sam-pling schemes have been studied under different names by other authors. Forinstance, Imbens and Lancaster (1996) refer to design 1 as multinomial sampling,and Lancaster and Imbens (1996) call design 2 case-control sampling with contam-ination, which is borrowed from Heckman and Robb (1985).The objective of Heckman and Robb (1985) is to estimate the impact of train-ing on earnings under various data scenarios. In that study they discuss commondata problems such as oversampling of trainees or “contamination” in the controlgroup, i.e. the training status of the individuals in the control group being un-known. Although the sampling schemes of Heckman and Robb (1985) are similarto designs 1 and 2, they are distinct in the sense that they are not outcome-basedbut treatment-based sampling. In our context, having a control group drawn fromthe whole population without conditioning on the outcome status makes it easier,not harder, to identify the causal relative risk parameter. For this reason we havereferred to design 2 as case-population sampling in order to remove connotationsof negativeness from the word “contamination.”Estimating the average treatment effect under treatment-based sampling hasbeen studied by other authors as well. For instance, Heckman and Todd (2009)point out that a matching estimator can be implemented by using the odds ratioof the propensity score ﬁt on the sample because it is a monotone transformationof the true propensity scores. Kennedy, Sj ¨olander, and Small (2015) show that onecan estimate the average treatment effect on the treated without the knowledge ofthe true population probability of the treatment. Assuming the latter is known,30un and LeeHu and Qin (2018) and Zhang, Hu, and Liu (2019) have developed weighted es-timators of the average treatment effect. However, all these methods are basedon strong ignorability, and to the best of our knowledge, we are not aware of anywork that does not rely on it. We leave it for future research how to extend theapproach taken in this paper to the context of treatment-based sampling.The term Bernoulli sampling has been alternatively used by e.g. Kalbﬂeisch andLawless (1988) to describe the case where an individual unit is randomly drawnfrom the entire population but it is retained or discarded with stratum-speciﬁcprobabilities. Imbens and Lancaster (1996) use the same terminology, while theycall our design 1 multinomial sampling as we mentioned earlier. The case where agiven number of observations are randomly drawn from each stratum has beentraditionally called the classical stratiﬁed sampling scheme (e.g. Hausman andWise, 1981). However, Imbens and Lancaster (1996) have shown that there is nomeaningful difference among the three schemes in that they lead to the same like-lihood function to estimate the parameters that appear in the choice probabilities.Since this paper is concerned about a binary outcome, Bernoulli sampling seemsmore appropriate than multinomial sampling.In the literature on choice-based sampling, the objective is usually efﬁcientlyestimating the parameters that appear in the parametrically speciﬁed prospectiveprobabilities. Manski and Lerman (1977) propose a weighted likelihood approachfor this purpose under outcome-based sampling. Cosslett (1981) shows that it isfeasible to compute the full maximum likelihood estimator. By far the most com-mon speciﬁcation is the logistic model. However, as Xie and Manski (1989) pointout, the logit model can be quite misleading under outcome-based sampling, ifthe truth is not logistic. Despite its convenience, the logistic speciﬁcation imposesrestrictions on the form of heterogeneity in the causal effect. In contrast, our ap-proach does not restrict the shape of the causal relative risk function θ RR ( · ) , therebyallowing an unrestricted form of heterogeneity in the causal treatment effect.31ausal Inference in Case-Control StudiesMany papers in this literature use the term “semiparametric” to describe thefact that the marginal distribution of the regressors are left unspeciﬁed in theiranalysis, while the prospective probability, i.e. the conditional distribution of theoutcome given the regressors, is still parametric: see e.g. Imbens and Lancaster(1996) and Breslow, Robins, and Wellner (2000). By contrast, our approach is semi-nonparametric in the sense of X. Chen (2007) because we do not impose para-metric restrictions anywhere. Instead of relying on the parametric assumption,we directly target the aggregated log odds ratio as the estimand of interest, wearticulate its relationship with the fundamental causal parameter of interest, andwe have derived the efﬁciency bound for the estimand under Bernoulli sampling.By combining all these results we can draw robust and efﬁcient inferences on thecausal parameter of interest.In the statistics and epidemiology literature, misspeciﬁcation and robustness hasbeen addressed from a different perspective. For instance, H.Y. Chen (2007) con-siders estimating the parameters that appear in the odds ratio in such a way thatconsistency and asymptotic normality follows as long as either the prospectiveor the retrospective probability is correctly speciﬁed: this approach is known asa doubly robust estimation method. Tchetgen Tchetgen, Robins, and Rotnitzky(2010) take a similar approach, but their estimator is simpler to implement thanH.Y. Chen (2007)’s; it is then further operationalized by Tchetgen Tchetgen (2013)under the ﬁnite-dimensional logistic assumption. Our estimating equation in (16)is different because our parameter of interest is semi-nonparametric. It is also note-worthy that statisticians and epidemiologists have maintained an active researchagenda in case-control studies unlike econometricians. In addition to the afore-mentioned papers, for instance, Zhou, Herring, Bhattacharya, Olshan, Dunson,and Study (2016) investigate how to deal with high dimensional predictors in thecase-control setup using a nonparametric Bayesian approach.Finally, our causal parameter is deﬁned by a ratio, but it is probably fair to saythat a difference (attributable risk in our setup) is a more common measure in32un and Leeeconometrics (e.g. Hahn, 1998; Hirano, Imbens, and Ridder, 2003). We do thisonly because the ratio is mathematically more convenient under outcome-basedsampling thanks to the invariance property of the odds ratio. However, it has longbeen questioned whether the emphasis on relative risk combined with the rare-disease assumption is relevant for public policies: see, e.g., Hsieh, Manski, andMcFadden (1985) and Manski (2009) among others. We take a pragmatic approachto this debate and believe that both attributable risk and relative risk are useful forevidence-based policymaking. We plan to work out details for causal attributablerisk in a separate paper since its analysis is sufﬁciently distinct from that of causalrelative risk.A PPENDIX

A. A

VERAGING WITHOUT T AKING THE L OGARITHM

In the main text our key estimand was an aggregated version of the logarithm of the odds ratio, i.e. β ( y ) = E [ log { OR ( X ) }| Y = y ] for y =

0, 1. As a result, thecentral causal parameter ξ RR was deﬁned in (24) by the logarithm of relative risk.Alternatively, one may want to proceed without taking the logarithm in whichcase we are led to consider ζ RR : = E (cid:2) θ RR ( X ∗ ) (cid:3) , ζ RR ( y ) : = (cid:90) X θ RR ( x ) dF X | Y ( x | y ) , κ ( y ) : = (cid:90) X OR ( x ) dF X | Y ( x | y ) for y =

0, 1. Again, if the MTR and MTS conditions are satisﬁed, then we have1 ≤ ζ RR ( y ) ≤ κ ( y ) (27)under both designs 1 and 2, where the inequalities are sharp.Efﬁcient estimation of κ ( y ) can be explored exactly in the same way as in sec-tion 4. Below we present the formula of the efﬁcient inﬂuence function, which isan analog of theorem 3. Theorem A.1.

Suppose that assumptions A, F and G hold and that we have a sample byBernoulli sampling. Then, for y =

0, 1 , κ ( y ) is pathwise differentiable and its pathwise derivative is given byK y ( Y , T , X ) = Y y ( − Y ) − y h y ( − h ) − y (cid:110) OR ( X ) − κ ( y ) (cid:111) − OR ( X ) ∆ ( Y , T , X )( − h ) w ( X ) y + OR ( X ) w ( X ) − y ∆ ( Y , T , X ) h . Further, K y is an element of the tangent space, and therefore, the semiparametric efﬁciencybound for κ ( y ) is given by E (cid:8) K y ( Y , T , X ) (cid:9) . We omit the proof of theorem A.1 because it is essentially identical to that ofTheorem 3. We can construct efﬁcient estimators of κ ( y ) and carry out causal in-ference on ζ RR by methods identical to those used in section 4. We do not repeatall the details for brevity.In general we have the relationship ξ RR ≤ log ( ζ RR ) by Jensen’s inequality. We have chosen ξ RR as our central causal parameter tofocus on in the main text because (i) it corresponds to the usual parameter whena parametric logistic regression model is used, and (ii) an average of the log oddsratio is less likely to be affected unduly by outliers than that of the odds ratio itself.A PPENDIX

B. A

UXILIARY LEMMAS

Lemma B.1.

Suppose that assumption D holds. Then, for t =

0, 1 and for all x ∈ X , ( − ) t (cid:2) P { Y ∗ ( t ) = | X ∗ = x } − P ( Y ∗ = | X ∗ = x ) (cid:3) ≤ where the bounds are sharp. Proof.

Since the two inequalities are similar, we focus on the case of t =

1. In thiscase, the claimed inequality follows from P { Y ∗ ( ) = T ∗ = | X ∗ = x } + P { Y ∗ ( ) = T ∗ = | X ∗ = x }≥ P { Y ∗ ( ) = T ∗ = | X ∗ = x } + P { Y ∗ ( ) = T ∗ = | X ∗ = x } .For sharpness, we know from assumption D that P { Y ∗ ( ) = T ∗ = | X ∗ = x } − P { Y ∗ ( ) = T ∗ = | X ∗ = x } = P { Y ∗ ( ) = Y ∗ ( ) = T ∗ = | X ∗ = x } ,where the right–hand side is unrestricted between 0 and 1. (cid:50) Lemma B.2.

Suppose that assumption E holds. Then, for t =

0, 1 and for all x ∈ X , ( − ) t (cid:2) P { Y ∗ ( t ) = | X ∗ = x } − P ( Y ∗ = | T ∗ = t , X ∗ = x ) (cid:3) ≥ where the bounds are sharp. Furthermore, if < P ( T ∗ = | X ∗ = x ) < , these inequali-ties hold with equality if and only if assumption E is satisﬁed with equality.Proof. Since the two inequalities are similar, we focus on the case of t =

1. First, P { Y ∗ ( ) = | X ∗ = x } = P ( Y ∗ = | T ∗ = X ∗ = x ) P ( T ∗ = | X ∗ = x )+ P { Y ∗ ( ) = | T ∗ = X ∗ = x } P ( T ∗ = | X ∗ = x ) , (28)where we note from assumption E that there exists some C x ∈ [

0, 1 ] such that P ( Y ∗ = | T ∗ = X ∗ = x ) = P { Y ∗ ( ) = | T ∗ = X ∗ = x } + C x . (29)Combining equations (28) and (29) yields the ﬁrst inequality in the lemma state-ment. Therefore, P { Y ∗ ( ) = | X ∗ = x } = P ( Y ∗ = | T ∗ = X ∗ = x ) − C x · P ( T ∗ = | X ∗ = x ) ≤ P ( Y ∗ = | T ∗ = X ∗ = x ) . (30)35ausal Inference in Case-Control StudiesSharpness follows from the fact that C x is not restricted except that it is between0 and 1. Also, if P ( T ∗ = | X ∗ = x ) >

0, then the last inequality in equation (30)holds with equality if and only if C x = (cid:50) A PPENDIX

C. P

ROOFS OF THE RESULTS IN THE MAIN TEXT

Proof of Lemma 1:

Let γ be the parameter denoting regular parametric submod-els, where the true value will be denoted by γ . Then, by using the likelihoodfunction in equation (11), the score evaluated at γ is equal to ( − Y ) (cid:104) S X | Y ( X | ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) ∂ γ P ( T = | X , Y = γ ) P ( T = | X , Y = ) { − P ( T = | X , Y = ) } (cid:105) + Y (cid:104) S X | Y ( X | ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) ∂ γ P ( T = | X , Y = γ ) P ( T = | X , Y = ) { − P ( T = | X , Y = ) } (cid:105) , (31)where S X | Y ( x | y ) = ∂ γ log f X | Y ( x | y ; γ ) is restricted only by E { S X | Y ( X | y ) | Y = y } =

0, while the derivatives ∂ γ P ( T = | X , Y = y , γ ) are unrestricted. (cid:50) Proof of Theorem 1:

In view of Lemma 1, the theorem follows immediately since P ( Y ∗ = | T ∗ = t , X ∗ = x ) = P { Y ∗ ( t ) = | T ∗ = t , X ∗ = x } = P { Y ∗ ( t ) = | X ∗ = x } ,where the last equality is by the assumption of unconfoundedness. (cid:50) Proof of Theorem 2:

Part (i).

The sharp lower bound of θ RR ( x ) follows fromlemma B.1. To prove that θ RR ( x ) ≤ θ OR ( x ) for all x ∈ X , note that θ OR ( x ) θ RR ( x ) = P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } ≥ ( − ) t (cid:2) P { Y ∗ ( t ) = | X ∗ = x } − P ( Y ∗ = | X ∗ = x ) (cid:3) ≥ t =

0, 1.

Part (ii).

The sharp upper bound of θ RR ( x ) under design 2 follows from lemmas 1and B.2 because θ RR ( x ) ≤ P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) = RR ∗ ( x ) = OR ( x ) . (32)The case of θ OR ( x ) under design 1 similarly uses the fact that lemma B.2 yields P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } ≤ P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) . (33)Combining equation (33) with (32) yields that under design 1, θ OR ( x ) ≤ OR ∗ ( x ) = OR ( x ) . Part (iii).

The ﬁnal statement follows immediately from lemma B.2. (cid:50)

Proof of theorem 3:

For brevity, we focus on β ( ) and let β = β ( ) . Proof for β ( ) is analogous. Let p ( x ) = P ( T = | X = x , Y = ) and p ( x ) = P ( T = | X = x , Y = ) . Note that β ( γ ) = (cid:90) X log (cid:34) p ( x ; γ ) − p ( x ; γ ) · − p ( x ; γ ) p ( x ; γ ) (cid:124) (cid:123)(cid:122) (cid:125) : = OR ( x ; γ ) (cid:35) f X | Y ( x | γ ) dx , (34)37ausal Inference in Case-Control Studieswhere γ represents regular parametric submodels such that γ is the truth. Then, ∂ γ OR ( x ; γ ) = ∂ γ p ( x ; γ ) { − p ( x ) } p ( x ) { − p ( x ) } − ∂ γ p ( x ; γ ) p ( x ) p ( x ) { − p ( x ) } = ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } OR ( x ) − ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } OR ( x ) . (35)Therefore, ∂ γ β ( γ ) = (cid:90) (cid:104) ∂ γ OR ( x ; γ ) OR ( x ) + log { OR ( x ) } S X | Y ( x | ) (cid:105) f X | Y ( x | ) dx = (cid:90) (cid:104) ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } − ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } + log { OR ( x ) } S X | Y ( x | ) (cid:105) f X | Y ( x | ) dx .(36)Now, we only need to verify the equality between E { F ( Y , T , X ) S ( Y , T , X ) } and (cid:90) (cid:104) ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } (cid:124) (cid:123)(cid:122) (cid:125) : = A ( x ) − ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } (cid:124) (cid:123)(cid:122) (cid:125) : = A ( x ) + log { OR ( x ) } S X | Y ( x | ) (cid:105) f X | Y ( x | ) dx ,(37)where F ( Y , T , X ) and S ( Y , T , X ) are given in the theorem statement and equa-tion (31), respectively: i.e. S ( Y , T , X )= ( − Y ) (cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) + Y (cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) , F ( Y , T , X )= − Y − h (cid:34) log OR ( X ) − β − (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } (cid:35) + Yh f X | Y ( X | ) f X | Y ( X | ) (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } .38un and LeeNote that F ( Y , T , X ) S ( Y , T , X ) is equal to1 − Y − h (cid:34) log OR ( X ) − β − (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } (cid:35)(cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) + Yh f X | Y ( X | ) f X | Y ( X | ) (cid:34) (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } (cid:35)(cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) .Here, taking expectations directly shows that E (cid:8) F ( Y , T , X ) S ( Y , T , X ) (cid:9) is equal to E (cid:8) log { OR ( X ) } S X | Y ( X | ) − A ( X ) | Y = (cid:9) + E (cid:26) f X | Y ( X | ) f X | Y ( X | ) A ( X ) (cid:12)(cid:12)(cid:12)(cid:12) Y = (cid:27) ,which is equal to the expression in equation (37) since E (cid:26) f X | Y ( X | ) f X | Y ( X | ) A ( X ) (cid:12)(cid:12)(cid:12)(cid:12) Y = (cid:27) = E (cid:8) A ( X ) | Y = (cid:9) .Finally, it follows from lemma 2 that F is an element of the tangent space. (cid:50) Proofs of theorems 4 and 5 are provided in appendix S-1, which is only for online:The proof of theorem 4 is similar to that of theorem 3. The proof of theorem 5 doesnot provide any additional insight above DML. (cid:50) R EFERENCES A CKERBERG , D., X. C

HEN , AND

J. H

AHN (2012): “A practical asymptotic vari-ance estimator for two-step semiparametric estimators,”

Review of Economics andStatistics , 94(2), 481–498.A

CKERBERG , D., X. C

HEN , J. H

AHN , AND

Z. L

IAO (2014): “Asymptotic efﬁciencyof semiparametric two-step GMM,”

Review of Economic Studies , 81(3), 919–943.A I , C., AND

X. C

HEN (2003): “Efﬁcient estimation of models with conditionalmoment restrictions containing unknown functions,”

Econometrica , 71(6), 1795–1843. (2012): “The semiparametric efﬁciency bound for models of sequential mo-ment restrictions containing unknown functions,”

Journal of Econometrics , 170(2),442–457. 39ausal Inference in Case-Control StudiesB

ELLONI , A., V. C

HERNOZHUKOV , AND

Y. W EI (2016): “Post-Selection Inferencefor Generalized Linear Models With Many Controls,” Journal of Business & Eco-nomic Statistics , 34(4), 606–619.B

HATTACHARYA , J., A. M. S

HAIKH , AND

E. V

YTLACIL (2008): “Treatment ef-fect bounds under monotonicity assumptions: an application to Swan-Ganzcatheterization,”

American Economic Review: Papers and Proceedings , 98(2), 351–56.B

HATTACHARYA , J., A. M. S

HAIKH , AND

E. V

YTLACIL (2012): “Treatment effectbounds: An application to Swan-Ganz catheterization,”

Journal of Econometrics ,168(2), 223–243.B

RESLOW , N. E. (1996): “Statistics in epidemiology: the case-control study,”

Jour-nal of the American Statistical Association , 91(433), 14–28.B

RESLOW , N. E.,

AND

N. E. D AY (1980): Statistical Methods in Cancer Research I.The Analysis of Case-Control Studies , vol. 1. International Agency for Research onCancer, Lyon, France.B

RESLOW , N. E., J. M. R

OBINS , AND

J. A. W

ELLNER (2000): “On the semi-parametric efﬁciency of logistic regression under case-control sampling,”

Bernoulli , 6(3), 447–455.C

ARVALHO , L. S.,

AND

R. R. S

OARES (2016): “Living on the edge: Youth entry, ca-reer and exit in drug-selling gangs,”

Journal of Economic Behavior & Organization ,121, 77–98.C

HEN , K. (2001): “Parametric models for response-biased sampling,”

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 63(4), 775–789.C

HEN , X. (2007): “Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models,” vol. 6 of

Handbook of Econometrics , pp. 5549–5632. El-sevier.C

HEN , H.Y. (2007): “A semiparametric odds ratio model for measuring associa-tion,”

Biometrics , 63(2), 413–421. 40un and LeeC

HERNOZHUKOV , V., D. C

HETVERIKOV , M. D

IMIRER , E. D

UFLO , C. H

ANSEN ,W. N

EWEY , AND

J. R

OBINS (2018): “Double/debiased machine learning for treat-ment and structural parameters,”

Econometrics Journal , 21, C1–C68.C

HERNOZHUKOV , V., S. L EE , AND

A. M. R

OSEN (2013): “Intersection bounds:Estimation and inference,”

Econometrica , 81(2), 667–737.C

ORNFIELD , J. (1951): “A method of estimating comparative rates from clinicaldata. Applications to cancer of the lung, breast, and cervix,”

Journal of the Na-tional Cancer Institute , 11(6), 1269–1275.C

OSSLETT , S. R. (1981): “Maximum Likelihood Estimator for Choice-Based Sam-ples,”

Econometrica , 49(5), 1289–1316.C

URRIE , J.,

AND

M. N

EIDELL (2005): “Air pollution and infant health: what canwe learn from California’s recent experience?,”

Quarterly Journal of Economics ,120(3), 1003–1030.D

OMOWITZ , I.,

AND

R. L. S

ARTAIN (1999): “Determinants of the Consumer Bank-ruptcy Decision,”

Journal of Finance , 54(1), 403–420.F

RIEDMAN , J., T. H

ASTIE , AND

R. T

IBSHIRANI (2010): “Regularization Paths forGeneralized Linear Models via Coordinate Descent,”

Journal of Statistical Soft-ware , 33(1), 1–22.H

AHN , J. (1998): “On the role of the propensity score in efﬁcient semiparametricestimation of average treatment effects,”

Econometrica , 66(2), 315–331.H

AUSMAN , J. A.,

AND

D. A. W

ISE (1981): “Stratiﬁcation on endogenous variablesand estimation: The Gary income maintenance experiment,”

Structural analysisof discrete data with econometric applications , pp. 365–391.H

ECKMAN , J. J.,

AND

R. R

OBB (1985): “Alternative methods for evaluating theimpact of interventions: An overview,”

Journal of econometrics , 30(1-2), 239–267.H

ECKMAN , J. J.,

AND

P. E. T

ODD (2009): “A note on adapting propensity scorematching and selection models to choice based samples,”

Econometrics Journal ,12(s1), S230–S234. 41ausal Inference in Case-Control StudiesH

IRANO , K., G. W. I

MBENS , AND

G. R

IDDER (2003): “Efﬁcient Estimation of Av-erage Treatment Effects Using the Estimated Propensity Score,”

Econometrica ,71(4), 1161–1189.H

OLLAND , P. W.,

AND

D. B. R

UBIN (1988): “Causal Inference in RetrospectiveStudies,”

Evaluation Review , 12(3), 203–231.H

SIEH , D. A., C. F. M

ANSKI , AND

D. M C F ADDEN (1985): “Estimation of ResponseProbabilities from Augmented Retrospective Observations,”

Journal of the Amer-ican Statistical Association , 80(391), 651–662.H U , Z., AND

J. Q IN (2018): “Generalizability of causal inference in observationalstudies under retrospective convenience sampling,” Statistics in Medicine , 37(19),2874–2883.I

MBENS , G. W. (1992): “An Efﬁcient Method of Moments Estimator for DiscreteChoice Models With Choice-Based Sampling,”

Econometrica , 60(5), 1187–1214.I

MBENS , G. W.,

AND

T. L

ANCASTER (1996): “Efﬁcient estimation and stratiﬁedsampling,”

Journal of Econometrics , 74(2), 289–318.I

MBENS , G. W.,

AND

D. B. R

UBIN (2015):

Causal inference in statistics, social, andbiomedical sciences . Cambridge University Press.J UN , S. J., AND

S. L EE (2019): “Identifying the effect of persuasion,”arXiv:1812.02276.K ALBFLEISCH , J.,

AND

J. L

AWLESS (1988): “Estimation of Reliability in Field–Performance Studies,”

Technometrics , 30(4), 365–378.K

ENNEDY , E. H., A. S J ¨ OLANDER , AND

D. S

MALL (2015): “Semiparametric causalinference in matched cohort studies,”

Biometrika , 102(3), 739–746.K IM , W., K. K WON , S. K

WON , AND

S. L EE (2018): “The identiﬁcation power ofsmoothness assumptions in models with counterfactual outcomes,” QuantitativeEconomics , 9(2), 617–642. 42un and LeeK

REIDER , B., J. V. P

EPPER , C. G

UNDERSEN , AND

D. J

OLLIFFE (2012): “Identifyingthe Effects of SNAP (Food Stamps) on Child Health Outcomes When Participa-tion Is Endogenous and Misreported,”

Journal of the American Statistical Associa-tion , 107(499), 958–975.L

ANCASTER , T.,

AND

G. I

MBENS (1996): “Case-control studies with contaminatedcontrols,”

Journal of Econometrics , 71(1-2), 145–160.L EE , S., K. S ONG , AND

Y.-J. W

HANG (2018): “Testing for a general class of func-tional inequalities,”

Econometric Theory , 34(5), 1018–1064.M

ACHADO , C., A. M. S

HAIKH , AND

E. J. V

YTLACIL (2019): “Instrumental vari-ables and the sign of the average treatment effect,”

Journal of Econometrics , 212(2),522–555.M

ANSKI , C. F. (1997): “Monotone Treatment Response,”

Econometrica , 65(6), 1311–1334.M

ANSKI , C. F. (2003):

Partial identiﬁcation of probability distributions . Springer-Verlag.(2009):

Identiﬁcation for prediction and decision . Harvard University Press.M

ANSKI , C. F.,

AND

S. R. L

ERMAN (1977): “The Estimation of Choice Probabilitiesfrom Choice Based Samples,”

Econometrica , 45(8), 1977–1988.M

ANSKI , C. F.,

AND

D. M C F ADDEN (1981): “Alternative estimators and sampledesigns for discrete choice analysis,” in

Structural analysis of discrete data witheconometric applications , ed. by C. F. Manski, and

D. McFadden, vol. 2, pp. 51–111.MIT Press, Cambridge, MA.M

ANSKI , C. F.,

AND

J. V. P

EPPER (2000): “Monotone instrumental variables: Withan application to the returns to schooling,”

Econometrica , 68(4), 997–1010.M C F ADDEN , D. (2015): “Observational Studies: Outcome-Based Sampling,” in

International Encyclopedia of the Social & Behavioral Sciences (Second Edition) , ed. byJ. D. Wright, pp. 103–106. Elsevier, Oxford.N

EWEY , W. K. (1990): “Semiparametric efﬁciency bounds,”

Journal of AppliedEconometrics , 5(2), 99–135. 43ausal Inference in Case-Control Studies(1994): “The asymptotic variance of semiparametric estimators,”

Economet-rica , 62(6), 1349–1382.O

KUMURA , T.,

AND

E. U

SUI (2014): “Concave-monotone treatment response andmonotone treatment selection: With an application to the returns to schooling,”

Quantitative Economics , 5(1), 175–194.R

UGGLES , S., S. F

LOOD , R. G

OEKEN , J. G

ROVER , E. M

EYER , J. P

ACAS , AND

M. S

OBEK (2019): “IPUMS USA: Version 9.0 [dataset],” https://doi.org/10.18128/D010.V9.0 .T AMER , E. (2010): “Partial identiﬁcation in econometrics,”

Annual Review of Eco-nomics , 2(1), 167–195.T

CHETGEN T CHETGEN , E. J. (2013): “On a closed-form doubly robust estimator ofthe adjusted odds ratio for a binary exposure,”

American journal of epidemiology ,177(11), 1314–1316.T

CHETGEN T CHETGEN , E. J., J. M. R

OBINS , AND

A. R

OTNITZKY (2010): “On dou-bly robust estimation in a semiparametric odds ratio model,”

Biometrika , 97(1),171–180.

VAN DE G EER , S. A. (2008): “High-dimensional generalized linear models and thelasso,”

Annals of Statistics , 36(2), 614–645.X IE , J., Y. L IN , X. Y AN , AND

N. T

ANG (2019): “Category-Adaptive Variable Screen-ing for Ultra-High Dimensional Heterogeneous Categorical Data,”

Journal of theAmerican Statistical Association , forthcoming.X IE , Y., AND

C. F. M

ANSKI (1989): “The logit model and response-based samples,”

Sociological Methods & Research , 17(3), 283–302.Z

HANG , Z., Z. H U , AND

C. L IU (2019): “Estimating the Population Average Treat-ment Effect in Observational Studies with Choice-Based Sampling,” The interna-tional journal of biostatistics , 15(1).Z

HOU , J., A. H. H

ERRING , A. B

HATTACHARYA , A. F. O

LSHAN , D. B. D

UNSON , AND

N. B. D. P. S

TUDY (2016): “Nonparametric Bayes modeling for case controlstudies with many predictors,”

Biometrics , 72(1), 184–192.44un and Lee O NLINE S UPPLEMENT TO “C AUSAL I NFERENCE IN C ASE -C ONTROL S TUDIES ” BY J UN AND L EE A PPENDIX

S-1. A

DDITIONAL P ROOFS

Proof of theorem 4:

For simplicity, in the proof, we focus on β ( ) and let it bedenoted by β . The case of β ( ) is similar. Recall that˜ F ( η )[ Y , T , X ]= − Y − h (cid:34) log (cid:102) OR ( η )[ X ] − β − { T − a ( X ) } a ( X ) { − a ( X ) } (cid:35) + Y ˜ w ( η )[ X ] h { T − b ( X ) } b ( X ) { − b ( X ) } .Note that E (cid:110) ˜ F ( η )[ Y , T , X ] (cid:111) = E (cid:8) log (cid:102) OR ( η )[ X ] − β (cid:12)(cid:12) Y = (cid:9) + E (cid:110) ˜ ∆ ( η )[ T , X ] (cid:12)(cid:12)(cid:12) Y = (cid:111) + E (cid:110) ˜ ∆ ( η )[ T , X ] (cid:12)(cid:12)(cid:12) Y = (cid:111) , (S.1)where ˜ ∆ ( η )[ T , X ] = − { T − a ( X ) } a ( X ) { − a ( X ) } , (S.2)˜ ∆ ( η )[ T , X ] = ˜ w ( η )[ X ] { T − b ( X ) } b ( X ) { − b ( X ) } . (S.3)Here, E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) X , Y = (cid:17) = a ( X ) − p ( X ) p ( X ) { − p ( X ) } , E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) X , Y = (cid:17) = − w ( X ) b ( X ) − p ( X ) p ( X ) { − p ( X ) } ,S-1ausal Inference in Case-Control Studieswhere p ( X ) = P ( T = | X , Y = ) and p = P ( T = | X , Y = ) as before.Therefore, E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) Y = (cid:17) = E (cid:110) a ( X ) − p ( X ) p ( X ) { − p ( X ) } (cid:12)(cid:12)(cid:12) Y = (cid:111) , (S.4)and E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) Y = (cid:17) = − E (cid:110) w ( X ) b ( X ) − p ( X ) p ( X ) { − p ( X ) } (cid:12)(cid:12)(cid:12) Y = (cid:111) = − E (cid:110) b ( X ) − p ( X ) p ( X ) { − p ( X ) } (cid:12)(cid:12)(cid:12) Y = (cid:111) .(S.5)Now, similarly to equation (35), we have ∂ γ log (cid:102) OR { η + γ ( η − η ) } (cid:12)(cid:12) γ = = b ( X ) − p ( X ) p ( x ) { − p ( x ) } − a ( X ) − p ( X ) p ( x ) { − p ( x ) } . (S.6)Therefore, the conclusion follows from equations (S.1) and (S.4) to (S.6). (cid:50) Proof of theorem 5:

As in the previous proofs, we focus on β ( ) ≡ β . The caseof β ( ) is similar. We verify Assumptions 3.1 and 3.2 of DML (Chernozhukov,Chetverikov, Dimirer, Duﬂo, Hansen, Newey, and Robins, 2018). Using the nota-tion used in DML, ψ ( W ; β , η ) = ˜ F y ( η )[ Y , T , X ] with W = ( Y , T , X ) . Then our case belongs to that of linear scores, namely ψ ( W ; θ , η ) = ψ a ( W ; η ) θ + ψ b ( W ; η ) ,S-2un and Leewhere ψ a ( W ; η ) = − − Y − h , ψ b ( W ; η ) = − Y − h (cid:34) log (cid:102) OR ( η )[ X ] − T − a ( X ) a ( X ) { − a ( X ) } (cid:35) + Y ˜ w ( η )[ X ] h T − b ( X ) b ( X ) { − b ( X ) } . Veriﬁcation of Assumption 3.1 of DML.

Under assumptions F and G, Assumption 3.1of DML is satisﬁed with λ N = J = ψ ; part (c) is by assumptions F and G; part (d) is by theorem 4; part (e) followsbecause E [ ψ a ( W ; η )] = Veriﬁcation of Assumption 3.2 (b) of DML.

It holds trivially that | ψ a ( W ; η ) | is boundedby a constant uniformly in η . Moreover, by assumption F, there is a constant c < ∞ such that | ψ ( W ; β , η ) | ≤ c uniformly in η almost surely. Veriﬁcation of Assumption 3.2 (d) of DML.

Note that E [ ψ ( W ; β , η )] ≥ − h E (cid:34) { log OR ( X ) − β } + p ( X ) { − p ( X ) } (cid:35) ,which is bounded from below by a constant under assumption F.Since Assumption 3.2 (a) of DML is the deﬁnition of the ﬁrst stage estimator,theorem 5 follows immediately from Theorems 3.1 and 3.2 of DML, provided thatwe verify the remaining Assumption 3.2 (c) of DML.S-3ausal Inference in Case-Control Studies Veriﬁcation of Assumption 3.2 (c) of DML.

Note that r n = ψ a ( W ; η ) does not depend on η . Step 2.

Now write that E [ | ψ ( W ; β , η ) − ψ ( W ; β , η ) | ]) = (cid:107) ψ ( W ; β , η ) − ψ ( W ; β , η ) (cid:107) P ,2 ≤ (cid:107)T (cid:107) P ,2 + (cid:107)T (cid:107) P ,2 ,where T : = − Y − h (cid:34) log (cid:102) OR ( η )[ X ] − T − a ( X ) a ( X ) { − a ( X ) } (cid:35) − − Y − h (cid:34) log OR ( X ) − T − p ( X ) p ( X ) { − p ( X ) } (cid:35) , T : = Y ˜ w ( η )[ X ] h (cid:34) T − b ( X ) b ( X ) { − b ( X ) } (cid:35) − Yw ( X ) h (cid:34) T − p ( X ) p ( X ) { − p ( X ) } (cid:35) .Then, in view of assumptions F and H, there exists a sequence ˜ δ n → (cid:104) E (cid:8) | ψ ( W ; β , η ) − ψ ( W ; β , η ) | (cid:9)(cid:105) ≤ ˜ δ n holds with probability at least 1 − τ n . This implies that we can take r (cid:48) n = ˜ δ n . Step 3.

Deﬁne a γ ( X ) : = p ( X ) + γ { a ( X ) − p ( X ) } and b γ ( X ) : = p ( X ) + γ { b ( X ) − p ( X ) } . Note that ∂ γ log (cid:102) OR { η + γ ( η − η ) } [ X ] = ∂ γ (cid:102) OR { η + γ ( η − η ) } [ X ] (cid:102) OR { η + γ ( η − η ) } [ X ]= b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) } − a ( X ) − p ( X ) a γ ( X ) { − a γ ( X ) } .S-4un and LeeIn addition, ∂ γ (cid:20) { T − a γ ( X ) } a γ ( X ) { − a γ ( X ) } (cid:21) = − a ( X ) − p ( X ) a γ ( X ) { − a γ ( X ) } − { T − a γ ( X ) }{ − a γ ( X ) } a γ ( X ) { − a γ ( X ) } { a ( X ) − p ( X ) } , ∂ γ (cid:20) ˜ w { η + γ ( η − η ) } [ X ] T − b γ ( X ) b γ ( X ) { − b γ ( X ) } (cid:21) = T − b γ ( X ) b γ ( X ) { − b γ ( X ) } ( η − η )[ X ] − ˜ w { η + γ ( η − η ) } [ X ] b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) }− ˜ w { η + γ ( η − η ) } [ X ] { T − b γ ( X ) }{ − b γ ( X ) } b γ ( X ) { − b γ ( X ) } { b ( X ) − p ( X ) } .Combining these yields ∂ γ ψ ( W ; β , η + γ ( η − η ))= − Y − h (cid:34) b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) } + { T − a γ ( X ) }{ − a γ ( X ) } a γ ( X ) { − a γ ( X ) } { a ( X ) − p ( X ) } (cid:35) + Yh (cid:34) T − b γ ( X ) b γ ( X ) { − b γ ( X ) } ( η − η )[ X ] − ˜ w { η + γ ( η − η ) } [ X ] b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) }− ˜ w { η + γ ( η − η ) } [ X ] { T − b γ ( X ) }{ − b γ ( X ) } b γ ( X ) { − b γ ( X ) } { b ( X ) − p ( X ) } (cid:35) .If we take the second-order derivative in the equation above, we can see that eachterm of the second-order derivatives on the right-hand side can be bounded inabsolute value by a constant times χ ( a , b ) , which is deﬁned to be equal tomax (cid:104) { a ( X ) − p ( X ) } , { b ( X ) − p ( X ) } , { η ( X ) − η ( X ) }{ b ( X ) − p ( X ) } (cid:105) .Therefore, there exists a universal constant C < ∞ such that | ∂ γ E [ ψ ( W ; β , η + γ ( η − η ))] | ≤ C χ ( a , b ) .S-5ausal Inference in Case-Control StudiesThen, by assumption H, there exists a sequence ˜ δ (cid:48) n → γ ∈ ( ) , η ∈T N | ∂ γ E [ ψ ( W ; β , η + γ ( η − η ))] | ≤ ˜ δ (cid:48) n n − holds with probability at least 1 − τ n . Therefore, we can take λ (cid:48) n = ˜ δ (cid:48) n n − . (cid:50) R EFERENCE

Chernozhukov V., D. Chetverikov, M. Dimirer, E. Duﬂo, C. Hansen, W. Newey, andJ. Robins (2018): “Doubld/debiased machine learning for treatment and structuralparameters,”