CC AUSAL I NFERENCE IN C ASE -C ONTROL S TUDIES ∗ S UNG J AE J UN † AND S OKBAE L EE ‡ P ENN S TATE U NIV . C
OLUMBIA U NIV . AND
IFS
April 20, 2020
Abstract.
We investigate identification of causal parameters in case-control andrelated studies. The odds ratio in the sample is our main estimand of interest andwe articulate its relationship with causal parameters under various scenarios. Itturns out that the odds ratio is generally a sharp upper bound for counterfactualrelative risk under some monotonicity assumptions, without resorting to strong ig-norability, nor to the rare-disease assumption. Further, we propose semparametri-cally efficient, easy-to-implement, machine-learning-friendly estimators of the ag-gregated (log) odds ratio by exploiting an explicit form of the efficient influencefunction. Using our new estimators, we develop methods for causal inference andillustrate the usefulness of our methods by a real-data example.
Key Words: relative risk, causality, monotonicity, case-control sample, machinelearning, partial identification, semiparametric efficiency bound
JEL Classification Codes:
C21, C55, C83 ∗ We would like to thank Chuck Manski and seminar participants at Cemmap, Oxford, and PennState for helpful comments. This work was supported in part by the European Research Council(ERC-2014-CoG-646917-ROMIA) and by the UK Economic and Social Research Council (ESRC)through research grant (ES/P008909/1) to the CeMMAP. † Department of Economics, Penn State University, 619 Kern Graduate Building, University Park,PA 16802, [email protected] ‡ Department of Economics, Columbia University, 1022 International Affairs Building, 420 West118th Street, New York, NY 10027, [email protected] a r X i v : . [ ec on . E M ] A p r un and Lee1. I NTRODUCTION
Empirical researchers often find it useful to work with outcome-based or case-control samples when they study rare events: cancer (Breslow and Day, 1980),infant death (Currie and Neidell, 2005), consumer bankruptcy (Domowitz and Sar-tain, 1999), and drug trafficking (Carvalho and Soares, 2016), among many others.Case-control sampling arises frequently in biostatistics when doctors or epidemi-ologists study risk factors for a rare disease: random sampling may yield only afew observations with the disease among several thousands of data. In econo-metrics, it is often referred to as choice-based or response-based sampling becausethe outcome of interest is discrete choice in many economic applications (see, e.g.,Chapter 6 of Manski, 2009).Inference methods that work with random samples are generally not suitablewhen data are outcome-based. In the econometrics literature, parametric estima-tion with outcome-based samples has been investigated by Manski and Lerman(1977), Cosslett (1981), Manski and McFadden (1981), Hsieh, Manski, and McFad-den (1985), Imbens (1992), and Lancaster and Imbens (1996), among others. Thisstrand of the literature has focused mainly on the consistency or efficiency of para-metric estimators in discrete response models; see e.g. McFadden (2015) for a re-view. In the biostatistics and epidemiology literature (e.g. Breslow, 1996), logisticregression has been the standard workhorse model in analyzing case-control stud-ies with a more emphasis on sampling designs.To motivate the setup of this paper, we start with a simple example. Table 1summarizes data from American Community Survey (ACS) 2018, cross-tabulatingthe likelihood of top income by educational attainment. The sample is restrictedto white males residing in California with at least a bachelor’s degree. The binary It is extracted from IPUMS USA (Ruggles, Flood, Goeken, Grover, Meyer, Pacas, and Sobek, 2019).The ACS is an ongoing annual survey by the US Census Bureau that provides key informationabout US population. The IPUMS database contains samples from the 2000-2018 ACS. The ACSsample is not a case-control sample but we will use it to illustrate our proposed methods. Y ) is defined to be one if a respondent’s annual total pre-tax wage and salary income is top-coded. The binary treatment ( T ) is defined tobe one if a respondent has a master’s degree, a professional degree, or a doctoraldegree. T ABLE
1. Top Income and EducationBeyond Bachelor’s TotalTop Income T = T = Y = Y = P ( Y = | T = ) ≈ P ( Y = | T = ) ≈ P ( Y = | T = ) − P ( Y = | T = ) ≈ P ( Y = | T = ) / P ( Y = | T = ) ≈ P ( Y = | T = ) P ( Y = | T = ) P ( Y = | T = ) P ( Y = | T = ) ≈ retrospective manner, the proportions of going be-yond a bachelor’s degree are P ( T = | Y = ) ≈ P ( T = | Y = ) ≈ P ( T = | Y = ) P ( T = | Y = ) P ( T = | Y = ) P ( T = | Y = ) ≈ In ACS 2018, the threshold income for top-coding is different across states. In our sample extract,the top-coded income bracket has median income $565,000 and the next highest income that is nottop-coded is $327,000. P ( Y = ) ≈ (cid:0) Y ∗ ( ) , Y ∗ ( ) , T ∗ , X ∗ (cid:1) , ( Y ∗ , T ∗ , X ∗ ) , and ( Y , T , X ) , where Y ∗ ( t ) is the potential binary outcome under treatment t ∈ {
0, 1 } , Y ∗ = Y ∗ ( ) T ∗ + Y ∗ ( )( − T ∗ ) , T ∗ and X ∗ are the outcome, treatment and covariates that wouldhave been observed under random sampling, and Y , T and X are the variables thatare actually observed in the outcome-based sample. As to the main causal param-eter of interest, we focus on θ RR ( x ) : = P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } , (3)which is causal relative risk conditional on X = x . To identify θ RR ( x ) , we face twoseparate challenges: one results from the usual missing data problem of potentialoutcomes and the other stems from the fact that the researcher does not have accessto ( Y ∗ , T ∗ , X ∗ ) but only to ( Y , T , X ) .Our contributions are two-fold. First, we articulate how the causal parameteris related with functionals of the distribution of ( Y , T , X ) under two different ver-sions of outcome-based sampling schemes: i.e. the traditional case-control sam-pling and case-population sampling considered in Lancaster and Imbens (1996). Itturns out that the odds ratio between Y and T conditional on X = x is generallya sharp upper bound for θ RR ( x ) under the MTR and MTS assumptions. This in-terpretation does not require strong ignorability, nor does it the usual rare-diseaseassumption. Therefore, our identification analysis shows that we can provide theconventional estimand, i.e. the odds ratio in the sample, with causal interpretation4un and Leefrom the perspective of partial identification (see, e.g., Manski, 2003, 2009; Tamer,2010).Second, we propose two novel estimation algorithms for the aggregated (log)odds ratio. For this purpose we obtain an explicit form of the efficient influencefunction, after which we construct suitable sample analogs. The first estimator webuild is a plug-in sieve estimator (e.g. X. Chen, 2007) and the second one is a dou-ble/debiased machine learning (DML) estimator (e.g. Chernozhukov, Chetverikov,Dimirer, Duflo, Hansen, Newey, and Robins, 2018). The former is simpler but thelatter accommodates LASSO-type or more general nonparametric estimators. Bothestimators achieve the semiparametric efficiency bound (e.g. Newey, 1990, 1994)and can be easily implemented by using standard statistical packages. Using ourestimators and the ACS data, we illustrate how to draw causal inferences based onour partial identification results as well as how to carry out a sensitivity analysis.To the best of our knowledge, we are not cognizant of directly relevant papersin the literature. In fact, the recent econometrics literature on outcome-based sam-pling is rather sparse; however, it is an important reality that random samplingcan be expensive when the outcome of interest is rare. The goal of this paper isto revamp outcome-based sampling from the perspective of modern economet-rics. Our paper is the first paper that nonparametrically connects the three dots:outcome-based sampling, causal inference and partial identification. We provide afurther discussion on how our paper is related to the existing literature in section 7.The remainder of the paper is organized as follows. Section 2 presents the frame-work and identification results. We describe two sampling schemes, i.e. case-control sampling and case-population sampling, after which we discuss causalparameters and their identification. In section 3, we derive the semiparametricefficiency bound for our estimand, and in section 4, we propose two estimationalgorithms. We analytically establish the local robustness property of one of ourestimating equations, yielding an estimator that suits well for machine learningin section 4.3. Section 5 summarizes the main takeaways and discusses several5ausal Inference in Case-Control Studiesinferential issues. Section 6 presents an empirical example using the ACS data.We conclude the paper by discussing the related literature and topics for futureresearch in section 7. Appendices, along with an online supplement, include addi-tional materials and all the proofs.2. F RAMEWORK
In this section, we describe the scheme of outcome-based sampling, define causalparameters and discuss their identification under two sets of assumptions: onewith strong ignorability and the other without it.2.1.
Bernoulli Sampling.
Let ( Y ∗ , T ∗ , X ∗ ) be the random variables that wouldhave been observed if a researcher had collected data via random sampling fromthe population of interest, where Y ∗ is a binary outcome, T ∗ is a binary treatment,and X ∗ is a vector of covariates. We assume that a random sample of ( Y ∗ , T ∗ , X ∗ ) is unavailable and hence ( Y ∗ , T ∗ , X ∗ ) is not observed. Instead, we assume that wehave a random sample of ( Y , T , X ) , where ( Y , T , X ) represents the random vari-ables that are actually observed in the sample that is drawn by the researcher’ssampling design, i.e. Bernoulli sampling (e.g. Breslow, Robins, and Wellner, 2000),which we further describe below and discuss in section 7.In Bernoulli sampling, the researcher draws a Bernoulli variable Y first from apre-specified marginal distribution, after which she randomly draws ( T , X ) from P y if and only if Y = y . Since h = P ( Y = ) is part of the sampling scheme, weassume that it is known. If P y is identical to the conditional distribution of ( T ∗ , X ∗ ) given on Y ∗ = y , then this is known as case-control sampling. The Bernoullischeme allows for other possibilities. Below are the two leading cases that wefocus on throughout the paper. In order to simplify our discussion, we first makea common-support assumption. Let X ∗ and X y be the support of X ∗ and that of X given Y = y , respectively. Assumption A (Common Support) . There is a common support X satisfying X = X ∗ = X = X . Y ∗ rep-resents breast cancer and we have two covariates to consider, i.e. gender and age,then the joint support of gender and age depends highly on whether to conditionon Y ∗ = X ∗ and X rep-resents the age; X ∗ is the age that would have been drawn from the population ofwomen and X is the age that is drawn from the subpopulation of women with orwithout breast cancer, depending on the corresponding value of Y . Throughoutthe paper, we are implicit about the possibility of stratification using extra covari-ates (different from those included in X ∗ ).Let P y ( t , x ) = f X | Y ( x | y ) P ( T = t | X = x , Y = y ) , where f X | Y is the probabilitydensity (or mass) function of X given Y = y for y =
0, 1.
Design 1 (Case-Control Sampling) . Suppose that for all ( t , x ) ∈ {
0, 1 } × X and fory ∈ {
0, 1 } ,f X | Y ( x | y ) = f X ∗ | Y ∗ ( x | y ) and P ( T = t | X = x , Y = y ) = P ( T ∗ = t | X ∗ = x , Y ∗ = y ) . In other words, P is the distribution of ( T ∗ , X ∗ ) given Y ∗ = , while P is that of ( T ∗ , X ∗ ) given Y ∗ = . Design 2 (Case-Population Sampling) . Suppose that for all ( t , x ) ∈ {
0, 1 } × X ,f X | Y ( x | ) = f X ∗ ( x ) and P ( T = t | X = x , Y = ) = P ( T ∗ = t | X ∗ = x ) , f X | Y ( x | ) = f X ∗ | Y ∗ ( x | ) and P ( T = t | X = x , Y = ) = P ( T ∗ = t | X ∗ = x , Y ∗ = ) . In other words, P represents the distribution of ( T ∗ , X ∗ ) of the entire population, while P is that of ( T ∗ , X ∗ ) conditional on Y ∗ = . Design 1 is arguably the most popular form of case-control studies and de-sign 2, which we call case-population sampling , is considered in Lancaster and Im-bens (1996). The notation here distinguishes the original variables ( Y ∗ , T ∗ , X ∗ ) of7ausal Inference in Case-Control Studiesinterest from the sampled ones ( Y , T , X ) ; see, e.g., K. Chen (2001) and Xie, Lin,Yan, and Tang (2019) for using the same notational device. The advantage of thisapproach is that it becomes straightforward to apply asymptotic theory under ran-dom sampling to observations generated from ( Y , T , X ) because we can regardthem as a collection of independent and identically distributed (i.i.d.) copies of ( Y , T , X ) . The marginal distribution of ( T , X ) is identified from the data, whilethat of ( T ∗ , X ∗ ) is not. For instance, in design 1, we have f X ( x ) = f X ∗ | Y ∗ ( x | ) h + f X ∗ | Y ∗ ( x | )( − h ) (cid:54) = f X ∗ ( x ) if h (cid:54) = P ( Y ∗ = ) ; h is part of the sampling scheme,while P ( Y ∗ = ) is the true probability of the case in the population. Further, f YX ( x ) = f X ∗ | Y ∗ ( x | ) h = f X ∗ ( x ) P ( Y ∗ = | X ∗ = x ) h / P ( Y ∗ = ) , which yieldsthe likelihood function studied in e.g. Manski and Lerman (1977). We emphasizethat P ( Y = | X = x ) does not have economic (or structural) interpretation like P ( Y ∗ = | X ∗ = x ) , where the latter is often modeled by a rational behavior of aneconomic agent.2.2. Causal Functional Parameters.
To define causal functional parameters perti-nent to outcome-based samples, let Y ∗ ( t ) ∈ {
0, 1 } be the binary potential outcomeof interest for treatment t ∈ {
0, 1 } . For example, in the context of Currie and Nei-dell (2005), t = Y ∗ ( ) = Y ∗ can be written as Y ∗ = T ∗ Y ∗ ( ) + ( − T ∗ ) Y ∗ ( ) . The cen-tral counterfactual probabilities are P { Y ∗ ( ) = | X ∗ = x } and P { Y ∗ ( ) = | X ∗ = x } . Conditional on X ∗ = x , one may consider the difference or ratio between thetwo counterfactual probabilities, which are called (conditional) attributable andrelative risk in the literature (see, e.g. Manski, 2009). In this paper, we focus onthe latter, namely causal relative risk θ RR ( x ) defined in equation (3). In view ofthe convenience of the odds ratio, as we demonstrated in Introduction, we alsoconsider a causal odds ratio that is defined by θ OR ( x ) : = P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } .8un and Lee2.3. Identification under Strong Ignorability.
We begin this section by articulat-ing how the odds ratio in the sample is related with some population quantitiesunder each sampling design. Let OR ( x ) be the odds ratio given X = x that isobserved in the sample: i.e.OR ( x ) : = P ( Y = | T = X = x ) P ( Y = | T = X = x ) P ( Y = | T = X = x ) P ( Y = | T = X = x ) , (4)where we assume that 0 < OR ( x ) < ∞ for all x ∈ X throughout the paper. Simi-larly, we define OR ∗ ( x ) and RR ∗ ( x ) by the conditional odds ratio and relative risk,respectively, in the population: i.e.OR ∗ ( x ) : = P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) , (5)RR ∗ ( x ) : = P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) . (6)Since we do not have a random sample of ( Y ∗ , T ∗ , X ∗ ) , identification of OR ∗ ( x ) orRR ∗ ( x ) is a priori unclear. However, the Bayes rule shows the following result. Lemma 1.
Under design 1, we have OR ( x ) = OR ∗ ( x ) for all x ∈ X . Similarly, underdesign 2, we have OR ( x ) = RR ∗ ( x ) for all x ∈ X . Lemma 1 shows how to relate the odds ratio in the case-control sample (respec-tively, case-population sample) with the odds ratio (respectively, relative risk) ofthe population. It requires additional assumptions to connect the odds ratio orrelative risk of the population with the causal parameters defined in terms of thepotential outcomes. The simplest approach is to use the idea of strong ignorability(see, e.g., Imbens and Rubin, 2015). In our context, strong ignorability consists ofthe following two assumptions.
Assumption B (Overlap) . For all ( t , x ) ∈ {
0, 1 } × X , we have < P { Y ∗ ( t ) = | X ∗ = x } < and < P ( T ∗ = | X ∗ = x ) < Assumption C (Unconfoundedness) . For all t ∈ {
0, 1 } and x ∈ X , P { Y ∗ ( t ) = | T ∗ = X ∗ = x } = P { Y ∗ ( t ) = | T ∗ = X ∗ = x } .The first requirement of assumption B implies that the potential outcome Y ∗ ( t ) cannot be 0 or 1 with probability 1 for some value of x . The second condition ofassumption B is the standard overlap condition in the literature. Assumption Csays that the potential outcomes Y ∗ ( ) and Y ∗ ( ) are conditionally independent ofthe treatment T ∗ given X ∗ = x .We now provide the following identification result in the spirit of Holland andRubin (1988). Theorem 1 (Holland and Rubin (1988)) . Suppose that assumptions B and C are satis-fied. Then, under design 1, we have θ OR ( x ) = OR ∗ ( x ) = OR ( x ) for all x ∈ X ; underdesign 2, we have θ RR ( x ) = RR ∗ ( x ) = OR ( x ) for all x ∈ X . Theorem 1 slightly extends the result of Holland and Rubin (1988); they did notconsider design 2, but their arguments can be used in a straightforward manner.In substance, the observed odds ratio OR ( x ) identifies the causal odds ratio θ OR ( x ) under design 1 and the causal relative risk θ RR ( x ) under design 2. One practicalmessage of theorem 1 is that it might be more beneficial to sample a control groupfrom the unconditional population if a researcher cares mainly about θ RR ( x ) . Inlight of this, we may regard designs 1 and 2 as studies suitable for the causal oddsratio and causal relative risk, respectively.2.4. Causal Interpretation without Strong Ignorability.
Strong ignorability is con-venient but it may be too strong for observational data; T ∗ is often a deliberatedecision of an individual agent. In this subsection, we establish an alternativecausal interpretation of OR using the framework of partial identification. In par-ticular, we build on assumptions of monotone treatment response (Manski, 1997)and monotone treatment selection (Manski and Pepper, 2000).10un and Lee Assumption D (Monotone Treatment Response) . We have Y ∗ ( ) ≥ Y ∗ ( ) almostsurely. Assumption E (Monotone Treatment Selection) . For all t ∈ {
0, 1 } and x ∈ X , P { Y ∗ ( t ) = | T ∗ = X ∗ = x } ≥ P { Y ∗ ( t ) = | T ∗ = X ∗ = x } .Assumption D rules out the possibility of Y ∗ ( ) = Y ∗ ( ) = θ RR ( x ) and θ OR ( x ) . Theorem 2.
Suppose that assumption B holds. The following inequalities are sharp.(i) If assumption D is satisfied, then ≤ θ RR ( x ) ≤ θ OR ( x ) for all x ∈ X under eachof the two sampling designs.(ii) If assumption E is satisfied, then θ OR ( x ) ≤ OR ( x ) under design 1 and θ RR ( x ) ≤ OR ( x ) under design 2.(iii) The following two statements are equivalent:(a) θ OR ( x ) = OR ( x ) in design 1 and θ RR ( x ) = OR ( x ) in design 2;(b) Assumption E is satisfied with equality, i.e. assumption C holds. Parts (i) and (ii) of Theorem 2 imply that if assumptions D and E are satisfied,then OR ( x ) can be understood as a sharp upper bound of causal relative risk bothunder designs 1 and 2. More specifically, for all x ∈ X , we have1 ≤ θ RR ( x ) ≤ θ OR ( x ) ≤ OR ( x ) under design 1; (7)1 ≤ θ RR ( x ) ≤ OR ( x ) under design 2. (8)11ausal Inference in Case-Control StudiesTheorems 1 and 2 articulate how to give causal interpretation to OR ( x ) in gen-eral. Assumption E allows for assumption C as a special case. Indeed, theorem 2shows that point identification holds if and only if the unconfoundedness condi-tion is satisfied.Assumptions D and E are not individually testable, but they jointly have a testableimplication, i.e. OR ( x ) ≥ x ∈ X by theorem 2, for which a nonparametrictest can be constructed via the general framework of testing functional inequalities(see, e.g., Chernozhukov, Lee, and Rosen, 2013; Lee, Song, and Whang, 2018).In case-control studies, it is commonly assumed that there is some (cid:101) > < P ( Y ∗ = | X ∗ = x ) ≤ (cid:101) for all x ∈ X . When we consider the case where (cid:101) →
0, we refer to this condition as the rare-disease assumption (e.g. Breslow,1996; Manski, 2009). The rare-disease assumption leads to | θ RR ( x ) − θ OR ( x ) | → (cid:101) →
0. Hence, if both strong ignorability and rare-disease are assumed, then θ RR ( x ) is well-approximated by OR ( x ) under design 1.However, our identification analysis shows that a researcher does not have toresort to strong ignorability, nor to the the rare-disease assumption, in order toprovide OR ( x ) with causal interpretation. If both MTR and MTS conditions areplausible, then a researcher can interpret OR ( x ) as the (sharp) upper bound of thecausal relative risk θ RR ( x ) under both designs 1 and 2.2.5. Heterogeneity and Aggregation.
The functional parameter OR ( x ) is difficultto estimate nonparametrically with good precision when the dimension of X ishigh. To avoid the curse of dimensionality, it is popular in case-control studies toadopt logistic regression at the true population level: that is, P ( Y ∗ = | T ∗ = t , X ∗ = x ) = exp ( α + t α + x (cid:62) α + tx (cid:62) α ) + exp ( α + t α + x (cid:62) α + tx (cid:62) α ) , (9)which implies that α + x (cid:62) α = log { OR ∗ ( x ) } = log { OR ( x ) } x ∈ X ; therefore, under the rare-disease assumption, log { RR ∗ ( x ) } ≈ α + x (cid:62) α as well.The parametric assumption is popular, but it is restrictive. For instance, the for-mulation in equation (9) limits the possible forms of heterogeneous causal effects;without the parametric assumption, log { OR ( x ) } is generally an unknown functionof x that can be highly nonlinear. In this paper we take a nonparametric approach,where we aim at estimating OR ( x ) without using any parametric assumption, afterwhich we aggregate it by integrating over x .Since F X | Y ( ·| ) and F X | Y ( ·| ) are identified in our study designs, we consider β ( y ) : = (cid:90) X log { OR ( x ) } dF X | Y ( x | y ) for y =
0, 1, (10)which is the weighted average of the log odds ratio using F X | Y ( x | y ) as weights: theargument y indicates which distribution of X , and hence which distribution of X ∗ ,is used to aggregate the log odds ratio. Specifically, under design 2, β ( ) is equalto E (cid:2) log { OR ( X ∗ ) } (cid:3) . Under design 1, if the population fraction of the case (i.e. P ( Y ∗ = ) ) is known to the researcher, which has been frequently assumed in theeconometrics literature (since Manski and Lerman, 1977), then E (cid:2) log { OR ( X ∗ ) } (cid:3) can be obtained by taking the weighted average of β ( y ) , i.e. E (cid:2) log { OR ( X ∗ ) } (cid:3) = β ( ) P ( Y ∗ = ) + β ( ) P ( Y ∗ = ) .If P ( Y ∗ = ) is unknown but only its upper bound is known, then we can un-dertake a bound analysis on E (cid:2) log { OR ( X ∗ ) } (cid:3) by using β ( ) and β ( ) ; this prob-lem will be further discussed in section 5. Therefore, in the next two sections, wewill treat β ( y ) as the main estimand of interest; our discussion on semiparamet-ric efficiency and machine-learning approaches will focus on β ( y ) . It relies on theresearcher’s view on assumptions C to E whether it is the aggregation of the log-arithm of the causal parameter itself or its sharp identifiable upper bound. Usingour proposed estimators, we discuss how to carry out causal inferences in section 5.13ausal Inference in Case-Control StudiesIn equation (10), the logarithm is taken before aggregating the odds ratio; alter-natively, one may take an expectation of the odds ratio directly. This case can behandled similarly. See Appendix A for details.3. E FFICIENT I NFLUENCE F UNCTION FOR β ( y ) We consider estimating the parameter β ( y ) , for which we do not impose anyparametric restrictions anywhere. As a first step, we derive the semiparametric ef-ficiency bound under both designs 1 and 2; since the mathematical structure of thelikelihood function is the same, we do not need to distinguish design 1 from de-sign 2. For this purpose, we will use the generic notation using the observed vari-ables ( Y , T , X ) instead of the original random variables of interest, i.e. ( Y ∗ , T ∗ , X ∗ ) .We start with the following assumptions for regularity. Assumption F (Bounded Probabilities) . There is a constant ε > such that for eachy =
0, 1 , ε ≤ P ( T = | X , Y = y ) ≤ − ε and ε ≤ P ( Y = | X ) ≤ − ε almost surely. Assumption G (Regular Distribution) . The distribution function F X | Y has a probabilitydensity f X | Y that satisfies < f X | Y ( x | y ) < ∞ for all x ∈ X and y =
0, 1 . Assumptions F and G are, in principle, testable since they are about the randomvariables observed in the sample. Assumption F is slightly stronger than what weneed to derive the efficient influence function, but it will be needed to establishstatistical properties of our proposed estimators later. Assumption G focuses onthe case where X is continuous but this is only for the sake of notational simplicity;if X is discrete or mixed, then f X | Y should be understood as a general Radon-Nikodym density with respect to some dominating measure.Under the Bernoulli sampling scheme, the likelihood of a single observation ( Y , T , X ) is given by L ( Y , T , X ) = (cid:8) ( − h ) P ( T , X ) (cid:9) − Y (cid:8) h P ( T , X ) (cid:9) Y , (11)14un and Leewhere for y =
0, 1, P y ( T , X ) = f X | Y ( X | y ) P ( T = | X , Y = y ) T (cid:8) − P ( T = | X , Y = y ) (cid:9) − T . (12)The likelihood in equation (11) is a simple mixture of two binary likelihoods. Thetangent space can be derived by using regular parametric submodels P y ( T , X ; γ ) such that P y ( T , X ; γ ) = P y ( T , X ) for y =
0, 1. The tangent space is described inthe following lemma.
Lemma 2.
Consider the Bernoulli sampling scheme of design 1 or design 2. The tangentspace is given by the set of functions of the following form:s ( Y , T , X ) = ( − Y ) (cid:104) a ( X ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) b ( X ) (cid:105) + Y (cid:104) a ( X ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) b ( X ) (cid:105) , where the functions a y and b y are such that E { a y ( X ) | Y = y } = and E { s ( Y , T , X ) } < ∞ for each y =
0, 1 . The following theorem shows that β ( y ) is pathwise differentiable along the reg-ular parametric submodels at γ in the sense of Newey (1990, 1994). Before wepresent the theorem, define w ( X ) : = f X | Y ( X | ) f X | Y ( X | ) . (13)Further, for y =
0, 1, define ∆ y ( Y , T , X ) : = Y y ( − Y ) − y { T − P ( T = | X , Y = y ) } P ( T = | X , Y = y ) { − P ( T = | X , Y = y ) } .We establish the following result using the approach taken by Hahn (1998). Theorem 3.
Suppose that assumptions A, F and G hold and that we have a sample byBernoulli sampling. Then, for y =
0, 1 , β ( y ) is pathwise differentiable and its pathwise derivative is given byF y ( Y , T , X ) = Y y ( − Y ) − y h y ( − h ) − y (cid:110) log OR ( X ) − β ( y ) (cid:111) − ∆ ( Y , T , X )( − h ) w ( X ) y + w ( X ) − y ∆ ( Y , T , X ) h . Further, F y is an element of the tangent space, and therefore, the semiparametric efficiencybound for β ( y ) is given by E (cid:8) F y ( Y , T , X ) (cid:9) . Theorem 3 shows the efficiency bound for β ( y ) , and it also implies that the as-ymptotic variance of a √ n –consistent and asymptotically linear estimator of β ( y ) should be E { F y ( Y , T , X ) } by Theorem 2.1 of Newey (1994). Since β ( y ) is the ex-pectation of log OR ( X ) with respect to the distribution of X given Y = y , it satisfies E (cid:8) log OR ( X ) − β ( y ) (cid:12)(cid:12) Y = y (cid:9) = E (cid:34) Y y ( − Y ) − y h y ( − h ) − y (cid:110) log OR ( X ) − β ( y ) (cid:111)(cid:35) =
0, (14)which is the expected value of the first term that appears in F y ( Y , T , X ) ; the otherterms in F y ( Y , T , X ) are for adjustment to address the effect of first step nonpara-metric estimation of log OR ( X ) via P ( T = | X = x , Y = y ) .4. E FFICIENT E STIMATION OF β ( y ) Efficient estimators of β ( y ) for y =
0, 1 can be constructed in multiple ways. Themost straightforward approach is just using equation (14), i.e. we base an estimatoron β ( y ) = E (cid:34) Y y ( − Y ) − y h y ( − h ) − y log OR ( X ) (cid:35) , (15)where we plug in a nonparametric estimator of OR ( x ) . Alternatively, we mayinclude the adjustment terms upfront to use E { F y ( Y , T , X ) } = β ( y ) by constructing a sample analog16un and Leeestimator from the following alternative expression: β ( y ) is equal to E (cid:34) Y y ( − Y ) − y h y ( − h ) − y log OR ( X ) − ∆ ( Y , T , X )( − h ) w ( X ) y + w ( X ) − y ∆ ( Y , T , X ) h (cid:35) . (16)This approach requires additional (nonparametric) estimation of w ( X ) , but since E { ∆ y ( Y , T , X ) | X } = y =
0, 1, having an incorrect function for w ( X ) does not matter for the consistency of the estimator based on equation (16). Suppose that we have the sample { ( Y i , T i , X i ) : i =
1, . . . , n } , where ( Y i , T i , X i ) ’sare i.i.d. copies of ( Y , T , X ) . Using this sample, we propose sieve logistic esti-mators based on equation (15) in section 4.1. In section 4.2, we show that themoment condition in equation (16) satisfies Neyman orthogonality in the senseof Chernozhukov, Chetverikov, Dimirer, Duflo, Hansen, Newey, and Robins (2018,DML hereafter). This leads to double/debiased machine learning (DML) estima-tors, which we present in section 4.3. Throughout the discussion we assume that h is known since it is part of the sampling scheme. However, if it is unknown,then using ˆ h = ∑ ni = Y i / n instead of h does not change the first-order asymptoticbehaviors of the estimators based on (15) and (16), as long as P and P do notdepend on h .4.1. Retrospective Sieve Logistic Estimation.
Recall that the observed odds ratioin equation (4) can be expressed asOR ( x ) = P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) .We model the treatment probabilities by infinite dimensional logistic regression:i.e. for y =
0, 1, P ( T = | X = x , Y = y ) = exp (cid:16) ∑ ∞ j = φ j ( x ) µ j , y (cid:17) + exp (cid:16) ∑ ∞ j = φ j ( x ) µ j , y (cid:17) , Misspecification of w ( X ) may affect the asymptotic distribution of our proposed estimator. Welimit our attention to nonparametric estimation of w ( X ) to minimize the possibility of misspecifi-cation. { φ j : j =
1, 2, . . . } is a series of basis functions and { µ j , y : j =
1, 2, . . . } isa series of unknown coefficients for each y =
0, 1. It then follows that for each y =
0, 1, log P ( T = | X = x , Y = y ) P ( T = | X = x , Y = y ) = ∞ ∑ j = φ j ( x ) µ j , y . (17)Therefore, by using equation (15) and assumption F, we obtain β ( y ) = ∞ ∑ j = (cid:90) X φ j ( x ) dF X | Y ( x | y ) (cid:0) µ j ,1 − µ j ,0 (cid:1) ≈ J n ∑ j = (cid:90) X φ j ( x ) dF X | Y ( x | y ) (cid:0) µ j ,1 − µ j ,0 (cid:1) , (18)provided that J n diverges to infinity as n → ∞ . Equation (18) suggests the follow-ing two-step sieve estimation strategy:(i) In the first step, for each y =
0, 1, estimate { µ j , y : y =
0, 1 j =
1, . . . , J n } bylogistic regression of T i on { φ j ( X i ) : j =
1, . . . , J n } with the Y i = y sample.(ii) In the second step, construct a sample analog of equation (18): i.e. (cid:98) β ( y ) : = J n ∑ j = (cid:90) X φ j ( x ) d (cid:98) F X | Y ( x | y ) (cid:0) (cid:98) µ j ,1 − (cid:98) µ j ,0 (cid:1) , (19)where (cid:98) µ j , y ’s are sieve logit estimates from the first step and (cid:90) X φ j ( x ) d (cid:98) F X | Y ( x | y ) = ∑ ni = Y di ( − Y i ) d φ j ( X i ) ∑ ni = Y di ( − Y i ) d .Since the retrospective probability model is used in equation (17), we call the es-timator defined in (19) the retrospective sieve logistic estimator of β ( y ) , y =
0, 1. Itcan be computed using standard software for logistic regression, as described inalgorithm 1.The procedure described in algorithm 1 achieves the first step by running a com-bined logistic regression of T i on Y i , the sieve basis terms and the interactions be-tween Y i and the sieve basis terms. This is first-order equivalent since Y i is binary18un and Lee Algorithm 1:
Retrospective Sieve Logistic Estimator of β ( ) Input: { ( Y i , T i , X i ) : i =
1, . . . , n } , tuning parameter J n and basis functions { φ j ( · ) : j =
1, . . . , J n } Output: estimate of β ( ) and its standard error Construct { φ ( X i ) , . . . , φ J n ( X i ) : i =
1, . . . , n } , where an intercept term isexcluded in φ j ’s; For each j =
1, . . . , J n , compute the empirical mean of φ j ( X i ) using only thecase sample ( Y i =
1) and construct the demeaned version, say ϕ j ( X i ) , of φ j ( X i ) ; Run a logistic regression of T i on the following regressors: an intercept term, Y i , ϕ j ( X i ) , j =
1, . . . , J n , and interactions between Y i and ϕ j ( X i ) , j =
1, . . . , J n ,using standard software; Read off the estimated coefficient for Y i and its standard errorand full interaction terms are included. For the second step, instead of evaluatingthe right-hand side of equation (19) after logistic regression, φ j ( X i ) ’s are demeanedfirst using only the case sample so that the resulting coefficient for Y i is first-orderequivalent to the estimator defined in equation (19). The advantage of the for-mulation in algorithm 1 is that the standard error of (cid:98) β ( ) can be read off directlyfrom standard software without any further programming. It is straightforward tomodify algorithm 1 for estimating β ( ) . One has to compute the empirical meanof φ j ( X i ) using only the control sample ( Y i =
0) for the demeaning step.Sieve logistic estimators have been popular in the literature, including the propen-sity score estimator used in Hirano, Imbens, and Ridder (2003). To the best of ourknowledge, it is novel to adopt retrospective sieve logistic estimators in the contextof case-control studies. It is not difficult to work out formal asymptotic propertiesof our proposed sieve estimator in view of the well-established literature on two-step sieve estimation (see, e.g., Ai and Chen, 2003, 2012; Ackerberg, Chen, Hahn,and Liao, 2014, among many others). Furthermore, conventional normal inferencebased on the standard error obtained in algorithm 1 is valid for semiparametric in-ference (e.g. Ackerberg, Chen, and Hahn, 2012). For brevity of the paper, we omitdetails. 19ausal Inference in Case-Control Studies4.2.
Neyman Orthogonality.
Both of the estimating equations in (15) and (16) de-pend on nonparametric objects that need to be estimated in advance. Equation (15)is simpler but equation (16) has an advantage that it is robust to local perturbationon the unknown functions that are estimated in the first step. It requires extranotation to discuss this result formally.Let W be the set of functions on X that are bounded and bounded away fromzero. Similarly, let G be the set of functions g : X → [ (cid:101) , 1 − (cid:101) ] for some (cid:101) >
0. For η = ( η (cid:124) , η ) (cid:124) with η = ( a , b ) (cid:124) ∈ G and η ∈ W , define (cid:102) OR ( η )[ X ] = (cid:2) b ( X ) { − a ( X ) } (cid:3) / (cid:2) { − b ( X ) } a ( X ) (cid:3) and ˜ w ( η )[ X ] = η ( X ) .So, (cid:102) OR ( · )[ X ] and ˜ w ( · )[ X ] denote (candidate) mappings from G and W , respec-tively, such that they are equal to OR ( X ) and w ( X ) when they are evaluated at η ∈ G and η ∈ W , respectively, where η ( x ) = (cid:0) P ( T = | X = x , Y = ) , P ( T = | X = x , Y = ) (cid:1) (cid:124) and η ( x ) = w ( x ) . Now, we define the mapping˜ F y ( · )[ Y , T , X ] by˜ F y ( η )[ Y , T , X ] : = Y y ( − Y ) − y h y ( − h ) − y log (cid:102) OR ( η )[ X ] − β ( y )+ Yh { ˜ w ( η )[ X ] } − y { T − b ( X ) } b ( X ) { − b ( X ) } − − Y − h { ˜ w ( η )[ X ] } − y { T − a ( X ) } a ( X ) { − a ( X ) } ,where η = ( a , b , η ) (cid:124) ∈ G × W . So, we have ˜ F y ( η )[ Y , T , X ] = F y ( Y , T , X ) , where η = ( η (cid:124) , η ) (cid:124) . We are now ready to state the main theorem of this subsection. Theorem 4.
Suppose that assumptions F and G hold. Then, under both designs 1 and 2,and for each y =
0, 1 , the Gateaux derivative of ˜ F y ( · )[ Y , T , X ] at η has mean zero: i.e. E (cid:104) ∂ γ ˜ F y { η + γ ( η − η ) } [ Y , T , X ] (cid:12)(cid:12)(cid:12) γ = (cid:105) = for all η ∈ G × W . F y ( Y , T , X ) provides a Neyman orthogonal moment func-tion. The fact that small perturbations around η do not have first-order asymp-totic consequences is known as the local robustness property. In this case, the firststep nonparametric estimation does not have any first-order effect, i.e. the limitingdistribution would be the same as if η were known, because all the adjustmentterms that are needed to address the effect of the first step estimation are alreadyreflected in F y .4.3. Retrospective Double/Debiased Machine Learning Estimation.
When thedimension of X is higher than the sample size, it is infeasible to implement thesieve estimator proposed in section 4.1. In this section we consider using machine-learning-based estimators in the first step, which will allow X to be of high di-mension. In view of Neyman orthognality established in section 4.2, we build anew estimator based on equation (16), which requires estimation of w ( x ) definedin equation (13). In high-dimensional settings, it would be impractical to estimate f X | Y ( x | ) and f X | Y ( x | ) separately and to take the ratio to obtain an estimator of w ( x ) . Instead, we use the Bayes rule to obtain w ( x ) = f X | Y ( x | ) f X | Y ( x | ) = P ( Y = | X = x ) P ( Y = | X = x ) h − h , (20)which suggests that we estimate P ( Y = | X = x ) since h = P ( Y = ) is eitherknown or trivial to estimate. The key insight here is that it may be unrealistic toassume the sparsity of f X | Y ( x | y ) for each y =
0, 1, but w ( x ) can be estimated bysparsity-based models since the sparsity of w ( x ) is equivalent to that of P ( Y = | X = x ) / P ( Y = | X = x ) . Therefore, we may rely on machine-learning methodsto estimate not only P ( T = | X = x , Y = y ) , y =
0, 1, but also P ( Y = | X = x ) . Forexample, we may use (cid:96) -penalized logistic estimation for estimating all relevantprobability models to construct an estimator of β ( y ) , y =
0, 1. Specifically, in order Heckman and Todd (2009) use the same relationship in the context of propensity score matchingunder treatment-based sampling. K ≥ n is divisible by K . Let { I k : k =
1, . . . , K } denote a K -fold partition of {
1, . . . , n } such that | I k | = n / K for each k . Suppose that one estimates η = ( η (cid:124) , η (cid:124) ) (cid:124) us-ing a machine-learning estimator, say ˆ η k , using observations that belong to I ck : = {
1, . . . , n } \ I k for each k . Then, the retrospective double/debiased machine learn-ing estimator (cid:98) β DML ( y ) of β ( y ) , y =
0, 1, is defined by (cid:98) β DML ( y ) : = K K ∑ k = | I k | ∑ i ∈ I k (cid:98) ψ i , k ( y ) , (21)where ˆ h : = n − ∑ ni = Y i , (cid:98) ψ i , k ( y ) : = Y yi ( − Y i ) − y ˆ h y ( − ˆ h ) − y log (cid:102) OR ( ˆ η k )[ X i ] + Y i ˆ h { ˜ w ( ˆ η k )[ X i ] } − y { T i − ˆ p k ( X i ) } ˆ p k ( X i ) { − ˆ p k ( X i ) }− ( − Y i )( − ˆ h ) { ˜ w ( ˆ η k )[ X i ] } − y T i − ˆ p k ( X i ) ˆ p k ( X i ) { − ˆ p k ( X i ) } , (22)and ˆ p k ( x ) : = (cid:98) P ML, k ( T = | X = x , Y = ) ,ˆ p k ( x ) : = (cid:98) P ML, k ( T = | X = x , Y = ) , (cid:102) OR ( ˆ η k )[ x ] : = (cid:98) P ML, k ( T = | X = x , Y = ) (cid:98) P ML, k ( T = | X = x , Y = ) (cid:98) P ML, k ( T = | X = x , Y = ) (cid:98) P ML, k ( T = | X = x , Y = ) ,˜ w ( ˆ η k )[ x ] : = (cid:98) P ML, k ( Y = | X = x ) (cid:98) P ML, k ( Y = | X = x ) ˆ h ( − ˆ h ) .Here, (cid:98) P ML, k denotes a machine-learning estimator of a probability model usingobservations that belong to I ck . We summarize the estimation procedure in algo-rithm 2. 22un and Lee Algorithm 2:
Retrospective Double/Debiased Machine Learning Estimator of β ( y ) , y =
0, 1
Input: { ( Y i , T i , X i ) : i =
1, . . . , n } , K , machine learning methods for estimatingprobability models Output: estimate of β ( ) and its standard error Construct a K -fold partition { I k : k =
1, . . . , K } of {
1, . . . , n } of approximatelyequal size; For each k , use observations belonging to I ck to obtain machine learningestimates ˆ η k of P ( T = | X = x , Y = ) , P ( T = | X = x , Y = ) and P ( Y = | X = x ) , respectively; For each k , use observations belonging to I k to construct (cid:98) ψ i , k ( y ) inequation (22); Obtain the estimate of β ( ) by equation (21) and its standard error (cid:98) σ DML ( y ) / √ n by (cid:98) σ ( y ) : = K K ∑ k = | I k | ∑ i ∈ I k (cid:110) (cid:98) ψ i , k ( y ) − (cid:98) β DML ( y ) (cid:111) . (23)Let (cid:107) · (cid:107) P ,2 denote the L ( P ) -norm, where P is a probability distribution that ( Y , T , X ) takes: i.e. (cid:107) a (cid:107) P ,2 = max ≤ (cid:96) ≤ d (cid:110) E [ a (cid:96) ( Y , T , X )] (cid:111) for a d -dimensional vector-valued function a : = ( a , . . . , a d ) . Assumption H (First-Stage Estimation) . There exist sequences δ n ≥ n − and τ n ofpositive constants both approaching zero such that for each k =
1, . . . , K, (cid:107) ˆ η k − η (cid:107) P ,2 ≤ δ n n − with probability no less than − τ n . Assumption H resembles classical rate requirements in semiparametric estima-tion. General theory of DML allows for a general norm; however, the L ( P ) -normis the most convenient for machine learning estimators. The required rate is at-tainable for a variety of machine learning methods. For instance, the primitiveconditions for (cid:96) -penalized logit estimators are worked out by van de Geer (2008)and Belloni, Chernozhukov, and Wei (2016) among others.An application of Theorems 3.1 and 3.2 of DML gives the following result thatformally justifies the estimation method proposed in algorithm 2.23ausal Inference in Case-Control Studies Theorem 5.
Let {P n : n ≥ } be a sequence of sets of probability distributions of ( Y , T , X ) . Suppose that for all n ≥ and P ∈ P n , (16) and assumptions F to H holdand that we have a sample by the Bernoulli sampling scheme of design 1 or design 2. Then,for y =
0, 1 , √ n { (cid:98) β DML ( y ) − β ( y ) } (cid:98) σ DML ( y ) → d N (
0, 1 ) uniformly over P ∈ P n , and (cid:98) σ ( y ) → p E (cid:104) F y ( Y , T , X ) (cid:105) uniformly over P ∈ P n , where (cid:98) σ ( y ) is defined inequation (23).
5. D
ISCUSSION : T HE M AIN T AKEAWAY AND I NFERENTIAL I SSUES
In this section we discuss and summarize some of the important messages fromour findings. Recall that the main estimand of interest is β ( y ) for y =
0, 1, whichis an aggregated version of log { OR ( x ) } . With causal inference in mind, the corre-sponding causal parameters would be either ξ RR ( y ) : = (cid:90) X log { θ RR ( x ) } dF X | Y ( x | y ) or ξ OR ( y ) : = (cid:90) X log { θ OR ( x ) } dF X | Y ( x | y ) .Here, we note that log θ RR ( x ) is easier to interpret than log θ OR ( x ) , so the formeris a more natural causal parameter to target. Also, it is arguably more desirable toaggregate log θ RR ( x ) by the true distribution of X ∗ : i.e. ξ RR : = E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] − E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] , (24)and we have ξ RR = ξ RR ( )( − p ∗ ) + ξ RR ( ) p ∗ under design 1, ξ RR ( ) under design 2,where p ∗ : = P ( Y ∗ = ) .In this setup, causal inference can be understood as how we relate the estimand β ( y ) with ξ RR ( y ) , and eventually with ξ RR , for which we need to address the factthat p ∗ is unidentified. Below we discuss each step in detail.24un and LeeHow to relate β ( y ) with ξ ( y ) depends on several assumptions as well as the sam-pling design itself. In the case of case-control sampling (i.e. design 1), strong ig-norability ensures that β ( y ) = ξ OR ( y ) but we do not learn about ξ RR ( y ) from β ( y ) ,unless the rare-disease assumption is additionally in place: if the case is rare inthe population (uniformly across the values of X ∗ ), then ξ OR ( y ) is a good approx-imation of ξ RR ( y ) . The case of case-population sampling (i.e. design 2) is easier,because strong ignorability is sufficient to guarantee β ( ) = ξ RR ( ) . Therefore, aconfidence interval for ξ RR ( ) in this case can be computed in the usual symmetricand two-sided way by using any of the proposed estimators of β ( ) .If strong ignorability is not credible, then the (approximate) equality relationshipbetween β ( y ) and ξ RR ( y ) breaks down. However, we have shown that if the MTRand MTS conditions are satisfied, we have0 ≤ ξ RR ( y ) ≤ β ( y ) (25)under both designs 1 and 2, where the inequalities are sharp. Further, these in-equalities do not require the rare disease assumption, and hence they are robustagainst its violation. Equation (25) implies that an estimate of β ( y ) , e.g. (cid:98) β DML ( y ) ,should be interpreted carefully: a large estimate does not necessarily confirm alarge causal effect but a small estimate does confirm a small causal effect. Also,a confidence interval for ξ RR ( y ) should be computed differently. For example,if (cid:98) β DML ( y ) is used, then an asymptotically valid confidence interval for ξ RR ( y ) should be computed by (cid:2) (cid:98) β DML ( y ) + z − α · (cid:98) σ DML ( y ) (cid:3) , where z − α is a one-sidedstandard normal critical value. Since (cid:98) β DML ( y ) is an efficient estimator, it will leadto a tight one-sided confidence interval for ξ RR ( y ) .If the final object of interest is ξ RR , i.e. the aggregated version of log θ RR ( x ) overthe entire population, then design 2 is clearly more convenient than design 1. Inthe case-control sampling design, we need to compute the weighted average of ξ RR ( ) and ξ RR ( ) . If p ∗ is known, then conducting inference on ξ RR is not hard: allof our discussion above applies again, though we need to use the standard error25ausal Inference in Case-Control Studiesof the linear combination of (cid:98) β DML ( ) and (cid:98) β DML ( ) . More realistically, the onlyinformation available to a researcher may be p ∗ ∈ [ p ] for some known upperbound p . Then, the sharp bounds for ξ RR will be given by0 ≤ ξ RR ≤ max (cid:8) β ( ) , β ( )( − p ) + β ( ) p (cid:9) . (26)Equation (26) suggests that we can implement “union bounds” to obtain a con-fidence interval for ξ RR . Specifically, we first check if β ( ) ≥ β ( ) by compar-ing their estimates. If so, then we use the estimate of β ( ) and its standard errorto compute a one-sided confidence interval. If not, then we use the estimate of β ( )( − p ) + β ( ) p and its standard error.6. A N E MPIRICAL E XAMPLE
In this section, we provide an empirical example to illustrate the usefulness ofour approach. We revisit the ACS 2018 sample extract in Introduction and addcovariates to implement the estimation methods we have proposed in this paper.Recall that the sample is restricted to white males residing in California with atleast a bachelor’s degree. The case sample ( Y = ) is composed of 921 individualswhose income is top-coded. To mimic design 1, the control sample ( Y = ) ofequal size is randomly drawn without replacement from the pool of individualswhose income is not top-coded. Thus, by design, P ( Y = ) = h = X ) include age and industry codes, and the binary treatment ( T ) isdefined to be one if an individual has a degree beyond bachelor’s. Age is restrictedto be between 25 and 70.We consider two different estimators: (i) retrospective sieve logit and (ii) ret-rospective DML estimator. For (i), only age is included as a covariate with cubicB-splines having three inner knots. For (ii), both age and industry codes are used.In particular, cubic B-splines of age with 17 inner knots (hence, J n =
20) as wellas 254 industry dummies are included in this specification, which can be viewed Specifically, they are 34, 45 and 55, which correspond to 0.25, 0.50 and 0.75 quantiles of the empir-ical age distribution. Specifically, we implement (cid:96) -penalized logisticestimation with glmnet package in R (Friedman, Hastie, and Tibshirani, 2010) toestimate P ( T = | X = x , Y = y ) , y =
0, 1 and P ( Y = | X = x ) with 5-fold cross-fitting. The underlying assumption here is that the B-spline terms plus the indus-try dummies are rich enough to approximate P ( T = | X = x , Y = y ) as well as P ( Y = | X = x ) . The penalization tuning parameter is chosen by cross-validation(that is, lambda.min in the glmnet package). To present a representative result, wedraw the control sample 100 times and compute estimates for each draw. Estimatesand standard errors reported below are median values out of 100 replications.T ABLE
2. Empirical Results: Sieve LogitPanel A. β ( ) β ( ) Retrospective Estimate 0.656 0.489(0.101) (0.167)Note. Standard errors are in the paren-theses.Panel B. exp [ β ( )] exp [ β ( )] Retrospective Estimate 1.927 1.63195% Confidence Interval [1,2.276] [1,2.147]Note. Confidence intervals are obtained underthe assumption that the point estimate is the up-per bound of exp [ β ( y )] , y =
0, 1.Table 2 reports estimation results with sieve logit estimation. Looking at Panel A,the retrospective sieve estimate of β ( ) is 0.656, which is larger than that of β ( ) ,thereby suggesting that there is heterogeneity among individuals. However, thestandard error of (cid:98) β ( ) is larger than that of (cid:98) β ( ) , which indicates that the differencebetween the two estimates might be driven by sampling uncertainty. In Panel B,we present point estimates of exp [ β ( y )] , y =
0, 1 and their confidence intervalsunder the assumption that the point estimate is the upper bound because the MTR Sieve logit estimation without penalization produced bogus results. [ β ( y )] are comparable to the usual odds ratio in terms of itsscale; therefore, they can be interpreted similarly. For example, 1.927 of exp [ (cid:98) β ( )] roughly means that obtaining a higher-level degree doubles the upper bound forthe chance of earning very high incomes. The end point of the confidence intervalranges from 2.15 to 2.28, which includes the unconditional odds ratio of 2.19 usingthe full sample.T ABLE
3. Empirical Results: Retrospective DML EstimatorPanel A. β ( ) β ( ) Retrospective Estimate 0.816 0.663(0.145) (0.124)Standard errors are in the parentheses.Panel B. exp [ β ( )] exp [ β ( )] Retrospective Estimate 2.261 1.94095% Confidence Interval [1,2.868] [1,2.377]Note. Confidence intervals are obtained underthe assumption that the point estimate is the up-per bound of exp [ β ( y )] , y =
0, 1.Table 3 reports estimation results with the retrospective DML estimator. Thepoint estimates are larger than those in table 2, indicating that the effect of highereducational attainment might be larger. It is impressive that the standard errors areabout the same size as those reported in table 2, given that 254 industry dummiesare additionally included with more B-spline terms for age.In semiparametric estimation with sieve approximation of unknown functions,it is necessary to choose the number J n of approximating terms. Typically, theoptimal choice of J n for semiparametric estimation is different from one for non-parametric estimation. Furthermore, unlike age, there is no natural ordering inindustry codes; thus, it would require an ad hoc grouping of industry dummiesto reduce the number of covariates if a researcher needs to use logistic regression28un and LeeF IGURE
1. The Upper Bounds of ξ RR and exp ( ξ RR ) : Sensitivity Analysis . . . . The Upper Bound of x RR Pr(Y*=1) C hange i n l og p r obab ili t y Estimate of beta(1)*Pr(Y*=1)+beta(0)*(1−Pr(Y*=1))One−Sided 95% Pointwise Confidence Interval 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . The Upper Bound of exp( x RR ) Pr(Y*=1) e x p ( c hange i n l og p r obab ili t y ) Estimate of exp[beta(1)*Pr(Y*=1)+beta(0)*(1−Pr(Y*=1))]One−Sided 95% Pointwise Confidence Interval
Note. ξ RR = E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] − E (cid:2) log P { Y ∗ ( ) = | X ∗ } ] .The left panel shows the estimate and 95% one-sided pointwise con-fidence interval for ξ RR , as a function of P ( Y ∗ = ) , and the rightpanel those for exp ( ξ RR ) .without penalization. Alternatively, a researcher might want to use machine learn-ing methods to deal with high-dimensionality of B -spline terms and full industrycodes. However, it could lead to a question whether and how to conduct inferenceif one mainly cares about parameters such as (cid:98) β ( y ) . The retrospective DML estima-tion method provides a constructive and affirmative answer to this question.We end this section by illustrating a sensitivity analysis for ξ RR . The left panelof figure 1 shows the estimate and 95% one-sided pointwise confidence intervalfor ξ RR , as a function of P ( Y ∗ = ) , and the right panel those for exp ( ξ RR ) . In thecase-control sampling, the true value of P ( Y ∗ = ) may be unknown; however, aswe can see from figure 1, we can trace out ξ RR as a function of P ( Y ∗ = ) , therebyproviding a tool for the sensitivity analysis. In the range of P ( Y ∗ = ) from 0 to 0.5,the upper end point of the 95% pointwise confidence interval for ξ RR (respectively,exp ( ξ RR ) ) is at most 0.9 (respectively, 2.5). Roughly speaking, this implies that it ishighly unlikely that obtaining a degree beyond bachelor’s improves the chance ofearning high incomes by more than a factor of 2.5.29ausal Inference in Case-Control Studies7. R ELATED L ITERATURE AND F UTURE R ESEARCH
The literature on causal inference using observational data is vast and the litera-ture on non-random sampling is extensive. In this section we discuss some of theimportant papers in the context of what we have achieved in this paper.We have labeled designs 1 and 2 together as Bernoulli sampling, which is theterm that we borrowed from Breslow, Robins, and Wellner (2000). The two sam-pling schemes have been studied under different names by other authors. Forinstance, Imbens and Lancaster (1996) refer to design 1 as multinomial sampling,and Lancaster and Imbens (1996) call design 2 case-control sampling with contam-ination, which is borrowed from Heckman and Robb (1985).The objective of Heckman and Robb (1985) is to estimate the impact of train-ing on earnings under various data scenarios. In that study they discuss commondata problems such as oversampling of trainees or “contamination” in the controlgroup, i.e. the training status of the individuals in the control group being un-known. Although the sampling schemes of Heckman and Robb (1985) are similarto designs 1 and 2, they are distinct in the sense that they are not outcome-basedbut treatment-based sampling. In our context, having a control group drawn fromthe whole population without conditioning on the outcome status makes it easier,not harder, to identify the causal relative risk parameter. For this reason we havereferred to design 2 as case-population sampling in order to remove connotationsof negativeness from the word “contamination.”Estimating the average treatment effect under treatment-based sampling hasbeen studied by other authors as well. For instance, Heckman and Todd (2009)point out that a matching estimator can be implemented by using the odds ratioof the propensity score fit on the sample because it is a monotone transformationof the true propensity scores. Kennedy, Sj ¨olander, and Small (2015) show that onecan estimate the average treatment effect on the treated without the knowledge ofthe true population probability of the treatment. Assuming the latter is known,30un and LeeHu and Qin (2018) and Zhang, Hu, and Liu (2019) have developed weighted es-timators of the average treatment effect. However, all these methods are basedon strong ignorability, and to the best of our knowledge, we are not aware of anywork that does not rely on it. We leave it for future research how to extend theapproach taken in this paper to the context of treatment-based sampling.The term Bernoulli sampling has been alternatively used by e.g. Kalbfleisch andLawless (1988) to describe the case where an individual unit is randomly drawnfrom the entire population but it is retained or discarded with stratum-specificprobabilities. Imbens and Lancaster (1996) use the same terminology, while theycall our design 1 multinomial sampling as we mentioned earlier. The case where agiven number of observations are randomly drawn from each stratum has beentraditionally called the classical stratified sampling scheme (e.g. Hausman andWise, 1981). However, Imbens and Lancaster (1996) have shown that there is nomeaningful difference among the three schemes in that they lead to the same like-lihood function to estimate the parameters that appear in the choice probabilities.Since this paper is concerned about a binary outcome, Bernoulli sampling seemsmore appropriate than multinomial sampling.In the literature on choice-based sampling, the objective is usually efficientlyestimating the parameters that appear in the parametrically specified prospectiveprobabilities. Manski and Lerman (1977) propose a weighted likelihood approachfor this purpose under outcome-based sampling. Cosslett (1981) shows that it isfeasible to compute the full maximum likelihood estimator. By far the most com-mon specification is the logistic model. However, as Xie and Manski (1989) pointout, the logit model can be quite misleading under outcome-based sampling, ifthe truth is not logistic. Despite its convenience, the logistic specification imposesrestrictions on the form of heterogeneity in the causal effect. In contrast, our ap-proach does not restrict the shape of the causal relative risk function θ RR ( · ) , therebyallowing an unrestricted form of heterogeneity in the causal treatment effect.31ausal Inference in Case-Control StudiesMany papers in this literature use the term “semiparametric” to describe thefact that the marginal distribution of the regressors are left unspecified in theiranalysis, while the prospective probability, i.e. the conditional distribution of theoutcome given the regressors, is still parametric: see e.g. Imbens and Lancaster(1996) and Breslow, Robins, and Wellner (2000). By contrast, our approach is semi-nonparametric in the sense of X. Chen (2007) because we do not impose para-metric restrictions anywhere. Instead of relying on the parametric assumption,we directly target the aggregated log odds ratio as the estimand of interest, wearticulate its relationship with the fundamental causal parameter of interest, andwe have derived the efficiency bound for the estimand under Bernoulli sampling.By combining all these results we can draw robust and efficient inferences on thecausal parameter of interest.In the statistics and epidemiology literature, misspecification and robustness hasbeen addressed from a different perspective. For instance, H.Y. Chen (2007) con-siders estimating the parameters that appear in the odds ratio in such a way thatconsistency and asymptotic normality follows as long as either the prospectiveor the retrospective probability is correctly specified: this approach is known asa doubly robust estimation method. Tchetgen Tchetgen, Robins, and Rotnitzky(2010) take a similar approach, but their estimator is simpler to implement thanH.Y. Chen (2007)’s; it is then further operationalized by Tchetgen Tchetgen (2013)under the finite-dimensional logistic assumption. Our estimating equation in (16)is different because our parameter of interest is semi-nonparametric. It is also note-worthy that statisticians and epidemiologists have maintained an active researchagenda in case-control studies unlike econometricians. In addition to the afore-mentioned papers, for instance, Zhou, Herring, Bhattacharya, Olshan, Dunson,and Study (2016) investigate how to deal with high dimensional predictors in thecase-control setup using a nonparametric Bayesian approach.Finally, our causal parameter is defined by a ratio, but it is probably fair to saythat a difference (attributable risk in our setup) is a more common measure in32un and Leeeconometrics (e.g. Hahn, 1998; Hirano, Imbens, and Ridder, 2003). We do thisonly because the ratio is mathematically more convenient under outcome-basedsampling thanks to the invariance property of the odds ratio. However, it has longbeen questioned whether the emphasis on relative risk combined with the rare-disease assumption is relevant for public policies: see, e.g., Hsieh, Manski, andMcFadden (1985) and Manski (2009) among others. We take a pragmatic approachto this debate and believe that both attributable risk and relative risk are useful forevidence-based policymaking. We plan to work out details for causal attributablerisk in a separate paper since its analysis is sufficiently distinct from that of causalrelative risk.A PPENDIX
A. A
VERAGING WITHOUT T AKING THE L OGARITHM
In the main text our key estimand was an aggregated version of the logarithm of the odds ratio, i.e. β ( y ) = E [ log { OR ( X ) }| Y = y ] for y =
0, 1. As a result, thecentral causal parameter ξ RR was defined in (24) by the logarithm of relative risk.Alternatively, one may want to proceed without taking the logarithm in whichcase we are led to consider ζ RR : = E (cid:2) θ RR ( X ∗ ) (cid:3) , ζ RR ( y ) : = (cid:90) X θ RR ( x ) dF X | Y ( x | y ) , κ ( y ) : = (cid:90) X OR ( x ) dF X | Y ( x | y ) for y =
0, 1. Again, if the MTR and MTS conditions are satisfied, then we have1 ≤ ζ RR ( y ) ≤ κ ( y ) (27)under both designs 1 and 2, where the inequalities are sharp.Efficient estimation of κ ( y ) can be explored exactly in the same way as in sec-tion 4. Below we present the formula of the efficient influence function, which isan analog of theorem 3. Theorem A.1.
Suppose that assumptions A, F and G hold and that we have a sample byBernoulli sampling. Then, for y =
0, 1 , κ ( y ) is pathwise differentiable and its pathwise derivative is given byK y ( Y , T , X ) = Y y ( − Y ) − y h y ( − h ) − y (cid:110) OR ( X ) − κ ( y ) (cid:111) − OR ( X ) ∆ ( Y , T , X )( − h ) w ( X ) y + OR ( X ) w ( X ) − y ∆ ( Y , T , X ) h . Further, K y is an element of the tangent space, and therefore, the semiparametric efficiencybound for κ ( y ) is given by E (cid:8) K y ( Y , T , X ) (cid:9) . We omit the proof of theorem A.1 because it is essentially identical to that ofTheorem 3. We can construct efficient estimators of κ ( y ) and carry out causal in-ference on ζ RR by methods identical to those used in section 4. We do not repeatall the details for brevity.In general we have the relationship ξ RR ≤ log ( ζ RR ) by Jensen’s inequality. We have chosen ξ RR as our central causal parameter tofocus on in the main text because (i) it corresponds to the usual parameter whena parametric logistic regression model is used, and (ii) an average of the log oddsratio is less likely to be affected unduly by outliers than that of the odds ratio itself.A PPENDIX
B. A
UXILIARY LEMMAS
Lemma B.1.
Suppose that assumption D holds. Then, for t =
0, 1 and for all x ∈ X , ( − ) t (cid:2) P { Y ∗ ( t ) = | X ∗ = x } − P ( Y ∗ = | X ∗ = x ) (cid:3) ≤ where the bounds are sharp. Proof.
Since the two inequalities are similar, we focus on the case of t =
1. In thiscase, the claimed inequality follows from P { Y ∗ ( ) = T ∗ = | X ∗ = x } + P { Y ∗ ( ) = T ∗ = | X ∗ = x }≥ P { Y ∗ ( ) = T ∗ = | X ∗ = x } + P { Y ∗ ( ) = T ∗ = | X ∗ = x } .For sharpness, we know from assumption D that P { Y ∗ ( ) = T ∗ = | X ∗ = x } − P { Y ∗ ( ) = T ∗ = | X ∗ = x } = P { Y ∗ ( ) = Y ∗ ( ) = T ∗ = | X ∗ = x } ,where the right–hand side is unrestricted between 0 and 1. (cid:50) Lemma B.2.
Suppose that assumption E holds. Then, for t =
0, 1 and for all x ∈ X , ( − ) t (cid:2) P { Y ∗ ( t ) = | X ∗ = x } − P ( Y ∗ = | T ∗ = t , X ∗ = x ) (cid:3) ≥ where the bounds are sharp. Furthermore, if < P ( T ∗ = | X ∗ = x ) < , these inequali-ties hold with equality if and only if assumption E is satisfied with equality.Proof. Since the two inequalities are similar, we focus on the case of t =
1. First, P { Y ∗ ( ) = | X ∗ = x } = P ( Y ∗ = | T ∗ = X ∗ = x ) P ( T ∗ = | X ∗ = x )+ P { Y ∗ ( ) = | T ∗ = X ∗ = x } P ( T ∗ = | X ∗ = x ) , (28)where we note from assumption E that there exists some C x ∈ [
0, 1 ] such that P ( Y ∗ = | T ∗ = X ∗ = x ) = P { Y ∗ ( ) = | T ∗ = X ∗ = x } + C x . (29)Combining equations (28) and (29) yields the first inequality in the lemma state-ment. Therefore, P { Y ∗ ( ) = | X ∗ = x } = P ( Y ∗ = | T ∗ = X ∗ = x ) − C x · P ( T ∗ = | X ∗ = x ) ≤ P ( Y ∗ = | T ∗ = X ∗ = x ) . (30)35ausal Inference in Case-Control StudiesSharpness follows from the fact that C x is not restricted except that it is between0 and 1. Also, if P ( T ∗ = | X ∗ = x ) >
0, then the last inequality in equation (30)holds with equality if and only if C x = (cid:50) A PPENDIX
C. P
ROOFS OF THE RESULTS IN THE MAIN TEXT
Proof of Lemma 1:
By the Bayes rule,OR ( x ) = P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) P ( T = | X = x , Y = ) .Then, under design 1, for all x ∈ X ,OR ( x ) = P ( T ∗ = | X ∗ = x , Y ∗ = ) P ( T ∗ = | X ∗ = x , Y ∗ = ) P ( T ∗ = | X ∗ = x , Y ∗ = ) P ( T ∗ = | X ∗ = x , Y ∗ = ) = OR ∗ ( x ) ,where the second equality again follows from the Bayes rule. Now, under design 2,for all x ∈ X ,OR ( x ) = P ( T ∗ = | X ∗ = x , Y ∗ = ) P ( T ∗ = | X ∗ = x , Y ∗ = ) P ( T ∗ = | X ∗ = x ) P ( T ∗ = | X ∗ = x ) = RR ∗ ( x ) . (cid:50) Proof of Lemma 2:
Let γ be the parameter denoting regular parametric submod-els, where the true value will be denoted by γ . Then, by using the likelihoodfunction in equation (11), the score evaluated at γ is equal to ( − Y ) (cid:104) S X | Y ( X | ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) ∂ γ P ( T = | X , Y = γ ) P ( T = | X , Y = ) { − P ( T = | X , Y = ) } (cid:105) + Y (cid:104) S X | Y ( X | ) + (cid:8) T − P ( T = | X , Y = ) (cid:9) ∂ γ P ( T = | X , Y = γ ) P ( T = | X , Y = ) { − P ( T = | X , Y = ) } (cid:105) , (31)where S X | Y ( x | y ) = ∂ γ log f X | Y ( x | y ; γ ) is restricted only by E { S X | Y ( X | y ) | Y = y } =
0, while the derivatives ∂ γ P ( T = | X , Y = y , γ ) are unrestricted. (cid:50) Proof of Theorem 1:
In view of Lemma 1, the theorem follows immediately since P ( Y ∗ = | T ∗ = t , X ∗ = x ) = P { Y ∗ ( t ) = | T ∗ = t , X ∗ = x } = P { Y ∗ ( t ) = | X ∗ = x } ,where the last equality is by the assumption of unconfoundedness. (cid:50) Proof of Theorem 2:
Part (i).
The sharp lower bound of θ RR ( x ) follows fromlemma B.1. To prove that θ RR ( x ) ≤ θ OR ( x ) for all x ∈ X , note that θ OR ( x ) θ RR ( x ) = P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } ≥ ( − ) t (cid:2) P { Y ∗ ( t ) = | X ∗ = x } − P ( Y ∗ = | X ∗ = x ) (cid:3) ≥ t =
0, 1.
Part (ii).
The sharp upper bound of θ RR ( x ) under design 2 follows from lemmas 1and B.2 because θ RR ( x ) ≤ P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) = RR ∗ ( x ) = OR ( x ) . (32)The case of θ OR ( x ) under design 1 similarly uses the fact that lemma B.2 yields P { Y ∗ ( ) = | X ∗ = x } P { Y ∗ ( ) = | X ∗ = x } ≤ P ( Y ∗ = | T ∗ = X ∗ = x ) P ( Y ∗ = | T ∗ = X ∗ = x ) . (33)Combining equation (33) with (32) yields that under design 1, θ OR ( x ) ≤ OR ∗ ( x ) = OR ( x ) . Part (iii).
The final statement follows immediately from lemma B.2. (cid:50)
Proof of theorem 3:
For brevity, we focus on β ( ) and let β = β ( ) . Proof for β ( ) is analogous. Let p ( x ) = P ( T = | X = x , Y = ) and p ( x ) = P ( T = | X = x , Y = ) . Note that β ( γ ) = (cid:90) X log (cid:34) p ( x ; γ ) − p ( x ; γ ) · − p ( x ; γ ) p ( x ; γ ) (cid:124) (cid:123)(cid:122) (cid:125) : = OR ( x ; γ ) (cid:35) f X | Y ( x | γ ) dx , (34)37ausal Inference in Case-Control Studieswhere γ represents regular parametric submodels such that γ is the truth. Then, ∂ γ OR ( x ; γ ) = ∂ γ p ( x ; γ ) { − p ( x ) } p ( x ) { − p ( x ) } − ∂ γ p ( x ; γ ) p ( x ) p ( x ) { − p ( x ) } = ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } OR ( x ) − ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } OR ( x ) . (35)Therefore, ∂ γ β ( γ ) = (cid:90) (cid:104) ∂ γ OR ( x ; γ ) OR ( x ) + log { OR ( x ) } S X | Y ( x | ) (cid:105) f X | Y ( x | ) dx = (cid:90) (cid:104) ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } − ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } + log { OR ( x ) } S X | Y ( x | ) (cid:105) f X | Y ( x | ) dx .(36)Now, we only need to verify the equality between E { F ( Y , T , X ) S ( Y , T , X ) } and (cid:90) (cid:104) ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } (cid:124) (cid:123)(cid:122) (cid:125) : = A ( x ) − ∂ γ p ( x ; γ ) p ( x ) { − p ( x ) } (cid:124) (cid:123)(cid:122) (cid:125) : = A ( x ) + log { OR ( x ) } S X | Y ( x | ) (cid:105) f X | Y ( x | ) dx ,(37)where F ( Y , T , X ) and S ( Y , T , X ) are given in the theorem statement and equa-tion (31), respectively: i.e. S ( Y , T , X )= ( − Y ) (cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) + Y (cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) , F ( Y , T , X )= − Y − h (cid:34) log OR ( X ) − β − (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } (cid:35) + Yh f X | Y ( X | ) f X | Y ( X | ) (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } .38un and LeeNote that F ( Y , T , X ) S ( Y , T , X ) is equal to1 − Y − h (cid:34) log OR ( X ) − β − (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } (cid:35)(cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) + Yh f X | Y ( X | ) f X | Y ( X | ) (cid:34) (cid:8) T − p ( X ) (cid:9) p ( X ) { − p ( X ) } (cid:35)(cid:104) S X | Y ( X | ) + (cid:8) T − p ( X ) (cid:9) A ( X ) (cid:105) .Here, taking expectations directly shows that E (cid:8) F ( Y , T , X ) S ( Y , T , X ) (cid:9) is equal to E (cid:8) log { OR ( X ) } S X | Y ( X | ) − A ( X ) | Y = (cid:9) + E (cid:26) f X | Y ( X | ) f X | Y ( X | ) A ( X ) (cid:12)(cid:12)(cid:12)(cid:12) Y = (cid:27) ,which is equal to the expression in equation (37) since E (cid:26) f X | Y ( X | ) f X | Y ( X | ) A ( X ) (cid:12)(cid:12)(cid:12)(cid:12) Y = (cid:27) = E (cid:8) A ( X ) | Y = (cid:9) .Finally, it follows from lemma 2 that F is an element of the tangent space. (cid:50) Proofs of theorems 4 and 5 are provided in appendix S-1, which is only for online:The proof of theorem 4 is similar to that of theorem 3. The proof of theorem 5 doesnot provide any additional insight above DML. (cid:50) R EFERENCES A CKERBERG , D., X. C
HEN , AND
J. H
AHN (2012): “A practical asymptotic vari-ance estimator for two-step semiparametric estimators,”
Review of Economics andStatistics , 94(2), 481–498.A
CKERBERG , D., X. C
HEN , J. H
AHN , AND
Z. L
IAO (2014): “Asymptotic efficiencyof semiparametric two-step GMM,”
Review of Economic Studies , 81(3), 919–943.A I , C., AND
X. C
HEN (2003): “Efficient estimation of models with conditionalmoment restrictions containing unknown functions,”
Econometrica , 71(6), 1795–1843. (2012): “The semiparametric efficiency bound for models of sequential mo-ment restrictions containing unknown functions,”
Journal of Econometrics , 170(2),442–457. 39ausal Inference in Case-Control StudiesB
ELLONI , A., V. C
HERNOZHUKOV , AND
Y. W EI (2016): “Post-Selection Inferencefor Generalized Linear Models With Many Controls,” Journal of Business & Eco-nomic Statistics , 34(4), 606–619.B
HATTACHARYA , J., A. M. S
HAIKH , AND
E. V
YTLACIL (2008): “Treatment ef-fect bounds under monotonicity assumptions: an application to Swan-Ganzcatheterization,”
American Economic Review: Papers and Proceedings , 98(2), 351–56.B
HATTACHARYA , J., A. M. S
HAIKH , AND
E. V
YTLACIL (2012): “Treatment effectbounds: An application to Swan-Ganz catheterization,”
Journal of Econometrics ,168(2), 223–243.B
RESLOW , N. E. (1996): “Statistics in epidemiology: the case-control study,”
Jour-nal of the American Statistical Association , 91(433), 14–28.B
RESLOW , N. E.,
AND
N. E. D AY (1980): Statistical Methods in Cancer Research I.The Analysis of Case-Control Studies , vol. 1. International Agency for Research onCancer, Lyon, France.B
RESLOW , N. E., J. M. R
OBINS , AND
J. A. W
ELLNER (2000): “On the semi-parametric efficiency of logistic regression under case-control sampling,”
Bernoulli , 6(3), 447–455.C
ARVALHO , L. S.,
AND
R. R. S
OARES (2016): “Living on the edge: Youth entry, ca-reer and exit in drug-selling gangs,”
Journal of Economic Behavior & Organization ,121, 77–98.C
HEN , K. (2001): “Parametric models for response-biased sampling,”
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 63(4), 775–789.C
HEN , X. (2007): “Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models,” vol. 6 of
Handbook of Econometrics , pp. 5549–5632. El-sevier.C
HEN , H.Y. (2007): “A semiparametric odds ratio model for measuring associa-tion,”
Biometrics , 63(2), 413–421. 40un and LeeC
HERNOZHUKOV , V., D. C
HETVERIKOV , M. D
IMIRER , E. D
UFLO , C. H
ANSEN ,W. N
EWEY , AND
J. R
OBINS (2018): “Double/debiased machine learning for treat-ment and structural parameters,”
Econometrics Journal , 21, C1–C68.C
HERNOZHUKOV , V., S. L EE , AND
A. M. R
OSEN (2013): “Intersection bounds:Estimation and inference,”
Econometrica , 81(2), 667–737.C
ORNFIELD , J. (1951): “A method of estimating comparative rates from clinicaldata. Applications to cancer of the lung, breast, and cervix,”
Journal of the Na-tional Cancer Institute , 11(6), 1269–1275.C
OSSLETT , S. R. (1981): “Maximum Likelihood Estimator for Choice-Based Sam-ples,”
Econometrica , 49(5), 1289–1316.C
URRIE , J.,
AND
M. N
EIDELL (2005): “Air pollution and infant health: what canwe learn from California’s recent experience?,”
Quarterly Journal of Economics ,120(3), 1003–1030.D
OMOWITZ , I.,
AND
R. L. S
ARTAIN (1999): “Determinants of the Consumer Bank-ruptcy Decision,”
Journal of Finance , 54(1), 403–420.F
RIEDMAN , J., T. H
ASTIE , AND
R. T
IBSHIRANI (2010): “Regularization Paths forGeneralized Linear Models via Coordinate Descent,”
Journal of Statistical Soft-ware , 33(1), 1–22.H
AHN , J. (1998): “On the role of the propensity score in efficient semiparametricestimation of average treatment effects,”
Econometrica , 66(2), 315–331.H
AUSMAN , J. A.,
AND
D. A. W
ISE (1981): “Stratification on endogenous variablesand estimation: The Gary income maintenance experiment,”
Structural analysisof discrete data with econometric applications , pp. 365–391.H
ECKMAN , J. J.,
AND
R. R
OBB (1985): “Alternative methods for evaluating theimpact of interventions: An overview,”
Journal of econometrics , 30(1-2), 239–267.H
ECKMAN , J. J.,
AND
P. E. T
ODD (2009): “A note on adapting propensity scorematching and selection models to choice based samples,”
Econometrics Journal ,12(s1), S230–S234. 41ausal Inference in Case-Control StudiesH
IRANO , K., G. W. I
MBENS , AND
G. R
IDDER (2003): “Efficient Estimation of Av-erage Treatment Effects Using the Estimated Propensity Score,”
Econometrica ,71(4), 1161–1189.H
OLLAND , P. W.,
AND
D. B. R
UBIN (1988): “Causal Inference in RetrospectiveStudies,”
Evaluation Review , 12(3), 203–231.H
SIEH , D. A., C. F. M
ANSKI , AND
D. M C F ADDEN (1985): “Estimation of ResponseProbabilities from Augmented Retrospective Observations,”
Journal of the Amer-ican Statistical Association , 80(391), 651–662.H U , Z., AND
J. Q IN (2018): “Generalizability of causal inference in observationalstudies under retrospective convenience sampling,” Statistics in Medicine , 37(19),2874–2883.I
MBENS , G. W. (1992): “An Efficient Method of Moments Estimator for DiscreteChoice Models With Choice-Based Sampling,”
Econometrica , 60(5), 1187–1214.I
MBENS , G. W.,
AND
T. L
ANCASTER (1996): “Efficient estimation and stratifiedsampling,”
Journal of Econometrics , 74(2), 289–318.I
MBENS , G. W.,
AND
D. B. R
UBIN (2015):
Causal inference in statistics, social, andbiomedical sciences . Cambridge University Press.J UN , S. J., AND
S. L EE (2019): “Identifying the effect of persuasion,”arXiv:1812.02276.K ALBFLEISCH , J.,
AND
J. L
AWLESS (1988): “Estimation of Reliability in Field–Performance Studies,”
Technometrics , 30(4), 365–378.K
ENNEDY , E. H., A. S J ¨ OLANDER , AND
D. S
MALL (2015): “Semiparametric causalinference in matched cohort studies,”
Biometrika , 102(3), 739–746.K IM , W., K. K WON , S. K
WON , AND
S. L EE (2018): “The identification power ofsmoothness assumptions in models with counterfactual outcomes,” QuantitativeEconomics , 9(2), 617–642. 42un and LeeK
REIDER , B., J. V. P
EPPER , C. G
UNDERSEN , AND
D. J
OLLIFFE (2012): “Identifyingthe Effects of SNAP (Food Stamps) on Child Health Outcomes When Participa-tion Is Endogenous and Misreported,”
Journal of the American Statistical Associa-tion , 107(499), 958–975.L
ANCASTER , T.,
AND
G. I
MBENS (1996): “Case-control studies with contaminatedcontrols,”
Journal of Econometrics , 71(1-2), 145–160.L EE , S., K. S ONG , AND
Y.-J. W
HANG (2018): “Testing for a general class of func-tional inequalities,”
Econometric Theory , 34(5), 1018–1064.M
ACHADO , C., A. M. S
HAIKH , AND
E. J. V
YTLACIL (2019): “Instrumental vari-ables and the sign of the average treatment effect,”
Journal of Econometrics , 212(2),522–555.M
ANSKI , C. F. (1997): “Monotone Treatment Response,”
Econometrica , 65(6), 1311–1334.M
ANSKI , C. F. (2003):
Partial identification of probability distributions . Springer-Verlag.(2009):
Identification for prediction and decision . Harvard University Press.M
ANSKI , C. F.,
AND
S. R. L
ERMAN (1977): “The Estimation of Choice Probabilitiesfrom Choice Based Samples,”
Econometrica , 45(8), 1977–1988.M
ANSKI , C. F.,
AND
D. M C F ADDEN (1981): “Alternative estimators and sampledesigns for discrete choice analysis,” in
Structural analysis of discrete data witheconometric applications , ed. by C. F. Manski, and
D. McFadden, vol. 2, pp. 51–111.MIT Press, Cambridge, MA.M
ANSKI , C. F.,
AND
J. V. P
EPPER (2000): “Monotone instrumental variables: Withan application to the returns to schooling,”
Econometrica , 68(4), 997–1010.M C F ADDEN , D. (2015): “Observational Studies: Outcome-Based Sampling,” in
International Encyclopedia of the Social & Behavioral Sciences (Second Edition) , ed. byJ. D. Wright, pp. 103–106. Elsevier, Oxford.N
EWEY , W. K. (1990): “Semiparametric efficiency bounds,”
Journal of AppliedEconometrics , 5(2), 99–135. 43ausal Inference in Case-Control Studies(1994): “The asymptotic variance of semiparametric estimators,”
Economet-rica , 62(6), 1349–1382.O
KUMURA , T.,
AND
E. U
SUI (2014): “Concave-monotone treatment response andmonotone treatment selection: With an application to the returns to schooling,”
Quantitative Economics , 5(1), 175–194.R
UGGLES , S., S. F
LOOD , R. G
OEKEN , J. G
ROVER , E. M
EYER , J. P
ACAS , AND
M. S
OBEK (2019): “IPUMS USA: Version 9.0 [dataset],” https://doi.org/10.18128/D010.V9.0 .T AMER , E. (2010): “Partial identification in econometrics,”
Annual Review of Eco-nomics , 2(1), 167–195.T
CHETGEN T CHETGEN , E. J. (2013): “On a closed-form doubly robust estimator ofthe adjusted odds ratio for a binary exposure,”
American journal of epidemiology ,177(11), 1314–1316.T
CHETGEN T CHETGEN , E. J., J. M. R
OBINS , AND
A. R
OTNITZKY (2010): “On dou-bly robust estimation in a semiparametric odds ratio model,”
Biometrika , 97(1),171–180.
VAN DE G EER , S. A. (2008): “High-dimensional generalized linear models and thelasso,”
Annals of Statistics , 36(2), 614–645.X IE , J., Y. L IN , X. Y AN , AND
N. T
ANG (2019): “Category-Adaptive Variable Screen-ing for Ultra-High Dimensional Heterogeneous Categorical Data,”
Journal of theAmerican Statistical Association , forthcoming.X IE , Y., AND
C. F. M
ANSKI (1989): “The logit model and response-based samples,”
Sociological Methods & Research , 17(3), 283–302.Z
HANG , Z., Z. H U , AND
C. L IU (2019): “Estimating the Population Average Treat-ment Effect in Observational Studies with Choice-Based Sampling,” The interna-tional journal of biostatistics , 15(1).Z
HOU , J., A. H. H
ERRING , A. B
HATTACHARYA , A. F. O
LSHAN , D. B. D
UNSON , AND
N. B. D. P. S
TUDY (2016): “Nonparametric Bayes modeling for case controlstudies with many predictors,”
Biometrics , 72(1), 184–192.44un and Lee O NLINE S UPPLEMENT TO “C AUSAL I NFERENCE IN C ASE -C ONTROL S TUDIES ” BY J UN AND L EE A PPENDIX
S-1. A
DDITIONAL P ROOFS
Proof of theorem 4:
For simplicity, in the proof, we focus on β ( ) and let it bedenoted by β . The case of β ( ) is similar. Recall that˜ F ( η )[ Y , T , X ]= − Y − h (cid:34) log (cid:102) OR ( η )[ X ] − β − { T − a ( X ) } a ( X ) { − a ( X ) } (cid:35) + Y ˜ w ( η )[ X ] h { T − b ( X ) } b ( X ) { − b ( X ) } .Note that E (cid:110) ˜ F ( η )[ Y , T , X ] (cid:111) = E (cid:8) log (cid:102) OR ( η )[ X ] − β (cid:12)(cid:12) Y = (cid:9) + E (cid:110) ˜ ∆ ( η )[ T , X ] (cid:12)(cid:12)(cid:12) Y = (cid:111) + E (cid:110) ˜ ∆ ( η )[ T , X ] (cid:12)(cid:12)(cid:12) Y = (cid:111) , (S.1)where ˜ ∆ ( η )[ T , X ] = − { T − a ( X ) } a ( X ) { − a ( X ) } , (S.2)˜ ∆ ( η )[ T , X ] = ˜ w ( η )[ X ] { T − b ( X ) } b ( X ) { − b ( X ) } . (S.3)Here, E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) X , Y = (cid:17) = a ( X ) − p ( X ) p ( X ) { − p ( X ) } , E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) X , Y = (cid:17) = − w ( X ) b ( X ) − p ( X ) p ( X ) { − p ( X ) } ,S-1ausal Inference in Case-Control Studieswhere p ( X ) = P ( T = | X , Y = ) and p = P ( T = | X , Y = ) as before.Therefore, E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) Y = (cid:17) = E (cid:110) a ( X ) − p ( X ) p ( X ) { − p ( X ) } (cid:12)(cid:12)(cid:12) Y = (cid:111) , (S.4)and E (cid:16) ∂ γ ˜ ∆ { η + γ ( η − η ) } [ T , X ] (cid:12)(cid:12) γ = (cid:12)(cid:12)(cid:12) Y = (cid:17) = − E (cid:110) w ( X ) b ( X ) − p ( X ) p ( X ) { − p ( X ) } (cid:12)(cid:12)(cid:12) Y = (cid:111) = − E (cid:110) b ( X ) − p ( X ) p ( X ) { − p ( X ) } (cid:12)(cid:12)(cid:12) Y = (cid:111) .(S.5)Now, similarly to equation (35), we have ∂ γ log (cid:102) OR { η + γ ( η − η ) } (cid:12)(cid:12) γ = = b ( X ) − p ( X ) p ( x ) { − p ( x ) } − a ( X ) − p ( X ) p ( x ) { − p ( x ) } . (S.6)Therefore, the conclusion follows from equations (S.1) and (S.4) to (S.6). (cid:50) Proof of theorem 5:
As in the previous proofs, we focus on β ( ) ≡ β . The caseof β ( ) is similar. We verify Assumptions 3.1 and 3.2 of DML (Chernozhukov,Chetverikov, Dimirer, Duflo, Hansen, Newey, and Robins, 2018). Using the nota-tion used in DML, ψ ( W ; β , η ) = ˜ F y ( η )[ Y , T , X ] with W = ( Y , T , X ) . Then our case belongs to that of linear scores, namely ψ ( W ; θ , η ) = ψ a ( W ; η ) θ + ψ b ( W ; η ) ,S-2un and Leewhere ψ a ( W ; η ) = − − Y − h , ψ b ( W ; η ) = − Y − h (cid:34) log (cid:102) OR ( η )[ X ] − T − a ( X ) a ( X ) { − a ( X ) } (cid:35) + Y ˜ w ( η )[ X ] h T − b ( X ) b ( X ) { − b ( X ) } . Verification of Assumption 3.1 of DML.
Under assumptions F and G, Assumption 3.1of DML is satisfied with λ N = J = ψ ; part (c) is by assumptions F and G; part (d) is by theorem 4; part (e) followsbecause E [ ψ a ( W ; η )] = Verification of Assumption 3.2 (b) of DML.
It holds trivially that | ψ a ( W ; η ) | is boundedby a constant uniformly in η . Moreover, by assumption F, there is a constant c < ∞ such that | ψ ( W ; β , η ) | ≤ c uniformly in η almost surely. Verification of Assumption 3.2 (d) of DML.
Note that E [ ψ ( W ; β , η )] ≥ − h E (cid:34) { log OR ( X ) − β } + p ( X ) { − p ( X ) } (cid:35) ,which is bounded from below by a constant under assumption F.Since Assumption 3.2 (a) of DML is the definition of the first stage estimator,theorem 5 follows immediately from Theorems 3.1 and 3.2 of DML, provided thatwe verify the remaining Assumption 3.2 (c) of DML.S-3ausal Inference in Case-Control Studies Verification of Assumption 3.2 (c) of DML.
Using the notation used in DML, define r n : = sup η ∈T N | E [ ψ a ( W ; η ) − ψ a ( W ; η )] | , r (cid:48) n : = sup η ∈T N ( E [ | ψ ( W ; β , η ) − ψ ( W ; β , η ) | ]) , λ (cid:48) n : = sup γ ∈ ( ) , η ∈T N | ∂ γ E [ ψ ( W ; β , η + γ ( η − η ))] | . Step 1.
Note that r n = ψ a ( W ; η ) does not depend on η . Step 2.
Now write that E [ | ψ ( W ; β , η ) − ψ ( W ; β , η ) | ]) = (cid:107) ψ ( W ; β , η ) − ψ ( W ; β , η ) (cid:107) P ,2 ≤ (cid:107)T (cid:107) P ,2 + (cid:107)T (cid:107) P ,2 ,where T : = − Y − h (cid:34) log (cid:102) OR ( η )[ X ] − T − a ( X ) a ( X ) { − a ( X ) } (cid:35) − − Y − h (cid:34) log OR ( X ) − T − p ( X ) p ( X ) { − p ( X ) } (cid:35) , T : = Y ˜ w ( η )[ X ] h (cid:34) T − b ( X ) b ( X ) { − b ( X ) } (cid:35) − Yw ( X ) h (cid:34) T − p ( X ) p ( X ) { − p ( X ) } (cid:35) .Then, in view of assumptions F and H, there exists a sequence ˜ δ n → (cid:104) E (cid:8) | ψ ( W ; β , η ) − ψ ( W ; β , η ) | (cid:9)(cid:105) ≤ ˜ δ n holds with probability at least 1 − τ n . This implies that we can take r (cid:48) n = ˜ δ n . Step 3.
Define a γ ( X ) : = p ( X ) + γ { a ( X ) − p ( X ) } and b γ ( X ) : = p ( X ) + γ { b ( X ) − p ( X ) } . Note that ∂ γ log (cid:102) OR { η + γ ( η − η ) } [ X ] = ∂ γ (cid:102) OR { η + γ ( η − η ) } [ X ] (cid:102) OR { η + γ ( η − η ) } [ X ]= b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) } − a ( X ) − p ( X ) a γ ( X ) { − a γ ( X ) } .S-4un and LeeIn addition, ∂ γ (cid:20) { T − a γ ( X ) } a γ ( X ) { − a γ ( X ) } (cid:21) = − a ( X ) − p ( X ) a γ ( X ) { − a γ ( X ) } − { T − a γ ( X ) }{ − a γ ( X ) } a γ ( X ) { − a γ ( X ) } { a ( X ) − p ( X ) } , ∂ γ (cid:20) ˜ w { η + γ ( η − η ) } [ X ] T − b γ ( X ) b γ ( X ) { − b γ ( X ) } (cid:21) = T − b γ ( X ) b γ ( X ) { − b γ ( X ) } ( η − η )[ X ] − ˜ w { η + γ ( η − η ) } [ X ] b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) }− ˜ w { η + γ ( η − η ) } [ X ] { T − b γ ( X ) }{ − b γ ( X ) } b γ ( X ) { − b γ ( X ) } { b ( X ) − p ( X ) } .Combining these yields ∂ γ ψ ( W ; β , η + γ ( η − η ))= − Y − h (cid:34) b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) } + { T − a γ ( X ) }{ − a γ ( X ) } a γ ( X ) { − a γ ( X ) } { a ( X ) − p ( X ) } (cid:35) + Yh (cid:34) T − b γ ( X ) b γ ( X ) { − b γ ( X ) } ( η − η )[ X ] − ˜ w { η + γ ( η − η ) } [ X ] b ( X ) − p ( X ) b γ ( X ) { − b γ ( X ) }− ˜ w { η + γ ( η − η ) } [ X ] { T − b γ ( X ) }{ − b γ ( X ) } b γ ( X ) { − b γ ( X ) } { b ( X ) − p ( X ) } (cid:35) .If we take the second-order derivative in the equation above, we can see that eachterm of the second-order derivatives on the right-hand side can be bounded inabsolute value by a constant times χ ( a , b ) , which is defined to be equal tomax (cid:104) { a ( X ) − p ( X ) } , { b ( X ) − p ( X ) } , { η ( X ) − η ( X ) }{ b ( X ) − p ( X ) } (cid:105) .Therefore, there exists a universal constant C < ∞ such that | ∂ γ E [ ψ ( W ; β , η + γ ( η − η ))] | ≤ C χ ( a , b ) .S-5ausal Inference in Case-Control StudiesThen, by assumption H, there exists a sequence ˜ δ (cid:48) n → γ ∈ ( ) , η ∈T N | ∂ γ E [ ψ ( W ; β , η + γ ( η − η ))] | ≤ ˜ δ (cid:48) n n − holds with probability at least 1 − τ n . Therefore, we can take λ (cid:48) n = ˜ δ (cid:48) n n − . (cid:50) R EFERENCE
Chernozhukov V., D. Chetverikov, M. Dimirer, E. Duflo, C. Hansen, W. Newey, andJ. Robins (2018): “Doubld/debiased machine learning for treatment and structuralparameters,”