Selecting and ranking individualized treatment rules with unmeasured confounding
SSelecting and Ranking Individualized Treatment Rules withUnmeasured Confounding
Bo Zhang, Jordan Weiss, Dylan S. Small, Qingyuan Zhao*University of Pennsylvania and University of Cambridge
Abstract : It is common to compare individualized treatment rules based on the value function, whichis the expected potential outcome under the treatment rule. Although the value function is not point-identified when there is unmeasured confounding, it still defines a partial order among the treatmentrules under Rosenbaum’s sensitivity analysis model. We first consider how to compare two treatmentrules with unmeasured confounding in the single-decision setting and then use this pairwise test to rankmultiple treatment rules. We consider how to, among many treatment rules, select the best rules andselect the rules that are better than a control rule. The proposed methods are illustrated using two realexamples, one about the benefit of malaria prevention programs to different age groups and anotherabout the effect of late retirement on senior health in different gender and occupation groups.
Keywords : Multiple testing; Observational studies; Partial order; Policy discovery; Sensitivity analysis.*Correspondence to: Qingyuan Zhao, Statistical Laboratory, University of Cambridge, Wilberforce Road,Cambridge, CB3 0WB, UK. Email: [email protected]. a r X i v : . [ s t a t . M E ] F e b Introduction
A central statistical problem in precision medicine and health policy is to learn treatment rules thatare tailored to the patient’s characteristics. There is now an exploding literature on individualizedpolicy discovery; see Kosorok and Laber [2019] for an up-to-date review. Although randomizedexperiments remain the gold standard for causal inference, there has been a growing interest inusing observational data to draw causal conclusions and discover individualized treatment rulesdue to the increasing availability of electronic health records and other observational data sources[Moodie et al., 2012, Athey and Wager, 2017, Kallus, 2017, Wu et al., 2019, Zhao et al., 2019b].A common way to formulate the problem of individualized policy discovery is via the valuefunction , which is the expected potential outcome under a treatment rule or regime. The optimaltreatment rule is usually defined as the one that maximizes the value function. In the single-decision setting, the value function can be easily identified when the data come from a randomizedexperiment (as long as the probability of receiving treatment is never 0 or 1). When the datacome from an observational study, the value function can still be identified under the assumptionthat all confounders are measured. This assumption can be further extended to the multiple-decision setting [Murphy, 2003, Robins, 2004]. In this paper we will focus our discussion on thesingle-decision setting but consider the possibility of unmeasured confounding.With few exceptions, the vast majority of existing methods for treatment rule discovery fromobservational data are based on the no unmeasured confounding assumption. Typically, thesemethods first estimate the value function assuming no unmeasured confounding and then selectthe treatment rule that maximizes the estimated value function. However, it is common that asubstantial fraction of the population appear to behave similarly under treatment or control. Froma statistical perspective and if there is truly no unmeasured confounder, we should still attempt toestimate the treatment effect for individuals in this subpopulation and optimize the treatment ruleaccordingly. However, the optimal treatment decisions for these individuals are, intuitively, alsothe most sensitive to unmeasured confounding. It may only take a small amount of unmeasuredconfounding to change the sign of the estimated treatment effects for these individuals. From apolicy perspective (especially when there is a cost constraint), learning the “optimal” treatmentdecision for these individuals from observational data seems likely to be error-prone.2 .1 Sensitivity analysis for individualized treatment rules
There is a long literature on studying the sensitivity of observational studies to unmeasured con-founding, dating from Cornfield et al. [1959]. In short, such sensitivity analysis asks how muchunmeasured confounding is needed to alter the causal conclusion of an observational study qualita-tively. In this paper, we will study the sensitivity of individualized treatment rules to unmeasuredconfounding using a prominent model proposed by Rosenbaum [1987], where the odds ratio ofreceiving the treatment for any two individuals with the same observed covariates is bounded be-tween 1 / Γ and Γ (Γ ≥
1; Γ = 1 corresponds to no unmeasured confounding). More specifically,we will consider selecting and ranking individualized treatment rules under Rosenbaum’s model forunmeasured confounding.Our investigation is motivated by the impact of effect modification on the power of Rosenbaum’ssensitivity analysis that is studied by Hsu et al. [2013]. A phenomenon found by Hsu et al. [2013]is that subgroups with larger treatment effect may have larger design sensitivity. For example,suppose a subgroup A has larger treatment effect than a subgroup B based on observational data.Then, there may exist a Γ > partial order on the set of individualized treatmentrules, so two rules with different value function assuming no unmeasured confounding may becomeindistinguishable under the Γ-sensitivity model when Γ >
1. Fundamentally, the reason is that thevalue function is only partially identified in Rosenbaum’s Γ-sensitivity model.As an example, let’s use r (cid:31) Γ r (abbreviated as r (cid:31) r if the value of Γ is clear fromthe context) to denote that the value of rule r is always greater than the value of r when theunmeasured confounding satisfies the Γ-sensitivity model. Then, it is possible that3 Under Γ = 1, r (cid:31) r (cid:31) r (so r (cid:31) r ); • Under some Γ > r (cid:31) r but r (cid:54)(cid:31) r .This phenomenon occurs frequently in real data examples, see Figure 1 in Section 3.2. Note that therelation (cid:31) is defined using the value function computed using the population instead of a particularsample.Because the value function only defines a partial order on the treatment rules, it is no longer well-defined to estimate the optimal treatment rule when there is unmeasured confounding. Instead, weaim to recover the partial ordering of a set of treatment rules or select a subset of rules that satisfycertain statistical properties. This problem is related to the problem of selecting and rankingsubpopulations (as a post hoc analysis for randomized experiments) which has been extensivelystudied in statistics [Gupta and Panchapakesan, 1979, Gibbons et al., 1999]. Unfortunately, inproblems considered by the existing literature, the subpopulations always have a total order . Forexample, a prototypical problem in that literature is to select a subset that contains the largest µ i based on independent observations Y i ∼ N( µ i , Existing methods for individualized policy discovery from observational data often take an ex-ploratory stance. They often aim to select the individualized treatment rule, often within aninfinite-dimension function class, that maximized the estimated value function using outcomeregression-based [Robins, 2004, Qian and Murphy, 2011], inverse-probability weighting [Zhao et al.,2012, Kallus, 2017], or doubly-robust estimation[Dud´ık et al., 2014, Athey and Wager, 2017]. Inorder to estimate the value function, some parametric or semiparametric models are specified tomodel the outcome and/or the treatment selection process. To identify the value function, thevast majority of these approaches make the no unmeasured confounding assumption which may beunrealistic in many applications. The only exception to our knowledge is Kallus and Zhou [2018],in which the authors propose to maximize the minimum value of an individualized treatment rulewhen the unmeasured confounding satisfies a marginal sensitivity model [Tan, 2006, Zhao et al.,2019a]. This is further extended to the estimation of conditional average treatment effect with4nmeasured confounding in Kallus et al. [2018]. Another related work is Yadlowsky et al. [2018]who consider semiparametric inference for the average treatment effect in Rosenbaum’s sensitivitymodel.In this paper we take a different perspective. Our approach is based on a statistical test tocompare the value of two individualized treatment rules when there is limited unmeasured con-founding. Briefly speaking, we first match the treated and control observations by the observedcovariates and then propose to use Rosenbaum’s sensitivity model to quantify the magnitude ofunmeasured confounding after matching (the deviation of the matched observational study from apairwise randomized experiment). At the core of our proposal is a randomization test introducedby Fogarty [2016] to compare the value of two individualized treatment rules in Rosenbaum’ sen-sitivity model. Based on this test, we introduce a framework to rank and select treatment ruleswithin a given finite collection and show that different statistical errors can be controlled with theappropriate multiple hypothesis testing methods.In principle, our framework can be used with an arbitrary (finite) number of pre-specified treat-ment rules. In practice, it is more suitable for small-scale policy discovery with relatively fewdecision variables, where it is not needed to use machine learning methods to discover complexpatterns or such methods have already been employed in a preliminary study to suggest a few can-didat rules. The design-based nature of our approach makes it particularly useful for confirmatory analyses, the importance of which is widely acknowledged in the policy discovery literature [e.g.Kallus, 2017, Zhang et al., 2018, Kosorok and Laber, 2019]. Methods proposed in this paper thuscomplement the existing literature on individualized treatment rules by providing a way to con-firm the effectiveness of a treatment rule learned from observational data and assess its robustnessto unmeasured confounding. When there are several competing treatment rules, our frameworkfurther facilitates the decision maker to select or rank the treatment rules using flexible criteria.The rest of the paper is organized as follows. In Section 2 we introduce a real data example thatwill be used to illustrate the proposed methods. We then introduce some notations and discuss howto compare two treatment rules when there is unmeasured confounding. In Section 3 we considerthree questions about ranking and selecting among multiple treatment rules. We compare ourproposal with some baseline procedures in a simulation study Section 4 and apply our method toanother application using data from the Health and Retirement Study. Finally, we conclude ourpaper with some brief discussion in Section 6. 5
Comparing treatment rules with unmeasured confounding
The Garki Project, conducted by the World Health Organization and the Government of Nigeriafrom 1969-1976, was an observational study that compared several strategies to control malaria. Hsuet al. [2013] studied the effect modification for one of the malaria control strategies, namely sprayingwith an insecticide, propoxur, together with mass administration of a drug sulfalene-pyrimethamineat high frequency. The outcome is the difference between the frequency of Plasmodium falciparumin blood samples, that is, the frequency of a protozoan parasite that causes malaria, measuredbefore and after the treatment. Using 1560 pairs of treated and control individuals matched bytheir age and gender, Hsu et al. [2013] found that the treatment was much more beneficial for youngchildren than for other individuals, if there is not unmeasured confounding.More interestingly, they found that, despite the reduced sample size, the 447 pairs of youngchildren exhibit a treatment effect that is far less sensitive to unmeasured confounding bias thanthe full sample of 1560 pairs. So from a policy perspective, it may be preferable to implementthe treatment only for young children rather than the whole population. In the rest of this paperwe will generalize this idea to selecting and ranking treatment rules. We will use the matcheddataset in Hsu et al. [2013] to illustrate the definitions and methodologies in the paper; see theoriginal article for more information about the Garki Project dataset and the matched design. Adifferent application concerning the effect of late retirement on health outcomes will be presentedin Section 5 near the end of this article.
We first introduce some notations in order to compare treatment rules when there is unmeasuredconfounding. Let X denote all the pre-treatment covariates measured by the investigator. Inthe single-decision setting considered in this paper, an individualized treatment rule (or treatmentregime) d maps a vector of pre-treatment covariates X to the binary treatment decisions, { , } (0 indicates control and 1 indicates treatment). In our running example, we shall consider sixtreatment rules, r , r , · · · , r , where r i assigns treatment to the youngest i ×
20% of the individuals.Specifically, the minimum, 20% , , ,
80% quantiles, and maximum of age are 0 (newborn), 7,20, 31, 41, and 73 years old.Let Y be the outcome and Y (0) , Y (1) be the potential outcomes under control and treatment.6he potential outcome under a treatment rule d is defined, naturally, as Y ( d ) = Y (0)1 { d ( X )=0 } + Y (1)1 { d ( X )=1 } . A common way to compare treatment rule is to use its value function, definedas the expected potential outcome under that rule, V ( d ) = E [ Y ( d )]. The value difference of twotreatment rules, r and r , is thus V ( r ) − V ( r ) = E [ Y ( r ) − Y ( r ) | r (cid:54) = r ] · P ( r (cid:54) = r )= E [ Y (1) − Y (0) | r > r ] · P ( r > r ) − E [ Y (1) − Y (0) | r < r ] · P ( r < r ) , (1)where for simplicity the event r ( X ) (cid:54) = r ( X ) is abbreviated as r (cid:54) = r (similarly for r < r and r > r ). Note that the event r > r is the same as r = 1 , r = 0 because the treatment decisionis binary. One of the terms on the right hand side of (1) will become 0 if the treatment rules arenested. In the malaria example, r ≤ r ≤ · · · ≤ r , so the value difference of the rules r and r can be written as V ( r ) − V ( r ) = E [ Y (1) − Y (0) | Age ∈ [7 , · P (Age ∈ [7 , . In this case, testing the sign of V ( r ) − V ( r ) is equivalent to testing the sign of the conditionalaverage treatment effect, E [ Y (1) − Y (0) | r > r ].The definition of the value function depends on the potential outcomes. To identify the valuefunction using observational data, it is standard to make the following assumptions [Kosorok andLaber, 2019]:1. Positivity: P ( A = a | X = x ) > a and x ;2. Consistency (SUTVA): Y = Y ( A );3. Ignorability (no unmeasured confounding): Y ( a ) | = A | X for all a .Under these conditions, it is straightforward to show that the value function is identified by [Qianand Murphy, 2011] V ( d ) = E (cid:20) Y I ( A = d ( X )) π ( A, X ) (cid:21) , where I is the indicator function of an event and π ( a, x ) = P ( A = a | X = x ) is the propensity score.The value function gives a natural and total order to the treatment rules. If the above identifica-tion assumptions hold, the value functions can be identified and thus this order can be consistentlyestimated as the sample size increases to infinity. In general, it is impossible to recover this order7hen there is unmeasured confounding. However, if the magnitude of unmeasured confounding isbounded according to a sensitivity model (a collection of distributions of the observed variablesand unobserved potential outcomes), it is possible to partially identify difference between the valueof two treatment rules and thus obtain a partial order. Definition 1.
Let r and r be two treatment rules that map a vector of pre-treatment covariates X to a binary treatment decision { , } , and V ( r ), V ( r ) their corresponding value functions.Given a sensitivity analysis model indexed by Γ, we say that the rule r is dominated by r witha margin δ if V ( r ) − V ( r ) > δ for all distributions in the sensitivity analysis model. We denotethis relation as r ≺ Γ ,δ r and furthere abbreviate it as r ≺ Γ r if δ = 0. We denote r (cid:54)≺ Γ r if r is not dominated by r with margin δ = 0.Notice that the partial order should be defined in terms of the partially identified interval for V ( r ) − V ( r ) instead of the partially identified intervals for V ( r ) and V ( r ). This is becausethe same distribution of the unobserved potential outcomes needs to be used when computing thepartially identified interval for V ( r ) − V ( r ), so it is not simply the difference between the partiallyidentified intervals for the individual values (the easiest way to see this is to take r = r ). Wethank an anonymous reviewer for pointing this out.It is easy to see that ≺ Γ is a strict partial order on the set of treatment rules because it satisfiesirreflexivity (not r ≺ Γ r ), transitivity ( r ≺ Γ r and r ≺ Γ r imply r ≺ Γ r ), and asymmetry( r ≺ Γ r implies not r ≺ Γ r ). In Rosenbaum’s sensitivity model be introduced in the sectionbelow, Γ = 1 corresponds to no unmeasured confounding and thus the relationship ≺ Γ=1 is a totalorder. r (cid:54)≺ Γ r using matched observational studies With the goal of selecting and ranking treatment rules with unmeasured confounding in mind, inthis section we consider the easier but essential task of comparing the value of two treatment rules, r and r , under Rosenbaum’s sensitivity model. This test will then serve as the basic elementof our procedures of selecting and ranking among multiple treatment rules below. We will firstintroduce the pair-matched design of an observational study and Rosenbaum’s sensitivity model,and then describe a studentized sensitivity analysis proposed by Fogarty [2016] that tests Neyman’snull hypothesis of average treatment effect being zero under Rosenbaum’s sensitivity model. Thistest can be immediately extended to compare the value of treatment rules.8uppose the observed data are n pairs, i = 1 , , ..., n , of two subjects j = 1 ,
2. These n pairs arematched for observed covariates X and within each pair, one subject is treated, denoted A ij = 1,and the other control, denoted A ij = 0, so that we have X i = X i and A i + A i = 1 for all i .In a sensitivity analysis, we may fail to match on an unobserved confounder U ij and thus incurunmeasured confounding bias.Rosenbaum [1987, 2002] proposed a one-parameter sensitivity model. Let F = { ( Y ij (0) , Y ij (1) , X ij , U ij ) , i = 1 , . . . , n, j = 1 , } be the collection of all measured or unmeasured variables otherthan the treatment assignment. Rosenbaum’s sensitivity model assumes that π i = P ( A i = 1 |F )satisfies 11 + Γ ≤ π i ≤ Γ1 + Γ , i = 1 , , ..., n. (2)When Γ = 1, this model asserts that π i = 1 / i and thus every subject has equal probability tobe assigned to treatment or control (i.e. no unmeasured confounding). In general, Γ > Y ij (0) = Y ij (1) for all i, j . Theasymptotic properties of these randomization tests are studied in Rosenbaum [2004, 2015] and Zhao[2018].In the context of comparing individualized treatment rules, Fisher’s sharp null hypothesis is nolonger suitable because we expect to have (and indeed are tasked to find) heterogeneous treatmenteffect. Recently, Fogarty [2016] developed a valid studentized test for Neyman’s null hypothesis thatthe average treatment effect is equal to zero, (2 n ) − (cid:80) ij Y ij (1) − Y ij (0) = 0, under Rosenbaum’ssensitivity model. We briefly describe Fogarty’s test. Let D i denote the treated-minus-controldifference in the i th matched pair, D i = ( A i − A i )( Y i − Y i ). Fix the sensitivity parameter Γ anddefine D i, Γ = D i − (cid:18) Γ −
1Γ + 1 (cid:19) | D i | , D Γ = 1 n n (cid:88) i =1 D i, Γ , and se( D Γ ) = 1 n ( n − n (cid:88) i =1 ( D i, Γ − D Γ ) . Fogarty [2016] showed that the one-sided student- t test that rejects Neyman’s hypothesis when D Γ se( D Γ ) > Φ − (1 − α )is asymptotically valid with level α under Rosenbaum’s sensitivity model (2) and mild regularityconditions. This test can be easily extended to test the null that the average treatment effect is no9reater than δ by replacing D i with D i − δ . Fogarty [2016] also provided a randomization-basedreference distribution in addition to the large-sample normal approximation.The above test for the average treatment effect can be readily extended to comparing treatmentrules. Recall that equation (1) implies the value difference of two rules r and r is a weighteddifference of two conditional average treatment effects on the set r > r and r > r . When the tworules are nested (without loss of generality assume r ≥ r ), testing the null hypothesis that r (cid:54)≺ Γ r is equivalent to testing a Neyman-type hypothesis E [ Y (1) − Y (0) | r > r ] ≤ i ) that satisfy r ( X i ) > r ( X i ). When the two rules are not nested, we can flip the sign of D i for those i suchthat r ( X i ) < r ( X i ) and then apply Fogarty’s test. In summary, to test the null hypothesis that r (cid:54)≺ Γ r , we can simply apply Fogarty’s test to { D i · [ r ( X i ) − r ( X i )] , for i such that r ( X i ) (cid:54) = r ( X i ) } . To test the hypothesis r (cid:54)≺ Γ ,δ r , we can use Fogarty’s test for the average treatmenteffect no greater than δ · ( n/m ) where m = (cid:12)(cid:12) { i : r ( X i ) (cid:54) = r ( X i ) } (cid:12)(cid:12) . A hallmark of Rosenbaum’s sensitivity analysis framework is its tipping-point analysis, and thatextends to the comparison of treatment rules. When testing r (cid:54)≺ Γ r with a series of Γ, there existsa smallest Γ such that the null hypothesis cannot be rejected, that is, we are no longer confidencethat r is dominated by r in that Γ-sensitivity model. This tipping point is commonly referred toas the sensitivity value [Zhao, 2018]. Formally, we define the sensitivity value for r ≺ r asΓ ∗ α ( r ≺ r ) = inf { Γ ≥ V ( r ) ≥ V ( r ) cannot be rejectedat level α under the Γ-sensitivity model } . Let r be the null treatment rule (for example, assigning control to the entire population). Thesensitivity Γ ∗ α ( r ≺ r ) is further abbreviated as Γ ∗ α ( r ).Zhao [2018] studied the asymptotic properties of the sensitivity value when testing Fisher’ssharp null hypothesis using a class of signed core statistics. Below, we will give the asymptoticdistribution of Γ ∗ α ( r ≺ r ) using Fogarty’s test as described in the last section. The result will bestated in terms of a transformation of the sensitivity value, κ ∗ α ( r ≺ r ) = Γ ∗ α ( r ≺ r ) − ∗ α ( r ≺ r ) + 1 . ∗ = 1 is transformed to κ ∗ = 0 and 0 ≤ κ ∗ < Proposition 1.
Assume the treatment rules are nested, r ( x ) ≤ r ( x ), and let I be the set ofindices i where r ( X i ) < r ( X i ). Assuming the moments of | D i | exist and E [ D i | r < r ] > (cid:112) |I| (cid:18) κ ∗ α ( r ≺ r ) − E [ D i | r < r ] E [ | D i | | r < r ] (cid:19) d → N (cid:16) z α µ, σ (cid:17) , as |I| → ∞ , (3)where z α is the upper- α quantile of the standard normal distribution and the parameters µ and σ depend on the distribution of D i (the expressions can be found in the Appendix).The proof of this proposition can be found in the Appendix. When the treatment rules arenot nested, one can simply replace D i with D i [ r ( X i ) − r ( X i )] and the condition r < r with r (cid:54) = r in the proposition statement. The asymptotic distribution of Γ ∗ α ( r (cid:31) r ) can be found bythe delta method and we omit further detail.The asymptotic distribution in (3) is similar to the one obtained in Zhao [2018, Thm. 1]. Whenthe treatment rules are nested r ≤ r and |I| → ∞ , the sensitivity value converges to a numberthat depends on the distribution of D i ,Γ ∗ α ( r ≺ r ) p → E [ | D i | | r < r ] + E [ D i | r < r ] E [ | D i | | r < r ] − E [ D i | r < r ] . The limit on the right hand side is the design sensitivity [Rosenbaum, 2004] of Fogarty’s test forcomparing the treatment rules. As the sample size converge to infinity, the power of Fogarty’stest converges to 1 at Γ smaller than the design sensitivity and to 0 at Γ larger than the designsensitivity. The normal distribution in (3) further approximates of the finite-sample behavior ofthe sensitivity value and can be used to compute the power of a sensitivity analysis by the factthat rejecting r (cid:54)≺ Γ r at level α is equivalent to Γ ∗ α ( r ≺ Γ r ) ≥ Γ. Next we consider the problem of comparing multiple treatment rules with unmeasured confounding.To this end, we need to define the goal and the statistical error we would like to control. A problemrelated to this is the selecting and ordering of multiple subpopulations [Gupta and Panchapakesan,1979, Gibbons et al., 1999], for example, given K independent measurements Y i ∼ N( µ i ,
1) where µ i is some characteristic of the i -th subpopulation. When comparing µ i , there are many goals11e can define. In fact, Gibbons et al. [1999, p. 4] gave a list of 7 possible goals for ranking andselection of subpopulations and considered them in the rest of their book. We believe at least 3out of their 7 goals have practically meaningful counterparts in comparing treatment rules. Given K + 1 treatment rules, R = { r , r , . . . , r K } , we may ask, in terms of their values,1. What is the ordering of all the treatment rules?2. Which treatment rule is the best?3. Which treatment rule(s) are better than the null/control r ?In a randomized experiment or an observational study with no unmeasured confounding, it may bepossible to obtain estimates of the value that are jointly asymptotically normal and then directlyuse the methods in Gibbons et al. [1999]. However, as discussed in Section 2, this no longer applieswhen there is unmeasured confounding because the value function may only be partially identified. When there is unmeasured confounding, the three goals above need to be modified because thevalue function only defines a partial order among the treatment rules (Definition 1). We make thefollowing definitions
Definition 2.
In the Γ-sensitivity model, the maximal rules in R are the ones not dominated byany other rule, R max , Γ = { r i : r i (cid:54)≺ Γ r j , ∀ j } . The positive rules are the ones which dominate the control and the null rules are the ones whichdon’t dominate the control, R pos , Γ = { r i : r ≺ Γ r i } , R nul , Γ = R \ R pos , Γ . The maximal set R max , Γ and the null set R nul , Γ are always non-empty (the latter is because r ∈ R nul , Γ ), become larger as Γ increases, and in general become the full set R as Γ → ∞ .In the rest of this section, we will consider the following three statistical problems: for somepre-specified significance level α > O Γ ⊂ { ( r i , r j ) , i, j = 0 , . . . , K, i (cid:54) = j } ,12uch that the probability that all the orderings are correct is at least 1 − α , that is, P ( r i ≺ Γ r j , ∀ ( r i , r j ) ∈ ˆ O Γ ) ≥ − α ?2. Can we construct a subset of treatment rules, ˆ R max , Γ , such that the probability that itcontains all maximal rules is at least 1 − α , that is, P ( R max , Γ ⊆ ˆ R max , Γ ) ≥ − α ?3. Can we construct a subset of treatment rules, ˆ R pos , Γ , such that the probability that it doesnot cover any null rule is at least 1 − α , that is, P ( ˆ R pos , Γ ∩ R null , Γ = (cid:54) ∅ ) ≥ − α ?Next, we will propose strategies to achieve the above statistical goals based on the test of twotreatment rules with unmeasured confounding described in Section 2.3. To start with, let’s consider the first goal—ordering the treatment rules, as the statistical inferenceis more straightforward. It is the same as the multiple testing problem where we would like tocontrol the family-wise error rate (FWER) for the collection of hypotheses, { H ij : r i (cid:54)≺ Γ r j , i, j =0 , . . . , K, i (cid:54) = j } . In principle, we can apply any multiple testing procedure that controls the FWER.A simple example is Bonferroni’s correction for all the K ( K −
1) tests.In sensitivity analysis problems, we can often greatly improve the statistical power by reducingthe number of tests using a planning sample [Heller et al., 2009, Zhao et al., 2018]. This is becauseRosenbaum’s sensitivity analysis considers the worst case scenario and is generally conservativewhen Γ >
1. The planning sample can be further used to order the hypotheses so we can sequentiallytest them, for example, using a fixed sequence testing procedure [Koch and Gansky, 1996, Westfalland Krishen, 2001].There are many possible ways to screen out, order, and then test the hypotheses. Here wedemonstrate one possibility:Step 1: Split the data into two parts. The first part is used for planning and the second part fortesting.Step 2: For every pair of treatment rules ( r i , r j ), use the planning sample to estimate populationparameters in the asymptotic distribution of the sensitivity value (3).Step 3: Compute the approximate power of testing H ij : r i (cid:54)≺ Γ r j in the testing sample using (3).Order the hypotheses by the estimated power, from highest to lowest.13tep 4: Sequentially test the ordered hypotheses using the testing sample at level α , until one hy-pothesis is rejected.Step 5: Output a Hasse diagram of the treatment rules by using all the rejected hypotheses.A Hasse diagram is an informative graph to represent a partial order (in our case, ≺ Γ ). In thisdiagram, each treatment rule is represented by a vertex and an edge goes upward from rule r i torule r j if r i ≺ Γ r j and there exists no r k such that r i ≺ Γ r k and r k ≺ Γ r j .Due to transitivity of a partial order, an upward path from r i to r j in the Hasse diagram (forexample r to r in Figure 1, Γ = 1 .
3) indicates that r i ≺ Γ r j , even if we could not directly reject r i (cid:54)≺ Γ r j in Step 4. The next proposition shows that the above multiple testing procedure alsocontrols the FWER for all the apparent and implied orders represented by the Hasse diagram. Proposition 2.
Let ˆ O Γ ⊂ { ( r i , r j ) , i (cid:54) = j } be a random set of ordered treatment rules obtainedusing the procedure above or any other multiple testing procedure. Letˆ O Γ , ext = ˆ O Γ (cid:91) { ( r i , r j ) : ∃ k , . . . , k m such that ( r i , r k ) , ( r k , r k ) , . . . , ( r k m , r j ) ∈ ˆ O Γ } be the extended set implied from the Hasse diagram. Then FWER with respect to ˆ O Γ , ext is thesame as FWER with respect to ˆ O Γ : P ( r i ≺ Γ r j , ∀ ( r i , r j ) ∈ ˆ O Γ ) = P ( r i ≺ Γ r j , ∀ ( r i , r j ) ∈ ˆ O Γ , ext ) . Proof.
We show the two events are equivalent. The ⊆ direction is trivial. For ⊇ , notice that anyfalse positive in ˆ O Γ , ext , say r i ≺ Γ r j implies that there is at least one false positive along the pathfrom r i to r j , that is, there is at least one false positive among r i ≺ Γ r k , r k ≺ Γ r k , . . . , r k m ≺ Γ r j ,which are all in ˆ O Γ . Thus, any false positive in ˆ O Γ , ext implies that there is also at least one falsepositive in ˆ O Γ .We illustrate the proposed method using the malaria dataset. We first use half of the data toestimate the population parameters in (3) for each pair of treatment rules ( r i , r j ). For every valueof Γ, we use (3) to compute the asymptotic power for the test of H ij : r i (cid:54)≺ Γ r j using the other halfof the data. We then order the hypotheses by the estimated power, from the highest to the lowest.In the malaria example, when Γ = 1, the order is H , H , H , H , H , H , H , H , H , H , . . . . H , H , H , H , H , H , H , H , H , H , . . . . Finally we follow Steps 4 and 5 above. We obtained Hasse diagrams for a variety of Γ, which areshown in Figure 1. As a baseline for comparison, Figure 2 shows the Hasse diagrams obtained by asimple Bonferroni adjustment for all K ( K −
1) = 30 hypotheses using all the data. Although onlyhalf of the data is used to test, ordering the hypotheses not only identified all the discoveries thatthe Bonferroni procedure identified, but also made one extra discovery when Γ = 1 .
3, 1 .
5, 2 .
5, 3 . .
8, 2, and 3. r r r r r r Γ = 1 | ˆ O| = 12 r r r r r r Γ = 1 . | ˆ O| = 10 r r r r r r Γ = 1 . | ˆ O| = 9 r r r r r r Γ = 1 . | ˆ O| = 8 r r r r r r Γ = 2 | ˆ O| = 7 r r r r r r Γ = 2 . | ˆ O| = 6 r r r r r r Γ = 3 . | ˆ O| = 5 r r r r r r Γ = 3 . | ˆ O| = 3 r r r r r r Γ = 4 | ˆ O| = 2 r r r r r r Γ = 6 | ˆ O| = 0Figure 1: Malaria example: Hasse diagrams obtained using sample-splitting and fixed sequencetesting; α = 0 .
1. 15 r r r r r Γ = 1 | ˆ O| = 10 r r r r r r Γ = 1 . | ˆ O| = 9 r r r r r r Γ = 1 . | ˆ O| = 8 r r r r r r Γ = 1 . | ˆ O| = 6 r r r r r r Γ = 2 | ˆ O| = 5 r r r r r r Γ = 2 . | ˆ O| = 5 r r r r r r Γ = 3 | ˆ O| = 3 r r r r r r Γ = 3 . | ˆ O| = 2 r r r r r r Γ = 4 | ˆ O| = 1 r r r r r r Γ = 6 | ˆ O| = 0Figure 2: Malaria example: Hasse diagrams obtained using Bonferroni’s adjustment; α = 0 . Next we consider constructing a set that covers all the maximal rules. Our proposal is based onthe following observation: if the hypothesis r i (cid:54)≺ Γ r j can be rejected, then r i is unlikely a maximalrule. More precisely, because r i ∈ R max , Γ implies that r i (cid:54)≺ Γ r j must be true, by the definition ofthe type I error of a hypothesis test, P ( r i (cid:54)≺ Γ r j is rejected | r i ∈ R max , Γ ) ≤ α. This suggests that we can derive a set of maximal rules from an estimated set of partial orders:ˆ R max , Γ = { r i : ( r i , r j ) (cid:54)∈ ˆ O Γ , ∀ j } . (4)In other words, ˆ R max , Γ contains all the “leaves” in the Hasse diagram of ˆ O Γ (a leaf in the Hassediagram is a vertex who has no edge going upward). For example, in Figure 1, the leaves are { r , r , r } when Γ = 1 . { r , r , r , r } when Γ = 1 .
5. Because (cid:8) R max , Γ (cid:54)⊆ ˆ R max , Γ (cid:9) = (cid:8) ∃ i ∈R max , Γ such that ( r i , r j ) ∈ ˆ O Γ for some j (cid:9) , the estimated set of maximal rules satisfies P ( R max , Γ (cid:54)⊆ R max , Γ ) ≤ α as desired whenever ˆ O Γ strongly controls the FWER at level α .Equation (4) suggests that only one hypothesis r i (cid:54)≺ Γ r j needs to be rejected in order to exclude r i from ˆ R max , Γ . This means that, when the purpose is to select the maximal rules, we do not needto test r i (cid:54)≺ Γ r j if another hypothesis r i (cid:54)≺ Γ r k for some k (cid:54) = j is already rejected. Therefore, wecan modify the procedure of finding ˆΩ Γ to further decrease the size of ˆ R max , Γ obtained from (4).For example, in the five-step procedure demonstrated in Section 3.2, we can further replace Step 3by:Step 3’: After ordering the hypotheses in Step 3, remove any hypothesis H ij : r i ≺ Γ r j if there isalready a hypothesis H ik appearing before H ij for some k (cid:54) = j .Again we use the malaria example to illustrate the selection of best treatment rules. As anexample, when Γ = 2 .
0, Step 3’ reduced the original sequence of hypotheses to the following: H , H , H , H , H , H . We used the hold-out samples to test the hypotheses sequentially at level α = 0 . H . Therefore, a level α = 0 . { r , r , r , r } whenΓ = 2. Table 1 lists the estimated maximal set ˆ R max , Γ for Γ = 1 , . , . , . , , and 2 . R max , Γ for different choices of Γ.Γ ˆ R max , Γ Γ ˆ R max , Γ { r , r , r } { r , r , r , r } { r , r , r } { r , r , r , r , r } { r , r , r , r } { r , r , r , r , r } { r , r , r , r } { r , r , r , r , r } { r , r , r , r } { r , r , r , r , r , r } Finally we consider how to select treatment rules that are better than a control rule. This canalso be transformed to a multiple testing problem for the hypotheses H i : r (cid:54)≺ Γ r i , i = 1 , . . . , K .Let ˆ R pos , Γ be the collection of rejected hypotheses following some multiple testing procedure. Bydefinition of FWER, P ( ˆ R pos , Γ ∩ R nul , Γ ) ≤ α if the multiple testing procedure strongly controlsFWER at level α . As an example, one can modify the procedure in (3.2) to select the positive rulesby only considering H i , i = 1 , . . . , K in Step 3.17n practice, a small increase of the value function, though statistically significant, may not justifya policy change. In this case, it may be desirable to estimate the positive rules that dominate thecontrol rule by margin δ , R pos , Γ ,δ = { r i : r ≺ Γ ,δ r i } . To obtain an estimate of R pos , Γ ,δ , one canfurther modify the procedure in (3.2) by replacing the hypothesis H i : r (cid:54)≺ Γ r i with the stronger r (cid:54)≺ Γ ,δ r i . Table 2: Estimated positive rules ˆ R pos , Γ ,δ for different choices of Γ and δ .Γ = 1 Γ = 1 . . . δ = 0 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 1 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 2 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 4 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 6 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r } { r } Γ = 2 . . . δ = 0 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 1 { r , r , r , r , r } { r , r , r , r , r } { r , r , r } δ = 2 { r , r , r , r , r } { r , r , r } { r , r } δ = 4 { r , r , r } ∅ ∅ δ = 6 ∅ ∅ ∅ Γ = 3 . . . δ = 0 { r , r , r } { r , r } ∅ δ = 1 { r , r } { r } ∅ δ = 2 ∅ ∅ ∅ δ = 4 ∅ ∅ ∅ δ = 6 ∅ ∅ ∅ We construct ˆ R pos , Γ ,δ with various choices of Γ and δ for the malaria example. In this case, δ measures the decrease in the number of Plasmodium falciparum parasites per milliliter of bloodsamples averaged over the entire study samples. Table 2 gives a summary of the results. Asexpected, the estimated set of positive rules becomes smaller as Γ or δ increases. We observe that,although r , r —assigning treatment to those under 7 and 20—are unlikely the optimal rules ifthere is no unmeasured confounding (Table 1), they are more robust to unmeasured confoundingthan the others, dominating the control rule up till Γ = 4 . We study and report the performance of three methods of selecting the positive rules R pos , Γ ,δ using numerical simulations in this section. Simulation results for selecting the maximal rules18re reported in the Supplementary Materials. We constructed 5 or 10 cohorts of data where thetreatment effect is constant within each cohort but different between the cohorts. After matching,the treated-minus-control difference in each cohort was normally distributed with mean1. µ = (0 . , . , . , . , . µ = (0 . , . , − . , . , . µ = (0 . , . , . , . , . , . , . , . , . , . µ = (0 . , . , . , . , − . , − . , . , . , . , . Bonferroni:
The full data is used to test the hypotheses H i : r (cid:54)≺ Γ r i and the Bonferronicorrection is used to adjust for multiple comparisons.2. Ordering by power:
This is the procedure described in Section 3.2 using sample splittingand fixed sequence testing.3.
Ordering by value function:
This is the same as above except that the hypotheses areordered by their estimated value at Γ = 1.For the second and third methods, we used either a half or a quarter of the matched pairs (randomlychosen) to order the hypotheses. Extra simulation results using different split proportions arereported in Supplementary Materials. This simulation was replicated 1000 times to report thepower and the error rate of the methods. The power is defined as the average size of the estimatedset of positive rules ˆ R pos , Γ and the error rate is 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ) with nominal level 0 . .
5, but ordering the hypotheses by (the estimated)power, though losing some sample for testing, can be much more powerful at larger values of Γ.For instance, in Table 5 when Γ = 3 .
0, two power-based methods are more than twice as powerfulas the Bonferroni method. We observe that only using a small planning sample (25%) seems to19ork well in the simulations. This is not too surprising given our theoretical results. Proposition 1suggest that only the first two moments of D and | D | are needed to estimate the sensitivity valueasymptotically.Table 3: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is definedas the size of ˆ R pos , Γ and error rate is defined as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment effect in the 5cohorts is given by µ = (0 . , . , . , . , . R pos , Γ is listed below each Γ value.Cohort size Method Γ = 1 . . . . . { r , . . . , r } { r , . . . , r } { r , r , r } { r , r } { r }
250 Bonferroni 5.00 / 0.00 2.54 / 0.01 1.60 / 0.03 0.72 / 0.02 0.11 / 0.00Value (50%) 5.00 / 0.00 0.51 / 0.07 0.08 / 0.04 0.00 / 0.00 0.00 / 0.00Power (50%) 5.00 / 0.00 2.30 / 0.07 1.46 / 0.07 0.74 / 0.04 0.20 / 0.00Value (25%) 5.00 / 0.00 0.73 / 0.07 0.18 / 0.05 0.03 / 0.01 0.00 / 0.00Power (25%) 5.00 / 0.00 2.64 / 0.07 1.69 / 0.06 0.85 / 0.05 0.21 / 0.00100 Bonferroni 4.99 / 0.00 1.39 / 0.02 0.75 / 0.02 0.37 / 0.02 0.08 / 0.00Value (50%) 4.80 / 0.00 0.49 / 0.07 0.16 / 0.04 0.04 / 0.02 0.00 / 0.00Power (50%) 4.77 / 0.00 1.15 / 0.05 0.75 / 0.05 0.38 / 0.03 0.15 / 0.02Value (25%) 4.99 / 0.00 0.61 / 0.06 0.24 / 0.03 0.12 / 0.03 0.01 / 0.00Power (25%) 4.99 / 0.00 1.33 / 0.05 0.80 / 0.05 0.52 / 0.05 0.12 / 0.00Table 4: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is definedas the size of ˆ R pos , Γ and error rate is defined as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment effect in the 5cohorts is given by µ = (0 . , . , − . , . , . R pos , Γ is listed below each Γ value.Cohort size Method Γ = 1 . . . . . { r , r , r , r } { r , r , r } { r , r } { r } ∅
250 Bonferroni 3.14 / 0.00 2.36 / 0.00 0.78 / 0.00 0.45 / 0.01 0.01 / 0.01Value (50%) 3.21 / 0.00 1.78 / 0.00 0.06 / 0.02 0.00 / 0.00 0.00 / 0.00Power (50%) 3.21 / 0.00 2.03 / 0.00 0.75 / 0.02 0.47 / 0.02 0.07 / 0.07Value (25%) 3.25 / 0.00 2.21 / 0.00 0.07 / 0.03 0.00 / 0.00 0.00 / 0.00Power (25%) 3.25 / 0.00 2.35 / 0.00 0.83 / 0.02 0.54 / 0.02 0.07 / 0.07100 Bonferroni 3.02 / 0.00 1.19 / 0.00 0.37 / 0.00 0.21 / 0.01 0.02 / 0.02Value (50%) 3.03 / 0.00 0.71 / 0.00 0.04 / 0.02 0.00 / 0.00 0.00 / 0.00Power (50%) 3.02 / 0.00 0.93 / 0.00 0.43 / 0.01 0.29 / 0.04 0.08 / 0.08Value (25%) 3.10 / 0.00 1.12 / 0.00 0.04 / 0.02 0.00 / 0.00 0.00 / 0.00Power (25%) 3.11 / 0.00 1.32 / 0.00 0.42 / 0.01 0.31 / 0.03 0.07 / 0.07
Finally we apply the proposed method to study the potentially heterogeneous effect of retirementtiming on senior health. Many empirical studies have focused on the effect of retirement timing on20able 5: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is definedas the size of ˆ R pos , Γ and error rate is defined as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment effect in the10 cohorts is given by µ = (0 . , . , . , . , . , . , . , . , . , . R pos , Γ islisted below each Γ value.Cohort size Method Γ = 1 . . . . . { r , . . . , r } { r , . . . , r } { r , . . . , r } { r , r } { r }
250 Bonferroni 10.00 / 0.00 6.80 / 0.01 2.41 / 0.00 0.20 / 0.00 0.02 / 0.01Value (50%) 10.00 / 0.00 0.88 / 0.06 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00Power (50%) 10.00 / 0.00 6.30 / 0.05 2.34 / 0.03 0.44 / 0.02 0.11 / 0.05Value (25%) 10.00 / 0.00 1.12 / 0.01 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00Power (25%) 10.00 / 0.00 7.06 / 0.06 2.73 / 0.02 0.42 / 0.02 0.10 / 0.05100 Bonferroni 9.99 / 0.00 3.97 / 0.01 1.14 / 0.00 0.12 / 0.00 0.03 / 0.02Value (50%) 9.95 / 0.00 0.76 / 0.05 0.03 / 0.01 0.00 / 0.00 0.00 / 0.00Power (50%) 9.91 / 0.00 3.18 / 0.04 1.17 / 0.02 0.28 / 0.03 0.10 / 0.05Value (25%) 9.95 / 0.00 1.06 / 0.04 0.06 / 0.01 0.00 / 0.00 0.00 / 0.00Power (25%) 9.99 / 0.00 3.93 / 0.04 1.39 / 0.02 0.25 / 0.02 0.09 / 0.05short-term and long-term health status of the elderly people [Morrow-Howell et al., 2001, Alaviniaand Burdorf, 2008, B¨orsch-Supan and J¨urges, 2006]. One theory known as the “psychosocial-materialist” approach suggests that retiring late may have health benefits because work forms a keypart of the identity of the elderly and provides financial, social and psychological resources [Calvoet al., 2012]. However, the health benefits of late retirement may differ in different subpopulations[Dave et al., 2008, Westerlund et al., 2009].We obtained the observational data from the Health and Retirement Study, an ongoing nation-ally representative survey of more than 30,000 adults who are older than 50 and their spouses in theUnited States. HRS is sponsored by the National Institute of Aging; Detailed information on theHRS and its design can be found in Sonnega et al. [2014]. We use the RAND HRS LongitudinalFile 2014 (V2), an easy-to-use dataset based on the HRS core data that consists of a follow-upstudy of 15 ,
843 elderly people [RAND, 2018].We defined the treatment as late retirement (retirement after 65 years old and before 70 yearsold) and asked how it impacted self-reported health status at the age of 70 (coded by: 5 - extremelygood, 4 - very good, 3 - good, 2 - fair, and 1 - poor). We included individuals who retired before70 and had complete measurements of the following confounders: year of birth, gender, education(years), race (white or not), occupation (1: executives and managers, 2: professional specialty, 3:sales and administration, 4: protection services and armed forces, 5: cleaning, building, food prep,and personal services, 6: production, construction, and operation), partnered, annual income, and21able 6: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is definedas the size of ˆ R pos , Γ and error rate is defined as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment effect in the10 cohorts is given by µ = (0 . , . , . , . , − . , − . , . , . , . , . R pos , Γ is listedbelow each Γ value.Cohort size Method Γ = 1 . . . . . { r , . . . , r , r , r } { r , . . . , r } { r , r , r } { r , r } { r }
250 Bonferroni 5.98 / 0.00 3.51/0.00 1.55 / 0.00 0.37 / 0.00 0.07 / 0.00Value (50%) 5.97 / 0.02 0.10 / 0.02 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00Power (50%) 6.00 / 0.02 3.43 / 0.02 1.53 / 0.01 0.56 / 0.02 0.21 / 0.01Value (25%) 6.02 / 0.03 0.23 / 0.04 0.03 / 0.00 0.01 / 0.00 0.00 / 0.00Power (25%) 6.02 / 0.02 3.67 / 0.04 1.84 / 0.01 0.66 / 0.01 0.22 / 0.01100 Bonferroni 5.60 / 0.01 2.42 / 0.00 0.68 / 0.00 0.19 / 0.00 0.06 / 0.00Value (50%) 5.24 / 0.03 0.22 / 0.03 0.04 / 0.00 0.01 / 0.00 0.00 / 0.00Power (50%) 5.48 / 0.03 2.23 / 0.03 0.86 / 0.02 0.31 / 0.02 0.16 / 0.02Value (25%) 5.58 / 0.02 0.71 / 0.03 0.16 / 0.00 0.04 / 0.00 0.01 / 0.00Power (25%) 5.71 / 0.02 2.61 / 0.03 0.98 / 0.01 0.36 / 0.02 0.14 / 0.02smoking status. This left us with 1934 treated subjects and 4831 controls. Figure 3 plots thedistribution of retirement age in all samples and in the treatment group. The distribution ofretirement age in the treatment group is right skewed, with a spike of people retiring shortly after65 years old. In the Supplementary Materials, we give a detailed account of data preprocessing andsample inclusion criteria. retirement age c oun t (a) In all samples retirement age c oun t (b) In the treated group Figure 3: Distribution of retirement ageUsing optimal matching as implemented in the optmatch
R package [Hansen and Klopfer,2006], we form 1858 matched pairs, matching exactly on the year of birth, gender, occupation, and22able 7: Covariate balance after matching.Control Treated std.diffYear of birth 1936.27 1936.27 0.00Female 0.53 0.53 0.00Non-hispanic white 0.77 0.75 -0.04Education (yrs) 12.52 12.53 0.00Occupation: cleaning, building, food prep, and personal services 0.10 0.10 0.00Occupation: executives and managers 0.16 0.16 0.00Occupation: production construction and operation occupations 0.28 0.28 0.00Occupation: professional specialty 0.19 0.19 0.00Occupation: protection services and armed forces 0.02 0.02 0.00Occupation: sales and admin 0.25 0.25 0.00Partnered 0.74 0.74 0.00Smoke ever 0.63 0.59 -0.08partnered or not, and balance the race, years of education, and smoking status. Table 7 summarizesthe covariate balance after matching. After matching, the treated and control groups are well-balanced (Table 7): the standardized differences of all covariates are less than 0.1. Additionally,the propensity score in the treated and control group have good overlap before and after matching(see the Supplementary Materials).We considered two potential effect modifiers, namely gender and occupation. More complicatedtreatment rules can in principle be considered within our framework, though having more treatmentrules generally reduces the power of multiple testing. We grouped the 6 occupations into 2 broadcategories: white collar jobs (executives and managers and professional specialties) and blue collarjobs (sales, administration, protection services, personal services, production, construction, andoperation). There were 4 subgroups defined by these two potential effect modifiers: male, white-collar workers ( G ), female, white-collar workers ( G ), male, blue-collar workers ( G ), and female,blue-collar workers ( G ). Thus, there were a total of 2 = 16 different regimes formed out ofthese two effect modifiers. We gave decimal as well as binary codings to the 16 groups: r ( r )assigns control to everyone, r ( r ) , r ( r ) , r ( r ) , r ( r ) assign treatment to one of the4 subgroups, and so forth. We split the matched samples and used 1 / /
4. Then we followed the procedures proposed in Section 3 to rank and select thetreatment rules.Figure 4 reports the estimated Hasse diagram at Γ = 1 .
2; additional results can be found inthe Supplementary Materials. The estimated maximal rules for various choices of Γ and δ arereported in Table 8 and the estimated positive rules are reported in the Supplementary Materials.23 (0000)8 (1000) 4 (0100) 2 (0010)1 (0001) 12 (1100) 10 (1010)9 (1001) 6 (0110)5 (0101) 3 (0011) 14 (1110)13 (1101) 11 (1011) 7 (0111)15 (1111) Figure 4: The effect of late retirement on health: Hasse diagram at Γ = 1 . r ( r ) which assigns late retirement to all but female, white-collar workers, r ( r ) whichassigns late retirement to all but male, blue-collar workers, and r ( r ) which assigns treatmentto everyone. When Γ increases to 1 . r ( r ) which assigns treatment to male, white-collarworkers and female, blue-collar workers, further enters the set of maximal rules. The estimatedpositive rules suggest that r ( r ) and r ( r ) which only assigns late retirement to female bluecollar workers, though not among the maximal rules at Γ = 1 in Table 8, are the most robust tounmeasured confounding. This suggests that later retirement perhaps benefit female blue-collarworkers more than others.Table 8: The effect of late retirement on health: ˆ R max , Γ for different choices of Γ.Γ ˆ R max , Γ { r , r , r } { r , r , r , r } { r , r , r , r , r , r , r , r } . amplify Γ to a two-dimensional curve indexed by (Λ , ∆), where Λ describes therelationship between the unmeasured confounder and the treatment assignment, and ∆ describesthe relationship between the unmeasured confounder and the outcome. For instance, Γ = 1 . , Λ) = (2 . , . , ∆) values to coefficientsof observed covariates, however, their method only works for binary outcome and binary treatment.In the Supplementary Materials, we describe a calibration analysis that handle the ordinal self-reported health status level in our application that has 5 levels.We follow Hsu and Small [2013] and use a plot to summarize the calibration analysis. InFigure 5, the blue curve represents Rosenbaum and Silber [2009]’s two-dimensional amplificationof Γ = 1 . , ∆). The estimated coefficients of observed covariates are representedby black dots (after taking an exponential so they are comparable to (Λ , ∆)). We followed thesuggestion in Gelman [2008] and standardized all the non-binary covariates to have mean 0 andstandard deviation 0 .
5, so the coefficient of each binary variable can be interpreted directly and thecoefficients of each continuous/ordinal variable can be interpreted as the effect of a 2-SD increase inthe covariate value, which roughly corresponds to flipping a binary variable from 0 to 1. Note that allcoefficients are under the Γ = 1 . . U constructed from smoking and education (red star in Figure 5). In this paper we proposed a general framework to compare, select, and rank treatment rules whenthere is a limited degree of unmeasured confounding and illustrated the proposed methods by tworeal data examples. A central message is that the best treatment rule (with the largest estimatedvalue) assuming no unmeasured confounding is often not the most robust to unmeasured confound-ing. This may have important policy implications when individualized treatment rules are learnedfrom observational data.Because the value function only defines a partial order on the treatment rules when there isunmeasured confounding, there is a multitude of statistical questions one can ask about selectingand ranking the treatment rules. We have considered three questions that we believe are most25 emaleraceeducationpartnered incomesmokewhite collarnever smoke + education
123 1 2 3 L D Figure 5: The effect of latent retirement on health: Calibration of the sensitivity analysis. The bluecurve is Rosenbaum and Silber [2009]’s amplification of Γ = 1 .
2. Black dots represent estimatedcoefficient of each observed covariate. Red marker represents the aggregate effect of flipping smokingstatus and increasing education by 2 standard deviations.26elevant to policy research, but there are many other questions (such as in Gibbons et al. [1999])one could ask.In principle, our framework can be used with an arbitrary number of prespecified individualizedtreatment rules. However, to maintain a good statistical power in the multiple testing, the prespec-ified treatment rules should not be too many. This limitation makes our method most suitable asa confirmatory analysis to complement machine learning algorithms for individualized treatmentrule discovery. Alternatively, if the number of decision variables is relatively low due to economicor practical reasons, our method is also reasonably powered for treatment rule discovery.
Acknowledgement
JW received funding from the Population Research Training Grant (NIH T32 HD007242) awardedto the Population Studies Center at the University of Pennsylvania by the NIH’s Eunice KennedyShriver National Institute of Child Health and Human Development.
A Appendix: Proof of Proposition 1
To simplify the notation, suppose r ( X i ) < r ( X i ) for all 1 ≤ i ≤ I . Let D i, Γ = D i − (cid:18) Γ −
1Γ + 1 (cid:19) | D i, Γ | , D = 1 I I (cid:88) i =1 D i , | D | = 1 I I (cid:88) i =1 | D i | ,D Γ = (1 /I ) I (cid:88) i =1 D i, Γ = D − (cid:18) Γ −
1Γ + 1 (cid:19) | D | , and se ( D Γ ) = 1 I I (cid:88) i =1 ( D i, Γ − D Γ ) . When E [ D i ] >
0, Γ ∗ ( r , r ) is obtained by solving the equation below in Γ: D Γ se ( D Γ ) = Φ − (1 − α ) . (5)Square both sides of the equation above and plug in the expressions for D Γ and se ( D Γ ) . Let27 = (Γ − / (Γ + 1) and z α = Φ − (1 − α ). Denote z α = Φ − (1 − α ) , | D | = 1 I I (cid:88) i =1 | D i | , , and s D = 1 I I (cid:88) i =1 ( D i − D ) , s | D | = 1 I I (cid:88) i =1 ( | D i | − | D | ) , s D, | D | = 1 I I (cid:88) i =1 ( D i − D )( | D i | − | D | ) . One can show the sensitivity value Γ ∗ ( r , r ) corresponds to κ ∗ that solves the following quadraticequation: (cid:18) | D | − I s | D | z α (cid:19) κ − (cid:18) D | D | − I s D, | D | z α (cid:19) κ + D − I s D c α = 0 . Specifically, we have κ ∗ = D | D | − I s D, | D | z α ± √ ∆ | D | − I s | D | z α , (6)where ∆ = ( D | D | − I s D, | D | z α ) − ( | D | − I s | D | z α )( D − I s D c α ).Note √ ∆ = z α (cid:114) I (cid:16) s | D | · D + s D · | D | − D | D | s D, | D | (cid:17) + 1 I z α (cid:16) s D, | D | − s | D | · s D (cid:17) . Let us denote A = E [ D ] · E [ | D | ], B = − z α (cid:113) σ | D | · E [ D ] + σ D · E [ | D | ] − E [ D ] E [ | D | ] σ D, | D | , C = E [ | D | ] , R = √ I ( D | D | − A ), R = √ I ( | D | − C ).We have κ ∗ = A + √ I R + √ I BC + √ I R + o p (cid:18) √ I (cid:19) = ( A + √ I R + √ I B ) · (1 − √ I R C ) C + o p (cid:18) √ I (cid:19) . Scale both sides by √ I and rearrange the terms, we have √ I (cid:18) κ ∗ − AC (cid:19) = BC + 1 C R − AC R + o p (1) . φ : R (cid:55)→ R = ( xy, y ): √ I D − E [ D ] | D | − E [ | D | ] ∼ N (0 , Σ) , implies R R = √ I D | D | − E [ D ] · E [ | D | ] | D | − E [ | D | ] ∼ N (0 , φ (cid:48) Σ( φ (cid:48) ) T ) , where Σ = Var[ D ] , Cov( D, | D | )Cov( D, | D | ) , Var[ | D | ] and φ (cid:48) = E [ | D | ] , E [ D ]0 , E [ | D | ] .Plug in the expressions for A , B , and C and compute the variance-covariance matrix of(1 /C, − A/C )( R , R ) T : √ I (cid:18) κ ∗ − E [ D ] E [ | D | ] (cid:19) ∼ N ( z α µ, σ )where µ = − (cid:113) σ | D | · E [ D ] + σ D · E [ | D | ] − E [ D ] E [ | D | ] σ D, | D | E [ | D | ] , and σ = Var[ D ] E [ | D | ] − Var[ | D | ] E [ D ] − E [ D ] E [ | D | ]Cov( D, | D | ) + 2 E [ D ]Var[ | D | ] E [ | D | ] . The Supplementary Materials contain additional appendices about matching in observationalstudies and further simulation and real data results.29 eferences
Seyed Mohammad Alavinia and Alex Burdorf. Unemployment and retirement and ill-health: across-sectional analysis across european countries.
International Archives of Occupational andEnvironmental Health , 82(1):39–45, 2008.Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896 , 2017.Axel B¨orsch-Supan and Hendrik J¨urges. Early retirement, social security and well-being in germany.Technical report, National Bureau of Economic Research, 2006.Esteban Calvo, Natalia Sarkisian, and Christopher R. Tamborini. Causal Effects of Retire-ment Timing on Subjective Physical and Emotional Health.
The Journals of Gerontol-ogy: Series B , 68(1):73–84, 11 2012. ISSN 1079-5014. doi: 10.1093/geronb/gbs097. URL https://doi.org/10.1093/geronb/gbs097 .J. Cornfield, W. Haenszel, E. Hammond, A. Lilienfeld, M. Shimkin, and E. Wynder. Smoking andlung cancer.
Journal of the National Cancer Institute , 22:173–203, 1959.Dhaval Dave, Inas Rashad, and Jasmina Spasojevic. The effects of retirement on physical andmental health outcomes.
Southern Economic Journal , 75(2):497–523, 2008. ISSN 00384038.URL .Miroslav Dud´ık, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluationand optimization.
Statistical Science , 29(4):485–511, 2014.Colin B Fogarty. Studentized sensitivity analysis for the sample average treatment effect in pairedobservational studies. arXiv preprint arXiv:1609.02112 , 2016.A. Gelman. Scaling regression inputs by dividing by two standard deviations.
Statistics in Medicine ,27:2865–2873, 2008.Jean Dickinson Gibbons, Ingram Olkin, and Milton Sobel.
Selecting and ordering populations: Anew statistical methodology . SIAM, 2nd edition, 1999.Shanti S Gupta and Subramanian Panchapakesan.
Multiple decision procedures: Theory andmethodology of selecting and ranking populations . SIAM, 1979.30en B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via networkflows.
Journal of Computational and Graphical Statistics , 15(3):609–627, 2006.Ruth Heller, Paul R Rosenbaum, and Dylan S Small. Split samples and design sensitivity inobservational studies.
Journal of the American Statistical Association , 104(487):1090–1101, 2009.J. Y. Hsu and D. S. Small. Calibrating sensitivity analyses to observed covariates in observationalstudies.
Biometrics , 69:803–811, 2013.Jesse Y. Hsu, Dylan S. Small, and Paul R. Rosenbaum. Effect modification and design sensitivity inobservational studies.
Journal of the American Statistical Association , 108(501):135–148, 2013.doi: 10.1080/01621459.2012.742018. URL https://doi.org/10.1080/01621459.2012.742018 .Nathan Kallus. Recursive partitioning for personalization using observational data. In
Proceedingsof the 34th International Conference on Machine Learning-Volume 70 , pages 1789–1798. JMLR.org, 2017.Nathan Kallus and Angela Zhou. Confounding-robust policy improvement. In
Advances in NeuralInformation Processing Systems , pages 9269–9279, 2018.Nathan Kallus, Xiaojie Mao, and Angela Zhou. Interval estimation of individual-level causal effectsunder unobserved confounding. arXiv preprint arXiv:1810.02894 , 2018.Gary G Koch and Stuart A Gansky. Statistical considerations for multiplicity in confirmatoryprotocols.
Drug Information Journal , 30(2):523–534, 1996.Michael R. Kosorok and Eric B. Laber. Precision medicine.
Annual Review of Statistics andIts Application , 6(1):263–286, 2019. doi: 10.1146/annurev-statistics-030718-105251. URL https://doi.org/10.1146/annurev-statistics-030718-105251 .Erica EM Moodie, Bibhas Chakraborty, and Michael S Kramer. Q-learning for estimating optimaldynamic treatment rules from observational data.
Canadian Journal of Statistics , 40(4):629–645,2012.Nancy Morrow-Howell, James Hinterlong, Michael Sherraden, et al.
Productive aging: Conceptsand challenges . JHU Press, 2001.Susan A Murphy. Optimal dynamic treatment regimes.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 65(2):331–355, 2003.31in Qian and Susan A. Murphy. Performance guarantees for individualized treatmentrules.
The Annals of Statistics , 39(2):1180–1210, 2011. ISSN 00905364. URL .RAND. RAND HRS Longitudinal File 2014 (V2) public use dataset. Produced by the RANDCenter for the Study of Aging, with funding from the National Institute on Aging and the SocialSecurity Administration. 2018.James M Robins. Optimal structural nested models for optimal sequential decisions. In
Proceedingsof the second seattle Symposium in Biostatistics , pages 189–326. Springer, 2004.P. R. Rosenbaum. Sensitivity analysis for certain permutation inferences in matched observationalstudies.
Biometrika , 74:13–26, 1987.P. R. Rosenbaum.
Observational Studies . Springer., 2002.P. R. Rosenbaum. Bahadur Efficiency of Sensitivity Analyses in Observational Studies.
Journal ofthe American Statistical Association , 110:205–217, 2015.P. R. Rosenbaum and J. H. Silber. Amplification of sensitivity analysis in matched observationalstudies.
Journal of the American Statistical Association , 104:1398–1405, 2009.Paul R Rosenbaum. Design sensitivity in observational studies.
Biometrika , 91(1):153–164, 2004.Paul R Rosenbaum. A new U-statistic with superior design sensitivity in matched observationalstudies.
Biometrics , 67(3):1017–1027, 2011.Amanda Sonnega, Jessica D Faul, Mary Beth Ofstedal, Kenneth M Langa, John WR Phillips,and David R Weir. Cohort Profile: the Health and Retirement Study (HRS).
InternationalJournal of Epidemiology , 43(2):576–585, 03 2014. ISSN 0300-5771. doi: 10.1093/ije/dyu067.URL https://doi.org/10.1093/ije/dyu067 .Zhiqiang Tan. A distributional approach for causal inference using propensity scores.
Journal ofthe American Statistical Association , 101(476):1619–1637, 2006.Hugo Westerlund, Mika Kivim¨aki, Archana Singh-Manoux, Maria Melchior, Jane E Ferrie, JaanaPentti, Markus Jokela, Constanze Leineweber, Marcel Goldberg, Marie Zins, and Jussi Vahtera.Self-rated health before and after retirement in france (gazel): a cohort study.
The Lancet , 374329705):1889 – 1896, 2009. ISSN 0140-6736. doi: https://doi.org/10.1016/S0140-6736(09)61570-1.URL .Peter H Westfall and Alok Krishen. Optimally weighted, fixed sequence and gatekeeper multipletesting procedures.
Journal of Statistical Planning and Inference , 99(1):25–40, 2001.Peng Wu, Donglin Zeng, and Yuanjia Wang. Matched learning for optimizing indi-vidualized treatment strategies using electronic health records.
Journal of the Ameri-can Statistical Association , 0(0):1–23, 2019. doi: 10.1080/01621459.2018.1549050. URL https://doi.org/10.1080/01621459.2018.1549050 .Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and Lu Tian. Bounds on the con-ditional and average treatment effect in the presence of unobserved confounders. arXiv preprintarXiv:1808.09521 , 2018.Yichi Zhang, Eric B Laber, Marie Davidian, and Anastasios A Tsiatis. Interpretable dynamictreatment regimes.
Journal of the American Statistical Association , 113(524):1541–1549, 2018.Qingyuan Zhao. On sensitivity value of pair-matched observational studies.
Journal of theAmerican Statistical Association , 0(0):1–10, 2018. doi: 10.1080/01621459.2018.1429277. URL https://doi.org/10.1080/01621459.2018.1429277 .Qingyuan Zhao, Dylan S Small, and Paul R Rosenbaum. Cross-screening in observational studiesthat test many hypotheses.
Journal of the American Statistical Association , 113(523):1070–1084,2018.Qingyuan Zhao, Dylan S. Small, and Bhaswar B. Bhattacharya. Sensitivity analysis for inverseprobability weighting estimators via the percentile bootstrap, 2019a.Ying-Qi Zhao, Eric B Laber, Yang Ning, Sumona Saha, and Bruce Sands. Efficient augmentationand relaxation learning for individualized treatment rules using observational data.
Journal ofMachine Learning Research , 20:1–23, 2019b.Yingqi Zhao, Donglin Zeng, A. John Rush, and Michael R. Kosorok. Estimating individ-ualized treatment rules using outcome weighted learning.
Journal of the American Sta-tistical Association , 107(499):1106–1118, 2012. doi: 10.1080/01621459.2012.695674. URL https://doi.org/10.1080/01621459.2012.695674https://doi.org/10.1080/01621459.2012.695674