[PDF] Selecting and ranking individualized treatment rules with unmeasured confounding

Abstract

It is common to compare individualized treatment rules based on the value function, which is the expected potential outcome under the treatment rule. Although the value function is not point-identified when there is unmeasured confounding, it still defines a partial order among the treatment rules under Rosenbaum's sensitivity analysis model. We first consider how to compare two treatment rules with unmeasured confounding in the single-decision setting and then use this pairwise test to rank multiple treatment rules. We consider how to, among many treatment rules, select the best rules and select the rules that are better than a control rule. The proposed methods are illustrated using two real examples, one about the benefit of malaria prevention programs to different age groups and another about the effect of late retirement on senior health in different gender and occupation groups.

Full PDF

SSelecting and Ranking Individualized Treatment Rules withUnmeasured Confounding

Bo Zhang, Jordan Weiss, Dylan S. Small, Qingyuan Zhao*University of Pennsylvania and University of Cambridge

Abstract : It is common to compare individualized treatment rules based on the value function, whichis the expected potential outcome under the treatment rule. Although the value function is not point-identiﬁed when there is unmeasured confounding, it still deﬁnes a partial order among the treatmentrules under Rosenbaum’s sensitivity analysis model. We ﬁrst consider how to compare two treatmentrules with unmeasured confounding in the single-decision setting and then use this pairwise test to rankmultiple treatment rules. We consider how to, among many treatment rules, select the best rules andselect the rules that are better than a control rule. The proposed methods are illustrated using two realexamples, one about the beneﬁt of malaria prevention programs to diﬀerent age groups and anotherabout the eﬀect of late retirement on senior health in diﬀerent gender and occupation groups.

Keywords : Multiple testing; Observational studies; Partial order; Policy discovery; Sensitivity analysis.*Correspondence to: Qingyuan Zhao, Statistical Laboratory, University of Cambridge, Wilberforce Road,Cambridge, CB3 0WB, UK. Email: [email protected]. a r X i v : . [ s t a t . M E ] F e b Introduction

A central statistical problem in precision medicine and health policy is to learn treatment rules thatare tailored to the patient’s characteristics. There is now an exploding literature on individualizedpolicy discovery; see Kosorok and Laber [2019] for an up-to-date review. Although randomizedexperiments remain the gold standard for causal inference, there has been a growing interest inusing observational data to draw causal conclusions and discover individualized treatment rulesdue to the increasing availability of electronic health records and other observational data sources[Moodie et al., 2012, Athey and Wager, 2017, Kallus, 2017, Wu et al., 2019, Zhao et al., 2019b].A common way to formulate the problem of individualized policy discovery is via the valuefunction , which is the expected potential outcome under a treatment rule or regime. The optimaltreatment rule is usually deﬁned as the one that maximizes the value function. In the single-decision setting, the value function can be easily identiﬁed when the data come from a randomizedexperiment (as long as the probability of receiving treatment is never 0 or 1). When the datacome from an observational study, the value function can still be identiﬁed under the assumptionthat all confounders are measured. This assumption can be further extended to the multiple-decision setting [Murphy, 2003, Robins, 2004]. In this paper we will focus our discussion on thesingle-decision setting but consider the possibility of unmeasured confounding.With few exceptions, the vast majority of existing methods for treatment rule discovery fromobservational data are based on the no unmeasured confounding assumption. Typically, thesemethods ﬁrst estimate the value function assuming no unmeasured confounding and then selectthe treatment rule that maximizes the estimated value function. However, it is common that asubstantial fraction of the population appear to behave similarly under treatment or control. Froma statistical perspective and if there is truly no unmeasured confounder, we should still attempt toestimate the treatment eﬀect for individuals in this subpopulation and optimize the treatment ruleaccordingly. However, the optimal treatment decisions for these individuals are, intuitively, alsothe most sensitive to unmeasured confounding. It may only take a small amount of unmeasuredconfounding to change the sign of the estimated treatment eﬀects for these individuals. From apolicy perspective (especially when there is a cost constraint), learning the “optimal” treatmentdecision for these individuals from observational data seems likely to be error-prone.2 .1 Sensitivity analysis for individualized treatment rules

There is a long literature on studying the sensitivity of observational studies to unmeasured con-founding, dating from Cornﬁeld et al. [1959]. In short, such sensitivity analysis asks how muchunmeasured confounding is needed to alter the causal conclusion of an observational study qualita-tively. In this paper, we will study the sensitivity of individualized treatment rules to unmeasuredconfounding using a prominent model proposed by Rosenbaum [1987], where the odds ratio ofreceiving the treatment for any two individuals with the same observed covariates is bounded be-tween 1 / Γ and Γ (Γ ≥

1; Γ = 1 corresponds to no unmeasured confounding). More speciﬁcally,we will consider selecting and ranking individualized treatment rules under Rosenbaum’s model forunmeasured confounding.Our investigation is motivated by the impact of eﬀect modiﬁcation on the power of Rosenbaum’ssensitivity analysis that is studied by Hsu et al. [2013]. A phenomenon found by Hsu et al. [2013]is that subgroups with larger treatment eﬀect may have larger design sensitivity. For example,suppose a subgroup A has larger treatment eﬀect than a subgroup B based on observational data.Then, there may exist a Γ > partial order on the set of individualized treatmentrules, so two rules with diﬀerent value function assuming no unmeasured confounding may becomeindistinguishable under the Γ-sensitivity model when Γ >

1. Fundamentally, the reason is that thevalue function is only partially identiﬁed in Rosenbaum’s Γ-sensitivity model.As an example, let’s use r (cid:31) Γ r (abbreviated as r (cid:31) r if the value of Γ is clear fromthe context) to denote that the value of rule r is always greater than the value of r when theunmeasured confounding satisﬁes the Γ-sensitivity model. Then, it is possible that3 Under Γ = 1, r (cid:31) r (cid:31) r (so r (cid:31) r ); • Under some Γ > r (cid:31) r but r (cid:54)(cid:31) r .This phenomenon occurs frequently in real data examples, see Figure 1 in Section 3.2. Note that therelation (cid:31) is deﬁned using the value function computed using the population instead of a particularsample.Because the value function only deﬁnes a partial order on the treatment rules, it is no longer well-deﬁned to estimate the optimal treatment rule when there is unmeasured confounding. Instead, weaim to recover the partial ordering of a set of treatment rules or select a subset of rules that satisfycertain statistical properties. This problem is related to the problem of selecting and rankingsubpopulations (as a post hoc analysis for randomized experiments) which has been extensivelystudied in statistics [Gupta and Panchapakesan, 1979, Gibbons et al., 1999]. Unfortunately, inproblems considered by the existing literature, the subpopulations always have a total order . Forexample, a prototypical problem in that literature is to select a subset that contains the largest µ i based on independent observations Y i ∼ N( µ i , Existing methods for individualized policy discovery from observational data often take an ex-ploratory stance. They often aim to select the individualized treatment rule, often within aninﬁnite-dimension function class, that maximized the estimated value function using outcomeregression-based [Robins, 2004, Qian and Murphy, 2011], inverse-probability weighting [Zhao et al.,2012, Kallus, 2017], or doubly-robust estimation[Dud´ık et al., 2014, Athey and Wager, 2017]. Inorder to estimate the value function, some parametric or semiparametric models are speciﬁed tomodel the outcome and/or the treatment selection process. To identify the value function, thevast majority of these approaches make the no unmeasured confounding assumption which may beunrealistic in many applications. The only exception to our knowledge is Kallus and Zhou [2018],in which the authors propose to maximize the minimum value of an individualized treatment rulewhen the unmeasured confounding satisﬁes a marginal sensitivity model [Tan, 2006, Zhao et al.,2019a]. This is further extended to the estimation of conditional average treatment eﬀect with4nmeasured confounding in Kallus et al. [2018]. Another related work is Yadlowsky et al. [2018]who consider semiparametric inference for the average treatment eﬀect in Rosenbaum’s sensitivitymodel.In this paper we take a diﬀerent perspective. Our approach is based on a statistical test tocompare the value of two individualized treatment rules when there is limited unmeasured con-founding. Brieﬂy speaking, we ﬁrst match the treated and control observations by the observedcovariates and then propose to use Rosenbaum’s sensitivity model to quantify the magnitude ofunmeasured confounding after matching (the deviation of the matched observational study from apairwise randomized experiment). At the core of our proposal is a randomization test introducedby Fogarty [2016] to compare the value of two individualized treatment rules in Rosenbaum’ sen-sitivity model. Based on this test, we introduce a framework to rank and select treatment ruleswithin a given ﬁnite collection and show that diﬀerent statistical errors can be controlled with theappropriate multiple hypothesis testing methods.In principle, our framework can be used with an arbitrary (ﬁnite) number of pre-speciﬁed treat-ment rules. In practice, it is more suitable for small-scale policy discovery with relatively fewdecision variables, where it is not needed to use machine learning methods to discover complexpatterns or such methods have already been employed in a preliminary study to suggest a few can-didat rules. The design-based nature of our approach makes it particularly useful for conﬁrmatory analyses, the importance of which is widely acknowledged in the policy discovery literature [e.g.Kallus, 2017, Zhang et al., 2018, Kosorok and Laber, 2019]. Methods proposed in this paper thuscomplement the existing literature on individualized treatment rules by providing a way to con-ﬁrm the eﬀectiveness of a treatment rule learned from observational data and assess its robustnessto unmeasured confounding. When there are several competing treatment rules, our frameworkfurther facilitates the decision maker to select or rank the treatment rules using ﬂexible criteria.The rest of the paper is organized as follows. In Section 2 we introduce a real data example thatwill be used to illustrate the proposed methods. We then introduce some notations and discuss howto compare two treatment rules when there is unmeasured confounding. In Section 3 we considerthree questions about ranking and selecting among multiple treatment rules. We compare ourproposal with some baseline procedures in a simulation study Section 4 and apply our method toanother application using data from the Health and Retirement Study. Finally, we conclude ourpaper with some brief discussion in Section 6. 5

Comparing treatment rules with unmeasured confounding

The Garki Project, conducted by the World Health Organization and the Government of Nigeriafrom 1969-1976, was an observational study that compared several strategies to control malaria. Hsuet al. [2013] studied the eﬀect modiﬁcation for one of the malaria control strategies, namely sprayingwith an insecticide, propoxur, together with mass administration of a drug sulfalene-pyrimethamineat high frequency. The outcome is the diﬀerence between the frequency of Plasmodium falciparumin blood samples, that is, the frequency of a protozoan parasite that causes malaria, measuredbefore and after the treatment. Using 1560 pairs of treated and control individuals matched bytheir age and gender, Hsu et al. [2013] found that the treatment was much more beneﬁcial for youngchildren than for other individuals, if there is not unmeasured confounding.More interestingly, they found that, despite the reduced sample size, the 447 pairs of youngchildren exhibit a treatment eﬀect that is far less sensitive to unmeasured confounding bias thanthe full sample of 1560 pairs. So from a policy perspective, it may be preferable to implementthe treatment only for young children rather than the whole population. In the rest of this paperwe will generalize this idea to selecting and ranking treatment rules. We will use the matcheddataset in Hsu et al. [2013] to illustrate the deﬁnitions and methodologies in the paper; see theoriginal article for more information about the Garki Project dataset and the matched design. Adiﬀerent application concerning the eﬀect of late retirement on health outcomes will be presentedin Section 5 near the end of this article.

We ﬁrst introduce some notations in order to compare treatment rules when there is unmeasuredconfounding. Let X denote all the pre-treatment covariates measured by the investigator. Inthe single-decision setting considered in this paper, an individualized treatment rule (or treatmentregime) d maps a vector of pre-treatment covariates X to the binary treatment decisions, { , } (0 indicates control and 1 indicates treatment). In our running example, we shall consider sixtreatment rules, r , r , · · · , r , where r i assigns treatment to the youngest i ×

20% of the individuals.Speciﬁcally, the minimum, 20% , , ,

80% quantiles, and maximum of age are 0 (newborn), 7,20, 31, 41, and 73 years old.Let Y be the outcome and Y (0) , Y (1) be the potential outcomes under control and treatment.6he potential outcome under a treatment rule d is deﬁned, naturally, as Y ( d ) = Y (0)1 { d ( X )=0 } + Y (1)1 { d ( X )=1 } . A common way to compare treatment rule is to use its value function, deﬁnedas the expected potential outcome under that rule, V ( d ) = E [ Y ( d )]. The value diﬀerence of twotreatment rules, r and r , is thus V ( r ) − V ( r ) = E [ Y ( r ) − Y ( r ) | r (cid:54) = r ] · P ( r (cid:54) = r )= E [ Y (1) − Y (0) | r > r ] · P ( r > r ) − E [ Y (1) − Y (0) | r < r ] · P ( r < r ) , (1)where for simplicity the event r ( X ) (cid:54) = r ( X ) is abbreviated as r (cid:54) = r (similarly for r < r and r > r ). Note that the event r > r is the same as r = 1 , r = 0 because the treatment decisionis binary. One of the terms on the right hand side of (1) will become 0 if the treatment rules arenested. In the malaria example, r ≤ r ≤ · · · ≤ r , so the value diﬀerence of the rules r and r can be written as V ( r ) − V ( r ) = E [ Y (1) − Y (0) | Age ∈ [7 , · P (Age ∈ [7 , . In this case, testing the sign of V ( r ) − V ( r ) is equivalent to testing the sign of the conditionalaverage treatment eﬀect, E [ Y (1) − Y (0) | r > r ].The deﬁnition of the value function depends on the potential outcomes. To identify the valuefunction using observational data, it is standard to make the following assumptions [Kosorok andLaber, 2019]:1. Positivity: P ( A = a | X = x ) > a and x ;2. Consistency (SUTVA): Y = Y ( A );3. Ignorability (no unmeasured confounding): Y ( a ) | = A | X for all a .Under these conditions, it is straightforward to show that the value function is identiﬁed by [Qianand Murphy, 2011] V ( d ) = E (cid:20) Y I ( A = d ( X )) π ( A, X ) (cid:21) , where I is the indicator function of an event and π ( a, x ) = P ( A = a | X = x ) is the propensity score.The value function gives a natural and total order to the treatment rules. If the above identiﬁca-tion assumptions hold, the value functions can be identiﬁed and thus this order can be consistentlyestimated as the sample size increases to inﬁnity. In general, it is impossible to recover this order7hen there is unmeasured confounding. However, if the magnitude of unmeasured confounding isbounded according to a sensitivity model (a collection of distributions of the observed variablesand unobserved potential outcomes), it is possible to partially identify diﬀerence between the valueof two treatment rules and thus obtain a partial order. Deﬁnition 1.

Let r and r be two treatment rules that map a vector of pre-treatment covariates X to a binary treatment decision { , } , and V ( r ), V ( r ) their corresponding value functions.Given a sensitivity analysis model indexed by Γ, we say that the rule r is dominated by r witha margin δ if V ( r ) − V ( r ) > δ for all distributions in the sensitivity analysis model. We denotethis relation as r ≺ Γ ,δ r and furthere abbreviate it as r ≺ Γ r if δ = 0. We denote r (cid:54)≺ Γ r if r is not dominated by r with margin δ = 0.Notice that the partial order should be deﬁned in terms of the partially identiﬁed interval for V ( r ) − V ( r ) instead of the partially identiﬁed intervals for V ( r ) and V ( r ). This is becausethe same distribution of the unobserved potential outcomes needs to be used when computing thepartially identiﬁed interval for V ( r ) − V ( r ), so it is not simply the diﬀerence between the partiallyidentiﬁed intervals for the individual values (the easiest way to see this is to take r = r ). Wethank an anonymous reviewer for pointing this out.It is easy to see that ≺ Γ is a strict partial order on the set of treatment rules because it satisﬁesirreﬂexivity (not r ≺ Γ r ), transitivity ( r ≺ Γ r and r ≺ Γ r imply r ≺ Γ r ), and asymmetry( r ≺ Γ r implies not r ≺ Γ r ). In Rosenbaum’s sensitivity model be introduced in the sectionbelow, Γ = 1 corresponds to no unmeasured confounding and thus the relationship ≺ Γ=1 is a totalorder. r (cid:54)≺ Γ r using matched observational studies With the goal of selecting and ranking treatment rules with unmeasured confounding in mind, inthis section we consider the easier but essential task of comparing the value of two treatment rules, r and r , under Rosenbaum’s sensitivity model. This test will then serve as the basic elementof our procedures of selecting and ranking among multiple treatment rules below. We will ﬁrstintroduce the pair-matched design of an observational study and Rosenbaum’s sensitivity model,and then describe a studentized sensitivity analysis proposed by Fogarty [2016] that tests Neyman’snull hypothesis of average treatment eﬀect being zero under Rosenbaum’s sensitivity model. Thistest can be immediately extended to compare the value of treatment rules.8uppose the observed data are n pairs, i = 1 , , ..., n , of two subjects j = 1 ,

2. These n pairs arematched for observed covariates X and within each pair, one subject is treated, denoted A ij = 1,and the other control, denoted A ij = 0, so that we have X i = X i and A i + A i = 1 for all i .In a sensitivity analysis, we may fail to match on an unobserved confounder U ij and thus incurunmeasured confounding bias.Rosenbaum [1987, 2002] proposed a one-parameter sensitivity model. Let F = { ( Y ij (0) , Y ij (1) , X ij , U ij ) , i = 1 , . . . , n, j = 1 , } be the collection of all measured or unmeasured variables otherthan the treatment assignment. Rosenbaum’s sensitivity model assumes that π i = P ( A i = 1 |F )satisﬁes 11 + Γ ≤ π i ≤ Γ1 + Γ , i = 1 , , ..., n. (2)When Γ = 1, this model asserts that π i = 1 / i and thus every subject has equal probability tobe assigned to treatment or control (i.e. no unmeasured confounding). In general, Γ > Y ij (0) = Y ij (1) for all i, j . Theasymptotic properties of these randomization tests are studied in Rosenbaum [2004, 2015] and Zhao[2018].In the context of comparing individualized treatment rules, Fisher’s sharp null hypothesis is nolonger suitable because we expect to have (and indeed are tasked to ﬁnd) heterogeneous treatmenteﬀect. Recently, Fogarty [2016] developed a valid studentized test for Neyman’s null hypothesis thatthe average treatment eﬀect is equal to zero, (2 n ) − (cid:80) ij Y ij (1) − Y ij (0) = 0, under Rosenbaum’ssensitivity model. We brieﬂy describe Fogarty’s test. Let D i denote the treated-minus-controldiﬀerence in the i th matched pair, D i = ( A i − A i )( Y i − Y i ). Fix the sensitivity parameter Γ anddeﬁne D i, Γ = D i − (cid:18) Γ −

1Γ + 1 (cid:19) | D i | , D Γ = 1 n n (cid:88) i =1 D i, Γ , and se( D Γ ) = 1 n ( n − n (cid:88) i =1 ( D i, Γ − D Γ ) . Fogarty [2016] showed that the one-sided student- t test that rejects Neyman’s hypothesis when D Γ se( D Γ ) > Φ − (1 − α )is asymptotically valid with level α under Rosenbaum’s sensitivity model (2) and mild regularityconditions. This test can be easily extended to test the null that the average treatment eﬀect is no9reater than δ by replacing D i with D i − δ . Fogarty [2016] also provided a randomization-basedreference distribution in addition to the large-sample normal approximation.The above test for the average treatment eﬀect can be readily extended to comparing treatmentrules. Recall that equation (1) implies the value diﬀerence of two rules r and r is a weighteddiﬀerence of two conditional average treatment eﬀects on the set r > r and r > r . When the tworules are nested (without loss of generality assume r ≥ r ), testing the null hypothesis that r (cid:54)≺ Γ r is equivalent to testing a Neyman-type hypothesis E [ Y (1) − Y (0) | r > r ] ≤ i ) that satisfy r ( X i ) > r ( X i ). When the two rules are not nested, we can ﬂip the sign of D i for those i suchthat r ( X i ) < r ( X i ) and then apply Fogarty’s test. In summary, to test the null hypothesis that r (cid:54)≺ Γ r , we can simply apply Fogarty’s test to { D i · [ r ( X i ) − r ( X i )] , for i such that r ( X i ) (cid:54) = r ( X i ) } . To test the hypothesis r (cid:54)≺ Γ ,δ r , we can use Fogarty’s test for the average treatmenteﬀect no greater than δ · ( n/m ) where m = (cid:12)(cid:12) { i : r ( X i ) (cid:54) = r ( X i ) } (cid:12)(cid:12) . A hallmark of Rosenbaum’s sensitivity analysis framework is its tipping-point analysis, and thatextends to the comparison of treatment rules. When testing r (cid:54)≺ Γ r with a series of Γ, there existsa smallest Γ such that the null hypothesis cannot be rejected, that is, we are no longer conﬁdencethat r is dominated by r in that Γ-sensitivity model. This tipping point is commonly referred toas the sensitivity value [Zhao, 2018]. Formally, we deﬁne the sensitivity value for r ≺ r asΓ ∗ α ( r ≺ r ) = inf { Γ ≥ V ( r ) ≥ V ( r ) cannot be rejectedat level α under the Γ-sensitivity model } . Let r be the null treatment rule (for example, assigning control to the entire population). Thesensitivity Γ ∗ α ( r ≺ r ) is further abbreviated as Γ ∗ α ( r ).Zhao [2018] studied the asymptotic properties of the sensitivity value when testing Fisher’ssharp null hypothesis using a class of signed core statistics. Below, we will give the asymptoticdistribution of Γ ∗ α ( r ≺ r ) using Fogarty’s test as described in the last section. The result will bestated in terms of a transformation of the sensitivity value, κ ∗ α ( r ≺ r ) = Γ ∗ α ( r ≺ r ) − ∗ α ( r ≺ r ) + 1 . ∗ = 1 is transformed to κ ∗ = 0 and 0 ≤ κ ∗ < Proposition 1.

Assume the treatment rules are nested, r ( x ) ≤ r ( x ), and let I be the set ofindices i where r ( X i ) < r ( X i ). Assuming the moments of | D i | exist and E [ D i | r < r ] > (cid:112) |I| (cid:18) κ ∗ α ( r ≺ r ) − E [ D i | r < r ] E [ | D i | | r < r ] (cid:19) d → N (cid:16) z α µ, σ (cid:17) , as |I| → ∞ , (3)where z α is the upper- α quantile of the standard normal distribution and the parameters µ and σ depend on the distribution of D i (the expressions can be found in the Appendix).The proof of this proposition can be found in the Appendix. When the treatment rules arenot nested, one can simply replace D i with D i [ r ( X i ) − r ( X i )] and the condition r < r with r (cid:54) = r in the proposition statement. The asymptotic distribution of Γ ∗ α ( r (cid:31) r ) can be found bythe delta method and we omit further detail.The asymptotic distribution in (3) is similar to the one obtained in Zhao [2018, Thm. 1]. Whenthe treatment rules are nested r ≤ r and |I| → ∞ , the sensitivity value converges to a numberthat depends on the distribution of D i ,Γ ∗ α ( r ≺ r ) p → E [ | D i | | r < r ] + E [ D i | r < r ] E [ | D i | | r < r ] − E [ D i | r < r ] . The limit on the right hand side is the design sensitivity [Rosenbaum, 2004] of Fogarty’s test forcomparing the treatment rules. As the sample size converge to inﬁnity, the power of Fogarty’stest converges to 1 at Γ smaller than the design sensitivity and to 0 at Γ larger than the designsensitivity. The normal distribution in (3) further approximates of the ﬁnite-sample behavior ofthe sensitivity value and can be used to compute the power of a sensitivity analysis by the factthat rejecting r (cid:54)≺ Γ r at level α is equivalent to Γ ∗ α ( r ≺ Γ r ) ≥ Γ. Next we consider the problem of comparing multiple treatment rules with unmeasured confounding.To this end, we need to deﬁne the goal and the statistical error we would like to control. A problemrelated to this is the selecting and ordering of multiple subpopulations [Gupta and Panchapakesan,1979, Gibbons et al., 1999], for example, given K independent measurements Y i ∼ N( µ i ,

1) where µ i is some characteristic of the i -th subpopulation. When comparing µ i , there are many goals11e can deﬁne. In fact, Gibbons et al. [1999, p. 4] gave a list of 7 possible goals for ranking andselection of subpopulations and considered them in the rest of their book. We believe at least 3out of their 7 goals have practically meaningful counterparts in comparing treatment rules. Given K + 1 treatment rules, R = { r , r , . . . , r K } , we may ask, in terms of their values,1. What is the ordering of all the treatment rules?2. Which treatment rule is the best?3. Which treatment rule(s) are better than the null/control r ?In a randomized experiment or an observational study with no unmeasured confounding, it may bepossible to obtain estimates of the value that are jointly asymptotically normal and then directlyuse the methods in Gibbons et al. [1999]. However, as discussed in Section 2, this no longer applieswhen there is unmeasured confounding because the value function may only be partially identiﬁed. When there is unmeasured confounding, the three goals above need to be modiﬁed because thevalue function only deﬁnes a partial order among the treatment rules (Deﬁnition 1). We make thefollowing deﬁnitions

Deﬁnition 2.

In the Γ-sensitivity model, the maximal rules in R are the ones not dominated byany other rule, R max , Γ = { r i : r i (cid:54)≺ Γ r j , ∀ j } . The positive rules are the ones which dominate the control and the null rules are the ones whichdon’t dominate the control, R pos , Γ = { r i : r ≺ Γ r i } , R nul , Γ = R \ R pos , Γ . The maximal set R max , Γ and the null set R nul , Γ are always non-empty (the latter is because r ∈ R nul , Γ ), become larger as Γ increases, and in general become the full set R as Γ → ∞ .In the rest of this section, we will consider the following three statistical problems: for somepre-speciﬁed signiﬁcance level α > O Γ ⊂ { ( r i , r j ) , i, j = 0 , . . . , K, i (cid:54) = j } ,12uch that the probability that all the orderings are correct is at least 1 − α , that is, P ( r i ≺ Γ r j , ∀ ( r i , r j ) ∈ ˆ O Γ ) ≥ − α ?2. Can we construct a subset of treatment rules, ˆ R max , Γ , such that the probability that itcontains all maximal rules is at least 1 − α , that is, P ( R max , Γ ⊆ ˆ R max , Γ ) ≥ − α ?3. Can we construct a subset of treatment rules, ˆ R pos , Γ , such that the probability that it doesnot cover any null rule is at least 1 − α , that is, P ( ˆ R pos , Γ ∩ R null , Γ = (cid:54) ∅ ) ≥ − α ?Next, we will propose strategies to achieve the above statistical goals based on the test of twotreatment rules with unmeasured confounding described in Section 2.3. To start with, let’s consider the ﬁrst goal—ordering the treatment rules, as the statistical inferenceis more straightforward. It is the same as the multiple testing problem where we would like tocontrol the family-wise error rate (FWER) for the collection of hypotheses, { H ij : r i (cid:54)≺ Γ r j , i, j =0 , . . . , K, i (cid:54) = j } . In principle, we can apply any multiple testing procedure that controls the FWER.A simple example is Bonferroni’s correction for all the K ( K −

1) tests.In sensitivity analysis problems, we can often greatly improve the statistical power by reducingthe number of tests using a planning sample [Heller et al., 2009, Zhao et al., 2018]. This is becauseRosenbaum’s sensitivity analysis considers the worst case scenario and is generally conservativewhen Γ >

1. The planning sample can be further used to order the hypotheses so we can sequentiallytest them, for example, using a ﬁxed sequence testing procedure [Koch and Gansky, 1996, Westfalland Krishen, 2001].There are many possible ways to screen out, order, and then test the hypotheses. Here wedemonstrate one possibility:Step 1: Split the data into two parts. The ﬁrst part is used for planning and the second part fortesting.Step 2: For every pair of treatment rules ( r i , r j ), use the planning sample to estimate populationparameters in the asymptotic distribution of the sensitivity value (3).Step 3: Compute the approximate power of testing H ij : r i (cid:54)≺ Γ r j in the testing sample using (3).Order the hypotheses by the estimated power, from highest to lowest.13tep 4: Sequentially test the ordered hypotheses using the testing sample at level α , until one hy-pothesis is rejected.Step 5: Output a Hasse diagram of the treatment rules by using all the rejected hypotheses.A Hasse diagram is an informative graph to represent a partial order (in our case, ≺ Γ ). In thisdiagram, each treatment rule is represented by a vertex and an edge goes upward from rule r i torule r j if r i ≺ Γ r j and there exists no r k such that r i ≺ Γ r k and r k ≺ Γ r j .Due to transitivity of a partial order, an upward path from r i to r j in the Hasse diagram (forexample r to r in Figure 1, Γ = 1 .

3) indicates that r i ≺ Γ r j , even if we could not directly reject r i (cid:54)≺ Γ r j in Step 4. The next proposition shows that the above multiple testing procedure alsocontrols the FWER for all the apparent and implied orders represented by the Hasse diagram. Proposition 2.

Let ˆ O Γ ⊂ { ( r i , r j ) , i (cid:54) = j } be a random set of ordered treatment rules obtainedusing the procedure above or any other multiple testing procedure. Letˆ O Γ , ext = ˆ O Γ (cid:91) { ( r i , r j ) : ∃ k , . . . , k m such that ( r i , r k ) , ( r k , r k ) , . . . , ( r k m , r j ) ∈ ˆ O Γ } be the extended set implied from the Hasse diagram. Then FWER with respect to ˆ O Γ , ext is thesame as FWER with respect to ˆ O Γ : P ( r i ≺ Γ r j , ∀ ( r i , r j ) ∈ ˆ O Γ ) = P ( r i ≺ Γ r j , ∀ ( r i , r j ) ∈ ˆ O Γ , ext ) . Proof.

We show the two events are equivalent. The ⊆ direction is trivial. For ⊇ , notice that anyfalse positive in ˆ O Γ , ext , say r i ≺ Γ r j implies that there is at least one false positive along the pathfrom r i to r j , that is, there is at least one false positive among r i ≺ Γ r k , r k ≺ Γ r k , . . . , r k m ≺ Γ r j ,which are all in ˆ O Γ . Thus, any false positive in ˆ O Γ , ext implies that there is also at least one falsepositive in ˆ O Γ .We illustrate the proposed method using the malaria dataset. We ﬁrst use half of the data toestimate the population parameters in (3) for each pair of treatment rules ( r i , r j ). For every valueof Γ, we use (3) to compute the asymptotic power for the test of H ij : r i (cid:54)≺ Γ r j using the other halfof the data. We then order the hypotheses by the estimated power, from the highest to the lowest.In the malaria example, when Γ = 1, the order is H , H , H , H , H , H , H , H , H , H , . . . . H , H , H , H , H , H , H , H , H , H , . . . . Finally we follow Steps 4 and 5 above. We obtained Hasse diagrams for a variety of Γ, which areshown in Figure 1. As a baseline for comparison, Figure 2 shows the Hasse diagrams obtained by asimple Bonferroni adjustment for all K ( K −

1) = 30 hypotheses using all the data. Although onlyhalf of the data is used to test, ordering the hypotheses not only identiﬁed all the discoveries thatthe Bonferroni procedure identiﬁed, but also made one extra discovery when Γ = 1 .

3, 1 .

5, 2 .

5, 3 . .

8, 2, and 3. r r r r r r Γ = 1 | ˆ O| = 12 r r r r r r Γ = 1 . | ˆ O| = 10 r r r r r r Γ = 1 . | ˆ O| = 9 r r r r r r Γ = 1 . | ˆ O| = 8 r r r r r r Γ = 2 | ˆ O| = 7 r r r r r r Γ = 2 . | ˆ O| = 6 r r r r r r Γ = 3 . | ˆ O| = 5 r r r r r r Γ = 3 . | ˆ O| = 3 r r r r r r Γ = 4 | ˆ O| = 2 r r r r r r Γ = 6 | ˆ O| = 0Figure 1: Malaria example: Hasse diagrams obtained using sample-splitting and ﬁxed sequencetesting; α = 0 .

1. 15 r r r r r Γ = 1 | ˆ O| = 10 r r r r r r Γ = 1 . | ˆ O| = 9 r r r r r r Γ = 1 . | ˆ O| = 8 r r r r r r Γ = 1 . | ˆ O| = 6 r r r r r r Γ = 2 | ˆ O| = 5 r r r r r r Γ = 2 . | ˆ O| = 5 r r r r r r Γ = 3 | ˆ O| = 3 r r r r r r Γ = 3 . | ˆ O| = 2 r r r r r r Γ = 4 | ˆ O| = 1 r r r r r r Γ = 6 | ˆ O| = 0Figure 2: Malaria example: Hasse diagrams obtained using Bonferroni’s adjustment; α = 0 . Next we consider constructing a set that covers all the maximal rules. Our proposal is based onthe following observation: if the hypothesis r i (cid:54)≺ Γ r j can be rejected, then r i is unlikely a maximalrule. More precisely, because r i ∈ R max , Γ implies that r i (cid:54)≺ Γ r j must be true, by the deﬁnition ofthe type I error of a hypothesis test, P ( r i (cid:54)≺ Γ r j is rejected | r i ∈ R max , Γ ) ≤ α. This suggests that we can derive a set of maximal rules from an estimated set of partial orders:ˆ R max , Γ = { r i : ( r i , r j ) (cid:54)∈ ˆ O Γ , ∀ j } . (4)In other words, ˆ R max , Γ contains all the “leaves” in the Hasse diagram of ˆ O Γ (a leaf in the Hassediagram is a vertex who has no edge going upward). For example, in Figure 1, the leaves are { r , r , r } when Γ = 1 . { r , r , r , r } when Γ = 1 .

5. Because (cid:8) R max , Γ (cid:54)⊆ ˆ R max , Γ (cid:9) = (cid:8) ∃ i ∈R max , Γ such that ( r i , r j ) ∈ ˆ O Γ for some j (cid:9) , the estimated set of maximal rules satisﬁes P ( R max , Γ (cid:54)⊆ R max , Γ ) ≤ α as desired whenever ˆ O Γ strongly controls the FWER at level α .Equation (4) suggests that only one hypothesis r i (cid:54)≺ Γ r j needs to be rejected in order to exclude r i from ˆ R max , Γ . This means that, when the purpose is to select the maximal rules, we do not needto test r i (cid:54)≺ Γ r j if another hypothesis r i (cid:54)≺ Γ r k for some k (cid:54) = j is already rejected. Therefore, wecan modify the procedure of ﬁnding ˆΩ Γ to further decrease the size of ˆ R max , Γ obtained from (4).For example, in the ﬁve-step procedure demonstrated in Section 3.2, we can further replace Step 3by:Step 3’: After ordering the hypotheses in Step 3, remove any hypothesis H ij : r i ≺ Γ r j if there isalready a hypothesis H ik appearing before H ij for some k (cid:54) = j .Again we use the malaria example to illustrate the selection of best treatment rules. As anexample, when Γ = 2 .

0, Step 3’ reduced the original sequence of hypotheses to the following: H , H , H , H , H , H . We used the hold-out samples to test the hypotheses sequentially at level α = 0 . H . Therefore, a level α = 0 . { r , r , r , r } whenΓ = 2. Table 1 lists the estimated maximal set ˆ R max , Γ for Γ = 1 , . , . , . , , and 2 . R max , Γ for diﬀerent choices of Γ.Γ ˆ R max , Γ Γ ˆ R max , Γ { r , r , r } { r , r , r , r } { r , r , r } { r , r , r , r , r } { r , r , r , r } { r , r , r , r , r } { r , r , r , r } { r , r , r , r , r } { r , r , r , r } { r , r , r , r , r , r } Finally we consider how to select treatment rules that are better than a control rule. This canalso be transformed to a multiple testing problem for the hypotheses H i : r (cid:54)≺ Γ r i , i = 1 , . . . , K .Let ˆ R pos , Γ be the collection of rejected hypotheses following some multiple testing procedure. Bydeﬁnition of FWER, P ( ˆ R pos , Γ ∩ R nul , Γ ) ≤ α if the multiple testing procedure strongly controlsFWER at level α . As an example, one can modify the procedure in (3.2) to select the positive rulesby only considering H i , i = 1 , . . . , K in Step 3.17n practice, a small increase of the value function, though statistically signiﬁcant, may not justifya policy change. In this case, it may be desirable to estimate the positive rules that dominate thecontrol rule by margin δ , R pos , Γ ,δ = { r i : r ≺ Γ ,δ r i } . To obtain an estimate of R pos , Γ ,δ , one canfurther modify the procedure in (3.2) by replacing the hypothesis H i : r (cid:54)≺ Γ r i with the stronger r (cid:54)≺ Γ ,δ r i . Table 2: Estimated positive rules ˆ R pos , Γ ,δ for diﬀerent choices of Γ and δ .Γ = 1 Γ = 1 . . . δ = 0 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 1 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 2 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 4 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 6 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r } { r } Γ = 2 . . . δ = 0 { r , r , r , r , r } { r , r , r , r , r } { r , r , r , r , r } δ = 1 { r , r , r , r , r } { r , r , r , r , r } { r , r , r } δ = 2 { r , r , r , r , r } { r , r , r } { r , r } δ = 4 { r , r , r } ∅ ∅ δ = 6 ∅ ∅ ∅ Γ = 3 . . . δ = 0 { r , r , r } { r , r } ∅ δ = 1 { r , r } { r } ∅ δ = 2 ∅ ∅ ∅ δ = 4 ∅ ∅ ∅ δ = 6 ∅ ∅ ∅ We construct ˆ R pos , Γ ,δ with various choices of Γ and δ for the malaria example. In this case, δ measures the decrease in the number of Plasmodium falciparum parasites per milliliter of bloodsamples averaged over the entire study samples. Table 2 gives a summary of the results. Asexpected, the estimated set of positive rules becomes smaller as Γ or δ increases. We observe that,although r , r —assigning treatment to those under 7 and 20—are unlikely the optimal rules ifthere is no unmeasured confounding (Table 1), they are more robust to unmeasured confoundingthan the others, dominating the control rule up till Γ = 4 . We study and report the performance of three methods of selecting the positive rules R pos , Γ ,δ using numerical simulations in this section. Simulation results for selecting the maximal rules18re reported in the Supplementary Materials. We constructed 5 or 10 cohorts of data where thetreatment eﬀect is constant within each cohort but diﬀerent between the cohorts. After matching,the treated-minus-control diﬀerence in each cohort was normally distributed with mean1. µ = (0 . , . , . , . , . µ = (0 . , . , − . , . , . µ = (0 . , . , . , . , . , . , . , . , . , . µ = (0 . , . , . , . , − . , − . , . , . , . , . Bonferroni:

The full data is used to test the hypotheses H i : r (cid:54)≺ Γ r i and the Bonferronicorrection is used to adjust for multiple comparisons.2. Ordering by power:

This is the procedure described in Section 3.2 using sample splittingand ﬁxed sequence testing.3.

Ordering by value function:

This is the same as above except that the hypotheses areordered by their estimated value at Γ = 1.For the second and third methods, we used either a half or a quarter of the matched pairs (randomlychosen) to order the hypotheses. Extra simulation results using diﬀerent split proportions arereported in Supplementary Materials. This simulation was replicated 1000 times to report thepower and the error rate of the methods. The power is deﬁned as the average size of the estimatedset of positive rules ˆ R pos , Γ and the error rate is 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ) with nominal level 0 . .

5, but ordering the hypotheses by (the estimated)power, though losing some sample for testing, can be much more powerful at larger values of Γ.For instance, in Table 5 when Γ = 3 .

0, two power-based methods are more than twice as powerfulas the Bonferroni method. We observe that only using a small planning sample (25%) seems to19ork well in the simulations. This is not too surprising given our theoretical results. Proposition 1suggest that only the ﬁrst two moments of D and | D | are needed to estimate the sensitivity valueasymptotically.Table 3: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is deﬁnedas the size of ˆ R pos , Γ and error rate is deﬁned as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment eﬀect in the 5cohorts is given by µ = (0 . , . , . , . , . R pos , Γ is listed below each Γ value.Cohort size Method Γ = 1 . . . . . { r , . . . , r } { r , . . . , r } { r , r , r } { r , r } { r }

250 Bonferroni 5.00 / 0.00 2.54 / 0.01 1.60 / 0.03 0.72 / 0.02 0.11 / 0.00Value (50%) 5.00 / 0.00 0.51 / 0.07 0.08 / 0.04 0.00 / 0.00 0.00 / 0.00Power (50%) 5.00 / 0.00 2.30 / 0.07 1.46 / 0.07 0.74 / 0.04 0.20 / 0.00Value (25%) 5.00 / 0.00 0.73 / 0.07 0.18 / 0.05 0.03 / 0.01 0.00 / 0.00Power (25%) 5.00 / 0.00 2.64 / 0.07 1.69 / 0.06 0.85 / 0.05 0.21 / 0.00100 Bonferroni 4.99 / 0.00 1.39 / 0.02 0.75 / 0.02 0.37 / 0.02 0.08 / 0.00Value (50%) 4.80 / 0.00 0.49 / 0.07 0.16 / 0.04 0.04 / 0.02 0.00 / 0.00Power (50%) 4.77 / 0.00 1.15 / 0.05 0.75 / 0.05 0.38 / 0.03 0.15 / 0.02Value (25%) 4.99 / 0.00 0.61 / 0.06 0.24 / 0.03 0.12 / 0.03 0.01 / 0.00Power (25%) 4.99 / 0.00 1.33 / 0.05 0.80 / 0.05 0.52 / 0.05 0.12 / 0.00Table 4: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is deﬁnedas the size of ˆ R pos , Γ and error rate is deﬁned as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment eﬀect in the 5cohorts is given by µ = (0 . , . , − . , . , . R pos , Γ is listed below each Γ value.Cohort size Method Γ = 1 . . . . . { r , r , r , r } { r , r , r } { r , r } { r } ∅

250 Bonferroni 3.14 / 0.00 2.36 / 0.00 0.78 / 0.00 0.45 / 0.01 0.01 / 0.01Value (50%) 3.21 / 0.00 1.78 / 0.00 0.06 / 0.02 0.00 / 0.00 0.00 / 0.00Power (50%) 3.21 / 0.00 2.03 / 0.00 0.75 / 0.02 0.47 / 0.02 0.07 / 0.07Value (25%) 3.25 / 0.00 2.21 / 0.00 0.07 / 0.03 0.00 / 0.00 0.00 / 0.00Power (25%) 3.25 / 0.00 2.35 / 0.00 0.83 / 0.02 0.54 / 0.02 0.07 / 0.07100 Bonferroni 3.02 / 0.00 1.19 / 0.00 0.37 / 0.00 0.21 / 0.01 0.02 / 0.02Value (50%) 3.03 / 0.00 0.71 / 0.00 0.04 / 0.02 0.00 / 0.00 0.00 / 0.00Power (50%) 3.02 / 0.00 0.93 / 0.00 0.43 / 0.01 0.29 / 0.04 0.08 / 0.08Value (25%) 3.10 / 0.00 1.12 / 0.00 0.04 / 0.02 0.00 / 0.00 0.00 / 0.00Power (25%) 3.11 / 0.00 1.32 / 0.00 0.42 / 0.01 0.31 / 0.03 0.07 / 0.07

Finally we apply the proposed method to study the potentially heterogeneous eﬀect of retirementtiming on senior health. Many empirical studies have focused on the eﬀect of retirement timing on20able 5: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is deﬁnedas the size of ˆ R pos , Γ and error rate is deﬁned as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment eﬀect in the10 cohorts is given by µ = (0 . , . , . , . , . , . , . , . , . , . R pos , Γ islisted below each Γ value.Cohort size Method Γ = 1 . . . . . { r , . . . , r } { r , . . . , r } { r , . . . , r } { r , r } { r }

250 Bonferroni 10.00 / 0.00 6.80 / 0.01 2.41 / 0.00 0.20 / 0.00 0.02 / 0.01Value (50%) 10.00 / 0.00 0.88 / 0.06 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00Power (50%) 10.00 / 0.00 6.30 / 0.05 2.34 / 0.03 0.44 / 0.02 0.11 / 0.05Value (25%) 10.00 / 0.00 1.12 / 0.01 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00Power (25%) 10.00 / 0.00 7.06 / 0.06 2.73 / 0.02 0.42 / 0.02 0.10 / 0.05100 Bonferroni 9.99 / 0.00 3.97 / 0.01 1.14 / 0.00 0.12 / 0.00 0.03 / 0.02Value (50%) 9.95 / 0.00 0.76 / 0.05 0.03 / 0.01 0.00 / 0.00 0.00 / 0.00Power (50%) 9.91 / 0.00 3.18 / 0.04 1.17 / 0.02 0.28 / 0.03 0.10 / 0.05Value (25%) 9.95 / 0.00 1.06 / 0.04 0.06 / 0.01 0.00 / 0.00 0.00 / 0.00Power (25%) 9.99 / 0.00 3.93 / 0.04 1.39 / 0.02 0.25 / 0.02 0.09 / 0.05short-term and long-term health status of the elderly people [Morrow-Howell et al., 2001, Alaviniaand Burdorf, 2008, B¨orsch-Supan and J¨urges, 2006]. One theory known as the “psychosocial-materialist” approach suggests that retiring late may have health beneﬁts because work forms a keypart of the identity of the elderly and provides ﬁnancial, social and psychological resources [Calvoet al., 2012]. However, the health beneﬁts of late retirement may diﬀer in diﬀerent subpopulations[Dave et al., 2008, Westerlund et al., 2009].We obtained the observational data from the Health and Retirement Study, an ongoing nation-ally representative survey of more than 30,000 adults who are older than 50 and their spouses in theUnited States. HRS is sponsored by the National Institute of Aging; Detailed information on theHRS and its design can be found in Sonnega et al. [2014]. We use the RAND HRS LongitudinalFile 2014 (V2), an easy-to-use dataset based on the HRS core data that consists of a follow-upstudy of 15 ,

843 elderly people [RAND, 2018].We deﬁned the treatment as late retirement (retirement after 65 years old and before 70 yearsold) and asked how it impacted self-reported health status at the age of 70 (coded by: 5 - extremelygood, 4 - very good, 3 - good, 2 - fair, and 1 - poor). We included individuals who retired before70 and had complete measurements of the following confounders: year of birth, gender, education(years), race (white or not), occupation (1: executives and managers, 2: professional specialty, 3:sales and administration, 4: protection services and armed forces, 5: cleaning, building, food prep,and personal services, 6: production, construction, and operation), partnered, annual income, and21able 6: Power and error rate (separated by/) of three methods estimating R pos , Γ . Power is deﬁnedas the size of ˆ R pos , Γ and error rate is deﬁned as 1 − P ( ˆ R pos , Γ ⊆ R pos , Γ ). Treatment eﬀect in the10 cohorts is given by µ = (0 . , . , . , . , − . , − . , . , . , . , . R pos , Γ is listedbelow each Γ value.Cohort size Method Γ = 1 . . . . . { r , . . . , r , r , r } { r , . . . , r } { r , r , r } { r , r } { r }

250 Bonferroni 5.98 / 0.00 3.51/0.00 1.55 / 0.00 0.37 / 0.00 0.07 / 0.00Value (50%) 5.97 / 0.02 0.10 / 0.02 0.00 / 0.00 0.00 / 0.00 0.00 / 0.00Power (50%) 6.00 / 0.02 3.43 / 0.02 1.53 / 0.01 0.56 / 0.02 0.21 / 0.01Value (25%) 6.02 / 0.03 0.23 / 0.04 0.03 / 0.00 0.01 / 0.00 0.00 / 0.00Power (25%) 6.02 / 0.02 3.67 / 0.04 1.84 / 0.01 0.66 / 0.01 0.22 / 0.01100 Bonferroni 5.60 / 0.01 2.42 / 0.00 0.68 / 0.00 0.19 / 0.00 0.06 / 0.00Value (50%) 5.24 / 0.03 0.22 / 0.03 0.04 / 0.00 0.01 / 0.00 0.00 / 0.00Power (50%) 5.48 / 0.03 2.23 / 0.03 0.86 / 0.02 0.31 / 0.02 0.16 / 0.02Value (25%) 5.58 / 0.02 0.71 / 0.03 0.16 / 0.00 0.04 / 0.00 0.01 / 0.00Power (25%) 5.71 / 0.02 2.61 / 0.03 0.98 / 0.01 0.36 / 0.02 0.14 / 0.02smoking status. This left us with 1934 treated subjects and 4831 controls. Figure 3 plots thedistribution of retirement age in all samples and in the treatment group. The distribution ofretirement age in the treatment group is right skewed, with a spike of people retiring shortly after65 years old. In the Supplementary Materials, we give a detailed account of data preprocessing andsample inclusion criteria. retirement age c oun t (a) In all samples retirement age c oun t (b) In the treated group Figure 3: Distribution of retirement ageUsing optimal matching as implemented in the optmatch

R package [Hansen and Klopfer,2006], we form 1858 matched pairs, matching exactly on the year of birth, gender, occupation, and22able 7: Covariate balance after matching.Control Treated std.diﬀYear of birth 1936.27 1936.27 0.00Female 0.53 0.53 0.00Non-hispanic white 0.77 0.75 -0.04Education (yrs) 12.52 12.53 0.00Occupation: cleaning, building, food prep, and personal services 0.10 0.10 0.00Occupation: executives and managers 0.16 0.16 0.00Occupation: production construction and operation occupations 0.28 0.28 0.00Occupation: professional specialty 0.19 0.19 0.00Occupation: protection services and armed forces 0.02 0.02 0.00Occupation: sales and admin 0.25 0.25 0.00Partnered 0.74 0.74 0.00Smoke ever 0.63 0.59 -0.08partnered or not, and balance the race, years of education, and smoking status. Table 7 summarizesthe covariate balance after matching. After matching, the treated and control groups are well-balanced (Table 7): the standardized diﬀerences of all covariates are less than 0.1. Additionally,the propensity score in the treated and control group have good overlap before and after matching(see the Supplementary Materials).We considered two potential eﬀect modiﬁers, namely gender and occupation. More complicatedtreatment rules can in principle be considered within our framework, though having more treatmentrules generally reduces the power of multiple testing. We grouped the 6 occupations into 2 broadcategories: white collar jobs (executives and managers and professional specialties) and blue collarjobs (sales, administration, protection services, personal services, production, construction, andoperation). There were 4 subgroups deﬁned by these two potential eﬀect modiﬁers: male, white-collar workers ( G ), female, white-collar workers ( G ), male, blue-collar workers ( G ), and female,blue-collar workers ( G ). Thus, there were a total of 2 = 16 diﬀerent regimes formed out ofthese two eﬀect modiﬁers. We gave decimal as well as binary codings to the 16 groups: r ( r )assigns control to everyone, r ( r ) , r ( r ) , r ( r ) , r ( r ) assign treatment to one of the4 subgroups, and so forth. We split the matched samples and used 1 / /

4. Then we followed the procedures proposed in Section 3 to rank and select thetreatment rules.Figure 4 reports the estimated Hasse diagram at Γ = 1 .

2; additional results can be found inthe Supplementary Materials. The estimated maximal rules for various choices of Γ and δ arereported in Table 8 and the estimated positive rules are reported in the Supplementary Materials.23 (0000)8 (1000) 4 (0100) 2 (0010)1 (0001) 12 (1100) 10 (1010)9 (1001) 6 (0110)5 (0101) 3 (0011) 14 (1110)13 (1101) 11 (1011) 7 (0111)15 (1111) Figure 4: The eﬀect of late retirement on health: Hasse diagram at Γ = 1 . r ( r ) which assigns late retirement to all but female, white-collar workers, r ( r ) whichassigns late retirement to all but male, blue-collar workers, and r ( r ) which assigns treatmentto everyone. When Γ increases to 1 . r ( r ) which assigns treatment to male, white-collarworkers and female, blue-collar workers, further enters the set of maximal rules. The estimatedpositive rules suggest that r ( r ) and r ( r ) which only assigns late retirement to female bluecollar workers, though not among the maximal rules at Γ = 1 in Table 8, are the most robust tounmeasured confounding. This suggests that later retirement perhaps beneﬁt female blue-collarworkers more than others.Table 8: The eﬀect of late retirement on health: ˆ R max , Γ for diﬀerent choices of Γ.Γ ˆ R max , Γ { r , r , r } { r , r , r , r } { r , r , r , r , r , r , r , r } . amplify Γ to a two-dimensional curve indexed by (Λ , ∆), where Λ describes therelationship between the unmeasured confounder and the treatment assignment, and ∆ describesthe relationship between the unmeasured confounder and the outcome. For instance, Γ = 1 . , Λ) = (2 . , . , ∆) values to coeﬃcientsof observed covariates, however, their method only works for binary outcome and binary treatment.In the Supplementary Materials, we describe a calibration analysis that handle the ordinal self-reported health status level in our application that has 5 levels.We follow Hsu and Small [2013] and use a plot to summarize the calibration analysis. InFigure 5, the blue curve represents Rosenbaum and Silber [2009]’s two-dimensional ampliﬁcationof Γ = 1 . , ∆). The estimated coeﬃcients of observed covariates are representedby black dots (after taking an exponential so they are comparable to (Λ , ∆)). We followed thesuggestion in Gelman [2008] and standardized all the non-binary covariates to have mean 0 andstandard deviation 0 .

5, so the coeﬃcient of each binary variable can be interpreted directly and thecoeﬃcients of each continuous/ordinal variable can be interpreted as the eﬀect of a 2-SD increase inthe covariate value, which roughly corresponds to ﬂipping a binary variable from 0 to 1. Note that allcoeﬃcients are under the Γ = 1 . . U constructed from smoking and education (red star in Figure 5). In this paper we proposed a general framework to compare, select, and rank treatment rules whenthere is a limited degree of unmeasured confounding and illustrated the proposed methods by tworeal data examples. A central message is that the best treatment rule (with the largest estimatedvalue) assuming no unmeasured confounding is often not the most robust to unmeasured confound-ing. This may have important policy implications when individualized treatment rules are learnedfrom observational data.Because the value function only deﬁnes a partial order on the treatment rules when there isunmeasured confounding, there is a multitude of statistical questions one can ask about selectingand ranking the treatment rules. We have considered three questions that we believe are most25 emaleraceeducationpartnered incomesmokewhite collarnever smoke + education

123 1 2 3 L D Figure 5: The eﬀect of latent retirement on health: Calibration of the sensitivity analysis. The bluecurve is Rosenbaum and Silber [2009]’s ampliﬁcation of Γ = 1 .

2. Black dots represent estimatedcoeﬃcient of each observed covariate. Red marker represents the aggregate eﬀect of ﬂipping smokingstatus and increasing education by 2 standard deviations.26elevant to policy research, but there are many other questions (such as in Gibbons et al. [1999])one could ask.In principle, our framework can be used with an arbitrary number of prespeciﬁed individualizedtreatment rules. However, to maintain a good statistical power in the multiple testing, the prespec-iﬁed treatment rules should not be too many. This limitation makes our method most suitable asa conﬁrmatory analysis to complement machine learning algorithms for individualized treatmentrule discovery. Alternatively, if the number of decision variables is relatively low due to economicor practical reasons, our method is also reasonably powered for treatment rule discovery.

Acknowledgement

JW received funding from the Population Research Training Grant (NIH T32 HD007242) awardedto the Population Studies Center at the University of Pennsylvania by the NIH’s Eunice KennedyShriver National Institute of Child Health and Human Development.

A Appendix: Proof of Proposition 1

To simplify the notation, suppose r ( X i ) < r ( X i ) for all 1 ≤ i ≤ I . Let D i, Γ = D i − (cid:18) Γ −

1Γ + 1 (cid:19) | D i, Γ | , D = 1 I I (cid:88) i =1 D i , | D | = 1 I I (cid:88) i =1 | D i | ,D Γ = (1 /I ) I (cid:88) i =1 D i, Γ = D − (cid:18) Γ −

1Γ + 1 (cid:19) | D | , and se ( D Γ ) = 1 I I (cid:88) i =1 ( D i, Γ − D Γ ) . When E [ D i ] >

0, Γ ∗ ( r , r ) is obtained by solving the equation below in Γ: D Γ se ( D Γ ) = Φ − (1 − α ) . (5)Square both sides of the equation above and plug in the expressions for D Γ and se ( D Γ ) . Let27 = (Γ − / (Γ + 1) and z α = Φ − (1 − α ). Denote z α = Φ − (1 − α ) , | D | = 1 I I (cid:88) i =1 | D i | , , and s D = 1 I I (cid:88) i =1 ( D i − D ) , s | D | = 1 I I (cid:88) i =1 ( | D i | − | D | ) , s D, | D | = 1 I I (cid:88) i =1 ( D i − D )( | D i | − | D | ) . One can show the sensitivity value Γ ∗ ( r , r ) corresponds to κ ∗ that solves the following quadraticequation: (cid:18) | D | − I s | D | z α (cid:19) κ − (cid:18) D | D | − I s D, | D | z α (cid:19) κ + D − I s D c α = 0 . Speciﬁcally, we have κ ∗ = D | D | − I s D, | D | z α ± √ ∆ | D | − I s | D | z α , (6)where ∆ = ( D | D | − I s D, | D | z α ) − ( | D | − I s | D | z α )( D − I s D c α ).Note √ ∆ = z α (cid:114) I (cid:16) s | D | · D + s D · | D | − D | D | s D, | D | (cid:17) + 1 I z α (cid:16) s D, | D | − s | D | · s D (cid:17) . Let us denote A = E [ D ] · E [ | D | ], B = − z α (cid:113) σ | D | · E [ D ] + σ D · E [ | D | ] − E [ D ] E [ | D | ] σ D, | D | , C = E [ | D | ] , R = √ I ( D | D | − A ), R = √ I ( | D | − C ).We have κ ∗ = A + √ I R + √ I BC + √ I R + o p (cid:18) √ I (cid:19) = ( A + √ I R + √ I B ) · (1 − √ I R C ) C + o p (cid:18) √ I (cid:19) . Scale both sides by √ I and rearrange the terms, we have √ I (cid:18) κ ∗ − AC (cid:19) = BC + 1 C R − AC R + o p (1) . φ : R (cid:55)→ R = ( xy, y ): √ I  D − E [ D ] | D | − E [ | D | ]  ∼ N (0 , Σ) , implies  R R  = √ I  D | D | − E [ D ] · E [ | D | ] | D | − E [ | D | ]  ∼ N (0 , φ (cid:48) Σ( φ (cid:48) ) T ) , where Σ =  Var[ D ] , Cov( D, | D | )Cov( D, | D | ) , Var[ | D | ]  and φ (cid:48) =  E [ | D | ] , E [ D ]0 , E [ | D | ]  .Plug in the expressions for A , B , and C and compute the variance-covariance matrix of(1 /C, − A/C )( R , R ) T : √ I (cid:18) κ ∗ − E [ D ] E [ | D | ] (cid:19) ∼ N ( z α µ, σ )where µ = − (cid:113) σ | D | · E [ D ] + σ D · E [ | D | ] − E [ D ] E [ | D | ] σ D, | D | E [ | D | ] , and σ = Var[ D ] E [ | D | ] − Var[ | D | ] E [ D ] − E [ D ] E [ | D | ]Cov( D, | D | ) + 2 E [ D ]Var[ | D | ] E [ | D | ] . The Supplementary Materials contain additional appendices about matching in observationalstudies and further simulation and real data results.29 eferences

Seyed Mohammad Alavinia and Alex Burdorf. Unemployment and retirement and ill-health: across-sectional analysis across european countries.

International Archives of Occupational andEnvironmental Health , 82(1):39–45, 2008.Susan Athey and Stefan Wager. Eﬃcient policy learning. arXiv preprint arXiv:1702.02896 , 2017.Axel B¨orsch-Supan and Hendrik J¨urges. Early retirement, social security and well-being in germany.Technical report, National Bureau of Economic Research, 2006.Esteban Calvo, Natalia Sarkisian, and Christopher R. Tamborini. Causal Eﬀects of Retire-ment Timing on Subjective Physical and Emotional Health.

The Journals of Gerontol-ogy: Series B , 68(1):73–84, 11 2012. ISSN 1079-5014. doi: 10.1093/geronb/gbs097. URL https://doi.org/10.1093/geronb/gbs097 .J. Cornﬁeld, W. Haenszel, E. Hammond, A. Lilienfeld, M. Shimkin, and E. Wynder. Smoking andlung cancer.

Journal of the National Cancer Institute , 22:173–203, 1959.Dhaval Dave, Inas Rashad, and Jasmina Spasojevic. The eﬀects of retirement on physical andmental health outcomes.

Southern Economic Journal , 75(2):497–523, 2008. ISSN 00384038.URL .Miroslav Dud´ık, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluationand optimization.

Statistical Science , 29(4):485–511, 2014.Colin B Fogarty. Studentized sensitivity analysis for the sample average treatment eﬀect in pairedobservational studies. arXiv preprint arXiv:1609.02112 , 2016.A. Gelman. Scaling regression inputs by dividing by two standard deviations.

Statistics in Medicine ,27:2865–2873, 2008.Jean Dickinson Gibbons, Ingram Olkin, and Milton Sobel.

Selecting and ordering populations: Anew statistical methodology . SIAM, 2nd edition, 1999.Shanti S Gupta and Subramanian Panchapakesan.

Multiple decision procedures: Theory andmethodology of selecting and ranking populations . SIAM, 1979.30en B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via networkﬂows.

Journal of Computational and Graphical Statistics , 15(3):609–627, 2006.Ruth Heller, Paul R Rosenbaum, and Dylan S Small. Split samples and design sensitivity inobservational studies.

Journal of the American Statistical Association , 104(487):1090–1101, 2009.J. Y. Hsu and D. S. Small. Calibrating sensitivity analyses to observed covariates in observationalstudies.

Biometrics , 69:803–811, 2013.Jesse Y. Hsu, Dylan S. Small, and Paul R. Rosenbaum. Eﬀect modiﬁcation and design sensitivity inobservational studies.

Journal of the American Statistical Association , 108(501):135–148, 2013.doi: 10.1080/01621459.2012.742018. URL https://doi.org/10.1080/01621459.2012.742018 .Nathan Kallus. Recursive partitioning for personalization using observational data. In

Proceedingsof the 34th International Conference on Machine Learning-Volume 70 , pages 1789–1798. JMLR.org, 2017.Nathan Kallus and Angela Zhou. Confounding-robust policy improvement. In

Advances in NeuralInformation Processing Systems , pages 9269–9279, 2018.Nathan Kallus, Xiaojie Mao, and Angela Zhou. Interval estimation of individual-level causal eﬀectsunder unobserved confounding. arXiv preprint arXiv:1810.02894 , 2018.Gary G Koch and Stuart A Gansky. Statistical considerations for multiplicity in conﬁrmatoryprotocols.

Drug Information Journal , 30(2):523–534, 1996.Michael R. Kosorok and Eric B. Laber. Precision medicine.

Annual Review of Statistics andIts Application , 6(1):263–286, 2019. doi: 10.1146/annurev-statistics-030718-105251. URL https://doi.org/10.1146/annurev-statistics-030718-105251 .Erica EM Moodie, Bibhas Chakraborty, and Michael S Kramer. Q-learning for estimating optimaldynamic treatment rules from observational data.

Canadian Journal of Statistics , 40(4):629–645,2012.Nancy Morrow-Howell, James Hinterlong, Michael Sherraden, et al.

Productive aging: Conceptsand challenges . JHU Press, 2001.Susan A Murphy. Optimal dynamic treatment regimes.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 65(2):331–355, 2003.31in Qian and Susan A. Murphy. Performance guarantees for individualized treatmentrules.

The Annals of Statistics , 39(2):1180–1210, 2011. ISSN 00905364. URL .RAND. RAND HRS Longitudinal File 2014 (V2) public use dataset. Produced by the RANDCenter for the Study of Aging, with funding from the National Institute on Aging and the SocialSecurity Administration. 2018.James M Robins. Optimal structural nested models for optimal sequential decisions. In

Proceedingsof the second seattle Symposium in Biostatistics , pages 189–326. Springer, 2004.P. R. Rosenbaum. Sensitivity analysis for certain permutation inferences in matched observationalstudies.

Biometrika , 74:13–26, 1987.P. R. Rosenbaum.

Observational Studies . Springer., 2002.P. R. Rosenbaum. Bahadur Eﬃciency of Sensitivity Analyses in Observational Studies.

Journal ofthe American Statistical Association , 110:205–217, 2015.P. R. Rosenbaum and J. H. Silber. Ampliﬁcation of sensitivity analysis in matched observationalstudies.

Journal of the American Statistical Association , 104:1398–1405, 2009.Paul R Rosenbaum. Design sensitivity in observational studies.

Biometrika , 91(1):153–164, 2004.Paul R Rosenbaum. A new U-statistic with superior design sensitivity in matched observationalstudies.

Biometrics , 67(3):1017–1027, 2011.Amanda Sonnega, Jessica D Faul, Mary Beth Ofstedal, Kenneth M Langa, John WR Phillips,and David R Weir. Cohort Proﬁle: the Health and Retirement Study (HRS).

InternationalJournal of Epidemiology , 43(2):576–585, 03 2014. ISSN 0300-5771. doi: 10.1093/ije/dyu067.URL https://doi.org/10.1093/ije/dyu067 .Zhiqiang Tan. A distributional approach for causal inference using propensity scores.

Journal ofthe American Statistical Association , 101(476):1619–1637, 2006.Hugo Westerlund, Mika Kivim¨aki, Archana Singh-Manoux, Maria Melchior, Jane E Ferrie, JaanaPentti, Markus Jokela, Constanze Leineweber, Marcel Goldberg, Marie Zins, and Jussi Vahtera.Self-rated health before and after retirement in france (gazel): a cohort study.

The Lancet , 374329705):1889 – 1896, 2009. ISSN 0140-6736. doi: https://doi.org/10.1016/S0140-6736(09)61570-1.URL .Peter H Westfall and Alok Krishen. Optimally weighted, ﬁxed sequence and gatekeeper multipletesting procedures.

Journal of Statistical Planning and Inference , 99(1):25–40, 2001.Peng Wu, Donglin Zeng, and Yuanjia Wang. Matched learning for optimizing indi-vidualized treatment strategies using electronic health records.

Journal of the Ameri-can Statistical Association , 0(0):1–23, 2019. doi: 10.1080/01621459.2018.1549050. URL https://doi.org/10.1080/01621459.2018.1549050 .Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and Lu Tian. Bounds on the con-ditional and average treatment eﬀect in the presence of unobserved confounders. arXiv preprintarXiv:1808.09521 , 2018.Yichi Zhang, Eric B Laber, Marie Davidian, and Anastasios A Tsiatis. Interpretable dynamictreatment regimes.

Journal of the American Statistical Association , 113(524):1541–1549, 2018.Qingyuan Zhao. On sensitivity value of pair-matched observational studies.

Journal of theAmerican Statistical Association , 0(0):1–10, 2018. doi: 10.1080/01621459.2018.1429277. URL https://doi.org/10.1080/01621459.2018.1429277 .Qingyuan Zhao, Dylan S Small, and Paul R Rosenbaum. Cross-screening in observational studiesthat test many hypotheses.

Journal of the American Statistical Association , 113(523):1070–1084,2018.Qingyuan Zhao, Dylan S. Small, and Bhaswar B. Bhattacharya. Sensitivity analysis for inverseprobability weighting estimators via the percentile bootstrap, 2019a.Ying-Qi Zhao, Eric B Laber, Yang Ning, Sumona Saha, and Bruce Sands. Eﬃcient augmentationand relaxation learning for individualized treatment rules using observational data.

Journal ofMachine Learning Research , 20:1–23, 2019b.Yingqi Zhao, Donglin Zeng, A. John Rush, and Michael R. Kosorok. Estimating individ-ualized treatment rules using outcome weighted learning.

Journal of the American Sta-tistical Association , 107(499):1106–1118, 2012. doi: 10.1080/01621459.2012.695674. URL https://doi.org/10.1080/01621459.2012.695674https://doi.org/10.1080/01621459.2012.695674