[PDF] The Optimal Dynamic Treatment Rule SuperLearner: Considerations, Performance, and Application

Abstract

The optimal dynamic treatment rule (ODTR) framework offers an approach for understanding which kinds of patients respond best to specific treatments -- in other words, treatment effect heterogeneity. Recently, there has been a proliferation of methods for estimating the ODTR. One such method is an extension of the SuperLearner algorithm -- an ensemble method to optimally combine candidate algorithms extensively used in prediction problems -- to ODTRs. Following the "causal roadmap," we causally and statistically define the ODTR and provide an introduction to estimating it using the ODTR SuperLearner. Additionally, we highlight practical choices when implementing the algorithm, including choice of candidate algorithms, metalearners to combine the candidates, and risk functions to select the best combination of algorithms. Using simulations, we illustrate how estimating the ODTR using this SuperLearner approach can uncover treatment effect heterogeneity more effectively than traditional approaches based on fitting a parametric regression of the outcome on the treatment, covariates and treatment-covariate interactions. We investigate the implications of choices in implementing an ODTR SuperLearner at various sample sizes. Our results show the advantages of: (1) including a combination of both flexible machine learning algorithms and simple parametric estimators in the library of candidate algorithms; (2) using an ensemble metalearner to combine candidates rather than selecting only the best-performing candidate; (3) using the mean outcome under the rule as a risk function. Finally, we apply the ODTR SuperLearner to the "Interventions" study, an ongoing randomized controlled trial, to identify which justice-involved adults with mental illness benefit most from cognitive behavioral therapy (CBT) to reduce criminal re-offending.

Full PDF

TThe Optimal Dynamic Treatment RuleSuperLearner: Considerations, Performance,and Application

Lina Montoya *1 , Mark van der Laan , Alexander Luedtke ,Jennifer Skeem , Jeremy Coyle , and Maya Petersen University of North Carolina, Chapel Hill, Department ofBiostatistics University of California, Berkeley, Division of Biostatistics University of Washington, Department of Statistics University of California, Berkeley, Departments of Social Welfareand Public PolicyJanuary 2021 * Email address for correspondence: [email protected] a r X i v : . [ s t a t . A P ] J a n bstract The optimal dynamic treatment rule (ODTR) framework offers an ap-proach for understanding which kinds of patients respond best to speciﬁctreatments – in other words, treatment effect heterogeneity. Recently, therehas been a proliferation of methods for estimating the ODTR. One suchmethod is an extension of the SuperLearner algorithm – an ensemble methodto optimally combine candidate algorithms extensively used in predictionproblems – to ODTRs. Following the “causal roadmap,” we causally andstatistically deﬁne the ODTR and provide an introduction to estimating itusing the ODTR SuperLearner. Additionally, we highlight practical choiceswhen implementing the algorithm, including choice of candidate algorithms,metalearners to combine the candidates, and risk functions to select the bestcombination of algorithms. Using simulations, we illustrate how estimat-ing the ODTR using this SuperLearner approach can uncover treatment ef-fect heterogeneity more effectively than traditional approaches based on ﬁt-ting a parametric regression of the outcome on the treatment, covariates andtreatment-covariate interactions. We investigate the implications of choicesin implementing an ODTR SuperLearner at various sample sizes. Our re-sults show the advantages of: (1) including a combination of both ﬂexiblemachine learning algorithms and simple parametric estimators in the libraryof candidate algorithms; (2) using an ensemble metalearner to combine can-didates rather than selecting only the best-performing candidate; (3) usingthe mean outcome under the rule as a risk function. Finally, we apply theODTR SuperLearner to the “Interventions” study, an ongoing randomizedcontrolled trial, to identify which justice-involved adults with mental illnessbeneﬁt most from cognitive behavioral therapy (CBT) to reduce criminalre-offending. Introduction

The primary objective of a clinical trial is often to evaluate the overall, averageeffect of a treatment on an outcome in a given population [1, 2, 3]. To accom-plish this objective in the point treatment setting, baseline covariate, treatment,and outcome data are often collected and the average treatment effect (ATE) isestimated, quantifying the average impact of the treatment in a population. Re-searchers may then interpret the impact of the treatment as beneﬁcial, neutral, orharmful. In this interpretation, the treatment’s impact is one-size-ﬁts-all; in otherwords, the effect of the treatment is interpreted as the same for everyone in thestudy population. But, it may be the case that an intervention tends to yield betteroutcomes for certain kinds of people but not for others. For example, becausejustice-involved people with mental illness are a heterogeneous group with di-verse symptoms, risk factors, and other treatment-relevant characteristics [4, 5],assigning Cognitive Behavioral Therapy (CBT) may decrease the probability ofrecidivism for individuals with high risk of recidivism but not low risk of recidi-vism [6]. The ATE analysis may lead one to conclude that there is no treatmenteffect in a given population, when there is, in fact, a differential treatment effectfor levels of variables.Precision health aims to shift the question from “which treatment works best”to “which treatment works best for whom ?” (sometimes, it further asks: at whattime? And/or at what dose? [3]). The point of moving towards this question is tomove towards better subject outcomes. While a range of novel study designs canhelp to address these questions by generating data in which individualized treat-ment effects are unconfounded [3, 7, 8], data from classic randomized controlledtrials also provide a rich data source for discovering treatment effect heterogene-ity. Under the assumption of no unmeasured confounding, the same methods canbe applied to observational data.One way of learning which treatment works best for whom is to estimate ef-fects within subgroups. Following our above example within the ﬁeld of criminaljustice, one could split the sample into subjects who are likely versus unlikely tore-offend, and look at the average effect of CBT on recidivism within these tworisk categories. Such a classic subgroup analysis helps to move a step closer tounderstanding the treatment that works best for whom. However, the need to re-strict the number of tests performed and to pre-specify analyses limits traditionalsubgroup analyses to comparing intervention effects in a small set of subgroups inwhich heterogeneous treatment effects are expected [2, 9]. In practice, the subjectcharacteristics that are most important for determining the best-suited interventionmay not be clear based on background knowledge. Further, effectively predictingthe type of intervention that a subject will best respond to may require accountingfor a wide range of subject characteristics and complex interactions between them.3or instance, identifying the subjects most likely to respond to CBT versus, forexample, treatment as usual (TAU) may require considering not only risk level,but also age, educational attainment, sex, substance abuse, psychological distress,and internal motivation to adhere to treatment – as well as various interactionsbetween these. In summary, the challenge is to take a wide range of subject char-acteristics and ﬂexibly learn how to best combine them into a strategy or rule thatassigns to each subject the speciﬁc intervention that works best for him or her.Estimating the optimal dynamic treatment rule (ODTR) for a given populationoffers a formal approach for learning about heterogeneous treatment effects anddeveloping such a strategy. A dynamic treatment rule can be thought of as a ruleor algorithm where the input is subject characteristics and the output is an indi-vidualized treatment choice for each subject [10, 11, 12, 13]. An optimal dynamictreatment rule (also known as an optimal treatment regime, optimal strategy, indi-vidualized treatment rule, optimal policy, etc.) is the dynamic treatment rule thatyields the best overall subject outcomes [14, 15]. In our criminal justice example,a dynamic treatment rule takes as input subject characteristics such as age, crimi-nal history, and education level and outputs a treatment decision – either CBT orTAU. The ODTR is the dynamic treatment rule under which the highest propor-tion of patients are not re-arrested. It is the most effective and, if one incorporatescost or constraints on resources [16], efﬁcient way of allocating the interventionsat our disposal based on measured subject characteristics.There have been major advances in estimating the ODTR within the ﬁelds ofstatistics and computer science, with important extensions to the case where treat-ment decisions are made at multiple points in time. Regression-based approaches,such as Q-learning, learn the ODTR by modeling the outcome regression (i.e., theexpected outcome given treatment and covariates) directly [14, 17, 18, 19, 20].Robins and Murphy developed methods of estimating the ODTR by modelingblip-to-reference functions (i.e., the strata-speciﬁc effect of the observed treat-ment versus control) and regret functions (i.e., the strata-speciﬁc loss incurredwhen given the optimal treatment versus the observed treatment), respectively[14, 15, 21]. Direct-estimation approaches to learning the ODTR, such as out-come weighted learning (OWL), aim to search among a large class of candidaterules for the one that yields the best expected outcome [22, 23, 24]. These are ex-amples of broad classes of ODTR estimators; within and outside of them there hasbeen a proliferation of methods to estimate the ODTR (see [3, 25, 26] for reviewsof the state of the art in estimating ODTRs and precision medicine).Given the vast number of methods available for estimating the ODTR, thequestion becomes: which approach to use? In some settings, some algorithmsmay work better than others. SuperLearning [27] (or, more speciﬁc to prediction,stacked regression [28]) was originally proposed as a method for data-adaptivelychoosing or combining prediction algorithms. The basic idea is to deﬁne a library4f candidate algorithms and choose the candidate or the combination of candi-dates that gives the best performance based on V-fold cross-validation. This re-quires deﬁning: (1) the algorithms to include in the library, (2) a parametric fam-ily of weighted combinations of these algorithms, the “metalearning” step [29],and (3) the choice of performance metric (i.e., risk) as the criterion for select-ing the optimal combination of algorithms. Given these three requirements, thenone can estimate the risk for each combination of algorithms using V-fold cross-validation, and choose the combination with the lowest cross-validated risk. TheSuperLearner framework has been implemented extensively for prediction prob-lems [30, 31, 32, 33], and has been extended to the ODTR setting [34, 35]. In par-ticular, Luedtke and van der Laan showed that in the randomized controlled trial(RCT) and sequential multiple assignment randomized trial (SMART) [7, 25, 36]settings, under the assumption that the loss function is bounded, the ODTR Su-perLearner estimator will be asymptotically equivalent to the ODTR estimatorchosen by the oracle selector (that is, the ODTR estimator, among the candidateODTR estimators, that yields the lowest risk under the true data distribution [27]).This implies that the ODTR SuperLearner will asymptotically do as well as orbetter than any single candidate estimator in the library, provided that none of thecandidate algorithms are correctly speciﬁed parametric models. If there is a well-speciﬁed parametric model in the library, the ODTR SuperLearner estimator ofthe ODTR will achieve near parametric rates of convergence to the true rule.These theoretical results lay important groundwork for understanding the asymp-totic beneﬁts to using the algorithm; however, less has been published on how theODTR SuperLearner performs in ﬁnite samples, the practical implications of keychoices when implementing the algorithm, and illustrations of implementing thisalgorithm on real RCT data. In this paper, we provide an introduction to theimplementation of the ODTR SuperLearner in the point treatment setting, anduse simulations to investigate the tradeoffs inherent in these user-supplied choicesand how they may differ with varying sample sizes. In particular, for sample sizes1,000 and 300, we examine: (1) how to select the candidate algorithms for es-timating the ODTR; speciﬁcally, the costs and beneﬁts to expanding the libraryto include a wider set of diverse ODTR algorithms, including simple parametricmodels versus more data adaptive algorithms, and blip-based versus direct estima-tion algorithms; (2) implications of the choice of parametric family for creatingweighted combinations for candidate ODTR learners (i.e., choice of metalearner);and, (3) implications of the choice of risk function used to judge performance andthereby select the optimal weighted combination of candidate learners. Finally,we apply the ODTR SuperLearner to real data generated from the CorrectionalIntervention for People with Mental Illness, or “Interventions,” trial, an ongoingRCT in which justice-involved adults with mental illness were either randomizedto CBT or TAU. In applying the ODTR SuperLearner to this sample, we aim to5dentify which people beneﬁt most from CBT versus TAU, in order to reduce re-cidivism.The organization of this article is as follows. First, we step through the causalroadmap (as described in [37]) for deﬁning the true ODTR for a given popula-tion. We focus on the case in which baseline covariates are measured, a singlebinary treatment is randomized, and an outcome is measured. We then give abrief introduction to some estimators of the ODTR, and in particular, describe theSuperLearner approach for estimating the optimal rule that builds on Luedtke andvan der Laan’s work [34]. We investigate the implications of the three sets ofimplementation choices outlined above in ﬁnite samples using simulations (withcorresponding R code illustrating implementation of all estimators considered),and the performance under such options. Lastly, we show results for the ODTRSuperLearner algorithm applied to the “Interventions” Study. We close with con-cluding remarks and future directions.

Consider point-treatment data where W ∈ W are baseline covariates, A ∈ { , } is the treatment, and Y ∈ R is the outcome measured at the end of the study. Ourdata can be described by the following structural causal model (SCM), M F [38]: W = f W ( U W ) A = f A ( W , U A ) Y = f Y ( W , A , U Y ) ,where the full data X = ( W , A , Y ) are endogenous nodes, U = ( U W , U A , U Y ) ∼ P U are unmeasured exogenous variables, and f = ( f W , f A , f Y ) are structural equa-tions. If it is known that data were generated from an RCT using simple ran-domization with equal probability to each arm, then the above structural causalmodel would state that Y may be affected by both W and A , but that W doesnot affect A (as in the “Interventions” trial); this can be represented in the abovemodel by letting U A ∼ Bernoulli ( p = . ) and A = U A . In this point treatmentsetting, a dynamic treatment rule is a function d that takes as input some func-tion V of the measured baseline covariates W and outputs a treatment decision: V → d ( V ) ∈ { , } . For the remainder of the paper, we consider the case where V = W ; in other words, we consider treatment rules that potentially respond to allmeasured baseline covariates. However, consideration of dynamic rules based ona more restrictive set of baseline covariates is also of frequent practical interest,6llowing, for example, for consideration of dynamic rules based on measurementsthat can be more readily attained; all methods described extend directly to thiscase. We denote the set of all dynamic treatment rules as D .The “Interventions” data consist of baseline covariates W , which include inter-vention site, sex, ethnicity, age, Colorado Symptom Index (CSI) score (a measureof psychiatric symptoms), level of substance use, Level of Service Inventory (LSI)score (a risk score to predict future recidivism that summarizes risk factors likecriminal history, educational and employment problems, and attitudes support-ive of crime), number of prior adult convictions, most serious offense, TreatmentMotivation Questionnaire (TMQ) score (a measure of internal motivation for un-dergoing treatment), and substance use level; the randomized treatment A , either amanualized Cognitive Behavioral Intervention for people criminal justice system(abbreviated CBT; A =

1) or treatment as usual (TAU), primarily psychiatric orcorrectional services ( A = Y of recidivism, an indica-tor that the person was not re-arrested within one year after study enrollment. InTable 2 we show the distribution of the data. Let d ( W ) be a deterministic function that takes as input a vector of baseline covari-ates, and gives as output a treatment assignment (in this case, either 0 or 1). For agiven rule d , we intervene on the above SCM to derive counterfactual outcomes: W = f W ( U W ) A = d ( W ) Y d ( W ) = f Y ( W , d ( W ) , U Y ) .Here, Y d ( W ) is the counterfactual outcome for a subject if his/her treatment A wereassigned using the dynamic treatment rule d ( W ) ; to simplify notation we referto this counterfactual outcome as Y d . The counterfactual outcomes for a personif he/she were assigned treatment or given control are denoted Y and Y , respec-tively. Together, the distribution of the exogenous variables P U and structuralequations f imply a distribution of the counterfactual outcomes, and the SCMprovides a model for the set of possible counterfactual distributions: P U , X ∈ M F .Our target parameter of interest in this paper is the ODTR, deﬁned as the rulethat, among all candidate rules D , yields the best expected outcomes. Using theconvention that larger values of Y correspond to better outcomes, an ODTR isdeﬁned as a maximizer of E P U , X [ Y d ] over all candidate rules d ∗ ∈ arg max d ∈ D E P U , X [ Y d ] . (1)7ny such ODTR can be deﬁned in terms of the conditional additive treatmenteffect (CATE), namely E P U , X [ Y − Y | W ] , which is the effect of treatment for agiven value of covariates W . Any ODTR assigns treatment 1 and 0 to all strata ofcovariates for which the CATE is positive and negative, respectively. If the CATEis 0 for a particular W (i.e., there is no treatment effect for that strata of W ), theODTR as deﬁned above may have more than one maximizing rule and thereforemay be non-unique; this is why the RHS of equation 1 above is a set [39]. AnODTR can take an arbitrary value for strata at which the CATE is 0. If we assumethat assigning treatment 0 is preferable to assigning treatment 1 in the absence ofa treatment effect, then we would prefer the following ODTR as a function of theCATE: d ∗ ( W ) ≡ I (cid:104) E P U , X [ Y − Y | W ] > (cid:105) .In other words, if a subject’s expected counterfactual outcome is better un-der treatment versus no treatment given his or her covariate proﬁle, then assigntreatment; otherwise, assign control. A subject’s counterfactual outcome underthe ODTR is Y d ∗ , and the expected outcome had everyone received the treatmentassigned by the ODTR is E P U , X [ Y d ∗ ] .Following our applied example, Y , Y , and Y d are the counterfactual outcomesfor a person if he/she were given CBT, TAU, and either CBT or TAU based onthe rule d , respectively; here, d ∗ is the rule for assigning CBT versus TAU us-ing subjects’ covariates that would yield the highest probability of no re-arrest, E P U , X [ Y d ∗ ] . We assume that our observed data were generated by sampling n independentobservations O i ≡ ( W i , A i , Y i ) , i = , . . . , n , from a data generating system describedby M F above (e.g., the “Interventions” study consists of 441 i.i.d. observationsof O ). The distribution of the observed data can be written as: P ( O ) = P W , ( W ) g ( A | W ) P Y , ( Y | A , W ) ,where P W , is the true distribution of W ; g is the true conditional distributionof A given W , or the treatment mechanism; P Y , is the true conditional distributionof Y given A and W . The distribution of the data P is an element of the statisticalmodel M , which in our RCT example is semi-parametric. Further, if the data aregenerated from an RCT design, as in the “Interventions” study, then the true g isknown, and the backdoor criteria (with the implied randomization assumption [38,40]), Y d ⊥ A | W ∀ d ∈ D , and the positivity assumption, Pr (cid:16) min a ∈{ , } g ( A = | W ) > (cid:17) = Q ( a , w ) ≡ E [ Y | A = a , W = w ] . Under the above assumption, E P U , X [ Y d ] (a parameter of the counterfactual distribution) is identiﬁed as E [ Q ( A = d , W )] (a parameter of the observed distribution) for any candidate rule d . Thus, theODTR is identiﬁed by d ∗ ∈ arg max d ∈ D E [ Q ( A = d , W )] .In addition, the CATE is identiﬁed as Q ( , W ) − Q ( , W ) , where the latteris sometimes referred to as the blip function B ( W ) . Then, the true optimal rulecan also be deﬁned as a parameter of the observed data distribution using the blipfunction: d ∗ ( W ) ≡ I [ B ( W ) > ] .Analogous to the deﬁnition of the ODTR as a function of the CATE, in words,the blip function essentially says that if treatment for a type of subject W = w iseffective (i.e., greater than 0), treat that type of person. If not, do not treat him/her.If all subjects were assigned treatment in this way, then this would result in thehighest expected outcome, which is the goal. We denote estimators with a subscript n , so that, for example, an estimator ofthe true ODTR d ∗ is d ∗ n . Estimates are functions of P n , which is the empiricaldistribution that gives each observation weight n ; P n ∈ M NP , and M NP is a non-parametric model. In what follows, we brieﬂy describe examples of commonmethods for estimating the ODTR. We ﬁrst describe methods that estimate theODTR via an estimate of the blip function. We then describe methods that directlyestimate a rule that maximizes the mean outcome. Blip-based approaches aim to learn the blip, which implies an ODTR. A beneﬁtof doing this is that one can look at the distribution of the predicted estimates ofthe blip for a given sample. Having the blip distribution allows one to identifythe patients in a sample who beneﬁt most (or least, or little) from treatment. Ad-ditionally, estimating the blip function can allow for estimating the ODTR under9esource constraints; for example, an ODTR in which only k % of the populationcan receive treatment [16]. Below we illustrate two methods of estimating theODTR by way of the blip function (i.e., blip-based estimators of the ODTR). Single stage Q-learning

A plug-in estimator naturally follows from the abovedeﬁnition of the optimal rule. One can estimate Q n ( A , W ) using any regression-based approach for estimating an outcome regression and predict at Q n ( , W ) and Q n ( , W ) . This provides an estimate of the blip: B n ( W ) = Q n ( , W ) − Q n ( , W ) ,which implies an estimate of the optimal rule: d ∗ n = I [ B n ( W ) > ] [3, 18, 25, 42]. Estimating the blip function

Consider the double-robust pseudo-outcome [43]: D ( Q , g ) = A − g ( A | W ) [ Y − Q ( A , W )] + Q ( , W ) − Q ( , W ) .Importantly, E [ D ( Q , g ) | W ] = B ( W ) if Q = Q or g = g . Using this result,one could estimate the blip by regressing the pseudo-outcome D n ( Q n , g n ) (whichwe abbreviate from here on as D n ) on W using any regression-based approach. Asin the previous method, this estimate of the blip implies an estimate of the optimalrule d ∗ n = I [ B n ( W ) > ] [15, 34, 44]. Instead of estimating the blip function, which implies an ODTR, one could es-timate the ODTR directly by selecting a rule d that maximizes the estimated E U , X [ Y d ] . Below we illustrate outcome weighted learning (OWL) – one exam-ple of a direct-estimation method for the ODTR. Single stage outcome weighted learning

We brieﬂy describe the general con-cept of outcome weighted learning here, but refer to (author?) [22] and (author?) [45] for a more thorough explanation. The optimal rule deﬁned above as a func-tion of P could equivalently be written as an inverse probability of treatmentweighted (IPTW) estimand: d ∗ ∈ arg max d ∈ D E [ Q ( A = d , W )] = arg max d ∈ D E (cid:34) Yg ( A | W ) I [ A = d ] (cid:35) .Written this way, estimating d ∗ could be regarded as a classiﬁcation problem,where the weighted outcome Yg ( A | W ) helps us learn what kind of patients should10et treatment: if a certain kind of patient W = w has large weighted outcomes andthey were treated according to candidate rule d , future patients with that covari-ate proﬁle should be treated using that rule. Conversely, the smaller the weightedoutcome among patients W who were treated according to d , the larger the “mis-classiﬁcation error” and the less likely those kinds of patients should be treatedaccording to d . This maximization problem is equivalent to the following mini-mization problem: d ∗ ∈ arg min d ∈ D E (cid:34) Yg ( A | W ) I [ A (cid:54) = d ] (cid:35) . (2)Now, if patients W = w who did not follow the rule d have large weighted out-comes (and thus larger “misclassiﬁcation error”), those kinds of patients shouldbe given the opposite treatment that d proposes. Note that in the RCT setting, ifone uses the known g and if treatments are given with equal probability, then thisreduces to ﬁnding the rule that minimizes the mean outcome among patients whodid not follow the rule. Equation 2 could alternatively be written as a minimizationproblem for, instead of a rule d , a function f : f ∗ ∈ arg min f ∈ F E (cid:34) Yg ( A | W ) I (cid:104) A (cid:54) = sign ( f ( W )) + (cid:105)(cid:35) , (3)where sign ( x ) = − x ≤ sign ( x ) = x >

0. Under the true data distri-bution P , f is the blip function, B . In order to solve this minimization problemusing data, we can use a plug-in estimator of (3); however, since it is a 0-1 func-tion (i.e., it is discontinuous and non-convex), one could use a convex surrogatefunction to approximate it, to instead minimize: f ∗ n ∈ arg min f ∈ F n n ∑ i = Y i g n ( A i | W i ) Φ ( A i f ( W i )) + λ n (cid:107) f (cid:107) , (4)where Φ ( t ) is the surrogate loss function (e.g., hinge loss, exponential loss, logis-tic loss), (cid:107) f (cid:107) is the norm of f , and λ n is the estimated penalization parameter on f to avoid overﬁtting of the rule. This can also be generalized with the IPTW func-tion replaced by the augmented IPTW [34, 45]. Once f ∗ n is found as the solutionto equation (4), the estimated ODTR is: d ∗ n = sign ( f ∗ n ( W )) . The overarching goal of SuperLearner is to let a user-supplied library of candidatealgorithms, such as speciﬁc implementations of the general approaches described11bove, “team up” to improve estimation of the ODTR. In order to implementthe ODTR SuperLearner, there are three user-supplied decisions one must make.First, one must consider the library of candidate algorithms to include. Thesecould include algorithms for estimating the blip function (which imply a rule), al-gorithms that search for the ODTR directly (such as OWL estimators), static rulesthat determine treatment regardless of covariates, or combinations of the aboveclasses of algorithms. Second, in what is sometimes referred to as the metalearn-ing step, one can either implement a SuperLearner that chooses one algorithm outof the library of candidate algorithms to include (i.e., “discrete” SuperLearner),or a SuperLearner that is a combination the candidate algorithms (i.e., “contin-uous” SuperLearner). For the latter, one again has a choice of metalearner; weconsider weighted convex combinations of candidate estimators of the blip andcombinations of estimates of the rules themselves (through a weighted “majorityvote”). Finally, one must choose the risk function used to judge the performanceof the weighted combinations of algorithms (estimated using V-fold cross valida-tion). Here, we consider two risk functions: the mean-squared error (MSE) andthe mean outcome under the candidate rule.The steps for implementing the ODTR SuperLearner are as follows; theyclosely follow the implementation of the canonical SuperLearner for regression[46]:1. Choose J candidate algorithms for estimating the optimal rule d n , j ( W ) for j = , ..., J . Candidates can include approaches based on estimating the blip B n , j ( W ) , e.g., candidate regressions of D n on W , where the candidate re-gressions might consist of a parametric linear model (corresponding to aclassic approach of ﬁtting a parametric outcome regression on A and W )as well as more ﬂexible machine learning type approaches such as neuralnetworks [47], multivariate adaptive regression splines [48], or recursivepartitioning and regression trees [49]. Candidate algorithms might also in-clude approaches for estimating the optimal rule directly, such an OWLestimator. Finally, the static treatment rules that treat all patients or treat nopatients, regardless of their covariate values, can also be included as can-didates. Inclusion of both simple parametric model estimators, as well asstatic rules such as treating all and treating none, is important to allow forthe possibility that the underlying true ODTR may in fact be simple (orwell-approximated by a simple rule), and providing less aggressive candi-dates in the SuperLearner library can help protect against overﬁtting in ﬁnitesamples.2. Split the data into V exhaustive and mutually exclusive folds. Let each foldin turn serve as the validation set and the complement data as the trainingset. 12. Fit each of the J candidate algorithms on the training set. Importantly, can-didate algorithms might depend on nuisance parameters, and those nuisanceparameters should be ﬁt on the training set, as well. For example, if a can-didate algorithm regresses D n on W to estimate the blip (which implies anODTR), then Q and g should be ﬁt and predicted on the training set, andthen plugged into D to ﬁt that candidate algorithm on the same training set(this is also called “nested” cross-validation, described in detail by [35]).4. Predict the estimated blip or the treatment assigned under the estimatedODTR for each observation in the validation set for each algorithm, basedon the corresponding training set ﬁt.5. Choose to either implement the discrete SuperLearner, which selects onealgorithm out of the candidate algorithms, or the continuous SuperLearner,which creates a weighted average of the candidate algorithms.(a) Continuous SuperLearner. Create different convex combinations ofthe candidate blip or treatment rule estimates that were predicted onthe validation set (i.e., convex combinations of the predictions from theprevious step). Formally, deﬁne an estimator of B n , α ( W ) or d n , α ( W ) as a convex combination of the candidate algorithms (indexed by j );each convex combination of algorithms is indexed by a weight vector α . A given convex combination of blip estimates are denoted as: B n , α ( W ) = ∑ j α j B n , j ( W ) , α j ≥ ∀ j , ∑ j α j = d n , α ( W ) = I (cid:104) ∑ j α j d n , j ( W ) > (cid:105) , α j ≥ ∀ j , ∑ j α j = α j must be either 0 or1. Such an approach may be particularly advantageous when samplesize is small: B n , α ( W ) = ∑ j α j B n , j ( W ) , α j ∈ { , }∀ j , ∑ j α j = n , α ( W ) = I (cid:104) ∑ j α j d n , j ( W ) > (cid:105) , α j ∈ { , }∀ j , ∑ j α j = α ). Here, we discuss two choices of risk functions for step (6)above. First, mean-squared error risk targeting the blip function, which wewill refer to as R MSE : R MSE = E [( D ( Q , g ) − B ( W )) ] .Because the MSE compares D n to the predicted blip, the candidate estima-tors under the MSE risk function must output estimated blip values. Ofnote, E P U , X [[ Y − Y − B ( W )] ] is identiﬁed as R MSE ( B ) if either Q = Q or g = g . The second risk function, which we call R E [ Y d ] , uses the expectedrule-speciﬁc outcome as criterion: R E [ Y d ] = − E [ E [ Y | A = d , W ]] .Intuitively, SuperLearner aims to choose the combination of treatment rulealgorithms that minimizes a cross-validated empirical risk, so it makes senseto have that risk be the negative of the expected outcome, such that Super-Learner chooses the combination of algorithms that maximizes the expectedoutcome, since that is the ultimate goal of the ODTR. Candidate estimatorsfor the SuperLearner that use the expected mean outcome under the rule asthe risk function can include both blip estimators that imply treatment rulesas well as direct estimators of the treatment rules. When the expected rulespeciﬁc outcome is chosen as the risk function, a further practical choiceis how to estimate this quantity; we focus here on a cross-validated tar-geted maximum likelihood estimator (TMLE) due to its favorable theoret-ical properties (double robustness, semi-parametric efﬁciency, and greaterprotection against overﬁtting through the use of sample splitting [46]); how-ever, one can use any estimator of treatment speciﬁc mean outcomes to es-timate this quantity [10, 11, 50].7. Average the risks across the validation sets resulting in one estimated crossvalidated risk for each candidate convex combination (i.e., each possiblechoice of α ). 14. Choose the estimator (i.e., the convex combination α ) that yields the small-est cross-validated empirical risk. Call this “best” weighting of the algo-rithms α n .9. Fit each candidate estimator B n , j ( W ) of the blip or d n , j ( W ) of the optimalrule on the entire data set. Generate predictions for each candidate algorithmindividually, and then combine them using the weights α n obtained in theprevious step. This is the SuperLearner estimate of the ODTR, where d ∗ n , B = I [ B n , α n ( W ) > ] or d ∗ n , d = d n , α n ( W ) directly.We summarize the practical implications of 3 key choices for implementingODTR SuperLearner here and in Table 1. In the ﬁrst dimension, selection of thecandidate algorithms, for illustration we consider having a library with only blipfunction estimators (called “Blip only” library) or a library with blip estimators,direct-estimation estimators, and static treatment rules that treat everyone or noone (called “Full” library). If one chooses to include direct-search estimators ofthe ODTR or static rules (i.e., functions that do not output a blip estimate in theprocess), then one is constrained to using the vote-based metalearner and R E [ Y d ] risk function, because the blip-based metalearner and R MSE risk function both relyon estimates of the blip for combining and choosing the algorithms, respectively.For the second dimension, the choice of how to combine algorithms, we con-sider either the metalearner that (a) selects only one candidate algorithm (called“Discrete”), (b) uses a weighted average to combine predicted blip estimates anddirectly plugs those into the risk (called “Blip-based”), (c) uses a weighted av-erage to combine predicted treatments under the candidate combinations of rulesand creates a weighted majority vote of these candidate rules as input into the risk(called “Vote-based”). The third dimension is the choice of performance measure,risk function – either the MSE ( R MSE ) or the mean outcome under the candidaterule ( R E [ Y d ] ). For the second and third dimensions, if one uses the vote-based met-alearner, then the R MSE risk cannot be used because R MSE requires an estimateof the blip to choose the best algorithm, and the vote-based metalearner does notoutput an estimate of the blip.

We use simulations to: (1) illustrate the potential beneﬁt to estimating the ODTRusing a SuperLearner approach, as compared to a more traditional approach to15 hoice 1: Library Blip only FullChoice 2: Metalearner Discrete Continuous Discrete ContinuousBlip-based Vote-based Blip-based Vote-basedChoice 3: Risk R MSE R E [ Y d ] R MSE R E [ Y d ] R MSE R E [ Y d ] R MSE R E [ Y d ] R MSE R E [ Y d ] R MSE R E [ Y d ] Possible? (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51)

Table 1: Summary of the possible ODTR SuperLearner conﬁgurations across thelibrary, metalearner, and risk dimensions. The last row (“Possible?”) indicateswhether the particular conﬁguration is possible to implement. The checkmarks( (cid:51) ) in the following table indicate that it is possible to construct that kind ofODTR SuperLearner algorithm; the x-marks ( (cid:55) ) indicate otherwise.studying treatment effect heterogeneity based on ﬁtting an outcome regressionwith interaction terms on covariates and treatment, as is often standard practice[1, 51, 52, 53]; and (2) investigate the implications of practical choices whenimplementing an ODTR SuperLearner in ﬁnite samples, including speciﬁcationof candidate algorithms in the library, choice of metalearner, and choice of riskfunction.

All simulations were implemented in R [54], and the code, simulated data, and re-sults can be found at https://github.com/lmmontoya/SL.ODTR. We examine thesecomparisons using two types of data generating processes (DGPs). Each simu-lation consists of 1,000 iterations of either n = ,

000 or n = W , W , W , W ∼ Normal ( µ = , σ = ) A ∼ Bernoulli ( p = . ) Y ∼ Bernoulli ( p ) .The probability of having a successful outcome differs for the two DGPs, which,in this case, means that the blip functions differ as well. The ﬁrst DGP is anexample of one with a complex blip function, and the second DGP is one with ablip function that is a simpler function of one variable. The ﬁrst DGP is directlyfrom work by Luedtke and van der Laan [16, 34, 44], and the second is modiﬁedfrom the ﬁrst. The probability of a successful outcome for DGP 1 is: p = . logit − ( − W + W + W A − . A ) + . logit − ( − . − W + W W + | W | A − . A ) ,16hen the true blip function is: B ( W ) = . [ logit − ( − W + W + W − . ) + logit − ( − . − W + W W + | W | − . ) − logit − ( − W + W ) + logit − ( − . − W + W W )] .For DGP 1, E P U , X [ Y d ∗ ] ≈ . E P U , X [ d ∗ ] ≈ . E P U , X [ Y ] ≈ . E P U , X [ Y ] ≈ . p = logit − ( W + . A + W A ) .Thus the true blip function is: B ( W ) = logit − ( W + . + W ) − logit − ( W ) .For DGP 2, E P U , X [ Y d ∗ ] ≈ . E P U , X [ d ∗ ] ≈ . E P U , X [ Y ] ≈ . E P U , X [ Y ] ≈ . For each data generating process, we consider a number of estimators of theODTR. First, mirroring epidemiologic practice, we model the outcome as an ad-ditive function of the treatment and covariates, and interactions with the treatmentand all covariates) [1, 51, 52, 53]. Such an approach translates to using the fol-lowing parametric model for the outcome regression: h ( E [ Y | A , W ]) = β + p ∑ i = β i W i + (cid:32) γ + p ∑ i = γ i W i (cid:33) A ,where h ( . ) denotes a link function, and p is the number of baseline covariatesin W . Using a linear link, the following parametric model for the blip function isimplied: B ( W ) = γ + p ∑ i = γ i W i .Next, we examine the ﬁnite sample implications of the aforementioned user-supplied choices in implementing a SuperLearner estimator of the ODTR, provid-ing guidance for practical data analysis. First, we examine the choice of library.We consider the library that only combines candidate blip estimators (“Blip only”library; i.e., a library with candidate algorithms suited for regressing D n on W )versus a library that has blip estimators, direct-estimation algorithms, and statictreatment rules (“Full” library). The “Blip only” libraries consist of either:17a) Simple parametric models only (denoted “Parametric blip models”). Thisconsisted of univariate GLMs with each covariate.(b) Machine learning algorithms only (denoted “ML blip models”), such as SL.glm (generalized linear models),

SL.mean (the average),

SL.glm.interaction (generalized linear models with interactions between all pairs of variables),

SL.earth (multivariate adaptive regression splines [48]),

SL.nnet (neuralnetworks [47]),

SL.svm (support vector machines [55]), and

SL.rpart (re-cursive partitioning and regression trees [49]) from the SuperLearner pack-age [56](c) A combination of (a) and (b) above, denoted “Parametric + ML blip mod-els”The “Full” library includes other ODTR algorithms like direct-estimation meth-ods, static rules, and other blip-based methods. Speciﬁcally, the “Full” libraryincludes either the “ML blip models” or “Parametric + ML blip models” fromthe “Blip only” libraries above, in addition to Q-learning [25], OWL [22], resid-ual weighted learning (RWL) [57], efﬁcient augmentation and relaxation learn-ing (EARL) [58], optimal classiﬁcation algorithms [59] (the latter 4 are fromthe DynTxRegime package [60], with function names owl, rwl, earl, and optimalClass , respectively), and static rules that treat all patients and no pa-tients, regardless of the patient covariate proﬁles. For the algorithms from theDynTxRegime package, except for nuisance parameters Q n and g n , we use de-fault parameters, and the rule as a function of all covariates. Additionally, for theoptimal classiﬁcation algorithm, the solver method is recursive partitioning forregression trees ( rpart ). Thus, the possible “Full” libraries are:(d) Algorithms from the “ML blip models” library, plus direct maximizers andstatic rules, denoted “ML blip models and E [ Y d ] maximizers”(e) All possible algorithms – that is, algorithms from the “Parametric + ML blipmodels” library, plus direct maximizers and static rules, denoted “All blipmodels and E [ Y d ] maximizers”Second, we examine the performance of different metalearners for combiningthe candidate ODTR algorithms. We examine the blip-based metalearner usingthe “Blip only” libraries, and the discrete and vote-based metalearners using boththe “Blip only” libraries and “Full” libraries.Third, we examine the performance of using either the MSE R MSE or the ex-pected outcome under the candidate rule R E [ Y d ] as risk criteria for choosing theoptimal linear combination of candidate ODTR algorithms. In particular, CV-TMLE is used for estimating R E [ Y d ] . 18e fully estimate the ODTR SuperLearner by additionally estimating nui-sance parameters (as opposed to using the true nuisance parameter functions) in anested fashion [35] as described above. Speciﬁcally, we estimate Q n and g n usingthe canonical SuperLearner [27, 56] and a correctly speciﬁed parametric model(i.e., a logistic regression of A on the intercept), respectively. We use 10-foldcross-validation throughout. We measure performance by computing the percent accuracy of the algorithm;that is, in a sample, the proportion of times the treatment assigned by the esti-mated ODTR matches the true optimal treatment (i.e., the treatment that wouldhave been assigned under the true ODTR) for each observation in the sample,averaged across simulation repetitions. We also evaluate performance metrics ofthe difference between the true conditional expected outcome under the estimatedrule, averaged across the sample, compared to the true mean outcome under thetrue optimal rule E n [ Q ( Y | A = d ∗ n , W )] − E [ Y d ∗ ] (as an approximation to the regret E [ Q ( Y | A = d ∗ n , W )] − E [ Y d ∗ ] ). We compute the mean and variance of this dif-ference across the simulation repetitions. Instead of presenting the raw varianceof the regret, we present a variance relative to the regret yielded by estimatingthe blip, and thus the optimal rule, using as a parametric GLM that models theblip as an additive function of all covariates. Additionally, we compute 2 . th and97 . th quantiles of the distribution of E n [ Q ( Y | A = d ∗ n , W )] across the simulationrepetitions. Figure 1 displays simulation results. Below we discuss results speciﬁc to eachDGP, conﬁguration dimension, and sample size. In general, results within eachDGP (i.e., across sample sizes) follow generally similar patterns; however, anydifferences in performance between libraries, metalearners, or risks are more pro-nounced for the smaller sample size.

Above, we showed that DGP 1 yields a blip function that is a complex function ofall of the available covariates. Here, for a larger sample size, we would expect abeneﬁt to more aggressive approaches to estimating the ODTR, such as includingmore ﬂexible machine learning-based approaches in the library of candidates, aswell as use of more aggressive metalearners (either vote- or blip-based) over a dis-crete SuperLearner due to the better ability of these approaches to approximate the19rue underlying function. That said, for smaller sample sizes, this beneﬁt mightbe attenuated, or even result in worse performance than simpler alternatives. Forthis DGP, at sample size of 1,000, indeed we ﬁnd a beneﬁt to the use of both moreaggressive metalearners and larger libraries. Interestingly, however, this beneﬁt ismaintained for sample size 300. Speciﬁcally, libraries that included data adaptive,machine learning algorithms (as opposed to incorrectly speciﬁed parametric mod-els alone) more accurately and precisely approximated the rule, even for samplesize of 300. Results also show that for both sample sizes, within the discrete met-alearner, the R E [ Y d ] risk performed better than R MSE risk, and more saliently, theblip-based and vote-based metalearners performed better than the discrete Super-Learner. Finally, as predicted by theory, all SuperLearner approaches evaluatedsubstantially outperformed a traditional parametric regression approach at bothsample sizes. Below we describe results speciﬁc to each sample size. n = 1,000 Libraries that contain machine learning algorithms (i.e., “ML blipmodels,” “Parametric + ML blip models,” “ML blip models and EYd maximizers,” and “Blip models and EYd maximizers”) overall perform better than librarieswith parametric models only (i.e., “Parametric blip models”) and the standardGLM (i.e., “GLM”), across all performance metrics. For example, the percentmatch between the true ODTR and the estimated ODTR spans from 72.0%-77.7%for any libraries with machine learning algorithms, whereas the percent match forlibraries with only parametric models is from 56.4% to 58.0%.There are no stark differences within the libraries that contain machine learn-ing algorithms across the metalearner and risk dimensions, except when usinga discrete metalearner and R MSE risk. Speciﬁcally, the discrete metalearner thatuses R MSE has a higher average regret and relative variance than all other algo-rithms that use machine learning (e.g., for the “Parametric + ML blip models”“Blip only” library that uses a discrete metalearner, the average regret when us-ing R MSE versus R E [ Y d ] is -0.0389 versus -0.0284, respectively, and the relativevariance when using R MSE versus R E [ Y d ] is 2.137 versus 0.7781, respectively). n = 300 As expected, given the limited data available to estimate a complexunderlying function, both accuracy of treatment assignment and approximatedregret (the extent to which the expected outcome under the estimated rule fellshort of the best outcomes available) deteriorated with smaller sample sizes. Thatsaid, even in this challenging situation of a complex true pattern of treatment effectheterogeneity and limited data with which to discover it, the ODTR SuperLearnerwould have improved the expected outcome by approximately 4.5% relative to thestatic rule that treats everyone, an approach that would have been suggested basedon estimation of the ATE. 20ibraries with only parametric models perform worse than libraries that con-tain machine learning algorithms in terms of average regret and accuracy. Forexample, SuperLearner ODTRs that contain libraries with parametric blip mod-els match 54%-55.4% of the time with the true ODTR, while the SuperLearnersthat contain libraries with machine learning algorithms match 60.9%-66.1% of thetime. These results parallel those found with sample size 1,000, except the dis-crepancy between libraries with machine learning models and parametric modelsis not as pronounced.Similar to the n = ,

000 case, among the libraries that used machine learningalgorithms, using the performance of the rule as risk is better across all perfor-mance metrics than using MSE as risk for the discrete metalearner. As long asmachine learning methods were included in the library, performance was similaracross risk functions and choice of metalearners, with the exception of the MSErisk combined with the discrete metalearner.

As shown above, DGP 2 has a true blip function that is a simple function of onecovariate (referred to here as a “simple” blip function). Here, the true optimalrule is described by a simple parametric model for the blip; thus, we expect thisapproach to perform well. However, in practice one is unlikely to be sure that thetruth can be well approximated by a simple rule; it is thus of interest to evaluatewhat price is paid for expanding the library to include more aggressive machinelearning algorithms and metalearners. In particular, one might expect that, forsmaller sample size, adding machine learning-based candidates and more com-plex metalearners risks substantial drop-off in performance. However, specifyinga library that includes simpler parametric models, in addition to machine learningapproaches, may help mitigate this risk. In fact, for this particular DGP, we see,across metalearners and risks, only a small price in performance from adding ma-chine learning algorithms to a library including only simple parametric models. Inshort, having an ODTR SuperLearner library that also includes parametric modelsis better than having a library that only consists of data adaptive, machine learningalgorithms. Within the libraries that did contain parametric models, particularlyfor the discrete metalearner, R MSE risk performs slightly better than R E [ Y d ] risk; forother metalearners there is little difference in performance in terms of risk. Per-formance of the metalearners varies slightly by sample size. Below we describeresults speciﬁc to each sample size. n = 1,000 In terms of accuracy, the libraries that only contain parametric modelsperform the best, followed closely by libraries that contain parametric modelsand machine learning models, followed by the library with only machine learning21odels. This pattern is evident in the percent match with the true ODTR: forexample, within the discrete metalearner with R MSE risk, the percent accuracyis 90.7% for the library with parametric models only (“Parametric blip models”library), 88.8% for the library that contain both parametric models and machinelearning models (“Parametric + ML blip models”), and 81.9% for the library thatcontains machine learning algorithms only (“ML blip models”). This same patternis apparent in terms of average regret; that is, the libraries that contain parametricblip models or a combination of parametric blip models and machine learningmodels have the lowest average regret (-0.0041 to -0.0095), while the libraries thatonly contain machine learning models have the highest average regret (-0.0100to -0.0138). Modeling the blip with a single, parametric model often used instandard epidemiological analysis (which, in this case, is incorrectly speciﬁed)yields an average regret of -0.0100 (higher than the libraries with a combinationof parametric models and/or machine learning algorithms, and at the same levelas having machine learning algorithms only).Within the libraries that contain parametric models and use a discrete met-alearner, R MSE performs better than R E [ Y d ] . For example, the mean regret and rel-ative variance for the discrete metalearner that only used parametric models in thelibrary is -0.0041 and 1.0267, respectively, when using R MSE risk, and -.0046 and1.3174, respectively, when using R E [ Y d ] risk. Otherwise, there were no apparentdifferences in performance by risk.For libraries that contain parametric models and use R MSE , the discrete ODTRSuperLearner performs better than the blip-based ODTR SuperLearner, with re-gards to all performance metrics. For example, the average regret and relativevariance for the library with only parametric models that uses R MSE was -0.0041and 1.0267, respectively, when using a discrete metalearner versus -.0059 and1.1855, respectively, when using the blip-based metalearner. This pattern is alsoevident for the library that has both parametric models and machine learning al-gorithms. n = 300 As in the case where n = , R MSE , the percentaccuracy for the library that uses only parametric models is 78%, followed by a75.5% accuracy when there is a combination of parametric models and machinelearning, while the library with only machine learning models had a 61.7% match22ith the true ODTR. While this pattern parallels that of the n = ,

000 case, thedropoff in accuracy when the library uses only parametric models versus whenthe library only uses machine learning algorithms is larger in terms of accuracyin the smaller sample size (16.3% difference) versus the larger sample size (8.8%difference).Similar to the n = ,

000 case, among libraries that contain parametric mod-els and in the discrete metalearner case, R MSE yields slightly better performanceresults than R E [ Y d ] . In contrast to the n = ,

000 case, for libraries that containparametric models and used R MSE , the blip-based metalearner performs slightlybetter than the discrete metalearner, with regards to all performance metrics. Forexample, the average regret and relative variance for the library with only para-metric models that use R MSE is -0.0188 and 1.6102, respectively, when using ablip-based metalearner versus -0.0190 and 1.8109, respectively, when using thediscrete metalearner.

In the “Interventions” study, 231 (52.2%) participants received CBT and 210(47.8%) TAU. Out of the 441 participants, 271 (61.5%) were not re-arrested withinthe year. The estimated probability of no re-arrest had everyone been assignedCBT is 62.2%, and the estimated probability of no-arrest had everyone been as-signed TAU is 60.7%; there was no signiﬁcant difference between these two prob-abilities (risk difference: 1.51%, CI: [-8.03%,11.06%]). After adjusting for co-variates using TMLE to improve the precision on this ATE estimate [61], the riskdifference is, similarly, 1.53% (CI: [-7.31%, 10.37%]).Figure 2 shows subgroup plots for each covariate – that is, the unadjusteddifference in probability of no re-arrest between those who received CBT versusTAU, within each covariate group level. One might begin to identify trends to-wards differential treatment effects; for example, participants may have beneﬁtedmore from CBT at the San Francisco site, or if they had offended three or moretimes. Accurate interpretation of any such subgroup analyses, however, requiresvariance estimates and hypothesis tests with appropriate correction for multipletesting. In addition, as mentioned before, it may be the case that the best way toassign treatment is by using information on more than one variable at a time, andeven interactions between those variables.Thus, we estimated the ODTR on the “Interventions” data to determine whichjustice-involved adults with mental illness should receive CBT. Speciﬁcally, weimplemented the ODTR SuperLearner with a blip-only library, a continuous, blip-23ased metalearner, and R E [ Y d ] risk function. We chose a blip-based library in orderto generate estimates of the blip, which themselves can be informative. The li-brary for d ∗ n consisted of a combination of simple parametric models (univariateGLMs with each covariate) and machine learning algorithms ( SL.glm , SL.mean , SL.glm.interaction , SL.earth , and

SL.rpart ). As in the simulations, theoutcome regression Q n was estimated using the canonical SuperLearner, g n wasestimated as an intercept-only logistic regression, and we used 10-fold cross vali-dation.Interestingly, despite implementing a continuous metalearner, the ODTR Su-perLearner algorithm assigned all weight on a GLM that modeled the blip as alinear function of only substance use. As shown in Figure 3, a plot depicting thedistribution of the predicted blip estimates for all participants, the algorithm canbe interpreted as such: if a justice-involved person with mental illness has a lowsubstance use score, give him/her CBT; otherwise, give him/her TAU. Under thisODTR estimate, for this sample, 52.38% of the participants would receive CBT. We described the ODTR SuperLearner and illustrated its performance for sam-ple DGPs under different conﬁgurations of the algorithm and ﬁnite sample sizes.These results build on existing work [34, 35] by fully estimating the ODTR andincluding an expanded SuperLearner library with not only blip-based regressionestimators, but also direct-estimation methods and static interventions. We high-lighted the practical choices one must consider when implementing the ODTRSuperLearner across three dimensions: (1) the ODTR SuperLearner library ofcandidate algorithms, namely, whether to include parametric models, machinelearning algorithms, or both; whether to include only estimators that output a pre-dicted blip or include a combination of blip estimators, direct estimators, and statictreatment rules, (2) the metalearner that either chooses a single algorithm or com-bines the algorithms and (3) the risk function that chooses the “best” estimator orcombination of estimators of the candidate ODTR algorithms.Simulation-based results illustrate the shortcomings of an approach to treat-ment effect heterogeneity based on approximating the blip as an additive functionof the available covariates, or equivalently, modeling the outcome as an additivefunction of the treatment and covariates, and interactions between the treatmentand all covariates, which is common practice in epidemiologic analyses for het-erogeneous treatment effects [1, 51, 52, 53]. With respect to choice of library,we recommend specifying a library with both simple parametric models and moreaggressive data adaptive algorithms, as well as static rules such as treat all or treatnone, allowing for ﬂexible estimation of both simple and complex underlying24ules. Inclusion of a full range of algorithms from simple to aggressive was par-ticularly important for small sample sizes. In terms of the choice of metalearner,both vote- and blip-based ensemble learners performed well; a vote-based met-alearner has the advantage, however, of allowing for the integration of a largerlibrary of candidate algorithms (including direct estimation approaches) and easeof integration of static rules. Of note, in these simulations, vote- and blip-basedmetalearners outperformed the discrete ODTR SuperLearner approach, even forsample size 300. However, we caution that this may not always be the case andwhen sample size is small, a discrete SuperLearner approach may provide ben-eﬁts – in fact, one could include a convex metalearner as a candidate algorithm.Finally, with respect to choice of risk function, both MSE and the expected out-come under the rule performed well; in practice one might prefer R E [ Y d ] because itallows for the use of a larger library of candidate algorithms.Additionally, as an illustration of how to apply the ODTR SuperLearner to realdata, we estimated the ODTR using the “Interventions” study to determine whichtypes of justice-involved adults with mental illness should be assigned CBT versusTAU, to yield the highest probability of no re-arrest. Preliminary results show thatthe ODTR SuperLearner placed all weight on a simple parametric model withonly substance use; thus, the algorithm suggests that, in this sample, participantswith lower levels of substance use may beneﬁt more from CBT. We note that thisis an example of a case in which the ODTR SuperLearner generated a ODTRestimate that was fully interpretable – although we used a continuous metalearnerand thus the SuperLearner could have allowed for combinations of algorithms, theSuperLearner happened to only choose one algorithm: a GLM with substance useas a regressor. To guarantee interpretability in the SuperLearner (for example, ifworking with practitioners who may want a treatment decision rule that could beeasily written down [3, 62]), one could implement the ODTR SuperLearner witha discrete metalearner and a simple parametric library only.Importantly, in a companion paper, we evaluate this ODTR – that is, we askthe causal question: what would have been the probability of no re-arrest had par-ticipants been assigned CBT according the ODTR SuperLearner (i.e., using onlysubstance use)? Further, is assigning CBT according to the ODTR SuperLearnersigniﬁcantly better than assigning CBT to everyone or no one? In this way, wecan determine if it is of clinical signiﬁcance to assign CBT according to this rule –namely, if assigning CBT using only substance use scores results in a statisticallysigniﬁcant reduction of recidivism, and if so, how much better one does with thisODTR compared to a non-individualized rule (such as giving CBT to all). Underthe appropriate causal assumptions, one could use any of the methods for estimat-ing treatment speciﬁc means to interpret this estimate as the expected outcomeunder the true optimal rule or the estimated optimal rule.Future work could extend these simulations to the multiple treatment (i.e.,25 AU ( A =

0) CBT ( A = pn

211 230

No re-arrest ( Y =

1) (%) 128 (60.7) 143 (62.2) 0.820

Site = San Francisco (%) 87 (41.2) 104 (45.2) 0.455

Gender = Female (%) 38 (18.0) 37 (16.1) 0.682

Ethnicity = Hispanic (%) 50 (23.7) 42 (18.3) 0.198

Age (mean (SD)) 38.08 (11.05) 37.01 (11.22) 0.317

CSI (mean (SD)) 32.35 (11.13) 33.46 (11.27) 0.300

LSI (mean (SD)) 5.59 (1.33) 5.50 (1.48) 0.472

SES (mean (SD)) 3.81 (1.89) 3.81 (2.12) 0.995

Prior adult convictions (%) 0.156Zero to two times 74 (35.1) 93 (40.4)Three or more times 134 (63.5) 129 (56.1)Missing 3 (1.4) 8 (3.5)

Most serious offense (mean (SD)) 5.29 (2.54) 5.09 (2.52) 0.415

Motivation (mean (SD)) 3.22 (1.36) 3.27 (1.37) 0.720

Substance use (%) 0.1840 53 (25.1) 76 (33.0)1 47 (22.3) 55 (23.9)2 109 (51.7) 98 (42.6)Missing 2 (0.9) 1 (0.4)

Table 2: Distribution of “Interventions” data by treatment assignment.more than 2 treatment levels) [35] and multiple timepoint setting (i.e., estimat-ing a sequential ODTR from, for example, a SMART design [7, 25, 36] insteadof an RCT design). We also wish to apply the ODTR SuperLearner on the full“Interventions” dataset (n = 720), once data collection is ﬁnished.This work contributes to understanding the toolbox of methods for analyzingthe heterogeneity in how patients respond to treatment. By learning which pa-tients respond best to what treatment in a ﬂexible manner, we can improve patientoutcomes – moving us closer to the goals of precision health.

References [1] Issa J Dahabreh, Rodney Hayward, and David M Kent. Using group data totreat individuals: understanding heterogeneous treatment effects in the ageof precision medicine and patient-centred evidence.

International journal ofepidemiology , 45(6):2184–2193, 2016.[2] David M Kent, Ewout Steyerberg, and David van Klaveren. Personalized26vidence based medicine: predictive approaches to heterogeneous treatmenteffects.

Bmj , 363:k4245, 2018.[3] Michael R Kosorok and Eric B Laber. Precision medicine.

Annual review ofstatistics and its application , 6:263–286, 2019.[4] Jennifer L Skeem, Sarah Manchak, and Jillian K Peterson. Correctional pol-icy for offenders with mental illness: Creating a new paradigm for recidivismreduction.

Law and human behavior , 35(2):110–126, 2011.[5] Jennifer L Skeem, Eliza Winter, Patrick J Kennealy, Jennifer Eno Louden,and Joseph R Tatar II. Offenders with mental illness have criminogenicneeds, too: Toward recidivism reduction.

Law and human behavior ,38(3):212, 2014.[6] Mark W Lipsey, Nana A Landenberger, and Sandra J Wilson. Effects ofcognitive-behavioral programs for criminal offenders.

Campbell systematicreviews , 3(1):1–27, 2007.[7] Huitian Lei, Inbal Nahum-Shani, K Lynch, David Oslin, and Susan A Mur-phy. A” smart” design for building individualized treatment sequences.

An-nual review of clinical psychology , 8:21–48, 2012.[8] Feifang Hu and William F Rosenberger.

The theory of response-adaptiverandomization in clinical trials , volume 525. John Wiley & Sons, 2006.[9] Ilya Lipkovich, Alex Dmitrienko, and Ralph B D’Agostino Sr. Tutorial inbiostatistics: data-driven subgroup identiﬁcation and analysis in clinical tri-als.

Statistics in medicine , 36(1):136–196, 2017.[10] Oliver Bembom and Mark J van der Laan. A practical illustration of theimportance of realistic individualized treatment rules in causal inference.

Electronic journal of statistics , 1:574, 2007.[11] Mark J van der Laan and Maya L Petersen. Causal effect models for real-istic individualized treatment and intention to treat rules.

The internationaljournal of biostatistics , 3(1), 2007.[12] James Robins. A new approach to causal inference in mortality studies witha sustained exposure period—application to control of the healthy workersurvivor effect.

Mathematical modelling , 7(9-12):1393–1512, 1986.[13] Bibhas Chakraborty and Erica EM Moodie.

Statistical methods for dynamictreatment regimes . Springer, 2013.2714] Susan A Murphy. Optimal dynamic treatment regimes.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 65(2):331–355, 2003.[15] James M Robins. Optimal structural nested models for optimal sequentialdecisions. In

Proceedings of the second seattle Symposium in Biostatistics ,pages 189–326. Springer, 2004.[16] Alexander R Luedtke and Mark J van der Laan. Optimal individualized treat-ments in resource-limited settings.

The international journal of biostatistics ,12(1):283–303, 2016.[17] Eric B Laber, Kristin A Linn, and Leonard A Stefanski. Interactive modelbuilding for q-learning.

Biometrika , 101(4):831–847, 2014.[18] Phillip J Schulte, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. Q-and a-learning methods for estimating optimal dynamic treatment regimes.

Statistical science: a review journal of the Institute of Mathematical Statis-tics , 29(4):640, 2014.[19] Erica EM Moodie, Bibhas Chakraborty, and Michael S Kramer. Q-learningfor estimating optimal dynamic treatment rules from observational data.

Canadian Journal of Statistics , 40(4):629–645, 2012.[20] Min Qian and Susan A Murphy. Performance guarantees for individualizedtreatment rules.

Annals of statistics , 39(2):1180, 2011.[21] Erica EM Moodie, Thomas S Richardson, and David A Stephens. Demysti-fying optimal dynamic treatment regimes.

Biometrics , 63(2):447–455, 2007.[22] Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimat-ing individualized treatment rules using outcome weighted learning.

Journalof the American Statistical Association , 107(499):1106–1118, 2012.[23] Baqun Zhang, Anastasios A Tsiatis, Marie Davidian, Min Zhang, and EricLaber. Estimating optimal treatment regimes from a classiﬁcation perspec-tive.

Stat , 1(1):103–114, 2012.[24] Ying-Qi Zhao, Donglin Zeng, Eric B Laber, and Michael R Kosorok.New statistical learning methods for estimating optimal dynamic treatmentregimes.

Journal of the American Statistical Association , 110(510):583–598, 2015. 2825] Michael R Kosorok and Erica EM Moodie.

Adaptive Treatment Strategiesin Practice: Planning Trials and Analyzing Data for Personalized Medicine ,volume 21. SIAM, 2015.[26] Ying-Qi Zhao and Eric B Laber. Estimation of optimal dynamic treatmentregimes.

Clinical Trials , 11(4):400–407, 2014.[27] Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Super learner.

Statistical applications in genetics and molecular biology , 6(1), 2007.[28] Leo Breiman. Stacked regressions.

Machine learning , 24(1):49–64, 1996.[29] Erin LeDell, Mark J van der Laan, and Maya Petersen. Auc-maximizingensembles through metalearning.

The international journal of biostatistics ,12(1):203–218, 2016.[30] Eric C Polley and Mark J van der Laan. Super learner in prediction. 2010.[31] Romain Pirracchio, Maya L Petersen, and Mark van der Laan. Improvingpropensity score estimators’ robustness to model misspeciﬁcation using su-per learner.

American journal of epidemiology , 181(2):108–119, 2014.[32] Romain Pirracchio, Maya L Petersen, Marco Carone, Matthieu RescheRigon, Sylvie Chevret, and Mark J van der Laan. Mortality prediction in in-tensive care units with the super icu learner algorithm (sicula): a population-based study.

The Lancet Respiratory Medicine , 3(1):42–52, 2015.[33] Maya L Petersen, Erin LeDell, Joshua Schwab, Varada Sarovar, RobertGross, Nancy Reynolds, Jessica E Haberer, Kathy Goggin, Carol Golin, JuliaArnsten, et al. Super learner analysis of electronic adherence data improvesviral prediction and may provide strategies for selective hiv rna monitoring.

Journal of acquired immune deﬁciency syndromes (1999) , 69(1):109, 2015.[34] Alexander R Luedtke and Mark J van der Laan. Super-learning of an op-timal dynamic treatment rule.

The international journal of biostatistics ,12(1):305–332, 2016.[35] Jeremy Robert Coyle.

Computational Considerations for Targeted Learning .PhD thesis, UC Berkeley, 2017.[36] Daniel Almirall, Inbal Nahum-Shani, Nancy E Sherwood, and Susan A Mur-phy. Introduction to smart designs for the development of adaptive interven-tions: with application to weight loss research.

Translational behavioralmedicine , 4(3):260–274, 2014. 2937] Maya L Petersen and Mark J van der Laan. Causal models and learning fromdata: integrating causal modeling and statistical estimation.

Epidemiology(Cambridge, Mass.) , 25(3):418, 2014.[38] Judea Pearl.

Causality: models, reasoning and inference , volume 29.Springer, 2000.[39] Alexander R Luedtke and Mark J van der Laan. Statistical inference forthe mean outcome under a possibly non-unique optimal treatment strategy.

Annals of statistics , 44(2):713, 2016.[40] James M Robins and Miguel A Hern´an. Estimation of the causal effects oftime-varying exposures.

Longitudinal data analysis , 553:599, 2009.[41] Maya L Petersen, Kristin E Porter, Susan Gruber, Yue Wang, and Mark Jvan der Laan. Diagnosing and responding to violations in the positivityassumption.

Statistical methods in medical research , 21(1):31–54, 2012.[42] Inbal Nahum-Shani, Min Qian, Daniel Almirall, William E Pelham, BethGnagy, Gregory A Fabiano, James G Waxmonsky, Jihnhee Yu, and Susan AMurphy. Q-learning: A data analysis method for constructing adaptive in-terventions.

Psychological methods , 17(4):478, 2012.[43] Daniel Rubin and Mark J van der Laan. A doubly robust censoring unbiasedtransformation.

The international journal of biostatistics , 3(1), 2007.[44] Mark J van der Laan and Alexander R Luedtke. Targeted learning of themean outcome under an optimal dynamic treatment rule.

Journal of causalinference , 3(1):61–95, 2015.[45] Daniel B Rubin and Mark J van der Laan. Statistical issues and limitations inpersonalized medicine research with clinical trials.

The international journalof biostatistics , 8(1), 2012.[46] Mark J van der Laan and Sherri Rose.

Targeted learning: causal inferencefor observational and experimental data . Springer Science & Business Me-dia, 2011.[47] Brian D Ripley and NL Hjort.

Pattern recognition and neural networks .Cambridge university press, 1996.[48] Jerome H Friedman et al. Multivariate adaptive regression splines.

Theannals of statistics , 19(1):1–67, 1991.[49] Leo Breiman.

Classiﬁcation and regression trees . Routledge, 2017.3050] Mark J van der Laan and Susan Gruber. Targeted minimum loss based esti-mation of an intervention speciﬁc mean outcome. 2011.[51] Ravi Varadhan, Jodi B Segal, Cynthia M Boyd, Albert W Wu, and Carlos OWeiss. A framework for the analysis of heterogeneity of treatment effectin patient-centered outcomes research.

Journal of clinical epidemiology ,66(8):818–825, 2013.[52] David van Klaveren, Yvonne Vergouwe, Vasim Farooq, Patrick W Serruys,and Ewout W Steyerberg. Estimates of absolute treatment beneﬁt for indi-vidual patients required careful modeling of statistical interactions.

Journalof clinical epidemiology , 68(11):1366–1374, 2015.[53] Salim Yusuf, Janet Wittes, Jeffrey Probstﬁeld, and Herman A Tyroler. Anal-ysis and interpretation of treatment effects in subgroups of patients in ran-domized clinical trials.

Jama , 266(1):93–98, 1991.[54] R Core Team.

R: A Language and Environment for Statistical Computing .R Foundation for Statistical Computing, Vienna, Austria, 2018.[55] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vectormachines.

ACM transactions on intelligent systems and technology (TIST) ,2(3):27, 2011.[56] Eric Polley, Erin LeDell, Chris Kennedy, and Mark van der Laan.

Super-Learner: Super Learner Prediction , 2018. R package version 2.0-24.[57] Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok.Residual weighted learning for estimating individualized treatment rules.

Journal of the American Statistical Association , 112(517):169–187, 2017.[58] Ying-Qi Zhao, Eric B Laber, Yang Ning, Sumona Saha, and Bruce E Sands.Efﬁcient augmentation and relaxation learning for individualized treatmentrules using observational data.

Journal of Machine Learning Research ,20(48):1–23, 2019.[59] Baqun Zhang, Anastasios A Tsiatis, Marie Davidian, Min Zhang, and EricLaber. Estimating optimal treatment regimes from a classiﬁcation perspec-tive.

Stat , 1(1):103–114, 2012.[60] S. T. Holloway, E. B. Laber, K. A. Linn, B. Zhang, M. Davidian, and A. A.Tsiatis.

DynTxRegime: Methods for Estimating Optimal Dynamic TreatmentRegimes , 2019. R package version 4.1.3161] Kelly L Moore and Mark J van der Laan. Covariate adjustment in random-ized trials with binary outcomes: targeted maximum likelihood estimation.

Statistics in medicine , 28(1):39–64, 2009.[62] Zachary D Cohen and Robert J DeRubeis. Treatment selection in depression.

Annual Review of Clinical Psychology , 14, 2018.32igure 1:

Performance of E n [ Q ( Y | A = d ∗ n , W )] (an approximation of the true mean out-come under the estimated ODTR) for DGP 1 (top two) and DGP 2 (bottom two). Thehorizontal black line depicts E P U , X [ Y d ∗ ] ; the horizontal red line depicts E P U , X [ Y ] ; the hor-izontal blue line depicts E P U , X [ Y ] . We compare the SuperLearner algorithm to an incor-rectly speciﬁed GLM (in gray, with N/A as the metalearner and a diamond with no ﬁll).We also compare (1) having a SuperLearner library with (a) only algorithms that esti-mate the blip (i.e., “Blip only” libraries) that only have parametric blip models (blue) oronly have machine-learning blip models (red) or both (purple) versus (b) an expanded or“Full” library with blip function regressions estimated via machine learning only (orange-yellow) or machine learning and parametric models (green), with methods that directly es-timate the ODTR and static rules, (2) having a metalearner (depicted on the x-axis) eitherthat chooses one algorithm (i.e., the “discrete” SuperLearner) or combines blip predic-tions/treatment predictions (i.e., the “continuous” SuperLearner), and (3) using the MSErisk function ( R MSE as a square) versus the mean outcome under the candidate rule riskfunction ( R E [ Y d ] as a triangle).as a triangle).