Optimising Individual-Treatment-Effect Using Bandits
OOptimising Individual-Treatment-Effect UsingBandits
Jeroen berrevoets
University of Brussels (VUB) [email protected]
Sam Verboven
University of Brussels (VUB) [email protected]
Wouter Verbeke
University of Brussels (VUB) [email protected]
Abstract
Applying causal inference models in areas such as economics, healthcare andmarketing receives great interest from the machine learning community. In particu-lar, estimating the individual-treatment-effect (ITE) in settings such as precisionmedicine and targeted advertising has peaked in application. Optimising this ITEunder the strong-ignorability-assumption — meaning all confounders expressinginfluence on the outcome of a treatment are registered in the data — is oftenreferred to as uplift modeling (UM). While these techniques have proven usefulin many settings, they suffer vividly in a dynamic environment due to conceptdrift. Take for example the negative influence on a marketing campaign when acompetitor product is released. To counter this, we propose the uplifted contextualmulti-armed bandit (U-CMAB), a novel approach to optimise the ITE by drawingupon bandit literature. Experiments on real and simulated data indicate that ourproposed approach compares favourably against the state-of-the-art. All our codecan be found online at https://github.com/vub-dl/u-cmab . Making individual-level causal predictions is an important problem in many fields. For example, individual-treatment-effect (ITE) predictions can be used to: prescribe medicine only when it causesthe best outcome for a specific patient; or advertise only to those that were not going to buy otherwise.While many ITE prediction methods exist, they fail to adapt through time. We believe this is acrucial problem in causal inference as many environments are dynamic in nature: patients couldbuild a tolerance to their prescribed medicine; or the initial marketing campaign could suffer from acompetitor’s product release [4]. In machine learning, we refer to deteriorating behaviour due to achanging environment, as concept drift [17, 5].A first naive attempt to create dynamic causal inference models, could be an adapted on-line learningmethod, e.g., on-line random forests [14]. However, such methods require a target variable—whichis absent as a counterfactual outcome is unobservable. A second naive approach would be to use achange detection algorithm [5], initiating a retraining subroutine when necessary. In fact, we havedone exactly this in our experiments, but found them to perform poorly compared to our method.We take a fundamentally different approach than the naive strategies described above, as we reformu-late uplift modeling in a bandit problem [12]. Since bandits learn continuously, they easily adapt todynamic environments using a windowed estimation of their target [16].
CausalML Workshop at 33rd Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ c s . L G ] O c t Preliminaries and Background
Uplift models estimate the net impact of a treatment T ∈ { , } on a response Y ∈ { , } for anindividual x ∈ R n . Such net impact is measured through an incremental probability: ˆ u ( Y, T, x ) . =ˆ p ( Y = 1 | T = 1 , x ) − ˆ p ( Y = 1 | T = 0 , x ) , where T = 1 when the treatment is applied and T = 0 when it is not [2, 6]. Given a high ˆ u , we derive that an x can be caused to respond ( Y = 1 ) to thetreatment [2, 13].Uplift models are then employed to identify a subpopulation with high ˆ u . By limiting treatment tothis subpopulation we reduce over-treatment by refraining from treating individuals indifferent totreatment ( ˆ u = 0 ) or worse, individuals that are averse ( ˆ u < ) to it.Typically, datasets in UM are built using a randomised trial setting, where Y ⊥⊥ T | x and < p ( T =1 | x ) < for all x , assuring the strong-ignoreability-assumption [13, 15, 2]. Hence, use of the do -operator is not required, contrasting the case when strong-ignoreability is violated [10]. Contextual multi-armed bandits (CMAB) differ from UM as they apply treatment in function ofexpected response only. We define this response as r ( T = i, x ) . = E [ R ( Y ) | T = i, x ] , where: x isconsidered a context; { T = 0 , T = 1 } is the set of arms; and R : Y → R is the numerical reward for Y [18, 9]. Optimal treatment selection is then motivated by an estimation of this expected response ˆ r as in (1), T ∗ b = arg max i { ˆ r ( T = i, x ) } . (1)The treatment T ∗ b is chosen over other treatments even if T ∗ b offers only a marginally higher expectedresponse.This formulation suggests two major components in a CMAB’s objective: (i) response estimationthrough ˆ r ; and (ii) proper treatment selection through (1). Randomly applying treatments ensures ˆ r to be unbiased, but contrasts the second objective. Balancing these components is often referred toas the exploration-exploitation trade-off [16]. We use this formulation to frame our experiments inSection 4. The difference between UM and CMABs is apparent through the maximisation in (1). Suchmaximisation contrasts UM as uplift models inform a decision maker to make causal decisions, onlyapplying a treatment when the treatment has a sufficient positive effect on x , i.e., when ˆ u is higherthan some threshold τ ∈ [ − , . As such, the optimal treatment in UM is found using, T ∗ u = I [ˆ u ( Y, T, x ) > τ ] , (2)where I [ · ] is the indicator function. Using our notation, this difference is simply: T ∗ b (cid:54) = T ∗ u .We contribute by defining τ , indicating when ˆ u is considered high enough . We then apply our findingsto bandit algorithms, making them optimise for uplift. By leveraging the ability to learn continuouslythe U-CMAB offers resilience in a dynamic environment for individual-level causal models. Introducing a penalty ψ associated with the cost of the treatment — with T = i → ψ i ∈ R and ψ = [ ψ , ψ ] (cid:62) — enables causal decision making by the U-CMAB. While τ is generally chosenheuristically [2], we provide an analytical method based on ψ : τ = ψ − ψ R ( Y = 1) , (3)where: ψ is the penalty of applying the treatment ( T = 1 ); ψ is the penalty of not applying thetreatment ( T = 0 ); and R ( Y = 1) is the potential (numerical) reward when x responds.Two benefits of (3) come to mind: (i) τ is now composed of parameters we can share with a banditalgorithm, and (ii) there is an intuitive appeal to (3)—when ψ is high, so is τ , translating in therequirement of a high ˆ u before treatment is applied, i.e., before applying an expensive treatment itshould have higher net impact when compared to an inexpensive treatment.Once ψ is chosen according to (3), it is to be deducted from the bandit’s estimated reward ˆ r ( T, x ) , ˆ r u ( T = i, x ) . = E [ R ( Y ) − ψ i | T = i, x ] , (4)2igure 1: The MDP for a CMAB in ITE optimisation: grey circles denote individuals of type X j ⊂ X ;squares indicate the response Y = 1 or Y = 0 ; black circles represent treatments with T if T = 0 and T if T = 1 , done so for brevity; dashed arrows are used when t ( Y, T, X j ) = 0 ; and full arrowsare used when t ( Y, T, X j ) = 1 .creating a new form of reward, ˆ r u , associated with every T = i .When ˆ r is replaced with ˆ r u , optimal treatment selection through (1) will be altered. Operatingaccording to this ˆ r u will yield treatment decisions similar to those made by an uplift model respectingsome threshold τ . We back this claim through experiments (in Section 4) and a proof of (3) in theAppendix.Some intuition into (4) can be achieved by formulating a Markov decision process (MDP), (cid:104)X , T , Y , t , R (cid:105) , where: X is the set of individuals, x ∈ X ; T is the set of treatments, T ∈ T ; Y is the set of responses, Y ∈ Y ; t describes the transition probability to Y (being a terminal statein this bandit setting) from x after applying treatment T , thus t ( Y, T, x ) . = p ( Y | T, x ) ; and R is thereward function denoted R : Y → R .As is illustrated in Figure 1, we can use this MDP, with t → { , } , to subdivide X into four differentkinds of individuals based on their transition properties [2]: X ⊂ X Respond ( Y = 1 ) only when treated ( T = 1 ) X ⊂ X Never responds ( Y = 0 ), regardless of treatment X ⊂ X Always respond ( Y = 1 ), regardless of treatment X ⊂ X Respond ( Y = 1 ) only when untreated ( T = 0 )If Y = 1 is the desired outcome, one can deduct from Figure 1, that only individuals from X yield apositive causal relationship between T and Y as applying treatment (i.e., following T ) to any othertype of individual will either: not result in Y = 1 ; or will, regardless of T . As an example, take theindividuals in X : as both T = 1 and T = 0 yield a transition probability of t = 1 , it does not matterwhich treatment the agent applies for the individuals to respond ( Y = 1 ). Therefore, a causal agentshould only apply treatment ( T = 1 ) when given an individual from X .Using ˆ r to differentiate between treatments, an agent would not find an optimum in case of X .However, adding penalties, ψ i , we can further differentiate between treatments and incorporate τ . We frame our experiments using the CMAB’s objective: (i) ITE prediction (rather than responseprediction); and (ii) causal treatment selection. As the U-CMAB is a UM method, we compareagainst the state-of-the art in UM, being an uplift random forest (URF) [2].
ITE prediction is tested using the Hillstrom dataset , a well known resource for ITE prediction withtwo treatments and eighteen variables [2, 11]. We evaluate performance using a qini-chart (a relativeof the gini-chart) [8]: after ranking each individual in a hold-out test-set according to their estimated https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html .0 0.5 1.0Fraction of data0.000.040.08 U p li f t T=1T=2
Hillstrom dataset
Batch ANNURF (T=1)URF (T=2)Random selection
Figure 2: Compared performance on the Hillstrom dataset of a single batch constrained ANN againsttwo separate URFs, where: URF (T=1) was trained for treatment T = 1 ; and URF (T=2) was trainedfor T = 2 . The farther a model is removed from the random selection line, the better. Experiment count U p li f t r e g r e t d a t a c o ll e c t i o n t r a i n e d m o d e l U-CMABRandom Forest (ADWIN)CMAB (a) No drift
Experiment count U p li f t r e g r e t d a t a c o ll e c t i o n t r a i n e d m o d e l U-CMABRandom Forest (ADWIN)CMAB (b) Sudden drift
Experiment count U p li f t r e g r e t d a t a c o ll e c t i o n t r a i n e d m o d e l U-CMABRandom Forest (ADWIN)CMAB (c) Gradual drift
Figure 3: Averaged performance over ten runs of the U-CMAB, URF and CMAB in various randomly-generated simulated environments [1]. The grey dashed line indicates the end of the first data gatheringperiod for the URF, yielding a regret of . as treatments are applied randomly. Dotted lines inFigure 3b indicate a sudden drift. ˆ u , the cumulative incremental response-rate is calculated using, q ( b ) . = (cid:18) Y ,b N ,b − Y ,b N ,b (cid:19) , (5)where: q ( b ) accounts for the first b ∈ N bins of size NB ; Y i,b is the amount of responders with T = i ;and N i,b is the amount of individuals treated with T = i . As an individual with high ˆ u is rankedfirst, (5) should score high for the first individuals and gradually decrease when more individuals areincluded in the evaluation.In our experiment we compared a batch constrained artificial neural network (ANN) [3, 7] to train ˆ r u ,as in (4), against two separate URFs—one for each treatment as current methods can only estimatefor one treatment at a time. From Figure 2 we recognise that the U-CMAB, using a batch ANN,compares favourably against both URFs, and is thus able to predict the ITE nicely using ˆ r u . Causal treatment selection is tested using a simulated environment [1] allowing us to compareagainst an all-knowing optimal policy, while controlling how dynamic the environment should be.In Figure 3 we have plotted performance of: (i) a URF [2], which we combined with an adaptivesliding window (ADWIN) change detection algorithm, initiating a data collection and retrainingroutine when necessary [5]; (ii) a regular CMAB; and (iii) the U-CMAB. We chose an (cid:15) -greedytraining strategy for both bandits for two major reasons: (i) typical implementations use a Robins-Monro estimation of their objective (both ˆ r and ˆ r u are an expectation), which is easily upgraded fordynamic settings using a constant step-size; (ii) (cid:15) -greedy has been shown to converge in a variety ofenvironments [9] which aids in our setting, as the environment is usually ill-documented [2].Performance shown is measured in a regret metric, taking into account the causal nature of eachtreatment decision [1]. Our results clearly indicate a performance increase in both dynamic and static4nvironments, while confirming immense instability of the URF in dynamic environments, even whenameliorated with an ADWIN change detection strategy. As expected, the CMAB performs worstin a static environment (Figure 3a) since it is the only method not optimising an ITE, however, itoutperforms the URF in dynamic environments (Figures 3b and 3c) further confirming the importanceof dynamic methods. Through the results shown in Section 4, we provide evidence that (2) and (3) allow bandit algorithmsto make treatment decisions based on a prediction for the individual-treatment-effect. The useof bandits minimises the amount of random experiments through efficient exploration and offersresilience against a dynamic environment.In light of further work, we are interested in the U-CMAB’s extension to full reinforcement learning[16] using an estimated τ through time, potentially allowing an agent to make causal decisionsleading to more efficient use of resources. Efficiently managing resources required to obtain a certainreward could greatly affect the application in practical settings. References [1] B
ERREVOETS , Jeroen ; V
ERBEKE , Wouter: Causal Simulations for Uplift Modeling. In: arXiv preprintarXiv:1902.00287 (2019)[2] D
EVRIENDT , Floris ; M
OLDOVAN , Darie ; V
ERBEKE , Wouter: A Literature Survey and ExperimentalEvaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development ofPrescriptive Analytics. In:
Big Data https://doi.org/10.1089/big.2017.0104 . – PMID: 29570415[3] E
RNST , Damien ; G
EURTS , Pierre ; W
EHENKEL , Louis: Tree-based batch mode reinforcement learning.In:
Journal of Machine Learning Research
ANG , Xiao:
Uplift Modeling for Randomized Experiments and Observational Studies , MassachusettsInstitute of Technology, Dissertation, 2018[5] G
AMA , João ; Ž
LIOBAIT ˙E , Indr˙e ; B
IFET , Albert ; P
ECHENIZKIY , Mykola ; B
OUCHACHIA , Abdelhamid:A survey on concept drift adaptation. In:
ACM computing surveys (CSUR)
46 (2014), Nr. 4, S. 44[6] G
UTIERREZ , Pierre ; G
ÉRARDY , Jean-Yves: Causal Inference and Uplift Modelling: A Review of theLiterature. In:
International Conference on Predictive Applications and APIs , 2017, S. 1–13[7] J
OHANSSON , Fredrik ; S
HALIT , Uri ; S
ONTAG , David: Learning representations for counterfactualinference. In:
International conference on machine learning , 2016, S. 3020–3029[8] K
ANE , Kathleen ; L O , Victor S. ; Z HENG , Jane: Mining for the truly responsive customers and prospectsusing true-lift modeling: Comparison of new and existing methods. In:
Journal of Marketing Analytics
ULESHOV , Volodymyr ; P
RECUP , Doina: Algorithms for multi-armed bandit problems. In: arXivpreprint arXiv:1402.6028 (2014)[10] P
EARL , Judea:
Causality . Cambridge, UK : Cambridge university press, 2009[11] R
ADCLIFFE , Nicholas J. ; S
URRY , Patrick D.: Real-world uplift modelling with significance-based uplifttrees. In:
White Paper TR-2011-1, Stochastic Solutions (2011)[12] R
OBBINS , Herbert: Some aspects of the sequential design of experiments. In:
Bulletin of the AmericanMathematical Society
55 (1952), S. 527–535[13] R
UBIN , Donald B.: Causal Inference Using Potential Outcomes. In:
Journal of the American Statistical A
100 (2005), Nr. 469, S. 322–331. – URL https://doi.org/10.1198/016214504000001880 [14] S
AFFARI , Amir ; L
EISTNER , Christian ; S
ANTNER , Jakob ; G
ODEC , Martin ; B
ISCHOF , Horst: On-linerandom forests. In:
IEEE (Veranst.), 2009, S. 1393–1400[15] S
HALIT , Uri ; J
OHANSSON , Fredrik D. ; S
ONTAG , David: Estimating individual treatment effect:generalization bounds and algorithms. In:
Proceedings of the 34th International Conference on MachineLearning-Volume 70
JMLR. org (Veranst.), 2017, S. 3076–3085[16] S
UTTON , Richard S. ; B
ARTO , Andrew G.:
Reinforcement learning: An introduction . 2nd. Cambridge,MA, USA : MIT press, 2018[17] T
SYMBAL , Alexey: The problem of concept drift: definitions and related work / Computer ScienceDepartment, Trinity College Dublin. Citeseer, 2004. – Forschungsbericht[18] Z
HOU , Li: A survey on contextual multi-armed bandits. In: arXiv preprint arXiv:1508.03326 (2015) Appendix
Python code used to test the U-CMAB as in Section 4 is provided online https://github.com/vub-dl/u-cmab . In this code you will find hyperparameters, notebooks documenting plot methodsand extra visualisations and experiments further confirming current instability.
Proof.
We prove that the equality, τ = ψ − ψ R ( Y = 1) , allows a bandit to make decisions based on some τ as in (2). By introducing a penalty ψ i of atreatment T = i in the treatment selection procedure as in (1) and (4), T ∗ = arg max i { E [ R ( Y ) − ψ i | T = i, x ] } , (6)reflecting the definition of ˆ r u . In case of a single treatment ( T = 1 ) and control ( T = 0 ), the arg max i {·} in (6) can be simplified in, T ∗ = I [ˆ r ( T = 1 , x ) − ψ > ˆ r ( T = 0 , x ) − ψ ] , (7)as ψ i is a constant and E [ · ] a linear operator, with ˆ r as an expected value based on the transitionfunction [16], ˆ r ( T, x ) . = R ( Y = 1)ˆ p ( Y = 1 | T, x ) , (8)with R ( Y = 1) as the reward received after responding to T .After rearranging (8) into (7) we get, T ∗ = I [ R ( Y = 1)ˆ p ( Y = 1 | T = 1 , x ) − ψ > R ( Y = 1)ˆ p ( Y = 1 | T = 0 , x ) − ψ ] . (9)Rearranging (9) yields, T ∗ = I (cid:20) ˆ u ( Y, T, x ) > ψ − ψ R ( Y = 1) (cid:21) , (10)which through (2) implies, τ = ψ − ψ R ( Y = 1)= 1)