Uplift Modeling from Separate Labels
UUplift Modeling from Separate Labels
Ikko Yamane , Florian Yger , Jamal Atif Masashi Sugiyama , The University of Tokyo, Tokyo, Japan RIKEN Center for Advanced Intelligence Project (AIP), Tokyo, Japan LAMSADE, CNRS, Université Paris-Dauphine, Université PSL, PARIS, FRANCE { yamane@ms., sugi@ } k.u-tokyo.ac.jp, { florian.yger@, jamal.atif@ } dauphine.fr Abstract
Uplift modeling is aimed at estimating the incremental impact of an action onan individual’s behavior, which is useful in various application domains such astargeted marketing (advertisement campaigns) and personalized medicine (medicaltreatments). Conventional methods of uplift modeling require every instance tobe jointly equipped with two types of labels: the taken action and its outcome.However, obtaining two labels for each instance at the same time is difficult orexpensive in many real-world problems. In this paper, we propose a novel methodof uplift modeling that is applicable to a more practical setting where only one typeof labels is available for each instance. We show a mean squared error bound forthe proposed estimator and demonstrate its effectiveness through experiments.
In many real-world problems, a central objective is to optimally choose a right action to maximizethe profit of interest. For example, in marketing, an advertising campaign is designed to promotepeople to purchase a product [29]. A marketer can choose whether to deliver an advertisement toeach individual or not, and the outcome is the number of purchases of the product. Another exampleis personalized medicine, where a treatment is chosen depending on each patient to maximize themedical effect and minimize the risk of adverse events or harmful side effects [1, 13]. In this case,giving or not giving a medical treatment to each individual are the possible actions to choose, and theoutcome is the rate of recovery or survival from the disease. Hereafter, we use the word treatment fortaking an action, following the personalized medicine example.
A/B testing [14] is a standard method for such tasks, where two groups of people, A and B, arerandomly chosen. The outcomes are measured separately from the two groups after treating all themembers of Group A but none of Group B. By comparing the outcomes between the two groups by astatistical test, one can examine whether the treatment positively or negatively affected the outcome.However, A/B testing only compares the two extreme options: treating everyone or no one. Thesetwo options can be both far from optimal when the treatment has positive effect on some individualsbut negative effect on others.To overcome the drawback of A/B testing, uplift modeling has been investigated recently [11, 28, 32].
Uplift modeling is the problem of estimating the individual uplift , the incremental profit brought bythe treatment conditioned on features of each individual. Uplift modeling enables us to design arefined decision rule for optimally determining whether to treat each individual or not, depending onhis/her features. Such a treatment rule allows us to only target those who positively respond to thetreatment and avoid treating negative responders.In the standard uplift modeling setup, there are two types of labels [11, 28, 32]: One is whether thetreatment has been given to the individual and the other is its outcome. Existing uplift modelingmethods require each individual to be jointly given these two labels for analyzing the association
Preprint. Work in progress. a r X i v : . [ s t a t . M L ] N ov etween outcomes and the treatment [11, 28, 32]. However, joint labels are expensive or hard (or evenimpossible) to obtain in many real-world problems. For example, when distributing an advertisementby email, we can easily record to whom the advertisement has been sent. However, for technicalor privacy reasons, it is difficult to keep track of those people until we observe the outcomes onwhether they buy the product or not. Alternatively, we can easily obtain information about purchasersof the product at the moment when the purchases are actually made. However, we cannot knowwhether those who are buying the product have been exposed to the advertisement or not. Thus, everyindividual always has one missing label. We term such samples separately labeled samples .In this paper, we consider a more practical uplift modeling setup where no jointly labeled samplesare available, but only separately labeled samples are given. Theoretically, we first show that theindividual uplift is identifiable when we have two sets of separately labeled samples collected under different treatment policies. We then propose a novel method that directly estimates the individualuplift only from separately labeled samples. Finally, we demonstrate the effectiveness of the proposedmethod through experiments. This paper focuses on estimation of the individual uplift u ( x ) , often called individual treatmenteffect (ITE) in the causal inference literature [31], defined as u ( x ) := E [ Y | x ] − E [ Y − | x ] ,where E [ · | · ] denotes the conditional expectation, and x is a X -valued random variable ( X ⊆ R d )representing features of an individual, and Y , Y − are Y -valued potential outcome variables [31]( Y ⊆ R ) representing outcomes that would be observed if the individual was treated and not treated,respectively. Note that only one of either Y or Y − can be observed for each individual. We denotethe { , − } -valued random variable of the treatment assignment by t , where t = 1 means that theindividual has been treated and t = − not treated. We refer to the population for which we want toevaluate u ( x ) as the test population , and denote the density of the test population by p ( Y , Y − , x , t ) .We assume that t is unconfounded with either of Y and Y − conditioned on x , i.e. p ( Y | x , t ) = p ( Y | x ) and p ( Y − | x , t ) = p ( Y − | x ) . Unconfoundedness is an assumption commonly made inobservational studies [5, 33]. For notational convenience, we denote by y := Y t the outcome of thetreatment assignment t . Furthermore, we refer to any conditional density of t given x as a treatmentpolicy .In addition to the test population, we suppose that there are two training populations k = 1 , , whosejoint probability density p k ( Y , Y − , x , t ) satisfy p k ( Y t = y | x = x ) = p ( Y t = y | x = x ) (for k = 1 , ) , (1) p ( t = t | x = x ) (cid:54) = p ( t = t | x = x ) , (2)for all possible realizations x ∈ X , t ∈ {− , } , and y ∈ Y . Intuitively, Eq. (1) means thatpotential outcomes depend on x in the same way as those in the test population, and Eq. (2) statesthat those two policies give a treatment with different probabilities for every x = x .We suppose that the following four training data sets, which we call separately labeled samples , aregiven: { ( x ( k ) i , y ( k ) i ) } n k i =1 i . i . d . ∼ p k ( x , y ) , { ( (cid:101) x ( k ) i , t ( k ) i ) } (cid:101) n k i =1 i . i . d . ∼ p k ( x , t ) (for k = 1 , ) , where n k and (cid:101) n k , k = 1 , , are positive integers. Under Assumptions (1), (2), and the uncon-foundedness, we have p k ( Y t | x , t = t ) = p ( Y t | x , t = t ) = p ( Y t | x ) for t ∈ {− , } and k ∈ { , } . Note that we can safely denote p ( y | x , t ) := p k ( y | x , t ) . Moreover, we have E [ Y t | x ] = E [ y | x , t = t ] for t = 1 , − , and thus our goal boils down to the estimation of u ( x ) = E [ y | x , t = 1] − E [ y | x , t = − (3)from the separately labeled samples, where the conditional expectation is taken over p ( y | x , t ) .Estimation of the individual uplift is important for the following reasons. It enables the estimation of the average uplift.
The average uplift U ( π ) of the treatment policy π ( t | x ) is the average outcome of π , subtracted by that of the policy π − , which constantly assigns2he treatment as t = − , i.e., π − ( t = τ | x ) := 1[ τ = − , where · ] denotes the indicator function: U ( π ) := (cid:90) (cid:90) (cid:88) t = − , yp ( y | x , t ) π ( t | x ) p ( x )d y d x − (cid:90) (cid:90) (cid:88) t = − , yp ( y | x , t ) π − ( t | x ) p ( x )d y d x = (cid:90) u ( x ) π ( t = 1 | x ) p ( x )d x . (4)This quantity can be estimated from samples of x once we obtain an estimate of u ( x ) . It provides the optimal treatment policy.
The treatment policy given by π ( t = 1 | x ) = 1[0 ≤ u ( x )] is the optimal treatment that maximizes the average uplift U ( π ) and equivalently the averageoutcome (cid:82)(cid:82) (cid:80) t = − , yp ( y | x , t ) π ( t | x ) p ( x )d y d x (see Eq. (4)) [32]. It is the optimal ranking scoring function.
From a practical viewpoint, it may be useful to prioritizeindividuals to be treated according to some ranking scores especially when the treatment is costlyand only a limited number of individuals can be treated due to some budget constraint. In fact, u ( x ) serves as the optimal ranking scores for this purpose [36]. More specifically, we define a family oftreatment policies { π f,α } α ∈ R associated with scoring function f by π f,α ( t = 1 | x ) = 1[ α ≤ f ( x )] .Then, under some technical condition, f = u maximizes the area under the uplift curve (AUUC) defined as AUUC( f ) := (cid:90) U ( π f,α )d C α = (cid:90) (cid:90) u ( x )1[ α ≤ f ( x )] p ( x )d x d C α = E [1[ f ( x ) ≤ f ( x (cid:48) )] u ( x (cid:48) )] , where C α := Pr[ f ( x ) < α ] , x , x (cid:48) i . i . d . ∼ p ( x ) , and E denotes the expectation with respect to thesevariables. AUUC is a standard performance measure for uplift modeling methods [11, 25, 28, 32].For more details, see Appendix B in the supplementary material. Remark on the problem setting:
Uplift modeling is often referred to as individual treatment effectestimation or heterogeneous treatment effect estimation and has been extensively studied especiallyin the causal inference literature [5, 7, 9, 12, 16, 24, 31, 37]. In particular, recent research hasinvestigated the problem under the setting of observational studies , inference using data obtainedfrom uncontrolled experiments because of its practical importance [33]. Here, experiments are saidto be uncontrolled when some of treatment variables are not controlled to have designed values.Given that treatment policies are unknown, our problem setting is also of observational studies butposes an additional challenge that stems from missing labels. What makes our problem feasible isthat we have two kinds of data sets following different treatment policies.It is also important to note that our setting generalizes the standard setting for observational studiessince the former is reduced to the latter when one of the treatment policies always assigns individualsto the treatment group, and the other to the control group.Our problem is also closely related to individual treatment effect estimation via instrumental vari-ables [2, 6, 10, 19]. A naive approach is first estimating the conditional density p k ( y | x ) and p k ( t | x ) from trainingsamples by some conditional density estimator [4, 34], and then solving the following linear systemfor p ( y | x , t = 1) and p ( y | x , t = − : p k ( y | x ) (cid:124) (cid:123)(cid:122) (cid:125) Estimated from { ( x ( k ) i , y ( k ) i ) } ni =1 = (cid:88) t = − , p ( y | x , t ) p k ( t | x ) (cid:124) (cid:123)(cid:122) (cid:125) Estimated from { ( (cid:101) x ( k ) i , t ( k ) i ) } (cid:101) ni =1 (for k = 1 , ) . (5) Among the related papers mentioned above, the most relevant one is Lewis and Syrgkanis [19], which isconcurrent work with ours. y over p ( y | x , t = 1) and p ( y | x , t = − are calculatedby numerical integration, and finally their difference is calculated to obtain another estimate of u ( x ) .However, this may not yield a good estimate due to the difficulty of conditional density estimationand the instability of numerical integration. This issue may be alleviated by working on the followinglinear system implied by Eq. (5) instead: E k [ y | x ] = (cid:80) t = − , E [ y | x , t ] p k ( t | x ) , k = 1 , ,where E k [ y | x ] and p k ( t | x ) can be estimated from our samples. Solving this new system for E [ y | x , t = 1] and E [ y | x , t = − and taking their difference gives an estimate of u ( x ) . A methodcalled two-stage least-squares for instrumental variable regression takes such an approach [10].The second approach of estimation E k [ y | x ] and p k ( t | x ) avoids both conditional density estimationand numerical integration, but it still involves post processing of solving the linear system andsubtraction, being a potential cause of performance deterioration. In this section, we develop a method that can overcome the aforementioned problems by directlyestimating the individual uplift.
First, we will show an important lemma that directly relates the marginal distributions of separatelylabeled samples to the individual uplift u ( x ) . Lemma 1.
For every x such that p ( x ) (cid:54) = p ( x ) , u ( x ) can be expressed as u ( x ) = 2 × E y ∼ p ( y | x ) [ y ] − E y ∼ p ( y | x ) [ y ] E t ∼ p ( t | x ) [ t ] − E t ∼ p ( t | x ) [ t ] . (6)For a proof, refer to Appendix C in the supplementary material.Using Eq. (6), we can re-interpret the naive methods described in Section 3 as estimating the condi-tional expectations on the right-hand side by separately performing regression on { ( x (1) i , y (1) i ) } n i =1 , { ( x (2) i , y (2) i ) } n i =1 , { ( (cid:101) x (1) i , t (1) i ) } (cid:101) n i =1 , and { ( (cid:101) x (2) i , t (2) i ) } (cid:101) n i =1 . This approach may result in unreliableperformance when the denominator is close to zero, i.e., p ( t | x ) (cid:39) p ( t | x ) .Lemma 1 can be simplified by introducing auxiliary variables z and w , which are Z -valued and {− , } -valued random variables whose conditional probability density and mass are defined by p ( z = z | x ) = p ( y = z | x ) + p ( y = − z | x ) ,p ( w = w | x ) = p ( t = w | x ) + p ( t = − w | x ) , for any z ∈ Z and any w ∈ {− , } , where Z := { s y | y ∈ Y , s ∈ { , − }} . Lemma 2.
For every x such that p ( x ) (cid:54) = p ( x ) , u ( x ) can be expressed as u ( x ) = 2 × E [ z | x ] E [ w | x ] , where E [ z | x ] and E [ w | x ] are the conditional expectations of z given x over p ( z | x ) and w given x over p ( w | x ) , respectively. A proof can be found in Appendix D in the supplementary material.Let w ( k ) i := ( − k − t ( k ) i and z ( k ) i := ( − k − y ( k ) i . Assuming that p ( x ) = p ( x ) =: p ( x ) , n = n , and (cid:101) n = (cid:101) n for simplicity, { ( (cid:101) x i , w i ) } ni =1 := { ( (cid:101) x ( k ) i , w ( k ) i ) } k =1 , i =1 ,..., (cid:101) n k and { ( x i , z i ) } ni =1 := { ( x ( k ) i , z ( k ) i ) } k =1 , i =1 ,...,n k can be seen as samples drawn from p ( x , z ) := p ( z | x ) p ( x ) and p ( x , w ) := p ( w | x ) p ( x ) , respectively, where n = n + n and (cid:101) n = (cid:101) n + (cid:101) n . Themore general cases where p ( x ) (cid:54) = p ( x ) , n (cid:54) = n , or (cid:101) n (cid:54) = (cid:101) n are discussed in Appendix I in thesupplementary material. 4 heorem 1. Assume that µ w , µ z ∈ L ( p ) and µ w ( x ) (cid:54) = 0 for every x such that p ( x ) > , where L ( p ) := { f : X → R | E x ∼ p ( x ) [ f ( x ) ] < ∞} . The individual uplift u ( x ) equals the solution tothe following least-squares problem: u ( x ) = argmin f ∈ L ( p ) E [( µ w ( x ) f ( x ) − µ z ( x )) ] , (7) where E denotes the expectation over p ( x ) , µ w ( x ) := E [ w | x ] , and µ z ( x ) := E [ z | x ] . Theorem 1 follows from Lemma 2. Note that p ( x ) (cid:54) = p ( x ) in Eq. (2) implies µ w ( x ) (cid:54) = 0 .In what follows, we develop a method that directly estimates u ( x ) by solving Eq. (7). A challengehere is that it is not straightforward to evaluate the objective functional since it involves unknownfunctions, µ w and µ z . z and w Our idea is to transform the objective functional in Eq. (7) into another form in which µ w ( x ) and µ z ( x ) appear separately and linearly inside the expectation operator so that we can approximate themusing our separately labeled samples.For any function g ∈ L ( p ) and any x ∈ X , expanding the left-hand side of the inequality E [( µ w ( x ) f ( x ) − µ z ( x ) − g ( x )) ] ≥ , we have E [( µ w ( x ) f ( x ) − µ z ( x )) ] ≥ E [( µ w ( x ) f ( x ) − µ z ( x )) g ( x )] − E [ g ( x ) ] =: J ( f, g ) . (8)The equality is attained when g ( x ) = µ w ( x ) f ( x ) − µ z ( x ) for any fixed f . This means that theobjective functional of Eq. (7) can be calculated by maximizing J ( f, g ) with respect to g . Hence, u ( x ) = argmin f ∈ L ( p ) max g ∈ L ( p ) J ( f, g ) . (9)Furthermore, µ w and µ z are separately and linearly included in J ( f, g ) , which makes it possible towrite it in terms of z and w as J ( f, g ) = 2 E [ wf ( x ) g ( x )] − E [ zg ( x )] − E [ g ( x ) ] . (10)Unlike the original objective functional in Eq. (7), J ( f, g ) can be easily estimated using sampleaverages by (cid:98) J ( f, g ) = 2 (cid:101) n (cid:101) n (cid:88) i =1 w i f ( (cid:101) x i ) g ( (cid:101) x i ) − n n (cid:88) i =1 z i g ( x i ) − n n (cid:88) i =1 g ( x i ) − (cid:101) n (cid:101) n (cid:88) i =1 g ( (cid:101) x i ) . (11)In practice, we solve the following regularized empirical optimization problem: min f ∈ F max g ∈ G (cid:98) J ( f, g ) + Ω( f, g ) , (12)where F , G are models for f , g respectively, and Ω( f, g ) is some regularizer.An advantage of the proposed framework is that it is model-independent, and any models can betrained by optimizing the above objective.The function g can be interpreted as a critic of f as follows. Minimizing Eq. (10) with respectto f is equivalent to minimizing J ( f, g ) = E [ g ( x ) { µ w ( x ) f ( x ) − µ z ( x ) } ] . g ( x ) serves as agood critic of f ( x ) when it makes the cost g ( x ) { µ w ( x ) f ( x ) − µ z ( x ) } larger for x at which f makes a larger error | µ w ( x ) f ( x ) − µ z ( x ) | . In particular, g maximizes the objective above when g ( x ) = µ w ( x ) f ( x ) − µ z ( x ) for any f , and the maximum coincides with the least-squares objectivein Eq. (7).Suppose that F and G are linear-in-parameter models: F = { f α : x (cid:55)→ α (cid:62) φ ( x ) | α ∈ R b f } and G = { g β : x (cid:55)→ β (cid:62) ψ ( x ) | β ∈ R b g } , where φ and ψ are b f -dimensional and b g -dimensionalvectors of basis functions in L ( p ) . Then, (cid:98) J ( f α , g β ) = 2 α (cid:62) Aβ − b (cid:62) β − β (cid:62) Cβ , where A := 1 (cid:101) n (cid:101) n (cid:88) i =1 w i φ ( (cid:101) x i ) ψ ( (cid:101) x i ) (cid:62) , b := 1 n n (cid:88) i =1 z i ψ ( x i ) , C := 12 n n (cid:88) i =1 ψ ( x i ) ψ ( x i ) (cid:62) + 12 (cid:101) n (cid:101) n (cid:88) i =1 ψ ( (cid:101) x i ) ψ ( (cid:101) x i ) (cid:62) . (cid:96) -regularizers, Ω( f, g ) = λ f α (cid:62) α − λ g β (cid:62) β with some positive constants λ f and λ g , thesolution to the inner maximization problem can be obtained in the following analytical form: (cid:98) β α := argmax β (cid:98) J ( f α , g β ) = (cid:101) C − ( A (cid:62) α − b ) , where (cid:101) C = C + λ g I b g and I b g is the b g -by- b g identity matrix. Then, we can obtain the solution toEq. (12) analytically as (cid:98) α := argmin α (cid:98) J ( f α , g (cid:98) β α ) = 2( A (cid:101) C − A (cid:62) + λ f I b g ) − A (cid:101) C − b . Finally, from Eq. (7), our estimate of u ( x ) is given as (cid:98) α (cid:62) φ ( x ) . Remark on model selection:
Model selection for F and G is not straightforward since the testperformance measure cannot be directly evaluated with (held out) training data of our problem.Instead, we may evaluate the value of J ( (cid:98) f , (cid:98) g ) , where ( (cid:98) f , (cid:98) g ) ∈ F × G is the optimal solution pair to min f ∈ F max g ∈ G (cid:98) J ( f, g ) . However, it is still nontrivial to tell if the objective value is small becausethe solution is good in terms of the outer minimization, or because it is poor in terms of the innermaximization. We leave this issue for future work. A theoretically appealing property of the proposed method is that its objective consists of simplesample averages. This enables us to establish a generalization error bound in terms of the Rademachercomplexity [15, 22].Denote ε G ( f ) := sup g ∈ L ( p ) J ( f, g ) − sup g ∈ G J ( f, g ) . Also, let R Nq ( H ) denote the Rademachercomplexity of a set of functions H over N random variables following probability density q (referto Appendix E for the definition). Proofs of the following theorems and corollary can be found inAppendix E, Appendix F, and Appendix G in the supplementary material. Theorem 2.
Assume that n = n , (cid:101) n = (cid:101) n , p ( x ) = p ( x ) , W := inf x ∈X | µ w ( x ) | > , M Z := sup z ∈Z | z | < ∞ , M F := sup f ∈ F,x ∈X | f ( x ) | < ∞ , and M G := sup g ∈ G,x ∈X | g ( x ) | < ∞ .Then, the following holds with probability at least − δ for every f ∈ F : E x ∼ p ( x ) [( f ( x ) − u ( x )) ] ≤ W (cid:34) sup g ∈ G (cid:98) J ( f, g ) + R n, (cid:101) nF,G + (cid:18) M z √ n + M w √ (cid:101) n (cid:19) (cid:114) log 2 δ + ε G ( f ) (cid:35) , where M z := 4 M Y M G + M G / , M w = 2 M F M G + M G / , and R n, (cid:101) nF,G := 2( M F +4 M Z ) R np ( x ,z ) ( G ) + 2(2 M F + M G ) R (cid:101) np ( x ,w ) ( F ) + 2( M F + M G ) R (cid:101) np ( x ,w ) ( G ) . In particular, the following bound holds for the linear-in-parameter models.
Corollary 1.
Let F = { x (cid:55)→ α (cid:62) φ ( x ) | (cid:107) α (cid:107) ≤ Λ F } , G = { x (cid:55)→ β (cid:62) ψ ( x ) | (cid:107) β (cid:107) ≤ Λ G } .Assume that r F := sup x ∈X (cid:107) φ ( x ) (cid:107) < ∞ and r G := sup x ∈X (cid:107) ψ ( x ) (cid:107) < ∞ , where (cid:107)·(cid:107) is the L -norm. Under the assumptions of Theorem 2, it holds with probability at least − δ that for every f ∈ F , E x ∼ p ( x ) [( f ( x ) − u ( x )) ] ≤ W sup g ∈ G (cid:98) J ( f, g ) + C z (cid:113) log δ + D z √ n + C w (cid:113) log δ + D w √ (cid:101) n + ε G ( f ) , where C z := r G Λ G + 4 r G Λ G M Y , C w := 2 r F Λ F + 2 r F r G Λ F Λ G + r G Λ G , D z := r G Λ G / r G Λ G M Y , and D w := r G Λ G / r F r G Λ F Λ G . Theorem 2 and Corollary 1 imply that minimizing sup g ∈ G (cid:98) J ( f, g ) , as the proposed method does,amounts to minimizing an upper bound of the mean squared error. In fact, for the linear-in-parametermodels, it can be shown that the mean squared error of the proposed estimator is upper bounded by O ( / √ n + / √ (cid:101) n ) plus some model mis-specification error with high probability as follows.6 heorem 3 (Informal) . Let (cid:98) f ∈ F be any approximate solution to inf f ∈ F sup g ∈ G (cid:98) J ( f, g ) withsufficient precision. Under the assumptions of Corollary 1, it holds with probability at least − δ that E x ∼ p ( x ) [( (cid:98) f ( x ) − u ( x )) ] ≤ O (cid:18)(cid:18) √ n + 1 √ (cid:101) n (cid:19) log 1 δ (cid:19) + 2 ε FG + ε F W , (13) where ε FG := sup f ∈ F ε G ( f ) and ε F := inf f ∈ F J ( f ) . A more formal version of Theorem 3 can be found in Appendix G.
Our framework can be extended to more general loss functions: inf f ∈ L ( p ) E [ (cid:96) ( µ w ( x ) f ( x ) , µ z ( x ))] , (14)where (cid:96) : R × R → R is a loss function that is lower semi-continuous and convex with respect toboth the first and the second arguments, where a function ϕ : R → R is lower semi-continuous if lim inf y → y ϕ ( y ) = ϕ ( y ) for every y ∈ R [30]. As with the squared loss, a major difficulty insolving this optimization problem is that the operand of the expectation has nonlinear dependencyon both µ w ( x ) and µ z ( x ) at the same time. Below, we will show a way to transform the objectivefunctional into a form that can be easily approximated using separately labeled samples.From the assumptions on (cid:96) , we have (cid:96) ( y, y (cid:48) ) = sup z ∈ R yz − (cid:96) ∗ ( z, y (cid:48) ) , where (cid:96) ∗ ( · , y (cid:48) ) is the convexconjugate of the function y (cid:55)→ (cid:96) ( y, y (cid:48) ) defined for any y (cid:48) ∈ R as z (cid:55)→ (cid:96) ∗ ( z, y (cid:48) ) = sup y ∈ R [ yz − (cid:96) ( y, y (cid:48) )] (see Rockafellar [30]). Hence, E [ (cid:96) ( µ w ( x ) f ( x ) , µ z ( x ))] = sup g ∈ L ( p ) E [ µ w ( x ) f ( x ) g ( x ) − (cid:96) ∗ ( g ( x ) , µ z ( x ))] . Similarly, we obtain E [ (cid:96) ∗ ( g ( x ) , µ z ( x ))] = sup h ∈ L ( p ) E [ µ z ( x ) h ( x )] − E [ (cid:96) ∗∗ ( g ( x ) , h ( x ))] , where (cid:96) ∗∗ ( y, · ) is the convex conjugate of the function y (cid:48) (cid:55)→ (cid:96) ∗ ( y, y (cid:48) ) defined for any y, z (cid:48) ∈ R by (cid:96) ∗∗ ( y, z (cid:48) ) := sup y (cid:48) ∈ R [ y (cid:48) z − (cid:96) ∗ ( y, y (cid:48) )] . Thus, Eq. (14) can be rewritten as inf f ∈ L ( p ) sup g ∈ L ( p ) inf h ∈ L ( p ) K ( f, g, h ) , where K ( f, g, h ) := E [ µ w ( x ) f ( x ) g ( x )] − E [ µ z ( x ) h ( x )] + E [ (cid:96) ∗∗ ( g ( x ) , h ( x ))] . Since µ w and µ z appear separately and linearly, K ( f, g, h ) can be approximated by sample averages using separatelylabeled samples. In this section, we test the proposed method and compare it with baselines.
We use the following data sets for experiments.
Synthetic data:
Features x are drawn from the two-dimensional Gaussian distribution with meanzero and covariance I . We set p ( y | x , t ) as the following logistic models: p ( y | x , t ) =1 / (1 − exp( − y a (cid:62) t x )) , where a − = (10 , (cid:62) and a = (10 , − (cid:62) . We also use the logisticmodels for p k ( t | x ) : p ( t | x ) = 1 / (1 − exp( − tx )) and p ( t | x ) = 1 / (1 − exp( − t { x + b } ) ,where b is varied over 25 equally spaced points in [0 , . We investigate how the performancechanges when the difference between p ( t | x ) and p ( t | x ) varies. Email data:
This data set consists of data collected in an email advertisement campaign for promotingcustomers to visit a website of a store [8, 27]. Outcomes are whether customers visited the websiteor not. We use × and randomly sub-sampled data points for training and evaluation,respectively. lim inf y → y ϕ ( y ) := lim δ (cid:38) inf | y − y |≤ δ ϕ ( y ) . obs data: This data set consists of randomized experimental data obtained from a job trainingprogram called the National Supported Work Demonstration [17], available at http://users.nber.org/~rdehejia/data/nswdata2.html . There are 9 features, and outcomes are income levelsafter the training program. The sample sizes are for the treatment group and for the controlgroup. We use × randomly sub-sampled data points for training and for evaluation. Criteo data:
This data set consists of banner advertisement log data collected by Criteo [18]available at . The task is to select a product tobe displayed in a given banner so that the click rate will be maximized. We only use records forbanners with only one advertisement slot. Each display banner has 10 features, and each producthas 35 features. We take the 12th feature of a product as a treatment variable merely becauseit is a well-balanced binary variable. The outcome is whether the displayed advertisement wasclicked. We treat the data set as the population although it is biased from the actual population sincenon-clicked impressions were randomly sub-sampled down to to reduce the data set size. Wemade two subsets with different treatment policies by appropriately sub-sampling according to thepredefined treatment policies (see Appendix L in the supplementary material). We set p k ( t | x ) as p ( t | x ) = 1 / (1 + exp( − t (cid:62) x )) and p ( t | x ) = 1 / (1 + exp( t (cid:62) x )) , where := (1 , . . . , (cid:62) . We conduct experiments under the following settings.
Methods compared:
We compare the proposed method with baselines that separately estimatethe four conditional expectations in Eq. (6). In the case of binary outcomes, we use the logistic-regression-based (denoted by FourLogistic) and a neural-network-based method trained with thesoft-max cross-entropy loss (denoted by FourNNC). In the case of real-valued outcomes, the ridge-regression-based (denoted by FourRidge) and a neural-network-based method trained with the squaredloss (denoted by FourNNR). The neural networks are fully connected ones with two hidden layerseach with 10 hidden units. For the proposed method, we use the linear-in-parameter models withGaussian basis functions centered at randomly sub-sampled training data points (see Appendix K formore details).
Performance evaluation:
We evaluate trained uplift models by the area under the uplift curve(AUUC) estimated on test samples with joint labels as well as uplift curves [26]. The uplift curve ofan estimated individual uplift is the trajectory of the average uplift when individuals are graduallymoved from the control group to the treated group in the descending order according to the rankinggiven by the estimated individual uplift. These quantities can be estimated when data are randomizedexperiment ones. The Criteo data are not randomized experiment data unlike other data sets, but thereare accurately logged propensity scores available. In this case, uplift curves and the AUUCs can beestimated using the inverse propensity scoring [3, 20]. We conduct trials of each experiment withdifferent random seeds. The results on the synthetic data are summarized in Figure 1. From the plots, we can see that allmethods perform relatively well in terms of AUUCs when the policies are distant from each other(i.e., b is larger). However, the performance of the baseline methods immediately declines as thetreatment policies get closer to each other (i.e., b is smaller). In contrast, the proposed methodmaintains its performance relatively longer until b reaches the point around . Note that the twopolicies would be identical when b = 0 , which makes it impossible to identify the individual upliftfrom their samples by any method since the system in Eq. (5) degenerates. Figure 2 highlights theirperformance in terms of the squared error. For FourNNC, test points with small policy difference | p ( t = 1 | x ) − p ( t = 1 | x ) | (colored darker) tend to have very large estimation errors. On theother hand, the proposed method has relatively small errors even for such points. Figure 3 showsresults on real data sets. The proposed method and the baseline method with logistic regressorsboth performed better than the baseline method with neural nets on the Email data set (Figure 3a). The instability of performance of FourLogistic can be explained as follows. FourLogistic uses linear models,whose expressive power is limited. The resulting estimator has small variance with potentially large bias. Sincedifferent b induces different u ( x ) , the bias depends on b . For this reason, the method works well for some b butpoorly for other b . b A UU C The Synthetic Data
Proposed (MinMaxGau)Baseline (FourNNC)Baseline (FourLogistic)
Figure 1: Results on the synthetic data. The plot shows the aver-age AUUCs obtained by the proposed method and the baselinemethods for different b . p ( t | x ) and p ( t | x ) are closer toeach other when b is smaller. (a) Baseline (FourLogistic). (b) Baseline (FourNNC). (c) Proposed (MinMaxGau). Figure 2: The plots show the squared errors of the estimated individual uplifts on the syntheticdata with b = 1 . Each point is darker-colored when | p ( t = 1 | x ) − p ( t = 1 | x ) | is smaller, andlighter-colored otherwise. Proportion of Treated Individuals A v e r a g e U p li f t Uplift Curve
Proposed (MinMaxGau)Baseline (FourNNC)Baseline (FourLogistic) (a) The Email data.
Proportion of Treated Individuals A v e r a g e U p li f t Uplift Curve
Proposed (MinMaxGau)Baseline (FourNNR)Baseline (FourRidge) (b) The Jobs data.
Proportion of Treated Individuals A v e r a g e U p li f t Uplift Curve
Proposed (MinMaxGau)Baseline (FourNNC)Baseline (FourLogistic) (c) The Criteo data.
Figure 3: Average uplifts as well as their standard errors on real-world data sets.On the Jobs data set, the proposed method again performed better than the baseline methods withneural networks. For the Criteo data set, the proposed method outperformed the baseline methods(Figure 3c). Overall, we confirmed the superiority of the proposed both on synthetic and real datasets.
We proposed a theoretically guaranteed and practically useful method for uplift modeling or individualtreatment effect estimation under the presence of systematic missing labels. The proposed methodshowed promising results in our experiments on synthetic and real data sets. The proposed frameworkis model-independent: any models can be used to approximate the individual uplift including onestailored for specific problems and complex models such as neural networks. On the other hand, modelselection may be a challenging problem due to the min-max structure. Addressing this issue would beimportant research directions for further expanding the applicability and improving the performanceof the proposed method. 9 cknowledgments
We are grateful to Marthinus Christoffel du Plessis and Takeshi Teshima for their inspiring suggestionsand for the meaningful discussions. We would like to thank the anonymous reviewers for their helpfulcomments. IY was supported by JSPS KAKENHI 16J07970. JA and FY would like to thankAdway for its support. MS was supported by the International Research Center for Neurointelligence(WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.
References [1] E. Abrahams and M. Silver. The Case for Personalized Medicine.
Journal of Diabetes Science andTechnology , 3(4):680–684, July 2009.[2] S. Athey, J. Tibshirani, and S. Wager. Generalized Random Forests. arXiv:1610.01271 [econ, stat] ,October 2016. arXiv: 1610.01271.[3] P. C. Austin. An introduction to propensity score methods for reducing the effects of confounding inobservational studies.
Multivariate behavioral research , 46(3):399–424, 2011.[4] C. M. Bishop.
Pattern Recognition and Machine Learning (Information Science and Statistics) . Springer-Verlag New York, Inc., 2006.[5] P. Gutierrez and J. Y. Gérardy. Causal inference and uplift modelling: A review of the literature. In
Proceedings of The 3rd International Conference on Predictive Applications and APIs , volume 67 of
Proceedings of Machine Learning Research , pages 1–13. PMLR, 11–12 Oct 2017.[6] J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy. Deep IV: A Flexible Approach for CounterfactualPrediction. In
International Conference on Machine Learning , pages 1414–1423, July 2017.[7] J. L. Hill. Bayesian Nonparametric Modeling for Causal Inference.
Journal of Computational andGraphical Statistics , 20(1):217–240, January 2011.[8] K. Hillstrom. The minethatdata e-mail analytics and data mining challenge, 2008. https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html .[9] K. Imai and M. Ratkovic. Estimating treatment effect heterogeneity in randomized program evaluation.
The Annals of Applied Statistics , 7(1):443–470, March 2013.[10] G. W. Imbens. Instrumental Variables: An Econometrician’s Perspective.
Statistical Science , 29(3):323–358, August 2014.[11] M. Jaskowski and S. Jaroszewicz. Uplift modeling for clinical trial data. In
ICML Workshop on ClinicalData Analysis , 2012.[12] F. D. Johansson, U. Shalit, and D. Sontag. Learning Representations for Counterfactual Inference. In
Proceedings of the 33rd International Conference on Machine Learning , volume 48 of
Proceedings ofMachine Learning Research , pages 3020–3029. JMLR, 2016.[13] S. H. Katsanis, G. Javitt, and K. Hudson. A Case Study of Personalized Medicine.
Science , 320(5872):53–54, April 2008.[14] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: surveyand practical guide.
Data Mining and Knowledge Discovery , 18(1):140–181, February 2009.[15] V. Koltchinskii. Rademacher penalties and structural risk minimization.
IEEE Transactions on InformationTheory , 47(5):1902–1914, July 2001.[16] S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Meta-learners for Estimating Heterogeneous TreatmentEffects using Machine Learning. arXiv:1706.03461 [math, stat] , June 2017. arXiv: 1706.03461.[17] R. J. LaLonde. Evaluating the econometric evaluations of training programs with experimental data.
TheAmerican economic review , pages 604–620, 1986.[18] D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, and M. de Rijke. Large-scale validation of counterfactuallearning methods: A test-bed. In
NIPS Workshop on "Inference and Learning of Hypothetical andCounterfactual Interventions in Complex Systems" , 2016.[19] G. Lewis and V. Syrgkanis. Adversarial Generalized Method of Moments. arXiv:1803.07164 [cs, econ,math, stat] , March 2018. arXiv: 1803.07164.
20] L. Li, W. Chu, J. Langford, and X. Wang. An unbiased offline evaluation of contextual bandit algorithmswith generalized linear models. In
Proceedings of the Workshop on On-line Trading of Exploration andExploitation 2 , volume 26 of
Proceedings of Machine Learning Research , pages 19–36. PMLR, 02 Jul2012.[21] S. Liu, A. Takeda, T. Suzuki, and K. Fukumizu. Trimmed density ratio estimation. In I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 30 , pages 4518–4528. Curran Associates, Inc., 2017.[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar.
Foundations of machine learning . MIT press, 2012.[23] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratioby convex risk minimization.
IEEE Transactions on Information Theory , 56(11):5847–5861, Nov 2010.[24] J. Pearl.
Causality . Cambridge university press, 2009.[25] N. Radcliffe. Using control groups to target on predicted lift: Building and assessing uplift model.
DirectMarketing Analytics Journal , pages 14–21, 2007.[26] N. Radcliffe and P. Surry. Differential Response Analysis: Modeling True Responses by Isolating theEffect of a Single Action.
Credit Scoring and Credit Control IV , 1999.[27] N. J. Radcliffe. Hillstrom’s minethatdata email analytics challenge: An approach using uplift modelling.
Technical report, Stochastic Solutions Limited , 2008.[28] N. J. Radcliffe and P. D. Surry. Real-world uplift modelling with significance-based uplift trees.
WhitePaper TR-2011-1, Stochastic Solutions , 2011.[29] R. Renault. Chapter 4 - Advertising in Markets. In Simon P. Anderson, Joel Waldfogel, and DavidStrömberg, editors,
Handbook of Media Economics , volume 1 of
Handbook of Media Economics , pages121–204. North-Holland, January 2015.[30] R. T. Rockafellar.
Convex Analysis . Princeton University Press, 1970.[31] D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions.
Journal of theAmerican Statistical Association , 100(469):322–331, 2005.[32] P. Rzepakowski and S. Jaroszewicz. Decision trees for uplift modeling with single and multiple treatments.
Knowledge and Information Systems , 32(2):303–327, 2012.[33] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization boundsand algorithms. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of
Proceedings of Machine Learning Research , pages3076–3085. PMLR, 06–11 Aug 2017.[34] M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Conditional densityestimation via least-squares density ratio estimation. In
International Conference on Artificial Intelligenceand Statistics , pages 781–788, 2010.[35] M. Sugiyama, T. Suzuki, and T. Kanamori.
Density Ratio Estimation in Machine Learning . CambridgeUniversity Press, 2012.[36] Stéphane Tuff’ery.
Data mining and statistics for decision making , volume 2. Wiley Chichester, 2011.[37] S. Wager and S. Athey. Estimation and Inference of Heterogeneous Treatment Effects using RandomForests. arXiv:1510.04342 [math, stat] , October 2015.[38] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio estimation forrobust distribution comparison.
Neural Computation , 25(5):1324–1370, 2013. ppendix A Average Uplift in Terms of the Individual Uplift U ( π ) = (cid:90) (cid:90) (cid:88) t = − , yp ( y | t, x ) π ( t | x ) p ( x )d y d x − (cid:90) (cid:90) (cid:88) t = − , yp ( y | t, x )1[ t = − p ( x )d y d x = (cid:90) (cid:90) y [ p ( y | t = 1 , x ) π ( t = 1 | x ) − p ( y | t = − , x ) π ( t = 1 | x )] p ( x )d y d x = (cid:90) (cid:90) y [ p ( y | t = 1 , x ) − p ( y | t = − , x )] π ( t = 1 | x ) p ( x )d y d x = (cid:90) u ( x ) π ( t = 1 | x ) p ( x )d x . (15) B Area Under the Uplift Curve and Ranking
Define the following symbols: • C α := Pr[ f ( x ) < α ] , • U ( α ; f ) := (cid:82) u ( x )1[ α ≤ f ( x )] p ( x )d x , • Rank( f ) := E [1[ f ( x (cid:48) ) ≤ f ( x )][ u ( x (cid:48) ) − u ( x )]] , • AUUC( f ) := (cid:82) U ( α ; f )d C α .Then, we have AUUC( f ) = (cid:90) ∞−∞ U ( α ) d C α d α d α = (cid:90) ∞−∞ U ( α ) p f ( x ) ( α )d α = (cid:90) R d U ( f ( x )) p ( x )d x = (cid:90) (cid:90) f ( x ) ≤ f ( x (cid:48) )] u ( x (cid:48) ) p ( x (cid:48) )d x (cid:48) p ( x )d x = E [1[ f ( x ) ≤ f ( x (cid:48) )] u ( x (cid:48) )](= E [1[ f ( x ) ≤ f ( x (cid:48) )][ y + − y − ]]) , where y + ∼ p ( y | x (cid:48) , t = 1) and y − ∼ p ( y | x (cid:48) , t = − .Assuming Pr[ f ( x (cid:48) ) = f ( x )] = 0 , we have Rank( f ) := E [1[ f ( x ) ≥ f ( x (cid:48) )][ u ( x ) − u ( x (cid:48) )]]= E [1[ f ( x ) ≥ f ( x (cid:48) )] u ( x )] − E [1[ f ( x ) ≥ f ( x (cid:48) )] u ( x (cid:48) )]= AUUC( f ) − E [(1 − f ( x ) ≤ f ( x (cid:48) )]) u ( x (cid:48) )]= E [ u ( x )] − f ) . Thus,
Rank( f ) = 2(AUUC( f ) − AUUC( r )) , where r : R d → R is the random ranking scoring functionthat outputs or − with probability / for any input x . Rank( f ) is maximized when f ( x ) ≤ f ( x (cid:48) ) ⇐⇒ u ( x ) ≤ u ( x (cid:48) ) for almost every pair of x ∈ R d and x ∈ R d . In particular, f = u is a maximizer of theobjective. C Proof of Lemma 1
Lemma 1.
For every x such that p ( x ) (cid:54) = p ( x ) , u ( x ) can be expressed as u ( x ) = 2 × E y ∼ p ( y | x ) [ y ] − E y ∼ p ( y | x ) [ y ] E t ∼ p ( t | x ) [ t ] − E t ∼ p ( t | x ) [ t ] . (16) roof. E y ∼ p ( y | x ) [ y ] − E y ∼ p ( y | x ) [ y ] = (cid:90) (cid:88) τ = − , yp ( y | x , t = τ ) p ( t = τ | x )d y − (cid:90) (cid:88) τ = − , yp ( y | x , t = τ ) p ( t = τ | x )d y = (cid:90) (cid:88) τ = − , yp ( y | x , t = τ )( p ( t = τ | x ) − p ( t = τ | x ))d y = (cid:88) τ = − , E y ∼ p ( y | x ,t = τ ) [ y ]( p ( t = τ | x ) − p ( t = τ | x ))= E y ∼ p ( y | x ,t =1) [ y ]( p ( t = 1 | x ) − p ( t = 1 | x ))+ E y ∼ p ( y | x ,t = − [ y ](1 − p ( t = 1 | x ) − p ( t = 1 | x ))= u ( x )( p ( t = 1 | x ) − p ( t = 1 | x )) . When p ( t = 1 | x ) (cid:54) = p ( t = 1 | x ) , it holds that u ( x ) = E y ∼ p ( y | x ) [ y ] − E y ∼ p ( y | x ) [ y ] p ( t = 1 | x ) − p ( t = 1 | x )= 2 × E y ∼ p ( y | x ) [ y ] − E y ∼ p ( y | x ) [ y ] E t ∼ p ( t | x ) [ t ] − E t ∼ p ( t | x ) [ t ] . D Proof of Lemma 2
Lemma 2.
For every x such that p ( x ) (cid:54) = p ( x ) , u ( x ) can be expressed as u ( x ) = 2 × E [ z | x ] E [ w | x ] , where E [ z | x ] and E [ w | x ] are the conditional expectations of z given x over p ( z | x ) and w given x over p ( w | x ) , respectively.Proof. We have E [ z | x ] = (cid:90) ζ (cid:20) p ( y = ζ | x ) + 12 p ( y = − ζ | x ) (cid:21) d ζ = 12 (cid:90) ζp ( y = ζ | x )d ζ + 12 (cid:90) ζp ( y = − ζ | x )d ζ = 12 (cid:90) yp ( y | x )d y − (cid:90) yp ( y | x )d y = 12 E y ∼ p ( y | x ) [ y ] − E y ∼ p ( y | x ) [ y ] . Similarly, we obtain E [ w | x ] = 12 E t ∼ p ( t | x ) [ t ] − E t ∼ p ( t | x ) [ t ] . Thus, × E [ z | x ] E [ w | x ] = 2 × E y ∼ p ( y | x ) [ y ] − E y ∼ p ( y | x ) [ y ] E t ∼ p ( t | x ) [ t ] − E t ∼ p ( t | x ) [ t ] = u ( x ) . E Proof of Theorem 2
We restate Theorem 2 below.
Theorem 2.
Assume that n = n , (cid:101) n = (cid:101) n , p ( x ) = p ( x ) , W := inf x ∈X | µ w ( x ) | > , M Z :=sup z ∈Z | z | < ∞ , M F := sup f ∈ F,x ∈X | f ( x ) | < ∞ , and M G := sup g ∈ G,x ∈X | g ( x ) | < ∞ . Then, thefollowing holds with probability at least − δ that for every f ∈ F , E x ∼ p ( x ) [( f ( x ) − u ( x )) ] ≤ W (cid:34) sup g ∈ G (cid:98) J ( f, g ) + R n, (cid:101) nF,G + (cid:18) M z √ n + M w √ (cid:101) n (cid:19) (cid:114) log 2 δ + ε G ( f ) (cid:35) , here M z := 4 M Y M G + M G / , M w = 2 M F M G + M G / , R n, (cid:101) nF,G := 2( M F +4 M Z ) R np ( x ,z ) ( G )+2(2 M F + M G ) R (cid:101) np ( x ,w ) ( F ) + 2( M F + M G ) R (cid:101) np ( x ,w ) ( G ) . Define J ( f, g ) and (cid:98) J ( f, g ) as in Section 3.2 and denote ε G ( f ) := sup g ∈ L ( p ) J ( f, g ) − sup g ∈ G J ( f, g ) . Definition 1 (Rademacher Complexity) . We define the
Rademacher complexity of H over N random variablesfollowing probability distribution q by R Np ( H ) = E V ,...,V N ,σ ,...,σ N (cid:34) sup h ∈ H N N (cid:88) i =1 σ i h ( V i ) (cid:35) , where σ , . . . , σ N are independent, {− , } -valued uniform random variables. Lemma 3.
Under the assumptions of Theorem 2, with probability at least − δ , it holds that for every f ∈ F , J ( f, g ) ≤ (cid:98) J ( f, g ) + R F,G + (cid:18) M z √ n + M w √ (cid:101) n (cid:19) (cid:114) log 2 δ . To prove Lemma 3, we use the following lemma, which is a slightly modified version of Theorem 3.1 in Mohriet al. [22].
Lemma 4.
Let H be a set of real-valued functions on a measurable space D . Assume that M :=sup h ∈ H,v ∈D h ( v ) < ∞ . Then, for any h ∈ H and any D -valued i.i.d. random variables V, V , . . . , V N following density q , we have E [ h ( V )] ≤ N N (cid:88) i =1 h ( V i ) + 2 R Nq ( H ) + (cid:114) M N log 1 δ . (17) Proof of Lemma 4.
We follow the proof of Theorem 3.1 in Mohri et al. [22] except that we set the constant B φ in Eq. (28) to M/m when we apply McDiarmid’s inequality (Section M).Now, we prove Lemma 3.
Proof of Lemma 3.
For any f ∈ F , g ∈ G , x (cid:48) , (cid:101) x (cid:48) ∈ X , z (cid:48) ∈ Z := { y, − y | y ∈ Y} , and w (cid:48) ∈ {− , } , wedefine h z and h w as follows: h z ( x (cid:48) , z (cid:48) ; g ) := − z (cid:48) g ( x (cid:48) ) − g ( x (cid:48) ) ,h w ( (cid:101) x (cid:48) , w (cid:48) ; f, g ) := w (cid:48) f ( (cid:101) x (cid:48) ) g ( (cid:101) x (cid:48) ) − g ( (cid:101) x (cid:48) ) . Denoting H z := { ( x (cid:48) , z (cid:48) ) (cid:55)→ h z ( x (cid:48) , z (cid:48) ; g ) | g ∈ G } , we have sup h ∈ H z , x (cid:48) ∈X ,z (cid:48) ∈Z (cid:12)(cid:12) h ( x (cid:48) , z (cid:48) ) (cid:12)(cid:12) ≤ M Z M G + 12 M G =: M z < ∞ , and thus, we can apply Lemma 4 to confirm that with probability at least − δ/ , E ( x ,z ) ∼ p ( x ,z ) [ h z ( x , z ; g )] ≤ n (cid:88) ( x i ,z i ) ∈ S z h z ( x i , z i ; g ) + 2 R np ( H z ) + (cid:114) M z n log 2 δ , where { ( x i , z i ) } ni =1 =: S z are the samples defined in Section 4.1. Similarly, it holds that with probability atleast − δ/ , E ( (cid:101) x ,w ) ∼ p ( x ,w ) [ h w ( (cid:101) x , w ; f, g )] ≤ (cid:101) n (cid:88) ( (cid:101) x ,w i ) ∈ S w h w ( (cid:101) x i , w i ; f, g ) + 2 R (cid:101) np ( H w ) + (cid:114) M w (cid:101) n log 2 δ , where H w := { ( (cid:101) x (cid:48) , w (cid:48) ) (cid:55)→ h w ( (cid:101) x (cid:48) , w (cid:48) ; f, g ) | f ∈ F, g ∈ G } , M w := M F M G + M G / , and { ( (cid:101) x i , w i ) } ni =1 =: S w are the samples defined in Section 4.1. By the union bound, we have the followingwith probability at least − δ : E ( x ,z ) ∼ p ( x ,z ) [ h z ( x , z ; g )] + E ( (cid:101) x ,w ) ∼ p ( x ,w ) [ h w ( (cid:101) x , w ; f, g )] (18) ≤ n (cid:88) ( x i ,z i ) ∈ S z h z ( x i , z i , g ) + 1 (cid:101) n (cid:88) ( (cid:101) x ,w i ) h w ( x i , w i , f, g ) (19) + 2( R np ( H z ) + R (cid:101) np ( H w )) + (cid:18) M z √ n + M w √ (cid:101) n (cid:19) (cid:114) log 2 δ , (20) sing some properties of the Rademacher complexity including Talagrand’s lemma, we can show that R np ( H z ) ≤ ( M F + 4 M Z ) R np ( G ) , (21) R (cid:101) np ( H w ) ≤ (2 M F + M G ) R (cid:101) np ( F ) + ( M F + M G ) R (cid:101) np ( G ) . (22)Clearly, (cid:98) J ( f, g ) = 1 n (cid:88) ( x i ,z i ) ∈ S z h ( x i , z i ; g ) + 1 (cid:101) n (cid:88) ( (cid:101) x i ,w i ) ∈ S w h ( (cid:101) x i , w i ; f, g ) ,J ( f, g ) = E ( x ,z ) ∼ p ( x ,z ) [ h z ( x , z ; g )] + E ( (cid:101) x ,w ) ∼ p ( x ,z ) [ h w ( (cid:101) x , w ; f, g )] . From Eq. (20), Eq. (21), and Eq. (22), we obtain J ( f, g ) ≤ (cid:98) J ( f, g ) + R F,G + (cid:18) M z √ n + M w √ (cid:101) n (cid:19) (cid:114) log 2 δ , (23)where R F,G := 2( M F + 4 M Z ) R np ( G ) + 2(2 M F + M G ) R (cid:101) np ( F ) + 2( M F + M G ) R (cid:101) np ( G ) . Finally, we prove Theorem 2.
Proof of Theorem 2.
From Lemma 3, with probability at least − δ , it holds that for all f ∈ F sup g ∈ G J ( f, g ) ≤ sup g ∈ G (cid:98) J ( f, g ) + R F,G + (cid:18) M z √ n + M w √ (cid:101) n (cid:19) (cid:114) log 2 δ . (24)Moreover, recalling W := inf x ∈X | µ w ( x ) | and sup g ∈ L ( p ) J ( f, g ) = E [( µ w ( x ) f ( x ) − µ z ( x )) ] , we have E (cid:2) ( f ( x ) − u ( x )) (cid:3) = E (cid:34)(cid:18) f ( x ) − µ z ( x ) µ w ( x ) (cid:19) (cid:35) (25) ≤ W E [( µ w ( x ) f ( x ) − µ z ( x )) ] (26) = 1 W (cid:20) ε G ( f ) + sup g ∈ G J ( f, g ) (cid:21) . (27)Combining Eq. (24) and Eq. (27) yields the inequality of the theorem. F Proof of Corollary 1
Corollary 1.
Let F = { x (cid:55)→ α (cid:62) φ ( x ) | (cid:107) α (cid:107) ≤ Λ F } , G = { x (cid:55)→ β (cid:62) ψ ( x ) | (cid:107) β (cid:107) ≤ Λ G } , and assumethat r F := sup x ∈X (cid:107) φ ( x ) (cid:107) < ∞ and r G := sup x ∈X (cid:107) ψ ( x ) (cid:107) < ∞ , where (cid:107)·(cid:107) is the L -norm. Under theassumptions of Theorem 2, it holds with probability at least − δ that for every f ∈ F , E x ∼ p ( x ) [( f ( x ) − u ( x )) ] ≤ W sup g ∈ G (cid:98) J ( f, g ) + C z (cid:113) log δ + D z √ n + C w (cid:113) log δ + D w √ (cid:101) n + ε G ( f ) , where C z := r G Λ G + 4 r G Λ G M Y , C w := 2 r F Λ F + 2 r F r G Λ F Λ G + r G Λ G , D z := r G Λ G / r G Λ G M Y ,and D w := r G Λ G / r F r G Λ F Λ G .Proof. Under the assumptions, it is known that the Rademacher complexity of the linear-in-parameter model F can be upper bounded as follows [22]: R Np ( F ) ≤ r F Λ F √ N .
We can bound R Np ( G ) similarly. Applying these bounds to Theorem 2, we obtain the statement of Corollary 1. Proof of Theorem 3
We prove the following, formal version of Theorem 3.
Theorem 3.
Under the assumptions of Corollary 1, it holds with probability at least − δ that E [( (cid:98) f ( x ) − u ( x )) ] ≤ (4 e n,δ + 2 ε FG + ε F ) /W , where ε FG := sup f ∈ F ε G ( f ) , and ε F := inf f ∈ F J ( f ) , (cid:98) f ∈ F is anyapproximate solution to inf f ∈ F sup g ∈ G (cid:98) J ( f, g ) satisfying sup g ∈ G (cid:98) J ( (cid:98) f, g ) ≤ inf f ∈ F sup g ∈ G (cid:98) J ( f, g ) + e n,δ ,and e n,δ := C z (cid:113) log δ + D z √ n + C w (cid:113) log δ + D w √ (cid:101) n . Proof.
Let J ( f ) := sup g ∈ L J ( f, g ) = E [( µ w ( x ) f ( x ) − µ z ( x )) ] , J G ( f ) := sup g ∈ G J ( f, g ) , (cid:98) J G ( f ) :=sup g ∈ G (cid:98) J ( f, g ) . Let (cid:101) f ∈ F be any approximate solution to inf f ∈ F J ( f ) satisfying J ( (cid:101) f ) ≤ ε F + e n,δ .As a special case of Eq. 24, we can prove that with probability at least − δ , it holds for every f ∈ F that J G ( f ) ≤ (cid:98) J G ( f ) + e n,δ . From Corollary 1, it holds that with probability at least − δ , J ( (cid:98) f ) ≤ (cid:104) J ( (cid:98) f ) − J G ( (cid:98) f ) (cid:105) + (cid:104) J G ( (cid:98) f ) − (cid:98) J G ( (cid:98) f ) (cid:105) + (cid:104) (cid:98) J G ( (cid:98) f ) − (cid:98) J G ( (cid:101) f ) (cid:105) + (cid:104) (cid:98) J G ( (cid:101) f ) − J G ( (cid:101) f ) (cid:105) + (cid:104) J G ( (cid:101) f ) − J ( (cid:101) f ) (cid:105) + J ( (cid:101) f ) ≤ ε FG + e n,δ + e n,δ + e n,δ + ε FG + [ ε F + e n,δ ] ≤ e n,δ + 2 ε FG + ε F . Since E [( (cid:98) f ( x ) − u ( x )) ] ≤ W J ( (cid:98) f ) , we obtain the bound in Theorem 3. H Binary Outcomes
When outcomes y take on binary values, e.g., success or failure, without loss of generality, we can assumethat y ∈ {− , } . Then, by the definition of the individual uplift, u ( x ) ∈ [ − , for any x ∈ R d . Inorder to incorporate this fact, we may add the following range constraints on f : − ≤ f ( x ) ≤ for every x ∈ { x i } ni =1 ∪ { (cid:101) x i } (cid:101) ni =1 . I Cases Where p ( x ) (cid:54) = p ( x ) or ( n , (cid:101) n ) (cid:54) = ( n , (cid:101) n ) So far, we have assumed that p ( x ) = p ( x ) , m = m , and n = n . The proposed method can be adaptedto the more general case where these assumptions may not hold.Let r k ( x ) = n n k · p ( x ) p k ( x ) and (cid:101) r k ( x ) = (cid:101) n (cid:101) n k · p ( x ) p k ( x ) , k = 1 , , for every x with p k ( x ) > . Let k i := 1 if thesample x i originally comes from p ( x ) , and k i := 2 if it comes from p ( x ) . Similarly, define (cid:101) k i ∈ { , } according to whether (cid:101) x i comes from p ( x ) or p ( x ) . Then, unbiased estimators of the three terms in theproposed objective Eq. (10) are given as the following weighted sample averages using r k and (cid:101) r k : E x ∼ p ( x ) [ wf ( x ) g ( x )] ≈ (cid:101) n (cid:101) n (cid:88) i =1 [ w i f ( (cid:101) x i ) g ( (cid:101) x i ) (cid:101) r (cid:101) k i ( (cid:101) x i )] , E x ∼ p ( x ) [ zg ( x )] ≈ n n (cid:88) i =1 [ z i g ( x i ) r k i ( x i )] E x ∼ p ( x ) [ g ( x ) ] ≈ n n (cid:88) i =1 [ g ( x i ) r k i ( x i )] + 12 (cid:101) n (cid:101) n (cid:88) i =1 [ g ( (cid:101) x i ) (cid:101) r (cid:101) k i ( (cid:101) x i )] . The density ratios p k ( x ) /p ( x ) can be accurately estimated using i.i.d. samples from p k ( x ) and p ( x ) [21, 23,35, 38]. J Unbiasedness of the Weighted Sample Average
Below, we show that the weighted sample averages are unbiased estimates. We only prove for E [ wf ( x ) g ( x )] since the other cases can be proven similarly. The expectation of the weighted sample average transforms as ollows: (cid:101) n (cid:101) n (cid:88) i =1 E (cid:101) x ( k ) i ∼ p k ( x ) ,t ( k ) i ∼ p k ( t | (cid:101) x ( k ) i ) (cid:104) w i f ( (cid:101) x i ) g ( (cid:101) x i ) (cid:101) r (cid:101) k i ( (cid:101) x i ) (cid:105) = 1 (cid:101) n (cid:88) k =1 , (cid:101) n k (cid:88) i =1 E x ∼ p k ( x ) ,t ∼ p k ( t | x ) (cid:20) ( − k − tf ( x ) g ( x ) (cid:101) n (cid:101) n k · p ( x ) p k ( x ) (cid:21) = 12 (cid:88) k =1 , E x ∼ p ( x ) ,t ∼ p k ( t | x ) (cid:104) ( − k − tf ( x ) g ( x ) (cid:105) = (cid:90) (cid:90) ( − k − t (cid:88) k =1 , p k ( t | x ) f ( x ) g ( x ) p ( x )d t d x = (cid:90) (cid:90) wp ( w | x ) f ( x ) g ( x ) p ( x )d t d x = E x ∼ p ( x ) ,w ∼ p ( w | x ) [ wf ( x ) g ( x )] . K Gaussian Basis Functions Used in Experiments
The l -th element of φ ( x ) = ( φ ( x ) , . . . , φ b f ( x )) (cid:62) is defined by φ l ( x ) := exp − (cid:13)(cid:13)(cid:13) x − x ( l ) (cid:13)(cid:13)(cid:13) σ , where x ( l ) , l = 1 , . . . , b f , are randomly chosen training data points. We used b f = 100 and σ = 25 for allexperiments. ψ is defined similarly. L Justification of the Sub-Sampling Procedure
Suppose that we want a sample subset S k following the treatment policy p k ( t | x ) . For each sample ( x i , t i , y i ) ∼ p ( x , t, y ) in the original dataset, we randomly add it into S k with probability proportionalto p k ( t i | x i ) /p ( t i | x i ) . Then, p ( x i , t i , y i | ( x i , t i , y i ) ∈ S k ) = p (( x i , t i , y i ) ∈ S k | x i , t i , y i ) p ( x i , t i , y i ) (cid:82) (cid:80) y i ,t i p (( x i , t i , y i ) ∈ S k | x i , t i , y i ) p ( x i , t i , y i )d x i = p k ( t i | x i ) p ( y i | x i , t i ) p ( x i ) (cid:82) (cid:80) y i ,t i p k ( t i | x i ) p ( y i | x i , t i ) p ( x i )d x i = p k ( t i | x i ) p ( y i | x i , t i ) p ( x i ) . This means that the subsamples S k preserve the original p ( y | x , t ) and p ( x ) but follow the desired treatmentpolicy p k ( t | x ) . M McDiarmid’s Inequality
Although McDiarmid’s inequality is a well known theorem, we present the statement to make the paperself-contained.
Theorem 4 (McDiarmid’s inequality) . Let ϕ : D N → R be a measurable function. Assume that there exists areal number B ϕ > such that (cid:12)(cid:12) ϕ ( v , . . . , v N ) − ϕ ( v (cid:48) , . . . , v (cid:48) N ) (cid:12)(cid:12) ≤ B ϕ , (28) for any v i , . . . , v N , v , . . . , v (cid:48) N ∈ D where v i = v (cid:48) i for all but one i ∈ { , . . . , N } . Then, for any D -valuedindependent random variables V , . . . , V N and any δ > the following holds with probability at least − δ : ϕ ( V , . . . , V N ) ≤ E [ ϕ ( V , . . . , V N )] + (cid:114) B ϕ N δ .δ .