A Comparison of Methods for Treatment Assignment with an Application to Playlist Generation
Carlos Fernández-Loría, Foster Provost, Jesse Anderton, Benjamin Carterette, Praveen Chandar
MMethods for Individual Treatment Assignment: An Application and Comparison forPlaylist Generation
CARLOS FERN ´ANDEZ-LOR´IA and FOSTER PROVOST,
New York University
JESSE ANDERTON, BENJAMIN CARTERETTE, and PRAVEEN CHANDAR,
Spotify
We present a systematic analysis of causal treatment assignment decision making, a general problem that arises in many applicationsand has received significant attention from economists, computer scientists, and social scientists. We focus on choosing, for each user,the best algorithm for playlist generation in order to optimize engagement. We characterize the various methods proposed in theliterature into three general approaches: learning models to predict outcomes, learning models to predict causal effects, and learningmodels to predict optimal treatment assignments. We show analytically that optimizing for outcome or causal-effect prediction is notthe same as optimizing for treatment assignments, and thus we should prefer learning models that optimize for treatment assignments.For our playlist generation application, we compare and contrast the three approaches empirically. This is the first comparison of thedifferent treatment assignment approaches on a real-world application at scale (based on more than half a billion individual treatmentassignments). Our results show (i) that applying different algorithms to different users can improve streams substantially compared todeploying the same algorithm for everyone, (ii) that personalized assignments improve substantially with larger data sets, and (iii)that learning models by optimizing treatment assignments rather than outcome or causal-effect predictions can improve treatmentassignment performance by more than 28%.CCS Concepts: •
Information systems → Data mining;
Additional Key Words and Phrases: treatment assignment, treatment effects, predictive modeling
Systems that make automated decisions are often deployed with the underlying goal of improving (rather than justpredicting) outcomes. For instance, many recommendation engines are deployed with the intent of driving customerengagement rather than just predicting the items that customers are likely to choose. Therefore, even though issues indata collection, modeling, and deployment often make it hard for recommender systems to be directly optimized interms of such goals, we should ultimately evaluate these systems in terms of their ability to improve the outcomes wecare about (e.g., customer engagement). Our focal application is choosing, for each listener, which playlist generationalgorithm to apply in order to maximize the number of song streams.One may frame this type of decision as a treatment assignment problem [18], where each possible algorithmcorresponds to a different ‘treatment’, and ideally we would like to assign each individual to the treatment associatedwith the highest number of streams. Of course, optimal treatments may vary from one individual to another dependingon their characteristics and the context, which may be captured by various features. For example, algorithm A maywork better for newer users, whereas algorithm B may work better for more experienced users. Thus, using statisticalmodeling, we could learn a treatment assignment policy from data to map individuals to optimal treatments based onfeatures such as tenure.The statistical estimation of treatment assignment policies from sample data has been studied across many differentfields, including econometrics [18], data mining [15], and multi-armed bandits [5]. The first contribution of this paper a r X i v : . [ ec on . E M ] M a y s to gather these various methods into three general approaches. The first approach is to learn a model that predictsoutcomes for each treatment and assigns individuals to the treatment with the best predicted outcome. The secondapproach is to learn a model for heterogeneous causal-effect estimation that assigns individuals to the treatment withthe largest predicted causal effect. Finally, the third approach consists of learning a weighted classification model, wherethe treatment assignment is the target variable and the outcome serves to weigh observations. Thus, the classificationmodel may be used to predict (and assign) the treatment that will have the best outcome.At a first glance, the three approaches may seem equivalent: they all seek to assign the treatment that is estimatedto lead to the best outcome. However, as a second contribution, this paper highlights two key distinctions betweenthem. The first distinction is their level of generality in terms of the tasks they can perform. For instance, models thatpredict outcomes may be used to estimate causal effects, whereas models that predict causal effects generally cannotpredict outcomes. The second distinction is in the objective function each approach uses to learn models from data.The first approach optimizes models to predict outcomes, the second to predict causal effects, and the third to predicttreatment assignments. As a result, each may lead to different treatment assignment policies. Importantly, our analysisshows that optimizing models to predict outcomes or causal effects is not the same as optimizing models for treatmentassignment. Therefore, in theory, learning models that predict optimal treatments (the third approach) should lead tobetter treatment assignments than the other two approaches.Finally, as a third contribution, we conduct a massive-scale experimental comparison of the three treatment as-signment approaches in the context of music recommendations at Spotify. As we will describe in more detail below,each treatment corresponds to a different algorithm that could be used to build music playlists, and the goal is tomaximize engagement (measured as number of song streams). To our knowledge, this is the first real-world, at-scalecomparison of these three approaches. Apart from confirming our analytical findings, the experiment shows that (1) aheterogeneous treatment assignment policy can substantially improve total streaming compared to deploying the samealgorithm for everyone, and that (2) larger data sets lead to significantly better policies, illustrating the advantages ofrunning large-scale A/B tests for the purpose of gathering unconfounded training data. This paper focuses on settings where a decision-maker wants to maximize the overall causal effect of decisionson an outcome of interest (e.g., deciding what playlist generation algorithm to use for each listener to maximallyincrease streams). We frame this as a treatment assignment problem, so that each possible alternative corresponds toa different treatment, and the goal is to assign individuals to the treatment that maximizes their outcome. In recentyears, this problem has been approached from various methodological perspectives, including econometrics [6], upliftmodeling [15], heterogeneous effect estimation [25], and multi-armed bandits [5]. This section provides an overview ofthe problem formulation.We specifically consider settings in which decisions are independent and the treatment assignment policy is learnedfrom historical data on previous decisions made at random. This implies that each decision affects a single unit (orinstance) and there is no selection bias in the data. In the causal inference literature, the first assumption is also knownas the
Stable Unit Treatment Value Assumption (SUTVA) [7]. The second assumption ensures that there is noconfounding in the data (i.e., unconfoundedness holds). The unconfoundedness assumption goes by many names indifferent fields, including ignorability [22], back-door criterion [19], and exogeneity [26]. Practically speaking, we use acarefully randomized A/B test to gather data where these assumptions hold.et T be the treatment assignment variable and Y be the observed outcome. We use potential outcomes to framecausality [23] and define Y j as the outcome we would observe if we were to assign treatment j . Therefore, Y = Y j if T = j , and the treatment that leads to the best outcome (on average) can be defined as: T ∗ = argmax j E [ Y j ] (1)Using historical data, one could estimate the best performing treatment by choosing the treatment with the largestmean ( ˆ E ): ˆ T = argmax j ˆ E [ Y | T = j ] (2)It is important to contrast our setting with a standard A/B test approach that compares multiple treatments acrosspredefined populations. Such an approach does not directly apply to our setting, because we want to learn the individualsor subpopulations to which each treatment should be applied. Suppose individuals vary with respect to a set of variables(features) X . We can then think of a feature vector x as a subpopulation where X = x and formulate the optimal decisionfor subpopulation x (on average) as: T ∗ ( x ) = argmax j E [ Y j | X = x ] (3)It should be clear that, without the argmax, the right-hand side is essentially the formulation of a statistical machinelearning model. Applying statistical modeling frees us from specifying in advance what are the particular subpopulationsof interest. In the next section, we will discuss various methods that have been proposed to estimate T ∗ ( x ) fromhistorical data, resulting in a treatment assignment policy, ˆ T ( x ) . In our setting, we leverage the unconfounded data froma randomized A/B test to learn and evaluate models for treatment assignments conditioned on the individuals’ features.In theoretical analyses, treatment assignment policies are typically evaluated in terms of their ability to minimize theexpected difference between the best potential outcome that could possibly be obtained and the potential outcomethat would be obtained by deploying the policy. This evaluation measure is also known as expected regret in decisiontheory: Reдret = E [ Y T ∗ ( X ) − Y ˆ T ( X ) ] . (4)In our setting, minimizing expected regret is the same as maximizing the expected conditional outcome (acrosstreatments): E [ Y ˆ T ( X ) ] (5)However, evaluating the causal effects of treatment assignment policies using historical data (as is typical whenbuilding standard machine learning models) is challenging because we do not observe all potential outcomes for anygiven individual; we only observe one potential outcome at a time. Therefore, if (for any given individual) the policyassigns a treatment that is different from the treatment that was assigned in the data, we do not know the correspondingpotential outcome. Fortunately, given a data set of N individuals from a randomized A/B test, we can still obtain anunbiased and consistent estimate of Equation 5 (see [16] for a detailed proof):1 N N (cid:213) i = ( ˆ T ( x i ) = t i ) y i P ( t i ) , (6) This formulation may be extended easily to include treatment costs. here for each individual i , x i is the feature vector, t i is the assigned treatment, y i is the observed outcome, and P ( t i ) is the probability of being assigned to treatment t i in the data (a known quantity if the data was collected through arandomized A/B test). The causal inference literature often focuses its attention on the estimation of aggregate causal effects, such as theso-called average treatment (or causal) effect (ATE) , which corresponds to the average effect of a treatment acrossthe individuals in some well-defined population. Unfortunately, estimating the ATE does not help us to target differentindividuals with different treatments, because it does not discriminate between the individuals in the populationat all. Thus, a fundamental assumption we are making is that the population exhibits heterogeneous treatmenteffects (HTEs) , which are defined in terms of the degree to which a treatment may have different effects on differentindividuals [13].One can account for HTEs through the estimation of conditional average treatment effects (CATEs) , whichcorrespond to the average causal effect conditioned on a set of available features. Thus, to the extent that individuals inthe population differ on their features (and those features are related to causal effects), we may estimate different causaleffects for each individual. Of course, treatment effects may still vary among individuals that share the same features(since we may not be accounting for all aspects related to the causal effect), but the estimation of HTEs by using CATEsallows us to make different interventions for different individuals without knowing the relevant subpopulations inadvance.The ideas behind CATE estimation have been increasingly applied to the development of new methods for treatmentassignment. An important contribution of this paper is to group these methods into three general approaches (describedbelow) for learning treatment assignment policies from data. Each approach has been recommended in prior research,so we also compare them analytically and empirically in subsequent sections as a second contribution.(1)
Outcome Prediction (OP):
A model to predict the outcome under each treatment is learned using a standardmachine learning method. The model assigns individuals to the treatment with the largest predicted outcomeand is optimized to discriminate them according to how their outcomes vary.(2)
Causal-Effect Prediction (CP):
A model to predict the differences between the outcome of each treatmentand a baseline (or control) is learned using a machine learning method specifically designed to estimate CATEs.The model assigns individuals to the treatment with the largest predicted difference (i.e., treatment effect) andis optimized to discriminate individuals according to how their treatment effects vary.(3)
Treatment Assignment Prediction (TP):
A (weighted) classification model to predict the treatment withthe largest weight is learned using a standard machine learning method. Weights are defined by the observedoutcome under each treatment condition, so that weights are larger for treatments with larger outcomes.Individuals are assigned to the predicted treatment and the model is optimized to discriminate individualsaccording to how their preferred treatments vary.
Uplift modeling . Uplift or true lift modeling [15, 17] estimates the incremental (causal) impact of a treatment onindividuals’ behaviour, and it has been recommended by the data mining community for targeting applications suchas online advertising and customer retention [21]. The uplift modeling literature typically focuses on settings wheretreatment assignments and outcomes are binary, so methods are usually grouped into two main categories [24]: thewo-model approach and the single-model approach (which are specific instances of OP and CP respectively). As thename suggests, the two-model approach consists of building two outcome classifiers (for treated and untreated), andthen subtracting the difference between their predictions to assess whether the treatment would be beneficial. Onthe other hand, the single-model approach directly models the difference between treatment and control probabilities.Approaches to do this include (1) algorithms specifically designed to discriminate according to treatment effects [24]and (2) transforming treatment assignments and outcomes into a new target variable that is modeled using standardmachine learning [14].
Econometrics . The econometrics literature has also argued that assigning treatments to maximize social welfare isa distinct problem from the point estimation and hypothesis testing problems usually considered in the causal inferenceliterature [12]. As a result, several authors have proposed various methods for optimal policies, which typically useestimation procedures that regress the outcome on the treatment assignment and a set of observed features [6, 8, 12, 18].Therefore, all of these methods would fall under the OP approach. Most of these studies are typically concerned withthe asymptotic properties of their proposed methods, but some have showcased their models in practical settings. Forinstance, [6] describe how to estimate treatment assignment policies in settings with budget constraints and evaluatetheir method in the context of efficient provision of anti-malaria bed net subsidies, using data from a randomizedexperiment conducted in Kenya. Their results show that subsidy allocation based on wealth, presence of children andpossession of a bank account can lead to a rise in subsidy use by about 9% points compared to allocation based onwealth only, and by 17% points compared to a purely random allocation.
Causal Inference . Others in the causal inference literature have also proposed machine learning methods specificallydesigned for CATE estimation [3]. Some of the most promising alternatives use bayesian additive regression trees(BART) [11], random forests [25], and regularized support vector machines (SVM) [13]. Importantly, a main motivationbehind these methods is their use in the estimation of policies for treatment assignment [2, 3], which corresponds to CPin our context. There is also a relatively large number of papers showing asymptotic properties of CP for treatmentassignment when an efficient estimator of CATE is known [see 4, for an overview], but none of them discuss (to ourknowledge) any results when deploying such systems in practice.
Multi-armed bandits . Finally, treatment assignment policies are also at the core of contextual multi-armed bandits.Models for multi-armed bandit problems may be used to learn how to make decisions in situations where the payoffof only one choice is observed [5, 9]. Such methods have been used to make automated decisions about online newsrecommendations to maximize clicks [16], for example. It is precisely in this stream of research that it was first notedthat the treatment assignment problem (as defined in Section 2) is numerically equivalent to a weighted classificationproblem [5], leading to the suggestion of TP. Nonetheless, OP has also been recommended for multi-armed banditalgorithms (LinUCB being a well-known example [16]).An important distinction between our setting and the multi-armed bandit problem is that the goal in bandit problemsis to learn a treatment assignment policy while actively making treatment assignment decisions for incoming subjects.Therefore, there is an exploration-exploitation dilemma that plays an important role in the decision-making procedure,whereas in our case the decision-maker cannot re-estimate the treatment assignment policy after making each decision.Our setting is also referred to as ”offline learning” in this community [5].
COMPARISON OF THE APPROACHES
At first, the three approaches we discussed in the previous section may seem equivalent, and in fact, each of themhas been shown to be asymptotically optimal (i.e., the approaches converge to optimal treatment assignments withlarge enough samples) when the machine learning procedure that is used to learn the models is a consistent estimator;see [12, 18] for OP, see [4, 12] for CP, and see [5] for TP. However, there are two subtle but key differences between theapproaches that are critically important when applying these causal estimators in practice.
The first key distinction is their level of generality (the approaches are listed in Section 3 from the most general to theleast general). OP is the most general of the approaches because models that predict outcomes may also be used topredict causal effects or optimal treatments. More specifically, causal-effect predictions may be obtained by taking thedifference between the predicted outcomes of two treatments under consideration (as suggested in uplift modeling),and optimal-treatment predictions may be obtained by selecting the treatment with the largest predicted outcome.Therefore, OP models may be used for three different purposes.On the other hand, models that predict causal effects cannot be used to predict outcomes. For such models, predictionsestimate the expected marginal increase (or decrease) in the outcome that results from assigning some specific treatment,but the predictions cannot be used to estimate expected outcomes under an arbitrary treatment condition. Therefore,while causal-effect predictions may still be used to predict optimal treatments (by selecting the treatment with thelargest predicted effect), CP models are not as general as OP models. Finally, models trained to predict optimal treatmentassignments (TP models) can only be used for that purpose; these models, the least general, cannot predict the outcomeor the effect that would result from making those assignments.
The second key distinction is that each approach uses a different objective (or loss) function to learn their respectivepredictive models. OP uses a loss function designed to optimize outcome predictions; CP uses a loss function designedto optimize causal-effect predictions, and TP uses a loss function designed to optimize treatment assignments. Thisimplies that, while all approaches share the same ultimate goal (optimizing treatment assignments as specified byEquations 4, 5, and 6), they differ with respect to the procedures they use to learn from data.This distinction is important because an improvement in the prediction of outcomes or causal effects does not implyan improvement in the prediction of optimal treatment assignments (as we show in detail in the next subsections). Infact, such improvements may actually occur at the expense of worse treatment assignments. Thus, we should expectmachine learning with loss functions specifically tailored to optimize treatment assignments to produce better models:TP should outperform the other approaches with finite training data when making treatment assignment decisions.Nonetheless, various research communities recommend OP and CP for treatment assignment, and few studies havecompared these three approaches either analytically or empirically. As exceptions, [5] provides a theoretical regretanalysis for multi-armed problems showing that for a given family of ‘regressors’ (e.g., decision trees) TP has a smallerlower bound regret than OP. These analytical results are supported by experiments on multi-class benchmark data setsthat were repurposed to simulate potential outcomes, showing TP as a superior alternative than OP. [14] used a similarexperimental approach to compare OP and CP, showing each of the approaches outperforming the other in differentempirical examples. They also note the potential of OP to outperform CP in settings where outcomes are strongly = T = O u t c o m e Predictions ( ˆ µ )True values ( µ )Prediction error (a) Model with worse ˆ µ T = T = O u t c o m e Predictions ( ˆ µ )True values ( µ )Prediction error (b) Model with better ˆ µ Fig. 1.
Comparison of outcome prediction vs treatment assignment for a single individual . The model depicted in (a) makesa better treatment assignment than the model depicted in (b) despite having larger outcome prediction errors. correlated with causal effects, suggesting that choosing between approaches should be an empirical undertaking. Thereare no studies (to our knowledge) that compare these approaches in an actual practical setting.In the following subsections, we compare the three approaches analytically to illustrate how their choice of objectivefunctions may affect their performance in treatment assignments. Then, in the next section, we provide an experimentalcomparison of the three approaches in a real, practical setting where better treatment assignment policies can generatesubstantial value.
As mentioned, OP assigns treatments by learning a model that predicts the expected outcome of each treatment ( ˆ µ ):ˆ µ ( x , j ) = ˆ E [ Y | X = x , T = j ] , (7)and then selecting the treatment with the best predicted outcome:ˆ T µ ( x ) = argmax j ˆ µ ( x , j ) (8)A standard approach to fit Equation 7 is to regress outcome Y on features X and T using various machine learningmethods designed to minimize the mean squared error for the outcome ( MSE µ ): MSE µ = E [( Y T − ˆ µ ( X , T )) ] , (9)and then to choose the model with the lowest empirical MSE µ .The premise here is that minimizing MSE µ implies better outcome predictions, and therefore better treatmentassignments. However, optimizing outcome predictions (by minimizing MSE µ or other measures such as mean absoluteerror or cross-entropy) does not necessarily optimize treatment assignments. Figure 1 compares the outcome predictionsmade by two different models for a single individual, one with high prediction errors (Figure 1a) and another with lowprediction errors (Figure 1b). The blue (dark) dots correspond to the true conditional expectation (they are the same for = T = − C a u s a l e ff e c t (a) Model with worse ˆ τ T = T = − C a u s a l e ff e c t Predictions (ˆ τ )True values ( τ )Prediction error (b) Model with better ˆ τ Fig. 2.
Comparison of causal effect prediction vs treatment assignment for a single individual . The model depicted in (a)makes a better treatment assignment than the model depicted in (b) despite having larger causal-effect prediction errors. both graphs), whereas the red dots correspond to the predictions. A larger distance between the blue and the red dotsimplies that the model has a larger
MSE µ ; it makes worse outcome predictions.In this example, the conditional expectation when T = T = T = µ | T = > ˆ µ | T =
2, the modelmakes the optimal treatment assignment. Going back to the example, Figure 1a shows that the model with largerprediction errors makes the optimal treatment assignment because the ranking of the predicted outcomes is the sameas the ranking of the true values. The second model makes a worse assignment, even though its prediction errorsare smaller, because the ranking is inverted. Therefore, choosing the model with the better
MSE µ leads to a worsetreatment assignment. The second treatment assignment approach, CP, is to learn a model to estimate CATE (ˆ τ ) directly:ˆ τ ( x , j ) = ˆ E [ Y | X = x , T = j ] − ˆ E [ Y | X = x , T = ] , (10)where T = T τ ( x ) = argmax j ˆ τ ( x , j ) (11)As mentioned in Section 3, there is a growing literature in the use of machine learning methods for the estimation ofCATE. The (sometimes unstated) goal of these methods is to minimize the mean squared error for treatment effects( MSE τ ): MSE τ = E [(( Y T − Y ) − ˆ τ ( X , T )) ] = E [( τ − ˆ τ ( X , T )) ] (12)Therefore, these methods are not optimized to predict outcomes but rather to predict causal effects (which are usuallydefined as the difference between potential outcomes, e.g., Y − Y ). The main challenge is that we only observe oneotential outcome for any given individual, so we cannot calculate Equation 12 directly because τ is not observable.However, we may use alternative formulations to estimate MSE τ from data (up to a constant) [1], allowing us tocompare (and optimize) models on the basis of how good they are at predicting causal effects.Unfortunately for our application, and similarly to the previous section, optimizing causal-effect predictions (byminimizing MSE τ ) is not the same as optimizing treatment assignments either. We illustrate this using Figure 2, whichshows a similar example to the one illustrated in Figure 1, except it compares the causal effect (rather than outcome)predictions made by two models. Therefore, in this example the blue (dark) dots represent the causal effect of thetreatments for a specific individual (these dots are the same in both graphs), and the red dots represent the estimation ofthe effect by the models. As before, the first model has high prediction errors (Figure 2a) but makes a better assignment,while the second has lower prediction errors (Figure 2b) but makes a worse assignment. Thus, the model that makes abetter causal-effect prediction (i.e. that has lower MSE τ ) makes a worse treatment assignment.Surprisingly, this implies that models that are (relatively) bad at causal-effect prediction may be good at makingtreatment assignments. This result, while seemingly counter-intuitive at first, may be attributed to the bias-variancedecomposition of errors. In the machine learning community, it is well known that models that have a good classificationperformance are not necessarily good at estimating class probabilities (and vice versa) [10]. A useful analogy in ourcontext is to think about treatment assignment as a classification problem and to think about causal-effect estimationas a probability estimation problem; the two tasks are closely related but not exactly the same. Importantly, the biasand variance components of the estimation error in causal-effect predictions may combine to influence treatmentassignment in a very different way than with the squared error of the predictions themselves. For instance, certaintypes of very high bias may be canceled by low variance to produce accurate treatment assignments. Therefore, it isimportant to distinguish between good causal-effect estimation and good treatment assignments. The third approach, TP, estimates the treatment assignment policy by directly learning the treatment assignmentsthat lead to the best outcomes. As [5] describe in detail, the treatment assignment problem can be transformed intoa weighted multi-class classification problem. The general idea is that, given a probability distribution P ( T ) over thetreatment (e.g., the probability that an individual gets assigned to treatment T in the A/B test data), each observation( x , y , t ) can be transformed into an importance-weighted multi-class example where y / P ( t ) is the cost of not predictingtreatment t given input x . These examples can then be fed to any importance-weighted multi-class classifier learningalgorithm. The predictions of the output classifier ( ˆ θ ) would correspond to:ˆ θ ( x , j ) = ˆ P ( T = j | X = x ) , (13)and may be used to choose the optimal treatment as follows:ˆ T θ ( x ) = argmax j ˆ θ ( x , j ) (14)As mentioned, the model predictions defined in Equation 13 are optimized to minimize the weighted misclassificationrate ( W MR ): W MR = E (cid:34) ( ˆ T θ ( X ) (cid:44) T ) YP ( T ) (cid:35) (15)e objective function presented in Equation 15 is directly tied to treatment assignment performance becauseminimizing it is equivalent to minimizing expected regret (as defined in Equation 4); see [5] for more details. We presenta simplified proof sketch here: argmin ˆ T E (cid:34) ( ˆ T ( X ) (cid:44) T ) YP ( T ) (cid:35) = argmin ˆ T (cid:213) j P ( T = j ) E (cid:34) ( ˆ T θ ( X ) (cid:44) T ) Y j P ( T = j ) (cid:12)(cid:12)(cid:12) T = j (cid:35) = argmin ˆ T (cid:213) j E [ ( ˆ T θ ( X ) (cid:44) T ) Y j | T = j ] , (16)and given the unconfoundedness assumption: = argmin ˆ T (cid:213) j E [ ( ˆ T θ ( X ) (cid:44) T ) Y j ] = argmin ˆ T (cid:32) (cid:213) j E [ Y j ] (cid:33) − E [ Y ˆ T θ ( X ) ] = argmin ˆ T − E [ Y ˆ T θ ( X ) ] = argmin ˆ T E [ Y T ∗ ( X ) − Y ˆ T θ ( X ) ] As a result, because minimizing
W MR is equivalent to optimizing for expected regret (i.e., treatment assignments),we should expect
W MR to be a better objective function than
MSE µ or MSE τ when the goal is to make the best possibletreatment assignments. We now present our third contribution, an empirical comparison of the three treatment assignment approaches forchoosing which playlist generation algorithm to apply for each listener.
In our playlist generation setting, the treatment variants consist of different algorithmic playlist generation (recom-mender) systems that are being tested in production by Spotify, a media services provider. Each recommender systemuses a different algorithm to select and rank songs in ”algorithmic” playlists (playlists that are built dynamicallyaccording to user data). The company has multiple goals when deploying such systems (e.g., converting users from freeto premium, reducing churn, increasing engagement with the platform). We focus specifically on the number of songstreams as our target outcome metric. However, (as stated before) issues in data collection, modeling, and deploymenthave made it hard historically for recommender systems to be directly optimized in terms of these goals. Therefore,the models that underlie these systems are often heuristic (e.g., songs may be ranked according to their similarity toprevious songs the user has played).To evaluate new recommender systems, firms typically run A/B tests to compare the new variant(s) with the existingproduction system (as a baseline) and decide whether to replace the production system with one of the new systems.This essentially chooses the same treatment assignment for all users. However, as we have argued throughout thisaper, different variants may work better for different users: if System A is best for new users and System B is best formore experienced users, then deploying the same system for all users would lead to sub-optimal treatment assignments.Since the outcome of interest in this case is song streams, the goal is to learn a treatment assignment policy that deploysdifferent systems for different users in order to maximize the number of streams.Before proceeding, we want to reemphasize that treatments in this setting consist of algorithms, not recommendations.The treatment assignment policy is choosing among playlist generation algorithms, each of which makes personalizeddecisions according to user data. Therefore recommended playlists would still vary from one user to another, even if weuse the same variant for everyone.
We compare the three treatment assignment approaches using data from a massive, production A/B test. The A/Btest produced a data set in which four different recommender systems were randomly assigned to users to buildalgorithmic playlists: three newly developed recommender systems and the system that was currently in production.More specifically, each observation corresponds to a user who selected a playlist, and each playlist was built using oneof the four systems (chosen at random) to select and rank songs. There are 770 million observations in the data: 86.68%assigned to the production system, and 4.44% for each of the new variants. For each observation, we have the followingcategorical features: country (19 values), playlist ID (6 values), platform (e.g., Android; 3 values), days since the accountwas created (transformed into a discrete variable with 4 values), and product (e.g., free, premium; 8 values). For eachcategorical variable, the categories with fewer than 10,000,000 observations were grouped together in a category named‘other’, resulting in the number of values for each variable reported above. Balance tests with respect to these featuresconfirmed an adequate randomization of the systems. For each observation, we also have the number of total streamsfor the user (increasing that is the outcome of interest).
As in the analytical section, we compare three approaches to estimate treatment assignment policies: (1) the outcomepolicy (OP) ( ˆ µ ), which uses a regression model to predict the total streams of each system and assign the system(treatment) with the largest predicted number of total streams (Equation 8); (2) the causal-effect policy (CP) (ˆ τ ),which uses a regression model to predict the causal effect of each system on total streams (compared to control) andassign the system with the largest predicted effect (Equation 11); and (3) the treatment policy (TP) ( ˆ θ ) , which usesa classification model to predict (and assign) the system that is estimated to produce the largest number of streams(Equation 14). All of these models are learned and applied at the user level.We use tree-based algorithms to learn all models so that differences in performance can be attributed to the lossfunctions used by each approach rather than the machine learning algorithm being used. For OP, we used a decisiontree regressor that minimizes MSE µ to learn ˆ µ . For CP, we used a decision tree regressor on the transformed variableproposed by [1] to learn ˆ τ by minimizing MSE τ , i.e., a ‘causal tree’. In the case of CP, we had to train 3 causal trees (onefor each system except control) because the method does not support non-binary treatments. Finally, for TP, we used aweighted decision tree classifier that minimizes W MR to learn ˆ θ .The models were learned, tuned, and evaluated using 10-fold nested cross validation, which separates the cross-validation used for hyperparameter optimization from the test folds used for evaluation [20]. We used the empiricalmeasure described in Equation 6 to evaluate all models, using P ( t i ) = .
68% when t i = P ( t i ) = . P ) in terms of the impact on streams( I , Equation 6) relative to just assigning the control system to everyone ( C , Equation 6 when ˆ T ( X i ) = , ∀ i ): I = N N (cid:213) i ( T i = ˆ T ( X i )) Y i P ( T i ) , (17) C = N N (cid:213) i ( T i = ) Y i P ( T i ) , (18) P = I − CC (19)Importantly, maximizing Equation 19 is equivalent to minimizing empirical expected regret (Equation 4) when theassumptions discussed in Section 2 are met [16]. Therefore, we can test the policies using historical data from the A/Btest. All models’ hyperparamenters were tuned to optimize their respective loss functions: OP was tuned to optimize MSE µ ; CP was tuned to optimize MSE τ , and TP was tuned to optimize W MR . Figure 3 shows the performance of each approach (measured as the increase in streams relative to the baseline) as the sizeof data increases. The red line is the policy in which the system that performs best on average is applied to everyone—what we would get from a typical A/B test (i.e., choosing which recommender system works best generally—withoutuser-specific modeling at the level of recommender system choice). In addition, the figure shows the performance ofthe three approaches we discussed to learn treatment assignment policies: OP (blue line), CP (orange line) and TP(green line). The areas around the lines represent 95% confidence intervals calculated using the ten results from thecross-validation.
The first interesting finding in Figure 3 is that choosing thesystem that performs best on average does not have a significant impact on the total number of song streams. Thisimplies that no single ‘best system’ for all users performs much better than the baseline (existing production system).Importantly, if we were to follow a traditional A/B test approach, we might come to the erroneous conclusion that thebest thing to do is simply to keep the system currently in production, because no other system produces an increase instreams that is statistically significant at the population level.We show in Table 1 the percentage of users that would be assigned to each recommender system, depending onthe treatment assignment policy that is used. Recall that our analysis was conducted using cross-validation, so thetable was built using out-of-sample treatment assignments for all users. The first row in the table shows that differentsystems may be selected as the ‘best system’ (on average) depending on the cross-validation folds used to compare thesystems: System T = T = ig. 3. Treatment assignment performance Policy T = T = T = T = ∗ Best on average † ∗ Average entropy of treatment assignments across folds. Min is 0 and Max is 2. † Different systems perform best (on average) depending on the folds that are usedto select the ‘best system’. Thus, not everyone is assigned to a single system whenusing cross-validation to evaluate and analyze the ‘best on average’ policy.
Table 1. Percentage of users assigned to each treatment fold, and then obtained the average entropy across all 10 folds (i.e., column 4 in Table 1). As we can see, all three policiesexhibit a large entropy (i.e., high heterogeneity) in treatment assignments within the folds, resulting in substantiallymore streams.
A second important finding shown in Figure 3 is that treatment assignmentpolicies become increasingly better with more training data, illustrating the importance of conducting large A/B tests togenerate unconfounded training data to learn models for individual-level treatment assignments. As we mentioned inSection 3, the estimation of causal effects by using CATEs instead of ATE is a substantial improvement for the purposes olicy
MSE µ MSE τ ∗ P † Mismatchwith TP ‡ Outcome (OP) ∗ MSE τ corresponds to the MSE of the transformed outcome proposed by [1]to estimate causal effects. † Performance as defined in Equation 19, Section 5.3. ‡ Percentage of decisions that are different from the decisions made by TP.
Table 2. Policy comparison at different tasks of deciding on individual interventions. However, the estimation of accurate CATEs requires much larger data sets;otherwise the models are likely to overfit.Fortunately, large A/B tests alleviate this problem because the data can be partitioned into fine-grained subpopulationsof users without losing substantial statistical power. Correspondingly, we can fit more complex causal models withless overfitting. As part of our analysis, we assessed overfitting by comparing the treatment assignments made by thevarious models we built with cross-validation. We found that, for any given individual, models are more likely to assignthe same treatment as sample size grows, suggesting that the models overfit less with more data. Finally, Figure 3 shows that the policy that was learned byoptimizing treatment assignments (green line) works substantially better than the policies that were learned byoptimizing outcome and causal-effect predictions (blue line and orange line respectively), thus validating our analyticalfindings. As discussed in detail above, objective functions that optimize things other than treatment assignmentprediction (i.e., better outcome or causal-effect predictions) do not necessarily favor better treatment assignments.Going back to our analytical examples, we would expect each approach to perform best doing whatever it is optimizedto do. For example, the predictive model used by OP should perform better at predicting outcomes than the models usedby CP and TP. However, as discussed in Section 4.1, we cannot (in general) use causal-effect or treatment assignmentmodels to estimate outcomes. Therefore, in order to compare the various approaches at different tasks, in the followinganalysis we adapt CP’s and TP’s models.Since we are using tree-based models for all policies, we generalize the models used by CP and TP by using differentprediction functions to aggregate the training observations at each leaf depending on the task at hand. For instance, ifwe want to make outcome predictions, the prediction function would consist of the average outcome of the observationsin the leaf (rather than the average causal effect in the case of CP or the treatment with the largest outcome in the caseof TP). Thus, the structures of all models remain the same, but the prediction function at the leaf-level may be adjustedto predict outcomes (CP and TP) or causal-effects (TP).Table 2 shows the performance of each approach at the three different tasks, evaluated using nested cross-validationon the entire data set. The tasks are predicting outcomes (a lower
MSE µ is better), predicting causal effects (a lower MSE τ is better), and predicting treatment assignments (a higher P is better). As expected, each approach is best atdoing what it was optimized to do. Nevertheless, the models with the best performance in outcome prediction (OP) and The details of this analysis are not included here due to space constraints. ausal-effect prediction (CP) are not the best models at making treatment assignments. In fact, in relative terms, theimpact of TP is 28% larger than the impact of OP or CP.Finally, the last column of the table shows the mismatch between the decisions made by each policy and the decisionsmade by TP. Interestingly, the mismatch is relatively large despite the small differences in performance between theapproaches and their similar distribution of treatment assignments (as shown in Table 1). This suggests that modelswith different objective functions can also lead to quite different decisions at the individual level (even if their overallperformance is similar).
This paper categorizes individualized treatment assignment methods into three main approaches: outcome prediction,causal-effect prediction, and treatment assignment prediction. To our knowledge, this is the first study to comparethese three approaches on a real application at scale. We discuss and illustrate two key distinctions between them: (1)their level of generality in terms of the types of tasks that their models may address, (2) and the objective functionthey use to estimate models from data. Importantly, we show that despite the fact that all three approaches have beenrecommended for treatment assignment in prior research, optimizing for outcome or causal-effect predictions is notthe same as optimizing for treatment assignments, and that the latter ought to be better in practical (non-asymptotic)settings such as ours.We then compare and contrast the three approaches for the real-world application of choosing, for each listener,which playlist generation algorithm to apply in order to maximize the number of song streams. We illustrate howunconfounded training data can be generated from A/B tests, and that large A/B tests can provide substantial valuefor learning treatment assignment policies. The results also show that individualized treatment assignment predictionindeed substantially outperforms causal-effect prediction and outcome prediction. Specifically, the machine-learnedtreatment assignment policy causally increases listening by over 28% relative to the other approaches. This is the casedespite the fact that none of the individual treatments (the different playlist generation algorithms) outperforms theothers when applied over the whole population. This final observation highlights the two different uses of A/B tests: thetraditional use is to choose the treatment that has the largest average treatment effect; we use the A/B test to generateunconfounded training data to learn models that will allow us to target specific treatments to specific individuals.
REFERENCES [1] Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects.
Proceedings of the National Academy of Sciences
Journal of Economic Perspectives
31, 2(2017), 3–32.[3] Susan Athey and Guido W Imbens. 2019. Machine Learning Methods That Economists Should Know About.
Annual Review of Economics
11 (2019).[4] Susan Athey and Stefan Wager. 2017. Efficient policy learning. arXiv preprint arXiv:1702.02896 (2017).[5] Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In
Proceedings of the 15th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 129–138.[6] Debopam Bhattacharya and Pascaline Dupas. 2012. Inferring welfare maximizing treatment assignment under budget constraints.
Journal ofEconometrics
Journal of Econometrics
Proceedings of the 28th International Conferenceon Machine Learning (2011).[10] Jerome H Friedman. 1997. On bias, variance, 0/1fi!?loss, and the curse-of-dimensionality.
Data mining and knowledge discovery
1, 1 (1997), 55–77.[11] Jennifer L Hill. 2011. Bayesian nonparametric modeling for causal inference.
Journal of Computational and Graphical Statistics
20, 1 (2011), 217–240.12] Keisuke Hirano and Jack R Porter. 2009. Asymptotics for statistical treatment rules.
Econometrica
77, 5 (2009), 1683–1701.[13] Kosuke Imai, Marc Ratkovic, et al. 2013. Estimating treatment effect heterogeneity in randomized program evaluation.
The Annals of AppliedStatistics
7, 1 (2013), 443–470.[14] Maciej Jaskowski and Szymon Jaroszewicz. 2012. Uplift modeling for clinical trial data. In
ICML Workshop on Clinical Data Analysis .[15] Kathleen Kane, Victor SY Lo, and Jane Zheng. 2014. Mining for the truly responsive customers and prospects using true-lift modeling: Comparisonof new and existing methods.
Journal of Marketing Analytics
2, 4 (2014), 218–238.[16] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In
Proceedings of the 19th international conference on World wide web . ACM, 661–670.[17] Victor SY Lo. 2002. The true lift model: a novel data mining approach to response modeling in database marketing.
ACM SIGKDD ExplorationsNewsletter
4, 2 (2002), 78–86.[18] Charles F Manski. 2004. Statistical treatment rules for heterogeneous populations.
Econometrica
72, 4 (2004), 1221–1246.[19] Judea Pearl. 2009.
Causality: Models, Reasoning and Inference . Cambridge University Press.[20] Foster Provost and Tom Fawcett. 2013.
Data Science for Business: What you need to know about data mining and data-analytic thinking . ” O’ReillyMedia, Inc.”.[21] Nicholas J Radcliffe and Patrick D Surry. 2011. Real-world uplift modelling with significance-based uplift trees.
White Paper TR-2011-1, StochasticSolutions (2011).[22] Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects.
Biometrika
70, 1(1983), 41–55.[23] Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of educational Psychology
66, 5(1974), 688.[24] Piotr Rzepakowski and Szymon Jaroszewicz. 2012. Decision trees for uplift modeling with single and multiple treatments.
Knowledge and InformationSystems
32, 2 (2012), 303–327.[25] Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests.
J. Amer. Statist. Assoc.