"Improving" prediction of human behavior using behavior modification
““I
MPROVING ” PREDICTION OF HUMAN BEHAVIORUSING BEHAVIOR MODIFICATION
Galit Shmueli
Institute of Service ScienceNational Tsing Hua UniversityHsinchu, Taiwan [email protected]
Abstract
The fields of statistics and machine learning design algorithms, models, and approaches to improveprediction. Larger and richer behavioral data increase predictive power, as evident from recent advancesin behavioral prediction technology. Large internet platforms that collect behavioral big data predictuser behavior for internal purposes and for third parties (advertisers, insurers, security forces, politicalconsulting firms) who utilize the predictions for personalization, targeting and other decision-making. Whilestandard data collection and modeling efforts are directed at improving predicted values, internet platformscan minimize prediction error by "pushing" users’ actions towards their predicted values using behaviormodification (BM) techniques. The better the platform can make users conform to their predicted outcomes,the more it can boast its predictive accuracy and ability to induce behavior change. Hence, platformsare strongly incentivized to “make predictions true". This strategy is absent from the ML and statisticsliterature. Investigating its properties requires incorporating causal notation into the correlation-basedpredictive environment—an integration currently missing. To tackle this void, we integrate Pearl’s causal do ( . ) operator into the predictive framework. We then decompose the expected prediction error given BM,and identify the components impacting predictive power. Our derivation elucidates the implications of suchBM to data scientists, platforms, their clients, and the humans whose behavior is manipulated. BM can makeusers’ behavior more predictable and even more homogeneous; yet this apparent predictability might notgeneralize when clients use predictions in practice. Outcomes pushed towards their predictions can be atodds with clients’ intentions, and harmful to manipulated users. K eywords behavior modification · behavioral big data · machine learning · prediction error · causal intervention · internet platforms Recent years have seen an incredible growth in predictive modeling of user behavior using behavioral big data inboth industry and in academia. Behavioral big data (BBD) are large and highly detailed datasets on human andsocial actions and interactions (Shmueli, 2017). BBD-based predictions now shape almost every aspect of modernlife, both online and on ground (Agrawal et al., 2018). In contrast to how statistics and machine learning haveapproached the task of reducing prediction error by improving predictions, a surprising new approach relies onbehavior modification techniques, now popularly used in industry. Such behavior modification can be aimed atpushing user actions towards their predicted values, thereby making predictions more certain.In her enlightening and alarming book, Zuboff (2019) describes the processes used by several large internet platformsthat collect BBD to package the raw material of users’ actively shared data and passively generated data (e.g. locationdata, video usage, friendship ties) into “prediction products" that are then sold to business customers–insurancecompanies, marketers, advertisers, security forces, political consulting firms, etc.–in “behavioral futures markets".The predictions are used to modify users’ behaviour, shaping it toward desired commercial or other outcomes. Often, the BBD platform delivers the interventions on its client’s behalf.One example is the recently launched Google Analytics “predictive audiences" service that “automatically enrichesyour data by bringing Google machine-learning expertise to bear on your dataset to predict the future behaviorof your users". Another example is Facebook’s “loyalty prediction" service which offers advertisers the ability We use the term “BBD platform" for internet platforms that collect users’ BBD “Purchase Probability, which predicts the likelihood that users who have visited your app or site will pur-chase in the next seven days. . . Churn Probability, predicts how likely it is that recently active users will notvisit your app or site in the next seven days." https://blog.google/products/marketingplatform/analytics/new-predictive-capabilities-google-analytics/ a r X i v : . [ c s . C Y ] A ug o target users based on how they will behave, what they will buy, and what they will think. Commenting on thisservice, Frank Pasquale, a law professor at the University of Maryland and scholar at Yale’s Information SocietyProject said heworried how the company could turn algorithmic predictions into “self-fulfilling prophecies,”since “once they’ve made this prediction, they have a financial interest in making it true.” That is,once Facebook tells an advertising partner you’re going to do some thing or other next month, theonus is on Facebook to either make that event come to pass, or show that they were able to helpeffectively prevent it (how Facebook can verify to a marketer that it was indeed able to change thefuture is unclear).In other words, the more accurate these prediction products, the higher value they provide their customers, and inturn the higher the revenues for the BBD platform. Moreover, the better the BBD platform is able to make usersconform to their algorithmically determined destiny, the more it can boast both its predictive accuracy and its abilityto induce behavior change (Rushkoff, 2019). Hence, BBD platforms have a strong incentive to improve predictionaccuracy, that is, to reduce prediction error.While Zuboff (2019) uncovered the dangers of a company using behavior predictions to manipulate its customers’behavior for its own commercial gain, we take this one step further: BBD platforms have the technical ability andincentive to manipulate their users’ behaviors not only in directions of increasing their clients’ gains, but also ina direction that showcases its prediction capabilities, thereby misleading its clients and manipulating humans inpossibly dangerous directions. An extreme example is predicting mental health risk for a healthcare stress reductionapp. While the app maker aims to lower stress of high risk users, the BBD platform can demonstrate high predictionaccuracy by turning high risk predictions into high risk realities.The goal of this work is to introduce a technical vocabulary which enables investigating this new behavior modifica-tion approach to minimizing prediction error. Technical terminology and notation is needed in order to identify theproperties and implications of the behavior modification approach to resulting predictive power. Various questionsarise: Can behavior modification mask poor predictive algorithms? Can one infer from the manipulated predictivepower the counterfactual of non-manipulated predictive power? Can platform clients running routine A/B testingdetect this scheme? What are the roles of personalized predictions and of personalized behavior modificationswithin the error minimization strategy?Using the do ( . ) operator by Pearl (2009), we aim to enable the analysis and evaluation of the effect of behaviormanipulation on predictive power. While do calculus is well developed for causal effects identification (Pearl,2009), the challenge here lies in combining the causal do ( . ) operator into the existing correlation-based predictiveframework. Our goal is to make transparent the effects of behavior modification on predictive power, thereby enablethe study of its impact on business, social, and humanistic aspects, and its potential implications. The fields of statistics and machine learning have been introducing new and improved models, algorithms, ap-proaches, and even data, aimed at improving predictive power. Approaches such as regularization, boosting, andensembles have proven highly useful in generating more precise predictions. From transparent regression models andtree-based algorithms, to more blackbox support vector machines, k-nearest neighbors, neural nets and especiallydeep learning algorithms, their justification and adoption lies in their ability to capture intricate signals linkinginputs and a to-be-predicted output.Predictive performance is typically measured by out-of-sample prediction errors, which compare predicted valueswith actual values for new observations. More formally, the prediction error e i for record i is defined as thedifference between the actual outcome value y i and its prediction ˆ y i , that is e i = y i − ˆ y i . For a sample of n records,we have a set of actual outcome values (cid:126)y = [ y , y , . . . , y n ] , a set of predicted outcome values (cid:126) ˆ y = [ˆ y , ˆ y , . . . , ˆ y n ] ,and a set of prediction errors (cid:126)e = [ e , e , . . . , e n ] . For each record i we also have predictor information in the formof p measurements (cid:126)x i = [ x i, , . . . , x i,p ] . The predictor information for n records is contained in the matrix X .Predicted values are obtained from ˆ f , the algorithm trained on (or model estimated from) data on inputs X andactual outcomes (cid:126)y , so that (cid:126) ˆ y = ˆ f ( X ) . Predictive algorithms and methods are designed and tuned to minimize some aggregation of the error values ( (cid:126)e )by operating on the predicted values ( (cid:126) ˆ y ). Improving predicted values is typically achieved by improving threecomponents: https://theintercept.com/2018/04/13/facebook-advertising-data-artificial-intelligence-ai/
2. structure of the algorithm/model ˆ f that relates the predictor information X to the outcome (e.g., newalgorithms and methods),2. estimation/computation of ˆ f ,3. quality and quantity of predictor information X . Larger, richer behavioral datasets have been shown toimprove predictive accuracy (Martens et al., 2016).In all these approaches, the actual outcome values (cid:126)y are considered fixed. The top panel of Figure 1 illustrates thestatistical and machine learning approach of improving the above three components in order to minimize predictionerror. When predicting an outcome y that is not expected to be manipulated between training and deployment, we anticipateprediction error due to the inability of the model ˆ f to (1) correctly capture the underlying f even with unlimitedtraining data (bias), (2) correctly estimate f due to insufficient data (variance), (3) capture the errors for individualobservations (cid:126)(cid:15) (noise). For predicting a numerical outcome or probability for a new observation, these three sourcesare formalized through a bias-variance decomposition of the expected prediction error (EPE), using squared-errorloss (Geman et al., 1992): EP E ( (cid:126)x ) = E (cid:16) ( y | (cid:126)x ) − ˆ f ( (cid:126)x ) (cid:17) = E ( (cid:15) ) + (cid:16) f ( (cid:126)x ) − E ( ˆ f ( (cid:126)x ) (cid:17) + E (cid:16) ˆ f ( (cid:126)x ) − E ( ˆ f ( (cid:126)x ) (cid:17) = σ + Bias ( ˆ f ( (cid:126)x )) + V ar ( ˆ f ( (cid:126)x )) . (1)In statistics and machine learning, prediction is based on an assumption of continuity, where the predicted obser-vations come from the same underlying processes and environment as the data used for training the predictivealgorithm and testing its predictive performance. The deterministic underlying function f and the random noisedistribution are both assumed to remain unchanged between the time of model training and evaluation and the timeof deployment. This assumption underlies the practice of randomly partitioning the data into separate training andtest sets (or into multiple “folds" in cross validation), where the model is trained on the training data and evaluatedon the separate test data. Of course, the continuity assumption is often violated to some degree depending on thedistance (temporal, geographical, etc.) between the training/test data and the to-be-predicted data and how fast orabruptly the environment changes between these two contexts. These challenges increase prediction errors beyondthe disparity observed between training and test prediction errors. Hence, predictive power based on the test datamight provide an overly optimistic estimate compared to actual performance at deployment. Companies such as Google, Facebook, Uber, Netflix, and Amazon have been investing in improving predictionalgorithms through collecting, buying, storing and processing unprecedented amounts and types of data. They havealso hired top statistics and machine learning talent, purchased AI companies, and developed in-house predictivealgorithms and platforms. These are aimed at improving predictions along the three strategies described earlier.
BBD platforms now have the incentive and technology to minimize prediction errors in a direction that is absentfrom academic prediction research: by manipulating actual outcomes ( (cid:126)y ). When the outcome of interest is a humanbehavior online or offline (clicking an ad, purchasing an item, posting sensitive information, visiting a doctor, voting,etc.), this action can be indirectly manipulated by using behavior modification techniques. The most populartechnique is the nudge , defined as “any aspect of the choice architecture that alters people’s behavior in a predictableway without forbidding any options or significantly changing their economic incentives" (Thaler and Sunstein, 2009,p. 6). Zuboff (2019) identifies two more types of behavior modification: herding , which is controlling key elementsin a person’s immediate context in order to guide their behavior towards a predictable one; and operant conditioning , Assuming underlying model E ( y | (cid:126)x ) = f ( (cid:126)x ) + (cid:15) , where (cid:15) has zero mean and variance σ . For example, Facebook’s “AI backbone" FBLearner Flow combines machine learning and experimentationcapabilities that can be applied to the entire Facebook userbase https://engineering.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/ Behavior modification , or behavior change techniques are “an observable, replicable and irreducible component of anintervention designed to alter or redirect causal processes that regulate behavior." (Michie et al., 2013) do ( B ) pushes the observed user behavior towards its predicted value. Note that only orangearrows denote a causal effect. The squiggly black arrows denote a correlation-based predictive relationship.a term coined by the famous behavioral psychologist B.F. Skinner, which uses positive and negative reinforcementto encourage certain behaviors and extinguish others. BJ Fogg, Standford university’s Behavior Design Lab director,lists seven types of “persuasive" technology tools (Fogg, 2002). While the field of marketing has used behaviormodification even prior to the advent of the internet (Nord and Peter, 1980), today’s technologies and big dataenable more covert, pervasive, and powerful manipulation due to their networked, continuously updated, dynamicand pervasive nature (Yeung, 2017). Zuboff (2019) explains,These interventions are designed to enhance certainty by doing things: they nudge, tune, herd,manipulate, and modify behavior in specific directions by executing actions as subtle as insertinga specific phrase into your Facebook news feed, timing the appearance of a BUY button on yourphone, or shutting down your car engine when an insurance payment is late.While these examples do not necessarily involve prediction, prediction-based behavior modification is common inrecommendation systems, targeted advertising, precision marketing, and other “personalized" interventions thatintend to cause human users to change their behavior in a specific direction that is beneficial to the interventioninitiator: towards longer online engagement, higher purchase propensity, increased information sharing, or, in ourcase, towards the platform’s predicted values. The two key points are that (1) BBD platforms have a plethora of powerful and tested behavior modification tools,and (2) behavior modification techniques are designed to modify behavior in a predictable way –here “pushing"outcome values towards their predicted values – in order “to shape individual, group, and population behavior inways that continuously improve their approximation to guaranteed outcomes." (Zuboff, 2019, p. 339)
Consider an insurance company interested in acquiring new customers, but trying to avoid high-risk customers.Now consider an internet platform that is interested in selling to the insurance company the risk scores of theirusers. To showcase their predictive power, the platform can generate risk scores for a set of users, then use behaviormodification to “push" users’ behaviors towards their predicted scores (e.g. encouraging/discouraging engagementwith the app during driving; showing/not-showing ads for alcoholic beverages during work hours). Such a strategywould turn high-risk predictions into high-risk realities. Note that in this scenario, the gap between the platform’sgoal (showcase accurate predictions) and the insurance company’s objective (avoiding high risk customers) leads topushing high-risk users’ behaviors in a direction that is not only ethically dubious but also at odds with the insurancecompany’s interests.Another example is a political consulting firm who is interested in reaching likely “Vote for T" individuals. Aninternet platform wants to sell predicted “Vote for T" scores of their users might prove their predictive power bygenerating “Vote for T" probability predictions, and then using behavior modification to push users’ behaviorstowards voting for T. In this scenario, the platform’s strategy would turn voting predictions into voting realities.Note that in this scenario, on top of serious ethical and legal implications, the platform’s strategy might be in linewith the political consulting firm’s goal if the firm is trying to promote voting for T, but at odds if the consultingfirm is non-partisan, or trying to promote a different candidate.4 .2 Why would a platform follow this strategy?
Given the financial incentives and technical capabilities of internet platforms to showcase predictive power fortheir prediction products, using behavior modification for “improving" prediction might be used intentionally by aplatform’s management or by a data scientist under pressure to showcase performance.Even without intention, such a strategy might be taking place on a platform due to the now-poplar use of algorithmssuch as reinforcement learning, which employ behavior modification (and user feedback) in order to optimize apredetermined objective function. The common objective of machine learning algorithms to minimize predictionerror would lead to this new outcome of “improved" prediction.
In the case of no behavior modification, differences between the training and deployment environments introduceuncertainty, typically by increasing bias and/or changing the noise distribution, and are therefore likely to causelarger prediction errors at deployment. Behavior modification implies that, by design, the contexts of training anddeployment of the predictive model are made different, albeit in a way that reduces uncertainty. While differencesbetween the deployment and training contexts arising from uncontrollable and unforeseeable conditions increaseuncertainty, behavior modification intends to shift actual outcome values y in a specific direction, which is bydesign closer to ˆ y . For example, when predicting that a user is likely to become depressed, displaying depressingnews, friends’ posts, and depression-related ads increases that user’s chance of depression (Facebook’s emotionalcontagion experiment by Kramer et al. (2014) displays such capability). When predicting the arrival time of adelivery, incentivizing faster/slower driving can increase the accuracy of the predicted arrival time. Displayingdonation amounts by “friends" with amounts similar to the user’s predicted amount can increase the chance the userdonates the predicted amount. To study the components affecting prediction error under behavior modification, it is useful to decompose the newform of expected prediction error (under behavior modification) into separate meaningful sources. This can help usidentify components such as bias, variance, and noise. However, we need technical vocabulary that can encode bothcorrelation-based and causal-based terminology.The challenge is that standard notation and terminology used in statistics and machine learning for predictivemodeling is insufficient for formalizing the problem of minimizing prediction error by intentionally manipulating theactual outcome values by way of behavior modification. The bottom panel of Figure 1 illustrates this new scenario.Specifically, predictive terminology lacks notation for denoting an intentional manipulation, as distinguished fromcorrelation-based relationships. At the same time, while causal notation does exist in the world of causal effects andcausal inference (Pearl, 2009; Rubin, 1974), in that world correlation-based prediction is excluded. Figure 1, whichincludes both causal arrows (orange) and a correlational connector (depicted as a squiggly black arrow, but withno causal interpretation ) is incoherent in the world of causal diagrams, as well as in the world of prediction. Wetherefore propose to integrate causal notation into the existing predictive terminology and context in a parsimoniousway. We do this by adopting the do ( . ) operator by Pearl (2009), where do ( B ) denotes that variable B is notsimply observed but rather manipulated. This allows us to incorporate intentional behavioral modification intothe predictive modeling context. We then use this notation to decompose the expected prediction error, in order toidentify the different components that affect predictive power.Denote the manipulated outcome as ˜ y . Using the do ( . ) operator, we note that it is incorrect to write ˜ y . = do ( y ) because the user’s outcome y is not directly manipulated. Instead, the modified outcome is fully mediated: theplatform tailors its behavior B ( do ( B ) or personalized do ( B i ) ) to manipulate the user’s instinct or mental state(e.g. emotion, thought), which in turn leads to the modified outcome ˜ y i . This manipulation is specifically aimed atpushing the outcome towards its prediction. We therefore write ˜ y i . = y i | do ( B i ) . (2) We chose to use a single-headed squiggly arrow rather than a bi-directional straight arrow for the correlation-basedpredictive relationship to convey the asymmetric input-output roles of X and y . In causal diagrams, bi-directional arrowsconvey an unobservable variable affecting the two variables at the arrowheads, and there is no way to represent an asymmetriccorrelation-based predictive relationship. “The do ( x ) operator is a mathematical device that helps us specify explicitly and formally what is held constant, and what isfree to vary" (Pearl, 2009, p. 358) It is possible to use Rubin’s potential outcomes notation intended for estimating treatment effects (e.g. Imbens and Rubin,2015, p. 33). This requires defining B = { , , , . . . } as the intervention assignment, and denoting by y i ( B ) the outcome,where y i (0) is the un-manipulated outcome. The quantity y i | do ( B ) , (cid:126)x is written as y i ( B ) | B, (cid:126)x . We prefer the do ( . ) operatorsince it conveys the causal nature of the manipulation B and clearly differentiates it from the correlation-based predictioncomponents (cid:126)x . y i y i | (cid:126)x i = f ( (cid:126)x i ) + (cid:15) i Outcome under no manipulation f f ( (cid:126)x ) True function under no manipulation ˆ f ˆ y | (cid:126)x = ˆ f ( (cid:126)x ) Predicted outcome under no manipulation σ V ar ( (cid:15) ) = E ( (cid:15) ) Noise variance under no manipulation f do g ( do ( B ) , (cid:126)x ) True function under do ( B )˜ y i y i | do ( B i ) , (cid:126)x i = g ( do ( B i ) , (cid:126)x i ) + ˜ (cid:15) i Manipulated outcome ˜ σ V ar (˜ (cid:15) ) = V ar (˜ y ) = E (˜ (cid:15) ) Noise variance under do ( B ) To allow heterogeneous effects of the behavioral modification, by the user’s specific predictor information X i = x i (e.g. user i ’s browsing history, demographics, location), we can write: ˜ y i . = y i | do ( B i ) , (cid:126)x i . (3)Second, to denote the predictive (correlation-based) relationship between the outcome and the predictors, wecontinue using the standard predictive notation: y i = f ( (cid:126)x i ) + (cid:15) i . (4)Third, for the manipulated outcome, we use f do to denote the underlying function, which can be a completelydifferent function from f : ˜ y i = f do ( do ( B i ) , (cid:126)x i ) + ˜ (cid:15) i = g ( do ( B i ) , (cid:126)x i ) + ˜ (cid:15) i . (5)We use ∼ on top of terms affected by do ( B ) . We note that the quantity E (˜ y i | (cid:126)x i ) − E ( y i | (cid:126)x i ) = E (˜ y i − y i | (cid:126)x i ) , iscalled the (population) Conditional Average Treatment Effect (CATE) (Athey and Imbens, 2016; Imbens and Rubin,2015) or
Individual Treatment Effect (ITE) (Shalit et al., 2017) and is of key interest in treatment effect estimationand testing. In the predictive modeling phase, we estimate f using ˆ f . We then compare the predicted value ˆ y to the manipulatedoutcome ˜ y . Table 1 provides the short notation, full notation and description for each of the above terms. Togetherwith equations 2-5, we now have a sufficient vocabulary for examining the prediction error under behaviormodification. Behavior modification techniques are used by BBD platforms for two purposes other than reducing predictionerror: for estimating the overall effect of a behavior modification intervention (A/B testing), and for predictingpersonalized user reactions to a behavior intervention for precision targeting (uplift modeling). These two purposesdiffer from the focus in this paper, and are also different from each other. Using the terminology and do ( . ) operator,we briefly describe these approaches in Appendix A, summarizing the key differences in Table 2. (cid:94) EP E ) When outcome values are intentionally “pushed" towards their predictive values, it is intuitive that the resultingexpected prediction error will be lower than the no-manipulation outcome values. We can now formalize thefollowing questions: Given a specific prediction algorithm ˆ f , trained on data with no behavior modification ( X, (cid:126)y ) ,when will the expected prediction error for a manipulated user with predictors (cid:126)x and manipulation do ( B i ) = b belower than if the user was not manipulated? That is, for a p -norm loss function L p , when will we get E [ L p (˜ y, ˆ f ( b, (cid:126)x ))] < E [ L p ( y, ˆ f ( (cid:126)x ))]? (6)When might the manipulation lead to worse predictive power?To answer these questions, we proceed to break down the EPE into several non-overlapping components. Usingthe standard L loss function, we can obtain the EPE under behavior modification as follows (the full derivation is Note that y i | (cid:126)x i assumes no manipulation. Pearl (2009, p. 70-72) offers an alternative formulation to encode manipulationvs. no manipulation, by adding a binary intervention indicator I B that obtains values in { do ( B i ) , idle } . In our case I B = idle for y i | (cid:126)x i . In the prediction minimization process all subjects are initially not B -manipulated and later a sample is B -manipulatedusing personalized modifications. CAT E = − Bias (blue circles) or on bothshifting the average outcome and shrinking the variance ˜ σ (red circles). Yellow X marks are predicted values ˆ f ( x i ) .(The schematic assumes a very large training sample, and thus ˆ f ≈ E ( ˆ f ) .)given in Appendix B): (cid:94) EP E ( (cid:126)x ) = E (cid:16) y | do ( B ) , (cid:126)x − ˆ f ( (cid:126)x ) (cid:17) = ˜ σ + (cid:104) CAT E ( (cid:126)x ) + Bias ( ˆ f ( (cid:126)x )) (cid:105) + V ar ( ˆ f ( (cid:126)x )) . (7)Each of the terms in eq. 7 has an interesting meaning and different implications on the effect of behavior modificationon EPE. The additive nature of this formulation provides insights on the roles of data size, predictive algorithmproperties, and behavior modification qualities. By comparing (cid:94) EP E ( (cid:126)x ) to EP E ( (cid:126)x ) (the manipulated and non-manipulated scenarios), we can see the following: Data size:
Whether manipulating or not, data size affects (cid:94)
EP E via the variance of the predictive algorithm, indicating that larger training samples can improve not only predictions, but also the average manipulatedprediction error. Pushing the outcome towards a more stable prediction leads to smaller errors. Magnitude of behavior modification effect:
The second term shows the role of the average behavior modificationmagnitude (CATE) in countering the bias of the predictive algorithm. This term is minimized when
CAT E = − Bias ( ˆ f ) , that is, when, on average, do ( B ) pushes the user’s behavior in a direction andmagnitude that exactly counters the prediction algorithm’s bias. Thus, an effective behavior modificationcan improve predictive power by combating the predictive algorithm’s bias, as long as < CAT E < − Bias or − Bias > CAT E > . Noise (homogeneity of prediction errors):
Compared to σ in the no-manipulation EP E , the first term in (cid:94)
EP E is ˜ σ , the noise variance under behavior modification . This means behavior modification candecrease/increase also the variability of prediction errors (or equivalently, of ˜ y relative to y ) across differentusers. To modify or not to modify? Three Scenarios
To better understand the trade-offs and implications of the four (cid:94)
EP E sources ( ˜ σ, CAT E, Bias ( ˆ f ) , V ar ( ˆ f ) ) on theexpected prediction error, we consider three scenarios. Figure 2 is a simplistic illustration of the roles of CATE and ˜ σ . Suppose predictions are drivers’ risk scores in terms of risky driving behaviors and the goal is to minimize the(squared) differences between the predicted and actual values. A ride-sharing or social media platform can modifythe driver’s behavior by manipulating the driver’s engagement with their app while driving. Suppose the x-axisis the daily distance traveled, so that risk is a quadratic function of distance. Yet the predictive model estimates alinear relationship. We consider three specific distances: x , x , and x . While ˆ f is a biased algorithm, for x there The machine learning “bias" is asymptotic in sample size: an algorithm is biased “if no matter how much training data wegive it, the learning curve will never reach perfect accuracy" (Provost and Fawcett, 2013).
7s no bias, for x the bias is negative, and for x bias is positive (for simplicity, the schematic assumes a very largetraining sample and thus ˆ f ≈ E ( ˆ f ) ). ˆ f trained on a very large sample This scenario would be akin to deep learning algorithms applied to massive training data. The very large samplemeans
V ar ( ˆ f ) ≈ and Bias ( ˆ f ) is very small. The strategy of setting CAT E = − Bias ( ˆ f ) is optimal if thebehavior modification also decreases error heterogeneity so that ˜ σ ≤ σ . Because the bias is small, the optimalbehavior modification should have a small effect. In Figure 2 ˆ f ( x ) has no bias, and therefore applying behaviormodification to drivers with distance x will introduce bias, and is only useful if it can sufficiently shrink thevariability of the resulting risky behaviors (red points). ˆ f trained on a very large sample High-bias algorithms include models with relatively few parameters (e.g. naive Bayes, linear regression, shallowtrees, and k-NN with large k ). As in scenario 1, here too V ar ( ˆ f ) ≈ . The strategy of setting CAT E = − Bias ( ˆ f ) (e.g. in Figure 2 increasing average risky behaviors for x by | Bias ( ˆ f ( x )) | and decreasing it for x by | Bias ( ˆ f ( x )) | ) is optimal if the behavior modification does not increase error heterogeneity, so that ˜ σ ≤ σ .While a small modification effect (in the right direction) can help counter the bias, the ideal modification effect mustbe as large as the bias. Note that EPE is computed for a specific (cid:126)x , and therefore generalizing the above rule to any (cid:126)x requires either assuming homoskedastic errors ˜ (cid:15) , or that the inequality holds for all (cid:126)x ( ∀ (cid:126)x ˜ σ (cid:126)x ≤ σ (cid:126)x ). ˆ f If the predictive model has high variance and is estimated on a relatively small sample, then potential minimization of ˜ σ and/or (cid:104) CAT E + Bias ( ˆ f ) (cid:105) by way of behavior modification might be negligible relative to V ar ( ˆ f ) . Becausebehavior modification is based on “pushing" behavior towards ˆ f ( (cid:126)x i ) , a highly volatile ˆ f might result in erratic do ( B i ) modifications in terms of magnitude or even direction. Hence the choice of predictive model or algorithmcan be detrimental to the effectiveness of behavior modification. We described a new strategy that might be used by BBD platforms for reducing prediction error which is completelydifferent from approaches taken by the fields of statistics and machine learning. This strategy involves behaviormodification, and therefore formalizing it into technical language requires supplementing predictive notationwith causal terminology. While our (cid:94)
EP E formula also applies to behavior modification for commercial benefit(e.g. advertising), we have focused on the more extreme case of a potentially rogue BBD platform aiming tominimize prediction errors or unintentionally doing so by using automated personalization techniques such asreinforcement learning. These two efforts can be misaligned, as in risk prediction applications where the client aimsto reduce risk, while the platform pushes risky users towards the risky action. Using the do ( · ) operator, we are able todescribe the entire system that includes the training dataset, the predictive algorithm, and the behavior modification.We are also able to distinguish between this strategy and two related, but different, behavior modification usagescommonly employed by companies: A/B testing and uplift modeling. The same can be applied to other relatedapproaches such as reinforcement learning, which is especially relevant due to its combined use of prediction andbehavior modification for personalization. Contrasting the bias-variance decomposition of the manipulated and non-manipulated scenarios highlighted two keysources of the manipulated prediction error: the CATE-bias relationship, and its tradeoff with the manipulated noisevariance. We now use these insights to return to the questions we posed earlier.
BBD has been shown to be very noisy, sparse, and high-dimensional (De Cnudde et al., 2020). Behavior modificationcan improve (cid:94)
EP E by countering the predictive algorithm’s bias as well as by reducing the noise variance. Thismeans that poor performance of a predictive algorithm, due to the algorithm’s bias and variance, and/or due tothe data noisiness, can be masked by do ( B ) . Therefore, customers of BBD platforms wanting to achieve the(manipulated) prediction accuracy level demonstrated by the platform, must purchase both the predictions and the ability to apply behavior modification similar to the one performed by the BBD platform. Purchasing the8redictions alone might uncover a much weaker predictive performance when deployed to non-manipulated users(or by applying a less effective modification). EP E from the manipulated (cid:94)
EP E ? The difference between the two quantities of no-manipulation
EP E and behavior-modified (cid:94)
EP E involves
CAT E , bias ( ˆ f ) , σ , and ˜ σ . Some of these quantities can be estimated (e.g., CATE), while others are more difficult, ifimpossible. Hence, it is unlikely the no-manipulation predictive power can be ascertained from the manipulated (cid:94)
EP E . This means customers of BBD platforms who want to evaluate the no-manipulation predictive power willneed to obtain (purchase) information about the estimated
EP E at the time of algorithm testing.
Platform clients who regularly run A/B tests on the platform are not likely to detect the error minimization strategy,because of the random allocation of users in an A/B test. This randomization spreads B -“affected" users across theA and B conditions, and therefore the difference between the group averages will cancel out the B effect. The A/Btest statistic and its statistical significance are therefore not impacted by B . An ideal behavior modification reduces not only the average magnitude of the errors, but also their variability sothat errors are more consistent across users. This highlights the role of personalized prediction that companiesnow invest in: a user’s personal (cid:126)x i data is used to select the best modification do ( B i ) = h ( B i , (cid:126)x i ) , where therange of B i choices includes not only different stimuli (e.g. different content display), but also different types ofreinforcement (e.g. positive reinforcement such as rewards, recognition, praise, vs. negative reinforcement such astime pressure, social pressure). For example, Kosinski et al. (2013) showed how Facebook users’ Likes can predicttheir psychological attributes, ranging from sexual orientation to intelligence, and suggested that including suchattributes can improve personalized interventions. Personalized interventions are also becoming more powerfulwith the introduction of reinforcement learning, which personalizes the system’s behavior by using users’ traits ( (cid:126)x i )combined with their implicit feedback (den Hengst et al., 2020).Personalized predictions have the potential to minimize (cid:94) EP E more equally both within a certain user profile (cid:126)x and across different user profiles, by lowering conditional bias via manipulating CATE, and by shrinking the(manipulated) outcomes’ variance.Finally, the bias-variance decomposition highlights the important role of a large training dataset in minimizingthe predictive algorithm’s variance
V ar ( ˆ f ) . Since predictive models trade off bias and variance, in the behaviormodification context low-bias algorithms are advantageous in terms of requiring a smaller manipulation effect tominimize EPE. One avenue for further research is the effect of behavior modification on classification accuracy (forbinary outcomes), where the effects of bias and variance on EPE are multiplicative rather than additive, and theliterature reports conflicting results on their roles (e.g. Friedman, 1997; Domingos, 2000). Behavior modification, now pervasively applied by BBD platforms to their “data subjects", is geared towardsoptimizing the platform’s commercial interest, often at the cost of users’ well-being and agency. “Persuasivetechnology", a design philosophy now implemented on platforms from e-commerce sites and social networks tosmartphones and fitness wristbands, aim at generating “behavioral change" and “habit formation", most oftenwithout the user’s knowledge or consent (Rushkoff, 2019). This application of behavior modification to platform users diverges from application to employees for increasing the organization’s productivity and workers’ jobsatisfaction. And, clearly, such use diverges from the original intention of behavior modification procedures “tochange socially significant behaviors, with the goal of improving some aspect of a person’s life" (Miltenberger,2015).Given the often conflicting goals of data subjects and the platforms that collect and use their data as well asmanipulate their behavior, it is important to introduce these causal mechanisms into the predictive environment, (cid:94) EP E − EP E = ˜ σ − σ + CAT E + 2 × CAT E × Bias ( ˆ f ) “People who do a lot of research on products may see an ad that features positive product reviews, whereas those who havesigned up for regular deliveries of other products in the past might see an ad offering a discount for those who “Subscribe &Save.”" “online insurance advertisements might emphasize security when facing emotionally unstable (neurotic) users but stresspotential threats when dealing with emotionally stable ones." Kosinski et al. (2013)
9o that our statistics, machine learning, and computational social science communities can study their technicalproperties and implications. By introducing and integrating causal notation into the predictive terminology, wecan start studying how behavior modification can create “better" predictions. This allows examining the effects ofdifferent behavior modification types, magnitudes, variation, and directions on anticipated outcomes.
Conclusion
Behavior modification can make users’ behavior not only more predictable but also more homogeneous. However,this apparent “predictability" is not guaranteed to generalize when the predictions are used by platform clients outsideof the platform environment, or within the platform with a different or no behavior modification strategy. Outcomespushed towards their predictions can also be at odds with the client’s intention, and harmful to the manipulatedusers. While platforms have the incentive and capabilities to minimize prediction errors, such minimization mighteven occur without the platform’s knowledge, due to automated personalization techniques that combine users’ dataand their implicit feedback. It is therefore critical to have a useful technical vocabulary that integrates intentionalbehavior modification into the correlation-based predictive framework to enable studying such contemporarystrategies.
References
Agrawal, A., Gans, J., and Goldfarb, A. (2018).
Prediction machines: the simple economics of artificial intelligence .Harvard Business Press.Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects.
Proceedings of theNational Academy of Sciences , 113(27):7353–7360.De Cnudde, S., Martens, D., Evgeniou, T., and Provost, F. (2020). A benchmarking study of classification techniquesfor behavioral data.
International Journal of Data Science and Analytics , 9(2):131–173.den Hengst, F., Grua, E. M., el Hassouni, A., and Hoogendoorn, M. (2020). Reinforcement learning for personaliza-tion: A systematic literature review.
Data Science , pages 1–41.Domingos, P. (2000). A unified bias-variance decomposition. In
Proceedings of 17th International Conference onMachine Learning , pages 231–238.Fogg, B. J. (2002).
Persuasive technology: using computers to change what we think and do . Morgan Kaufmann.Friedman, J. H. (1997). On bias, variance, 0/1—loss, and the curse-of-dimensionality.
Data mining and knowledgediscovery , 1(1):55–77.Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma.
Neuralcomputation , 4(1):1–58.Imbens, G. W. and Rubin, D. B. (2015).
Causal inference in statistics, social, and biomedical sciences . CambridgeUniversity Press.Kohavi, R. and Longbotham, R. (2017). Online controlled experiments and a/b testing. In Sammut, C. and Webb,G. I., editors,
Encyclopedia of Machine Learning and Data Mining . Springer.Kohavi, R. and Thomke, S. (2017). The surprising power of online experiments.
Harvard Business Review .Kosinski, M., Stillwell, D., and Graepel, T. (2013). Private traits and attributes are predictable from digital recordsof human behavior.
Proceedings of the national academy of sciences , 110(15):5802–5805.Kramer, A. D., Guillory, J. E., and Hancock, J. T. (2014). Experimental evidence of massive-scale emotionalcontagion through social networks.
Proceedings of the National Academy of Sciences , 111(24):8788–8790.Lo, V. S. (2002). The true lift model: a novel data mining approach to response modeling in database marketing.
ACM SIGKDD Explorations Newsletter , 4(2):78–86.Martens, D., Provost, F., Clark, J., and de Fortuny, E. J. (2016). Mining massive fine-grained behavior data toimprove predictive analytics.
MIS quarterly , 40(4):869–888.Michie, S., Richardson, M., Johnston, M., Abraham, C., Francis, J., Hardeman, W., Eccles, M. P., Cane, J., andWood, C. E. (2013). The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques:building an international consensus for the reporting of behavior change interventions.
Annals of behavioralmedicine , 46(1):81–95.Miltenberger, R. G. (2015).
Behavior modification: Principles and procedures . Cengage Learning, 6 edition.Nord, W. R. and Peter, J. P. (1980). A behavior modification perspective on marketing.
Journal of Marketing ,44(2):36–47.Pearl, J. (2009).
Causality . Cambridge university press, 2 edition.10rovost, F. and Fawcett, T. (2013).
Data Science for Business: What you need to know about data mining anddata-analytic thinking . O’Reilly Media, Inc.Radcliffe, N. J. and Surry, P. D. (2011). Real-world uplift modelling with significance-based uplift trees.
WhitePaper TR-2011-1, Stochastic Solutions , pages 1–33.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal ofeducational Psychology , 66(5):688.Rushkoff, D. (2019).
Team Human . WW Norton & Company.Rzepakowski, P. and Jaroszewicz, S. (2012). Uplift modeling in direct marketing.
Journal of Telecommunicationsand Information Technology , pages 43–50.Shalit, U., Johansson, F. D., and Sontag, D. (2017). Estimating individual treatment effect: generalization boundsand algorithms. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages3076–3085. JMLR.Shmueli, G. (2017). Research dilemmas with behavioral big data.
Big data , 5(2):98–119.Thaler, R. H. and Sunstein, C. R. (2009).
Nudge: Improving decisions about health, wealth, and happiness . Penguin.Yeung, K. (2017). ‘hypernudge’: Big data as a mode of regulation by design.
Information, Communication &Society , 20(1):118–136.Zuboff, S. (2019).
The age of surveillance capitalism: The fight for a human future at the new frontier of power .Profile Books. 11 ppendixA Differences between A/B testing, uplift modeling, and error minimization
As mentioned earlier, behavior modification techniques are used by BBD platforms for two purposes other thanreducing prediction error: for estimating the overall effect of a behavior modification intervention (A/B testing), andfor predicting personalized user reactions to a behavior intervention for precision targeting (uplift modeling). Wenow show how these two purposes and error minimization are different from each other. Table 2 summarizes thekey differences.Table 2: Differences between A/B testing, uplift modeling, and error minimization
A/B testing Uplift modeling Error minimization
Business goal Test effectiveness of new de-sign/feature Effective user targeting Increasing value of predictiveproductsAnalysis goal Test average effect of new fea-ture:
AT E = n B (cid:80) n B i =1 ˜ y i − n A (cid:80) n A i =1 ˜ y i Predict uplift for each user i :uplift i = ˆ˜ y i,do ( B =1) − ˆ˜ y i,do ( B =0) Minimize overall predic-tion error: e.g. MSE= n (cid:80) i (˜ y i − ˆ y i ) Sequence ofevents intervention → estimation intervention → prediction prediction → interventionInterventionlevels do ( B = 0) , do ( B = 1) do ( B = 0) , do ( B = 1) do ( B i ) (personalized) X used for not used, or for CATE training predictive model(s) training ˆ f and personalizing do ( B i ) A.1 Estimating the overall effect of a behavior modification intervention for product improvement (A/Btesting)
BBD platforms try to improve their products and users’ experience on an ongoing basis, from updating websitedesign to introducing new features, new ad layouts, marketing emails, and more. To do this, they employ A/B tests,which compare the impact a new test design/feature (version B) vs. an existing one (version A). Amazon, Microsoft,Facebook, Google and similar companies conduct thousands to tens of thousands of A/B tests each year, on millionsof users, testing user interface changes, enhancements to algorithms (search, ads, personalization, recommendation,etc.), changes to apps, content management system, and more (Kohavi and Longbotham, 2017; Kohavi and Thomke,2017). The A/B test is a simple randomized experiment comparing the average outcome of a treatment group to thatof a control group. Suppose that version A is the current website functionality ( B = 0 ), and version B is a newfeature ( B = 1 ). The A/B testing process is as follows:1. randomly assign n A users to condition A ( do ( B = 0) ) and n B users to condition B ( do ( B = 1) ).2. measure the outcome for users in condition A ( ˜ y i , i = 1 , . . . , n A ) and in condition B ( ˜ y i , i = 1 , . . . , n B ).3. estimate the Average Treatment Effect AT E = n B (cid:80) n B i =1 ˜ y i − n A (cid:80) n A i =1 ˜ y i .4. use statistical inference and effect magnitude estimates to determine whether the new feature adds value.When there is interest in the ATE for certain subgroups, such as by the user’s language, steps 3-4 can be supplementedwith CATE. In summary: A/B tests are used to determine the effectiveness of a new behavior modification featurecompared to an existing one. This is done by comparing the average outcome of the two randomly assigned do ( B ) groups, and estimating and testing ATE or CATE. A.2 Predicting personalized behavior modification effects for maximizing conversion/revenue (upliftmodeling)
Uplift modeling , also known as differential response analysis , or true lift modeling (Rzepakowski and Jaroszewicz,2012; Radcliffe and Surry, 2011), is used in precision marketing and in political persuasion for identifying peoplewho will modify their behavior (e.g. purchase a product, or vote for a candidate) conditional on being given aparticular treatment (e.g. receiving a coupon, or a phone call), assuming the treatment can also cause a negativeoutcome for some people. In uplift modeling, an intervention is applied to a randomly sampled group of people. Theresulting data, along with data from a control group, is used to build predictive model(s) for predicting a person’s change in response , or uplift due to the intervention. The predictive model(s) produce predicted outcomes under Uplift modeling includes a two-model approach, where models of the form ˜ y = g ( X ) + ˜ (cid:15) are trained separately on thetreatment and control groups, and a single-model approach that trains a single model on the combined dataset ( ˜ y = h ( X, B )+ ˜ δ ). o ( B = 0) and do ( B = 1) , which are then combined to obtain uplift i = ˆ˜ y i,do ( B =1) − ˆ˜ y i,do ( B =0) . Finally, the upliftvalues are used to determine which users to treat ( do ( B = 1) ) and for which to avoid treatment ( do ( B = 0) ). Thisprocess is as follows:1. randomly assign n A users to condition A ( do ( B = 0) ) and n B users to condition B ( do ( B = 1) .2. measure the outcome and predictors for users in conditions A ({ ˜ y i , (cid:126)x i } , i = 1 , . . . , n A ) and B({ ˜ y i , (cid:126)x i } , i = 1 , . . . , n B ).3. train a predictive model of ˜ y on X, B (e.g., Lo, 2002).4. use the model to predict ˜ y i,do ( B =1) and ˜ y i,do ( B =0) , and compute uplift i = ˆ˜ y i,do ( B =1) − ˆ˜ y i,do ( B =0) .5. Based on uplift, determine who should be treated. A.3 Behavior modification for minimizing prediction error
The process of minimizing prediction error can be summarized as follows:1. collect observational predictors
X, B and outcome (cid:126)y for a sample of n users.2. build a predictive model ˆ y = ˆ f ( X, B ) . Compute personalized predicted scores ˆ y i = ˆ f ( (cid:126)x i , B i ) for eachuser.3. apply behavior modification do ( B i ) to each user to push their outcome towards their predicted value ˆ y i .4. compute manipulated prediction errors ˜ y i − ˆ y i , and summarize them to demonstrate impressive predictivepower. B Derivation of (cid:94)
EP E bias-variance decomposition (Equation 7)
The derivation for Equation 7 (bias-variance decomposition under do ( B ) ) is as follows. For convenience, we use f and ˆ f to denote the no-manipulation true function and its estimated model. For the manipulated scenario, we use f do to denote the true function under behavior modification do ( B ) .For a new observation with inputs (cid:126)x and manipulated outcome y | do ( B ) , (cid:126)x , we can decompose the expectedprediction error as follows (for convenience, we drop subscript i ): (cid:94) EP E ( (cid:126)x ) = E (cid:16) y | do ( B ) , (cid:126)x − ˆ f ( (cid:126)x ) (cid:17) = E (˜ y − ˆ f ) = E (˜ y − f do + f do − ˆ f ) = E (˜ y − f do ) + E ( f do − ˆ f ) + 2 E (˜ y − f do )( f do − ˆ f ) . (8)These three terms can be further simplified. The first term can be simplified by noting that ˜ y = f do + ˜ (cid:15) : E (˜ y − f do ) = E (˜ (cid:15) ) = ˜ σ . (9)The second term can be written as: E ( f do − ˆ f ) = E (cid:16) f do − E ( ˆ f ) + E ( ˆ f ) − ˆ f (cid:17) = (cid:16) f do − E ( ˆ f ) (cid:17) + E (cid:16) ˆ f − E ( ˆ f ) (cid:17) = (cid:16) f do − E ( ˆ f ) (cid:17) + V ar ( ˆ f ) (10)because the cross product is zero: E (cid:16) f do − E ( ˆ f ) (cid:17) (cid:16) E ( ˆ f ) − ˆ f (cid:17) = 2 (cid:16) f do − E ( ˆ f ) (cid:17) (cid:16) E ( ˆ f ) − E ( ˆ f ) (cid:17) = 0 . (11)We can further write Equation 10 as a function of the bias and variance of ˆ f : (cid:104) f do − E ( ˆ f ) (cid:105) + V ar ( ˆ f ) = E (cid:104) f do − f + f − E ( ˆ f ) (cid:105) + V ar ( ˆ f )= (cid:104) CAT E + Bias ( ˆ f ) (cid:105) + V ar ( ˆ f ) . (12)13inally, using the independence of the new observation’s prediction error ˜ (cid:15) from the prediction ˆ f based on thetraining data ( E (˜ (cid:15) ˆ f ) = 0 ), the third term can be shown to be zero: E (˜ y − f do )( f do − ˆ f ) = 2 E (˜ (cid:15) )( f do − ˆ f ) = 2 f do E (˜ (cid:15) ) − E (˜ (cid:15) ˆ f ) = 0 . (13)Therefore, we can write (cid:94) EP E from Equation (8) as: (cid:94)