Estimating Heterogeneous Treatment Effects in Residential Demand Response
WWORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 1
Estimating Heterogeneous Treatment Effects inResidential Demand Response
Datong P. Zhou ∗ , Maximilian Balandat † , and Claire J. Tomlin † , Fellow, IEEE
Abstract —We evaluate the causal effect of hour-ahead priceinterventions on the reduction in residential electricity consump-tion using a data set from a large-scale experiment on 7,000households in California. By estimating user-level counterfactualsusing time-series prediction, we estimate an average treatmenteffect of ≈ k -means clustering approach that naively detectsheterogeneity in the feature space, confirming the superiority ofcausal decision trees. Lastly, we comment on how our methodsto detect heterogeneity can be used for targeting households toimprove cost efficiency. Index Terms —Demand Response, Randomized ControlledTrial, Smart Grid, Smart Meter Data, Time Series, IndividualTreatment Effects, Average Treatment Effect, Causal Inference,Decision Trees, Causal Trees, k -Means Clustering. I. I
NTRODUCTION
This paper studies the causal effect of incentivizing resi-dential households to participate in Demand Response (DR)to temporarily reduce electricity consumption. DR has beenpromoted by the introduction of demand-side managementprograms (DSM) after the 1970s energy crisis [1], enabled bythe integration of information and communications technologyin the electric grid. The rationale behind DSM is to alleviatethe inelasticity of energy supply due to the slowness ofpower plants’ output adjustment, which causes small increasesand decreases in demand to result in a price boom or bust,respectively. DSM attempts to protect utilities against suchprice risks by partially relaying them to end-users, increasingmarket efficiency [2].In 2015, the California Public Utilities Commissionlaunched a Demand Response Auction Mechanism (DRAM)[3], requiring electric utilities to procure a certain amountof reduction capacity from DR providers. These aggrega-tors incentivize their customers (also called “Proxy DemandResource” (PDR) [4]) under contract to temporarily reducetheir consumption relative to their projected usage withoutintervention, referred to in this context as counterfactual or baseline , based on which compensations for (non-)fulfilledreductions are settled: If the consumer uses less (more) energythan the baseline, she receives a reward (incurs a penalty).Figure 1 illustrates the interactions between agents. ∗ Dept. of Mechanical Engineering, University of California, Berke-ley (UCB). † Dept. of Electrical Engineering and Computer Sciences,UCB. [datong.zhou, balandat, tomlin]@berkeley.edu
Wholesale MarketDR Provider Scheduling CoordinatorElectric UtilityEnd-Use Customers
PDR
Fig. 1: Interactions of Agents in Residential DR
The estimation of the materialized reduction arguably isthe most critical component of the DR bidding process. Ifthe reductions are estimated with a biased counterfactual,either the DR provider or the utility clearing the bids is sys-tematically discriminated against. If the baseline is unbiasedbut plagued by high variance, the profit settlement is highlyvolatile. Existing baselines employed by major power gridoperators in the United States (e.g. California IndependentSystem Operator (CAISO), New York ISO) are calculatedwith simple arithmetic averages of previous observations [4]and therefore are inaccurate. The estimation of more accuratebaselines is a significant contribution of this paper.
A. Contributions
We estimate the average treatment effect (ATE) of hour-ahead notifications on the reduction of electricity consumptionby evaluating a Randomized Controlled Trial (RCT) on ≈ − . kWh per DR Event and userand discover noticeable geographic and temporal heterogeneityamong users, as the largest estimated reductions occur insummer months as well as in regions with warmer climate.Further, we utilize household census data to detect treat-ment effect heterogeneity of users by modifying the trainingprocedure of classification and decision trees [5] in a waythat partitions the feature space into axis-aligned regions withpiecewise constant treatment effects. The splits at each interiornode are optimized by penalizing differences in covariates andrewarding differences of the mean outcome between treatedand control units. This approach is inspired by [6], but departsfrom the original model to fit our setting. Additionally, webenchmark its performance to a more naive k -means clusteringapproach that simply fits centroids in the feature space, eachof which corresponds to a particular “group” with the identicaltreatment effect. a r X i v : . [ phy s i c s . s o c - ph ] O c t ORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 2
B. Background and Related Work
Performing causal inference to estimate the effect of anintervention on a metric of interest has long been tantamountto setting up a randomized controlled trial (RCT), wherebythe difference in means between the control and treatmentgroup is defined to be the average treatment effect. However,many experiments are infeasible or critical due to budgetand ethical concerns. With the rapid growth of collecteduser data, a new direction of research at the intersectionof machine learning and economics is concerned with usingnonexperimental estimation techniques in lieu of RCTs. Thegeneral idea of such nonexperimental estimation techniques isto partition observations under treatment and control in orderto fit a nominal model on the latter set, which, when appliedon the treatment set, yields the treatment effect of interest.Examples for such models are [7], who evaluates welfareeffects of home automation by calculating the Kolmogorov-Smirnov Statistic between users, or [8], who constructs aconvex combination of US states as the counterfactual estimatefor tobacco consumption to estimate the effect of a tobaccocontrol program in California on tobacco consumption.Our setting involves time-series prediction of smart meterdata, which is essentially a short-term load forecasting (STLF)problem. Within STLF, tools employed are ARIMA modelswith a seasonal component [9] and classic regression modelswhere support vector regression (SVR) and neural networksyield the highest accuracy [10], [11]. For a comprehensivecomparison of ML techniques for forecasting and differinglevels of load aggregation, see [12].In our previous work [13], [14], [15], we have developedde-biased nonexperimental estimators for predicting individualtreatment effects (ITEs) of a residential DR program andbenchmarked their accuracy against an experimental gold stan-dard. We achieved this by modifying unit-level fixed effects,which are often used in fixed effects regressions in appliedeconomics [16], into explicit user-level terms, making theresulting regression model amenable to general ML models.While we begin by revising the concepts of nonexperimentalestimators and an overview of the experimental setting, thefocus of this paper lies in more advanced analyses using K -means clustering and causal decision trees to explore het-erogeneity of treatment effects across the population basedon demographic features. The idea of pursuing this directionis well-motivated from a practical perspective, as detectingheterogeneous subpopulations would allow us to design adap-tive targeting schemes that customize incentives that seek tomaximize a certain objective, e.g. cost efficiency.II. E XPERIMENTAL S ETUP
A. Setup of the Experiment
The experiment is carried out by OhmConnect, Inc., usingfunds provided by the California Energy Commission. Figure2 draws a flowchart of the experimental setup.Over the course of the experimental time period (Nov. 2016- Dec. 2017), each consumer that signs up for the study is randomly assigned to one of the following groups:
All ParticipantsTreatment − Encouraged Treatment − Non-Encouraged ControlEstimate ResponsesTargeted-High Targeted-Low Non-TargetedShuffle UsersMoral Suasion Price Moral Suasion & Price ControlPhase 1 (90 days)Phase 2 (90 days)Phase 3 (90 days)
Fig. 2: Setup of Experiment • Treatment − Encouraged: The user receives an aver-age number of 25 DR events in the 90 days fol-lowing recruitment, with event incentive levels se-lected uniformly at random from the set R = { . , . , . , . , . } USDkWh . Also, the user is givena rebate for purchasing a smart home automation device. • Treatment − Non-Encouraged: Same as in Treatment-Encouraged, but without smart home automation rebate. • Control (Initially Delayed): Users do not receive anynotifications for DR events for a period of 90 days afterenrollment.These three groups form Phase 1 of the experiment. Users inPhase 1 treatment groups that have reached 90 days of ageare pooled and assigned to one of three different groups forPhase 2 interventions. Users receive either large incentives,smaller ones, or a mix of both depending on the group theyare assigned to. Lastly, users with completed Phase 2 as wellas Phase 1 control users after 90 days of age undergo Phase 3(moral suasion), which occurs on an event-by-event level andselectively includes moral suasion in text messages.In Sections III and IV, we evaluate Phase 1 of the experi-ment. Phase 2 and 3 are analyzed in our previous work andare hence omitted in this paper. Instead, we focus on detectingheterogeneity in treatment effects in Sections V and VI.III. N
ONEXPERIMENTAL T REATMENT E FFECT E STIMATION
A. Potential Outcomes Framework
To estimate the effect of the DR intervention program, weadopt the potential outcomes framework introduced by Rubin(1974) [17]. Let I = { , . . . , n } denote the set of users. Theindicator D it ∈ { , } encodes the fact whether or not user i received DR treatment at time t . Each user is equipped with aconsumption time series y i = { y i , . . . , y iτ } and associatedcovariates X i = { x i , . . . , x iτ } ∈ × τi =1 X i , X i ⊂ R n x ,where time is indexed by t ∈ T = { , . . . , τ } and n x is thedimension of the covariate space X i . Let y it and y it denoteuser i ’s electricity consumption at time t for D it = 0 and D it = 1 , respectively. Let C i and T i denote the set of controland treatment times for user i . That is, C i = { t ∈ T | D it = 0 } , T i = { t ∈ T | D it = 1 } . (1) ORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 3
The number of treatment hours is much smaller than thenumber of non-treatment hours. Thus < |T i | / |C i | (cid:28) .Further, let D i,t and D i,c denote user i ’s covariate-outcomepairs of treatment and control times, respectively. That is, D i,t = { ( x it , y it ) | t ∈ T i } , D i,c = { ( x it , y it ) | t ∈ C i } . (2)The one-sample estimate of the treatment effect on user i attime t , given the covariates x it ∈ R n x , is β it ( x it ) := y it ( x it ) − y it ( x it ) ∀ i ∈ I , t ∈ T , (3)which varies across time, the covariate space, and the userpopulation. Marginalizing this one-sample estimate over theset of treatment times T i and the covariate space X i yields theuser-specific Individual Treatment Effect (ITE) β i β i := E X i E t ∈T i (cid:104) y it − y it (cid:12)(cid:12)(cid:12) x it (cid:105) = 1 |T i | (cid:88) t ∈T i ( y it − y it ) . (4)The average treatment effect on the treated (ATT) follows from(4): ATT := E i ∈I [ β i ] = 1 |I| (cid:88) i ∈I |T i | (cid:88) t ∈T i ( y it − y it ) . (5)Since users were put into different experimental groups ina randomized fashion, the ATT and the average treatmenteffect (ATE) are identical [18]. Lastly, the conditional averagetreatment effect (CATE) on ˜ x is obtained by marginalizing theconditional distribution of one-sample estimates (3) on ˜ x overall users and treatment times, where ˜ x ∈ R ˜ n x is a subvectorof x ∈ R n x , < ˜ n x < n x :CATE (˜ x ) := E i ∈I E t ∈T i (cid:104) ( y it − y it ) (cid:12)(cid:12)(cid:12) ˜ x it = ˜ x (cid:105) . (6)The CATE captures heterogeneity among users, e.g. withrespect to specific hours of the day, the geographic distribu-tion of users, the extent to which a user possesses “smarthome” appliances, group or peer effects, etc. To rule outthe existence of unobserved factors that could influence theassignment mechanism generating the complete observed dataset { ( x it , y it , D it ) | i ∈ I , t ∈ T } , we make the followingstandard assumptions: Assumption 1 (Unconfoundedness of Treatment Assignment) . Given the covariates { x it } t ∈ T , the potential outcomes areindependent of treatment assignment: ( y it , y it ) ⊥ D it | x it ∀ i ∈ I , t ∈ T . (7) Assumption 2 (Stationarity of Potential Outcomes) . Given thecovariates { x it } t ∈ T , the potential outcomes are independentof time, that is, ( y it , y it ) ⊥ t | x it ∀ i ∈ I , t ∈ T . (8)Assumption 1 is the “ignorable treatment assignment” as-sumption introduced by Rosenbaum and Rubin [19]. Underthis assumption, the assignment of DR treatment to users isimplemented in a randomized fashion, which allows the cal-culation of unbiased ATEs (5) and CATEs (6). Assumption 2,motivated by the time-series nature of the observational data,ensures that the set of observable covariates { x it | t ∈ T } can capture seasonality effects in the estimation of the potentialoutcomes. That is, the conditional distribution of the potentialoutcomes, given covariates, remains constant.The fundamental problem of causal inference [20] refersto the fact that either the treatment or the control outcomecan be observed, but never both (granted there are no missingobservations). That is, y it = y it + D it · ( y it − y it ) ∀ t ∈ T . (9)Thus, the ITE (4) is not identified, because one and only oneof both potential outcomes is observed, namely { y it | t ∈ T i } for the treatment times and { y it | t ∈ C i } for the control times.It therefore becomes necessary to estimate counterfactuals. B. Non-Experimental Estimation of Counterfactuals
Consider the following model for the estimation of coun-terfactuals: y it = f i ( x it ) + D it · β it ( x it ) + ε it , (10)where ε it denotes noise uncorrelated with covariates andtreatment assignment. f i ( · ) : R n x (cid:55)→ R is the conditionalmean function and pertains to D it = 0 . To obtain an estimatefor f i ( · ) , denoted with ˆ f i ( · ) , control outcomes { y it | t ∈ C i } are first regressed on { x it | t ∈ C i } , namely their observablecovariates. In a second step, the counterfactual ˆ y it for any t ∈ T i can be estimated by evaluating ˆ f i ( · ) on its associatedcovariate vector x it . Finally, subtracting ˆ y it from y it isolatesthe one-sample estimate β it ( x it ) , from which the user-specificITE (4) can be estimated. Figure 3 illustrates this process ofestimating the reduction during a DR event by subtractingthe actual consumption y it from the predicted counterfactual ˆ y it = ˆ f i ( x it ) . We restrict our estimators f i ( · ) to a single hourprediction horizon as DR events are at most one hour long. Consumption/Temperature t actualconsumptionambient airtemperatureCovariates x it for Estimation ˆ f i ( x it ) = ˆ y it estimatedcounterfactual y it materializedconsumption D R E v e n t EstimatedReduction
Fig. 3: Estimation of the Counterfactual ˆ y it using Treatment Covariates x it and Predicted Reduction y it − ˆ y it To estimate f i ( · ) , we use the following classical regressionmethods [21], referred to as estimators : • (E1): Ordinary Least Squares Regression (OLS) • (E2): L1 Regularized (LASSO) Linear Regression (L1) • (E3): L2 Regularized (Ridge) Linear Regression (L2) • (E4): k -Nearest Neighbors Regression (KNN) • (E5): Decision Tree Regression (DT) • (E6): Random Forest Regression (RF) ORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 4
DT (E5) and RF (E6) follow the procedure of Classificationand Regression Trees [5]. We compare estimators (E1)-(E6) tothe CAISO 10-in-10 Baseline (BL) [4], which, for any givenhour on a weekday, is calculated as the mean of the hourlyconsumptions on the 10 most recent business days during theselected hour. For weekend days and holidays, the mean of the4 most recent observations is calculated. This BL is furtheradjusted with a
Load Point Adjustment , which corrects the BLby a factor proportional to the consumption three hours priorto a DR event [4].To estimate user i ’s counterfactual outcome ˆ y it during a DRevent t ∈ T i , we use the following covariates: • t • Air temperature at time t and 4 preceding measurements • Hour of the day, an indicator variable for (non-)businessdays, and month of the year as categorical variablesThus, the covariate vector writes x it = [ y it − · · · y it − T it · · · T it − C(HoD it ) : C(is Bday it ) C(MoY it )] . (11)In (11), T it denotes temperature, HoD it hour of day, is Bday it an indicator for business days, and MoY it the month of yearfor user i at time t . “C” denotes categorical variables and “:”their interaction. C. Estimation of Individual Treatment Effects
To obtain point estimates for user i ’s ITE β i , we simplyaverage all one-sample estimates (3) according to (4). Toobtain an estimate of whether or not a given user i has actuallyreduced consumption, we utilize a nonparametric permutationtest with the null hypothesis of a zero ITE: H : β i = 0 , H : β i (cid:54) = 0 . (12)Given user i ’s paired samples { z it = ˆ y it − y it | t ∈ T i } during DR periods, the p -value associated with H (12) is p = (cid:80) D ∈P i ( ¯ D ≤ ˆ β i )2 |T i | . (13)In (13), ¯ D denotes the mean of D . P i denotes the set of allpossible assignments of signs to the pairwise differences inthe set { z it = y it − ˆ y it | t ∈ T i } . That is, P i = { s z i , . . . , s |T i | z i |T i | | s j ∈ {± } , ≤ j ≤ |T i |} (14)which is of size |T i | . Finally, the p -value from (12) iscalculated as the fraction of all possible assignments whosemeans are less than or equal the estimated ITE ˆ β i . In practice,as the number of DR events per user in Phase 1 is about 25 (seeFigure 2), the number of total possible assignments becomescomputationally infeasible. Thus, we randomly generate asubset of assignments from P i to compute the p -value in(13). Moreover, we use the percentile bootstrap method [22]to compute a confidence interval of the estimated ITE for user i around the point estimate ˆ β i . IV. N ONEXPERIMENTAL E STIMATION R ESULTS
A. Average Treatment Effect
Figure 4 shows ATE point estimates and their 99% boot-strapped confidence intervals conditional on differing rewardlevels for all estimators as well as the CAISO BL. Due to theempirical de-biasing procedure (c.f. [13], [14], [15]), the pointestimates for estimators (E1)-(E6) are close to each other. BLappears to be biased in favor of the users, as it systematicallypredicts larger reductions than (E1)-(E6). .
05 0 .
25 0 .
50 1 .
00 3 . Incentive [USD / kWh] − . − . − . − . − . − . − . R e du c t i o n s [ k W h ] BL KNN OLS L1 L2 DT RF
Estimated Reductions [kWh] and
Confidence Intervals
Fig. 4: CATEs by Incentive Level & Confidence Intervals
The ATE averaged over the predictions of estimators (E1)-(E6) is − . kWh / − . . The intercept and the slopeof the demand curve are − . kWh / − . kWh/USD,meaning that users reduce an additional 0.013 kWh per dollaroffered, a small change. Due to the idiosyncratic nature ofthe CATE for r = 0 . USDkWh , the slope and intercept have to beinterpreted with caution. However, the results give rise to anotable correlation between incentive levels and reductions.To compare the prediction accuracy of the estimators, Ta-ble I reports the width of the confidence intervals for eachmethod and incentive level. The inferiority of the CAISObaseline compared to the non-experimental estimators, amongwhich RF achieves the tightest confidence intervals, becomesapparent. Therefore, in the remainder of this paper, we restrictall results achieved with non-experimental estimators to thoseobtained with RF.
Width of CATE Confidence Intervals (kWh) by Incentive Level0.05
USDkWh
USDkWh
USDkWh
USDkWh
USDkWh BL . . . . . KNN . . . . . OLS . . . . . L1 . . . . . L2 . . . . . DT . . . . . RF TABLE I: Width of 99% Confidence Intervals around ATE Point Estimate byIncentive Level, all Estimators
Figure 5 plots the CATE broken out by month of the year,where we compare the estimates from the Random Forestestimator to a Fixed Effects (FE) estimator, which serves asan experimental gold standard. For further explanation on thiscomparison, the reader is referred to our previous work [15].Here we focus our attention on the larger reduction in summermonths which are presumably attributed to an increased usageof air conditioning.
B. Individual Treatment Effects
Figure 6 plots ITEs for a randomly selected subset of 800users who received at least 10 DR events in Phase 1, estimated
ORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 5
Month of the Year − . − . − . − . . R e du c t i o n s [ k W h ] ITE EstimatesFE Regression
Estimated Reductions [kWh] and
Confidence Intervals
Fig. 5: CATE by Month, Fixed Effects Estimator (cf. [15]) vs. Random ForestEstimator with RF. Users are sorted by their point estimates (blue),whose 95% bootstrapped confidence intervals are drawn inblack. Yellow lines represent users with at least one activesmart home automation device. By marginalizing the pointestimates over all users with at least 10 events, we obtain anATE of − . kWh ( − − . kWh as reported earlier. The difference ensues from onlyconsidering users with at least 10 DR events. The 99% ATEconfidence interval is [ − . , − . kWh. − . − . − . − . − . . . . . Absolute Reduction [kWh] U s e r N u m b e r Estimated ITEs of 800 Randomly Selected Users with ≥ DR Events, RF
95% ITE Confidence IntervalPoint EstimateZeroMean Reduction99% ATE Confidence Interval
Fig. 6: Distribution of ITEs with Confidence Intervals
Table II reports estimated ATEs for users with or withoutactive smart home automation devices, which are obtainedby aggregating the relevant estimated ITEs from Figure 6.We notice larger responses as well as a larger percentage ofestimated reducers among automated users.
ATEs Conditional on Automation Status for Users with ≥
10 DR Events
451 79 . − . − . Non-Automated . − . − . All . − . − . TABLE II: Estimated CATEs by Automation Status, RF (E6)
Table III reports the percentage of significant reducers fordifferent confidence levels, obtained with the permutation testunder the null (12). From Tables II and III, it becomesclear that automated users show larger reductions than non-automated ones, which agrees with expectations.
Fraction of Significant Reducers (among sample of size ) − α = 0 . − α = 0 .
95 1 − α = 0 .
225 205 159 % of Total . . . % of Total . . . % of Total . . . TABLE III: Estimated Percentage of Significant Reducers according toPermutation Test, RF Estimator (E6)
Larger reductions are estimated in warm summer months.To test the hypothesis whether or not there exists such acorrelation, Figure 7 scatter plots estimated ITEs as a functionof the average ambient air temperature observed during therelevant DR events. This gives rise to a noticeable positivecorrelation of ambient air temperature and the magnitude ofreductions. Indeed, a subsequent hypothesis test with the nullbeing a zero slope is rejected with a p -value of less than e − . Mean Temperature [ ◦ C ] − . − . − . . E s t i m a t e d I T E [ k W h ] Correlation between Temperature and Individual Treatment EffectsZeroEstimated Slope
Fig. 7: Correlation between Average Ambient Air Temperature and ITEs.
V. C
AUSAL D ECISION T REES
We utilize household census data to detect subpopulationswith “similar” treatment effects. The data set consists of demo-graphic features (age, ethnicity, relationship status, householdsize, income) as well as features that describe the electricityusage of customers, such as solar/electric/gas heating and theage of the house. Following best practices, we standardize thecensus data set (i.e. zero mean, unit variance).
A. Training Algorithm
We now explain the training procedure for fitting the causaltree. As a first step, we take all available training data, namelythe collection of covariates, outcomes, and treatment indicatorsfor all treated and control users over the entire experimentalperiod, which we denote as { ( x it , y it , D it ) | i ∈ I , t ∈ T } .Let S = { , . . . , n } denote the indices of the resulting trainingsample. Algorithm 1 provides the pseudocode used for fittinga causal tree.To avoid overfitting, Algorithm 1 stops growing the treewhen the number of treated observations or control observa-tions drops below a certain threshold n min or if the currentdepth of the node exceeds the maximal depth max depth (cf.line 1). If either of these three criteria is fulfilled, the currentnode is defined to be a leaf that predicts the CATE as thedifferences in means between treated and control observations: ˆ β ( S ) = 1 |{ i | i ∈ S , D i = 1 }| (cid:88) j ∈{ i | i ∈S ,D i =1 } y j − |{ i | i ∈ S , D i = 0 }| (cid:88) j ∈{ i | i ∈S ,D i =0 } y j . (15) ORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 6
Algorithm 1
Recursive Causal Tree Algorithm
FitTree( S ) Input:
Set of sample point indices S = { , . . . , n } Output:
TreeNode root if n treat < n min or n ctrl < n min or curr depth ≥ max depth then return else Randomly create set of eligible features J Choose best splitting feature j ∈ J and splittingthreshold γ S l = { i | x ij ≤ γ } S r = { i | x ij > γ } node left = FitTree( S l ) node right = FitTree( S r ) end if Otherwise, we use bagging [5] to randomly generate aneligible subset of candidate splitting features J (line 4). Foreach such feature, we iterate through various values to findthe threshold that minimizes the weighted cost J as follows: J ( S ) = |S l | J ( S l ) + |S r | J ( S r ) |S l | + |S r | , (16)where the costs for the left and right subtree are defined as J ( S l ) = mse l + mse l − α ( µ ( S l ) − µ ( S l )) , (17a) J ( S r ) = mse r + mse r − α ( µ ( S r ) − µ ( S r )) . (17b)In (17a) and (17b), mse l and mse r denote the mean squarederror of all control samples in the left and right subtree,respectively:mse l = (cid:80) j ∈{ i | i ∈S l ,D i =0 } (cid:0) y j − µ ( S l ) (cid:1) |{ i | i ∈ S l , D i = 0 }| , (18a)mse r = (cid:80) j ∈{ i | i ∈S r ,D i =0 } (cid:0) y j − µ ( S r ) (cid:1) |{ i | i ∈ S r , D i = 0 }| . (18b)mse l and mse r are defined analogously. Next, the terms (cid:0) µ ( S l ) − µ ( S l ) (cid:1) and (cid:0) µ ( S r ) − µ ( S r ) (cid:1) denote thesquared differences in means between control and treatmentobservations in the left and right subtree, respectively. Takentogether, the cost function J penalizes variations in the out-comes within the control or treatment group, but rewardsseparations between them. The scalar parameter α ∈ R + trades off these two contributions.Finally, the sample S is split into S l ∪ S r according to thebest splitting feature and threshold that minimizes the cost (16)and recursively calls FitTree on S l and S r .Taken together, Algorithm 1 partitions the feature space intoa set of leaves Π = { f , . . . , f | Π | } , where each leaf f ∈ Π hasa prediction of the conditional CATE ˆ β ( S ( f )) , where S ( f ) ⊂S denotes the collection of sample points in leaf f . B. Validation of Splitting Algorithm
We now verify that algorithm 1 works correctly by simulat-ing synthetic responses on a reduced data set. In doing so, we also find the optimal set of hyperparameters that minimizesthe mean squared prediction error mse, which we define to bemse = 1 |S val | (cid:88) f ∈ Π (cid:88) i ∈ f ⊆S val (cid:16) ˆ β ( S tr ( f )) − β i (cid:17) , (19)that is, the mean sum of squares between the actual treatmenteffect β i and the CATE predicted in the corresponding leaf f , i.e. ˆ β ( S tr ( f )) . By using a training sample S tr that fits thecausal tree Π and predicts CATEs { ˆ β ( S tr ( f )) } f ∈ Π as well asa separate validation set S val to compute (19), we ensure themodel does not overfit/underfit.More precisely, we perform cross-validation on the hyper-parameters α (cf. (17a) and (17b)) and max depth to minimize(19), given a synthetic linear response which we construct asfollows: y it ← y it − (0 . · T it + ε T ) − (0 . · mean kwh i + ε m ) − ( pugasheat i + ε pu ) − ( prenter i + ε pr ) (20) − (0 . · avgfamsize i + ε a ) , where ε T , ε m , ε pu , ε pr , ε a are independent zero mean Gaus-sians with variance σ = 1 / . The variables in (20) denotethe temperature of user i at time t , the mean consumption, thepercentage of electricity consumed by gas heat, probability ofrenting a house, and the average family size. This subjectivelychosen response is applied to a subset of users defined to bethe treated group T , whereas the remaining users become thecontrol group C . Next, we randomly draw 80% of samplesfrom both T and C to create the training set S tr , from whichit follows that the remaining 20% are the validation set S val ,see Figure 8 for an illustration. S tr S val Fig. 8: Partitioning of data for cross-validation. First, apply (20) on 10% ofavailable data (red area) to create treatment data, remaining 90% (green) iscontrol data. Second, randomly choose 80% of treatment and control datafor training (blue rectangle, S tr ), remaining 20% is validation data (orangerectangle, S val ). Figure 9 plots the validation MSE (19) for various α and n min with a maximal depth max depth = 30 . Since thetraining and validation data set consist of separate data points,we observe that for n min < some leaves will not bepopulated by samples in the validation set, resulting in anundefined MSE. As this phenomenon is a clear indication ofoverfitting, we define the MSE in this case to be infinity. Thus,in Figure 9, all points to the left of a particular line have aninfinite MSE.As the MSEs appear to be relatively insensitive to the choiceof α , particularly for n min = 175 that minimizes the MSE, wewill use α = 1 for all subsequent analyses of the entire dataset. ORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 7
150 200 250 300 350 400
Minimum Number of Samples per Leaf . . . . . . . V a li d a t i o n M S E Validation Mean Squared Errors for Various α Fig. 9: Validation MSE (19) for varying values of α C. Results and Interpretation
We now perform the same analysis from the previous sub-section on the full data set with synthetic responses across allcovariates. Specifically, we divide the data into a training andvalidation set and choose the best set of hyperparameters thatminimizes the MSE (19). As the training procedure presentsno novel insights, we omit the learning curve for the MSE (cf.Figure 9). Figure 10 plots an example for a causal tree that hasbeen fitted with a subjectively chosen set of hyperparametersto fit within the page limit.The optimal set of hyperparameters that achieves the small-est validation MSE (19) was found to be n min = 200 ,max depth = 15 . As in the previous subsection, the influenceof the parameter α on the MSE seems to be negligible,motivating us to use α = 1 as in the validation case.VI. K-M EANS C LUSTERING FOR H ETEROGENEITY
In the previous section we illustrated how a causal tree candetect heterogeneity in treatment effects. We now present amore ad-hoc approach which utilizes k -means clustering andcompare its prediction accuracy to the mean squared error (19)achieved by the causal tree. For a valid comparison, we use thesame sample S = S tr ∪ S val with the same synthetic responses(20) as in the causal tree case.The well-known k -means clustering algorithm places k centroids c , . . . , c k into the n -dimensional feature space,where n is the number of features of the training samples.The CATE belonging to cluster c i is then defined to be ˆ β ( c i ) = 1 |{ j ∈ S tr | c ( j ) = c i , D j = 1 }| (cid:88) k ∈{ j ∈S tr | c ( j )= c i ,D j =1 } β k − |{ j ∈ S tr | c ( j ) = c i , D j = 0 }| (cid:88) k ∈{ j ∈S tr | c ( j )= c i ,D j =0 } β k (21)which is similar in nature to (15). In (21), c ( j ) denotes theindex of the centroid with the smallest Euclidian distanceto training sample j ∈ S tr . Hence the k -means algorithmgenerates k distinct CATEs, namely one for each cluster.Finally, the MSE is determined as the mean of squareddifferences between predicted and actual treatment effect:mse = 1 |S val | (cid:88) i ∈S val (cid:16) ˆ β ( c ( i )) − β i (cid:17) . (22) To find the optimal value of k , we perform cross-validationon a range of possible k , see Figure 11. In addition to theMSEs, Figure 11 also plots their rolling mean with a windowsize of , as the MSE oscillates around its trend. Using thistechnique, we determine the optimal k to be 70.Comparing Figures 9 and 11, we immediately see that thevalidation MSE of the causal tree is two orders of magnitudesmaller than in the k -means clustering case, which is consis-tent with our expectations, as the k -means clustering algorithmfails to take into account the labels of the data points andoperates on a much more naive level than the causal tree.VII. D ISCUSSION AND C ONCLUSION
We analyzed Residential Demand Response as a human-in-the-loop cyber-physical system that incentivizes users tocurtail electricity consumption during designated peak hours.Utilizing data collected from a Randomized Controlled Trialfunded by the California Energy Commission and conductedby a DR provider in the San Francisco Bay Area, we esti-mated the causal effect of hour-ahead price interventions onelectricity reduction.We developed a two-step non-experimental estimationframework and employed off-the-shelf regression models tolearn a consumption model on non-DR periods to predictcounterfactuals during DR hours of interest. The estimatedATE is − . kWh (11%) per Demand Response event andhousehold. Further, we observe a weak positive correlationbetween the incentive level and the estimated reductions,suggesting that users are only weakly elastic in response toincentives.Moreover, we discovered notable heterogeneity of users intime and by automation status, since the largest reductionswere observed in summer months as well as among users withat least one connected smart home automation device. Further,the ambient air temperature was found to positively correlatewith the amount of reductions, suggesting that air conditioningunits are a major contributor to reductions.Further, we used causal decision trees and k -means clus-tering to detect subpopulations of “similar” treatment effects.While the k -means algorithm is a naive algorithm, fitting acausal decision trees requires more complex computations,resulting in much higher accuracies. In our previous work [15],we made use of a very simplistic targeting algorithm, whichassigns large incentives to small reducers and vice versa inorder to reduce the cost of the DR provider. Future work willinvestigate to what degree cost efficiency can be improved byutilizing more fine-grained targeting approaches enabled bycausal decision trees. R EFERENCES[1] P. Palensky and D. Dietrich, “Demand Side Management: DemandResponse, Intelligent Energy Systems, and Smart Loads,”
IEEE Trans-actions on Industrial Informatics , vol. 7, no. 3, pp. 381–388, 2011.[2] S. Borenstein, “The Long-Run Efficiency of Real-Time ElectricityPricing,”
The Energy Journal , 2005.[3] “Public Utilities Commission of the State of California: Resolution E-4728. Approval with Modifications to the Joint Utility Proposal for aDemand Response Auction Mechanism Pilot,” July 2015.[4] “California Independent System Operator Corporation (CAISO): FifthReplacement FERC Electric Tariff,” 2014.
ORKING PAPER. LAST UPDATED: OCTOBER 23, 2018 8 air_temp <= 25.48?i_mean_daily_kwh <= 18.60?True i_mean_kwh_HoD19 <= 1.37?Falsei_mean_daily_kwh <= 8.82? True i_mean_kwh_HoD8 <= 1.62?Falseair_temp <= 20.07? True i_mean_kwh_HoD18 <= 0.62?FalseCATE=-0.0164 kWhTrue i_mean_kwh_HoD16 <= 0.24?FalseCATE=0.0394 kWhTrue CATE=0.0018 kWhFalse CATE=0.0032 kWhTrue i_mean_kwh_HoD18 <= 0.90?FalseCATE=0.0179 kWhTrue CATE=-0.0374 kWhFalse i_mean_kwh_HoD17 <= 2.18?True CATE=-0.5429 kWhFalsei_mean_daily_kwh <= 21.80?True CATE=0.0308 kWhFalseCATE=0.0046 kWhTrue CATE=-0.0104 kWhFalse i_mean_kwh_HoD19 <= 0.80?True i_mean_kwh <= 1.22?Falsepugasheat <= 0.93?True i_mean_atemp_HoD5 <= 13.09?Falsei_mean_kwh_HoD15 <= 0.24? True CATE=-0.0435 kWhFalseCATE=0.0094 kWhTrue i_mean_atemp_HoD5 <= 12.97?FalseCATE=-0.1006 kWhTrue CATE=0.0042 kWhFalse CATE=-0.3816 kWhTrue CATE=0.0295 kWhFalseCATE=-0.1016 kWhTrue CATE=-0.0340 kWhFalse
Fig. 10: Example for a causal tree on demeaned data set, maximal depth set to 6. Blue boxes (leaves) describe CATEs computed with (15).
Number of Clusters K . . . . . V a li d a t i o n M S E Validation Mean Squared Errors for Various α MSERolling MeanBest K Fig. 11: Validation MSE (19) for varying values of K [5] L. Breiman, J. Friedman, C. Stone, and R. A. Olshen, “Classificationand Regression Trees,” CRC Press , 1984.[6] S. Athey and G. W. Imbens, “Recursive Partitioning for HeterogeneousCausal Effects,”
Proceedings of the National Academy of Sciences ofthe United States of America , vol. 113, no. 27, pp. 7353–7360, 2016.[7] B. Bollinger and W. R. Hartmann, “Welfare Effects of Home Automa-tion Technology with Dynamic Pricing,”
Stanford University, GraduateSchool of Business Research Papers , 2015.[8] A. Abadie, A. Diamond, and J. Hainmueller, “Synthetic Control Methodsfor Comparative Case Studies: Estimating the Effect of California’s To-bacco Control Program,”
Journal of the American Statistical Association ,vol. 105, no. 490, pp. 493–505, 2012.[9] J. W. Taylor and P. E. Sharry, “Short-Term Load Forecasting Methods:An Evaluation Based on European Data,”
IEEE Transactions on PowerSystems , vol. 22, no. 4, pp. 2213–2219, 2007.[10] T. Senjyu, H. Takara, K. Uezato, and T. Funabashi, “One-Hour-AheadLoad Forecasting Using Neural Network,”
IEEE Transactions on PowerSystems , vol. 17, no. 1, pp. 113–118, 2002.[11] E. E. Elattar, J. Goulermas, and Q. H. Wu, “Electric Load ForecastingBased on Locally Weighted Support Vector Regression,”
IEEE Trans-actions on Systems, Man, and Cybernetics , vol. 40, no. 4, 2010.[12] P. Mirowski, S. Chen, T. K. Ho, and C.-N. Yu, “Demand Forecasting inSmart Grids,”
Bell Labs Technical Journal , 2014.[13] D. Zhou, M. Balandat, and C. Tomlin, “Residential Demand ResponseTargeting Using Machine Learning with Observational Data,” , 2016.[14] ——, “A Bayesian Perspective on Residential Demand Response Us-ing Smart Meter Data,” , 2016.[15] ——, “Estimation and Targeting of Residential Households for Hour-Ahead Demand Response Interventions - A Case Study in California,” ,2018.[16] H. Allcott, “Rethinking Real-Time Electricity Pricing,”
Resource andEnergy Economics , vol. 33, no. 4, pp. 820–842, 2011.[17] D. B. Rubin, “Estimating Causal Effects of Treatments in Randomizedand Non-Randomized Studies,”
Journal of Educational Psychology ,vol. 66, no. 5, pp. 688–701, 1974. [18] J.-S. Pischke and J. D. Angrist,
Mostly Harmless Econometrics , 1st ed.Princeton University Press, 2009.[19] P. R. Rosenbaum and D. B. Rubin, “The Central Role of the PropensityScore in Observational Studies for Causal Effects,”
Biometrika , vol. 70,no. 1, pp. 41–55, 1983.[20] P. W. Holland, “Statistics and Causal Inference,”
Journal of the AmericanStatistical Association , vol. 81, no. 396, pp. 945–960, 1986.[21] T. Hastie, R. Tibshirani, and J. Friedman,
The Elements of StatisticalLearning . Springer New York, 2009.[22] B. Efron and R. J. Tibshirani,
An Introduction to the Bootstrap . CRCPress, 1994.
Datong Paul Zhou received a B.S. in MechanicalEngineering from TU Munich, Germany, in 2014.He received M.S. degrees in Electrical Engineeringand Computer Science and Mechanical Engineeringfrom the University of California, Berkeley, in 2016and 2017, respectively, as well as an M.A. in Mathe-matics in 2018. He is currently working towards hisPh.D. in Mechanical Engineering with a focus onmachine learning, game theory, and control theoryapplied to electricity markets.
Maximilian Balandat received a Dipl.-Ing. in Elec-trical Engineering from TU Darmstadt, Germany, in2010, and a M.A. in Mathematics and a Ph.D. inElectrical Engineering from UC Berkeley in 2016.His research interests include Control Theory, Ma-chine Learning, Causal Inference, Economic Incen-tives, and Bayesian Optimization. He is currently aResearch Scientist at Facebook.