[PDF] A Python Library For Empirical Calibration

Abstract

Dealing with biased data samples is a common task across many statistical fields. In survey sampling, bias often occurs due to unrepresentative samples. In causal studies with observational data, the treated versus untreated group assignment is often correlated with covariates, i.e., not random. Empirical calibration is a generic weighting method that presents a unified view on correcting or reducing the data biases for the tasks mentioned above. We provide a Python library EC to compute the empirical calibration weights. The problem is formulated as convex optimization and solved efficiently in the dual form. Compared to existing software, EC is both more efficient and robust. EC also accommodates different optimization objectives, supports weight clipping, and allows inexact calibration, which improves usability. We demonstrate its usage across various experiments with both simulated and real-world data.

Full PDF

AA P

YTHON L IBRARY FOR E MPIRICAL C ALIBRATION

Xiaojing Wang

Google [email protected]

Jingang Miao

Google [email protected]

Yunting Sun

Google [email protected] A BSTRACT

Dealing with biased data samples is a common task across many statistical ﬁelds. Insurvey sampling, bias often occurs due to unrepresentative samples. In causal studieswith observational data, the treated versus untreated group assignment is often correlatedwith covariates, i.e., not random. Empirical calibration is a generic weighting method thatpresents a uniﬁed view on correcting or reducing the data biases for the tasks mentionedabove. We provide a Python library EC to compute the empirical calibration weights.The problem is formulated as convex optimization and solved efﬁciently in the dualform. Compared to existing software, EC is both more efﬁcient and robust. EC alsoaccommodates different optimization objectives, supports weight clipping, and allowsinexact calibration, which improves usability. We demonstrate its usage across variousexperiments with both simulated and real-world data. NTRODUCTION

Biased data samples are prevalent in statistical problems. In survey sampling, unrepresentative data can occurwhen certain sub-population are under- or over-represented in the sample. For example, researchers maydecide to over-sample females in a marketing study of cosmetics using differential sampling probabilities,which, if not accounted for, would lead to selection bias. Further, once invited to join the study with a cashincentive, richer individuals may be less likely to participate, and the differential response rates may causeself-selection bias. To guard against such biases, sampling weights, non-response weights, and calibrationweights are used at different stages of survey data adjustment. In causal inference with observational data, thetreated versus untreated group assignment is often correlated with covariates, i.e., not random. Weighting canbe applied to untreated individuals such that the reweighted untreated group better mimics the control groupas in a randomized control experiment.Weighting is a generic bias correction technique that can take different forms. One is inverse probabilityweighting, where the inverse of probability/propensity of being included in the sample or receiving thetreatment is used as weights. Such probability is either known or estimated from a propensity model. Popularexamples include the Horvitz-Thompson estimator (Horvitz & Thompson, 1952) in survey sampling andinverse propensity score weighting (Rosenbaum & Rubin, 1983) in causal inference.Empirical calibration takes a different form, where instead of trying to summarize the systematic differenceswith one probability/propensity , one seeks weights that directly balance out the covariates but deviate leastfrom the uniform weights to reduce the inﬂation of variance.Solving the empirical calibration weights can be formulated as a convex optimization problem. Suppose thereare n biased data points { X i } ni =1 , and p marginal constraints { ¯ c j } pj =1 . The goal is to ﬁnd a n -dimensionalweight vector w that is the least unequal while subjecting to the weight normalization and marginal balance1 a r X i v : . [ s t a t . C O ] J u l onstraints. minimize w L ( w ) subject to n (cid:88) i =1 w i c j ( X i ) = ¯ c j , j = 1 , . . . , p, n (cid:88) i =1 w i = 1 ,w i ≥ , i = 1 , . . . , n. where L is the convex loss function and c j ( X i ) is the j -th transformation.There are variations of the unequalness deﬁnition, which determines the loss function L . Hainmueller (2012)proposed entropy balancing in the context of causal inference with observational data. It corresponds to theentropy loss L ( w ) = n (cid:88) i =1 w i log( w i ) , which can be viewed as the Kullback-Leibler divergence between w and uniform weights. Alternatively,(Deville & Särndal, 1992) used the Euclidean or squared distance between w and some base weights —typically uniform weights. It is equivalent to a quadratic loss L ( w ) = 12 n (cid:88) i =1 w i . Further, minimizing this quadratic loss can be seen as maximizing the effective sample size (Kish, 1965): ( (cid:80) ni =1 w i ) (cid:80) ni =1 w i . More broadly, both entropy and quadratic losses can be considered as special cases of the empirical likelihood(Owen, 2001).We provide a Python library EC that supports both entropy and quadratic loss functions. When there is nofeasible solution to satisfy the balance constraints, EC allows for inexact constraints, which improves therobustness and usability. For some applications, it can be desirable to bound the weights to a certain range, EC conveniently has the weight restraint option built-in, which is superior to post optimization clipping orwinsorization.This paper is structured as follows. Section 2 brieﬂy presents EC ’s interface. In section 3, we give a shorttour of various applications of empirical calibration, and present experimental results on both simulated andreal-world data. Some implementation issues are discussed in section 4. Concluding remarks are given insection 5. OFTWARE

The EC library is available at https://github.com/google/empirical_calibration and canbe imported as import empirical_calibration as ec The primary interface is function ec.calibrate : 2 ef calibrate(covariates: np.ndarray,target_covariates: np.ndarray,target_weights: np.ndarray = None,autoscale: bool = False,objective: Objective = ec.Objective.ENTROPY,max_weight: float = 1.0,l2_norm: float = 0) -> Tuple[np.ndarray, bool]: Its arguments are covariates X , covariates to be calibrated. All values must be numeric. For categoricalvalues, the from_formula function is often more convenient. target_covariates covariates to be used as target in calibration. The sum of each column is amarginal constraint ( ¯ c j ). The number of columns should match covariates . target_weights weights for target_covariates . These are needed when the target_covariates themselves have weights. Its length must equal thenumber of rows in target_covariates . If None, equal weights are used. autoscale whether to scale covariates to [0, 1] and apply the same scaling to target covariates . Setting it to True might help improve numerical stabil-ity. objective the objective of the convex optimization problem. Supportedvalues are ec.Objective.ENTROPY ( (cid:80) ni =1 w i log( w i ) ) and ec.Objective.QUADRATIC ( (cid:80) ni =1 w i ). max_weight the upper bound on weights. Must be between uniform weight (1 / number ofrows in covariates ) and 1.0. l2_norm (cid:15) , the L2 norm of the covariates balance constraint; i.e., the Euclidean distancebetween the weighted mean of covariates and the simple mean of target covariatesafter balancing.It returns a tuple (weights, success) , where weights The weights for the subjects. They should sum up to 1. success

Whether the constrained optimization succeeds.Entropy objective is a common choice for causal inference problems, while the quadratic objective is morewidely used in the context of survey calibration. One practical difference is that the entropy objective yieldsstrictly positive weights while the quadratic objective allows zero weights where zero-weight data points areeffectively discarded in subsequent analysis.The given constraints may yield no feasible solution, in which case one can either drop some constraints orsoften the hard constraints by specifying a maximal distance allowed between the weighted covariates and thetarget covariates via the l2_norm argument. Further, instead of manually specifying the soft constraint, onecan use ec.maybe_exact_calibrate , which automatically chooses the smallest soft margin that yieldsa feasible solution. 3lso provided is the from_formula interface, which is parallel to the functions above with covariates and target_covariates replaced with formula

Formula used to generate design matrix. No outcome variable allowed. df Data to be calibrated. target_df

Data containing the target. formula here follows patsy -style so that one can conveniently construct square terms, interaction terms,etc. For example, formula='~ -1 + x1 + x2 ** 2 + x3:x4' tells EC to match the ﬁrst moment of x , the second moment of x as well as the interaction between x and x . This saves the user the trouble ofmanually constructing these quantities.We illustrate the usage of these interfaces in the Applications section below. PPLICATIONS

URVEY C ALIBRATION

Surveys or samples are drawn to make inferences about the target population, which is the bread and butterof statistics. A valid inference is only possible when the survey or sample is representative of the targetpopulation.Survey weighting is commonly used to correct for unequal sampling probability, differential response rates,and coverage errors (Deville & Särndal, 1992; Kott, 2006). Calibration, which is often one of the steps insurvey weighting, makes use of information about the target population that is available after the sample iscollected (auxiliary information). For example, for a survey of internet users, such information may be foundin the Current Population Survey Computer and Internet Use Supplement (Ryan & Lewis, 2017), includingdistributions of internet users’ gender, age, education, and household income. Calibration aims to make thesample more representative of the target population. Popular methods of using auxiliary information includeratio estimation (Fuller, 2011), post-stratiﬁcation or raking (Little, 1993), and calibration estimation (Deville& Särndal, 1992).Weighted samples can be used to estimate the population average of survey response. The weights are oftenconstructed such that weighted moments — typically the ﬁrst two moments — agree with known populationbenchmarks. The benchmarks could come from census data or other large scale and high-quality surveys.Meanwhile, the calibration weights are chosen to least deviate from the base weights, which account forsampling design and differential non-response. In the simplest form, the base weights are uniform.3.1.1 E

STIMATE POPULATION MEAN

A common use case of surveys is to estimate the mean or total of a quantity. We replicate the directstandardization (Fleiss et al., 2003) example by Fu et al. (2018). Data is obtained from the

CVXR package (Fuet al., 2017), where dspop contains 1000 rows with columns sex ∼ Bernoulli (0 . , age ∼ Uniform (10 , ,and y i ∼ N (5 × sex i + 0 . × age i , . dssamp contains a skewed sample of 100 rows with small values of y over-represented, thus biasing its distribution downwards. cols = ['sex', 'age']weights, _ = ec.maybe_exact_calibrate(covariates=dssamp[cols],target_covariates=dspop[cols], y . objective=ec.Objective.ENTROPY) print ('True mean of y: {}'.format(dspop['y'].mean())) print ('Unweighted sample mean: {}'.format(dssamp['y'].mean())) print ('Weighted mean: {}'.format(np.average(dssamp['y'], weights=weights))) The true mean of y based on the data generating process is 6.0. Using the generated population of size 1000and a sample of size 100 contained in the CVXR package, the population mean is 6.01, but the mean of theskewed sample is 3.76, which is a gross underestimation. With empirical calibration, however, the weightedmean is 5.82, which is closer to the population mean.The kernel density plots (Figure 1) show that the unweighted curve is biased toward smaller values of y , butthe weights help recover the true density.3.2 O BSERVATIONAL D ATA WITH A B INARY T REATMENT

Empirical calibration is also applicable in observational studies with a binary treatment. It aims to achievecovariate balance by calibrating the untreated group against the treated group. It assigns non-negative weightsto individual control units such that certain moments — typically means — of covariates between the treatmentgroup and the untreated group are matched. The weighted untreated units then mimic a control sample as if itwas from a randomized experiment.We follow the potential outcome language of Rubin causal model to describe the inference problem witha binary treatment. Suppose each unit i is associated with a pair of potential outcomes: Y i (1) if treated( T i = 1 ) and Y i (0) if untreated ( T i = 0 ). The treatment effect for this unit is deﬁned as τ i = Y i (1) − Y i (0) .Inference of τ i ’s becomes a missing data problem since the two potential outcomes are never both observed —commonly known as the fundamental problem of causal inference . If Y i ( T i ) is observed, Y i (1 − T i ) is the counterfactual , i.e., what would have been observed if T i did not occur.We limit the discussion to the sample average treatment effect on the treated (ATT) estimand: n (cid:88) i : T i =1 [ Y i (1) − Y i (0)] , (1)where n is the number of treated units. The ﬁrst term in (1) is straightforward to compute using thetreated observations. The problem boils down to estimating the second term — the mean of unobservedcounterfactuals — from the control observations. 5n experimental settings, the treatment assignment is independent of the outcome, Y (1) , Y (0) ⊥ T , andone can simply use n (cid:80) i : T i =0 Y i (0) as an estimate of n (cid:80) i : T i =1 Y i (0) . In observational studies, however,due to the treatment selection bias, the treated group is often systematically different from the control group,rendering the latter two quantities unequal. Conventionally, we assume observed covariates X contain allthe information about the selection bias (i.e., there is no unobserved confounding variable, which is a strongassumption), and estimate ATT in two stages: 1) “design” purely based on matching of observed covariates X and 2) outcome analysis of Y . Stage 1 equates or balances the distributions of covariates between thetreated and control groups. Matching and weighting are the two main approaches to achieve the balance, seeAppendix B for a short tour of common methods. Stage 2 compares the outcomes of the treated and controlunits, and estimates the causal effect of the treatment. It generally involves some regression adjustments toaccount for the small residual covariate imbalance between the groups after stage 1. The outcome regressionmust be able to accept weights if stage 1 is done via weighting. The matching methods and outcome analysisin the two stages have been shown to work best in combination (Rubin, 1973). The intuition is the same asthat behind the double-robust estimators (Robins et al., 1994), which are asymptotically unbiased if either thepropensity score matching model or the outcome regression model is correctly speciﬁed.For empirical calibration weighting, we focus on the afﬁne estimator : ˆ τ EC = 1 n (cid:88) i : T i =1 Y i − (cid:88) i : T i =0 w i Y i , (2)where n is the number of treatment units and w i ’s are the empirical calibration weights that sum up to 1.The ﬁrst term is the simple average of the observed outcomes for the treated units, and the second term is theweighted average of the observed outcomes for the control units.3.2.1 K ANG -S CHAFER S IMULATION

Kang et al. (2007) used a simulation study to illustrate the selection bias of outcome under informativenon-response. The study became a standard benchmark to compare different estimators for causal estimands.The true set of covariates is generated independently and identically distributed from the standard normaldistribution ( Z , Z , Z , Z ) ∼ N (0 , I ) . The outcome is generated as Y = 210 + 27 . Z + 13 . Z + 13 . Z + 13 . Z + (cid:15), where (cid:15) ∼ N (0 , .The propensity score is deﬁned as P r ( T = 1 | Z ) = expit ( − Z + 0 . Z − . Z − . Z ) . This mechanism produces an equal-sized treated and control group on average. Given the covariates, theoutcome is independent of the treatment assignment; thus the true ATT is zero. The overall outcome mean is210. Due to the treatment selection bias, the outcome mean for the treated group (200) is lower than that ofthe control group (220).A typical exercise is to examine the performance of an observational method under both correctly speciﬁedand misspeciﬁed propensity score and/or outcome regression models. Misspeciﬁcation occurs when the6ollowing nonlinear transformation X i ’s are observed in place of the true covariates X i = exp( Z i / ,X i = Z i / (1 + exp( Z i )) + 10 ,X i = ( Z i Z i /

25 + 0 . ,X i = ( Z i + Z i + 20) . To estimate ATT, we use empirical calibration to weight the control group so that it matches the treatmentgroup in terms of their covariate distribution. Simulations with a sample of 1000 are used.Without weights, the bias is 20.2, and RMSE is 20.3. With correctly speciﬁed covariates to match( Z , . . . , Z ), the bias is almost zero (0.001), and RMSE is 0.09, With incorrectly speciﬁed covariatesto match ( X , . . . , X ), bias is -4.5, and RMSE is 4.6. With matching on additional transformations ofincorrectly speciﬁed covariates ( X , . . . , X plus interactions terms and log-transformed versions), bias is-1.8 and RMSE is 2.0. It is evident that weighting reduces the bias of the ATT estimate signiﬁcantly.We also use empirical calibration to estimate the population mean and compare it with the results reportedin Kang et al. (2007). The treatment group is weighted so that it matches the population in terms of theircovariate distribution. The estimator is the weighted value of y in the treatment group. Again, simulationswith a sample of 1000 are used. With correctly speciﬁed covariates to match ( Z , . . . , Z ), the bias is almostzero (0.0003), and RMSE is 1.13, which is better than all the methods described in Kang et al. (2007).3.2.2 L A L ONDE D ATA A NALYSIS

The LaLonde (1986) data is another canonical benchmark in the causal inference literature. It consists ofthree groups for evaluating the effect of a large scale job training program — the National Supported WorkDemonstration (NSW). An experimental treatment group with 185 observations, an experimental controlgroup with 260 observations, and an observational control group was drawn from the Current PopulationSurvey (CPS), with 15,992 observations. The outcome variable was the post-intervention earnings in 1978,and common covariates for all three datasets include age, education, earnings in 1974, earnings in 1975, andbinary indicators of being black, being Hispanic, being married, and having completed high school.Using the experimental control, the simple difference-in-means estimate of the average treatment effect on thetreated is 1794.3 with a 95% conﬁdence interval of [479.2, 3109.5]. A linear regression with earning in 1978as the outcome and treatment or control indicator together with other variables as independent variables givesan estimate of 1698 with a 95% conﬁdence interval of [458, 2938], which is narrower than the difference inmeans estimate.We apply entropy balancing on the observational control with respect to 52 covariates as described inHainmueller (2012), and the difference in means estimate is 1571, which is identical to what was reported inHainmueller (2012). Using quadratic balancing, the point estimate is 1713, which is very close to the estimateusing the experimental control.3.3 G EO E XPERIMENTS

An advertiser must measure the impact and return on investment of its online ad campaigns. Randomizedcontrolled experiment or A/B testing is considered the gold standard for drawing causal conclusions. Auser-level experiment randomly assigns users to the treatment or control group. No ad is served to the controlusers. The causal effect of advertising can then be estimated by comparing the observed outcomes betweenthe two groups. It can be difﬁcult however to maintain the integrity of the user level group assignment due tovarious issues such as user signed in and signed out, cookie churn, cross-device usage.7 a) Unweighted mean of controls as counterfactual (b) Weighted mean of controls as counterfactual

Figure 2: Causal inference for geo 30 with unweighted and weighted mean of control geos as the counterfactualestimate. The pretest period is to the left of the dashed lines; the intervention period is between the dashedlines, and cooldown period is to the right of the dashed lines.Geo-level experiments offer an attractive alternative to user-level experiments by experimenting at thegeographic level. A region is ﬁrst partitioned into non-overlapping subregions, or simply “geos”, which arethen randomly assigned to a control or treatment condition. Each geo realizes its assigned treatment conditionthrough the use of geo-targeted advertising. For example, an online ad campaign is run in the treatment geosbut not in the control geos. The behavior changes, such as website visits and purchases, of customers in thetreatment geos can then be attributed to the ad campaign.Despite the random assignment of treatment and control geos, the systematic difference between the twogroups can still be substantial, because there may only be a small number of highly heterogeneous geosavailable for experimentation. A common solution is ﬁrst to pair similar geo together and then apply randomassignment within each pair. The issue is that some geo, for example, New York City, is too large to ﬁnd amatching geo and cannot be sufﬁciently balanced out.Empirical calibration can help lessen the imbalance of the geo-level randomization, and thus reduce the biasof the causal effect estimate. It ﬁnds weights for the control geos such that the weighted average of the controlgeo time series matches that of the treatment geos in the pretest period. The same weights are then applied tocontrol time series in the test period as the treatment counterfactual time series.Empirical calibration does not aggregate time series across pretest or test period as in geo-based regression(GBR) (Vaver & Koehler, 2011), nor does it aggregate data across control geos as in the time-based regression(TBR) (Kerman et al., 2017). It leverages individual time series at each geo for causal inference. AlthoughGBR allows geo level weights in its linear model to account for heteroscedasticity caused by the differences ingeo size, unlike in empirical calibration, GBR’s weights are determined in a less principled way — simply theinverse of the pretest period metrics. Empirical calibration shares the spirit of using contemporaneous controlswith Bayesian structural time series (Brodersen et al., 2015) models but does not need to specify a time seriesmodel, thus is more robust to model misspeciﬁcation. We demonstrate the usage of empirical calibration forgeo experiments using the data found in Google Inc (2017). The goal is to estimate the treatment effect ongeo 30.The unweighted mean of controls is systematically above that of the treatment geo in the pretest period, whichsuggests that the difference between the unweighted means in the test period is not an indication of any causaleffect of the treatment. The weighted mean of controls, on the other hand, matches the treatment geo muchbetter in the pretest period, which leads to a more reasonable counterfactual estimate.8nstead of looking at one treatment geo at a time, one can also choose to study the mean of all treatment geosas compared to control geos.

MPLEMENTATION

Stata package

Ebalance (Hainmueller & Xu, 2013) and R package ebal (Hainmueller, 2014) package ﬁndsthe entropy balancing weights using the algorithm proposed by Hainmueller (2012). It exploits the dual formof convex optimization, where the primal weights can be expressed as a log-linear function of the covariatesspeciﬁed in the moment conditions. The dual problem is more tractable than the primal problem because it isunconstrained and the dimensionality reduces to the number of marginal constraints. A Levenberg–Marquardtscheme is employed to update the dual solution iteratively. When the problem is feasible, the algorithm isglobally convergent.R package survey offers some overlapping functionality with our python implementation. Calling survey::calibrate with the argument calfun="linear" is equivalent to using ec.calibrate with a quadratic objective. If no upper or lower bounds are placed on the weights, a closed-form solution isused; however, negative weights are possible. If, on the other hand, bounds are set, then an iterative algorithmis used instead. One advantage of our implementation is that survey::calibrate would fail when thereis no feasible solution, while ec.calibration allows gradual relaxation of the constraints until a feasiblesolution can be found.Alternatively, both the entropy balancing and quadratic balancing weights can be solved by general purposeconvex optimization software. Python library

CVXPY (Diamond & Boyd, 2016) and R package

CVXR (Fuet al., 2017) are two common choices, which provide domain-speciﬁc modeling language and interface withopen-source solvers such as

ECOS (Domahidi et al., 2013) and

SCS (O’Donoghue et al., 2016).Our implementation mostly follows the entropy balancing dual solution. We also express the primal weightsas a function of the covariates and Lagrangian multipliers. The function is log-linear for entropy balancingand linear with a ReLU activation for quadratic balancing. Instead of relying on Levenberg–Marquardt update,we ﬁnd the dual solution by solving a set of non-linear equations with Python’s scipy.optimize.fsolve ,see Appendix A for details.There could be no feasible solution when the sample size is small, or the number of imposing constraints islarge. In the context of observational studies with a binary treatment, it is particularly common to have nofeasible solution when there was limited overlap between the treated and untreated covariate distributions.Instead of forcing our users to drop certain marginal constraints altogether, we provide an option for our usersto relax the hard constraints. In our experience, a small constraint relaxation often leads to feasible solutionswithout too much degradation on the weights, signiﬁcantly improving the numerical computation experience.In other applications, it might be desirable to restrict the weights to a particular range. The conventionalsolution is to apply weight clipping or winsorization in a post-processing step, but it leads to a violation of theweight normalization constraint and sub-optimal weight solutions. Our implementation allows the weightclipping imposed directly as an additional constraint in the optimization problem, yielding optimal weightssatisfying all constraints at once.4.1 B

ENCHMARK AGAINST

R P

ACKAGES

EBAL

AND

CVXR

Benchmarking was done against R package ebal version 0.1-6 and package

CVXR version 0.99-4 on a Kanget al. (2007) simulation of sample size 2,000. Computation was done for 200 times for each package. Package EC was 22 and 73 times as fast as ebal and CVXR respectively (Table 1). Further, Package EC had lessvariability in the performance times. 9ackage Mean Min Max ebal CVXR EC ONCLUSION

Empirical calibration has many applications, particularly in survey sampling and causal inference. EC provides an easy to use, robust, and efﬁcient implementation. The source code is shared at https://github.com/google/empirical_calibration , where additional documentation and applicationexamples can be found.A CKNOWLEDGMENTS

The authors thank Art Owen, Jim Koehler, Joseph Kelly, Georg Goerg, Jon Vaver, Susanna Makela, MikeHankin, and Mike Wurm for helpful discussions. The authors appreciate Tony Fagan, Penny Chu, andElissa Lee for their support. Special thanks go to Jim Koehler, David Chan, and Tim Au for reviewing thismanuscript. R EFERENCES

Kay H Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L Scott, et al. Inferring causal impactusing bayesian structural time-series models.

The Annals of Applied Statistics , 9(1):247–274, 2015.Jean-Claude Deville and Carl-Erik Särndal. Calibration estimators in survey sampling.

Journal of theAmerican statistical Association , 87(418):376–382, 1992.Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization.

The Journal of Machine Learning Research , 17(1):2909–2913, 2016.Alexander Domahidi, Eric Chu, and Stephen Boyd. Ecos: An socp solver for embedded systems. In

ControlConference (ECC), 2013 European , pp. 3071–3076. IEEE, 2013.Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik.

Statistical Methods for Rates and Proportions .Wiley, 2003.Anqi Fu, Balasubramanian Narasimhan, and Stephen Boyd. Cvxr: An r package for disciplined convexoptimization. arXiv preprint arXiv:1711.07582 , 2017.Anqi Fu, Balasubramanian Narasimhan, and Stephen Boyd. Cvrx: A direct standardization example. https://cvxr.rbind.io/cvxr_examples/cvxr_direct-standardization , 2018.Wayne A Fuller.

Sampling statistics , volume 560. John Wiley & Sons, 2011.Google Inc. R package GeoexperimentsResearch. https://github.com/google/GeoexperimentsResearch , 2017.Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to producebalanced samples in observational studies.

Political Analysis , 20(1):25–46, 2012.Jens Hainmueller. ebal: Entropy reweighting to create balanced samples. https://CRAN.R-project.org/package=ebal , 2014. R package version 0.1-6.10ens Hainmueller and Yiqing Xu. Ebalance: A stata package for entropy balancing.

Journal of StatisticalSoftware , 2013.Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efﬁcient estimation of average treatment effects usingthe estimated propensity score.

Econometrica , 71(4):1161–1189, 2003.Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a ﬁniteuniverse.

Journal of the American statistical Association , 47(260):663–685, 1952.Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data.

Statistical science , 22(4):523–539, 2007.Jouni Kerman, Peng Wang, and Jon Vaver. Estimating ad effectiveness using geo experiments in a time-basedregression framework. ai.google/research/pubs/pub45950 , 2017. Google, Inc.Gary King and Richard Nielsen. Why propensity scores should not be used for matching.

Copy athttp://jmp/1sexgVw Export BibTex Tagged XML Download Paper , 481, 2015.Leslie Kish.

Survey sampling . John Wiley and Sons, 1965.Phillip S Kott. Using calibration weighting to adjust for nonresponse and coverage errors.

Survey Methodology ,32(2):133, 2006.Robert J LaLonde. Evaluating the econometric evaluations of training programs with experimental data.

TheAmerican economic review , pp. 604–620, 1986.Roderick JA Little. Post-stratiﬁcation: a modeler’s perspective.

Journal of the American Statistical Associa-tion , 88(423):1001–1012, 1993.Art B Owen.

Empirical likelihood . Chapman and Hall/CRC, 2001.Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting andhomogeneous self-dual embedding.

Journal of Optimization Theory and Applications , 169(3):1042–1068,2016.James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefﬁcients when someregressors are not always observed.

Journal of the American statistical Association , 89(427):846–866,1994.Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies forcausal effects.

Biometrika , 70(1):41–55, 1983.Donald B Rubin. The use of matched sampling and regression adjustment to remove bias in observationalstudies.

Biometrics , pp. 185–203, 1973.Camille Ryan and Jamie M Lewis. Computer and internet use in the united states: 2015.

American CommunitySurvey Reports , 2017.Jon Vaver and Jim Koehler. Measuring ad effectiveness using geo experiments. https://ai.google/research/pubs/pub38355 , 2011. Google Inc.Qingyuan Zhao and Daniel Percival. Entropy balancing is doubly robust.

Journal of Causal Inference , 5(1),2017. 11

PPENDIX

A C

ONVEX O PTIMIZATION D ETAILS

A.1 S

OLVING THE D UAL P ROBLEM

We present a detailed solution for the quadratic loss. The solution for entropy loss can be derived similarly.We ﬁrst convert the optimization into vector formminimize w w T w (3)subject to Z T w = p , (4) Tn w = 1 , (5) w ≥ n , (6)where Z is the n × p covariate matrix subtracted by the marginal constraint, i.e., the i -th row Z i = c ( X i ) − ¯c .We construct the Lagrangian with β ∈ R p , θ ∈ R , and α ∈ R n ≥ as the Lagrangian multipliers L = 12 w T w − β T Z T w − θ ( Tn w − − α T w . Setting ∂L∂w i to zero reveals that the primal weight is a linear function of the transformed covariate andLagrangian multipliers w i = Z Ti β + θ + α i . Multiplying w i on both sides and using the complementary slackness Karush-Kuhn-Tucker condition w i α i =0 , we eliminate α i and obtain w i ( w i − Z Ti β − θ ) = 0 , (7)which further reduces to w i = max { , Z Ti β + θ } . (8)Substituting (8) into (4)-(5), we obtain a p + 1 -dimensional estimating equation with respect to the dualparameter ( β, θ ) . The equation can be solved using a general purpose nonlinear equation solver, suchas scipy.optimize.fsolve . When such a solution exists, it would be unique because the quadraticobjective is strictly convex.The dual solution for the entropy objective can be derived similarly, but with an exponential link function toreplace (8): w i = e Z Ti β + θ . (9)By construction of (8) and (9), sample data points with identical covariates would be given identical weights,which is a desirable property.A.2 B OUNDING THE W EIGHTS

For certain applications, it is desirable to impose additional upper and/or lower bound on the weights to reducethe impact of extreme weights. Suppose the imposing bound is [ l, u ] , where l and u are both non-negative.We only need to change the weight link functions accordingly. (8) would become w i = min { u, max { l, Z Ti β + θ }} , and (9) is now w i = min { u, max { l, e Z Ti β + θ }} . ELAXING THE E QUALITY C ONSTRAINT

When there is no feasible solution, we need to either drop some marginal constraints or soften the hardconstraints as in soft margin support vector machine | Z T w | ≤ (cid:15), (10)where (cid:15) is the tolerance. It only requires slightly more work to solve optimization problem with toleranceadded. Following the common practice, we ﬁrst introduce two p -dimensional non-negative slack variables ∆ and ∆ , and reformulate the optimization problem asminimize w w T w (11)subject to Z T w + ∆ − ∆ = p , (12) Tn w = 1 , (13) w ≥ n , (14) (cid:15) − ∆ T ∆ − ∆ T ∆ ≥ , (15) ∆ ≥ p , (16) ∆ ≥ p . (17)We then construct the Lagrangian with β ∈ R p , θ ∈ R , α ∈ R n ≥ , λ ∈ R ≥ , and γ , γ ∈ R p ≥ as theLagrangian multipliers L = 12 w T w − β T ( Z T w +∆ − ∆ ) − θ ( Tn w − − α T w − λ ( 12 (cid:15) −

12 ∆ T ∆ −

12 ∆ T ∆ ) − γ T ∆ − γ T ∆ . Setting ∂L∂w i to zero and applying the complementary slackness KKT condition w i α i = 0 leads to the samesolution (8) for w i .Setting ∂L∂ ∆ and ∂L∂ ∆ to zero and applying their respective complementary slackness KKT conditions, wehave λ ∆ = max { , β } ,λ ∆ = max { , − β } , which in turn leads to λ (∆ − ∆ ) = β, (18) λ (∆ T ∆ + ∆ T ∆ ) = | β | . (19)As a result the slack variables can be eliminated from (12) λZ T w + β = p . (20)Combining (19) and the complementary KKT condition λ ( (cid:15) − ∆ T ∆ − ∆ T ∆ ) = 0 , we get λ(cid:15) = | β | , which nicely connects the L tolerance (cid:15) to L norm of the dual parameter β , and further eliminates λ from (20), Z T w + (cid:15)β | β | = p . (21)13t reveals that it is the scaled dual parameter β that serves as the slack for the relaxed covariate balanceconstraint (12). Finally, combining (8), (13) and (21), we again obtain a p +1 -dimensional estimating equationwith respect to the dual parameter ( β, θ ) , and can solve it using a general purpose nonlinear equation solver. A PPENDIX

B E

MPIRICAL C ALIBRATION AND O THER C AUSAL M ETHODS

We brieﬂy compare empirical calibration with other causal inference methods. Randomized experiments usea randomized assignment mechanism to ensure that the covariate distributions between the treated and controlgroups are naturally balanced. For observational data, matching and weighting are the two main methods toenforce this balance.

Matching refers to matching treatment and control units at the individual level; while weighting assigns continuous weights to control units such that their weighted average matches the treatmentaverage. Zhao & Percival (2017) nicely summarized all matching and weighting methods in the table belowDiscrete weights Continuous weightsBy raw covariates Raw matching Empirical calibrationBy propensity scores Propensity score matching Propensity score weightingTable 2: Causal methods with observational data.Weighting is generally preferred to matching due to its efﬁciency gain (Zhao & Percival, 2017). Compared topropensity score weighting, empirical calibration enjoys similar theoretical properties but yields more stableweights.

Raw Matching uses raw pre-treatment covariates for matching and does not attempt to model the assign-ment mechanism. Both exact matching and k : 1 nearest neighbor matching fall into this category. Theirvariants vary by choice of the distance metric, and whether the same control unit is allowed to be used multipletimes as a match (matching with replacement).Raw matching is intuitive and easy to implement. In many applications, however, it can be difﬁcult orimpossible to ﬁnd k closely matched control units for certain treatment units. Empirical calibration, on theother hand, aims to balance the treated and control group at the aggregate level, which is usually sufﬁcient forestimating ATT.Nearest neighbor matching implies binary or discrete weights to control units. For an observational datawith a large control group, many control units would be effectively discarded in the analysis. By adoptingcontinuous weights, empirical calibration uses information from more control units. It leads to a largereffective sample size for the control group and reduces the variance of the ATT estimate. Propensity Score Matching (PSM) assumes a regression model for the propensity score p ( X ) = P ( T =1 | X ) . The treated and control units are matched using the estimated propensity scores. The propensity scoreserves as the sufﬁcient statistic of treatment assignment, and reduces the problem to a univariate matchingproblem.The success of the propensity score method hinges on the quality of the estimated scores. Accurately estimatedpropensity scores stochastically balance the covariates between the treatment and control groups. In practice,however, matching or weighting solely based on propensity score rarely guarantees actual covariate balancing.If the data is balanced to begin with, PSM may increase the imbalance (King & Nielsen, 2015). Empiricalcalibration takes the guesswork out of the procedure by directly optimizing towards an explicit covariatebalancing goal, with guaranteed results. 14 ropensity Score Weighting (PSW) weights the control units using the inverse probability weighting (IPW) scheme ˆ τ IPW = 1 n (cid:88) i : T i =1 Y i − n (cid:88) i : T i =0 ˆ p ( X i )1 − ˆ p ( X i ) Y i , (22)where ˆ p ( X i ) is the estimated propensity score for the i -th unit.The inverse probability weights could be volatile and sensitive to model misspeciﬁcation. If some estimatedpropensity score is close to zero for a control unit, its inverse weight becomes very large and unstable, inﬂatingthe variance of the IPW estimate for ATT. One remedy is to trim extreme weights or apply winsorization,but it is hard to choose the trimming threshold in a principled way. Empirical calibration seeks weightsthat least deviate from uniform weights, so large weights are naturally penalized in the optimization. Theresulting weights are far more stable. Furthermore, if desired, bounds on the weights can be directly imposedin optimization without the need for a separate trimming step.Conventionally, propensity score regression and outcome regress models are ﬁtted separately and thencombined to construct a doubly robustdoubly robust