Machine Learning for Experimental Design: Methods for Improved Blocking
MMachine Learning for Experimental Design: Methods forImproved Blocking
Brian Quistorff ∗ and Gentry Johnson † November 2, 2020
Abstract
Restricting randomization in the design of experiments (e.g., using blocking/stratification,pair-wise matching, or rerandomization) can improve the treatment-control balance on im-portant covariates and therefore improve the estimation of the treatment effect, particularlyfor small- and medium-sized experiments. Existing guidance on how to identify these vari-ables and implement the restrictions is incomplete and conflicting. We identify that differ-ences are mainly due to the fact that what is important in the pre-treatment data may nottranslate to the post-treatment data. We highlight settings where there is sufficient datato provide clear guidance and outline improved methods to mostly automate the processusing modern machine learning (ML) techniques. We show in simulations using real-worlddata, that these methods reduce both the mean squared error of the estimate (14%-34%)and the size of the standard error (6%-16%).
Keywords:
Machine Learning, Big Data, Experimentation, Causality, Blocking, Stratifica-tion, Pair-wise matching, Rerandomization
In the design of experiments, the method of treatment randomization can be used to reducethe variance of the estimated treatment effect so as to improve efficiency, protect against typeI errors, and increase power (Bruhn and McKenzie, 2009)—particularly for small- and medium-sized experiments. This is achieved by improving variable balance (similarity of a variable’sdistribution between the treated and control groups) for variables that are important predictorsof the post-treatment outcome.For illustrative purposes we first look at the most common randomization method, blocking (sometimes called stratifying ), originally proposed by Fisher (1935). Blocking creates a partitionof the sample, separating pre-treatment data into blocks with a minimum size c B (typically four,Kernan 1999) and assigning an equal number of treated and control units within each block. Bydividing an important variable’s extent with these blocks, one can increase balance along thisvariable. For example, if we ensure that units with different values of an important categoricalvariable are partitioned into separate blocks, then we can ensure that, even in finite-sample (notjust in expectation), treatment will be uncorrelated with this variable.Existing guidance on how to use pre-treatment data is not fully data-driven and so involvesmany decisions by the experimenter, wasting time and potentially resulting in sub-optimaltreatment effect estimation. There are two main existing strategies for picking blocks. We show ∗ Microsoft Technology + Research. Contact: [email protected]. † Amazon, AWS Central Economics. Contact: [email protected]. In some settings “stratification” refers to the drawing the experimental sample from the population and“blocking” refers to assigning treatment. a r X i v : . [ ec on . E M ] O c t ow both can be improved by using modern off-the-shelf machine learning (ML) solutions thatmake our proposed procedures mostly data-driven. We also show how to choose among theavailable strategies.The most common approach for determining blocks is what we will refer to as variableselection . This strategy attempts to select variables that will be strongly related to the post-treatment outcome. Blocks are then defined by a (potentially uneven) grid from splitting eachvariable separately and taking the Cartesian product. The number of selected variables is keptsmall as there is a cost to stratifying on unimportant variables. Imbens et al. (2009) showthat while stratification on a variable can not increase the true expected squared error of theestimate, it does increase the estimate of its variance when accounting for stratification dueto a degrees-of-freedom adjustment as discussed below (one could use the variance estimatorwhich ignores the stratification, but this is overly conservative). In their survey of this ap-proach, Bruhn and McKenzie (2009) recommend selecting at least the pre-treatment outcomevariable—as most outcomes have some unit-level persistence—and a geographic variable, asoutcome shocks are likely to be correlated within geographic regions (i.e., the data generatingprocess (DGP) is time-varying). With multiple pre-treatment periods of data, they suggest thatone could determine which additional variables to include based on how each early variable iscorrelated with later pre-treatment outcomes. Even if there are many related variables, they cau-tion against including too many since each new variables decreases the balance on existing ones(given that blocks have a minimum size so that the granularity along existing dimensions mustdecrease). In their simulation studies, they include four blocking variables. They recommendsplitting variables by a roughly even number of quantiles. Overall, the guidance leaves someareas unspecified (how exactly to determine which variables to include, how many quantiles tosplit variables by) and some areas sub-optimal (partitioning by a grid is sub-optimal when unitsare unevenly distributed across multiple selected variables as increasing the granularity of thegrid quickly causes some grid cells to reach the minimum cell size).An alternative strategy for blocking has been to use an estimated prediction model to de-termine which units to group together. Barrios (2014) and Aufenanger (2017) both suggest,in different settings, using pre-treatment data to build a model of the pre-period outcome us-ing pre-period covariates and then generating predicted values—so-called pre-period prognosticscores (Hansen, 2008). Blocks are formed by ordering units by their prognostic score and thensequentially allocating blocks of a common size. The guidance here is well specified, though wenote that there may be more optimal ways to partition based on the prognostic score, giventhat the default method may create more blocks than is helpful for minimizing treatment effecterror and therefore result in larger than necessary standard errors.The approaches differ due to their assumptions about the DGP, particularly about its tem-poral properties. If the DGP is constant over time and well estimated by the predictive modelin terms of available predictors and functional form, then the prognostic score approach is op-timal. It efficiently uses pre-treatment information both by utilizing weakly related covariatesdiscarded by the variable selection strategy and also by collapsing all covariates to a single di-mension and thus making it easier to find an optimal partition. Reducing to a single dimension,however, often results in units with similar prognostic scores that have very different covariates.If the DGP changes over time, units with similar pre-treatment prognostic scores may not havesimilar future prognostic scores. In this case, it is beneficial to block instead on a handful offixed characteristics (e.g., geographic and demographic variables). Similarly, if the predictivemodel cannot closely approximate the functional form of the DGP, it may be more beneficial toblock on separate variables rather than a composite index. . If the predictive model is missingvariables, then there may be persistence in the outcome variable over time not captured by themodel, making it beneficial to block on the actual value of the pre-treatment outcome as this isinformative in addition to ˆ y pre . For example suppose you estimate a model with a squared covariate term. Then a unit will be put close toone with the opposite value, which may differ significantly from the true model.
2e show that, when there are multiple pre-treatment periods of data, there are ways tochoose between the variable selection and prognostic score strategies. We also apply standardML tools to automate both. This includes a strategy for determining the number of blocks,again an area with little guidance, where we balance the goals of improving estimate accuracyand reducing standard errors. Most of the improvements can be made using off-the-shelf MLtools, though we detail some areas where custom solutions would be helpful.We note that there are additional situations in which one would want to block an experiment.It is commonly done if subgroup analysis is expected to be performed, as (a) the pre-specificationguards against claims of searching indiscriminately for statistically significant subgroups and(b) it improves the precision of these estimates. We show how to include these extra block-constraints in the strategies we present. Finally, it is noteworthy that in the context of two-stage randomized trials, which we do not study here, one can form blocks in the second stage(using data from the first stage) to vary the treatment percentage across blocks in ways thatcan increase estimation precision—the so-called
Neyman Allocation (Tabord-Meehan, 2018).We discuss the basic ML tools involved and our proposed strategies in Section 2. In Section 3we discuss the application of these tools to the other most common methods of randomization.In Section 4 we use real-world data to compare our proposed strategies to hand-built blocking.We conclude in Section 5.
We first describe our notation and review basic goals. Then we discuss a few standard generalML tasks and highlight the most common method used currently for each. We then outline ourproposed automated strategies that use these methods and then how to select the optimal one.Finally, we propose modified strategies when different types of data are available.
Suppose the following data generating process (DGP) y it = βd it + h t ( X i ) + u it where i ∈ [1 , ..., n ] indexes experimental units (e.g. customers), t indexes time, and, as above, d is the binary treatment (zero for all units in the pre-periods and treatment only changes inone time period), h is potentially time-varying, X i are the observed covariates, and the u it are independent across units but may be correlated across time for an individual as we do notmeasure all characteristics. We assume we have one post period and at least one pre-period (i.e.a baseline) of data. We also assume that we will analyze the experiment using data from post and include dummy variables for each block. .In this paper, as is widely accepted in the literature on experimentation and causal inferencemore broadly, we follow the potential outcomes framework (Cochran and Rubin, 1973; Holland,1986). More formally, if we just consider the post period, then we have outcome y i of unit i . When unit i receives treatment d i = 1 , her outcome is y i , and when she receives treatment d i = 0 , her outcome is y i . The identification problem is that we cannot observe unit i in bothstates, meaning that we must use different units altogether to serve as the counterfactual for i .The average treatment effect (ATE) is defined as β = E [ y i − y i ] = E [ y i ] − [ y i ] . Without anysource of randomization we would estimate Though Bruhn and McKenzie (2009) note that in practice blocking dummies are often not included, theyshow empirically that this leads to overly conservative standard errors. CPMP (2004) similarly states that“analysis should reflect the restriction on randomisation implied by the stratification.” β = E [ y i | d i = 1] − E [ y i | d i = 0]=( E [ y i − y i | d i = 1]) + ( E [ y i | d i = 1] − E [ y i | d i = 0]) . The first term on the right-hand side is the the average treatment effect on the treated. Thesecond is often referred to as selection bias. To eliminate selection bias the Conditional In-dependence Assumption, { y io, y i } ⊥⊥ d i | X i , must be met (Rosenbaum and Rubin, 1983). Innon-experimental causal inference settings, the intuition underlying this assumption is that, af-ter controlling for observable characteristics, treatment is as good as random. It also impliesthat E [ y i | X i , d i = 1] − E [ y i | X i , d i = 0] = E [ y i − y i | X i ] , and therefore ˜ β = β .In experimental settings, it can be easily seen that random assignment to treatment impliesthe Conditional Independence Assumption. In fact, conditioning on X i is not even necessaryfor the assumption to hold, and thus inference in experimental settings is generally far moreconvincing than in observational settings. While pure randomization provides identification of β ,more sophisticated treatment assignment mechanisms such as blocking bring other advantages,as mentioned briefly above and described in more detail below.The standard benefits mentioned to motivate blocking include reducing Type I error, re-ducing Type II error (increasing power), and increasing efficiency. Type I error refers to thechance of a false-positive result given no effect exists. This can happen if there is a finite samplecorrelation between the assigned treatment and a prognostic factor. Blocking will reduce thechance of such a correlation, and while some experimenters may address this withan ex-postadjustment, ex-ante restrictions are more efficient (Bruhn and McKenzie, 2009).We, therefore,can reduce Type I errors by reducing the mean-squared error (MSE) of the estimated treatmenteffect. Type II error refers to the chance of failing to detect an effect when one exists. This isdirectly related to the variance of the outcomes between the two treatment arms. Blocking onprognostic factors reduces the sample variances. We, therefore, can increase power by reducingthe standard error of the estimated treatment effect. Efficiency refers to the number of obser-vations required to detect an effect for a given experimental setup. While dependent on manyfactors, it is typically thought of along the dimension of power. The more power required, thelarger the experiment must be. We will therefore think about statistical efficiency in terms ofreducing standard errors.Blocking would, therefore, ideally reduce the estimate’s MSE and the estimate’s standarderror. These two goals are typically, but not always, aligned. Blocking on the most importantexpected prognostic factors typically improves both, but as mentioned above there is a degree-of-freedom cost in the estimate’s standard error. For example, using the OLS regression formula, (cid:100) s.e. ( ˆ β ) = (cid:113) s ( ˜ X (cid:48) ˜ X ) − , where s = ˆ u (cid:48) ˆ u/ ( n − b − , ˆ u are the fitted residuals, there are b blocks,and ˜ X includes d and along with all the blocking variables. Suppose two blocking partitions,with b and b + 1 blocks respectively. As treatment is assigned orthogonal to the blocking factors,we can ignore differences in the ( ˜ X (cid:48) ˜ X ) − term. The extra blocking may reduce the residuals ˆ u (cid:48) ˆ u ,but might increase the standard error through the (cid:112) / ( n − b − term. If the extra blockingdoes not improve the residuals then the cost in terms of the relative increase in standard erroris (cid:100) s.e. (ˆ τ b +1 ) (cid:100) s.e. (ˆ τ b ) = (cid:114) n − b − n − b − . This cost decreases with sample size, and conditional on sample size, each additional block isincreasingly costly (though bounded as the maximum number of blocks is roughly n/c B .) For The diagonal elements of ( ˜ X (cid:48) ˜ X ) − measure the linear dependence of each column of ˜ X against the rest. Tosee this, let ˆ x j be the projection of x j on the subspace spanned by the other vectors, let ε j = x j − ˆ x j , and then ( ˜ X (cid:48) ˜ X ) − jj = 1 / (cid:107) ε j (cid:107) . Since the treatment remains orthogonal its measure of dependence will remain roughlyconstant. n = 200 and c B = 4 , the first partition block increases the standard errorby roughly 0.25% and the n/c B block increases the standard error by 0.34%. For a sampleof n = 400 the numbers are 0.13% and 0.14%. The partition, therefore, that minimizes theestimate’s standard error can have fewer blocks than the one that minimizes the estimate’sMSE. If any block in the partition that minimizes the effect standard error has a size of at least c B then splitting it would improve the estimate’s MSE.An additional motivation for blocking can emerge if the experimenter expects to performsubgroup analysis across a particular variable to look for heterogeneity. In this case, the ex-perimenter would have selected ˜ X variables with an existing partition (likely a grid) ˜Π for thisanalysis. For example, in the simplest case, all ˜ X variables could be split at their median valuescreating | ˜ X | initial blocks. In Section 2.3 and Section 2.4, we suggest minor adaptations to theprocedures in this paper when an ex-ante subgroup analysis plan motivates blocking.Given the two goals (of reducing the estimate’s MSE and standard error) might diverge,one must decide on an overall strategy. Ex-post one could optimally select among candidatepartitions given a weighting between the two goals, but ex-ante this is much harder. While wenote later how to do this when data is available (see Section 2.6.2), typically the data will notbe available or can be put to better use. We, therefore, put forward a simple and reasonableapproach that tries to balance them without explicitly optimizing the two goals. We create asequence of partitions (discussed below) and find the partition where its fitted outcome model(fitting y to the block variables) has the best expected out-of-sample accuracy. This naturallylimits the partition complexity to some degree, as when building a predictive model, a partitionthat is too fine-grained will over-fit to its training data and perform badly out-of-sample (e.g.,a partition with a block for every observation has clearly gone beyond finding generalizablepatterns and instead memorizes the idiosyncrasies of the current sample). This strategy, ofestimating the out-of-sample performance of a model, can be done efficiently using the samedata via a procedure called cross-validation (discussed below).The value of blocking decreases with sample size. As the sample size increases, the chanceof a finite-sample correlation between treatment and a prognostic factor decreases. Thereforethe concern about the estimate’s MSE becomes less important. For statistical efficiency, it iscommonly believed that blocking in larger samples is less important (Kernan, 1999), but thiswill depend more on the nature of the data. Overall, many clinical trialists suggest blocking isless important with samples over 400 (Kernan, 1999). Our view, however, is that if the processcan be made easy, then the benefits may outweigh the costs at many sample sizes.We focus first on the situation of having at least two pre-periods worth of data t ∈ { pre1 , pre2 } as this is the cleanest setup for the models. Both strategies will model the relation between y pre2 and [ X, y pre1 ] to form partitions, and we show how to use an out-of-sample method topick between them. After deciding which strategy to use, given there is likely some temporaldependence, we proceed with estimation by using the selected model to generate partitions whenusing [ X, y pre2 ] , rather than [ X, y pre1 ] . Before we detail the ML methods, we first discuss a general difference with more commonmethods used in economics. ML models typically have hyper-parameters, which often controlthe model’s overall complexity. One benefit of many ML methods is that they can be quitecomplex, but increasing their complexity too much can mean that they overfit to the sampledata, essentially memorizing the idiosyncrasies of the current sample and behaving badly outof sample. Experimenters and practitioners have therefore developed procedures to modulatemodel complexity and limit overfitting. The main procedure is cross-validation (CV), whichsimulates the out-of-sample error. CV randomly split the data into K “folds” (usually 5 or 10). Out-of-sample predictions are made for each observation using a model that was trained on alldata but the fold for that observation (so there are K separately trained sub-models). One can5hen fit the model with different values of the hyper-parameter and pick the hyper-parametervalue based on the one with the lowest mean squared prediction error (MSPE). We list three common ML tasks and identify for each a method that is common, simple, andcan be used off-the-shelf :• Partitioning: This task is to create a partition, Π , with cells (cid:96) , from a feature-space X to a variable level of complexity. The goal when constructing the partition is that the setof dummy variables covering each block maximizes their predictive power for y (i.e., thepredicted value for each block is the block’s mean outcome). Finding the globally optimalpartition is too computationally intensive, so the most common method (Hastie et al.,2009) for this task is Classification and Regression Tree (Cart, Breiman 1993). Cart startswith the whole feature space as a single block and recursively splits each block into twousing rectilinear cuts. To split a block, it searches over each dimension and possible valuesin that block and finds the split that reduces the overall MSE of the outcome of the twosub-blocks. Intuitively, it finds a split such that the two sides have very different meanoutcomes. The main hyperparameters are the tree depth (which we choose through CV)and the minimum leaf size (which we set as c B )• Feature selection: In this task, we have a generic outcome y and features X and wewould like to find the subset X ∗ that is most important for determining y . The mostcommon method (Taddy, 2019) is the Least Absolute Shrinkage and Selection Operator(Lasso, Tibshirani 1996). Lasso is a linear model that adds to the OLS objective functiona penalty for the L -norm of the coefficients, solving min β || y − Xβ || + λ || β || . The Lassosolution will typically set many coefficients to exactly zero due to the geometry of the L penalization. If the true DGP is sparse in terms of the non-zero coefficients, then undercertain conditions the Lasso can achieve the oracle property and be consistent in termsof selecting the true subset (Zou, 2006). We highlight three usage notes. First, as theabsolute sizes of the coefficients are all penalized, we typically normalize all features tohave a standard mean and variance. Second, as it’s a linear model, variables that interactor affect the outcome non-linearly may not be selected. To help address this, a commonpractice is to augment X with common transformations. Third, we will follow commonpractice and set the λ hyperparameter using CV. – A common sub-task is to identify importance weights { w k } of the selected variables.As the Lasso coefficients are biased due to the L regularization, one can constructimportance weights by performing a subsequent OLS on just the Lasso-selected fea-tures (the Post-Lasso by Belloni and Chernozhukov 2013) and taking the absolutevalue of the coefficients. • Prediction: In this task we wish to form a robust prediction in the face of potentialnon-linearities, learning y ≈ ˆ g ( X ) . There are many options for this task, but in moststatistical data (i.e., not visual or text data) applications, the Random Forest (Breiman,2001) is common, simple, and performs well (Taddy, 2019). A Random Forest is theaverage of a large number of separate tree models (typically Cart). Each tree is trained An alternative is the “1se” rule (Friedman et al., 2010), which is the simplest model that has a MSE no morethan one standard error above the minimum. This is typically used if there is a strong reason to believe that themodel will be used with new data drawn from a different distribution. The partitioning methods introduced above may also be used as a non-linear variable-selection procedureby generating a partition and then selecting the variables that were used to split on at least once. With manyvariables, however, the performance of decision trees suffers (Hastie et al., 2009) and so in these cases especially,Lasso is preferred. We note that for Lasso, some plug-in estimates for setting λ have attractive theoretical properties(Belloni et al., 2012). For more complicated methods, one can re-run the models each time omitted one covariate and use theincrease in MSE of the outcome as a measure of importance.
6n a slight modification of the original data (the data is bootstrapped and then at eachsplitting decision a random number of features are selected as candidates for splitting), toyield different trees adding smoothness and robustness.We note that, while we have picked a popular, widely available, and simple method for eachpurpose, there are alternatives (e.g., Best Subset instead of Lasso and Boosted Trees insteadof Random Forests). If there are data or computational reasons to pick an alternative, thatshould be explored by the experimenter. The above can be thought of as default choices tooperationalize the below algorithms. In the description of the algorithms, we will use the generictask name—partitioning, feature selection, or prediction—rather than any particular method.Finally, as with most ML methods, the ones noted here can function even when there aremore features than observations. They are therefore quite useful in settings where, despite asmall sample size, we nonetheless have rich data on individuals.
As mentioned above, we can use a dedicated feature selection method initially or directly usethe partitioning algorithm. The choice will depend on the number of covariates, K , and theexperimenter’s prior on the sparsity of the covariates in the DGP. If K is relatively small,then the partitioning algorithm can be used directly on the variables to create blocks. If K isrelatively large, then the performance of partitioning algorithms tends to suffer. In that case,and especially if the experimenter’s prior is that only a sparse subset of the variables matter forpredicting the outcome, we can use a preliminary feature selection method.Note that in addition to the standard variables, we could pre-generate ˆ y pre1 (from a predictionmodel of y pre1 ≈ g pre1 P S ( X ) ) and include it as well. Its inclusion might improve performance andfocuses this strategy on alleviating issues rising from model misspecification and dynamic DGPs.Performance improvements will result if there is a long tail of covariates in X that are weaklyrelated to y pre and can therefore be compactly represented in ˆ y pre . An orientation towards theissue of dynamic DGPs will result insofar as ˆ y pre1 captures information from X that explains thestatic components of the DGP, meaning the covariates selected from X when ˆ y pre1 is includedin the model will be those whose influence may vary over time. Concisely, if y pre1 is selected,this is evidence of persistence (unspecified variables), and if some of X is selected, then this isevidence of a dynamic DGP.In some circumstances, blocking on real variables (even if chosen by a model) may be pre-ferred for interpretability and trustworthiness reasons to using a synthetic feature such as ˆ y pre1 .Using a synthetic measure such as these can result in unintuitive groups (units that, while hav-ing similar prognostic scores, have very different covariates). This concern is raised similarly inthe matching literature (King and Nielsen, 2016) in the context of propensity-score matching.If interpretable blocks are required, then the experimenter may prefer to leave ˆ y pre1 out of theVariable Selection strategy.Full details are in Algorithm 1.This algorithm has multiple advantages over the existing manual process:1. This algorithm has a common method for selecting among y pre1 , geographic variables, andother features. It also focuses on predictive power in a joint setting rather than usingbivariate correlations. While the feature selection method does not jointly pick blockingvariables and partition, it does have an automatic stopping rule (the cross-validated λ ) forlimiting the selected set of blocking variables.2. In general, a tree-based partition is preferred to a grid-partition as it can have increasedgranularity while adapting to densely and sparsely populated regions of the covariate space. Technically the minimum blocks size is only used on the pre2 data, but blocks could be made smaller for lgorithm 1 Variable Selection Blocking Strategy
Inputs : y pre1 , y pre2 , and X .1. Estimate a prediction model y pre ≈ g pre1 P S ( X ) and generate ˆ y pre1 . Define M = { y pre , ˆ y pre1 , X } .2. Estimate a prediction model y pre ≈ g pre2 P S ( X ) and generate ˆ y pre2 .3. If K is large (or assuming sparsity): Use a feature selection method predicting y pre2 using M . Redefine M as the selected set of features. (If needed for a downstream task, returnthe importance weights).4. Perform partition (with CV tree depth) predicting y pre2 using M , yielding partition Π .5. Assign blocks based on updated data: b = Π( y pre2 , ˆ y pre2 , X ) . Ensure that the partition didnot create blocks smaller than c B with updated data (if so, prune back the tree complexityuntil this constraint is satisfied). Return : b
3. As the partition algorithm ensures that there is expected benefit to a finer partition, wenaturally balance the trade-off between increased granularity and the downstream degrees-of-freedom adjustment.If the experimenter is unsure whether to use an initial variable selection method in Algorithm 1,one can create both versions of the variable selection strategy and use the procedure in Section 2.5to decide between them. If the experimenter is motivated to block in order to carry out apre-specified subgroup analysis, then we suggest the following modification to Algorithm 1:in the partitioning step, we start with the existing partition ˜Π as described in Section 2.1 andrecursively partition cells from that point. If the experimenter is using an initial feature selection,then the procedure should be constrained to only allow new splits on selected dimensions. Remark . (Adaptive Grid Alternative) The partition created by Cart will subdivide the spaceinto hyperrectangles, but the partition can still be quite irregular and hard to understand. Ifthe partition needs to be understandable on its own, then an alternative is to use an adaptivegrid partition . This grid can be built by dividing (when possible) covariates on quantiles. Thereshould be more blocks across variables that are more important. Therefore, attempt to makethe number of blocks across variables roughly proportional to their importance weights. Theoverall granularity is a hyperparameter that can be set by CV (and assuming a minimum blocksize of c B ). Remark . (Misfits) Blocks for experiments often contain an odd number of units, preventinga perfect even distribution of treatment in the block. One of the units (typically at random) isheld-out and labeled the “misfit” and the rest are randomized event across treatments. One maywant to ensure that the treatment assignment of the misfits is also even across the distribution.If blocks span only a single dimension then we can iterate across the blocks in order and assignmisfits to alternating treatments. If blocks span multiple dimensions, however, there are nosimple solutions. (If the misfits themselves form a rectangular lattice then this is possible, butthis is highly unlikely). Practice varies in this situation and non-random solutions are typicallyslow and approximate.One approach, with any progressive partition method, is to view just the misfit units froma higher, coarser-level of partitioning and re-do blocking at this higher level. With Cart, thiscan be done easily by simply iterating across the tree leaves in order and assigning misfits to the pre1 data as well. In practice, typical decision trees have minimum node leaf sizes of around 6 so as to notestimate means from very small samples. As a result, this variant is unlikely to is be helpful.
Remark . (Feature learning) In ML, a related task to feature selection is feature learning .This focuses on generating (often a small set of) synthetic features that are transformationsor combinations of the original features that can perform better than the original features forsome downstream estimation. This is a task that many experimenters already do manually(e.g., constructing composite indexes, averages, and log/polynomial transformations of existingfeatures), but feature learning performs this in an automated way. Learned features are oftenconstructed using neural networks (Hinton and Salakhutdinov, 2006). A full treatment of thetheory and application of feature learning is beyond the scope of this paper, so we note heremerely situations where feature learning might be helpful. It would be a case where FPSwill not perform the best (the true DGP is difficult to approximate or the DGP is dynamic),but where VS does not perform as well as should be expected (e.g., because there are toomany variables to select, so some combination is helpful). We note that given the proceduremust learn an additional set of transformations, the task usually requires a larger sample size,potentially limiting its usefulness. To our knowledge, feature learning has not yet been appliedto experimental blocking.
We construct a
Future Prognostic Score (FPS) by using a prediction model to approximate y pre2 ≈ g F P S ( X, y pre1 ) . (1)Note that this is different from the simple prognostic score models of Barrios (2014) andAufenanger (2017), as it is looking one step ahead and incorporates a past outcome value.This ensures that this strategy uses the same data as the variable selection strategy. With thepast outcome value, FPS can now deal with outcome persistence, though since it collapses thematch-space to a single index it can not deal with a dynamic DGP. We must still also considerthe fact that our model may be misspecified.As with the above, blocking is carried out on the predicted value using updated features ˆ g F P S ( X, y pre2 ) . The existing standard method, which we call Sequential Allocation, is to arrangeunits according to their FPS and generate groups of size c B . Groups can be made larger toincorporate segments of units with identical predicted values. This ensures that extra blocks areonly created when there is a benefit to (in-sample) predictive performance. This might createmore odd-sized cells, but misfits are less of a problem in this approach as we can ensure an evendistribution of the treatment arms across the span of the prognostic scores by iterating acrossthe misfits in order and alternately assigning treatment.In the case of blocking motivated by apre-planned subgroup analysis, the experimenter should start with the existing partition ˜Π asdescribed in Section 2.1, arrange units within each block by their FPS, and procced partitioningfrom that point (ensuring no cell with size below c B ). Remark . (Alternate score-based partitioning) The existing approach of taking prognosticscores and performing Sequential Allocation may create too many blocks since it focuses onin-sample predictive performance. The first-stage predictive method for learning ˆ g F P S does usetools to control for over-fitting (so that ˆ y i is not too influenced by y i ), but will likely still createtoo many unique levels of ˆ y and that is all the Sequential Allocator focuses on. We need totreat the joint process of learning ˆ g F P S and constructing the allocation as a combined partitionmethod using CV to control for the final complexity (the number of blocks). Given we want asecond-stage partitioning method that can create a partition with a less-than-maximal numberof blocks, we may want more complexity than the Sequential Allocator. Options include: A pre-generated ˆ y pre1 could be used here as well, but it would not improve performance unless it wereestimated using a different algorithm. N/c B , but stillroughly evenly sized. This is simple, but far from optimal. • Complex: Since we are only dealing with a single dimension, there will be many fewerpossible partitions, and we can jointly optimize the splitting rules rather than use a greedysolution such as Cart. A straightforward approach would start with quantile splits andthen use coordinate-descent to sequentially optimize each split until no changes are made.Regardless of the actual partitioning method used, the complexity should still be tuned for CVperformance. As this is a two-stage process, for each iteration f we learn a separate ˆ g fF P S andpartition using all data but fold f creating an outcome prediction of the average prognosticscore in each block and then see the out-of-sample performance on fold f . There are different ways to determine which strategy to use depending on the available data:• If there is another pre-treatment period, pre3, we can empirically see which resultingpartition has the best predictive performance on y pre3 .• If not, then we can compare performance using cross-validation, where here we choosebetween different model types rather than between different hyper-parameters for a singlemodel type. . Given we need to have sufficient units per block, a 2-fold CV version isbest to maximize the size of the held-out fold. One can average the results over multiplerandom splits to reduce noise. Note that this works best with larger datasets.After deciding which strategy to use, given there is likely temporal dependence, we use themodel to generate partitions when using y pre2 rather than y pre1 . If there are time-varying covariates Z it , then they should be used in the same way as y pre . Z pre1 would be used when modeling y pre2 , and then updated values Z pre2 would be used to constructthe final partition. With more time-periods, we can improve several parts of the process. One option is to use theabove strategies with additional look-ahead predictions.• Variable selection: Use variables M = { y pre2 , y pre1 , X, ˆ y pre1 , ˆ y pre2 } and either constructthe partition directly or via an initial feature selection method targeting y pre3 .• FPS: Generate prediction values from y pre3 ≈ g F P S + ( y pre2 , y pre1 , X ) A second option is to use y pre3 to find an optimal trade-off between the goal of reducing theestimate’s MSE and its standard error. For every candidate partition created using data from { y pre2 , y pre1 , X } , one could simulate S different randomizations and then calculate the average We note that another simple alternative would be to use Cart targeting y pre2 and blocking on ˆ y pre2 =ˆ g FPS ( X, y pre1 ) to create the partition. This solution likely does not offer any benefit in single-dimensionalpartitioning as the greedy solution will results in a very uneven distribution of sizes (some blocks roughly twicethe size of others). Note that comparing directly the performance of the partitions from the above models on y pre2 will be biasedas the ML models were trained on that data. We review here the other main randomization methods, pair-wise matching and rerandomization ,and how the above strategies can be modified when either are preferred to blocking.
Pair-wise matching divides the sample into similar pair with each pair randomly assigned tohave a treated and control unit. If the experimenter wants to improve balance along a certainvariable, this can be explicitly achieved through including that variable in the match criteria.Application of Strategies:• Variable selection: It is straightforward to use the feature selection method above to selecta match space and then construct pairs. Each unit has values for its selected features, M , and so our task becomes to divide the units into pairs with a method that attemptsto minimize the overall within-pair differences (where we define distance as geometricdistance in M , but where we weight each dimension by its importance w k ). This issimilar to the problem of matching treated to control units in 1-1 matching estimators.Similar to that domain, the optimal solution (Greevy, 2004) is quite difficult, so mostimplementations take the approach of finding the “nearest available match” (King et al.,2007). We, therefore, suggest the same: select available units randomly and pair them totheir nearest available unit.• Future prognostic score: Use a prediction model to generate prognostic scores, order unitsby their score, and then sequentially put them into pairs.Selection between strategies: As we can produce pair-level dummies similar to the block-leveldummies, the selection procedure is the same as with blocking. Rerandomization techniques (Taves, 1974; Pocock and Simon, 1975) repeatedly randomize unitsto treatment and control arms until the imbalance across important variables meets some crite-rion. There are two methods commonly used: “big stick” which rerandomizes until no importantvariable has a significant difference at a pre-specified level (commonly 5%) and “min-max” whichcomputes for a pre-specified R number of draws (commonly 1000) the maximum t -statistic dif-ference for the important variables and then chooses the randomization with the minimummaximum. Notice that, in contrast to the other methods, this ensures a parametric rather thannon-parametric form of balance as we explicitly specify the moments (typically means) thatshould be matched. We will focus on the min-max strategy, but it is straightforward to adaptthe methods for the “big stick” approach. Let θ rk be the t -statistic for the difference in means ofthe k th variable between the two treatment arms in the r th randomization, so that the standardmin-max strategy selects r ∗ = arg min r [max k θ rk ] .Application of strategies:• Variable Selection: Proceeding as we have above in the variable selection setting, we usethe feature selection method as to get selected variables M . This will constitute the set of11able 1: Coefficient MSEMethod Mexico, n=100 Mexico, n=300 Sri Lanka,n=100 Sri Lanka,n=300FPS: Random Forest .0242458 .007467 .0294429 .0109078Manual: 48 blocks .0368593 .0093043 .0355722 .0115674VS: CART .0246845 .0079382 .0310299 .0101904VS: Lasso + CART .0251519 .0071399 .0285996 .0099049variables for which we will compare the t-statistics of mean differences across treated andcontrol units. We suggest taking into account the relative importance of the variables byfinding the ideal randomization via r ∗ = arg min r [max k w k θ rk ] .• Future Prognostic Score: Use a prediction model to generate future prognostic scores. Let ˜ θ r be the t -statistic for the difference in means of the future prognostic scores for the r th randomization. As we have collapsed the dimensions we now simply choose r ∗ =arg min r ˜ θ r .Selection between strategies: If we have access to an additional pre-period of data, then we canchoose between the above methods in a similar way by taking both approaches and seeing howwell they do at minimizing average differences in y pre3 between treatment and control groups. Ifwe do not, we can use the method for blocking using CV and see the average difference betweenthe arms in the hold-out samples. To analyze empirically how well our strategies perform, we use the data and framework ofBruhn and McKenzie (2009), comparing their manually constructed blocks against our block-ing strategies. We use the two datasets from their framework containing more than two pre-treatment outcomes periods: a panel survey of microenterprises in Sri Lanka (de Mel et al., 2008)and a sub-sample of the Mexican employment survey (ENE). In both of these, the subgroupstudied received no treatments. We treat the first two periods as pre1 and pre2 and the third aspost. For both, we estimate results using the n=100 and n=300 samples. The Sri Lanka datasethas 29 covariates and the Mexican sample has 30 covariates. The benefit of the ML strategieswe propose typically increases with the number of covariates. We perform 10,000 simulationsof placebo assignments to units and assess the performance of the strategies above as comparedto the strategy of Bruhn and McKenzie (2009) that constructs 48 blocks by hand-picking fourvariables and then manually determining a grid.We analyze our results in terms of MSE of the treatment effect (given we know the trueeffect is zero) and the size of the standard error. Table 1 reports the MSE of the estimatedcoefficient. We see that all of our strategies perform better than the manual method across allsamples. The reduction in the MSE from using the best ML method ranges from 16%-34%. TheFuture Prognostic Score strategy performed best on the Mexican ENE sample with N = 100,sample whereas the Variable Selection strategy with initial Feature Selection performed best onthe Mexican ENE sample with N = 300 and both Sri Lankan samples.Table 2 reports the length of the standard error for the estimate, a measure of increasedprecision. All ML algorithms again perform better than the manual strategy across all samples.The reduction in the MSE from using the best ML method ranges from 6%-16%. All threeautomated strategies performed best in at least one context.12able 2: Size of Coefficient Standard ErrorMethod Mexico, n=100 Mexico, n=300 Sri Lanka,n=100 Sri Lanka,n=300FPS: Random Forest 509.4455 268.1693 917.4573 515.8929Manual: 48 blocks 611.7684 300.0989 964.0424 537.3345VS: CART 525.2979 274.4434 925.6401 499.0057VS: Lasso + CART 514.9183 264.9388 905.8749 500.7876 Restricting randomization in experiments to reduce treatment-control imbalances on variablesthat are important for predicting the post-treatment outcome improves efficiency, protectsagainst type I errors, and increases power for the estimated treatment effect (Bruhn and McKenzie,2009), particularly for small- and medium-sized samples. Existing guidance for this process hasbeen conflicting and demands many ad hoc decisions. We show that this incompleteness inguidance is due to differing views on the dynamics of the data generating process (DGP). In thecase of having at least two pre-periods worth of baseline data, we outline methods that resolvethese differences and automate the process using modern, and off-the-shelf machine learning(ML) techniques. For the main type of randomization restriction, blocking, we determine whatare the important dimensions to create blocks along, how to create blocks, and how many shouldbe made. Crucially, for determining how many blocks to create, we provide a way to balance thegoal of improving the estimators true accuracy, which improves with more blocks, and the goal ofreducing the estimated standard error, which can increase due to a degree-of-freedom correctionif the extra blocks are only marginally helpful. Applications are also show to the other maintypes of randomization restrictions: pair-wise matching and rerandomization. With real-worlddata, we see reductions in the mean squared error of the estimated coefficient of 14%-34% andreductions in the standard error of the estimate of 6%-16%. We also detail custom tools thatmay improve performance even more.
References
Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal ef-fects.
Proceedings of the National Academy of Sciences , 113(27):7353–7360, jul 2016.doi:10.1073/pnas.1510489113.Tobias Aufenanger. Machine learning to improve experimental design.Technical report, FAU Discussion Papers in Economics, 2017. URL https://ideas.repec.org/p/zbw/iwqwdp/162017.html .Thomas Barrios. Optimal stratification in randomized experiments. Mimeo, 2014. URL https://scholar.harvard.edu/files/tbarrios/files/opstratv17_0.pdf .A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimalinstruments with an application to eminent domain.
Econometrica , 80(6):2369–2429, 2012.doi:10.3982/ecta9626.Alexandre Belloni and Victor Chernozhukov. Least squares after model selection in high-dimensional sparse models.
Bernoulli , 19(2):521–547, may 2013. doi:10.3150/11-bej410.Leo Breiman.
Classification and regression trees . Chapman & Hall, New York, 1993. ISBN9780412048418. 13eo Breiman. Random forests.
Machine Learning , 45(1):5–32, 2001.doi:10.1023/a:1010933404324.Miriam Bruhn and David McKenzie. In pursuit of balance: Randomization in practice indevelopment field experiments.
American Economic Journal: Applied Economics , 1(4):200–232, sep 2009. doi:10.1257/app.1.4.200.William G. Cochran and Donald B. Rubin. Controlling bias in observational studies: A review.
Sankhya: The Indian Journal of Statistics, Series A (1961-2002) , 35(4):417–446, 1973. ISSN0581572X. URL .CPMP. Committee for proprietary medicinal products (CPMP) points to consider on adjustmentfor baseline covariates.
Statistics in Medicine , 23(5):701–709, 2004. doi:10.1002/sim.1647.Suresh de Mel, David McKenzie, and Christopher Woodruff. Returns to capital in microenter-prises: Evidence from a field experiment.
Quarterly Journal of Economics , 123(4):1329–1372,nov 2008. doi:10.1162/qjec.2008.123.4.1329.R. A. Fisher.
The design of experiments . Oliver and Boyd, Edinburgh, 1935.Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for gener-alized linear models via coordinate descent.
Journal of Statistical Software , 33(1), 2010.doi:10.18637/jss.v033.i01.R. Greevy. Optimal multivariate matching before randomization.
Biostatistics , 5(2):263–275,apr 2004. doi:10.1093/biostatistics/5.2.263.B. B. Hansen. The prognostic analogue of the propensity score.
Biometrika , 95(2):481–488, feb2008. doi:10.1093/biomet/asn004.Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The Elements of Statis-tical Learning . Springer-Verlag New York Inc., 2009. ISBN 0387848576. URL .G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural net-works.
Science , 313(5786):504–507, 2006. ISSN 0036-8075. doi:10.1126/science.1127647. URL http://science.sciencemag.org/content/313/5786/504 .Paul W. Holland. Statistics and causal inference.
Journal of the American Statistical Association ,81(396):945–960, 1986. ISSN 01621459. URL .Guido Imbens, Gary King, David McKenzie, and Geert Ridder. On the finite sample benefitsofstratification in randomized experiments. Mimeo, 2009.W Kernan. Stratified randomization for clinical trials.
Journal of Clinical Epidemiology , 52(1):19–26, jan 1999. doi:10.1016/s0895-4356(98)00138-3.Gary King and Richard Nielsen. Why propensity scores should not be used for matching. Mimeo,2016. URL https://gking.harvard.edu/files/gking/files/psnot.pdf .Gary King, Emmanuela Gakidou, Nirmala Ravishankar, Ryan T. Moore, Jason Lakin, ManettVargas, Martha María Téllez-Rojo, Juan Eugenio Hernández Ávila, Mauricio Hernández Ávila,and Héctor Hernández Llamas. A “politically robust” experimental design for public policyevaluation, with application to the mexican universal health insurance program.
Journal ofPolicy Analysis and Management , 26(3):479–506, 2007. doi:10.1002/pam.20279.S. J. Pocock and R. Simon. Sequential treatment assignment with balancing for prognosticfactors in the controlled clinical trial.
Biometrics , 31:103–115, March 1975. ISSN 0006-341X.14aul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in obser-vational studies for causal effects.
Biometrika , 70(1):41–55, 1983. ISSN 00063444. URL .Max Tabord-Meehan. Stratication trees for adaptive randomization in randomized controlledtrials. Working Paper, October 2018. URL https://sites.northwestern.edu/mtu579/ .Matt Taddy.
Business Data Science . McGraw-Hill Education Ltd, 2019. ISBN 1260452778. URL .Donald R. Taves. Minimization: A new method of assigning patients to treatmentand control groups.
Clinical Pharmacology & Therapeutics , 15(5):443–453, may 1974.doi:10.1002/cpt1974155443.Robert Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society. Series B (Methodological) , 58(1):267–288, 1996. ISSN 00359246. URL .Hui Zou. The adaptive lasso and its oracle properties.
Journal of the American StatisticalAssociation , 101(476):1418–1429, dec 2006. doi:10.1198/016214506000000735.
A Alternative tree splitting rule for partitions
While partitioning algorithms are fit on one set of data, they are designed to not overfit to thesample and are instead tuned to do well on the general population of data that the sample wasdrawn from. The standard way to do this is to fit a full sequence of partitions of increasinggranularity, each focusing on in-sample fit, and then to pick the one that does best on CV OOSpredictions. An alternative way, pioneered by Athey and Imbens (2016), is to incorporate thisout-of-sample focus directly into each splitting decision in cases where we know the size of theauxiliary sample on which the partition will be used. Taking the example of Cart, one can writethe typical objective function as finding the partition Π that minimizes the “modified” MSEMSE (Π; S pre ) = − N (cid:88) (cid:96) ∈ Π N (cid:96) ˆ µ ( (cid:96) ; S pre , Π) . Athey and Imbens (2016) show that if we take the auxiliary sample into account during the splitwe should minimize the Expected MSE, which can be estimated as (cid:92)
EMSE(Π; S pre ) = − N (cid:88) (cid:96) ∈ Π N (cid:96) ˆ µ ( (cid:96) ; S pre , Π) + 2 N (cid:88) (cid:96) ∈ Π N (cid:96) ˆ V (ˆ µ ( (cid:96) ; S pre , Π)) where we now penalize blocks that have high variance in their estimates. Using this for partition-ing requires custom tools (Athey and Imbens (2016) provide tools for partitioning on estimatedtreatment effect, not estimated outcome), so we leave this for future work.
B Alternative available data
B.1 1 pre-period
This is the typical case studied in the previous literature. We can automate a few portions ofthe standard strategies, but we can not deal with the general temporal dynamics of the DGP:15 Variable selection: Since we only have a single outcome, we do not have a separate targetto jointly pick the best variables from [ X, y pre1 ] . We, therefore, take the guidance ofBruhn and McKenzie (2009) and force the inclusion of y pre1 and separately select thefeatures X ∗ from a feature selection model targeting y pre1 with X . Similarly, we no longercan construct a partition based on a joint predictive model. We could construct an adaptivegrid (as above). The experimenter would have to give a relative weight for y pre1 comparedto the variables in X ∗ . Obvious candidates would be (cid:80) k ∈ X ∗ w k (so that y pre1 has equalweight to all of X ∗ ) or | X ∗ | (cid:80) k ∈ X ∗ w k (the average weight from X ∗ ).• Prognostic score: Construct the simple prognostic scores from a model of y pre1 ≈ g P S ( X ) .Then order units by their prognostic score and partition them into groups of c B .Selection between strategies: Here the experimenter would have to take a stand on the amount oftemporal dependence in the DGP (which could potentially be assessed in another data source). Remark . (Auxiliary sample) If there is an auxiliary sample with improved data (e.g., [ X, y , y ] and no treatment was applied) then we can construct the partition tree using the auxiliary sampleand bring the partition over to the main sample. If the main sample is smaller, then it can bepruned back until the minimum cell has at least c B units. As there is not sufficient data to tunethis new partition to out-of-sample performance, it might result in slightly more blocks than areoptimal. B.2 Zero pre-period outcomes