[PDF] PCA Rerandomization

Abstract

Mahalanobis distance between treatment group and control group covariate means is often adopted as a balance criterion when implementing a rerandomization strategy. However, this criterion may not work well for high-dimensional cases because it balances all orthogonalized covariates equally. Here, we propose leveraging principal component analysis (PCA) to identify proper subspaces in which Mahalanobis distance should be calculated. Not only can PCA effectively reduce the dimensionality for high-dimensional cases while capturing most of the information in the covariates, but it also provides computational simplicity by focusing on the top orthogonal components. We show that our PCA rerandomization scheme has desirable theoretical properties on balancing covariates and thereby on improving the estimation of average treatment effects. We also show that this conclusion is supported by numerical studies using both simulated and real examples.

Full PDF

PPCA Rerandomization

Hengtao Zhang , Guosheng Yin and Donald B. Rubin Department of Statistics and Actuarial ScienceThe University of Hong KongPokfulam Road, Hong Kong Department of StatisticsHarvard UniversityCambridge, MA, USA Yau Mathematical Sciences CenterTsinghua UniversityBeijing, China Fox Business SchoolTemple UniversityPhiladelphia, PA, USAAbstract. Mahalanobis distance between treatment group and control group covariate meansis often adopted as a balance criterion when implementing a rerandomization strategy. How-ever, this criterion may not work well for high-dimensional cases because it balances all or-thogonalized covariates equally. Here, we propose leveraging principal component analysis(PCA) to identify proper subspaces in which Mahalanobis distance should be calculated. Notonly can PCA eﬀectively reduce the dimensionality for high-dimensional cases while captur-ing most of the information in the covariates, but it also provides computational simplicity byfocusing on the top orthogonal components. We show that our PCA rerandomization schemehas desirable theoretical properties on balancing covariates and thereby on improving theestimation of average treatment eﬀects. We also show that this conclusion is supported bynumerical studies using both simulated and real examples.1 a r X i v : . [ s t a t . M E ] F e b EY WORDS: Covariate Balance; Experimental Design; Mahalanobis Distance; PrincipalComponent Analysis; Randomization.

Randomized experiments have long been regarded as the gold standard to measure the ef-fect of an intervention, because randomization can reduce the potential bias of estimatesby balancing the covariate distributions between treatment groups on average. However,when pure (complete) randomization is implemented in practice, it often yields unbalancedallocations, so that the groups should be rerandomized before the experiment is actuallyconducted. Although rerandomization has been discussed earlier (Fisher, 1926; Cox, 2009;Worrall, 2010), its formal theoretical framework was not well established until the publica-tion of Morgan and Rubin (2012). Using Mahalanobis distance as the balance criterion fortreatment-control experiments, rerandomization was shown to improve the covariate balanceand the precision of estimated treatment eﬀects. Following the work of Morgan and Rubin(2012), eﬀort has been made to extend or modify such rerandomization schemes. For ex-ample, Morgan and Rubin (2015) proposed a rerandomization strategy for covariates withdiﬀerent tiers of anticipated importance with respect to the outcome variable. The extensionof rerandomization to a 2 K factorial design was developed by Branson et al. (2016) basedon a real example with educational data. Zhou et al. (2018) considered rerandomization forexperiments with sequentially enrolled units. Li et al. (2018, 2020) investigated the asymp-totic properties of the standard treatment eﬀect estimator for the treatment-control settingsand 2 K factorial designs respectively. Li and Ding (2020) further established asymptoticproperties for the combination of rerandmization and regression adjustment.All the aforementioned works use Mahalanobis distance as the balance measure owingto its several appealing characteristics. First, it is invariant to any aﬃne transformation2f the original covariates. Second, Morgan and Rubin (2012) showed that, not only canMahalanobis distance guarantee the unbiasedness of the treatment eﬀect estimator as wellas the balance of covariate means between equal-sized treatment and control groups, but italso reduces an equal percent of sampling variance for each covariate and principal compo-nent. Apart from rerandomization, Mahalanobis distance is also widely applied in matchingmethods for observational studies (Rubin, 1973a,b, 1979, 1980; Rosenbaum and Rubin, 1985;Stuart, 2010).Despite the advantages discussed above, the full-rank Mahalanobis distance may notwork well for rerandomization with the high-dimensional data (Branson and Shao, 2021).Because it is diﬃcult to equally balance a large number of principal components with diﬀer-ent magnitudes of sampling variances. In related work, Morgan and Rubin (2015) proposedbalancing the covariates hierarchically using prespeciﬁed tiers of importance of covariatesrelated to the outcome. Branson and Shao (2021) pointed out that it might be diﬃcult tospecify the relative importance for a large number of covariates a priori . They proposed in-cluding a ridge term in Mahalanobis distance, which they showed puts more emphasis on thetop principal components. However, the ridge rerandomization of Branson and Shao (2021)relies on complicated Monte Carlo integration and constraint optimization to determine thevalue of the ridging parameter. Rather than using Mahalanobis distance, Johansson andSchultzberg (2020) proposed a diﬀerent rank-based balance measure for rerandomization,but their heuristic metric is designed for longitudinal data where the pre-experimental out-comes are available to estimate the relative importance of each covariate. Moreover, thetheoretical properties are have not yet been developed under their proposed balance metric.Here, we propose using only the top principal components from a principal componentanalysis (PCA) to calculate Mahalanobis distance in the associated subspace and then per-form rerandomization. Our PCA rerandomization can be viewed as using a lower-dimensionalalternative to the full-rank Mahalanobis distance. We show that because PCA rerandom-3zation only reduces the variance of the selected top components, it thereby imposes moreshrinkage on them relative to full-rank rerandomization given the same acceptance proba-bility. Moreover, the lower dimensional orthogonality of top principal components simpliﬁesthe covariance matrix into a diagonal matrix and thus improves the computational eﬃciencywhen calculating Mahalanobis distance.We establish theory for PCA rerandomization, including the sampling distribution of themodiﬁed balance criterion and the variance reduction properties for the standard treatmenteﬀect estimator and covariate mean diﬀerences compared with the complete randomization.Practically, despite using PCA, our method is as easy to implement as the original reran-domization without cumbersome parameter speciﬁcation or increased computation involvedin the ridge rerandomization.In Section 2, we review the rerandomization framework based on Mahalanobis distance(Morgan and Rubin, 2012). We present the details of PCA rerandomization and its theoret-ical properties in Section 3. Section 4 reports the results of numerical experiments, whichshow the potential desirable performance of our proposed method compared with other ran-domization schemes. Section 5 concludes with a discussion. Let X = ( x , . . . , x n ) (cid:62) ∈ R n × d be the ﬁxed covariate matrix representing n trial participantswith x i ∈ R d . For simplicity, assume that all x i ’s are standardized to have zero mean and unitvariance: X (cid:62) n = , where n ∈ R n is the n -component vector with all components equal to1. We focus here on treatment-versus-control experiments, and deﬁne W = ( W , . . . , W n ) (cid:62) ∈{ , } n to be the random indicator vector of allocations, where W i = 1 means the i th unit isassigned to the treatment group, whereas W i = 0 means the i th unit is assigned to the controlgroup. In the theoretical developments, for simplicity, we further impose the constraint on4 such that treatment and control groups are the same size, n (cid:88) i =1 W i = n (cid:88) i =1 (1 − W i ) = n/ . (2.1)Given W and X , let ¯ x T = 2 n X (cid:62) W and ¯ x C = 2 n X (cid:62) ( n − W ) , (2.2)which denote the mean vector of the covariates of the treatment and control groups respec-tively. According to the potential outcomes framework (Neyman, 1923; Rubin, 1974; Imbensand Rubin, 2015), each unit is associated with two potential outcomes y i (1) or y i (0) corre-sponding to the situation where the unit is assigned to the treatment group or the controlgroup. Only one outcome can be observed after the allocation, which can be written as y i = W i y i (1) + (1 − W i ) y i (0). The standard goal of causal inference is to estimate the sampleaverage treatment eﬀect (SATE), τ = 1 n n (cid:88) i =1 { y i (1) − y i (0) } . The standard simple estimator of SATE is the mean diﬀerence of the observed outcomesbetween the two observed groups,ˆ τ = ¯ y T − ¯ y C = 2 n (cid:8) W (cid:62) y − ( n − W ) (cid:62) y (cid:9) , where y = ( y , . . . , y n ) (cid:62) . In randomized experiments, more balanced covariates across treat-ment and control groups generally lead to a more precise estimator ˆ τ . Traditional pure (orcomplete) randomization only balances covariates between the treatment and control groupson average, and thus a particular realized allocation can be unbalanced, thereby adverselyaﬀecting statistical inference.Morgan and Rubin (2012) formally established the rerandomization framework usingMahalanobis distance, M , as the balance measure, where M = (¯ x T − ¯ x C ) (cid:62) Σ − (¯ x T − ¯ x C ) , (2.3)5nd Σ = cov(¯ x T − ¯ x C | X ) = 4cov( X ) /n is the covariance matrix of ¯ x T − ¯ x C with respect to all W satisfying condition (2.1), and cov( X ) = X (cid:62) X / ( n −

1) is the sample covariance matrix of X . When Σ is singular, the pseudo-inverse is applied to calculate the distance. Rather thanperforming one single randomization, basic rerandomization continues generating feasibleallocation vectors W until the corresponding M is smaller than a predeﬁned threshold a ( a > W with M ≤ a is chosen to determine the actual allocation in theexperiment. Under the mild condition that ¯ x T − ¯ x C | X ∼ N ( , Σ ), the distribution ofMahalanobis distance M | X follows a χ d distribution, so the threshold a can be determinedby controlling the acceptance probability p a with P ( M ≤ a | X ) = p a .Rerandomization using Mahalanobis distance has several attractive properties. First,rerandomization ensures the unbiased estimation of τ , that is, E (ˆ τ | X , M ≤ a ) = τ , becauseit balances the mean diﬀerence of any observed or even unobserved covariate x : E (¯ x T − ¯ x C | X , M ≤ a ) = 0. Second, it reduces the variance of each covariate by the same proportion,100(1 − v a )% relative to complete randomization,cov(¯ x T − ¯ x C | X , M ≤ a ) = v a cov(¯ x T − ¯ x C | X ) , (2.4)where v a = P ( χ d +2 ≤ a ) /P ( χ d ≤ a ) ∈ (0 , y and covariate x , the variance reduction for the estimated treat-ment eﬀect, var(ˆ τ | x , M ≤ a ) = { − (1 − v a ) R } var(ˆ τ | x ), where R is the multiple squaredcorrelation between y and x .Branson and Shao (2021) developed ridge rerandomization by including a ridge termwhen calculating Mahalanobis distance, M λ = (¯ x T − ¯ x C ) (cid:62) ( Σ + λ I d ) − (¯ x T − ¯ x C ) , where I d denotes the d -dimensional identity matrix; they showed that such ridging automat-ically assigns more weight to the top components and can provide better balance in high-dimensional/high-collinearity cases. We propose using the top principal components rather6han original covariates (i.e., all principal components) to perform the rerandomization. Ourlower-dimensional method inherits most of the theoretical properties of rerandomization andcan be more convenient to implement without possibly cumbersome speciﬁcations of tuningparameters as in the ridge rerandomization. Deﬁne X = UDV (cid:62) to be the singular value decomposition of X , where U ∈ R n × p and V ∈ R d × p correspond to the matrices of the left and right singular vectors, with U (cid:62) U = V (cid:62) V = I p and p = min { n, d } ; D = diag { σ , . . . , σ p } is a diagonal matrix composed of non-negative singular (i.e., eigen) values σ ≥ · · · ≥ σ p >

0. Here, we focus on the common casewith p = d so that Z = ( z ij ) = UD are the principal components of X . Let Z k = U k D k = (cid:0) z ( k )1 , . . . , z ( k ) n (cid:1) (cid:62) ∈ R n × k denote the matrix of the top k ( k ≤ d ) principal components of X ,where z ( k ) i is the ﬁrst k elements of the i th row of Z , and U k ∈ R n × k is the ﬁrst k columnsof U and D k ∈ R k × k is the top k -dimensional sub-matrix of D . Similarly, let ˜ Z d − k =˜ U d − k ˜ D d − k ∈ R n × ( d − k ) denote the last d − k principal components, where ˜ U d − k ∈ R n × ( d − k ) isthe last d − k columns of U and ˜ D k ∈ R ( d − k ) × ( d − k ) is the last ( d − k )-dimensional sub-matrixof D . We calculate Mahalanobis distance based on the top k principal components, M k = (cid:0) ¯ z ( k ) T − ¯ z ( k ) C (cid:1) (cid:62) Σ − z (cid:0) ¯ z ( k ) T − ¯ z ( k ) C (cid:1) , (3.1)where ¯ z ( k ) T and ¯ z ( k ) C are deﬁned similarly following (2.2),¯ z ( k ) T = 2 n Z (cid:62) k W and ¯ z ( k ) C = 2 n Z (cid:62) k ( n − W ) , and Σ z = C n Z (cid:62) k Z k = C n D k with C n = 4 / ( n − n ). For selecting a treatment assignment,we proceed by generating W until the criterion M k ≤ a k is reached, which is referred to asPCA rerandomization, or more precisely PCA- k rerandomization.7or a particular randomized allocation, let ¯ z T,j and ¯ z C,j be the mean values of the j thprincipal component for the treatment and control groups respectively. Let s j denote thesample variance of the j th component. It can be shown that s j = (cid:80) ni =1 z ij / ( n −

1) = σ j / ( n − Z is also centered, (cid:62) n Z = (cid:62) n XV = . Thus, we can rewrite (3.1) as M k = n k (cid:88) j =1 (cid:18) ¯ z T,j − ¯ z C,j s j (cid:19) , which is the sum of standardized mean diﬀerence of the top k principal components. Notethat Z can be obtained from the aﬃne transformation on covariates Z = XV . Following theaﬃnely invariant property of Mahalanobis distance, it can be shown that M = n d (cid:88) j =1 (cid:18) ¯ z T,j − ¯ z C,j s j (cid:19) and M λ = n d (cid:88) j =1 σ j σ j + λ/C n (cid:18) ¯ z T,j − ¯ z C,j s j (cid:19) , where the former is the original Mahalanobis distance and the latter corresponds to the ridgererandomization. Therefore, PCA rerandomization is a truncated version of the originalrerandomization, i.e., M k ≤ M because of k ≤ d where the equality holds only if k = d .Furthermore, M λ is a weighted version of M , with weight σ j / ( σ j + λ/C n ) attached to allcomponents, i.e., there is no dimension reduction. Therefore, ridge rerandomization can beviewed as a smooth counterpart to PCA rerandomization, which uses a binary weight,1 j ≤ k ( j ) = (cid:40) σ j / ( σ j + 0) = 1 , if j ≤ k,σ j / ( σ j + ∞ ) = 0 , if j > k. Although there are two tuning parameters for PCA rerandomization, they are easy tospecify. As in the routine PCA, the number of top components k can be determined usingthe percent of cumulative variation that is explained by the selected principal components, k = min j (cid:40) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) ji =1 σ i (cid:80) di =1 σ i ≥ γ k (cid:41) , (3.2)where γ k ∈ (0 ,

1) is a prespeciﬁed constant. Given k , we determine the threshold a k via theacceptance probability p a k analogous to the full-rank rerandomization, P ( M k ≤ a k | X ) = p a k ,where the distribution of M k | X is χ k . 8 .2 Theoretical Properties of PCA Rerandomization Given k and a k , several statistical properties can be derived for the PCA rerandomization,and we defer the corresponding technical details to the Appendix. PCA rerandomization bal-ances the covariates between the treatment and control groups on average, and additionallyleads to the unbiased estimation of τ . Theorem 1.

Given a constant a k > , E (¯ x T − ¯ x C | X , M k ≤ a k ) = 0 and E (ˆ τ | X , M k ≤ a k ) = τ. According to the deﬁnition of M k , it has the same value for both allocations W and n − W for any given threshold a k >

0. Furthermore, we assume that equation (2.1) holdsin PCA rerandomization. Therefore, the unbiasedness for ¯ x T − ¯ x C and ˆ τ follows Theorem2.1 and Corollary 2.2 in Morgan and Rubin (2012).The characteristic of covariate balance in Theorem 1 can be further extended to theunobserved covariates, as implied by Corollary 2.2 of Morgan and Rubin (2012). In additionto removing the conditional bias, PCA rerandomization tends to make the diﬀerence ofcovariate means and ˆ τ more concentrated. Before showing such results, we ﬁrst provide thedistribution of M k under the condition (¯ x T − ¯ x C ) | X ∼ N ( , Σ ). Theorem 2.

Let k denote the number of selected top principal components. If (¯ x T − ¯ x C ) | X ∼ N ( , Σ ) , we have that M k | X ∼ χ k . Under PCA rerandomization, M k thus follows a chi-squared distribution with the degreeof freedom being the same as the number of selected top principal components. The originalrerandomization yields M | X ∼ χ d , because it uses all d components. One major applicationof Theorem 2 is to specify the threshold a k through the probability p a k such that p a k = P ( χ k ≤ a k ). Smaller degree of freedom yields a smaller threshold given the same acceptance9robability p a = p a k , so we have a k < a for PCA rerandomization relative to the thresholdin pure rerandomization. Moreover, it is easier for our method to determine a k than ridgererandomizaiton, which has to specify the threshold from a mixture distribution. Given thedistribution of M k and the threshold a k , we can quantify how much variance is reduced forthe covariance matrix of ¯ x T − ¯ x C through the following theorem. Theorem 3.

Given the top k principal components and the threshold a k > , if (¯ x T − ¯ x C ) | X ∼ N ( , Σ ) , cov(¯ x T − ¯ x C | X , M k ≤ a k ) = C n V (cid:18) v a k D k ˜ D d − k (cid:19) V (cid:62) , where C n = 4 / ( n − n ) and v a k = P ( χ k +2 ≤ a k ) /P ( χ k ≤ a k ) . Theorem 3 states the obvious fact that PCA rerandomization only reduces the variationof top k selected principal components, for which the percent reduction in variance (PRV) is100(1 − v a k )%. This strategy automatically identiﬁes and balances the most variable subspace of original covariates. Furthermore, we ﬁnd that v a k in PCA rerandomization is at most itsrerandomization counterpart v a given the same acceptance probability.From Morgan and Rubin (2012), the covariance reduction for rerandomization iscov(¯ x T − ¯ x C | X , M ≤ a ) = C n V (cid:18) v a D k v a ˜ D d − k (cid:19) V (cid:62) . (3.3)Therefore, smaller v a k means that we reduce more variance along the top principal axesthan pure rerandomization. In contrast to ridge rerandomization, our scheme only reducesthe same percent of variance for the selected k components, whereas ridge rerandomizationreduces for all components but with diﬀerent percents, with top components receiving moreshrinkage, i.e., cov(¯ x T − ¯ x C | X , M λ ≤ a λ ) = C n V diag (cid:0) ξ λ, σ , . . . , ξ λ,d σ d (cid:1) V (cid:62) , where 0 < ξ λ, ≤ · · · ≤ ξ λ,d are constants deﬁned in Theorem 4.2 of Branson and Shao (2021)for a given λ . The aforementioned discussion focuses on the PRV on principal components,10hereas the following corollary further provides the PRV for the j th entry of the covariatemean diﬀerence ¯ x T,j − ¯ x C,j under PCA rerandomization.

Corollary 1.

Given the conditions in Theorem 3 and the ( j, j ) th element of Σ , Σ jj > ,for j ∈ [ d ] = { , . . . , d } , the PRV of ¯ x T,j − ¯ x C,j is − v a k ,j )% for j ∈ [ k ] , where v a k ,j = (cid:0) C n V diag (cid:8) v a k D k , ˜ D d − k (cid:9) V (cid:62) (cid:1) jj Σ jj ∈ (0 , . To assess how PCA rerandomization can improve the estimation of τ , we follow Morganand Rubin (2012) to assume that the treatment eﬀect is additive: y i ( W i ) = β + x (cid:62) i β + τ W i + ε i , (3.4)where β + x (cid:62) i β is the projection of the outcome y i onto the subspace spanned by ( , X ), and ε i is the residual of y i orthogonal to the linear subspace spanned by X . We further assumethat ˆ τ , the estimator of τ , follows a normal distribution given X , as in Morgan and Rubin(2012). Theorem 4.

Given the conditions in Theorem 3, if the data arise from (3.4) and ˆ τ isnormally distributed given X , then for any β ∈ R d , var(ˆ τ | X ) − var(ˆ τ | X , M k ≤ a k ) = C n β (cid:62) V (cid:18) (1 − v a k ) D k

00 0 (cid:19) V (cid:62) β ≥ . Furthermore, let ˜ β = V (cid:62) β where the j th component of ˜ β is denoted by ˜ β j , and C = { β ∈ R d |∃ j ∈ [ k ] s.t. ˜ β j (cid:54) = 0 } . If β ∈ C , var(ˆ τ | X ) − var(ˆ τ | X , M k ≤ a k ) > . This theorem shows that PCA rerandomization never harms the precision of the treat-ment eﬀect estimator ˆ τ relative to pure rerandomization, although it requires an additionalconstraint β ∈ C to guarantee a strict sampling variance reduction. Note that each col-umn of V corresponds to a direction of the principal component. The constraint intuitively11eans that the coeﬃcients β should not fall into the subspace spanned by the last d − k components. As shown in the proof in the Appendix, the sampling variance reduction of ˆ τ is related to that of covariates. In particular, we have for all rerandomization schemes,var(ˆ τ | X ) − var(ˆ τ | X , M ≤ a ) = β (cid:62) (cid:8) cov(¯ x T − ¯ x C | X ) − cov(¯ x T − ¯ x C | X , M ≤ a ) (cid:9) β . (3.5)Therefore, the variance reduction of ˆ τ is governed by all principal components for purererandomization and ridge rerandomization due to the fact that cov(¯ x T − ¯ x C | X ) > cov(¯ x T − ¯ x C | X , M ≤ a ) under both schemes. Speciﬁcally, each principal component has an equalcontribution to the reduction under pure rerandomization, whereas top principal componentsare associated with larger weights in M λ and thus result in more variance reduction than thetail components under ridge rerandomization. The reductions of all principal componentsare ﬁnally aggregated through the inner product with the coeﬃcients β . On the other hand,using only the top principal components for PCA rerandomization leads to a constraint toensure the strict reduction in (3.5). Our Monte Carlo study can be compactly described as a 4 × × × × × × n ∈ { , , , } samples from a d -dimensional multivariate normaldistribution, x ∼ N (cid:0) , (1 − ρ ) I d + ρ d (cid:62) d (cid:1) , where the correlation coeﬃcient ρ ∈ { . , . , . } and the dimension d ∈ { , , , } . The sample size ( n ) and the covariate dimension ( d )are known factors at the design stage, whereas the correlation ρ is an estimable factor that isalso essentially known to the investigator. Half of the n units are assigned to the treatmentgroup and the other half to the control group, where the allocation used is generated by oneof the three randomization schemes being compared. Given the covariates and allocationvariable, the outcome is simulated from the following model with a ﬁxed additive treatmenteﬀect τ , y = g ( x , β ) + τ W + ε, (4.1)where g ( x , β ) represents the function of response surface that is unknown to the investigatorbut estimable to some extent, and the residual follows a normal distribution ε ∼ N (0 , σ ε ),also unknown to the investigator. We adopt two diﬀerent functions for the response surface: g ( x , β ) ∈ { x (cid:62) β , exp( x ) (cid:62) β } with β ∈ (cid:8) d , ( (cid:62) d/ , × (cid:62) d/ ) (cid:62) (cid:9) . The residual variance σ ε takesvalues from { . , } , and we set τ = 1. All factors with their levels and descriptions aresummarized in Table 1.We compare PCA rerandomization (PCAReR) with three other randomization schemes:(a) complete (pure) randomization (CR) with the constraint (2.1), (b) original rerandomiza-tion (ReR), and (c) ridge rerandomization (RidgeReR). For fair comparisons, we set the sameacceptance probability for all rerandomization-based methods, i.e., p a k = p a λ = p a = 0 . γ k = 0 .

95 in (3.2) tospecify the number of top principal components to be selected. The optimal ridge coeﬃcient λ is determined by the procedure given in Branson and Shao (2021).All four randomization methods are evaluated using three criteria: the ﬁnal covariatebalance, the true estimation precision of an additive treatment eﬀect, and computationaltime. Because all approaches produce balanced covariates on average, we adopt the average13ariance of the mean diﬀerence ¯ x T,j − ¯ x C,j across all covariates j = 1 , . . . , d , denoted by ¯ σ , toevaluate the empirical balance. For the treatment eﬀect, we choose the mean squared error(MSE) of ˆ τ as the evaluation metric. To normalize these metrics, we compute the reductionpercents of ¯ σ and MSE for each rerandomization scheme relative to CR, which are denotedby r ¯ σ and r MSE respectively. We also record the computational time for generating a feasibleallocation in seconds under each rerandomization method. It is important to have a fastalgorithm if further analysis requires generating many acceptable allocations (Morgan andRubin, 2012), such as performing randomization tests. In addition, we consider the selectednumber of top components k as well as the variance shrinkage coeﬃcient v a k in Theorem 3for PCAReR to help explain its performance under diﬀerent settings.Given a triplet of ( n, d, ρ ), we simulate 2000 covariate matrices. For each covariate matrix,four randomization schemes are used to generate allocations and then the responses aresimulated from model (4.1) for all combinations of (cid:8) g ( x , β ) , β , σ ε (cid:9) . Following Rubin (1979),we use the same covariate matrix when comparing diﬀerent rerandomization methods, andthe covariate matrices with smaller n and d , such as ( n, d ) = (100 , n, d ) =(1000 , ρ . This nested design strategy minimizes the number of randomsamples and correlates the results of diﬀerent rerandomization methods, which thus makescomparisons more precise.All evaluation metrics are calculated under each conﬁguration of factors based on 2000replications. We divide 2000 replications into 10 separate groups of size 200 to create repeatedmetrics for calculating their within-conﬁguration mean square in ANOVA. In each group,we compute the three major metrics r ¯ σ , r MSE and the computational time. Under eachrerandomization scheme, we can obtain 4 × × ×

10 = 480 diﬀerent values for both r ¯ σ and computational time based on our replication strategy, because these two metrics arecompletely determined by three known or estimable factors. However, given the scheme, we14ave 4 × × × ×

10 = 3840 diﬀerent values for r MSE with respect to all the remainingsix factors.

Following Rubin (1979) and Gutman and Rubin (2013), we begin with three separateANOVAs based on the seven-factor design to identify the most inﬂuential factors in terms ofthree major evaluation metrics. In the ANOVA, we consider the main eﬀects of the factorsas well as all of their interactions. The relative importance of factors and their interactionsare evaluated by their F -ratios or, equivalently, the ANOVA mean squares.Table 2 presents the ANOVA results of r ¯ σ for known factors, where we ﬁnd that thecovariate dimension, d , and correlation, ρ , strongly inﬂuence the covariate balance metric,and the ﬁrst three factors ( d , scheme and ρ ) account for around 82% of the total sum ofsquares. We only report the top 11 factorial eﬀects because the rest have mean squaresmuch smaller than the mean squares for the top 11. Furthermore, the within-conﬁgurationmean square indicates small variability of the metric due to random sampling in the datageneration. The ANOVA results of r MSE are reported in Table 3, where we only displaythe top 20 inﬂuential factorial factors based on the F -ratios. We observe that the covariatedimension, d , the correlation coeﬃcient, ρ , and the type of response surface, g ( x , β ), arethe most important factors in addition to the scheme, and the residuals also have relativelysmall mean square. Additionally, the ﬁrst four factors explain around 72% of the total sumof squares. From Table 4, we can identify, the scheme, d and n as the most inﬂuential factorsfor the computational time, and about 68% of the total sum of squares can be explained bythe ﬁrst ﬁve factorial eﬀects. After identifying the crucial factors, we take the average of eachmetric over the levels of all the remaining factors to provide a comprehensive characterizationfor the eﬀects of the selected factors.The ﬁrst panel of Table 5 compares diﬀerent rerandomization schemes in terms of r ¯ σ ρ and d . To indicate the degree of dimension reduction by PCA, we also notethe number of selected top components in parentheses. In general, although PCAReR uses asmaller number of principal components, it can result in more balanced covariates than ReRand the improvement is more obvious in the cases with large ρ and d . The main reason is thatthe variance shrinkage coeﬃcient v a k of PCAReR is typically smaller than the counterpart v a of ReR especially for large ρ and d , as revealed in Figure 1. Compared with RidgeReR,which takes all principal components into account and puts more emphasis on the top ones,PCAReR only uses the most variable subspace to perform the rerandomization, and thusits performance is inferior to a certain extent. The diﬀerence in r ¯ σ between RidgeReR andPCAReR is small when k is close to d , for example, when k = 9.9 or 43.2.Table 5 also presents r MSE for each rerandomization scheme under diﬀerent settings ofthe inﬂuential factors ( d, ρ, g ( x , β )). Because the estimation precision of ˆ τ is closely relatedto the covariate balance as implied by Theorem 4, we again observe that PCAReR has anintermediate performance between ReR and RidgeReR for a given type of response surface,and the advantages of PCAReR and RidgeReR are more evident for large ρ and d . Allmethods yield smaller r MSE for the nonlinear surface exp( x ) (cid:62) β relative to the linear surface x (cid:62) β , because these approaches only balance the ﬁrst moments of covariates. Additionally,all three rerandomization schemes perform better for small d given the same ρ and g ( x , β ),because it is more diﬃcult to simultaneously balance numerous covariates.As shown by Theorem 4 in this paper, and Theorem 4.3 in Branson and Shao (2021), theestimation precision of ˆ τ for PCAReR and RidgeReR involves the transformed coeﬃcient ˜ β = V (cid:62) β . Therefore, it is expected a superior performance of PCAReR relative to RidgeReR,when the values of ˜ β corresponding to the top components are smaller than those of thetail components among the selected k principal components by PCA, such as ˜ β = (1 , , . . . , (cid:80) ki =1 i, (cid:62) d − k ) (cid:62) . We demonstrate this phenomenon through a simulation by settingthe coeﬃcient as β = V ˜ β under a linear response surface g ( x , β ) = x (cid:62) β with sample size16 = 200 and the residual variance σ ε = 1. The factors d and ρ are kept at the samevalues as given in Table 1. The last panel of Table 5 presents the percent reduction r MSE .We still observe that PCAReR works better than ReR, because PCAReR tends to yieldmore balanced covariates in the selected subspace. This example shows that PCAReR mayoutperform RidgeReR if the subspace spanned by the tail components among the k selectedones is associated with larger values of regression coeﬃcients.Table 6 summarizes the computational time for three rerandomization schemes, whichshows that PCAReR dominates in most cases. Generally, it can be complicated for RidgeReRto specify the optimal values for λ and a λ , so RidgeReR spends considerable time in generat-ing a feasible allocation. When d is close to n , such as ( n, d ) = (100 ,

90) or (200 , n (cid:29) d , to generate an allocation. This is due to thelong rejection period for a given threshold a , because it is more diﬃcult to balance a largenumber of covariates. Note that a is determined under a multivariate normal assumptionthat only holds asymptotically, and the assumption may not be satisﬁed when d is close to n , so the given threshold level a may be too restrictive for ReR. This result also reﬂects thepractical advantage of using PCA to accelerate the search for a feasible allocation.In Table 6, ReR displays some unusual results when d > n , i.e., ( n, d ) = (100 , M = n − n (2 W − n ) (cid:62) X (cid:0) X (cid:62) X (cid:1) − X (cid:62) (2 W − n ) = n − , where (cid:0) X (cid:62) X (cid:1) − refers to the pseudo-inverse of X (cid:62) X and U ∈ R n × n is composed of the leftsingular vectors of X with X (cid:0) X (cid:62) X (cid:1) − X (cid:62) = UU (cid:62) = I n . We further ﬁnd that the threshold a of ReR is larger than n − n, d ) = (100 , .3 Guideline for Rerandomization Based on the simulation results, we provide a guideline for the investigator to select a properscheme under diﬀerent settings of factors. Tables 2–4 show that d and ρ are the mostinﬂuential factors in terms of the F -ratios. Given the known factors d and n and estimablefactor ρ , the investigator may choose a suitable rerandomization scheme as follows. Whenboth d and ρ are small, all of the three rerandomization schemes show similar performancein terms of r ¯ σ and r MSE . When either d or ρ is large, PCAReR and RidgeReR shouldbe considered because they leverage the importance of principal components to improvecovariate balance and thus estimation precision. Note that β and g ( x , β ) are unknownto the investigator during the design stage, and Table 5 implies that both methods couldyield better performance than ReR. We hence recommend using PCAReR because the fastimplementation and suﬃcient dimension reduction of PCAReR allows the investigator tomake prompt attempts on expanding the dimension of the covariate matrix by incorporatingdiﬀerent combinations of nonlinear features (e.g., second moments of covariates), and mayboost its performance using nonlinear response surfaces. For illustration, we compare diﬀerent rerandomization methods using a real dataset. Weuse the ﬁrst part of the data obtained from the Infant Health and Development Program(IHDP) (Hill, 2011; Louizos et al., 2017). This program aimed at improving cognitive de-veloptment for the low-birth-weight and premature infants by providing high-quality childcare and home visits from trained specialists for the infants in the treatment group. Thisintervention successfully promoted cognitive test scores for the treated children comparedwith the control group. The dataset consists of 747 participants with 25 covariates and thecontinuous test scores. Six covariates are continuous which are standardized, and the othersare binary. To mimic the real process, we use all observed data to ﬁt a sparse linear model18ia LASSO using 25 main eﬀect covariates and all (cid:0) (cid:1) = 300 two-way interactions amongvariables. The LASSO penalty is chosen to be 0.055 from cross-validation, and 35 covariatesare ﬁnally selected. The ﬁtted model is used to generate responses given the allocation duringrandomization. We construct the covariate matrix X ∈ R × by all instances, and planto assign an approximately equal number of units to the treatment (373) and control (374)groups. Diﬀerent randomization approaches are then applied to generate 1000 independentallocations, from which we calculate various empirical metrics.We ﬁrst draw the box plot of the mean diﬀerence ¯ x T,j − ¯ x C,j for continuous features( X , . . . , X ) and ﬁrst six binary variables ( X , . . . , X ) in Figure 2(a); the other binarycovariates demonstrate similar balance properties as the ﬁrst six ones. We observe thatPCAReR using 81 principal components achieves more balanced covariates than CR and itsperformance is comparable to ReR. We further present the density plot of the treatmenteﬀect estimator ˆ τ with respect to 1000 replications for ReR, RidgeReR and PCAReR inFigure 2(b), where the distribution of ˆ τ under PCAReR is more concentrated than thatunder ReR. Table 7 reports the overall reduction percentages for covariate mean diﬀerencesand the MSE of ˆ τ with respect to CR, where PCAReR yields performance between ReRand RidgeReR, but PCAReR runs 159 times faster than RidgeReR, which takes around 7.9hours for a total of 1000 allocations, which could be an issue if inferences were to be basedon randomization tests. We propose a PCA rerandomization scheme to allocate participants in experiments wherethere are high-dimensional or high-collinearity covariates. Compared with the classical reran-domization (Morgan and Rubin, 2012), the major diﬀerence is that we use the top principalcomponents rather than the original covariates to calculate Mahalanobis distance. We showthat not only does PCA rerandomization share most theoretical characteristics with the sim-19le rerandomization, but it also demonstrates empirical advantages with high-dimensionalor highly correlated covariates in terms of covariate balance, the precision of treatment eﬀectestimation as well as computational time.Analogous to rerandomization, we can extend PCA rerandomization to other types ofexperiments. A line of future work is to develop the asymptotic property of the treatmenteﬀect estimator under PCA rerandomization as done in Li et al. (2018) for rerandomization.Another path worth exploring is to apply PCA rerandomization to 2 K factorial designsfollowing Branson et al. (2016). 20 eferences Branson, Z., Dasgupta, T., and Rubin, D. B. (2016). Improving covariate balance in 2 K factorial designs via rerandomization with an application to a New York city departmentof education high school study. The Annals of Applied Statistics , 10(4):1958–1976.Branson, Z. and Shao, S. (2021). Ridge rerandomization: An experimental design strategyin the presence of covariate collinearity.

Journal of Statistical Planning and Inference ,211:287 – 314.Cox, D. (2009). Randomization in the design of experiments.

International Statistical Review ,77(3):415–429.Fisher, R. A. (1926). The arrangement of ﬁeld experiments.

Journal of the Ministry ofAgriculture of Great Britain , 33:503–513.Gutman, R. and Rubin, D. (2017). Estimation of causal eﬀects of binary treatments in un-confounded studies with one continuous covariate.

Statistical Methods in Medical Research ,26(3):1199–1215.Gutman, R. and Rubin, D. B. (2013). Robust estimation of causal eﬀects of binary treatmentsin unconfounded studies with dichotomous outcomes.

Statistics in Medicine , 32(11):1795–1814.Gutman, R. and Rubin, D. B. (2015). Estimation of causal eﬀects of binary treatments inunconfounded studies.

Statistics in Medicine , 34(26):3381–3398.Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference.

Journal of Com-putational and Graphical Statistics , 20(1):217–240.Imbens, G. W. and Rubin, D. B. (2015).

Causal Inference for Statistics, Social, and Biomed-ical Sciences: An Introduction . Cambridge University Press, Cambridge, UK.21ohansson, P. and Schultzberg, M. (2020). Rerandomization strategies for balancing covari-ates using pre-experimental longitudinal data.

Journal of Computational and GraphicalStatistics , in press.Li, X. and Ding, P. (2020). Rerandomization and regression adjustment.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 82(1):241–268.Li, X., Ding, P., and Rubin, D. B. (2018). Asymptotic theory of rerandomizationin treatment–control experiments.

Proceedings of the National Academy of Sciences ,115(37):9157–9162.Li, X., Ding, P., and Rubin, D. B. (2020). Rerandomization in 2 K factorial experiments. The Annals of Statistics , 48(1):43–63.Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., and Welling, M. (2017). Causaleﬀect inference with deep latent-variable models. In

Advances in Neural InformationProcessing Systems , pages 6446–6456.Morgan, K. L. and Rubin, D. B. (2012). Rerandomization to improve covariate balance inexperiments.

The Annals of Statistics , 40(2):1263–1282.Morgan, K. L. and Rubin, D. B. (2015). Rerandomization to balance tiers of covariates.

Journal of the American Statistical Association , 110(512):1412–1421.Neyman, J. (1923). On the application of probability theory to agricultural experiments:Essay on principles, Section 9.

Roczniki Nauk Rolniczych Tom X [in Polish]; translated in

Statistical Science , 5:465–471.Rosenbaum, P. R. and Rubin, D. B. (1985). Constructing a control group using multi-variate matched sampling methods that incorporate the propensity score.

The AmericanStatistician , 39(1):33–38. 22ubin, D. B. (1973a). Matching to remove bias in observational studies.

Biometrics ,29(1):159–183.Rubin, D. B. (1973b). The use of matched sampling and regression adjustment to removebias in observational studies.

Biometrics , 29(1):185–203.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandom-ized studies.

Journal of Educational Psychology , 66(5):688–701.Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment tocontrol bias in observational studies.

Journal of the American Statistical Association ,74(366):318–328.Rubin, D. B. (1980). Bias reduction using Mahalanobis-metric matching.

Biometrics ,36(2):293–298.Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward.

Statistical Science , 25(1):1–21.Worrall, J. (2010). Evidence: philosophy of science meets medicine.

Journal of Evaluationin Clinical Practice , 16(2):356–362.Zhou, Q., Ernst, P. A., Morgan, K. L., Rubin, D. B., and Zhang, A. (2018). Sequentialrerandomization.

Biometrika , 105(3):745–752.23

Appendix

A.1 Proof of Theorem 2

Proof.

From (2.2), we have¯ x T − ¯ x C = 2 n X (cid:62) (2 W − n ) = 2 n VZ (cid:62) (2 W − n ) ∼ N ( , Σ ) . Furthermore, it is easy to see that Z = ( Z k , ˜ Z d − k ) and V (cid:62) ΣV = C n D = C n (cid:18) D k ˜ D d − k (cid:19) , (A.1)where C n = 4 / ( n − n ). Consequently,¯ z ( k ) T − ¯ z ( k ) C = 2 n Z (cid:62) k (2 W − n ) = 2 n ( I k , ) V (cid:62) X (cid:62) (2 W − n ) ∼ N ( , Σ z ) , (A.2)where Σ z = Z (cid:62) k Z k = C n D k . This concludes that M k | X ∼ χ k according to the property of amultivariate normal distribution. A.2 Proof of Theorem 3

Proof.

For simplicity, let η = V (cid:62) (¯ x T − ¯ x C ) = ( η k , ˜ η d − k ) (cid:62) where η k = ( η , . . . , η k ) (cid:62) ∈ R k and ˜ η d − k = ( η k +1 , . . . , η d ) (cid:62) ∈ R d − k . According to (A.1) and (A.2), we know that η | X ∼ N ( , C n D ), η k | X ∼ N ( , C n D k ) and ˜ η d − k | X ∼ N ( , C n ˜ D d − k ). Thus, M k can be writtenin terms of η k and singular values σ ≥ · · · ≥ σ k >

0, i.e., M k = (cid:80) kj =1 η j / ( C n σ j ).Note that we havecov(¯ x T − ¯ x C | X , M k ≤ a k ) = V cov( η | X , M k ≤ a k ) V (cid:62) . (A.3)Let a d = b denote two variables a and b having the same distribution. First, one can obtainthat E ( η j | X , M k ≤ a k ) = 0 through the symmetry of a normal distribution, i.e., − η j d = η j ∼ (0 , C n σ j ) , ∀ j ∈ [ k ] (Branson and Shao, 2021). Speciﬁcally, E ( η j | X , M k ≤ a k ) = E (cid:32) η j (cid:12)(cid:12)(cid:12)(cid:12) X , k (cid:88) j =1 η j C n σ j ≤ a k (cid:33) = E (cid:32) − η j (cid:12)(cid:12)(cid:12)(cid:12) X , k (cid:88) j =1 ( − η j ) C n σ j ≤ a k (cid:33) = − E ( η j | X , M k ≤ a k ) , (A.4)which leads to E ( η j | X , M k ≤ a k ) = 0. Now, we focus on the covariance, cov( η i η j | X , M k ≤ a k )with i (cid:54) = j and i, j ∈ [ k ]. Similar to (A.4), we only need to ﬂip the sign of η j to show E ( η i η j | X , M k ≤ a k ) = 0 and thus further derive cov( η i η j | X , M k ≤ a k ) = E ( η i η j | X , M k ≤ a k ) = 0. On the other hand, η j / ( C n σ j ) are independent and identically follows χ for j ∈ [ k ].Combining the above results, the variance of η j for j ∈ [ k ] isvar( η j | X , M k ≤ a k ) = E ( η j | X , M k ≤ a k )= C n σ j E (cid:32) η j C n σ j (cid:12)(cid:12)(cid:12)(cid:12) X , k (cid:88) j =1 η j C n σ j ≤ a k (cid:33) = C n σ j k E ( M k | X , M k ≤ a k ) , where the last equation follows the exchangeability of i.i.d. { η j / ( C n σ j ) } j ∈ [ k ] . Because M k | X ∼ χ k , we have E ( M k | X , M k ≤ a k ) /k = v a k = P ( χ k +2 ≤ a k ) /P ( χ k ≤ a k ) follow-ing the proof of Theorem 3.1 in Morgan and Rubin (2012). Therefore, the variance isvar( η j | X , M k ≤ a k ) = C n v a k σ j .It is evident that for j = k + 1 , . . . , d , we have E ( η j | X , M k ≤ a k ) = E ( η j | X ) = 0 , var( η j | X , M k ≤ a k ) = var( η j | X ) = C n σ j , cov( η i η j | X , M k ≤ a k ) = cov( η i η j | X ) = 0 , ∀ i (cid:54) = j, i = k + 1 , . . . , d. η i η j | X , M k ≤ a k ) = 0 when i ∈ [ k ] and j = k + 1 , . . . , d as follows,cov( η i η j | X , M k ≤ a k ) = E ( η i η j | X , M k ≤ a k )= E (cid:8) η i E ( η j | η i , X , M k ≤ a k ) | X , M k ≤ a k (cid:9) = E (cid:8) η i E ( η j | X ) | X , M k ≤ a k (cid:9) = 0 . Considering all the aforementioned results of the variance and covariance, we havecov( η | X , M k ≤ a k ) = C n (cid:18) v a k D k ˜ D d − k (cid:19) . The theorem can be proved after plugging the above equation into (A.3).

A.3 Proof of Corollary 1

Proof.

According to the deﬁnition of PRV, we can obtain the expression of v a k,j as the ratioof the j th diagonal entry of the covariance matrix for PCAReR and Σ ; that is, v a k ,j = (cid:0) C n V diag (cid:8) v a k D k , ˜ D d − k (cid:9) V (cid:62) (cid:1) jj Σ jj . Here, we show that v a k ,j ∈ (0 , V = ( V k , ˜ V d − k ) where V k ∈ R d × k and ˜ V d − k ∈ R d × ( d − k ) correspond to the ﬁrst k and last d − k components. Let α j = ( (cid:62) j − , , (cid:62) d − j ) (cid:62) bea d -dimensional vector of 0’s except for the j th element taking a value of 1. We have that Σ jj = α (cid:62) j Σ α j = C n α (cid:62) j V (cid:18) D k ˜ D d − k (cid:19) V (cid:62) α j = C n α (cid:62) j V k D k V (cid:62) k α j + C n α (cid:62) j ˜ V d − k ˜ D d − k ˜ V (cid:62) d − k α j > , and (cid:0) C n V diag (cid:8) v a k D k , ˜ D d − k (cid:9) V (cid:62) (cid:1) jj = v a C n α (cid:62) j V k D k V (cid:62) k α j + C n α (cid:62) j ˜ V d − k ˜ D d − k ˜ V (cid:62) d − k α j . v a k = P ( χ k +2 ≤ a k ) /P ( χ k ≤ a k ) ∈ (0 ,

1) for any a k >

0. We can conclude that0 < (cid:0) C n V diag (cid:8) v a k D k , ˜ D d − k (cid:9) V (cid:62) (cid:1) jj < Σ jj , and thus v a k,j ∈ (0 ,

1) because both V k D k V (cid:62) k and ˜ V d − k ˜ D d − k ˜ V (cid:62) d − k are positive deﬁnitive. A.4 Proof of Theorem 4

Proof.

Deﬁne ¯ ε T = 2 W (cid:62) ε /n and ¯ ε C = 2( − W ) (cid:62) ε /n , where ε = ( ε , . . . , ε n ) (cid:62) . Accordingto (3.4), ˆ τ can be written as ˆ τ = (¯ x T − ¯ x C ) (cid:62) β + τ + (¯ ε T − ¯ ε C ) . Leveraging the orthogonality between the ﬁrst and last terms, we have thatvar(ˆ τ | X ) = var (cid:8) (¯ x T − ¯ x C ) (cid:62) β | X (cid:9) + var(¯ ε T − ¯ ε C | X )= β (cid:62) Σ β + var(¯ ε T − ¯ ε C | X )= C n β (cid:62) V (cid:18) D k ˜ D d − k (cid:19) V (cid:62) β + var(¯ ε T − ¯ ε C | X ) . (A.5)Furthermore, the conditional normal assumption on ˆ τ and ¯ x T − ¯ x C leads to the conditionalindependence between ¯ x T − ¯ x C and ¯ ε T − ¯ ε C , because these two terms are uncorrelated.Therefore, M k is also conditionally independent of ¯ ε T − ¯ ε C , and we have thatvar(ˆ τ | X , M k ≤ a k ) = var (cid:8) (¯ x T − ¯ x C ) (cid:62) β | X , M k ≤ a k (cid:9) + var(¯ ε T − ¯ ε C | X , M k ≤ a k )= β (cid:62) cov(¯ x T − ¯ x C | X , M k ≤ a k ) β + var(¯ ε T − ¯ ε C | X )= C n β (cid:62) V (cid:18) v a k D k ˜ D d − k (cid:19) V (cid:62) β + var(¯ ε T − ¯ ε C | X ) . (A.6)Combining (A.5) and (A.6), it can be shown thatvar(ˆ τ | X ) − var(ˆ τ | X , M k ≤ a k ) = C n β (cid:62) V (cid:18) (1 − v a k ) D k

00 0 (cid:19) V (cid:62) β ≥ , where the non-negativity arises from the positive semi-deﬁniteness of the matrix. When β ∈ C , since v a k ∈ (0 ,

1) and D k is positive deﬁnite, one can conclude that var(ˆ τ | X ) > var(ˆ τ | X , M k ≤ a k ). 27 −v a k /v a r d Figure 1: The reduction percent of the variance shrinkage coeﬃcient v a k of PCAReR withrespect to v a of ReR in terms of the feature dimension d ∈ { , . . . , } and the correlationcoeﬃcient ρ ∈ { , . . . , . } with n = 100. 28 X X X X X X X X X X X Covariates M ean D i ff e r en c e Methods

CRReRRidgeReRPCAReR (a) Covariate Balance t ^ D en s i t y Methods

ReRRidgeReRPCAReR (b) Density of ˆ τ Figure 2: The results of diﬀerent randomization approaches on the IHDP dataset. (a) Boxplot of covariate mean diﬀerences between the treatment and control for continuous variables( X , . . . , X ) and ﬁrst six binary variables ( X , . . . , X ); (b) density plots of the estimatedtreatment eﬀect, where the black vertical line denotes the ‘true’ value in the linear model.CR: Complete randomization with equal size of units in each group, ReR: Rerandomization,RidgeReR: ridge rerandomization, PCAReR: PCA rerandomization.29able 1: Factors and their corresponding levels in the simulation study.Factors Levels Descriptions n { , , , } Sample size d { , , , } Dimension of covariates ρ { . , . , . } Correlation coeﬃcient of covariatesScheme { ReR , RidgeReR , PCAReR } Rerandomization (ReR) scheme g ( x , β ) { x (cid:62) β , exp( x ) (cid:62) β } Response surface β (cid:8) d , ( (cid:62) d/ , × (cid:62) d/ ) (cid:62) (cid:9) Coeﬃcient vector in the response surface σ ε { . , } Residual varianceTable 2: ANOVA results of the 11 most inﬂuential factors for r ¯ σ , which is the reductionpercent of the average empirical variance of ¯ x T,j − ¯ x C,j across all covariates ( j = 1 , . . . , d )relative to complete randomization. DF is the degree of freedom, MS represents mean square,and the last row Residual refers to the within-conﬁguration mean square.Factors DF MS F -ratio d ρ ρ × scheme 4 1.41 789 d × scheme 6 0.28 157 n × scheme 6 0.09 50 d × ρ n × ρ × scheme 12 0.04 25 d × ρ × scheme 12 0.04 21 n × d × scheme 18 0.03 19 n F -ratio for r MSE , which is the reduction percent of mean squared error (MSE) of ˆ τ relative to completerandomization. DF is the degree of freedom, MS represents mean square, and the last rowResidual refers to the within-conﬁguration mean square.Factors DF MS F -ratioScheme 2 74.83 12128 d g ( x , β ) 1 55.71 9029 ρ ρ × g ( x , β ) 2 5.78 937 d × scheme 6 3.47 563 ρ × scheme 4 3.43 556 d × g ( x , β ) 3 3.06 496Scheme × g ( x , β ) 2 1.96 317 n × scheme 6 1.24 201 n ρ × scheme × g ( x , β ) 4 0.43 69 n × d × scheme 18 0.27 44 d × ρ n × ρ × scheme 12 0.20 32 d × ρ × scheme 12 0.09 15 n × ρ d × scheme × g ( x , β ) 6 0.08 14 n × g ( x , β ) 3 0.05 7 n × d F -ratio d n × d × scheme 18 16566 2096 n × scheme 6 13658 1728 n d × scheme 6 10328 1307 n × d ρ d × ρ n × ρ n × d × ρ × scheme 36 1077 136 n × ρ × scheme 12 1047 132 n × d × ρ

18 851 108 ρ × scheme 4 748 95 d × ρ × scheme 12 618 78Residual 1296 832 a b l e : T h ee v a l u a t i o n m e tr i c s r ¯ σ × nd r M S E × f o rt h r ee r e r a nd o m i z a t i o n ( R e R ) s c h e m e s und e r d i ﬀ e r e n t c o m b i n a t i o n s o f ρ a nd d , w h e r e r ¯ σ a nd r M S E d e n o t e t h e r e du c t i o np e r ce n t s o f t h e a v e r ag ee m p i r i c a l v a r i a n ce o f ¯ x T , j − ¯ x C , j a c r o ss a ll c o v a r i a t e s ( j = ,..., d ) a nd m e a n s q u a r e d e rr o r( M S E ) o f ˆ τ r e l a t i v e t o c o m p l e t e r a nd o m i z a t i o n , r e s p ec t i v e l y . W e r e p o rtt h e nu m b e r k o f s e l ec t e d t o pp r i n c i p a l c o m p o n e n t s f o r P C A r e r a nd o m i z a t i o n i n t h e p a r e n t h e s e s . d = d = d = d = S c h e m e ρ = . . . . . . . . . . . . r ¯ σ × R e R R i d g e R e R P C A R e R ( k ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) r M S E × f o r li n e a rr e s p o n s e s u r f a ce R e R R i d g e R e R P C A R e R r M S E × f o r n o n li n e a rr e s p o n s e s u r f a ce R e R R i d g e R e R P C A R e R r M S E × f o r a s p ec i a l c h o i ce o f β R e R R i d g e R e R P C A R e R n and d . The number of selected top principal components k forPCA rerandomization is given in the parentheses. n = 100 n = 200Scheme d = 10 50 90 180 10 50 90 180ReR 0.02 0.16 21.79 0.01 0.02 0.08 0.32 128.29RidgeReR 1.01 3.24 5.94 92.65 1.41 2.99 6.06 19.41PCAReR ( k ) . (8) . (29) . (41) . (55) . (8) . (33) . (52) . (82) n = 500 n = 1000Scheme d = 10 50 90 180 10 50 90 180ReR 0.02 0.08 0.18 0.87 0.02 0.08 0.18 0.63RidgeReR 1.45 2.97 5.73 16.19 1.21 3.15 6.32 19.79PCAReR ( k ) . (8) . (36) . (60) . (109) . (8) . (37) . (64) . (120) Table 7: Comparisons of diﬀerent rerandomization (ReR) schemes on the real dataset, with r ¯ σ : the reduction percent of the variance of ¯ x T,j − ¯ x C,j across all covariates with respect tocomplete randomization (CR), r MSE : the reduction percent of mean squared error (MSE) ofˆ τ relative to CR, and Time(s): computational time in seconds.Scheme r ¯ σ r MSE

Time(s)ReR 0.07 0.08 0.21RidgeReR 0.26 0.56 28.54PCAReR ( kk