[PDF] Kernel-Distance-Based Covariate Balancing

Abstract

A common concern in observational studies focuses on properly evaluating the causal effect, which usually refers to the average treatment effect or the average treatment effect on the treated. In this paper, we propose a data preprocessing method, the Kernel-distance-based covariate balancing, for observational studies with binary treatments. This proposed method yields a set of unit weights for the treatment and control groups, respectively, such that the reweighted covariate distributions can satisfy a set of pre-specified balance conditions. This preprocessing methodology can effectively reduce confounding bias of subsequent estimation of causal effects. We demonstrate the implementation and performance of Kernel-distance-based covariate balancing with Monte Carlo simulation experiments and a real data analysis.

Full PDF

KKernel-Distance-Based Covariate Balancing

Xialing Wen, Ying Yan ∗ , Wenliang Pan, Xianyang ZhangSchool of Mathematics, Sun Yat-sen University,No.135, Xingang West Road, Guangzhou, P.R.China Abstract

A common concern in observational studies focuses on properly evaluating the causal eﬀect,which usually refers to the average treatment eﬀect or the average treatment eﬀect on thetreated. In this paper, we propose a data preprocessing method, the Kernel-distance-basedcovariate balancing, for observational studies with binary treatments. This proposed methodyields a set of unit weights for the treatment and control groups, respectively, such that thereweighted covariate distributions can satisfy a set of pre-speciﬁed balance conditions. Thispreprocessing methodology can eﬀectively reduce confounding bias of subsequent estimation ofcausal eﬀects. We demonstrate the implementation and performance of Kernel-distance-basedcovariate balancing with Monte Carlo simulation experiments and a real data analysis.

Key words: Observational Studies; Causal Eﬀect; Average Treatment Eﬀect; Covariate Balance;Kernel Distance

In causal inference, there are two crucial concepts. (i) The propensity score (Rosenbaum and Rubin,1983): the probability of assignment to the treatment group given the observed covariates. (ii) Thepotential outcomes (Neyman, 1923; Rubin, 1974): the possible outcomes under diﬀerent treatmentdecisions or interventions. The potential outcomes framework, also known as the Neyman-RubinPotential Outcomes, was ﬁrst proposed in completely randomized experiments by Neyman (1923)and was extended to both experimental and observational studies by Rubin (2005). An overview ∗ The Corresponding Author. Email: [email protected] a r X i v : . [ s t a t . M E ] J a n or the implementation of the potential outcomes framework can be found in Imbens and Rubin(2015). Both the two concepts are related to evaluate the causal eﬀect, a central topic in causaninference. Based on various purpose, researchers may prefer estimating the average treatment eﬀect(ATE), the average treatment eﬀect on the treated (ATT), the average treatment eﬀect on thecontrol (ATC), or all of them.In observational studies, however, it is not valid to evaluate the causal eﬀect by merely comput-ing the mean-diﬀerence in response between diﬀerent experimental groups due to the confoundingvariables and the non-random assignment mechanism. One method proposed by Rubin (2007) tosolve this problem is to balance the distributions between the treated and the control groups. Twopopular preprocessing methods, matching and propensity score methods, have been widely used inobservational studies. In causal inference, weighting methods that adjust for observed covariatesenjoy substantial popularity. In this paper, we focus on weighting methods.The weights derived from adjusting the empirical distributions of the observed covariates willbe applied to the outcomes such that the reweighted data appear randomized, which can yield sta-ble estimates for the parameters of interest when there is no unmeasured confounders. Generallyspeaking, inverse probability weighting is less robust and works no better than moment balancingmethods. Therefore, we propose a weighting framework, namely the kernel distance-based (KDB)covariate balance, and use the ﬁrst-moment balance of all covariates as a constraint for the optimiza-tion problem. In this article, the KDB method is a preprocessing technique to achieve covariatebalance in observational studies with a binary treatment. We give estimators of ATE and ATTbased on the KDB framework, among which our primary focus is ATE estimation. Within the sameframework, the optimization problem can be ﬂexibly changed to estimate ATT or ATC.Subsequently, we further explore their asymptotic properties and compare our proposed methodwith other preprocessing methods, such as the inverse probability weighting (IPW) (Horvitz andThompson, 1952; Hirano et al., 2003), the covariate balancing propensity score (CBPS) (Imai andRatkovic, 2014), the entropy balancing (EB) (Hainmueller, 2012), and non-parametric calibrationweighting (CAL)(Chan et al., 2016). It can be shown that this proposed framework has severalattractive features. Most importantly, the KDB method can automatically match the balance in-volve the second and possibly higher moments of the covariate distributions as well as interactions.Besides, the KDB estimator for ATE shows greater stability in nonlinear simulation scenarios com-2ared to existing weighting estimators.The remainder of this article is organized as follows. Section 2 introduces the deﬁnitions, no-tation, and assumptions used throughout the article. Section 3 reviews the main results of fourweighting methods, namely the IPW, CBPS, EB, and CAL. Section 4 introduces the weighted ker-nel distance in detail and describes the theoretical framework of the KDB that we proposed. Section5 illustrates the proposed method and compares its performance with other weighting methods intwo Monte Carlo simulations and a real data analysis. Section 6 concludes this article and discussesfurther explorations. We use the symbol ‘ (cid:52) = ’ to distinguish deﬁnitions from equalities. Suppose the observed datasetconsists of a random sample or ﬁnite population of N units, each assigned to one of the groups,treatment or control, for which covariate-balanced comparisons are of interest. For each unit i ( i =1 , · · · , N ) , we can observe { ( Y i , T i , X i ) : i = 1 , · · · , N } of { ( Y, T, X ) } , where Y is an outcomevariable, T is a treatment indicator and X = ( X , · · · , X D ) with expectation µ and covariancematrix Σ is a set of D covariates. Every T i takes values 1 or 0, which correspondingly indicates theunit i is assigned to the treatment or the control group. We deﬁne the sample size in the treatmentand control group respectively as n and n . The observed data in the treatment and control groupsare denoted as { ( Y i , T i , X i ) : i = 1 , · · · , n } and { ( Y j , T j , X j ) : j = 1 , · · · , n } , respectively,where X i = ( X i, , · · · , X i,D ) , X j = ( X j, , · · · , X j,D ) . The probability of assignment to the treatment group given the covariates is the so-called propensityscore (Rosenbaum and Rubin, 1983) e ( X ) = P r ( T = 1 | X ) , which plays a central role in causal inference.Under the potential outcome framework (Neyman, 1923; Rubin, 1974), we deﬁne a pair of3otential outcomes { Y i (0) , Y i (1) } for every unit i under treatment 0 or 1 respectively. The observedoucome can be expressed by the pair of potential outcomes under the consistency assumption: Y i = (1 − T i ) Y i (0) + T i Y i (1) . In this paper, we focus on two causal estimands: the average treatment eﬀect (ATE), that is τ := E [ Y (1) − Y (0)] = µ − µ , where µ t := E [ Y ( T = t )] , and the average treatment eﬀect on the treated (ATT), that is τ := E [ Y (1) − Y (0) | T = 1] = µ | − µ | , where µ t | ≡ E [ Y ( T = t ) | T = 1] for t = 0 , . The expectations for the potential outcomes given thecovariates are µ ( X ) ≡ E [ Y (0) | X ] and µ ( X ) ≡ E [ Y (1) | X ] . For causal comparisons, we make the following assumptions (Rosenbaum and Rubin, 1983)throughout this article so that unbiased estimators of ATE and ATT are available in observationalstudies.

Assumption 1 (Strong Ignorability): { Y (0) , Y (1) } | = T | X , where | = denotes independence.We assume the potential outcomes { Y (0) , Y (1) } are independent of the treatment indicator T given covariates X , which implies there is no unmeasured confounders that may cause the selectionbias. Formally, as shown by Rosenbaum and Rubin (1983), Assumption 1 also implies { Y (0) , Y (1) } | = T | e ( X ) , E ( Y | T = 0 , X ) = E ( Y | T = 1 , X ) = E ( Y | X ) . Under

Assumption 1 , we can investigate ATE through the value of τ ( X ) ≡ µ ( X ) − µ ( X ) { Y (0) ,Y (1) } | = T | X =============== E [ Y (1) | T = 1 , X ] − E [ Y (0) | T = 0 , X ]= E [ Y | T = 1 , X ] − E [ Y | T = 0 , X ] . Assumption 2 (Overlap): < P ( T = 1 | X ) < , f or all X . To ensure that there are useful observations for estimating the causal eﬀect, we assume theprobability that a subject is assigned to the treatment group is limited to range between 0 and 1.

Assumption 2 also helps to ensure that the bias-correction information is available in the entiredomain of X (Zhao and Percival, 2016) and equivalently requires that the covariate distributionprovides suﬃcient overlap between the treatment and the control group.Reminding that under Assumption 1, the ATE can be expressed as τ = E [ Y i (1) − Y i (0)]= E { E [ Y (1) − Y (0) | X ] } = E [ µ ( x ) − µ ( x )]= (cid:90) µ ( x ) dF ( x ) − (cid:90) µ ( x ) dF ( x ) (2.1)where F ( x ) indicates the true distribution of covariates. Within a weighting framework, we de-note the weighted sample average treatment eﬀect (SATE) with weights w = { w , · · · , w N } = { p , · · · , p n , q , · · · , q n } , n + n = N , as ˆ τ wN = n (cid:88) i =1 T i w i Y i − n (cid:88) i =1 (1 − T i ) w i Y i = N (cid:88) i =1 p i T i Y i − N (cid:88) j =1 q j (1 − T j ) Y j ˆ F wN, ( x ) = n (cid:88) i =1 p i I ( X i ≤ x ) , ˆ F wN, ( x ) = n (cid:88) i =1 q j I ( X j ≤ x ) , where I ( · ) is an indicator function. Therefore, we have ˆ τ wN − τ = n (cid:88) i =1 T i w i Y i − n (cid:88) i =1 (1 − T i ) w i Y i − τ = n (cid:88) i =1 T i w i ( Y i − µ ( X i )) − n (cid:88) i =1 (1 − T i ) w i ( Y i − µ ( X i )) + n (cid:88) i =1 T i w i µ ( X i ) − n (cid:88) i =1 (1 − T i ) w i µ ( X i ) − τ = (cid:40) n (cid:88) i =1 T i w i ( µ ( X i ) − µ ( X i )) − τ (cid:41) + (cid:40) n (cid:88) i =1 T i w i µ ( X i ) − n (cid:88) i =1 (1 − T i ) w i µ ( X i ) (cid:41) + (cid:40) n (cid:88) i =1 T i w i ( Y i − µ ( X i )) − n (cid:88) i =1 (1 − T i ) w i ( Y i − µ ( X i )) (cid:41) . (2.2)According to Equation (2.2), we know that the bias of weighted SATE ˆ τ wN depends entirely on thethree items n (cid:88) i =1 T i w i ( µ ( X i ) − µ ( X i )) − τ, n (cid:88) i =1 T i w i µ ( X i ) − n (cid:88) i =1 (1 − T i ) w i µ ( X i ) , n (cid:88) i =1 T i w i ( Y i − µ ( X i )) − n (cid:88) i =1 (1 − T i ) w i ( Y i − µ ( X i )) . Here, the second item directly shows the importance of balancing the distributions of the covariatesbetween the treated and control groups, which is the goal of our proposed weighting framework. Tofurther ensure lim N →∞ (ˆ τ wN − τ ) = 0 , we need the following Assumption 3 . Assumption 3 (Constant Conditional ATE): µ ( X ) − µ ( X ) = τ, f or all X . Assumption 3 , we assume a constant causal eﬀect, so the ATE is numerically the same asthe ATT.

We next give a brief overview of some existing approaches.

According to Ma and Wang (2010), the IPW assigns a weight (cid:98) p IP Wi = T i Np ( X i ) (cid:80) n i =1 T i Np ( X i ) to Y i (the outcome in the treatment group), i = 1 , , · · · , n , and a weight (cid:98) q IP Wj = − T j N (1 − p ( X j )) (cid:80) n j =0 1 − T j N (1 − p ( X j )) to Y j (the outcome in the control group), j = 1 , , · · · , n , to estimate ATE, thereby deriving anatural estimator of ATE, that is, (cid:98) τ IPW = 1 N N (cid:88) i =1 (cid:20) T i Y i p ( X i ) − (1 − T i ) Y i − p ( X i ) (cid:21) = 1 N N (cid:88) i =1 (2 T i − Y i − T i + (2 T i − p ( X i )= n (cid:88) i =1 T i Y i N p ( X i ) − n (cid:88) j =1 (1 − T j ) Y j N (1 − p ( X j )) . ATT using inverse probability weighting is (cid:98) τ ,IPW = 1 n N (cid:88) i =1 (cid:20) T i Y i − (1 − T i ) ˆ e ( X i )1 − ˆ e ( X i ) Y i (cid:21) = 1 n N (cid:88) i =1 ( T i · Y i ) − n N (cid:88) i =1 (cid:20) (1 − T i ) ˆ e ( X i )1 − ˆ e ( X i ) Y i (cid:21) = n (cid:88) i =1 Y i n − n (cid:88) j =1 ˆ e ( X i )1 − ˆ e ( X i ) Y j n . Therefore, the weights assigned to Y i , i = 1 , , · · · , n for estimating ATT is (cid:98) p IP Wi,trt = 1 n , and the weights assigned to Y j , j = 1 , , · · · , n is (cid:98) q IP Wj,trt = ˆ e ( X i ) n · (1 − ˆ e ( X i )) . The entropy balancing (EB) scheme by Hainmueller (2012) considered: min w i H ( w ) = (cid:88) i | D =0 h ( w i ) s.t.  (cid:80) i | D =0 w i c ri ( X i ) = m r , with r ∈ , , · · · , R (cid:80) i | D =0 w i = 1 , w i ≥ f or all i such that D = 0 The weights (cid:98) p EBi = n (1 , · · · , n ) and (cid:98) q EBj = w j ( j = 1 , · · · , n ) of EB method can be directlycalculated with the R package “ ebal ”. Imai and Ratkovic (2014) proposed a relatively more stable method, namely the Covariate BalancePropensity Score (CBPS), for estimating propensity score so that covariate balance is optimized.CBPS can signiﬁcantly improve the poor empirical performance of propensity score matching and8eighting methods. The key idea is using propensity score weighting to characterize the covariatebalance: E (cid:32) T i ˜ X i π β ( X i ) − (1 − T i ) ˜ X i − π β ( X i ) (cid:33) = 0 where ˜ X i = f ( X i ) is an D-dimensional vector-valued measurable function of X i speciﬁed by theresearcher. Equation (3.3) holds for any function of covariates. Selecting a speciﬁc f ( · ) , for example,setting ˜ X i = ( X Ti X Ti ) T , Equation (3.3) helps to balance both the ﬁrst and second moments of allcovariates.The CBPS method can be implemented through the open-source R package “CBPS”. A general class of calibration (

CAL ) estimators of ATE is established with a wide class, such asexponential tilting, of calibration weights. The weights are constructed to achieve an exact momentbalance of observed covariates among the treated, the control, and the combined group. Globalsemiparametric eﬃciency has been established for the CAL estimators.For a ﬁxed g ∈ R , let D ( f, g ) denote a continuously diﬀerentiable distance measure for f ∈ R .Being nonnegative and strictly convex in f , D ( f, g ) = 0 if and only if f = g . The general ideaof calibration as in Deville and Särndal (1992) is to minimize the aggregate distance between theﬁnal weights w = ( w , · · · , w N ) to a given vector of design weights d = ( d , · · · , d N ) subject tomoment constraints. Avoiding estimating the design weights, CAL ﬁrst used a vector of misspeciﬁeduniform design weights d ∗ = (1 , · · · , , and constructed the ﬁnal weights w by solving the followingconstrained optimization problem: M inimize N (cid:88) i =1 D ( w, ,s.t.  N (cid:80) Ni =1 T i w i u ( X i ) = N (cid:80) Ni =1 u ( X i ) N (cid:80) Ni =1 (1 − T i ) w i u ( X i ) = N (cid:80) Ni =1 u ( X i ) . (3.1)This CAL method can be implemented through an open-source R package “ATE” for estimatingboth the ATE and ATT. 9 The Weighted Kernel Covariate Balancing

Let X , · · · , X n and Y , · · · , Y m be random samples from two unknown probability measures, P and Q , then researchers are often interested in detecting the diﬀerence and estimating the discrepencybetween them. In this paper, we use a probability metric with ζ -structure (Zolotarev, 1983) γ F ( P, Q ) = sup f ∈ F | (cid:90) f dP − (cid:90) f dQ | , where F indicates a class of functions, to measure the distance between the treatment and thecontrol covariates.As for the implementation of γ ( P, Q ) , we can deﬁne || · || K to be a norm of a ReproducingKernel Hilbert Spaces (RKHS) K and set F = { f : || f || K (cid:54) = 1 } , by which we can get a “kernelized”version of the total variation distance. Restricting k to be a strictly positive deﬁnite kernel functioncorresponding to RKHS K , then we can get a closed-form solution for γ ( P, Q ) (Sriperumbuduret al., 2012), that is γ k ( P n , Q n ) = || N (cid:88) i =1 w i k ( · , X i ) || K = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i,j =1 w i w j k ( X i , X j ) , where P n and Q n respectively denote the empirical measure of X | T = 1 and X | T = 0 , and w i isthe sample weight for the unit i ∈ { , · · · , N } . Let us deﬁne w i = 1 /n if T i = 1 and w i = 1 /n if T i = 0 , then we can get a simple as well as closed-form analytical solution for the empirical estimateof the probability metric under the assumption that the function class represents an RKHS. Thecomputation of γ k ( P n , Q n ) is O ( N ) but it is also independent of the dimension of the confounders.We can use γ k ( P n , Q n ) as a diagnostic for covariate balance, the smaller values of which denotebetter performance. Given two probability measures, P and Q deﬁned on a measurable space S , the integral probabil-ity metric (IPM) (Sriperumbudur et al., 2012), also called probability metrics with a ζ -structure,10etween P and Q is deﬁned as γ F ( P , Q ) = sup (cid:26)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) S f d P − (cid:90) S f d Q (cid:12)(cid:12)(cid:12)(cid:12) : f ∈ F (cid:27) (4.1)where F is a class of real-valued bounded measurable functions on S . The choice of functions is thecrucial distinction between diﬀerent IPMs and various popular distance measures, for example, the Kantorovich metric and the total variation distance , can be obtained by choosing appropriate F in (4.1). When F = { f : || f || H ≤ } , the corresponding γ F is called kernel distance or maximummean discrepancy, where H represents a reproducing kernel Hilbert space (RKHS) with k as itsreproducing kernel (r.k.). In this situation, we can rewrite the function space F as ( H , k ) and theIPM γ F as γ F k . The kernel distance has been widely used in statistical applications, includinghomogeneity testing, independence testing, conditional independence testing, and mixture densityestimation. Let { X P,i : i = 1 , , · · · , n } and { X Q,j : j = 1 , , · · · , m } be i.i.d. samples randomly drawn from P and Q , respectively, the empirical estimator of γ F ( P , Q ) is given by γ F ( P n , Q m ) = sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 w Ei f ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (4.2)where P n ( x ) := (cid:80) ni =1 1 n I ( X P,i ≤ x ) and Q m ( x ) := (cid:80) mj =1 1 m I ( X Q,j ≤ x ) represent the empiricaldistributions of P and Q , respectively, I ( · ) is an indicator function, N = n + m , w Ei = n when X i = X P,i for i = 1 , · · · , n and w Ej = − m when X j = X Q,j for j = 1 , · · · , m .Correspondingly, we deﬁne the weighted estimator of of γ F ( P , Q ) as γ F ( P wn , Q wm ) = sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 w i f ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (4.3)where P wn ( x ) := (cid:80) ni =1 p i I ( X P,i ≤ x ) and Q m ( x ) := (cid:80) mj =1 q j I ( X Q,j ≤ x ) represent the wighteddistributions of P and Q , respectively, w i = p i when X i = X P,i for i = 1 , · · · , n and w j = − q j when X j = X Q,j for j = 1 , · · · , m . { p i : i = 1 , · · · , n } and { q j : j = 1 , · · · , m } are the balancing weights11erived frome some weighting methods. Theorem 4.1.

Let k be a strictly positive deﬁnite kernel, i.e., for all N ∈ N , { α i } Ni =1 ⊂ R \{ } andall mutually distinct { θ i } Ni =1 ⊂ S , (cid:80) Ni,j =1 α i α j k ( θ i , θ j ) > . Then, for F = F k := { f : (cid:107) f (cid:107) H ≤ } ,the following function f is the unique solution to (4.3): f = 1 (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13) H N (cid:88) i =1 w i k ( · , X i ) (4.4) and the corresponding weighted estimator of the kernel distance is γ k,N ( P n , Q m ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i,j =1 w i w j k ( X i , X j ) . (4.5) Proof.

Consider γ k,N ( P n , Q m ) := sup (cid:110)(cid:80) Ni =1 w i f ( X i ) : (cid:107) f | H ≤ (cid:111) , which can be written as γ k,N ( P n , Q m ) = sup (cid:40)(cid:42) f, N (cid:88) i =1 w i k ( · , X i ) (cid:43) H : (cid:107) f (cid:107) H ≤ (cid:41) , where we have used the reproducing property of H , i.e., ∀ f ∈ H , ∀ x ∈ S , f ( x ) = (cid:104) f, k ( · , x ) (cid:105) H .Following the Cauchy-Schwartz inequality, which states that for all vectors u and v of an innerproduct space it is true that (cid:107)(cid:104) u, v (cid:105)(cid:107) ≤ (cid:107) u (cid:107) · (cid:107) v (cid:107) (4.6)where (cid:104)· , ·(cid:105) is the inner product, we have γ k,N ( P n , Q m ) := sup (cid:40) N (cid:88) i =1 w i f ( X i ) : (cid:107) f | H ≤ (cid:41) ≤ sup (cid:40) (cid:107) f (cid:107) H · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H : (cid:107) f (cid:107) H ≤ (cid:41) ≤ sup (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H (cid:41) Therefore, we can set f = 1 (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13) H N (cid:88) i =1 w i k ( · , X i ) . k is strictly positive deﬁnite, γ k,N ( P n , Q m ) = 0 if and only if P n = Q m , which thereforeensures that (4.4) is well-deﬁned. According to Rasmussen (2003), one of the choice of kernel k for covariate balance is the Gaussiankernel, that is K ( x , x (cid:48) ; σ ) = exp (cid:18) −|| x − x (cid:48) || σ (cid:19) , where the parameter σ determines the width of the Gaussian kernel and it can either be ﬁxed orestimated from the data. In this article, we estimate σ with the median of all possible pairwisesquared Euclidean distances between all pairs of subjects. σ = median (cid:26) { ( X i − X j ) i,j ∈ trt , ( X i − X j ) i ∈ trt,j ∈ col , ( X i − X j ) i,j ∈ col } > (cid:27) The Information Matrix K G corresponding to the Gaussian kernel K ( x , x (cid:48) ; σ ) is K G =  K − K − K K  =  ( K ( X i , X j )) n × n − ( K ( X i , X j )) n × n (cid:16) − ( K ( X i , X j )) n × n (cid:17) T ( − K ( X i , X j )) n × n  . (4.7) In this article, we propose a preprocessing procedure, the kernel-distance-based covariate balancingmethod, to create balanced samples for evaluating the treatment eﬀects. In this procedure, wecreate two groups of scalar weights for the treatment and the control groups so that the reweightedtreatment and control groups can match precisely at the speciﬁed moments.

In this section, we introduce the estimation of ATE τ ≡ E [ Y (1) − Y (0)] ˆ τ = n (cid:88) i =1 p i · Y i − n (cid:88) j =1 q j · Y j . The weights { p i : i = 1 , · · · , n } and { q j : j = 1 , · · · , n } derive from the following KDB reweightingscheme: min (cid:126)p , (cid:126)q r W ( (cid:126)p , (cid:126)q ) = N (cid:88) i =1 N (cid:88) j =1 p i p j K ( X i , X j ) · I ( i, j ∈ trt )+ N (cid:88) i =1 N (cid:88) j =1 q i q j K ( X i , X j ) · I ( i, j ∈ col ) − N (cid:88) i =1 N (cid:88) j =1 p i q j K ( X i , X j ) · I ( i ∈ trt , j ∈ col ) (4.8) s.t.  (cid:80) n i =1 p i = 1 , p i ≥ (cid:80) n j =1 q j = 1 , q j ≥ (cid:80) n i =1 p i g ( X ) = (cid:80) n j =1 q j g ( X ) where I ( · ) is an indicator function, and g ( · ) indicates a set of constraint functions imposed on thecovariate moment. Here, for convenience, we set g ( · ) = { g ( · ) , · · · , g D ( · ) } to be the sample meanfunction at all covariates so that the reweighted treatment and control groups can match on theﬁrst moment, that is g d ( X t ) := (cid:80) n t i =1 X ti,d n t , t ∈ { , } , i ∈ { , · · · , n t } , d ∈ { , · · · , D } . To express concisely, we deﬁne (cid:126) w = ( p , · · · , p n , q , · · · , q n ) T , (cid:126)b (2+ D + N ) × = (cid:18) · · · (cid:19) T , and A T =  (cid:126) × n , (cid:126) × n (cid:126) × n , (cid:126) × n (cid:126) X ,trt , − (cid:126) X ,col ... , ...(cid:126) X D,trt , − (cid:126) X D,col I N × N  (2+ D + N ) × N =  · · · n · · · n · · · n · · · n X , · · · X , n − X , · · · − X , n ... ... ... ... ... ...X D, · · · X D, n − X D, · · · − X D, n · · · · · · ... ... ... . . . ... ... · · · · · ·  (2+ D + N ) × N , where D refers to the total number of covariates, (cid:126) X d,trt and (cid:126) X d,col respectively on behalf of the d th covariate in the treatment and control group, and then the optimization problem for estimatingATE can be expressed as min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T K G (cid:126) w (4.9) s.t. A T (cid:126) w ≥ (cid:126)b Here, the ﬁrst D constraints are treated as equality constraints, and all further as inequalityconstraints. To further construct stable weights for estimating the ATE, we can make a small modiﬁcation to theoptimization problem in (4.9). Compared to the reweighting scheme in section 4.6.1, we only add atuning parameter λ to the diagonal of K G in order to control the variance of ˆ τ , while keeping all theother settings and parameters the same. Deﬁning (cid:126) w = (cid:126) w | p i = n ,q j = n = ( n , · · · , n , n , · · · , n ) T ,15he reweighting scheme for stable ATE estimation is: min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T K G (cid:126) w + λ ( (cid:126) w − (cid:126) w ) T ( (cid:126) w − (cid:126) w ) (4.10) s.t. A T (cid:126) w ≥ (cid:126)b It is easy to prove that the optimization problem (4.10) satisfying constraints above is equivalent to min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T ( K + λ I N × N ) (cid:126) w (4.11) s.t. A T (cid:126) w ≥ (cid:126)b When λ = 0 , this optimization problem (4.11) degenerates into problem (4.9) and it controlsbias, but not variance of ATE estimation. When λ increases, this optimization problem may increasebias while reduce the variance of ATE estimation. In this section, we introduce the KDB reweighting scheme for estimating the ATT τ := E [ Y (1) − Y (0) | T = 1] = µ | − µ | . We estimate τ with the mean diﬀerence between the treatment outcomes and the weighted controloutcomes, that is, ˆ τ = n (cid:88) i =1 n · Y i − n (cid:88) j =1 q j · Y j , where { q j : j = 1 , · · · , n } derive‘ from the following reweighted scheme: min (cid:126)q r W ( (cid:126)q ) = N (cid:88) i =1 N (cid:88) j =1 n n K ( X i , X j ) · I ( i, j ∈ trt )+ N (cid:88) i =1 N (cid:88) j =1 q i q j K ( X i , X j ) · I ( i, j ∈ col ) − N (cid:88) i =1 N (cid:88) j =1 n q j K ( X i , X j ) · I ( i ∈ trt , j ∈ col ) (4.12)16 .t.  (cid:80) n j =1 q j = 1 , q j ≥ (cid:80) n i =1 1 n g ( X ) = (cid:80) n j =1 q j g ( X ) Just like the ATE estimation, we deﬁne (cid:126)q n × = ( q , q , · · · , q n ) T , and reset (cid:126) w , (cid:126)b (1+ D + n × , A T respectively to be (cid:126) w = ( p , · · · , p n , q , · · · , q n ) T , (cid:126)b (1+ D + n × = (cid:18) X , X , · · · X D, · · · (cid:19) T A T =  (cid:126) × n X , ... X D, I n × n  (1+ D + n ) × n =  · · · n X , X , · · · X , n ... ... . . . ...X D, X D, · · · X D, n · · ·

00 1 · · · ... ... . . . ... · · ·  (1+ D + n ) × n , where p i = n , i ∈ { , · · · , n } and X d, , d ∈ { , · · · , D } refers to the sample mean of the d th covariate in the treatment group. Therefore, we can also simplify the ATT optimization problemabove into a matrix form min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T K G (cid:126) w s.t. A T (cid:126) q ≥ (cid:126)b (4.13)Here, the ﬁrst D constraints are treated as equality constraints, and all further as inequalityconstraints. 17 .7 Propositions of the KDB Covariate Balance Proposition 1.

The Gaussian Information Matrix K G is positive semi-deﬁnite. Proof.

Based on the propertied of kernel, i.e, K ( x , x ) = < ϕ ( x ) , ϕ ( x ) > , for any v ∈ R N , wehave v (cid:48) K v = N (cid:88) i,j =1 v i v j K ij = n (cid:88) i,j =1 v i v j < ϕ ( X i ) , ϕ ( X j ) > + N (cid:88) i,j = n v i v j < ϕ ( X i ) , ϕ ( X j ) > − n (cid:88) i =1 N (cid:88) j = n v i v j < ϕ ( X i ) , ϕ ( X j ) > = < n (cid:88) i =1 ϕ ( X i ) , n (cid:88) j =1 ϕ ( X j ) > + < N (cid:88) i = n ϕ ( X i ) , N (cid:88) j = n ϕ ( X j ) > − < n (cid:88) i =1 ϕ ( X i ) , N (cid:88) j = n ϕ ( X j ) > = || n (cid:88) i =1 v i ϕ ( X i ) − N (cid:88) j = n v j ϕ ( X j ) || ≥ as required. Optimization Problem:

We aim to solve the following minimization problem: min x ∈ R N f ( x ) = 12 x (cid:48) K x, st. A x = b, B x ≤ N , (4.14)where A =  τn τn τn τn  , B = − I and b = (1 , τ . Proposition 2. (4.14) is a standard convex optimization problem. he Dual of Above Convex Problem: The Lagrange function is given by L ( x, λ, ν ) = f ( x ) + λ τ B x + ν τ ( A x − b ) and the dual function is given by g ( λ, ν ) = inf x ∈ R N L ( x, λ, ν ) . Note L ( x, λ, ν ) = 12 x (cid:48) K x + λ τ B x + ν τ ( A x − b )= 12 x (cid:48) K x + ( λ τ B + ν τ A ) x − ν τ b. (4.15)The dual problem of (4.14) is d ∗ = sup λ ≥ N ,ν g ( λ, ν ) . Let p ∗ be the optimal value of (4.14), we have d ∗ ≤ p ∗ . Since the inequality constraints are all aﬃne, from the weak Slater condition we know that d ∗ = p ∗ . Then from KKT conditions we have, K x + B τ λ + A τ ν = 0 (4.16)and A x = b. Assuming K is positive diﬁnite. Solving (4.16) we have x = − K − ( B τ λ + A τ ν ) . Therefore, g ( λ, ν ) = −

12 ( B τ λ + A τ ν ) τ K − ( B τ λ + A τ ν ) − ν τ b. The dual problem is max λ,ν −

12 ( B τ λ + A τ ν ) τ K − ( B τ λ + A τ ν ) − ν τ b s.t. λ i ≥ , i = 1 , · · · , N. (4.17)Solving for ν we have (cid:79) ν g ( λ, ν ) = − AK − A τ ν − AK − B τ λ − b ⇒ ν = − ( AK − A τ ) − (cid:0) AK − B τ λ + b (cid:1) . (4.18)19ubstitute ν with (4.18), (4.17) becomes max λ i ≥ λ τ (cid:0) BK − A τ ( AK − A τ ) − AK − B τ − BK − B τ (cid:1) λ + b τ ( AK − A τ ) − AK − B τ λ. (4.19)Suppose K − =  M − M − M M  , we have AK − A τ = τ M + τ M . For simplicity,substitute B with − I and write Q = τ M + τ M , (4.19) is converted to max λ i ≥ λ τ (cid:0) K − A τ Q − AK − − K − (cid:1) λ + b τ Q − AK − λ. (4.20) In this section, we perform simulation studies for evaluating the performance of diﬀerent weightingmethods, including the inverse probability weighting (IPW), the covariate balancing propensity score(CBPS) (Imai and Ratkovic, 2014), the entropy balance (EB) (Hainmueller, 2012), the calibrationestimator (CAL) (Chan et al., 2016) and our proposed KDB Covariate Balance. We use the oracleand unadjusted (UnAD) estimators as benchmarks. In all the following simulation experiments, weassume a constant causal eﬀect, so the ATE is numerically the same as the ATT. For each simulationexperiment, we construct N sim Monte Carlo datasets.In addition to simulation studies, we also conduct a real data analysis with the ‘National sup-ported work’ (NSW) dataset.

To evaluate the performance of estimating ATE or ATT, we use the metric

Bias , % Bias , empiricalstandard deviation (SD) , and RMSE . Bias is the diﬀerence between the average estimate and the true value of ATE/ATT. % Bias is the bias as a percentage of the estimate’s standard deviation. A useful rule of thumb (Kanget al., 2007) is that the performance of interval estimates and test statistics begins to deterioratewhen the bias of the point estimate exceeds about 40% of its standard deviation. The empirical SD for an estimator is the square root of average squared diﬀerences between all estimates from themean. Here, we use the empirical SD corresponding to the average value of ATE/ATT estimates,20nd thus it is the empirical standard deviation of all estimated values divided by the square rootof sample size N sim . RMSE is the root-mean-square error for measuring the diﬀerences betweenvalues predicted by a weighting method and the observed values.To further compare the power of balancing covariates, we use the balancing statistics r W ,kernel distance (KD) , the maximum, average, and median value of absolute standardized meandiﬀerence (maxASMD, meanASMD, medASMD) , the average Kolmogorov-Smirnov statistic (meanKS) and the average T statistic (meanT) . We introduce each balancing statistic with moredetails as follows.For every set of sample weights { (cid:98) p i } ( i = 1 , , · · · , n ) and { (cid:98) q j } ( j = 1 , , · · · , n ) obtained bydiﬀerent balancing methods for each Monte Carlo dataset, we calculate r W as (cid:98) r W ( (cid:126)p, (cid:126)q ) = N (cid:88) i =1 N (cid:88) j =1 (cid:98) p i (cid:98) p j K ( X i , X j ) · I ( i, j ∈ trt )+ N (cid:88) i =1 N (cid:88) j =1 (cid:98) q i (cid:98) q j K ( X i , X j ) · I ( i, j ∈ col ) − N (cid:88) i =1 N (cid:88) j =1 (cid:98) p i (cid:98) q j K ( X i , X j ) · I ( i ∈ trt , j ∈ col ) . (5.1)For a simulation experiment with a total of N sim iterations, we calculate the average value of all (cid:98) r W ( (cid:126)p, (cid:126)q ) statistics of each balancing method and compare them. KD , the square root of r W , hasbeen widely used in statistical applications (Sriperumbudur et al., 2012) and can be regarded as areasonable measure of covariates balance (Xie, 2018).In estimating ATE, the ASMD of covariate X d ( d = 1 , · · · , D ) with balancing weights (cid:98) p i ( i =1 , · · · , n ) and (cid:98) q j ( j = 1 , · · · , n ) is deﬁned as (Zhang et al., 2019): ASM D d,AT E = | ¯ X pd, − ¯ X qd, | (cid:113) ( S d, + S d, ) / , where ¯ X pd, and ¯ X qd, respectively refer to the weighted sample mean of covariate X d in the treatmentand control group ¯ X pd, = n (cid:88) i =1 (cid:98) p i X i,d, , ¯ X qd, = n (cid:88) j =1 (cid:98) q j X j,d, , and S d, and S d, denote the sample variance of covariate X k in the treatment and control group21ithout any adjustment S d, = (cid:80) n i =1 X i,d, − ¯ X d, n − , S d, = (cid:80) n j =1 X i,d, − ¯ X d, n − In estimating ATT, the

ASMD of covariate X d ( d = 1 , · · · , D ) with balancing weights (cid:98) p i = n ( i =1 , · · · , n ) and (cid:98) q j ( j = 1 , · · · , n ) is deﬁned as (Xie et al., 2019): ASM D d,AT T = | ¯ X pd, sd ( X d, ) − ¯ X qd, sd ( X d, ) | , where sd ( X d, ) denotes the sample standard deviation of covariate X d in the treatment groupwithout any adjustment sd ( X d, ) = (cid:115) (cid:80) n i =1 X i,d, − ¯ X d, n − . For each Monte Carlo dataset, we calculte the

ASMD values over all D covariates and mark downthe maximum, mean and median values.A Kolmogorov-Smirnov test can be applied to test whether two underlying one-dimensionalprobability distributions signiﬁcantly diﬀer and the KS statistic quantiﬁes a distance between theempirical distribution functions of two samples. We calculate the KS statistic between the weightedsamples in treatment and control group over all D covariates and take the average as one balancingstatistic. meanKS = (cid:80) Dd =1 KS d D where KS d = sup x | F ,n ( X pd, ) − F ,n ( X qd, ) | . F ,n ( X pd, ) and F ,n ( X qd, ) are the empirical distri-bution functions of the weighted sample { (cid:98) p i · X d, i } ( i = 1 , , · · · , n ) and { (cid:98) q j · X d, j } ( j = 1 , , · · · , n ) respectively, and sup is the supremum function.A t-test is commonly applied to determine if the means of two sets of data are signiﬁcantlydiﬀerent from each other. We conduct a Welch’s t-test between the weighted means in the treatmentand control group over all D covariates and take the average as one balance statistic. meanT = (cid:80) Dd =1 T d D T d = ¯ X pd, − ¯ X qd, S ¯ (cid:52) ,d , and S ¯ (cid:52) ,d = (cid:114) S Xpd, n + S Xqd, n . S X pd, and S X qd, are the unbiased estimatorsof the variances of the two weighted samples { (cid:98) p i · X d, i } ( i = 1 , , · · · , n ) and { (cid:98) q j · X d, j } ( j =1 , , · · · , n ) respectively. In diﬀerent simulation scenarios, sample oracle estimators, sample unadjusted estimators, and theirbiases will be served as benchmarks for comparing diﬀerent weighting methods.

The sample oracle estimator of the ATE and ATT are respectively deﬁned as (cid:98) τ Oracle = N (cid:88) i Y i (1) − N (cid:88) i Y i (0) , (cid:98) τ ,Oracle = n (cid:88) i [ Y i (1) | T i = 1] − n (cid:88) i [ Y i (0) | T i = 1] . Without any adjustment, we assigned weight (cid:98) p UnADi = n ( i = 1 , , · · · , n ) for each unit in thetreatment group, and (cid:98) q UnADj = n ( j = 1 , , · · · , n ) for each unit in the control group. From this,we can get the sample unadjusted estimator of the ATE, that is (cid:98) τ UnAD = n (cid:88) i p UnADi Y i − n (cid:88) j q UnADj Y j . For convenience, we denote• KDBC:the proposed KDB covariate balancing with only the constraints: n (cid:88) i p i = 1 , p i ≥ n (cid:88) j q j = 1 , q j ≥

23 KDM1:the proposed KDB covariate balancing with the constraints: n (cid:88) i =1 p i = 1 , p i ≥ n (cid:88) j =1 q j = 1 , q j ≥ n (cid:88) i =1 p i X k, i = n (cid:88) j =1 q j X k, j , k ∈ , · · · , K In this section, we follow the simulation example in Josey et al. (2020), an extensive set of experi-mental scenarios adapted from Kang et al. (2007), to compare diﬀerent weighting methods.We ﬁrst generate the observed covariates X = ( X , X , X , X ) T , where X d ∼ N (0 , , d =1 , · · · , , and then make the following transformations to build the hidden variables U = ( U , U , U , U ) T U = exp ( X ,U = X exp ( X ) + 10 ,U = ( X · X

25 + 0 . ,U = ( X + X + 20) . In subsequent applications, we standardize U = ( U , U , U , U ) T so that it has a mean of zero andmarginal variances of one. The propensity score model is deﬁned as π ( δ T ) i = exp [ η ( δ T ) i ]1 + exp [ η ( δ T ) i ] ⇒ T ( δ T ) i ∼ Bin (1 , π ( δ T ) i ) where δ T ∈ { X, U } refers to the generative process that determines the treatment assignment.Here, δ T = X indicates the log odds of the propensity score is linear with η ( X ) i = − X i + 0 . X i − . X i − . X i , while δ T = U implies a nonlinear relaltionship between the log odds of thepropensity score and η ( U ) i = − U i + 0 . U i − . U i − . U i . As for the outcome propcess, weknow that Y = Y (1) · T + Y (0) · (1 − T ) and construct the pair potential outcomes { Y i (1) , Y i (0) } i, i = 1 , · · · , N with a bivariate model  Y i (0) Y i (1)  ( δ O ) ∼ N  µ ( δ O ) i µ ( δ O ) i + γ  ,  σ ρσρσ σ  where σ is the error variance, ρ is the correlation between the potential outcomes, γ = 20 representsthe treatment eﬀect and δ O ∈ { X, U } indicates the covariates, whether X or U , that participate inthe outcome process. For example, if δ O = X , we have µ ( X ) i = 210 + 27 . X i + 13 . X i + 13 . X i +13 . X i . It should be emphasized that although we may use U to construct variable T or Y , itcannot be observed and the covariates to be balanced is always X . In the other words, the observedsimulated data consists of { X i , T i , Y i } , i = 1 , · · · , N .Based on this data generation mechanism, we can know that• ATE: τ = E [ Y i (1) − Y i (0)] = E [ Y i (1)] − E [ Y i (0)]= (cid:16) µ ( δ O ) i + γ (cid:17) − µ ( δ O ) i = γ = 20 • ATT: τ trt = E [ Y i (1) − Y i (0) | T = 1]= E (cid:104)(cid:16) µ ( δ O ) i + γ (cid:17) − µ ( δ O ) i | T = 1 (cid:105) = γ = 20 To better understand the role of diﬀerent parameters on estimating ATE, we choose the samplesize N in { , } , the error variance σ in { , , } , the correlation between the potentialoutcomes ρ in {− . , , . } , the generative process δ T that determines the treatment assignmentin { U : T is generated with covariates U ; X : T is generated with covariates X }, and the outcomeprocess δ O in { U : Y is generated with covariates U ; X : T is generated with covariates X }, sothere are a total of 72 diﬀerent simulaiton scenarios. We change only one parameter in diﬀerentsimulation scenarios and run 500 Monte Carlo datasets for each setting.Fixing δ T and δ O ({ δ T = X, δ O = X}, { δ T = X, δ O = U}, { δ O = U, δ O = X}, { δ T = U, δ O = U},we run 1000 Monte Carlo datasets to compare the performance of four weighting methods (IPW,CBPS, EB, KDBC) on estimating ATE under diﬀerent combinations of N , σ and ρ . The results25re brieﬂy shown in Figure [1, 2, 3, 4]. For every Figure corresponds to a diﬀerent combination of δ T and δ O , the value of σ corresponds to each row from top to bottom are { , , } , and the value of ρ corresponds to each column from left to right are {− . , , . } . It can be found that the samplesize N and the error variance σ mainly inﬂuence the standard errors of ATE estimators, while thecorrelation between potential outcomes Y i (0) and Y i (1) has almost no eﬀect. This observation isconsistent with Josey et al. (2020) and thus, we focus on the simulation scenario with parameters N = 200 , σ = 10 , and ρ = 0 .We run 500 Monte Carlo experiments for estimating both the ATE and ATT. The results of ATEestimation are shown in Figure [5] and Table [1,2], and the results of ATT estimation are shownin Table [3,4]. Table [1,3] together show that the KDBC estimators, both for ATE and ATT, canperform as well as the other weighting estimators when all the true predictors are included in theobserved dataset. While in both scenarios of { δ T = X, δ O = U} and { δ T = U, δ O = U} where thekey information of the true covariates are missed, the KDBC estimators can perform much betterthan other weighting estimators. This reﬂects the high stability of our proposed KDB covariatebalance in estimating causal eﬀects even when the observed dataset misses some key informationabout the real generation mechanism.To better observe the adjustment eﬀect of our proposed method, we draw the cumulative distri-bution functions (CDF) of all covariates without any adjustment and that of the weighted covariatesunder KDBC. It can be seen from Figure [6] that regardless of the value of δ T and δ O , under ourproposed KDB covariate balancing with only regularization constraints (KDBC), the CDFs of theweighted covariates between the treatment and control groups can always match each other well.This means that our proposed method can eﬀectively achieve the covariate balance between thetreatment and control groups. 26 sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000

Figure 1: T Scenario = “X”, Y Scenario = “X” sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000

Figure 2: T Scenario = “X”, Y Scenario = “U” sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000

Figure 3: T Scenario = “U”, Y Scenario = “X”. sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000 −15−10−5051015

UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−5051015

UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−5051015

UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000

Figure 4: T Scenario = “U”, Y Scenario = “U” TxYx A T E −10−505101520253035404550 Oracle UnAD IPW CBPS EB CAL KDBC TxYu A T E −10−505101520253035404550 Oracle UnAD IPW CBPS EB CAL KDBC TuYx A T E −10−505101520253035404550 Oracle UnAD IPW CBPS EB CAL KDBC TuYu A T E Figure 5:

ATE estimate using ﬁve balancing methods (

IPW , CBPS , EB , CAL and

KDBC ) with N = 200 , σ = 10 , ρ = 0 .Both oracle estimate and unadjusted ( UnAD ) estimate are used as a benchmark for comparison. reatmentAssignmentScenario OutcomeScenario Metric Benchmark Balancing MethodsOracle UnAD IPW CBPS EB CAL KDBC KDM1 X X

ATE

Bias sd RMSE

ATE

Bias sd RMSE

ATE

Bias -0.00635 -15.85057704 0.37174 -0.00156 -0.04647 -0.00179 -0.01756 -0.00392 sd RMSE

ATE

Bias -0.00826 -14.69161794 -4.69057 -4.80178 -4.16993 -4.13912 -0.10005 -0.08974 sd RMSE

Table 1: Details on ATE estimate with 500 Monte Carlo datasets.

TreatmentAssignmentScenario OutcomeScenario Balancing Metric Benchmark Balancing MethodsOracle UnAD IPW CBPS EB CAL KDBC KDM1

X X KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT Table 2: ATE - Details on Balancing Metrics with 500 Monte Carlo datasets. reatmentAssignmentScenario OutcomeScenario Metric Benchmark Balancing MethodsOracle IPW CBPS EB CAL KDM1 KDBC X X

ATT

Bias sd RMSE

ATT

Bias sd RMSE

ATT

Bias -0.00635 5.08391 -0.00717 -0.05765 -0.00689 -0.02487 -0.22012 sd RMSE

ATT

Bias -0.00826 -0.92883 -3.60219 -3.62439 -3.60196 -0.03336 -0.25254 sd RMSE

Table 3: Details on ATT estimate with 500 Monte Carlo datasets.

TreatmentAssignmentScenario OutcomeScenario BalancingMetric Benchmark Balancing MethodsOracle IPW CBPS EB CAL KDM1 KDBC

X X KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT Table 4: ATT - Details on Balancing Metrics with 500 Monte Carlo datasets. .000.250.500.751.00 −2 0 2 X1 CD F X2 CD F X3 CD F X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex X1 CD F X2 CD F T scen = X X3 CD F Y scen = U X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex X1 CD F X2 CD F T scen = U X3 CD F Y scen = X X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex X1 CD F X2 CD F T scen = U X3 CD F Y scen = U X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex

Figure 6: The CDF of unweighted covariates is arranged in odd rows, and that of weighted covariates under KDBC is arrangedin even rows. All combination of “T scenario” and “Y scenario” are included. Dashed line indicates the treatment group, andthe solid line indicatesv the control group. .3.2 Simulation 2 In this section, we perform a simulation experiment with a parameter setting diﬀerent from Simu-lation 1 (section 5.3.1). We wish to explore the mechanism of the KDB method in mining covariateinformation as well as check the performance of the KDB stable ATE estimation.We ﬁrst construct the treatment indicator T i , i = 1 , , · · · , N with a binomial distribution Bin (1 , p ) . Here, we set p , the probability for unit i being assigned to the treatment group, to be0.5. Next, we used two diﬀerent multivariate normal distributions to generate the covariates X and X for the treatment group and the control group, respectively. We have (cid:18) X X (cid:19) | T = 1 ∼ N   ,  . .  and (cid:18) X X (cid:19) | T = 0 ∼ N   + α ,  . .  +  α α  , where α = 0 . , α = 0 . . Let X and X to be distributer as N (0 , , we continue to constructthe covariates X = X × X and X = X . The vector ( X , X , X , X , X , X ) T is subsequentlystandardized to have a mean of zero and marginal variances of one. The observed covariates are { X , X , X , X } . For the outcome variable, we use the bivariate model  Y i (0) Y i (1)  ∼ N  µ i µ i + γ  ,  σ σ  where γ = 10 , σ = 10 , and µ i = 20 X i + 10 X i + 5 X i + 5 X i + α · X i + α · X i with α =1 , α = 2 . To explore whether KDB stable ATE estimation can eﬀectively reduce the variance ofATE estimator, we add a tuning parameter λ to the diagonal of K G and change the value to seesee how the estimation result changes. K G =  K − K − K K  + λ I N N = 200 , we run 500 Monte Carlo datasets for each value of λ in { , , , , , } .Results of ATE estimation with λ ∈ { , , , , , } are shown in Table [5]. In each table for aspeciﬁc value of λ , the rows are arranged from top to bottom according to the absolute value ofbias.Comparing the results in all tables together, we can ﬁnd that the UnAD estimator alwaysperforms the worst, and is far from the true value. This shows the importance of covariate balance,especially when the real generation mechanism of the data involves nonlinear information. As thevalue of λ increases, the performance of KDBC gradually deteriorates, but the performance of KDM1has always been in the forefront, which indicates the necessity to add the ﬁrst-moment balance tothe balance constraints of the KDB method. Furthermore, adding a non-zero λ do help reducethe variance of the KDM1 estimator of ATE compared to the results when λ = 0 , the mechanismbehind which requires to be further explored.When λ changes from 0 to 1, the performance of KDM1 is not as good as CBPS and CAL. How-ever, when we increase the λ value from 2 to 100, KDM1 again becomes the best weighting methodwith the smallest absolute value of bias, the mechanism behind which needs further exploration.This probably implies that there is a threshold for λ to induce better performance, and it may berelated to the information matrix (section 4.5) of the observed dataset.Considering the ASMD values of X and X under diﬀerent weighting methods in Table [5],we know that the KDB method including the KDBC and KDM1 can always perform the best.Reviewing the performance of the KDB covariate balance in simulation 1 (section 5.3.1) with { δ T = X, δ O = U} and { δ T = U, δ O = U}, we can conclude that the KDB covariate balance mayautomatically discover and include some latent variables, such as interaction terms between diﬀerentcovariates or high-order terms of covariates, to analyze and estimate the ATE.36 ambda = 0 ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanTOracle

KDM1

KDBC

CBPS

CAL

IPW EB UnAD -13.16629 23.16629 0.15097 1.04695 0.04264 0.20455 0.89249 0.46910 0.46869 0.87363 0.11004 0.24557 -2.83703lambda = 1