Kernel-Distance-Based Covariate Balancing
KKernel-Distance-Based Covariate Balancing
Xialing Wen, Ying Yan ∗ , Wenliang Pan, Xianyang ZhangSchool of Mathematics, Sun Yat-sen University,No.135, Xingang West Road, Guangzhou, P.R.China Abstract
A common concern in observational studies focuses on properly evaluating the causal effect,which usually refers to the average treatment effect or the average treatment effect on thetreated. In this paper, we propose a data preprocessing method, the Kernel-distance-basedcovariate balancing, for observational studies with binary treatments. This proposed methodyields a set of unit weights for the treatment and control groups, respectively, such that thereweighted covariate distributions can satisfy a set of pre-specified balance conditions. Thispreprocessing methodology can effectively reduce confounding bias of subsequent estimation ofcausal effects. We demonstrate the implementation and performance of Kernel-distance-basedcovariate balancing with Monte Carlo simulation experiments and a real data analysis.
Key words: Observational Studies; Causal Effect; Average Treatment Effect; Covariate Balance;Kernel Distance
In causal inference, there are two crucial concepts. (i) The propensity score (Rosenbaum and Rubin,1983): the probability of assignment to the treatment group given the observed covariates. (ii) Thepotential outcomes (Neyman, 1923; Rubin, 1974): the possible outcomes under different treatmentdecisions or interventions. The potential outcomes framework, also known as the Neyman-RubinPotential Outcomes, was first proposed in completely randomized experiments by Neyman (1923)and was extended to both experimental and observational studies by Rubin (2005). An overview ∗ The Corresponding Author. Email: [email protected] a r X i v : . [ s t a t . M E ] J a n or the implementation of the potential outcomes framework can be found in Imbens and Rubin(2015). Both the two concepts are related to evaluate the causal effect, a central topic in causaninference. Based on various purpose, researchers may prefer estimating the average treatment effect(ATE), the average treatment effect on the treated (ATT), the average treatment effect on thecontrol (ATC), or all of them.In observational studies, however, it is not valid to evaluate the causal effect by merely comput-ing the mean-difference in response between different experimental groups due to the confoundingvariables and the non-random assignment mechanism. One method proposed by Rubin (2007) tosolve this problem is to balance the distributions between the treated and the control groups. Twopopular preprocessing methods, matching and propensity score methods, have been widely used inobservational studies. In causal inference, weighting methods that adjust for observed covariatesenjoy substantial popularity. In this paper, we focus on weighting methods.The weights derived from adjusting the empirical distributions of the observed covariates willbe applied to the outcomes such that the reweighted data appear randomized, which can yield sta-ble estimates for the parameters of interest when there is no unmeasured confounders. Generallyspeaking, inverse probability weighting is less robust and works no better than moment balancingmethods. Therefore, we propose a weighting framework, namely the kernel distance-based (KDB)covariate balance, and use the first-moment balance of all covariates as a constraint for the optimiza-tion problem. In this article, the KDB method is a preprocessing technique to achieve covariatebalance in observational studies with a binary treatment. We give estimators of ATE and ATTbased on the KDB framework, among which our primary focus is ATE estimation. Within the sameframework, the optimization problem can be flexibly changed to estimate ATT or ATC.Subsequently, we further explore their asymptotic properties and compare our proposed methodwith other preprocessing methods, such as the inverse probability weighting (IPW) (Horvitz andThompson, 1952; Hirano et al., 2003), the covariate balancing propensity score (CBPS) (Imai andRatkovic, 2014), the entropy balancing (EB) (Hainmueller, 2012), and non-parametric calibrationweighting (CAL)(Chan et al., 2016). It can be shown that this proposed framework has severalattractive features. Most importantly, the KDB method can automatically match the balance in-volve the second and possibly higher moments of the covariate distributions as well as interactions.Besides, the KDB estimator for ATE shows greater stability in nonlinear simulation scenarios com-2ared to existing weighting estimators.The remainder of this article is organized as follows. Section 2 introduces the definitions, no-tation, and assumptions used throughout the article. Section 3 reviews the main results of fourweighting methods, namely the IPW, CBPS, EB, and CAL. Section 4 introduces the weighted ker-nel distance in detail and describes the theoretical framework of the KDB that we proposed. Section5 illustrates the proposed method and compares its performance with other weighting methods intwo Monte Carlo simulations and a real data analysis. Section 6 concludes this article and discussesfurther explorations. We use the symbol ‘ (cid:52) = ’ to distinguish definitions from equalities. Suppose the observed datasetconsists of a random sample or finite population of N units, each assigned to one of the groups,treatment or control, for which covariate-balanced comparisons are of interest. For each unit i ( i =1 , · · · , N ) , we can observe { ( Y i , T i , X i ) : i = 1 , · · · , N } of { ( Y, T, X ) } , where Y is an outcomevariable, T is a treatment indicator and X = ( X , · · · , X D ) with expectation µ and covariancematrix Σ is a set of D covariates. Every T i takes values 1 or 0, which correspondingly indicates theunit i is assigned to the treatment or the control group. We define the sample size in the treatmentand control group respectively as n and n . The observed data in the treatment and control groupsare denoted as { ( Y i , T i , X i ) : i = 1 , · · · , n } and { ( Y j , T j , X j ) : j = 1 , · · · , n } , respectively,where X i = ( X i, , · · · , X i,D ) , X j = ( X j, , · · · , X j,D ) . The probability of assignment to the treatment group given the covariates is the so-called propensityscore (Rosenbaum and Rubin, 1983) e ( X ) = P r ( T = 1 | X ) , which plays a central role in causal inference.Under the potential outcome framework (Neyman, 1923; Rubin, 1974), we define a pair of3otential outcomes { Y i (0) , Y i (1) } for every unit i under treatment 0 or 1 respectively. The observedoucome can be expressed by the pair of potential outcomes under the consistency assumption: Y i = (1 − T i ) Y i (0) + T i Y i (1) . In this paper, we focus on two causal estimands: the average treatment effect (ATE), that is τ := E [ Y (1) − Y (0)] = µ − µ , where µ t := E [ Y ( T = t )] , and the average treatment effect on the treated (ATT), that is τ := E [ Y (1) − Y (0) | T = 1] = µ | − µ | , where µ t | ≡ E [ Y ( T = t ) | T = 1] for t = 0 , . The expectations for the potential outcomes given thecovariates are µ ( X ) ≡ E [ Y (0) | X ] and µ ( X ) ≡ E [ Y (1) | X ] . For causal comparisons, we make the following assumptions (Rosenbaum and Rubin, 1983)throughout this article so that unbiased estimators of ATE and ATT are available in observationalstudies.
Assumption 1 (Strong Ignorability): { Y (0) , Y (1) } | = T | X , where | = denotes independence.We assume the potential outcomes { Y (0) , Y (1) } are independent of the treatment indicator T given covariates X , which implies there is no unmeasured confounders that may cause the selectionbias. Formally, as shown by Rosenbaum and Rubin (1983), Assumption 1 also implies { Y (0) , Y (1) } | = T | e ( X ) , E ( Y | T = 0 , X ) = E ( Y | T = 1 , X ) = E ( Y | X ) . Under
Assumption 1 , we can investigate ATE through the value of τ ( X ) ≡ µ ( X ) − µ ( X ) { Y (0) ,Y (1) } | = T | X =============== E [ Y (1) | T = 1 , X ] − E [ Y (0) | T = 0 , X ]= E [ Y | T = 1 , X ] − E [ Y | T = 0 , X ] . Assumption 2 (Overlap): < P ( T = 1 | X ) < , f or all X . To ensure that there are useful observations for estimating the causal effect, we assume theprobability that a subject is assigned to the treatment group is limited to range between 0 and 1.
Assumption 2 also helps to ensure that the bias-correction information is available in the entiredomain of X (Zhao and Percival, 2016) and equivalently requires that the covariate distributionprovides sufficient overlap between the treatment and the control group.Reminding that under Assumption 1, the ATE can be expressed as τ = E [ Y i (1) − Y i (0)]= E { E [ Y (1) − Y (0) | X ] } = E [ µ ( x ) − µ ( x )]= (cid:90) µ ( x ) dF ( x ) − (cid:90) µ ( x ) dF ( x ) (2.1)where F ( x ) indicates the true distribution of covariates. Within a weighting framework, we de-note the weighted sample average treatment effect (SATE) with weights w = { w , · · · , w N } = { p , · · · , p n , q , · · · , q n } , n + n = N , as ˆ τ wN = n (cid:88) i =1 T i w i Y i − n (cid:88) i =1 (1 − T i ) w i Y i = N (cid:88) i =1 p i T i Y i − N (cid:88) j =1 q j (1 − T j ) Y j ˆ F wN, ( x ) = n (cid:88) i =1 p i I ( X i ≤ x ) , ˆ F wN, ( x ) = n (cid:88) i =1 q j I ( X j ≤ x ) , where I ( · ) is an indicator function. Therefore, we have ˆ τ wN − τ = n (cid:88) i =1 T i w i Y i − n (cid:88) i =1 (1 − T i ) w i Y i − τ = n (cid:88) i =1 T i w i ( Y i − µ ( X i )) − n (cid:88) i =1 (1 − T i ) w i ( Y i − µ ( X i )) + n (cid:88) i =1 T i w i µ ( X i ) − n (cid:88) i =1 (1 − T i ) w i µ ( X i ) − τ = (cid:40) n (cid:88) i =1 T i w i ( µ ( X i ) − µ ( X i )) − τ (cid:41) + (cid:40) n (cid:88) i =1 T i w i µ ( X i ) − n (cid:88) i =1 (1 − T i ) w i µ ( X i ) (cid:41) + (cid:40) n (cid:88) i =1 T i w i ( Y i − µ ( X i )) − n (cid:88) i =1 (1 − T i ) w i ( Y i − µ ( X i )) (cid:41) . (2.2)According to Equation (2.2), we know that the bias of weighted SATE ˆ τ wN depends entirely on thethree items n (cid:88) i =1 T i w i ( µ ( X i ) − µ ( X i )) − τ, n (cid:88) i =1 T i w i µ ( X i ) − n (cid:88) i =1 (1 − T i ) w i µ ( X i ) , n (cid:88) i =1 T i w i ( Y i − µ ( X i )) − n (cid:88) i =1 (1 − T i ) w i ( Y i − µ ( X i )) . Here, the second item directly shows the importance of balancing the distributions of the covariatesbetween the treated and control groups, which is the goal of our proposed weighting framework. Tofurther ensure lim N →∞ (ˆ τ wN − τ ) = 0 , we need the following Assumption 3 . Assumption 3 (Constant Conditional ATE): µ ( X ) − µ ( X ) = τ, f or all X . Assumption 3 , we assume a constant causal effect, so the ATE is numerically the same asthe ATT.
We next give a brief overview of some existing approaches.
According to Ma and Wang (2010), the IPW assigns a weight (cid:98) p IP Wi = T i Np ( X i ) (cid:80) n i =1 T i Np ( X i ) to Y i (the outcome in the treatment group), i = 1 , , · · · , n , and a weight (cid:98) q IP Wj = − T j N (1 − p ( X j )) (cid:80) n j =0 1 − T j N (1 − p ( X j )) to Y j (the outcome in the control group), j = 1 , , · · · , n , to estimate ATE, thereby deriving anatural estimator of ATE, that is, (cid:98) τ IPW = 1 N N (cid:88) i =1 (cid:20) T i Y i p ( X i ) − (1 − T i ) Y i − p ( X i ) (cid:21) = 1 N N (cid:88) i =1 (2 T i − Y i − T i + (2 T i − p ( X i )= n (cid:88) i =1 T i Y i N p ( X i ) − n (cid:88) j =1 (1 − T j ) Y j N (1 − p ( X j )) . ATT using inverse probability weighting is (cid:98) τ ,IPW = 1 n N (cid:88) i =1 (cid:20) T i Y i − (1 − T i ) ˆ e ( X i )1 − ˆ e ( X i ) Y i (cid:21) = 1 n N (cid:88) i =1 ( T i · Y i ) − n N (cid:88) i =1 (cid:20) (1 − T i ) ˆ e ( X i )1 − ˆ e ( X i ) Y i (cid:21) = n (cid:88) i =1 Y i n − n (cid:88) j =1 ˆ e ( X i )1 − ˆ e ( X i ) Y j n . Therefore, the weights assigned to Y i , i = 1 , , · · · , n for estimating ATT is (cid:98) p IP Wi,trt = 1 n , and the weights assigned to Y j , j = 1 , , · · · , n is (cid:98) q IP Wj,trt = ˆ e ( X i ) n · (1 − ˆ e ( X i )) . The entropy balancing (EB) scheme by Hainmueller (2012) considered: min w i H ( w ) = (cid:88) i | D =0 h ( w i ) s.t. (cid:80) i | D =0 w i c ri ( X i ) = m r , with r ∈ , , · · · , R (cid:80) i | D =0 w i = 1 , w i ≥ f or all i such that D = 0 The weights (cid:98) p EBi = n (1 , · · · , n ) and (cid:98) q EBj = w j ( j = 1 , · · · , n ) of EB method can be directlycalculated with the R package “ ebal ”. Imai and Ratkovic (2014) proposed a relatively more stable method, namely the Covariate BalancePropensity Score (CBPS), for estimating propensity score so that covariate balance is optimized.CBPS can significantly improve the poor empirical performance of propensity score matching and8eighting methods. The key idea is using propensity score weighting to characterize the covariatebalance: E (cid:32) T i ˜ X i π β ( X i ) − (1 − T i ) ˜ X i − π β ( X i ) (cid:33) = 0 where ˜ X i = f ( X i ) is an D-dimensional vector-valued measurable function of X i specified by theresearcher. Equation (3.3) holds for any function of covariates. Selecting a specific f ( · ) , for example,setting ˜ X i = ( X Ti X Ti ) T , Equation (3.3) helps to balance both the first and second moments of allcovariates.The CBPS method can be implemented through the open-source R package “CBPS”. A general class of calibration (
CAL ) estimators of ATE is established with a wide class, such asexponential tilting, of calibration weights. The weights are constructed to achieve an exact momentbalance of observed covariates among the treated, the control, and the combined group. Globalsemiparametric efficiency has been established for the CAL estimators.For a fixed g ∈ R , let D ( f, g ) denote a continuously differentiable distance measure for f ∈ R .Being nonnegative and strictly convex in f , D ( f, g ) = 0 if and only if f = g . The general ideaof calibration as in Deville and Särndal (1992) is to minimize the aggregate distance between thefinal weights w = ( w , · · · , w N ) to a given vector of design weights d = ( d , · · · , d N ) subject tomoment constraints. Avoiding estimating the design weights, CAL first used a vector of misspecifieduniform design weights d ∗ = (1 , · · · , , and constructed the final weights w by solving the followingconstrained optimization problem: M inimize N (cid:88) i =1 D ( w, ,s.t. N (cid:80) Ni =1 T i w i u ( X i ) = N (cid:80) Ni =1 u ( X i ) N (cid:80) Ni =1 (1 − T i ) w i u ( X i ) = N (cid:80) Ni =1 u ( X i ) . (3.1)This CAL method can be implemented through an open-source R package “ATE” for estimatingboth the ATE and ATT. 9 The Weighted Kernel Covariate Balancing
Let X , · · · , X n and Y , · · · , Y m be random samples from two unknown probability measures, P and Q , then researchers are often interested in detecting the difference and estimating the discrepencybetween them. In this paper, we use a probability metric with ζ -structure (Zolotarev, 1983) γ F ( P, Q ) = sup f ∈ F | (cid:90) f dP − (cid:90) f dQ | , where F indicates a class of functions, to measure the distance between the treatment and thecontrol covariates.As for the implementation of γ ( P, Q ) , we can define || · || K to be a norm of a ReproducingKernel Hilbert Spaces (RKHS) K and set F = { f : || f || K (cid:54) = 1 } , by which we can get a “kernelized”version of the total variation distance. Restricting k to be a strictly positive definite kernel functioncorresponding to RKHS K , then we can get a closed-form solution for γ ( P, Q ) (Sriperumbuduret al., 2012), that is γ k ( P n , Q n ) = || N (cid:88) i =1 w i k ( · , X i ) || K = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i,j =1 w i w j k ( X i , X j ) , where P n and Q n respectively denote the empirical measure of X | T = 1 and X | T = 0 , and w i isthe sample weight for the unit i ∈ { , · · · , N } . Let us define w i = 1 /n if T i = 1 and w i = 1 /n if T i = 0 , then we can get a simple as well as closed-form analytical solution for the empirical estimateof the probability metric under the assumption that the function class represents an RKHS. Thecomputation of γ k ( P n , Q n ) is O ( N ) but it is also independent of the dimension of the confounders.We can use γ k ( P n , Q n ) as a diagnostic for covariate balance, the smaller values of which denotebetter performance. Given two probability measures, P and Q defined on a measurable space S , the integral probabil-ity metric (IPM) (Sriperumbudur et al., 2012), also called probability metrics with a ζ -structure,10etween P and Q is defined as γ F ( P , Q ) = sup (cid:26)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) S f d P − (cid:90) S f d Q (cid:12)(cid:12)(cid:12)(cid:12) : f ∈ F (cid:27) (4.1)where F is a class of real-valued bounded measurable functions on S . The choice of functions is thecrucial distinction between different IPMs and various popular distance measures, for example, the Kantorovich metric and the total variation distance , can be obtained by choosing appropriate F in (4.1). When F = { f : || f || H ≤ } , the corresponding γ F is called kernel distance or maximummean discrepancy, where H represents a reproducing kernel Hilbert space (RKHS) with k as itsreproducing kernel (r.k.). In this situation, we can rewrite the function space F as ( H , k ) and theIPM γ F as γ F k . The kernel distance has been widely used in statistical applications, includinghomogeneity testing, independence testing, conditional independence testing, and mixture densityestimation. Let { X P,i : i = 1 , , · · · , n } and { X Q,j : j = 1 , , · · · , m } be i.i.d. samples randomly drawn from P and Q , respectively, the empirical estimator of γ F ( P , Q ) is given by γ F ( P n , Q m ) = sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 w Ei f ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (4.2)where P n ( x ) := (cid:80) ni =1 1 n I ( X P,i ≤ x ) and Q m ( x ) := (cid:80) mj =1 1 m I ( X Q,j ≤ x ) represent the empiricaldistributions of P and Q , respectively, I ( · ) is an indicator function, N = n + m , w Ei = n when X i = X P,i for i = 1 , · · · , n and w Ej = − m when X j = X Q,j for j = 1 , · · · , m .Correspondingly, we define the weighted estimator of of γ F ( P , Q ) as γ F ( P wn , Q wm ) = sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 w i f ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (4.3)where P wn ( x ) := (cid:80) ni =1 p i I ( X P,i ≤ x ) and Q m ( x ) := (cid:80) mj =1 q j I ( X Q,j ≤ x ) represent the wighteddistributions of P and Q , respectively, w i = p i when X i = X P,i for i = 1 , · · · , n and w j = − q j when X j = X Q,j for j = 1 , · · · , m . { p i : i = 1 , · · · , n } and { q j : j = 1 , · · · , m } are the balancing weights11erived frome some weighting methods. Theorem 4.1.
Let k be a strictly positive definite kernel, i.e., for all N ∈ N , { α i } Ni =1 ⊂ R \{ } andall mutually distinct { θ i } Ni =1 ⊂ S , (cid:80) Ni,j =1 α i α j k ( θ i , θ j ) > . Then, for F = F k := { f : (cid:107) f (cid:107) H ≤ } ,the following function f is the unique solution to (4.3): f = 1 (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13) H N (cid:88) i =1 w i k ( · , X i ) (4.4) and the corresponding weighted estimator of the kernel distance is γ k,N ( P n , Q m ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i,j =1 w i w j k ( X i , X j ) . (4.5) Proof.
Consider γ k,N ( P n , Q m ) := sup (cid:110)(cid:80) Ni =1 w i f ( X i ) : (cid:107) f | H ≤ (cid:111) , which can be written as γ k,N ( P n , Q m ) = sup (cid:40)(cid:42) f, N (cid:88) i =1 w i k ( · , X i ) (cid:43) H : (cid:107) f (cid:107) H ≤ (cid:41) , where we have used the reproducing property of H , i.e., ∀ f ∈ H , ∀ x ∈ S , f ( x ) = (cid:104) f, k ( · , x ) (cid:105) H .Following the Cauchy-Schwartz inequality, which states that for all vectors u and v of an innerproduct space it is true that (cid:107)(cid:104) u, v (cid:105)(cid:107) ≤ (cid:107) u (cid:107) · (cid:107) v (cid:107) (4.6)where (cid:104)· , ·(cid:105) is the inner product, we have γ k,N ( P n , Q m ) := sup (cid:40) N (cid:88) i =1 w i f ( X i ) : (cid:107) f | H ≤ (cid:41) ≤ sup (cid:40) (cid:107) f (cid:107) H · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H : (cid:107) f (cid:107) H ≤ (cid:41) ≤ sup (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) i =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H (cid:41) Therefore, we can set f = 1 (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 w i k ( · , X i ) (cid:13)(cid:13)(cid:13) H N (cid:88) i =1 w i k ( · , X i ) . k is strictly positive definite, γ k,N ( P n , Q m ) = 0 if and only if P n = Q m , which thereforeensures that (4.4) is well-defined. According to Rasmussen (2003), one of the choice of kernel k for covariate balance is the Gaussiankernel, that is K ( x , x (cid:48) ; σ ) = exp (cid:18) −|| x − x (cid:48) || σ (cid:19) , where the parameter σ determines the width of the Gaussian kernel and it can either be fixed orestimated from the data. In this article, we estimate σ with the median of all possible pairwisesquared Euclidean distances between all pairs of subjects. σ = median (cid:26) { ( X i − X j ) i,j ∈ trt , ( X i − X j ) i ∈ trt,j ∈ col , ( X i − X j ) i,j ∈ col } > (cid:27) The Information Matrix K G corresponding to the Gaussian kernel K ( x , x (cid:48) ; σ ) is K G = K − K − K K = ( K ( X i , X j )) n × n − ( K ( X i , X j )) n × n (cid:16) − ( K ( X i , X j )) n × n (cid:17) T ( − K ( X i , X j )) n × n . (4.7) In this article, we propose a preprocessing procedure, the kernel-distance-based covariate balancingmethod, to create balanced samples for evaluating the treatment effects. In this procedure, wecreate two groups of scalar weights for the treatment and the control groups so that the reweightedtreatment and control groups can match precisely at the specified moments.
In this section, we introduce the estimation of ATE τ ≡ E [ Y (1) − Y (0)] ˆ τ = n (cid:88) i =1 p i · Y i − n (cid:88) j =1 q j · Y j . The weights { p i : i = 1 , · · · , n } and { q j : j = 1 , · · · , n } derive from the following KDB reweightingscheme: min (cid:126)p , (cid:126)q r W ( (cid:126)p , (cid:126)q ) = N (cid:88) i =1 N (cid:88) j =1 p i p j K ( X i , X j ) · I ( i, j ∈ trt )+ N (cid:88) i =1 N (cid:88) j =1 q i q j K ( X i , X j ) · I ( i, j ∈ col ) − N (cid:88) i =1 N (cid:88) j =1 p i q j K ( X i , X j ) · I ( i ∈ trt , j ∈ col ) (4.8) s.t. (cid:80) n i =1 p i = 1 , p i ≥ (cid:80) n j =1 q j = 1 , q j ≥ (cid:80) n i =1 p i g ( X ) = (cid:80) n j =1 q j g ( X ) where I ( · ) is an indicator function, and g ( · ) indicates a set of constraint functions imposed on thecovariate moment. Here, for convenience, we set g ( · ) = { g ( · ) , · · · , g D ( · ) } to be the sample meanfunction at all covariates so that the reweighted treatment and control groups can match on thefirst moment, that is g d ( X t ) := (cid:80) n t i =1 X ti,d n t , t ∈ { , } , i ∈ { , · · · , n t } , d ∈ { , · · · , D } . To express concisely, we define (cid:126) w = ( p , · · · , p n , q , · · · , q n ) T , (cid:126)b (2+ D + N ) × = (cid:18) · · · (cid:19) T , and A T = (cid:126) × n , (cid:126) × n (cid:126) × n , (cid:126) × n (cid:126) X ,trt , − (cid:126) X ,col ... , ...(cid:126) X D,trt , − (cid:126) X D,col I N × N (2+ D + N ) × N = · · · n · · · n · · · n · · · n X , · · · X , n − X , · · · − X , n ... ... ... ... ... ...X D, · · · X D, n − X D, · · · − X D, n · · · · · · ... ... ... . . . ... ... · · · · · · (2+ D + N ) × N , where D refers to the total number of covariates, (cid:126) X d,trt and (cid:126) X d,col respectively on behalf of the d th covariate in the treatment and control group, and then the optimization problem for estimatingATE can be expressed as min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T K G (cid:126) w (4.9) s.t. A T (cid:126) w ≥ (cid:126)b Here, the first D constraints are treated as equality constraints, and all further as inequalityconstraints. To further construct stable weights for estimating the ATE, we can make a small modification to theoptimization problem in (4.9). Compared to the reweighting scheme in section 4.6.1, we only add atuning parameter λ to the diagonal of K G in order to control the variance of ˆ τ , while keeping all theother settings and parameters the same. Defining (cid:126) w = (cid:126) w | p i = n ,q j = n = ( n , · · · , n , n , · · · , n ) T ,15he reweighting scheme for stable ATE estimation is: min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T K G (cid:126) w + λ ( (cid:126) w − (cid:126) w ) T ( (cid:126) w − (cid:126) w ) (4.10) s.t. A T (cid:126) w ≥ (cid:126)b It is easy to prove that the optimization problem (4.10) satisfying constraints above is equivalent to min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T ( K + λ I N × N ) (cid:126) w (4.11) s.t. A T (cid:126) w ≥ (cid:126)b When λ = 0 , this optimization problem (4.11) degenerates into problem (4.9) and it controlsbias, but not variance of ATE estimation. When λ increases, this optimization problem may increasebias while reduce the variance of ATE estimation. In this section, we introduce the KDB reweighting scheme for estimating the ATT τ := E [ Y (1) − Y (0) | T = 1] = µ | − µ | . We estimate τ with the mean difference between the treatment outcomes and the weighted controloutcomes, that is, ˆ τ = n (cid:88) i =1 n · Y i − n (cid:88) j =1 q j · Y j , where { q j : j = 1 , · · · , n } derive‘ from the following reweighted scheme: min (cid:126)q r W ( (cid:126)q ) = N (cid:88) i =1 N (cid:88) j =1 n n K ( X i , X j ) · I ( i, j ∈ trt )+ N (cid:88) i =1 N (cid:88) j =1 q i q j K ( X i , X j ) · I ( i, j ∈ col ) − N (cid:88) i =1 N (cid:88) j =1 n q j K ( X i , X j ) · I ( i ∈ trt , j ∈ col ) (4.12)16 .t. (cid:80) n j =1 q j = 1 , q j ≥ (cid:80) n i =1 1 n g ( X ) = (cid:80) n j =1 q j g ( X ) Just like the ATE estimation, we define (cid:126)q n × = ( q , q , · · · , q n ) T , and reset (cid:126) w , (cid:126)b (1+ D + n × , A T respectively to be (cid:126) w = ( p , · · · , p n , q , · · · , q n ) T , (cid:126)b (1+ D + n × = (cid:18) X , X , · · · X D, · · · (cid:19) T A T = (cid:126) × n X , ... X D, I n × n (1+ D + n ) × n = · · · n X , X , · · · X , n ... ... . . . ...X D, X D, · · · X D, n · · ·
00 1 · · · ... ... . . . ... · · · (1+ D + n ) × n , where p i = n , i ∈ { , · · · , n } and X d, , d ∈ { , · · · , D } refers to the sample mean of the d th covariate in the treatment group. Therefore, we can also simplify the ATT optimization problemabove into a matrix form min (cid:126) w r W ( (cid:126) w ) = (cid:126) w T K G (cid:126) w s.t. A T (cid:126) q ≥ (cid:126)b (4.13)Here, the first D constraints are treated as equality constraints, and all further as inequalityconstraints. 17 .7 Propositions of the KDB Covariate Balance Proposition 1.
The Gaussian Information Matrix K G is positive semi-definite. Proof.
Based on the propertied of kernel, i.e, K ( x , x ) = < ϕ ( x ) , ϕ ( x ) > , for any v ∈ R N , wehave v (cid:48) K v = N (cid:88) i,j =1 v i v j K ij = n (cid:88) i,j =1 v i v j < ϕ ( X i ) , ϕ ( X j ) > + N (cid:88) i,j = n v i v j < ϕ ( X i ) , ϕ ( X j ) > − n (cid:88) i =1 N (cid:88) j = n v i v j < ϕ ( X i ) , ϕ ( X j ) > = < n (cid:88) i =1 ϕ ( X i ) , n (cid:88) j =1 ϕ ( X j ) > + < N (cid:88) i = n ϕ ( X i ) , N (cid:88) j = n ϕ ( X j ) > − < n (cid:88) i =1 ϕ ( X i ) , N (cid:88) j = n ϕ ( X j ) > = || n (cid:88) i =1 v i ϕ ( X i ) − N (cid:88) j = n v j ϕ ( X j ) || ≥ as required. Optimization Problem:
We aim to solve the following minimization problem: min x ∈ R N f ( x ) = 12 x (cid:48) K x, st. A x = b, B x ≤ N , (4.14)where A = τn τn τn τn , B = − I and b = (1 , τ . Proposition 2. (4.14) is a standard convex optimization problem. he Dual of Above Convex Problem: The Lagrange function is given by L ( x, λ, ν ) = f ( x ) + λ τ B x + ν τ ( A x − b ) and the dual function is given by g ( λ, ν ) = inf x ∈ R N L ( x, λ, ν ) . Note L ( x, λ, ν ) = 12 x (cid:48) K x + λ τ B x + ν τ ( A x − b )= 12 x (cid:48) K x + ( λ τ B + ν τ A ) x − ν τ b. (4.15)The dual problem of (4.14) is d ∗ = sup λ ≥ N ,ν g ( λ, ν ) . Let p ∗ be the optimal value of (4.14), we have d ∗ ≤ p ∗ . Since the inequality constraints are all affine, from the weak Slater condition we know that d ∗ = p ∗ . Then from KKT conditions we have, K x + B τ λ + A τ ν = 0 (4.16)and A x = b. Assuming K is positive difinite. Solving (4.16) we have x = − K − ( B τ λ + A τ ν ) . Therefore, g ( λ, ν ) = −
12 ( B τ λ + A τ ν ) τ K − ( B τ λ + A τ ν ) − ν τ b. The dual problem is max λ,ν −
12 ( B τ λ + A τ ν ) τ K − ( B τ λ + A τ ν ) − ν τ b s.t. λ i ≥ , i = 1 , · · · , N. (4.17)Solving for ν we have (cid:79) ν g ( λ, ν ) = − AK − A τ ν − AK − B τ λ − b ⇒ ν = − ( AK − A τ ) − (cid:0) AK − B τ λ + b (cid:1) . (4.18)19ubstitute ν with (4.18), (4.17) becomes max λ i ≥ λ τ (cid:0) BK − A τ ( AK − A τ ) − AK − B τ − BK − B τ (cid:1) λ + b τ ( AK − A τ ) − AK − B τ λ. (4.19)Suppose K − = M − M − M M , we have AK − A τ = τ M + τ M . For simplicity,substitute B with − I and write Q = τ M + τ M , (4.19) is converted to max λ i ≥ λ τ (cid:0) K − A τ Q − AK − − K − (cid:1) λ + b τ Q − AK − λ. (4.20) In this section, we perform simulation studies for evaluating the performance of different weightingmethods, including the inverse probability weighting (IPW), the covariate balancing propensity score(CBPS) (Imai and Ratkovic, 2014), the entropy balance (EB) (Hainmueller, 2012), the calibrationestimator (CAL) (Chan et al., 2016) and our proposed KDB Covariate Balance. We use the oracleand unadjusted (UnAD) estimators as benchmarks. In all the following simulation experiments, weassume a constant causal effect, so the ATE is numerically the same as the ATT. For each simulationexperiment, we construct N sim Monte Carlo datasets.In addition to simulation studies, we also conduct a real data analysis with the ‘National sup-ported work’ (NSW) dataset.
To evaluate the performance of estimating ATE or ATT, we use the metric
Bias , % Bias , empiricalstandard deviation (SD) , and RMSE . Bias is the difference between the average estimate and the true value of ATE/ATT. % Bias is the bias as a percentage of the estimate’s standard deviation. A useful rule of thumb (Kanget al., 2007) is that the performance of interval estimates and test statistics begins to deterioratewhen the bias of the point estimate exceeds about 40% of its standard deviation. The empirical SD for an estimator is the square root of average squared differences between all estimates from themean. Here, we use the empirical SD corresponding to the average value of ATE/ATT estimates,20nd thus it is the empirical standard deviation of all estimated values divided by the square rootof sample size N sim . RMSE is the root-mean-square error for measuring the differences betweenvalues predicted by a weighting method and the observed values.To further compare the power of balancing covariates, we use the balancing statistics r W ,kernel distance (KD) , the maximum, average, and median value of absolute standardized meandifference (maxASMD, meanASMD, medASMD) , the average Kolmogorov-Smirnov statistic (meanKS) and the average T statistic (meanT) . We introduce each balancing statistic with moredetails as follows.For every set of sample weights { (cid:98) p i } ( i = 1 , , · · · , n ) and { (cid:98) q j } ( j = 1 , , · · · , n ) obtained bydifferent balancing methods for each Monte Carlo dataset, we calculate r W as (cid:98) r W ( (cid:126)p, (cid:126)q ) = N (cid:88) i =1 N (cid:88) j =1 (cid:98) p i (cid:98) p j K ( X i , X j ) · I ( i, j ∈ trt )+ N (cid:88) i =1 N (cid:88) j =1 (cid:98) q i (cid:98) q j K ( X i , X j ) · I ( i, j ∈ col ) − N (cid:88) i =1 N (cid:88) j =1 (cid:98) p i (cid:98) q j K ( X i , X j ) · I ( i ∈ trt , j ∈ col ) . (5.1)For a simulation experiment with a total of N sim iterations, we calculate the average value of all (cid:98) r W ( (cid:126)p, (cid:126)q ) statistics of each balancing method and compare them. KD , the square root of r W , hasbeen widely used in statistical applications (Sriperumbudur et al., 2012) and can be regarded as areasonable measure of covariates balance (Xie, 2018).In estimating ATE, the ASMD of covariate X d ( d = 1 , · · · , D ) with balancing weights (cid:98) p i ( i =1 , · · · , n ) and (cid:98) q j ( j = 1 , · · · , n ) is defined as (Zhang et al., 2019): ASM D d,AT E = | ¯ X pd, − ¯ X qd, | (cid:113) ( S d, + S d, ) / , where ¯ X pd, and ¯ X qd, respectively refer to the weighted sample mean of covariate X d in the treatmentand control group ¯ X pd, = n (cid:88) i =1 (cid:98) p i X i,d, , ¯ X qd, = n (cid:88) j =1 (cid:98) q j X j,d, , and S d, and S d, denote the sample variance of covariate X k in the treatment and control group21ithout any adjustment S d, = (cid:80) n i =1 X i,d, − ¯ X d, n − , S d, = (cid:80) n j =1 X i,d, − ¯ X d, n − In estimating ATT, the
ASMD of covariate X d ( d = 1 , · · · , D ) with balancing weights (cid:98) p i = n ( i =1 , · · · , n ) and (cid:98) q j ( j = 1 , · · · , n ) is defined as (Xie et al., 2019): ASM D d,AT T = | ¯ X pd, sd ( X d, ) − ¯ X qd, sd ( X d, ) | , where sd ( X d, ) denotes the sample standard deviation of covariate X d in the treatment groupwithout any adjustment sd ( X d, ) = (cid:115) (cid:80) n i =1 X i,d, − ¯ X d, n − . For each Monte Carlo dataset, we calculte the
ASMD values over all D covariates and mark downthe maximum, mean and median values.A Kolmogorov-Smirnov test can be applied to test whether two underlying one-dimensionalprobability distributions significantly differ and the KS statistic quantifies a distance between theempirical distribution functions of two samples. We calculate the KS statistic between the weightedsamples in treatment and control group over all D covariates and take the average as one balancingstatistic. meanKS = (cid:80) Dd =1 KS d D where KS d = sup x | F ,n ( X pd, ) − F ,n ( X qd, ) | . F ,n ( X pd, ) and F ,n ( X qd, ) are the empirical distri-bution functions of the weighted sample { (cid:98) p i · X d, i } ( i = 1 , , · · · , n ) and { (cid:98) q j · X d, j } ( j = 1 , , · · · , n ) respectively, and sup is the supremum function.A t-test is commonly applied to determine if the means of two sets of data are significantlydifferent from each other. We conduct a Welch’s t-test between the weighted means in the treatmentand control group over all D covariates and take the average as one balance statistic. meanT = (cid:80) Dd =1 T d D T d = ¯ X pd, − ¯ X qd, S ¯ (cid:52) ,d , and S ¯ (cid:52) ,d = (cid:114) S Xpd, n + S Xqd, n . S X pd, and S X qd, are the unbiased estimatorsof the variances of the two weighted samples { (cid:98) p i · X d, i } ( i = 1 , , · · · , n ) and { (cid:98) q j · X d, j } ( j =1 , , · · · , n ) respectively. In different simulation scenarios, sample oracle estimators, sample unadjusted estimators, and theirbiases will be served as benchmarks for comparing different weighting methods.
The sample oracle estimator of the ATE and ATT are respectively defined as (cid:98) τ Oracle = N (cid:88) i Y i (1) − N (cid:88) i Y i (0) , (cid:98) τ ,Oracle = n (cid:88) i [ Y i (1) | T i = 1] − n (cid:88) i [ Y i (0) | T i = 1] . Without any adjustment, we assigned weight (cid:98) p UnADi = n ( i = 1 , , · · · , n ) for each unit in thetreatment group, and (cid:98) q UnADj = n ( j = 1 , , · · · , n ) for each unit in the control group. From this,we can get the sample unadjusted estimator of the ATE, that is (cid:98) τ UnAD = n (cid:88) i p UnADi Y i − n (cid:88) j q UnADj Y j . For convenience, we denote• KDBC:the proposed KDB covariate balancing with only the constraints: n (cid:88) i p i = 1 , p i ≥ n (cid:88) j q j = 1 , q j ≥
23 KDM1:the proposed KDB covariate balancing with the constraints: n (cid:88) i =1 p i = 1 , p i ≥ n (cid:88) j =1 q j = 1 , q j ≥ n (cid:88) i =1 p i X k, i = n (cid:88) j =1 q j X k, j , k ∈ , · · · , K In this section, we follow the simulation example in Josey et al. (2020), an extensive set of experi-mental scenarios adapted from Kang et al. (2007), to compare different weighting methods.We first generate the observed covariates X = ( X , X , X , X ) T , where X d ∼ N (0 , , d =1 , · · · , , and then make the following transformations to build the hidden variables U = ( U , U , U , U ) T U = exp ( X ,U = X exp ( X ) + 10 ,U = ( X · X
25 + 0 . ,U = ( X + X + 20) . In subsequent applications, we standardize U = ( U , U , U , U ) T so that it has a mean of zero andmarginal variances of one. The propensity score model is defined as π ( δ T ) i = exp [ η ( δ T ) i ]1 + exp [ η ( δ T ) i ] ⇒ T ( δ T ) i ∼ Bin (1 , π ( δ T ) i ) where δ T ∈ { X, U } refers to the generative process that determines the treatment assignment.Here, δ T = X indicates the log odds of the propensity score is linear with η ( X ) i = − X i + 0 . X i − . X i − . X i , while δ T = U implies a nonlinear relaltionship between the log odds of thepropensity score and η ( U ) i = − U i + 0 . U i − . U i − . U i . As for the outcome propcess, weknow that Y = Y (1) · T + Y (0) · (1 − T ) and construct the pair potential outcomes { Y i (1) , Y i (0) } i, i = 1 , · · · , N with a bivariate model Y i (0) Y i (1) ( δ O ) ∼ N µ ( δ O ) i µ ( δ O ) i + γ , σ ρσρσ σ where σ is the error variance, ρ is the correlation between the potential outcomes, γ = 20 representsthe treatment effect and δ O ∈ { X, U } indicates the covariates, whether X or U , that participate inthe outcome process. For example, if δ O = X , we have µ ( X ) i = 210 + 27 . X i + 13 . X i + 13 . X i +13 . X i . It should be emphasized that although we may use U to construct variable T or Y , itcannot be observed and the covariates to be balanced is always X . In the other words, the observedsimulated data consists of { X i , T i , Y i } , i = 1 , · · · , N .Based on this data generation mechanism, we can know that• ATE: τ = E [ Y i (1) − Y i (0)] = E [ Y i (1)] − E [ Y i (0)]= (cid:16) µ ( δ O ) i + γ (cid:17) − µ ( δ O ) i = γ = 20 • ATT: τ trt = E [ Y i (1) − Y i (0) | T = 1]= E (cid:104)(cid:16) µ ( δ O ) i + γ (cid:17) − µ ( δ O ) i | T = 1 (cid:105) = γ = 20 To better understand the role of different parameters on estimating ATE, we choose the samplesize N in { , } , the error variance σ in { , , } , the correlation between the potentialoutcomes ρ in {− . , , . } , the generative process δ T that determines the treatment assignmentin { U : T is generated with covariates U ; X : T is generated with covariates X }, and the outcomeprocess δ O in { U : Y is generated with covariates U ; X : T is generated with covariates X }, sothere are a total of 72 different simulaiton scenarios. We change only one parameter in differentsimulation scenarios and run 500 Monte Carlo datasets for each setting.Fixing δ T and δ O ({ δ T = X, δ O = X}, { δ T = X, δ O = U}, { δ O = U, δ O = X}, { δ T = U, δ O = U},we run 1000 Monte Carlo datasets to compare the performance of four weighting methods (IPW,CBPS, EB, KDBC) on estimating ATE under different combinations of N , σ and ρ . The results25re briefly shown in Figure [1, 2, 3, 4]. For every Figure corresponds to a different combination of δ T and δ O , the value of σ corresponds to each row from top to bottom are { , , } , and the value of ρ corresponds to each column from left to right are {− . , , . } . It can be found that the samplesize N and the error variance σ mainly influence the standard errors of ATE estimators, while thecorrelation between potential outcomes Y i (0) and Y i (1) has almost no effect. This observation isconsistent with Josey et al. (2020) and thus, we focus on the simulation scenario with parameters N = 200 , σ = 10 , and ρ = 0 .We run 500 Monte Carlo experiments for estimating both the ATE and ATT. The results of ATEestimation are shown in Figure [5] and Table [1,2], and the results of ATT estimation are shownin Table [3,4]. Table [1,3] together show that the KDBC estimators, both for ATE and ATT, canperform as well as the other weighting estimators when all the true predictors are included in theobserved dataset. While in both scenarios of { δ T = X, δ O = U} and { δ T = U, δ O = U} where thekey information of the true covariates are missed, the KDBC estimators can perform much betterthan other weighting estimators. This reflects the high stability of our proposed KDB covariatebalance in estimating causal effects even when the observed dataset misses some key informationabout the real generation mechanism.To better observe the adjustment effect of our proposed method, we draw the cumulative distri-bution functions (CDF) of all covariates without any adjustment and that of the weighted covariatesunder KDBC. It can be seen from Figure [6] that regardless of the value of δ T and δ O , under ourproposed KDB covariate balancing with only regularization constraints (KDBC), the CDFs of theweighted covariates between the treatment and control groups can always match each other well.This means that our proposed method can effectively achieve the covariate balance between thetreatment and control groups. 26 sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000
Figure 1: T Scenario = “X”, Y Scenario = “X” sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000
Figure 2: T Scenario = “X”, Y Scenario = “U” sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000
Figure 3: T Scenario = “U”, Y Scenario = “X”. sigma^2=2, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=2, rho=0.5 A T E N N=200N=1000 −15−10−5051015
UnAD IPW CBPS EB KDBC sigma^2=5, rho=−0.3 A T E N N=200N=1000 −15−10−5051015
UnAD IPW CBPS EB KDBC sigma^2=5, rho=0 A T E N N=200N=1000 −15−10−5051015
UnAD IPW CBPS EB KDBC sigma^2=5, rho=0.5 A T E N N=200N=1000−15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=−0.3 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0 A T E N N=200N=1000 −15−10−505101520253035 UnAD IPW CBPS EB KDBC sigma^2=10, rho=0.5 A T E N N=200N=1000
Figure 4: T Scenario = “U”, Y Scenario = “U” TxYx A T E −10−505101520253035404550 Oracle UnAD IPW CBPS EB CAL KDBC TxYu A T E −10−505101520253035404550 Oracle UnAD IPW CBPS EB CAL KDBC TuYx A T E −10−505101520253035404550 Oracle UnAD IPW CBPS EB CAL KDBC TuYu A T E Figure 5:
ATE estimate using five balancing methods (
IPW , CBPS , EB , CAL and
KDBC ) with N = 200 , σ = 10 , ρ = 0 .Both oracle estimate and unadjusted ( UnAD ) estimate are used as a benchmark for comparison. reatmentAssignmentScenario OutcomeScenario Metric Benchmark Balancing MethodsOracle UnAD IPW CBPS EB CAL KDBC KDM1 X X
ATE
Bias sd RMSE
ATE
Bias sd RMSE
ATE
Bias -0.00635 -15.85057704 0.37174 -0.00156 -0.04647 -0.00179 -0.01756 -0.00392 sd RMSE
ATE
Bias -0.00826 -14.69161794 -4.69057 -4.80178 -4.16993 -4.13912 -0.10005 -0.08974 sd RMSE
Table 1: Details on ATE estimate with 500 Monte Carlo datasets.
TreatmentAssignmentScenario OutcomeScenario Balancing Metric Benchmark Balancing MethodsOracle UnAD IPW CBPS EB CAL KDBC KDM1
X X KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT Table 2: ATE - Details on Balancing Metrics with 500 Monte Carlo datasets. reatmentAssignmentScenario OutcomeScenario Metric Benchmark Balancing MethodsOracle IPW CBPS EB CAL KDM1 KDBC X X
ATT
Bias sd RMSE
ATT
Bias sd RMSE
ATT
Bias -0.00635 5.08391 -0.00717 -0.05765 -0.00689 -0.02487 -0.22012 sd RMSE
ATT
Bias -0.00826 -0.92883 -3.60219 -3.62439 -3.60196 -0.03336 -0.25254 sd RMSE
Table 3: Details on ATT estimate with 500 Monte Carlo datasets.
TreatmentAssignmentScenario OutcomeScenario BalancingMetric Benchmark Balancing MethodsOracle IPW CBPS EB CAL KDM1 KDBC
X X KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT KD maxASMD meanASMD medASMD meanKS meanT Table 4: ATT - Details on Balancing Metrics with 500 Monte Carlo datasets. .000.250.500.751.00 −2 0 2 X1 CD F X2 CD F X3 CD F X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex X1 CD F X2 CD F T scen = X X3 CD F Y scen = U X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex X1 CD F X2 CD F T scen = U X3 CD F Y scen = X X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex X1 CD F X2 CD F T scen = U X3 CD F Y scen = U X4 CD F TIndex X1 CD F unde r K D B C X2 CD F unde r K D B C X3 CD F unde r K D B C X4 CD F unde r K D B C TIndex
Figure 6: The CDF of unweighted covariates is arranged in odd rows, and that of weighted covariates under KDBC is arrangedin even rows. All combination of “T scenario” and “Y scenario” are included. Dashed line indicates the treatment group, andthe solid line indicatesv the control group. .3.2 Simulation 2 In this section, we perform a simulation experiment with a parameter setting different from Simu-lation 1 (section 5.3.1). We wish to explore the mechanism of the KDB method in mining covariateinformation as well as check the performance of the KDB stable ATE estimation.We first construct the treatment indicator T i , i = 1 , , · · · , N with a binomial distribution Bin (1 , p ) . Here, we set p , the probability for unit i being assigned to the treatment group, to be0.5. Next, we used two different multivariate normal distributions to generate the covariates X and X for the treatment group and the control group, respectively. We have (cid:18) X X (cid:19) | T = 1 ∼ N , . . and (cid:18) X X (cid:19) | T = 0 ∼ N + α , . . + α α , where α = 0 . , α = 0 . . Let X and X to be distributer as N (0 , , we continue to constructthe covariates X = X × X and X = X . The vector ( X , X , X , X , X , X ) T is subsequentlystandardized to have a mean of zero and marginal variances of one. The observed covariates are { X , X , X , X } . For the outcome variable, we use the bivariate model Y i (0) Y i (1) ∼ N µ i µ i + γ , σ σ where γ = 10 , σ = 10 , and µ i = 20 X i + 10 X i + 5 X i + 5 X i + α · X i + α · X i with α =1 , α = 2 . To explore whether KDB stable ATE estimation can effectively reduce the variance ofATE estimator, we add a tuning parameter λ to the diagonal of K G and change the value to seesee how the estimation result changes. K G = K − K − K K + λ I N N = 200 , we run 500 Monte Carlo datasets for each value of λ in { , , , , , } .Results of ATE estimation with λ ∈ { , , , , , } are shown in Table [5]. In each table for aspecific value of λ , the rows are arranged from top to bottom according to the absolute value ofbias.Comparing the results in all tables together, we can find that the UnAD estimator alwaysperforms the worst, and is far from the true value. This shows the importance of covariate balance,especially when the real generation mechanism of the data involves nonlinear information. As thevalue of λ increases, the performance of KDBC gradually deteriorates, but the performance of KDM1has always been in the forefront, which indicates the necessity to add the first-moment balance tothe balance constraints of the KDB method. Furthermore, adding a non-zero λ do help reducethe variance of the KDM1 estimator of ATE compared to the results when λ = 0 , the mechanismbehind which requires to be further explored.When λ changes from 0 to 1, the performance of KDM1 is not as good as CBPS and CAL. How-ever, when we increase the λ value from 2 to 100, KDM1 again becomes the best weighting methodwith the smallest absolute value of bias, the mechanism behind which needs further exploration.This probably implies that there is a threshold for λ to induce better performance, and it may berelated to the information matrix (section 4.5) of the observed dataset.Considering the ASMD values of X and X under different weighting methods in Table [5],we know that the KDB method including the KDBC and KDM1 can always perform the best.Reviewing the performance of the KDB covariate balance in simulation 1 (section 5.3.1) with { δ T = X, δ O = U} and { δ T = U, δ O = U}, we can conclude that the KDB covariate balance mayautomatically discover and include some latent variables, such as interaction terms between differentcovariates or high-order terms of covariates, to analyze and estimate the ATE.36 ambda = 0 ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanTOracle
KDM1
KDBC
CBPS
CAL
IPW EB UnAD -13.16629 23.16629 0.15097 1.04695 0.04264 0.20455 0.89249 0.46910 0.46869 0.87363 0.11004 0.24557 -2.83703lambda = 1
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanTOracle
CBPS
CAL
KDM1
IPW EB KDBC
UnAD -13.16629 23.16629 0.15097 1.04695 0.04264 0.20455 0.89249 0.46910 0.46869 0.87363 0.11004 0.24557 -2.83703lambda = 2
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanTOracle
CAL
KDM1
CBPS EB IPW
KDBC
UnAD -13.21379 23.21379 0.15533 1.04968 0.04286 0.20487 0.89554 0.47282 0.47129 0.87486 0.11403 0.24615 -2.86241lambda = 5
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanTOracle
KDM1
CAL
CBPS EB IPW
KDBC
UnAD -13.08489 23.08489 0.15413 1.04381 0.04235 0.20387 0.89282 0.47006 0.46973 0.87439 0.11324 0.24486 -2.82364lambda = 10
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanTOracle
KDM1
CAL
CBPS EB IPW
KDBC -1.93687 11.93687 0.09537 0.54227 0.01188 0.10779 0.48651 0.25751 0.25467 0.47040 0.11033 0.23875 -1.53728
UnAD -13.28347 23.28347 0.15303 1.05243 0.04300 0.20539 0.89664 0.47257 0.47019 0.87847 0.11167 0.24624 -2.87076lambda = 100
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanTOracle
KDM1
CAL
CBPS EB IPW
KDBC -11.07394 21.07394 0.14527 0.95356 0.03548 0.18653 0.82059 0.43223 0.43112 0.80211 0.11307 0.24343 -2.61005
UnAD -13.08489 23.08489 0.15413 1.04381 0.04235 0.20387 0.89282 0.47006 0.46973 0.87439 0.11324 0.24486 -2.82364
Table 5: λ ∈ { , , , , , } . .4 Real Data Analysis To further compare different weighting methods, we investigate the causal effect of a real data set- the labour training program data that previously analyzed in LaLonde (1986) and Dehejia andWahba (1999), among many others.The dataset used here is a subset containing information on earnings in 1974 (RE74) of theNational Supported Work Demonstration as used in LaLonde (1986). The National SupportedWork Demonstration is an evaluation of the National Supported Work Demonstration project, atransitional, subsidized work experience program for four target groups of people with longstandingemployment problems: ex-offenders, former drug addicts, women who were long-term recipients ofwelfare benefits, and school dropouts, many with criminal records. It is designed to test whetherand to what extent 12 to 18 months of employment in a supportive but performance-orientedenvironment would equip hard-to-employ people to get and hold regular, unsubsidized jobs.
Metric UnAD IPW CBPS EB CAL KDBC KDps KDM1
ATE 1804.78064 1641.77770 1637.56504 1612.77550 1612.44289 1774.39211 2311.99732 1672.08084 sd Balancing Metric UnAD IPW CBPS EB CAL KDBC KDps KDM1KD maxASMD meanASMD medASMD meanKS meanT
Table 6: Results of ATE for the NSW data (Dehejia and Wahba, 1999). The ATE is an average of 500 bootstrap estimates,and sd = sd ( AT E ) √ is the standard error of the average ATE estimates. Variables included in the dataset are: treatment indicator (1 if treated, 0 if not treated), age, ed-ucation, Black (1 if black, 0 otherwise), Hispanic (1 if Hispanic, 0 otherwise), married (1 if married, 0otherwise), nodegree (1 if no degree, 0 otherwise), RE74 (earnings in 1974), RE75 (earnings in 1975),and RE78 (earnings in 1978). We compared the weighting estimates for earnings in 1978 betweenthe treatment and control groups, which were estimates for the treatment effects. We compare theproposed KDB covariate balance with the IPW, CBPS, EB, CAL, and take the UnAD estimator asthe benchmark. We set the balance restrictions in the KDB method as follows: regularization con-straint only, regularization constraints plus propensity score balance, and regularization constraintsplus the first-moment balance. Each scenario is denoted by “KDBC”, “KDps”, and “KDM1”, respec-tively. We obtain 500 bootstrap datasets using repeatable sampling, with which we estimate the38TE. The results are shown in Table [6], and all numerical results are rounded to five decimal places.As can be seen from Table [6], the estimation results of all weighting methods are significantly dif-ferent from those of UnAD estimates. Except for the KDB method (KDBC, KDps, KDM1), theestimation results of other weighting methods are very similar. Both KDBC and KDM1 have thesmallest standard deviation (sd), so they show outstanding stability in estimating ATE. Consideringthe specific estimates of ATE, however, only adding regularization constraints may not be enoughto achieve a good balance of covariates between the treatment and control group. In addition, theresults of KDPS estimation are completely different from those of other methods including UnAD,and its sd is very large. Therefore, it seems inappropriate to use regularization constraints pluspropensity score balance as balance restrictions in the KDB method. This may be an importanthint to remind us to carefully set balance restrictions, which is an important exploration focus.We plot the probability density functions (PDF) of covariates “age” (denoted by X ) and “ed-ucation” (denoted by X ), as well as PDFs of X and X weighted by different methods (UnAD,IPW, CBPS, EB, CAL, KDBC) to compare performance. Both the first PDF plot in Figure [7]and Figure [8] indicate that X and X have almost the same distributions in the treatment andcontrol groups, which is consistent with the randomization setting in NSW data. In Figure [7,8],the relative relationship between the PDFs of X and X weighted by the IPW, CBPS, EB, andCAL methods in the treatment and control groups is very similar, which may be able to explainwhy the estimation results of these weighting methods in Table [6] are highly similar. Although allmethods have widened the gap between the probability density functions of the treatment groupand the control group, the KDBC method seems to better retain the characteristics of the originalPDF, such as bimodal. 39
510 0.0 0.1 0.2 0.3
TIndex Density of X1 devided by 200 TIndex Density of X1 with UnAD weights TIndex Density of X1 with IPW
TIndex Density of X1 with CBPS
TIndex Density of X1 with EB
TIndex Density of X1 with CAL
TIndex Density of X1 with KDBC
Figure 7: Density of X = age under different balancing methods. TIndex Density of X2 devided by 160 TIndex Density of X2 with UnAD weights
TIndex Density of X2 with IPW
TIndex Density of X2 with CBPS
TIndex Density of X2 with EB
TIndex Density of X2 with CAL
TIndex Density of X2 with KDBC
Figure 8: Density of X = education under different balancing methods. Conclusions and Further Discussions
With the goal of creating balanced samples, we proposed a kernel-distance-based covariate balancingmethod with the Gaussian kernel in this article. The proposed method works for calculating sampleweights for the treatment and control groups in order to yield proper estimator for the causaleffect. It seems to perform well in discovering the latent variable, such as the interaction of differentpredictors, and yield stable estimate for the ATE.
There are two main directions for our proposed method to develop. (i) In addition to setting thekernel distance as the loss function, we can also choose other commonly used covariate balancemetrics as the loss function, and use the kernel distance as an optimization constraint, so that itis possible to control the balance performance from different dimensions at the same time. (ii) Welook forward to exploring the further applications of kernel distance in high-dimensional space.
AcknowledgmentsReferences
Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2016): “Globally efficient non-parametricinference of average treatment effects by empirical balancing calibration weighting,” Journal ofthe Royal Statistical Society. Series B, Statistical methodology, 78, 673.
Dehejia, R. H. and S. Wahba (1999): “Causal effects in nonexperimental studies: Reevaluatingthe evaluation of training programs,” Journal of the American statistical Association, 94, 1053–1062.
Deville, J.-C. and C.-E. Särndal (1992): “Calibration estimators in survey sampling,” Journalof the American statistical Association, 87, 376–382.42 rlander, S. (1977): “Entropy in linear programs: an approach to planning" report LiTH-MAT-R-77-3,” Department of Mathematics, Linkoping University, Linkoping, Sweden.
Hainmueller, J. (2012): “Entropy balancing for causal effects: A multivariate reweighting methodto produce balanced samples in observational studies,” Political Analysis, 20, 25–46.
Hirano, K., G. W. Imbens, and G. Ridder (2003): “Efficient estimation of average treatmenteffects using the estimated propensity score,” Econometrica, 71, 1161–1189.
Horvitz, D. G. and D. J. Thompson (1952): “A generalization of sampling without replacementfrom a finite universe,” Journal of the American statistical Association, 47, 663–685.
Imai, K. and M. Ratkovic (2014): “Covariate balancing propensity score,” Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 76, 243–263.
Imbens, G. W. and D. B. Rubin (2015): Causal inference in statistics, social, and biomedicalsciences, Cambridge University Press.
Josey, K. P., E. Juarez-Colunga, F. Yang, and D. Ghosh (2020): “A framework for covariatebalance using Bregman distances,” Scandinavian Journal of Statistics.
Kang, J. D., J. L. Schafer, et al. (2007): “Demystifying double robustness: A comparison ofalternative strategies for estimating a population mean from incomplete data,” Statistical science,22, 523–539.
LaLonde, R. J. (1986): “Evaluating the econometric evaluations of training programs with exper-imental data,” The American economic review, 604–620.
Ma, X. and J. Wang (2010): “Robust inference using inverse probability weighting,” Journal ofthe American Statistical Association, 1–10.
Newey, W. K. and R. J. Smith (2004): “Higher order properties of GMM and generalizedempirical likelihood estimators,” Econometrica, 72, 219–255.
Neyman, J. S. (1923): “On the application of probability theory to agricultural experiments. essayon principles. section 9.(tlanslated and edited by dm dabrowska and tp speed, statistical science(1990), 5, 465-480),” Annals of Agricultural Sciences, 10, 1–51.43 asmussen, C. E. (2003): “Gaussian processes in machine learning,” in Summer School on MachineLearning, Springer, 63–71.
Rosenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in obser-vational studies for causal effects,” Biometrika, 70, 41–55.
Rubin, D. B. (1974): “Estimating causal effects of treatments in randomized and nonrandomizedstudies.” Journal of educational Psychology, 66, 688.——— (1978): “Bayesian inference for causal effects: The role of randomization,” The Annals ofstatistics, 34–58.——— (2005): “Causal inference using potential outcomes: Design, modeling, decisions,” Journalof the American Statistical Association, 100, 322–331.——— (2007): “The design versus the analysis of observational studies for causal effects: parallelswith the design of randomized trials,” Statistics in medicine, 26, 20–36.
Sriperumbudur, B. K., K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet,et al. (2012): “On the empirical estimation of integral probability metrics,” Electronic Journalof Statistics, 6, 1550–1599.
Wong, R. K. and K. C. G. Chan (2018): “Kernel-based covariate functional balancing forobservational studies,” Biometrika, 105, 199–213.
Xie, Y. (2018): “Causal Inference with Covariate Balance Optimization,” Ph.D. thesis, Universityof Waterloo.
Xie, Y., Y. Zhu, C. A. Cotton, and P. Wu (2019): “A model averaging approach for estimatingpropensity scores by optimizing balance,” Statistical methods in medical research, 28, 84–101.
Yeying, Z., G. Debashis, et al. (2018): “A Kernel-Based Metric for Balance Assessment,”Journal of Causal Inference, 6.
Zhang, Z., H. J. Kim, G. Lonjon, Y. Zhu, et al. (2019): “Balance diagnostics after propensityscore matching,” Annals of translational medicine, 7.44 hao, Q. and D. Percival (2016): “Entropy balancing is doubly robust,” Journal of CausalInference, 5.
Zhu, Y., J. S. Savage, and D. Ghosh (2018): “A kernel-based metric for balance assessment,”Journal of causal inference, 6.