Estimating Average Treatment Effects with Support Vector Machines
EEstimating Average Treatment Effects with Support VectorMachines ∗ Alexander Tarr † Kosuke Imai ‡ February 25, 2021
Abstract
Support vector machine (SVM) is one of the most popular classification algorithms in themachine learning literature. We demonstrate that SVM can be used to balance covariatesand estimate average causal effects under the unconfoundedness assumption. Specifically, weadapt the SVM classifier as a kernel-based weighting procedure that minimizes the maximummean discrepancy between the treatment and control groups while simultaneously maximizingeffective sample size. We also show that SVM is a continuous relaxation of the quadratic in-teger program for computing the largest balanced subset, establishing its direct relation to thecardinality matching method. Another important feature of SVM is that the regularization pa-rameter controls the trade-off between covariate balance and effective sample size. As a result,the existing SVM path algorithm can be used to compute the balance-sample size frontier. Wecharacterize the bias of causal effect estimation arising from this trade-off, connecting the pro-posed SVM procedure to the existing kernel balancing methods. Finally, we conduct simulationand empirical studies to evaluate the performance of the proposed methodology and find thatSVM is competitive with the state-of-the-art covariate balancing methods.
Keywords: causal inference, covariate balance, matching, subset selection, weighting ∗ We thank Brian Lee for making his simulation code available and Francesca Dominici, Gary King, Jose Zu-bizarreta and seminar participants at Institute for Quantitative Social Science, Harvard University for helpful dis-cussions. Imai thanks the Sloan Foundation ( † PhD. Candidate, Department of Electrical Engineering, Princeton University, Princeton, NJ 08544.Email:[email protected] ‡ Professor, Department of Government and Department of Statistics, Harvard University, Cambridge, MA 02138.Phone: 617–384–6778, Email: [email protected], URL: https://imai.fas.harvard.edu a r X i v : . [ s t a t . M E ] F e b Introduction
Estimating causal effects in an observational study is complicated by the lack of randomization intreatment assignment, which may lead to confounding bias. The standard approach is to weightobservations such that the empirical distribution of observed covariates is similar between the treat-ment and control groups (see e.g., Lunceford and Davidian, 2004; Rubin, 2006; Ho et al. , 2007;Stuart, 2010). Researchers then estimate the causal effects using the weighted sample while assum-ing the absence of unobserved confounders. Recently, a large number of weighting methods havebeen proposed to directly optimize covariate balance for causal effect estimation (e.g., Hainmueller,2012; Imai and Ratkovic, 2014; Zubizarreta, 2015; Chan et al. , 2016; Athey et al. , 2018; Li et al. ,2018; Wong and Chan, 2018; Zhao, 2019; Hazlett, 2020; Kallus, 2020; Ning et al. , 2020; Tan, 2020).This paper provides a new insight to this fast growing literature on covariate balancing bydemonstrating that support vector machine (SVM), which is one of the most popular classificationalgorithms in the machine learning literature (Cortes and Vapnik, 1995; Sch¨olkopf et al. , 2002),can be used to balance covariates and estimate the average treatment effect under the standardunconfoundedness assumption. Specifically, we adapt the SVM classifier as a kernel-based weightingprocedure that minimizes the maximum mean discrepancy between the treatment and controlgroups (Gretton et al. , 2007a) while simultaneously maximizing effective sample size. The resultingweights are bounded, leading to stable causal effect estimation. Importantly, as SVM has beenextensively studied and widely used, we can exploit its well-known theoretical properties and highlyoptimized implementation.All matching and weighting methods face the same trade-off between effective sample size andcovariate balance with a better balance typically leading to a smaller sample size. We show thatSVM directly addresses this fundamental trade-off. Specifically, the dual optimization problemfor SVM computes a set of balancing weights as the dual coefficients while yielding the supportvectors that comprise a largest balanced subset. In addition, the regularization parameter of SVMcontrols the trade-off between sample size and covariate balance. This implies that the existing pathalgorithm (Hastie et al. , 2004; Sentelle et al. , 2016) can efficiently characterize the balance-samplesize frontier (King et al. , 2017). Since both sample size and covariate balance affect the statisticalproperties of causal estimates, we analyze how this trade-off affects causal effect estimation.In the causal inference literature, we are not the first ones to realize the connection betweenSVM and covariate balancing. In an unpublished working paper, Ratkovic (2014) notes that the1inge-loss function of SVM has a first-order condition, which leads to balanced covariate sums amongst the support vectors. Instead, we show that the dual form of SVM optimization problemleads to the covariate mean balance. In addition, Ghosh (2018) notes the relationship betweenthe SVM margin and the region of covariate overlap. The author argues that the support vectorscorrespond to observations lying in the intersection of the convex hulls for the treated and controlsamples (King and Zeng, 2006). In contrast, we show that SVM can be used to obtain weightswhich can be used for causal effect estimation. Furthermore, neither of these two previous worksstudies the relationship between the regularization parameter of SVM and the fundamental trade-off between covariate balance and effective sample size.The proposed methodology is also related to several other covariate balancing methods. First,we establish that SVM can be seen as a continuous relaxation of the quadratic integer programfor computing the largest balanced subset. Indeed, SVM approximates an optimization problemclosely related to cardinality matching (Zubizarreta et al. , 2014). Second, SVM is a kernel-basedcovariate balancing method. Several researchers have recently developed weighting methods tobalance functions in a reproducing kernel Hilbert space (RKHS) (Wong and Chan, 2018; Hazlett,2020; Kallus, 2020). SVM shares the advantage of these methods that it can balance a general classof functions and easily accommodate non-linearity and non-additivity in the conditional expectationfunctions for the outcomes. In particular, we show that SVM fits into the kernel optimal matchingframework (Kallus, 2020). Unlike these covariate balancing methods, however, we can exploit theexisting path algorithms of SVM to compute the set of solutions over the entire regularizationpath with comparable complexity to computing a single solution (Hastie et al. , 2004; Sentelle et al. , 2016). This allows us to efficiently characterize the trade-off between covariate balance andeffective sample size.The rest of the paper is structured as follows. In Section 2, we present our methodologicalresults. In Section 3, we conduct simulation studies to compare the performance of SVM with thatof the aforementioned related covariate balancing methods. Lastly, in Section 4, we apply SVM tothe data from the right heart catheterization observational study (Connors et al. , 1996).
In this section, we establish several properties of SVM as a covariate balancing method. We firstshow that the SVM dual can be viewed as a regularized optimization problem that minimizes themaximum mean discrepancy (MMD). We then compare SVM to cardinality matching and show2ow the regularization path algorithm for SVM can be viewed as a balance-sample size frontier.Lastly, we discuss how to use SVM for causal effect estimation and compare SVM to existing kernelbalancing methods.
Suppose that we observe a simple random sample of N units from a super-population of interest, P . Denote the observed data by D = { X i , Y i , T i } Ni =1 where X i ∈ X represents a D -dimensionalvector of covariates, Y i is the outcome variable, and T i is a binary treatment assignment variablethat is equal to 1 if unit i is treated and 0 otherwise. We define the index sets for the treatment andcontrol groups as T = { i : T i = 1 } and C = { i : T i = 0 } with the group sizes equal to n T = |T | and n C = |C| , respectively. Finally, we define the observed outcome as Y i = T i Y i (1)+(1 − T i ) Y i (0) where Y i (1) and Y i (0) are the potential outcomes under treatment and control conditions, respectively.This notation implies the Stable Unit Treatment Value Assumption (SUTVA) — no interferencebetween units and the same version of the treatment (Rubin, 1990). Furthermore, we make thefollowing standard identification assumptions. All of these assumptions are maintained throughoutthis paper. Assumption 1 (Unconfoundedness)
The potential outcomes { Y i (1) , Y i (0) } are independent ofthe treatment assignments T i conditional on the covariates X i . That is, for all x ∈ X , we have { Y i (1) , Y i (0) } ⊥⊥ T i | X i = x . Assumption 2 (Overlap)
For all x ∈ X , the propensity score e ( x ) = Pr( T i = 1 | X i = x ) isbounded away from 0 and 1, i.e., < e ( x ) < . To consider causal inference with SVM, it is convenient to define the following transformedtreatment variable, which is equal to either − W i = 2 T i − ∈ {− , } (1)In addition, for t = 0 ,
1, we define the conditional expectation functions, disturbances, and condi-tional variance functions as E ( Y i ( t ) | X i ) = f t ( X i ) , (cid:15) i ( t ) = Y i ( t ) − f t ( X i ) , σ t ( X i ) = V ( Y i ( t ) | X i ) . Note that by construction, we have E ( (cid:15) i ( t ) | X i ) = 0 for t = 0 ,
1. Lastly, let H K denote a reproduc-ing kernel Hilbert space (RKHS), with norm (cid:107)·(cid:107) H K and kernel K ( X i , X j ) = (cid:104) φ ( X i ) , φ ( X j ) (cid:105) H K ,3here φ : R p (cid:55)→ H K is a feature mapping of the covariates to the RKHS. Support vector machines (SVMs) are a widely-used methodology for two-class classification prob-lems (Cortes and Vapnik, 1995; Sch¨olkopf et al. , 2002). SVM aims to compute a separating hyper-plane of the form f ( X i ) = β (cid:62) φ ( X i ) + β , (2)where β ∈ H K is the normal vector for the hyperplane and β is the offset. In this paper, we useSVM for the classification of treatment status. In the case of non-separable data, β and β arecomputed according to the soft-margin SVM problem, which is formulated asmin β ,β , ξ λ (cid:107) β (cid:107) H K + N (cid:88) i =1 ξ i s.t. W i f ( X i ) ≥ − ξ i , i = 1 , . . . , Nξ i ≥ , i = 1 , . . . , N (3)where { ξ i } Ni =1 are the so-called slack variables, and λ represents a regularization parameter con-trolling the trade-off between the margin width and margin violation of the hyperplane. Note that λ is related to the traditional SVM cost parameter C via the equality λ = 1 /C .Defining the matrix Q with elements Q ij = W i W j K ( X i , X j ) and the vector W with elements W i , this problem has a corresponding dual form given bymin α λ α (cid:62) Qα − (cid:62) α s.t. W (cid:62) α = 00 (cid:22) α (cid:22) represents a vector of ones and (cid:22) denotes an element-wise inequality.We begin by providing an intuitive explanation of how SVM can be viewed as a covariatebalancing procedure. First, note that the quadratic term in the above dual objective functioncan be written as a weighted measure of covariate discrepancy between the treatment and controlgroups, α (cid:62) Qα = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈T α i φ ( X i ) − (cid:88) i ∈C α i φ ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (5)4hile the constraint W (cid:62) α = 0 ensures that the sum of weights is identical between the treatmentand control groups, W (cid:62) α = 0 ⇐⇒ (cid:88) i ∈T α i = (cid:88) i ∈C α i . Lastly, the second term in the objective, (cid:62) α , is proportional to the sum of weights for eachtreatment group, since the above constraint implies (cid:80) i ∈T α i = (cid:80) i ∈C α i = (cid:62) α /
2. Thus, SVMsimultaneously minimizes the covariate discrepancy and maximizes the effective sample size, whichin turn leads to minimization of the weighted difference-in-means in the transformed covariatespace. It is also important to note that unlike some other balancing methods, the weights arebounded as represented by the constraint 0 ≤ α i ≤ i , leading to stable causal effectestimation (Tan, 2010).The choice of kernel function K ( X i , X j ) and its corresponding feature map φ determine thetype of covariate balance enforced by SVM, as shown in equation (5). In this paper, we focus on thelinear, polynomial, and radial basis function (RBF) kernels. The linear kernel K ( X i , X j ) = X (cid:62) i X j corresponds to a feature map φ ( X i ) = X i , and hence the quadratic term α (cid:62) Qα measures thediscrepancy in the original covariates. The general form for the degree d polynomial kernel withscale parameter c is K ( X i , X j ) = (cid:0) X (cid:62) i X j + c (cid:1) d . For example, when d = 2, this kernel has acorresponding feature map, φ ( X i ) = (cid:104) X i , . . . , X ip , √ X ip X i,p − , . . . , √ X ip X i , . . . , √ X i X i , √ c X ip , . . . , √ c X i , c (cid:105) (cid:62) . Hence, the quadratic kernel leads to a discrepancy measure of the original covariates, their squares,and all pairwise interactions. In general, the degree d polynomial kernel leads to a feature mapconsisting of all powers of the original covariates and all interactions up to degree d . The fi-nal kernel considered in this paper is the RBF kernel with scale parameter γ : K ( X i , X j ) =exp (cid:16) − γ (cid:107) X i − X j (cid:107) (cid:17) . This kernel can be viewed as a generalization of the polynomial kernel inthe limit d → ∞ .In addition, SVM sets the weights to zero for the units whose treatment status is easy to classify.To see this, note that the Karush–Kuhn–Tucker (KKT) conditions for soft-margin SVM lead tothe following useful characterization for a solution α : W i f ( X i ) = 1 = ⇒ ≤ α i ≤ i f ( X i ) < ⇒ α i = 1 (6) W i f ( X i ) > ⇒ α i = 0The set of units that satisfy W i f ( X i ) > W i f ( X i ) = 1 are referred to as marginal supportvectors, whereas the set of units that meet W i f ( X i ) < λ controls which of these two components receives more em-phasis. SVM chooses optimal weights such that easy-to-classify units are given zero weight. Ourgoal in the remainder of this section is to extend the above intuition and establish a more rigorousconnection between SVM, covariate balancing, and causal effect estimation. We now show that SVM minimizes the maximum mean discrepancy (MMD) of covariate distribu-tion between the treatment and control groups. The MMD is a commonly used measure of distancebetween probability distributions (Gretton et al. , 2007a) that was recently proposed as a metric forbalance assessment in causal inference (Zhu et al. , 2018). Specifically, we show that the SVM dualproblem given in equation (4) can be viewed as a regularized optimization problem for computingweights which minimize the MMD.The MMD, which is also called the kernel distance, is a measure of distance between twoprobability distributions based on the difference in mean function values for functions in the unitball of a RKHS (Gretton et al. , 2007a). The MMD has found use in several statistical applications,such as hypothesis testing (Gretton et al. , 2007b, 2012) and density estimation (Sriperumbudur,2011). Given the unit ball RKHS F K = { f ∈ H K : (cid:107) f (cid:107) H K ≤ } and two probability measures F and G , the MMD is defined as γ K ( F, G ) := sup f ∈F K (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) f dF − (cid:90) f dG (cid:12)(cid:12)(cid:12)(cid:12) . (7)An important property of the MMD is that when K is a characteristic kernel (e.g., the Gaus-sian radial basis function kernel and Laplace kernel), then γ K ( F, G ) = 0 if and only if F = G et al. , 2010).The computation of γ K ( F, G ) requires the knowledge of both F and G , which is typicallyunavailable. In practice, an estimate of γ K ( F, G ) using the empirical distributions (cid:98) F m and (cid:98) G n canbe computed as γ K ( (cid:98) F m , (cid:98) G n ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i : X i ∼ F φ ( X i ) − n (cid:88) j : X j ∼ G φ ( X j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H K , (8)where m and n are the size of the samples drawn from F and G , respectively. The propertiesof this statistic are well-studied (see e.g., Sriperumbudur et al. , 2012). In causal inference, theempirical MMD can be used to assess balance between the treated and control samples (Zhu et al. ,2018). This is done by setting F = P ( X i | T i = 1) and G = P ( X i | T i = 0). Then, thequantity γ K ( (cid:98) F m , (cid:98) G n ) gives a measure of independence between the treatment assignment T i andthe observed pre-treatment covariates X i .Equation (8) naturally suggests a weighting procedure that balances the covariate distributionsbetween the treatment and control groups by minimizing the empirical MMD. We define a weightedvariant of the empirical MMD as γ K ( (cid:98) F α , (cid:98) G α ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈T α i φ ( X i ) − (cid:88) j ∈C α j φ ( X j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:112) α (cid:62) Qα , (9)where (cid:98) F α and (cid:98) G α denote the reweighted empirical distributions under weights α . The weights arerestricted to the simplex set, A simplex = α ∈ R N : 0 (cid:22) α (cid:22) , (cid:88) i ∈T α i = (cid:88) j ∈C α j = 1 . (10)The optimization problem for finding the MMD minimizing weights is therefore formulated asmin α (cid:112) α (cid:62) Qα s.t. α ∈ A simplex (11)Note that computing weights according to this problem is generally not preferable due to thelack of regularization, which leads to overfitting and sparse α , resulting in many discarded samples.The following theorem, which we prove in Appendix A.1, establishes that the SVM dual problemcan be viewed as a regularized version of the optimization problem in equation (11).7 heorem 1 (SVM Dual Problem as Regularized MMD Minimization) Let α ∗ ( λ ) denotethe solution to the SVM dual problem under λ , defined in equation (4) . Consider the normalizedweights (cid:101) α ∗ ( λ ) = 2 α ∗ ( λ ) / (cid:62) α ∗ ( λ ) such that (cid:101) α ∗ ( λ ) ∈ A simplex . Then,(i) There exists λ ∗ such that (cid:101) α ∗ ( λ ∗ ) is a solution to the MMD minimization problem, defined inequation (11) .(ii) The quantity (cid:101) α ∗ ( λ ) (cid:62) Q (cid:101) α ∗ ( λ ) is a monotonically increasing function of λ . Theorem 1 shows that SVM minimizes the MMD with the regularization parameter λ controllingthe trade-off between the covariate imbalance, measured as the MMD, and the effective samplesize, measured as the sum of the support vector weights (cid:62) α . Thus, a greater size of supportvector set may lead to a worse covariate balance between the treatment and control groups withinthat set. SVM can also be seen as a continuous relaxation of the quadratic integer program (QIP) forcomputing the largest balanced subset. Consider the modified version of the optimization problemin equation (4), in which we replace the continuous constraint 0 (cid:22) α (cid:22) α ∈ { , } N . Since the second term in the objective can be rewritten as a norm, thisproblem, which we refer to as SVM-QIP, is given by:min α λ α (cid:62) Qα − (cid:62) α s.t. W (cid:62) α = 0 α ∈ { , } N (12)Interpreting the variables α i as indicators of whether or not unit i is selected into the optimalsubset, we see that the objective is a trade-off between subset balance in the projected features(first term) and subset size (second term). Here, balance is measured by a difference in sums.However, the constraint W (cid:62) α requires the optimal subset to have an equal number of treated andcontrol units, so balancing the feature sums also implies balancing the feature means.Thus, the SVM dual in equation (4) can be viewed as a continuous relaxation of the largestbalanced subset problem represented by SVM-QIP in equation (12), with the set of support vectorscomprising an approximation to the largest balanced subset, as these are the units for which α i > λ , and the choice of kernel. In our own experiments, we observe significant overlap betweenthe two solutions, suggesting that SVM uses non-integer weights to augment the SVM-QIP solutionwithout compromising balance in the selected subset. We compare the differences between the twomethods in Section 3.2 Closely related to the SVM-QIP formulation in equation (12) is cardinality matching (Zubizarreta et al. , 2014), which maximizes the number of matches subject to a set of covariate balance con-straints. The objective of cardinality matching is given by,max m ij (cid:88) i ∈T (cid:88) j ∈C m ij s.t. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈T (cid:88) j ∈C m ij [ f b ( X ik ) − f b ( X jk )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε kb (cid:88) i ∈T (cid:88) j ∈C m ij , k = 1 , . . . , D, b = 1 , . . . , B (cid:88) j ∈C m ij ≤ , i ∈ T (cid:88) i ∈T m ij ≤ , j ∈ C m ij ∈ { , } , i ∈ T , j ∈ C (13)where m ij are selection variables indicating whether treated unit i is matched to control unit j , X ik denotes the k th element of covariate vector X i , f b is an arbitrary function of the covariatesspecifying each of the B balance conditions, and ε kb is a tolerance selected by a researcher. Commonchoices for f b are the first- and second-order moments, and ε kb is typically set to a scalar multipleof the corresponding standardized difference-in-means.To establish the connection between SVM-QIP and cardinality matching, we first note thatcardinality matching need not be formulated as a matched pair optimization problem. In fact, thebalance constraints between pairs, as formulated in equation (13), is equivalent to those betweenthe treatment and control groups in the selected subsample. Similarly, the one-to-one matchingconstraints are equivalent to constraining the number of treated and control units in the selectedsubsample to be equal. Therefore, defining the indicator variable α i for selection into the optimal9ubset, we can rewrite the objective of cardinality matching as,max α N (cid:88) i =1 α i s.t. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈T α i f b ( X ik ) − (cid:88) j ∈C α j f b ( X jk ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε kb N (cid:88) i =1 α i , k = 1 , . . . , D, b = 1 , . . . , B N (cid:88) i =0 α i W i = 0 α i ∈ { , } , i = 1 , . . . , N (14)The difference between cardinality matching and SVM-QIP lies in the way that balance is en-forced in the optimal subset. Cardinality matching imposes covariate-specific balance by boundingeach dimension’s difference-in-means, while SVM-QIP imposes aggregated balance by penalizingthe normed difference-in-means. The preference between these two measures of balance depends onthe dimensionality of the covariates and a priori knowledge about confounding mechanisms. If wesuspect certain covariates to be confounders, then bounding those specific dimensions is more rea-sonable. However, if no such information is available and the covariate space is high-dimensional,then restricting the overall balance may be more preferable. We empirically examine the relativeperformance of these two methods in Section 4.We emphasize that using the norm to measure balance is also computationally attractive sinceit avoids the direct calculation of the balance conditions φ ( X i ) through the so-called “kernel trick.”Furthermore, as we discuss below, SVM can be used to approximate solutions to SVM-QIP withhigh accuracy and a much lower computational cost, which allows us to approximate the regular-ization path for SVM-QIP faster than a single solution for cardinality matching can be computed. Another important advantage to using SVM to perform covariate balancing is the existence of pathalgorithms, which can efficiently compute the set of solutions to equation (4) over different valuesof λ . Since Theorem 1 established that λ controls the trade-off between the MMD and a heuristicmeasure of subset size, the path algorithm for SVM can be viewed as the weighting analog tothe balance-sample size frontier (King et al. , 2017). Below, we briefly discuss the algorithm forcomputing the SVM regularization path and describe how the path can be interpreted as a balance-sample size frontier. 10 he algorithm. Path algorithms for SVM were first proposed in Hastie et al. (2004), whoshowed that the weights α and scaled intercept α := λβ are piecewise linear in λ and presentedan algorithm for computing the entire path of solutions with a comparable computation cost tofinding a single solution. However, their algorithm was prone to numerical problems and would failin the presence of singular submatrices of Q . Recent work on SVM path algorithms has focused onresolving the issues with singularities. In our analysis, we use the algorithm presented in Sentelle et al. (2016), which is briefly described in Appendix A.2. Initial solution.
The initial solution in the SVM regularization path corresponds to the solutionat λ max such that for any λ > λ max , the minimizing weight vector α does not change. We assumewithout a loss of generality that n T ≤ n C . Then initially, α i = 1 for all i ∈ T , and the remainingweights are computed according to argmin α ∈ [0 , n α (cid:62) Qα s.t. (cid:88) i ∈C α i = n T α i = 1 , i ∈ T . (15)Thus, the initial solution computes control weights such that the weighted empirical MMD isminimized while fixing the renormalized weights (cid:101) α i = n − T for the treated units. Note that thissolution also corresponds to the largest subset, as measured by (cid:80) i α i , amongst all solutions on theregularization path. Terminal solution.
The regularization path completes when the resulting solution has no non-marginal support vectors, or in the case of non-separable data, when λ = 0. In practice, however,path algorithms run into numerical issues when λ is small, so we terminate the path at λ min =1 × − , which appears to work well in our experiments. This value is often greater than λ ∗ , whichcorresponds to the MMD-minimizing solution defined in Theorem 1. In practice, however, we findthe differences in balance between these two solutions to be negligible.Summarizing the regularization path, we see that the initial solution at λ max has the largestweight sum (cid:80) Ni =1 α i = 2 min { n T , n C } and can be viewed as the largest balanced subset retaining allobservations in the minority class. As we move through the path, the SVM dual problem imposesgreater restrictions on balance in the subset, which leads to smaller subsets, until we reach theterminal value λ min , at which the weighted empirical MMD is smallest on the path.11 .7 Causal Effect Estimation Theorem 1 establishes that the SVM dual problem can be viewed as a regularized optimizationproblem for computing balancing weights to minimize the MMD. Under the unconfoundednessassumption, therefore, the resulting weighted support vector set composes a subsample of the datathat approximates randomization of treatment assignment. However, achieving a high degree ofbalance often requires significant pruning of the original sample, especially in scenarios where thecovariate distributions for the treated and control groups have limited overlap. In this section, weprovide a characterization of this trade-off between subset size and subset balance and discuss itsimpact on the bias of causal effect estimates.Recent work by Kallus (2020) established that many existing matching and weighting methodsin causal inference are minimizing the dual norm of the bias for a weighted estimator, a propertycalled error dual norm minimizing. The author also proposes a new method, kernel optimal match-ing (KOM), which considers minimizing the dual norm of the bias when the conditional expectationfunctions are embedded in an RKHS. Below, we show that SVM also fits into the KOM framework.We restrict our attention to the following weighted difference-in-means estimator, (cid:98) τ = (cid:88) i ∈T α i Y i − (cid:88) i ∈C α i Y i (16)where α ∈ A simplex is computed via the application of SVM to the data { X i , T i } Ni =1 . Below,we derive the form for the conditional bias with respect to two estimands, the sample averagetreatment effect (SATE), τ SATE , and the sample average treatment effect for the treated (SATT), τ SATT . We then discuss how to compute this bias when the conditional expectation functions f and f are unknown.As we show in Appendix A.3, under Assumptions 1 and 2, the conditional bias with respect to τ SATE and τ SATT for the estimator above is given by E ( (cid:98) τ − τ | { X i , T i } Ni =1 ) = N (cid:88) i =1 α i W i f ( X i ) + N (cid:88) i =1 ( α i T i − v i ) τ ( X i ) , (17)where v i = /N if τ = τ SATE T i /n T if τ = τ SATT . (18)12nd τ ( X i ) := E ( Y i (1) − Y i (0) | X i ) = f ( X i ) − f ( X i ). Fan et al. (2016) use a similar biasdecomposition.The first term in equation (17) represents the bias due to the imbalance of prognostic score(Hansen, 2008). Note that the estimation of this quantity is difficult since f is typically unknown.Instead, we embed f in a unit-ball RKHS, F K , and consider an f that maximizes this bias term.As we show in Appendix A.4, minimizing this quantity leads to the following optimization problemthat is of the same form as that of the SVM dual problem given in equation (11),min α γ K (cid:16) (cid:98) F α , (cid:98) G α (cid:17) , s.t. α ∈ A (19)Thus, SVM can also be viewed as a regularization method for minimizing prognostic imbalance.However, SVM does not address the second term of the conditional bias in equation (17). Thisterm corresponds to the bias due to extrapolation outside of the weighted treatment group tothe population of interest. If the SATT is the target quantity, the term represents the bias dueto the difference between the weighted and unweighted CATE for the treatment group. Thus,the prognostic balance achieved by SVM may come at the expense of this CATE bias, whichrepresents the discrepancies between the weighted covariate distribution for the treatment groupand the unweighted covariate distribution for the sample of interest. This is the direct consequenceof the trade-off between balance and effective sample size, which is controlled by the regularizationparameter as shown earlier.To illustrate this point more clearly, suppose that the target estimand is the SATT and n T ≤ n C ,a scenario often encountered in practice. In this case, the initial solution on the regularization pathfor SVM is given by α i = 1 for i ∈ T . This implies that the renormalized simplex weightsfor the treated observations are given by ˜ α i = n − T for i ∈ T , and hence the second bias termin equation (17) vanishes. The remaining weights are then chosen such that the imbalance inprognostic score is minimized. Since the treated group is unmodified in this setting, balancebetween the unweighted treated and reweighted control covariate distributions will be at its largestvalue on the path since balancing is more difficult in the unpruned data. As the penalty forbalance is increased, the SVM solution will trim both treated and control samples that are difficultto balance. This pruning improves balance, but may increase the CATE bias.13 .8 Relation to Kernel Balancing Methods A closely related method recently proposed in the causal inference literature is kernel balancing(Hazlett, 2020; Kallus et al. , 2018; Wong and Chan, 2018). We now discuss the relations betweenSVM and existing kernel balancing methods. Consider the following alternative decomposition ofconditional bias, E ( (cid:98) τ − τ | { X i , T i } Ni =1 ) = N (cid:88) i =1 ( α i T i − v i ) f ( X i ) + N (cid:88) i =1 { v i − α i (1 − T i ) } f ( X i ) . (20)To minimize this bias, kernel balancing methods restrict f and f to a RKHS F k := { ( f , f ) ∈H K × H K : (cid:113) (cid:107) f (cid:107) H K + (cid:107) f (cid:107) H K ≤ c } and consider minimizing the largest bias under the pair( f , f ) ∈ F K . This problem is given by,min α sup f ,f ∈F K (cid:34) N (cid:88) i =1 ( α i T i − v i ) f ( X i ) − N (cid:88) i =1 { α i (1 − T i ) − v i } f ( X i ) (cid:35) , s.t. α ∈ A (21)where A denotes the constraints on the weights. For example, Kallus et al. (2018) restricts themto A simplex whereas Wong and Chan (2018) essentially uses α i ≥ N − , though their formulation isslightly different than that give above.As shown in Kallus et al. (2018), the problem of minimizing this worst-case conditional biasamounts to computing weights that balance the treatment and control covariate distributionswith respect to the empirical distribution for the population of interest. Let (cid:98) F α and (cid:98) G α denotethe weighted empirical covariate distributions for the treatment and control groups, respectively,while having (cid:98) H v represent the empirical covariate distribution corresponding to the population ofinterest. Then if F K is also restricted to the unit-ball RKHS (fixing the size of f , f is necessarysince the bias scales linearly with (cid:107) f (cid:107) H K and (cid:107) f (cid:107) H K ), i.e., c = 1, the optimization problem inequation (21) can be written in terms of the minimization of the empirical MMD statistics:min α γ K (cid:16) (cid:98) F α , (cid:98) H v (cid:17) + γ K (cid:16) (cid:98) G α , (cid:98) H v (cid:17) . s.t. α ∈ A . (22)The objective in equation (22) does not contain a measure of distance between the conditionalcovariate distributions (cid:98) F α and (cid:98) G α . Instead, balance between these two distributions is indirectly14ncouraged through balancing each one individually with respect to the target distribution (cid:98) H v .This is in contrast with SVM, which directly balances the covariate distribution between the treat-ment and control groups. In this section, we examine the performance of SVM in ATE estimation under two different simula-tion settings. We also examine the connection between SVM and the QIP for the largest balancedsubset.
We consider two simulation setups used in previous studies. Simulation A comes from Lee et al. (2010) who use a slightly modified version of the simulations presented in Setoguchi et al. (2008).Specifically, we adopt the exact setup corresponding to their “scenario G,” which is briefly summa-rized here. We refer readers to the original article for the exact specification. For each simulateddataset, we generate 10 covariates X i = ( X i , . . . , X i ) (cid:62) from the standard normal distribution,with correlation introduced between four pairs of variables. Treatment assignment is generatedaccording to P ( T i = 1 | X i ) = expit (cid:0) β (cid:62) f ( X i ) (cid:1) , where β is some coefficient vector, and f ( X i )controls the degree of additivity and linearity in the true propensity score model. This scenariouses the true propensity score model with a moderate amount of non-linearity and non-additivity.The outcome model was specified to be linear in the observed covariates with a constant, additivetreatment effect: Y i ( T i ) = γ + γ (cid:62) X i + τ T i + (cid:15) i , with τ = − . (cid:15) i ∼ N (0 , . Z i = ( Z i , . . . , Z i ) (cid:62) from the standard normal distribution. The observed covariates are definedas the nonlinear functions of these variables, i.e., X i = ( X i , . . . , X i ) (cid:62) , where X = exp( Z / X = Z / [1 + exp( Z )], X = ( Z Z /
25 + 0 . , X = ( Z + Z + 20) , and X j = Z j , j = 5 , . . . , P ( T i = 1 | Z i ) = expit( − Z − . Z ), which corresponds to Model 1of Wong and Chan (2018). Finally, the outcome model is specified as Y ( T i ) = 200 + 10 T i + (1 . T i − . . Z + 13 . Z + 13 . Z + 13 . Z ) + (cid:15) i , with (cid:15) i ∼ N (0 , .2 Comparison between SVM and SVM-QIP We begin by examining the connection between SVM and SVM-QIP by comparing solutions ob-tained using one simulated dataset of N = 500 units generated according to Simulation A. Specifi-cally, we first compute the SVM path using the path algorithm described in Section 2.6, obtaining aset of regularization parameter breakpoints λ . Next, we compute the SVM-QIP solution for each ofthese breakpoints using the Gurobi optimization software (Gurobi Optimization, 2020). We limitthe solver to spending 5 minutes of runtime for each problem. Finding the exact integer-valuedsolution under a given λ requires a significant amount of time, but a good approximation cantypically be found in a few seconds.For both methods, we compute the objective function value at each of the breakpoints as well asthe coverage of the SVM-QIP solution by the SVM solution. The latter represents the proportionof units with non-zero SVM weights that are included in the largest balanced subset identified bySVM-QIP. Formally, the coverage is defined ascvg( λ ) = |(cid:100) α SVM ( λ ) (cid:101) ∩ α SVM-QIP ( λ ) || α SVM-QIP ( λ ) | . In order to examine the effects of separability on the quality of the approximation, we perform theabove analysis using three different types of features. Specifically, we use a linear kernel with theuntransformed covariates (linear), a linear kernel with the degree-2 polynomial features formed byconcatenating the original covariates with all two-way interactions and squared terms (polynomial),and the Gaussian RBF with scale parameter chosen according the median heuristic (RBF). In allcases, we scale the input feature matrix such that the columns have 0 mean and standard deviation1 before performing the kernel computation.Figure 1 shows that the objective values for the SVM and SVM-QIP solutions are close whenthe penalty on balance λ − is small, with divergence between the two methods occurring towardsthe end of the regularization path. In the linear case, we see that the paths for the two methodsare nearly identical, suggesting that the solutions of the two problems are essentially the same.Divergence in the polynomial and RBF settings is more pronounced due to greater separability inthe transformed covariate space, which is more difficult to balance without non-integer weights.When λ is very small, we also find that SVM-QIP returns α = , indicating that the penalty onbalance is too great. Lastly, the effects of approximating the SVM-QIP solution are reflected inthe RBF setting, where upon close inspection the objective value appears to be somewhat noisy16 − − λ − − − − − O b j ec t i v e SVMSVM-QIP − λ − − − − − λ − − − − − (a) Linear (b) Polynomial (c) RBF Figure 1: Comparison of the objective value between SVM and SVM-QIP. The blue line denotesthe objective value of the SVM solution, and the orange dotted line denotes the objective value ofthe SVM-QIP solution. − − λ − . . . . . . C o v e r a g e − λ − λ − (a) Linear (b) Polynomial (c) RBF Figure 2: The proportion of samples in the SVM-QIP largest balanced subset covered by the SVMsolution. The instances of zero coverage in the polynomial and RBF settings represent the caseswhere the SVM-QIP fails to find a nontrivial solution.and non-monotonic.Interestingly, the coverage plots in Figure 2 show that even when the objective values betweenthe two methods are divergent, the SVM solution still predominantly covers the SVM-QIP solution.The regions with zero coverage in the polynomial and RBF settings correspond to instances wherethe balance penalty is so significant that a nontrivial solution cannot be found for the SVM-QIP.This result illustrates that SVM approximates one-to-one matching by augmenting a well-balancedmatched subsample with some non-integer weights. This leads to an increase in the subset sizewhile preserving the overall balance within the subsample.17 − λ − − . − . − . − . − . − . A T E − λ − Simulation A λ − − λ − − A T E − λ − Simulation B λ − (a) Linear (b) Polynomial (c) RBF Figure 3: ATE estimates for Simulations A (top) and B (bottom) over the SVM regularizationpath. The boxplots represent the distribution of the ATE estimates over Monte Carlo simulations.The red dashed line corresponds to the true ATE.
Next, we evaluate the performance of SVM in estimating the ATE for Simulations A and B. Foreach scenario, we generate 1,000 datasets with N = 500 samples. For each simulated dataset, wecompute the ATE estimate over a fixed grid of 100 λ values chosen based on the simulation scenarioand input feature. As described in Section 3.2, we use the linear, polynomial, and RBF-inducedfeatures, standardizing the covariate matrix before passing it to the kernel in all cases.In Figure 3, we plot the distribution of ATE estimates over Monte Carlo simulations against theregularization parameter λ . The results for Simulation A (top panel) show that the bias approacheszero as the penalty on balance increases ( λ decreases). This behavior is due to the fact that the18onditional bias in the estimate under the outcome model for Simulation A is given by E [ (cid:98) τ − τ | X N , T N ] = γ (cid:62) (cid:32) N (cid:88) i =1 α i T i X i (cid:33) . This implies that all bias comes from prognostic score imbalance. This quantity becomes thesmallest when minimizing (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 α i T i X i (cid:13)(cid:13)(cid:13) , which is controlled by the regularization parameter λ in the SVM dual objective under the linear setting. Note that this quantity is also small underboth the polynomial and RBF input features.We also find that under all three settings, there is relatively little change in the variance of theestimates along most of the path, suggesting that the variance gained from trimming the sampleis counteracted by the variance decreased from correcting for heteroscedasticity. The exception tothis observation occurs at the beginning of the linear case, where the reduction in bias also reducesthe variance, and at the end of the RBF path, where the amount of trimming is so substantialrelative to the balance gained that the variance increases.For Simulation B (bottom panel), we also observe that the bias decreases as the penalty onbalance increases. However, due to misspecification, nonlinearity, and the presence of treatmenteffect heterogeneity in the outcome model, the bias never decays to zero as shown in Section 2.7. Wealso find that the SVM with linear kernel can reduce bias as well as the other kernels, suggesting thatSVM is robust to misspecification and nonlinearity in the outcome model. Similar to Simulation A,we also observe relatively small changes in the variance as the constraint on balance increases,except at the end of the RBF path where there is substantial sample pruning. Next, we compare the performance of SVM with that of other methods. Our results below showthat the performance of SVM is comparable to that of related state-of-the-art covariate balancingmethods available in the literature. In particular, we consider kernel optimal matching (KOM;Kallus et al. , 2018), kernel covariate balancing (KCB; Wong and Chan, 2018), cardinality matching(CARD; Zubizarreta et al. , 2014), and inverse propensity score weighting (IPW) based on logisticregression (GLM) and random forest (RFRST), both of which were used in the original simulationstudy by Lee et al. (2010). For SVM, we compute solutions using λ − = 0 . λ − = 0 .
10, and λ − = 2 .
60 for Simulation A under the linear, polynomial, and RBF settings, respectively. ForSimulation B, we use λ − = 1 . λ − = 1 .
92, and λ − = 10 .
48. These values are taken from the19
VM KOM KCB CARD GLM RFRST
Method − . − . − . − . − . − . − . − . A T E LinearPolynomialRBF
SVM KOM KCB CARD GLM RFRST
Method (a) Simulation A (b) Simulation B
Figure 4: Boxplots for ATE estimates for Simulations A (left) and B (right). The hatch pat-tern denotes the input feature (Linear, Polynomial, or RBF) — kernel optimal matching (KOM;Kallus et al. , 2018), kernel covariate balancing (KCB; Wong and Chan, 2018), cardinality matching(CARD; Zubizarreta et al. , 2014), and IPW with propensity score modeling via logistic regression(GLM) and random forest (RFRST). The red dashed line corresponds to the true ATE.grid of λ values used in the simulation based on visual inspection of the path plots in Figure 3around where the estimate curve flattens out.For KOM, we compute weights under the linear, polynomial, and RBF settings described earlierwith the default settings for the provided code. For KCB, we compute weights using the RBF kerneland use its default settings. While KCB allows for other kernel functions, it was originally designedfor the use of RBF and Sobolev kernels. We find its results to be poor when using the linear andpolynomial features. For CARD, we used a threshold of 0 .
01 and 0 . et al. (2010).Figure 4 plots the distributions of the effect estimates over 1,000 simulated datasets for bothscenarios. Simulation A (left panel) shows comparable performance across all methods, with SVMand KOM having the best performance in terms of both bias and variance. In particular, SVMachieves near zero bias under all three input features. The results for KCB show that it performsslightly worse in comparison to the other kernel methods, with greater bias and variance under the20BF setting.The results for CARD show near identical performance with SVM under the linear setting,however results under the polynomial setting are notably worse. The reason for this comes fromthe choice of balance threshold, which was set to 0.1 times the standardized difference-in-meansof the input feature matrix. Although decreasing the scalar below 0.1 would lead to a morebalanced matching, we found that algorithm was unable to consistently find a solution for alldatasets with scalar multiples smaller than 0.1. This result highlights the main issue with definingbalance dimension-by-dimension, which makes it difficult to enforce small overall balance withoutinformation on the underlying geometry of the data. Lastly, the propensity score methods showthe worst performance. This is somewhat expected as the true propensity score model is morecomplicated than the true outcome model under this simulation setting.We note that further reduction in the variance of the SVM solution while preserving bias islikely possible with a more principled method of choosing the solution for each simulated dataset.In general, a value of λ that works well for one dataset may not work well for another, and a betterapproach would examine estimates over the path and balance-sample size curves for each datasetindividually. Nevertheless, our heuristic procedure to selecting a solution produced high-qualityresults, demonstrating the strength of SVM as a balancing method.The results for Simulation B (right panel) show a slightly more varying performance acrossmethods. Amongst the kernel methods, we find that KOM has the best performance under thepolynomial and RBF settings, achieving near zero bias under these scenarios, while SVM hasthe best performance under the linear setting. The discrepancy under the linear setting is dueto misspecification, which leads to a poor regularization parameter choice and consequently poorbalance and bias under the KOM procedure.We also find that SVM is unable to drive the bias to zero, which is due to the treatment effectheterogeneity in the outcome model. As discussed in Section 2.7, SVM ignores the second term inthe conditional bias decomposition (17), which is zero under a constant additive treatment effectin Simulation A but is nonzero in Simulation B. In contrast, KOM targets both bias terms in itsformulation, which leads to greater bias reduction.In comparison to the other kernel methods, we find that KCB has comparable bias to SVMbut greater variance. For CARD, we observe comparable results to SVM under the linear setting,but again we observe worse performance under the polynomial setting due to the reasons men-tioned above. Lastly, we find mixed results between the two propensity score methods. Logistic21egression (GLM) has the worst performance while Random forest (RFRST) exhibits the secondbest performance amongst all methods. This result is likely due to the simple structure of the truepropensity score model, whose nonlinearity can only be accurately modeled by RFRST. In this section, we apply the proposed methodology to the right heart catheterization (RHC) dataset originally analyzed in Connors et al. (1996). This observational data set was used to study theeffectiveness of right heart catheterization, a diagnostic procedure, for critically ill patients. Thekey result from the study was that after adjusting for a large number of pre-treatment covariates,right heart catheterization appeared to reduce survival rates. This finding contradicts the existingmedical perception that the procedure is beneficial.
The data set consists of 5,735 patients, with 2,184 of them assigned to the treatment group and3,551 assigned to the control group. For each patient, we observe the treatment status, whichindicates whether or not he/she received catheterization within 24 hours of hospital admission.The outcome variable represents death within 30 days. Finally, the dataset contains a total of72 pre-treatment covariates that are thought to be related to the decision to perform right heartcatheterization. These variables include background information about the patient, such as age,sex, and race, indicator variables for primary/secondary diseases and comorbidities, and variousmeasurements from medical test results.We compute the full SVM regularization paths under the linear, polynomial, and RBF settingsdescribed in Section 3.1. In forming the polynomial features, we exclude all trivial interactions(e.g., interactions between categories of the same categorical variable) and squares of binary-valuedcovariates. For comparison, we also compute the KOM weights under all three settings, the KCBweights under the RBF setting, and the CARD weights under the linear and polynomial settingswith a threshold fixed to 0.1 times the standardized difference-in-means.
Figure 5 plots the ATE estimates over the SVM regularization paths with the pointwise 95%confidence intervals based on the weighted Neyman variance estimator (Imbens and Rubin, 2015,Chapter 19). The horizontal axis represents the normed difference-in-means within the weightedsubset as a covariate balance measure. For all three settings, we find that the estimated ATE22 . . . . . . . . A T E Normed difference-in-means .
02 0 . (a) Linear (b) Polynomial (c) RBF Figure 5: ATE estimates for the RHC data over the SVM regularization path. The horizontalaxis represents the normed difference-in-means in covariates within the weighted subset. The solidblue line denotes the average estimate, and the solid gray background denotes the pointwise 95%confidence intervals. . . . . N o r m e dd i ff e r e n ce - i n - m e a n s Effective sample size . . . . . . . . . . . (a) Linear (b) Polynomial (c) RBF Figure 6: Trade-off between balance and effective sample size. The black dashed-line indicates theestimated elbow point.slightly increases as the weighted subset becomes more balanced, supporting the results originallyreported in Connors et al. (1996) that right heart catheterization decreased survival rates.Figure 6 illustrates the trade-off between the balance measure (the normed difference-in-meansin covariates within the weighted subset) and effective subset size, as the balance-sample sizefrontier. Such graphs can be useful to researchers in selecting a solution along the regularizationpath for estimating the ATE. Across all cases, we achieve a good amount of balance improvementonce the data set is pruned to about 3,500, which occurs around where the trade-off between subset23ize balance becomes less favorable.We also examine differences in dimension-by-dimension balance between SVM and CARD andbetween SVM and KOM under the linear and polynomial settings. We do not conduct such acomparison for RBF, which is infinite dimensional. Here, we consider four different SVM solutions:the largest subset size solution whose standardized difference-in-means in covariates was below 0.1for all dimensions, the solution whose effective sample size was nearest the subset size for the othermethod, the solution whose normed difference-in-means in covariates was closest to that of theother method, and the solution occurring at the kneedle estimate for the elbow of the balance-weight sum curve. We take the minimum-balance solution when no elbow exists, as in the linearcase. The effective sample size is computed according to the following Kish’s formula: N e = (cid:0)(cid:80) i ∈T α i (cid:1) (cid:80) i ∈T α i + (cid:0)(cid:80) i ∈C α i (cid:1) (cid:80) i ∈C α i . (23)Figure 7 presents the covariate balance comparisons between SVM and CARD for both linearand polynomial settings. Comparing against the small difference-in-means solution (leftmost col-umn) for which the standardized difference-in-means for all covariates are below 0.1, CARD retainsmore observations in its selected subset, but SVM achieves a better covariate balance than CARDfor most dimensions although there are some large imbalances. This is expected because SVMminimizes the overall covariate imbalance without a constraint on each dimension as in CARD.We observe a similar result when comparing CARD with the SVM solutions based on the clos-est effective sample size solution (left-middle) and the closest normed difference-in-means solution(right-middle). It is notable that the latter generally achieves a better covariate balance while re-taining more observations than CARD. Finally, the results for the elbow SVM solution (rightmostcolumn) show that tight balance is attainable with a moderate amount of sample pruning, withnear exact balance in the linear setting. This level of covariate balance is difficult to achieve withCARD due to the infeasibility of optimization particularly in high dimensional settings.Figure 8 shows the dimensional balance comparisons for the KOM solution against the SVMsolution. Here, we see that under the linear setting, KOM retains significantly more units thanthe SVM solution while attaining the same balance. This is due to the (cid:80) i αW i = 0 constraintof SVM, which encourages the selected subset to have a roughly equal proportion of treated andcontrol units, while the KOM solution allows the resulting subset to be more imbalanced. Underthe polynomial setting, however, we find that SVM retains significantly more units than KOM24 . . . SVM ( N e = 3830 ) . . . C A R D ( N e = ) . . . SVM ( N e = 4186 ) . . . SVM ( N e = 4367 ) . . . SVM ( N e = 3442 ) . . . SVM ( N e = 3777 ) . . . C A R D ( N e = ) . . . SVM ( N e = 4071 ) . . . SVM ( N e = 4304 ) . . . SVM ( N e = 3325 ) (a) Small difference-in-means (b) Closest effectivesample size (c) Closest normeddifference-in-means (d) Elbow Figure 7: Standardized difference-in-means of covariates comparisons between SVM and Cardinal-ity Matching (CARD) under the linear (top) and polynomial (bottom) settings with different SVMsolutions: (a) standardized difference-in-means less than 0.1 in all covariates, (b) effective samplesize closest to that of CARD, (c) normed difference-in-means closest to that of CARD, and (d)elbow of the regularization path. The effective sample size for each method is given as N e in theparentheses. Note that darker areas correspond to higher concentrations of points.while achieving a similar degree of covariate balance, which is likely due to poor regularizationparameter choice by the KOM algorithm.Lastly, we compare the point estimates of the ATE, the weighted Neyman standard error, andthe effective sample size for SVM, CARD (with linear and polynomial), KOM, and KCB (withRBF) in Table 1. We considered three different solutions from the SVM path in our comparisons:SVM imbalance , which corresponds to the initial solution for which the balance constraint is mostrelaxed and α i = 1, i ∈ T , SVM balance , which corresponds to the most regularized solution withthe best covariate balance on the path, and SVM elbow , which corresponds to the solution occurringat the elbow of the balance-weight sum curves shown in Figure 6.The results show that SVM leads to a positive estimate in all cases, which agrees with theoriginal finding reported in Connors et al. (1996). We also find that the three SVM solutions differmost significantly in their standard errors, which increases as the constraint on balance becomes25 − − − − SVM ( N e = 3443 ) − − − − K O M ( N e = ) .
00 0 .
04 0 . SVM ( N e = 3356 ) . . . K O M ( N e = ) (a) Linear (b) Polynomial Figure 8: Standardized difference-in-means comparisons between SVM and Kernel Optimal Match-ing (KOM) under the linear (left) and polynomial (right) settings.stronger and the subset is more pruned, as shown in the effective sample size column. In particular,the heavily balanced SVM balance solution under the RBF setting leads to a 95% confidence intervalwhich overlaps with zero. This is in contrast with the less balanced SVM elbow solution, which hasboth a larger effect estimate and smaller confidence interval. This result demonstrates the necessityof computing the regularization path so that researchers may avoid low-quality solutions due topoor parameter choice.Comparing against other methods, we observe that KOM yields greater estimates of positiveeffects with comparable standard errors in the linear and RBF settings. However, under thepolynomial setting, the standard error is much larger and the sample is significantly more prunedthan the modestly balanced SVM elbow solution. Both CARD and KCB produce smaller positiveeffect estimates, with the standard error for KCB leading to a 95% confidence interval whichoverlaps with zero.
In this paper, we show how support vector machines (SVMs) can be used to compute covariatebalancing weights and estimate causal effects. We establish a number of interpretations of SVM asa covariate balancing procedure. First, the SVM dual problem computes weights which minimize atrade-off between the MMD and a measure of subset size while simultaneously maximizing effectivesample size. Second, the SVM dual problem can be viewed as a continuous relaxation of the26eature Method Estimate Standard error Effective sample sizeLinear SVM balance elbow — — —SVM imbalance balance elbow imbalance balance elbow imbalance balance ), the elbow solution (SVM elbow ),and the solution with the worst covariate balance on the path (SVM imbalance ).largest balanced subset problem, which is closely related to cardinality matching. Lastly, similar toexisting kernel balancing methods, SVM weights were shown to minimize the worst-case bias dueto prognostic score imbalance. Additionally, path algorithms can be used to compute the entireset of SVM solutions as the regularization parameter varies, which constitutes a balance-samplesize frontier. These methods provide researchers with a characterization of the balance-samplesize trade-off and allow for visualization on how causal effect estimates vary as the constraint onbalance changes.Our work in this paper suggests several possible directions for future research. On the algorith-mic side, a disadvantage of the proposed methodology is that it encourages roughly equal effectivenumber of treated and control units in the optimal subset, which can lead to unnecessary samplepruning. One could use weighted SVM (Lin and Wang, 2002) to address this problem, but existingpath algorithms are applicable only to unweighted SVM. On the theoretical side, results in thispaper suggest a fundamental connection between the support vectors and the set of overlap, i.e., { X : 0 < e ( X ) < } . Steinwart (2004) shows that the fraction of support vectors for a variantof the SVM discussed here asymptotically approaches the measure of this overlap set, suggestingthat SVM may be used to develop a statistical test for the overlap assumption.27 eferences Athey, S., Imbens, G. W., and Wager, S. (2018). Approximate residual balancing: debiased infer-ence of average treatment effects in high dimensions.
Journal of the Royal Statistical Society,Series B, Methodological , (4), 597–623.Chan, K. C. G., Yam, S. C. P., and Zhang, Z. (2016). Globally efficient nonparametric inferenceof average treatment effects by empirical balancing calibration weighting. Journal of the RoyalStatistical Society, Series B, Methodological , , 673–700.Connors, A. F., Speroff, T., Dawson, N. V., Thomas, C., Harrell, F. E., Wagner, D., Desbiens,N., Goldman, L., Wu, A. W., Califf, R. M., et al. (1996). The effectiveness of right heartcatheterization in the initial care of critically iii patients. Jama , (11), 889–897.Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning , (3), 273–297.Dinkelbach, W. (1967). On nonlinear fractional programming. Management science , (7), 492–498.Fan, J., Imai, K., Liu, H., Ning, Y., and Yang, X. (2016). Improving covariate balancing propensityscore: A doubly robust and efficient approach. Technical report, Princeton University.Ghosh, D. (2018). Relaxed covariate overlap and margin-based causal effect estimation. Statisticsin Medicine , (28), 4252–4265.Gretton, A., Borgwardt, K., Rasch, M., Sch¨olkopf, B., and Smola, A. J. (2007a). A kernel methodfor the two-sample-problem. In Advances in neural information processing systems , pages 513–520.Gretton, A., Fukumizu, K., Teo, C., Song, L., Sch¨olkopf, B., and Smola, A. (2007b). A kernelstatistical test of independence.
Advances in neural information processing systems , , 585–592.Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. (2012). A kerneltwo-sample test. The Journal of Machine Learning Research , (1), 723–773.Gurobi Optimization, L. (2020). Gurobi optimizer reference manual.28ainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting methodto produce balanced samples in observational studies. Political analysis , pages 25–46.Hansen, B. B. (2008). The prognostic analogue of the propensity score.
Biometrika , (2), 481–488.Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). The entire regularization path for thesupport vector machine. Journal of Machine Learning Research , (Oct), 1391–1415.Hazlett, C. (2020). Kernel balancing: A flexible non-parametric weighting procedure for estimatingcausal effects. Statistica Sinica , (3), 1155–1189.Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessingfor reducing model dependence in parametric causal inference. Political Analysis , (3), 199–236.Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the RoyalStatistical Society, Series B (Statistical Methodology) , (1), 243–263.Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Kallus, N. (2020). Generalized optimal matching methods for causal inference.
Journal of MachineLearning Research , (62), 1–54.Kallus, N., Pennicooke, B., and Santacatterina, M. (2018). More robust estimation of sample aver-age treatment effects using kernel optimal matching in an observational study of spine surgicalinterventions. arXiv preprint arXiv:1811.04274 .King, G. and Zeng, L. (2006). The dangers of extreme counterfactuals. Political Analysis , (2),131–159.King, G., Lucas, C., and Nielsen, R. A. (2017). The balance-sample size frontier in matchingmethods for causal inference. American Journal of Political Science , (2), 473–489.Lee, B. K., Lessler, J., and Stuart, E. A. (2010). Improving propensity score weighting usingmachine learning. Statistics in medicine , (3), 337–346.Li, F., Morgan, K. L., and Zaslavsky, A. M. (2018). Balancing covariates via propensity scoreweighting. Journal of the American Statistical Association , (521), 390–400.29in, C.-F. and Wang, S.-D. (2002). Fuzzy support vector machines. IEEE transactions on neuralnetworks , (2), 464–471.Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity scorein estimation of causal treatment effects: a comparative study. Statistics in Medicine , (19),2937–2960.Ning, Y., Peng, S., and Imai, K. (2020). Robust estimation of causal effects via high-dimensionalcovariate balancing propensity score. Biometrika , (3), 533–554.Ratkovic, M. (2014). Balancing within the margin: Causal effect estimation with support vectormachines. Department of Politics, Princeton University, Princeton, NJ , page available at .Rubin, D. B. (1990). Comments on “On the application of probability theory to agriculturalexperiments. Essay on principles. Section 9” by J. Splawa-Neyman translated from the Polishand edited by D. M. Dabrowska and T. P. Speed.
Statistical Science , , 472–480.Rubin, D. B. (2006). Matched Sampling for Causal Effects . Cambridge University Press, Cam-bridge.Schaible, S. (1976). Fractional programming. ii, on dinkelbach’s algorithm.
Management science , (8), 868–873.Sch¨olkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: support vector machines,regularization, optimization, and beyond . MIT press.Sentelle, C., Anagnostopoulos, G., and Georgiopoulos, M. (2016). A simple method for solvingthe svm regularization path for semidefinite kernels.
IEEE transactions on neural networks andlearning systems , (4), 709.Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J., and Cook, E. F. (2008). Evaluatinguses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepi-demiology and drug safety , (6), 546–555.Sriperumbudur, B. K. (2011). Mixture density estimation via hilbert space embedding of measures.In , pages 1027–1030.IEEE. 30riperumbudur, B. K., Gretton, A., Fukumizu, K., Sch¨olkopf, B., and Lanckriet, G. R. (2010).Hilbert space embeddings and metrics on probability measures. The Journal of Machine LearningResearch , , 1517–1561.Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Sch¨olkopf, B., Lanckriet, G. R., et al. (2012).On the empirical estimation of integral probability metrics. Electronic Journal of Statistics , ,1550–1599.Steinwart, I. (2004). Sparseness of support vector machines—some asymptotically sharp bounds.In Advances in Neural Information Processing Systems , pages 1069–1076.Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward.
Statistical Science , (1), 1–21.Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika , (3), 661–682.Tan, Z. (2020). Regularized calibrated estimation of propensity scores with model misspecificationand high-dimensional data. Biometrika , (1), 137–158.Wong, R. K. and Chan, K. C. G. (2018). Kernel-based covariate functional balancing for observa-tional studies. Biometrika , (1), 199–213.Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions. Annals of Statis-tics , (2), 965–993.Zhu, Y., Savage, J. S., and Ghosh, D. (2018). A kernel-based metric for balance assessment. Journalof causal inference , (2).Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incompleteoutcome data. Journal of the American Statistical Association , (511), 910–922.Zubizarreta, J. R., Paredes, R. D., Rosenbaum, P. R., et al. (2014). Matching for balance, pairingfor heterogeneity in an observational study of the effectiveness of for-profit and not-for-profithigh schools in chile. The Annals of Applied Statistics , (1), 204–231.31 Supplementary Appendix
A.1 Proof of Theorem 1
We prove this theorem by establishing several equivalent reformulations of the SVM dual problem.By equivalence, we mean that these problems differ from one another only in the scaling of the reg-ularization parameter, implying that their regularization paths consist of the same set of solutions.More formally, given two optimization problems P1 and P2, we say that P1 and P2 are equivalentif a solution α ∗ for P1 can be used to construct a solution for P2.Denote the SVM weight set A SVM = α ∈ R N : 0 (cid:22) α . (cid:22) , (cid:88) i ∈T α i = (cid:88) j ∈C α j , and consider a rescaled version of the SVM dual given in equation (4), which we label as P1:min α α (cid:62) Qα − ν (cid:62) α . s.t. α ∈ A SVM (P1)Note that for a given λ and solution α ∗ to the original problem defined in (4) under λ , α ∗ is alsoa solution to the rescaled problem P1 under ν = 2 λ . This establishes equivalence between thesetwo problems.We begin by proving the following lemma, which allows us to replace the squared seminormterm α (cid:62) Qα with (cid:112) α (cid:62) Qα to obtain the problemmin α (cid:112) α (cid:62) Qα − ν (cid:62) α . s.t. α ∈ A SVM (P2)
Lemma 1
The problems (P1) and (P2) are equivalent.
Proof
This result follows from the strong duality of the SVM dual problem, which allows us toform the following equivalent problem in which the penalized term (cid:62) α is replaced with a hard32onstraint with threshold (cid:15) : min α α (cid:62) Qα . s.t. α ∈ A SVM (cid:62) α ≥ (cid:15) (P3)The solution to (P3) is unchanged whether we minimize α (cid:62) Qα or (cid:112) α (cid:62) Qα , so the problemmin α (cid:112) α (cid:62) Qα s.t. α ∈ A SVM (cid:62) α ≥ (cid:15) is identical to (P3). By strong duality, we can again enforce the hard constraint on the term (cid:62) α through a penalized term with new regularization parameter, which establishes the equivalencebetween (P1) and (P2). (cid:50) Next, we consider the fractional programmin α (cid:112) α (cid:62) Qα (cid:62) α / . s.t. α ∈ A SVM (P4)The following lemma connects (P4) to the reformulated SVM problem (P2) through Dinkelbach’smethod (Dinkelbach, 1967; Schaible, 1976):
Lemma 2 (Dinkelbach, 1967, Theorem 1) Suppose α ∗ ∈ A SVM and α ∗ (cid:54) = . Then q ∗ = (cid:112) α (cid:62)∗ Qα ∗ (cid:62) α ∗ / α ∈A SVM (cid:112) α (cid:62) Qα (cid:62) α / if, and only if min α ∈A SVM (cid:112) α (cid:62) Qα − q ∗ (cid:62) α = (cid:113) α (cid:62)∗ Qα ∗ − q ∗ (cid:62) α ∗ = 0 . Thus, the solution to the rescaled SVM dual problem (P2) under ν = q ∗ / α (cid:112) α (cid:62) Qα . s.t. α ∈ A simplex (P5)33he following lemma establishes equivalence between (P4) and (P5) under the proper renormal-ization of the fractional program solution. Lemma 3
Assume any solution α ∗ to (P4) is such that α ∗ (cid:54) = . Then the problems (P4) and (P5) are equivalent.
Proof
Let α (cid:54) = and α be solutions to problems (P4) and (P5), respectively, and consider thevector-valued function f : A SVM \ (cid:55)→ A simplex , f ( α ) = α / ( (cid:62) α / A simplex ⊂ A SVM \ , α is feasible for (P4). Then by optimality of α , we have (cid:113) α (cid:62) Qα (cid:62) α / ≤ (cid:113) α (cid:62) Qα (cid:62) α / (cid:113) α (cid:62) Qα . Next, note that f ( α ) is feasible for (P5). Then by optimality of α , we have (cid:113) α (cid:62) Qα ≤ (cid:113) f ( α ) (cid:62) Q f ( α ) = (cid:113) α (cid:62) Qα (cid:62) α / . In order for both of these inequalities to be true, we must have (cid:113) α (cid:62) Qα (cid:62) α / (cid:113) α (cid:62) Qα (cid:62) α / (cid:113) α (cid:62) Qα . Note that the assumption α (cid:54) = is only a formality, since by Lemma 2, the trivial solution α = can occur only when ν < q ∗ , which is outside the portion of the regularization path thatwe consider. (cid:50) We are now ready to prove Theorem 1. Part (i): Lemma 1 establishes equivalence between theregularization paths for the rescaled SVM dual (P1) and (P2). In addition, Lemma 2 establishesthe existence of ν ∗ such that the solution to (P2) under ν ∗ is also a solution to (P4). Then, itfollows that there exists λ ∗ such that the solution to the rescaled SVM dual problem under λ ∗ minimizes (P4). Finally, recall that Lemma 3 establishes that the minimizing solution to (P4) isalso a solution to the weighted MMD minimization problem. Therefore, there exists λ ∗ such thatthe solution to the SVM dual under λ ∗ minimizes the weighted MMD. Part (ii): The proof followsfrom Schaible (1976, Lemma 3). 34 .2 The Path Algorithm of Sentelle et al. (2016) We briefly describe the path algorithm of Sentelle et al. (2016) used in this paper. The regularizationpath for SVM is characterized by a sequence of breakpoints, representing the values of λ at whicheither one of the support vectors on the margin W i f ( X i ) = 1 exits the margin, or a non-marginalobservation reaches the margin. Between these breakpoints, the coefficients of the marginal supportvectors α i change linearly in λ , while the coefficients of all other observations stay fixed as λ ischanged. Since the KKT conditions must be met for any solution α , we can use a linear system ofequations to compute how each α i and α changes with respect to λ .Based on this idea, beginning with an initial solution corresponding to some large initial value of λ , the path algorithm first computes how the current marginal support vectors change with respectto λ . Given this quantity, the next breakpoint in the path is computed by decreasing λ until amarginal support vector exits the margin, i.e, α i = 0 or α i = 1, or a non-marginal observationenters the margin, i.e., W i f ( X i ) = 1. At this point, the marginal support vector set is updated,and the changes in α i and α , as well as the next breakpoint, are computed. This procedure repeatsuntil the terminal value of λ is reached. A.3 Conditional Bias with Respect to SATE and SATT
In this section, we derive the conditional bias for the weighted difference-in-means estimator. Notethat our derivation follows the one given in Kallus et al. (2018). Consider the problem of estimatingthe SATE and SATT, defined as τ SATE = 1 N N (cid:88) i =1 Y i (1) − Y i (0) and τ SATT = 1 n T (cid:88) i ∈T Y i (1) − Y i (0) , (24)respectively. We denote the weighted estimator (cid:98) τ , which has a form (cid:98) τ = (cid:88) i ∈T α i Y i − (cid:88) i ∈C α i Y i , (25)where α i ∈ A simplex . The conditional bias with respect to the SATE is given by E [ (cid:98) τ − τ SATE | X N , T N ] = (cid:88) i ∈T E [ α i Y i | X N , T N ] − (cid:88) i ∈C E [ α i Y i | X N , T N ] − E [ τ SATE | X N , T N ]= N (cid:88) i =1 α i [ T i − (1 − T i )] E [ Y i ( T i ) | X i , T i ] − N N (cid:88) i =1 E [ Y i (1) − Y i (0) | X i , T i ]35 N (cid:88) i =1 α i [ T i − (1 − T i )] E [ Y i ( T i ) | X i ] − N N (cid:88) i =1 E [ Y i (1) − Y i (0) | X i ]= N (cid:88) i =1 α i T i f ( X i ) − N (cid:88) i =1 α i (1 − T i ) f ( X i ) − N N (cid:88) i =1 τ ( X i )= N (cid:88) i =1 α i T i [ f ( X i ) + τ ( X i )] − N (cid:88) i =1 α i (1 − T i ) f ( X i ) − N N (cid:88) i =1 τ ( X i )= N (cid:88) i =1 ( α i T i − N − ) τ ( X i ) + N (cid:88) i =1 α i [ T i − (1 − T i )] f ( X i )= N (cid:88) i =1 ( α i T i − N − ) τ ( X i ) + N (cid:88) i =1 α i W i f ( X i ) , where lines two and three follow from Assumptions 1 and 2. By a similar argument, the conditionalbias with respect to the SATT is E [ (cid:98) τ − τ SATT | X N , T N ] = (cid:88) i ∈T E [ α i Y i | X N , T N ] − (cid:88) i ∈C E [ α i Y i | X N , T N ] − E [ τ SATT | X N , T N ]= N (cid:88) i ∈T ( α i − n − T ) τ ( X i ) + N (cid:88) i =1 α i W i f ( X i ) . A.4 Worst-case Bias in an RKHS
We consider the problem of minimizing the bias due to prognostic score imbalance, defined in(17). Restricting f to the unit-ball RKHS, defined as F K in Section 2.3, and considering the f which maximizes the absolute value of this quantity, we compute the worst-case squared bias dueto prognostic score imbalance as B ( α ; X N , T N ) = sup f ∈F K (cid:32) N (cid:88) i =1 α i W i f ( X i ) (cid:33) . We can simplify this expression by B ( α ; X N , T N ) = sup f ∈F K (cid:32) N (cid:88) i =1 α i W i f ( X i ) (cid:33) = sup f ∈F K (cid:32) N (cid:88) i =1 α i W i (cid:104) f , φ ( X i ) (cid:105) (cid:33)
36 sup f ∈F K (cid:32)(cid:42) f , N (cid:88) i =1 α i W i φ ( X i ) (cid:43)(cid:33) = (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 α i W i φ ( X i ) (cid:13)(cid:13)(cid:13) H K , = γ K (cid:16) (cid:98) F α ,,
36 sup f ∈F K (cid:32)(cid:42) f , N (cid:88) i =1 α i W i φ ( X i ) (cid:43)(cid:33) = (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 α i W i φ ( X i ) (cid:13)(cid:13)(cid:13) H K , = γ K (cid:16) (cid:98) F α ,, (cid:98) G α (cid:17) ,,