[PDF] Estimating Average Treatment Effects with Support Vector Machines

Abstract

Support vector machine (SVM) is one of the most popular classification algorithms in the machine learning literature. We demonstrate that SVM can be used to balance covariates and estimate average causal effects under the unconfoundedness assumption. Specifically, we adapt the SVM classifier as a kernel-based weighting procedure that minimizes the maximum mean discrepancy between the treatment and control groups while simultaneously maximizing effective sample size. We also show that SVM is a continuous relaxation of the quadratic integer program for computing the largest balanced subset, establishing its direct relation to the cardinality matching method. Another important feature of SVM is that the regularization parameter controls the trade-off between covariate balance and effective sample size. As a result, the existing SVM path algorithm can be used to compute the balance-sample size frontier. We characterize the bias of causal effect estimation arising from this trade-off, connecting the proposed SVM procedure to the existing kernel balancing methods. Finally, we conduct simulation and empirical studies to evaluate the performance of the proposed methodology and find that SVM is competitive with the state-of-the-art covariate balancing methods.

Full PDF

EEstimating Average Treatment Eﬀects with Support VectorMachines ∗ Alexander Tarr † Kosuke Imai ‡ February 25, 2021

Abstract

Support vector machine (SVM) is one of the most popular classiﬁcation algorithms in themachine learning literature. We demonstrate that SVM can be used to balance covariatesand estimate average causal eﬀects under the unconfoundedness assumption. Speciﬁcally, weadapt the SVM classiﬁer as a kernel-based weighting procedure that minimizes the maximummean discrepancy between the treatment and control groups while simultaneously maximizingeﬀective sample size. We also show that SVM is a continuous relaxation of the quadratic in-teger program for computing the largest balanced subset, establishing its direct relation to thecardinality matching method. Another important feature of SVM is that the regularization pa-rameter controls the trade-oﬀ between covariate balance and eﬀective sample size. As a result,the existing SVM path algorithm can be used to compute the balance-sample size frontier. Wecharacterize the bias of causal eﬀect estimation arising from this trade-oﬀ, connecting the pro-posed SVM procedure to the existing kernel balancing methods. Finally, we conduct simulationand empirical studies to evaluate the performance of the proposed methodology and ﬁnd thatSVM is competitive with the state-of-the-art covariate balancing methods.

Keywords: causal inference, covariate balance, matching, subset selection, weighting ∗ We thank Brian Lee for making his simulation code available and Francesca Dominici, Gary King, Jose Zu-bizarreta and seminar participants at Institute for Quantitative Social Science, Harvard University for helpful dis-cussions. Imai thanks the Sloan Foundation ( † PhD. Candidate, Department of Electrical Engineering, Princeton University, Princeton, NJ 08544.Email:[email protected] ‡ Professor, Department of Government and Department of Statistics, Harvard University, Cambridge, MA 02138.Phone: 617–384–6778, Email: [email protected], URL: https://imai.fas.harvard.edu a r X i v : . [ s t a t . M E ] F e b Introduction

Estimating causal eﬀects in an observational study is complicated by the lack of randomization intreatment assignment, which may lead to confounding bias. The standard approach is to weightobservations such that the empirical distribution of observed covariates is similar between the treat-ment and control groups (see e.g., Lunceford and Davidian, 2004; Rubin, 2006; Ho et al. , 2007;Stuart, 2010). Researchers then estimate the causal eﬀects using the weighted sample while assum-ing the absence of unobserved confounders. Recently, a large number of weighting methods havebeen proposed to directly optimize covariate balance for causal eﬀect estimation (e.g., Hainmueller,2012; Imai and Ratkovic, 2014; Zubizarreta, 2015; Chan et al. , 2016; Athey et al. , 2018; Li et al. ,2018; Wong and Chan, 2018; Zhao, 2019; Hazlett, 2020; Kallus, 2020; Ning et al. , 2020; Tan, 2020).This paper provides a new insight to this fast growing literature on covariate balancing bydemonstrating that support vector machine (SVM), which is one of the most popular classiﬁcationalgorithms in the machine learning literature (Cortes and Vapnik, 1995; Sch¨olkopf et al. , 2002),can be used to balance covariates and estimate the average treatment eﬀect under the standardunconfoundedness assumption. Speciﬁcally, we adapt the SVM classiﬁer as a kernel-based weightingprocedure that minimizes the maximum mean discrepancy between the treatment and controlgroups (Gretton et al. , 2007a) while simultaneously maximizing eﬀective sample size. The resultingweights are bounded, leading to stable causal eﬀect estimation. Importantly, as SVM has beenextensively studied and widely used, we can exploit its well-known theoretical properties and highlyoptimized implementation.All matching and weighting methods face the same trade-oﬀ between eﬀective sample size andcovariate balance with a better balance typically leading to a smaller sample size. We show thatSVM directly addresses this fundamental trade-oﬀ. Speciﬁcally, the dual optimization problemfor SVM computes a set of balancing weights as the dual coeﬃcients while yielding the supportvectors that comprise a largest balanced subset. In addition, the regularization parameter of SVMcontrols the trade-oﬀ between sample size and covariate balance. This implies that the existing pathalgorithm (Hastie et al. , 2004; Sentelle et al. , 2016) can eﬃciently characterize the balance-samplesize frontier (King et al. , 2017). Since both sample size and covariate balance aﬀect the statisticalproperties of causal estimates, we analyze how this trade-oﬀ aﬀects causal eﬀect estimation.In the causal inference literature, we are not the ﬁrst ones to realize the connection betweenSVM and covariate balancing. In an unpublished working paper, Ratkovic (2014) notes that the1inge-loss function of SVM has a ﬁrst-order condition, which leads to balanced covariate sums amongst the support vectors. Instead, we show that the dual form of SVM optimization problemleads to the covariate mean balance. In addition, Ghosh (2018) notes the relationship betweenthe SVM margin and the region of covariate overlap. The author argues that the support vectorscorrespond to observations lying in the intersection of the convex hulls for the treated and controlsamples (King and Zeng, 2006). In contrast, we show that SVM can be used to obtain weightswhich can be used for causal eﬀect estimation. Furthermore, neither of these two previous worksstudies the relationship between the regularization parameter of SVM and the fundamental trade-oﬀ between covariate balance and eﬀective sample size.The proposed methodology is also related to several other covariate balancing methods. First,we establish that SVM can be seen as a continuous relaxation of the quadratic integer programfor computing the largest balanced subset. Indeed, SVM approximates an optimization problemclosely related to cardinality matching (Zubizarreta et al. , 2014). Second, SVM is a kernel-basedcovariate balancing method. Several researchers have recently developed weighting methods tobalance functions in a reproducing kernel Hilbert space (RKHS) (Wong and Chan, 2018; Hazlett,2020; Kallus, 2020). SVM shares the advantage of these methods that it can balance a general classof functions and easily accommodate non-linearity and non-additivity in the conditional expectationfunctions for the outcomes. In particular, we show that SVM ﬁts into the kernel optimal matchingframework (Kallus, 2020). Unlike these covariate balancing methods, however, we can exploit theexisting path algorithms of SVM to compute the set of solutions over the entire regularizationpath with comparable complexity to computing a single solution (Hastie et al. , 2004; Sentelle et al. , 2016). This allows us to eﬃciently characterize the trade-oﬀ between covariate balance andeﬀective sample size.The rest of the paper is structured as follows. In Section 2, we present our methodologicalresults. In Section 3, we conduct simulation studies to compare the performance of SVM with thatof the aforementioned related covariate balancing methods. Lastly, in Section 4, we apply SVM tothe data from the right heart catheterization observational study (Connors et al. , 1996).

In this section, we establish several properties of SVM as a covariate balancing method. We ﬁrstshow that the SVM dual can be viewed as a regularized optimization problem that minimizes themaximum mean discrepancy (MMD). We then compare SVM to cardinality matching and show2ow the regularization path algorithm for SVM can be viewed as a balance-sample size frontier.Lastly, we discuss how to use SVM for causal eﬀect estimation and compare SVM to existing kernelbalancing methods.

Suppose that we observe a simple random sample of N units from a super-population of interest, P . Denote the observed data by D = { X i , Y i , T i } Ni =1 where X i ∈ X represents a D -dimensionalvector of covariates, Y i is the outcome variable, and T i is a binary treatment assignment variablethat is equal to 1 if unit i is treated and 0 otherwise. We deﬁne the index sets for the treatment andcontrol groups as T = { i : T i = 1 } and C = { i : T i = 0 } with the group sizes equal to n T = |T | and n C = |C| , respectively. Finally, we deﬁne the observed outcome as Y i = T i Y i (1)+(1 − T i ) Y i (0) where Y i (1) and Y i (0) are the potential outcomes under treatment and control conditions, respectively.This notation implies the Stable Unit Treatment Value Assumption (SUTVA) — no interferencebetween units and the same version of the treatment (Rubin, 1990). Furthermore, we make thefollowing standard identiﬁcation assumptions. All of these assumptions are maintained throughoutthis paper. Assumption 1 (Unconfoundedness)

The potential outcomes { Y i (1) , Y i (0) } are independent ofthe treatment assignments T i conditional on the covariates X i . That is, for all x ∈ X , we have { Y i (1) , Y i (0) } ⊥⊥ T i | X i = x . Assumption 2 (Overlap)

For all x ∈ X , the propensity score e ( x ) = Pr( T i = 1 | X i = x ) isbounded away from 0 and 1, i.e., < e ( x ) < . To consider causal inference with SVM, it is convenient to deﬁne the following transformedtreatment variable, which is equal to either − W i = 2 T i − ∈ {− , } (1)In addition, for t = 0 ,

1, we deﬁne the conditional expectation functions, disturbances, and condi-tional variance functions as E ( Y i ( t ) | X i ) = f t ( X i ) , (cid:15) i ( t ) = Y i ( t ) − f t ( X i ) , σ t ( X i ) = V ( Y i ( t ) | X i ) . Note that by construction, we have E ( (cid:15) i ( t ) | X i ) = 0 for t = 0 ,

1. Lastly, let H K denote a reproduc-ing kernel Hilbert space (RKHS), with norm (cid:107)·(cid:107) H K and kernel K ( X i , X j ) = (cid:104) φ ( X i ) , φ ( X j ) (cid:105) H K ,3here φ : R p (cid:55)→ H K is a feature mapping of the covariates to the RKHS. Support vector machines (SVMs) are a widely-used methodology for two-class classiﬁcation prob-lems (Cortes and Vapnik, 1995; Sch¨olkopf et al. , 2002). SVM aims to compute a separating hyper-plane of the form f ( X i ) = β (cid:62) φ ( X i ) + β , (2)where β ∈ H K is the normal vector for the hyperplane and β is the oﬀset. In this paper, we useSVM for the classiﬁcation of treatment status. In the case of non-separable data, β and β arecomputed according to the soft-margin SVM problem, which is formulated asmin β ,β , ξ λ (cid:107) β (cid:107) H K + N (cid:88) i =1 ξ i s.t. W i f ( X i ) ≥ − ξ i , i = 1 , . . . , Nξ i ≥ , i = 1 , . . . , N (3)where { ξ i } Ni =1 are the so-called slack variables, and λ represents a regularization parameter con-trolling the trade-oﬀ between the margin width and margin violation of the hyperplane. Note that λ is related to the traditional SVM cost parameter C via the equality λ = 1 /C .Deﬁning the matrix Q with elements Q ij = W i W j K ( X i , X j ) and the vector W with elements W i , this problem has a corresponding dual form given bymin α λ α (cid:62) Qα − (cid:62) α s.t. W (cid:62) α = 00 (cid:22) α (cid:22) represents a vector of ones and (cid:22) denotes an element-wise inequality.We begin by providing an intuitive explanation of how SVM can be viewed as a covariatebalancing procedure. First, note that the quadratic term in the above dual objective functioncan be written as a weighted measure of covariate discrepancy between the treatment and controlgroups, α (cid:62) Qα = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈T α i φ ( X i ) − (cid:88) i ∈C α i φ ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (5)4hile the constraint W (cid:62) α = 0 ensures that the sum of weights is identical between the treatmentand control groups, W (cid:62) α = 0 ⇐⇒ (cid:88) i ∈T α i = (cid:88) i ∈C α i . Lastly, the second term in the objective, (cid:62) α , is proportional to the sum of weights for eachtreatment group, since the above constraint implies (cid:80) i ∈T α i = (cid:80) i ∈C α i = (cid:62) α /

2. Thus, SVMsimultaneously minimizes the covariate discrepancy and maximizes the eﬀective sample size, whichin turn leads to minimization of the weighted diﬀerence-in-means in the transformed covariatespace. It is also important to note that unlike some other balancing methods, the weights arebounded as represented by the constraint 0 ≤ α i ≤ i , leading to stable causal eﬀectestimation (Tan, 2010).The choice of kernel function K ( X i , X j ) and its corresponding feature map φ determine thetype of covariate balance enforced by SVM, as shown in equation (5). In this paper, we focus on thelinear, polynomial, and radial basis function (RBF) kernels. The linear kernel K ( X i , X j ) = X (cid:62) i X j corresponds to a feature map φ ( X i ) = X i , and hence the quadratic term α (cid:62) Qα measures thediscrepancy in the original covariates. The general form for the degree d polynomial kernel withscale parameter c is K ( X i , X j ) = (cid:0) X (cid:62) i X j + c (cid:1) d . For example, when d = 2, this kernel has acorresponding feature map, φ ( X i ) = (cid:104) X i , . . . , X ip , √ X ip X i,p − , . . . , √ X ip X i , . . . , √ X i X i , √ c X ip , . . . , √ c X i , c (cid:105) (cid:62) . Hence, the quadratic kernel leads to a discrepancy measure of the original covariates, their squares,and all pairwise interactions. In general, the degree d polynomial kernel leads to a feature mapconsisting of all powers of the original covariates and all interactions up to degree d . The ﬁ-nal kernel considered in this paper is the RBF kernel with scale parameter γ : K ( X i , X j ) =exp (cid:16) − γ (cid:107) X i − X j (cid:107) (cid:17) . This kernel can be viewed as a generalization of the polynomial kernel inthe limit d → ∞ .In addition, SVM sets the weights to zero for the units whose treatment status is easy to classify.To see this, note that the Karush–Kuhn–Tucker (KKT) conditions for soft-margin SVM lead tothe following useful characterization for a solution α : W i f ( X i ) = 1 = ⇒ ≤ α i ≤ i f ( X i ) < ⇒ α i = 1 (6) W i f ( X i ) > ⇒ α i = 0The set of units that satisfy W i f ( X i ) > W i f ( X i ) = 1 are referred to as marginal supportvectors, whereas the set of units that meet W i f ( X i ) < λ controls which of these two components receives more em-phasis. SVM chooses optimal weights such that easy-to-classify units are given zero weight. Ourgoal in the remainder of this section is to extend the above intuition and establish a more rigorousconnection between SVM, covariate balancing, and causal eﬀect estimation. We now show that SVM minimizes the maximum mean discrepancy (MMD) of covariate distribu-tion between the treatment and control groups. The MMD is a commonly used measure of distancebetween probability distributions (Gretton et al. , 2007a) that was recently proposed as a metric forbalance assessment in causal inference (Zhu et al. , 2018). Speciﬁcally, we show that the SVM dualproblem given in equation (4) can be viewed as a regularized optimization problem for computingweights which minimize the MMD.The MMD, which is also called the kernel distance, is a measure of distance between twoprobability distributions based on the diﬀerence in mean function values for functions in the unitball of a RKHS (Gretton et al. , 2007a). The MMD has found use in several statistical applications,such as hypothesis testing (Gretton et al. , 2007b, 2012) and density estimation (Sriperumbudur,2011). Given the unit ball RKHS F K = { f ∈ H K : (cid:107) f (cid:107) H K ≤ } and two probability measures F and G , the MMD is deﬁned as γ K ( F, G ) := sup f ∈F K (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) f dF − (cid:90) f dG (cid:12)(cid:12)(cid:12)(cid:12) . (7)An important property of the MMD is that when K is a characteristic kernel (e.g., the Gaus-sian radial basis function kernel and Laplace kernel), then γ K ( F, G ) = 0 if and only if F = G et al. , 2010).The computation of γ K ( F, G ) requires the knowledge of both F and G , which is typicallyunavailable. In practice, an estimate of γ K ( F, G ) using the empirical distributions (cid:98) F m and (cid:98) G n canbe computed as γ K ( (cid:98) F m , (cid:98) G n ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i : X i ∼ F φ ( X i ) − n (cid:88) j : X j ∼ G φ ( X j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H K , (8)where m and n are the size of the samples drawn from F and G , respectively. The propertiesof this statistic are well-studied (see e.g., Sriperumbudur et al. , 2012). In causal inference, theempirical MMD can be used to assess balance between the treated and control samples (Zhu et al. ,2018). This is done by setting F = P ( X i | T i = 1) and G = P ( X i | T i = 0). Then, thequantity γ K ( (cid:98) F m , (cid:98) G n ) gives a measure of independence between the treatment assignment T i andthe observed pre-treatment covariates X i .Equation (8) naturally suggests a weighting procedure that balances the covariate distributionsbetween the treatment and control groups by minimizing the empirical MMD. We deﬁne a weightedvariant of the empirical MMD as γ K ( (cid:98) F α , (cid:98) G α ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈T α i φ ( X i ) − (cid:88) j ∈C α j φ ( X j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:112) α (cid:62) Qα , (9)where (cid:98) F α and (cid:98) G α denote the reweighted empirical distributions under weights α . The weights arerestricted to the simplex set, A simplex =  α ∈ R N : 0 (cid:22) α (cid:22) , (cid:88) i ∈T α i = (cid:88) j ∈C α j = 1  . (10)The optimization problem for ﬁnding the MMD minimizing weights is therefore formulated asmin α (cid:112) α (cid:62) Qα s.t. α ∈ A simplex (11)Note that computing weights according to this problem is generally not preferable due to thelack of regularization, which leads to overﬁtting and sparse α , resulting in many discarded samples.The following theorem, which we prove in Appendix A.1, establishes that the SVM dual problemcan be viewed as a regularized version of the optimization problem in equation (11).7 heorem 1 (SVM Dual Problem as Regularized MMD Minimization) Let α ∗ ( λ ) denotethe solution to the SVM dual problem under λ , deﬁned in equation (4) . Consider the normalizedweights (cid:101) α ∗ ( λ ) = 2 α ∗ ( λ ) / (cid:62) α ∗ ( λ ) such that (cid:101) α ∗ ( λ ) ∈ A simplex . Then,(i) There exists λ ∗ such that (cid:101) α ∗ ( λ ∗ ) is a solution to the MMD minimization problem, deﬁned inequation (11) .(ii) The quantity (cid:101) α ∗ ( λ ) (cid:62) Q (cid:101) α ∗ ( λ ) is a monotonically increasing function of λ . Theorem 1 shows that SVM minimizes the MMD with the regularization parameter λ controllingthe trade-oﬀ between the covariate imbalance, measured as the MMD, and the eﬀective samplesize, measured as the sum of the support vector weights (cid:62) α . Thus, a greater size of supportvector set may lead to a worse covariate balance between the treatment and control groups withinthat set. SVM can also be seen as a continuous relaxation of the quadratic integer program (QIP) forcomputing the largest balanced subset. Consider the modiﬁed version of the optimization problemin equation (4), in which we replace the continuous constraint 0 (cid:22) α (cid:22) α ∈ { , } N . Since the second term in the objective can be rewritten as a norm, thisproblem, which we refer to as SVM-QIP, is given by:min α λ α (cid:62) Qα − (cid:62) α s.t. W (cid:62) α = 0 α ∈ { , } N (12)Interpreting the variables α i as indicators of whether or not unit i is selected into the optimalsubset, we see that the objective is a trade-oﬀ between subset balance in the projected features(ﬁrst term) and subset size (second term). Here, balance is measured by a diﬀerence in sums.However, the constraint W (cid:62) α requires the optimal subset to have an equal number of treated andcontrol units, so balancing the feature sums also implies balancing the feature means.Thus, the SVM dual in equation (4) can be viewed as a continuous relaxation of the largestbalanced subset problem represented by SVM-QIP in equation (12), with the set of support vectorscomprising an approximation to the largest balanced subset, as these are the units for which α i > λ , and the choice of kernel. In our own experiments, we observe signiﬁcant overlap betweenthe two solutions, suggesting that SVM uses non-integer weights to augment the SVM-QIP solutionwithout compromising balance in the selected subset. We compare the diﬀerences between the twomethods in Section 3.2 Closely related to the SVM-QIP formulation in equation (12) is cardinality matching (Zubizarreta et al. , 2014), which maximizes the number of matches subject to a set of covariate balance con-straints. The objective of cardinality matching is given by,max m ij (cid:88) i ∈T (cid:88) j ∈C m ij s.t. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈T (cid:88) j ∈C m ij [ f b ( X ik ) − f b ( X jk )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε kb (cid:88) i ∈T (cid:88) j ∈C m ij , k = 1 , . . . , D, b = 1 , . . . , B (cid:88) j ∈C m ij ≤ , i ∈ T (cid:88) i ∈T m ij ≤ , j ∈ C m ij ∈ { , } , i ∈ T , j ∈ C (13)where m ij are selection variables indicating whether treated unit i is matched to control unit j , X ik denotes the k th element of covariate vector X i , f b is an arbitrary function of the covariatesspecifying each of the B balance conditions, and ε kb is a tolerance selected by a researcher. Commonchoices for f b are the ﬁrst- and second-order moments, and ε kb is typically set to a scalar multipleof the corresponding standardized diﬀerence-in-means.To establish the connection between SVM-QIP and cardinality matching, we ﬁrst note thatcardinality matching need not be formulated as a matched pair optimization problem. In fact, thebalance constraints between pairs, as formulated in equation (13), is equivalent to those betweenthe treatment and control groups in the selected subsample. Similarly, the one-to-one matchingconstraints are equivalent to constraining the number of treated and control units in the selectedsubsample to be equal. Therefore, deﬁning the indicator variable α i for selection into the optimal9ubset, we can rewrite the objective of cardinality matching as,max α N (cid:88) i =1 α i s.t. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈T α i f b ( X ik ) − (cid:88) j ∈C α j f b ( X jk ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε kb N (cid:88) i =1 α i , k = 1 , . . . , D, b = 1 , . . . , B N (cid:88) i =0 α i W i = 0 α i ∈ { , } , i = 1 , . . . , N (14)The diﬀerence between cardinality matching and SVM-QIP lies in the way that balance is en-forced in the optimal subset. Cardinality matching imposes covariate-speciﬁc balance by boundingeach dimension’s diﬀerence-in-means, while SVM-QIP imposes aggregated balance by penalizingthe normed diﬀerence-in-means. The preference between these two measures of balance depends onthe dimensionality of the covariates and a priori knowledge about confounding mechanisms. If wesuspect certain covariates to be confounders, then bounding those speciﬁc dimensions is more rea-sonable. However, if no such information is available and the covariate space is high-dimensional,then restricting the overall balance may be more preferable. We empirically examine the relativeperformance of these two methods in Section 4.We emphasize that using the norm to measure balance is also computationally attractive sinceit avoids the direct calculation of the balance conditions φ ( X i ) through the so-called “kernel trick.”Furthermore, as we discuss below, SVM can be used to approximate solutions to SVM-QIP withhigh accuracy and a much lower computational cost, which allows us to approximate the regular-ization path for SVM-QIP faster than a single solution for cardinality matching can be computed. Another important advantage to using SVM to perform covariate balancing is the existence of pathalgorithms, which can eﬃciently compute the set of solutions to equation (4) over diﬀerent valuesof λ . Since Theorem 1 established that λ controls the trade-oﬀ between the MMD and a heuristicmeasure of subset size, the path algorithm for SVM can be viewed as the weighting analog tothe balance-sample size frontier (King et al. , 2017). Below, we brieﬂy discuss the algorithm forcomputing the SVM regularization path and describe how the path can be interpreted as a balance-sample size frontier. 10 he algorithm. Path algorithms for SVM were ﬁrst proposed in Hastie et al. (2004), whoshowed that the weights α and scaled intercept α := λβ are piecewise linear in λ and presentedan algorithm for computing the entire path of solutions with a comparable computation cost toﬁnding a single solution. However, their algorithm was prone to numerical problems and would failin the presence of singular submatrices of Q . Recent work on SVM path algorithms has focused onresolving the issues with singularities. In our analysis, we use the algorithm presented in Sentelle et al. (2016), which is brieﬂy described in Appendix A.2. Initial solution.

The initial solution in the SVM regularization path corresponds to the solutionat λ max such that for any λ > λ max , the minimizing weight vector α does not change. We assumewithout a loss of generality that n T ≤ n C . Then initially, α i = 1 for all i ∈ T , and the remainingweights are computed according to argmin α ∈ [0 , n α (cid:62) Qα s.t. (cid:88) i ∈C α i = n T α i = 1 , i ∈ T . (15)Thus, the initial solution computes control weights such that the weighted empirical MMD isminimized while ﬁxing the renormalized weights (cid:101) α i = n − T for the treated units. Note that thissolution also corresponds to the largest subset, as measured by (cid:80) i α i , amongst all solutions on theregularization path. Terminal solution.

The regularization path completes when the resulting solution has no non-marginal support vectors, or in the case of non-separable data, when λ = 0. In practice, however,path algorithms run into numerical issues when λ is small, so we terminate the path at λ min =1 × − , which appears to work well in our experiments. This value is often greater than λ ∗ , whichcorresponds to the MMD-minimizing solution deﬁned in Theorem 1. In practice, however, we ﬁndthe diﬀerences in balance between these two solutions to be negligible.Summarizing the regularization path, we see that the initial solution at λ max has the largestweight sum (cid:80) Ni =1 α i = 2 min { n T , n C } and can be viewed as the largest balanced subset retaining allobservations in the minority class. As we move through the path, the SVM dual problem imposesgreater restrictions on balance in the subset, which leads to smaller subsets, until we reach theterminal value λ min , at which the weighted empirical MMD is smallest on the path.11 .7 Causal Eﬀect Estimation Theorem 1 establishes that the SVM dual problem can be viewed as a regularized optimizationproblem for computing balancing weights to minimize the MMD. Under the unconfoundednessassumption, therefore, the resulting weighted support vector set composes a subsample of the datathat approximates randomization of treatment assignment. However, achieving a high degree ofbalance often requires signiﬁcant pruning of the original sample, especially in scenarios where thecovariate distributions for the treated and control groups have limited overlap. In this section, weprovide a characterization of this trade-oﬀ between subset size and subset balance and discuss itsimpact on the bias of causal eﬀect estimates.Recent work by Kallus (2020) established that many existing matching and weighting methodsin causal inference are minimizing the dual norm of the bias for a weighted estimator, a propertycalled error dual norm minimizing. The author also proposes a new method, kernel optimal match-ing (KOM), which considers minimizing the dual norm of the bias when the conditional expectationfunctions are embedded in an RKHS. Below, we show that SVM also ﬁts into the KOM framework.We restrict our attention to the following weighted diﬀerence-in-means estimator, (cid:98) τ = (cid:88) i ∈T α i Y i − (cid:88) i ∈C α i Y i (16)where α ∈ A simplex is computed via the application of SVM to the data { X i , T i } Ni =1 . Below,we derive the form for the conditional bias with respect to two estimands, the sample averagetreatment eﬀect (SATE), τ SATE , and the sample average treatment eﬀect for the treated (SATT), τ SATT . We then discuss how to compute this bias when the conditional expectation functions f and f are unknown.As we show in Appendix A.3, under Assumptions 1 and 2, the conditional bias with respect to τ SATE and τ SATT for the estimator above is given by E ( (cid:98) τ − τ | { X i , T i } Ni =1 ) = N (cid:88) i =1 α i W i f ( X i ) + N (cid:88) i =1 ( α i T i − v i ) τ ( X i ) , (17)where v i =  /N if τ = τ SATE T i /n T if τ = τ SATT . (18)12nd τ ( X i ) := E ( Y i (1) − Y i (0) | X i ) = f ( X i ) − f ( X i ). Fan et al. (2016) use a similar biasdecomposition.The ﬁrst term in equation (17) represents the bias due to the imbalance of prognostic score(Hansen, 2008). Note that the estimation of this quantity is diﬃcult since f is typically unknown.Instead, we embed f in a unit-ball RKHS, F K , and consider an f that maximizes this bias term.As we show in Appendix A.4, minimizing this quantity leads to the following optimization problemthat is of the same form as that of the SVM dual problem given in equation (11),min α γ K (cid:16) (cid:98) F α , (cid:98) G α (cid:17) , s.t. α ∈ A (19)Thus, SVM can also be viewed as a regularization method for minimizing prognostic imbalance.However, SVM does not address the second term of the conditional bias in equation (17). Thisterm corresponds to the bias due to extrapolation outside of the weighted treatment group tothe population of interest. If the SATT is the target quantity, the term represents the bias dueto the diﬀerence between the weighted and unweighted CATE for the treatment group. Thus,the prognostic balance achieved by SVM may come at the expense of this CATE bias, whichrepresents the discrepancies between the weighted covariate distribution for the treatment groupand the unweighted covariate distribution for the sample of interest. This is the direct consequenceof the trade-oﬀ between balance and eﬀective sample size, which is controlled by the regularizationparameter as shown earlier.To illustrate this point more clearly, suppose that the target estimand is the SATT and n T ≤ n C ,a scenario often encountered in practice. In this case, the initial solution on the regularization pathfor SVM is given by α i = 1 for i ∈ T . This implies that the renormalized simplex weightsfor the treated observations are given by ˜ α i = n − T for i ∈ T , and hence the second bias termin equation (17) vanishes. The remaining weights are then chosen such that the imbalance inprognostic score is minimized. Since the treated group is unmodiﬁed in this setting, balancebetween the unweighted treated and reweighted control covariate distributions will be at its largestvalue on the path since balancing is more diﬃcult in the unpruned data. As the penalty forbalance is increased, the SVM solution will trim both treated and control samples that are diﬃcultto balance. This pruning improves balance, but may increase the CATE bias.13 .8 Relation to Kernel Balancing Methods A closely related method recently proposed in the causal inference literature is kernel balancing(Hazlett, 2020; Kallus et al. , 2018; Wong and Chan, 2018). We now discuss the relations betweenSVM and existing kernel balancing methods. Consider the following alternative decomposition ofconditional bias, E ( (cid:98) τ − τ | { X i , T i } Ni =1 ) = N (cid:88) i =1 ( α i T i − v i ) f ( X i ) + N (cid:88) i =1 { v i − α i (1 − T i ) } f ( X i ) . (20)To minimize this bias, kernel balancing methods restrict f and f to a RKHS F k := { ( f , f ) ∈H K × H K : (cid:113) (cid:107) f (cid:107) H K + (cid:107) f (cid:107) H K ≤ c } and consider minimizing the largest bias under the pair( f , f ) ∈ F K . This problem is given by,min α sup f ,f ∈F K (cid:34) N (cid:88) i =1 ( α i T i − v i ) f ( X i ) − N (cid:88) i =1 { α i (1 − T i ) − v i } f ( X i ) (cid:35) , s.t. α ∈ A (21)where A denotes the constraints on the weights. For example, Kallus et al. (2018) restricts themto A simplex whereas Wong and Chan (2018) essentially uses α i ≥ N − , though their formulation isslightly diﬀerent than that give above.As shown in Kallus et al. (2018), the problem of minimizing this worst-case conditional biasamounts to computing weights that balance the treatment and control covariate distributionswith respect to the empirical distribution for the population of interest. Let (cid:98) F α and (cid:98) G α denotethe weighted empirical covariate distributions for the treatment and control groups, respectively,while having (cid:98) H v represent the empirical covariate distribution corresponding to the population ofinterest. Then if F K is also restricted to the unit-ball RKHS (ﬁxing the size of f , f is necessarysince the bias scales linearly with (cid:107) f (cid:107) H K and (cid:107) f (cid:107) H K ), i.e., c = 1, the optimization problem inequation (21) can be written in terms of the minimization of the empirical MMD statistics:min α γ K (cid:16) (cid:98) F α , (cid:98) H v (cid:17) + γ K (cid:16) (cid:98) G α , (cid:98) H v (cid:17) . s.t. α ∈ A . (22)The objective in equation (22) does not contain a measure of distance between the conditionalcovariate distributions (cid:98) F α and (cid:98) G α . Instead, balance between these two distributions is indirectly14ncouraged through balancing each one individually with respect to the target distribution (cid:98) H v .This is in contrast with SVM, which directly balances the covariate distribution between the treat-ment and control groups. In this section, we examine the performance of SVM in ATE estimation under two diﬀerent simula-tion settings. We also examine the connection between SVM and the QIP for the largest balancedsubset.

We consider two simulation setups used in previous studies. Simulation A comes from Lee et al. (2010) who use a slightly modiﬁed version of the simulations presented in Setoguchi et al. (2008).Speciﬁcally, we adopt the exact setup corresponding to their “scenario G,” which is brieﬂy summa-rized here. We refer readers to the original article for the exact speciﬁcation. For each simulateddataset, we generate 10 covariates X i = ( X i , . . . , X i ) (cid:62) from the standard normal distribution,with correlation introduced between four pairs of variables. Treatment assignment is generatedaccording to P ( T i = 1 | X i ) = expit (cid:0) β (cid:62) f ( X i ) (cid:1) , where β is some coeﬃcient vector, and f ( X i )controls the degree of additivity and linearity in the true propensity score model. This scenariouses the true propensity score model with a moderate amount of non-linearity and non-additivity.The outcome model was speciﬁed to be linear in the observed covariates with a constant, additivetreatment eﬀect: Y i ( T i ) = γ + γ (cid:62) X i + τ T i + (cid:15) i , with τ = − . (cid:15) i ∼ N (0 , . Z i = ( Z i , . . . , Z i ) (cid:62) from the standard normal distribution. The observed covariates are deﬁnedas the nonlinear functions of these variables, i.e., X i = ( X i , . . . , X i ) (cid:62) , where X = exp( Z / X = Z / [1 + exp( Z )], X = ( Z Z /

25 + 0 . , X = ( Z + Z + 20) , and X j = Z j , j = 5 , . . . , P ( T i = 1 | Z i ) = expit( − Z − . Z ), which corresponds to Model 1of Wong and Chan (2018). Finally, the outcome model is speciﬁed as Y ( T i ) = 200 + 10 T i + (1 . T i − . . Z + 13 . Z + 13 . Z + 13 . Z ) + (cid:15) i , with (cid:15) i ∼ N (0 , .2 Comparison between SVM and SVM-QIP We begin by examining the connection between SVM and SVM-QIP by comparing solutions ob-tained using one simulated dataset of N = 500 units generated according to Simulation A. Speciﬁ-cally, we ﬁrst compute the SVM path using the path algorithm described in Section 2.6, obtaining aset of regularization parameter breakpoints λ . Next, we compute the SVM-QIP solution for each ofthese breakpoints using the Gurobi optimization software (Gurobi Optimization, 2020). We limitthe solver to spending 5 minutes of runtime for each problem. Finding the exact integer-valuedsolution under a given λ requires a signiﬁcant amount of time, but a good approximation cantypically be found in a few seconds.For both methods, we compute the objective function value at each of the breakpoints as well asthe coverage of the SVM-QIP solution by the SVM solution. The latter represents the proportionof units with non-zero SVM weights that are included in the largest balanced subset identiﬁed bySVM-QIP. Formally, the coverage is deﬁned ascvg( λ ) = |(cid:100) α SVM ( λ ) (cid:101) ∩ α SVM-QIP ( λ ) || α SVM-QIP ( λ ) | . In order to examine the eﬀects of separability on the quality of the approximation, we perform theabove analysis using three diﬀerent types of features. Speciﬁcally, we use a linear kernel with theuntransformed covariates (linear), a linear kernel with the degree-2 polynomial features formed byconcatenating the original covariates with all two-way interactions and squared terms (polynomial),and the Gaussian RBF with scale parameter chosen according the median heuristic (RBF). In allcases, we scale the input feature matrix such that the columns have 0 mean and standard deviation1 before performing the kernel computation.Figure 1 shows that the objective values for the SVM and SVM-QIP solutions are close whenthe penalty on balance λ − is small, with divergence between the two methods occurring towardsthe end of the regularization path. In the linear case, we see that the paths for the two methodsare nearly identical, suggesting that the solutions of the two problems are essentially the same.Divergence in the polynomial and RBF settings is more pronounced due to greater separability inthe transformed covariate space, which is more diﬃcult to balance without non-integer weights.When λ is very small, we also ﬁnd that SVM-QIP returns α = , indicating that the penalty onbalance is too great. Lastly, the eﬀects of approximating the SVM-QIP solution are reﬂected inthe RBF setting, where upon close inspection the objective value appears to be somewhat noisy16 − − λ − − − − − O b j ec t i v e SVMSVM-QIP − λ − − − − − λ − − − − − (a) Linear (b) Polynomial (c) RBF Figure 1: Comparison of the objective value between SVM and SVM-QIP. The blue line denotesthe objective value of the SVM solution, and the orange dotted line denotes the objective value ofthe SVM-QIP solution. − − λ − . . . . . . C o v e r a g e − λ − λ − (a) Linear (b) Polynomial (c) RBF Figure 2: The proportion of samples in the SVM-QIP largest balanced subset covered by the SVMsolution. The instances of zero coverage in the polynomial and RBF settings represent the caseswhere the SVM-QIP fails to ﬁnd a nontrivial solution.and non-monotonic.Interestingly, the coverage plots in Figure 2 show that even when the objective values betweenthe two methods are divergent, the SVM solution still predominantly covers the SVM-QIP solution.The regions with zero coverage in the polynomial and RBF settings correspond to instances wherethe balance penalty is so signiﬁcant that a nontrivial solution cannot be found for the SVM-QIP.This result illustrates that SVM approximates one-to-one matching by augmenting a well-balancedmatched subsample with some non-integer weights. This leads to an increase in the subset sizewhile preserving the overall balance within the subsample.17 − λ − − . − . − . − . − . − . A T E − λ − Simulation A λ − − λ − − A T E − λ − Simulation B λ − (a) Linear (b) Polynomial (c) RBF Figure 3: ATE estimates for Simulations A (top) and B (bottom) over the SVM regularizationpath. The boxplots represent the distribution of the ATE estimates over Monte Carlo simulations.The red dashed line corresponds to the true ATE.

Next, we evaluate the performance of SVM in estimating the ATE for Simulations A and B. Foreach scenario, we generate 1,000 datasets with N = 500 samples. For each simulated dataset, wecompute the ATE estimate over a ﬁxed grid of 100 λ values chosen based on the simulation scenarioand input feature. As described in Section 3.2, we use the linear, polynomial, and RBF-inducedfeatures, standardizing the covariate matrix before passing it to the kernel in all cases.In Figure 3, we plot the distribution of ATE estimates over Monte Carlo simulations against theregularization parameter λ . The results for Simulation A (top panel) show that the bias approacheszero as the penalty on balance increases ( λ decreases). This behavior is due to the fact that the18onditional bias in the estimate under the outcome model for Simulation A is given by E [ (cid:98) τ − τ | X N , T N ] = γ (cid:62) (cid:32) N (cid:88) i =1 α i T i X i (cid:33) . This implies that all bias comes from prognostic score imbalance. This quantity becomes thesmallest when minimizing (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 α i T i X i (cid:13)(cid:13)(cid:13) , which is controlled by the regularization parameter λ in the SVM dual objective under the linear setting. Note that this quantity is also small underboth the polynomial and RBF input features.We also ﬁnd that under all three settings, there is relatively little change in the variance of theestimates along most of the path, suggesting that the variance gained from trimming the sampleis counteracted by the variance decreased from correcting for heteroscedasticity. The exception tothis observation occurs at the beginning of the linear case, where the reduction in bias also reducesthe variance, and at the end of the RBF path, where the amount of trimming is so substantialrelative to the balance gained that the variance increases.For Simulation B (bottom panel), we also observe that the bias decreases as the penalty onbalance increases. However, due to misspeciﬁcation, nonlinearity, and the presence of treatmenteﬀect heterogeneity in the outcome model, the bias never decays to zero as shown in Section 2.7. Wealso ﬁnd that the SVM with linear kernel can reduce bias as well as the other kernels, suggesting thatSVM is robust to misspeciﬁcation and nonlinearity in the outcome model. Similar to Simulation A,we also observe relatively small changes in the variance as the constraint on balance increases,except at the end of the RBF path where there is substantial sample pruning. Next, we compare the performance of SVM with that of other methods. Our results below showthat the performance of SVM is comparable to that of related state-of-the-art covariate balancingmethods available in the literature. In particular, we consider kernel optimal matching (KOM;Kallus et al. , 2018), kernel covariate balancing (KCB; Wong and Chan, 2018), cardinality matching(CARD; Zubizarreta et al. , 2014), and inverse propensity score weighting (IPW) based on logisticregression (GLM) and random forest (RFRST), both of which were used in the original simulationstudy by Lee et al. (2010). For SVM, we compute solutions using λ − = 0 . λ − = 0 .

10, and λ − = 2 .

60 for Simulation A under the linear, polynomial, and RBF settings, respectively. ForSimulation B, we use λ − = 1 . λ − = 1 .

92, and λ − = 10 .

48. These values are taken from the19

VM KOM KCB CARD GLM RFRST

Method − . − . − . − . − . − . − . − . A T E LinearPolynomialRBF

SVM KOM KCB CARD GLM RFRST

Method (a) Simulation A (b) Simulation B

Figure 4: Boxplots for ATE estimates for Simulations A (left) and B (right). The hatch pat-tern denotes the input feature (Linear, Polynomial, or RBF) — kernel optimal matching (KOM;Kallus et al. , 2018), kernel covariate balancing (KCB; Wong and Chan, 2018), cardinality matching(CARD; Zubizarreta et al. , 2014), and IPW with propensity score modeling via logistic regression(GLM) and random forest (RFRST). The red dashed line corresponds to the true ATE.grid of λ values used in the simulation based on visual inspection of the path plots in Figure 3around where the estimate curve ﬂattens out.For KOM, we compute weights under the linear, polynomial, and RBF settings described earlierwith the default settings for the provided code. For KCB, we compute weights using the RBF kerneland use its default settings. While KCB allows for other kernel functions, it was originally designedfor the use of RBF and Sobolev kernels. We ﬁnd its results to be poor when using the linear andpolynomial features. For CARD, we used a threshold of 0 .

01 and 0 . et al. (2010).Figure 4 plots the distributions of the eﬀect estimates over 1,000 simulated datasets for bothscenarios. Simulation A (left panel) shows comparable performance across all methods, with SVMand KOM having the best performance in terms of both bias and variance. In particular, SVMachieves near zero bias under all three input features. The results for KCB show that it performsslightly worse in comparison to the other kernel methods, with greater bias and variance under the20BF setting.The results for CARD show near identical performance with SVM under the linear setting,however results under the polynomial setting are notably worse. The reason for this comes fromthe choice of balance threshold, which was set to 0.1 times the standardized diﬀerence-in-meansof the input feature matrix. Although decreasing the scalar below 0.1 would lead to a morebalanced matching, we found that algorithm was unable to consistently ﬁnd a solution for alldatasets with scalar multiples smaller than 0.1. This result highlights the main issue with deﬁningbalance dimension-by-dimension, which makes it diﬃcult to enforce small overall balance withoutinformation on the underlying geometry of the data. Lastly, the propensity score methods showthe worst performance. This is somewhat expected as the true propensity score model is morecomplicated than the true outcome model under this simulation setting.We note that further reduction in the variance of the SVM solution while preserving bias islikely possible with a more principled method of choosing the solution for each simulated dataset.In general, a value of λ that works well for one dataset may not work well for another, and a betterapproach would examine estimates over the path and balance-sample size curves for each datasetindividually. Nevertheless, our heuristic procedure to selecting a solution produced high-qualityresults, demonstrating the strength of SVM as a balancing method.The results for Simulation B (right panel) show a slightly more varying performance acrossmethods. Amongst the kernel methods, we ﬁnd that KOM has the best performance under thepolynomial and RBF settings, achieving near zero bias under these scenarios, while SVM hasthe best performance under the linear setting. The discrepancy under the linear setting is dueto misspeciﬁcation, which leads to a poor regularization parameter choice and consequently poorbalance and bias under the KOM procedure.We also ﬁnd that SVM is unable to drive the bias to zero, which is due to the treatment eﬀectheterogeneity in the outcome model. As discussed in Section 2.7, SVM ignores the second term inthe conditional bias decomposition (17), which is zero under a constant additive treatment eﬀectin Simulation A but is nonzero in Simulation B. In contrast, KOM targets both bias terms in itsformulation, which leads to greater bias reduction.In comparison to the other kernel methods, we ﬁnd that KCB has comparable bias to SVMbut greater variance. For CARD, we observe comparable results to SVM under the linear setting,but again we observe worse performance under the polynomial setting due to the reasons men-tioned above. Lastly, we ﬁnd mixed results between the two propensity score methods. Logistic21egression (GLM) has the worst performance while Random forest (RFRST) exhibits the secondbest performance amongst all methods. This result is likely due to the simple structure of the truepropensity score model, whose nonlinearity can only be accurately modeled by RFRST. In this section, we apply the proposed methodology to the right heart catheterization (RHC) dataset originally analyzed in Connors et al. (1996). This observational data set was used to study theeﬀectiveness of right heart catheterization, a diagnostic procedure, for critically ill patients. Thekey result from the study was that after adjusting for a large number of pre-treatment covariates,right heart catheterization appeared to reduce survival rates. This ﬁnding contradicts the existingmedical perception that the procedure is beneﬁcial.

The data set consists of 5,735 patients, with 2,184 of them assigned to the treatment group and3,551 assigned to the control group. For each patient, we observe the treatment status, whichindicates whether or not he/she received catheterization within 24 hours of hospital admission.The outcome variable represents death within 30 days. Finally, the dataset contains a total of72 pre-treatment covariates that are thought to be related to the decision to perform right heartcatheterization. These variables include background information about the patient, such as age,sex, and race, indicator variables for primary/secondary diseases and comorbidities, and variousmeasurements from medical test results.We compute the full SVM regularization paths under the linear, polynomial, and RBF settingsdescribed in Section 3.1. In forming the polynomial features, we exclude all trivial interactions(e.g., interactions between categories of the same categorical variable) and squares of binary-valuedcovariates. For comparison, we also compute the KOM weights under all three settings, the KCBweights under the RBF setting, and the CARD weights under the linear and polynomial settingswith a threshold ﬁxed to 0.1 times the standardized diﬀerence-in-means.

Figure 5 plots the ATE estimates over the SVM regularization paths with the pointwise 95%conﬁdence intervals based on the weighted Neyman variance estimator (Imbens and Rubin, 2015,Chapter 19). The horizontal axis represents the normed diﬀerence-in-means within the weightedsubset as a covariate balance measure. For all three settings, we ﬁnd that the estimated ATE22 . . . . . . . . A T E Normed diﬀerence-in-means .

02 0 . (a) Linear (b) Polynomial (c) RBF Figure 5: ATE estimates for the RHC data over the SVM regularization path. The horizontalaxis represents the normed diﬀerence-in-means in covariates within the weighted subset. The solidblue line denotes the average estimate, and the solid gray background denotes the pointwise 95%conﬁdence intervals. . . . . N o r m e dd i ﬀ e r e n ce - i n - m e a n s Eﬀective sample size . . . . . . . . . . . (a) Linear (b) Polynomial (c) RBF Figure 6: Trade-oﬀ between balance and eﬀective sample size. The black dashed-line indicates theestimated elbow point.slightly increases as the weighted subset becomes more balanced, supporting the results originallyreported in Connors et al. (1996) that right heart catheterization decreased survival rates.Figure 6 illustrates the trade-oﬀ between the balance measure (the normed diﬀerence-in-meansin covariates within the weighted subset) and eﬀective subset size, as the balance-sample sizefrontier. Such graphs can be useful to researchers in selecting a solution along the regularizationpath for estimating the ATE. Across all cases, we achieve a good amount of balance improvementonce the data set is pruned to about 3,500, which occurs around where the trade-oﬀ between subset23ize balance becomes less favorable.We also examine diﬀerences in dimension-by-dimension balance between SVM and CARD andbetween SVM and KOM under the linear and polynomial settings. We do not conduct such acomparison for RBF, which is inﬁnite dimensional. Here, we consider four diﬀerent SVM solutions:the largest subset size solution whose standardized diﬀerence-in-means in covariates was below 0.1for all dimensions, the solution whose eﬀective sample size was nearest the subset size for the othermethod, the solution whose normed diﬀerence-in-means in covariates was closest to that of theother method, and the solution occurring at the kneedle estimate for the elbow of the balance-weight sum curve. We take the minimum-balance solution when no elbow exists, as in the linearcase. The eﬀective sample size is computed according to the following Kish’s formula: N e = (cid:0)(cid:80) i ∈T α i (cid:1) (cid:80) i ∈T α i + (cid:0)(cid:80) i ∈C α i (cid:1) (cid:80) i ∈C α i . (23)Figure 7 presents the covariate balance comparisons between SVM and CARD for both linearand polynomial settings. Comparing against the small diﬀerence-in-means solution (leftmost col-umn) for which the standardized diﬀerence-in-means for all covariates are below 0.1, CARD retainsmore observations in its selected subset, but SVM achieves a better covariate balance than CARDfor most dimensions although there are some large imbalances. This is expected because SVMminimizes the overall covariate imbalance without a constraint on each dimension as in CARD.We observe a similar result when comparing CARD with the SVM solutions based on the clos-est eﬀective sample size solution (left-middle) and the closest normed diﬀerence-in-means solution(right-middle). It is notable that the latter generally achieves a better covariate balance while re-taining more observations than CARD. Finally, the results for the elbow SVM solution (rightmostcolumn) show that tight balance is attainable with a moderate amount of sample pruning, withnear exact balance in the linear setting. This level of covariate balance is diﬃcult to achieve withCARD due to the infeasibility of optimization particularly in high dimensional settings.Figure 8 shows the dimensional balance comparisons for the KOM solution against the SVMsolution. Here, we see that under the linear setting, KOM retains signiﬁcantly more units thanthe SVM solution while attaining the same balance. This is due to the (cid:80) i αW i = 0 constraintof SVM, which encourages the selected subset to have a roughly equal proportion of treated andcontrol units, while the KOM solution allows the resulting subset to be more imbalanced. Underthe polynomial setting, however, we ﬁnd that SVM retains signiﬁcantly more units than KOM24 . . . SVM ( N e = 3830 ) . . . C A R D ( N e = ) . . . SVM ( N e = 4186 ) . . . SVM ( N e = 4367 ) . . . SVM ( N e = 3442 ) . . . SVM ( N e = 3777 ) . . . C A R D ( N e = ) . . . SVM ( N e = 4071 ) . . . SVM ( N e = 4304 ) . . . SVM ( N e = 3325 ) (a) Small diﬀerence-in-means (b) Closest eﬀectivesample size (c) Closest normeddiﬀerence-in-means (d) Elbow Figure 7: Standardized diﬀerence-in-means of covariates comparisons between SVM and Cardinal-ity Matching (CARD) under the linear (top) and polynomial (bottom) settings with diﬀerent SVMsolutions: (a) standardized diﬀerence-in-means less than 0.1 in all covariates, (b) eﬀective samplesize closest to that of CARD, (c) normed diﬀerence-in-means closest to that of CARD, and (d)elbow of the regularization path. The eﬀective sample size for each method is given as N e in theparentheses. Note that darker areas correspond to higher concentrations of points.while achieving a similar degree of covariate balance, which is likely due to poor regularizationparameter choice by the KOM algorithm.Lastly, we compare the point estimates of the ATE, the weighted Neyman standard error, andthe eﬀective sample size for SVM, CARD (with linear and polynomial), KOM, and KCB (withRBF) in Table 1. We considered three diﬀerent solutions from the SVM path in our comparisons:SVM imbalance , which corresponds to the initial solution for which the balance constraint is mostrelaxed and α i = 1, i ∈ T , SVM balance , which corresponds to the most regularized solution withthe best covariate balance on the path, and SVM elbow , which corresponds to the solution occurringat the elbow of the balance-weight sum curves shown in Figure 6.The results show that SVM leads to a positive estimate in all cases, which agrees with theoriginal ﬁnding reported in Connors et al. (1996). We also ﬁnd that the three SVM solutions diﬀermost signiﬁcantly in their standard errors, which increases as the constraint on balance becomes25 − − − − SVM ( N e = 3443 ) − − − − K O M ( N e = ) .

00 0 .

04 0 . SVM ( N e = 3356 ) . . . K O M ( N e = ) (a) Linear (b) Polynomial Figure 8: Standardized diﬀerence-in-means comparisons between SVM and Kernel Optimal Match-ing (KOM) under the linear (left) and polynomial (right) settings.stronger and the subset is more pruned, as shown in the eﬀective sample size column. In particular,the heavily balanced SVM balance solution under the RBF setting leads to a 95% conﬁdence intervalwhich overlaps with zero. This is in contrast with the less balanced SVM elbow solution, which hasboth a larger eﬀect estimate and smaller conﬁdence interval. This result demonstrates the necessityof computing the regularization path so that researchers may avoid low-quality solutions due topoor parameter choice.Comparing against other methods, we observe that KOM yields greater estimates of positiveeﬀects with comparable standard errors in the linear and RBF settings. However, under thepolynomial setting, the standard error is much larger and the sample is signiﬁcantly more prunedthan the modestly balanced SVM elbow solution. Both CARD and KCB produce smaller positiveeﬀect estimates, with the standard error for KCB leading to a 95% conﬁdence interval whichoverlaps with zero.

In this paper, we show how support vector machines (SVMs) can be used to compute covariatebalancing weights and estimate causal eﬀects. We establish a number of interpretations of SVM asa covariate balancing procedure. First, the SVM dual problem computes weights which minimize atrade-oﬀ between the MMD and a measure of subset size while simultaneously maximizing eﬀectivesample size. Second, the SVM dual problem can be viewed as a continuous relaxation of the26eature Method Estimate Standard error Eﬀective sample sizeLinear SVM balance elbow — — —SVM imbalance balance elbow imbalance balance elbow imbalance balance ), the elbow solution (SVM elbow ),and the solution with the worst covariate balance on the path (SVM imbalance ).largest balanced subset problem, which is closely related to cardinality matching. Lastly, similar toexisting kernel balancing methods, SVM weights were shown to minimize the worst-case bias dueto prognostic score imbalance. Additionally, path algorithms can be used to compute the entireset of SVM solutions as the regularization parameter varies, which constitutes a balance-samplesize frontier. These methods provide researchers with a characterization of the balance-samplesize trade-oﬀ and allow for visualization on how causal eﬀect estimates vary as the constraint onbalance changes.Our work in this paper suggests several possible directions for future research. On the algorith-mic side, a disadvantage of the proposed methodology is that it encourages roughly equal eﬀectivenumber of treated and control units in the optimal subset, which can lead to unnecessary samplepruning. One could use weighted SVM (Lin and Wang, 2002) to address this problem, but existingpath algorithms are applicable only to unweighted SVM. On the theoretical side, results in thispaper suggest a fundamental connection between the support vectors and the set of overlap, i.e., { X : 0 < e ( X ) < } . Steinwart (2004) shows that the fraction of support vectors for a variantof the SVM discussed here asymptotically approaches the measure of this overlap set, suggestingthat SVM may be used to develop a statistical test for the overlap assumption.27 eferences Athey, S., Imbens, G. W., and Wager, S. (2018). Approximate residual balancing: debiased infer-ence of average treatment eﬀects in high dimensions.

Journal of the Royal Statistical Society,Series B, Methodological , (4), 597–623.Chan, K. C. G., Yam, S. C. P., and Zhang, Z. (2016). Globally eﬃcient nonparametric inferenceof average treatment eﬀects by empirical balancing calibration weighting. Journal of the RoyalStatistical Society, Series B, Methodological , , 673–700.Connors, A. F., Speroﬀ, T., Dawson, N. V., Thomas, C., Harrell, F. E., Wagner, D., Desbiens,N., Goldman, L., Wu, A. W., Caliﬀ, R. M., et al. (1996). The eﬀectiveness of right heartcatheterization in the initial care of critically iii patients. Jama , (11), 889–897.Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning , (3), 273–297.Dinkelbach, W. (1967). On nonlinear fractional programming. Management science , (7), 492–498.Fan, J., Imai, K., Liu, H., Ning, Y., and Yang, X. (2016). Improving covariate balancing propensityscore: A doubly robust and eﬃcient approach. Technical report, Princeton University.Ghosh, D. (2018). Relaxed covariate overlap and margin-based causal eﬀect estimation. Statisticsin Medicine , (28), 4252–4265.Gretton, A., Borgwardt, K., Rasch, M., Sch¨olkopf, B., and Smola, A. J. (2007a). A kernel methodfor the two-sample-problem. In Advances in neural information processing systems , pages 513–520.Gretton, A., Fukumizu, K., Teo, C., Song, L., Sch¨olkopf, B., and Smola, A. (2007b). A kernelstatistical test of independence.

Advances in neural information processing systems , , 585–592.Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. (2012). A kerneltwo-sample test. The Journal of Machine Learning Research , (1), 723–773.Gurobi Optimization, L. (2020). Gurobi optimizer reference manual.28ainmueller, J. (2012). Entropy balancing for causal eﬀects: A multivariate reweighting methodto produce balanced samples in observational studies. Political analysis , pages 25–46.Hansen, B. B. (2008). The prognostic analogue of the propensity score.

Biometrika , (2), 481–488.Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). The entire regularization path for thesupport vector machine. Journal of Machine Learning Research , (Oct), 1391–1415.Hazlett, C. (2020). Kernel balancing: A ﬂexible non-parametric weighting procedure for estimatingcausal eﬀects. Statistica Sinica , (3), 1155–1189.Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessingfor reducing model dependence in parametric causal inference. Political Analysis , (3), 199–236.Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the RoyalStatistical Society, Series B (Statistical Methodology) , (1), 243–263.Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Kallus, N. (2020). Generalized optimal matching methods for causal inference.

Journal of MachineLearning Research , (62), 1–54.Kallus, N., Pennicooke, B., and Santacatterina, M. (2018). More robust estimation of sample aver-age treatment eﬀects using kernel optimal matching in an observational study of spine surgicalinterventions. arXiv preprint arXiv:1811.04274 .King, G. and Zeng, L. (2006). The dangers of extreme counterfactuals. Political Analysis , (2),131–159.King, G., Lucas, C., and Nielsen, R. A. (2017). The balance-sample size frontier in matchingmethods for causal inference. American Journal of Political Science , (2), 473–489.Lee, B. K., Lessler, J., and Stuart, E. A. (2010). Improving propensity score weighting usingmachine learning. Statistics in medicine , (3), 337–346.Li, F., Morgan, K. L., and Zaslavsky, A. M. (2018). Balancing covariates via propensity scoreweighting. Journal of the American Statistical Association , (521), 390–400.29in, C.-F. and Wang, S.-D. (2002). Fuzzy support vector machines. IEEE transactions on neuralnetworks , (2), 464–471.Lunceford, J. K. and Davidian, M. (2004). Stratiﬁcation and weighting via the propensity scorein estimation of causal treatment eﬀects: a comparative study. Statistics in Medicine , (19),2937–2960.Ning, Y., Peng, S., and Imai, K. (2020). Robust estimation of causal eﬀects via high-dimensionalcovariate balancing propensity score. Biometrika , (3), 533–554.Ratkovic, M. (2014). Balancing within the margin: Causal eﬀect estimation with support vectormachines. Department of Politics, Princeton University, Princeton, NJ , page available at .Rubin, D. B. (1990). Comments on “On the application of probability theory to agriculturalexperiments. Essay on principles. Section 9” by J. Splawa-Neyman translated from the Polishand edited by D. M. Dabrowska and T. P. Speed.

Statistical Science , , 472–480.Rubin, D. B. (2006). Matched Sampling for Causal Eﬀects . Cambridge University Press, Cam-bridge.Schaible, S. (1976). Fractional programming. ii, on dinkelbach’s algorithm.

Management science , (8), 868–873.Sch¨olkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: support vector machines,regularization, optimization, and beyond . MIT press.Sentelle, C., Anagnostopoulos, G., and Georgiopoulos, M. (2016). A simple method for solvingthe svm regularization path for semideﬁnite kernels.

IEEE transactions on neural networks andlearning systems , (4), 709.Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J., and Cook, E. F. (2008). Evaluatinguses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepi-demiology and drug safety , (6), 546–555.Sriperumbudur, B. K. (2011). Mixture density estimation via hilbert space embedding of measures.In , pages 1027–1030.IEEE. 30riperumbudur, B. K., Gretton, A., Fukumizu, K., Sch¨olkopf, B., and Lanckriet, G. R. (2010).Hilbert space embeddings and metrics on probability measures. The Journal of Machine LearningResearch , , 1517–1561.Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Sch¨olkopf, B., Lanckriet, G. R., et al. (2012).On the empirical estimation of integral probability metrics. Electronic Journal of Statistics , ,1550–1599.Steinwart, I. (2004). Sparseness of support vector machines—some asymptotically sharp bounds.In Advances in Neural Information Processing Systems , pages 1069–1076.Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward.

Statistical Science , (1), 1–21.Tan, Z. (2010). Bounded, eﬃcient and doubly robust estimation with inverse weighting. Biometrika , (3), 661–682.Tan, Z. (2020). Regularized calibrated estimation of propensity scores with model misspeciﬁcationand high-dimensional data. Biometrika , (1), 137–158.Wong, R. K. and Chan, K. C. G. (2018). Kernel-based covariate functional balancing for observa-tional studies. Biometrika , (1), 199–213.Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions. Annals of Statis-tics , (2), 965–993.Zhu, Y., Savage, J. S., and Ghosh, D. (2018). A kernel-based metric for balance assessment. Journalof causal inference , (2).Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incompleteoutcome data. Journal of the American Statistical Association , (511), 910–922.Zubizarreta, J. R., Paredes, R. D., Rosenbaum, P. R., et al. (2014). Matching for balance, pairingfor heterogeneity in an observational study of the eﬀectiveness of for-proﬁt and not-for-proﬁthigh schools in chile. The Annals of Applied Statistics , (1), 204–231.31 Supplementary Appendix

A.1 Proof of Theorem 1

We prove this theorem by establishing several equivalent reformulations of the SVM dual problem.By equivalence, we mean that these problems diﬀer from one another only in the scaling of the reg-ularization parameter, implying that their regularization paths consist of the same set of solutions.More formally, given two optimization problems P1 and P2, we say that P1 and P2 are equivalentif a solution α ∗ for P1 can be used to construct a solution for P2.Denote the SVM weight set A SVM =  α ∈ R N : 0 (cid:22) α . (cid:22) , (cid:88) i ∈T α i = (cid:88) j ∈C α j  , and consider a rescaled version of the SVM dual given in equation (4), which we label as P1:min α α (cid:62) Qα − ν (cid:62) α . s.t. α ∈ A SVM (P1)Note that for a given λ and solution α ∗ to the original problem deﬁned in (4) under λ , α ∗ is alsoa solution to the rescaled problem P1 under ν = 2 λ . This establishes equivalence between thesetwo problems.We begin by proving the following lemma, which allows us to replace the squared seminormterm α (cid:62) Qα with (cid:112) α (cid:62) Qα to obtain the problemmin α (cid:112) α (cid:62) Qα − ν (cid:62) α . s.t. α ∈ A SVM (P2)

Lemma 1

The problems (P1) and (P2) are equivalent.

Proof

This result follows from the strong duality of the SVM dual problem, which allows us toform the following equivalent problem in which the penalized term (cid:62) α is replaced with a hard32onstraint with threshold (cid:15) : min α α (cid:62) Qα . s.t. α ∈ A SVM (cid:62) α ≥ (cid:15) (P3)The solution to (P3) is unchanged whether we minimize α (cid:62) Qα or (cid:112) α (cid:62) Qα , so the problemmin α (cid:112) α (cid:62) Qα s.t. α ∈ A SVM (cid:62) α ≥ (cid:15) is identical to (P3). By strong duality, we can again enforce the hard constraint on the term (cid:62) α through a penalized term with new regularization parameter, which establishes the equivalencebetween (P1) and (P2). (cid:50) Next, we consider the fractional programmin α (cid:112) α (cid:62) Qα (cid:62) α / . s.t. α ∈ A SVM (P4)The following lemma connects (P4) to the reformulated SVM problem (P2) through Dinkelbach’smethod (Dinkelbach, 1967; Schaible, 1976):

Lemma 2 (Dinkelbach, 1967, Theorem 1) Suppose α ∗ ∈ A SVM and α ∗ (cid:54) = . Then q ∗ = (cid:112) α (cid:62)∗ Qα ∗ (cid:62) α ∗ / α ∈A SVM (cid:112) α (cid:62) Qα (cid:62) α / if, and only if min α ∈A SVM (cid:112) α (cid:62) Qα − q ∗ (cid:62) α = (cid:113) α (cid:62)∗ Qα ∗ − q ∗ (cid:62) α ∗ = 0 . Thus, the solution to the rescaled SVM dual problem (P2) under ν = q ∗ / α (cid:112) α (cid:62) Qα . s.t. α ∈ A simplex (P5)33he following lemma establishes equivalence between (P4) and (P5) under the proper renormal-ization of the fractional program solution. Lemma 3

Assume any solution α ∗ to (P4) is such that α ∗ (cid:54) = . Then the problems (P4) and (P5) are equivalent.

Proof

Let α (cid:54) = and α be solutions to problems (P4) and (P5), respectively, and consider thevector-valued function f : A SVM \ (cid:55)→ A simplex , f ( α ) = α / ( (cid:62) α / A simplex ⊂ A SVM \ , α is feasible for (P4). Then by optimality of α , we have (cid:113) α (cid:62) Qα (cid:62) α / ≤ (cid:113) α (cid:62) Qα (cid:62) α / (cid:113) α (cid:62) Qα . Next, note that f ( α ) is feasible for (P5). Then by optimality of α , we have (cid:113) α (cid:62) Qα ≤ (cid:113) f ( α ) (cid:62) Q f ( α ) = (cid:113) α (cid:62) Qα (cid:62) α / . In order for both of these inequalities to be true, we must have (cid:113) α (cid:62) Qα (cid:62) α / (cid:113) α (cid:62) Qα (cid:62) α / (cid:113) α (cid:62) Qα . Note that the assumption α (cid:54) = is only a formality, since by Lemma 2, the trivial solution α = can occur only when ν < q ∗ , which is outside the portion of the regularization path thatwe consider. (cid:50) We are now ready to prove Theorem 1. Part (i): Lemma 1 establishes equivalence between theregularization paths for the rescaled SVM dual (P1) and (P2). In addition, Lemma 2 establishesthe existence of ν ∗ such that the solution to (P2) under ν ∗ is also a solution to (P4). Then, itfollows that there exists λ ∗ such that the solution to the rescaled SVM dual problem under λ ∗ minimizes (P4). Finally, recall that Lemma 3 establishes that the minimizing solution to (P4) isalso a solution to the weighted MMD minimization problem. Therefore, there exists λ ∗ such thatthe solution to the SVM dual under λ ∗ minimizes the weighted MMD. Part (ii): The proof followsfrom Schaible (1976, Lemma 3). 34 .2 The Path Algorithm of Sentelle et al. (2016) We brieﬂy describe the path algorithm of Sentelle et al. (2016) used in this paper. The regularizationpath for SVM is characterized by a sequence of breakpoints, representing the values of λ at whicheither one of the support vectors on the margin W i f ( X i ) = 1 exits the margin, or a non-marginalobservation reaches the margin. Between these breakpoints, the coeﬃcients of the marginal supportvectors α i change linearly in λ , while the coeﬃcients of all other observations stay ﬁxed as λ ischanged. Since the KKT conditions must be met for any solution α , we can use a linear system ofequations to compute how each α i and α changes with respect to λ .Based on this idea, beginning with an initial solution corresponding to some large initial value of λ , the path algorithm ﬁrst computes how the current marginal support vectors change with respectto λ . Given this quantity, the next breakpoint in the path is computed by decreasing λ until amarginal support vector exits the margin, i.e, α i = 0 or α i = 1, or a non-marginal observationenters the margin, i.e., W i f ( X i ) = 1. At this point, the marginal support vector set is updated,and the changes in α i and α , as well as the next breakpoint, are computed. This procedure repeatsuntil the terminal value of λ is reached. A.3 Conditional Bias with Respect to SATE and SATT

In this section, we derive the conditional bias for the weighted diﬀerence-in-means estimator. Notethat our derivation follows the one given in Kallus et al. (2018). Consider the problem of estimatingthe SATE and SATT, deﬁned as τ SATE = 1 N N (cid:88) i =1 Y i (1) − Y i (0) and τ SATT = 1 n T (cid:88) i ∈T Y i (1) − Y i (0) , (24)respectively. We denote the weighted estimator (cid:98) τ , which has a form (cid:98) τ = (cid:88) i ∈T α i Y i − (cid:88) i ∈C α i Y i , (25)where α i ∈ A simplex . The conditional bias with respect to the SATE is given by E [ (cid:98) τ − τ SATE | X N , T N ] = (cid:88) i ∈T E [ α i Y i | X N , T N ] − (cid:88) i ∈C E [ α i Y i | X N , T N ] − E [ τ SATE | X N , T N ]= N (cid:88) i =1 α i [ T i − (1 − T i )] E [ Y i ( T i ) | X i , T i ] − N N (cid:88) i =1 E [ Y i (1) − Y i (0) | X i , T i ]35 N (cid:88) i =1 α i [ T i − (1 − T i )] E [ Y i ( T i ) | X i ] − N N (cid:88) i =1 E [ Y i (1) − Y i (0) | X i ]= N (cid:88) i =1 α i T i f ( X i ) − N (cid:88) i =1 α i (1 − T i ) f ( X i ) − N N (cid:88) i =1 τ ( X i )= N (cid:88) i =1 α i T i [ f ( X i ) + τ ( X i )] − N (cid:88) i =1 α i (1 − T i ) f ( X i ) − N N (cid:88) i =1 τ ( X i )= N (cid:88) i =1 ( α i T i − N − ) τ ( X i ) + N (cid:88) i =1 α i [ T i − (1 − T i )] f ( X i )= N (cid:88) i =1 ( α i T i − N − ) τ ( X i ) + N (cid:88) i =1 α i W i f ( X i ) , where lines two and three follow from Assumptions 1 and 2. By a similar argument, the conditionalbias with respect to the SATT is E [ (cid:98) τ − τ SATT | X N , T N ] = (cid:88) i ∈T E [ α i Y i | X N , T N ] − (cid:88) i ∈C E [ α i Y i | X N , T N ] − E [ τ SATT | X N , T N ]= N (cid:88) i ∈T ( α i − n − T ) τ ( X i ) + N (cid:88) i =1 α i W i f ( X i ) . A.4 Worst-case Bias in an RKHS

We consider the problem of minimizing the bias due to prognostic score imbalance, deﬁned in(17). Restricting f to the unit-ball RKHS, deﬁned as F K in Section 2.3, and considering the f which maximizes the absolute value of this quantity, we compute the worst-case squared bias dueto prognostic score imbalance as B ( α ; X N , T N ) = sup f ∈F K (cid:32) N (cid:88) i =1 α i W i f ( X i ) (cid:33) . We can simplify this expression by B ( α ; X N , T N ) = sup f ∈F K (cid:32) N (cid:88) i =1 α i W i f ( X i ) (cid:33) = sup f ∈F K (cid:32) N (cid:88) i =1 α i W i (cid:104) f , φ ( X i ) (cid:105) (cid:33)

36 sup f ∈F K (cid:32)(cid:42) f , N (cid:88) i =1 α i W i φ ( X i ) (cid:43)(cid:33) = (cid:13)(cid:13)(cid:13)(cid:80) Ni =1 α i W i φ ( X i ) (cid:13)(cid:13)(cid:13) H K , = γ K (cid:16) (cid:98) F α ,,