[PDF] Matching on What Matters: A Pseudo-Metric Learning Approach to Matching Estimation in High Dimensions

Abstract

When pre-processing observational data via matching, we seek to approximate each unit with maximally similar peers that had an alternative treatment status--essentially replicating a randomized block design. However, as one considers a growing number of continuous features, a curse of dimensionality applies making asymptotically valid inference impossible (Abadie and Imbens, 2006). The alternative of ignoring plausibly relevant features is certainly no better, and the resulting trade-off substantially limits the application of matching methods to "wide" datasets. Instead, Li and Fu (2017) recasts the problem of matching in a metric learning framework that maps features to a low-dimensional space that facilitates "closer matches" while still capturing important aspects of unit-level heterogeneity. However, that method lacks key theoretical guarantees and can produce inconsistent estimates in cases of heterogeneous treatment effects. Motivated by straightforward extension of existing results in the matching literature, we present alternative techniques that learn latent matching features through either MLPs or through siamese neural networks trained on a carefully selected loss function. We benchmark the resulting alternative methods in simulations as well as against two experimental data sets--including the canonical NSW worker training program data set--and find superior performance of the neural-net-based methods.

Full PDF

aa r X i v : . [ ec on . E M ] M a y Matching on What Matters:

A Pseudo-Metric Learning Approach to Matching Estimation inHigh Dimensions

Gentry Johnson

Department of EconomicsUniversity of MarylandCollege Park, MD 20742 [email protected]

Brian Quistorff

AI + ResearchMicrosoftRedmond, WA 98052

[email protected]

Matt Goldman

AI + ResearchMicrosoftRedmond, WA 98052 [email protected]

Abstract

When pre-processing observational data via matching, we seek to approximateeach unit with maximally similar peers that had an alternative treatment status–essentially replicating a randomized block design. However, as one considers agrowing number of continuous features, a curse of dimensionality applies makingasymptotically valid inference impossible (Abadie and Imbens, 2006). The alter-native of ignoring plausibly relevant features is certainly no better, and the result-ing trade-off substantially limits the application of matching methods to “wide”datasets. Instead, Li and Fu (2017) recasts the problem of matching in a metriclearning framework that maps features to a low-dimensional space that facilitates“closer matches” while still capturing important aspects of unit-level heterogeneity.However, that method lacks key theoretical guarantees and can produce inconsis-tent estimates in cases of heterogeneous treatment effects. Motivated by straight-forward extension of existing results in the matching literature, we present alterna-tive techniques that learn latent matching features through either MLPs or throughsiamese neural networks trained on a carefully selected loss function. We bench-mark the resulting alternative methods in simulations as well as against two ex-perimental data sets–including the canonical NSW worker training program dataset–and ﬁnd superior performance of the neural-net-based methods.

While experimentation approaches to causal inference have strong theoretical guarantees, they areimpractical in some settings. This leads researchers to rely on observational methods which supposethat conditioning on available covariates is sufﬁcient to yield unbiased estimates. This can be espe-cially tenuous (or prone to researcher manipulation) when the conditioning variables are selected inan ad hoc way. Recognizing this limitation, there has been growing interest in importing machinelearning (ML) tools to automate and accelerate the model selection process (Athey, 2017). Most ex-amples of this work (Chernozhukov et al., 2016; Wager and Athey, 2018) rely on “black-box” MLalgorithms that may obscure the underlying counterfactual reasoning. By contrast, matching meth-ods yield a set of direct unit-to-unit comparisons that are fully transparent to policy makers, but are

Preprint. Under review. uch less robust to the curse of dimensionality presented by high-dimensional data. Nonetheless,there has been relatively little work on the problem of model selection in high-dimensions for thesemethods.Instead, propensity score matching (PSM) remains the most commonly used matching method usedin applied economics and related ﬁelds. This approach side-steps the curse of dimensionality bycollapsing all covariates to a single number reﬂecting probability of treatment (McCaffrey et al.,2004; Wyss et al., 2014). However, PSM has been criticized because it does not guarantee covariatebalance in the resulting sample, thus allowing signiﬁcant model dependence and inefﬁciency indownstream estimations (King and Nielsen, 2016). Prognostic score matching (Hansen, 2008) alsohas the advantage of collapsing the covariates used for matching to a single dimension. While itdoes not explicitly balance covariates, it does balance on counterfactual outcomes , eliminating thepotentially large efﬁciency losses associated with PSM.Matching directly on the space of covariates–using methods like Mahalanobos Distance Matching(MDM) or Coarsened Exact Matching (CEM)–has gained popularity in ﬁelds such as Statistics,Epidemiology, Sociology, and Political Science (Ho et al., 2007). This has the advantage of notrequiring correct speciﬁcation of either a propensity or prognostic model. Additionally, these meth-ods provide better covariate balance and, to the extent those covariates drive variance in outcomes,improve efﬁciency of downstream estimations when compared to PSM. Finally, these methods pre-vent the creation of unintuitive matches that happen when units with very different covariate valueshave similar estimated propensity (or prognostic) scores. However, MDM and CEM are particularlyvulnerable to the curse of dimensionality discussed in Abadie and Imbens (2006) and demonstratedin Gu and Rosenbaum (1993). That is, even for the case of two continuous covariates, the non-parametric bias resulting from covariate differences between matched pairs can be large enough tomake asymptotically valid inference impossible. Along with Li and Fu (2017), we explore a middleground to recast the problem of matching in a metric learning framework that maps our covariatesto a low-dimensional space that facilitates “closer matches” while still capturing important aspectsof unit-level heterogeneity. These methods are a natural choice as they were initially designed topredict similarity in high-dimensional problems like facial recognition (Parkhi et al., 2015), personre-identiﬁcation (Liao et al., 2015), or image retrieval (Hoi et al., 2010).In order to proceed, we ﬁrst establish very high-level conditions on the behavior of a learned metricin order to satisfy the consistency of matching estimators. Then, we utilize feed-forward neuralnetworks (NN) and siamese neural networks (SNN) trained (separately) on outcomes and treatmentlabels and extract the semi-ﬁnal layer of these networks as low-dimensional representations that are,nonetheless, sufﬁcient to control for the confounding effects of our covariates. The dimension ofthis embedding layer can be easily altered to trade off a richer representation for concerns aboutnon-parametric bias. Through simulation and comparison to two canonical experimental datasetsfrom the applied social sciences, we show our methods perform better than available competitors.

Following the potential outcome framework, we let T ∈ { , } denote a unit’s treatment status and Y (1) , Y (0) denote the outcomes that would be realized for each unit if it were treated (or not).Additionally, we adopt the following conventional assumptions. Assumption 1. (compact support) Let X be a random vector of continuous covariates with dimen-sion k and with density bounded away from zero on some compact and convex support X . Assumption 2.

For some ﬁxed η > and ∀ x ∈ X : • (unconfoundedness) T ⊥ ( Y (0) , Y (1)) | X • (common support) ρ ( x ) = P ( T = 1 | X ) ∈ ( η, − η ) • (continuity) Denote m ( x ) := E [ Y (0) | X = x ] τ ( x ) := E [ Y (1) | X = x ] − E [ Y (0) | X = x ] , then ρ , m , τ are all continuous functions of x . ssumption 3. (iid data) { Y i , T i , X i } are independent and identically distributed draws from somejoint distribution of ( Y, T, X ) , Implicit in Assumption 3 is the Stable Unit Treatment Value Assumption (SUTVA) that rules outany contaminating effects of one unit’s treatment onto another unit’s outcomes.The average treatment effect on the treated (ATT) or untreated (ATUT) populations are then esti-mated under nearest neighbor matching (NNM) as d ATT d = 1 N X i : T i =1 h Y i − Y j ∗ d ( i ) i (2.1) \ ATUT d = 1 N X i : T i =0 h Y i − Y j ∗ d ( i ) i , where j ∗ d ( i ) = arg min k : T i = T k d ( X i , X k ) gives the nearest neighbor with opposing treatment status.We write the d subscript to emphasize the estimator’s dependence on some (potentially learned)distance metric in our setting. Now we establish high-level conditions that d must obey in order tosatisfy consistency of the resulting estimator. Theorem 2.1 (Consistency) . Suppose a metric d on the space of X . Additionally,a. suppose ∃ C > such that for every x, y ∈ X either • P SM : d ( x, y ) > C · | ρ ( x ) − ρ ( y ) |• P GM C : d ( x, y ) > C · | m ( x ) − m ( y ) | then d ATT d is an asymptotically unbiased estimator of ATT.b. suppose ∃ C > such that for every x, y ∈ X either • P SM : d ( x, y ) > C · | ρ ( x ) − ρ ( y ) |• P GM T : d ( x, y ) > C · | ( m ( x ) + τ ( x )) − ( m ( y ) + τ ( y )) | then \ ATUT d is an asymptotically unbiased estimator estimator of ATUT.And consistency of either estimator would follow from application of a law of large numbers to theresiduals.Proof. See proof in Supplement DTheorem D.1 guarantees consistency for metrics that suitably penalize either deviations in propensityscore or a speciﬁc prognostic score (depending on the estimand). It is immediate to see that it can beapplied to the already well-established results for MDM matching as well as propensity score andprognostic score matching for the cases when those models are correctly speciﬁed. Furthermore, itdemonstrates the consistency of methods which (given guarantees on convergence) learn a matchingmetric that utilizes outcomes as labels, provided that only outcomes for control units are used tolearn a metric for estimating the ATT, or conversely, that only outcomes for treated units are used tolearn a metric for estimating the ATUT. This closely parallels the results in Antonelli et al. (2016)that require two separate prognostic scores for average treatment effect estimation.Critically, it does not support the consistency of matching estimators that utilize metrics learned onthe pooled samples of outcome labels, including the procedure detailed in Li and Fu (2017). Todemonstrate the potential failure of consistency in such settings, we pose a simple example with justa single covariate ( x ∼ U [ − , ) and outcomes given by E [ Y (0) | x ] = x E [ Y (1) | x ] = 2 τ ( x ) = E [ Y (1) | x ] − E [ Y (0) | x ] = 4 − x (2.2) However be combined with additional results about convergence of the underlying metric learning methodsin order to insure consistency in practice. ρ ( x ) = (cid:26) x − x if x > , − x − x if x < . A metric learned to minimize mean squared error on the pooled (treatment and control) outcomedata would converge to d ( x , x ) = ( E [ Y | x ] − E [ Y | x ]) = ( | x | − | x | ) . That is, for a treated unit with x i = we would be indifferent between matching to a control unitwith x = , implying b τ i = , and matching to a control unit with x = − , implying b τ i = . Itis immediately evident that such an estimator would be inconsistent for both the ATT and ATUTparameters. The example in (2.2) is highly stylized, but as long as one supposes heterogeneous treat-ment effects, it will be straightforward to construct counter-examples to the consistency of matchingmetrics learned on pooled outcome data.In addition to using a control-only or treatment-only subsample of training units, an alternativebut somewhat more parametric attempt to mitigate potential bias resulting from heterogenous treat-ment effects is detailed in Johansson et al. (2016), wherein treatment status is concatenated with thelearned data representation only at the stage immediately before prediction. In this section, we provide a description of a general procedure to pre-process data for matchingestimation. This procedure is then combined with NNM to produce a full matching method. We de-scribe in additional detail how this procedure can be applied to the pseudo-metric learning paradigmspeciﬁcally.

Our general framework, which is conformable to a variety of pre-processing techniques, involvestraining a predictive algorithm to predict an outcome Y , as well as training a separate instance ofthe algorithm to predict treatment status D . We then combine the learned representation of the datafrom these two tasks–where the method of combination depends on the predictive algorithm used–and carry out nearest-neighbor matching on the resulting object. Computationally, the complexityof the matching process is dominated by the component ML models. As these are standard modeltypes their complexity is well established in the literature. The following methods employ different neural net architectures in order to discover some low-dimensional representation of the original covariates on which a standard distance metric (e.g. Eu-clidean) can be applied to, together, constitute a learned pseudo-metric. In analogy to the alternativesof propensity and prognostic score matching, we learn two separate low-dimensional representationsof the data, one in which units with similar potential outcomes are nearby each other, and one inwhich units with similar propensity for treatment are nearby each other.For a given unit i , these separate representations are characterized by two vectors of continuous-valued features, m i , y and m i , d . We propose performing matching on the space deﬁned by the unionof these two representations, M union = [ M y M d ] , where dim ( M d ) = n × z d and dim ( M y ) = n × z y . It is possible that z d = z y if the researcher so chooses, but this need not be true. Notethat unlike methods which aim to select some subset of the original covariates, the combination ofthe learned features discussed here will include all learned features resulting from estimating theseparate treatment and outcome models. Thus, the researcher, through specifying the dimensionof the hidden layers to be extracted and used in matching, can explicitly control the dimension ofthe matching space. As a practical matter, we drop learned features which are perfectly correlatedwith another learned feature and learned features with near-zero variance. We recommend standardhyperparameter tuning for both the number of hidden layers as well as their size. When training the neural nets which learn similarity in outcomes, we use control-only or treatment-onlysubsamples. This decision is motivated by the previous discussion in Section 2 regarding heterogeneous treat-ment effects. .2.1 NN method NNs are, theoretically, universal function approximators (Cybenko, 1989). The success of NNsderives from the capacity of a network’s inner layers to learn successive transformations of the data.Extracting these inner layers and using them as input into a second model is a known technique fordimensionality reduction (Hinton and Salakhutdinov, 2006).Speciﬁcally, we will refer to the ﬁnal hidden layers of the network that predict treatment and outcomeas M d and M y , respectively. These are of dimension z d and z y respectively–parameters that theresearcher can control when designing the network architecture that will determine the dimensionof the matching space. Additionally, denote the vectors of ﬁnal hidden layer weights as σ dl and σ yl ,respectively, and let e n = [1 1 1 ... be a (1 × n ) vectors of ones. Then, letting ⊗ be the Kroneckerproduct operator, deﬁne A , an ( n × z d ) matrix, and B , an ( n × z y ) matrix as A = e ′ ⊗ σ dl , B = e ′ ⊗ σ yl (3.1)Using the above deﬁnitions, and letting the operator ⊙ refer to element-wise matrix multiplication,we propose matching on X s = [ M d ⊙ A M y ⊙ B ] n × ( z y + z d ) (3.2)The matrix X s can be thought of simply as the matrix of scaled learned features. If a feature in theﬁnal hidden layer receives a very large weight as it enters the output layer, it is desirable to alsoincrease the scale of that feature in the downstream matching estimation. Similarly, if a feature inthe ﬁnal hidden layer receives a near-zero weight as it is passed to the output layer, we would like toassign less importance to matching closely on that feature than on others. The siamese neural network structure is not designed to predict a value, but rather to learn a mappingto a low-dimensional manifold such that alike observations are nearby each other and dissimilarobservations are far from each other on the manifold. Each pass through the network involves twoobservations, X i and X j . Each observation is passed through identical networks, terminating in alayer whose length the researcher controls.The distance between the two layers is then input into one of two loss functions. The SNN whichis designed to learn similarity in outcomes uses the loss function described in Equation (2.2). TheSNN which is designed to learn similarity in treatment propensity uses a standard contrastive lossfunction as in Equation (2.3).We recommend extracting the ﬁnal hidden layer of each SNN, and matching on the combination ofthese layers. Denote the layer extracted from the treatment-target SNN as M d and the layer extractedfrom the outcome-target SNN as M y . In the SNN architecture there is no set of weights connectingthe extracted layer to some terminal node, instead the extracted layer enters directly into the lossfunction (2.2) or (2.3). Thus, unlike in the NN Method, there is no need to weight the extractedlayer before its input into the matching estimator. We propose matching on X s = [ M d M y ] n × ( z y + z d ) (3.3)The rows of the matrix X s can be seen as representing the position of each observation i on thelearned manifold, where the mapping to that manifold has been explicitly constructed such that alikepairs are nearby each other in terms of Euclidean distance. This structure is naturally amenable to astandard matching procedure which assumes that observations that are nearby each other in terms ofof Euclidean distance are alike. Indeed, this transformation of the raw covariate space renders thatimplicit assumption in the standard matching procedure far more plausible. In order to benchmark the performance of the NN and SNN methods proposed here, we comparethem to a number of other methods described in Table 1–including variable selection methods that5hoose a subset of the raw features—across three different simulated environments characterized byDGPs of increasing complexity.Table 1: Matching methods used in simulationsAbbrev. Type Matching FeaturesNN Pseudo-metric Union of semi-ﬁnal layers from NNs trained on

Y, D

SNN Pseudo-metric Union of semi-ﬁnal layers from SNNs trained on

Y, DL ∗ Var. Selection Union of features selected by Lasso models on

Y, D

RRF ∗ Var. Selection Union of features selected by RRFs on

Y, D

PSM Prop. Score Propensity scores from logit regression of D onto X PSMSQ Prop. Score Propensity scores from logit regression of D onto { X, X } U. Oracle Var. Selection Union of features known to impact either Y or D Int. Oracle Var. Selection Intersection of features known to impact both Y and D For all simulated environments, the following are true: X i ∼ N (0 K , I K ); K = 50 N = 8000; β = 1 (4.1)Where X i denotes the vector of covariates for observation i , K is the dimension of the covariatespace, N is the number of observations in a given simulation, and β is the true treatment effect. Weexamine performance under three different DGPs: a sparse linear DGP, a sparse linear DGP withquadratic terms, and a NN DGP. Complete descriptions of the simulated environment can be foundin Supplement A.As additional benchmarks, we include two methods which generate a matching space through vari-able selection. First, in accordance with Section 3, we estimate a pair of LASSO regressions tochoose matching features. In the second alternative, we estimate a pair of Regularized RandomForests (RRF) (Deng and Runger, 2012) to choose matching features on the basis of variable impor-tance. Supplement B contains a more complete treatment of these alternatives.Examining Table 2 it is evident that, when the DGP is relatively simple, both PSM and PSMSQdo quite well in RMSE terms. Importantly, the two neural-net-based methods we propose, NN andSNN, perform nearly identically to PSM and PSMSQ in this setting.At a ﬁrst pass, it may seem counter-intuitive that oracle estimators–endowed with perfect knowl-edge of the relevant covariates–will underperform supervised algorithms which seek to recoversome form of this knowledge. However, the non-parametric bias induced by matching to a near-est neighbor is O ( n − /K ) for problems with K continuous covariates (Abadie and Imbens, 2006).This explains the poor performance of the oracle estimators and further motivates the pseudo-metriclearning paradigm. Table 2: Sparse linear DGP simulation resultsNN SNN L RRF PSM PSMSQ U. Oracle Int. OracleMean 0.96 1.00 1.42 1.52 0.99 0.99 1.29 1.13SD 0.02 0.04 0.04 0.03 0.05 0.04 0.03 0.05RMSE 0.05 0.04 0.42 0.52 0.05 0.04 0.30 0.14In Table 3 we see that, as the DGP grows slightly non-linear, the performance of PSM and PSMSQbegins to deteriorate. It should be expected that PSM, now explicitly misspeciﬁed, should begin tofalter. However, PSMSQ sees a non-trivial uptick in RMSE as well.Table 3: Sparse linear with sq. terms DGP simulation resultsNN SNN L RRF PSM PSMSQ U. Oracle Int. OracleMean 0.99 1.04 1.81 1.92 1.19 1.01 1.66 1.34SD 0.05 0.07 0.16 0.17 0.26 0.13 0.13 0.11RMSE 0.05 0.08 0.83 0.94 0.32 0.13 0.67 0.36The simulation results from the NN DGP in Table 4 indicate that, as the DGP grows increasinglycomplex, PSM and PSMSQ become wholly unusable. SNN and NN exhibit a very gentle uptick in6MSE, but are still nearly unbiased. If the researcher takes seriously the possibility that the DGPmay be highly non-linear, then it is prudent to employ something akin to NN or SNN–it is harmlessif the DGP turns out to be simple, but provides signiﬁcant gains if is not.Table 4: NN DGP simulation resultsNN SNN L RRF PSM PSMSQMean 1.01 1.07 1.58 1.56 1.51 1.39SD 0.10 0.21 0.06 0.04 0.60 0.46RMSE 0.10 0.21 0.58 0.56 0.77 0.59

We ﬁrst examine the performance of the methods proposed in Section 3 on the canonical LaLonde(1986) data set. Following Imbens (2015), we look speciﬁcally at the experimental and non-experimental versions of the original data set introduced in Dehejia and Wahba (1999).In the LaLonde data, applicants to National Supported Work (NSW), a labor market training pro-gram, were selected at random for participation. Using the this data alone, we can learn the exper-imental ATT. Combining these data with a non-experimental comparison group derived from thePanel Study of Income Dynamics (PSID), it is possible to benchmark the performance of variouscausal inference methods designed to recover the ATT from observational data.The variable of interest is labor market earnings in 1978, and available features include earnings in1974 and 1975 as well as various demographic variables. In Table 5, we compare non-experimentalresults with the experimentally estimated ATT of $1,794. The NN and SNN estimates, standarderrors, and conﬁdence intervals are taken from the average over 100 iterations of each method. Inthe case of the non-experimental LaLonde data, the control and treated groups are signiﬁcantlydifferent in the observed features. For that reason, we ﬁnd ﬁnd it prudent to average over differentiterations of the neural-net based methods in a situation with such extreme baseline control-treatedfeature imbalance.As is standard practice in cases of signiﬁcant class imbalance, we recommend oversampling in NNand SNN training if the treated units comprise less than of the entire sample. Other methods ofprotecting against overﬁt may also be used as well.Table 5: LaLonde non-experimental ATT results

Est.* Difference SE** 95% CI

Experimental 1794.34 0.00 671.00 (479, 3110)

NN 1632.74 -161.60 872.79 (-78, 3343)SNN 1736.51 -57.84 795.11 (178, 3295)PSM 897.94 -896.40 1045.56 (-1151, 2947)PSMSQ 2109.21 314.87 990.90 (167, 4051)OLS 914.65 -879.69 551.32 (-166, 1995)*Estimates for NN and SNN methods are averaged over 100 runs each.**SE are calculated as in Abadie and Imbens (2006) for matching estimators.

Standard errors are calculated according to Abadie and Imbens (2006) for all estimators besidesOLS. Among all methods, SNN comes the closest to recovering the experimental treatment effect,with NN outperforming all non-SNN methods by a considerable margin as well. While OLS pro-vides the lowest standard error, it is considerably biased. We attempted to benchmark our results on this data set against those reported in Li and Fu (2017), but wereunable to recreate their exact environment. .2 IHDP (1993) dataset We additionally examine the performance of the methods proposed in Section 3 on a modiﬁed ver-sion of the IHDP (Infant Health and Development Program) data set. The IHDP data set originatesfrom a randomized longitudinal trial designed to discover the effect of a comprehensive set of inter-ventions designed to curb health and developmental issues in low birth weight, premature infants.In order to simulate an observational study, we follow Hill (2011), Li and Fu (2017), and others inremoving all children with non-white mothers from the treatment group. The subset of removedinfants is clearly non-random and therefore will induce the type of structural feature imbalance thatwe typically expect in non-experimental data.Because the theoretical framework we present maintains the unconfoundedness assumption, we sim-ulate outcomes using only pretreatment features and treatment assignment. The response surfaceswe use follow Li and Fu (2017) exactly. A full description can be found in Supplement C.We simulate 50 response surfaces and report the results in Table 6. Both NN and SNN result insmall bias and RMSE, providing further support for the efﬁcacy of these pseudo-metric learningmethods across a variety of settings. While Li and Fu (2017) only report the absolute value of theerror in average treatment effect, they report an error of . , which is bested by both the NN andSNN methods. Table 6: IHDP simulated outcome results (ATT = 4) Avg Est.* Avg Difference* RMSE**NN 3.89 -0.11 0.28SNN 4.13 0.13 0.34PSM 3.74 -0.26 1.03PSMSQ 3.92 -0.08 0.46OLS 3.83 -0.17 0.33*Averages over 50 sets of simulated outcomes.**RMSE is calculated using results from 50 simulated outcomes.

When estimating effects of conditionally exogenous treatment, pre-processing data with matching isa popular tool, and researchers with high-dimensional data may face a difﬁcult decision about howbest to deﬁne an appropriate metric. Following Li and Fu (2017), we recast the problem of matchingin a metric learning framework that maps our covariates to a low-dimensional space. This facilitates“closer matches” while still capturing important aspects of unit-level heterogeneity.We provide general conditions under which a learned distance metric is sure to lead to consistentestimation. This illuminates the importance of using control-only or treatment-only data when usingsupervised learning to discover a metric from outcome data. We also provide applied researcherswith two new methods that leverage both MLPs and siamese neural nets and which compare favor-ably to state-of-the-art alternatives.

References

Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatmenteffects. econometrica , 74(1):235–267.Abadie, A. and Imbens, G. W. (2016). Matching on the estimated propensity score.

Econometrica , 84(2):781–807.Antonelli, J., Cefalu, M., Palmer, N., and Agniel, D. (2016). Doubly robust matching estimators for highdimensional confounding adjustment.Athey, S. (2017). Beyond prediction: Using big data for policy problems.

Science , 355(6324):483–485. hernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., and Newey, W. K. (2016). Doublemachine learning for treatment and causal parameters. Technical report, cemmap working paper, Centre forMicrodata Methods and Practice.Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control,Signals and Systems , 2(4):303–314.Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluationof training programs.

Journal of the American Statistical Association , 94(448):1053–1062.Deng, H. and Runger, G. (2012). Feature selection via regularized trees. In

The 2012 International JointConference on Neural Networks (IJCNN) , pages 1–8.Gu, X. S. and Rosenbaum, P. R. (1993). Comparison of multivariate matching methods: Structures, distances,and algorithms.

Journal of Computational and Graphical Statistics , 2(4):405–420.Hansen, B. B. (2008). The prognostic analogue of the propensity score.

Biometrika , 95(2):481–488.Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference.

Journal of Computational andGraphical Statistics , 20(1):217–240.Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.

Science , 313(5786):504–507.Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducingmodel dependence in parametric causal inference.

Political analysis , 15(3):199–236.Hoi, S. C., Liu, W., and Chang, S.-F. (2010). Semi-supervised distance metric learning for collaborative imageretrieval and clustering.

ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM) , 6(3):18.Imbens, G. W. (2015). Matching methods in practice.

Journal of Human Resources , 50(2):373–419.Johansson, F., Shalit, U., and Sontag, D. (2016). Learning representations for counterfactual inference. InBalcan, M. F. and Weinberger, K. Q., editors,

Proceedings of The 33rd International Conference on MachineLearning , volume 48 of

Proceedings of Machine Learning Research , pages 3020–3029, New York, NewYork, USA. PMLR.King, G. and Nielsen, R. (2016). Why propensity scores should not be used for matching. Mimeo.LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data.

The American economic review , pages 604–620.Li, S. and Fu, Y. (2017). Matching on balanced nonlinear representations for treatment effects estimation. In

Advances in Neural Information Processing Systems , pages 929–939.Liao, S., Hu, Y., Zhu, X., and Li, S. Z. (2015). Person re-identiﬁcation by local maximal occurrence representa-tion and metric learning. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 2197–2206.McCaffrey, D. F., Ridgeway, G., and Morral, A. R. (2004). Propensity score estimation with boosted regressionfor evaluating causal effects in observational studies.

Psychological Methods , 9(4):403–425.Parkhi, O. M., Vedaldi, A., Zisserman, A., et al. (2015). Deep face recognition. In

BMVC , volume 1, page 6.Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using randomforests.

Journal of the American Statistical Association , 113(523):1228–1242.Wyss, R., Ellis, A. R., Brookhart, M. A., Girman, C. J., Funk, M. J., LoCasale, R., and Stürmer, T. (2014). Therole of prediction modeling in propensity score estimation: An evaluation of logistic regression, bCART,and the covariate-balancing propensity score.

American Journal of Epidemiology , 180(6):645–655. upplements A Simulation settings

Descriptions for each simulated environment described in Section 4 are provided below. Each DGP is simulated1000 times.1. A sparse linear DGP characterized by Y i = D i β + X i γ + ǫ i ; ǫ i ∼ N (0 , (A.1) D i = Bernoulli (cid:18)

11 + e − ( X i ω ) (cid:19) (A.2)Additionally deﬁne L nz = { ω k ∈ ω | ω k = 0 } and G nz = { γ k ∈ γ | γ k = 0 } . Then, | L nz | = | G nz | = 8 (A.3) ω i = γ i = 0 . ∀ ω i ∈ L nz , γ i ∈ G nz (A.4)Finally, there is overlap in the x i which have non-zero ω i and non-zero γ i . Speciﬁcally, six x i havenon-zero values of both coefﬁcients, and four have a non-zero value of just one of ω i or γ i .2. A sparse linear DGP with squared terms characterized by Y i = D i β + X i γ + X i γ + ǫ i ; ǫ i ∼ N (0 , (A.5) D i = Bernoulli (cid:18)

11 + e − ( X i ω + X i ω ) (cid:19) (A.6)Additionally deﬁne L nz = { ω k ∈ ω | ω k = 0 } , G nz = { γ k ∈ γ | γ k = 0 } , L nz = { ω k ∈ ω | ω k = 0 } , and G nz = { γ k ∈ γ | γ k = 0 } . Then, | L nz | = | G nz | = 8; | L nz | = | G nz | = 2 (A.7) ω i = γ i = 0 . ∀ ω i ∈ { L nz , L nz } , γ i ∈ { G nz , G nz } (A.8)There is the same level of overlap in the x i which have non-zero coefﬁcient values of ω i and non-zero γ i as in the linear sparse DGP. This need not be true of the squared terms, however.3. A random NN DGP is characterized by Y i ∼ D i β + F Y ( X i ) + ǫ i ; ǫ i ∼ N (0 , (A.9) D i = Bernoulli (cid:18)

11 + e − ( F D ( X i )) (cid:19) (A.10) F Y ( · ) and F D ( · ) each have one very wide hidden layer and one shallow hidden layer. The ﬁrst layerfor each network utilizes an ELU activation function, while the second utilizes a ReLU activationfunction. Importantly, the ELU activation function is never used to learn the matching embedding.We deliberately use a different structure as well as the ELU activation functions in order to ensurethat this DGP is not nested in the neural nets which will learn the matching embedding.The network weights for both F Y ( · ) and F D ( · ) , which can be seen as analogous to the coefﬁcientvalues of ﬁrst two DGPs above, are generated with some correlation in order to induce an upward biasof roughly the same magnitude (here, referring to the bias that results from simply running a linearregression of y on treatment) as the simpler DGPs. Additionally, network weights are randomly setto with p = 0 . in order to limit the complexity of the DGP. Description of variable selection methods for matching

The below methods are an instantiation of the general procedure described in Section 3.1. Here, that frameworkis applied to variable selection algorithms where the learned representation of the data is a subset of the inputfeatures.For both of the below methods, we recommend standard tuning of the penalty parameter through cross-validation.

B.1 L1 Method

First, we run a LASSO of y on X and ﬁnd β ly : β ly = arg min β n X i =1 ( y i − X ′ i β ) + λ p X j =1 | β j | (B.1)Second, we run an L1-penalized logit of d on X and ﬁnd β ld : β ld = arg min β − log  Y i : d i =1 p ( X i ) Y j : d j =0 (1 − p ( X j ))  + λ p X k =1 | β k | (B.2)We then collect X select = { X i | β ly i = 0 } ∪ { X j | β ld j = 0 } .In the simulations we present, we select the penalty parameter λ through k-fold cross-validation separately foreach regression. However, it is not obvious that selecting the optimal λ which arises from solving the outcomeand treatment prediction problems will coincide with the optimal λ for the downstream matching estimator.Speciﬁcally, the cost of selecting irrelevant variables may be much larger for a matching estimator due to theorder of the non-parametric bias. In fact, our simulation results on a true sparse DGP suggest that, when thegoal is to input selected variables into a matching estimator, a more aggressive penalty may be necessary thanthat which is selected through cross-validation on the prediction problems. B.2 RRF Method

The regularization framework for tree-based methods amounts to comparing the (penalized) gain for a split on anew variable to the maximum gain possible from splitting on a variable that has already been used for splitting.Formally, at any given node, deﬁne the set of features used in previous splits as F . For some feature X j / ∈ F ,we would like to ensure that it is selected only if gain ( X j ) ≫ max { i | X i ∈ F } gain ( X i ) .We can accomplish this through applying a penalty λ ∈ [0 , to gain ( X j ) for all X j / ∈ F . Deﬁne gain R ( X j ) as: gain R ( X j ) = ( λ · gain ( X j ) X j / ∈ Fgain ( X j ) X j ∈ F (B.3)An RRF model is then trained on both the outcome and treatment prediction problems, and we collect X select = { X i | X i ∈ F y } ∪ { X j | X j ∈ F d } , where F y is the set of features selected by the outcome modeland F d is the set of features selected by the treatment model.Similarly to the L1 methods, we select the regularization parameter λ through k-fold cross validation. Thesimulation results suggest that, as is the case with the L1 methods, this process may select a too-conservativepenalty parameter–in this case corresponding to a λ which is larger that what would be optimal if the non-parametric nature of the matching estimator were fully internalized when selecting λ .In fact, in our simulations that boast a true sparse DGP, nearly every available covariate ends up in X select .However, we use variable importance, c j , as measured by average informational gain, to scale each X j ∈ X select , so that matching is performed on the space deﬁned by the set X scale = { c j · X j | X j ∈ X select } .Scaling selected features according to their importance has the effect of greatly reducing their contribution tothe distance metric employed in the matching step, and therefore mitigating the non-parametric bias induced bytheir inclusion. However, a more elegant solution should involve a modiﬁed algorithm for selecting λ , as wellas careful tuning of other parameters which can affect the magnitude of X select , such as the number of trees,the number of splitting candidates to be evaluated at each node, and the depth of each tree. Description of IHDP settings

The response surfaces we use follow Li and Fu (2017), speciﬁcally:i. Y (0) = exp (( X + W ) β ) + ǫ ii. Y (1) = Xβ − α + ǫ iii. The factual and counterfactual outcomes are assigned as in the standard Rubin causal model frame-work.Where W is an offset matrix with each element equal to . and β ∈ R d × is a vector of coefﬁcients with eachelement sampled randomly from (0 , . , . , . , . with respective probabilities (0 . , . , . , . , . .The elements of the vectors ǫ and ǫ are randomly sampled from the standard normal distribution N (0 , ,and α ∈ R n × is a constant vector with the value of the constant chosen such that the ATT will be equal to 4.We simulate 50 such response surfaces and report the results in Section 5. Both NN and SNN result in smallbias and RMSE, providing further support for the efﬁcacy of these feature learning methods across a variety ofsettings. D Proof of Theorem 2.1

Theorem D.1 (Consistency) . Suppose a metric d on the space of X . Additionally,a. suppose ∃ C > such that for every x, y ∈ X either • P SM : d ( x, y ) > C · | ρ ( x ) − ρ ( y ) |• P GM C : d ( x, y ) > C · | m ( x ) − m ( y ) | then d ATT d is an asymptotically unbiased estimator of ATT.b. suppose ∃ C > such that for every x, y ∈ X either • P SM : d ( x, y ) > C · | ρ ( x ) − ρ ( y ) |• P GM T : d ( x, y ) > C · | ( m ( x ) + τ ( x )) − ( m ( y ) + τ ( y )) | then \ ATUT d is an asymptotically unbiased estimator estimator of ATUT.And consistency of either estimator would follow from application of a law of large numbers to the residuals.Proof. First note that by combining Assumptions 1-3, we know the density on X is everywhere positive forboth treated and control units. Then for the case of part (a), there must exist a sequence δ n → such that either P GM T : | m ( x ) − m ( x j ∗ d ( i ) ) | < δ n or P SM : | ρ ( x ) − ρ ( x j ∗ d ( i ) ) | < δ n . Now, write the true ATT and corresponding estimator as

AT T = 1 N X i : T i =1 [ Y i − Y i (0)] d ATT d = 1 N X i : T i =1 h Y i − Y j ∗ d ( i ) i and the expectation of difference E [ d ATT d − AT T ] = E  N X i : T i =1 h Y j ∗ d ( i ) − Y i (0) i = 1 N X i : T i =1 h m ( X i ) − m ( X j ∗ d ( i ) ) i s satisﬁed. In the P GM T case, it is immediate that our bias is bounded by E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i : T i =1 h m ( X i ) − m ( X j ∗ d ( i ) ) i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ n → , whereas in the P SM case the bias is controlled by standard arguments about the estimated propensity score(see for example (Abadie, 2016)).For (b), the proof has the same structure but in the

P GM C case depends on our ability to provide matches withsimilar counterfactual treated outcomes. That is, E [ \ ATUT d − AT UT ] = E  N X i : T i =0 h Y j ∗ d ( i ) − Y i (0) i = 1 N X i : T i =0 h ( m ( X i ) + τ ( X i )) − ( m ( X j ∗ d ( i ) ) + τ ( X j ∗ d ( i ) )) i , which can be bounded so long as | ( m ( X i ) + τ ( X i )) − ( m ( X j ∗ d ( i ) ) + τ ( X j ∗ d ( i ) )) | < δ n . This difference between the treated and control case illuminates the importance of learning a metric for ATT(ATUT) estimation on only outcome labels from control (treated) data.

E Computing infrastructure

All simulations and empirical demonstrations were executed in R. We use R wrappers for keras and tensorﬂow to build all neural networks, and the glmnet and

RRF packages to estimate the variable selection methodsdescribed in Supplement B.We have run the code on a Linux HPC as well as locally on Windows machines. The least powerful machineon which we’ve run the code is an 8-core, 2.90GHz Windows desktop with an Intel Core i7-7700T processorand 16GB of RAM.packages to estimate the variable selection methodsdescribed in Supplement B.We have run the code on a Linux HPC as well as locally on Windows machines. The least powerful machineon which we’ve run the code is an 8-core, 2.90GHz Windows desktop with an Intel Core i7-7700T processorand 16GB of RAM.