[PDF] Optimal selection of the number of control units in kNN algorithm to estimate average treatment effects

Abstract

Full PDF

OOptimal selection of the number of control unitsin kNN algorithm to estimate average treatmenteﬀects

Andr´es Ram´ırez-Hassan ∗ Raquel Vargas-Correa † Gustavo Garcia ‡ Daniel Londo˜no § August 18, 2020

Abstract

We propose a simple approach to optimally select the number of controlunits in k nearest neighbors (kNN) algorithm focusing in minimizing themean squared error for the average treatment eﬀects. Our approach isnon-parametric where conﬁdence intervals for the treatment eﬀects werecalculated using asymptotic results with bias correction. Simulation ex-ercises show that our approach gets relative small mean squared errors,and a balance between conﬁdence intervals length and type I error. Weanalyzed the average treatment eﬀects on treated (ATET) of participationin 401(k) plans on accumulated net ﬁnancial assets conﬁrming signiﬁcanteﬀects on amount and positive probability of net asset. Our optimal k se-lection produces signiﬁcant narrower ATET conﬁdence intervals comparedwith common practice of using k = 1. JEL Classiﬁcation: C14, C18, O18, R42.Keywords: Average treatment eﬀects, k nearest neighbors, optimalnumber of control units, 401(k). ∗ Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email:aramir21@eaﬁt.edu.co † Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email: rvar-gas@eaﬁt.edu.co ‡ Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email:ggarci24@eaﬁt.edu.co § Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email: dlon-doko@eaﬁt.edu.co a r X i v : . [ ec on . E M ] A ug Introduction

Matching estimators are commonly used to estimate average treatment eﬀectswhen treatment assignment is independent of outcome conditional on confound-ing variables (Cameron and Trivedi, 2005, p. 871). However, it seems that opti-mal selection of the number of matching units ( k ) is an open question as mostof the applied literature ﬁxes k small without any optimal criterion, often k = 1(Abadie and Imbens, 2016). Choice of k is a trade-oﬀ between bias and variance.Small k implies small bias, but high variance, and vice versa. So, the aim ofthis paper is to propose a simple approach to optimally select k . In particular,we focus on the k nearest-neighbors (kNN) algorithm when used to estimatetreatment eﬀects: average treatment eﬀect (ATE), and average treatment eﬀecton treated (ATET).Optimal selection of k in kNN algorithm is done by cross validation whenprediction or classiﬁcation are the objectives (Hastie et al., 2009, p. 241). This iseasily performed as the statistical object of interest is observable in-sample. Onthe other hand, average treatment eﬀects are not observable anywhere. So, wefollow similar arguments to Athey et al. (2015) proposing an unbiased functionof observable variables to optimally choose k through cross validation when themain inferential concern is treatment eﬀects.We apply our proposal to analyze ATET due to enrollment in 401(k) onamount and probability of positive accumulated net ﬁnancial assets. We foundsigniﬁcant ATET whose point estimates are approximately $15 K and 19% givenan optimal k ∗ = 19 in both cases.After this brief introduction, Section 2 shows our proposal to optimally se-lect k in the kNN algorithm. Section 3 displays results from some simulationsexercises, and Section 4 shows the results of our application regarding 401(k)enrollment. Section 5 ends with some concluding remarks. k in kNN We will focus on the average treatment eﬀect on treated as we will use this inour application. However, results associated with the average treatment eﬀectsare in the Appendix (see subsection 6.1). We build the matching units using k nearest neighbors (kNN) with replacement, where k is selected minimizing themean squared error between a conditional unbiased estimator of the averagetreatment eﬀect on the treated (ATET) and a matching estimator of the ATET.We consider a framework where there is a random binary treatment ( D i = { , } ) that is independent of the outcome variable ( Y i ) conditional on indepen-dent observable variables ( X i ⊂ R P ), i = 1 , , . . . , N . The average treatmenteﬀect on treated, ∆ T = E [( Y i (1) − Y i (0)) | D i = 1 , X i = x i ] , (1)where X i are control variables, { Y i (0) , Y i (1) } are the potential outcomes under1iﬀerent states, untreated and treated, Y i = Y i ( D i ) = (cid:26) Y i (0) , D i = 0 Y i (1) , D i = 1 (cid:27) . However, we do not simultaneously observe the same individual under bothstates. So, we propose to use the matching approach to build the syntheticcontrols, the general formula is (Cameron and Trivedi, 2005, p. 875),ˆ∆

T M = 1 N (cid:88) i ∈{ D i =1 }  Y i (1) − k (cid:88) j ∈ A ik ( x ) Y j (0)  , (2)where A ik ( x ) = { j : || X i − X j || < || X i − X l || , j ∈ { D j = 0 } , l ∈ { D l = 0 }} is the set of the k closest untreated units to treated unit i in terms of theEuclidean norm in covariates, j, l = 1 , , . . . , N , || · || is the Euclidean norm, N is the sample size of untreated individuals ( D i = 0), N is the sample sizeof treated individuals ( D i = 1), and k is the cardinality of A ik ( x ), that is, thenumber of controls for treated unit, | A ik ( x ) | = (cid:88) j ∈ A ik ( x ) {|| X i − X j || < || X i − X l ||} = k. Therefore, the synthetic control for treated individual i is built using anaverage of untreated nearest individuals (neighbors).The choice of k is a trade-oﬀ between bias and variance. Selecting justone neighbor minimizes bias, this is equal to zero if there is exact match ( k =1 , x i = x j ), but implies high variability. On the other hand, a large amount ofneighbors decreases variance but increases bias. In general, x i = x j is an eventof probability 0 for continuous covariates, then Abadie and Imbens (2006) showthat matching estimators have an asymptotic bias. This asymptotic bias canbe ignored if N P c / = O ( N ) where P c is the number of continuous covariates( P c / > B T M = 1 N (cid:88) i ∈{ D i =1 }  µ ( x i ) − k (cid:88) j ∈ A ik ( x ) µ ( x j )  , (3)where µ ( x i ) = E [ Y | X = x , D = 0]. This can be estimated non-parametricallyusing a series expansion estimator to obtain ˆ B T M (Abadie and Imbens, 2011).Therefore by equations 2 and 3, the bias-corrected matching estimator isˆ∆ BC = ˆ∆ T M − ˆ B T M . (4)Abadie and Imbens (2006) show under suitable assumptions (see Assump-tions 1, 2’, 3’ and 4 in their paper) that( V E,T + V ∆( X ) ,T ) − / (cid:112) N ( ˆ∆ T M − B T M − ∆ T ) d −→ N (0 , , V E,T = N (cid:80) Ni =1 (cid:16) D i + (1 − D i ) J ik k (cid:17) σ ( X i , D i ), J ik = (cid:80) Nj =1 (cid:110) i ∈ A jk (cid:111) is the number of times unit i is used as a match given k matches per unit, σ ( X i , D i ) = V ( Y | X i = x i , D i = d i ), which can be consistently estimated byˆ σ ( X i , D i ) = kk +1 (cid:16) Y i − k (cid:80) l ∈ L ik (cid:17) , L ik ( x ) is the set of the k closest units to unit i in terms of the Euclidean norm in covariates such that have the same value forthe treatment variable, and V ∆( X ) ,T = E (cid:2) ([ µ ( X i ) − µ ( X i )] − ∆ T ) | D = 1 (cid:3) is the variance of the conditional average treatment eﬀect on treated.In an ideal situation, we select k in the regression nearest neighbor framework(kNN) minimizing the mean squared error of the average treatment eﬀect ontreated (equation 1), that is,arg min k E (cid:16) ˆ∆ BC − ∆ T (cid:17) . (5)However, the average treatment eﬀect on treated (∆ T ) is not observed.Therefore, we follow similar arguments to Athey et al. (2015) proposing anunbiased function of potentially observable variables to deﬁne the ATET inprogram 5.We have the following assumptions: Assumption 1.

Stable unit treatment value assumption (SUTVA)

Assumption 2.

Ignorability (unconfoundedness) D i ⊥ Y i (0) | X i = x i . Assumption 3.

Overlap (matching) ≤ P ( D i = 1 | X i = x i ) < . Remark 1.

Assumption 1 implies not spillover, interaction or general equi-librium eﬀects (Rubin, 1978). Assumption 2 establishes that treatment assign-ment ignores untreated outcome given control variables (Rosenbaum and Rubin,1983; Rubin, 1978). Assumption 3 says that it is necessary to have overlap insubsamples of X i , that is, for each treated individual it is necessary to have ananalogous untreated (synthetic control) to identify the treatment eﬀect on treated(Rosenbaum and Rubin, 1983). We assume conditional independent and identically distributed random sam-ple drawn from a population. This implicitly implies Assumption 1 (SUTVA).Then, we deﬁne the individual treatment eﬀect on treated ∆ T ∗ i = Y i ( D i − P ( X i )) P [ D = 1] (1 − P ( X i )) , (6)where P [ D = 1] is the marginal probability of treatment, which is the same forevery individual given random sampling, and P ( X i ) = P [ D i = 1 | X i = x ] is theconditional probability of treatment given regressors X i , that is, the propensityscore (Rosenbaum and Rubin, 1983). 3 roposition 1. Assuming a conditional independent and identically distributedrandom sample such that Assumptions 2 and 3 are satisﬁed, then: E (cid:2) ∆ T ∗ i (cid:12)(cid:12) P ( X i ) , X i = x (cid:3) = ∆ T . Proof.

We omit conditioning on P [ D = 1] and X i = x to simplify notation. E (cid:2) ∆ T ∗ i (cid:12)(cid:12) P [ D = 1] , X i = x (cid:3) = E (cid:20) Y i ( D i − P ( X i )) P [ D = 1] (1 − P ( X i )) (cid:21) = 1 P [ D = 1] (1 − P ( X i )) { (1 − P ( X i )) E [ D i ( Y i (1) − Y i (0))]+ E [ Y i (0)( D i − P ( X i ))] } = 1 P [ D = 1] (1 − P ( X i )) (cid:8) (1 − P ( X i )) E (cid:2) E (cid:2) D i ( Y i (1) − Y i (0)) (cid:12)(cid:12) D i (cid:3)(cid:3) + E [ D i ] E [ Y i (0)] − P ( X i ) E [ Y i (0)] } = 1 P [ D = 1] (1 − P ( X i )) (cid:8) (1 − P ( X i )) P [ D = 1] E (cid:2) ( Y i (1) − Y i (0)) (cid:12)(cid:12) D i = 1 (cid:3) + E [ Y i (0)] ( E [ D i ] − P ( X i )) } = E (cid:2) ( Y i (1) − Y i (0)) (cid:12)(cid:12) D i = 1 (cid:3) The ﬁrst equality is by deﬁnition, the second is after some simply algebra andtaking into account that Y i = D i Y i (1) + (1 − D i ) Y i (0) and D i Y i = D i Y i (1) + D i Y i (0) − D i Y i (0) given D i = { , } . The third is an application of the law ofiterative expectations for the ﬁrst term, and the Assumption 2 (ignorability) forthe second term. The fourth equality takes again into account that D i = { , } ,and P [ D i = 1] = P [ D = 1] (random sampling). Finally, we take into accountthat E [ D i | X i = x i ] = P [ D i = 1 | X i = x ] = P ( X i ). Notice that Assumption 3is required to have P [ D = 1] (1 − P ( X i )) (cid:54) = 0. Remark 2.

Observe that E [ Y i (0) D i ] = P [ D = 1] E [ Y i (0) | D i = 1] . Therefore,we also obtain the required statement, if E [ Y i (0) | D i = 1] = E [ Y i (0) | D i = 0] = E [ Y i (0)] conditional on X i = x i is satisﬁed, that is, the conditional mean assumptionon control group is satisﬁed (Angrist and Krueger, 1999, p. 1316). Outcome ofuntreated individuals does not determine participation. We require estimators for P [ D = 1] and P ( X i ) to estimate ∆ T ∗ i . We use N (cid:80) Ni =1 D i , N = N + N , as estimator for the marginal probability of treat-ment, and a logit model to estimate the conditional probability of treatment.Obviously, these estimators can be changed. Notice that this introduces anextra source of variability as we do not observe P [ D = 1] and P ( X i ) in obser-vational data sets. This kind of issue is discussed by Athey et al. (2015) who4oint out that transformations like equation 7 does not optimally use all avail-able information, this may mean extra variability. Additionally, Abadie andImbens (2016) show that ATE using matching based on the estimated propen-sity score is more eﬃcient than matching based on the true propensity score.On the other hand, ATET using the estimated propensity score can be more oless eﬃcient compared to the true propensity score. We performed some simula-tion exercises that show better inferential performance based on the estimatedpropensity score (available upon authors request). In addition, as our ATET(ATE) is asymptotically biased corrected, the mean squared error decomposi-tion shows that variance explains the most part (75% on average), as expected.However, it remains a proportion to ﬁnite sample bias eﬀect (25% on average).Give that ∆ T ∗ i in equation 8 is a conditional unbiased estimator for ∆ T (Proposition 1), we select k in the kNN regression framework such thatarg min k G G (cid:88) g =1  ˆ∆ BCg − N g N g (cid:88) i =1 (cid:100) ∆ T ∗ i,g   , (7)where (cid:100) ∆ T ∗ i,g =  N g N g (cid:88) i =1 D T esti,g  −  Y T esti,g ( D T esti,g − (cid:92) P ( X T esti,g ))1 − (cid:92) P ( X T esti,g )  , ˆ∆ T Mg = 1 N ,g (cid:88) i ∈ { D Testi,g =1 }  Y T esti,g (1) − K (cid:88) j ∈ A iK ( x Testg ) Y T rainj,g (0)  , ˆ B T Mg = 1 N ,g (cid:88) i ∈ { D Testi,g =1 }  ˆ µ ( x i ) T estg − K (cid:88) j ∈ A iK ( x Testg ) ˆ µ ( x j ) T raing  , and ˆ∆ BCg = ˆ∆

T Mg − ˆ B T Mg where G is the number of groups, N g is the sample size of each group, and N ,g is the sample size of the treated group in group g , g = 1 , , . . . , G . Superscripts T rain and

T est refer to train and test data sets (see below).Notice that (cid:100) ∆ T ∗ i,g is the sample version of the conditional unbiased estimatorof the ATET ∆ T ∗ i , so its average is a consistent estimator for ATET. On theother hand, ˆ∆ BCg is bias-corrected version of the matching estimator (Abadieand Imbens, 2006).kNN regression is a standard approach in the machine and statistical learningcommunities (Hastie et al., 2009, p. 14). So, we follow k -fold cross-validation to5elect the optimal k (Hastie et al., 2009, p. 241). In particular, we randomly splitthe data set in G roughly equal-sized groups such that we keep the proportionbetween treated and untreated individuals in each group, then we calculate N g (cid:80) N g i =1 (cid:100) ∆ T ∗ i,g , g = 1 , , . . . , G . We ﬁt the model using G − k , and identify the k nearest neighbors in the control group at the trainingdata set in terms of the Euclidean norm to individual i in the treated group atthe G -th left group (test data set), that is, | A ik ( x T estg ) | = (cid:88) j ∈ A ik ( x Testg ) (cid:8) || X T esti,g − X T rainj || < || X T esti,g − X T rainl || (cid:9) = k, then we estimate k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0), and obtain ˆ∆

T Mi,g = Y T esti,g (1) − k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0) and ˆ∆

BCi,g = ˆ µ ( x i ) T estg − k (cid:80) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T estg for i ∈ (cid:8) D T esti,g = 1 (cid:9) . So, we can calculate the mean squared error for the G -thleft group. This is done for g = 1 , , . . . , G and diﬀerent values of k such thatwe average over the G prediction errors given k , and select the k which has theminimum average error (see program 7).Given an optimal k = k ∗ , we use equation 4 to estimate the average treat-ment eﬀect on treated using the whole sample size. As we said, this estimatorconverges in distribution to a normal distribution, so we use the asymptoticresult to build conﬁdence intervals.Algorithm A1 summarizes our methodological proposal: We set exactly the same simulation setting as Otsu and Rai (2017) who proposeda weighted bootstrap to perform inference of matching estimators for ATE andATET as it is well known that naive bootstrap produces invalid inference dueto failing to reproduce the distribution of the number of times each unit is usedas a match (Abadie and Imbens, 2008).We show the simulation setting for exposition purposes: Y i (1) = τ + m j ( || X i || ) + (cid:15) i ,Y i (0) = m j ( || X i || ) + (cid:15) i ,D i = { P ( X i ) ≥ v i } , v i ∼ U [0 , ,P ( X i ) = γ + γ || X i || , X i = [ X i , . . . , X iP ] (cid:62) ,X ij = ψ i | ζ ij | / || ζ i || , j = 1 , , . . . , P,ψ i ∼ U [0 , , ζ i ∼ N ( , I P ) , (cid:15) i ∼ N (0 , . ) , where (cid:15) i , v i , ψ i and ζ i are mutually independent. The ATET as well as ATEis τ = 0 . γ = 0 . γ = 0 . P = 5 and there are six diﬀerent curves for m j ( || X i || ) (see Otsu and Rai (2017) for details):6 lgorithm A1 Optimal k in kNN to estimate ATET Calculate N (cid:80) Ni =1 D i . for i = 1 , , . . . , N do • Fit a logit model where the dependent variable is D i to get (cid:92) P ( X i ). • Calculate the (cid:100) ∆ T ∗ i , the sample version of equation 8. end for Split randomly the data set in G roughly equal-sized groups such that theproportion between treated and untreated in each group is keep the sameas in the original data set. Let us set the training data set as G − g -th left group. for k = 1 , , . . . , K do for g = 1 , , . . . , G do • Calculate N g (cid:80) N g i =1 (cid:100) ∆ T ∗ i,g , where N g is the sample size of g -th group (testdata set). • Find the k nearest neighbors of X i,g for treated individuals of the g thgroup ( D i,g = 1 in the test data set) among the untreated individualsin the training data set ( D j,h/ ∈ g = 0 in the remaining G − • Calculate k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0) and k (cid:80) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing . • Obtain ˆ∆

T Mi,g = Y T esti,g (1) − k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0) and ˆ B T Mi,g =ˆ µ ( x i ) T estg − k (cid:80) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing for i ∈ (cid:8) D T esti,g = 1 (cid:9) . • Estimate ˆ∆

T Mg = N ,g (cid:80) i ∈ { D Testi,g =1 } ˆ∆ T Mi,g , ˆ B T Mg = N ,g (cid:80) i ∈ { D Testi,g =1 } ˆ B T Mi,g and ˆ∆

BCg = ˆ∆

T Mg − ˆ B T Mg end for • Calculate the mean squared error:1 G G (cid:88) g =1  ˆ∆ BCg − N g N g (cid:88) i =1 (cid:100) ∆ T ∗ i,g   end for Select k that minimizes the mean squared error ( k ∗ ). for i = 1 , , . . . , N do • Given k ∗ and the whole data set, ﬁnd the k ∗ nearest neighbors in theuntreated group to each individual in the treated group. • Calculate ˆ∆

BCi = Y i (1) − k ∗ (cid:80) j ∈ A ik ∗ ( x ) Y j (0) end for Use ˆ∆ BC as a measure of central tendency of the ATET at neighborhoodlevel. Use theorems 4 and 7 in Abadie and Imbens (2006) to build conﬁdenceintervals of the ATET. 7 ( || X i || ) =0 .

15 + 0 . || X i || ,m ( || X i || ) =0 . . || X i || + 0 . (cid:0) − || X i || − . (cid:1) ,m ( || X i || ) =0 . − || X i || − . − || X i || − . − || X i || − . ,m ( || X i || ) =0 . − || X i || ) . − . . − || X i || ) ,m ( || X i || ) =0 . − || X i || ) . − . . − || X i || ) − . || X i || cos(30 || X i || ) ,m ( || X i || ) =0 . .

25 sin(8 || X i || −

5) + 0 . (cid:0) − || X i || − . (cid:1) . In addition, we perform other simulation exercises where the outcome vari-able is binary. In particular, Y i (1) = (cid:26) , τ + βm j ( || X i || ) + (cid:15) i > , τ + βm j ( || X i || ) + (cid:15) i ≤ (cid:27) ,Y i (0) = (cid:26) , βm j ( || X i || ) + (cid:15) i > , βm j ( || X i || ) + (cid:15) i ≤ (cid:27) , where (cid:15) i ∼ LG (0 , LG denotes a logistic distribution, β = 0 . τ = 0 .

5, and P ( D i = 1) = P ( γ + γ || X i || > v i ) = F LG ( γ + γ || X i || ), γ = 0 . γ = 0 . v i ∼ LG (0 ,

1) and F LG is the distribution function. Other components of thespeciﬁcation are as in the previous setting.Observe that in this setting AT E i = E ( Y i (1) − Y i (0) | X i = x i , Z i = z i ) = P ( Y i (1) = 1 | X i = x i , Z i = z i ) − P ( Y i (0) = 0 | X i = x i , Z i = z i ) = F LG ( τ + βm j ( || X i || )) − F LG ( βm j ( || X i || )), which has the same analytical expression forthe ATET, but conditional on the relevant sample ( D i = 1).Our results are based on a sample size equal to 100 (Otsu and Rai, 2017). We have 1,000 replications for each simulation exercise.First, we present the results for the average treatment eﬀects on treated forthe continuous outcome variable. Table 1 displays the mean relative squarederrors (MRSE),

M RSE = 1 S k S k (cid:88) s =1 (cid:32) ˆ∆ BCs ( k ) − AT ET ˆ∆ BCs ( k ∗ ) − AT ET (cid:33) , k ∗ (cid:54) = k, k = 1 , , . . . , where ˆ∆ BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATET estimates using any k and theoptimal k = k ∗ , and S k is the number of times that k was not optimal.We see in Table 1 that there are relative large advantages using our optimal k ∗ as all ﬁgures are larger than 1. We perform other simulation exercises using larger sample sizes. Outcomes show similarresults. Available upon authors request. Results for the average treatment eﬀects are in the Appendix, subsection 6.2. Outcomesare similar to ATET. k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATET ˆ∆ BCs ( k ∗ ) − ATET (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . .

20 whereˆ∆

BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATET estimates using any k andthe optimal k = k ∗ , and S k is the number of times that k was notoptimal. It seems that there are relative large advantages usingour optimal k ∗ . k implies relative wide intervals, whereaslarge k implies narrow intervals. The dotted line is optimal selection of k mini-mizing mean squared error. By construction has a ratio equal to 1.Figure 1 shows the average over 1,000 replications of the ratios between thelength of the 95% conﬁdence intervals associated with any number of neighbors( k , x axis) and the optimal selection using our proposal. This is done for thesix data generating settings ( m j ( || X i || ) , j = 1 , , . . . k implies relative wide intervals, whereas large k implies narrowintervals. This is expected as small k implies high variability compared withlarge k . Observe that intervals lengths decreases very fast, and stabilize around k = 10. On the other hand, we see in Figure 2 that the type I error due torejecting the null hypothesis AT ET = τ , given a signiﬁcance level equal to 5%,is the lowest using k = 1. Using k minimizing the mean squared error gives arelatively small type I error compared with other k selection.Figure 3 displays the histograms of the optimal k in our six simulationsettings. We see that optimal selection is dominated by small k . This could beevidence to support k = 1, which is widely used in many applications. Although,it seems that k = { , } are the most relevant, and large values of k cannotbe discarded, especially when using large sample sizes (we performed othersimulations exercises that show this).Results of the ATET for our simulation setting using a binary outcome arequalitatively similar to the continuous outcome. Table 2 show the mean relativesquared errors. Again, all ﬁgures are larger than 1, which means that using anon-optimal k implies larger estimation errors.10igure 2: Type I errors (5% nominal size) continuous outcome: Average treat-ment eﬀects on treated. Optimal k gives a relatively small type I error comparedwith other k selection. k = 1 always gets the best nominal size.Figure 3: Optimal k frequencies continuous outcome: Average treatment eﬀectson treated. It seems that small k dominates optimal k choice. Although, large k cannot be discarded. 11able 2: Mean relative squared errors binary outcome: Average treatment ef-fects on treated k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATET ˆ∆ BCs ( k ∗ ) − ATET (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . . BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATET estimates usingany k and the optimal k = k ∗ , and S k is the number of timesthat k was not optimal. It seems that there are relativelarge advantages using an optimal k . k implies relative wide intervals, whereas large k implies narrow intervals. The dotted line is optimal selection of k minimizingmean squared error. By construction has a ratio equal to 1.Figure 4 shows the relative 95% conﬁdence intervals length ratios associ-ated with k and k ∗ . All six simulation settings look very similar with widerconﬁdence intervals when k is small, and narrower intervals when k is large.There is stabilization around k = 10. Figure 5 shows type I errors using 5%as signiﬁcance level. Larger k implies larger errors, and using k ∗ implies smalltype I errors in general, except when k is very small. Small type I error can beexplained by means of Figure 6 where we see that optimal selection of k is alsobased on small k . Here k ∗ = { , , , } are particularly relevant.In general, we can observe that our proposal implies relatively smaller esti-mation errors achieving a compromise between conﬁdence interval length andtype I error having as a result of small optimal k most of the times. To illustrate our proposal we estimate the average treatment eﬀect on treatedof 401(k) participation on accumulated net ﬁnancial assets. 401(k) is a retire-ment plan where savings contributions are provided by employers deduced fromemployees’ payment before taxes such that taxes on capital gains are avoided.Employees should achieve some eligibility criteria such that the plan does nothave universal applicability in principle. The impact evaluation of 401(k) on netﬁnancial assets has been done previously by Benjamin (2003); Chernozhukov13igure 5: Type I errors (5% nominal size) binary outcome: Average treatmenteﬀects on treated. Optimal k gives a relatively small type I error compared withother k selection. k = 1 always gets the best nominal size.Figure 6: Optimal k frequencies binary outcome: Average treatment eﬀects ontreated. It seems that small k dominates optimal k choice. Although, large k cannot be discarded. 14able 3: Descriptive statistics: Outcome variablesResponse Mean treated Mean untreated StatisticContinuous $38.26K (79.09) $10.89K (55.26) 16.28 ∗∗ Binary 0.86 (0.35) 0.55 (0.5) 35.27 8 ∗∗ Standard error in parenthesis. The test of the diﬀerence of two proportions uses a normallydistributed test statistic calculated as z = ˆ p − ˆ p (ˆ p p (1 − ˆ p p )(1 /n +1 /n . , ˆ p p = x + x n + n , x l and n l arethe number of successes and sample sizes in each group, l = 1 ,

2. The test of the diﬀerenceof two means uses a Student’s t calculated as t = ¯ x − ¯ x ( s x /n + s x /n ) . . Null hypothesis: means are equal. Critical value at 5% level of signiﬁcance is 1.96 ∗∗ Rejection of null hypothesis at 5% signiﬁcance level. et al. (2017); Chernozhukov and Hansen (2004); Conley et al. (2012) amongothers. We overcome the lack of random assignment following Chernozhukovet al. (2017) arguments. Enrollment in a 401(k) plan is exogenous when control-ling for observable variables which drive employment decisions, such as income,when the plan initially became available. Therefore, our dataset is from the 1991Survey of Income and Program Participation composed by 9,915 observations.We control for the same set of variables as Chernozhukov and Hansen (2004).In particular, age (linear and squared), education (years), household size, income(seven levels), and binary variables indicating: deﬁned beneﬁt pension, homeownership, marital status (married), participation in IRA (individual retirementaccount) and two-earners status (both household heads contribute to householdincome). Details can be found in Benjamin (2003).Table 3 shows unconditional mean diﬀerence tests of our outcome variables:accumulated net ﬁnancial assets, and a binary variable indicating positive ac-cumulated net ﬁnancial assets. Employees enrolled in 401(k) plans have $27.4more accumulated net ﬁnancial assets on average than employees who are notenrolled. The unconditional probability of positive net ﬁnancial assets is 31%on average higher for the former. We can see that there are unconditionalstatistically signiﬁcant diﬀerences.Figure 7 shows the ATET in the accumulated net ﬁnancial assets using ouroptimal selection, k ∗ = 19, and k = 1, as a reference. We see similar ATETpoint estimates, which are approximately $ 15K (similar outcomes are found byConley et al. (2012) under the exogeneity assumption using a Bayesian approachfor a local average treatment eﬀects, LATE). However, the 95% conﬁdence in-terval using k ∗ is ($10.9K,$18.9K), whereas k = 1 interval is ($9.1K,$19.9K),that is 35% larger. Results regarding the probability of positive accumulatednet ﬁnancial assets can be seen in Figure 8. Both, k ∗ = 19 and k = 1 indi-cate statistically signiﬁcant ATET whose point estimate is approximately 19.9%and 18.7% respectively. The 95% conﬁdence intervals are (16.3%, 21.3%) and(17.7%, 21.3%) for k = 1 and k ∗ = 19. This means that using k = 1 implies a38.9% wider interval than using k ∗ = 19. Observe that both exercises indicatethe same optimal k . This means that diagnostic balance tests are based on the15igure 7: Average treatment eﬀects on treated due to 401(k) participation onaccumulated net ﬁnancial assets: Point estimates with k = 1 (gray dot) andoptimal k ∗ (black dot), and 95% conﬁdence intervals. Both, k = 1 and k ∗ indicate 5% signiﬁcant positive ATET of approximately $15K. 95% conﬁdenceintervals using k = 1 are 28% wider compared to k ∗ .same control groups. Figure 9 shows the love plot indicating that control groupsbased on k ∗ = 19 satisfy mean balancing conditions. We present a simple approach to obtain an optimal selection of the numberof control units ( k ) to build the “synthetic” or “counterfactual” unit for eachindividual in the k nearest neighbors algorithm. Our approach is based on asimple unbiased reference estimator ( individual treatment eﬀect on treated forATET and individual treatment eﬀect for ATE) such that we select k minimizingthe mean squared error with respect to the reference estimator. Therefore, theoptimal k = k ∗ achieves a balance between conﬁdence interval length and typeI error.Our application suggests that k ∗ can be relatively large compared to thecommon practice of using k = 1. This implies signiﬁcant reductions regardingconﬁdence interval lengths. Although, both choices show 5% statistically signif-icant eﬀects of 401(k) enrollment in the probability of positive accumulated netﬁnancial assets, and its amount.A ﬁnal remark is that using as reference estimator for treatments eﬀectsthe one obtained using k = 1 is an interesting alternative. We obtained samequalitative results, that is, a balance between interval length and type I error,and a considerable reduction in MRSE. Simulations exercises conﬁrm this.16igure 8: Average treatment eﬀects on treated due to 401(k) participation onthe probability of positive accumulated net ﬁnancial assets: Point estimateswith k = 1 (gray dot) and optimal k ∗ (black dot), and 95% conﬁdence intervals.Both, k = 1 and k ∗ indicate 5% signiﬁcant positive ATET of approximately19%. 95% conﬁdence intervals using k = 1 are 28% wider compared to k ∗ .Figure 9: Mean balancing condition for control variables used calculating ATET:Love plot using k ∗ = 19. It seems that mean balancing conditions are satisﬁedas statistical tests (black dots) are inside critical levels (dashed red lines).17 eferences Abadie, A. and Imbens, G. (2006). Large sample properties of matching esti-mators for average treatment eﬀects.

Econometrica , 74(1):235–267.Abadie, A. and Imbens, G. (2008). On the failure of the bootstrap for matchingestimators.

Econometrica , 76(6):1537––1557.Abadie, A. and Imbens, G. (2011). Bias-corrected matching estimators foraverage treatment eﬀects.

Journal of Business & Economic Statistics ,29(1):1–11.Abadie, A. and Imbens, G. (2016). Matching on the estimated propensity score.

Econometrica , 84(2):781—-807.Angrist, J. and Krueger, A. (1999). Empirical strategies in labor economics.In Ashenfelter, O. and Card, D., editors,

Handbook of Labor Economics ,volume 3. ELSEVIER.Athey, S., Imbens, G., and Ramachandra, V. (2015). Machine learning methodsfor estimating heterogeneous causal eﬀects. Technical report, StanfordUniversity.Benjamin, D. (2003). Does 401(k) eligibility increase savings? Evidence frompropensity score subclassiﬁcation.

Journal of Public Economics , 87:1259–1290.Cameron, C. and Trivedi, K. (2005).

Microeconometrics: Methods and Applica-tions . Cambridge University Press.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey,W., and Robins, J. (2017). Double/debiased machine learning for treat-ment and causal parameters.

The Econometrics Journal , 21:1–68.Chernozhukov, V. and Hansen, C. (2004). The eﬀects of 401 (k) participationon the wealth distribution: an instrumental quantile regression analysis.

Review of Economics and statistics , 86(3):735–751.Conley, T., Hansen, C., and Rossi, P. (2012). Plausibly exogenous.

The Reviewof Economics and Statistics , 94(1):260–272.Hastie, T., Tibshirani, R., and Friedman, J. (2009).

The Elements of StatisticalLearning Data Mining, Inference, and Prediction . Springer, second editionedition.Otsu, T. and Rai, Y. (2017). Bootstrap inference of matching estimators foraverage treatment eﬀects.

Journal of the American Statistical Association ,112(520):1720–1732.Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensityscore in observational studies for causal eﬀects.

Biometrika , 70(1):41–55.Rubin, D. (1978). Bayesian inference for causal eﬀects: The role of randomiza-tion.

Annals of Statistics , 6:34–58.18

Appendix k in kNN: Average treatment ef-fects The average treatment eﬀect is∆

AT E = E [( Y i (1) − Y i (0)) | X i = x i ] . A matching estimator for the ATE is given byˆ∆

AT E = 1 N N (cid:88) i =1 (cid:104) ˆ Y i (1) − ˆ Y i (0) (cid:105) , where ˆ Y i (1) = (cid:26) k (cid:80) j ∈ A ik ( x ) Y j (0) , D i = 0 Y i , D i = 1 (cid:27) , ˆ Y i (0) = (cid:26) Y i , D i = 0 k (cid:80) j ∈ A ik ( x ) Y j (1) , D i = 1 (cid:27) ,A ik ( x ) deﬁnes the relevant control group conditional on being treated oruntreated.The bias of the ATE is B AT E = 1 N  (cid:88) i ∈{ D i =1 }  µ ( x i ) − k (cid:88) j ∈ A ik ( x ) µ ( x j )  − (cid:88) i ∈{ D i =0 }  µ ( x i ) − k (cid:88) j ∈ A ik ( x ) µ ( x j )  , where µ ( x i ) = E [ Y | X = x , D = 0] and µ ( x i ) = E [ Y | X = x , D = 1]. Thesecan be estimated non-parametrically using a series expansion estimator to obtainˆ B T M (Abadie and Imbens, 2011).Therefore, the bias-corrected matching estimator isˆ∆

BCAT E = ˆ∆

AT E − ˆ B AT E . Theorem 4(i) in Abadie and Imbens (2006) shows asymptotic distributionconvergence results for the bias corrected average treatment eﬀect, and section3.2 in their paper has required variance expressions (marginal and conditional).The individual treatment eﬀect is∆

AT E ∗ i = Y i ( D i − P ( X i )) P ( X i )(1 − P ( X i )) , (8)such that E (cid:2) ∆ AT E ∗ i (cid:12)(cid:12) X i = x (cid:3) = ∆ AT E (Athey et al., 2015).19herefore, the optimal selection of k is given by solving the programarg min k G G (cid:88) g =1  ˆ∆ BCAT Eg − N g N g (cid:88) i =1 (cid:92) ∆ AT E ∗ i,g   , where (cid:92) ∆ AT E ∗ i,g =  Y T esti,g ( D T esti,g − (cid:92) P ( X T esti,g )) (cid:92) P ( X T esti,g )(1 − (cid:92) P ( X T esti,g )) )  , ˆ∆ AT Eg = 1 N g  (cid:88) i ∈ { D Testi,g =1 }  Y T esti,g (1) − k (cid:88) j ∈ A ik ( x Testg ) Y T rainj,g (0)  − (cid:88) i ∈ { D Testi,g =0 }  Y T esti,g (0) − k (cid:88) j ∈ A ik ( x Testg ) Y T rainj,g (1)  , ˆ B AT Eg = 1 N g  (cid:88) i ∈ { D Testi,g =1 }  ˆ µ ( x i ) T estg − k (cid:88) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing  − (cid:88) i ∈ { D Testi,g =0 }  ˆ µ ( x i ) T estg − k (cid:88) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing  , and ˆ∆ BCAT Eg = ˆ∆

AT Eg − ˆ B AT Eg . An algorithm similar to Algorithm A1 can be implemented to obtain anoptimal k for the ATE given these deﬁnitions. k implies relative wide intervals, whereas large k impliesnarrow intervals. The dotted line is the optimal selection of k minimizing meansquared error. By construction has a ratio equal to 1.Figure 11: Type I errors continuous outcome: Average treatment eﬀects. Theoptimal k gives an average type I error compared with other k selection. k = 1always gets the best nominal size 21igure 12: 95% conﬁdence interval length ratio binary outcome: Average treat-ment eﬀects. Small k implies relative wide intervals, whereas large k impliesnarrow intervals. The dotted line is optimal selection of k minimizing meansquared error. By construction has a ratio equal to 1.Figure 13: Type I errors (5% nominal size) binary outcome: Average treatmenteﬀects. The optimal k gives a relatively small type I error compared with other k selection. k = 1 always gets the best nominal size.22able 4: Mean relative squared errors continuous outcome: Average treatmenteﬀects k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATE ˆ∆ BCs ( k ∗ ) − ATE (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . . BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATE estimates using any k and the optimal k = k ∗ , and S k is the number of times that k wasnot optimal. It seems that there are relative large advantagesusing our optimal k ∗ . k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATE ˆ∆ BCs ( k ∗ ) − ATE (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . . BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATE estimates usingany k and the optimal k = k ∗ , and S k is the number of timesthat k was not optimal. It seems that there are relativelarge advantages using an optimal k ..