Optimal selection of the number of control units in kNN algorithm to estimate average treatment effects
Andrés Ramírez-Hassan, Raquel Vargas-Correa, Gustavo García, Daniel Londoño
OOptimal selection of the number of control unitsin kNN algorithm to estimate average treatmenteffects
Andr´es Ram´ırez-Hassan ∗ Raquel Vargas-Correa † Gustavo Garcia ‡ Daniel Londo˜no § August 18, 2020
Abstract
We propose a simple approach to optimally select the number of controlunits in k nearest neighbors (kNN) algorithm focusing in minimizing themean squared error for the average treatment effects. Our approach isnon-parametric where confidence intervals for the treatment effects werecalculated using asymptotic results with bias correction. Simulation ex-ercises show that our approach gets relative small mean squared errors,and a balance between confidence intervals length and type I error. Weanalyzed the average treatment effects on treated (ATET) of participationin 401(k) plans on accumulated net financial assets confirming significanteffects on amount and positive probability of net asset. Our optimal k se-lection produces significant narrower ATET confidence intervals comparedwith common practice of using k = 1. JEL Classification: C14, C18, O18, R42.Keywords: Average treatment effects, k nearest neighbors, optimalnumber of control units, 401(k). ∗ Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email:aramir21@eafit.edu.co † Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email: rvar-gas@eafit.edu.co ‡ Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email:ggarci24@eafit.edu.co § Department of Economics, Universidad EAFIT, Medell´ın, Colombia; email: dlon-doko@eafit.edu.co a r X i v : . [ ec on . E M ] A ug Introduction
Matching estimators are commonly used to estimate average treatment effectswhen treatment assignment is independent of outcome conditional on confound-ing variables (Cameron and Trivedi, 2005, p. 871). However, it seems that opti-mal selection of the number of matching units ( k ) is an open question as mostof the applied literature fixes k small without any optimal criterion, often k = 1(Abadie and Imbens, 2016). Choice of k is a trade-off between bias and variance.Small k implies small bias, but high variance, and vice versa. So, the aim ofthis paper is to propose a simple approach to optimally select k . In particular,we focus on the k nearest-neighbors (kNN) algorithm when used to estimatetreatment effects: average treatment effect (ATE), and average treatment effecton treated (ATET).Optimal selection of k in kNN algorithm is done by cross validation whenprediction or classification are the objectives (Hastie et al., 2009, p. 241). This iseasily performed as the statistical object of interest is observable in-sample. Onthe other hand, average treatment effects are not observable anywhere. So, wefollow similar arguments to Athey et al. (2015) proposing an unbiased functionof observable variables to optimally choose k through cross validation when themain inferential concern is treatment effects.We apply our proposal to analyze ATET due to enrollment in 401(k) onamount and probability of positive accumulated net financial assets. We foundsignificant ATET whose point estimates are approximately $15 K and 19% givenan optimal k ∗ = 19 in both cases.After this brief introduction, Section 2 shows our proposal to optimally se-lect k in the kNN algorithm. Section 3 displays results from some simulationsexercises, and Section 4 shows the results of our application regarding 401(k)enrollment. Section 5 ends with some concluding remarks. k in kNN We will focus on the average treatment effect on treated as we will use this inour application. However, results associated with the average treatment effectsare in the Appendix (see subsection 6.1). We build the matching units using k nearest neighbors (kNN) with replacement, where k is selected minimizing themean squared error between a conditional unbiased estimator of the averagetreatment effect on the treated (ATET) and a matching estimator of the ATET.We consider a framework where there is a random binary treatment ( D i = { , } ) that is independent of the outcome variable ( Y i ) conditional on indepen-dent observable variables ( X i ⊂ R P ), i = 1 , , . . . , N . The average treatmenteffect on treated, ∆ T = E [( Y i (1) − Y i (0)) | D i = 1 , X i = x i ] , (1)where X i are control variables, { Y i (0) , Y i (1) } are the potential outcomes under1ifferent states, untreated and treated, Y i = Y i ( D i ) = (cid:26) Y i (0) , D i = 0 Y i (1) , D i = 1 (cid:27) . However, we do not simultaneously observe the same individual under bothstates. So, we propose to use the matching approach to build the syntheticcontrols, the general formula is (Cameron and Trivedi, 2005, p. 875),ˆ∆
T M = 1 N (cid:88) i ∈{ D i =1 } Y i (1) − k (cid:88) j ∈ A ik ( x ) Y j (0) , (2)where A ik ( x ) = { j : || X i − X j || < || X i − X l || , j ∈ { D j = 0 } , l ∈ { D l = 0 }} is the set of the k closest untreated units to treated unit i in terms of theEuclidean norm in covariates, j, l = 1 , , . . . , N , || · || is the Euclidean norm, N is the sample size of untreated individuals ( D i = 0), N is the sample sizeof treated individuals ( D i = 1), and k is the cardinality of A ik ( x ), that is, thenumber of controls for treated unit, | A ik ( x ) | = (cid:88) j ∈ A ik ( x ) {|| X i − X j || < || X i − X l ||} = k. Therefore, the synthetic control for treated individual i is built using anaverage of untreated nearest individuals (neighbors).The choice of k is a trade-off between bias and variance. Selecting justone neighbor minimizes bias, this is equal to zero if there is exact match ( k =1 , x i = x j ), but implies high variability. On the other hand, a large amount ofneighbors decreases variance but increases bias. In general, x i = x j is an eventof probability 0 for continuous covariates, then Abadie and Imbens (2006) showthat matching estimators have an asymptotic bias. This asymptotic bias canbe ignored if N P c / = O ( N ) where P c is the number of continuous covariates( P c / > B T M = 1 N (cid:88) i ∈{ D i =1 } µ ( x i ) − k (cid:88) j ∈ A ik ( x ) µ ( x j ) , (3)where µ ( x i ) = E [ Y | X = x , D = 0]. This can be estimated non-parametricallyusing a series expansion estimator to obtain ˆ B T M (Abadie and Imbens, 2011).Therefore by equations 2 and 3, the bias-corrected matching estimator isˆ∆ BC = ˆ∆ T M − ˆ B T M . (4)Abadie and Imbens (2006) show under suitable assumptions (see Assump-tions 1, 2’, 3’ and 4 in their paper) that( V E,T + V ∆( X ) ,T ) − / (cid:112) N ( ˆ∆ T M − B T M − ∆ T ) d −→ N (0 , , V E,T = N (cid:80) Ni =1 (cid:16) D i + (1 − D i ) J ik k (cid:17) σ ( X i , D i ), J ik = (cid:80) Nj =1 (cid:110) i ∈ A jk (cid:111) is the number of times unit i is used as a match given k matches per unit, σ ( X i , D i ) = V ( Y | X i = x i , D i = d i ), which can be consistently estimated byˆ σ ( X i , D i ) = kk +1 (cid:16) Y i − k (cid:80) l ∈ L ik (cid:17) , L ik ( x ) is the set of the k closest units to unit i in terms of the Euclidean norm in covariates such that have the same value forthe treatment variable, and V ∆( X ) ,T = E (cid:2) ([ µ ( X i ) − µ ( X i )] − ∆ T ) | D = 1 (cid:3) is the variance of the conditional average treatment effect on treated.In an ideal situation, we select k in the regression nearest neighbor framework(kNN) minimizing the mean squared error of the average treatment effect ontreated (equation 1), that is,arg min k E (cid:16) ˆ∆ BC − ∆ T (cid:17) . (5)However, the average treatment effect on treated (∆ T ) is not observed.Therefore, we follow similar arguments to Athey et al. (2015) proposing anunbiased function of potentially observable variables to define the ATET inprogram 5.We have the following assumptions: Assumption 1.
Stable unit treatment value assumption (SUTVA)
Assumption 2.
Ignorability (unconfoundedness) D i ⊥ Y i (0) | X i = x i . Assumption 3.
Overlap (matching) ≤ P ( D i = 1 | X i = x i ) < . Remark 1.
Assumption 1 implies not spillover, interaction or general equi-librium effects (Rubin, 1978). Assumption 2 establishes that treatment assign-ment ignores untreated outcome given control variables (Rosenbaum and Rubin,1983; Rubin, 1978). Assumption 3 says that it is necessary to have overlap insubsamples of X i , that is, for each treated individual it is necessary to have ananalogous untreated (synthetic control) to identify the treatment effect on treated(Rosenbaum and Rubin, 1983). We assume conditional independent and identically distributed random sam-ple drawn from a population. This implicitly implies Assumption 1 (SUTVA).Then, we define the individual treatment effect on treated ∆ T ∗ i = Y i ( D i − P ( X i )) P [ D = 1] (1 − P ( X i )) , (6)where P [ D = 1] is the marginal probability of treatment, which is the same forevery individual given random sampling, and P ( X i ) = P [ D i = 1 | X i = x ] is theconditional probability of treatment given regressors X i , that is, the propensityscore (Rosenbaum and Rubin, 1983). 3 roposition 1. Assuming a conditional independent and identically distributedrandom sample such that Assumptions 2 and 3 are satisfied, then: E (cid:2) ∆ T ∗ i (cid:12)(cid:12) P ( X i ) , X i = x (cid:3) = ∆ T . Proof.
We omit conditioning on P [ D = 1] and X i = x to simplify notation. E (cid:2) ∆ T ∗ i (cid:12)(cid:12) P [ D = 1] , X i = x (cid:3) = E (cid:20) Y i ( D i − P ( X i )) P [ D = 1] (1 − P ( X i )) (cid:21) = 1 P [ D = 1] (1 − P ( X i )) { (1 − P ( X i )) E [ D i ( Y i (1) − Y i (0))]+ E [ Y i (0)( D i − P ( X i ))] } = 1 P [ D = 1] (1 − P ( X i )) (cid:8) (1 − P ( X i )) E (cid:2) E (cid:2) D i ( Y i (1) − Y i (0)) (cid:12)(cid:12) D i (cid:3)(cid:3) + E [ D i ] E [ Y i (0)] − P ( X i ) E [ Y i (0)] } = 1 P [ D = 1] (1 − P ( X i )) (cid:8) (1 − P ( X i )) P [ D = 1] E (cid:2) ( Y i (1) − Y i (0)) (cid:12)(cid:12) D i = 1 (cid:3) + E [ Y i (0)] ( E [ D i ] − P ( X i )) } = E (cid:2) ( Y i (1) − Y i (0)) (cid:12)(cid:12) D i = 1 (cid:3) The first equality is by definition, the second is after some simply algebra andtaking into account that Y i = D i Y i (1) + (1 − D i ) Y i (0) and D i Y i = D i Y i (1) + D i Y i (0) − D i Y i (0) given D i = { , } . The third is an application of the law ofiterative expectations for the first term, and the Assumption 2 (ignorability) forthe second term. The fourth equality takes again into account that D i = { , } ,and P [ D i = 1] = P [ D = 1] (random sampling). Finally, we take into accountthat E [ D i | X i = x i ] = P [ D i = 1 | X i = x ] = P ( X i ). Notice that Assumption 3is required to have P [ D = 1] (1 − P ( X i )) (cid:54) = 0. Remark 2.
Observe that E [ Y i (0) D i ] = P [ D = 1] E [ Y i (0) | D i = 1] . Therefore,we also obtain the required statement, if E [ Y i (0) | D i = 1] = E [ Y i (0) | D i = 0] = E [ Y i (0)] conditional on X i = x i is satisfied, that is, the conditional mean assumptionon control group is satisfied (Angrist and Krueger, 1999, p. 1316). Outcome ofuntreated individuals does not determine participation. We require estimators for P [ D = 1] and P ( X i ) to estimate ∆ T ∗ i . We use N (cid:80) Ni =1 D i , N = N + N , as estimator for the marginal probability of treat-ment, and a logit model to estimate the conditional probability of treatment.Obviously, these estimators can be changed. Notice that this introduces anextra source of variability as we do not observe P [ D = 1] and P ( X i ) in obser-vational data sets. This kind of issue is discussed by Athey et al. (2015) who4oint out that transformations like equation 7 does not optimally use all avail-able information, this may mean extra variability. Additionally, Abadie andImbens (2016) show that ATE using matching based on the estimated propen-sity score is more efficient than matching based on the true propensity score.On the other hand, ATET using the estimated propensity score can be more oless efficient compared to the true propensity score. We performed some simula-tion exercises that show better inferential performance based on the estimatedpropensity score (available upon authors request). In addition, as our ATET(ATE) is asymptotically biased corrected, the mean squared error decomposi-tion shows that variance explains the most part (75% on average), as expected.However, it remains a proportion to finite sample bias effect (25% on average).Give that ∆ T ∗ i in equation 8 is a conditional unbiased estimator for ∆ T (Proposition 1), we select k in the kNN regression framework such thatarg min k G G (cid:88) g =1 ˆ∆ BCg − N g N g (cid:88) i =1 (cid:100) ∆ T ∗ i,g , (7)where (cid:100) ∆ T ∗ i,g = N g N g (cid:88) i =1 D T esti,g − Y T esti,g ( D T esti,g − (cid:92) P ( X T esti,g ))1 − (cid:92) P ( X T esti,g ) , ˆ∆ T Mg = 1 N ,g (cid:88) i ∈ { D Testi,g =1 } Y T esti,g (1) − K (cid:88) j ∈ A iK ( x Testg ) Y T rainj,g (0) , ˆ B T Mg = 1 N ,g (cid:88) i ∈ { D Testi,g =1 } ˆ µ ( x i ) T estg − K (cid:88) j ∈ A iK ( x Testg ) ˆ µ ( x j ) T raing , and ˆ∆ BCg = ˆ∆
T Mg − ˆ B T Mg where G is the number of groups, N g is the sample size of each group, and N ,g is the sample size of the treated group in group g , g = 1 , , . . . , G . Superscripts T rain and
T est refer to train and test data sets (see below).Notice that (cid:100) ∆ T ∗ i,g is the sample version of the conditional unbiased estimatorof the ATET ∆ T ∗ i , so its average is a consistent estimator for ATET. On theother hand, ˆ∆ BCg is bias-corrected version of the matching estimator (Abadieand Imbens, 2006).kNN regression is a standard approach in the machine and statistical learningcommunities (Hastie et al., 2009, p. 14). So, we follow k -fold cross-validation to5elect the optimal k (Hastie et al., 2009, p. 241). In particular, we randomly splitthe data set in G roughly equal-sized groups such that we keep the proportionbetween treated and untreated individuals in each group, then we calculate N g (cid:80) N g i =1 (cid:100) ∆ T ∗ i,g , g = 1 , , . . . , G . We fit the model using G − k , and identify the k nearest neighbors in the control group at the trainingdata set in terms of the Euclidean norm to individual i in the treated group atthe G -th left group (test data set), that is, | A ik ( x T estg ) | = (cid:88) j ∈ A ik ( x Testg ) (cid:8) || X T esti,g − X T rainj || < || X T esti,g − X T rainl || (cid:9) = k, then we estimate k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0), and obtain ˆ∆
T Mi,g = Y T esti,g (1) − k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0) and ˆ∆
BCi,g = ˆ µ ( x i ) T estg − k (cid:80) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T estg for i ∈ (cid:8) D T esti,g = 1 (cid:9) . So, we can calculate the mean squared error for the G -thleft group. This is done for g = 1 , , . . . , G and different values of k such thatwe average over the G prediction errors given k , and select the k which has theminimum average error (see program 7).Given an optimal k = k ∗ , we use equation 4 to estimate the average treat-ment effect on treated using the whole sample size. As we said, this estimatorconverges in distribution to a normal distribution, so we use the asymptoticresult to build confidence intervals.Algorithm A1 summarizes our methodological proposal: We set exactly the same simulation setting as Otsu and Rai (2017) who proposeda weighted bootstrap to perform inference of matching estimators for ATE andATET as it is well known that naive bootstrap produces invalid inference dueto failing to reproduce the distribution of the number of times each unit is usedas a match (Abadie and Imbens, 2008).We show the simulation setting for exposition purposes: Y i (1) = τ + m j ( || X i || ) + (cid:15) i ,Y i (0) = m j ( || X i || ) + (cid:15) i ,D i = { P ( X i ) ≥ v i } , v i ∼ U [0 , ,P ( X i ) = γ + γ || X i || , X i = [ X i , . . . , X iP ] (cid:62) ,X ij = ψ i | ζ ij | / || ζ i || , j = 1 , , . . . , P,ψ i ∼ U [0 , , ζ i ∼ N ( , I P ) , (cid:15) i ∼ N (0 , . ) , where (cid:15) i , v i , ψ i and ζ i are mutually independent. The ATET as well as ATEis τ = 0 . γ = 0 . γ = 0 . P = 5 and there are six different curves for m j ( || X i || ) (see Otsu and Rai (2017) for details):6 lgorithm A1 Optimal k in kNN to estimate ATET Calculate N (cid:80) Ni =1 D i . for i = 1 , , . . . , N do • Fit a logit model where the dependent variable is D i to get (cid:92) P ( X i ). • Calculate the (cid:100) ∆ T ∗ i , the sample version of equation 8. end for Split randomly the data set in G roughly equal-sized groups such that theproportion between treated and untreated in each group is keep the sameas in the original data set. Let us set the training data set as G − g -th left group. for k = 1 , , . . . , K do for g = 1 , , . . . , G do • Calculate N g (cid:80) N g i =1 (cid:100) ∆ T ∗ i,g , where N g is the sample size of g -th group (testdata set). • Find the k nearest neighbors of X i,g for treated individuals of the g thgroup ( D i,g = 1 in the test data set) among the untreated individualsin the training data set ( D j,h/ ∈ g = 0 in the remaining G − • Calculate k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0) and k (cid:80) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing . • Obtain ˆ∆
T Mi,g = Y T esti,g (1) − k (cid:80) j ∈ A ik ( x Testg ) Y T rainj,g (0) and ˆ B T Mi,g =ˆ µ ( x i ) T estg − k (cid:80) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing for i ∈ (cid:8) D T esti,g = 1 (cid:9) . • Estimate ˆ∆
T Mg = N ,g (cid:80) i ∈ { D Testi,g =1 } ˆ∆ T Mi,g , ˆ B T Mg = N ,g (cid:80) i ∈ { D Testi,g =1 } ˆ B T Mi,g and ˆ∆
BCg = ˆ∆
T Mg − ˆ B T Mg end for • Calculate the mean squared error:1 G G (cid:88) g =1 ˆ∆ BCg − N g N g (cid:88) i =1 (cid:100) ∆ T ∗ i,g end for Select k that minimizes the mean squared error ( k ∗ ). for i = 1 , , . . . , N do • Given k ∗ and the whole data set, find the k ∗ nearest neighbors in theuntreated group to each individual in the treated group. • Calculate ˆ∆
BCi = Y i (1) − k ∗ (cid:80) j ∈ A ik ∗ ( x ) Y j (0) end for Use ˆ∆ BC as a measure of central tendency of the ATET at neighborhoodlevel. Use theorems 4 and 7 in Abadie and Imbens (2006) to build confidenceintervals of the ATET. 7 ( || X i || ) =0 .
15 + 0 . || X i || ,m ( || X i || ) =0 . . || X i || + 0 . (cid:0) − || X i || − . (cid:1) ,m ( || X i || ) =0 . − || X i || − . − || X i || − . − || X i || − . ,m ( || X i || ) =0 . − || X i || ) . − . . − || X i || ) ,m ( || X i || ) =0 . − || X i || ) . − . . − || X i || ) − . || X i || cos(30 || X i || ) ,m ( || X i || ) =0 . .
25 sin(8 || X i || −
5) + 0 . (cid:0) − || X i || − . (cid:1) . In addition, we perform other simulation exercises where the outcome vari-able is binary. In particular, Y i (1) = (cid:26) , τ + βm j ( || X i || ) + (cid:15) i > , τ + βm j ( || X i || ) + (cid:15) i ≤ (cid:27) ,Y i (0) = (cid:26) , βm j ( || X i || ) + (cid:15) i > , βm j ( || X i || ) + (cid:15) i ≤ (cid:27) , where (cid:15) i ∼ LG (0 , LG denotes a logistic distribution, β = 0 . τ = 0 .
5, and P ( D i = 1) = P ( γ + γ || X i || > v i ) = F LG ( γ + γ || X i || ), γ = 0 . γ = 0 . v i ∼ LG (0 ,
1) and F LG is the distribution function. Other components of thespecification are as in the previous setting.Observe that in this setting AT E i = E ( Y i (1) − Y i (0) | X i = x i , Z i = z i ) = P ( Y i (1) = 1 | X i = x i , Z i = z i ) − P ( Y i (0) = 0 | X i = x i , Z i = z i ) = F LG ( τ + βm j ( || X i || )) − F LG ( βm j ( || X i || )), which has the same analytical expression forthe ATET, but conditional on the relevant sample ( D i = 1).Our results are based on a sample size equal to 100 (Otsu and Rai, 2017). We have 1,000 replications for each simulation exercise.First, we present the results for the average treatment effects on treated forthe continuous outcome variable. Table 1 displays the mean relative squarederrors (MRSE),
M RSE = 1 S k S k (cid:88) s =1 (cid:32) ˆ∆ BCs ( k ) − AT ET ˆ∆ BCs ( k ∗ ) − AT ET (cid:33) , k ∗ (cid:54) = k, k = 1 , , . . . , where ˆ∆ BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATET estimates using any k and theoptimal k = k ∗ , and S k is the number of times that k was not optimal.We see in Table 1 that there are relative large advantages using our optimal k ∗ as all figures are larger than 1. We perform other simulation exercises using larger sample sizes. Outcomes show similarresults. Available upon authors request. Results for the average treatment effects are in the Appendix, subsection 6.2. Outcomesare similar to ATET. k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATET ˆ∆ BCs ( k ∗ ) − ATET (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . .
20 whereˆ∆
BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATET estimates using any k andthe optimal k = k ∗ , and S k is the number of times that k was notoptimal. It seems that there are relative large advantages usingour optimal k ∗ . k implies relative wide intervals, whereaslarge k implies narrow intervals. The dotted line is optimal selection of k mini-mizing mean squared error. By construction has a ratio equal to 1.Figure 1 shows the average over 1,000 replications of the ratios between thelength of the 95% confidence intervals associated with any number of neighbors( k , x axis) and the optimal selection using our proposal. This is done for thesix data generating settings ( m j ( || X i || ) , j = 1 , , . . . k implies relative wide intervals, whereas large k implies narrowintervals. This is expected as small k implies high variability compared withlarge k . Observe that intervals lengths decreases very fast, and stabilize around k = 10. On the other hand, we see in Figure 2 that the type I error due torejecting the null hypothesis AT ET = τ , given a significance level equal to 5%,is the lowest using k = 1. Using k minimizing the mean squared error gives arelatively small type I error compared with other k selection.Figure 3 displays the histograms of the optimal k in our six simulationsettings. We see that optimal selection is dominated by small k . This could beevidence to support k = 1, which is widely used in many applications. Although,it seems that k = { , } are the most relevant, and large values of k cannotbe discarded, especially when using large sample sizes (we performed othersimulations exercises that show this).Results of the ATET for our simulation setting using a binary outcome arequalitatively similar to the continuous outcome. Table 2 show the mean relativesquared errors. Again, all figures are larger than 1, which means that using anon-optimal k implies larger estimation errors.10igure 2: Type I errors (5% nominal size) continuous outcome: Average treat-ment effects on treated. Optimal k gives a relatively small type I error comparedwith other k selection. k = 1 always gets the best nominal size.Figure 3: Optimal k frequencies continuous outcome: Average treatment effectson treated. It seems that small k dominates optimal k choice. Although, large k cannot be discarded. 11able 2: Mean relative squared errors binary outcome: Average treatment ef-fects on treated k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATET ˆ∆ BCs ( k ∗ ) − ATET (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . . BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATET estimates usingany k and the optimal k = k ∗ , and S k is the number of timesthat k was not optimal. It seems that there are relativelarge advantages using an optimal k . k implies relative wide intervals, whereas large k implies narrow intervals. The dotted line is optimal selection of k minimizingmean squared error. By construction has a ratio equal to 1.Figure 4 shows the relative 95% confidence intervals length ratios associ-ated with k and k ∗ . All six simulation settings look very similar with widerconfidence intervals when k is small, and narrower intervals when k is large.There is stabilization around k = 10. Figure 5 shows type I errors using 5%as significance level. Larger k implies larger errors, and using k ∗ implies smalltype I errors in general, except when k is very small. Small type I error can beexplained by means of Figure 6 where we see that optimal selection of k is alsobased on small k . Here k ∗ = { , , , } are particularly relevant.In general, we can observe that our proposal implies relatively smaller esti-mation errors achieving a compromise between confidence interval length andtype I error having as a result of small optimal k most of the times. To illustrate our proposal we estimate the average treatment effect on treatedof 401(k) participation on accumulated net financial assets. 401(k) is a retire-ment plan where savings contributions are provided by employers deduced fromemployees’ payment before taxes such that taxes on capital gains are avoided.Employees should achieve some eligibility criteria such that the plan does nothave universal applicability in principle. The impact evaluation of 401(k) on netfinancial assets has been done previously by Benjamin (2003); Chernozhukov13igure 5: Type I errors (5% nominal size) binary outcome: Average treatmenteffects on treated. Optimal k gives a relatively small type I error compared withother k selection. k = 1 always gets the best nominal size.Figure 6: Optimal k frequencies binary outcome: Average treatment effects ontreated. It seems that small k dominates optimal k choice. Although, large k cannot be discarded. 14able 3: Descriptive statistics: Outcome variablesResponse Mean treated Mean untreated StatisticContinuous $38.26K (79.09) $10.89K (55.26) 16.28 ∗∗ Binary 0.86 (0.35) 0.55 (0.5) 35.27 8 ∗∗ Standard error in parenthesis. The test of the difference of two proportions uses a normallydistributed test statistic calculated as z = ˆ p − ˆ p (ˆ p p (1 − ˆ p p )(1 /n +1 /n . , ˆ p p = x + x n + n , x l and n l arethe number of successes and sample sizes in each group, l = 1 ,
2. The test of the differenceof two means uses a Student’s t calculated as t = ¯ x − ¯ x ( s x /n + s x /n ) . . Null hypothesis: means are equal. Critical value at 5% level of significance is 1.96 ∗∗ Rejection of null hypothesis at 5% significance level. et al. (2017); Chernozhukov and Hansen (2004); Conley et al. (2012) amongothers. We overcome the lack of random assignment following Chernozhukovet al. (2017) arguments. Enrollment in a 401(k) plan is exogenous when control-ling for observable variables which drive employment decisions, such as income,when the plan initially became available. Therefore, our dataset is from the 1991Survey of Income and Program Participation composed by 9,915 observations.We control for the same set of variables as Chernozhukov and Hansen (2004).In particular, age (linear and squared), education (years), household size, income(seven levels), and binary variables indicating: defined benefit pension, homeownership, marital status (married), participation in IRA (individual retirementaccount) and two-earners status (both household heads contribute to householdincome). Details can be found in Benjamin (2003).Table 3 shows unconditional mean difference tests of our outcome variables:accumulated net financial assets, and a binary variable indicating positive ac-cumulated net financial assets. Employees enrolled in 401(k) plans have $27.4more accumulated net financial assets on average than employees who are notenrolled. The unconditional probability of positive net financial assets is 31%on average higher for the former. We can see that there are unconditionalstatistically significant differences.Figure 7 shows the ATET in the accumulated net financial assets using ouroptimal selection, k ∗ = 19, and k = 1, as a reference. We see similar ATETpoint estimates, which are approximately $ 15K (similar outcomes are found byConley et al. (2012) under the exogeneity assumption using a Bayesian approachfor a local average treatment effects, LATE). However, the 95% confidence in-terval using k ∗ is ($10.9K,$18.9K), whereas k = 1 interval is ($9.1K,$19.9K),that is 35% larger. Results regarding the probability of positive accumulatednet financial assets can be seen in Figure 8. Both, k ∗ = 19 and k = 1 indi-cate statistically significant ATET whose point estimate is approximately 19.9%and 18.7% respectively. The 95% confidence intervals are (16.3%, 21.3%) and(17.7%, 21.3%) for k = 1 and k ∗ = 19. This means that using k = 1 implies a38.9% wider interval than using k ∗ = 19. Observe that both exercises indicatethe same optimal k . This means that diagnostic balance tests are based on the15igure 7: Average treatment effects on treated due to 401(k) participation onaccumulated net financial assets: Point estimates with k = 1 (gray dot) andoptimal k ∗ (black dot), and 95% confidence intervals. Both, k = 1 and k ∗ indicate 5% significant positive ATET of approximately $15K. 95% confidenceintervals using k = 1 are 28% wider compared to k ∗ .same control groups. Figure 9 shows the love plot indicating that control groupsbased on k ∗ = 19 satisfy mean balancing conditions. We present a simple approach to obtain an optimal selection of the numberof control units ( k ) to build the “synthetic” or “counterfactual” unit for eachindividual in the k nearest neighbors algorithm. Our approach is based on asimple unbiased reference estimator ( individual treatment effect on treated forATET and individual treatment effect for ATE) such that we select k minimizingthe mean squared error with respect to the reference estimator. Therefore, theoptimal k = k ∗ achieves a balance between confidence interval length and typeI error.Our application suggests that k ∗ can be relatively large compared to thecommon practice of using k = 1. This implies significant reductions regardingconfidence interval lengths. Although, both choices show 5% statistically signif-icant effects of 401(k) enrollment in the probability of positive accumulated netfinancial assets, and its amount.A final remark is that using as reference estimator for treatments effectsthe one obtained using k = 1 is an interesting alternative. We obtained samequalitative results, that is, a balance between interval length and type I error,and a considerable reduction in MRSE. Simulations exercises confirm this.16igure 8: Average treatment effects on treated due to 401(k) participation onthe probability of positive accumulated net financial assets: Point estimateswith k = 1 (gray dot) and optimal k ∗ (black dot), and 95% confidence intervals.Both, k = 1 and k ∗ indicate 5% significant positive ATET of approximately19%. 95% confidence intervals using k = 1 are 28% wider compared to k ∗ .Figure 9: Mean balancing condition for control variables used calculating ATET:Love plot using k ∗ = 19. It seems that mean balancing conditions are satisfiedas statistical tests (black dots) are inside critical levels (dashed red lines).17 eferences Abadie, A. and Imbens, G. (2006). Large sample properties of matching esti-mators for average treatment effects.
Econometrica , 74(1):235–267.Abadie, A. and Imbens, G. (2008). On the failure of the bootstrap for matchingestimators.
Econometrica , 76(6):1537––1557.Abadie, A. and Imbens, G. (2011). Bias-corrected matching estimators foraverage treatment effects.
Journal of Business & Economic Statistics ,29(1):1–11.Abadie, A. and Imbens, G. (2016). Matching on the estimated propensity score.
Econometrica , 84(2):781—-807.Angrist, J. and Krueger, A. (1999). Empirical strategies in labor economics.In Ashenfelter, O. and Card, D., editors,
Handbook of Labor Economics ,volume 3. ELSEVIER.Athey, S., Imbens, G., and Ramachandra, V. (2015). Machine learning methodsfor estimating heterogeneous causal effects. Technical report, StanfordUniversity.Benjamin, D. (2003). Does 401(k) eligibility increase savings? Evidence frompropensity score subclassification.
Journal of Public Economics , 87:1259–1290.Cameron, C. and Trivedi, K. (2005).
Microeconometrics: Methods and Applica-tions . Cambridge University Press.Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey,W., and Robins, J. (2017). Double/debiased machine learning for treat-ment and causal parameters.
The Econometrics Journal , 21:1–68.Chernozhukov, V. and Hansen, C. (2004). The effects of 401 (k) participationon the wealth distribution: an instrumental quantile regression analysis.
Review of Economics and statistics , 86(3):735–751.Conley, T., Hansen, C., and Rossi, P. (2012). Plausibly exogenous.
The Reviewof Economics and Statistics , 94(1):260–272.Hastie, T., Tibshirani, R., and Friedman, J. (2009).
The Elements of StatisticalLearning Data Mining, Inference, and Prediction . Springer, second editionedition.Otsu, T. and Rai, Y. (2017). Bootstrap inference of matching estimators foraverage treatment effects.
Journal of the American Statistical Association ,112(520):1720–1732.Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensityscore in observational studies for causal effects.
Biometrika , 70(1):41–55.Rubin, D. (1978). Bayesian inference for causal effects: The role of randomiza-tion.
Annals of Statistics , 6:34–58.18
Appendix k in kNN: Average treatment ef-fects The average treatment effect is∆
AT E = E [( Y i (1) − Y i (0)) | X i = x i ] . A matching estimator for the ATE is given byˆ∆
AT E = 1 N N (cid:88) i =1 (cid:104) ˆ Y i (1) − ˆ Y i (0) (cid:105) , where ˆ Y i (1) = (cid:26) k (cid:80) j ∈ A ik ( x ) Y j (0) , D i = 0 Y i , D i = 1 (cid:27) , ˆ Y i (0) = (cid:26) Y i , D i = 0 k (cid:80) j ∈ A ik ( x ) Y j (1) , D i = 1 (cid:27) ,A ik ( x ) defines the relevant control group conditional on being treated oruntreated.The bias of the ATE is B AT E = 1 N (cid:88) i ∈{ D i =1 } µ ( x i ) − k (cid:88) j ∈ A ik ( x ) µ ( x j ) − (cid:88) i ∈{ D i =0 } µ ( x i ) − k (cid:88) j ∈ A ik ( x ) µ ( x j ) , where µ ( x i ) = E [ Y | X = x , D = 0] and µ ( x i ) = E [ Y | X = x , D = 1]. Thesecan be estimated non-parametrically using a series expansion estimator to obtainˆ B T M (Abadie and Imbens, 2011).Therefore, the bias-corrected matching estimator isˆ∆
BCAT E = ˆ∆
AT E − ˆ B AT E . Theorem 4(i) in Abadie and Imbens (2006) shows asymptotic distributionconvergence results for the bias corrected average treatment effect, and section3.2 in their paper has required variance expressions (marginal and conditional).The individual treatment effect is∆
AT E ∗ i = Y i ( D i − P ( X i )) P ( X i )(1 − P ( X i )) , (8)such that E (cid:2) ∆ AT E ∗ i (cid:12)(cid:12) X i = x (cid:3) = ∆ AT E (Athey et al., 2015).19herefore, the optimal selection of k is given by solving the programarg min k G G (cid:88) g =1 ˆ∆ BCAT Eg − N g N g (cid:88) i =1 (cid:92) ∆ AT E ∗ i,g , where (cid:92) ∆ AT E ∗ i,g = Y T esti,g ( D T esti,g − (cid:92) P ( X T esti,g )) (cid:92) P ( X T esti,g )(1 − (cid:92) P ( X T esti,g )) ) , ˆ∆ AT Eg = 1 N g (cid:88) i ∈ { D Testi,g =1 } Y T esti,g (1) − k (cid:88) j ∈ A ik ( x Testg ) Y T rainj,g (0) − (cid:88) i ∈ { D Testi,g =0 } Y T esti,g (0) − k (cid:88) j ∈ A ik ( x Testg ) Y T rainj,g (1) , ˆ B AT Eg = 1 N g (cid:88) i ∈ { D Testi,g =1 } ˆ µ ( x i ) T estg − k (cid:88) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing − (cid:88) i ∈ { D Testi,g =0 } ˆ µ ( x i ) T estg − k (cid:88) j ∈ A ik ( x Testg ) ˆ µ ( x j ) T raing , and ˆ∆ BCAT Eg = ˆ∆
AT Eg − ˆ B AT Eg . An algorithm similar to Algorithm A1 can be implemented to obtain anoptimal k for the ATE given these definitions. k implies relative wide intervals, whereas large k impliesnarrow intervals. The dotted line is the optimal selection of k minimizing meansquared error. By construction has a ratio equal to 1.Figure 11: Type I errors continuous outcome: Average treatment effects. Theoptimal k gives an average type I error compared with other k selection. k = 1always gets the best nominal size 21igure 12: 95% confidence interval length ratio binary outcome: Average treat-ment effects. Small k implies relative wide intervals, whereas large k impliesnarrow intervals. The dotted line is optimal selection of k minimizing meansquared error. By construction has a ratio equal to 1.Figure 13: Type I errors (5% nominal size) binary outcome: Average treatmenteffects. The optimal k gives a relatively small type I error compared with other k selection. k = 1 always gets the best nominal size.22able 4: Mean relative squared errors continuous outcome: Average treatmenteffects k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATE ˆ∆ BCs ( k ∗ ) − ATE (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . . BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATE estimates using any k and the optimal k = k ∗ , and S k is the number of times that k wasnot optimal. It seems that there are relative large advantagesusing our optimal k ∗ . k m m m m m m MRSE = S k (cid:80) S k s =1 (cid:18) ˆ∆ BCs ( k ) − ATE ˆ∆ BCs ( k ∗ ) − ATE (cid:19) , k ∗ (cid:54) = k, k = 1 , , . . . BCs ( k ) and ˆ∆ BCs ( k ∗ ) are the ATE estimates usingany k and the optimal k = k ∗ , and S k is the number of timesthat k was not optimal. It seems that there are relativelarge advantages using an optimal k ..