[PDF] Spatial Differencing for Sample Selection Models with Unobserved Heterogeneity

Abstract

This paper derives identification, estimation, and inference results using spatial differencing in sample selection models with unobserved heterogeneity. We show that under the assumption of smooth changes across space of the unobserved sub-location specific heterogeneities and inverse Mills ratio, key parameters of a sample selection model are identified. The smoothness of the sub-location specific heterogeneities implies a correlation in the outcomes. We assume that the correlation is restricted within a location or cluster and derive asymptotic results showing that as the number of independent clusters increases, the estimators are consistent and asymptotically normal. We also propose a formula for standard error estimation. A Monte-Carlo experiment illustrates the small sample properties of our estimator. The application of our procedure to estimate the determinants of the municipality tax rate in Finland shows the importance of accounting for unobserved heterogeneity.

Full PDF

aa r X i v : . [ ec on . E M ] S e p Spatial Diﬀerencing for Sample Selection Models withUnobserved Heterogeneity

Alexander Klein ∗ Guy Tchuente † University of KentJune 2020

Abstract

This paper derives identiﬁcation, estimation and inference results using spatial dif-ferencing in sample selection models with unobserved heterogeneity. We show thatunder the assumption of smooth changes across space of the unobserved sub-locationspeciﬁc heterogeneities and inverse Mills ratio, key parameters of a sample selectionmodel are identiﬁed. The smoothness of the sub-location speciﬁc heterogeneities im-plies a correlation in the outcomes. We assume that the correlation is restricted withina location or cluster and derive asymptotic results showing that as the number of in-dependent clusters increases, the estimators are consistent and asymptotically normal.We also propose a formula for standard error estimation. A Monte-Carlo experiment ∗ School of Economics, e-mail: [email protected] † Corresponding author: School of Economics, University of Kent, e-mail: [email protected], Ad-dress: Kennedy Building, Park Wood Road, Canterbury, Kent, CT2 7FS, Tel:+441227827249. llustrates the small sample properties of our estimator. The application of our proce-dure to estimate the determinants of the municipality tax rate in Finland shows theimportance of accounting for unobserved heterogeneity. Keywords:

Sample selection, Spatial diﬀerence, Unobserved heterogeneity.

In linear models, spatial diﬀerencing has been used to deal with unobserved omittedvariables. The availability of geographical locations allowed empirical papers to takeadvantage of the spatial dimension of the data and control for various unobservedheterogeneity (e.g. Duranton, Gobillon, and Overman (2011), Black (1999) or Holmes(1998)). In general, spatial diﬀerencing oﬀers an identiﬁcation strategy in the situationswhen researchers face cross-sectional data with unobserved heterogeneity and lack suit-able instrumental variables. This paper extends spatial diﬀerencing to a model withsample selection.For economists, the question of omitted variables is a serious concern in the contextof nonexperimental data. The solution is straightforward when omitted variables aresimply the result of not including all relevant variables for which data exist - we addsuch variables to the model to avoid the bias induced by their omission. When omittedvariables are unobserved, researchers have basically three options: they can use (i)proxies, (ii) instrumental variables, or (iii) diﬀerencing the data across time or space.The proxies reduce the bias if they manage to capture the eﬀect of the omittedvariables such that what remains is uncorrelated with the error term. However, it is ften the case that the proxies are imperfect, hence they may still be related to theunobserved heterogeneity, or the error term if it turns out to be endogenous, or it couldalso be irrelevant after controlling for observed covariates. In such cases, the inclusionof the proxy will not solve the bias problem, it may even exacerbate it. The secondsolution - using a valid set of the instruments - may help alleviate the bias. However, asdiscussed in Todd and Wolpin (2003), the “quasi-experimental” local average treatmenteﬀect (LATE) obtained in the instrumental variable model may not correspond tothe ceteris paribus and thus, may not correspond to the deep structural parameterof interest. Lastly, panel data sets allow researchers to control for the unobservedheterogeneity. They indeed help them to identify the causal eﬀect when, for example,time constant unobserved heterogeneity might cause endogeneity problem and stronginstruments satisfying exclusion restrictions cannot be found. However, there might besituations when such data sets are not available.Our paper is a contribution to the literature identifying and estimating model pa-rameters in the presence of unobserved omitted variables. We propose an identiﬁcationstrategy based on spatial diﬀerencing. As was discussed above, this approach has beenused in the context of linear regressions. However, little is known about its performancein non-linear models. We extend spatial diﬀerencing into this direction, speciﬁcally tothe case of cross-section data with sample-selection. We show that under justiﬁableassumptions on the smoothness of the unobserved heterogeneity (i.e. spatially closeindividuals have similar unobserved heterogeneity and the derivative of their InverseMill’s ratio are similar), spatial diﬀerencing eliminates the unobserved eﬀects even inthe presence of a nonlinear element - in our case Mill’s ratio. The parameters of in- See Todd and Wolpin (2003) for details on the use of proxy and Oster (2019) for a rigorous treatment ofthe evaluation of the robustness to omitted variables. erest of our sample selection model are estimated using a standard two-step approachof Heckman (1974) Heckman (1979). We derive asymptotic properties and propose acorrection of standard errors accounting for the two-step nature of our estimation andspatial diﬀerencing. The asymptotic behavior of the estimator reveals important prop-erties of spatial diﬀerencing that researchers would need to be cautious about. The newestimator and the standard errors correction are easy to implement.The intuitions for the model of sample selection with unobserved spatial hetero-geneity that we consider in this paper can be described as follows. Suppose we havea cross-sectional data on municipalities which are organized into larger geographicalunits called regions, and which have the authority to set the levels of local taxation.Municipality tax rates must be at least as high as the threshold set by a central gov-ernment. As a result, municipalities self-select into those with the tax rates at thethreshold and those above it. We are interested in what determines the municipalities’tax rates. The tax rate will depend on various socio-economic characteristics, such asage composition of population and income, but also on amenities. These can dependon the region where the municipalities are located: for example, regions with naturallandscapes might have diﬀerent level and composition of amenities than regions withoutthem. We can control for them with region-speciﬁc dummies. However there can be aconsiderable unobserved heterogeneity at municipality level. Controlling for that withmunicipality-speciﬁc dummies might not be an option, since we may quickly run outof degrees of freedom. Therefore, we face a problem of a self-selected cross-sectionalsample with unobserved heterogeneity which we cannot fully control with dummies, andwhich has two spatial dimensions: high-level which we call locations (in our exampleregions), and low-level which we call sub-locations (in our example municipalities). patial diﬀerencing will eliminate the sub-location speciﬁc unobserved heterogeneity.It will, at the same time, also induce a correlation in the error terms. We take thatcorrelation into account, and derive the asymptotic properties of our estimator usingsimilar arguments to those present in the derivation of the asymptotic behavior ofthe clustered standard errors: the number of locations goes to inﬁnity and the size oflocation is assumed random and bounded almost surely. We ﬁnd that this result alsoextends to a linear model without sample selection. This has important implicationsthat researchers need to be cautious about. Indeed, the consistency of estimator appliedto the spatially diﬀerenced data requires (i) a large number of locations, (ii) a limitednumber of individuals in each location. Monte Carlo simulations also suggest that itwould be better if the number of individuals in sub-locations were small as well. Beforewe continue, let’s notice that locations in our model are equivalent to clusters and weuse ’location’ and ’cluster’ interchangeably.Since our estimator is derived for a clustered sample with unobserved heterogeneity,this paper contributes to the literature on the selection correction in panel data. Inthis literature, the main challenge is the presence of individual-speciﬁc unobserved het-erogeneity in both the outcome and the selection equations. The existing solutions arebased on either a full model speciﬁcation or on a diﬀerencing procedure. Wooldridge(1995) uses a Mundlak approach to specify the individual-speciﬁc unobserved hetero-geneity in both equations. He also imposes a special functional form to the selectionmechanism. Kyriazidou (1997), on the other hand, does not impose strong restrictionof the selection equation functional form and uses a nonparametric approach to dif-ference out the unobserved ﬁxed-eﬀect. Rochina-Barrachina (1999) similarly relies ondiﬀerencing to identify the parameters of the model, but she also imposes additional istributional assumptions to the selection equation. Even if our problem has similari-ties with the selection correction in panel data model literature, the main diﬀerence isthat we are observing a clustered cross-section. In each cluster, there is a ﬁner commonsub-location speciﬁc unobserved heterogeneity shared by some individuals in that clus-ter. This heterogeneity, however, is diﬀerent from the cluster and individual-speciﬁcones studied in panel data models and implies a diﬀerent cluster asymptotic. Sincethe outcome of the individuals are not independent in our model, while it is in thepanel data case, our asymptotic results are thus derived using a large number of clusterasymptotic with heterogenous, random and bounded cluster size.The clustered dependence created by the ﬁner sub-location speciﬁc unobserved het-erogeneity relates our asymptotic discussion to the papers dealing with clustering atvariance level (see Wooldridge (2010) for a textbook treatment). The asymptotic in thatliterature is derived using either a large or a ﬁxed number of clusters. The ﬁxed numberof cluster leads to non-normal asymptotic and discussion about recent contributions canbe found in Hansen and Lee (2019). A large number of cluster asymptotic was ﬁrst de-rived by White (1984) and has been investigated by several authors allowing either ﬁxedcluster size or heterogeneous cluster. Recent developments include Hansen and Lee(2019) who propose conditions on the relation between the cluster sample sizes andthe full sample in a regular asymptotic, or Djogbenou, MacKinnon, and Nielsen (2019)who derive asymptotic with varying cluster sizes and carry out a cluster wild bootstrap.Our results complement this literature by extending the cluster asymptotic to a sampleselection model.We present an empirical application of our new estimator. We examine the deter-minants of tax rates across four hundred and eleven Finish municipalities spread across ineteen Finish regions. In 1999, Finish central government has decided to raise thelower bound of the tax rate municipalities could set from 0.2% to 0.5%. This has createda sample selection mechanism which resulted with more than half of the municipalitiesopting for 0.5% while the rest charging higher tax rate. We use our spatial diﬀerencingestimator to control for unobserved municipalities eﬀect which can be correlated withthe error term, creating thus an endogeneity problem and rendering the standard sam-ple selection estimator biased and inconsistent. Our results clearly show the presenceof unobserved heterogeneity across municipalities, limitations of using only region dum-mies (nineteen in our case) to fully control for municipalities’ unobserved heterogeneity,and the importance of spatial diﬀerencing to control for it.The structure of the paper is as follows. First, we expand the spatial diﬀeringmethod in the case of linear regression model to the case of sample selection. Thenwe discuss identiﬁcation assumptions, propose an estimation procedure, and derive theestimator of the corrected standard errors. Lastly, we conduct Monte Carlo simulationsand present an empirical application of our estimator. In many economic applications, we are interested in estimating the following regressionequation: y ij = x ′ ij δ + γ j + γ jα + ε ij (1) here x ′ ij is a vector of exogenous controls variables, γ j is location ﬁxed eﬀect, γ jα isa sub-location speciﬁc eﬀect for sub-location α which is at a ﬁner spatial scale thanlocation j , and ε ij is the error term. Examples of application of this model can be found in the estimation of the fertilizereﬀect on wheat crops yield in farms growing multiple crops, or the eﬀect of local taxationon the growth of ﬁrms. The crop-yields depend on the soil quality of location γ j (e.g.a village), but also on the sub-location speciﬁc soil composition (e.g. a farm in the vil-lage) Collins, Alva, Boydston, Cochran, Hamm, McGuire, and Riga (2006). Similarly,the impact of local taxation on the growth of ﬁrms may vary by county but also sub-locations such as neighborhoods as in Duranton, Gobillon, and Overman (2011). Wecan control for γ j with location dummy variables. However, they might not be enoughto capture all unobserved heterogeneity related to location j as there can be consid-erable heterogeneity at ﬁner spatial scale of sub-locations: using the example above,the ﬁrms are located in various neighborhoods α which are sub-locations of location j . Furthermore, standard location ﬁxed eﬀect γ j relies upon an arbitrary speciﬁcationof the comparison neighborhood group, as pointed out by Gibbons and Machin (2003),making it an imperfect control for sub-location speciﬁc eﬀect γ jα . If γ jα is correlatedwith x ij , OLS estimate of δ will be biased. In the absence of suitable instrumental vari-ables for x ij , the spatial diﬀerencing oﬀers a solution by diﬀerencing out the unobservedsub-location speciﬁc eﬀects γ jα .Duranton, Gobillon, and Overman (2011), Black (1999) or Holmes (1998) use spa-tial diﬀerencing in the case of linear models to solve endogeneity problems arising fromunobserved sub-location eﬀect γ jα . They take advantage of the fact that for suﬃciently The sub-location speciﬁc component, γ jα , is a simpliﬁcation for γ jα i . We are implicitly assuming thatthe sub-location speciﬁc eﬀects are the same for all its individuals. mall distances between sub-locations, their speciﬁc eﬀect γ jα changes smoothly acrossspace, allowing thus to diﬀerence them out. This corresponds to the following assump-tion. Assumption I1:

The sub-location speciﬁc unobservable eﬀect is homogenous inthe a neighborhood of the individual ie ∆ d γ jα = 0 for d small enough.In several economic models, in addition to the sub-location speciﬁc ﬁxed eﬀects γ jα ,the outcome of interest is not observed for the selected sub-sample. The selection canbe the result of the decision of individuals or the researcher. The presence of sampleselection introduces nonlinearity to the model (1).We specify the model with sample selection as follows. Consider two latent depen-dent variables y ∗ ij , and y ∗ ij in a cross-section which follow a regular linear model forindividual i in a location j : y ∗ ij = z ′ ij β + θ jα + θ j + ε ij - selection equation, y ∗ ij = x ′ ij δ + γ jα + γ j + ε ij - outcome equation.Individual error terms are ε ij and ε ij ; θ jα and γ jα are sub-location speciﬁc eﬀectsfor a sub-location α in location j , aﬀecting the selection and the outcome equationrespectively. The exogenous characteristics x ij aﬀect the outcome. They could becorrelated with γ jα + γ j but not with ε ij and ε ij . The variables z ij are exogenousvariables determining selection, they can be a subset of x ij . However, for identiﬁcationpurposes, some elements of z ij are assumed to be absent from x ij . Assumption I2: ε ij and ε ij are independent identically distributed normal ran-dom variables for all i, j . he outcome is modelled in the form of a truncated sample selection model and isrepresented by equation (2). y ij =  y ∗ ij if y ∗ ij > − if y ∗ ij ≤ Condition 1:

Cov [ z ij , θ jα + θ j + ε ij ] = 0; z ij is exogenous Condition 2:

Cov [ x ij , γ jα + γ j + ε ij ] = 0; x ij is exogenous Condition 3: and errors ( ε ij , ε ij ) satisfy ε ij = ρ × ε ij + v ij with ε ij ∼ N (0 , v ij .It is possible to consistently estimate δ by Tobit regression under these three conditions. In most applications, the Condition 1 and 2 are unlikely to hold because there is apossibility that, within a location, there could be a sub-location speciﬁc omitted variableaﬀecting both the outcome and some observed characteristics of interest. Thus, it ispossible that

Cov [ z ij , θ jα + θ j ] = 0 and Cov [ x ij , γ jα + γ j ] = 0. The standard way todeal with the correlation between x ij and γ jα would be to ﬁnd a suitable instrumentfor the x ij and run a IV Tobit or IV two-stage Heckit.The very local nature of the sub-location speciﬁc eﬀect means that it is not alwaysevident to ﬁnd a variable correlated with x ij and uncorrelated with γ jα . The exclusionrestriction is likely to be violated and IV two-stage Heckit will yield inconsistent esti-mates for δ . Another option is to use the ﬁner location ﬁxed eﬀect and estimate themodel using classic Heckman two-stage procedure, but this will in practice lead to a Identiﬁcation required an exclusion restriction ie a variable that aﬀects y ∗ ij but not y ∗ ij . Otherwise,identiﬁcation relies on the nonlinearity of the inverse Mills ratio. roliferation of variable and lose of degrees of freedom. This section investigates the application of this spatial diﬀerencing technique to thecase of cross-section sample selection models. We denote ∆ d to be a spatial diﬀerenceoperator. One example is a pair-wise diﬀerence operator which takes the diﬀerencebetween each observation and another observation located at distance less than d fromthat observation. In a location j , with individual i and k who are neighbours. Thepair-wise diﬀerencing of the variable A is:∆ d A = A ij − A kj . Another example is the diﬀerence between the individual outcome and the averageoutcome of his/her neighbourhood N id . This operator is similar to the neighbourhoodﬁxed eﬀect operator, the diﬀerence being that the neighbourhoods can overlap. We callthis operator the ﬁxed-eﬀect diﬀerence operator. Let N id = { k, in neighbourhood d } ,the sample size of N id is N d , the diﬀerencing is given by:∆ df A = A ij − N d X k ∈N id A kj . A further possibility is to use a kernel as in Kyriazidou (1997) to weight neighbour in N id according to how far they are, in term of observable characteristics. This operator s the kernel diﬀerence operator.∆ dK A = A ij − X k ∈N id ψ ( i, k ) A kj . Where ψ ( i, k ) = 1 h N d K ( z ′ ij − z ′ kj ) β + ( x ′ ij − x ′ kj ) δh N d ! , K is a kernel density functionwhile h N d , is a sequence of bandwidths. To illustrate our identiﬁcation strategy and forthe asymptotic derivation, we use the pairwise spatial diﬀerence operator, while for theempirical application and for the Monte Carlo simulations, the ﬁxed eﬀect diﬀerence isused.For the spatial diﬀerence operator ∆ d , ∆ d y ij = y ij − y kj with k an observation inthe neighborhood d of i . Let ξ ij ≡ { x ij , z ij , y ∗ ij > , γ id , θ id } with γ id = { γ kj with k ∈N id ∪ { i }} and θ id = { θ kj with k ∈ N id ∪ { i }} .E [∆ d y ij | ξ ij , ξ kj ] = E [ y ij − y kj | ξ ij , ξ kj ] (3)= E [ y ij | ξ ij ] − E [ y kj | ξ kj ] (4)= x ′ ij δ + γ aj + γ j + ρλ ( z ′ ij β + θ ja + θ j ) (5) − [ x ′ kj δ + γ jα + γ j + ρλ ( z ′ kj β + θ ja + θ j )]= ∆ d x ′ ij δ + ∆ d γ jα + ρ ∆ d λ ( z ′ ij β + θ ja + θ j ) (6)where λ ( c ) = φ ( c ) / Φ( c ) is the inverse Mill’s ratio while φ ( c ) and Φ( c ) are respectivelythe density and distribution function of a normal random variable with mean zero andvariance 1.To go from Equation (3) to Equation (4) we use the linearity of expectation and themean independence of y ij and y ∗ kj conditional on ξ ij , as well as the mean independence f y kj and y ∗ ij conditional on ξ kj , since we have assumed in Assumption I2 that ε ij and ε ij are iid . The separation of the conditional set, ξ ij and ξ kj , is possible because we areworking with cross-sectional data. Such separation of the conditional set is not possiblefor panel data. Indeed, in the context of panel data with individual eﬀects and sampleselection, when the diﬀerencing is used to remove the ﬁxed-eﬀects, the conditional setcannot be separated as we have done to move from Equation (3) to (4). For exam-ple, Kyriazidou (1997) has to impose a “conditional exchangeability” assumption thatis conditioned on the variable related to the two periods used in diﬀerencing. In caseof models with censoring, Lee (2001) discusses conditions under which ﬁrst-diﬀerencecan be applied, and applies the linear implication of the ”conditional exchangeability”assumption. In a similar context using ﬁrst diﬀerence, Rochina-Barrachina (1999) im-poses a joint normality between the diﬀerence in the error of the outcome equation andthe error in the selections equation in the two time periods. Estimating equation (6) presents two challenges for the identiﬁcation of the pa-rameter of interest δ and the sample selection parameter ρ : the sub-location speciﬁcdiﬀerence ∆ d γ jα , and the sample selection term ρ ∆ d λ ( z ′ ij β + θ jα + θ j ). As for the sub-location speciﬁc diﬀerence ∆ d γ jα , under Assumption I1 and I2, equation (6) becomes E [∆ d y ij | ξ ij , ξ kj ] = ∆ d x ′ ij δ + ρ ∆ d λ ( z ′ ij β + θ jα + θ j ) (7)These assumptions allow us to diﬀerence-out the sub-location speciﬁc unobserved eﬀect γ jα , a strategy that was applied by Duranton, Gobillon, and Overman (2011).As for the sample selection term ρ ∆ d λ ( z ′ ij β + θ jα + θ j ), we see that it depends on theunobservable sub-location speciﬁc and location eﬀects θ jα + θ j . Because that sample See Dustmann and Rochina-Barrachina (2007) for a review on selection correction in panel data models. election term is a nonlinear function, a simple spatial diﬀerencing will not always workunlike the case of γ ja . Therefore, the following assumption helps us to deal with thischallenge: Assumption I3: (i) The sub-location speciﬁc unobservable selection eﬀect is homogeneous in a neigh-borhood of the individual i.e. ∆ d θ ja = 0 for d small enough.(ii) The changes in the inverse Mill’s-Ratio in a neighborhood of the individual i.e. λ ( z ′ ij β + θ jα i + θ j ) − λ ( z ′ ij β ) θ jα i + θ j = λ ′ ( c i ) = λ ′ ( c k ) = λ ( z ′ kj β + θ jα k + θ j ) − λ ( z ′ kj β ) θ jα k + θ j (8)for i and k in a neighborhood d small enough, θ jα i + θ j and θ jα k + θ j both diﬀerentfrom 0, λ ′ ( . ) is the ﬁrst derivative the inverse Mill’s ratio, c i , and c k are, respectively,in the intervals formed by [ z ′ ij β, z ′ ij β + θ jα i + θ j ] and [ z ′ kj β, z ′ kj β + θ jα k + θ j ] such thatEquation (8) holds.Assumption I3 (i) is similar to assumption I1. It seems plausible that if that as-sumption holds for the outcome equation, it will hold true for the selection equation aswell.Assumption I3 (ii) is novel and one of the contributions of this paper. It assumesthat if the exact Taylor approximation is applied on the individual inverse Mill’s ratiofor individuals i and k in the location j , the intermediate points c i and c k should besimilar. If the level of nonlinearity of λ ( . ) is low, then the assumption will also hold.In the extreme case of local linearity of the inverse Mill’s ratio, the Assumption 3 (ii)perfectly holds. he combination of assumptions I3 (i) and I3 (ii) implies that λ ( z ′ ij β + θ jα i + θ j ) − λ ( z ′ ij β ) = λ ( z ′ kj β + θ jα k + θ j ) − λ ( z ′ kj β ) . Thus, ∆ d λ ( z ′ ij β ) = ∆ d λ ( z ′ ij β + θ jα + θ j ) Theorem 1.

Let us consider the sample selection model presented in Equation 2. Underassumptions I1 to I3 the parameters δ and ρ are identiﬁed. Proof of Theorem 1

We have already shown that under the assumptions I1 and I2, we can obtain Equation(7). Applying the assumption I3, to Equation (7) leads to the following equation E [∆ d y ij | ξ ij , ξ kj ] = ∆ d x ′ ij δ + ρ ∆ d λ ( z ′ ij β ) . (9)Thus, assumptions I1 to I3 are suﬃcient for the identiﬁcation of δ and ρ .We have derived the results using the pairwise spatial diﬀerence operator. However,the identiﬁcation result holds for other spatial diﬀerence operators as well. In the caseof the average or kernel diﬀerence operator, the conditioning in equation 9 is on ξ kj with k ∈ N id for the average diﬀerence operator and k is in the full sample for the kerneloperator. Note that under assumptions I1 and I3, any diﬀerence of the weighted averagein a neighborhood of the individual will enable us to remove the sub-location speciﬁceﬀect. The conditional expectation presented in Equation (9) depends on exogenousobservable variables and parameters of interest. .2 Estimation and Asymptotic Properties In this section, we present an estimation procedure and derive asymptotic properties ofthe proposed estimator. The estimation procedure involves two-steps. In the ﬁrst step,probit model is estimated and the inverse Mill’s ratio predicted. In the second step,a spatial diﬀerence operator diﬀerences out both location and the sub-location speciﬁcunobserved heterogeneity. The model is then estimated using an ordinary least squareestimator. When we have a sample of N individuals, the estimation procedure is thusas follows: Step 1:

Estimate β by probit with location eﬀect γ j ; and calculate ˆ λ i = λ ( z ′ ij ˆ β ). Step 2:

Estimate δ and ρ in the OLS regression∆ d y ij = ∆ d x ′ ij δ + ρ ∆ d λ ( z ′ ij ˆ β ) + w ikj . (10)Since we used spatial diﬀerencing and λ ( z ′ ij ˆ β ) is estimated in the ﬁrst step, a par-ticular structure of the variance-covariance matrix emerges. Therefore, we also need toderive the correct estimator of standard errors w ikj which we will do in section 2.3.We will now show that the estimator obtained by the above procedure is consistentand asymptotically normal. To derive the asymptotic properties we use similar argu-ments as those used to derive the asymptotic properties of the clustered standard errors.Speciﬁcally, the population size of each location is assumed random and bounded al-most surely, and the law of large numbers is applied by letting the number of locations(clusters in case of clustered standard errors) go to inﬁnity.We consider a generic matrix of spatial diﬀerence ∆. The matrix form notation of quation (10) can be expressed in as ∆ y = ∆ x ′ δ + ρ ∆ λ ( z ′ ˆ β ) + ∆ η (11)where η ij are the same error as in standard sample selection models. Let us denote θ = ( δ, ρ ) ′ and W = [ x ′ , λ ( z ′ ˆ β )]. The simpliﬁed estimation Equation (11) is∆ y = ∆ W θ + ∆ η and OLS estimator of θ is ˆ θ = [(∆ W ) ′ ∆ W ] − [(∆ W ) ′ ∆ y ] (12)The spatial nature of data implies that an observation k with n neighbours mayappear in several pairs. This induces correlation in the error term ∆ η for all n of thesepairs because of the spatial diﬀerencing in the second step of the estimation procedure.As a result, a particular structure of the covariance matrix emerges, and we need totake that into account when calculating the standard errors.To proceed further, we need to introduce assumptions under which the asymptoticproperties of our estimator are derived. Assumption E1:

The sample is formed if N individuals from the population.( i ) We observed { x ij , z ij } independent and identically distributed random variable with i = 1 , ..., N and j = 1 , ..., J .( ii ) The number of individuals in a location j , N j , is exogenous, random, identically The variables without subscript represent vector or matrices of all observation in the sample. We assume the notation that λ ( z ′ ˆ β ) is a vector with typical element λ ( z ′ ij ˆ β ). istributed with N j < n almost surely and E ( N j ) < ∞ ; where n is a scalar.( iii ) The outcomes and the latent variables are independent across location i.e. j = j the variables y ij ⊥ y ij and y ∗ ij ⊥ y ∗ ij .An implication of assumption E1 ( i ) in conjunction with assumption I2 is that θ j and γ j are iid . However, within a location j , there is a certain level of correlation amongindividuals which operates through θ jα i or γ jα i . This means that our assumptionsrestrict how that within-location individual correlations occurs.Assumption E1 ( ii ) restricts the location size to be bounded and implies that thenumber of locations has to grow to achieve a large sample size in our asymptotic cal-culation. This assumption is similar to those held in the literature of cluster samplesasymptotic and it leads to a “large number of cluster” asymptotic theory similar to theone discussed in Wooldridge (2010), who assumes ﬁxed cluster size. This assumptioncorresponds to a speciﬁc case of the Assumption 1 in Hansen and Lee (2019), who al-low for diﬀerent cluster size ranging from ﬁxed to inﬁnite. We have, however, derivedthe asymptotic of our estimator under the more restrictive condition of Assumption E1( ii ). The reason is that it can be proven that under a joint asymptotic ( N, J → ∞ ),Assumption 1 is equivalent to assuming that the size of the sample in each location isbounded. If we instead allow for a sequential asymptotic where the number of locationsis ﬁxed and the sample size goes to inﬁnity, then there exists at least one location withan inﬁnite number of individuals and the inequality used in the proof of Hansen andLee (2009)’s Theorem 1 becomes invalid.To better illustrate our argument, let us consider the location sample size proposedby Hansen and Lee (2019): N j = N α with 0 ≤ α <

1; we can prove that 1 − α = ln ( J ) ln ( N ) .If we allow for a joint asymptotic, α is not deﬁne. If on the contrary we assume that he number of locations J is ﬁxed, then, α goes to 1. In both cases, relaying on Hansenand Lee (2009)’s Assumption 1 seems not enough to warrant the desire asymptoticregularities. Assumption E2: z ′ and W are full rank column, with each element having up toits 4 th moment. Theorem 2.

We consider the sample selection model presented in Equation 2. Underassumptions I1 to I3, E1 and, E2. ( i ) ˆ θ → p θ as N → ∞ ( ii ) √ N (ˆ θ − θ ) → d N (0 , Θ) with Θ = C Γ C ′ where C − = E ((∆ W ij ) ′ ∆ W ij ) , Γ = ρ E [(∆ W ij ) ′ Ω ij ∆ W ij ]+ E [(∆ W ij ) ′ ∆ e ij ∆ e ij (∆ W ij )] , and Ω ij = [ λ ′ ( z ′ ij β )] z ′ ij V β z ij taking V β as the ﬁrst step probit variance-covariance ma-trix. Proof of Theorem 2:

In appendix.It is important to notice that the same type of asymptotics should be used in alinear model. In this respect, we complement Duranton, Gobillon, and Overman (2011)who propose a correction for the standard errors, but do not discuss the asymptoticproperties of their estimators. Similarly, Black (1999) and Holmes (1998) use spatialdiﬀerencing, but do not account for the fact that diﬀerencing will lead to a correlationbetween pairs where an individual is present. Our asymptotic derivations do accountfor the presence of correlation between pairs, and are valid not only for a model withbut also without sample selection (in our model, the absence of selection implies ρ =0). They also have important practical implications: the consistency of the estimator equires a large number of locations γ j , and a small number of individuals in eachsub-location γ jα This section derives a procedure estimate the variance-covariance of the estimator inEquation (12) which has a particular structure arising from ( i ) spatial diﬀerencing and( ii ) Heckman’s two-step estimation procedure.We consider B = (cid:2) (∆ W ) ′ ∆ W (cid:3) − and Σ = V ar [(∆ W ) ′ ∆ η ] such that the conditionalvariance-covariance matrix of ˆ θ is V ar (ˆ θ ) = B Σ B ′ Note that Σ = (∆ W ) ′ V ar (∆ η )(∆ W )This means that we need a consistent estimator of V ar (∆ η ) to compute correctstandard error for ˆ θ .Let us consider that V ar (∆ η ) = V + V with V = ∆ V ar ( e )∆ ′ = ρ ∆ R ∆ ′ where R a diagonal matrix of dimension N (total number of observations), with d ij = − λ ( z ′ ij β )[ z ′ ij β + λ ( z ′ ij β )] as the diagonal elements. V = ρ ∆ V ar h λ ( z ′ ˆ β ) − λ ( z ′ β ) i ∆ ′ = ρ ∆ DzV β z ′ D ∆ ′ where D is the square, diagonal matrix of dimension N with 1 − d ij as the diagonalelements; z is the data matrix of selection equation; and V p is the variance-covarianceestimate from the probit estimation of the selection equation. Theorem 3.

We consider the sample selection model presented in Equation 2. Underassumptions I1 to I3, E1 and, E2. The variance-covariance estimator of the ˆ θ is givenby V twostep = B (∆ W ) ′ [ ˆ V + ˆ V ](∆ W ) B ′ (13) where ˆ V = ˆ ρ ∆ ˆ R ∆ ′ and ˆ V = ˆ ρ ∆ ˆ Dz ˆ V β z ′ ˆ D ∆ ′ with all unknown parameters re-placed by their estimates. Moreover, this is a consistent estimator V ar (ˆ θ ) . Proof of Theorem 3:

The result holds by construction.

In this section we present the results of Monte Carlo simulations to (i) to describe thebehavior of the estimator proposed in this paper and (ii) oﬀer empirical guidance forapplied research. Regarding the latter, we will pay a close attention to the implication of ssumption E1 ( ii ) according to which it is important to have a large number of locationsrelative to the number of individuals in the sub-locations. Monte Carlo experimentswill oﬀer empirical guidance as to when the number of locations is large enough.The estimator developed in this paper is referred to as the “Sub-location Diﬀer-encing” and it accounts for sub-location speciﬁc eﬀect γ jα . To highlight its features,we compare it to two other estimators. One ignores the presence of both γ j and γ jα and applies a simple two step estimator with no spatial diﬀerencing - we call it “No-Diﬀerencing” estimator. The other accounts only for the location ﬁxed eﬀects γ j andwe call it “Location Diﬀerencing” estimator. For each estimator, the mean bias and thecoverage rate for the 95% conﬁdence level test are reported in tables I to III.The data is obtained using the following data generating process. We assume thatthere are J = 20 , ,

100 non-overlapping locations, each location is divided into s =2 , , n j = 3 , , ,

10 individuals sharing the same sub-location.The latent variables are y ∗ ij = z ij β + θ ijs + θ j + ε ij and y ∗ ij = x ij δ + γ ijs + γ j + ε ij , where θ ija = 10 − j × s and γ ijs = 5 j × s is the sub-location speciﬁc eﬀect, while θ j = 10 − j and γ j = 10 j are the location eﬀects; for all i and j , x ij ∼ N (0 ,

1) , z ij ∼ U (0 ,

1) eachdrawn independently; δ = 1, β = 0 .

2. The error terms in both equations for all i and j are generated as follows: ε ij ∼ N (0 , ε ij = ρε ij + v ij where v ij ∼ N (0 ,

1) isindependent of ε ij and ρ = 0 . There is room for improvement concerning our inference strategy. Cluster robust inference is part ofa large and growing literature and our work gives some insight as to how diﬀrencing can be used in cross-sectional data. Future work will investigate the importance of heteroscedasticity, and small sample proceduressuch as bootstrap will be used to improve inference.

Numb. Of sub-location sub-location-size Estimators Mean bias Coverage rate2 3 No-diﬀerencing -0.079 95.8Location Diﬀerecing -0.393 74.2Sub-location Diﬀerencing -0.039 82.25 No-diﬀerencing -0.344 94.5Location Diﬀerecing -0.953 82.1Sub-location Diﬀerencing 0.011 87.08 No-diﬀerencing -0.173 94.4Location Diﬀerecing -1.307 90.7Sub-location Diﬀerencing 0.016 79.510 No-diﬀerencing 0.233 95.8Location Diﬀerecing -2.264 93.6Sub-location Diﬀerencing 0.067 93.64 3 No-diﬀerencing 0.369 95.3Location Diﬀerecing 0.117 88.1Sub-location Diﬀerencing 0.001 75.55 No-diﬀerencing 0.996 96.0Location Diﬀerecing 2.798 92.7Sub-location Diﬀerencing -0.019 82.58 No-diﬀerencing 0.085 94.5Location Diﬀerecing 5.824 95.2Sub-location Diﬀerencing -0.035 80.510 No-diﬀerencing -0.241 95.6Location Diﬀerecing -2.494 96.1Sub-location Diﬀerencing -0.066 88.28 3 No-diﬀerencing 0.833 94.7Location Diﬀerecing -0.310 94.9Sub-location Diﬀerencing -0.005 66.35 No-diﬀerencing -0.176 93.6Location Diﬀerecing 1.678 96.8Sub-location Diﬀerencing 0.011 76.88 No-diﬀerencing 0.013 95.4Location Diﬀerecing -0.269 99.0Sub-location Diﬀerencing 0.006 83.310 No-diﬀerencing 0.271 95.2Location Diﬀerecing -2.573 99.3Sub-location Diﬀerencing -0.016 86.1

Numb. Of sub-location sub-location-size Estimators Mean bias Coverage rate2 3 No-diﬀerencing -0.392 95.3Location Diﬀerecing 0.497 75.1Sub-location Diﬀerencing -0.006 78.45 No-diﬀerencing 0.095 95.2Location Diﬀerecing -0.965 84.5Sub-location Diﬀerencing 0.007 84.88 No-diﬀerencing -0.108 94.8Location Diﬀerecing 1.330 92.3Sub-location Diﬀerencing -0.053 85.810 No-diﬀerencing 0.043 94.5Location Diﬀerecing -2.052 95.8Sub-location Diﬀerencing -0.028 79.24 3 No-diﬀerencing -0.177 95.2Location Diﬀerecing -1.027 89.5Sub-location Diﬀerencing 0.004 71.35 No-diﬀerencing -0.227 95.3Location Diﬀerecing -1.387 92.8Sub-location Diﬀerencing -0.017 78.78 No-diﬀerencing 0.424 93.6Location Diﬀerecing -1.437 95.0Sub-location Diﬀerencing 0.012 84.210 No-diﬀerencing -0.156 95.1Location Diﬀerecing 0.348 96.8Sub-location Diﬀerencing 0.025 78.98 3 No-diﬀerencing 0.031 95.0Location Diﬀerecing -0.200 95.0Sub-location Diﬀerencing 0.010 67.45 No-diﬀerencing 0.279 94.4Location Diﬀerecing 2.118 97.0Sub-location Diﬀerencing -0.008 73.68 No-diﬀerencing -0.101 93.2Location Diﬀerecing 4.108 98.0Sub-location Diﬀerencing 0.005 80.810 No-diﬀerencing -1.369 95.6Location Diﬀerecing -1.866 99.5Sub-location Diﬀerencing -0.041 83.6

Numb. Of sub-location sub-location-size Estimators Mean bias Coverage rate2 3 No-diﬀerencing -0.702 94.9Location Diﬀerecing 1.017 74.0Sub-location Diﬀerencing 0.007 63.65 No-diﬀerencing -0.716 95.3Location Diﬀerecing -2.314 82.0Sub-location Diﬀerencing -0.001 72.28 No-diﬀerencing -0.089 94.6Location Diﬀerecing -5.000 90.5Sub-location Diﬀerencing -0.005 84.210 No-diﬀerencing 0.198 94.9Location Diﬀerecing -0.197 96.2Sub-location Diﬀerencing -0.050 85.34 3 No-diﬀerencing -0.096 94.5Location Diﬀerecing 0.858 89.1Sub-location Diﬀerencing -0.001 59.45 No-diﬀerencing -0.691 94.7Location Diﬀerecing 0.712 89.6Sub-location Diﬀerencing 0.001 64.08 No-diﬀerencing -0.177 96.7Location Diﬀerecing 1.216 96.9Sub-location Diﬀerencing 0.015 79.210 No-diﬀerencing -0.384 94.0Location Diﬀerecing -10.139 97.9Sub-location Diﬀerencing -0.075 79.68 3 No-diﬀerencing -1.018 95.2Location Diﬀerecing 0.598 94.1Sub-location Diﬀerencing 0.002 53.75 No-diﬀerencing 0.050 95.1Location Diﬀerecing 1.400 95.7Sub-location Diﬀerencing -0.001 61.08 No-diﬀerencing -1.093 95.7Location Diﬀerecing -10.195 98.1Sub-location Diﬀerencing 0.010 75.710 No-diﬀerencing -0.642 94.7Location Diﬀerecing 6.82 98.9Sub-location Diﬀerencing 0.009 79.1 . As expected, the “No-Diﬀerencing” estimator has a larger mean bias in the pres-ence of spatial heterogeneity. This result holds for both small and large numbersof locations as well as for few or many individuals having the same sub-locationspeciﬁc unobserved heterogeneity N id .2. The mean bias of the “Sub-location Diﬀerencing” estimator is smaller than otherestimators. It increases with the number of individuals in the sub-locations, anddecreases with the number of locations. For example, in a sample of 600 individ-uals which are spread across 100 locations with 2 sub-locations and 3 individualsin each sub-location, the mean bias is of 0.007. However, for the same samplesize but spread across 30 locations with 2 sub-locations and 10 individuals in eachsub-locations, the mean bias is − . This section shows the empirical importance of spatial diﬀerencing methodology pro-posed in the previous sections. To illustrate the importance of our estimator, we ask hat determines tax rates set by regional governing bodies. This question opens animportant issue of identiﬁcation since circular causation or omitted variable bias leadsto biased and inconsistent estimators. We will use our spatial diﬀerencing method toexamine the case of changes in the Finish local property tax rate at the turn of themillennium.Finland consists of 411 municipalities (in 1999) spread across 19 regions which chooseproperty tax rate within the limits set by the central government. In 1999, the centralgovernment decided to raise the lower limit for the year 2000 from 0.2% to 0.5%. Thischange created a probability mass of municipalities at the lower bound: more than halfof municipalities have a taxation rate of 0.5%, making the data sample censored. Weinvestigate what aﬀected municipalities tax rate in the year 2000.We estimate the parameter of the outcome equation in the model represented as inEquation (2). Speciﬁcally, the outcome variable is the level of general property tax in amunicipality i in a region j , and explanatory variables include municipalityˆa ˘A´Zs i agestructure of the population, level of municipalityˆa ˘A´Zs income, received subsidies, localincome tax rate and a dummy for region j in which the municipality i is located. Theselection equation determines whether the municipality sets its general tax rate at themandatory minimum of 0.5% or above and contains all the variables which are in theoutcome equation except for local income tax rate.As illustrated in Equation (1), there can be an unobserved sub-location speciﬁc eﬀectoperating at a ﬁner spatial scale than region j , in our case at the level of municipalitieswhich region j consists of. Indeed, municipalitiesˆa ˘A´Z tax level can depend not only There is a large literature which examines a range of factors inﬂuencing local tax rates e.g.Charney (1983), Ashworth and Heyndels (1997), Ross and Yinger (1999), Charlot and Paty (2007);Charlot, Paty, and Piguet (2015), Crowley and Sobel (2011), Baskaran (2014), Buettner and von Schwerin(2016). n its population size, income and subsidies received from the central government, butit can also depend on the level of amenities in the municipality. It is usually diﬃcultto measure them. More importantly, even if we have a few measures of amenities ortheir proxies, they might not be able to capture all of them, leaving some amenitiesunobserved. In our case, unobserved amenities can be correlated with municipalitiesˆa ˘A´Zpopulation, income level, or the level of subsidies which implies that not controlling forthem will render the estimates biased and inconsistent. Therefore, using the fact thattwo municipalities from the same region sharing borders are neighbor, we use our spatialdiﬀerencing method to tackle this problem.We estimate equation (2) with spatial diﬀerencing conducted as the diﬀerence be-tween municipality i and the average of its neighbours. Columns 1 and 2 present theresults without spatial diﬀerencing, columns 3-6 with spatial diﬀerencing. Estimationis conducted with and without regional dummies respectively, and with two diﬀerentestimators of standard errors: wild cluster bootstrap, and spatially-adjusted standarderrors derived in Section 2.3. Clustering of the standard errors is done at the level ofregion j . Since there are only 19 regions, we use wild clustered bootstrap proceduredeveloped by Cameron, Gelbach, and Miller (2008), which properties were studied bye.g. Davidson and MacKinnon (2010), MacKinnon (2013), and MacKinnon and Webb(2017). Speciﬁcally, we use a recently developed wild bootstrap package boottest byRoodman, Nielsen, MacKinnon, and Webb (2019) implemented in Stata.We begin with discussing the results with spatial diﬀerencing: columns 3-6, Columns3 and 4 present the results when we use spatial diﬀerencing with the standard errorscalculated using the formula derived in Section 2.3. There is only one signiﬁcant vari-able: the share of population older than 75 in column 4. This is not surprising, since ur estimator of variance-covariance matrix is an asymptotic estimator, while the esti-mation is done on a sample with small number of clusters (nineteen regions). Thereforewe use wild-bootstrap procedure which is known to be suitable for a small number ofclusters. The results using wild-bootstrapping are shown in columns 5 and 6 and wesee a considerable increase in the number of statistically signiﬁcant results.The comparison of columns 5 and 6 with columns 1 and 2 reveals the importance ofspatial diﬀerencing. Controlling for the sub-location speciﬁc unobserved eﬀect γ jα byspatial diﬀerencing renders four variables statistically signiﬁcant: share of populationyounger than 15, share of population older than 75, government grants, and incometax rate. This is in contrast to columns 1 and 2 in which income tax rate is theonly signiﬁcant variable. Not controlling for γ jα leads to the omitted variable biaswhich, apart from rendering the estimates inconsistent, inﬂates the standard errors andmakes the estimates mostly insigniﬁcant. Spatial diﬀerencing controls for this omittedvariable bias, which means that they do not ’end up’ in the error term and do notinﬂate the standard errors. In addition to comparing estimates with and without spatialdiﬀerencing, it is also instructive to compare columns 5 and 6: spatial diﬀerencing withand without regional dummies. We see that controlling for the regional unobservedeﬀect γ j does not help to control for sub-location speciﬁc eﬀects γ jα . Indeed, themagnitude and the statistical signiﬁcance of the estimates change very little, and eventhe one variable that looses its statistical signiﬁcance after including regional dummies(municipality’s income) is only marginally signiﬁcant without these dummies.Overall, our empirical analysis shows that controlling for the unobserved munici-pality eﬀects matters. Estimations which control for spatial unobserved eﬀects only atthe regional suggest that the income tax rate is the only determinant of general tax ate set by the municipalities. However, after controlling for the sub-location speciﬁcunobserved eﬀects, the tax rate depends not only on the income tax rate but also onthe age composition of their population - the share of young as well as the share ofelderly population. These results thus indicate that spatial diﬀerencing is an importanttool to deal with omitted variable bias which often plagues empirical studies on localtaxation. Table IV: Determinants of Municipality Taxation Rate

No Spatial Diﬀerencing Spatial DiﬀerencingWild bootstrap Spatially adj se Wild bootstrapNo Reg. Dummies Regional Dummies No Reg. Dummies Regional Dummies No Reg. Dummies Regional Dummies(1) (2) (3) (4) (5) (6)Population -1.897 -0.7047 0.1346 0.2448 0.1346 0.2448[-1.420] [-1.0432] [0.0489] [0.2656] [0.507] [0.8793]Share pop. <

15 0.009 -0.0015 -0.0116 -0.0113 -0.0116** -0.0113**[0.492] [-0.0996] [-0.0933] [-0.1373] [-2.536] [-2.3921]Share 61 < pop. <

74 -0.0055 -0.0054 -0.0089 -0.0092 -0.0089 -0.0092[-0.923] [-0.343] [-0.1106] [-0.0721] [-1.495] [-1.4902]Share pop. >

75 0.0211 0.0049 -0.0145 -0.0140** -0.0145* -0.014*[1.027] [0.4688] [-0.1575] [-1.8771] [-1.794] [-1.6746]Income 2.40E-07 6.48E-06 1.23E-05 0.00001 1.23E-05* 1.1E-05[0.047] [0.8092] [0.0234] [0.0213] [1.924] [1.6688]Gov. grant -1.7E-05 1.8E-05 -1.78E-05 -2.16E-05 -1.78E-05 -2.16E-05[-0.868] [0.2555] [-0.0458] [-0.0925] [-0.483] [-0.5666]Income tax rate 0.0350*** 0.0396** 0.0482 0.0453 0.0482*** 0.0453***[3.383] [3.547] [0.2775] [0.2965] [3.362] [3.088]Inverse Mill’s ratio 0.3028 0.1731 0.0159 0.0034 0.0159 0.0034[1.505] [1.1468] [0.0346] [0.0298] [0.854] [0.1525]Constant -0.5327 -0.3601 0.0111 -0.006 0.0111 -0.006[-0.713] [-0.598] [1.540] [-0.230] [1.540] [-0.199]Observations 403 403 273 273 273 273Regional Dummies NO YES NO YES NO YESNumber of Dummies 19 19 19 19 19 19R-squared 0.197 0.271 0.248 0.279 0.248 0.279

Source : see text; Note: t-statistics in brackets, *** p < < < This paper has investigated a sample selection model with unobserved heterogeneity ata very ﬁne location level. It proposes spatial diﬀerencing as an alternative identiﬁcationstrategy when instrumental variable and/or a panel data are not available. We discussthe assumptions under which the parameters of the model are identiﬁed. The estimation f the parameters is done using the classic Heckman’s two-step estimation procedure.The diﬀerecing and the two-step procedure lead to a novel estimator with propertiesthat are also relevant for spatial diﬀerencing in linear models. To understand thebehavior of the new estimator, we derive a cluster asymptotic of the estimator. Thederivation reveals two important implications for its empirical implementation: ( i ) thenumber of clusters needs to be large for inference to be based on normal distribution.( ii ) each cluster should have a bounded number of individuals.Monte Carlo experiments show that accounting for sub-location speciﬁc heterogene-ity is crucial for identiﬁcation. It also conﬁrms the estimator’s properties derived in ourasymptotic. In particular, the estimator performs better with the increasing numberof locations, and fewer individuals in sub-locations. In addition, ignoring sub-locationsand applying spatial diﬀerencing only to more aggregate geographical units subsum-ing sub-locations, the mean bias is larger. The coverage rate of the test based on thecorrected standard error has an empirical coverage lower that the theoretical one.In the empirical application which looked at the determinants of municipal tax rate,we show that using spatial diﬀerencing in combination with cluster wild-bootstrap in-ference tools can be extremely useful. Indeed, the new estimator reveals several de-terminants of the municipal tax rate that would have been missed otherwise. Thedevelopment of a bootstrap appropriate sample selection models is left for future re-search. ppendix Proof of Theorem 2:

The proof is written conditional on the set of number of individuals in the locations.Thus, when E ( N j ) is used, it can be considered as a constant.The substitution of the true value of ∆ y in equation (12) yields the followingequality ˆ θ = θ + [(∆ W ) ′ ∆ W ] − [(∆ W ) ′ ∆ η ]Let us assume that y ij = x ′ ij δ + γ jα + γ j + ρλ ( z ′ ij β + θ jα k + θ j )+ e ij with E ( e ij | ξ ij ) = 0.Thus, ∆ y ij = ∆ x ′ ij δ + ρ ∆ λ ( z ′ ij β + θ jα + θ j )+∆ e ij . Under the identiﬁcation assumptionsI1 to I3 have ∆ y ij = ∆ x ′ ij δ + ρ ∆ λ ( z ′ ij β ) + ∆ e ij . The second step regression equation is equivalent to∆ y ij = ∆ x ′ ij δ + ρ ˜∆ λ ( z ′ ij β ) + ∆[ ρ ( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β )) + e ij ]ˆ β is estimated by maximum likelihood probit in the ﬁrst step with variance-covariancematrix V β . Given that λ ( . ) is twice diﬀerentiable, the continuous mapping theorem im-plies that λ ( z ′ ij β ) − λ ( z ′ ij ˆ β ) goes to zero in probability and is asymptotically normal. Ifwe assume that N is the full sample while N is the selected sample. We, therefore,have p N ( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β )) → d N (0 , Ω ij ) (14) We assume that

N/N → here Ω ij = [ λ ′ ( z ′ ij β )] z ′ ij V β z ij .We are interested in the limiting distribution of √ N (ˆ θ − θ ). √ N (ˆ θ − θ ) = N [(∆ W ) ′ ∆ W ] − √ N [(∆ W ) ′ ∆ η ]= " P Ni =1 (∆ W i ) ′ ∆ W i N − √ N N X i =1 (∆ W i ) ′ ∆ η i While W i are iid , ∆ W i are not independent because an individual is allowed toappear in many pairs. We will therefore have to use LLN and CLT for non-independentrandom variables. The dependence structure is driven by the operator ∆. If ∆ is suchthat each individual appear only in one pair then the classical CLT and LLN could beapplied. However, if individuals are allowed to appear in several pairs, then we need toapply CLT and LLN accounting for correlation. P Ni =1 (∆ W i ) ′ ∆ W i N = P Jj =1 P N j k =1 (∆ W kj ) ′ ∆ W kj N = 1 J J X j =1 E ( N j ) N j X k =1 (∆ W kj ) ′ ∆ W kj Let us consider Y j = 1 E ( N j ) N j X k =1 (∆ W kj ) ′ ∆ W kj , these variables are iid , moreover,note that N = N + N + .... + N J = J E ( N j ). Under assumption E1, all locations havea bounded maximum capacity of N j < n with n a scalar. Under the assumption thatall second moments of the variables in W exist (Assumption E2). J = 1 J J X j =1 E ( N j ) N j X k =1 Y j is a matrix. Thus, the law of large number apply to it ifand only if it applies to all is elements. Let a j be a typical element of the matrix1 E ( N j ) N j X k =1 Y j . Let t and m be two variables from the set of variables forming W . Forexample, we can consider t = x the ﬁrst column of the random variable x . If t = m then, E | a j | ≤ E ( N j ) N j X k =1 E | ∆ t kj ∆ m kj | (15)= 1 E ( N j ) N j X k =1 E | ( t kj − t ij )( m kj − m ij ) | (16) ≤ E ( N j ) N j X k =1 E | t kj m kj | (17) ≤ E ( N j ) N j X k =1 q E ( | t kj | ) E ( | m kj | ) (18) ≤ M (19)with M a constant.The result is obtained by using successively the triangular inequality, the identicaldistribution of variable in W , the Cauchy-Schwars’s inequality and the existence ofmoment up-to its fourth (which means that the second moment exists). If t = m we ave, E | a j | ≤ E ( N j ) N j X k =1 E | ∆ t kj ∆ t kj | (20)= 1 E ( N j ) N j X k =1 E | ( t kj − t ij ) | (21) ≤ E ( N j ) N j X k =1 E | t kj t ij | + E ( t kj ) (22) ≤ E ( N j ) N j X k =1 ( E | t kj | ) + E ( t kj ) (23) ≤ M (24)Thus, the LLN implies that P Ni =1 (∆ W i ) ′ ∆ W i N → p E ((∆ W ij ) ′ ∆ W ij ) = C − . We can also show that1 √ N N X i =1 (∆ W i ) ′ ∆ η i = ρ √ N N X i =1 (∆ W i ) ′ ∆( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β ))+ 1 √ N N X i =1 (∆ W i ) ′ ∆ e ij . We consider Λ j = N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β )) and E j = N j X k =1 (∆ W kj ) ′ ∆ e kj .Conditional on ˆ β , Λ j are iid random variables; E j are too. We assume that the numberof individuals in a group is iid with ﬁnite mean E ( N j ). Given all locations are assumed The application of the LLN implies for consistency reason that

N/J → p E ( N j ) . Thus JE ( N j ) ≈ N. o be disjoint, 1 √ N N X i =1 (∆ W i ) ′ ∆ η i = ρ √ N J X j =1 Λ j + 1 √ N J X j =1 E j . We have E ( E j ) = 0 for each j. Moreover,

V ar ( E j ) = E [ N j X k =1 (∆ W kj ) ′ ∆ e kj ( N j X k =1 (∆ W kj ) ′ ∆ e kj ) ′ ]= E [ N j X k =1 (∆ W kj ) ′ ∆ e kj ∆ e kj (∆ W kj )]= E ( N j ) E [(∆ W kj ) ′ ∆ e kj ∆ e kj (∆ W kj )] . Under Assumption E2,

V ar ( E j ) is ﬁnite, because all variables have up-to the fourthmoments. Indeed, if we consider a typical element of E [(∆ W kj ) ′ ∆ e kj ∆ e kj (∆ W kj )],form by the variables t and m , E [( t kj − t ij )( m kj − m ij )∆ e kj (∆ W kj )] ≤ E [ t kj m kj (∆ e kj ) ] ≤ E | t kj m kj ∆ e kj |≤ q E ( | t kj | ) E [(∆ e kj ) ] E ( | m kj | ) E [(∆ e kj ) ] ≤ M It should be noted that E [(∆ e kj ) ] = 2 E ( e kj ) < ∞ . imilarly, we can show that E (Λ j ) = 0, and V ar (Λ j ) = E [( N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β )))( N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))) ′ ]= E [ N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))(∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β )) ′ ] (25)= E ( N j ) E ((∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))(∆ W kj )) (26)= E ( N j ) E [(∆ W kj ) ′ Ω kj (∆ W kj )] (27)We need to show that E [(∆ W kj ) ′ Ω kj (∆ W kj )], with Ω kj = [ λ ′ ( z ′ kj β )] z ′ kj V β z kj is ﬁ-nite. A typical element of this matrix is given by, E [( t kj − t ij )Ω kj ( m kj − m ij )]. We canshow the following using a Cauchy-Schwarz’s inequality. E [( t kj − t ij )Ω kj ( m kj − m ij )] ≤ E [ t kj m kj Ω kj ] (28) ≤ E [ | t kj m kj Ω kj | ] (29) ≤ q E ( | t kj | )( E ([ λ ′ ( z ′ kj β )] z ′ kj V β z kj )) E ( | m kj | )It remains to be proofed that E ([ λ ′ ( z ′ kj β )] z ′ kj V β z kj ) < ∞ . The application of theCauchy-Schwarz’s inequality implies, E ([ λ ′ ( z ′ kj β )] z ′ kj V β z kj ) ≤ q E ([ λ ′ ( z ′ kj β )] E [( z ′ kj V β z kj ) ] (30) ≤ q E [( z ′ kj V β z kj ) ] < ∞ (31)This follows from noting that | λ ′ ( . ) | ≤ z have up-to their fourth oments.The moment of a typical element E [( t kj − t ij )Ω kj ( m kj − m ij )] < ∞ . This proof that the variance is ﬁnite.It is important to notice that conditional W , N X i =1 (∆ W i ) ′ ∆( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β )) and N X i =1 (∆ W i ) ′ ∆ e ij are independent random variables. Therefore,1 √ N N X i =1 (∆ W i ) ′ ∆ η i → d N (0 , Γ) , (32)where Γ = ρ E [(∆ W ij ) ′ Ω ij ∆ W ij ] + E [(∆ W ij ) ′ ∆ e ij ∆ e ij (∆ W ij )] . √ N (ˆ θ − θ ) → d N (0 , Θ) (33)with Θ = C Γ C ′ . This proves the asymptotic normality of our two step estimator.We have proven that under assumptions I1, I2, I3, E1 and E2, P Ni =1 (∆ W i ) ′ ∆ W i N → p E ((∆ W ) ′ ∆ W ) = C − . Using similar arguments we can show that P Ni =1 (∆ W i ) ′ ∆ η i N → p E ((∆ W ) ′ ∆ η ) = 0 . Which means that ˆ θ is a consistent estimator of θ. We have proven the estimator is oth consistent and asymptotically normal. This ends the proof of Theorem 2. eferences Ashworth, J., and B. Heyndels (1997): “Politicians’ preferences on local tax rates:an empirical analysis,”

European Journal of Political Economy , 13(3), 479–502.

Baskaran, T. (2014): “Identifying local tax mimicking with administrative bordersand a policy reform,”

Journal of Public Economics , 118, 41–51.

Black, S. E. (1999): “Do Better Schools Matter? Parental Valuation of ElementaryEducation,”

The Quarterly Journal of Economics , 114(2), 577–599.

Buettner, T., and A. von Schwerin (2016): “Yardstick competition and partial co-ordination: Exploring the empirical distribution of local business tax rates,”

Journalof Economic Behavior & Organization , 124, 178–201.

Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008): “Bootstrap-basedimprovements for inference with clustered errors,”

The Review of Economics andStatistics , 90(3), 414–427.

Charlot, S., and S. Paty (2007): “Market access eﬀect and local tax setting: evi-dence from French panel data,”

Journal of Economic Geography , 7(3), 247–263.

Charlot, S., S. Paty, and V. Piguet (2015): “Does ﬁscal coop´eration increaselocal tax rates in urban areas?,”

Regional Studies , 49(10), 1706–1721.

Charney, A. H. (1983): “Intraurban manufacturing location decisions and local taxdiﬀerentials,”

Journal of Urban Economics , 14(2), 184–205.

Collins, H., A. Alva, R. Boydston, R. Cochran, P. Hamm, A. McGuire, andE. Riga (2006): “Soil microbial, fungal, and nematode responses to soil fumigation nd cover crops under potato production,” Biology and Fertility of Soils , 42(3), 247–257.

Crowley, G. R., and R. S. Sobel (2011): “Does ﬁscal decentralization constrainLeviathan? New evidence from local property tax competition,”

Public Choice , 149(1-2), 5.

Davidson, R., and J. G. MacKinnon (2010): “Wild bootstrap tests for IV regres-sion,”

Journal of Business & Economic Statistics , 28(1), 128–144.

Djogbenou, A. A., J. G. MacKinnon, and M. Ø. Nielsen (2019): “Asymptotictheory and wild bootstrap inference with clustered errors,”

Journal of Econometrics ,212(2), 393–412.

Duranton, G., L. Gobillon, and H. G. Overman (2011): “Assessing the Eﬀectsof Local Taxation using Microgeographic Data,”

The Economic Journal , 121(555),1017–1046.

Dustmann, C., and M. E. Rochina-Barrachina (2007): “Selection correction inpanel data models: An application to the estimation of females’ wage equations,”

The Econometrics Journal , 10(2), 263–293.

Gibbons, S., and S. Machin (2003): “Valuing English primary schools,”

Journal ofUrban Economics , 53(2), 197–219.

Hansen, B. E., and S. Lee (2019): “Asymptotic theory for clustered samples,”

Jour-nal of econometrics , 210(2), 268–290.

Heckman, J. (1974): “Shadow Prices, Market Wages, and Labor Supply,”

Economet-rica , 42(4), 679–694. eckman, J. J. (1979): “Sample Selection Bias as a Speciﬁcation Error,” Economet-rica , 47(1), 153–161.

Holmes, T. J. (1998): “The Eﬀect of State Policies on the Location of Manufacturing:Evidence from State Borders,”

Journal of Political Economy , 106(4), 667–705.

Kyriazidou, E. (1997): “Estimation of a Panel Data Sample Selection Model,”

Econo-metrica , 65(6), 1335–1364.

Lee, M.-j. (2001): “First-diﬀerence estimator for panel censored-selection models,”

Economics Letters , 70(1), 43–49.

MacKinnon, J. G. (2013): “Thirty years of heteroskedasticity-robust inference,” in

Recent advances and future directions in causality, prediction, and speciﬁcation anal-ysis , pp. 437–461. Springer.

MacKinnon, J. G., and M. D. Webb (2017): “Wild bootstrap inference for wildlydiﬀerent cluster sizes,”

Journal of Applied Econometrics , 32(2), 233–254.

Oster, E. (2019): “Unobservable selection and coeﬃcient stability: Theory and evi-dence,”

Journal of Business & Economic Statistics , 37(2), 187–204.

Rochina-Barrachina, M. E. (1999): “A new estimator for panel data sample selec-tion models,”

Annales d’Economie et de Statistique , pp. 153–181.

Roodman, D., M. Ø. Nielsen, J. G. MacKinnon, and M. D. Webb (2019):“Fast and wild: Bootstrap inference in Stata using boottest,”

The Stata Journal ,19(1), 4–60. oss, S., and J. Yinger (1999): “Sorting and voting: A review of the literature onurban public ﬁnance,” Handbook of regional and urban economics , 3, 2001–2060.

Todd, P. E., and K. I. Wolpin (2003): “On the speciﬁcation and estimation of theproduction function for cognitive achievement,”

The Economic Journal , 113(485),F3–F33.

White, H. (1984):

Asymptotic theory for econometricians . Academic press.

Wooldridge, J. M. (1995): “Selection corrections for panel data models under con-ditional mean independence assumptions,”

Journal of econometrics , 68(1), 115–132.(2010):

Econometric analysis of cross section and panel data . MIT press.. MIT press.