Spatial Differencing for Sample Selection Models with Unobserved Heterogeneity
aa r X i v : . [ ec on . E M ] S e p Spatial Differencing for Sample Selection Models withUnobserved Heterogeneity
Alexander Klein ∗ Guy Tchuente † University of KentJune 2020
Abstract
This paper derives identification, estimation and inference results using spatial dif-ferencing in sample selection models with unobserved heterogeneity. We show thatunder the assumption of smooth changes across space of the unobserved sub-locationspecific heterogeneities and inverse Mills ratio, key parameters of a sample selectionmodel are identified. The smoothness of the sub-location specific heterogeneities im-plies a correlation in the outcomes. We assume that the correlation is restricted withina location or cluster and derive asymptotic results showing that as the number of in-dependent clusters increases, the estimators are consistent and asymptotically normal.We also propose a formula for standard error estimation. A Monte-Carlo experiment ∗ School of Economics, e-mail: [email protected] † Corresponding author: School of Economics, University of Kent, e-mail: [email protected], Ad-dress: Kennedy Building, Park Wood Road, Canterbury, Kent, CT2 7FS, Tel:+441227827249. llustrates the small sample properties of our estimator. The application of our proce-dure to estimate the determinants of the municipality tax rate in Finland shows theimportance of accounting for unobserved heterogeneity. Keywords:
Sample selection, Spatial difference, Unobserved heterogeneity.
In linear models, spatial differencing has been used to deal with unobserved omittedvariables. The availability of geographical locations allowed empirical papers to takeadvantage of the spatial dimension of the data and control for various unobservedheterogeneity (e.g. Duranton, Gobillon, and Overman (2011), Black (1999) or Holmes(1998)). In general, spatial differencing offers an identification strategy in the situationswhen researchers face cross-sectional data with unobserved heterogeneity and lack suit-able instrumental variables. This paper extends spatial differencing to a model withsample selection.For economists, the question of omitted variables is a serious concern in the contextof nonexperimental data. The solution is straightforward when omitted variables aresimply the result of not including all relevant variables for which data exist - we addsuch variables to the model to avoid the bias induced by their omission. When omittedvariables are unobserved, researchers have basically three options: they can use (i)proxies, (ii) instrumental variables, or (iii) differencing the data across time or space.The proxies reduce the bias if they manage to capture the effect of the omittedvariables such that what remains is uncorrelated with the error term. However, it is ften the case that the proxies are imperfect, hence they may still be related to theunobserved heterogeneity, or the error term if it turns out to be endogenous, or it couldalso be irrelevant after controlling for observed covariates. In such cases, the inclusionof the proxy will not solve the bias problem, it may even exacerbate it. The secondsolution - using a valid set of the instruments - may help alleviate the bias. However, asdiscussed in Todd and Wolpin (2003), the “quasi-experimental” local average treatmenteffect (LATE) obtained in the instrumental variable model may not correspond tothe ceteris paribus and thus, may not correspond to the deep structural parameterof interest. Lastly, panel data sets allow researchers to control for the unobservedheterogeneity. They indeed help them to identify the causal effect when, for example,time constant unobserved heterogeneity might cause endogeneity problem and stronginstruments satisfying exclusion restrictions cannot be found. However, there might besituations when such data sets are not available.Our paper is a contribution to the literature identifying and estimating model pa-rameters in the presence of unobserved omitted variables. We propose an identificationstrategy based on spatial differencing. As was discussed above, this approach has beenused in the context of linear regressions. However, little is known about its performancein non-linear models. We extend spatial differencing into this direction, specifically tothe case of cross-section data with sample-selection. We show that under justifiableassumptions on the smoothness of the unobserved heterogeneity (i.e. spatially closeindividuals have similar unobserved heterogeneity and the derivative of their InverseMill’s ratio are similar), spatial differencing eliminates the unobserved effects even inthe presence of a nonlinear element - in our case Mill’s ratio. The parameters of in- See Todd and Wolpin (2003) for details on the use of proxy and Oster (2019) for a rigorous treatment ofthe evaluation of the robustness to omitted variables. erest of our sample selection model are estimated using a standard two-step approachof Heckman (1974) Heckman (1979). We derive asymptotic properties and propose acorrection of standard errors accounting for the two-step nature of our estimation andspatial differencing. The asymptotic behavior of the estimator reveals important prop-erties of spatial differencing that researchers would need to be cautious about. The newestimator and the standard errors correction are easy to implement.The intuitions for the model of sample selection with unobserved spatial hetero-geneity that we consider in this paper can be described as follows. Suppose we havea cross-sectional data on municipalities which are organized into larger geographicalunits called regions, and which have the authority to set the levels of local taxation.Municipality tax rates must be at least as high as the threshold set by a central gov-ernment. As a result, municipalities self-select into those with the tax rates at thethreshold and those above it. We are interested in what determines the municipalities’tax rates. The tax rate will depend on various socio-economic characteristics, such asage composition of population and income, but also on amenities. These can dependon the region where the municipalities are located: for example, regions with naturallandscapes might have different level and composition of amenities than regions withoutthem. We can control for them with region-specific dummies. However there can be aconsiderable unobserved heterogeneity at municipality level. Controlling for that withmunicipality-specific dummies might not be an option, since we may quickly run outof degrees of freedom. Therefore, we face a problem of a self-selected cross-sectionalsample with unobserved heterogeneity which we cannot fully control with dummies, andwhich has two spatial dimensions: high-level which we call locations (in our exampleregions), and low-level which we call sub-locations (in our example municipalities). patial differencing will eliminate the sub-location specific unobserved heterogeneity.It will, at the same time, also induce a correlation in the error terms. We take thatcorrelation into account, and derive the asymptotic properties of our estimator usingsimilar arguments to those present in the derivation of the asymptotic behavior ofthe clustered standard errors: the number of locations goes to infinity and the size oflocation is assumed random and bounded almost surely. We find that this result alsoextends to a linear model without sample selection. This has important implicationsthat researchers need to be cautious about. Indeed, the consistency of estimator appliedto the spatially differenced data requires (i) a large number of locations, (ii) a limitednumber of individuals in each location. Monte Carlo simulations also suggest that itwould be better if the number of individuals in sub-locations were small as well. Beforewe continue, let’s notice that locations in our model are equivalent to clusters and weuse ’location’ and ’cluster’ interchangeably.Since our estimator is derived for a clustered sample with unobserved heterogeneity,this paper contributes to the literature on the selection correction in panel data. Inthis literature, the main challenge is the presence of individual-specific unobserved het-erogeneity in both the outcome and the selection equations. The existing solutions arebased on either a full model specification or on a differencing procedure. Wooldridge(1995) uses a Mundlak approach to specify the individual-specific unobserved hetero-geneity in both equations. He also imposes a special functional form to the selectionmechanism. Kyriazidou (1997), on the other hand, does not impose strong restrictionof the selection equation functional form and uses a nonparametric approach to dif-ference out the unobserved fixed-effect. Rochina-Barrachina (1999) similarly relies ondifferencing to identify the parameters of the model, but she also imposes additional istributional assumptions to the selection equation. Even if our problem has similari-ties with the selection correction in panel data model literature, the main difference isthat we are observing a clustered cross-section. In each cluster, there is a finer commonsub-location specific unobserved heterogeneity shared by some individuals in that clus-ter. This heterogeneity, however, is different from the cluster and individual-specificones studied in panel data models and implies a different cluster asymptotic. Sincethe outcome of the individuals are not independent in our model, while it is in thepanel data case, our asymptotic results are thus derived using a large number of clusterasymptotic with heterogenous, random and bounded cluster size.The clustered dependence created by the finer sub-location specific unobserved het-erogeneity relates our asymptotic discussion to the papers dealing with clustering atvariance level (see Wooldridge (2010) for a textbook treatment). The asymptotic in thatliterature is derived using either a large or a fixed number of clusters. The fixed numberof cluster leads to non-normal asymptotic and discussion about recent contributions canbe found in Hansen and Lee (2019). A large number of cluster asymptotic was first de-rived by White (1984) and has been investigated by several authors allowing either fixedcluster size or heterogeneous cluster. Recent developments include Hansen and Lee(2019) who propose conditions on the relation between the cluster sample sizes andthe full sample in a regular asymptotic, or Djogbenou, MacKinnon, and Nielsen (2019)who derive asymptotic with varying cluster sizes and carry out a cluster wild bootstrap.Our results complement this literature by extending the cluster asymptotic to a sampleselection model.We present an empirical application of our new estimator. We examine the deter-minants of tax rates across four hundred and eleven Finish municipalities spread across ineteen Finish regions. In 1999, Finish central government has decided to raise thelower bound of the tax rate municipalities could set from 0.2% to 0.5%. This has createda sample selection mechanism which resulted with more than half of the municipalitiesopting for 0.5% while the rest charging higher tax rate. We use our spatial differencingestimator to control for unobserved municipalities effect which can be correlated withthe error term, creating thus an endogeneity problem and rendering the standard sam-ple selection estimator biased and inconsistent. Our results clearly show the presenceof unobserved heterogeneity across municipalities, limitations of using only region dum-mies (nineteen in our case) to fully control for municipalities’ unobserved heterogeneity,and the importance of spatial differencing to control for it.The structure of the paper is as follows. First, we expand the spatial differingmethod in the case of linear regression model to the case of sample selection. Thenwe discuss identification assumptions, propose an estimation procedure, and derive theestimator of the corrected standard errors. Lastly, we conduct Monte Carlo simulationsand present an empirical application of our estimator. In many economic applications, we are interested in estimating the following regressionequation: y ij = x ′ ij δ + γ j + γ jα + ε ij (1) here x ′ ij is a vector of exogenous controls variables, γ j is location fixed effect, γ jα isa sub-location specific effect for sub-location α which is at a finer spatial scale thanlocation j , and ε ij is the error term. Examples of application of this model can be found in the estimation of the fertilizereffect on wheat crops yield in farms growing multiple crops, or the effect of local taxationon the growth of firms. The crop-yields depend on the soil quality of location γ j (e.g.a village), but also on the sub-location specific soil composition (e.g. a farm in the vil-lage) Collins, Alva, Boydston, Cochran, Hamm, McGuire, and Riga (2006). Similarly,the impact of local taxation on the growth of firms may vary by county but also sub-locations such as neighborhoods as in Duranton, Gobillon, and Overman (2011). Wecan control for γ j with location dummy variables. However, they might not be enoughto capture all unobserved heterogeneity related to location j as there can be consid-erable heterogeneity at finer spatial scale of sub-locations: using the example above,the firms are located in various neighborhoods α which are sub-locations of location j . Furthermore, standard location fixed effect γ j relies upon an arbitrary specificationof the comparison neighborhood group, as pointed out by Gibbons and Machin (2003),making it an imperfect control for sub-location specific effect γ jα . If γ jα is correlatedwith x ij , OLS estimate of δ will be biased. In the absence of suitable instrumental vari-ables for x ij , the spatial differencing offers a solution by differencing out the unobservedsub-location specific effects γ jα .Duranton, Gobillon, and Overman (2011), Black (1999) or Holmes (1998) use spa-tial differencing in the case of linear models to solve endogeneity problems arising fromunobserved sub-location effect γ jα . They take advantage of the fact that for sufficiently The sub-location specific component, γ jα , is a simplification for γ jα i . We are implicitly assuming thatthe sub-location specific effects are the same for all its individuals. mall distances between sub-locations, their specific effect γ jα changes smoothly acrossspace, allowing thus to difference them out. This corresponds to the following assump-tion. Assumption I1:
The sub-location specific unobservable effect is homogenous inthe a neighborhood of the individual ie ∆ d γ jα = 0 for d small enough.In several economic models, in addition to the sub-location specific fixed effects γ jα ,the outcome of interest is not observed for the selected sub-sample. The selection canbe the result of the decision of individuals or the researcher. The presence of sampleselection introduces nonlinearity to the model (1).We specify the model with sample selection as follows. Consider two latent depen-dent variables y ∗ ij , and y ∗ ij in a cross-section which follow a regular linear model forindividual i in a location j : y ∗ ij = z ′ ij β + θ jα + θ j + ε ij - selection equation, y ∗ ij = x ′ ij δ + γ jα + γ j + ε ij - outcome equation.Individual error terms are ε ij and ε ij ; θ jα and γ jα are sub-location specific effectsfor a sub-location α in location j , affecting the selection and the outcome equationrespectively. The exogenous characteristics x ij affect the outcome. They could becorrelated with γ jα + γ j but not with ε ij and ε ij . The variables z ij are exogenousvariables determining selection, they can be a subset of x ij . However, for identificationpurposes, some elements of z ij are assumed to be absent from x ij . Assumption I2: ε ij and ε ij are independent identically distributed normal ran-dom variables for all i, j . he outcome is modelled in the form of a truncated sample selection model and isrepresented by equation (2). y ij = y ∗ ij if y ∗ ij > − if y ∗ ij ≤ Condition 1:
Cov [ z ij , θ jα + θ j + ε ij ] = 0; z ij is exogenous Condition 2:
Cov [ x ij , γ jα + γ j + ε ij ] = 0; x ij is exogenous Condition 3: and errors ( ε ij , ε ij ) satisfy ε ij = ρ × ε ij + v ij with ε ij ∼ N (0 , v ij .It is possible to consistently estimate δ by Tobit regression under these three conditions. In most applications, the Condition 1 and 2 are unlikely to hold because there is apossibility that, within a location, there could be a sub-location specific omitted variableaffecting both the outcome and some observed characteristics of interest. Thus, it ispossible that
Cov [ z ij , θ jα + θ j ] = 0 and Cov [ x ij , γ jα + γ j ] = 0. The standard way todeal with the correlation between x ij and γ jα would be to find a suitable instrumentfor the x ij and run a IV Tobit or IV two-stage Heckit.The very local nature of the sub-location specific effect means that it is not alwaysevident to find a variable correlated with x ij and uncorrelated with γ jα . The exclusionrestriction is likely to be violated and IV two-stage Heckit will yield inconsistent esti-mates for δ . Another option is to use the finer location fixed effect and estimate themodel using classic Heckman two-stage procedure, but this will in practice lead to a Identification required an exclusion restriction ie a variable that affects y ∗ ij but not y ∗ ij . Otherwise,identification relies on the nonlinearity of the inverse Mills ratio. roliferation of variable and lose of degrees of freedom. This section investigates the application of this spatial differencing technique to thecase of cross-section sample selection models. We denote ∆ d to be a spatial differenceoperator. One example is a pair-wise difference operator which takes the differencebetween each observation and another observation located at distance less than d fromthat observation. In a location j , with individual i and k who are neighbours. Thepair-wise differencing of the variable A is:∆ d A = A ij − A kj . Another example is the difference between the individual outcome and the averageoutcome of his/her neighbourhood N id . This operator is similar to the neighbourhoodfixed effect operator, the difference being that the neighbourhoods can overlap. We callthis operator the fixed-effect difference operator. Let N id = { k, in neighbourhood d } ,the sample size of N id is N d , the differencing is given by:∆ df A = A ij − N d X k ∈N id A kj . A further possibility is to use a kernel as in Kyriazidou (1997) to weight neighbour in N id according to how far they are, in term of observable characteristics. This operator s the kernel difference operator.∆ dK A = A ij − X k ∈N id ψ ( i, k ) A kj . Where ψ ( i, k ) = 1 h N d K ( z ′ ij − z ′ kj ) β + ( x ′ ij − x ′ kj ) δh N d ! , K is a kernel density functionwhile h N d , is a sequence of bandwidths. To illustrate our identification strategy and forthe asymptotic derivation, we use the pairwise spatial difference operator, while for theempirical application and for the Monte Carlo simulations, the fixed effect difference isused.For the spatial difference operator ∆ d , ∆ d y ij = y ij − y kj with k an observation inthe neighborhood d of i . Let ξ ij ≡ { x ij , z ij , y ∗ ij > , γ id , θ id } with γ id = { γ kj with k ∈N id ∪ { i }} and θ id = { θ kj with k ∈ N id ∪ { i }} .E [∆ d y ij | ξ ij , ξ kj ] = E [ y ij − y kj | ξ ij , ξ kj ] (3)= E [ y ij | ξ ij ] − E [ y kj | ξ kj ] (4)= x ′ ij δ + γ aj + γ j + ρλ ( z ′ ij β + θ ja + θ j ) (5) − [ x ′ kj δ + γ jα + γ j + ρλ ( z ′ kj β + θ ja + θ j )]= ∆ d x ′ ij δ + ∆ d γ jα + ρ ∆ d λ ( z ′ ij β + θ ja + θ j ) (6)where λ ( c ) = φ ( c ) / Φ( c ) is the inverse Mill’s ratio while φ ( c ) and Φ( c ) are respectivelythe density and distribution function of a normal random variable with mean zero andvariance 1.To go from Equation (3) to Equation (4) we use the linearity of expectation and themean independence of y ij and y ∗ kj conditional on ξ ij , as well as the mean independence f y kj and y ∗ ij conditional on ξ kj , since we have assumed in Assumption I2 that ε ij and ε ij are iid . The separation of the conditional set, ξ ij and ξ kj , is possible because we areworking with cross-sectional data. Such separation of the conditional set is not possiblefor panel data. Indeed, in the context of panel data with individual effects and sampleselection, when the differencing is used to remove the fixed-effects, the conditional setcannot be separated as we have done to move from Equation (3) to (4). For exam-ple, Kyriazidou (1997) has to impose a “conditional exchangeability” assumption thatis conditioned on the variable related to the two periods used in differencing. In caseof models with censoring, Lee (2001) discusses conditions under which first-differencecan be applied, and applies the linear implication of the ”conditional exchangeability”assumption. In a similar context using first difference, Rochina-Barrachina (1999) im-poses a joint normality between the difference in the error of the outcome equation andthe error in the selections equation in the two time periods. Estimating equation (6) presents two challenges for the identification of the pa-rameter of interest δ and the sample selection parameter ρ : the sub-location specificdifference ∆ d γ jα , and the sample selection term ρ ∆ d λ ( z ′ ij β + θ jα + θ j ). As for the sub-location specific difference ∆ d γ jα , under Assumption I1 and I2, equation (6) becomes E [∆ d y ij | ξ ij , ξ kj ] = ∆ d x ′ ij δ + ρ ∆ d λ ( z ′ ij β + θ jα + θ j ) (7)These assumptions allow us to difference-out the sub-location specific unobserved effect γ jα , a strategy that was applied by Duranton, Gobillon, and Overman (2011).As for the sample selection term ρ ∆ d λ ( z ′ ij β + θ jα + θ j ), we see that it depends on theunobservable sub-location specific and location effects θ jα + θ j . Because that sample See Dustmann and Rochina-Barrachina (2007) for a review on selection correction in panel data models. election term is a nonlinear function, a simple spatial differencing will not always workunlike the case of γ ja . Therefore, the following assumption helps us to deal with thischallenge: Assumption I3: (i) The sub-location specific unobservable selection effect is homogeneous in a neigh-borhood of the individual i.e. ∆ d θ ja = 0 for d small enough.(ii) The changes in the inverse Mill’s-Ratio in a neighborhood of the individual i.e. λ ( z ′ ij β + θ jα i + θ j ) − λ ( z ′ ij β ) θ jα i + θ j = λ ′ ( c i ) = λ ′ ( c k ) = λ ( z ′ kj β + θ jα k + θ j ) − λ ( z ′ kj β ) θ jα k + θ j (8)for i and k in a neighborhood d small enough, θ jα i + θ j and θ jα k + θ j both differentfrom 0, λ ′ ( . ) is the first derivative the inverse Mill’s ratio, c i , and c k are, respectively,in the intervals formed by [ z ′ ij β, z ′ ij β + θ jα i + θ j ] and [ z ′ kj β, z ′ kj β + θ jα k + θ j ] such thatEquation (8) holds.Assumption I3 (i) is similar to assumption I1. It seems plausible that if that as-sumption holds for the outcome equation, it will hold true for the selection equation aswell.Assumption I3 (ii) is novel and one of the contributions of this paper. It assumesthat if the exact Taylor approximation is applied on the individual inverse Mill’s ratiofor individuals i and k in the location j , the intermediate points c i and c k should besimilar. If the level of nonlinearity of λ ( . ) is low, then the assumption will also hold.In the extreme case of local linearity of the inverse Mill’s ratio, the Assumption 3 (ii)perfectly holds. he combination of assumptions I3 (i) and I3 (ii) implies that λ ( z ′ ij β + θ jα i + θ j ) − λ ( z ′ ij β ) = λ ( z ′ kj β + θ jα k + θ j ) − λ ( z ′ kj β ) . Thus, ∆ d λ ( z ′ ij β ) = ∆ d λ ( z ′ ij β + θ jα + θ j ) Theorem 1.
Let us consider the sample selection model presented in Equation 2. Underassumptions I1 to I3 the parameters δ and ρ are identified. Proof of Theorem 1
We have already shown that under the assumptions I1 and I2, we can obtain Equation(7). Applying the assumption I3, to Equation (7) leads to the following equation E [∆ d y ij | ξ ij , ξ kj ] = ∆ d x ′ ij δ + ρ ∆ d λ ( z ′ ij β ) . (9)Thus, assumptions I1 to I3 are sufficient for the identification of δ and ρ .We have derived the results using the pairwise spatial difference operator. However,the identification result holds for other spatial difference operators as well. In the caseof the average or kernel difference operator, the conditioning in equation 9 is on ξ kj with k ∈ N id for the average difference operator and k is in the full sample for the kerneloperator. Note that under assumptions I1 and I3, any difference of the weighted averagein a neighborhood of the individual will enable us to remove the sub-location specificeffect. The conditional expectation presented in Equation (9) depends on exogenousobservable variables and parameters of interest. .2 Estimation and Asymptotic Properties In this section, we present an estimation procedure and derive asymptotic properties ofthe proposed estimator. The estimation procedure involves two-steps. In the first step,probit model is estimated and the inverse Mill’s ratio predicted. In the second step,a spatial difference operator differences out both location and the sub-location specificunobserved heterogeneity. The model is then estimated using an ordinary least squareestimator. When we have a sample of N individuals, the estimation procedure is thusas follows: Step 1:
Estimate β by probit with location effect γ j ; and calculate ˆ λ i = λ ( z ′ ij ˆ β ). Step 2:
Estimate δ and ρ in the OLS regression∆ d y ij = ∆ d x ′ ij δ + ρ ∆ d λ ( z ′ ij ˆ β ) + w ikj . (10)Since we used spatial differencing and λ ( z ′ ij ˆ β ) is estimated in the first step, a par-ticular structure of the variance-covariance matrix emerges. Therefore, we also need toderive the correct estimator of standard errors w ikj which we will do in section 2.3.We will now show that the estimator obtained by the above procedure is consistentand asymptotically normal. To derive the asymptotic properties we use similar argu-ments as those used to derive the asymptotic properties of the clustered standard errors.Specifically, the population size of each location is assumed random and bounded al-most surely, and the law of large numbers is applied by letting the number of locations(clusters in case of clustered standard errors) go to infinity.We consider a generic matrix of spatial difference ∆. The matrix form notation of quation (10) can be expressed in as ∆ y = ∆ x ′ δ + ρ ∆ λ ( z ′ ˆ β ) + ∆ η (11)where η ij are the same error as in standard sample selection models. Let us denote θ = ( δ, ρ ) ′ and W = [ x ′ , λ ( z ′ ˆ β )]. The simplified estimation Equation (11) is∆ y = ∆ W θ + ∆ η and OLS estimator of θ is ˆ θ = [(∆ W ) ′ ∆ W ] − [(∆ W ) ′ ∆ y ] (12)The spatial nature of data implies that an observation k with n neighbours mayappear in several pairs. This induces correlation in the error term ∆ η for all n of thesepairs because of the spatial differencing in the second step of the estimation procedure.As a result, a particular structure of the covariance matrix emerges, and we need totake that into account when calculating the standard errors.To proceed further, we need to introduce assumptions under which the asymptoticproperties of our estimator are derived. Assumption E1:
The sample is formed if N individuals from the population.( i ) We observed { x ij , z ij } independent and identically distributed random variable with i = 1 , ..., N and j = 1 , ..., J .( ii ) The number of individuals in a location j , N j , is exogenous, random, identically The variables without subscript represent vector or matrices of all observation in the sample. We assume the notation that λ ( z ′ ˆ β ) is a vector with typical element λ ( z ′ ij ˆ β ). istributed with N j < n almost surely and E ( N j ) < ∞ ; where n is a scalar.( iii ) The outcomes and the latent variables are independent across location i.e. j = j the variables y ij ⊥ y ij and y ∗ ij ⊥ y ∗ ij .An implication of assumption E1 ( i ) in conjunction with assumption I2 is that θ j and γ j are iid . However, within a location j , there is a certain level of correlation amongindividuals which operates through θ jα i or γ jα i . This means that our assumptionsrestrict how that within-location individual correlations occurs.Assumption E1 ( ii ) restricts the location size to be bounded and implies that thenumber of locations has to grow to achieve a large sample size in our asymptotic cal-culation. This assumption is similar to those held in the literature of cluster samplesasymptotic and it leads to a “large number of cluster” asymptotic theory similar to theone discussed in Wooldridge (2010), who assumes fixed cluster size. This assumptioncorresponds to a specific case of the Assumption 1 in Hansen and Lee (2019), who al-low for different cluster size ranging from fixed to infinite. We have, however, derivedthe asymptotic of our estimator under the more restrictive condition of Assumption E1( ii ). The reason is that it can be proven that under a joint asymptotic ( N, J → ∞ ),Assumption 1 is equivalent to assuming that the size of the sample in each location isbounded. If we instead allow for a sequential asymptotic where the number of locationsis fixed and the sample size goes to infinity, then there exists at least one location withan infinite number of individuals and the inequality used in the proof of Hansen andLee (2009)’s Theorem 1 becomes invalid.To better illustrate our argument, let us consider the location sample size proposedby Hansen and Lee (2019): N j = N α with 0 ≤ α <
1; we can prove that 1 − α = ln ( J ) ln ( N ) .If we allow for a joint asymptotic, α is not define. If on the contrary we assume that he number of locations J is fixed, then, α goes to 1. In both cases, relaying on Hansenand Lee (2009)’s Assumption 1 seems not enough to warrant the desire asymptoticregularities. Assumption E2: z ′ and W are full rank column, with each element having up toits 4 th moment. Theorem 2.
We consider the sample selection model presented in Equation 2. Underassumptions I1 to I3, E1 and, E2. ( i ) ˆ θ → p θ as N → ∞ ( ii ) √ N (ˆ θ − θ ) → d N (0 , Θ) with Θ = C Γ C ′ where C − = E ((∆ W ij ) ′ ∆ W ij ) , Γ = ρ E [(∆ W ij ) ′ Ω ij ∆ W ij ]+ E [(∆ W ij ) ′ ∆ e ij ∆ e ij (∆ W ij )] , and Ω ij = [ λ ′ ( z ′ ij β )] z ′ ij V β z ij taking V β as the first step probit variance-covariance ma-trix. Proof of Theorem 2:
In appendix.It is important to notice that the same type of asymptotics should be used in alinear model. In this respect, we complement Duranton, Gobillon, and Overman (2011)who propose a correction for the standard errors, but do not discuss the asymptoticproperties of their estimators. Similarly, Black (1999) and Holmes (1998) use spatialdifferencing, but do not account for the fact that differencing will lead to a correlationbetween pairs where an individual is present. Our asymptotic derivations do accountfor the presence of correlation between pairs, and are valid not only for a model withbut also without sample selection (in our model, the absence of selection implies ρ =0). They also have important practical implications: the consistency of the estimator equires a large number of locations γ j , and a small number of individuals in eachsub-location γ jα This section derives a procedure estimate the variance-covariance of the estimator inEquation (12) which has a particular structure arising from ( i ) spatial differencing and( ii ) Heckman’s two-step estimation procedure.We consider B = (cid:2) (∆ W ) ′ ∆ W (cid:3) − and Σ = V ar [(∆ W ) ′ ∆ η ] such that the conditionalvariance-covariance matrix of ˆ θ is V ar (ˆ θ ) = B Σ B ′ Note that Σ = (∆ W ) ′ V ar (∆ η )(∆ W )This means that we need a consistent estimator of V ar (∆ η ) to compute correctstandard error for ˆ θ .Let us consider that V ar (∆ η ) = V + V with V = ∆ V ar ( e )∆ ′ = ρ ∆ R ∆ ′ where R a diagonal matrix of dimension N (total number of observations), with d ij = − λ ( z ′ ij β )[ z ′ ij β + λ ( z ′ ij β )] as the diagonal elements. V = ρ ∆ V ar h λ ( z ′ ˆ β ) − λ ( z ′ β ) i ∆ ′ = ρ ∆ DzV β z ′ D ∆ ′ where D is the square, diagonal matrix of dimension N with 1 − d ij as the diagonalelements; z is the data matrix of selection equation; and V p is the variance-covarianceestimate from the probit estimation of the selection equation. Theorem 3.
We consider the sample selection model presented in Equation 2. Underassumptions I1 to I3, E1 and, E2. The variance-covariance estimator of the ˆ θ is givenby V twostep = B (∆ W ) ′ [ ˆ V + ˆ V ](∆ W ) B ′ (13) where ˆ V = ˆ ρ ∆ ˆ R ∆ ′ and ˆ V = ˆ ρ ∆ ˆ Dz ˆ V β z ′ ˆ D ∆ ′ with all unknown parameters re-placed by their estimates. Moreover, this is a consistent estimator V ar (ˆ θ ) . Proof of Theorem 3:
The result holds by construction.
In this section we present the results of Monte Carlo simulations to (i) to describe thebehavior of the estimator proposed in this paper and (ii) offer empirical guidance forapplied research. Regarding the latter, we will pay a close attention to the implication of ssumption E1 ( ii ) according to which it is important to have a large number of locationsrelative to the number of individuals in the sub-locations. Monte Carlo experimentswill offer empirical guidance as to when the number of locations is large enough.The estimator developed in this paper is referred to as the “Sub-location Differ-encing” and it accounts for sub-location specific effect γ jα . To highlight its features,we compare it to two other estimators. One ignores the presence of both γ j and γ jα and applies a simple two step estimator with no spatial differencing - we call it “No-Differencing” estimator. The other accounts only for the location fixed effects γ j andwe call it “Location Differencing” estimator. For each estimator, the mean bias and thecoverage rate for the 95% confidence level test are reported in tables I to III.The data is obtained using the following data generating process. We assume thatthere are J = 20 , ,
100 non-overlapping locations, each location is divided into s =2 , , n j = 3 , , ,
10 individuals sharing the same sub-location.The latent variables are y ∗ ij = z ij β + θ ijs + θ j + ε ij and y ∗ ij = x ij δ + γ ijs + γ j + ε ij , where θ ija = 10 − j × s and γ ijs = 5 j × s is the sub-location specific effect, while θ j = 10 − j and γ j = 10 j are the location effects; for all i and j , x ij ∼ N (0 ,
1) , z ij ∼ U (0 ,
1) eachdrawn independently; δ = 1, β = 0 .
2. The error terms in both equations for all i and j are generated as follows: ε ij ∼ N (0 , ε ij = ρε ij + v ij where v ij ∼ N (0 ,
1) isindependent of ε ij and ρ = 0 . There is room for improvement concerning our inference strategy. Cluster robust inference is part ofa large and growing literature and our work gives some insight as to how diffrencing can be used in cross-sectional data. Future work will investigate the importance of heteroscedasticity, and small sample proceduressuch as bootstrap will be used to improve inference.
Numb. Of sub-location sub-location-size Estimators Mean bias Coverage rate2 3 No-differencing -0.079 95.8Location Differecing -0.393 74.2Sub-location Differencing -0.039 82.25 No-differencing -0.344 94.5Location Differecing -0.953 82.1Sub-location Differencing 0.011 87.08 No-differencing -0.173 94.4Location Differecing -1.307 90.7Sub-location Differencing 0.016 79.510 No-differencing 0.233 95.8Location Differecing -2.264 93.6Sub-location Differencing 0.067 93.64 3 No-differencing 0.369 95.3Location Differecing 0.117 88.1Sub-location Differencing 0.001 75.55 No-differencing 0.996 96.0Location Differecing 2.798 92.7Sub-location Differencing -0.019 82.58 No-differencing 0.085 94.5Location Differecing 5.824 95.2Sub-location Differencing -0.035 80.510 No-differencing -0.241 95.6Location Differecing -2.494 96.1Sub-location Differencing -0.066 88.28 3 No-differencing 0.833 94.7Location Differecing -0.310 94.9Sub-location Differencing -0.005 66.35 No-differencing -0.176 93.6Location Differecing 1.678 96.8Sub-location Differencing 0.011 76.88 No-differencing 0.013 95.4Location Differecing -0.269 99.0Sub-location Differencing 0.006 83.310 No-differencing 0.271 95.2Location Differecing -2.573 99.3Sub-location Differencing -0.016 86.1
Numb. Of sub-location sub-location-size Estimators Mean bias Coverage rate2 3 No-differencing -0.392 95.3Location Differecing 0.497 75.1Sub-location Differencing -0.006 78.45 No-differencing 0.095 95.2Location Differecing -0.965 84.5Sub-location Differencing 0.007 84.88 No-differencing -0.108 94.8Location Differecing 1.330 92.3Sub-location Differencing -0.053 85.810 No-differencing 0.043 94.5Location Differecing -2.052 95.8Sub-location Differencing -0.028 79.24 3 No-differencing -0.177 95.2Location Differecing -1.027 89.5Sub-location Differencing 0.004 71.35 No-differencing -0.227 95.3Location Differecing -1.387 92.8Sub-location Differencing -0.017 78.78 No-differencing 0.424 93.6Location Differecing -1.437 95.0Sub-location Differencing 0.012 84.210 No-differencing -0.156 95.1Location Differecing 0.348 96.8Sub-location Differencing 0.025 78.98 3 No-differencing 0.031 95.0Location Differecing -0.200 95.0Sub-location Differencing 0.010 67.45 No-differencing 0.279 94.4Location Differecing 2.118 97.0Sub-location Differencing -0.008 73.68 No-differencing -0.101 93.2Location Differecing 4.108 98.0Sub-location Differencing 0.005 80.810 No-differencing -1.369 95.6Location Differecing -1.866 99.5Sub-location Differencing -0.041 83.6
Numb. Of sub-location sub-location-size Estimators Mean bias Coverage rate2 3 No-differencing -0.702 94.9Location Differecing 1.017 74.0Sub-location Differencing 0.007 63.65 No-differencing -0.716 95.3Location Differecing -2.314 82.0Sub-location Differencing -0.001 72.28 No-differencing -0.089 94.6Location Differecing -5.000 90.5Sub-location Differencing -0.005 84.210 No-differencing 0.198 94.9Location Differecing -0.197 96.2Sub-location Differencing -0.050 85.34 3 No-differencing -0.096 94.5Location Differecing 0.858 89.1Sub-location Differencing -0.001 59.45 No-differencing -0.691 94.7Location Differecing 0.712 89.6Sub-location Differencing 0.001 64.08 No-differencing -0.177 96.7Location Differecing 1.216 96.9Sub-location Differencing 0.015 79.210 No-differencing -0.384 94.0Location Differecing -10.139 97.9Sub-location Differencing -0.075 79.68 3 No-differencing -1.018 95.2Location Differecing 0.598 94.1Sub-location Differencing 0.002 53.75 No-differencing 0.050 95.1Location Differecing 1.400 95.7Sub-location Differencing -0.001 61.08 No-differencing -1.093 95.7Location Differecing -10.195 98.1Sub-location Differencing 0.010 75.710 No-differencing -0.642 94.7Location Differecing 6.82 98.9Sub-location Differencing 0.009 79.1 . As expected, the “No-Differencing” estimator has a larger mean bias in the pres-ence of spatial heterogeneity. This result holds for both small and large numbersof locations as well as for few or many individuals having the same sub-locationspecific unobserved heterogeneity N id .2. The mean bias of the “Sub-location Differencing” estimator is smaller than otherestimators. It increases with the number of individuals in the sub-locations, anddecreases with the number of locations. For example, in a sample of 600 individ-uals which are spread across 100 locations with 2 sub-locations and 3 individualsin each sub-location, the mean bias is of 0.007. However, for the same samplesize but spread across 30 locations with 2 sub-locations and 10 individuals in eachsub-locations, the mean bias is − . This section shows the empirical importance of spatial differencing methodology pro-posed in the previous sections. To illustrate the importance of our estimator, we ask hat determines tax rates set by regional governing bodies. This question opens animportant issue of identification since circular causation or omitted variable bias leadsto biased and inconsistent estimators. We will use our spatial differencing method toexamine the case of changes in the Finish local property tax rate at the turn of themillennium.Finland consists of 411 municipalities (in 1999) spread across 19 regions which chooseproperty tax rate within the limits set by the central government. In 1999, the centralgovernment decided to raise the lower limit for the year 2000 from 0.2% to 0.5%. Thischange created a probability mass of municipalities at the lower bound: more than halfof municipalities have a taxation rate of 0.5%, making the data sample censored. Weinvestigate what affected municipalities tax rate in the year 2000.We estimate the parameter of the outcome equation in the model represented as inEquation (2). Specifically, the outcome variable is the level of general property tax in amunicipality i in a region j , and explanatory variables include municipalityˆa ˘A´Zs i agestructure of the population, level of municipalityˆa ˘A´Zs income, received subsidies, localincome tax rate and a dummy for region j in which the municipality i is located. Theselection equation determines whether the municipality sets its general tax rate at themandatory minimum of 0.5% or above and contains all the variables which are in theoutcome equation except for local income tax rate.As illustrated in Equation (1), there can be an unobserved sub-location specific effectoperating at a finer spatial scale than region j , in our case at the level of municipalitieswhich region j consists of. Indeed, municipalitiesˆa ˘A´Z tax level can depend not only There is a large literature which examines a range of factors influencing local tax rates e.g.Charney (1983), Ashworth and Heyndels (1997), Ross and Yinger (1999), Charlot and Paty (2007);Charlot, Paty, and Piguet (2015), Crowley and Sobel (2011), Baskaran (2014), Buettner and von Schwerin(2016). n its population size, income and subsidies received from the central government, butit can also depend on the level of amenities in the municipality. It is usually difficultto measure them. More importantly, even if we have a few measures of amenities ortheir proxies, they might not be able to capture all of them, leaving some amenitiesunobserved. In our case, unobserved amenities can be correlated with municipalitiesˆa ˘A´Zpopulation, income level, or the level of subsidies which implies that not controlling forthem will render the estimates biased and inconsistent. Therefore, using the fact thattwo municipalities from the same region sharing borders are neighbor, we use our spatialdifferencing method to tackle this problem.We estimate equation (2) with spatial differencing conducted as the difference be-tween municipality i and the average of its neighbours. Columns 1 and 2 present theresults without spatial differencing, columns 3-6 with spatial differencing. Estimationis conducted with and without regional dummies respectively, and with two differentestimators of standard errors: wild cluster bootstrap, and spatially-adjusted standarderrors derived in Section 2.3. Clustering of the standard errors is done at the level ofregion j . Since there are only 19 regions, we use wild clustered bootstrap proceduredeveloped by Cameron, Gelbach, and Miller (2008), which properties were studied bye.g. Davidson and MacKinnon (2010), MacKinnon (2013), and MacKinnon and Webb(2017). Specifically, we use a recently developed wild bootstrap package boottest byRoodman, Nielsen, MacKinnon, and Webb (2019) implemented in Stata.We begin with discussing the results with spatial differencing: columns 3-6, Columns3 and 4 present the results when we use spatial differencing with the standard errorscalculated using the formula derived in Section 2.3. There is only one significant vari-able: the share of population older than 75 in column 4. This is not surprising, since ur estimator of variance-covariance matrix is an asymptotic estimator, while the esti-mation is done on a sample with small number of clusters (nineteen regions). Thereforewe use wild-bootstrap procedure which is known to be suitable for a small number ofclusters. The results using wild-bootstrapping are shown in columns 5 and 6 and wesee a considerable increase in the number of statistically significant results.The comparison of columns 5 and 6 with columns 1 and 2 reveals the importance ofspatial differencing. Controlling for the sub-location specific unobserved effect γ jα byspatial differencing renders four variables statistically significant: share of populationyounger than 15, share of population older than 75, government grants, and incometax rate. This is in contrast to columns 1 and 2 in which income tax rate is theonly significant variable. Not controlling for γ jα leads to the omitted variable biaswhich, apart from rendering the estimates inconsistent, inflates the standard errors andmakes the estimates mostly insignificant. Spatial differencing controls for this omittedvariable bias, which means that they do not ’end up’ in the error term and do notinflate the standard errors. In addition to comparing estimates with and without spatialdifferencing, it is also instructive to compare columns 5 and 6: spatial differencing withand without regional dummies. We see that controlling for the regional unobservedeffect γ j does not help to control for sub-location specific effects γ jα . Indeed, themagnitude and the statistical significance of the estimates change very little, and eventhe one variable that looses its statistical significance after including regional dummies(municipality’s income) is only marginally significant without these dummies.Overall, our empirical analysis shows that controlling for the unobserved munici-pality effects matters. Estimations which control for spatial unobserved effects only atthe regional suggest that the income tax rate is the only determinant of general tax ate set by the municipalities. However, after controlling for the sub-location specificunobserved effects, the tax rate depends not only on the income tax rate but also onthe age composition of their population - the share of young as well as the share ofelderly population. These results thus indicate that spatial differencing is an importanttool to deal with omitted variable bias which often plagues empirical studies on localtaxation. Table IV: Determinants of Municipality Taxation Rate
No Spatial Differencing Spatial DifferencingWild bootstrap Spatially adj se Wild bootstrapNo Reg. Dummies Regional Dummies No Reg. Dummies Regional Dummies No Reg. Dummies Regional Dummies(1) (2) (3) (4) (5) (6)Population -1.897 -0.7047 0.1346 0.2448 0.1346 0.2448[-1.420] [-1.0432] [0.0489] [0.2656] [0.507] [0.8793]Share pop. <
15 0.009 -0.0015 -0.0116 -0.0113 -0.0116** -0.0113**[0.492] [-0.0996] [-0.0933] [-0.1373] [-2.536] [-2.3921]Share 61 < pop. <
74 -0.0055 -0.0054 -0.0089 -0.0092 -0.0089 -0.0092[-0.923] [-0.343] [-0.1106] [-0.0721] [-1.495] [-1.4902]Share pop. >
75 0.0211 0.0049 -0.0145 -0.0140** -0.0145* -0.014*[1.027] [0.4688] [-0.1575] [-1.8771] [-1.794] [-1.6746]Income 2.40E-07 6.48E-06 1.23E-05 0.00001 1.23E-05* 1.1E-05[0.047] [0.8092] [0.0234] [0.0213] [1.924] [1.6688]Gov. grant -1.7E-05 1.8E-05 -1.78E-05 -2.16E-05 -1.78E-05 -2.16E-05[-0.868] [0.2555] [-0.0458] [-0.0925] [-0.483] [-0.5666]Income tax rate 0.0350*** 0.0396** 0.0482 0.0453 0.0482*** 0.0453***[3.383] [3.547] [0.2775] [0.2965] [3.362] [3.088]Inverse Mill’s ratio 0.3028 0.1731 0.0159 0.0034 0.0159 0.0034[1.505] [1.1468] [0.0346] [0.0298] [0.854] [0.1525]Constant -0.5327 -0.3601 0.0111 -0.006 0.0111 -0.006[-0.713] [-0.598] [1.540] [-0.230] [1.540] [-0.199]Observations 403 403 273 273 273 273Regional Dummies NO YES NO YES NO YESNumber of Dummies 19 19 19 19 19 19R-squared 0.197 0.271 0.248 0.279 0.248 0.279
Source : see text; Note: t-statistics in brackets, *** p < < < This paper has investigated a sample selection model with unobserved heterogeneity ata very fine location level. It proposes spatial differencing as an alternative identificationstrategy when instrumental variable and/or a panel data are not available. We discussthe assumptions under which the parameters of the model are identified. The estimation f the parameters is done using the classic Heckman’s two-step estimation procedure.The differecing and the two-step procedure lead to a novel estimator with propertiesthat are also relevant for spatial differencing in linear models. To understand thebehavior of the new estimator, we derive a cluster asymptotic of the estimator. Thederivation reveals two important implications for its empirical implementation: ( i ) thenumber of clusters needs to be large for inference to be based on normal distribution.( ii ) each cluster should have a bounded number of individuals.Monte Carlo experiments show that accounting for sub-location specific heterogene-ity is crucial for identification. It also confirms the estimator’s properties derived in ourasymptotic. In particular, the estimator performs better with the increasing numberof locations, and fewer individuals in sub-locations. In addition, ignoring sub-locationsand applying spatial differencing only to more aggregate geographical units subsum-ing sub-locations, the mean bias is larger. The coverage rate of the test based on thecorrected standard error has an empirical coverage lower that the theoretical one.In the empirical application which looked at the determinants of municipal tax rate,we show that using spatial differencing in combination with cluster wild-bootstrap in-ference tools can be extremely useful. Indeed, the new estimator reveals several de-terminants of the municipal tax rate that would have been missed otherwise. Thedevelopment of a bootstrap appropriate sample selection models is left for future re-search. ppendix Proof of Theorem 2:
The proof is written conditional on the set of number of individuals in the locations.Thus, when E ( N j ) is used, it can be considered as a constant.The substitution of the true value of ∆ y in equation (12) yields the followingequality ˆ θ = θ + [(∆ W ) ′ ∆ W ] − [(∆ W ) ′ ∆ η ]Let us assume that y ij = x ′ ij δ + γ jα + γ j + ρλ ( z ′ ij β + θ jα k + θ j )+ e ij with E ( e ij | ξ ij ) = 0.Thus, ∆ y ij = ∆ x ′ ij δ + ρ ∆ λ ( z ′ ij β + θ jα + θ j )+∆ e ij . Under the identification assumptionsI1 to I3 have ∆ y ij = ∆ x ′ ij δ + ρ ∆ λ ( z ′ ij β ) + ∆ e ij . The second step regression equation is equivalent to∆ y ij = ∆ x ′ ij δ + ρ ˜∆ λ ( z ′ ij β ) + ∆[ ρ ( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β )) + e ij ]ˆ β is estimated by maximum likelihood probit in the first step with variance-covariancematrix V β . Given that λ ( . ) is twice differentiable, the continuous mapping theorem im-plies that λ ( z ′ ij β ) − λ ( z ′ ij ˆ β ) goes to zero in probability and is asymptotically normal. Ifwe assume that N is the full sample while N is the selected sample. We, therefore,have p N ( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β )) → d N (0 , Ω ij ) (14) We assume that
N/N → here Ω ij = [ λ ′ ( z ′ ij β )] z ′ ij V β z ij .We are interested in the limiting distribution of √ N (ˆ θ − θ ). √ N (ˆ θ − θ ) = N [(∆ W ) ′ ∆ W ] − √ N [(∆ W ) ′ ∆ η ]= " P Ni =1 (∆ W i ) ′ ∆ W i N − √ N N X i =1 (∆ W i ) ′ ∆ η i While W i are iid , ∆ W i are not independent because an individual is allowed toappear in many pairs. We will therefore have to use LLN and CLT for non-independentrandom variables. The dependence structure is driven by the operator ∆. If ∆ is suchthat each individual appear only in one pair then the classical CLT and LLN could beapplied. However, if individuals are allowed to appear in several pairs, then we need toapply CLT and LLN accounting for correlation. P Ni =1 (∆ W i ) ′ ∆ W i N = P Jj =1 P N j k =1 (∆ W kj ) ′ ∆ W kj N = 1 J J X j =1 E ( N j ) N j X k =1 (∆ W kj ) ′ ∆ W kj Let us consider Y j = 1 E ( N j ) N j X k =1 (∆ W kj ) ′ ∆ W kj , these variables are iid , moreover,note that N = N + N + .... + N J = J E ( N j ). Under assumption E1, all locations havea bounded maximum capacity of N j < n with n a scalar. Under the assumption thatall second moments of the variables in W exist (Assumption E2). J = 1 J J X j =1 E ( N j ) N j X k =1 Y j is a matrix. Thus, the law of large number apply to it ifand only if it applies to all is elements. Let a j be a typical element of the matrix1 E ( N j ) N j X k =1 Y j . Let t and m be two variables from the set of variables forming W . Forexample, we can consider t = x the first column of the random variable x . If t = m then, E | a j | ≤ E ( N j ) N j X k =1 E | ∆ t kj ∆ m kj | (15)= 1 E ( N j ) N j X k =1 E | ( t kj − t ij )( m kj − m ij ) | (16) ≤ E ( N j ) N j X k =1 E | t kj m kj | (17) ≤ E ( N j ) N j X k =1 q E ( | t kj | ) E ( | m kj | ) (18) ≤ M (19)with M a constant.The result is obtained by using successively the triangular inequality, the identicaldistribution of variable in W , the Cauchy-Schwars’s inequality and the existence ofmoment up-to its fourth (which means that the second moment exists). If t = m we ave, E | a j | ≤ E ( N j ) N j X k =1 E | ∆ t kj ∆ t kj | (20)= 1 E ( N j ) N j X k =1 E | ( t kj − t ij ) | (21) ≤ E ( N j ) N j X k =1 E | t kj t ij | + E ( t kj ) (22) ≤ E ( N j ) N j X k =1 ( E | t kj | ) + E ( t kj ) (23) ≤ M (24)Thus, the LLN implies that P Ni =1 (∆ W i ) ′ ∆ W i N → p E ((∆ W ij ) ′ ∆ W ij ) = C − . We can also show that1 √ N N X i =1 (∆ W i ) ′ ∆ η i = ρ √ N N X i =1 (∆ W i ) ′ ∆( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β ))+ 1 √ N N X i =1 (∆ W i ) ′ ∆ e ij . We consider Λ j = N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β )) and E j = N j X k =1 (∆ W kj ) ′ ∆ e kj .Conditional on ˆ β , Λ j are iid random variables; E j are too. We assume that the numberof individuals in a group is iid with finite mean E ( N j ). Given all locations are assumed The application of the LLN implies for consistency reason that
N/J → p E ( N j ) . Thus JE ( N j ) ≈ N. o be disjoint, 1 √ N N X i =1 (∆ W i ) ′ ∆ η i = ρ √ N J X j =1 Λ j + 1 √ N J X j =1 E j . We have E ( E j ) = 0 for each j. Moreover,
V ar ( E j ) = E [ N j X k =1 (∆ W kj ) ′ ∆ e kj ( N j X k =1 (∆ W kj ) ′ ∆ e kj ) ′ ]= E [ N j X k =1 (∆ W kj ) ′ ∆ e kj ∆ e kj (∆ W kj )]= E ( N j ) E [(∆ W kj ) ′ ∆ e kj ∆ e kj (∆ W kj )] . Under Assumption E2,
V ar ( E j ) is finite, because all variables have up-to the fourthmoments. Indeed, if we consider a typical element of E [(∆ W kj ) ′ ∆ e kj ∆ e kj (∆ W kj )],form by the variables t and m , E [( t kj − t ij )( m kj − m ij )∆ e kj (∆ W kj )] ≤ E [ t kj m kj (∆ e kj ) ] ≤ E | t kj m kj ∆ e kj |≤ q E ( | t kj | ) E [(∆ e kj ) ] E ( | m kj | ) E [(∆ e kj ) ] ≤ M It should be noted that E [(∆ e kj ) ] = 2 E ( e kj ) < ∞ . imilarly, we can show that E (Λ j ) = 0, and V ar (Λ j ) = E [( N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β )))( N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))) ′ ]= E [ N j X k =1 (∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))(∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β )) ′ ] (25)= E ( N j ) E ((∆ W kj ) ′ ∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))∆( λ ( z ′ kj β ) − λ ( z ′ kj ˆ β ))(∆ W kj )) (26)= E ( N j ) E [(∆ W kj ) ′ Ω kj (∆ W kj )] (27)We need to show that E [(∆ W kj ) ′ Ω kj (∆ W kj )], with Ω kj = [ λ ′ ( z ′ kj β )] z ′ kj V β z kj is fi-nite. A typical element of this matrix is given by, E [( t kj − t ij )Ω kj ( m kj − m ij )]. We canshow the following using a Cauchy-Schwarz’s inequality. E [( t kj − t ij )Ω kj ( m kj − m ij )] ≤ E [ t kj m kj Ω kj ] (28) ≤ E [ | t kj m kj Ω kj | ] (29) ≤ q E ( | t kj | )( E ([ λ ′ ( z ′ kj β )] z ′ kj V β z kj )) E ( | m kj | )It remains to be proofed that E ([ λ ′ ( z ′ kj β )] z ′ kj V β z kj ) < ∞ . The application of theCauchy-Schwarz’s inequality implies, E ([ λ ′ ( z ′ kj β )] z ′ kj V β z kj ) ≤ q E ([ λ ′ ( z ′ kj β )] E [( z ′ kj V β z kj ) ] (30) ≤ q E [( z ′ kj V β z kj ) ] < ∞ (31)This follows from noting that | λ ′ ( . ) | ≤ z have up-to their fourth oments.The moment of a typical element E [( t kj − t ij )Ω kj ( m kj − m ij )] < ∞ . This proof that the variance is finite.It is important to notice that conditional W , N X i =1 (∆ W i ) ′ ∆( λ ( z ′ ij β ) − λ ( z ′ ij ˆ β )) and N X i =1 (∆ W i ) ′ ∆ e ij are independent random variables. Therefore,1 √ N N X i =1 (∆ W i ) ′ ∆ η i → d N (0 , Γ) , (32)where Γ = ρ E [(∆ W ij ) ′ Ω ij ∆ W ij ] + E [(∆ W ij ) ′ ∆ e ij ∆ e ij (∆ W ij )] . √ N (ˆ θ − θ ) → d N (0 , Θ) (33)with Θ = C Γ C ′ . This proves the asymptotic normality of our two step estimator.We have proven that under assumptions I1, I2, I3, E1 and E2, P Ni =1 (∆ W i ) ′ ∆ W i N → p E ((∆ W ) ′ ∆ W ) = C − . Using similar arguments we can show that P Ni =1 (∆ W i ) ′ ∆ η i N → p E ((∆ W ) ′ ∆ η ) = 0 . Which means that ˆ θ is a consistent estimator of θ. We have proven the estimator is oth consistent and asymptotically normal. This ends the proof of Theorem 2. eferences Ashworth, J., and B. Heyndels (1997): “Politicians’ preferences on local tax rates:an empirical analysis,”
European Journal of Political Economy , 13(3), 479–502.
Baskaran, T. (2014): “Identifying local tax mimicking with administrative bordersand a policy reform,”
Journal of Public Economics , 118, 41–51.
Black, S. E. (1999): “Do Better Schools Matter? Parental Valuation of ElementaryEducation,”
The Quarterly Journal of Economics , 114(2), 577–599.
Buettner, T., and A. von Schwerin (2016): “Yardstick competition and partial co-ordination: Exploring the empirical distribution of local business tax rates,”
Journalof Economic Behavior & Organization , 124, 178–201.
Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008): “Bootstrap-basedimprovements for inference with clustered errors,”
The Review of Economics andStatistics , 90(3), 414–427.
Charlot, S., and S. Paty (2007): “Market access effect and local tax setting: evi-dence from French panel data,”
Journal of Economic Geography , 7(3), 247–263.
Charlot, S., S. Paty, and V. Piguet (2015): “Does fiscal coop´eration increaselocal tax rates in urban areas?,”
Regional Studies , 49(10), 1706–1721.
Charney, A. H. (1983): “Intraurban manufacturing location decisions and local taxdifferentials,”
Journal of Urban Economics , 14(2), 184–205.
Collins, H., A. Alva, R. Boydston, R. Cochran, P. Hamm, A. McGuire, andE. Riga (2006): “Soil microbial, fungal, and nematode responses to soil fumigation nd cover crops under potato production,” Biology and Fertility of Soils , 42(3), 247–257.
Crowley, G. R., and R. S. Sobel (2011): “Does fiscal decentralization constrainLeviathan? New evidence from local property tax competition,”
Public Choice , 149(1-2), 5.
Davidson, R., and J. G. MacKinnon (2010): “Wild bootstrap tests for IV regres-sion,”
Journal of Business & Economic Statistics , 28(1), 128–144.
Djogbenou, A. A., J. G. MacKinnon, and M. Ø. Nielsen (2019): “Asymptotictheory and wild bootstrap inference with clustered errors,”
Journal of Econometrics ,212(2), 393–412.
Duranton, G., L. Gobillon, and H. G. Overman (2011): “Assessing the Effectsof Local Taxation using Microgeographic Data,”
The Economic Journal , 121(555),1017–1046.
Dustmann, C., and M. E. Rochina-Barrachina (2007): “Selection correction inpanel data models: An application to the estimation of females’ wage equations,”
The Econometrics Journal , 10(2), 263–293.
Gibbons, S., and S. Machin (2003): “Valuing English primary schools,”
Journal ofUrban Economics , 53(2), 197–219.
Hansen, B. E., and S. Lee (2019): “Asymptotic theory for clustered samples,”
Jour-nal of econometrics , 210(2), 268–290.
Heckman, J. (1974): “Shadow Prices, Market Wages, and Labor Supply,”
Economet-rica , 42(4), 679–694. eckman, J. J. (1979): “Sample Selection Bias as a Specification Error,” Economet-rica , 47(1), 153–161.
Holmes, T. J. (1998): “The Effect of State Policies on the Location of Manufacturing:Evidence from State Borders,”
Journal of Political Economy , 106(4), 667–705.
Kyriazidou, E. (1997): “Estimation of a Panel Data Sample Selection Model,”
Econo-metrica , 65(6), 1335–1364.
Lee, M.-j. (2001): “First-difference estimator for panel censored-selection models,”
Economics Letters , 70(1), 43–49.
MacKinnon, J. G. (2013): “Thirty years of heteroskedasticity-robust inference,” in
Recent advances and future directions in causality, prediction, and specification anal-ysis , pp. 437–461. Springer.
MacKinnon, J. G., and M. D. Webb (2017): “Wild bootstrap inference for wildlydifferent cluster sizes,”
Journal of Applied Econometrics , 32(2), 233–254.
Oster, E. (2019): “Unobservable selection and coefficient stability: Theory and evi-dence,”
Journal of Business & Economic Statistics , 37(2), 187–204.
Rochina-Barrachina, M. E. (1999): “A new estimator for panel data sample selec-tion models,”
Annales d’Economie et de Statistique , pp. 153–181.
Roodman, D., M. Ø. Nielsen, J. G. MacKinnon, and M. D. Webb (2019):“Fast and wild: Bootstrap inference in Stata using boottest,”
The Stata Journal ,19(1), 4–60. oss, S., and J. Yinger (1999): “Sorting and voting: A review of the literature onurban public finance,” Handbook of regional and urban economics , 3, 2001–2060.
Todd, P. E., and K. I. Wolpin (2003): “On the specification and estimation of theproduction function for cognitive achievement,”
The Economic Journal , 113(485),F3–F33.
White, H. (1984):
Asymptotic theory for econometricians . Academic press.
Wooldridge, J. M. (1995): “Selection corrections for panel data models under con-ditional mean independence assumptions,”
Journal of econometrics , 68(1), 115–132.(2010):
Econometric analysis of cross section and panel data . MIT press.. MIT press.