[PDF] Better Bunching, Nicer Notching

Abstract

We study the bunching identification strategy for an elasticity parameter that summarizes agents' response to changes in slope (kink) or intercept (notch) of a schedule of incentives. A notch identifies the elasticity but a kink does not, when the distribution of agents is fully flexible. We propose new non-parametric and semi-parametric identification assumptions on the distribution of agents that are weaker than assumptions currently made in the literature. We revisit the original empirical application of the bunching estimator and find that our weaker identification assumptions result in meaningfully different estimates. We provide the Stata package "bunching" to implement our procedures.

Full PDF

BBETTER BUNCHING, NICER NOTCHING

Marinho Bertanha ∗ Andrew H. McCallum † Nathan Seegert ‡ First draft: August 5, 2017This draft: August 14, 2020

Abstract

We study the bunching identification strategy for an elasticity parameter thatsummarizes agents’ response to changes in slope (kink) or intercept (notch) of aschedule of incentives. A notch identifies the elasticity but a kink does not, when thedistribution of agents is fully flexible. We propose new non-parametric andsemi-parametric identification assumptions on the distribution of agents that areweaker than assumptions currently made in the literature. We revisit the originalempirical application of the bunching estimator and find that our weaker identificationassumptions result in meaningfully different estimates. We provide the Stata package bunching to implement our procedures.

JEL:

C14, H24, J20

Keywords: partial identification, censored regression, bunching, notching ∗ ∼ mbertanh. † ‡ a r X i v : . [ ec on . E M ] J a n Introduction

Estimating agents’ responses to incentives is a central objective in economics and manyother social sciences. A continuous distribution of agents that face a piecewise-linear scheduleof incentives results in a distribution of responses with mass points located where the slopeor intercept of the schedule changes. For example, a progressive schedule of marginal incometax rates induces a mass of heterogeneous individuals to report the same income at the levelwhere marginal rates increase. Many studies in economics use mass points in the responsedistribution to recover primitive parameters that govern agents’ responses to incentives.Pioneering work by Saez (2010), Chetty, Friedman, Olsen, and Pistaferri (2011), andKleven and Waseem (2013) develop bunching estimators to use mass points in responsedistributions to recover primitive parameters. These estimators are widely applied ineconomics and rely on the idea that a mass point is larger, the more responsive agents are toincentives. The size of the mass point, however, also depends on the unobserved distributionof agents’ heterogeneity. Current methods are only able to map the size of mass points toprimitive parameters because they make specific assumptions about the unobserveddistribution.This paper places bunching estimators on a statistical foundation and makes threecontributions on the identification of a primitive parameter that summarizes agents’responses to incentives. First, we clarify how the mapping of observed variables to anelasticity parameter depends on assumptions about the unobserved distribution ofheterogeneity. The elasticity parameter captures the log percentage change of a response toa log percentage change in an incentive. A change in the intercept of the incentive scheduleadmits non-parametric point identification of the elasticity but a change in slope does not.Second, we examine the assumptions made by current bunching methods and proposeweaker assumptions for partial and point identification of the elasticity. Third, we revisit theoriginal empirical application of the bunching estimator, which is in the economics literaturethat examines the largest means-tested cash transfer program in the United States —theEarned Income Tax Credit (EITC). Our weaker assumptions about the unobserveddistribution of heterogeneity result in meaningful changes in estimates of individualresponses to taxes.Our first contribution is to clarify the importance of assumptions about unobservedheterogeneity for the identification of the elasticity. Many existing estimates are based on anagent optimization problem with a piece-wise linear constraint that has one change in slopeor intercept. Slope changes in the constraint are often referred to as ‘‘kinks’’ while interceptchanges are often called ‘‘notches.’’ We generalize the constraint of the agent’s problem to a2chedule with multiple changes in intercepts and slopes because agents typically encounter acombination of both kinks and notches.We highlight three insights about identification with kinks and notches assuming anon-parametric family of distributions for unobserved heterogeneity that have continuousprobability density functions (PDFs). First, if the constraint has at least one notch, it ispossible to point identify the elasticity. Identification comes from using the empty intervalin the support of the observed distribution that is created by agents’ responses to a notch.Second, point identification is impossible if the incentive schedule only contains kinks.Identification is impossible because there always exists an unobserved distribution thatreconciles any elasticity with the observed distribution of responses. Third, inferencemethods designed for one kink can be applied in cases with multiple kinks at each kinkseparately, as long as there are no notches preceding the kink under study. This is becausethe range of heterogeneous agents that bunch at a kink is the same regardless of whether itis the first kink in the schedule or if it is followed by another kink. In contrast, the range ofagents that bunch at a kink changes if that kink is preceded by a notch. Thus, methodsdesigned for one kink could be invalidated by a preceding notch, but our new identificationstrategies can handle both kinks and notches simultaneously.Our second contribution is to propose three novel identification strategies for theelasticity if the incentive schedule has kinks but no notches. Each of these strategies relieson weaker assumptions than those implicit in current implementations of the bunchingestimator. Our first strategy identifies upper and lower bounds on the elasticity —partiallyidentifies the elasticity —by making a mild shape restriction on the non-parametric family ofheterogeneity distributions. The other two strategies point identify the elasticity usingcovariates and semi-parametric restrictions on the distribution of heterogeneity.The first strategy partially identifies the elasticity by assuming a bound on the slopemagnitude of the heterogeneity PDF, that is, Lipschitz continuity. Intuition foridentification of the elasticity in this setting is as follows. We observe the mass of agentswho bunch, which equals the area under the heterogeneity PDF inside an interval. Thelength of this bunching interval depends on the unknown elasticity. The maximum slopemagnitude of the PDF implies upper and lower bounds for all possible PDF values inside thebunching interval that are consistent with the observed bunching mass. This translates intolower and upper bounds, respectively, on the size of the bunching interval, whichcorresponds to lower and upper bounds on the elasticity. These bounds allow researchers toexamine the magnitude of the impossibility result in their empirical context. Depending onthe data, it might take an unreasonably high slope magnitude on the heterogeneity PDF toproduce bounds that include all possible elasticity values. In other settings, the difference3etween upper and lower bounds may be economically large even for small slope magnitudes.The next two strategies rely on the fact that bunching can be rewritten as a censoredregression model with a middle censoring point. We stress that while these strategiesnecessarily add structure to point identify the elasticity, they do not require fully parametricassumptions, such as normality, on the unconditional distribution of heterogeneity.The second strategy identifies the elasticity by estimating a maximum likelihoodmid-censored model, using data truncated to a window local to the kink. The likelihoodfunction assumes that the unobserved distribution conditional on covariates is parametric,but we demonstrate that correct specification of the conditional distribution is not necessaryfor consistency, as long as the unconditional distribution is correctly specified. For example,conditional normality yields a mid-censored Tobit model, which has globally concavelikelihood and is easy to implement. Nevertheless, consistency only requires that theunobserved distribution is a semi-parametric mixture of normals, and conditional normalityis not necessary. Truncating the sample around the kink point improves the fit of the modeland further weakens these distribution assumptions.The third strategy restricts a quantile of the unobserved distribution, conditional oncovariates, and point identification follows existing theory for censored quantile regressions(Powell, 1986; Chernozhukov and Hong, 2002; Chernozhukov, Fern´andez-Val, and Kowalski,2015).Both of the two semi-parametric methods are censored regression models thatincorporate covariates. These approaches extend bunching estimators to control forobservable heterogeneity for the first time. Observable individual characteristics generallyaccount for substantial variation across agents and leave less heterogeneity unobserved. Thisfact suggests that identification strategies that utilize covariates should be preferred overidentifying assumptions that only restrict the shape of the unobserved distribution withoutcovariates.Our third contribution is to illustrate the empirical relevance of our methods byrevisiting Saez (2010)’s original influential application of bunching in the distribution of U.S.income caused by kinks in the EITC schedule. That approach implicitly assumes theunobserved PDF of agents that bunch is linear and uses a trapezoidal approximation tocompute the bunching mass. This assumption fits poorly when the true density is non-linearor the interval of agents that bunch is large. We compare elasticity estimates based on ouridentification assumptions with estimates based on the trapezoidal approximation usingannual samples of U.S. federal tax returns from the Internal Revenue Service (IRS).Our partial identification method indicates that households adjust their reported incomein response to marginal tax rates by a considerable amount. Placing a conservative limit on4he slope magnitude, the lower bound for the elasticity is 0.34 —that is, a one percentincrease in the marginal tax rate results in a reduction in reported income of at least 0.34percent. This estimate contrasts with the estimate of 0.43 using the trapezoidalapproximation. The difference in these estimates matters. For example, Saez (2001) showsthat the optimal top marginal tax rate for an economy with an elasticity of 0.34 is 13percentage points higher than when the elasticity is 0.43.The truncated Tobit model with covariates fits well the observed distribution of incomemaking our semi-parametric consistency result operative. Elasticity estimates from thismodel differ substantially from estimates based on the trapezoidal approximation for somecategories of U.S. taxpayers. For example, we estimate an an elasticity of 0.72 versus atrapezoidal estimate of 1.10 for married and self-employed individuals. This large differencehighlights the sensitivity of estimates to functional form assumptions, as well as the need formethods that rely on weaker assumptions.Our three new methods provide a suite of ways to recover elasticities from bunchingbehavior. Each method differs in the assumptions they make about the unobserveddistribution to achieve identification. There is no way to determine which assumption iscorrect because the unobserved distribution is not fully identified. Nevertheless, estimatesthat are stable across many methods indicate that different identifying assumptions do notplay a major role in the construction of those estimates. On the contrary, estimates that aresensitive to different assumptions are dependent on the validity of those assumptions.Therefore, we recommend that researchers examine the sensitivity of elasticity estimatesacross all available methods as a matter of routine.Bunching estimators are widely applied in settings including fuel economy regulations(Sallee and Slemrod, 2012), electricity demand (Ito, 2014), real estate taxes (Kopczuk andMunroe, 2015), labor regulations (Garicano, Lelarge, and Van Reenan, 2016), prescriptiondrug insurance (Einav, Finkelstein, and Schrimpf, 2017), marathon finishing times (Allen,Dechow, Pope, and Wu, 2017), attribute-based regulations (Ito and Sallee, 2018), education(Dee, Dobbie, Jacob, and Rockoff, 2019; Caetano, Caetano, and Nielsen, 2020b), minimumwage (Jales, 2018; Cengiz, Dube, Lindner, and Zipperer, 2019), and air-pollution datamanipulation (Ghanem, Shen, and Zhang, 2019), among others. Variation in the size of themass point across groups of individuals has also been used as a first stage in a two stageapproach to control for endogeneity (Chetty, Friedman, and Saez, 2013; Caetano, 2015;Grossman and Khalil, 2019). An additional complication in many applications arises whenthe bunching mass is spread over a range instead of being a mass point. Blomquist, Kumar, Econometric approaches using bunching for causal identification include Khalil and Yildiz (2017), Caetanoand Maheshri (2018), Caetano, Kinsler, and Teng (2019), and Caetano, Caetano, and Nielsen (2020a). bunching that implements our procedures. Firms’ and individuals’ optimization problems often face piecewise-linear constraints.The nature of constraints is dictated by differential tax rates, insurance reimbursement rates,or contract bonuses. A budget set is fully characterized by a sequence of intercepts andslopes that change at known points. A change in the intercept is referred to as a notch, and The Stata package is available at the Statistical Software Components (SSC) online repository. Type ssc install bunching in Stata to install the package. The package is also available for download from thewebsite of the authors.

We start with the labor supply characterization employed by the vast majority of theliterature, which follows the seminal work of Saez (2010) and Kleven and Waseem (2013).Agents maximize an iso-elastic quasi-linear utility function and choose consumption andlabor subject to a piecewise-linear budget set. Well-known models that fit into this categoryinclude those of Burtless and Hausman (1978), Best and Kleven (2018), Einav, Finkelstein,and Schrimpf (2017), among others. For ease of exposition, we focus on budget sets with onekink or one notch in the main text. In supplemental Appendices B.1 and B.2, we generalizethe literature to any combination of kinks and notches. Section 3 below briefly discusses newinsights for the identification of the elasticity that arise in the problem with multiple kinksand notches.Consider a population of agents that are heterogeneous with respect to a scalar variable N ∗ , referred to as ability. Ability is distributed according to a continuous probability densityfunction (PDF) f N ∗ , with support (0 , ∞ ), and a cumulative distribution function (CDF) F N ∗ . Agents know their N ∗ , but the econometrician does not observe the distribution of N ∗ .Agents maximize utility by jointly choosing a composite consumption good C and laborsupply L . Utility is increasing in C and decreasing in L . These variables are constrained bya budget set, where the agent may consume all of its labor income net of taxes plus anexogenous endowment I . For simplicity, we assume the price of labor and consumption areequal to one, such that taxable labor income Y is equal to L .In the budget constraint with a kink, the tax rate increases from t to t as incomeincreases above the kink value K . The budget constraint has a notch when the agent ischarged a lump-sum tax of ∆ > K . Agent type N ∗ maximizes utility U ( C, Y ; N ∗ ) as follows,max C,Y C − N ∗ /ε (cid:18) YN ∗ (cid:19) ε (1) s.t. C = I { Y ≤ K } [ I + (1 − t ) Y ] + I { Y > K } [ I + (1 − t ) ( Y − K )] , (2)where I {·} is the indicator function; the budget line has intercept I and slope 1 − t if Y ≤ K , but intercept I = I + K (1 − t ) − ∆ with slope 1 − t if Y > K ; and ε is theelasticity of income Y with respect to one minus the tax rate when the solution is interior.In the case of a kink, ∆ = 0, and the budget frontier is continuous; otherwise, in the case of7 notch, it has a jump discontinuity of size ∆ at Y = K . The solution is always on thebudget frontier in Equation 2. The solution for Y in Problem 1 is well known in the literature, when K is a kink (Saez,2010) and when K is a notch (Kleven and Waseem, 2013): Y =  N ∗ (1 − t ) ε , if 0 < N ∗ < NK , if N ≤ N ∗ ≤ NN ∗ (1 − t ) ε , if N < N ∗ , (3)where the expressions for the thresholds N and N are given below.In the case of a kink, N = K (1 − t ) − ε , and N = K (1 − t ) − ε . The budget frontier iscontinuous, but its slope suddenly decreases at Y = K . For values of N ∗ inside the bunchinginterval [ N , N ], the agent’s indifference curve is never tangent to the budget frontier, and wehave the non-interior solution Y = K . For values of N ∗ outside of the bunching interval, theindifference curve is always tangent to some point on the budget frontier.In the case of a notch, the solution is interior for N ∗ < N = K (1 − t ) − ε , but there areno tangent indifference curves for N ∗ ∈ [ K (1 − t ) − ε , K (1 − t ) − ε ], just as in the case of akink. Although tangency occurs for N ∗ > K (1 − t ) − ε , some of the resulting utility levels arelower than the utility at the notch point. The budget frontier with a jump-downdiscontinuity at Y = K has an interval of income values ( K, Y I ] that no agent ever chooses.The value Y I > K corresponds to the interior solution of the agent with N ∗ = N I ; that is,the smallest N ∗ such that the agent’s utility is equal to the utility of the agent choosing Y = K . Thus N = N I , and the solution is at Y = K for N ∗ ∈ [ N , N ]. As the ability N ∗ increases above N I , the utility gets larger than the utility at K , and again there is aninterior solution. Supplemental Appendix B.2 has a formal definition of N I in Equation B.3.To make the solution more tractable, we take the natural logarithm of all variables.Define y = log( Y ), n ∗ = log( N ∗ ), k = log( K ), s = log(1 − t ), and s = log(1 − t ). y =  n ∗ + εs , if n ∗ < nk , if n ≤ n ∗ ≤ nn ∗ + εs , if n < n ∗ . (4)As ability n ∗ increases, the optimal choice of y increases, except when n ∗ falls inside thebunching interval [ n, n ], in which y remains constant and equal to k .8 .3 Bunching and the Counterfactual Distribution of Income The solution in the previous section expresses income as a function of the modelparameters and n ∗ . For given values of ( t , t , k, ε ), the continuously distributed n ∗ mapsinto a mixed continuous-discrete distribution for y . The model predicts bunching in thedistribution of y at a kink or notch point (i.e. P ( y = k ) > y otherwise. The amount of bunching depends on the elasticity ε and the unobserveddistribution n ∗ , B ≡ P ( y = k ) = P ( n ≤ n ∗ ≤ n ) = (cid:90) nn f n ∗ ( u ) du = F n ∗ ( n ) − F n ∗ ( n ) , (5)where the length of the interval [ n, n ] varies with ε .The literature typically defines B in terms of the counterfactual distribution of income inthe scenario without any kinks or notches. Let counterfactual income be y in such case.The solution to Problem 1 is simply y = n ∗ + εs for every value of n ∗ . The variable y hascontinuous PDF f y and CDF F y . The bunching mass is derived as B = (cid:90) k +∆ yk f y ( u ) du = F y ( k + ∆ y ) − F y ( k ) , (6)where ∆ y = ε ( s − s ). Figure 1, Panels a and b, illustrates the distributions of y and y ,and how they relate to each other, to B , and to f n ∗ .Saez (2010)’s insight is that the mass of agents bunching B is increasing in the elasticity ε for a given distribution of y . Stated another way, the more agents shift income to thekink-point k , the more sensitive they are to changes in tax rates. All current bunching andnotching estimators use this insight to identify the elasticity. First, the researcher obtains anestimate of the counterfactual distribution of y and the bunching mass B . Plugging theseinto Equation 6 allows us to solve for an estimate of the elasticity.The treatment of the problem thus far abstracts from the existence of optimization andfriction errors in the solution of Problem 1. In reality, instead of y , researchers typicallyobserve the distribution of (cid:101) y = y + e , where e is a random variable accounting foroptimization frictions. In this case, the distribution of (cid:101) y has the bunching mass distributedover a range around the kink point, as opposed to being right at the kink.In a recent survey article, Kleven (2016) summarizes an identification strategy commonlyused in the literature to estimate the distribution of y . The ‘‘polynomial strategy’’ was firstproposed by Chetty et al. (2011) (Equations 14-15 and Figures 3-4), and it consists of fittinga flexible polynomial to an estimate of the PDF of y . The polynomial regression excludes9bservations that lie in a range around the kink point. The researcher chooses the rangebased on the support of the distribution of friction errors. The polynomial fit is thenextrapolated to this excluded region as a way of predicting f y . The procedure is widely usedin the bunching literature; see, for example, Figure 6 by Bastani and Selin (2014) , Figure 1by Devereux, Liu, and Loretz (2014), and Figure 4 by Best and Kleven (2018). Insupplemental Appendix B.3, we give more details in the context of a simple example, where n ∗ is uniformly distributed. The example shows that such an identification strategy fails torecover both B and f y , even when the proposed polynomial fit is perfect.The strategy fails for two reasons. First, the distribution y is observed with error, and aproper deconvolution method must be used to retrieve the distribution of y , given thedistribution of (cid:101) y . Second, even when the distribution of y is known, it is not possible toobtain the distribution of y inside the integration domain of Equation 6. Although y = y when n ∗ < n , we have that y = k , while y = n ∗ + εs when n ∗ ∈ [ n, n ] (Figures 1a and 1b).The shape of the distribution of y is unidentified when n ∗ falls in the bunching interval.The rest of this paper focuses on the second problem of the identification strategy,namely the problem of identifying the elasticity, ε , using the distribution of y instead of thedistribution of (cid:101) y = y + e . In fact, our methods apply to the many examples of bunching thatdo not have friction errors, for example, Figure 4 by Glogowsky (2018) and Figure 1 byGoncalves and Mello (2018). The study of identification in the presence of optimizationfrictions is deferred to future research. In work in progress, Cattaneo et al. (2018) studyidentification of the distribution of y given the distribution of (cid:101) y plus minimal assumptions onthe distribution of e . The general solution to Problem 1 with multiple kinks and notches in supplementalAppendix B.2 brings new insights to the identification of the elasticity, when compared tothe particular solution in the case of one kink or notch. First, in a budget set with multiplekinks but no notches, the general solution is simply a combination of solutions local to eachkink. The bunching intervals of consecutive kinks do not overlap (Equation B.4). As aresult, inference methods for the elasticity that are valid in the case of one kink may still beused locally to each kink.Second, a notch at k creates an empty interval in the support of the distribution of y right after k . Such an empty interval may or may not contain the next tax change point k (cid:48) > k , depending on the value of ε . For example, eligibility for Medicaid benefits in theUnited States creates a sizeable notch that may overshadow the next tax change in the10udget set of some individuals. In this case, inference methods that focus on kinks withoutaccounting for other notches may produce misleading conclusions about the elasticity.The rest of this section investigates identification with one notch or one kink. We showthat identification is possible with one notch without any restriction on the distribution of n ∗ . On the other hand, the identification in case of a kink is impossible, unless theresearcher imposes restrictions on the distribution of n ∗ . In the problem without optimization error, the existence of one notch producesadditional identifying information in the observed distribution of income. However, theidentification strategy is different from previous studies, which solely rely on Equation 6.Even when N ∗ has full support (0 , ∞ ), there exists an empty interval in the distributionof Y , to right of the notch. Following the solution in Equation 3, the empty interval is(

K, Y I ], where Y I = N I (1 − t ) ε , and N I is defined above. Once Y I is identified from thesupport of the distribution of Y , we numerically solve for ε that satisfies the indifferencecondition Equation 7 below. Theorem 1.

Suppose the support of N ∗ is equal to (0 , ∞ ) , that K is a notch, and that theupper limit of the empty interval in the support of Y to the right of K is equal to Y I . Thenthe indifference condition that defines Y I is equivalent to Y I + εK (cid:18) KY I (cid:19) ε = (1 + ε ) (cid:18) C + I + K (1 − t )1 − t (cid:19) , (7) where C is the consumption value on the budget frontier at the notch point. Moreover, thereexists an unique ε that solves Equation 7 as a function of Y I , K , C , I , t . Therefore theelasticity is identified. A proof for this theorem is in Appendix A.1 and all our other proofs are in Appendix A.

Although bunching is increasing in the elasticity for a fixed distribution of y or n ∗ , it isalso true that, for a fixed elasticity, bunching increases as f n ∗ becomes more concentratedbetween n and n . If all we know about f n ∗ is that it is continuous with full support and thatits integral over [ n, n ] equals B , then there is no way to identify both the elasticity and f n ∗ using only Equation 5; equivalently, there is no way to identify both the elasticity and the In this subsection, it is analytically simpler to work with the solution in levels rather than in logs. y using only Equation 6. Intuitively, identification using only (5) or (6) isimpossible because each uses one equation to solve for two unknowns. This is shown byBlomquist et al. (2015) and Blomquist and Newey (2017). We present the impossibility resultin this section as a building block to our novel identification strategies in the next sections.Formally, the data and model comprise five objects: 1) the CDF of earnings F y , 2) thekink point k , 3) the slopes of the piecewise-linear constraint s and s ; 4) the CDF of thelatent variable F n ∗ , and 5) the elasticity ε . Equation 4 is a mapping T that takes objects(2)–(5) and maps them into the CDF of optimal incomes across agents: F y = T ( k, s , s , F n ∗ , ε ).The researcher observes objects (1)–(3), but does not observe the last two, F n ∗ and ε .The problem of identification consists of inverting the mapping T such that the unobserved ε is a function that only depends on the first three objects ( F y , k, s , s ), regardless of what F n ∗ may be. We denote the class of admissible distributions of n ∗ as F n ∗ . If the class F n ∗ contains all possible continuous distributions of n ∗ , then identification of ε is impossible. Lemma 1.

Let F n ∗ be the class of all CDFs F n ∗ that have continuous PDFs f n ∗ withsupport ( −∞ , ∞ ) . Let F y be the class of all CDFs F y that are mixed continuous-discretewith one mass point at k , and continuous PDF f y otherwise. Take F y , k , s , and s asgivens. For every elasticity ε ∈ (0 , ∞ ) , there exists F n ∗ ,ε ∈ F n ∗ such that F y = T ( k, s , s , F n ∗ ,ε , ε ) . Therefore it is impossible to point-identify ε . Figure 1 provides intuition for the proof of Lemma 1. It illustrates that the observablePDF f y in Figure 1a is generated by applying Equation 4 to two different combinations oflatent variable distributions and elasticities, f n ∗ ,ε and f n ∗ ,ε (cid:48) in Figures 1c and 1d,respectively. Lemma 1 clarifies that current bunching methods are either implicitlyrestricting F n ∗ or simply inconsistent for the true elasticity.A direct consequence of Lemma 1 is that it is impossible to test restrictions on F n ∗ .Below we consider a couple of examples of identifying restrictions from the literature. Example 1.

Saez (2010) implicitly restricts F n ∗ when using a trapezoidal approximation tosolve the integral in Equation 6, in levels rather than in logs (Saez’s Equation 4 on page186). That is, B = (cid:90) K +∆ YK f Y ( u ) du ∼ = (cid:18) f Y ( K + ∆ Y ) + f Y ( K )2 (cid:19) ∆ Y, (8) where ∆ Y = K [((1 − t ) / (1 − t )) ε − . A sufficient condition for the approximation to betrue is to assume f Y ( u ) is an affine function of u for values of u ∈ [ K, K + ∆ Y ] . Giventhat Y = N ∗ (1 − t ) ε , the PDF f N ∗ ( u ) = f Y ( u (1 − t ) ε )(1 − t ) ε is restricted to be an affine unction of u inside the interval [ K (1 − t ) − ε , K (1 − t ) − ε ] . This is equivalent to restricting f n ∗ to have an exponential shape within [ k − εs , k − εs ] .The rest of Saez’s identification strategy uses the fact that f Y ( K ) = f Y ( K − ) , and f Y ( K + ∆ Y ) = f Y ( K + )((1 − t ) / (1 − t )) ε , where f Y is the PDF of the continuous portionof the distribution of Y , and f Y ( K ± ) denotes side limits lim Y → K ± f Y ( K ± ) . Substitutingthese into Equation 8, B ∼ = 12 (cid:18) f Y ( K + ) (cid:18) − t − t (cid:19) ε + f Y ( K − ) (cid:19) K (cid:20)(cid:18) − t − t (cid:19) ε − (cid:21) , (9) which is Equation 5 by Saez (2010). It is then possible to solve implicitly for ε as a functionof the side limits of f Y , the tax rates, the kink point, and the bunching mass. One may argue that the affine assumption is a good approximation to any potentiallynon-linear density f N ∗ , if the bunching interval [ K (1 − t ) − ε , K (1 − t ) − ε ] is small. Theproblem with this argument is that the size of the interval is itself a function of the elasticity.It is impossible to state that the interval is small and the linear approximation is a good onewithout a priori knowledge of the elasticity. Example 2.

The derivation by Chetty et al. (2011) of Equation 6 on page 761 assumes thatthe PDF f Y is constant inside the bunching interval [ K, K + ∆ Y ] . This is equivalent toassuming that N ∗ is uniformly distributed in that region and thus restricts the class F n ∗ .For some scalar a , assume F Y ( u ) = a + f Y ( K ) u for u ∈ [ K, K + ∆ Y ] , so that the PDF of Y is constant and equal to f Y ( K ) in the bunching interval. Then, B = (cid:90) K +∆ YK f Y ( u ) du = F Y ( K + ∆ Y ) − F Y ( K )= f Y ( K )∆ Y = f Y ( K ) K (cid:20)(cid:18) − t − t (cid:19) ε − (cid:21) ∼ = f Y ( K ) Kε ln (cid:18) − t − t (cid:19) ε ∼ = B/f Y ( K ) K ln (cid:16) − t − t (cid:17) , (10) where the second to last approximate equality uses [(1 − t ) / (1 − t )] ε − ∼ = ln[(1 − t ) / (1 − t )] ε for small tax changes; and the last approximate equality is Equation 6by Chetty et al. (2011). The rest of their identification procedure relies on the polynomialstrategy to obtain B and f Y ( K ) , as described in the supplemental Appendix B.3.The constant PDF assumption on f Y is more restrictive than the affine PDF assumption hat justifies Saez’s trapezoidal approximation. The trapezoidal approximation allows for f Y to have a non-zero slope in the bunching interval, whereas the constant PDF assumptiondoes not. There are more flexible restrictions one could impose on F n ∗ . For example, one could say n ∗ follows a distribution inside a parametric family of distributions. Example 3.

In general, let F n ∗ = { G n ∗ ( n ; θ ) , θ ∈ Θ } , where G n ∗ are CDFs indexed by a p × vector of parameters θ in a parameter space Θ . Identification of the elasticity requiresthat the bunching mass and the shape of the distribution of y around the kink point aresufficient to identify θ and ε . That is, the family of distributions F n ∗ is such that, for anyfeasible choice of ( k, s , s , ε, θ ) , there exists an unique solution (¯ ε, ¯ θ ) = ( ε, θ ) to the followingsystem of equations: G n ∗ ( k − εs ; θ ) − G n ∗ ( k − εs ; θ ) = G n ∗ (cid:0) k − ¯ εs ; ¯ θ (cid:1) − G n ∗ (cid:0) k − ¯ εs ; ¯ θ (cid:1) (11) G n ∗ ( u − εs ; θ ) = G n ∗ ( u − ¯ εs ; ¯ θ ) for ∀ u < k (12) G n ∗ ( u − εs ; θ ) = G n ∗ ( u − ¯ εs ; ¯ θ ) for ∀ u > k. (13) For example, the family of normal distributions with unknown mean and variance satisfiesthese conditions (see supplemental Appendix B.4). Identification is also possible in familieswith more than just two parameters. The objects on the left-hand side (LHS) of the threeequations above, evaluated at the true ( ε, θ ) , are identified from the data. Thus, if F n ∗ satisfies (11) - (13) , then the elasticity and F n ∗ are identified. The rest of the paper focuses on methods that identify the elasticity in the kink case. Wepresent three types of identification assumptions on the distribution of ability, from lessrestrictive to more restrictive. We start with a non-parametric shape restriction that boundsthe slope magnitude of f n ∗ , which leads to partial identification of ε . Next, we connectbunching to the literature on censored regressions, where n ∗ is the regression error. Itbecomes natural to use covariates to explain n ∗ , and we propose two types ofsemi-parametric restrictions on the distribution of n ∗ that point-identify the elasticity. Thefirst restricts the distribution of n ∗ , conditional on covariates; and the second restricts aquantile of the distribution of n ∗ , conditional on covariates. In general, more data variationand structure are needed to provide any information about the elasticity.14 .1 Non-parametric Bounds Our partial identification approach relies on restricting the class F n ∗ to PDFs, f n ∗ , thatare Lipschitz continuous with constant M ∈ (0 , ∞ ). In other words, the slope magnitude ofany f n ∗ ∈ F n ∗ is bounded by M . The following theorem gives the partially identified set for ε as a function of identified quantities and the maximum slope magnitude M . Theorem 2.

Assume F n ∗ contains all distributions with PDF f n ∗ that are Lipschitzcontinuous with constant M ∈ (0 , ∞ ) . Then the elasticity ε ∈ Υ , where Υ =  ∅ , if B < | f y ( k + ) − f y ( k − ) | [ f y ( k + )+ f y ( k − ) ] M [ ε, ε ] , if | f y ( k + ) − f y ( k − ) | [ f y ( k + )+ f y ( k − ) ] M ≤ B < f y ( k + ) + f y ( k − ) M [ ε, ∞ ) , if f y ( k + ) + f y ( k − ) M ≤ B , where ∅ is the empty set, and ε = 2 [ f y ( k + ) / f y ( k − ) / M B ] / − ( f y ( k + ) + f y ( k − )) M ( s − s ) ε = − f y ( k + ) / f y ( k − ) / − M B ] / + ( f y ( k + ) + f y ( k − )) M ( s − s ) . Figures 1c and 1d provide the intuition behind the derivation of the bounds in Υ. For afixed value of ε , the length of the interval [ n, n ] is fixed. If the magnitude of the derivativeof f n ∗ is bounded by M , we obtain maximum and minimum areas under f n ∗ over [ n, n ]. Werepeat this exercise for every value of ε to get a range of possible areas associated with each ε . Given the probability of bunching B is the area under the true f n ∗ over [ n, n ], thepartially identified set has all values of ε whose range of possible areas contains B . Thepartially identified set is empty if M is not big enough to allow for the existence of acontinuous function f n ∗ which connects f y ( k − ) = f n ∗ ( k − εs ) to f y ( k + ) = f n ∗ ( k − εs ). Thepartially identified set is unbounded if M is large enough to allow f n ∗ to be zero inside theinterval [ n, n ].The expression for the partially identified set depends on the value of M and theresearcher must specify this value to compute the bounds. The uniform approximation inExample 2 says that f n ∗ has zero slope inside the bunching interval, that is, M = 0. Thetrapezoidal approximation in Example 1 implicitly chooses M = m such that m is thesmallest value of M for which we have bounds that are well defined. Formally, m solves B = | f y ( k + ) − f y ( k − ) | [ f y ( k + ) + f y ( k − )] / m , which makes ε = ε and point-identifies ε .Thus the exercise of computing bounds necessarily involves assumptions weaker than the15niform and trapezoidal approximations.Lemma 1 makes clear that it is impossible to identify, and thus estimate, the value of M .A useful starting point for the magnitude of M comes from the maximum slope magnitudeof the continuous part of f y , say m . The PDF f y is identified and is the shifted PDF of n ∗ .Thus, the maximum slope of f n ∗ outside of the bunching interval is identified and equal to m . If we assume that that the slope of f n ∗ inside the bunching interval is never bigger thanoutside, then M = m .As a rule of thumb, we recommend researchers to plot the bounds in Theorem 2 as afunction of M for a range of values that includes m , m , and possibly bigger values, e.g., upto 2 m . Theorem 2 is important to quantify the magnitude of the impossibility problempresented in Lemma 1. If the bounds plotted for a range of M values admit elasticities thatare too different in economic terms, then the identifying assumptions play a critical role indetermining the elasticity. We give full details and implement this sensitivity analysis in theempirical section using our bunching Stata package (Section 5). While we assume the PDF has bounded slope, Blomquist and Newey (2017) partiallyidentify the elasticity by assuming the PDF of heterogeneity is monotone. Our approach hasthree valuable properties. The first is that the bounds of our partially-identified set haveclosed form solutions. Second, an observed mass point implies a positive elasticity even forlarge values of the slope M , which is in line with the theoretical prediction that agentsrespond to a change in incentives. Third, it nests and is easily comparable to the originalbunching estimator based on the trapezoidal approximation.We end this subsection with the case of a budget set with several kinks k j , j = 1 , . . . , J ,but no notches. One may ask whether the existence of several kinks helps identify theelasticity. As noted above, the bunching intervals do not overlap across kinks, that is, N j = K j (1 − t j − ) − ε < K j (1 − t j ) − ε = N j . Lemma 1 applies to each kink, and multiplekinks do not necessarily point-identify ε , because the distribution of n ∗ may be very differentacross different bunching intervals.Multiple kinks do help with the identification of ε , as long as the researcher restricts theslope of f n ∗ and believes the model in Equation 1 applies to all individuals. This arises fromthe fact that every individual is assumed to have the same elasticity parameter ε , and thatthe bounds of Theorem 2 vary in length as B j , f y ( k ± j ), s j vary across cutoffs j = 1 , . . . , J .The partially identified set is narrowed down by the intersection of bounds specific to eachone of the multiple kinks. It is important to clarify that the problem of choosing M is different than the typical problem of choosinga tuning parameter, e.g., a bandwidth or polynomial order in non-parametric estimation. The value of M represents a choice of functional form assumption, while in non-parametric estimation, you typically choosethe tuning parameter to achieve desirable properties of the estimator for a given functional form assumption. orollary 1. Assume the conditions of Theorem 2 for each kink k j , j = 1 , . . . , J . Then theelasticity ε ∈ (cid:84) Jj =1 Υ j , where Υ j is the partially identified set of Theorem 2 applied to kink k j . Identification with kinks is impossible when the distribution of ability n ∗ belongs to thenon-parametric class of all continuous distributions. Parametric functional form assumptionsidentify the elasticity, but identification relies on fitting such functional form tonon-bunching individuals and extrapolating the functional form to bunching individuals.This section considers alternative identification assumptions that rely on the existence ofadditional covariates in the dataset. There is strong empirical evidence suggesting thatability is well explained by individual characteristics, such as age, demographics, filing status,etc. For example, the ability distribution of young workers may have a very different meanand variance, compared to that of older workers. Extrapolations based on covariates thatpredict n ∗ are much more reasonable than extrapolations solely based on the shape of thePDF of n ∗ . The key assumption is that covariates that help explain the distribution of n ∗ fornon-bunching individuals also help explain the distribution of n ∗ for bunching individuals.We start by connecting bunching to censored regression models. This allows us to relateto the vast econometrics literature in this area. Consider again the data generating processgiven by Equation 4. The model for y is a mid-censored model, where the error term is n ∗ ,the intercept to the left of the kink is εs , the intercept to the right of the kink is εs , andthe censoring point is k . The main difference between (4) and a typical censored regressionmodel is that the latter has the censoring point at either the minimum or maximum of thedistribution of y (see Equation 15 in the next subsection). Identification, estimation, andinference in these models have been widely studied in econometrics since Tobin (1958).There are many advantages of framing the estimation of ε as estimation of a censoredmodel. Surveys of censoring models and their applications are provided by Maddala (1983),Amemiya (1984), Dhrymes (1986), Long (1997), DeMaris (2005), and Greene (2005). Thereare straightforward extensions that account for optimizing frictions. Moreover, censoredmodels are easily estimated with a number of different techniques that are available in manycomputer packages. Most importantly, it becomes extremely practical to add covariates asexplanatory factors for the distribution of n ∗ .Assume the researcher has access to a vector of covariates X ∈ R × ( d +1) , where X contains an intercept variable and the distribution of X is unrestricted. We build oncensoring models with covariates to identify the elasticity by imposing two types ofsemi-parametric assumptions on the distribution of n ∗ .17he first type of assumption states that the distribution of n ∗ is a mixture of normaldistributions averaged over the distribution of covariates. This assumption does not implyconditional normality of n ∗ given X but it is implied by conditional normality of n ∗ .Although the Tobit likelihood assumes normality of the unobserved distribution conditionalon covariates, we demonstrate that the Tobit estimator remains consistent under thesemi-parametric class of normal mixtures. In addition, the researcher may estimate atruncated Tobit model on data in a small neighborhood of the kink point, which requireseven weaker distribution assumptions for consistency. This robustness property remains trueif we replace the normal distribution by another parametric distribution to form thesemi-parametric mixture. For example, the maximum likelihood estimator for the elasticitythat assumes that n ∗ conditional on X is exponential remains consistent when theunconditional distribution of n ∗ is a mixture of exponentials averaged over X , whether ornot the distribution of n ∗ conditional on X is exponential. In the rest of this section, we keepthis first type of assumption in terms of normals for ease of exposition and practical reasons:the Tobit likelihood is globally concave and software to estimate Tobit models is ubiquitous.The second type of assumption imposes a parametric functional form on a quantile of theconditional distribution of n ∗ , given X . Sufficient variation in covariates yieldspoint-identification of the elasticity, which is consistently estimated by mid-censoredquantile regressions. The first type of semi-parametric assumption is formally stated in Lemma 2 below. Inthe meantime, we construct the Tobit estimator by assuming that there exists unique( β, σ ) ∈ R × ( d +1) × R + , such that F n ∗ | X ( n, x ) = Φ (cid:18) n − xβσ (cid:19) , (14)where F n ∗ | X denotes the CDF of n ∗ conditional on X , and Φ( · ) is the CDF of a standardnormal distribution. Assumption 14 does not restrict the distribution of X ; thus theunconditional CDF of n ∗ lives in a semi-parametric class and needs not to be normal. Themore variation in covariates one has, the richer is this class of distributions.The elasticity parameter ε is consistently estimated using a mid-censored Tobitregression. Define the error term U = n ∗ − Xβ , the latent variables y ∗ = εs + Xβ + U , and y ∗ = εs + Xβ + U , where y ∗ < y ∗ , since ε > s > s . Then y follows a mid-censored18obit model y =  y ∗ , if y ∗ < y ∗ < kk , if y ∗ ≤ k ≤ y ∗ y ∗ , if k < y ∗ < y ∗ = min { y ∗ ; max { k ; y ∗ }} . (15)This is different from the classic Tobit model, where the censoring point is either at theminimum or at the maximum of the distribution of y . A possible estimation strategy is toadapt the two-step Heckit estimator to our setting (Heckman, 1976, 1979). In the first step,estimate a binary outcome for bunching and not bunching individuals including covariates.In the second step, regress income of not bunching individuals on covariates and theequivalents of the inverse Mills ratio. Another extremely practical way of estimating thismid-censored Tobit model is to estimate two classic Tobit models. To see that, construct thevariables y = min { y, k } and y = max { k, y } . It turns out that y follows a right-censoredTobit with intercept εs + β , slope coefficients β , . . . , β d , where β = ( β , β , . . . , β d ).Similarly, y follows a left-censored Tobit with intercept εs + β , and slope coefficients β , . . . , β d . Thus, the elasticity is consistently estimated by the difference of both intercepts( εs + β ) − ( εs + β ) divided by ( s − s ). Despite its practicality, this estimation strategydoes not constrain the slope coefficients and variances to be equal on both sides of the cutoff,which translates into loss of efficiency. The mid-censored Tobit likelihood naturally takesthese constraints into account and provides the most efficient estimates. It is therefore ourpreferred implementation.Let ( y i , X i ), i = 1 , . . . , n be an iid sample of observations. The maximum likelihoodestimator (MLE) for ( ε, β, σ ) is constructed by maximizing the log-likelihood function of thesample of y i s, conditional on X i s. L ( y , . . . , y n | X , . . . , X n ; ε, β, σ )= 1 n n (cid:88) i =1 I { y i < k } log (cid:20) σ φ (cid:18) y i − εs − X i βσ (cid:19)(cid:21) + I { y i = k } log (cid:20) Φ (cid:18) k − εs − X i βσ (cid:19) − Φ (cid:18) k − εs − X i βσ (cid:19)(cid:21) + I { y i > k } log (cid:20) σ φ (cid:18) y i − εs − X i βσ (cid:19)(cid:21) ≡ n n (cid:88) i =1 (cid:96) i ( ε, β, σ ) . (16)Regardless of the true distribution F n ∗ | X , the MLE is consistent for the parameter thatmaximizes the population average of the log-likelihood function, that is,(¯ ε, ¯ β, ¯ σ ) ∈ arg max E [ l i ( ε, β, σ )]. Standard textbook analyses of Tobit models demonstrate19niqueness of (¯ ε, ¯ β, ¯ σ ) as solution to the maximization problem. We say the elasticity isidentified by a mid-censored Tobit when the true parameter ε coincides with ¯ ε . We showthat the normality Assumption 14 is not necessary for identification of ε . Lemma 2.

Let G n ∗ ( n ; β, σ, F X ) = E (cid:2) Φ (cid:0) n − Xβσ (cid:1)(cid:3) , where the expectation is taken over thedistribution of X with CDF F X . Assume the true distribution of n ∗ belongs to thesemi-parametric family F n ∗ = { G n ∗ ( n ; β, σ, F X ) , ( β, σ, F X ) ∈ R × ( d +1) × R + × F X } , (17) where F X is the class of all CDFs of X . Suppose F n ∗ satisfies (11) – (13) for the true F X .Define G y ( y ; ε, β, σ, F X ) to be the unconditional CDF of y obtained by transforming G n ∗ ( n ; β, σ, F X ) , according to Equation 4 and a given value of ε . Let F y ( y ) be the true CDFof y . If G y ( y ; ¯ ε, ¯ β, ¯ σ, F X ) = F y ( y ) , then ¯ ε equals the true elasticity, regardless of Assumption14. This remains true if we replace the normal distribution by another parametricdistribution to form the semi-parametric mixture in (17) . If the Tobit best-fit distribution of y matches the true distribution of y , Lemma 2guarantees that the elasticity estimated by the Tobit is consistent for the true elasticity,regardless of whether F n ∗ | X is normal. Essentially, Lemma 2 requires the unconditionaldistribution of n ∗ to be a mixture of normal distributions, where the average is taken acrossthe distribution of covariates. Standard quasi-MLE asymptotic inference procedures applyhere. Namely, the MLE (ˆ ε, ˆ β, ˆ σ ) obtained from (16) and centered at (¯ ε, ¯ β, ¯ σ ) isasymptotically normal, with zero mean and the usual variance-covariance matrix in the‘‘sandwich form.’’One of the features of bunching estimators is the reliance on data local to the kink point.With the mid-censored Tobit model, the researcher may also restrict the sample toobservations of y lying in a small neighborhood of k and estimate a truncated Tobit. Thetruncated Tobit is an attractive estimation strategy, because consistency of ˆ ε relies on amuch weaker version of Assumption 14. Moreover, the smaller the truncation window, theeasier it is to fit the unconditional distribution of y with a Tobit, and the stronger is therobustness result of Lemma 2.As a matter of routine, we recommend researchers estimate a truncated Tobit model forvarious window sizes around the kink point and examine two things: first, the plot of the For example, see Hayashi (2000), Section 8.3. The truncated Tobit model has a log-likelihood that is slightly different than (16). Instead of thelog-likelihood of y | X , it maximizes the log-likelihood of y | X, k − δ < y < k + δ for δ >

0, which has a truncatednormal distribution. y compared to the histogram of y for various sizes oftruncation windows. The distribution fit tends to improve as the size of the windowdecreases. The better the fit, the more likely the conditions of Lemma 2 are met, and thecloser is the elasticity to the truth. We illustrate this exercise with simulated data below andwith real data in Section 5.Consider the following simulation experiment. Let U and U be Bernoulli withprobability of success √ /

2, and U be normal with mean 1 .

59 and variance 0 . , where allthree variables are independent. Let the covariates be X = U and X = U U , and abilitybe n ∗ = √ X + 2 X + U . This model was chosen to match moments of the real data inSection 5. Generate an iid sample with 500,000 observations of X , X , and y , according toEquation 4 with ε = 1. As in the EITC example in Section 5, the kink point is k = 2 . t = − .

34 and t = 0.The first exercise estimates a mid-censored Tobit that is correctly specified with bothcovariates X and X . We start with the full sample of simulated data and produceestimates for truncation windows that are symmetric around the kink point and shrink insize. For example, Figures 2a and 2b show the histogram of simulated data for y , and thebest-fit Tobit distributions for two truncation sizes, 100% and 40%. Although f n ∗ | X ,X isnormal, it is clear from the figures that f y is not a censored normal and therefore f ∗ n is notnormal. Figure 2c displays the elasticity estimate as a function of the percentage of data usedin each truncated estimation. The elasticity estimate is stable over all truncation windows,because the model is correctly specified. The Tobit fits the distribution of y perfectly for alltruncation windows, and the estimated elasticity is approximately equal to the truth.The second exercise estimates a misspecified model using the same simulated data.Specifically, we drop X out of the model. In this case, f n ∗ | X does not have a normaldistribution, and Assumption 14 is not satisfied. Estimation using all of the data does not fitthe distribution of y (Figure 2d). On the other hand, Figure 2e demonstrates that thetruncated Tobit matches the distribution of y perfectly for windows that use 40% of the dataor less. In line with Lemma 2, elasticity estimates converge to the truth, as the truncationwindow decreases below 40%. Another type of semi-parametric assumption on the ability distribution consists ofrestricting a quantile of the distribution of n ∗ , conditional on X . Namely, for τ ∈ (0 , β ( τ ) ∈ R × ( d +1) such that Q τ ( n ∗ | X ) = Xβ ( τ ) , (18)where Q τ denotes the τ -th quantile of a distribution. A common choice in applied work is τ = 1 / X on the right-hand side, e.g., polynomials and interaction terms.Equation 15 leads to y = min { εs + n ∗ ; max { k ; εs + n ∗ }} , which is an increasing andcontinuous function of n ∗ . The quantile of an increasing and continuous function of n ∗ isequal to that same function evaluated at the quantile of n ∗ . Using Assumption 18, Q τ ( y | X ) = min { εs + Xβ ( τ ); max { k ; εs + Xβ ( τ ) }} . (19)For those observations such that X (cid:48) β ( τ ) < k − εs or X (cid:48) β ( τ ) > k − εs , the quantile Q τ ( y | X ) varies linearly with X ; otherwise, it is constant and equal to k . Intuitively, ifthere is enough variation in X for uncensored observations, then the slope coefficients andthe intercepts are identified. This leads to identification of ε . Lemma 3.

Define ˜ X = [ X, I { Q τ ( y | X ) > k } ] , a random vector in R × ( d +2) . Assume E (cid:104) I { Q τ ( y | X ) (cid:54) = k } ˜ X (cid:48) ˜ X (cid:105) has full rank and that Assumption 18 holds. Then ε isidentified. In the absence of covariates or restrictions on Q τ ( y | X ), the rank condition is neversatisfied. This confirms the impossibility demonstrated in Lemma 1. For example, supposethe researcher has two dummy variables, W and W . An unrestricted Q τ ( y | W , W )contains four parameters, because the conditional quantile takes at most four possible values.In the best case scenario for identification, these four values are all different from k . Interms of Lemma 3, d = 3, and X = [1 , W , W , W W ] is 1 ×

4. The matrix E (cid:104) I { Q τ ( y | X ) (cid:54) = k } ˜ X (cid:48) ˜ X (cid:105) is 5 × Q τ ( y | X ) mustbe restricted to fewer parameters for identification to be possible.Theoretical work on estimation and inference of parameters in censored quantileregression (CQR) models dates back to the 1980s (Powell (1984, 1986)). Recent advancesinclude the computationally attractive three-step estimator by Chernozhukov and Hong(2002), and CQR with endogeneity by Chernozhukov et al. (2015). In the simpler case of Q τ ( y | X ) = Xβ ( τ ), Koenker and Bassett (1978) show that a consistent estimator for β ( τ )is obtained by the solution to the problemmin b ∈ R d +1 n (cid:88) i =1 [ ρ τ ( y i − X i b )] , (20)22here ( y i , X i ) i = 1 , . . . , n is an iid sample and ρ τ ( u ) = ( τ − u ≤ u is the so-called‘‘check function.’’ In our case, the parametric conditional quantile function Q τ ( y | X ) isgiven in Equation 19. The slope and intercept coefficients are estimated by(ˆ b ( τ ) , ˆ δ ( τ )) = arg min b ∈ R d ,δ ∈ R n (cid:88) i =1 [ ρ τ ( y i − min { X (cid:48) i b ; max { k ; X (cid:48) i b + δ }} )] , (21)where ˆ b ( τ ) is consistent for β ( τ ) + [ εs , , . . . , (cid:48) , and ˆ δ ( τ ) is consistent for ε ( s − s ).Therefore the elasticity is consistently estimated by ˆ ε = ˆ δ/ ( s − s ), and it is asymptoticallynormal.The optimization problem in Equation 21 is computationally difficult. For the left (orright) censored case, Chernozhukov and Hong (2002) proposed a fast and practical estimatorthat consists of three steps. Our case of middle censoring requires a straightforwardmodification of their method. We delineate practical steps to obtain ˆ ε and its standard errorusing CQR in Section B.5 of the supplemental appendix. We demonstrate and compare our new methods using bunching behavior created bykinks in the earned income tax credit (EITC). Each method differs in the assumptions theymake about the unobserved distribution to achieve identification. There is no way todetermine which assumption is correct because the unobserved distribution is not fullyidentified. Nevertheless, estimates that are stable across many methods indicate thatdifferent identifying assumptions do not play a major role in the construction of thoseestimates. On the contrary, estimates that are sensitive to different assumptions aredependent on the validity of those assumptions. Patel, Seegert, and Smith (2016) provide anempirical illustration of this sensitivity.First, we use our non-parametric bounds to provide initial information about howsensitive the elasticity estimate is to different shapes of the underlying ability distribution.When the bounds are tight, then the shape of the underlying distribution is not critical. Butwhen the bounds are wide, then the shape is critical. In this case, reducing the range ofpossible elasticities requires either stronger restrictions on the shape of the abilitydistribution or additional data on determinants of ability.Second, we combine observed determinants of ability with our semi-parametric approachto point identify the elasticity. We compare the resulting best-fit Tobit income distributionto the observed distribution for alternative samples that range from using all observations tousing only data local to the kink. When the best-fit Tobit distribution coincides with the23bserved distribution, the estimated elasticity is consistent (Lemma 2). Furthermore, if theTobit elasticity is within narrow non-parametric bounds, then the identifying assumptionsare inconsequential; if within wide bounds, then the identifying assumptions are notcontradictory and the covariates provide point identification. In contrast, if the Tobitelasticity is outside of the bounds, then the elasticity estimate is not robust to the twoalternative identifying assumptions. Finally, when the best-fit Tobit distribution does notcoincide with the observed distribution, the determinants of ability used for estimation areuninformative or the semi-parametric assumption is inappropriate.We recommend that researchers examine the sensitivity of elasticity estimates across allavailable methods as a matter of routine. We illustrate these steps in the context of theEITC in the rest of this section.

We use data from the Individual Public Use Tax Files, constructed by the IRS. Theannual cross-section for each year 1995 to 2004 includes sampling weights which allowinterpretation of any estimates as being based on the population of U.S. income tax returns.This data was initially used by Saez (2010) to demonstrate how to use bunching to estimatean elasticity. The income distribution for individuals with one child demonstrates clear bunchingaround the $ −

34 percent to 0 percent at $ −

34 percent above the kink point.Observed bunching in the distribution of income suggests that people do respond tochanges in tax rates. To effectively set tax rates, however, it is imperative to quantify thisresponse precisely. Small variation in elasticity estimates imply large differences in optimaltax rates. For example, variation in the elasticity of taxable income between 0.1 and 0.2implies an optimal top marginal tax rate between 82% or 69%. As demonstrated above,identifying the elasticity requires information on the amount of bunching and the incomedistribution. The following sections show how different methods leverage different types ofvariation to identify the elasticity. We replicate some of the estimates in Saez (2010) using publicly available code from the website of the

American Economic Journal: Economic Policy and report them in supplemental Appendix B.6. This example comes from Saez (2001). In particular, Equation 9 states ¯ τ = (1 − g ) / (1 − g + ε u ε c ( a − , where g is defined as the value the government has for the marginal consumption of high income earners(often set to 0), a is the Pareto parameter (with baseline value of 2), and ε c and ε u are the compensated anduncompensated elasticities of taxable income. For the calculation in the text, we utilize ε u = ε c , a Paretoparameter of 2, and a g value of 0.1. y . As does Saez (2010), we excludeobservations that lie within $ F y ( y ). The size of the discontinuity equals the bunching mass. We then relyon the fact that y = F y (cid:0) F − y (˜ y ) (cid:1) and use the estimated CDFs to transform ˜ y into y .Our filtering procedure is different from the polynomial strategy discussed in Section 2.3.We simply aim at removing the friction error from the sample, while the the polynomialstrategy of Example 2 aims to remove friction error and recover the counterfactualdistribution of income, which requires much stronger restrictions according to Lemma 1. Ourfiltering procedure works well in cases in which 1) the researcher has a good prior on thesupport of the friction error distribution ( $ Table 1 reports estimates of the elasticity of taxable income using a classic bunchingmethod, non-parametric bounds, and Tobit models with covariates. Each of these estimatesrelies on a different set of assumptions to identify the elasticity of taxable income, andtogether they provide insights into which assumptions are most defensible in the context ofthe EITC.Column 1 reports our estimates of the elasticity of taxable income using a trapezoidalapproximation (Example 1). This method assumes the unobserved PDF is linear in the Supplemental Appendix B.6 recomputes our estimates using the filtering procedure employed by Saez(2010). We estimate the PDF of the variables in logs rather than in levels, which simplifies the elasticity formula M that includes the maximum slope magnitude of f y . We reiterate that M is unidentified and that the slope of f y provides a starting point. The bunching Statapackage consistently estimates the maximum slope of f y by taking the maximum slope inthe histogram of y across all consecutive bins. We find that the slope is never bigger than0 . M = 0 . M = 1 in Table 1, Columns 2 and 3, and plot bounds for M upto 2 in Figure 3. The vertical lines in these figures designate the minimum and maximumslope, such that both the upper and lower bounds are finite numbers. The first line is thesmallest slope that allows a continuous PDF to be consistent with both the bunching massand observed income distribution. At the minimum slope, both lower and upper bounds areequal to the estimate based on the trapezoidal approximation, reported in column 1.As M increases, the set of possible PDF shapes in the bunching region becomes richer.The second line is the maximum slope before the set of possible distributions allows for aPDF that touches zero in the bunching interval. In that case, the bunching mass remainsconstant for arbitrarily large ε , and the upper bound is infinity (Theorem 2).A large range between lower and upper bounds in Figure 3 suggests the estimates changesubstantially with the shape of the unobserved distribution. For example, the bounds areuninformative for the self-employed married sample, even for small values of M . Thisindicates that the data will not provide precise information on the elasticity unless theresearcher imposes further functional form restrictions on the distribution of n ∗ . In contrast,we learn the most in the case of all filers and self-employed not married, where the boundsare narrower than in other subsamples for M = 0 .

5. The lower bound is always defined forlarger choices of M , which gives partial information on the elasticity without the need ofbeing precise with the choice of M . For the exceedingly high value of M = 2, the lowerbound is about 0.25 for all filers and 0.5 for the other three subsamples.Columns 4–7 report our estimates of the Tobit model using the full sample and truncated based on the trapezoidal approximation in Example 1. Comparisons across methods provides insights into the reasonableness of differentassumptions used to estimate the elasticity. The trapezoidal approximation is always withinthe bounds, because its estimate is based on a linear interpolation of the PDF in thebunching region. The slope of such line equals the minimum slope for which the bounds aredefined. In contrast, the Tobit model using 100% of the data is often below the lower bounds,but the Tobit distribution fails to fit the observed distribution of income globally. TruncatedTobit estimates generally enter the bounds as the truncation window decreases and, as aresult, the fit of the Tobit distribution improves. For the all filers sample, an M larger than1 is needed for the bounds to cover the Tobit estimate truncated at 25%. This reiterates ourprevious discussion that the Tobit fit for all filers is poor until we use 20% or less of the data.Consider self-employed married and self-employed not married filers. Figures 6 and 7demonstrate that the bunching mass is larger for self-employed not married individuals thanfor self-employed married individuals. This difference in bunching mass might lead aresearcher to conjecture that the elasticity is larger for self-employed not married individuals.27hether this conjecture is true depends, however, on differences in the underlyingdistribution of heterogeneity. Estimates based on the trapezoidal approximation contradictthat conjecture. The estimates in column 1 of Table 1 are larger for self-employed marriedindividuals than self-employed not married. The global distributional assumption of amixture of normals averaged over covariates produces a larger elasticity for self-employednot married (column 4 in Table 1). However, the credibility of these Tobit estimates isquestioned by the poor fit shown in panel a of Figures 6 and 7. Truncating the sampleobtains a better fit, and we find that the elasticities are approximately the same for marriedand not married. The disagreement across methods for these subsamples indicates thatassumptions on the distribution of heterogeneity are critical to obtain informative elasticityestimates.

We show how to use bunching from piecewise-linear budget constraints to identifyelasticities, under conditions weaker than those used in the literature on kinks and notches.The key theoretical point is that bunching is determined by the elasticity parameter and theshape of an unobserved distribution. Additional assumptions or data are needed to identifythe elasticity.We propose a suite of estimation techniques that allow researchers to tailor theirestimation to different assumptions and data variation. These include non-parametricbounds and semi-parametric censored models with covariates. The non-parametric boundsare the least restrictive method and also nest estimators from the previous literature.These techniques have wide applicability, because piecewise-linear budget constraints arecommon across fields, from public finance and labor, to industrial organization andaccounting. Our estimation strategies also provide a foundation for future advances intechniques that will account for different empirical hurdles. Of particular interest areextensions that consider optimization and friction errors, extensive margin responses, andpanel data methods.

The views expressed in this paper are those of the authors and do not necessarily reflect theviews of the Federal Reserve Board or the Federal Reserve System. We would like to thankMatias Cattaneo, Bill Evans, Roger Gordon, Jim Hines, Dan Hungerman, Michael Jansson,Henrik Kleven, Brian Knight, Erzo Luttmer, Byron Lutz, Dayanand Manoli, MagneMogstad, Whitney Newey, Andreas Peichl, Emmanual Saez, Dan Silverman, and Joel28lemrod for valuable comments and discussions. The paper also benefited from feedbackreceived from seminar participants at the UCSD Workshop on Bunching Estimators,Econometric Society, International Association for Applied Econometrics, InternationalInstitute of Public Finance, National Tax Association, Dartmouth College, Federal ReserveBoard, and University of Michigan. Jessica C. Liu, Michael A. Navarrete, and Alexis M.Payne provided excellent research assistance. All remaining errors are our own. Bertanhaacknowledges financial support received while visiting the Kenneth C. Griffin Department ofEconomics, University of Chicago. 29 eferences

Allen, E. J., P. M. Dechow, D. G. Pope, and G. Wu (2017, June). Reference-DependentPreferences: Evidence from Marathon Runners.

Management Science 63 (6), 1657--1672.Amemiya, T. (1984). Tobit Models: A Survey.

Journal of Econometrics 24 (1-2), 3--61.Bastani, S. and H. Selin (2014). Bunching and Non-bunching at Kink Points of the SwedishTax Schedule.

Journal of Public Economics 109 , 36--49.Bertanha, M., A. H. McCallum, and N. Seegert (2018, March). Better Bunching, NicerNotching. Working Paper 3144539, SSRN.Bertanha, M. and M. J. Moreira (2020). Impossible inference in econometrics: Theory andapplications.

Journal of Econometrics .Best, M. C. and H. J. Kleven (2018). Housing Market Responses to Transaction Taxes:Evidence From Notches and Stimulus in the UK.

Review of Economic Studies 85 (1),157--193.Blomquist, S., A. Kumar, C.-Y. Liang, and W. Newey (2015, May). Individual Heterogeneity,Nonlinear Budget Sets, and Taxable Income. Working Paper 21/15, Cemmap.Blomquist, S., A. Kumar, C.-Y. Liang, and W. Newey (2019, October). On Bunching andIdentification of the Taxable Income Elasticity. Working Paper 53/19, Cemmap.Blomquist, S. and W. Newey (2017, September). The Bunching Estimator Cannot Identifythe Taxable Income Elasticity. Working Paper 40/17, Cemmap.Blomquist, S. and W. Newey (2018, March). The Kink and Notch Bunching EstimatorsCannot Identify the Taxable Income Elasticity. Working Paper 2018:4, UppsalaUniversitet.Burtless, G. and J. A. Hausman (1978). The Effect of Taxation on Labor Supply:Evaluating the Gary Negative Income Tax Experiment.

Journal of PoliticalEconomy 86 (6), 1103--1130.Caetano, C. (2015). A Test of Exogeneity Without Instrumental Variables in Models WithBunching.

Econometrica 83 (4), 1581--1600.Caetano, C., G. Caetano, and E. Nielsen (2020a). Correcting endogeneity bias in modelswith bunching. Technical report, Working Paper.30aetano, C., G. Caetano, and E. R. Nielsen (2020b). Should children do more enrichmentactivities? leveraging bunching to correct for endogeneity.Caetano, G., J. Kinsler, and H. Teng (2019). Towards causal estimates of children’s timeallocation on skill development.

Journal of Applied Econometrics 34 (4), 588--605.Caetano, G. and V. Maheshri (2018). Identifying dynamic spillovers of crime with a causalapproach to model selection.

Quantitative Economics 9 (1), 343--394.Cattaneo, M., M. Jansson, X. Ma, and J. Slemrod (2018, March). Bunching Designs:Estimation and Inference. Working paper, UCSD Bunching Workshop.Cattaneo, M. D., M. Jansson, and X. Ma (2019). Simple local polynomial density estimators.

Journal of the American Statistical Association 0 (0), 1--7.Cengiz, D., A. Dube, A. Lindner, and B. Zipperer (2019, August). The Effect of MinimumWages on Low-wage Jobs.

Quarterly Journal of Economics 134 (3), 1405--1454.Chernozhukov, V., I. Fern´andez-Val, and A. E. Kowalski (2015). Quantile Regression withCensoring and Endogeneity.

Journal of Econometrics 186 (1), 201--221.Chernozhukov, V. and H. Hong (2002). Three-step Censored Quantile Regression andExtramarital Affairs.

Journal of the American Statistical Association 97 (459), 872--882.Chetty, R., J. N. Friedman, T. Olsen, and L. Pistaferri (2011). Adjustment Costs, FirmResponses, and Micro vs. Macro Labor Supply Elasticities: Evidence from Danish TaxRecords.

Quarterly Journal of Economics 126 (2), 749--804.Chetty, R., J. N. Friedman, and E. Saez (2013, December). Using Differences in Knowledgeacross Neighborhoods to Uncover the Impacts of the EITC on Earnings.

AmericanEconomic Review 103 (7), 2683--2721.Dee, T. S., W. Dobbie, B. A. Jacob, and J. Rockoff (2019, July). The Causes andConsequences of Test Score Manipulation: Evidence from the New York RegentsExaminations.

American Economic Journal: Applied Economics 11 (3), 382--423.DeMaris, A. (2005). Truncated and Censored Regression Models. In

Regression with SocialData: Modeling Continuous and Limited Response Variables , Chapter 9, pp. 314--347.John Wiley & Sons, Ltd. 31evereux, M. P., L. Liu, and S. Loretz (2014). The Elasticity of Corporate Taxable Income:New Evidence from UK Tax Records.

American Economic Journal: EconomicPolicy 6 (2), 19--53.Dhrymes, P. J. (1986). Limited Dependent Variables. In Z. Griliches and M. D. Intriligator(Eds.),

The Handbook of Econometrics , Volume 3 of , Chapter 27, pp. 1567--1631. NorthHolland.Einav, L., A. Finkelstein, and P. Schrimpf (2017). Bunching at the Kink: Implications forSpending Responses to Health Insurance Contracts. Journal of Public Economics 146 ,27--40.Garicano, L., C. Lelarge, and J. Van Reenan (2016, November). Firm Size Distortions andthe Productivity Distribution: Evidence from France.

American EconomicReview 106 (11), 3439--3479.Ghanem, D., S. Shen, and J. Zhang (2019, January). A Censored Maximum LikelihoodApproach to Quantifying Manipulation in China’s Air Pollution Data. Working paper,University of California - Davis.Glogowsky, U. (2018). Behavioral Responses to Wealth Transfer Taxation: BunchingEvidence from Germany. Working Paper 3111993, SSRN.Goncalves, F. and S. Mello (2018). A Few Bad Apples? Racial Bias in Policing. Workingpaper, University of California - Los Angeles.Greene, W. H. (2005). Censored Data and Truncated Distributions. In T. Mills andK. Patterson (Eds.),

Palgrave Handbook of Econometrics , Volume 1 of , Chapter 20, pp.695--736. London: Palgrave Macmillan.Grossman, D. and U. Khalil (2019). Neighborhood Networks and Program Participation. Journal of Health Economics 70 (forthcoming), 102257.Hayashi, F. (2000).

Econometrics . Princeton University Press.Heckman, J. J. (1976, January). The Common Structure of Statistical Models of Truncation,Sample Selection and Limited Dependent Variables and a Simple Estimator for SuchModels. In

Annals of Economic and Social Measurement, Volume 5, number 4 , NBERChapters, pp. 475--492. National Bureau of Economic Research, Inc.Heckman, J. J. (1979). Sample selection bias as a specification error.

Econometrica 47 (1),153--161. 32to, K. (2014). Do Consumers Respond to Marginal or Average Price? Evidence fromNonlinear Electricity Pricing.

American Economic Review 104 (2), 537--563.Ito, K. and J. M. Sallee (2018, May). The Economics of Attribute-Based Regulation: Theoryand Evidence from Fuel Economy Standards.

Review of Economics and Statistics 100 (2),319--336.Jales, H. (2018). Estimating the effects of the minimum wage in a developing country: Adensity discontinuity design approach.

Journal of Applied Econometrics 33 (1), 29--51.Jales, H. and Z. Yu (2017, January). Identification and estimation using a densitydiscontinuity approach. In M. D. Cattaneo and J. C. Escanciano (Eds.),

RegressionDiscontinuity Designs: Theory and Applications , Volume 38, pp. 29--72. EmeraldPublishing Limited.Khalil, U. and N. Yildiz (2017). A test of the selection-on-observables assumption using adiscontinuously distributed covariate. Technical report, working paper.Kleven, H. J. (2016). Bunching.

Annual Review of Economics 8 , 435--464.Kleven, H. J. and M. Waseem (2013). Using Notches to Uncover Optimization Frictions andStructural Elasticities: Theory and Evidence from Pakistan.

Quarterly Journal ofEconomics 128 (2), 669--723.Koenker, R. and G. Bassett (1978). Regression Quantiles.

Econometrica 46 (1), 33--50.Kopczuk, W. and D. Munroe (2015). Mansion Tax: The Effect of Transfer Taxes on theResidential Real Estate Market.

American Economic Journal: Economic Policy 7 (2),214--57.Long, J. S. (1997).

Regression Models for Categorical and Limited Dependent Variables (2ed.). SAGE Publications.Maddala, G. S. (1983).

Limited-dependent and Qualitative Variables in Econometrics .Econometric Society Monographs. Cambridge University Press.Patel, E., N. Seegert, and M. G. Smith (2016). At a Loss: The Real and Reporting Elasticityof Corporate Taxable Income. Working Paper 2608166, SSRN.Powell, J. L. (1984). Least Absolute Deviations Estimation for the Censored RegressionModel.

Journal of Econometrics 25 (3), 303--325.33owell, J. L. (1986). Censored Regression Quantiles.

Journal of Econometrics 32 (1),143--155.Saez, E. (2001). Using Elasticities to Derive Optimal Income Tax Rates.

Review ofEconomic Studies 68 (1), 205--229.Saez, E. (2010). Do Taxpayers Bunch at Kink Points?

American Economic Journal:Economic Policy 2 (3), 180--212.Sallee, J. M. and J. Slemrod (2012). Car Notches: Strategic Automaker Responses to FuelEconomy Policy.

Journal of Public Economics 96 (11), 981--999.Tobin, J. (1958). Estimation of Relationships for Limited Dependent Variables.

Econometrica 26 (1), 24--36.Weber, C. (2016). Does the Earned Income Tax Credit Reduce Saving by Low-IncomeHouseholds?

National Tax Journal 69 (1), 41--76.34igure 1: Identification of the Elasticity in the Case of a Kink (a) Distribution of Observed Income (b) Counterfactual Distribution of Income in theAbsence of Kink(c) Distribution of Ability Consistent withObserved Income and Higher Elasticity (d) Distribution of Ability Consistent withObserved Income and Lower Elasticity

Notes:

Panel 1a plots an example of PDF of y . The continuous portions are equal to the PDF of ability n ∗ shifted by εs for y < k , and by εs for y > k , respectively. The shaded area represents a discrete mass pointwith probability B = P ( y = k ), that is, the probability of bunching. Panel 1b shows the counterfactual PDFof y , that is, the distribution of income if tax rates did not change at the kink. The PDF of y is continuous,and equals the PDF of n ∗ shifted by εs . It is also equal to the PDF of y before the kink, and to the shiftedPDF of y after the kink. However, the distribution of y does not reveal the shape of the PDF of y in thebunching region (i.e. φ ). The shaded area under φ integrates to the probability of bunching B . The last twopanels (Panels 1c-1d) display two different distributions of n ∗ that generate the same distribution of income y (Panel 1a) with two different elasticities, ε < ¯ ε , according to Equation 4. The PDF of n ∗ outside of thebunching region is equal to the PDF of y shifted by εs , if n ∗ < k − εs ; or shifted by εs , if n ∗ > k − εs .Aside from B , the distribution of income does not contain any information about the shape of φ in the PDFof n ∗ . If we assume f n ∗ is Lipschitz continuous with known constant, it is possible to derive upper and lowerbounds for φ , which correspond, respectively, to lower and upper bounds on the elasticity (Theorem 2). (a) 100% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $)

DataTobit model (b) 40% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (c) Elasticity by percent used .7511.251.51.752 E l a s ti c it y e s ti m a t e Percent of data used for estimation (d) 100% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $)

DataTobit model (e) 40% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (f) Elasticity by percent used .7511.251.51.752 E l a s ti c it y e s ti m a t e Percent of data used for estimation

Notes:

The simulation experiment illustrates the robustness of Tobit estimates to deviations from the normality assumption (Assumption 14). Theexperiment generates 500,000 observations of y , and two covariates ( X , X ), assuming ε = 1, and n ∗ | X , X ∼ Normal( β + β X + β X , σ ) (seedetails in Section 4.2.1). As in the EITC case, the kink point is k = 2 . t = − .

34 and t = 0. The first exercise estimates a mid-censoredTobit that is correctly specified with both covariates X and X . Panels (a) and (b) show the histogram of simulated data for y , and the best-fit Tobitdistributions for two truncation sizes, 100% and 40% of the sample used. Panel (c) displays the elasticity estimate as a function of the percentageof data used in each truncated estimation, along with 95% confidence bands. The second exercise drops X and estimates a misspecified model.Panels (d)-(f) are analogous to Panels (a)-(c), except that they use the estimates from the misspecified Tobit model, where n ∗ | X is not normal. Theestimation truncated at 40% fits the distribution of y , and the elasticity converges to the true value (Lemma 2). able 1: Estimates Using U.S. Tax Returns 1995--2004 (1) (2) (3) (4) (5) (6) (7) (8)Statistical Model Trapezoidal Theorem 2 Theorem 2 Tobit Tobit Tobit TobitApproximation Bounds Bounds Full Sample Trunc. 75% Trunc. 50% Trunc. 25% SampleM = 0.5 M = 1 details All

Obs. 189.1mElasticity ( ε ) 0.426 [0 . , . . , ∞ ] 0.195 0.280 0.291 0.326 Avg. $ $ Self-employed

Obs. 33.5mElasticity ( ε ) 0.854 [0 . , . . , ∞ ] 0.603 0.790 0.787 0.796 Avg. $ $ Self-employed,married

Obs. 24.0mElasticity ( ε ) 1.102 [0 . , ∞ ] [0 . , ∞ ] 0.373 0.586 0.692 0.722 Avg. $ $ Self-employed,not married

Obs. 9.6mElasticity ( ε ) 0.784 [0 . , . . , . $ $ Notes:

The table shows estimates of the elasticity for four different subsamples of the IRS data, and using three different approaches discussed in thepaper. The first approach (column 1) uses the trapezoidal approximation to point-identify the elasticity (Example 1). We obtained non-parametricestimates of the side limits of f y at the kink using the method of Cattaneo, Jansson, and Ma (2019). The estimate for the bunching mass equals thesample proportion of y observations that equals the kink point (see discussion on friction errors in Section 5.1). We obtained standard errors using 100bootstrap iterations. The second approach (columns 2 and 3) uses the same estimates of the bunching mass and side limits to compute partiallyidentified sets for the elasticity (Theorem 2). Upper and lower bounds are calculated for two choices of M, that is, the maximum slope of the PDF ofthe unobserved heterogeneity n ∗ . Column 4 has Tobit MLE estimates of the elasticity that utilizes the full sample of data, along with robust standarderrors. Columns 5 through 7 report truncated Tobit MLE estimates. As we move from column 5 to column 7, we restrict the estimation sample toshrinking symmetric windows around the kink that utilizes 75% to 25% of the data. The set of covariates that enters the Tobit estimation is keptconstant across different truncation windows. It includes dummy variables such as marital and employment status, year effects, types of deductions orsocial security benefits received, and whether the filer used a tax prep software. igure 3: Partial Identification Bounds for the Elasticity (a) All Filers .25.5.7511.251.51.7522.25 E l a s ti c it y e s ti m a t e .07 .82 Maximum slope of the unobserved density

UpperLowerTrapezoidal (b) Self-Employed Filers .25.5.7511.251.51.7522.25 E l a s ti c it y e s ti m a t e .09 .55 Maximum slope of the unobserved density

UpperLowerTrapezoidal (c) Self-employed and Married Filers .25.5.7511.251.51.7522.25 E l a s ti c it y e s ti m a t e .03 .15 Maximum slope of the unobserved density

UpperLowerTrapezoidal (d) Self-employed and Not Married Filers .25.5.7511.251.51.7522.25 E l a s ti c it y e s ti m a t e .24 1.54 Maximum slope of the unobserved density

UpperLowerTrapezoidal

Notes:

Panels a through d display partially identified sets for the elasticity for all filers with one child, andthree other subsamples defined by employment and marital status. The y-axis has elasticity values betweenlower and upper bounds given various choices of M on the x-axis, that is, the maximum slope magnitude ofthe PDF of the unobserved heterogeneity n ∗ (Theorem 2). Each panel has two vertical lines. The line on theleft corresponds to the smallest choice of M for which the bounds are defined. At the smallest M , upper andlower bounds are equal to the elasticity estimate based on the trapezoidal approximation (Example 1). Thevertical line on the right corresponds to the largest choice of M for which the upper bound is finite. Higherslopes allow for the possibility of PDFs that are zero in the bunching window. As a result, we may have afinite bunching mass for any arbitrarily large elasticity. (a) 100% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $)

DataTobit model (b) 80% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (c) 60% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (d) 40% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (e) 20% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (f) Elasticity by percent used .15.25.35.45.55.65.75.85.95 E l a s ti c it y e s ti m a t e Percent of data used for estimation

Notes: the figure displays best-fit Tobit distributions and elasticity estimates for various choices of a symmetric truncation window around the kinkpoint. Estimation uses the following dummy variables as covariates: marital and employment status, year effects, types of deductions or social securitybenefits received, and whether the filer used a tax prep software. The set of included covariates is kept constant across different truncation windows.Panels a through e show the histogram of income for all filers (bars), along with the best-fit Tobit PDF for each truncation window (line). The best-fitPDF is constructed using the truncated Tobit likelihood averaged over covariate values in the sample. Panel f displays the Tobit elasticity estimate asa function of the percentage of data used in estimation. igure 5: Truncated Tobit - Self-employed Filers (a) 100% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $)

Notes: the figure displays best-fit Tobit distributions and elasticity estimates for various choices of a symmetric truncation window around the kinkpoint. Estimation uses the following dummy variables as covariates: marital status, year effects, types of deductions or social security benefits received,and whether the filer used a tax prep software. The set of included covariates is kept constant across different truncation windows. Panels a through eshow the histogram of income for self-employed filers (bars), along with the best-fit Tobit PDF for each truncation window (line). The best-fit PDF isconstructed using the truncated Tobit likelihood averaged over covariate values in the sample. Panel f displays the Tobit elasticity estimate as afunction of the percentage of data used in estimation. igure 6: Truncated Tobit - Self-employed and Married Filers (a) 100% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $)

Notes: the figure displays best-fit Tobit distributions and elasticity estimates for various choices of a symmetric truncation window around the kinkpoint. Estimation uses the following dummy variables as covariates: year effects, types of deductions or social security benefits received, and whetherthe filer used a tax prep software. The set of included covariates is kept constant across different truncation windows. Panels a through e show thehistogram of income for self-employed and married filers (bars), along with the best-fit Tobit PDF for each truncation window (line). The best-fit PDFis constructed using the truncated Tobit likelihood averaged over covariate values in the sample. Panel f displays the Tobit elasticity estimate as afunction of the percentage of data used in estimation. igure 7: Truncated Tobit - Self-employed and Not Married Filers (a) 100% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $)

DataTobit model (b) 80% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (c) 60% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (d) 40% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (e) 19% of the data used for estimation E a r n i ng s d e n s it y ( b i n s ) Earnings (Log thousands of 2008 $) (f) Elasticity by percent used .15.25.35.45.55.65.75.85.95 E l a s ti c it y e s ti m a t e Percent of data used for estimation

Notes: the figure displays best-fit Tobit distributions and elasticity estimates for various choices of a symmetric truncation window around the kinkpoint. Estimation uses the following dummy variables as covariates: year effects, types of deductions or social security benefits received, and whetherthe filer used a tax prep software. The set of included covariates is kept constant across different truncation windows. Panels a through e show thehistogram of income for self-employed and not married filers (bars), along with the best-fit Tobit PDF for each truncation window (line). The best-fitPDF is constructed using the truncated Tobit likelihood averaged over covariate values in the sample. Panel f displays the Tobit elasticity estimate asa function of the percentage of data used in estimation. AppendixA.1 Identification with a Notch - Proof of Theorem 1

We present the proof of Theorem 1 in the more general case of multiple tax changes withat least one notch (Sections B.1 and B.2 in the supplemental appendix). Let p ∈ { , . . . , L } be the index of the smallest notch K p . As explained in the text, the presence of a notch mayremove the next tax change K p +1 from the solution to the utility maximization problemwith multiple kinks and notches (Lemma B.1 in the supplemental appendix). Let q ∈ { p + 1 , . . . , L } be the index of the next tax change that appears in the solution.Following the proof of Lemma B.1, the distribution of Y does not have any mass in theinterval ( K p ; Y Ip ] where Y Ip = N Ip (1 − t q − ) ε , and N Ip is defined as part of the solution inEquation B.4 in the supplemental appendix. The econometrician observes the value of Y Ip ,which is between K q − and K q . The goal is to solve for ε using this information.The proof of Lemma B.1 says N Ip satisfies the equation below. N Ip (1 − t q − ) ε + ε (cid:0) N Ip (cid:1) − /ε ( K p ) εε = (1 + ε ) [ C p − I q − + K q − (1 − t q − )]Use the fact that Y Ip = N Ip (1 − t q − ) ε and (cid:0) Y Ip (cid:1) − ε (1 − t q − ) = (cid:0) N Ip (cid:1) − ε and substitute thesein the equation above to get Y Ip (1 − t q − ) + ε (cid:0) Y Ip (cid:1) − ε (1 − t q − )( K p ) εε = (1 + ε ) [ C p − I q − + K q − (1 − t q − )] Y Ip + εK p (cid:18) K p Y Ip (cid:19) ε = (1 + ε ) (cid:18) C p − I q − + K q − (1 − t q − )1 − t q − (cid:19) (A.1)The elasticity ε is identified if there exists an unique solution for ε in Equation A.1 asfunction of Y Ip , K p , C p , I q − , t q − . We know a solution exists, and we show it must beunique. Consider the left-hand and right-hand sides of (A.1) as functions of ε . The solutionoccurs at the value of ε where both of these functions intersect. Uniqueness is equivalent tosingle-crossing of these functions.The function on the right-hand side (RHS) of (A.1) has positive intercept equal to[ C p − I q − + K q − (1 − t q − )] / (1 − t q − ). The function on the left-hand side (LHS) hasintercept equal to Y Ip because εK p (cid:16) K p Y Ip (cid:17) ε converges to zero as ε ↓

0. The intercept of theLHS is strictly bigger than the intercept of the RHS: Y Ip ≷ C p − I q − + K q − (1 − t q − )1 − t q − I q − + ( Y Ip − K q − )(1 − t q − ) ≷ C p C Ip ≷ C p where C Ip is the consumption value on the budget frontier when income is equal to Y Ip whichis strictly greater than C p . In fact, the consumer is indifferent between ( C Ip , Y Ip ) and ( C p , Y p )where Y Ip > Y p . Since utility is strictly decreasing in Y and increasing in C , we must have C Ip > C p . Therefore, Y Ip > [ C p − I q − + K q − (1 − t q − )] / (1 − t q − ).43he function on the RHS of (A.1) has positive slope equal to[ C p − I q − + K q − (1 − t q − )] / (1 − t q − ). The function on the LHS has strictly positivederivative for any positive ε , ∂∂ε LHS = K p (cid:18) K p Y Ip (cid:19) ε (cid:20) − ε ln (cid:18) K p Y Ip (cid:19)(cid:21) which is strictly positive because K p > (cid:16) K p Y Ip (cid:17) ε ∈ (0 , − ε ln (cid:16) K p Y Ip (cid:17) >

0. Thederivative is strictly increasing with ε , ∂ ∂ε LHS = K p (cid:18) K p Y Ip (cid:19) ε ε (cid:20) ln (cid:18) K p Y Ip (cid:19)(cid:21) which is also strictly positive. The limit of ∂∂ε LHS as ε → ∞ is equal to K p . Therefore, theslope of the LHS is positive, strictly increasing but always less than K p . Next, we show that K p is strictly less than the constant slope of the RHS. K p ≷ C p − I q − + K q − (1 − t q − )1 − t q − I q − + ( K p − K q − )(1 − t q − ) ≷ C p . The value C ∗ p = I q − + ( K p − K q − )(1 − t q − ) is what consumption would be if income wereequal to K p and the budget segment between K q − and K q were extrapolated back to K p .We know that the indifference curve touches this budget segment at one point ( C Ip , Y Ip ), andevery other point on the extrapolated budget segment has strictly lower utility. We alsoknow that ( C p , K p ) is on such indifference curve, so that ( C p , K p ) is strictly preferred to( C ∗ p , K p ). Therefore, C ∗ p < C p , and K p < [ C p − I q − + K q − (1 − t q − )] / (1 − t q − ), and theslope of the function on the LHS of (A.1) is always less than the slope of the function of theRHS.In summary, the intercept of the function on the LHS of (A.1) is greater than theintercept of the function on the RHS. Both functions are strictly increasing: the one on theRHS has constant slope, and the one on the LHS has increasing slope that is smaller thanthe slope of the RHS function. Therefore, the intersection of these two functions is unique. (cid:3) A.2 Impossibility of Non-parametric Identification of the Elasticity - Proof ofLemma 1

Consider the case of one kink k , with one tax change, s to s . It suffices to show thatfor every ε >

0, there exists F n ∗ ,ε ∈ F n ∗ such that F y = T ( k, s , s , F n ∗ ,ε , ε ) for fixed F y , k , s , and s . To show the existence of such an F n ∗ ,ε , fix arbitrary ε > F n ∗ ,ε as follows:1. First, define a continuous function φ : [ k − εs , k − εs ] → R ++ such that: (a) φ ( k − εs ) = lim u ↑ k f y ( u ); (b) φ ( k − εs ) = lim u ↓ k f y ( u ); and (c)44 φ ( u ) du = F y ( k ) − lim u ↑ k F y ( u ).2. Second, compute the CDF F n ∗ ,ε by integrating the following PDF: f n ∗ ,ε ( v ) =  f y ( εs + v ) , v ∈ ( −∞ , k − εs ) φ ( v ) , v ∈ [ k − εs , k − εs ] f y ( εs + v ) v ∈ ( k − εs , + ∞ ) . (cid:3) A.3 Partial Identification with Non-parametric Restrictions - Proof of Theorem2

First, let’s fix ε >

0. We look at all possible PDFs in F n ∗ and compute the maximumand minimum integrals over the interval [ n, n ]. The length of this interval is ε ( s − s ).Thus, without loss of generality, we restrict our attention to f n ∗ over the interval[0 , ε ( s − s )] such that:(i) f n ∗ is continuous, and it connects the point (0 , f y ( k − )) to ( ε ( s − s ) , f y ( k + )) in the(x,y) plane;(ii) the absolute value of the slope of f n ∗ is bounded by M.First, start with f n ∗ being a line. The magnitude of the slope is | f y ( k + ) − f y ( k − ) | ε ( s − s ) . Supposethis magnitude is bigger than M . Then, any f n ∗ satisfying (i) will have a slope magnitudehigher than M at some point. Therefore, we need to look at ε ≥ ε where ε = | f y ( k + ) − f y ( k − ) | M ( s − s ) .For fixed ε ≥ ε , the slope of the line will be less or equal to M . The maximum possiblearea is attained when the function has the shape of a hat with two line segments that attainthe maximum slope. The first line segment starts at (0 , f y ( k − )) and has slope + M ; thesecond line segment ends at ( ε ( s − s ) , f y ( k + )) and has slope − M . Call this function f n ∗ .These lines intersect at x ∗ where x ∗ = f y ( k + ) − f y ( k − ) + M ε ( s − s )2 M .

Note that x ∗ is always such 0 ≤ x ∗ ≤ ε ( s − s ) because ε ≥ ε . Note that it is impossible tofind another f n ∗ that satisfies (i), it is greater than f n ∗ , and that has slope magnitude less orequal than M . The maximum area is A ( ε ) = (cid:90) ε ( s − s )0 f n ∗ ( v ) dv = (1 / M ) (cid:2) M ε s − M ε s s + M ε s + 2 M εf y ( k − ) s − M εf y ( k − ) s + 2 M εf y ( k + ) s − M εf y ( k + ) s − f y ( k − ) + 2 f y ( k − ) f y ( k + ) − f y ( k + ) ) (cid:3) The function A ( ε ) is strictly increasing with respect to ε over ε ≥ ε . In fact, thederivative is (( s − s )( f y ( k − ) + f y ( k + ) + M ε ( s − s )) / , f y ( k − )) and has slope − M , and another that ends at ( ε ( s − s ) , f y ( k + ))and has slope + M . Differently the hat function, the intersection ( x ∗∗ , y ∗∗ ) of this invertedhat function may or may not be above the x-axis. That is, y ∗∗ may be negative, but f n ∗ isalways positive. In that case, we simply set the function to zero in the region where it wouldbe negative. Call this function f n ∗ .The intersection occurs at x ∗∗ = f y ( k − ) − f y ( k + ) + M ε ( s − s )2 M .

Note that x ∗∗ is always such x ∗∗ ≥ ε ≥ ε . The y-value of the intersection is y ∗∗ = f y ( k − ) + f y ( k + ) − M ε ( s − s )2 M . and this is positive as long as ε ≤ ε where ε = | f y ( k + )+ f y ( k − ) | M ( s − s ) . Note also that ε < ε .For ε ≤ ε ≤ ε , the minimum area is A ( ε ) = (cid:90) ε ( s − s )0 f n ∗ ( v ) dv = ( − / M ) (cid:2) M ε s − M ε s s + M ε s − M εf y ( k − ) s +2 M εf y ( k − ) s − M εf y ( k + ) s + 2 M εf y ( k + ) s − f y ( k − ) + 2 f y ( k − ) f y ( k + ) − f y ( k + ) (cid:3) The function A ( ε ) is strictly increasing with respect to ε over ε ≤ ε < ε . In fact, thederivative is (( s − s ) ∗ ( f y ( k − ) + f y ( k + ) − M ε ( s − s ))) / ε < ε . The function A ( ε ) is constant with respect to ε over ε ≥ ε .Therefore, we have characterized the maximum and minimum areas A ( ε ) and A ( ε ) forany given ε . These areas are undefined if ε < ε , they are equal if ε = ε , they are strictlyincreasing wrt ε and A ( ε ) ≤ A ( ε ) for ε ∈ ( ε , ε ). For ε ≥ ε , A ( ε ) continues to grow wrt ε but A ( ε ) stays constant at A ( ε ). The expression for A ( ε ) is ( f y ( k − ) + f y ( k + ) ) / M .Finally, we define the partially identified set. Case I: If B < A ( ε ) = A ( ε ), there does not exist any function f n ∗ consistent with anyelasticity ε , so the set is empty. The expression for A ( ε ) = A ( ε ) is( | f y ( k − ) − f y ( k + ) | ( f y ( k − ) + f y ( k + ))) / (2 M ). Case II:

Suppose B ≥ A ( ε ) and B < A ( ε ). There is an interval range for ε such thatfor any ε in this interval there exists a function f n ∗ whose integral equals B . The minimumpossible elasticity solves A ( ε ) = B . That gives ε = 2 [ f y ( k + ) / f y ( k − ) / M B ] / − ( f y ( k + ) + f y ( k − )) M ( s − s ) . The maximum possible elasticity solves A ( ε ) = B . That gives ε = − f y ( k + ) / f y ( k − ) / − M B ] / + ( f y ( k + ) + f y ( k − )) M ( s − s )46 ase III: Suppose B ≥ A ( ε ). It is still possible to find a minimum elasticity that solves A ( ε ) = B . However, for any elasticity ε ≥ ε we have A ( ε ) ≤ B , so ε is infinity. (cid:3) A.4 Tobit Regression - Proof of Identification Lemma 2

By Assumption 17, there exists true values ( ε, β, σ ) such that B = F y ( k + ) − F y ( k − ) = G n ∗ ( k − εs ; β, σ, F X ) − G n ∗ ( k − εs ; β, σ, F X ) F y ( u ) = G n ∗ ( u − εs ; β, σ, F X ) for ∀ u < kF y ( u ) = G n ∗ ( u − εs ; β, σ, F X ) for ∀ u > k, where F X is the true CDF of X . The MLE estimator is consistent for (¯ ε, ¯ β, ¯ σ ), and we havethat G y ( y ; ¯ ε, ¯ β, ¯ σ, F X ) = F y ( y ) ∀ y . Thus, G n ∗ ( k − ¯ εs ; ¯ β, ¯ σ, F X ) − G n ∗ ( k − ¯ εs ; ¯ β, ¯ σ, F X )= G n ∗ ( k − εs ; β, σ, F X ) − G n ∗ ( k − εs ; β, σ, F X ) G n ∗ ( u − ¯ εs ; ¯ β, ¯ σ, F X ) = G n ∗ ( u − εs ; β, σ, F X ) for ∀ u < kG n ∗ ( u − ¯ εs ; ¯ β, ¯ σ, F X ) = G n ∗ ( u − εs ; β, σ, F X ) for ∀ u > k The parametric family created by G n ∗ ( n ; β, σ, F X ), with F X fixed at the truth, satisfies(11)-(13) by assumption. Therefore, the equations above solve uniquely with(¯ ε, ¯ β, ¯ σ ) = ( ε, β, σ ). (cid:3) A.5 Censored Quantile Regression - Proof of Identification Lemma 3

Call D = I { Q τ ( y | X ) (cid:54) = k } . Let β ( τ ) = [ β ( τ ) , β ( τ ) , . . . , β d ( τ )] (cid:48) . Define˜ β ( τ ) = [ β ( τ ) + εs , β ( τ ) , . . . , β d ( τ ) , ε ( s − s )] (cid:48) . Multiplying Equation 19 by D yields DQ τ ( y | X ) = D ˜ X ˜ β ( τ ) . Pre-multiplying it by ˜ X (cid:48) and taking expectations leads to D ˜ X (cid:48) Q τ ( y | X ) = D ˜ X (cid:48) ˜ X ˜ β ( τ ) E (cid:104) D ˜ X (cid:48) Q τ ( y | X ) (cid:105) = E (cid:104) D ˜ X (cid:48) ˜ X ˜ β ( τ ) (cid:105) ˜ β ( τ ) = E (cid:104) D ˜ X (cid:48) ˜ X (cid:105) − E (cid:104) D ˜ X (cid:48) Q τ ( y | X ) (cid:105) . (A.2)An infinite amount of data identifies the joint distribution of ( y, X ). This identifies thefunction Q τ ( y | X = x ) for every x in the support of X , and the joint distribution of( y, Q τ ( y | X ) , X, ˜ X, D ). Therefore, ˜ β ( τ ) is identified by Equation A.2. Finally, ε = ˜ β d +1 ( τ ) / ( s − s ). (cid:3) ‘BETTER BUNCHING, NICER NOTCHING’’ Marinho Bertanha, Andrew McCallum, Nathan Seegert

B Supplemental Appendix for Online PublicationB.1 General Utility Maximization Problem with Multiple Kinks and Notches

To generalize the objective function in Equation 1, we update the budget set to have J different tax regimes that change at cutoff points 0 < K < . . . < K J on pre-tax laborincome Y . Each tax regime has income tax t j such that 0 ≤ t ≤ t ≤ . . . ≤ t J <

1. Thereare two possible tax changes. A change in tax rate is a kink. A lump-sum tax change iscalled a notch. Agent type N ∗ maximizes utility U ( C, Y ; N ∗ ) as followsmax C,Y C − N ∗ /ε (cid:18) YN ∗ (cid:19) ε (B.1) s.t. C = J (cid:88) j =0 I { K j < Y ≤ K j +1 } [ I j + (1 − t j ) ( Y − K j )] , (B.2)where K = 0, K J +1 = ∞ , I {·} is the indicator function, the solution is always on thebudget frontier (Equation B.2), and we assume the agent resolves indifference by choosingthe smallest value of Y . The elasticity of income Y with respect to (1 − t j ) is equal to ε when the solution is interior.The budget frontier is continuous except when there is a notch. The limit of the budgetfrontier when Y ↓ K j is equal to I j , but equal to I j − + (1 − t j − ) ( K j − K j − ) when Y ↑ K j .The size of the jump discontinuity at a notch location K j is equal to I j − I j − − (1 − t j − ) ( K j − K j − ). The intercepts I j and I j − are assumed to be such thatjump discontinuities at notches are negative. B.2 General Solution with Multiple Kinks and Notches

Lemma B.1 below provides a general solution to Problem B.1 with any combination ofkinks and notches.

Lemma B.1.

Define N = ∪ Jj =0 ( K j (1 − t j ) − ε ; K j +1 (1 − t j ) − ε ] as the set of N ∗ values forwhich the indifference curves are tangent to the budget frontier. The function Y ∗ : N → R , Y ∗ ( N ∗ ) = (cid:80) Jj =0 I { K j (1 − t j ) − ε < N ∗ ≤ K j +1 (1 − t j ) − ε } N ∗ (1 − t j ) ε , maps N ∗ values to the Y values corresponding to such tangency points. Similarly, C ∗ ( N ∗ ) is consumption on thebudget frontier (Equation B. ) when Y = Y ∗ ( N ∗ ) . Let C j be the value of C ∗ ( N ∗ ) whenever Y ∗ ( N ∗ ) = K j , j = 1 , . . . , J . For a notch-point K j , define the value of N Ij to be that of thefirst indifference curve tangent to the budget frontier on the right of Y = K j , such that theutility level is equal to the utility of the notch-point K j , N Ij = min (cid:26) N ∗ ∈ N : U ( C j , K j ) = U ( C ∗ ( N ∗ ) , Y ∗ ( N ∗ )) (cid:27) . (B.3)1 n the case of a kink, the bunching interval is defined as [ N j , N j ] , where N j = K j (1 − t j − ) − ε , and N j = K j (1 − t j ) − ε . In the case of a notch, the expression for N j equals that of the kink case, but N j changes to N Ij .Note that the bunching intervals of two consecutive kinks do not overlap, that is, K j (1 − t j ) − ε < K j +1 (1 − t j ) − ε . The same is not true for a kink or a notch K j +1 that comesright after a notch K j , because N Ij may be greater than K j +1 (1 − t j ) − ε depending on ε . Inthis case, Y = K j +1 does not appear in the solution. To account for that, construct asubsequence { j l } Ll =1 of { , . . . , J } such that: (i) j = 1 ; and (ii) for l ≥ , set j l to be thesmallest j such that N j > N j l − . Then, the solution to the maximization problem in (B.1) isgiven by Y =  N ∗ (1 − t j − ) ε , if < N ∗ < N j K j , if N j ≤ N ∗ ≤ N j N ∗ (1 − t j − ) ε , if N j < N ∗ < N j ... N ∗ (1 − t j L − ) ε , if N j L − < N ∗ < N j L K j L , if N j L ≤ N ∗ ≤ N j L N ∗ (1 − t J ) ε , if N j L < N ∗ < ∞ . (B.4) Proof.

For every N ∗ >

0, there exists an unique solution on the budget frontier. If theconsumer is indifferent between two solutions, we assume the consumer takes the solutionwith less Y . The proof is by induction over ¯ J = 0 , , . . . , J . Denote the budget frontier BF ¯ J by C = ¯ J (cid:88) j =0 I { ¯ K j < Y ≤ ¯ K j +1 } (cid:2) I j + (1 − t j ) ( Y − ¯ K j ) (cid:3) . where ¯ K j = K j for j = 0 , , . . . , ¯ J and ¯ K ¯ J +1 = ∞ .As we change the budget frontier from BF ¯ J to BF ¯ J +1 , K ¯ J +1 takes a finite value strictlygreater than K ¯ J , and K ¯ J +2 is set to ∞ . If the solution to Problem B.1 with budget frontier BF ¯ J is such that Y < K ¯ J +1 < ∞ , then this is also the solution to Problem B.1 with budgetfrontier BF ¯ J +1 . In fact, points on BF ¯ J dominate points on BF ¯ J +1 , and they coincide for Y < K ¯ J +1 . Part I: ¯ J = 0 , solve Problem B.1 with budget BF .This is a standard consumer maximization problem where the optimal choice for Y occurs at the point the indifference curve is tangent to BF . Therefore, for N ∗ > Y = N ∗ (1 − t ) ε . Part II: ¯ J = 1 , solve Problem B.1 with budget BF .The budget frontier BF has two segments BF for 0 < Y ≤ K , and BF for K < Y .If N ∗ < K (1 − t ) − ε , then the solution of Part I, Y = N ∗ (1 − t ) ε < K , is also the solutionin Part II. It remains to find the solution for N ∗ ≥ K (1 − t ) − ε . These solutions must lie on BF for Y ≥ K because they strictly dominate those that lie to the left of K . Case I : Suppose K is a kink. Assume N ∗ is such that K (1 − t ) − ε ≤ N ∗ ≤ K (1 − t ) − ε . If the solution is interior to BF , then it must be at a tangent point in which case Y = N ∗ (1 − t ) ε . However,2 = N ∗ (1 − t ) ε ≤ K , a contradiction because this Y falls outside of the interior of BF .Therefore, if N ∗ is such that N = K (1 − t ) − ε ≤ N ∗ ≤ K (1 − t ) − ε = N , then thesolution is Y = K . Suppose N ∗ > N . Then, the solution is in the interior of BF , and it isequal to Y = N ∗ (1 − t ) ε . Case II : Suppose K is a notch. There is a jump-down discontinuity in BF at K , and BF is continuous from the left.Consider the point ( C, Y ) = ( C , K ) on BF . Define Y D to be the value of Y such that thecorresponding C value on BF is equal to C . The jump-down discontinuity creates astrictly dominated region on BF because the utility of ( C , K ) is strictly greater than theutility of any solution with Y ∈ ( K , Y D ). Indifference between K and Y D is resolvedtowards K by assumption. Therefore, we cannot have solutions to Problem B.1 with budget BF such that Y ∈ ( K , Y D ].Define the point (cid:101) N I as being the solution of Problem B.1 with budget BF (instead of BF ). This is the smallest N ∗ for which Problem B.1 with budget BF has solution withutility equal to U ( C , K ).First, a solution (cid:101) N I exists. To see that, note that for small N ∗ , the tangent point Y = N ∗ (1 − t ) ε along BF falls in the dominated region Y ∈ ( K , Y D ], and the utility is lessthan U ( C , K ); on the other hand, the utility at this tangent point increases with N ∗ , and iteventually equals U ( C , K ). The solution is such that (cid:101) N I ≥ Y D (1 − t ) − ε > K (1 − t ) − ε .Second, the solution (cid:101) N I is unique. To see that, solve for N ∗ in the equation below. U ( C , K ) = U (cid:0) I + N ∗ (1 − t ) ε +1 − K (1 − t ) , N ∗ (1 − t ) ε (cid:1) where C = I + N ∗ (1 − t ) ε +1 − K (1 − t ) is consumption on BF when Y = N ∗ (1 − t ) ε .Evaluating and rearranging the equality gives N ∗ (1 − t ) ε + ε ( N ∗ ) − /ε ( K ) εε = (1 + ε ) [ C − I + K (1 − t )]The solution is unique because the derivative of the right-hand side is strictly positive given N ∗ > K (1 − t ) − ε . Note that (cid:101) N I is the unique solution to Problem B.1 when the budget is BF .Call (cid:101) Y I = (cid:101) N I (1 − t ) ε . Suppose there is a solution to Problem B.1 with budget BF such that Y D < Y ≤ (cid:101) Y I . This solution is interior to budget BF , so we must have Y = N ∗ (1 − t ) ε for some N ∗ . But such a solution cannot be a solution to Problem B.1 withbudget BF because Y ≤ (cid:101) Y I and so dominated by ( C , K ). Therefore, we cannot havesolutions to Problem B.1 with budget BF such that Y ∈ ( K , (cid:101) Y I ].It remains to characterize the solution when N ∗ is such that K (1 − t ) − ε ≤ N ∗ . If N ∗ issuch that N = K (1 − t ) − ε ≤ N ∗ ≤ (cid:101) Y I (1 − t ) − ε = N , the solution cannot be in theinterior of BF since Y = N ∗ (1 − t ) ε ≥ K ; it cannot be in ( K , (cid:101) Y I ] either. Assume it is inthe interior of BF with Y > (cid:101) Y I . Since it is interior, it satisfies Y = N ∗ (1 − t ) ε , but N ∗ ≤ (cid:101) Y I (1 − t ) − ε which makes Y ≤ (cid:101) Y I , a contradiction. Therefore, the solution toProblem B.1 with budget BF when N ∗ ∈ [ N ; N ] is Y = K . Finally, suppose N ∗ > N .Then, the solution is in the interior of BF , and it is equal to Y = N ∗ (1 − t ) ε . Part III:

Assume the solution of Problem B.1 with budget BF ¯ J and ≤ ¯ J < J is as n Equation B.4 with ¯ J . Show that (B.4) with ¯ J + 1 solves Problem B.1 with budget BF ¯ J +1 .Consider Problem B.1 with budget BF ¯ J and solution B.4 with L being ¯ L . If N ∗ is suchthat Y < K ¯ J +1 < ∞ , then Y also solves Problem B.1 with budget BF ¯ J +1 . Therefore, thesolution to Problem B.1 with budget BF ¯ J +1 or budget BF ¯ J coincide for those values of N ∗ .Note also that, if K j is a notch and j < j ¯ L , then the value of N j (defined in (B.3)) does notchange when the budget changes from BF ¯ J to BF ¯ J +1 . If K j ¯ L is a notch, then the value N j ¯ L may change (case IV below). In what follows, consider the last two budget segments of BF ¯ J +1 : BF ¯ J +1¯ J and BF ¯ J +1¯ J +1 . Case I : K j ¯ L is a kink, K ¯ J +1 is a kink In this case, j ¯ L +1 = ¯ J + 1 because N ¯ J +1 = K ¯ J +1 (1 − t ¯ J ) − ε > N j ¯ L , so that ¯ J + 1 is thesmallest j such that N j > N j ¯ L . It is also true that j ¯ L = ¯ J . To see that, note thatconsecutive intervals [ N j , N j ] never overlap for kinks because N j = K j (1 − t j ) − ε < K j +1 (1 − t j ) − ε = N j +1 . The upper limit of a kink interval j is strictlysmaller than the lower limit of a notch interval j + 1. However, the upper limit of a notchinterval j may be bigger than the lower limit of the next interval j + 1. Suppose j ¯ L = ¯ J were not true, that is, j ¯ L < ¯ J . Then, any j such that j ¯ L < j ≤ ¯ J is not in the subsequence { j l } because K j ¯ L is a notch, and its interval overlaps with the j interval. But this is acontradiction with K j ¯ L being a kink point.If N ∗ < K ¯ J +1 (1 − t ¯ J ) − ε , then the solution B.4 with budget BF ¯ J is Y < K ¯ J +1 , and Y also solves Problem B.1 with budget BF ¯ J +1 for that same value of N ∗ . It remains tocharacterize the solution when N ∗ ≥ K ¯ J +1 (1 − t ¯ J ) − ε Assume N ∗ is such that N ¯ J +1 = K ¯ J +1 (1 − t ¯ J ) − ε ≤ N ∗ ≤ K ¯ J +1 (1 − t ¯ J +1 ) − ε = N ¯ J +1 . Asseen in Part II, Case I, the solution cannot be interior to BF ¯ J +1¯ J +1 . The solution must be at K ¯ J +1 . Assume N ∗ > N ¯ J +1 . Then, the solution is interior to BF ¯ J +1¯ J +1 , and it equals to Y = N ∗ (1 − t ¯ J +1 ) ε . Case II : K j ¯ L is a kink, K ¯ J +1 is a notch As seen in Part III, Case I, j ¯ L = ¯ J . We also have j ¯ L +1 = ¯ J + 1 because the j interval[ N j , N j ] of a kink does not overlap with the j + 1 interval of a notch.If N ∗ < K ¯ J +1 (1 − t ¯ J ) − ε , then the solution B.4 with budget BF ¯ J is Y < K ¯ J +1 , and Y also solves Problem B.1 with budget BF ¯ J +1 for that same value of N ∗ . It remains tocharacterize the solution when N ∗ ≥ K ¯ J +1 (1 − t ¯ J ) − ε Assume N ∗ is such that N ¯ J +1 = K ¯ J +1 (1 − t ¯ J ) − ε ≤ N ∗ ≤ N ¯ J +1 , where N ¯ J +1 is thesolution of Problem B.3 when the budget is BF ¯ J +1 . As seen in Part II, Case II, the solution Y cannot be in ( K ¯ J +1 , N ¯ J +1 (1 − t ¯ J +1 ) ε ] or in the interior of BF ¯ J +1¯ J +1 . Therefore, the solutionis Y = K ¯ J +1 . Assume N ∗ > N ¯ J +1 . Then, the solution is interior to BF ¯ J +1¯ J +1 , and it equals to Y = N ∗ (1 − t ¯ J +1 ) ε . Case III : K j ¯ L is a notch, N j ¯ L < N ¯ J +1 For the notch K j ¯ L , the solution N j ¯ L to Problem B.3 when the budget is BF ¯ J does notchange when the budget becomes BF ¯ J +1 precisely because N j ¯ L < N ¯ J +1 . In this case, j ¯ L +1 = ¯ J + 1. For N ∗ such that N j ¯ L < N ∗ < K ¯ J +1 (1 − t ¯ J ) − ε , the solution B.4 with budget4 F ¯ J is Y < K ¯ J +1 , and Y also solves Problem B.1 with budget BF ¯ J +1 for that same valueof N ∗ . It remains to characterize the solution when N ∗ ≥ K ¯ J +1 (1 − t ¯ J ) − ε .Assume K ¯ J +1 is a kink, and that N ∗ is such that N ¯ J +1 = K ¯ J +1 (1 − t ¯ J ) − ε ≤ N ∗ ≤ K ¯ J +1 (1 − t ¯ J +1 ) − ε = N ¯ J +1 . As seen in Part II, Case I, thesolution cannot be interior to BF ¯ J +1¯ J +1 . The solution must be at K ¯ J +1 . Assume N ∗ > N ¯ J +1 .Then, the solution is interior to BF ¯ J +1¯ J +1 , and it equals to Y = N ∗ (1 − t ¯ J +1 ) ε .Assume K ¯ J +1 is a notch, and that N ∗ is such that N ¯ J +1 = K ¯ J +1 (1 − t ¯ J ) − ε ≤ N ∗ ≤ N ¯ J +1 ,where N ¯ J +1 is the solution of Problem B.3 when the budget is BF ¯ J +1 . As seen in Part II,Case II, the solution Y cannot be in ( K ¯ J +1 , N ¯ J +1 (1 − t ¯ J +1 ) ε ] or in the interior of BF ¯ J +1¯ J +1 .Therefore, the solution is Y = K ¯ J +1 . Assume N ∗ > N ¯ J +1 . Then, the solution is interior to BF ¯ J +1¯ J +1 , and it equals to Y = N ∗ (1 − t ¯ J +1 ) ε . Case IV : K j ¯ L is a notch, N j ¯ L ≥ N ¯ J +1 The indifference value for Y at N j ¯ L is Y Ij ¯ L = N j ¯ L (1 − t ¯ J ) ε ≥ N ¯ J +1 (1 − t ¯ J ) ε = K ¯ J +1 . If N j ¯ L = N ¯ J +1 , the solution to Problem B.3 when the budget is BF ¯ J remains unchanged whenthe budget becomes BF ¯ J +1 . If N j ¯ L > N ¯ J +1 , then Y Ij ¯ L > K ¯ J +1 , and the solution to ProblemB.3 when the budget is BF ¯ J changes when the budget becomes BF ¯ J +1 . The value of N j ¯ L increases such that the new indifference point satisfies Y Ij ¯ L = N j ¯ L (1 − t ¯ J +1 ) ε .There does not exist a j such that N j > N j ¯ L because K ¯ J +1 is the last tax-change pointavailable and N ¯ J +1 ≤ N j ¯ L . Therefore, when constructing the solution of Problem B.1 withbudget BF ¯ J +1 , the last term in the subsequence { j l } remains j ¯ L .The point K j ¯ L is a notch, so Part II, Case II says that for N ∗ such that N j ¯ L = K j ¯ L (1 − t j ¯ L − ) − ε ≤ N ∗ ≤ N j ¯ L , the solution Y cannot be in ( K j ¯ L , N j ¯ L (1 − t ¯ J +1 ) ε ] orin the interior of BF ¯ J +1¯ J +1 . Therefore, the solution is Y = K j ¯ L . Assume N ∗ > N j ¯ L . Then, thesolution is interior to BF ¯ J +1¯ J +1 , and it equals to Y = N ∗ (1 − t ¯ J +1 ) ε . B.3 Friction Errors and Failure of the ‘‘Polynomial Strategy’’

This section presents a counterexample that illustrates the failure of a commonidentification strategy used in applied work to estimate the elasticity using kinks. For areview, see Kleven (2016).First, we set the parameters of the model. The true values are: ε = 1 . t = . t = 0 . k = 0. The bunching interval is[ n, n ] = [0 . , . n ∗ ∼ U [ − . . .

435 and has length equal to 2.The probability of bunching, or bunching mass B , is equal to 10% in this example. Thefriction error e is also assumed uniformly distributed e ∼ U [ − .

5; 0 . (cid:101) y = y + e , where y is a function of n ∗ , ε , t , and t , asdescribed in Equation 4.In the counterfactual scenario of no tax change, we have n = n , and the counterfactualincome with friction error is denoted (cid:101) y . The counterfactual income without friction error is y . Figure B.1a depicts the PDF of (cid:101) y and (cid:101) y .5 common identification strategy used in applied work is to fit a polynomial to the PDFof (cid:101) y excluding observations in the neighborhood of the kink k = 0, that corresponds to thesupport of the measurement error (i.e. [ − .

5; 0 . (cid:101) y and the polynomial fit extrapolated to the excluded neighborhoodaround the kink. Figure B.1b illustrates the procedure. The figure shows that such strategyfails to identify the true bunching mass, even when the polynomial fit of 7th order is perfect,and we assume the researcher knows the support of e .The last part of the estimation strategy uses the extrapolated polynomial to predict thecounterfactual PDF of y . Following Equation 6, identification of ε requires thecounterfactual PDF of y , without measurement error. Figure B.1c shows that thepolynomial strategy fails to retrieve the PDF of y . The PDF predicted by the polynomialregression does not integrate to one, and thus it is not a PDF. If we divide thepolynomial-based PDF in Figures B.1b and B.1c by its integral, the PDF shifts up in thegraphs. The re-normalized PDF still misses the true f y , and the underestimation of B islarger than before.The polynomial strategy fails for two reasons:1. The PDF of (cid:101) y is not simply the PDF of y plus the PDF of e (Figure B.1a), but theconvolution between the two PDFs. While y and e have uniform distributions, with aflat PDF, their convolution does not have a flat PDF. As a result, extrapolating thepolynomial to find the bunching mass and to predict the PDF of y is misleading;2. The counterfactual distribution required for identification of the elasticity is the PDFof y , and not the PDF of (cid:101) y (Equation 6). Moreover, even if friction errors were not aproblem, it is not possible to use the distribution of y to back out the distribution of y for values of y inside [ k, k + ( s − s ) ε ]. The shape of the distribution of y isunidentified when n ∗ falls in the bunching interval (Figure 1). B.4 Parametric Gaussian Family Identifies the Elasticity

We demonstrate how to verify conditions (11) - (13) in the parametric Gaussian case.Suppose the distribution of n ∗ follows a normal distribution with unknown mean µ andunknown variance σ , such that F n ∗ ( n ) = G n ∗ ( n ; µ, σ ) = Φ (cid:0) n − µσ (cid:1) where Φ denotes thestandard normal CDF.Take ( k, s , s , ε, µ, σ ) arbitrary. The goal is to show that ¯ ε = ε , ¯ µ = µ , and ¯ σ = σ arethe only solutions to the equalities below:Φ (cid:18) k − εs − µσ (cid:19) − Φ (cid:18) k − εs − µσ (cid:19) = Φ (cid:18) k − ¯ εs − ¯ µ ¯ σ (cid:19) − Φ (cid:18) k − ¯ εs − ¯ µ ¯ σ (cid:19) (B.5)Φ (cid:18) u − εs − µσ (cid:19) = Φ (cid:18) u − ¯ εs − ¯ µ ¯ σ (cid:19) for ∀ u < k (B.6)Φ (cid:18) u − εs − µσ (cid:19) = Φ (cid:18) u − ¯ εs − ¯ µ ¯ σ (cid:19) for ∀ u > k (B.7)6ake (B.6), and apply Φ − ( · ) to both sides. u − εs − µσ = u − ¯ εs − ¯ µ ¯ σ , ∀ u < k. These are two lines that must have the same slope, 1 /σ = 1 / ¯ σ , and the same intercept( εs + µ ) /σ = (¯ εs + ¯ µ ) / ¯ σ . These imply that ¯ σ = σ , and ¯ εs + ¯ µ = εs + µ .Similarly, (B.7) implies that ¯ εs + ¯ µ = εs + µ . Subtracting this last equation from theprevious one gives ¯ ε ( s − s ) = ε ( s − s ), which yields ¯ ε = ε . Finally, εs + ¯ µ = εs + µ gives ¯ µ = µ . (cid:3) B.5 Implementation of Censored Quantile Regressions

The optimization problem in Equation 21 is computationally difficult. For the left (orright) censored case, Chernozhukov and Hong (2002) proposed a fast and practical estimatorthat consists of three steps. First, you fit a flexible Probit model that explains theprobability of no censoring; then, you select observations whose values of X lead to apredicted probability of no censoring that is greater than 1 − τ . Second, you fit a quantileregression of y on X using the selected observations in the first step; then, you selectobservations whose values of X lead to a predicted quantile that is greater than k . Third,repeat the second step using the observations selected at the end of the second step.Chernozhukov and Hong (2002) demonstrate consistency and asymptotic normality of theirthree-step estimator. Moreover, they show that the standard errors computed by thequantile regression in the third step are valid.Our case of middle censoring requires a straightforward modification of the methodproposed by Chernozhukov and Hong (2002). Inspired by their algorithm, we propose thefollowing implementation steps.1. Create dummies δ − i = I { y i < k } (not censored, left of k ) and δ + i = I { y i > k } (notcensored, right of k ). Fit two Probit models to estimate P [ δ + i | X i ] = Φ( X i g + ) and P [ δ − i | X i ] = Φ( X i g − ), where Φ denotes the cdf of a standard normal distribution, and g ± are vectors of parameters. You may use powers and interactions of X i to make thisstage as flexible as possible. Select two subsamples as follows. Compute the 10thquantile of the empirical distribution of Φ( X i ˆ g + ) − (1 − τ ) conditional onΦ( X i ˆ g + ) > − τ . Let κ +0 ( τ ) be the 10th quantile of that distribution. The firstsubsample is J +0 ( τ ) = { i : Φ( X i ˆ g + ) > − τ + κ +0 ( τ ) } . The second subsample is J − ( τ ) = { i : Φ( X i ˆ g − ) > τ + κ − ( τ ) } , where κ − ( τ ) is the 10th quantile of the empiricaldistribution of Φ( X i ˆ g − ) − τ conditional on Φ( X i ˆ g − ) > τ . Create a dummy W i = I { i ∈ J +0 ( τ ) } .2. Fit the quantile regression model Q τ ( y i | X i , W i ) = X i b ( τ ) + W i δ ( τ ) using observationsin J − ( τ ) ∪ J +0 ( τ ). Use the estimates of this quantile regression, that is ˆ b ( τ ) and ˆ δ ( τ ),to create two subsamples as follows. The first subsample is J +1 ( τ ) = { i : X i ˆ b ( τ ) + ˆ δ ( τ ) > k + κ +1 ( τ ) } , where κ +1 ( τ ) is the 3rd quantile of theempirical distribution of X i ˆ b ( τ ) + ˆ δ ( τ ) − k conditional on X i ˆ b ( τ ) + ˆ δ ( τ ) > k . Thesecond subsample is J − ( τ ) = { i : X i ˆ b ( τ ) < k + κ − ( τ ) } , where κ − ( τ ) is the 97th7uantile of the empirical distribution of X i ˆ b ( τ ) − k conditional on X i ˆ b ( τ ) < k .Create a dummy W i = I { i ∈ J +1 ( τ ) } .3. Fit the quantile regression model Q τ ( y i | X i , W i ) = X i b ( τ ) + W i δ ( τ ) using observationsin J − ( τ ) ∪ J +1 ( τ ) to obtain estimates ˆ b ( τ ) and ˆ δ ( τ ). The elasticity estimator isˆ ε = ˆ δ ( τ ) / ( s − s ). B.6 Estimates with the Filtering Method of Saez (2010)

In this section, we recompute the estimates of Table 1 using a different filtering method.Specifically, we employ the procedure used by Saez (2010) to obtain the bunching mass andthe side limits of the distribution of income without friction error Y . The procedureimplicitly defines a way to estimate the unobserved distribution of Y given the observeddistribution of income with friction error ˜ Y . We refer the reader to Figure 2 by Saez (2010).The first step is to construct a histogram-based estimate of the PDF f ˜ Y , and thenaverage f ˜ Y for ˜ Y ∈ [ K − δ, K − δ ] ∪ [ K + δ, K + 2 δ ], where K = 8 ,

580 is the kink point, and δ = 1 ,

500 defines the excluded region. Call that average ¯ f . The bunching mass is estimatedby the area between two curves, f ˜ Y and ¯ f . The continuous portion of f Y equals f ˜ Y , exceptfor the excluded region [ K − δ, K + δ ], where f Y equals ¯ f . We obtain the CDFs F Y and F ˜ Y from their PDF estimates. Finally, we rely on Y = F Y (cid:16) F − Y ( ˜ Y ) (cid:17) to transform ˜ Y into Y .8able B.1: Estimates Using U.S. Tax Returns 1995--2004 (1) (2) (3) (4) (5) (6) (7) (8)Statistical Model Saez (2010) Theorem 2 Theorem 2 Tobit Tobit Tobit TobitBounds Bounds Full Sample Trunc. 75% Trunc. 50% Trunc. 25% SampleM = 0.5 M = 1 details All

Obs. 189.1mElasticity ( ε ) 0.235 [0 . , . . , . $ $ Self-employed

Obs. 33.5mElasticity ( ε ) 0.933 [0 . , . . , ∞ ] 0.617 0.809 0.805 0.822 Avg. $ $ Self-employed,married

Obs. 24.0mElasticity ( ε ) 0.391 [0 . , . . , ∞ ] 0.187 0.286 0.330 0.331 Avg. $ $ Self-employed,not married

Obs. 9.6mElasticity ( ε ) 1.260 [1 . , . . , ∞ ] 1.145 0.991 1.003 1 . † Avg. $ $ Notes:

The table shows estimates of the elasticity for four different subsamples of the IRS data, and using three different approaches discussed in thepaper. The first approach (column 1) uses the trapezoidal approximation to point-identify the elasticity (Example 1). Estimates and standard errorswere computed using the publicly available code by Saez (2010) at the website of the American Economic Journal, Economic Policy. The secondapproach (columns 2 and 3) computes partially identified sets for the elasticity (Theorem 2), using non-parametric estimates of the side limits of f y atthe kink, and the bunching mass. Side limits were estimated using the method of Cattaneo et al. (2019). The estimate for the bunching mass equalsthe sample proportion of y observations that equals the kink point (see discussion in Section B.6 on friction errors). Upper and lower bounds arecalculated for two choices of M, that is, the maximum slope of the PDF of the unobserved heterogeneity n ∗ . Column 4 has Tobit MLE estimates of theelasticity that utilizes the full sample of data, along with robust standard errors. Columns 5 through 7 report truncated Tobit MLE estimates. As wemove from column 5 to column 7, we restrict the estimation sample to shrinking symmetric windows around the kink that utilizes 75% to 25% of thedata. The set of covariates that enters the Tobit estimation is kept constant across different truncation windows. It includes dummy variables such asmarital and employment status, year effects, types of deductions or social security benefits received, and whether the filer used a tax prep software. † There are too few observations for the maximum likelihood estimator to converge when using 25% of the sample. This estimate uses 27% instead. igure B.1: Counterexample where ‘‘Polynomial Strategy’’ Fails (a) Distribution of Income with Friction Error -1 -0.5 0 0.5 1 1.5 y+e -0.100.10.20.30.40.50.60.70.8 P D F f y+e f y +e (b) Estimation of Bunching Mass -1 -0.5 0 0.5 1 1.5 y+e -0.100.10.20.30.40.50.60.70.8 P D F f y+e fitted polynomial True Bunching Mass: 0.10Est. Bunching Mass: 0.07 (c) Counterfactual Distribution of Incomewithout Friction Error -1 -0.5 0 0.5 1 1.5 y -0.100.10.20.30.40.50.60.70.8 P D F f y estimated Notes:

The population model of this example has ε = 1 . t = .

2, and t = 0 . k = 0. Thedistribution of ability is assumed uniform, n ∗ ∼ U [ − . . e ∼ U [ − .

5; 0 . (cid:101) y = y + e , where y is afunction of n ∗ , ε , t , and t , as described in Equation 4. Figure B.1a displays the PDF of (cid:101) y and (cid:101) y . FigureB.1b displays the fitted 7th-order polynomial to the PDF of (cid:101) y using observations in ( −∞ , − . ∪ (0 . , ∞ ).The bunching mass is estimated by the integral of the difference between f (cid:101) y and the fitted polynomial, insidethe excluded region. The polynomial strategy understimates the true bunching mass, and does not retrievethe PDF of y (Figure B.1c).(Figure B.1c).