Loss aversion and the welfare ranking of policy interventions
Sergio Firpo, Antonio F. Galvao, Martyna Kobus, Thomas Parker, Pedro Rosa-Dias
LLoss aversion and the welfare ranking of policy interventions ∗ Sergio Firpo † Antonio F. Galvao ‡ Martyna Kobus § Thomas Parker ¶ Pedro Rosa-Dias (cid:107)
April 21, 2020
Abstract
In this paper we develop theoretical criteria and econometric methods to rank policyinterventions in terms of welfare when individuals are loss-averse. The new criterion for“loss aversion-sensitive dominance” defines a weak partial ordering of the distributions ofpolicy-induced gains and losses. It applies to the class of welfare functions which modelindividual preferences with non-decreasing and loss-averse attitudes towards changes inoutcomes. We also develop new statistical methods to test loss aversion-sensitive domi-nance in practice, using nonparametric plug-in estimates. We establish the limiting dis-tributions of uniform test statistics by showing that they are directionally differentiable.This implies that inference can be conducted by a special resampling procedure. Sincepoint-identification of the distribution of policy-induced gains and losses may require verystrong assumptions, we also extend comparison criteria, test statistics, and resamplingprocedures to the partially-identified case. Finally, we illustrate our methods with anempirical application to welfare comparison of two income support programs.
Keywords:
Welfare, Loss Aversion, Policy Evaluation, Stochastic Ordering, DirectionalDifferentiability
JEL codes:
C12, C14, I30 ∗ The authors are grateful to Pedro Carneiro, Hide Ichimura, Radosław Kurek, Essie Maasoumi, Piotr Miłoś,Magne Mogstad, Jim Powell, João Santos Silva, Tiemen Woutersen and seminar participants at the Universityof California Berkeley, University of Arizona, 28th annual meeting of the Midwest Econometrics Group, 35thMeeting of the Canadian Econometric Study Group, and 3rd edition of the Rio-Sao Paulo EconometricsWorkshop for useful comments and discussions regarding this paper. Marta Schoch provided excellent researchassistance. Computer programs to replicate the numerical analyses are available from the authors. All theremaining errors are ours. † Insper, Sao Paulo, Brazil. E-mail: [email protected] ‡ Department of Economics, University of Arizona, Tucson, USA. E-mail: [email protected] § Institute of Economics, Polish Academy of Sciences, Warsaw, Poland. E-mail: [email protected] ¶ Department of Economics, University of Waterloo, Waterloo, Canada. E-mail: [email protected] (cid:107)
Department of Economics and Public Policy, Imperial College Business School, Imperial College London,UK. E-mail: [email protected] a r X i v : . [ ec on . E M ] A p r e suffer more, ... when we fall from a better to a worse situation, than we ever enjoy when we risefrom a worse to a better. Adam Smith, The Theory of Moral Sentiments
The welfare ranking of policy interventions has classically (Atkinson, 1970) been conductedunder the Rawlsian principle of the “veil of ignorance”: all policies that produce the samemarginal outcome distributions are deemed equivalent for the purpose of welfare analysis.From this perspective, individual gains and losses should be irrelevant (Roemer, 1998; Sen,2000). However, policies often generate heterogeneous effects, potentially giving rise to gainsand losses, which can be consequential for several reasons.More modern approaches to ranking policy interventions greatly emphasize how differentindividuals are affected by a policy (Heckman and Smith, 1998; Carneiro, Hansen, and Heck-man, 2001). A powerful motivation for this lies in the dynamics of political economy. Publicsupport for a policy, and for the authorities that implement it, depend on the balance of gainsand losses experienced by different individuals in the electorate. In addition, and in line withthe political economy arguments adduced in Carneiro, Hansen, and Heckman (2001), thereis mounting empirical evidence corroborating that the electorate exhibits loss aversion — anempirical regularity that has been identified in a wide variety of other contexts (Kahnemanand Tversky, 1979; Samuelson and Zeckhauser, 1988; Tversky and Kahneman, 1991; Rabinand Thaler, 2001; Rick, 2011). This aversion to losses among constituents, in turn, drivesthe actions of policy makers, as documented in contexts as diverse as government supportto the steel industry in US trade policy, and the repeal of the Affordable Care Act (Freundand Özden, 2008; Alesina and Passarelli, 2019). In this paper we develop new testable crite-ria and econometric methods to rank distributions of individual policy effects from a welfarestandpoint when individuals exhibit loss aversion. This extends the toolkit available for theevaluating the impact of policy interventions.Our first contribution is to propose criteria for ranking policies when agents are averse tolosses by using the standard welfare function approach (Atkinson, 1970), namely, that policiesmay be evaluated based on a welfare ranking. We use a ranking based on social value func-tions, which aggregate individual gains and losses evaluated by a cardinal and interpersonallycomparable value function, similarly to standard utilitarian welfare ranking. As is well-known,the latter is equivalent to first-order stochastic dominance (FOSD) over distributions of policyoutcomes. In a similar spirit, as the first main contribution, we show that the social value1unction ranking with non-decreasing and loss-averse value functions (Tversky and Kahneman,1991) is equivalent to a new concept we call loss aversion-sensitive dominance (LASD) overdistributions of policy-induced gains and losses. Recall that FOSD requires that the cumula-tive distribution function of the dominated distribution lies everywhere above the cumulativedistribution of the dominant distribution. In contrast, under LASD it must lie sufficientlyabove the dominant distribution such that potential losses cannot be compensated by poten-tial gains. This is a consequence of loss-aversion. Except for the special case of a status quo policy (i.e. a policy of no change) where FOSD and LASD coincide, generally, as we show,LASD can be used to compare policies that are indistinguishable for FOSD. The LASD criterion relies on gains and losses, which under certain identification conditionscould be considered treatment effects . It is well known that the point identification of thedistribution of treatment effects may require implausible theoretical restrictions such as rankinvariance of potential outcomes (Heckman, Smith, and Clements, 1997). We extend our LASDcriteria to a partially-identified setting and establish a sufficient condition to rank alternativepolicies under partial identification of the distributions of their effects. We use Makarovbounds (Makarov, 1982; Rüschendorf, 1982; Frank, Nelsen, and Schweizer, 1987) to bound thedistribution of treatment effects when the joint pre- and post-policy outcome distribution isunknown. This provides a testable criterion that can be used in practice, since the marginaldistribution functions from samples observed under various treatments can usually be identifiedand Makarov bounds only rely on marginal information for their identification.The second contribution of this paper is to develop statistical inference procedures topractically test the loss averse-sensitive dominance condition using sample data. We developstatistical tests for both point-identified and partially-identified distributions of outcomes. Thetest procedures are designed to assess, uniformly over the two outcome distributions, whetherone treatment dominates another in terms of the LASD criterion. Specifically, we suggestKolmogorov-Smirnov and Cramér-von Mises test statistics that are applied to nonparametricplug-in estimates of the LASD criterion mentioned above. Inference for these statistics usesspecially tailored resampling procedures. We show that our procedures control the size oftests uniformly over probability distributions that satisfy the null hypothesis. Our tests arerelated to the literature on uniform inference for stochastic dominance represented by, e.g.,Linton, Song, and Whang (2010); Linton, Maasoumi, and Whang (2005); Barrett and Donald The literature on stochastic dominance is vast and spans economics and mathematics - we refer the readerto, e.g., Shaked and Shanthikumar (1994) and Levy (2016) for a review. When dominance curves cross, higherorder or inverse stochastic dominance criteria have been proposed. The former involves conditions on higher(typically third and fourth) order derivatives of utility function (e.g. Fishburn (1980), Chew (1983) to whichEeckhoudt and Schlesinger (2006) provided interesting interpretation, whereas the latter is related to therank-dependent theory originally proposed by Weymark (1981) and Yaari (1987, 1988), where social welfarefunctions are weighted averages of ordered outcomes with weights decreasing with the rank of the outcome (seeAaberge, Havnes, and Mogstad (2018) for a recent refinement of this theory). L -norm statistics ap-plied to this function are just regular enough that, with some care, resampling can be used toconduct inference. We rely on recent results from Fang and Santos (2019), who built on thework of Dümbgen (1993), to propose an inference procedure that combines standard resam-pling with an estimate of the way that test statistics depend on underlying data distributions.We contribute to the literature on directionally differentiable test statistics with a new testfor LASD. Recent contributions to this literature include, among others, Cattaneo, Jansson,and Nagasawa (2017); Hong and Li (2018); Chetverikov, Santos, and Shaikh (2018); Cho andWhite (2018); Christensen and Connault (2019); Fang and Santos (2019) and Masten andPoirier (2020). When distributions are only partially identified by bounds, the situation ismore challenging, but the problem has a similar solution. This allows us to conduct conser-vative inference in the partially identified case. Our contribution to this literature is novelbecause of our focus on uniform tests for dominance in both the point- and partially-identifiedcases.Finally, this paper also relates to the strand of literature that develops methods to estimatethe optimal treatment assignment policy that maximizes a social welfare function. Recentdevelopments can be found in Manski (2004), Dehejia (2005), Hirano and Porter (2009), Stoye(2009), Bhattacharya and Dupas (2012), Tetenov (2012), Kitagawa and Tetenov (2018, 2019),among others. These papers focus on the decision-theoretic properties and procedures thatmap empirical data into treatment choices. In this context, our paper is most closely related toKasy (2016), which focuses on welfare rankings of policies rather than optimal policy choice.We empirically illustrate the use of our proposed criteria and tests with a welfare compari-son of two well-known income support programs using data from Bitler, Gelbach, and Hoynes32006). We show that, in the case of a policy with gainers and losers, the use of our lossaversion-sensitive evaluation criteria may lead to a ranking of policy interventions that differsfrom that obtained when their outcomes are compared using stochastic dominance.The rest of the paper is organized as follows. Section 2 presents some basic definitions andnotation and defines loss aversion-sensitive dominance. Section 3 develops testable criteria forloss aversion-sensitive dominance. Section 4 proposes statistical inference methods for LASDusing sample observations. An empirical application appears in Section 5. Finally, Section6 concludes. One appendix includes auxiliary results and definitions, and a second appendixcollects proof of the results in the text. In this section, we propose a novel dominance relation for ordering policies under the assump-tion that social decision makers consider the distribution of individual gains and losses underdifferent policy scenarios. We call this criterion Loss Aversion-Sensitive Dominance (LASD).Suppose a random variable X describes individual gains and losses, and X has cumulativedistribution function F , and let F be the set of cumulative distribution functions with boundedsupport X . We maintain the assumption throughout that F ∈ F . The bounded supportassumption is made to avoid technical conditions on tails of distribution functions.A decision maker has preferences over X that are represented via a continuous social valuefunction (SVF). Definition 2.1 (Social Value Function) . Suppose random variable X has CDF F ∈ F andlet W : F → R denote the following social value function W ( F ) = ˆ X v ( x ) d F ( x ) , (1)where v : X → R is called a value function . The social value function defined above is standard in the literature. W ( F ) is the expectedevaluation of the distribution of X by a decision maker who uses value function v ( · ) . The valuefunction v ( · ) in (1) need not be any agent’s actual value function, but simply the utility functionthat the social planner uses to convert outcomes into an interpersonally-comparable measureof well-being (Gajdos and Weymark, 2012). We depart from the standard assumptions on Formally speaking we have W v ( F ) but we suppress the subscript v for expositional brevity. v exhibits the following features: (i) agentsassign negative value to losses and positive value to gains, (ii) the value function is monotone(increasing), and our key property – (iii) there is asymmetry in gains and losses, namely, losseshurt an agent more than gains of equivalent magnitude make her happy. These properties areformally listed in the next definition. Definition 2.2 (Properties of the value function) . The value function v : X → R satisfies:1. Disutility of losses and utility of gains: v ( x ) ≤ for all x < , v (0) = 0 and v ( x ) ≥ forall x > .2. Non-decreasing: v (cid:48) ( x ) ≥ for all x .3. Loss-averse: v (cid:48) ( − x ) ≥ v (cid:48) ( x ) for all x > .The properties in Definition 2.2 are typically assumed in Prospect Theory together withthe additional requirement of S-shapedness of value function, which we do not consider (see,e.g., p. 279 of Kahneman and Tversky (1979)). Assumptions 1 and 2 are standard concavityand monotone increasing conditions. Assumption 3 expresses the idea that “losses loom largerthan corresponding gains” and is a widely accepted definition of loss aversion (Tversky andKahneman, 1992, p.303). It is a stronger condition than the one considered by Kahneman andTversky (1979).The following form of W ( F ) will be useful in subsequent definitions and results. Proposition 2.3.
Suppose that F ∈ F and v is once differentiable. Then W ( F ) = − ˆ −∞ v (cid:48) ( x ) F ( x ) d x + ˆ ∞ v (cid:48) ( x )(1 − F ( x )) d x. (2)Assume that the decision maker’s social value function W depends on v which satisfiesDefinition 2.2, and she wishes to compare random variables X A and X B which represent gainsand losses under two policies labeled A and B . We use the labels F A and F B for the distributionfunctions of X A and X B . The decision maker prefers X A over X B if she evaluates F A as betterthan F B using her SVF — specifically, X A is preferred to X B if and only if W ( F A ) ≥ W ( F B ) ,where W is defined in Definition 2.1. This idea is formalized below. Definition 2.4 (Loss Aversion-Sensitive Dominance) . Let X A and X B have distribution func-tions respectively labeled F A , F B ∈ F . If W ( F A ) ≥ W ( F B ) for all value functions v that satisfyDefinition 2.2, we say that F A dominates F B in terms of Loss Aversion-Sensitive Dominance ,or LASD for short, and we write F A (cid:23) LASD F B .5n the next section we relate this abstract notion to a more concrete condition that dependson the cumulative distribution functions of the outcome distributions, F A and F B . In this section we formulate conditions for evaluating distributions of gains and losses. Wepropose criteria that indicate whether one distribution of gains and losses dominates anotherin the sense described in Definition 2.4.For making comparisons between policies A and B , an econometrician can generally observethree relevant distributions. First, suppose that the control or current distribution of agents’outcomes is represented by the random variable Z which has marginal distribution function G . Two other random variables, Z A and Z B , describe outcomes under policies A and B .Assume their marginal distribution functions are G A and G B respectively. However, a decisionmaker who is sensitive to loss considers differences induced by these prospective policies. Thegains and losses due to policies A and B are defined by the random variables X A = Z A − Z and X B = Z B − Z . The decision maker’s goal is to compare policies A and B using thedistribution functions F A and F B , the distribution functions of X A and X B .The problem with comparing the variables X A and X B is well-known in the treatmenteffects literature: F A and F B depend on the joint distribution of ( Z , Z A , Z B ) , which maynot be observable without restrictions imposed by an economic model. In subsection 3.1 weabstract from specific identification conditions and discusses LASD under the assumption that F A and F B are identified. In subsection 3.2 we work with a partially identified case whereonly the marginal distribution functions G , G A and G B are observable and no restrictionsare made to identify F A and F B . The LASD concept in Definition 2.4 requires that one distribution is preferred to anotherover an entire class of social value functions and is difficult to test directly. The followingresult relates the LASD concept to a criterion which depends only on marginal distributionfunctions and orders F A and F B according to the class of SVFs allowed in Definition 2.2. Inthis section we assume that F A , F B ∈ F are point identified. This may result from a varietyof econometric restrictions that deliver identification and are the subject a large literature. Theorem 3.1.
Suppose that F A , F B ∈ F . The following are equivalent: . F A (cid:23) LASD F B .2. For all x ≥ , F A and F B satisfy F B ( − x ) − F A ( − x ) ≥ max { , F A ( x ) − F B ( x ) } . (3)
3. For all x ≥ , F A and F B simultaneously satisfy F A ( − x ) − F B ( − x ) ≤ (4) and (1 − F A ( x )) − F A ( − x ) ≥ (1 − F B ( x )) − F B ( − x ) . (5)Theorem 3.1 provides two different conditions that can be used to verify whether onedistribution of gains and losses dominates the other in the LASD sense. These criteriacompare the outcome distributions by examining how the distribution functions ( F A , F B ) assign probabilities to gains and losses of all possible magnitudes. The particular way thatthey make a comparison is related to the relative importance of gains and losses. Considercondition (3). For X B to be dominated, its distribution function must lie above the distributionof X A for losses . X B can be dominated by X A even when gains under A are dominated bythose under B — that is, when F A ( x ) − F B ( x ) ≥ for some x ≥ — as long as this lackof dominance in gains is compensated by sufficient dominance of X A over X B in the lossesregion. This is a consequence of the asymmetric treatment of gains and losses. On the otherhand, consider conditions (4) and (5). Condition (4) is a FOSD condition applied to losses.This is a consequence of loss aversion; note that in the extreme case where only losses matter,we would have (4). Condition (5) is a tail condition on the distributions. It requires that whenbalancing the probabilities of gains and losses of absolute magnitudes at least as large as x , X A provides gains to a higher proportion of agents than does X B . Inequality (3) combinesthe two inequalities represented by (4) and (5) into a single equation.It is interesting to note that LASD has one property in common with FOSD, namely, ahigher mean is a necessary condition for both types of dominance. Corollary 3.2. If F A (cid:23) LASD F B then E [ X A ] ≥ E [ X B ] . Note that FOSD cannot rank two distributions that have the same mean — that is, if F A (cid:23) F OSD F B and E [ X A ] = E [ X B ] , then F A = F B . This is not the case for LASD in (3), LASD is a partial order. Over losses, (4) is a partial order because FOSD is a partial order. For thetail condition (5) checking transitivity we have (1 − F A ( x )) − F A ( − x ) ≥ (1 − F B ( x )) − F B ( − x ) , (1 − F B ( x )) − F B ( − x ) ≥ (1 − F C ( x )) − F C ( − x ) , and (1 − F A ( x )) − F A ( − x ) ≥ (1 − F C ( x )) − F C ( − x ) . If F A ( − x ) − F B ( − x ) = 0 then F A ( − x ) = F B ( − x ) and using it in (5) gives anti-symmetry.
7s the next example demonstrates. Therefore, for example, when comparing two distributionswith the same average effect, equation (3) may still be used to differentiate between them.
Example 3.3.
Consider the family of uniform distributions on [ − − y, − y ] ∪ [ y, y + 1] indexedby y > and denote the corresponding member distribution functions F y . The family of suchdistributions have mean zero and F y (cid:23) LASD F y (cid:48) whenever y < y (cid:48) . Indeed, note that W ( F y ) = 12 (cid:18) ˆ − y − − y v ( z ) d z + ˆ yy v ( z ) d z (cid:19) and thus for any v which is loss-averse (see Definition 2.2) we havedd y W ( F y ) = 12 ( v ( − − y ) − v ( − y ) + v (1 + y ) − v ( y ))= − ˆ − y − − y v (cid:48) ( z ) d z + ˆ yy v (cid:48) ( z ) d z = ˆ yy (cid:0) v (cid:48) ( z ) − v (cid:48) ( − z ) (cid:1) d z ≤ . It is important to note that LASD is a concept that is specialized to the comparisonof distributions that represent gains and losses. Standard FOSD is typically applied to thedistribution of outcomes in levels without regard to whether the outcomes resulted from gainsor losses of agents relative to a pre-policy state — in our notation, G A and G B are typicallycompared with FOSD, instead of F A and F B . FOSD applied to post-policy levels may or maynot coincide with LASD applied to changes. This means that even when a strong conditionsuch as FOSD holds for final outcomes, if one took into account how agents value gains andlosses it may turn out that the dominant distribution is no longer a preferred outcome. Onecould apply the FOSD rule to compare distributions of income changes, which implies LASDapplied to changes, because FOSD applies to a broader class of value functions. However, thistype of comparison would ignore agents’ loss aversion, the important qualitative feature thatLASD accounts for. The following example shows that the analysis of outcomes in levels usingFOSD need not correspond to any LASD ordering of outcomes in changes. Example 3.4.
Let Z represent outcomes before policies A or B . Suppose Z is distributeduniformly over { , , , } . Policy A assigns post-policy outcomes depending on the realized Z according to the schedule Z A = if { Z = 0 } if { Z = 1 } if { Z = 2 } if { Z = 3 } . X A = Z A − Z is P { X A = − } = 1 / , P { X A = 1 } = P { X A =3 } = 1 / . Meanwhile, policy B maintains the status quo: X B = Z B − Z = 0 with probability1. It is straightforward to check that Z A ∼ Z B thus they dominate each other according toFOSD. However, there is no loss aversion-sensitive dominance between X A and X B . Indeed,we can find two value functions that fulfill the conditions of Definition 2.4 but order X A and X B differently. For example, take v ( x ) = x . Then ´ v ( x ) dF A ( x ) > ´ v ( x ) dF B ( x ) = 0 .Next let v ( x ) = sgn ( x ) | x | / . Then − . ≈ ´ v ( x ) dF A ( x ) < ´ v ( x ) dF B ( x ) ≈ .In the previous example, policy B left pre-treatment outcomes unchanged, or in otherwords, maintained a status quo condition — we had X B = Z B − Z ≡ . Suppose generallythat X B has a distribution that is degenerate at . Then F B ( x ) = 0 for all x < and F B ( x ) = 1 for all x ≥ . We define this as a status quo policy distribution, labelled F SQ .When comparison is between a distribution F A and F SQ , LASD and standard FOSD areequivalent. Corollary 3.5.
Suppose that F A ∈ F and F B = F SQ . Then F A (cid:23) LASD F SQ ⇐⇒ F A (cid:23) F OSD F SQ . Remark 3.6.
Although in this paper we focus on the distribution of gains and losses, Kőszegiand Rabin (2006) have developed an interesting preference relation in which individuals deriveutility from income and also from gains and losses. In particular, their utility function isadditively separable in both gains and losses x and income levels z i.e. ˜ v ( x, z ) = v G ( x ) + v I ( z ) ,where x ∈ R and z ∈ [0 , ∞ ) . Using Kőszegi and Rabin (2006) preferences, policy A dominatespolicy B if, in our notation, (4) and (5) are satisfied by X A and X B along with the additionalcondition that Z A dominates Z B according to FOSD. A proof of this result is given in AppendixB. In many situations of interest the cumulative distribution functions of gains and losses, F A and F B , are not point identified without a model of the relationship between X A and X B .Without information on the dependence between potential outcomes, we can still make somemore circumscribed statements with regard to dominance based on bounds for the distributionfunctions.A number of authors have considered functions that bound the distribution functions F A and F B . Taking X A as an example, the Makarov bounds (Makarov, 1982; Rüschendorf, 1982;9rank, Nelsen, and Schweizer, 1987) are two functions L and U that satisfy L ( x ) ≤ F A ( x ) ≤ U ( x ) for all x ∈ R , depend only on the marginal distribution functions G and G A andare pointwise sharp — for any fixed x there exist some Z ∗ and Z ∗ A such that the resulting X ∗ A = Z ∗ A − Z ∗ has a distribution function at x that is equal one of L ( x ) or U ( x ) . Williamsonand Downs (1990) provide convenient definitions for these bound functions. For any twodistribution functions G , G , define the maps L ( x, G , G ) = sup u ∈ R ( G ( u ) − G ( u − x )) U ( x, G , G ) = inf u ∈ R (1 + G ( u ) − G ( u − x )) . For convenience define the policy-specific bound functions for F k , k ∈ { A, B } and all x ∈ R ,which depend on the marginal CDFs G and G k , by L k ( x ) = L ( x, G , G k ) (6) U k ( x ) = U ( x, G , G k ) . (7)Using these definitions we obtain a sufficient and a necessary condition for LASD when onlybounds of the treatment effects distribution are observable. The next theorem formalizes theresult. Theorem 3.7.
Suppose that G , G A , G B ∈ F and define the bounding functions using for-mulas (6) and (7) for k ∈ { A, B } .1. If for all x ≥ , L B ( − x ) − U A ( − x ) ≥ max { , U A ( x ) − L B ( x ) } (8) then (3) holds.2. If (3) holds then for all x ≥ , U B ( − x ) − L A ( − x ) ≥ L A ( x ) − U B ( x ) . (9)Theorem 3.7 is and extension of Theorem 3.1 from the point-identified to the partially-identified case. Both Theorems 3.1 and 3.7 will play important parts in the inference proce-dures discussed in the next Section.When the comparison is with the status quo distribution, the partially identified conditionssimplify. Corollary 3.8 below is an extension of Corollary 3.5 to the partially identified case. Corollary 3.8.
Suppose that F B = F SQ and that G , G A ∈ F . Define the bound functions A and L A using formulas (6) and (7). Then U A ( − x ) = 0 for all x ≥ ⇒ F A (cid:23) LASD F SQ and F A (cid:23) LASD F SQ ⇒ L A ( − x ) = 0 for all x ≥ . In this section we propose statistical inference methods for the loss aversion-sensitive domi-nance (LASD) criteria discussed in previous sections. We consider the null and alternativehypotheses H : F A (cid:23) LASD F B H : F A (cid:54)(cid:23) LASD F B . (10)Under the null hypothesis (10) policy A dominates B in the LASD sense, similar to much of theliterature on stochastic dominance — see, for example, Linton, Maasoumi, and Whang (2005);Linton, Song, and Whang (2010). We use the dominance criteria discussed in Theorems 3.1and 3.7 to design nonparametric tests for H . Because the LASD hypothesis is translatedinto functional inequalities, which we discuss below, tests must be conducted uniformly overall x ≥ . This uniformity in x and features of the LASD conditions present a challenge forinference.We consider tests for this null hypothesis given sample data observed under two differ-ent identification assumptions. We start with the case where the econometrician can directlyobserve samples { X Ai } n A i =1 and { X Bi } n B i =1 which represent agents’ gains and losses. In otherwords, we assume that a model has been imposed on the data so that the distribution func-tions of X A and X B are point-identified and their distribution functions can be estimatedusing the empirical distribution functions from two samples. Next we extend these resultsto the partially-identified case where no assumption about the joint distribution of potentialoutcomes under either treatment is assumed. In this case, the econometrician observes threesamples, { Z i } n i =1 , { Z Ai } n A i =1 and { Z Bi } n B i =1 , of outcomes under a control or pre-policy state,and outcomes under policies A and B , and tests are based on plug-in estimates for bounds for X A = Z A − Z and X B = Z B − Z .We consider distribution functions as members of the space of bounded functions on thesupport X ⊆ R , denoted (cid:96) ∞ ( X ) , equipped with the supremum norm, defined for f : R k → R (cid:96) by (cid:107) f (cid:107) ∞ = max j { sup x ∈ R k | f j ( x ) |} . For real numbers x let ( x ) + = max { , x } . Given asequence of bounded functions { f n } n and limiting random element f we write f n (cid:59) f todenote weak convergence in ( (cid:96) ∞ , (cid:107) · (cid:107) ∞ ) in the sense of Hoffman-Jørgensen (van der Vaart andWellner, 1996). 11 .1 Inferring dominance from point identified treatment distributions In this subsection we suppose that the pair of marginal distribution functions F = ( F A , F B ) is identified. To implement a test of the hypotheses (10) we employ the results of Theorem 3.1 to constructmaps of F into criterion functions that are used to detect deviations from the hypothesis H .Specifically, recalling that ( x ) + = max { , x } , for the point-identified case we examine maps T : ( (cid:96) ∞ ( R )) → (cid:96) ∞ ( R + ) and T : ( (cid:96) ∞ ( R )) → ( (cid:96) ∞ ( R + )) , defined for each x ≥ by T ( F )( x ) = ( F A ( x ) − F B ( x )) + + F A ( − x ) − F B ( − x ) (11)and T ( F )( x ) = (cid:34) F A ( − x ) − F B ( − x ) F A ( x ) − F B ( x ) + F A ( − x ) − F B ( − x ) (cid:35) . (12)Functions T ( F ) and T ( F ) are designed so that large positive values will indicate a violationof the null. Taking T as an example, Theorem 3.1 states that W ( F A ) ≥ W ( F B ) if and only if F B ( − x ) − F A ( − x ) ≥ ( F A ( x ) − F B ( x )) + for all x ≥ , so tests can be constructed by lookingfor x where T ( F )( x ) becomes significantly positive. We will refer to T j as maps from pairs ofdistribution functions to another function space, and also refer to them as functions.The hypotheses (10) can be rewritten in two equivalent forms, depending on whether oneuses T or T to transform distribution functions: letting X ⊆ R + be an evaluation set, wehave H (1)0 : T ( F )( x ) ≤ , for all x ∈ X ,H (1)1 : T ( F )( x ) > , for some x ∈ X (13)and H (2)0 : T ( F )( x ) ≤ , for all x ∈ X ,H (2)1 : T ( F )( x ) (cid:54)≤ , for some x ∈ X . (14)In the second set of hypotheses is a two-dimensional vector of zeros and inequalities aretaken coordinate-wise.The next step in testing the hypotheses (13) and (14) is to estimate T ( F ) and T ( F ) . Let F n = ( F An , F Bn ) denote the pair of marginal empirical distribution functions, that is, F kn ( x ) = n k (cid:80) n k i =1 { X ki ≤ x } for k ∈ { A, B } . These are well-behaved estimators of the components of12 . Letting n = n A + n B , standard empirical process theory shows that √ n ( F n − F ) convergesweakly to a Gaussian process under weak assumptions (van der Vaart, 1998, Example 19.6).In order to conduct inference for loss aversion-sensitive dominance, we use plug-in estimators T j ( F n ) for j ∈ { , } . See Remark A.7 in Appendix A for details on the computation of thesefunctions.In order to detect when T j ( F n ) is significantly positive, we consider statistics based ona one-sided supremum norm or a one-sided L norm over X . Kolmogorov-Smirnov (i.e.,supremum norm) type statistics are V n = √ n sup x ∈X ( T ( F n )( x )) + (15) V n = √ n max (cid:26) sup x ∈X ( T ( F n )( x )) + , sup x ∈X ( T ( F n )( x )) + (cid:27) . (16)Meanwhile Cramér-von Mises (or L norm) test statistics are defined by W n = √ n (cid:18) ˆ X (cid:0) ( T ( F n )( x )) + (cid:1) d x (cid:19) / , (17) W n = √ n (cid:18) ˆ X (cid:0) ( T ( F n )( x )) + (cid:1) + (cid:0) ( T ( F n )( x )) + (cid:1) d x (cid:19) / . (18)In the sequel, we assume that all functions used in L statistics are square-integrable. We wish to establish the limiting distributions of V jn and W jn , for j ∈ { , } , under the nullhypothesis H : F A (cid:23) LASD F B . This means that we are concerned with the behavior of theempirical criterion function processes √ n ( T j ( F n ) − T j ( F )) , which are random functions.Two challenges arise when considering these criterion function processes. First, the formof the null hypothesis as a functional inequality to be tested uniformly over X is a source ofirregularity. The assumption that the distribution P satisfies the null hypothesis F A (cid:23) LASD F B implies that the asymptotic distributions of W j and V j depend on features of P . Thisis referred to as non-uniformity in P in (Linton, Song, and Whang, 2010; Andrews and Shi,2013), and requires attention when resampling.Second, due to the pointwise maximum function in its definition, T is too irregular asa map from the data to the space of bounded functions to establish a limiting distributionfor the empirical process √ n ( T ( F n ) − T ( F )) using conventional statistical techniques. In13ontrast, T is a linear map of F , which implies that √ n ( T ( F n ) − T ( F )) has a well-behavedlimiting distribution in ( (cid:96) ∞ ( R + )) .Despite the above challenges, we show that V jn and W jn (for j ∈ { , } ) have well-behavedasymptotic distributions, and furthermore, that the limiting random variables satisfy V ∼ V and W ∼ W . This is an important result because it is the foundation for applying bootstraptechniques for inference. Before stating the formal assumptions and asymptotic properties ofthe tests, we discuss the two difficulties mentioned above in more detail.The limiting distributions of V jn and W jn statistics depend on features of the joint prob-ability distribution of ( X A , X B ) , which we denote by P . Let P be the set of distributions P such that F A (cid:23) LASD F B . These are distributions with marginal distribution functions F suchthat T j ( F )( x ) ≤ for all x ≥ . To discuss the relationship between these sets of distributionsand test statistics, we relabel the two coordinates of the T function as m ( x ) = F A ( − x ) − F B ( − x ) (19)and m ( x ) = F A ( − x ) − F B ( − x ) + F A ( x ) − F B ( x ) . (20)When P ∈ P , both m ( x ) ≤ and m ( x ) ≤ for all x ≥ .More detail is required about the behavior of the two coordinate functions to determinethe limiting distributions of V jn and W jn statistics. For L -norm statistics W n and W n , wedefine the following relevant subdomains of X , which collect the arguments where m or m are equal to zero: X ( P ) = { x ∈ X : m ( x ) = 0 } (21) X ( P ) = { x ∈ X : m ( x ) = 0 } . (22)Denote X ( P ) ⊆ X as the set of x where T ( F )( x ) = 0 or at least one coordinate of T ( F ) equals for probability distribution P . As will be seen below, X ( P ) is the same for both the T and T functions. Following Linton, Song, and Whang (2010), we call X ( P ) the contactset for the distribution P . Given the above definitions, under the null hypothesis we can write X ( P ) = X ( P ) ∪ X ( P ) . On the other hand, the supremum-norm statistics V n and V n need a different family of sets,14amely the sets of (cid:15) -maximizers of m and m . For any (cid:15) ≥ and k ∈ { , } , let M k ( (cid:15) ) = (cid:26) x ∈ X : m k ( x ) ≥ sup x ∈X m k ( x ) − (cid:15) (cid:27) . (23)An important subset of P are those P for which test statistics have nontrivial limitingdistributions under the null hypothesis — that is, not degenerate at 0, which occurs whenthere is some x such that T j ( F )( x ) = 0 (note that there are no x such that T j ( F )( x ) > when P ∈ P ). Define P ⊂ P to be the set of all P such that X ( P ) (cid:54) = ∅ . If P ∈ P \P then X ( P ) = ∅ and because the distribution satisfies the null hypothesis, F A strictly dominates F B everywhere and the criterion functions T j are strictly negative over X . When P ∈ P \P ,test statistics have asymptotic distributions that are degenerate at zero because test statisticswill detect that policy A is strictly better that B over all of X . When P ∈ P , T j ( F ) is zeroover X ( P ) and test statistics have a nontrivial asymptotic distribution over X ( P ) . Thus,when F A (cid:23) LASD F B , the asymptotic behavior of test statistics depends on whether P ∈ P or P ∈ P \P . Note that when P ∈ P , we have lim (cid:15) (cid:38) M k ( (cid:15) ) = X k ( P ) for whichevercoordinate function actually achieves the maximal value zero.Hadamard differentiability is an analytic tool used to establish the asymptotic distributionof nonlinear maps of the empirical process. Definition A.1 in Appendix A provides a precisestatement of the concept. When a map is Hadamard differentiable — for example T , which islinear as a map from ( (cid:96) ∞ ( R )) to ( (cid:96) ∞ ( R + )) and is thus trivially differentiable — the functionaldelta method can be applied to describe its asymptotic behavior as a transformed empiricalprocess, and a chain rule makes the analysis of compositions of several Hadamard-differentiablemaps tractable. Also, the Hadamard differentiability of a map implies resampling is consistentwhen this map is applied to the resampled empirical process (van der Vaart, 1998, Theorem23.9) — so, for example, the distribution of resampled criterion processes √ n ( T ( F ∗ n ) − T ( F n )) is a consistent estimate of the asymptotic distribution of √ n ( T ( F n ) − T ( F )) in the space (cid:96) ∞ ( R + ) . On the other hand, consider the T map. The pointwise Hadamard directionalderivative of T ( f )( x ) at a given x ≥ in direction h ( x ) = ( h A ( x ) , h B ( x )) is T (cid:48) f ( h )( x ) = h A ( x ) − h B ( x ) + h A ( − x ) − h B ( − x ) , f A ( x ) > f B ( x )( h A ( x ) − h B ( x )) + + h A ( − x ) − h B ( − x ) , f A ( x ) = f B ( x ) h A ( − x ) − h B ( − x ) , f A ( x ) < f B ( x ) . (24)This map, thought of as a map between function spaces, ( (cid:96) ∞ ( R )) and (cid:96) ∞ ( R + ) , is not dif-ferentiable because the pointwise maximum map is only differentiable at each point x , butnot in the codomain (cid:96) ∞ ( R + ) . Despite the lack of differentiability of the map F (cid:55)→ T ( F ) , we15how in Lemma A.3 in Appendix A that F (cid:55)→ V and F (cid:55)→ W are Hadamard directionallydifferentiable, which implies these maps are just regular enough that existing statistical meth-ods can be applied to their analysis. Later in this section we apply the resampling techniquerecently developed in Fang and Santos (2019) along with this directional differentiability totest hypotheses using V n or W n .Having discussed the difficulties in the relationship between distributions and test statistics,we turn to assumptions on the observations. In order to conduct inference using either T ( F n ) or T ( F n ) we make the following assumptions. A1 The observations { X Ai } n A i =1 and { X Bi } n B i =1 are iid samples and independent of each otherand are continuously distributed with marginal distribution functions F A and F B re-spectively. A2 Let the sample sizes n A and n B increase in such a way that n k / ( n A + n B ) → λ k as n A , n B → ∞ , where < λ k < for k ∈ { A, B } . Define n = n A + n B .Under these assumptions we establish the asymptotic properties of the test statistics underthe null and fixed alternatives. Under the above assumptions, there is a Gaussian process G F such that √ n ( F n − F ) (cid:59) G F . We denote each coordinate process G F A and G F B , and forconvenience define two transformed processes: for each x ≥ let G ( x ) = G F A ( − x ) − G F B ( − x ) (25) G ( x ) = G F A ( x ) − G F B ( x ) − G F A ( − x ) + G F B ( − x ) . (26)These will be used in the theorem below. Theorem 4.1.
Make assumptions A1 - A2 . Define the limiting Gaussian processes G and G as above. Then:1. Suppose that P ∈ P . As n → ∞ , V n (cid:59) V and W n (cid:59) W , where V ∼ max (cid:40) , sup x ∈X ( P ) G ( x ) · (cid:26) sup x ∈X m ( x ) = 0 (cid:27) , sup x ∈X ( P ) G ( x ) · (cid:26) sup x ∈X m ( x ) = 0 (cid:27)(cid:41) and W ∼ (cid:32) ˆ X ( P ) (cid:0) ( G ( x )) + (cid:1) d x + ˆ X ( P ) (cid:0) ( G ( x )) + (cid:1) d x (cid:33) / .
2. Suppose that P ∈ P . As n → ∞ , V n (cid:59) V and W n (cid:59) W , where V ∼ V and W ∼ W . . Suppose that P ∈ P \P for j = 1 or . As n → ∞ , P { V jn > (cid:15) } → and P { W jn > (cid:15) } → for all (cid:15) > .4. Suppose that P (cid:54)∈ P . As n → ∞ , P { V jn > c } → and P { W jn > c } → for all c ≥ for j = 1 or . Theorem 4.1 derives the asymptotic properties of the proposed test statistics. Parts 1 and 2establish the weak limits of V jn and W jn for j ∈ { , } when the null hypothesis is true. Recallthat when P ∈ P , lim (cid:15) (cid:38) M k ( (cid:15) ) = X k ( P ) , which is why M k ( (cid:15) ) terms are absent in the firstpart of the theorem. Remarkably, the test statistics using T and T criterion processes havethe same asymptotic behavior despite the different appearances of the underlying processesand the irregularity of T . Part 3 shows that the statistics are asymptotically degenerate atzero when the contact set is empty, that is, when P lies on the interior of the null region. Part4 shows that the test statistics diverge when data comes from any distribution that does notsatisfy the null hypothesis.The limiting distributions described in Part 1 of Theorem 4.1 are not standard because thedistributions of the test statistics depend on features of P through the X ( P ) terms in eachexpression. Therefore, to make practical inference feasible, we suggest the use of resamplingtechniques below. The proposed test statistics have complex limiting distributions. In this subsection, we presentresampling procedures to estimate the limiting distributions of both V jn and W jn for j ∈ { , } under the assumption that P ∈ P . Naive use of bootstrap data generating processes in theplace of the original empirical process suffers from distortions due to discontinuities in thedirectional derivatives of the maps that define the distributions of the test statistics. In finitesamples the plug-in estimate will not find, for example, the region where F A ( x ) − F B ( x ) = 0 ,where the derivatives exhibit discontinuous behavior. Our procedure involves making estimatesof the derivatives involved in the limiting distribution and a standard exchangeable bootstraproutine, as proposed in Fang and Santos (2019). In order to estimate contact sets, define a sequence of constants { a n } such that a n (cid:38) and √ na n → ∞ and let ˆ m n ( x ) = F An ( − x ) − F Bn ( − x ) and ˆ m n ( x ) = F An ( − x ) − F Bn ( − x ) + Given a set of weights { W i } ni =1 that sum to one and are independent of { X i } ni =1 , the exchangeable bootstrapmeasure is a randomly-weighted measure that puts mass W i at observed sample point X i for each i . Thisencompasses, for example, the standard bootstrap, m -of- n bootstrap and wild bootstrap. An ( x ) − F Bn ( x ) . Then for W j statistics define the estimated contact sets by ˆ X = { x ∈ X : | ˆ m n ( x ) | ≤ a n } (27) ˆ X = { x ∈ X : | ˆ m n ( x ) | ≤ a n } . (28)When both sets are empty, replace both estimates by X . Meanwhile, for V j statistics defineestimated (cid:15) -maximizer sets. For any sequence of constants { b n } such that b n (cid:38) and √ nb n →∞ , let ˆ M ( b n ) = { x ∈ X : ˆ m n ( x ) ≥ max ˆ m n ( x ) − b n } , (29) ˆ M ( b n ) = { x ∈ X : ˆ m n ( x ) ≥ max ˆ m n ( x ) − b n } . (30)Using these estimates, the distributions of V and W can be estimated from sample data(recall that Part 2 of Theorem 4.1 asserts that these are the same distributions as those of V and W ). The formulas in part 3 of the steps below are obtained by inserting estimatedcontact sets and resampled empirical processes in the place of population-level quantities intothe functions shown in Part 1 of Theorem 4.1. Resampling routine to estimate the distributions of V jn and W jn for j = 1 , :
1. If using a Cramér-von Mises statistic, given a sequence of constants { a n } , estimate thecontact sets ˆ X and ˆ X . If using a Kolmogorov-Smirnov statistic, given a sequence ofconstants { b n } , estimate the b n -maximizer sets of ˆ m n and ˆ m n .Next repeat the following two steps for r = 1 , . . . , R :2. Construct the resampled processes F ∗ r n ( x ) = √ n (cid:16) F ∗ An ( − x ) − F ∗ Bn ( − x ) − F An ( − x ) + F Bn ( − x ) (cid:17) F ∗ r n ( x ) = √ n (cid:16) F ∗ An ( − x ) − F ∗ Bn ( − x ) − F An ( − x ) + F Bn ( − x )+ F ∗ An ( x ) − F ∗ Bn ( x ) − F An ( x ) + F Bn ( x ) (cid:17) using an exchangeable bootstrap.3. Calculate the resampled test statistic. Letting ˆ k = argmax k { sup x ≥ ˆ m kn ( x ) } and { c n } (cid:38) satisfy √ nc n → ∞ , calculate V ∗ rn = (cid:16) max x ∈ ˆ M ˆ k ( b n ) F ∗ r ˆ kn ( x ) (cid:17) + | max ˆ m n − max ˆ m n | > c n max (cid:110) , max x ∈ ˆ M ( b n ) F ∗ r n ( x ) , max x ∈ ˆ M ( b n ) F ∗ r n ( x ) (cid:111) | max ˆ m n − max ˆ m n | ≤ c n (31)or W ∗ rn = (cid:32) ˆ ˆ X (cid:0) ( F ∗ r n ( x )) + (cid:1) d x + ˆ ˆ X (cid:0) ( F ∗ r n ( x )) + (cid:1) d x (cid:33) / . (32)Finally,4. Let ˆ q V ∗ (1 − α ) and ˆ q W ∗ (1 − α ) be the (1 − α ) th sample quantile from the bootstrapdistributions of { V ∗ rn } Rr =1 or { W ∗ rn } Rr =1 , respectively, where α ∈ (0 , is the nominal sizeof the tests. Reject the null hypothesis (13) or (14) if V jn and W jn defined in (15)-(18)are, respectively, larger than ˆ q V ∗ (1 − α ) or ˆ q W ∗ (1 − α ) .The resampled statistics are calculated by imposing the null hypothesis and assuming thatthe region X j ( P ) is the only part of the domain that provides a nondegenerate contributionto the asymptotic distribution of the statistic under the null. The two cases of each partin the maximum arise from trying to impose the null behavior on the resampled supremumnorm statistics, even when it appears the null is violated based on the value of the samplestatistic. A simple alternative way to conduct inference would be to assume the least-favorablenull hypothesis that F A ≡ F B , and to resample using all of X . However, this may result intests with lower power (Linton, Song, and Whang, 2010) — power loss arises in situationswhere X ( P ) ⊂ X (strictly), so that the T j process is only nondegenerate on a subset, whilebootstrapped processes that assume X ( P ) = X would look over all of X and result in astochastically larger bootstrap distribution than the true distribution.The next result shows that our tests based on the resampling schemes described abovehave accurate size under the null hypothesis. In order to metrize weak convergence we usetest functions from the set BL , which denotes Lipschitz functions R → R that have constant1 and are bounded by 1. Theorem 4.2.
Make assumptions A1 - A2 and suppose that P ∈ P . Let ˆ q V ∗ j (1 − α ) and ˆ q W ∗ j (1 − α ) be the (1 − α ) th sample quantile from the bootstrap distributions as described in theroutines above. Then for j = 1 , , the bootstrap is consistent: sup f ∈ BL | E [ f ( V ∗ n ) | X ] − E [ f ( V )] | = o P (1) nd sup f ∈ BL | E [ f ( W ∗ n ) | X ] − E [ f ( W )] | = o P (1) , where V and W are defined in Theorem 4.1. The result in above theorem is stated in terms of the limiting variables V and W andbootstrap analogs. V and W , using the functional delta method, are Hadamard directionalderivatives of a chain of maps from the marginal distribution functions F to the real line, andthe derivatives are most compactly expressed as the definitions in Theorem 4.1.The bootstrap variables combine conventional resampling with finite-sample estimates ofthe maps defined in Part 1 of Theorem 4.1, which is a resampling approach proposed inFang and Santos (2019). Their result is actually more general — it states that with a moreflexible estimator V ∗ n , we would obtain bootstrap consistency for P in the null and alternativeregions. Because our focus is on testing F A (cid:23) LASD F B , however, our resampling scheme,and Theorem 4.2, are done under the imposition of the null hypothesis. The resamplingconsistency result in Theorem 4.2 implies that our bootstrap tests have asymptotically correctsize uniformly over probability distributions in the null region, in the same sense as was stressedin Linton, Song, and Whang (2010). A formal statement of this uniformity over P is givenin Theorem A.5 in Appendix A. Along with Part 4 of Theorem 4.1 Theorem A.5 additionallyimplies that our tests are consistent, that is, that their power to detect violations from thenull represented by fixed alternative distributions tends to one. This is because the resamplingscheme produces asymptotically bounded critical values, while the test statistics diverge underthe alternative. In this section we extend dominance tests to the case that distribution functions F A and F B are only partially identified by their Makarov bounds. Suppose that Z , Z A and Z B are random variables with marginal distribution functions G = ( G , G A , G B ) , but the jointprobability distribution P of the vector ( Z , Z A , Z B ) is unknown, so that F A and F B are notpoint identified because they are the unknown distribution functions of X A = Z A − Z and X B = Z B − Z . Nevertheless, we wish to test the hypotheses in (10), which depend on F A and F B . 20 .2.1 Test statistics Recall equations (8) and (9) from Section 3. Restated in terms of the null hypothesis F A (cid:23) LASD F B , condition (8) is sufficient to imply the null hypothesis is true, while (9) represents anecessary condition for dominance. Denote by P suf the set of distributions that satisfy (8)and let P nec collect all distributions that satisfy (9). Then still using the label P for the setof distributions such that X A dominates X B , we have the (strict) inclusions P suf ⊂ P ⊂P nec . Given this relation, without any further identification conditions, we look for significantviolations of the necessary condition, since P / ∈ P nec implies
P / ∈ P . This generally results inconservative tests because distributions P ∈ P nec \P will also not be rejected, but it avoidsoverrejection, which would be the result when using the sufficient condition.To test the null (10) we employ the inequality specified in equation (9) from Theorem 3.7.For each x ∈ X let T ( G )( x ) = L A ( − x ) + L A ( x ) − U B ( − x ) − U B ( x ) , (33)where L A and U B are defined in (6) and (7). To see the explicit dependence of T on G ,rewrite (33), using the identity inf f = − sup( − f ) in the definition of U B as T ( G )( x ) = sup u ∈ R ( G A ( u ) − G ( u + x )) + sup u ∈ R ( G A ( u ) − G ( u − x )) − u ∈ R ( G ( u + x ) − G B ( u )) + sup u ∈ R ( G ( u − x ) − G B ( u )) . (34)As before, T has been written in such a way that a violation of the null hypothesis F A (cid:23) LASD F B is indicated by observing some x such that T ( G )( x ) > .The above map shares a similar feature with the T map in the previous section — themarginal (in u ) optimization maps are directionally differentiable at each point x ≥ , but f ( u, x ) (cid:55)→ sup u f ( u, x ) is not Hadamard differentiable as a map from (cid:96) ∞ ( R ×X ) to (cid:96) ∞ ( X ) . Onesolution to this problem is to examine the distribution of test functionals applied to the process,which are Hadamard directionally differentiable (shown in Lemma A.4 in Appendix A).Given observed samples { Z ki } for k ∈ { , A, B } , define the marginal empirical distributionfunctions G n = ( G n , G An , G Bn ) , where G kn ( z ) = n k (cid:80) i { Z ki ≤ z } for k ∈ { , A, B } , and let L An and U Bn be the plug-in estimates of the bounds: for each x ∈ X , let L An ( x ) = L ( x, G n , G An ) U Bn ( x ) = U ( x, G n , G Bn ) , L and U were introduced in equations (6) and (7). To estimate T in (33)we use the plug-in estimate T ( G n ) . As in the previous section, we consider the followingKolmogorov-Smirnov and Cramér-von Mises type test statistics: V n = √ n sup x ∈X ( T ( G n )( x )) + (35) W n = √ n (cid:18) ˆ X (cid:0) ( T ( G n )( x )) + (cid:1) d x (cid:19) / . (36)The next subsections establish limiting distributions for V n and W n and suggest a resamplingprocedure to estimate the distributions. Once again, it is necessary to define the region where the test statistics have nontrivial distri-butions. Define the contact set for the T criterion function by X nec ( P ) = { x ∈ X : L A ( − x ) + L A ( x ) − U B ( − x ) − U B ( x ) = 0 } . We say that distribution P ∈ P nec when X nec ( P ) (cid:54) = ∅ . As mentioned at the beginning ofthe section, P nec is not the set of P such that F A (cid:23) LASD F B , rather those that satisfy thisnecessary condition, or in other words, P ⊂ P nec . There is no obvious connection between P and P nec — the P in P nec are simply those that lead to nontrivial asymptotic behavior of the T statistic, as will be shown in Theorem 4.3. Next, we define a few functions that are analogous tothe m and m used in the point-identified case, and which come from separating equation 34into four sub-functions. Let m ( u, x ) = G A ( u ) − G ( u + x ) , m ( u, x ) = G A ( u ) − G ( u − x ) , m ( u, x ) = G B ( u ) − G ( u + x ) and m ( u, x ) = G B ( u ) − G ( u − x ) . These functions are usedto define, for k = 1 , . . . , for any x ∈ X and (cid:15) ≥ , the set-valued maps M k ( x, (cid:15) ) = (cid:26) u ∈ R : m k ( u, x ) ≥ sup u ∈ R m k ( u, x ) − (cid:15) (cid:27) . (37)Also for the supremum norm statistic another relevant set of (cid:15) -maximizers exists: for any (cid:15) ≥ , let M nec ( (cid:15) ) = (cid:40) ( u, x ) ∈ R × X : (cid:88) k =1 m k ( u, x ) ≥ sup u,x (cid:88) k =1 m k ( u, x ) − (cid:15) (cid:41) . (38)Under the null hypothesis that the supremum is zero, lim (cid:15) (cid:38) M nec ( (cid:15) ) = X nec , as seen in theexpression for V in the next theorem. 22ow we turn to regularity assumptions on the observed data. The only difference betweenthese assumptions and assumptions A1 - A2 is that we must now make assumptions for threesamples instead of two. B1 The observations { Z i } n i =1 , { Z Ai } n A i =1 and { Z Bi } n B i =1 are iid samples and independent ofeach other and are continuously distributed with marginal distribution functions G , G A and G B respectively. B2 The sample sizes n , n A and n B increase in such a way that n k / ( n + n A + n B ) → λ k as n , n A , n B → ∞ , for k ∈ { , A, B } , where < λ k < . Let n = n + n A + n B .Before stating the next theorem, it is convenient to make some definitions. Under as-sumptions B1 - B2 , standard results in empirical process theory show that there is a Gaussianprocess G G such that √ n ( G n − G ) (cid:59) G G (van der Vaart, 1998, Example 19.6). For each ( u, x ) ,denote the transformed empirical processes and their (Gaussian) limits √ n ( G An ( u ) − G n ( u + x ) − G A ( u ) + G ( u + x )) = G n ( u, x ) (cid:59) G ( u, x ) √ n ( G An ( u ) − G n ( u − x ) − G A ( u ) + G ( u − x )) = G n ( u, x ) (cid:59) G ( u, x ) √ n ( G n ( u + x ) − G Bn ( u ) − G ( u + x ) + G B ( u )) = G n ( u, x ) (cid:59) G ( u, x ) √ n ( G n ( u − x ) − G Bn ( u ) − G ( u − x ) + G B ( u )) = G n ( u, x ) (cid:59) G ( u, x ) (39)Given the above and definitions, the asymptotic behavior of V n and W n can be estab-lished. Theorem 4.3.
Under assumptions B1 - B2 :1. Suppose that P ∈ P nec . As n → ∞ , V n (cid:59) V and W n (cid:59) W , where, given thedefinitions (39) and (37) , V = (cid:32) sup x ∈X nec ( P ) 4 (cid:88) k =1 lim (cid:15) (cid:38) sup u ∈M k ( x,(cid:15) ) G k ( u, x ) (cid:33) + and W = ˆ X nec ( P ) (cid:32) (cid:88) k =1 lim (cid:15) (cid:38) sup u ∈M k ( x,(cid:15) ) G k ( u, x ) (cid:33) + d x / .
2. Suppose that P ∈ P nec \P nec . Then as n → ∞ , P { V > (cid:15) } → and P { W > (cid:15) } → for all (cid:15) > . . Suppose that P / ∈ P nec . Then as n → ∞ , P { V > c } → and P { W > c } → for all c ≥ . The results of this theorem parallel those in Theorem 4.1. The distributions of these teststatistics are complex. Therefore a consistent resampling procedure for inference is discussedin the next subsection. The conservatism of these tests is reflected in the second part above.There may be
P / ∈ P such that P ∈ P nec \P nec , meaning the test will not detect that thisdistribution violates the hypothesis that F A (cid:23) LASD F B . Now we turn to the issue of conducting practical inference using estimated bound functions andthe necessary condition for LASD. As before, resampling can be implemented by estimatingthe derivatives of either V or W . These estimates represent the major difference from theresampling scheme developed in the point identified setting.The estimates required for tests based on V n and W n are similar to those used in thepoint-identified case. Define a grid of values X ⊂ R and let X + be the sub-grid of nonnegativepoints such that X + ⊂ X . For a sequence a n such that a n (cid:38) and √ na n → ∞ , define theestimate of the contact set ˆ X nec = (cid:8) x ∈ X + : | L An ( − x ) + L An ( x ) − U Bn ( − x ) − U Bn ( x ) | ≤ a n (cid:9) . (40)When this estimated set is empty, set ˆ X nec = X + . The inner maximization step that occurs inthe definition of the test statistics requires an estimate of the (cid:15) -maximizers of each sub-process,that is, estimates of (37) for k = 1 , . . . . For these sets we also use the same sort of estimator:for { b n } such that b n (cid:38) and √ nb n → ∞ , for each x ∈ X + let ˆ M k ( x ) = (cid:26) u ∈ X : ˆ m kn ( u, x ) ≥ max u ∈ X ˆ m kn ( u, x ) − b n (cid:27) (41)where the ˆ m kn are plug-in estimators of m k . Finally, for a sequence d n such that d n (cid:38) and √ nd n → ∞ , define the estimator ˆ M nec = (cid:40) ( u, x ) ∈ X × X + : (cid:88) k =1 ˆ m kn ( u, x ) ≥ max ( u,x ) ∈ X × X + (cid:88) k =1 ˆ m kn ( u, x ) − d n (cid:41) . (42)Putting these estimates together, we find the derivative estimates described in the resampling Otherwise these functions would need to be evaluated over a prohibitive number of points in the support.
Resampling routine to estimate the distributions of V n and W n
1. If using a Cramér-von Mises statistic, given a sequence of constants { a n } , estimate thecontact set ˆ X nec . If using a Kolmogorov-Smirnov statistic, given sequences of constants { b n } and { d n } , estimate ˆ M k ( · ) for k = 1 , . . . and ˆ M nec .Next repeat the following two steps for r = 1 , . . . , R :3. Construct the resampled processes G ∗ kn = √ n ( G ∗ kn − G kn ) using an exchangeable boot-strap.4. Calculate the resampled test statistic V ∗ r n = (cid:32) max x ∈ ˆ M nec (cid:88) k =1 max u ∈ ˆ M k ( x ) G ∗ kn ( u, x ) (cid:33) + or W ∗ r n = ˆ ˆ X nec (cid:32) (cid:88) k =1 max u ∈ ˆ M k ( x ) G ∗ kn ( u, x ) (cid:33) + d x / . Finally,6. Let ˆ q V ∗ (1 − α ) and ˆ q W ∗ (1 − α ) be the (1 − α ) th sample quantile from the bootstrapdistributions of { V ∗ r n } Rr =1 or { W ∗ r n } Rr =1 , respectively, where α ∈ (0 , is the nominalsize of the tests. We reject the null hypothesis (13) if V n and W n defined in (35) or (36)are, respectively, larger than ˆ q V ∗ (1 − α ) or ˆ q W ∗ (1 − α ) .The following theorem guarantees that the resampling scheme is consistent. Theorem 4.4.
Make assumptions B1 - B2 and suppose that P ∈ P nec . Let ˆ q V ∗ (1 − α ) and ˆ q W ∗ (1 − α ) be the (1 − α ) th sample quantile from the bootstrap distributions as described in theroutines above. Then the bootstrap is consistent: sup f ∈ BL | E [ f ( V ∗ n ) | X ] − E [ f ( V )] | = o P (1) and sup f ∈ BL | E [ f ( W ∗ n ) | X ] − E [ f ( W )] | = o P (1) . P ∈ P nec . The testing procedure based on the T criterion function controls size uniformly over P nec , a superset of P . The uniform size ofthe resampling inference scheme over P nec is stated formally in Theorem A.6 in Appendix A.However, using only a necessary condition for inference comes at a cost, which is the possibilityof trivial power against some alternative P / ∈ P . For any P ∈ P nec \P , the probability ofrejecting the null is also less than or equal to α . More generally, results about size and poweragainst various alternatives that can be specified for point identified distributions are notavailable for the partially identified case. On the other hand, it is remarkable that the testcontrols size uniformly over the set P , which is a set of treatment outcome distributions thatcannot be observed directly.An Online Supplemental Appendix provides Monte Carlo numerical evidence of the finitesample properties of both point- and partially-identified methods. The simulations show thattests have empirical size close to the nominal, and high power against selected alternatives. In this section we illustrate the use of our proposed methods in a policy evaluation context. Wecontrast our results with a classical stochastic dominance approach. We use household-leveldata from an experimental evaluation of two federal assistance programs, named Aid to Fami-lies with Dependent Children (AFDC) and Jobs First (JF), to analyze the distributional effectsof the policies. Bitler, Gelbach, and Hoynes (2006) use these data to document substantialheterogeneity in the impacts of this policy change on recipients’ total incomes. The authorsfocus on this policy because of the availability of experimental data, which provides a clearsource of identification. Amongst its main findings, the article shows that this heterogeneitygenerated income gains and losses in different, sizable groups of recipients.AFDC was one of the largest federal assistance programs in the United States between 1935and 1996. It consisted of a means-tested income support scheme for low-income families withdependent children, administered at the state level and funded at the federal level. Followingcriticism that this program discouraged female labor market participation and perpetuatedwelfare dependency, AFDC was discontinued in 1996 and replaced, in each state, by more Bitler, Gelbach, and Hoynes (2006) conduct a test comparing features of households before random assign-ment and find that they do not differ significantly in terms of observable characteristics. We check additionallythat the income distributions were the same before the experiment split households among the two policies.We use a conventional two-sided Cramér-von Mises test for the equality of distributions. The statistic wasapproximately . and its p-value was . , implying that before the experiment, the distributions are indis-tinguishable. A and B in the previous sections). There are quarterlymeasures for income, earnings and transfers, but we concentrate only on measures of changein total income, comparing quarterly income before and after the households were randomlyassigned to one of the groups. Because assignment is random, we assume that the distributionfunctions of gains and losses under each policy, F JF and F AF DC , are point-identified by thedifferences in incomes before and after random assignment.
To make welfare decisions in terms of gains and losses, we require data in terms of changes,which we construct using several definitions. First, measurements were taken before randomassignment (RA) into one of the two programs, and we call these measurements pre-RAobservations. All periods after random assignment are labeled post-RA observations. Next,the Jobs First program stopped supporting individuals at what we call the Time Limit (TL),although quarterly income was observed for these households after the time limit. We call pre-TL observations those that were made after random assignment but before the time limit, whilepost-TL observations are those made after the JF time limit. We summarize the pre/post-RA and pre/post-TL observations in one of two ways — either by averaging income overall quarters in the relevant time span, or by using the final quarter within the time span.Therefore there are four ways of defining income changes based on all the combinations oftime limits and measurement summaries. 27hanges in household income due to the AFDC and JF policies were defined using one oftwo methods. First, the natural log of of the average earnings in all post-RA quarters minusthe natural log of the average pre-RA quarterly earnings is called the average-RA change.Second, the natural log of the last quarter of post-RA income minus the natural log of thelast quarter of pre-RA income is called the last-quarter-RA change. Other changes are definedusing data around the Jobs First time limit. The natural log of average post-TL quarterlyearnings minus the natural log of average pre-TL quarterly earnings is called the average-TLchange. The natural log of the last quarter of post-TL income minus the natural log of thelast quarter of pre-TL income is called the last-quarter-TL change.We conducted formal tests of the hypothesis (10) using W n statistics (Cramér-von Misesstatistics applied to the empirical T process). The results of these tests are presentedin left hand side of Table 1. First, we consider the results when changes are defined asacross the random assignment. The tests indicate that we cannot reject the hypothesis that F AF DC (cid:23)
LASD F JF unless we measure outcomes using average-RA changes. In that caseAFDC does not appear to dominate the JF policy. We also conducted tests of the hypothesis F JF (cid:23) LASD F AF DC . We cannot reject this null hypothesis using either measure. Because inone of these cases both distributions dominate each other, we double-checked using two-sidedtests of distributional equality, that is, for the null that F AF DC ≡ F JF . Using average incomemeasures the distributions appear to be different, but using last quarter measures, we cannotreject the null that the distributions are indistinguishable. These tests offer some evidencethat income changes across random assignment are indistinguishable or better under the JFpolicy than under the AFDC policy.Now we consider the case when changes are defined as across the time limit (either usingaverages or last quarters). In this case, we do not reject the hypothesis that F AF DC (cid:23)
LASD F JF , and we reject the hypothesis that F JF (cid:23) LASD F AF DC . This is an indication that thecontinued support from the AFDC policy effectively supports household incomes across theJF time limit better than the JF policy does — to be expected, since the JF policy providesno more support to any households after the time limit, allowing for a higher probability oflosses in household income.Figure 1 displays the CDFs of gains and losses under the AFDC and JF policies, then theway that the two T coordinate processes compare them — when looking at the coordinatesin equation (12), F A corresponds to F AF DC here, so large positive values correspond to arejection of the hypothesis F AF DC (cid:23)
LASD F JF . This figure uses only average-RA changeobservations. It can be seen in the second and third panels that the presumable reason that Results for the other test statistics are qualitatively the same. They are collected in an Online SupplementalAppendix. ASD in changes FOSD in levels F AF DC (cid:23) F JF F JF (cid:23) F AF DC equality G AF DC (cid:23) G JF G JF (cid:23) G AF DC equalityavg-RA . . . . p-value . . . . lastQ-RA . . . p-value . . . avg-TL . . . . . p-value . . . . . lastQ-TL . . p-value . . Table 1: This table presents a number of tests that can be used to infer whether the Jobs First(JF) program would be preferred to the Aid to Families with Dependent Children (AFDC) or theopposite. Column titles paraphrase the null hypotheses in the tests. The first three columns usechanges in income and the last three columns measure income in levels without regard to pre-policy income. Comparisons made before and after assignment or time limit were measured usingthe average of all months or using the last quarter. 1999 bootstrap repetitions used in each test.the AFDC policy does not dominate the JF policy using LASD is because the probability ofsmall losses is higher in the AFDC program and the relation between small gains and smalllosses is preferable in JF. −5 0 5 . . . . . . CDFs (changes) change F n ( c hange ) . . . . m process x n m ^ ( x ) m process x n m ^ ( x ) Figure 1: The CDFs of changes in post-RA income and the way that they are turned into T ( F ) coordinate processes. The second and third panels correspond to plug-in estimates ofthe coordinate functions of equation (12). The large positive values in the second panel drivethe rejection of the hypothesis F AF DC (cid:23) F JF seen in Table 1.29 .2 Tests in levels: first order stochastic dominance We also conducted an analysis of these data using standard FOSD inference methods. Testswere used to infer dominance of the AFDC or JF policies using post-randomization levels,that is, without regard to pre-randomization state. Income in levels is defined in two ways.Post-RA average income is defined as the natural log of the average income in all post-RAquarters. Post-TL average income is defined as the natural log of the average income in onlythe post-TL quarters. We conduct tests of the null hypothesis that G AF DC (cid:23)
F OSD G JF or G JF (cid:23) F OSD G AF DC , where the notation G is meant as a reminder that these are marginalfinal income distributions that do not consider a household’s pre-policy income. The resultsof these tests are presented in right hand side of Table 1.Using all post-RA quarters, we can reject the hypothesis that G AF DC (cid:23)
F OSD G JF , butcannot reject the hypothesis that G JF (cid:23) F OSD G AF DC . Therefore it seems clear that theJF policy dominates the AFDC using final outcome distributions, that is, without regard tothe effect that the policies have on any particular household’s path from pre- to post-policyincome.When analyzing only the post-TL average income, we cannot reject the hypothesis that G AF DC (cid:23)
F OSD G JF or G JF (cid:23) F OSD G AF DC , although there is weak evidence that the secondrelation might be violated. We checked a two-sided test for distributional equality, and couldnot reject that the distributions were indistinguishable. Therefore marginal post-TL incomedistributions seem indistinguishable while data in changes reveals that households would preferthe AFDC program. The inferences made using data in levels and FOSD can therefore bequite different from those using LASD with data on changes.The significantly positive part that drives the rejection of the hypothesis G AF DC (cid:23)
F OSD G JF is represented by the spike in the right panel of Figure 2, which is due to the fact thatthe red AFDC CDF lies significantly above the black JF CDF in the left plot near log incomelevel x = 8 . Public policies often result in gains for some individuals and losses for others. Evidence showsthat the way individuals value such gains and losses is a key determinant of public supportfor these policies. This in turn, can determine which policies decision makers pursue. Sinceloss aversion is a well established regularity, how can the welfare associated with alternative30 . . . . . . CDFs (levels) income G n ( i n c o m e ) FOSD process x n ( G n A F DC ( x ) - G n J F ) Figure 2: The CDFs of levels of post-RA income and the way that they are used to test first-order stochastic dominance. The large positive values in the second panel drive the rejectionof the hypothesis G AF DC (cid:23) G JF seen in Table 1.policies be ranked when individuals are loss-averse?We address this question by defining a social preference relation for distributions of gainsand losses caused by a policy: loss aversion-sensitive dominance (LASD). We show that thesesocial preferences are equivalent to criteria that depend solely on distribution functions. Theassumption of loss aversion can lead to a welfare ranking of policies that is different from theone that would be brought about if classic utility theory and First-Order Stochastic Domi-nance were used. We then propose testable conditions for LASD. Because our data come asdifferences between underlying random variables, we propose a point-identified version of theseconditions and also a partially identified analog.In order to make LASD comparisons using observed data, we propose statistical infer-ence methods to formally test LASD relations in both the point-identified and the partiallyidentified cases. We show that resampling techniques, tailored to specific features of the cri-terion functions, can be used to conduct inference. Finally, we illustrate our LASD criterionand inference methods with a simple empirical application that uses data from a well knownevaluation of a large income support policy in the US. This shows that the ranking of policyoptions depends crucially on whether changes or levels are used and whether or not one takesindividual loss aversion into account. 31 ppendixA Results on differentiability, uniform size control and compu-tation This section includes a definition and short discussion of the Hadamard directional differentia-bility concept and contains important intermediate results on Hadamard derivatives used toestablish the main results in the text. Next we present some results on the control of size overthe null region using the proposed resampling methods. Finally, there is one remark regardingthe computation of T and T processes ( T processes should probably be computed on a gridfor the sake of computation time). Proof of the results discussed in this appendix are collectedin Appendix B.4.The Hadamard derivative is a standard tool used to analyze the asymptotic behavior ofnonlinear maps in empirical process theory (van der Vaart, 1998, Section 20.2). We provide adefinition here for completeness, along with its directional counterpart. Definition A.1 (Hadamard differentiability) . Let D and E be Banach spaces and consider amap φ : D φ ⊆ D → E .1. φ is Hadamard differentiable at f ∈ D φ tangentially to a set D ⊆ D if there is acontinuous linear map φ (cid:48) : D → E such that lim n →∞ (cid:13)(cid:13)(cid:13)(cid:13) φ ( f + t n h n ) − φ ( f ) t n − φ (cid:48) ( h ) (cid:13)(cid:13)(cid:13)(cid:13) E = 0 for all sequences { h n } ⊂ D and { t n } ⊂ R such that h n → h ∈ D and t n → as n → ∞ and f + t n h n ∈ D φ for all n .2. φ is Hadamard directionally differentiable at f ∈ D φ tangentially to a set D ⊆ D if thereis a continuous map φ (cid:48) f : D → E such that lim n →∞ (cid:13)(cid:13)(cid:13)(cid:13) φ ( f + t n h n ) − φ ( f ) t n − φ (cid:48) f ( h ) (cid:13)(cid:13)(cid:13)(cid:13) E = 0 for all sequences { h n } ⊂ D and { t n } ⊂ R + such that h n → h ∈ D and t n (cid:38) as n → ∞ and f + t n h n ∈ D φ for all n .In both cases of the above definition, φ (cid:48) f is continuous, with the addition of linearity in thefully-differentiable case (Shapiro, 1990, Proposition 3.1). They also differ in the sequences ofadmissible { t n } , which allows the second definition to encode directions.32ecause the pair of marginal distribution functions always occur as the difference F A − F B ,the next few definitions and lemmas are stated for a single function f . For later results, mapswill be applied with the function f = F A − F B . The following maps will be used repeatedly inthis section and the proofs for analyzing more complex directionally differentiable maps. Let φ : R → R be φ ( x ) = ( x ) + = max { , x } , (43)and similarly, define ψ : R → R by ψ ( x, y ) = max { x, y } . (44)For some domain X ⊆ R j let σ : (cid:96) ∞ ( X ) → R be σ ( f ) = sup x ∈X f ( x ) . (45)These are all Hadamard directionally differentiable maps. It can be verified that for all a ∈ R , φ (cid:48) x ( a ) = a x > { , a } x = 00 x < , (46)while for pairs ( a, b ) ∈ R , ψ (cid:48) x,y ( a, b ) = a x > y max { a, b } x = yb x < y . For any (cid:15) ≥ , let M f ( (cid:15) ) = { x ∈ X : f ( x ) ≥ σ ( f ) − (cid:15) } be the set of (cid:15) -maximizers of f .Cárcamo, Cuevas, and Rodríguez (2019) show that for all directions h ∈ (cid:96) ∞ ( X ) σ (cid:48) f ( h ) = lim (cid:15) (cid:38) sup x ∈M f ( (cid:15) ) h ( x ) (47)and they also give conditions under which the limiting operation can be discarded and thesupremum of h can be taken over the set of maximizers of f .The next lemma shows shows that a weighted L p norm (for p > ) applied to the positivepart of a function is directionally differentiable. Cramér-von Mises statistics are found bysetting p = 2 . The directional differentiability of the L p norm with p = 1 was shown in LemmaS.4.5 of Fang and Santos (2019). Note that this lemma must be shown for the L p norm applied33o the positive-part map, jointly applied to a function f . This is because f (cid:55)→ ( f ) + is notdifferentiable as a map of functions to functions. Nevertheless, the dominated convergencetheorem allows one to use pointwise convergence with integrability to find the result. Lemma A.2.
Suppose f : X ⊆ R j → R k is a bounded and p -integrable function. Let w : X → R k + be such that ´ w i ( x ) d x < ∞ for i = 1 , . . . k . Let < p < ∞ and define the one-sided L p norm of f by λ ( f ) = (cid:32) k (cid:88) i =1 ˆ X (cid:0) ( f i ( x )) + (cid:1) p w i ( x ) d x (cid:33) /p . (48) For i = 1 , . . . k , define the subdomains X i − = { x ∈ X : f i ( x ) < } , X i = { x ∈ X : f i ( x ) = 0 } and X i + = { x ∈ X : f i ( x ) > } and the index collections I = { i ∈ , . . . k : µ ( X i ) > } and I + = { i ∈ , . . . k : µ ( X i + ) > } , where µ is Lebesgue measure. Then λ is Hadamarddirectionally differentiable and its derivative for any bounded, p -integrable h : X → R k is λ (cid:48) f ( h ) = I + = I = ∅ (cid:16)(cid:80) i ∈I ´ X i (( h i ( x )) + ) p w i ( x ) d x (cid:17) /p I + = ∅ , I (cid:54) = ∅ λ ( f ) p − (cid:80) i ∈I + ´ X i + f p − i ( x ) h i ( x ) w i ( x ) d x I + (cid:54) = ∅ . (49)The above definitions make it easy, if rather abstract, to state the differentiability of themaps from distribution to test statistics that are applied to conduct uniform inference usingthe T process. Lemma A.3.
Let f ∈ (cid:96) ∞ ( X ) and let ν ( f ) = sup x ∈X (cid:0) ( f ( x )) + + f ( − x ) (cid:1) + (50) and, assuming f is square integrable, ω ( f ) = (cid:18) ˆ X { (( f ( x )) + + f ( − x )) + } d x (cid:19) / . (51) Then ν and ω are Hadamard directionally differentiable, and, letting f ( x ) = f ( − x ) and f ( x ) = f ( x ) + f ( − x ) , their derivatives for any direction h ∈ (cid:96) ∞ ( X ) are ν (cid:48) f ( h ) = (cid:16) φ (cid:48) ψ ( σ ( f ) ,σ ( f )) ◦ ψ (cid:48) σ ( f ) ,σ ( f ) (cid:17) ( σ (cid:48) f ( h ) , σ (cid:48) f ( h )) (52) and, assuming in addition that f, h are square integrable, ω (cid:48) f ( h ) = (cid:16) λ (cid:48) ψ ( f ,f ) ◦ ψ (cid:48) f ,f (cid:17) ( h, h ) , (53)34 here we take the order p = 2 and the weight function w ≡ in λ (cid:48) f defined in (49) . Next we turn to results for the partially identified case. Lemma A.4 provides the theoret-ical tool needed for the analysis of Kolmogorov-Smirnov-type statistics when using Makarovbounds. First define the abstract map θ : ( (cid:96) ∞ ( U × X )) → R by θ ( f, g ) = sup x ∈X (cid:18) sup u ∈U f ( u, x ) + sup u ∈U g ( u, x ) (cid:19) . (54)For defining the directional derivative of this map at some f and g , we need to consider (cid:15) -maximizers for any (cid:15) ≥ of these functions in u for each fixed x , which for any f ∈ (cid:96) ∞ ( U × X ) is the set-valued map M f ( x, (cid:15) ) = (cid:26) u ∈ U : f ( u, x ) ≥ sup u ∈U f ( u, x ) − (cid:15) (cid:27) . (55)We reserve one special label for the collection of (cid:15) -maximizers of the outer maximizationproblem that defines θ : for any (cid:15) ≥ let M θ ( (cid:15) ) = { ( u, x ) ∈ U × X : f ( u, x ) + g ( u, x ) ≥ θ ( f, g ) − (cid:15) } . (56)Lemma A.4 ahead discusses derivatives of θ , a functional that imposes two levels of max-imization with an intermediate addition step, and shows that this operator is directionallydifferentiable. It is similar to the case of maximizing a bounded bivariate function, and itsproof follows that of Theorem 2.1 of Cárcamo, Cuevas, and Rodríguez (2019), which dealtwith directional differentiability of the supremum functional applied to a bounded function.The statement is for the sum of only two functions as arguments but it is straightforward toextend to any finite number of functions, as in Theorem 4.3. Lemma A.4.
Let
U ⊆ R m and X ⊆ R n . Suppose that f, g ∈ (cid:96) ∞ ( U × X ) , and let θ be the mapdefined in (54) . Then θ is Hadamard directionally differentiable and its derivative at ( f, g ) forany directions ( h, k ) ∈ ( (cid:96) ∞ ( U × X )) is θ (cid:48) f,g ( h, k ) = lim (cid:15) (cid:38) sup x ∈M θ ( (cid:15) ) (cid:32) sup u ∈M f ( x,(cid:15) ) h ( u, x ) + sup u ∈M g ( x,(cid:15) ) k ( u, x ) (cid:33) . (57)The behavior of bootstrap tests under the null and alternatives is most easily examinedusing distributions local to P . We consider sequences of distributions P n local to the nulldistribution P such that for a mean-zero, square-integrable function η , P n have distribution35unctions F n (where P has CDF F ) that satisfy lim n →∞ ˆ (cid:18) √ n (cid:16)(cid:112) d F n − √ d F (cid:17) − η √ d F (cid:19) → . (58)The behavior of the underlying empirical process under local alternatives satisfies Assumption5 of Fang and Santos (2019) in a straightforward way (Wellner, 1992, Theorem 1). Theorem A.5.
Make assumptions A1 - A2 and suppose that F A (cid:23) LASD F B . Suppose that X is convex. Let ˆ q V ∗ j (1 − α ) and ˆ q W ∗ j (1 − α ) be the (1 − α ) th sample quantile from the bootstrapdistributions as described in the routines above. Then for j = 1 , ,1. When P ∈ P and { P n } satisfy (58) and T j ( F n )( x ) ≤ for all x ≥ , lim sup n →∞ P n (cid:110) V jn > ˆ q V ∗ j (1 − α ) (cid:111) ≤ α and lim sup n →∞ P n (cid:110) W jn > ˆ q W ∗ j (1 − α ) (cid:111) ≤ α.
2. When P ∈ P and { P n } satisfy (58) and T j ( F n )( x ) ≤ for all x ≥ , and the distribu-tion of V or W is increasing at its (1 − α ) th quantile, lim n →∞ P n (cid:110) V jn > ˆ q V ∗ j (1 − α ) (cid:111) = α and lim n →∞ P n (cid:110) W jn > ˆ q W ∗ j (1 − α ) (cid:111) = α. Now we consider using the resampling routine outlined above to test the null hypothesisthat F A (cid:23) LASD F B when the distributions are only partially identified. It is no longer possibleto guarantee exact rejection probabilities because the test is based on a superset of P , butwe can still show that the test does not overreject. Theorem A.6.
Make assumptions B1 - B2 . Also assume that X is a convex set. Let ˆ q V ∗ (1 − α ) and ˆ q W ∗ (1 − α ) be the (1 − α ) th sample quantile from the bootstrap distributions of { V ∗ r n } Rr =1 or { W ∗ r n } Rr =1 as described in the routine above. When the sequence of alternative distributions P n satisfy (58) and T ( F n )( x ) ≤ for all x ≥ , lim sup n →∞ P n (cid:8) V n > ˆ q V ∗ (1 − α ) (cid:9) ≤ α and lim sup n →∞ P n (cid:8) W n > ˆ q W ∗ (1 − α ) (cid:9) ≤ α. emark A.7 (A note on computing point-identified criterion functions) . Standard empiricaldistribution functions are used to estimate the marginal distributions F A and F B . However,the definitions of the T and T criterion functions contain F k ( − x ) terms, making the plug-in T j ( F n ) left-continuous at some sample observations. Therefore some care must be takenwhen evaluating them because there may be regions that are relevant for evaluation (i.e., thelocation of the supremum) that are not attained by any sample observations. This couldbe dealt with approximately by evaluating the functions on a grid. Instead, we evaluate thefunction approximately at all the points where it changes its value. For example, let X n denotethe pooled sample (of size ( n A + n B ) ) of X A and X B observations. Then we evaluate T j at thepoints ˜ X n = 0 ∪ X + n ∪ { X n − (cid:15) } − , where X + n and X − n refer to the positive- and negative-valuedelements of the pooled sample X n and (cid:15) is a very small amount added to each element of X n ,for example, the square root of the machine’s double-precision accuracy. When evaluating the L integrals from an observed sample, the domain can be set to [0 , ˜ x max ] , where ˜ x max is thelargest point in the evaluation set ˜ X n , because the integrand is identically zero above thatpoint. B Proof of results
B.1 Results in Section 2
Proof of Proposition 2.3.
Equation (1) implies that W ( F ) = ˆ R − v ( x ) d F ( x ) + ˆ R + v ( x ) d F ( x ) . (59)For the first part of (59) note that ˆ R − v ( x ) d F ( x ) = lim R →−∞ ˆ R v ( x ) d F ( x )= lim R →−∞ (cid:20) v ( x ) F ( x ) | R − ˆ R v (cid:48) ( x ) F ( x ) d x (cid:21) = − ˆ −∞ v (cid:48) ( x ) F ( x ) d x, using the normalization v (0) = 0 noted in Definition 2.2, the assumed bounded support of F and integration by parts. 37imilarly, ˆ R + v ( x ) d F ( x ) = − ˆ R + v ( x ) d (1 − F )( x )= − lim R →∞ ˆ R v ( x ) d (1 − F )( x )= − lim R →∞ (cid:20) v ( x )(1 − F ( x )) | R − ˆ R v (cid:48) ( x )(1 − F ( x )) d x (cid:21) = ˆ ∞ v (cid:48) ( x )(1 − F ( x )) d x. Putting these two parts together yields (2).
B.2 Proofs of results in Section 3
Proof of Theorem 3.1.
Notice that (3) is equivalent to both (4) and (5); in this proof we usethe latter two conditions. Using Proposition 2.3 we rewrite W ( F A ) ≥ W ( F B ) as the equivalentcondition − ˆ −∞ v (cid:48) ( z ) F A ( z ) d z + ˆ ∞ v (cid:48) ( z )(1 − F A ( z )) d z ≥ − ˆ −∞ v (cid:48) ( z ) F B ( z ) d z + ˆ ∞ v (cid:48) ( z )(1 − F B ( z )) d z. Rearranging terms we find this is equivalent to ˆ −∞ v (cid:48) ( z ) F B ( z ) d z − ˆ −∞ v (cid:48) ( z ) F A ( z ) d z ≥ ˆ ∞ v (cid:48) ( z )(1 − F B ( z )) d z − ˆ ∞ v (cid:48) ( z )(1 − F A ( z )) d z or simply ˆ −∞ v (cid:48) ( z )( F B ( z ) − F A ( z )) d z ≥ ˆ ∞ v (cid:48) ( z )( F A ( z ) − F B ( z )) d z. This is in turn equivalent to ˆ ∞ v (cid:48) ( − z )( F B ( − z ) − F A ( − z )) d z ≥ ˆ ∞ v (cid:48) ( z )( F A ( z ) − F B ( z ))) d z or − ˆ ∞ v (cid:48) ( − z )( F A ( − z ) − F B ( − z )) d z ≥ ˆ ∞ v (cid:48) ( z )( F A ( z ) − F B ( z ))) d z. Adding v (cid:48) ( z )( F A ( − z ) − F B ( − z )) to both sides we find this is equivalent to ˆ ∞ ( v (cid:48) ( z ) − v (cid:48) ( − z ))( F A ( − z ) − F B ( − z )) d z ≥ ˆ ∞ v (cid:48) ( z )( F A ( z ) − F B ( z ) + F A ( − z ) − F B ( − z )) d z. (60)38tilizing the assumptions of loss aversion and non-decreasingness given in Definition 2.2, (4)and (5) are sufficient for (60) to hold for any v . Condition (5) is due to the fact that F A ( x ) − F B ( x ) + F A ( − x ) − F B ( − x ) ≤ ∀ x ≥ is equivalent to the condition − F A ( x ) − F A ( − x ) ≥ − F B ( x ) − F B ( − x ) ∀ x ≥ . We now show that conditions (4) and (5) are also necessary by means of a contradiction to(60). To this end, assume that there exists some x such that F A ( − x ) − F B ( − x ) > . From thefact that the distribution function is right continuous, it follows that there is a neighbourhood ( a, b ) , b > a ≥ , such that for all x ∈ ( a, b ) , F A ( − x ) − F B ( − x ) > . Consider the valuefunction v ( x ) = a − b x ≤ − b x ≥ − ax + a x ∈ ( − b, − a ) . Note that this v satisfies conditions 1-3 of Definition 2.2. Further, for x ∈ ( a, b ) , v (cid:48) ( − x ) = 1 >v (cid:48) ( x ) = 0 . Therefore ˆ ∞ ( v (cid:48) ( z ) − v (cid:48) ( − z ))( F A ( − z ) − F B ( − z )) d z < , while ˆ ∞ v (cid:48) ( z )( F A ( z ) − F B ( z ) + F A ( − z ) − F B ( − z )) d z = 0 , because v (cid:48) ( x ) = 0 for x ≥ . This contradicts (60).The second condition can be proven similarly. Assume that there exists a neighbourhood ( a, b ) , ≤ a < b such that for all x ∈ ( a, b ) , (1 − F A ( x )) − F A ( − x ) < (1 − F B ( x )) − F B ( − x ) .Take v non-decreasing and such that v (cid:48) ( x ) = v (cid:48) ( − x ) , for example v ( x ) = sgn ( x ) × | x | ∈ [0 , a ]( | x | − a ) | x | ∈ ( a, b ) b − a | x | ∈ [ b, ∞ ) . v we find ˆ ∞ ( v (cid:48) ( z ) − v (cid:48) ( − z ))( F A ( − z ) − F B ( − z )) d z = 0 while ˆ ∞ v (cid:48) ( z )( F A ( z ) − F B ( z ) + F A ( − z ) − F B ( − z )) d z > , which is a contradiction. Proof of Corollary 3.2.
Use v ( x ) = x , which belongs to the class of functions used in the firstpart of Theorem 3.1 and in Definition 2.4. Proof of Corollary 3.5.
We first notice that F A (cid:23) F OSD F SQ is equivalent to the event { F A is supported on R + } . (61)Property (61) easily implies that F A (cid:23) LASD F SQ , which follows by Property 1 of Definition 2.2.On the other hand one checks that v ( x ) := x x ≤ x > fulfills Definition 2.2. Thus F A (cid:23) LASD F SQ implies (61). Proof of Remark 3.6.
The social value function in this case is the following ˆ R × [0 , ∞ ) v ( x ) + v ( y ) d F ( x, y ) = ˆ ∞ ˆ −∞ ( v ( x ) + v ( y )) f ( x, y ) d x d y + ˆ ∞ ˆ ∞ ( v ( x ) + v ( y )) f ( x, y ) d x d y. Let us define ˜ f Y ( x, y ) = ´ x −∞ f ( z, y ) d z and let F X , F Y denote marginals of, respectively, X, Y .Integrating by parts on the negative domain of x we get ˆ ∞ (cid:20) ( v ( x ) + v ( y )) ˜ f Y ( x, y ) | −∞ − ˆ −∞ v (cid:48) ( x ) ˜ f Y ( x, y ) d x (cid:21) d y v ( x ) = 0 for x = 0 and that ˜ f Y ( x, y ) = 0 for x = −∞ we get (cid:20) ˆ ∞ v ( y ) ˜ f Y (0 , y ) d y (cid:21) − ˆ −∞ (cid:20) ˆ ∞ v (cid:48) ( x ) ˜ f Y ( x, y ) d y (cid:21) d x Performing integration by parts again and noticing that v (cid:48) ( x ) is independent of integrationarea in the second expression, we obtain (cid:20) v ( y ) F (0 , y ) | ∞ − ˆ ∞ v (cid:48) ( y ) F (0 , y ) d y (cid:21) − ˆ −∞ (cid:104) v (cid:48) ( x ) (cid:16) F ( x, y ) | ∞ (cid:17)(cid:105) d x (cid:20) v ( ∞ ) F X (0) − ˆ ∞ v (cid:48) ( y ) F (0 , y ) d y (cid:21) − ˆ −∞ v (cid:48) ( x ) F X ( x ) d x. In the end, we obtain (cid:2) v ( ∞ ) − v ( ∞ ) F X (0) − v ( ∞ ) F Y (0) (cid:3) − (cid:2) ´ ∞ v (cid:48) ( y ) F Y ( y ) − v (cid:48) ( y ) F (0 , y ) d y (cid:3) − ´ ∞ (cid:2) v (cid:48) ( x )( F X ( x ) − F ( x, (cid:3) d x. We will now turn to the positive domain of x , thus ˆ ∞ (cid:20) ( v ( x ) + v ( y )) ˜ f Y ( x, y ) | ∞ − ˆ ∞ v (cid:48) ( x ) ˜ f Y ( x, y ) d x (cid:21) d y and (cid:20) ˆ ∞ ( v ( ∞ ) + v ( y )) f Y ( y ) − v ( y ) ˜ f Y (0 , y ) d y (cid:21) − ˆ ∞ (cid:20) ˆ ∞ v (cid:48) ( x ) ˜ f Y ( x, y ) d y (cid:21) d x. Finally, (cid:2) ( v ( ∞ ) + v ( y )) F Y ( y ) − v ( y ) F (0 , y ) (cid:3) | ∞ − (cid:2) ´ ∞ v (cid:48) ( y ) F Y ( y ) − v (cid:48) ( y ) F (0 , y ) d y (cid:3) − ´ ∞ (cid:104) v (cid:48) ( x ) (cid:16) F ( x, y ) | ∞ (cid:17)(cid:105) d x. Putting together the negative and the positive side, we obtain (cid:2) v ( ∞ ) F X (0) − ´ ∞ v (cid:48) ( y ) F (0 , y ) d y (cid:3) − ´ −∞ v (cid:48) ( x ) F X ( x ) d x + (cid:2) v ( ∞ ) − v ( ∞ ) F X (0) − v ( ∞ ) F Y (0) (cid:3) − (cid:2) ´ ∞ v (cid:48) ( y ) F Y ( y ) − v (cid:48) ( y ) F (0 , y ) d y (cid:3) − ´ ∞ (cid:2) v (cid:48) ( x )( F X ( x ) − F ( x, (cid:3) d x. After simplifying this expression becomes − ´ −∞ v (cid:48) ( x ) F X ( x ) d x + (cid:2) v ( ∞ ) − v ( ∞ ) F Y (0) (cid:3) − (cid:2) ´ ∞ v (cid:48) ( y ) F Y ( y ) d y (cid:3) − ´ ∞ (cid:2) v (cid:48) ( x )( F X ( x ) − F ( x, (cid:3) d x. Using the fact that y ∈ [0 , ∞ ] we get v ( ∞ ) − ˆ −∞ v (cid:48) ( x ) F X ( x ) d x − ˆ ∞ v (cid:48) ( y ) F Y ( y ) d y − ˆ ∞ v (cid:48) ( x ) F X ( x ) d x and v (0) + 2 ˆ ∞ v (cid:48) ( x ) d x − ˆ −∞ v (cid:48) ( x ) F X ( x ) d x − ˆ ∞ v (cid:48) ( y ) F Y ( y ) d y − ˆ ∞ v (cid:48) ( x ) F X ( x ) d x. v (0) = 0 we have − ˆ −∞ v (cid:48) ( x ) F X ( x ) d x + ˆ ∞ v (cid:48) ( y )(1 − F Y ( y )) d y + ˆ ∞ v (cid:48) ( x )(1 − F X ( x )) d x. The only change in comparison to Theorem 3.1 is the addition of the term ´ ∞ v (cid:48) ( y )(1 − F Y ( y )) d y , which looks quite natural knowing that not only gains and losses but also incomesare considered. Applying the first part of the proof of Theorem 3.1 the comparison betweendistributions F A and F B comes down to the following inequality ˆ ∞ ( v (cid:48) ( x ) − v (cid:48) ( − x ))( F XA ( − x ) − F XB ( − x )) d x + ˆ ∞ ( v (cid:48) ( y ) (cid:16) F YB ( y ) − F YA ( y ) (cid:17) d y ≥ ˆ ∞ v (cid:48) ( x )( F XA ( x ) − F XB ( x ) + F XA ( − x ) − F XB ( − x )) d x. In comparison to (60) this inequality includes additionally the comparison of
A, B forincomes y . Since v (cid:48) ( y ) ≥ (i.e. utility is increasing with income), assuming (4), (5) andadditionally that F YB ( y ) − F YA ( y ) ≥ for all y , that is, A dominates B for incomes accordingto FOSD, is enough to ensure that A is better than B . Proof of Theorem 3.7.
Given the bounds inequality, we have L B ( − x ) − U A ( − x ) ≤ F B ( − x ) − F A ( − x ) ≤ U B ( − x ) − L A ( − x ) and L A ( x ) − U B ( x ) ≤ F A ( x ) − F B ( x ) ≤ U A ( x ) − L B ( x ) , from which it is clear that (8) is a sufficient condition. As a necessary condition we have (9),as otherwise we would have F B ( − x ) − F A ( − x ) ≤ U B ( − x ) − L A ( − x ) ≤ L A ( x ) − U B ( x ) ≤ F A ( x ) − F B ( x ) . Proof of Corollary 3.8.
Recall Corollary 3.5 implied that when F B is a status quo distribution,the FOSD and LASD relations are equivalent. Then F A (cid:23) F OSD F SQ implies that F A ( − x ) =0 for all x ≥ because F SQ ( − x ) = 0 for all x ≥ . Therefore a sufficient condition for42 A (cid:23) LASD F SQ is that U A ( − x ) = 0 for all x ≥ . Similarly, if F A (cid:23) LASD F SQ , equivalentto F A (cid:23) F OSD F SQ , then it must be the case that F A ( − x ) = 0 for all x ≥ , implying that L A ( − x ) = 0 as well. B.3 Results in Section 4
Proof of Theorem 4.1.
For Part 1 note that if
P ∈ P then by definition, X k ( P ) (cid:54) = ∅ forsome k ∈ { , } and for all x ∈ X k ( P ) , m k ( x ) = 0 . Then the supremum is achieved and lim (cid:15) (cid:38) M k ( (cid:15) ) = X k ( P ) for at least one coordinate, so that suprema are taken over at leastone of X ( P ) and X ( P ) and whichever coordinate satisfies this condition will contribute to theasymptotic distribution. Note that for all x ∈ X ( P ) , √ nT ( F n )( x ) = √ n ( T ( F n ) − T ( F ))( x ) .Lemma A.3 and the null hypothesis, which implies X k ( P ) (cid:54) = ∅ for k ∈ { , } , imply the resultfor V and W .To show Part 2, note that T is a linear map of F , and assuming that X k ( P ) (cid:54) = ∅ for k ∈{ , } , we have that its weak limit (for whichever set is nonempty) is sup x ∈X k ( P ) ( T k ( G F )( x )) + by Lemma A.3. Breaking X ( P ) into its two subsets and assuming the null hypothesis is trueresults in the same behavior as the supremum norm statistic from the first part (using thedefinition of the supremum norm in two coordinates as the maximum of the two suprema).The same reasoning holds for the L statistic in Part 2.Part 3 follows from the behavior of the test statistics over { x ∈ X : m ( x ) < , m ( x ) < } described in Lemma A.3. To show Part 4 for V n suppose that for some x ∗ , T ( F )( x ∗ ) = ξ > .Then sup x ∈X √ nT ( F n )( x ) ≥ √ n ( T ( F n )( x ∗ ) − T ( F )( x ∗ )) + √ nξ . Then lim inf n →∞ P (cid:26) sup x ≥ √ nT ( F n )( x ) > c (cid:27) ≥ lim n →∞ P (cid:8) √ n ( T ( F n )( x ∗ ) − T ( F )( x ∗ )) > c − √ nξ (cid:9) → , where the last convergence follows from the delta method applied to √ n ( F n ( x ∗ ) − F ( x ∗ )) ,which converges in distribution to a tight random variable. The proof for the other statisticsis analogous. Proof of Theorem 4.2.
This theorem is an application of Theorem 3.2 of Fang and Santos(2019). Define the statistics V and W as maps from F to the real line using ν and ω definedin equations (50) and (51) in Lemma A.3, and let their estimators be defined as in part 3 ofthe resampling scheme. Their Assumptions 1-3 are satisfied either by the definitions of ν and ω and Lemma A.3, the standard convergence result √ n ( F n − F ) (cid:59) G F (van der Vaart and43ellner, 1996, Theorem 2.8.4) and the choice of bootstrap weights. We need to show thattheir Assumption 4 is also satisfied. Write either function as (cid:107) h +1 (cid:107) + (cid:107) h +1 ∨ h +2 (cid:107) + (cid:107) h +2 (cid:107) usingthe desired norm. Both norms satisfy a reverse triangle inequality, and using the fact that | ( x ) + − ( y ) + | ≤ | x − y | , the difference for two functions g and h is bounded by (cid:107) g − h (cid:107) + (cid:107) g ∨ g − h ∨ h (cid:107) + (cid:107) g − h (cid:107) . The first difference is bounded by (cid:107) g − h (cid:107) , and the secondand the third are bounded by (cid:107) g − h (cid:107) . Rewriting equations (31) and (32) as functionals ofdifferential directions h , define ˆ ν (cid:48) n ( h ) = (cid:16) max x ∈ ˆ M ˆ k ( b n ) h ˆ k ( x ) (cid:17) + | max ˆ m n − max ˆ m n | > c n max (cid:110) , max x ∈ ˆ M ( b n ) h ( x ) , max x ∈ ˆ M ( b n ) h ( x ) (cid:111) | max ˆ m n − max ˆ m n | ≤ c n and ˆ ω (cid:48) n ( h ) = (cid:32) ˆ ˆ X (cid:0) ( h ( x )) + (cid:1) d x + ˆ ˆ X (cid:0) ( h ( x )) + (cid:1) d x (cid:33) / . (62)Because both ν and ω are Lipschitz, Lemma S.3.6 of Fang and Santos (2019) implieswe need only check that | ˆ ν (cid:48) n ( h ) − ν (cid:48) F ( h ) | = o P (1) and | ˆ ω (cid:48) n ( h ) − ω (cid:48) F ( h ) | = o P (1) for eachfixed h . This follows from the consistency of the contact set and (cid:15) -argmax estimators.The consistency of these estimators follow from the uniform law of large numbers for the (cid:15) -maximizing sets, and the tightness of the limit G F for the contact sets, which implies that lim n P {√ n (cid:107) F n − F (cid:107) ∞ ≤ √ na n } = 1 . Proof of Theorem 4.3.
Consider V first. Note that V n can be rewritten as V n = √ n sup( T ( G n )) + = √ n max { , sup T ( G n ) } . Lemma A.4, extended to the four parts of the T process, and the condition that X nec ( P ) (cid:54) = ∅ ,implies each of the four inner results. The derivative of the positive-part map discussed in (46),with the hypothesis that P ∈ P nec , which implies lim (cid:15) (cid:38) M nec ( (cid:15) ) = X nec , and the chain ruleimply the outer part of the derivative and Theorem 2.1 of Fang and Santos (2019) implies theresult. For W n and W , the finite-sample integrand converges pointwise for each x ∈ X to thelimit. By assumption there are no x such that the integrand is positive, which leaves the x in X nec ( P ) as the nontrivial part of the integral. Because the limit is assumed square-integrable,dominated convergence, Lemma A.2 and Theorem 2.1 of Fang and Santos (2019) imply theresult.For Part 2, note that by hypothesis X nec ( P ) = ∅ and there are no x that result in T ( G )( x ) > . Therefore Theorem 2.1 of Fang and Santos (2019), along with the chain rule,44emmas A.4 and A.2 and the positive-part map, imply the result. The proof of Part 3 is thesame as the analogous part of the proof of Theorem 4.1. Proof of Theorem 4.4.
For both statistics, Assumptions 1-3 of Fang and Santos (2019) aretrivially satisfied (van der Vaart and Wellner, 1996, Theorem 2.8.4) or satisfied by constructionin the case of the bootstrap weights. Below we check that their Assumption 4 is also satisfiedfor both statistics, so that the statement of the theorem follows from their Theorem 3.2.Consider V n first, and write the supremum statistic as a function of underlying processesabstractly labeled g : the limiting variable relies (through the delta method) on a map of theform V = V ( g ) = ( φ (cid:48) θ ( g ) ◦ θ (cid:48) g )( h ) , where g ∈ ( (cid:96) ∞ ( R × X )) , φ (cid:48) x is defined in (46) and θ (cid:48) g in (57)(extended to four functions as the arguments of the map). V n uses the sample estimates ofthese functions. Under the null hypothesis θ ( g ) = 0 , so that we may estimate ˆ φ (cid:48) n ( x ) = ( x ) + ,which is Lipschitz because | ( x ) + − ( y ) + | ≤ | x − y | . Writing the formula for the estimate ofthe derivative of θ for just two functions f and g (since the estimator for four functions canbe extended immediately from this case), we have, given sequences { b n } and { d n } , ˆ θ (cid:48) ( h, k ) = max x ∈ ˆ M θ (cid:32) max u ∈ ˆ M f ( x ) h ( u, x ) + max u ∈ ˆ M g ( x ) k ( u, x ) (cid:33) . This map is Lipschitz in ( h, k ) : given any ( f, g ) pair, paraphrasing the sets over which maximaare taken and their arguments, we have (cid:12)(cid:12)(cid:12) ˆ θ (cid:48) ( h , k ) − ˆ θ (cid:48) ( h , k ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ˆ M θ (cid:32) max ˆ M f h + max ˆ M g k (cid:33) − max ˆ B (cid:32) max ˆ M f h + max ˆ M g k (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max ˆ M θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ˆ M f h + max ˆ M g k − max ˆ M f h − max ˆ M g k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max ˆ M θ max ˆ M f | h − h | + max ˆ M θ max ˆ M g | k − k |≤ {(cid:107) h − h (cid:107) ∞ , (cid:107) k − k (cid:107) ∞ } = 2 (cid:107) ( h , k ) − ( h , k ) (cid:107) ∞ . Because all the maps in the chain that defines V n are Lipschitz, V n is itself Lipschitz, andtherefore Lemma S.3.6 of Fang and Santos (2019) implies that their Assumption 4 holds if (cid:107) ( ˆ φ (cid:48) θ ( g ) ◦ ˆ θ (cid:48) g )( h ) − ( φ (cid:48) θ ( g ) ◦ θ (cid:48) g )( h ) (cid:107) = o P (1) (where the arguments g and h are again elements of ( (cid:96) ∞ ( R × X )) ). This follows from the consistency of the (cid:15) -maximizer estimates.Next consider W n . For this part simplify the inner part to the sum of two functions, f and g , since the result is a simple generalization. Write W n = W n ( h, k ) = (ˆ λ (cid:48) µ ( f,g ) ◦ ˆ µ (cid:48) f,g )( h, k ) ,45here the marginal (in u ) maximization map µ is defined for each x ≥ , by µ ( f, g )( x ) =sup U f ( u, x )+sup U g ( u, x ) and for each x ≥ , ˆ µ (cid:48) f,g ( h, k )( x ) = max u ∈ ˆ M f ( x ) h ( u, x )+max u ∈ ˆ M g ( x ) k ( u, x ) (define ˆ M f ( x ) and ˆ M g ( x ) as in (41)). First, (cid:13)(cid:13) ˆ µ (cid:48) ( h , k ) − ˆ µ (cid:48) ( h , k ) (cid:13)(cid:13) ∞ = sup X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ˆ M f ( x ) h + max ˆ M g ( x ) k − max ˆ M f ( x ) h − max ˆ M g ( x ) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) h − h (cid:107) ∞ + (cid:107) k − k (cid:107) ∞ ≤ (cid:107) ( h , k ) − ( h , k ) (cid:107) ∞ . Second, for square integrable f and h consider the estimate, assuming P ∈ P nec , ˆ λ (cid:48) ( h ) = λ ( h | ˆ X ) where f | A denotes the restriction of the function f to the set A . On ˆ X the subadditivity ofthe norm trivially implies that ˆ λ (cid:48) is Lipschitz there. This implies that ˆ λ (cid:48) is a Lipschitz map,and in turn that ˆ λ (cid:48) µ ( f,g ) ◦ ˆ µ (cid:48) f,g is Lipschitz.Finally, ˆ µ (cid:48) f,g ( h, k )( x ) converges for each x the pointwise limit µ (cid:48) f,g ( h, k )( x ) = lim (cid:15) (cid:38) (cid:32) sup u ∈ ˆ M f ( x,(cid:15) ) h ( u, x ) + max u ∈ ˆ M g ( x,(cid:15) ) k ( u, x ) (cid:33) . The set estimators ˆ X and ˆ X + are consistent estimators for X and X + using the same argu-ment as above for the supremum norm. Then for square integrable h and k , the dominatedconvergence theorem implies that for any given f, g , (cid:12)(cid:12)(cid:12) (ˆ λ (cid:48) µ ( f,g ) ◦ ˆ µ (cid:48) f,g )( h, k ) − ( λ (cid:48) µ ( f,g ) ◦ µ (cid:48) f,g )( h, k ) (cid:12)(cid:12)(cid:12) = o P (1) , and Lemma S.3.6 of Fang and Santos (2019) implies the result. B.4 Results in Appendix A
Proof of Lemma A.2.
Let { t n } be a sequence of positive numbers such that t n (cid:38) as n → ∞ ,and let { h n } ∈ ( (cid:96) ∞ ( X )) k be a sequence of bounded, p -integrable functions such that h n → h ∈ ( (cid:96) ∞ ( X )) k as n → ∞ .Suppose that for all i and all x ∈ X , f i ( x ) < , or in other words, I + = I = ∅ . For anypoint x there exists some N such that for all n > N , ( f i + t n h ni ) + = 0 because t n (cid:38) and h i
46s bounded. Then dominated convergence implies that the p -th power of the L p norm satisfies lim n →∞ t n (cid:32) k (cid:88) i =1 ˆ X i − (cid:0) ( f i ( x ) + t n h ni ( x )) + (cid:1) p w i ( x ) d x − k (cid:88) i =1 ˆ X i − (cid:0) ( f i ( x )) + (cid:1) p w i ( x ) d x (cid:33) = 0 . This is also the result for λ ( f ) in this case, which is the difference of these terms each raisedto the power /p .Next suppose I (cid:54) = ∅ and I + = ∅ , that is, for some i , {X i } has positive measure but themeasure of x that make any coordinate of f positive is zero. Then calculate the differencesdirectly: lim n →∞ t n (cid:40)(cid:32) k (cid:88) i =1 ˆ X i (cid:0) ( f i ( x ) + t n h ni ( x )) + (cid:1) p w i ( x ) d x (cid:33) /p − (cid:32) k (cid:88) i =1 ˆ X i (cid:0) ( f i ( x )) + (cid:1) p w i ( x ) d x (cid:33) /p (cid:41) = lim n →∞ t n (cid:32) t pn k (cid:88) i =1 ˆ X i (cid:0) ( h ni ( x )) + (cid:1) p w i ( x ) d x (cid:33) /p = k (cid:88) i =1 ˆ X i (cid:0) ( h i ( x )) + (cid:1) p w i ( x ) d x using dominated convergence and the p -integrability of h . If the subregions { x : f i ( x ) < } have positive measure, they contribute 0 to the limit.Now suppose that I + is not empty, that is, there is at least one i such that X i + has positivemeasure. Then for each x ∈ X + i there exists an N such that for n > N , f i ( x ) + t n h ni ( x ) > for all i . Then for n > N , for this x , ( f i ( x ) + t n h ni ( x )) p − f pi ( x ) = p (cid:88) j =0 (cid:18) pj (cid:19) f ji ( x )( t n h ni ( x )) p − j − f pi ( x )= f pi ( x ) + pt n f p − i ( x ) h ni ( x ) + O ( t n ) − f pi ( x )= pt n f p − i ( x ) h ni ( x ) + O ( t n ) . This implies that for n large enough, the inner integral, using the calculations from the previous47arts to account for the sets where f i is zero or negative, satisfies lim n →∞ t n (cid:40) k (cid:88) i =1 ˆ X ( f i ( x ) + t n h ni ( x )) p w i ( x ) d x − k (cid:88) i =1 ˆ X f pi ( x ) w i ( x ) d x (cid:41) = lim n →∞ t n (cid:40) pt n k (cid:88) i =1 ˆ X i + f p − i ( x ) h ni ( x ) w i ( x ) d x + O ( t n ) + O ( t pn ) + 0 (cid:41) = p k (cid:88) i =1 ˆ X i + f p − i ( x ) h i ( x ) w i ( x ) d x. Using the expansion ( x + th t ) /p = x /p + p x (1 − p ) /p th t + o ( | th t | ) as t → , it can be seen thatthe Hadamard derivative of x (cid:55)→ x /p is p x (1 − p ) /p h . Therefore the chain rule and integrabilityof f and h implies that the derivative is λ ( f ) p − k (cid:88) i =1 ˆ X i + f p − i ( x ) h i ( x ) w i ( x ) d x. Proof of Lemma A.3.
For ν write ν ( f ) = sup x ∈X (cid:0) ( f ( x )) + + f ( − x ) (cid:1) + = sup x ∈X max (cid:8) , ( f ( x )) + + f ( − x ) (cid:9) = sup x ∈X max { , max { f ( − x ) , f ( x ) + f ( − x ) }} and using the definitions of f and f made in the statement of the lemma and changing theorder in which the maxima are computed = max (cid:26) , max (cid:26) sup x ∈X f ( x ) , sup x ∈X f ( x ) (cid:27)(cid:27) = ( φ ◦ ψ )( σ ( f ) , σ ( f )) . Then using the chain rule (Shapiro, 1990) the derivative is that given in the statement of thelemma. For ω , assume f and h are square integrable and write ω ( f ) = λ (( f ( x )) + + f ( − x ))= λ (max { f ( − x ) , f ( x ) + f ( − x ) } )= ( λ ◦ ψ )( f , f ) . Proof of Theorem A.5.
This is an application of Corollary 3.2 in Fang and Santos (2019), andwe only sketch the most important details of the proof. After applying the null hypothesis, thederivatives ν (cid:48) F and ω (cid:48) F shown in (52) and (53) are both convex. For example, in the expressionfor ν (cid:48) F , (cid:32) sup X ( P ) ( αh A + (1 − α ) h B ) (cid:33) + ≤ α (cid:32) sup X ( P ) h A (cid:33) + + (1 − α ) (cid:32) sup X ( P ) h B (cid:33) + and similar calculations hold for the other two terms. In the case of ω (cid:48) F , for example, ˆ X ( P ) (cid:0) ( αh + (1 − α ) h ) + (cid:1) ≤ α ˆ X ( P ) (cid:0) ( h ) + (cid:1) + (1 − α ) ˆ X ( P ) (cid:0) ( h ) + (cid:1) , where the inequality relies on the nonnegativity of the innermost term and convexity of x (cid:55)→ x for x ≥ . Then Theorem 3.3 of Fang and Santos (2019) applies. The second part ofthe theorem is a special case of the first, when the part of the relationship that leads tonondegenerate behavior is not empty. Proof of Lemma A.4.
First, let s n = t − n and define the finite differences ∆ n = sup X (cid:18) sup U ( s n f + h )( u, x ) + sup U ( s n g + k )( u, x ) (cid:19) − s n θ ( f, g ) (63)so that for any s n (cid:37) ∞ , we need to show that ∆ n → θ (cid:48) f,g ( h, k ) defined in the statement ofthe theorem.Fix an (cid:15) > . Then for any x / ∈ M θ ( (cid:15) ) , note that sup U ( s n f + h )( u, x ) + sup U ( s n g + k )( u, x ) − s n θ ( f, g ) ≤ sup h + sup k − s n (cid:15). (64)Similarly, if u / ∈ M f ( x, (cid:15) ) for any x (the case for u that do not nearly-optimize g ( · , x ) issymmetric), then also ( s n f + h )( u, x ) + sup U ( s n g + k )( u, x ) − s n θ ( f, g ) ≤ sup h + sup k − s n (cid:15) (65)49or that x . Therefore for any (cid:15) > , lim sup n ∆ n = lim sup n (cid:32) sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) ( s n f + h )( u, x ) + sup M g ( x,(cid:15) ) ( s n g + k )( u, x ) (cid:33) − s n θ ( f, g ) (cid:33) ≤ lim sup n (cid:32) s n sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) f ( u, x ) + sup M g ( x,(cid:15) ) g ( u, x ) (cid:33) − s n θ ( f, g )+ sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33) (cid:33) = sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33) , (66)so that this inequality holds as (cid:15) (cid:38) .Next, for any (cid:15) > define ¯ t ( (cid:15) ) = sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33) . (67)Because this function is nondecreasing in (cid:15) , it has a limit as (cid:15) (cid:38) , so that for any m ∈ N there exists an x m ∈ M θ (1 /m ) and ( u fm , u gm ) satisfying the inequality h ( u fm , x m ) + k ( u gm , x m ) ≥ ¯ t (1 /m ) − /m. Therefore ¯ t (1 /m ) ≤ h ( u fm , x m ) + k ( u gm , x m ) + 1 /m = s n f ( u fm , x m ) + h ( u fm , x m ) + s n g ( u gm , x m ) + k ( u gm , x m )+ 1 /m − s n ( f ( u fm , x m ) + g ( u gm , x m )) ≤ sup X (cid:18) sup U ( s n f + h )( u, x ) + sup U ( s n g + k )( u, x ) (cid:19) − s n θ ( f, g ) + ( s n + 1) /m, (68)which implies that lim (cid:15) (cid:38) sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33) = lim m →∞ ¯ t (1 /m ) ≤ ∆ n . (69)50 roof of Theorem A.6. Start by considering V . As in the proof of Theorem 4.2, we simplifythe analysis by writing this statistic as a composition of maps that act on just two functionalarguments, ( φ (cid:48) θ ( f,g ) ◦ θ (cid:48) f,g )( h, k ) , where the positive-part map φ (cid:48) x is defined in (46) and θ (cid:48) f,g is,for any h, k ∈ (cid:96) ∞ ( U × X )) , θ (cid:48) f,g ( h, k ) = lim (cid:15) (cid:38) sup x ∈M θ ( (cid:15) ) (cid:32) lim (cid:15) (cid:38) sup u ∈M f ( x,(cid:15) ) h ( u, x ) + lim (cid:15) (cid:38) sup u ∈M g ( x,(cid:15) ) k ( u, x ) (cid:33) , where M f ( x, (cid:15) ) and M θ ( (cid:15) ) are defined in (55) and (56).It can be verified that for a fixed value of θ ( f, g ) , ˆ φ (cid:48) θ ( f,g ) ( x ) is convex and nondecreasing.Next consider θ (cid:48) f,g . For any (cid:15) > , consider the map applied to the convex combination ofvector-valued functions α ( h , k ) + (1 − α )( h , k ) : sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) ( αh ( u, x ) + (1 − α ) k ( u, x )) + sup M g ( x,(cid:15) ) ( αh ( u, x ) + (1 − α ) k ( u, x )) (cid:33) ≤ sup M θ ( (cid:15) ) (cid:32) α (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33) + (1 − α ) (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33)(cid:33) ≤ α sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33) +(1 − α ) sup M θ ( (cid:15) ) (cid:32) sup M f ( x,(cid:15) ) h ( u, x ) + sup M g ( x,(cid:15) ) k ( u, x ) (cid:33) . Therefore, letting (cid:15) (cid:38) , it can be seen that θ (cid:48) f,g is convex. Because V is the composition ofa non-decreasing, convex function with a convex function, V is also a convex map of ( h, k ) to R (Boyd and Vandenberghe, 2004, eq. 3.11). As mentioned in the text, P ⊆ P nec . ThereforeCorollary 3.2 of Fang and Santos (2019) implies lim sup n →∞ P n (cid:8) V n > q V ∗ (1 − α ) (cid:9) ≤ α. Turn next to W . Similarly, write this statistic as a map of pairs of bounded functions tothe real line as W n = ( λ (cid:48) µ ( f,g ) ◦ µ (cid:48) f,g )( h, k ) , where for each x ∈ X , µ ( f, g )( x ) = sup U f ( u, x ) + sup U g ( u, x ) and µ (cid:48) f,g ( h, k )( x ) = lim (cid:15) (cid:38) max u ∈M f ( x,(cid:15) ) h ( u, x ) + lim (cid:15) (cid:38) max u ∈M g ( x,(cid:15) ) k ( u, x ) , and for any functions f, h ∈ (cid:96) ∞ ( X ) , λ (cid:48) f ( h ) is defined in (49). We show the convexity of this51omposition directly. Paraphrase µ ( x ) = µ ( f, g )( x ) , and for fixed (cid:15) > , µ (cid:48) ( x ) = sup u ∈M f ( x,(cid:15) ) h ( u, x ) + sup u ∈M g ( x,(cid:15) ) k ( u, x ) µ (cid:48) ( x ) = sup u ∈M f ( x,(cid:15) ) h ( u, x ) + sup u ∈M g ( x,(cid:15) ) k ( u, x )¯ µ (cid:48) ( x ) = sup u ∈M f ( x,(cid:15) ) ( αh + (1 − α ) k )( u, x ) + sup u ∈M g ( x,(cid:15) ) ( αh + (1 − α ) k )( u, x ) . Finally, let X denote the region where µ ( x ) = 0 . Then Lemma A.2 shows that λ (cid:48) µ (¯ µ (cid:48) ) = λ (¯ µ (cid:48) | X ) , where ¯ µ (cid:48) | X denotes the restriction of the function ¯ µ (cid:48) to the set X . Consider thefirst term on the right hand side. Inside the integral, it can be seen that ≤ (cid:0) ¯ µ (cid:48) ( x ) (cid:1) + = (cid:32) sup u ∈M f ( x,(cid:15) ) ( αh + (1 − α ) h )( u, x ) + sup u ∈M g ( x,(cid:15) ) ( αk + (1 − α ) k )( u, x ) (cid:33) + ≤ (cid:32) α (cid:32) sup u ∈M f ( x,(cid:15) ) h ( u, x ) + sup u ∈M g ( x,(cid:15) ) k ( u, x ) (cid:33) + (1 − α ) (cid:32) sup u ∈M f ( x,(cid:15) ) h ( u, x ) + sup u ∈M g ( x,(cid:15) ) k ( u, x ) (cid:33) (cid:33) + = (cid:0) αµ (cid:48) ( x ) + (1 − α ) µ (cid:48) ( x ) (cid:1) + ≤ α (cid:0) µ (cid:48) ( x ) (cid:1) + + (1 − α ) (cid:0) µ (cid:48) ( x ) (cid:1) + . Because the integrand is nonnegative, subadditivity of the L norm implies λ (¯ µ (cid:48) | X ) ≤ αλ ( µ (cid:48) | X ) + (1 − α ) λ ( µ (cid:48) | X ) . This inequality holds as (cid:15) (cid:38) by the assumed square-integrability of the arguments. ThereforeCorollary 3.2 of Fang and Santos (2019) implies lim sup n →∞ P n (cid:8) W n > q W ∗ (1 − α ) (cid:9) ≤ α. eferences Aaberge, R., T. Havnes, and
M. Mogstad (2018): “Ranking Intersecting DistributionFunctions,”
Journal of Applied Econometrics , forthcoming.
Alesina, A., and
F. Passarelli (2019): “Loss Aversion, Politics and Redistribution,”
Amer-ican Journal of Political Science , 63, 936–947.
Andrews, D. W., and
X. Shi (2013): “Inference Based on Conditional Moment Inequalities,”
Econometrica , 81, 609–666.
Atkinson, A. B. (1970): “On the Measurement of Inequality,”
Journal of Economic Theory ,2, 244–263.
Barrett, G. F., and
S. G. Donald (2003): “Consistent Tests for Stochastic Dominance,”
Econometrica , 71, 71–104.
Bhattacharya, D., and
P. Dupas (2012): “Inferring Welfare Maximizing Treatment As-signment under Budget Constraints,”
Journal of Econometrics , 167, 168–196.
Bitler, M. P., J. B. Gelbach, and
H. W. Hoynes (2006): “What Mean Impacts Miss:Distributional Effects of Welfare Reform Experiments,”
American Economic Review , 96,988–1012.
Boyd, S., and
L. Vandenberghe (2004):
Convex Optimization . Cambridge UniversityPress, Cambridge.
Cárcamo, J., A. Cuevas, and
L.-A. Rodríguez (2019): “Directional Differentiability forSupremum-Type Functionals: Statistical Applications,” arXiv e-prints , p. arXiv:1902.01136.
Carneiro, P., K. T. Hansen, and
J. J. Heckman (2001): “Removing the Veil of Ignorancein Assessing the Distributional Impacts of Social Policies,”
Swedish Economic Policy Review ,8, 273–301.
Cattaneo, M. D., M. Jansson, and
K. Nagasawa (2017): “Bootstrap-Based Inferencefor Cube Root Consistent Estimators,” Working paper.
Chetverikov, D., A. Santos, and
A. M. Shaikh (2018): “The Econometrics of ShapeRestrictions,”
Annual Review of Economics , 10, 31–63.
Chew, S. H. (1983): “A Generalization of the Quasilinear Mean with Applications to theMeasurement of Income Inequality and Decision Theory Resolving the Allais Paradox,”
Econometrica , 51, 1065–1092.
Cho, J. S., and
H. White (2018): “Directionally Differentiable Econometric Models,”
Econo-metric Theory , 34, 1101–1131.
Christensen, T., and
B. Connault (2019): “Counterfactual Sensitivity and Robustness,”Working paper.
Dehejia, R. (2005): “Program Evaluation as a Decision Problem,”
Journal of Econometrics ,125, 141–173. 53 ümbgen, L. (1993): “On Nondifferentiable Functions and the Bootstrap,”
Probability Theoryand Related Fields , 95, 125–140.
Eeckhoudt, L., and
H. Schlesinger (2006): “Putting Risk in Its Proper Place,”
AmericanEconomic Review , 96, 280–289.
Fang, Z., and
A. Santos (2019): “Inference on Directionally Differentiable Functions,”
Review of Economic Studies , 86, 377–412.
Fishburn, P. C. (1980): “Continua of Stochastic Dominance Relations for Unbounded Prob-ability Distributions,”
Journal of Mathematical Economics , 7, 271–285.
Frank, M. J., R. B. Nelsen, and
B. Schweizer (1987): “Best-Possible Bounds for theDistribution of a Sum — A Problem of Kolmogorov,”
Probability Theory and Related Fields ,74, 199–211.
Freund, C., and c. Özden (2008): “Trade Policy and Loss Aversion,”
American EconomicReview , 98, 1675–1691.
Gajdos, T., and
J. A. Weymark (2012): “Introduction to Inequality and Risk,”
Journalof Economic Theory , 147, 1313–1330.
Heckman, J. J., J. Smith, and
N. Clements (1997): “Making the Most Out of ProgrammeEvaluations and Social Experiments: Accounting for Heterogeneity in Programme Impacts,”
Review of Economic Studies , 64, 487–535.
Heckman, J. J., and
J. A. Smith (1998): “Evaluating the Welfare State,” in
Econometricsand Economic Theory in the Twentieth Century: The Ragnar Frisch Centennial Symposium ,ed. by S. Strom. Cambridge University Press, New York.
Hirano, K., and
J. Porter (2009): “Asymptotics for Statistical Treatment Rules,”
Econo-metrica , 77, 1683–1701.
Hong, H., and
J. Li (2018): “The Numerical Delta Method,”
Journal of Econometrics , 206,379–394.
Kahneman, D., and
A. Tversky (1979): “Prospect Theory: An Analysis of Decision UnderRisk,”
Econometrica , 47, 263–292.
Kasy, M. (2016): “Partial Identification, Distributional Preferences, and the Welfare Rankingof Policies,”
Review of Economics and Statistics , 98, 111–131.
Kőszegi, B., and
M. Rabin (2006): “A Model of Reference-Dependent Preferences,”
Quar-terly Journal of Economics , CXXI, 1133–1165.
Kitagawa, T., and
A. Tetenov (2018): “Who Should Be Treated? Empirical WelfareMaximization Methods for Treatment Choice,”
Econometrica , 86, 591–616.(2019): “Equality-Minded Treatment Choice,”
Journal of Business and EconomicStatistics , forthcoming. 54 evy, H. (2016):
Stochastic Dominance: Investment Decision Making Under Uncertainty,3rd edition . Springer International Publishing, Switzerland.
Linton, O., E. Maasoumi, and
Y.-J. Whang (2005): “Consistent Testing for StochasticDominance Under General Sampling Schemes,”
Review of Economic Studies , 72, 735–765.
Linton, O., K. Song, and
Y.-J. Whang (2010): “An Improved Bootstrap Test of StochasticDominance,”
Journal of Econometrics , 154, 186–202.
Makarov, G. (1982): “Estimates for the Distribution Function of a Sum of Two RandomVariables when the Marginal Distributions are Fixed,”
Theory of Probability and its Appli-cations , 26(4), 803–806.
Manski, C. F. (2004): “Statistical Treatment Rules for Heterogeneous Populations,”
Econo-metrica , 72, 1221–1246.
Masten, M. A., and
A. Poirier (2020): “Inference on Breakdown Frontiers,”
QuantitativeEconomics , 11, 41–111.
Rabin, M., and
R. H. Thaler (2001): “Anomalies: Risk Aversion,”
Journal of EconomicPerspectives , 15, 219–232.
Rick, S. (2011): “Losses, Gains, and Brains: Neuroeconomics Can Help to Answer OpenQuestions about Loss Aversion,”
Journal of Consumer Psychology , 21, 453–463.
Roemer, J. E. (1998):
Theories of Distributive Justice . Harvard University Press, Cambridge.
Rüschendorf, L. (1982): “Random Variables with Maximum Sums,”
Advances in AppliedProbability , 14, 623–632.
Samuelson, W., and
R. Zeckhauser (1988): “Status Quo Bias in Decision Making,”
Jour-nal of Risk and Uncertainty , 1, 7–59.
Sen, A. K. (2000):
Freedom, Rationality and Social Choice: The Arrow Lectures and OtherEssays . Oxford University Press, Oxford.
Shaked, M., and
G. J. Shanthikumar (1994):
Stochastic Orders and Their Applications .Academic Press, San Diego, CA.
Shapiro, A. (1990): “On Concepts of Directional Differentiability,”
Journal of OptimizationTheory and Applications , 66, 477–487.
Stoye, J. (2009): “Minimax Regret Treatment Choice With Finite Samples,”
Journal ofEconometrics , 151, 70–81.
Tetenov, A. (2012): “Statistical Treatment Choice Based on Asymmetric Minimax RegretCriteria,”
Journal of Econometrics , 166, 157–165.
Tversky, A., and
D. Kahneman (1991): “Loss Aversion in Riskless Choice: A Reference-Dependent Model,”
Quarterly Journal of Economics , 106, 1039–1061.551992): “Advances in Prospect Theory: Cumulative Representation of Uncertainty,”
Journal of Risk and Uncertainty , 5, 297–323. van der Vaart, A. W. (1998):
Asymptotic Statistics . Cambridge University Press, Cam-bridge. van der Vaart, A. W., and
J. A. Wellner (1996):
Weak Convergence and EmpiricalProcesses . Springer, New York.
Wellner, J. A. (1992): “Empirical Processes in Action: A Review,”
International StatisticalReview , 60, 247–269.
Weymark, J. A. (1981): “Generalized Gini Inequality Indices,”
Mathematical Social Sciences ,1, 409–430.
Williamson, R. C., and
T. Downs (1990): “Probabilistic Arithmetic I. Numerical Methodsfor Calculating Convolutions and Dependency Bounds,”
International Journal of Approxi-mate Reasoning , 4, 89–158.
Yaari, M. E. (1987): “The Dual Theory of Choice Under Risk,”
Econometrica , 55, 95–115.(1988): “A Controversial Proposal Concerning Inequality Measurement,”
Journal ofEconomic Theory , 44, 381–397. 56 upplemental appendix to “Loss aversion and the welfare rankingof policy interventions”
This supplement appendix contains numerical Monte Carlo simulations studying the empiricalsize and power of the statistical methods proposed in the main text and additional results forthe empirical application in Section 5 of the main text.
C Monte Carlo simulations
In this section, we compare the finite sample performances tests proposed in the text fortesting the LASD null hypothesis. We describe the results of simulation experiments usedto investigate the size and power properties of the tests described in the main text. Thereare three simulation settings: a normal location model and a triangular model under pointidentification, and a normal location model under partial identification.
C.1 Normal model, identified case
In this experiment there are two independent, Gaussian random variables that represent point-identified outcomes. The scale of both distributions is set to unity, the location of distribution A is set to zero and the location of distribution B is allowed to vary. Letting µ B denote thelocation of distribution B , tests should not reject the null H : F A (cid:23) LASD F B when µ B ≤ and should reject the null when µ B > . This is a case where P is a singleton, which is when µ B = 0 .We select constant sequences in the following way. Let n = n A + n B . The estimatedcontact sets ˆ X k = { x ∈ X : | ˆ m kn ( x ) | ≤ a n } worked well using a n = 4 log(log( n )) / √ n . Forestimated (cid:15) -maximizer sets ˆ M k = { x ∈ X : ˆ m kn ( x ) > sup ˆ m kn ( x ) − b n } we used b n = (cid:112) log(log( n )) /n . For deciding on which coordinate appeared significantly larger than theother, or whether both coordinates reached approximately the same supremum, that is, whenestimating | max ˆ m n ( x ) − max ˆ m n ( x ) | ≤ c n , we used the same constant sequence as b n , thatis, c n = (cid:112) log(log( n )) /n . These sequences were used after preliminary simulations with thenormal model, and were used in the other two simulations as well (with n = n + n A + n B inthe partially-identified setting).The size and power of the tests is good in this example, as can be seen in Figure 3. Themean of distribution B ran from − / √ n to / √ n so the alternatives are local to the boundaryof the null region. Sample sizes were identical for both samples and set equal to 100, 500or 1,000. When resampling, the number of bootstrap repetitions was set equal to 499 (forsamples of size 100), 999 (for samples of size 500) or 1,999 (for samples of size 1,000). Figure 3plots empirical rejection probabilities from 1,000 simulation runs.From Figure 3 it can be seen that the empirical rejection probabilities are relatively close57 . . . . . Normal location experiment m B n E m p i r i c a l r e j e c t i on p r obab ili t y n = 100n = 500n = 1000V V W W Figure 3: Empirical rejection probabilities of the LASD tests in the point identified normallocation model experiment. The tests are of nominal 5% size, should have exactly 5% rejectionprobability when µ B = 0 and should reject when µ B > . V n and V n tests have identicalbehavior so only V n results are shown. Samples of sizes 100, 500 and 1000 correspond respec-tively to 499, 999 and 1999 bootstrap repetitions. Distributions are local to the boundary ofthe null region, which is where µ B = 0 . 1000 simulation repetitions.to the nominal 5% rejection probability at the boundary of the null region when µ B = 0 . Thebehavior of supremum norm tests was identical so only V n test results are shown. The W n and W n results are close and the differences are due to numerical integration that occurs overone or two dimensions depending on the statistic. C.2 Triangular model, identified case
In this experiment we use two independent triangular random variables, where we let θ =( α, β, γ ) denote the lower endpoint of the support, the mode of the distribution and the upperendpoint of the support. Distribution A uses θ A = ( − , , , while the shape of distribution B is allowed to vary. For a parameter (cid:15) ∈ [ − / , / we let θ B = ( − − (cid:15)/ √ n, − (cid:15)/ √ n, (cid:15)/ √ n ) ,so that all the distributions are local to the boundary of the null region represented by (cid:15) = 0 .Two distributions are depicted in Figure 4, in which (cid:15) = 1 / . This implies that F A (cid:23) LASD F B .From the right panel of the plot it can be seen that these distributions satisfy an LASDordering, but they would not be ordered by FOSD.Figure 5 shows the empirical rejection results from the triangular model experiment. We58 . . . . . . Densities x f ( x ) f A f B −2 −1 0 1 2 . . . . . . CDFs x F ( x ) F A F B Figure 4: Triangular model densities and distribution functions. In this example distribution F A (cid:23) LASD F B (in terms of the description in the text, (cid:15) = 1 / for distribution B ). Heuris-tically, the higher gains under policy B are outweighed by the probability of larger losses sothat distribution A dominates distribution B in the LASD sense, but F A (cid:54)(cid:23) F OSD F B .allow (cid:15) , which controls the shape of distribution B , to vary between − / and / . The testsin this experiment should reject the null when (cid:15) < , should equal the nominal size at (cid:15) = 0 and should not reject when (cid:15) > . Because of the restricted supports of the distributions andthe relatively small region for (cid:15) , the horizontal axis for the power curves shown in Figure 5 isthe value of the alternative parameters in absolute scale and not local alternatives. Thereforethe power curves show a noticeable change over different values of the sample sizes used. C.3 Normal model, partially identified case
In this experiment we use three independent normal random variables ( Z , Z A , Z B ) with scalesset to unity and location parameters µ = (0 , , µ B ) , where µ B is allowed to vary. We denotethis triple of marginal normal CDFs by G ( µ B ) . Rounding to one decimal place, the null H : F A (cid:23) LASD F B should be rejected when µ B > . . We let µ B vary locally around thisapproximate boundary point. Figure 6 depicts the T ( G ( µ B )) function for µ B = 2 . , . or . . Tests are designed to detect the positive deviation in the right-most panel of the figure,when T ( G )( x ) > for some x ≥ .Figure 7 shows empirical rejection probabilities for tests with three independent normaldistributions. The tests are not conducted under any assumptions about the independenceof the samples. The rejection probabilities are different than those in the point-identifiedexperiments — more evidence is needed to detect deviations from the null region than in theidentified case, because the bound U B combines observations from the control and sample B .Although more information is necessary, it is important to note that these alternatives (likein the other experiments) are local to the boundary of the P nec set.59 . . . . . . Triangular distribution experiment e E m p i r i c a l r e j e c t i on p r obab ili t y n = 100n = 500n = 1000V V W W Figure 5: Empirical rejection probabilities of the LASD tests in the point identified triangu-lar model experiment. The tests are of nominal 5% size, should have exactly 5% rejectionprobability when (cid:15) = 0 and should reject when (cid:15) < . Samples of sizes 100, 500 and 1000correspond respectively to 499, 999 and 1999 bootstrap repetitions. Distributions are aroundthe boundary of the null region, which is where (cid:15) = 0 , but plotted on an absolute, not local,scale. 1000 simulation repetitions.As can be seen in Figure 7, the tests in the partially identified case do not reject the nullwith as high a probability as in the point identified case, which is a direct result of the lackof knowledge about inter-sample correlations that dictates the form of the T function definedin the main text. Also, it appears as though these deviations from the null are not very welldetected by the Cramér-von Mises tests in relation to the Kolmogorov-Smirnov tests. However,it is important to note that in this example, alternatives are local alternatives, and representsmaller and smaller deviations from the null region as sample sizes increase. D Application
In this section we present the additional test results for the empirical application discussedin Section 5 of the main paper. Table 2 includes results of V n statistics, the second Table 3contains V n statistics, Table 4 the third contains W n statistics and finally Table 5 reproducesthe table of W n results used in the main text. The tables reveal that all the tests have verysimilar qualitative conclusions. Some of the entries are exactly the same across tables andare indeed repetitions of the same tests, but the tables are shown in entirety to facilitate60 − . − . − . − . − . − . − . . m B = x T ( G )( x ) − . − . − . − . − . − . . m B = x T ( G )( x ) − . − . − . − . − . − . . m B = x T ( G )( x ) Figure 6: The T ( G ( µ B )) function for different values of the location of the marginal distri-bution function G B . Tests should reject the null hypothesis when T ( G )( x ) > for some x asin the right panel. LASD in changes FOSD in levels F AF DC (cid:23) F JF F JF (cid:23) F AF DC equality G AF DC (cid:23) G JF G JF (cid:23) G AF DC equalityavg-RA . . . . p-value . . . . lastQ-RA . . . p-value . . . avg-TL . . . . . p-value . . . . . lastQ-TL . . p-value . . Table 2: Table of sup-norm tests. LASD tests use the T1 process.
LASD in changes FOSD in levels F AF DC (cid:23) F JF F JF (cid:23) F AF DC equality G AF DC (cid:23) G JF G JF (cid:23) G AF DC equalityavg-RA . . . . p-value . . . . lastQ-RA . . . p-value . . . avg-TL . . . . . p-value . . . . . lastQ-TL . . p-value . . Table 3: Table of sup-norm tests. LASD tests use the T2 process.comparison. 61
ASD in changes FOSD in levels F AF DC (cid:23) F JF F JF (cid:23) F AF DC equality G AF DC (cid:23) G JF G JF (cid:23) G AF DC equalityavg-RA . . . . p-value . . . . lastQ-RA . . . p-value . . . avg-TL . . . . . p-value . . . . . lastQ-TL . . p-value . . Table 4: Table of L2-norm tests. LASD tests use the T1 process.
LASD in changes FOSD in levels F AF DC (cid:23) F JF F JF (cid:23) F AF DC equality G AF DC (cid:23) G JF G JF (cid:23) G AF DC equalityavg-RA . . . . p-value . . . . lastQ-RA . . . p-value . . . avg-TL . . . . . p-value . . . . . lastQ-TL . . p-value . . Table 5: Table of L2-norm tests. LASD tests use the T2 process.62 . . . . . . Normal location, partially identified ( m B - ) n E m p i r i c a l r e j e c t i on p r obab ili t y n = 100n = 500n = 1000V W Figure 7: Empirical rejection probabilities of the LASD tests in the partially identified normallocation model experiment. The control and policy A distributions have means set to zero,while the location of policy B is allowed to vary. The tests are of nominal 5% size, should haveexactly 5% rejection probability when ( µ B − . √ n = 0 and should reject when ( µ B − . √ n > (alternatives are local to the boundary of the set P nec described in the text). Samples ofsizes 100, 500 and 1000 correspond respectively to 499, 999 and 1999 bootstrap repetitions.1000 simulation repetitions.Finally, we note that the example could be used to conduct tests under partial identifica-tion, as if we had no knowledge of the longitudinal structure of the data. However, tests using V n or W nn