IInference in Incomplete Models
Alfred Galichon and Marc Henry
Harvard University and Columbia UniversityFirst draft: September 15, 2005This draft : May 26, 2006 Abstract
We provide a test for the specification of a structural model without identifying assump-tions. We show the equivalence of several natural formulations of correct specification, whichwe take as our null hypothesis. From a natural empirical version of the latter, we derive aKolmogorov-Smirnov statistic for Choquet capacity functionals, which we use to constructour test. We derive the limiting distribution of our test statistic under the null, and show thatour test is consistent against certain classes of alternatives. When the model is given in para-metric form, the test can be inverted to yield confidence regions for the identified parameterset. The approach can be applied to the estimation of models with sample selection, censoredobservables and to games with multiple equilibria.
JEL Classification: C10, C12, C13, C14, C52, C61Keywords: partial identification, specification test, random correspondences, Core, selections, plausibility constraint, Monge-Kantorovich mass transportation problem, Kolmogorov-Smirnov test for capacity functionals. This research was carried out while the first author was visiting the Bendheim Center for Finance, Prince-ton University and financial support from NSF grant SES 0350770 to Princeton University and from the ConseilG´en´eral des Mines is gratefully acknowledged. The authors also wish to thank Gary Chamberlain, Xiaohong Chen,Victor Chernozhukov, Pierre-Andr´e Chiappori, Ronald Gallant, Peter Hansen, Han Hong, Guido Imbens, MichaelJansson, Massimo Marinacci, Rosa Matzkin, Francesca Molinari, Ulrich Mueller, Alexei Onatski, Ariel Pakes, JimPowell, Peter Robinson, Bernard Salani´e, Thomas Sargent, Jos´e Scheinkman, Jay Sethuraman, Azeem Shaikh,Chris Sims, Kyungchul Song and Edward Vytlacil and seminar participants at Berkeley, Columbia, ´Ecole poly-technique, Harvard, MIT, NYU, Princeton, SAMSI and Stanford for helpful comments (with the usual disclaimer).Correspondence address: Department of Economics, Columbia University, 420 W 118th Street, New York, NY10027, USA. [email protected]. This paper is now superseded by various papers by the same authors. a r X i v : . [ ec on . E M ] F e b ntroduction In many contexts, the ability of econometric models to identify, hence estimate from observedfrequencies, the distribution of residual uncertainty often rests on strong prior assumption thatare difficult to substantiate and even to analyze within the economic decision problem.A recent approach, pioneered by Manski has been to forego such prior assumptions, thus givingup the ability to identify a single probability distribution for residual uncertainty, and allowinstead for a set of distributions compatible with the empirical setup. A variety of models havebeen analyzed in this way, whether partial identification stems from incompletely specified models(typically models with multiple equilibria) or from structural data insufficiencies (typically casesof data censoring). See Manski, 2005 for an up-to-date survey on the topic.All these models with incomplete identification share the basic fundamental structure that theresidual uncertainty and the relevant observable quantities are linked by a many-to-many mappinginstead of a one-to-one mapping as in the case of identification.In this paper, we propose a general framework for conducting inference without additional as-sumptions such as equilibrium selection mechanisms necessary to identify the model (i.e. toensure that the many-to-many mapping is actually one-to-one). The usual terminology for suchmodels is “incomplete” or “partially identified.”In a parametric setting, the objective of inference in partially identified models is the estimationof the set of parameters (hereafter called identified set ) which are compatible with the distributionof the observed data and an assessment of the quality of that estimation. For the latter objective,two routes have been taken.Chernozhukov et al., 2002 initiated research to obtain regions that cover the identified set witha prescribed probability. They propose an M-estimation approach with a sub-sampling proce-dure to approximate quantiles of the supremum of the criterion function over the identified set.Shaikh, 2005 proposes an alternative M-estimation with subsampling procedure that nests theChernozhukov et al., 2002 proposal. M-estimation with subsampling is the only general proposalto date that does not rely on a conservative testing procedure, but the choice of criterion functionin the M-estimation procedure is arbitrary, and may have a large effect on the confidence regions.2n related research, a more direct application of random set methods has been taken to achieve thegoal of constructing confidence regions for the identified set: Shaikh and Vytlacil, 2005 considera special model where the identified set is a deterministic mapping of a collection of expectations,and base inference on the sample analogs of these expectations. Beresteanu and Molinari, 2006propose the use of central limit theorems for random sets to conduct inference in models withset valued data. However, the adaptation of delta theorems for random sets is required for thisapproach to attain its full potential.The second route was initiated by Imbens and Manski, 2004 who considered the different problemof covering each element of the identified set, and demanded uniform coverage. Shaikh, 2005 showsthat the M-estimation with subsampling procedure can also be applied to uniform coverage ofelements of the identified set. Pakes et al., 2004 consider models that are defined by momentinequalities and propose a conservative procedure to form a confidence region for all parametersin the identified set based on inequalities testing ideas. The procedure is conservative since thelimiting distribution of the test statistic depends on the number of constraints that are actuallybinding, and unlike in the special one dimensional treatment response case analyzed by Imbensand Manski, 2004, no superefficient pre-test is available.Still in the latter spirit, Andrews et al., 2004 consider entry games (and more generally gameswith discrete strategies) and propose a conservative procedure to form a confidence region for allparameters in the identified set based on the idea that the probability of a certain outcome is nolarger than the probability that necessary conditions (such as Nash rationality constraints) aremet.The inference procedure proposed here is in the same spirit as this latter contribution, but itgives a full formalization of the idea in a very general framework, does not restrict the class ofdistributions of observables (hence allows estimation of games with continuous strategies as wellas entry games), does not rely on resampling procedures (though they may be used as alterna-tive quantile approximation devices), and provides an exact test as opposed to the conservativeprocedures considered above.After a prelude to expound the ideas developed here in the familiar case of Kolmogorov-Smirnovspecification testing, the general set-up is described (with some examples) in section 1. It com-prises the specification of a structure (in the Koopmans terminology) with observable and un-observable variables (unobservable to the analyst but not necessarily to the economic agents)3elated by a many-to-many mapping as opposed to the one-to-one mapping required for identifi-cation. The structure is defined by the many-to-many mapping (which can comprise rationalityconstraints as before, as well as any constraints that are plausible within the theory) and a hy-pothesized distribution for the unobserved variables. To fix ideas, we call Γ the many-to-manymapping defining the structure, ν a hypothesized distribution of unobservables and P the truedistribution of observables.Still in section 1, a characterization is given of what we mean by correct specification, viz.compatibility of the structure with the distribution of the observable variables, and it is shownthat several natural ways of defining compatibility are in fact equivalent. They include (amongother notions) a compatibility notion based on selections γ of Γ (i.e. functions such that γ ∈ Γ),a notion based on the existence of a joint probability that admits ν and P as marginals andis supported on the region where the constraints implied by Γ are satisfied, and the notion ofmaximum plausibility introduced by Dempster, 1967.Second, in section 2, we show that the characterizations of correct specification of the structureare equivalent to the existence of a zero cost solution to a Monge-Kantorovich mass transportationproblem, where mass is transported between distribution P and distribution ν with zero-one costassociated with violation of the constraints implied by Γ. This is the topic of section 2. Note thata special case of Monge-Kantorovich transportation problem is the well-know matching problem.Third, still in section 2, this observation allows us to conduct inference using the empirical versionof the mass transportation problem (with the unknown P replaced by the empirical distribution P n ). Empirical formulations pertaining to the different characterizations of correct specificationof the structure are compared, and several are found to be equivalent, whereas others differaccording to the choice of probability metric. It turns out that the dual of the empirical problemyields a statistic that reduces to the familiar Kolmogorov-Smirnov specification test statistic inthe identified case where Γ is one-to-one.The properties of this statistic are examined in section 3. The classical Kolmogorov-Smirnovstatistic tests the equality of two probability measures by checking their difference on a good class of sets (large enough to be convergence-determining, but small enough to allow asymptotictreatment). Here our test statistic checks that P ( A ) is no larger than ν (Γ( A )) for all A in asimilar class of sets. Since ν (Γ( A )) is the probability of the sufficient conditions implied by A ,we see the strong similarity with the Andrews et al., 2004 approach. Hence the dual empirical4roblem provides us with a computable test statistic, and a distribution to compare it to, and aparallel with the classical case.We derive the asymptotic distribution of our test statistic and describe how classes of alternativesagainst which our test has power are related to what we call core-determining classes of sets.Finally, the fourth section shows simple implementation procedures, and the inversion of the testto construct a confidence region for the elements of the identified set of parameters when both Γand ν are specified in parametric form. If one is interested in testing structural hypotheses suchas extra constraints implied by theory, within the framework of a partially identified model, theconstraints should be rejected if the region they imply on the parameter set does not intersectwith the identified set. Here the question can be answered directly by incorporating the extraconstraints in the model and testing the restricted specification. If, on the other hand, one isinterested in reporting parameter value estimates with confidence bounds for policy analysis,the specification test can be inverted to the end of providing confidence regions that cover theelements of the identified set with pre-determined probability, or confidence regions that coverthe identified set itself.At the end of this section, we discuss semi-nonparametric extensions of our approach to includemodels which do not specify a parametric family of hypothesized data generating processes forthe unobservable variables. This includes as a special case models defined by moment inequalities,the full treatment of which is the subject of the companion paper Galichon and Henry, 2006.The last section of the main text concludes; whereas proofs and additional results are collectedin the appendix. Prelude: complete model benchmark
Before we define incomplete model specifications, we give a short heuristic univariate descriptionof the benchmark that we use and discuss the Kolmogorov-Smirnov specification test statisticthat we are effectively generalizing in this paper.For ease of noptation, we consider observables y ∈ R and unobservables u ∈ R (also called“unobserved shocks”, “latent variables”, etc...). Abstracting from dependence on an unknown5eterministic parameter, we define a “complete” structure as a pair ( ν, γ ), where ν is a datagenerating process for the unobservables, and γ is a bijection from the set of observables to theset of unobservables, as in figure 1.Figure 1: Bijective structureIf we call P the true data-generating process for the observables, we say that the completestructure is well specified if P ( A ) = ν ( γ ( A )) for all Borel sets A , which, by Dynkin’s lemma, isequivalent to P ( A ) = ν ( γ ( A )) for all cells A of the form ( −∞ , y ], y ∈ R , which is immediatelyseen to be equivalent to sup A ∈S ( P ( A ) − ν ( γ ( A ))) = 0 (1)where C = { ( −∞ , y ] , ( y , ∞ ) : ( y , y ) ∈ R } .(1) is a programming problem, and it will turn out to be very fruitful to consider its Monge-Kantorovich dual formulation inf π ∈M ( P,ν ) (cid:90) R { u (cid:54) = γ ( y ) } π ( dy, du ) = 0 , (2)where 1 { x ∈ A } denotes the indicator function of the set A , and the infimum is taken over alljoint probability measures with marginals P and ν . The latter is a mass transportation (or“generalized matching”) problem, where mass is transported from the set of observables to the setof unobservables with zero-one cost of transportation associated with violations of the constraint u = γ ( y ). 6his formulation can be interpreted as the existence of a probability that is concentrated on thestructure, or alternatively, to the existence of a coupling between the random variable Y withlaw P and the random variable U with law ν , i.e. the existence of π with marginals P and ν suchthat π ( U (cid:54) = γ ( Y )) = 0 . (3)We shall show that this dual representation of the hypothesis of correct specification has a naturalgeneralization to the case of incomplete structures.Turning to empirical versions of the problem, we can consider the statistic obtained by replacing P by the empirical distribution P n of a sample of independent and identically distributed variableswith law P , we obtain inf π ∈M ( P,ν ) (cid:90) R { u (cid:54) = γ ( y ) } π ( dy, du ) , (4)where the infimum is taken over probabilities π with marginals P n and ν . By the above mentionedduality, the latter is equal to sup A ∈B Y ( P n ( A ) − ν ( γ ( A ))) , with B Y the class of Borel sets.The last step is to determine a class of sets that is small enough to allow determination of thelimiting behaviour of the statistic, i.e. we need to class of sets to be P -Donsker, and large enoughthat the values of ν ( γ ( . )) over all Borel sets are determined by the latter’s values on the restrictedclass. The class C satisfies both requirements, and the resulting test statistic issup A ∈C ( P n ( A ) − ν ( γ ( A ))) = sup y ∈ R | P n ( −∞ , y ] − ν ( γ ( −∞ , y ]) | , (5)which is exactly the Kolmogorov-Smirnov specification test statistic.We shall essentially follow these same steps to show equivalence between formulations of thehypothesis of correct specification and to derive a test of specification when the bijection γ isreplaced by a correspondence Γ, as in figure 2. Then we shall consider parameterized versions ofthe structure where both Γ and ν depend on a parameter θ , and form confidence regions with allvalues of θ such that the specification of model (Γ θ , ν θ ) is not rejected.7igure 2: Incomplete structure We consider a very general econometric specification, thereby posing the problem exactly as inJovanovic, 1989 which was an inspiration for this work. Variables under consideration are dividedinto two groups. • Latent variables, u ∈ U . The vector u is not observed by the analyst, but some of itscomponents may be observed by the economic actors. U is a complete, metrizable andseparable topological space (i.e. a Polish space). • Observable variables, y ∈ Y = R d y . The vector y is observed by the analyst.The Borel sigma-algebras of Y and U will be respectively denoted B Y and B U . Call P the Borelprobability measure that represents the true data generating process for the observable variables,and ν the hypothesized data generating processes for the latent variables. The structure is givenby a relation between observable and latent variables, i.e. a subset of Y × U , which we shallwrite as a multi-valued mapping from Y to U denoted by Γ. Finally, the set of Borel probabilitymeasures on ( Y × U , σ ( B Y × B U )) with marginals P and ν is denoted by M ( P, ν ). Wheneverthere is no ambiguity, we shall adopt the de Finetti notation µf to denote the integral of f withrespect to µ . 8 .1 Examples Example 1: Sample selection and other models with missing counterfactuals.
Thetypical Heckman sample selection models require very strong and often implausible assumptions toguarantee identification. Weaker assumptions, such as certain forms of monotonicity are plausibleand restrict significantly the identified set without reducing it to a singleton. As an illustration ofour formulation in this case, consider for instance the classical set-up in Heckman and Vytlacil,2001. We observe (
Y, D, W ), where Y is the outcome variable, D is an indicator variable forthe receipt of treatment, and Z is a vector of instruments (we implicitly condition the model onexogenous observable covariates). The outcome variable is generated as follows: Y = DY + (1 − D ) Y , where Y is the binary potential outcome if the individual does not receive treatment, and Y isthe binary potential outcome if the individual does receive treatment. The model is completedwith the specification of D as follows: D = 1 { g ( Z ) ≥ U } , where g is a measurable function and U is uniformly distributed on [0 ,
1] (without loss of gen-erality). The model can be written in the form of a multi-valued mapping Γ from observable tounobservables in the following way:( y, d, z ) (cid:55)−→ { ( u, y , y ) ∈ Γ( y, d, z ) } (1 , , z ) (cid:55)−→ [ 0 , g ( z )] × { } × { , } (1 , , z ) (cid:55)−→ ( g ( z ) , × { , } × { } (0 , , z ) (cid:55)−→ [ 0 , g ( z )] × { } × { , } (0 , , z ) (cid:55)−→ ( g ( z ) , × { , } × { } Example 2: Returns to schooling.
Consider a general specification for the returns to edu-cation, where income Y is a function of years of education E , other observable characteristics X and unobserved ability U as Y = G ( E, X, U ). G can be inverted as a multi-valued mapping toyield a correspondence U = Γ( Y, E, X ). Example 3: Censored data structures.
Models with top-censoring or positive censoring suchas Tobit models fall in this class. A classic problem where identification fails is regression with9nterval censored outcomes: the observables variables are the pairs ( Y ∗ , Y ∗ , X ) of upper and lowervalues for the dependent variable, and the explanatory variables. The correspondence describingthe structure is Γ θ ( y ∗ , y ∗ , x ) = [ y ∗ − x (cid:48) θ, y ∗ + x (cid:48) θ ] . Example 4: Games with multiple equilibria.
Very large classes of economic models becomeestimable with this approach, when one allows the object of interest to be the identified setof parameters as opposed to single parameter values. A simple class of examples is that ofmodels defined by a set of Nash rationality constraints. Suppose the payoff function for player j , j = 1 , . . . , J is given by Π j ( S j , S − j , X j , U j ; θ ) , where S j is player j ’s strategy and S − j is their opponents’ strategies. X j is a vector of observablecharacteristics of player j and U j a vector of unobservable determinants of the payoff. Finally θ is a vector of parameters. Pure strategy Nash equilibrium conditionsΠ j ( S j , S − j , X j , U j ; θ ) ≥ Π j ( S, S − j , X j , U j ; θ ) , for all S define a correspondence Γ θ from unobservable player characteristics to observable variables ( S, X ). Example 5: Entry models.
Consider the special case of example 4 proposed by Jovanovic,1989. The payoff functions areΠ ( x , x , u ) = ( λx − u )1 { x =1 } , Π ( x , x , u ) = ( λx − u )1 { x =1 } , where x i ∈ { , } is firm i’s action, and u is an exogenous cost. The firms know their cost; theanalyst, however, knows only that u ∈ [0 , λ is in (0 , x = x = 0 for all u ∈ [0 , x = x = 1 for all u ∈ [0 , λ ] and zero otherwise. Since the two firms’ actions areperfectly correlated, we shall denote them by a single binary variable y = x = x . Hence thestructure is described by the multi-valued mapping: Γ(1) = [0 , λ ] and Γ(0) = [0 , y is Bernoulli, we can write P = (1 − p, p ) with p the probability of a 1. For the distributionof u , we consider a parametric exponential family on [0 , We wish to develop a procedure to detect whether the structure (Γ , ν ) and the distributionof observables are compatible. First we explain what we mean by compatible . We start bytaking P , Γ and ν as given and by considering three natural formalizations of compatibility, afirst representation based on measurable selections of Γ, the second based on the existence of asuitable probability measure with marginals P and ν and a third based on Dempster’s notion ofmaximal plausibility. It is very easily understood in the simple case where the link Γ between latent and observablevariables is parametric and Γ = γ is measurable and single valued. Defining the image measureof P by γ by P γ − ( A ) = P { y ∈ Y| γ ( y ) ∈ A } , (6)for all A ∈ B U , we say that the structure is well specified if and only if ν = P γ − . In the generalcase considered here, Γ may not be single valued, and its images may not even be disjoint (whichwould be the case if it was the inverse image of a single valued mapping from U to Y , i.e. a tradi-tional function from latent to observable variables). However, under a measurability assumptionon Γ, we can construct an analogue of the image measure, which will now be a set Core(Γ , P )of Borel probability measures on U (defined by (10)), and the hypothesis of compatibility of therestrictions on latent variable distributions and on the structures linking latent and observablevariables will naturally take the formH : ν ∈ Core(Γ , P ) . (7) Assumption 1:
Γ has non-empty and closed values, and for each open set
O ⊆ U , Γ − ( O ) = { y ∈ Y | Γ( y ) ∩ O (cid:54) = ∅ } ∈ B Y . 11o relate the present case to the intuition of the single-valued case, it is useful to think in termsof single-valued selections of the multi-valued mapping Γ, as in figure 3.Figure 3: Selection of a correspondenceA measurable selection γ of Γ is a measurable function such that γ ( y ) ∈ Γ( y ) for all y ∈ Y . Theset of measurable selections of a multi-valued mapping Γ that satisfies Assumption 1 is denotedSel(Γ) (which is known to be non-empty by the Rokhlin-Kuratowsky-Ryll-Nardzewski Theorem).To each selection γ of Γ, we can associate the image measure of P , denoted P γ − , defined as in(6).It would be tempting to reformulate the compatibility condition as the requirement that at leastone selection γ in Sel(Γ) is such that ν = P γ − . However, such a requirement implies that γ corresponds to the equilibrium that is always selected. Under such a requirement, if for agiven observable value the structure does not specify which value of the latent variables gaverise to it, the latter is nonetheless fixed. Hence two identical observed realizations in the sampleof observations necessarily arose from the same realization of the latent variables. We argue,however, that if the structure does not specify an equilibrium selection mechanism, there is noreason to assume that each observation is drawn from the same equilibrium.Allowing endogenous equilibrium selection of unknown form is equivalent to allowing the existenceof an arbitrary distribution on the set of P γ − when γ spans Sel(Γ) (as opposed to a mass onone particular P γ − ). A Bayesian formulation of the problem would entail a specification of thisdistribution. Here, we stick to the given specification in leaving it completely unspecified.12ence, we argue that the correct reformulation of the compatibility condition is that ν can bewritten as a mixture of probability measures of the form P γ − , where γ ranges over Sel(Γ).However, as the following example show, even for the simplest multi-valued mapping, the set ofmeasurable selections is very rich, let alone the set of their mixtures. Example:
Consider the multi-valued mappingΓ : [0 , ⇒ [0 , x ) = { , x } for all x . The collection of measurable selections of Γ is indexed by theclass of Borel subsets of [0 , γ B , such that γ B ( x ) = x { x ∈ B } for any Borel subset B of [0 , { x ∈ B } denotes the indicator functionwhich equals one when x ∈ B and zero otherwise.Hence, it will be imperative to give manageable equivalent representations of such a mixture, asis done in Theorem 1 below. The second natural representation of compatibility of the distribution P of observables and thestructure (Γ , ν ) is based on the existence of probability measures on the product Y ×U that admit P and ν as marginals.In the benchmark case of Γ = γ one-to-one, the structure imposes a stringent constraint on pairs( y, u ), namely that u = γ ( y ). So the admissible region of the product space is the graph of γ , i.e.the set Graph γ = { ( y, u ) ∈ Y × U : u = γ ( y ) } . The compatibility condition described above, namely
P γ − = ν is equivalent to the existenceof a probability measure on the product space that is supported by Graph γ (i.e. that givesprobability zero outside the constrained region defined by the structure) and admits P and ν asmarginals.This generalizes immediately to the case of Γ multi-valued, as the existence of a probabilitymeasure that admits P and ν as marginals, and that is supported on the constrained regionGraph Γ = { ( y, u ) ∈ Y × U : u ∈ Γ( y ) } , (8)13n other words, a probability measure that admits P and ν as marginals and gives probabilityzero to the event U / ∈ Γ( Y ), where U and Y are random elements with probability law ν and P respectively (namely (12) below). Dempster, 1967 suggests to consider the smallest reliability that can be associated with the event B ∈ B U as the belief function P ( A ) = P { y ∈ Y | Γ( y ) ⊆ B } and the largest plausibility that can be associated with the event B as the plausibility function P ( A ) = P { y ∈ Y | Γ( y ) ∩ B (cid:54) = ∅ } the two being linked by the relation P ( A ) = 1 − P ( A c ) , (9)which prompted some authors to call them conjugates or dual of each other.A natural way to construct a set of probability measures is to consider all probability measuresthat do not exceed the largest plausibility that can be associated with a set, and that, as a resultof (9), are larger than the smallest reliability associated with a set. We thus form the core of thebelief function : Core(Γ , P ) = { µ ∈ ∆( U ) | ∀ B ∈ B U , µ ( B ) ≥ P ( B ) } (10)= { µ ∈ ∆( U ) | ∀ B ∈ B U , µ ( B ) ≤ P ( B ) } where the first equality can be taken as a definition, and the second follows immediately from(9). It is well known that Core(Γ , P ) is non-empty, and another natural representation of thecompatibility of the distribution P of observables with the structure (Γ , ν ) is that ν belongs toCore(Γ , P ), in other words, that ν satisfies ν ( B ) ≤ P ( { y ∈ Y : Γ( y ) ∩ B (cid:54) = ∅ } ) for all B ∈ B U .Figure 4 illustrates this requirement in the case of finite sets. The name Core is standard in the literature to denote the set of probability measures satisfying (13). It seemsto originate from D. Gillies’ 1953 Princeton PhD thesis on “some theorems on n-person games.” For finite sets,the core is non-empty by the Bondareva-Shapley theorem. In the present more general context, the non-emptinessof the core will follow from the equivalence of (i) and (iv) of Theorem 1 below, and the existence of measurableselections of Γ under assumption 1. { a } always gives rise to theevent { b , b } , whereas event { a } never does, so it is natural to constrain the probability of theevent { b , b } by the upper bound P ( { a , a , a } ) and the lower bound P ( { a } ). The following theorem shows that the three representations discussed above are, in fact, equiv-alent. In addition, two more equivalent formulations are presented that will be used in theempirical formulations in the next section.
Theorem 1:
Under assumption 1, the following statements are equivalent:(i) ν is a mixture of images of P by measurable selections of Γ, (i.e. ν is in the weak closedconvex hull of { P γ − ; γ ∈ Sel(Γ) } ).(ii) There exists for P -almost all y ∈ Y a probability measure π ν ( y, . ) on U with support Γ( y ),such that ν ( B ) = (cid:90) Y π ν ( y, B ) P ( dy ) , all B ∈ B U . (11)(iii) If U and Y are random elements with respective distributions P and ν , there exists aprobability measure π ∈ M ( P, ν ) that is supported on the admissible region, i.e. such that π ( U / ∈ Γ( Y )) = 0 . (12)(iv) The probability assigned by ν to an event in B ∈ B U is no greater than the largest plausi-bility associated with B given P and Γ, i.e. ν ( B ) ≤ P ( { y ∈ Y : Γ( y ) ∩ B (cid:54) = ∅ } ) (13)15v) For all A ∈ B Y , we have P ( A ) ≤ ν (Γ( A )) . (14) Remark 1:
The weak topology on ∆( U ), the set of probability measures on U , is the topology ofconvergence in distribution. ∆( U ) is also Polish, and the weak closed convex hull of { P γ − ; γ ∈ Sel(Γ) } is indeed the collection of arbitrary mixtures of elements of { P γ − ; γ ∈ Sel(Γ) } . Remark 2:
Notice that (11) looks like a disintegration of ν , and indeed, when Γ is the inverseimage of a single-valued measurable function (i.e. when the structure is given by a single-valuedmeasurable function from latent to observable variables), the probability kernel π ν is exactly the( P, Γ − )-disintegration of ν , in other words, π ν ( y, . ) is the conditional probability measure on U under the condition Γ − ( u ) = { y } . Hence (11) has the interpretation that a random elementwith distribution ν can be generated as a draw from π ν ( y, . ) where y is a realization of a randomelement with distribution P . Remark 3:
As will be explained later, our test statistic will be based on violations of repre-sentation (v), which is the dual formulation of (iii) seen as a Monge-Kantorovich optimal masstransportation solution.
Remark 4:
Equivalence of (i) and (iii) is a generalization of proposition 1 of Jovanovic, 1989 tothe case where P is not necessarily atomless and U not necessarily compact. Notice that relativeto Jovanovic, 1989, the roles of Y and U are reversed for the purposes of specification testing.As discussed in the second remark following proposition 1 mentioned above, atomlessness of thedistribution of latent variables is innocuous as long as U is rich enough. However, atomlessness ofthe distribution of observables isn’t innocuous, since it rules out many of the relevant applications.Note that since as a multivalued function, Γ is always invertible, and Assumption 1 holds for Γif and only if it holds for Γ − , the roles of P and ν can be interchanged in the formulations. Insome cases, the symmetric formulation, with the roles of P and ν interchanged, is useful, so westate it for completeness below: Theorem 1’:
Under assumption 1, the following statements are equivalent, and are also equiv-alent to each of the statements in Theorem 1: 16i’) P is a mixture of images of ν by measurable selections of Γ − , (i.e. P is in the weak closedconvex hull of { νγ − ; γ ∈ Sel(Γ − ) } ).(ii’) There exists for ν -almost all u ∈ U a probability measure π P ( u, . ) on Y with support Γ − ( u ),such that P ( A ) = (cid:90) U π P ( u, A ) ν ( du ) , all A ∈ B Y . (15)(iii’) is identical to Theorem 1(iii).(iv’) The probability assigned by P to an event in A ∈ B Y is no greater than the largest plausi-bility associated with A given ν and Γ − , i.e. P ( A ) ≤ ν ( { u ∈ U : Γ − ( u ) ∩ A (cid:54) = ∅ } ) (16)(v’) For all B ∈ B U , we have ν ( B ) ≤ P (Γ − ( B )) . (17) Remark 1:
The reason for giving this second theorem is that some of the new formulations willmore amenable to forming empirical counterparts.
Each of the theoretical formulations of correct specification of the structure given in Theorems 1and 1’ has empirical counterparts, obtained essentially by replacing P by an estimate such as P n in the formulations. The equivalence of the theoretical formulations does not necessarily entailequivalence of the empirical counterparts, especially in the cases where they rely on a choice ofdistance on the (metrizable) space of probability measures on ( Y , B Y ) or ( U , B U ). Hence we needto consider the relations existing between the different empirical counterparts. We shall formour test statistic based on the empirical formulation relative to (v), so the reader may jump tosection 2.4 without loss of continuity. For this empirical formulation, we consider (i’) from Theorem 1’. We denote Core(Γ − , ν ) the setof arbitrary mixtures of νγ − when γ spans Sel(Γ − ), and denoting by d a choice of metric on17he space of probability measures on ( Y , B Y ), the null can be reformulated as d ( P, Core(Γ − , ν )) := inf µ ∈ Core(Γ − ,ν ) d ( P, µ ) = 0 . Hence the empirical version is obtained by replacing P by an estimate such as P n to yield d ( P n , Core(Γ − , ν )) . It will naturally depend on the specific choice of metric.To see the relation between this and other empirical formulations, consider the Kolmogorov-Smirnov metric defined by d K S ( µ , µ ) = sup A ∈B Y ( µ ( A ) − µ ( A ))for any two probability measures µ and µ on ( Y , B Y ). With this choice of metric, we can deriveconditions under which the equalities d K S ( P n , Core(Γ − , ν )) = inf γ ∈ S el (Γ − ) sup A ∈B Y ( P n ( A ) − νγ − ( A ))= sup A ∈B Y inf γ ∈ S el (Γ) ( P n ( A ) − νγ ( A ))= sup A ∈B Y ( P n ( A ) − ν (Γ( A )))hold, and therefore this empirical formulation is equivalent to empirical formulations based on(iii), (iv), and (v) below. We consider (ii) from Theorem 1 and d a metric on the space of probability measures on ( U , B U ).Under the null hypothesis, let π ν be the family of kernels defined in (ii) of Theorem 1. Denoting µf the integral of a function f by a measure µ , we can write (ii) as d ( ν, P π ν ) = 0, which admits d ( ν, P n π ν ) as empirical counterpart, and the latter is equal to d ( P π ν , P n π ν ). A notable aspect ofthis empirical formulation is that for many choices of metric d or indeed pseudo-metric (such asrelative entropy), it will take the form of a functional of the empirical process G n := √ n ( P n − P )applied to the functions y (cid:55)→ π ν ( y ) . Different Goodness-of-fit tests can therefore be generalizedwithin a single framework. The difficulty here of course is that the kernel π ν depends on theunknown P in a complicated way through the integral equation (11).18 .3 Empirical representation relative to (iii) In view of representation (iii) of Theorem 1, i.e. equation (12), the null can be reformulated asthe following Monge-Kantorovich mass transportation problemmin π ∈M ( P,ν ) (cid:90) Y×U { u/ ∈ Γ( y ) } π ( dy, du ) = 0 , (18)where the transportation cost function 1 { u/ ∈ Γ( y ) } is an indicator penalty for violation of the struc-ture.We now consider the empirical version of this Monge-Kantorovich problem, replacing P by theempirical distribution P n to yield the functional T ∗ ( P n , Γ , ν ) = min π ∈M ( P n ,ν ) (cid:90) Y×U { u/ ∈ Γ( y ) } π ( dy, du ) . (19)We shall see below that it is equal to the empirical formulations relative to (iv) and (v). Since formulations (iv) and (v) from Theorem 1 can be rewrittensup A ∈B Y ( P ( A ) − ν (Γ( A ))) = 0 , the following empirical formulation seems the most natural:sup A ∈B Y ( P n ( A ) − ν (Γ( A ))) . The following Theorem states the equivalence between the latter and the empirical formulationderived from (iii):
Theorem 2:
The following equalities hold: T ∗ ( P n , Γ , ν ) = max f ⊕ g ≤ ϕ ( P n f + νg ) (20)= sup A ∈B Y ( P n ( A ) − ν (Γ( A ))) , (21)where ϕ ( y, u ) = 1 { u/ ∈ Γ( y ) } , and f ⊕ g ≤ ϕ signifies that the maximum in (20) is taken over allmeasureable functions f on Y and g on U such that for all ( y, u ), f ( y ) + g ( u ) ≤ ϕ ( y, u ).We shall therefore take T ∗ ( P n , Γ , ν ) as our starting point to construct a test statistic in thefollowing section. 19 Specification test
We propose to adopt a test statistic based on the dual Monge-Kantorovich formulation (21),in other words a statistic that penalizes large values of (21). However, T ∗ ( P n , Γ , ν ) seeminglyinvolves checking condition (14) on all sets in B Y . We need to elicit a reduced class of sets onwhich to check condition (14). Call such a reduced class S , and the resulting statistic is T S ( P n , Γ , ν ) = sup A ∈S ( P n ( A ) − ν (Γ( A ))) . (22) S is the result of a formal trade-off: it needs to be small enough to allow us to derive a limitingdistribution for a suitable re-scaling of T ( P n , Γ , ν ), and large enough to determine the directionof the inequality P − ν Γ, which corresponds to a requirement that our test retain power againstfixed alternatives.To illustrate these requirements, we start by considering two simple types of structures to betested. First we shall consider bijective structures (which correspond to our “prelude”), then thecase where Y is finite. • Bijective structures:
In the case where Γ = γ is single-valued and bijective, consider thefollowing classes of cells in R d y : C = { ( −∞ , y ] , ( y, ∞ ) : y ∈ R d y } ˜ C = { ( −∞ , y ] : y ∈ R d y } . Notice that sup A ∈C ( P n ( A ) − ν ( γ ( A ))) = sup A ∈ ˜ C | P n ( A ) − ν ( γ ( A )) | and the latter is the classical Kolmogorov-Smirnov specification test statistic. Hence thechoice of C for our reduced class S is suitable on both counts: we know, as was discussed inthe prelude, that C is a value-determining class for probability measures, hence checking theinequality P − νγ on the reduced class is equivalent to checking it on all measurable sets.In addition, from Appendix A1, we know that this class is Vapnik- ˘Cervonenkis, and hencethat √ nT C ( P n , γ, ν ) = sup A ∈C G n ( A ) converges weakly to the supremum of a P -Brownianbridge, and the test of specification can be constructed based on approximations of thequantiles through simulations of the Brownian bridge or the bootstrap.20 Discrete observables:
In the case where the observables belong to a finite set, the powerset 2 Y is finite, hence Vapnik- ˘Cervonenkis. This will be sufficient to derive the limitingdistribution of √ nT Y ( P n , Γ , ν ) = √ n sup A ∈ Y ( P n ( A ) − ν (Γ( A ))). Since class of wholesubsets is used, we do not need to worry about the competing requirements that the classdetermine the direction of the inequality P − ν Γ.We shall consider the two requirements on the class of sets S sequentially. First, in the nextsubsection, we derive the asymptotic distribution of T S ( P n , Γ , ν ) for a given choice of S . Then, inthe following subsection, we examine the power of the test based on T S ( P n , Γ , ν ), which amountsto linking the choice of the class of sets S with classes of alternatives. We start with a short heuristic description of the behaviour of T S ( P n , Γ , ν ) which will motivatesome definitions and constructions. We then give specific sets of conditions for the asymptoticresults to hold. Under the null hypothesis H , we have P ( A ) − ν (Γ( A )) ≤ A ∈ B Y . Recalling that G n isthe empirical process √ n ( P n − P ), we have √ n T S ( P n , Γ , ν ) = √ n sup A ∈S ( P n ( A ) − ν (Γ( A )))= sup A ∈S ( G n ( A ) + √ n ( P ( A ) − ν (Γ( A )))) . Unlike the case of the classical Kolmogorov-Smirnov test, the second term in the previous displaydoes not vanish under the null, since the “regions of indeterminacy” allow δ ( A ) := P ( A ) − ν (Γ( A ))to be strictly negative for some sets A ∈ S . What we know at this stage is that under the null,we have √ n T S ( P n , Γ , ν ) = sup A ∈S ( G n ( A ) + √ n ( P ( A ) − ν (Γ( A )))) ≤ sup A ∈S G n ( A ) , but relying on this bound may lead to very conservative inference.Note that δ is independent of n , so that the scaling factor √ n will pull the second term in theprevious display to −∞ for all the sets where the inequality is strict. This prompts the followingdefinition, illustrated in figure 5: 21igure 5: Examples of sets in C b (symbolized by the arrows) in a correctly specified case ( P and ν are uniform, hence correct specification corresponds to the graph of Γ containing the diagonal). Definition 3.1:
We denote the subclass of sets from S where P = ν Γ by S b , i.e. S b := { A ∈ S : P ( A ) = ν (Γ( A )) } . If the class S is a Vapnik- ˘Cervonenkis class of sets, the empirical process converges weakly to the P -Browninan bridge G , i.e. a tight centered Gaussian stochastic process with variance-covariancedefined by EG ( A ) G ( A ) = P ( A ∩ A ) − P ( A ) P ( A ) , and the convergence is uniform over the class S (i.e. the convergence is in l ∞ ( F ), where F isthe class of indicator functions of sets in S ), so that by the continuous mapping theorem, thesupremum of the empirical process converges weakly to the supremum of the Brownian bridge(for a detail of the proof, see Appendix A1).Under (mild) conditions that ensure that the function δ “takes off” frankly from zero on S b tonegative values on S\S b , the term √ n δ dominates the oscillations of the empirical process, andthe sets in S\S b drop out from the supremum in the asymptotic expression, so that √ n T S ( P n , Γ , ν ) (cid:32) sup A ∈S b G ( A ) , (23)22here (cid:32) denotes weak convergence. Naturally, since S b depends on the unknown P , we needto find a data dependent class of sets to approximate S b . By the Law of Iterated Logarithm(see for instance page 476 of Dudley, 2003), we know that the empirical process G n is uniformly O p ( √ ln ln n ), so that if we construct the data dependent class as in definition 2 below with abandwidth sequence h = h n > h n + h − n (cid:114) ln ln nn → , (24)we shall pick out the sets in S b asymptotically. Definition 3.2:
We denote the data dependent subclass of sets from S where P n ≥ ν Γ − h byˆ S b,h , i.e. ˆ S b,h := { A ∈ S : P n ( A ) ≥ ν (Γ( A )) − h } . This data dependent class of sets allows us to approximate the distribution of T S ( P n , Γ , ν ) basedon the following limiting result sup A ∈ ˆ S b,hn G ( A ) (cid:32) sup A ∈S b G ( A ) (25)under requirement (24) on the bandwidth sequence h n , and the additional requirement that h n (ln ln n ) → , (26)which allows to control local oscillations of the empirical process as well. Note that (24) and (26)are very mild, as they are both satisfied whenever h n n − ζ + h − n n η → , for some − / < η ≤ ζ < . (27)Hence we shall be able to choose between the following methods for approximating quantiles ofthe distribution of T S ( P n , Γ , ν ) and constructing rejection regions for our test statistic: • We can simulate the Brownian bridge and compute the quantiles of the distribution of itssupremum over the data dependent class ˆ S b,h n for some choice of h n . • We can use a subsampling approximation of the quantiles of the distribution of T S ( P n , Γ , ν ).Indeed, sup A ∈S b G ( A ) has continuous distribution function on [0 , + ∞ ), hence the subsam-pling approximation of quantiles is valid. 23efore moving on to specific asymptotic results, we close this heuristic description with a discus-sion of the cases where the class of saturated sets S b is the trivial class { ∅ , Y} . In such cases,the test statistic converges to zero if one chooses the scaling factor √ n . A refinement of the testwill therefore involve a faster rate of convergence, determined through the construction of a localempirical process taylored to the shape of ν Γ close to ∅ and to Y . We now turn to specific conditions on the structure (Γ , ν ) and the law P of the observables suchthat results (23) which allows the subsampling approach, and (25) which then also allows thesimulation approach, hold.(a) Case where Y is finite and S is the class of all subsets S = 2 Y .In that case, we show in Theorem 3a below that both approaches to inference are valid. Theorem 3a: If Y is finite and S = 2 Y , (23) and (25) hold.(b) Case where Y = R d y , P is absolutely continuous with respect to Lebesgue measure and S = { ( y , z ) × . . . × ( y d y , z d y ) : y , . . . , y d y , z , . . . , z d y ∈ R } or any subclass, such as theclass C defined above .As indicated above, the asymptotic results are derived under assumptions such that thefunction δ “takes off” frankly from zero. To make this precise, we introduce the following“frank separation” assumption. Recall that if d is the Euclidean metric on Y , the Haussdorfmetric d H between two sets A and A is defined by d H ( A , A ) = max (cid:32) sup y ∈ A inf z ∈ A d ( y, z ) , sup z ∈ A inf y ∈ A d ( y, z ) (cid:33) . We need to ensure that on sets that are sufficiently distant from sets in S b (where theinequality is binding), then δ is sufficiently negative so that it dominates local oscillationsof the empirical process. To formalize this, we define the subclass of S of sets such that theinequality is nearly binding. Definition 3.3:
We denote the subclass of sets from S where P ≥ ν Γ − h by S b,h , i.e. S b,h := { A ∈ S : P ( A ) ≥ ν (Γ( A )) − h } . Note that since P is absolutely continuous, considering only open intervals is without loss of generality.
24e can now state
Assumption FS (Frank Separation):
There exists
K > < η < A ∈ S b,h , for h > A b ∈ S b such that A b ⊆ A and d H ( A, A b ) ≤ Kh η . Remark 1:
Assumption is very mild, in the sense that it fails only in pathological cases,such as the case where Y = R , S = C , and y (cid:55)→ P (( −∞ , y ]) − ν (Γ(( −∞ , y ])) is C ∞ with allderivatives equal to zero at some y = y such that ( −∞ , y ] ∈ C .Then, we have: Theorem 3b:
Suppose assumptions FS and (27) hold and that P is absolutely continuouswith respect to Lebesgue measure. Then (23) and (25) hold.The proof is based on the following lemma, Lemma 3a:
Under the conditions of Theorem 3b, we havesup A ∈S b,hn G n ( A ) (cid:32) sup A ∈S b G ( A ) , which involves bounds on oscillations of the empirical process. As mentioned before, to ensure consistency of our specification test statistic, we need to deriveconditions on the structure (Γ , ν ) and the law P of observables such that all violations of theinequality P ≤ ν Γ will be detected asymptotically with a test based on the statistic T S ( P n , Γ , ν ).Before giving specific results, we shall try to convey the extent of the difficulties involved, incomparison with the case of the classical Kolmogorov-Smirnov test which was developed in ourprelude.When testing the equality of two probability measures, as in the Kolmogorov-Smirnov test, weneed a class of sets that will determine the value of the law P , since it will ensure that if theequality holds on this class of sets, it holds everywhere. To be more precise, we need a convergencedetermining class (see section 2.6 page 18 of van der Vaart, 1998) since our test is asymptotic.25hen testing the inequality P ≤ ν Γ, the situation is complicated in two ways. First, ν Γ is aset function, but it is generally not additive unless Γ is bijective, and a convergence determiningclass is much harder come by. Second, determining the value of ν Γ may not be sufficient, sinceit may not guarantee that the direction of the inequality P ≤ ν Γ will be maintained from thereduced convergence determining class to all measurable sets. We discuss these two points in thefollowing subsections. ν Γ : The set function A (cid:55)→ ν (Γ( A )) is a Choquet capacity functional (for definitions and properties,see Appendix A2), and the following lemma (lemma 1.14 of Salinetti and Wets, 1986) provides aconvergence determining class in great generality. Recall that a closed ball B ( y, η ) with center y and radius η is the sets of points in Y whose distance to y is lower or equal to η . Define S SW asthe class of compact subsets of Y with the following two properties:(C1) Elements of S SW are finite unions of closed balls with positive radii,(C2) Elements of S SW are continuity sets for the Choquet capacity functional A → ν (Γ( A )) , in other words, if A ∈ S SW , then ν (Γ(cl( A ))) = ν (Γ(int( A ))).Then we have: Lemma SW:
The class S SW is convergence determining.The class S SW is not a Vapnik- ˘Cervonenkis class of sets since for any finite collection of points,there is a collection of finite union of balls that shatters it (see appendix A1). However, thereis a natural restriction of this class which is. In the case where Y = R d y , S SW can be redefinedwith rectangles instead of balls. Take an integer K . Define the class of finite unions of at most K rectangles: S K = { (cid:91) k ≤ K ( y k , z k ) : ( y k , z k ) ∈ R d y } . Then we have 26 emma 3b: S K is a Vapnik- ˘Cervonenkis class of sets.Hence this class is amenable to asymptotic treatment. ν ΓThe requirement, that we call “Core determining”, on the class S that P ( A ) ≤ ν (Γ( A )) forall A ∈ S imply P ( A ) ≤ ν (Γ( A )) for all measurable A is apparently more stringent than therequirement that the values of the set function ν (Γ( . )) on all measurable sets be determined byits values on S . Definition 3.4:
A class S of subsets of Y is core determining for (Γ , ν ) ifsup S ( P − ν Γ) = 0 = ⇒ sup B Y ( P − ν Γ) = 0We have noted already the obvious fact:
Fact 1: S = 2 Y is core determining for observables on a finite set Y .A close inspection of the proof of Theorem 2 shows the following fact: Fact 2:
The class F Y of closed subsets of Y is core determining.We now show that we can actually say much more by linking the core determining propertywith the convergence determining property, and showing that the class ˜ S SW of finite unions ofopen balls with positive raddii (or alternatively the class finite unions of open rectangles) is coredetermining.First, we need to consider the following assumptions on the structure: Assumption (CD1): Y is a compact subset of R d y , and U is a compact subset of R d u . Assumption (CD2): P and ν are absolutely continuous with respect to Lebesgue measure. Assumption (CD3):
There exists γ ∈ Sel(Γ) such that P ( A ) → ν ( γ ( A )) → • There exists γ ∈ Sel(Γ) injective, such that νγ (now a probability measure) is absolutelycontinuous with respect to P . • There exists γ ∈ Sel(Γ) and α > ν ( γ ( A )) ≤ αP ( A ) for all A measurable. Assumption (CD4):
Γ is convex-valued, i.e. Γ( y ) is a convex set for all y ∈ Y .This assumption rules out some interesting cases, for instance when the graph of Γ (defined in(8)) is the union of the graphs of two functions. However, our conditions are not minimal, andsuch cases could be treated under a different set of conditions.We define the upper and lower envelopes of the Graph of Γ by Definition 3.5:
The upper (resp. lower) envelope of Graph Γ is the function y (cid:55)→ u ( y ) =sup { Γ( y ) } (resp. y (cid:55)→ l ( y ) = inf { Γ( y ) } ). Assumption (CD5):
The upper and lower envelopes u and l of the graph of Γ are Lipschitz,i.e. there exists κ ≥ y , y ∈ Y ,max ( | u ( y ) − u ( y ) | , | l ( y ) − l ( y ) | ) ≤ κ | y − y | . To state our last assumption, we need an extra definition:
Definition 3.6:
A forking point of Γ is a y such that for any (cid:15) >
0, there exists y and y inthe open ball B( y , (cid:15) ) such that Γ( y ) is a singleton, and Γ( y ) is not. Assumption (CD6):
Γ has at most a finite number of forking points.Note that this is a technical assumption that is violated only in pathological cases, and that isakin to the Frank Separation Assumption (FS).We can now state the result:
Theorem 3c:
Under assumption (CD1)-(CD6), the class ˜ S SW of finite unions of open balls with28ositive radii (or alternatively the class finite unions of open rectangles) is core determining.This result is fundamental in that it reduces the problem of checking consistency of the test basedon the statistic T S ( P n , Γ , ν ) to the problem of checking whether P ( A ) ≤ ν (Γ( A )) for A a finiteunion of balls (or rectangles) in R d y whenever P ≤ ν Γ on S .We shall now apply this reasoning to give some conditions on the structure (Γ , ν ) under whichthe test based on statistic T S ( P n , Γ , ν ) is consistent with S = C = { ( −∞ , y ] , ( y, ∞ ) : y ∈ R } ,such as in figure 6, and conditions under which the class C may not be core determining, but theclass S = R = { ( y, z ) : y, z ∈ R } is. We thereby defining classes of alternatives that our testsbased on T C ( P n , Γ , ν ) and T R ( P n , Γ , ν ) have power against in case Y = R and P is absolutelycontinuous with respect to Lebesgue measure.Figure 6: Violation of null that can be detected by the class of cells C . Notice in particular thatthe inequality P ≤ ν Γ is violated on the set A ( P and ν are uniform). Theorem 3d:
If assumption (CD1) and (CD2) are satisfied, and the graph of Γ has increasingupper and lower envelopes, then C is core determining, and hence the specification test based onthe statistic T C ( P n , Γ , ν ) is consistent.In figure 7, we show a case where the null hypothesis does not hold, but a test based on T C ( P n , Γ , ν )fails to detect it because of the lack of monotonicity of the upper envelope. In that case, we needthe larger class of sets R to detect the departure from the null.29igure 7: Violation of null that cannot be detected by the class of cells C , but can be detectedby the class of all intervals. Notice in particular that the inequality P ≤ ν Γ is violated on A butnot on B ( P and ν are uniform). The test of specification that we have developed can be applied to the construction of confidenceregions in case the structure depends on unknown parameters. Let θ ∈ Θ ⊆ R d θ be a vector ofstructural parameters, and let the model be given by (Γ θ , ν θ ). Definition 4.1:
The identified set Θ I is defined as the set of all θ ∈ Θ such that the nullhypothesis H ( θ ) of compatibility of (Γ θ , ν θ ) with P (as defined in Theorems 1 and 1’) holds true.This section is an outline of the application of our testing procedure to the construction ofconfidence regions for elements of the identified set and for the identified set itself. To form a confidence region that covers (with at least some pre-determined probability) eachparameter value that makes the structure compatible with the distribution of observables, wepropose to invert our test statistic to form a confidence region for elements of Θ I . In otherwords, for a given α ∈ (0 , n such that, for all θ ∈ Θ I , lim inf n P ( θ ∈ CR n ) ≥ α. The confidence region obtained from inverting the test has the form CR n = { θ ∈ Θ : √ nT S ( P n , Γ θ , ν θ ) ≤ ˆ Q α ( θ ) } where S is a class of sets which is Core determining for all θ ∈ Θ and ˆ Q α ( θ ) is an approximation of the α quantile of the distribution of T S ( P n , Γ θ , ν θ ). A30alid approximation can be obtained using either one of the two methods proposed at the end ofsection 3.1.1. To form a region that covers the whole identified set with pre-determined probability, we need a re-gion CR ∗ n such that lim inf n P (Θ I ⊆ CR ∗ n ) ≥ α. The latter can be obtained using the method pro-posed by Chernozhukov et al., 2002 applied to the criterion function (sup A ∈S ( P ( A ) − ν θ (Γ θ ( A )))) with sample criterion T S ( P n , Γ θ , ν θ ) (under the condition that C1, C2, C4 and C5 of Chernozhukovet al., 2002 hold). A main contribution of this paper, therefore, is to provide the first natural andgeneral choice of criterion function, and thereby pave the way for a comparison of criteria and adiscussion of optimality. We now spell out our procedures on a very simple example: example 5 of section 1. The structureis described by the multi-valued mapping: Γ(1) = [0 , λ ] and Γ(0) = [0 , y isBernoulli, we can write P = (1 − p, p ) (cid:48) with p the probability of a 1. For the distribution of u , weconsider a parametric exponential family on [0 , ν φ has distribution function u φ , with φ >
0. Our parameter vector is therefore θ = ( λ, φ ) (cid:48) .The null hypothesis in this case is immediately seen to be equivalent to p ≤ λ φ for a given valueof the parameter vector. Indeed, the easiest formulation to use is probably formulation (v) whichrequires that p = P ( { } ) ≤ ν (Γ(1)) = ν [0 , λ ] = λ φ . Hence T { , } ( P n , Γ θ , ν θ ) = p n − λ φ . Now,if p = λ φ , then S b = { ∅ , { } , { } , { , }} and then √ n ( p n − λ φ ) converges weakly to a normalrandom variable with mean zero and variance p (1 − p ), whereas if p < λ φ , then S b = { ∅ , { , }} and √ n ( p n − λ φ ) converges to zero. In either case, for a given choice of sequence h n , ˆ S b,h n isequal to { ∅ , { } , { } , { , }} if p n ≥ λ φ − h n and { ∅ , { , }} otherwise.The α quantile of √ nT { , } ( P n , Γ θ , ν θ ) = √ n ( p n − λ φ ) can be approximated with 0 if p n < λ φ − h n ,and with the α quantile of the normal with mean zero and variance p n (1 − p n ) if p n ≥ λ φ − h n .Alternatively, Q α ( θ ) can be approximated using subsampling (though it would be a serious caseof overkill). The procedure would then be the following: Consider all (or a large number B n of)the samples of size b n from the sample of size n with 1 /b n + b n /n → Q α ( θ )31ith ˆ Q α ( θ ) = inf { x : 1 B n B n (cid:88) i =1 {√ bT S ( P ib , Γ θ , ν θ ) ≤ x } ≥ α } where P ib is the empirical distribution of the i -th subsample. A confidence region is then CR n = { θ ∈ [0 , × (0 , + ∞ ) : √ nT S ( P n , Γ θ , ν θ ) ≤ ˆ Q α ( θ ) } . Since structures are often given without a specification of the distribution of the unobservablevariables, it is customary to assume only moment conditions, such as a given mean (taken tobe equal to zero without loss of generality) and finite variance. This includes as special casesstructures defined by moment inequality conditions.In such cases, a similar approach can be taken where the null is defined as the existence of ajoint law supported on the set { u ∈ Γ θ ( y ) } with marginal P on Y and marginal on U satisfyingsome moment conditions. Calling V the set of laws that satisfy the said conditions, the dualformulation delivers a feasible version of the statisticinf ν ∈V sup A ∈S [ P ( A ) − ν (Γ θ ( A ))] . This involves a number of difficulties, which are the subject of a companion paper
GH:2006 .We only give here, as an illustration, the application of the method on a classic special case ofexample 3Suppose one observes income brackets with centers in Y = { y , . . . , y k } with y < . . . < y k andwidth δ . True income is unobservable, and one is interested in the mean of true income. Themodel correspondence is given by Γ( y ) = ( y − δ/ , y + δ/ p ( y i ) (resp. p n ( y i )) denote thetrue (resp. empirical) probability of { Y = y i } .Consider formulation (v’): ν ≤ P Γ − of the null hypothesis. Denoting Γ u ( B ) = { y : Γ( y ) ⊆ B } for any B ∈ B U , and writing φ ∗ = P Γ − and φ ∗ = P Γ u , we have (using Definition A2.6Lemma A2.2 in appendix A2) that under the null, the expectation of any measurable function f of the unobservable variables satisfies (cid:90) Ch f dφ ∗ ≤ E f ≤ (cid:90) Ch f dφ ∗ . φ ∗ n = P n Γ − and φ n ∗ = P n Γ u the empirical versions of φ ∗ and φ ∗ , the set [ (cid:82) Ch f dφ n ∗ , (cid:82) Ch f dφ ∗ n ]estimates the identified set [ (cid:82) Ch f dφ ∗ (cid:82) Ch f dφ ∗ ] . In the case considered here, where f is the iden-tity, this identified set equals (cid:34) k (cid:88) i =1 ( y i − δ/ p ( y i ) , k (cid:88) i =1 ( y i + δ/ p ( y i ) (cid:35) , which is equal to (cid:34) k (cid:88) i =1 ( y i − δ/
2) ( p n ( y i ) − g n,i / √ n ) , k (cid:88) i =1 ( y i + δ/
2) ( p n ( y i ) − g n,i / √ n ) (cid:35) from which asymptotically valid confidence regions can be constructed, since g n = ( g n, , . . . , g n,k ) (cid:48) , with g n,i = √ n ( p n ( y i ) − p ( y i )) is asymptotically a Gaussian vector. Conclusion
We have provided a coherent definition of correct specification of structures with no identifyingassumptions. This definition is the result of the equivalence of several natural generalizations ofthe hypothesis of correct specification in the identified case. These theoretical formulations ofcorrect specification have natural empirical counterparts, several of which are also shown to beequivalent, and a test of specification is based on the latter. When the structure is parameterized,this test can be inverted to yield confidence regions for the set of structural parameters for whichthe null hypothesis of correct specification is satisfied.This work has the following natural extensions: First, the whole approach is articulated aroundthe existence of a joint measure with given marginals, hence it is essentially parametric in nature,but can be naturally extended to a problem of existence of a joint probability measure with onemarginal given (the distribution of observables) and moment conditions on the other marginal (thedistribution of unbobservable variables). This natural extension of our work will nest structuresdefined by moment inequalities, and therefore deliver a way to construct confidence regions insuch cases. Second, the statistic we have used to examine correct specification can be derivedfrom the Kolmogorov-Smirnov distance between the empirical distribution and the set of datagenerating processes implied by the structure. Other distances and pseudo-distances will generatedifferent specification statistics, and relative entropy may be a particularly good candidate, inthat it produces optimal inference in the special case of identified structures.33 ppendix A: Additional concepts and results
A1: Convergence of the empirical process
We give here definitions and results that we use in our asymptotic analysis. The definition of a Vapnik-˘Cervonenkis class of sets is given in section 2.6.1 page 134 of van der Vaart and Wellner, 1996 andreproduced here for the convenience of the reader.
Definition A1.1:
Let S be a collection of subsets of a set X . An arbitrary set of n points { x , . . . , x n } posesses 2 n subsets. Say that C picks out a certain subset from { x , . . . , x n } if this can be formed as theset C ∩ { x , . . . , x n } for a C in S . The collection S is said to shatter { x , . . . , x n } if each of its 2 n subsetscan be picked out in this manner. The Vapnik- ˘Cervonenkis index of the class S is the smallest n forwhich no set of cardinality n is shattered by S . A Vapnik- ˘Cervonenkis class of sets is a class with finiteVapnik- ˘Cervonenkis index. Fact A1:
The class of cells C is a Vapnik- ˘Cervonenkis class of sets (see Example 2.6.1 page 135 of vander Vaart and Wellner, 1996). Definition A1.2:
The P -Brownian bridge is the tight centered Gaussian stochastic process with variance-covariance defined by EG ( A ) G ( A ) = P ( A ∩ A ) − P ( A ) P ( A ). Theorem A1.1: If S is a Vapnik- ˘Cervonenkis class of sets, the empirical process converges weakly to the P -Browninan bridge G , and the convergence is uniform over the class S (i.e. the convergence is in l ∞ ( F ),where F is the class of indicator functions of sets in S ). Proof of Theorem A1.1:
We assume that S is a Vapnik- ˘Cervonenkis class of sets. Call F the class ofindicator functions of sets in S , and call V ( F ) the Vapnik- ˘Cervonenkis index of the corresponding class ofsets. By Theorem 2.6.4 page 136, there exists a constant C such that for all probability measure Q andall 0 < ε <
1, the covering number (see definition 2.2.3 page 98 of van der Vaart and Wellner, 1996) of F in L ( Q ) metric, N( ε, F , L ( Q )) satisfyN( ε, F , L ( Q )) ≤ C ( V ( F ))(4 e ) V ( F ) (1 /ε ) V ( F ) − . Hence, we have (cid:90) ∞ sup Q (cid:112) ln N( ε, F , L ( Q )) dε < ∞ . Since F is a class of indicator functions, the above suffices to satisfy conditions of Theorem 2.5.2 page 127of van der Vaart and Wellner, 1996, and F is P -Donsker, which by definition means that G n converges in l ∞ ( F ). y the continuous mapping theorem, we immediately have the following corollary: Corollary A1.1: If S is a Vapnik- ˘Cervonenkis class of sets, then sup S G n converges weakly to sup S G . A2: Choquet capacity functionals
We collect here all the definitions, equivalent representations and properties of Choquet capacity functionals(a.k.a. distributions of random sets or infinitely alternating capacities) that are useful for this paper. Allthe results presented here can be traced back to Choquet, 1953.Take X a Polish space (complete metrizable and seperable topological space) endowed with its Borel σ -algebra B . For a sequence of numbers, a n ↑ a (resp. a n ↓ a ) denotes convergence in inceasing (resp.decreasing) values, whereas for a sequence of sets, the notation A n ↑ A (resp. A n ↓ A ) denotes A n ⊆ A n +1 for all n and A = (cid:83) n A n (resp. A n +1 ⊆ A n for all n and A = (cid:84) n A n ). Finally, denote F (resp. G ) the setof closed (resp. open) subsets of X , and for A ∈ B , F A = { F ∈ F : F ∩ A (cid:54) = ∅ } . Definition A2.1:
A capacity is a set function ϕ : B → R satisfying(i) ϕ ( ∅ ) = 0 and ϕ ( X ) = 1,(ii) For any two Borel sets A ⊆ B , we have ϕ ( A ) ≤ ϕ ( B ),(iii) For all sequences of Borel sets A n ↑ A , we have ϕ ( A n ) ↑ ϕ ( A ),(iv) For all sequences of closed sets F n ↓ F , we have ϕ ( F n ) ↓ ϕ ( F ). Definition A2.2
A capacity ϕ is called infinitely alternating if for any n and any sequence A , . . . , A n ofBorel sets, ϕ (cid:32) n (cid:92) i =1 A i (cid:33) ≤ (cid:88) ∅ (cid:54) = I ⊆{ , ,...,n } ( − | I | +1 ϕ (cid:32)(cid:91) I A i (cid:33) We call Choquet capacity functional an infinitely alternating capacity. Probability measures are specialcases of Choquet capacity functionals, for which the alternating inequality of definition A2.2 holds as anequality (known as Poincar´e’s equality).We now show that infinite alternation is a characteristic property of distributions of random sets (for aproof, see for instance section 2.1 of Matheron, 1975).
Theorem A2.1: ϕ is a Choquet capacity functional (i.e. an infinitely alternating capacity) if and onlyif there exists a probability measure P on F such that, for all A ∈ B , ϕ ( A ) = P ( F A ), and such a P isunique. is therefore called the distribution of the random set associated with the probability measure P , whichallows the following definition of convergence determining classes for a Choquet capacity functional: Definition A2.3:
A class C of Borel subsets of X is called convergence determining for a Choquet capacityfunctional ϕ if and only if the class {F A ; A ∈ C} is convergence determining for the probability measure P associated to ϕ as in Theorem A2.1.We now look at the relation with measurable correspondences, defined as correspondences that satisfyAssumption 1 in the main text. Let (Ω , B , P ) be a probability space. Definition A2.4:
A non-empty and closed valued correspondence Γ : Ω ⇒ X is called a measurablecorrespondence if for each open set O ⊆ X , Γ − ( O ) = { ω ∈ Ω | Γ( ω ) ∩ O (cid:54) = ∅ } belongs to B .If we define ϕ by ϕ ( A ) = P { ω ∈ Ω | Γ( ω ) ∩ A (cid:54) = ∅ } , for all A ∈ B , then ϕ is a Choquet capacity functional(from section 26.8 page 209 of Choquet, 1953), and its core is defined by the following: Definition A2.5: the core of ϕ defined above is the set of probability measures that are set-wise dominatedby ϕ , i.e. Core( ϕ ) := Core(Γ , P ) = { Q : Q ( A ) ≤ ϕ ( A ) all A measurable } .We add useful regularity properties of Choquet capacity functionals: Lemma A2.1: If ϕ is a Choquet capacity functional, by the Choquet Capacitability Theorem (section38.2 page 232 of Choquet, 1953), in addition to properties (i)-(iv) of Definition A2.1, it satisfies(v) ϕ ( A ) = sup { ϕ ( F ) : F ⊆ A, F ∈ F} for all A ∈ B ,(vi) ϕ ( A ) = inf { ϕ ( G ) : A ⊆ G, G ∈ G} for all A ∈ B .Several notions extend integration in case of non-additive measures. We only use explicitely the notion ofChoquet integral, which we define below. Definition A2.6:
The Choquet integral of a bounded measurable function f with respect to a capacity ϕ is defined by (cid:90) Ch f d ϕ = (cid:90) ∞ ϕ ( { f ≥ x } ) d x + (cid:90) −∞ ( ϕ ( { f ≥ x } ) −
1) d x, . (28)The Choquet integral reduces to the Lebesgue integral when ϕ is a probability measure. In addition, ithas a very simple expression in case ϕ is a Choquet capacity functional (see Theorem 1 of Castaldo et al.,2004). emma A2.2: If ϕ is a Choquet capacity functional, then for all f bounded measurable, the Choquetintegral of f with respect to ϕ is given by (cid:82) Ch f d ϕ = sup Q ∈ Core( ϕ ) (cid:82) f d Q . Appendix B: Proofs of the results in the main text
Reader’s guide to the proofs:
In the proof of Theorem 1, a result very close to (ii) ⇐⇒ (iv) is stated in Wasserman, 1990, but the proofis essentially omitted. The proof of (i) ⇐⇒ (iii) relies on Corollary 1 of Castaldo et al., 2004, which allowsto generalize Proposition 1 of Jovanovic, 1989. The proof of (iv) ⇐⇒ (v) is straightforward, whereasthe proof of (iii) ⇐⇒ (v) is similar to Theorem 2. The latter is a simple application of lemma 1, whichitself is a simplification of the main generalized Monge-Kantorovitch duality theorem of Kellerer, 1984.Lemma 1[a] is lemma 11.8.5 of Dudley, 2003. The proof given here for completeness is due to N. Belili.The rest of Theorem 2 is a specialization of the duality result to zero-one cost, which can also be provedusing Proposition (3.3) page 424 of Kellerer, 1984, but we give a direct proof to show that we can specializeto closed sets, a fact that we use in the discussion of the power of the test.Theorem 3a is straightforward. Theorem 3b is structured around the inequalitysup S b G n ≤ sup ˆ S b,hn G n ≤ sup S b,ln G n which holds on an event of large enough probability, with suitable bandwidth sequences h n (cid:28) l n . Then,lemma 3a shows that sup S b,ln G n converges weakly to the same limit as sup S b G n , namely sup S b G . Finally,the same reasoning is invoked to show that sup ˆ S b,hn G also converges to the same limit (but for this weneed to assume that the bandwidth satisfies condition (27) rather than (24) and (26)). Lemma 3a relieson the construction of a local empirical process relative to the thin sets A \ A b , where A is in S b,l n and A b is in S b and is close to A in terms of Haussdorf metric (hence the term “thin”).Lemma 3b, like Appendix A1, brings together some facts that are scattered in van der Vaart and Well-ner, 1996. Theorem 3c uses the regulatiry properties of Choquet capacity functionals to show that finiteunions of balls are core determining. Given a closed set F , using outer regularity of P and a compactnessargument, a decreasing sequence of finite unions of open balls is constructed that satisfies two require-ments: it converges to F both in P -measure and in Haussdorf distance. The regularity properties of thecorrespondence Γ are then used to control the Haussdorf distance between the images by Γ of F and theapproximating sequence. The absolute continuity of ν is then invoqued to conclude, so that the sign of theinequality is maintained by continuity. Theorem 3d ties in the problem of finding core determining classeswith the Monge-Kantorovitch dual under zero-one cost: pairs (1 F , − Γ( F ) ) with F in the larger class areshown to be convex combinations of pairs (1 A , − Γ( A ) ) with A in the potential core determining class. roof of Theorem 1: [a] We first show equivalences (i) ⇐⇒ (iv) ⇐⇒ (ii):Call ∆( B ) the set of all Borel probability measures with support B . Under Assumption 1, the map y (cid:55)→ ∆(Γ( y )) is a map from Y to the set of all non-empty convex sets of Borel probability measures on U which are closed with respect to the weak topology. Moreover, for any f ∈ C b ( U ), the set of all continuousbounded real functions on U , the map y (cid:55)−→ sup (cid:26)(cid:90) f dµ : µ ∈ ∆(Γ( y )) (cid:27) = max u ∈ Γ( y ) f ( u )is B Y -measurable, so that, by Theorem 3 of Strassen, 1965, for a given ν ∈ ∆( U ), there exists π satisfying(11) with π ( y, . ) ∈ ∆(Γ( y )) for P -almost all y if and only if (cid:90) U f ( u ) ν ( du ) ≤ (cid:90) Y sup u ∈ Γ( y ) f ( u ) P ( dy ) (29)for all f ∈ C b ( U ). Now, defining P as the set function P : B → P ( { y ∈ Y : Γ( y ) ∩ B (cid:54) = ∅ } ) , the right-hand side of (29) is shown in the following sequence of equalities to be equal to the integral of f with respect to P in the sense of Choquet (defined by (28)). (cid:90) Y sup u ∈ Γ( y ) { f ( u ) } P ( dy )= (cid:90) ∞ P (cid:8) y ∈ Y : sup u ∈ Γ( y ) { f ( u ) } ≥ x (cid:9) d x + (cid:90) −∞ ( P (cid:8) y ∈ Y : sup u ∈ Γ( y ) { f ( u ) } ≥ x (cid:9) −
1) d x = (cid:90) ∞ P (cid:8) y ∈ Y : Γ( y ) ⊆ { f ≥ x } (cid:9) d x + (cid:90) −∞ ( P (cid:8) y ∈ Y : Γ( y ) ⊆ { f ≥ x } (cid:9) −
1) d x = (cid:90) ∞ P ( { f ≥ x } ) d x + (cid:90) −∞ ( P ( { f ≥ x } ) −
1) d x = (cid:90) Ch f d P .
By Theorem 1 of Castaldo et al., 2004, for any f ∈ C b ( U ), (cid:90) Ch f d P = max γ ∈ S el (Γ) (cid:90) U f ( u ) P γ − ( du ) , so that (29) is equivalent to max γ ∈ S el (Γ) (cid:90) U f ( u ) P γ − ( du ) ≥ (cid:90) U f ( u ) ν ( du ) (30)for any f ∈ C b ( U ). If ν is in the weak closure of the set of convex combinations of elements of { P γ − : γ ∈ Sel(Γ) } , then by linearity of the integral and the definition of weak convergence, (30) holds. Conversely, if ν satisfies (30), then it satisfies (cid:90) Ch f d P ≥ (cid:90) U f ( u ) ν ( du )and by monotone continuity, we have for all A ∈ B U , and 1 A the indicator function, (cid:90) U A ( u ) ν ( du ) ≤ (cid:90) C h A dP . ence ν ( A ) ≤ P ( A ) for all A ∈ B U , which by Corollary 1 of Castaldo et al., 2004 implies that ν is theweak limit of a sequence of convex combinations of elements of { P γ − : γ ∈ Sel(Γ) } , hence it is a mixturein the desired sense and the proof is complete.[b] We now show equivalences (iii) ⇐⇒ (iv) ⇐⇒ (v):Using theorem 2 below, it suffices to show that (13) is equivalent to ν (Γ( A )) ≥ P ( A ) for all A ∈ B Y . Aspreviously, define P as the set function on B U P : B → P ( { y ∈ Y : Γ( y ) ∩ B (cid:54) = ∅ } ) . Define also P as the set function P : B → P ( { y ∈ Y : Γ( y ) ⊆ B } ) . Since P ( B ) = 1 − P ( B c ), we have the well known equivalence between ν ( B ) ≤ P ( B ) for all B ∈ B U and ν ( B ) ≥ P ( B ) for all B ∈ B U . In particular, for B = Γ( A ) for any A ∈ B Y , we have ν ( B ) ⊆ { y ∈ Y :Γ( y ) ⊆ Γ( A ) } . As A ⊆ { y ∈ Y : Γ( y ) ⊆ Γ( A ) } , we have ν (Γ( A )) ≥ P ( B ). Conversely, for some B ∈ B U ,call B ∗ = { y ∈ Y : Γ( y ) ⊆ B } . Then, we have P ( B ∗ ) ≤ ν (Γ( B ∗ )). The result follows from the observationthat Γ( B ∗ ) ⊆ B . Proof of Theorem 1’:
The proof completely parallels the proof of Theorem 1. The equivalence between 1(iii) and 1’(iii’) drivesthe equivalence of each of the formulations in Theorem 1’ with each of the formulations in Theorem 1.
Lemma 1: If ϕ : Y × U → R is bounded, non-negative and lower semicontinuous, theninf π ∈M ( P,ν ) πϕ = sup f ⊕ g ≤ ϕ ( P f + νg ) Proof of Lemma 1:
It can be shown to be a special case of corollary (2.18) of Kellerer, 1984; however, a direct proof is moretransparent, so we give it here for completeness. The left-hand side is immediately seen to be always largerthan the right-hand side, so we show the reverse inequality.[a] case where ϕ is continuous and U and Y are compact.Call G the set of functions on Y × U strictly dominated by ϕ and call H the set of functions of the form + g with f and g continuous functions on Y and U respectively. Call s ( c ) = P f + νg for c ∈ H . It is awell defined linear functional, and is not identically zero on H . G is convex and sup-norm open. Since ϕ is continuous on the compact Y × U , we have s ( c ) ≤ sup f + sup g < sup ϕ for all c ∈ G ∩ H , which is non empty and convex. Hence, by the Hahn-Banach theorem, there exists alinear functional η that extends s on the space of continuous functions such thatsup G η = sup G ∩ H s. By the Riesz representation theorem, there exists a unique finite non-negative measure π on Y × U suchthat η ( c ) = πc for all continuous c . Since η = s on H , we have (cid:90) Y×U f ( y ) dπ ( y, u ) = (cid:90) Y f ( y ) dP ( y ) (cid:90) Y×U g ( u ) dπ ( y, u ) = (cid:90) Y g ( u ) dν ( y ) , so that π ∈ M ( P, ν ) and sup f ⊕ g ≤ ϕ ( P f + νg ) = sup G ∩ H s = sup H η = πϕ. [b] Y and U are not necessarily compact, and ϕ is continuous.For all n >
0, there exists compact sets K n and L n such thatmax ( P ( Y\ K n ) , ν ( U\ L n )) ≤ n . Let ( a, b ) be an element of
Y × U and define two probability measures µ n and ν n with compact support by µ n ( A ) = P ( A ∩ K n ) + P ( A \ K n ) δ a ( A ) ν n ( B ) = ν ( B ∩ L n ) + ν ( B \ L n ) δ b ( B ) , where δ denotes the Dirac measure. By [a] above, there exists π n with marginals µ n and ν n such that π n ϕ ≤ sup f ⊕ g ≤ ϕ ( P f + νg ) + ϕ ( a, b ) n . Since ( π n ) has weakly converging marginals, it is weakly relatively compact. Hence it contains a weaklyconverging subsequence with limit π ∈ M ( P, ν ). By Skorohod’s almost sure representation (see for instancetheorem 11.7.2 page 415 of Dudley, 2003), there exists a sequence of random variables X n on a probabilityspace (Ω , A , P ) with law π n and a random variable X on the same probability space with law π such that X is the almost sure limit of ( X n ). By Fatou’s lemma, we then haveliminf π n ϕ = liminf E ϕ ( X n ) ≥ E liminf ϕ ( X n ) = E ϕ ( X ) = πϕ. ence we have the desired result.[c] General case. ϕ is the pointwise supremum of a sequence of continuous bounded functions, so the result follows fromupward σ -continuity of both inf π ∈M ( P,ν ) πϕ and sup f ⊕ g ≤ ϕ ( P f + νg ) on the space of lower semicontinuousfunctions, shown in propositions (1.21) and (1.28) of Kellerer, 1984. Proof of Theorem 2:
Under assumption 1, Γ is closed valued, hence ϕ ( y, u ) = 1 { u/ ∈ Γ( y ) } is lower semicontinuous and (20) is adirect application of lemma 1 above.We now show (21). Since the sup-norm of the cost function is 1 (the cost function is an indicator), thesupremum in (20) is attained pairs of functions ( f, g ) in F , defined by F = { ( f, g ) ∈ L ( P ) × L ( ν ) , ≤ f ≤ , − ≤ g ≤ ,f ( y ) + g ( u ) ≤ { u/ ∈ Γ( y ) } , f upper semicontinuous } . Now, ( f, g ) can be written as a convex combination of pairs (1 A , − B ) in F . Indeed, f = (cid:82) { f ≥ x } dx and g = (cid:82) − { g ≤− x } dx , and for all x , 1 { f ≥ x } ( y ) − { g ≤− x } ( u ) ≤ { u/ ∈ Γ( y ) } . Since the functional on theright-hand side of (20) is linear, the supremum is attained on such a pair (1 A , − B ). Hence, the right-andside of (20) specializes to sup A × B ⊆ D ( P ( A ) − ν ( B )) . (31)For D = { ( y, u ) : u / ∈ Γ( y ) } , A × B ⊆ D means that if y ∈ A and u ∈ B , then u / ∈ Γ( y ). In other words u ∈ B implies u / ∈ Γ( A ), which can be written B ⊆ Γ( A ) c . Hence, the dual problem can be writtensup Γ( A ) ⊆ B c ( P ( A ) − ν ( B )) = sup Γ( A ) ⊆ B ( P ( A ) − ν ( B )) . and (21) follows immediately. Proof of Theorem 3a:
Let A be the subset of Y that achieves the maximum of δ ( A ) = P ( A ) − ν (Γ( A )) over A ∈ S\S b . Call δ = δ ( A ), and note that δ <
0. We have √ nT Y ( P n , Γ , ν ) = sup A ∈ Y [ G n ( A ) + √ n ( P ( A ) + ν (Γ( A )))]= max { sup S b G n , sup A ∈ Y \S b [ G n ( A ) + √ n ( P ( A ) + ν (Γ( A )))] } . he second term in the maximum of the preceding display is dominated bysup Y \S b G n + √ nδ , whose limsup is almost surely non-positive. Hence (23) follows from the convergence of the empiricalprocess. (25) follows from the fact that, under (24), for all n sufficiently large, ˆ S b,h n is almost surely equalto S b . Proof of Theorem 3b:
Consider two sequences of positive numbers l n and h n such that they both satisfy (27), l n > h n and( l n − h n ) − (cid:113) ln ln nn →
0. Notice that { ∅ , Y} ⊆ S b , S b,h , ˆ S b,h for any h >
0. Since G n ( Y ) = 0, we thereforehave sup S b G n , sup S b,ln G n and sup ˆ S b,hn G n non-negative. Hence, calling ζ n the indicator function of theevent sup S G n ≤ ( l n − h n ) √ n , we can write ζ n sup S b G n ≤ ζ n max (cid:40) sup S b [ G n + √ n ( P − ν Γ)] , sup S\S b [ G n + √ n ( P − ν Γ)] (cid:41) ≤ ζ n √ nT S ( P n , Γ , ν ) ≤ ζ n sup ˆ S b,hn G n ≤ ζ n sup S b,ln G n , where the first inequality holds because the left-hand side is equal to the first term in the right-hand side,the second inequality holds trivially as an equality since S = S b ∪ S\S b , the third inequality holds becauseon S\ ˆ S b,h n , we have by definition G n + √ n ( P − ν Γ) = √ n ( P n − ν Γ) ≤ − h n ≤
0, and the last inequality holdsbecause on { ζ n = 1 } , we have that A ∈ ˆ S b,h n implies ν Γ( A ) ≤ P n ( A ) + h n = P ( A ) + ( P n − P )( A ) + h n ≤ P ( A ) + sup S G n / √ n + h n ≤ P ( A ) + l n − h n + h n = P ( A ) + l n , which implies that A ∈ S b,l n .By Lemma 3a and Appendix A1, we have that both sup S b G n and sup S b,ln G n converge weakly to sup S b G .It is shown below that ζ n → p
1, so that Slutsky’s lemma (lemma 2.8 page 11 of van der Vaart, 1998) yieldsthe weak convergence of ζ n sup S b G n and ζ n sup S b,ln G n to the same limit, and hence that of ζ n T S ( P n , Γ , ν )and ζ n sup ˆ S b,hn G n . It follows from Slutsky’s lemma again that √ nT S ( P n , Γ , ν ) (cid:32) sup S b G and sup ˆ S b,hn G n (cid:32) sup S b G , which proves (23).We now prove that ζ n → p
1. Indeed, for any (cid:15) > P ( | ζ n − | > (cid:15) ) = P ( ζ n = 0) = P (sup S G n > ( l n − h n ) √ n ) → l n − h n ) √ n (cid:29) √ ln ln n by assumption.There remains to show (25). Defining ξ n as the indicator of the set {− h n √ n ≤ sup S G n ≤ ( l n − h n ) √ n } , e have the inequalities ξ n sup S b G ≤ ξ n sup ˆ S b,hn G ≤ ξ n sup S b,ln G . Indeed, the first inequality holds because sup S G n ≥ − h n √ n implies that P n ( A ) ≥ P ( A ) − h n for all A , hence that S b ⊆ ˆ S b,h n ; and the second inequality holds because because on { ξ n = 1 } , we have that A ∈ ˆ S b,h n implies ν Γ( A ) ≤ P n ( A ) + h n = P ( A ) + ( P n − P )( A ) + h n ≤ P ( A ) + sup S G n / √ n + h n ≤ P ( A ) + l n − h n + h n = P ( A ) + l n , which implies that A ∈ S b,l n .By Lemma 3a suitably modified to apply to the oscillations of G instead of the oscillations of G n , wehave that sup S b,ln G converges weakly to sup S b G . It is shown below that ξ n → p
1, so that Slutsky’slemma yields the weak convergence of ξ n sup S b G n and ξ n sup S b,ln G to the same limit, and hence that of ξ n sup ˆ S b,hn G . It follows from Slutsky’s lemma again thatsup ˆ S b,hn G (cid:32) sup S b G , which proves (25).We now prove that ξ n → p
1. Indeed, for any (cid:15) > P ( | ξ n − | > (cid:15) ) = P ( ζ n = 0) = P (sup S G n > ( l n − h n ) √ n or sup S G n < − h n √ n ) → l n − h n ) √ n (cid:29) √ ln ln n and h n √ n (cid:29) √ ln ln n by assumption. Proof of Lemma 3a:
Take a bandwidth sequence l n that satisfies (27), and take S b,l n as in definition 3.3. Under assumption FS,take A ∈ S b,l n and an A ∈ S b such that d H ( A, A ) ≤ ζ n = Kl ηn (we suppress the dependence of A b on A for ease of notation). As S b ⊆ S b,l n , one hassup A ∈S b G n ( A ) ≤ sup B ∈S b,ln G n ( A ) (32)Second, since A b ⊆ A , one hassup A ∈S b,ln G n ( A ) = sup A ∈S b,ln [ G n ( A b ) + G n ( A \ A b )] ≤ sup A ∈S b,ln [ G n ( A b )] + sup A ∈S b,ln [ G n ( A \ A b )] . If we have that sup A ∈S b,ln | G n ( A \ A b ) | = O a . s . (cid:16)(cid:112) ζ n ln ln n (cid:17) , then sup A ∈S b,ln G n ( A ) = sup A ∈S b,ln [ G n ( A b )] + O a . s . (cid:16)(cid:112) ζ n ln ln n (cid:17) (33)noting the dependence of A b on A in the expression above. But since A b ∈ S b , one has sup A ∈S b,ln [ G n ( A b )] ≤ sup A ∈S b G n ( A ). This fact, along with (32) and (33), yields the result. e now show that we have indeed thatsup A ∈S b,ln | G n ( A \ A b ) | = O a . s . (cid:16)(cid:112) ζ n ln ln n (cid:17) . This relies on the construction of a local empirical process relative to the thin regions A \ A b . First considersuch a region. If A ∈ S b , the result holds trivially, so that we may assume that A ∈ S b,l n \S b , so that A \ A b is not empty. We distinguish the case where A is a bounded rectangle, and the cases where A isunbounded.(i) A is a bounded rectangle, i.e. of the form ( y , z ) × . . . × ( y d y , z d y ), with y , . . . , y d y , z , . . . , z d y real. Then, since d H ( A, A b ) ≤ ζ n , A b is also a bounded rectangle, and the A \ A b is the union of atleast one (since A and A b are distinct) and at most f ( d y ) (the number of faces of a rectangle in R d y )rectangles with at least one dimension bounded by ζ n .(ii) A is an unbounded rectangle, i.e. of the same form as above, except that some of the edges are + ∞ of −∞ . Then A b is also an unbounded rectangle, and A \ A b is also the union of a finite number ofrectangles with one dimension bounded by ζ n .In both cases ( i ), and ( ii ), A \ A b is the union of a finite number of rectangles with at least one dimensionbounded by ζ n . Hence if we control the supremum of the empirical process on one of these thin rectangles,when A ranges over S b,l n , we can control it on A \ A b .Hence, it suffices to prove that sup A ∈S b,ln | G n ( ϕ n ( A )) | = O a . s . (cid:16)(cid:112) ζ n ln ln n (cid:17) , where ϕ n is the homothety that carries A into one of the thin rectangles described above.As an homothety, ϕ n is invertible and bi-measurable, and since ϕ n ( A ) has at least one dimension boundedby ζ n , and P is absolutely continuous with respect to Lebesgue measure, P ( ϕ n ( A )) = O ( ζ n ) uniformelywhen A ranges over S b,l n . Now, for any A ∈ S b,l n , we have G n ( ϕ n ( A )) = √ n [ P n ( ϕ n ( A )) − P ( ϕ n ( A ))]= 1 √ n n (cid:88) i =1 (cid:0) { ϕ n ( A ) } ( Y i ) − E P (1 { ϕ n ( A ) } ( Y )) (cid:1) = 1 √ n n (cid:88) i =1 (cid:0) A ( ϕ − n ( Y i )) − E P (1 A ( ϕ − n ( Y ))) (cid:1) := (cid:112) ζ n L n (1 A , ϕ n ) , where L n (1 A , ϕ n ) is defined as 1 √ nζ n n (cid:88) i =1 (cid:0) A ( ϕ − n ( Y i )) − E P (1 A ( ϕ − n ( Y ))) (cid:1) o conform with the notation of Einmahl and Mason, 1997.Conditions A(i)-A(iv) of the latter hold for a n = b n = l n and a = 0 under (27), and conditions S(i)-S(iii)and F(ii) and F(iv)-F(viii) hold because F is here the class of indicator functions of S b,l n which, as asubclass of S , is a Vapnik- ˘Cervonenkis class of sets. Hence Theorem 1.2 of Einmahl and Mason, 1997holds, and sup A ∈S b,ln | L n (1 A , ϕ n ) | = O a . s . (cid:16) √ ln ln n (cid:17) so that the desired result holds. Proof of Lemma 3b:
Consider S = { ( y, z ) : ( y, z ) ∈ R d y } . It is a Vapnik- ˘Cervonenkis class. Indeed, if d y = 1, its Vapnik-˘Cervonenkis index is three, since S can pick out the two elements of a set of cardinality 2, but can neverpick out the subset { x, z } of a set of three elements { x, y, z } . More generally, it can be shown that theVapnik- ˘Cervonenkis index of S is 2 d y + 1 (see Example 2.6.1 page 135 of van der Vaart and Wellner, 1996).Hence the class S K is also Vapnik- ˘Cervonenkis. The latter follows from lemma 2.6.17(iii) page 147 of vander Vaart and Wellner, 1996 and the fact that it is contained in the K -iterated union S (cid:116) . . . (cid:116) S , wherethe “square union” of two classes of sets S and S is defined by S (cid:116) S = { A ∪ A : A ∈ S , A ∈ S } . Proof of Theorem 3c:
From Fact 2, we know that we can restrict attention to closed subsets of Y . Take F one such subset. Bythe outer regularity of Borel probability measures, for all n there is an open set O (cid:48) n such that F ⊆ O (cid:48) n and P ( O (cid:48) n ) ≤ P ( F ) + 1 /n . Since O (cid:48) n is open, for each y ∈ F , there exists r y > B ( y, r y ) centered at y with radius r y is included in O (cid:48) n , and by construction, the open set ˜ O (cid:48) n = (cid:83) y ∈ F B ( y, min( r y , /n )) covers F . As a closed subset of a compact set, F is compact. Hence we cancall O n the finite sub-covering of F extracted from ˜ O (cid:48) n . O n is therefore a finite union of open balls withpositive radii, i.e. it belongs to ˜ S SW . By construction of O n , we have d H ( O n , F ) ≤ /n , and we knowthat Γ( F ) ⊆ Γ( O n ), and we shall now show that ν (Γ( O n )) converges to ν (Γ( F )) to yield the result that˜ S SW is core determining.Consider the following partition Y = Y I ∪ Y − n ∪ Y + n with: Y I = { y ∈ Y : ν (Γ( y )) = 0 } , Y − n = { y ∈ Y : 0 < ν (Γ( y )) < /n } , Y + n = { y ∈ Y : ν (Γ( y )) ≥ /n } . efine F I = F ∩ Y I , F − n = F ∩ Y − n and F + n = F ∩ Y + n , and similarly for O n , with O In denoting O n ∩ Y I .Consider first O In \ F I . Assumption (CD3) yields immediately that ν (Γ( O In \ F I )) ↓ O − n \ F − n . Under assumption (CD6), ν (Γ( Y − n )) ↓
0, hence ν (Γ( O − n \ F − n )) ↓ O + n \ F + n . Consider the disjoint connected components of Γ( O + n ). Their ν measure is at least1 /n by construction, hence by the compactness of U , the number J n of disjoint connected components ofΓ( O + n ) is no greater than n . We have shown above that d H ( O n , F ) < /n , hence we have d H ( O + n , F + n ) < /n . By assumption (CD5), this implies that d H (Γ( O + n ) , Γ( F + n )) = O (1 /n ). Hence for n sufficientlylarge, all the disjoint connected components of Γ( O + n ) intersect Γ( F + n ). Call ( C j ) J n j =1 the disjoint connectedcomponents of Γ( O + n ). We have ν (Γ( O + n )) = J n (cid:88) j =1 ν (Γ( C j )) = J n (cid:88) j =1 (cid:0) ν (Γ( C j )) + O (1 /n ) = ν (Γ( F + n )) + O (1 /n ) (cid:1) , where the second equality holds under assumption (CD2). Since F + n ⊆ O + n , we therefore have the desiredresult ν (Γ( O + n \ F + n )) ↓
0, which completes the proof.
Proof of Theorem 3d:
From fact 2, we can restrict attention to closed subsets of Y = R . Call Y I the subset of Y definedby u ( y ) = l ( y ) P -almost surely (and therefore everywhere since u and l are increasing). Note that therestriction of ν Γ to Y I is a probability measure. Consider a closed subset F of Y . Call F I = F ∩ Y I (resp. F U = F \ F I ) the intersection of F with Y I (resp. its complementary). Because of the monotonicity of theenvelopes, ν (Γ( F )) = ν (Γ( F I )) + ν (Γ( F U )), hence we only need to prove the result for closed subsets of Y I and for closed subsets of Y\Y I .Take F a subset of Y I . The restriction ν Γ |Y I of ν Γ to Y I is a probability measure, and the class of sets C I defined by C I = { A ∈ Y : A = ˜ A ∩ Y I , ˜ A ∈ C} is value determining for ν Γ |Y I . By the monotonicity ofthe envelopes, we have ν (Γ( ˜ A )) = ν (Γ( A )) + ν (Γ( ˜ A \ A )) (with the notation of the definition of C I above).Hence, if ν (Γ( A )) ≥ P ( A ) for all A ∈ C , then ν (Γ( A )) ≥ P ( A ) for all A ⊆ Y I .We can now restrict attention to the case where the upper and lower envelopes are distinct, in which case,for a closed set F , Γ( F ) has at most a countable number of connected parts, which we denote C n , n ∈ Z ,ordered in the sense that inf C n > sup C n − . By construction, each C n is the image by Γ of a subset F n of F . Γ being convex-valued, the monotonicity of the envelopes u and l implies upper-semicontinuityof l and lower-semicontinuity of u . Therefore, C n = Γ( F n ) = Γ([inf F n , sup F n ]), and we deduce that ν Γ( F ) = ν Γ( (cid:83) n I n ) where ( I n ) n ∈ Z is a countable collection of disjoint closed intervals in R . Hence if weshow that ν Γ( I ) ≥ P ( I ) for any interval I , then we have ν Γ( F ) = (cid:80) n ν Γ( I n ) ≥ (cid:80) n P ( I n ) ≥ P ( F ), and he inequality holds for F .Now, for any y < y ∈ R we have P ( y , y ] = P ( y , + ∞ )+ P ( −∞ , y ] − ≤ ν Γ( y , + ∞ )+ ν Γ( −∞ , y ] − ν ( u ( y ) − l ( y )) = ν Γ( y , y ] where u (resp. l ) is the upper (resp. lower) envelope, and the result follows. eferences Andrews, D., Berry, S., & Jia, P. (2004).
Confidence regions for parameters in discrete gameswith multiple equilibria, with an application to discount chain store location [unpublishedmanuscript].Beresteanu, A., & Molinari, F. (2006).
Asymptotic properties for a class of partiallly identifiedmodels [unpublished manuscript].Castaldo, A., Maccheroni, F., & Marinacci, M. (2004). Random sets and their distributions.
Sankhya (Series A) , , 409–427.Chernozhukov, V., Hong, H., & Tamer, E. (2002). Inference on parameter sets in econometricmodels [unpublished manuscript].Choquet, G. (1953). Th´eorie des capacit´es.
Annales de l’Institut Fourier , , 131–295.Dempster, A. P. (1967). Upper and lower probabilities induced by a multi-valued mapping. Annalsof Mathematical Statistics , , 325–339.Dudley, R. (2003). Real analysis and probability . Cambridge University Press.Einmahl, U., & Mason, D. (1997). Gaussian approximation of local empirical processes indexedby functions.
Probability Theory and Related Fields , , 283–311.Galichon, A., & Henry, M. (2006). A duality approach to inference in models defined by momentinequalities [unpublished manuscript].Heckman, J., & Vytlacil, E. (2001). Instrumental variables, selection models and tight bounds onthe average treatment effect.
Econometric Evaluations of Labour Market Policies, Lechner,M., and F. Pfeiffer, eds. , 1–16.Imbens, G., & Manski, C. (2004). Confidence intervals for partially identified parameters.
Econo-metrica , , 1845–1859.Jovanovic, B. (1989). Observable implications of models with multiple equilibria. Econometrica , , 1431–1437.Kellerer, H. (1984). Duality theorems for marginal problems. Zeitschrift f¨ur Wahrscheinlichkeit-stheorie und Verwandte Gebiete , , 399–432.Manski, C. (2005). Partial identification in econometrics [ New Palgrave Dictionary of Economics,2nd Edition. ].Matheron, G. (1975).
Random sets and integral geometry . New York: Wiley.Pakes, A., Porter, J., Ho, K., & Ishii, J. (2004).
Moment inequalities and their application [un-published manuscript]. 48alinetti, G., & Wets, R. (1986). On the convergence in distribution of measurable multifunctions(random sets), normal integrands, stochastic processes and stochastic infima.
Mathematicsof Operations Research , , 385–422.Shaikh, A. (2005). Inference for a class of partially identified econometric models [unpublishedmanuscript].Shaikh, A., & Vytlacil, E. (2005).
Threshhold crossing models and bounds on treatment effects: Anonparametric analysis [NBER Technical Working Paper 0307].Strassen, V. (1965). The existence of probability measures with given marginals.
Journal of Math-ematical Statistics , , 423–439.van der Vaart, A. (1998). Asymptotic statistics . Cambridge University Press.van der Vaart, A., & Wellner, J. (1996).
Weak convergence and empirical processes . New York:Springer.Wasserman, L. (1990). Prior envelopes based on belief functions.
Annals of Statistics ,18