PPolicy Transforms and Learning Optimal Policies
Thomas M. Russell ∗ Carleton University
December 22, 2020
Abstract
We study the problem of choosing optimal policy rules in uncertain environments using models thatmay be incomplete and/or partially identified. We consider a policymaker who wishes to choose a policyto maximize a particular counterfactual quantity called a policy transform . We characterize learnability of a set of policy options by the existence of a decision rule that closely approximates the maximinoptimal value of the policy transform with high probability. Sufficient conditions are provided for theexistence of such a rule. However, learnability of an optimal policy is an ex-ante notion (i.e. beforeobserving a sample), and so ex-post (i.e. after observing a sample) theoretical guarantees for certainpolicy rules are also provided. Our entire approach is applicable when the distribution of unobservablesis not parametrically specified, although we discuss how semiparametric restrictions can be used. Finally,we show possible applications of the procedure to a simultaneous discrete choice example and a programevaluation example.
Keywords : Partial Identification, Decision Theory, Statistical Learning Theory
I thank Jiaying Gu, Ismael Mourifie, Eduardo Souza-Rodrigues, Adam Rosen, Stanislav Volgushev and Yuanyuan Wan fortheir feedback and encouragement, and I am especially grateful to JoonHwan Cho for many hours of discussion that helped toimprove this paper. A previous version of this paper appeared in my doctoral thesis at the University of Toronto. This researchwas supported by the Social Sciences and Humanities Research Council of Canada. All errors are my own. ∗ Thomas M. Russell, Assistant Professor, Department of Economics, Carleton University, 1125 Colonel By Drive, Ottawa,Ontario, K1S5B6, Canada. Email: [email protected]. a r X i v : . [ ec on . E M ] D ec Introduction
One of the fundamental goals of econometrics is to credibly translate knowledge of underlying economicmechanisms into models that, when combined with sample data, can be used to understand the effects ofcounterfactual policy experiments and can help guide policy decisions. In this paper we consider the problemof making policy decisions in settings where the econometric model is partially identified and/or incomplete.The paper is motivated by the fact that credible models are needed to honestly inform policy makers on theimpacts of counterfactual policies, even if credible models provide an incomplete description of the true datagenerating process.Our framework is general enough to accommodate many existing structural econometric models. Ourdescription of the environment is similar to descriptions found in Jovanovic (1989) and Chesher and Rosen(2017a), which in turn are extensions of the classical foundations for econometric modelling set forth in Koop-mans et al. (1950) and Hurwicz (1950), among others. We assume the economic system under considerationmanifests as a collection of random variables which can be partitioned into those that are observable—including a vector of observed endogenous variables Y and a vector of exogenous variables Z —and thosethat are latent or unobservable—denoted by the vector U . We refer colloquially to the variables containedin Y and Z as the “observables,” and refer to the variables contained in U as the “unobservables.” Unlikemost of the existing literature, we do not take the distribution of U as a model primitive. This is in accor-dance with the perspective that the latent variable U represents the gap between what can be explained bya theoretical model, and what must remain unexplained; that is, “errors in equations” rather than “errorsin variables.” As we will demonstrate, such a distinction becomes especially important when performingcounterfactual analyses.The policymaker is assumed to have access to data on the observables, as well as an econometric modelthat describes how the observables are related to the unobservables. The model may depend on a vectorof parameters θ ∈ Θ; here Θ is required only to be a complete and separable metric space, which permitsmany function spaces used in nonparametric analyses. We then let Γ represent an abstraction of the setof all possible policies under consideration by the policymaker, where γ ∈ Γ denotes one such policy. Eachhypothetical policy γ ∈ Γ represents an intervention on the underlying existing economic system, whichoperates to generate the endogenous variables from the exogenous and unobserved variables. After theeconomic system is modified, the resulting system may now generate a new, or counterfactual distribution ofthe endogenous variables. Thus, by altering the underlying economic system, a policy intervention inducesa change between the factual (or observed) and counterfactual (hypothetical and unobserved) distributionsof the endogenous outcome variables. Latent variables are not affected by the policy, and instead serve These two explanations of the error term are documented by Morgan (1990) Chapter 6. We recommend Qin and Gilbert(2001) for a review of how attitudes towards the latent variables have evolved over time.
2s important links between the factual and counterfactual domains. A policymaker’s problem is thenformulated as the problem of choosing a policy intervention that induces a counterfactual distribution ofendogenous outcome variables that is favourable according to some criterion.We denote the counterfactual endogenous outcome variables as Y (cid:63)γ , where the γ index is to emphasizethe fact that its distribution will depend on the counterfactual policy experiment γ ∈ Γ under consideration.Under this setup, this paper focuses on a particular class of counterfactual quantities that can be written inthe following form: I [ ϕ ]( γ ) := (cid:90) ϕ ( v ) dP V γ . (1.1)Here ϕ is some function, V γ := ( Y (cid:63)γ , Y, Z, U ) is a vector of all the random variables that describe the factualand counterfactual domains, P V γ denotes the distribution of V γ , and v denotes a realization of V γ . Inparticular, the operator I [ · ]( γ ) takes a function ϕ of the vector v of endogenous, exogenous, unobserved andcounterfactual variables, and maps it to a function I [ ϕ ]( γ ) of the policy parameter γ . For this reason, werefer to I [ · ]( γ ) as a policy transform . As we will show in our examples on simultaneous discrete choice andprogram evaluation, counterfactual objects that can be written as policy transforms include counterfactualchoice probabilities, and counterfactual average effects. If a policymaker’s counterfactual object of interestcan be written as the policy transform of some function ϕ , then the resulting policy transform gives all theinformation the policymaker needs to compare various policies and make a policy choice.Throughout the paper we consider a policymaker who wishes to maximize the value of the policy trans-form, although our analysis is equally applicable to the case when the policymaker wishes to minimize thevalue of the policy transform. With perfect knowledge of the distribution of the vector V γ , the policymakerfaces a trivial decision problem and can simply choose the policy γ that obtains the maximum of the policytransform I [ ϕ ]( γ ). However, this idealized decision problem is rarely encountered in practice, and insteadthe policymaker may only have access to a finite sample of the observed random variables. Furthermore,even with an infinite sample the policy transform may not be identified under any credible assumptions.This will be especially true throughout our discussion, since we will not require that the distribution of theunobservables U be parametrically specified.To make progress, we model the policy decision problem as a decision under ambiguity, where we assumethat the “true state of the world” belongs to a state space S × P
Y,Z . Here P Y,Z is the set of all Borelprobability measures on the observable space
Y × Z . Furthermore, each s ∈ S is associated with a pair ofconditional distributions ( P U | Y,Z , P Y (cid:63)γ | Y,Z,U ). Taking a pair ( s, P
Y,Z ) ∈ S × P Y,Z to be the true state, thepolicymaker can evaluate the policy transform in (1.1) corresponding to that state. Keeping the dependenceon P Y,Z implicit, we denote the policy transform in state ( s, P
Y,Z ) as I [ ϕ ]( γ, s ), and refer to it as the state-dependent policy transform . We then consider the policymaker’s decision problem when she has access to a From Pearl (2009) p. 211: “The background variables are the main carriers of information from the actual world to thehypothetical world; they serve as the “guardians of invariance” (or persistence) in the dynamic process that transforms theformer into the latter.” n denote the space of all possible n − samples { ( y i , z i ) } ni =1 , andlet d : Ψ n → Γ denote a (measurable) decision rule that maps from sample realizations to policies. Before asample ψ ∈ Ψ n is observed d ( ψ ) will be a random variable, and the policymaker’s problem is then translatedinto the problem of selecting a decision rule according to some reasonable criteria.However, without knowledge of the true state, it is unclear how the policymaker should (in a prescriptivesense) choose among, or rank, various decision rules. One nearly self-evident requirement on any method ofranking decision rules is that the ranking should respect weak dominance; that is, if for every P Y,Z ∈ P
Y,Z we have I [ ϕ ]( d (cid:48) ( ψ ) , s ) ≤ I [ ϕ ]( d ( ψ ) , s ) a.s. for every s ∈ S , then d should be preferred to d (cid:48) . However, it isclear that many decisions rules will not be comparable according to this partial ordering.To progress further, we introduce a preference relation over the space of all decision rules that is motivatedfrom computational learning theory. In particular, fix any κ ∈ (0 ,
1) and let c n ( d, κ ) be the smallest valuesatisfying: inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) inf s ∈S I [ ϕ ]( d ( ψ ) , s ) + c n ( d, κ ) ≥ sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) (cid:19) ≥ κ. (1.2)Then under our framework, a decision rule d : Ψ n → Γ is weakly preferred to decision rule d (cid:48) : Ψ n → Γ atlevel κ and sample size n if c n ( d, κ ) ≤ c n ( d (cid:48) , κ ). This preference relation appears to be new, and diverges (tosome extent) from the existing literature on frequentist decision theory. However, its close connection to theprobably approximately correct (PAC) learning framework from computational learning theory allows us touse a rich set of results from statistical learning theory and empirical process theory to study its theoreticalproperties. In addition, this preference relation induces a total ordering, and our first result in Section 2demonstrates that, at a minimum, this preference relation respects weak dominance.Given this preference relation, throughout the paper we will use the value c n ( d, κ ) to measure the “per-formance” or “quality” of a decision rule d for a given sample size n and confidence level κ . We then providetwo sets of theoretical results for the policymaker’s decision problem.In the first set of results, we provide conditions on the decision problem that guarantees the existence ofa decision rule d such that c n ( d, κ ) tends to zero as the sample size n becomes large. The existence of such adecision rule characterizes the notion of policy space learnability. The definition of policy space learnabilityappears to be new in economics, although it is adapted from the widely popular PAC learning frameworkfrom computer science proposed by Valiant (1984). Our particular analysis deals mostly with the decisiontheoretic generalization of the PAC learning model proposed by Haussler (1992), which is referred to as the agnostic PAC learning model.We show that even in simple environments the policy space may not be learnable. In this case thepolicymaker’s decision problem is still well-defined, but there will be theoretical limitations on how well anygiven policy can perform, even in large samples. We then provide sufficient conditions for learnability whichare related to certain complexity measures of the class of functions in our problem; in particular, to the See Definition 2.3. ex-ante notion (i.e. before observing the sample), verifying learnabilitycan be uninformative about the ex-post performance (i.e. after observing the sample) of a given policy rule.Thus, our second set of results provides a means for the policymaker to perform an ex-post analysis of herselected policy rule. First we study the finite sample properties of a particular decision rule, called the ε − maximin empirical (eME) rule, which selects a ε − maximizer of the worst case (over s ∈ S ) empiricalversion of I [ ϕ ]( γ, s ). Using concentration inequalities, we provide an upper bound on the quantity c n ( d, κ )when d is the eME rule, and we demonstrate how the upper bound is affected by various features of thedecision problem.However, the eME rule is only one particular rule, and for many reasons it may not be the policy ruleselected by the policymaker. We thus turn to the problem of approximating the set of all policies γ ∈ Γsatisfying: γ (cid:55)→ sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( γ, s ) ≤ δ, (1.3)with probability at least κ ; note that any decision rule that selects a policy in this set will thus have c n ( d, κ ) ≤ δ . We call this set of policies the “ δ -level set,” and we show how a procedure from the literatureon excess risk bounds in statistical learning theory can be adapted to our environment to approximatethe δ − level set. Finally, we show that the eME decision rule selects a policy in the δ − level with highprobability for δ sufficiently large, providing further justification for its use. Unlike the first ex-ante analysisof learnability, all of the results comprising the ex-post analysis do not require the entropy growth condition—or any other sufficient condition for learnability—to be satisfied. Thus, they are applicable whether or notthe policy space Γ is learnable, although they are silent about rates of convergence. Taken altogether, webelieve our two sets of theoretical results provide a comprehensive means of making and evaluating policydecisions.This paper also makes a contribution from an identification perspective. Perhaps unsurprisingly, animportant theoretical object in our study of policy decisions are the following policy transform envelopefunctions: I (cid:96)b [ ϕ ]( γ ) := inf s ∈S I [ ϕ ]( γ, s ) , I ub [ ϕ ]( γ ) := sup s ∈S I [ ϕ ]( γ, s ) . Regardless of the true (sub-)state s ∈ S , at the true distribution P Y,Z the policy transform in (1.1) can be“sandwiched” between these upper and lower envelope functions. This idea is illustrated in Figure 1. Ourability to provide a tractable characterization of these envelope functions thus turns out to be critical to our5 igure 1:
This figure illustrates the policy transform of some function ϕ , as well as the upper and lower envelopefunctions I ub [ ϕ ]( γ ) and I (cid:96)b [ ϕ ]( γ ) (resp.). The minimax (over (sub-)states s ∈ S ) policy is the policy that minimizesthe upper envelope, and the maximin (over (sub-)states s ∈ S ) policy choice is the policy that maximizes the lowerenvelope. ability to provide sufficient conditions for policy learnability, and for our ex-post analysis of the eME ruleand the δ − level set.The envelope functions may not be policy transforms themselves, but under some conditions they canbe interpreted as sharp bounds on the policy transform I [ ϕ ]( γ ), point-wise in the variable γ . It is herethat we make a contribution in the identification literature by showing that the envelope functions can beexpressed as the value functions of optimization problems parameterized by the policy variable γ ∈ Γ. Theresult is derived under assumptions found in the theory of error bounds and exact penalty functions fromthe literature on optimization, and the resulting optimization problems are closely related to mathematicalprograms with equilibrium constraints , or MPECs. A remarkable benefit of our optimization approach isthat we show the bounds on the policy transform can be constructed without the need to first estimate thefull identified set for θ , the vector of model parameters. This is in contrast to typical approaches to boundingcounterfactual quantities, which first estimate the identified set of structural parameters, and then perform acounterfactual for every possible value of the parameter vector in the identified set. A direct implication of ourresult is that, in either point- or partially-identified models, if the policymaker’s counterfactual quantity ofinterest is the policy transform of some function ϕ , then all structural parameters can be treated as nuisanceparameters when performing counterfactuals and making policy choices. These results on identification maybe of substantial separate interest.Finally, throughout the text we discuss a simultaneous discrete choice and a program evaluation examplein order to illustrate possible applications of the procedure. The simultaneous discrete choice example See Dolgopolik (2016) for a survey of exact penalty functions and their connection to error bounds, and see Luo et al.(1996) for a textbook treatment MPECs.
This paper builds on results from a variety of different literatures, including recent work on counterfactualsin structural models, partial identification and random set theory, decision theory and optimal policy choice,and computational and statistical learning theory.Our approach to modelling and counterfactuals in partially identified models extends the literature us-ing random set theory in econometrics, including Beresteanu et al. (2011), Galichon and Henry (2011),Beresteanu et al. (2012) and Chesher and Rosen (2017a). As mentioned in the introduction, our generalenvironment is similar to descriptions found in Jovanovic (1989) and more recently in Chesher and Rosen(2017a), which in turn are extensions of the classical foundations for econometric modelling set forth inKoopmans et al. (1950) and Hurwicz (1950), among others. The use of random set theory is convenientin order to permit application of the method to a wider class of models. In particular, our framework isapplicable to models that may (or may not) be incomplete, which are an important class of models in theliterature on partial identification. Incomplete models are now legion, and include entry games with multipleequilibria (Bresnahan and Reiss (1990), Bresnahan and Reiss (1991), Tamer (2003), Jia (2008), Cilibertoet al. (2018)); english auctions (Haile and Tamer (2003), Chesher and Rosen (2017b)); discrete choice modelswith endogenous regressors or social interactions (Chesher and Rosen (2012), Chesher et al. (2013), Chesherand Rosen (2014)); matching models (Uetake and Watanabe (2019)); friendship networks (Miyauchi (2016));and selection and treatment effect models (Mourifie et al. (2018), Russell (2019)).From the perspective of policy choice, our general approach to the problem of policy decisions is new.However, there is now a large and growing literature on statistical treatment rules in econometrics, includingpapers by Manski (2004), Hirano and Porter (2009), Stoye (2009), Stoye (2012), Chamberlain (2011), Tetenov(2012), Kasy (2016), Kitagawa and Tetenov (2018) and Mbakop and Tabord-Meehan (2019). In general thesepapers can be divided according to (i) whether they are frequentist/bayesian, (ii) whether they take a finite-sample or asymptotic approach, and (iii) whether they consider decision problems under uncertainty orambiguity (or “Knightian uncertainty”). In the current paper we take a frequentist, finite-sample approachto decision problems under ambiguity. However, unlike previous papers that belong to the same class, ourmethod of evaluating statistical decision rules differs from the procedure proposed by Wald (1950). In theabsence of ambiguity arising from the unknown sub-state s ∈ S , our procedure is very similar to the PACframework for inductive inference that has become enormously popular in the computer science literature.7his model of learning was initially proposed in a seminal paper by Valiant (1984), for which he won theprestigious Turing Award. The name “probably approximately correct” seems to have been first used byAngluin and Laird (1988), who extended the model to the case of noisy data. The PAC model and itsextensions have now become the dominant model of learning in the theoretical foundations of machinelearning; influential textbook treatments that make this connection explicit include Kearns et al. (1994),Vapnik (1995), Vapnik (1998), Vidyasagar (2002), Shalev-Shwartz and Ben-David (2014) and Mohri et al.(2018). Our particular analysis is most closely related to the decision theoretic generalization of the PAClearning model proposed by Haussler (1992), as well as the general learning setting considered in Vapnik(1995). Other important papers studying necessary and sufficient conditions for learnability in variousmachine learning settings include Blumer et al. (1989), Kearns and Schapire (1994), Bartlett et al. (1996),Alon et al. (1997), and Shalev-Shwartz et al. (2010), among others. Our work here on providing sufficientconditions for learnability borrows heavily from this literature. However, the additional ambiguity thatarises in relation to possible partial identification of the policy transform differentiates our setting from thestatistical learning literature, and our incorporation of this notion of ambiguity into the PAC frameworkappears to be new. Many of our results are applicable to problems involving risk minimization subject to(stochastic) constraints, and thus may be of separate interest to researchers in machine learning.Surprisingly, we are unaware of any attempts to formally connect the literature on statistical decisiontheory with the literature on statistical learning theory. On the one hand, the properties of a Wald-styleanalysis are (at this point) better understood; see, for example, Stoye (2011) for an axiomatization of Wald’sfrequentist maximin procedure. On the other hand, we find the PAC style criterion to be much moreamenable to informative ex-post analyses of particular decision rules, mostly due to its connection to theconcentration of measure phenomenon, and thus its amenability to analysis using concentration inequalities.The connections to the statistical learning literature permeate our theoretical results. There are con-nections of our work to the study of ratio-type empirical processes (e.g. Gin´e et al. (2003), Gin´e et al.(2006)), and to the study of fixed-point equations and rates of convergence in risk minimization problems(e.g. Massart (2000), Koltchinskii and Panchenko (2000), Bousquet et al. (2002), Bartlett et al. (2005), andKoltchinskii (2006)). Overall our work is most closely related to the work of Koltchinskii (2006), and thesubsequent textbook treatment Koltchinskii (2011). As we will see in the section on the ex-post analysis ofcertain decision rules, a key component of our approach is the use of Rademacher processes to construct data-dependent bounds on certain important empirical processes. This has the benefit of allowing the policymakerto avoid relying on any specific properties of the underlying function class, which are typically difficult toverify, and thus are applicable whether or not the associated policy space is learnable. Furthermore, the useof data-dependent complexity measures like the empirical Rademacher complexity ensures our finite sample Kitagawa and Tetenov (2018) and Mbakop and Tabord-Meehan (2019) make some connections with the statistical learningliterature. However, their method of evaluating statistical treatment rules is different from that considered by the PAC model.Some discussion on the links with decision theory can be found in an influential paper by Haussler (1992), although the discussionis very limited and no connection is made with Wald-style frequentist decision theory. As far as we are aware, this remains anopen question. Schennach (2014) provides a general framework for models with moment conditionsthat depend on latent variables, and shows that the latent variables can be integrated out of the momentconditions without loss of information using a least-favourable entropy maximizing distribution. Torgovitsky(2019) shows that when restrictions on the distribution of the latent variables have a certain structure, sharpidentified sets for functionals of partially-identified parameters can be characterized in terms of optimizationproblems. Finally, Li (2019) shows that sharp identified sets for structural and counterfactual parameterscan be constructed using a method that essentially profiles the latent variables out of the moment conditions.In the current paper, we use an idea related to Li (2019) to eliminate unobservables from the counterfactualbounding problem. However, in contrast to Li (2019) our focus on policy transforms means our formulationdoes not require replacing a finite number of moment conditions with a continuum of moment conditions.Furthermore, our approach does not require the policymaker to compute the full identified set of structuralparameters. Our specific characterization of the bounds on the policy transform in terms of two parametricoptimization problems was designed to be amenable to the theoretical analysis of policy space learnability,and the analysis of the eME rule and the δ − level sets. Thus, our particular bounding approach is new.Finally, and perhaps most importantly, our focus is primarily on using the bounds to study the problem ofpolicy choice, which is not considered in any of Ekeland et al. (2010), Schennach (2014), Torgovitsky (2019)or Li (2019).The idea that at least some structural parameters may be seen as nuisance parameters in the policy deci-sion problem goes back at least as far as Marshak (1953). Heckman (2010) refers to this idea as “Marshak’sMaxim.” At a high level, the identification component of this paper is reminiscent of Ichimura and Taber(2000), who discuss a method for performing ex-ante policy experiments in the treatment effect literaturewithout estimating the structural parameters, and without specifying the error distribution. More recentexamples of counterfactual analysis without first estimating the (identified set for the) structural parameterscan be found in Syrgkanis et al. (2018), Tebaldi et al. (2019) and Kalouptsidi et al. (2019).The remainder of the paper will proceed as follows. Section 2 introduces the notation and main definitions The paper of Ekeland et al. (2010) is related to a string of other papers by the same authors, namely Galichon and Henry(2006), Galichon and Henry (2009) and Galichon and Henry (2011).
Notation:
Given a subset X of a Polish space (a complete and separable metric space), we use B ( X ) todenote the Borel σ − algebra on X (note the topology on X is the topology induced by the metric). We willoften either leave the metric implicit, or will denote a generic metric by the function d : X × X → R . For twomeasurable spaces ( X , B ( X )) and ( X (cid:48) , B ( X (cid:48) )), the product σ − algebra on X ×X (cid:48) is denoted by B ( X ) ⊗ B ( X (cid:48) ).If X : (Ω , A ) → ( X , B ( X )) is a random variable defined on the probability space (Ω , A , P ), then we use P X todenote the probability measure induced on X by X ; that is, for any A ∈ B ( X ), P X ( A ) := P ( X − ( A )). We let σ ( X ) ⊆ A denote the smallest sub σ − algebra making X a measurable function. Furthermore, we interpret P X | X (cid:48) ( X ∈ A | X (cid:48) = x ) as a regular conditional probability measure. In many cases we do not explicitlydifferentiate between the true distribution of the random variable X , say P X , or some other distribution ofthe random variable X , say P (cid:48) X , and instead leave the distinction to be resolved by context. To keep thenotation clean, we will omit the transpose when combining column vectors; that is, if v and v are twocolumn vectors, rather than write v = ( v (cid:62) , v (cid:62) ) (cid:62) we instead write v = ( v , v ), where it is understood that v is a column vector unless otherwise specified. Importantly, throughout the paper we use the convention thatsup ∅ = −∞ and inf ∅ = + ∞ . Finally, we will largely ignore measurability issues in the main text, but wenote that such issues are non-trivial in our framework, and are discussed and addressed in Appendix B.2.1. As mentioned in the introduction, the description of the environment follows closely that of Jovanovic (1989)and Chesher and Rosen (2017a), which in turn are extensions of the classical foundations for econometricmodelling set forth in Koopmans et al. (1950) and Hurwicz (1950), among others. However, there are somedifferences that will be pointed out as they occur. We will also make heavy use of random set theory in thispaper. Random set theory has played a major role in the development of methods for partially identifiedmodels, for example in the contributions of Beresteanu et al. (2011), Galichon and Henry (2011), Beresteanuet al. (2012) and Chesher and Rosen (2017a), among others. We will also use random set theory in this paper,10s it naturally generalizes many features of complete econometric models to incomplete models (see Chesherand Rosen (2017a)). Since complete models can be seen as special cases of incomplete models, focusingon incomplete models will allow us to construct a method that applies to a broader class of econometricmodels. Some important definitions from random set theory—including the notion of Effros-measurability,the definition of a random set, the distribution of a random set, and the notion of a selection from a randomset—have been moved to Appendix A for brevity. The current section will presume some working knowledgeof these concepts.We begin by specifying the restrictions on the factual and counterfactual domains. First we will fix theprobability space and define the unobserved random variables and parameters that are common to bothdomains.
Assumption 2.1.
There exists a fixed probability space (Ω , A , P ) , and a random element U : (Ω , A ) → ( U , B ( U )) where U is a compact second-countable Hausdorff space. In addition, the parameter space Θ is aPolish space equipped with the σ − algebra B (Θ) . Fixing the probability space throughout represents a departure from some of the existing literature onpartial identification and random set theory in econometrics (e.g. Galichon and Henry (2011), Chesher andRosen (2017a)). Our reason for doing so is mostly conceptual. This paper is concerned with counterfactuals,and counterfactuals naturally involve some comparison of units between factual and counterfactual states. Inany probabilistic framework, the underlying probability space naturally specifies the basic unit of observation(e.g. individuals, firms, types, etc.), so that it is necessary for the units of observation to be the same inboth the factual and counterfactual states when performing a counterfactual analysis. The point may seemesoteric, but it will have a major impact on the statement and proofs of most of our results while alsoresolving some interpretative difficulties.The restriction that U is a compact space in Assumption 2.1 may seem overly restrictive; for example,the euclidean space R d ( d < ∞ ) with the usual topology is not a compact space. We might considerrelaxing Assumption 2.1 by allowing U to be a locally compact second-countable Hausdorff space, of which R d (with the usual topology) is an example. However, any locally compact Hausdorff space has a one-pointcompactification; that is, assuming U is locally compact and Hausdorff, there exists a compact space (cid:101) U with U ⊂ (cid:101) U such that (cid:101) U \ U consists of a single point. Furthermore, (cid:101) U is unique up to a homeomorphism. Arelated argument has been presented in Schennach (2014). From this perspective, it is difficult to imaginean environment where a policymaker should have strong a priori reasons to model the unobservables usinga locally compact Hausdorff space U versus its one-point compactification (cid:101) U , despite the fact that this isoften done in practice. On the other hand, the theoretical benefits of taking U to be compact (or to be theone-point compactification of some locally compact Hausdorff space) are numerous. We will highlight thesebenefits as they arise. See Munkres (2014) Theorem 29.1. Recall a homeomorphism is a continuous invertible function with a continuous inverse. U belong to a parametric class. This is in keepingwith our desire to avoid treating the distribution of U as a model primitive. This perspective is consis-tent with the idea that the latent variables represent components of the underlying economic system thatremain unmodelled, due primarily to the policymaker’s ignorance of the process determining U , and thusher inability to construct a complete mathematical description of the economic system under investigation.This interpretation becomes especially meaningful given the role the latent variables play in determiningcounterfactual outcomes. Instead, as we will see, the distribution of U can be implicitly constrained by theremaining primitives of the model.Finally we note that equipping the parameter space with the Borel σ − algebra B (Θ) may seem odd.However, to make policy decisions in our framework will require measurability of certain functions to beintroduced later on. Primitive conditions for the required measurability will make use of the measure space(Θ , B (Θ)). We return to similar points throughout the paper, and refer readers to Appendix B.2.1 for ourresults on measurability.We will now summarize the restrictions on the factual and counterfactual domains, beginning with thefactual domain. Assumption 2.2 (Factual Domain) . The factual domain is represented by random vectors Y : (Ω , A ) → ( Y , B ( Y )) and Z : (Ω , A ) → ( Z , B ( Z )) , where Y and Z are Polish spaces. There exists a (possibly multi-valued) map G − : Y × Z × Θ → U which is closed and Effros-measurable, and satisfies: P (cid:0) U ∈ G − ( Y, Z, θ ) | Y = y, Z = z (cid:1) = 1 , (2.1)( y, z ) − a.s. for some θ ∈ Θ . Furthermore, E P U | Y,Z × P Y,Z [ m j ( Y, Z, U, θ )] ≤ , j = 1 , . . . , J, (2.2) for some measurable functions m j : Y × Z × U × Θ → R , for j = 1 , . . . , J , bounded in absolute value for each θ ∈ Θ . The first part of the assumption states that the unobserved random vector is a selection from the randomset G − ( Y, Z, θ ) (see Appendix A for the definition of a selection). Note the assumption requires onlythat G − ( · , θ ) admits a selection when θ = θ . The first part of the assumption can thus be interpretedas a support restriction for the vector of unobservables conditional on the observed data. These supportrestrictions are derived from the policymaker’s econometric model, as we will see in the examples ahead.We also note that the random set G − contains the U − level sets presented in Chesher and Rosen (2017a)as a special case, and thus our framework will be applicable to the generalized instrumental variable (GIV)models considered in their work. A similar argument to the one presented in Appendix B of Chesher and Rosen (2015) can be used to show that thischaracterization of selectionability conditional on ( y, z ) a.s. is equivalent to using an analogous selectionability criteria for thejoint distributions of (
Y, Z, U ). A similar point will apply later on when we introduce Assumption 2.3.
12n the second part of the assumption we suppose that the factual domain satisfies the moment inequalitiesin (2.2), which are allowed to depend on the unobserved random variable U . This differs from momentconditions in the generalized method of moments (GMM), as well as typical definition of moment inequalities(c.f. Chernozhukov et al. (2007)). This places our paper in the narrow literature in partial identificationthat allows for moments to depend on unobserved random variables with a possibly unknown distribution(c.f. Ekeland et al. (2010), Schennach (2014), Torgovitsky (2019) and Li (2019)). The assumption ofboundedness of the moment functions may appear to be restrictive. This assumption might be replacedby the weaker assumption that the moment functions are uniformly integrable with respect to the set ofprobability measures P U | Y,Z × P Y,Z satisfying the other components of Assumption 2.2. However, regardlessof how it is weakened, we contend that boundedness of the moment functions remains the most primitiveassumption for our purposes. Finally, the fact that there are only a finite number of moment functionsmay also be restrictive; for example, this prohibits the use of conditional moment inequalities when theconditioning variable is continuous. Our identification result in Section 3 can be extended—under a suitablemodification of our assumptions—to handle the case of an infinite number of moment inequalities. However,the same statement is not true of the results in Sections 4 and 5 on policy decisions, which rely more cruciallyon the fact that the number of moment conditions is finite. We also note that both the Effros measurabilityof G − and Borel measurability of each moment function m j with respect to B ( Y ) ⊗ B ( Z ) ⊗ B (Θ) (ratherthan only with respect to B ( Y ) ⊗ B ( Z )) will be required later on to ensure measurability of certain keyclasses of functions.Similar to the factual domain, we must specify restrictions on the counterfactual domain, and whenspecifying the counterfactual domain we must specify which counterfactuals are under consideration bythe policymaker. We index various counterfactuals by an abstract parameter γ , where a fixed value of γ represents a single counterfactual, and different values of γ correspond to different counterfactuals. Theinterpretation of the parameter γ that will be used throughout is that it is an abstraction of a policy toolunder the control of the policymaker. The parameter γ will play an important role in our policy decisionprocedure presented later in the paper. Assumption 2.3 (Γ-Counterfactual Domains) . The Γ -counterfactual domains are represented by a stochas-tic process { Y (cid:63) ( ω, γ ) : γ ∈ Γ } where (Γ , B (Γ)) is a measurable space with Γ a Polish space, and where Y (cid:63)γ := Y (cid:63) ( · , γ ) is such that Y (cid:63) : (Ω × Γ , A ⊗ B (Γ)) → ( Y (cid:63) , B ( Y (cid:63) ) is measurable, with Y (cid:63) a Polish space.Furthermore, there exists a (possibly multi-valued) map G (cid:63) : Y × Z × U × Θ × Γ → Y (cid:63) which is closed andEffros measurable, and satisfies: P (cid:0) Y (cid:63)γ ∈ G (cid:63) ( Y, Z, U, θ , γ ) | Y = y, Z = z, U = u (cid:1) = 1 , (2.3)( y, z, u ) − a.s. for the same θ ∈ Θ from Assumption 2.2, and for all γ ∈ Γ . Compared to the existing literature, Assumption 2.3 appears to be new. It restricts the set of counter- See for example alternative assumptions given in Ekeland et al. (2010) and Li (2019). igure 2: Displayed above is an illustration of the setup implied by Assumptions 2.1, 2.2 and 2.3. In particular, notethat all random variables are assumed to be defined on the same probability space. Furthermore, note the directionof the arrows from the factual domain
Y × Z to the latent U to the counterfactual domain Y (cid:63) , intended to illustratethe process by which information from the factual domain informs on the counterfactual domain. factuals considered in this paper to be those that can be written as modifications of support-like restrictionson the random variables in the model. We contend that this assumption is able to accommodate mostcounterfactuals of interest in economics, although it rules out, for example, consideration of counterfactualsthat modify the distributions of the latent variables. Under this assumption we have that Y (cid:63)γ := Y (cid:63) ( · , γ ) is a selection process from the set-valued process G (cid:63) ( Y, Z, U, θ , γ ), where G (cid:63) is required to be Effros-measurablewith respect to the product σ − algebra. Again, the measurability requirement with respect to both Θ and Γmay seem odd, but will be required in Section 4 and 5 when we consider the question of policy choice. Notethat—consistent with the remark following Assumption 2.1—the probability space in Assumptions 2.2 and2.3 are assumed to be the same. Remark 2.1 (The “No Back-Tracking” Principle) . From a purely mathematical standpoint there is no reasonthat the moment functions in Assumption 2.2 cannot also be functions of Y (cid:63)γ and/or γ ∈ Γ . However, weomit this extension for interpretive reasons and caution researchers interested in this approach. In particular,if the researcher is not judicious in her formulation of such moment functions, then it is possible to haveenvironments where the counterfactual γ ∈ Γ of interest has “identifying power” for the structural parameters θ ∈ Θ . Such environments are extremely puzzling since, intuitively, in these cases the counterfactual domain γ ∈ Γ under consideration contains “information” on the values of the structural parameters θ ∈ Θ existingin the factual domain. Environments that avoid such difficulties will be said to satisfy the “no back-trackingprinciple.” We will return to this idea at some point in our example on simultaneous discrete choicemodels.
The setup implied by Assumptions 2.1, 2.2 and 2.3 is illustrated in Figure 2. Throughout the remainderof the paper, we let V γ := ( Y (cid:63)γ , Y, Z, U ) denote a random vector with realizations v ∈ V , where V is a productspace with the product σ − algebra. This principle is named in honour of the philosopher David Lewis who argued against similar “back-tracking counterfactuals”in Lewis (1979). .2 Examples We will now turn to two examples to help illustrate the nature of the assumptions just introduced. Theexamples will be revisited throughout the remainder of the text. The introduction of the examples is lengthy,and readers may skip to Subsection 2.3 without loss of continuity.The first example we consider is a simultaneous discrete choice model. Simultaneous discrete choicemodels have seen a wide number of applications, including empirical entry games (e.g. Tamer (2003)), anddiscrete choice models with social interactions (e.g. Brock and Durlauf (2001)). It is already known fromthe work of Chesher and Rosen (2020) that this example falls into the class of GIV models considered byChesher and Rosen (2017a). For readers familiar with these works, the model will serve as a natural point ofcomparison. The second example is a program evaluation example which closely mirrors the environment inHeckman and Vytlacil (2005). The example shows a model where the structural parameter is point-identified,but the counterfactual object of interest is partially-identified. Example 1 (Simultaneous Discrete Choice) . Consider a simultaneous discrete choice problem. In particular,assume that a binary outcome vector Y := ( Y , . . . , Y K ) has generic element Y k ∈ Y determined by theequation: Y k = { π k ( Z k , Y − k ; θ ) ≥ U k } . (2.4) Here Z k is a vector of covariates, U k is an unobserved random variable, and θ is a vector of model parameters.We will define the vector Z := ( Z , . . . , Z K ) and U := ( U , . . . , U K ) where each variable Z k has support Z = { z , . . . , z L } , a finite subset of euclidean space, and each U k has support U = [ − , d u . For each k , we assume that π k is a known measurable function of ( Z k , Y − k , θ ) , mapping to [ − , that is linear inparameters θ and has a gradient (with respect to θ ) bounded away from zero for each ( z, y − k ) . We alsoassume that θ = ( θ , . . . , θ K ) , and that each π k depends only on the subvector θ k . For simplicity we willassume that the parameter space Θ is a compact subset of R d θ , and that U is continuously distributed. Toillustrate the use of semi-parametric restrictions, we will also assume that each coordinate of the vector U is (i) median zero, and (ii) median independent of ( Z k , Y − k ) . Finally, we assume all random variables aresupported on the same probability space (Ω , A , P ) . Verification of Assumption 2.1 under these conditions ispresented in Appendix C.1.1.For the factual domain, we have the following multifunction: G − ( Y, Z, θ ) := cl { u ∈ U : Y k = { π k ( Z k , Y − k ; θ ) ≥ u k } , k = 1 , . . . , K. } . (2.5) In our setting, this is due to the fact the instrument will be assumed to have finite support. Note that we could instead define U := R d u , but then: { π k ( Z k , Y − k ; θ ) ≥ U k } = { ˜ π k ( Z k , Y − k ; θ ) ≥ ˜ U k } , where ˜ π k ( Z k , Y − k ; θ ) = tanh ( π k ( Z k , Y − k ; θ )) and ˜ U k = tanh( U k ). In other words, the case with U := R d u is homeomorphicto the case U := [ − , d u . ote the closure is taken to ensure that G − ( · , θ ) is a closed set for each θ . However, this introduces noadditional structure and serves merely as a technical simplification, since G − ( · , θ ) as defined above willbe almost surely equal to the right hand side of (2.5) without taking the closure, which follows from theassumption that U is continuously distributed. To complete the description of the factual domain, we willimpose the median zero and median independence assumptions for each coordinate of the vector U as asequence of moment conditions. In particular, for k = 1 . . . , K , we will impose the moment conditions: E [( { U k ≥ } − { U k ≤ } ) { Z k = z, Y − k = y − k } ] ≤ , ∀ z ∈ Z , y − k ∈ Y K − , (2.6) E [( { U k ≤ } − { U k ≥ } ) { Z k = z, Y − k = y − k } ] ≤ , ∀ z ∈ Z , y − k ∈ Y K − . (2.7) Taken together, (2.6) and (2.7) imply that the latent variables U k are both median zero and median indepen-dent of covariates Z k and the outcomes Y − k . Verification of Assumption 2.2, including Effros-measurabilityof the multifunction (2.5) , is provided in Appendix C.1.1.Turning to the counterfactual domain, there are many possible counterfactuals that may be of interest.For the sake of illustration, we will consider counterfactuals of the following form. Let γ k : Z × Y K − →Z × Y K − , γ = ( γ k ) Kk =1 , and Y (cid:63)γ := ( Y (cid:63) ,γ , . . . , Y (cid:63)K,γ ) with typical element: Y (cid:63)k,γ = { π k ( γ ( Z k , Y (cid:63) − k,γ ); θ ) ≥ U k } . (2.8) For example, our interest may be in the properties of the counterfactual random variable Y (cid:63)k,γ , such as itsmean or its conditional mean. The multifunction for the counterfactual domain is then given by: G (cid:63) ( Z, U, θ, γ ) := (cid:8) y (cid:63) ∈ Y : y (cid:63)k = { π k (cid:0) γ ( Z k , y (cid:63) − k ); θ (cid:1) ≥ U k } , k = 1 , . . . , K. (cid:9) . (2.9) Note here we take Y (cid:63) = Y . Verification of Assumption 2.3, including Effros-measurability of the multifunc-tion in (2.9) , is provided in Appendix C.1.1. Example 2 (Program Evaluation) . Consider the problem of program evaluation. In this example, a binaryvariable D ∈ { , } indicates participation in the treatment or control group for some program, and theobserved real-valued outcome is given by: Y = U (1 − D ) + U D, (2.10) where U and U are potential outcomes that are never jointly observed. We will assume throughout that U , U ∈ U = [ Y , Y ] , and thus we also assume the outcome Y takes values in the bounded interval Y := [ Y , Y ] .In the absence of a selection equation determining the values of D , the potential outcome model is incomplete.This case is considered in Russell (2019), and the framework in this paper applies to this case as well.Alternatively, we will consider the more popular approach of Heckman and Vytlacil (1999) and Heckman and Note that this restriction implies constraints on the joint distribution of the vector ( U , . . . , U K ). Alternatively, we mightinstead impose only median independence of U k with Z k , which restricts only the marginal distribution of U k . ytlacil (2005), and will suppose that the treatment is determined by the equation: D = { g ( Z ) ≥ U } , (2.11) where U is continuous, and g ( · ) is an unknown measurable function of the observable covariates Z ∈ Z ⊂ R d z , where d z is the dimension of the vector Z . We will assume that Z is finite, and will allow for the casewhen the vector Z can be decomposed as Z = ( X, Z ) with (i) U ⊥⊥ Z | X (conditional independence) and (ii) E [ U d | Z ] = E [ U d | X ] for d ∈ { , } (mean independence). We will thus decompose Z as Z = Z × X , where Z is the support of Z and X is the support of X . Under these assumptions, it is without loss of generalitythat U be taken to be uniformly distributed on [0 , conditional on Z . As shown in Vytlacil (2002), theseassumptions, combined with the additive separability of the selection equation in (2.11) , are equivalent to theassumptions required to estimate the local average treatment effect (LATE) of Imbens and Angrist (1994).This connects this model with a large body of empirical work that focuses on obtaining estimates of the LATE.Set the parameter space as Θ =
G × T . Here G can be taken equal to the space of all positive measurablefunctions on Z which is a metric space with the sup norm (for example); under finiteness of Z , this spaceis Polish. Furthermore, we will take the component T of the parameter space to be the space of all possiblemeasurable functions on Z . This component of the parameter space will be used in the moment functionsbelow. Finally, we shall denote a generic pair ( g, t ) ∈ Θ as θ .We will denote the support of ( U , U , U ) as U := [ Y , Y ] × [0 , . We also assume that the randomvariables in the vector ( Y, D, Z, U , U , U ) are all supported on the same probability space (Ω , A , P ) . Underthese conditions, Assumption 2.1 is verified in Appendix C.2.1.For the factual domain we have the multifunction: G − ( Y, D, Z, θ ) := cl ( U , U , U ) ∈ U : Y = U (1 − D ) + U D,D = { g ( Z ) ≥ U } . . (2.12) Note the closure is taken to ensure that G − ( · , θ ) is a closed set for each θ . However, this introduces no addi-tional structure and serves merely as a technical simplification, since G − ( · , θ ) as defined above will be almostsurely equal to the right hand side of (2.12) without taking the closure, which follows from the assumptionthat U is continuously distributed. Close inspection of this multifunction provides some simplification: G − ( Y, D, Z, θ ) = { Y } × [ Y , Y ] × [ g ( Z ) , , if D = 0 , [ Y , Y ] × { Y } × [0 , g ( Z )] , if D = 1 . (2.13) To complete the description of the factual domain, we will impose the independence condition U ⊥⊥ Z | X andthe mean independence condition E [ U d | Z ] = E [ U d | X ] , for d ∈ { , } , as a sequence of moment conditions.In particular, since Z is assumed to be finite, let us partition Z into the product Z = Z × X , where := { z , . . . , z K } and X := { x , . . . , x L } . Now consider the following sequence of moment inequalities: E [( D − g ( z , x )) { Z = z , X = x } ] ≤ , ∀ z ∈ Z , x ∈ X , (2.14) E [( g ( z , x ) − D ) { Z = z , X = x } ] ≤ , ∀ z ∈ Z , x ∈ X , (2.15) and: E [( { U ≤ g ( z , x ) } − g ( z , x )) { X = x } ] ≤ , ∀ z ∈ Z , x ∈ X , (2.16) E [( g ( z , x ) − { U ≤ g ( z , x ) } ) { X = x } ] ≤ , ∀ z ∈ Z , x ∈ X . (2.17) Together (2.14) and (2.15) imply P ( D = 1 | Z = z ) = g ( z ) for all z ∈ Z , and (2.16) and (2.17) imply P ( U ≤ g ( z ) | Z = z ) = P ( U ≤ g ( z ) | X = x ) = g ( z ) for all z ∈ Z and x ∈ X . Under finiteness ofthe support Z , these moment inequalities represent the only observable implications of the independencecondition U ⊥⊥ Z | X . In addition, we will impose the following moment conditions: E [ t ( z , x ) − { Z = z , X = x } ] ≤ , ∀ z ∈ Z , ∀ x ∈ X , (2.18) E [ { Z = z , X = x } − t ( z , x )] ≤ , ∀ z ∈ Z , ∀ x ∈ X , (2.19) and: E (cid:34) U d (cid:32) { Z = z , X = x } (cid:88) z ∈Z t ( z , x ) − { X = x } t ( z , x ) (cid:33)(cid:35) ≤ , ∀ z ∈ Z , x ∈ X , d ∈ { , } , (2.20) E (cid:34) U d (cid:32) { X = x } t ( z , x ) − { Z = z , X = x } (cid:88) z ∈Z t ( z , x ) (cid:33)(cid:35) ≤ , ∀ z ∈ Z , x ∈ X , d ∈ { , } . (2.21) Together (2.18) - (2.21) imply the mean independence condition: E [ U d | Z ] = E [ U d | X ] for d ∈ { , } . Inparticular, (2.18) and (2.19) ensure t ( z , x ) = P ( Z = z , X = x ) , so that the moment conditions in (2.20) and (2.21) imply: E [ U d ( { Z = z , X = x } P ( X = x ) − { X = x } P ( Z = z , X = x ))] = 0 , ∀ z ∈ Z , x ∈ X , d ∈ { , } , or equivalently: E (cid:20) U d (cid:18) { Z = z , X = x } P ( Z = z , X = x ) − { X = x } P ( X = x ) (cid:19)(cid:21) = 0 , ∀ z ∈ Z , x ∈ X , d ∈ { , } . From here, a full verification of Assumption 2.2 for the factual domain, including Effros measurability of themultifunction (2.13) , is provided in Appendix C.2.1.With this setup, we might be interested in how the outcome variable changes when the factors Z thatdetermine an individual’s treatment decision are modified. For example, let Γ denote the set of all measurable unctions γ : Z → Z (note that there are at most finitely many). We can then define: Y (cid:63)γ = U (1 − D (cid:63)γ ) + U D (cid:63)γ , (2.22) where the random variable D (cid:63)γ is then given by: D (cid:63)γ = { g ( γ ( Z )) ≥ U } . Note that as in Heckman and Vytlacil (1999) and Heckman and Vytlacil (2005), our counterfactual γ ∈ Γ has no direct effect on ( U , U ) . Our interest is in the properties of the random variable Y (cid:63)γ , such as its meanor its conditional mean. The multifunction for the counterfactual domain is given by: G (cid:63) ( Z, U , U , U, θ, γ ) := ( Y (cid:63)γ , D (cid:63)γ ) ∈ Y × { , } : Y (cid:63)γ = U (1 − D (cid:63)γ ) + U D (cid:63)γ ,D (cid:63)γ = { g ( γ ( Z )) ≥ U } . . (2.23) Note here we take Y (cid:63) = Y . Again, close inspection of this multifunction provides some simplification: G (cid:63) ( Z, U , U , U, θ, γ ) = ( U , , if U ≤ g ( γ ( Z )) , ( U , , if g ( γ ( Z )) < U. (2.24) A full verification of Assumption 2.3 for the counterfactual domain, including Effros measurability of themultifunction (2.24) , is provided in Appendix C.2.1.
Throughout the paper we will build on the environment established in the previous section to present aframework for making policy decisions based on the value of any counterfactual object of interest that canbe written as an integral of some function of the vector V γ . In particular, if ϕ : Ω × Γ → R is somemeasurable function, then we will restrict attention to environments where policymakers are interested ineither the policy transform or the conditional policy transform of ϕ , which are defined next. Definition 2.1 (Policy Transform and Conditional Policy Transform) . Let ϕ : Ω × Γ → R be a bounded andmeasurable function. The policy transform of ϕ is a function I [ ϕ ]( γ ) : Γ → R given by: I [ ϕ ]( γ ) := (cid:90) ϕ ( ω, γ ) dP. (2.25) Furthermore, if A (cid:48) ⊂ A is a σ − algebra, then a conditional policy transform of ϕ given A (cid:48) is a function ˜ I [ ϕ ] : Ω × Γ → R such that (i) ˜ I [ ϕ ] : Ω × Γ → R is A (cid:48) ⊗ Γ − measurable, and (ii) I [ ˜ I [ ϕ ]( · , γ ) A ]( γ ) = I [ ϕ A ]( γ ) for every A ∈ A (cid:48) . We will focus on the unconditional policy transform throughout the remainder of the paper, since anal-ogous results hold for the conditional policy transform. In addition, since the relevant random variables in See Carneiro et al. (2011) for a discussion of other possible parameters under this setting. V γ , we will abuse notation throughout the paper and instead focuson policy transforms of the form: I [ ϕ ]( γ ) := (cid:90) Ω ϕ ( V γ ( ω )) dP = (cid:90) V ϕ ( v ) dP V γ , (2.26)which are clearly a special case of the general policy transforms in Definition 2.1.In the remainder of the paper we take as primitive that the policymaker would like to choose γ to maximizethe value of the policy transform for some known function ϕ : V → R , although all results apply equallyto the case where the policymaker wishes to minimize the policy transform. For pedagogical purposes,it is useful to first consider an idealized decision problem. In particular, when (i) the true distribution P Y,Z is known, (ii) the conditional distribution P U | Y,Z is known, and (iii) the counterfactual conditionaldistribution P Y (cid:63)γ | Y,Z,U is known, the policymaker’s problem becomes trivial: she can simply compute thepolicy transform of ϕ and choose the maximizing value of γ . However, clearly such idealized environmentswill be rare. Instead, we will consider the more realistic case when the policymaker only has access to ani.i.d. sample of size n from the true distribution P Y,Z , and knows only that Assumptions 2.1, 2.2, and 2.3 aresatisfied. In such an environment, the policymaker may be unable to compute the policy transform due to(i) lack of perfect knowledge of P Y,Z , (ii) lack of knowledge of P U | Y,Z and (iii) lack of knowledge of P Y (cid:63)γ | Y,Z,U .All three cases can occur when the structural parameters are point- or partially-identified.We are now ready to define the decision problem under consideration.
Definition 2.2 (The Decision Problem) . The policymaker’s decision problem is characterized by:(i) The population, represented by the probability space (Ω , A , P ) .(ii) The action (or policy) space, given by (Γ , B (Γ)) .(iii) The sample space, given by (Ψ n , Σ Ψ n , P ⊗ nY,Z ) , where Ψ n := ( Y × Z ) n , with typical element ψ = { ( y i , z i ) } ni =1 , equipped with the product Borel σ − algebra Σ Ψ n := ( B ( Y ) ⊗ B ( Z )) ⊗ n and the productmeasure P ⊗ nY,Z .(iv) The state space, given by S × P
Y,Z , where P Y,Z is the set of all Borel probability measures on
Y × Z ,and S is the set of all triples s = ( θ, P U | Y,Z , P Y (cid:63)γ | Y,Z,U ) such that the pair ( s, P Y,Z ) satisfies:(a) θ ∈ Θ ,(b) P U | Y,Z ( U ∈ G − ( Y, Z, θ ) | Y = y, Z = z ) = 1 , ( y, z ) -a.s.,(c) P Y (cid:63)γ | Y,Z,U ( Y (cid:63)γ ∈ G (cid:63) ( Y, Z, U, θ, γ ) | Y = y, Z = z, U = u ) = 1 , ( y, z, u ) -a.s., and(d) the elements θ ∈ Θ and P U | Y,Z satisfy: max j =1 ,...,J E P U | Y,Z × P Y,Z [ m j ( Y, Z, U, θ )] ≤ . (2.27) After we describe the decision problem, it will be apparent that desire of the policymaker to maximize or minimize thepolicy transform might be deduced using an axiomatic approach from a preference relation over the space of Borel probabilitymeasures on V . We find this idea interesting, but will not pursue it here. v) The feasible statistical decision rules D , with typical element d , given by the set of all measurablefunctions d : Ψ n → Γ .(vi) The objective function, given by a function I [ ϕ ] : Γ × S × P Y,Z → R , called the state-dependent policytransform, which has the expression: I [ ϕ ]( γ, s ) := (cid:90) ϕ ( v ) d ( P Y (cid:63)γ | Y,Z,U × P U | Y,Z × P Y,Z ) (2.28) where ϕ : V → R is a measurable function (where P Y,Z is left implicit when writing I [ ϕ ]( γ, s ) ). A few remarks on this definition of our statistical decision problem are in order. In parts (i) and (ii), thespecification of the population and the action space are somewhat standard, and have been motivated in theprevious sections. In part (iii), the sample space is simply taken as the n − fold product of the observablespace ( Y × Z ). The measure on this space is the n − fold product of the true distribution P Y,Z , from whichwe immediately deduce that the sample in ψ ∈ Ψ n is assumed to be i.i.d. Motivated from the frameworkin the previous section, part (iv) indicates that the unobserved state is characterized by a distribution P Y,Z and the triple ( θ, P U | Y,Z , P Y (cid:63)γ | Y,Z,U ), where S corresponds to the set of all such triples that satisfy the modelsupport restrictions and moment conditions introduced in the previous section. In part (v), the feasibledecision rules D are characterized by the set of all measurable functions from the sample space Ψ n to theaction space Γ. We will return to this point below. Furthermore, in this paper we will use the terms policyrules and decision rules interchangeably. Finally, part (vi) of Definition 2.2 introduces the state-dependentpolicy transform, which is a generalization of the policy transform that allows for it’s value to depend on theunknown state from part (iv). Evaluated at the true state, the state-dependent policy transform reduces tothe policy transform from Definition 2.1.Ex-ante (i.e. before observing the sample) each decision rule d : Ψ n → Γ is a random variable. Under somemeasurability conditions, this implies the state-dependent policy transform I [ ϕ ]( d ( ψ ) , s ) is also a randomvariable. The remaining question is how to use the collection { I [ ϕ ]( d ( ψ ) , s ) : ( s, P Y,Z ) ∈ S × P Y,Z } toevaluate a given policy rule. It seems self-evident that a policy rule d ∈ D should be preferred to a policyrule d (cid:48) ∈ D if for every P Y,Z ∈ P
Y,Z we have I [ ϕ ]( d (cid:48) ( ψ ) , s ) ≤ I [ ϕ ]( d ( ψ ) , s ) a.s. for every s ∈ S ; in such acase, d delivers a larger value of the policy transform in every state with probability one, regardless of thedistribution P Y,Z . Any preference relation over D that satisfies this condition will be said to respect weakdominance . However, beyond the requirement that a preference relation respect weak dominance, it is notobvious how a policymaker should (in the prescriptive sense) choose among competing policy options giventhe decision problem in Definition 2.2. Note we might instead allow for randomized decision rules by taking D to be the set of all measurable functions from Ψto the set of all distributions on Γ. This is not required for what we have in mind, but is easily accommodated under slightlymodified assumptions. We refer to Manski (2011) for a similar definition. Also note that our definition implies stochastic dominance of I [ ϕ ]( d ( ψ ) , s )over I [ ϕ ]( d (cid:48) ( ψ ) , s ) for every ( s, P Y,Z ) ∈ S × P Y,Z . By Strassen’s Theorem, our definition will be equivalent to stochasticdominance if we allow for alternative probability spaces for each ( s, P
Y,Z ) pair. This point is raised repeatedly in the work of Charles Manski, and is summarized in Manski (2011).
Definition 2.3 (PAC Maximin Preference Relation) . Fix a sample size n . For any κ ∈ (0 , and any d ∈ D , let c n ( · , κ ) : D → R ++ be the smallest value satisfying: inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) inf s ∈S I [ ϕ ]( d ( ψ ) , s ) + c n ( d, κ ) ≥ sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) (cid:19) ≥ κ. (2.29) Then decision rule d : Ψ n → Γ is weakly preferred to (or weakly dominates) decision rule d (cid:48) : Ψ n → Γ at level κ and sample size n , denoted by d (cid:48) (cid:52) κ d , if and only if c n ( d, κ ) ≤ c n ( d (cid:48) , κ ) . The decision rule d : Ψ n → Γ is strictly preferred to (or strictly dominates) decision rule d (cid:48) : Ψ n → Γ , denoted by d (cid:48) ≺ κ d , if and only if c n ( d, κ ) < c n ( d (cid:48) , κ ) . A decision rule d ∈ D will be called admissible with respect to (cid:52) κ if there is no decisionrule d (cid:48) ∈ D that is strictly preferred to (or strictly dominates) d . This preference relation is named the PAC maximin preference relation given its close connection to thelearning framework in the next subsection, which in turn is closely related to the PAC learning model ofValiant (1984) from computational learning theory. We refer readers to Appendix A.2 where we discussthe notion of PAC learnability from computational learning theory. We will also emphasize the connectionfurther in the next subsection.For a fixed κ ∈ (0 , d and d (cid:48) can be compared according to (cid:52) κ . In addition, it has an interpretation in terms ofquantiles. In particular, suppose for simplicity that P Y,Z contains a single distribution π and define Q π ( κ, d )as the κ quantile (under distribution π ) of the map: d (cid:55)→ sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( d ( ψ ) , s ) . (2.30)Note that the map in (2.30) is always positive. Then a decision rule d ∈ D will be preferred to a decision rule d (cid:48) ∈ D under (cid:52) κ if and only if Q π ( κ, d ) ≤ Q π ( κ, d (cid:48) ). Quantile utility maximization has been considered inManski (1988) and Manski and Tetenov (2014), and axiomatized in Rostek (2010). However, our approachhas major differences from these approaches, especially with regards to our treatment of the (sub-)states s ∈ S .Providing an axiomatization for the preference relation in Definition 2.3 is beyond the scope of this paper.Indeed, there is no reason why a policymaker needs to have the exact preference relation from Definition 2.3in order to find the results in this paper useful or interesting. However, the following result shows that, at aminimum, (cid:52) κ respects weak dominance, as defined above. Proposition 2.1.
Suppose that Assumptions 2.1, 2.2 and 2.3 hold, and that ϕ : V → [ ϕ (cid:96)b , ϕ ub ] ⊆ R is abounded and measurable function. Also, suppose that γ (cid:55)→ inf s ∈S I [ ϕ ]( γ, s ) is (universally) measurable. Let , d (cid:48) ∈ D be two decision rules, and suppose that for every P Y,Z ∈ P
Y,Z we have I [ ϕ ]( d (cid:48) ( ψ ) , s ) ≤ I [ ϕ ]( d ( ψ ) , s ) a.s. for every s ∈ S . Then for any κ ∈ (0 , we have d (cid:48) (cid:52) κ d , where (cid:52) κ is the preference relation fromDefinition 2.3; that is, the preference relation (cid:52) κ respects weak dominance.Proof. See Appendix B. (cid:4)
Remark 2.2.
Universal measurability is a weaker requirement then Borel measurability, and is defined inAppendix B.2.1. Also, in Appendix B.2.1 we show that the map γ (cid:55)→ inf s ∈S I [ ϕ ]( γ, s ) is universally measur-able, although the result and proof relies on Assumption 3.1 introduced in the next section. Since Assumption3.1 has not yet been introduced at this point, we impose (universal) measurability of γ (cid:55)→ inf s ∈S I [ ϕ ]( γ, s ) asa separate assumption in this proposition. Our main interest in the preference relation from Definition 2.3—especially versus other preference re-lations encountered in frequentist decision theory—is its close connection to the PAC learning framework,which allows us to use a rich set of results from statistical learning theory and empirical process theory tostudy its theoretical properties. Before formally introducing this connection, we will first revisit our examplesto illustrate the various definitions presented in Definition 2.2.
Example 1 (Simultaneous Discrete Choice (Cont’d)) . For the simultaneous discrete choice example, recallthat our interest is in the properties of the counterfactual random variable Y (cid:63)k,γ , such as its mean or itsconditional mean. For the sake of illustration, we will focus on the quantity: I [ ϕ ]( γ ) = (cid:90) Ω { Y (cid:63)k,γ ( ω ) = 1 } dP, (2.31) which is a counterfactual choice probability. Note this quantity is the policy transform of the function ϕ ( ω, γ ) = { Y (cid:63)k,γ ( ω ) = 1 } . Without much additional complication, we might instead be interested in the con-ditional choice probability E [ { Y (cid:63)k,γ ( ω ) = 1 }| Z ] ; it is easily verified that ˜ I [ ϕ ]( ω, γ ) = E [ ϕ ( ω, γ ) | Z ]( ω ) , with ϕ ( ω, γ ) = { Y (cid:63)k,γ ( ω ) = 1 } , is a conditional policy transform. Throughout we will suppose the policymakeris interested in selecting the policy γ ∈ Γ that maximizes the quantity (2.31) . We can now formally definethe policymaker’s decision problem. The population is given by the probability space (Ω , A , P ) and the actionspace is given by (Γ , B (Γ)) , where Γ is the set of all functions γ = ( γ k ) Kk =1 with γ k : Z × Y K − → Z × Y K − and B (Γ) can be taken as the power set of Γ . The sample space in this example is given by Ψ n , which is allpossible realizations of the n vectors { ( y i , z i ) } ni =1 . Each state of the world is indexed by a pair ( θ, P U | Y,Z ) sat-isfying the support restriction given by (2.5) and the moment conditions (2.6) and (2.7) . The state dependent Indeed, by definition this quantity is measurable with respect to σ ( Z ), and satisfies: I [˜ I [ ϕ ]( · , γ ) A ]( γ ) = (cid:90) E [ ϕ ( ω, γ ) | Z ]( ω ) A ( ω ) dP = (cid:90) { Y (cid:63)k,γ ( ω ) = 1 } A ( ω ) dP = I [ ϕ A ]( γ ) , (2.32)for every A ∈ σ ( Z ). Since Z and Y are finite, both Γ and B (Γ) contain at most finitely many elements. olicy transform is given by: I [ ϕ ]( γ, s ) := (cid:90) { U k ≤ π k ( γ ( Z k , Y − k ); θ ) } dP U | Y,Z dP Y,Z . A feasible statistical decision rule is then any measurable function d : Ψ n → Γ that selects a policy indexedby γ given access to an n − sample from Ψ n . Example 2 (Program Evaluation (Cont’d)) . For the program evaluation example, recall that our interestis in the properties of the random variable Y (cid:63)γ , such as its mean or its conditional mean. For the sakeof illustration, we will focus on the average outcome under some counterfactual policy γ ∈ Γ , given by E [ Y (cid:63)γ ] . Note that taking ϕ ( ω, γ ) = Y (cid:63)γ ( ω ) (:= Y (cid:63) ( ω, γ )) , it is then clear that E [ Y (cid:63)γ ] = I [ ϕ ]( γ ) , so that theaverage effect of a counterfactual policy is the policy transform of the random variable Y (cid:63)γ ( ω ) . Without muchadditional complication, we might instead be interested in the conditional average effect E [ Y (cid:63)γ | X ] . It is easilyverified that ˜ I [ ϕ ]( ω, γ ) = E [ ϕ ( ω, γ ) | X ]( ω ) , with ϕ ( ω, γ ) = Y (cid:63)γ ( ω ) , is a conditional policy transform. We willassume throughout that the policymaker is interested in maximizing the value of E [ Y (cid:63)γ ] . We can now formallydefine the policymaker’s decision problem. The population is given by the probability space (Ω , A , P ) and theaction space is given by (Γ , B (Γ)) , where Γ is the set of all functions γ : Z → Z and B (Γ) is the power setof Γ . The sample space is given by Ψ n = ( Y × { , } × Z ) n with a typical element ψ = (( y i , d i , z i )) ni =1 .The state space S is given by s = ( θ, P U ,U ,U | Y,Z , P Y (cid:63)γ | U ,U ,U,Y,Z ) , where P U ,U ,U | Y,Z and P Y (cid:63)γ | U ,U ,U,Y,Z are any random variables that satisfy the support restriction (2.12) and moment conditions (2.14) - (2.19) .Finally, a feasible statistical decision rule is any measurable function d : Ψ n → Γ that selects a policy indexedby γ given access to an n − sample from Ψ n . With the policymaker’s decision problem defined in the previous subsection, our upcoming theoretical resultscan be divided according to whether they are applicable ex-ante (i.e. before observing the sample) or ex-post(i.e. after observing the sample).Recall the preference relation from Definition 2.3. Under this preference relation, the “performance” or“quality” of a decision rule d can be measured using the value c n ( d, κ ). Thus, the value of c n ( d, κ ) will be amajor focus of both the ex-ante and ex-post theoretical analyses in the remainder of the paper. Our mainfocus in the ex-ante theoretical results is establishing sufficient conditions for learnability of a policy space,which we will discuss further in this subsection. Our main focus for the ex-post theoretical analysis is inestablishing bounds on the value of c n ( d, κ ) for certain decision rules, as well as bounds on the set of decisionrules d ∈ D that obtain a small value of c n ( d, κ ). Indeed, by definition this quantity is measurable with respect to σ ( X ), and satisfies: I [˜ I [ ϕ ]( · , γ ) A ]( γ ) = (cid:90) E [ ϕ ( ω, γ ) | X ]( ω ) A ( ω ) dP = (cid:90) Y (cid:63)γ ( ω ) A ( ω ) dP = I [ ϕ A ]( γ ) , (2.33)for every A ∈ σ ( X ). Since Z is finite, both Γ and B (Γ) contain at most finitely many elements. .4.1 Policy Space Learnability To understand the ex-ante theoretical analysis, we must formally introduce the concept of policy spacelearnability, named because of its connection to notions of learnability from computational learning theory.Intuitively, a policy space Γ will be learnable if, for some decision rule d ∈ D , the value c n ( d, κ ) from Definition2.3 can be made arbitrarily small as n increases. This concept will be made precise in this subsection.A review of concepts of learnability from computational learning theory is provided in Appendix A.2.We argue that, under the preference relation from Definition 2.3, the conceptual differences between theproblem of policy choice and the problem of selecting an optimal classifier in a statistical learning settingare smaller than they may initially appear. In both settings we wish to select a decision rule based on afinite sample that will perform well, based on similar criteria, in samples yet unseen. The essential differencebetween the environments is that the performance of a counterfactual policy is unobservable, even for thesample in hand. Of course this is not an issue if the policymaker has an econometric model that can usedto determine the counterfactual outcomes of the policy experiment.The general model from the previous subsections will serve exactly this purpose. Given the preferencerelation from Definition 2.3, the policymaker is presented with a decision problem that is remarkably similarto a learning problem, which is apparent when the following definition is compared with the definition ofPAC Learnability from Appendix A.2. Definition 2.4 (PAMPAC Learnability) . Under Assumptions 2.1, 2.2, and 2.3, a policy space Γ is policyagnostic maximin PAC-learnable (PAMPAC) with respect to the policy transform of ϕ : V → R if there existsa function ζ Γ : R ++ × (0 , → N such that, for any ( c, κ ) ∈ R ++ × (0 , and any distribution P Y,Z over
Y × Z , if n ≥ ζ Γ ( c, κ ) then there is some decision procedure d : Ψ n → Γ satisfying: inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) inf s ∈S I [ ϕ ]( d ( ψ ) , s ) + c ≥ sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) (cid:19) ≥ κ. (2.34)That is, a policy space is PAMPAC learnable if there is exists some decision rule d : Ψ n → R that, in theworst-case (sub-)state s ∈ S , closely approximates the value:sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) , with high probability for a sufficiently large (but finite) sample. In terms of the preference relation fromDefinition 2.3, PAMPAC learnability implies that, as the sample size grows, every point in ( c, κ ) − space musteventually (i.e. for large enough n ) lie above the function c n ( d, · ) : (0 , → R ++ for some decision rule d .This idea is illustrated in Figure 3. Framed in this manner, we see that PAMPAC learnability is not requiredto determine the admissible decision rules or to make a policy choice. However, there may be substantial A nearly identical definition can be given for policy agnostic minimax PAC-learnability, with the exception that the decisionprocedure d : Ψ n → Γ must satisfy:inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup s ∈S I [ ϕ ]( d ( ψ ) , s ) − c ≤ inf γ ∈ Γ sup s ∈S I [ ϕ ]( γ, s ) (cid:19) ≥ κ. (2.35) igure 3: This figure illustrates the idea of PAMPAC learnability from Definition 2.4. Given a pair ( c, κ ), PAMPAClearnability guarantees that there is some finite n and some decision rule d : Ψ n → Γ such that the graph of c n ( d, κ )lies entirely below the point ( c, κ ). For example, for ( c , κ ) in the figure, there exists a sample size n and a decisionrule d such that (2.34) is satisfied. Note that (2.34) is also satisfied for the points ( c , κ ) and ( c , κ ) at n and d ,and n and d , respectively. To verify PAMPAC learnability, the same must hold for all points ( c, κ ); in particular,in the figure we would need to find a sample size n and decision rule d such that the graph of c n ( d , κ ) lies entirelybelow the point ( c , κ ). ex-ante limitations on the theoretical performance of any given decision rule in environments that are notPAMPAC learnable, making it an important object of theoretical analysis.Despite the fact that PAMPAC learnability may appear to be a weak notion, there are trivial environmentswhere a policy space Γ may not be PAMPAC learnable. Example 1 (Simultaneous Discrete Choice (cont’d)) . Consider the general setup of Example 1. Supposefor simplicity that K = 1 , and consider the following modifications. Let Z = [ − , and Θ = [ − , and let π k ( Z k , Y − k ; θ ) = π k ( Z k ; θ ) = sin( Z k /θ ) . Then Y k is determined by the equation: Y k = { sin( Z k /θ ) ≥ U k } . Now consider a policy space Γ that consists of all functions γ : Z → Z , and suppose we are interested in thepolicy transform: I [ ϕ ]( γ ) := (cid:90) Ω ϕ ( ω, γ ) dP = (cid:90) Ω { Y (cid:63)k,γ ( ω ) = 1 } dP, where ϕ ( ω, γ ) = { Y (cid:63)k,γ ( ω ) = 1 } and: Y (cid:63)k,γ = { sin( γ ( Z k ) /θ ) ≥ U k } . In this case, we claim the policy space Γ may not be PAMPAC learnable with respect to the policy transformof ϕ .
26t is important to realize that the possible failure of PAMPAC learnability does not hinge on the choice ofthe sine function in this example, which is used for illustrative purposes only. Indeed, the following exampleshows that the idea is more general.
Example 2 (Program Evaluation (cont’d)) . Consider the general setup of Example 2, with the followingmodifications. Let Z = [ − , and let Θ denote the space of continuous functions with values in [ − , .Otherwise, keep all other aspects of the factual domain the same. Now consider a policy space Γ thatconsists of all continuous functions γ : Z → Z . Suppose we are still interested in the policy transform of ϕ ( ω, γ ) = Y (cid:63)γ ( ω ) , where: Y (cid:63)γ = U (1 − D (cid:63)γ ) + U D (cid:63)γ , (2.36) and where the random variable D (cid:63)γ is given by: D (cid:63)γ = { θ ( γ ( Z )) ≥ U } . In this case, we claim the policy space Γ may not be PAMPAC learnable with respect to the policy transformof ϕ . These examples illustrate that there may be limits to which policy spaces are learnable. In the firstexample, learnability may fail because the structural function determining the counterfactual values of Y (cid:63)k,γ is too “complex,” and so cannot be adequately approximated (or “learned”) with any finite amount of data.A similar explanation applies to the second example, in particular to the structural function determining thevalues of D (cid:63)γ . In the next sections we will explore sufficient conditions for the learnability of a policy spacethat are precisely related to constraints on the complexity of certain function spaces. After establishing aparticular policy space is learnable, which is an ex-ante (i.e. before observing the sample) notion, we willthen discuss how to evaluate particular decision rules, which is an ex-post (i.e. after observing the sample)notion. Both components will be relevant to the theoretical evaluation of the decision problem. As is suggested by (2.34) in Definition 2.4, and as was discussed in the introduction, in order to determinewhether a given policy space Γ is PAMPAC learnable it is useful to first provide a characterization of theenvelope functions: I (cid:96)b [ ϕ ]( γ ) := inf s ∈S I [ ϕ ]( γ, s ) , I ub [ ϕ ]( γ ) := sup s ∈S I [ ϕ ]( γ, s ) . Note that, at the true distribution P Y,Z , the function I (cid:96)b [ ϕ ]( γ ) serves as a lower bound on the policy transform I [ ϕ ]( γ ). Similarly, the function I ub [ ϕ ]( γ ) serves as an upper bound. Recall that this idea was illustrated inFigure 1 in the introduction.In the case of PAMPAC learnability, if a tractable characterization of the lower envelope function I (cid:96)b [ ϕ ]( γ )27an be provided under some conditions, then determining whether a policy space is PAMPAC learnablereduces to the problem of finding a decision rule d : Ψ n → Γ that satisfies:inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup γ ∈ Γ I (cid:96)b [ ϕ ]( γ ) − I (cid:96)b [ ϕ ]( d ( ψ )) ≤ c (cid:19) ≥ κ, (2.37)for large enough (but finite) n . Thus in the next section we focus on obtaining a tractable characterization ofthe envelope functions before returning to the problem of policy choice in Section 4. Once a tractable char-acterization of the lower (or upper) envelope function is provided, we will then present sufficient conditionsfor PAMPAC learnability. In addition to its importance to our ex-ante analysis, we will see that a tractablecharacterization of the envelope functions will also be key to our ex-post analysis of the policymaker’s decisionproblem in Section 5. In this section we derive a useful characterization of the envelope functions I (cid:96)b [ ϕ ]( γ ) and I ub [ ϕ ]( γ ) definedin the previous section. We will show that these envelope functions can be written as the value functions ofoptimization problems parameterized by γ ∈ Γ. Our specific characterization will be important when derivingour learnability results, as well as for our ex-post finite-sample analysis in the next sections. However, forthose interested in partial identification, the results in this section may be of substantial separate interest.We first define the identified set for the structural parameters and policy transform before presentingour main result for this section. In general, these identified sets must be defined relative to a distribution P Y,Z . For notational simplicity this is kept implicit throughout this section.We now begin by introducing some additional notation. For assistance with some of the notation in thenext definition, the reader is referred to Appendix A, which discusses the notion of selectionability from arandom set.
Definition 3.1 (Distributions of Selections) . The collection P U | Y,Z ( θ ) contains all regular conditionalprobability measures P U | Y,Z such that each P U | Y,Z ∈ P U | Y,Z ( θ ) is the distribution of some selection U ∈ Sel ( G − ( · , θ )) ; that is: P U | Y,Z ( θ ) := (cid:8) P U | Y,Z : U ∼ P U | Y,Z for some U ∈ Sel ( G − ( · , θ )) (cid:9) . (3.1) Furthermore, the collection P Y (cid:63)γ | Y,Z,U ( θ, γ ) contains all regular conditional probability measures P Y (cid:63)γ | Y,Z,U such that each P Y (cid:63)γ | Y,Z,U ∈ P Y (cid:63)γ | Y,Z,U ( θ, γ ) is the distribution of some selection Y (cid:63)γ ∈ Sel ( G (cid:63) ( · , θ, γ )) ; that See, for example, Definition 3 in Chesher and Rosen (2017a) and the surrounding discussion. Clearly the collection P U | Y,Z ( θ ) also depends on P Y,Z , although we suppress this dependence for notational simplicitythroughout. s: P Y (cid:63)γ | Y,Z,U ( θ, γ ) := (cid:110) P Y (cid:63)γ | Y,Z,U ( θ, γ ) : Y (cid:63)γ ∼ P Y (cid:63)γ | Y,Z,U ( θ, γ ) for some Y (cid:63)γ ∈ Sel ( G (cid:63) ( · , θ, γ )) (cid:111) . (3.2)We will see shortly that compactness of U from Assumption 2.1 is quite convenient. Indeed, note thatunder compactness of U , the collection P U | Y,Z ( θ ) is uniformly tight for any θ . If P U | Y,Z ( θ ) is also closedin the weak ∗ topology, then the collection P U | Y,Z ( θ ) is compact in the weak ∗ topology, which allows for asimplification of the statement and proofs of many of the results. However, by the fact that G − is closed,this latter result follows directly from the fact that every selection U ∈ Sel ( G − ( · , θ )) is supported by acompact set. Thus, throughout our exposition we can use the fact that P U | Y,Z ( θ ) is compact in the weak ∗ topology.Beyond the simplifications that come with this result, it also solves a meaningful issue related to selectionsfrom identically distributed random sets. Indeed, two identically distributed random sets may have differentsets of measurable selections, although the weak ∗ closure of their measurable selections will always coincide. The issue is thus entirely resolved by compactness of U , which ensures the collection P U | Y,Z ( θ ) is closed inthe weak ∗ topology; in other words, under Assumptions 2.1 and 2.2, this means two identically distributedrandom sets G − ( Y, Z, θ ) and G − ( Y (cid:48) , Z (cid:48) , θ ) (see Definition A.2 in Appendix A) will have the same set ofmeasurable selections.With the additional notation afforded by Definition 3.1, we now have the following definition of theidentified set of structural parameters: Definition 3.2 (Identified Set of Structural Parameters) . Under Assumptions 2.1 and 2.2, the identified set Θ ∗ of structural parameters (with respect to the distribution P Y,Z ) is given by: Θ ∗ := (cid:26) θ ∈ Θ : inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] ≤ (cid:27) . (3.3)Compactness of P U | Y,Z ( θ ) in the weak ∗ topology, combined with boundedness of the moment conditions,ensures that the infimum in the definition of Θ ∗ is obtained. Although our focus in this paper is not onthe identified set of structural parameters, this definition will be helpful when providing a definition of theidentified set for the policy transform, as well as in the proofs.To state the definition of the identified set for the policy transform, it will be useful for us to first definethe following function: I ∗ [ ϕ ]( θ, γ, I, P Y (cid:63)γ | Y,Z,U , P U | Y,Z ):= max (cid:26) (cid:12)(cid:12)(cid:12) E P Y (cid:63)γ | Y,Z,U × P U | Y,Z × P Y,Z [ ϕ ( V γ ) − I ] (cid:12)(cid:12)(cid:12) , max j =1 ,...,J E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] (cid:27) . (3.4) Clearly the collection P Y (cid:63)γ | Y,Z,U ( θ, γ ) also depends on P Y,Z,U , although we suppress this dependence for notational sim-plicity throughout. See Corbae et al. (2009) Theorem 9.9.2 on p. 575, as well as the surrounding discussion. See Molchanov (2017) Theorem 1.4.3 on p. 79. This follows from the extreme value theorem after noting the map P U | Y,Z (cid:55)→ E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] is continuouswhen the moment function m j is uniformly bounded. P Y,Z and the pair ( θ, P U | Y,Z ), and (ii) if the point “ I ” is the resulting value of the policy transformfor the inputs ( θ, γ, P Y (cid:63)γ | Y,Z,U , P U | Y,Z ). As such, it represents all the conditions necessary for the point “ I ”to be included in the identified set for the policy transform. We now have the following definition: Definition 3.3 (Identified Set for Policy Transforms) . Under Assumptions 2.1, 2.2, and 2.3, for any γ ∈ Γ the identified set for I [ ϕ ]( γ ) (with respect to the distribution P Y,Z ) is given by: I ∗ [ ϕ ]( γ ) := (cid:91) θ ∈ Θ ∗ I [ ϕ ]( θ, γ ) , (3.5) where: I [ ϕ ]( θ, γ ) := (cid:26) I ∈ R : ∃ P U | Y,Z ∈ P U | Y,Z ( θ ) and P Y (cid:63)γ | Y,Z,U ∈ P Y (cid:63)γ | Y,Z,U ( θ, γ ) satisfying I ∗ [ ϕ ] (cid:16) θ, γ, I, P Y (cid:63)γ | Y,Z,U , P U | Y,Z (cid:17) ≤ (cid:27) . (3.6)Our main result in this section will attempt to provide a more insightful characterization of the identifiedset for policy transforms, which will also be vital for the problem of policy choice considered in the nextsection. However, before stating our main identification result, we require the following technical assumption. Assumption 3.1 (Error Bounds) . (i) (Linear Minorant) There exists values δ > and C > such thatfor every θ ∈ Θ : inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + ≥ C min { δ, d ( θ, Θ ∗ ) } . (3.7) (ii) (Local Counterfactual Robustness) There exists a value C ≥ such that for any θ ∈ Θ ∗ δ := { θ : d ( θ, Θ ∗ ) ≤ δ } : inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ ≥ inf θ ∗ ∈ Θ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ∗ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ ∗ ,γ ) (cid:90) ϕ ( v ) dP V γ − C d ( θ, Θ ∗ ) , (3.8) and: sup P U | Y,Z ∈P U | Y,Z ( θ ) sup P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ ≤ sup θ ∗ ∈ Θ ∗ sup P U | Y,Z ∈P U | Y,Z ( θ ∗ ) sup P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ ∗ ,γ ) (cid:90) ϕ ( v ) dP V γ + C d ( θ, Θ ∗ ) . (3.9)Intuitively, Assumption 3.1 makes two statements. First, part (i) of the assumption is a global conditionthat requires that, whenever θ ∈ Θ \ Θ ∗ , there is at least one moment function that can be bounded belowby the function on the right side of (3.7). In general this condition is very similar to previous conditions inthe literature; see, for example, the “partial identification condition” in Chernozhukov et al. (2007) section30.2. Also, see Kaido et al. (2019) for a review of similar conditions. The major difference arises from thefact that the condition must hold for all P U | Y,Z ∈ P U | Y,Z ( θ ), owing to the fact that the moment conditionsin this paper are allowed to depend on the latent variables. Verifying condition (i) can usually be done byfirst enumerating all scenarios which imply θ / ∈ Θ ∗ , and then verifying that the condition holds for eachsuch scenario. This is exactly the strategy used when verifying the assumption in the examples. Also notethat the condition is automatically satisfied if P U | Y,Z ( θ ) is empty—that is, when G − ( Y, Z, θ ) admits nomeasurable selections—or when none of the moment conditions depend on the structural parameters.Part (ii) of Assumption 3.1 appears to be entirely new. Intuitively, (3.8) is a local condition that requiresthe smallest value of the integral of ϕ to not decrease too fast as we move θ slightly outside of the identifiedset. In the opposite direction, (3.9) requires that the largest value of the integral of ϕ does not increase toofast as we move θ slightly outside of the the identified set. These conditions will be violated if, for example,the value of the integral can change discontinuously on the boundary of the identified set. We call thecondition the local counterfactual robustness condition because it demands that small changes in the valueof the structural parameters do not generate discontinuous changes in value of the counterfactual quantityof interest. Interestingly, both of the conditions in Assumption 3.1 are related to typical assumptions madein the theory of error bounds in the optimization literature. Finally, note the value of δ in parts (i) and(ii) are the same. However, this is not restrictive, since part (i) and (ii) can be established for two differentvalues δ ( i ) , δ ( ii ) >
0, and then δ can be taken as δ = min { δ ( i ) , δ ( ii ) } .In practice, part (ii) of Assumption 3.1 can be challenging to verify. Because of this, we introduce thefollowing assumption as an alternative to part (ii) of Assumption 3.1: Assumption 3.2 (Error Bounds (2)(ii)) . For some δ > , there exists values (cid:96) , (cid:96) ≥ (possibly dependingon δ ) such that: d ( u, G − ( y, z, θ )) ≤ (cid:96) · d ( θ, Θ − ( y, z, u ) ∩ Θ ∗ δ ) , ( y, z ) − a.s. for all u ∈ U and θ ∈ Θ ∗ δ , (3.10) d ( y (cid:63) , G (cid:63) ( y, z, u, θ, γ )) ≤ (cid:96) · d ( θ, Θ (cid:63) ( v, γ ) ∩ Θ ∗ δ ) , ( y, z, u ) − a.s. for all y (cid:63) ∈ Y (cid:63) and θ ∈ Θ ∗ δ . (3.11) where Θ − ( y, z, u ) and Θ (cid:63) ( v, γ ) are defined by: Θ − ( y, z, u ) := (cid:8) θ : u ∈ G − ( y, z, θ ) (cid:9) , Θ (cid:63) ( v, γ ) := { θ : y (cid:63) ∈ G (cid:63) ( y, z, u, θ, γ ) } . Furthermore, the function ϕ : V → R is bounded, measurable, and Lipschitz continuous in ( u, y (cid:63) ) withLipschitz constant L ϕ . The following Lemma shows that Assumption 3.2 is sufficient for part (ii) of Assumption 3.1. In theprocess, the Lemma makes an interesting connection between Assumption 3.1 and certain Lipschitzianbehaviour of the random sets G − and G (cid:63) with respect to the structural parameters θ ∈ Θ. See Pang (1997) for an introduction. emma 3.1. Suppose that Assumptions 2.1, 2.2 and 2.3 are satisfied. Finally, suppose that G − ( · , θ ) and G (cid:63) ( · , θ, γ ) are almost-surely non-empty for each θ ∈ Θ ∗ . Then Assumption 3.2 implies Assumption 3.1(ii)with C = L ϕ max { (cid:96) , (cid:96) } .Proof. See Appendix B. (cid:4)
It can be shown that the conditions (3.10) and (3.11) are equivalent to almost-sure versions of Lips-chitz continuity conditions for set-valued maps, where the distance between two sets is measured by thePompeiu–Hausdorff distance. Localized versions of these conditions are called metric regularity conditions,which also have a close connection to constraint qualifications from optimization theory. See Dontchev andRockafellar (2009) Chapter 3.3 and Ioffe (2016) for a discussion.
We can finally turn to our main objective for this section, which is the problem of bounding the policytransform I [ ϕ ]( γ ). Theoretically, bounds on I [ ϕ ]( γ ) can be obtained by solving two (very) complicatedconstrained optimization problems that search over all distributions P U | Y,Z and P Y (cid:63)γ | Y,Z,U that satisfy ourmodelling assumptions for the ones that maximize and minimize the policy transform of ϕ . However, it isclear that such optimization problems will be infeasible in most realistic cases. The following result shows atractable formulation of bounds on policy transforms that will be important for the next section. Theorem 3.1 (Bounds on the Policy Transform) . Suppose that Assumptions 2.1, 2.2, 2.3 and 3.1 all hold.Also, suppose that ϕ : V → [ ϕ (cid:96)b , ϕ ub ] ⊂ R is a bounded, measurable function, and that for each γ ∈ Γ ,the random sets G − ( · , θ ) and G (cid:63) ( · , θ, γ ) are almost-surely non-empty for each θ ∈ Θ ∗ . Then co I ∗ [ ϕ ]( γ ) =[ I (cid:96)b [ ϕ ]( γ ) , I ub [ ϕ ]( γ )] , with: I (cid:96)b [ ϕ ]( γ ) = inf θ ∈ Θ max λ j ∈{ , } (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z , (3.12) I ub [ ϕ ]( γ ) = sup θ ∈ Θ min λ j ∈{ , } (cid:90) sup u ∈ G − ( y,z,θ ) sup y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) − µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z , (3.13) where µ ∗ ∈ R + is any value satisfying: µ ∗ ≥ max (cid:26) C C , ( ϕ ub − ϕ (cid:96)b ) C δ (cid:27) , (3.14) and where C , C and δ are from Assumption 3.1.Proof. See Appendix B. (cid:4)
Theorem 3.1 states that the closed, convex hull of the identified set I (cid:63) [ ϕ ]( γ ) from Definition 3.3 for thepolicy transform I [ ϕ ]( γ ) can be computed as the solution to two optimization problems. Interestingly, theseoptimization problems are closely related to problems found in the literature on mathematical programming32roblems subject to equilibrium constraints (MPECs), which have previously seen applications in economicsto social planning problems and Stackelberg games. The upper and lower envelope functions in Theorem3.1 are perhaps most aptly characterized as penalized optimization problems, with µ ∗ in (3.14) serving therole of the penalty parameter. Both the statement of the result and its proof rely on the theory of exactpenalty functions from the literature on error bounds in variational analysis. The Theorem uses the errorbounds Assumption 3.1 in order to show that the penalty µ ∗ can be taken to be finite. This is very importantfor the theoretical analysis of the policy decision problem to take place in the sections ahead. Furthermore,implicitly Theorem 3.1 shows that the values of λ j will depend only on the parameter θ , a point which willused in the next sections.From an identification perspective, the envelope functions will generally not give sharp bounds on thepolicy transform. However, under any additional conditions that ensure the identified set I ∗ ( γ ) is closedand convex for every γ ∈ Γ, Theorem 3.1 provides a (point-wise in γ ) sharp characterization of the identifiedset for the policy transform. Finally, the result is easily modified for the case when the object of interest isa conditional policy transform.One of the most interesting features of Theorem 3.1 is that, when the counterfactual object of interestis a particular form, there is no need to compute the identified set Θ ∗ of structural parameters in order tobound the counterfactual object of interest. In addition, the unobservables in the problem are profiled out,and when the identified set I ∗ ( γ ) is closed and convex this is without any loss of information. This pointalso translates into the policy decision problem studied in the next sections. The structural parametersand unobservables intuitively play the role of an intermediary connecting the factual and counterfactualdomains. However, after the envelope functions from Theorem 3.1 are computed, they play no further rolein the problem of policy choice.While we will not dwell on measurability issues in the main text, we note that Lemma B.1 in AppendixB.2.1 shows that the integrands in the optimization problems are universally measurable; that is, measurablefor the completion of any probability measure P Y,Z . The proof of this result relies crucially on the fact thatboth G − and G (cid:63) are Effros-measurable. Furthermore, Proposition B.1 in Appendix B.2.1 shows that themaps γ (cid:55)→ I (cid:96)b [ ϕ ]( γ ) , I ub [ ϕ ]( γ ) are measurable with respect to the universal σ − algebra on Γ (as generated bythe Borel σ − algebra). These results will be important to keep in mind in the next sections on policy choice.We now return to the examples presented earlier to discuss our identification result. We will first verifyAssumption 3.1 in our examples and will show how Lemma 3.1 can be helpful. Example 1 (Simultaneous Discrete Choice (cont’d)) . Consider again Example 1 on simultaneous discretechoice, and recall that we have imposed a median zero and median independence restriction using the momentconditions in (2.6) and (2.7) .This example presents challenges for the verification of Assumption 3.1 because of the discontinuity of For a textbook treatment, see Luo et al. (1996). See Dolgopolik (2016) for a review. he function ϕ ( v ) = { π k ( γ ( z, y − k ); θ ) ≥ u } . Indeed, under our current assumptions, Assumption 3.1 isnot satisfied. To appreciate the intuition, focus on Assumption 3.1(ii). The issue for this assumption arisesonly when for some k ∈ { , . . . , K } and some z ∈ Z and y − k ∈ Y K − we have (i) the counterfactual cutoffvalue π k ( γ ( z, y − k ); θ ∗ ) = 0 at some θ ∗ ∈ ∂ Θ ∗ , and if (ii) P ( Y k = 1 | Z k = z (cid:48) , Y − k = y (cid:48)− k ) (cid:54) = 0 . , where ( z (cid:48) , y (cid:48)− k ) = γ ( z, y − k ) . In this knife-edge case, a very small change from θ ∗ ∈ ∂ Θ ∗ to some θ / ∈ Θ ∗ can causea discontinuous change in P ( Y (cid:63)γ,k = 1) . A full description of this failure, including illustrations of variouscases, is presented in Appendix C.1.2.However, by slightly strengthening our moment conditions we can satisfy Assumption 3.1 in this example.The key is to introduce additional assumptions on the degree of smoothness of the distribution of U k aroundzero. In particular, we will replace the moment conditions in (2.6) and (2.7) with the following conditions: E (cid:2)(cid:0) { U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } − . (cid:1) { Z k = z, Y − k = y − k } (cid:3) ≤ , (3.15) E (cid:2)(cid:0) . − { U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max {− L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } (cid:1) { Z k = z, Y − k = y − k } (cid:3) ≤ , (3.16) for k = 1 , . . . , K , for all z, z (cid:48) ∈ Z and all y − k , y (cid:48)− k ∈ Y K − . In addition to implying the median zero/medianindependence assumption, these new moment conditions also limit the amount of probability mass on U thatis arbitrarily close to zero, which turns out to be key to satisfying Assumption 3.1. Also note that, despite thefact that these moment conditions will implicitly impose constraints on the obtainable counterfactual choiceprobabilities, it is easily verified that they do not impose any additional constraints on the set of structuralparameters θ ∈ Θ that can rationalize the observed distribution (in the sense of Definition 3.2), and thus donot violate the no-backtracking principle introduced in Remark 2.1.With these new moment conditions, Assumption 3.1 can be shown to be satisfied. Recall that whenfirst introducing Example 1 we assumed π k is a known measurable function of ( Z k , Y − k ) that is linear inparameters θ , and has a gradient (with respect to θ ) bounded away from zero for each ( z, y − k ) . We concludethat π k is Lipschitz in θ , and also satisfies a “reverse Lipschitz” condition; that is, for each ( z, y − k ) we have: L (cid:48) k || θ − θ ∗ || ≤ | π k ( z, y − k ; θ ) − π k ( z, y − k ; θ ∗ ) | ≤ L k || θ − θ ∗ || , for some L (cid:48) k , L k > . Now define: τ := min k min ( z,y − k ) | . − P ( Y k = 1 | Z = z, Y − k = y − k ) | s.t. | . − P ( Y k = 1 | Z = z, Y − k = y − k ) | > . (3.17) Then the analysis in Appendix C.1.2 shows that Assumption 3.1 is verified for C = L L (cid:48) , C = L L and δ = τ / ( L L (cid:48) ) , where L = min k L k and L (cid:48) = min k L (cid:48) k . In Theorem 3.1 we can thus take the penalty µ ∗ to beany value satisfying: µ ∗ ≥ max (cid:26) LL (cid:48) , τ (cid:27) . Theorem 3.1 says that the lower and upper envelopes on I [ ϕ ]( γ ) = P ( Y (cid:63)γ = 1) , as a function of γ , are given y (3.12) and (3.13) , respectively. Remark 3.1 (Counterfactual Coherency) . Recall that Theorem 3.1 applies only if the random sets G − ( · , θ ) and G (cid:63) ( · , θ, γ ) are almost-surely non-empty for each θ ∈ Θ ∗ . In the simultaneous discrete choice example,the counterfactual map G (cid:63) ( · , θ, γ ) can fail to be almost-surely non-empty, which is related to the well knownproblem of coherency in these models. In particular, for a given instantiation of a vector of unobservables ( u , . . . , u K ) , there may not exist any vector of counterfactual endogenous outcome variables ( y (cid:63) ,γ , . . . , y (cid:63)K,γ ) that solves the system of equations represented by (2.8) . However, we note that this issue is unrelated to ourparticular approach, and might be resolved by (i) conditioning the analysis on the subset of U that ensuresa solution to the system of equations in (2.8) , or (ii) imposing certain constraints on the parameter spacethat ensures the existence of a solution to the system of equations in (2.8) . We refer the reader to Chesherand Rosen (2020) for a thorough discussion of this issue. However, whether this “counterfactual coherency”problem can be resolved without violating the no-backtracking principle from Remark 2.1 appears to be anopen question. Example 2 (Program Evaluation (cont’d)) . Consider again Example 2 on program evaluation. Verificationof Assumption 3.1 is presented in Appendix C.2.2, and uses Lemma 3.1 to verify Assumption 3.1(ii). Re-markably, we show that Assumption 3.1 is satisfied for any value of δ > with C = C = 1 . Thus we cantake the penalty µ ∗ = 1 . Then Theorem 3.1 says that the lower and upper envelopes on I [ ϕ ]( γ ) = E [ Y (cid:63)γ ] , asa function of γ , are given by (3.12) and (3.13) , respectively. In this section, we provide sufficient conditions for PAMPAC learnability. To begin, the following propositionclarifies the connection between the lower envelope function from the previous section and the notion ofPAMPAC learnability.
Proposition 4.1.
Suppose Assumptions 2.1, 2.2, 2.3, and 3.1 hold. Also, suppose that ϕ : V → [ ϕ (cid:96)b , ϕ ub ] ⊂ R is a bounded, measurable function, and that for each γ ∈ Γ , the random sets G − ( · , θ ) and G (cid:63) ( · , θ, γ ) arealmost-surely non-empty for each θ ∈ Θ ∗ . Then a policy space Γ is PAMPAC learnable with respect to thepolicy transform of ϕ if and only if: inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup γ ∈ Γ I (cid:96)b [ ϕ ]( γ ) − I (cid:96)b [ ϕ ]( d ( ψ )) ≤ c (cid:19) ≥ κ, (4.1) where I (cid:96)b [ ϕ ] : Γ → R is the lower envelope function from Theorem 3.1. Remark 4.1.
By Proposition B.1 in Appendix B.2.1, the map ψ (cid:55)→ I (cid:96)b [ ϕ ]( d ( ψ )) is universally measurable;that is, measurable with respect to the completion of any P Y,Z ∈ P
Y,Z . Thus, the event in (4.1) can alwaysbe assigned a unique probability using outer measures, if necessary.
35n particular, the lower envelope function completely characterizes PAMPAC learnability of the policyspace Γ with respect to ϕ . Thus, it should be unsurprising that our sufficient conditions for a policy spaceto be PAMPAC learnable will be related to the behaviour of the lower envelope function from Theorem 3.1.Next we introduce an entropy growth condition, which will be imposed as a constraint on the complexityallowed for both the moment functions and the function ϕ . To introduce the entropy growth condition, wemust first define the covering number and metric entropy for a class of functions. Definition 4.1 (Covering Number, Metric Entropy) . Let ( T , ρ ) be a semi-metric space. A cover of T is anycollection of sets whose union contains T as a subset. For any ε > , the covering number for T , denoted by N ( ε, T , ρ ) , is the smallest number of ρ − balls needed to form a ε − cover. The metric entropy is the logarithmof the covering number. Definition 4.2 (Entropy Growth Condition) . Let F be a measurable class of real-valued functions on ameasurable space ( X , A X ) with envelope F . The class F satisfies the entropy growth condition if: sup Q ∈Q n log N ( ε, F , || · || Q, ) = o ( n ) , (4.2) for every ε > , with the supremum taken over all discrete probability measures Q n on X with atoms thathave probabilities that are integer multiples of /n . This condition is adapted from a condition in Dudley et al. (1991) (Theorem 6, p. 500) that, in combi-nation with other mild conditions, is shown to be sufficient for a class of functions to be uniform Glivenko-Cantelli. The entropy growth condition essentially says that, for any set X n of n points ( x , . . . , x n ) insome space X , the logarithm of the minimal number of balls of radius ε > F| X n := { ( f ( x ) , . . . , f ( x n )) : f ∈ F} ⊆ R n , is of order o ( n ). Sufficient conditions for this to be the case can be connected to conditions previously usedin the literature. For example, (4.2) is satisfied if the class of functions is of VC-type (c.f. Chernozhukovet al. (2013), Belloni et al. (2019)), if the class satisfies Pollard’s manageability criterion (c.f. Pollard (1990),Andrews and Shi (2013), Andrews and Shi (2017)), or if the class of functions is otherwise known to be auniform Donsker class.The following Theorem shows that if certain classes of functions in the policy analysis problem obey theentropy growth condition, then every policy space is PAMPAC learnable. To state the result, we must firstintroduce an important class of functions. Let Λ = { , } J , and for a fixed triple ( θ, γ, λ ) ∈ Θ × Γ × Λ, let h (cid:96)b ( · , · , θ, γ, λ ) : Y × Z → R be given by: h (cid:96)b ( y, z, θ, γ, λ ) := inf u ∈ G − ( y,z,θ ) (cid:32) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) . (4.3) See also Van Der Vaart and Wellner (1996) Theorem 2.8.1 on p.167. h (cid:96)b ( · , · , θ, γ, λ ) is exactly the integrand in the lower envelope function from Theorem 3.1. Nowdefine the class of functions: H (cid:96)b := { h (cid:96)b ( · , · , θ, γ, λ ) : Y × Z → R : ( θ, γ, λ ) ∈ Θ × Γ × Λ } . (4.4)Then we have the following result: Theorem 4.1.
Suppose that Assumptions 2.1, 2.2, 2.3 and 3.1 hold. Also, suppose that ϕ : V → [ ϕ (cid:96)b , ϕ ub ] ⊂ R is a bounded, measurable function, and that for each γ ∈ Γ , the random sets G − ( · , θ ) and G (cid:63) ( · , θ, γ ) arealmost-surely non-empty for each θ ∈ Θ ∗ . Fix any ε > . (i) If the class of functions H (cid:96)b satisfies theentropy growth condition, then every policy space is PAMPAC learnable with respect to the policy transformof ϕ . Furthermore, for any c > we have: sup P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( d ( ψ ) , s ) ≥ c (cid:19) = O ( r ( n )) , (4.5) where: r ( n ) := max (cid:26) n − / , n − / sup Q ∈Q n (cid:113) log N ( ε, H (cid:96)b , || · || Q, ) (cid:27) . (4.6) (ii) If the class of functions: Φ := { ϕ ( · , u, y (cid:63) ) : Y × Z → R : ( u, y (cid:63) ) ∈ U × Y (cid:63) } , (4.7) M j := { m j ( · , u, θ ) : Y × Z → R : ( u, θ ) ∈ U × Θ } , j = 1 , . . . , J, (4.8) are uniformly bounded, and satisfy the entropy growth condition, then so does H (cid:96)b . Furthermore, for any c > we have: sup P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( d ( ψ ) , s ) ≥ c (cid:19) = O ( r ( n )) , (4.9) where: r ( n ) := max n − / , n − / sup Q ∈Q n (cid:118)(cid:117)(cid:117)(cid:116) log N ( ε/ , Φ , || · || Q, ) + J (cid:88) j =1 log N ( ε/ , M j , || · || Q, ) . (4.10) Proof.
See Appendix B. (cid:4)
The proof of the part (i) proceeds by proposing a specific decision procedure, and then showing that theproposed decision procedure satisfies the requirements of PAMPAC learnability from Definition 2.4 whenthe class of functions H (cid:96)b satisfies the entropy growth condition. The specific decision procedure proposedin the proof is any procedure that obtains within ε of the maximum of the sample analog lower envelopefunction for each sample ψ ∈ Ψ n , for some ε >
0. We call this rule the ε − maximin empirical rule , andwe will revisit it’s properties in the next subsection. Here we also finally see the close connection betweenPAMPAC learnability and the lower envelope function from Theorem 3.1 in the previous section, which has37een alluded to throughout the paper. The particular form of the lower envelope function from Theorem3.1 makes it amenable to analysis using methods from empirical process theory, which are used in the proofof Theorem 4.1. Also note that Assumption 3.1, which was needed to obtain a bound on the penalty µ ∗ inTheorem 3.1, is also needed for this result. Without a bound on this penalty, Theorem 4.1 will generally notbe true.The proof of part (ii) of Theorem 4.1 shows that if each “component” of the lower envelope of the policytransform—namely the moment functions and the function ϕ —satisfy the entropy growth condition, thenthe metric entropy of the class H (cid:96)b can also be controlled. Combined with the result in Proposition 4.1, theproof of part (ii) of Theorem 4.1 then shows that our proposed ε − maximin decision rule can obtain close tothe maximum value (over γ ∈ Γ) of the lower envelope of the policy transform with high probability.It may seem surprising that our learnability result holds for any policy space. However, this is a resultof the fact that the complexity of the policy space is tempered by the class of functions Φ from (4.7), sinceit is only through functions in this class that the policy can affect the policy transform. By imposing thatthe class Φ satisfy the entropy growth condition, we are implicitly imposing constraints on the complexity ofthe policy space. Note that the Theorem provides only sufficient conditions for PAMPAC learnability, andalternative results that impose complexity constraints on the policy space Γ directly, rather than on Φ, maybe possible.We will now turn to our motivating examples to verify learnability of the involved policy spaces.
Example 1 (Simultaneous Discrete Choice (cont’d)) . Consider again Example 1 on simultaneous discretechoice. In this case we have:
Φ := { { π k ( γ ( · ); θ ) ≥ u } : ( u, θ ) ∈ U × Θ } , (4.11) with the moment conditions: E (cid:2)(cid:0) { U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } − . (cid:1) { Z k = z, Y − k = y − k } (cid:3) ≤ , (4.12) E (cid:2)(cid:0) . − { U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max {− L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } (cid:1) { Z k = z, Y − k = y − k } (cid:3) ≤ , (4.13) for k = 1 , . . . , K , for all z, z (cid:48) ∈ Z and all y − k , y (cid:48)− k ∈ Y K − . Details on the verification of the entropy growthcondition for both Φ and the class of moment functions associated with the moment conditions above arepresented in Appendix C.1.3. Furthermore, under our assumptions for this example, the rate of convergencederived from Theorem 4.1 is found to be O ( n − / ) . Example 2 (Program Evaluation (cont’d)) . Consider again Example 2 on program evaluation. In this casewe have:
Φ := { { g ( γ ( z )) ≥ u } ( u − u ) + u : ( u , u , u, g ) ∈ U × G} , (4.14)38 ith the moment conditions: E [( D − g ( Z , X )) { Z = z , X = x } ] ≤ , ∀ z ∈ Z , x ∈ X , (4.15) E [( g ( Z , X ) − D ) { Z = z , X = x } ] ≤ , ∀ z ∈ Z , x ∈ X , (4.16) E [( { U ≤ g ( z , x ) } − g ( z , x )) { X = x } ] ≤ , ∀ z ∈ Z , x ∈ X , (4.17) E [( g ( z , x ) − { U ≤ g ( z , x ) } ) { X = x } ] ≤ , ∀ z ∈ Z , x ∈ X , (4.18) E [ t ( z , x ) − { Z = z , X = x } ] ≤ , ∀ z ∈ Z , ∀ x ∈ X , (4.19) E [ { Z = z , X = x } − t ( z , x )] ≤ , ∀ z ∈ Z , ∀ x ∈ X , (4.20) and: E (cid:34) U d (cid:32) { Z = z , X = x } (cid:88) z ∈Z t ( z , x ) − { X = x } t ( z , x ) (cid:33)(cid:35) ≤ , ∀ z ∈ Z , x ∈ X , d ∈ { , } , (4.21) E (cid:34) U d (cid:32) { X = x } t ( z , x ) − { Z = z , X = x } (cid:88) z ∈Z t ( z , x ) (cid:33)(cid:35) ≤ , ∀ z ∈ Z , x ∈ X , d ∈ { , } . (4.22) Details on the verification of the entropy growth condition for both Φ and the class of functions associatedwith the moment functions above are presented in Appendix C.2.3. Furthermore, under our assumptions forthis example, the rate of convergence derived from Theorem 4.1 is found to be O ( n − / ) . Theorem 4.1 shows sufficient conditions for PAMPAC learnability in a given environment. However, whilethe result shows that it may be possible ex-ante (i.e. before observing a particular sample) to learn a givenpolicy space, it does not provide us any useful ex-post (i.e. after observing the sample) information on theperformance of our decision rule. This reflects a well-known complaint of PAC learnability, and has givenrise to the literature on data-dependent excess risk bounds in statistical learning literature; see Bartlettet al. (2002), Koltchinskii (2001), and Koltchinskii (2006) for examples, and Boucheron et al. (2005) orKoltchinskii (2011) for a review. Thus after establishing learnability of a particular class of policies, it maybe of separate interest to evaluate the finite sample performance of a given decision rule for a given sample.This is accomplished in the next subsections. We will focus our attention on the particular decision ruleused in the proof of Theorem 4.1 which was shown to satisfy the requirements of PAMPAC learnability underthe assumptions of the theorem. The decision rule used was allowed to be any ε − maximizer of the empiricalversion of the lower envelope function I (cid:96)b [ ϕ ]( γ ), which is why we will call it the ε − maximin empirical rule. Definition 5.1 ( ε − maximin empirical welfare) . Fix any ε ≥ and let (cid:98) I (cid:96)b [ ϕ ]( γ ) denote the lower envelopefrom Theorem 3.1 evaluated at the empirical measure for ( Y, Z ) . Then d : Ψ n → Γ is a ε − maximin empirical eME) rule if: (cid:98) I (cid:96)b [ ϕ ]( d ( ψ )) + ε ≥ sup γ ∈ Γ (cid:98) I (cid:96)b [ ϕ ]( γ ) . (5.1) Remark 5.1.
Note that in general the “ ε ” is necessary (although it can be made arbitrarily small), owingto the fact that the supremum of (cid:98) I (cid:96)b [ ϕ ]( · ) may not be obtained. Furthermore, unlike our result on PAMPAC learnability, all of the results in the next subsections are data-dependent, and do not depend on any particular properties (beyond measurability) of any function classesinvolved in the policy decision problem. Thus, there is no need to verify the entropy growth condition, orany other condition sufficient for learnability to use the results ahead. In practice, we still recommend thatthe sufficient conditions for learnability of a policy space be verified prior to using the results.
In this section we obtain a bound on the value of c n ( d, κ ) for any fixed κ taking d to be the eME rule.To describe our procedure, we will first introduce a data-dependent complexity measure for the class H (cid:96)b .The complexity measure we use is based on the empirical Rademacher complexity, advocated by Bartlettet al. (2002), Koltchinskii (2001), and Koltchinskii (2006) (among others) in the context of empirical riskminimization. Definition 5.2 (Empirical Rademacher Complexity) . Let F be a class of measurable functions f : Y × Z → R . The empirical Rademacher complexity of F is given as: || R n || ( F ) := sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ξ i · f ( y i , z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (5.2) where ξ i are realizations of Rademacher random variables; that is, ξ ∈ {− , } and P ( ξ i = −
1) = P ( ξ =1) = 1 / . Remark 5.2.
A technical point worth emphasizing is that, when seen as a function of the underlying productprobability space, the empirical Rademacher complexity may not be a measurable function. We suppressthese difficulties in the statement of our results, although we show in Appendix B.2.1 that the Rademachercomplexity || R n || ( H (cid:96)b ) is universally measurable (with respect to the product Borel σ − algebra on ( Y × Z ) n ),which is sufficient for the purposes in this paper. In our context, the empirical Rademacher complexity of the class H (cid:96)b depends only on the observedempirical distribution and on n draws of a Rademacher random variable; it can therefore be computed aftersimulating from the Rademacher distribution. With this new definition in hand, we have the following result: Theorem 5.1.
Suppose that Assumptions 2.1, 2.2, 2.3, and 3.1 hold. Let ϕ : V → [ ϕ (cid:96)b , ϕ ub ] ⊂ R be abounded, measurable function, and suppose that for each γ ∈ Γ , the random sets G − ( · , θ ) and G (cid:63) ( · , θ, γ ) are lmost-surely non-empty for each θ ∈ Θ ∗ . Let { ( y i , z i ) } ni =1 be i.i.d. from some distribution P Y,Z satisfyingour assumptions and let d : Ψ n → Γ be an eME decision rule for some ε > . Furthermore, let H < ∞ satisfy | h | ≤ H for every h ∈ H (cid:96)b , and let: c n ( κ ) = 4 || R n || ( H (cid:96)b ) + (cid:115)
72 ln(2 / (2 − κ )) H n + 5 ε. (5.3) Then for any sample size n , and any κ ∈ (0 , we have: inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( d ( ψ ) , s ) ≤ c n ( κ ) (cid:19) ≥ κ. (5.4) Proof.
See Appendix B. (cid:4)
Theorem 5.1 shows two closely related results. First, for any fixed value of κ ∈ (0 ,
1) the Theorem showsthat, when in the worst-case state, the eME rule obtains within c n ( κ ) of the maximin value of the state-dependent policy transform with probability at least κ . Simple comparative statics show that the value of c n ( κ ) is smaller when n is larger and/or || R n || ( H (cid:96)b ) and H are smaller. The only difficult part of computing c n ( κ ) is computing the Rademacher complexity, which is approximately as difficult computationally ascomputing the empirical version of the lower bound in Theorem 3.1.We again see a close connection between PAMPAC learnability and the lower envelope function fromTheorem 3.1. The particular form of the lower envelope function from Theorem 3.1 makes it especiallyamenable to analysis using concentration concentration inequalities, which are used in the proof of Theorem5.1. Again Assumption 3.1 is required for this result: without a finite (and known) value for the penalty µ ∗ ,derivation of the finite sample results in Theorem 5.1 would not be possible.Finally we mention again that, unlike Theorem 4.1 on PAMPAC learnability, Theorem 5.1 does notimpose any restrictions on the underlying class of functions H (cid:96)b . In particular, this class need not satisfy theentropy growth condition from Definition 4.2, nor any other sufficient conditions for learnability, meaningTheorem 5.1 is applicable even when Γ is not PAMPAC learnable. As a result, Theorem 5.1 is able to providefinite sample guarantees for the eME rule, but necessarily remains silent about rates of convergence. The previous subsection uses a specific rule, the eME rule, and derives finite sample theoretical guaranteeson the performance of this rule. However, the eME rule is only one particular rule, and for a variety ofreasons it may not be the rule selected by the policymaker.In order to complement the results of the previous subsection, in this subsection we will provide sometheoretical results on alternative policy rules. To understand the approach, let us define the function: E ∗ ( γ ) := sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( γ, s ) = sup γ ∈ Γ I (cid:96)b [ ϕ ]( γ ) − I (cid:96)b [ ϕ ]( γ ) , (5.5)41nd the set: G ∗ ( δ ) := { γ ∈ Γ : E ∗ ( γ ) ≤ δ } . (5.6)We call the set G ∗ ( δ ) the δ − level set. Our objective in this subsection will be to provide an approximationof the δ − level set that holds with probability at least κ . If we can do so, then by constructionn any decisionrule d : Ψ n → Γ that maps within our approximation of the δ − level set will have c n ( d, κ ) ≤ δ . There maybe many decision rules that map within our approximation to the δ − level set, so our theoretical results willbe applicable to a large number of decision rules. As a by product of our analysis, we will also show thatfor certain values of δ the eME rule will be contained in the δ − level set with probability at least κ . Again,the results of this section do not impose any restrictions on the underlying class of functions H (cid:96)b , and areapplicable even when Γ is not PAMPAC learnable.To introduce our results for the δ − level set, we must first introduce some additional notation. In partic-ular, define: E n ( γ ) := sup γ ∈ Γ inf s ∈S (cid:98) I [ ϕ ]( γ, s ) − inf s ∈S (cid:98) I [ ϕ ]( γ, s ) = sup γ ∈ Γ (cid:98) I (cid:96)b [ ϕ ]( γ ) − (cid:98) I (cid:96)b [ ϕ ]( γ ) , (5.7)and for δ > G n ( δ ) := { γ ∈ Γ : E n ( γ ) ≤ δ } . (5.8)The set G n ( δ ) represents the empirical version of the δ − level set.The following theorem shows that, for sufficiently large δ , the δ − level set is contained within an enlarge-ment of, and contains a contraction of, the empirical δ − level set with high probability. Theorem 5.2.
Suppose that Assumptions 2.1, 2.2, 2.3, and 3.1 hold. Also suppose that ϕ : V → [ ϕ (cid:96)b , ϕ ub ] ⊂ R is a bounded, measurable function, and that for each γ ∈ Γ , the random sets G − ( · , θ ) and G (cid:63) ( · , θ, γ ) arealmost-surely non-empty for each θ ∈ Θ ∗ . Let H < ∞ satisfy | h | ≤ H for every h ∈ H (cid:96)b , and suppose that { ( y i , z i ) } ni =1 is i.i.d. from some distribution P Y,Z satisfying our assumptions. Define: H (cid:48) n,(cid:96)b ( δ ) := { h (cid:96)b ( · , · , θ, γ, λ ) − h (cid:96)b ( · , · , θ (cid:48) , γ (cid:48) , λ (cid:48) ) : θ, θ (cid:48) ∈ Θ , γ, γ (cid:48) ∈ G n ( δ ) , λ, λ (cid:48) ∈ { , } J } , where H (cid:48) n,(cid:96)b ( δ ) has a uniform bound H (cid:48) n ( δ ) ≤ H < ∞ . Furthermore, let t j := (cid:112) c log( c j ) with c = 5 and c = (3 / (2(1 − κ ))) / , and let { δ j } ∞ j =0 be a sequence decreasing to zero with δ > H . Choose some a ∈ (1 , ∞ ) , let b = 2 − / a , and let: T n ( δ ) := || R n || ( H (cid:48) n,(cid:96)b ( b δ j )) + t j H (cid:48) n ( b δ j ) √ n , if δ ∈ ( δ j +1 , δ j ] for some j ≥ , otherwise, (5.9)42 nd: T (cid:91)n ( σ ) := sup δ ≥ σ T n ( δ ) δ , (5.10) T (cid:93)n ( η ) := inf (cid:110) σ > T (cid:91)n ( σ ) ≤ η (cid:111) . (5.11) Finally, set δ ∗ > T (cid:93)n (1 − / a ) . Then for any δ ≥ a δ ∗ we have: inf P Y,Z ∈P Y,Z P ⊗ nY,Z ( G n ( δ/ a ) ⊆ G ∗ ( δ ) ⊆ G n ( b δ )) ≥ κ. Proof.
See Appendix B. (cid:4)
Theorem 5.2 closely mimics results in the statistical learning literature, namely in the problem of boundingexcess risk in empirical risk minimization problems. In particular, the proof of the result uses techniquesdeveloped by Koltchinskii (2006) and Koltchinskii (2011), where the latter gives a textbook treatment. Theorem 5.2 gives a novel application of these techniques to the problem of policy choice in the presence ofpartial identification. Similar to the other results in this paper, Theorem 5.2 relies crucially on the form ofthe lower envelope function from Theorem 3.1. Again Assumption 3.1 is required, since Theorem 5.2 requiresa finite (and known) value for the penalty parameter µ ∗ .Intuitively, Theorem 5.2 says that for a suitably large value of δ the δ − level sets G n ( δ ) of the function E n ( · ) can be used to approximate the δ − level sets G ∗ ( · ) of the function E ∗ ( · ). The substantial componentof the results is the selection of such a “suitably large value of δ .” In particular, the value of δ needed forour approximation to work must be larger than the value of δ ∗ from the Theorem, where δ ∗ is related to thesolution of a fixed point equation. The connection of the functions T n ( · ), T (cid:91)n ( · ) and T (cid:93)n ( · ) to fixed pointequations is illustrated in Figure 4 and is described in its associated caption. As illustrated in the Figure,the function T n ( δ ) is a left-continuous step function that is greater than or equal to zero on the interval[0 , δ ], and zero otherwise.The proof of Theorem 5.2 relies on Lemma 5.1, and the best way to understand Theorem 5.2 is to firstunderstand Lemma 5.1. Lemma 5.1.
Suppose that the assumptions of Theorem 5.2 all hold. Define: H (cid:48) (cid:96)b ( δ ) := { h (cid:96)b ( · , · , θ, γ, λ ) − h (cid:96)b ( · , · , θ (cid:48) , γ (cid:48) , λ (cid:48) ) : θ, θ (cid:48) ∈ Θ , γ, γ (cid:48) ∈ G ∗ ( δ ) , λ, λ (cid:48) ∈ Λ } , where H (cid:48) (cid:96)b ( δ ) has a uniform bound H (cid:48) ( δ ) ≤ H < ∞ . Furthermore, let t j := (cid:112) c log( c j ) with c = 5 and c = (3 / (2(1 − κ ))) / , and let { δ j } ∞ j =0 be a sequence decreasing to zero with δ > H . Also, let: T ( δ ) := || R n || ( H (cid:48) (cid:96)b ( δ j )) + t j H (cid:48) ( δ j ) √ n , if δ ∈ ( δ j +1 , δ j ] , , otherwise, (5.12) The (cid:91) − and (cid:93) − transforms are taken from Koltchinskii (2006), and the properties of these transforms can be found inAppendix A.3. of Koltchinskii (2011). igure 4: This figure illustrates step (iv) in the procedure to determine the δ − level set. After choosing a decreasingsequence { δ j } ∞ j =0 , the policymaker finds the value δ ∗ such that δ ∗ > T (cid:93)n (1 − / a ). In the figure, this occurs in theinterval ( δ , δ ] (although, of course, this need not be the case). The figure also illustrates the fact that T n ( δ ) is a stepfunction. Finally, the figure illustrates how the (cid:91) − and (cid:93) − transforms of T n ( δ ) are related to fixed point-equations. Inparticular, the figure illustrates the fixed point of T n ( δ ) = δ , which is given exactly by T (cid:93)n (1). In addition, the fixedpoint of T n ( δ ) = δ (1 − / a ) is given by T (cid:93)n (1 − / a ). and: T (cid:91) ( σ ) := sup δ ≥ σ T ( δ ) δ , (5.13) T (cid:93) ( η ) := inf (cid:110) σ > T (cid:91) ( σ ) ≤ η (cid:111) . (5.14) Finally, suppose δ ∗∗ > T (cid:93) (1 − / a ) for some a ∈ (1 , ∞ ) . Then for any δ ≥ a δ ∗∗ we have: inf P Y,Z ∈P Y,Z P ⊗ nY,Z ( G n ( δ/ a ) ⊆ G ∗ ( δ ) ⊆ G n ((2 − / a ) δ )) ≥ κ. Proof.
See Appendix B. (cid:4)
Note that Lemma 5.1 is very similar to Theorem 5.2, with a major exception being that the class offunctions H (cid:48) (cid:96)b ( δ ) in Lemma 5.1 differs from the class of functions H (cid:48) n,(cid:96)b ( δ ) in Theorem 5.2. Note that H (cid:48) n,(cid:96)b ( δ ) represents a “feasible version” of H (cid:48) (cid:96)b ( δ ), since H (cid:48) (cid:96)b ( δ ) depends on the unknown δ − level set G ∗ ( δ ),where H (cid:48) n,(cid:96)b ( δ ) depends on the empirical δ − level set G n ( δ ).A heuristic proof may help provide some sense of how these results work. A necessary step in provingeither Theorem 5.2 or Lemma 5.1 is to relate the quantities E n ( γ ) and E ∗ ( γ ), which is exactly what is done44n the proof of Lemma 5.1. Among other things, the proof of Lemma 5.1 demonstrates that an importantobject connecting the quantities E n ( γ ) and E ∗ ( γ ) is given by: δ (cid:55)→ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( δ ) sup λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | , (5.15)where: P n h (cid:96)b ( · , θ, γ, λ ) := 1 n n (cid:88) i =1 h (cid:96)b ( y i , z i , θ, γ, λ ) , P h (cid:96)b ( · , θ, γ, λ ) := (cid:90) h (cid:96)b ( y, z, θ, γ, λ ) dP Y,Z . The quantity (5.15) is easily seen to be the sup-norm of a particular empirical process. Note that thisempirical process depends on unknown population quantities through both G ∗ ( δ ) and through the functions P h (cid:96)b ( · , θ, γ, λ ) and P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) ), which depend on the unknown true probability measure. While thedependence on G ∗ ( δ ) is unavoidable for now, the dependence on P h (cid:96)b ( · , θ, γ, λ ) and P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) ) can beremoved by working with the function T ( δ ) from (5.12). Thus, the function T ( δ ) in Lemma 5.1—which isslightly different from T n ( δ ) in Theorem 5.2—is constructed to serve as an upper envelope of the quantityin (5.15), for every δ ∈ [0 , δ ], on some event E n with probability at least κ .With (5.15) replaced by its upper bound T ( δ ), the proof of Lemma 5.1 then shows that, if σ := E ∗ ( γ ),the following inequalities hold on the event E n : E ∗ ( γ ) ≤ E n ( γ ) + T ( σ ) , (5.16) E n ( γ ) ≤ E ∗ ( γ ) + T ( σ ) . (5.17)Now note that if δ ∗∗ = T (cid:93) (1 − / a ) + ε for any ε >
0, then T ( δ ) ≤ (1 − / a ) · δ for every δ ≥ δ ∗∗ . Furthermore,by construction the value of δ ∗∗ will be close to the smallest possible value for which this is true. Now fixany γ with σ = E ∗ ( γ ) ≥ δ ∗∗ . Then clearly: T ( σ ) ≤ (cid:18) − a (cid:19) E ∗ ( γ ) , (5.18)Combining this result with (5.16) and (5.17) we obtain that for any γ satisfying E ∗ ( γ ) ≥ δ ∗∗ we have: E ∗ ( γ ) ≤ a E n ( γ ) , (5.19) E n ( γ ) ≤ b E ∗ ( γ ) . (5.20)The remainder of the proof of Lemma 5.1 is dedicated to showing that the following inequalities hold for Note that, technically speaking, the dependence of (5.15) on
P h (cid:96)b ( · , θ, γ, λ ) and P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) ) is removed using asymmetrization inequality (c.f. Van Der Vaart and Wellner (1996) Lemma 2.3.1) and a Hoeffding-type concentration inequality,which leads exactly to the upper-bound T ( δ ), which holds with high probability. γ ∈ Γ on the event E n : E ∗ ( γ ) ≤ a ( E n ( γ ) ∨ δ ∗∗ ) , (5.21) E n ( γ ) ≤ b ( E ∗ ( γ ) ∨ δ ∗∗ ) , (5.22)After these inequalities are established, it is straightforward to argue that G n ( δ/ a ) ⊆ G ∗ ( δ ) ⊆ G n ( b δ ) on theevent E n when δ ≥ a δ ∗∗ . Intuitively, the proof of Theorem 5.2 then shows that H (cid:48) (cid:96)b ( δ ), T ( · ) (and its (cid:91) − and (cid:93) − transform) and δ ∗∗ defined in Lemma 5.1 can be replaced with their feasible versions H (cid:48) n,(cid:96)b ( δ ), T n ( · ) (andits (cid:91) − and (cid:93) − transform) and δ ∗ defined in Theorem 5.2.Theorem 5.2 suggests the following procedure to approximate the δ − level set. The policymaker beginsby computing E n ( γ ) as a function of γ (for example, by establishing a grid over Γ). The policymaker fixessome value a ∈ (1 , ∞ ) and constructs a sequence { δ j } ∞ j =0 decreasing to zero with (1 − / a ) δ > H . Ingeneral the procedure will give a tighter bound if the sequence { δ j } ∞ j =0 has small initial increments. Thepolicymaker then computes δ ∗ > T (cid:93)n (1 − / a ). This is done by the following procedure:(i) The policymaker takes n i.i.d. draws of a Rademacher random variable ξ .(ii) At the j th step (beginning at step 0) the policymaker uses E n ( γ ) to compute the Rademacher complexity || R n || ( H (cid:48) n,(cid:96)b ( b δ j )) with the formula (5.2).(iii) The policymaker uses E n ( γ ) to compute a uniform upper bound H n ( δ j ) for H (cid:48) n,(cid:96)b ( δ j ) (or she can simplyuse 2 H ).(iv) The policymaker determines if there is any value δ ∈ ( δ j +1 , δ j ] such that T n ( δ j ) /δ ≥ − / a . • If so, the policymaker stops and sets δ ∗ = δ + η , where η > δ ∈ ( δ j +1 , δ j ] is equal to anyvalue satisfying T n ( δ j ) /δ ≤ − / a . • If not, the policymaker repeats steps (i) and (ii) for iteration j + 1.An illustration of this step is provided in Figure 4. By Theorem 5.2, the policymaker then knows that forevery δ ≥ δ ∗ , the δ − minimal set G ( δ ) will be contained within the sample analogue δ − minimal set G n ( b δ ),and will contain the sample analogue δ − minimal set G n ( δ/ a ) with probability at least κ . Note that thecomputational bottleneck in this procedure arises from repeatedly computing the Rademacher complexity.In addition to being interesting in its own right, Theorem 5.2 also sheds light on the results from theprevious subsection. In particular, the proof of Theorem 5.2 and Lemma B.9 lead to the following result,which is stated as a corollary of Theorem 5.2. Note that the first inequality is trivial, since (5.19) is satisfied when E ∗ ( γ ) ≥ δ ∗∗ , and if E ∗ ( γ ) ≤ δ ∗∗ , then E ∗ ( γ ) ≤ a δ ∗∗ ,since a >
1. The second of these inequalities is non-trivial, and relies on an auxiliary result given by Lemma B.9 in theAppendix. orollary 5.1. Suppose the assumptions of Theorem 5.2 hold, and let δ ∗ be as in Theorem 5.2. For any ε > let ˆ γ ∈ Γ be the policy selected by the eME decision rule. If δ ≥ δ ∗ ≥ ε > , then: inf P Y,Z ∈P Y,Z P ⊗ nY,Z ( E ∗ (ˆ γ ) ≤ δ ) ≥ κ. That is ˆ γ ∈ G ∗ ( δ ) with high probability when δ ≥ δ ∗ ≥ ε > . This result shows that, if ε ≤ δ ∗ then our eME rule from the previous subsection will be contained in the δ − level set G ∗ ( δ ) when δ ≥ δ ∗ with high probability. This should serve as some additional justification forusing the eME rule, since it shows that, when both δ ∗ and ε are small, the procedure suggested by Theorem5.2 will not lead to decision rules that vastly outperform the eME rule. The purpose of the paper is to develop a general and novel framework for bounding counterfactual quantitiesand for making policy decisions. Our framework is applicable in models that partially identified and/orincomplete. Furthermore, we do not require parametric distributional assumptions for the latent variables,and we allow for moment conditions that depend on latent variables. We introduce the policy transform ,and argue that many counterfactual quantities can be written as the policy transform of some function.We then introduce a preference relation that respects weak dominance, and discuss the problem of policychoice using a framework similar to the PAC model of learnability from computational learning theory. Ourtheoretical results are divided into those that are applicable ex-ante (i.e. before observing the sample) andex-post (i.e. after observing the sample). For our ex-ante results, we introduce the notion of “learning” apolicy space, and provide sufficient conditions for a policy space to be learnable. For our ex-post results, weprovide theoretical guarantees on the performance of particular policy rules. Throughout the paper we alsodemonstrate how to apply the results to a simultaneous discrete choice example and a program evaluationexample.There are many obvious extensions of this work that might be interesting. This paper has been partic-ularly focused on theoretical developments, with examples serving mainly a pedagogical purpose. Furtherdevelopment of the examples and empirical applications are needed to clearly illustrate and fully investigatethe strengths and weaknesses of the method in practice. In addition, the paper has been largely silent onimplementation, which may be computationally complex in certain environments. Further development ofefficient algorithms to implement the procedures is clearly needed. Finally, the relation between PAC learn-ability and the literature on frequentist decision theory requires further investigation and clarification. Webelieve all of these extensions to be fruitful avenues of future research.47 eferences
Aliprantis, C. D. and Border, K. C. (2006).
Infinite dimensional analysis: a hitchhiker’s guide . Springer.Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. (1997). Scale-sensitive dimensions, uniformconvergence, and learnability.
Journal of the ACM (JACM) , 44(4):615–631.Andrews, D. W. and Shi, X. (2013). Inference based on conditional moment inequalities.
Econometrica ,81(2):609–666.Andrews, D. W. and Shi, X. (2017). Inference based on many conditional moment inequalities.
Journal ofeconometrics , 196(2):275–287.Angluin, D. and Laird, P. (1988). Learning from noisy examples.
Machine Learning , 2(4):343–370.Bartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation.
MachineLearning , 48(1-3):85–113.Bartlett, P. L., Bousquet, O., Mendelson, S., et al. (2005). Local rademacher complexities.
The Annals ofStatistics , 33(4):1497–1537.Bartlett, P. L., Long, P. M., and Williamson, R. C. (1996). Fat-shattering and the learnability of real-valuedfunctions.
Journal of Computer and System Sciences , 52(3):434–452.Belloni, A., Bugni, F. A., and Chernozhukov, V. (2019). Subvector inference in pi models with many momentinequalities. Technical report, cemmap working paper.Beresteanu, A., Molchanov, I., and Molinari, F. (2011). Sharp identification regions in models with convexmoment predictions.
Econometrica , 79(6):1785–1821.Beresteanu, A., Molchanov, I., and Molinari, F. (2012). Partial identification using random set theory.
Journal of Econometrics , 166(1):17–32.Bertsekas, D. P. and Shreve, S. (1978).
Stochastic optimal control: the discrete-time case . Academic Press.Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the vapnik-chervonenkis dimension.
Journal of the ACM (JACM) , 36(4):929–965.Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classification: A survey of some recentadvances.
ESAIM: probability and statistics , 9:323–375.Bousquet, O., Koltchinskii, V., and Panchenko, D. (2002). Some local measures of complexity of convexhulls and generalization bounds. In
International Conference on Computational Learning Theory , pages59–73. Springer. 48resnahan, T. F. and Reiss, P. C. (1990). Entry in monopoly market.
The Review of Economic Studies ,57(4):531–553.Bresnahan, T. F. and Reiss, P. C. (1991). Empirical models of discrete games.
Journal of Econometrics ,48(1-2):57–81.Brock, W. A. and Durlauf, S. N. (2001). Discrete choice with social interactions.
The Review of EconomicStudies , 68(2):235–260.Carneiro, P., Heckman, J. J., and Vytlacil, E. J. (2011). Estimating marginal returns to education.
AmericanEconomic Review , 101(6):2754–81.Chamberlain, G. (2011). Bayesian aspects of treatment choice.
The Oxford Handbook of Bayesian Econo-metrics , pages 11–39.Chernozhukov, V., Chetverikov, D., Kato, K., et al. (2013). Gaussian approximations and multiplier boot-strap for maxima of sums of high-dimensional random vectors.
The Annals of Statistics , 41(6):2786–2819.Chernozhukov, V., Hong, H., and Tamer, E. (2007). Estimation and confidence regions for parameter setsin econometric models.
Econometrica , 75(5):1243–1284.Chesher, A. and Rosen, A. (2012). Simultaneous equations models for discrete outcomes: coherence, com-pleteness, and identification. Working paper.Chesher, A. and Rosen, A. (2015). Characterizations of identified sets delivered by structural econometricmodels. Technical report, cemmap working paper.Chesher, A. and Rosen, A. M. (2014). An instrumental variable random-coefficients model for binary out-comes.
The econometrics journal , 17(2):S1–S19.Chesher, A. and Rosen, A. M. (2017a). Generalized instrumental variable models.
Econometrica , 85(3):959–989.Chesher, A. and Rosen, A. M. (2017b). Incomplete english auction models with heterogeneity. Technicalreport, cemmap working paper.Chesher, A. and Rosen, A. M. (2020). Structural modeling of simultaneous discrete choice. Working paper.Chesher, A., Rosen, A. M., and Smolinski, K. (2013). An instrumental variable model of multiple discretechoice.
Quantitative Economics , 4(2):157–196.Ciliberto, F., Murry, C., and Tamer, E. T. (2018). Market structure and competition in airline markets.
Available at SSRN 2777820 .Cohn, D. L. (2013).
Measure theory . Springer. 49orbae, D., Stinchcombe, M. B., and Zeman, J. (2009).
An introduction to mathematical analysis foreconomic theory and econometrics . Princeton University Press.Dolgopolik, M. (2016). A unifying theory of exactness of linear penalty functions.
Optimization , 65(6):1167–1202.Dontchev, A. L. and Rockafellar, R. T. (2009). Implicit functions and solution mappings.
Springer Monogr.Math.
Dudley, R. M. (2010).
Real analysis and probability . Cambridge university press.Dudley, R. M. (2014).
Uniform central limit theorems . Cambridge university press.Dudley, R. M., Gin´e, E., and Zinn, J. (1991). Uniform and universal glivenko-cantelli classes.
Journal ofTheoretical Probability , 4(3):485–510.Ekeland, I., Galichon, A., and Henry, M. (2010). Optimal transportation and the falsifiability of incompletelyspecified economic models.
Economic Theory , 42(2):355–374.Galichon, A. and Henry, M. (2006). Inference in incomplete models. Working paper.Galichon, A. and Henry, M. (2009). A test of non-identifying restrictions and confidence regions for partiallyidentified parameters.
Journal of Econometrics , 152(2):186–196.Galichon, A. and Henry, M. (2011). Set identification in models with multiple equilibria.
The Review ofEconomic Studies , 78(4):1264–1298.Gin´e, E., Koltchinskii, V., et al. (2006). Concentration inequalities and asymptotic results for ratio typeempirical processes.
The Annals of Probability , 34(3):1143–1216.Gin´e, E., Koltchinskii, V., and Wellner, J. A. (2003). Ratio limit theorems for empirical processes. In
Stochastic inequalities and applications , pages 249–278. Springer.Haile, P. A. and Tamer, E. (2003). Inference with an incomplete model of english auctions.
Journal ofPolitical Economy , 111(1):1–51.Haussler, D. (1992). Decision theoretic generalizations of the pac model for neural net and other learningapplications.
Information and computation , 100(1):78–150.Heckman, J. J. (2010). Building bridges between structural and program evaluation approaches to evaluatingpolicy.
Journal of Economic literature , 48(2):356–98.Heckman, J. J. and Vytlacil, E. (2005). Structural equations, treatment effects, and econometric policyevaluation 1.
Econometrica , 73(3):669–738. 50eckman, J. J. and Vytlacil, E. J. (1999). Local instrumental variables and latent variable models foridentifying and bounding treatment effects.
Proceedings of the national Academy of Sciences , 96(8):4730–4734.Himmelberg, C. (1975). Measurable relations.
Fundamenta Mathematicae , 87(1):53–72.Hirano, K. and Porter, J. R. (2009). Asymptotics for statistical treatment rules.
Econometrica , 77(5):1683–1701.Hurwicz, L. (1950). Generalization of the concept of identification.
Statistical inference in dynamic economicmodels , 10:245–57.Ichimura, H. and Taber, C. R. (2000). Direct estimation of policy impacts. Working paper.Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment effects.
Econometrica , 62(2):467–475.Ioffe, A. (2016). Metric regularity—a survey part 1. theory.
Journal of the Australian Mathematical Society ,101(2):188–243.Jia, P. (2008). What happens when wal-mart comes to town: An empirical analysis of the discount retailingindustry.
Econometrica , 76(6):1263–1316.Jovanovic, B. (1989). Observable implications of models with multiple equilibria.
Econometrica: Journal ofthe Econometric Society , pages 1431–1437.Kaido, H., Molinari, F., and Stoye, J. (2019). Constraint qualifications in partial identification. arXivpreprint arXiv:1908.09103 .Kalouptsidi, M., Kitamura, Y., Lima, L., and Souza-Rodrigues, E. (2019). Partial identification and inferencefor dynamic models and counterfactuals. Working paper.Kasy, M. (2016). Partial identification, distributional preferences, and the welfare ranking of policies.
Reviewof Economics and Statistics , 98(1):111–131.Kearns, M. J. and Schapire, R. E. (1994). Efficient distribution-free learning of probabilistic concepts.
Journal of Computer and System Sciences , 48(3):464–497.Kearns, M. J., Vazirani, U. V., and Vazirani, U. (1994).
An introduction to computational learning theory .MIT press.Kitagawa, T. and Tetenov, A. (2018). Who should be treated? empirical welfare maximization methods fortreatment choice.
Econometrica , 86(2):591–616. 51oltchinskii, V. (2001). Rademacher penalties and structural risk minimization.
IEEE Transactions onInformation Theory , 47(5):1902–1914.Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization.
TheAnnals of Statistics , 34(6):2593–2656.Koltchinskii, V. (2011).
Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems:Ecole d’Et´e de Probabilit´es de Saint-Flour XXXVIII-2008 , volume 2033. Springer Science & BusinessMedia.Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning.In
High dimensional probability II , pages 443–457. Springer.Koopmans, T. C., Rubin, H., and Leipnik, R. B. (1950). Measuring the equation systems of dynamiceconomics.
Statistical inference in dynamic economic models , 10.Kosorok, M. R. (2008).
Introduction to empirical processes and semiparametric inference.
Springer.Lewis, D. (1979). Counterfactual dependence and time’s arrow.
Noˆus , pages 455–476.Li, L. (2019). Identification of structural and counterfactual parameters in a large class of structural econo-metric models. Working paper.Luo, Z.-Q., Pang, J.-S., and Ralph, D. (1996).
Mathematical programs with equilibrium constraints . Cam-bridge University Press.Manski, C. and Tetenov, A. (2014). The quantile performance of statistical treatment rules using hypothesistests to allocate a population to two treatments. Technical report, cemmap working paper.Manski, C. F. (1988). Ordinal utility models of decision making under uncertainty.
Theory and Decision ,25(1):79–104.Manski, C. F. (2004). Statistical treatment rules for heterogeneous populations.
Econometrica , 72(4):1221–1246.Manski, C. F. (2011). Actualist rationality.
Theory and Decision , 71(2):195–210.Marshak, J. (1953). Economic measurements for policy and prediction.
Studies in Econometric Method ,pages 1–26.Massart, P. (2000). Some applications of concentration inequalities to statistics. In
Annales de la Facult´edes sciences de Toulouse: Math´ematiques , volume 9, pages 245–303.Mbakop, E. and Tabord-Meehan, M. (2019). Model selection for treatment choice: Penalized welfare maxi-mization. arXiv preprint arXiv:1609.03167 . 52iyauchi, Y. (2016). Structural estimation of pairwise stable networks with nonnegative externality.
Journalof econometrics , 195(2):224–235.Mogstad, M., Santos, A., and Torgovitsky, A. (2018). Using instrumental variables for inference about policyrelevant treatment parameters.
Econometrica , 86(5):1589–1619.Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018).
Foundations of machine learning . MIT press.Molchanov, I. (2017).
Theory of random sets . Springer Science & Business Media.Morgan, M. S. (1990).
The history of econometric ideas . Cambridge University Press.Mourifie, I., Henry, M., and M´eango, R. (2018). Sharp bounds and testability of a roy model of stem majorchoices.
Available at SSRN 2043117 .Mourifie, I. and Wan, Y. (2020). Layered sensitivity analysis in program evaluation using the mte. Workingpaper.Munkres, J. (2014).
Topology . Pearson Education.Pang, J.-S. (1997). Error bounds in mathematical programming.
Mathematical Programming , 79(1-3):299–332.Parthasarathy, K. R. (2005).
Probability measures on metric spaces , volume 352. American MathematicalSoc.Pearl, J. (2009).
Causality: models, reasoning and inference . Springer.Pollard, D. (1990). Empirical processes: theory and applications. In
NSF-CBMS regional conference seriesin probability and statistics , pages i–86. JSTOR.Qin, D. and Gilbert, C. L. (2001). The error term in the history of time series econometrics.
Econometrictheory , 17(2):424–450.Rostek, M. (2010). Quantile maximization in decision theory.
The Review of Economic Studies , 77(1):339–371.Russell, T. M. (2019). Sharp bounds on functionals of the joint distribution in the analysis of treatmenteffects.
Journal of Business & Economic Statistics , pages 1–15.Schennach, S. M. (2014). Entropic latent variable integration via simulation.
Econometrica , 82(1):345–385.Shalev-Shwartz, S. and Ben-David, S. (2014).
Understanding machine learning: From theory to algorithms .Cambridge university press. 53halev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2010). Learnability, stability and uniformconvergence.
Journal of Machine Learning Research , 11(Oct):2635–2670.Shreve, S. E. and Bertsekas, D. P. (1978). Alternative theoretical frameworks for finite horizon discrete-timestochastic optimal control.
SIAM Journal on control and optimization , 16(6):953–978.Shreve, S. E. and Bertsekas, D. P. (1979). Universally measurable policies in dynamic programming.
Math-ematics of Operations Research , 4(1):15–30.Stinchcombe, M. B. and White, H. (1992). Some measurability results for extrema of random functions overrandom sets.
The Review of Economic Studies , 59(3):495–514.Stoye, J. (2009). Minimax regret treatment choice with finite samples.
Journal of Econometrics , 151(1):70–81.Stoye, J. (2011). Statistical decisions under ambiguity.
Theory and decision , 70(2):129–148.Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validity of experiments.
Journal of Econometrics , 166(1):138–156.Syrgkanis, V., Tamer, E., and Ziani, J. (2018). Inference on auctions with weak assumptions on information. arXiv preprint arXiv:1710.03830 .Tamer, E. (2003). Incomplete simultaneous discrete response model with multiple equilibria.
The Review ofEconomic Studies , 70(1):147–165.Tebaldi, P., Torgovitsky, A., and Yang, H. (2019). Nonparametric estimates of demand in the californiahealth insurance exchange. Technical report, National Bureau of Economic Research.Tetenov, A. (2012). Statistical treatment choice based on asymmetric minimax regret criteria.
Journal ofEconometrics , 166(1):157–165.Torgovitsky, A. (2019). Partial identification by extending subdistributions.
Quantitative Economics ,10(1):105–144.Uetake, K. and Watanabe, Y. (2019). Entry by merger: Estimates from a two-sided matching model withexternalities.
Available at SSRN 2188581 .Valiant, L. (2013).
Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in aComplex World . Basic Books (AZ).Valiant, L. G. (1984). A theory of the learnable. In
Proceedings of the sixteenth annual ACM symposium onTheory of computing , pages 436–445. ACM. 54an Der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In
Weak convergence and empiricalprocesses , pages 16–28. Springer.Vapnik, V. (1995).
The nature of statistical learning theory . Springer science & business media.Vapnik, V. (1998). Statistical learning theory.
New York .Vidyasagar, M. (2002).
A theory of learning and generalization . Springer-Verlag.Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result.
Econo-metrica , 70(1):331–341.Wald, A. (1950).
Statistical decision functions.
Wiley.55
Preliminaries
A.1 Preliminaries on Random Set Theory
This Appendix introduces some key elements of random set theory. Since measurability issues play a signif-icant role in random set theory, we begin by providing the definition of an Effros-measurable multifunction,and show its connection with the definition of a random set.
Definition A.1 (Effros-Measurability, Random Set) . Let (Ω , A , P ) be a probability space, let V be a Polishspace, and let O V denote the collection of all open sets on V . A multifunction V : Ω → F V is called Effros-measurable if for every A ∈ O V we have V − ( A ) := { ω ∈ Ω : V ( ω ) ∩ A (cid:54) = ∅ } ∈ A . An Effros-measurableclosed-valued multifunction on a probability space (Ω , A , P ) is called a random closed set. From this definition, we see that a random closed set is an Effros-measurable closed multifunction whichtakes elements from the underlying probability space to the collection of closed sets on some Polish space V . An Effros-measurable closed multifunction is also sometimes called weakly measurable . When theunderlying probability space (Ω , A , P ) is complete Effros-measurability is equivalent to both (i) V − ( B ) ∈ A for all B ∈ B ( V ) (Borel measurability) and (ii) V − ( F ) ∈ A for all F ∈ F V (strong measurability). Ourmain interest in the paper is in the case when V is a subset of finite-dimensional euclidean space, althoughthe framework is more general.While Effros-measurability is the proper notion of measurability for many of the results, it can be difficultto verify. There are other conditions that are sufficient for Effros measurability, but we find one conditionto be particularly helpful in the examples. Let d denote the metric on a Polish space V , and let V : Ω → F V be a multifunction. The distance to the set V ( ω ) on V is given by: d ( v, V ( ω )) := inf { d ( v, v (cid:48) ) : v (cid:48) ∈ V ( ω ) } . By a result of Himmelberg (1975), Effros measurability of the multifunction V is equivalent to measurabilityof d ( v, V ( ω )) (as a random variable from Ω to [0 , ∞ ]) for each v ∈ V .Throughout the paper it is also important to understand what it means for two random sets to beidentically distributed, which is provided in the next definition. Definition A.2 (Identically Distributed Random Sets) . Let (Ω , A , P ) be a probability space, let V be a Polishspace. We say that two random sets V and V ∗ are identically distributed, denoted by V ∼ V ∗ , if for every A ∈ O V we have P ( ω : V ( ω ) ∩ A (cid:54) = ∅ ) = P ( ω : V ∗ ( ω ) ∩ A (cid:54) = ∅ ) . Finally, an important concept in random set theory is that of a selection from a random set. Intuitively,a random set V can be understood as a collection of random variables V satisfying V ( ω ) ∈ V ( ω ) P − a.s.Such random variables are called selections from the random set V , which is made precise in the followingdefinition. See Aliprantis and Border (2006) Ch. 18 See Molchanov (2017) Theorem 1.3.3, p.59. efinition A.3 (Selections, Conditional Selections) . A random element V : Ω → V is called a (measurable)selection of V if V ( ω ) ∈ V ( ω ) for P − almost all ω ∈ Ω . The family of all measurable selections of a randomset V will be denoted by Sel ( V ) . Although it is suppressed in the notation, the family of selections
Sel ( · ) depends both on the distributionof the random set V , and on the underlying probability space. Indeed, two identically distributed randomsets on the same probability space may have different families of selections. However, the weak closedconvex hulls of the family of selections from two random closed sets on the same probability space coincide.In addition, when the underlying probability space is non-atomic, it is not necessary to take convex hulls.See the discussion following Definition 3.1 in the main text.
A.2 PAC Learnability
As described in the introduction, our definition of learnability is related to the definition of learnability pre-scribed in Valiant (1984). It will thus be useful to understand the concept of learnability from computationallearning theory. We will omit technical details in the pursuit of clarity.In a supervised learning problem, the researcher is presumed to have an i.i.d. sample ψ = (( y i , z i )) ni =1 from the true measure P Y,Z . The researcher is also assumed to have a class of functions F in mind, calledthe hypothesis space. The researcher’s objective is to select a function f : Z → Y , called a hypothesis (or aclassifier or a predictor), from the hypothesis space F that can accurately predict values in Y given valuesin Z . The performance of a given function f ∈ F is measured according to a loss function. That is, it isassumed the researcher has some function L : Y × Y → R such that L ( y, f ( z )) measures the loss incurredwhen a prediction f ( z ) is made and the true value of the outcome is y . The problem of selecting a goodhypothesis f is then translated into the problem of choosing f ∈ F to minimize expected loss, or risk. Adecision rule in this context is a measurable map d : Ψ n → F that selects a hypothesis from the hypothesisspace; in learning theory, this decision rule is called an algorithm.So far the reader should note a resemblance to decision problems seen in statistics and econometrics.However, important differences between the fields arise when evaluating a given statistical decision rule. Inparticular, computer scientists are interested in rules that achieve close to the minimum possible risk withhigh probability in finite samples. To define this rigorously, let ˆ f ∈ F be the hypothesis selected by somedecision rule (or algorithm) d : Ψ n → F . Since ˆ f ∈ F depends on the observed sample, ex-ante it will be arandom variable. Now fix any values ( c, κ ) ∈ R ++ × (0 , f closely approximates the performance ofthe optimal decision rule in finite samples if:inf P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) inf f ∈F E [ L ( y, f ( z ))] − E [ L ( y, ˆ f ( z ))] (cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:19) ≥ κ, (A.1)for a small value of c ∈ R + and a large value of κ ∈ (0 ,
1) at sample size n . Here P Y,Z is the collectionof all Borel probability measures on
Y × Z , and thus the performance of a decision rule is uniform over See Example 1.4.2 in Molchanov (2017), p. 79. P Y,Z ∈ P
Y,Z . We can now introduce the notion of (agnostic) PAC learnabilityinitially proposed by Haussler (1992).
Definition A.4 (Agnostic PAC Learnability) . A hypothesis class F is (agnostic) probably approximatelycorrect (PAC) learnable with respect to the loss function L if there exists a function ζ F : R + × (0 , → N suchthat, for any ( c, κ ) ∈ R ++ × (0 , → N , if n ≥ ζ F ( c, κ ) then there is some decision procedure d : Ψ n → F such that ˆ f := d ( ψ ) satisfies (A.1) . Remark A.1.
This definition omits an important component of the original definition of PAC learnabilityfound in the paper of Valiant (1984), which also requires that the algorithm (decision rule) can be processedin polynomial time (relative to the length of its input). For some this may be a serious omission, as therequirement that an algorithm can be efficiently processed is seen as a core component of learnability incomputational learning theory. In other words, a hypothesis space is (agnostic) PAC learnable if we can guarantee that (A.1) holds forany choice of the pair ( c, κ ) ∈ R ++ × (0 ,
1) for large enough n . Here c is called the error tolerance parameter,and κ is called the confidence parameter. The “agnostic” component of the definition refers to the fact thatthe hypothesis class F may or may not include the true labelling function f ∗ : Z → Y ; indeed, such a “true”labelling function may not even exist.One major advantage of the PAC framework—relative to other frequentist methods of evaluating decisionrules—is its analytical tractability and amenability to analysis via concentration inequalities, and techniquesfrom empirical process theory. Indeed, in the case when the decision rule d : Ψ n → F corresponds to theempirical risk minimization rule, it is well known that PAC learnability is implied by uniform convergence(over both P Y,Z and F ) of the empirical risk to the population risk. In specific learning problems thisuniform convergence is equivalent to learnability (see the discussion in Alon et al. (1997) and Shalev-Shwartzet al. (2010)). This means well-developed tools in empirical process theory can be used to establish thelearnability of a particular class of functions. Intuitively, whether or not a particular class of functions F is learnable depends on the “complexity” of the function class. There are various ways to measure thecomplexity of F , some of which are encountered in the current paper. In general, classes that exhibit lesscomplexity are easier to learn than classes that exhibit more complexity, and if a class of functions is toocomplex, it may not be learnable. B Proofs
Remark B.1 (Common Notation) . To avoid repetition we introduce some common notation for use in theproofs of Theorem 4.1, Theorem 5.1, Lemma 5.1 and Lemma B.9. In particular, for any θ ∈ Θ and γ ∈ Γ Note that taking the outer probability is necessary because the sampling uncertainty from the choice of ˆ f is not resolvedby the inner expectation. This perspective is apparent in Valiant (2013). See, for example, Shalev-Shwartz and Ben-David (2014) Lemma 4.2. et λ ∗ ( θ, γ ) , and ˆ λ ( θ, γ ) satisfy: P h (cid:96)b ( · , θ, γ, λ ∗ ( θ, γ )) = max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) , (B.1) P n h (cid:96)b ( · , θ, γ, ˆ λ ( θ, γ )) = max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) . (B.2) Now for any γ ∈ Γ , let θ ∗ and ˆ θ satisfy: P h (cid:96)b ( · , θ ∗ ( γ ) , γ, λ ∗ ( θ ∗ ( γ ) , γ )) ≤ inf θ ∈ Θ P h (cid:96)b ( · , θ, γ, λ ∗ ( θ, γ )) + ε, (B.3) P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, ˆ λ (ˆ θ ( γ ) , γ )) ≤ inf θ ∈ Θ P n h (cid:96)b ( · , θ, γ, ˆ λ ( θ, γ )) + ε, (B.4) Finally, let γ ∗ and ˆ γ satisfy: P h (cid:96)b ( · , θ ∗ ( γ ∗ ) , γ ∗ , λ ∗ ( θ ∗ ( γ ∗ ) , γ ∗ )) ≥ sup γ ∈ Γ P h (cid:96)b ( · , θ ∗ ( γ ) , γ, λ ∗ ( θ ∗ ( γ ) , γ )) − ε, (B.5) P n h (cid:96)b ( · , ˆ θ (ˆ γ ) , ˆ γ, ˆ λ (ˆ θ (ˆ γ ) , ˆ γ )) ≥ sup γ ∈ Γ P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, ˆ λ (ˆ θ ( γ ) , γ )) − ε. (B.6) With these definitions, it is straightforward to show: sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) + 3 ε, (B.7)sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) + 3 ε. (B.8) Furthermore, we can always choose γ ∗ and ˆ γ to satisfy: inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) , (B.9)inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ ∗ , λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) . (B.10) Remark B.2 (Measurability) . We will not comment on measurability issues in every proof, and insteadwe refer readers to the discussion Appendix B.2.1 (namely, Proposition B.1 and Corollary B.1). There itis shown that certain quantities in this paper that are not typically (Borel) measurable are still universallymeasurable. This allows us to use outer measures to resolve measurability issues, although this is left implicitin many of the proofs. However, we also note that all measurability issues can also be resolved by restricting Θ and Γ to have at most countably many points. B.1 Proofs of the Main Results
Proof of Proposition 2.1.
Recall by assumption we have γ (cid:55)→ inf s ∈S I [ ϕ ]( γ, s ) is universally measurable.By (Borel) measurability of each decision rule d : Ψ n → Γ (and thus universal measurability), and thefact that universally measurable functions are closed under composition, this implies that the map ψ (cid:55)→ inf s ∈S I [ ϕ ]( d ( ψ ) , s ) is universally measurable. The result then follows from Lemma B.2 after noting thatsup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) is a constant for each P Y,Z ∈ P
Y,Z (and thus plays the role of “ c ( P )” from Lemma59.2). (cid:4) Proof of Lemma 3.1.
Fix a value of δ > θ ∗ ∈ Θ ∗ (cid:90) inf u ∈ G − ( y,z,θ ∗ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) dP Y,Z − (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) dP Y,Z ≤ C d ( θ, Θ ∗ ) . Note that this inequality is trivially satisfied for any C ≥ θ ∈ Θ ∗ . Thus, it suffices to focus on thecase when θ ∈ Θ ∗ δ \ Θ ∗ . Furthermore, for this latter case it suffices to find a value of C ≥ (cid:90) (cid:18) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) − inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) (cid:19) dP Y,Z ≤ C d ( θ , θ ) , for any θ , θ ∈ Θ ∗ δ . However, to find C in the previous display, it suffices to find C such that:inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) − inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) ≤ C d ( θ , θ ) , (B.11)( y, z ) − a.s. Fix any ε > y, z ) ∈ Y × Z be any pair (outside the null sets in (3.10) and (3.11)). Forany θ , θ ∈ Θ ∗ δ let u ∗ , u ∗ , y ∗ and y ∗ satisfy: u ∗ ∈ G − ( y, z, θ ) , y ∗ ∈ G (cid:63) ( y, z, u ∗ , θ , γ ) ,u ∗ ∈ G − ( y, z, θ ) , y ∗ ∈ G (cid:63) ( y, z, u ∗ , θ , γ ) , and: ϕ ( y, z, u ∗ , y ∗ ) ≤ inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) + ε,ϕ ( y, z, u ∗ , y ∗ ) ≤ inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) + ε. For simplicity we will denote v ∗ := ( y, z, u ∗ , y ∗ ) and v ∗ := ( y, z, u ∗ , y ∗ ). Now, by Proposition 3C.1 inDontchev and Rockafellar (2009), condition (3.11) implies: d H ( G (cid:63) ( y, z, u, θ , γ ) , G (cid:63) ( y, z, u, θ , γ )) ≤ (cid:96) d ( θ , θ ) , ∀ θ , θ ∈ Θ ∗ δ ( y, z, u ) − a.s. Thus, since y ∗ ∈ G (cid:63) ( y, z, u, θ , γ ) by assumption, there exists y ∈ G (cid:63) ( y, z, u, θ , γ ) such that d ( y , y ∗ ) ≤ (cid:96) d ( θ , θ ). Furthermore, by Proposition 3C.1 in Dontchev and Rockafellar (2009), condition(3.10) implies: d H ( G − ( y, z, θ ) , G − ( y, z, θ )) ≤ (cid:96) d ( θ , θ ) , ∀ θ , θ ∈ Θ ∗ δ . Thus, since u ∗ ∈ G − ( y, z, θ ) by assumption, there exists u ∈ G − ( y, z, θ ) such that d ( u , u ∗ ) ≤ (cid:96) d ( θ , θ ). Recall the Hausdorff distance between two non-empty subsets A and B of a metric space ( X , d ) is given by: d H ( A, B ) := max (cid:26) sup a ∈ A inf b ∈ B d ( a, b ) , sup b ∈ B inf a ∈ A d ( a, b ) (cid:27) . v := ( y, z, u , y ). Then we have:inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) − inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ ,γ ) ϕ ( v ) ≤ ϕ ( v ∗ ) − ϕ ( v ∗ ) + ε ≤ ϕ ( v ) − ϕ ( v ∗ ) + 2 ε ≤ L ϕ d (( y , u ) , ( u ∗ , y ∗ )) + 2 ε ≤ L ϕ max { d ( y , y ∗ ) , d ( u , u ∗ ) } + 2 ε ≤ L ϕ max { (cid:96) , (cid:96) } d ( θ , θ ) + 2 ε, which holds for all θ , θ ∈ Θ ∗ δ . Since ε > C in (B.11) can be taken equalto L ϕ max { (cid:96) , (cid:96) } . This completes the proof. (cid:4) Proof of Theorem 3.1.
We will show the lower bound, as the proof for the upper bound is symmetric. Wewill prove the following sequence of equalities and inequalities: I [ ϕ ]( γ ) := (cid:90) ϕ ( v ) dP V γ ≥ inf θ ∈ Θ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ (B.12)= inf θ ∈ Θ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:32) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) (B.13)= inf θ ∈ Θ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:32) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) dP Y,Z,U + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) (B.14)= inf θ ∈ Θ (cid:90) (cid:32) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) m j ( y, z, u, θ ) (cid:33) dP Y,Z (B.15)= inf θ ∈ Θ max λ j ∈{ , } (cid:90) (cid:32) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z . (B.16)Inequality (B.12) is obvious. Equality (B.13) follows from Lemma B.4. Equalities (B.14) and (B.15) followfrom Lemma B.3. Finally, (B.16) follows from Lemma B.5. (cid:4) Proof of Theorem 4.1.
Let F be a class of real-valued functions, and let ψ = (( y i , z i )) ni =1 denote a particular Here we take the product metric as the sup metric; that is, if ( X , d ) and ( X (cid:48) , d (cid:48) ) are two metric spaces, then the productmetric d ∞ on X × X (cid:48) is defined as d ∞ (( x , x (cid:48) ) , ( x , x (cid:48) )) = max (cid:8) d ( x , x ) , d (cid:48) ( x (cid:48) , x (cid:48) ) (cid:9) . n . For any f, f (cid:48) ∈ F define the norm: || f − f (cid:48) || ψ, := (cid:32) n (cid:88) i =1 ( f ( y i , z i ) − f (cid:48) ( y i , z i )) (cid:33) / . Recall that: h (cid:96)b ( y, z, θ, γ, λ ) := inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) . For notational simplicity we will define: P n h (cid:96)b ( · , θ, γ, λ ) := 1 n n (cid:88) i =1 inf u i ∈ G − ( y i ,z i ,θ ) inf y (cid:63)i ∈ G (cid:63) ( y i ,z i ,u i ,θ,γ ) (cid:32) ϕ ( v i ) + µ ∗ J (cid:88) j =1 λ j m j ( y i , z i , u i , θ ) (cid:33) ,P h (cid:96)b ( · , θ, γ, λ ) := (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z . For any decision rule d : Ψ n → Γ and any P Y,Z ∈ P
Y,Z , we have by Markov’s inequality and Theorem 3.1: P ⊗ nY,Z (cid:18) sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( d ( ψ ) , s ) ≥ c (cid:19) ≤ c E (cid:18) sup γ ∈ Γ inf s ∈S I [ ϕ ]( γ, s ) − inf s ∈S I [ ϕ ]( d ( ψ ) , s ) (cid:19) = 1 c E (cid:18) sup γ ∈ Γ I (cid:96)b [ ϕ ]( γ ) − I (cid:96)b [ ϕ ]( d ( ψ )) (cid:19) . (B.17)Now note by symmetrization (e.g. Van Der Vaart and Wellner (1996) Lemma 2.3.1) we have:sup γ ∈ Γ sup θ ∈ Θ max λ ∈ Λ (cid:12)(cid:12)(cid:12)(cid:12) E ( P n h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ, γ, λ )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E sup γ ∈ Γ sup θ ∈ Θ max λ ∈ Λ (cid:12)(cid:12)(cid:12)(cid:12) P n h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ, γ, λ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E || R n || ( H (cid:96)b ) , (B.18)where the final outer expectation is a joint expectation that is also taken over the Rademacher randomvariables. Now let λ ∗ ( θ, γ ), ˆ λ ( θ, γ ), θ ∗ ( γ ), ˆ θ ( γ ), γ ∗ and ˆ γ be as in Remark B.1, and set d ( ψ ) = ˆ γ . Then wehave: E I (cid:96)b [ ϕ ]( d ( ψ ))= E inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, d ( ψ ) , λ ) , (by Theorem 3.1),= E inf θ ∈ Θ P h (cid:96)b ( · , θ, d ( ψ ) , λ ∗ ( θ, d ( ψ ))) , (since λ ∗ is optimal at P for any ( θ, γ )),= E P h (cid:96)b ( · , θ ∗ ( d ( ψ )) , d ( ψ ) , λ ∗ ( θ ∗ , d ( ψ ))) − ε, (since θ ∗ is ε − optimal at ( P, λ ∗ ) for any γ ), ≥ E P h (cid:96)b ( · , θ ∗ ( d ( ψ )) , d ( ψ ) , ˆ λ ( θ ∗ ( d ( ψ )) , d ( ψ ))) − ε, (since λ ∗ was optimal at P for any ( θ, γ )), ≥ EP n h (cid:96)b ( · , θ ∗ ( d ( ψ )) , d ( ψ ) , ˆ λ ( θ ∗ ( d ( ψ )) , d ( ψ ))) − E || R n || ( H (cid:96)b ) − ε, (by (B.18)), ≥ EP n h (cid:96)b ( · , ˆ θ ( d ( ψ )) , d ( ψ ) , ˆ λ (ˆ θ ( d ( ψ )) , d ( ψ ))) − E || R n || ( H (cid:96)b ) − ε, (since ˆ θ is ε -optimal at ( P n , ˆ λ ) for any γ ), To be mindful of measurability issues, we can use the outer-measures version of Markov’s inequality given in Lemma 6.10in Kosorok (2008). EP n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , ˆ λ (ˆ θ ( γ ∗ ) , γ ∗ )) − E || R n || ( H (cid:96)b ) − ε, (since d ( ψ ) was ε -optimal at ( P n , ˆ λ, ˆ θ )), ≥ EP n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − E || R n || ( H (cid:96)b ) − ε, (since ˆ λ was optimal at P n for any ( θ, γ )), ≥ E P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − E || R n || ( H (cid:96)b ) − ε, (by (B.18)), ≥ E P h (cid:96)b ( · , θ ∗ ( γ ∗ ) , γ ∗ , λ ∗ ( θ ∗ ( γ ∗ ) , γ ∗ )) − E || R n || ( H (cid:96)b ) − ε, (since θ ∗ is ε -optimal at ( P, λ ∗ ) for any γ ), ≥ E sup γ ∈ Γ P h (cid:96)b ( · , θ ∗ ( γ ) , γ, λ ∗ ( θ ∗ ( γ ) , γ )) − E || R n || ( H (cid:96)b ) − ε, (since γ ∗ was ε -optimal at ( P, λ ∗ , θ ∗ )), ≥ E sup γ ∈ Γ inf θ ∈ Θ P h (cid:96)b ( · , θ, γ, λ ∗ ( θ, γ )) − E || R n || ( H (cid:96)b ) − ε, (since θ ∗ was ε -optimal at ( P, λ ∗ ) for any γ ), ≥ E sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − E || R n || ( H (cid:96)b ) − ε, (since λ ∗ was optimal at P for any ( θ, γ )), ≥ E sup γ ∈ Γ I (cid:96)b [ ϕ ]( γ ) − E || R n || ( H (cid:96)b ) − ε, (by Theorem 3.1).Since ε > E (cid:18) sup γ ∈ Γ I [ ϕ ]( γ ) − I [ ϕ ]( d ( ψ )) (cid:19) ≤ E || R n || ( H (cid:96)b ) . (B.19)It thus suffices to bound the Rademacher complexity, given by: E || R n || ( H (cid:96)b )= E sup γ ∈ Γ sup θ ∈ Θ max λ ∈ Λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ξ i inf u i ∈ G − ( y i ,z i ,θ ) (cid:32) inf y (cid:63)i ∈ G (cid:63) ( y i ,z i ,u i ,θ,γ ) ϕ ( v i ) + µ ∗ J (cid:88) j =1 λ j m j ( y i , z i , u i , θ ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . If H (cid:96)b is not closed under symmetry, then redefine it as H (cid:96)b ∪ ( −H (cid:96)b ); for our purposes this is without lossof generality, since this operation can only increase the value of E || R n || ( H (cid:96)b ). We then have from LemmaB.7 that for any ε > E || R n || ( H (cid:96)b ) ≤ ε √ n + 2Diam ψ, ( H (cid:96)b ) (cid:114) log N ( ε, H (cid:96)b , || · || ψ, ) n . (B.20)Since the class of functions H (cid:96)b is uniformly bounded, we have Diam ψ, ( H (cid:96)b ) < ∞ . It remains to bound themetric entropy. To do so, we will define: H I := (cid:40) h ( · , u, θ, γ, λ ) : Y × Z → R : h ( y, z, u, θ, γ ) = inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) , ( u, θ, γ, λ ) ∈ U × Θ × Γ × Λ (cid:41) , (B.21) H II := (cid:26) h ( · , u, θ, γ ) : Y × Z → R : h ( y, z, u, θ, γ ) = inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) , ( u, θ, γ ) ∈ U × Θ × Γ (cid:27) , (B.22)63 III := { h ( · , u, y (cid:63) ) : Y × Z → R : h ( y, z, u, y (cid:63) ) = ϕ ( y, z, u, y (cid:63) ) , ( u, y (cid:63) ) ∈ U × Y (cid:63) } , (B.23) H IV := h ( · , u, θ, λ ) : Y × Z → R : h ( y, z, u, θ ) = J (cid:88) j =1 λ j m j ( y, z, u, θ ) , ( u, θ, λ ) ∈ U × Θ × Λ . (B.24)By Lemma B.6, we have: N ( ε, H (cid:96)b , || · || ψ, ) ≤ N ( ε/ , H I , || · || ψ, ) . By Lemma B.8 we also have: N ( ε/ , H I , || · || ψ, ) ≤ N ( ε/ , H II , || · || ψ, ) N ( ε/ , H IV , || · || ψ, ) . Applying Lemma B.6 again we have: N ( ε/ , H II , || · || ψ, ) ≤ N ( ε/ , H III , || · || ψ, ) . Finally, from iterated application of Lemma B.8: N ( ε/ , H IV , || · || ψ, ) ≤ J (cid:89) j =1 N ( ε/ (2 J ) , M j , || · || ψ, ) , We conclude that:log N ( ε, H (cid:96)b , || · || ψ, ) ≤ log N ( ε/ , H III , || · || ψ, ) + J (cid:88) j =1 log N ( ε/ (2 J ) , M j , || · || ψ, ) ≤ sup Q ∈Q n log N ( ε/ , H III , || · || Q, ) + J (cid:88) j =1 sup Q ∈Q n log N ( ε/ (2 J ) , M j , || · || Q, ) , with the supremum taken over all discrete probability measures Q n on X with atoms that have probabilitiesthat are integer multiples of 1 /n . Since by assumption H III and M j satisfy the entropy growth condition,the right side of the previous display is of order o ( n ). Combining this with (B.20), we see that for any ( c, κ )pair, there exists some n such that 4 E || R n || ( H (cid:96)b ) ≤ c (1 − κ ). Combining this with (B.19) and (B.17), theproof is complete. (cid:4) Proof of Theorem 5.1.
Recall that: h (cid:96)b ( y, z, θ, γ, λ ) := inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) . For notational simplicity we will define: P n h (cid:96)b ( · , θ, γ, λ ) := 1 n n (cid:88) i =1 inf u i ∈ G − ( y i ,z i ,θ ) inf y (cid:63)i ∈ G (cid:63) ( y i ,z i ,u i ,θ,γ ) (cid:32) ϕ ( v i ) + µ ∗ J (cid:88) j =1 λ j m j ( y i , z i , u i , θ ) (cid:33) , h (cid:96)b ( · , θ, γ, λ ) := (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z . We claim that it suffices to set c n ( κ ) = 2˜ c n ( ψ, κ ) + 5 ε , where ˜ c n ( ψ, κ ) satisfies:sup γ ∈ Γ sup θ ∈ Θ max λ ∈ Λ (cid:12)(cid:12)(cid:12)(cid:12) P n h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ, γ, λ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ˜ c n ( ψ, κ ) , (B.25)with probability at least κ/
2. Let λ ∗ ( θ, γ ), ˆ λ ( θ, γ ), θ ∗ ( γ ), ˆ θ ( γ ), γ ∗ and ˆ γ be as in Remark B.1 and set d ( ψ ) = ˆ γ . Then we have: I (cid:96)b [ ϕ ]( d ( ψ ))= inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, d ( ψ ) , λ ) , (by Theorem 3.1),= inf θ ∈ Θ P h (cid:96)b ( · , θ, d ( ψ ) , λ ∗ ( θ, d ( ψ ))) , (since λ ∗ is optimal at P for any ( θ, γ )),= P h (cid:96)b ( · , θ ∗ ( d ( ψ )) , d ( ψ ) , λ ∗ ( θ ∗ , d ( ψ ))) − ε, (since θ ∗ is ε − optimal at ( P, λ ∗ ) for any γ ), ≥ P h (cid:96)b ( · , θ ∗ ( d ( ψ )) , d ( ψ ) , ˆ λ ( θ ∗ ( d ( ψ )) , d ( ψ ))) − ε, (since λ ∗ was optimal at P for any ( θ, γ )), ≥ ( κ/ P n h (cid:96)b ( · , θ ∗ ( d ( ψ )) , d ( ψ ) , ˆ λ ( θ ∗ ( d ( ψ )) , d ( ψ ))) − ˜ c n ( ψ, κ ) − ε, (by (B.25)), ≥ P n h (cid:96)b ( · , ˆ θ ( d ( ψ )) , d ( ψ ) , ˆ λ (ˆ θ ( d ( ψ )) , d ( ψ ))) − ˜ c n ( ψ, κ ) − ε, (since ˆ θ is ε -optimal at ( P n , ˆ λ ) for any γ ), ≥ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , ˆ λ (ˆ θ ( γ ∗ ) , γ ∗ )) − ˜ c n ( ψ, κ ) − ε, (since d ( ψ ) was ε -optimal at ( P n , ˆ λ, ˆ θ )), ≥ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − ˜ c n ( ψ, κ ) − ε, (since ˆ λ was optimal at P n for any ( θ, γ )), ≥ ( κ/ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − c n ( ψ, κ ) − ε, (by (B.25)), ≥ P h (cid:96)b ( · , θ ∗ ( γ ∗ ) , γ ∗ , λ ∗ ( θ ∗ ( γ ∗ ) , γ ∗ )) − c n ( ψ, κ ) − ε, (since θ ∗ is ε -optimal at ( P, λ ∗ ) for any γ ), ≥ sup γ ∈ Γ P h (cid:96)b ( · , θ ∗ ( γ ) , γ, λ ∗ ( θ ∗ ( γ ) , γ )) − c n ( ψ, κ ) − ε, (since γ ∗ was ε -optimal at ( P, λ ∗ , θ ∗ )), ≥ sup γ ∈ Γ inf θ ∈ Θ P h (cid:96)b ( · , θ, γ, λ ∗ ( θ, γ )) − c n ( ψ, κ ) − ε, (since θ ∗ was ε -optimal at ( P, λ ∗ ) for any γ ), ≥ sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − c n ( ψ, κ ) − ε, (since λ ∗ was optimal at P for any ( θ, γ )), ≥ sup γ ∈ Γ I (cid:96)b [ ϕ ]( γ ) − c n ( ψ, κ ) − ε, (by Theorem 3.1).where each inequality “ ≥ ( κ/ ” holds with probability at least κ/
2. Note that this shows:sup γ ∈ Γ I (cid:96)b [ ϕ ]( γ ) − I (cid:96)b [ ϕ ](ˆ γ ) ≤ c n ( ψ, κ ) + 5 ε, with probability at least κ . To satisfy (B.25) it clearly suffices to choose ˜ c n ( ψ, κ ) to satisfy:sup P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup γ ∈ Γ sup θ ∈ Θ max λ ∈ Λ (cid:12)(cid:12)(cid:12)(cid:12) P n h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ, γ, λ ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ ˜ c n ( ψ, κ ) (cid:19) ≤ − κ/ . (B.26)65rom Koltchinskii (2011) Theorem 4.6 we have for any t > P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:18) sup γ ∈ Γ sup θ ∈ Θ max λ ∈ Λ (cid:12)(cid:12)(cid:12)(cid:12) P n h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ, γ, λ ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ || R n || ( H (cid:96)b ) + 3 tH √ n (cid:19) ≤ exp (cid:18) − t (cid:19) . Now set: ˜ c n ( ψ, κ ) = 2 || R n || ( H (cid:96)b ) + (cid:115)
18 ln(2 / (2 − κ )) H n . Then we have: c n ( κ ) = 4 || R n || ( H (cid:96)b ) + (cid:115)
72 ln(2 / (2 − κ )) H n + 5 ε. Then we conclude (5.4). (cid:4)
Proof of Theorem 5.2.
Let
T, T (cid:91) , and T (cid:93) be as defined in Lemma 5.1. In this proof, it is useful to note thefollowing facts:(i) The functions δ (cid:55)→ T n ( δ ) , T ( δ ) are non-decreasing left-continuous step functions that are greater thanor equal to zero on the interval [0 , δ ], and zero otherwise.(ii) The functions σ (cid:55)→ T (cid:91)n ( σ ) , T (cid:91) ( σ ), are non-increasing and left-continuous with their only possible pointsof discontinuity at the points { δ j } ∞ j =0 .(iii) The functions η (cid:55)→ T (cid:93)n ( η ) , T (cid:93) ( η ) are non-increasing and continuous.Now for any η >
0, let: δ ∗ = T (cid:93)n (1 − / a ) + η (cid:48) ,δ ∗∗ = T (cid:93) (1 − / a ) + η, where η (cid:48) = η + ε for some ε >
0. Note that choosing δ ∗∗ slightly larger than T (cid:93) (1 − / a ) ensures that T (cid:91) ( δ ∗∗ ) ≤ − / a . A similar note applies to δ ∗ and T (cid:93)n (1 − / a ).From the proof of Lemma 5.1 we know there exists an event E n with P ⊗ nY,Z ( E n ) ≥ κ such that on E n wehave G ∗ ( δ ) ⊆ G n ( b δ ) for every δ ≥ δ ∗∗ . Thus, for every δ ≥ δ ∗∗ we have on E n that T ( δ ) ≤ T n ( δ ), whichimplies: T ( δ ) δ ≤ T n ( δ ) δ , for all δ ≥ δ ∗∗ . Thus, on E n we have T (cid:91) ( σ ) ≤ T (cid:91)n ( σ ) for any σ ≥ δ ∗∗ , and in particular we have: T (cid:91) ( δ ∗∗ ) := sup δ ≥ δ ∗∗ T ( δ ) δ ≤ sup δ ≥ δ ∗∗ T n ( δ ) δ =: T (cid:91)n ( δ ∗∗ ) , (B.27)Recall our choice of δ ∗∗ ensures that T (cid:91) ( δ ∗∗ ) ≤ − / a . We can now distinguish two cases on the event E n :66. We have: sup δ ≥ δ ∗∗ T ( δ ) δ ≤ − a ≤ sup δ ≥ δ ∗∗ T n ( δ ) δ . In this case, we have T (cid:91)n ( δ ∗∗ ) ≥ − / a , and thus T (cid:93)n (1 − / a ) ≥ δ ∗∗ , so that δ ∗ > δ ∗∗ (see the definitionsof δ ∗ and δ ∗∗ above).2. We have: sup δ ≥ δ ∗∗ T ( δ ) δ ≤ sup δ ≥ δ ∗∗ T n ( δ ) δ < − a . This implies either (i) T (cid:93) (1 − / a ) ≤ T (cid:93)n (1 − / a ) < δ ∗∗ , or (ii) T (cid:93)n (1 − / a ) < T (cid:93) (1 − / a ) < δ ∗∗ . Incase (i) we clearly have δ ∗ ≥ δ ∗∗ . In case (ii), let: c := T (cid:93) (1 − / a ) − T (cid:93)n (1 − / a ) > . Then: δ ∗∗ − δ ∗ = T (cid:93) (1 − / a ) + η − T (cid:93)n (1 − / a ) − η (cid:48) = c − ε, where the last line follows from the definition of η (cid:48) . Now suppose that c > ε for our ε > c does not depend on the value of η >
0, so the assumption that c > ε must triviallyhold for every η >
0. If we can show that c < ε for some η >
0, we will have arrived at our desiredcontradiction.Recall that on E n we have T (cid:91) ( σ ) ≤ T (cid:91)n ( σ ) for any σ ≥ δ ∗∗ . This implies that, for any r >
0, if T (cid:93) ( r ) ≥ δ ∗∗ then T (cid:93)n ( r ) ≥ T (cid:93) ( r ). Now choose a value r η ∈ R closest to 1 − / a such that r η ≤ − / a and: T (cid:93) ( r η ) = T (cid:93) (1 − / a ) + η = δ ∗∗ . Such a choice is always possible by continuity of T (cid:93) , and by the fact that T (cid:93) is non-increasing. Bytaking η (and thus also δ ∗∗ ) small enough we conclude by continuity of T (cid:93) that the point r η can alsoalways be chosen arbitrarily close to 1 − / a . Recall by continuity of T (cid:93)n that there exists ε (cid:48) > T (cid:93)n ( x ) − T (cid:93)n (1 − / a ) < ε whenever (1 − / a ) < x + ε (cid:48) . Now by choosing r η ≤ − / a such that1 − / a < r η + ε (cid:48) , we have: c = T (cid:93) (1 − / a ) − T (cid:93)n (1 − / a ) < T (cid:93) ( r η ) − T (cid:93)n (1 − / a ) ≤ T (cid:93)n ( r η ) − T (cid:93)n (1 − / a )67 ε. This of course contradicts the fact that c > ε for every choice of η >
0. We conclude that c ≤ ε , andsince δ ∗∗ − δ ∗ = c − ε , we have δ ∗∗ ≤ δ ∗ .We conclude in all cases that δ ∗∗ ≤ δ ∗ on E n . The result then follows directly from Lemma 5.1. (cid:4) Proof of Lemma 5.1.
Recall that: h (cid:96)b ( y, z, θ, γ, λ ) := inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) . For notational simplicity we will define: P n h (cid:96)b ( · , θ, γ, λ ) := 1 n n (cid:88) i =1 inf u i ∈ G − ( y i ,z i ,θ ) inf y (cid:63)i ∈ G (cid:63) ( y i ,z i ,u i ,θ,γ ) (cid:32) ϕ ( v i ) + µ ∗ J (cid:88) j =1 λ j m j ( y i , z i , u i , θ ) (cid:33) ,P h (cid:96)b ( · , θ, γ, λ ) := (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z . Define the events: E n,j := (cid:40) sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( δ j ) sup λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | ≤ T ( δ j ) (cid:41) , and: E n := (cid:92) { j : δ j ≥ δ ∗∗ } E n,j . (B.28)Note the value 2 H is an upper bound for any function in H (cid:48) (cid:96)b ( δ ) for any δ >
0. By our choice of δ > H wehave: sup P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:0) E cn, (cid:1) = 0 . Furthermore, from the uniform version of Hoeffding’s inequality (e.g. Koltchinskii (2011) Theorem 4.6, p.71)we have: sup P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:0) E cn,j (cid:1) ≤ exp (cid:32) − t j (cid:33) , for each j ∈ N . We conclude by the union bound that:inf P Y,Z ∈P Y,Z P ⊗ nY,Z ( E n ) ≥ − (cid:88) { j : δ j ≥ δ ∗∗ } exp (cid:32) − t j (cid:33) . c = 5, c = (3 / (2 κ )) / and t j = (cid:112) c log( c · j ), we have: (cid:88) { j : δ j ≥ δ ∗∗ } exp (cid:32) − t j (cid:33) ≤ ∞ (cid:88) j =1 exp (cid:32) − t j (cid:33) = ∞ (cid:88) j =1 exp (cid:18) − c log( c · j )2 (cid:19) = ∞ (cid:88) j =1 ( c · j ) − c = 2(1 − κ )3 ∞ (cid:88) j =1 (cid:18) j (cid:19) / ≤ − κ )3 (cid:18) (cid:19) = 1 − κ. Thus we conclude: inf P Y,Z ∈P Y,Z P ⊗ nY,Z ( E n ) ≥ κ. (B.29)The remainder of the proof proceeds in two parts:1. We will show that on the event E n we have for any γ ∈ Γ, E n ( γ ) ≤ (2 − / a ) ( E ∗ ( γ ) ∨ δ ∗∗ ). We willthen use this fact to argue that, on E n , for any δ ≥ δ ∗∗ we have G ∗ ( δ ) ⊆ G n ((2 − / a ) δ ).2. We will show that on the event E n we have for any γ ∈ Γ, E ∗ ( γ ) ≤ a ( E n ( γ ) ∨ δ ∗∗ ). We will then usethis fact to argue that, on E n , for any δ ≥ a δ ∗∗ we have G n ( δ/ a ) ⊆ G ∗ ( δ ).Throughout this proof, let λ ∗ ( θ, γ ), ˆ λ ( θ, γ ), θ ∗ ( γ ), ˆ θ ( γ ), γ ∗ and ˆ γ be as in Remark B.1. Part 1:
We will prove that on the event E n we have E n ( γ ) ≤ (2 − / a ) ( E ∗ ( γ ) ∨ δ ∗∗ ) for any γ ∈ Γ.First, consider any γ with σ := E ∗ ( γ ) ≥ δ ∗∗ . Pick any ε > δ ∗∗ ≥ ε , which is possible since δ ∗∗ > T (cid:93) (1 − / a ) ≥
0. Then on the event E n we have: E n ( γ ) := sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) + 3 ε = inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ )+ (cid:18) inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) − (cid:18) inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) + 3 ε ≤ sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ )69 (cid:18) inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) − (cid:18) inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) + 3 ε = E ∗ ( γ ) + (cid:18) inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) − (cid:18) inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) + 3 ε. Now note: inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − max λ ∈ Λ P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) + ε ≤ max λ ∈ Λ P h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ) − max λ ∈ Λ P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) + 2 ε ≤ max λ ∈ Λ P h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ) − P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) + 2 ε = P h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ∗ (ˆ θ ( γ ) , γ )) − P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) + 2 ε. Similarly: inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ) + ε ≤ max λ ∈ Λ P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ) + 2 ε ≤ max λ ∈ Λ P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) − P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ∗ (ˆ θ ( γ ) , γ )) + 2 ε = P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) − P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ∗ (ˆ θ ( γ ) , γ )) + 2 ε. Thus we conclude: E n ( γ ) ≤ E ∗ ( γ ) + 7 ε + P h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ∗ (ˆ θ ( γ ) , γ )) − P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) − (cid:16) P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ∗ (ˆ θ ( γ ) , γ )) − P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) (cid:17) . However, γ ∈ G ∗ ( σ ) by assumption, and by Lemma B.9 we have ˆ γ ∈ G ∗ ( σ ) on the event E n . Thus, the rightside of the previous display can be bounded above: P h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ∗ (ˆ θ ( γ ) , γ )) − P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) − (cid:16) P n h (cid:96)b ( · , ˆ θ ( γ ) , γ, λ ∗ (ˆ θ ( γ ) , γ )) − P n h (cid:96)b ( · , θ ∗ (ˆ γ ) (cid:17) ≤ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( σ ) sup λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | . Furthermore, for any σ ≥ δ ∗∗ , on the event E n this final quantity is bounded above by T ( σ ); this follows70rom the definition of T ( σ ) and the monotonicity of the map: x (cid:55)→ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( x ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | . Thus on E n : E n ( γ ) ≤ E ∗ ( γ ) + T ( σ ) + 7 ε = E ∗ ( γ ) + T ( σ ) σ σ + 7 ε ≤ E ∗ ( γ ) + sup δ ≥ σ (cid:18) T ( δ ) δ (cid:19) σ + 7 ε = E ∗ ( γ ) + T (cid:91) ( σ ) σ + 7 ε = E ∗ ( γ ) + T (cid:91) ( σ ) E ∗ ( γ ) + 7 ε. Now, since σ ≥ δ ∗∗ > T (cid:93) (1 − / a ) we have T (cid:91) ( σ ) ≤ T (cid:91) ( δ ∗∗ ) ≤ − / a . Thus, on the event E n , if γ is suchthat E ∗ ( γ ) ≥ δ ∗∗ , we have: E n ( γ ) ≤ (cid:18) − a (cid:19) E ∗ ( γ ) + 7 ε. Since ε > δ ∗∗ ≥ ε , and thus can be made arbitrarily small, we conclude that on theevent E n we have for any γ with E ∗ ( γ ) ≥ δ ∗∗ : E n ( γ ) ≤ (cid:18) − a (cid:19) E ∗ ( γ ) . Now consider the case when σ := E ∗ ( γ ) ≤ δ ∗∗ . By the same derivation as above we obtain: E n ( γ ) ≤ E ∗ ( γ ) + sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( σ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | + 7 ε. By monotonicity, we have:sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( σ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) |≤ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( δ ∗∗ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | . Furthermore, on the event E n we have:sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( δ ∗∗ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | ≤ T ( δ ∗∗ ) . Thus, on the event E n : E n ( γ ) ≤ E ∗ ( γ ) + T ( δ ∗∗ ) + 7 ε ≤ E ∗ ( γ ) + sup δ ≥ δ ∗∗ (cid:18) T ( δ ) δ (cid:19) δ ∗∗ + 7 ε E ∗ ( γ ) + T (cid:91) ( δ ∗∗ ) δ ∗∗ + 7 ε ≤ E ∗ ( γ ) + (cid:18) − a (cid:19) δ ∗∗ + 7 ε ≤ δ ∗∗ + (cid:18) − a (cid:19) δ ∗∗ + 7 ε = (cid:18) − a (cid:19) δ ∗∗ + 7 ε. Since ε > δ ∗∗ ≥ ε , and thus can be made arbitrarily small, we conclude that on theevent E n we have for any γ : E n ( γ ) ≤ (cid:18) − a (cid:19) ( E ∗ ( γ ) ∨ δ ∗∗ ) . We will use this result to argue that, on the event E n , if δ ≥ δ ∗∗ then E ∗ ( γ ) ≤ δ = ⇒ E n ( γ ) ≤ (2 − / a ) δ .There are two cases:(i) E ∗ ( γ ) ≤ δ ∗∗ ≤ δ , which implies on the event E n : E n ( γ ) ≤ (cid:18) − a (cid:19) ( E ∗ ( γ ) ∨ δ ∗∗ ) = (cid:18) − a (cid:19) δ ∗∗ ≤ (cid:18) − a (cid:19) δ. (ii) δ ∗∗ ≤ E ∗ ( γ ) ≤ δ , which implies on the event E n : E n ( γ ) ≤ (cid:18) − a (cid:19) ( E ∗ ( γ ) ∨ δ ∗∗ ) = (cid:18) − a (cid:19) E ∗ ( γ ) ≤ (cid:18) − a (cid:19) δ. Thus we conclude that for any δ ≥ δ ∗∗ , on E n we have that E ∗ ( γ ) ≤ δ = ⇒ E n ( γ ) ≤ (2 − / a ) δ . Nowrecall that we have E ∗ ( γ ) ≤ δ ⇐⇒ γ ∈ G ∗ ( δ ) and E n ( γ ) ≤ (2 − / a ) δ ⇐⇒ γ ∈ G n ((2 − / a ) δ ). Thus, weconclude that for any δ ≥ δ ∗∗ , on the event E n : G ∗ ( δ ) ⊆ G n ((2 − / a ) δ ) , as desired. Part 2:
We will prove that on the event E n we have E ∗ ( γ ) ≤ a ( E n ( γ ) ∨ δ ∗∗ ) for any γ ∈ Γ. If γ is suchthat E ∗ ( γ ) ≤ δ ∗∗ then this is trivially true (since a > γ with σ := E ∗ ( γ ) ≥ δ ∗∗ . Pickany ε > δ ∗∗ ≥ ε , which is possible since δ ∗∗ > T (cid:93) (1 − / a ) ≥
0. Then on the event E n we have: E ∗ ( γ ) := sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) + 3 ε = inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ )+ (cid:18) sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) (cid:19) (cid:18) sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) (cid:19) + 3 ε = E ∗ ( γ ) + (cid:18) inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) (cid:19) − (cid:18) sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) (cid:19) + 3 ε. Now note: inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − max λ ∈ Λ P h (cid:96)b ( · , θ ∗ ( γ ) , γ, λ ) + ε ≤ max λ ∈ Λ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) − max λ ∈ Λ P h (cid:96)b ( · , θ ∗ ( γ ) , γ, λ ) + 2 ε ≤ max λ ∈ Λ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) − P h (cid:96)b ( · , θ ∗ ( γ ) , γ, ˆ λ ( θ ∗ ( γ ) , γ )) + 2 ε ≤ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P h (cid:96)b ( · , θ ∗ ( γ ) , γ, ˆ λ ( θ ∗ ( γ ) , γ )) + 2 ε. Similarly: inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ ∗ , λ ) + 3 ε ≤ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ, λ ) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) + 4 ε ≤ max λ ∈ Λ P n h (cid:96)b ( · , θ ∗ ( γ ) , γ, λ ) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) + 5 ε ≤ P n h (cid:96)b ( · , θ ∗ ( γ ) , γ, ˆ λ ( θ ∗ ( γ ) , γ )) − P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) + 5 ε. Thus we conclude: E ∗ ( γ ) ≤ E n ( γ ) + 10 ε + P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P h (cid:96)b ( · , θ ∗ ( γ ) , γ, ˆ λ ( θ ∗ ( γ ) , γ )) − (cid:16) P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P n h (cid:96)b ( · , θ ∗ ( γ ) , γ, ˆ λ ( θ ∗ ( γ ) , γ )) (cid:17) . However, γ ∈ G ∗ ( σ ) by assumption, and E ∗ ( γ ∗ ) ≤ ε ≤ E ∗ ( γ ) = σ implies γ ∗ ∈ G ∗ ( σ ). Thus, the right sideof the previous display can be bounded above: P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P h (cid:96)b ( · , θ ∗ ( γ ) , γ, ˆ λ ( θ ∗ ( γ ) , γ )) − (cid:16) P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P n h (cid:96)b ( · , θ ∗ ( γ ) , γ, ˆ λ ( θ ∗ ( γ ) , γ )) (cid:17) ≤ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( σ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | . Furthermore, for any σ ≥ δ ∗∗ , on the event E n this final quantity is bounded above by T ( σ ); this follows73rom the definition of T ( σ ) and the monotonicity of the map: x (cid:55)→ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( x ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | . Thus on E n : E ∗ ( γ ) ≤ E n ( γ ) + T ( σ ) + 10 ε = E n ( γ ) + T ( σ ) σ σ + 10 ε ≤ E n ( γ ) + sup δ ≥ σ (cid:18) T ( δ ) δ (cid:19) σ + 10 ε = E n ( γ ) + T (cid:91) ( σ ) σ + 10 ε = E n ( γ ) + T (cid:91) ( σ ) E ∗ ( γ ) + 10 ε. Now, since σ ≥ δ ∗∗ > T (cid:93) (1 − / a ) we have T (cid:91) ( σ ) ≤ T (cid:91) ( δ ∗∗ ) ≤ − / a . Thus, on the event E n , if γ is suchthat σ = E ∗ ( γ ) ≥ δ ∗∗ , we have: E ∗ ( γ ) ≤ E n ( γ ) + (cid:18) − a (cid:19) E ∗ ( γ ) + 10 ε = ⇒ E ∗ ( γ ) ≤ a E n ( γ ) + 10 a ε. Since ε > δ ∗∗ ≥ ε , and thus can be made arbitrarily small, we conclude that on theevent E n we have for any γ : E ∗ ( γ ) ≤ a ( E n ( γ ) ∨ δ ∗∗ ) . We will use this result to argue that, on the event E n , if δ/ a ≥ δ ∗∗ then E ∗ ( γ ) ≤ δ . There are two cases:(i) E n ( γ ) ≤ δ ∗∗ ≤ δ/ a , which implies on the event E n : E ∗ ( γ ) ≤ a ( E n ( γ ) ∨ δ ∗∗ ) = a δ ∗∗ ≤ δ. (ii) δ ∗∗ ≤ E n ( γ ) ≤ δ/ a , which implies on the event E n : E ∗ ( γ ) ≤ a ( E n ( γ ) ∨ δ ∗∗ ) = a E ∗ ( γ ) ≤ δ. Thus we conclude that for any δ/ a ≥ δ ∗∗ , on E n we have that E n ( γ ) ≤ δ/ a = ⇒ E ∗ ( γ ) ≤ δ . Now recall thatwe have E n ( γ ) ≤ δ/ a ⇐⇒ γ ∈ G n ( δ/ a ) and E ∗ ( γ ) ≤ δ ⇐⇒ γ ∈ G ∗ ( δ ). Thus, we conclude that for any δ ≥ a δ ∗∗ , on the event E n : G n ( δ/ a ) ⊆ G ∗ ( δ ) , as desired. This completes the proof. (cid:4) .2 Auxiliary Results and Proofs B.2.1 On Issues of Measurability
The following discussion mirrors the discussion in Dudley (2010) Section 3.3 and Dudley (2014) Section 5.3.Let X be a Polish space, and let B ( X ) be the Borel σ − algebra on X . Then ( X , B ( X )) is a measurablespace. If P is a probability law on B ( X ), then ( X , B ( X ) , P ) is a probability space. Now for any B ⊂ X , wecan define the outer measure P ∗ on B as: P ∗ ( B ) := inf { P ( C ) : B ⊂ C and C ∈ B ( X ) } . By Theorem 3.3.1 in Dudley (2010), there always exists C ∈ B ( X ) such that P ∗ ( B ) = P ( C ), and such a set C is called a measurable cover of B . Now define the collection of null sets for P as: N ull ( P ) := { A ⊂ X : P ∗ ( A ) = 0 } . Furthermore, let B ∗ P ( X ) denote the smallest σ − algebra generated by B ( X ) ∪ N ull ( P ). By Proposition 3.3.2in Dudley (2010), we have: B ∗ P ( X ) := { B ⊂ X : B ∆ C ∈ N ull ( P ) for some C ∈ B ( X ) } , where B ∆ C = ( B \ C ) ∪ ( C \ B ). We can now extend the measure P from B ( X ) to a measure P on B ∗ P ( X )as follows: if B ∆ C ∈ N ull ( P ) and C ∈ B ( X ), then set P ( B ) = P ( C ). Proposition 3.3.3 in Dudley (2010)verifies this is a valid extension; that is, P is a measure on B ∗ P ( X ) and P agrees with P for all sets in B ( X ).However, note that the collection B ∗ P ( X ) clearly depends on the probability measure P . Indeed, if Q isanother measure on B ( X ), and B ∗ Q ( X ) is defined in an analogous manner to B ∗ P ( X ), then it is possible forthe two collections B ∗ P ( X ) and B ∗ Q ( X ) to differ because the null sets of P and Q differ. On the other hand,clearly both B ∗ P ( X ) and B ∗ Q ( X ) must have many elements in common; for example, both collections mustcontain the Borel sets B ( X ).A set B ∈ B ∗ P ( X ) is called measurable for the completion of P . If for every probability measure P theset B is measurable for the completion of P , then we call B universally measurable . We will denote theuniversally measurable sets as B ∗ ( X ); it is easily verified that B ∗ ( X ) is also a σ − algebra. By definition,for any two probability measures P and Q , both B ∗ P ( X ) and B ∗ Q ( X ) contain the universally measurablesets. Also note that, in our example, clearly the Borel sets B ( X ) are universally measurable.A subset A ⊂ X of a Polish space X (with the Borel σ − field) is called B ( X )- analytic if there exists acompact metric space Y such that A is the projection onto X of some B ∈ B ( X ) ⊗ B ( Y ). A function f : A → [ −∞ , ∞ ] is called lower (or upper) semi-analytic if A is an analytic set and { x ∈ A : f ( x ) < c } (or { x ∈ A : f ( x ) ≥ c } ) is an analytic set for every c ∈ R ; that is, if the epigraph (or hypograph) of f is an This follows from the fact that an arbitrary intersection of σ − algebras is a σ − algebra. We note that this is one of many equivalent definitions of an analytic set. See Chapter 8 of Cohn (2013). Our definition isfrom Stinchcombe and White (1992).
Lemma B.1 (Infimum over Random Sets is Lower Semi-Analytic) . Suppose that Assumptions 2.1, 2.2and 2.3 hold. Then for any lower semi-analytic function f : V × Γ × Θ × { , } J → R , the function f lb, ( y, z, u, θ, γ, λ ) given by: f lb, ( y, z, u, θ, γ, λ ) := inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) f ( v, θ, γ, λ ) , (B.30) is lower semi-analytic; that is, { ( y, z, u, γ, θ, λ ) : f lb, ( y, z, u, θ, γ, λ ) < r } is an analytic set for every r ∈ R ,and thus is universally measurable. In addition, the function f lb, ( y, z, θ, γ, λ ) given by: f lb, ( y, z, θ, γ, λ ) := inf u ∈ G − ( y,z,θ ) f lb, ( y, z, u, θ, γ, λ ) , (B.31) is also lower semi-analytic; that is, { ( y, z, θ, γ, λ ) : f lb, ( y, z, θ, γ, λ ) < r } is an analytic set for every r ∈ R ,and thus is universally measurable. Remark B.3.
Defining f ub, ( y, z, u, θ, γ, λ ) and f ub, ( y, z, u, θ, γ, λ ) as the analogous functions with theinfimum replaced with the supremum, it is possible to show that f ub, ( y, z, u, θ, γ, λ ) and f ub, ( y, z, u, θ, γ, λ ) are upper semi-analytic.Proof of Lemma B.1. Recall that under Assumption 2.3, the multifunction G (cid:63) ( y, z, u, θ, γ ) is Effros mea-surable with respect to the product Borel σ − algebra B ( Y ) ⊗ B ( Z ) ⊗ B ( U ) ⊗ B (Γ). By Molchanov (2017)Theorem 1.3.3 this implies that:Graph( G (cid:63) ) ∈ B ( Y ) ⊗ B ( Z ) ⊗ B ( U ) ⊗ B (Θ) ⊗ B (Γ) . Thus Graph( G (cid:63) ) is a Borel (and thus also an analytic) set. Now note that G (cid:63) ( y, z, u, θ, γ ) can be rewrittenas: G (cid:63) ( y, z, u, θ, γ ) := { y (cid:63) ∈ Y (cid:63) : ( y, z, u, y (cid:63) , θ, γ ) ∈ Graph( G (cid:63) ) } . The fact that f lb, : V × Γ × Θ × { , } J → R is lower semi-analytic then follows directly from the selectionTheorem of Shreve and Bertsekas (1978), p. 968. Taking f lb, ( y, z, u, θ, γ, λ ) as lower semi-analytic, anearly identical proof shows that f lb, ( y, z, θ, γ, λ ) is also lower semi-analytic. (cid:4) Proposition B.1.
Suppose the assumptions of Theorem 3.1 hold. Then the maps γ (cid:55)→ I (cid:96)b [ ϕ ]( γ ) , I ub [ ϕ ]( γ ) are universally measurable.Proof. We will focus on the map γ (cid:55)→ I (cid:96)b [ ϕ ]( γ ), as the proof for the upper envelope function is symmetric.By Theorem 3.1 we have: I (cid:96)b [ ϕ ]( γ ) = inf θ ∈ Θ max λ ∈{ , } J (cid:90) h (cid:96)b ( y, z, θ, γ, λ ) dP Y,Z , See also Bertsekas and Shreve (1978) Proposition 7.47, p. 179. h (cid:96)b ( y, z, θ, γ, λ ) := inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) . Suppose that h (cid:96)b ( y, z, θ, γ, λ ) is lower semi-analytic (we will return to this in a moment). Then by proposition7.46 in Bertsekas and Shreve (1978), the map:( θ, γ, λ ) (cid:55)→ (cid:90) h (cid:96)b ( y, z, θ, γ, λ ) dP Y,Z , (B.32)is lower semi-analytic. Furthermore, suppose that g : R → R and g : R → R are lower semi-analytic. Thefunction g ( x ) = g ( x ) ∨ g ( x ) satisfies: g − (( −∞ , r )) = g − (( −∞ , r )) ∪ g − (( −∞ , r )) . Since analytic sets are closed under (countable) unions and intersections (Parthasarathy (2005) Theorem 3.1),we have that g is lower semi-analytic whenever g and g are lower semi-analytic. From this we concludethat the function: ( θ, γ ) (cid:55)→ max λ ∈{ , } J (cid:90) h (cid:96)b ( y, z, θ, γ, λ ) dP Y,Z , is lower semi-analytic, being the pointwise maximum of at most 2 J lower semi-analytic functions of the form(B.32). Finally, by the selection Theorem of Shreve and Bertsekas (1978), p. 968 (see also Bertsekas andShreve (1978) Proposition 7.47) we have the map: γ (cid:55)→ sup θ ∈ Θ max λ ∈{ , } J (cid:90) h (cid:96)b ( y, z, θ, γ ) dP Y,Z , is lower semi-analytic, and thus universally measurable. It thus remains only to show that h (cid:96)b ( y, z, θ, γ, λ ) islower semi-analytic. By Lemma B.1, h (cid:96)b ( y, z, θ, γ ) will be lower semi-analytic if we can show the function:( v, θ, γ, λ ) (cid:55)→ ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) , (B.33)is lower semi-analytic. Since both ϕ ( v ) and { m j ( y, z, u, θ ) } Jj =1 are Borel measurable by assumption, since thecomposition of Borel measurable functions are Borel measurable we conclude that (B.33) is Borel measurable.The conclusion then follows from the fact that every Borel measurable function is lower semi-analytic. (cid:4) A nearly identical argument shows that, for every fixed sequence ( ξ , . . . , ξ n ) ∈ {− , } n , the Rademachercomplexity: (( y , z ) , . . . , ( y n , z n )) (cid:55)→ || R || ( H (cid:96)b ) , is universally measurable. This is stated as a corollary of the previous result for easy reference. Corollary B.1.
Suppose the assumptions of Theorem 3.1 hold, and suppose that the sequence ( Y , Z ) , . . . , Y n , Z n ) are the coordinate projections of the product probability space (( Y × Z ) n , ( B ( Y ) ⊗ B ( Z )) ⊗ n , P ⊗ nY,Z )) .Then the map: (( Y , Z ) , . . . , ( Y n , Z n )) (cid:55)→ || R || ( H (cid:96)b ) , is universally measurable; that is, it is measurable for the completion of P ⊗ nY,Z for any P Y,Z ∈ P
Y,Z . B.2.2 Respect for Weak Dominance of the Preference Relation in Definition 2.3Lemma B.2.
Let (Ω , A ) be a measurable space, and let X , X : Ω × T → R be two stochastic processessuch that X ( · , t ) and X ( · , t ) are measurable for each t , and ω (cid:55)→ inf t ∈T X ( ω, t ) , inf t ∈T X ( ω, t ) are uni-versally measurable; that is, measurable with respect to the completion of any probability measure on (Ω , A ) .Furthermore, suppose that for any probability measure on (Ω , A ) we have X ( ω, t ) ≤ X ( ω, t ) a.s. for every t ∈ T , and let c : P → R ++ be any value depending only on P , where P is the set of all probability measureon (Ω , A ) . Finally, let c , c : (0 , × P → R ++ be the smallest values satisfying: P (cid:18) inf t ∈T X ( ω, t ) + c ( κ, P ) ≥ c ( P ) (cid:19) ≥ κ,P (cid:18) inf t ∈T X ( ω, t ) + c ( κ, P ) ≥ c ( P ) (cid:19) ≥ κ, for each κ ∈ (0 , . Then for every P ∈ P we have c ( κ, P ) ≤ c ( κ, P ) for every κ ∈ (0 , .Proof. Fix any probability measure P ∈ P . Then by assumption: X ( ω, t ) ≤ X ( ω, t ) a.s. ∀ t ∈ T . This implies: inf t ∈T X ( ω, t ) ≤ X ( ω, t ) a.s. ∀ t ∈ T , which in turn implies: inf t ∈T X ( ω, t ) ≤ inf t ∈T X ( ω, t ) a.s., and thus: inf t ∈T X ( ω, t ) − c ( P ) ≤ inf t ∈T X ( ω, t ) − c ( P ) a.s. Let N denote the null set for which this relation is not true (this set may depend on P ∈ P ). Then we havefor every x ∈ R : (cid:26) ω : inf t ∈T X ( ω, t ) − c ( P ) > x (cid:27) ∩ N c ⊆ (cid:26) ω : inf t ∈T X ( ω, t ) − c ( P ) > x (cid:27) ∩ N c , By assumption, these events belong to the universal σ − algebra generated by A , and so are measurable with78espect to the completion of any P ∈ P . This implies that for every x ∈ R : P (cid:18) ω : inf t ∈T X ( ω, t ) − c ( P ) > x (cid:19) ≤ P (cid:18) ω : inf t ∈T X ( ω, t ) − c ( P ) > x (cid:19) . Now taking any κ ∈ (0 ,
1) and setting x = − c ( κ, P ) we have: κ ≤ P (cid:18) ω : inf t ∈T X ( ω, t ) + c ( κ, P ) > c ( P ) (cid:19) ≤ P (cid:18) ω : inf t ∈T X ( ω, t ) + c ( κ, P ) > c ( P ) (cid:19) . By definition, this implies c ( κ, P ) can be no larger than c ( κ, P ); that is, c ( κ, P ) ≤ c ( κ, P ). Since κ ∈ (0 ,
1) was arbitrary, we conclude that c ( κ, P ) ≤ c ( κ, P ) for every κ ∈ (0 , P ∈ P was alsoarbitrary we conclude that for every P ∈ P we have c ( κ, P ) ≤ c ( κ, P ) for every κ ∈ (0 , (cid:4) B.2.3 Results on Interchanging Integrals and Supremum/InfimumLemma B.3 (Interchange of Integral and Supremum/Infimum) . Let ( V , B ( V )) and ( V (cid:48) , B ( V (cid:48) )) be measur-able spaces with V and V (cid:48) as Polish spaces. Let V ∈ V be any random variable defined on the probabilityspace (Ω , A , P ) with (marginal) distribution P V = P ◦ V − . Furthermore, let G : V → V (cid:48) be an almost surelynon-empty Effros-measurable multifunction. Then for any bounded and measurable function ϕ : V × V (cid:48) → R ,we have: (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V = sup V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V (cid:48) ( v )) dP V , (B.34) (cid:90) inf v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V = inf V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V (cid:48) ( v )) dP V , (B.35) In particular, if: P V (cid:48) | V := { P V (cid:48) | V : V (cid:48) ∼ P V (cid:48) | V , V (cid:48) : V → V (cid:48) is measurable and P V (cid:48) | V ( V (cid:48) ∈ G ( V ) | V = v ) = 1 a.s. } , then: (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V = sup P V (cid:48)| V ∈P V (cid:48)| V (cid:90) ϕ ( v, v (cid:48) ) d ( P V (cid:48) | V × P V ) , (B.36) (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V = inf P V (cid:48)| V ∈P V (cid:48)| V (cid:90) ϕ ( v, v (cid:48) ) d ( P V (cid:48) | V × P V ) . (B.37) Proof of Lemma B.3.
Since G is Effros measurable, by Theorem 1.3.3 in Molchanov (2017) we have thatgr( G ) ∈ B ( V ) ⊗ B ( V (cid:48) ), and thus gr( G ) is trivially an analytic set. Now define:gr v ( G ) := { v (cid:48) ∈ V (cid:48) : ( v, v (cid:48) ) ∈ gr( G ) } . Now let: ϕ ∗ ( v ) := sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) = sup v (cid:48) ∈ gr v ( G ) ϕ ( v, v (cid:48) ) . M := { v ∈ Π V (gr( G )) : ∃ v (cid:48) ∈ gr v ( G ) s.t. ϕ ( v, v (cid:48) ) = ϕ ∗ ( v ) } . where Π V : V × V (cid:48) → V is the projection operator. Fix any ε >
0. By the Exact Selection Theorem(Shreve and Bertsekas (1979), p.16) there exists a universally measurable function ˜ v (cid:48) : V → V (cid:48) such that( v, ˜ v (cid:48) ( v )) ∈ gr( G ) for every v ∈ Π V (gr( G )) and: ϕ ( v, ˜ v (cid:48) ( v )) = ϕ ∗ ( v ) , if v ∈ M, ≥ ϕ ∗ ( v ) − ε, if v / ∈ M. This allows us to write: (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V ≤ (cid:90) ϕ ( v, (cid:101) v (cid:48) ( v )) dP V + ε. Since ˜ v (cid:48) is a (universally) measurable selection, clearly we have: (cid:90) ϕ ( v, (cid:101) v (cid:48) ( v )) dP V ≤ sup V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V ( v )) dP V It suffices to show: sup V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V (cid:48) ( v )) dP V ≤ (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V . For any ε >
0, let V (cid:48) ε ∈ Sel ( G ) satisfy:sup V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V (cid:48) ( v )) dP V ≤ (cid:90) ϕ ( v, V (cid:48) ε ( v )) dP V + ε. Furthermore, let N := { v : V ε ( v ) / ∈ G ( v ) } . Then by definition of Sel ( G ) we have P ( N ) = 0. Thus: (cid:90) ϕ ( v, V ε ( v )) dP V = (cid:90) N c ϕ ( v, V (cid:48) ε ( v )) dP V ≤ (cid:90) N c sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V ≤ (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V . Combining everything we have: (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V ≤ sup V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V ( v )) dP V + ε ≤ (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V + 2 ε Since ε > (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V = sup V (cid:48) ∈ Sel u.m. ( G ) (cid:90) ϕ ( v, V (cid:48) ( v )) dP V . Since each V (cid:48) ∈ Sel u.m. ( G ) is universally measurable, each V (cid:48) can be associated with a B ( V ) − measurablerandom variable ˜ V (cid:48) such that V (cid:48) = ˜ V (cid:48) a.s. Thus we can conclude: (cid:90) sup v (cid:48) ∈ G ( v ) ϕ ( v, v (cid:48) ) dP V = sup V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V (cid:48) ( v )) dP V .
80o show the final claim, note that for any V (cid:48) : V → V (cid:48) we have: P V (cid:48) | V ( V (cid:48) = v (cid:48) | V = v ) = { V (cid:48) ( v ) = v (cid:48) } . i.e. the conditional distribution of V (cid:48) is degenerate. Thus for any V (cid:48) ∈ Sel ( G ): (cid:90) (cid:90) ϕ ( v, v (cid:48) ) d ( P V (cid:48) | V × P V ) = (cid:90) ϕ ( v, v (cid:48) ) { V (cid:48) ( v ) = v (cid:48) } dP V = (cid:90) ϕ ( v, V (cid:48) ( v )) dP V . By definition of P V (cid:48) | V , we conclude that:sup V (cid:48) ∈ Sel ( G ) (cid:90) ϕ ( v, V (cid:48) ( v )) dP V = sup P V (cid:48)| V ∈P V (cid:48)| V (cid:90) ϕ ( v, v (cid:48) ) d ( P V (cid:48) | V × P V ) . (cid:4) B.2.4 Results on Error Bounds
In the next Lemma we focus on the lower envelope function, although clearly an analogous result is true forthe upper envelope function. For notational simplicity, denote: ϕ ∗ := inf θ ∈ Θ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ . (B.38)We now have the following result: Lemma B.4 (Equality Between Primal and Penalized Problems) . Suppose the Assumptions of Theorem3.1 hold. Then there exists functions λ (cid:96)bj : Θ × P Y,Z → { , } , j = 1 , . . . , J , depending only on θ and thedistribution P Y,Z , such that: ϕ ∗ = inf θ ∈ Θ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:32) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) . Remark B.4.
Recall that P Y,Z is the set of all Borel probability measures on
Y × Z .Proof of Lemma B.4.
First, note that for any functions λ (cid:96)bj : Θ × P Y,Z → { , } , j = 1 , . . . , J , we have: ϕ ∗ := inf θ ∈ Θ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ = inf θ ∈ Θ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) sup λ ∈ R J + (cid:32) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ j E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) ≥ inf θ ∈ Θ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:32) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) ≥ inf θ ∈ Θ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:32) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) .
81t thus suffices to show that there exists functions λ (cid:96)bj : Θ × P Y,Z → { , } for j = 1 , . . . , J satisfying thereverse inequality. This is done constructively. In particular, define: J ∗ ( θ, P Y,Z ) := (cid:26) j ∈ { , . . . , J } : inf P U | Y,Z ∈P U | Y,Z ( θ ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )]= inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + (cid:27) . That is, the set J ∗ ( θ, P Y,Z ) returns the indices of the weakly positive (i.e. weakly violated) moment functionsthat obtain the inner maximum in the problem:inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + . Now set: λ (cid:96)bj ( θ, P Y,Z ) := { j ∈ J ∗ ( θ, P Y,Z ) } . (B.39)To show why this choice works, begin by fixing any θ ∈ Θ ∗ δ . By Assumption 3.1(ii) we have: C d ( θ, Θ ∗ ) ≥ ϕ ∗ − inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ . (B.40)Furthermore, from Assumption 3.1(i), since θ ∈ Θ ∗ δ by assumption, we have: C d ( θ, Θ ∗ ) = C min { δ, d ( θ, Θ ∗ ) } (B.41) ≤ inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + = inf P U | Y,Z ∈P U | Y,Z ( θ ) λ (cid:96)bj ( θ, P Y,Z ) | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + = inf P U | Y,Z ∈P U | Y,Z ( θ ) λ (cid:96)bj ( θ, P Y,Z ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] ≤ J (cid:88) j =1 inf P U | Y,Z ∈P U | Y,Z ( θ ) λ (cid:96)bj ( θ, P Y,Z ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] ≤ inf P U | Y,Z ∈P U | Y,Z ( θ ) J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] . (B.42)Now by construction we have µ ∗ ≥ C /C . Thus: C d ( θ, Θ ∗ ) ≤ µ ∗ C d ( θ, Θ ∗ ) . (B.43)Now use (B.43) to combine (B.40) and (B.42) and rearrange to obtain: ϕ ∗ ≤ inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ + µ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )]82 inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:32) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) , which holds for all θ ∈ Θ ∗ δ . To complete the proof, consider any θ ∈ Θ \ Θ ∗ δ . Recall from the assumptions ofTheorem 3.1 that ϕ : V → [ ϕ (cid:96)b , ϕ ub ] ⊂ R . Then using Assumption 3.1 we have:inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] ≥ ϕ (cid:96)b + µ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) m j ( y, z, u, θ ) d ( P U | Y,Z × P Y,Z ) ≥ ϕ (cid:96)b + µ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + ≥ ϕ (cid:96)b + µ ∗ C min { δ, d ( θ, Θ ∗ ) } = ϕ (cid:96)b + µ ∗ C δ ≥ ϕ ∗ , where the last line follows since µ ∗ ≥ ( ϕ ub − ϕ (cid:96)b ) / ( C δ ) ≥ ( ϕ ∗ − ϕ (cid:96)b ) / ( C δ ). Since we have shown theinequality holds for every θ ∈ Θ, we have:inf θ ∈ Θ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:32) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) dP V γ + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P Y,Z,U [ m j ( y, z, u, θ )] (cid:33) ≥ ϕ ∗ . This completes the proof. (cid:4)
Lemma B.5.
Suppose the assumptions of Theorem 3.1 hold, and define: h (cid:96)b ( y, z, θ, γ ) := inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) m j ( y, z, u, θ ) (cid:33) . where λ (cid:96)bj : Θ × P Y,Z → { , } , j = 1 , . . . , J , are as from Lemma B.4. Then we have: (cid:90) h (cid:96)b ( y, z, θ, γ ) dP Y,Z ≤ max λ j ∈{ , } (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z , (B.44) with equality holding in (B.44) if θ ∈ Θ ∗ .Proof of Lemma B.5. We have: (cid:90) h (cid:96)b ( y, z, θ, γ ) dP Y,Z := (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) m j ( y, z, u, θ ) (cid:33) dP Y,Z
83 max λ j ∈{ , } s.t λ j = λ (cid:96)bj ( θ,P Y,Z ) (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z ≤ max λ j ∈{ , } (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z . The first line holds by definition, the second line holds since the λ j ( θ, P Y,Z ) depends only on θ , and thirdline holds because the unconstrained maximum is always weakly larger than the constrained maximum.It remains only to show that the last inequality holds with equality whenever θ ∈ Θ ∗ . To do so it sufficesto show that for any θ ∈ Θ ∗ : (cid:90) h (cid:96)b ( y, z, θ, γ ) dP Y,Z ≥ max λ j ∈{ , } (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z . (B.45)To this end, note that by Lemma B.3 we have: (cid:90) h (cid:96)b ( y, z, θ, γ ) dP Y,Z = (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) m j ( y, z, u, θ ) (cid:33) dP Y,Z = inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) m j ( y, z, u, θ ) (cid:33) d ( P U | Y,Z × P Y,Z ) . (B.46)Now since the infimum of the sum is larger than the sum of the infimums, we have:inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) m j ( y, z, u, θ ) (cid:33) d ( P U | Y,Z × P Y,Z ) ≥ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) d ( P U | Y,Z × P Y,Z )+ inf P U | Y,Z ∈P U | Y,Z ( θ ) µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] . (B.47)Now recall that λ (cid:96)bj ( θ, P Y,Z ) = 1 if and only if:inf P U | Y,Z ∈P U | Y,Z ( θ ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] = inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + . From here we conclude: inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + = inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J λ (cid:96)bj ( θ, P Y,Z ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] ≤ inf P U | Y,Z ∈P U | Y,Z ( θ ) J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )]84hus: inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) dP Y,Z,U + inf P U | Y,Z ∈P U | Y,Z ( θ ) µ ∗ J (cid:88) j =1 λ (cid:96)bj ( θ, P Y,Z ) E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] ≥ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) dP Y,Z,U + µ ∗ inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + . (B.48)However, since θ ∈ Θ ∗ by assumption, we have:inf P U | Y,Z ∈P U | Y,Z ( θ ) max j =1 ,...,J | E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] | + = 0 . (B.49)Thus, combining (B.46), (B.47), (B.48) and (B.49) we can conclude: (cid:90) h (cid:96)b ( y, z, θ, γ ) dP Y,Z ≥ inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) d ( P U | Y,Z × P Y,Z ) . (B.50)Now, applying Lemma B.3 again we have:inf P U | Y,Z ∈P U | Y,Z ( θ ) (cid:90) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) ϕ ( v ) d ( P U | Y,Z × P Y,Z )= inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) P V γ . (B.51)Now note for θ ∈ Θ ∗ :inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) ϕ ( v ) P V γ = inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) sup λ j ∈ R + (cid:32) (cid:90) ϕ ( v ) P V γ + µ ∗ J (cid:88) j =1 λ j E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] (cid:33) ≥ inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) max λ j ∈{ , } (cid:32) (cid:90) ϕ ( v ) P V γ + µ ∗ J (cid:88) j =1 λ j E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] (cid:33) . (B.52)Now by the minimax inequality:inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) max λ j ∈{ , } (cid:32) (cid:90) ϕ ( v ) P V γ + µ ∗ J (cid:88) j =1 λ j E P U | Y,Z × P Y,Z [ m j ( y, z, u, θ )] (cid:33) ≥ max λ j ∈{ , } inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP V γ . (B.53)Finally, by iterated application of Lemma B.3 we have:max λ j ∈{ , } inf P U | Y,Z ∈P U | Y,Z ( θ ) inf P Y (cid:63)γ | Y,Z,U ∈P Y (cid:63)γ | Y,Z,U ( θ,γ ) (cid:90) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP V γ max λ j ∈{ , } (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z . (B.54)Combining (B.50), (B.51), (B.52), (B.53), and (B.54) we have: (cid:90) h (cid:96)b ( y, z, θ, γ ) dP Y,Z ≥ max λ j ∈{ , } (cid:90) inf u ∈ G − ( y,z,θ ) inf y (cid:63) ∈ G (cid:63) ( y,z,u,θ,γ ) (cid:32) ϕ ( v ) + µ ∗ J (cid:88) j =1 λ j m j ( y, z, u, θ ) (cid:33) dP Y,Z , whenever θ ∈ Θ ∗ . This concludes the proof. (cid:4) B.2.5 Lemmas Supporting Theorem 4.1 on LearnabilityLemma B.6.
Suppose that F := { f α ( · , θ ) : X → R : θ ∈ Θ , α ∈ A} is a totally bounded parametric classof measurable real-valued functions on the metric space ( X , d ) , where ( A , d a ) and (Θ , d θ ) are also metricspaces. Furthermore let G be a class of real-valued functions with each element g ( · , θ ) : X → R defined by: g ( x, θ ) := inf α ∈ C ( x,θ ) f α ( x, θ ) , for some f ∈ F , where C ( x, θ ) is a non-empty multifunction for each ( x, θ ) pair. Then for any probabilitymeasure Q we have: N ( ε, G , || · || Q, ) ≤ N ( ε/ , F , || · || Q, ) . Proof of Lemma B.6.
As a parametric class of functions (parameterized by ( α, θ )), the ε/ − cover of F canbe characterized by a collection of points { ( α i , θ i ) } ni =1 , where n = N ( ε/ , F , ||·|| Q, ). Denote such a collectionby N ( F ). We will show that for any g ∈ G there exists a pair ( α (cid:48) , θ (cid:48) ) ∈ N ( F ) such that: | g ( x, θ ) − f α (cid:48) ( x, θ (cid:48) ) | ≤ ε. Since every g ∈ G can be expressed as: g ( x, θ ) = inf α ∈ C ( x,θ ) f α ( x, θ ) , it suffices to show there exists a pair ( α (cid:48) , θ (cid:48) ) ∈ N ( F ) such that: (cid:12)(cid:12)(cid:12)(cid:12) inf α ∈ C ( x,θ ) f α ( x, θ ) − f α (cid:48) ( x, θ (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε. Now let α ∗ be any value satisfying: (cid:12)(cid:12)(cid:12)(cid:12) inf α ∈ C ( x,θ ) f α ( x, θ ) − f α ∗ ( x, θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε/ . That is, α ∗ is a ε/ α (cid:48) , θ (cid:48) ) ∈ N ( F ) such that | f α ∗ ( x, θ ) − f α (cid:48) ( x, θ (cid:48) ) | ≤ ε/ N ( F ) is a ε/ − cover of F ). Then we86ave: | g ( x, θ ) − f α (cid:48) ( x, θ (cid:48) ) | = (cid:12)(cid:12)(cid:12)(cid:12) inf α ∈ C ( x,θ ) f α ( x, θ ) − f α (cid:48) ( x, θ (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) inf α ∈ C ( x,θ ) f α ( x, θ ) − f α ∗ ( x, θ ) (cid:12)(cid:12)(cid:12)(cid:12) + | f α ∗ ( x, θ ) − f α (cid:48) ( x, θ (cid:48) ) |≤ ε/ ε/ ε. This completes the proof. (cid:4)
Lemma B.7.
Let F be a symmetric class of measurable real-valued functions on a Polish space X , and let ψ = ( x i ) ni =1 denote an arbitrary vector of n points from X . Then for any ε > : E || R n || ( F ) ≤ ε √ n + 2 Diam ψ, ( F ) (cid:114) log N ( ε, F , || · || ψ, ) n . Proof of Lemma B.7.
Note that: n E || R n || ( F ) = n E sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ξ i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Now recall that the Rademacher process (cid:80) ni =1 ξ i f ( x i ) is sub-Gaussian with respect to the euclidean distancebetween the vectors ( f ( x ) , . . . , f ( x n )) and ( f (cid:48) ( x ) , . . . , f (cid:48) ( x n )) for f, f (cid:48) ∈ F . We denote this euclideandistance by || f − f (cid:48) || ψ, to emphasize that the vector ψ = ( x i ) ni =1 is fixed. Fix an minimal ε − net F ∗ ⊂ F inthe || · || ψ, norm. There exists at least one function f (cid:48) ∈ F ∗ such that: E sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ε √ n. (For example, we can always take f (cid:48) to be the element in F ∗ to be closest to − f in the || · || ψ, norm, whichis an element of F by symmetry.) Now for any f ∈ F , let f ∗ ( f ) ∈ F ∗ be a function with || f − f ∗ ( f ) || ψ, ≤ ε .Then: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ( x i ) − n (cid:88) i =1 ξ i f ∗ ( f )( x i ) + n (cid:88) i =1 ξ i f ∗ ( f )( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ( x i ) − n (cid:88) i =1 ξ i f ∗ ( f )( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( f )( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup || f − f ∗ || ψ, ≤ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ( x i ) − n (cid:88) i =1 ξ i f ∗ ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup || f − f ∗ || ψ, ≤ ε n (cid:88) i =1 | f ( x i ) − f ∗ ( x i ) | + sup f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup || f − f ∗ || ψ, ≤ ε √ n || f − f ∗ || ψ, + sup f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ nε + sup f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Note we have used the inequality || f − f (cid:48) || ψ, ≤ √ n || f − f (cid:48) || ψ, , where || f − f (cid:48) || ψ, denotes the L distancebetween f and f (cid:48) at the points ψ = ( x i ) ni =1 . Now for any value a > (cid:32) a E max f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) ≤ E exp (cid:32) a max f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) = E max f ∗ ,f (cid:48) ∈F ∗ exp (cid:32) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) ≤ (cid:88) f,f ∗ ∈F ∗ E exp (cid:32) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) ≤ (cid:88) f,f ∗ ∈F ∗ exp (cid:0) a Diam ψ, ( F ) / (cid:1) ≤ N ( ε, F , || · || ψ, ) exp (cid:0) a Diam ψ, ( F ) / (cid:1) , where the second-last inequality follows from the fact that the Rademacher process is sub-Gaussian withparameter Diam ψ, ( F ). Taking logs and dividing both sides by a >
0, we have: E max f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N ( ε, F , || · || ψ, ) a + a Diam ψ, ( F )2 . Minimizing the upper bound with respect to “ a ” yields: E max f ∗ ,f (cid:48) ∈F ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ξ i f ∗ ( x i ) − n (cid:88) i =1 ξ i f (cid:48) ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ψ, ( F ) (cid:113) log N ( ε, F , || · || ψ, )We conclude that: n E || R n || ( F ) ≤ √ nε + 2Diam ψ, ( F ) (cid:113) log N ( ε, F , || · || ψ, ) . (cid:4) Lemma B.8.
Let G and H be two classes of functions and let F := { g + h : g ∈ G , h ∈ H} . Then: N ( ε, F , || · || ) ≤ N ( ε/ , G , || · || ) N ( ε/ , H , || · || ) , Recall a stochastic process ( ω, t ) (cid:55)→ X ( ω, t ) on a metric space ( T, d ) is sub-Gaussian with respect to the metric d if E exp ( λ ( X t − X s )) ≤ exp( λ d ( t, s ) / The minimizing value is a = 2 (cid:16) log N ( ε, F , || · || ψ, ) / Diam ψ, ( F ) (cid:17) / . here || · || is any norm. Remark B.5.
Note that a nearly identical proof of this result can be used to show that: N ( ε, F , || · || ) ≤ N ( ε · a, G , || · || ) N ( ε · b, H , || · || ) , where a, b > are any values satisfying a + b = 1 .Proof of Lemma B.8. Suppose that N ( ε/ , G , || · || ) = n and N ( ε/ , H , || · || ) = m . It suffices to show N ( ε, F , || · || ) ≤ nm . Let N ( G ) denote the centres of the balls that obtain the n − cover of G and let N ( H )denote the centres of the balls that obtain the m − cover of H . Enumerate the elements of N ( G ) as g , . . . , g n and enumerate the elements of N ( H ) as h , . . . , h m . Now define the following collections: G j := { g ∈ G : || g − g j || ≤ ε/ } , H k := { h ∈ H : || h − h k || ≤ ε/ } , for j = 1 , . . . , n and k = 1 , . . . , m . Then { G j } forms a ε/ − cover of G and { H k } forms a ε/ − cover of H .Now for any g j ∈ N ( G ) and h k ∈ N ( H ) let f jk = g j + h k , and define: F jk := { f : || f − f jk || ≤ ε } . We will now argue that { F jk } is a ε − cover of F . Note that if we can establish this, the proof will becomplete, since there are only nm sets F jk . By construction each F jk is a || · ||− ball of radius ε , so it onlyremains to check that { F jk } covers F . To do so, fix any f ∈ F . Then by definition f = g + h for some g ∈ G and h ∈ H . Since { G j } forms a ε/ − cover of G and { H k } forms a ε/ − cover of H , we know there is some g j ∈ N ( G ) and some h k ∈ N ( H ) such that || g − g j || ≤ ε/ || h − h k || ≤ ε/
2. But since f jk = g j + h k wehave that: || f − f jk || = || ( g + h ) − ( g j + h k ) || ≤ || g − g j || + || h − h k || ≤ ε/ ε/ ε, so that f ∈ F jk , and so is an element of the cover { F jk } . Since f ∈ F was arbitrary, we conclude that { F jk } covers F . This completes the proof. (cid:4) B.2.6 A Lemma Supporting Theorem 5.2 and Lemma 5.1Lemma B.9.
Let δ ∗∗ be as in Lemma 5.1. If δ ≥ δ ∗∗ ≥ ε > , then: sup P Y,Z ∈P Y,Z P ⊗ nY,Z ( E ∗ (ˆ γ ) ≥ δ ) ≤ − κ. That is ˆ γ ∈ G ∗ ( δ ) with high probability when δ ≥ δ ∗∗ ≥ ε > .Proof. Throughout this proof, let λ ∗ ( θ, γ ), ˆ λ ( θ, γ ), θ ∗ ( γ ), ˆ θ ( γ ), γ ∗ and ˆ γ be as in Remark B.1. Fix any89 > δ ∗∗ (the case when δ = δ ∗∗ follows from continuity). If σ := E ∗ (ˆ γ ) ≥ δ ≥ ε >
0, then: E ∗ (ˆ γ ) := sup γ ∈ Γ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) + 3 ε = inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ )+ (cid:18) inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) − (cid:18) inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) + 3 ε ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) − (cid:18) inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) (cid:19) + 4 ε Now note: inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, ˆ γ, λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P h (cid:96)b ( · , θ, γ ∗ , λ ) − max λ ∈ Λ P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) + ε ≤ max λ ∈ Λ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) − max λ ∈ Λ P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) + 2 ε ≤ max λ ∈ Λ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) − P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) + 2 ε ≤ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) + 2 ε. Similarly: inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) − inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, γ ∗ , λ ) ≤ inf θ ∈ Θ max λ ∈ Λ P n h (cid:96)b ( · , θ, ˆ γ, λ ) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) + ε ≤ max λ ∈ Λ P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ) + 2 ε ≤ max λ ∈ Λ P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, λ ) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) + 2 ε ≤ P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) − max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) + 2 ε. Thus we have: E ∗ (ˆ γ ) ≤ P h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) − (cid:18) max λ ∈ Λ P n h (cid:96)b ( · , ˆ θ ( γ ∗ ) , γ ∗ , λ ∗ (ˆ θ ( γ ∗ ) , γ ∗ )) − P n h (cid:96)b ( · , θ ∗ (ˆ γ ) , ˆ γ, ˆ λ ( θ ∗ (ˆ γ ) , ˆ γ )) (cid:19) + 8 ε. Furthermore, σ = E ∗ (ˆ γ ) ≥ E ∗ ( γ ∗ ) implies that ˆ γ, γ ∗ ∈ G ( σ ). Thus: E ∗ (ˆ γ ) ≤ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( σ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | , ε > E n,j := (cid:40) sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( σ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | ≤ T ( δ j ) (cid:41) , (B.55)and: E n := (cid:92) { j : δ j ≥ δ ∗∗ } E n,j . Note by our choice of δ > H we have: sup P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:0) E cn, (cid:1) = 0 . Furthermore, from the uniform version of Hoeffding’s inequality (e.g. Koltchinskii (2011) Theorem 4.6, p.71)we have: sup P Y,Z ∈P Y,Z P ⊗ nY,Z (cid:0) E cn,j (cid:1) ≤ exp (cid:32) − t j (cid:33) , for each j ∈ N . We conclude by the union bound that:sup P Y,Z ∈P Y,Z P ⊗ nY,Z ( E cn ) ≤ (cid:88) { j : δ j ≥ δ ∗∗ } exp (cid:32) − t j (cid:33) ≤ ∞ (cid:88) j =0 exp (cid:32) − t j (cid:33) ≤ − κ. Now on the event E n , for every δ ≥ δ ∗∗ we have:sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( δ ) max λ,λ (cid:48) ∈ Λ | ( P n − P ) ( h (cid:96)b ( · , θ, γ, λ ) − h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) | ≤ T ( δ ) . Now suppose by way of contradiction that { E ∗ (ˆ γ ) ≥ δ } ∩ E n (cid:54) = ∅ . Then on this event we have: σ := E ∗ (ˆ γ ) ≤ sup θ,θ (cid:48) ∈ Θ sup γ,γ (cid:48) ∈ G ∗ ( σ ) max λ,λ (cid:48) ∈ Λ | ( P n h (cid:96)b ( · , θ, γ, λ ) − P n h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) − ( P h (cid:96)b ( · , θ, γ, λ ) − P h (cid:96)b ( · , θ (cid:48) , γ (cid:48) , λ (cid:48) )) |≤ T ( σ ) . However, note that this implies that σ ≤ δ ∗∗ on the event E n . But since σ ≥ δ > δ ∗∗ by assumption, wehave a contradiction. We conclude that { E ∗ (ˆ γ ) } ≥ δ } ∩ E n = ∅ , or equivalently that { E ∗ (ˆ γ ) ≥ δ } ⊆ E cn ,where the event E cn has probability at most 1 − κ . (cid:4) Additional Details for the Examples
C.1 Example 1: Simultaneous Discrete Choice
C.1.1 Verification of Assumptions 2.1, 2.2 and 2.3
We will now proceed to verify Assumption 2.1, 2.2 and 2.3. First note that Assumption 2.1 is triviallysatisfied, since the probability space (Ω , A , P ) is complete, and both U and Θ are compact metric spaceswith the euclidean norm.To verify Assumption 2.2, note that the multifunction for the factual domain can be rewritten as: G − ( Y, Z, θ ) = u ∈ U : u k ∈ [ π k ( Z k , Y − k ; θ ) , , if Y k = 0 ,u k ∈ [ − , π k ( Z k , Y − k ; θ )] , if Y k = 1 . . (C.1)From here we conclude that, for any u ∈ U : d ( u, G − ( Y, Z, θ ))= max k (cid:18) { Y k = 0 }| π k ( Z k , Y − k ; θ ) − u k | + + { Y k = 1 }| u k − π k ( Z k , Y − k ; θ ) | + (cid:19) . (C.2)Under our assumptions, this distance is the maximum of K measurable functions, and so is itself measurable.Since u ∈ U was arbitrary, by the result of Himmelberg (1975) (see also Theorem 1.3.3 in Molchanov (2017))this implies that G − is an Effros-measurable multifunction (w.r.t. B ( Y ) ⊗ B ( Z ) ⊗ B (Θ)), as desired. It isthen easily seen that the conditional distribution of the vector U given ( Y, Z ) satisfies (2.1) in Assumption2.2 using the multifunction in (C.1) with θ = θ . To complete the verification of Assumption 2.2, note thatall the moment functions from the moment conditions in (2.6) and (2.7) are bounded in absolute value andBorel measurable (w.r.t. B ( Y ) ⊗ B ( Z ) ⊗ B (Θ)).We now turn to the verification of Assumption 2.3. Recall the counterfactual multifunction: G (cid:63) ( Z, U, θ, γ ) := (cid:8) y (cid:63) ∈ Y : y (cid:63)k = { π k (cid:0) γ ( Z k , y (cid:63) − k ); θ (cid:1) ≥ U k } , k = 1 , . . . , K. (cid:9) . (C.3)Close inspection reveals that: d ( y (cid:63) , G (cid:63) ( Z, U, θ, γ )) = max k (cid:12)(cid:12) y (cid:63)k − { π k (cid:0) γ ( Z k , y (cid:63) − k ); θ (cid:1) ≥ U k } (cid:12)(cid:12) . (C.4)Under our assumptions, this distance is also the maximum of K measurable functions, and so is itselfmeasurable. Since y (cid:63) ∈ Y (cid:63) was arbitrary, by the result of Himmelberg (1975) (see also Theorem 1.3.3 inMolchanov (2017)) this implies that G (cid:63) is an Effros-measurable multifunction (w.r.t. B ( Y ) ⊗ B ( Z ) ⊗ B ( Z ) ⊗ B (Θ) ⊗ B (Γ)), as desired. Finally, it is easily seen that the conditional distribution of the vector Y (cid:63)γ given( Y, Z, U ) satisfies (2.3) in Assumption 2.3 using the multifunction in (2.9) with θ = θ .92 igure 5: This figure illustrates a case violating Assumption 3.1(ii). The black dots • represent equal probabilitymasses (1 /
6) assigned by the conditional distribution of U k given ( z, y − k ). The red dots • represent equal probabilitymasses (1 /
6) assigned by the conditional distribution of U k given ( z (cid:48) , y (cid:48)− k ) = γ ( z, y − k ). In the upper portion of thefigure we have θ ∗ ∈ Θ ∗ , the median zero assumption is satisfied (three black dots • and three red dots • on eitherside of zero) and the maximum value of P ( Y (cid:63)γ = 1 | Z k = z, Y − k = y − k ) at θ ∗ is obtained at 1 /
6. However, in thebottom portion of the figure a small change in the value of θ ∗ ∈ Θ ∗ to θ / ∈ Θ ∗ causes a violation of the medianzero assumption for the points ( z, y − k ) and ( z (cid:48) , y (cid:48)− k ). At the new value θ / ∈ Θ ∗ we have the maximum value of P ( Y (cid:63)γ = 1 | Z k = z, Y − k = y − k ) is 1. Note that the scale of the figure can be made arbitrarily small. C.1.2 Verification of Assumption 3.1
We will first verify Assumption 3.1(ii) for some C ≥ δ >
0, and then will show that Assumption 3.1(i)is also satisfied for our choice of δ > k ∈ { , . . . , K } and some z ∈ Z and y − k ∈ Y − k we have: (i) the object of interest is P ( Y (cid:63)γ,k = 1 | Z k = z (cid:48) , Y − k = y (cid:48)− k ) or P ( Y (cid:63)γ,k = 1), (ii) the counterfactualcutoff value π k ( γ ( z, y − k ); θ ∗ ) = 0 at some θ ∗ ∈ ∂ Θ ∗ , and (iii) if P ( Y k = 1 | Z k = z (cid:48) , Y − k = y (cid:48)− k ) (cid:54) = 0 . z (cid:48) , y (cid:48)− k ) = γ ( z, y − k ). In this knife-edge case, a very small change in θ ∗ to some θ / ∈ Θ ∗ can cause adiscontinuous change in P ( Y (cid:63)γ,k = 1 | Z k = z (cid:48) , Y − k = y (cid:48)− k ) or P ( Y (cid:63)γ,k = 1).To prevent such discontinuities in the value of the policy transform, we can introduce additional assump-tions on the degree of smoothness of the distribution of U k around zero. In particular, instead of the momentconditions in (2.6) and (2.7) we propose imposing the constraints: P (cid:0) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k (cid:1) − . ≤ max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } , (C.5)0 . − P (cid:0) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k (cid:1) ≤ max {− L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } , (C.6)for some L >
0, for k = 1 , . . . , K , and for all z, z (cid:48) ∈ Z and y − k , y (cid:48)− k ∈ Y K − . These constraints impose93 igure 6: This figure illustrates a case violating Assumption 3.1(ii). The black dots • represent equal probabilitymasses (1 /
6) assigned by the conditional distribution of U k given ( z, y − k ). The red dots • represent equal probabilitymasses (1 /
6) assigned by the conditional distribution of U k given ( z (cid:48) , y (cid:48)− k ) = γ ( z, y − k ). In the upper portion of thefigure we have θ ∗ ∈ Θ ∗ , the median zero assumption is satisfied (three black dots • and three red dots • on either sideof zero) and the maximum value of P ( Y (cid:63)γ = 1 | Z k = z, Y − k = y − k ) at θ ∗ is obtained at 1 /
2. However, in the bottomportion of the figure a small change in the value of θ ∗ ∈ Θ ∗ to θ / ∈ Θ ∗ causes a violation of the median zero assumptionfor the point ( z (cid:48) , y (cid:48)− k ). At the new value θ / ∈ Θ ∗ we have the maximum value of P ( Y (cid:63)γ = 1 | Z k = z, Y − k = y − k ) is 1.Note that the scale of the figure can be made arbitrarily small. Figure 7:
This figure illustrates a case that does not violate Assumption 3.1(ii). The black dots • represent equalprobability masses (1 /
6) assigned by the conditional distribution of U k given ( z, y − k ). The red dots • represent equalprobability masses (1 /
6) assigned by the conditional distribution of U k given ( z (cid:48) , y (cid:48)− k ) = γ ( z, y − k ). In the upperportion of the figure we have θ ∗ ∈ Θ ∗ , the median zero assumption is satisfied (three black dots • and three red dots • on either side of zero) and P ( Y (cid:63)γ = 1 | Z k = z, Y − k = y − k ) = 1 /
6. In the bottom portion of the figure a small changein the value of θ ∗ ∈ Θ ∗ to θ / ∈ Θ ∗ causes a violation of the median zero assumption for the point ( z, y − k ). However,at the new value θ / ∈ Θ ∗ we still have that the maximum obtainable value of P ( Y (cid:63)γ = 1 | Z k = z, Y − k = y − k ) is 1 /
94 local Lipschitzian constraint on the distribution of U k around zero. Note that by taking L sufficientlylarge, these constraints will only be active when π k ( z (cid:48) , y (cid:48)− k ; θ ) is close to zero. It is also easily verified thatthe new moment conditions implied by (C.5) and (C.6) also satisfy Assumption 2.2.We claim that the constraints (C.5) and (C.6) imply that U k is median zero and median independent of( Z, Y − k ). To see this, note that U k has a median of zero given ( z, y − k ) if and only if:(I) π k ( z k , y − k ; θ ) ≤ P ( U k ≤ π k ( z k , y − k ; θ ) | Z = z k , Y − k = y − k ) ≤ .
5; or(II) π k ( z k , y − k ; θ ) > P ( U k > π k ( z k , y − k ; θ ) | Z = z k , Y − k = y − k ) ≤ . U k does not have a median of zeroconditional on ( z, y − k ) if and only if:(i) π k ( z k , y − k ; θ ) > P ( U k ≤ π k ( z k , y − k ; θ ) | Z = z k , Y − k = y − k ) < .
5; or(ii) π k ( z k , y − k ; θ ) ≤ P ( U k > π k ( z k , y − k ; θ ) | Z = z k , Y − k = y − k ) < . U k is median zero and median independent of ( Z, Y − k ).However, note that it is possible that either (I) or (II) is satisfied but one of (C.5) or (C.6) fails, owing to thefact that together (C.5) and (C.6) are stronger than the median zero and median independence restrictionsinitially imposed in (2.6) and (2.7).We will now proceed to verify Assumption 3.1. First recall from the discussion in the text that π k is aknown measurable function of ( Z k , Y − k , θ ) that is linear in parameters θ and has a gradient (with respectto θ ) bounded away from zero for each ( z, y − k ). Thus, π k is Lipschitz in θ , and also satisfies a “reverseLipschitz” condition; that is, for each ( z, y − k ) we have: L (cid:48) k || θ − θ ∗ || ≤ | π k ( z, y − k ; θ ) − π k ( z, y − k ; θ ∗ ) | ≤ L k || θ − θ ∗ || , for some L (cid:48) k , L k >
0. Now, if one of the constraints (C.5) or (C.6) is violated, we have one of the followinginequalities: P ( (cid:101) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k ) − . > max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } , (C.7)0 . − P ( (cid:101) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k ) > max {− L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } , (C.8)Subtracting (C.7) from (C.5) and taking ( z (cid:48) , y (cid:48)− k ) = γ ( z, y − k ), we have: P ( U k ≤ π k ( γ ( z, y − k ); θ ∗ ) | Z k = z, Y − k = y − k ) − P ( (cid:101) U k ≤ π k ( γ ( z, y − k ); θ ) | Z k = z, Y − k = y − k )= P (cid:0) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) | Z k = z, Y − k = y − k (cid:1) − P ( (cid:101) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k ) < max { L π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) , } − max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , }≤ max { L π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) − L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } igure 8: This figure illustrates three scenarios, each scenario involving a different allocation of probability mass for U k , represented by the 6 dots • representing equal probability mass, and a different value of the cutoff π k ( z, y − k ; θ ). Inscenario (A), π k ( z k , y − k ; θ ) > P ( Y k = 0 | Z = z k , Y − k = y − k ) ≤ .
5. In this case the median zero condition can besatisfied, for example, by the allocation of probability mass displayed in the figure. In scenario (B), π k ( z k , y − k ; θ ) > P ( Y k = 0 | Z = z k , Y − k = y − k ) > .
5. Here there is no way of satisfying the median zero assumption, since toomuch mass will always be assigned above zero. In scenario (C), π k ( z k , y − k ; θ ) < P ( Y k = 0 | Z = z k , Y − k = y − k ) > .
5. In this case the median zero condition can again be satisfied, for example, by the allocation of probabilitymass displayed in the figure. ≤ L | π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) − π k ( z (cid:48) , y (cid:48)− k ; θ ) |≤ L L k || θ − θ ∗ || . (C.9)Furthermore, subtracting (C.8) from (C.6) and again taking ( z (cid:48) , y (cid:48)− k ) = γ ( z, y − k ), we have: P ( (cid:101) U k ≤ π k ( γ ( z, y − k ); θ ) | Z k = z, Y − k = y − k ) − P ( U k ≤ π k ( γ ( z, y − k ); θ ∗ ) | Z k = z, Y − k = y − k )= P ( (cid:101) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k ) − P (cid:0) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) | Z k = z, Y − k = y − k (cid:1) < max {− L π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) , } − max {− L π k ( z (cid:48) , y (cid:48)− k ; θ ) , }≤ max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) − L π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) , }≤ L | π k ( z (cid:48) , y (cid:48)− k ; θ ∗ ) − π k ( z (cid:48) , y (cid:48)− k ; θ ) |≤ L L k || θ − θ ∗ || . (C.10)From here we can deduce that Assumption 3.1(ii) is satisfied for any δ > C = L L where L = min k L k .96o verify Assumption 3.1(i), we will first introduce the following Lemma and provide a sketch of its proof: Lemma C.1.
Consider the simultaneous discrete choice environment of Example 1, but with the new momentconditions (C.5) and (C.6) in place of (2.6) and (2.7) . Now fix some value θ ∈ Θ . If there exists a randomvariable U with distribution P U | Y,Z ∈ P U | Y,Z ( θ ) satisfying: P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . ≤ max { L π k ( z, y − k ; θ ) , } , (C.11)0 . − P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) ≤ max {− L π k ( z, y − k ; θ ) , } , (C.12) for k = 1 , . . . , K and for every ( z, y − k ) ∈ Z × Y K − , then θ ∈ Θ ∗ . Remark C.1.
Note that, precisely because of the result in this Lemma, the new moment conditions impliedby (C.5) and (C.6) satisfy the no-backtracking principle from Remark 2.1. Indeed, this Lemma shows that (C.11) and (C.12) are sufficient to characterize the identified set. Since these moment conditions do notdepend on the counterfactual γ of interest, the no-backtracking principle is satisfied.Proof. Note by assumption there exists a random variable U with distribution P U | Y,Z ∈ P U | Y,Z ( θ ) satisfying(C.11) and (C.12) for k = 1 , . . . , K and for every ( z, y − k ) ∈ Z × Y K − . Take (cid:101) U to be a random vectorsatisfying: P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) = P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) , for k = 1 , . . . , K and for every ( z, y − k ) ∈ Z × Y K − , so that (cid:101) U satisfies (C.11) and (C.12). We must showthat we can fix probabilities of the form P ( (cid:101) U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k ) for ( z (cid:48) , y (cid:48)− k ) (cid:54) = ( z, y − k ) ina way that satisfies the remaining constraints from (C.5) and (C.6), as well as the constraints: P ( U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k ) ≤ P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) , if π k ( z (cid:48) , y (cid:48)− k ; θ ) ≤ π k ( z, y − k ; θ ), and: P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) ≤ P ( U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) | Z k = z, Y − k = y − k ) , if π k ( z, y − k ; θ ) ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ). However, such an allocation of probability is clearly always possible. (cid:4) The contrapositive of this result says that if θ / ∈ Θ ∗ , then there is no random variable U with distribution P U | Y,Z ∈ P U | Y,Z ( θ ) satisfying (C.11) and (C.12); in other words, if θ / ∈ Θ ∗ , then every distribution P U | Y,Z ∈P U | Y,Z ( θ ) violates either (C.11) or (C.12). Thus, the Lemma suggests that when analysing violations of themoment conditions in order to verify Assumption 3.1(i), it suffices to focus on the moment conditions (C.11)and (C.12).Finally, there is an important property that will be utilized repeatedly when verifying Assumption 3.1(i):for any P U | Y,Z ∈ P U | Y,Z ( θ ) and any P U (cid:48) | Y,Z ∈ P U | Y,Z ( θ (cid:48) ), we must have: P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) = P ( U (cid:48) k ≤ π k ( z, y − k ; θ (cid:48) ) | Z k = z, Y − k = y − k ) (C.13)97or k = 1 , . . . , K and for every ( z, y − k ) ∈ Z × Y K − . Indeed, this property follows from the fact that both P U | Y,Z and P U (cid:48) | Y,Z satisfy the support restrictions for the simultaneous discrete choice model at θ and θ (cid:48) ,respectively, and thus they must both rationalize the same observed conditional choice probabilities.Now we are prepared to verify Assumption 3.1(i). First fix some value of θ / ∈ Θ. If P U | Y,Z ( θ ) is empty,then Assumption 3.1(i) is satisfied for any C , δ >
0. Thus, we will focus attention on the non-trivial casewhere P U | Y,Z ( θ ) is non-empty. Note that if P ( Y k = 1 | Z = z, Y − k = y − k ) = 0 . k = 1 , . . . , K and for every( z, y − k ) ∈ Z ×Y K − , then (C.11) and (C.12) will be satisfied for any P U | Y,Z ∈ P U | Y,Z ( θ ). By Lemma C.1 thisimplies θ ∈ Θ ∗ , contradicting the fact that θ / ∈ Θ. We conclude that if P ( Y k = 1 | Z = z, Y − k = y − k ) = 0 . k = 1 , . . . , K and for every ( z, y − k ) ∈ Z ×Y K − then θ / ∈ Θ ∗ implies P U | Y,Z ( θ ) is empty, a case we have ruledout. Thus, we will take as a starting point that there exists at least one k and one pair ( z, y − k ) ∈ Z × Y K − such that P ( Y k = 1 | Z = z, Y − k = y − k ) (cid:54) = 0 .
5. Now define: τ := min k min ( z,y − k ) | . − P ( Y k = 1 | Z = z, Y − k = y − k ) | s.t. | . − P ( Y k = 1 | Z = z, Y − k = y − k ) | > . (C.14)By assumption and by construction we have τ >
0. We now consider violations of the moment conditions(C.11) and (C.12) in turn. First, consider a violation of (C.11). In particular, for our fixed value of θ / ∈ Θsuppose: P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . > max { L π k ( z, y − k ; θ ) , } , (C.15)for some k and ( z, y − k ) pair, where (cid:101) U k is a subvector of (cid:101) U whose distribution is a member of P U | Y,Z ( θ ).Furthermore, let θ ∗ ∈ Θ ∗ be the element of Θ ∗ closest to θ (such an element exists since Θ ∗ will be closed,which follows from continuity of the payoff functions). There are four cases to consider:1. π k ( z, y − k ; θ ∗ ) ≤ π k ( z, y − k ; θ ) ≤
0. Then we have:max { L π k ( z, y − k ; θ ) , } = 0 . (C.16)However, since π k ( z, y − k ; θ ∗ ) ≤ . ≥ P ( U k ≤ π k ( z, y − k ; θ ∗ ) | Z k = z, Y − k = y − k ) = P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) , where we have used property (C.13) and the fact that θ ∗ satisfies both (C.5) and (C.6). But then thisimplies: P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . ≤ . (C.17)Combining (C.16) and (C.17) contradicts the assumption of (C.15). Thus, this case is not possibleunder the assumption of (C.15). 98. π k ( z, y − k ; θ ∗ ) ≤ π k ( z, y − k ; θ ) >
0. Then we have:max { L π k ( z, y − k ; θ ) , } = L π k ( z, y − k ; θ ) . (C.18)However, since π k ( z, y − k ; θ ∗ ) ≤ . ≥ P ( U k ≤ π k ( z, y − k ; θ ∗ ) | Z k = z, Y − k = y − k ) = P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) , where we have used property (C.13) and the fact that θ ∗ satisfies both (C.5) and (C.6). But then thisimplies: P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . ≤ . (C.19)Combining (C.18) and (C.19) contradicts the assumption of (C.15). Thus, this case is not possibleunder the assumption of (C.15).3. π k ( z, y − k ; θ ∗ ) > π k ( z, y − k ; θ ) ≤
0. Then we have:max { L π k ( z, y − k ; θ ) , } = 0 . Then: P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . − max { L π k ( z, y − k ; θ ) , } = P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . ≥ τ, where the last line follows from the fact that P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . > π k ( z, y − k ; θ ) ≤
0, and by the definition of τ from (3.17).4. π k ( z, y − k ; θ ∗ ) > π k ( z, y − k ; θ ) >
0. First note that by assumption we have: P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . − max { L π k ( z, y − k ; θ ) , } > ≥ P ( U k ≤ π k ( z, y − k ; θ ∗ ) | Z k = z, Y − k = y − k ) − . − max { L π k ( z, y − k ; θ ∗ ) , } . Using (C.13) and the fact that π k ( z, y − k ; θ ∗ ) > π k ( z, y − k ; θ ) >
0, this implies π k ( z, y − k ; θ ∗ ) >π k ( z, y − k ; θ ). Now let θ (cid:48) be a convex combination of θ ∗ and θ satisfying: P ( U (cid:48) k ≤ π k ( z, y − k ; θ (cid:48) ) | Z k = z, Y − k = y − k ) − . − L π k ( z, y − k ; θ (cid:48) ) = 0 , for some selection U (cid:48) k . Such an element always exists by linearity of π k . Then: P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . − max { L π k ( z, y − k ; θ ) , } P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − . − max { L π k ( z, y − k ; θ ) , } + L π k ( z, y − k ; θ (cid:48) ) − L π k ( z, y − k ; θ (cid:48) )= L π k ( z, y − k ; θ (cid:48) ) − L π k ( z, y − k ; θ )= L | π k ( z, y − k ; θ (cid:48) ) − π k ( z, y − k ; θ ) |≥ L L (cid:48) k || θ (cid:48) − θ ||≥ L L (cid:48) k || θ ∗ − θ || . In the third last line we used the fact that π k ( z, y − k ; θ ∗ ) > π k ( z, y − k ; θ ). In the second last line we haveused the reverse Lipschitz condition, and in the final line we have used the fact that θ (cid:48) lies between θ and θ ∗ , by virtue of being a convex combination of these elements.Next, consider a violation of (C.12). In particular, for our fixed θ / ∈ Θ suppose:0 . − P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) > max {− L π k ( z, y − k ; θ ) , } , (C.20)for some k and ( z, y − k ) pair, where (cid:101) U k is a subvector of (cid:101) U whose distribution is is a member of P U | Y,Z ( θ ).Again, let θ ∗ ∈ Θ ∗ be the element of Θ ∗ closest to θ . There are again four cases to consider:1. π k ( z, y − k ; θ ∗ ) ≤ π k ( z, y − k ; θ ) ≤
0. First note that by assumption we have:0 . − P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − max {− L π k ( z, y − k ; θ ) , } > ≥ . − P ( U k ≤ π k ( z, y − k ; θ ∗ ) | Z k = z, Y − k = y − k ) − max {− L π k ( z, y − k ; θ ∗ ) , } . Using (C.13) and the fact that π k ( z, y − k ; θ ∗ ) ≤ π k ( z, y − k ; θ ) ≤
0, this implies π k ( z, y − k ; θ ∗ ) <π k ( z, y − k ; θ ). Now let θ (cid:48) be a convex combination of θ ∗ and θ satisfying:0 . − P ( U (cid:48) k ≤ π k ( z, y − k ; θ (cid:48) ) | Z k = z, Y − k = y − k ) + L π k ( z, y − k ; θ (cid:48) ) = 0 , for some selection U (cid:48) k . Such an element always exists by linearity of π k . Then:0 . − P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − max {− L π k ( z, y − k ; θ ) , } = 0 . − P ( U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − max {− L π k ( z, y − k ; θ ) , } + L π k ( z, y − k ; θ (cid:48) ) − L π k ( z, y − k ; θ (cid:48) )= L π k ( z, y − k ; θ ) − L π k ( z, y − k ; θ (cid:48) )= L | π k ( z, y − k ; θ ) − π k ( z, y − k ; θ (cid:48) ) |≥ L L (cid:48) k || θ − θ (cid:48) ||≥ L L (cid:48) k || θ − θ ∗ || . π k ( z, y − k ; θ ∗ ) < π k ( z, y − k ; θ ). In the second last line we haveused the reverse Lipschitz condition, and in the final line we have used the fact that θ (cid:48) lies between θ and θ ∗ , by virtue of being a convex combination of these elements.2. π k ( z, y − k ; θ ∗ ) ≤ π k ( z, y − k ; θ ) >
0. Then we have:max {− L π k ( z, y − k ; θ ) , } = 0 . Then: 0 . − P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) − max {− L π k ( z, y − k ; θ ) , } = 0 . − P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) ≥ τ, where the last line follows from the fact that 0 . − P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) > π k ( z, y − k ; θ ) >
0, and by the definition of τ from (3.17).3. π k ( z, y − k ; θ ∗ ) > π k ( z, y − k ; θ ) ≤
0. Then we have:max {− L π k ( z, y − k ; θ ) , } = − L π k ( z, y − k ; θ ) . (C.21)However, since π k ( z, y − k ; θ ∗ ) > . ≤ P ( U k ≤ π k ( z, y − k ; θ ∗ ) | Z k = z, Y − k = y − k ) = P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) , where we have used property (C.13) and the fact that θ ∗ satisfies both (C.5) and (C.6). But then thisimplies: 0 . − P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) ≤ . (C.22)Combining (C.21) and (C.22) contradicts the assumption of (C.20). Thus, this case is not possibleunder the assumption of (C.20).4. π k ( z, y − k ; θ ∗ ) > π k ( z, y − k ; θ ) >
0. Then we have:max {− L π k ( z, y − k ; θ ) , } = 0 . (C.23)However, since π k ( z, y − k ; θ ∗ ) > . ≤ P ( U k ≤ π k ( z, y − k ; θ ∗ ) | Z k = z, Y − k = y − k ) = P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) , where we have used property (C.13) and the fact that θ ∗ satisfies both (C.5) and (C.6). But then this101mplies: 0 . − P ( (cid:101) U k ≤ π k ( z, y − k ; θ ) | Z k = z, Y − k = y − k ) ≤ . (C.24)Combining (C.23) and (C.24) contradicts the assumption of (C.20). Thus, this case is not possibleunder the assumption of (C.20).Combining everything, we conclude that Assumption 3.1 holds with C = L L (cid:48) and δ = τ / ( L L (cid:48) ), where L (cid:48) = min k L (cid:48) k . C.1.3 Verification of Learnability
By the assumed linearity of π k with respect to θ , and since π k depends only on the subvector θ k of θ , thefunction ( u, θ ) (cid:55)→ π k ( γ ( z, y − k ); θ ) − u is a hyperplane in R d k for each ( z, y − k ), where d k is the dimension of θ k . By Lemma 2.6.15 in Van Der Vaart and Wellner (1996), for example, Φ is a Vapnik-Chervonenkis (VC)class with VC dimension at most d k + 2. Furthermore, recalle that Φ can be taken to be uniformly boundedin absolute value by 1. Using, for example, Theorem 2.6.7 in Van Der Vaart and Wellner (1996), we candeduce: sup Q ∈Q n log N ( ε, Φ , || · || Q, ) = O (1) , so that Φ easily satisfies the entropy growth condition. Now let j index a generic moment function: m j ( Y − k , Z, U, θ ) = (cid:0) { U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } − . (cid:1) { Z k = z, Y − k = y − k } , and let M j be the associated class of functions: M j = { m j ( · , u, θ ) : Y × Z → R : ( u, θ ) ∈ U × Θ } . Note that the values ( z (cid:48) , y (cid:48)− k ) are not arguments of the function, but instead are associated with the index j . Since π k takes values in the interval [ − , M j is uniformly bounded. We claim that thereexists no set of size 2 shattered by M j , implying M j is a VC-subgraph class. We will prove this by way ofcontradiction. In particular, suppose that there exists two points ( y , z ) and ( y , z ), and values t , t ∈ R such that: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { m j ( y , z , u, θ ) ≥ t } { m j ( y , z , u, θ ) ≥ t } : ( u, θ ) ∈ U × Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 4 . (C.25)In other words, we suppose the set { ( y , z ) , ( y , z ) } is shattered by M j , and that t , t ∈ R witness theshattering. We have: m j ( y , z , u, θ ) = (cid:0) { u k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } − . (cid:1) { z ,k = z, y , − k = y − k } ,m j ( y , z , u, θ ) = (cid:0) { u k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max { L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } − . (cid:1) { z ,k = z, y , − k = y − k } . { z ,k = z, y , − k = y − k } = { z ,k = z, y , − k = y − k } : In this case the two functions m j ( y , z , u, θ )and m j ( y , z , u, θ ) are identical for all ( u, θ ) ∈ U × Θ. This means (C.25) is impossible, since at leastone of the vectors (1 ,
0) and (0 ,
1) cannot be picked out by M j .2. { z ,k = z, y , − k = y − k } (cid:54) = { z ,k = z, y , − k = y − k } : In this case at least one of the functions m j ( y , z , u, θ ) or m j ( y , z , u, θ ) is the zero function. Again, this means (C.25) is impossible. Forexample, if m j ( y , z , u, θ ) is the zero function, then it is impossible for M j to pick out both (0 ,
0) and(1 ,
0) or both (0 ,
1) and (1 , y , z ) and ( y , z ) were arbitrary, we conclude that there exists no set of size 2 shattered by M j .This implies that M j is a VC-subgraph class, and using, for example, Theorem 2.6.7 in Van Der Vaart andWellner (1996), we can deduce: sup Q ∈Q n log N ( ε, M j , || · || Q, ) = O (1) . Thus, M j easily satisfies the entropy growth condition. Finally, let j (cid:48) index a generic moment function: m j ( Y − k , Z, U, θ ) = (cid:0) . − { U k ≤ π k ( z (cid:48) , y (cid:48)− k ; θ ) } − max {− L π k ( z (cid:48) , y (cid:48)− k ; θ ) , } (cid:1) { Z k = z, Y − k = y − k } , and let M j (cid:48) be the associated class of functions: M j (cid:48) = { m j (cid:48) ( · , u, θ ) : Y × Z → R : ( u, θ ) ∈ U × Θ } . A nearly identical argument as for M j reveals that M j (cid:48) is a VC-subgraph class and thus trivially satisfiesthe entropy growth condition. We conclude using Theorem 4.1(ii) that our class of policies Γ is PAMPAClearnable with a rate of convergence of O ( n − / ). C.2 Example 2: Program Evaluation
C.2.1 Verification of Assumptions 2.1, 2.2 and 2.3
We will now proceed to verify Assumption 2.1, 2.2 and 2.3. First note that Assumption 2.1 is triviallysatisfied, since the probability space (Ω , A , P ) is complete, U is a compact subset of euclidean space, andΘ is a Polish space; in particular, since Z (and thus also X ) is finite, G can be considered as the set of allpositive measurable functions g : Z → [0 , g ∈ G has an equivalent representation as avector in [0 , |Z| . The same logic applies to each t ∈ T . Next, let us recall the multifunction: G − ( Y, D, Z, θ ) := cl ( U , U , U ) ∈ U : Y = U (1 − D ) + U D,D = { g ( Z ) ≥ U } . (C.26)103lose inspection of this multifunction shows that: G − ( y, d, z, θ ) = { y } × [ Y , Y ] × [ g ( z ) , , if d = 0 , [ Y , Y ] × { y } × [0 , g ( z )] , if d = 1 . (C.27)Now for any ( u , u , u ) ∈ U we have: d (( u , u , u ) , G − ( Y, D, Z, θ ))= D max {| u − Y | , g ( Z ) − u } + (1 − D ) max {| u − Y | , Z − g ( z ) } . (C.28)Since g ∈ G is measurable by definition, from here it is easily verified that the distance above is measurablewith respect to B ( Y ) ⊗ B ( D ) ⊗ B ( Z ). Since ( u , u , u ) ∈ U was arbitrary, by the result of Himmelberg (1975)(see also Theorem 1.3.3 in Molchanov (2017)) this implies that G − is an Effros-measurable multifunction, asdesired. Modulo changes in notation, it is easily seen that the conditional distribution of the vector ( U , U , U )given ( Y, D, Z ) satisfies (2.1) in Assumption 2.2 using the multifunction in (2.12) with g ( · ) = g ( · ). Finally,note that all of the moment functions in the moment conditions (2.14) - (2.19) are measurable and boundedby 1, and the moment functions from the moment conditions in (2.20) and (2.21) are measurable and boundedby max {| Y | , | Y |} .Turning to the counterfactual domain, recall the multifunction: G (cid:63) ( Z, U , U , U, θ, γ ) := ( Y (cid:63)γ , D (cid:63)γ ) ∈ Y × { , } : Y (cid:63)γ = U (1 − D (cid:63)γ ) + U D (cid:63)γ ,D (cid:63)γ = { g ( γ ( Z )) ≥ U } . (C.29)Note here we take Y (cid:63) = Y , although this is not necessary. Furthermore, close inspection of this multifunctionshows that: G (cid:63) ( z, u , u , u, θ, γ ) = ( u , , if u ≤ g ( γ ( z )) , ( u , , if g ( γ ( z )) < u. (C.30)In this case, the counterfactual map in (2.23) is single-valued. In this case, Effros measurability is equivalentto the usual notion of measurability for functions, and measurability of G (cid:63) follows from familiar argumentsafter noting that both g and γ are measurable functions. Finally, modulo changes in notation, it is easily seenthat the conditional distribution of the vector ( Y (cid:63)γ , D (cid:63)γ ) given ( Y, D, Z, U , U , U ) satisfies (2.3) in Assumption2.3 using the multifunction in (2.23) with g ( · ) = g ( · ).104 .2.2 Verification of Assumption 3.1 First we focus on (2.14) - (2.17). Since these moments do not depend on t ∈ T , to verify Assumption 3.1 itsuffices to focus on the parameter g ∈ G . From the moment conditions (2.14) and (2.15) we have: g ( z , x ) = P ( D = 1 | Z = z , X = x ) ⇐⇒ E [( D − g ( z , x )) { Z = z , X = x } ] ≤ E [( g ( z , x ) − D ) { Z = z , X = x } ] ≤ , (C.31)and from (2.16) and (2.17) we have: g ( z , x ) = P ( U ≤ g ( z , x ) | X = x ) ⇐⇒ E [( { U ≤ g ( z , x ) } − g ( z , x )) { X = x } ] ≤ E [( g ( z , x ) − { U ≤ g ( z , x ) } ) { X = x } ] ≤ , (C.32)For notational simplicity, let g ( z ) := g ( z , x ) for z = ( z , x ). From (C.31) we see that g ( z ) is point-identified. Define: G ∗ = { g : ( g, t ) ∈ Θ ∗ for some t ∈ T } . Then point-identification of g implies that G ∗ is a singleton, and that for any g ∈ G : d ( g, G ∗ ) = max z ∈Z | g ( z ) − g ( z ) | . From here it is straightforward to use conditions (C.31) and (C.32) to argue that part (i) of Assumption 3.1is satisfied with C = 1 for any δ >
0. In particular, suppose g / ∈ G ∗ , and that z ∗ ∈ Z satisfies: d ( g, G ∗ ) = max z ∈Z | g ( z ) − g ( z ) | = | g ( z ∗ ) − g ( z ∗ ) | . Without loss of generality, suppose that g ( z ∗ ) > g ( z ∗ ). Then from (C.31) we have: E [( g ( z ∗ ) − D ) { Z = z ∗ } ] = 0 < E [( g ( z ∗ ) − D ) { Z = z ∗ } ] . Thus: E [( g ( z ∗ ) − D ) { Z = z ∗ } ] = E [( g ( z ∗ ) − D ) { Z = z ∗ } ] − E [( g ( z ∗ ) − D ) { Z = z ∗ } ]= g ( z ∗ ) − g ( z ∗ )= | g ( z ∗ ) − g ( z ∗ ) | = d ( g, G ∗ ) . Now to complete the verification of part (i) of Assumption 3.1 we turn to (2.18) - (2.21), which can bewritten as: E [ t ( z , x ) − { Z = z , X = x } ] = 0 , ∀ z ∈ Z , x ∈ X , (C.33)105nd: E (cid:34) U d (cid:32) { Z = z , X = x } (cid:88) z ∈Z t ( z , x ) − { X = x } t ( z , x ) (cid:33)(cid:35) ≤ , ∀ z ∈ Z , x ∈ X , d ∈ { , } . (C.34)Since these moments do not depend on g ∈ G , to verify Assumption 3.1 for these moments it suffices to focuson the parameter t ∈ T . Now define: T ∗ = { t : ( g, t ) ∈ Θ ∗ for some g ∈ G} . From (C.33) it is clear that t is also point identified. Since g is also point identified we have Θ ∗ = { g }×{ t } .Because of this, we claim that it suffices to focus on the conditions from (C.33); indeed, t / ∈ T ∗ ⇐⇒ t (cid:54) = t implies that t / ∈ T ∗ if and only if (C.33) is violated. Now consider any t / ∈ T ∗ and let ( z ∗ , x ∗ ) satisfy:( z ∗ , x ∗ ) = arg max z ,x | t ( z , x ) − t ( z , x ) | . Without loss of generality we can suppose t ( z , x ) > t ( z , x ). Then: E [ t ( z ∗ , x ∗ ) − { Z = z ∗ , X = x ∗ } ] = E [ t ( z ∗ , x ∗ ) − { Z = z ∗ , X = x ∗ } ] − E [ t ( z ∗ , x ∗ ) − { Z = z ∗ , X = x ∗ } ]= t ( z ∗ , x ∗ ) − t ( z ∗ , x ∗ )= | t ( z ∗ , x ∗ ) − t ( z ∗ , x ∗ ) | = d ( t, T ∗ ) . Combining everything, if J indexes all the moment constraints and if θ / ∈ Θ ∗ with θ = ( g, t ), then we know:inf P U ,U ,U | Y,D,Z ∈P U ,U ,U | Y,D,Z ( θ ) max j ∈J | E [ m j ( y, d, z, u , u , u, θ )] | + ≥ max { d ( g, G ∗ ) , d ( t, T ∗ ) } ≥ d ( θ, Θ ∗ ) . Conclude that Assumption 3.1 is satisfied with C = 1 for any δ > C = 1. To show why, we will apply Lemma3.1 to our environment. First note that ϕ is the identity function when we are interested in E P [ Y (cid:63)γ ]. Thus L ϕ = 1 in Lemma 3.1. Next, note from the definition of our support restrictions G − and G (cid:63) we can deducethat: d (( u , u , u ) , G − ( y, d, z, θ )) = max {| u − y | , | g ( z ) − u | + } , if d = 0 , max {| u − y | , | u − g ( z ) | + } , if d = 1 . (C.35) d (( y (cid:63) , d (cid:63) ) , G (cid:63) ( y, d, z, u , u , u, θ, γ )) = max {| u − y | , | g ( z ) − u | + } , if u > g ( γ ( z )) , max {| u − y | , | u − g ( z ) | + } , if u ≤ g ( γ ( z )) . (C.36)106e now define the sets Θ − and Θ (cid:63) given in Lemma 3.1 in the context of this example:Θ − ( y, d, z, u , u , u ) ∩ Θ ∗ δ := (cid:8) θ ∈ Θ ∗ δ : ( u , u , u ) ∈ G − ( y, d, z, θ ) (cid:9) = { θ ∈ Θ ∗ δ : g ( z ) ∈ [0 , u ] } , if d = 0 and u = y, { θ ∈ Θ ∗ δ : g ( z ) ∈ [ u, } , if d = 1 and u = y, ∅ , otherwise, (C.37)Θ (cid:63) ( v, γ ) ∩ Θ ∗ δ := { Θ ∗ δ ∈ Θ : ( y (cid:63) , d (cid:63) ) ∈ G (cid:63) ( y, d, z, u , u , u, θ, γ ) } = { θ ∈ Θ ∗ δ : g ( γ ( z )) ∈ [0 , u ] } , if d ∗ = 0 and y (cid:63) = u , { θ ∈ Θ ∗ δ : g ( γ ( z )) ∈ [ u, } , if d ∗ = 1 and y (cid:63) = u , ∅ , otherwise. (C.38)With these definitions, we have for any θ ∈ Θ ∗ δ : d ( θ, Θ − ( y, d, z, u , u , u ) ∩ Θ ∗ δ ) = | g ( z ) − u | + , if d = 0 and u = y, | u − g ( z ) | + , if d = 1 and u = y, + ∞ , otherwise, (C.39) d ( θ, Θ (cid:63) ( v, γ ) ∩ Θ ∗ δ ) = | g ( γ ( z )) − u | + , if d ∗ = 0 and y (cid:63) = u , | u − g ( γ ( z )) | + , if d ∗ = 1 and y (cid:63) = u , ∅ , otherwise. (C.40)Combining (C.35) with (C.39) we can verify condition (3.10) with (cid:96) = 1. Furthermore, by combining(C.36) with (C.40) we can verify condition (3.11) with (cid:96) = 1. Applying Lemma 3.1 then yields the choice C = L ϕ max { (cid:96) , (cid:96) } = 1, as claimed above. Note also that this value of C works for any δ > µ ∗ = 1 in Theorem 3.1. Also, recall the moment functions for this example fromequations (2.14) - (2.17). The Theorem then states that the lower and upper bounds on the closed convexhull of the identified set for E [ Y (cid:63)γ ] can be computed as the solutions to the problems (3.12) and (3.13).Intuitively, under the assumptions of the Theorem the infimum over θ ∈ Θ and supremum over θ ∈ Θ inproblems (3.12) and (3.13) will be obtained at the value θ ∈ Θ.107 .2.3 Verification of Learnability
We claim that Φ is a VC class with VC index of at most |Z| + 1. To prove this, we must show that thereexists no set of points Z n = { z , . . . , z n } with n = |Z| + 1 shattered by Φ. Let t , . . . , t n be arbitrary realnumbers. Now define the set: B := { { g ( γ ( z )) ≥ u } ( u − u ) + u ≥ t } { { g ( γ ( z )) ≥ u } ( u − u ) + u ≥ t } ... { { g ( γ ( z n )) ≥ u } ( u − u ) + u ≥ t n } : ( u , u , u, θ ) ∈ U × Θ . If B contains the vector b ∈ { , } n , then we say that Φ “picks out” b . It suffices to show that there alwaysexists at least one vector b ∈ { , } n that Φ fails to pick out. Since n > |Z| , there exists at least one z ∈ Z that appears twice in the set Z n . Thus there is some i, j ∈ { , . . . , n } such that z i = z j . Then regardless ofthe values of ( u , u , u, θ ) we will always have: { g ( γ ( z i )) ≥ u } ( u − u ) + u = { g ( γ ( z j )) ≥ u } ( u − u ) + u . We then have:1. If t i = t j then Φ fails to pick out any vector b ∈ { , } n with b i = 0 and b j = 1.2. If t i < t j then Φ fails to pick out any vector b ∈ { , } n with b i = 0 and b j = 1.3. If t j < t i then Φ fails to pick out any vector b ∈ { , } n with b i = 1 and b j = 0.Since this covers all possibilities for t i , t j ∈ R , we conclude that there always exists at least one binary vectorthat Φ fails to pick out, and thus Φ shatters no set of size n = |Z| + 1. Now using, for example, Theorem2.6.7 in Van Der Vaart and Wellner (1996), we can deduce:sup Q ∈Q n log N ( ε, Φ , || · || Q, ) = O (1) , so that Φ easily satisfies the entropy growth condition. Now let j index a generic moment function: m j ( D, Z, θ ) = ( D − g ( z , x )) { Z = z , X = x } , and let M j be the associated class of functions: M j = { m j ( · , θ ) : { , } × Z → R : θ ∈ Θ } . Note this class indexes the moment functions from the moment conditions (2.14). Also note that ( z , x ) arenot arguments of the moment function, but are instead are associated with the index j .We claim that there exists no set of size 3 shattered by M j , implying M j is a VC-subgraph class. Wewill prove this by way of contradiction. In particular, suppose that there exists three points ( d , z ), ( d , z ),108nd ( d , z ) and values t , t , t ∈ R such that: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { m j ( d , z , θ ) ≥ t } { m j ( d , z , θ ) ≥ t } { m j ( d , z , θ ) ≥ t } : θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 8 . (C.41)In other words, we suppose the set { ( d , z ) , ( d , z ) , ( d , z ) } is shattered by M j , and that t , t , t ∈ R witness the shattering. We have: m j ( d , z , θ ) = ( d − g ( z , x )) { z , = z , x = x } ,m j ( d , z , θ ) = ( d − g ( z , x )) { z , = z , x = x } m j ( d , z , θ ) = ( d − g ( z , x )) { z , = z , x = x } . Now consider two cases:1. { z , = z , x = x } = { z , = z , x = x } = { z , = z , x = x } = 1: Note that since d i ∈ { , } , atleast two functions m j ( d , z , θ ), m j ( d , z , θ ) and m j ( d , z , θ ) are identical for all θ ∈ Θ. This means(C.41) is impossible. For instance, suppose that m j ( d , z , θ ) = m j ( d , z , θ ). Then at least one of thevectors (1 , ,
0) or (0 , ,
0) cannot be picked out by M j .2. Either { z ,k = z, y , − k = y − k } = 0 or { z ,k = z, y , − k = y − k } = 0 or { z ,k = z, y , − k = y − k } = 0:In this case at least one of the functions m j ( d , z , θ ), m j ( d , z , θ ) or m j ( d , z , θ ) is equal to zero forall θ ∈ Θ. Again, this means (C.41) is impossible. For example, if m j ( d , z , θ ) is the zero function,then it is impossible for M j to pick out both (0 , ,
0) and (1 , , d , z ), ( d , z ), and ( d , z ) were arbitrary, we conclude that there exists no set of size 3 shattered by M j . This implies that M j is a VC-subgraph class, and using, for example, Theorem 2.6.7 in Van Der Vaartand Wellner (1996), we can deduce: sup Q ∈Q n log N ( ε, M j , || · || Q, ) = O (1) . Thus, M j easily satisfies the entropy growth condition. Given the relation between the moment functionsfrom (2.14) and (2.15), a nearly identical analysis holds for the moment functions from the moment conditions(2.15).Now let j (cid:48) index a generic moment function: m j (cid:48) ( X, U, θ ) = ( { U ≤ g ( z , x ) } − g ( z , x )) { X = x } , and let M j (cid:48) be the associated class of functions: M j (cid:48) = { m j (cid:48) ( · , u, θ ) : X → R : ( u, θ ) ∈ U × Θ } . Note this class indexes the moment functions from the moment conditions (2.16). Also note that ( z , x ) are109ot arguments of the moment function, but are instead are associated with the index j (cid:48) .We claim that there exists no set of size 2 shattered by M j (cid:48) , implying M j (cid:48) is a VC-subgraph class. Wewill prove this by way of contradiction. In particular, suppose that there exists two points x and x , andvalues t , t ∈ R such that: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { m j (cid:48) ( x , u, θ ) ≥ t } { m j (cid:48) ( x , u, θ ) ≥ t } : ( u, θ ) ∈ U × Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 4 . (C.42)In other words, we suppose the set { x , x } is shattered by M j (cid:48) , and that t , t ∈ R witness the shattering.We have: m j (cid:48) ( x , u, θ ) = ( { u ≤ g ( z , x ) } − g ( z , x )) { x = x } ,m j (cid:48) ( x , u, θ ) = ( { u ≤ g ( z , x ) } − g ( z , x )) { x = x } . Now consider two cases:1. { x = x } = { x = x } = 1: Then the two functions m j (cid:48) ( x , u, θ ) and m j (cid:48) ( x , u, θ ) are identical forall ( u, θ ) ∈ U × Θ. This means (C.42) is impossible, since at least one of the vectors (1 ,
0) and (0 , M j (cid:48) .2. Either { x = x } = 0 or { x = x } = 0: In this case at least one of the functions m j (cid:48) ( x , u, θ ) or m j (cid:48) ( x , u, θ ) is the zero function. Again, this means (C.41) is impossible. For example, if m j (cid:48) ( x , u, θ )is the zero function, then it is impossible for M j (cid:48) to pick out both (0 ,
0) and (1 , x and x were arbitrary, we conclude that there exists no set of size 2 shattered by M j (cid:48) . This impliesthat M j (cid:48) is a VC-subgraph class, and using, for example, Theorem 2.6.7 in Van Der Vaart and Wellner(1996), we can deduce: sup Q ∈Q n log N ( ε, M j (cid:48) , || · || Q, ) = O (1) . Thus, M j (cid:48) easily satisfies the entropy growth condition. Given the relation between the moment functionsfrom (2.16) and (2.17), a nearly identical analysis holds for the moment functions from the moment conditions(2.17).Now let j (cid:48)(cid:48) index a generic moment function: m j (cid:48)(cid:48) ( Z, θ ) = t ( z , x ) − { Z = z , X = x ) } , and let M j (cid:48)(cid:48) be the associated class of functions: M j (cid:48)(cid:48) = { m j (cid:48)(cid:48) ( · , θ ) : Z → R : θ ∈ Θ } . Note this class indexes the moment functions from the moment conditions (2.18). Also note that ( z , x ) arenot arguments of the moment function, but are instead are associated with the index j (cid:48)(cid:48) .110e claim that there exists no set of size 3 shattered by M j (cid:48)(cid:48) , implying M j (cid:48)(cid:48) is a VC-subgraph class. Tosee this, note that for any three points { z , z , z } we have: m j (cid:48)(cid:48) ( z , θ ) = t ( z , x ) − { z , = z , x = x ) } ,m j (cid:48)(cid:48) ( z , θ ) = t ( z , x ) − { z , = z , x = x ) } ,m j (cid:48)(cid:48) ( z , θ ) = t ( z , x ) − { z , = z , x = x ) } . The conclusion follows from the fact that two of these moment functions must always be the same. Thisimplies that M j (cid:48)(cid:48) is a VC-subgraph class, and using, for example, Theorem 2.6.7 in Van Der Vaart andWellner (1996), we can deduce: sup Q ∈Q n log N ( ε, M j (cid:48)(cid:48) , || · || Q, ) = O (1) . Thus, M j (cid:48)(cid:48) easily satisfies the entropy growth condition. Given the relation between the moment functionsfrom (2.18) and (2.19), a nearly identical analysis holds for the moment functions from the moment conditions(2.19).Finally, let j (cid:48)(cid:48)(cid:48) index a generic moment function: m j (cid:48)(cid:48)(cid:48) ( Z, U d , θ ) = U d (cid:32) { Z = z , X = x } (cid:88) z ∈Z t ( z , x ) − { X = x } t ( z , x ) (cid:33) . and let M j (cid:48)(cid:48) be the associated class of functions: M j (cid:48)(cid:48)(cid:48) = (cid:8) m j (cid:48)(cid:48)(cid:48) ( · , u d , θ ) : Z → R : ( u d , θ ) ∈ [ Y , Y ] × Θ (cid:9) . Note this class indexes the moment functions from the moment conditions (2.20). Also note that ( z , x ) arenot arguments of the moment function, but are instead are associated with the index j (cid:48)(cid:48)(cid:48) .We claim that there exists no set of size 5 shattered by M j (cid:48)(cid:48)(cid:48) , implying M j (cid:48)(cid:48)(cid:48) is a VC-subgraph class. Tosee this, note that for any five points { z , z , z , z , z } we have: m j (cid:48)(cid:48)(cid:48) ( z , u d , θ ) = u d (cid:32) { z , = z , x = x } (cid:88) z ∈Z t ( z , x ) − { x = x } t ( z , x ) (cid:33) ,m j (cid:48)(cid:48)(cid:48) ( z , u d , θ ) = u d (cid:32) { z , = z , x = x } (cid:88) z ∈Z t ( z , x ) − { x = x } t ( z , x ) (cid:33) ,m j (cid:48)(cid:48)(cid:48) ( z , u d , θ ) = u d (cid:32) { z , = z , x = x } (cid:88) z ∈Z t ( z , x ) − { x = x } t ( z , x ) (cid:33) ,m j (cid:48)(cid:48)(cid:48) ( z , u d , θ ) = u d (cid:32) { z , = z , x = x } (cid:88) z ∈Z t ( z , x ) − { x = x } t ( z , x ) (cid:33) ,m j (cid:48)(cid:48)(cid:48) ( z , u d , θ ) = u d (cid:32) { z , = z , x = x } (cid:88) z ∈Z t ( z , x ) − { x = x } t ( z , x ) (cid:33) . The conclusion follows from the fact that two of these moment functions must always be identical for all θ .111his implies that M j (cid:48)(cid:48)(cid:48) is a VC-subgraph class, and using, for example, Theorem 2.6.7 in Van Der Vaartand Wellner (1996), we can deduce: sup Q ∈Q n log N ( ε, M j (cid:48)(cid:48)(cid:48) , || · || Q, ) = O (1) . Thus, M j (cid:48)(cid:48)(cid:48)(cid:48)(cid:48)(cid:48)