aa r X i v : . [ ec on . E M ] J u l Optimal Decision Rules for Weak GMM
By Isaiah Andrews and Anna Mikusheva Abstract
This paper derives the limit experiment for nonlinear GMM models with weak and partialidentification. We propose a theoretically-motivated class of default priors on a nonparametricnuisance parameter. These priors imply computationally tractable Bayes decision rules in thelimit problem, while leaving the prior on the structural parameter free to be selected by theresearcher. We further obtain quasi-Bayes decision rules as the limit of sequences in this class,and derive weighted average power-optimal identification-robust frequentist tests. Finally, weprove a Bernstein-von Mises-type result for the quasi-Bayes posterior under weak and partialidentification.Keywords: Limit Experiment, Quasi Bayes, Weak Identification, Nonlinear GMMJEL Codes: C11, C12, C20
This draft: July 2020.
Weak and partial identification arise in a wide range of empirical settings. The prob-lem of weak identification in linear IV is well-studied, but much less is known aboutweak identification in nonlinear models. In particular, while there is clear evidence ofidentification problems in some nonlinear applications, with objective functions that havemultiple minima or are close to zero over non-trivial regions of the parameter space, thereare not yet commonly-accepted methods for detecting weak identification. Even less isknown about optimality: while there are some results on optimal tests for parameters inweak IV settings (e.g. D. Andrews, Moreira, and Stock 2006, Moreira and Moreira 2019, Harvard Department of Economics, Littauer Center M18, Cambridge, MA 02138. Email [email protected]. Support from the National Science Foundation under grant number 1654234,and from the Sloan Research Fellowship is gratefully acknowledged. Department of Economics, M.I.T., 50 Memorial Drive, E52-526, Cambridge, MA, 02142. Email:[email protected]. Financial support from the Castle-Krob Career Development Chair and the SloanResearch Fellowship is gratefully acknowledged. This paper develops a theory of optimality under weak and partial identification innonlinear GMM. We first derive the limit experiment for weakly identified GMM. We thenstudy Bayes decision rules in the limit problem and propose a theoretically-motivatedclass of priors that implies computationally tractable decision rules. This class yields thequasi-Bayes decision rules studied by Kim (2002) and Chernozhukov and Hong (2003) astheir diffuse-prior limit. We further prove a Bernstein-von Mises-type result establishingthe asymptotic properties of quasi-Bayes under weak and partial identification.Our results show that the quasi-Bayes approach has a number of appealing prop-erties regardless of the identification status of the model. Kim (2002) suggested thequasi-Bayes approach based on maximum entropy arguments, while Chernozhukov andHong (2003) discussed it as a computational device for strongly-identified, point-identifiedsettings, where they showed that quasi-Bayes procedures are asymptotically equivalentto optimally-weighted GMM, and so are efficient in the usual sense. We show that quasi-Bayes is the limit of a sequence of Bayes decision rules for theoretically motivated priorseven under weak identification. In addition to quasi-Bayes decision rules, we also derivenew weighted average power-optimal, identification-robust frequentist tests.There are three main results in this paper. The first derives the limit experiment forweakly and partially identified models, laying the foundation for our analysis of optimal-ity. The observation in the limit experiment is a Gaussian process corresponding to thenormalized sample average of the GMM moments. Consistent with the semiparametricnature of GMM, the parameter space is infinite dimensional. The parameter consists ofthe structural parameter (i.e. the parameter that enters the GMM moments) and thenon-parametric mean function of the moments, which lies in a reproducing kernel Hilbertspace (RKHS). For convex loss functions, a complete class theorem from Brown (1986)implies that all admissible decision rules in the limit experiment are the pointwise limitsof Bayes decision rules, so we focus our attention on Bayesian approaches. Kaji (2020) studies weakly identified parameters in semiparametric models, and introduces a notionof weak efficiency for estimators. Weak efficiency is necessary, but not in general sufficient, for decision-theoretic optimality (e.g. admissibility) in many contexts.
Suppose we observe a sample of independent and identically distributed observations { X i , i = 1 , ..., n } from an unknown distribution P ∗ . The true structural parameter value θ ∗ ∈ Θ satisfies the moment equality E P ∗ [ φ ( X, θ ∗ )] = 0 for φ ( · , · ) a known function ofthe data and parameters with φ ( x, θ ) ∈ R k . We aim to choose an action a ∈ A , and willincur a loss L ( a, θ ∗ ) that depends only on a and the structural parameter θ ∗ .We are interested in settings where identification is weak, in the sense that the meanof the moment function E P ∗ [ φ ( X, θ )] is close to zero relative to sampling uncertainty, orexactly zero, over a non-trivial part of the parameter space Θ. To obtain asymptoticapproximations that reflect this, we adopt a nonparametric version of weak identifica-tion asymptotics and model the data generating process as local to identification failure.Specifically, we assume that the true distribution P ∗ is close to some (unknown) distri-bution P where the identified set for the structural parameter,Θ = { θ ∈ Θ : E P [ φ ( X, θ )] = 0 } contains at least two distinct elements, and further assume that θ ∗ ∈ Θ . To deriveresults that reflect proximity to identification failure, we embed P ∗ in a sequence ofdistributions P n,f converging to P in the sense that Z (cid:20) √ n ( dP / n,f − dP / ) − f dP / (cid:21) → The more general assumption that θ ∗ is local to Θ yields a limit experiment similar to that derivedbelow, at the cost of heavier notation. Hence, we focus on the case with θ ∗ ∈ Θ . n → ∞ , where P ∗ = P n,f for the observed sample size n . A measurable function f in equation (1) is called score, and (1) implies that E P [ f ( X )] = 0, and that E P [ f ( X )]is finite (see Van der Vaart and Wellner, Lemma 3.10.10). Denote the space of scorefunctions by T ( P ), and note that this is a linear subspace of L ( P ).While θ ∗ is the structural parameter of interest, it does not fully describe the distri-bution of the data, even in large samples. Instead, asymptotic behavior under P n,f isgoverned by the score f . Identifying information about θ ∗ then comes from the fact thatnot all elements of T ( P ) are consistent with a given θ ∗ . Specifically, one can show thatthe scaled sample average of the moments has (asymptotic) mean zero at θ ∗ under P n,f if and only if E P [ f ( X ) φ ( X, θ ∗ )] = 0. Correspondingly, define the sub-space of scoresconsistent with θ ∗ as T θ ∗ ( P ) = { f ∈ T ( P ) : E P [ f ( X ) φ ( X, θ ∗ )] = 0 } . We are now equipped to define the finite sample statistical experiment.
Definition 1
The finite sample experiment for sample size n , E ∗ n,P , corresponds to ob-serving an i.i.d. sample of random variables X i , i = 1 , ..., n distributed according to P n,f ,with parameter space { ( θ ∗ , f ) : θ ∗ ∈ Θ , f ∈ T θ ∗ ( P ) } . Note that the parameter space for this experiment is infinite-dimensional, consistent withthe semi-parametric nature of the GMM model.We next introduce two running examples, based on linear and quantile IV respectively.While our focus is on nonlinear models, we include the linear IV example to illustratethe implications of our approach in a more familiar setting.
Example 1. Linear IV.
Assume that the observed data X i = ( Y i , W i , Z ′ i ) consistsof an outcome variable Y , a scalar regressor W , and a k -dimensional instrument Z .The moment function is φ ( X, θ ) = Z ( Y − θW ). Let P be a distribution such that E P [ ZY ] = E P [ ZW ] = 0, so the mean of the moments is identically zero under P and thestructural parameter is unidentified. We model the true distribution of the data as part Prior work by Kaji (2020) also analyzes weak identification using paths of the form (1).
5f a sequence P n,f local to P in the sense of (1). This implies that there exists a k × δ (which depends on score f ) such that (cid:0) ζ ′ ,n , ζ ′ ,n (cid:1) ′ = √ n n X i =1 Z ′ i Y i , √ n n X i =1 Z ′ i W i ! ′ ⇒ ( ζ ′ , ζ ′ ) ′ ∼ N (( θ ∗ δ ′ , δ ′ ) ′ , Σ ) . Here Σ is (2 k ) × (2 k ) reduced-form covariance matrix that is consistently estimable andunaffected by f . We assume for simplicity that Σ is full rank, but impose no otherrestrictions and so allow heteroskedastic errors. Hence, in this setting our approach neststhe weak-instrument asymptotics introduced by Staiger and Stock (1997). (cid:3) Example 2. Quantile IV.
Consider the moment condition φ ( X, α, β ) = ( I { Y − α − W ′ β ≤ } − . Z, (2)introduced by Chernozhukov and Hansen (2005). The observed data X i = ( Y i , W i , Z i )consist of an outcome Y , a ( p − W , and k -dimensional vector of instruments Z . The structural parameters θ = ( α, β ) liein a set Θ ⊂ R p . A variety of different distributions P give rise to non-trivial identifiedsets Θ in this model. Correspondingly, there are many ways weak identification mayarise. For this example, suppose that the first element of Z is a constant, while theremaining elements of Z can be written as the element-wise product U · Z ∗ , for U a k -dimensional mean-zero random vector independent of ( Y, W, Z ∗ ) and Z ∗ a potentiallyinformative, but unobserved, instrument. In this setting, the last k − E P [ φ ( X, α, β )] are identically zero on Θ , while the first element of E P [ φ ( X, α, β )] is zeroif and only if α is equal to the median of Y − W ′ β . Hence the identified set under P isΘ = { θ = ( α, β ) ∈ Θ : α = median P ( Y − W ′ β ) } . This section shows that in order to construct asymptotically optimal decision rules forweakly identified GMM, it suffices to derive optimal decision rules in a limit experiment.This limit experiment corresponds to observing a Gaussian process g ( · ) with unknownmean function m ( · ) and known covariance function Σ( · , · ), where θ ∗ satisfies m ( θ ∗ ) = 0.6ntuitively, g ( · ) corresponds to the scaled sample average of the moments, since as wediscuss below, under mild regularity conditions1 √ n n X i =1 φ ( X i , · ) ⇒ g ( · ) ∼ GP ( m, Σ) , (3)on Θ under P n,f , where m ( · ) = E P [ f ( X ) φ ( X, · )], Σ( θ , θ ) = E P [ φ ( X, θ ) φ ( X, θ ) ′ ], andΣ is consistently estimable.To derive the limit experiment, we first discuss the parameter space for the meanfunction m ( · ), and its connection to the space of scores f . We next discuss a standardnon-parametric limit experiment, which we then use to derive our Gaussian process limitexperiment. At each stage, we follow the usual limits-of-experiments approach and relatethe experiments studied in terms of the attainable risk functions. Functional parameter space.
Consider the set of k × P sj =1 Σ( · , θ j ) b j : Θ → R k , defined for any finite set of constant vectors { b j } ⊂ R k ,parameters { θ j } ⊂ Θ and a covariance function Σ( · , · ). Define a scalar product on thisset by h P sj =1 Σ( · , θ j ) b j , P ˜ sl =1 Σ( · , ˜ θ l ) c l i H = P sj =1 P ˜ sl =1 b ′ j Σ( θ j , ˜ θ l ) c l . Definition 2
The Reproducing Kernel Hilbert Space (RKHS) H associated with Σ is thecompletion of the space spanned by functions of the form P sj =1 Σ( · , θ j ) b j with respect tothe scalar product h· , ·i H . Let H ∗ be the completion of the space spanned by scores of the form f ( X ) = P sj =1 φ ( X, θ j ) ′ b j in L ( P ). This is a linear subspace of T ( P ). Let ( H ∗ ) ⊥ be the orthogonal complementto H ∗ . For any score f ∈ T ( P ) , denote by f ∗ and f ⊥ its projections onto H ∗ and ( H ∗ ) ⊥ respectively. Lemma 1
Define a linear transformation mapping scores f ∈ T ( P ) to functions m ( · ) : m ( · ) = E P [ f ( X ) φ ( X, · )] . (4) The image of T ( P ) under this transformation is H . The null space of this transformationis ( H ∗ ) ⊥ . The transformation (4) restricted to H ∗ establishes an isomorphism between H and H ∗ . In particular, for any two f , f ∈ H ∗ and the corresponding m ( · ) , m ( · ) , wehave h m , m i H = E P [ f ( X ) f ( X )] . f ∈ T ( P ) lie in H , and all meanfunctions in H correspond to some f ∈ T ( P ). The correspondence between scores andmean functions is many-to-one, however, as all scores with the same projection f ∗ onto H ∗ imply the same mean function. Definition of limit experiments.
We next introduce two limit experiments. Thefirst, E ∗∞ , is a variant of a Gaussian sequence experiment discussed in Van der Vaart(1991), adapted to incorporate the moment restriction. The second, E ∗ GP , is our final goal.Let { ϕ j } be a complete orthonormal basis in T ( P ), and let H θ ∗ = { m ∈ H : m ( θ ∗ ) = 0 } denote the subset of H with a zero at θ ∗ . Definition 3
The limit experiment E ∗∞ corresponds to observing the (infinite) sequenceof independent random variables W j ∼ N ( E P [ f ( X ) ϕ j ( X )] , , with parameter space { ( θ ∗ , f ) : θ ∗ ∈ Θ , f ∈ T θ ∗ ( P ) } . Definition 4
The Gaussian process experiment E ∗ GP corresponds to observing a Gaussianprocess g ( · ) ∼ GP ( m ( · ) , Σ) with known covariance function Σ( · , · ) , unknown mean m , andparameter space { ( θ ∗ , m ) : θ ∗ ∈ Θ , m ∈ H θ ∗ } . The parameter space in E ∗∞ is the same as in the finite-sample experiment, while byLemma 1 the parameter space in E ∗ GP is smaller. In all experiments the true value θ ∗ corresponds to a zero of the moment function, in the sense that E P [ f ( X ) φ ( X, θ ∗ )] = 0or m ( θ ∗ ) = 0. Attainable risk functions.
Following the literature on limits of experiments (c.f. LeCam 1986) we will compare the experiments described above in terms of attainable riskfunctions. We begin with an asymptotic representation theorem.
Lemma 2 (Theorem 3.1 in Van der Vaart (1991)) Consider a sequence of statistics S n which has a limit distribution under E ∗ n,P , in the sense that under any P n,f for f ∈ T ( P ) ,S n ( X , ..., X n ) ⇒ S f as n → ∞ . Assume there exists a complete separable set S suchthat S f ( S ) = 1 for all f ∈ T ( P ) . Then in the experiment E ∗∞ there exists a (possiblyrandomized) statistic S ∗ = s ∗ ( { W j } , U ) for a random variable U ∼ U [0 , independentof W j such that S ∗ ∼ S f under f for all f ∈ T ( P ) . E ∗∞ nests that in the finite sample experiment E ∗ n,P asymptotically. Corollary 1 If L ( a, θ ∗ ) is bounded and continuous in a for all θ ∗ ∈ Θ , and the sequenceof decision rules S n satisfies the conditions of Lemma 2, then there exists a statistic S ∗ in the limit experiment E ∗∞ such that lim n →∞ E P n,f [ L ( S n , θ ∗ )] = E f [ L ( S ∗ , θ ∗ )] for all { ( θ ∗ , f ) : θ ∗ ∈ Θ , f ∈ T θ ∗ ( P ) } . We next compare the experiments E ∗∞ and E ∗ GP . By Lemma 1 any score can be writtenas f = f ∗ + f ⊥ , where f ⊥ ∈ ( H ∗ ) ⊥ has no effect on the mean function. We can re-writethe parameter space of the limit experiment E ∗∞ as a Cartesian product { θ ∗ ∈ Θ , f = ( f ∗ , f ⊥ ) ∈ T θ ∗ ( P ) } = { θ ∗ ∈ Θ , f ∗ ∈ H ∗ θ ∗ } × { f ⊥ ∈ ( H ∗ ) ⊥ } for H ∗ θ ∗ = { f ∈ H ∗ : E P [ f ( X ) φ ( X, θ ∗ )] = 0 } . The parameter f ⊥ is unrelated to the struc-tural parameter θ ∗ , and the restriction of the experiment E ∗∞ that fixes this parameter isequivalent to the experiment E ∗ GP . Theorem 1
Fix any f ⊥ ∈ ( H ∗ ) ⊥ . For any statistic S ∗ in E ∗∞ , there exists a (possiblyrandomized) statistic S in E ∗ GP such that for all f ∗ ∈ H ∗ , the distribution of S ∗ under ( f ∗ , f ⊥ ) is the same as that of S under m ( · ) = E P [ f ∗ ( X ) φ ( X, · )] . Identifying f ∗ and m, the set of risk functions (cid:8) E ( f ∗ ,f ⊥ ) [ L ( S ∗ , θ ∗ )] : θ ∗ ∈ Θ , f ∈ H ∗ θ ∗ (cid:9) in E ∗∞ is equal to the setof risk functions { E m [ L ( S, θ ∗ )] : θ ∗ ∈ Θ , m ∈ H θ ∗ } in E ∗ GP . The idea of holding f ⊥ fixed is similar to the “slicing” argument of Hirano and Porter(2009). Specifically, f ⊥ is a nuisance parameter that neither interacts with the parameterof interest θ ∗ , nor enters the loss function. Thus, to derive optimal procedures it sufficesto study optimality holding f ⊥ fixed, which in turn implies equivalence with the simplerGaussian process experiment E ∗ GP . Hence, the performance of optimal decision rules in E ∗ GP bounds asymptotic performance in E ∗ n,P . Thus, if a sequence of decision rules S n hasrisk converging to an optimal risk function in E ∗ GP , it must be asymptotically optimal.Theorem 1 gives a criterion which may be checked to verify asymptotic optimality. Inmany cases, a plug-in approach further suggests the form of an asymptotically optimalrule. Suppose we know an optimal decision rule S = s ( g ( · ) , Σ , U ) in the Gaussian process9xperiment E ∗ GP , where we now make dependence on the covariance function explicit. Ifa uniform central limit theorem holds under P and we have a consistent estimator b Σ forΣ (e.g. the sample covariance function b Σ( θ, ˜ θ ) = d Cov ( φ ( X i ; θ ) , φ ( X i ; ˜ θ ))), then Le Cam’sthird lemma implies that the weak convergence (3) holds and b Σ remains consistent under P n,f . Hence, provided s ( g ( · ) , Σ , U ) is almost-everywhere continuous in ( g ( · ) , Σ , U ), theContinuous Mapping Theorem implies that S n = s √ n n X i =1 φ ( X i , · ) , b Σ , U ! ⇒ s ( g ( · ) , Σ , U )under P n,f , so the sequence of rules S n is asymptotically optimal so long as the losssatisfies the conditions of Corollary 1.The idea of solving the limit problem in order to derive asymptotically optimal deci-sion rules is of course not new (see e.g. Le Cam, 1986). More recently, Mueller (2011)proposed an alternative approach to derive asymptotically optimal tests based on weakconvergence conditions like (3). Relative to the approach of Mueller (2011) applied to oursetting, the benefits of Theorem 1 are (i) to show that there is, in a sense, no asymptoticinformation loss from limiting attention to the sample average of the moments and (ii)the ability to consider general decision problems in addition to tests. The weak conver-gence (3) was the starting point of Andrews and Mikusheva (2016), where we proposed ageneral approach to constructing identification-robust, but not necessarily optimal, tests. Example 1. Linear IV (continued).
The Gaussian process experiment correspondsto observing the linear-in- θ process g ( θ ) = ζ − θζ . For Σ ij k × k sub-blocks of the(2 k ) × (2 k ) covariance matrix Σ , this process has covariance functionΣ( θ , θ ) = E P [( Y − θ W ) ZZ ′ ( Y − θ W )] = Σ − θ Σ − θ Σ + θ θ Σ . The corresponding RKHS consists of R k -valued linear functions of θ . Theorem 1 im-plies that the performance of optimal decision rules in the limit experiment bounds theattainable asymptotic performance. Moreover, given an optimal rule in the limit experi-ment we can construct an asymptotically optimal rule by plugging in ζ ,n = √ n P ni =1 Z i Y i ζ ,n = √ n P ni =1 Z i W i , and a covariance estimator b Σ , in place of ζ , ζ , and Σ . (cid:3) xample 2. Quantile IV (continued). Our analysis in this section treats the iden-tified set Θ under P as known, and limits attention to θ ∈ Θ . Under P n,f the scaledsample average of the moments converges as a process on Θ . Specifically, g n ( β ) = 1 √ n n X i =1 (cid:18) I { Y i − α ( β ) − W ′ i β ≤ } − (cid:19) Z i ⇒ g ( β ) , g ( · ) ∼ GP ( m, Σ)where α ( β ) = median P ( Y − W ′ β ). According to Theorem 1, to derive asymptoticallyoptimal decision rules in this setting, it suffices to derive optimal decision rules based onobserving g ( · ) . We cannot calculate g n ( β ) in practice, since P , and thus α ( β ) and Θ , areunknown. Section 4 discusses feasible procedures, showing that they implicitly estimateΘ based on a subset of the moments and behave like their Gaussian process experimentanalogs based on the remaining moments. (cid:3) Theorem 1 allows us to characterize the class of asymptotically admissible decision rules.A decision rule s ( g ) in the experiment E ∗ GP is admissible if there exists no rule s ′ ( g )with weakly lower risk for all parameter values, and strictly lower risk for some. Theinfinite-dimensional parameter space for E ∗ GP puts it beyond the scope of many completeclass theorems (theorems characterizing the set of admissible rules), but for convex lossfunctions a result from Brown (1986) applies. Theorem 2 (Brown, 1986) Suppose that A is closed, with A ⊆ R d a for some d a , that L ( a, θ ) is continuous and strictly convex in a for every θ , and that either A is boundedor lim k a k→∞ L ( a, θ ) = ∞ . Then for every admissible decision rule s in E ∗ GP there existsa sequence of priors π r and corresponding Bayes decision rules s π r , Z E θ ∗ ,m [ L ( s π r ( g ) , θ )] dπ r ( θ ∗ , m ) = min ˜ s Z E θ ∗ ,m [ L (˜ s ( g ) , θ )] dπ r ( θ ∗ , m ) , such that s π r ( g ) → s ( g ) as r → ∞ for almost every g . Hence, for convex loss functions all admissible decision rules in the limit experiment arepointwise limits of Bayes decision rules. 11
Priors for GMM Limit Problem
The previous section shows that we can reduce the search for asymptotically optimal de-cision rules to a search for optimal rules based on the Gaussian process g ( · ) ∼ GP ( m, Σ) , with a known covariance function Σ( · , · ) and an unknown mean function m such that m ( θ ∗ ) = 0. Motivated by the complete class result in Theorem 2, we concentrate ourattention on Bayes decision rules.The parameter in the limit experiment consists of the finite-dimensional parameterof interest θ ∗ , and the infinite-dimensional nuisance parameter m that determines theidentification status of θ ∗ . Researchers may have prior information about θ ∗ , but itseems impractical to elicit priors about the infinite-dimensional parameter m . We thusaim to propose a class of default priors on the infinite-dimensional component.We proceed in three steps. First, we re-parameterize the limit experiment to furtherseparate the structural parameter θ ∗ from the infinite-dimensional nuisance parameter.Second, we leave the choice of prior on θ ∗ free, and consider a class of Gaussian processpriors on the infinite-dimensional parameter which lead to tractable decision rules. Third,we argue that it is natural to impose a particular invariance property for default priors,and find that this restriction dramatically reduces the class of candidate priors. Thisleads to our suggested default priors.We assume from this point on that Θ is compact and Σ is continuous. By Corollary4 of Berlinet and Thomas-Agnan (2004), this implies that H is a separable space ofcontinuous functions. Lemma 1.3.1 of Adler and Taylor (2007) implies we can take g tobe everywhere continuous almost surely. The parameter space { ( θ ∗ , m ) : θ ∗ ∈ Θ , m ∈ H θ ∗ } requires that for fixed θ ∗ , the meanfunction m must lie in the linear subspace H θ ∗ . Hence, the marginal parameter space for m , leaving θ ∗ unrestricted, is the subset of H with at least one zero on Θ . This set ishighly non-convex, as it is easy to find pairs of functions, each of which has a zero, suchthat the average has no zeros.To simplify the construction of the prior, we re-parameterize the model to disentangle12 ∗ and the infinite-dimensional nuisance parameter. Our reparameterization is based onwhat we term an anchor functional. Denote by C the space of continuous functions fromΘ to R k , and let A be a linear functional from C to R k . Let G ( · ) = g ( · ) − m ( · ) denote themean-zero Gaussian process with covariance function Σ. The regression of the process G on the anchor A ( G ) defines a Pettis integral ψ ( · ) = [ ψ ( · ) , ..., ψ k ( · )] = E [ G ( · ) A ( G ) ′ ] ( E [ A ( G ) A ( G ) ′ ]) − ∈ H k , where each column is again a function in H (see Van der Vaart and Van Zanten, 2008,for discussion). Since ψ ( · ) depends only on Σ and A , it is known in the limit experiment.An example of an anchor functional is the point-evaluation functional at a point θ ∈ Θ , A ( G ) = G ( θ ) . For this anchor ψ ( · ) = Σ( · , θ )Σ( θ , θ ) − .Let H µ be the linear sub-space of H orthogonal to { ψ ( · ) , ..., ψ k ( · ) } . Assume that the k × k matrix ψ ( θ ) has full rank for all θ ∈ Θ . For each m ( · ) ∈ H , define µ ( · ) to be theprojection of m on the linear sub-space H µ . The properties of Pettis integrals imply that h ψ, m i H = A ( m ) and h ψ, ψ i H = 1, which yields the orthogonal decomposition m ( · ) = µ ( · ) + ψ ( · ) A ( m ) . For any θ ∗ and m ∈ H θ ∗ , m ( θ ∗ ) = 0 , so A ( m ) = − [ ψ ( θ ∗ )] − µ ( θ ∗ ). We can conse-quently re-write m ( · ) as a function of ( θ ∗ , µ ), m ( · ) = µ ( · ) − ψ ( · )[ ψ ( θ ∗ )] − µ ( θ ∗ ) . This establishes a one-to-one correspondence between { ( θ ∗ , m ) : θ ∗ ∈ Θ , m ∈ H θ ∗ } and( θ ∗ , µ ) ∈ Θ × H µ . Hence, the transformation from ( θ ∗ , m ) to ( θ ∗ , µ ) is a reparameteri-zation of the model. The parameter space in the reparameterized model is a Cartesianproduct, Θ × H µ . Moreover, H µ is the RKHS generated by the covariance function e Σ( θ , θ ) = Σ( θ , θ ) − ψ ( θ ) E [ A ( G ) A ( G ) ′ ] ψ ( θ ) ′ , and thus is a linear space. The combination of Cartesian product structure and linearityfor the infinite-dimensional component greatly simplifies the task of constructing priors.There is a stochastic decomposition associated with this re-parametrization. Definethe random vector and stochastic process ξ and h , respectively, by ξ = A ( g ) and h ( · ) = g ( · ) − ψ ( · ) ξ. (5)13y construction ξ ∼ N ( A ( m ) , Σ ξ ) for Σ ξ = E [ A ( G ) A ( G ) ′ ] , while h ( · ) ∼ GP ( µ, e Σ).Moreover, ξ and h are jointly normal and uncorrelated, and therefore independent. Notethat the distribution of h ( · ) does not depend on θ ∗ . In Andrews and Mikusheva (2016) weshowed that when A is the point evaluation functional at θ , h ( · ) is a sufficient statisticfor the nuisance parameter in the problem of testing H : θ ∗ = θ . Example 1. Linear IV (continued).
The linear process g ( θ ) = ζ − θζ is a one-to-one transformation of the k × ζ and ζ , where the mean vectorsof ζ and ζ are proportional with constant of proportionality θ ∗ . For any anchor, thecorresponding process h is also linear, and can be fully described by a single k × ζ and ζ . The mean µ of h is correspondingly described by a k × µ , equal to the same linear transformationapplied to the means of ζ and ζ . See Section S1 of the Supplementary Appendix fordetails. Different choices of anchor yield different re-parameterizations ( θ ∗ , ˜ µ ), where onenatural option is to select the anchor that implies ˜ µ equal to the first stage coefficient. Example 2. Quantile IV (continued).
Many different anchor functionals may beused in this example. For instance, in many econometric applications β = 0 is a pointof particular interest. Correspondingly, one could take the anchor to equal the point-evaluation functional at β = 0, A ( g ) = g (0). (cid:3) We next derive a class of default priors on the infinite-dimensional nuisance parameters µ , leaving the prior on θ ∗ free to be specified based on context-specific knowledge. Weseek priors on µ that (i) yield analytically and computationally tractable Bayes decisionrules and (ii) behave reasonably when combined with many different priors on θ ∗ . Our specification of the prior is guided in part by the structure of the likelihood. Thelast section decomposed the observed process g ( · ) as g ( · ) = h ( · ) + ψ ( · ) ξ, for ξ = A ( g ). Byconstruction ξ and h ( · ) are independent. Thus, the likelihood function ℓ ( µ, θ ∗ ; g ) based14n the observed data g ( · ) factors as ℓ ( µ, θ ∗ ; g ) = ℓ ( µ, θ ∗ ; ξ ) ℓ ( µ ; h ) , where ℓ ( µ, θ ∗ ; ξ ) and ℓ ( µ ; h ) are the likelihood functions based on ξ and h , with the latterdepending only on µ but not on θ ∗ . Since the loss function depends only on θ ∗ , to deriveBayes decision rules it suffices to construct the marginal posterior distribution for θ ∗ .For analytical tractability we consider only independent priors π ( θ ∗ ) π ( µ ) on θ ∗ and µ . Under such priors the marginal posterior for θ ∗ is π ( θ ∗ | ξ, h ) = π ( θ ∗ ) R ℓ ( µ, θ ∗ ; ξ ) ℓ ( µ ; h ) dπ ( µ ) R R ℓ ( µ, θ ; ξ ) ℓ ( µ ; h ) dπ ( µ ) dπ ( θ ) = π ( θ ∗ ) R ℓ ( µ, θ ∗ ; ξ ) dπ ( µ | h ) f ( h ) R R ℓ ( µ, θ ; ξ ) dπ ( µ | h ) dπ ( θ ) f ( h ) . Here f ( h ) = R ℓ ( µ ; h ) dπ ( µ ) denotes the marginal density of h , and π ( µ | h ) the posteriorfor µ given h . Prior independence of θ ∗ and µ ensures that f ( h ) does not depend on θ ∗ ,and so drops out. Thus, the posterior simplifies to π ( θ ∗ | g ) = π ( θ ∗ ) ℓ ∗ ( θ ∗ ) R π ( θ ) ℓ ∗ ( θ ) dθ , for ℓ ∗ ( θ ) = Z ℓ ( µ, θ ; ξ ) dπ ( µ | h ) . (6)Cartesian product structure of the parameter space Θ × H µ is necessary for independentpriors and so plays a crucial role in this result. The restriction to independent priors ismade less stringent than it might appear by the freedom to choose the anchor functional A , which in turn determines the content of independence.We further restrict attention to Gaussian process priors µ ∼ GP (0 , Ω), where Ω( · , · )is a continuous covariance function. This allows us to exploit conjugacy results, greatlysimplifying the form of the posterior. Specifically, ℓ ∗ ( θ ∗ ) is based on the conditionaldistribution of ξ ∼ N ( − [ ψ ( θ ∗ )] − µ ( θ ∗ ) , Σ ξ ) conditional on the realization of h = µ + GP (0 , Σ), where µ ∼ GP (0 , Ω). For a Gaussian process prior on µ , ℓ ∗ ( θ ∗ ) correspondsto a Gaussian likelihood for observation ξ with mean given by the best linear predictorof ξ based on h . The solution to this linear prediction problem is obtained in Parzen(1962), and details appear in the proof of Theorem 3 stated below. See Berlinet andThomas-Agnan (2004) Ch. 2.4 for a textbook treatment. All Gaussian processes with covariance function Σ and mean functions in H are mutually absolutelycontinuous, so we can define the likelihood with respect to any base measure in this class. .3 Invariance Restriction The posterior (6) depends on the researcher-specified prior π ( θ ∗ ) and the Gaussian like-lihood ℓ ∗ ( θ ∗ ). The latter in turn depends on ξ, along with the best linear predictorfor the vector ξ based on the process h . While the best linear predictor is mathemati-cally well-defined, direct calculation involves infinite-dimensional objects and will oftenbe practically unappealing. In most cases one would need to approximate the best linearpredictor numerically, for instance using discretization or eigenvector expansions. See e.g.Parzen (1962) for discussion. A further challenge is that the form of the best linear pre-dictor depends on the precise specification of the prior covariance function Ω. The spaceof such covariance functions is enormous, and it seems challenging to directly evaluatewhether a given covariance function is reasonable or not.To derive default priors, we thus take a different approach, and ask what choices ofprior covariance Ω lead to decision rules with desirable properties. Since the prior on θ ∗ may be specified based on application-specific knowledge, it is particularly importantthat a default prior on µ produce reasonable results when combined with many differentchoices of π ( θ ∗ ). To this end, we require that if a researcher rules out some parametervalues ex-ante, limiting the support of π ( θ ∗ ), the implied Bayes decision rules should notdepend on the behavior of the moments at the excluded parameter values.Formally, we require that for priors π ( θ ∗ ) with restricted support e Θ ⊂ Θ , Bayesdecision rules based on the prior π ( θ ∗ ) π ( µ ) should depend on the data only through ξ and the restriction of g to e Θ. For this invariance property to hold for all possible priors π ( θ ∗ ) and all loss functions L ( a, θ ∗ ), however, it must be that ℓ ∗ ( θ ) depends on the dataonly through ( ξ, g ( θ )) for all θ ∈ Θ . This restriction dramatically narrows the class ofcandidate covariance functions Ω.
Theorem 3
Consider the setting described above with a Gaussian process prior on µ ,where the covariance function Ω is continuous. For all θ ∗ ∈ Θ such that e Σ( θ ∗ , θ ∗ ) and Ω( θ ∗ , θ ∗ ) have full rank, the integrated likelihood ℓ ∗ ( θ ∗ ) depends on the data only through ( ξ, g ( θ ∗ )) , or equivalently through ( ξ, h ( θ ∗ )) , if and only if Ω( θ ∗ , θ ∗ ) − Ω( θ ∗ , θ ) = e Σ( θ ∗ , θ ∗ ) − e Σ( θ ∗ , θ ) for all θ ∈ Θ . (7) We provide a formal invariance argument in Section S2 of the Supplementary Appendix. H Ω , the RKHS generated by Ω, coincides with H µ ,the parameter space for µ . It is natural to require that H Ω ⊆ H µ . On the other hand,covariance functions that imply H Ω ⊂ H µ rule out parts of the parameter space a-priori,and, as discussed in Florens and Simoni (2012), this can be understood as a smoothingassumption. Specifically, if H Ω is strictly contained in H µ , then H Ω can be expressedas the image of H µ under an integral, or smoothing, operator related to the covariancefunctions Ω and e Σ. In such cases, it is intuitive that the best linear predictor of µ ( θ ∗ )will smooth the realized process h , using values of the process at points other than θ ∗ .Thus, it makes sense that the requirement to use only h ( θ ∗ ) forces H Ω = H µ .The next lemma shows that invariance generically reduces the class of candidate priorsto a one dimensional family, namely covariance functions Ω proportional to e Σ. We firstrecall a definition. A linear subspace V ⊆ R k is invariant for a linear operator L if forany v ∈ V we have Lv ∈ V . Invariant sub-spaces for a symmetric matrix L are thesub-spaces spanned by subsets of its eigenvectors. Lemma 3
Fix some θ ∈ Θ such that e Σ( θ , θ ) is full rank. Assume there does not exista non-trivial (non-empty, but strictly smaller than R k ) linear subspace V ⊆ R k that isinvariant for the whole family of symmetric operators D = n D ( θ ) = R ( θ , θ ) R ( θ , θ ) ′ , θ ∈ Θ : det( e Σ( θ, θ )) > o , where R ( θ , θ ) = e Σ( θ , θ ) − / e Σ( θ , θ ) e Σ( θ, θ ) − / is a correlation function. Then condi-tion (7) is equivalent to Ω( · , · ) = λ e Σ( · , · ) for some λ > . Two positive-definite matrices share a common subspace if and only if several eigenvectorsof one matrix span the same sub-space as several eigenvectors of the other. The set ofmatrix pairs that share a non-trivial invariant subspace is of lower dimension than theset of positive-definite matrix pairs, so generically (that is, everywhere but on a nowhere-dense subset) two positive-definite matrices share no non-trivial invariant subspace. Forthe condition of Lemma 3 to fail requires something still stronger, namely that the samesubspace be invariant for a whole family of matrices indexed by θ . This often entailsspecial structure on the moment conditions. Such structure arises naturally in some cases.For instance, suppose a researcher forms moments based on two independent datasets,17here one dataset is used to form the first block of moments, while the other is used for therest. In this case Σ will be block-diagonal, and will imply two orthogonal invariant sub-spaces that are common across all θ . If these are the only nontrivial invariant subspaces,the family of Ω satisfying condition (7) is two-dimensional, allowing a researcher to putdifferent coefficients of proportionality on two invariant sub-spaces. Example 1. Linear IV (continued).
Our analysis allows a researcher-selectedmarginal prior θ ∗ and an independent Gaussian prior N (0 , Ω) on ˜ µ . Interestingly, inthis setting our invariance condition imposes no restrictions, and any k × k covariancematrix Ω is allowed. This reflects the parametric and low-dimensional nature of themodel: ( ξ, h ( θ ∗ )) is an invertible linear transformation of ( ζ , ζ ), so ℓ ∗ ( θ ∗ ) necessarilydepends on the data only through ( ξ, h ( θ ∗ )).Choosing the anchor such that ˜ µ is the first stage coefficient leads to independentpriors on the structural and first-stage parameters ( θ ∗ , δ ) . The class of priors we considertherefore nests some independent priors discussed in the literature, including the MM1prior of Moreira and Moreira (2019) and (taking the diffuse limit for Ω) the relativelyinvariant prior of Moreira and Ridder (2019). Our class of priors is wider than this,however, and also allows some form of dependence between ( θ ∗ , δ ). See Section S1 of theSupplementary Appendix for details. (cid:3) Example 2. Quantile IV (continued).
Assume without loss of generality that
V ar ( U ) = I k and denote ˜ Z ′ = (1 Z ∗′ ). Absent further restrictions on the joint distribu-tion of ( Y, W, Z ∗ ) , in general, the covariance functionΣ( β, ˜ β ) = E P (cid:20)(cid:18) I { Y − α ( β ) − W β ≤ } − (cid:19) (cid:18) I { Y − α ( ˜ β ) − W ˜ β ≤ } − (cid:19) ˜ Z ˜ Z ′ (cid:21) . does not have non-trivial invariant subspace and satisfies assumptions of Lemma 3, so onlythe proportional prior yields invariant decision rules. However, if P imposes independenceof Z ∗ from ( Y, W ), then Σ( β, ˜ β ) is the product of a scalar function of β, ˜ β with the k × k matrix E P h ˜ Z ˜ Z ′ i . In this special case, invariance only determines Ω up to a k × k positivedefinite matrix. (cid:3) .4 Bayes Decision Rules for Proportional Priors Motivated by Theorem 3 and Lemma 3, we focus on proportional prior covariance func-tions, Ω( · , · ) = λ e Σ( · , · ). Bayes decision rules minimize the posterior risk, s ( g ) = arg min a ∈A Z Θ L ( a, θ ∗ ) π ( θ ∗ | g ) dθ ∗ (8)for almost every realization of the data (see e.g. Chapter 3 of Lehmann and Casella(1998)), for π ( θ ∗ | g ) as defined in (6). Under proportional priors, ℓ ∗ ( θ ) = ℓ ( θ ; g, Σ , λ ) = | Λ( θ ) | − · exp (cid:18) − u ( θ ) ′ Λ( θ ) − u ( θ ) (cid:19) , where u ( θ ) = λ λ ψ ( θ ) − g ( θ )+ λ ξ , Λ( θ ) = λ λ (cid:2) ψ ( θ ) − (cid:3) Σ ( θ, θ ) (cid:2) ψ ( θ ) − (cid:3) ′ + λ V ar ( ξ ) , and ξ = A ( g ) is the value of the anchor functional applied to the process g . Hence, forproportional priors and a given choice of λ, the posterior distribution takes a simple form.Standard numerical techniques like Markov chain Monte Carlo can be used to samplefrom the posterior and implement Bayes decision rules using (8).Intuitively, the constant of proportionality λ controls the strength of identificationunder the prior. When λ = 0 the prior implies that the mean function m is zero withprobability one, so nothing can be learned from g and the posterior on θ ∗ is equal to theprior. Under the diffuse limit, λ → ∞ , by contrast, the prior implies that m divergeseverywhere it is nonzero, and dominates sampling uncertainty. Note that lim λ →∞ ℓ ( θ ; g, Σ , λ ) = λ ( θ ; g, Σ , ∞ ) = | ψ ( θ ) | · | Σ( θ, θ ) | − · exp (cid:18) − g ( θ ) ′ Σ( θ, θ ) − g ( θ ) (cid:19) . (9) Hence, as λ → ∞ , ℓ ∗ ( θ ) converges to a transformation of the continuously updatingGMM objective function, multiplied by factors that do not depend on g and so may beabsorbed into the prior. Chernozhukov and Hong (2003) advocate the quasi-likelihood(9) (without the first two terms) as a computational device for point-identified, strongly-identified settings where Bayesian techniques are more tractable than minimization, andshow that the resulting estimators are asymptotically efficient. We obtain the samequasi-likelihood as a diffuse-prior limit in a setting that allows for weak and partialidentification. Since Bayes decision rules with respect to full-support priors are admissibleunder mild conditions, quasi-Bayes decision rules based on (9) are the limit of a sequenceof admissible decision rules. See Section S3 of the Supplementary Appendix. Indeed,19ince the limit (9) arises for any choice of anchor A and the term | ψ ( θ ) | may be absorbedinto the prior, a given quasi-Bayes decision rule corresponds to the limit of many differentsequences of priors. Given these desirable properties for quasi-Bayes rules, together withthe asymptotic results discussed in Section 4 below, we recommend choosing λ = ∞ . Several other papers have justified quasi-Bayes decision rules based on (9) from a Bayesianperspective. Closest to our approach, Florens and Simoni (2019) consider Bayesian infer-ence based on an asymptotic normal approximation to a transformation of the data, andobtain the quasi-likelihood (9) as a diffuse-prior limit. Unlike our analysis, however, theyspecify a Gaussian process prior on the finite-sample density of the data X , rather thanon the mean function in the limit experiment. Earlier work by Kim (2002) obtained thesame quasi-likelihood via maximum entropy arguments, while Gallant (2016) obtains itas a Bayesian likelihood based on a coarsened sigma-algebra. Unlike our analysis, noneof these papers speak to questions of optimality.Other authors have considered alternative Bayesian approaches for moment condi-tion models that do not run through the quasi-likelihood (9). Chamberlain and Imbens(2003) consider inference for just-identified moment condition models with discrete data,while Bornn et al. (2019) consider discrete data and potentially over-identified momentconditions. Both procedures have a finite-sample Bayesian justification, unlike our ap-proach. Kitamura and Otsu (2011) and Shin (2015) consider Bayesian approaches basedon Dirichlet process priors and exponential tilting arguments, while Schennach (2005)shows that a particular generalized empirical likelihood-type objective arises in the limitfor a family of nonparametric priors. While we focus on Bayes decision rules, the calculations needed to derive weighted averagepower optimal tests are nearly the same. Such tests provide a natural complement toBayesian approaches to uncertainty quantification such as credible sets. Hence, we brieflydescribe how our results may be used to construct optimal tests.20onsider the problem of testing H : θ ∗ = θ against the composite alternative H : θ ∗ = θ . There generally exists no uniformly most powerful test in this setting, so letus instead maximize average power with respect to weights (i.e. prior) π ( θ, µ ) overthe alternative. The problem is further complicated by the presence of the infinite-dimensional nuisance parameter µ under the null. Andrews and Mikusheva (2016) showthat the process h based on anchor A ( g ) = g ( θ ) is sufficient for µ under the null.Building on this result, if we limit attention to tests that are similar, with rejectionprobability equal to α under the null for all values of µ , optimal test takes a simple form. Theorem 4
Consider anchor A ( g ) = g ( θ ) and h defined in (5). Let π ( θ, µ ) = π ( θ ) π ( µ ) be the weight function over Θ × H µ . Define the test ϕ ∗ ( θ ) = I n R ℓ ∗ ( θ ) π ( θ ) dθℓ ∗ ( θ ) > c α ( h ) o , for testing the null hypothesis H : θ ∗ = θ , where ℓ ∗ ( · ) is defined in (6) and c α ( h ) isthe − α quantile of the random variable R ℓ ∗ ( θ ) π ( θ ) dθℓ ∗ ( θ ) conditional on h under the null,provided the last distribution is almost surely continuous. Then ϕ ∗ ( θ ) is a similar test,with E θ ,µ [ ϕ ∗ ( θ )] = α for all µ ∈ H µ . Moreover, ϕ ∗ ( θ ) maximizes π ( θ, µ ) -weightedaverage power over the class of similar tests, in the sense that for any other test ϕ with E θ ,µ [ ϕ ] = α for all µ ∈ H µ , R E θ,µ [ ϕ ∗ ( θ ) − ϕ ] dπ ( θ, µ ) ≥ . For further discussion of similarity and the construction of the conditional criticalvalue c α ( h ) , see Andrews and Mikusheva (2016). Note that while the test ϕ ∗ ( θ ) dependson the weight function π , it controls the rejection probability for all parameter valuesconsistent with the null. Hence, the confidence set CS = { θ ∈ Θ : ϕ ∗ ( θ ) = 0 } formed byinverting this family of tests will have coverage 1 − α no matter the choice of π . The limit experiment studied in the previous sections treats Θ as known. In practice,however, the structure of weak identification, and thus the set Θ , is often unknown. Fea-sible quasi-Bayes procedures use the normalized sample moment g n ( · ) = √ n P ni =1 φ ( X i , · )and estimated covariance b Σ n in place of the limit process g and known covariance Σ.The researcher specifies a prior over the whole parameter space Θ, and for Q n ( θ ) =21 n ( θ ) ′ b Σ − n ( θ ) g n ( θ ) uses the decision rule s n ( g n ) = arg min a ∈A R Θ L ( a, θ ) π ( θ ) exp (cid:8) − Q n ( θ ) (cid:9) dθ R Θ π ( θ ) exp (cid:8) − Q n ( θ ) (cid:9) dθ . (10)This section shows that feasible decision rules (10) are asymptotically equivalent toinfeasible rules based on knowledge of Θ . In particular, the feasible quasi-posterior π ( e Θ | g n ) = R e Θ π ( θ ) exp { − Q n ( θ ) } dθ R Θ π ( θ ) exp { − Q n ( θ ) } dθ concentrates on neighborhoods of Θ . Moreover, thereexist infeasible decision rules that are asymptotically equivalent to (10) for a large class ofloss functions. These rules correspond to a prior π supported on Θ and a transformationof the moments. Specifically, some of the original moments are used to estimate Θ ,while the remainder are used to form the posterior on Θ . Assumption 1
The distribution P is such that Φ ( θ ) = E P [ φ ( X i ; θ )] and Σ( · , · ) arecontinuous, and the determinant of Σ( θ ) = Σ( θ, θ ) is nonzero. Further, under PG n ( θ ) = g n ( θ ) − √ n Φ ( θ ) ⇒ G ∼ GP (0 , Σ) , and the covariance estimator b Σ n is uniformly consistent, sup θ ∈ Θ (cid:13)(cid:13)(cid:13)b Σ n ( θ ) − Σ ( θ ) (cid:13)(cid:13)(cid:13) → p . Assumption 2
There exists a continuously differentiable function ϑ ( β, γ ) : Ξ → Θ ∗ ⊆ Θ where Θ ⊆ Θ ∗ , Ξ = { ( β, γ ) : β ∈ B, γ ∈ Γ ( β ) ⊆ R p γ } is compact, ϑ ( β, γ ) ∈ Θ if andonly if γ = 0 , and lies in the interior of Γ ( β ) for all β ∈ B . There exist a positivemeasure π ( β, γ ) on Ξ such that π ( θ ) on Θ ∗ is the pushforward of π ( β, γ ) under ϑ . Theconditional prior on γ given β has uniformly bounded density π γ ( γ | β ) that is uniformlycontinuous and positive at γ = 0 , and R B dπ ( β ) > . We adopt the shorthand Φ( β, γ ) = Φ( ϑ ( β, γ )) and Φ( β ) = Φ( β,
0) for all functions.
Assumption 3
The function Φ( β, γ ) is uniformly (over β ∈ B ) differentiable in γ at γ = 0 . Further, for ∇ ( β ) = ∂∂γ Φ ( β ) , J ( β ) = ∇ ( β ) ′ Σ ( β ) − ∇ ( β ) is everywhere positivedefinite. Liao and Jiang (2010) establish a similar consistency result for the case where the weighting matrixdoes not vary with θ . Chen, Christensen, and Tamer (2018) characterize the behavior of the quasi-posterior when the GMM objective depends on a finite-dimensional reduced-form parameter. We focus on unconstrained decision rules for brevity, but a similar analysis applies for tests. there exists some (unknown to the researcher) re-parameterizationof the model in terms of β and γ , where β indexes the weakly or partially identifiedparameter, while γ can be called strongly identified. The set Θ corresponds to γ = 0 , and is parameterized by β ∈ B . Han and McCloskey (2019) provide sufficient conditionsfor such a reparameterization to exist. The mapping from ( β, γ ) to θ can be many-to-one,and we impose very little structure on the set B , which may, for example, be a collectionof points or intervals. We also note that π ( β, γ ) need not integrate to one, since Θ ∗ maybe a strict subset of Θ. Assumption 3 requires that γ be strongly identified, in the sensethat the Jacobian of the moments with respect to γ has full rank at γ = 0. Theorem 5
Assume that P satisfies Assumptions 1, 2, and 3. If the prior π ( θ ) hasbounded density on the set Θ , then for any sequence c n → ∞ , under sequences P n,f localto P in the sense of (1) we have π (cid:16)n θ ∈ Θ : Φ( θ )Σ( θ ) − Φ( θ ) ≥ c n n o | g n (cid:17) = o p (1) . (11) Moreover, for any bounded function c ( θ ) uniformly continuous at Θ Z Θ c ( θ ) dπ ( θ | g n ) − R B c ( ϑ ( β )) exp (cid:8) − Q βn ( β ) (cid:9) dπ ( β ) R B exp n − Q βn ( β ) o dπ ( β ) → p , (12) where dπ ( β ) = π γ (0 | β ) | J ( β ) | − dπ ( β ) , Q βn ( β ) = g n ( β ) ′ M ( β ) g n ( β ) , and M ( β ) = Σ ( β ) − − Σ ( β ) − ∇ ( β ) J ( β ) − ∇ ( β ) ′ Σ ( β ) − . Theorem 5 is a version of the Bernstein-von Mises theorem for weakly and partiallyidentified quasi-Bayesian settings. The GMM objective function Q n ( θ ) is bounded on Θ but diverges away from Θ . As (11) highlights, this forces the posterior to concentrate oninfinitesimal neighborhoods of Θ , corresponding to consistent estimation of the stronglyidentified parameter γ . The rank k − p γ matrix M ( β ) then selects the linear combinationof moments orthogonal to those used to estimate γ , and this combination is used to formthe posterior on β . Unlike in the classical Bernstein-von Mises theorem, the prior on Θ (i.e. on B ) matters asymptotically, and is adjusted based on the precision of the estimatefor γ as measured by J ( β ) . Overall, we obtain that feasible quasi-Bayes posteriors are23symptotically equivalent to infeasible posteriors based on a transformation of the priorand moment conditions. This likewise implies asymptotic equivalence of feasible andinfeasible decision rules.
Corollary 2
Let the assumptions of Theorem 5 hold. Assume that the loss function L ( a, θ ) is Lipschitz in a and continuous in θ over Θ ∗ , and that A is compact. Assumefurther that for almost all realization of process G ( β ) ∼ GP (0 , Σ) , the process L ( a ) = R B L ( a, ϑ ( β )) exp (cid:8) − G ( β ) ′ M ( β ) G ( β ) (cid:9) dπ ( β ) has a unique minimizer over A . Then s n ( g n ) = arg min a ∈A R B L ( a, ϑ ( β )) exp (cid:8) − Q βn ( β ) (cid:9) dπ ( β ) R B exp n − Q βn ( β ) o dπ ( β ) → p s n ( g n ) . Uniqueness of the minimizer L ( a ) is guaranteed to hold if the loss function is convex in a . Sufficient conditions for uniqueness in non-convex cases are discussed in Cox (2020).Overall, we obtain that feasible quasi-Bayes decision rules, computed without knowl-edge of Θ , converge to infeasible quasi-Bayes rules based on knowledge of Θ and atransformation of the moments and prior. These rules are, in turn, the limit of sequencesof proper-prior Bayes decision rules in the limit problem by our previous results. Following Kim (2002) and Cherzhukov and Hong (2003), quasi-Bayes procedures havebeen used in a range of applications. Here, we briefly illustrate our results with anapplication of quantile IV to data from Graddy (1995) on the demand for fish at theFulton fish market. Following Chernozhukov et al. (2009), who discuss finite-samplefrequentist inference in this setting, we consider the quantile IV moment conditions statedin equation (2) with Y the log quantity of fish purchased, W the log price, and Z a vectorof instruments consisting of a constant, a dummy for whether the weather offshore wasmixed (with wave height above 3.8 feet and windspeed over 13 knots), and a dummy forwhether the weather offshore was stormy (with wave height above 4.5 feet and windspeedover 18 knots). For further details on the data and setting, see Graddy (1995) andChernozhukov et al. (2009).The results of Chernozhukov et al. (2009) suggest that the data are not very infor-mative when we consider inference on the 0.25 or 0.75 quantiles, consistent with weak24 igure 1: Results for Graddy (1995) data identification. Here we discuss results for the 0.75 quantile. Following Chernozhukov etal. (2009), we restrict attention to α ∈ [0 ,
30] and β ∈ [ − , , and consider a flatprior π ( θ ∗ ) on θ = ( α, β ). We calculate the quasi-Bayes posterior distribution dis-cussed in Section 4 using random-walk Metropolis Hastings with ten million draws. Weplot 500 draws from the quasi-Bayes posterior in panel (a) of Figure 1. Panel (b) plotsthe marginal quasi-Bayes posterior distribution for the price coefficient β , with verticallines marking the (continuously-updating) GMM estimate and the quasi-posterior mean.Panel (c) plots the 95% highest posterior density set, while panel (d) plots the conditionalfrequentist confidence set discussed in Section 3.6, along with the GMM estimate.A few aspects of these results warrant discussion. Given the flat prior, the quasi-posterior is a monotonic transformation of the GMM objective function. Figure 1 thushighlights that the GMM objective is far from quadratic in this case, so conventional These choices are discussed in the working paper version, Chernozhukov et al. (2006). in Example 2. Panel(b) shows that the GMM estimate is quite different from the quasi-posterior mean andthat the quasi-posterior is highly non-normal, again contrary to what we would expectin the point-identified, strongly-identified case (see Chernozhukov and Hong, 2003). Thequasi-posterior mean seems a more reasonable summary than the GMM estimate, as thelatter ignores the large region of uncertainty to the right.We also report credible and confidence sets. Panel (c) reports the quasi-Bayesianhighest posterior density credible set, which has no frequentist coverage guarantees inthe current setting. Panel (d) reports the frequentist confidence set obtained throughinversion of the weighted average power optimal conditional tests discussed in Section3.6, using the quasi-Bayes objective. The quasi-Bayesian credible set and the frequentistconfidence set have a quite similar shape, but the frequentist confidence set is slightlysmaller, covering 4.74% of the parameter space, as compared to 4.82% for the highestposterior density set.
Adler, R.J., and J.E. Taylor (2007),
Random Fields and Geometry , New York : SpringerAndrews, D.W.K., M. Moreira, and J. Stock (2006), “Optimal Two-Sided Invariant Sim-ilar Tests of Instrumental Variables Regression,”
Econometrica , 74(3), 715-752Andrews, I. and A. Mikusheva (2016), “Conditional Inference with a Functional NuisanceParameter,”
Econometrica , 84(4), 1571-1612Berlinet, A., and C. Thomas-Agnan (2004)
Reproducing Kernel Hilbert Spaces in Proba-bility and Statistics , Kluwer Academic Publishers The results of Chen et al (2018) show that quasi-Bayesian highest posterior density sets have correctasymptotic coverage of the identified set in certain partially identified settings. This result does notextend to general weakly identified models, however, since highest posterior density sets are necessarilya strict subset of the parameter space and so under-cover the identified set when Θ = Θ. Journal of the Royal Statistical Society, Series B , 81: 5-43Brown, L.D. (1986)
Fundamentals of Statistical Exponential Families with Applicationsin Statistical Decision Theory , Hayward, CA: Institute of Mathematical StatisticsChamberlain, G. and G. Imbens (2003): “Nonparametric Applications of Bayesian Infer-ence,”
Journal of Business and Economic Statistics , 21, 1218Chen, X., T. Christensen, and E. Tamer (2018). “Monte Carlo Confidence Sets forIdentified Sets,”
Econometrica , 86(6), 1965-2018Chernozhukov, V. and C. Hansen (2005): “An IV Model of Quantile Treatment Effects,”
Econometrica , 73(1), 245-261Chernozhukov, V., C. Hansen, and M. Jansson (2006), “Finite Sample Inference forQuantile Regression Models,” Unpublished ManuscriptChernozhukov, V., C. Hansen, and M. Jansson (2009), “Finite Sample Inference forQuantile Regression Models,”
Journal of Econometrics , 152(2), 93-103Chernozhukov, V. and H. Hong (2003): “An MCMC Approach to Classical Estimation,”
Journal of Econometrics , 115(2), 293-346Cox, G. (2020), “Almost Sure Uniqueness of a Global Minimum Without Convexity,”
Annals of Statistics , 48(1), 584-606Florens, J.P. and A. Simoni (2012): “Nonparametric Estimation of an InstrumentalRegression: A Quasi-Bayesian Approach Based on Regularized Posterior,”
Journal ofEconometrics , 170, 458-475Florens, J.P. and A. Simoni (2019): “Gaussian Processes and Bayesian Moment Estima-tion,”
Journal of Business and Economic Statistics , Forthcoming.Gallant, R. (2016), “Reflections on the Probability Space Induced by Moment Conditionswith Implications for Bayesian Inference,”
Journal of Financial Econometrics , 14(2),284294Graddy, K. (1995), “Testing for Imperfect Competition at the Fulton Fish Market,”
RandJournal of Economics,
Quantitative Economics , 10(3), 1019-1068Hirano, K. and J. Porter (2009), “Asymptotics for Statistical Treatment Rules,”
Econo- etrica , 77(5), 1683-1701Kaji, T. (2020): “Theory of Weak Identification in Semiparametric Models,” UnpublishedManuscriptKim, J.Y. (2002): “Limited Information Likelihood and Bayesian Analysis,” Journal ofEconometrics , 107, 175-193Kitamura, Y. and T. Otsu (2011), “Bayesian Analysis of Moment Restriction ModelsUsing Nonparametric Priors,” Unpublished ManuscriptLe Cam, L. (1986),
Asymptotic Theory of Statistical Inference,
John Wiley & SonsLehman, E.L. and G. Casella (1998),
Theory of Point Estimation,
Springer Texts inStatisticsLiao, Y. and W. Jiang (2010), “Bayesian Analysis in Moment Inequality Models,”
Annalsof Statistics , 38(1), 275-316Moreira, H. and M. Moreira (2019), “Optimal Two-Sided Tests for Instrumental VariablesRegression with Heteroskedastic and Autocorrelated Errors,”
Journal of Econometrics ,213(2), 398-433Moreira, M. and G. Ridder (2019), “Optimal Invariant Tests in an Instrumental VariablesRegression With Heteroskedastic and Autocorrelated Errors,” Unpublished ManuscriptMueller, U.K. (2011), “Efficient Tests Under a Weak Convergence Assumption,”
Econo-metrica , 79(2), 395-435Neveu J . (1968)
Processus AIeatoires Gaussiens.
Seminaire Math. Sup., Les presses deI’Universite de MontrealParzen, E. (1962), “Extraction and Detection Problems and Reproducing Kernel HilbertSpaces,”
J. SIAM Control , Ser. A., 1(1) , 35–62Schennach, S. M. (2005), “Bayesian Exponentially Tilted Empirical Likelihood,”
Biometrika
92, 3146Shin, M. (2015), “Bayesian GMM,” Dissertation Chapter, University of PennsylvaniaStaiger, D. and J.H. Stock (1997), “Instrumental Variables Regression with Weak Instru-ments,”
Econometrica , 65(3), 557-586Van der Vaart, A.W. (1991) “An Asymptotic Representation Theorem,”
InternationalStatistical Review , Vol. 59, No. 1, pp. 97-121Van der Vaart , A.W. and J.A. Wellner (1996),
Weak Convergence and Empirical Pro- esses , SpringerVan der Vaart, A.W. and H. Van Zanten (2008), “Reproducing Kernel Hilbert Spaces ofGaussian Processes,” in Pushing the Limits of Contemporary Statistics: Contributions inHonor of Jayanta K. Ghosh , Bertrand Clarke and Subhashis Ghosal, eds., (Beachwood,Ohio, USA: Institute of Mathematical Statistics, 2008), 200-222
Proof of Lemma 1.
A score f ( X ) = P si =1 φ ( X, θ i ) ′ a i corresponds to the function m ( · ) = E P [ f ( X ) φ ( X, · )] = s X i =1 E P [ φ ( X, · ) φ ( X, θ i ) ′ a i ] = s X i =1 Σ( · , θ i ) a i . For two scores f ( X ) = P si =1 φ ( X, θ i ) ′ a i and f ( X ) = P s ∗ j =1 φ ( X, θ ∗ j ) ′ b j and correspondingmean functions m ( · ) = P si =1 Σ( · , θ i ) a i and m ( · ) = P s ∗ j =1 Σ( · , θ ∗ j ) b j we have E P [ f ( X ) f ( X )] = s X i =1 s ∗ X j =1 a ′ i E P (cid:2) φ ( X, θ i ) φ ( X, θ ∗ j ) ′ (cid:3) b j = s X i =1 s ∗ X j =1 a ′ i Σ( θ i , θ ∗ j ) b j = h m , m i H . This implies that there is an isomorphism between H and H ∗ .It remains to show is that for any f ⊥ ∈ T ( P ) that is orthogonal (in L ( P ) sense) to H ∗ we have E P [ f ⊥ ( X ) φ ( X, · )] ≡ k ∈ R k . Indeed, f ⊥ is orthogonal to a ′ φ ( X, θ ) ∈ H ∗ for any vector a ∈ R k and any θ ∈ Θ . Thus E P [ f ⊥ ( X ) a ′ φ ( X, θ )] = a ′ m ( θ ) = 0. Proof of Theorem 1.
Define the orthonormal basis { φ j ( X ) } of T ( P ) to consist ofthe union of an orthonormal basis { φ ∗ j ( X ) } of H ∗ and an orthonormal basis { φ ⊥ j ( X ) } of ( H ∗ ) ⊥ . The limit experiment E ∗∞ corresponds to observing the union of two sets ofmutually independent random variables: W ∗ j ∼ N (cid:0) E P [ f ( X ) ϕ ∗ j ( X )] , (cid:1) and W ⊥ j ∼ N (cid:0) E P [ f ( X ) ϕ ⊥ j ( X )] , (cid:1) . Due to Lemma 1 E P [ f ( X ) ϕ ∗ j ( X )] = h m, ϕ ∗ j i H . The experiment of observing only W ∗ j ∼ N (cid:0) h m, ϕ ∗ j i H , (cid:1) is equivalent to the Gaussian Process experiment E ∗ GP by the Karhunen-Loeve theorem.By independence dP f ( W ∗ , W ⊥ ) = dP f ∗ ( W ∗ ) × dP f ⊥ ( W ⊥ ). The loss function dependsonly on θ ∗ , and the parameter space for ( θ ∗ , f ∗ , f ⊥ ) is the Cartesian product { θ ∗ ∈ , f ∗ ∈ H ∗ θ ∗ } × { f ⊥ ∈ ( H ∗ ) ⊥ } . The risk of a decision rule δ is e R ( θ ∗ , f ) = e R ( θ ∗ , f ∗ , f ⊥ ) = E f (cid:2) L ( δ ( W ∗ , W ⊥ ) , θ ∗ ) (cid:3) . We claim that for any fixed value f ⊥ there exists a decision rule in the experiment E ∗ GP with risk R ( θ ∗ , m ) = e R ( θ ∗ , f ∗ , f ⊥ ) for all ( θ ∗ , f ∗ ) ∈ { θ ∗ ∈ Θ , f ∗ ∈ H ∗ θ ∗ } , where m corresponds to f ∗ as in Lemma 1. Indeed, since experiment E ∗ GP is equivalent toobserving only the W ∗ j variables, it is enough for each realization W ∗ = w to draw arandom variable ( W ⊥ ) from distribution dP f ⊥ (which is fixed) and produce a randomizeddecision as e δ ( w ) = δ ( w, W ⊥ ) . Proof of Theorem 2.
The distribution of g for any m ∈ H is dominated by thedistribution under m = 0. Moreover, the form of the likelihood ratio for Gaussianprocesses (see e.g. Theorem 54 in Berlinet and Thomas-Agnan) implies that condition(1) in Section 4A.1 of Brown (1986) holds. Our assumptions likewise imply condition (2)of Brown (1986). This result is thus immediate from Theorem 4A.12 of Brown (1986). Proof of Theorem 3.
We first find the conditional mean of ξ ∼ N ( − [ ψ ( θ ∗ )] − µ ( θ ∗ ) , Σ ξ )given the realization of h = µ + GP (0 , e Σ), assuming µ ∼ GP (0 , Ω). Neveu (1968) provesthat for the Gaussian family the conditional mean coincides with the best linear predictor.Note next that E [ ξh ( · )] = − [ ψ ( θ ∗ )] − Ω( θ ∗ , · ) = ρ ( · ) , and E [ h ( θ ) h ( θ )] = Ω( θ , θ ) + e Σ( θ , θ ) . Denote by K the RKHS corresponding to the covariance function Ω + e Σ, and by L ( h ) thesubspace of L ( P ) random variables obtained as the closure of linear combinations of h .Define ξ ∗ as the projection of ξ on to L ( h ). By definition it is the best linear predictorof ξ given h , and E [ ξh ( · )] = E [ ξ ∗ h ( · )] = ρ ( · ). Lemma 13 in Berlinet and Thomas-Agnan(2004) implies that ρ ( · ) ∈ K . Denote by Ψ the canonical congruence between L ( h ) and K , defined by Ψ X j a j h ( θ j ) ! = X j a j (Ω( θ j , · ) + e Σ( θ j , · )) ∈ K , and extended by continuity. Then ξ ∗ = Ψ − ( ρ ( · )) . See Section 3 of Berlinet and Thomas-Agnan (2004) for further discussion. 30e next fix θ ∗ , assume that condition (7) holds, and show that the best linear pre-dictor depends on ( ξ, g ( θ ∗ )) only. Condition (7) implies thatΩ( θ ∗ , · ) + e Σ( θ ∗ , · ) = (cid:16) I k + e Σ( θ ∗ , θ ∗ )Ω( θ ∗ , θ ∗ ) − (cid:17) Ω( θ ∗ , · ) . Thus, ρ ( · ) = − [ ψ ( θ ∗ )] − (cid:16) I k + e Σ( θ ∗ , θ ∗ )Ω( θ ∗ , θ ∗ ) − (cid:17) − h Ω( θ ∗ , · ) + e Σ( θ ∗ , · ) i , and the canonical congruence has the form ξ ∗ = Ψ − ( ρ ( · )) = − [ ψ ( θ ∗ )] − (cid:16) I k + e Σ( θ ∗ , θ ∗ )Ω( θ ∗ , θ ∗ ) − (cid:17) − h ( θ ∗ ) , which depends on the data only through h ( θ ∗ ) = g ( θ ∗ ) − ψ ( θ ∗ ) ξ .Finally, we prove the converse, assuming that for each θ ∗ the likelihood depends onlyon ( ξ, g ( θ ∗ )) and proving that (7) holds. Since the conditional distribution of ξ given θ ∗ and ξ ∗ is N ( ξ ∗ , Σ ξ ), ξ ∗ must depend only on ( ξ, g ( θ ∗ )) or, equivalently, on ( ξ, h ( θ ∗ )).Linearity of ξ ∗ in h then implies that there exists a non-random k × k matrix B ( θ ∗ ) suchthat Ψ − ( ρ ( · )) = B ( θ ∗ ) h ( θ ∗ ) . By the definition of the canonical congruence this implies ρ ( · ) = B ( θ ∗ ) h Ω( θ ∗ , · ) + e Σ( θ ∗ , · ) i . Since ρ ( · ) = − [ ψ ( θ ∗ )] − Ω( θ ∗ , · ), however, [ ψ ( θ ∗ )] − Ω( θ ∗ , · ) = B ( θ ∗ ) h Ω( θ ∗ , · ) + e Σ( θ ∗ , · ) i .Since ψ ( θ ∗ ) has full rank, both sides are invertible when evaluated at θ ∗ , and some rear-rangement yields (7). Proof of Lemma 3.
Let B = Ω( θ , θ ). Condition (7) implies thatΩ( θ , θ ) = B e Σ( θ , θ ) − e Σ( θ , θ ) and Ω( θ, θ ) − Ω( θ, θ ) = e Σ( θ, θ ) − e Σ( θ, θ ) . The transposed equations areΩ( θ, θ ) = e Σ( θ, θ ) e Σ( θ , θ ) − B and Ω( θ , θ )Ω( θ, θ ) − = e Σ( θ , θ ) e Σ( θ, θ ) − . We can calculate Ω( θ , θ )Ω( θ, θ ) − Ω( θ, θ ) in two ways, so e Σ( θ , θ ) e Σ( θ, θ ) − e Σ( θ, θ ) e Σ( θ , θ ) − B = B e Σ( θ , θ ) − e Σ( θ , θ ) e Σ( θ, θ ) − e Σ( θ, θ ) . e Σ( θ , θ ) − / and let e B = e Σ( θ , θ ) − / B e Σ( θ , θ ) − / . We obtain that e B commutes with a whole family of symmetric matrices : D ( θ ) e B = e BD ( θ ) . Assume e B has r distinct eigenvalues. Since e B is symmetric, all eigenvectorscorresponding distinct eigenvalues are orthogonal. Let V ,..., V r be the orthogonal sub-spaces spanned by eigenvectors of the e B corresponding to eigenvalues λ , ..., λ r , respec-tively. Consider a symmetric matrix D ( θ ) ∈ D that commutes with e B . Take any v i ∈ V i and v j ∈ V j : v ′ i D ( θ ) e Bv j = λ j v ′ i D ( θ ) v j = v ′ i e BD ( θ ) v j = λ i v ′ i D ( θ ) v j . This implies v ′ i D ( θ ) v j = 0 for any i = j . Thus D ( θ ) v j ∈ V j , and V ,..., V r are invariantsubspaces for D ( θ ). Thus, we proved that V ,..., V r are invariant spaces for the wholefamily of operators D . Under the conditions of the lemma this implies that e B has singleeigenvalue λ >
0, and thus Ω( · , · ) = λ e Σ( · , · ). Proof of Theorem 4.
Similarity of ϕ ∗ ( θ ) follows from Lemma 1 in Andrews andMikusheva (2016). Theorem S2.1 in the supplement to Andrews and Mikusheva (2016)implies that any similar test in this setting must be conditionally similar given h , with E θ ,µ [ ϕ | h ] = α almost surely. For any µ ∈ H µ and any test ϕ , Z Z E θ,µ [ ϕ ] dπ ( µ ) dπ ( θ ) = E π [ ϕ ] = E θ ,µ (cid:20) R ℓ ∗ ( θ ) dπ ( θ ) f ( h ) ℓ ( µ, θ ; ξ ) ℓ ( µ ; h ) ϕ (cid:21) . Since ξ = g ( θ ) , ℓ ( µ, θ ; ξ ) does not depend on µ , and is equal to ℓ ∗ ( θ ) . Lemma2 of Andrews and Mikusheva (2016) implies that ˜ ϕ ( θ ) = 1 n R ℓ ∗ ( θ ) dπ ( θ ) f ( h ) ℓ ∗ ( θ ) ℓ ( µ ; h ) > ˜ c α ( h ) o maximizes E π [ ϕ ] over the class of size- α similar tests, where ˜ c α ( h ) is the 1 − α quantileof R ℓ ∗ ( θ ) dπ ( θ ) f ( h ) ℓ ∗ ( θ ) ℓ ( µ ; h ) conditional on h under the null. The test statistic in ˜ ϕ ( θ ) differs fromthat in ϕ ∗ ( θ ) only through terms depending on h . These can be absorbed into thecritical value, so ˜ ϕ ( θ ) = ϕ ∗ ( θ ) . Proof of Theorem 5.
By contiguity it is enough to prove the statement under P .Denote Q ( θ ) = Φ( θ ) ′ Σ( θ ) − Φ( θ ). Due to Assumption 1, b Σ n ( θ ) − is uniformly boundedin probability and G n ( · ) ⇒ GP (0 , Σ), thus max θ ∈ Θ G n ( θ ) ′ b Σ n ( θ ) − G n ( θ ) = O p (1). Since32 n ( θ ) = √ n Φ( θ ) + G n ( θ ) we have12 Q n ( θ ) ≤ nQ ( θ )(1 + o p (1)) + max θ ∈ Θ G n ( θ ) ′ b Σ n ( θ ) − G n ( θ );12 Q n ( θ ) ≥ n Q ( θ )(1 + o p (1)) − max θ ∈ Θ G n ( θ ) ′ b Σ n ( θ ) − G n ( θ ) . Define a set Θ δ,n = (cid:8) θ ∈ Θ : Q ( θ ) ≤ δn (cid:9) for some δ > cc n = { θ ∈ Θ : Q ( θ ) ≥ c n n } . π (cid:0) Θ cc n | g n (cid:1) ≤ R Θ ccn π ( θ ) exp (cid:8) − Q n ( θ ) (cid:9) dθ R Θ δ,n π ( θ ) exp (cid:8) − Q n ( θ ) (cid:9) ≤ O p (1) · R Θ ccn π ( θ ) exp {− n Q ( θ ) } dθ R Θ δ,n π ( θ ) dθ . Due to uniform differentiability of Φ( β, γ ) there exist positive constants C , C and smallenough ε > θ ∈ Θ ∗ ε = { θ = ϑ ( β, γ ) : k γ k < ε } we have C k γ k ≤ Q ( β, γ ) ≤ C k γ k . For large enough n we have Θ δ,n ⊆ Θ ∗ ε . Thus, Z Θ δ,n π ( θ ) dθ = Z B Z Q ( β,γ ) ≤ δn π γ ( γ | β ) dγdπ ( β ) ≥ C Z k γ k ≤ δC n dγ ≥ Cn − pγ . Divide the integral over Θ cc n into integrals over Θ cc n ∩ Θ ∗ ε and over Θ cc n ∩ (Θ ∗ ε ) c , where(Θ ∗ ε ) c = (Θ \ Θ ∗ ε ). We have Θ cc n ∩ Θ ∗ ε ⊆ { θ = ϑ ( β, γ ) : C k γ k ≥ c n n } . Denote by ¯ Q thenon-zero minimum of Q ( θ ) over (Θ ∗ ε ) c . Thus, R Θ ccn π ( θ ) exp (cid:8) − n Q ( θ ) (cid:9) dθ R Θ δ,n π ( θ ) dθ ≤ Cn pγ exp n − n Q o + Z C k γ k ≥ cnn exp {− nC k γ k } dγ ! ≤≤ o (1) + Z C k y k ≥ c n exp {− C k y k } dy → . In the last line we used the change of variables y = √ nγ and integrability of exp {−k y k } .This proves (11), and implies that for π Θ cn the prior restricted to Θ c n , the posterior π Θ cn (Υ | g n ) = π (Υ ∩ Θ cn | g n ) π (Θ cn | g n ) defined on sets Υ ⊆ Θ is asymptotically the same as π (Υ | g n ).If c n n →
0, then due to compactness of Θ ∗ , for large enough n we have Θ c n ⊆ Θ ∗ . Thus,we can treat the parameterization described in Assumption 2 as applying to the wholeparameter space Θ.The above implies that for any ε > δ large enough such that P ( sup β n pγ Z δn ≤k γ k exp (cid:26) − Q n ( β, γ ) (cid:27) ≥ ε ) ≤ ε, (13)and also sup β P {k N (0 , J − ( β )) k ≥ δ } < ε . Define g n ( β, γ ) = g n ( β ) + √ n ∇ ( β ) γ and R n ( β, γ ) = g n ( β, γ ) − g n ( β, γ ). Let us show thatsup β sup k γ k ≤ δn k R n ( β, γ ) k → p . (14)33ndeed, k R n ( β, γ ) k ≤ √ n k Φ( β, γ ) − ∇ ( β ) γ k + k G n ( β, γ ) − G n ( β ) k . We have thatsup β sup k γ k ≤ δn k G n ( β, γ ) − G n ( β ) k → p β sup k γ k ≤ δn √ n k Φ( β, γ ) − ∇ ( β ) γ k → . Denote Q n ( β, γ ) = g n ( β, γ )Σ − ( β ) g n ( β, γ ). Equation (14) implies thatsup β sup k γ k ≤ δn (cid:12)(cid:12)(cid:12)(cid:12) − exp (cid:26) −
12 ( Q n ( β, γ ) − Q n ( β, γ )) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) → p . (15)Indeed, the left-hand side is bounded above bysup β sup k γ k ≤ δn | Q n ( β, γ ) − Q n ( β, γ ) | ≤≤ sup β sup k γ k ≤ δn n | ( g n + g n ) b Σ − n R n | + | g n ( b Σ − n ( β, γ ) − Σ − ( β )) g n | o → p . The last convergence follows from continuity of covariance function Σ, equation (14), andboundedness in probability of g n , g n and b Σ − over {k γ k ≤ δn } .Denote Q βn ( β ) = g n ( β ) ′ M ( β ) g n ( β ). Let us define a projection operator P ( β ) =Σ − ( β ) ∇ ( β ) J ( β ) − ∇ ( β ) ′ Σ − ( β ). Notice that M ( β ) = Σ − ( β )( I k − P ( β ))Σ − ( β ). Q n ( β, γ ) = g n ( β, γ ) ′ M ( β ) g n ( β, γ ) + g n ( β, γ ) ′ Σ − ( β ) P ( β )Σ − ( β ) g n ( β, γ ) == Q βn ( β ) + (cid:0) G ∗ ( β ) + √ nγ (cid:1) ′ J ( β ) (cid:0) G ∗ ( β ) + √ nγ (cid:1) , where G ∗ ( β ) = J ( β ) − ∇ ( β ) ′ Σ − ( β ) G n ( β ). Integration of the Gaussian pdf gives n pγ Z k γ k ≤ δn exp (cid:26) − (cid:0) G ∗ ( β ) + √ nγ (cid:1) ′ J ( β ) (cid:0) G ∗ ( β ) + √ nγ (cid:1)(cid:27) dγ == | J ( β ) | − P (cid:26)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:18) √ n G ∗ ( β ) , J − ( β ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ (cid:27) → p | J ( β ) | − P (cid:8) k N (cid:0) , J − ( β ) (cid:1) k ≤ δ (cid:9) , where δ was chosen large enough that q ( δ ) = P {k N (0 , J − ( β )) k ≤ δ } ≥ − ε . Thus,sup β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n pγ Z k γ k ≤ δn exp (cid:26) − Q n ( β, γ ) (cid:27) dγ − | J ( β ) | − q ( δ ) exp (cid:26) − Q βn ( β ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → p . Joining together last statement with equations (13) and (15) we getsup β (cid:12)(cid:12)(cid:12)(cid:12) n pγ Z Γ( β ) exp (cid:26) − Q n ( β, γ ) (cid:27) dγ − | J ( β ) | − exp {− Q βn ( β ) } (cid:12)(cid:12)(cid:12)(cid:12) → p . Given statement (13), for c ( β, γ ) satisfying assumptions of Theorem 5 we havesup β (cid:12)(cid:12)(cid:12)(cid:12) n pγ Z Γ( β ) c ( β, γ ) exp (cid:26) − Q n ( β, γ ) (cid:27) dγ − c ( β, | J ( β ) | − exp (cid:26) − Q βn ( β ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) → p . Assumption 2 implies R B π γ (0 | β ) | J ( β ) | − exp {− Q βn ( β ) } dπ ( β ) is stochastically boundedaways from zero. Thus, (12) holds. (cid:3) roof of Corollary 2. For each a ∈ A we can apply (12) to c ( θ ) = L ( a, θ ). Since L ( a, θ ) is Lipschitz in a and A is compact, this impliessup a ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z Θ L ( a, θ ) dπ ( θ | g n ) − R B L ( a, ϑ ( β, (cid:8) − Q βn ( β ) (cid:9) dπ ( β ) R B exp n − Q βn ( β ) o dπ ( β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → p . We also have weak convergence of the process R B L ( · , ϑ ( β, (cid:8) − Q βn ( β ) (cid:9) dπ ( β ) R B exp n − Q βn ( β ) o dπ ( β ) ⇒ L ( · )on A . This implies R Θ L ( · , θ ) dπ ( θ | g n ) ⇒ L ( · ). Due to Theorem 3.2.2 in Van der Vaartand Wellner (1996), ( s n ( g n ) , s n ( g n )) ⇒ (argmin a ∈A L ( a ) , argmin a ∈A L ( a )) . Thus, s n ( g n ) − s n ( g n ) → p (cid:3)(cid:3)