[PDF] An Adversarial Approach to Structural Estimation

Abstract

We propose a new simulation-based estimation method, adversarial estimation, for structural models. The estimator is formulated as the solution to a minimax problem between a generator (which generates synthetic observations using the structural model) and a discriminator (which classifies if an observation is synthetic). The discriminator maximizes the accuracy of its classification while the generator minimizes it. We show that, with a sufficiently rich discriminator, the adversarial estimator attains parametric efficiency under correct specification and the parametric rate under misspecification. We advocate the use of a neural network as a discriminator that can exploit adaptivity properties and attain fast rates of convergence. We apply our method to the elderly's saving decision model and show that including gender and health profiles in the discriminator uncovers the bequest motive as an important source of saving across the wealth distribution, not only for the rich.

Full PDF

AAN ADVERSARIAL APPROACH TO STRUCTURAL ESTIMATION

Tetsuya Kaji , Elena Manresa , and Guillaume Pouliot University of Chicago New York UniversityJuly 14, 2020

Abstract

We propose a new simulation-based estimation method, adversarial esti-mation, for structural models. The estimator is formulated as the solutionto a minimax problem between a generator (which generates synthetic obser-vations using the structural model) and a discriminator (which classiﬁes if anobservation is synthetic). The discriminator maximizes the accuracy of its clas-siﬁcation while the generator minimizes it. We show that, with a suﬃcientlyrich discriminator, the adversarial estimator attains parametric eﬃciency undercorrect speciﬁcation and the parametric rate under misspeciﬁcation. We advo-cate the use of a neural network as a discriminator that can exploit adaptivityproperties and attain fast rates of convergence. We apply our method to theelderly’s saving decision model and show that including gender and health pro-ﬁles in the discriminator uncovers the bequest motive as an important sourceof saving across the wealth distribution, not only for the rich.

JEL Codes:

C13, C45.

Keywords: structural estimation, generative adversarial networks, neuralnetworks, simulated method of moments, indirect inference, eﬃcient estimation.

We thank Mariacristina De Nardi and John Jones for sharing the data and codes for the em-pirical application and for very helpful discussion. We also thank Isaiah Andrews, Manuel Arellano,Stephane Bonhomme, Aureo De Paula, Costas Meghir, Chris Hansen, Koen Jochmans, WhitneyNewey, Luigi Pistaferri, and Bernard Salanie, as well as numerous participants in conferences andvenues for helpful discussion. Elsie Hoﬀet, Yijun Liu, Ignacio Ciggliutti, and Marcela Barrios pro-vided superb research assistance. We gratefully acknowledge the support of the NSF by means ofthe Grant SES-1824304 and the Richard N. Rosett Faculty Fellowship and the Liew Family FacultyFellowship at the University of Chicago Booth School of Business. a r X i v : . [ ec on . E M ] J u l INTRODUCTIONStructural estimation is a useful tool to quantify economic mechanisms and learnabout the eﬀects of policies that are yet to be implemented. Structural models arenaturally articulated as parametric models and, as such, may in principle be esti-mated using maximum likelihood (MLE). However, likelihood functions arising fromeconomic models are sometimes too complex to evaluate or may not exist in closedform. Meanwhile, generating data from structural models is often feasible, even if itcan be computationally intensive. This observation has spurred large literature onsimulation-based estimation methods.A prominent example of such methods is the simulated method of moments(SMM). If we have particular features of the data we want to reproduce, SMM isan attractive tool to naturally incorporate them. At the same time, a naive strategyto stack a large number of moments is known to yield poor ﬁnite sample properties(Altonji and Segal, 1996). This tradeoﬀ is especially pronounced in models with richheterogeneity, where the number of moments may grow exponentially with the num-ber of covariates, leading to the curse of dimensionality. While it may be resolved ifwe can reduce the moments to a handful of informative ones, it is often the case thatsuch a choice is not obvious.This paper proposes a new simulation-based estimation method, adversarial esti-mation , that can be used regardless of whether we know which features to match. It isinspired by the generative adversarial networks (GAN) , a machine learning algorithmdeveloped by Goodfellow et al. (2014) to generate realistic images. We adopt itsadversarial framework to estimate the structural parameters that generate realisticeconomic data. While maintaining the ﬂexibility of SMM, our method is demon-strated to work well under rich heterogeneity.The generative adversarial estimation framework is a minimax game between twocomponents—a discriminator and a generator —over classiﬁcation accuracy:min { generator } max { discriminator } classiﬁcation accuracy . The generator is an algorithm to simulate synthetic data; its objective is to ﬁnda data-generating algorithm that confuses the discriminator. The discriminator is aclassiﬁcation algorithm that distinguishes observed data from simulated data; it takesa data point as input and classiﬁes if it comes from observed data or simulated data;2ts objective is to maximize the accuracy of its classiﬁcation.In original GAN, both the discriminator and generator are given as neural net-works (hence the name). In this paper, we take the generator to be (derived from) thestructural model that we intend to estimate, and the discriminator to be an arbitraryclassiﬁcation algorithm (while our primary choice is still a neural network). For clas-siﬁcation accuracy, we use the cross-entropy loss, following Goodfellow et al. (2014). From a standpoint of econometrics, it can be seen that the generator is minimizingthe distance between observed data and simulated data deﬁned by the choice of thediscriminator and inputs thereto.Our method leverages not only GAN but also the growing literature on why neuralnetworks excel. In the context of nonparametric regression, Bauer and Kohler (2019)show that a multilayer neural network circumvents the curse of dimensionality whenthe target function has a low-dimensional structure. Building on their approximationresult, we show that the same holds true for the discriminator when the likelihoodratio has a low-dimensional structure. Moreover, we propose a heuristic way to checklow-dimensionality using autoencoder , another seminal machine learning algorithm.Interestingly, our framework can be viewed as a bridge cast between SMM andMLE. When we use logistic regression as the discriminator, the resulting estimatoris asymptotically equivalent to optimally weighted SMM (Example 2). When weuse the oracle discriminator given by a likelihood ratio, the resulting estimator isequivalent to MLE (Example 3). What is interesting is the middle case, in whichthe oracle discriminator is not available but a suﬃciently rich discriminator capableof approximating it is used (Example 4). We show that, under some conditions, theresulting estimator enjoys the best of both ends: (1) ﬂexibility to choose moments ifdesired, (2) closed-form likelihood not required, (3) asymptotic eﬃciency as MLE.Our theoretical development proceeds as follows. First, we establish the rate ofconvergence of a general discriminator (Theorem 1). Then, we apply this to thediscriminator given by a neural network (Proposition 3). Next, we develop the para-metric rate of convergence of the generator under possible global misspeciﬁcation(Theorem 6). Finally, we deduce parametric eﬃciency of the generator under correctspeciﬁcation (Corollary 7). To the best of our knowledge, this is the ﬁrst work to There are other losses considered in the literature (Nowozin et al., 2016; Arjovsky et al., 2017),which we do not cover. Low-dimensionality is a feature of some structural models, where a small number of factorsdrives variation of multiple outcomes (e.g., Cunha et al., 2010). ratio , which suﬀers much less from issues related to the tail or the support.Gallant and Tauchen (1996) propose the generalized method of moments (GMM)using the score of an auxiliary model whose likelihood is available and show that itis eﬃcient when the auxiliary model nests the structural model. This paper diﬀers innot requiring a tractable auxiliary model that approximates the structural model.Finally, this paper contributes to statistics. As much as statistical character-ization of machine learning algorithms is an active area of research, it is also animportant problem to characterize the statistical properties when the model is mis-speciﬁed (Kleijn and van der Vaart, 2006, 2012; Jankowski, 2014). This paper addsto the list of such work by deriving the asymptotic distribution of the adversarial5stimator under global misspeciﬁcation. As stated earlier, some intermediate resultsin the paper may be useful in various ﬁelds.The rest is organized as follows. Section 2 deﬁnes the adversarial framework.Section 3 develops the asymptotic properties of the adversarial estimator. Section 4discusses implementation of estimation and inference. Section 5 revisits investigationof the elderly’s saving motive by De Nardi et al. (2010). The appendix containsthe proofs. The online appendix contains a Monte Carlo exercise on a simpliﬁedRoy Model, an addendum on equivalence with SMM, and details on the empiricalapplication. 2 ADVERSARIAL ESTIMATION FRAMEWORKThis section deﬁnes the adversarial estimation framework. It accommodates struc-tural models with a ﬁnite number of parameters, possibly with covariates.The estimation problem we consider is one for which likelihood evaluation is notfeasible but simulation is. Hence, there are two sets of observations: the actualobservations and the synthetic observations. We let { X i } ni =1 represent the actualobservations of size n drawn i.i.d. from a measure space ( X , A , P ) and { X θi } mi =1 thesynthetic observations of size m generated i.i.d. from ( X , A , P θ ) where { P θ : θ ∈ Θ } isa parametric model over ( X , A ). If there is θ ∈ Θ such that P θ = P , the structuralmodel is said to be correctly speciﬁed , while we allow for the possibility that this is notthe case. Furthermore, we are concerned with the case where { X θi } mi =1 are generatedthrough a set of common latent variables that do not depend on θ , that is, thereexists a measure space ( ˜ X , ˜ A , ˜ P ) and i.i.d. observations therefrom { ˜ X i } mi =1 such that X θi = T θ ( ˜ X i ) almost surely for a deterministic measurable function T θ : ˜ X → X . This implies that P θ is a pushforward measure of ˜ P under T θ , that is, P θ = ˜ P ◦ T − θ .This setup arises naturally in complex structural models with dynamic optimiza-tion, learning, or latent types that renders analytic characterization of the likelihoodinfeasible. We note that our framework does not cover structural models with asemiparametric component; such extension is left for future work. Example 1 (Structural model) . Let { y i , x i } ni =1 be i.i.d. with y i ∈ R d y and x i ∈ R d x .Consider a structural parametric conditional model where individual outcomes arefunctions of exogenous variables x i , an error ε i ∈ R d ε with a known distribution The latent variables are called common random numbers (Gouriéroux et al., 2010). x i , and a ﬁnite-dimensional parameter θ ∈ Θ ⊂ R K ; that is, y θi = f ( x i , ε i ; θ ) for some function f . The object of interest is typically a function of thestructural parameter θ such as the eﬀect of a counterfactual policy.It is often the case that the associated likelihood of a complex structural model isnot available in closed form but simulation is feasible; in particular, we have access toan i.i.d. sample { ( ε i , x i ) } mi =1 of size m , where in conditional models { x i } mi =1 is typicallysampled from the empirical distribution of { x i } ni =1 , and for any value of θ we can mapit into { ( y θi , x i ) } mi =1 .Let X i = G ( y i , x i ) ∈ R d be a set of d functions of ( y i , x i ) representing the featuresof the data the researcher chooses to use in estimation . Some examples of X i are asubvector of ( y i , x i ), transformations (like logarithms, growth rates, or interactions),or simply the full vector ( y i , x i ). The simulated counterpart, X θi = G ( y θi , x i ), is thesame transformation now as a function of y θi and x i . (cid:3) If we choose θ such that P θ is very diﬀerent from P , it would be easy to distinguish X θi from X i . Conversely, if P θ is close to P , distinction would be harder. Theidea behind our method is to pick a classiﬁcation algorithm, possibly state-of-the-artmachine learning, and search for the value of θ for which the algorithm can classifythe least.Classiﬁcation is deﬁned formally as a function D : X → [0 ,

1] such that for given X , D ( X ) represents the likeliness of X being an actual observation in the scale of aunity; D ( X ) = 1 means that X is classiﬁed as “actual” with certainty; D ( X ) = 0 that X is classiﬁed as “synthetic”. Let D be the class of classiﬁcation functions realizablein the algorithm, e.g., the class of appropriate neural networks.The adversarial estimator is deﬁned by the following minimax problem:ˆ θ = arg min θ ∈ Θ max D ∈D n n X i =1 log D ( X i ) + 1 m m X i =1 log(1 − D ( X θi )) . (1)Since D is a function between 0 and 1, both log D and log(1 − D ) are nonpositive. If X i and X θi are very diﬀerent from each other (which is the case when P θ is far from P ), the discriminator may be able to ﬁnd D that assigns 1 on the support of X i and 0on the support of X θi , in which case the inner maximization attains the value of zero.Meanwhile, however close X θi is to X i , the discriminator can at least pick D ≡ / in which case the maximized value is at least 2 log(1 / This is of course provided that a constant function 1 / D , which is usually the case. /

2) and 0, and the closer it isto 2 log(1 / θ ∈ Θ max D ∈D E X i ∼ P [log D ( X i )] + E X θi ∼ P θ [log(1 − D ( X θi ))] . If we do not have a restriction on D (so that any function D : X → [0 ,

1] is allowed),the optimum classiﬁcation function for the inner maximization is known to be D θ ( X ) := p ( X ) p ( X ) + p θ ( X ) , where p and p θ are the densities of P and P θ with respect to some common domi-nating measure (Goodfellow et al., 2014, Proposition 1). Note here that the objectivefunction with this choice of D is equal to the Jensen-Shannon divergence between P and P θ . If the model is correctly speciﬁed, then θ is the unique solution to the outerminimization (Goodfellow et al., 2014, Theorem 1). In turn, when the model is notcorrectly speciﬁed, we set our target parameter—denoted as well by θ —to be thepseudo-parameter that minimizes the Jensen-Shannon divergence. We now look at three examples of D . Example 2 (Logistic discriminator) . Let Λ( t ) = (1 + e − t ) − and D be the classof logistic discriminators D ( X ) = Λ( λ X ) for λ ∈ R d . The objective function canbe interpreted as the log-likelihood of a logistic regression model where the actualobservation is associated with outcome 1 and the synthetic with 0. Here, we give arough intution that the adversarial estimator matches moment E [ X i ].The ﬁrst-order condition (FOC) of the inner maximization is1 n n X i =1 (1 − Λ( λ X i )) X i − m m X i =1 Λ( λ X θi ) X θi = 0 . Thus, the discriminator searches for λ that matches the weighted averages of X i and X θi . If there exists θ for which E [ X i ] = E [ X θ i ], then λ ( θ ) = 0 would a solution,since then E [(1 − Λ(0)) X i ] = E [ X i ] / E [ X θ i ] / E [Λ(0) X θ i ]. As a matter offact, by concavity of the objective function with respect to λ , it is the only solution.Recalling then that the unique outer minimum is attained when the inner maximizer This is analogous to MLE estimating a pseudo-parameter that minimizes the Kullback-Leiblerdivergence under misspeciﬁcation (Huber, 1967; White, 1982; Patilea, 2001). D ≡ / λ = 0), we ﬁnd that ˆ θ solves n P ni =1 X i = m P mi =1 X ˆ θi + o p (1). Thus, ˆ θ matches the means of X i and X ˆ θi . In Appendix S.3, we prove asymptotic equivalenceof this ˆ θ and the optimally weighted SMM with moment E [ X i ]. (cid:3) Example 3 (Oracle discriminator) . Let D be the oracle discriminator D θ . Then, theestimator boils down to the minimizer of the sample Jensen-Shannon divergenceˆ θ = arg min θ ∈ Θ n n X i =1 log p ( X i ) p ( X i ) + p θ ( X i ) + 1 m m X i =1 log p θ ( X θi ) p ( X θi ) + p θ ( X θi ) . Taking the FOC reveals that the minimizer matches the scores of the actual and syn-thetic observations. In particular, the associated estimator is eﬃcient under correctspeciﬁcation as n/m → . (cid:3) Example 4 (Nonparametric discriminator and neural network) . In general, we donot know the oracle D θ in closed form, but we may consider a sieve D n of classes offunctions that expands as the sample size increases (Chen, 2007). If we choose a sieveof neural networks, D can be written in the following form. Denote the hidden-layeractivation function by σ : R → R and the output activation function by Λ : R → R .Let L be the number of hidden and output layers. Let w ‘ij be the weight for the i thnode in the ( ‘ + 1)th layer on the j th node in the ‘ th layer; for example, the input tothe second node in the ﬁrst layer is w x + · · · + w U x U , where X = ( x , . . . , x U ) isthe input to the network. Let w ‘i = ( w ‘i , . . . , w ‘iU ) be the column vector of weightsfor the i th node in the ( ‘ + 1)th layer. Let w ‘ = ( w ‘ , . . . , w ‘U ) be the matrix withcolumns w ‘i ; note that for ‘ = L , w L is just a column vector as there is only oneoutput. Let w be the vector of all parameters. Then, the discriminator is given by D ( X ; w ) = Λ( w L σ ( w L − σ ( · · · w σ ( w X )))) , where σ ( v ) for a vector v is elementwise application. There is enormous literature onwhy (deep) neural networks do well (Yarotsky, 2017; Bach, 2017; Mhaskar and Poggio,2020). Among them, we exploit Bauer and Kohler (2019) in Proposition 3. (cid:3) Moreover, estimation based on matching scores can have better properties than estimation basedon equating the score to 0. In the dynamic ﬁxed eﬀect panel model, Gouriéroux et al. (2010) showthat the resulting estimator is unbiased, while MLE suﬀers from the incidental parameter problem. If we include a constant input and a constant node (also known as the “bias” term), it is assumedto be already incorporated in X and w . X i by P n , to X θi by P θm , and to ˜ X i by ˜ P m ; note that we also have P θm = ˜ P m ◦ T − θ . Let µ be ameasure that dominates P and { P θ } and denote their densities by p and { p θ } . Weusually omit dµ , for example, R f p = R f p dµ = R f dP . We employ the operatornotation for expectation, e.g., P log D = E X i ∼ P [log D ( X i )] and P θm log(1 − D ) = m P mi =1 log(1 − D ( X θi )) = ˜ P m log(1 − D ) ◦ T θ . As a shorthand, we denote the populationand sample objective functions by M θ ( D ) := P log D + P θ log(1 − D ) , M θn,m ( D ) := P n log D + P θm log(1 − D ) . The sample inner maximizer given θ is denoted by ˆ D θn,m and the outer minimizer byˆ θ n,m . The distance of discriminators is measured by a Hellinger-type distance d θ ( D , D ) := q h θ ( D , D ) + h θ (1 − D , − D ) where h θ ( D , D ) := q ( P + P θ )( √ D − √ D ) . The distance of θ is measured bythe Hellinger distance on probability distributions, h ( p, q ) := qR ( √ p − √ q ) . We usethe shorthand h ( θ , θ ) for h ( p θ , p θ ). We also occasionally use the distance ˜ h ( θ , θ ) := " ˜ P s p p θ ◦ T θ − s p p θ ◦ T θ ! / . The size of the sieve is measured by the bracketing entropy.

Deﬁnition (Bracketing number and bracketing entropy integral) . The ε -bracketingnumber N [] ( ε, F , d ) of a set F with respect to a premetric d is the minimal numberof ε -brackets in d needed to cover F . The δ -bracketing entropy integral of F withrespect to d is J [] ( δ, F , d ) := R δ q N [] ( ε, F , d ) dε . Note that h ( θ , θ ) is roughly equal to [ P ( p p θ /p − p p θ /p ) ] / . Therefore, h and ˜ h arethe Hellinger distances measured by X ∼ P and ˜ X ∼ ˜ P , respectively, so to speak. A similarHellinger-like distance is considered in Patilea (2001). A premetric on F is a function d : F × F → R that satisﬁes d ( f, f ) = 0 and d ( f, g ) = d ( g, f ) ≥ f, g ∈ F . It is also called “pseudosemimetric”. .1 Assumptions On the Sieve

Let D θn,δ := { D ∈ D n : d θ ( D, D θ ) ≤ δ } . The following requires that the sieve doesnot grow too fast. Assumption 1 (Entropy of sieve) . There exists α < J [] ( δ, D θn,δ , d θ ) /δ α is decreasing in δ uniformly in θ ∈ Θ. There exists δ n = o ( n − / ) such that J [] ( δ n , D θn,δ n , d θ ) (cid:46) δ n √ n uniformly in θ ∈ Θ.Next is a reﬁnement of the “bounded likelihood ratio” condition used in nonpara-metric maximum likelihood. It is often trivial if we assume a compact support for X i , which is standard in the neural network literature. Assumption 2 (Support compatibility) . Let P ( X | A ) be P ( X { A } ) /P ( A ) if P ( A ) > δ n = o ( n − / ) and M such that uniformly in θ ∈ Θ,sup D ∈D θn,δn P D θ D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D θ D ≥ ! < M, sup D ∈D θn,δn P θ − D θ − D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − D θ − D ≥ ! < M. Also, the brackets { ‘ ≤ D ≤ u } in Assumption 1 can be taken so that ( P + P θ )( D θ ‘ ( √ u − √ ‘ ) ) and ( P + P θ )( − D θ − u ( √ − ‘ − √ − u ) ) are O ( d θ ( u, ‘ ) ).The following is a suﬃcient condition for a particular family of neural networkdiscriminators to satisfy Assumption 1 when d ∗ < p . It also accommodates thelow-dimensional structure of Bauer and Kohler (2019). Assumption 3 (Neural network) . Let P and P θ have subexponential tails and ﬁniteﬁrst moments uniformly in θ ∈ Θ. Let log( p /p θ ) satisfy the assumptions for m inBauer and Kohler (2019, Theorem 3) uniformly in θ ∈ Θ; in particular, log( p /p θ )satisﬁes a ( p, C )-smooth generalized hierarchical interaction model of order d ∗ andﬁnite level l with K components for p = q + s , q ∈ N , and s ∈ (0 , q or less of functions g k , f j,k of log( p /p θ ) are bounded; all g k areLipschitz with a positive constant. Let D n := { Λ( f ) : f ∈ H ( l ) } , Λ( x ) := 1 / (1 + e − x ),be a sieve of neural network discriminators that satisfy the assumption of Lemma 2. E.g., van der Vaart and Wellner (1996, Theorem 3.4.4) and Ghosal et al. (2000, Lemma 8.7). The low-dimensional structure in Bauer and Kohler (2019) is related to the the target functionsatisfying a generalized hierarchical interaction representation. See Appendix S.2.4 for the deﬁnition. We say that P on R d has subexponential tails if log P ( k X k ∞ > a ) (cid:46) − a for large a . H ( l ) satisfy the assumptions for the neural network in Bauer and Kohler(2019, Theorem 3); in particular, H ( l ) is deﬁned as Bauer and Kohler (2019, (6)) with K , d , d ∗ as in the structure of log( p /p θ ); the activation function is q -admissible; M ∗ = d ∗ + qd ∗ ! ( q + 1) " (log δ n ) q +3) δ n p + 1 ! d ∗ , α = " (log δ n ) q +3) δ n d ∗ + p (2 q +3)+1 p log nδ n for δ n = [(log n ) p +2 d ∗ (2 q +3) p /n ] p p + d ∗ . On the Estimation Procedure

The following allows us to establish results at rates in terms of n . Assumption 4 (Growing synthetic sample size) . n/m converges.The following makes sure that the trained discriminator converges to the truediscriminator at a rate fast enough to yield a meaningful estimator for θ . Assumption 5 (Approximately maximizing discriminator) . The trained discrimina-tor ˆ D θn,m ∈ D n satisﬁes M θn,m ( ˆ D θn,m ) ≥ M θn,m ( D θ ) − o P ( n − / ) uniformly over θ ∈ Θ.The following ensures that the derivative of the sample objective function con-verges to that of the population. This is a standard assumption in M -estimation thatinvolves nuisance parameters (Klein and Spady, 1993; Gouriéroux and Monfort, 1997;Fermanian and Salanié, 2004; Nickl and Pötscher, 2010) to obtain a regular estimatorfor θ (Newey, 1994). For this, it is important in practice to ﬁx the structural shocksthat generate synthetic data as well as random seeds in any stochastic optimizationalgorithm involved. Assumption 6 (Approximately minimizing generator and orthogonality) . There ex-ists open G ⊂ Θ that contains θ such that the estimator ˆ θ n,m satisﬁes M ˆ θ n,m n,m ( ˆ D ˆ θ n,m n,m ) ≤ inf θ ∈ G M θn,m ( ˆ D θn,m ) + o P ( n − ) , inf θ ∈ G h M ˆ θ n,m n,m ( ˆ D ˆ θ n,m n,m ) − M θn,m ( ˆ D θn,m ) i − h M ˆ θ n,m n,m ( D ˆ θ n,m ) − M θn,m ( D θ ) i ≥ o P ( n − ) . On the Structural ModelAssumption 7 (Identiﬁcation) . For every open G ⊂ Θ that contains θ , we haveinf θ / ∈ G h ( θ, θ ) > θ / ∈ G M θ ( D θ ) > M θ ( D θ ).12he following assumes that the entropy of the structural model is low enough toadmit a √ n -estimator of θ . Assumption 8 (Hellinger bracketing of generative model) . Let P δ := { p θ : θ ∈ Θ ,h ( θ , θ ) ≤ δ } and ˜ P δ := { ( p /p θ ) ◦ T θ : θ ∈ Θ , ˜ h ( θ , θ ) ≤ δ } . There exists r < ∞ such that N [] ( ε, P δ , h ) (cid:46) ( δ/ε ) r and N [] ( ε, ˜ P δ , ˜ h ) (cid:46) ( δ/ε ) r for 0 < ε ≤ δ . ˜ h ( θ , θ ) = O ( h ( θ , θ )) as θ → θ .The following assumes a type of twice diﬀerentiability that is weaker than thepointwise. Notably, it can be satisﬁed by densities with jumps and kinks, whichappear in censored models, auctions, search models, and corporate ﬁnance (Cher-nozhukov and Hong, 2004; Strebulaev and Whited, 2011). It builds on Le Cam’sdiﬀerentiability in quadratic mean (Pollard, 1997; van der Vaart, 1998, Chapter 7)and adds local uniformity and twice diﬀerentiability. Local uniformity is required asour method involves measuring the distance with both actual and synthetic samples.Twice diﬀerentiability is needed to accommodate misspeciﬁcation. The map ˙ ‘ θ is the score function for θ , and the matrix I θ the Fisher information matrix for θ . Assumption 9 (Uniform and twice diﬀerentiability in quadratic mean) . The param-eter space Θ is (a subset of) a Euclidean space R k . The structural model { P θ : θ ∈ Θ } is (locally) uniformly diﬀerentiable in quadratic mean at θ , that is, there exists a k × ‘ θ : X → R k such that for h, g ∈ R k and g → Z X (cid:20) √ p θ + h − √ p θ + g −

12 ( h − g ) ˙ ‘ θ √ p θ + g (cid:21) = o ( k h − g k ) . It is also twice diﬀerentiable in quadratic mean at θ , that is, there exists a k × k matrix of measurable functions ¨ ‘ θ : X → R k × k such that for h ∈ R k and h → Z X " √ p θ + h − √ p θ − h ˙ ‘ θ √ p θ − h ¨ ‘ θ h √ p θ − h ˙ ‘ θ ˙ ‘ θ h √ p θ = o ( k h k )and I θ := P θ ˙ ‘ θ ˙ ‘ θ = − P θ ¨ ‘ θ . The matrix ˜ I θ := 2 P θ ( D θ ˙ ‘ θ ˙ ‘ θ +(¨ ‘ θ + ˙ ‘ θ ˙ ‘ θ ) log(1 − D θ )) is positive deﬁnite. Remark.

The matrix ˜ I θ is the curvature of the outer minimization. Remark.

Under correct speciﬁcation, the annoying term ( P θm − P θ m ) log(1 − D θ ) inLemma 4 goes away, making twice diﬀerentiability unnecessary.13e impose very mild smoothness on the simulated data transformation comparedto, e.g., Nickl and Pötscher (2010, Assumptions P1–2, R) or Gouriéroux and Monfort(1997, Chapter 2). Importantly, we do not exclude cases where T θ is discontinuous.Such situations arise frequently in economics (Frazier et al., 2019) while many existingeconometric theories rule them out. Assumption 10 (Smooth synthetic data generation) . For every compact K ⊂ Θ, r nm sup h ∈ K (cid:13)(cid:13)(cid:13) √ m (˜ P m − ˜ P ) D θ ( ˙ ‘ θ ◦ T θ + h/ √ n − ˙ ‘ θ ◦ T θ ) (cid:13)(cid:13)(cid:13) = o ∗ P (1) . For the rate of convergence, we need that P is “close enough” to P θ in the sensethat the Hellinger convergence of P θ to P θ takes place on the support of P . Assumption 11 (Smooth synthetic model and overlapping support with P ) . Thereexists open G ⊂ Θ containing θ in which M θ ( D θ ) − M θ ( D θ ) (cid:38) h ( θ, θ ) . For everycompact K ⊂ Θ, r nm sup h ∈ K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ m ( P θ + h/ √ nm − P θ m ) − ( P θ + h/ √ n − P θ )1 / √ n log(1 − D θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o ∗ P (cid:18) nm (cid:19) . Also, h ( θ, θ ) = O ( R D θ ( √ p θ − √ p θ ) ) as θ → θ . Remark.

The ﬁrst condition of Assumption 11 is implied by positive deﬁniteness of˜ I θ in Assumption 9.The following assumption is required for eﬃciency. Assumption 12 (Correct speciﬁcation) . The synthetic model { P θ : θ ∈ Θ } is cor-rectly speciﬁed, that is, P θ = P and D θ ≡ / Remark.

Assumption 12 implies Assumption 11.

Theorem 1 (Rate of convergence of discriminator) . Under Assumptions 1, 4, and 5, d θ ( ˆ D θn,m , D θ ) = o ∗ P ( n − / ) uniformly in θ ∈ Θ . Theorem 2 (Rate of convergence of objective function) . Under Assumptions 1, 2,4, and 5, M θn,m ( ˆ D θn,m ) − M θn,m ( D θ ) = o P ( n − / ) uniformly in θ ∈ Θ . For example, limited dependent variable models satisﬁy Assumption 10 under Assumption 4. X i . Proposition 3 (Rate of convergence of neural network discriminator) . Under As-sumptions 3 to 5, d θ ( ˆ D θn,m , D θ ) = O ∗ P ( δ n ) . Consistency can be proved with diﬀerent, conceptually weaker assumptions.

Theorem 4 (Consistency of generator) . Suppose that for every open G ⊂ Θ thatcontains θ , inf θ / ∈ G M θ ( D θ ) > M θ ( D θ ) , that M := { log D θ : θ ∈ Θ } and M := { log(1 − D θ ) ◦ T θ : θ ∈ Θ } are P - and ˜ P -Glivenko-Cantelli respectively, and that theestimator ˆ θ n,m satisﬁes M ˆ θ n,m n,m ( ˆ D ˆ θ n,m n,m ) ≤ inf θ ∈ Θ M θn,m ( ˆ D θn,m ) + o ∗ P (1) . Then, under theconclusion of Theorem 2' with δ n → , h (ˆ θ n,m , θ ) → in outer probability. Theorem 5 (Rate of convergence of generator) . Under Assumptions 4, 6 to 8, and 11, h (ˆ θ n,m , θ ) = O ∗ P ( n − / ) . Theorem 6 (Asymptotic distribution of generator) . Under the conclusion of Theo-rem 5 and Assumptions 4, 6, 7, and 9 to 11, √ n (ˆ θ n,m − θ ) = 2 ˜ I − θ √ n h P n (1 − D θ ) ˙ ‘ θ − P θ m D θ ˙ ‘ θ i + o ∗ P (1) (cid:32) N (cid:18) , ˜ I − θ (cid:20)(cid:18) P θ + lim n →∞ nm P (cid:19) D θ (1 − D θ ) ˙ ‘ θ ˙ ‘ θ (cid:21) ˜ I − θ (cid:19) . Corollary 7 (Eﬃciency of generator) . Under the conclusion of Theorem 6 and As-sumption 12, √ n (ˆ θ n,m − θ ) (cid:32) N (cid:18) , (cid:20) n →∞ nm (cid:21) I − θ (cid:19) . Remark. If n/m →

0, ˆ θ n,m attains parametric eﬃciency. D Is Not Rich Enough?

Our theory assumes that D is a sieve that eventually is capable of representing D θ . Inﬁnite samples, however, we do not know how well D approximates D θ . Therefore, it isinteresting to know what happens when D is not a sieve but a ﬁxed class of functions.Although the complete treatment of this case is beyond our scope, we examine whathappens to the population problem as we enrich D , e.g., by gradually adding nodesand layers to the neural network. 15or simplicity, we maintain Assumptions 2 and 12 and assume that D contains aconstant function 1 /

2. Let ˜ D θ be the population maximizer of M θ ( D ) in D . Since M θ ( D ) − M θ ( D θ ) = − d θ ( D, D θ ) + o ( d θ ( D, D θ ) ) by Theorem 2', ˜ D θ is equivalentto a minimizer of d θ ( D, D θ ) in D up to o ( d θ ( D, D θ ) ). Under Assumption 12, ˜ D θ = D θ ≡ / M θ (1 /

2) = M θ (1 / M θ ( ˜ D θ ) − M θ ( ˜ D θ ) = M θ ( D θ ) − M θ ( D θ ) + M θ ( D θ ) − M θ ( ˜ D θ )= − d θ ( D θ , D θ ) + 2 d θ ( ˜ D θ , D θ ) + o ( d θ ( D θ , D θ ) ) + o ( d θ ( ˜ D θ , D θ ) ) . Note that by Lemma 7, d θ ( D θ , D θ ) = Z (cid:18)q p + p θ − √ p (cid:19) + Z (cid:18)q p + p θ − √ p θ (cid:19) = Z p p + p ( √ p − √ p θ ) + Z p θ p θ + p θ ( √ p − √ p θ ) + o ( h ( p , p θ ) )= h ( p , p θ ) + o ( h ( p , p θ ) ) . Thus, we obtain M θ ( ˜ D θ ) − M θ ( ˜ D θ ) = − h ( p , p θ ) + 2 d θ ( ˜ D θ , D θ ) + o ( h ( p , p θ ) ) . If D contains D θ , then the second term is zero and the Hellinger curvature allows us toestimate θ eﬃciently; if D is a singleton set that contains only 1 /

2, the ﬁrst and secondterms cancel and the objective function becomes completely ﬂat, rendering estimationof θ impossible. Therefore, the second term represents the loss in eﬃciency due tothe limited capacity of D . For the regular logit case, we know that D is alreadyrich enough that the curvature admits √ n -estimation. Then, as we enrich D , itbecomes more and more capable of minimizing d θ ( ˜ D θ , D θ ) , getting closer and closerto eﬃciency. Of course, such enrichment should not be too fast to avoid overﬁtting,the conditions of which are characterized above.4 PRACTICAL ASPECTS The method requires the choice of inputs X i and the choice of the discriminator D .A natural choice of X i is the entire set of observables, X i = ( y i , x i ). While ourmethod is intended so that we need not worry about selecting or creating moments,16n the event that we want to emphasize speciﬁc aspects of the data, we may stilldo so by dropping a part of the observables or by transforming them. For example,although our theory allows for discontinuous T θ , we may still want to adopt the ﬁxof Bruins et al. (2018) to accomodate gradient-based optimization methods. At anyrate, the choice of inputs must ensure that the parameters of the structural modelare identiﬁed.The choice of the discriminator is more nuanced in that there is no natural, obviouschoice. However, if a generative model is not computationally demanding, we maytest several discriminators on their abilities to recover the generative parameters. Inparticular, pick an arbitrary θ as the “true” value and generate data; treat them asthe observed data and run adversarial estimation with several choices of D ; then, pickone that performs the best. (Indeed, this can also be used to try out diﬀerent choicesof inputs.) If we are also worried about severe misspeciﬁcation, we may also testusing the actual data; split the data into two and make sure that the discriminatorcannot separate them too well.In applications where generating synthetic data is very costly (as in our empiricalapplication), we suggest choosing the discriminator based on cross validation as fol-lows. Fix θ at some value; split the actual data into two, say samples 1 and 2; usesample 1 and synthetic data to estimate D for diﬀerent choices of D ; use sample 2and new synthetic data to evaluate the classiﬁcation accuracy of each D ; pick the onewith the highest accuracy. For the value of θ , we may use estimates from a previousstudy if available, or try a few diﬀerent values to check for robustness. See Section 5.4for more on what we did in our empirical application.We note that the analysis of the estimator taking into account the selection ofinputs and the discriminator is left for future work. It is helpful to ﬁt an autoencoder on X to get a sense of its underlying dimensionality.Proposition 3 shows that the convergence rate of the neural network discriminatordepends on the underlying dimension d ∗ —rather than the dimension—of X . Thebottleneck of the autoencoder (the middle layer with the smallest number of nodes) isindicative of the underlying dimension. See Appendix S.2.4 for intuition and evidenceof reduced dimensionality of X . The network structure in Assumption 3 depends on unknown constants such as d ∗ and α . .3 Estimation Procedure We consider an iterative algorithm that solves the optimization problem in (1).

Algorithm (Estimation) . i. Initialize θ = θ (0) . Fix a set of random shocks { ˜ X i } mi =1 and any random seed ifstochastic optimization is used.ii. For given θ = θ ( s ) , generate { X θ ( s ) i } mi =1 using { ˜ X i } mi =1 .iii. Reset the random seed and train ˆ D θ ( s ) n,m with { X i } ni =1 and { X θ ( s ) i } mi =1 .iv. Compute the gradient ∆( θ ( s ) ) of the objective function with respect to θ .v. Set θ ( s +1) = θ ( s ) − ξ ∆( θ ( s ) ) where ξ > θ ) ≈ The asymptotic variance formula given in Theorem 6 is challenging to estimate sincewe do not have the closed-form likelihood. We advocate the use of bootstrap asthe crux of the theory is that the estimation error of ˆ D θn,m can be ignored in theasymptotics of ˆ θ . When standard bootstrap is computationally burdensome, we canuse the bootstrap proposed by Honoré and Hu (2017), as we do so in Section 5. Algorithm (Bootstrap) . There is a relation between D θ and the score and Hessian, ˙ ‘ θ = D θ ∂ log(1 − D θ ) ∂θ = − − D θ ∂ log D θ ∂θ and ¨ ‘ θ + ˙ ‘ θ ˙ ‘ θ = − D θ [ ∂ log D θ ∂θ ∂ log D θ ∂θ − ∂ log D θ ∂θ∂θ ], so it is possible to construct the sample counterpartof the variance in Theorem 6, though we do not pursue the proof of its convergence in this paper.

18. Let { X ∗ i } ni =1 and { ˜ X θ ∗ i } mi =1 be the bootstrap samples of actual and syntheticobservations of sizes n and m , drawn randomly with replacement.ii. Solve (1) with { X ∗ i } ni =1 and { ˜ X θ ∗ i } mi =1 to obtain a bootstrap estimator ˆ θ ∗ (1) n,m .iii. Repeat (i)–(ii) for S times to obtain S bootstrap estimators { ˆ θ ∗ (1) n,m , . . . , ˆ θ ∗ ( S ) n,m } .iv. Use the distribution of { ˆ θ ∗ ( s ) n,m } Ss =1 to approximate the distribution of ˆ θ n,m .5 EMPIRICAL APPLICATION: “WHY DO THE ELDERLY SAVE?”Using the adversarial framework, we examine the elderly’s saving, following De Nardiet al. (2010) (henceforth DFJ). The elderly save for various reasons—uncertaintyon survival, bequest motive, or ever-rising medical expenses as they age. Diﬀerentmotives for saving yield diﬀerent implications on policy evaluation such as Medicaidand Medicare. Hence, it is an important and active area of research.The risk the elderly face is highly heterogeneous, depending on their gender, age,health status, and permanent income. This implies potentially large heterogeneity inthe saving motive across individuals; not accounting for this can bias the estimatesof utility. For example, the rich live several years more than the poor on average.Failure to reﬂect this diﬀerence can make the rich look thriftier than they are. On theother hand, existing estimation methods such as SMM may suﬀer from severe lackof precision when various heterogeneity is introduced. This motivates adversarialestimation with a ﬂexible discriminator that parses information in an adaptive andparsimonious way. Indeed, our adversarial estimates, using the same model and thesame data as in DFJ, will see considerable gains in precision. We focus on the behavior of single, retired individuals of age 70 and older. In eachperiod, a surviving single retired agent receives utility u ( c ) from consumption c and,if they die in that period, additional utility φ ( e ) from leaving estate e , where u ( c ) := c − ν − ν , φ ( e ) := ϑ ( e + k ) − ν − ν , and ν is the relative risk aversion and ϑ and k are the intensity and curvature of thebequest motive. Each individual is associated with gender g and permanent income19 , and carries six state variables: age t , asset a t , nonasset income y t , health status h t , medical expense shock ζ t , and survival s t . Health and survival are binary, where h t = 1 means they are healthy at age t , and s t = 1 they survive to the next period.They face three channels of uncertainty: health, survival, and medical expenses.Heath and survival evolve as Markov chains. We denote π H ( g, h t , I, t ) := Pr( h t +1 = 1 | g, h t , I, t ) , π S ( g, h t , I, t ) := Pr( s t +1 = 1 | g, h t , I, t ) . The medical expenses they incur are given bylog m t = m ( g, h t , I, t ) + σ ( g, h t , I, t ) × ψ t , where m and σ are deterministic functions, ψ t = ζ t + ξ t , ξ t ∼ N (0 , σ ξ ), ζ t = ρζ t − + (cid:15) t ,and (cid:15) t ∼ N (0 , σ (cid:15) ). The nonasset income evolves deterministically as y t = y ( g, I, t ).The asset evolves as a t +1 = a t + y n ( ra t + y t , τ ) + b t − m t − c t , where b t ≥ government transfer , r the risk-free pretax rate of return , y n ( · , τ )the posttax income , and τ the tax structure . The agent faces a borrowing constraint a t ≥ c t ≥ c ; governmenttransfer b t is positive only when both constraints cannot be satisﬁed without it.The timing in each period is given as follows. Heath h t and medical expenses m t realize; then the individual chooses consumption c t ; then survival s t realizes; if s t = 0,they leave the remaining assets as bequest; if s t = 1, move on to the next period.Denoting the cash-on-hand by x t := c t + a t +1 , the agent’s Bellman equation is V t ( x, g, h, I, ζ ) = max c,x u ( c, h ) + β [ s E t V t +1 ( x , g, h , I, ζ ) + (1 − s ) φ ( e )]subject to x = ( x − c )+ y n ( r ( x − c )+ y , τ )+ b − m , e = ( x − c ) − max { , ˜ τ ( x − c − ˜ x ) } , and x ≥ c ≥ c . The ﬁrst constraint is the budget constraint; the second the bequest (taxedat rate ˜ τ with deduction ˜ x ); the last the borrowing and consumption constraints.We also look at two transformations: the marginal propensity to consume at themoment of death MPC := (1 + r ) / (1 + r + [ βϑ (1 + r )] /ν ) and the implied asset ﬂoor a := k/ [ βϑ (1 + r )] /ν above which individuals get utility from bequeathing. The marginal propensity to bequeath (MPB) is deﬁned by 1 − MPC. .2 Data We use the same data as DFJ, taken from

Assets and Health Dynamics Among theOldest Old (AHEAD) . The sample consists of non-institutionalized individuals of age70 and older in 1994. It contains 8,222 individuals in 6,047 households (3,872 singlesand 2,175 couples). The survey took place biyearly from 1994 to 2006. We focus on3,259 single retired individuals, 592 of which are men and 2,667 women. Of those,884 were alive in 2006. We drop the ﬁrst survey in 1994 for reliability, following DFJ.The survey collects information on age t , ﬁnancial wealth a t , nonasset income y t ,medical expenses m t , and health status h t . Financial wealth includes real estate,autos, several other liquid assets, retirement accounts, etc. Nonasset income includessocial security beneﬁts, veteran’s beneﬁts, and other beneﬁts. Medical expenses aretotal out-of-pocket spending; the average yearly expenses are $3,700 with standarddeviation $13,400. The permanent income is not observed, but we use as a proxy theranking of individual average income over time. The health status is a binary variableindicating whether the individual perceives themself as healthy. The health status is a variable that was not used in the moments of DFJ; we arguethat this gives additional variation to identify the bequest motive (Kopczuk, 2007).Disentangling the bequest motive from medical expenditure risk is a challengingtask. As the bequest is a luxury good, we expect that its identifying power comes fromwealthy individuals. Meanwhile, wealthy individuals are also ones with the longestlife expectancy, being motivated to save for medical expenses.Indeed, DFJ document that the medical expenditure for the rich skyrockets afterage 95, reaching $15,000 by age 100. However, if the health condition diminishes theirlife expectancy, those with shorter horizons would face much less incentive to save forthe coming medical expenses while as much incentive to save for bequests.We ﬁnd some evidence of this in our dataset. Figures 1a and 1b are the proportionsof individuals who survive for the next ﬁve years at ages 85 and 90, conditional ongender and health. We see that the health status, along with gender, is a strongpredictor of life expectancy in years when the medical expenditure soars.Heterogeneity in the survival materializes as a diﬀerence in the savings. Figures 1c Single individuals are those who were neither married nor cohabiting at any point in the analysis.

3% 17%17% 7%0%30%60% 85 90 - Y e a r S u r v i v a l AgeHealthy Unhealthy (a) Men’s ﬁve-year survival rates.

52% 33%28% 14%0%30%60% 85 90 - Y e a r S u r v i v a l AgeHealthy Unhealthy (b) Women’s ﬁve-year survival rates. M e d i a n A ss e t [ k $ ] Median Age Never get sick Get sick by 2002 (c) Men’s asset. M e d i a n A ss e t [ k $ ] Median Age Never get sick Get sick by 2002 (d) Women’s asset. M e d i a n M e d i c a l E x p e n s e s [ k $ ] Median Age Never get sick Get sick by 2002 (e) Men’s medical expenses. M e d i a n M e d i c a l E x p e n s e s [ k $ ] Median Age Never get sick Get sick by 2002 (f) Women’s medical expenses. M e d i a n P I Q u a n t il e Median Age Never get sick Get sick by 2002 (g) Men’s permanent income. M e d i a n P I Q u a n t il e Median Age Never get sick Get sick by 2002 (h) Women’s permanent income.

Figure 1: Proﬁles by gender and health. Figures 1c to 1h are for 4–5th PIqs in Cohort3. Solid lines are for those who stay healthy for the duration of their observation;dashed lines for those who are healthy in 1996 and become unhealthy by 2002.and 1d give the trajectories of the median assets for the 4th and 5th PI quintiles inCohort 3. The solid lines are those who were healthy throughout the survey periods22nd the dashed lines are those who were healthy in 1996 but reported unhealthy in1998, 2000, or 2002. We see that men who were exposed to the health shock (hencethe survival shock) dig into their savings much more than healthy men. With highersurvival rates, women exhibit the trend to a much lesser degree.Such diﬀerence in the asset proﬁles seems to be driven neither by the diﬀerencein medical expenses nor by survival selection among the rich. Figures 1e and 1f showthe median medical expenses during the same periods; we observe similar trajectoriesacross gender and health. Figures 1g and 1h show the median PI quantiles of thesurvivors; if there is attrition of rich or poor individuals that aﬀects the median assets,we expect to see a change in the median PI quantiles. However, they do not diﬀermuch by at least age 90 while bifurcation of the asset proﬁles begins at age 90.These ﬁndings are suggestive that the diﬀerence in the asset proﬁles is attributableto the change in the saving behaviors. The health status changes the exposure to themedical expenditure risk through the survival probability, which then induces changesin the saving behavior by shifting the balance between the bequest motive and medicalexpenditure risk.

Following DFJ, we carry out estimation in two steps: (1) estimate π H , π S , m , σ , ρ m , σ ξ , σ (cid:15) (in fact, we borrow numbers from DFJ), (2) estimate ν , MPC, and k using ouradversarial approach. The parameters r , τ , ˜ τ , and ˜ x are ﬁxed as in the original paper,and β = 0 . c , we ﬁx it at $4,500 to reﬂect annual social security payments. After the second step, we can also recover ϑ and a .We consider two diﬀerent sets of inputs to the discriminator. The ﬁrst set consistsof the log age of an individual in 1996, permanent income (the aforementioned proxy),the proﬁle (full history) of asset holdings, and the proﬁle of survival indicators, X := (1 , log t , I, a t , . . . , a t , s t , . . . , s t ) ∈ R . This is intended to capture similar identifying variation as DFJ. The second set is In their preferred speciﬁcation DFJ estimate β and c floor , in addition to ν , M P C and k . Instead,we ﬁx β and c floor to a reasonable value according to the literature. Sensitivity analysis shows thatchanging c mostly aﬀects the risk aversion parameter. All individuals are alive in 1996, so we drop s t . X := ( X , g, h t , . . . , h t ) ∈ R , aiming to capture more variation for the bequest motive as explained in Section 5.3.The results on the autoencoders for X are presented in Appendix S.2.4.We use cross validation to choose the discriminator (Section 4.1). We focus onfeed-forward neural networks with sigmoid activation functions with at most twohidden layers. We ﬁx θ at a preliminary estimate; split the actual data into sample 1(80%) and sample 2 (20%); estimate D with sample 1, varying the numbers of nodesand layers; evaluate their classiﬁcation accuracy with sample 2; pick the networkconﬁguration with the highest accuracy. The selected neural network discriminatorconsists of two hidden layers, the ﬁrst with 20 nodes and the second 10 nodes.We compare our estimates with SMM in DFJ. They use 150 moments consistingof median assets of groups divided by the cohort and permanent income quintile ineach calendar year. The cohort is deﬁned on a four-year window; Cohort 1 are thosewho were 72–6 years old in 1996; Cohort 2 were 77–81; Cohort 3 were 82–6; Cohort 4were 87–91; Cohort 5 were 92 and older. Details are in DFJ. We note that accountingfor health and gender is infeasible in SMM since it yields too many moments, whileit is eﬀortless in our framework. Table 1 gives the parameter estimates from DFJ and our adversarial method withspeciﬁcations X and X . Parenthesized numbers are the standard errors; we useHonoré and Hu (2017) to compute them for the adversarial estimates. The ﬁrst rowis the SMM estimates in DFJ. The second and third rows come from the adversarialestimation; the second uses X (14 variables) and the third X (21 variables).A major diﬀerence between our estimates and DFJ’s is the curvature of the utilityof bequests k . Our estimate is an order of magnitude smaller, which has an impor-tant implication: while DFJ conclude only the super rich would obtain utility frombequeathing, our estimate suggests bequeathing matters across the entire permanentincome distribution. A related number is the implied asset ﬂoor a . We obtain es-timates of $1,320 and $4,243, which are on the lower side of the estimates known We use the classiﬁcation accuracy provided by Keras’s ADAM, which is based on thresholding. X is intended to capture similar identifying variation as DFJ. The inputs X contain additional variation in gender and health, which is our preferred spec-iﬁcation. Standard errors for the adversarial estimates are obtained by the poor(wo)man’s bootstrap. β c [$] ν ϑ k [k$] MPC a [$] LossDFJ, Table 3 0.97 2,665 3.84 2,360 273 0.12 36,215 − . X − . X − . in the literature. However, they correspond to the 22nd and 24th percentiles of thedistribution of assets one period before deaths (see Section 5.6) in our sample, re-spectively. We interpret these numbers as our method providing a sensible ﬁt of thedata. In contrast, DFJ’s implied asset ﬂoor is $36,215, which corresponds to the 40thpercentile.Overall, the intensity of the bequest motive is minor in DFJ and X but non-negligible in X . While k is low for both X and X , MPC is almost twice as largein X compared to X . Consequently, individuals care about bequests less than theirown consumption according to X .DFJ and adversarial also diﬀer in risk aversion ν . A large value of risk aversionrationalizes the observed saving patterns when the consumption ﬂoor c is ﬁxed at$4,500, a reasonable value in the literature. In line with our theory, adversarial estimation provides substantial gains in pre-cision relative to DFJ. The decrease in standard errors reﬂects that the data is suﬃ-ciently powerful to conclude the importance of the bequest motive, especially whenexploiting additional variation in gender and health.The last column reports the cross-entropy loss of each set of parameter estimates.To make a fair comparison, we take each set of estimates and solve the inner maxi-mization of (1) using X as the input. The loss does not improve with X relative toDFJ but does so substantially with X , which is consistent with our observation thatgender and health provide useful variation for identifying the bequest motive. This DFJ’s risk aversion estimate increases from 3.84 to 6.04 in an alternative speciﬁcation where c is ﬁxed at $5,000. However, according to their criterion, the ﬁt of the model decreases substantially. X our preferred speciﬁcation. Similarly as DFJ, we look at the assets one period before deaths to compare theﬁt and counterfactuals. Individuals who passed away during the survey periods aredivided into ﬁve groups of permanent income quintiles (PIqs). We take the assets inthe last survey when they were alive and sum these across individuals in each group.Table 2 shows the assets one period before deaths for the actual data and sim-ulation. Adversarial X baseline and DFJ baseline rows are the simulations of themodels with parameters equal to the estimates of our preferred speciﬁcation and ofDFJ. Our estimates ﬁt the assets for low PIqs well but overestimates high PIqs, whileDFJ show the opposite pattern. In Appendix S.2.5, we provide additional evidenceof the good ﬁt of the data.Next, we perform two counterfactual simulations to measure the elderly’s savingmotive in terms of (i) bequest and (ii) medical expenditure risk. We simulate themodel with the same parameters except that we kill either the bequest incentive, φ ≡

0, or the medical expenditure risk, σ ≡

0. The “(% diﬀerence)” rows give thediﬀerence of the baseline and counterfactual relative to the baseline.The contribution of the bequest motive to the savings diﬀers substantially betweenour estimates and DFJ. In our estimates, the lack of the bequest motive decreasesthe savings by 13.7% to 19.2%, while DFJ estimates suggest at most 2.1% decrease.This is largely due to the diﬀerence in the estimates of the curvature k . According toour estimates, the bequest motive is an important and substantial source of savingsfor both the poor and the rich. This ﬁnding is consistent with Lockwood (2018) whouses additional data on annuity takeup to identify the bequest motive.The contribution of the medical expenditure risk looks much more in line for thetwo models. The amount of savings to prepare for uncertain medical expenses issubstantial in both predictions. This is because rich individuals live long and henceare at high risk of large medical expenses. Poor individuals do not survive long enoughto face it and are more likely to be covered by social insurance programs. Trimming the observations above the top 1% of mean assets decreases the discrepancy between X and the actual data signiﬁcantly. Results are available upon request. In addition, the gap in theﬁt between the poor and the rich might be attributed to the rich doing inter vivos transfers moreoften than the poor, biasing the assets of the rich downwards toward the end of their lives (McGarry,1999). ϑ = 0 (so φ ≡ σ ≡ m t = m ). Each number is a cross-sectional sum of assets of individuals oneperiod before their death given in the units of k$, a proxy for their intended bequest.Percentages are relative to the corresponding baselines. Permanent income quintile1st 2nd 3rd 4th 5thActual data 18,191 25,266 42,006 50,495 85,814Adversarial X baseline 20,441 26,366 51,339 62,662 110,385No bequest 17,644 21,587 42,586 50,631 95,212(% diﬀerence) (13.7%) (18.1%) (17.1%) (19.2%) (13.7%)No medical risk 18,890 23,252 43,789 49,385 90,204(% diﬀerence) ( 7.6%) (11.8%) (14.7%) (21.2%) (18.3%)DFJ baseline 16,527 19,672 38,157 42,737 83,814No bequest 16,342 19,605 37,387 42,425 83,563(% diﬀerence) ( 1.1%) ( 0.3%) ( 2.1%) ( 0.7%) ( 0.5%)No medical risk 16,440 19,242 36,157 38,053 76,080(% diﬀerence) ( 0.5%) ( 2.2%) ( 5.4%) (11.0%) ( 9.4%) To summarize, our adversarial estimates reveal with precision that the bequestmotive contributes in similar magnitudes to the slow decrease in the elderly’s savingsacross PIqs. The uncertainty in medical expenses contribute less for poor individuals.APPENDIXA PROOFSLet m pq := log p + q q . To derive asymptotic properties of the discriminator, it is helpfulto think in terms of the pseudo-objective functions ˜ M θ ( D ) := P m DD θ + P θ m − D − D θ , ˜ M θn,m ( D ) := P n m DD θ + ˜ P m m − D − D θ ◦ T θ , since concavity of the logarithm implies˜ M θn,m ( ˆ D θn,m ) − ˜ M θn,m ( D θ ) ≥

12 [ M θn,m ( ˆ D θn,m ) − M θn,m ( D θ )] ≥ − o P ( n − / ) . See, e.g., van der Vaart and Wellner (1996, Section 3.4.1) and van der Vaart (1998, Section 5.5). k f k P,B := q P ( e | f | − − | f | ) that inducesa premetric without the triangle inequality (van der Vaart and Wellner, 1996, p. 324). A.1 Discriminators

Let M θ, n,δ := { m DD θ : D ∈ D θn,δ } and M θ, n,δ := { m − D − D θ : D ∈ D θn,δ } . Lemma 1 (Maximal inequality for pseudo-cross-entropy discriminator) . For every D ∈ D , ˜ M θ ( D ) − ˜ M θ ( D θ ) ≤ − d θ ( D, D θ ) / (1 + √ . For every δ > , E ∗ sup D ∈D θn,δ √ n (cid:12)(cid:12)(cid:12) ( ˜ M θn,m − ˜ M θ )( D ) − ( ˜ M θn,m − ˜ M θ )( D θ ) (cid:12)(cid:12)(cid:12) (cid:46) J [] ( δ, D θn,δ , d θ ) " r nm + (cid:18) nm (cid:19) J [] ( δ, D θn,δ , d θ ) δ √ n . Proof.

Since log x ≤ √ x −

1) for every x > P log DD θ ≤ P (cid:16)q DD θ − (cid:17) = (cid:20) P √ D ( p + p θ ) √ p − Z D ( p + p θ ) − Z p (cid:21) + ( P + P θ )( D − D θ ) = − h θ ( D, D θ ) + ( P + P θ )( D − D θ ) . Similarly, P θ log − D − D θ ≤ − h θ (1 − D, − D θ ) − ( P + P θ )( D − D θ ). Replacing D and1 − D with ( D + D θ ) / − D + 1 − D θ ) / P m DD θ + P θ m − D − D θ ≤ − h θ (cid:16) D + D θ , D θ (cid:17) − h θ (cid:16) − D +1 − D θ , − D θ (cid:17) . Since √ h θ ( p + q , q ) ≤ h θ ( p, q ) ≤ (1 + √ h θ ( p + q , q ) (van der Vaart and Wellner, 1996,Problem 3.4.4), we obtain the ﬁrst inequality. For the second inequality, observe that √ n h ( ˜ M θn,m − ˜ M θ )( D ) − ( ˜ M θn,m − ˜ M θ )( D θ ) i = √ n ( P n − P ) m DD θ + √ n ( P θm − P θ ) m − D − D θ . Therefore, it suﬃces to separately bound E ∗ sup D ∈D θn,δ (cid:12)(cid:12)(cid:12) √ n ( P n − P ) m DD θ (cid:12)(cid:12)(cid:12) and q nm E ∗ sup D ∈D θn,δ (cid:12)(cid:12)(cid:12) √ m ( P θm − P θ ) m − D − D θ (cid:12)(cid:12)(cid:12) . Since m DD θ , m − D − D θ ≥ log(1 /

2) and e | x | − − | x | ≤ e x/ − for every x ≥ log(1 / (cid:13)(cid:13)(cid:13) m DD θ (cid:13)(cid:13)(cid:13) P ,B ≤ P (cid:16) e m DDθ / − (cid:17) ≤ h θ (cid:16) D + D θ , D θ (cid:17) ≤ h θ ( D, D θ ) , (cid:13)(cid:13)(cid:13) m − D − D θ (cid:13)(cid:13)(cid:13) P θ ,B ≤ h θ (1 − D, − D θ ) .

28y van der Vaart and Wellner (1996, Lemma 3.4.3), the ﬁrst supremum is boundedby J [] (2 δ, M θ, n,δ , k · k P ,B )[1 + J [] (2 δ, M θ, n,δ , k · k P ,B ) / (4 δ √ n )] . Let [ ‘, u ] be an ε -bracketin D with respect to d θ . Since u − ‘ ≥ e | x | − − | x | ≤ e x/ − for x ≥ (cid:13)(cid:13)(cid:13) m uD θ − m ‘D θ (cid:13)(cid:13)(cid:13) P ,B ≤ Z (cid:18)r u + D θ ‘ + D θ − (cid:19) p ≤ Z (cid:16) √ u + D θ − √ ‘ + D θ (cid:17) ( p + p θ ) ≤ h θ ( u, ‘ ) ≤ ε . Thus, [ m ‘D θ , m uD θ ] makes a 2 ε -bracket in M θ, with respect to k·k P ,B , so J [] (2 δ, M θ, n,δ , k·k P ,B ) ≤ J [] ( δ, D θn,δ , d θ ). Analogous argument for the second supremum yields thesecond inequality. (cid:4) Now, Theorems 1 and 2 follow immediately from the following general versions.

Theorem 1' (Rate of convergence of discriminator) . Suppose Assumption 4 holdsand M θn,m ( ˆ D θn,m ) ≥ M θn,m ( D θ ) − O P ( δ n ) for a nonnegative sequence δ n . If we have J [] ( δ n , D θn,δ n , d θ ) (cid:46) δ n √ n and there exists α < such that J [] ( δ, D θn,δ , d θ ) /δ α is decreas-ing in δ , then d θ ( ˆ D θn,m , D θ ) = O ∗ P ( δ n ) .Proof. As noted at the beginning of the section, the condition of the theorem implies˜ M θn,m ( ˆ D θn,m ) ≥ ˜ M θn,m ( D θ ) − O P ( δ n ). Then, the theorem follows from van der Vaartand Wellner (1996, Theorem 3.4.1) applied with Lemma 1. (cid:4) Theorem 2' (Rate of convergence of objective function) . Under Assumption 2, M θ ( D ) − M θ ( D θ ) = − d θ ( D, D θ ) + o ( d θ ( D, D θ ) ) . Under the assumptions of Theo-rem 1' and Assumption 2, M θn,m ( ˆ D θn,m ) − M θn,m ( D θ ) = O ∗ P ( δ n ) .Proof. Note that for every D ∈ D , M θn,m ( D ) − M θn,m ( D θ ) = M θ ( D ) − M θ ( D θ ) + ( P n − P ) log DD θ + ( P θm − P θ ) log − D − D θ . Let W := q DD θ − W := q − D − D θ −

1, and δ := d θ ( D, D θ ). By Taylor’s theorem,log(1 + x ) = x − x + x R ( x ) where R ( x ) = O ( x ) as x →

0. Therefore, M θ ( D ) − M θ ( D θ ) = P log DD θ + P θ log − D − D θ = 2 P log(1 + W ) + 2 P θ log(1 + W )= 2 P W − P W + P W R ( W ) + 2 P θ W − P θ W + P θ W R ( W ) . Note that P W = P ( √ D/D θ − = h θ ( D, D θ ) and P θ W = h θ (1 − D, − D θ ) .29ince W j ≥

0, this implies that W ( X i ) = O P ( δ ) and W ( X θi ) = O P ( δ ). Moreover,2 P W = (cid:20) P √ D ( p + p θ ) √ p − Z D ( p + p θ ) − Z p (cid:21) + ( P + P θ )( D − D θ )= − h θ ( D, D θ ) + ( P + P θ )( D − D θ ) , P θ W = − h θ (1 − D, − D θ ) − ( P + P θ )( D − D θ ) . Thus, 2 P W + 2 P θ W = − d θ ( D, D θ ) and W ( X i ) and W ( X θi ) are o P (1) since | D − D θ | ≤ |√ D − √ D θ | . Also, R ( W ( X i )) and R ( W ( X θi )) are o P (1). For 1 / ≤ c < | P W R ( W ) | ≤ P W | R ( W ) | { W ≤ − c } + P W | R ( W ) | { W > − c }≤ P ( − R ( W ) { W ≤ − c } ) + P W | R ( − c ) ∨ R ( W ) | . Since R ( x ) <

1, the second term is o ( δ ) for every c by the dominated convergencetheorem. By the diagonal argument, there exists a sequence c → D → D θ such that the second term remains o ( δ ). Since 0 < − R ( x ) < − x ) for x ≤ − , P ( − R ( W ) { W ≤ − c } ) ≤ P (log D θ D { W ≤ − c } ) = P ( DD θ log D θ D · D θ D { W ≤ − c } ) ≤ sup x ≥ (1 − c ) − | x log x | · P ( D θ D { W ≤ − c } ) . The ﬁrst term is o (1) as c →

1. The second term is bounded by P ( D θ D { W ≤− } ) = P ( W ≤ − ) P ( D θ D | D θ D ≥ ) ≤ P ( W ≤ − ) M by Assumption 2.By Markov’s inequality, P ( W ≤ − ) ≤ P W = O ( δ ). Thus, we have shown | P W R ( W ) | = o ( δ ). Similarly, | P θ W R ( W ) | = o ( δ ). Then, the ﬁrst claim follows.Now, we bound the suprema of the two random terms E ∗ sup D ∈D θn,δn (cid:12)(cid:12)(cid:12) √ n ( P n − P ) log DD θ (cid:12)(cid:12)(cid:12) and E ∗ sup D ∈D θn,δn (cid:12)(cid:12)(cid:12) √ m ( P θm − P θ ) log − D − D θ (cid:12)(cid:12)(cid:12) . Under Assumption 2, it follows from (the remark after) Lemma 5 that for D ∈ D θn,δ n , (cid:13)(cid:13)(cid:13) log DD θ (cid:13)(cid:13)(cid:13) P ,B ≤ M ) h θ ( D, D θ ) , (cid:13)(cid:13)(cid:13) log − D − D θ (cid:13)(cid:13)(cid:13) P θ ,B ≤ M ) h θ (1 − D, − D θ ) . Assumption 2 also implies that an ε -bracket in M θ, induces (cid:13)(cid:13)(cid:13) log uD θ − log ‘D θ (cid:13)(cid:13)(cid:13) P ,B ≤ P (cid:16)q u‘ − (cid:17) = 4( P + P θ ) D θ ‘ ( √ u − √ ‘ ) ≤ Cd θ ( u, ‘ ) , (cid:13)(cid:13) log − ‘ − D θ − log − u − D θ (cid:13)(cid:13)(cid:13) P θ ,B ≤ P + P θ ) − D θ − u ( √ − ‘ − √ − u ) ≤ Cd θ ( u, ‘ ) , for some C >

0. By similar arguments as in the proof of Theorem 1', the two supremaare of orders √ nδ n and √ mδ n . With Assumption 4 follows the theorem. (cid:4)

A.2 Neural Network Discriminators

We establish a bound on the bracketing number of a (possibly sparse) neural networkwith bounded weights and Lipschitz activation functions.

Lemma 2 (Bracketing number of neural network with bounded weights) . Let F be a class of neural networks deﬁned in Example 4. Denote the total number ofnonzero weights by S and the maximum number of nonzero weights in each node(except for the ﬁrst layer taking inputs) by ˜ U . Assume that σ and Λ are Lipschitzwith constant and k w k ∞ ≤ C for some C . Assume innocuously that ˜ U C ≥ andlet σ := | σ (0) | . Deﬁne F : R d → R by F ( x ) := σ + k x k ∞ . Then, for any premetric d F and k f k d F := sup g ∈F d F ( g − f / , g + f / , N [] ( k εF k d F , F , d F ) ≤ & L + 1)( ˜ U C ) L +1 dε ’ S . For a fully connected network, ˜ U = U and S = ( LU + 1) U + ( d − U ) U . For ahierarchical network in Bauer and Kohler (2019), S = O ( ˜ U ( L +4) / d ) .Proof. Recall from Example 4 that f ( x ; w ) = Λ( w L σ ( w L − σ ( · · · w σ ( w x )))). We canbound the outputs of the ‘ th layer by k σ ( w ‘ − σ ( · · · )) k ∞ ≤ σ + k w ‘ − σ ( · · · ) k ∞ ≤ σ + ˜ U C k σ ( · · · ) k ∞ ≤ [1 + ˜ U C + · · · + ( ˜ U C ) ‘ − ] σ + ˜ U ‘ − C ‘ d k x k ∞ ≤ ˜ U ‘ − C ‘ ( ˜ U σ + d k x k ∞ ) ≤ ( ˜ U C ) ‘ d ( σ + k x k ∞ ) , where the fourth inequality holds for ˜ U C ≥

2. For two sets of weights, w and ˜ w , | f ( x ; w ) − f ( x ; ˜ w ) | ≤ ˜ U k w L − ˜ w L k ∞ ( k σ ( w L − σ ( · · · )) k ∞ ∨ k σ ( ˜ w L − σ ( · · · )) k ∞ )+ ˜ U C k σ ( w L − σ ( · · · )) − σ ( ˜ w L − σ ( · · · )) k ∞ We can write k log DD θ k P ,B ≤ [2(1 + M ) ∨ C ] h θ ( D, D θ ) and k log uD θ − log ‘D θ k P ,B ≤ [2(1 + M ) ∨ C ] d θ ( u, ‘ ) to apply the same argument as in Theorem 1'. The number of nonzero elements in each row of each matrix w ‘ , ‘ ≥

1, is bounded by ˜ U . ˜ U L +1 C L d k w L − ˜ w L k ∞ ( σ + k x k ∞ ) + · · · + ˜ U L +1 C L d k w − ˜ w k ∞ ( σ + k x k ∞ ) + ˜ U L C L d k w − ˜ w k ∞ k x k ∞ ≤ ( L + 1) ˜ U L +1 C L d k w − ˜ w k ∞ ( σ + k x k ∞ ) . Let A := ( L + 1) ˜ U L +1 C L d . Partitioning the weight space [ − C, C ] S into cubes oflength 2 ε/A creates d CA/ε e S cubes. Hence, N ( ε, [ − C, C ] S , k · k ∞ ) ≤ d CA/ε e S . Thebound follows by van der Vaart and Wellner (1996, Theorem 2.7.11), observing thatthe proof thereof works for a premetric with modiﬁcation of 2 ε k F k to k εF k d F .For a fully connected network, the number of all weights is dU (weights for theﬁrst layer) plus ( L − U (weights for the remaining hidden layers) plus U (weights inthe output layer), summing to ( LU + 1) U + ( d − U ) U . For a network H (0) in Bauerand Kohler (2019) (in their notation), the number of all weights is A (0) := d (4 d ∗ M ∗ ) +4 d ∗ M ∗ + M ∗ = 4(1+ d ) d ∗ M ∗ + M ∗ . For H (1) , A (1) := A (0) K + K (4 d ∗ M ∗ )+4 d ∗ M ∗ + M ∗ = A (0) K + 4(1 + K ) d ∗ M ∗ + M ∗ . For H ( l ) , A ( l ) := A ( l − K + 4(1 + K ) d ∗ M ∗ + M ∗ = A (0) K l + P l − j =0 K j [4(1+ K ) d ∗ M ∗ + M ∗ ] = 4 d ∗ M ∗ [(1+ d ) K l + − K l − K (1+ K )]+ M ∗ − K l +1 − K = O ( dd ∗ M ∗ K l ). Then use L = 2 + 3 l and ˜ U = M ∗ ∨ (4 d ∗ ) ∨ K . (cid:4) Remark.

Lemma 2 assumes a Lipschitz property for the activation and output func-tions, which accommodates ReLU, softplus, and sigmoid.

Remark.

If a premetric d satisﬁes the property that ‘ ≤ f ≤ u implies d ( ‘, f ) ≤ d ( ‘, u ), then the ε -covering number of F with respect to d is bounded by N [] ( ε, F , d ).Another popular way to bound the covering number is by the dimension of F (van derVaart and Wellner, 1996, Chapter 2.6; Anthony and Bartlett, 1999, Chapter 12).However, dimension bounds for neural networks often come with strong functional-form assumptions on the activation function (Bartlett and Maass, 2003; Bartlett et al.,2019). Our approach does not require that at the cost of bounded weights. Proof of Proposition 3.

We use Lemma 2 to bound the bracketing number in Theo-rem 1'. Since D is nonnegative, we can extend d θ to accommodate arbitrary functions f and f by d θ ( f , f ) := d θ (0 ∨ f , ∨ f ). In the notation of Lemma 2, k ε F k d θ = sup D ∈D d θ ( D − ε F/ , D + ε F/ ≤ h θ (0 , ε F ) + h θ (0 , ε F ) = 2 ε ( P + P θ ) F = 2 ε [2 σ + ( P + P θ ) k X k ∞ ] =: Bε . If the network has a bias term, the actual variable weights are slightly fewer, but it does notchange the order. P and P θ have uniformly bounded ﬁrst moments, B < ∞ . Therefore,log N [] ( ε, D n , d θ ) ≤ log N [] (cid:16)(cid:13)(cid:13)(cid:13) ε B F (cid:13)(cid:13)(cid:13) d θ , D n , d θ (cid:17) ≤ S log l B ( L +1)( ˜ UC ) L +1 dε m . The same bound holds for log N [] ( ε, − D n , h θ ). Observe that for 0 < δ ≤ e a , Z δ q a − log ε dε = √ π e a erfc (cid:16)q a − log δ (cid:17) + δ q a − log δ (cid:46) δ q a − log δ. Therefore, J [] ( δ, D n , h θ ) (cid:46) Z δ q S [log(2 B ( L + 1)( ˜ U C ) L +1 d ) − ε ] dε (cid:46) δ q S [log(2 B ( L + 1)( ˜ U C ) L +1 d ) − δ ] (cid:46) δ q SL log( ˜ U C ) − S log δ. Again, J [] ( δ, − D n , h θ ) is likewise bounded. By Theorem 1' and Assumption 4, thisgives rise to the rate δ n = O (cid:18)r SL log( ˜ UC )+ S log nn (cid:19) . (2)To attain this, the sieve must be rich enough so that inf D ∈D n d θ ( D, D θ ) (cid:46) δ n .Since D n = Λ( H ( l ) ), we use Bauer and Kohler (2019, Theorem 3) to derive thenetwork conﬁguration that attains this rate. For that, we need to choose “ N , η n , a n , M n ” in their notation. First, we set N = q and η n = δ n . By subexponentiality, wehave log P ( k X k ∞ > a ) + log P θ ( k X k ∞ > a ) (cid:46) − a for large a . Therefore, we want a n (cid:29) − δ n so that ( P + P θ )( k X k ∞ > a n ) (cid:46) δ n . We can do this by setting a n = ( − log δ n ) . Finally, we want to choose M n so that a N + q +3 n M − pn ∼ δ n ; set M n =(log δ n ) N + q +3) /p /δ /pn . Let A ⊂ [ − a n , a n ] d be the set for which ( P + P θ )( A ) ≤ cη n inBauer and Kohler (2019, Theorem 3). Then, h θ ( D, D θ ) ≤ (cid:18)Z k x k ∞ >a n + Z A + Z {k x k ∞ ≤ a n }\ A (cid:19) ( √ D − √ D θ ) ( p + p θ ) ≤ ( P + P θ )( k X k ∞ > a n ) + ( P + P θ )( A ) + Z {k x k ∞ ≤ a n }\ A ( √ D − √ D θ ) ( p + p θ ) . The ﬁrst two terms are bounded by δ n + cδ n . For D = Λ( f ), Z {k x k ∞ ≤ a n }\ A ( √ D −√ D θ ) ( p + p θ ) = Z {k x k ∞ ≤ a n }\ A (cid:16) √ Λ( f ) −√ Λ(Λ − ◦ D θ ) (cid:17) ( p + p θ ) If we set a n ∼ − δ n , then we can only say ( P + P θ )( k X k ∞ > a n ) (cid:46) δ cn for some c . (cid:13)(cid:13)(cid:13) f − Λ − ◦ D θ (cid:13)(cid:13)(cid:13) ∞ , {k x k ∞ ≤ a n }\ A = (cid:13)(cid:13)(cid:13) f − log p p θ (cid:13)(cid:13)(cid:13) ∞ , {k x k ∞ ≤ a n }\ A , since √ Λ( · ) is Lipschitz with constant 1 / (3 √ h θ (1 − D, − D θ ) . By Bauer and Kohler (2019, Theorem 3), inf f ∈H ( l ) k f − log p p θ k ∞ , {k x k ∞ ≤ a n }\ A (cid:46) δ n . Thus, we obtain inf D ∈D n d θ ( D, D θ ) (cid:46) δ n .Meanwhile, substituting S = O ( dd ∗ M ∗ K l ) ∼ M ∗ , ˜ U = M ∗ ∨ (4 d ∗ ) ∨ K ∼ M ∗ , C = α , and L = 2 + 3 l = O (1) into (2) yields δ n ∼ M ∗ log( M ∗ α )+log nn . Here, M ∗ = d ∗ + Nd ∗ ! ( N + 1)( M n + 1) d ∗ ∼ M d ∗ n = (log δ n ) d ∗ ( N + q +3) /p δ d ∗ /pn ,α = M d ∗ + p (2 N +3)+1 n η n log n = (log δ n ) N + q +3)[ d ∗ + p (2 N +3)+1] /p δ d ∗ + p (2 N +3)+1] /pn log n. Thus, δ n ∼ [(log n ) p +2 d ∗ ( N + q +3) p /n ] p p + d ∗ . The result follows by substituting N = q . (cid:4) A.3 GeneratorsProof of Theorem 4.

For simplicity, we omit the subscripts n, m . Note that M ˆ θ ( D ˆ θ ) − inf θ ∈ Θ M θ ( D θ ) ≤ h M ˆ θ ( ˆ D ˆ θ ) − inf θ ∈ Θ M θ ( ˆ D θ ) i + h M ˆ θ ( D ˆ θ ) − M ˆ θ ( ˆ D ˆ θ ) i + sup θ ∈ Θ h M θ ( ˆ D θ ) − M θ ( D θ ) i . The ﬁrst diﬀerence is less than o ∗ P (1) by assumption; the other two are o ∗ P (1) byTheorem 2'. Therefore, M ˆ θ ( D ˆ θ ) ≤ inf θ ∈ Θ M θ ( D θ ) + o ∗ P (1).By the assumption of Glivenko-Cantelli, k P n − P k M → k ˜ P m − ˜ P k M → n, m → ∞ . By van der Vaart and Wellner (1996, Corollary3.2.3 (i)), it follows that ˆ θ n,m → θ in outer probability. (cid:4) The next theorem is a generalization of Theorem 5 on the rate of convergence ofˆ θ n,m . The parametric rate can be achieved if P θ is “close enough” to P . Theorem 5' (Rate of convergence of generator) . Suppose M ˆ θ n,m n,m ( ˆ D ˆ θ n,m n,m ) ≤ M θ n,m ( ˆ D θ n,m ) + O ∗ P ( κ n ) , h M ˆ θ n,m n,m ( ˆ D ˆ θ n,m n,m ) − M θ n,m ( ˆ D θ n,m ) i − h M ˆ θ n,m n,m ( D ˆ θ n,m ) − M θ n,m ( D θ ) i = O ∗ P ( κ n )34 or a nonnegative sequence κ n . Then, under Assumptions 4, 7, 8, and 11, h (ˆ θ n,m , θ ) ∨ ˜ h (ˆ θ n,m , θ ) = O ∗ P ( κ n ∨ n − / ) .Proof. The displayed condition implies M ˆ θ ( D ˆ θ ) ≤ M θ ( D θ ) + O ∗ P ( κ n ), so we applyvan der Vaart and Wellner (1996, Theorem 3.2.5) to M θ ( D θ ). By Assumptions 7and 11, M θ ( D θ ) − M θ ( D θ ) (cid:38) h ( θ, θ ) ∧ c for some c > θ ∈ Θ.Next, we show the convergence of the sample objective function. Note that( M θ − M θ )( D θ ) − ( M θ − M θ )( D θ ) = ( P n − P ) log D θ D θ + (˜ P m − ˜ P ) log (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ . By Lemma 6, k log D θ D θ k P ,B ≤ h ( θ, θ ) and k log (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ k P ,B ≤ h ( θ, θ ) . For δ >

0, deﬁne M δ := { log D θ D θ : h ( θ, θ ) ≤ δ } and M δ := { log (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ : ˜ h ( θ, θ ) ≤ δ } . By van der Vaart and Wellner (1996, Lemma 3.4.3), E ∗ sup h ( θ,θ ) <δ (cid:12)(cid:12)(cid:12) √ n ( P n − P ) log D θ D θ (cid:12)(cid:12)(cid:12) (cid:46) J [] (2 δ, M δ , k · k P ,B ) h J [] (2 δ, M δ , k·k P ,B )4 δ √ n i . Let [ ‘, u ] be an ε -bracket in { p θ } with respect to h . Since u − ‘ ≥ e | x | − − | x | ≤ e x/ − for every x ≥ (cid:13)(cid:13)(cid:13) log p + up + p θ − log p + ‘p + p θ (cid:13)(cid:13)(cid:13) P ,B ≤ Z (cid:16)q p + up + ‘ − (cid:17) p ≤ Z ( √ p + u − √ p + ‘ ) ≤ h ( u, ‘ ) ≤ ε . Thus, [log p + ‘p + p θ , log p + up + p θ ] makes a 2 ε -bracket in M . Hence, N [] (2 ε, M δ , k · k P ,B ) ≤ N [] ( ε, P δ , h ) (cid:46) ( δ/ε ) r by Assumption 8. This induces J [] (2 δ, M δ , k · k P ,B ) (cid:46) δ . There-fore, E ∗ sup h ( θ,θ ) <δ (cid:12)(cid:12)(cid:12) √ n ( P n − P ) log D θ D θ (cid:12)(cid:12)(cid:12) (cid:46) δ + √ n . Similarly, E ∗ sup ˜ h ( θ,θ ) <δ |√ m (˜ P m − ˜ P ) log − D θ − D θ | (cid:46) δ + √ m . Then, the result followsby van der Vaart and Wellner (1996, Theorem 3.2.5). (cid:4) Lemma 3.

Under Assumption 9, for every h ∈ R k and h → , Z "s p θ + p θ + h − √ p θ − h ˙ ‘ θ √ p θ = o ( k h k ) , Z "s p θ + p θ + h − √ p θ + h + 14 h ˙ ‘ θ √ p θ + h = o ( k h k ) . roof. Denote p := p θ and p h := p θ + h . For the ﬁrst statement, it suﬃces to show Z h(cid:16)q p + p h − √ p (cid:17) − (cid:16) √ p h − √ p (cid:17)i = Z (cid:16)q p + p h − √ p h + √ p (cid:17) = o ( k h k ) . For every ε >

0, there exists

M > Z (cid:16)q p + p h − √ p h + √ p (cid:17) ≤ ε + Z p h /p ≤ M (cid:16)q p + p h − √ p h + √ p (cid:17) . By Taylor’s theorem and concavity of the square root,0 ≤ q p + p h − √ p h + √ p ≤ √ p + p h − p √ p − √ p h + √ p = ( √ p h − √ p ) (cid:16)q p h p − (cid:17) . Thus, one obtains Z p h /p ≤ M (cid:16)q p + p h − √ p h + √ p (cid:17) ≤ Z p h /p ≤ M ( √ p h − √ p ) (cid:16)q p h p − (cid:17) . For p h /p ≤ M , ( √ p h /p − is bounded by M , so the RHS is bounded by h I θ hM = O ( k h k M ). Moreover, ( √ p h /p − converges to zero almost everywhere as p h con-verges to p in DQM; therefore, by the dominated convergence theorem, the RHS is o ( k h k M ). By the diagonal argument, the original integral is o ( k h k ).For the second statement, we have shown R [( q p + p h −√ p h ) − ( √ p − √ p h )] = o ( k h k ),which, with Assumption 9, implies R [ q p + p h − √ p h − ( − h ) ˙ ‘ θ √ p h ] = o ( k h k ). Thiscompletes the proof. (cid:4) The following lemma states local convergence of the objective function.

Lemma 4 (Asymptotic distribution of objective function) . Under Assumptions 4and 9, for every compact K ⊂ Θ , uniformly in h ∈ K , n h M θ + h/ √ nn,m ( D θ + h/ √ n ) − M θ n,m ( D θ ) i = −√ n P n h ˙ ‘ θ + √ n ( P n + P θ + h/ √ nm ) D θ + h/ √ n h ˙ ‘ θ + √ n ( P θ + h/ √ nm − P θ + h/ √ n ) − ( P θ m − P θ )1 / √ n log(1 − D θ ) + h ˜ I θ h o P (1) . With Assumptions 10 and 11, this reduces to −√ n P n h ˙ ‘ θ + √ n ( P n + P θ m ) D θ h ˙ ‘ θ + h ˜ I θ h o P (1) . This M applies uniformly over every small h . roof. Let θ := θ + h/ √ n , W := √ D θ /D θ −

1, ˜ W := √ p θ /p θ −

1. Observe that n [ M θ ( D θ ) − M θ ( D θ )] = n ( P n + P θm ) log D θ D θ − n P θm log p θ p θ + n ( P θm − P θ m ) log(1 − D θ ) . We examine each term separately. By Assumption 9, n ( P θ − P θ ) log(1 − D θ ) = n Z ( √ p θ + √ p θ )( √ p θ − √ p θ ) log(1 − D θ )= Z (cid:18) √ nh ˙ ‘ θ + h ¨ ‘ θ h + h ˙ ‘ θ ˙ ‘ θ h (cid:19) p θ log(1 − D θ ) + o (1) . The ﬁrst term is zero since M θ ( D θ ) − M θ ( D θ ) ≥ M θ ( D θ ) − M θ ( D θ ) =2 R D θ ( √ p θ − √ p θ ) + o ( h ( θ, θ ) ) + ( P θ − P θ ) log(1 − D θ ). Therefore, n ( P θ − P θ ) log(1 − D θ ) = P θ ( h ¨ ‘ θ h + h ˙ ‘ θ ˙ ‘ θ h ) log(1 − D θ ) + o (1). If Assumption 11holds, then n [( P θm − P θ m ) − ( P θ − P θ )] log(1 − D θ ) = o P (1 + n/m ).Using log x = 2( √ x − − ( √ x − + ( √ x − R ( √ x −

1) for R ( x ) = O ( x ), n ( P n + P θm ) log D θ D θ = 2 n ( P n + P θm ) W − n ( P n + P θm ) W + n ( P n + P θm ) W R ( W n ) . Let ˘ I θ := 2 P θ D θ ˙ ‘ θ ˙ ‘ θ . Observe that( P + P θ ) (cid:18) √ nW + h ˙ ‘ θ (1 − D θ ) (cid:19) = n Z h √ p + p θ − √ p + p θ + h ˙ ‘ θ √ n q (1 − D θ ) p θ i , which is o ( k h k /n ) by Lemma 7 and Assumption 9. Thus, the RHS converges tozero uniformly over every compact K ⊂ Θ. We draw two observations: (i) themean and variance of ( √ nW + (1 − D θ ) h ˙ ‘ θ / X i ), X i ∼ ( P + P θ n ) /

2, convergeto zero and so does the variance of √ n ( P n + P θm )( √ nW + (1 − D θ ) h ˙ ‘ θ /

2) underAssumption 4; (ii) ( P + P θ ) | nW − (1 − D θ ) ( h ˙ ‘ θ / | →

0, so n ( P n + P θm ) W =( P n + P θm )(1 − D θ ) ( h ˙ ‘ θ / + o P (1) → h I θ h/ − h ˘ I θ h/

8. Next, n ( P + P θ ) W = − n h ( p + p θ , p + p θ ) −→ − h I θ h + h ˘ I θ h , √ n ( P + P θ )(1 − D θ ) h ˙ ‘ θ = √ nP θ h ˙ ‘ θ = √ n ( P θ − P θ ) h ˙ ‘ θ → h I θ h . This implies that the mean of √ n ( P n + P θm )( √ nW + (1 − D θ ) h ˙ ‘ θ /

2) converges to The term P θ h ˙ ‘ θ log(1 − D θ ) is the only term that is linear in h = h ( θ, θ ), so if it is not zero,then M θ ( D θ ) − M θ ( D θ ) ≥ This does not imply that the mean of √ n ( P n + P θm )( √ nW + (1 − D θ ) h ˙ ‘ θ /

2) converges to zero. h I θ h/ h ˘ I θ h/

16. Combining with (i), we ﬁnd n ( P n + P θm ) W = −√ n ( P n + P θm )(1 − D θ ) h ˙ ‘ θ + h I θ h + h ˘ I θ h + o P (1) . The remainer term n ( P n + P θm ) W R ( W n ) vanishes by the same logic as van der Vaart(1998, Theorem 7.2).Next, observe that n P θm log p θ p θ = 2 n P θm ˜ W − n P θm ˜ W + n P θm ˜ W R ( ˜ W ) and P θ (cid:18) √ n ˜ W + h ˙ ‘ θ (cid:19) = n Z (cid:20) √ p θ − √ p θ + h ˙ ‘ θ √ n √ p θ (cid:21) = o (cid:16) k h k n (cid:17) . Again, (i) the mean and variance of ( √ n ˜ W + h ˙ ‘ θ / X i ), X i ∼ P θ , converge tozero and so does the variance of √ n P θm ( √ n ˜ W + h ˙ ‘ θ /

2) under Assumption 4; (ii) P θ | n ˜ W − ( h ˙ ‘ θ / | →

0, so n P θm ˜ W → P θ ( h ˙ ‘ θ / → h I θ h/

4. Next, nP θ ˜ W = − nh ( θ, θ ) / → − h I θ h/ √ nP θ h ˙ ‘ θ / → h I θ h/

2. This implies that the meanof √ n P θm ( √ n ˜ W + h ˙ ‘ θ /

2) converges to 3 h I θ h/

8. Thus, we ﬁnd n P θm ˜ W = −√ n P θm h ˙ ‘ θ + h I θ h + o P (1) . Again, we may ignore the remainer term n P θm ˜ W R ( ˜ W ). Altogether, n [ M θ ( D θ ) − M θ ( D θ )] = −√ n P n h ˙ ‘ θ + √ n ( P n + P θm ) D θ h ˙ ‘ θ + h ˜ I θ h + n [( P θm − P θ m ) − ( P θ − P θ )] log(1 − D θ ) + o P (1) . For the second claim, it remains to show that with Assumption 10, √ n ( P n + P θm ) D θ h ˙ ‘ θ − √ n ( P n + P θ m ) D θ h ˙ ‘ θ = o P (1) . Note that ( P + P θ ) D θ h ˙ ‘ θ − ( P + P θ ) D θ h ˙ ‘ θ = 0. Write √ n ( P n + P θm )( D θ − D θ ) h ˙ ‘ θ + √ n ( P θm − P θ m ) D θ h ˙ ‘ θ . Since p/ ( p + x ) is convex in x ≥ p > D θ p θ − p θ p + p θ ≤ D θ − D θ ≤ D θ p θ − p θ p + p θ byTaylor’s theorem. Therefore, by Assumption 9, − ( P n + P θm ) D θ (1 − D θ )( h ˙ ‘ θ ) + o P (1) ≤ √ n ( P n + P θm )( D θ − D θ ) h ˙ ‘ θ ≤ − ( P n + P θm ) D θ (1 − D θ )( h ˙ ‘ θ ) + o P (1) . Thus, √ n ( P n + P θm )( D θ − D θ ) h ˙ ‘ θ converges to − P θ D θ ( h ˙ ‘ θ ) = − h ˘ I θ h/ h ˘ I θ h/ (cid:4) Proof of Theorem 6.

By Theorem 5 and Assumption 7, ˆ θ is consistent and √ n (ˆ θ − θ ) is uniformly tight. Assumption 6 implies M ˆ θ ( D ˆ θ ) ≤ inf θ ∈ O M θ ( D θ ) + o ∗ P ( n − ).Let G n := √ n ( P n − P ), G θ m := √ m ( P θ m − P θ ), and G θ n,m f := G n (1 − D θ ) f − q n/m G θ m D θ f . With Assumptions 4 and 9 to 11, Lemma 4 implies that uniformlyin h ∈ K compact, n h M θ + h/ √ n ( D θ + h/ √ n ) − M θ ( D θ ) i = − h G θ n,m ˙ ‘ θ + h ˜ I θ h + o P (cid:16) nm (cid:17) . In particular, this holds for both ˆ h := √ n (ˆ θ − θ ) and ˘ h := 2 ˜ I − θ G θ n,m ˙ ‘ θ , so n h M θ +ˆ h/ √ n ( D θ +ˆ h/ √ n ) − M θ ( D θ ) i = − ˆ h G θ n,m ˙ ‘ θ + ˆ h ˜ I θ ˆ h + o ∗ P (cid:16) nm (cid:17) ,n h M θ +˘ h/ √ n ( D θ +˘ h/ √ n ) − M θ ( D θ ) i = − G θ n,m ˙ ‘ θ ˜ I − θ G n,m ˙ ‘ θ + o P (cid:16) nm (cid:17) . Since ˆ h minimizes M θ ( D θ ) up to o ∗ P (1 /n ), the LHS of the ﬁrst equation is larger thanthat of the second up to o ∗ P (1). Subtracting the two, (cid:16) ˆ h − I − θ G θ n,m ˙ ‘ θ (cid:17) ˜ I θ (cid:16) ˆ h − I − θ G θ n,m ˙ ‘ θ (cid:17) + o ∗ P (cid:16) nm (cid:17) ≤ . Since ˜ I θ is assumed positive deﬁnite, ˆ h − I − θ G θ n,m ˙ ‘ θ = o ∗ P ( √ n/m ), proving theﬁrst expression. Since P n and P θ m are independent, the asymptotic variance is˜ I − θ h P (1 − D θ ) ˙ ‘ θ ˙ ‘ θ + (cid:16) lim n →∞ nm (cid:17) P θ D θ ˙ ‘ θ ˙ ‘ θ i ˜ I − θ = ˜ I − θ h P θ D θ (1 − D θ ) ˙ ‘ θ ˙ ‘ θ + (cid:16) lim n →∞ nm (cid:17) P D θ (1 − D θ ) ˙ ‘ θ ˙ ‘ θ i ˜ I − θ . (cid:4) A.4 Supporting Lemmas

The next lemma allows us to bound the Bernstein “norm” of an arbitrary log likeli-hood ratio by the Hellinger distance without having to assume a bounded likelihoodratio. This is a major improvement from Ghosal et al. (2000, Lemma 8.7) in that themultiple of the Hellinger need not diverge as h ( p, p ) → Lemma 5 (Bernstein “norm” of log likelihood ratio) . For any pair of probability easures P and P such that P ( p /p ) < ∞ , (cid:13)(cid:13)(cid:13)(cid:13)

12 log pp (cid:13)(cid:13)(cid:13)(cid:13) P ,B ≤ h ( p, p ) (cid:20) c ≥ cP (cid:18) p p (cid:12)(cid:12)(cid:12)(cid:12) p p ≥ (cid:20) c (cid:21) (cid:19)(cid:21) ≤ h ( p, p ) (cid:20) P (cid:18) p p (cid:12)(cid:12)(cid:12)(cid:12) p p ≥ (cid:19)(cid:21) , where P ( p /p | p /p ≥ a ) = 0 if P ( p /p ≥ a ) = 0 .Proof. Using e | x | − − | x | ≤ ( e x − for x ≥ − and e | x | − − | x | < e x − for x > , (cid:13)(cid:13)(cid:13) log q pp (cid:13)(cid:13)(cid:13) P ,B ≤ P (cid:16)q pp − (cid:17) n pp ≥ e o + 2 P (cid:16)q p p − (cid:17) n p p > e o . The ﬁrst term is bounded by 2 h ( p, p ) . For every c ≥ P (cid:16)q p p − (cid:17) n p p > e o ≤ P (cid:16)q p p − − c (cid:17) nq p p ≥ c o = P (cid:16)q p p ≥ c (cid:17)h P (cid:16)q p p − (cid:12)(cid:12)(cid:12) q p p ≥ c (cid:17) − c i . Using x − c ≤ c x for every x , P (cid:16)q p p − (cid:12)(cid:12)(cid:12) q p p ≥ c (cid:17) − c ≤ c h P (cid:16)q p p − (cid:12)(cid:12)(cid:12) q p p ≥ c (cid:17)i ≤ c P (cid:16) p p (cid:12)(cid:12)(cid:12) q p p ≥ c (cid:17) P (cid:16)h − q pp i (cid:12)(cid:12)(cid:12) q p p ≥ c (cid:17) by the Cauchy-Schwarz inequality. Then the ﬁrst inequality follows. For the second,let c = 2. (cid:4) Remark.

Since the Bernstein “norm” dominates L -norm, we have P ( log p p ) ≤k log p p k P ,B , which may be better than Ghosal et al. (2000, Lemma 8.6). Remark.

Similarly, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)

12 log DD θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P ,B ≤ h θ ( D, D θ ) " P D θ D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D θ D ≥ ! , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)

12 log 1 − D − D θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P θ ,B ≤ h θ (1 − D, − D θ ) " P θ − D θ − D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − D θ − D ≥ ! . Lemma 6 (Bernstein “norm” of log discriminator ratio) . For every θ , θ ∈ Θ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) log D θ D θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P ,B ≤ h ( θ , θ ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) log (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P ,B ≤ h ( θ , θ ) . roof. Since e | x | − − | x | ≤ e x/ − for x ≥ (cid:13)(cid:13)(cid:13) log D θ D θ (cid:13)(cid:13)(cid:13) P ,B ≤ P (cid:18)r D θ D θ − (cid:19) { D θ ≥ D θ } + 4 P (cid:18)r D θ D θ − (cid:19) { D θ < D θ }≤ P (cid:16)q p + p θ p + p θ − (cid:17) + 4 P (cid:16)q p + p θ p + p θ − (cid:17) ≤ Z ( √ p + p θ − √ p + p θ ) ≤ Z ( √ p θ − √ p θ ) ≤ h ( θ , θ ) . Similarly, (cid:13)(cid:13)(cid:13) log (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ (cid:13)(cid:13)(cid:13) P ,B ≤ P (cid:18)r (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ − (cid:19) + 4 ˜ P (cid:18)r (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ − (cid:19) ≤ h ( θ , θ ) since ˜ P (cid:18)r (1 − D θ ) ◦ T θ (1 − D θ ) ◦ T θ − (cid:19) ≤ ˜ P (cid:16) √ (1 − D θ ) ◦ T θ − √ (1 − D θ ) ◦ T θ (cid:17) ≤ ˜ P (cid:16)q p p θ ◦ T θ − q p p θ ◦ T θ (cid:17) = ˜ h ( θ , θ ) . (cid:4) Lemma 7 (Hellinger distance of sums of densities) . For arbitrary densities p , p , p , h ( p + p , p + p ) = Z p p + p ( √ p − √ p ) + o ( h ( p , p ) ) , where p / ( p + p ) = 1 if p = p = 0 .Proof. Since √ p + x is convex in x , by Taylor’s theorem, √ p + p ≥ √ p + p + q p p + p ( √ p − √ p ) , where p / ( p + p ) is deﬁned to be 1 if p = p = 0. If p ≥ p , therefore,0 ≤ q p p + p ( √ p − √ p ) ≤ √ p + p − √ p + p ≤ q p p + p ( √ p − √ p ) . Thus, we get the following lower and upper bounds

Z h p p + p ∧ p p + p i ( √ p − √ p ) ≤ h ( p + p , p + p ) ≤ Z h p p + p ∨ p p + p i ( √ p − √ p ) . By the dominated convergence theorem follows the claim. (cid:4)

Altonji, J. G. and L. M. Segal (1996): “Small-Sample Bias in GMM Estimationof Covariance Structures,”

Journal of Business & Economic Statistics , 14, 353–366.

Anthony, M. and P. L. Bartlett (1999):

Neural Network Learning: TheoreticalFoundations , New York: Cambridge University Press.

Arjovsky, M., S. Chintala, and L. Bottou (2017): “Wasserstein GenerativeAdversarial Networks,” in

Proceedings of the 34th International Conference on Ma-chine Learning , ed. by D. Precup and Y. W. Teh, International Convention Centre,Sydney, Australia: PMLR, vol. 70 of

Proceedings of Machine Learning Research ,214–223.

Athey, S., G. Imbens, J. Metzger, and E. Munro (2020): “Using Wasser-stein Generative Adversarial Networks for the Design of Monte Carlo Simulations,”ArXiv:1909.02210.

Bach, F. (2017): “Breaking the Curse of Dimensionality with Convex Neural Net-works,”

Journal of Machine Learning Research , 18, 629–681.

Bartlett, P. L., N. Harvey, C. Liaw, and A. Mehrabian (2019): “Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Net-works,”

Journal of Machine Learning Research , 20, 1–17.

Bartlett, P. L. and W. Maass (2003): “Vapnik-Chervonenkis Dimension ofNeural Nets,” in

The Handbook of Brain Theory and Neural Networks , ed. by M. A.Arbib, Cambridge: MIT Press, 1188–1192, second ed.

Bauer, B. and M. Kohler (2019): “On Deep Learning as a Remedy for theCurse of Dimensionality in Nonparametric Regression,”

Annals of Statistics , 47,2261–2285.

Bennett, A., N. Kallus, and T. Schnabel (2019): “Deep Generalized Methodof Moments for Instrumental Variable Analysis,” in

Advances in Neural Informa-tion Processing Systems 32 , ed. by H. Wallach, H. Larochelle, A. Beygelzimer,F. d'Alché-Buc, E. Fox, and R. Garnett, Curran Associates, Inc., 3564–3574.

Bruins, M., J. A. Duffy, M. P. Keane, and A. A. Smith, Jr. (2018): “Gen-eralized Indirect Inference for Discrete Choice Models,”

Journal of Econometrics ,205, 177–203.

Chen, X. (2007): “Large Sample Sieve Estimation of Semi-Nonparametric Models,”in

Handbook of Econometrics , vol. 6B, chap. 76, 5549–5632.42 hen, X. and X. Shen (1998): “Sieve Extremum Estimates for Weakly DependentData,”

Econometrica , 66, 289–314.

Chen, X. and H. White (1999): “Improved Rates and Asymptotic Normality forNonparametric Neural Network Estimators,”

IEEE Transactions on InformationTheory , 45, 682–691.

Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen,W. Newey, and J. Robins (2018): “Double/Debiased Machine Learning forTreatment and Structural Parameters,”

Econometrics Journal , 21, C1–C68.

Chernozhukov, V. and H. Hong (2004): “Likelihood Estimation and Inferencein a Class of Nonregular Econometric Models,”

Econometrica , 72, 1445–1480.

Cunha, F., J. J. Heckman, and S. M. Schennach (2010): “Estimating theTechnology of Cognitive and Noncognitive Skill Formation,”

Econometrica , 78,883–931.

De Nardi, M., E. French, and J. B. Jones (2010): “Why Do the Elderly Save?The Role of Medical Expenses,”

Journal of Political Economy , 118, 39–75.

Farrell, M. H., T. Liang, and S. Misra (2019): “Deep Neural Networks forEstimation and Inference,” ArXiv:1809.09953.

Fermanian, J.-D. and B. Salanié (2004): “A Nonparametric Simulated Maxi-mum Likelihood Estimation Method,”

Econometric Theory , 20, 701–734.

Forneron, J.-J. and S. Ng (2018): “The ABC of Simulation Estimation withAuxiliary Statistics,”

Journal of Econometrics , 205, 112–139.

Frazier, D. T., T. Oka, and D. Zhu (2019): “Indirect Inference with a Non-Smooth Criterion Function,”

Journal of Econometrics , 212, 623–645.

Gallant, A. R. and G. Tauchen (1996): “Which Moments to Match?”

Econo-metric Theory , 12, 657–681.

Ghosal, S., J. K. Ghosh, and A. W. van der Vaart (2000): “ConvergenceRates of Posterior Distributions,”

Annals of Statistics , 28, 500–531.

Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio (2014): “Generative AdversarialNets,” in

Advances in Neural Information Processing Systems , 2672–2680.

Gouriéroux, C. and A. Monfort (1997):

Simulation-Based Econometric Meth-ods , Oxford; New York: Oxford University Press.43 ouriéroux, C., P. C. Phillips, and J. Yu (2010): “Indirect Inference forDynamic Panel Models,”

Journal of Econometrics , 157, 68–77.

Hartford, J., G. Lewis, K. Leyton-Brown, and M. Taddy (2017): “DeepIV: A Flexible Approach for Counterfactual Prediction,” in

Proceedings of the 34thInternational Conference on Machine Learning , ed. by D. Precup and Y. W. Teh,International Convention Centre, Sydney, Australia: PMLR, vol. 70 of

Proceedingsof Machine Learning Research , 1414–1423.

Honoré, B. E. and L. Hu (2017): “Poor (Wo)man’s Bootstrap,”

Econometrica ,85, 1277–1301.

Huber, P. J. (1967): “The Behavior of Maximum Likelihood Estimates under Non-standard Conditions,” in

Proceedings of the Fifth Berkeley Symposium on Mathe-matical Statistics and Probability , ed. by L. M. Le Cam and J. Neyman, BerkeleySymposium on Mathematical Statistics and Probability, Berkeley: University ofCalifornia Press, vol. 1, 221–233.

Jankowski, H. (2014): “Convergence of Linear Functionals of the Grenander Esti-mator under Misspeciﬁcation,”

Annals of Statistics , 42, 625–653.

Kleijn, B. J. K. and A. W. van der Vaart (2006): “Misspeciﬁcation in Inﬁnite-Dimensional Bayesian Statistics,”

Annals of Statistics , 34, 837–877.——— (2012): “The Bernstein-Von-Mises Theorem under Misspeciﬁcation,”

Elec-tronic Journal of Statistics , 6, 354–381.

Klein, R. W. and R. H. Spady (1993): “An Eﬃcient Semiparametric Estimatorfor Binary Response Models,”

Econometrica , 61, 387–421.

Kopczuk, W. (2007): “Bequest and Tax Planning: Evidence from Estate Tax Re-turns,”

Quarterly Journal of Economics , 122, 1801–1854.

Kristensen, D. and Y. Shin (2012): “Estimation of Dynamic Models with Non-parametric Simulated Maximum Likelihood,”

Journal of Econometrics , 167, 76–94.

Kuan, C.-M. and H. White (1994): “Artiﬁcial Neural Networks: An EconometricPerspective,”

Econometric Reviews , 13, 1–91.

Lewis, G. and V. Syrgkanis (2018): “Adversarial Generalized Method of Mo-ments,” ArXiv:1803.07164.

Lockwood, L. M. (2018): “Incidental Bequests and the Choice to Self-Insure Late-Life Risks,”

American Economic Review , 108, 2513–50.44 ackey, L., V. Syrgkanis, and I. Zadik (2018): “Orthogonal Machine Learn-ing: Power and Limitations,” in

Proceedings of the 35th International Conferenceon Machine Learning , ed. by J. Dy and A. Krause, StockholmsmÃďssan, StockholmSweden: PMLR, vol. 80 of

Proceedings of Machine Learning Research , 3375–3383.

McGarry, K. (1999): “Inter Vivos Transfers and Intended Bequests,”

Journal ofPublic Economics , 73, 321–351.

Mhaskar, H. N. and T. Poggio (2020): “Function Approximation by Deep Net-works,”

Communications on Pure & Applied Analysis , 19, 4085–4095.

Newey, W. K. (1994): “The Asymptotic Variance of Semiparametric Estimators,”

Econometrica , 62, 1349–1382.

Nickl, R. and B. M. Pötscher (2010): “Eﬃcient Simulation-Based MinimumDistance Estimation and Indirect Inference,”

Mathematical Methods of Statistics ,19, 327–364.

Nowozin, S., B. Cseke, and R. Tomioka (2016): “ f -GAN: Training Genera-tive Neural Samplers using Variational Divergence Minimization,” in Advances inNeural Information Processing Systems 29 (NIPS 2016) , Curran Associates, Inc.,271–279.

Patilea, V. (2001): “Convex Models, MLE and Misspeciﬁcation,”

Annals of Statis-tics , 29, 94–123.

Pollard, D. (1997): “Another Look at Diﬀerentiability in Quadratic Mean,” in

Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics , NewYork: Springer, chap. 19, 305–314.

Schmidt-Hieber, J. (2020): “Nonparametric Regression using Deep Neural Net-works with ReLU Activation Function,”

Annals of Statistics , forthcoming.

Strebulaev, I. A. and T. M. Whited (2011): “Dynamic Models and StructuralEstimation in Corporate Finance,”

Foundations and Trends ® in Finance , 6, 1–163. van der Vaart, A. W. (1998): Asymptotic Statistics , Cambridge: CambridgeUniversity Press. van der Vaart, A. W. and J. A. Wellner (1996):

Weak Convergence andEmpirical Processes: With Applications to Statistics , New York: Springer.

White, H. (1982): “Maximum Likelihood Estimation of Misspeciﬁed Models,”

Econometrica , 50, 1–25.

Yarotsky, D. (2017): “Error Bounds for Approximations with Deep ReLU Net-works,”

Neural Networks , 94, 103–114. 45

N ADVERSARIAL APPROACH TO STRUCTURAL ESTIMATION

Online Appendix

Tetsuya Kaji , Elena Manresa , and Guillaume Pouliot University of Chicago New York UniversityJuly 14, 2020S.1 MONTE CARLO EXERCISE OF A ROY MODELWe conduct simulation of a Roy model with two sectors and two periods. The Roymodel encompasses two essential features of economic environments: comparativeadvantage and selection. It is often estimated with indirect inference as the likelihoodis hard to characterize.

S.1.1 Design

We implement a simpliﬁed version of the Roy model with no covariates. There aretwo sectors in which individuals work for wages. The wage in period 1 is determinedby log w i = µ d ( i + ε id ( i , where d ( i ∈ { , } is the sector chosen by individual i in period t = 1, µ and µ are sector-speciﬁc mean wage, and ε id ( i is an individual and sector-speciﬁc shockdistributed normally. The wage in period 2 is determined bylog w i = µ d ( i + γ d ( i { d i = d ( i } + ε id ( i , where d i is the sector chosen by i at t = 2, γ d ( i is the returns to experience if i chooses the same sector, and ε i and ε i are the shock, possibly correlated with theprevious shock.In this model individuals make diﬀerent choices because they have diﬀerent com-parative advantages in one sector versus the other. There are four diﬀerent sources ofheterogeneity: two idiosyncratic shocks in period 1 for two sectors and two idiosyn-cratic shocks in period 2 for two sectors. 1ndividuals choose location d i to maximize the present value of current and futurewages. In period 1, an individual works in sector 1 if the following inequality holds w i + β E [ w i | d i = 1] > w i + β E [ w i | d i = 2] , where β is a discount factor and w i = max { w i , w i } , and w i d is the potential wagein period 1 and location d . Expectations are taken with respect to the idiosyncraticshock ( ε i , ε i ) . Since ε i and ε i are normally distributed, the expectations haveclosed forms.In period 2, an individual, conditional on their choice of sector in period 1, observes ε i and ε i and choose the sector based on the maximum wage.Thus, the sector choice and wage for each period can be written as a function of thestructural parameters θ = ( µ , µ , γ , γ , σ , σ , ρ t , ρ s , β ), where ρ t is the correlationbetween period 1 and period 2 in both locations, and ρ s is the correlation betweenlocations.As actual observations, we generate data for n = 1 ,

000 individuals with the trueparameter θ = (1 . , , . , , , , . , . , S.1.2 Estimation

We consider adversarial estimation using 1-hidden layer neural networks of increas-ing number of neurons (from 2 to 100). We follow the two-step iterative algorithmdescribed in Section 4.3. More speciﬁcally: we initialize θ at some value and generatea ﬁxed set of shocks. We pick m = n . After training the neural network, we hold ﬁxthe estimated weights and calculate the gradient of (1) for small changes of θ . Then,we update θ in the direction of the gradient and generate corresponding syntheticdata using the same shocks. The NN have been speciﬁed using a sigmoid link function of in all its layers. Inaddition, we incorporate dropout of 10% of the nodes during training, and allowfor early stopping. The NN are trained with the R keras package. In particular,using stochastic gradient descent and backpropagation. We ﬁx the randomness of thestochastic gradient descent across iterations of the estimation algorithm.We set X i = ( w i , d i , w i , d i ), i.e. the vector of all outcomes. For each replication While gradient-based methods are not justiﬁed in this context, given the discrete nature of someof the outcomes, we did not encounter numerical problems following this strategy.

2e use 5 diﬀerent initial conditions. We deﬁne the estimate as the one that minimizesthe loss across the 5 minimizations.

S.1.3 Results

Figure 2 contains 8 panels with the mean estimation of each parameter, across 1,000Monte Carlo simulations. The x axis represents the number of nodes of the hiddenlayer, and in parenthesis the total number of parameters in the NN, from 2 to 100.The green line denotes the true value of the parameter. The diﬀerent shades of greyindicate diﬀerent quantiles of the Monte Carlo distribution.For all sizes of the NN the estimator is essentially unbiased. However, for smallerNN the variability around the mean can be large. The variability decreases as thesize of the NN grows, up until the point where the size of the NN is around 10. Thisexercise provides evidence that, in line with our theory, a more ﬂexible discriminatordelivers estimators with smaller variance. We attribute this ﬁnding to the abilityof the NN to better approximate the infeasible discriminator, D θ , which attains theCramer Rao bound.Worth noting is also the fact that for larger NN there seems to be limited increasein variance. This is likely due to the ability of the training algorithms to incorporateregularization through diﬀerent strategies.S.2 ADDITIONAL NOTES ON THE EMPIRICAL APPLICATION S.2.1 Details on Estimation Algorithm

Estimation of GAN in its original formulation (i.e. for training a generative modelof images) is notoriously challenging (e.g., see Arjovsky and Bottou, 2017). Twomain issues have been raised in the literature: (i) “mode-seeking behavior” of thediscriminator due to imbalances between synthetic and actual sample sizes, and (ii)“ﬂat or vanishing gradient” of the objective function in terms of the parameters ofthe generative model when synthetic and actual samples are easily distinguishable bythe discriminator.Imbalances in the sample size of synthetic versus actual data arise naturally in ourcontext. Indeed, in order to reduce inﬂation of the variance of structural parameterestimates it is useful to choose m >> n . When this is the case, there is a risk that agood discriminator is one where it always predicts “synthetic”, regardless of the input.3 a) µ (b) µ (c) γ (d) γ (e) σ (f) σ (g) ρ w (h) ρ t Figure 2: Results diﬀerent NN4owever, this is not a useful discriminator in our endeavor. We follow the literaturerecommendation in Machine Learning and mitigate this problem by performing dataaugmentation on the actual samples. In particular, we use a naive bootstrap strategyto resample with replacement histories of assets of individuals until both samples areeven.As per the ﬂat gradient, we argue that this problem is not nearly as pervasivewhen the generative model is a typical structural economic model (provided the dis-criminator is parsimonious enough and is not overﬁtting). Indeed, Arjovsky andBottou (2017) show that the problem of ﬂat gradients is closely related to problemsof overlapping support in typical generative models of images (see Lemma 1 andTheorem 2.1. in their paper), where the set of realizable images are measure zeroin the space of all possible images. Typical economic models are very diﬀerent fromimage generative models: (i) they tend to be embedded in low-dimensional spaces(the space of the endogenous outcomes), and (ii) they tend to be parametrized bylow-dimensional vectors, where searching for conﬁgurations that provide overlappingsupport might be computationally feasible. Nonetheless, we could still encounter thisproblem, especially when outcomes are discrete.In the context of our empirical application, outcomes are continuous and overlap-ping support is not a ﬁrst order problem. Nonetheless, gradients of the structuralparameters tend to be close to 0 when the conditional distribution of the outcomesgenerated by the model and the actual data are far apart, hence making naive gra-dient descent a very slow strategy. We implement two speeding strategies that haverecently become popular in the context of training neural networks: NAG (NesterovAcceleredated Gradient), an accelerated gradient descent method featuring momen-tum (Nesterov, 1983), and RPROP, an adaptive learning rate algorithm (Riedmillerand Braun, 1993).Finally, we now give details on our choice of tuning parameters of the algorithm fortraining the discriminator. Recall we choose D the set of feedforward neural networkwith 2-hidden layers with 20 and 10 neurons, respectively, with sigmoid activationfunctions in both layers. We rely on state of the art estimation algorithms in theR Keras package for training the discriminator. In particular, we use the defaultADAM optimization algorithm, which incorporates stochastic gradient descent, andbackpropagation for fast computation of gradients. For implementation of stochasticgradient descent, we select a small batch size of 120 samples per gradient calculation,5nd a large number of epochs (2000). As opposed to other implementations of GAN,we train the discriminator “to completion”, and we ﬁx the seed of the stochasticgradient to preserve non-randomness of the criterion as a function of structural pa-rameters. We ﬁnd this strategy to be the one that delivers the most reliable estimates,albeit at the cost of being computationally intensive. In order to avoid overﬁtting inthe discriminator we make use of callback options that track the evolution of out ofsample accuracy measures over epochs.In S.2.3 below, we provide evidence that our estimation algorithm can success-fully recover the true parameters in a Monte Carlo exercise tailored to the empiricalapplication. S.2.2 Details on Implementation of Poor (Wo)man’s Bootstrap

We implement a “fast” bootstrap alternative proposed in Honoré and Hu (2017). Ourestimates are based on 50 replications. For each replication we conduct 9 diﬀerentunivariate optimization problems.

S.2.3 Monte Carlo Excercise

In order to provide conﬁdence on the results of the empirical application we conducta simulation exercise in a design that mimics the DFJ model.We simulate asset proﬁles conditional on the real distribution of health, PI, gender,etc. for N = 2 ,

688 individuals according to the DFJ model and the following valuesof the structural parameters: β = 0 . ν = 5 . c = 4 , .

23, and k = 13 , ν k [k$] 13.80 13.32 4.40 6.26 22.95 Notes: Mean and standard deviations computed over 250 Monte Carlo replications. ν is theparameter of risk aversion, MPC is the marginal propensity to consume at the moment of death,and k is the curvature of the bequest motive part of the utility function. S.2.4 Autoencoder on X The use of particular multilayer neural networks as sieve estimators for D θ can achievefaster rates of convergence than other nonparametric methods. A necessary condition,as stated in Proposition 3 in the main text, is that log( p /p θ ) admits the followinghierarchical representation introduced in Bauer and Kohler (2019): Deﬁnition (Generalized hierarchical interaction model) . Let d ∈ N , with d ∗ ∈{ , ..., d } and m : R d → R . We say that m admits a generalized hierarchical interac-tion model of order d ∗ and level 0, if there exist a , . . . , a d ∗ ∈ R d and f : R d ∗ → R such that m ( x ) = f ( a x, . . . , a d ∗ x ) . for all x ∈ R d . We say that m satisﬁes a generalized hierarchical interaction model oforder d ∗ and level l +1, if there exist K ∈ N , g k : R d ∗ → R and f k , . . . , f d ∗ k : R d → R ( k = 1 , . . . , K ) such that f k , . . . , f d ∗ k ( k = 1 , . . . , K ) satisfy a generalized hierarchicalmodel of order l and m ( x ) = K X k =1 g k ( f k ( x ) , . . . , f d ∗ k ( x ))for all x ∈ R d .As an example, log( p /p θ ) satisﬁes a generalized hierarchical interaction model oforder d ∗ = 1 and level 0 when p θ corresponds to a conditional binary choice model,such as probit or logit, irrespectively of the dimension of the conditioning covariates.We now provide an intuition on why ﬁtting autoencoders on the inputs, X i , canbe informative of the hierarchical interaction order, d ∗ . We start by giving somebackground on autoencoders.Autoencoders are used as dimension reduction statistical models, and have beenreferred to as the non-linear version of PCA (e.g. see Bishop (2006)). Autoencoders7re special neural networks that attempt to approximate the inputs, and they havethree diﬀerentiated parts: encoder, bottleneck, and decoder. The encoder is typicallya multilayer feedforward neural network with decreasing number of nodes in eachlayer. It forges a compressed representation of the inputs into the bottleneck, thehidden layer with the smallest number of nodes. The decoder takes the neurons fromthe bottleneck and maps it back to the output layer, increasing the number of nodesin each layer. The output layer has exactly as many nodes as the dimension of theinput. Fitting an autoencoder involves minimizing the diﬀerence between the outputlayer and the inputs.Let X ∈ R d be a vector that can be perfectly ﬁt into an autoencoder with d ∗

Figure 4 shows the ﬁt of the model in terms of mean asset proﬁles conditional oncohort and permanent income quintiles, excluding observations above 1% of the meanasset distribution of the actual data. The ﬁt of both DFJ and Adversarial are good,albeit they tend to do best in diﬀerent parts of the distribution. Adversarial performsremarkably well for all cohorts for the bottom 3 permanent income quintiles. However,for the upper two permanent income quintiles, adversarial can overshoot, especiallyfor the younger indivdiuals in the sample.We also report the ﬁt of the model separately for men and women in Cohort 2 inFigure 5. Matching the distribution conditional on gender is required in adversarial X , but not in DFJ. We can see that adversarial X delivers a good ﬁt for men even Mean assets are sensitive to small changes in the right hand side tail of the distribution. Thetrimming strategy for simulated observations under the adversarial estimates accounts for less than1.75% of the observations, while it is less than 1.5% of observations for DFJ. year M ean A ss e t [ k $ ] TRUEGANDFJ

Figure 4: Fit in terms of mean assets by cohort (rows) and PIq (columns) over years.Red is DFJ, green is Adversarial X , and blue is actual data.at the top of the distribution, while DFJ tens to underestimate men’s assets often.10 m en w o m en year M ean A ss e t [ k $ ] TRUEGANDFJ

Figure 5: Fit in terms of mean assets in cohort 2 separately for men and women byPIq (columns) over years. Red is DFJ, green is Adversarial X , and blue is actualdata. Other cohorts exhibit similar patterns.S.3 EQUIVALENCE TO SMM WHEN D IS LOGISTICWe start by discussing the statistical properties of the adversarial estimator when D is a logistic regression under high-level conditions, for any choice of X i = (1 , ˜ X i ),where ˜ X i is the choice of the researcher.The goal on this section is two-fold: ﬁrst, to develop intuition on the propertiesof the estimator in a case where we can derive expressions analytically. Second, statethe asymptotic equivalence result with a SMM estimator when moments are samplemeans of ˜ X i and optimally weighted. Hence, in this section we abstract from theconditions that ensure that the logistic regression is a regular M -estimator. In thenext section, we will spell out all the formal conditions under which we analyze theadversarial framework.Recall the FOC given in Example 2. Consistency of ˆ θ can be established understandard regularity conditions on M -estimation. For simplicity we assume X θi isdiﬀerentiable with respect to θ .For any θ , let us deﬁne the following limiting discriminator parameter value λ ( θ ) = arg max λ E [log(Λ( λ X i ))] + E [log(1 − Λ( λ X θi ))] . We assume the following three high-level assumptions:1. λ ( θ ) = 0 if and only if θ = θ . For instance, Newey and McFadden (1994, Theorem 2.1).

11. sup θ k ˆ λ ( θ ) − λ ( θ ) k = o p (1).3. √ n (ˆ λ ( θ ) − λ ( θ )) (cid:32) N (0 , lim m,n →∞ [1 + nm ]Ω λ ).where Ω λ = E [ X i X i ] − Var( X i ) E [ X i X i ] − .The ﬁrst condition can be interpreted as an identiﬁcation assumption. The secondcondition is uniform consistency of the logit parameters over the space of θ . The thirdcondition states that ˆ λ behaves asymptotically as a regular M -estimator. Proposition S.1 (Asymptotic equivalence with SMM) . Under Assumptions 1, 2,and 3, as n, m → ∞ √ n (ˆ θ − θ ) (cid:32) N (cid:18) , lim m,n →∞ (cid:20) nm (cid:21) V (cid:19) where V = E " ∂X θ i ∂θ E [ X i X i ] − E " ∂X θ i ∂θ ! − . In addition, ˜ θ = arg min θ n n X i =1 ˜ X i − m m X i =1 ˜ X θi ! Ω W n n X i =1 ˜ X i − m m X i =1 ˜ X θi ! , where Ω W is the optimal weighting matrix deﬁned in Gouriéroux et al. (1993, Propo-sition 5) satisﬁes √ n (˜ θ − θ ) (cid:32) N (cid:18) , lim m,n →∞ (cid:20) nm (cid:21) V (cid:19) . Proof.

Using the properties of the sigmoid function, we have the following expansionˆ θ − θ = M ( θ ∗ ) − ˆ λ ( θ ) · m m X i =1 Λ(ˆ λ ( θ ) X θ i ) · ∂X θ i ∂θ ! where θ ∗ lies between ˆ θ and θ , and M ( θ ) = ∂ ˆ λ ( θ ) ∂θ m m X i =1 Λ(ˆ λ ( θ ) X θi ) ∂X θi ∂θ + ˆ λ ( θ ) m m X i =1 Λ(ˆ λ ( θ ) X θi ) ∂ X θi ∂θ ! + ˆ λ ( θ ) m m X i =1 Λ (ˆ λ ( θ ) X θi ) " ∂ ˆ λ ( θ ) ∂θ X θi + ˆ λ ( θ ) ∂X θi ∂θ ∂X θi ∂θ . By consistency of ˆ θ and conditions 1 and 2 above, we have ˆ λ ( θ ∗ ) = o p (1). In addition,substituting in the expression of ∂ ˆ λ∂θ obtained using the total derivative of the FOC of12he logit maximization (omitted here), we have M ( θ ∗ ) = A ( θ ∗ ) R ( θ ∗ ) − A ( θ ∗ ) + o p (1) , where A ( θ ) = m m X i =1 Λ(ˆ λ ( θ ) X θi ) ∂X θi ∂θ ! ,R ( θ ) = n n X i =1 Λ (ˆ λ ( θ ) X i ) X i · X i + 1 m m X i =1 Λ (ˆ λ ( θ ) X θi ) X θi · X θi ! . Using the block matrix inversion formula and ∂X θi ∂θ = (0 , ∂ ˜ X θi ∂θ ) , we see that, as n/m → A ( θ ) Ω λ A ( θ ) = 12 M ( θ ) , and hence √ n (ˆ θ − θ ) = M ( θ ∗ ) − √ n (ˆ λ ( θ ) − A ( θ ) (cid:32) N (cid:18) , lim m,n →∞ (cid:20) nm (cid:21) V (cid:19) . We now move to show the second part of the proposition. Using the notation inGouriéroux et al. (1993), we deﬁne Q ( θ ; τ ) = − n n X i =1 ( ˜ X θi − τ ) where τ is the auxiliary parameter. We haveˆ τ ( θ ) = 1 n n X i =1 ˜ X θi . Using the expression of the asymptotic distribution with the optimal weighting matrixin Gouriéroux et al. (1993, Propositions 4 and 5), we obtain the result. (cid:4)

Remark.

When n/m → C , there is inﬂation of the variance proportional to 1 + C .REFERENCES Arjovsky, M. and L. Bottou (2017): “Towards Principled Methods for TrainingGenerative Adversarial Networks,” arXiv preprint arXiv:1701.04862 . Bauer, B. and M. Kohler (2019): “On Deep Learning as a Remedy for the13urse of Dimensionality in Nonparametric Regression,”

Annals of Statistics , 47,2261–2285.

Bishop, C. M. (2006):

Pattern Recognition and Machine Learning , Springer.

De Nardi, M., E. French, and J. B. Jones (2010): “Why Do the Elderly Save?The Role of Medical Expenses,”

Journal of Political Economy , 118, 39–75.

Gouriéroux, C., A. Monfort, and E. Renault (1993): “Indirect Inference,”

Journal of Applied Econometrics , 8, S85–S118.

Honoré, B. E. and L. Hu (2017): “Poor (Wo)man’s Bootstrap,”

Econometrica ,85, 1277–1301.

Nesterov, Y. (1983): “A Method of Solving a Convex Programming Problem withConvergence Rate O (1 /k ),” in Sov. Math. Dokl , vol. 27.

Newey, W. K. and D. McFadden (1994): “Large Sample Estimation and Hypoth-esis Testing,” in

Handbook of Econometrics , Amsterdam: North-Holland, vol. 4,chap. 36, 2111–2245.

Riedmiller, M. and H. Braun (1993): “A Direct Adaptive Method for FasterBackpropagation Learning: The RPROP Algorithm,” in