[PDF] Inference by Stochastic Optimization: A Free-Lunch Bootstrap

Abstract

Assessing sampling uncertainty in extremum estimation can be challenging when the asymptotic variance is not analytically tractable. Bootstrap inference offers a feasible solution but can be computationally costly especially when the model is complex. This paper uses iterates of a specially designed stochastic optimization algorithm as draws from which both point estimates and bootstrap standard errors can be computed in a single run. The draws are generated by the gradient and Hessian computed from batches of data that are resampled at each iteration. We show that these draws yield consistent estimates and asymptotically valid frequentist inference for a large class of regular problems. The algorithm provides accurate standard errors in simulation examples and empirical applications at low computational costs. The draws from the algorithm also provide a convenient way to detect data irregularities.

Full PDF

IInference by Stochastic Optimization:A Free-Lunch Bootstrap

Jean-Jacques Forneron ∗ Serena Ng † May 2020

Abstract

Assessing sampling uncertainty in extremum estimation can be challenging when theasymptotic variance is not analytically tractable. Bootstrap inference oﬀers a feasiblesolution but can be computationally costly especially when the model is complex. Thispaper uses iterates of a specially designed stochastic optimization algorithm as drawsfrom which both point estimates and bootstrap standard errors can be computed ina single run . The draws are generated by the gradient and Hessian computed frombatches of data that are resampled at each iteration. We show that these draws yieldconsistent estimates and asymptotically valid frequentist inference for a large classof regular problems. The algorithm provides accurate standard errors in simulationexamples and empirical applications at low computational costs. The draws from thealgorithm also provide a convenient way to detect data irregularities.

JEL Classiﬁcation: C2, C3Keywords: Stochastic gradient descent, Newton-Raphson, Simulation-Based Estimation. ∗ Department of Economics, Boston University, 270 Bay State Rd, MA 02215 Email: [email protected] † Department of Economics, Columbia University and NBER, 420 W. 118 St. MC 3308, New York, NY10027 Email: [email protected] would like to thank Jessie Li for helpful comments and suggestions as well as the participants of theMicroeconomics Seminar at UC Santa-Cruz, the Optimization-Conscious Econometrics Conference, andthe econometric workshop held at BU/BC, Columbia University, University of Wisconsin, and New YorkUniversity. We would also like to thank Robert Moﬃtt and Sisi Zhang for their help in replicating some ofthe empirical results as well as Wentian Qian for research assistance. Financial Support from the NationalScience Foundation (SES 1558623) is gratefully acknowledged. a r X i v : . [ ec on . E M ] M a y Introduction

Many questions of economic interest can be expressed as non-linear functions of unknown pa-rameters θ that need to be estimated from a sample of data of size n . The typical econometricroutine is to ﬁrst obtain a consistent estimate ˆ θ n of the true value θ by minimizing an objec-tive function Q n ( θ ) , after which its sampling uncertainty is assessed. Though gradient-freeoptimizers provide point estimates, its asymptotic variance is often analytically intractable.One remedy is to use bootstrap standard errors, but this requires solving the minimizationproblem each time the data is resampled, and for complex models, this is no simple task.There is a long-standing interest in ﬁnding ‘short-cuts’ that can relieve the computationburden without sacriﬁcing too much accuracy. Examples include Davidson and MacKinnon(1999), Andrews (2002), Kline and Santos (2012), Armstrong et al. (2014) and more re-cently Honoré and Hu (2017). These methods provide standard errors by taking a convergedestimate ˆ θ n as given. As such, estimation always precedes inference.This paper proposes a resampling scheme that will deliver both the point estimates of θ and its standard errors within the same optimization framework. Since the standard errorsare obtained as a by-product of point estimation, we refer to the procedure as a ‘free-lunchbootstrap’. The free-lunch is made possible by a specially designed stochastic optimizationalgorithm that resamples batches of data of size m ≤ n . Given an initial guess θ , oneupdates θ b for b ≥ to θ b +1 using the gradient, the inverse Hessian as conditioning matrix,and a suitably chosen learning rate. We ﬁrst show that the average over B draws of θ b isequivalent to the mode ˆ θ n obtained by classical optimization up to order m . We then showthat the distribution of √ m ( θ b − ˆ θ n ) conditional on the original sample of data is ﬁrst-orderequivalent to that of √ n (ˆ θ n − θ ) upon rescaling, making it a bootstrap distribution. Becausethe conditioning matrix is the inverse Hessian, the procedure is a resampled Newton-Raphson(r nr ) algorithm. For other conditioning matrices, the draws from resampling still producea consistent estimate but cannot be used for inference.The main appeal of the proposed methodology is its simplicity. If the optimization prob-lem can be solved by our stochastic optimizer, inference can be made immediately withoutfurther computations. Natural applications include two-step estimation when the parametersin the two steps are functionally dependent in a complicated way, as well as minimum dis-tance estimation that compares the empirical moments with the model moments expressed In optimization, the no-free lunch theorem of Wolpert and Macready (1997) states that, when averagedover all problems, the computation cost of ﬁnding a solution is the same across methods. We use the termto refer to the ability to compute the quantities for inference when the estimator is constructed.

1s a function of the parameters. When such a mapping cannot be expressed in closed-form,simulation estimation makes progress by using Monte-Carlo methods to approximate thebinding function, but computing standard errors of the simulation-based estimates remainsa daunting task. Our algorithm provides an automated solution to compute standard er-rors and removes simulation noise, resulting in more accurate estimates. The algorithm alsoprovides a convenient way to compute clustered standard errors and model diagnostics.As compared to other stochastic optimization algorithms, we use a learning rate thatis ﬁxed rather than vanishing, and though a small m is desirable from a pure computationperspective, valid inference necessitates that m cannot be too small. As compared to con-ventional bootstrap methods, the simultaneous nature of estimation and inference meansthat a preliminary ˆ θ n is not needed for resampling. Though θ b is a Markov chain, no priordistribution is required, nor are Bayesian computation tools employed. In simulated exam-ples and applications, our bootstrap standard errors match up well with the asymptotic andbootstrap analogs, but at signiﬁcantly lower computational costs.The plan of the paper is as follows. Section 2 begins with a review of classical andstochastic optimization. The proposed free-lunch algorithm is presented in Section 3 andits relation to other resampling procedures is explained. The properties of the draws fromthe algorithm are derived in Section 4. Simulated and empirical examples are presented inSection 5. Section 6 extends the main results to simulation-based estimation. AppendixA provides derivations of the main results. An on-line supplement provides the r code toimplement one of the applications considered, additional results with details for replications,as well as an analytical example for Section 6. Consider minimization of the objective function Q n ( θ ) with respect to θ whose true value is θ . The sample gradient and Hessian of Q n ( θ ) are deﬁned respectively by G n ( θ ) = ∇ Q n ( θ ; x ) = 1 n n (cid:88) i =1 ∇ Q n ( θ ; x i ) H n ( θ ) = ∇ Q n ( θ ; x ) = 1 n n (cid:88) i =1 ∇ Q n ( θ ; x i ) . The necessary conditions for a local minimum are (cid:107) G n (ˆ θ n ) (cid:107) = 0 and H n (ˆ θ n ) positive semi-deﬁnite. The suﬃcient conditions are (cid:107) G n (ˆ θ n ) (cid:107) = 0 and H n (ˆ θ n ) positive deﬁnite. To ﬁnd The ﬁle is available for download at . θ k is θ k +1 = θ k − γ k Z n ( θ k ) where γ k is the step size and Z n = ∂θ k +1 ∂γ k is the direction of change.Gradient based methods specify Z n ( θ k ) = P n ( θ k ) G n ( θ k ) where P n ( θ n ) is a conditioningmatrix. The updating rule then becomes θ k +1 ≡ θ k − γ k P n ( θ k ) G n ( θ k ) . (1)The method of gradient descent ( gd ) (also known as steepest descent) sets P n = I d . Since gd does not involve the Hessian, it is a ﬁrst order method and is less costly. Convergence of ˆ θ k to the minimizer ˆ θ n is linear under certain conditions, but the rate depends on I d − γH n ( θ k ) being in the restricted region of ( − , , and can be slow when the ratio of the maximum tothe minimum eigenvalue of H n is large. The Newton-Raphson algorithm puts P n = H n ( θ k ) − .It is a second-order method since it involves the Hessian matrix. When γ k = 1 , the algorithmconverges quadratically if Q n satisﬁes certain conditions. A drawback of Newton’s algorithmis that it requires computation of the inverse of the Hessian. When strong convexity fails,the Hessian could be non-positive deﬁnite for θ away from the minimum. In such cases,it is not uncommon to replace the Hessian by H n ( θ k ) + c · I d for some c > , or specify P n = ( H n ( θ k ) (cid:48) H n ( θ k )) / to restore positive deﬁniteness around saddle-points, see Nocedaland Wright (2006, Chapter 3.4). Quasi-Newton methods bypass direct computation of theHessian or its inverse, but analytical convergence results are more diﬃcult to obtain. Wefocus our theoretical analysis on gradient descent and Newton-Raphson based algorithmsbut will consider quasi-Newton methods in some simulations. Stochastic optimization ﬁnds the optima in noisy observations using carefully designed re-cursive algorithms. The idea can be traced to the theory of stochastic approximation whenthe goal is to minimize some function Q ( θ ) with gradient G ( θ ) , which is equivalent to theroot-ﬁnding problem G ( θ ) = 0 whose the true value is θ . A classical optimizer would per-form θ k = θ k − − γ k G ( θ k − ) . Robbins and Monro (1951) considers the situation when we In statistical computing, the convergence of θ k to ˆ θ n is said to be linear if (cid:107) θ k +1 − ˆ θ n (cid:107) / (cid:107) θ k − ˆ θ n (cid:107) q < r forsome r ∈ (0 , if q = 1 and quadratic if q = 2 . Convergence is superlinear if lim k →∞ (cid:107) θ k +1 − ˆ θ n (cid:107) / (cid:107) θ k − ˆ θ n (cid:107) =0 . See Boyd and Vanderberghe (2004) Section 9.3.1 for linear convergence of gradient methods, and Nocedaland Wright (2006, Theorem 3.5) for quadratic convergence of Newton’s method when γ = 1 or γ k → at anappropriate rate. ‘Damped Newton’ updating with γ k ∈ (0 , has a linear convergence rate, see Boyd andVanderberghe (2004) Section 9.5.3 and Nesterov (2018) Section 1.2.4. G ( θ k ) + e k with E ( e k ) = 0 and suggests to update according to θ k = θ k − − γ k ( G ( θ k − ) + e k ) . Robbins and Monro (1951) proved that θ k as −→ θ for G non-decreasing with step size sequence γ k ≥ satisfying (i) ∞ (cid:88) k =1 γ k = + ∞ , (ii) ∞ (cid:88) k =1 γ k < + ∞ . (2)The ﬁrst condition ensures that all possible solutions will be reached with high probabil-ity regardless of the starting value, while the second ensures convergence to the true value.Building on the Robbins-Monro algorithm, the Kiefer-Wolfowitz algorithm uses a ﬁnite dif-ference approximation G ( θ k ) ≈ G n ( θ k ) = (cid:15) k (cid:20) Q n ( θ k + (cid:15) k ) − Q n ( θ k − (cid:15) k ) (cid:21) . This is oftenrecognized as the ﬁrst implementation of stochastic gradient descent. Kiefer and Wolfowitz(1952) proves convergence of θ k to the maxiumum likelihood estimate ˆ θ n assuming that thelikelihood Q n is convex, (cid:15) k goes to zero, and that the two conditions stated in (2) hold.Modern stochastic gradient descent updates according to θ k +1 = θ k − γ k G m ( θ k ) where G m ( θ k ) = m (cid:80) mi =1 G ( θ k ; x i ) is an estimate of G ( θ ) . It can be seen as Monte-Carlobased since the m observations used to compute G m ( θ k ) are usually chosen from { , . . . , n } randomly. Though m = 1 is computationally inexpensive and is a popular choice, a small γ k is often needed to compensate for the higher variation. A common rule is to choose γ k = γk − δ , where δ ∈ (1 / , and γ > are the choice parameters. Depending on δ ,convergence as measured by E ( (cid:107) θ k − θ (cid:107) ) can occur at a /k rate or slower. To reducesensitivity to the tuning parameters, Ruppert (1988) and Polyak and Juditsky (1992) proposeto accelerate convergence using what is now known as Polyak-Ruppert averaging: θ k = k (cid:80) ki =1 θ i . Importantly, θ k converges at the fastest /k rate for all choices of δ ∈ (1 / , .Moulines and Bach (2011) shows that the improvements hold even for a ﬁnite number ofiterations k . We will return to Polyak-Ruppert averaging below.Stochastic optimization presents an interesting alternative to classical optimization as itapproximates the gradient on minibatches of the original data. This is particularly helpful inlarge scale learning problems such Lasso, support-vector machines and K-means clusteringwhen non-linear optimization can be challenging. Improvements to sgd with P m = I d includemomentum (Polyak, 1964) and accelerated gradient (Nesterov, 1983) methods. Besides its4omputational appeal, stochastic optimization can improve upon its classical counterpart innon-convex settings. A variation of s gd , known as Stochastic gradient Langevin dynamics ( sgld ) incorporatesLangevin dynamics into a Bayesian sampler. As will be discussed further below, the updateis based on the gradient of the posterior distribution. Of note now is that sgld has two typesof noises: an injected noise, and the stochastic gradient noise based on m (cid:28) n observations.They play diﬀerent roles in the algorithm. In the early phase of sgld , the stochastic gradientdominates and the algorithm performs optimization. In the second phase, the injected noisedominates and the algorithm behaves like a posterior sampler. The algorithm seamlesslyswitches from one phase to another for an appropriate choice of the learning rate.Unlike classical optimization, stochastic Newton-Raphson with the inverse Hessian H m ( θ ) as conditioning matrix is not popular because the Hessian is often noisy and near singular for m small, rendering the algorithm unstable. We will show that using a variation of stochasticNewton-Raphson with larger batches of data can produce draws that not only provide anaccurate estimate of θ but also yields frequentist assessment of sampling uncertainty. It thusintegrates numerical optimization with statistical inference. In contrast, other conditioningmatrices will yield consistent estimates but would not provide valid inference in our setup. Consider extremum estimation of parameters θ ∈ Θ ⊂ R d from data x = ( x , . . . , x n ) . Let θ be the minimizer of a twice diﬀerentiable population objective function Q ( θ ) whose sampleanalog is Q n ( θ ) ≡ Q n ( θ ; x ) . The sample extremum estimator is ˆ θ n = argmin θ ∈ Θ Q n ( θ ) . For likelihood estimation, Q n ( θ ) = − (cid:80) ni =1 (cid:96) i ( θ ) where (cid:96) i is the likelihood of θ at observation x i . For least squares estimation, Q n ( θ ) is the sum of squared residuals (cid:80) ni =1 e i ( θ ) . For GMMestimate with positive weighting matrix W n , Q n ( θ ) = g n ( θ ) (cid:48) W n g n ( θ ) where E [ g i ( θ )] = 0 .Under regularity conditions stated in Theorem 2.1 of Newey and McFadden (1994) ˆ θ n isconsistent for θ . If, in addition, the assumptions in Theorem 3.1 of Newey and McFadden See Goodfellow et al. (2016, Chapter 8) for an overview of sgd . Ge et al. (2015) shows that noisygradient descent can escape all saddle points in polynomial time under a strict saddle property whereasclassical gradient methods converge at saddle points where the gradient is zero. Jin et al. (2017) shows thatthe dimension of θ has a negligible eﬀect on the number of iterations needed to escape saddle points, makingit an eﬀective solution even in large optimization problems. ˆ θ n is also √ n -asymptotically normal: √ n ( V ) − / (ˆ θ n − θ ) d −→ N (0 , I d ) where V = [ H ( θ )] − var ( √ nG n ( θ ))[ H ( θ )] − . Finite sample inference is typically basedon an estimate of V which can be analytically intractable or costly to compute on the fullsample. It is not uncommon to resort to bootstrap inference. We consider the m out of n bootstrap with m → ∞ , m/n → c ∈ [0 , and samples ( x ( b )1 , . . . , x ( b ) m ) with replacement fromthe data ( x , . . . , x n ) for b = 1 , . . . , B and solve B minimization problems: ˆ θ ( b ) m = argmin θ ∈ Θ Q ( b ) m ( θ ) , where the resampled objective Q ( b ) m ( θ ) = Q ( b ) m ( θ, x ( b ) ) is computed over the sample x ( b ) .Let E (cid:63) and var (cid:63) denote the bootstrap expectation and variance which are taken con-ditional on the sample data ( x , . . . , x n ) , and d (cid:63) → denotes the convergence in distributionconditional on the data. Since we only consider correctly speciﬁed regular estimators, thedesired convergence is to a Gaussian limit. The m out of n bootstrap can allow for diﬀerentsampling schemes. Assuming that the resampling scheme is chosen to reﬂect the dependencestructure of the data, it holds in a variety of settings that √ m ( V m ) − / (cid:16) ˆ θ ( b ) m − ˆ θ n (cid:17) d (cid:63) → N (0 , I d ) , where V m = [ H n (ˆ θ n )] − var (cid:63) ( √ mG ( b ) m (ˆ θ n ))[ H n (ˆ θ n )] − depends on the inverse Hessian as wellas the variance of the resampled score. The result implies that the resampled distributionof ˆ θ ( b ) m can be used to approximate the sampling distribution of ˆ θ n .An m out of n bootstrap with m < n uses smaller samples and performs as well as a n out of n bootstrap in simulations while requiring similar or weaker conditions, (Bickel et al.,2012). Nonetheless, it still makes multiple calls to a classical optimizer. As will be seenbelow, our proposed algorithm only requires one call to the optimizer.We propose the following two algorithms and use ‘ b ’ to index the iterates of the algorithm.In this notation, G ( b +1) m ( θ b ) is the gradient computed on the ( b + 1) -th batch of resampleddata of size m and evaluated at θ b , the parameter value in the previous draw, and H ( b +1) m ( θ b ) is similarly deﬁned. For two-way clustering, a recommended procedure is to resample over one cluster dimension and reweighalong the other (Roodman et al., 2019). See also Cameron et al. (2011) on multiway clustering. For time-series data, block resampling is needed to preserve the dependence structure. For correctly speciﬁed GMMmodels the bootstrap described above is valid (Hahn, 1996). lgorithm 1 Estimation by Resampling

Input: (a) an initial guess θ , (b) a bootstrap sample size B and a burn-in period burn ,(c) a batch size m ≤ n , (d) a ﬁxed learning rate γ ∈ (0 , , (e) a conditioning matrix P b Burn-in and Resample:for b = 1 , . . . , burn + B do Resample the ( b + 1) -th batch of data of size m Update P b and G b = G ( b +1) m ( θ b ) .Update θ b +1 = θ b − γP b G b , end forDiscard the ﬁrst burn draws, re-index θ burn + b to θ b for b = 1 , . . . B .Let θ re = B (cid:80) Bb =1 θ b . Algorithm 2

The Free-Lunch BootstrapImplement Algorithm 1 with P b = H − b .Let θ r nr = B (cid:80) Bb =1 θ b and deﬁne (cid:99) var ( θ b ) = B (cid:80) Bb =1 ( θ b − θ r nr )( θ b − θ r nr ) (cid:48) . Output: θ r nr and V r nr = mφ ( γ ) (cid:99) var ( θ b ) , where φ ( γ ) = γ − (1 − γ ) .Algorithm 1 produces an estimate θ re by resampling, hence the acronym re . It worksfor any conditioning matrix P b satisfying assumptions to be made precise in Theorem 1.Algorithm 2 produces draws using the inverse Hessian as P b as in Newton-Raphson, hencethe acronym r nr . The free-lunch aspect relates to the fact that we get both an estimate θ r nr and its standard error in one run. The bootstrap aspect comes from the fact thatunder the assumptions of Theorem 2, √ m ( θ b − ˆ θ n ) has the same asymptotic distribution as √ n (ˆ θ n − θ ) after an adjustment of mφ ( γ ) . The quantity V r nr is an estimate of the sandwichvariance that is computationally costly for classical estimation. A Wald test for H : θ = θ † can be constructed as wald = n ( θ r nr − θ † ) (cid:48) V − r nr ( θ r nr − θ † ) which has an asymptotic Chi-squared distribution under the null hypothesis. A 95% levelbootstrap conﬁdence interval can also be constructed after adjusting for mn and φ ( γ ) bytaking the (0.025,0.975) quantiles of (cid:8) θ r nr + (cid:113) mnφ ( γ ) ( θ b − θ r nr ) (cid:9) b ≥ .Algorithms 1 and 2 have three features that distinguish them from existing gradient-basedstochastic optimizers. First, γ ∈ (0 , does not change with b . Fixing γ rather than letting γ b → potentially permits faster convergence. Second, we sample m out of n observationswith m/n → c ∈ [0 , and √ n/m → . This precludes the popular choice in stochastic7ptimization of m = 1 , but admits m = n . We thus accept a higher computation cost toaccommodate inference. Third, compared to s gd Algorithm 2 uses the inverse Hessian asconditioning matrix.

This subsection uses the linear regression model to gain intuition of the Free-Lunch bootstrap.The model is y i = x (cid:48) i θ + e i . Let ˆ e n = y n − X n ˆ θ n be the n × vector of least squares residualsevaluated at the solution ˆ θ n . X n denote the n × K matrix of regressors. The linear model is ofinterest because the objective function is quadratic and the quantities required for updatingare analytically tractable. The gradient and Hessian of the full sample objective function Q n ( θ ) = ( y n − X n θ ) (cid:48) ( y n − X n θ ) / (2 n ) are G n ( θ ) = − X (cid:48) n ( y n − X n θ ) /n and H n ( θ ) = X (cid:48) n X n /n .The updates for this linear model evolve as θ k +1 = θ k + γP k X (cid:48) n ( y n − X n θ k ) /n = θ k + γP k X (cid:48) n (cid:18) X n (ˆ θ n − θ k ) + ˆ e n (cid:19) /n. Convergence of θ k for a given conditioning matrix P k can be studied by subtracting ˆ θ n fromboth sides of the updating equation and re-arranging terms (see Appendix A.1 for details).Table 1 summarizes convergence of gd , nr , s gd , r gd and r nr . snr is not consideredbecause X (cid:48) m X m /m is singular for m = 1 so θ b is not well deﬁned. The left panel of the tablegives the updating rule in closed form and the right panel expresses the deviation of thedraws from ˆ θ n as the sum of a deterministic and a stochastic component.Table 1: OLS: updating rules and convergenceMethod Conditioning Update: Convergence: θ k +1 − ˆ θ n =Matrix P k θ k +1 − θ k = deterministic + stochastic gd I d − γ k G k ( I d − γH n )( θ k − ˆ θ n ) s gd I d − γ b G b ( I d − γ b H b )( θ b − ˆ θ n ) − γ b G b (ˆ θ n ) r gd I d − γG b ( I d − γH b )( θ b − ˆ θ n ) + γH b (ˆ θ ( b ) m − ˆ θ n ) nr H − k − γH − k G k (1 − γ )( θ k − ˆ θ n ) r nr H − b − γH − b G b (1 − γ )( θ b − ˆ θ n ) − γ (ˆ θ ( b ) m − ˆ θ n ) Note: G k = G ( k +1) n ( θ k ) , G b = G ( b +1) m ( θ b ) , H k = H ( k +1) n ( θ k ) , H b = H ( b +1) m ( θ b ) . As seen from Table 1, gd updates do not depend on the Hessian but convergence does,while for nr the opposite is true. Convergence of nr can be achieved after one iteration8f γ = 1 . In s gd , r gd and r nr , batch resampling adds a stochastic component to theupdates and convergence is no longer deterministic. The deviations for s gd and r gd θ b − ˆ θ n follow a VAR(1) process with varying and ﬁxed coeﬃcient matrices I d − γ b H b and I d − γH b ,respectively. In contrast, the r nr draws have an AR(1) representation with a ﬁxed coeﬃcient (1 − γ ) that is dimension-free and independent of the Hessian. Note that r gd and r nr keep γ ﬁxed and rely on averaging over b for convergence.Our main result pertains to r nr so it is useful to have a deeper understanding of how itworks. Unlike stochastic optimizers which require γ b vanishing, the learning rate γ used togenerate the r nr draws is constant. The draws evolve according to θ b +1 − ˆ θ n = (1 − γ )( θ b − ˆ θ n ) + γ (ˆ θ ( b +1) m − ˆ θ n ) (3)where ˆ θ ( b +1) m = ( X ( b ) (cid:48) m X ( b ) m ) − X ( b ) (cid:48) m y ( b ) m is obtained by classical optimization using the ( b + 1) -thbootstrap sample ( y ( b +1) i , x ( b +1) i ) i =1 ,...,m . Being a bootstrap estimate, it holds under regularityconditions that the distribution of √ m (ˆ θ ( b +1) m − ˆ θ n ) conditional on the data approximates thesampling distribution of √ n (ˆ θ n − θ ) d −→ N (0 , V ) .Clearly when γ = 1 , (3) implies θ b = ˆ θ ( b ) m , meaning that each r nr draw equals thebootstrap estimate ˆ θ ( b ) m . We want to show that the draws are still bootstrap estimates when γ ∈ (0 , . For such γ , θ b +1 − ˆ θ n is an AR(1) process where for each b , the innovations γ (ˆ θ ( b +1) m − ˆ θ n ) are iid conditional on the original sample. Iterating the AR(1) formula back-wards to the initial value θ , we can decompose the draws θ b +1 into two terms: θ b +1 − ˆ θ n = (1 − γ ) b +1 ( θ − ˆ θ n ) (cid:124) (cid:123)(cid:122) (cid:125) initialization bias + γ b (cid:88) j =0 (1 − γ ) j (ˆ θ ( b +1 − j ) m − ˆ θ n ) (cid:124) (cid:123)(cid:122) (cid:125) resampling noise , (4)where { ˆ θ ( b +1 − j ) m } j ≥ are the bootstrap estimates in the previous iterations. The constantlearning rate is crucial in achieving this representation.To show that our estimator θ r nr is √ n -consistent for ˆ θ n , i.e. θ r nr = ˆ θ n + o p (cid:63) ( √ n ) , weneed to evaluate the average of the two terms in (4) over b . The initialization bias in (4)is due to taking an arbitrary starting value θ and is identical to the optimization error inclassical Newton-Raphson. For γ ∈ (0 , , B (cid:80) Bb =1 (1 − γ ) b +1 = O ( B ) because { (1 − γ ) b } b ≥ isa summable geometric series. Another bias term of order O ( m ) arises when E (cid:63) (ˆ θ ( b ) m − ˆ θ n ) = O ( m ) . Since ˆ θ n is ﬁxed as b varies, we now have E (cid:63) ( θ r nr ) = ˆ θ n + O ( B ) + O ( m ) . Thus E (cid:63) ( θ r nr ) = ˆ θ n + o ( √ n ) as required, assuming √ n min ( m,B ) → . Turning to the variance, ﬁrstnote that by virtue of bootstrapping, { ˆ θ ( b +1 − j ) m − ˆ θ n } j ≥ constitutes a sequence of conditionally9id errors each with variance that is O ( m ) . Since { (1 − γ ) b } b ≥ is summable, the variance in θ r nr due to resampling is O ( mB ) . This becomes o ( n ) when nmB → , a suﬃcient conditionbeing √ n min ( m,B ) → , which is also required for the bias to be negligible. We have thus shownthat θ r nr = ˆ θ n + o p (cid:63) ( √ n ) for √ n min ( m,B ) → , which is a simpliﬁed version of Theorem 1 below.Though the result has the ﬂavor of Polyak-Ruppert averaging in stochastic optimization, γ is ﬁxed here and m increases with n .To show bootstrap validity of r nr , we need to establish that, conditional on the sample ofdata, the distribution of √ m ( θ b +1 − ˆ θ n ) is asympotically equal, up to a constant scaling factor,to that of √ m (ˆ θ ( b +1) m − ˆ θ n ) . This requires that the initialization bias in each θ b is o ( √ m ) ,which holds when log( m ) b → . From (4), √ m ( θ b +1 − ˆ θ n ) has variance γ V m conditional on θ b and unconditional variancevar (cid:18) √ m ( θ b +1 − ˆ θ n ) (cid:19) = γ + O ([1 − γ ] b +2 )1 − [1 − γ ] V m ≈ φ ( γ ) V m where φ ( γ ) = γ − [1 − γ ] , and V m = var ( √ m (ˆ θ ( b +1) m − ˆ θ n )) is the bootstrap estimate of thesandwich variance V deﬁned above. This establishes that the variance of θ b is proportional tothat of the bootstrap estimate. Combined with asympotic normality of each √ m (ˆ θ ( b +1) m − ˆ θ n ) for each b and additional conditions to be made precise in Theorem 2, we have (cid:18) φ ( γ ) V m (cid:19) − / √ m (cid:16) θ b +1 − ˆ θ n (cid:17) d (cid:63) → N (0 , I d ) . But asymptotic theory gives the distribution of √ n (ˆ θ n − θ ) with sample size n , not m . Anadjustment for φ ( γ ) and m is needed. Let V r nr = mφ ( γ ) var ∗ ( θ b ) = V m + o (1) . For appropriatechoice of m and γ , V − / r nr √ n ( θ r nr − θ ) d (cid:63) → N (0 , I d ) and Algorithm 2 proposes a plug-inestimate of V r nr . θ b This section studies the properties of draws θ b produced by Algorithms 1 and 2 for non-linear models. The proofs are more involved for two reasons. First, an arbitrary γ ∈ (0 , may not lead to convergence even for classical optimizers. Second, whereas in quadraticproblems the draws θ b have a tractable AR(1) representation, for non-quadratic objectivesthe draws θ b follow a non-linear process which is more diﬃcult to study. Hence we need toﬁrst show, under strong convexity conditions, that there exist ﬁxed values of γ ∈ (0 , suchthat optimization of Q n ( θ ) has a globally convergent solution. We then show, using the idea10f coupling, for appropriate choices of m and B that θ b +1 can be made very close to a linearAR(1) sequence θ (cid:63)b +1 that is constructed as if the objective were quadratic. This allow us toestablish consistency of θ re for ˆ θ n in Theorem 1 for a large class of P b , and a distributionresult in Theorem 2 that validates inference for a particular choice of P b . θ k from Classical Updating to ˆ θ n Econometric theory typically studies the conditions under which ˆ θ n is consistent for θ , takingas given that a numerical optimizer exists to produce a convergent solution ˆ θ n . From Neweyand McFadden (1994), the regularity conditions for consistent estimation of θ are continuityof Q ( θ ) and uniform convergence of Q n ( θ ) to Q ( θ ) . Asymptotic normality further requiressmoothness of Q n ( θ ) , θ being in the interior of the support, and non-singularity of H ( θ ) .But classical Newton-type algorithms may only converge to a local minimum and a globalconvergent solution is guaranteed only when the objective function is strongly convex on theparameter space Θ . For gradient-based optimizers to deliver such a solution, the followingprovides the required conditions. Assumption 1. Q n is twice continuously diﬀerentiable on Θ , a convex and compact subsetof R d . There exists a constant C < + ∞ such that for all θ ∈ Θ :i. < λ H ≤ λ min ( H n ( θ )) ≤ λ max ( H n ( θ )) ≤ λ H < + ∞ ,ii. (cid:107) H n ( θ ) − H n (ˆ θ n ) (cid:107) ≤ C (cid:107) θ − ˆ θ n (cid:107) ,iii. < λ P ≤ λ min ( P k ) ≤ λ max ( P k ) ≤ λ P < + ∞ . Condition i. implies strong convexity of Q n on Θ . Condition ii. imposes Lipschitzcontinuity of the Hessian. Assumption 1 implies the following two inequalities which areknown as the Polyak-Łojasiewicz inequalities: (cid:104) θ − ˆ θ n , G n (ˆ θ n ) (cid:105) = ( θ − ˆ θ n ) (cid:48) H n (˜ θ n )( θ − ˆ θ n ) ≥ λ H (cid:107) θ − ˆ θ n (cid:107) , (5) (cid:107) G n (ˆ θ n ) (cid:107) = ( θ − ˆ θ n ) (cid:48) H n (˜ θ n ) ( θ − ˆ θ n ) ≤ λ H (cid:107) θ − ˆ θ n (cid:107) , (6)where ˜ θ n is an intermediate value between θ and ˆ θ n . Inequality (5), due to Łojasiewicz (1963)and Polyak (1963), follows from the positive deﬁniteness of H n (˜ θ n ) . Together, (5) and (6)ensure that ˆ θ n is a unique (or global) minimizer of Q n . See Boyd and Vanderberghe (2004), Chapter 9.1. A function Q n is strongly convex on Θ if for all θ ∈ Θ , there exists some λ > such that ∇ Q n ( θ ) ≥ λI d . For bounded Θ , there also exists λ such that ∇ Q n ( θ ) ≤ λI d . Then λ/λ is an upper bound on the condition number of ∇ Q n ( θ ) . γ such that gradient based optimization isglobally convergent. To see why, consider (cid:107) θ k +1 − ˆ θ n (cid:107) = (cid:107) θ k − ˆ θ n − γP k G k (cid:107) = (cid:107) θ k − ˆ θ n (cid:107) − γ (cid:104) θ k − ˆ θ n , P k G k (cid:105) + γ (cid:107) P k G k (cid:107) ≤ (cid:0) − γλ P λ H + γ [ λ P λ H ] (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) = A ( γ ) (cid:107) θ k − ˆ θ n (cid:107) , where the last inequality is implied by Assumption 1 i. and iii. Since a contraction occursif A ( γ ) ∈ [0 , , global convergence follows. Now at γ = 0 , A (0) = 1 and ∂ γ A (0) < , so bycontinuity and local monotonicity of A ( · ) , there exists a nonempty subinterval of the form (0 , ˜ γ ] with ˜ γ ∈ (0 , such that A ( γ ) ∈ [0 , for all γ ∈ (0 , ˜ γ ] . This establishes existence ofan interval of values for γ close to zero such that the gradient-based optimizer is globallyconvergent. But depending on λ P λ H and λ P λ H , there may exist larger values of γ ∈ (0 , with A (1) ≥ that could frustrate convergence. The following Lemma shows that (cid:112) A ( γ ) is the global convergence rate of θ k to ˆ θ n . Lemma 1.

Suppose Assumption 1 holds, then there exists γ ∈ (0 , such that A ( γ ) ∈ [0 , .Let γ be such that A ( γ ) = (1 − γ ) , then (cid:107) θ k − ˆ θ n (cid:107) ≤ (1 − γ ) k (cid:107) θ − ˆ θ n (cid:107) → , as k → ∞ . Proof of Lemma 1:

As discussed above, there exists γ such that A ( γ ) ∈ [0 , . For such γ , let γ ( λ P , λ H , λ K , λ H ) ∈ (0 , independent of k be such that: A ( γ ) = (1 − γ ) ∈ [0 , . Itfollows that (cid:107) θ k +1 − ˆ θ n (cid:107) ≤ (cid:112) A ( γ ) (cid:107) θ k − ˆ θ n (cid:107) ≤ (1 − γ ) (cid:107) θ k − ˆ θ n (cid:107) ≤ (1 − γ ) k (cid:107) θ − ˆ θ n (cid:107) → , as k → ∞ . In general, a larger value of γ would result in faster convergence of (cid:107) θ k − ˆ θ n (cid:107) to zero.The choice of γ and the implied γ in Lemma 1 are typically data-dependent, but furtherinsights can be gained in two special cases. For gd , the largest globally convergent γ is λ H /λ H . In ill-conditionned problems when this ratio is small, convergence will be slowsince (1 − γ ) = (1 − [ λ H /λ H ] ) will be large. For nr when P ( θ ) = H ( θ ) − , we can use λ P H ≤ λ P λ H and λ P H ≥ λ P λ H to obtain a tighter bound. The globally convergent γ thatminimizes − γλ P H + γ λ P H is then γ = λ P H / [ λ P H ] which is strictly less than for non-quadratic objectives. Since the (1 − γ ) = (1 − [ λ P H /λ P H ] ) associated with nr is typicallysmaller than for gd , nr will converge faster.12 .2 Consistency of θ re Resampling is usually used for inference, but Algorithm 1 uses resampling for estimation.Unlike classical optimizers, the resampled gradient is noisy. As a consequence, the draws ( θ b ) b ≥ constructed by Algorithm 1 no longer converge deterministically. The following con-ditions will be imposed on the resampled objective Q ( b ) m . Assumption 2.

Suppose that m/n → c ∈ [0 , as both m and n → + ∞ and there existspositive and ﬁnite constants C , C , C (cid:48) , C such that for all θ ∈ Θ , the resampled gradient G ( b ) m ( θ ) and Hessian H ( b ) m ( θ ) satisfy the following for all b ≥ and θ ∈ Θ :i. (cid:107) G ( b ) m ( θ ) − G ( b ) m (ˆ θ n ) − H ( b ) m (ˆ θ n )( θ − ˆ θ n ) (cid:107) ≤ C (cid:107) θ − ˆ θ n (cid:107) ,ii. < λ H ≤ λ min ( H ( b ) m ( θ )) ≤ λ max ( H ( b ) m ( θ )) ≤ λ H < + ∞ ,iii. (cid:104) E (cid:63) (cid:16) sup θ ∈ Θ (cid:107) G ( b ) m ( θ ) − G n ( θ ) (cid:107) (cid:17)(cid:105) / ≤ C √ m ,iv. (cid:107) E (cid:63) (cid:16) G ( b ) m (ˆ θ n ) (cid:17) (cid:107) ≤ C (cid:48) m ,v. (cid:104) E (cid:63) (cid:16) sup θ ∈ Θ (cid:107) H ( b ) m ( θ ) − H n ( θ ) (cid:107) (cid:17)(cid:105) / ≤ C √ m ,vi. < λ P ≤ λ min ( P b ) ≤ λ max ( P b ) ≤ λ P < + ∞ . Assumption 2 i. bounds the remainder term in the Taylor expansion of each resampledgradient around the sample minimizer ˆ θ n . Assumption 2 ii. implies that each resampledobjective is also strongly convex. Conditions iii.-v. are tightness condition on the resampledgradient and Hessian empirical process. It implies uniform convergence over Θ at a √ m -rate. Condition iv. is satisﬁed with C (cid:48) = 0 for MLE and NLS estimators because G n is a samplemean. For over-identiﬁed GMM, g n (ˆ θ n ) (cid:54) = 0 and the gradient G n (ˆ θ n ) = 2 ∂ θ g n (ˆ θ n ) (cid:48) W n g n (ˆ θ n ) is not a sample mean. Condition iv. requires correct speciﬁcation in GMM so that (cid:107) g n (ˆ θ n ) (cid:107) goes to zero suﬃciently fast as n → ∞ .The following lemma shows that θ b will converge in probability to and stays within a √ m neighborhood of ˆ θ n as b increases. This is implied by a conditional uniform Central Limit Theorem. See van der Vaart and Wellner (1996,Chapter 2.9) and Kosorok (2007, Chapter 10) for iid data. Chen et al. (2003) provide high-level conditionsfor resampling two-step estimators when the ﬁrst-step estimator can be nonparametric. emma 2. Under Assumptions 1-2 and given γ ∈ (0 , such that (1 − γ ) = A ( γ ) ∈ [0 , ,as deﬁned in Lemma 1, there exists a constant C = C ( C , λ P , γ ) such that (cid:104) E (cid:63) (cid:16) (cid:107) θ b +1 − ˆ θ n (cid:107) (cid:17)(cid:105) / ≤ (1 − γ ) b +1 (cid:20) E (cid:63) ( (cid:107) θ − ˆ θ n (cid:107) ) (cid:21) / + C γ √ m . Proof of Lemma 2:

For any θ ∈ Θ , let G b ( θ ) = √ m (cid:16) G ( b +1) m ( θ ) − G n ( θ ) (cid:17) . By constructionof θ b , we have θ b +1 − ˆ θ n = θ b − ˆ θ n − γP b G b . It follows that θ b +1 − ˆ θ n = θ b − ˆ θ n − γP b G n ( θ b ) + γ √ m G b ( θ b ) . Taking the (cid:107) · (cid:107) norm on both sides, applying the triangular inequality and using argumentsanalogous to Lemma 1, we have for γ ∈ (0 , small enough such that the same A ( γ ) ∈ [0 , : (cid:107) θ b +1 − ˆ θ n (cid:107) ≤ (cid:107) θ b − ˆ θ n − γP b G n ( θ b ) (cid:107) + γλ P √ m (cid:18) sup θ ∈ Θ (cid:107) G b ( θ ) (cid:107) (cid:19) ≤ (1 − γ ) (cid:107) θ b − ˆ θ n (cid:107) + γλ P √ m (cid:18) sup θ ∈ Θ (cid:107) G b ( θ ) (cid:107) (cid:19) . Taking expectations on both sides: (cid:104) E (cid:63) (cid:16) (cid:107) θ b +1 − ˆ θ n (cid:107) (cid:17)(cid:105) / ≤ (1 − γ ) (cid:104) E (cid:63) (cid:16) (cid:107) θ b − ˆ θ n (cid:107) (cid:17)(cid:105) / + γλ P C √ m . The desired result is then obtained with C = γλ P C .Lemma 2 shows stochastic convergence of θ b to ˆ θ n . To study the properties of ourestimator θ re , we will use a concept known as coupling . A coupling between two distributions µ and ν on an (unrestricted) common probability space is a pair of random variables X and Y such that X ∼ ν and Y ∼ µ , and are equal, on average, up to Wasserstein distance oforder p ≥ . Precisely, the Wasserstein-Fréchet-Kantorovich coupling distance between twodistributions ν and µ is deﬁned as: W p ( ν, µ ) p = inf ( X,Y ) ,X ∼ ν,Y ∼ µ E ( (cid:107) X − Y (cid:107) p ) , p ≥ . Of interest here is the coupling between θ b and θ (cid:63)b , where θ (cid:63)b is a linearized sequence of θ b deﬁned below. They have diﬀerent marginal distributions because one is a linear and theother is a non-linear process. Nonetheless, they live on the same probability space becausethey rely on the same source of randomness originating from the resampled objective Q ( b ) m .Hence if we can show that (cid:107) θ b − θ (cid:63)b (cid:107) is small in probability, then we can work with thedistribution of θ (cid:63)b which is more tractable.Precisely, we are interested in a linearized sequence deﬁned as θ (cid:63)b +1 − ˆ θ n = Ψ(ˆ θ n )( θ (cid:63)b − ˆ θ n ) − γP m G ( b +1) m (ˆ θ n ) , (7)14here Ψ(ˆ θ n ) = I d − γP m H n (ˆ θ ) and P m = I d for r gd and P m = [ H n ] − for r nr . We sawearlier from (3) in the linear regression model that Ψ(ˆ θ n ) = (1 − γ ) I d for r nr . We nowprovide conditions on P b for the draws produced in Algorithm 1 to be close to those deﬁnedin (7) in non-quadratic settings. Assumption 3.

Deﬁne d ,n = E (cid:63) ( (cid:107) θ − ˆ θ n (cid:107) ) and let P m be a symmetric positive deﬁnitematrix such that for Ψ(ˆ θ n ) = I d − γP m H n (ˆ θ ) ,i. ≤ λ max (Ψ(ˆ θ n ) ) < ,ii. (cid:104) E (cid:63) (cid:16) (cid:107) I d − P b P − m (cid:107) (cid:17)(cid:105) / ≤ C (cid:16) ρ b d ,n + √ m (cid:17) , for some ρ ∈ [0 , and some C > . Assumption 3 ii. is needed to ensure that the resampled conditioning matrix P b convergesto P m used in (7) and Assumption 3 i. ensures stability of the linearized process (7). Theseassumptions allow us to study θ (cid:63)b +1 − ˆ θ n as a VAR process with parameters that depend onthe Hessian, the conditioning matrix and the learning rate as in the OLS example.For r gd with P b = P m = I d , Condition ii. holds automatically, while Condition i.requires γ < /λ H . For r nr with P m = [ H n (ˆ θ n )] − , it will be shown in Theorem 2 thatConditions i.-ii. hold for any γ ∈ (0 , such that A ( γ ) ∈ [0 , under the assumptions ofLemmas 1, 2. This implies that θ (cid:63)b constructed in (7) for r nr is an AR(1) process withautoregressive coeﬃcient − γ as in the OLS case. Now deﬁne ρ = max (cid:20)(cid:113) λ max (Ψ m (ˆ θ n ) ) , − γ, ρ (cid:21) < . (8)The autoregressive structure of θ (cid:63)b − ˆ θ n and θ b − ˆ θ n together with the assumed convergenceof P b to P m lead to the following result on the coupling distance between θ b and θ (cid:63)b . Lemma 3.

Suppose that Lemmas 1 and 2 hold, and there exists a matrix P m > satisfyingAssumption 3. Let ρ be deﬁned as in (8). Then θ (cid:63)b deﬁned in (7) satisﬁes: E (cid:63) ( (cid:107) θ b − θ (cid:63)b (cid:107) ) ≤ C (cid:18) m + ρ b [ d ,n + d ,n ] (cid:19) . The statement holds for any conditioning matrix P b evaluated on the subsamples satis-fying Assumption 3. Since (cid:80) Bb =1 ρ b ≤ − ρ , Lemma 3 implies E (cid:63) (cid:18) (cid:107) θ re − θ (cid:63) re (cid:107) (cid:19) ≤ C − ρ (cid:18) m + d ,n + d ,n B (cid:19) . (9)15he result is useful because it implies that our estimator θ re equals θ (cid:63) re = B (cid:80) Bb =1 θ (cid:63)b upto vanishing terms. By the triangular inequality: E (cid:63) (cid:16) (cid:107) θ re − ˆ θ n (cid:107) (cid:17) ≤ E (cid:63) (cid:16) (cid:107) θ re − θ (cid:63) re (cid:107) (cid:17) + E (cid:63) (cid:16) (cid:107) θ (cid:63) re − ˆ θ n (cid:107) (cid:17) . (10)The ﬁrst term can be bounded by Lemma 3 as discussed above, and E (cid:63) ( θ (cid:63)b ) = ˆ θ n by con-struction of θ (cid:63)b in (7). Furthermore, Assumption 3 i. implies that the diﬀerence θ (cid:63) re − ˆ θ n isa O p (cid:63) ( √ mB ) since θ (cid:63)b is asymptotically ergodic and its innovations have variance of order m . Theorem 1.

Let θ be the population minimizer, ˆ θ n be the estimate obtained by a classicaloptimizer, and { θ b } be generated by Algorithm 1. Suppose that {√ mP m G ( b ) m (ˆ θ n ) } b ≥ are iidwith ﬁnite and bounded variance-covariance matrix. Under the conditions of Lemma 3, E (cid:63) (cid:16) (cid:107) θ re − ˆ θ n (cid:107) (cid:17) ≤ C (cid:18) m + d ,n + d ,n B + 1 √ mB (cid:19) , where C depends on the constants and the largest eigenvalue of var (cid:63) ( P m G ( b ) m (ˆ θ n )) . Further-more, suppose that √ n min ( B,m ) → and d ,n = O (1) then: √ n (cid:0) θ re − θ (cid:1) = √ n (cid:16) ˆ θ n − θ (cid:17) + o p (cid:63) (1) . Theorem 1 says that the average of draws θ re is a consistent estimate of ˆ θ n for any choiceof conditioning satisfying Assumption 3. The inverse Hessian (r nr ) and the identity matrix(r gd ) are examples of such conditioning matrices P b . nr for Frequentist Inference Theorem 1 is valid for P b satisfying the assumptions of the analysis. This subsection special-izes to r nr produced by Algorithm 2 which uses the inverse Hessian as conditioning matrix.There are two reasons for this choice. First, it implies a faster decline in the initializationbias compared to e.g. P m = I d used in gradient descent. Second, such a conditioning matrixhas a limit P m = [ H n (ˆ θ n )] − . For P m (cid:54) = [ H n (ˆ θ n )] − the dynamics are approximated by aVAR(1) instead of a simple AR(1). While the variance of the AR(1) is proportional to thedesired V m , up to a simple adjustment, this is generally not the case for the VAR(1).Once Assumption 3 is granted with P m = [ H n (ˆ θ n )] − , the general idea of deriving thelimiting distribution of the r nr draws is to ensure the increasing sum θ (cid:63)b = ˆ θ n − γ (cid:80) b − j =0 (1 − γ ) j H n (ˆ θ n ) − G ( b − j ) m (ˆ θ n ) preserves the convergence of each resampled G ( b − j ) m (ˆ θ n ) . The autore-gressive nature of θ b makes the argument somewhat diﬀerent from the standard setting where16ach resampled minimizer ˆ θ ( b ) m can usually be expressed as a function of a single G ( b ) m (ˆ θ n ) plusnegligible terms. In such cases, distributional statements about ˆ θ ( b ) m follow from the conver-gence of each resampled G ( b ) m . Here, the increasing sum θ (cid:63)b depends on the entire history ofthe independently resampled { G ( b − j ) m (ˆ θ n ) } j =0 ,...,b − for which we need to prove convergence. Assumption 4.

Let P m in Theorem 1 be [ H n (ˆ θ n )] − . Suppose that {√ mP m G ( b ) m (ˆ θ n ) } b ≥ has a non-singular variance-covariance matrix denoted V m . For some β ∈ (0 , / and (cid:107) r m ( τ ) (cid:107) ≤ C ψ (cid:107) τ (cid:107) α with α > , it holds that for i = − : E (cid:63) (cid:16) exp (cid:104) √ m i τ (cid:48) ( V m ) − / [ H n (ˆ θ n )] − G ( b ) m (ˆ θ n ) (cid:105)(cid:17) = exp (cid:18) − (cid:107) τ (cid:107) (cid:19) · (cid:18) r m ( τ ) m β (cid:19) . Assumption 4 requires non-degeneracy of the variance-covariance matrix which is requiredfor Central Limit Theorems (White, 2000, Theorem 5.3). Assumption 4 provides higher-orderconditions to ensure that the bootstrap converges in distribution at a suﬃciently fast rate. Itcan be understood as requiring the resampled data to have an Edgeworth expansion, the ﬁrstterm being the characteristic function of the standard normal distribution. This occurs with β = 1 / , α = 1 for averages of iid data with ﬁnite third moment (Lahiri, 2013, Chapters 6.2-6.3). By Assumption 4, the error in the Gaussian approximation of √ n [ H n (ˆ θ n )] − G ( b ) m (ˆ θ n ) depends on α through r m ( τ ) and on β through the inﬂation factor r m ( τ ) m β . These twoparameters are of signiﬁcance because the error in the asymptotic approximation for θ (cid:63)b inherits the error in √ n [ H n (ˆ θ n )] − G ( b ) m . The following theorem takes as given the validity ofbootstrap standard errors, i.e. √ m ( V m ) − / (cid:16) ˆ θ n − θ (cid:17) d → N (0 , I d ) . Theorem 2.

Let { θ b } be generated by Algorithm 2 and suppose that the conditions of Lemmas1, 2 hold then Assumption 3 holds with P m = [ H n (ˆ θ n )] − . Furthermore, suppose Assumption4 holds and let φ ( γ ) = γ − (1 − γ ) , then as m, b → + ∞ with log( m ) /b → , ( φ ( γ ) V m ) − / √ m (cid:16) θ b − ˆ θ n (cid:17) d (cid:63) → N (0 , I d ) . The thrust of the Theorem is that [ H n (ˆ θ n )] − G ( b ) m (ˆ θ n ) is approximately normal whenproperly standardized by ( V m ) − / and scaled by √ m . The summation in the AR(1) repre-sentation (4) preserves this property under the stated assumption but inﬂates the varianceby a factor φ ( γ ) which needs to be adjusted. As pointed out above, the error in the Gaussianapproximation of θ (cid:63)b is of the same order as √ n [ H n (ˆ θ n )] − G ( b ) m (ˆ θ n ) , which depends on α, β according to Assumption 4. But r m ( τ ) is inﬂated by a factor of (2 − γ ) α − [1 − γ ] α which is when γ = 1 and goes to inﬁnity as γ → . The Gaussian approximation is better for larger γ .17n implication of Theorems 1 and 2 is that V − / r nr √ n (cid:0) θ r nr − θ (cid:1) = V − / r nr √ n (cid:16) ˆ θ n − θ (cid:17) + o p (cid:63) (1) d → N (0 , I d ) , where V r nr = mφ ( γ ) var (cid:63) ( θ b − ˆ θ n ) , and a plug-in estimator V r nr is deﬁned in Algorithm 2. Thisimplies that standard errors and quantiles computed from the draws θ b , after adjusting for m and φ ( γ ) , can be used to make asymptotically valid inference. Conﬁdence intervals canbe constructed to test linear and non-linear hypotheses.Theorem 2 specializes to r nr where P b = [ H ( b +1) m ( θ b )] − . Because the AR(1) representa-tion in (7) does not hold for r gd , simple adjustments cannot be designed that would allowr gd to provide valid inference. Furthermore, when γ ∈ (0 , is small enough such that r gd converges, it is not uncommon that λ max (Ψ(ˆ θ n ) ) (cid:39) because of ill-conditioning, and when θ b is very persistent, a much larger B will be required.Given the Markov chain nature of our θ b , convergence of the chain can be diagnosed usingthe standard tools from the MCMC literature such as convergence diagnostics consideredin Gelman and Rubin (1992); Brooks and Gelman (1998). As seen from (7), the draws θ b approximately follow d univariate AR(1) processes with the same persistence parameter (1 − γ ) which is user-chosen. This can be used to gauge the quality of our large sampleapproximation in the data for a given pair ( γ, m ) . We will illustrate this feature below.It is noteworthy that while the appeal of stochastic optimization is the savings fromusing m (cid:28) n , our r nr requires m not to be too small. This can be seen as the costof valid inference. Nonetheless, several additional shortcuts could improve the numericalperformance of r nr . Our algorithm can be modiﬁed so that the Hessian is updated every fewiterations rather than at each iteration. The draws would still be valid since the assumptionsof Theorem 2 would still hold. The Hessian could also be approximated using quasi-Newtonmethods which only requires computing gradients. However, as shown in Dennis and Moré(1977), Nocedal and Wright (2006), the analytical properties of the Hessian approximatedby bfgs can only be guaranteed under strong conditions for quadratic objectives. Ren-Puand Powell (1983) show that the bfgs estimate P k may not converge to the Hessian even forquadratic objectives. Theoretical guarantees can be given for less popular but more tractablemethods such as Broyden’s method or the Symmetric Rank-1 (SR1) update (Conn et al.,1991). These, unfortunately, tend to be less stable than bfgs even in classical optimization.Though an extension of Theorems 1 and 2 to quasi-Newton updating is left to future work,the results based on a resampled bfgs procedure are promising, as will be seen below.18 .4 Relation to other Bootstrap and Quasi-Bayes Methods Our algorithm is related to several other fast bootstrap methods. As discussed in the in-troduction, solving the minimization problem B times can be computationally challengingor infeasible. Some shortcuts have been proposed to generate bootstrap draws for inferenceat a lower cost. Davidson and MacKinnon (1999) (hereafter dmk ) proposes a n out of n approximate bootstrap that replaces non-linear estimation on each batch of re-sampled databy a small number of Newton steps using ˆ θ n as starting value. In our notation, they performNewton-Raphson updating θ ( b ) dmk ,j +1 = θ ( b ) dmk ,j − [ H ( b ) n ( θ ( b ) dmk ,j )] − G ( b ) n ( θ ( b ) dmk ,j ) with θ ( b ) dmk , = ˆ θ n and j = 0 , . . . , k − times for each b = 1 , . . . , B and report the draws θ ( b ) dmk ,k . Armstronget al. (2014) extends this approach for two-step estimation with a ﬁnite dimensional or non-parametric ﬁrst-step estimator. Kline and Santos (2012) (hereafter, ks ) suggests a scorebootstrap that uses random weights to perturb the score while holding the Hessian at thesample estimate. If the random weights are { ω ( b ) i } with E [ ω i ] = 0 , E [ ω i ] = 1 , then the dis-tribution [ H n (ˆ θ n )] − √ n (cid:80) ni =1 ω ( b ) i G i (ˆ θ n ; y i , x i ) conditional on the data is used to approximatethat of √ n (ˆ θ n − θ ) . The appeal is that the Hessian only needs to be computed once. Honoréand Hu (2017) proposes an approach where the resampled objective is minimized only in ascalar direction for a class of models.The methods above all rely on a preliminary converged estimate, ˆ θ n and hence estimationprecedes inference. We compute θ r nr and an estimate of its sampling uncertainty in the sameloop, so no further computation is needed once θ r nr is available. Under our assumptions,the initialization bias will vanish. The practical implication is that for B large enough, theinitial values of r nr can be far away from the global minimum ˆ θ n , and the algorithm willnot be sensitive to the usual stopping criteria used in optimization to ﬁnd ˆ θ n .Liang and Su (2019) suggests a ‘moment-adjusted’ algorithm ( masgrad ) that, in ournotation, updates according to γ b → with P b = var ( √ nG n ( θ )) − / , which is the asymp-totic variance-covariance matrix of the sample gradient. In practice, they recommend toevaluate this quantity using the full sample. Under the information matrix equality, we have E [( G n ( θ b ) G n ( θ b ) (cid:48) ] = E [ H ( θ b )] so that the diﬀerence amounts to using H ( θ b ) − / instead of H ( θ b ) − . While such a conditioning matrix would result in consistent estimates, it wouldnot provide asymptotically valid bootstrap draws, which requires P b = H ( θ b ) − and γ ﬁxedas shown in our Theorem 2.The sgld algorithm proposed in Welling and Teh (2011) updates according to θ b +1 = θ b + γ b +1 (cid:18) ∇ log p ( θ b ) + nm m (cid:88) i =1 ∇ log p ( x ( b ) i | θ b ) (cid:19) + v b +1 (11)19here v b ∼ N (0 , γ b I d ) is an injected noise, γ b satisﬁes (2) and p ( θ b ) is the prior distributionevaluated at θ b while log p ( x ( b ) i | θ b ) is the log-likelihood of a resampled observation x ( b ) i eval-utated a θ b . The update is thus based on the gradient of the log posterior distribution. Like sgld draws, the draws of our free-lunch bootstrap involve two phases: optimization andsampling. First, in the optimization phase, the shape of the objective function dominatesthe resampling noise until θ b attains a neighborhood of ˆ θ n . Then, in the sampling the re-sampling phase, the noise dominates and r nr draws have bootstrap properties. Comparedto sgld , the noise is not injected exogenously and our γ is ﬁxed. Welling and Teh (2011)shows that with carefully chosen step size γ b and noise variance σ v , sgld draws can be usedfor Bayesian inference. Our free-lunch algorithm does not involve any prior and the goal isfrequentist inference, as in the Laplace-type inference proposed in Chernozhukov and Hong(2003) (hereafter CH).Like CH, our goal is also to simplify the estimation of complex models. CH tacklesnon-smooth and non-convex objective functions by combining a prior with a transformationof the objective function. In principle, we can also handle non-convex objective functionsthrough regularization, but smoothness is an assumption we need to maintain. CH relies ona Laplace approximation to validate the theory while we use the idea of coupling. By natureof the Metropolis-Hastings algorithm, not all CH draws are accepted and the Markov chainis better described as a threshold autoregressive process. All our draws are accepted andthey constitute a nonlinear but smooth autoregressive process. Valid quasi-Bayes inferencerequires the optimal weighting matrix W n = var ( √ n ( g n (ˆ θ n )) which needs to be estimated.Continuously updating W n ( θ ) can result in local optima so that the MCMC chain can takesigniﬁcantly more time to converge. Whether or not convexiﬁcation is required, our approachdoes not require a speciﬁc weighting matrix.In terms of tuning parameters, CH requires as input the proposal distribution in theMetropolis-Hastings algorithm and the associated hyper-parameters. Our tuning parametersare conﬁned to the ﬁxed learning rate γ and the resampling size m , which do not dependon the dimension of θ . The complexity of the problem also aﬀects the two algorithms indiﬀerent ways. As seen from Lemma 1, nr converges at a dimension-free linear rate of (1 − γ ) ,whereas MCMC converges more slowly as the dimension of θ increases. For instance, thenumber of iterations needed for the random walk Metropolis-Hastings to converge increasesquadratically with the condition number of the Hessian of the log-density and linearly inthe dimension d of θ . To alleviate this issue, several samplers exploit gradient information(Roberts and Tweedie, 1996; Girolami and Calderhead, 2011; Neal, 2011; Welling and Teh,20011). While these methods improve upon random walk Metropolis-Hastings, ill-conditioningcan still render slow convergence. Scaling the proposal using Hessian information can reducethe eﬀect of ill-conditioning but requires a preliminary estimate. See Dwivedi et al. (2019),Table 1, for an overview of mixing times in Metropolis-Hastings algorithms. This section illustrates the properties of the r nr draws using simulated data and data used inpublished work. Throughout, we use a burn-in period of burn = 1+ round (log(0 . / log(1 − γ )) so that the bias is approximately less than 1% of the initialization error (cid:107) θ − ˆ θ n (cid:107) .Additional implentation details are given in Appendix C. The set of γ values satisfying theconditions for Lemma 1 are data dependent, but in all simulated and empirical examples, γ ∈ [0 . , . performed well. We simulate data from the linear model with intercept β = 1 , slope β = 1 , x i ∼ E (2) , e i ∼ t (6) , n = 200 . We set B = 1000 plus burn-in draws. Homoskedasticstandard errors with a degree of freedom adjustment are computed. Table 2 reports estimatesand standard errors for one simulated sample. We consider three values of batch size m =200 , , and for each batch size, three values of the learning rate γ . The results are denotedr nr γ for γ = 1 , . , and . . The smaller the γ , the more persistent are the draws. Thus γ = 0 . is a case of extreme persistence, and as seen from the analysis of the linear model,the variance of the draws are larger the smaller γ is.Table 2: OLS: Estimates and Standard Errors for β Estimates Standard Errorsm ols r nr r nr . r nr . ase boot r nr r nr . r nr .

200 1.230 1.236 1.234 1.234 0.180 0.159 0.164 0.155 0.19350 - 1.251 1.241 1.262 - 0.184 0.179 0.187 0.16110 - 1.288 1.258 1.296 - 0.255 0.270 0.254 0.205

Remark: Results reported for one simulated sample of size n = 200 . The OLS estimator takes the value ˆ β = 1 . for this simulated sample. We see fromTable 2 that the r nr estimate is very close to OLS when m = 200 ( = n ) and the choice of γ makes little diﬀerence. Theorem 1 suggests that the estimation error should be of order √ m . The large bias associated with a m small is most visible at m = 10 , which is less than21 n . The diﬀerence between the OLS and r nr estimates is nearly a third of a standard errorfor γ = 1 . The m out of n Bootstrap and r nr standard errors are also less accurate with m = 10 . Example 2: MA(1)

Consider the estimation of a MA(1) model by non-linear leastsquares ( nlls ). The data is generated as y t = µ + e t + ψe t − . We set µ = 0 , ψ = 0 . , n = 500 and B = 2 , . In this example, Q n ( θ ) = (cid:80) nt =1 e t ( θ ) where e t ( θ ) are the nlls ﬁlteredresiduals computed as described in Appendix C. In estimation, the gradient and Hessian arecomputed analytically. For the standard bootstrap, we implement a state-space resamplingalgorithm described in Appendix C. For r nr , we initialize at θ = (0 , with a learning rateset to γ = 0 . , . and . , noting that γ = 1 was too large to get stable results in thisexample. Table 3: MA(1): Estimates of ψ and Standard ErrorsEstimates Standard Errors m nlls r nr . r nr . r nr . ase boot dmk r nr . r nr . r nr .

500 0.816 0.825 0.822 0.820 0.026 0.027 0.023 0.025 0.029 0.113250 - 0.819 0.819 0.814 - 0.028 - 0.034 0.034 0.08150 - 0.805 0.786 0.780 - 0.035 - 0.042 0.040 0.050

Remark: Results reported for one simulated sample of size n = 500 . In this synthetic data, the nlls estimator is ˆ ψ = 0 . . Table 3 shows that when m = n ,r nr produces a θ r nr that is very close to the full sample nlls estimate for all three values of γ . As in the OLS example above, the bias and standard errors are larger when m is smaller,as suggested by Theorem 1. The r nr standard errors are very similar to those obtained bythe m out of n bootstrap for all values of m . The standard errors are quite poor for γ = 0 . ,most likely because of the strong persistence of the draws. This subsection considers three examples, the ﬁrst concerns probit estimation of labor forceparticipation, the second is covariance structure estimation of earnings dynamics, and thethird is structural estimation of a BLP model.

Application 1: Labor Force Participation

The probit model is of interest because theobjective function is strictly convex. To illustrate, we estimate the model for female laborforce participation considered in Mroz (1987). The data consist of n = 753 observations22ssumed iid. We set B = 1000 and γ = 0 . . Three values of m are considered: m = n , , . Appendix B provides r code for replicating r nr in this example. There are 8parameters in this exercise and to conserve space, we only report 4 to get a ﬂavor of theresults. Table C1 in the on-line Appendix reports all coeﬃcients. As seen from Table 4, ther nr estimates are close to the MLE ones. Furthermore, the r nr standard errors are closeto the bootstrap standard errors. Table 4 also shows results for resampled bfgs which islabeled r qn . Evidently, the r qn estimates are similar to r nr ; but is much faster to computebecause the Hessian is not computed directly.Table 4: Labor Force Participation: Estimates and Standard ErrorsEstimates mle r nr n r nr r nr r qn n r qn r qn nwifeinc -0.012 - - - -0.012 -0.013 -0.014 -0.012 -0.011 -0.012educ 0.131 - - - 0.132 0.138 0.143 0.131 0.129 0.129exper 0.123 - - - 0.123 0.124 0.123 0.123 0.124 0.125exper2 -0.002 - - - -0.002 -0.002 -0.002 -0.002 -0.002 -0.002Standard Errors ase boot dmk ks r nr n r nr r nr r qn n r qn r qn nwifeinc 0.005 0.005 0.005 0.005 0.005 0.006 0.005 0.005 0.005 0.005educ 0.025 0.026 0.026 0.025 0.025 0.027 0.028 0.027 0.025 0.025exper 0.019 0.020 0.019 0.019 0.019 0.020 0.021 0.019 0.018 0.017exper2 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001Panel (a) of Figure 1 illustrates the behavior of the draws produced by r nr . The dashedred line corresponds to the MLE estimate ˆ θ n . The black line corresponds to r nr drawsbased on resampling the data with replacement. The blue line shows iterates from classical nr with the same γ = 0 . . The top left panel shows the ﬁrst draws in the convergencephase when the classical nr and the proposed r nr should behave similarly. While in thisexample, r nr converges after draws, nr requires to iterations to achieve convergence.The top right panel plots the next draws. Since convergence is achieved after 5 draws,these draws are in the re-sampling phase. Evidently, the transition between the convergenceand the resampling phase of r nr is seamless. The AR(1) coeﬃcient on θ b,educ based on theconverged draws (after discarding the ﬁrst ﬁve) is estimated to be . with a standarderror . , which is not signiﬁcantly diﬀerent from . − γ predicted by Lemma 3.Panel (b) of Figure 1 uses the Mroz (1987) data to further illustrate Lemma 3. Wecompare the r nr draws with two AR(1) sequences generated according to coupling theoryin (7), ie. θ (cid:63)b +1 = ˆ θ n + (1 − γ )( θ (cid:63)b − ˆ θ n ) − γH − n G ( b +1) m (ˆ θ n ) with θ (cid:63) = θ . For m = n shown in23he left panel, the coupling result is very accurate after the short initial convergence phaseas the two series are nearly indistinguishable. The right panel shows that coupling distanceis noticeably greater when m = 100 .Panel (c) of Figure 1 illustrates Theorem 2 by comparing the asymptotic Gaussian dis-tribution with the bootstrap, r nr , dmk and ks distributions for the education coeﬃcientwith m = n and γ = 0 . . The distribution of the r nr draws is rescaled using the simpleadjustment: θ r nr + (cid:113) mnφ ( γ ) ( θ b − θ r nr ) after discarding a burn-in period of draws. The r nr distribution approximates the bootstrap distribution quite well.Figure 1: Labor force participation: draws for θ educ Application 2: Earnings Dynamics

Moﬃtt and Zhang (2018) estimates earnings volatil-24ty using a subsample of males in the Panel Study of Income Dynamics (PSID) datasetbetween 1970 and 2014 for a total of observations. Let y iat denote an individual’searnings i in age group a (between 24 and 54) at time t . Earnings are assumed to be thesum of a permanent µ ia and a transitory ν iat component: y iat = α t µ ia + β t ν ia , µ ia = µ i + a (cid:88) s =1 ω is , ν ia = ε ia + a (cid:88) s =1 ψ a,a − s ε is , for a ≥ . In Moﬃtt and Zhang (2018), the variances are modeled via 11 parameters and estimated by asequential quadratic programming algorithm (SQP). The Hessian in their example has bothpositive and negative eigenvalues, suggesting that the solution could be a saddle point. Toabstract from identiﬁcation issues, we estimate θ = ( ν , δ , γ , γ ) and ﬁx the remaining 7 pa-rameters. We specify the conditioning matrix as P b = ( H (cid:48) b H b ) − / to ensure positive deﬁnite-ness. Algorithm 2 converges using the starting values θ = (0 . , − . , − . , . as in the original paper, with m = n , γ = 0 . B = 2000 , and resampling at the age-cohortlevel. We also consider re-weighting instead of resampling which is denoted as r nr w . Thoughour theory does not cover resampled quasi-Newton methods, we also use an implementationof bfgs that sets P b = ( H (cid:48) b, bfgs H b, bfgs ) − / , where H b, bfgs is the bfgs approximation of theHessian matrix, and report the results as r qn .Table 5 shows that the r nr , r nr w and r qn estimates are very close to ˆ θ n obtained bySQP. The r nr w standard errors are larger than the r nr ones, which are in turn larger thanthe bootstrap ones, but the diﬀerences are not enough to change the conclusion that all fourparameters are statistically diﬀerent from zero. However, bootstrap inference of ˆ θ n is time-consuming, requiring 5h48m to produce 2000 draws, even after the original Matlab code wasported to r and C++ using Rcpp to get a greater than times speedup in computationtime. In contrast, r nr produces estimates and standard errors in 1h4m, and the r qn in38m. The time needed for dmk to produce standard errors is comparable to r nr , while r qn is more comparable to ks . However, both have an overhead of having to ﬁrst obtain ˆ θ n ,which entails minimization of the objective function by SQP.In addition to providing standard errors, an additional by-product of Algorithm 2 is thatthe draws can be used for model diagnostics. Gentzkow et al. (2017) provides statistics toassess the sensitivity of the parameter estimates to assumptions of the model, taking thedata as given. Our algorithm takes the model assumptions as given, but takes advantage ofresampling to shed light on the sensitivity of the estimates to features of the data themselves. Speciﬁcally, var ( µ i, ) by ν , var ( ω ir ) by δ , δ , var ( ε ir ) by γ , γ , k , and ψ a,a − r by π, λ , η , η , η . Weset k = 1 , λ = 5 , and all remaining parameters to zero. ˆ θ n r nr r nr w r qn r qn w boot dmk ks r nr r nr w r qn r qn w ν δ -5.768 -5.779 -5.779 -5.767 -5.766 0.050 0.062 0.049 0.063 0.060 0.051 0.049 γ -1.839 -1.819 -1.819 -1.841 -1.842 0.083 0.101 0.079 0.094 0.091 0.089 0.082 γ Legend: solid blue: full sample estimate; black line: r nr draws excluding the age groupindicated above in parenthesis; red dashed line: change in excluded age group estimation based on resampled data that exclude one age group at a time. The parameterthat is least sensitive to age-groups appears to be γ . The parameter ν tends to be lowerwhen the age group (29-33) is excluded, while γ is higher when the age group (44-48) isexcluded. The parameter most sensitive to age is γ , which is evidently smaller in absolute26agnitude when the younger age groups are excluded. For example, it is -1.2 when theyoungest age group is dropped, but is around -2.0 when the oldest age group is excluded. Application 3: Demand for Cereal

We evaluate the r nr algorithm on the BLP modelof Berry et al. (1995) using the cereal data generated in Nevo (2000). The sample consistsof market shares for cereal products across markets for a total of market/productobservations. This example is of interest because bootstrapping the BLP model is demand-ing. We use the BLPestimatoR

R package which builds on C++ functions to evaluate theGMM objective and analytical gradient. The data consists of market shares for 24 productsin 94 markets. In the BLP framework, parameters on terms that enter linearly can pro-jected out by 2SLS. Of the remaining parameters that need to be estimated, we drop someinteraction terms from the original paper that may be diﬃcult to identify. This allows us tofocus on coeﬃcients that enter the moment conditions non-linearly since the BLP procedurerequires a ﬁxed-point inversion for these, making the moment conditions costly to evaluate.The parameter dimension including market ﬁxed eﬀects is d = 33 . To control for possiblecorrelations in the unobservables at the market level, we compute cluster-robust standarderrors. See Appendix C for detailsThe data consists of market shares s gj in market g ∈ { , . . . , } for product j ∈{ , . . . , } with characteristic matrix X gj . To resample at the market level, for each b = 1 , . . . , B we draw markets g ( b )1 , . . . , g ( b )94 from { , . . . , } with replacement and takethe associated shares and characteristics { s g ( b ) j , X g ( b ) j } j =1 ,..., as observations within eachmarket. Since the number of clusers is relatively small, we only consider m = n . We set γ = 0 . and a burn-in period of draws. Additional values of γ are considered in Table C2.Both r nr and r qn deliver estimates similar to those obtained from classical optimizationand the standard errors are similar to the bootstrap ones. However, there is a signiﬁcantdiﬀerence. To generate B = 1000 draws the standard bootstrap requires 4h45m while ther nr runs in 1h04m which is almost 5x faster. The r qn estimate only requires 13m, which isabout 20x faster than the bootstrap. While dmk is similar to r nr , a preliminary estimateof θ is needed. Estimation of the AR(1) coeﬃcient for the r nr draws associated with theparameters reported in Table 6 ﬁnds that the estimates range from . to . with 95%conﬁdence levels that always include the value of (1 − γ ) = 0 . predicted by theory.For this example we also consider the CH quasi-Bayesian estimator implemented using arandom walk Metropolis-Hastings MCMC algorithm. The prior for each of the parameters ks is not reported here because it needs signiﬁcant rewriting of the BLPestimatoR package. Further-more, ks only consider just-identiﬁed GMM models as indicated in their footnote 8.

27s a N (0 , distribution, and the inverse of the clustered variance-covariance matrix of themoments evaluated at ˆ θ n is used as the optimal weighting matrix W n . The Markov chain isinitialized at θ = ˆ θ n , the proposal in the random walk step is scaled by . H n (ˆ θ n ) − / / √ n which yields an acceptance rate of . . Though we generate a large number of draws( B = 50000 ), the Markov chain is strongly persistent and the eﬀective sample size as deﬁnedin Gelman et al. (2013) is typically less than . The CH estimates are further away from ˆ θ n than the r nr estimates, and the standard errors are also smaller than the m out of n bootstrap and the r nr or the dmk .Table 6: Demand for Cereal: Estimates and Standard Errors (Random Coeﬃcients)Estimates Standard Errors ˆ θ n ch r nr r qn ch boot dmk r nr r qn s t d e v const. 0.284 -0.016 0.263 0.277 0.130 0.129 0.127 0.123 0.105price 2.032 2.364 2.188 1.917 0.738 1.198 1.026 0.975 0.880sugar -0.008 -0.013 -0.006 -0.006 0.011 0.017 0.012 0.012 0.010mushy -0.077 -0.248 -0.055 -0.042 0.132 0.177 0.168 0.166 0.154 i n c o m e const. 3.581 4.414 3.464 3.702 0.453 0.666 0.738 0.714 0.636price 0.467 -3.255 1.335 -0.295 2.449 3.829 4.275 4.040 3.603sugar -0.172 -0.195 -0.171 -0.174 0.021 0.028 0.028 0.027 0.025mushy 0.690 0.888 0.647 0.702 0.203 0.345 0.346 0.339 0.312time 1h50m 4h45m 1h4m 1h8m 13m Simulation-based estimation is routinely used to analyze structural models associated withanalytically intractable likelihoods. Estimators in this class include the simulated method ofmoments, indirect inference and eﬃcient method of moments, and following Forneron and Ng(2016), we will generically refer to them as simulated minimum distance ( smd ) estimators.We now show that r nr will still provide valid inference. This is useful because computingstandard errors for the smd estimates is not always straightforward.The minimum-distance ( md ) estimator minimizes the distance between a sample auxiliarystatistic ˆ ψ n = ˆ ψ n ( θ ) and it’s expected value ψ ( θ ) and is deﬁned as ˆ θ n, md = argmin θ (cid:107) g n ( θ ) (cid:107) W n , g n ( θ ) = ˆ ψ n − ψ ( θ ) ( md )where W n is a weighting matrix. In cases when the binding function ψ ( · ) that maps θ to theauxiliary statistic is tractable, Algorithm 2 provides a convenient way to compute standard28rrors for ˆ θ n, md . When ψ ( θ ) is not tractable, smd simulates data y i,s ( θ ) for given θ using iiderrors e i,s and estimates ψ ( θ ) by ˆ ψ n,S ( θ ) = S (cid:80) Ss =1 ψ n,s ( y n,s ( θ )) . The estimator is ˆ θ n, smd = argmin θ (cid:107) g n,S ( θ ) (cid:107) W n g n,S ( θ ) = ˆ ψ n − ˆ ψ n,S ( θ ) ( smd )To motivate our simulation based r nr , consider the exactly identiﬁed case when it holdsthat ˆ ψ n − ψ (ˆ θ n, md ) = 0 . Note that while the vector of auxiliary statistics ˆ ψ ( b ) m computed usingresampled data satisﬁes E (cid:63) ( ˆ ψ ( b ) m ) = ˆ ψ n , the statistics ˆ ψ ( b ) m,S ( θ ) computed by smd satisﬁes E (cid:63) [ ˆ ψ ( b ) m,S ( θ )] = ψ ( θ ) , for all θ . The two results together imply that E (cid:63) [ ˆ ψ ( b ) m − ˆ ψ ( b ) m,S ( θ )] whenevaluated at θ = ˆ θ n, md is ˆ ψ n − ψ (ˆ θ n, md ) which takes the value of zero as in md estimation.This suggests a simulation based resampled objective function deﬁned as: Q ( b ) m,S ( θ ) = (cid:107) ˆ ψ ( b ) m − ˆ ψ ( b ) m,S ( θ ) (cid:107) W n , g m,S ( θ ) = ˆ ψ ( b ) m − ˆ ψ ( b ) m,S ( θ ) ( fl,s )will have the same minimizer as the infeasible md , at least to a ﬁrst-order. Let the drawsbe generated according θ b +1 ,S = θ b,S − γP b +1 ,S G ( b ) m,S ( θ b ) with gradient G ( b ) m,S ( θ b,S ) = − ∂ θ (cid:48) ˆ ψ ( b ) m,S ( θ b,S ) W n ( ˆ ψ ( b ) m − ˆ ψ ( b ) m,S ( θ )) . (12)By Theorem 1, the mean θ r nr ,S is consistent for ˆ θ n, md . By implication, θ re ,S will also be moreeﬃcient than ˆ θ n, smd , which we will verify in simulations below. To analyze θ re ,S , we need thefollowing: Assumption 2.iii (cid:48) . Suppose there exists ﬁnite constants C , C such that for any S ≥ a. (cid:104) E (cid:63) (cid:16) (cid:107) ˆ ψ ( b ) m − ˆ ψ n (cid:107) (cid:17)(cid:105) / ≤ C √ m ; (cid:104) E (cid:63) (cid:16) sup θ ∈ Θ (cid:107) ˆ ψ ( b ) m,S ( θ ) − ψ ( θ ) (cid:107) (cid:17)(cid:105) / ≤ C √ mS b. (cid:104) E (cid:63) (cid:16) (cid:107) ∂ θ ˆ ψ ( b ) m,S (ˆ θ n ) − ∂ θ ψ (ˆ θ n ) (cid:107) (cid:17)(cid:105) / ≤ C √ mS . Assumption 2.iii’ implies Assumption 2.iii where G n ( θ ) is the gradient of md by takingthe diﬀerence G ( b ) m,S ( θ ) − G n ( θ ) and using the Cauchy-Schwarz inequality.It remains to construct the variance of θ r nr ,S . The foregoing analysis would suggest thatvalid inference would follow after the variance adjustment deﬁned in Algorithm 2. However,this is not the case. Intuitively, the estimator θ r nr ,S is consistent for ˆ θ n, md whose variance V does not involve simulation noise. But the quantity V r nr deﬁned in Algorithm 2 presumesthe presence of simulation noise in the estimate θ r nr ,S and will give standard errors that will,in general, be too large. The are many ways to overcome this problem, and most involverunning a second chain of draws in parallel with the one used to compute the estimator. For29xample, taking the diﬀerence of two chains with the same simulated samples would workas the simulation noise will oﬀset.Our preferred approach is to use a second chain that directly estimates the varianceof θ r nr ,S . As in the ﬁrst chain, this second chain is generated as θ b +1 ,S = (1 − γ ) θ b,S − γP b +1 ,S ˜ G ( b ) m,S ( θ b,S ) as deﬁned in (7), but the gradient is ˜ G ( b ) m,S ( θ b,S ) = − ∂ θ (cid:48) ˆ ψ ( b ) m,S ( θ b,S ) W n ( ˆ ψ ( b ) m − ˆ ψ n ) . (13)Compared to the ﬁrst chain deﬁned by (12), the second chain replaces the simulated auxiliarystatistics ˆ ψ ( b ) m,S (ˆ θ n ) by the sample estimates ˆ ψ n which is already computed. As all otherquantities involved in computing (13) are taken from (12), the computation overhead ofgenerating θ b,S is thus negligible. Proposition 1.

Suppose that the Assumptions for Q n and Q ( b ) m,S in Theorems 1 and 2 hold,with Assumption 2.iii replaced by 2.iii’. Let ˆ θ n, md be the infeasible minimum-distance esti-mator. Let { θ b,S } be a chain generated with G ( b ) m,S deﬁned as in (12), and { θ b,S } be generatedusing ˜ G ( b ) m,S deﬁned as in (13). Let θ r nr ,S = B (cid:80) Bb =1 θ b,S , P b +1 ,S = [ H ( b +1) m,S ( θ b,S )] − , and deﬁne V r nr ,S = mφ ( γ ) var (cid:63) ( θ b,S ) . Then for any S ≥ ﬁxed,i. √ n (cid:0) θ r nr ,S − θ (cid:1) = √ n (cid:16) ˆ θ n, md − θ (cid:17) + o p (cid:63) (1) .ii. As m, b → + ∞ with log( m ) /b → : V − / r nr ,S √ n (cid:0) θ r nr ,S − θ (cid:1) d (cid:63) → N (0 , I d ) , Forneron and Ng (2016, 2018) shows that a weighted average of smd estimates with in-dependent simulation draws constitutes a posterior mean which is asymptotically equivalentto the infeasible md estimator. This requires solving the optimization problem as manytimes (ie. S > ). Part i. of the proposition shows that this type of statistical eﬃciencycan be achieved by r nr in a single run, ie( S = 1 ). Resampling by r nr involves takingdraws from the joint distribution F n × F shocks to produce ˆ ψ ( b ) m,S ( θ ) , which is an estimate ofpopulation mapping ψ ( θ ) . In practice, the simulation and resampling noise in ˆ ψ ( b ) m,S ( θ ) and ˆ ψ ( b ) m are averaged out so that the variance of θ r nr ,S does not depend on S asymptotically.This contrasts with the smd estimator ˆ θ n, smd which has vanishing simulation noise only when S → ∞ as n → ∞ .Part ii of the Proposition involves a second sequence θ b,S which, as noted above, is usedto compute the variance of the estimator. To understand its underpinnings, recall that30he sandwich variance for ˆ θ n, md has a meat component that is the variance of the score − ∂ θ (cid:48) ψ (ˆ θ n, md ) W n ( ˆ ψ n − ψ (ˆ θ n, md )) . If ψ were tractable, a bootstrap draw of this score wouldbe − ∂ θ (cid:48) ψ (ˆ θ n, md ) W n ( ˆ ψ ( b ) m − ψ (ˆ θ n, md )) . But this is approximately − ∂ θ (cid:48) ˆ ψ ( b ) m,S ( θ b,S ) W n ( ˆ ψ ( b ) m − ˆ ψ n ) which is precisely the gradient (13) used to generate θ b,S . Hence it provides a correctapproximation of the variance of the scores. Though two chains are needed in the case ofsimulation estimation, it only needs S = 1 . These arguments are further illustrated using asimple example in Appendix D. Example 3: Dynamic Panel

Consider the dynamic panel regression: y it = α i + ρy it − + x (cid:48) it β + σ e e it , with ρ = 0 . , β = 1 , σ e = 1 , x it ∼ N (0 , , e ∼ N (0 , , n = 1000 and T = 5 . Let A = I T − T (cid:48) T /T , a matrix which computes the time de-meaned y it − y i . The Least-SquaresDummy Variable (LSDV) estimator is obtained by regressing Ay T on Ay T − and Ax T .The estimator is inconsistent for ﬁxed T as n → ∞ .The LSDV estimator is inconsistent when n → ∞ and T is ﬁxed. Gouriéroux et al. (2010)shows that indirect inference, which has an automatic bias correction property, is consistentfor ﬁxed T . The idea is to match the sample LSDV estimator ˆ ψ n = ˆ θ n,LSDV with a simulated ˆ ψ n,S ( θ ) = ˆ θ simn,LSDV ( θ ) using S ≥ simulated samples.To generate r nr draws, we resample ( y i , . . . , y iT , x i , . . . , x iT ) i =1 ,...,m with replacementover i for given m and compute ˆ ψ ( b ) m = ˆ θ ( b ) m,LSDV , our resampled moments. Using the newsimulation draws e ( b +1) it at each b , we simulate S ≥ panels: y ( b +1) it,s = ρy it − ,s + x ( b +1) (cid:48) it,s β + σ e e ( b +1) it,s , for t = 1 , . . . , T and i = 1 , . . . , m and compute the simulated moments ˆ ψ ( b ) m,S =˜ θ ( b ) m,LSDV ( θ b ) . An addional moment is needed to estimate σ e ; we use the standard deviationof the OLS residuals in the LSVD regression. The gradient and Hessian are computed usingﬁnite diﬀerences. We illustrate with m = n, , for n = 1000 .The LSDV estimate is 0.329 which is signiﬁcantly downward biased. However, the indirectinference ( ind ) estimator corrects the bias as shown in Gouriéroux et al. (2010). The estimateof . in Table 7 for S = 1 bears this out. The r nr estimates are closer to θ than ind for m = n , and is similar for m = 50 . The ind estimates with S = 10 are very closeto the r nr estimates obtained over all γ, m and S including S = 1 . This implies that r nr achieves the eﬃciency of ind with large S using just S = 1 . The standard errors are smallerthan other methods except for S = 10 . Results for S = 2 , are reported in Table D1.We close the analysis with two remarks about the examples. As noted earlier, an ill-conditioned Hessian can render slow convergence of gradient-based optimizers. The values31able 7: Dynamic Panel: Estimates of ρ and Standard ErrorsEstimates Standard Errors S m ind r nr . r nr . r nr . ase boot dmk r nr . r nr . r nr . - 0.589 0.588 0.586 - 0.048 - 0.036 0.039 0.037 - 0.580 0.588 0.581 - 0.050 - 0.037 0.037 0.024 - 0.589 0.591 0.587 - 0.034 - 0.034 0.037 0.030 - 0.586 0.589 0.587 - 0.035 - 0.038 0.031 0.032 Remark: Results reported for one simulated sample of size n = 200 , T = 5 . of λ min ( H n ) λ max ( H n ) evaluated at θ = ˆ θ n , are − , · − and · − for the probit, earnings dynamics,and BLP examples, respectively. Classical gd should be slow in converging in these cases,and the applications bear this out. Second, to reinforce the main result that Algorithm 2provides valid inference, we evaluate the coverage of r nr in all of the simulated examplesconsidered. As seen from Table 8, r nr delivers a 5% size in almost all cases. Details aregiven in Appendix C of the online supplement.Table 8: Size of Conﬁdence Intervals Across Methods and Examples ase boot dmk ks r nr boot r nr OLS m = n = 200 m = 50 β β m = n = 500 m = 250 µ ψ m = n = 1000 m = 100 Dynamic ρ β S = 1 ) σ m = n = 1000 m = 100 Dynamic ρ β S = 2 ) σ m = n = 1000 m = 100 Dynamic ρ β S = 5 ) σ Results based on replications with B = 1000 , γ = 0 . ; burn =45. Conclusion

In this paper, we design two algorithms to produce draws that, upon averaging, is asymptot-ically equivalent to the full-sample estimate produced by a classical optimizer. By using theinverse Hessian as conditioning matrix, the draws of Algorithm 2 immediately provide validstandard errors for inference, hence a free lunch. In problems that require S simulations toapproximate the binding function, our algorithm achieves the level of eﬃciency of smd witha large S , but at the cost of S = 1 . Numerical evaluations show that Algorithm 2 producesaccurate estimates and standard errors but runs signiﬁcantly faster than the conventionalbootstrap and most of the ‘short-cut’ methods.33 eferences Andrews, D. W. K. (2002): “Higher-Order Improvements of a Computationally Attractivek-Step Bootstrap for Extremum Estimators,”

Econometrica , 70:1, 119–162.

Armstrong, T. B., M. Bertanha, and H. Hong (2014): “A fast resample method forparametric and semiparametric models,”

Journal of Econometrics , 179, 128–133.

Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in Market Equilib-rium,”

Econometrica , 63, 841.

Bickel, P. J., F. Götze, and W. R. van Zwet (2012): “Resampling fewer than nobservations: gains, losses, and remedies for losses,” in

Selected works of Willem van Zwet ,Springer, 267–297.

Boyd, S. and L. Vanderberghe (2004):

Convex Optimization , New York, NY, USA:Cambridge University press.

Brooks, S. P. and A. Gelman (1998): “General Methods for Monitoring Convergence ofIterative Simulations,”

Journal of Computational and Graphical Statistics , 7, 434–455.

Cameron, A. C., J. B. Gelbach, and D. L. Miller (2011): “Robust inference withmultiway clustering,”

Journal of Business & Economic Statistics , 29, 238–249.

Chatterjee, S., A. S. Hadi, et al. (1986): “Inﬂuential observations, high leverage points,and outliers in linear regression,”

Statistical science , 1, 379–393.

Chen, X., O. Linton, and I. Van Keilegom (2003): “Estimation of SemiparametricModels when the Criterion Function Is Not Smooth,”

Econometrica , 71, 1591–1608.

Chernozhukov, V. and H. Hong (2003): “An MCMC Approach to Classical Estimation,”

Journal of Econometrics , 115:2, 293–346.

Conlon, C. and J. Gortmaker (2019): “Best practices for diﬀerentiated products de-mand estimation with pyblp,”

Unpublished Manuscript . Conn, A. R., N. I. M. Gould, and P. L. Toint (1991): “Convergence of quasi-Newtonmatrices generated by the symmetric rank one update,”

Mathematical Programming , 50,177–195.

Davidson, R. and J. G. MacKinnon (1999): “Bootstrap Testing in Nonlinear Models,”

International Economic Review , 40, 487–508.

Dennis, Jr, J. E. and J. J. Moré (1977): “Quasi-Newton methods, motivation andtheory,”

SIAM review , 19, 46–89.

Dwivedi, R., Y. Chen, M. J. Wainwright, and B. Yu (2019): “Log-concave sampling:Metropolis-Hastings algorithms are fast,”

Journal of Machine Learning Research , 20, 1–42.

Forneron, J.-J. and S. Ng (2016): “A Likelihood Free Reverse Sampler of the PosteriorDistribution,”

G.˜Gonzalez-Rivera, R.˜C. Hill and T.-H. Lee (eds), Advances in Econo-metrics, Essays in Honor of Aman Ullah , 36, 389–415.——— (2018): “The ABC of simulation estimation with auxiliary statistics,”

Journal ofeconometrics , 205, 112–139. 34 e, R., F. Huang, C. Jin, and Y. Yuan (2015): “Escaping from saddle pointsâĂŤonlinestochastic gradient for tensor decomposition,” in

Conference on Learning Theory , 797–842.

Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B.Rubin (2013):

Bayesian data analysis , CRC press.

Gelman, A. and D. B. Rubin (1992): “Inference from Iterative Simulation Using MultipleSequences,”

Statist. Sci. , 7, 457–472.

Gentzkow, M., J. M. Shapiro, and I. Andrews (2017): “Measuring the Sensitivity ofParameter Estimates to Estimation Moments,”

The Quarterly Journal of Economics , 132,1553–1592.

Girolami, M. and B. Calderhead (2011): “Riemann manifold langevin and hamilto-nian monte carlo methods,”

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 73, 123–214.

Goodfellow, I., Y. Bengio, and A. Courville (2016): “Deep Learning,”

MIT Press . Gouriéroux, C., P. C. Phillips, and J. Yu (2010): “Indirect inference for dynamicpanel models,”

Journal of Econometrics , 157, 68–77.

Hahn, J. (1996): “A note on bootstrapping generalized method of moments estimators,”

Econometric Theory , 12, 187–197.

Honoré, B. E. and L. Hu (2017): “Poor (Wo)man’s Bootstrap,”

Econometrica , 85, 1277–1301.

Jin, C., R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan (2017): “How toescape saddle points eﬃciently,” in

Proceedings of the 34th International Conference onMachine Learning-Volume 70 , JMLR. org, 1724–1732.

Kiefer, J. and J. Wolfowitz (1952): “Stochastic estimation of the maximum of a re-gression function,”

The Annals of Mathematical Statistics , 23, 462–466.

Kline, P. and A. Santos (2012): “A Score Based Approach to Wild Bootstrap Inference,”

Journal of Econometric Methods , 1.

Kosorok, M. R. (2007):

Introduction to empirical processes and semiparametric inference ,Springer Science & Business Media.

Lahiri, S. N. (2013):

Resampling methods for dependent data , Springer Science & BusinessMedia.

Liang, T. and W. Su (2019): “Statistical Inference for the Population Landscape viaMoment-Adjusted Stochastic Gradients,”

Journal of the Royal Statistical Society, SeriesB , 81-Part 2, 431–456.

Łojasiewicz, S. (1963): “A topological property of real analytic subsets,”

Coll. du CNRS,Les équations aux dérivées partielles , 117, 87–89.

Moffitt, R. and S. Zhang (2018): “Income Volatility and the PSID: Past Research andNew Results,”

AEA Papers and Proceedings , 108, 277–80.35 oulines, E. and F. R. Bach (2011): “Non-asymptotic analysis of stochastic approxi-mation algorithms for machine learning,” in

Advances in Neural Information ProcessingSystems , 451–459.

Mroz, T. (1987): “The Sensitivity of an Empirical Model of Married Women’s Hours ofWork to Economic and Statistical Assumptions,”

Econometrica , 55, 765–99.

Neal, R. (2011): “MCMC using Hamiltonian dynamics,”

Handbook of markov chain montecarlo , 2, 2.

Nesterov, Y. (1983): “A method for unconstrained convex minimization problem with therate of convergence o(1/k2),”

Doklady ANSSSR , 543âĂŞ 547.——— (2018):

Lectures on convex optimization , vol. 137, Springer.

Nevo, A. (2000): “A practitioner’s guide to estimation of random-coeﬃcients logit modelsof demand,”

Journal of economics & management strategy , 9, 513–548.

Newey, W. and D. McFadden (1994): “Large Sample Estimation and Hypothesis Test-ing,” in

Handbook of Econometrics , North Holland, vol. 36:4, 2111–2234.

Nocedal, J. and S. Wright (2006):

Numerical Optimzation , Springer, second ed.

Polyak, B. T. (1963): “Gradient methods for minimizing functionals,”

Zhurnal Vychisli-tel’noi Matematiki i Matematicheskoi Fiziki , 3, 643–653.——— (1964): “Some Methods for Speeding Up the Convergence of Iteration Methods,”

User Compuational Mathematics and Mathematical Physics , 4:1, 1–17.

Polyak, B. T. and A. B. Juditsky (1992): “Acceleration of stochastic approximation byaveraging,”

SIAM Journal on Control and Optimization , 30, 838–855.

Ren-Pu, G. and M. J. Powell (1983): “The convergence of variable metric matrices inunconstrained optimization,”

Mathematical programming , 27, 123.

Robbins, H. and S. Monro (1951): “A Stochastic Approximation Method,”

The Annalsof Mathematical Statistics , 22, 400–407.

Roberts, G. and R. Tweedie (1996): “Exponential convergence of Langevin distributionsand their discrete approximations,”

Bernoulli , 2, 341–363.

Roodman, D., M. Ø. Nielsen, J. G. MacKinnon, and M. D. Webb (2019): “Fast andwild: Bootstrap inference in Stata using boottest,”

The Stata Journal , 19, 4–60.

Ruppert, D. (1988): “Eﬃcient estimators from a slowly convergent Robbins-Monro proce-dure,”

School of Oper. Res. and Ind. Eng., Cornell Univ., Ithaca, NY, Tech. Rep , 781.

Stoffer, D. S. and K. D. Wall (2004): “Resampling in state space models,”

State Spaceand Unobserved Component Models Theory and Applications , 227, 258. van der Vaart, A. W. and J. A. Wellner (1996):

Weak Convergence and EmpiricalProcesses , Springer Series in Statistics, New York, NY: Springer New York.36 elling, M. and Y. W. Teh (2011): “Bayesian Learning via Stochastic GradientLangevin Dynamics,”

Proceedings of the 28th International Conference on Machine Learn-ing , 681–688.

White, H. (2000):

Asymptotic theory for econometricians , Academic press.

Wolpert, D. H. and W. G. Macready (1997): “No Free Lunch Theorems for Optimiza-tion,”

IEEE Transactions on Evolutionary Computation , 1:1, 67–82.37 ppendix A

A.1 Derivations for the Least-Squares Example

In this example, y n = X n ˆ θ n + ˆ e n , Q n ( θ ) = n ( y n − X n θ ) (cid:48) ( y n − X n θ ) , H n = X (cid:48) n X n /n, G n = − X (cid:48) n ˆ e n /n , and Q ( b ) m ( θ ) = m ( y ( b ) m − X ( b ) m θ ) (cid:48) ( y ( b ) m − X ( b ) m θ ) , H b = H ( b +1) m ( θ b ) = X ( b +1) (cid:48) m X ( b +1) m /m . G b ( θ ) = − X ( b +1) (cid:48) m [ y ( b ) m − X ( b ) m θ ] /m . Let ˆ θ ( b ) m = ( X ( b ) (cid:48) m X ( b ) m ) − X ( b ) (cid:48) m y ( b ) m be the m out of n boot-strap estimate. Orthogonality of least squares residuals will be used repeatedly. Gradient Descent θ k +1 = θ k − γ [ − X (cid:48) n ( y n − X n θ k ) /n ] . Subtract ˆ θ n on both sides andnote that y n = X n ˆ θ n + ˆ e n (full sample estimates), then: θ k +1 − ˆ θ n = θ k − ˆ θ n − γ (cid:104) − X (cid:48) n ( X n ˆ θ n + ˆ e n − X n θ k ) /n (cid:105) = θ k − ˆ θ n − ( γH n )( θ k − ˆ θ n ) + γX (cid:48) n ˆ e n /n = ( I − γH n )( θ b − ˆ θ n ) since X (cid:48) n ˆ e n = 0 . Newton-Raphson θ k +1 = θ k − γ [ H n ] − [ − X (cid:48) n ( y n − X n θ k ) /n ] . Subtract ˆ θ n on both sides: θ k +1 − ˆ θ n = θ k − ˆ θ n − γH − n (cid:104) − [ X (cid:48) n X n /n ][ˆ θ n − θ k ] + X (cid:48) n ˆ e n /n (cid:105) = (1 − γ )( θ k − ˆ θ n ) since X (cid:48) n ˆ e n = 0 . Stochastic Gradient Descent θ b +1 = θ b − γ b (cid:104) − X ( b ) (cid:48) m ( y ( b ) m − X ( b ) m θ b ) /m (cid:105) . Thus θ b +1 − ˆ θ n = θ b − ˆ θ n − γ b (cid:104) − X ( b ) (cid:48) m ( y bm − X ( b ) m ˆ θ n − X ( b ) m [ θ b − ˆ θ n ]) /m (cid:105) = ( I − γ b H b )( θ b − ˆ θ n ) + γ b X ( b ) (cid:48) m ( y bm − X ( b ) m ˆ θ n ) /m = ( I − γ b H b )( θ b − ˆ θ n ) − γ b G b (ˆ θ n ) since X ( b ) (cid:48) m ˆ e ( b ) m = 0 . Resampled Gradient Descent θ b +1 = θ b − γ (cid:104) − X ( b ) (cid:48) m ( y ( b ) m − X ( b ) m θ b ) /m (cid:105) . Subtract ˆ θ n onboth sides and note that y ( b ) m = X ( b ) m ˆ θ ( b ) m + ˆ e ( b ) m (bootstrap estimates). Then θ b +1 − ˆ θ n = θ b − ˆ θ n − γ (cid:104) − X ( b ) (cid:48) m ( X ( b ) m [ˆ θ ( b ) m − ˆ θ n ] + ˆ e ( b ) m − X ( b ) m [ θ b − ˆ θ n ]) /m (cid:105) = θ b − ˆ θ n − ( γH b )( θ b − ˆ θ n ) + γH b (ˆ θ ( b ) m − ˆ θ n )= ( I − γH b )( θ b − ˆ θ n ) + γH b (ˆ θ ( b ) m − ˆ θ n ) since X ( b ) (cid:48) m ˆ e ( b ) m = 0 . Resampled Newton-Raphson θ b +1 = θ b − γ [ H b ] − (cid:104) − X ( b ) (cid:48) m ( y ( b ) m − X ( b ) m θ b ) /m (cid:105) . Then θ b +1 − ˆ θ n = θ b − ˆ θ n − γ [ H b ] − (cid:104) − X ( b ) (cid:48) m ( X ( b ) m [ˆ θ ( b ) m − ˆ θ n ] + ˆ e ( b ) m − X ( b ) m [ θ b − ˆ θ n ]) /m (cid:105) = (1 − γ )( θ b − ˆ θ n ) + γ (ˆ θ ( b ) m − ˆ θ n ) since X ( b ) (cid:48) m ˆ e ( b ) m = 0 . A-1 .2 Proof of Lemma 3:

Note ﬁrst that by construction, γ (cid:16) P b G ( b +1) m ( θ b ) − P m G ( b +1) m (ˆ θ n ) (cid:17) = γP m H n (ˆ θ n )[ θ b − ˆ θ n ]+ γP m (cid:16) G ( b +1) m ( θ b ) − G ( b +1) m (ˆ θ n ) − H n (ˆ θ n )[ θ b − ˆ θ n ] (cid:17) (A.1) + γ (cid:0) P b − P m (cid:1) (cid:16) G ( b +1) m ( θ b ) − G ( b +1) m (ˆ θ n ) (cid:17) . (A.2)From the deﬁnition of θ b and θ (cid:63)b , the diﬀerence can be expressed as: θ b +1 − θ (cid:63)b +1 = (cid:0) θ b − γP b G ( b +1) m ( θ b ) (cid:1) − (cid:16) ˆ θ n + Ψ(ˆ θ n )( θ (cid:63)b − ˆ θ n ) − γP m G ( b +1) m (ˆ θ n ) (cid:17) = Ψ(ˆ θ n )( θ b − θ (cid:63)b ) + ( I d − Ψ(ˆ θ n ))( θ b − ˆ θ n ) − γ (cid:16) P b G ( b +1) m ( θ b ) − P m G ( b +1) m (ˆ θ n ) (cid:17) = Ψ(ˆ θ n )( θ b − θ (cid:63)b ) + γP m H n (ˆ θ n )[ θ b − ˆ θ n ] − γ (cid:16) P b G ( b +1) m ( θ b ) − P m G ( b +1) m (ˆ θ n ) (cid:17) = Ψ(ˆ θ n )( θ b − θ (cid:63)b ) − ( A. − ( A. where the third equality follows from the fact that I d − Ψ(ˆ θ n ) = γP m H n (ˆ θ n ) . By Assumption2 i. and vi. as well as Lemma 2, E (cid:63) ( (cid:107) ( A. (cid:107) ) ≤ γλ P C E (cid:63) ( (cid:107) θ b − ˆ θ n (cid:107) ) ≤ γλ P C (cid:18) (1 − γ ) b +2 d ,n + C γ m (cid:19) . By Assumptions 2 ii., 3 ii., Lemma 2, mean-value theorem, and Cauchy-Schwarz inequality, E (cid:63) ( (cid:107) ( A. (cid:107) ) ≤ γ (cid:2) E (cid:63) (cid:0) (cid:107) P b − P m (cid:107) (cid:1)(cid:3) / (cid:104) E (cid:63) (cid:16) (cid:107) H ( b +1) m (˜ θ b )( θ b − ˆ θ n ) (cid:107) (cid:17)(cid:105) / ≤ γλ H C (cid:18) ρ b d ,n + 1 √ m (cid:19) (cid:18) (1 − γ ) b +1 d ,n + C γ √ m (cid:19) , where ˜ θ b is some intermediate value between θ b and ˆ θ n , and an upper bound deﬁned in termsof ρ to simplify notation.The two bounds leads to the following recursion on the coupling distance: E (cid:63) (cid:0) (cid:107) θ b +1 − θ (cid:63)b +1 (cid:107) (cid:1) ≤ ρ E (cid:63) ( (cid:107) θ b − θ (cid:63)b (cid:107) ) + E (cid:63) ( (cid:107) ( A. (cid:107) ) + E (cid:63) ( (cid:107) ( A. (cid:107) ) ≤ ρ E (cid:63) ( (cid:107) θ b − θ (cid:63)b (cid:107) ) + C +6 ( ρ b [ d ,n + d ,n ] + 1 m ) ≤ C +6 − ρ (cid:18) ρ b [ d ,n + d ,n ] + 1 m (cid:19) , A-2here C +6 is a constant which depends on the terms used to bound (A.1) and (A.2). Recallthat θ = θ (cid:63) so that the coupling distance is zero for b = 0 . Putting C = C +6 / (1 − ρ ) provesthe desired result. A.3 Proof of Theorem 1

To bound E (cid:63) (cid:16) (cid:107) θ (cid:63) re − ˆ θ n (cid:107) (cid:17) , we use the recursive representation of (7) and take the average: θ (cid:63) re − ˆ θ n = 1 B B (cid:88) b =1 Ψ(ˆ θ n ) b ( θ − ˆ θ n ) − γ B B (cid:88) b =1 b − (cid:88) j =0 Ψ(ˆ θ n ) j P m E (cid:63) (cid:16) G ( b − j ) m (ˆ θ n ) (cid:17) − γ B B (cid:88) b =1 b − (cid:88) j =0 Ψ(ˆ θ n ) j P m (cid:104) G ( b − j ) m (ˆ θ n ) − E (cid:63) (cid:16) G ( b − j ) m (ˆ θ n ) (cid:17)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) ∆ ( b − j ) m (ˆ θ n ) . Assumption 3 i. implies that (cid:107)

Ψ(ˆ θ n ) b ( θ − ˆ θ n ) (cid:107) ≤ ρ b (cid:107) θ − ˆ θ n (cid:107) , so the ﬁrst term isless than d ,n (1 − ρ ) B in expectation. Consider now the second term. By Assumption 2 iv, (cid:107) Ψ(ˆ θ n ) j P m E (cid:63) (cid:16) G ( b − j ) m (ˆ θ n ) (cid:17) (cid:107) ≤ ρ j λ P C (cid:48) √ m so the second term is less than λ P C (cid:48) (1 − ρ ) √ m . For thethird term and with ∆ ( b − j ) m (ˆ θ n ) deﬁned above, we have by conditional independence, (cid:20) E (cid:63)  (cid:107) B B (cid:88) b =1 b − (cid:88) j =0 Ψ(ˆ θ n ) j P m ∆ ( b − j ) m (ˆ θ n ) (cid:107)  (cid:21) / = (cid:20) E (cid:63)  (cid:107) B B (cid:88) b =1 B − b +1 (cid:88) j =0 Ψ(ˆ θ n ) j P m ∆ ( b ) m (ˆ θ n ) (cid:107)  (cid:21) / = 1 √ mB (cid:20) B B (cid:88) b =1 E (cid:63)  (cid:107) B − b +1 (cid:88) j =0 Ψ(ˆ θ n ) j √ mP m ∆ ( b ) m (ˆ θ n ) (cid:107)  (cid:21) / ≤ λ P (1 − ρ ) √ mB (cid:20) (cid:32) sup ≤ b ≤ B E (cid:63) (cid:107)√ m ∆ ( b ) m (ˆ θ n ) (cid:107) (cid:33) (cid:21) / ≤ γλ P [ C + C (cid:48) ](1 − ρ ) √ mB where the ﬁrst inequality follows from the average being less than the sup , combined with (cid:107) Ψ(ˆ θ n ) j P m ∆ ( b ) m (ˆ θ n ) (cid:107) ≤ ρ j λ P (cid:107) ∆ ( b ) m (ˆ θ n ) (cid:107) which is summable over j ≥ . The last inequalityuses Assumption 2 iii-iv. Recall that Lemma 3 implies (9) which states that E (cid:63) (cid:18) (cid:107) θ re − A-3 (cid:63) re (cid:107) (cid:19) ≤ C − ρ (cid:18) m + d ,n + d ,n B (cid:19) . Now putting everything together, we have: E (cid:63) (cid:16) (cid:107) θ re − ˆ θ n (cid:107) (cid:17) ≤ E ∗ (cid:18) (cid:107) θ re − θ ∗ re (cid:107) (cid:19) + E ∗ (cid:18) θ ∗ re − ˆ θ n (cid:107) (cid:19) ≤ C (cid:18) m + d ,n + d ,n B + 1 √ mB (cid:19) , which is a o ( √ n ) when √ n min ( m,B ) → and d ,n = O (1) . A.4 Proof of Theorem 2

The property that P m = [ H n (ˆ θ n )] − when P b = [ H ( b +1) m ( θ b )] − is crucial for what is to follow,and it is useful to understand why. Under Assumption 2 vi., (cid:104) E (cid:63) (cid:16) (cid:107) I d − P b H n (ˆ θ n ) (cid:107) (cid:17)(cid:105) / ≤ λ P (cid:104) E (cid:63) (cid:16) (cid:107) P − b − H n (ˆ θ n ) (cid:107) (cid:17)(cid:105) / . Given that P b = [ H ( b +1) m ( θ b )] − , an application of the triangular inequality, Assumption 1 ii.and 2 v. together with Lemma 2 give (cid:104) E (cid:63) (cid:16) (cid:107) P − b − H n (ˆ θ n ) (cid:107) (cid:17)(cid:105) / = (cid:104) E (cid:63) (cid:16) (cid:107) H ( b ) m ( θ b ) − H n (ˆ θ n ) (cid:107) (cid:17)(cid:105) / ≤ (cid:104) E (cid:63) (cid:16) (cid:107) H n ( θ b ) − H n (ˆ θ n ) (cid:107) (cid:17)(cid:105) / + (cid:2) E (cid:63) (cid:0) (cid:107) H ( b +1) m ( θ b ) − H n ( θ b ) (cid:107) (cid:1)(cid:3) / ≤ C (cid:104) E (cid:63) (cid:16) (cid:107) θ b − ˆ θ n (cid:107) (cid:17)(cid:105) / + (cid:20) E (cid:63) (cid:18) sup θ ∈ Θ (cid:107) H ( b +1) m ( θ ) − H n ( θ ) (cid:107) (cid:19)(cid:21) / ≤ (1 − γ ) b C d ,n + (cid:18) C C γ + C (cid:19) √ m . This implies that Assumption 3 ii. holds with C = max ( C , C C γ + C ) and P m = [ H n (ˆ θ n )] − .Assumption 3 i. automatically holds since we now have Ψ(ˆ θ n ) = (1 − γ ) I d which has all itseigenvalues in [0 , for any γ ∈ (0 , .To prove Theorem 2, we ﬁrst substitute θ b for the linear process θ (cid:63)b using: √ m (cid:112) φ ( γ ) ( V m ) − / ( θ b − ˆ θ n ) = √ m (cid:112) φ ( γ ) ( V m ) − / ( θ b − θ (cid:63)b ) + √ m (cid:112) φ ( γ ) ( V m ) − / ( θ (cid:63)b − ˆ θ n ) . By Lemma 3, √ m √ φ ( γ ) ( V m ) − / ( θ b − θ (cid:63)b ) = o p (cid:63) (1) when log( m ) /b → since it implies √ mγ b =exp( b [ log( m )2 b + log( γ )]) → . A-4or r nr we have P m = H n (ˆ θ n ) − so that Ψ(ˆ θ n ) = (1 − γ ) I d . Using the recursion (7), wehave: √ m (cid:112) φ ( γ ) ( V m ) − / ( θ (cid:63)b − ˆ θ n ) = √ m (cid:112) φ ( γ ) ( V m ) − / (1 − γ ) b ( θ − ˆ θ n ) − γ b − (cid:88) j =0 (1 − γ ) j √ m (cid:112) φ ( γ ) ( V m ) − / [ H n (ˆ θ n )] − G ( b − j ) m (ˆ θ n ) . Since the [ H n (ˆ θ n )] − G ( b − j ) m (ˆ θ n ) are independent (conditional on the data) and identicallydistributed, we have by a convolution argument: E (cid:63) (cid:32) exp( i τ (cid:48) √ mγ b − (cid:88) j =0 (1 − γ ) j √ m (cid:112) φ ( γ ) ( V m ) − / [ H n (ˆ θ n )] − G ( b − j ) m (ˆ θ n )) (cid:33) = b − (cid:89) j =0 E (cid:63) (cid:32) exp( i τ (cid:48) √ mγ (1 − γ ) j √ m (cid:112) φ ( γ ) ( V m ) − / [ H n (ˆ θ n )] − G ( b − j ) m (ˆ θ n )) (cid:33) = b − (cid:89) j =0 (cid:20) exp (cid:18) − (cid:107) τ (cid:107) γ (1 − γ ) j φ ( γ ) (cid:19) (cid:18) r m ( γ (1 − γ ) j τ /φ ( γ )) m β (cid:19)(cid:21) = exp (cid:18) − (cid:107) τ (cid:107) γ [1 − (1 − γ ) b ][1 − (1 − γ ) ] φ ( γ ) (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =exp( −(cid:107) τ (cid:107) / o (1)) b − (cid:89) j =0 (cid:20)(cid:18) r m ( γ (1 − γ ) j τ /φ ( γ )) m β (cid:19)(cid:21)(cid:124) (cid:123)(cid:122) (cid:125) ( I ) . To show that the last product is convergent under the stated assumptions, take logs and usethe inequality x x ≤ log(1 + x ) ≤ x for x > − . Then log ( (cid:107) I (cid:107) ) = b − (cid:88) j =0 log (cid:18) | r m ( γ (1 − γ ) j τ /φ ( γ )) | m β (cid:19) ≤ b − (cid:88) j =0 | r m ( γ (1 − γ ) j τ /φ ( γ )) | m β ≤ b − (cid:88) j =0 (cid:107) γτ /φ ( γ ) (cid:107) α (1 − γ ) αj m β ≤ (cid:107) γτ /φ ( γ ) (cid:107) α [1 − (1 − γ ) α ] m β . Note that γφ ( γ ) = 2 − γ ≥ for γ ∈ (0 , . Putting everything together we have: E (cid:63) (cid:32) exp( i τ (cid:48) √ m (cid:112) φ ( γ ) ( V m ) − / ( θ b − ˆ θ n ) (cid:33) = exp (cid:18) − (cid:107) τ (cid:107) (cid:19) (cid:18) O (cid:18) (cid:107) τ (cid:107) α m β (2 − γ ) α [1 − (1 − γ ) α ] (cid:19)(cid:19) , which implies the desired convergence in distribution. upplement to” Inference by Stochastic Optimization:A Free-Lunch Bootstrap ” Jean-Jacques Forneron ∗ Serena Ng † May 2020This Supplemental Material consists of Appendices C and D to the main text. ∗ Department of Economics, Boston University, 270 Bay State Rd, MA 02215 Email: [email protected] † Department of Economics, Columbia University and NBER, 420 W. 118 St. MC 3308, NewYork, NY 10027 Email: [email protected] ppendix B Implementing rNR in R

To illustrate how the r nr is implemented in a real-data setting, we provide some detailedcommented r code below which estimates a probit model on the Mroz (1987) data. set . seed (123) B-1 yy [ i ] * XX [i ,] * dnorm ( score [ i ]) / pnorm ( score [ i ]) -(1 - yy [ i ]) * XX [i ,] * dnorm ( score [ i ]) / (1 - pnorm ( score [ i ]) ) )}return ( dll )}rNR <- function ( coef0 , learn = 0.1 , iter = 500 , m = n ) {

B-2 generate rNR drawsout _ rNR1 = rNR ( coef0 , learn , b1 + iter _ rNR , m1 )out _ rNR2 = rNR ( coef0 , learn , b1 + iter _ rNR , m2 )out _ rNR3 = rNR ( coef0 , learn , b1 + iter _ rNR , m3 )

B-3 ppendix C Additional Empirical and Simulation Results

C.1 Simulated ExamplesExample 2: MA(1)

The following provides additional details on computing the estimatesfound in Table 3. The data generating process is y t = µ + e t + ψe t − where e t ∼ N (0 , iid. For a given θ = ( µ, ψ ) , the ﬁltered residuals are computed as e t ( θ ) = y t − µ − ψe t − ( θ ) initialized with e = 0 . The nlls objective is then Q n ( θ ) = (cid:80) nt =1 e t ( θ ) . To ﬁnd the gradientof Q n we compute the jacobian and the Hessian of x t ( θ ) = µ + ψe t − ( θ ) which are given by: ∇ x t ( θ ) = (cid:18) ψ de t − ( θ ) dψ + e t − ( θ ) . (cid:19) , ∇ x t ( θ ) = (cid:18) de t − ( θ ) dψ (cid:19) . The gradient of Q n is G n ( θ ) = 2 (cid:80) nt =1 e t (ˆ θ n ) ∇ x t (ˆ θ n ) = 0 . Similarly, the Hessian is H n ( θ ) =2 (cid:80) nt =1 [ e t (ˆ θ n ) ∇ x t (ˆ θ n )+ ∇ x t (ˆ θ n ) ∇ (cid:48) x t (ˆ θ n )] . The objective is minimized using Newton-Raphsoniterations based on the analytical G n , H n . The asymptotic standard errors are computed fromthe inverse Hessian, based on the information matrix equality.For the standard bootstrap, we implement a resampling scheme desgined for State-Spacemodels described in Stoﬀer and Wall (2004). Given a converged estimate ˆ θ n , compute theﬁltered e t (ˆ θ n ) . The resampled data is then generated as y ( b ) t = ˆ µ n + e ( b ) t (ˆ θ n ) + ˆ ψ n e ( b ) t − (ˆ θ n ) where e ( b ) t (ˆ θ n ) are iid draws with replacement taken from { ˆ e t (ˆ θ n ) } t =1 ,...,n . The resampled nlls objective Q ( b ) n ( θ ) is then computed and minimized as described above. This procedureis very time-consuming and is implemented in C++ using Rcpp to reduce computation time.Other methods described below are implemented using only r .To implement dmk , given a converged estimate ˆ θ n , ﬁltered residuals e t (ˆ θ n ) and theirderivates, we sample indices t ,b , . . . , t n,b with replacement from { , . . . , n } for each b andcompute the resampled gradient and Hessian as G ( b ) n = 2 (cid:80) nj =1 e t j,b (ˆ θ n ) ∇ x t j,b (ˆ θ n ) and H ( b ) n =2 (cid:80) nj =1 [ e t j,b (ˆ θ n ) ∇ x t j,b (ˆ θ n ) + ∇ x t j,b (ˆ θ n ) ∇ (cid:48) x t j,b (ˆ θ n )] . We then generate the draws using one nr iteration θ ( b ) dmk = ˆ θ n − [ H ( b ) n (ˆ θ n )] − G ( b ) n (ˆ θ n ) .To implement r nr with m ≤ n , sample a block of m observations ( y ( b )1 , . . . , y ( b ) m ) =( y t , y t +1 , . . . , y t + m ) with ≤ t ≤ n − m +1 and compute the ﬁltered residuals e ( b ) t ( θ b − ) = y ( b ) t − µ b − − ψ b − e ( b ) t − ( θ b − ) for t = 1 , . . . , m where ( µ b − , ψ b − ) = θ b − is the previous r nr draw. Asabove, the ﬁltered residuals are initialized at e = 0 and the r nr draws are initialized at θ =(0 , . Similarly to our implementation of dmk , we then we sample indices t ,b , . . . , t m,b withreplacement from { , . . . , m } and compute the resampled gradient and Hessian G ( b ) m , H ( b ) m ,the updating equation gives the draws θ b = θ b − − γ [ H ( b ) m ( θ b − )] − G ( b ) m ( θ b − ) .C-1 ize of Conﬁdence Intervals in the Simulated Examples The table below presentsthe size of conﬁdence intervals over replications in the simulated examples of Section5. Frequentist conﬁdence intervals ( ase ) are computed using ˆ θ n ± . se (ˆ θ n ) . Bootstrapconﬁdence intervals for the standard bootstrap ( boot ), dmk and ks are computed bytaking the . and . percentiles of the draws θ ( b ) except for the dynamic panel asdiscussed below. For r nr , the conﬁdence intervals are computed by taking the . and . percentiles of θ re + (cid:113) mnφ ( γ ) ( θ b − θ re ) where φ ( γ ) = γ − (1 − γ ) .For the dynamic panel, the standard bootstrap ( boot ), dmk and ks draws are adjustedso that conﬁdence intervals are computed by taking the . and . percentiles of ˆ θ n, smd +( θ ( b ) − θ B ) , where θ B is the average bootstrap draw. Without this recentering the conﬁdenceintervals display signiﬁcant size distortion, see Appendix D for a discussion of this recentering.For r nr , we take the . and . percentiles of θ re ,S + (cid:113) mnφ ( γ ) θ b,S where θ re ,S = B (cid:80) Bb =1 θ b,S after discarding the burn-in draws. C.2 Empirical ExamplesApplication 1: Labor Force Participation

The table below presents the estimates andstandard errors for all methods and coeﬃcients in the Mroz (1987) application.Table C1: Labor Force Participation: Estimates and Standard ErrorsEstimates mle r nr n r nr r nr r qn n r qn r qn nwifeinc -0.012 - - - -0.012 -0.013 -0.014 -0.012 -0.011 -0.012educ 0.131 - - - 0.132 0.138 0.143 0.131 0.129 0.129exper 0.123 - - - 0.123 0.124 0.123 0.123 0.124 0.125exper2 -0.002 - - - -0.002 -0.002 -0.002 -0.002 -0.002 -0.002age -0.053 - - - -0.053 -0.053 -0.055 -0.052 -0.052 -0.052kidslt6 -0.868 - - - -0.874 -0.892 -0.902 -0.864 -0.855 -0.844kidsge6 0.036 - - - 0.037 0.038 0.041 0.036 0.035 0.032const. 0.270 - - - 0.271 0.216 0.234 0.248 0.256 0.249Standard Errors ase boot dmk ks r nr n r nr r nr r qn n r qn r qn nwifeinc 0.005 0.005 0.005 0.005 0.005 0.006 0.005 0.005 0.005 0.005educ 0.025 0.026 0.026 0.025 0.025 0.027 0.028 0.027 0.025 0.025exper 0.019 0.020 0.019 0.019 0.019 0.020 0.021 0.019 0.018 0.017exper2 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001age 0.008 0.009 0.008 0.008 0.009 0.008 0.009 0.009 0.008 0.008kidslt6 0.119 0.120 0.118 0.118 0.120 0.119 0.129 0.117 0.113 0.117kidsge6 0.043 0.046 0.045 0.045 0.045 0.048 0.047 0.044 0.042 0.045const. 0.509 0.512 0.507 0.505 0.494 0.535 0.544 0.544 0.494 0.506C-2e also implemented r gd in this application to evaluate its feasibility in a real-datasetting. We found that r gd requires a burn-in greater than draws to converge andthe high persistence of the draws results in a very small eﬀective sample size while the r nr converges quickly ( ≤ draws) and has good mixing properties. This is mainly due to theill-conditioning of the problem since λ min ( H n ) λ max ( H n ) evaluated at θ = ˆ θ n is − which implies avery slow convergence for gd s gd and r gd . Application 2: Earnings Dynamics

During the initial convergence phase some ad-justments to the r nr updating equations were required to handle the non-convexity of theobjective in the Moﬃtt and Zhang (2018) application. For r nr and r qn , draws such thatthe sample objective Q n increases -folds or more are discarded, i.e. we only keep θ b if Q n ( θ b ) ≤ Q n ( θ b − ) . This never occurs for r nr and r nr w . It occurred twice for r qn andfour times for r qn w but only in the burn-in sample with burn = 50 . When a draw is dis-carded, the bfgs approximation of the Hessian is reset to the Hessian computed using ﬁnitediﬀerences. These adjustments ensured that r qn converged from the original starting values.For r nr w , we reweight the observations using exponential E (1) draws. For ks , the score isreweighted using draws from the Rademacher distribution. Results presented in Table 5 werecomputed using cluster nodes with an eight-core 2.6 GHz Intel Xeon E5-2650v2 processor. Application 3: Demand for Cereal

To side-step possible identiﬁcation issues, we omitthe income ∗ price interaction as well as the child ∗ price and age-related coeﬃcients. Theresults are broadly similar when the child ∗ price coeﬃcient is included.The r package BLPestimatoR does not oﬀer a bootstrap option. The ‘parametric’bootstrap implemented in the Python pyBLP package of Conlon and Gortmaker (2019)draws from the asymptotic distribution, making Gaussian draws centered at ˆ θ n with a sand-wich variance-covariance matrix. BLPestimatoR , implements estimation taking as inputa dataset, initial values and a model speciﬁcation. To implement the standard bootstrapusing this package, we simply update the data by resampling at the market level and use thebuilt-in functions to re-estimate using as initial value the sample estimate ˆ θ n . For dmk , r nr and r qn the data is updated as described above, then built-in functions provide analyticalgradient estimates. The Hessian is computed for dmk and r nr using ﬁnite diﬀerences.Table C2 replicates the r nr estimates and standard errors from Section 5 with diﬀerentlearning rates γ = 0 . , . , . . The results are similar using γ ∈ [0 . , . while γ = 0 . isless stable and results in large standard errors for the income ∗ price interaction coeﬃcient.C-3able C2: Demand for Cereal: Estimates and Standard Errors for γ = 0 . , . , . Estimates Standard Errors ˆ θ n r nr . r nr . r nr . boot r nr . r nr . r nr . s t d e v const. 0.284 0.264 0.266 0.260 0.129 0.126 0.127 0.176price 2.032 2.191 2.183 2.162 1.198 0.930 1.013 1.689sugar -0.008 -0.006 -0.006 -0.005 0.017 0.011 0.011 0.017mushy -0.077 -0.057 -0.056 -0.057 0.177 0.151 0.163 0.233 i n c o m e const. 3.581 3.475 3.463 3.459 0.666 0.721 0.747 1.451price 0.467 1.235 1.360 1.255 3.829 3.744 4.187 16.458sugar -0.172 -0.170 -0.170 -0.166 0.028 0.029 0.028 0.135mushy 0.690 0.643 0.634 0.535 0.345 0.353 0.355 1.582Results presented in Table 5 were computed using cluster nodes with a fourteen-core 2.4GHz Intel Xeon E5-2680v4 processor. The CH estimates were computed on a diﬀerent batchjob and were assigned at runtime to an eight-core 2.6 GHz Intel Xeon E5-2670 processor.C-4 ppendix D SMD Estimation Table D1: Dynamic Panel: Estimates of ρ and Standard ErrorsEstimates Standard Errors S m ind r nr . r nr . r nr . ase boot dmk r nr . r nr . r nr . - 0.589 0.588 0.586 - 0.048 - 0.036 0.039 0.037 - 0.580 0.588 0.581 - 0.050 - 0.037 0.037 0.024 - 0.588 0.587 0.589 - 0.041 - 0.033 0.036 0.037 - 0.588 0.592 0.581 - 0.041 - 0.037 0.038 0.038 - 0.591 0.589 0.588 - 0.038 - 0.036 0.038 0.037 - 0.583 0.581 0.581 - 0.038 - 0.037 0.035 0.029 - 0.589 0.591 0.587 - 0.034 - 0.034 0.037 0.030 - 0.586 0.589 0.587 - 0.035 - 0.038 0.031 0.032 Remark: Results reported for one simulated sample of size n = 200 , T = 5 . SMD Estimation of a Sample Mean

To illustrate Proposition 1, consider the simple model y i ∼ N ( θ , . The md estimator of θ is ˆ θ md = y n ≡ ˆ ψ n . For any given θ , let y si ( θ ) = θ + e si where e si ∼ N (0 , . The smd estimator is the θ that equates ψ ( θ, y S ) = y n,S ( θ ) to ˆ ψ n , and is found to be ˆ θ smd = ˆ θ md − e n,S .The r nr resamples and simulates the binding function to give θ b +1 ,S − ˆ θ n, md = (1 − γ )( θ b,S − ˆ θ n, md ) + γ (ˆ θ ( b ) m, md − ˆ θ n, md − e ( b ) m,S ) . Note that resampling alone gives θ b +1 − ˆ θ n, md = (1 − γ )( θ b − ˆ θ n, md ) + γ (ˆ θ ( b ) m, md − ˆ θ n, md ) .Taking conditional expectations, we have E (cid:63) ( θ b +1 ,S ) = ˆ θ n, md + (1 − γ ) b +1 ( θ − ˆ θ n, md ) so that E (cid:63) ( θ re ,S ) = ˆ θ n, md + O ( B ) , as in the OLS example. Furthermore, var (cid:63) ( θ re ,S ) = O ( mB + mSB ) where the ﬁrst term is due to resampling ( ˆ θ ( b ) m − ˆ θ n, md ) and the second is due to simulationnoise ( e ( b ) m,S ). Hence for this example, θ re = ˆ θ n, md + O p (cid:63) ( B + √ mB + √ mSB ) , showing thatby averaging over both the resampling and simulation noise, θ re ,S is ﬁrst-order equivalent to ˆ θ n, md if √ n min ( m,B ) → for any S ≥ .Part ii. of the Proposition involves a second sequence θ b,S because the variance of ther nr draws are comprised of two quantities: var (cid:63) (ˆ θ ( b ) m ) and var (cid:63) ( e ( b ) m,S ) . But mφ ( γ ) var (cid:63) ( θ b ) = m var (cid:63) (ˆ θ ( b ) m ) + m var (cid:63) ( e ( b ) m,S ) > m var (cid:63) (ˆ θ ( b ) m ) , and as a consequence var ∗ (ˆ θ ( b ) m ) is larger than theD-1ctual sampling uncertainty of θ re ,S . Running the second chain in parallel using the sameresampled statistic { y ( b ) m } b =1 ,...,B produces the AR(1) draws { θ b,S } . We can rewrite thesedraws as: θ b +1 ,S = (1 − γ ) θ b,S + γ (ˆ θ ( b ) m, md − ˆ θ n, md ) . This is an AR(1) process that targets the infeasible sampling distribution based on theintractable md objective function. From the OLS example, we know that var (cid:63) ( θ b,S ) = γ + o (1)1 − [1 − γ ] var (cid:63) (ˆ θ ( b ) m, md ) which is proportional to the desired variance. Hence, V re ,S = mφ ( γ ) var (cid:63) ( θ b ) = m var (cid:63) (ˆ θ ( b ) m, md ) yields valid standard errors for θ re ,S .Note also that the smd bootstrap draws ˆ θ ( b ) m, smd = ˆ θ ( b ) m, md − e bm,S are centered around ˆ θ n, md instead of ˆ θ n, smd because the simulation noise e bm,S averages out. Since E (cid:63) (ˆ θ ( b ) m, smd ) = ˆ θ n, md ,the bootstrap conﬁdence interval must be re-centered around ˆ θ n, smd to have correct size. Incontrast with r nr , the variance does not need to be adjusted. In the numerical examplesbelow, the draws were re-centered around ˆ θ n, smd . A numerical illustration of this example isgiven below. Example 4: Sample Mean

To illustrate that the r nr draws achieve the same eﬃciencyas an smd estimators with S = ∞ at a lower computation cost of S = 1 , we simulate y i ∼ N ( θ, with θ = 1 , n = 1000 . Table D2 illustrates the variance properties of r nr relative to indirect inference and the size of conﬁdence intervals derived from the quantilesof the draws θ b,S . With m = 200 < n = 1000 , the variance of r nr is comparable to themethod of moments (which has no simulation noise) and indirect inference with S = 20 simulated samples of n = 1000 observations. The size of m out of n bootstrap conﬁdenceintervals are reported in the last line of the table for each estimator. Size for r nr is againcomparable to the method of moments and indirect inference.Table D2: Mean Estimaton: standard deviation and size mm r nr ind ind ind ind std 0.031 0.032 0.047 0.035 0.033 0.032size 0.059 0.056 0.059 0.059 0.044 0.050 Legend: n = 1000 ; r nr γ = 0 . , m = 200 , B = 1000 ; ind S : indirect inference with S = 1 , , , . replications.replications.