Estimation and Inference by Stochastic Optimization: Three Examples
aa r X i v : . [ ec on . E M ] F e b Estimation and Inference by Stochastic Optimization:Three Examples By Jean-Jacques Forneron ∗ and Serena Ng † Repeated optimizations required to computeparameter estimates and bootstrap standard er-rors of complex models can be computation-ally burdensome. In Forneron and Ng (2020),we design a resampled Newton-Raphson algo-rithm (r nr ) that provides consistent estimatesand valid standard errors in one run of theoptimizer. The key insight is that the algo-rithm serves as a resampling device to producea Markov chain of iterates with desirable prop-erties. In this paper, we illustrate that r nr canspeed up BLP estimation from almost five hoursusing standard ( n out of n ) bootstrap to justover an hour and can be further reduced to fif-teen minutes using a resampled quasi-Newton(r qn ) algorithm that does not directly com-pute the Hessian. A Monte-Carlo exercise usingProbit IV regressions shows that r nr and r qn provide accurate estimates and coverage. Theappeal of the proposed approach goes beyondfaster computation. A re-sampling based indi-rect inference estimator not only produces stan-dard errors easily, but is also more efficient thanone obtained by classical optimization. This isillustrated by a dynamic panel model example. I. The Setup
Many economic applications entail minimiz-ing a sample objective function Q n ( θ ) with re-spect to a vector of parameters θ to obtain anestimate ˆ θ n = argmin θ Q n ( θ ) . Under regular-ity conditions, ˆ θ n is √ n consistent for the truevalue θ † and ( V † ) − / √ n (ˆ θ n − θ † ) d −→N (0 , I d ). ∗ Department of Economics, Boston University, 270 BayState Rd, MA 02215 Email: [email protected] † Department of Economics, Columbia University andNBER, 420 W. 118 St. MC 3308, New York, NY 10027 Email:[email protected] Support from the National Science Foundation(SES 1558623, 2018368) is gratefully acknowledged.The authors would like to thank Elie Tamer for useful com-ments and suggestions.
The sandwich variance V † required for infer-ence depends on both the gradient and the Hes-sian which are often analytically intractable.Bootstrap inference approximates the asymp-totic distribution but requires repeated opti-mization each time a batch of data of size n isresampled. Alternatives are available to speedup computation but they still necessitate a pre-liminary estimate ˆ θ n .The Newton-Raphson algorithm computes ˆ θ n by iterating until convergence: θ k +1 = θ k − γ k [ H n ( θ k )] − G n ( θ k ) , where γ k is a learning rate, G n ( θ k ) is thegradient, and the conditioning matrix is setto the inverse of the Hessian H n ( θ k ) so that[ H n ( θ k )] − G n ( θ k ) determines the direction ofthe update. In Forneron and Ng (2020), we pro-pose a novel resampled Newton-Raphson algo-rithm (r nr ) that produces an estimate of θ andits standard errors in one run of the optimizer. Algorithm rnr Inputs: (a) initial guess θ ; (b) bootstrapsample size B and burn-in period burn ; (c)batch size m ≤ n , and (d) fixed learningrate γ ∈ (0 , Resample:
For b = 1 , . . . , burn + B a. Resample a ( b + 1)-th batch of data ofsize m ,b. Update H b = H b +1 m ( θ b ) and G b = G ( b +1) m ( θ b ),c. Update θ b +1 = θ b − γH − b G b . Outputs:
Discard the first burn draws.Let φ ( γ ) = γ − (1 − γ ) and outputa. θ r nr = B P Bb =1 θ b ,b. V r nr = mφ ( γ ) c var( θ b ) where c var( θ b ) = B P Bb =1 ( θ b − θ r nr )( θ b − θ r nr ) ′ . he main idea of r nr is to combine estimationwith inference by exploiting the randomness dueto re-sampling within the optimizer. Each b + 1-th sample consists of m observations drawn ran-domly from the original data. Then Q ( b +1) m ( θ )is evaluated, its gradient G ( b +1) m ( θ b ) and Hessian H b +1 m ( θ b ) are used to update θ b to θ b +1 . Theestimator is computed by taking the mean overdraws after discarding the first burn iterates toreduce the impact of the initial guess θ . Stan-dard errors are obtained from the draws after asample size and scale adjustment of q mφ ( γ ) . LikeMCMC, inference is sampling-based but the ap-proach is fully frequentist. Unlike other boot-strap shortcuts, our approach does not requirea preliminary estimate ˆ θ n . Evaluating the direction of change using smallbatches of data is in the spirit of stochasticoptimization, but there are two important dif-ferences. First, while the learning rate γ b instochastic optimization declines with each b , our γ ∈ (0 ,
1] is constant. This allows us to analyt-ically establish that the draws { θ b } Bb =1 form aMarkov chain with stationary ergodic proper-ties. Second, whereas stochastic optimizationuses m fixed (as small as one) with efficientcomputation as a goal, we also have inferencein mind which necessitate m to increase fasterthan √ n . Statistical and computational effi-ciency are conflicting goals in this context.Under certain conditions in Forneron and Ng(2020), the r nr draws have two properties: √ n (cid:16) θ r nr − ˆ θ n (cid:17) = o p ⋆ (1) , (Estimation) V − / r nr √ m (cid:16) θ b − ˆ θ n (cid:17) d ⋆ → N (0 , I d ) , (Inference)where V r nr is defined in Algorithm above. Thefirst (consistency) result states that the meanestimator ¯ θ r nr is first order equivalent to theclassical estimator ˆ θ n . The second (inference)result states that the distribution of the drawsis first-order equivalent to that of θ ( b ) m , the boot-strap distribution. A sketch of the argument is For instance, iid data can be re-sampled at the individ-ual level with replacement. Clustered data, as in Example 1below, should be re-sampled at the cluster level with replace-ment. See e.g. Davidson and MacKinnon (1999), Andrews(2002), Kline and Santos (2012), Honor´e and Hu (2017). as follows. For the same resampling scheme, it isknown that the standard bootstrap yields validinference. We show that when m and γ ∈ (0 , nr draws is close to that of the standard boot-strap up to scale, and by implication, close tothe limiting distribution of ˆ θ n . But unlike thestandard bootstrap which needs repeated opti-mizations, r nr produces standard errors in thesame optimization that produces estimates ¯ θ r nr .This means that upon completion of that singlerun, a (1 − α )% confidence interval for the j -thcoefficient can be immediately constructed as:( θ r nr ,j + q α/ , θ r nr ,j + q − α/ )where q α/ is the α/ q mnφ ( γ ) ( θ b,j − θ r nr ,j ). Wald statistics can also be computedusing V r nr as a plug in estimate of V † .The two results above also hold for a faster re-sampled quasi-Newton algorithm, called rQN,which approximates the Hessian by a least-squares interpolation scheme, it is describedin Forneron and Ng (2020). This scheme en-sures the conditioning matrix is both symmet-ric and positive definite which is required forinference. Though the consistency result alsoholds for many conditioning matrices, the infer-ential result only holds for conditioning matri-ces that approximate the inverse Hessian suf-ficiently well because the sandwich variancestructure cannot be replicated otherwise. Thus,resampled gradient descent which uses an iden-tity matrix for conditioning will give incorrectstandard errors but valid estimates.Algorithms r nr and r qn are especially usefulwhen the model is costly to optimize. But theyalso have statistical appeals:- the draws are im-mediately available for post estimation diagnos-tics, and in the case of simulation estimation,¯ θ r nr can even be more efficient than an estimateobtained from classical optimization. We nowillustrate some of these properties. II. Example 1: Demand for Cereal
We consider the BLP model ofBerry, Levinsohn and Pakes (1995) for thecereal data generated in Nevo (2000). Thedata consists of market shares s gj in market2able 1—: Demand for Cereal: Estimates and Standard Errors (Random Coefficients)Estimates Standard Errorsˆ θ n r nr r qn boot dmk r nr r qn s t d e v const. 0.284 0.263 0.273 0.129 0.127 0.123 0.120price 2.032 2.188 1.983 1.198 1.026 0.975 0.950sugar -0.008 -0.006 0.006 0.017 0.012 0.012 0.012mushy -0.077 -0.055 -0.044 0.177 0.168 0.166 0.167 i n c o m e const. 3.581 3.464 3.646 0.666 0.738 0.714 0.662price 0.467 1.335 0.111 3.829 4.275 4.040 3.569sugar -0.172 -0.171 -0.174 0.028 0.028 0.027 0.031mushy 0.690 0.647 0.694 0.345 0.346 0.339 0.333time 4h36m 1h1m 58m 15m g ∈ { , . . . , } for product j ∈ { , . . . , } .Parameters on terms that enter linearly areprojected out by 2SLS. We then drop inter-action terms that seem difficult to identify.This leaves us with d = 8 parameters thatenter the moment conditions g non-linearly.Evaluation of the objective and its gradient iscostly because fixed-point iterations are neededto invert market shares. We perform m outof n resampling at the market level. This levelof clustering controls for possible correlationsin the unobservables at the market level. Thatis, for each b = 1 , . . . , B we draw markets g ( b )1 , . . . , g ( b )94 from { , . . . , } with replacement,taking the associated shares and characteristics { s g ( b ) j , X g ( b ) j } j =1 ,..., as observations withineach market. We set γ = 0 . burn = 10draws. Since the number of clusers is relativelysmall, we set m = n = 94.Table 1 indicates that the r nr estimates aresimilar to ˆ θ n obtained from classical optimiza-tion. The standard errors are similar acrossmethods but the r nr ones are nearly 5x fasterto compute than the bootstrap and are compa-rable to Davidson and MacKinnon (1999), de-noted as dmk , even excluding the time used toget the preliminary estimate. The r qn furtherreduces computation time over r nr by a factorof 4.The estimates based on classical optimizationreported above use only 20 integration draws as We use the
BLPestimatoR
R package which builds onC++ functions to evaluate the GMM objective and analyticalgradient (Brunner et al., 2017). in Nevo (2000). More accurate estimates willrequire more draws, so the gains in using r nr and r qn are conservative. Besides inference onthe parameters, the r nr draws can also be use-ful in post-estimation analysis. For instance, inmore involved counterfactuals such as mergeranalyses, the delta-method can be challengingto apply while re-evaluating counterfactuals onbootstrap draws is straightforward. III. Example 2: Probit IV Regression
The second example uses simulations to eval-uate the finite sample properties of r nr and r qn .We consider a probit instrumental variable re-gression model specified as y i = { αy i + β + β x i + ρv i + u i } ,y i = ξ + ξ x i + πz i + v i , where x i , z i are independent and exponentiallydistributed with rate 1; v i , u i are independentstandard normal; θ † = ( ξ , ξ , π, α, β , β , ρ ) =(0 , , , , , , g n ( θ ) = 1 n n X i =1 (cid:18) r i ( θ ) ⊗ (1 , x i , z i , r i ) ′ r i ( θ ) ⊗ (1 , x i , z i ) ′ (cid:19) , where ⊗ is the kronecker product, r i ( θ ) = y i − ( ξ + ξ x i + πz i ) and r i ( θ ) = y i − Φ( αy i + β + β x i + ρr i ). We set n = 500 in each of the1000 Monte-Carlo replications. The estimatesˆ θ n are computed using the bfgs routine in r .3able 2—: Probit IV: finite sample properties in estimation and inferenceAverage Estimate Standard Deviation Rejection Ratesm r nr r qn r nr r qn boot r nr r qn γ = 0 . γ = 0 . nr , r qn we use B = 2000 and consider γ ∈ { . , . } , m ∈ { , , } . For bfgs ,r nr and r qn θ = (0 , . . . , θ n and we only use B = 500 as is common practice, but even thisis slower than r nr and r qn with B = 2000.Table 2 compares the properties of the esti-mates and quantile-based confidence intervalsfor α , the coefficient on the endogenous regres-sor y i which is typically a parameter of inter-est. The average estimates and standard errorswith classical estimation ˆ θ n are 1 .
034 and 0 . nr and r qn reported in the table. The exception is the m = 50, γ = 0 . burn = 50 ap-pears to be too small. For coverage, the usualˆ α n ± . · se( ˆ α n ) confidence interval is very closeto the 95% level with a rejection rate of 0 . m = 500 and γ = 0 .
2, the coverage ofr nr , r qn is comparable to that of the bootstrap.For m < n the accuracy of the bootstrap de-clines while r nr and r qn are less affected. For γ = 0 .
1, coverage is closer to the nominal 95%confidence level for the entire range of m and γ values. IV. Example 3: Simulation-basedEstimation
The third example highlights the statisticalgains of r nr /r qn for simulation-based estima-tion. Consider the linear dynamic panel model: y it = ρy it − + βx it + α i + σe it , (1) where t ∈ { , . . . , T } , i ∈ { , . . . , n } , θ =( ρ, β, σ ). The least-squares dummy variable( lsdv ) estimator is inconsistent as n → ∞ with T fixed. Gouri´eroux, Phillips and Yu (2010)consider simulation estimation of θ using ˆ ψ n =ˆ θ n, lsdv as auxiliary statistics. Given draws e sit and a value of θ , simulate S panels of y sit ofsize ( n, T ) using (1), compute the simulatedmoments ˆ ψ sn ( θ ) = ˆ θ sn, lsdv . The indirect infer-ence ( ind ) estimator ˆ θ Sn, ind = argmin θ k ˆ ψ n − S P s ˆ ψ sn ( θ ) k has an automatic bias correctionproperty and is consistent as n → ∞ even if T is fixed, but its variance is inflated by a fac-tor (1 + S ) due to simulation noise. The n outof n bootstrap is often used to obtain standarderrors of indirect inference estimates. Through-out the estimation above, the covariates x it andthe simulation draws e sit are fixed while the op-timizer solves for ˆ θ Sn, ind .In contrast, r nr and r qn resample m out of n individual paths of ( x it ) t =1 ,...,T and simulatenew draws e s,bit at each iteration b . This has twoadvantages. First, as in the examples above,it is faster than the conventional bootstrap inproducing standard errors. Second, the sim-ulation noise across b averages out, and as aconsequence, r nr /r qn achieve the same asymp-totic variance as an ind estimator that uses S = ∞ simulations. This statistical efficiencygain comes for free since we only use finitelymany S simulated samples at each iteration b ,and with m possibly less than n . The only pro-viso is that a second chain of draws is needed toproduce correct standard errors and confidenceintervals, as shown in Forneron and Ng (2020).4able 3—: Dynamic Panel: finite sample properties in estimation and inferenceAverage Estimate Standard Deviation Rejection Ratesm r nr r qn r nr r qn boot r nr r qn γ = 0 . S = 1500 0.599 0.599 0.023 0.023 0.052 0.044 0.045100 0.598 0.599 0.023 0.023 0.054 0.037 0.047 γ = 0 . S = 10500 0.600 0.600 0.023 0.023 0.049 0.044 0.043100 0.598 0.598 0.023 0.023 0.047 0.043 0.041To illustrate, data are simulated with( ρ, β, σ ) = (0 . , , x it , e it are iid stan-dard normal with n = 500 and T = 5. Weuse m ∈ { , } , B = 2000, S ∈ { , } , burn = 45, and γ = 0 . nr , r qn . Forthe bootstrap we only use B = 500 as is com-mon practice. The lsdv estimate ˆ ρ n, lsdv is 0 . . ρ = 0 . ind removes the downward bias al-most entirely with an average estimate of 0 . .
601 for S = 1 ,
10, respectively. The stan-dard deviation of the ind estimates is 0 .
032 and0 .
024 for S = 1 ,
10. Table 3 shows that r nr and r qn preserve this bias correction and havesmaller standard deviations even with S = 1and m < n , as predicted by theory. Coverageis close to the nominal 95% level for the usualˆ θ n ± . · se(ˆ θ n ) confidence interval, with re-jection rates of 0 .
049 and 0 .
046 for S = 1 , nr , and r qn have similar coverage.Increasing S has little effect on r nr , r qn butimproves the accuracy of ind . REFERENCES
Andrews, D. W. K.
Econometrica , 70:1: 119–162.
Berry, Steven, James Levinsohn, andAriel Pakes.
Econometrica ,63(4): 841.
Brunner, Daniel, Florian Heiss, Andr´eRomahn, and Constantin Weiser.
Reliable estimation of random coefficient logitdemand models.
DICE Discussion Paper.
Davidson, Russell, and James G. MacK-innon.
International Economic Review ,40(2): 487–508.
Forneron, J., and S. Ng.
Gouri´eroux, Christian, Peter C.B.Phillips, and Jun Yu.
Journalof Econometrics , 157(1): 68–77.
Honor´e, Bo E., and Luojia Hu.
Econometrica ,85(4): 1277–1301.
Kline, Patrick, and Andres Santos.
Journal of Econometric Methods ,1(1).
Nevo, A.