Online Statistical Inference for Gradient-free Stochastic Optimization
aa r X i v : . [ m a t h . S T ] F e b Online Statistical Inference for Gradient-free Stochastic Optimization
Online Statistical Inference for Gradient-free StochasticOptimization
Xi Chen [email protected]
Stern School of BusinessNew York University
Zehua Lai [email protected]
Committee on Computational and Applied MathematicsUniversity of Chicago
He Li [email protected]
Stern School of BusinessNew York University
Yichen Zhang [email protected]
Krannert School of ManagementPurdue University
Abstract
As gradient-free stochastic optimization gains emerging attention for a wide range of ap-plications recently, the demand for uncertainty quantification of parameters obtained fromsuch approaches arises. In this paper, we investigate the problem of statistical inferencefor model parameters based on gradient-free stochastic optimization methods that use onlyfunction values rather than gradients. We first present central limit theorem results forPolyak-Ruppert-averaging type gradient-free estimators. The asymptotic distribution re-flects the trade-off between the rate of convergence and function query complexity. We nextconstruct valid confidence intervals for model parameters through the estimation of the co-variance matrix in a fully online fashion. We further give a general gradient-free frameworkfor covariance estimation and analyze the role of function query complexity in the conver-gence rate of the covariance estimator. This provides a one-pass computationally efficientprocedure for simultaneously obtaining an estimator of model parameters and conductingstatistical inference. Finally, we provide numerical experiments to verify our theoreticalresults and illustrate some extensions of our method for various machine learning and deeplearning applications.
1. Introduction
Modern machine learning algorithms have achieved state-of-art performance on a wide rangeof tasks. As the predictions we make via these algorithms are random, it is desirable fora reliable machine learning system to be able to quantify the uncertainty of its estimators.This paper explores the problem of statistical inference for model parameters based on thefollowing stochastic optimization problem,minimize F p x q : “ E ζ „ D r f p x ; ζ qs , where ζ denotes a random sample from some probability distribution D , f p x ; ζ q is the lossfunction, and x P R d is the parameter of interest. hen, Lai, Li, and Zhang A widely used optimization method for minimizing the population loss function F p x q is the stochastic gradient descent (SGD) algorithm, dating back to Robbins and Monro(1951) and developed by Polyak and Juditsky (1992) and Nemirovski et al. (2009). Givena starting point x , at each step t , the SGD algorithm updates the parameter x t by, x t “ x t ´ ´ η t ∇ f p x t ´ ; ζ t q , where ζ t is sampled from the data distribution D , and η t “ ηt ´ α is the step size with someconstants η, α ą
0. The output of SGD is either the last iterate x t , or the average of theiterates along the path. The idea of the iteratives averaging goes back to Ruppert (1988)and Polyak and Juditsky (1992), referred to as Polyak-Ruppert averaging.SGD has been extensively studied and applied in the machine learning community(Zhang, 2004; Nemirovski et al., 2009; Rakhlin et al., 2012) due to its enormous compu-tational advantages over the traditional deterministic optimization methods. In particular,SGD visits each data point once, which has a lower computational cost compared to the fullgradient descent method. In the meantime, SGD is a one-pass algorithm without storingthe data stream, and naturally fits in transient online datasets. Furthermore, SGD achievesthe optimal convergence rate of the objective value or the estimation error of parametersunder certain conditions (Bach and Moulines, 2013; Nemirovski et al., 2009; Agarwal et al.,2009).These advantages of SGD come at the cost of randomness introduced in the randomsampling scheme of SGD, which leads to the need of providing uncertainty quantification ofmodel parameters. In particular, we focus on the statistical inference aspect based on thelimiting distribution of the average of SGD iterates. The first classical result to characterizeits limiting distribution is given by Ruppert (1988) and Polyak and Juditsky (1992), ? t p ¯ x t ´ x ˚ q ñ N ` , H ´ SH ´ ˘ , (1.1)where H “ ∇ F p x ˚ q is the Hessian matrix of F p x q at x “ x ˚ , and S “ E r ∇ f p x ˚ ; ζ q ∇ f p x ˚ ; ζ q J s is the Gram matrix of f p x ˚ ; ζ q .Some efforts have been made based on this result or its variants in recent literature.For example, Toulis and Airoldi (2017) propose the implicit SGD and provide the asymp-totic normality of its average iterates. Chen et al. (2020) develop an asymptotically validinference by a cancellation method using the path of the SGD iterates. Li et al. (2018) fixthe step size and discard intermediate iterates to reduce the correlation. Fang et al. (2017)present a perturbation-based resampling procedure to conduct statistical inference. Fur-thermore, Su and Zhu (2018) adopt a tree structure and split the single thread into severalthreads to construct confidence intervals.These existing methods heavily rely on the explicit form of the gradient. Neverthe-less, such information may be expensive or infeasible in many machine learning problems.For example, in bandit optimization and adversarial training, we only have the black-boxaccess to objective values. The gradient-free SGD methods are proposed which replace ∇ f p x t ´ ; ζ t q by its estimates using only function values. As gradient-free methods intro-duce additional bias and randomness into the SGD procedure, it would be essential toanalyze the asymptotic distributions of the estimators, and develop approaches to conductconfidence intervals based on the estimators of gradient-free SGD where the true gradientinformation is unavailable. nline Statistical Inference for Gradient-free Stochastic Optimization Related Works
The gradient-free methods for stochastic optimization have a long his-tory since the pioneering paper (Kiefer and Wolfowitz, 1952) introduces a gradient-freeKiefer–Wolfowitz algorithm for stochastic approximation. Blum (1954) extends that to amultidimensional version. A detailed literature review can be found in Kushner and Yin(2003) and the references therein.Recently, there have been renewed interests in optimization and machine learning prob-lems with only function values available instead of the first-order gradient information. Inparticular, Agarwal et al. (2011) and Shamir (2017) give sub-optimal convergence rates un-der certain conditions. The sub-optimality is due to the limited access to the function queries(Agarwal et al., 2010; Nesterov and Spokoiny, 2017). Furthermore, Jamieson et al. (2012)present an alternative using only function comparisons instead of function evaluations.Agarwal et al. (2010), Nesterov and Spokoiny (2017) and Ghadimi and Lan (2013) in-vestigate general gradient-free (zeroth-order) methods based on the difference between twofunction evaluations and provide improved convergence rates. Duchi et al. (2015) propose amirror descent algorithm under the same settings and characterize the optimal rate of con-vergence. Liu et al. (2018) present methods based on stochastic variance reduction gradientdescent to further improve the convergence rate at the cost of increasing function querycomplexity. Moreover, Wang et al. (2018) and Golovin et al. (2020) extend the algorithminto high-dimensional zeroth-order optimization.
Contributions
In this paper, we conduct statistical inference of model parameters basedon gradient-free SGD. We summarize our key contributions as follows.• We study the asymptotic distributions of model parameter estimators based on twodifferent types of gradient-free SGD. We show that the iterates under Polyak-Ruppert-averaging scheme ¯ x t are asymptotically normal.• We show that the asymptotic distributions depend on the number of function queries.For the coordinate-wise gradient estimation with O p d q function queries per step, theasymptotic distribution is the same as that of first-order SGD, and the estimator ¯ x t matches the same statistical property as the empirical risk minimizer (ERM). For therandom sampling scheme, we show that there is a trade-off between the scale of theasymptotic covariance matrix and the number of function queries.• We propose two types of methods for estimating the Hessian matrix H “ ∇ F p x ˚ q where the information of both the gradient and the Hessian is unavailable. Based onthat, we construct valid confidence intervals for model parameters through the esti-mation of the covariance matrix in a fully online fashion. Our procedure to constructconfidence intervals is one-pass and computationally efficient, and can be performedsimultaneously with the gradient-free SGD iterations. Notations
We first introduce some notations in our paper. We write vectors in boldfaceletters (e.g., x and ε ) and scalers are written in lightface letters (e.g., η ). For any positiveinteger t , we use r t s as a shorthand for the discrete set t , , ¨ ¨ ¨ , t u . Let t e i u di “ be thestandard basis in R d with the i -th coordinate as 1 and the other coordinates as 0. Denote I d as the identity matrix in R d ˆ d and is the zero vector. For convenience, let }¨} denote thestandard Euclidean norm for vectors and the spectral norm for matrices. We also use } ¨ } Fro hen, Lai, Li, and Zhang as the Frobenius norm for matrices. We also use E t to denote the conditional expectationwith respect to the natural filtration, i.e., E t r x t ` s : “ E r x t ` | F t s , F t : “ σ p x k | k ď t q . We denote the first-order difference of f p¨q as,∆ u f p x ; ζ t q “ f p x ` u ; ζ t q ´ f p x ; ζ t q . Finally, we use O p¨q notation to hide only absolute constants.
2. Gradient-free Stochastic Gradient Descent
Our gradient-free SGD replaces ∇ f p x t ´ ; ζ t q by its estimator with only function valuesrequired. The gradient-free SGD has the following update rule with the gradient estimation p ∇ f p x t ´ ; ζ t q , x t “ x t ´ ´ η t p ∇ f p x t ´ ; ζ t q . A natural two-point random gradient estimator is defined as, p ∇ f p x t ´ ; ζ t q “ dµ t ∆ µ t u f p x t ´ ; ζ t q u t “ dµ t r f p x t ´ ` µ t u t ; ζ t q ´ f p x t ´ ; ζ t qs u t . (2.1)where u t is uniformly sampled from the unit sphere in R d , and µ t ą p ∇ f p x ; ζ q is not unbiased with respectto ∇ f p x ; ζ q , although its bias converges to as µ Ñ
0. Instead, it is unbiased with respectto the gradient of F µ p x q : “ E r F p x ` µ u q| x s , which can be seen as a smoothing version of F p x q . We will leverage this property in the proofs.The random gradient estimator (2.1) requires only O p q function queries on f p¨q perstep. However, this single-sample procedure on u introduces a large variance to our gradientestimator. In contrast to the single-sample update, we combine m i.i.d. samples t u p i q t u m i “ to reduce the variance during the gradient estimation. We call it the average randomgradient estimator , p ∇ f p x t ´ ; ζ t q “ dm µ t m ÿ i “ ∆ µ t u p i q f p x t ´ ; ζ t q u p i q t . (2.2)Lian et al. (2016) and Liu et al. (2018) also consider the coordinate-wise gradient estimator where the evaluation directions are fixed, p ∇ f p x t ´ ; ζ t q “ µ t d ÿ i “ ∆ µ t e i f p x t ´ ; ζ t q e i . (2.3)In the following theoretical analysis, we present central limit theorems for all our gradientestimators used in gradient-free SGD. Before that, we introduce some assumptions on lossfunctions F p¨q and f p¨q . nline Statistical Inference for Gradient-free Stochastic Optimization Assume that the minimum of F p x q is achieved at x ˚ . Let H and S to be the Hessian matrixand the Gram matrix at x ˚ , respectively, H : “ ∇ F p x ˚ q , S : “ E “ ∇ f p x ˚ ; ζ q ∇ f p x ˚ ; ζ q J ‰ . Assumption 1.
The population loss function F p x q is a twice continuously differentiablefunction and λI d ĺ ∇ F p x q ĺ L f I d for any x P R d and some L f ą λ ą . Assumption 2.
We have E t ´ r ∇ f p x ; ζ t q ´ ∇ F p x qs “ and E t ´ “ ∇ f p x ˚ ; ζ q ∇ f p x ˚ ; ζ q J ‰ “ S , for any x P R d . Moreover, for some ă δ ď , E t ´ } ∇ f p x ; ζ t q ´ ∇ F p x q} ` δ ď M p} x } ` δ ` q . Assumption 3.
There are constants L h and L p such that for any x , y P R d , E t ´ ›› ∇ f p x ; ζ t q ´ ∇ f p y ; ζ t q ›› ď L h } x ´ y } , E t ´ ›› r ∇ f p x ˚ ; ζ t qs ´ H ›› ď L p . Assumptions 1 to 3 are standard conditions used in the SGD inference literature (Polyak and Juditsky,1992; Su and Zhu, 2018; Chen et al., 2020). In particular, Assumption 1 requires the popu-lation loss function F p¨q to be λ -strongly convex and L f -smooth. Assumption 2 ensures theunbiasedness of ∇ f p¨q . Our assumption is weaker than the usual unbiasedness requirementas we only need unbiasedness under the conditional expectation. Notice that throughoutour paper, we do not assume any independence on random samples t ζ i u ti “ . Moreover, the2 ` δ moment condition is the classical Lyapunov condition used in asymptotic-normal typeresults for dependent random variables. The statements in Assumption 3 introduce theLipschitz continuity condition and the concentration condition on the Hessian matrix. In this subsection, we give theoretical analysis for our gradient-free SGD. We will defer allour proofs to the supplementary material.Before presenting our main theoretical findings, we first provide some useful upperbounds on the distance between x and the minima x ˚ . Lemma 1.
Let Assumptions 1 and 2 be fulfilled. Set the step size as η t “ ηt ´ α for someconstants η, α ą . The smoothing parameter µ t satisfies µ t ď µη t , for some constant µ .Then for all ă α ă , x t converges to x ˚ almost surely, and we have the following bounds, E t ´ } x t ´ x ˚ } ď Ct ´ α , (2.4) E t ´ } x t ´ x ˚ } ` δ ď Ct ´ α p ` δ q{ , (2.5) where C is some constant dependent on constants λ, L f , δ, L h , L p , α, η, µ, d , and δ ą isdefined in Assumption 2. With Lemma 1 in hand, we can now establish the asymptotic normality for our gradient-free SGD estimators. hen, Lai, Li, and Zhang Theorem 2.
Let Assumptions 1 to 3 be fulfilled. Using the random gradient estimator (2.1) ,our gradient-free SGD satisfies, ? t p x t ´ x ˚ q ñ N ` , H ´ QH ´ ˘ , (2.6) where ¯ x t : “ t t ÿ i “ x i , Q “ dd ` p S ` tr p S q I d q , when t Ñ 8 . Here ñ represents the convergence in distribution. In the classical central limit theorem for SGD (Polyak and Juditsky, 1992), the covari-ance matrix is H ´ SH ´ . However, the covariance matrix of our estimator is roughly dH ´ SH ´ . The extra randomness is caused by taking the expectation with respect to therandom vector u . Please refer to Lemma 7 in the supplementary material for more details.As shown in the corollary below, the scale of the covariance matrix can be reduced by theother two estimation methods. Corollary 3.
Let Assumptions 1 to 3 be fulfilled. Using the average random gradientestimator (2.2) , when t Ñ 8 , we have, ? t p x t ´ x ˚ q ñ N ` , H ´ QH ´ ˘ , where Q “ dm p d ` q p S ` tr p S q I d q ` m ´ m S. Similarly, for the coordinate-wise gradient estimator (2.3) , we have, ? t p x t ´ x ˚ q ñ N ` , H ´ QH ´ ˘ , where Q “ S. In Corollary 3 above, for the average random gradient estimator (2.2), when m “
1, we recover the result in (2.6) and when m is sufficiently large, Q « S . Comparedwith the classical central limit theorem result for SGD (1.1), gradient-free methods require O p d q function queries to achieve the same asymptotic covariance matrix. Therefore, ourestimators share the same statistical property as the empirical risk minimizer. Furthermore,when the model is well-specified, the limiting covariance matrix H ´ SH ´ “ I ´ achievesthe Cramér-Rao lower bound, where I is the Fisher information matrix. This indicates that¯ x t is asymptotically efficient (Van der Vaart, 2000).
3. Gradient-free SGD Inference
In practice, the above results cannot be used directly for statistical inference tasks as wehave no information on the true Hessian and Gram matrices H and S . One solution is toevaluate the empirical Hessian and Gram matrices with all the data at ¯ x t , p H “ t t ÿ i “ ∇ f p ¯ x t ; ζ i q , p S “ t t ÿ i “ ∇ f p ¯ x t ; ζ i q ∇ f p ¯ x t ; ζ i q J . However, this estimation approach is not computationally efficient and is not satisfied inan online setting. We need to store each sample until our gradient-free SGD procedure nline Statistical Inference for Gradient-free Stochastic Optimization finishes and reuse all the samples once again. This requires O p td q storage and at least O p td q computation. One question naturally arises, can we provide an online estimationmethod for H and S which is computationally efficient? In this section, we provide onlinegradient-free estimation methods for H and S , and present their non-asymptotic estimationerrors. We first consider gradient-free methods for estimating the Hessian matrix H “ ∇ F p x ˚ q .Inspired by the average random gradient estimator (2.2) and the coordinate-wise estima-tor (2.3), there are two corresponding ways to estimating the Hessian, one involves therandom sampling and the other one is a coordinate-wise method. For the random samplingmethod, at any step i , the gradient-free Hessian estimation is given by, p G i “ d m µ t m ÿ i “ ” ∆ µ t v p i q t f p x t ´ ` µ t u p i q t ; ζ t q ´ ∆ µ t v p i q t f p x t ´ ; ζ t q ı u p i q t v p i qJ t , where t u p i q t u i,t and t v p i q t u i,t are independent random vectors uniformly sampled from theunit sphere in R d and m ą
0. Therefore, our average random Hessian estimator is, r H t “ t t ÿ i “ p G i ` p G J i . (3.1)where the p r G i ` r G J i q{ r H t . The function query complexityis O p m q per step for this Hessian estimation.Another way to estimate the Hessian H is through the coordinate-wise method. Wefirst compute, p G i “ µ t d ÿ i “ d ÿ j “ “ ∆ µ t e j f p x t ´ ` µ t e i ; ζ t q ´ ∆ µ t e j f p x t ´ ; ζ t q ‰ e i e J j . Notice that the above procedure requires O p d q function queries per step, which is notcomputationally desirable. Instead, we randomly select some entries of p G i to update ateach step, the estimator now becomes r G i , where r G p jk q i “ p p G p jk q i B p jk q i , j, k P r d s . Here the entries of B i are i.i.d. and follow a Bernoulli distribution, i.e., B p jk q i „ Bernoulli p p q ,for some fixed p P p , q . Therefore, it requires O p pd q function queries at each step. If weset p “ O p { d q , then the computation cost here is reduced to O p q per step. The final coordinate-wise Hessian estimator is defined as, r H t “ t t ÿ i “ r G i ` r G J i . (3.2)The next lemma quantifies the error rates of our Hessian estimators. hen, Lai, Li, and Zhang Lemma 4.
Under Assumptions 1 to 3, we have the following result for the average randomHessian estimator (3.1) , E } r H t ´ H } ď C t ´ α ` C ˆ ` m ˙ t ´ . (3.3) The coordinate-wise estimator (3.2) satisfies, E } r H t ´ H } ď C t ´ α ` C ˆ ` ´ pp ˙ t ´ . (3.4)As can be referred from above, when t Ñ 8 , the error rates of both Hessian estimatorsare dominated by the C t ´ α term. The sampling parameters m and p do not contributeto the asymptotic error rates. In the following, we absorb m , p into the constants. Thuswe have E } r H t ´ H } ď Ct ´ α .To avoid possible singularity of r H t when inverting it in the covariance matrix compu-tation ( H ´ QH ´ ), we consider a threshold version of r H t . Let U Λ U J be the eigenvaluedecomposition of r H t . Now define p H t “ U p Λ U J , p Λ ii “ max t κ, Λ ii u , i P r d s , for some small fixed constant 0 ă κ ă λ and λ is defined in Assumption 1. By construction,it is guaranteed that p H t is invertible. Now we consider the estimation of the covariance matrix H ´ QH ´ . The estimation of Q is straightforward and an online update of this estimation can be maintained with ourgradient estimator during SGD update, i.e., p Q t : “ t t ÿ i “ p ∇ f p x i ´ ; ζ i q p ∇ f p x i ´ ; ζ i q J . With the above construction of p H t and p Q t , our gradient-free covariance matrix estimator is p H ´ t p Q t p H ´ t . Based on Lemma 4, we obtain a non-asymptotic error rate for our gradient-freecovariance matrix estimator. Theorem 5.
Under Assumptions 1 to 3 and δ “ , for gradient estimators (2.1) to (2.3) andHessian estimators (3.1) and (3.2) , our proposed gradient-free covariance matrix estimatorconverges to the true covariance matrix H ´ QH ´ with the following rate, E ››› p H ´ t p Q t p H ´ t ´ H ´ QH ´ ››› ď Ct ´ α { . The above theorem establishes the consistency of our proposed covariance matrix esti-mator. For any given direction w P R d , a confidence interval of the true parameter x ˚ canbe constructed if we project ¯ x t and p H ´ t p Q t p H ´ t onto w . Specifically, for some confidencelevel q and the corresponding z -score z q { , we can obtain an exact confidence interval as t Ñ 8 , P " w J x ˚ P „ ¯ x t ´ z q { t w J p H ´ t p Q t p H ´ t w , ¯ x t ` z q { t w J p H ´ t p Q t p H ´ t w * Ñ ´ q. nline Statistical Inference for Gradient-free Stochastic Optimization
4. Experiments and Applications
We present numerical results and discussions on our gradient-free inference in this section.
In this subsection, we investigate the empirical performance of the proposed estimator of theasymptotic covariance matrix. We conduct simulations on two cases, linear regression andlogistic regression. Specifically, we draw t a i , b i u ti “ i.i.d. samples where a P R d , a „ N p , Σ q is the covariate vector and b P R is the response. Our true parameter in these models isdenoted as x ˚ P R d . For both cases, we consider two different covariance matrices Σ for a : identity matrix I d and equicorrelation covariance matrix (Equicorr in the tables), i.e.,Σ i,j “ r for all i ‰ j and Σ i,i “ b i “ a J i x ˚ ` ǫ i , @ i P r t s where ǫ i „ N p , σ q is the noiseterm. We use the quadratic loss function, F p x q : “ E rp b i ´ a J i x ˚ q s , as our population lossfunction.In the logistic regression case, P p b i | a i q “ r ` exp p´ b i a J i x ˚ qs ´ , b i P t´ , u , @ i P r t s .The population objective function is given by, F p x q : “ E log r ` exp p´ b i a J i x ˚ qs .In the following numerical experiments, we set α “ . t “ and the parameter dimension d “ , , ,
200 in each case. Moreover, let the noisevariance σ “ . d -dimensional vector, we project it onto different directions to construct confidence intervals.We report the relative estimation error (estimate error) and the relative covariance matrixerror (covmat error), respectively, } ¯ x t ´ x ˚ }} x ˚ } , } p H ´ t p Q t p H ´ t ´ H ´ QH ´ }} H ´ QH ´ } . To better evaluate our estimators, the oracle coverage rate (oracle rate) is conducted by thetrue covariance matrix H ´ QH ´ .We present results of the linear regression in Table 1 and the logistic regression in Table 2,with 500 Monte-Carlo simulations. As shown in both tables, the oracle coverage rates areapproximately 95% and the coverage rates of our estimators are slightly worse. The linearregression case has a better performance overall. Since measuring the quality of confidence intervals requires knowing the true parametersexactly, it is hard to conduct performance tests for our method on real-world datasets. Asshown in the theoretical results and the empirical performance on the simulation datasets,our gradient-free inference method has the potential to be applied in real tasks for statisticalinference of model parameters. We give a few examples here to illustrate.
Confidence interval for predictions.
Given our gradient-free estimator ¯ x t , we canmake predictions φ p ¯ x t ; ζ i q on a fixed sample for some function φ p¨q : R d Ñ R . By the delta hen, Lai, Li, and Zhang Table 1: Linear regression inferencedimension Σ estimate error covmat error cover rate (%) oracle rate (%)Identity 0.0074 0.0837 94.84 94.6625 Equicorr 0.0110 0.0840 94.98 95.58Identity 0.0099 0.0836 94.85 94.5150 Equicorr 0.0133 0.0838 93.46 95.38Identity 0.0152 0.0782 93.94 95.08100 Equicorr 0.0216 0.0817 94.03 95.49Identity 0.0326 0.0910 94.55 94.62200 Equicorr 0.0564 0.1159 94.37 95.23Table 2: Logistic regression inferencedimension Σ estimate error covmat error cover rate (%) oracle rate (%)Identity 0.0447 0.0326 93.82 95.2725 Equicorr 0.0512 0.0347 93.46 95.75Identity 0.0697 0.0342 94.13 94.5350 Equicorr 0.0784 0.0355 94.34 94.94Identity 0.1179 0.0332 94.68 95.31100 Equicorr 0.1301 0.0343 93.86 94.24Identity 0.2357 0.0307 93.90 94.95200 Equicorr 0.2606 0.0312 93.17 95.42method (Van der Vaart, 2000), our prediction also shares the asymptotic normality, ? t p φ p ¯ x t ; ζ i q ´ φ p x ˚ ; ζ i qq Ñ N ` , ∇ φ p ¯ x t ; ζ i q J H ´ QH ´ ∇ φ p ¯ x t ; ζ i q ˘ . Here we can replace the Hessian matrix H , the Gram matrix Q and the gradient ∇ φ p ¯ x t ; ζ i q by their gradient-free estimators and construct confidence intervals accordingly. Non-convex optimization.
Many non-convex optimization problems are equippedwith non-convex object functions which have locally convex property. Our inference proce-dure can be done in the last phase of the training process, where the training parameter x t is in some locally convex basin. Take this as our initializer x , we can conduct thecovariance matrix estimation during the gradient-free SGD updates. Adversarial attack detection.
As shown in Li et al. (2018), the idea of statisticalinference can also be applied to detect adversarial attacks on neural networks.
5. Discussion
In this paper, we conduct gradient-free statistical inference for the corresponding stochasticoptimization process. We show the asymptotic normality for the gradient-free SGD esti-mator and give a consistent estimator of the covariance matrix for model parameters. Our nline Statistical Inference for Gradient-free Stochastic Optimization theoretical analysis covers different gradient and Hessian estimators. Our proposed infer-ence procedure is fully online and in accordance with gradient-free SGD. Our findings arevalidated by numerical results and we further demonstrate how our gradient-free inferencecan be used on real-world applications.Our results and estimation methods can be adopted in many SGD variants, e.g., signSGD (Bernstein et al., 2018) and zeroth-order sign SGD (Liu et al., 2019). In the future,it is an interesting topic to discover the statistical property of distributed gradient-freestochastic optimization process. It is also important to relax the current assumptions andconsider the inference for gradient-free SGD on more challenging optimization problems,e.g., general non-convex optimizations. Acknowledgment
Xi Chen is supported by NSF grant IIS-1845444.
References
Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar.Information-theoretic lower bounds on the oracle complexity of convex optimization. In
Advances in Neural Information Processing Systems , pages 1–9, 2009.Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimiza-tion with multi-point bandit feedback. In
Conference on Learning Theory , pages 28–40,2010.Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin.Stochastic convex optimization with bandit feedback. In
Advances in Neural InformationProcessing Systems , pages 1035–1043, 2011.Patrice Assouad. Espaces p -lisses et q -convexes, inégalités de Burkholder. In SéminaireMaurey-Schwartz 1974–1975: Espaces L p , applications radonifiantes et géométrie des es-paces de Banach, Exp. No. XV , page 8. 1975.Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximationwith convergence rate O p { n q . In Advances in Neural Information Processing Systems ,pages 773–781, 2013.Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandku-mar. signSGD: Compressed optimisation for non-convex problems. In
Proceedings of theInternational Conference on Machine Learning , pages 560–569. PMLR, 2018.Julius R Blum. Multidimensional stochastic approximation methods.
The Annals of Math-ematical Statistics , pages 737–744, 1954.Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for modelparameters in stochastic gradient descent.
The Annals of Statistics , 48(1):251–273, 2020. hen, Lai, Li, and Zhang John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimalrates for zero-order convex optimization: The power of two function evaluations.
IEEETransactions on Information Theory , 61(5):2788–2806, 2015.Rick Durrett.
Probability: theory and examples . Cambridge University Press, Cambridge,2019. Fifth edition.Yixin Fang, Jinfeng Xu, and Lei Yang. On scalable inference with stochastic gradientdescent. arXiv preprint arXiv:1707.00192 , 2017.Xiang Gao, Bo Jiang, and Shuzhong Zhang. On the information-adaptive variants of theADMM: an iteration complexity perspective.
Journal of Scientific Computing , 76(1):327–363, 2018.Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvexstochastic programming.
SIAM Journal on Optimization , 23(4):2341–2368, 2013.Daniel Golovin, John Karro, Greg Kochanski, Chansoo Lee, Xingyou Song, and Qiuyi Zhang.Gradientless descent: High-dimensional zeroth-order optimization. In
International Con-ference on Learning Representations , 2020.Kevin G Jamieson, Robert Nowak, and Ben Recht. Query complexity of derivative-freeoptimization. In
Advances in Neural Information Processing Systems , pages 2672–2680,2012.Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regressionfunction.
The Annals of Mathematical Statistics , 23(3):462–466, 1952.Harold Kushner and G. George Yin.
Stochastic Approximation and Recursive Algorithmsand Applications . Springer Science and Business Media, 2003.Tianyang Li, Anastasios Kyrillidis, Liu Liu, and Constantine Caramanis. ApproximateNewton-based statistical inference using only stochastic gradients. arXiv preprintarXiv:1805.08920 , 2018.Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensive linearspeedup analysis for asynchronous stochastic parallel optimization from zeroth-order tofirst-order. In
Advances in Neural Information Processing Systems , pages 3054–3062,2016.Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini.Zeroth-order stochastic variance reduction for nonconvex optimization. In
Advances inNeural Information Processing Systems , pages 3727–3737, 2018.Sijia Liu, Pin-Yu Chen, Xiangyi Chen, and Mingyi Hong. signSGD via zeroth-order oracle.In
International Conference on Learning Representations , 2019.Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robuststochastic approximation approach to stochastic programming.
SIAM Journal on Op-timization , 19(4):1574–1609, 2009. nline Statistical Inference for Gradient-free Stochastic Optimization Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex func-tions.
Foundations of Computational Mathematics , 17(2):527–566, 2017.Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by aver-aging.
SIAM Journal on Control and Optimization , 30(4):838–855, 1992.Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent op-timal for strongly convex stochastic optimization. In
Proceedings of the InternationalConference on Machine Learning , pages 449–456, 2012.Herbert Robbins and Sutton Monro. A stochastic approximation method.
The Annals ofMathematical Statistics , pages 400–407, 1951.David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process.Technical report, Cornell University Operations Research and Industrial Engineering,1988.Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization withtwo-point feedback.
Journal of Machine Learning Research , 18(1):1703–1713, 2017.Weijie J Su and Yuancheng Zhu. Uncertainty quantification for online learning andstochastic approximation via hierarchical incremental gradient descent. arXiv preprintarXiv:1802.04876 , 2018.Panos Toulis and Edoardo M Airoldi. Asymptotic and finite-sample properties of estimatorsbased on stochastic gradients.
The Annals of Statistics , 45(4):1694–1727, 2017.Aad W Van der Vaart.
Asymptotic statistics , volume 3 of
Cambridge Series in Statisticaland Probabilistic Mathematics . Cambridge University Press, Cambridge, 2000.Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-order optimization in high dimensions. In
Proceedings of the International Conference onArtificial Intelligence and Statistics , pages 1356–1365. PMLR, 2018.Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descentalgorithms. In
Proceedings of the International Conference on Machine Learning , page116, 2004. hen, Lai, Li, and Zhang Supplementary Material for Online StatisticalInference for Gradient-free StochasticOptimization
In this supplementary, we provide detailed proofs to all our theoretical findings in themain paper. In Section A, we show proofs for Section 2, and Section B for Section 3.Section C provides some technical lemmas we will use during the proofs. Throughout ourproofs, we will assume, without loss of generality, F p¨q achieves its minimum at x ˚ “ and F p q “ F p¨q as, F µ p x q : “ E r F p x ` µ u q| x s , where u P R d is the uniform random variable on the unit sphere in R d . Furthermore, let ξ t “ ∇ F p x t ´ q ´ ∇ F µ t p x t ´ q , γ t “ ∇ F µ t p x t ´ q ´ dµ t r F p x t ´ ` µ t u t q ´ F p x t ´ qs u t , ε t “ dµ t r F p x t ´ ` µ t u t q ´ F p x t ´ qs u t ´ dµ t r f p x t ´ ` µ t u t ; ζ t q ´ f p x t ´ ; ζ t qs u t . Appendix A. Proofs in Section 2
A.1 Proof of Lemma 1
Proof.
First, by Lemma 4.1 (a) from Gao et al. (2018), E t ´ r ε t | u t s “ E t ´ r γ t s “ . (A.1)Using fundamental theorem of calculus, } ξ t } “ ›››› ∇ F p x t ´ q ´ E dµ t ż µ t x ∇ F p x t ´ ` s u t q , u t y u t d s ›››› ď Cµ t . (A.2)The estimation error between ∇ F µ t p x t ´ q and its gradient-free estimator, γ , satisfies that E t ´ } γ t } ď E t ´ ›››› dµ t r F p x t ´ ` µ t u t q ´ F p x t ´ qs u t ›››› ď C p} x t ´ } ` µ t q . (A.3)We also have the following fact for ε . All of these estimates will be used repeatedly in ourfollowing proofs. E t ´ ” } ε t } ` δ | u t ı “ E t ´ «›››› dµ t ż µ t x ∇ F p x t ´ ` s u t q ´ ∇ f p x t ´ ` s u t ; ζ t q , u t y u t d s ›››› ` δ ˇˇˇˇ u t ff ď E t ´ „ dµ t ż µ t } ∇ F p x t ´ ` s u t q ´ ∇ f p x t ´ ` s u t ; ζ t q} ` δ d s ˇˇ u t ď C dµ t ż µ t p} x t ´ ` s u t } ` δ ` q d s ď C p} x t ´ } ` δ ` µ ` δt ` q . (A.4) nline Statistical Inference for Gradient-free Stochastic Optimization Now decompose the update step as follows: x t “ x t ´ ´ η t dµ t r f p x t ´ ` µ t u t ; ζ t q ´ f p x t ´ ; ζ t qs“ x t ´ ´ η t ∇ F p x t ´ q ` η t p ξ t ` γ t ` ε t q . Therefore, we can derive that, } x t } ď} x t ´ } ´ η t x ∇ F p x t ´ q , x t ´ y ` η t x ξ t ` γ t ` ε t , x t ´ y´ η t x ξ t , ∇ F p x t ´ qy ` η t } ξ t ` γ t ` ε t ´ ∇ F p x t ´ q} . For the first part in the RHS of the above inequality, we have, x ∇ F p x t ´ q , x t ´ y ě F p x t ´ q ` λ } x t ´ } ě λ } x t ´ } , (A.5)with strong convexity property. Moreover, | E t ´ x ξ t ` γ t ` ε t , x t ´ y| “ | E t ´ x ξ t , x t ´ y| ď Cµ t } x t ´ } , (A.6) } ξ t ` γ t ` ε t ´ ∇ F p x t ´ q} ď } ξ t } ` } γ t } ` } ε t } ` } ∇ F p x t ´ q} ď C p} x t ´ } ` µ t ` q , (A.7) |x ξ t , ∇ F p x t ´ qy| ď Cµ t } x t ´ } . where we use Cauchy-Schwarz inequality in (A.6), Cauchy-Schwarz inequality and Lipschitz-ness of ∇ F in (A.7). Note that µ t ď Cη t , so we further have, E t ´ } x t } ď ` ´ λη t ` C η t ˘ } x t ´ } ` C η t . (A.8)There exists some universal constant t ą
0, such that for all t ą t , C η t ď λη t and E t ´ } x t } ď p ´ C η t q } x t ´ } ` C η t . (A.9)For any t ą t ě t , using tower rule of expectation, we have, E t } x t } ď « t ź k “ t ` p ´ C η t q ff } x t } ` C t ÿ k “ t ` « t ź j “ k ` p ´ C η j q ff η k ď exp ´ C t ÿ k “ t ` η k + } x t } ` C η t t ÿ k “ t ` « t ź j “ k ` p ´ C η j q ff η k ď exp t´ C p t ´ t q η t u } x t } ` C η t . (A.10)From inequality (A.8), we can obtain, E } x t } ď C p ` } x } q . (A.11)From inequality (A.10), for all t ě t , E t } x t } “ E t E t } x t } ď exp " ´ C t η t * E t } x t } ` C ˆ t ˙ ´ α ď exp " ´ C t η t * ` } x t } ` C η t ˘ ` C t ´ α , (reuse C ) ď t ´ α ` } x t } ` C η t ˘ ` C t ´ α . (A.12) hen, Lai, Li, and Zhang Combining (A.11) and (A.12), we will have claim (2.4). By inequality (A.9) and martingaleconvergence theorem, } x t } converges almost surely. Because its second moment convergesto 0, it must converges to 0 almost surely.The proof of claim (2.5) is similar. Because E t ´ p ε t ` γ t q “ , by the same argumentas in Lemma 8, we have E t ´ } x t } ` δ ď E t ´ } x t ´ ´ η t ∇ F p x t ´ q ` η t ξ t } ` δ ` C E t ´ } x t ´ ´ η t ∇ F p x t ´ q ` η t ξ t } δ E t ´ } η t p γ t ` ε t q} ` C E t ´ } η t p γ t ` ε t q} ` δ . By Young’s inequality, for any q ą
0, we can decompose the second term as, C E t ´ } x t ´ ´ η t ∇ F p x t ´ q ` η t ξ t } δ E t ´ } η t p γ t ` ε t q} ď qη t E t ´ } x t ´ ´ η t ∇ F p x t ´ q ` η t ξ t } ` δ ` C q η ` δ { t E t ´ }p γ t ` ε t q} ` δ , where q ą E t ´ } x t } ` δ ď p ` qη t q E t ´ } x t ´ ´ η t ∇ F p x t ´ q ` η t ξ t } ` δ ` C q η ` δ { t E t ´ }p γ t ` ε t q} ` δ ď p ` qη t q} x t ´ ´ η t ∇ F p x t ´ q} ` δ ` C q η t p ` qη t q µ t } x t ´ ´ η t ∇ F p x t ´ q} ` δ ` C q η ` δt µ ` δt p ` qη t q ` C q η ` δ { t ´ } x t ´ } ` δ ` µ ` δt ` ¯ ď p ` qη t q} x t ´ ´ η t ∇ F p x t ´ q} ` δ ` C q η ` δ { t ´ ` } x t ´ } ` δ ¯ ď ´ ` qη t ´ Cη t ` Cη t ` Cη ` δt ¯ } x t ´ } ` δ ` C q η ` δ { t . Because q can be arbitrary small, we can take some suitable q such that we have the followingestimate, E t ´ } x t } ` δ ď p ´ Cη t q} x t ´ } ` δ ` Cη ` δ { t . (A.13)With inequality (A.13) and the same argument as in the first case, we will have claim (2.5). A.2 Proof of Theorem 2
Proof.
We follow the proof in Polyak and Juditsky (1992). The update step is x t “ x t ´ ´ η t ∇ F p x t ´ q ` η t p ξ t ` γ t ` ε t q“ p I d ´ η t H q x t ´ ` η t p H x t ´ ´ ∇ F p x t ´ q ` ξ t ` γ t ` ε t q . By the argument in Polyak and Juditsky (1992), we only need to prove three conditions: ÿ i “ ? i E } H x i ´ ´ ∇ F p x i ´ q ` ξ i } , (A.14) nline Statistical Inference for Gradient-free Stochastic Optimization is bounded almost surely, E } γ i ` ε i } , (A.15)is bounded almost surely, and when t Ñ 8 ? t t ÿ i “ p γ i ` ε i q ñ N p , Q q , (A.16)in probability.By Assumption 3, we know that } ∇ F p x q ´ ∇ F p y q} ď L g } x ´ y } . So } H x i ´ ´ ∇ F p x i ´ q} ď C } x i ´ } . By facts (A.1) to (A.4), we know that E } H x i ´ ´ ∇ F p x i ´ q ` ξ i } ď Ci ´ α . So condition (A.14) holds.Because γ i converges to almost surely and ε i has bounded variance. So condi-tion (A.15) holds. To prove condition (A.16), it suffices to verify that,1 ? t t ÿ i “ ε i ñ N p , Q q . By martingale central limit theorem (Durrett, 2019, Theorem 8.2.8), we only need to verifytwo conditions, 1 t t ÿ i “ E i ´ r ε i ε J i s Ñ Q, (A.17)1 t t ÿ i “ E ” } ε i } } ε i }ą a ? t ı Ñ , (A.18)in probability for all a ą
0. To verify condition (A.17), recall that ε t “ dµ t ż µ t x ∇ F p x t ´ ` s u t q ´ ∇ f p x t ´ ` s u t ; ζ t q , u t y u t d s. Let g p x ; ζ t q “ ∇ F p x q ´ ∇ f p x ; ζ t q and define the matrixΣ t p x , y q “ E t ´ g p x ; ζ t q g p y ; ζ t q J “ E t ´ p ∇ F p x q ´ ∇ f p x ; ζ t qqp ∇ F p y q ´ ∇ f p y ; ζ t qq J . From Assumption 3, we can get the following estimate on g p ; ζ t q , E t ´ } g p ; ζ t q} ď C, (A.19) hen, Lai, Li, and Zhang for some constant C . Moreover, E t ´ } g p x ; ζ t q ´ g p ; ζ t q} “ E t ´ } ∇ F p x q ´ ∇ f p x ; ζ t q ´ p ∇ F p q ´ ∇ f p ; ζ t qq} ď C } x } ` E t ´ } ∇ f p x ; ζ t q ´ ∇ f p ; ζ t qq} “ C } x } ` E t ´ ››››ż ∇ f p s x ; ζ t q x d s ›››› ď C } x } ` E t ´ ż } ∇ f p s x ; ζ t q x } d s ď C p} x } ` } x } ż E t ´ } ∇ f p s x ; ζ t q} d s ď C } x } p ` } x } q . (A.20)We can also obtain an estimate for the Hessian matrix ∇ f p x ; ζ t q , E t ´ } ∇ f p x ; ζ t q} ď E t ´ } ∇ f p ; ζ t q} ` E t ´ ›› ∇ f p x ; ζ t q ´ ∇ f p ; ζ t q ›› ď C p ` } x } q . (A.21)Then with estimate (A.19), (A.20) and (A.21), we have } Σ t p x , y q ´ S } ď E t ´ } g p x ; ζ t q g p y ; ζ t q J ´ g p ; ζ t q g p ; ζ t q J }ď E t ´ } g p x ; ζ t q}} g p y ; ζ t q ´ g p ; ζ t q} ` E t ´ } g p x ; ζ t q ´ g p ; ζ t q}} g p ; ζ t q}ď C p ` } x } q} y }p ` } y }q ` C } x }p ` } x }q . If we rewrite E t ´ r ε t ε J t | u t s as follows, E t ´ “ ε t ε J t | u t ‰ “ d µ t u t u Tt E t ´ „ ż µ t ż µ t p ∇ F p x t ´ ` s u t q ´ ∇ f p x t ´ ` s u t ; ζ t qqp ∇ F p x t ´ ` s u t q ´ ∇ f p x t ´ ` s u t ; ζ t qq J d s d s ˇˇˇˇ u t u t u Tt “ d µ t u t u Tt ż µ t ż µ t Σ t p x t ´ ` s u t , x t ´ ` s u t q d s d s u t u Tt , we can derive that ›››› E t ´ “ ε t ε J t | u t ‰ ´ d µ t u t u Tt S u t u Tt ›››› ď C ` ` } x t ´ } ` µ t ˘ p} x t ´ } ` µ t q . By Lemma 7, we have ›› E t ´ “ ε t ε Tt ‰ ´ Q ›› ď C p ` } x t ´ } ` µ t qp} x t ´ } ` µ t q . (A.22)Thus E t ´ r ε t ε J t s converges almost surely to Q and condition (A.17) holds.To verify condition (A.18), we leverage Assumption 2. It can be shown that, E i ´ ” } ε i } } ε i }ą a ? t ı ď ” E i ´ ” } ε i } ` δ ıı ` δ ” E i ´ ” } ε i }ą a ? t ıı δ ` δ . nline Statistical Inference for Gradient-free Stochastic Optimization Note that E i ´ ” } ε i }ą a ? t ı “ P i ´ ´ } ε i } ą c ? t q| x i ´ ¯ ď a ? t E i ´ } ε i } . Therefore, for some given t , it satisfies, E i ´ ” } ε i } } ε i }ą a ? t ı ď C ˆ a ? t ˙ δ ` δ ´ ` } x i } ` δ ` µ ` δi ¯ ` δ p ` } x i } ` µ i q δ ` δ , from which we can obtain that E r} ε i } } ε i }ą a ? t s ď C ˆ a ? t ˙ δ ` δ . (A.23)Take average over t in inequality (A.23), we find that condition (A.18) holds when t goesto infinity: 1 t t ÿ i “ E ” } ε i } } ε i }ą a ? t ı ď C ˆ a ? t ˙ δ ` δ Ñ . A.3 Proof of Corollary 3
Proof.
To prove the two claims, we only need to change the respective notation in the aboveproof of Theorem 2. The only substantial difference is the calculation in inequality (A.22).In the case of average random gradient estimator (2.2), the calculation of expectation canbe written roughly as the following form: E ˜ m m ÿ i “ u i u J i ¸ S ˜ m m ÿ i “ u i u J i ¸ “ dm p d ` q p S ` tr p S q I d q ` m ´ m S. In the case of coordinate-wise estimator (2.3), we don’t need to calculate the expectationas there is no randomness in the evaluation directions.
Appendix B. Proofs in Section 3
B.1 Proof of Lemma 4
Proof.
In the case of average random Hessian estimator (3.1), we first decompose ˆ H t ´ H as follows, r H t ´ H “ t t ÿ i “ p G i ` p G J i ´ H “ t t ÿ i “ ˜ p G i ` p G J i ´ ∇ f p x i ´ ; ζ i q ¸ ` t t ÿ i “ “ ∇ f p x i ´ ; ζ i q ´ ∇ f p ; ζ i q ‰ ` t t ÿ i “ ∇ f p ; ζ i q ´ H. (B.1) hen, Lai, Li, and Zhang For the first term in the decomposition (B.1), E t ´ „ } µ t “ f p x t ´ ` µ t u ` µ t v ; ζ t q ´ f p x t ´ ` µ t u ; ζ t q ´ f p x t ´ ` µ t v ; ζ t q` f p x t ´ ; ζ t q ‰ uv J ´ uu J ∇ f p x t ´ ; ζ t q vv J ›› ˇˇˇˇ u , v ď E t ´ «›››› µ t ż µ t ż µ t ∇ f p x t ´ ` s u ` s v ; ζ t q ´ ∇ f p x t ´ ; ζ t q d s d s ›››› ˇˇˇˇ u , v ff ď µ t ż µ t ż µ t E t ´ ”›› ∇ f p x t ´ ` s u ` s v ; ζ t q ´ ∇ f p x t ´ ; ζ t q ›› ˇˇ u , v ı d s d s ď Cµ t ż µ t ż µ t } s u ` s v } d s d s ď Cµ t . Notice that, E t ´ ››››ˆ m u i u J i ˙ ∇ f p x t ´ ; ζ t q ˆ m vv J ˙ ´ ∇ f p x t ´ ; ζ t q ›››› ď E t ´ ›››› m u i u J i ´ I d ›››› ›› ∇ f p x t ´ ; ζ t q ›› ›››› m vv J ´ I d ›››› ď Cm ` ` } x t ´ } ˘ . Thus by Lemma 7, we have E ››››› t t ÿ i “ p G i ` p G J i ´ ∇ f p x i ´ ; ζ i q ››››› ď E ››››› t t ÿ i “ p G i ` p G J i ´ ∇ f p x i ´ ; ζ i q ››››› ď E ››››› t t ÿ i “ p G i ´ ∇ f p x i ´ ; ζ i q ››››› ď t t ÿ i “ „ Cµ i ` Cm t ´ p ` E } x i ´ } q ď Ct ´ α ` C m t ´ , (B.2)where in the last inequality, we use the fact that, ř ti “ i ´ α ď Ct ´ α .For the second term in (B.1), it can be shown that, E ››››› t t ÿ i “ ∇ f p x i ´ ; ζ i q ´ ∇ f p ; ζ i q ››››› ď t t ÿ i “ E ›› ∇ f p x i ´ ; ζ i q ´ ∇ f p ; ζ i q ›› ď Ct t ÿ i “ E } x i } ď Ct ´ α . (B.3) nline Statistical Inference for Gradient-free Stochastic Optimization Moreover, we have, E ››››› t t ÿ i “ ∇ f p ; ζ i q ´ H ››››› ď E ››››› t t ÿ i “ ∇ f p ; ζ i q ´ H ››››› “ t t ÿ i “ E ›› ∇ f p ; ζ i q ∇ f p ; ζ i q J ´ HH T ›› ď Ct ´ . (B.4)Combine the previous estimates (B.2), (B.3) and (B.4), our average random Hessianestimator satisfies, E ››› r H t ´ H ››› ď Ct ´ α ` C p ` m q t ´ . Similarly, for the coordinate-wise Hessian estimator (3.2), we have the following decom-position, r H t ´ H “ t t ÿ i “ r G i ` r G J i ´ H “ t t ÿ i “ r G i ` r G J i ´ p G i ` p G J i ` t t ÿ i “ ˜ p G i ` p G J i ´ ∇ f p x i ´ ; ζ i q ¸ ` t t ÿ i “ “ ∇ f p x i ´ ; ζ i q ´ ∇ f p ; ζ i q ‰ ` t t ÿ i “ ∇ f p ; ζ i q ´ H. (B.5)Given p G t , our Bernoulli sampling Hessian estimator r G t satisfies, E ››› r G t ´ p G t ››› “ E « d ÿ j “ d ÿ k “ p ´ p G p jk q t B p jk q t ´ p G p jk q t ¯ ff “ d ÿ j “ d ÿ k “ E ˆ p B p jk q t ´ ˙ ´ p G p jk q t ¯ “ ´ pp d ÿ j “ d ÿ k “ E ´ p G p jk q t ¯ “ ´ pp } p G t } , where the second equality uses the fact that B p jk q i are independent from each other. There-fore, E ››››› t t ÿ i “ r G i ´ p G i ››››› ď E ››››› t t ÿ i “ r G i ´ p G i ››››› ď C ´ pp t ´ t ÿ i “ E ››› p G i ››› . With 1 { t ř ti “ E } p G i } ď C ` Ct ´ α , the first term in decomposition (B.5) satisfies, E ››››› t t ÿ i “ r G i ´ p G i ››››› ď C ´ pp t ´ . (B.6) hen, Lai, Li, and Zhang Other terms can be bounded similarly as in the first case: E ››››› t t ÿ i “ p G i ` p G J i ´ ∇ f p x i ´ ; ζ i q ››››› ď Ct ´ α , (B.7) E ››››› t t ÿ i “ ∇ f p x i ´ ; ζ i q ´ ∇ f p ; ζ i q ››››› ď Ct ´ α , (B.8) E ››››› t t ÿ i “ ∇ f p ; ζ i q ´ H ››››› ď Ct ´ . (B.9)Combine inequality (B.6), (B.7), (B.8) and (B.9), we obtain the desired result forcoordinate-wise Hessian estimator (3.2). B.2 Proof of Theorem 5
To prove Theorem 5, we first present the following lemma on the error rate of p Q t . Lemma 6.
Under conditions in Theorem 5. Our online gram matrix estimate p Q t has thefollowing convergence rate, E } p Q t ´ Q } ď Ct ´ α { . Proof.
Recall the update rule, x t “ x t ´ ´ η t ∇ F p x t ´ q ` η t p ξ t ` γ t ` ε t q , and our gram matrix estimate p Q t is, p Q t “ t t ÿ i “ p ∇ F p x i ´ q ´ ξ i ´ γ i ´ ε i qp ∇ F p x i ´ q ´ ξ i ´ γ i ´ ε i q J . It can be seen that we have the following estimates, E t ´ ››››› t t ÿ i “ ∇ F p x i ´ q ∇ F p x i ´ q J ››››› ď C t t ÿ i “ E t ´ } x i ´ } ď Ct ´ α , E t ´ ››››› t t ÿ i “ ξ i ξ J i ››››› ď C t t ÿ i “ µ t ď Ct ´ α , E t ´ ››››› t t ÿ i “ γ i γ J i ››››› ď C t t ÿ i “ p E t ´ } x i ´ } ` µ t q ď Ct ´ α , E t ´ ››››› t t ÿ i “ ε i ε J i ››››› ď C t t ÿ i “ p E t ´ } x i ´ } ` µ t ` q ď C. The crossing terms between them can be bounded by Cauchy-Schwarz inequality. Therefore,we can find that all terms in p Q t except ř ti “ ε i ε J i { t can be bounded by Ct ´ α { . So it sufficesto prove, E ››››› t t ÿ i “ ε i ε J i ´ Q ››››› ď Ct ´ α { . (B.10) nline Statistical Inference for Gradient-free Stochastic Optimization Define a new sequence z t : “ ε t ε J t ´ E t ´ ε t ε J t . Then z t is a martingale difference sequenceand we have ›› ε t ε J t ´ S ›› ď } z t } ` ›› E t ´ ε t ε J t ´ S ›› ď } z t } ` C ` } x t ´ } ` } x t ´ } ` µ t ` µ t ˘ , where the last inequality leverages inequality (A.22). Now we have, E ››››› t t ÿ i “ ε i ε J i ´ Q ››››› ď E ››››› t t ÿ i “ z i ››››› ` C E ` } x t ´ } ` } x t ´ } ` µ t ` µ t ˘ ď E ››››› t t ÿ i “ z i ››››› ` Ct ´ α { . Thus we turn the proof of (B.10) into, E ››››› t t ÿ i “ z i ››››› ď Ct ´ { . (B.11)By Hölder’s inequality, it can be derived that, E t ´ } z t } ď E t ´ } ε t } ď C p} x t ´ } ` µ t ` q . Combine Lemma 8 with Lemma 1, we have E ››››› t t ÿ i “ z i ››››› ď t t ÿ i “ C E ` } x i ´ } ` µ i ` ˘ ď Ct ´ . Therefore, condition (B.11) is satisfied through Jensen’s inequality.We now come back to the main proof of Theorem 5.
Proof.
For the threshold estimator p H t , it is consistent with the rate below, E } p H t ´ H } ď E } r H t ´ H } ` E } p H t ´ r H t } ď E } r H t ´ H } ` E }p p H t ´ r H t q λ min p r H t qą κ } ` E }p p H t ´ r H t q λ min p r H t qď κ } ď E } r H t ´ H } ` C P ´ λ min p r H t q ď κ ¯ ď E } r H t ´ H } ` C P p} H ´ r H t } ě p λ min p H q ´ κ q qď E } r H t ´ H } ` C p λ ´ κ q E } H ´ r H t } ď Ct ´ α , (B.12)where the fourth inequality follows from Markov’s inequality and the last one from Lemma 4. hen, Lai, Li, and Zhang By Lemma 9, the inverse matrix error satisfies, E } p H ´ t ´ H ´ } ď E ” } H ´ p p H t ´ H q}ď { } p H t ´ H }} H ´ } ` } H ´ p p H t ´ H q}ě { } p H ´ t ´ H ´ } ı ď } H ´ } E } p H t ´ H } ` p κ ´ ` λ ´ p H qq P ˆ } H ´ p p H t ´ H q} ě ˙ ď } H ´ } E } p H t ´ H } ` λ p κ ´ ` λ ´ p H qq E } p H t ´ H } ď C t ´ α , (B.13)where the third inequality follows from Markov’s inequality and the last one from (B.12).Also note that, } Q } “ dd ` p} S } ` } tr p S q I d }q ď d } S } . (B.14)We now consider our target term, with our previous results (B.12), (B.13), (B.14) andLemma 6, we can obtain that, E ››› p H ´ t p Q t p H ´ t ´ H ´ QH ´ ››› “ E ››› p H ´ t p p Q t ´ Q q p H ´ t ` p H ´ ` p H ´ t ´ H ´ q Q p H ´ ` p H ´ t ´ H ´ q ´ H ´ QH ´ ››› ď E ››› p H ´ t p p Q t ´ Q q p H ´ t ››› ` E ››› H ´ Q p p H ´ t ´ H ´ q ››› ` E ››› p p H ´ t ´ H ´ q QH ´ ››› ` E ››› p p H ´ t ´ H ´ q Q p p H ´ t ´ H ´ q ››› ď κ ´ E ››› p Q t ´ Q ››› ` λ ´ } Q } E ››› p H ´ t ´ H ´ ››› ` } Q } E ››› p H ´ t ´ H ´ ››› ď Ct ´ α { , which completes the proof. Appendix C. Technical Lemmas
Lemma 7.
Let A P R d ˆ d be a symmetric matrix, and u is an uniform random variable onthe unit sphere in R d , we have E uu J “ d I d , E uu J A uu J “ d p d ` q p A ` tr p A q I d q . Proof.
Let z be a random vector z „ N p , I d q , then by computing each entry separately,we have E } z } “ d, E } z } “ d ` d, E r zz J s “ I d , E r zz J A zz J s “ A ` tr p A q I d . nline Statistical Inference for Gradient-free Stochastic Optimization On the other hand, the Gaussian vector z can be decomposed into independent radius partand spherical part: E r zz J s “ E „ } z } z } z } z J } z } “ d E uu J E r zz J A zz J s “ E „ } z } z } z } z J } z } A z } z } z J } z } “ p d ` d q E uu J A uu J . Our final results can be derived from above.The following lemma is from Assouad (1975), we include the proof here.
Lemma 8.
Let t X n u be a martingale difference sequence, i.e. E r X n | X n ´ s “ . For any ă p ď , there exist a constant C p such that E ››››› n ÿ i “ X i ››››› p ď C p n ÿ i “ E r} X i } p | X i ´ s where } ¨ } is any norm on R d .Proof. For any p ě
1, there exists a constant C such that for any a , b P R d ,12 p} a ` b } p ` } a ´ b } p q ď } a } p ` C } b } p where } ¨ } is the 2-norm. To see this, in the one dimensional case, this is equivalent to12 p| ` x | p ` | ´ x | p q ď ` C | x | p . In the limit x Ñ ˘8 , C ą x Ñ , C ą C exists.In the general d -dimensional case, note that } x } p is a twice differentiable when x ‰ ,so the same argument holds by using the Taylor expansion and we have12 p} a ` b } p ` } a ´ b } p q ď } a } p ` C } b } p . Thus, we can show that, E n ´ ››››› n ÿ i “ X i ››››› p “ E n ´ ››››› n ´ ÿ i “ X i ` X n ››››› p ď E n ´ ››››› n ´ ÿ i “ X i ››››› p ` C p E n ´ } X n } p ´ E n ´ ››››› n ´ ÿ i “ X i ´ X n ››››› p ď E n ´ ››››› n ´ ÿ i “ X i ››››› p ` C p E n ´ } X n } p . hen, Lai, Li, and Zhang By induction, it can be obtained that E ››››› n ÿ i “ X i ››››› p ď C p n ÿ i “ E r} X i } p | X i ´ s . For any general matrix norm, there exists a constant C such that1 C } X } ď } X } ď C } X } . Then we have the same result, E ››››› n ÿ i “ X i ››››› p ď C p n ÿ i “ E r} X i } p | X i ´ s . We now provide a matrix perturbation inequality from Chen et al. (2020).
Lemma 9.
Let matrix B has the decomposition B “ A ` E , where A and B are assumedto be invertible. We have, ›› B ´ ´ A ´ ›› ď } A ´ } } E } ´ } A ´ E } . Proof.
Notice that B ´ “ p A ` E q ´ “ A ´ ´ A ´ ` A ´ ` E ´ ˘ ´ A ´ “ A ´ ´ A ´ E ` A ´ E ` I ˘ ´ A ´ . Therefore, the inversion error is, } B ´ ´ A ´ } “ ››› A ´ E ` A ´ E ` I ˘ ´ A ´ ››› ď } A ´ } } E }}p A ´ E ` I q ´ }ď } A ´ } } E } λ min p A ´ E ` I qď } A ´ } } E } ´ } A ´ E } , where we use Weyl’s inequality in the last inequality.where we use Weyl’s inequality in the last inequality.