Minimax Efficient Finite-Difference Stochastic Gradient Estimators Using Black-Box Function Evaluations
MMinimax Efficient Finite-Difference Stochastic Gradient EstimatorsUsing Black-Box Function Evaluations ∗ Henry Lam † Haidong Li ‡ Xuhui Zhang § Abstract
We consider stochastic gradient estimation using noisy black-box function evaluations. Astandard approach is to use the finite-difference method or its variants. While natural, it isopen to our knowledge whether its statistical accuracy is the best possible. This paper arguesso by showing that central finite-difference is a nearly minimax optimal zeroth-order gradientestimator, among both the class of linear estimators and the much larger class of all (nonlinear)estimators.
Stochastic gradient estimation is of central importance in simulation analysis and optimization.It concerns the estimation of gradients under noisy environments driven by data or Monte Carlosimulation runs. This problem arises as a key ingredient in sensitivity analysis and uncertainty quan-tification for simulation models, descent-based algorithms in stochastic optimization and machinelearning, and other applications such as financial portfolio hedging. For an overview of stochasticgradient estimation and its applications, see, e.g., [19], [11], [13] Chapter 7 and [1] Chapter 7.In this paper, we consider stochastic gradient estimation when only a noisy simulation oracle toevaluate the function value or model output is available. This corresponds to black-box settings inwhich it is costly, or even impossible, to utilize the underlying dynamics of a simulation model, or todistort the data collection mechanism in an experiment given the input. In stochastic optimization,such an oracle is also known as the zeroth-order [12, 20]. These gradient estimators are in contrast tounbiased estimators obtained from methods such as the infinitesimal perturbation analysis [15, 17],the likelihood ratio or score function method [14, 22, 21], measure-valued differentiation [16] andother variants that require structural information on model dynamics.Under the above setting, the most natural and common approach is to use the finite-difference(FD) method [26, 9, 10]. This entails simulating the function values at two neighboring inputs andusing the first principle to approximate the derivative. The resulting estimator has a bias comingfrom this derivative approximation, on top of the variance coming from the function evaluationnoise. This leads to a subcanonical overall mean squared error (MSE) and a need to rightly tune ∗ A preliminary conference version of this work has appeared in [18]. † Department of Industrial Engineering and Operations Research, Columbia University, New York, NY. Email: [email protected] ‡ Department of Industrial Engineering and Management, Peking University, Beijing, China. Email: [email protected] § Department of Management Science and Engineering, Stanford University, Stanford, CA. Email: [email protected] a r X i v : . [ m a t h . S T ] J u l the perturbation size between the two input values. It is widely known that, for twice continuouslydifferentiable functions, the optimal attainable MSE for central finite-difference (CFD) schemes isof order O ( n − / ), where n refers to the number of differencing pairs in the simulation. On theother hand, when one uses forward or backward finite-differences, the optimal MSE deteriorates to O ( n − / ).Although the optimal MSEs within the classes of FD schemes are well-known, a question ariseswhether these classes are optimal or better compared to other, possibly much larger, classes ofgradient estimators. Our goal in this paper is to give a first such study on the optimality on a classlevel.Our main results show that, under a general setting, CFD is nearly optimal among any possiblegradient estimation schemes. This optimality is in a minimax sense. Namely, we consider the MSEof any gradient estimator over a collection of twice differentiable functions with unknown functioncharacteristics (e.g., function value and higher-order gradients). Subject to this uncertainty of thefunction, we consider the minimizer of the worst-case MSE over this function collection, givingrise to what we call the minimax risk. Among the class of linear estimators, we show that, in theone-dimensional case, CFD exactly achieves the minimax risk, whereas in the multi-dimensionalcase it achieves so up to a multiplicative factor that depends sublinearly on the input dimension.Furthermore, we show that, among the much larger class of all nonlinear estimators, CFD remainsnearly minimax optimal up to a multiplicative factor that is polynomial in the dimension.In terms of methodological contributions, we derive our linear minimax results by using a newelementary proof. We derive our general minimax results via Le Cam’s method [24] with theimposition of an adversarially chosen hypothesis test, and the notion of modulus of continuity [5]to obtain the worst-case functions derived from this test. Lastly, we also demonstrate that, withoutextra knowledge on the gradient, randomized schemes such as simultaneous perturbation will lead tounbounded worst-case risks in general, due to the interaction between the gradient magnitude andthe additional variance coming from the random perturbation. This indicates that less conservativeanalyses along this line would require more information on the magnitude of the gradient of interest.Our work is related to, and contrasted with, the derivative estimation in nonparametric re-gression (e.g., [6, 7]), which focuses on the estimation of the conditional expectations and theirderivatives given input values, a similar setting as ours. However, in these studies the data and inparticular the available input values are often assumed given a priori. In contrast, in stochastic gra-dient estimation and optimization, one often has the capability to select the input points at whichthe function evaluation is conducted. This therefore endows more flexibility than nonparametricregression and, correspondingly, translates to superior minimax rates in our setting. For example,[7] has established a minimax risk of order O ( n − / ) for nonparametric derivative estimation, whichis slower than our O ( n − / ). Finally, we note that other works [2, 3, 25] have studied derivativeestimation uniformly well over regular or equi-spaced input design points. Moreover, these papersconsider asymptotic risks as the number of input points grow, in contrast to the finite-sample resultsin this work.The remainder of the paper is as follows. Section 2 focuses on linear minimax risk and the cor-responding optimal or nearly optimal estimators. Section 3 focuses on general risks and estimators.Section 4 concludes the paper and discusses future directions. In this section we focus on the class of linear estimators. Section 2.1 first presents the single-dimensional case. Section 2.2 generalizes it to higher-dimensional counterparts, and Section 2.3further studies the use of simultaneous random perturbation in this setting. We will derive boundson minimax risks and show that CFD is a nearly optimal estimator. Furthermore, these boundsare tight for any finite sample in the single-dimensional case, and also in the multi-dimensional caseunder additional restrictions.
We first introduce our setting. Let f ( · ) : R → R be a performance measure of interest, where wehave access to an unbiased estimate Y ( x ) for any chosen x ∈ R . In other words, Y ( · ) is a family ofrandom variables indexed by x such that E [ Y ( x )] = f ( x ) and V ar ( Y ( x )) = σ ( x ) for any x ∈ R .Suppose x is the point of interest. Our goal is to estimate the derivative f (cid:48) ( x ).Given simulation budget n ≥
1, we can simulate independently at input design points x + δ j , j =1 , · · · , n , with δ j of our choice, giving outputs Y j ( x + δ j )’s. We consider the class of linear estimatorsin the form L n = n (cid:88) j =1 w j Y j ( x + δ j ) , (1)where w j are the linear coefficients or weights. Note that for even budget n the CFD scheme¯ L n = 1 n/ n/ (cid:88) j =1 Y j − ( x + δ ) − Y j ( x − δ )2 δ (2)is an example of linear estimators where δ j = ( − j +1 δ and w j = ( − j +1 nδ , for a perturbation size δ .We aim to study the optimality within the class of all linear estimators in the form (1), and inparticular investigate the role of CFD. We use the MSE as a performance criterion, which dependson the a priori unknown function f . Our premise is a minimax framework that seeks for schemesto minimize the worst-case MSE, among a suitable wide enough class of function f and simulationnoise. More precisely, we consider A = { f ( · ) : f (2) ( x ) exists, | f (2) ( x ) | ≤ b and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − f ( x ) − f (cid:48) ( x )( x − x ) − f (2) ( x )2 ( x − x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ a | x − x | } (3)and B = { σ ( · ) : σ ( x ) ≤ d } , where a, b, d > b does not play a role inour deductions.Roughly speaking, A contains all twice differentiable functions whose second-order derivative isabsolutely bounded by b and third-order derivative is absolutely bounded by a . This characterizationis not exact, however, since the Taylor-series type expansion in (3) applies only to the point x , andthus A is more general than the aforementioned characterization. The reason for proposing thisclass, instead of other similar ones, is that A allows us to obtain a very accurate minimax analysisand derivation of optimal estimators.We define the linear minimax L -risk as R ( n, A , B ) = inf δ j ,j =1 , ··· ,nw j ,j =1 , ··· ,n sup f ( · ) ∈ A ,σ ( · ) ∈ B E (cid:16) L n − f (cid:48) ( x ) (cid:17) , which is the minimum worst-case MSE among all functions f ∈ A and noise levels in B . The linearestimators are selected through the design points x + δ j ’s and linear coefficients w j ’s.The following theorem gives the exact expression for the minimax risk, and shows that a suitablycalibrated CFD attains this risk level. In other words, CFD is the optimal linear estimator amongthe problem class specified by ( A , B ). The proof of this result involves only elementary inequalities. Theorem 1.
For any n ≥ , R ( n, A , B ) ≥ (cid:18) a d (cid:19) / n − / . Besides, if the budget n is even, the CFD estimator ¯ L n in (2) with δ = ( da ) / n / satisfies sup f ( · ) ∈ A ,σ ( · ) ∈ B E ( ¯ L n − f (cid:48) ( x )) = (cid:18) a d (cid:19) / n − / . Thus the estimator ¯ L n is the best linear estimator in the class of problems defined by ( A , B ) .Proof. For any designs δ j , j = 1 , · · · , n and linear coefficients w j , j = 1 , · · · , n , supposing f ( · ) ∈ A and f ( · ) ∈ C ( R ), we have, by Taylor’s expansion, f ( x + δ j ) = f ( x ) + f (cid:48) ( x ) δ j + f (2) ( x )2 δ j + f (3) ( x + t j δ j )6 δ j , for any j = 1 , · · · , n , where 0 ≤ t j ≤
1. Thus the bias of the estimator L n is EL n − f (cid:48) ( x ) = f ( x ) n (cid:88) j =1 w j + f (cid:48) ( x ) n (cid:88) j =1 w j δ j − + f (2) ( x )2 n (cid:88) j =1 w j δ j + n (cid:88) j =1 f (3) ( x + t j δ j )6 w j δ j . On the other hand, the variance of the estimator L n is V ar ( L n ) = n (cid:88) j =1 w j σ ( x + δ j ) . If (cid:80) nj =1 w j (cid:54) = 0, we consider the particular cases where f (cid:48) ( x ) = f (2) ( x ) = f (3) ( x ) = 0 for all x , and f ( x ) arbitrary, concluding that sup f ( · ) ∈ A ( EL n − f (cid:48) ( x )) = ∞ . Therefore, for the purpose of deriving a lower bound, we can assume without loss of generalitythat (cid:80) nj =1 w j = 0. Similarly we can assume (cid:80) nj =1 w j δ j − δ i = δ j , weassume w.l.o.g. w i = w j since it leads to smaller variance. Now consider f ( · ) ∈ A such that f ( x + δ j ) = a | δ j | · sign ( w j ), and f ( x ) = 0 otherwise. In such a case the MSE simplifies to n (cid:88) j =1 a | w j δ j | + n (cid:88) j =1 w j σ ( x + δ j ) . Further considering the case σ ( x + δ j ) = d , we getsup f ( · ) ∈ A ,σ ( · ) ∈ B E ( L n − f (cid:48) ( x )) ≥ a n (cid:88) j =1 | w j δ j | + d n (cid:88) j =1 w j . Now since (cid:80) nj =1 w j δ j = 1, by H¨older’s inequality,1 ≤ n (cid:88) j =1 | w j | / | δ j || w j | / ≤ n (cid:88) j =1 | w j δ j | / n (cid:88) j =1 | w j | / and so n (cid:88) j =1 | w j δ j | ≥ (cid:16)(cid:80) nj =1 | w j | (cid:17) ≥ n (cid:16)(cid:80) nj =1 w j (cid:17) , where we used H¨older’s inequality (cid:16)(cid:80) nj =1 | w j | (cid:17) ≤ n (cid:16)(cid:80) nj =1 w j (cid:17) . Thussup f ( · ) ∈ A ,σ ( · ) ∈ B E (cid:16) L n − f (cid:48) ( x ) (cid:17) ≥ a
36 1 n (cid:16)(cid:80) nj =1 w j (cid:17) + d n (cid:88) j =1 w j ≥ (cid:18) a d (cid:19) / n − / , where the lower bound is achieved at n (cid:88) j =1 w j = (cid:18) a d (cid:19) / / n / . Since δ j , w j are arbitrary, we conclude thatinf δ j ,j =1 , ··· ,nw j ,j =1 , ··· ,n sup f ( · ) ∈ A ,σ ( · ) ∈ B E (cid:16) L n − f (cid:48) ( x ) (cid:17) ≥ (cid:18) a d (cid:19) / n − / . (4)On the other hand, supposing the budget n is even, we see that the bias of the estimator ¯ L n satisfies | E ¯ L n − f (cid:48) ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12) f ( x + δ ) − f ( x − δ )2 δ − f (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x + δ ) − f ( x − δ ) − δf (cid:48) ( x )2 δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:16) f ( x + δ ) − f ( x ) − f (cid:48) ( x ) δ − f (2) ( x )2 δ (cid:17) − (cid:16) f ( x − δ ) − f ( x ) + f (cid:48) ( x ) δ − f (2) ( x )2 δ (cid:17)(cid:12)(cid:12)(cid:12) δ ≤ (cid:12)(cid:12)(cid:12) f ( x + δ ) − f ( x ) − f (cid:48) ( x ) δ − f (2) ( x )2 δ (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) f ( x − δ ) − f ( x ) + f (cid:48) ( x ) δ − f (2) ( x )2 δ (cid:12)(cid:12)(cid:12) δ ≤ a δ . Also, the variance satisfies
V ar ( ¯ L n ) ≤ dnδ . Thus the MSE satisfies E (cid:16) ¯ L n − f (cid:48) ( x ) (cid:17) ≤ a δ + dnδ = (cid:18) a d (cid:19) / n − / (5)by plugging in δ = ( da ) / n / . We note that the bound (5) holds for any f ( · ) ∈ A and σ ( · ) ∈ B .Combining (4) and (5), we conclude the proof for Theorem 1. We now generalize to the multi-dimensional case. Consider a performance measure with multi-dimensional input, f ( · ) : R p → R where p ≥
2. Analogous to the single-dimensional case, suppose Y ( · ) is an unbiased estimate of f ( · ). We would like to estimate ∇ f ( x ) where x ∈ R p is the pointof interest. Consider A q = { f ( · ) : ∇ f ( x ) exists, (cid:107)∇ f ( x ) (cid:107) ≤ b and (cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − f ( x ) − ∇ f ( x ) T ( x − x ) −
12 ( x − x ) T ∇ f ( x )( x − x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ a (cid:107) x − x (cid:107) q } and B = { σ ( · ) : σ ( x ) ≤ d } , where a, b, d >
0, and (cid:107) · (cid:107) q denotes the (cid:96) q -norm, q ∈ { , , ∞} . Given simulation budget n ≥ x + δ j , j = 1 , · · · , n , and form the vector-valued linearestimator L pn = n (cid:88) j =1 w j Y j ( x + δ j ) , where w j ∈ R p are the vector-valued linear coefficients.Like before, we define the linear minimax L -risk as R p ( n, A q , B ) = inf δ j ,j =1 , ··· ,nw j ,j =1 , ··· ,n sup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) L pn − ∇ f ( x ) (cid:107) . Intuitively, the main distinction of the multi-dimensional setting described above, compared to thesingle-dimensional case, is the restriction on the “third-order gradient” characterized by the (cid:96) q -norm, where the choice of q can affect the resulting risk bounds. The following theorem provides alower estimate of R p and shows that applying CFD on each of the p dimensions matches this lowerestimate up to a multiplicative factor depending sublinearly on p . This implies in particular thatmulti-dimensional CFD is rate-optimal in the sample size n . Moreover, if the signs of the weights w j in L pn are further restricted to be the same across components, then CFD achieves the exactminimax risk when p = 1 or 2. Theorem 2.
For any n ≥ , R p ( n, A q , B ) ≥ p / (cid:18) a d (cid:19) / n − / , for q = 1 ,
2; (6) R p ( n, A q , B ) ≥ p / (cid:18) a d (cid:19) / n − / , for q = ∞ . (7) Besides, if the budget n is a multiple of both p and , we allocate n/p budget and form the CFDestimator with δ = ( da ) / n/p ) / on each dimension. Denote the resulting estimator as ¯ L pn . Then sup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) ¯ L pn − ∇ f ( x ) (cid:107) = p / (cid:18) a d (cid:19) / n − / . (8) Thus the estimator ¯ L pn is optimal in the class of problems defined by ( A q , B ) up to a sublinearmultiplicative factor in p . Moreover, if we further restrict each coefficient w j to have the same signacross components, i.e. sign (( w j ) k ) = sign (( w j ) l ) for any ≤ k, l ≤ p , then we have R p ( n, A q , B ) ≥ p / (cid:18) a d (cid:19) / n − / , for q = 1 , R p ( n, A q , B ) ≥ p (cid:18) a d (cid:19) / n − / , for q = ∞ for this restricted class of linear estimators, so that ¯ L pn is exactly optimal when considering functionclass A q with q = 1 , . We remark that the p / gap between (6) and (8) is due to a technical challenge that the l ∞ → l matrix norm does not admit an explicit expression. This challenge is bypassed if one restricts thesame sign across all components in each coefficient, which recovers the minimax optimality of CFD. Proof.
Suppose f ( · ) ∈ A q , f ( · ) ∈ C ( R p ). According to Taylor’s expansion f ( x + δ j ) = f ( x )+ ∇ f ( x ) T δ j + 12 δ Tj ∇ f ( x ) δ j + 16 (cid:88) k ,k ,k (cid:0) ∇ f ( x + t j δ j ) (cid:1) k k k ( δ j ) k ( δ j ) k ( δ j ) k , for any j = 1 , · · · , n , where 0 ≤ t j ≤
1. Thus the bias of the estimator L pn satisfies E ( L pn ) i − ( ∇ f ( x )) i = f ( x ) n (cid:88) j =1 ( w j ) i + ∇ f ( x ) T n (cid:88) j =1 ( w j ) i δ j − e i + n (cid:88) j =1
12 ( w j ) i δ Tj ∇ f ( x ) δ j + n (cid:88) j =1
16 ( w j ) i (cid:88) k ,k ,k (cid:0) ∇ f ( x + t j δ j ) (cid:1) k k k ( δ j ) k ( δ j ) k ( δ j ) k , where e i is the i th standard basis in R p . On the other hand, the variance of the estimator L pn is E (cid:107) L pn − EL pn (cid:107) = p (cid:88) i =1 V ar (( L pn ) i ) = p (cid:88) i =1 n (cid:88) j =1 ( w j ) i σ ( x + δ j ) . If (cid:80) nj =1 ( w j ) i (cid:54) = 0, we consider the particular cases where ( ∇ f ( x )) k = 0 , ( ∇ f ( x )) k k = 0 , ( ∇ f ( x )) k k k =0 for all x and k , k , k , and f ( x ) arbitrary, concluding thatsup f ( · ) ∈ A q ( E ( L pn ) i − ( ∇ f ( x )) i ) = ∞ . Thus, like in the proof for Theorem 1, for the purpose of deriving a lower bound, we can assumewithout loss of generality that (cid:80) nj =1 w j = 0. Similarly we can assume (cid:80) nj =1 ( w j ) i δ j − e i = 0.Furthermore, if δ i = δ j , we assume w.l.o.g. w i = w j since it leads to smaller variance. Now consider f ( · ) ∈ A q such that f ( x + δ j ) = a (cid:107) δ j (cid:107) q · sign (( w j ) i ), and f ( x ) = 0 otherwise, where i = arg max ≤ i ≤ p n (cid:88) j =1 | ( w j ) i |(cid:107) δ j (cid:107) q . In such a case the MSE is bounded from below by n (cid:88) j =1 a | ( w j ) i |(cid:107) δ j (cid:107) q + p (cid:88) i =1 n (cid:88) j =1 ( w j ) i σ ( x + δ j ) . Considering the case σ ( x + δ j ) = d , we getsup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) L pn − ∇ f ( x ) (cid:107) ≥ a n (cid:88) j =1 | ( w j ) i |(cid:107) δ j (cid:107) q + d n (cid:88) j =1 (cid:107) w j (cid:107) . (9)Now since (cid:80) nj =1 ( w j ) i ( δ j ) i = 1, by H¨older’s inequality, p = n (cid:88) j =1 w Tj δ j ≤ n (cid:88) j =1 (cid:107) w j (cid:107) / r (cid:107) δ j (cid:107) q (cid:107) w j (cid:107) / r ≤ n (cid:88) j =1 (cid:107) w j (cid:107) r (cid:107) δ j (cid:107) q / n (cid:88) j =1 (cid:107) w j (cid:107) r / , where q + r = 1. Thus p (cid:16)(cid:80) nj =1 (cid:107) w j (cid:107) r (cid:17) ≤ n (cid:88) j =1 (cid:107) w j (cid:107) r (cid:107) δ j (cid:107) q ≤ n (cid:88) j =1 (cid:107) w j (cid:107) (cid:107) δ j (cid:107) q ≤ p n (cid:88) j =1 | ( w j ) i |(cid:107) δ j (cid:107) q . Since by H¨older’s inequality, n (cid:88) j =1 (cid:107) w j (cid:107) r ≤ n (cid:88) j =1 (cid:107) w j (cid:107) ≤ n n (cid:88) j =1 (cid:107) w j (cid:107) , r ≥ n (cid:88) j =1 (cid:107) w j (cid:107) r ≤ p n (cid:88) j =1 (cid:107) w j (cid:107) ≤ np n (cid:88) j =1 (cid:107) w j (cid:107) , r = 1we havesup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) L pn −∇ f ( x ) (cid:107) ≥ a p n (cid:16)(cid:80) nj =1 (cid:107) w j (cid:107) (cid:17) + d n (cid:88) j =1 (cid:107) w j (cid:107) ≥ p / (cid:18) a d (cid:19) / n − / , q = 1 , n (cid:88) j =1 (cid:107) w j (cid:107) = (cid:18) a d (cid:19) / p / / n / , and sup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) L pn −∇ f ( x ) (cid:107) ≥ a p n (cid:16)(cid:80) nj =1 (cid:107) w j (cid:107) (cid:17) + d n (cid:88) j =1 (cid:107) w j (cid:107) ≥ p / (cid:18) a d (cid:19) / n − / , q = ∞ (11)where the lower bound is achieved at n (cid:88) j =1 (cid:107) w j (cid:107) = (cid:18) a d (cid:19) / p / / n / . Since δ j , w j are arbitrary, we conclude thatinf δ j ,j =1 , ··· ,nw j ,j =1 , ··· ,n sup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) L pn − ∇ f ( x ) (cid:107) ≥ p / (cid:18) a d (cid:19) / n − / , q = 1 , , inf δ j ,j =1 , ··· ,nw j ,j =1 , ··· ,n sup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) L pn − ∇ f ( x ) (cid:107) ≥ p / (cid:18) a d (cid:19) / n − / , q = ∞ . Now supposing in addition that each w j has the same sign across components, then instead of (9),we have the sharper lower bound a p (cid:88) i =1 n (cid:88) j =1 | ( w j ) i |(cid:107) δ j (cid:107) q + d n (cid:88) j =1 (cid:107) w j (cid:107) . Note that by H¨older’s inequality n (cid:88) j =1 (cid:107) w j (cid:107) (cid:107) δ j (cid:107) q ≤ p p (cid:88) i =1 n (cid:88) j =1 | ( w j ) i |(cid:107) δ j (cid:107) q . Thus the p / factor in (10) can be improved to p / , and the p / factor in (11) can be improvedto p . Finally, suppose the budget n is a multiple of p and 2. We allocate n/p budget to eachdimension. Then for any f ( · ) ∈ A q and σ ( · ) ∈ B , we get E (cid:0) ( ¯ L pn ) i − ( ∇ f ( x )) i (cid:1) ≤ (cid:18) a d (cid:19) / (cid:18) np (cid:19) − / = p / (cid:18) a d (cid:19) / n − / . Thus E (cid:107) ¯ L pn − ∇ f ( x ) (cid:107) ≤ p / (cid:18) a d (cid:19) / n − / , which completes our proof.0 The discussion above has focused on deterministic designs in which the perturbation sizes δ j arefixed. In multi-dimensional gradient estimation, it is also common to use random perturbation inwhich a random vector in R p is generated and FD is taken simultaneously for all dimensions byprojecting the vector onto each direction. This leads to schemes such as simultaneous perturbation[23] and Gaussian smoothing [20], and are frequently used in stochastic optimization. A questionarises how these randomized schemes perform with respect to our presented risk criterion.To proceed, let δ be a random vector in R p where δ i , i = 1 , · · · , p are i.i.d. symmetricallydistributed about 0 and satisfy some additional properties (described in, e.g., [23], which will notconcern us as we will see), and let φ ( δ ) = (1 /δ , · · · , /δ p ) T . Other distributional choices of δ andthe associated φ have also been suggested (e.g., [20, 8]), for which our subsequent argument followssimilarly. Suppose the simulation budget n is even. We choose a scaling parameter h >
0, thenrepeatedly and independently simulate δ j ∈ R p and Y j ( · )’s and output S n = 1 n/ n/ (cid:88) j =1 Y j − ( x + hδ j ) − Y j ( x − hδ j )2 h φ ( δ j ) . The following theorem shows that, without further assumptions on the magnitude of the first-ordergradient, the L -risk of random perturbation can be arbitrarily large. Theorem 3.
For any n ≥ even, sup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:107) S n − ∇ f ( x ) (cid:107) = ∞ . Proof.
First note that by independence, E (cid:107) S n − ES n (cid:107) = 2 n tr (cid:18) Cov (cid:18) Y ( x + hδ ) − Y ( x − hδ )2 h φ ( δ ) (cid:19)(cid:19) = 2 n p (cid:88) i =1 V ar (cid:18) Y ( x + hδ ) − Y ( x − hδ )2 hδ i (cid:19) . Next, by conditioning on δ , we have V ar (cid:18) Y ( x + hδ ) − Y ( x − hδ )2 hδ i (cid:19) = V ar (cid:18) f ( x + hδ ) − f ( x − hδ )2 hδ i (cid:19) + E (cid:20) σ ( x + hδ ) + σ ( x − hδ )4 h ( δ i ) (cid:21) ≥ V ar (cid:18) f ( x + hδ ) − f ( x − hδ )2 hδ i (cid:19) . Now consider f ( · ) ∈ A q and f ( x ) = f ( x ) + ρ T ( x − x ), where ρ ∈ R and ∈ R p denotes thevector of all ones. Thus we get V ar (cid:18) f ( x + hδ ) − f ( x − hδ )2 hδ i (cid:19) = ρ V ar (cid:18) T δδ i (cid:19) > . Sending ρ → ∞ we conclude thatsup f ( · ) ∈ A q V ar (cid:18) f ( x + hδ ) − f ( x − hδ )2 hδ i (cid:19) = ∞ . E (cid:107) S n −∇ f ( x ) (cid:107) = (cid:107) ES n −∇ f ( x ) (cid:107) + E (cid:107) S n − ES n (cid:107) ≥ E (cid:107) S n − ES n (cid:107) ≥ n p (cid:88) i =1 V ar (cid:18) f ( x + hδ ) − f ( x − hδ )2 hδ i (cid:19) . which completes our proof.The unboundedness of the worst-case L -risk in Theorem 3 is due to the interaction between thegradient of interest and the variance from the random perturbation. This hints that in general, torestrain the worst-case L -risk for such schemes, extra knowledge on the magnitude of the gradientis needed. We now expand our analysis to consider estimators that are possibly nonlinear. Section 3.1 firstpresents the single-dimensional case. Section 3.2 then presents the generalization to the multi-dimensional counterpart. We derive bounds for the minimax risks and show that, in these expandedclasses, the CFD estimators are still nearly optimal.
Adopting the notations from the previous sections, suppose the budget is n . We select the inputdesign points x , · · · , x n and for convenience let Y j = Y j ( x j ) be the independent unbiased noisyfunction evaluation of f at x j with simulation variance σ ( x j ). We are interested in estimating f (cid:48) at x , which w.l.o.g. we take as the origin 0. Denote (cid:98) θ = (cid:98) θ ( Y , · · · , Y n ) as a generic estimator. Weconsider the class of problems specified by A = { f ( · ) : f (2) (0) exists, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − f (0) − f (cid:48) (0) x − f (2) (0)2 x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ a | x | and | f (2) (0) | ≤ b } and B = { σ ( · ) : σ ( x ) ≤ d } , where a, b, d >
0. Like in Section 2, the parameter b does not play a role subsequently. Define theminimax risk as R ( n, A , B ) = inf x j ,j =1 , ··· ,n (cid:98) θ sup f ( · ) ∈ A ,σ ( · ) ∈ B E ( (cid:98) θ − f (cid:48) (0)) . Theorem 4.
For any n ≥ , R ( n, A , B ) ≥ e − / (cid:18) adn (cid:19) / . (12) Consequently, the CFD estimator ¯ L n in (2) is optimal up to a constant multiplicative factor. The last conclusion in Theorem 4 is a simple consequence from combining the risk estimate of¯ L n in Theorem 1 with (12). In contrast to the elementary proof for Theorem 1, here we use LeCam’s method (e.g., [24]) and the notion of modulus of continuity [5] to estimate the minimax risk.Specifically, Le Cam’s method derives minimax lower bounds by constructing a hypothesis test and2using its error to inform a bound. The error of the hypothesis test, in turn, is analyzable by theNeyman-Pearson lemma. The lower bound provided by Le Cam’s method involves the distancebetween two functions, and tightening the lower bound then becomes a functional optimizationproblem that can be viewed as the dual or inverse of the formulation to attain the so-called modulusof continuity. Consequently, finding the extremal or worst-case functions for the inverse modulusof continuity will give the resulting lower bound. Proof.
Consider arbitrary functions f , f ∈ A , we follow Le Cam’s method. Now for any estimator (cid:98) θ , we have (cid:15) = ( f (cid:48) (0) − f (cid:48) (0)) ≤ (cid:98) θ − f (cid:48) (0)) + 2( (cid:98) θ − f (cid:48) (0)) . Define test statistic ψ by ψ ( Y , · · · , Y n ) = (cid:40) | (cid:98) θ − f (cid:48) (0) | ≤ | (cid:98) θ − f (cid:48) (0) | , | (cid:98) θ − f (cid:48) (0) | > | (cid:98) θ − f (cid:48) (0) | . Let E k , k = 1 , P k and p k as the probability measure and density)under model Y j ∼ f k ( x j ) + η j , j = 1 , · · · , n, where η j , j = 1 , · · · , n i.i.d. follows normal distribution with mean zero and variance d . We have E k ( (cid:98) θ − f (cid:48) k (0)) ≥ E k (cid:104) ( (cid:98) θ − f (cid:48) k (0)) I ( ψ = k ) (cid:105) ≥ (cid:15) P k ( ψ = k ) . Thus sup f ∈ A ,σ ∈ B E ( (cid:98) θ − f (cid:48) (0)) ≥ max k =1 , E k ( (cid:98) θ − f (cid:48) k (0)) ≥ (cid:15) P ( ψ = 1) + P ( ψ = 2)2 . Taking infimum over all possible estimators, we getinf (cid:98) θ sup f ∈ A ,σ ∈ B E ( (cid:98) θ − f (cid:48) (0)) ≥ (cid:15) ψ ( P ( ψ = 1) + P ( ψ = 2)) . The right hand side is minimized by the Neyman-Pearson test, i.e. ψ ( y , · · · , y n ) = (cid:26) p ( y , · · · , y n ) ≥ p ( y , · · · , y n ) , p ( y , · · · , y n ) < p ( y , · · · , y n ) . Thus by a standard relation (e.g. Lemma 2.6 in [24])inf (cid:98) θ sup f ∈ A ,σ ∈ B E ( (cid:98) θ − f (cid:48) (0)) ≥ (cid:15) (cid:90) min { p ( y , · · · , y n ) , p ( y , · · · , y n ) } dy ≥ (cid:15) e − KL ( P ,P ) , where KL ( P , P ) denote the KL divergence between the distributions P , P . Now since P ∼ N ( µ , dI n × n ) and P ∼ N ( µ , dI n × n ), where µ k = ( f k ( x ) , · · · , f k ( x n )) , k = 1 ,
2, by direct com-putation, KL ( P , P ) = 12 d ( µ − µ ) T ( µ − µ ) = 12 d (cid:107) µ − µ (cid:107) ≤ n d sup x | f ( x ) − f ( x ) | . (cid:15) : ω ( (cid:15) ) = inf (cid:26) sup x | f ( x ) − f ( x ) | : | f (cid:48) (0) − f (cid:48) (0) | = (cid:15), f , f ∈ A (cid:27) . (13)Solving the optimization in (13) yields a tightest upper bound for KL ( P , P ), and further obtainsa tightest lower bound for (cid:15) e − KL ( P ,P ) . The decision variables in the outer problem in (13) are apair of functions ( f , f ), where we would later denote ( f ∗ , f ∗ ) as the solutions, which constitute theextremal or worst-case functions. Note that ω ( (cid:15) ) is the inverse function of the so-called modulus ofcontinuity at the point x = 0, which is defined by (cid:15) ( ω ) = sup (cid:26) | f (cid:48) (0) − f (cid:48) (0) | : sup x | f ( x ) − f ( x ) | ≤ ω, f , f ∈ A (cid:27) . By Lemma 7 of [5], the extremal pair of functions in attaining the modulus function (cid:15) ( ω ) can bechosen in the form f ∗ = f and f ∗ = − f for some f . Thus ω ( (cid:15) ) = 2 inf (cid:26) sup x | f ( x ) | : | f (cid:48) (0) | = (cid:15)/ , f ∈ A (cid:27) . (14)If f ( x ) solves problem (14), so does − f ( − x ). As the absolute value is a convex function, ( f ( x ) − f ( − x )) / f ∈ A , (cid:12)(cid:12)(cid:12) f ( x ) − f (cid:48) (0) x (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − f (0) − f (cid:48) (0) x − f (2) (0)2 x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ a | x | . It follows that | f ( x ) | ≥ (cid:12)(cid:12)(cid:12) f (cid:48) (0) x (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) f ( x ) − f (cid:48) (0) x (cid:12)(cid:12)(cid:12) ≥ (cid:15) | x | − a | x | . Consider the function f ∗ which increases with a gradient (cid:15)/ x = 0 and is as close to 0 aspossible: f ∗ ( x ) = sign ( x ) (cid:104) (cid:15) | x | − a | x | (cid:105) + . It is easy to verify that f ∗ ( x ) is an odd function, f ∗ ( x ) ∈ A , and sup x | f ( x ) | ≥ sup x | f ∗ ( x ) | for anyodd function f ∈ A . Therefore, f ∗ ( x ) is a solution to problem (14).Since ω ( (cid:15) ) = (cid:15) (cid:112) (cid:15)a , we getinf (cid:98) θ sup f ∈ A ,σ ∈ B E ( (cid:98) θ − f (cid:48) (0)) ≥ (cid:15) e − n(cid:15) ad . Now take (cid:15) = (cid:0) adn (cid:1) / , we getinf (cid:98) θ sup f ∈ A ,σ ∈ B E ( (cid:98) θ − f (cid:48) (0)) ≥ e − / (cid:18) adn (cid:19) / . We complete our proof by noting that the above bound holds for any design points x , · · · , x n .4Figure 1 visualizes the above worst-case function f ∗ ( x ) = sign ( x ) (cid:104) (cid:15) | x | − a | x | (cid:105) + in function class A . Intuitively, when function evaluations of f and f are close but their gradientsat x = 0 are quite different, we cannot easily find an estimator to minimize the errors in gradientestimation for these two functions at the same time. Such a scenario is considered as the worstcase. Consider ( f , f ) taken in the form ( f, − f ), and the difference between f (cid:48) (0) and f (cid:48) (0) is (cid:15) . As shown in Figure 1, f ∗ increases with a gradient (cid:15)/ x = 0. In order to make functionevaluations of f and f as close as possible, f ∗ needs to be as close to 0 as possible. Ultimately,with these worst-case functions indexed by (cid:15) , we take (cid:15) = (cid:0) adn (cid:1) / in our lower-bound derivationto balance the maximization of gradient difference (cid:15) and the minimization of function evaluationdifference. Figure 1:
Worst-case function in attaining the inverse modulus of continuity
Suppose now that the input design points are x , · · · , x n ∈ R p , and Y j = Y j ( x j ) are independentunbiased noisy function evaluations of f : R p → R at x j with simulation variance σ ( x j ). Here wewould like to estimate ∇ f at (w.l.o.g.) the origin x = 0. Denote (cid:98) θ = (cid:98) θ ( Y , · · · , Y n ) as a generic R p -valued estimator like before. We consider the class of problems A q = { f ( · ) : ∇ f (0) exists, (cid:107)∇ f (0) (cid:107) ≤ b and (cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − f (0) − ∇ f (0) T x − x T ∇ f (0) x (cid:12)(cid:12)(cid:12)(cid:12) ≤ a (cid:107) x (cid:107) q } and B = { σ ( · ) : σ ( x ) ≤ d } , where a, b, d >
0, and (cid:107) · (cid:107) q denotes the (cid:96) q -norm, q ∈ { , , ∞} . Define the minimax risk as R p ( n, A q , B ) = inf x j ,j =1 ,...,n (cid:98) θ sup f ( · ) ∈ A q ,σ ( · ) ∈ B E (cid:13)(cid:13)(cid:13)(cid:98) θ − ∇ f (0) (cid:13)(cid:13)(cid:13) . Theorem 5.
For any n ≥ , R p ( n, A q , B ) ≥ e − / (cid:32) adp / n (cid:33) / , for q = 1 ,R p ( n, A q , B ) ≥ e − / (cid:18) adn (cid:19) / , for q = 2 , ∞ . Consequently, the CFD estimator that divides budget equally among all dimensions, ¯ L pn describedin Theorem 2, is optimal up to a multiplicative factor polynomial in p for q = 1 , , ∞ . Similar to the proof for Theorem 4, Le Cam’s method and the modulus of continuity are usedto obtain the worst-case hypothesized functions for the general minimax risk. However, here thechoice of q in function class A q affects the modulus function, and thus the resulting worst-casefunctions and risk bounds. Proof.
Following Le Cam’s method in the proof for Theorem 4 and replacing (cid:15) = ( f (cid:48) (0) − f (cid:48) (0)) by (cid:15) = (cid:107)∇ f (0) − ∇ f (0) (cid:107) for the multi-dimensional setting, we have, for any f , f ∈ A q , R p ( n, A q , B ) ≥ (cid:107)∇ f (0) − ∇ f (0) (cid:107) e − (cid:107) µ − µ (cid:107) d , where µ k = ( f k ( x ) , . . . , f k ( x n )) , k = 1 , A q : (cid:15) A q ( ω ) = sup (cid:26) (cid:107)∇ f (0) − ∇ f (0) (cid:107) : sup x | f ( x ) − f ( x ) | ≤ ω, f , f ∈ A q (cid:27) , which is not only a function of ω (like that in Theorem 4) but also affected by the choice of q . Herethe extremal pair ( f , f ) attaining the modulus function will be different for different q . First, byLemma 7 of [5], the extremal pair can be chosen of the form: f = f and f = − f . Thus (cid:15) A q ( ω ) = 2 sup (cid:26) (cid:107)∇ f (0) (cid:107) : sup x | f ( x ) | ≤ ω/ , f ∈ A q (cid:27) . It follows that (cid:15) A q is the inverse function of ω A q ( (cid:15) ) = 2 inf (cid:26) sup x | f | : (cid:107)∇ f (0) (cid:107) = (cid:15)/ , f ∈ A q (cid:27) . (15)If f ( x ) solves the problem (15), so does − f ( − x ). As the norm is a convex function, ( f ( x ) − f ( − x )) / f ∈ A q , (cid:12)(cid:12) f ( x ) − ∇ f (0) T x (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − f (0) − ∇ f (0) T x − x T ∇ f (0) x (cid:12)(cid:12)(cid:12)(cid:12) ≤ a (cid:107) x (cid:107) q . It follows that | f ( x ) | ≥ (cid:12)(cid:12) ∇ f (0) T x (cid:12)(cid:12) − (cid:12)(cid:12) f ( x ) − ∇ f (0) T x (cid:12)(cid:12) ≥ (cid:12)(cid:12) ∇ f (0) T x (cid:12)(cid:12) − a (cid:107) x (cid:107) q . x | f ( x ) | ≥ sup x (cid:104) |∇ f (0) T x | − a (cid:107) x (cid:107) q (cid:105) + . Denote X + = { x : ∇ f (0) T x ≥ } and X − = { x : ∇ f (0) T x < } . Note that the odd function g ( x ) = (cid:104) ∇ f (0) T x − a (cid:107) x (cid:107) q (cid:105) + · x ∈ X + + (cid:104) ∇ f (0) T x + a (cid:107) x (cid:107) q (cid:105) + · x ∈ X − belongs to A q , sup x | f ( x ) | ≥ sup x | g ( x ) | , and also that ∇ g (0) = ∇ f (0). Therefore, we considerfunctions of the form f ( x ) = (cid:104) ∇ f (0) T x − a (cid:107) x (cid:107) q (cid:105) + · x ∈ X + + (cid:104) ∇ f (0) T x + a (cid:107) x (cid:107) q (cid:105) + · x ∈ X − with (cid:107)∇ f (0) (cid:107) = (cid:15)/ ξ ∗ as a solution to thefollowing problem min (cid:107) ξ (cid:107) = (cid:15)/ max (cid:107) x (cid:107) q =1 ,ξ T x ≥ ξ T x. (16)We see that f ∗ ( x ) = (cid:104) ( ξ ∗ ) T x − a (cid:107) x (cid:107) q (cid:105) + · ( ξ ∗ ) T x ≥ + (cid:104) ( ξ ∗ ) T x + a (cid:107) x (cid:107) q (cid:105) + · ( ξ ∗ ) T x< is a solution to (15).For q = 1 and 2, it is easy to verify that ξ ∗ = (cid:18) (cid:15) √ p , . . . , (cid:15) √ p (cid:19) is a solution to (16). Therefore, f ∗ ( x ) = sign (cid:32) p (cid:88) i =1 ( x ) i (cid:33) (cid:34) (cid:15) √ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) i =1 ( x ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − a (cid:107) x (cid:107) q (cid:35) + . Moreover, we have sup x | f ∗ ( x ) | = sup t ≥ (cid:104) (cid:15) √ pt − a p t (cid:105) + = (cid:15) (cid:114) (cid:15)a p − / , for q = 1 , sup x | f ∗ ( x ) | = sup t ≥ (cid:104) (cid:15) √ pt − a p / t (cid:105) + = (cid:15) (cid:114) (cid:15)a , for q = 2 . For q = ∞ , it is easy to verify that ξ ∗ = (cid:16) (cid:15) , , . . . , (cid:17) is a solution to (16). Therefore, f ∗ ( x ) = sign (( x ) ) (cid:104) (cid:15) | ( x ) | − a (cid:107) x (cid:107) q (cid:105) + . Moreover, we have sup x | f ∗ ( x ) | = sup t ≥ (cid:104) (cid:15) t − a t (cid:105) + = (cid:15) (cid:114) (cid:15)a . ω A q ( (cid:15) ) = 2 (cid:15) (cid:114) (cid:15)a p − / , for q = 1 ,ω A q ( (cid:15) ) = 2 (cid:15) (cid:114) (cid:15)a , for q = 2 , ∞ . Therefore, we have R p ( n, A q , B ) ≥ (cid:15) e − nω A q ( (cid:15) )2 d = (cid:15) e − n(cid:15) adp / , for q = 1 ,R p ( n, A q , B ) ≥ (cid:15) e − nω A q ( (cid:15) )2 d = (cid:15) e − n(cid:15) ad , for q = 2 , ∞ . Further by choosing (cid:15) = (cid:32) adp / n (cid:33) / , for q = 1 ,(cid:15) = (cid:18) adn (cid:19) / , for q = 2 , ∞ , we get R p ( n, A q , B ) ≥ e − / (cid:32) adp / n (cid:33) / ≈ . p (cid:18) adn (cid:19) / , for q = 1 ,R p ( n, A q , B ) ≥ e − / (cid:18) adn (cid:19) / ≈ . (cid:18) adn (cid:19) / , for q = 2 , ∞ . Figure 2 visualizes the above worst-case function f ∗ for a two-dimensional case in function class A q with different q . Similar to the discussion for Figure 1, worst-case functions serve to balance themaximization of the gradient difference and the minimization of the function evaluation difference.Since (cid:96) -norm is the largest among the three considered norms, the worst-case function for q = 1descends to 0 most rapidly. The worst-case function for q = 2 takes a round shape, and theboundary of its zeros is circular. The worst-case function for q = ∞ decreases only when the valuein the maximal dimension increases and therefore appears the sharpest. In addition, each dimensionof the worst-case function for q = 1 or q = 2 has the same derivative at point x = 0. However,the worst-case function for q = ∞ has a non-zero derivative at point x = 0 only along one of itsdimensions. In this paper we studied the minimax optimality of stochastic gradient estimators when only noisyfunction evaluations are available, with respect to the worst-case MSE among unknown twice dif-ferentiable functions. We derived the exact minimax risk for the class of linear estimators, andshowed that CFD is optimal within this class, in the single-dimensional case. We extended the8 (a) q = 1. (b) q = 2. (c) q = ∞ . Figure 2:
Worst-case functions in a two-dimensional case analysis to the multi-dimensional case, by showing the optimality of CFD up to a multiplicativefactor sublinear in the dimension, and exactly if the signs of weights in the linear estimator arerestricted to be the same across components. We also showed that, without further assumptionson the gradient magnitude, the worst-case risk of random perturbation schemes can be unbounded.Next we approximated the minimax risk over the general class of (nonlinear) estimators and showedthat CFD is still nearly optimal over this much larger estimator class. These approximations wereshown up to a constant factor in the single-dimensional case and an additional factor dependingpolynomially on the dimension in the multi-dimensional case. We used elementary techniques in thelinear minimax analyses and Le Cam’s method and the modulus of continuity in the general mini-max analyses. In future work, we will investigate the use of additional a priori information on theconsidered function class, and tighten our minimax estimates using potential alternate approaches.
Acknowledgements
We gratefully acknowledge support from the National Science Foundation under grants CAREERCMMI-1653339/1834710 and IIS-1849280.
References [1] S. Asmussen and P. W. Glynn.
Stochastic Simulation: Algorithms and Analysis , volume 57. Springer Science &Business Media, New York, 2007.[2] W. Dai, T. Tong, and M. G. Genton. Optimal estimation of derivatives in nonparametric regression.
The Journalof Machine Learning Research , 17(1):5700–5724, 2016.[3] K. De Brabanter, J. De Brabanter, B. De Moor, and I. Gijbels. Derivative estimation with local polynomialfitting.
The Journal of Machine Learning Research , 14(1):281–301, 2013.[4] D. L. Donoho. Statistical estimation and optimal recovery.
The Annals of Statistics , 22(1):238–270, 1994.[5] D. L. Donoho and R. C. Liu. Geometrizing rates of convergence, iii.
The Annals of Statistics , pages 668–701,1991.[6] J. Fan. Local linear regression smoothers and their minimax efficiencies.
The Annals of Statistics , 21(1):196–216,1993.[7] J. Fan, T. Gasser, I. Gijbels, M. Brockmann, and J. Engel. Local polynomial regression: Optimal kernels andasymptotic minimax efficiency.
Annals of the Institute of Statistical Mathematics , 49(1):79–99, 1997.[8] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: Gradientdescent without a gradient. In
Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms ,pages 385–394, Philadelphia, Pennsylvania, 2005. Society for Industrial and Applied Mathematics. [9] B. L. Fox and P. W. Glynn. Replication schemes for limiting expectations. Probability in the Engineering andInformational Sciences , 3(3):299–318, 1989.[10] A. Frolov and N. Chentsov. On the calculation of definite integrals dependent on a parameter by the monte carlomethod.
USSR Computational Mathematics and Mathematical Physics , 2(4):802–807, 1963.[11] M. C. Fu. Gradient estimation.
Handbooks in Operations Research and Management Science , 13:575–616, 2006.[12] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.
SIAMJournal on Optimization , 23(4):2341–2368, 2013.[13] P. Glasserman.
Monte Carlo Methods in Financial Engineering , volume 53. Springer Science & Business Media,New York, 2013.[14] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems.
Communications of the ACM ,33(10):75–84, 1990.[15] P. Heidelberger, X.-R. Cao, M. A. Zazanis, and R. Suri. Convergence properties of infinitesimal perturbationanalysis estimates.
Management Science , 34(11):1281–1302, 1988.[16] B. Heidergott, F. J. V´azquez-Abad, G. Pflug, and T. Farenhorst-Yuan. Gradient estimation for discrete-eventsystems by measure-valued differentiation.
ACM Transactions on Modeling and Computer Simulation , 20(1):5/1–5/28, 2010.[17] Y.-C. Ho, X. Cao, and C. Cassandras. Infinitesimal and finite perturbation analysis for queueing networks.
Automatica , 19(4):439–445, 1983.[18] H. Lam and X. Zhang. Minimax efficient finite-difference gradient estimators. In , pages 392–403. IEEE, 2019.[19] P. L’Ecuyer. An overview of derivative estimation. In B. Nelson, W. D. Kelton, and G. M. Clark, editors,
Proceedings of the 1991 Winter Simulation Conference , pages 207–217, Piscataway, New Jersey, 1991. Instituteof Electrical and Electronics Engineers, Inc.[20] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions.
Foundations of Compu-tational Mathematics , 17(2):527–566, 2017.[21] M. I. Reiman and A. Weiss. Sensitivity analysis for simulations via likelihood ratios.
Operations Research ,37(5):830–844, 1989.[22] R. Y. Rubinstein. The score function approach for sensitivity analysis of computer simulation models.
Mathe-matics and Computers in Simulation , 28(5):351–379, 1986.[23] J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.
IEEE transactions on automatic control , 37(3):332–341, 1992.[24] A. B. Tsybakov.
Introduction to Nonparametric Estimation . Springer Science & Business Media, New York,2009.[25] W. W. Wang and L. Lin. Derivative estimation based on difference sequence via locally weighted least squaresregression.
The Journal of Machine Learning Research , 16(1):2617–2641, 2015.[26] M. A. Zazanis and R. Suri. Convergence rates of finite-difference sensitivity estimates for stochastic systems.