[PDF] Zeroth-Order Algorithms for Smooth Saddle-Point Problems

Abstract

Saddle-point problems have recently gained increased attention from the machine learning community, mainly due to applications in training Generative Adversarial Networks using stochastic gradients. At the same time, in some applications only a zeroth-order oracle is available. In this paper, we propose several algorithms to solve stochastic smooth (strongly) convex-concave saddle-point problems using zeroth-order oracles and estimate their convergence rate and its dependence on the dimension n of the variable. In particular, our analysis shows that in the case when the feasible set is a direct product of two simplices, our convergence rate for the stochastic term is only by a logn factor worse than for the first-order methods. We also consider a mixed setup and develop 1/2th-order methods that use zeroth-order oracle for the minimization part and first-order oracle for the maximization part. Finally, we demonstrate the practical performance of our zeroth-order and 1/2th-order methods on practical problems.

Full PDF

ZZeroth-Order Algorithms for SmoothSaddle-Point Problems (cid:63)

Abdurakhmon Sadiev , Aleksandr Beznosikov , , Pavel Dvurechensky , andAlexander Gasnikov , , , Moscow Institute of Physics and Technology, Russia Sirius University of Science and Technology, Russia Weierstrass Institute for Applied Analysis and Stochastics, Germany Institute for Information Transmission Problems RAS, Russia Caucasus Mathematical Center, Adyghe State University, Russia

Abstract.

In recent years, the importance of saddle-point problems inmachine learning has increased. This is due to the popularity of GANs. Inthis paper, we solve stochastic smooth (strongly) convex-concave saddle-point problems using zeroth-order oracles. Theoretical analysis showsthat in the case when the optimization set is a simplex, we lose only log n times in the stochastic convergence term. The paper also provides anapproach to solving saddle-point problems, when the oracle for one of thevariables has zero order, and for the second – ﬁrst order. Subsequently,we implement zeroth-order and 1/2th-order methods to solve practicalproblems. Keywords: zeroth-order optimization · saddle-point problems · stochas-tic optimization The popularity of machine learning is growing every day. Now we can ﬁnd itsapplication in various areas of human life: from recommendation systems to thedetection of diseases, from translation of texts to stores without sellers. Cur-rently, there is great interest in the adversarial approach to network training.In this case, not one model is being trained, but two, and the main goal ofthe second model is to deceive the ﬁrst. This approach signiﬁcantly increasesthe quantity and quality of the solution for Deep Learning problems. There-fore,

Generative Adversarial Network [8] has made a revolution and has alreadybecome one of the most famous and popular models. In fact, GAN is nothing (cid:63)

The research of A. Beznosikov was partially supported by RFBR, project number 19-31-51001. The research of A. Gasnikov was partially supported by RFBR, projectnumber 18-29-03071 mk and was partially supported by the Ministry of Scienceand Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03,project no. 0714-2020-0005. This work was partially conducted while A. Sadiev andA. Beznosikov were on the project internship in Sirius University of Science andTechnology. a r X i v : . [ m a t h . O C ] S e p A. Sadiev, A. Beznosikov et al. more than a classic saddle-point problem. The issue of the correct and quicksolution of it is one of the most important and complicated in the optimizationcommunity. Our paper is devoted to this question.Saddle-point problems are used in various ﬁelds of science, not only in ma-chine learning. They are a good uniﬁed approach for the analysis of Nash equi-libriums, game problems. There are many ways to solve saddle-point problemsunder various assumptions. The most famous are Mirror-Descent algorithm [1]and its modiﬁcation with extra step – Mirror-Prox [13,16]. In this paper, we con-centrate on solving smooth, convex-concave/strongly-convex-strongly-concavesaddle-point problems. Moreover, we want to use the zeroth-order oracle.This issue of the derivative-free optimization is not well studied for saddle-point problems in the literature, although zeroth-order methods have advantages,including in terms of Machine and Deep learning. First of all, derivative-freemethods are attractive because of their ﬂexibility: in some problems we do nothave access to the gradient or it is diﬃcult to calculate it. In recent years,there has been growing interest in zeroth-order optimization in terms of onlinelearning methods [4,17]. Now such methods have also found their application inthe training of neural networks. One of the promising and eﬀective ways to usezeroth-order oracles is

Adversarial Attacks [9,19], in particular the

Black-BoxAttacks [15]. In this concept, the attacking model does not have access to thearchitecture of the main training model, but only to the input and output, infact, only to the value of the zeroth-order oracle for the loss function. In such asituation, derivative-free methods ﬁnd application. As the research results show[6,21,7], this approach gives the same quality of training as the more laboriousmethods of Adversarial Attacks, but they give a 3-time gain in time [5].This shows that the zeroth-order concept can be useful for such a large num-ber of bland and necessary problems. The purpose of this paper is not only tomake theoretical estimates of the convergence of derivative-free methods, butalso to show their practical importance.

In the ﬁrst part of the work, we present zeroth-order analogues of Mirror-Descent[1] and Mirror-Prox [12] methods for stochastic saddle-point problems in convexand strongly convex cases. We consider various concepts of zeroth-order oraclesand various concepts of noise. Also we introduce a new class of smooth saddle-point problems – ﬁrmly smooth.In the deterministic case, our methods have a linear oracle complexity inthe smooth strongly-convex-stronglyconcave case, and sublinear O (1 /N ) – inthe convex case. One can note that in some estimates, there is a factor of theproblem’s dimension n , but somewhere n /q . This factor q depends on geometricsetup of our problem and gives a beneﬁt when we work in the Hlder, but non-Euclidean case (use non-Euclidean prox), i.e. (cid:107) · (cid:107) = (cid:107) · (cid:107) p and p ∈ [1; 2], then (cid:107) · (cid:107) ∗ = (cid:107) · (cid:107) q , where / p + / q = 1. Then q takes values from 2 to ∞ , in particular,in the Euclidean case q = 2, but when the optimization set is a simplex, q = ∞ . eroth-Order Algorithms for Smooth Saddle-Point Problems 3 (see Table 1 for a comparison of the oracle complexity of known results withzeroth-order methods for saddle-point problems in related works). Method Assumptions Complexity in deterministic setup

ZO-GDMSA [20] NC-SC, UCst-Cst, S ˜ O (cid:16) nκ ε (cid:17) ZO-Min-Max [14] NC-SC, Cst-Cst, S ˜ O (cid:0) nε (cid:1) zoSPA [3] C-C, Cst-Cst, BG O (cid:16) n / q M D ε (cid:17) [ Alg 1 and 3 ] SC-SC, Cst-Cst, S ˜ O (cid:16) min (cid:104) n / q κ , nκ (cid:105) · log (cid:0) ε (cid:1)(cid:17) [ Alg 2 ] C-C, Cst-Cst, S ˜ O (cid:16) n LD ε (cid:17) [ Alg 1 ] C-C, Cst-Cst, FS ˜ O (cid:16) n / q L D ε (cid:17) ∗ Table 1.

Comparison of oracle complexity in deterministic setup of diﬀerent 0th-order methods with diﬀerent assumptions on f ( x, y ): C-C – convex-concave, SC-SC – strongly-convex-strongly-concave, NC-SC – nonconvex-strongly-concave; Cst –optimizaation set is constrained, UCst – unconstrained; S - smooth, FS - ﬁrmlysmooth (see (9)), BG - bounded gradient. Here ε means the accuracy of the so-lution, D – the diametr of the set, µ – strong convexity constant, L – smooth-ness constant, κ = L / µ , M – bound of the gradient, n – dimension of theproblem, q = 2 for the Euclidean case and q = ∞ for setup of (cid:107) · (cid:107) -norm.*convergence on N (cid:80) Nk =1 E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) . Our theoretical analysis shows that the zeroth-order methods has the samesublinear convergence rate in the stochastic part as the ﬁrst-order method: O (1 / √ N ) in convex case and O (1 /N ) in strongly-convex case. (see Table 2for a comparison of the oracle complexity in the stochastic part for basic ﬁrst-order methods and available zeroth-order methods for stochastic saddle-pointproblems).The second part of the work is devoted to the use of a mixed order oracle, i.e.a zeroth-order oracle in one variable and a ﬁrst-order oracle for the other. First,we analyze a special case when such an approach is appropriate - the Lagrangemultiplier method. Then we also present a general approach for this question.The idea of using such an oracle is found in the in literature [2], but for thecomposite problem.As mentioned above, all theoretical results are tested in practice on variousclassical problems. We consider classical saddle-point problem:min x ∈X max y ∈Y f ( x, y ) , (1) A. Sadiev, A. Beznosikov et al.

Method Order Assumptions Complexity for stochastic part

EGMP [12] 1st C-C, Cst-Cst, S O (cid:16) σ D ε (cid:17) PEG [11] 1st SC-SC, Cst-Cst, S O (cid:16) σ µ ε (cid:17) ZO-SGDMSA [20] 0th NC-SC, UCst-Cst, S ˜ O (cid:16) κ nσ ε (cid:17) [ Alg 1 ] 0th SC-SC, Cst-Cst, S O (cid:16) n /q σ µ ε (cid:17) [ Alg 2 ] 0th C-C, Cst-Cst, S O (cid:16) nσ D ε (cid:17) [ Alg 1 ] 0th C-C, Cst-Cst, FS O (cid:16) n /q σ D ε (cid:17) Table 2.

Comparison of oracle complexity for stochastic part of diﬀerent 1st and 0thorders methods with diﬀerent assumptions on f ( x, y ): see notation in Table 1. Here σ – the bound of variance. where X ⊂ R n x and Y ⊂ R n y are closed convex sets. For simplicity, we introducethe set Z = X × Y , z = ( x, y ) and the operator F : F ( z ) = F ( x, y ) =  ∇ x f ( x, y ) −∇ y f ( x, y )  . (2)In this paper, we will focus on the case when we do not have access to theoracle for ∇ x f ( x, y ) and ∇ y f ( x, y ). Additionally, our zeroth-order oracle hasstochastic noise and unknown bounded noise, i.e. we have ˜ f ( z, ξ ) = f ( z, ξ )+ δ ( z ): E [ f ( z, ξ )] = f ( z ) , E [ F ( z, ξ )] = F ( z ) , E [ (cid:107) F ( z, ξ ) − F ( z ) (cid:107) ] ≤ σ , | δ ( z ) | ≤ ∆. (3)Next, we discuss the concepts of oracles with which we want to replace the valueof F . Random direction oracle.

In this strategy, the vectors e x , e y are generateduniformly on the unit Euclidean sphere RS (1): g d ( z, e, τ, ξ ) = nτ  (cid:16) ˜ f ( x + τ e x , y, ξ ) − ˜ f ( x, y, ξ ) (cid:17) e x (cid:16) ˜ f ( x, y, ξ ) − ˜ f ( x, y + τ e y , ξ ) (cid:17) e y  . (4) Full coordinates oracle.

Here we consider a standard orthogonal normal-ized basis { h , . . . , h n x + n y } and make gradient in the following form: g f ( z, h, τ, ξ ) = 1 τ n x (cid:88) i =1 (cid:16) ˜ f ( z + τ h i , ξ ) − ˜ f ( z, ξ ) (cid:17) h i + 1 τ n x + n y (cid:88) i = n x +1 (cid:16) ˜ f ( z, ξ ) − ˜ f ( z + τ h i , ξ ) (cid:17) h i . (5) eroth-Order Algorithms for Smooth Saddle-Point Problems 5 In this concept, we need to call f oracle n x + n y +1 times, whereas in the previouscase only 3 times. We use (cid:104) x, y (cid:105) def = (cid:80) ni =1 x i y i to deﬁne inner product of x, y ∈ R n where x i isthe i -th component of x in the standard basis in R n . Hence we get the def-inition of (cid:96) -norm in R n in the following way (cid:107) x (cid:107) = (cid:112) (cid:104) x, x (cid:105) . We deﬁne (cid:96) p -norms as (cid:107) x (cid:107) p def = ( (cid:80) ni =1 | x i | p ) / p for p ∈ (1 , ∞ ) and for p = ∞ we use (cid:107) x (cid:107) ∞ def = max ≤ i ≤ n | x i | . The dual norm (cid:107) · (cid:107) q for the norm (cid:107) · (cid:107) p is deﬁned in thefollowing way: (cid:107) y (cid:107) q def = max {(cid:104) x, y (cid:105) | (cid:107) x (cid:107) p ≤ } . Operator E [ · ] is full mathemati-cal expectation and operator E ξ [ · ] express conditional mathematical expectation.As stated above, during the course of the paper we will work in an arbitrarynorm (cid:107) · (cid:107) = (cid:107) · (cid:107) p , where p ∈ [1; 2]. And its conjugate (cid:107) · (cid:107) ∗ = (cid:107) · (cid:107) q with q ∈ [2; + ∞ ) and / p + / q = 1. Some assumptions will be made later in theEuclidean norm - we will write this explicitly (cid:107) · (cid:107) . Deﬁnition 1.

Function d ( z ) : Z → R is called prox-function if d ( z ) is 1-strongly convex w.r.t. (cid:107) · (cid:107) -norm and diﬀerentiable on Z function. Deﬁnition 2.

Let d ( z ) : Z → R is prox-function. For any two points z, w ∈ Z we deﬁne Bregman divergence V z ( w ) associated with d ( z ) as follows: V z ( w ) = d ( z ) − d ( w ) − (cid:104)∇ d ( w ) , z − w (cid:105) . We denote the Bregman-diameter Ω Z of Z w.r.t. V z ( z ) as Ω Z def = max { (cid:112) V z ( z ) | z , z ∈ Z} . Deﬁnition 3.

Let V z ( w ) Bregman divergence. For all x ∈ Z deﬁne prox-operator of ξ : prox x ( ξ ) = arg min y ∈Z ( V x ( y ) + (cid:104) ξ, y (cid:105) ) . Next we present the assumptions that we will use in the convergence analysis.

Assumption 1.

The set Z is bounded w.r.t (cid:107) · (cid:107) by constant D p , i.e. V z ( z ) ≤ D p , ∀ z , z ∈ Z . (6) D p is called a diameter in (cid:107) · (cid:107) = (cid:107) · (cid:107) p . Assumption 2. f ( x, y ) is convex-concave. It means that f ( · , y ) is convex forall y and f ( x, · ) is concave for all x . Assumption 2(s). f ( x, y ) is strongly-convex-strongly-concave. It meansthat f ( · , y ) is strongly-convex for all y and f ( x, · ) is strongly-concave for all x w.r.t. V · ( · ), i.e. for all x , x ∈ X and for all y , y ∈ Y we have f ( x , y ) ≥ f ( x , y ) + (cid:104)∇ x f ( x , y ) , x − x (cid:105) + µ (cid:0) V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) (cid:1) , − f ( x , y ) ≥ − f ( x , y ) + (cid:104)−∇ y f ( x , y ) , y − y (cid:105) + µ (cid:0) V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) (cid:1) . (7) A. Sadiev, A. Beznosikov et al.

Assumption 3. f ( x, y, ξ ) is L ( ξ )-Lipschitz continuous w.r.t (cid:107) · (cid:107) , i.e. for all x , x ∈ X , y , y ∈ Y and ξ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ x f ( x , y , ξ ) −∇ y f ( x , y , ξ )  −  ∇ x f ( x , y , ξ ) −∇ y f ( x , y , ξ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ L ( ξ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x y  −  x y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (8) Assumption 3(f ). f ( x, y ) s L -ﬁrmly Lipschitz continuous w.r.t (cid:107) · (cid:107) , i.e.for all x , x ∈ X , y , y ∈ Y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ x f ( x , y , ξ ) −∇ y f ( x , y , ξ )  −  ∇ x f ( x , y , ξ ) −∇ y f ( x , y , ξ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ L ( ξ ) (cid:42) ∇ x f ( x , y , ξ ) −∇ y f ( x , y , ξ )  −  ∇ x f ( x , y , ξ ) −∇ y f ( x , y , ξ )  ,  x y  −  x y (cid:43) . (9)For (8) and (9) we assume that exists L such that E [ L ( ξ )] ≤ L . Fordeterministic case L is equal to deterministic constant L (without ξ ).By Cauchy-Schwarz, (8) follows from (9). It is easy to see that the assump-tions 4 and 4(f) above can be easily rewritten in a more compact form using F ( z ). For assumption 3(s) it is more complicated: Lemma 1. If f ( x, y ) is µ -strongly convex on x and µ -strongly concave on y w.r.t V · ( · ) , then for F ( z ) we have (cid:104) F ( z ) − F ( z ) , z − z (cid:105) ≥ µ V z ( z ) + V z ( z )) , ∀ z , z ∈ Z . Hereinafter, we do not present the proofs of lemmas and theorems in themain part of the paper – see the corresponding parts of the appendix. And wecan present some properties of oracles (4), (5):

Lemma 2.

Let e ∈ RS (1) , i.e. uniformly distributed on the unit Euclideansphere. Randomness comes from independent variables e , ξ and a point z . Norm (cid:107) · (cid:107) ∗ = (cid:107) · (cid:107) q satisﬁes q ∈ [2; + ∞ ) . We introduce the constant ρ n : ρ n = min { q − ,

16 log( n ) − } . Then under Assumption 3 or 3(f ) the following statements hold: – for Random direction oracle E (cid:2) (cid:107) g d ( z, e, τ, ξ ) (cid:107) q (cid:3) ≤ n /q ρ n E (cid:2) (cid:107) F ( z ) − F ( z ∗ ) (cid:107) (cid:3) + 48 n /q ρ n (cid:107) F ( z ∗ ) (cid:107) +48 n /q ρ n σ + 8 n /q +1 ρ n L τ +16 n /q +1 ρ n ∆ τ , (cid:107) E [ g d ( z, e, τ, ξ )] − F ( z ) (cid:107) q ≤ n /q +1 / √ ρ n Lτ + 4 n /q +1 / √ ρ n ∆τ ; eroth-Order Algorithms for Smooth Saddle-Point Problems 7 – for Full coordinates oracle E (cid:2) (cid:107) g f ( z, τ, ξ ) − F ( z ) (cid:107) q (cid:3) ≤ σ + 3 nL τ + 6 n∆ τ , (cid:107) E [ g f ( z, τ, ξ )] − F ( z ) (cid:107) q ≤ √ nLτ + 2 √ n∆τ . In this part, we present methods for solving problem (1), which use only thezeroth-order oracle. First of all, we want to consider the classic version of theMirror-Descent algorithm. For theoretical and practical analysis of this algorithm

Algorithm 1 zoVIA

Input: z , N , γ , τ .Choose grad to be either g d or g f . for k = 0 , , , . . . , N do Sample indep. e k , ξ k . d k = grad( z k , e k , τ, ξ k ). z k +1 = prox z k ( γ · d k ). end forOutput: z N +1 or ¯ z N +1 . in the non-smooth case, but with abounded gradient, see [1](ﬁrst order),[3](zero order). The main problem of thisapproach is that it is diﬃcult to analyze inthe case when f is convex-concave and Lip-schitz continuous (Assumptions 2 and 3).But in practice, this algorithm does notdiﬀer much from its counterparts, whichwill be given below. Let us analyze thisalgorithm in convex-concave and strongly-convex-strongly-concave cases with Ran-dom direction oracle: Theorem 1.

By Algorithm 1 with Random direction oracle – under Assumptions 1, 2, 3(f ) and with γ ≤ n /q ρ n L , we get N N (cid:88) k =1 E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) ≤ LD p γN + 48 γn /q ρ n L (cid:0) (cid:107) F ( z ∗ ) (cid:107) + σ (cid:1) +8 γn /q +1 ρ n L (cid:18) L τ + 2 ∆ τ (cid:19) +8 n /q +1 / √ ρ n LD p (cid:18) Lτ + 2 ∆τ (cid:19) ; – under Assumptions 1, 2(s), 3 and with γ ≤ µ n /q ρ n L : E (cid:2) V z N +1 ( z ∗ ) (cid:3) ≤ V z ( z ∗ ) γµ exp (cid:18) − γµN (cid:19) + 3500 n /q ρ n µ N (cid:0) (cid:107) F ( z ∗ ) (cid:107) + σ (cid:1) + 600 n /q +1 ρ n µ N (cid:18) L τ + 2 ∆ τ (cid:19) + 600 n /q +1 / √ ρ n D p γµ N (cid:18) Lτ + 2 ∆τ (cid:19) . A. Sadiev, A. Beznosikov et al.

Remark.

In this theorem and below, we draw attention to the fact that inthe main part of the convergence there is a deterministic constant L , and in theparts that are responsible for noise – L (see (8),(9)). Corollary 1.

For Algorithm 1 – under Assumptions 1, 2, 3(f ) and with γ = min (cid:110) n /q ρ n L , D p n /q √ ρ n σ √ N (cid:111) ,τ = Θ (cid:18) min (cid:26) εn /q +1 / √ ρ n L D p , max (cid:20)(cid:114) εnL , σ √ nL (cid:21)(cid:27)(cid:19) , ∆ = O (cid:0) L τ (cid:1) , the oracle complexity (coincides with the number of iterations) to ﬁnd ε -solution (in terms of the convergence criterion from Theorem 1) is N = O (cid:32) max (cid:40) n / q ρ n L D p ε , n / q ρ n σ D p ε (cid:41)(cid:33) . – under Assumptions 1, 2(s), 3 and with γ = µ n /q ρ n L , τ = Θ (cid:18) min (cid:26) max (cid:20) √ εLL , σ √ nL (cid:21) , max (cid:20) εµn / q + / √ ρ n LD p , σ µn / q + / √ ρ n L D p (cid:21)(cid:27)(cid:19) ,∆ = O (cid:0) L τ (cid:1) , the oracle complexity (coincides with the number of itera-tions) to ﬁnd ε -solution (in terms of the convergence criterion from Theorem1) can be bounded by N = (cid:101) O (cid:18) max (cid:26) n / q ρ n L µ log (cid:18) ε (cid:19) , n / q ρ n σ µ ε (cid:27)(cid:19) . Remark.

We analyze only Random direction oracle. The estimate of theoracle complexity with Full coordinate oracle has the same form with q = 2.Next, we consider a standard algorithm for working with smooth saddle-point problem. It builds on the extra-gradient method [13]. The idea of usingthis approach for saddle-point problems is not new [12]. It has both heuristicadvantages (we forestall the properties of the gradient) as well as purely math-ematical ones (a more clear theoretical analysis). We use two versions of thisapproach: classic and single call version from [11]. eroth-Order Algorithms for Smooth Saddle-Point Problems 9 Algorithm 2 zoESVIA

Input: z , N , γ , τ .Choose oracle grad from g d , g f . for k = 0 , , , . . . , N do Sample indep. e k , e k +1 / , ξ k , ξ k +1 / . d k = grad( z k , e k , τ, ξ k ). z k +1 / = prox z k ( γ · d k ). d k +1 / = grad( z k +1 / , e k +1 / , τ, ξ k +1 / ). z k +1 = prox z k ( γ · d k +1 / ). end forOutput: z N +1 or ¯ z N +1 . Algorithm 3 zoscESVIA

Input: z , N , γ , τ .Choose oracle grad from g d , g f . for k = 0 , , , . . . , N do Sample independent e k , ξ k .Take d k − from previous step. z k +1 / = prox z k ( γ · d k − ). d k = grad( z k +1 / , e k +1 / , τ, ξ k ). z k +1 = prox z k ( γ · d k ). end forOutput: z N +1 or ¯ z N +1 . Here ¯ z N +1 = N +1 (cid:80) Ni =0 z i +1 / .Next, we will deal with the theoretical analysis of convergence: Theorem 2. –

By Algorithm 2 with Full coordinates oracle under Assump-tions 1, 2, 3 and with γ ≤ / L , we have E [ ε sad (¯ z N )] ≤ D p γN + 9 γ (cid:18) σ + nL τ + 2 n∆ τ (cid:19) +2 √ nD p (cid:18) Lτ + 2 ∆τ (cid:19) , where ε sad (¯ z N ) = max y (cid:48) ∈Y f (¯ x N , y (cid:48) ) − min x (cid:48) ∈X f ( x (cid:48) , ¯ y N ) , ¯ x N , ¯ y N are deﬁned the same way as ¯ z N . – By Algorithm 3 with Full coordinates oracle under Assumptions 1, 2(s), 3and with p = 2 ( V x ( y ) = / (cid:107) x − y (cid:107) ), γ ≤ / L : E (cid:2) (cid:107) z N +1 − z ∗ (cid:107) (cid:3) ≤ LD µ exp (cid:18) − µN L (cid:19) + 450 µ N (cid:18) σ + nL τ + 2 n∆ τ (cid:19) + 150 D γµ N (cid:18) √ nLτ + 2 √ n∆τ (cid:19) . Corollary 2.

Let ε – accuracy of the solution (in terms of the convergencecriterion from Theorem 2). – For Algorithm 2 with Full coordinates oracle under Assumptions 1, 2, 3 with γ = min { / L , D p / ( σ √ N ) } and additionally τ = O (cid:32) min (cid:40) ε √ nLD p , max (cid:34)(cid:115) εLnL , σ √ nL (cid:35)(cid:41)(cid:33) , ∆ = O (cid:0) L τ (cid:1) , we have the number of iterations to ﬁnd ε -solution N = O (cid:32) max (cid:40) LD p ε , σ D p ε (cid:41)(cid:33) . – For Algorithm 3 with Full coordinates oracle under Assumptions 1, 2(s), 3,with p = 2 ( V x ( y ) = / (cid:107) x − y (cid:107) ), γ = / L and additionally τ = O (cid:32) min (cid:40) max (cid:34)(cid:115) εµLL , σ √ nL (cid:35) , max (cid:20) µε √ nLD , σ √ nL D (cid:21)(cid:41)(cid:33) ,∆ = O (cid:0) L τ (cid:1) , the number of iterations to ﬁnd ε -solution: N = (cid:101) O (cid:18) max (cid:26) Lµ log (cid:18) ε (cid:19) , σ µ ε (cid:27)(cid:19) . Remark.

The oracle complexity for the Full coordinate oracle is n timesgreater than the number of iterations.The analysis is carried out only for the Full coordinate oracle. The mainproblem of using Random Direction is that their variance is tied to the norm ofthe gradient; therefore, using an extra step does not give any advantages overAlgorithm 1. A possible way out of this situation is to use the same direction e within one iteration of Algorithm 2 – this idea is implemented in AppendixF and in Practice part. It is interesting how it work in practice, because in thenon-smooth case [3] the gain by the factor n /q can be obtained. / -Order Methods In this section, we have access to a ﬁrst-order oracle in one of the variables, andin the other – only a zeroth-order oracle. For such a case, we suggest using anoracle of the form: (cid:101) g ( z, τ ) =  [ grad ( x, y )] x −∇ y f ( x, y )  , where [ grad ( x, y )] x – one of the zeroth-order approximations on variable x : (4)or (5). Before proving the general case, we consider one illustrative example: Let

X ⊂ R n be a convex, compact set and functions f ( x ) , g ( x ) , . . . , g m ( x ) beconvex, smooth. We solve the following optimization problem:min x ∈X f ( x ) , s.t. g i ( x ) ≤ ∀ i ∈ , . . . m. A dual problem to the original one:max λ ∈⊥ m min x ∈X L ( x, λ ) = f ( x ) + (cid:104) λ, g ( x ) (cid:105) , eroth-Order Algorithms for Smooth Saddle-Point Problems 11 where ⊥ m = { y ∈ R m | y i ≥ } – a positive orthant, L ( x, λ ) – a Lagrangefunction, λ – a Lagrange multiplier, g ( x ) = ( g ( x ) , . . . , g m ( x ))) T . We got asaddle-point problem that we want to solve using the zeroth-order method,i.e. only function values are available. But it turns out that we have accessto ∇ λ L ( x, λ ) = g ( x ) completely free: when we build the ”gradient” on x usingﬁnite diﬀerences, we call the value for g ( x ) and immediately get the gradient λ .For such a problem, the oracle of the zero and ﬁrst orders can be called thesame number of times. In general, it is unproﬁtable to calculate the gradient asmany times as the zeroth-order oracles and a slightly diﬀerent result is obtained: Deﬁne Mixed oracle: (cid:101) g f ( z, τ ) =  [ g f ( x, y )] x −∇ y f ( x, y )  , then Theorem 3.

To get accuracy ε (in terms of the convergence criterion fromTheorem 2) in Algorithm 2 with Mixed oracle, under Assumptions 1, 2, 3, with γ = min { / L , D p / ( σ √ N ) } , τ = O (cid:32) min (cid:40) ε √ nLD p , max (cid:34)(cid:115) εLnL , σ √ nL (cid:35)(cid:41)(cid:33) , ∆ = O (cid:0) L τ (cid:1) , we need to call Full coordinates oracle for xN = O (cid:32) max (cid:40) LD p ε , σ D p ε (cid:41)(cid:33) times . The main goal of our experiments is to compare the Algorithms 1,2,3 and 4 (seeAppendix F) described in this paper with Full coordinate and Random directionoracles. We consider the classical bilinear saddle-point problem on a probabilitysimplex: min x ∈ ∆ n max y ∈ ∆ k (cid:2) y T Cx (cid:3) , (10) This problem is often referred to as a matrix game (see Part 5 in [1]). Twoplayers X and Y are playing. The goal of player Y is to win as much as possibleby correctly choosing an action from 1 to k , the goal of player X is to minimizethe gain of player X using his actions from 1 to n . Each element of the matrix c ij are interpreted as a winning, provided that player X has chosen the i -thstrategy and player Y has chosen the j -th strategy.Let consider the step of algorithm. The prox-function is d ( x ) = (cid:80) ni =1 x i log x i (entropy) and V x ( y ) = (cid:80) ni =1 x i log x i / y i (KL divergence). The result of the prox-imal operator is u = prox z k ( γ k grad( z k , e k , τ, ξ k )) = z k exp( − γ k grad( z k , e k , τ, ξ k )) , by this entry we mean: u i = [ z k ] i exp( − γ k [grad( z k , e k , τ, ξ k )] i ) . Using the Bregman projection onto the simplex in following way P ( x ) = x / (cid:107) x (cid:107) ,we have [ x k +1 ] i = [ x k ] i exp( − γ k [grad x ( z k , e k , τ, ξ k )] i ) n (cid:80) j =1 [ x k ] j exp( − γ k [grad x ( z k , e k , τ, ξ k )] j ) , [ y k +1 ] i = [ y k ] i exp( γ k [grad y ( z k , e k , τ, ξ k )] i ) n (cid:80) j =1 [ y k ] j exp( γ k [grad y ( z k , e k , τ, ξ k )] j ) , where under g x , g y we mean parts of g which are responsible for x and for y .In the ﬁrst part of the experiment, we take matrix 200 × In this paper, we presented various algorithms for optimizing smooth stochasticsaddle point problems using zero-order oracles. For some oracles, we provide atheoretical analysis. We also compare the approaches covered in the work on apractical matrix game.As a continuation of the work, we can distinguish the following areas: conver-gence estimates for Algorithm 4 (see the appendix), the study of gradient-freemethods for saddle point problems already with a one-point approximation (inthis work, we used a two-point one). We also highlight the acceleration of thesemethods. eroth-Order Algorithms for Smooth Saddle-Point Problems 13

Oracle calls number, N f ( x N , y * ) f ( x * , y N ) f ( x , y * ) f ( x * , y ) Matrix Game, 200x200 zoVIA-full coordinateszoVIA-random directionzoESVIA-full coordinateszoESVIA-random directionzoESVIA-random direction (same e)zoscESVIA-full coordinateszoscESVIA-random direction

Fig. 1.

Diﬀerent algorithms with Full coordinate and Random direction oracles appliedto solve saddle-problem (10).

References

1. Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization: Analysis,Algorithms, and Engineering Applications (2019)2. Beznosikov, A., Gorbunov, E., Gasnikov, A.: Derivative-free method for decentral-ized distributed non-smooth optimization. arXiv preprint arXiv:1911.10645 (2019)3. Beznosikov, A., Sadiev, A., Gasnikov, A.: Gradient-free methods for saddle-pointproblem. arXiv preprint arXiv:2005.05913 (2020)4. Cesa-Bianchi, N., Conconi, A., Gentile, C.: On the generalization ability of on-linelearning algorithms. IEEE Trans. Information Theory (9), 2050–2057 (2004)5. Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.J.: Zoo. Proceedings ofthe 10th ACM Workshop on Artiﬁcial Intelligence and Security - AISec 17(2017). https://doi.org/10.1145/3128572.3140448, http://dx.doi.org/10.1145/3128572.3140448

6. Croce, F., Hein, M.: A randomized gradient-free attack on relu networks (2018)7. Croce, F., Rauber, J., Hein, M.: Scaling up the randomized gradient-free adversarialattack reveals overestimation of robustness using established attacks (2019)8. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A., Bengio, Y.: Generative adversarial networks (2014)9. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarialexamples (2014)4 A. Sadiev, A. Beznosikov et al.10. Gorbunov, E., Dvurechensky, P., Gasnikov, A.: An accelerated methodfor derivative-free smooth stochastic convex optimization. arXiv preprintarXiv:1802.09022 (2018)11. Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: On the convergence ofsingle-call stochastic extra-gradient methods (2019)12. Juditsky, A., Nemirovskii, A.S., Tauvel, C.: Solving variational inequalities withstochastic mirror-prox algorithm (2008)13. Korpelevich, G.M.: The extragradient method for ﬁnding saddle points and otherproblems (1976)14. Liu, S., Lu, S., Chen, X., Feng, Y., Xu, K., Al-Dujaili, A., Hong, M., O’Reilly,U.M.: Min-max optimization without gradients: Convergence and applications toadversarial ml (2019)15. Narodytska, N., Kasiviswanathan, S.P.: Simple black-box adversarial attacks ondeep neural networks. In: CVPR Workshops. pp. 1310–1318. IEEE Computer So-ciety (2017), http://doi.ieeecomputersociety.org/10.1109/CVPRW.2017.172

16. Nemirovski, A.: Prox-method with rate of convergence o (1/ t ) for variationalinequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization , 229–251 (012004). https://doi.org/10.1137/S105262340342562917. Shamir, O.: An optimal algorithm for bandit and zero-order convex optimizationwith two-point feedback. CoRR abs/1507.08752 (2015), http://arxiv.org/abs/1507.08752

18. Stich, S.U.: Uniﬁed optimal analysis of the (stochastic) gradient method (2019)19. Tramr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.:Ensemble adversarial training: Attacks and defenses (2017)20. Wang, Z., Balasubramanian, K., Ma, S., Razaviyayn, M.: Zeroth-order algorithmsfor nonconvex minimax problems with improved complexities (2020)21. Ye, H., Huang, Z., Fang, C., Li, C.J., Zhang, T.: Hessian-aware zeroth-order opti-mization for black-box adversarial attack (2018)eroth-Order Algorithms for Smooth Saddle-Point Problems 15

A General facts and technical lemmas

Lemma 3.

For arbitrary integer n ≥ and arbitrary set of positive numbers a , . . . , a n we have (cid:32) m (cid:88) i =1 a i (cid:33) ≤ m m (cid:88) i =1 a i . (11) Lemma 4.

For q ≥ and for arbitrary vectors a ∈ R n , b ∈ R m we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ab (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q ≤ (cid:107) a (cid:107) q + (cid:107) b (cid:107) q . (12) Lemma 5 (Fact 5.3.2 from [1]).

Given norm (cid:107) · (cid:107) on space Z and prox-function d ( z ) , let z ∈ Z , w ∈ R n and z + = prox z ( w ) . Then for all u ∈ Z(cid:104) w, z + − u (cid:105) (cid:54) V z ( u ) − V z + ( u ) − V z ( z + ) . (13) Lemma 6 (see Lemma 1 from [10]).

Let e ∈ RS (1) , i.e. a random vectoruniformly distributed on the surface of the unit Euclidean sphere in R n , q ∈ [2; + ∞ ) . Then, for n ≥ , E (cid:2) (cid:107) e (cid:107) q (cid:3) ≤ n /q − ρ n , (14) E (cid:2) (cid:104) s, e (cid:105) (cid:107) e (cid:107) q (cid:3) ≤ n /q − ρ n (cid:107) s (cid:107) , ∀ s ∈ R n , (15) where ρ n = min { q − ,

16 log n − } . Lemma 7 (see Lemma 2 from [18]).

Let consider non-negative sequence r k : r k +1 ≤ (1 − aγ ) r k + cγ , where a, c > , γ = 1 /d . Then ar N +1 ≤ dr · exp (cid:18) − aN d (cid:19) + 36 caN . (16) B Proof of Lemma 1

Lemma. If f ( x, y ) is µ -strongly convex on x and µ -strongly concave on y w.r.t V · ( · ), then for F ( z ) we have (cid:104) F ( z ) − F ( z ) , z − z (cid:105) ≥ µ V z ( z ) + V z ( z )) , ∀ z , z ∈ Z . (17) Proof.

By deﬁnition of µ -strong convexity w.r.t V · ( · ): f ( x , y ) ≥ f ( x , y )+ (cid:104)∇ x f ( x , y ) , x − x (cid:105) + µ (cid:0) V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) (cid:1) , f ( x , y ) ≥ f ( x , y )+ (cid:104)∇ x f ( x , y ) , x − x (cid:105) + µ (cid:0) V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) (cid:1) , − f ( x , y ) ≥ − f ( x , y )+ (cid:104)−∇ y f ( x , y ) , y − y (cid:105) + µ (cid:0) V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) (cid:1) , − f ( x , y ) ≥ − f ( x , y )+ (cid:104)−∇ y f ( x , y ) , y − y (cid:105) + µ (cid:0) V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) (cid:1) . Let introduce a new deﬁnition for sum of Bregman divergences: V = V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y )+ V ( x ,y ) ( x , y ) + V ( x ,y ) ( x , y ) + V ( x ,y ( x , y ) + V ( x ,y ) ( x , y ) . Using deﬁnition of Bregman divergence and 1-stronge convexity of prox-function d , we get: V = (cid:104)∇ x d ( x , y ) − ∇ x d ( x , y ) , x − x (cid:105) + (cid:104)∇ x d ( x , y ) − ∇ x d ( x , y ) , x − x (cid:105) + (cid:104)∇ y d ( x , y ) − ∇ y d ( x , y ) , y − y (cid:105) + (cid:104)∇ y d ( x , y ) − ∇ y d ( x , y ) , y − y (cid:105) = (cid:104)∇ d ( z ) − ∇ d ( z ) , z − z (cid:105) + (cid:104)∇ d (˜ z ) − ∇ d (˜ z ) , ˜ z − ˜ z (cid:105)≥ V z ( z ) + V z ( z ) , where ˜ z = ( x , y ), ˜ z = ( x , y ) Thus, we have V ≥ V z ( z ) + V z ( z ). Summ-ming up: (cid:104)∇ x f ( x , y ) − ∇ x f ( x , y ) , x − x (cid:105)−(cid:104)∇ y f ( x , y ) − ∇ y f ( x , y ) , y − y (cid:105) + µ V ≤ . Using

V ≥ V z ( z ) + V z ( z ), we have (cid:104)∇ x f ( x , y ) − ∇ x f ( x , y ) , x − x (cid:105) − (cid:104)∇ y f ( x , y ) − ∇ y f ( x , y ) , y − y (cid:105) + µ V z ( z ) + V z ( z )) ≤ , and (cid:104) F ( z ) − F ( z ) , z − z (cid:105) = (cid:104)∇ x f ( x , y ) − ∇ x f ( x , y ) , x − x (cid:105)−(cid:104)∇ y f ( x , y ) − ∇ y f ( x , y ) , y − y (cid:105)≥ µ V z ( z ) + V z ( z )) . (cid:3) C Proof of Lemma 2

Lemma.

Let e ∈ RS (1), i.e. uniformly distributed on the unit Euclideansphere. Randomness comes from independent variables e , ξ and a point z . Norm (cid:107) · (cid:107) ∗ = (cid:107) · (cid:107) q satisﬁes q ∈ [2; + ∞ ). We introduce the constant ρ n : ρ n = min { q − ,

16 log( n ) − } . Then under Assumption 3 or 3(f) the following statements hold: eroth-Order Algorithms for Smooth Saddle-Point Problems 17 – for Random direction oracle E (cid:2) (cid:107) g d ( z, e, τ, ξ ) (cid:107) q (cid:3) ≤ n /q ρ n E (cid:2) (cid:107) F ( z ) − F ( z ∗ ) (cid:107) (cid:3) + 48 n /q ρ n (cid:107) F ( z ∗ ) (cid:107) +48 n /q ρ n σ + 8 n /q +1 ρ n L τ +16 n /q +1 ρ n ∆ τ , (18) (cid:107) E [ g d ( z, e, τ, ξ )] − F ( z ) (cid:107) q ≤ n /q +1 / √ ρ n Lτ + 4 n /q +1 / √ ρ n ∆τ ; (19) – for Full coordinates oracle E (cid:2) (cid:107) g f ( z, τ, ξ ) − F ( z ) (cid:107) q (cid:3) ≤ σ + 3 nL τ + 6 n∆ τ , (20) (cid:107) E [ g f ( z, τ, ξ )] − F ( z ) (cid:107) q ≤ √ nLτ + 2 √ n∆τ . (21) Proof of (18). E (cid:2) (cid:107) g d ( z, e, τ, ξ ) (cid:107) q (cid:3) (11) ≤ n E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:104)∇ x f ( x, y ) , e x (cid:105) e x (cid:104)−∇ y f ( x, y ) , e y (cid:105) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q  +4 n E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:104)∇ x f ( x, y, ξ ) − ∇ x f ( x, y ) , e x (cid:105) e x (cid:104)−∇ y f ( x, y, ξ ) + ∇ y f ( x, y ) , e y (cid:105) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q  +4 n τ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( f ( x + τ e x , y, ξ ) − f ( x, y, ξ ) − (cid:104)∇ x f ( x, y, ξ ) , τ e x (cid:105) ) e x ( f ( x, y, ξ ) − f ( x, y + τ e y , ξ ) + (cid:104)∇ y f ( x, y, ξ ) , τ e y (cid:105) ) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q  +4 n τ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( δ ( x + τ e x , y ) − δ ( x, y )) e x ( δ ( x, y ) − δ ( x, y + τ e y )) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q  (12) ≤ n E (cid:104) (cid:107)(cid:104)∇ x f ( x, y ) , e x (cid:105) e x (cid:107) q (cid:105) + 4 n E (cid:104) (cid:107)(cid:104)−∇ y f ( x, y ) , e y (cid:105) e y (cid:107) q (cid:105) +4 n E (cid:104) (cid:107)(cid:104)∇ x f ( x, y, ξ ) − ∇ x f ( x, y ) , e x (cid:105) e x (cid:107) q (cid:105) +4 n E (cid:104) (cid:107)(cid:104)−∇ y f ( x, y, ξ ) + ∇ y f ( x, y ) , e y (cid:105) e y (cid:107) q (cid:105) +4 n τ E (cid:20)(cid:13)(cid:13)(cid:13)(cid:16) ˜ f ( x + τ e x , y, ξ ) − ˜ f ( x, y, ξ ) − (cid:104)∇ x f ( x, y, ξ ) , τ e x (cid:105) (cid:17) e x (cid:13)(cid:13)(cid:13) q (cid:21) +4 n τ E (cid:20)(cid:13)(cid:13)(cid:13)(cid:16) ˜ f ( x, y, ξ ) − ˜ f ( x, y + τ e y , ξ ) + (cid:104)∇ y f ( x, y, ξ ) , τ e y (cid:105) (cid:17) e y (cid:13)(cid:13)(cid:13) q (cid:21) +4 n τ E (cid:104) (cid:107) ( δ ( x + τ e x , y ) − δ ( x, y )) e x (cid:107) q (cid:105) +4 n τ E (cid:104) (cid:107) ( δ ( x, y ) − δ ( x, y + τ e y )) e y (cid:107) q (cid:105) . From (8) we get (cid:107)∇ x f ( x , y, ξ ) −∇ x f ( x , y, ξ ) (cid:107) ≤ L (cid:107) x − x (cid:107) and (cid:107)∇ y f ( x, y , ξ ) −∇ y f ( x, y , ξ ) (cid:107) ≤ L (cid:107) y − y (cid:107) for all x, x , x ∈ X , y, y , y ∈ Y . It follows thatfunctions f ( · , y, ξ ) and f ( x, · , ξ ) are L ( ξ )-Lipschitz continuous. Then E (cid:2) (cid:107) g d ( z, e, τ, ξ ) (cid:107) q (cid:3) ≤ n E (cid:104) (cid:107)(cid:104)∇ x f ( x, y ) , τ e x (cid:105) e x (cid:107) q (cid:105) + 4 n E (cid:104) (cid:107)(cid:104)−∇ y f ( x, y ) , τ e y (cid:105) e y (cid:107) q (cid:105) +4 n E (cid:104) (cid:107)(cid:104)∇ x f ( x, y, ξ ) − ∇ x f ( x, y ) , τ e x (cid:105) e x (cid:107) q (cid:105) +4 n E (cid:104) (cid:107)(cid:104)−∇ y f ( x, y, ξ ) + ∇ y f ( x, y ) , τ e y (cid:105) e y (cid:107) q (cid:105) +4 n L τ E (cid:104) (cid:107) e x (cid:107) q (cid:105) + 4 n L τ E (cid:104) (cid:107) e y (cid:107) q (cid:105) +8 n ∆ τ E (cid:104) (cid:107) e x (cid:107) q (cid:105) + 8 n ∆ τ E (cid:104) (cid:107) e y (cid:107) q (cid:105) . In the last inequality, we additionally use (3) + (11) and independence of e and ξ . With (14) and (15), one can get the following result: E (cid:2) (cid:107) g d ( z, e, τ, ξ ) (cid:107) q (cid:3) ≤ n /q ρ n E (cid:2) (cid:107)∇ x f ( x, y ) (cid:107) (cid:3) + 24 n /q ρ n E (cid:2) (cid:107) − ∇ y f ( x, y ) (cid:107) (cid:3) +24 n /q ρ n E (cid:104) (cid:107)∇ x f ( x, y, ξ ) − ∇ x f ( x, y ) (cid:107) (cid:105) +24 n /q ρ n E (cid:104) (cid:107)−∇ y f ( x, y, ξ ) + ∇ y f ( x, y ) (cid:107) (cid:105) +8 n /q +1 ρ n L τ + 16 n /q +1 ρ n ∆ τ ≤ n /q ρ n E (cid:2) (cid:107) F ( z ) (cid:107) (cid:3) + 48 n /q ρ n σ + 8 n /q +1 ρ n L τ + 16 n /q +1 ρ n ∆ τ ≤ n /q ρ n E (cid:2) (cid:107) F ( z ) − F ( z ∗ ) (cid:107) (cid:3) + 48 n /q ρ n (cid:107) F ( z ∗ ) (cid:107) +48 n /q ρ n σ + 8 n /q +1 ρ n L τ + 16 n /q +1 ρ n ∆ τ . Proof of (19) . (cid:107) E [ g d ( z, e, τ, ξ )] − F ( z ) (cid:107) q ≤ nτ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  ( f ( x + τ e x , y, ξ ) − f ( x, y, ξ ) − (cid:104)∇ x f ( x, y, ξ ) , τ e x (cid:105) ) e x ( f ( x, y, ξ ) − f ( x, y + τ e y , ξ ) + (cid:104)∇ y f ( x, y, ξ ) , τ e y (cid:105) ) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q + n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  (cid:104)∇ x f ( x, y, ξ ) − ∇ x f ( x, y ) , e x (cid:105) e x (cid:104)−∇ y f ( x, y, ξ ) + ∇ y f ( x, y ) , e y (cid:105) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q + n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  (cid:104)∇ x f ( x, y ) , e x (cid:105) e x (cid:104)−∇ y f ( x, y ) , e y (cid:105) e y  −  ∇ x f ( x, y ) −∇ y f ( x, y ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q + nτ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  ( δ ( x + τ e x , y ) − δ ( x, y )) e x ( δ ( x, y ) − δ ( x, y + τ e y )) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q . eroth-Order Algorithms for Smooth Saddle-Point Problems 19 Taking into account the independence of e and ξ , as well as using their unbi-asedness, we get (cid:107) E [ g d ( z, e, τ, ξ )] − F ( z ) (cid:107) q ≤ nτ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  ( f ( x + τ e x , y ) − f ( x, y ) − (cid:104)∇ x f ( x, y ) , τ e x (cid:105) ) e x ( f ( x, y ) − f ( x, y + τ e y ) + (cid:104)∇ y f ( x, y ) , τ e y (cid:105) ) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q + nτ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  ( δ ( x + τ e x , y ) − δ ( x, y )) e x ( δ ( x, y ) − δ ( x, y + τ e y )) e y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q (12) ≤ nτ (cid:107) E [( f ( x + τ e x , y ) − f ( x, y ) − (cid:104)∇ x f ( x, y ) , τ e x (cid:105) ) e x ] (cid:107) q + nτ (cid:107) E [( f ( x, y ) − f ( x, y + τ e y ) + (cid:104)∇ y f ( x, y ) , τ e y (cid:105) ) e y ] (cid:107) q + nτ (cid:107) E [( δ ( x + τ e x , y ) − δ ( x, y )) e x ] (cid:107) q + nτ (cid:107) E [( δ ( x, y ) − δ ( x, y + τ e y )) e y ] (cid:107) q . Further, Jensen inequality gives (cid:107) E [ g d ( z, e, τ, ξ )] − F ( z ) (cid:107) q ≤ nτ E (cid:104) | f ( x + τ e x , y ) − f ( x, y ) − (cid:104)∇ x f ( x, y ) , τ e x (cid:105)| (cid:107) e x (cid:107) q (cid:105) + nτ E (cid:104) | f ( x, y ) − f ( x, y + τ e y ) + (cid:104)∇ y f ( x, y ) , τ e y (cid:105)| (cid:107) e y (cid:107) q (cid:105) + nτ E (cid:104) | δ ( x + τ e x , y ) − δ ( x, y ) | (cid:107) e x (cid:107) q (cid:105) + nτ E (cid:104) | δ ( x, y ) − δ ( x, y + τ e y ) | (cid:107) e y (cid:107) q (cid:105) . It remains to use L -Lipschitz continuous of f ( · , y ) and f ( x, · ): (cid:107) E [ g d ( z, e, τ, ξ )] − F ( z ) (cid:107) q ≤ nLτ E (cid:104) (cid:107) e x (cid:107) q (cid:105) + nLτ E (cid:104) (cid:107) e y (cid:107) q (cid:105) + nτ E (cid:104) ( | δ ( x + τ e x , y ) | + | δ ( x, y ) | ) (cid:107) e x (cid:107) q (cid:105) + nτ E (cid:104) ( | δ ( x, y ) | + | δ ( x, y + τ e y ) | ) (cid:107) e y (cid:107) q (cid:105) (3) , (14) ≤ n /q +1 / √ ρ n Lτ + 4 n /q +1 / √ ρ n ∆τ . Proof of (20). E (cid:2) (cid:107) g f ( z, τ, ξ ) − F ( z ) (cid:107) q (cid:3) (5) , (11) ≤ E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) τ n x (cid:88) i =1 ( f ( z + τ h i , ξ ) − f ( z, ξ )) h i + 1 τ n x + n y (cid:88) i = n x +1 ( f ( z, ξ ) − f ( z + τ h i , ξ )) h i − F ( z, ξ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:35) +3 E (cid:104) (cid:107) F ( z, ξ ) − F ( z ) (cid:107) (cid:105) +3 E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n x + n y (cid:88) i =1 ( δ ( z + τ h i ) − δ ( z )) τ h i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:35) (3) , (11) ≤ E (cid:34) n x + n y (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) ( f ( z + τ h i , ξ ) − f ( z, ξ )) τ − ∂f ( z, ξ ) ∂z i (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) +3 σ + 6 n∆ τ . By the mean value theorem we have that for some | q i | ≤ | τ | : E (cid:2) (cid:107) g f ( z, τ, ξ ) − F ( z ) (cid:107) ∗ (cid:3) ≤ E (cid:34) n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂f ( z + q i h i , ξ ) ∂z i − ∂f ( z, ξ ) ∂z i (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) +3 σ + 6 n∆ τ ≤ n (cid:88) i =1 L q i + 3 σ + 6 n∆ τ ≤ nL τ + 3 σ + 6 n∆ τ . Proof of (21). Using unbiasedness of ξ : (cid:107) E [ g f ( z, τ, ξ )] − F ( z )] (cid:107) q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) τ n x (cid:88) i =1 ( f ( z + τ h i ) − f ( z )) h i + 1 τ n x + n y (cid:88) i = n x +1 ( f ( z ) − f ( z + τ h i , )) h i − F ( z ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n x + n y (cid:88) i =1 ( δ ( z + τ h i ) − δ ( z )) τ h i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 L q i + 2 √ n∆τ ≤ √ nLτ + 2 √ n∆τ . (cid:3) eroth-Order Algorithms for Smooth Saddle-Point Problems 21 D Proof of Theorem 1

Lemma 8.

Let z, g ∈ R n and Z ⊂ R n . Then for z = prox z ( g ) and for all u ∈ Z we have (cid:104) g, z − u (cid:105) ≤ V z ( u ) − V z ( u ) + 12 (cid:107) g (cid:107) q (22) Proof.

By (13), we have for all u ∈ Z(cid:104) g, z − u (cid:105) = (cid:104) g, z − z + z − u (cid:105) ≤ V z ( u ) − V z ( u ) − V z ( z ) . Making simple transformations: (cid:104) g, z − u (cid:105) ≤ (cid:104) g, z − z (cid:105) + V z ( u ) − V z ( u ) − V z ( z ) ≤ (cid:104) g, z − z (cid:105) + V z ( u ) − V z ( u ) − (cid:107) z − z (cid:107) p . In last inequality we use the property of the Bregman divergence: V x ( y ) ≥ (cid:107) x − y (cid:107) p . With Hlder’s inequality and the fact: ab − b / (cid:54) a / , we get (cid:104) g, z − u (cid:105) ≤ (cid:107) g (cid:107) q (cid:107) z − z (cid:107) p + V z ( u ) − V z ( u ) − (cid:107) z − z (cid:107) p ≤ V z ( u ) − V z ( u ) + 12 (cid:107) g (cid:107) q . (cid:3) Theorem.

By Algorithm 1 with Random direction oracle – under Assumptions 1, 2, 3(f) and with γ ≤ n /q ρ n L , we get1 N N (cid:88) k =1 E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) ≤ LD p γN + 48 γn /q ρ n L (cid:0) (cid:107) F ( z ∗ ) (cid:107) + σ (cid:1) +8 γn /q +1 ρ n L (cid:18) L τ + 2 ∆ τ (cid:19) +8 n /q +1 / √ ρ n LD p (cid:18) Lτ + 2 ∆τ (cid:19) ; (23) – under Assumptions 1, 2(s), 3 and with γ ≤ µ n /q ρ n L : E (cid:2) V z N +1 ( z ∗ ) (cid:3) ≤ V z ( z ∗ ) γµ exp (cid:18) − γµN (cid:19) + 3500 n /q ρ n µN (cid:0) (cid:107) F ( z ∗ ) (cid:107) + σ (cid:1) + 600 n /q +1 ρ n µN (cid:18) L τ + 2 ∆ τ (cid:19) + 600 n /q +1 / √ ρ n D p γµN (cid:18) Lτ + 2 ∆τ (cid:19) . (24) (cid:3) Proof of (23) . We begin with descent lemma (22): γ (cid:104) g d ( z k , e k , τ, ξ k ) , z k − u (cid:105) ≤ V z k ( u ) − V z k +1 ( u ) + γ (cid:107) g d ( z k , e k , τ, ξ k ) (cid:107) q . Taking u = z ∗ and using convexity - concavity of f ( x, y ) in form (cid:104) F ( z ∗ ) , z k − z ∗ (cid:105) ≥

0, we get γ (cid:104) F ( z k ) − F ( z ∗ ) , z k − u (cid:105) ≤ V z k ( z ∗ ) − V z k +1 ( z ∗ )+ γ (cid:104) F ( z k ) − g d ( z k , e k , τ, ξ k ) , z k − z ∗ (cid:105) + γ (cid:107) g d ( z k , e k , τ, ξ k ) (cid:107) q . With (9), this gives γL (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) ≤ V z k ( z ∗ ) − V z k +1 ( z ∗ )+ γ (cid:104) F ( z k ) − g d ( z k , e k , τ, ξ k ) , z k − u (cid:105) + γ (cid:107) g d ( z k , e k , τ, ξ k ) (cid:107) q . Taking full expectation and using Hlder’s inequality, (18), (19), we have γL E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) ≤ E [ V z k ( z ∗ )] − E (cid:2) V z k +1 ( u ) (cid:3) +2 γ (cid:18) n /q +1 / √ ρ n Lτ + 4 n /q +1 / √ ρ n ∆τ (cid:19) D p + γ (cid:16) n /q ρ n E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) + 48 n /q ρ n (cid:107) F ( z ∗ ) (cid:107) (cid:17) + γ (cid:18) n /q ρ n σ + 8 n /q +1 ρ n L τ + 16 n /q +1 ρ n ∆ τ (cid:19) .γ ≤ / n q ρ n L gives γ L E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) ≤ E [ V z k ( z ∗ )] − E (cid:2) V z k +1 ( z ∗ ) (cid:3) +2 γ (cid:18) n /q +1 / √ ρ n Lτ + 4 n /q +1 / √ ρ n ∆τ (cid:19) D p + γ (cid:16) n /q ρ n (cid:107) F ( z ∗ ) (cid:107) + 48 n /q ρ n σ (cid:17) + γ (cid:18) n /q +1 ρ n L τ + 16 n /q +1 ρ n ∆ τ (cid:19) . It remains to sum up from k = 1 to k = N :1 N N (cid:88) k =1 E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) ≤ LD p γN + 48 γn /q ρ n L (cid:0) (cid:107) F ( z ∗ ) (cid:107) + σ (cid:1) +8 γn /q +1 ρ n L (cid:18) L τ + 2 ∆ τ (cid:19) +8 n /q +1 / √ ρ n LD p (cid:18) Lτ + 2 ∆τ (cid:19) . eroth-Order Algorithms for Smooth Saddle-Point Problems 23 (cid:3) Proof of (24) . Similarly to the previous proof, we begin with descent lemma(22): γ (cid:104) g ( z k , e k , τ, ξ k ) , z k − u (cid:105) ≤ V z k ( u ) − V z k +1 ( u ) + γ (cid:107) g d ( z k , e k , τ, ξ k ) (cid:107) q . Taking u = z ∗ and using (cid:104) F ( z ∗ ) , z k − z ∗ (cid:105) ≥

0, we get: γ (cid:104) F ( z k ) − F ( z ∗ ) , z k − z ∗ (cid:105) ≤ V z k ( z ∗ ) − V z k +1 ( z ∗ )+ γ (cid:104) F ( z k ) − g d ( z k , e k , τ, ξ k ) , z k − u (cid:105) + γ (cid:107) g ( z k , e k , τ, ξ k ) (cid:107) q . With (17), it gives γµ V z k ( z ∗ ) ≤ V z k ( z ∗ ) − V z k +1 ( z ∗ )+ γ (cid:104) F ( z k ) − g d ( z k , e k , τ, ξ k ) , z k − u (cid:105) + γ (cid:107) g d ( z k , e k , τ, ξ k ) (cid:107) q . Taking full expectation and using (18), (19), we have E (cid:2) V z k +1 ( z ∗ ) (cid:3) ≤ (cid:16) − γµ (cid:17) E [ V z k ( z ∗ )] + 2 γ (cid:18) n /q +1 / √ ρ n Lτ + 4 n /q +1 / √ ρ n ∆τ (cid:19) D p + γ (cid:16) n /q ρ n E (cid:2) (cid:107) F ( z k ) − F ( z ∗ ) (cid:107) (cid:3) + 48 n /q ρ n (cid:107) F ( z ∗ ) (cid:107) (cid:17) + γ (cid:18) n /q ρ n σ + 8 n /q +1 ρ n L τ + 16 n /q +1 ρ n ∆ τ (cid:19) . Using (8) and assuming γ ≤ µ / (cid:16) n /q ρ n L (cid:17) : E (cid:2) V z k +1 ( z ∗ ) (cid:3) ≤ (cid:16) − γµ (cid:17) E [ V z k ( z ∗ )] + 2 γ (cid:32) n /q +1 / √ ρ n Lτγ + 4 n /q +1 / √ ρ n ∆γτ (cid:33) D p + γ (cid:16) n /q ρ n (cid:107) F ( z ∗ ) (cid:107) + 24 n /q ρ n σ (cid:17) + γ (cid:18) n /q +1 ρ n L τ + 8 n /q +1 ρ n ∆ τ (cid:19) . It remains to use (16) and get E (cid:2) V z N +1 ( z ∗ ) (cid:3) ≤ V z ( z ∗ ) γµ exp (cid:18) − γµN (cid:19) ++ 3500 n /q ρ n µN (cid:0) (cid:107) F ( z ∗ ) (cid:107) + σ (cid:1) + 600 n /q +1 ρ n µ N (cid:18) L τ + 2 ∆ τ (cid:19) + 600 n /q +1 / √ ρ n D p γµ N (cid:18) Lτ + 2 ∆τ (cid:19) . (cid:3) E Proof of Theorem 2

Lemma 9.

Let z, g, g / ∈ R n and Z ⊂ R n . Then for z / = prox z ( g ) and z = prox z ( g / ) and for all u ∈ Z we have (cid:104) g / , z / − u (cid:105) ≤ V z ( u ) − V z ( u ) + 12 (cid:107) g − g / (cid:107) q − V z ( z / ) . (25) Proof.

Using (13) with z = z , z + = z , w = g / , u = u and with z = z , z + = z / , w = g , u = z : (cid:104) g / , z − u (cid:105) ≤ V z ( u ) − V z ( u ) − V z ( z ) , (cid:104) g, z / − z (cid:105) ≤ V z ( z ) − V z / ( z ) − V z ( z / ) . By summing these two inequalities, we get (cid:104) g / , z / − u (cid:105) ≤ V z ( u ) − V z ( u ) + (cid:104) g − g / , z − z / (cid:105)− V z / ( z ) − V z ( z / ) . Applying Cauchy-Schwartz inequality and property: V z / ( z ) ≥ / (cid:107) z / − z (cid:107) ,we have (cid:104) g / , z / − u (cid:105) ≤ V z ( u ) − V z ( u ) + 12 (cid:107) g − g / (cid:107) q − V z ( z / ) . (cid:3) Theorem.–

By Algorithm 2 with Full coordinates oracle under Assumptions 1, 2, 3 andwith γ ≤ / L , we have E [ ε sad (¯ z N +1 )] ≤ D p γN + 9 γ (cid:18) σ + nL τ + 2 n∆ τ (cid:19) +2 √ nD p (cid:18) Lτ + 2 ∆τ (cid:19) , (26)where ε sad (¯ z N ) = max y (cid:48) ∈Y f (¯ x N , y (cid:48) ) − min x (cid:48) ∈X f ( x (cid:48) , ¯ y N ) , ¯ x N , ¯ y N are deﬁned the same way as ¯ z N . – By Algorithm 3 with Full coordinates oracle under Assumptions 1, 2(s), 3and with p = 2 ( V x ( y ) = / (cid:107) x − y (cid:107) ), γ ≤ / L : E (cid:2) (cid:107) z N +1 − z ∗ (cid:107) (cid:3) ≤ LD µ exp (cid:18) − µN L (cid:19) + 450 µ N (cid:18) σ + nL τ + 2 n∆ τ (cid:19) + 150 D γµ N (cid:18) √ nLτ + 2 √ n∆τ (cid:19) . (27) eroth-Order Algorithms for Smooth Saddle-Point Problems 25 Proof of (26) . We begin with (25) and taking z = z k , g = γg f ( z k , e k , τ, ξ k ), g / = γg f ( z k +1 / , e k +1 / , τ, ξ k +1 / ), then z / = z k +1 / , z = z k +1 and have γ (cid:104) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) , z k +1 / − u (cid:105)≤ V z k ( u ) − V z k +1 ( u ) − V z k ( z k +1 / )+ γ (cid:107) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − g f ( z k , e k , τ, ξ k ) (cid:107) q (11) ≤ V z k ( u ) − V z k +1 ( u ) − V z k ( z k +1 / )+ 3 γ (cid:107) F ( z k +1 / ) − F ( z k ) (cid:107) q + 3 γ (cid:107) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − F ( z k +1 / ) (cid:107) q + 3 γ (cid:107) g f ( z k , e k , τ, ξ k ) − F ( z k ) (cid:107) q (8) ≤ V z k ( u ) − V z k +1 ( u ) − V z k ( z k +1 / )+ 3 γ L (cid:107) z k +1 / − z k (cid:107) + 3 γ (cid:107) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − F ( z k +1 / ) (cid:107) q + 3 γ (cid:107) g f ( z k , e k , τ, ξ k ) − F ( z k ) (cid:107) q . Applying the property: V z k ( z k +1 / ) ≥ / (cid:107) z k +1 / − z k (cid:107) ≥ / (cid:107) z k +1 / − z k (cid:107) ,with γ ≤ / L , we get γ (cid:104) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) , z k +1 / − u (cid:105)≤ V z k ( u ) − V z k +1 ( u )+ 3 γ (cid:107) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − F ( z k +1 / ) (cid:107) q + 3 γ (cid:107) g f ( z k , e k , τ, ξ k ) − F ( z k ) (cid:107) q , and γ (cid:104) F ( z k +1 / ) , z k +1 / − u (cid:105) ≤ V z k ( u ) − V z k +1 ( u )+ γ (cid:104) F ( z k +1 / ) − g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) , z k +1 / − u (cid:105) + 3 γ (cid:107) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − F ( z k +1 / ) (cid:107) q + 3 γ (cid:107) g f ( z k , e k , τ, ξ k ) − F ( z k ) (cid:107) q . Taking the full expectation and using (20), (21) with (6): E (cid:2) γ (cid:104) F ( z k +1 / ) , z k +1 / − u (cid:105) (cid:3) ≤ E [ V z k ( u )] − E (cid:2) V z k +1 ( u ) (cid:3) +2 γ (cid:18) √ nLτ + 2 √ n∆τ (cid:19) D p +3 γ (cid:18) σ + 3 nL τ + 6 n∆ τ (cid:19) . Summing over all k from 0 to N , we have E (cid:34) N (cid:88) k =0 (cid:104) F ( z k +1 / ) , z k +1 / − u (cid:105) (cid:35) ≤ D p γ + 2 N (cid:18) √ nLτ + 2 √ n∆τ (cid:19) D p +3 N γ (cid:18) σ + 3 nL τ + 6 n∆ τ (cid:19) . To ﬁnish the proof we need to connect N (cid:80) k =0 (cid:104) F ( z k +1 / ) , z k +1 / − u (cid:105) and ε sad (¯ z N +1 ).By the deﬁnition of ¯ x N and ¯ y N , Jensen’s inequality and convexity-concavity of f : ε sad (¯ z N +1 ) ≤ max y (cid:48) ∈Y f (cid:32) N + 1 (cid:32) N (cid:88) k =0 x k +1 / (cid:33) , y (cid:48) (cid:33) − min x (cid:48) ∈X f (cid:32) x (cid:48) , N + 1 (cid:32) N (cid:88) k =0 y k +1 / (cid:33)(cid:33) ≤ max y (cid:48) ∈Y N + 1 N (cid:88) k =0 f ( x k +1 / , y (cid:48) ) − min x (cid:48) ∈X N + 1 N (cid:88) k =0 f ( x (cid:48) , y k +1 / ) . Given the fact of linear independence of x (cid:48) and y (cid:48) : ε sad (¯ z N ) ≤ max ( x (cid:48) ,y (cid:48) ) ∈Z N + 1 N (cid:88) k =0 (cid:0) f ( x k +1 / , y (cid:48) ) − f ( x (cid:48) , y k +1 / ) (cid:1) . Using convexity and concavity of the function f : ε sad (¯ z N ) ≤ max ( x (cid:48) ,y (cid:48) ) ∈Z N + 1 N (cid:88) k =0 (cid:0) f ( x k +1 / , y (cid:48) ) − f ( x (cid:48) , y k +1 / ) (cid:1) = max ( x (cid:48) ,y (cid:48) ) ∈Z N + 1 N (cid:88) k =0 (cid:0) f ( x k +1 / , y (cid:48) ) − f ( x k +1 / , y k +1 / ) + f ( x k +1 / , y k +1 / ) − f ( x (cid:48) , y k +1 / ) (cid:1) ≤ max ( x (cid:48) ,y (cid:48) ) ∈Z N + 1 N (cid:88) k =0 (cid:0) (cid:104)∇ y f ( x k +1 / , y k +1 / ) , y (cid:48) − y k (cid:105) + (cid:104)∇ x f ( x k +1 / , y k +1 / ) , x k − x (cid:48) (cid:105) (cid:1) ≤ max u ∈Z N + 1 N (cid:88) k =0 (cid:104) F ( z k +1 / ) , z k +1 / − u (cid:105) . (28) eroth-Order Algorithms for Smooth Saddle-Point Problems 27 Proof of (27) . Similarly to the previous proof, let begin with (25) and takefull expectation: E (cid:2) (cid:107) z k +1 − z ∗ (cid:107) (cid:3) ≤ E (cid:2) (cid:107) z k − z ∗ (cid:107) (cid:3) − γ E (cid:2) (cid:104) g f ( z k + / , τ, ξ k + / ) , z k + / − z ∗ (cid:105) (cid:3) + γ E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) − E (cid:2) (cid:107) z k + / − z k (cid:107) (cid:3) . (29)Next we work with E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) : E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) (11) ≤ E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − F ( z k + / ) (cid:107) (cid:3) +3 E (cid:2) (cid:107) g f ( z k − / , τ, ξ k + / ) − F ( z k − / ) (cid:107) (cid:3) +3 E (cid:2) (cid:107) F ( z k + / ) − F ( z k − / ) (cid:107) (cid:3) (20) , (8) ≤ L E (cid:2) (cid:107) z k + / − z k − / (cid:107) (cid:3) + 6 (cid:18) σ + nL τ + 2 n∆ τ (cid:19) (11) ≤ L E (cid:2) (cid:107) z k + / − z k (cid:107) (cid:3) + 6 L E (cid:2) (cid:107) z k − z k − / (cid:107) (cid:3) +6 (cid:18) σ + nL τ + 2 n∆ τ (cid:19) ≤ L E (cid:2) (cid:107) z k + / − z k (cid:107) (cid:3) +6 γ L E (cid:2) (cid:107) g f ( z k − / , τ, ξ k − / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) +6 (cid:18) σ + nL τ + 2 n∆ τ (cid:19) . In last inequality we use non-expansiveness of Euclidean prox operator. By sim-ple transformation: E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) ≤ L E (cid:2) (cid:107) z k + / − z k (cid:107) (cid:3) +12 γ L E (cid:2) (cid:107) g f ( z k − / , τ, ξ k − / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) − E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) +12 (cid:18) σ + nL τ + 2 n∆ τ (cid:19) . If γ ≤ / L , then 12 γ L ≤ − µγ , and we can rewrite previous inequality: E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) ≤ L E (cid:2) (cid:107) z k + / − z k (cid:107) (cid:3) +(1 − µγ ) E (cid:2) (cid:107) g f ( z k − / , τ, ξ k − / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) − E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) +12 (cid:18) σ + nL τ + 2 n∆ τ (cid:19) . (30) Next we consider − γ E (cid:2) (cid:104) g f ( z k + / , τ, ξ k + / ) , z k + / − z ∗ (cid:105) (cid:3) : − γ E (cid:2) (cid:104) g f ( z k + / , τ, ξ k + / ) , z k + / − z ∗ (cid:105) (cid:3) = − γ E (cid:2) (cid:104) F ( z k + / ) , z k + / − z ∗ (cid:105) (cid:3) +2 γ E (cid:2) (cid:104) F ( z k + / ) − g f ( z k + / , τ, ξ k + / ) , z k + / − z ∗ (cid:105) (cid:3) ≤ − γ E (cid:2) (cid:104) F ( z k + / ) , z k + / − z ∗ (cid:105) (cid:3) +4 γ (cid:107) E (cid:2) F ( z k + / ) − g f ( z k + / , τ, ξ k + / ) (cid:3) (cid:107) D ≤ − γ E (cid:2) (cid:104) F ( z k + / ) , z k + / − z ∗ (cid:105) (cid:3) +4 γ (cid:18) √ nLτ + 2 √ n∆τ (cid:19) D ≤ − γµ E (cid:2) (cid:107) z k + / − z ∗ (cid:107) (cid:3) +4 γ (cid:18) √ nLτ + 2 √ n∆τ (cid:19) D ≤ − γµ E (cid:2) (cid:107) z k − z ∗ (cid:107) (cid:3) + 2 γµ E (cid:2) (cid:107) z k + / − z k (cid:107) (cid:3) +4 γ (cid:18) √ nLτ + 2 √ n∆τ (cid:19) D . (31)Combining (29), (30), and (31), we have E (cid:2) (cid:107) z k +1 − z ∗ (cid:107) (cid:3) + E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) ≤ (1 − γµ ) (cid:0) E (cid:2) (cid:107) z k − z ∗ (cid:107) (cid:3) + E (cid:2) (cid:107) g f ( z k − / , τ, ξ k − / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3)(cid:1) +(2 γµ + 12 γ L − E (cid:2) (cid:107) z k + / − z k (cid:107) (cid:3) + γ (cid:20) (cid:18) σ + nL τ + 2 n∆ τ (cid:19) + 4 D γ (cid:18) √ nLτ + 2 √ n∆τ (cid:19)(cid:21) . With γ ≤ / L we have 12 γ L ≤ − µγ and E (cid:2) (cid:107) z k +1 − z ∗ (cid:107) (cid:3) + E (cid:2) (cid:107) g f ( z k + / , τ, ξ k + / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3) ≤ (1 − γµ ) (cid:0) E (cid:2) (cid:107) z k − z ∗ (cid:107) (cid:3) + E (cid:2) (cid:107) g f ( z k − / , τ, ξ k − / ) − g f ( z k − / , τ, ξ k − / ) (cid:107) (cid:3)(cid:1) + γ (cid:20) (cid:18) σ + nL τ + 2 n∆ τ (cid:19) + 4 D γ (cid:18) √ nLτ + 2 √ n∆τ (cid:19)(cid:21) . It remains to apply (16) and then : E (cid:2) (cid:107) z N +1 − z ∗ (cid:107) (cid:3) ≤ Lµ exp (cid:18) − µN L (cid:19) (cid:0) (cid:107) z − z ∗ (cid:107) + (cid:107) g f ( z , τ, ξ ) − g f ( z , τ, ξ ) (cid:107) (cid:1) + 36 µ N (cid:20) (cid:18) σ + nL τ + 2 n∆ τ (cid:19) + 4 D γ (cid:18) √ nLτ + 2 √ n∆τ (cid:19)(cid:21) . (cid:3) eroth-Order Algorithms for Smooth Saddle-Point Problems 29 F Other approach for e in Algorithm 2 This algorithm is an easy modiﬁcation of Algorithm 2. The only diﬀerence isthat we use the same direction e and random variable ξ within one iteration Algorithm 4 zoESVIA (same direction)

Input: z , N , γ , τ .Choose oracle grad from G, g d , g f . . for k = 0 , , , . . . , N do Sample indep. e k , ξ k . d k = grad( z k , e k , τ, ξ k ). z k +1 / = prox z k ( γ · d k ). d k +1 / = grad( z k +1 / , e k , τ, ξ k ). z k +1 = prox z k ( γ · d k +1 / ). end forOutput: z N +1 or ¯ z N +1 . G Proof of Theorem 3

Theorem 4.

By Algorithm 2 under assumption 1, 2, 3 with Mixed oracle (cid:101) g f and γ ≤ / L , we get E [ ε sad (¯ z N )] ≤ D p γN + 2 D p (cid:18) √ n x L τ + 2 √ n x ∆τ (cid:19) +9 γ (cid:18) σ + n x L τ + 2 n x ∆ τ (cid:19) . (32) Proof of (32): We begin with (25) and taking z = z k , g = γ (cid:101) g f ( z k , e k , τ, ξ k ), g / = γ (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ), then z / = z k +1 / , z = z k +1 and we get γ (cid:104) (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) , z k +1 / − u (cid:105)≤ V z k ( u ) − V z k +1 ( u ) − V z k ( z k +1 / )+ γ (cid:107) (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − (cid:101) g f ( z k , e k , τ, ξ k ) (cid:107) q (11) ≤ V z k ( u ) − V z k +1 ( u ) − V z k ( z k +1 / )+ 3 γ (cid:107) F ( z k +1 / ) − F ( z k ) (cid:107) q + 3 γ (cid:107) (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − F ( z k +1 / ) (cid:107) q + 3 γ (cid:107) (cid:101) g f ( z k , e k , τ, ξ k ) − F ( z k ) (cid:107) q With (8) it gives γ (cid:104) (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) , z k +1 / − u (cid:105)≤ V z k ( u ) − V z k +1 ( u ) − V z k ( z k +1 / )+ 3 γ L (cid:107) z k +1 / − z k (cid:107) + 3 γ (cid:107) (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − F ( z k +1 / ) (cid:107) q + 3 γ (cid:107) (cid:101) g f ( z k , e k , τ, ξ k ) − F ( z k ) (cid:107) q . Applying the property: V z k ( z k +1 / ) ≥ / (cid:107) z k +1 / − z k (cid:107) ≥ / (cid:107) z k +1 / − z k (cid:107) ,with γ ≤ / L , we get γ (cid:104) (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) , z k +1 / − u (cid:105) ≤ V z k ( u ) − V z k +1 ( u )+ 3 γ (cid:107) (cid:101) g f ( z k +1 / , e k +1 / , τ, ξ k +1 / ) − F ( z k +1 / ) (cid:107) q + 3 γ (cid:107) (cid:101) g f ( z k , e k , τ, ξ k ) − F ( z k ) (cid:107) q . Taking the full expectation and using (20), (21) with (6): E (cid:2) γ (cid:104) F ( z k +1 / ) , z k +1 / − u (cid:105) (cid:3) ≤ E [ V z k ( u )] − E (cid:2) V z k +1 ( u ) (cid:3) +2 γ (cid:18) √ n x L τ + 2 √ n x ∆τ (cid:19) D p +3 γ (cid:18) σ + 3 n x L τ + 6 n x ∆ τ (cid:19) . It remains to sum up from k = 0 to k = N and use 28 and ﬁnish the proof ofthis theorem.and use 28 and ﬁnish the proof ofthis theorem.