[PDF] Convergence analysis of the stochastic reflected forward-backward splitting algorithm

Abstract

We propose and analyze the convergence of a novel stochastic algorithm for solving monotone inclusions that are the sum of a maximal monotone operator and a monotone, Lipschitzian operator. The propose algorithm requires only unbiased estimations of the Lipschitzian operator. We obtain the rate \mathcal{O}(log(n)/n) in expectation for the strongly monotone case, as well as almost sure convergence for the general case. Furthermore, in the context of application to convex-concave saddle point problems, we derive the rate of the primal-dual gap. In particular, we also obtain \mathcal{O}(1/n) rate convergence of the primal-dual gap in the deterministic setting.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Convergence analysis of the stochastic reﬂected forward-backwardsplitting algorithm

Nguyen Van Dung and B`˘ang Cˆong V˜u Department of Mathematics, University of Transport and Communications,3 Cau Giay Street, Hanoi, [email protected]; [email protected] 18, 2021

Abstract

We propose and analyze the convergence of a novel stochastic algorithm for solving monotoneinclusions that are the sum of a maximal monotone operator and a monotone, Lipschitzianoperator. The propose algorithm requires only unbiased estimations of the Lipschitzian operator.We obtain the rate O ( log ( n ) /n ) in expectation for the strongly monotone case, as well as almostsure convergence for the general case. Furthermore, in the context of application to convex-concave saddle point problems, we derive the rate of the primal-dual gap. In particular, we alsoobtain O (1 /n ) rate convergence of the primal-dual gap in the deterministic setting. Keywords: monotone inclusion, stochastic optimization, stochastic error, monotone operator,operator splitting, reﬂected method, Lipschitz, composite operator, duality, primal-dual algorithm,ergodic convergence

Mathematics Subject Classiﬁcations (2010) : 47H05, 49M29, 49M27, 90C25

A wide class of problems in monotone operator theory, variational inequalities, convex optimization,image processing, machine learning, reduces to the problem of solving monotone inclusions involvingLipschitzian operators; see [2, 4, 3, 5, 8, 11, 22, 32, 23, 33, 35] and the references therein. In thispaper, we revisit the generic monotone inclusions of ﬁnding a zero point of the sum of a maximallymonotone operator A and a monotone, µ -Lipschitzian operator B , acting on a real separable Hilbertspace H , i.e., Find x ∈ H such that 0 ∈ ( A + B ) x. (1.1)1he ﬁrst splitting method proposed for solving problem was in [33] which is now known as theforward-backward-forward splitting method (FBFSM). Further investigations of this method leadto a new primal-dual splitting method in [4] where B is a linear monotone skew operator in suitableproduct spaces. A main limitation of the FBFSM is their two calls of B per iteration. This issuewas recently resolved in [22] in which the forward reﬂected backward splitting method (FRBSM)was proposed, namely, γ ∈ ]0 , + ∞ [ , x n +1 = (Id + γA ) − ( x n − γBx n + γBx n − ) . (1.2)An alternative approach to overcome this issue was in [11] where the reﬂected forward backwardsplitting method (RFBSM) was proposed: γ ∈ ]0 , + ∞ [ , x n +1 = (Id + γA ) − ( x n − γB (2 x n − x n − )) . (1.3)It is important to stress that the methods FBFSM, RFBSM and RFBSM are limited to the de-terministic setting. The stochastic version of FBFSM was investigated in [35] and recently in [14].Both works [35] and [14] requires two stochastic approximation of B . While, a stochastic versionof FRBSM was also considered in [22] for the case when B is a ﬁnite sum. However, it remainsrequire to evaluate the operator B .The objective of this paper is to avoid these above limitations of [35, 14, 22] by considering thestochastic counterpart of (1.3). At each iteration, we use only one unbiased estimation of B (2 x n − x n − ) and hence the resulting algorithm shares the same structure as the standard stochasticforward-backward splitting [7, 9, 28]. However, it allows to solve a larger class of problems involvingnon-cocoercive operators.In Section 2, we recall the basic notions in convex analysis and monotone operator theory aswell as the probability theory, and establish the results which will be used in the proof of theconvergence of the proposed method. We present the proposed method and derive the almost sureconvergence, convergence in expectation in Section 3. In the last section, we will further apply theproposed algorithm to the convex-concave saddle problem involving the inﬁmal convolutions, andestablish the rate of the ergodic convergence of the primal-dual gap. Let H be a separable real Hilbert space endowed with the inner product h . | . i and the associatednorm k . k . Let ( x n ) n ∈ N be a sequence in H , and x ∈ H . We denote the strong convergence and theweak convergence of ( x n ) n ∈ N to x by x n → x and x n ⇀ x , respectively. Deﬁnition 2.1

Let A : H → H be a set-valued operator. (i) The domain of A is denoted by dom( A ) that is a set of all x ∈ H such that Ax = ∅ . (ii) The range of A is ran( A ) = (cid:8) u ∈ H | ( ∃ x ∈ H ) u ∈ Ax (cid:9) . (iii) The graph of A is gra( A ) = (cid:8) ( x, u ) ∈ H × H | u ∈ Ax (cid:9) . The inverse of A is A − : u (cid:8) x | u ∈ Ax (cid:9) . (v) The zero set of A is zer( A ) = A − . Deﬁnition 2.2

We have the following deﬁnitions: (i)

We say that A : H → H is monotone if (cid:0) ∀ ( x, u ) ∈ gra A (cid:1)(cid:0) ∀ ( y, v ) ∈ gra A (cid:1) h x − y | u − v i ≥ . (2.1)(ii) We say that A : H → H is maximally monotone if it is monotone and there exists no mono-tone operator B such that gra( B ) properly contains gra( A ) . (iii) We say that A : H → H is φ A -uniformly monotone, at x ∈ dom( A ) , if there exists anincreasing function φ A : [0 , ∞ [ → [0 , ∞ ] that vanishes only at such that (cid:0) ∀ u ∈ Ax (cid:1)(cid:0) ∀ ( y, v ) ∈ gra A (cid:1) h x − y | u − v i ≥ φ A ( k y − x k ) . (2.2) If φ A = ν A | · | for some ν A ∈ ]0 , ∞ [ , then we say that A is ν A -strongly monotone. (iv) The resolvent of A is J A = (Id + A ) − , (2.3) where Id denotes the identity operator on H . (v) A single-valued operator B : H → H is -cocoercive or ﬁrmly nonexpasive if ( ∀ ( x, y ) ∈ H ) h x − y | Bx − By i ≥ k Bx − By k . (2.4)Let Γ ( H ) be the class of proper lower semicontinuous convex function from H to ] −∞ , + ∞ ]. Deﬁnition 2.3

For f ∈ Γ ( H ) : (i) The proximity operator of f isprox f : H → H : x argmin y ∈H (cid:0) f ( y ) + 12 k x − y k (cid:1) . (2.5)(ii) The conjugate function of f is f ∗ : a sup x ∈H (cid:0) h a | x i − f ( x ) (cid:1) . (2.6)(iii) The inﬁmal convolution of the two functions ℓ and g from H to ] −∞ , + ∞ ] is ℓ (cid:3) g : x inf y ∈H ( ℓ ( y ) + g ( x − y )) . (2.7)3ote that prox f = J ∂f , let x ∈ H and set p = prox f x , we have( ∀ y ∈ H ) f ( p ) − f ( y ) ≤ h y − p | p − x i , (2.8)and that ( ∀ f ∈ Γ ( H )) ( ∂f ) − = ∂f ∗ . (2.9)Following [25], let (Ω , F , P ) be a probability space. A H -valued random variable is a measurablefunction X : Ω → H , where H is endowed with the Borel σ -algebra. We denote by σ ( X ) the σ -ﬁeldgenerated by X . The expectation of a random variable X is denoted by E [ X ]. The conditionalexpectation of X given a σ -ﬁeld A ⊂ F is denoted by E [ X | A ]. A H -valued random process is asequence { x n } of H -valued random variables. The abbreviation a.s. stands for ’almost surely’. Lemma 2.4 ([27, Theorem 1])

Let ( F n ) n ∈ N be an increasing sequence of sub- σ -algebras of F , let ( z n ) n ∈ N , ( ξ n ) n ∈ N , ( ζ n ) n ∈ N and ( t n ) n ∈| NN be [0 , + ∞ [ -valued random sequences such that, for every n ∈ N , z n , ξ n , ζ n and t n are F n -measurable. Assume moreover that P n ∈ N t n < + ∞ , P n ∈ N ζ n < + ∞ a.s. and ( ∀ n ∈ N ) E [ z n +1 | F n ] ≤ (1 + t n ) z n + ζ n − ξ n a.s. Then z n converges a.s. and ( ξ n ) is summable a.s. Corollary 2.5

Let ( F n ) n ∈ N be an increasing sequence of sub- σ -algebras of F , let ( x n ) n ∈ N be [0 , + ∞ [ -valued random sequences such that, for every n ∈ N , x n − is F n -measurable and X n ∈ N E [ x n | F n ] < + ∞ a.s (2.10) Then P n ∈ N x n < + ∞ a.sProof . Let us set ( ∀ n ∈ N ) z n = n − X k =1 x k . Then, z n is F n measurable. Moreover, E [ z n +1 | F n ] = z n + E [ x n | F n ]Hence, it follows from Lemma 2.4 and (2.10) that ( z n ) n ∈ N converges a.s.The following lemma can be viewed as direct consequence of [7, Proposition 2.3]. Lemma 2.6

Let C be a non-empty closed subset of H and let ( x n ) n ∈ N be a H -valued randomprocess. Suppose that, for every x ∈ C , ( k x n +1 − x k ) n ∈ N converges a.s. Suppose that the set ofweak sequentially cluster points of ( x n ) n ∈ N is a subset of C a.s. Then ( x n ) n ∈ N converges weaklya.s. to a C -valued random vector. Algorithm and convergences

We propose the following algorithm, for solving (1.1), that requires only the unbiased estimationsof the monotone, µ − Lipschitzian operators B . Algorithm 3.1

Let ( γ n ) n ∈ N be a sequence in ]0 , + ∞ [. Let x , x − be H -valued, squared integrablerandom variables. Iterates( ∀ n ∈ N )  y n = 2 x n − x n − Finding r n : E [ r n | F n ] = By n x n +1 = J γ n A ( x n − γ n r n ) , (3.1)where F n = σ ( x , x , . . . , x n ). Remark 3.2

Here are some remarks.(i) Algorithm 3.1 is an extension of the reﬂected forward backward splitting in [11] which itselfrecovers the projected reﬂected gradient methods for monotone variational inequalities in [21]as a special case. Further connections to existing works in the deterministic setting can befound in [11] as well as [21].(ii) In the special case when A is a normal cone operator, A = N X for some non-empty closedconvex set, the iteration (3.1) reduces to the one in [13]. However, as we will see in Remark3.9, our convergences results are completely diﬀerent from that of [13].(iii) The proposed algorithm shares the same structure as the stochastic forward-backward split-ting in [7, 9, 28]. The main advantage of (3.1) is the monotonicity and Lipschitzianity of B which is much weaker than cocoercivity assumption in [7, 9, 28].(iv) Under the current conditions on A and B , an alternative method ”Between forward-backwardand forward-reﬂected-backward” for solving Problem 1.1 is presented in [22, Section 6] whichremains require to evaluate the operator B as well as its unbiased estimations.We ﬁrst prove some lemmas which will be used in the proof of Theorem 3.5 and Theorem 3.8. Lemma 3.3

Let ( x n ) n ∈ N and ( y n ) n ∈ N be generated by (3.1) . Suppose that A is φ A -uniformlymonotone and B is φ B -uniformly monotone. Let x ∈ zer( A + B ) and set ( ∀ n ∈ N ) ǫ n = 2 γ n (cid:0) φ A ( k x n +1 − x k ) + φ A ( k x n +1 − x n k ) + φ B ( k y n − x k ) (cid:1) . (3.2) The following holds. k x n +1 − x k + ǫ n + (3 − γ n γ n − ) k x n +1 − x n k + γ n γ n − k x n +1 − y n k + 2 γ n h r n − Bx | x n +1 − x n i≤ k x n − x k + 2 γ n h r n − − Bx | x n − x n − i + 2 γ n h r n − − r n | x n +1 − y n i + γ n γ n − k x n − y n k + 2 γ n h r n − By n | x − y n i . (3.3)5 roof . Let n ∈ N and x ∈ zer( A + B ). Set p n +1 = 1 γ n ( x n − x n +1 ) − r n . (3.4)Then, by the deﬁnition of the resolvent, p n +1 ∈ Ax n +1 . (3.5)Since A is ν A -uniformly monotone and − Bx ∈ Ax , we obtain (cid:28) x n − x n +1 γ n − r n + Bx | x n +1 − x (cid:29) ≥ ν A φ A ( k x n +1 − x k ) , (3.6)which is equivalent to h x n − x n +1 | x n +1 − x i − γ n ν A φ A ( k x n +1 − x k ) ≥ γ n h r n − Bx | x n +1 − x i . (3.7)Let us estimate the right hand side of (3.7). Using y n = 2 x n − x n − , we have h r n − Bx | x n +1 − x i = h r n − Bx | x n +1 − y n i + h r n − By n | y n − x i + h By n − Bx | y n − x i = h r n − Bx | x n +1 − x n i − h r n − Bx | x n − x n − i + h r n − By n | y n − x i + h By n − Bx | y n − x i = h r n − Bx | x n +1 − x n i − h r n − − Bx | x n − x n − i + h r n − − r n | x n − x n − i + h r n − By n | y n − x i + h By n − Bx | y n − x i = h r n − Bx | x n +1 − x n i − h r n − − Bx | x n − x n − i + h r n − r n − | x n +1 − y n i + h r n − − r n | x n +1 − x n i + h r n − By n | y n − x i + h By n − Bx | y n − x i (3.8)Using the uniform monotonicity of A again, it follows from (3.4) that (cid:28) x n − x n +1 γ n − r n − x n − − x n γ n − + r n − | x n +1 − x n (cid:29) ≥ ν A φ A ( k x n +1 − x n k ) , (3.9)which is equivalent to h r n − − r n | x n +1 − x n i ≥ ν A φ A ( k x n +1 − x n k ) + k x n +1 − x n k γ n + (cid:28) x n − y n γ n − | x n +1 − x n (cid:29) . (3.10)We have ( h x n − y n | x n +1 − x n i = k x n +1 − y n k − k x n − y n k − k x n +1 − x n k h x n − x n +1 | x n +1 − x i = k x n − x k − k x n − x n +1 k − k x n +1 − x k . (3.11)Therefore, we derive from (3.7), (3.8), (3.10) and (3.11), and the uniform monotonicity of B that k x n − x k − k x n − x n +1 k − k x n +1 − x k − γ n ν A φ A ( k x n +1 − x k ) ≥ γ n (cid:0) h r n − Bx | x n +1 − x n i − h r n − − Bx | x n − x n − i (cid:1) + 2 γ n ν B φ B ( k y n − x k )+ 2 γ n h r n − r n − | x n +1 − y n i + 2 γ n ν A φ A ( k x n +1 − x n k ) + 2 k x n +1 − x n k + γ n γ n − (cid:0) k x n +1 − y n k − k x n − y n k − k x n +1 − x n k (cid:1) + 2 γ n h r n − By n | y n − x i . (3.12)6ence, k x n +1 − x k + ǫ n + (3 − γ n γ n − ) k x n +1 − x n k + γ n γ n − k x n +1 − y n k + 2 γ n h r n − Bx | x n +1 − x n i≤ k x n − x k + 2 γ n h r n − − Bx | x n − x n − i + 2 γ n h r n − − r n | x n +1 − y n i + γ n γ n − k x n − y n k + 2 γ n h r n − By n | x − y n i , (3.13)which proves (3.3).We also have the following lemma where (3.14) was used in [21] as well as in [11]. Lemma 3.4

For every n ∈ N , we have following estimations h By n − − By n | x n +1 − y n i ≤ µ (1 + √ k y n − x n k + µ k x n − y n − k + µ √ k y n − x n +1 k , (3.14) and T n = 1 γ n k x n − x k + µ k x n − y n − k + ( 1 γ n − + µ (1 + √ k x n − x n − k + 2 α n − ≥ γ n k x n − x k , (3.15) where α n = h By n − Bx | x n +1 − x n i .Proof . Let n ∈ N . We have2 h By n − − By n | x n +1 − y n i ≤ k x n +1 − y n kk By n − − By n k≤ µ k x n +1 − y n kk y n − − y n k≤ µ √ k y n − y n − k + µ √ k x n +1 − y n k = µ √ k y n − x n + x n − y n − k + µ √ k x n +1 − y n k ≤ µ √ (cid:0) (1 + 1 √ − k y n − x n k + (1 + √ − k x n − y n − k (cid:1) + µ √ k x n +1 − y n k = µ (1 + √ k x n − y n k + µ k x n − y n − k + µ √ k x n +1 − y n k . Since α n − = h By n − − Bx | x n − x n − i , we obtain2 | α n − | ≤ µ k y n − − x kk x n − x n − k≤ µ ( k x n − y n − k + k x n − x k ) k x n − x n − k≤ µ ( k x n − y n − k + k x n − x k + 2 k x n − x n − k ) . (3.16)Therefore, we derive from (3.16) and the deﬁnition of T n that T n ≥ γ n k x n − x k + ( 12 γ n − µ ) k x n − x k + ( 1 γ n − + µ ( − √ k x n − x n − k ≥ γ n k x n − x k , which proves (3.15). 7 heorem 3.5 The following hold. (i)

Let ( γ n ) n ∈ N be a nondecreasing sequence in i , √ − µ h , satisﬁes τ = inf n ∈ N ( 2 γ n − γ n − − µ (1 + √ > In the setting of Algorithm 3.1, assume that the following condition are satisﬁed for F n = σ (( x k ) ≤ k ≤ n ) X n ∈ N E [ k r n − By n k | F n ] < + ∞ a.s (3.17) Then ( x n ) converges weakly to a random varibale x : Ω → zer( A + B ) a.s. (ii) Suppose that dom( A ) is bounded, A or B is uniformly monotone. Let ( γ n ) n ∈ N be a monotonedecreasing sequence in i , √ − µ h such that ( γ n ) n ∈ N ∈ ℓ ( N ) \ ℓ ( N ) and X n ∈ N γ n E [ k r n − By n k | F n ] < ∞ a.s . (3.18) Then ( x n ) n ∈ N converges strongly a unique solution x .Proof . (i): Let n ∈ N . Applying Lemma 3.3 with ν A = ν B = 0, we have,1 γ n k x n +1 − x k + 1 γ n − k x n +1 − y n k + (cid:0) γ n − γ n − (cid:1) k x n − x n +1 k + 2 δ n ≤ γ n k x n − x k + 1 γ n − k x n − y n k + 2 δ n − + 2 h r n − − r n | x n +1 − y n i + 2 β n (3.19)where ( δ n = h r n − Bx | x n +1 − x n i β n = h r n − By n | x − y n i Let χ be in (cid:3) , τ (cid:2) , it follows from the Cauchy Schwarz’s inequality and (3.14) that2 h r n − − r n | x n +1 − y n i = 2 h r n − − By n − + By n − − By n + By n − r n | x n +1 − y n i≤ k r n − − By n − k χ + χ k x n +1 − y n k + k r n − By n k χ + χ k x n +1 − y n k + µ (1 + √ k y n − x n k + µ k x n − y n − k + µ √ k y n − x n +1 k (3.20)Hence, we derive from (3.19) and (3.20) that1 γ n k x n +1 − x k + ( 1 γ n − − µ √ − χ ) k x n +1 − y n k + (cid:0) γ n − γ n − (cid:1) k x n − x n +1 k + 2 δ n ≤ γ n k x n − x k + ( 1 γ n − + µ (1 + √ k x n − y n k + µ k x n − y n − k + 2 δ n − + 2 β n + k r n − − By n − k χ + k r n − By n k χ . (3.21)8n turn, using γ n ≤ γ n +1 and x n − y n = x n − − x n γ n +1 k x n +1 − x k + µ k x n +1 − y n k + (cid:0) γ n − γ n − (cid:1) k x n − x n +1 k + 2 δ n ≤ γ n k x n − x k + µ k x n − y n − k + (cid:0) γ n − − γ n − (cid:1) k x n − x n − k + 2 δ n − + 2 β n − ( 1 γ n − − µ (1 + √ − χ ) k x n +1 − y n k − (cid:0) γ n − − γ n − − µ (1 + √ (cid:1) k x n − x n − k + k r n − − By n − k + k r n − By n k χ . (3.22)Let us set θ n = 1 γ n k x n − x k + µ k x n − y n − k + (cid:0) γ n − − γ n − (cid:1) k x n − x n − k +2 δ n − + k r n − − By n − k χ . (3.23)We have2 | δ n − | = 2 | h r n − − By n − | x n − x n − i + 2 h By n − − Bx | x n − x n − i |≤ k r n − − By n − k χ + χ k x n − x n − k + 2 µ k y n − − x kk x n − x n − k (3.24) ≤ k r n − − By n − k χ + χ k x n − x n − k + 2 µ (cid:0) k x n − y n − k + k x n − x k (cid:1) k x n − x n − k≤ k r n − − By n − k χ + χ k x n − x n − k + µ (cid:0) k x n − y n − k + k x n − x k + 2 k x n − x n − k (cid:1) ⇒ θ n ≥ ( 1 γ n − µ ) k x n − x k + (cid:0) γ n − − γ n − − χ − µ (cid:1) k x n − x n − k ≥ µ k x n − x k ≥ E [ β n | F n ] = 0 . (3.26)Therefore, by taking the conditional expectation both sides of (3.22) with respect to F n , we obtain E [ θ n +1 | F n ] ≤ θ n − ( 1 γ n − − µ ( √ − χ ) E [ k x n +1 − y n k | F n ] − (cid:0) γ n − − γ n − − µ (1 + √ (cid:1) k x n − x n − k + 2 E [ k r n − By n k | F n ] χ . (3.27)It follows from our conditions on step sizes ( γ n ) n ∈ N that1 γ n − − µ ( √ − χ > γ n − − γ n − − µ (1 + √ − χ > , (3.28)Now, in view of Lemma 2.4, we get θ n → ¯ θ and k x n − x n − k → a.s. (3.29)9rom (3.17) and Corollary 2.5, we have X n ∈ N k r n − − By n − k < + ∞ ⇒ k r n − − By n − k → a.s (3.30)Since ( θ n ) n ∈ N converges, it is bounded and therefore, using (3.25), it follows that ( k x n − x k ) n ∈ N and ( x n ) n ∈ N are bounded. Hence ( y n ) n ∈ N is also bounded. In turn, from (3.24), we derive δ n − → a.s (3.31)Moreover, k x n − y n k = k x n − x n − k → , and k x n − y n − k ≤ k x n − x n − k + k x n − − y n − k → . (3.32)Therefore, we derive from (3.23), (3.29), (3.30), (3.31), (3.32) and Lemma 2.4 that( k x n − x k ) n ∈ N converges a.s. (3.33)Let x ∗ be a weak cluster point of ( x n ) n ∈ N . Then, there exists a subsequence ( x n k ) k ∈ N whichconverges weakly to x ∗ a.s. By (3.32), y n k ⇀ x ∗ a.s. Let us next set z n = ( I + γ n A ) − ( x n − γ n By n ) . (3.34)Then, since J γ n A is nonexpansive, we have k x n +1 − z n k ≤ γ n k By n − r n k → . (3.35)It follows from x n k ⇀ x ∗ that x n k +1 ⇀ x ∗ and hence from (3.35) that z n k ( ω ) ⇀ x ∗ ( ω ). Since z n k = ( I + γ n k A ) − ( x n k − γ n k By n k ), we have x n k − z n k γ n k − By n k + Bz n k ∈ ( A + B ) z n k , (3.36)From (3.32) and (3.35), we havelim k →∞ k x n k − z n k k = lim k →∞ k y n k − z n k k = 0 (3.37)Since B is µ -Lipschitz and ( γ n ) n ∈ N is bounded away from 0, it follows that x n k − z n k γ n k − By n k + Bz n k → a.s. (3.38)Using [2, Corollary 25.5], the sum A + B is maximally monotone and hence, its graph is closed in H weak × H strong [2, Proposition 20.38]. Therefore, 0 ∈ ( A + B ) x ∗ a.s., that is x ∗ ∈ zer( A + B ) a.s.By Lemma 2.6, the sequence ( x n ) n ∈ N converges weakly to ¯ x ∈ zer( A + B ) and the proof is complete(ii) It follows from Lemma 3.3 and (3.14) that k x n +1 − x k + γ n γ n − k x n +1 − y n k + (cid:0) − γ n γ n − (cid:1) k x n − x n +1 k + 2 γ n α n + 2 γ n h r n − By n | x n +1 − x n i + ǫ n ≤ k x n − x k + 2 γ n h r n − − By n − + By n − − By n + By n − r n | x n +1 − y n i + γ n γ n − k x n − y n k + 2 γ n α n − + 2 γ n h r n − − By n − | x n − x n − i + 2 γ n β n ≤ k x n − x k + γ n (cid:0) µ (1 + √ k y n − x n k + µ k x n − y n − k + µ √ k y n − x n +1 k (cid:1) + γ n γ n − k x n − y n k + 2 γ n α n − + 2 γ n h r n − − By n − | x n − x n − i + 2 γ n β n + 2 γ n h r n − − By n − + By n − r n | x n +1 − y n i , (3.39)10or any η ∈ i , − γ µ (1+ √ h , using the Cauchy Schwarz’s inequality, we have  γ n h r n − − By n − | x n +1 − y n i ≤ γ n η k r n − − By n − k + η k x n +1 − y n k γ n h r n − By n | x n +1 − y n i ≤ γ n η k r n − By n k + η k x n +1 − y n k γ n h r n − − By n − | x n − x n − i ≤ γ n η k r n − − By n − k + η k x n − x n − k γ n h r n − By n | x n +1 − x n i ≤ γ n η k r n − By n k + η k x n +1 − x n k (3.40)We have k x n +1 − y n k ≤ k x n +1 − x n k + k x n − y n k ) (3.41)Therefore, we derive from (3.39), (3.40), (3.41), the monotonic decreasing of ( γ n ) n ∈ N and y n − x n = x n − x n − that k x n +1 − x k + ( γ n γ n − − γ n µ √ k x n +1 − y n k + 2 k x n +1 − x n k + 2 γ n α n + ǫ n ≤ k x n − x k + γ n µ k x n − y n − k + (1 + γ µ (1 + √ k x n − x n − k + 2 γ n − α n − − γ n − − γ n ) α n − + 2 γ n β n + 2 γ n − η k r n − − By n − k + 2 γ n η k r n − By n k + 5 η (cid:0) k x n +1 − x n k + k x n − y n k (cid:1) (3.42)Since dom( A ) is bounded, there exists M > ∀ n ∈ N ) | α n | ≤ M , and hence (3.42)implies that k x n +1 − x k + γ n µ k x n +1 − y n k + (2 − η ) k x n +1 − x n k + 2 γ n α n ≤ k x n − x k + γ n − µ k x n − y n − k + (2 − η ) k x n − x n − k + 2 γ n − α n − − ( γ n γ n − − γ n µ ( √ k x n +1 − y n k − (1 − γ µ (1 + √ − η ) k x n − x n − k + 2( γ n − − γ n ) M + 2 γ n − η k r n − − By n − k + 2 γ n η k r n − By n k − ǫ n + 2 γ n β n . (3.43)Let us set p n = k x n − x k + γ n − µ k x n − y n − k + (2 − η ) k x n − x n − k + 2 γ n − α n − + 2 γ n − η k r n − − By n − k . (3.44)Then, by taking the conditional expectation with respect to F n both sides of (3.43) and using E [ r n | F n ] = By n , we get E [ p n +1 | F n ] ≤ p n − ( γ n γ n − − γ n µ ( √ E [ k x n +1 − y n k | F n ] − (1 − γ µ (1 + √ − η ) k x n − x n − k + 2( γ n − − γ n ) M + 4 γ n η E [ k r n − By n k | F n ] − E [ ǫ n | F n ] (3.45)Note that,  γ n γ n − − γ n µ ( √ > , − γ µ (1 + √ − η > , P n ∈ N ( γ n − − γ n ) M = γ M (3.46)11imilar to (3.15), we have p n is a nonnegative sequence. In turn, Lemma 2.4 and (3.45) give, p n → ¯ p, k x n − x n − k → X n ∈ N E [ ǫ n | F n ] < + ∞ a.s. (3.47)Using the same argument as the proof of (i),lim n →∞ k x n − x k = ¯ p. (3.48)Now, let us consider the case where A is φ A − uniformly monotone. We then derive from (3.47) that X n ∈ N γ n E [ φ A ( k x n +1 − x k ) | F n ] < + ∞ , (3.49)hence Corollary 2.5 impies that X n ∈ N γ n φ A ( k x n +1 − x k ) < + ∞ . (3.50)Since P n ∈ N γ n = ∞ , it follows from (3.50) that lim φ A ( k x n +1 − x k ) = 0. Thus, there exists asubsequence ( k n ) n ∈ N such that φ A ( k x k n − x k ) → k x k n − x k →

0. Therefore, by(3.48), we obtain x n → x . We next consider that case when B is φ B − uniformly monotone. Since y n = 2 x n − x n − , by the triangle inequality, k x n − x k − k x n − x n − k ≤ k y n − x k ≤ k x n − x k + k x n − − x n k , (3.51)and by (3.47), we obtain lim n →∞ k y n − x k = lim n →∞ k x n − x k . Hence, by using the same argument asthe case A is uniformly monotone, we obtain y n → x and hence x n → x . The proof of the theoremis complete. Remark 3.6

For 0 < γ < √ − µ , 12 − γµ (1 + √ < c <

1. Then for every ( γ n ) n ∈ N ⊂ [ cγ, γ ] N ,we have τ = inf n ∈ N ( 2 γ n − γ n − − µ (1 + √ > Corollary 3.7

Let γ ∈ (cid:3) , ( √ − /µ (cid:2) . Let x , x − be H -valued, squared integrable random vari-ables. ( ∀ n ∈ N )  y n = 2 x n − x n − E [ r n | F n ] = By n x n +1 = J γA ( x n − γr n ) . (3.52) Suppose that X n ∈ N E [ k r n − By n k | F n ] < + ∞ a.s. (3.53) Then ( x n ) n ∈ N converges weakly to a random variable x : Ω → zer( A + B ) a.s. Theorem 3.8

Suppose that A is ν -strongly monotone. Deﬁne ∀ n ∈ N ) γ n = 12 ν ( n + 1) . (3.54)12 uppose that there exists a constant c such that ( ∀ n ∈ N ) E [ k r n − By n k | F n ] ≤ c. (3.55) Then ( ∀ n > n ) E (cid:2) k x n − x k (cid:3) = O (log( n + 1) / ( n + 1)) , (3.56) where n is the smallest integer such that n > ν − µ (1 + √ .Proof . Let n ∈ N . It follows from (3.54) that1 + 2 νγ n = 2 ν ( n + 2)2 ν ( n + 1) = γ n γ n +1 . (3.57)Set  ρ ,n = h r n − By n | x − y n i + h r n − − By n − | x n − x n − i − h r n − By n | x n +1 − x n i ,ρ ,n = h r n − − By n − − r n + By n | x n +1 − y n i ,ρ n = ρ ,n + ρ ,n . (3.58)Hence, by applying Lemma 3.3 with φ B = 0 and φ A = ν | · | , we obtain(1 + 2 ν A γ n ) k x n +1 − x k + (3 + 2 νγ n − γ n γ n − ) k x n +1 − x n k + γ n γ n − k x n +1 − y n k + 2 γ n α n ≤ k x n − x k + 2 γ n α n − + 2 γ n h By n − − By n | x n +1 − y n i + γ n γ n − k x n − y n k + 2 γ n ρ n . (3.59)We derive from Lemma 3.4 and (3.59) that1 γ n +1 k x n +1 − x k + ( 1 γ n − − µ √ k x n +1 − y n k + (cid:0) γ n + 2 ν − γ n − (cid:1) k x n − x n +1 k + 2 α n ≤ γ n k x n − x k + ( 1 γ n − + µ (1 + √ k x n − x n − k + µ k x n − y n − k + 2 α n − + 2 ρ n . (3.60)Now, using the deﬁnition of T n , we can rewrite (3.60) as T n +1 ≤ T n + 2 ρ n − ( 1 γ n − − µ ( √ k x n +1 − y n k − (cid:0) γ n + 2 ν − γ n − − µ ( √ (cid:1) k x n − x n +1 k (3.61)Let us rewrite ρ ,n as ρ ,n = h r n − − By n − | x n +1 − x n i − h r n − − By n − | x n − x n − i− h r n − By n | x n +1 − x n i + h r n − By n | x n − x n − i , (3.62)which implies that ρ n = h r n − By n | x − x n i + h r n − − By n − | x n +1 − x n i − h r n − By n | x n +1 − x n i . (3.63)13aking the conditional expectation with respect to F n , we obtain E [ T n +1 |F n ] ≤ T n + 2 E [ ρ n | F n ] − ( 1 γ n − − µ ( √ E [ k x n +1 − y n k | F n ] − (cid:0) γ n + 2 ν − γ n − − µ ( √ (cid:1) E [ k x n − x n +1 k | F n ] . (3.64)By the deﬁnition of ρ n in (3.63), we have2 E [ ρ n | F n ] = 2 E [ h r n − − By n − | x n +1 − x n i | F n ] − E [ h r n − By n | x n +1 − x n i | F n ] ≤ γ n − E [ k r n − − By n − k | F n ] + 12 γ n − E [ k x n +1 − x n k | F n ]+ 16 γ n E [ k r n − By n k | F n ] + 14 γ n E [ k x n +1 − x n k | F n ] (3.65)In turn, it follows from (3.64) that E [ T n +1 | F n ] ≤ T n − ( 1 γ n − − µ ( √ E [ k x n +1 − y n k | F n ] − (cid:0) γ n − µ ( √ (cid:1) E [ k x n − x n +1 k | F n ] + 18 γ n − c. (3.66)Note that for n > n , γ n − µ (1 + √ ≥

0, and hence taking expectation both the sides of (3.66),we obtain ( ∀ n > n ) E [ T n +1 ] ≤ E [ T n ] + c n X k = n γ k , (3.67)which proves the desired result by invoking Lemma 3.4. Remark 3.9

We have some comparisons to existing work.(i) Under the standard condition (3.18), we obtain the strong almost sure convergence of theiterates, when A or B is uniformly monotone, as in the context of the stochastic forward-backward splitting [28]. In the general case, to ensure the weak almost sure convergence, wenot only need the step-size bounded away from 0 but also the summable condition in (3.17).These conditions were used in [7, 9, 29, 30].(ii) In the case when A is a normal cone in Euclidean spaces and the weak sharpness of B issatisﬁed, as it was shown in [13, Proposition1], the strong almost sure convergence of ( x n ) n ∈ N is obtained under the condition (3.18). Without imposing additional conditions on B suchas weak sharpness [13], uniform monotonicity [28], the problem of proving the almost sureconvergence of the iterates under the condition (3.18) is still open.(iii) When A is strongly monotone, we obtained the rate O (log( n +1) / ( n +1)) which is slower thanthe rate O (1 / ( n + 1)) of the stochastic forward-backward splitting [28] and their extensionsin [12, 34] . The main reason is the monotonicity and Lipschitzianity of B is weaker than thecocoercivity of B as in [28, 29]. 14iv) In the case when A is a normal cone to a nonempty closed convex set X in Euclideanspaces, the work in [13] obtained the rate 1 / √ n of the gap function deﬁned by X ∋ x sup y ∈ X h By | x − y i . This rate of convergence was ﬁrstly established in [19] for solving vari-ational inequalities with stochastic mirror-prox algorithm. Therefore, they diﬀer from ourresults in the present paper.We provide an generic special case which was widely studied in the stochastic optimization; see[1, 31, 15, 16, 18, 20] for instances. Corollary 3.10

Let f ∈ Γ ( H ) and let h : H → R be a convex diﬀerentiable function, with µ -Lipschitz continuous gradient, given by an expectation form h ( x ) = E ξ [ H ( x, ξ )] . In the expectation, ξ is a random vector whose probability distribution is supported on a set Ω P ⊂ R m , and H : H× Ω p → R is convex function with respect to the variable x . The problem is tominimize x ∈H f ( x ) + h ( x ) , (3.68) under the following assumptions: (i) zer( ∂f + ∇ h ) = ∅ . (ii) It is possible to obtain independent and identically distributed (i.i.d.) samples ( ξ n ) n ∈ N of ξ . (iii) Given ( x, ξ ) ∈ H × Ω P , one can ﬁnd a point ∇ H ( x, ξ ) such that E [ ∇ H ( x, ξ )] = ∇ h ( x ) .Let ( γ n ) n ∈ N be a sequence in ]0 , + ∞ [ . Let x , x − be in H . ( ∀ n ∈ N ) (cid:22) y n = 2 x n − x n − x n +1 = prox γ n f ( x n − γ n ∇ H ( y n , ξ n )) . (3.69) Then, the following hold. (i) If f is ν -strongly monotone, for some ν ∈ ]0 , + ∞ [ , and there exists a constant c such that E [ k∇ H ( y n , ξ n ) − ∇ h ( y n ) k | ξ , . . . , ξ n − ] ≤ c. (3.70) Then, for the learning rate ( ∀ n ∈ N ) γ n = ν ( n +1) . We obtain ( ∀ n > n ) E (cid:2) k x n − x k (cid:3) = O (log( n + 1) / ( n + 1)) , (3.71) where n is the smallest integer such that n > ν − µ (1 + √ , and x is the unique solutionto (3.68) . (ii) If f is not strongly monotone, let ( γ n ) n ∈ N be a non-decreasing sequence in i , √ − µ h , satisﬁes τ = inf n ∈ N ( 2 γ n − γ n − − µ (1 + √ > and X n ∈ N E [ ∇ H ( y n , ξ n ) − ∇ h ( y n ) k | ξ , . . . , ξ n − ] < + ∞ a.s (3.72) Then ( x n ) converges weakly to a random variable x : Ω → zer( ∂f + ∇ h ) a.s. roof . The conclusions are followed from Theorem 3.5 & 3.8 where A = ∂f, B = ∇ h, and ( ∀ n ∈ N ) r n = ∇ H ( y n , ξ n ) . (3.73) Remark 3.11

The algorithm (3.69) as well as the convergence results appear to be new. Algorithm(3.69) is diﬀerent from the standard stochastic proximal gradient [1, 31, 15, 16] only the evaluationof the stochastic gradients at the reﬂections ( y n ) n ∈ N . In this section, we focus on the class of primal-dual problem which was ﬁrstly investigated in[8]. This typical structured primal-dual framework covers a widely class of convex optimizationproblems and it has found many applications to image processing, machine learning [8, 10, 26, 6, 24].We further exploit the duality nature of this framework to obtain a new stochastic primal-dualsplitting method and focus on the ergodic convergence of the primal-dual gap.

Problem 4.1

Let f ∈ Γ ( H ), g ∈ Γ ( G ) and let h : H → R be a convex diﬀerentiable function,with µ h -Lipschitz continuous gradient, given by an expectation form h ( x ) = E ξ [ H ( x, ξ )]. In theexpectation, ξ is a random vector whose probability distribution P is supported on a set Ω p ⊂ R m ,and H : H × Ω → R is convex function with respect to the variable x . Let ℓ ∈ Γ ( G ) be a convexdiﬀerentiable function with µ ℓ -Lipschitz continuous gradient, and given by an expectation form ℓ ( v ) = E ξ [ L ( v, ξ )]. In the expectation, ζ is a random vector whose probability distribution issupported on a set Ω D ⊂ R d , and L : G × Ω D → R is convex function with respect to the variable v . Let K : H → G be a bounded linear operator. The primal problem is tominimize x ∈H h ( x ) + ( ℓ ∗ (cid:3) g )( Kx ) + f ( x ) , (4.1)and the dual problem is to minimize v ∈G ( h + f ) ∗ ( − K ∗ v ) + g ∗ ( v ) + ℓ ( v ) , (4.2)under the following assumptions:(i) There exists a point ( x ⋆ , v ⋆ ) ∈ H × G such that the primal-dual gap function deﬁned by G : H × G → R ∪ {−∞ , + ∞} ( x, v ) h ( x ) + f ( x ) + h Kx | v i − g ∗ ( v ) − ℓ ( v ) (4.3)veriﬁes the following condition: (cid:0) ∀ x ∈ H (cid:1) ( (cid:0) ∀ v ∈ G (cid:1) G ( x ⋆ , v ) ≤ G ( x ⋆ , v ⋆ ) ≤ G ( x, v ⋆ ) , (4.4)(ii) It is possible to obtain independent and identically distributed (i.i.d.) samples ( ξ n , ζ n ) n ∈ N of( ξ, ζ ). 16iii) Given ( x, v, ξ, ζ ) ∈ H × G × Ω P × Ω D , one can ﬁnd a point ( ∇ H ( x, ξ ) , ∇ L ( v, ξ )) such that E ( ξ,ζ ) [( ∇ H ( x, ξ ) , ∇ L ( v, ζ ))] = ( ∇ h ( x ) , ∇ ℓ ( v )) . (4.5)Using the standard technique as in [8], we derive from (3.69) the following stochastic primal-dualsplitting method, Algorithm 4.2, for solving Problem 4.1. The weak almost sure convergence andthe convergence in expectation of the resulting algorithm can be derived easily from Corollary 3.10and hence we omit them here. Algorithm 4.2

Let ( x , x − ) ∈ H and ( v , v − ) ∈ G . Let ( γ n ) n ∈ N be a non-negative sequence.Iterates For n = 0 , . . . ,  y n = 2 x n − x n − u n = 2 v n − v n − x n +1 = prox γ n f ( x n − γ n ∇ H ( y n , ξ n ) − γ n K ∗ u n ) v n +1 = prox γ n g ∗ ( v n − γ n ∇ L ( u n , ζ n ) + γ n Ky n ) (4.6) Theorem 4.3

Let x = x − , v = v − . Set µ = 2 max { µ h , µ ℓ } + k K k , let ( γ n ) n ∈ N be a decreasingsequence in i , µ h such that e = X n ∈ N γ n E (cid:2) k∇ H ( y n , ξ n ) − ∇ h ( y n ) k + k∇ L ( u n , ζ n ) − ∇ ℓ ( u n ) k (cid:3) < ∞ . (4.7) For every N ∈ N , deﬁne ˆ x N = (cid:18) N X n =0 γ n x n +1 (cid:19) / (cid:18) N X n =0 γ n (cid:19) and ˆ v N = (cid:18) N X n =0 γ n v n +1 (cid:19) / (cid:18) N X n =0 γ n (cid:19) . (4.8) Assume that dom f and dom g ∗ are bounded. Then the following holds: E [ G (ˆ x N , v ) − G ( x, ˆ v N )] ≤ (cid:18) k ( x , v ) − ( x, v ) k + γ c ( x, v ) + e (cid:19)(cid:30)(cid:18) N X k =0 γ k (cid:19) − . (4.9) where c ( x, v ) = k K k sup n ∈ N { E [ | h x n +1 − x | v n +1 − v n i | ] + E [ | h x n +1 − x n | v n +1 − v i | ] } < ∞ . (4.10) Proof . We ﬁrst note that (4.10) holds because of the boundedness of dom f and dom g ∗ . Since ℓ isa convex, diﬀerentiable function with µ ℓ -Lipschitz continuous gradient, using the descent lemma, ℓ ( u ) ≤ ℓ ( q ) + h∇ ℓ ( q ) | u − q i + µ ℓ k u − q k . (4.11)Since ℓ is convex, ℓ ( q ) ≤ ℓ ( w ) + h∇ ℓ ( q ) | q − w i . Adding this inequality to (4.11), we obtain ℓ ( u ) ≤ ℓ ( w ) + h∇ ℓ ( q ) | u − w i + µ ℓ k u − q k . (4.12)17n particular, applying (4.12) with u = v n +1 , w = v and q = u n , we get ℓ ( v n +1 ) ≤ ℓ ( v ) + h∇ ℓ ( u n ) | v n +1 − v i + µ ℓ k v n +1 − u n k . (4.13)Moreover, it follows from (4.6) that − ( v n +1 − v n + γ n ∇ L ( u n , ζ n ) − γ n Ky n ) ∈ γ n ∂g ∗ ( v n +1 ) , (4.14)and hence, using the convexity of g ∗ , g ∗ ( v ) − g ∗ ( v n +1 ) ≥ γ n h v n +1 − v | v n +1 − v n + γ n ∇ L ( u n , ζ n ) − γ n Ky n i . (4.15)Therefore, we derive from (4.13), (4.15) and (4.3) that G ( x n +1 , v ) − G ( x n +1 , v n +1 ) = h Kx n +1 | v − v n +1 i − g ∗ ( v ) + g ∗ ( v n +1 ) − ℓ ( v ) + ℓ ( v n +1 ) ≤ h Kx n +1 | v − v n +1 i + 1 γ n h v − v n +1 | v n +1 − v n + γ n ∇ L ( u n , ζ n ) − γ n Ky n i + h∇ ℓ ( u n ) | v n +1 − v i + µ ℓ k v n +1 − u n k = h K ( x n +1 − y n ) | v − v n +1 i + 1 γ n h v − v n +1 | v n +1 − v n i + µ ℓ k v n +1 − u n k + h∇ ℓ ( u n ) − ∇ L ( u n , ζ n ) | v n +1 − v i . (4.16)By the same way, since h is convex diﬀerentiable with µ h -Lipschitz gradient, we have h ( x n +1 ) − h ( x ) ≤ h∇ h ( y n ) | x n +1 − x i + µ h k x n +1 − y n k . (4.17)Moreover, it follows from (4.6) that − ( x n +1 − x n + γ n ∇ H ( y n , ξ n ) + γ n K ∗ ( u n )) ∈ γ n ∂f ( x n +1 ) , (4.18)and hence, by the convexity of f , f ( x n +1 ) − f ( x ) ≤ γ n h x − x n +1 | x n +1 − x n + γ n ∇ H ( y n , ξ n ) + γ n K ∗ ( u n ) i . (4.19)In turn, using the deﬁnition of G as in (4.4), we have G ( x n +1 , v n +1 ) − G ( x, v n +1 ) = h ( x n +1 ) − h ( x ) + h K ( x n +1 − x ) | v n +1 i + f ( x n +1 ) − f ( x ) ≤ h∇ h ( y n ) | x n +1 − x i + µ h k x n +1 − y n k + h K ( x n +1 − x ) | v n +1 i + 1 γ n h x − x n +1 | x n +1 − x n + γ n ∇ H ( y n , ξ n ) + γ n K ∗ ( u n ) i = h K ( x n +1 − x ) | v n +1 − u n i + 1 γ n h x − x n +1 | x n +1 − x n i + µ h k x n +1 − y n k + h∇ h ( y n ) − ∇ H ( y n , ξ n ) | x n +1 − x i . (4.20)Let us set ( ¯ x n +1 = prox γ n f ( x n − γ n ∇ h ( y n ) − γ n K ∗ ( u n )) , ¯ v n +1 = prox γ n g ∗ ( v n − γ n ∇ ℓ ( u n ) + γ n K ( y n )) . (4.21)18hen, using the Cauchy Schwarz’s inequality and the nonexpansiveness of prox γ n f , we obtain h∇ h ( y n ) − ∇ H ( y n , ξ n ) | x n +1 − x i = h∇ h ( y n ) − ∇ H ( y n , ξ n ) | x n +1 − ¯ x n +1 i + h∇ h ( y n ) − ∇ H ( y n , ξ n ) | ¯ x n +1 − x i≤ k∇ H ( y n , ξ n ) − ∇ h ( y n ) kk x n +1 − ¯ x n +1 k + h∇ h ( y n ) − ∇ H ( y n , ξ n ) | ¯ x n +1 − x i≤ γ n k∇ H ( y n , ξ n ) − ∇ h ( y n ) k + h∇ h ( y n ) − ∇ H ( y n , ξ n ) | ¯ x n +1 − x i . (4.22)By the same way, h∇ ℓ ( u n ) − ∇ L ( u n , ζ n ) | v n +1 − v i≤ γ n k∇ L ( u n , ζ n ) − ∇ ℓ ( u n ) k + h∇ ℓ ( u n ) − ∇ L ( u n , ζ n ) | ¯ v n +1 − v i . (4.23)It follows from (4.16), (4.20) and (4.22), (4.23) that G ( x n +1 , v ) − G ( x, v n +1 ) ≤ γ n (cid:0) h v − v n +1 | v n +1 − v n i + h x n − x n +1 | x n +1 − x i (cid:1) + µ h k x n +1 − y n k + h K ( x n +1 − y n ) | v − v n +1 i + h K ( x n +1 − x ) | v n +1 − u n i + µ ℓ k v n +1 − u n k + h∇ h ( y n ) − ∇ H ( y n , ξ n ) | ¯ x n +1 − x i + h∇ ℓ ( u n ) − ∇ L ( u n , ζ n ) | ¯ v n +1 − v i + γ n k∇ H ( y n , ξ n ) − ∇ h ( y n ) k + γ n k∇ L ( u n , ζ n ) − ∇ ℓ ( u n ) k , (4.24)which is equivalent to γ n (cid:0) G ( x n +1 , v ) − G ( x, v n +1 ) (cid:1) ≤ (cid:0) h v − v n +1 | v n +1 − v n i + h x n − x n +1 | x n +1 − x i (cid:1) + µ h γ n k x n +1 − y n k + γ n (cid:0) h K ( x n +1 − y n ) | v − v n +1 i + h K ( x n +1 − x ) | v n +1 − u n i (cid:1) + µ ℓ γ n k v n +1 − u n k + γ n (cid:0) k∇ H ( y n , ξ n ) − ∇ h ( y n ) k + k∇ L ( u n , ζ n ) − ∇ ℓ ( u n ) k (cid:1) + γ n (cid:0) h∇ h ( y n ) − ∇ H ( y n , ξ n ) | ¯ x n +1 − x i + h∇ ℓ ( u n ) − ∇ L ( u n , ζ n ) | ¯ v n +1 − v i (cid:1) . (4.25)For simple, set µ = max { µ h , µ ℓ } and let us deﬁne some notations in the space H × G where thescalar product and the associated norm are deﬁned in the normal manner,  x = ( x, v ) , x n = ( x n , v n ) , y n = ( y n , u n ) , x n = ( x n , v n ) , r n = ( ∇ H ( y n , ξ n ) , ∇ L ( u n , ζ n )) , R n = ( ∇ h ( y n ) , ∇ ℓ ( u n )) , (4.26)and S : H × G → H × G : ( x, v ) ( K ∗ v, − Kx ) . (4.27)Then, one has k S k = k K k and h K ( x n +1 − y n ) | v − v n +1 i + h K ( x n +1 − x ) | v n +1 − u n i = h S ( x n +1 − x n ) | x n +1 − x i − h S ( x n − x n − ) | x n − x i − h S ( x n − x n − ) | x n +1 − x n i≤ d n +1 − d n + k K k (cid:0) k x n − x n − ) k + k x n +1 − x n k (cid:1) , (4.28)19here we set d n = h S ( x n − x n − ) | x n − x i . Moreover, we also have h v − v n +1 | v n +1 − v n i + h x n − x n +1 | x n +1 − x i = h x n +1 − x n | x − x n +1 i = 12 k x n − x k − k x n − x n +1 k − k x n +1 − x k . (4.29)Furthermore, using the triangle inequality, we obtain µ h γ n k x n +1 − y n k + µ ℓ γ n k v n +1 − u n k ≤ γ n µ (cid:0) k x n − x n +1 k + k x n − x n − k (cid:1) . (4.30)Finally, we can rewrite the two last terms in (4.25) as γ n (cid:0) k∇ H ( y n , ξ n ) − ∇ h ( y n ) k + k∇ L ( u n , ζ n ) − ∇ ℓ ( u n ) k (cid:1) + γ n (cid:0) h∇ h ( y n ) − ∇ H ( y n , ξ n ) | ¯ x n +1 − x i + h∇ ℓ ( u n ) − ∇ L ( u n , ζ n ) | ¯ v n +1 − v i (cid:1) = γ n k r n − R n k + γ n h r n − R n | x − ¯ x n +1 i (4.31)Therefore, inserting (4.28), (4.29), (4.30) and (4.31) into (4.25) and rearranging, we get γ n (cid:0) G ( x n +1 , v ) − G ( x, v n +1 ) (cid:1) ≤ k x n − x k − k x n +1 − x k + γ n d n +1 − γ n d n − ( 12 − γ n µ − γ n k K k k x n − x n +1 k + ( γ n µ + γ n k K k k x n − x n − k + γ n k r n − R n k + γ n h r n − R n | x − ¯ x n +1 i . (4.32)Let us set b n = 12 k x n − x k + ( γ n µ + γ n k K k k x n − x n − k − γ n d n . (4.33)We have | γ n d n | ≤ γ n k K kk x n − x kk x n − x n − k ≤ γ n k K k (cid:0) k x n − x k + k x n − x n − k (cid:1) ⇒ b n ≥ ∀ n ∈ N Then, we can rewrite (4.32) as γ n (cid:0) G ( x n +1 , v ) − G ( x, v n +1 ) (cid:1) ≤ b n − b n +1 − ( 12 − γ n µ − γ n k K k k x n − x n +1 k + ( γ n − γ n +1 ) d n +1 + γ n k r n − R n k + γ n h r n − R n | x − ¯ x n +1 i . (4.34)Now, using our assumption, since x n +1 is independent of ( ξ n , ζ n ), we have E [ h r n − R n | x − ¯ x n +1 i | ( ξ , ζ ) , . . . ( ξ n − , ζ n − )] = 0 . (4.35)Moreover, the condition on the learning rate gives12 − γ n µ − γ n k K k ≥ γ n − γ n +1 ≥ . (4.36)Therefore, taking expectation both sides of (4.34), we obtain E [ γ n (cid:0) G ( x n +1 , v ) − G ( x, v n +1 ) (cid:1) ] ≤ E [ b n ] − E [ b n +1 ] + ( γ n − γ n +1 ) c ( x, v ) + γ n E (cid:2) k r n − R n k (cid:3) . (4.37)Now, for any N ∈ N , summing (4.34) from n = 0 to n = N and invoking the convexity-concavityof G , we arrive at the desired result. 20 emark 4.4 Here are some remarks.(i) To the best of our knowledge, this is ﬁrst work establishing the rate convergence of theprimal-dual gap for structure convex optimization involving inﬁmal convolutions.(ii) The results presented in this Section are new even in the deterministic setting. In this case,by setting γ n ≡ γ , our results share the same rate convergence O (1 /N ) of the primal-dualgap as in [17]. While in the stochastic setting, our results share the same rate convergenceof the primal-dual gap as in [30] under the same conditions on ( γ n ) n ∈ N and variances as in(4.7). However, the work in [30] are limited to the case ℓ is a constant function. References [1] Y. F. Atchade, G. Fort and E. Moulines, On perturbed proximal gradient algorithms,

J. Mach.Learn. Res. , Vol. 18, pp. 310–342, 2017.[2] H.H. Bauschke and P.L. Combettes, Convex Analysis and Monotone Operator Theory inHilbert Spaces, 2nd edn. Springer, New York, 2017.[3] M.N. B`ui and P. L . Combettes, Multivariate monotone inclusions in saddle form,https://arxiv.org/pdf/2002.06135.pdf.[4] L. M. Brice˜no-Arias and P. L. Combettes, A monotone+skew splitting model for compositemonotone inclusions in duality,

SIAM J. Optim. , Vol. 21, pp. 1230–1250, 2011.[5] L. M. Brice˜no-Arias and D. Davis, Forward-Backward-Half Forward Algorithm for SolvingMonotone Inclusions,

SIAM J. Optim. , Vol. 28, pp. 2839–2871, 2018.[6] A. Chambolle and T. Pock, An introduction to continuous optimization for imaging,

ActaNumer. , Vol. 25, pp. 161-319, 2016.[7] P.L. Combettes and J.-C. Pesquet, Stochastic quasi-Fej´er block-coordinate ﬁxed point itera-tions with random sweeping,

SIAM J. Optim. , Vol. 25, pp. 1221-1248, 2015.[8] P. L. Combettes and J.-C. Pesquet, Primal-dual splitting algorithm for solving inclusions withmixtures of composite, Lipschitzian, and parallel-sum type monotone operators,

Set-ValuedVar. Anal. , Vol. 20, pp. 307-330, 2012.[9] P.L. Combettes and J.-C. Pesquet, Stochastic approximations and perturbations in forward-backward splitting for monotone operators,

Pure Appl. Funct. Anal. , vol. 1, pp. 13-37, 2016.[10] P.L. Combettes and J.-C. Pesquet, Proximal splitting methods in signal processing. Fixed-point algorithms for inverse problems in science and engineering,

Optim. Appl. , Vol.49, pp.185-212, Springer, New York, 2011.[11] V. Cevher and B. C. V˜u, A reﬂected forward-backward splitting method for monotone inclu-sions involving Lipschitzian operators,

Set-Valued Var. Anal.,

Lecture Notes in Mathematics , Vol 2227, Springer, Cham 2018.[13] S. Cui and U. V. Shanbhag, On the analysis of reﬂected gradient and splitting methods formonotone stochastic variational inequality problems,

IEEE 55th Conference on Decision andControl (CDC) , ARIA Resort & Casino December 12-14, 2016, Las Vegas, USA, 2016.[14] S. Cui and U. V. Shanbhag, Variance-Reduced Proximal and Splitting Schemes for MonotoneStochastic Generalized Equations https://arxiv.org/abs/2008.11348[15] J. Duchi and Y. Singer, Eﬃcient online and batch learning using forward backward splitting,

J. Mach. Learn. Res. , Vol.10, pp. 2899-2934, 2009.[16] A. Defazio, F.Bach and S. Lacoste-Julien, SAGA:afast increment algradient method withsupport for non-strongly convex composite objectives,

Adv. Neural Inf. Process. Syst. , Vol.27, pp. 1646–1654,2014.[17] Y. Drori, S. Sabach, and M. Teboulle, A simple algorithm for a class of nonsmooth convex-concave saddle-point problems,

Oper. Res. Lett. , Vol. 43, pp. 209-214, 2015.[18] S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convexstochastic composite optimization, I: a generic algorithmic framework,

SIAM J. Optim. , Vol.22, pp. 1469-1492, 2012.[19] A. Juditsky, A. Nemirovski, C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm,

Stoch. Syst.

Vol. 1, pp. 17–58, 2011.[20] J.T. Kwok, C. Hu and W. Pan, Accelerated gradient methods for stochastic optimization andonline learning,

Adv. Neural Inf. Process. Syst. , Vol. 22, pp.781-789, 2009.[21] Y. Malitsky, Projected reﬂected gradient methods for monotone variational inequalities,

SIAMJ. Control Optim. , Vol. 25, pp. 502–520, 2015.[22] Y. Malitsky and M. K.Tam, A Forward-Backward Splitting Method for Monotone InclusionsWithout Cocoercivity,

SIAM J. Optim. , Vol. 30, pp. 1451-1472, 2020.[23] J.-C. Pesquet and A. Repetti, A class of randomized primal-dual algorithms for distributedoptimization,

J. Nonlinear Convex Anal. , Vol. 16, pp. 2453-2490, 2015.[24] H. Ouyang, N. He, L. Tran, and A. Gray, Stochastic alternating direction method of multipliers,

In Proceedings of the 30th International Conference on Machine Learning , Atlanta, GA, USA,2013.[25] M. Ledoux, M. Talagrand, Probability in Banach spaces: isoperimetry and processes, Springer,New York, 1991[26] N. Komodakis and J.-C.Pesquet, Playing with duality: An overview of recent primal-dualapproaches for solving large-scale optimization problems,

IEEE Signal processing magazine ,Vol. 32, pp. 31-54, 2015. 2227] H. Robbins and D. Siegmund, A convergence theorem for non negative almost supermartingalesand some applications. In: Rustagi JS, editor.

Optimizing methods in statistic , New York (NY):Academic Press; pp. 233-257, 1971.[28] L. Rosasco, S. Villa and, B. C. V˜u Stochastic forward-backward splitting for monotone inclu-sions,

J. Optim. Theory Appl. , Vol. 169, pp. 388-406, 2016.[29] L. Rosasco, S. Villa, B. C. V˜u, A stochastic inertial forward-backward splitting algorithm formultivariate monotone inclusions,

Optimization , Vol. 65, pp. 1293-1314, 2016.[30] L. Rosasco, S. Villa, B. C. V˜u, A First-order stochastic primal-dual algorithm with correctionstep,

Numer. Funct. Anal. Optim. , Vol. 38, pp.602-626, 2017.[31] L. Rosasco, S. Villa, B. C. V˜u, Convergence of Stochastic Proximal Gradient Algorithm,

Applied Mathematics & Optimization , 2019. https://link.springer.com/article/10.1007/s00245-019-09617-7[32] E. K. Ryu and B. C. V˜u, Finding the Forward-Douglas-Rachford-Forward Method,

J. Optim.Theory Appl. , Vol. 184, pp. 858–876, 2020.[33] P. Tseng, A modiﬁed forward-backward splitting method for maximal monotone mappings,

SIAM J. Control Optim. , 38 (2000), pp. 431–446.[34] A. Yurtsever, B. C. V˜u, V. Cevher, Stochastic three-composite convex minimization,

Advancesin Neural Information Processing Systems , pp. 4329-4337, 2016.[35] B. C. V˜u, Almost sure convergence of the forward-backward-forward splitting algorithm,