[PDF] A first-order primal-dual algorithm with linesearch

Abstract

The paper proposes a linesearch for a primal-dual method. Each iteration of the linesearch requires to update only the dual (or primal) variable. For many problems, in particular for regularized least squares, the linesearch does not require any additional matrix-vector multiplications. We prove convergence of the proposed method under standard assumptions. We also show an ergodic O(1/N) rate of convergence for our method. In case one or both of the prox-functions are strongly convex, we modify our basic method to get a better convergence rate. Finally, we propose a linesearch for a saddle point problem with an additional smooth term. Several numerical experiments confirm the efficiency of our proposed methods.

Full PDF

AA ﬁrst-order primal-dual algorithm with linesearch

Yura Malitsky ∗ Thomas Pock ∗ Abstract

The paper proposes a linesearch for a primal-dual method. Each iteration ofthe linesearch requires to update only the dual (or primal) variable. For manyproblems, in particular for regularized least squares, the linesearch does not requireany additional matrix-vector multiplications. We prove convergence of the proposedmethod under standard assumptions. We also show an ergodic O (1 /N ) rate ofconvergence for our method. In case one or both of the prox-functions are stronglyconvex, we modify our basic method to get a better convergence rate. Finally, wepropose a linesearch for a saddle point problem with an additional smooth term.Several numerical experiments conﬁrm the eﬃciency of our proposed methods. Keywords:

Saddle-point problems, ﬁrst-order algorithms, primal-dual algorithms, linesearch,convergence rates, backtracking

In this work we propose a linesearch procedure for the primal-dual algorithm (PDA)that was introduced in [4]. It is a simple ﬁrst-order method that is widely used for solvingnonsmooth composite minimization problems. Recently, it was shown the connection ofPDA with proximal point algorithms [12] and ADMM [6]. Some generalizations of themethod were considered in [5, 7, 10]. A survey of possible applications of the algorithmcan be found in [6, 13].The basic form of PDA uses ﬁxed step sizes during all iterations. This has severaldrawbacks. First, we have to compute the norm of the operator, which may be quiteexpensive for large scale dense matrices. Second, even if we know this norm, one canoften use larger steps, which usually yields a faster convergence. As a remedy for the ﬁrstissue one can use a diagonal precondition [16], but still there is no strong evidence thatsuch a precondition improves or at least does not worsen the speed of convergence of PDA.Regarding the second issue, as we will see in our experiments, the speed improvementgained by using the linesearch sometimes can be signiﬁcant.Our proposed analysis of PDA exploits the idea of recent works [14, 15] where severalalgorithms are proposed for solving a monotone variational inequality. Those algorithmsare diﬀerent from the PDA; however, they also use a similar extrapolation step. Althoughour analysis of the primal-dual method is not so elegant as, for example, that in [12], itgives a simple and a cheap way to incorporate the linesearch for deﬁning the step sizes.Each inner iteration of the linesearch requires updating only the dual variables. Moreover, ∗ Institute for Computer Graphics and Vision, Graz University of Technology, 8010 Graz,Austria.E-mail: [email protected], [email protected] a r X i v : . [ m a t h . O C ] M a r he step sizes may increase from iteration to iteration. We prove the convergence ofthe algorithm under quite general assumptions. Also we show that in many importantcases the PDAL (primal-dual algorithm with linesearch) preserves the complexity of theiteration of PDA. In particular, our method, applied to the regularized least-squaresproblems, uses the same number of matrix-vector multiplication per iteration as theforward-backward method or FISTA [3] (both with ﬁxed step size) does, but does notrequire to know the matrix norm and, in addition, uses adaptive steps.For the case when the primal or dual objectives are strongly convex, we modify ourlinesearch procedure in order to construct accelerated versions of PDA. This is done in asimilar way as in [4]. The obtained algorithms share the same complexity per iterationas PDAL does, but in many cases substantially outperform PDA and PDAL.We also consider a more general primal-dual problem which involves an additionalsmooth function with Lipschitz-continuous gradient (see [7]). For this case we generalizeour linesearch to avoid knowing that Lipschitz constant.The authors in [9, 10] also proposed a linesearch for the primal-dual method, with thegoal to vary the ratio between primal and dual steps such that primal and dual residualsremain roughly of the same size. The same idea was used in [11] for the ADMM method.However, we should highlight that this principle is just a heuristic, as it is not clear thatit in fact improves the speed of convergence. The linesearch proposed in [9, 10] requiresan update of both primal and dual variables, which may make the algorithm much moreexpensive than the basic PDA. Also the authors proved convergence of the iterates onlyin the case when one of the sequences ( x k ) or ( y k ) is bounded. Although this is oftenthe case, there are many problems which can not be encompassed by this assumption.Finally, it is not clear how to derive accelerated versions of that algorithm.As a byproduct of our analysis, we show how one can use a variable ratio betweenprimal and dual steps and under which circumstances we can guarantee convergence.However, it was not the purpose of this paper to develop new strategies for varying sucha ratio during iterations.The paper is organized as follows. In the next section we introduce the notationsand recall some useful facts. Section 2 presents our basic primal-dual algorithm withlinesearch. We prove its convergence, establish its ergodic convergence rate and considersome particular examples of how the linesearch works. In section 3 we propose acceler-ated versions of PDAL under the assumption that the primal or dual problem is stronglyconvex. Section 4 deals with more general saddle point problems which involve an addi-tional smooth function. In section 5 we illustrate the eﬃciency of our methods for severaltypical problems. Let X , Y be two ﬁnite-dimensional real vector spaces equipped with an inner product h· , ·i and a norm k · k = q h· , ·i . We are focusing on the following problem:min x ∈ X max y ∈ Y h Kx, y i + g ( x ) − f ∗ ( y ) , (1)where 2 K : X → Y is a bounded linear operator, with the operator norm L = k K k ; • g : X → ( −∞ , + ∞ ] and f ∗ : Y → ( −∞ , + ∞ ] are proper lower semicontinuousconvex (l.s.c.) functions; • problem (1) has a saddle point.Note that f ∗ denotes the Legendre–Fenchel conjugate of a convex l.s.c. function f . By aslight (but common) abuse of notation, we write K ∗ to denote the adjoint of the operator K . Under the above assumptions, (1) is equivalent to the primal problemmin x ∈ X f ( Kx ) + g ( x ) (2)and the dual problem: min y ∈ Y f ∗ ( y ) + g ∗ ( − K ∗ x ) . (3)Recall that for a proper l.s.c. convex function h : X → ( −∞ , + ∞ ] the proximal oper-ator prox h is deﬁned asprox h : X → X : x argmin z n h ( z ) + 12 k z − x k o . The following important characteristic property of the proximal operator is well known:¯ x = prox h x ⇔ h ¯ x − x, y − ¯ x i ≥ h (¯ x ) − h ( y ) ∀ y ∈ X. (4)We will often use the following identity (cosine rule):2 h a − b, c − a i = k b − c k − k a − b k − k a − c k ∀ a, b, c ∈ X. (5)Let (ˆ x, ˆ y ) be a saddle point of problem (1). Then by the deﬁnition of the saddle pointwe have P ˆ x, ˆ y ( x ) := g ( x ) − g (ˆ x ) + h K ∗ ˆ y, x − ˆ x i ≥ ∀ x ∈ X, (6) D ˆ x, ˆ y ( y ) := f ∗ ( y ) − f ∗ (ˆ y ) − h K ˆ x, y − ˆ y i ≥ ∀ y ∈ Y. (7)The expression G ˆ x, ˆ y ( x, y ) = P ˆ x, ˆ y ( x ) + D ˆ x, ˆ y ( y ) is known as a primal-dual gap. In certaincases when it is clear which saddle point is considered, we will omit the subscript in P , D , and G . It is also important to highlight that for ﬁxed (ˆ x, ˆ y ) functions P ( · ), D ( · ), and G ( · , · ) are convex.Consider the original primal-dual method: y k +1 = prox σf ∗ ( y k + σK ¯ x k ) x k +1 = prox τg ( x k − τ K ∗ y k +1 )¯ x k +1 = x k +1 + θ ( x k +1 − x k ) . In [4] its convergence was proved under assumptions θ = 1, τ, σ >

0, and τ σL <

1. Inthe next section we will show how to incorporate a linesearch into this method.3 lgorithm 1

Primal-dual algorithm with linesearch

Initialization:

Choose x ∈ X , y ∈ Y , τ > µ ∈ (0 , , δ ∈ (0 , β >

0. Set θ = 1. Main iteration:

1. Compute x k = prox τ k − g ( x k − − τ k − K ∗ y k ) .

2. Choose any τ k ∈ [ τ k − , τ k − √ θ k − ] and run Linesearch: θ k = τ k τ k − ¯ x k = x k + θ k ( x k − x k − ) y k +1 = prox βτ k f ∗ ( y k + βτ k K ¯ x k )2.b. Break linesearch if q βτ k k K ∗ y k +1 − K ∗ y k k ≤ δ k y k +1 − y k k (8)Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch

The primal-dual algorithm with linesearch (PDAL) is summarized in Algorithm 1.Given all information from the current iterate: x k , y k , τ k − , and θ k − , we ﬁrst choosesome trial step τ k ∈ [ τ k − , τ k − √ θ k − ], then during every iteration of the linesearchit is decreased by µ ∈ (0 , y k +1 and a step size τ k which will be used to compute the next iterate x k +1 . In Algorithm 1there are two opposite options: always start the linesearch from the largest possible step τ k = τ k − √ θ k − , or, the contrary, never increase τ k . Step 2 also allows us to choosecompromise between them.Note that in PDAL parameter β plays the role of the ratio στ between ﬁxed steps inPDA. Thus, we can rewrite the stopping criteria (8) as τ k σ k k K ∗ y k +1 − K ∗ y k k ≤ δ k y k +1 − y k k , where σ k = βτ k . Of course in PDAL we can always choose ﬁxed steps τ k , σ k with τ k σ k ≤ δ L and set θ k = 1. In this case PDAL will coincide with PDA, though ourproposed analysis seems to be new. Parameter δ is mainly used for theoretical purposes:in order to be able to control k y k +1 − y k k , we need δ <

1. For the experiments δ shouldbe chosen very close to 1. Remark 1.

It is clear that each iteration of the linesearch requires computation ofprox βτ k f ∗ ( · ) and K ∗ y k +1 . As in problem (1) we can always exchange primal and dualvariables, hence it makes sense to choose for the dual variable in PDAL the one for whichthe respective prox-operator is simpler to compute. Note that during the linesearch weneed to compute Kx k only once and then use that K ¯ x k = (1 + θ k ) Kx k − θ k Kx k − .4 emark 2. Note that when prox σf ∗ is a linear (or aﬃne) operator, the linesearch becomesextremely simple: it does not require any additional matrix-vector multiplications. Weitemize some examples below:1. f ∗ ( y ) = h c, y i . Then it is easy to verify that prox σf ∗ u = u − σc . Thus, we have y k +1 = prox σ k f ∗ ( y k + σ k K ¯ x k ) = y k + σ k K ¯ x k − σ k cK ∗ y k +1 = K ∗ y k + σ k ( K ∗ K ¯ x k − K ∗ c )2. f ∗ ( y ) = k y − b k . Then prox σf ∗ u = u + σb σ and we obtain y k +1 = prox σ k f ∗ ( y k + σ k K ¯ x k ) = y k + σ k ( K ¯ x k + b )1 + σ k K ∗ y k +1 = 11 + σ k ( K ∗ y k + σ k ( K ∗ K ¯ x k + K ∗ b ))3. f ∗ ( y ) = δ H ( y ), the indicator function of the hyperplane H = { u : h u, a i = b } . Thenprox σf ∗ u = P H u = u + b −h u,a ik a k a . And hence, y k +1 = prox σ k f ∗ ( y k + σ k K ¯ x k ) = y k + σ k K ¯ x k + b − h a, y k + σ k K ¯ x k ik a k aK ∗ y k +1 = K ∗ y k + σ k K ∗ K ¯ x k + b − h a, y k + σ k K ¯ x k ik a k K ∗ a. Evidently, each iteration of PDAL for all the cases above requires only two matrix-vectormultiplications. Indeed, before the linesearch starts we have to compute Kx k and K ∗ Kx k and then during the linesearch we should use the following relations: K ¯ x k = (1 + θ k ) Kx k − θ k Kx k − K ∗ K ¯ x k = (1 + θ k ) K ∗ Kx k − θ k K ∗ Kx k − , where Kx k − and K ∗ Kx k − should be reused from the previous iteration. All otheroperations are comparably cheap and, hence, the cost per iteration of PDA and PDALare almost the same. However, this might not be true if the matrix K is very sparse. Forexample, for a ﬁnite diﬀerences operator the cost of a matrix-vector multiplication is thesame as the cost of a few vector-vector additions.One simple implication of these facts is that for the regularized least-squares problemmin x k Ax − b k + g ( x ) , (9)our method does not require one to know k A k but at the same time does not require anyadditional matrix-vector multiplication as it is the case for standard ﬁrst order methodswith backtracking (e.g. proximal gradient method, FISTA). In fact, we can rewrite (9)as min x max y g ( x ) + h Ax, y i − k y + b k . (10)Here f ∗ ( y ) = k y + b k and hence, we are in the situation where the operator prox f ∗ isaﬃne.By construction of the algorithm we simply have the following:5 emma 1.(i) The linesearch in PDAL always terminates. (ii)

There exists τ > τ k > τ for all k ≥ (iii) There exists θ > θ k ≤ θ for all k ≥ Proof. (i) In each iteration of the linesearch τ k is multiplied by factor µ <

1. Since, (8)is satisﬁed for any τ k ≤ δ √ βL , the inner loop can not run indeﬁnitely.(ii) Without loss of generality, assume that τ > δµ √ βL . Our goal is to show that from τ k − > δµ √ βL follows τ k > δµ √ βL . Suppose that τ k = τ k − √ θ k − µ i for some i ∈ Z + . If i = 0 then τ k > τ k − > δµ √ βL . If i > τ k = τ k − √ θ k − µ i − does not satisfy (8).Thus, τ k > δ √ βL and hence, τ k > δµ √ βL .(iii) From τ k ≤ τ k − √ θ k − and θ k = τ k τ k − it follows that θ k ≤ √ θ k − . From thisit can be easily concluded that θ k ≤ √ for all k ∈ Z + . In fact, assume the contrary,and let r be the smallest number such that θ r > √ . Since θ = 1, we have r ≥

1, andhence θ r − ≥ θ r − > √ . This yields a contradiction.One might be interested in how many iterations the linesearch needs in order toterminate. Of course, we can give only a priori upper bounds. Assume that ( τ k ) isbounded from above by τ max . Consider in step 2 of Algorithm 1 two opposite ways ofchoosing step size τ k : maximally increase it, that is, always set τ k = τ k − √ θ k − ornever increase it and set τ k = τ k − . If the former case holds, then after i iterations ofthe linesearch we have τ k ≤ τ k − √ θ k − µ i − ≤ τ max √ µ i − , where we have used thebound from Lemma 1 (iii). Hence, if i − ≥ log µ δβL (1+ √ τ max , then τ k ≤ δβL and thelinesearch procedure must terminate. Similarly, if the latter case holds, then after atmost 1 + log µ δβLτ max iterations the linesearch stops. However, because in this case ( τ k )is nonincreasing, this number is the upper bound for the total number of the linesearchiterations.The following is our main convergence result. Theorem 1.

Let ( x k , y k ) be a sequence generated by PDAL. Then it is a bounded sequencein X × Y and all its cluster points are solutions of (1) . Moreover, if g | dom g is continuousand ( τ k ) is bounded from above then the whole sequence ( x k , y k ) converges to a solutionof (1) . The condition of g | dom g to be continuous is not restrictive: it holds for any g withopen dom g (this includes all ﬁnite-valued functions) or for an indicator δ C of any closedconvex set C . Also it holds for any separable convex l.s.c. function (Corollary 9.15, [2]).The boundedness of ( τ k ) from above is rather theoretical: clearly we can easily bound itin PDAL. 6 roof. Let (ˆ x, ˆ y ) be any saddle point of (1). From (4) it follows that h x k +1 − x k + τ k K ∗ y k +1 , ˆ x − x k +1 i ≥ τ k ( g ( x k +1 ) − g (ˆ x )) (11) D β ( y k +1 − y k ) − τ k K ¯ x k , ˆ y − y k +1 E ≥ τ k ( f ∗ ( y k +1 ) − f ∗ (ˆ y )) . (12)Since x k = prox τ k − g ( x k − − τ k − K ∗ y k ), we have again by (4) that for all x ∈ X h x k − x k − + τ k − K ∗ y k , x − x k i ≥ τ k − ( g ( x k ) − g ( x )) . After substitution in the last inequality x = x k +1 and x = x k − we get h x k − x k − + τ k − K ∗ y k , x k +1 − x k i ≥ τ k − ( g ( x k ) − g ( x k +1 )) , (13) h x k − x k − + τ k − K ∗ y k , x k − − x k i ≥ τ k − ( g ( x k ) − g ( x k − )) . (14)Adding (13), multiplied by θ k = τ k τ k − , and (14), multiplied by θ k , we obtain h ¯ x k − x k + τ k K ∗ y k , x k +1 − ¯ x k i ≥ τ k ((1 + θ k ) g ( x k ) − g ( x k +1 ) − θ k g ( x k − )) , (15)where we have used that ¯ x k = x k + θ k ( x k − x k − ).Consider the following identity: τ k h K ∗ y k +1 − K ∗ ˆ y, ¯ x k − ˆ x i − τ k h K ¯ x k − K ˆ x, y k +1 − ˆ y i = 0 . (16)Summing (11), (12), (15), and (16), we get h x k +1 − x k , ˆ x − x k +1 i + 1 β h y k +1 − y k , ˆ y − y k +1 i + h ¯ x k − x k , x k +1 − ¯ x k i + τ k h K ∗ y k +1 − K ∗ y k , ¯ x k − x k +1 i − τ k h K ∗ y, ¯ x k − ˆ x i + τ k h K ˆ x, y k +1 − y i≥ τ k (cid:16) f ∗ ( y k +1 ) − f ∗ (ˆ y ) + (1 + θ k ) g ( x k ) − θ k g ( x k − ) − g (ˆ x ) (cid:17) . (17)Using that f ∗ ( y k +1 ) − f ∗ (ˆ y ) − h K ˆ x, y k +1 − ˆ y i = D ˆ x, ˆ y ( y k +1 ) (18)and(1+ θ k ) g ( x k ) − θ k g ( x k − ) − g (ˆ x ) + h K ∗ y, ¯ x k − ˆ x i = (1 + θ k ) (cid:16) g ( x k ) − g (ˆ x ) + h K ∗ ˆ y, x k − ˆ x i (cid:17) − θ k (cid:16) g ( x k − ) − g (ˆ x ) + h K ∗ ˆ y, x k − − ˆ x i (cid:17) = (1 + θ k ) P ˆ x, ˆ y ( x k ) − θ k P ˆ x, ˆ y ( x k − ) , (19)we can rewrite (17) as (we will not henceforth write the subscripts for P and D unless itis unclear) h x k +1 − x k , ˆ x − x k +1 i + 1 β h y k +1 − y k , ˆ y − y k +1 i + h ¯ x k − x k , x k +1 − ¯ x k i + τ k h K ∗ y k +1 − K ∗ y k , ¯ x k − x k +1 i ≥ τ k ((1 + θ k ) P ( x k ) − θ k P ( x k − ) + D ( y k +1 )) . (20)7et ε k denotes the right-hand side of (20). Using the cosine rule for every item in theﬁrst line of (20), we obtain12 ( k x k − ˆ x k − k x k +1 − ˆ x k − k x k +1 − x k k )+ 12 β ( k y k − ˆ y k − k y k +1 − ˆ y k − k y k +1 − y k k )+ 12 ( k x k +1 − x k k − k ¯ x k − x k k − k x k +1 − ¯ x k k )+ τ k h K ∗ y k +1 − K ∗ y k , ¯ x k − x k +1 i ≥ ε k . (21)By (8), Cauchy–Schwarz, and Cauchy’s inequalities we get τ k h K ∗ y k +1 − K ∗ y k , ¯ x k − x k +1 i ≤ δ √ β k x k +1 − ¯ x k kk y k +1 − y k k≤ k x k +1 − ¯ x k k + δ β k y k +1 − y k k , from which we derive that12 ( k x k − ˆ x k − k x k +1 − ˆ x k ) + 12 β ( k y k − ˆ y k − k y k +1 − ˆ y k ) − k ¯ x k − x k k − − δ β k y k +1 − y k k ≥ ε k . (22)Since (ˆ x, ˆ y ) is a saddle point, D ( y k ) ≥ P ( x k ) ≥ k x k − ˆ x k − k x k +1 − ˆ x k ) + 12 β ( k y k − ˆ y k − k y k +1 − ˆ y k ) − k ¯ x k − x k k − − δ β k y k +1 − y k k ≥ τ k ((1 + θ k ) P ( x k ) − θ k P ( x k − )) (23)or, taking into account θ k τ k ≤ (1 + θ k − ) τ k − ,12 k x k +1 − ˆ x k + 12 β k y k +1 − ˆ y k + τ k (1 + θ k ) P ( x k ) ≤ k x k − ˆ x k + 12 β k y k − ˆ y k + τ k − (1 + θ k − ) P ( x k − ) − k ¯ x k − x k k − − δ β k y k +1 − y k k (24)From this we deduce that ( x k ), ( y k ) are bounded sequences and lim k →∞ k ¯ x k − x k k = 0,lim k →∞ k y k − y k − k = 0. Also notice that x k +1 − x k τ k = ¯ x k +1 − x k +1 τ k +1 → k → ∞ , τ k is separated from 0 by Lemma 1. Let ( x k i , y k i ) be asubsequence that converges to some cluster point ( x ∗ , y ∗ ). Passing to the limit in h τ k i ( x k i +1 − x k i ) + K ∗ y k i , x − x k i +1 i ≥ g ( x k i +1 ) − g ( x ) ∀ x ∈ X, h βτ k i ( y k i +1 − y k i ) − K ¯ x k i , y − y k i +1 i ≥ f ∗ ( y k i +1 ) − f ∗ ( y ) ∀ y ∈ Y, we obtain that ( x ∗ , y ∗ ) is a saddle point of (1).When g | dom g is continuous, g ( x k i ) → g ( x ∗ ), and hence, P x ∗ ,y ∗ ( x k i ) →

0. From (24) itfollows that the sequence a k = k x k +1 − x ∗ k + β k y k +1 − y ∗ k + τ k (1 + θ k ) P x ∗ ,y ∗ ( x k ) ismonotone. Taking into account boundedness of ( τ k ) and ( θ k ), we obtainlim k →∞ a k = lim i →∞ a k i = 0 , which means that x k → x ∗ , y k → y ∗ . Theorem 2 (ergodic convergence) . Let ( x k , y k ) be a sequence generated by PDAL and (ˆ x, ˆ y ) be any saddle point of (1) . Then for the ergodic sequence ( X N , Y N ) it holds that G ˆ x, ˆ y ( X N , Y N ) ≤ s N k x − ˆ x k + 12 β k y − ˆ y k + τ θ P ˆ x, ˆ y ( x ) ! , where s N = P Nk =1 τ k , X N = τ θ x + P Nk =1 τ k ¯ x k τ θ + s N , Y N = P Nk =1 τ k y k s N .Proof. Summing up (22) from k = 1 to N , we get12 ( k x − ˆ x k − k x N +1 − ˆ x k ) + 12 β ( k y − ˆ y k − k y N +1 − ˆ y k ) ≥ N X k =1 ε k (25)The right-hand side in (25) can be expressed as N X k =1 ε k = τ N (1 + θ N ) P ( x N ) + N X k =2 [(1 + θ k − ) τ k − − θ k τ k )] P ( x k − ) − θ τ P ( x ) + N X k =1 τ k D ( y k +1 )By convexity of P , τ N (1 + θ N ) P ( x N )+ N X k =2 [(1 + θ k − ) τ k − − θ k τ k )] P ( x k − ) (26) ≥ ( τ θ + s N ) P ( τ (1 + θ ) x + P Nk =2 τ k ¯ x k τ θ + s N ) (27)= ( τ θ + s N ) P ( τ θ x + P Nk =1 τ k ¯ x k τ θ + s N ) ≥ s N P ( X N ) , (28)9here s N = P Nk =1 τ k . Similarly, N X k =1 τ k D ( y k ) ≥ s N D ( P Nk =1 τ k y k s N ) = s N D ( Y N ) . (29)Hence, N X k =1 ε k ≥ s N ( P ( X N ) + D ( Y N )) − τ θ P ( x )and we conclude G ( X N , Y N ) = P ( X N ) + D ( Y N ) ≤ s N k x − ˆ x k + 12 β k y − ˆ y k + τ θ P ( x ) ! . Clearly, we have the same O (1 /N ) rate of convergence as in [4, 5, 9], though with theergodic sequence ( X N , Y N ) deﬁned in a diﬀerent way.Analysing the proof of Theorems 1 and 2, the reader may ﬁnd out that our proofdoes not rely on the proximal interpretation of the PDA [12]. An obvious shortcomingof our approach is that deriving new extensions of the proposed method, like inertial orrelaxed versions [5], still requires some nontrivial eﬀorts. It would be interesting to obtaina general approach for the proposed method and its possible extensions.It is well known that in many cases the speed of convergence of PDA crucially dependson the ratio between primal and dual steps β = στ . Motivated by this, paper [10] proposedan adaptive strategy for how to choose β in every iteration. Although it is not the goalof this work to study the strategies for deﬁning β , we show that the analysis of PDALallows us to incorporate such strategies in a very natural way. Theorem 3.

Let ( β k ) ⊂ ( β min , β max ) be a monotone sequence with β min , β max > and ( x k , y k ) be a sequence generated by PDAL with variable ( β k ) . Then the statement ofTheorem 1 holds.Proof. Let ( β k ) be nondecreasing. Then using β k ≤ β k − , we get from (24) that12 k x k +1 − ˆ x k + 12 β k k y k +1 − ˆ y k + τ k (1 + θ k ) P ( x k ) ≤ k x k − ˆ x k + 12 β k − k y k − ˆ y k + τ k − (1 + θ k − ) P ( x k − ) − k ¯ x k − x k k − − δ β k k y k +1 − y k k . (30)Since ( β k ) is bounded from above, the conclusion in Theorem 1 simply follows.If ( β k ) is decreasing, then the above arguments should be modiﬁed. Consider (23),multiplied by β k , β k k x k − ˆ x k − k x k +1 − ˆ x k ) + 12 ( k y k − ˆ y k − k y k +1 − ˆ y k ) − β k k ¯ x k − x k k − − δ k y k +1 − y k k ≥ β k τ k ((1 + θ k ) P ( x k ) − θ k P ( x k − )) . (31)10s β k < β k − , we have θ k β k τ k ≤ (1 + θ k − ) β k − τ k − , which in turn implies β k k x k +1 − ˆ x k + 12 k y k +1 − ˆ y k + τ k β k (1 + θ k ) P ( x k ) ≤ β k − k x k − ˆ x k + 12 k y k − ˆ y k + τ k − β k − (1 + θ k − ) P ( x k − ) − β k k ¯ x k − x k k − − δ k y k +1 − y k k . (32)Due to the given properties of ( β k ), the rest is trivial.It is natural to ask if it is possible to use nonmonotone ( β k ). To answer this question,we can easily use the strategies from [11], where an ADMM with variable steps wasproposed. One way is to use any β k during a ﬁnite number of iterations and then switchto monotone ( β k ). Another way is to relax the monotonicity of ( β k ) to the following:there exists ( ρ k ) ⊂ R + such that : X k ρ k < ∞ and β k ≤ β k − (1 + ρ k ) ∀ k ∈ N or β k − ≤ β k (1 + ρ k ) ∀ k ∈ N . We believe that it should be quite straightforward to prove convergence of PDAL withthe latter strategy.

It has been shown [4] that in the case when g or f ∗ is strongly convex, one can modifythe primal-dual algorithm and derive a better convergence rate. We show that the sameholds for PDAL. The main diﬀerence of the accelerated variant APDAL from the basicPDAL is that now we have to vary β in every iteration.Of course due to the symmetry of the primal and dual variables in (1), we can alwaysassume that the primal objective is strongly convex. However, from the computationalpoint of view it might not be desirable to exchange g and f ∗ in the PDAL becauseof Remark 2. Therefore, we discuss the two cases separately. Also notice that bothaccelerated algorithms below coincide with PDAL when the parameter of strong convexity γ = 0. g is strongly convex Assume that g is γ –strongly convex, i.e., g ( x ) − g ( x ) ≥ h u, x − x i + γ k x − x k ∀ x , x ∈ X, u ∈ ∂g ( x ) . Below we assume that the parameter γ is known. The following algorithm (APDAL)exploits the strong convexity of g :Note that in contrast to PDAL, we set δ = 1, as in any case we will not be able toprove convergence of ( y k ). 11 lgorithm 2 Accelerated primal-dual algorithm with linesearch: g is strongly convex Initialization:

Choose x ∈ X , y ∈ Y , µ ∈ (0 , τ > β >

0. Set θ = 1. Main iteration:

1. Compute x k = prox τ k − g ( x k − − τ k − K ∗ y k ) β k = β k − (1 + γτ k − )2. Choose any τ k ∈ [ τ k − q β k − β k , τ k − q β k − β k (1 + θ k − )] and run Linesearch: θ k = τ k τ k − ¯ x k = x k + θ k ( x k − x k − ) y k +1 = prox β k τ k f ∗ ( y k + β k τ k K ¯ x k )2.b. Break linesearch if q β k τ k k K ∗ y k +1 − K ∗ y k k ≤ k y k +1 − y k k (33)Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch

Instead of (11), now one can use the stronger inequality h x k +1 − x k + τ k K ∗ y k +1 , ˆ x − x k +1 i ≥ τ k ( g ( x k +1 ) − g (ˆ x ) + γ k x k +1 − ˆ x k ) . (34)In turn, (34) yields a stronger version of (22) (also with β k instead of β ):12 ( k x k − ˆ x k − k x k +1 − ˆ x k ) + 12 β k ( k y k − ˆ y k − k y k +1 − ˆ y k ) − k ¯ x k − x k k ≥ ε k + γτ k k x k +1 − ˆ x k (35)or, alternatively,12 k x k − ˆ x k + 12 β k k y k − ˆ y k − k ¯ x k − x k k ≥ ε k + 1 + γτ k k x k +1 − ˆ x k + β k +1 β k β k +1 k y k +1 − ˆ y k . (36)Note that the algorithm provides that β k +1 β k = 1 + γτ k . For brevity let A k = 12 k x k − ˆ x k + 12 β k k y k − ˆ y k . Then from (36) follows β k +1 β k A k +1 + ε k ≤ A k β k +1 A k +1 + β k ε k ≤ β k A k . Thus, summing the above from k = 1 to N , we get β N +1 A N +1 + N X k =1 β k ε k ≤ β A . (37)Using the convexity in the same way as in (26), (29), we obtain N X k =1 β k ε k = β N τ N (1 + θ N ) P ( x N ) + N X k =2 [(1 + θ k − ) β k − τ k − − θ k β k τ k )] P ( x k − ) − θ β τ P ( x ) + N X k =1 β k τ k D ( y k +1 ) ≥ s N ( P ( X N ) + D ( Y N )) − θ σ P ( x ) , (38)where σ k = β k τ k s N = N X k =1 σ k X N = σ θ x + P Nk =1 σ k ¯ x k σ θ + s N Y N = P Nk =1 σ k y k +1 s N Hence, β N +1 A N +1 + s N G ( X N , Y N ) ≤ β A + θ σ P ( x ) . (39)From this we deduce that the sequence ( k y k − ˆ y k ) is bounded and G ( X N , Y N ) ≤ s N ( β A + θ σ P ( x )) k x N +1 − ˆ x k ≤ β N +1 ( β A + θ σ P ( x )) . Our next goal is to derive asymptotics for β N and s N . Obviously, (33) holds for any τ k , β k such that τ k ≤ √ β k L . Since in each iteration of the linesearch we decrease τ k byfactor of µ , τ k can not be less than µ √ β k L . Hence, we have β k +1 = β k (1 + γτ k ) ≥ β k (1 + γ µL √ β k ) = β k + γµL q β k . (40)By induction, one can show that there exists C > β k ≥ Ck for all k > C > k x N +1 − ˆ x k ≤ C ( N + 1) . From (40) it follows that β k +1 − β k ≥ γµL √ Ck . As σ k = β k +1 − β k γ , we obtain σ k ≥ µL √ Ck and thus s N = P Nk =1 σ k = O ( N ). This means that for some constant C > G ( X N , Y N ) ≤ C N . We have shown the following result:

Theorem 4.

Let ( x k , y k ) be a sequence generated by Algorithm 2. Then k x N − ˆ x k = O (1 /N ) and G ( X N , Y N ) = O (1 /N ) . .2 f ∗ is strongly convex The case when f ∗ is γ –strongly convex can be treated in a similar way. Algorithm 3

Accelerated primal-dual algorithm with linesearch: f ∗ is strongly convex Initialization:

Choose x ∈ X , y ∈ Y , µ ∈ (0 , τ > β >

0. Set θ = 1. Main iteration:

1. Compute x k = prox τ k − g ( x k − − τ k − K ∗ y k ) β k = β k − γβ k − τ k −

2. Choose any τ k ∈ [ τ k − , τ k − √ θ k − ] and run Linesearch: θ k = τ k τ k − , σ k = β k τ k ¯ x k = x k + θ k ( x k − x k − ) y k +1 = prox σ k f ∗ ( y k + σ k K ¯ x k )2.b. Break linesearch if q β k τ k k K ∗ y k +1 − K ∗ y k k ≤ k y k +1 − y k k Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch

Note that again we set δ = 1.Dividing (22) over τ k and taking into account the strong convexity of f ∗ , which hasto be used in (12), we deduce12 τ k ( k x k − ˆ x k − k x k +1 − ˆ x k ) + 12 σ k ( k y k − ˆ y k − k y k +1 − ˆ y k ) − τ k k ¯ x k − x k k ≥ ε k τ k + γ k y k +1 − ˆ y k , (41)which can be rewritten as12 τ k k x k − ˆ x k + 12 σ k k y k − ˆ y k − τ k k ¯ x k − x k k ≥ ε k τ k + τ k +1 τ k τ k +1 k x k +1 − ˆ x k + σ k +1 σ k (1 + γσ k ) 12 σ k +1 k y k +1 − ˆ y k , (42)Note that by construction of ( β k ) in Algorithm 3, we have τ k +1 τ k = σ k +1 σ k (1 + γσ k ) . Let A k = τ k k x k − ˆ x k + σ k k y k − ˆ y k . Then (42) is equivalent to τ k +1 τ k A k +1 + ε k τ k ≤ A k − τ k k ¯ x k − x k k τ k +1 A k +1 + ε k ≤ τ k A k − k ¯ x k − x k k . Finally, summing the above from k = 1 to N , we get τ N +1 A N +1 + N X k =1 ε k ≤ τ A − N X k =1 k ¯ x k − x k k (43)From this we conclude that the sequence ( x k ) is bounded, lim k →∞ k ¯ x k − x k k = 0, and G ( X N , Y N ) ≤ s N ( τ A + θ τ P ( x )) , k y N +1 − ˆ y k ≤ σ N +1 τ N +1 ( τ A + θ τ P ( x )) = β N +1 ( τ A + θ τ P ( x )) , where X N , Y N , s N are the same as in Theorem 2.Let us turn to the derivation of the asymptotics of ( τ N ) and ( s N ). Analogously, wehave that τ k ≥ µ √ β k L and thus β k +1 = β k γβ k τ k ≤ β k γ µL √ β k . Again it is not diﬃcult to show by induction that β k ≤ Ck for some constant C >

0. Infact, as φ ( β ) = β γ µL √ β is increasing, it is suﬃcient to show that β k +1 ≤ β k γ µL √ β k ≤ Ck γ µL q Ck ≤ C ( k + 1) . The latter inequality is equivalent to √ C ≥ (2 + 1 k ) Lγµ , which obviously holds for C largeenough (of course C must also satisfy the induction basis).The obtained asymptotics for ( β k ) yields τ k ≥ µ √ β k L ≥ µk √ CL , from which we deduce s N = P Nk =1 τ k ≥ µ √ CL P Nk =1 k . Finally, we obtain the followingresult: Theorem 5.

Let ( x k , y k ) be a sequence generated by Algorithm 3. Then k y N − ˆ y k = O (1 /N ) and G ( X N , Y N ) = O (1 /N ) . Remark 3.

In the case when both g and f ∗ are strongly convex, one can derive a newalgorithm, combining the ideas of Algorithms 2 and 3 (see [4] for more details).15 A more general problem

In this section we show how to apply the linesearch procedure to the more general problemmin x ∈ X max y ∈ Y h Kx, y i + g ( x ) − f ∗ ( y ) − h ( y ) , (44)where in addition to the previous assumptions, we suppose that h : Y → R is a smoothconvex function with L h –Lipschitz–continuous gradient ∇ h . Using the idea of [12], Con-dat and V˜u in [7, 18] proposed an extension of the primal-dual method to solve (44): y k +1 = prox σf ∗ ( y k + σ ( K ¯ x k − ∇ h ( y k ))) x k +1 = prox τg ( x k − τ K ∗ y k +1 )¯ x k +1 = 2 x k +1 − x k . This scheme was proved to converge under the condition τ σ k K k ≤ − σL h . Originallythe smooth function was added to the primal part and not to the dual as in (44). Forus it is more convenient to consider precisely that form due to the nonsymmetry of theproposed linesearch procedure. However, simply exchanging max and min in (44), werecover the form which was considered in [7].In addition to the issues related to the operator norm of K which motivated us toderive the PDAL, here we also have to know the Lipschitz constant L h of ∇ h . This hasseveral drawbacks. First, its computation might be expensive. Second, our estimationof L h might be very conservative and will result in smaller steps. Third, using localinformation about h instead of global L h often allows the use of larger steps. Therefore,the introduction of a linesearch to the algorithm above is of great practical interest.The algorithm below exploits the same idea as Algorithm 1 does. However, its stop-ping criterion is more involved. The interested reader may identify it as a combinationof the stopping criterion (8) and the descent lemma for smooth functions h .Note that in the case h ≡

0, Algorithm 4 corresponds exactly to Algorithm 1.We brieﬂy sketch the proof of convergence. Let (ˆ x, ˆ y ) be any saddle point of (1).Similarly to (11), (12), and (15), we get h x k +1 − x k + τ k K ∗ y k +1 , ˆ x − x k +1 i ≥ τ k ( g ( x k +1 ) − g (ˆ x )) h β ( y k +1 − y k ) − τ k K ¯ x k + τ k ∇ h ( y k ) , ˆ y − y k +1 i ≥ τ k ( f ∗ ( y k +1 ) − f ∗ (ˆ y )) . h θ k ( x k − x k − ) + τ k K ∗ y k , x k +1 − ¯ x k i ≥ τ k ((1 + θ k ) g ( x k ) − g ( x k +1 ) − θ k g ( x k − )) . Summation of the three inequalities above and identity (16) yields h x k +1 − x k , ˆ x − x k +1 i + 1 β h y k +1 − y k , ˆ y − y k +1 i + θ k h x k − x k − , x k +1 − ¯ x k i + τ k h K ∗ y k +1 − K ∗ y k , ¯ x k − x k +1 i − τ k h K ∗ y, ¯ x k − ˆ x i + τ k h K ˆ x, y k +1 − y i + τ k h∇ h ( y k ) , ˆ y − y k +1 i ≥ τ k (cid:16) f ∗ ( y k +1 ) − f ∗ (ˆ y ) + (1 + θ k ) g ( x k ) − θ k g ( x k − ) − g (ˆ x ) (cid:17) . (46)16 lgorithm 4 General primal-dual algorithm with linesearch

Initialization:

Choose x ∈ X , y ∈ Y , τ > µ ∈ (0 , , δ ∈ (0 ,

1) and β >

0. Set θ = 1. Main iteration:

1. Compute x k = prox τ k − g ( x k − − τ k − K ∗ y k ) .

2. Choose any τ k ∈ [ τ k − , τ k − √ θ k − ] and run Linesearch: θ k = τ k τ k − , σ k = βτ k ¯ x k = x k + θ k ( x k − x k − ) y k +1 = prox σ k f ∗ ( y k + σ k ( K ¯ x k − ∇ h ( y k )))2.b. Break linesearch if τ k σ k k K ∗ y k +1 − K ∗ y k k + 2 σ k [ h ( y k +1 ) − h ( y k ) − h∇ h ( y k ) , y k +1 − y k i ] ≤ δ k y k +1 − y k k (45)Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch

By convexity of h , we have τ k ( h (ˆ y ) − h ( y k ) − h∇ h ( y k ) , ˆ y − y k i ) ≥

0. Combining itwith inequality (45), divided over 2 β , we get δ β k y k +1 − y k k − τ k k K ∗ y k +1 − K ∗ y k k ≥ τ k [ h ( y k +1 ) − h (ˆ y ) − h∇ h ( y k ) , y k +1 − ˆ y i ] . Adding the above inequality to (46) gives us h x k +1 − x k , ˆ x − x k +1 i + 1 β h y k +1 − y k , ˆ y − y k +1 i + θ k h x k − x k − , x k +1 − ¯ x k i + τ k h K ∗ y k +1 − K ∗ y k , ¯ x k − x k +1 i − τ k h K ∗ y, ¯ x k − ˆ x i + τ k h K ˆ x, y k +1 − y i + δ β k y k +1 − y k k − τ k k K ∗ y k +1 − K ∗ y k k ≥ τ k (cid:16) ( f ∗ + h )( y k +1 ) − ( f ∗ + h )(ˆ y ) + (1 + θ k ) g ( x k ) − θ k g ( x k − ) − g (ˆ x ) (cid:17) . (47)Note that for problem (44) instead of (7) we have to use D ˆ x, ˆ y ( y ) := f ∗ ( y ) + h ( y ) − f ∗ (ˆ y ) − h (ˆ y ) − h K ˆ x, y − ˆ y i ≥ ∀ y ∈ Y, (48)17hich is true by deﬁnition of (ˆ x, ˆ y ). Now we can rewrite (47) as h x k +1 − x k , ˆ x − x k +1 i + 1 β h y k +1 − y k , ˆ y − y k +1 i + θ k h x k − x k − , x k +1 − ¯ x k i + τ k h K ∗ y k +1 − K ∗ y k , ¯ x k − x k +1 i + δ β k y k +1 − y k k − τ k k K ∗ y k +1 − K ∗ y k k ≥ τ k (cid:16) (1 + θ k ) P ( x k ) − θ k P ( x k − ) + D ( y k +1 ) (cid:17) . (49)Applying cosine rules for all inner products in the ﬁrst line in (49) and using that θ k ( x k − x k − ) = ¯ x k − x k , we obtain12 ( k x k − ˆ x k − k x k +1 − ˆ x k − k x k +1 − x k k )+12 β ( k y k − ˆ y k − k y k +1 − ˆ y k − k y k +1 − y k k )+ 12 ( k x k +1 − x k k − k ¯ x k − x k k − k x k +1 − ¯ x k k )+ τ k k K ∗ y k +1 − K ∗ y k kk x k +1 − ¯ x k k + δ β k y k +1 − y k k − τ k k K ∗ y k +1 − K ∗ y k k ≥ τ k (cid:16) (1 + θ k ) P ( x k ) − θ k P ( x k − ) + D ( y k +1 ) (cid:17) . (50)Finally, applying Cauchy’s inequality in (50) and using that τ k θ k ≤ τ k − (1 + θ k − ), weget12 k x k +1 − ˆ x k + 12 β k y k +1 − ˆ y k + τ k (1 + θ k ) P ( x k ) ≤ k x k − ˆ x k + 12 β k y k − ˆ y k + τ k − (1 + θ k − ) P ( x k − ) − k ¯ x k − x k k − − δ β k y k +1 − y k k , (51)from which the convergence of ( x k ) and ( y k ) to a saddle point of (44) can be derived ina similar way as in Theorem 1. This section collects several numerical tests that will illustrate the performance of theproposed methods. Computations were performed using Python 3 on an Intel Corei3-2350M CPU 2.30GHz running 64-bit Debian Linux 8.7.For PDAL and APDAL we initialize the input data as µ = 0 . δ = 0 . τ = √ min { m,n }k A k F . The latter is easy to compute and it is an upper bound of k A k . The parameter β for PDAL is always taken as στ in PDA with ﬁxed steps σ and τ . A trial step τ k inStep 2 is always chosen as τ k = τ k − √ θ k − . Codes can be found on https://github.com/ymalitsky/primal-dual-linesearch . .1 Matrix game We are interested in the following min-max matrix game:min x ∈ ∆ n max y ∈ ∆ m h Ax, y i , (52)where x ∈ R n , y ∈ R m , A ∈ R m × n , and ∆ m , ∆ n denote the standard unit simplices in R m and R n , respectively.For this problem we study the performance of PDA, PDAL (Algorithm 1), Tseng’sFBF method [17], and PEGM [15]. For comparison we use the primal-dual gap G ( x, y ),which can be easily computed for a feasible pair ( x, y ) as G ( x, y ) = max i ( Ax ) i − min j ( A ∗ y ) j . Since iterates obtained by Tseng’s method may be infeasible, we used an auxiliary point(see [17]) to compute the primal-dual gap.The initial point in all cases was chosen as x = n (1 , . . . ,

1) and y = m (1 , . . . , τ = σ = 1 / k A k = 1 / q λ max ( A ∗ A ), which we compute in advance. The inputdata for FBF and PEGM are the same as in [15]. Note that these methods also use alinesearch.We consider four diﬀerently generated samples of the matrix A ∈ R m × n :1. m = n = 100. All entries of A are generated independently from the uniformdistribution in [ − , m = n = 100. All entries of A are generated independently from the the normaldistribution N (0 , m = 500, n = 100. All entries of A are generated independently from the normaldistribution N (0 , m = 1000, n = 2000. The matrix A is sparse with 10% nonzero elements generatedindependently from the uniform distribution in [0 , G ( x k , y k ) computed in every iteration vsCPU time. The results are presented in Figure 1. l -regularized least squares We study the following l -regularized problem:min x φ ( x ) := 12 k Ax − b k + λ k x k , (53)where A ∈ R m × n , x ∈ R n , b ∈ R m . Let g ( x ) = λ k x k , f ( p ) = k p − b k . Analogously to(9) and (10), we can rewrite (53) asmin x max y g ( x ) + h Ax, y i − f ∗ ( y ) ,

50 100 150 200CPU time, seconds10 P D g a p ( x k , y k ) PDAPDALFBFPEGM (a) Example 1 P D g a p ( x k , y k ) PDAPDALFBFPEGM (b) Example 2 P D g a p ( x k , y k ) PDAPDALFBFPEGM (c) Example 3 P D g a p ( x k , y k ) PDAPDALFBFPEGM (d) Example 4Figure 1: Convergence plots for problem (52) where f ∗ ( y ) = k y k + ( b, y ) = k y + b k − k b k . Clearly, the last term does not have anyimpact on the prox-term and we can conclude that prox λf ∗ ( y ) = y − λb λ . This means thatthe linesearch in Algorithm 1 does not require any additional matrix-vector multiplication(see Remark 2).We generate four instances of problem (53), on which we compare the performanceof PDA, PDAL, APDA (accelerated primal-dual algorithm), APDAL, FISTA [3], andSpaRSA [19]. The latter method is a variant of the proximal gradient method with anadaptive linesearch. All methods except PDAL and SpaRSA require predeﬁned step sizes.For this we compute in advance k A k = q λ max ( A ∗ A ). For all instances below we use thefollowing parameters: • PDA : σ = k A k , τ = k A k ; • PDAL : β = 1 / • APDA, APDAL : β = 1, γ = 0 . • FISTA : α = k A k ; • SpaRSA : In the ﬁrst iteration we run a standard Armijo linesearch procedure to20eﬁne α and then we run SpaRSA with parameters as described in [19]: M = 5, σ = 0 . α max = 1 /α min = 10 .For all cases below we generate some random w ∈ R n in which s random coordinatesare drawn from from the uniform distribution in [ − ,

10] and the rest are zeros. Then wegenerate ν ∈ R m with entries drawn from N (0 , .

1) and set b = Aw + ν . The parameter λ = 0 . x = (0 , . . . , y = Ax − b .The matrix A ∈ R m × n is constructed in one of the following ways:1. n = 1000, m = 200, s = 10. All entries of A are generated independently from N (0 , n = 2000, m = 1000, s = 100. All entries of A are generated independently from N (0 , n = 5000, m = 1000, s = 50. First, we generate the matrix B with entriesfrom N (0 , p ∈ (0 ,

1) we construct the matrix A by columns A j , j = 1 , . . . , n as follows: A = B √ − p , A j = p ∗ A j − + B j . As p increases, A becomesmore ill-conditioned (see [1] where this example was considered). In this experimentwe take p = 0 . p = 0 . φ ( x k ) − φ ∗ vs CPU time. Since in fact thevalue φ ∗ is unknown, we instead run our algorithms for suﬃciently many iterations toobtain the ground truth solution x ∗ . Then we simply set φ ∗ = φ ( x ∗ ). Next, we consider another regularized least squares problem:min x ≥ φ ( x ) := 12 k Ax − b k , (54)where A ∈ R m × n , x ∈ R n , b ∈ R m . Similarly as before, we can express it asmin x max y g ( x ) + h Ax, y i − f ∗ y ) , (55)where g ( x ) = δ R n + ( x ), f ∗ ( y ) = k y + b k . For all cases below we generate a random matrix A ∈ R m × n with density d ∈ (0 , φ ∗ = 0, we alwaysgenerate w as a sparse vector in R n whose s nonzero entries are drawn from the uniformdistribution in [0 , b = Aw .We test the performance of the same algorithms as in the previous example. ForFISTA and SpaRSA we use the same parameters as before. For every instance of theproblem, PDA and PDAL use the same β . For APDA and APDAL we always set β = 1and γ = 0 .

1. The initial points are x = (0 , . . . ,

0) and y = Ax − b = − b .We generate our data as follows:1. m = 2000, n = 4000, d = 1, s = 1000; the entries of A are generated independentlyfrom the uniform distribution in [ − , β = 25.21 ( x ) * PDAPDALAPDAAPDALFISTASPARSA (a) Example 1 ( x ) * PDAPDALAPDAAPDALFISTASPARSA (b) Example 2 ( x ) * PDAPDALAPDAAPDALFISTASPARSA (c) Example 3 ( x ) * PDAPDALAPDAAPDALFISTASPARSA (d) Example 4Figure 2: Convergence plots for problem (53) m = 1000, n = 2000, d = 0 . s = 100; the nonzero entries of A are generatedindependently from the uniform distribution in [0 , β = 25.3. m = 3000, n = 5000, d = 0 . s = 100; the nonzero entries of A are generatedindependently from the uniform distribution in [0 , β = 25.4. m = 10000, n = 20000, d = 0 . s = 500; the nonzero entries of A are generatedindependently from the normal distribution N (0 , β = 1.As (54) is again just a regularized least squares problem, the linesearch does notrequire any additional matrix-vector multiplications. The results with φ ( x k ) − φ ∗ vs.CPU time are presented in Figure 3.Although primal-dual methods may converge faster, they require tuning β = σ/τ . Itis also interesting to highlight that sometimes non-accelerated methods with a properlychosen ratio σ/τ can be faster than their accelerated variants. In this work, we have presented several primal-dual algorithms with linesearch. On theone hand, this allows us to avoid the evaluation of the operator norm, and on the other22

20 40 60 80 100CPU time, seconds 10 ( x k ) * PDAPDALAPDAAPDALFISTASPARSA (a) Example 1 ( x k ) * PDAPDALAPDAAPDALFISTASPARSA (b) Example 2 ( x k ) * PDAPDALAPDAAPDALFISTASPARSA (c) Example 3 ( x k ) * PDAPDALAPDAAPDALFISTASPARSA (d) Example 4Figure 3: Convergence plots for problem (54) hand, it allows us to make larger steps. The proposed linesearch is very simple andin many important cases it does not require any additional expensive operations (suchas matrix-vector multiplications or prox-operators). For all methods we have provedconvergence. Our experiments conﬁrm the numerical eﬃciency of the proposed methods.

Acknowledgements:

The work is supported by the Austrian science fund (FWF) underthe project "Eﬃcient Algorithms for Nonsmooth Optimization in Imaging" (EANOI) No.I1148. The authors also would like to thank the referees and the SIOPT editor WotaoYin for their careful reading of the manuscript and their numerous helpful comments.

References [1] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence rates ofgradient methods for high-dimensional statistical recovery. In

Adv. Neur. In. , pages37–45, 2010.[2] H. H. Bauschke and P. L. Combettes.

Convex Analysis and Monotone OperatorTheory in Hilbert Spaces . Springer, New York, 2011.233] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problem.

SIAM J. Imaging Sci. , 2(1):183–202, 2009.[4] A. Chambolle and T. Pock. A ﬁrst-order primal-dual algorithm for convex problemswith applications to imaging.

J. Math. Imaging. Vis. , 40(1):120–145, 2011.[5] A. Chambolle and T. Pock. On the ergodic convergence rates of a ﬁrst-order primal–dual algorithm.

Math. Program. , pages 1–35, 2015.[6] A. Chambolle and T. Pock. An introduction to continuous optimization for imaging.

Acta Numerica , 25:161–319, 2016.[7] L. Condat. A primal–dual splitting method for convex optimization involvinglipschitzian, proximable and linear composite terms.

J. Optimiz. Theory App. ,158(2):460–479, 2013.[8] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Eﬃcient projections ontothe l -ball for learning in high dimensions. In Proceedings of the 25th internationalconference on Machine learning , pages 272–279, 2008.[9] T. Goldstein, M. Li, and X. Yuan. Adaptive primal-dual splitting methods for statis-tical learning and image processing. In

Advances in Neural Information ProcessingSystems , pages 2089–2097, 2015.[10] T. Goldstein, M. Li, X. Yuan, E. Esser, and R. Baraniuk. Adaptive primal-dualhybrid gradient methods for saddle-point problems. arXiv preprint arXiv:1305.0546 ,2013.[11] B. He, H. Yang, and S. Wang. Alternating direction method with self-adaptivepenalty parameters for monotone variational inequalities.

J. Optimiz. Theory App. ,106(2):337–356, 2000.[12] B. He and X. Yuan. Convergence analysis of primal-dual algorithms for a saddle-point problem: From contraction perspective.

SIAM J. Imag. Sci. , 5(1):119–149,2012.[13] N. Komodakis and J. C. Pesquet. Playing with duality: An overview of recentprimal-dual approaches for solving large-scale optimization problems.

IEEE SignalProc. Mag. , 32(6):31–54, 2015.[14] Y. Malitsky. Reﬂected projected gradient method for solving monotone variationalinequalities.

SIAM J. Optimiz. , 25(1):502–520, 2015.[15] Y. Malitsky. Proximal extrapolated gradient methods for variational inequalities.

Optimization Methods and Software , 33(1):140–164, 2018.[16] T. Pock and A. Chambolle. Diagonal preconditioning for ﬁrst order primal-dualalgorithms in convex optimization. In

Computer Vision (ICCV), 2011 IEEE Inter-national Conference on , pages 1762–1769. IEEE, 2011.2417] P. Tseng. A modiﬁed forward-backward splitting method for maximal monotonemappings.

SIAM J. Control , 38:431–446, 2000.[18] B. C. V˜u. A splitting algorithm for dual monotone inclusions involving cocoerciveoperators.

Advances in Computational Mathematics , 38(3):667–681, 2013.[19] S. J. Wright, R. D. Nowak, and M. A. Figueiredo. Sparse reconstruction by separableapproximation.