[PDF] Primal-dual proximal splitting and generalized conjugation in non-smooth non-convex optimization

Abstract

We demonstrate that difficult non-convex non-smooth optimization problems, such as Nash equilibrium problems and anisotropic as well as isotropic Potts segmentation model, can be written in terms of generalized conjugates of convex functionals. These, in turn, can be formulated as saddle-point problems involving convex non-smooth functionals and a general smooth but non-bilinear coupling term. We then show through detailed convergence analysis that a conceptually straightforward extension of the primal--dual proximal splitting method of Chambolle and Pock is applicable to the solution of such problems. Under sufficient local strong convexity assumptions of the functionals -- but still with a non-bilinear coupling term -- we even demonstrate local linear convergence of the method. We illustrate these theoretical results numerically on the aforementioned example problems.

Full PDF

aarxiv: 1901.02746v4date: 2020-03-19 page 1 of 37© the authors, cc-by-sa 4.0 primal–dual proximal splitting and generalizedconjugation in non-smooth non-convexoptimization

Christian Clason ∗ Stanislav Mazurenko † Tuomo Valkonen ‡ Abstract

We demonstrate that difficult non-convex non-smooth optimization problems, suchas Nash equilibrium problems and anisotropic as well as isotropic Potts segmentation model,can be written in terms of generalized conjugates of convex functionals. These, in turn, can beformulated as saddle-point problems involving convex non-smooth functionals and a generalsmooth but non-bilinear coupling term. We then show through detailed convergence analysisthat a conceptually straightforward extension of the primal–dual proximal splitting method ofChambolle and Pock is applicable to the solution of such problems. Under sufficient local strongconvexity assumptions of the functionals – but still with a non-bilinear coupling term – weeven demonstrate local linear convergence of the method. We illustrate these theoretical resultsnumerically on the aforementioned example problems.

This work is concerned with the numerical solution of non-smooth non-convex saddle-point problemsof the form(1.1) min x ∈ X max y ∈ Y G ( x ) + K ( x , y ) − F ∗ ( y ) , where G : X → (cid:82) and F ∗ : Y → (cid:82) are (possibly non-smooth) proper, convex and lower semicontinuousfunctionals on Hilbert spaces X and Y , and K : X × Y → (cid:82) is smooth but may be non-convex-concave.Such problems arise in many areas of optimal control, inverse problems, and imaging; we will treattwo specific examples below. To find a critical point for (1.1), we propose the generalized primal–dualproximal splitting (GPDPS) method: Algorithm 1.1 (GPDPS).

Given a starting point ( x , y ) and step lengths τ i , ω i , σ i >

0, iterate: x i + : = prox τ i G ( x i − τ i K x ( x i , y i )) , s x i + : = x i + + ω i ( x i + − x i ) , y i + : = prox σ i + F ∗ ( y i + σ i + K y ( s x i + , y i )) , ∗ Faculty of Mathematics, University Duisburg-Essen, 45117 Essen, Germany ([email protected], orcid: 0000-0002-9948-8426) † Loschmidt Laboratories, Masaryk University, Brno, Czechia; previously

Department of Mathematical Sciences, Universityof Liverpool, United Kingdom ([email protected], orcid: 0000-0003-3659-4819) ‡ ModeMat, Escuela Politécnica Nacional, Quito, Ecuador and

Department of Mathematics and Statistics, Universityof Helsinki, Finland; previously

Department of Mathematical Sciences, University of Liverpool, United Kingdom([email protected], orcid: 0000-0001-6683-3572)rxiv: 1901.02746v4, 2020-03-19 page 2 of 37 where prox τ i G ( v ) = ( I + τ i ∂ G ) − ( v ) is the proximal mapping for G ; and K x , K y are the partial Fréchetderivatives of K with respect to x and y . A main result of this work is that under suitable conditionson the step length parameters τ i , σ i , and ω i , this algorithm converges weakly to a critical point of (1.1);see Theorem 6.1. Furthermore, if ∂ G and/or ∂ F ∗ is strongly metrically subregular at the saddle point(in particular, if G and/or F ∗ are strongly convex), we show optimal convergence rates for the standardacceleration strategies; see Theorems 6.3 and 6.4.In addition, we demonstrate in this work how through a suitable reformulation this method can beapplied to the following two non-trivial applications:(i) elliptic Nash equilibrium problems , where K ( x , y ) is the so-called Nikaido–Isoda function encod-ing the Nash equilibrium [29, 25, 38]; see Section 2.1 for details.(ii) (Huber-regularized) (cid:96) - TV denoising (also referred to as the Potts model ) [18, 33, 34], where K ( x , y ) is used to express the non-convex Potts functional as the generalized K -conjugate of a convexindicator function; see Section 2.2 for details.In particular, the second example demonstrates how the proposed method can be used to solve (some)non-convex non-smooth problems by reformulating in them in terms of a convex but non-smoothfunctional and a smooth but non-convex coupling term. (We stress, however, that we do not claim thatthis approach is superior to state-of-the-art problem-specific approaches such as the ones mentionedin the cited works for the specific problems; such an investigation is left for the future.) Related literature.

Our approach is obviously motivated by the well-known primal–dual proximalsplitting (PDPS) method of Chambolle and Pock [8] for convex optimization problems of the formmin x ∈ X F ( Ax ) + G ( x ) for F : Y → (cid:82) proper, convex, and lower semicontinuous and A : X → Y linear.The method is based on the equivalent reformulation as the saddle-point problem(1.2) min x ∈ X max y ∈ Y G ( x ) + (cid:104) Ax , y (cid:105) − F ∗ ( y ) where F ∗ is the Fenchel conjugate of F . Several other alternative techniques for such optimizationproblems have also been developed, e.g., using smoothing schemes [28] or a proximal alternatingpredictor corrector [13]. This approach was extended to allow for nonlinear but Fréchet differentiable A in [35]. Later work [12, 10] applied this to non-convex PDE-constrained optimization problems andderived accelerated variants.In a broader context, generalized convex conjugation has been studied for many decades withapplications in economics, see, e.g., [26, 32, 15] and the references therein. Algorithms for the solutionof general saddle-point problems min x max y f ( x , y ) have been considered in several seminal papers.In particular, a prox-type method was suggested in [27] for C , convex–concave functions yieldinga O ( / N ) rate of convergence for an ergodic version of the gap max y (cid:48) ∈ Y f ( x , y (cid:48) ) − min x (cid:48) ∈ X f ( x (cid:48) , y ) .These results were further extended to allow non-smooth functions in the Mirror Descent method [22],demonstrating a O ( /√ N ) rate of convergence for the ergodic gap although with a vanishing step sizefor large N . The authors also considered an acceleration of the Mirror Proximal method for the casewhen the gradient map of f can be split into a Lipschitz-continuous part and a monotone operator [23].The latter was assumed “simple” in the sense that a solution to a specific variational inequality couldbe found relatively efficiently. As a result, the authors obtained an O ( / N ) rate of convergence with apossibility for improvement to O ( / N ) for a strongly concave f . Finally, the reformulation of (1.1) witha bilinear K as a monotone inclusion problem was considered in [21]. Algorithms applicable to (1.1)with a genuinely nonlinear K have only started to appear in literature relatively recently. An abstractconvergence result was obtained for an inexact regularized Gauss–Seidel method in [3]. In [20], theauthors considered saddle-point representable functions and arrived at a very similar structure to (1.1); Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 3 of 37 specifically, they reformulated this problem as a smooth linearly-constrained saddle point problem bymoving the non-smooth terms into the problem domain and applied the Mirror Proximal algorithmmentioned earlier, with a smooth cost function and the O ( / N ) convergence rate [27]. Following [21],Kolossoski and Monteiro [24] developed a non-Euclidean hybrid proximal extragradient for G and F ∗ Bregman distances, and K general convex–concave. The case of a general convex–concave K in(1.1) (which therefore becomes an overall convex–concave problem) has been recently studied in [19].Besides being restricted to convex–concave problems, their algorithm differs from Algorithm 1.1 inapplying the overrelaxation to K y ( x i + , y i ) instead of to x i + in the third step. Finally, problems forgeneral sufficiently smooth K ( x , y ) were considered in [5] in conjunction with a variant of ADMM;however, no proofs of convergence were given in the general case. Organization.

To motivate our approach, we start with a more detailed description of the above-mentioned example problems and their reformulation as a saddle-point problem of the form (1.1) in thenext Section 2. (This section can be skipped by readers only interested in the convergence analysisfor the general Algorithm 1.1.) The following Section 3 then collects basic notation and definitions aswell as the fundamental assumptions that will be used throughout the following. We then study theconvergence and convergence rates of Algorithm 1.1 in Sections 4 to 6. More precisely, in Section 4 wederive a basic convergence estimate using the “testing” framework introduced in [36, 37] for the studyof preconditioned proximal point methods. The results and assumptions depend on the iterates stayingin a local neighborhood of a solution. In Section 5 we therefore derive conditions on the step lengthparameters and initial iterate that ensure that the iterates do not escape from a local neighborhood.Afterwards, we provide in Section 6 exact step length rules for Algorithm 1.1 together with respectiveweak convergence or convergence rate results: linear under sufficient strong convexity of G and F ∗ ,and “accelerated” O ( / N ) or O ( / N ) rates with somewhat lesser assumptions. Finally, we illustratethe applicability and performance of the proposed approach applied to our two example problemsin Section 7. Appendices a to c contain further technical results on the assumptions required forconvergence, in particular verifying them for the Huber-regularized (cid:96) -TV denoising example. Before we begin our analysis of the convergence of Algorithm 1.1, we motivate its generality bydiscussing two examples of practically relevant problems that can be cast in the form (1.1) and whichwill be used to numerically illustrate the behavior of the algorithm in Section 7. The idea in each caseis to write a non-convex functional F as the generalized K -conjugate of a convex functional F ∗ , i.e., F ( x ) = sup y ∈ Y K ( x , y ) − F ∗ ( y ) for a suitable K (depending on F ). Our first example is the reformulation of Nash equilibrium problems using the Nikaido–Isoda functionfollowing [38]. Consider a non-cooperative game of n ∈ (cid:78) players, each of which has a strategy x k ∈ X k ⊂ (cid:82) and a payout function ϕ k : (cid:82) n → (cid:82) . For convenience, we introduce the vector x ∈ (cid:82) n of strategies and the notation ( x − k | z ) : = ( x , . . . , x k − , z , x k + , . . . x n ) ( ≤ k ≤ n , z ∈ (cid:82) ) Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 4 of 37 for the vector where player k changes their strategy x k to z . We also set X : = X × · · · × X n . A vector x ∗ ∈ X of strategies is then a Nash equilibrium if(2.1) ϕ k ( x ∗ ) = ϕ k ( x ∗− k | x ∗ k ) = min z ∈ (cid:82) ϕ k ( x ∗− k | z ) ( ≤ k ≤ n ) . We now introduce the Nikaido–Isoda function [29] (also called the Ky Fan function [17]) Ψ ( x , y ) = n (cid:213) k = ( ϕ k ( x − k | x k ) − ϕ k ( x − k | y k )) ( x , y ∈ X ) as well as the optimum response function(2.2) V ( x ) = max y ∈ X Ψ ( x , y ) ( x ∈ X ) . It follows from [38, Thm. 2.2] that x ∗ ∈ X is a Nash equilibrium if and only if it is a minimizer of V .Using the indicator function of the set X ⊂ (cid:82) n defined by δ X ( x ) = (cid:40) x ∈ X , ∞ if x (cid:60) X , we see that the generally non-convex response function V is the Ψ -preconjugate of the convex func-tional δ X and can characterize a Nash equilibrium x ∗ ∈ X as the solution to the saddle-point problemmin x ∈ (cid:82) n max y ∈ (cid:82) n δ X ( x ) + Ψ ( x , y ) − δ X ( y ) . We can therefore solve the Nash equilibrium problem (2.1) by applying Algorithm 1.1 to K ( x , y ) = Ψ ( x , y ) , F ∗ = G = δ X . In Section 7.1, we illustrate this exemplarily for the two-player elliptic Nash equilibrium problem from[6].

Remark 2.1.

If the set X k of feasible strategies for each player depends on the strategies of the otherplayers (i.e., X k = X k ( x − k ) ), (2.1) becomes a generalized Nash equilibrium problem (GNEP) ; see thesurvey [16] and the literature cited therein. If for all kX k ( x − k ) = { x k ∈ (cid:82) n : ( x − k | x k ) ∈ Z } ( ≤ k ≤ n ) for some closed and convex set Z ⊂ (cid:82) n , the GNEP is called jointly convex . In this case, minimizationof (2.2) is no longer an equivalent characterization but defines a variational equilibria [31]; everyvariational equilibrium is a generalized Nash equilibrium but not vice versa, see, e.g., [16, Thm. 3.9].Hence Algorithm 1.1 can also be applied to compute (some if not all) solutions to jointly convex GNEPs. Our next example is concerned with (Huber-regularized) (cid:96) -TV denoising or segmentation, also referredto as Potts model . Let f ∈ (cid:82) N × N , N , N ∈ (cid:78) , be a given noisy or to be segmented image. We thensearch for the denoised or segmented image as the solution to(2.3) min x ∈ (cid:82) N × N α (cid:107) x − f (cid:107) + (cid:107) D h x (cid:107) p , , Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 5 of 37 for a regularization parameter α ≥ D h : (cid:82) N × N → (cid:82) N × N × , and the vectorial (cid:96) -seminorm(2.4) (cid:107) z (cid:107) p , : = N (cid:213) i = N (cid:213) j = (cid:12)(cid:12)(cid:0) | z ij | , | z ij | (cid:1)(cid:12)(cid:12) p , where | t | = (cid:40) t = , t (cid:44) , and | · | p for p ∈ [ , ∞] is the usual p -norm on (cid:82) ; we will discuss the choice of p in detail below. Clearly, (cid:107) · (cid:107) p , is a non-convex functional for any p ∈ [ , ∞] . Let us briefly comment on the use of (cid:96) -TVas a regularizer in imaging. Intuitively, the functional in (2.4) applied to the discrete gradient countsthe number of jumps of the image value between neighboring pixels; it can therefore be expectedthat minimizers are piecewise constant, and that jumps are penalized even more strongly than by the(convex) total variation model.To motivate our approach, we first consider a simple scalar (lower semicontinuous) step function,i.e., we consider for ( , ∞) ⊂ (cid:82) the corresponding characteristic function (2.5) χ ( , ∞) ( t ) = (cid:40) t ≤ , t > . To write this non-convex function as the generalized preconjugate of a convex function, let ρ : (cid:82) → (cid:82) satisfy ρ ( ) =

0, sup t ≤ ρ ( t ) =

0, and sup t > ρ ( t ) =

1. Then a simple case distinction shows that(2.6) χ ( , ∞) ( t ) = sup s ≥ ρ ( st ) = sup s ∈ (cid:82) ρ ( st ) − δ [ , ∞) ( s ) . Setting κ ( s , t ) : = ρ ( st ) , we thus obtain that χ ( , ∞) is the κ -preconjugate of the convex indicator function δ [ , ∞) . One possible choice for ρ is ρ = χ ( , ∞) ; however, we require ρ to be smooth in order to applyAlgorithm 1.1. A better choice is therefore(2.7) ρ ( t ) = t − t , ( t ∈ (cid:82) ) , see Figure 1, which has the advantage that the supremum in (2.6) is always attained at a finite s ≥ | t | = χ { } ( t ) , we can proceed similarly by case distinction to write | t | = sup s ∈ (cid:82) ρ ( st ) = sup s ∈ (cid:82) ρ ( st ) − , i.e., for κ ( s , t ) = ρ ( st ) as above, | · | is the κ -preconjugate of the zero function f ∗ ≡

0. In practice, itmay be useful to add Huber regularization, i.e., replace f ∗ by f ∗ γ : = f ∗ + γ | · | = γ | · | for some γ > f ∗ γ and our choice (2.7) are differentiable, an elementary calculus argument showsthat the corresponding preconjugate is | t | γ : = sup s ∈ (cid:82) ρ ( st ) − γ | s | = t t + γ , which is a still non-convex approximation of | t | , see Figure 2.We now turn to the vectorial (cid:96) seminorm, where we distinguish between p ∈ [ , ∞] . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 6 of 37 − − − − Figure 1: plot of ρ from (2.7) − − . . . . . . γ = − γ = − γ = − Figure 2: plot of | t | γ for different values of γ The case p = . With this choice, (2.4) reduces to (cid:107) z (cid:107) , = N (cid:213) i = N (cid:213) j = (cid:213) k = | z ijk | , which is the most common choice for the Potts model found in the literature. Here, the Potts functional (cid:107) D h x (cid:107) , counts for each pixel ( i , j ) the jumps across each edge of the pixel separately, i.e., the contri-bution of each pixel is either 0 (no jump), 1 (jump in either horizontal or vertical direction), or 2 (jumpin both directions). We thus refer (in a slight abuse of terminology) to this case as the anisotropic Pottsmodel .Since this functional is completely separable, we can apply the above scalar approach componentwiseby taking(2.8) κ ( z , y ) = N (cid:213) i = N (cid:213) j = (cid:213) k = ρ ( z ijk y ijk ) such that F = (cid:107) · (cid:107) , is the κ -preconjugate of the zero function F ∗ ≡

0. Correspondingly, the Huberregularization of F is given by F γ ( z ) = N (cid:213) i = N (cid:213) j = (cid:213) k = | z ijk | γ . The case p = ∞ . Now (2.4) reduces to (cid:107) z (cid:107) ∞ , = N (cid:213) i = N (cid:213) j = max (cid:8) | z ij | , | z ij | (cid:9) . Here, each pixel contributes to the Potts functional only once, even if there is a jump across both edges.Since a simple case distinction shows that max {| a | , | b | } = ||( a , b )| p | for any a , b ∈ (cid:82) and p ∈ [ , ∞] ,this case is equivalent to |(cid:107) z |(cid:107) , p : = N (cid:213) i = N (cid:213) j = (cid:12)(cid:12) |( z ij , z ij )| p (cid:12)(cid:12) for any p ∈ [ , ∞] , which leads to an alternate definition of the Potts functional sometimes found inthe literature. We refer to this case as the isotropic Potts model . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 7 of 37

This functional is only separable with respect to the pixel coordinates ( i , j ) but not with respect to k .We thus extend our preconjugation approach to (cid:82) by observing for t ∈ (cid:82) that || t | | = sup s ∈ (cid:82) ρ ((cid:104) s , t (cid:105)) = sup s ∈ (cid:82) ρ ( s t + s t ) since for t = ρ ((cid:104) s , t (cid:105)) = s ∈ (cid:82) , while for t (cid:44) t (cid:44)

0, the supremum will be attained at1 by the choice of ρ . Setting(2.9) κ ∞ ( z , y ) = N (cid:213) i = N (cid:213) j = ρ (cid:0) z ij y ij + z ij y ij (cid:1) makes F = (cid:107) · (cid:107) ∞ , again the κ ∞ -preconjugate of the zero function F ∗ ≡

0. The corresponding Huberregularization can be once more computed by elementary calculus as F γ ( z ) = N (cid:213) i = N (cid:213) j = (cid:12)(cid:12) |( z ij , z ij )| (cid:12)(cid:12) γ . The case p ∈ ( , ∞) . In principle, one could proceed as for p = ∞ by constructing a function ρ p : (cid:82) × (cid:82) → (cid:82) with sup s ∈ (cid:82) ρ p ( s , t ) =  t = , t (cid:44) , t t = , / p if t (cid:44) , t t (cid:44) , and setting κ p ( s , t ) = ρ p ( s , t ) . However, since the corresponding Potts functional only differs from thecase p = / p → p → ∞ ,we will only consider the extremal cases p = p = ∞ .In all cases, we can apply Algorithm 1.1 to K ( x , y ) = κ p ( D h x , y ) , G ( x ) = α (cid:107) x − f (cid:107) , F ∗ γ ( y ) = γ (cid:107) y (cid:107) for p ∈ [ , ∞] and γ ≥

0. We illustrate the application of Algorithm 1.1 for p ∈ { , ∞} and γ > Remark 2.2.

We can also apply this approach for | t | q with q ∈ ( , ) using the same ρ as above, writing | t | q = sup s ∈ (cid:82) κ ( t , s ) for κ ( t , s ) : = | t | q ρ ( st ) , as ρ ( st ) = t = κ ( t , s ) is not C ; we canachieve that by instead writing | t | q = sup s ∈ (cid:82) κ ( t , s ) for κ ( t , s ) : = | t | q ρ (| st | ) . We start the development of our proposed method by introducing the necessary notation and overallassumptions. Throughout the rest of this paper, we write L( X ; Y ) for the space of bounded linearoperators between Hilbert spaces X and Y . In what follows, we let x and y denote elements of X and Y , respectively, and denote by u a pair ( x , y ) ∈ X × Y . For brevity, we will also use this notation forsimilar tuples, e.g., u i : = ( x i , y i ) , without explicit introduction in each case. Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 8 of 37

For any Hilbert space, I is the identity operator, (cid:104) x , x (cid:48) (cid:105) is the inner product in the correspondingspace, and (cid:130) ( x , r ) is the closed unit ball of the radius r at x . If H : X ⇒ X is a set-valued map, we willfrequently use the concise notation (cid:104) H ( x ) , (cid:101) x (cid:105) : = {(cid:104) w , (cid:101) x (cid:105) : w ∈ H ( x )} as well as, e.g., 0 ≤ (cid:104) H ( x ) , (cid:101) x (cid:105) if the corresponding relation holds for all w ∈ H ( x ) .For self-adjoint T , S ∈ L( X ; Y ) , the inequality T ≥ S means T − S is positive semidefinite. If T ∈ L( X ; X ) is self-adjoint, we further set (cid:104) x , x (cid:48) (cid:105) T : = (cid:104) T x , x (cid:48) (cid:105) , and (cid:107) x (cid:107) T : = (cid:112) (cid:104) x , x (cid:105) T (which define aninner product and a norm in X , respectively, if T is in addition positive definite). In this case, T ≥ S implies that (cid:107) x (cid:107) T ≥ (cid:107) x (cid:107) S for all x ∈ X .We also recall that K x and K y denote the partial Fréchet derivatives of a continuosly differentiableoperator K with respect to the given variable.Throughout this paper, we make the following fundamental assumptions on (1.1). Assumption 3.1.

The functionals G : X → (cid:82) and F ∗ : Y → (cid:82) are convex, proper, and lower semicon-tinuous. Furthermore,(i) there exist a constant γ G ∈ (cid:82) and a neighborhood X G of (cid:98) x such that(3.1) (cid:104) ∂ G ( x ) + K x ( (cid:98) x , (cid:98) y ) , x − (cid:98) x (cid:105) ≥ γ G (cid:107) x − (cid:98) x (cid:107) ( x ∈ X G ) ;(ii) there exist a constant γ F ∗ ∈ (cid:82) and a neighborhood Y F ∗ of (cid:98) y such that(3.2) (cid:104) ∂ F ∗ ( y ) − K y ( (cid:98) x , (cid:98) y ) , y − (cid:98) y (cid:105) ≥ γ F ∗ (cid:107) y − (cid:98) y (cid:107) ( y ∈ Y F ∗ ) . Let us comment on this assumption. First, since the subgradients ∂ G and ∂ F ∗ of convex, proper, andlower semicontinuous functionals are maximally monotone operators [4, Theorem 20.25], Assump-tion 3.1 always holds with γ G = γ F ∗ =

0. This is already sufficient for showing weak convergence ofAlgorithm 1.1; see Theorem 6.1. For strong convergence with rates, however, we (as usual in nonlinearoptimization) need a local superlinear growth condition near the solution that requires taking γ G and/or γ F ∗ strictly positive (unless we can compensate by better properties of K through Assumption 3.2below); see Theorems 6.3 and 6.4. In this case, Assumption 3.1 (i), for example, coincides with strongmetric subregularity of ∂ G ; see [1, 2]. This property holds (at any ˆ x and ˆ w ∈ ∂ G ( ˆ x ) ) whenever G isstrongly convex; however, it is a strictly weaker property since we only require it to hold at a specific ˆ x and ˆ w = − K x ( (cid:98) x , (cid:98) y ) arising from the first-order necessary optimality conditions (4.1) below. (Forexample, ∂ д for д ( x ) = | x | is strongly metrically subregular at x = w ∈ (− , ) – but not at w ∈ {− , } – although д is not strongly convex.) Assumption 3.2.

The functional K ( x , y ) ∈ C ( X × Y ) and there exist ρ x , ρ y > u , u (cid:48) ∈ U( ρ x , ρ y ) : = ( (cid:130) ( (cid:98) x , ρ x ) ∩ X G ) × ( (cid:130) ( (cid:98) y , ρ y ) ∩ Y F ∗ ) , the following properties hold:(i) (second partial derivatives) The second partial derivatives K xy ( u ) and K y x ( u ) exist and satisfy K xy ( u ) = [ K y x ( u )] ∗ .(ii) (locally Lipschitz gradients) For some functions L x ( y ) , L y ( x ) ≥ L y x ≥ (cid:107) K x ( x (cid:48) , y ) − K x ( x , y )(cid:107) ≤ L x ( y )(cid:107) x (cid:48) − x (cid:107) , (cid:107) K y x ( x (cid:48) , y ) − K y x ( x , y )(cid:107) ≤ L y x (cid:107) x (cid:48) − x (cid:107) , (cid:107) K y ( x , y (cid:48) ) − K y ( x , y )(cid:107) ≤ L y ( x )(cid:107) y (cid:48) − y (cid:107) . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 9 of 37 (iii) (locally bounded gradient) There exists R K > u ∈U( ρ x , ρ y ) (cid:107) K xy ( x , y )(cid:107) ≤ R K .(iv) (three-point condition) There exist θ x , θ y > λ x , λ y ≥ ξ x , ξ y ∈ (cid:82) such that (cid:104) K x ( x (cid:48) , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y ) , x − (cid:98) x (cid:105) + ξ x (cid:107) x − (cid:98) x (cid:107) ≥ θ x (cid:107) K y ( (cid:98) x , y ) − K y ( x , y ) − K y x ( x , y )( (cid:98) x − x )(cid:107) − λ x (cid:107) x − x (cid:48) (cid:107) , (3.4a) (cid:104) K y ( x , y ) − K y ( x , y (cid:48) ) + K y ( (cid:98) x , (cid:98) y ) − K y ( (cid:98) x , y ) , y − (cid:98) y (cid:105) + ξ y (cid:107) y − (cid:98) y (cid:107) ≥ θ y (cid:107) K x ( x (cid:48) , (cid:98) y ) − K x ( x (cid:48) , y (cid:48) ) − K xy ( x (cid:48) , y (cid:48) )( (cid:98) y − y (cid:48) )(cid:107) − λ y (cid:107) y − y (cid:48) (cid:107) . (3.4b)We again elaborate on this assumption. Assumption 3.2 (i)–(iii) are standard in nonlinear optimizationof smooth functions. Apart from the estimates in Assumption 3.2 (ii), we make use of the followinginequality that is an immediate consequence:(3.5) (cid:107) K y ( x (cid:48) , y ) − K y ( x , y ) − K y x ( x , y )( x (cid:48) − x )(cid:107) ≤ L y x (cid:107) x − x (cid:48) (cid:107) . The constants ξ x and ξ y in Assumption 3.2 (iv) can typically be taken positive by exploiting the strongmonotonicity factors γ G and γ F ∗ of ∂ G and ∂ F ∗ . Indeed, further on in Theorem 4.1, we will require that γ G − (cid:101) γ G ≥ ξ x and γ F ∗ − (cid:101) γ F ∗ ≥ ξ y , where (cid:101) γ G and (cid:101) γ F ∗ will be acceleration factors employed to updatethe step length parameters τ i , ω i , and σ i in the algorithm.In Appendix a we demonstrate that Assumption 3.2 (iv) is closely related to standard second-orderoptimality conditions, i.e., a positive definite Hessian at the solution (cid:98) u . In particular, if the primalproblem for the saddle-point functional is strongly convex and the dual problem is strongly concave, theconstants that ensure Assumption 3.2 (iv) can be found explicitly. Nonetheless, Assumption 3.2 (iv) ismore general than the simple strong convex-concavity. Indeed, in Appendix c we verify Assumption 3.2for K arising from combinations of a linear operator with a generalized conjugate representations ofthe step function and the (cid:96) function from Section 2.2.Since (3.4b) holds for any ξ y , λ y ≥ K ( x , y ) = (cid:104) A ( x ) , y (cid:105) for some A ∈ C ( X ) , the conditions(3.4) reduce to the three-point condition for A from [10] with the exponent p =

1. In the present work,such an exponent would correspond to exponents p x , p y ∈ [ , ] over the norms with the factors θ y and θ y that we consider in Assumption b.1 (iv*). These can sometimes be useful: The exponent p = A for a phase and amplitudereconstruction problem. For the sake of readability, in the main part of the present work we focus onthe case p x = p y =

1, i.e., Assumption 3.2 (iv), and discuss the changes needed for p x , p y ∈ ( , ] inAppendix b. We want to find a critical point (cid:98) u = ( (cid:98) x , (cid:98) y ) ∈ X × Y of the saddle point functional ( x , y ) (cid:55)→ G ( x ) + K ( x , y ) − F ∗ ( y ) , i.e., satisfying(4.1) 0 ∈ H ( (cid:98) u ) for H ( u ) : = (cid:18) ∂ G ( x ) + K x ( x , y ) ∂ F ∗ ( y ) − K y ( x , y ) (cid:19) . Since G and F ∗ are proper, convex, and lower semicontinuous, and K is continuously differentiable,using the definition of the saddle-point, the Fréchet derivative, and the convex subdifferential, anelementary limiting argument as in, e.g., [9, Prop. 2.2] shows that the inclusion (4.1) is a first-ordernecessary optimality condition for a saddle point. If K ( x , y ) = (cid:104) Ax , y (cid:105) for A ∈ L( X ; Y ) , (4.1) reduces to Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 10 of 37 − A ∗ (cid:98) y ∈ ∂ G ( (cid:98) x ) and A (cid:98) x ∈ ∂ F ∗ ( (cid:98) y ) , which coincides with the well-known Fenchel–Rockafellar extremalityconditions for (1.2); see [14, Remark 4.2].To study Algorithm 1.1, we reformulate it in the preconditioned proximal point and testing frameworkof [36]. Specifically, we write Algorithm 1.1 in implicit proximal point form as solving in each iterationfor u i + = ( x i + , y i + ) ∈ X × Y in(IPP) 0 ∈ W i + (cid:101) H i + ( u i + ) + M i + ( u i + − u i ) , where the linearization (cid:101) H i + of H , the linear preconditioner M i + , and the step length operator W i + are defined as (cid:101) H i + ( u ) : = (cid:18) ∂ G ( x ) + K x ( x i , y i ) + K xy ( x i , y i )( y − y i ) ∂ F ∗ ( y ) − K y (( + ω i ) x − ω i x i , y i ) − K y x ( x i , y i )( x − [( + ω i ) x − ω i x i ]) (cid:19) , (4.2) M i + : = (cid:18) I − τ i K xy ( x i , y i )− ω i σ i + K y x ( x i , y i ) I (cid:19) , (4.3) W i + : = (cid:18) τ i I σ i + I (cid:19) . (4.4)Inserting these definitions into (IPP) and rearranging, we can rewrite inclusion (IPP) as(4.5) 0 ∈ (cid:18) τ i ∂ G ( x i + ) + τ i K x ( x i , y i ) + x i + − x i σ i + ∂ F ∗ ( y i + ) − σ i + K y (( + ω i ) x i + − ω i x i , y i ) + y i + − y i (cid:19) . Therefore, based on the definitions of the proximal point mapping prox τ G ( v ) = ( I + τ ∂ G ) − ( v ) and of s x i + = ( + ω i ) x i + − ω i x i , solving (IPP) for u i + is equivalent to performing one step of Algorithm 1.1.Since proximal mappings of proper, convex and lower semicontinuous functionals are well-defined,single-valued, and Lipschitz continuous [4, Proposition 12.15], and K is twice Fréchet differentiable on X × Y , this also shows that (IPP) always admits a unique solution u i + .The next step is to “test” the inclusion (IPP) by application of (cid:104) · , u i + − (cid:98) u (cid:105) Z i + for the testing operator Z i + : = (cid:18) ϕ i I ψ i + I (cid:19) . This testing operator and the respective primal and dual testing variables ϕ i and ψ i + will be seen toencode convergence rates after some rearrangements of the tested inclusions for i = , . . . , N − (cid:107) · (cid:107) Z N + M N + forms a local metric that measures the convergence of the iterates while ∆ i + can potentially be usedto measure function value or gap converge. In particular, we therefore want (cid:107) u (cid:107) Z N + M N + → ∞ as N → ∞ with a certain rate such that boundedness of (cid:107) u N − (cid:98) u (cid:107) Z N + M N + implies the convergence of u N → (cid:98) u at the reciprocal rate (see Theorems 6.3 and 6.4). Theorem 4.1 ([36, Theorem 2.1]).

Suppose (IPP) is solvable, and denote the iterates by { u i } i ∈ (cid:78) . If Z i + M i + is self-adjoint and for some (cid:98) u ∈ U and ∆ i + = ∆ i + ( (cid:98) u ) ∈ (cid:82) , for all i ≤ N − , (cid:104) Z i + W i + (cid:101) H i + ( u i + ) , u i + − (cid:98) u (cid:105) + ∆ i + ≥ (cid:107) u i + − (cid:98) u (cid:107) Z i + M i + − Z i + M i + − (cid:107) u i + − u i (cid:107) Z i + M i + , (4.6) then (cid:107) u N − (cid:98) u (cid:107) Z N + M N + ≤ (cid:107) u − (cid:98) u (cid:107) Z M + N − (cid:213) i = ∆ i + . (4.7)The next theorem specializes Theorem 4.1 to our specific setup, converting the abstract condition(4.6) into several step length and testing parameter update rules and bounds. Specifically, (4.8a) below Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 11 of 37 couples the primal and dual step lengths τ i and σ i and the over-relaxation parameter ω i with thetesting parameters. Condition (4.8b) determines convergence rates by limiting how fast the testingparameters can grow. This rate is limited through the available strong monotonicity or second-orderbehavior ( γ G − ξ x and γ F ∗ − ξ y ) through (4.8d) and (4.8e) as well as additional step length boundsfrom (4.8c). We point out that only the latter are specific to our non-convex setting; the remainingconditions are present in the convex setting as well, see [36]. We will further develop these rules andconditions in the next section to obtain specific convergence results; an explicit example for a set ofparameters satisfying these rules and conditions will be provided for the (cid:96) -TV denoising in Section 7.2and Appendix c. Here and in the following, we use the notation s x i + : = x i + + ω i ( x i + − x i ) fromAlgorithm 1.1 for brevity. Theorem 4.2.

Suppose Assumptions 3.1 and 3.2 hold with the constants θ x , θ y > ; ξ x , ξ y ∈ (cid:82) ; λ x , λ y ≥ ; L y x ≥ and R K > . For all i ∈ (cid:78) , let s u i + : = ( s x i + , y i ) , and suppose u i , u i + , (cid:98) u , s u i + ∈ U( ρ x , ρ y ) forsome ρ x , ρ y ≥ . Assume for all i ∈ (cid:78) that ω ≥ ω i ≥ ω > and that for some < δ ≤ µ < ; η i > ;and (cid:101) γ G , (cid:101) γ F ∗ ≥ , ω i = η i η − i + , η i = ψ i σ i = ϕ i τ i , (4.8a) ϕ i + = ϕ i ( + τ i (cid:101) γ G ) , ψ i + = ψ i + ( + σ i + (cid:101) γ F ∗ ) , (4.8b) 1 ≥ σ i (cid:18) R K τ i − µ + λ y ω i (cid:19) , τ i ≤ δλ x + L y x ( ω i + ) ρ y , (4.8c) γ G ≥ (cid:101) γ G + ξ x , θ y ≥ ωρ x , (4.8d) γ F ∗ ≥ (cid:101) γ F ∗ + ξ y , θ x ≥ ρ y ω − . (4.8e) Then (4.6) is satisfied for any ∆ i + ≤ .Proof. We split the proof into several steps.

Step 1 (estimation of Z i + M i + ) By (4.8a), ϕ i τ i = η i and ψ i + σ i + ω i = η i , so (4.3) yields(4.9) Z i + M i + = (cid:18) ϕ i I − η i K xy ( x i , y i )− η i K y x ( x i , y i ) ψ i + I (cid:19) , which is clearly self-adjoint. Applying Cauchy’s and Young’s inequalities, we further obtain for any δ > x ∈ X , and y ∈ Y that − (cid:104) x , η i K xy ( x i , y i ) y (cid:105) ≥ −( − δ ) ϕ i (cid:107) x (cid:107) − ( − δ ) − ϕ − i η i (cid:107) K xy ( x i , y i ) y (cid:107) , implying that(4.10) Z i + M i + ≥ ˆ Q i + : = (cid:32) δϕ i I ψ i + I − η i ϕ − i − δ K y x ( x i , y i ) K xy ( x i , y i ) (cid:33) . Step 2 (estimation of Z i + M i + − Z i + M i + ) Expanding Z i + M i + − Z i + M i + according to (4.9) andthen applying (4.8b), we obtain(4.11) 12 (cid:107) u i + − (cid:98) u (cid:107) Z i + M i + − Z i + M i + = − η i (cid:101) γ G (cid:107) x i + − (cid:98) x (cid:107) − η i + (cid:101) γ F ∗ (cid:107) y i + − (cid:98) y (cid:107) + (cid:104)( η i + K xy ( x i + , y i + ) − η i K xy ( x i , y i ))( y i + − (cid:98) y ) , x i + − (cid:98) x (cid:105) . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 12 of 37

Step 3 (estimation of (cid:101) H i + ( u i + ) ) By (4.2) we have (cid:101) H i + ( u i + ) = (cid:18) ∂ G ( x i + ) + K x ( x i , y i ) + K xy ( x i , y i )( y i + − y i ) ∂ F ∗ ( y i + ) − K y ( s x i + , y i ) − K y x ( x i , y i )( x i + − s x i + ) (cid:19) . Since 0 ∈ H ( (cid:98) u ) , we have − K x ( (cid:98) x , (cid:98) y ) ∈ ∂ G ( (cid:98) x ) and K y ( (cid:98) x , (cid:98) y ) ∈ ∂ F ∗ ( (cid:98) y ) . Using (4.5) multliplied by Z i + ,Assumption 3.1, and (4.8a), we can thus estimate(4.12) (cid:104) (cid:101) H i + ( u i + ) , u i + − (cid:98) u (cid:105) W i + Z i + ≥ η i γ G (cid:107) x i + − (cid:98) x (cid:107) + η i + γ F ∗ (cid:107) y i + − (cid:98) y (cid:107) + η i (cid:104) K x ( x i , y i ) − K x ( (cid:98) x , (cid:98) y ) + K xy ( x i , y i )( y i + − y i ) , x i + − (cid:98) x (cid:105) + η i + (cid:104) K y ( (cid:98) x , (cid:98) y ) − K y ( s x i + , y i ) − K y x ( x i , y i )( x i + − s x i + ) , y i + − (cid:98) y (cid:105) . Combining (4.12), (4.11), and (4.10), we arrive at(4.13) S i + : = (cid:107) u i + − u i (cid:107) Z i + M i + + (cid:107) u i + − (cid:98) u (cid:107) Z i + M i + − Z i + M i + + (cid:104) (cid:101) H i + ( u i + ) , u i + − (cid:98) u (cid:105) W i + Z i + ≥ (cid:107) u i + − u i (cid:107) Q i + + D for D : = η i ( γ G − (cid:101) γ G )(cid:107) x i + − (cid:98) x (cid:107) + η i + ( γ F ∗ − (cid:101) γ F ∗ )(cid:107) y i + − (cid:98) y (cid:107) + (cid:104)( η i + K xy ( x i + , y i + ) − η i K xy ( x i , y i ))( y i + − (cid:98) y ) , x i + − (cid:98) x (cid:105) + η i (cid:104) K x ( x i , y i ) − K x ( (cid:98) x , (cid:98) y ) + K xy ( x i , y i )( y i + − y i ) , x i + − (cid:98) x (cid:105) + η i + (cid:104) K y ( (cid:98) x , (cid:98) y ) − K y ( s x i + , y i ) − K y x ( x i , y i )( x i + − s x i + ) , y i + − (cid:98) y (cid:105) . The claim of the theorem is established if we prove that S i + ≥ Step 4 (estimation of D ) With (cid:101) D x + y : = (cid:104)( η i + K xy ( x i + , y i + ) − η i K xy ( x i , y i ))( y i + − (cid:98) y ) , x i + − (cid:98) x (cid:105) + η i (cid:104) K x ( x i , y i ) − K x ( (cid:98) x , (cid:98) y ) + K xy ( x i , y i )( y i + − y i ) , x i + − (cid:98) x (cid:105) + η i + (cid:104) K y ( (cid:98) x , (cid:98) y ) − K y ( x i + , y i ) , y i + − (cid:98) y (cid:105) , and D ω : = (cid:104) K y ( x i + , y i ) − K y ( s x i + , y i ) + K y x ( x i + , y i )( s x i + − x i + ) , y i + − (cid:98) y (cid:105) + (cid:104)[ K y x ( x i , y i ) − K y x ( x i + , y i )]( s x i + − x i + ) , y i + − (cid:98) y (cid:105) , we can rewrite D = η i ( γ G − (cid:101) γ G )(cid:107) x i + − (cid:98) x (cid:107) + η i + ( γ F ∗ − (cid:101) γ F ∗ )(cid:107) y i + − (cid:98) y (cid:107) + (cid:101) D x + y + η i + D ω . We rearrange (cid:101) D x + y = η i (cid:104) K x ( x i , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y ) , x i + − (cid:98) x (cid:105) + η i + (cid:104) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) + K y x ( x i + , y i + )( x i + − (cid:98) x ) , y i + − (cid:98) y (cid:105) + η i + (cid:104) K y ( (cid:98) x , (cid:98) y ) − K y ( (cid:98) x , y i + ) , y i + − (cid:98) y (cid:105) + η i + (cid:104) K y ( x i + , y i + ) − K y ( x i + , y i ) , y i + − (cid:98) y (cid:105)− η i (cid:104) K x ( x i , (cid:98) y ) − K x ( x i , y i ) − K xy ( x i , y i )( (cid:98) y − y i ) , x i + − (cid:98) x (cid:105) . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 13 of 37

Since η i + = η i ω − i , setting D x : = ξ x (cid:107) x i + − (cid:98) x (cid:107) + (cid:104) K x ( x i , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y ) , x i + − (cid:98) x (cid:105) + (cid:104) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + ) , y i + − (cid:98) y (cid:105) ω − i , and D y : = ξ y (cid:107) y i + − (cid:98) y (cid:107) + (cid:104) K y ( (cid:98) x , (cid:98) y ) − K y ( (cid:98) x , y i + ) , y i + − (cid:98) y (cid:105) + (cid:104) K y ( x i + , y i + ) − K y ( x i + , y i ) , y i + − (cid:98) y (cid:105)− ω i (cid:104) K x ( x i , (cid:98) y ) − K x ( x i , y i ) − K xy ( x i , y i )( (cid:98) y − y i ) , x i + − (cid:98) x (cid:105) , we can write D = η i ( γ G − (cid:101) γ G − ξ y )(cid:107) x i + − (cid:98) x (cid:107) + η i + ( γ F ∗ − (cid:101) γ F ∗ + ξ x )(cid:107) y i + − (cid:98) y (cid:107) + η i D x + η i + D y + η i + D ω . As for the estimate for D ω , using Assumption 3.2 (ii) and (3.5) we obtain(4.14) D ω ≥ − L y x (cid:107) s x i + − x i + (cid:107) (cid:107) y i + − (cid:98) y (cid:107) − L y x (cid:107) x i + − x i (cid:107) (cid:107) s x i + − x i + (cid:107) (cid:107) y i + − (cid:98) y (cid:107)≥ − L y x ω i ( ω i + ) ρ y (cid:107) x i + − x i (cid:107) using in the last inequality the expansion s x i + : = x i + + ω i ( x i + − x i ) and the bound (cid:107) y i + − (cid:98) y (cid:107) ≤ ρ y that follows from the assumed inclusion u i + ∈ U( ρ x , ρ y ) .We now use Assumption 3.2 (iv) to further bound D x and D y . From (3.4a), we obtain(4.15) D x ≥ θ x (cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) − λ x (cid:107) x i + − x i (cid:107) − (cid:107) y i + − (cid:98) y (cid:107) (cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) ω − i ≥ ( θ x − ρ y ω − )(cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) − λ x (cid:107) x i + − x i (cid:107) ≥ − λ x (cid:107) x i + − x i (cid:107) , using in the last two inequalities that u i + ∈ U( ρ x , ρ y ) for some ρ x , ρ y ≥ ω − i ≤ ω − and θ x ≥ ρ y ω − from (4.8e). Analogously, from (3.4b) and Cauchy’s inequality,(4.16) D y ≥ θ y (cid:107) K x ( x i , (cid:98) y ) − K x ( x i , y i ) − K xy ( x i , y i )( (cid:98) y − y i )(cid:107) − λ y (cid:107) y i + − y i (cid:107) − ω i (cid:107) x i + − (cid:98) x (cid:107) (cid:107) K x ( x i , (cid:98) y ) − K x ( x i , y i ) − K xy ( x i , y i )( (cid:98) y − y i )(cid:107)≥ ( θ y − ρ x ω )(cid:107) K x ( x i , (cid:98) y ) − K x ( x i , y i ) − K xy ( x i , y i )( (cid:98) y − y i )(cid:107) − λ y (cid:107) y i + − y i (cid:107) ≥ − λ y (cid:107) y i + − y i (cid:107) , where in the last two inequalities we again used u i + ∈ U( ρ x , ρ y ) , ω i ≤ ω , and θ y ≥ ωρ x from (4.8d).Therefore, combining (4.14), (4.15), and (4.16), we obtain(4.17) D = η i D x + η i + D y + η i + D ω + η i ( γ G − (cid:101) γ G − ξ x )(cid:107) x i + − (cid:98) x (cid:107) + η i + ( γ F ∗ − (cid:101) γ F ∗ − ξ y )(cid:107) y i + − (cid:98) y (cid:107) ≥ η i + ( γ F ∗ − (cid:101) γ F ∗ − ξ y )(cid:107) y i + − (cid:98) y (cid:107) − η i λ x (cid:107) x i + − x i (cid:107) + η i ( γ G − (cid:101) γ G − ξ x )(cid:107) x i + − (cid:98) x (cid:107) − η i + λ y (cid:107) y i + − y i (cid:107) − η i L y x ( ω i + ) ρ y (cid:107) x i + − x i (cid:107) ≥ − η i λ x + L y x ( ω i + ) ρ y (cid:107) x i + − x i (cid:107) − η i + λ y (cid:107) y i + − y i (cid:107) , Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 14 of 37 where we have also used the first bounds of (4.8d) and (4.8e) in the final step. Further using (4.8c)and η i + = η i ω − i , we deduce that D ≥ − (cid:107) u i + − u i (cid:107) Q i + . Recalling (4.13), we obtain S i + ≥

0, i.e., (4.6)holds with ∆ i + ≤ (cid:3) In the subsequent sections, we will also need the following corollary.

Corollary 4.3.

Suppose that Assumption 3.2 (iii) and the conditions (4.8) hold. Then ( − µ ) ψ i + ≥ η i ϕ − i R K and (4.18) Z i + M i + ≥ (cid:18) δϕ i I ( µ − δ )( − δ ) − ψ i + I (cid:19) . Proof.

Observe that due to (4.8), ( − µ ) ψ i + ≥ ( − µ ) ψ i = ( − µ ) η i σ i τ i ϕ i ≥ η i ϕ − i R K . This is our first claim. As for the second term, from Assumption 3.2 (iii) we have η i ϕ − i − δ K y x ( x i , y i ) K xy ( x i , y i ) ≤ η i ϕ − i − δ R K I ≤ − µ − δ ψ i + I . Inserting this bound into (4.10) in the proof of Theorem 4.2 establishes (4.18). (cid:3)

In the previous section, we derived step length conditions that we will further develop in Section 6to prove convergence and convergence rates. However, we implicitly required that all the iterations { u i } i ∈ (cid:78) belong to U( ρ x , ρ y ) . In this section, we derive additional step lengths restrictions to ensurethat this holds.We start with a lemma that bounds the next iterate u i + given bounds on the current iterate u i andthe step lengths for the current iteration. Afterwards, we chain these estimates to only require boundson the initial iterates and the step lengths. Lemma 5.1.

Fix i ∈ (cid:78) . Suppose Assumption 3.1, Assumption 3.2 (ii), and (iii) hold in U( ρ x , ρ y ) , and that u i + solves (IPP) . For simplicity, assume ω i ≤ . Suppose r x , i , r y , i , δ x , δ y > and (cid:98) u ∈ H − ( ) are suchthat (cid:130) ( (cid:98) x , r x , i + δ x ) × (cid:130) ( (cid:98) y , r y , i + δ y ) ⊆ U( ρ x , ρ y ) and u i ∈ (cid:130) ( (cid:98) x , r x , i ) × (cid:130) ( (cid:98) y , r y , i ) . If (5.1) τ i ≤ δ x R K r y , i + L x ( (cid:98) y ) r x , i and σ i + ≤ δ y L y ( (cid:98) x ) r y , i + R K ( r x , i + δ x ) , then u i + ∈ (cid:130) ( (cid:98) x , r x , i + δ x ) × (cid:130) ( (cid:98) y , r y , i + δ y ) and (cid:107) s x i + − (cid:98) x (cid:107) ≤ r x , i + δ x .Proof. We want to show that the step length conditions (5.1) are sufficient for (cid:107) x i + − (cid:98) x (cid:107) ≤ r x , i + δ x , (cid:107) s x i + − (cid:98) x (cid:107) ≤ r x , i + δ x , and (cid:107) y i + − (cid:98) y (cid:107) ≤ r y , i + δ y . We do this by applying the testing argument on the primal and dual variables separately. Multiplying(IPP) by Z ∗ i + ( u i + − (cid:98) u ) with ϕ i = ψ i + =

0, we obtain0 ∈ τ i (cid:104) ∂ G ( x i + ) + K x ( x i , y i ) , x i + − (cid:98) x (cid:105) + (cid:104) x i + − x i , x i + − (cid:98) x (cid:105) . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 15 of 37

Using the three-point identity(5.2) (cid:104) x i + − x i , x i + − (cid:98) x (cid:105) = (cid:107) x i + − x i (cid:107) − (cid:107) x i − (cid:98) x (cid:107) + (cid:107) x i + − (cid:98) x (cid:107) , we obtain (cid:107) x i − (cid:98) x (cid:107) ∈ τ i (cid:104) ∂ G ( x i + ) + K x ( x i , y i ) , x i + − (cid:98) x (cid:105) + (cid:107) x i + − x i (cid:107) + (cid:107) x i + − (cid:98) x (cid:107) . Using further 0 ∈ ∂ G ( (cid:98) x ) + K x ( (cid:98) x , (cid:98) y ) and the monotonicity of ∂ G , we arrive at (cid:107) x i + − x i (cid:107) + (cid:107) x i + − (cid:98) x (cid:107) + τ i (cid:104) K x ( x i , y i ) − K x ( (cid:98) x , (cid:98) y ) , x i + − (cid:98) x (cid:105) ≤ (cid:107) x i − (cid:98) x (cid:107) . With C x : = τ i (cid:107) K x ( x i , y i ) − K x ( (cid:98) x , (cid:98) y )(cid:107) , this implies that(5.3) (cid:107) x i + − x i (cid:107) + (cid:107) x i + − (cid:98) x (cid:107) ≤ C x (cid:107) x i + − (cid:98) x (cid:107) + (cid:107) x i − (cid:98) x (cid:107) . After rearranging the terms and using (cid:107) x i + − (cid:98) x (cid:107) ≤ (cid:107) x i + − x i (cid:107) + (cid:107) x i − (cid:98) x (cid:107) , we thus have ((cid:107) x i + − x i (cid:107) − C x ) + (cid:107) x i + − (cid:98) x (cid:107) ≤ ((cid:107) x i − (cid:98) x (cid:107) + C x ) , which leads to(5.4) (cid:107) x i + − (cid:98) x (cid:107) ≤ (cid:107) x i − (cid:98) x (cid:107) + C x . To estimate the dual variable, we multiply (IPP) by Z ∗ i + ( u i + − (cid:98) u ) with ϕ i = ψ i + =

1. This gives0 ∈ σ i + (cid:104) ∂ F ∗ ( y i + ) − K y ( s x i + , y i ) , y i + − (cid:98) y (cid:105) + (cid:104) y i + − y i , y i + − (cid:98) y (cid:105) . Using 0 ∈ ∂ F ∗ ( (cid:98) y ) − K y ( (cid:98) x , (cid:98) y ) and following the steps leading to (5.4), we deduce(5.5) (cid:107) y i + − (cid:98) y (cid:107) ≤ (cid:107) y i − (cid:98) y (cid:107) + C y with C y : = σ i + (cid:107) K y ( (cid:98) x , (cid:98) y ) − K y ( s x i + , y i )(cid:107) .We now proceed to derive bounds on C x and C y with the goal of bounding both (5.4) and (5.5) fromabove. Using Assumption 3.2 (ii), (iii), and the mean value theorem applied to K x ( x i , ·) and K y (· , y i ) , C x ≤ τ i ((cid:107) K x ( x i , y i ) − K x ( x i , (cid:98) y )(cid:107) + (cid:107) K x ( x i , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y )(cid:107))≤ τ i ( R K r y , i + L x ( (cid:98) y ) r x , i ) = : R x , and C y ≤ σ i + ((cid:107) K y ( (cid:98) x , (cid:98) y ) − K y ( (cid:98) x , y i )(cid:107) + (cid:107) K y ( (cid:98) x , y i ) − K y ( s x i + , y i )(cid:107)≤ σ i + ( L y ( (cid:98) x ) r y , i + R K ( r x , i + δ x )) = : R y , the latter under the assumption that (cid:107) s x i + − (cid:98) x (cid:107) ≤ r x , i + δ x , which we now verify. First, by definition, (cid:107) s x i + − (cid:98) x (cid:107) = (cid:107) x i + − (cid:98) x + ω i ( x i + − x i )(cid:107) = (cid:107) x i + − (cid:98) x (cid:107) + ω i (cid:107) x i + − x i (cid:107) + ω i (cid:104) x i + − (cid:98) x , x i + − x i (cid:105) = ( + ω i )(cid:107) x i + − (cid:98) x (cid:107) + ω i ( + ω i )(cid:107) x i + − x i (cid:107) − ω i (cid:107) x i − (cid:98) x (cid:107) ≤ ( + ω i )((cid:107) x i + − (cid:98) x (cid:107) + (cid:107) x i + − x i (cid:107) ) − ω i (cid:107) x i − (cid:98) x (cid:107) . Applying (5.3) and (5.4), we obtain (cid:107) s x i + − (cid:98) x (cid:107) ≤ ( + ω i )( C x (cid:107) x i + − (cid:98) x (cid:107) + (cid:107) x i − (cid:98) x (cid:107) ) − ω i (cid:107) x i − (cid:98) x (cid:107) ≤ C x (cid:107) x i + − (cid:98) x (cid:107) + (cid:107) x i − (cid:98) x (cid:107) ≤ C x ((cid:107) x i − (cid:98) x (cid:107) + C x ) + (cid:107) x i − (cid:98) x (cid:107) ≤ ( C x + r x , i ) . The bound (5.1) on τ i implies that C x ≤ R x ≤ δ x / (cid:107) s x i + − (cid:98) x (cid:107) ≤ r x , i + δ x . From (5.4)we thus obtain (cid:107) x i + − (cid:98) x (cid:107) ≤ r x , i + δ x . The bound (5.1) on σ i then implies that C y ≤ R y ≤ δ y , whichtogether with (5.5) completes the proof. (cid:3) Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 16 of 37

To chain the applications of Lemma 5.1 on each iteration i ∈ (cid:78) , we introduce the following assumption,for which we recall the notations in Assumption 3.2 as well as the definition of U( ρ x , ρ y ) from (3.3). Assumption 5.2.

Suppose Assumption 3.2 holds near a solution (cid:98) u ∈ H − ( ) . Given an initial iterate u ∈ X × Y , and initial step length parameters τ , σ , ω > < δ ≤ µ < r max : = (cid:112) δ − ((cid:107) x − (cid:98) x (cid:107) + ν − (cid:107) y − (cid:98) y (cid:107) ) with ν : = σ ω τ − . We then assume that there exist δ x , δ y > r y ≥ r max (cid:112) ν ( − δ ) δ ( µ − δ ) − such that (cid:130) ( (cid:98) x , r max + δ x ) × (cid:130) ( (cid:98) y , r y + δ y ) ⊆ U( ρ x , ρ y ) and that for all i ∈ (cid:78) the step lengths τ i , σ i > τ i ≤ δ x R K r y + L x ( (cid:98) y ) r max and σ i + ≤ δ y L y ( (cid:98) x ) r y + R K ( r max + δ x ) . Lemma 5.3.

For all i ∈ (cid:78) , suppose u i + solves (IPP) and that all the conditions of Theorem 4.2 are satisfiedfor some ρ x , ρ y > and (cid:101) γ G , (cid:101) γ F ∗ ≥ except for the requirement u i , u i + , s u i + ∈ U( ρ x , ρ y ) . Then ifAssumption 5.2 holds, { u i } i ∈ (cid:78) , { s u i + } i ∈ (cid:78) ⊂ U( ρ x , ρ y ) .Proof. We define r x , i : = √ δ ϕ i (cid:107) u − (cid:98) u (cid:107) Z M and U i : = (cid:8) ( x , y ) ∈ X × Y (cid:12)(cid:12) (cid:107) x − (cid:98) x (cid:107) + ψ i + ϕ i µ − δ ( − δ ) δ (cid:107) y − (cid:98) y (cid:107) ≤ r x , i (cid:9) . Since the conditions (4.8) hold, we can apply Corollary 4.3 and the estimate (4.18) on Z i + M i + to deducethat(5.8) { u ∈ X × Y | (cid:107) u − (cid:98) u (cid:107) Z i + M i + ≤ (cid:107) u − (cid:98) u (cid:107) Z M } ⊂ U i . From (4.8b), we also deduce that ϕ i + ≥ ϕ i and hence that r x , i + ≤ r x , i . Consequently, if r x , ≤ r max ,then(5.9) (cid:130) ( (cid:98) x , r x , i + δ x ) × (cid:130) ( (cid:98) y , r y + δ y ) ⊆ (cid:130) ( (cid:98) x , r max + δ x ) × (cid:130) ( (cid:98) y , r y + δ y ) ⊆ U( ρ x , ρ y ) , so it will suffice to show that u i ∈ (cid:130) ( (cid:98) x , r x , i + δ x ) × (cid:130) ( (cid:98) y , r y + δ y ) for each i ∈ (cid:78) to prove the claim. Wedo this in two steps. In the first step, we show that r x , i ≤ r max and(5.10) U i ⊆ (cid:130) ( (cid:98) x , r x , i ) × (cid:130) ( (cid:98) y , r y ) ( i ∈ (cid:78) ) . In the second step, we show by induction that u i ∈ U i as well as s u i + ∈ U( ρ x , ρ y ) for i ∈ (cid:78) . Step 1

We first prove (5.10). Since U i ⊆ (cid:130) ( (cid:98) x , r x , i ) × Y , we only have to show that U i ⊆ X × (cid:130) ( (cid:98) y , r y ) .First, note that (4.8) and (cid:101) γ G , (cid:101) γ F ∗ ≥ ψ i + ≥ ψ i ≥ ψ as well as ϕ i + ≥ ϕ i ≥ ϕ = η ω τ − = νψ for ν defined in (5.6). We then obtain from the definition of r x , i substituting Z M from (4.9) that r x , i δϕ i = (cid:107) u − (cid:98) u (cid:107) Z M = νψ (cid:107) x − (cid:98) x (cid:107) − η (cid:104) x − (cid:98) x , K xy ( x , y )( y − (cid:98) y )(cid:105) + ψ (cid:107) y − (cid:98) y (cid:107) . Using Cauchy’s and Young’s inequalities, the fact that ϕ i ≥ νψ , and the assumption that (cid:107) K xy ( x , y )(cid:107) ≤ R K , we arrive at r x , i ≤ ( νψ (cid:107) x − (cid:98) x (cid:107) + ( ψ + η ϕ − R K )(cid:107) y − (cid:98) y (cid:107) )( δνψ ) − . We obtain from Corollary 4.3 that η ϕ − R K ≤ ( − µ ) ψ ≤ ψ and hence that r x , i ≤ r . The assumptionon r y then yields for all i ∈ (cid:78) that(5.11) r y ≥ r ϕ ψ ( − δ ) δµ − δ ≥ r x , ϕ ψ i + ( − δ ) δµ − δ = r x , i ϕ i ψ i + ( − δ ) δµ − δ . Thus (5.10) follows from the definition of U i . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 17 of 37

Step 2

We next show by induction that u i ∈ U i and s u i + ∈ U( ρ x , ρ y ) for all i ∈ (cid:78) . Since (5.8) holdsfor i =

0, we have that u ∈ U . Moreover, since in Step 1 we have r x , ≤ r max , the bound (5.1) for i = u N ∈ U N . By (5.10), we have that u N ∈ (cid:130) ( (cid:98) x , r x , N ) × (cid:130) ( (cid:98) y , r y ) . Since again thebound (5.1) for i = N follows from (5.7) and the bound r x , N ≤ r max follows from Step 1, we can applyLemma 5.1 to obtain u N + ∈ (cid:130) ( (cid:98) x , r x , N + δ x ) × (cid:130) ( (cid:98) y , r y + δ y ) and s x N + ∈ (cid:130) ( (cid:98) x , r x , N + δ x ) . By (5.9), we have (cid:130) ( (cid:98) x , r x , N + δ x ) × (cid:130) ( (cid:98) y , r y + δ y ) ⊆ U( ρ x , ρ y ) and thus u N + , s u N + ∈ U( ρ x , ρ y ) .Theorem 4.2 now implies that (4.7) is satisfied for i ≤ N with ∆ N + ≤

0, which together with (4.7) and(5.8) yields that u N + ∈ U N + . This completes the induction step and hence the proof. (cid:3) We are now ready to formulate the main convergence results of this paper based on the estimatesderived above. First, based on (4.8d) and (4.8e), strong convexity may be required if ξ x and ξ y have tobe positive for Assumption 3.2 to be satisfied. Moreover, the neighborhood U( ρ x , ρ y ) has to be smallenough, as determined by the assumptions θ x ≥ ρ y ω − and θ y ≥ ωρ x in the next results. This affectsthe admissible step lengths and how close we have to initialize u via Assumption 5.2. After the nextthree main convergence results, we show that Assumption 5.2 is satisfied if we initialize close enoughto a root (cid:98) u ∈ H − ( ) . Hence, to apply the theorems in practice, we have to find constants for whichAssumptions 3.1 and 3.2 are satisfied, use these constants to bound and compute the step lengths asdescribed in the theorems, and initialize close enough to (cid:98) u . In Appendix b we consider some relaxationof Assumption 3.2 (iv), which in turn requires larger γ G and γ F ∗ instead of θ x ≥ ρ y ω − and θ y ≥ ωρ x .The following theorem provides conditions sufficient for weak convergence of the sequence { u i } i ∈ (cid:78) generated by Algorithm 1.1. Apart from technical requirements of Theorem 4.2, we require additionalweak-to-strong continuity of the mapping u (cid:55)→ K y x ( u ) x . While its verification depends on theparticular choice of K , it is trivially satisfied in two cases: (i) X and Y are finite-dimensional and K y x is continuous; or (ii) the mapping u (cid:55)→ K y x ( u ) x is linear and compact. Theorem 6.1 (weak convergence: ω i = ). Suppose Assumptions 3.1 to 5.2 hold for some R K > ; L y x ≥ ; λ x , λ y , θ x , θ y ≥ ; and ξ x , ξ y ∈ (cid:82) such that ξ x = γ G , θ y ≥ ρ x , (6.1a) ξ y = γ F ∗ , θ x ≥ ρ y . (6.1b) For some < δ < µ < , choose τ i ≡ τ < δλ x + L y x ρ y , σ i ≡ σ ≤ (cid:18) R K τ − µ + λ y (cid:19) − , and ω i ≡ . (6.2) Furthermore, suppose that(i) u i (cid:42) s u implies that K y x ( u i ) x → K y x ( s u ) x for all x ∈ X ,and either(iia) the mapping u (cid:55)→ ( K x ( u ) , K y ( u )) is weak-to-strong continuous in U( ρ x , ρ y ) ; or(iib) the mapping u (cid:55)→ ( K x ( u ) , K y ( u )) is weak-to-weak continuous, but Assumption 3.1 (monotone ∂ G and ∂ F ∗ ) and Assumption 3.2 (iv) (three-point condition on K ) hold at any weak limit s u = ( s x , s y ) of { u i } i ∈ (cid:78) for the same choices of θ x and θ y . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 18 of 37

Then the sequence { u i } i ∈ (cid:78) generated by Algorithm 1.1 converges weakly to some s u ∈ H − ( ) (possiblydifferent from (cid:98) u ). Since it is assumed that θ x ≥ ρ y , we can replace ρ y by θ x / τ in (6.2) if the latteris more readily available.For constant τ , σ , and ω =

1, we have to set ψ i ≡ ψ and ϕ i ≡ ϕ to satisfy (4.8a). Consequently,applying Corollary 4.3 to bound Z i + M i + from below will not help to prove Theorem 6.1. We insteadwill make use of the following enhanced version of Opial’s lemma. Lemma 6.2 ([10, Lemma A.2]).

Let U be a Hilbert space, ˆ U ⊂ U (not necessarily closed or convex), and { u i } i ∈ (cid:78) ⊂ U . Also let A i ∈ L( U ; U ) be self-adjoint and A i ≥ ˆ ϵ I for some ˆ ϵ (cid:44) for all i ∈ (cid:78) . If thefollowing conditions hold, then u i (cid:42) s u in U for some s u ∈ ˆ U :(i) The sequence {(cid:107) u i − ˆ u (cid:107) A i } i ∈ (cid:78) is nonincreasing for some ˆ u ∈ ˆ U .(ii) All weak limit points of { u i } i ∈ (cid:78) belong to ˆ U .(iii) There exists C > such that (cid:107) A i (cid:107) ≤ C for all i , and for any weakly convergent subsequence { u i k } k ∈ (cid:78) there exists A ∞ ∈ L( U ; U ) such that A i k u → A ∞ u strongly in U for all u ∈ U .Proof of Theorem 6.1. We first verify (4.8) so that we can apply Theorem 4.2 and Lemma 5.3. We set ψ N ≡ ϕ N ≡ στ − , (cid:101) γ G = (cid:101) γ F ∗ = ω = ω = ω = ξ x , ξ y , θ x , θ y satisfying (6.1). With the choice ω =

1, the bounds (6.2) thus ensure (4.8c).Hence (4.8) holds, which together with Assumption 5.2 and ψ = { u i } i ∈ (cid:78) ∈ U( ρ x , ρ y ) and { s x i + } i ∈ (cid:78) ∈ (cid:130) ( (cid:98) x , ρ x ) . Therefore there exists at least one weak limitpoint of { u i } i ∈ (cid:78) . Moreover, (4.9) yields self-adjointness of Z i + M i + and since the bounds (6.2) are strict,Theorem 4.2 holds with ∆ i + ≤ − ˆ δ (cid:205) Ni = (cid:107) u i + − u i (cid:107) for some ˆ δ > U = H − ( ) and A i = Z i + M i + . Estimate (4.7) is validfor any starting iterate; thus setting N = u i instead of u , we obtain (cid:107) u i + − (cid:98) u (cid:107) Z i + M i + ≤(cid:107) u i − (cid:98) u (cid:107) Z i + M i + + ∆ i + for any ∆ i + ≤ K y x ( u i ) x → K y x ( s u ) x for all x ∈ X if u i (cid:42) s u .Hence we only need to verify (ii), i.e., if a subsequence of { u i } i ∈ (cid:78) converges weakly to some s u , then s u ∈ H − ( ) . We note that W i + ≡ W , and (IPP) implies that v i + ∈ W A ( u i + ) for A ( u i + ) : = (cid:18) ∂ G ( x i + ) − γ G ( x i + − s x ) ∂ F ∗ ( y i + ) − γ F ∗ ( y i + − s y ) (cid:19) and(6.3) v i + : = W (cid:18) − K x ( x i + , y i + ) − γ G ( x i + − s x ) K y ( x i + , y i + ) − γ F ∗ ( y i + − s y ) (cid:19) − M i + ( u i + − u i )− W (cid:18) K x ( x i , y i ) − K x ( x i + , y i + ) + K xy ( x i , y i )( y i + − y i ) K y ( x i + , y i + ) − K y ( s x i + , y i ) − K y x ( x i , y i )( x i + − s x i + ) (cid:19) . (6.4)Therefore it suffices to show that if u i k (cid:42) s u = ( s x , s y ) for a subsequence, then v i k (cid:42) s v : = W (cid:18) − K x ( s x , s y ) K y ( s x , s y ) (cid:19) and s v ∈ W A ( s u ) , which by construction is equivalent to s u ∈ H − ( ) . Note that A is maximally monotone since it onlyinvolves subgradient mappings of proper convex lower semicontinuous functions due to Assumption 3.1.Moreover, further use of (4.7) shows that (cid:205) ∞ i = δ (cid:107) u i + − u i (cid:107) < ∞ and hence that (cid:107) u i + − u i (cid:107) → Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 19 of 37 (a) If assumption (iia) holds, we obtain that v i k → s v , and the required inclusion s v ∈ A ( s u ) follows fromthe fact that the graph of the maximally monotone operator A is sequentially weakly–stronglyclosed; see [4, Proposition 16.36].(b) If assumption (iib) holds, then only v i k (cid:42) s v . In this case, we can apply the Brezis–Crandall–PazyLemma [4, Corollary 20.59 (iii)] to obtain the required inclusion under the additional conditionthat lim sup k →∞ (cid:104) u i k − s u , v i k − s v (cid:105) ≤

0. In our case, recalling that the last two terms of (6.4)converge strongly to zero, we have thatlim sup k →∞ (cid:104) u i k − s u , v i k − s v (cid:105) ≤ lim sup i →∞ (cid:104) u i − s u , v i − s v (cid:105) = lim sup i →∞ q i for q i : = (cid:104) K x ( s x , s y ) − K x ( x i + , y i + ) , x i + − s x (cid:105) + (cid:104) K y ( x i + , y i + ) − K y ( s x , s y ) , y i + − s y (cid:105)− γ F ∗ (cid:107) y i + − s y (cid:107) − γ G (cid:107) x i + − s x (cid:107) . Defining d xi : = (cid:104) K y ( x i + , y i + ) − K y ( s x , y i + ) + K y x ( x i + , y i + )( s x − x i + ) , y i + − s y (cid:105)− (cid:104) K x ( x i , s y ) − K x ( s x , s y ) , x i + − s x (cid:105) − γ G (cid:107) x i + − s x (cid:107) and d yi : = (cid:104) K x ( x i , s y ) − K x ( x i , y i ) − K xy ( x i , y i )( s y − y i ) , x i + − s x (cid:105)− (cid:104) K y ( x i + , y i + ) − K y ( x i + , y i ) + K y ( s x , s y ) − K y ( s x , y i + ) , y i + − s y (cid:105) − γ F ∗ (cid:107) y i + − s y (cid:107) , we rearrange and estimate(6.5) q i = d xi + d yi + (cid:104) K y ( x i + , y i + ) − K y ( x i + , y i ) , y i + − s y (cid:105) + (cid:104)( K xy ( x i + , y i + ) − K xy ( x i , y i ))( y i − s y ) , x i + − s x (cid:105) + (cid:104) K x ( x i , y i ) − K x ( x i + , y i + ) + K xy ( x i + , y i + )( y i + − y i ) , x i + − s x (cid:105)≤ d xi + d yi + O ((cid:107) u i + − u i (cid:107)) . Using ξ x = γ G , ξ y = γ F ∗ , (3.5), and both Assumption 3.1 and Assumption 3.2 (iv) at s u , we estimate q i ≤ O ((cid:107) u i + − u i (cid:107)) as d xi ≤ ((cid:107) y i + − s y (cid:107) − θ x )(cid:107) K y ( x i + , y i + ) − K y ( s x , y i + ) + K y x ( x i + , y i + )( s x − x i + )(cid:107) ≤ , d yi ≤ ((cid:107) x i + − s x (cid:107) − θ y )(cid:107) K x ( x i , s y ) − K x ( x i , y i ) − K xy ( x i , y i )( s y − y i )(cid:107) ≤ . In the last bounds we used θ x ≥ ρ y , θ y ≥ ρ x , and (cid:107) y i + − s y (cid:107) ≤ ρ y because both (cid:107) y i + − (cid:98) y (cid:107) ≤ ρ y and (cid:107) (cid:98) y − s y (cid:107) ≤ ρ y ; likewise, (cid:107) x i + − s x (cid:107) ≤ ρ x . Since (cid:107) u i + − u i (cid:107) →

0, we obtain thatlim sup i →∞ q i ≤

0. The Brezis–Crandall–Pazy Lemma thus yields the desired inclusion s v ∈ A ( s u ) .Hence in both cases, s u ∈ H − ( ) and the condition (ii) of Lemma 6.2 is satisfied. Applying Lemma 6.2,we obtain the claim. (cid:3) We now provide convergence rates under additional assumptions of strong convexity of G and/or F ∗ , although we still allow non-convexity of the overall problem through K . To be specific, we requirethat we can take the acceleration or step length update factors (cid:101) γ G > (cid:101) γ F ∗ > (cid:101) γ G >

0, which is the case, for instance, when G is strongly convexand (3.4a) holds with ξ x =

0. Since we obtain a fortiori strong convergence from the rates, we do notrequire the additional assumptions on K introduced in Theorem 6.1; on the other hand, we only obtainconvergence of the primal iterates. Similar to the linear case of [10], the step length choice followsdirectly from having to satisfy (4.8b) and the desire to keep the right-hand side of the σ -rule (4.8c)constant. Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 20 of 37

Theorem 6.3 (convergence rates under acceleration: ω i = ). Suppose Assumptions 3.1 to 5.2 hold forsome R K > ; L y x ≥ ; λ x , λ y , θ x , θ y ≥ ; and ξ x , ξ y ∈ (cid:82) such that for some (cid:101) γ G > , ξ x = γ G − (cid:101) γ G , θ y ≥ ρ x , (6.6a) ξ y = γ F ∗ , θ x ≥ ρ y . (6.6b) Choose (6.7) τ i + = τ i + (cid:101) γ G τ i , σ i + ≡ σ , and ω i ≡ , satisfying for some < δ ≤ µ < the bounds (6.8) 0 < τ ≤ δλ x + L y x ρ y and < στ ≤ − µR K . Then (cid:107) x N − (cid:98) x (cid:107) converges to zero at the rate O ( / N ) .Proof. We again first verify (4.8) so that we can apply Theorem 4.2 and Lemma 5.3. Setting ψ i ≡ η i ≡ σ , ϕ i : = στ − i , and (cid:101) γ F ∗ =

0, (4.8a) follows from the σ -rule of (6.7) and the choice of ψ i , η i , and ϕ i . Using (6.7) and τ i : = σϕ − i , we obtain ϕ i + = ( + (cid:101) γ G τ i ) ϕ i , and hence (4.8b) follows. Since τ i ≤ τ and λ y ≥

0, (4.8c) follows from (6.8) and ω i =

1. Furthermore, (4.8d) and (4.8e) are satisfied due to theassumed bounds (6.6) on ξ x , ξ y , θ x , and θ y taking ω = ω = ∆ i + =

0. We now estimatethe convergence rate from (4.7) by bounding Z N + M N + from below. Using Corollary 4.3, we obtain δϕ N (cid:107) x N − (cid:98) x (cid:107) ≤ (cid:107) u − (cid:98) u (cid:107) Z M . Moreover, ϕ N + = ( + (cid:101) γ G τ N ) ϕ N = ϕ N + (cid:101) γ G σ = . . . = ϕ + N (cid:101) γ G σ , which yields the claim. (cid:3) Theorem 6.4 (linear convergence: ω i < ). Suppose Assumptions 3.1 to 5.2 hold for some R K > ; L y x ≥ ; λ x , λ y ≥ ; and (cid:101) γ G , (cid:101) γ F ∗ > as well as ξ x = γ G − (cid:101) γ G , θ y ≥ ωρ x , (6.9a) ξ y = γ F ∗ − (cid:101) γ F ∗ , θ x ≥ ρ y ω − (6.9b) with (6.10) τ i ≡ τ , σ i ≡ σ : = τ (cid:101) γ G (cid:101) γ − F ∗ , and ω i ≡ ω : = ( + (cid:101) γ G τ ) − . Assume for some < δ ≤ µ < the bound (6.11) τ ≤ min (cid:40) δλ x + L y x ρ y , (cid:101) γ F ∗ (cid:101) γ − G λ y + (cid:113) λ y + (cid:101) γ F ∗ (cid:101) γ − G ( R K ( − µ ) − + (cid:101) γ G λ y ) (cid:41) . Then (cid:107) u N − (cid:98) u (cid:107) converges to zero with the linear rate O ( ω N ) .Proof. We will use Theorem 4.2 and Lemma 5.3, for both of which we need to verify (4.8) first. We set ω : = ω : = ω , ψ N : = ω ( + σ (cid:101) γ F ∗ ) N = ω ( + (cid:101) γ G τ ) N = ω − N , and ϕ N : = ωστ − ( + τ (cid:101) γ G ) N = ω − N στ − . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 21 of 37

Then ψ = ψ N σ = ϕ N τ , verifying (4.8a) and (4.8b). We next observe that substituting σ i = τ (cid:101) γ G (cid:101) γ − F ∗ ,the first bound of (4.8c) is tantamount to requiring τ (cid:0) τ R K ( − µ ) − + λ y ω − (cid:1) ≤ (cid:101) γ F ∗ (cid:101) γ − G . Substituting ω = ( + (cid:101) γ G τ ) − , this in turn is equivalent to (cid:0) R K ( − µ ) − + (cid:101) γ G λ y (cid:1) τ + λ y τ − (cid:101) γ F ∗ (cid:101) γ − G ≤ , which after solving a quadratic inequality for τ yields the second bound of (6.11). Since ω ≤

1, the firstbound of (6.11) gives the second bound of (4.8c). Finally, (4.8d) and (4.8e) follow directly from (6.9)with ω = ω = ω .Since Assumption 5.2 and (4.8) hold, we can apply Lemma 5.3 to obtain { u i } i ∈ (cid:78) ∈ U( ρ x , ρ y ) and { s x i + } i ∈ (cid:78) ∈ (cid:130) ( (cid:98) x , ρ x ) . Moreover, (4.9) yields self-adjointness of Z i + M i + . Consequently, we can applyTheorem 4.2 and Lemma 5.3 to arrive at (4.7) for any ∆ i + ≤ Z N + M N + from below. Using Corol-lary 4.3, we obtain that(6.12) 1 ω N (cid:18) δσωτ − (cid:107) x N − (cid:98) x (cid:107) + µ − δ − δ (cid:107) y N − (cid:98) y (cid:107) (cid:19) ≤ (cid:107) u − (cid:98) u (cid:107) Z M . Since ω ∈ ( , ) , this gives the claimed linear convergence rate through the exponential growth of1 / ω N . (cid:3) Remark 6.5. If K ( x , y ) = (cid:104) A ( x ) , y (cid:105) for some A ∈ C ( X ) , then K x ( x , y ) = [∇ A ( x )] ∗ y and K y ( x , y ) = A ( x ) with L y ( x ) = L y x = L for L a local Lipschitz factor of ∇ A . Furthermore, Assumption 3.2, the steplength bounds, and the update rules required in Theorem 6.1 or 6.4 reduce to the corresponding onesintroduced in [10] for this case. As for acceleration, Theorem 6.3 now gives a weaker convergence rateof O ( / N ) compared to O ( / N ) in [10, Theorem 4.3]. This is due to (4.8c) requiring σ i to be boundedwhenever λ y >

0, even when τ i goes to zero.Before we conclude this section, we refine Assumption 5.2 by showing that its implicit requirementsdo not add any additional step length bounds provided the starting point is sufficiently close to (cid:98) u . Proposition 6.6.

Under the assumptions of Theorem 6.1, 6.3, or 6.4, suppose that ρ x , ρ y > . Then thereexists ε > such that Assumption 5.2 holds whenever the initial iterate u = ( x , y ) satisfies (6.13) r max : = (cid:112) δ − ((cid:107) x − (cid:98) x (cid:107) + ν − (cid:107) y − (cid:98) y (cid:107) ) ≤ ε with ν : = σ ω τ − . Proof.

We take µ , δ , σ i , τ i , and ω i as they are defined in the corresponding Theorem 6.1, 6.3, or 6.4,and L x ( (cid:98) y ) , L y ( (cid:98) x ) , R K from Assumption 3.2. We need to show that there exist δ x , δ y > r y ≥ r max (cid:112) ν ( − δ ) δ ( µ − δ ) − such that (5.7) holds and(6.14) (cid:130) ( (cid:98) x , r max + δ x ) × (cid:130) ( (cid:98) y , r y + δ y ) ⊆ U( ρ x , ρ y ) . Let ε > r y : = ε (cid:112) ν ( − δ ) δ ( µ − δ ) − as well as δ x : = √ ε and δ y : = ρ y − r y . Observing (6.13),we then see both that δ y > ε > r max ≤ ε in Lemma 5.3. Let c ε : = min (cid:26) δ x R K r y + L x ( (cid:98) y ) r max , δ y L y ( (cid:98) x ) r y + R K ( r max + δ x ) (cid:27) . Since r y , r max = O ( ε ) , δ x = √ ε , and δ y > ρ y / > ε > c ε → ∞ as ε →

0. Comparing the definition of c ε to (5.7), we therefore see that the latter holds for any given τ > σ i ≡ σ > ε > τ i ≤ τ , the inequalities (5.7) hold. (cid:3) Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 22 of 37

Finally, we illustrate the applicability of the proposed approach for the example applications describedin Section 2. The Julia implementation used to generate the following results is on Zenodo [11].

Our first example illustrates the reformulation from Section 2.1 for the two-player elliptic Nash equilib-rium problem from [6]. Here the action space of each player is L ( Ω ) for a bounded domain Ω ⊂ (cid:82) d with boundary ∂ Ω . To avoid confusion with the spatial variable, we will in this subsection denote theprimal variable with u and the dual variable with v . The set of admissible strategies is X k = (cid:8) w ∈ L ( Ω ) : w ( x ) ∈ [ a , b ] a.e. x ∈ Ω (cid:9) ( k = , ) . For a set of strategies u : = ( u , u ) ∈ X = X × X , the payout function for each player is ϕ k ( u , u ) = (cid:107) S ( u , u ) − z k (cid:107) L ( Ω ) + α k (cid:107) B k u k (cid:107) L ( Ω ) ( k = , ) , where α k > z k ∈ L ( Ω ) are given target states, S : L ( Ω ) → L ( Ω ) maps u = ( u , u ) to the solution y to the elliptic boundary value problem(7.1) (cid:40) − ∆ y = B u + B u + f on Ω , y = ∂ Ω , B k : L ( Ω ) → L ( Ω ) are control operators which are here chosen as B k w : = (cid:40) w ( x ) if x ∈ ω k , x (cid:60) ω k , for some control domains ω k ⊂ Ω , and f is a common source term. Following Section 2.1, the corre-sponding Nash equilibrium problem (2.1) can then be solved by applying Algorithm 1.1 to G : L ( Ω ) → (cid:82) , G ( u , u ) = δ X ( u , u ) , F ∗ : L ( Ω ) → (cid:82) , F ∗ ( v , v ) = δ X ( v , v ) , K : L ( Ω ) × L ( Ω ) → (cid:82) , K (( u , u ) , ( v , v )) = [ ϕ ( u , u ) − ϕ ( v , u )] + [ ϕ ( u , u ) − ϕ ( u , v )] To implement the algorithm, we need explicit forms of the proximal mappings for G and F ∗ and ofthe partial derivatives of K . Since G = F ∗ = δ X for X = X × X , we haveprox τ G ( w ) = prox σ F ∗ ( w ) = proj X ( w ) = (cid:16) proj X ( w ) , proj X ( w ) (cid:17) for the metric projections onto the convex sets X k given pointwise almost everywhere by [ proj X k ( w k )]( x ) =  b if w k ( x ) > b , w k ( x ) if w k ( x ) ∈ [ a , b ] , a if w k ( x ) < a . It remains to address the computation of K u ( u , v ) and K v ( u , v ) . Using adjoint calculus and the linearityof the adjoint equation, we have that K u ( u , v ) = (cid:18) p ( u , v ) + α u p ( u , v ) + α u (cid:19) , K v ( u , v ) = (cid:18) q ( u , v ) − α v q ( u , v ) − α v (cid:19) , Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 23 of 37 (a) u ∗ (b) u ∗ Figure 3:

Constructed solution for elliptic NEP example ( N = where p ( u , v ) = : p and p ( u , v ) = : p are the solutions to the equations − ∆ p = S ( u , u ) − S ( u , v ) − z , − ∆ p = S ( u , u ) − S ( v , u ) − z , and q ( u , v ) = : q and q ( u , v ) = : q are the solutions to the equations − ∆ q = − S ( v , u ) + z , − ∆ q = − S ( u , v ) + z , all with homogeneous Dirichlet conditions. Hence, every iteration of Algorithm 1.1 requires ninesolutions of a partial differential equation (recall that K v is evaluated at ( s u i + , v i ) , while K u is evaluatedat ( u i , v i ) ). Since S and hence K u and K v are affine in u and v , the assumptions of Theorem 6.1 aresatisfied for sufficiently small step lengths. Since neither F ∗ nor G are strongly convex, no accelerationis possible.For our numerical tests we follow [6] and consider a finite-difference discretization of (7.1) on Ω = ( , ) with N nodes in each direction, ω = ( , ) × ( , / ) , ω = ( , ) × ( / , ) , as well as a = − . b = .

5, and α i =

1. Using the method of manufactured solutions, z , z , and f are chosen such that the solution u ∗ = ( u ∗ , u ∗ ) of the Nash equilibrium problem is known a priori;see Figure 3. By construction, the saddle point then satisfies v ∗ = u ∗ and hence Ψ ( u ∗ , v ∗ ) =

0. Sincethe Lipschitz constants for K and its derivatives are not available, we simply take the parametersin Algorithm 1.1 as σ i + ≡ σ = . τ i ≡ τ = .

99, and ω = .

0. The results of the algorithm fordifferent values of N ∈ { , , , , } are shown in Table 1, which reports the distance ofthe primal-dual iterates ( u i , v i ) to the exact solution. As can be seen, the iteration converges in eachcase to machine precision within 5 iterations, and the convergence behavior is virtually identical.This demonstrates the mesh independence expected from an algorithm for which convergence can beshown in function spaces. Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 24 of 37

Table 1:

Results for elliptic NEP example for different

Ni N = N = N = N = N = . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − (cid:96) -tv denoising Our next example concerns the (cid:96) -TV denoising or segmentation problem from Section 2.2. Recall thatwe can solve the (Huber-regularized) (cid:96) -TV problem (2.3) by applying Algorithm 1.1 to G : (cid:82) N × N → (cid:82) , G ( x ) = α (cid:107) x − f (cid:107) , F ∗ γ : (cid:82) N × N × → (cid:82) , F ∗ γ ( y ) = γ (cid:107) y (cid:107) , K p : (cid:82) N × N × (cid:82) N × N × → (cid:82) , K p ( x , y ) = κ p ( D h x , y ) , for p ∈ { , ∞} and γ ≥

0, where D h : (cid:82) N × N → (cid:82) N × N × is the discrete gradient. We write H γ for H defined in (4.1) corresponding to F ∗ = F ∗ γ . Since G and F ∗ γ are quadratic, a simple computation showsthat prox τ G ( x ) = + τα (cid:16) x + τα f (cid:17) , and prox σ F ∗ γ ( y ) = + γ σ y , where all operations are to be understood componentwise. For the derivatives of K p , we have by thechain rule(7.2) K x ( x , y ) = D Th κ p , z ( D h x , y ) , K y ( x , y ) = κ p , y ( D h x , y ) , where D Th is the discrete (negative) divergence. For the partial derivatives of κ p , z ( z , y ) and κ p , y ( z , y ) ,we again distinguish the cases p = p = ∞ :For p =

1, we have componentwise [ κ , z ( z , y )] ijk = ( − z ijk y ijk ) y ijk , [ κ , y ( z , y )] ijk = ( − z ijk y ijk ) z ijk . For p = ∞ , we have componentwise [ κ ∞ , z ( z , y )] ijk = ( − z ij y ij − z ij y ij ) y ijk , [ κ ∞ , y ( z , y )] ijk = ( − z ij y ij − z ij y ij ) z ijk . It remains to choose valid step sizes for Algorithm 1.1, for which the next result gives useful estimates.We recall from [7] that a forward differences discretization of the gradient operator satisfies (cid:107) D h (cid:107) ≤√ / h . Recalling (7.2) and the definitions of G and F ∗ γ , a critical point ( (cid:98) x , (cid:98) y ) ∈ H − γ ( ) satisfies(7.3) 0 = α − ( (cid:98) x − f ) + D Th κ p , z ( D h (cid:98) x , (cid:98) y ) and γ (cid:98) y = κ p , y ( D h (cid:98) x , (cid:98) y ) . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 25 of 37

Corollary 7.1.

Let K = K p for either p = or p = ∞ . Choose L ≥ (cid:107) D h (cid:107) and R K > L . Then Assumption 3.2holds for some θ x , θ y > and ρ x , ρ y > with L x ( y ) = L (cid:107) y (cid:107) , L y ( x ) = L (cid:107) x (cid:107) , L y x = L sup y ∈ (cid:130) ( (cid:98) y , ρ y ) (cid:107) y (cid:107) , and the constants ξ x , ξ y > , λ x , λ y ≥ satisfying (7.4) ξ x λ x > L ( L − λ x + (cid:98) m y ) (cid:98) m y and λ y > (cid:98) m x . Proof.

We consider only p = ∞ as the proof for p = (cid:101) R K >

2, Lemma c.1 appliedcomponentwise shows that the operator κ p satisfies Assumption 3.2 for some (cid:101) θ z , (cid:101) θ y > (cid:101) ρ x , (cid:101) ρ y > (cid:101) R K ) when we take (cid:101) L z ( y ) = (cid:107) y (cid:107) , (cid:101) L y ( z ) = (cid:107) z (cid:107) , and (cid:101) L yz = y ∈ (cid:130) ( (cid:98) y , (cid:102) ρ y ) (cid:107) y (cid:107) . Moreover, the constants (cid:101) ξ z , (cid:101) ξ y ∈ (cid:82) and (cid:101) λ z , (cid:101) λ y ≥ (cid:101) ξ z (cid:101) λ z > max ij ( λ z + (cid:107) (cid:98) y · ij (cid:107) )(cid:107) (cid:98) y · ij (cid:107) as well as (cid:101) ξ y > (cid:101) λ y > max ij (cid:107) (cid:98) z · ij (cid:107) for (cid:98) z = D h (cid:98) x .By Lemma c.2 on compositions with a linear operator, we can now take R K = (cid:101) R K L , ρ x = L − (cid:101) ρ x , ρ y = (cid:101) ρ y , ξ x = L (cid:101) ξ z , ξ y = (cid:101) ξ y , λ x = L (cid:101) λ z , λ y = (cid:101) λ y , θ x = (cid:101) θ z , θ y = (cid:101) θ y L − , L x ( y ) = L (cid:101) L z ( y ) , L y ( x ) = (cid:101) L y ( D h x ) , L y x = L (cid:101) L yz . These give the claim. (cid:3)

We now obtain from Theorem 6.4 the following estimate.

Corollary 7.2.

Suppose Assumption 3.1 holds. Choose L ≥ (cid:107) D h (cid:107) . For some (cid:101) γ G ∈ ( , α − ) and (cid:101) γ F ∗ ∈ ( , γ ) ,take ξ x = α − − (cid:101) γ G and ξ y = γ − (cid:101) γ F ∗ as well as λ x , λ y ≥ such that (7.4) holds. For some < δ ≤ µ < ,take σ = τ (cid:101) γ G (cid:101) γ − F ∗ and ω : = ( + (cid:101) γ G τ ) − as well as (7.5) τ < min  δλ x , (cid:101) γ F ∗ (cid:101) γ − G λ y + (cid:113) λ y + (cid:101) γ F ∗ (cid:101) γ − G ( L ( − µ ) − + (cid:101) γ G λ y )  . Then (cid:107) u N − (cid:98) u (cid:107) converges to zero with the linear rate O ( ω N ) provided u is close enough to (cid:98) u .Proof. The assumptions (cid:101) γ G ∈ ( , α − ) and (cid:101) γ F ∗ ∈ ( , γ ) ensure ξ x , ξ y >

0. Since we have assumed (7.4),Corollary 7.1 yields Assumption 3.2 for any R K > L and some θ x , θ y >

0. We next use Theorem 6.4,whose conditions we need to verify. First, taking ρ x , ρ y > θ x ≥ ρ y ω − and θ y ≥ ωρ x . Furthermore, the strict inequality in (7.5) implies (6.11) for sufficiently small ρ y >

0. Finally,Proposition 6.6 ensures that we can satisfy Assumption 5.2 by taking u sufficiently close to (cid:98) u . Therest of the conditions we have assumed explicitly, so we can apply Theorem 6.4 to finish the proof. (cid:3) Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 26 of 37

Recall that Assumption 3.1 is a second-order growth condition at the critical point ( (cid:98) x , (cid:98) y ) , whichis a common assumption needed to show convergence of algorithms for non-convex optimizationproblems. To calculate the upper bounds on τ in (7.5), we need to find λ x , λ y ≥ (cid:98) m x and (cid:98) m y . To do this, note that the critical point conditions (7.3)imply(7.6) (cid:98) y · ij = [ D h (cid:98) x ] · ij |[ D h (cid:98) x ] · ij | + γ ( p = ∞) and (cid:98) y kij = [ D h (cid:98) x ] kij |[ D h (cid:98) x ] kij | + γ ( p = ) . Since t (cid:55)→ t /( t + γ ) is increasing, we can estimate (cid:98) m y based on (cid:98) m x . Since any solution of the Pottsproblem should be piecewise constant with very few intensity quantization levels, we can estimate (cid:98) m x as the expected maximal jump between neighboring pixels. We take this as 100% of the dynamicrange for safety. In practice, as a practical choice of γ > ξ x > L (cid:98) m y , we usean over-approximation s γ : = ≥ γ in (7.6). We remark that we thus cannot guarantee convergenceof Algorithm 1.1 for small γ >

0; however, we demonstrate below that these estimates can still leadto useful step sizes for such cases. Similarly, we do not have an estimate for the unknown localneighborhood of convergence; we compensate for this by taking small δ = . u = ( x , y ) with x = f and y ≡ p . As a test image, wechoose “blobs” from the ImageJ framework [30] with size N × N = × α = γ = − (cf. Figure 2) and use the accelerated step size rule from Theorem 6.4. To do this,we need to satisfy (7.5) for the primal step length τ . We discretize the problem such that h = L = √

8. Furthermore, we set (cid:101) γ F ∗ = γ /

100 and (cid:101) γ G = (cid:101) α − for (cid:101) α = α . The above estimates thenlead to the step length parameters p = : τ = . · − , σ = . ω = . p = ∞ : τ = . · − , σ = . ω = . ( (cid:98) x , (cid:98) y ) is not available here, we instead use x max : = x N max for N max = and similarly y max as references for computing errors. The corresponding reference images x max obtained from Algorithm 1.1 after N max = iterations are shown in Figures 4b and 4c for p = p = ∞ , respectively. While the evaluation of the formulation and the algorithm in the context of imageprocessing is outside of the scope of this work, we briefly comment on the difference between p = p = ∞ . As can be seen by comparing the two images, the results are very similar. However, sincediagonal jumps are penalized less for p = ∞ , the isotropic Huber–Potts model is better able to preservesmall light blobs such as the one indicated by the red circles. The edges of the blobs are also noticeablysmoother.The convergence behavior of the method for both choices of p over N max / = · iterationsis given in Figure 5. For the function values, we observe in Figure 5a the usual fast decrease in thebeginning of the iteration, after which the values stagnate. Nevertheless, the errors continue to decreasedown to machine precision at the predicted linear rate. The convergence behavior for p = p = ∞ is similar, although the linear convergence for p = ∞ is with a significantly smaller constant. Weremark that visually, the iterates in both cases are indistinguishable from the reference images alreadyafter N = iterations. This is consistent with Figure 5b since the total error is dominated by the dualcomponent, which acts as an edge indicator; small changes of the boundaries of the blobs during theiteration will, even for small gray value changes, lead to large differences in the dual variable. Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 27 of 37 (a) original image f (b) x max for p = (c) x max for p = ∞ Figure 4: (cid:96) -TV denoising: original image f and reference iterates x max for anisotropic ( p =

1) and isotropic( p = ∞ ) Huber–Potts model . p = p = ∞ (a) function values · · · · · · − − − − p = p = ∞ (b) errors Figure 5: (cid:96) -TV denoising: convergence of function values F γ ( x N ) + G ( x N ) and errors (cid:107) x N − x max (cid:107) + (cid:107) y N − y max (cid:107) Using generalized conjugation, some non-smooth non-convex optimization problems can be trans-formed into saddle-point problems involving non-smooth convex functionals and a smooth non-convex-concave coupling term. For such problems, a generalized primal–dual proximal splitting method canbe applied that converges weakly under step length conditions if a local quadratic growth condition issatisfied near a saddle-point. Under additional strong convexity assumptions on the functionals (butnot the coupling term and hence the problem), convergence rates for accelerated algorithms can beshown. This approach can be applied to elliptic Nash equilibrium problems and for the anisotropicand isotropic Huber-regularized Potts models, as the numerical examples illustrate. Future work isconcerned with further evaluating and comparing the performance of the proposed algorithm for theseexamples. acknowledgments

In the first stages of the research T. Valkonen and S. Mazurenko were supported by the EPSRC FirstGrant EP/P021298/1, “PARTIAL Analysis of Relations in Tasks of Inversion for Algorithmic Leverage”.

Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 28 of 37

Later T. Valkonen was supported by the Academy of Finland grants 314701 and 320022. C. Clason wassupported by the German Science Foundation (DFG) under grant Cl 487/2-1. We thank the anonymousreviewers for insightful comments. a data statement for the epsrc

The source codes for the numerical experiments are on Zenodo at [11]. appendix a reductions of the three-point condition

The following two propositions demonstrate that Assumption 3.2 (iv) is closely related to standardsecond-order optimality conditions, i.e., that the Hessian is positive definite at the solution (cid:98) u . Proposition a.1.

Suppose Assumption 3.2 (ii) (locally Lipschitz gradients of K ) holds in some neighborhood U of (cid:98) u , and for some ξ x ∈ (cid:82) , γ x > , (a.1) ξ x (cid:107) x − (cid:98) x (cid:107) + (cid:104) K x ( x , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y ) , x − (cid:98) x (cid:105) ≥ γ x (cid:107) x − (cid:98) x (cid:107) (( x , y ) ∈ U) . Then (3.4a) holds in U with θ x = ( γ x − α ) L − y x , and λ x = L x ( (cid:98) y ) ( α ) − for any α ∈ ( , γ x ] .Proof. An application of Cauchy’s and Young’s inequalities with any factor α >

0, Assumption 3.2 (ii),and (a.1) yields the estimate (cid:104) K x ( x (cid:48) , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y ) , x − (cid:98) x (cid:105) + ξ x (cid:107) x − (cid:98) x (cid:107) = (cid:104) K x ( x , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y ) , x − (cid:98) x (cid:105) + ξ x (cid:107) x − (cid:98) x (cid:107) + (cid:104) K x ( x (cid:48) , (cid:98) y ) − K x ( x , (cid:98) y ) , x − (cid:98) x (cid:105)≥ ( γ x − α )(cid:107) x − (cid:98) x (cid:107) − L x ( (cid:98) y ) ( α ) − (cid:107) x (cid:48) − x (cid:107) . At the same time, using (3.5), (cid:107) K y ( (cid:98) x , y ) − K y ( x , y ) − K y x ( x , y )( (cid:98) x − x )(cid:107) ≤ L y x (cid:107) x − (cid:98) x (cid:107) . Therefore (3.4a) holds if we take θ x ≤ ( γ x − α ) L − y x and λ x = L x ( (cid:98) y ) ( α ) − . (cid:3) Proposition a.2.

Suppose Assumption 3.2 (ii) (locally Lipschitz gradients of K ) holds in some neighborhood U of (cid:98) u with L y ( x ) ≤ s L y , and that (cid:107) K xy ( x , y (cid:48) ) − K xy ( x , y )(cid:107) ≤ L xy (cid:107) y (cid:48) − y (cid:107) ( u , u (cid:48) ∈ U) for some constant L xy ≥ . Assume, moreover, for some ξ y ∈ (cid:82) , γ y > that (a.2) ξ y (cid:107) y − (cid:98) y (cid:107) + (cid:104) K y ( (cid:98) x , (cid:98) y ) − K y ( (cid:98) x , y ) , y − (cid:98) y (cid:105) ≥ γ y (cid:107) y − (cid:98) y (cid:107) (( x , y ) ∈ U) . Then (3.4b) holds in U with θ y = ( γ y − α )( + α ) − L − xy , and λ y = ( s L y ( α ) − + ( + α − ) L xy θ y ) forany α ∈ ( , γ y ] , α > .Proof. An application of Cauchy’s and Young’s inequalities with any factor α >

0, Assumption 3.2 (ii),and (a.2) yields the estimate (cid:104) K y ( x , y ) − K y ( x , y (cid:48) ) + K y ( (cid:98) x , (cid:98) y ) − K y ( (cid:98) x , y ) , y − (cid:98) y (cid:105) + ξ y (cid:107) y − (cid:98) y (cid:107) ≥ (cid:104) K y ( x , y ) − K y ( x , y (cid:48) ) , y − (cid:98) y (cid:105) + γ y (cid:107) y − (cid:98) y (cid:107) ≥ ( γ y − α )(cid:107) y − (cid:98) y (cid:107) − L y ( x ) α (cid:107) y (cid:48) − y (cid:107) . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 29 of 37

At the same time, using (3.5) and Young’s inequality for any α > (cid:107) K x ( x (cid:48) , (cid:98) y ) − K x ( x (cid:48) , y (cid:48) ) − K xy ( x (cid:48) , y (cid:48) )( (cid:98) y − y (cid:48) )(cid:107) ≤ L xy (cid:107) y (cid:48) − (cid:98) y (cid:107) ≤ L xy ( + α )(cid:107) y − (cid:98) y (cid:107) + L xy ( + α − )(cid:107) y (cid:48) − y (cid:107) . Therefore (3.4b) holds if we take θ y ≤ γ y − α ( + α ) L xy and λ y = s L y α + ( + α − ) L xy θ y . (cid:3) appendix b relaxations of the three-point condition In all the results of this paper, Assumption 3.2 (iv) can be generalized to the following three-pointcondition similar to the one used in [10].

Assumption b.1.

The functional K ( x , y ) ∈ C ( X × Y ) and there exists a neighborhood(b.1) U( ρ x , ρ y ) : = ( (cid:130) ( (cid:98) x , ρ x ) ∩ X G ) × ( (cid:130) ( (cid:98) y , ρ y ) ∩ Y F ∗ ) , for some ρ x , ρ y > u (cid:48) , u ∈ U( ρ x , ρ y ) , the following property holds:(iv*) (three-point condition) There exist θ x , θ y > λ x , λ y ≥ ξ x , ξ y ∈ (cid:82) , and p x , p y ∈ [ , ] suchthat (cid:104) K x ( x (cid:48) , (cid:98) y ) − K x ( (cid:98) x , (cid:98) y ) , x − (cid:98) x (cid:105) + ξ x (cid:107) x − (cid:98) x (cid:107) ≥ θ x (cid:107) K y ( (cid:98) x , y ) − K y ( x , y ) − K y x ( x , y )( (cid:98) x − x )(cid:107) p x − λ x (cid:107) x − x (cid:48) (cid:107) , and(b.2a) (cid:104) K y ( x , y ) − K y ( x , y (cid:48) ) + K y ( (cid:98) x , (cid:98) y ) − K y ( (cid:98) x , y ) , y − (cid:98) y (cid:105) + ξ y (cid:107) y − (cid:98) y (cid:107) ≥ θ y (cid:107) K x ( x (cid:48) , (cid:98) y ) − K x ( x (cid:48) , y (cid:48) ) − K xy ( x (cid:48) , y (cid:48) )( (cid:98) y − y (cid:48) )(cid:107) p y − λ y (cid:107) y − y (cid:48) (cid:107) . (b.2b)This assumption introduces p x and p y in [ , ] , while in Assumption 3.2 (iv) we had p x = p y =

1. Forinstance, in [10, Appendix B] we verified Assumption b.1 with p x = K ( x , y ) = (cid:104) A ( x ) , y (cid:105) for the reconstruction of the phase and amplitude of a complex number. This relaxation mainly affectsthe proof of Step 4 in Theorem 4.2, which now requires a few intermediate derivations. Corollary b.2.

The results of Theorem 4.2 continue to hold if Assumption 3.2 (iv) is replaced with Assump-tion b.1 (iv*) for some p x , p y ∈ [ , ] , where in case p y ∈ ( , ] , (4.8d) is replaced by γ G ≥ (cid:101) γ G + ξ x + p y − ( θ y p p y y ρ p y − x ω − ) py − , (b.3a) and in case p x ∈ ( , ] , (4.8e) is replaced by γ F ∗ ≥ (cid:101) γ F ∗ + ξ y + p x − ( ωθ x p p x x ρ p x − y ) px − . (b.3b) Proof.

The beginning of the proof follows the exact same steps as in the proof of Theorem 4.2 up until(4.14). We now use Assumption b.1 (iv*) to further bound D x and D y similarly to (4.15) and (4.16). From(b.2a),(b.4) D x ≥ θ x (cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) p x − λ x (cid:107) x i + − x i (cid:107) − (cid:107) y i + − (cid:98) y (cid:107) (cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) ω − i . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 30 of 37

The following generalized Young’s inequality for any positive a , b , p and q such that q − + p − = p x ∈ [ , ] :(b.5) ab = (cid:16) ab − pp (cid:17) b p − p ≤ p (cid:16) ab − pp (cid:17) p + q b p − p q = p a p b − p + (cid:18) − p (cid:19) b . Applying this inequality with p = p x , a : = ( ζ x p x ) − / (cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) , and b : = ( ζ x p x ) / (cid:107) y i + − (cid:98) y (cid:107) , for any ζ x > D x ≥ θ x (cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) p x − λ x (cid:107) x i + − x i (cid:107) − (cid:107) y i + − (cid:98) y (cid:107) − p x p p x x ω i ζ p x − x (cid:107) K y ( (cid:98) x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( (cid:98) x − x i + )(cid:107) p x − p x − ω i ζ x (cid:107) y i + − (cid:98) y (cid:107) . We now use u i + ∈ U( ρ x , ρ y ) for some ρ x , ρ y ≥

0, and ω − i ≤ ω − to obtain(b.6) θ x − (cid:107) y i + − (cid:98) y (cid:107) − p x ( p p x x ω i ζ p x − x ) − ≥ θ x − ρ − p x y ( p p x x ωζ p x − x ) − . If p x =

1, we use the assumed inequality θ x ≥ ρ y ω − from (4.8e) to show that the right-hand sideof (b.6) is non-negative for any ζ x >

0. Otherwise we take ζ x : = ( ωθ x p p x x ρ p x − y ) /( − p x ) to ensure theright-hand side of (b.6) is zero. In either case, θ x − ρ − p x y ( p p x x ωζ p x − x ) − ≥ D x ≥ − λ x (cid:107) x i + − x i (cid:107) − ( p x − ) ω − i ζ x (cid:107) y i + − (cid:98) y (cid:107) . Analogously, from (b.2b) and Cauchy’s inequality, D y ≥ θ y (cid:107) K x ( x i , (cid:98) y ) − K x ( x i , y i ) − K xy ( x i , y i )( (cid:98) y − y i )(cid:107) p y − λ y (cid:107) y i + − y i (cid:107) − ω i (cid:107) x i + − (cid:98) x (cid:107) (cid:107) K x ( x i , (cid:98) y ) − K x ( x i , y i ) − K xy ( x i , y i )( (cid:98) y − y i )(cid:107) . This has a structure similar to (b.4) with ω i now as a multiplier. Hence, we apply a similar generalizedYoung’s inequality to the last term with any ζ y >

0. Noting that ω i ≤ ω , we use the following boundsimilar to (b.6): θ y − (cid:107) x i + − (cid:98) x (cid:107) − p y ω i ( p p y y ζ p y − y ) − ≥ θ y − ρ − p y x ω ( p p y y ζ p y − y ) − ≥ . The last inequality holds for any ζ y > p y = θ y ≥ ωρ x from (4.8d); otherwise,we set ζ y : = ( θ y p p y y ρ p y − x ω − ) /( − p y ) . We then obtain that(b.8) D y ≥ − λ y (cid:107) y i + − y i (cid:107) − ( p y − ) ω i ζ y (cid:107) x i + − (cid:98) x (cid:107) . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 31 of 37

Combining (4.14), (b.7), and (b.8), we can thus bound(b.9) D = η i D x + η i + D y + η i + D ω + η i ( γ G − (cid:101) γ G − ξ x )(cid:107) x i + − (cid:98) x (cid:107) + η i + ( γ F ∗ − (cid:101) γ F ∗ − ξ y )(cid:107) y i + − (cid:98) y (cid:107) ≥ η i + ( γ F ∗ − (cid:101) γ F ∗ − ξ y − ( p x − ) ζ x )(cid:107) y i + − (cid:98) y (cid:107) − η i λ x (cid:107) x i + − x i (cid:107) + η i ( γ G − (cid:101) γ G − ξ x − ( p y − ) ζ y )(cid:107) x i + − (cid:98) x (cid:107) − η i + λ y (cid:107) y i + − y i (cid:107) − η i L y x ( ω i + ) ρ y (cid:107) x i + − x i (cid:107) ≥ − η i λ x + L y x ( ω i + ) ρ y (cid:107) x i + − x i (cid:107) − η i + λ y (cid:107) y i + − y i (cid:107) , where in the final step, we have also used (b.3) and the selected ζ x and ζ y if p x > p y > (cid:3) It is worth observing that when p x ∈ ( , ] or p y ∈ ( , ] , the inequalities (b.3) do not directlybound the respective ρ y or ρ x . Hence, we do not need to initalize the corresponding variable locally,unlike when p x = p y =

1. On the other hand, sufficient strong convexity is required from thecorresponding G and F ∗ .We start with the lemma ensuring that the iterates stay in the initial neighborhood of the saddlepoint. Corollary b.3.

The results of Lemma 5.3 continue to hold if the corresponding conditions of Theorem 4.2are replaced with those in Corollary b.2.Proof.

The proof repeats that of Lemma 5.3, applying Corollary b.2 instead of Theorem 4.2 in Step 2. (cid:3)

We next extend the results of Section 6 to arbitrary choices of both p x ∈ [ , ] and p y ∈ [ , ] . Thismainly consists of verifying (b.3a) when p y (cid:44) p x (cid:44)

1. Note that it is possible to take p x = p y (cid:44)

1, or vice versa, as long as the corresponding conditions are satisfied.

Corollary b.4.

The results of Theorem 6.1 continue to hold if Assumption 3.2 (iv) is replaced with Assump-tion b.1 (iv*) for some p x , p y ∈ [ , ] , where in case p y ∈ ( , ] , (6.1a) is replaced with ξ x = γ G − p y − ( θ y p p y y ( ρ x ) p y − ) py − , (b.10a) and in case p x ∈ ( , ] , (6.1b) is replaced with ξ y = γ F ∗ − p x − ( θ x p p x x ( ρ y ) p x − ) px − . (b.10b) Proof.

Since conditions (b.10) are sufficient for (b.3) with ω = ω = p x >

1, we now obtain a lower bound on d xi by arguing as in (b.4)–(b.6) with (cid:98) u replaced by s u .Specifically, using (3.5), Assumption b.1 (iv*) at s u , and the generalized Young’s inequality (b.5), we Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 32 of 37 obtain for any ζ x > d xi ≤ − θ x (cid:107) K y ( s x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( s x − x i + )(cid:107) p x + (cid:107) y i + − s y (cid:107) (cid:107) K y ( s x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( s x − x i + )(cid:107) + λ x (cid:107) x i + − x i (cid:107) − p y − ( θ y p p y y ( ρ x ) p y − ) py − (cid:107) x i + − s x (cid:107) ≤ (cid:32) (cid:107) y i + − s y (cid:107) − p x p p x x ζ p x − x − θ x (cid:33) (cid:107) K y ( s x , y i + ) − K y ( x i + , y i + ) − K y x ( x i + , y i + )( s x − x i + )(cid:107) p x + ( p x − ) ζ x (cid:107) y i + − s y (cid:107) + λ x (cid:107) x i + − x i (cid:107) − p y − ( θ y p p y y ( ρ x ) p y − ) py − (cid:107) x i + − s x (cid:107) . Inserting ζ x = ( θ x p p x x ( ρ y ) p x − ) /( − p x ) and (cid:107) y i + − s y (cid:107) ≤ ρ y , we eliminate the first term on theright-hand side. Likewise, if p y >

1, similar steps applied to d yi result in d yi ≤ ( p y − ) ζ y (cid:107) x i + − s x (cid:107) + λ y (cid:107) y i + − y i (cid:107) − p x − ( θ x p p x x ( ρ y ) p x − ) px − (cid:107) y i + − s y (cid:107) for ζ y = ( θ y p p y y ( ρ x ) p y − ) /( p y − ) . Using (cid:107) u i + − u i (cid:107) → ζ x and ζ y , we then obtainthe desired estimate lim sup i →∞ q i : = lim sup i →∞ ( d xi + d yi + O ((cid:107) u i + − u i (cid:107))) ≤ (cid:3) Corollary b.5.

The results of Theorem 6.3 continue to hold if Assumption 3.2 (iv) is replaced with Assump-tion b.1 (iv*) for some p x , p y ∈ [ , ] , where in case p y ∈ ( , ] , (6.6a) is replaced for some (cid:101) γ G > with ξ x = γ G − (cid:101) γ G − p y − ( θ y p p y y ( ρ x ) p y − ) py − , (b.11a) and in case p x ∈ ( , ] , (6.6b) is replaced with ξ y = γ F ∗ − p x − ( θ x p p x x ( ρ y ) p x − ) px − . (b.11b) Proof.

Conditions (b.11) are sufficient for (b.3) with ω = ω = (cid:3) Corollary b.6.

The results of Theorem 6.4 continue to hold if Assumption 3.2 (iv) is replaced with Assump-tion b.1 (iv*) for some p x , p y ∈ [ , ] , where in case p y ∈ ( , ] , (6.9a) is replaced for some (cid:101) γ G > with ξ x = γ G − (cid:101) γ G − p y − ( θ y p p y y ( ρ x ) p y − ω − ) py − , (b.12a) and in case p x ∈ ( , ] , (6.9b) is replaced for some (cid:101) γ F ∗ > with ξ y = γ F ∗ − (cid:101) γ F ∗ − p x − ( ωθ x p p x x ( ρ y ) p x − ) px − . (b.12b) Proof.

Conditions (b.12) are sufficient for (b.3) with ω = ω = ω to hold; therefore, we can repeat theproof of Theorem 6.4 replacing the references to Theorem 4.2 by references to Corollary b.2. (cid:3) Corollary b.7.

The results of Proposition 6.6 continue to hold if the corresponding conditions of Theorem 6.1,6.3, or 6.4 are replaced with those in Corollary b.4, b.5, or b.6.Proof.

The proof repeats that of Proposition 6.6. (cid:3)

Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 33 of 37 appendix c verification of conditions for step function presentationand potts model

Throughout this section, we set ρ ( t ) : = t − t and κ ( x , y ) : = ρ ((cid:104) x , y (cid:105)) for x , y ∈ (cid:82) m . Then ρ (cid:48) ( t ) = ( − t ) so that κ x ( x , y ) = y ( − (cid:104) y , x (cid:105)) and κ xy ( x , y ) = ( I − (cid:104) y , x (cid:105) I − y ⊗ x ) , (c.1a) κ y ( x , y ) = x ( − (cid:104) x , y (cid:105)) and κ y x ( x , y ) = ( I − (cid:104) x , y (cid:105) I − x ⊗ y ) , (c.1b)where a ⊗ b ∈ (cid:82) n × n is the tensor product between two vectors a and b , producing a matrix of all thecombinations of products between the entries.The following lemma verifies Assumption 3.2 for K = κ . Lemma c.1.

Let R K > , and suppose (cid:98) x , (cid:98) y ∈ (cid:82) m for m ≥ with (c.2) 0 ≤ (cid:104) (cid:98) x , (cid:98) y (cid:105) I + (cid:98) x ⊗ (cid:98) y ≤ I . Then the function K = κ defined above satisfies Assumption 3.2 for some θ x , θ y > and some ρ x , ρ y > dependent on R K with L x ( y ) = | y | , L y ( x ) = | x | , L y x = (| (cid:98) y | + ρ y ) , as well as the constants ξ x , ξ y ∈ (cid:82) , λ x , λ y ≥ satisfying λ x ξ x > ( λ x + | (cid:98) y | )| (cid:98) y | , ξ y > , and λ y > | (cid:98) x | .Proof. First, Assumption 3.2 (i) holds everywhere since K ∈ C ∞ ( (cid:82) m ) . To verify Assumption 3.2 (ii), weobserve using (c.1) that κ x ( x (cid:48) , y ) − κ x ( x , y ) = ( y ⊗ y )( x − x (cid:48) ) , (c.3a) κ xy ( x , y (cid:48) ) − κ xy ( x , y ) = (cid:104) y − y (cid:48) , x (cid:105) I + ( y − y (cid:48) ) ⊗ x , (c.3b) κ y ( x , y (cid:48) ) − κ y ( x , y ) = ( x ⊗ x )( y − y (cid:48) ) , (c.3c) κ y x ( x (cid:48) , y ) − κ y x ( x , y ) = (cid:104) x − x (cid:48) , y (cid:105) I + ( x − x (cid:48) ) ⊗ y . (c.3d)Hence L x , L y , and L y x are as claimed.To verify Assumption 3.2 (iii), we first of all observe using (c.2) that | κ xy ( (cid:98) x , (cid:98) y )| = | I − (cid:104) (cid:98) y , (cid:98) x (cid:105) I − (cid:98) y ⊗ (cid:98) x | ≤ . Therefore sup ( x , y )∈ (cid:130) ( (cid:98) x , ρ x )× (cid:130) ( (cid:98) y , ρ y ) | κ xy ( x , y )| ≤ R K for some ρ x , ρ y > R K > (cid:104) κ x ( x (cid:48) , (cid:98) y ) − κ x ( (cid:98) x , (cid:98) y ) , x − (cid:98) x (cid:105) + ξ x | x − (cid:98) x | ≥ θ x | κ y ( (cid:98) x , y ) − κ y ( x , y ) − κ y x ( x , y )( (cid:98) x − x )| − λ x | x − x (cid:48) | . Expanding the equation using (c.1), (c.3), and κ y ( (cid:98) x , y ) − κ y ( x , y ) − κ y x ( x , y )( (cid:98) x − x ) = (cid:98) x ( − (cid:104) (cid:98) x , y (cid:105)) − x ( − (cid:104) x , y (cid:105)) − ( I − (cid:104) x , y (cid:105) I − x ⊗ y )( (cid:98) x − x ) = [(cid:104) x , y (cid:105) x − (cid:104) (cid:98) x , y (cid:105) (cid:98) x + ((cid:104) x , y (cid:105) I + x ⊗ y )( (cid:98) x − x )] = [(cid:104) x − (cid:98) x , y (cid:105) (cid:98) x + ( x ⊗ y )( (cid:98) x − x )] = − (( (cid:98) x − x ) ⊗ y )( (cid:98) x − x ) , we require that(c.4) 2 (cid:104) (cid:98) x − x (cid:48) , x − (cid:98) x (cid:105) (cid:98) y ⊗ (cid:98) y + ξ x | x − (cid:98) x | ≥ θ x | y | | x − (cid:98) x | − λ x | x − x (cid:48) | . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 34 of 37

Taking any α >

0, this will hold by Cauchy’s and Young’s inequalities if ξ x ≥ ( + α )| (cid:98) y | + θ x | y | and λ x / ≥ α − | (cid:98) y | . If | (cid:98) y | =

0, clearly these hold for some α , θ x >

1, thiscondition reduces to (cid:98) x (cid:98) y ∈ [ , ] . This is necessarily satisfied in the case of the step function (where f ∗ = δ [ , ∞) ) and in the case of the (cid:96) function (where f ∗ =

0) as in both cases, (cid:98) x (cid:98) y ∈ { , } by the dualoptimality condition κ y ( (cid:98) x , (cid:98) y ) ∈ ∂ f ∗ ( (cid:98) y ) . Furthermore, if we take f ∗ γ = γ | · | for some γ ≥

0, then forany m ≥ (cid:98) x ( − (cid:104) (cid:98) x , (cid:98) y (cid:105)) = γ (cid:98) y , i.e, (cid:98) y = (cid:98) x ( γ + | (cid:98) x | ) − , for which(c.2) is easily verified.The following lemma shows that Assumption 3.2 remains valid if we include a linear operator inthe primal component. Lemma c.2.

Let K ( x , y ) = (cid:101) K ( Ax , y ) for some A ∈ L( X ; Z ) and (cid:101) K ∈ C ( Z × Y ) on Hilbert spaces X , Y , Z .Suppose (cid:101) K satisfies Assumption 3.2 at ( (cid:98) z , (cid:98) y ) : = ( A (cid:98) x , (cid:98) y ) . Mark the corresponding constants with a tilde: (cid:101) L z , (cid:101) R K , and so on. Then K satisfies Assumption 3.2 with R K : = (cid:101) R K (cid:107) A (cid:107) ; ξ x = (cid:107) A (cid:107) (cid:101) ξ z , ξ y = (cid:101) ξ y ; λ x = (cid:107) A (cid:107) (cid:101) λ z , λ y = (cid:101) λ y ; θ x = (cid:101) θ z , θ y = (cid:101) θ y (cid:107) A (cid:107) − ; ρ x = (cid:107) A (cid:107) − (cid:101) ρ x , and ρ y = (cid:101) ρ y as well as L x ( y ) = (cid:107) A (cid:107) (cid:101) L z ( y ) , L y ( x ) = (cid:101) L y ( Ax ) , L y x = (cid:107) A (cid:107) (cid:101) L yz . (c.5) Proof.

Observe first of all that by the chain rule, K x ( x , y ) = A ∗ (cid:101) K z ( Ax , y ) , K y ( x , y ) = (cid:101) K y ( Ax , y ) , K xy ( x , y ) = A ∗ (cid:101) K zy ( Ax , y ) , Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 35 of 37 and hence Assumption 3.2 (i) holds for K if it holds for (cid:101) K .Let now Assumption 3.2 (ii) hold for (cid:101) K with (cid:101) L x , (cid:101) L y , and (cid:101) L y x . Observing that(c.6) A (cid:130) ( (cid:98) x , ρ x ) × (cid:130) ( (cid:98) y , ρ y ) ⊂ (cid:130) ( (cid:98) z , (cid:101) ρ x ) × (cid:130) ( (cid:98) y , (cid:101) ρ y ) , Assumption 3.2 (ii) thus also holds with the function of (c.5). Similarly in Assumption 3.2 (iii), we cantake R K : = (cid:101) R K (cid:107) A (cid:107) .Finally, we expand Assumption 3.2 (iv) for K as (cid:104) (cid:101) K z ( z (cid:48) , (cid:98) y ) − (cid:101) K z ( (cid:98) z , (cid:98) y ) , z − (cid:98) z (cid:105) + ξ x (cid:107) x − (cid:98) x (cid:107) ≥ θ x (cid:107) (cid:101) K y ( (cid:98) z , y ) − (cid:101) K y ( z , y ) − (cid:101) K yz ( z , y )( (cid:98) z − z )(cid:107) − λ x (cid:107) x − x (cid:48) (cid:107) and (cid:104) (cid:101) K y ( z , y ) − (cid:101) K y ( z , y (cid:48) ) + (cid:101) K y ( (cid:98) z , (cid:98) y ) − (cid:101) K y ( (cid:98) z , y ) , y − (cid:98) y (cid:105) + ξ y (cid:107) y − (cid:98) y (cid:107) ≥ θ y (cid:107) A ∗ [ (cid:101) K z ( z (cid:48) , (cid:98) y ) − (cid:101) K z ( z (cid:48) , y (cid:48) ) − (cid:101) K zy ( z (cid:48) , y (cid:48) )( (cid:98) y − y (cid:48) )](cid:107) − λ y (cid:107) y − y (cid:48) (cid:107) , where z = Ax , z (cid:48) = Ax (cid:48) , and (cid:98) z = A (cid:98) x . Since (cid:107) z − z (cid:48) (cid:107) ≤ (cid:107) A (cid:107) (cid:107) x − x (cid:48) (cid:107) , etc., this follows from Assump-tion 3.2 (iv) for (cid:101) K with the constants as claimed. (cid:3) Applying this lemma to (cid:101) K ( z , y ) = (cid:205) nk = κ ( z k , y k ) , we can thus lift the scalar estimates for K = κ asin (c.1) to the corresponding estimates on K ( x , y ) : = (cid:205) nk = κ ([ D h x ] k , y k ) as used in the Potts modelexample. references [1] F. J. Aragón Artacho and M. H. Geoffroy, Characterization of metric regularity of subdifferentials, Journal of Convex Analysis

15 (2008), 365–380.[2] F. J. Aragón Artacho and M. H. Geoffroy, Metric subregularity of the convex subdifferential inBanach spaces,

J. Nonlinear Convex Anal.

15 (2014), 35–47.[3] H. Attouch, J. Bolte, and B. Svaiter, Convergence of descent methods for semi-algebraic andtame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidelmethods,

Mathematical Programming

137 (2013), 91–129, doi:10.1007/s10107-011-0484-9 .[4] H. H. Bauschke and P. L. Combettes,

Convex Analysis and Monotone Operator Theory in HilbertSpaces , CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, 2 edition,2017, doi:10.1007/978-3-319-48311-5 .[5] M. Benning, F. Knoll, C. B. Schönlieb, and T. Valkonen, Preconditioned ADMM with nonlinearoperator constraint, in

System Modeling and Optimization: 27th IFIP TC 7 Conference, CSMO 2015,Sophia Antipolis, France, June 29–July 3, 2015, Revised Selected Papers , L. Bociu, J. A. Désidéri, andA. Habbal (eds.), Springer International Publishing, 2016, 117–126, doi:10.1007/978-3-319-55795-3_10 , arXiv:1511.00425 , https://tuomov.iki.fi/m/nonlinearADMM.pdf .[6] A. Borzì and C. Kanzow, Formulation and numerical solution of Nash equilibrium multiobjectiveelliptic control problems, SIAM Journal on Control and Optimization

51 (2013), 718–744, doi:10.1137/120864921 .[7] A. Chambolle, An algorithm for total variation minimization and applications,

Journal of Mathe-matical Imaging and Vision

20 (2004), 89–97, doi:10.1023/b:jmiv.0000011325.36760.1e . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 36 of 37 [8] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with appli-cations to imaging,

Journal of Mathematical Imaging and Vision

40 (2011), 120–145, doi:10.1007/s10851-010-0251-1 .[9] C. Clason and K. Kunisch, A convex analysis approach to multi-material topology optimization,

ESAIM: Mathematical Modelling and Numerical Analysis

50 (2016), 1917–1936, doi:10.1051/m2an/2016012 .[10] C. Clason, S. Mazurenko, and T. Valkonen, Acceleration and global convergence of a first-orderprimal–dual method for nonconvex problems, 2019, doi:10.1137/18m1170194 .[11] C. Clason, S. Mazurenko, and T. Valkonen, Julia codes for “Primal–dual proximal splitting andgeneralized conjugation in non-smooth non-convex optimization”, Online resource on Zenodo,2020, doi:10.5281/zenodo.3647614 .[12] C. Clason and T. Valkonen, Primal-dual extragradient methods for nonlinear nonsmoothPDE-constrained optimization,

SIAM Journal on Optimization

27 (2017), 1313–1339, doi:10.1137/16m1080859 , arXiv:1606.06219 , https://tuomov.iki.fi/m/pdex2_nlpdhgm.pdf .[13] Y. Drori, S. Sabach, and M. Teboulle, A simple algorithm for a class of nonsmooth convex–concavesaddle-point problems, Operations Research Letters

43 (2015), 209–214, doi:10.1016/j.orl.2015.02.001 .[14] I. Ekeland and R. Temam,

Convex Analysis and Variational Problems , SIAM, Philadelphia, 1999, doi:10.1137/1.9781611971088 .[15] K. H. Elster and A. Wolf, Recent results on generalized conjugate functions, in

Trends inMathematical Optimization: 4th French-German Conference on Optimization , K. H. Hoffmann,J. Zowe, J. B. Hiriart-Urruty, and C. Lemarechal (eds.), Birkhäuser Basel, 1988, 67–78, doi:10.1007/978-3-0348-9297-1_5 .[16] F. Facchinei and C. Kanzow, Generalized Nash equilibrium problems,

Ann. Oper. Res.

175 (2010),177–211, doi:10.1007/s10479-009-0653-x .[17] S. D. Flåm and A. S. Antipin, Equilibrium programming using proximal-like algorithms,

Math.Programming

78 (1997), 29–41, doi:10.1007/bf02614504 .[18] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restorationof images,

IEEE Transactions on Pattern Analysis and Machine Intelligence doi:10.1109/tpami.1984.4767596 .[19] E. Y. Hamedani and N. S. Aybat, A primal-dual algorithm for general convex-concave saddle pointproblems, arXiv (2018), arXiv:1803.01401 .[20] N. He, A. Juditsky, and A. Nemirovski, Mirror Prox algorithm for multi-term composite mini-mization and semi-separable problems,

Computational Optimization and Applications

61 (2015),275–319, doi:10.1007/s10589-014-9723-3 .[21] Y. He and R. D. Monteiro, An accelerated HPE-type algorithm for a class of composite convex-concave saddle-point problems,

SIAM Journal on Optimization

26 (2016), 29–56, doi:10.1137/14096757x .[22] A. Juditsky and A. Nemirovski, First order methods for nonsmooth convex large-scale optimiza-tion, I: general purpose methods, in

Optimization for Machine Learning , S. Sra, S. Nowozin, andS. J. Wright (eds.), MIT Press, 2011, 121–148, . Clason, Mazurenko, Valkonen Primal–dual proximal splitting and generalized conjugation . . .rxiv: 1901.02746v4, 2020-03-19 page 37 of 37 [23] A. Juditsky and A. Nemirovski, First order methods for nonsmooth convex large-scale optimiza-tion, II: utilizing problems structure, in

Optimization for Machine Learning , S. Sra, S. Nowozin,and S. J. Wright (eds.), MIT Press, 2011, 149–183, .[24] O. Kolossoski and R. Monteiro, An accelerated non-Euclidean hybrid proximal extragradient-type algorithm for convex–concave saddle-point problems,

Optimization Methods and Software

32 (2017), 1244–1272, doi:10.1080/10556788.2016.1266355 .[25] J. B. Krawczyk and S. Uryasev, Relaxation algorithms to find Nash equilibria with economicapplications,

Environmental Modeling & Assessment doi:10.1023/a:1019097208499 .[26] J. E. Martinez-Legaz, Generalized convex duality and its economic applications, in

Handbook ofGeneralized Convexity and Generalized Monotonicity , N. Hadjisavvas, S. Komlósi, and S. Schaible(eds.), Springer, 2005, 237–292, doi:10.1007/0-387-23393-8_6 .[27] A. Nemirovski, Prox-method with rate of convergence O ( / t ) for variational inequalities withLipschitz continuous monotone operators and smooth convex-concave saddle point problems, SIAM Journal on Optimization

15 (2004), 229–251, doi:10.1137/s1052623403425629 .[28] Y. Nesterov, Smooth minimization of non-smooth functions,

Mathematical Programming doi:10.1007/s10107-004-0552-5 .[29] H. Nikaidô and K. Isoda, Note on non-cooperative convex games,

Pacific J. Math. doi:10.2140/pjm.1955.5.807 , http://projecteuclid.org/euclid.pjm/1171984836 .[30] W. S. Rasband, ImageJ, 1997–2018, https://imagej.nih.gov/ij/ .[31] J. B. Rosen, Existence and uniqueness of equilibrium points for concave n -person games, Econo-metrica

33 (1965), 520–534, doi:10.2307/1911749 .[32] I. Singer,

Duality for Nonconvex Approximation and Optimization , Springer-Verlag New York, 2006, doi:10.1007/0-387-28395-1 .[33] M. Storath, A. Weinmann, and L. Demaret, Jump-sparse and sparse recovery using Potts func-tionals,

IEEE Transactions on Signal Processing

62 (2014), 3654–3666, doi:10.1109/tsp.2014.2329263 .[34] M. Storath, A. Weinmann, J. Frikel, and M. Unser, Joint image reconstruction and segmentationusing the Potts model,

Inverse Problems

31 (2015), 025003, doi:10.1088/0266-5611/31/2/025003 .[35] T. Valkonen, A primal-dual hybrid gradient method for non-linear operators with applicationsto MRI,

Inverse Problems

30 (2014-05), 055012, doi:10.1088/0266-5611/30/5/055012 , arXiv:1309.5032 , https://tuomov.iki.fi/m/nl-pdhgm.pdf .[36] T. Valkonen, Testing and non-linear preconditioning of the proximal point method, AppliedMathematics and Optimization (2018), doi:10.1007/s00245-018-9541-6 , arXiv:1703.05705 , https://tuomov.iki.fi/m/proxtest.pdf .[37] T. Valkonen and T. Pock, Acceleration of the PDHGM on partially strongly convex functions, Journal of Mathematical Imaging and Vision

59 (2017), 394–414, doi:10.1007/s10851-016-0692-2 , arXiv:1511.06566 , https://tuomov.iki.fi/m/cpaccel.pdf .[38] A. von Heusinger and C. Kanzow, Optimization reformulations of the generalized Nash equi-librium problem using Nikaido-Isoda-type functions, Comput. Optim. Appl.

43 (2009), 353–377, doi:10.1007/s10589-007-9145-6 ..