[PDF] A Framework of Inertial Alternating Direction Method of Multipliers for Non-Convex Non-Smooth Optimization

Abstract

In this paper, we propose an algorithmic framework dubbed inertial alternating direction methods of multipliers (iADMM), for solving a class of nonconvex nonsmooth multiblock composite optimization problems with linear constraints. Our framework employs the general minimization-majorization (MM) principle to update each block of variables so as to not only unify the convergence analysis of previous ADMM that use specific surrogate functions in the MM step, but also lead to new efficient ADMM schemes. To the best of our knowledge, in the \emph{nonconvex nonsmooth} setting, ADMM used in combination with the MM principle to update each block of variables, and ADMM combined with inertial terms for the primal variables have not been studied in the literature. Under standard assumptions, we prove the subsequential convergence and global convergence for the generated sequence of iterates. We illustrate the effectiveness of iADMM on a class of nonconvex low-rank representation problems.

Full PDF

AA FRAMEWORK OF INERTIAL ALTERNATING DIRECTIONMETHOD OF MULTIPLIERS FORNON-CONVEX NON-SMOOTH OPTIMIZATION ∗ LE THI KHANH HIEN † , DUY NHAT PHAN ‡ , AND

NICOLAS GILLIS † Abstract.

In this paper, we propose an algorithmic framework dubbed inertial alternatingdirection methods of multipliers (iADMM), for solving a class of nonconvex nonsmooth multiblockcomposite optimization problems with linear constraints. Our framework employs the generalminimization-majorization (MM) principle to update each block of variables so as to not only unifythe convergence analysis of previous ADMM that use speciﬁc surrogate functions in the MM step, butalso lead to new eﬃcient ADMM schemes. To the best of our knowledge, in the nonconvex nonsmooth setting, ADMM used in combination with the MM principle to update each block of variables, andADMM combined with inertial terms for the primal variables have not been studied in the literature.Under standard assumptions, we prove the subsequential convergence and global convergence for thegenerated sequence of iterates. We illustrate the eﬀectiveness of iADMM on a class of nonconvexlow-rank representation problems.

1. Introduction.

In this paper, we consider the following nonconvex minimiza-tion problem with linear constraintmin x,y F ( x , . . . , x s ) + h ( y )such that s (cid:88) i =1 A i x i + B y = b, (1)where y ∈ R q , x i ∈ R n i , x := [ x ; . . . ; x s ] ∈ R n , n = (cid:80) si =1 n i , A i is a linear map from R n i to R m , B is a linear map from R q to R m , b ∈ R m , h : R q → R is a diﬀerentiablefunction, and F ( x ) = f ( x ) + (cid:80) si =1 g i ( x i ), where f : R n → R is a nonconvex nonsmoothfunction and g i : R n i → R ∪ { + ∞} are proper lower semi-continuous functions for i = 1 , , . . . , s . We assume that F satisﬁes ∂F ( x ) = ∂ x F ( x ) × . . . × ∂ x s F ( x ), where ∂F denote the limiting subdiﬀerential of F (see the deﬁnition in the supplementarydocument). Notation.

We denote [ s ] := { , . . . , s } . For the p -dimensional Euclidean space R p , we use (cid:104)· , ·(cid:105) to denote the inner product and (cid:107) · (cid:107) to denote the correspondinginduced norm. For a linear map M , M ∗ denotes the adjoint linear map with respectto the inner product and (cid:107)M(cid:107) is the induced operator norm of M . We use I todenote the identity map. For a positive deﬁnite self-adjoint operator Q , we denote (cid:107) x (cid:107) Q := (cid:104) x, Q x (cid:105) . We denote the smallest eigenvalue of a symmetric linear self-map(that is, M = M ∗ ) by λ min ( M ). We use Im ( B ) to denote the image of B . One of the importantapplications of Problem (1) is the following generalized nonconvex low-rank represen- ∗ L.T.K. Hien and D.N. Phan contributed equally to this work

Funding:

L. T. K. Hien and N. Gillis are supported by the Fonds de la Recherche Scientiﬁque- FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS project no30468160 (SeLMA), and by the European Research Council (ERC starting grant 679515). † Department of Mathematics and Operational Research, University of Mons, Belgium ([email protected], [email protected]). ‡ Department of Mathematics and Informatics, HCMC University of Education, Vietnam([email protected]). This condition is satisﬁed when f is a sum of a continuously diﬀerentiable function and a blockseparable function that has limiting subdiﬀerential, see [2, Proposition 2.1].1 a r X i v : . [ m a t h . O C ] F e b L. T. K. HIEN, D. N. PHAN, N. GILLIS tation problem: given a data matrix D ∈ R d × n , solve(2) min X,Y,Z min( m,n ) (cid:88) i =1 r ( σ i ( X )) + r ( Y ) + r ( Z )subject to D = A X + Y A + Z, where X ∈ R m × n , Y ∈ R d × q , Z ∈ R d × n , A ∈ R d × m , A ∈ R q × n , r ( · ) is an increasingconcave function to promote X to be of low rank, r ( · ) is regularization function, and r ( · ) is a function that models some noise (for example, if we take r ( Z ) = (cid:107) Z (cid:107) F then Z represents a Gaussian noise). Problem (2) generalizes several important problemsin machine learning. Let us mention some examples:(i) When A and A are identity matrices, r ( t ) = t χ with 0 < χ ≤ r ( Y ) = (cid:80) q − i =1 (cid:107) Y i − Y i +1 (cid:107) , where Y i is the i -th column of Y , Problem (2) decomposesthe data matrix D into three components, X , Y and Z . For example, in videosurveillance, each column of D is a vectorized image of a video frame, X is alow-rank matrix that plays the role of the background, Y is the foreground thathas small variations betweet its columns (such as slowly moving objectives),and Z represents some noise [42].(ii) When A and A are identity matrices, r ( t ) = t , r ( Y ) = λ (cid:107) Y (cid:107) , where λ issome constant, Problem (2) recovers the robust principal component analysismodel, see, e.g., [10], where X is a low-rank matrix, Y represents a sparse noise,and Z represents additional noise. It is also used for foreground-backgroundseparation in video surveillance.(iii) When r ( t ) = t and r ( Y ) = (cid:107) Y (cid:107) ∗ , Problem (2) is the latent low-rankrepresentation problem [27]. The authors [27] used A = DP and A = P ∗ D ,where P and P are computed by orthogonalizing the columns of D ∗ and D ,respectively. We will use this application to illustrate the eﬀectiveness of ourproposed framework, iADMM, in Section 3.Other applications of Problem (1) include statistical learning, see, e.g., [3, 43],and minimization on compact manifolds, see, e.g., [22, 44]. Let A := [ A . . . A s ] , A x := s (cid:88) i =1 A i x i ∈ R m . The augmented Lagrangian for Problem (1) is given by L ( x, y, ω ) := F ( x ) + h ( y ) + (cid:104) ω, A x + B y − b (cid:105) + β (cid:107)A x + B y − b (cid:107) , (3)where β > x and y : x k +1 ∈ argmin x L ( x, y k , ω k ) , (4a) y k +1 ∈ argmin y L ( x k +1 , y, ω k ) , (4b) ω k +1 = ω k + β ( A x k +1 + B y k +1 − b ) . (4c) NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS s >

1, the scheme is similar, see for example [42]. Theupdate of x in (4a) (a similar discussion is applicable to (4b)) can be rewritten as x k +1 ∈ argmin x F ( x ) + ϕ k ( x ) , where(5) ϕ k ( x ) = β (cid:107)A x + B y k − b (cid:107) + (cid:104) ω k , A x + B y k − b (cid:105) . Solving the subproblem (4a) is usually very expensive especially when F is not smooth.A remedy is minimizing a suitable surrogate function of L ( · , y k , ω k ) that allows a moreeﬃcient update for x . For example, since ϕ k ( x ) is upper bounded by(6) ˆ ϕ ( x ) = ϕ k ( x k ) + (cid:104)∇ ϕ k ( x k ) , x − x k (cid:105) + κβ (cid:107) x − x k (cid:107) where κ ≥ (cid:107)A ∗ A(cid:107) (because ∇ ϕ k ( x ) is β (cid:107)A ∗ A(cid:107) -Lipschitz continuous), x can be updatedby(7) x k +1 ∈ argmin x F ( x ) + ˆ ϕ ( x ) , which leads to the linearized ADMM method, see [25, 45]. The update in (7) has aclosed form for some nonsmooth F ; see [34]. When F = f + g and f is L f -smooth thenwe can also use the upper bound ˆ F ( x ) = f ( x k ) + (cid:104)∇ f ( x k ) , x − x k (cid:105) + L f (cid:107) x − x k (cid:107) + g ( x )of F to derive the following update for x :(8) x k +1 ∈ argmin x ˆ F ( x ) + ˆ ϕ ( x ) . This leads to the proximal linearized ADMM method, see [7, 28]. We note that L ( · , y k , ω k ) is always upper bounded by L ( · , y k , ω k ) + D φ ( x, x k ), where D φ is theBregman distance associated with a continuously diﬀerentiable convex function φ on R n :(9) D φ ( a, b ) := φ ( a ) − φ ( b ) − (cid:104)∇ φ ( b ) , a − b (cid:105) , ∀ a, b ∈ R n . For example, if φ ( x ) = (cid:107) x (cid:107) Q = (cid:104) x, Q x (cid:105) then D φ ( a, b ) = (cid:107) a − b (cid:107) Q . This upper boundleads to proximal ADMM, see [12, 23].The above mentioned upper bound functions are speciﬁc examples of surrogatefunctions for L ( · , y k , ω k ) (see Deﬁnition 2.1) while each method of updating x cor-responds to a majorization-minimization (MM) step. In the convex setting (that is, f ( x, y ) is convex), [11] and [20] use the MM principle to unify and generalize theconvergence analysis of many ADMM for multi-blocks problems (that is, s > L. T. K. HIEN, D. N. PHAN, N. GILLIS they have showed signiﬁcant improvement in their practical performance, see forexample [32, 46, 47, 35, 17]. Recently, the authors in [18] propose an inertial blockMM framework for solving (1) without the linear coupling constraint. To the bestof our knowledge, ADMM with inertial terms for the primal variables have not beenstudied for the nonconvex setting although they have been analysed for the convexsetting; see [24, 33].

In this paper, we propose iADMM, a framework of inertialalternating direction methods of multipliers, for solving the nonconvex nonsmoothproblem (1). When no extrapolation is used, iADMM becomes a general ADMMframework that employs the minimization-majorization principle in each block update.For the ﬁrst time in the nonconvex nonsmooth setting of Problem (1), we studyADMM and its inertial version combined with the MM principle when updating eachblock of variables. Moreover, our framework allows to use an over-relaxation parameter α ∈ (0 ,

2) to set αβ as the constant stepsize for updating the dual variable ω . Notethat α = 1 is the standard choice in the nonconvex setting, see, e.g., [20, 23, 42].In the convex setting, [14] showed that α can be chosen in (0 , α ∈ (cid:0) , √ (cid:1) , see, e.g., [13, 49].Recently, [7] proposed proximal ADMM that use α ∈ (0 ,

2) for solving a special caseof the nonconvex Problem (1) with s = 1 and A = −I .Under mild assumptions, we analyse the subsequential convergence guaranteefor the generated sequence of iADMM and ADMM in parallel. When F ( x ) + h ( y )satisﬁes the K(cid:32)L property and α = 1, we prove the global convergence for the generatedsequence. Finally, we apply the proposed framework to solve a class of Problem (2)and report its numerical results to illustrate the eﬃcacy of iADMM.

2. An inertial ADMM framework.

In this section, we describe the iADMMframework and prove its subsequential and global convergence. Throughout the paper,we make the following assumptions that are standard for studying Problem (1) andthe convergence of ADMMs in the nonconvex setting, see for example [42, 7, 23].

Assumption (i) σ B := λ min ( BB ∗ ) > .(ii) F ( x ) + h ( y ) is lower bounded.(iii) h is a L h -smooth function (that is, ∇ h is L h -Lipschitz continuous withconstant L h ). Let us ﬁrst formally deﬁne a surrogate function.Some examples were given in the introduction.

Definition

Let

X ⊆ R n . A function u : X × X → R is called a surrogate function of a function f on X if the following conditions aresatisﬁed:(a) u ( z, z ) = f ( z ) for all z ∈ X ,(b) u ( x, z ) ≥ f ( x ) for all x, z ∈ X . As we are considering multi-block problems, we need the following deﬁnition of a blocksurrogate function, which is a generalization of Deﬁnition 2.1.

Definition

Let Let X i ⊆ R n i , X ⊆ R n . Afunction u i : X i × X → R is called a block i surrogate function of f on X if thefollowing conditions are satisﬁed:(a) u i ( z i , z ) = f ( z ) for all z ∈ X , NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS Algorithm 1 iADMM for solving Problem (1)Choose x = x − , y = y − , ω . Let u i , i ∈ [ s ], be block i surrogate functions of f ( x ) on R n . for k = 0 , . . . do Set x k, = x k for i = 1 , ..., s do Compute ¯ x ki = x ki + ζ ki ( x ki − x k − i ).Update block x i by x k,ii ∈ argmin x i (cid:110) u i ( x i , x k,i − ) + g i ( x i )+ (cid:104)A ∗ i (cid:0) ω k + β ( A ¯ x k,i − + B y k − b ) (cid:1) , x i (cid:105) + κ i β (cid:107) x i − ¯ x ki (cid:107) (cid:111) , (10)where κ i ≥ (cid:107)A ∗ i A i (cid:107) , and¯ x k,i − = ( x k +11 , . . . , x k +1 i − , ¯ x ki , x ki +1 , . . . , x ks ).Set x k,ij = x k,i − j for all j (cid:54) = i . end for Set x k +1 = x k,s .Compute ˆ y k = y k + δ k ( y k − y k − ).Update y by y k +1 ∈ argmin y (cid:110) (cid:104)B ∗ ω k + ∇ h (ˆ y k ) , y (cid:105) + β (cid:107)A x k +1 + B y − b (cid:107) + L h (cid:107) y − ˆ y k (cid:107) (cid:111) . (11)Update ω by(12) ω k +1 = ω k + αβ ( A x k +1 + B y k +1 − b ) . end for (b) u i ( x i , z ) ≥ f ( x i , z (cid:54) = i ) for all x i ∈ X i and z ∈ X , where ( x i , z (cid:54) = i ) := ( z , . . . , z i − , x i , z i +1 , . . . , z s ) . The block approximation error is deﬁned as e i ( x i , z ) := u i ( x i , z ) − f ( x i , z (cid:54) = i ) . The inertial alternating direction method of multipliers (iADMM) frameworkis described in Algorithm 1. iADMM cyclically update the blocks x , . . . , x s and y . Let us use x k,i = ( x k +11 , . . . , x k +1 i , x ki +1 , . . . , x ks ) and x k +1 = x k,s , where k is theouter iteration index, and i the cyclic inner iteration index ( i ∈ [ s ]). The updateof block x i in (10) (note that x k +1 i = x k,ii ) means that iADMM chooses a surrogatefunction for x i (cid:55)→ L ( x i , x k,i (cid:54) = i , y k , ω k ), which is formed by summing a surrogate functionof x i (cid:55)→ f ( x i , x k,i (cid:54) = i ) + g i ( x i ) and a surrogate function of x i (cid:55)→ ϕ k ( x i , x k,i (cid:54) = i ) where ϕ k ( x )is deﬁned in (5), then apply extrapolation to the latter surrogate function . To updateblock y , as h ( y ) is L h -smooth, we apply Nesterov type acceleration on h as in (11). It is important noting that it is possible to embed the general inertial term G ki to the surrogate of L. T. K. HIEN, D. N. PHAN, N. GILLIS

Together with Assumption 1, we make the following standard assumption for u i throughout the paper. Assumption (i) The block surrogate function u i ( x i , z ) is continuous.(ii) Given z ∈ R n , for i ∈ [ s ] , there exists a function x i (cid:55)→ ¯ e i ( x i , z ) such that ¯ e i ( · , z ) is continuously diﬀerentiable at z i , ¯ e i ( z i , z ) = 0 , ∇ x i ¯ e i ( z i , z ) = 0 , and the blockapproximation error x i (cid:55)→ e i ( x i , z ) satisﬁes (13) e i ( x i , z ) ≤ ¯ e i ( x i , z ) for all x i . The condition in Assumption 2 (ii) is satisﬁed when we simply choose u i ( x i , z ) = f ( x i , z (cid:54) = i ) (that is, f ( x i , z (cid:54) = i ) is a surrogate function of itself), or when e i ( · , z ) iscontinuously diﬀerentiable at z i and ∇ x i e i ( z i , z ) = 0, or when e i ( x i , z ) ≤ c (cid:107) x i − z i (cid:107) (cid:15) for some (cid:15) > c >

0; see [18, Lemma 3].

Remark Before proceeding to the convergence analysis of iADMM, we makethe following remark. As we target Nesterov-type acceleration in the update of y (notethat h is assumed to be L h -smooth), we analyse the update rule as in (11) for y . Incase y is updated by y k +1 ∈ argmin y L ( x k +1 , y, ω k ) , iADMM still works and the convergence analysis would be simpliﬁed by using the samerationale to obtain subsequential as well as global convergence. We hence omit thiscase in our analysis. Let us start by deﬁning some additional notationsand their convention that will be used later. Let x k,i , y k and ω k be the iteratesgenerated by iADMM. We denote ∆ x ki = x ki − x k − i , ∆ y k = y k − y k − , ∆ ω k = ω k − ω k − , α = | − α | ασ B (1 −| − α | ) , α = ασ B (1 −| − α | ) and L k = L ( x k , y k , ω k ). We let ν i , i ∈ [ s ], and ν y be arbitrary constants in (0 , • If ζ ki = 0, that is, when we do not apply extrapolation in the update of x ki ,we take ζ ki /ν i = 0 and ν i = 0. • If δ k = 0, that is, when we do not apply extrapolation in the update of y , wetake δ k /ν y = 0 and ν y = 0.Now we present our main convergence results. Their proofs can be found in thesupplementary material.As iADMM allows to use extrapolation in the update of x ki and y k , the Lagrangianis not guaranteed to satisfy the suﬃcient descent property; in fact, it is not guaranteedto decrease at each iteration. Instead, it has the following nearly suﬃciently decreasingproperty as stated in the following Propositions 2.3 and 2.4. Proposition (i) Considering the update in (10) , in general when x i (cid:55)→ u i ( x i , z )+ g i ( x i ) is nonconvex, we choose κ i > (cid:107) A ∗ i A i (cid:107) . Denote a ki = βζ ki ( κ i + (cid:107)A ∗ i A i (cid:107) ) .Then (14) L ( x k,i , y k , ω k ) + η i (cid:107) ∆ x k +1 i (cid:107) ≤ L ( x k,i − , y k , ω k ) + γ ki (cid:107) ∆ x ki (cid:107) , x i (cid:55)→ L ( x i , x k,i (cid:54) = i , y k , ω k ) as in [18]. This inertial term may also lead to the extrapolation for the blocksurrogate function of f ( x ) or for both the two block surrogates. However, to simplify our analysis, weonly consider here the eﬀect of the inertial term for the block surrogate of ϕ k ( x ).NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS where (15) η i = (1 − ν i )( κ i −(cid:107)A ∗ i A i (cid:107) ) β ,γ ki = ( a ki ) ν i ( κ i −(cid:107)A ∗ i A i (cid:107) ) β . (ii) When x i (cid:55)→ u i ( x i , z ) + g i ( x i ) is convex, we choose κ i = (cid:107)A ∗ i A i (cid:107) and Inequality (14) is satisﬁed with (16) γ ki = β (cid:107)A ∗ i A i (cid:107) ( ζ ki ) , η i = β (cid:107)A ∗ i A i (cid:107) . Proposition

Considering the update in (11) , we have L ( x k +1 , y k +1 , ω k ) + η y (cid:107) ∆ y k +1 (cid:107) ≤ L ( x k +1 , y k , ω k ) + γ ky (cid:107) ∆ y k (cid:107) , where η y = (1 − ν y )( β (cid:107)B ∗ B(cid:107) + L h )2 and γ ky = L h δ k ν y ( β (cid:107)B ∗ B(cid:107) + L h ) when h ( y ) is nonconvex, and η y = L h and γ ky = L h δ k when h ( y ) is convex. From Proposition 2.3 and Proposition 2.4, we obtain the following recursiveinequality for {L k } in Proposition 2.5 that serves as cornerstone to derive the boundfor the extrapolation parameters ζ ki and δ k in Proposition 2.6. Proposition

We have L k +1 + η y (cid:107) ∆ y k +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) ≤ L k + s (cid:88) i =1 γ ki (cid:107) ∆ x ki (cid:107) + γ ky (cid:107) ∆ y k (cid:107) + α β ( (cid:107) B ∗ ∆ ω k (cid:107) − (cid:107) B ∗ ∆ ω k +1 (cid:107) ) + α β L h (cid:107) ∆ y k +1 (cid:107) + α β (cid:0) ¯ δ k L h (cid:107) ∆ y k (cid:107) + 4 L h δ k − (cid:107) ∆ y k − (cid:107) (cid:1) , (17) where ¯ δ k = 2 if δ k = 0 for all k and δ k ) otherwise. Now we characterize the chosen parameters for Algorithm 1 in the following proposition.

Proposition

Let η y , γ ky , η i , γ ki , i ∈ [ s ] , and ¯ δ k be deﬁned in Proposition 2.3and Proposition 2.4. Denote (18) µ = η y − α L h β . For k ≥ , suppose the parameters are chosen such that µ > , η i > , and thefollowing conditions are satisﬁed for some constants < C x , C y < : γ ki ≤ C x η i , α L h δ k − β ≤ C µ,α L h ¯ δ k β + γ ky ≤ C µ, (19) where (cid:40) C = C y and C = 0 if δ k = 0 ∀ k, < C < C y and C = C y − C otherwise . L. T. K. HIEN, D. N. PHAN, N. GILLIS (i) For

K > we have L K +1 + µ (cid:107) ∆ y K +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x K +1 i (cid:107) + α β (cid:107)B ∗ ∆ w K +1 (cid:107) +(1 − C ) µ (cid:107) ∆ y K (cid:107) + K − (cid:88) k =1 (cid:2) (1 − C y ) µ (cid:107) ∆ y k (cid:107) + (1 − C x ) s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) (cid:3) ≤ L + α β (cid:107)B ∗ ∆ ω (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x i (cid:107) + µ (cid:107) ∆ y (cid:107) + C µ (cid:107) ∆ y (cid:107) . (20) (ii) If we use one of the following methods: • we choose δ k = 0 for all k , that is, there is no extrapolation in the update of y , • we use extrapolation in the update of y and choose the parameters such that β ≥ L h ασ B (1 − | − α | ) ,β ≥ αL h µσ B (1 − | − α | ) max (cid:110) , δ k − C (cid:111) (21) then { ∆ y k } , { ∆ x ki } and { ∆ ω k } converge to 0. We will assume that Algorithm 1 generates a bounded sequence in our subsequentialand global convergence results. Let us provide a suﬃcient condition that guaranteesthis boundedness assumption in the following proposition.

Proposition If b + Im ( A ) ⊆ Im ( B ) , λ min ( B ∗ B ) > and F ( x ) + h ( y ) iscoercive over the feasible set { ( x, y ) : A x + B y = b } then the sequences { x k } , { y k } and { ω k } generated by Algorithm 1 are bounded. It is important noting that the coercive condition of F ( x ) + h ( y ) over the feasible setis weaker than the coercive condition of F ( x ) + h ( y ) over x ∈ R n , y ∈ R q . Let us nowpresent the subsequential convergence of the generated sequence. Theorem

Suppose the parameters of Algo-rithm 1 are chosen such that the conditions in (19) of Proposition 2.6 are satisﬁed.If the generated sequence of Algorithm 1 is bounded, then every limit point of thegenerated sequence is a critical point of L . To obtain a global convergence, we need the following Kurdyka-(cid:32)Lojasiewicz (K(cid:32)L)property for F ( x ) + h ( y ). Definition

A function φ ( · ) is said to have the K(cid:32)L property at ¯ x ∈ dom ∂ φ ifthere exists ς ∈ (0 , + ∞ ] , a neighborhood U of ¯ x and a concave function Υ : [0 , ς ) → R + that is continuously diﬀerentiable on (0 , ς ) , continuous at , Υ(0) = 0 , and Υ (cid:48) ( t ) > for all t ∈ (0 , η ) , such that for all x ∈ U ∩ [ φ (¯ x ) < φ ( x ) < φ (¯ x ) + ς ] , we have (22) Υ (cid:48) ( φ ( x ) − φ (¯ x )) dist (0 , ∂φ ( x )) ≥ , where dist (0 , ∂φ ( x )) = min {(cid:107) z (cid:107) : z ∈ ∂φ ( x ) } . If φ ( x ) has the K(cid:32)L property at eachpoint of dom ∂φ then φ is a K(cid:32)L function. Many non-convex non-smooth functions in practical applications belong to the classof K(cid:32)L functions, for examples, real analytic functions, semi-algebraic functions, andlocally strongly convex functions, see for example [5, 6].

NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS Theorem

Suppose we do not use extrapolation toupdate y (that is, δ k = 0 for all k ) and we take α = 1 . Then the conditions in (19) become γ ki ≤ C x η i , α L h β ≤ C y µ, (23) for some constants < C x , C y < . Furthermore, we assume that (i) for any x, z ∈ R n , x i ∈ dom( g i ) we have (24) ∂ x i (cid:0) f ( x ) + g i ( x i ) (cid:1) = ∂ x i f ( x ) + ∂ x i g i ( x i ) ,∂ x i (cid:0) u i ( x i , z ) + g i ( x i ) (cid:1) = ∂ x i u i ( x i , z ) + ∂ x i g i ( x i ) , and (ii) for any x, z in a bounded subset of R n , if s i ∈ ∂u i ( x i , z ) , there exists ξ i ∈ ∂ x i f ( x ) such that (25) (cid:107) ξ i − s i (cid:107) ≤ L i (cid:107) x − z (cid:107) for some constant L i . If the generated sequence of Algorithm 1 is bounded and F ( x ) + h ( y ) has the K(cid:32)Lproperty, then the whole generated sequence of Algorithm 1 converges to a critical pointof L . We refer the readers to [38, Corollary 10.9] for a suﬃcient condition for (24)(see supplementary material for more details). Some speciﬁc examples that satisfy(24) include: (i) g i = 0, (ii) the functions x i (cid:55)→ f ( x ) and x i (cid:55)→ u i ( x i , z ) are strictlydiﬀerentiable (see [38, Exercise 10.10]), (iii) the functions x i (cid:55)→ f ( x ) and x i (cid:55)→ u i ( x i , z ) are convex and the relative interior qualiﬁcation conditions are satisﬁed:ri(dom( f ( · , x (cid:54) = i )) ∩ ri(dom g i ) (cid:54) = ∅ and ri(dom( g ( · , z )) ∩ ri(dom g i ) (cid:54) = ∅ . We note thatalthough the condition in (25) is necessary for our convergence proof, the Lipschitzconstant L i does not inﬂuence how to choose the parameters in our framework. We endthis section by noting that a convergence rate for the generated sequence of iADMMcan be derived using the same technique as in the proof of [1, Theorem 2]. Someexamples of using the technique of [1, Theorem 2] to derive the convergence rateinclude [46, Theorem 2.9] and [17, Theorem 3]. Other than the convergence rate whichappears to be the same in diﬀerent papers using the technique in [1], determining theK(cid:32)L exponent, that is, the coeﬃcient a when Υ( t ) = ct − a , where c is a constant, is anactive and challenging topic. The type of the convergence rate depends on the valueof a . Speciﬁcally, when a = 0, the algorithm converges after a ﬁnite number of steps;when a ∈ (0 , /

2] it has linear convergence, and when a ∈ (1 / ,

1] it has sublinearconvergence. Determining the value of a is out of the scope of this paper.

3. Numerical results.

In this section, we apply iADMM to solve a latent low-rank representation problem of the form of Problem (2); see Section 1.1. Speciﬁcally,we choose r ( t ) = λ t , r ( Z ) = (cid:107) Z (cid:107) (hence Z represents a Gaussian noise), andconsider a nonconvex regularization function for Y , r ( Y ) = λ (cid:80) qi =1 φ ( (cid:107) Y i (cid:107) ), where Y i is the i -th column of Y and φ ( t ) = 1 − exp( − θt ) [9]. In the upcoming experiments, wechoose A = DP and A = P ∗ D as proposed in [27], where P and P are computedby orthogonalizing the columns of D ∗ and D , respectively.Problem (2) in this case takes the form of (1) with B being the identity operator, b being the data set D , x and x being the matrices X and Y , y being the matrix Z , f ( X, Y ) = λ (cid:107) X (cid:107) ∗ + r ( Y ), g i = 0 and h ( Z ) = (cid:107) Z (cid:107) .We choose the following block surrogate functions for f : u ( X, X k , Y k ) = λ (cid:107) X (cid:107) ∗ + r ( Y k ), u ( Y, X k +1 , Y k ) = r ( Y k ) + (cid:80) qi =1 ς ki (cid:107) Y i (cid:107) + λ (cid:107) X k +1 (cid:107) ∗ , where0 L. T. K. HIEN, D. N. PHAN, N. GILLIS

Time (s) E rr o r Hopkins155 iADMM-mmADMM-mmlinearizedADMM

Time (s) O b j e c t i v e v a l ue Hopkins155 iADMM-mmADMM-mmlinearizedADMM

Time (s) E rr o r Umist10 iADMM-mmADMM-mmlinearizedADMM

50 100 150 200 250 300

Time (s) O b j e c t i v e v a l ue Umist10 iADMM-mmADMM-mmlinearizedADMM

Time (s) E rr o r Yaleb10 iADMM-mmADMM-mmlinearizedADMM

100 200 300 400 500

Time (s) O b j e c t i v e v a l ue Yaleb10 iADMM-mmADMM-mmlinearizedADMM

Fig. 1 . Evolution of the value of the segmentation error rate and the objective function valuewith respect to time. For Hopkins155, the results are the average values over 156 sequences. ς ki ∈ λ ∇ φ ( (cid:107) Y ki (cid:107) ). Obviously u satisﬁes Assumption 2 and u satisﬁes Assumption2 (i). Since φ is continuously diﬀerentiable with Lipschitz gradient on [0 , + ∞ ) andthe Euclidean norm is Lipschitz continuous, it follows from Section 4.5 of [18] that u satisﬁes Assumption 2 (ii).For updating X , according to the update (10), X k +1 is computed by solving the NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS X λ (cid:107) X (cid:107) ∗ + (cid:28) A ∗ (cid:16) β ( A ¯ X k + Y k A + Z k − D ) + W k (cid:17) , X (cid:29) + κ β (cid:107) X − ¯ X k (cid:107) , (26)where κ ≥ (cid:107) A ∗ A (cid:107) and ¯ X k = X k + ζ k ( X k − X k − ). The sub-problem (26) has aclosed-form solution given by X k +1 = U S λ /κ β V T , where U SV T is the SVD of ¯ X k − A ∗ ( A ¯ X k + Y k A + Z k − D + W k ) /κ β and S λ /κ β = diag([ S ii − λ /κ β ] + ), where diag( u ) is a diagonal matrix whose diagonalelements are the entries of u , and [ . ] + is the projection onto the nonnegative orthant.The update (11) for Y is Y k +1 ∈ arg min Y q (cid:88) i =1 ς ki (cid:107) Y i (cid:107) + (cid:104) ( W k + β ( A X k +1 + ¯ Y k A + Z k − D )) A ∗ , Y (cid:105) + κ β (cid:107) Y − ¯ Y k (cid:107) , where κ ≥ (cid:107) A A ∗ (cid:107) and ¯ Y k = Y k + ζ k ( Y k − Y k − ). The sub-problem above has aclosed-form solution Y k +1 i = (cid:2) (cid:107) P ki (cid:107) − ς ki / ( κ β ) (cid:3) + P ki (cid:107) P ki (cid:107) , where P ki is the i -th column of ¯ Y k − ( A X k +1 + ¯ Y k A + Z k − D ) /κ − W k / ( κ β ).The updates (11) and (12) for Z and W are respectively given by Z k +1 = − ( W k + β ( A X k +1 + y k +1 A − D )) / (1 + β ) ,W k +1 = W k + αβ ( A X k +1 + y k +1 A + Z k +1 − D ) . Let us determine the parameters. Note that L h = 1, σ B = 1, and δ k = 0. Since h ( Z ) is convex and we do not apply extrapolation for Z , hence by Proposition 2.4we have η y = and γ ky = 0. Since (cid:107) X (cid:107) ∗ and (cid:80) qi =1 ς ki (cid:107) Y i (cid:107) are convex, we choose κ = (cid:107) A ∗ A (cid:107) , κ = (cid:107) A A ∗ (cid:107) and the conditions in (23) become ζ ki ≤ √ C x (for i = 1 ,

2) and (2+ C y ) α β ≤ C y . In our experiments, we choose C x = 1 − − , α = 1, C y = 1 − − , β = 2(2 + C y ) α /C y , a = 1, a k = (1 + (cid:113) a k − ), and ζ ki = min (cid:110) a k − − a k , √ C x (cid:111) .We compare iADMM without extrapolation denoted by ADMM-mm, and iADMMwith the extrapolation denoted by iADMM-mm, with a linearized ADMM that isonly diﬀerent from ADMM-mm for updating Y . In particular, the linearizedADMMmethod updates Y by solving the following nonconvex sub-problemsmin − λ exp( (cid:107) Y i (cid:107) ) + κ β (cid:107) Y i − V ki (cid:107) , where V ki is the i -the column of X k − ( W k + β ( A X k +1 + ¯ Y k A + Z k − D )) A ∗ / ( κ β ). Since the sub-problems above do not have closed-formsolutions, we employ an MM scheme to solve them.2 L. T. K. HIEN, D. N. PHAN, N. GILLIS

Table 1

Comparison of segmentation error rate and ﬁnal objective function values obtained within theallotted time. Bold values indicate the best results

Method Error Obj. valuemean ± std mean ± std H o p k i n s linearizedADMM 0.1579 ± ± ± ± ± ± U m i s t linearizedADMM 0.5170 1.0838 × ADMM-mm 0.5170 1.0167 × iADMM-mm × Y a l e b linearizedADMM 0.7656 5.2317 × ADMM-mm 0.7047 4.4829 × iADMM-mm × To examine the performance of the comparative algorithms, we consider subspacesegmentation tasks. In particular, after obtaining X ∗ , we follow the setting in [26] toconstruct the aﬃnity matrix Q by Q ij = ( ˜ U ˜ U T ) ij , where ˜ U is formed by U ∗ (Σ ∗ ) / with normalized rows and U ∗ Σ ∗ ( V ∗ ) T being the SVD of X ∗ . Finally, we apply theNormalized Cuts [21] on W to cluster the data into groups.The experiments are run on three data sets: Hopkins 155, extended Yale B andUmist. Hopskins 155 consists of 156 sequences, each of which has from 39 to 550vectors drawn from two or three motions (one motion corresponds to one subspace).Each sequence is a sole segmentation task and thus there are 156 clustering tasks intotal. Yale B contains 2414 frontal face images of 38 classes while Umist contains 564images of 20 classes. To avoid computational issue when computing the segmentationerror rate, we construct clustering tasks by using only the ﬁrst 10 classes of these twodata sets as proposed in [29].All tests are preformed using Matlab R2019a on a PC 2.3 GHz Intel Core i5 of8GB RAM. The code is available from https://github.com/nhatpd/iADMMIn our experiments, we choose θ = 5, λ = λ = 0 .

01 for Hopkins 155, and λ = λ = 1 for the two other data sets. We note that we do not optimize numericalresults by tweaking the parameters as this is beyond the scope of this work. It isimportant noting that we evaluate the algorithms on the same models. We set theinitial points to zero. We run each algorithm 10, 300, and 500 seconds for each sequenceof Hopkins 155, Umist10, and Yaleb10, respectively. We plot the curves of the valueof the segmentation error rate and the objective function value versus the trainingtime in Figure 1, and report the ﬁnal values in Table 1. Since there are 156 sequences(data sets) in Hopkins 155, we plot the average values, and report the ﬁnal averageresults and standard deviation over these sequences.We observe that iADMM-mm converges the fastest on all the data sets, providinga signiﬁcant acceleration of ADMM-mm. iADMM-mm achieves not only the bestﬁnal objective function values but also the best segmentation error rates. Thisillustrates the usefulness of the acceleration technique. In addition, ADMM-mmoutperforms linearizedADMM which illustrates the usefulness of properly choosing aproper surrogate function. NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS

4. Conclusion.

We have analysed iADMM, a framework of inertial alternatingdirection methods of multipliers, for solving a class of nonconvex nonsmooth optimiza-tion problem with linear constraints. The preliminary computational results in solvinga class of nonconvex low-rank representation problems not only show the eﬃcacyof using inertial terms for ADMM but also show the advantage of using suitableblock surrogate functions that may lead to closed-form solutions in the block updateof ADMM. We conclude the paper by mentioning two important questions that weconsider as a future research directions: • Can we extend the cyclic update rule of iADMM to randomized/non-cyclicsetting? • To guarantee the global convergence, iADMM does not allow extrapolation inthe update of y ; see Theorem 2.10. Can we extend the analysis to allow theextrapolation in the update of y ? REFERENCES[1] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functionsinvolving analytic features.

Mathematical Programming , 116(1):5–16, Jan 2009.[2] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization andprojection methods for nonconvex problems: An approach based on the Kurdyka-(cid:32)Lojasiewiczinequality.

Mathematics of Operations Research , 35(2):438–457, 2010.[3] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties.

Foundations and Trends in Machine Learning , 4, 08 2011.[4] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods.

SIAM Journal on Optimization , 23:2037–2060, 2013.[5] J. Bochnak, M. Coste, and M.-F. Roy.

Real Algebraic Geometry . Springer, 1998.[6] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvexand nonsmooth problems.

Mathematical Programming , 146(1):459–494, Aug 2014.[7] R. I. Bot and D.-K. Nguyen. The proximal alternating direction method of multipliers in thenonconvex setting: Convergence analysis and rates.

Mathematics of Operations Research ,45(2):682–712, 2020.[8] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.

Found. Trends Mach. Learn. ,3(1):1–122, Jan. 2011.[9] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and supportvector machines. In

Proceeding of international conference on machine learning ICML’98 ,1998.[10] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?

J. ACM ,58(3), 2011.[11] L. Canyi, J. Feng, S. Yan, and Z. Lin. A uniﬁed alternating direction method of multipliers bymajorization minimization.

IEEE transactions on pattern analysis and machine intelligence ,40:527 – 541, 07 2018.[12] W. Deng and W. Yin. On the global and linear convergence of the generalized alternatingdirection method of multipliers.

Rice CAAM tech report TR12-14 , 66, 01 2012.[13] M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel matrix rank minimization with applicationsto system identiﬁcation and realization.

SIAM Journal on Matrix Analysis and Applications ,34(3):946–977, 2013.[14] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problemsvia ﬁnite element approximation.

Computers & Mathematics with Applications , 2(1):17 –40, 1976.[15] R. Glowinski and A. Marroco. Sur l’approximation, par ´el´ements ﬁnis d’ordre un, et lar´esolution, par p´enalisation-dualit´e d’une classe de probl`emes de dirichlet non lin´eaires.

ESAIM: Mathematical Modelling and Numerical Analysis - Mod´elisation Math´ematique etAnalyse Num´erique , 9(R2):41–76, 1975.[16] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gauss–seidel methodunder convex constraints.

Operations Research Letters , 26(3):127 – 136, 2000.[17] L. T. K. Hien, N. Gillis, and P. Patrinos. Inertial block proximal method for non-convexnon-smooth optimization. In

Thirty-seventh International Conference on Machine Learning L. T. K. HIEN, D. N. PHAN, N. GILLIS

ICML 2020 , 2020.[18] L. T. K. Hien, D. N. Phan, and N. Gillis. Inertial block majorization minimization frameworkfor nonconvex nonsmooth optimization. arXiv:2010.12133, 2020.[19] C. Hildreth. A quadratic programming procedure.

Naval Research Logistics Quarterly , 4(1):79–85, 1957.[20] M. Hong, T.-H. Chang, X. Wang, M. Razaviyayn, S. Ma, and Z.-Q. Luo. A block successive upper-bound minimization method of multipliers for linearly constrained convex optimization.

Mathematics of Operations Research , 45(3):833–861, 2020.[21] Jianbo Shi and J. Malik. Normalized cuts and image segmentation.

IEEE Trans. Pattern Anal.Mach. Intell. , 22(8):888–905, 2000.[22] R. Lai and S. Osher. A splitting method for orthogonality constrained problems.

Journal ofScientiﬁc Computing , 58, 02 2014.[23] G. Li and T. K. Pong. Global convergence of splitting methods for nonconvex compositeoptimization.

SIAM Journal on Optimization , 25(4):2434–2460, 2015.[24] H. Li and Z. Lin. Accelerated alternating direction method of multipliers: An optimal o(1 / k)nonergodic analysis.

Journal of Scientiﬁc Computing , 79:671–699, 05 2019.[25] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with adaptive penalty forlow-rank representation. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q.Weinberger, editors,

Advances in Neural Information Processing Systems , volume 24, pages612–620. Curran Associates, Inc., 2011.[26] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures bylow-rank representation.

IEEE Trans. Pattern Anal. Mach. Intell. , 35(1):171–184, 2013.[27] G. Liu and S. Yan. Latent low-rank representation for subspace segmentation and featureextraction. , pages 1615–1622, 2011.[28] Q. Liu, X. Shen, and Y. Gu. Linearized admm for nonconvex nonsmooth optimization withconvergence analysis.

IEEE Access , 7:76131–76144, 2019.[29] C. Lu, J. Tang, S. Yan, and Z. Lin. Nonconvex nonsmooth low rank minimization via iterativelyreweighted nuclear norm.

IEEE Transactions on Image Processing , 25(2):829–839, 2016.[30] J. G. Melo and R. D. C. Monteiro. Iteration-complexity of a jacobi-type non-euclidean admmfor multi-block linearly constrained nonconvex programs, 2017.[31] Y. Nesterov.

Introductory lectures on convex optimization: A basic course . Kluwer AcademicPubl., 2004.[32] P. Ochs. Unifying abstract inexact convergence theorems and block coordinate variable metricipiano.

SIAM Journal on Optimization , 29(1):541–570, 2019.[33] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized alternating directionmethod of multipliers.

SIAM Journal on Imaging Sciences , 8(1):644–681, 2015.[34] N. Parikh and S. Boyd. Proximal algorithms.

Foundations and Trends in Optimization ,1(3):127–239, 2014.[35] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (iPALM) fornonconvex and nonsmooth problems.

SIAM Journal on Imaging Sciences , 9(4):1756–1787,2016.[36] M. J. D. Powell. On search directions for minimization algorithms.

Mathematical Programming ,4(1):193–201, Dec 1973.[37] M. Razaviyayn, M. Hong, and Z. Luo. A uniﬁed convergence analysis of block successiveminimization methods for nonsmooth optimization.

SIAM Journal on Optimization ,23(2):1126–1153, 2013.[38] R. T. Rockafellar and R. J.-B. Wets.

Variational Analysis . Springer Verlag, Heidelberg, Berlin,New York, 1998.[39] K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse covariance selection via alternatinglinearization methods. In J. D. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel,and A. Culotta, editors,

Advances in Neural Information Processing Systems 23 , pages2101–2109. Curran Associates, Inc., 2010.[40] P. Tseng. Convergence of a block coordinate descent method for nondiﬀerentiable minimization.

Journal of Optimization Theory and Applications , 109(3):475–494, Jun 2001.[41] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimiza-tion.

Mathematical Programming , 117(1):387–423, Mar 2009.[42] Y. Wang, W. Yin, and J. Zeng. Global convergence of admm in nonconvex nonsmoothoptimization.

Journal of Scientiﬁc Computing , 78:29–63, 01 2019.[43] Y. Wang, J. Zeng, Z. Peng, X. Chang, and Z. Xu. Linear convergence of adaptively iterativethresholding algorithms for compressed sensing.

IEEE Transactions on Signal Processing ,63(11):2957–2971, 2015.[44] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints.NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS Mathematical Programming , 142, 12 2010.[45] M. Xu and T. Wu. A class of linearized proximal alternating direction methods.

J. OptimizationTheory and Applications , 151:321–337, 11 2011.[46] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimizationwith applications to nonnegative tensor factorization and completion.

SIAM Journal onImaging Sciences , 6(3):1758–1789, 2013.[47] Y. Xu and W. Yin. A globally convergent algorithm for nonconvex optimization based on blockcoordinate update.

Journal of Scientiﬁc Computing , 72(2):700–734, Aug 2017.[48] J. Yang, Y. Zhang, and W. Yin. An eﬃcient tvl1 algorithm for deblurring multichannel imagescorrupted by impulsive noise.

SIAM Journal on Scientiﬁc Computing , 31(4):2842–2865,2009.[49] L. Yang, T. K. Pong, and X. Chen. Alternating direction method of multipliers for a class ofnonconvex and nonsmooth problems with applications to background/foreground extraction.

SIAM Journal on Imaging Sciences , 10(1):74–110, 2017.[50] W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for l(1)-minimization with applications to compressed sensing.

Siam Journal on Imaging Sciences ,1:143–168, 01 2008.

Appendix A. Preliminaries of non-convex non-smooth optimization.

Let g : E → R ∪ { + ∞} be a proper lower semicontinuous function. Definition

A.1. [38, Deﬁnition 8.3](i) For any x ∈ dom g, and d ∈ E , we denote the directional derivative of g at x in the direction d by g (cid:48) ( x ; d ) = lim inf τ ↓ g ( x + τ d ) − g ( x ) τ . (ii) For each x ∈ dom g, we denote ˆ ∂g ( x ) as the Frechet subdiﬀerential of g at x which contains vectors v ∈ E satisfying lim inf y (cid:54) = x,y → x (cid:107) y − x (cid:107) ( g ( y ) − g ( x ) − (cid:104) v, y − x (cid:105) ) ≥ . If x (cid:54)∈ dom g, then we set ˆ ∂g ( x ) = ∅ . (iii) The limiting-subdiﬀerential ∂g ( x ) of g at x ∈ dom g is deﬁned as follows: ∂g ( x ) := (cid:110) v ∈ E : ∃ x ( k ) → x, g (cid:16) x ( k ) (cid:17) → g ( x ) , v ( k ) ∈ ˆ ∂g (cid:16) x ( k ) (cid:17) , v ( k ) → v (cid:111) . (iv) The horizon subdiﬀerential ∂ ∞ g ( x ) of g at x is deﬁned as follows: ∂ ∞ g ( x ) := (cid:110) v ∈ E : ∃ λ ( k ) → , λ ( k ) ≥ , λ ( k ) x ( k ) → x,g (cid:16) x ( k ) (cid:17) → g ( x ) , v ( k ) ∈ ˆ ∂g (cid:16) x ( k ) (cid:17) , v ( k ) → v (cid:111) . Definition

A.2.

We call x ∗ ∈ dom F a critical point of F if ∈ ∂F ( x ∗ ) . Definition

A.3. [38, Deﬁnition 7.5] A function f : R n → R ∪ { + ∞} is calledsubdiﬀerentially regular at ¯ x if f (¯ x ) is ﬁnite and the epigraph of f is Clarke regular at (¯ x, f (¯ x )) as a subset of R n × R (see [38, Deﬁnition 6.4] for the deﬁnition of Clarkeregularity of a set at a point). Proposition

A.4. [38, Corollary 10.9] Suppose f = f + · + f m for proper lowersemi-continuous function f i : R n → R ∪ { + ∞} and let ¯ x ∈ dom f . Suppose eachfunction f i is subdiﬀerential regular at ¯ x , and the condition that the only combinationof vector ν i ∈ ∂ ∞ f i (¯ x ) with ν + . . . ν m = 0 is ν i = 0 for i ∈ [ m ] . Then we have ∂f (¯ x ) = ∂f (¯ x ) + . . . ∂f m (¯ x ) . L. T. K. HIEN, D. N. PHAN, N. GILLIS

Appendix B. Proofs.

Before proving the propositions, let us give some prelimi-nary results. We use x, z to denote the vectors in R n . Lemma

B.1. [18, Lemma 2.8] If the function x i (cid:55)→ Θ( x i , z ) is ρ -strongly convex,diﬀerentiable at z i , and ∇ x i Θ( z i , z ) = 0 then we have Θ( x i , z ) ≥ ρ (cid:107) x i − z i (cid:107) . We recall the notation ( x i , z (cid:54) = i ) = ( z , . . . , z i − , x i , z i +1 , . . . , z s ). Suppose we are tryingto solve min x Ψ( x ) := Φ( x ) + s (cid:88) i =1 g i ( x i ) . Proposition

B.2. [18, Theorem 2.7] Suppose G ki : R n i × R n i → R n i be someextrapolation operator that satisﬁes G ki ( x ki , x k − i ) ≤ a ki (cid:107) x ki − x k − i (cid:107) . Let u i ( x i , z ) is ablock surrogate function of Φ( x ) . We assume one of the following conditions holds: • x i (cid:55)→ u i ( x i , z ) + g i ( x i ) is ρ i -strongly convex, • the approximation error Θ( x i , z ) := u i ( x i , z ) − Φ( x i , z (cid:54) = i ) satisfying Θ( x i , z ) ≥ ρ i (cid:107) x i − z i (cid:107) for all x i .Note that ρ i may depend on z . Let x k +1 i = argmin x i u i ( x i , x k,i − ) + g i ( x i ) − (cid:104)G ki ( x ki , x k − i ) , x i (cid:105) . Then we have (27) Ψ( x k,i − ) + γ ki (cid:107) x ki − x k − i (cid:107) ≥ Ψ( x k,i ) + η ki (cid:107) x k +1 i − x ki (cid:107) , where γ ki = ( a ki ) νρ i , η ki = (1 − ν ) ρ i , and < ν < is a constant. If we do not apply extrapolation, that is a ki = 0 , then (27) is satisﬁed with γ ki = 0 and η ki = ρ i / . The following proposition is derived from [17, Remark 3] and [46, Lemma 2.1].

Proposition

B.3.

Suppose x i (cid:55)→ Φ( x ) is a L i -smooth convex function and g i ( x i ) is convex. Deﬁne ˆ x ki = x ki + α ki ( x ki − x k − i ) , ¯ x ki = x ki + β ki ( x ki − x k − i ) , and ¯ x k,i − =( x k +11 , . . . , x k +1 i − , ¯ x ki , x ki +1 , . . . , x ks ) . Let x k +1 i = argmin x i (cid:104)∇ Φ(¯ x k,i − ) , x i (cid:105) + g i ( x i ) + L i (cid:107) x i − ˆ x ki (cid:107) . Then we have Inequality (27) is satisﬁed with γ ki = L i (cid:0) ( β ki ) + ( γ ki − α ki ) ν (cid:1) , η ki = (1 − ν ) L i . If α ki = β ki then we have Inequality (27) is satisﬁed with γ ki = L i β ki ) , η ki = L i . NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS B.1. Proof of Proposition 2.3. (i) Suppose we are updating x ki . Let us recallthat L ( x, y, ω ) := f ( x ) + (cid:80) si =1 g i ( x i ) + h ( y ) + ϕ ( x, y, ω ), where(28) ϕ ( x, y, ω ) = β (cid:107)A x + B y − b (cid:107) + (cid:104) ω, A x + B y − b (cid:105) . Denote u i ( x i , z, y, ω ) = u i ( x i , z ) + h ( y ) + ˆ ϕ i ( x i , z, y, ω ) , whereˆ ϕ i ( x i , z, y, ω ) = ϕ ( z, y, ω ) + (cid:104)A ∗ i (cid:0) ω + β ( A z + B y − b ) (cid:1) , x i − z i (cid:105) + κ i β (cid:107) x i − z i (cid:107) . We see that ˆ ϕ i ( x i , z, y, ω ) is a block surrogate function of x (cid:55)→ ϕ ( x, y, ω ) withrespect to block x i , and u i ( x i , z, y, ω ) is a block surrogate function of x (cid:55)→ f ( x ) + h ( y ) + ϕ ( x, y, ω ) with respect to block x i . The update in (10) can be rewritten asfollows.(29) x k +1 i = argmin x i u i ( x i , x k,i − , y k , ω k ) + g i ( x i ) − (cid:104)G ki ( x ki , x k − i ) , x i (cid:105) , where G ki ( x ki , x k − i ) = β A ∗ i A (cid:0) x k,i − − ¯ x k,i − ) (cid:1) + κ i βζ ki ( x ki − x k − i ) . (30)The block approximation error function between u i ( x i , z, y, ω ) and x (cid:55)→ f ( x ) + h ( y ) + ϕ ( x, y, ω ) is deﬁned as e i ( x i , z, y, ω ) = u i ( x i , z, y, ω ) − (cid:0) f ( x i , z (cid:54) = i ) + h ( y ) + ϕ (( x i , z (cid:54) = i ) , y, ω ) (cid:1) = u i ( x i , z ) − f ( x i , z (cid:54) = i ) + ˆ ϕ i ( x i , z, y, ω ) − ϕ (( x i , z (cid:54) = i ) , y, ω ) ≥ θ i ( x i , z, y, ω ) := ϕ ( z, y, ω ) − ϕ (( x i , z (cid:54) = i ) , y, ω )+ (cid:104)A ∗ i (cid:0) ω + β ( A z + B y − b ) (cid:1) , x i − z i (cid:105) + κ i β (cid:107) x i − z i (cid:107) . (31)We have ∇ x i θ i ( x i , z, y, ω ) = κ i β ( x i − z i ) + ∇ x i ϕ ( z, y, ω ) − ∇ x i ϕ (( x i , z (cid:54) = i ) , y, ω ). Hence ∇ x i θ i ( z i , z ) = 0. On the other hand, note that x i (cid:55)→ ϕ (( x i , z (cid:54) = i ) , y k , ω k ) is β (cid:107)A ∗ i A i (cid:107) -smooth. So, x i (cid:55)→ θ i ( x i , z, y, ω ) is a β ( κ i − (cid:107)A ∗ i A i (cid:107) ) - strongly convex function. FromLemma B.1 we have θ i ( x i , z ) ≥ β ( κ i −(cid:107)A ∗ i A i (cid:107) )2 (cid:107) x i − z i (cid:107) . The result follows from (29),(31) and Proposition (B.2).(ii) When x i (cid:55)→ u i ( x i , z ) + g i ( x i ) is convex and we apply the update as in (10), itfollows from Proposition B.3 (see also [18, Remark 4.1]) that u i ( x ki , x k,i − ) + g i ( x ki ) + ϕ ( x k,i − , y k , ω k ) + β (cid:107)A ∗ i A i (cid:107) ζ ki ) (cid:107) x ki − x k − i (cid:107) ≥ u i ( x k +1 i , x k,i − ) + g i ( x k +1 i ) + ϕ ( x k,i , y k , ω k ) + β (cid:107)A ∗ i A i (cid:107) (cid:107) x k +1 i − x ki (cid:107) . (32)On the other hand, note that u i ( x ki , x k,i − ) = f ( x k,i − ) and u i ( x k +1 i , x k,i − ) ≥ f ( x k,i ).The result follows then.8 L. T. K. HIEN, D. N. PHAN, N. GILLIS

B.2. Proof of Proposition 2.4.

Denoteˆ h ( y, y (cid:48) ) = h ( y (cid:48) ) + (cid:104) ω, A x + B y (cid:48) − b (cid:105) + (cid:104)B ∗ ω + ∇ h ( y (cid:48) ) , y − y (cid:48) (cid:105) + L h (cid:107) y − y (cid:48) (cid:107) . Then we have ˆ h ( y, y (cid:48) )+ β (cid:107)A x + B y − b (cid:107) is a surrogate function of y (cid:55)→ h ( y )+ ϕ ( x, y, ω ).Note that the function y (cid:55)→ ˆ h ( y, y (cid:48) ) + β (cid:107)A x + B y − b (cid:107) is ( L h + β (cid:107)B ∗ B(cid:107) )-stronglyconvex. The result follows from Proposition B.2 (see also [18, Section 4.2.1]).Suppose h ( y ) is convex. We note that y (cid:55)→ β (cid:107)A x + B y − b (cid:107) is also convex andplays the role of g i in Proposition B.3. The result follows from Proposition B.3. B.3. Proof of Proposition 2.5.

Note that(33) L ( x k +1 , y k +1 , ω k +1 ) = L ( x k +1 , y k +1 , ω k ) + 1 αβ (cid:104) ω k +1 − ω k , ω k +1 − ω k (cid:105) . From the optimality condition of (11) we have ∇ h (ˆ y k ) + L h ( y k +1 − ˆ y k ) + B ∗ ω k + β B ∗ ( A x k +1 + B y k +1 − b ) = 0 . Together with (12) we obtain(34) ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ) + B ∗ ω k + 1 α B ∗ ( w k +1 − w k ) = 0 . Hence,(35) B ∗ w k +1 = (1 − α ) B ∗ ω k − α ( ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k )) , which implies that(36) B ∗ ∆ w k +1 = (1 − α ) B ∗ ∆ w k − α ∆ z k +1 , where ∆ z k +1 = z k +1 − z k and z k +1 = ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ). We now consider2 cases.Case 1: 0 < α ≤

1. From the convexity of (cid:107) · (cid:107) we have(37) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ (1 − α ) (cid:107)B ∗ ∆ w k (cid:107) + α (cid:107) ∆ z k +1 (cid:107) . Case 2: 1 < α <

2. We rewrite (36) as B ∗ ∆ w k +1 = − ( α − B ∗ ∆ w k − α − α (2 − α )∆ z k +1 . Hence(38) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ ( α − (cid:107)B ∗ ∆ w k (cid:107) + α (2 − α ) (cid:107) ∆ z k +1 (cid:107) . Combine (37) and (38) we obtain(39) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ | − α |(cid:107)B ∗ ∆ w k (cid:107) + α − | − α | (cid:107) ∆ z k +1 (cid:107) , which implies(40)(1 −| − α | ) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ | − α | ( (cid:107)B ∗ ∆ w k (cid:107) −(cid:107)B ∗ ∆ w k +1 (cid:107) )+ α − | − α | (cid:107) ∆ z k +1 (cid:107) . NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS y we have (cid:107) ∆ z k +1 (cid:107) = (cid:107)∇ h (ˆ y k ) − ∇ h (ˆ y k − ) + L h (∆ y k +1 − δ k ∆ y k ) − L h (∆ y k − δ k − ∆ y k − ) (cid:107) ≤ L h (cid:107) ˆ y k − ˆ y k − (cid:107) + 3 L h (cid:107) ∆ y k +1 (cid:107) + 3 (cid:107) (1 + δ k ) L h ∆ y k − L h δ k − ∆ y k − (cid:107) ≤ L h (cid:2) (1 + δ k ) (cid:107) ∆ y k (cid:107) + δ k − (cid:107) ∆ y k − (cid:107) (cid:3) + 3 L h (cid:107) ∆ y k +1 (cid:107) + 6(1 + δ k ) L h (cid:107) ∆ y k (cid:107) + 6 L h δ k − (cid:107) ∆ y k − (cid:107) = 3 L h (cid:107) ∆ y k +1 (cid:107) + 12(1 + δ k ) L h (cid:107) ∆ y k (cid:107) + 12 L h δ k − (cid:107) ∆ y k − (cid:107) . (41)If we do not use extrapolation for y then we have (cid:107) ∆ z k +1 (cid:107) = (cid:107)∇ h ( y k ) − ∇ h ( y k − ) + L h ∆ y k +1 − L h ∆ y k (cid:107) ≤ L h (cid:107) ∆ y k (cid:107) + 3 L h (cid:107) ∆ y k +1 (cid:107) + 3 L h (cid:107) ∆ y k (cid:107) = 6 L h (cid:107) ∆ y k (cid:107) + 3 L h (cid:107) ∆ y k +1 (cid:107) . (42)Furthermore, note that σ B (cid:107) ∆ w k +1 (cid:107) ≤ (cid:107)B ∗ ∆ w k +1 (cid:107) . Therefore, it follows from (40)that (cid:107) ∆ w k +1 (cid:107) ≤ | − α | σ B (1 − | − α | ) ( (cid:107)B ∗ ∆ w k (cid:107) − (cid:107)B ∗ ∆ w k +1 (cid:107) )+ α L h σ B (1 − | − α | ) ( (cid:107) ∆ y k +1 (cid:107) + ¯ δ k (cid:107) ∆ y k (cid:107) + 4 δ k − (cid:107) ∆ y k − (cid:107) ) . (43)The result is obtained from (43), (33) and Proposition 2.3. B.4. Proof of Proposition 2.6. (i) From Inequality (17) and the conditionsin (19) we have L k +1 + µ (cid:107) ∆ y k +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) + α β (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ L k + C µ (cid:107) ∆ y k (cid:107) + C µ (cid:107) ∆ y k − (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x ki (cid:107) + α β (cid:107)B ∗ ∆ w k (cid:107) . (44)By summing from k = 1 to K Inequality (44) and noting that C + C = C y we obtainInequality (20).(ii) Let us prove { ∆ y k } and { ∆ x ki } converge to 0.Let us ﬁrst prove the second situation, that is we use extrapolation for the updateof y and Inequality (21) is satisﬁed. From (35) we have α B ∗ w k +1 = − (1 − α ) B ∗ ∆ ω k +1 − αz k +1 , where z k +1 = ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ). Using the same technique that derivesInequality (39), we obtain the following(45) ασ B (cid:107) w k +1 (cid:107) ≤ α (cid:107)B ∗ w k +1 (cid:107) ≤ | − α |(cid:107)B ∗ ∆ ω k +1 (cid:107) + α − | − α | (cid:107) z k +1 (cid:107) . On the other hand, we have L k = F ( x k ) + h ( y k ) + β (cid:107)A x k + B y k − b + ω k β (cid:107) − β (cid:107) ω k (cid:107) L. T. K. HIEN, D. N. PHAN, N. GILLIS ≥ F ( x k ) + h ( y k ) − β (cid:107) ω k (cid:107) . Together with (45) and (cid:107) z k (cid:107) = (cid:107)∇ h (ˆ y k − ) − ∇ h ( y k ) + ∇ h ( y k ) + L h (∆ y k − δ k − ∆ y k − ) (cid:107) ≤ (cid:107)∇ h (ˆ y k − ) − ∇ h ( y k ) (cid:107) + 4 (cid:107)∇ h ( y k ) (cid:107) + 4 L h (cid:107) ∆ y k (cid:107) + 4 L h δ k − (cid:107) ∆ y k − (cid:107) ≤ L h (cid:107) ∆ y k (cid:107) + 12 δ k − (cid:107) ∆ y k − (cid:107) + 4 (cid:107)∇ h ( y k ) (cid:107) . we obtain L k ≥ F ( x k ) + h ( y k ) − αβσ B (cid:0) | − α |(cid:107) B ∗ ∆ ω k (cid:107) + α − | − α | (cid:107) z k (cid:107) (cid:1) ≥ F ( x k ) + h ( y k ) − | − α | αβσ B (cid:107) B ∗ ∆ ω k (cid:107) − α βσ B (1 − | − α | ) (cid:0) L h (cid:107) ∆ y k (cid:107) + 12 δ k − (cid:107) ∆ y k − (cid:107) + 4 (cid:107)∇ h ( y k ) (cid:107) (cid:1) (46)Since h ( y ) is L h -smooth, for all y ∈ R q and α L > h ( y − α L ∇ f ( y )) ≤ h ( y ) − α L (1 − L h α L (cid:107)∇ h ( y ) (cid:107) . Let us choose α L such that α L (1 − L h α L ) = α βσ B (1 −| − α | ) . Note that this equationhave, solution when β ≥ L h ασ B (1 −| − α | ) . Then we have h ( y k ) − α βσ B (1 − | − α | ) (cid:107)∇ h ( y k ) (cid:107) ≥ h ( y k − α L ∇ f ( y k )) . Together with (46) we get L k ≥ F ( x k ) + h ( y k − α L ∇ f ( y k )) − | − α | αβσ B (cid:107) B ∗ ∆ ω k (cid:107) − α βσ B (1 − | − α | ) (12 L h (cid:107) ∆ y k (cid:107) + 12 δ k − (cid:107) ∆ y k − (cid:107) ) . (47)So from α β ≥ | − α | αβσ B , µ ≥ α L h βσ B (1 −| − α | ) , (1 − C ) µ ≥ α L h δ k βσ B (1 −| − α | ) we have L K +1 + µ (cid:107) ∆ y K +1 (cid:107) + α β (cid:107) B ∗ ∆ w K +1 (cid:107) + (1 − C ) µ (cid:107) ∆ y K (cid:107) ≥ F ( x K +1 ) + h ( y K +1 − α L ∇ f ( y K +1 )) . (48)Hence L K +1 + µ (cid:107) ∆ y K +1 (cid:107) + α β (cid:107) B ∗ ∆ w K +1 (cid:107) + (1 − C ) µ (cid:107) ∆ y K (cid:107) is lower bounded.Furthermore, since η i and µ are positive numbers we derive from Inequality (20)that (cid:80) ∞ k =1 (cid:107) ∆ y k (cid:107) < + ∞ and (cid:80) ∞ k =1 (cid:107) ∆ x ki (cid:107) < + ∞ . Therefore, { ∆ y k } and { ∆ x ki } converge to 0.Let us now consider the ﬁrst situation when δ k = 0 for all k .From Inequality (17) and the conditions in (19) we have L k +1 + µ (cid:107) ∆ y k +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) + α β (cid:107) B ∗ ∆ w k +1 (cid:107) ≤ L k + C y µ (cid:107) ∆ y k (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x ki (cid:107) + α β (cid:107) B ∗ ∆ w k (cid:107) . (49) NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS k = 1 to K we obtain L K +1 + C y µ (cid:107) ∆ y K +1 (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x K +1 i (cid:107) + α β (cid:107) B ∗ ∆ w K +1 (cid:107) + K (cid:88) k =1 (cid:2) (1 − C y ) µ (cid:107) ∆ y k +1 (cid:107) + (1 − C x ) s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) (cid:3) ≤ L + α β (cid:107) B ∗ ∆ ω (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x i (cid:107) + Cµ (cid:107) ∆ y (cid:107) . (50)Denote the value of the right side of Inequality (49) by ˆ L k . Note that 0 < C x , C y < { ˆ L k } is non-increasing. It follows from [30,Lemma 2.9] that ˆ L k ≥ ϑ for all k , where ϑ is is the lower bound of F ( x k ) + h ( y k ). Forcompleteness, let us provide the proof in the following. We haveˆ L k ≥ L k = F ( x k ) + h ( y k ) + β (cid:107) Ax k + By k − b (cid:107) + 1 αβ (cid:104) ω k , ω k − ω k − (cid:105)≥ ϑ + 12 αβ ( (cid:107) ω k (cid:107) − (cid:107) ω k − (cid:107) + (cid:107) ∆ ω k (cid:107) ) ≥ ϑ + 12 αβ ( (cid:107) ω k (cid:107) − (cid:107) ω k − (cid:107) ) , (51)Assume that there exists k such that ˆ L k < ϑ for all k ≥ k . As ˆ L k is non-increasing, K (cid:88) k =1 ( ˆ L k − ϑ ) ≤ k (cid:88) k =1 ( ˆ L k − ϑ ) + ( K − k )( ˆ L k − ϑ ) . Hence (cid:80) ∞ k =1 ( ˆ L k − ϑ ) = −∞ . However, from (51) we have K (cid:88) k =1 ( ˆ L k − ϑ ) ≥ K (cid:88) k =1 αβ (cid:107) ω k (cid:107) − αβ (cid:107) ω k − (cid:107) ≥ αβ ( −(cid:107) ω (cid:107) ) , which gives a contradiction.Since ˆ L K ≥ ϑ and η i and µ are positive numbers we derive from Inequality (20)that (cid:80) ∞ k =1 (cid:107) ∆ y k (cid:107) < + ∞ and (cid:80) ∞ k =1 (cid:107) ∆ x ki (cid:107) < + ∞ . Therefore, { ∆ y k } and { ∆ x ki } converge to 0.Now we prove { ∆ ω k } goes to 0. Since (cid:80) ∞ k =1 (cid:107) ∆ y k (cid:107) < + ∞ , we derive from (41)that (cid:80) ∞ k =1 (cid:107) ∆ z k (cid:107) < + ∞ . Summing up Equality (39) from k = 1 to K we have(1 − | − α | ) K (cid:88) k =1 (cid:107)B ∗ ∆ ω k (cid:107) + (cid:107)B ∗ ∆ ω K +1 (cid:107) ≤ (cid:107)B ∗ ∆ ω (cid:107) + α − | − α | K (cid:88) k =1 (cid:107) ∆ z k +1 (cid:107) , which implies that (cid:80) ∞ k =1 (cid:107)B ∗ ∆ ω k (cid:107) < + ∞ . Hence, (cid:107)B ∗ ∆ ω k (cid:107) →

0. Since σ B > { ∆ ω k } goes to 0.2 L. T. K. HIEN, D. N. PHAN, N. GILLIS

B.5. Proof of Proposition 2.7.

We remark that we use the idea in the proofof [42, Lemma 6] to prove the proposition. However, our proof is more complicatedsince in our framework α ∈ (0 , h is linearized and we use extrapolationfor y .Note that as σ B > B is a surjective. Together with the assumption b + Im ( A ) ⊆ Im ( B ) we have there exist ¯ y k such that A x k + B ¯ y k − b = 0.Now we have L k = F ( x k ) + h ( y k ) + β (cid:107)A x k + B y k − b (cid:107) + (cid:104) ω k , A x k + B y k − b (cid:105) = F ( x k ) + h ( y k ) + β (cid:107) Ax k + B y k − b (cid:107) + (cid:104)B ∗ ω k , y k − ¯ y k (cid:105) . (52)From (34) we have (cid:104)B ∗ ω k , y k − ¯ y k (cid:105) = (cid:10) ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ) + 1 α B ∗ ( w k +1 − w k ) , ¯ y k − y k (cid:11) ≥ (cid:104)∇ h ( y k ) , ¯ y k − y k (cid:105) − (cid:0) (cid:107)∇ h ( y k ) − ∇ h (ˆ y k ) (cid:107) + L h (cid:107) ∆ y k +1 (cid:107) + L h δ k (cid:107) ∆ y k (cid:107) + 1 α (cid:107)B ∗ ∆ ω k +1 (cid:107) (cid:1) (cid:107) ¯ y k − y k (cid:107) . Therefore, it follows from (52) and L h -smooth property of h that L k ≥ F ( x k ) + h (¯ y k ) − L h (cid:107) y k − ¯ y k (cid:107) − (cid:0) L h δ k (cid:107) ∆ y k (cid:107) + L h (cid:107) ∆ y k +1 (cid:107) + 1 α (cid:107)B ∗ ∆ ω k +1 (cid:107) (cid:1) (cid:107) ¯ y k − y k (cid:107) . (53)On the other hand, we have (cid:107) ¯ y k − y k (cid:107) ≤ λ min ( B ∗ B ) (cid:107)B (¯ y k − y k ) (cid:107) = 1 λ min ( B ∗ B ) (cid:107)A x k + B y k − b (cid:107) = 1 λ min ( B ∗ B ) (cid:13)(cid:13) αβ ∆ ω k (cid:13)(cid:13) . (54)We have proved in Proposition 2.6 that (cid:107) ∆ ω k (cid:107) , (cid:107) ∆ x k (cid:107) and (cid:107) ∆ y k (cid:107) converge to 0.Furthermore, from Proposition 2.6 we have L k is upper bounded. Therefore, from(53), (54) and (20) we have F ( x k ) + h (¯ y k ) is upper bounded. So { x k } is bounded.Consequently, A x k is bounded.Furthermore, we have (cid:107) y k (cid:107) ≤ λ min ( B ∗ B ) (cid:107)B y k (cid:107) = 1 λ min ( B ∗ B ) (cid:13)(cid:13) αβ ∆ ω k − A x k − b (cid:13)(cid:13) . Therefore, { y k } is bounded, which implies that (cid:107)∇ h (ˆ y k ) (cid:107) is also bounded. Finally,from (34) and the assumption λ min ( BB ∗ ) > { ω k } is bounded. B.6. Proof of Theorem 2.8.

Suppose ( x k n , y k n , ω k n ) converges to ( x ∗ , y ∗ , ω ∗ ).Since ∆ x ki goes to 0, we have x k n +1 i and x k n − i also converge to x ∗ i for all i ∈ [ s ]. From(29), for all x i , we have u i ( x k +1 i , x k,i − , y k , ω k ) + g i ( x k +1 i ) ≤ u i ( x i , x k,i − , y k , ω k ) + g i ( x i ) − (cid:104)G ki ( x ki , x k − i ) , x i − x k +1 i (cid:105) . (55) NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS x i = x ∗ i and k = k n − u i ( x i , z ) is continuous byAssumption 2 (i), we havelim sup n →∞ u i ( x ∗ i , x ∗ , y ∗ , ω ∗ ) + g i ( x k n i ) ≤ u i ( x ∗ i , x ∗ , y ∗ , ω ∗ ) + g i ( x ∗ i ) . On the other hand, as g i ( x i ) is lower semi-continuous. Hence, g i ( x k n i ) converges to g i ( x ∗ i ). Now we choose k = k n → ∞ in (55) for all x i we obtain L ( x ∗ , y ∗ , ω ∗ ) + g i ( x ∗ i ) ≤ u i ( x i , x ∗ , y ∗ , ω ∗ ) + g i ( x i )= L ( x i , x ∗(cid:54) = i , y ∗ , ω ∗ ) + e i ( x i , x ∗ , y ∗ , ω ∗ ) + g i ( x i ) , (56)where L ( x, y, ω ) = f ( x ) + h ( y ) + ϕ ( x, y, ω ) and e i is the approximation error deﬁnedin (31). We have e i ( x i , x ∗ , y ∗ , ω ∗ ) = u i ( x i , x ∗ ) − f ( x i , x ∗(cid:54) = i ) + ˆ ϕ i ( x i , x ∗ , y ∗ , ω ∗ ) − ϕ (( x i , x ∗(cid:54) = i ) , y ∗ , ω ∗ ) ≤ ¯ e i ( x i , x ∗ ) + ˆ ϕ i ( x i , x ∗ , y ∗ , ω ∗ ) − ϕ (( x i , x ∗(cid:54) = i ) , y ∗ , ω ∗ ) . Note that ¯ e i ( x ∗ i , x ∗ ) = 0 by Assumption 2. From (56) we have x ∗ i is a solution ofmin x i L ( x i , x ∗(cid:54) = i , y ∗ , ω ∗ ) + ¯ e i ( x i , x ∗ ) + ˆ ϕ i ( x i , x ∗ , y ∗ , ω ∗ ) − ϕ (( x i , x ∗(cid:54) = i ) , y ∗ , ω ∗ ) . Writing the optimality condition for this problem we obtain 0 ∈ ∂ x i L ( x ∗ , y ∗ , ω ∗ ).Totally similarly we can prove that 0 ∈ ∂ y L ( x ∗ , y ∗ , ω ∗ ). On the other hand, we have∆ ω k = ω k − ω k − = αβ ( A x k + B y k − b ) → . Hence, ∂ ω L ( x ∗ , y ∗ , ω ∗ ) = A x ∗ + B y ∗ − b = 0 . As we assume ∂F ( x ) = ∂ x F ( x ) × . . . × ∂ x s F ( x ), we have ∂ L ( x, y, ω ) = ∂F ( x ) + ∇ (cid:16) h ( y ) + (cid:104) ω, A x + B y − b (cid:105) + β (cid:107)A x + B y − b (cid:107) (cid:17) = ∂ x L ( x, y, ω ) × . . . × ∂ x s L ( x, y, ω ) × ∂ y L ( x, y, ω ) × ∂ ω L ( x, y, ω ) . So 0 ∈ ∂ L ( x ∗ , y ∗ , ω ∗ ). B.7. Proof of Theorem 2.10.

Note that we assume the generated sequenceof Algorithm 1 is bounded. The following analysis is considered in the bounded setthat contains the generated sequence of Algorithm 1. We ﬁrst prove some preliminaryresults.(A) The optimality condition of (29) gives us G ki ( x ki − x k − i ) − A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) ∈ ∂ x i (cid:0) u i ( x k +1 i , x k,i − ) + g i ( x k +1 i ) (cid:1) . (57)As (24) holds, there exists s k +1 i ∈ ∂u i ( x k +1 i , x k,i − ) and t k +1 i ∈ ∂g i ( x k +1 i ) such that(58) G ki ( x ki − x k − i ) − A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) = s k +1 i + t k +1 i . As (25) holds, there exists ξ k +1 i ∈ ∂ x i f ( x k +1 ) such that(59) (cid:107) ξ k +1 i − s k +1 i (cid:107) ≤ L i (cid:107) x k +1 − x k,i − (cid:107) . L. T. K. HIEN, D. N. PHAN, N. GILLIS

Denote τ k +1 i := ξ k +1 i + t k +1 i ∈ ∂ x i F ( x k +1 ) (as (24) holds). Then, from (58) wehave(60) τ k +1 i = ξ k +1 i + G ki ( x ki − x k − i ) −A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) − s k +1 i . On the other hand, we note that(61) ∂ x i L ( x k +1 , y k +1 , ω k +1 ) = ∂ x i F ( x k +1 ) + A ∗ i (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1) . Let d k +1 i := τ k +1 i + A ∗ i (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1) ∈ ∂ x i L ( x k +1 , y k +1 , ω k +1 ).From (60) we have (cid:107) d k +1 i (cid:107) = (cid:13)(cid:13)(cid:13) ξ k +1 i + G ki ( x ki − x k − i ) − A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) − s k +1 i + A ∗ i (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1)(cid:13)(cid:13)(cid:13) (62)Together with (59) we obtain (cid:107) d k +1 i (cid:107) ≤ a ki (cid:107) ∆ x ki (cid:107) + β (cid:107)A ∗ i A (cid:107)(cid:107) x k +1 − x k,i − (cid:107) + β (cid:107)A ∗ i B(cid:107)(cid:107) ∆ y k +1 (cid:107) + (cid:107)A ∗ i (cid:107)(cid:107) ∆ ω k +1 (cid:107) + κ i β (cid:107) ∆ x k +1 i (cid:107) + L i (cid:107) x k +1 − x k,i − (cid:107) . (63)It follows from (11) that B ∗ ω k + ∇ h (ˆ y k ) + β B ∗ ( A x k +1 + B y k +1 − b ) + L h ( y k +1 − ˆ y k ) = 0 . Let d k +1 y := ∇ h ( y k +1 ) + B ∗ (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1) . We have d k +1 y ∈ ∂ y L ( x k +1 , y k +1 , ω k +1 )and (cid:107) d k +1 y (cid:107) = (cid:107)∇ h ( y k +1 ) − ∇ h (ˆ y k ) + B ∗ ( ω k +1 − ω k ) − L h ( y k +1 − ˆ y k ) (cid:107)≤ L h (cid:107) y k +1 − ˆ y k (cid:107) + (cid:107)B ∗ (cid:107)(cid:107) ∆ ω k +1 (cid:107)≤ L h ( (cid:107) ∆ y k +1 (cid:107) + δ k (cid:107) ∆ y k (cid:107) ) + (cid:107)B ∗ (cid:107)(cid:107) ∆ ω k +1 (cid:107) . Let d k +1 ω := A x k +1 + B k +1 − b . We have d k +1 ω ∈ ∂ ω L ( x k +1 , y k +1 , ω k +1 ) and d k +1 ω = ( ω k +1 − ω k ) / ( αβ ) = ∆ ω k +1 / ( αβ ) . (B) Let us now prove F ( x k n ) converges to F ( x ∗ ). This implies L ( x k n , y k n , ω k n )converges to L ( x ∗ , y ∗ , ω ∗ ) since L is diﬀerentiable in y and ω . We have F ( x k n ) = f ( x k n ) + s (cid:88) i =1 g i ( x k n i ) = u s ( x k n s , x k n ) + s (cid:88) i =1 g i ( x k n i ) . So F ( x k n ) converges to u s ( x ∗ i , x ∗ ) + (cid:80) si =1 g i ( x ∗ i ) = F ( x ∗ ).We now proceed to prove the global convergence. Denote z = ( x, y, ω ), ˜ z = (˜ x, ˜ y, ˜ ω ),and z k = ( x k , y k , ω k ). We consider the following auxiliary function¯ L ( z , ˜ z ) = L ( x, y, ω ) + s (cid:88) i =1 η i + C x η i (cid:107) x i − ˜ x i (cid:107) + (1 + C y ) µ (cid:107) y − ˜ y (cid:107) + α β (cid:107) B ∗ ( ω − ˜ ω ) (cid:107) . The auxiliary sequence ¯ L ( z k , z k − ) has the following properties. NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS

Suﬃcient decreasing property . From (49) we have¯ L ( z k +1 , z k ) + s (cid:88) i =1 η i − C x η i (cid:0) (cid:107) x k +1 i − x ki (cid:107) + (cid:107) x ki − x k − i (cid:107) (cid:1) + (1 − C y ) µ (cid:0) (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) (cid:1) ≤ ¯ L ( z k , z k − ) . Boundedness of subgradient . In the proof (A) above, we have proved that (cid:107) d k +1 (cid:107) ≤ a ( (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) ω k +1 − ω k (cid:107) )for some constant a and d k +1 ∈ ∂ L ( z k +1 ). On the other hand, as we use α = 1, from (36) we obtain √ σ B (cid:107) ω k +1 − ω k (cid:107) ≤ (cid:107) B ∗ ( ω k +1 − ω k ) (cid:107) = (cid:107) ∆ z k +1 (cid:107) = (cid:107)∇ h ( y k ) − ∇ h ( y k − ) + L h (∆ y k +1 − ∆ y k ) (cid:107)≤ L h (cid:107) y k − y k − (cid:107) + L h (cid:107) y k +1 − y k (cid:107) . (64)Hence, (cid:107) d k +1 (cid:107) ≤ a ( (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) )for some constant a . Note that ∂ ¯ L ( z , ˜ z ) = ∂ L ( z , ˜ z )+ ∂ (cid:16) s (cid:88) i =1 η i + C x η i (cid:107) x i − ˜ x i (cid:107) + (1 + C y ) µ (cid:107) y − ˜ y (cid:107) + α β (cid:107) B ∗ ( ω − ˜ ω ) (cid:107) (cid:17) . Hence, it is not diﬃcult to show that (cid:107) d k +1 (cid:107) ≤ a ( (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) )for some constant a and d k +1 ∈ ∂ ¯ L ( z k +1 , z k ).3. KL property . Since F ( x ) + h ( y ) has KL property, then ¯ L ( z , ˜ z ) also has K(cid:32)Lproperty.4. A continuity condition . Suppose z k n converges to ( x ∗ , y ∗ , ω ∗ ). In the proof(B) above, we have proved that L ( z k n ) converges to L ( x ∗ , y ∗ , ω ∗ ). Furthermore,from Proposition 2.6 we proved that (cid:107) z k +1 − z k (cid:107) goes to 0. Hence we have z k n − converges to ( x ∗ , y ∗ , ω ∗ ). So, ¯ L ( z k +1 , z k ) converges to ¯ L ( z ∗ , z ∗ ).Using the same technique as in [6, Theorem 1], see also [17, 32], we can prove that ∞ (cid:88) k =1 (cid:0) (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) (cid:1) < ∞ . which implies { ( x k , y k ) } converges to ( x ∗ , y ∗ ). From (64) we obtain ∞ (cid:88) k =1 (cid:107) ω k +1 − ω k (cid:107) ≤ ∞ (cid:88) k =1 (cid:0) (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) (cid:1) < ∞ . Hence, { ω k } also converges to ω ∗∗