A Framework of Inertial Alternating Direction Method of Multipliers for Non-Convex Non-Smooth Optimization
AA FRAMEWORK OF INERTIAL ALTERNATING DIRECTIONMETHOD OF MULTIPLIERS FORNON-CONVEX NON-SMOOTH OPTIMIZATION ∗ LE THI KHANH HIEN † , DUY NHAT PHAN ‡ , AND
NICOLAS GILLIS † Abstract.
In this paper, we propose an algorithmic framework dubbed inertial alternatingdirection methods of multipliers (iADMM), for solving a class of nonconvex nonsmooth multiblockcomposite optimization problems with linear constraints. Our framework employs the generalminimization-majorization (MM) principle to update each block of variables so as to not only unifythe convergence analysis of previous ADMM that use specific surrogate functions in the MM step, butalso lead to new efficient ADMM schemes. To the best of our knowledge, in the nonconvex nonsmooth setting, ADMM used in combination with the MM principle to update each block of variables, andADMM combined with inertial terms for the primal variables have not been studied in the literature.Under standard assumptions, we prove the subsequential convergence and global convergence for thegenerated sequence of iterates. We illustrate the effectiveness of iADMM on a class of nonconvexlow-rank representation problems.
1. Introduction.
In this paper, we consider the following nonconvex minimiza-tion problem with linear constraintmin x,y F ( x , . . . , x s ) + h ( y )such that s (cid:88) i =1 A i x i + B y = b, (1)where y ∈ R q , x i ∈ R n i , x := [ x ; . . . ; x s ] ∈ R n , n = (cid:80) si =1 n i , A i is a linear map from R n i to R m , B is a linear map from R q to R m , b ∈ R m , h : R q → R is a differentiablefunction, and F ( x ) = f ( x ) + (cid:80) si =1 g i ( x i ), where f : R n → R is a nonconvex nonsmoothfunction and g i : R n i → R ∪ { + ∞} are proper lower semi-continuous functions for i = 1 , , . . . , s . We assume that F satisfies ∂F ( x ) = ∂ x F ( x ) × . . . × ∂ x s F ( x ), where ∂F denote the limiting subdifferential of F (see the definition in the supplementarydocument). Notation.
We denote [ s ] := { , . . . , s } . For the p -dimensional Euclidean space R p , we use (cid:104)· , ·(cid:105) to denote the inner product and (cid:107) · (cid:107) to denote the correspondinginduced norm. For a linear map M , M ∗ denotes the adjoint linear map with respectto the inner product and (cid:107)M(cid:107) is the induced operator norm of M . We use I todenote the identity map. For a positive definite self-adjoint operator Q , we denote (cid:107) x (cid:107) Q := (cid:104) x, Q x (cid:105) . We denote the smallest eigenvalue of a symmetric linear self-map(that is, M = M ∗ ) by λ min ( M ). We use Im ( B ) to denote the image of B . One of the importantapplications of Problem (1) is the following generalized nonconvex low-rank represen- ∗ L.T.K. Hien and D.N. Phan contributed equally to this work
Funding:
L. T. K. Hien and N. Gillis are supported by the Fonds de la Recherche Scientifique- FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS project no30468160 (SeLMA), and by the European Research Council (ERC starting grant 679515). † Department of Mathematics and Operational Research, University of Mons, Belgium ([email protected], [email protected]). ‡ Department of Mathematics and Informatics, HCMC University of Education, Vietnam([email protected]). This condition is satisfied when f is a sum of a continuously differentiable function and a blockseparable function that has limiting subdifferential, see [2, Proposition 2.1].1 a r X i v : . [ m a t h . O C ] F e b L. T. K. HIEN, D. N. PHAN, N. GILLIS tation problem: given a data matrix D ∈ R d × n , solve(2) min X,Y,Z min( m,n ) (cid:88) i =1 r ( σ i ( X )) + r ( Y ) + r ( Z )subject to D = A X + Y A + Z, where X ∈ R m × n , Y ∈ R d × q , Z ∈ R d × n , A ∈ R d × m , A ∈ R q × n , r ( · ) is an increasingconcave function to promote X to be of low rank, r ( · ) is regularization function, and r ( · ) is a function that models some noise (for example, if we take r ( Z ) = (cid:107) Z (cid:107) F then Z represents a Gaussian noise). Problem (2) generalizes several important problemsin machine learning. Let us mention some examples:(i) When A and A are identity matrices, r ( t ) = t χ with 0 < χ ≤ r ( Y ) = (cid:80) q − i =1 (cid:107) Y i − Y i +1 (cid:107) , where Y i is the i -th column of Y , Problem (2) decomposesthe data matrix D into three components, X , Y and Z . For example, in videosurveillance, each column of D is a vectorized image of a video frame, X is alow-rank matrix that plays the role of the background, Y is the foreground thathas small variations betweet its columns (such as slowly moving objectives),and Z represents some noise [42].(ii) When A and A are identity matrices, r ( t ) = t , r ( Y ) = λ (cid:107) Y (cid:107) , where λ issome constant, Problem (2) recovers the robust principal component analysismodel, see, e.g., [10], where X is a low-rank matrix, Y represents a sparse noise,and Z represents additional noise. It is also used for foreground-backgroundseparation in video surveillance.(iii) When r ( t ) = t and r ( Y ) = (cid:107) Y (cid:107) ∗ , Problem (2) is the latent low-rankrepresentation problem [27]. The authors [27] used A = DP and A = P ∗ D ,where P and P are computed by orthogonalizing the columns of D ∗ and D ,respectively. We will use this application to illustrate the effectiveness of ourproposed framework, iADMM, in Section 3.Other applications of Problem (1) include statistical learning, see, e.g., [3, 43],and minimization on compact manifolds, see, e.g., [22, 44]. Let A := [ A . . . A s ] , A x := s (cid:88) i =1 A i x i ∈ R m . The augmented Lagrangian for Problem (1) is given by L ( x, y, ω ) := F ( x ) + h ( y ) + (cid:104) ω, A x + B y − b (cid:105) + β (cid:107)A x + B y − b (cid:107) , (3)where β > x and y : x k +1 ∈ argmin x L ( x, y k , ω k ) , (4a) y k +1 ∈ argmin y L ( x k +1 , y, ω k ) , (4b) ω k +1 = ω k + β ( A x k +1 + B y k +1 − b ) . (4c) NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS s >
1, the scheme is similar, see for example [42]. Theupdate of x in (4a) (a similar discussion is applicable to (4b)) can be rewritten as x k +1 ∈ argmin x F ( x ) + ϕ k ( x ) , where(5) ϕ k ( x ) = β (cid:107)A x + B y k − b (cid:107) + (cid:104) ω k , A x + B y k − b (cid:105) . Solving the subproblem (4a) is usually very expensive especially when F is not smooth.A remedy is minimizing a suitable surrogate function of L ( · , y k , ω k ) that allows a moreefficient update for x . For example, since ϕ k ( x ) is upper bounded by(6) ˆ ϕ ( x ) = ϕ k ( x k ) + (cid:104)∇ ϕ k ( x k ) , x − x k (cid:105) + κβ (cid:107) x − x k (cid:107) where κ ≥ (cid:107)A ∗ A(cid:107) (because ∇ ϕ k ( x ) is β (cid:107)A ∗ A(cid:107) -Lipschitz continuous), x can be updatedby(7) x k +1 ∈ argmin x F ( x ) + ˆ ϕ ( x ) , which leads to the linearized ADMM method, see [25, 45]. The update in (7) has aclosed form for some nonsmooth F ; see [34]. When F = f + g and f is L f -smooth thenwe can also use the upper bound ˆ F ( x ) = f ( x k ) + (cid:104)∇ f ( x k ) , x − x k (cid:105) + L f (cid:107) x − x k (cid:107) + g ( x )of F to derive the following update for x :(8) x k +1 ∈ argmin x ˆ F ( x ) + ˆ ϕ ( x ) . This leads to the proximal linearized ADMM method, see [7, 28]. We note that L ( · , y k , ω k ) is always upper bounded by L ( · , y k , ω k ) + D φ ( x, x k ), where D φ is theBregman distance associated with a continuously differentiable convex function φ on R n :(9) D φ ( a, b ) := φ ( a ) − φ ( b ) − (cid:104)∇ φ ( b ) , a − b (cid:105) , ∀ a, b ∈ R n . For example, if φ ( x ) = (cid:107) x (cid:107) Q = (cid:104) x, Q x (cid:105) then D φ ( a, b ) = (cid:107) a − b (cid:107) Q . This upper boundleads to proximal ADMM, see [12, 23].The above mentioned upper bound functions are specific examples of surrogatefunctions for L ( · , y k , ω k ) (see Definition 2.1) while each method of updating x cor-responds to a majorization-minimization (MM) step. In the convex setting (that is, f ( x, y ) is convex), [11] and [20] use the MM principle to unify and generalize theconvergence analysis of many ADMM for multi-blocks problems (that is, s > L. T. K. HIEN, D. N. PHAN, N. GILLIS they have showed significant improvement in their practical performance, see forexample [32, 46, 47, 35, 17]. Recently, the authors in [18] propose an inertial blockMM framework for solving (1) without the linear coupling constraint. To the bestof our knowledge, ADMM with inertial terms for the primal variables have not beenstudied for the nonconvex setting although they have been analysed for the convexsetting; see [24, 33].
In this paper, we propose iADMM, a framework of inertialalternating direction methods of multipliers, for solving the nonconvex nonsmoothproblem (1). When no extrapolation is used, iADMM becomes a general ADMMframework that employs the minimization-majorization principle in each block update.For the first time in the nonconvex nonsmooth setting of Problem (1), we studyADMM and its inertial version combined with the MM principle when updating eachblock of variables. Moreover, our framework allows to use an over-relaxation parameter α ∈ (0 ,
2) to set αβ as the constant stepsize for updating the dual variable ω . Notethat α = 1 is the standard choice in the nonconvex setting, see, e.g., [20, 23, 42].In the convex setting, [14] showed that α can be chosen in (0 , α ∈ (cid:0) , √ (cid:1) , see, e.g., [13, 49].Recently, [7] proposed proximal ADMM that use α ∈ (0 ,
2) for solving a special caseof the nonconvex Problem (1) with s = 1 and A = −I .Under mild assumptions, we analyse the subsequential convergence guaranteefor the generated sequence of iADMM and ADMM in parallel. When F ( x ) + h ( y )satisfies the K(cid:32)L property and α = 1, we prove the global convergence for the generatedsequence. Finally, we apply the proposed framework to solve a class of Problem (2)and report its numerical results to illustrate the efficacy of iADMM.
2. An inertial ADMM framework.
In this section, we describe the iADMMframework and prove its subsequential and global convergence. Throughout the paper,we make the following assumptions that are standard for studying Problem (1) andthe convergence of ADMMs in the nonconvex setting, see for example [42, 7, 23].
Assumption (i) σ B := λ min ( BB ∗ ) > .(ii) F ( x ) + h ( y ) is lower bounded.(iii) h is a L h -smooth function (that is, ∇ h is L h -Lipschitz continuous withconstant L h ). Let us first formally define a surrogate function.Some examples were given in the introduction.
Definition
Let
X ⊆ R n . A function u : X × X → R is called a surrogate function of a function f on X if the following conditions aresatisfied:(a) u ( z, z ) = f ( z ) for all z ∈ X ,(b) u ( x, z ) ≥ f ( x ) for all x, z ∈ X . As we are considering multi-block problems, we need the following definition of a blocksurrogate function, which is a generalization of Definition 2.1.
Definition
Let Let X i ⊆ R n i , X ⊆ R n . Afunction u i : X i × X → R is called a block i surrogate function of f on X if thefollowing conditions are satisfied:(a) u i ( z i , z ) = f ( z ) for all z ∈ X , NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS Algorithm 1 iADMM for solving Problem (1)Choose x = x − , y = y − , ω . Let u i , i ∈ [ s ], be block i surrogate functions of f ( x ) on R n . for k = 0 , . . . do Set x k, = x k for i = 1 , ..., s do Compute ¯ x ki = x ki + ζ ki ( x ki − x k − i ).Update block x i by x k,ii ∈ argmin x i (cid:110) u i ( x i , x k,i − ) + g i ( x i )+ (cid:104)A ∗ i (cid:0) ω k + β ( A ¯ x k,i − + B y k − b ) (cid:1) , x i (cid:105) + κ i β (cid:107) x i − ¯ x ki (cid:107) (cid:111) , (10)where κ i ≥ (cid:107)A ∗ i A i (cid:107) , and¯ x k,i − = ( x k +11 , . . . , x k +1 i − , ¯ x ki , x ki +1 , . . . , x ks ).Set x k,ij = x k,i − j for all j (cid:54) = i . end for Set x k +1 = x k,s .Compute ˆ y k = y k + δ k ( y k − y k − ).Update y by y k +1 ∈ argmin y (cid:110) (cid:104)B ∗ ω k + ∇ h (ˆ y k ) , y (cid:105) + β (cid:107)A x k +1 + B y − b (cid:107) + L h (cid:107) y − ˆ y k (cid:107) (cid:111) . (11)Update ω by(12) ω k +1 = ω k + αβ ( A x k +1 + B y k +1 − b ) . end for (b) u i ( x i , z ) ≥ f ( x i , z (cid:54) = i ) for all x i ∈ X i and z ∈ X , where ( x i , z (cid:54) = i ) := ( z , . . . , z i − , x i , z i +1 , . . . , z s ) . The block approximation error is defined as e i ( x i , z ) := u i ( x i , z ) − f ( x i , z (cid:54) = i ) . The inertial alternating direction method of multipliers (iADMM) frameworkis described in Algorithm 1. iADMM cyclically update the blocks x , . . . , x s and y . Let us use x k,i = ( x k +11 , . . . , x k +1 i , x ki +1 , . . . , x ks ) and x k +1 = x k,s , where k is theouter iteration index, and i the cyclic inner iteration index ( i ∈ [ s ]). The updateof block x i in (10) (note that x k +1 i = x k,ii ) means that iADMM chooses a surrogatefunction for x i (cid:55)→ L ( x i , x k,i (cid:54) = i , y k , ω k ), which is formed by summing a surrogate functionof x i (cid:55)→ f ( x i , x k,i (cid:54) = i ) + g i ( x i ) and a surrogate function of x i (cid:55)→ ϕ k ( x i , x k,i (cid:54) = i ) where ϕ k ( x )is defined in (5), then apply extrapolation to the latter surrogate function . To updateblock y , as h ( y ) is L h -smooth, we apply Nesterov type acceleration on h as in (11). It is important noting that it is possible to embed the general inertial term G ki to the surrogate of L. T. K. HIEN, D. N. PHAN, N. GILLIS
Together with Assumption 1, we make the following standard assumption for u i throughout the paper. Assumption (i) The block surrogate function u i ( x i , z ) is continuous.(ii) Given z ∈ R n , for i ∈ [ s ] , there exists a function x i (cid:55)→ ¯ e i ( x i , z ) such that ¯ e i ( · , z ) is continuously differentiable at z i , ¯ e i ( z i , z ) = 0 , ∇ x i ¯ e i ( z i , z ) = 0 , and the blockapproximation error x i (cid:55)→ e i ( x i , z ) satisfies (13) e i ( x i , z ) ≤ ¯ e i ( x i , z ) for all x i . The condition in Assumption 2 (ii) is satisfied when we simply choose u i ( x i , z ) = f ( x i , z (cid:54) = i ) (that is, f ( x i , z (cid:54) = i ) is a surrogate function of itself), or when e i ( · , z ) iscontinuously differentiable at z i and ∇ x i e i ( z i , z ) = 0, or when e i ( x i , z ) ≤ c (cid:107) x i − z i (cid:107) (cid:15) for some (cid:15) > c >
0; see [18, Lemma 3].
Remark Before proceeding to the convergence analysis of iADMM, we makethe following remark. As we target Nesterov-type acceleration in the update of y (notethat h is assumed to be L h -smooth), we analyse the update rule as in (11) for y . Incase y is updated by y k +1 ∈ argmin y L ( x k +1 , y, ω k ) , iADMM still works and the convergence analysis would be simplified by using the samerationale to obtain subsequential as well as global convergence. We hence omit thiscase in our analysis. Let us start by defining some additional notationsand their convention that will be used later. Let x k,i , y k and ω k be the iteratesgenerated by iADMM. We denote ∆ x ki = x ki − x k − i , ∆ y k = y k − y k − , ∆ ω k = ω k − ω k − , α = | − α | ασ B (1 −| − α | ) , α = ασ B (1 −| − α | ) and L k = L ( x k , y k , ω k ). We let ν i , i ∈ [ s ], and ν y be arbitrary constants in (0 , • If ζ ki = 0, that is, when we do not apply extrapolation in the update of x ki ,we take ζ ki /ν i = 0 and ν i = 0. • If δ k = 0, that is, when we do not apply extrapolation in the update of y , wetake δ k /ν y = 0 and ν y = 0.Now we present our main convergence results. Their proofs can be found in thesupplementary material.As iADMM allows to use extrapolation in the update of x ki and y k , the Lagrangianis not guaranteed to satisfy the sufficient descent property; in fact, it is not guaranteedto decrease at each iteration. Instead, it has the following nearly sufficiently decreasingproperty as stated in the following Propositions 2.3 and 2.4. Proposition (i) Considering the update in (10) , in general when x i (cid:55)→ u i ( x i , z )+ g i ( x i ) is nonconvex, we choose κ i > (cid:107) A ∗ i A i (cid:107) . Denote a ki = βζ ki ( κ i + (cid:107)A ∗ i A i (cid:107) ) .Then (14) L ( x k,i , y k , ω k ) + η i (cid:107) ∆ x k +1 i (cid:107) ≤ L ( x k,i − , y k , ω k ) + γ ki (cid:107) ∆ x ki (cid:107) , x i (cid:55)→ L ( x i , x k,i (cid:54) = i , y k , ω k ) as in [18]. This inertial term may also lead to the extrapolation for the blocksurrogate function of f ( x ) or for both the two block surrogates. However, to simplify our analysis, weonly consider here the effect of the inertial term for the block surrogate of ϕ k ( x ).NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS where (15) η i = (1 − ν i )( κ i −(cid:107)A ∗ i A i (cid:107) ) β ,γ ki = ( a ki ) ν i ( κ i −(cid:107)A ∗ i A i (cid:107) ) β . (ii) When x i (cid:55)→ u i ( x i , z ) + g i ( x i ) is convex, we choose κ i = (cid:107)A ∗ i A i (cid:107) and Inequality (14) is satisfied with (16) γ ki = β (cid:107)A ∗ i A i (cid:107) ( ζ ki ) , η i = β (cid:107)A ∗ i A i (cid:107) . Proposition
Considering the update in (11) , we have L ( x k +1 , y k +1 , ω k ) + η y (cid:107) ∆ y k +1 (cid:107) ≤ L ( x k +1 , y k , ω k ) + γ ky (cid:107) ∆ y k (cid:107) , where η y = (1 − ν y )( β (cid:107)B ∗ B(cid:107) + L h )2 and γ ky = L h δ k ν y ( β (cid:107)B ∗ B(cid:107) + L h ) when h ( y ) is nonconvex, and η y = L h and γ ky = L h δ k when h ( y ) is convex. From Proposition 2.3 and Proposition 2.4, we obtain the following recursiveinequality for {L k } in Proposition 2.5 that serves as cornerstone to derive the boundfor the extrapolation parameters ζ ki and δ k in Proposition 2.6. Proposition
We have L k +1 + η y (cid:107) ∆ y k +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) ≤ L k + s (cid:88) i =1 γ ki (cid:107) ∆ x ki (cid:107) + γ ky (cid:107) ∆ y k (cid:107) + α β ( (cid:107) B ∗ ∆ ω k (cid:107) − (cid:107) B ∗ ∆ ω k +1 (cid:107) ) + α β L h (cid:107) ∆ y k +1 (cid:107) + α β (cid:0) ¯ δ k L h (cid:107) ∆ y k (cid:107) + 4 L h δ k − (cid:107) ∆ y k − (cid:107) (cid:1) , (17) where ¯ δ k = 2 if δ k = 0 for all k and δ k ) otherwise. Now we characterize the chosen parameters for Algorithm 1 in the following proposition.
Proposition
Let η y , γ ky , η i , γ ki , i ∈ [ s ] , and ¯ δ k be defined in Proposition 2.3and Proposition 2.4. Denote (18) µ = η y − α L h β . For k ≥ , suppose the parameters are chosen such that µ > , η i > , and thefollowing conditions are satisfied for some constants < C x , C y < : γ ki ≤ C x η i , α L h δ k − β ≤ C µ,α L h ¯ δ k β + γ ky ≤ C µ, (19) where (cid:40) C = C y and C = 0 if δ k = 0 ∀ k, < C < C y and C = C y − C otherwise . L. T. K. HIEN, D. N. PHAN, N. GILLIS (i) For
K > we have L K +1 + µ (cid:107) ∆ y K +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x K +1 i (cid:107) + α β (cid:107)B ∗ ∆ w K +1 (cid:107) +(1 − C ) µ (cid:107) ∆ y K (cid:107) + K − (cid:88) k =1 (cid:2) (1 − C y ) µ (cid:107) ∆ y k (cid:107) + (1 − C x ) s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) (cid:3) ≤ L + α β (cid:107)B ∗ ∆ ω (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x i (cid:107) + µ (cid:107) ∆ y (cid:107) + C µ (cid:107) ∆ y (cid:107) . (20) (ii) If we use one of the following methods: • we choose δ k = 0 for all k , that is, there is no extrapolation in the update of y , • we use extrapolation in the update of y and choose the parameters such that β ≥ L h ασ B (1 − | − α | ) ,β ≥ αL h µσ B (1 − | − α | ) max (cid:110) , δ k − C (cid:111) (21) then { ∆ y k } , { ∆ x ki } and { ∆ ω k } converge to 0. We will assume that Algorithm 1 generates a bounded sequence in our subsequentialand global convergence results. Let us provide a sufficient condition that guaranteesthis boundedness assumption in the following proposition.
Proposition If b + Im ( A ) ⊆ Im ( B ) , λ min ( B ∗ B ) > and F ( x ) + h ( y ) iscoercive over the feasible set { ( x, y ) : A x + B y = b } then the sequences { x k } , { y k } and { ω k } generated by Algorithm 1 are bounded. It is important noting that the coercive condition of F ( x ) + h ( y ) over the feasible setis weaker than the coercive condition of F ( x ) + h ( y ) over x ∈ R n , y ∈ R q . Let us nowpresent the subsequential convergence of the generated sequence. Theorem
Suppose the parameters of Algo-rithm 1 are chosen such that the conditions in (19) of Proposition 2.6 are satisfied.If the generated sequence of Algorithm 1 is bounded, then every limit point of thegenerated sequence is a critical point of L . To obtain a global convergence, we need the following Kurdyka-(cid:32)Lojasiewicz (K(cid:32)L)property for F ( x ) + h ( y ). Definition
A function φ ( · ) is said to have the K(cid:32)L property at ¯ x ∈ dom ∂ φ ifthere exists ς ∈ (0 , + ∞ ] , a neighborhood U of ¯ x and a concave function Υ : [0 , ς ) → R + that is continuously differentiable on (0 , ς ) , continuous at , Υ(0) = 0 , and Υ (cid:48) ( t ) > for all t ∈ (0 , η ) , such that for all x ∈ U ∩ [ φ (¯ x ) < φ ( x ) < φ (¯ x ) + ς ] , we have (22) Υ (cid:48) ( φ ( x ) − φ (¯ x )) dist (0 , ∂φ ( x )) ≥ , where dist (0 , ∂φ ( x )) = min {(cid:107) z (cid:107) : z ∈ ∂φ ( x ) } . If φ ( x ) has the K(cid:32)L property at eachpoint of dom ∂φ then φ is a K(cid:32)L function. Many non-convex non-smooth functions in practical applications belong to the classof K(cid:32)L functions, for examples, real analytic functions, semi-algebraic functions, andlocally strongly convex functions, see for example [5, 6].
NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS Theorem
Suppose we do not use extrapolation toupdate y (that is, δ k = 0 for all k ) and we take α = 1 . Then the conditions in (19) become γ ki ≤ C x η i , α L h β ≤ C y µ, (23) for some constants < C x , C y < . Furthermore, we assume that (i) for any x, z ∈ R n , x i ∈ dom( g i ) we have (24) ∂ x i (cid:0) f ( x ) + g i ( x i ) (cid:1) = ∂ x i f ( x ) + ∂ x i g i ( x i ) ,∂ x i (cid:0) u i ( x i , z ) + g i ( x i ) (cid:1) = ∂ x i u i ( x i , z ) + ∂ x i g i ( x i ) , and (ii) for any x, z in a bounded subset of R n , if s i ∈ ∂u i ( x i , z ) , there exists ξ i ∈ ∂ x i f ( x ) such that (25) (cid:107) ξ i − s i (cid:107) ≤ L i (cid:107) x − z (cid:107) for some constant L i . If the generated sequence of Algorithm 1 is bounded and F ( x ) + h ( y ) has the K(cid:32)Lproperty, then the whole generated sequence of Algorithm 1 converges to a critical pointof L . We refer the readers to [38, Corollary 10.9] for a sufficient condition for (24)(see supplementary material for more details). Some specific examples that satisfy(24) include: (i) g i = 0, (ii) the functions x i (cid:55)→ f ( x ) and x i (cid:55)→ u i ( x i , z ) are strictlydifferentiable (see [38, Exercise 10.10]), (iii) the functions x i (cid:55)→ f ( x ) and x i (cid:55)→ u i ( x i , z ) are convex and the relative interior qualification conditions are satisfied:ri(dom( f ( · , x (cid:54) = i )) ∩ ri(dom g i ) (cid:54) = ∅ and ri(dom( g ( · , z )) ∩ ri(dom g i ) (cid:54) = ∅ . We note thatalthough the condition in (25) is necessary for our convergence proof, the Lipschitzconstant L i does not influence how to choose the parameters in our framework. We endthis section by noting that a convergence rate for the generated sequence of iADMMcan be derived using the same technique as in the proof of [1, Theorem 2]. Someexamples of using the technique of [1, Theorem 2] to derive the convergence rateinclude [46, Theorem 2.9] and [17, Theorem 3]. Other than the convergence rate whichappears to be the same in different papers using the technique in [1], determining theK(cid:32)L exponent, that is, the coefficient a when Υ( t ) = ct − a , where c is a constant, is anactive and challenging topic. The type of the convergence rate depends on the valueof a . Specifically, when a = 0, the algorithm converges after a finite number of steps;when a ∈ (0 , /
2] it has linear convergence, and when a ∈ (1 / ,
1] it has sublinearconvergence. Determining the value of a is out of the scope of this paper.
3. Numerical results.
In this section, we apply iADMM to solve a latent low-rank representation problem of the form of Problem (2); see Section 1.1. Specifically,we choose r ( t ) = λ t , r ( Z ) = (cid:107) Z (cid:107) (hence Z represents a Gaussian noise), andconsider a nonconvex regularization function for Y , r ( Y ) = λ (cid:80) qi =1 φ ( (cid:107) Y i (cid:107) ), where Y i is the i -th column of Y and φ ( t ) = 1 − exp( − θt ) [9]. In the upcoming experiments, wechoose A = DP and A = P ∗ D as proposed in [27], where P and P are computedby orthogonalizing the columns of D ∗ and D , respectively.Problem (2) in this case takes the form of (1) with B being the identity operator, b being the data set D , x and x being the matrices X and Y , y being the matrix Z , f ( X, Y ) = λ (cid:107) X (cid:107) ∗ + r ( Y ), g i = 0 and h ( Z ) = (cid:107) Z (cid:107) .We choose the following block surrogate functions for f : u ( X, X k , Y k ) = λ (cid:107) X (cid:107) ∗ + r ( Y k ), u ( Y, X k +1 , Y k ) = r ( Y k ) + (cid:80) qi =1 ς ki (cid:107) Y i (cid:107) + λ (cid:107) X k +1 (cid:107) ∗ , where0 L. T. K. HIEN, D. N. PHAN, N. GILLIS
Time (s) E rr o r Hopkins155 iADMM-mmADMM-mmlinearizedADMM
Time (s) O b j e c t i v e v a l ue Hopkins155 iADMM-mmADMM-mmlinearizedADMM
Time (s) E rr o r Umist10 iADMM-mmADMM-mmlinearizedADMM
50 100 150 200 250 300
Time (s) O b j e c t i v e v a l ue Umist10 iADMM-mmADMM-mmlinearizedADMM
Time (s) E rr o r Yaleb10 iADMM-mmADMM-mmlinearizedADMM
100 200 300 400 500
Time (s) O b j e c t i v e v a l ue Yaleb10 iADMM-mmADMM-mmlinearizedADMM
Fig. 1 . Evolution of the value of the segmentation error rate and the objective function valuewith respect to time. For Hopkins155, the results are the average values over 156 sequences. ς ki ∈ λ ∇ φ ( (cid:107) Y ki (cid:107) ). Obviously u satisfies Assumption 2 and u satisfies Assumption2 (i). Since φ is continuously differentiable with Lipschitz gradient on [0 , + ∞ ) andthe Euclidean norm is Lipschitz continuous, it follows from Section 4.5 of [18] that u satisfies Assumption 2 (ii).For updating X , according to the update (10), X k +1 is computed by solving the NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS X λ (cid:107) X (cid:107) ∗ + (cid:28) A ∗ (cid:16) β ( A ¯ X k + Y k A + Z k − D ) + W k (cid:17) , X (cid:29) + κ β (cid:107) X − ¯ X k (cid:107) , (26)where κ ≥ (cid:107) A ∗ A (cid:107) and ¯ X k = X k + ζ k ( X k − X k − ). The sub-problem (26) has aclosed-form solution given by X k +1 = U S λ /κ β V T , where U SV T is the SVD of ¯ X k − A ∗ ( A ¯ X k + Y k A + Z k − D + W k ) /κ β and S λ /κ β = diag([ S ii − λ /κ β ] + ), where diag( u ) is a diagonal matrix whose diagonalelements are the entries of u , and [ . ] + is the projection onto the nonnegative orthant.The update (11) for Y is Y k +1 ∈ arg min Y q (cid:88) i =1 ς ki (cid:107) Y i (cid:107) + (cid:104) ( W k + β ( A X k +1 + ¯ Y k A + Z k − D )) A ∗ , Y (cid:105) + κ β (cid:107) Y − ¯ Y k (cid:107) , where κ ≥ (cid:107) A A ∗ (cid:107) and ¯ Y k = Y k + ζ k ( Y k − Y k − ). The sub-problem above has aclosed-form solution Y k +1 i = (cid:2) (cid:107) P ki (cid:107) − ς ki / ( κ β ) (cid:3) + P ki (cid:107) P ki (cid:107) , where P ki is the i -th column of ¯ Y k − ( A X k +1 + ¯ Y k A + Z k − D ) /κ − W k / ( κ β ).The updates (11) and (12) for Z and W are respectively given by Z k +1 = − ( W k + β ( A X k +1 + y k +1 A − D )) / (1 + β ) ,W k +1 = W k + αβ ( A X k +1 + y k +1 A + Z k +1 − D ) . Let us determine the parameters. Note that L h = 1, σ B = 1, and δ k = 0. Since h ( Z ) is convex and we do not apply extrapolation for Z , hence by Proposition 2.4we have η y = and γ ky = 0. Since (cid:107) X (cid:107) ∗ and (cid:80) qi =1 ς ki (cid:107) Y i (cid:107) are convex, we choose κ = (cid:107) A ∗ A (cid:107) , κ = (cid:107) A A ∗ (cid:107) and the conditions in (23) become ζ ki ≤ √ C x (for i = 1 ,
2) and (2+ C y ) α β ≤ C y . In our experiments, we choose C x = 1 − − , α = 1, C y = 1 − − , β = 2(2 + C y ) α /C y , a = 1, a k = (1 + (cid:113) a k − ), and ζ ki = min (cid:110) a k − − a k , √ C x (cid:111) .We compare iADMM without extrapolation denoted by ADMM-mm, and iADMMwith the extrapolation denoted by iADMM-mm, with a linearized ADMM that isonly different from ADMM-mm for updating Y . In particular, the linearizedADMMmethod updates Y by solving the following nonconvex sub-problemsmin − λ exp( (cid:107) Y i (cid:107) ) + κ β (cid:107) Y i − V ki (cid:107) , where V ki is the i -the column of X k − ( W k + β ( A X k +1 + ¯ Y k A + Z k − D )) A ∗ / ( κ β ). Since the sub-problems above do not have closed-formsolutions, we employ an MM scheme to solve them.2 L. T. K. HIEN, D. N. PHAN, N. GILLIS
Table 1
Comparison of segmentation error rate and final objective function values obtained within theallotted time. Bold values indicate the best results
Method Error Obj. valuemean ± std mean ± std H o p k i n s linearizedADMM 0.1579 ± ± ± ± ± ± U m i s t linearizedADMM 0.5170 1.0838 × ADMM-mm 0.5170 1.0167 × iADMM-mm × Y a l e b linearizedADMM 0.7656 5.2317 × ADMM-mm 0.7047 4.4829 × iADMM-mm × To examine the performance of the comparative algorithms, we consider subspacesegmentation tasks. In particular, after obtaining X ∗ , we follow the setting in [26] toconstruct the affinity matrix Q by Q ij = ( ˜ U ˜ U T ) ij , where ˜ U is formed by U ∗ (Σ ∗ ) / with normalized rows and U ∗ Σ ∗ ( V ∗ ) T being the SVD of X ∗ . Finally, we apply theNormalized Cuts [21] on W to cluster the data into groups.The experiments are run on three data sets: Hopkins 155, extended Yale B andUmist. Hopskins 155 consists of 156 sequences, each of which has from 39 to 550vectors drawn from two or three motions (one motion corresponds to one subspace).Each sequence is a sole segmentation task and thus there are 156 clustering tasks intotal. Yale B contains 2414 frontal face images of 38 classes while Umist contains 564images of 20 classes. To avoid computational issue when computing the segmentationerror rate, we construct clustering tasks by using only the first 10 classes of these twodata sets as proposed in [29].All tests are preformed using Matlab R2019a on a PC 2.3 GHz Intel Core i5 of8GB RAM. The code is available from https://github.com/nhatpd/iADMMIn our experiments, we choose θ = 5, λ = λ = 0 .
01 for Hopkins 155, and λ = λ = 1 for the two other data sets. We note that we do not optimize numericalresults by tweaking the parameters as this is beyond the scope of this work. It isimportant noting that we evaluate the algorithms on the same models. We set theinitial points to zero. We run each algorithm 10, 300, and 500 seconds for each sequenceof Hopkins 155, Umist10, and Yaleb10, respectively. We plot the curves of the valueof the segmentation error rate and the objective function value versus the trainingtime in Figure 1, and report the final values in Table 1. Since there are 156 sequences(data sets) in Hopkins 155, we plot the average values, and report the final averageresults and standard deviation over these sequences.We observe that iADMM-mm converges the fastest on all the data sets, providinga significant acceleration of ADMM-mm. iADMM-mm achieves not only the bestfinal objective function values but also the best segmentation error rates. Thisillustrates the usefulness of the acceleration technique. In addition, ADMM-mmoutperforms linearizedADMM which illustrates the usefulness of properly choosing aproper surrogate function. NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS
4. Conclusion.
We have analysed iADMM, a framework of inertial alternatingdirection methods of multipliers, for solving a class of nonconvex nonsmooth optimiza-tion problem with linear constraints. The preliminary computational results in solvinga class of nonconvex low-rank representation problems not only show the efficacyof using inertial terms for ADMM but also show the advantage of using suitableblock surrogate functions that may lead to closed-form solutions in the block updateof ADMM. We conclude the paper by mentioning two important questions that weconsider as a future research directions: • Can we extend the cyclic update rule of iADMM to randomized/non-cyclicsetting? • To guarantee the global convergence, iADMM does not allow extrapolation inthe update of y ; see Theorem 2.10. Can we extend the analysis to allow theextrapolation in the update of y ? REFERENCES[1] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functionsinvolving analytic features.
Mathematical Programming , 116(1):5–16, Jan 2009.[2] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization andprojection methods for nonconvex problems: An approach based on the Kurdyka-(cid:32)Lojasiewiczinequality.
Mathematics of Operations Research , 35(2):438–457, 2010.[3] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties.
Foundations and Trends in Machine Learning , 4, 08 2011.[4] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods.
SIAM Journal on Optimization , 23:2037–2060, 2013.[5] J. Bochnak, M. Coste, and M.-F. Roy.
Real Algebraic Geometry . Springer, 1998.[6] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvexand nonsmooth problems.
Mathematical Programming , 146(1):459–494, Aug 2014.[7] R. I. Bot and D.-K. Nguyen. The proximal alternating direction method of multipliers in thenonconvex setting: Convergence analysis and rates.
Mathematics of Operations Research ,45(2):682–712, 2020.[8] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.
Found. Trends Mach. Learn. ,3(1):1–122, Jan. 2011.[9] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and supportvector machines. In
Proceeding of international conference on machine learning ICML’98 ,1998.[10] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?
J. ACM ,58(3), 2011.[11] L. Canyi, J. Feng, S. Yan, and Z. Lin. A unified alternating direction method of multipliers bymajorization minimization.
IEEE transactions on pattern analysis and machine intelligence ,40:527 – 541, 07 2018.[12] W. Deng and W. Yin. On the global and linear convergence of the generalized alternatingdirection method of multipliers.
Rice CAAM tech report TR12-14 , 66, 01 2012.[13] M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel matrix rank minimization with applicationsto system identification and realization.
SIAM Journal on Matrix Analysis and Applications ,34(3):946–977, 2013.[14] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problemsvia finite element approximation.
Computers & Mathematics with Applications , 2(1):17 –40, 1976.[15] R. Glowinski and A. Marroco. Sur l’approximation, par ´el´ements finis d’ordre un, et lar´esolution, par p´enalisation-dualit´e d’une classe de probl`emes de dirichlet non lin´eaires.
ESAIM: Mathematical Modelling and Numerical Analysis - Mod´elisation Math´ematique etAnalyse Num´erique , 9(R2):41–76, 1975.[16] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gauss–seidel methodunder convex constraints.
Operations Research Letters , 26(3):127 – 136, 2000.[17] L. T. K. Hien, N. Gillis, and P. Patrinos. Inertial block proximal method for non-convexnon-smooth optimization. In
Thirty-seventh International Conference on Machine Learning L. T. K. HIEN, D. N. PHAN, N. GILLIS
ICML 2020 , 2020.[18] L. T. K. Hien, D. N. Phan, and N. Gillis. Inertial block majorization minimization frameworkfor nonconvex nonsmooth optimization. arXiv:2010.12133, 2020.[19] C. Hildreth. A quadratic programming procedure.
Naval Research Logistics Quarterly , 4(1):79–85, 1957.[20] M. Hong, T.-H. Chang, X. Wang, M. Razaviyayn, S. Ma, and Z.-Q. Luo. A block successive upper-bound minimization method of multipliers for linearly constrained convex optimization.
Mathematics of Operations Research , 45(3):833–861, 2020.[21] Jianbo Shi and J. Malik. Normalized cuts and image segmentation.
IEEE Trans. Pattern Anal.Mach. Intell. , 22(8):888–905, 2000.[22] R. Lai and S. Osher. A splitting method for orthogonality constrained problems.
Journal ofScientific Computing , 58, 02 2014.[23] G. Li and T. K. Pong. Global convergence of splitting methods for nonconvex compositeoptimization.
SIAM Journal on Optimization , 25(4):2434–2460, 2015.[24] H. Li and Z. Lin. Accelerated alternating direction method of multipliers: An optimal o(1 / k)nonergodic analysis.
Journal of Scientific Computing , 79:671–699, 05 2019.[25] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with adaptive penalty forlow-rank representation. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q.Weinberger, editors,
Advances in Neural Information Processing Systems , volume 24, pages612–620. Curran Associates, Inc., 2011.[26] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures bylow-rank representation.
IEEE Trans. Pattern Anal. Mach. Intell. , 35(1):171–184, 2013.[27] G. Liu and S. Yan. Latent low-rank representation for subspace segmentation and featureextraction. , pages 1615–1622, 2011.[28] Q. Liu, X. Shen, and Y. Gu. Linearized admm for nonconvex nonsmooth optimization withconvergence analysis.
IEEE Access , 7:76131–76144, 2019.[29] C. Lu, J. Tang, S. Yan, and Z. Lin. Nonconvex nonsmooth low rank minimization via iterativelyreweighted nuclear norm.
IEEE Transactions on Image Processing , 25(2):829–839, 2016.[30] J. G. Melo and R. D. C. Monteiro. Iteration-complexity of a jacobi-type non-euclidean admmfor multi-block linearly constrained nonconvex programs, 2017.[31] Y. Nesterov.
Introductory lectures on convex optimization: A basic course . Kluwer AcademicPubl., 2004.[32] P. Ochs. Unifying abstract inexact convergence theorems and block coordinate variable metricipiano.
SIAM Journal on Optimization , 29(1):541–570, 2019.[33] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized alternating directionmethod of multipliers.
SIAM Journal on Imaging Sciences , 8(1):644–681, 2015.[34] N. Parikh and S. Boyd. Proximal algorithms.
Foundations and Trends in Optimization ,1(3):127–239, 2014.[35] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (iPALM) fornonconvex and nonsmooth problems.
SIAM Journal on Imaging Sciences , 9(4):1756–1787,2016.[36] M. J. D. Powell. On search directions for minimization algorithms.
Mathematical Programming ,4(1):193–201, Dec 1973.[37] M. Razaviyayn, M. Hong, and Z. Luo. A unified convergence analysis of block successiveminimization methods for nonsmooth optimization.
SIAM Journal on Optimization ,23(2):1126–1153, 2013.[38] R. T. Rockafellar and R. J.-B. Wets.
Variational Analysis . Springer Verlag, Heidelberg, Berlin,New York, 1998.[39] K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse covariance selection via alternatinglinearization methods. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel,and A. Culotta, editors,
Advances in Neural Information Processing Systems 23 , pages2101–2109. Curran Associates, Inc., 2010.[40] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.
Journal of Optimization Theory and Applications , 109(3):475–494, Jun 2001.[41] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimiza-tion.
Mathematical Programming , 117(1):387–423, Mar 2009.[42] Y. Wang, W. Yin, and J. Zeng. Global convergence of admm in nonconvex nonsmoothoptimization.
Journal of Scientific Computing , 78:29–63, 01 2019.[43] Y. Wang, J. Zeng, Z. Peng, X. Chang, and Z. Xu. Linear convergence of adaptively iterativethresholding algorithms for compressed sensing.
IEEE Transactions on Signal Processing ,63(11):2957–2971, 2015.[44] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints.NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS Mathematical Programming , 142, 12 2010.[45] M. Xu and T. Wu. A class of linearized proximal alternating direction methods.
J. OptimizationTheory and Applications , 151:321–337, 11 2011.[46] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimizationwith applications to nonnegative tensor factorization and completion.
SIAM Journal onImaging Sciences , 6(3):1758–1789, 2013.[47] Y. Xu and W. Yin. A globally convergent algorithm for nonconvex optimization based on blockcoordinate update.
Journal of Scientific Computing , 72(2):700–734, Aug 2017.[48] J. Yang, Y. Zhang, and W. Yin. An efficient tvl1 algorithm for deblurring multichannel imagescorrupted by impulsive noise.
SIAM Journal on Scientific Computing , 31(4):2842–2865,2009.[49] L. Yang, T. K. Pong, and X. Chen. Alternating direction method of multipliers for a class ofnonconvex and nonsmooth problems with applications to background/foreground extraction.
SIAM Journal on Imaging Sciences , 10(1):74–110, 2017.[50] W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for l(1)-minimization with applications to compressed sensing.
Siam Journal on Imaging Sciences ,1:143–168, 01 2008.
Appendix A. Preliminaries of non-convex non-smooth optimization.
Let g : E → R ∪ { + ∞} be a proper lower semicontinuous function. Definition
A.1. [38, Definition 8.3](i) For any x ∈ dom g, and d ∈ E , we denote the directional derivative of g at x in the direction d by g (cid:48) ( x ; d ) = lim inf τ ↓ g ( x + τ d ) − g ( x ) τ . (ii) For each x ∈ dom g, we denote ˆ ∂g ( x ) as the Frechet subdifferential of g at x which contains vectors v ∈ E satisfying lim inf y (cid:54) = x,y → x (cid:107) y − x (cid:107) ( g ( y ) − g ( x ) − (cid:104) v, y − x (cid:105) ) ≥ . If x (cid:54)∈ dom g, then we set ˆ ∂g ( x ) = ∅ . (iii) The limiting-subdifferential ∂g ( x ) of g at x ∈ dom g is defined as follows: ∂g ( x ) := (cid:110) v ∈ E : ∃ x ( k ) → x, g (cid:16) x ( k ) (cid:17) → g ( x ) , v ( k ) ∈ ˆ ∂g (cid:16) x ( k ) (cid:17) , v ( k ) → v (cid:111) . (iv) The horizon subdifferential ∂ ∞ g ( x ) of g at x is defined as follows: ∂ ∞ g ( x ) := (cid:110) v ∈ E : ∃ λ ( k ) → , λ ( k ) ≥ , λ ( k ) x ( k ) → x,g (cid:16) x ( k ) (cid:17) → g ( x ) , v ( k ) ∈ ˆ ∂g (cid:16) x ( k ) (cid:17) , v ( k ) → v (cid:111) . Definition
A.2.
We call x ∗ ∈ dom F a critical point of F if ∈ ∂F ( x ∗ ) . Definition
A.3. [38, Definition 7.5] A function f : R n → R ∪ { + ∞} is calledsubdifferentially regular at ¯ x if f (¯ x ) is finite and the epigraph of f is Clarke regular at (¯ x, f (¯ x )) as a subset of R n × R (see [38, Definition 6.4] for the definition of Clarkeregularity of a set at a point). Proposition
A.4. [38, Corollary 10.9] Suppose f = f + · + f m for proper lowersemi-continuous function f i : R n → R ∪ { + ∞} and let ¯ x ∈ dom f . Suppose eachfunction f i is subdifferential regular at ¯ x , and the condition that the only combinationof vector ν i ∈ ∂ ∞ f i (¯ x ) with ν + . . . ν m = 0 is ν i = 0 for i ∈ [ m ] . Then we have ∂f (¯ x ) = ∂f (¯ x ) + . . . ∂f m (¯ x ) . L. T. K. HIEN, D. N. PHAN, N. GILLIS
Appendix B. Proofs.
Before proving the propositions, let us give some prelimi-nary results. We use x, z to denote the vectors in R n . Lemma
B.1. [18, Lemma 2.8] If the function x i (cid:55)→ Θ( x i , z ) is ρ -strongly convex,differentiable at z i , and ∇ x i Θ( z i , z ) = 0 then we have Θ( x i , z ) ≥ ρ (cid:107) x i − z i (cid:107) . We recall the notation ( x i , z (cid:54) = i ) = ( z , . . . , z i − , x i , z i +1 , . . . , z s ). Suppose we are tryingto solve min x Ψ( x ) := Φ( x ) + s (cid:88) i =1 g i ( x i ) . Proposition
B.2. [18, Theorem 2.7] Suppose G ki : R n i × R n i → R n i be someextrapolation operator that satisfies G ki ( x ki , x k − i ) ≤ a ki (cid:107) x ki − x k − i (cid:107) . Let u i ( x i , z ) is ablock surrogate function of Φ( x ) . We assume one of the following conditions holds: • x i (cid:55)→ u i ( x i , z ) + g i ( x i ) is ρ i -strongly convex, • the approximation error Θ( x i , z ) := u i ( x i , z ) − Φ( x i , z (cid:54) = i ) satisfying Θ( x i , z ) ≥ ρ i (cid:107) x i − z i (cid:107) for all x i .Note that ρ i may depend on z . Let x k +1 i = argmin x i u i ( x i , x k,i − ) + g i ( x i ) − (cid:104)G ki ( x ki , x k − i ) , x i (cid:105) . Then we have (27) Ψ( x k,i − ) + γ ki (cid:107) x ki − x k − i (cid:107) ≥ Ψ( x k,i ) + η ki (cid:107) x k +1 i − x ki (cid:107) , where γ ki = ( a ki ) νρ i , η ki = (1 − ν ) ρ i , and < ν < is a constant. If we do not apply extrapolation, that is a ki = 0 , then (27) is satisfied with γ ki = 0 and η ki = ρ i / . The following proposition is derived from [17, Remark 3] and [46, Lemma 2.1].
Proposition
B.3.
Suppose x i (cid:55)→ Φ( x ) is a L i -smooth convex function and g i ( x i ) is convex. Define ˆ x ki = x ki + α ki ( x ki − x k − i ) , ¯ x ki = x ki + β ki ( x ki − x k − i ) , and ¯ x k,i − =( x k +11 , . . . , x k +1 i − , ¯ x ki , x ki +1 , . . . , x ks ) . Let x k +1 i = argmin x i (cid:104)∇ Φ(¯ x k,i − ) , x i (cid:105) + g i ( x i ) + L i (cid:107) x i − ˆ x ki (cid:107) . Then we have Inequality (27) is satisfied with γ ki = L i (cid:0) ( β ki ) + ( γ ki − α ki ) ν (cid:1) , η ki = (1 − ν ) L i . If α ki = β ki then we have Inequality (27) is satisfied with γ ki = L i β ki ) , η ki = L i . NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS B.1. Proof of Proposition 2.3. (i) Suppose we are updating x ki . Let us recallthat L ( x, y, ω ) := f ( x ) + (cid:80) si =1 g i ( x i ) + h ( y ) + ϕ ( x, y, ω ), where(28) ϕ ( x, y, ω ) = β (cid:107)A x + B y − b (cid:107) + (cid:104) ω, A x + B y − b (cid:105) . Denote u i ( x i , z, y, ω ) = u i ( x i , z ) + h ( y ) + ˆ ϕ i ( x i , z, y, ω ) , whereˆ ϕ i ( x i , z, y, ω ) = ϕ ( z, y, ω ) + (cid:104)A ∗ i (cid:0) ω + β ( A z + B y − b ) (cid:1) , x i − z i (cid:105) + κ i β (cid:107) x i − z i (cid:107) . We see that ˆ ϕ i ( x i , z, y, ω ) is a block surrogate function of x (cid:55)→ ϕ ( x, y, ω ) withrespect to block x i , and u i ( x i , z, y, ω ) is a block surrogate function of x (cid:55)→ f ( x ) + h ( y ) + ϕ ( x, y, ω ) with respect to block x i . The update in (10) can be rewritten asfollows.(29) x k +1 i = argmin x i u i ( x i , x k,i − , y k , ω k ) + g i ( x i ) − (cid:104)G ki ( x ki , x k − i ) , x i (cid:105) , where G ki ( x ki , x k − i ) = β A ∗ i A (cid:0) x k,i − − ¯ x k,i − ) (cid:1) + κ i βζ ki ( x ki − x k − i ) . (30)The block approximation error function between u i ( x i , z, y, ω ) and x (cid:55)→ f ( x ) + h ( y ) + ϕ ( x, y, ω ) is defined as e i ( x i , z, y, ω ) = u i ( x i , z, y, ω ) − (cid:0) f ( x i , z (cid:54) = i ) + h ( y ) + ϕ (( x i , z (cid:54) = i ) , y, ω ) (cid:1) = u i ( x i , z ) − f ( x i , z (cid:54) = i ) + ˆ ϕ i ( x i , z, y, ω ) − ϕ (( x i , z (cid:54) = i ) , y, ω ) ≥ θ i ( x i , z, y, ω ) := ϕ ( z, y, ω ) − ϕ (( x i , z (cid:54) = i ) , y, ω )+ (cid:104)A ∗ i (cid:0) ω + β ( A z + B y − b ) (cid:1) , x i − z i (cid:105) + κ i β (cid:107) x i − z i (cid:107) . (31)We have ∇ x i θ i ( x i , z, y, ω ) = κ i β ( x i − z i ) + ∇ x i ϕ ( z, y, ω ) − ∇ x i ϕ (( x i , z (cid:54) = i ) , y, ω ). Hence ∇ x i θ i ( z i , z ) = 0. On the other hand, note that x i (cid:55)→ ϕ (( x i , z (cid:54) = i ) , y k , ω k ) is β (cid:107)A ∗ i A i (cid:107) -smooth. So, x i (cid:55)→ θ i ( x i , z, y, ω ) is a β ( κ i − (cid:107)A ∗ i A i (cid:107) ) - strongly convex function. FromLemma B.1 we have θ i ( x i , z ) ≥ β ( κ i −(cid:107)A ∗ i A i (cid:107) )2 (cid:107) x i − z i (cid:107) . The result follows from (29),(31) and Proposition (B.2).(ii) When x i (cid:55)→ u i ( x i , z ) + g i ( x i ) is convex and we apply the update as in (10), itfollows from Proposition B.3 (see also [18, Remark 4.1]) that u i ( x ki , x k,i − ) + g i ( x ki ) + ϕ ( x k,i − , y k , ω k ) + β (cid:107)A ∗ i A i (cid:107) ζ ki ) (cid:107) x ki − x k − i (cid:107) ≥ u i ( x k +1 i , x k,i − ) + g i ( x k +1 i ) + ϕ ( x k,i , y k , ω k ) + β (cid:107)A ∗ i A i (cid:107) (cid:107) x k +1 i − x ki (cid:107) . (32)On the other hand, note that u i ( x ki , x k,i − ) = f ( x k,i − ) and u i ( x k +1 i , x k,i − ) ≥ f ( x k,i ).The result follows then.8 L. T. K. HIEN, D. N. PHAN, N. GILLIS
B.2. Proof of Proposition 2.4.
Denoteˆ h ( y, y (cid:48) ) = h ( y (cid:48) ) + (cid:104) ω, A x + B y (cid:48) − b (cid:105) + (cid:104)B ∗ ω + ∇ h ( y (cid:48) ) , y − y (cid:48) (cid:105) + L h (cid:107) y − y (cid:48) (cid:107) . Then we have ˆ h ( y, y (cid:48) )+ β (cid:107)A x + B y − b (cid:107) is a surrogate function of y (cid:55)→ h ( y )+ ϕ ( x, y, ω ).Note that the function y (cid:55)→ ˆ h ( y, y (cid:48) ) + β (cid:107)A x + B y − b (cid:107) is ( L h + β (cid:107)B ∗ B(cid:107) )-stronglyconvex. The result follows from Proposition B.2 (see also [18, Section 4.2.1]).Suppose h ( y ) is convex. We note that y (cid:55)→ β (cid:107)A x + B y − b (cid:107) is also convex andplays the role of g i in Proposition B.3. The result follows from Proposition B.3. B.3. Proof of Proposition 2.5.
Note that(33) L ( x k +1 , y k +1 , ω k +1 ) = L ( x k +1 , y k +1 , ω k ) + 1 αβ (cid:104) ω k +1 − ω k , ω k +1 − ω k (cid:105) . From the optimality condition of (11) we have ∇ h (ˆ y k ) + L h ( y k +1 − ˆ y k ) + B ∗ ω k + β B ∗ ( A x k +1 + B y k +1 − b ) = 0 . Together with (12) we obtain(34) ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ) + B ∗ ω k + 1 α B ∗ ( w k +1 − w k ) = 0 . Hence,(35) B ∗ w k +1 = (1 − α ) B ∗ ω k − α ( ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k )) , which implies that(36) B ∗ ∆ w k +1 = (1 − α ) B ∗ ∆ w k − α ∆ z k +1 , where ∆ z k +1 = z k +1 − z k and z k +1 = ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ). We now consider2 cases.Case 1: 0 < α ≤
1. From the convexity of (cid:107) · (cid:107) we have(37) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ (1 − α ) (cid:107)B ∗ ∆ w k (cid:107) + α (cid:107) ∆ z k +1 (cid:107) . Case 2: 1 < α <
2. We rewrite (36) as B ∗ ∆ w k +1 = − ( α − B ∗ ∆ w k − α − α (2 − α )∆ z k +1 . Hence(38) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ ( α − (cid:107)B ∗ ∆ w k (cid:107) + α (2 − α ) (cid:107) ∆ z k +1 (cid:107) . Combine (37) and (38) we obtain(39) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ | − α |(cid:107)B ∗ ∆ w k (cid:107) + α − | − α | (cid:107) ∆ z k +1 (cid:107) , which implies(40)(1 −| − α | ) (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ | − α | ( (cid:107)B ∗ ∆ w k (cid:107) −(cid:107)B ∗ ∆ w k +1 (cid:107) )+ α − | − α | (cid:107) ∆ z k +1 (cid:107) . NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS y we have (cid:107) ∆ z k +1 (cid:107) = (cid:107)∇ h (ˆ y k ) − ∇ h (ˆ y k − ) + L h (∆ y k +1 − δ k ∆ y k ) − L h (∆ y k − δ k − ∆ y k − ) (cid:107) ≤ L h (cid:107) ˆ y k − ˆ y k − (cid:107) + 3 L h (cid:107) ∆ y k +1 (cid:107) + 3 (cid:107) (1 + δ k ) L h ∆ y k − L h δ k − ∆ y k − (cid:107) ≤ L h (cid:2) (1 + δ k ) (cid:107) ∆ y k (cid:107) + δ k − (cid:107) ∆ y k − (cid:107) (cid:3) + 3 L h (cid:107) ∆ y k +1 (cid:107) + 6(1 + δ k ) L h (cid:107) ∆ y k (cid:107) + 6 L h δ k − (cid:107) ∆ y k − (cid:107) = 3 L h (cid:107) ∆ y k +1 (cid:107) + 12(1 + δ k ) L h (cid:107) ∆ y k (cid:107) + 12 L h δ k − (cid:107) ∆ y k − (cid:107) . (41)If we do not use extrapolation for y then we have (cid:107) ∆ z k +1 (cid:107) = (cid:107)∇ h ( y k ) − ∇ h ( y k − ) + L h ∆ y k +1 − L h ∆ y k (cid:107) ≤ L h (cid:107) ∆ y k (cid:107) + 3 L h (cid:107) ∆ y k +1 (cid:107) + 3 L h (cid:107) ∆ y k (cid:107) = 6 L h (cid:107) ∆ y k (cid:107) + 3 L h (cid:107) ∆ y k +1 (cid:107) . (42)Furthermore, note that σ B (cid:107) ∆ w k +1 (cid:107) ≤ (cid:107)B ∗ ∆ w k +1 (cid:107) . Therefore, it follows from (40)that (cid:107) ∆ w k +1 (cid:107) ≤ | − α | σ B (1 − | − α | ) ( (cid:107)B ∗ ∆ w k (cid:107) − (cid:107)B ∗ ∆ w k +1 (cid:107) )+ α L h σ B (1 − | − α | ) ( (cid:107) ∆ y k +1 (cid:107) + ¯ δ k (cid:107) ∆ y k (cid:107) + 4 δ k − (cid:107) ∆ y k − (cid:107) ) . (43)The result is obtained from (43), (33) and Proposition 2.3. B.4. Proof of Proposition 2.6. (i) From Inequality (17) and the conditionsin (19) we have L k +1 + µ (cid:107) ∆ y k +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) + α β (cid:107)B ∗ ∆ w k +1 (cid:107) ≤ L k + C µ (cid:107) ∆ y k (cid:107) + C µ (cid:107) ∆ y k − (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x ki (cid:107) + α β (cid:107)B ∗ ∆ w k (cid:107) . (44)By summing from k = 1 to K Inequality (44) and noting that C + C = C y we obtainInequality (20).(ii) Let us prove { ∆ y k } and { ∆ x ki } converge to 0.Let us first prove the second situation, that is we use extrapolation for the updateof y and Inequality (21) is satisfied. From (35) we have α B ∗ w k +1 = − (1 − α ) B ∗ ∆ ω k +1 − αz k +1 , where z k +1 = ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ). Using the same technique that derivesInequality (39), we obtain the following(45) ασ B (cid:107) w k +1 (cid:107) ≤ α (cid:107)B ∗ w k +1 (cid:107) ≤ | − α |(cid:107)B ∗ ∆ ω k +1 (cid:107) + α − | − α | (cid:107) z k +1 (cid:107) . On the other hand, we have L k = F ( x k ) + h ( y k ) + β (cid:107)A x k + B y k − b + ω k β (cid:107) − β (cid:107) ω k (cid:107) L. T. K. HIEN, D. N. PHAN, N. GILLIS ≥ F ( x k ) + h ( y k ) − β (cid:107) ω k (cid:107) . Together with (45) and (cid:107) z k (cid:107) = (cid:107)∇ h (ˆ y k − ) − ∇ h ( y k ) + ∇ h ( y k ) + L h (∆ y k − δ k − ∆ y k − ) (cid:107) ≤ (cid:107)∇ h (ˆ y k − ) − ∇ h ( y k ) (cid:107) + 4 (cid:107)∇ h ( y k ) (cid:107) + 4 L h (cid:107) ∆ y k (cid:107) + 4 L h δ k − (cid:107) ∆ y k − (cid:107) ≤ L h (cid:107) ∆ y k (cid:107) + 12 δ k − (cid:107) ∆ y k − (cid:107) + 4 (cid:107)∇ h ( y k ) (cid:107) . we obtain L k ≥ F ( x k ) + h ( y k ) − αβσ B (cid:0) | − α |(cid:107) B ∗ ∆ ω k (cid:107) + α − | − α | (cid:107) z k (cid:107) (cid:1) ≥ F ( x k ) + h ( y k ) − | − α | αβσ B (cid:107) B ∗ ∆ ω k (cid:107) − α βσ B (1 − | − α | ) (cid:0) L h (cid:107) ∆ y k (cid:107) + 12 δ k − (cid:107) ∆ y k − (cid:107) + 4 (cid:107)∇ h ( y k ) (cid:107) (cid:1) (46)Since h ( y ) is L h -smooth, for all y ∈ R q and α L > h ( y − α L ∇ f ( y )) ≤ h ( y ) − α L (1 − L h α L (cid:107)∇ h ( y ) (cid:107) . Let us choose α L such that α L (1 − L h α L ) = α βσ B (1 −| − α | ) . Note that this equationhave, solution when β ≥ L h ασ B (1 −| − α | ) . Then we have h ( y k ) − α βσ B (1 − | − α | ) (cid:107)∇ h ( y k ) (cid:107) ≥ h ( y k − α L ∇ f ( y k )) . Together with (46) we get L k ≥ F ( x k ) + h ( y k − α L ∇ f ( y k )) − | − α | αβσ B (cid:107) B ∗ ∆ ω k (cid:107) − α βσ B (1 − | − α | ) (12 L h (cid:107) ∆ y k (cid:107) + 12 δ k − (cid:107) ∆ y k − (cid:107) ) . (47)So from α β ≥ | − α | αβσ B , µ ≥ α L h βσ B (1 −| − α | ) , (1 − C ) µ ≥ α L h δ k βσ B (1 −| − α | ) we have L K +1 + µ (cid:107) ∆ y K +1 (cid:107) + α β (cid:107) B ∗ ∆ w K +1 (cid:107) + (1 − C ) µ (cid:107) ∆ y K (cid:107) ≥ F ( x K +1 ) + h ( y K +1 − α L ∇ f ( y K +1 )) . (48)Hence L K +1 + µ (cid:107) ∆ y K +1 (cid:107) + α β (cid:107) B ∗ ∆ w K +1 (cid:107) + (1 − C ) µ (cid:107) ∆ y K (cid:107) is lower bounded.Furthermore, since η i and µ are positive numbers we derive from Inequality (20)that (cid:80) ∞ k =1 (cid:107) ∆ y k (cid:107) < + ∞ and (cid:80) ∞ k =1 (cid:107) ∆ x ki (cid:107) < + ∞ . Therefore, { ∆ y k } and { ∆ x ki } converge to 0.Let us now consider the first situation when δ k = 0 for all k .From Inequality (17) and the conditions in (19) we have L k +1 + µ (cid:107) ∆ y k +1 (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) + α β (cid:107) B ∗ ∆ w k +1 (cid:107) ≤ L k + C y µ (cid:107) ∆ y k (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x ki (cid:107) + α β (cid:107) B ∗ ∆ w k (cid:107) . (49) NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS k = 1 to K we obtain L K +1 + C y µ (cid:107) ∆ y K +1 (cid:107) + C x s (cid:88) i =1 η i (cid:107) ∆ x K +1 i (cid:107) + α β (cid:107) B ∗ ∆ w K +1 (cid:107) + K (cid:88) k =1 (cid:2) (1 − C y ) µ (cid:107) ∆ y k +1 (cid:107) + (1 − C x ) s (cid:88) i =1 η i (cid:107) ∆ x k +1 i (cid:107) (cid:3) ≤ L + α β (cid:107) B ∗ ∆ ω (cid:107) + s (cid:88) i =1 η i (cid:107) ∆ x i (cid:107) + Cµ (cid:107) ∆ y (cid:107) . (50)Denote the value of the right side of Inequality (49) by ˆ L k . Note that 0 < C x , C y < { ˆ L k } is non-increasing. It follows from [30,Lemma 2.9] that ˆ L k ≥ ϑ for all k , where ϑ is is the lower bound of F ( x k ) + h ( y k ). Forcompleteness, let us provide the proof in the following. We haveˆ L k ≥ L k = F ( x k ) + h ( y k ) + β (cid:107) Ax k + By k − b (cid:107) + 1 αβ (cid:104) ω k , ω k − ω k − (cid:105)≥ ϑ + 12 αβ ( (cid:107) ω k (cid:107) − (cid:107) ω k − (cid:107) + (cid:107) ∆ ω k (cid:107) ) ≥ ϑ + 12 αβ ( (cid:107) ω k (cid:107) − (cid:107) ω k − (cid:107) ) , (51)Assume that there exists k such that ˆ L k < ϑ for all k ≥ k . As ˆ L k is non-increasing, K (cid:88) k =1 ( ˆ L k − ϑ ) ≤ k (cid:88) k =1 ( ˆ L k − ϑ ) + ( K − k )( ˆ L k − ϑ ) . Hence (cid:80) ∞ k =1 ( ˆ L k − ϑ ) = −∞ . However, from (51) we have K (cid:88) k =1 ( ˆ L k − ϑ ) ≥ K (cid:88) k =1 αβ (cid:107) ω k (cid:107) − αβ (cid:107) ω k − (cid:107) ≥ αβ ( −(cid:107) ω (cid:107) ) , which gives a contradiction.Since ˆ L K ≥ ϑ and η i and µ are positive numbers we derive from Inequality (20)that (cid:80) ∞ k =1 (cid:107) ∆ y k (cid:107) < + ∞ and (cid:80) ∞ k =1 (cid:107) ∆ x ki (cid:107) < + ∞ . Therefore, { ∆ y k } and { ∆ x ki } converge to 0.Now we prove { ∆ ω k } goes to 0. Since (cid:80) ∞ k =1 (cid:107) ∆ y k (cid:107) < + ∞ , we derive from (41)that (cid:80) ∞ k =1 (cid:107) ∆ z k (cid:107) < + ∞ . Summing up Equality (39) from k = 1 to K we have(1 − | − α | ) K (cid:88) k =1 (cid:107)B ∗ ∆ ω k (cid:107) + (cid:107)B ∗ ∆ ω K +1 (cid:107) ≤ (cid:107)B ∗ ∆ ω (cid:107) + α − | − α | K (cid:88) k =1 (cid:107) ∆ z k +1 (cid:107) , which implies that (cid:80) ∞ k =1 (cid:107)B ∗ ∆ ω k (cid:107) < + ∞ . Hence, (cid:107)B ∗ ∆ ω k (cid:107) →
0. Since σ B > { ∆ ω k } goes to 0.2 L. T. K. HIEN, D. N. PHAN, N. GILLIS
B.5. Proof of Proposition 2.7.
We remark that we use the idea in the proofof [42, Lemma 6] to prove the proposition. However, our proof is more complicatedsince in our framework α ∈ (0 , h is linearized and we use extrapolationfor y .Note that as σ B > B is a surjective. Together with the assumption b + Im ( A ) ⊆ Im ( B ) we have there exist ¯ y k such that A x k + B ¯ y k − b = 0.Now we have L k = F ( x k ) + h ( y k ) + β (cid:107)A x k + B y k − b (cid:107) + (cid:104) ω k , A x k + B y k − b (cid:105) = F ( x k ) + h ( y k ) + β (cid:107) Ax k + B y k − b (cid:107) + (cid:104)B ∗ ω k , y k − ¯ y k (cid:105) . (52)From (34) we have (cid:104)B ∗ ω k , y k − ¯ y k (cid:105) = (cid:10) ∇ h (ˆ y k ) + L h (∆ y k +1 − δ k ∆ y k ) + 1 α B ∗ ( w k +1 − w k ) , ¯ y k − y k (cid:11) ≥ (cid:104)∇ h ( y k ) , ¯ y k − y k (cid:105) − (cid:0) (cid:107)∇ h ( y k ) − ∇ h (ˆ y k ) (cid:107) + L h (cid:107) ∆ y k +1 (cid:107) + L h δ k (cid:107) ∆ y k (cid:107) + 1 α (cid:107)B ∗ ∆ ω k +1 (cid:107) (cid:1) (cid:107) ¯ y k − y k (cid:107) . Therefore, it follows from (52) and L h -smooth property of h that L k ≥ F ( x k ) + h (¯ y k ) − L h (cid:107) y k − ¯ y k (cid:107) − (cid:0) L h δ k (cid:107) ∆ y k (cid:107) + L h (cid:107) ∆ y k +1 (cid:107) + 1 α (cid:107)B ∗ ∆ ω k +1 (cid:107) (cid:1) (cid:107) ¯ y k − y k (cid:107) . (53)On the other hand, we have (cid:107) ¯ y k − y k (cid:107) ≤ λ min ( B ∗ B ) (cid:107)B (¯ y k − y k ) (cid:107) = 1 λ min ( B ∗ B ) (cid:107)A x k + B y k − b (cid:107) = 1 λ min ( B ∗ B ) (cid:13)(cid:13) αβ ∆ ω k (cid:13)(cid:13) . (54)We have proved in Proposition 2.6 that (cid:107) ∆ ω k (cid:107) , (cid:107) ∆ x k (cid:107) and (cid:107) ∆ y k (cid:107) converge to 0.Furthermore, from Proposition 2.6 we have L k is upper bounded. Therefore, from(53), (54) and (20) we have F ( x k ) + h (¯ y k ) is upper bounded. So { x k } is bounded.Consequently, A x k is bounded.Furthermore, we have (cid:107) y k (cid:107) ≤ λ min ( B ∗ B ) (cid:107)B y k (cid:107) = 1 λ min ( B ∗ B ) (cid:13)(cid:13) αβ ∆ ω k − A x k − b (cid:13)(cid:13) . Therefore, { y k } is bounded, which implies that (cid:107)∇ h (ˆ y k ) (cid:107) is also bounded. Finally,from (34) and the assumption λ min ( BB ∗ ) > { ω k } is bounded. B.6. Proof of Theorem 2.8.
Suppose ( x k n , y k n , ω k n ) converges to ( x ∗ , y ∗ , ω ∗ ).Since ∆ x ki goes to 0, we have x k n +1 i and x k n − i also converge to x ∗ i for all i ∈ [ s ]. From(29), for all x i , we have u i ( x k +1 i , x k,i − , y k , ω k ) + g i ( x k +1 i ) ≤ u i ( x i , x k,i − , y k , ω k ) + g i ( x i ) − (cid:104)G ki ( x ki , x k − i ) , x i − x k +1 i (cid:105) . (55) NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS x i = x ∗ i and k = k n − u i ( x i , z ) is continuous byAssumption 2 (i), we havelim sup n →∞ u i ( x ∗ i , x ∗ , y ∗ , ω ∗ ) + g i ( x k n i ) ≤ u i ( x ∗ i , x ∗ , y ∗ , ω ∗ ) + g i ( x ∗ i ) . On the other hand, as g i ( x i ) is lower semi-continuous. Hence, g i ( x k n i ) converges to g i ( x ∗ i ). Now we choose k = k n → ∞ in (55) for all x i we obtain L ( x ∗ , y ∗ , ω ∗ ) + g i ( x ∗ i ) ≤ u i ( x i , x ∗ , y ∗ , ω ∗ ) + g i ( x i )= L ( x i , x ∗(cid:54) = i , y ∗ , ω ∗ ) + e i ( x i , x ∗ , y ∗ , ω ∗ ) + g i ( x i ) , (56)where L ( x, y, ω ) = f ( x ) + h ( y ) + ϕ ( x, y, ω ) and e i is the approximation error definedin (31). We have e i ( x i , x ∗ , y ∗ , ω ∗ ) = u i ( x i , x ∗ ) − f ( x i , x ∗(cid:54) = i ) + ˆ ϕ i ( x i , x ∗ , y ∗ , ω ∗ ) − ϕ (( x i , x ∗(cid:54) = i ) , y ∗ , ω ∗ ) ≤ ¯ e i ( x i , x ∗ ) + ˆ ϕ i ( x i , x ∗ , y ∗ , ω ∗ ) − ϕ (( x i , x ∗(cid:54) = i ) , y ∗ , ω ∗ ) . Note that ¯ e i ( x ∗ i , x ∗ ) = 0 by Assumption 2. From (56) we have x ∗ i is a solution ofmin x i L ( x i , x ∗(cid:54) = i , y ∗ , ω ∗ ) + ¯ e i ( x i , x ∗ ) + ˆ ϕ i ( x i , x ∗ , y ∗ , ω ∗ ) − ϕ (( x i , x ∗(cid:54) = i ) , y ∗ , ω ∗ ) . Writing the optimality condition for this problem we obtain 0 ∈ ∂ x i L ( x ∗ , y ∗ , ω ∗ ).Totally similarly we can prove that 0 ∈ ∂ y L ( x ∗ , y ∗ , ω ∗ ). On the other hand, we have∆ ω k = ω k − ω k − = αβ ( A x k + B y k − b ) → . Hence, ∂ ω L ( x ∗ , y ∗ , ω ∗ ) = A x ∗ + B y ∗ − b = 0 . As we assume ∂F ( x ) = ∂ x F ( x ) × . . . × ∂ x s F ( x ), we have ∂ L ( x, y, ω ) = ∂F ( x ) + ∇ (cid:16) h ( y ) + (cid:104) ω, A x + B y − b (cid:105) + β (cid:107)A x + B y − b (cid:107) (cid:17) = ∂ x L ( x, y, ω ) × . . . × ∂ x s L ( x, y, ω ) × ∂ y L ( x, y, ω ) × ∂ ω L ( x, y, ω ) . So 0 ∈ ∂ L ( x ∗ , y ∗ , ω ∗ ). B.7. Proof of Theorem 2.10.
Note that we assume the generated sequenceof Algorithm 1 is bounded. The following analysis is considered in the bounded setthat contains the generated sequence of Algorithm 1. We first prove some preliminaryresults.(A) The optimality condition of (29) gives us G ki ( x ki − x k − i ) − A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) ∈ ∂ x i (cid:0) u i ( x k +1 i , x k,i − ) + g i ( x k +1 i ) (cid:1) . (57)As (24) holds, there exists s k +1 i ∈ ∂u i ( x k +1 i , x k,i − ) and t k +1 i ∈ ∂g i ( x k +1 i ) such that(58) G ki ( x ki − x k − i ) − A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) = s k +1 i + t k +1 i . As (25) holds, there exists ξ k +1 i ∈ ∂ x i f ( x k +1 ) such that(59) (cid:107) ξ k +1 i − s k +1 i (cid:107) ≤ L i (cid:107) x k +1 − x k,i − (cid:107) . L. T. K. HIEN, D. N. PHAN, N. GILLIS
Denote τ k +1 i := ξ k +1 i + t k +1 i ∈ ∂ x i F ( x k +1 ) (as (24) holds). Then, from (58) wehave(60) τ k +1 i = ξ k +1 i + G ki ( x ki − x k − i ) −A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) − s k +1 i . On the other hand, we note that(61) ∂ x i L ( x k +1 , y k +1 , ω k +1 ) = ∂ x i F ( x k +1 ) + A ∗ i (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1) . Let d k +1 i := τ k +1 i + A ∗ i (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1) ∈ ∂ x i L ( x k +1 , y k +1 , ω k +1 ).From (60) we have (cid:107) d k +1 i (cid:107) = (cid:13)(cid:13)(cid:13) ξ k +1 i + G ki ( x ki − x k − i ) − A ∗ i (cid:0) ω k + β ( A x k,i − + B y k − b ) (cid:1) − κ i β ( x k +1 i − x ki ) − s k +1 i + A ∗ i (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1)(cid:13)(cid:13)(cid:13) (62)Together with (59) we obtain (cid:107) d k +1 i (cid:107) ≤ a ki (cid:107) ∆ x ki (cid:107) + β (cid:107)A ∗ i A (cid:107)(cid:107) x k +1 − x k,i − (cid:107) + β (cid:107)A ∗ i B(cid:107)(cid:107) ∆ y k +1 (cid:107) + (cid:107)A ∗ i (cid:107)(cid:107) ∆ ω k +1 (cid:107) + κ i β (cid:107) ∆ x k +1 i (cid:107) + L i (cid:107) x k +1 − x k,i − (cid:107) . (63)It follows from (11) that B ∗ ω k + ∇ h (ˆ y k ) + β B ∗ ( A x k +1 + B y k +1 − b ) + L h ( y k +1 − ˆ y k ) = 0 . Let d k +1 y := ∇ h ( y k +1 ) + B ∗ (cid:0) ω k +1 + β ( A x k +1 + B y k +1 − b ) (cid:1) . We have d k +1 y ∈ ∂ y L ( x k +1 , y k +1 , ω k +1 )and (cid:107) d k +1 y (cid:107) = (cid:107)∇ h ( y k +1 ) − ∇ h (ˆ y k ) + B ∗ ( ω k +1 − ω k ) − L h ( y k +1 − ˆ y k ) (cid:107)≤ L h (cid:107) y k +1 − ˆ y k (cid:107) + (cid:107)B ∗ (cid:107)(cid:107) ∆ ω k +1 (cid:107)≤ L h ( (cid:107) ∆ y k +1 (cid:107) + δ k (cid:107) ∆ y k (cid:107) ) + (cid:107)B ∗ (cid:107)(cid:107) ∆ ω k +1 (cid:107) . Let d k +1 ω := A x k +1 + B k +1 − b . We have d k +1 ω ∈ ∂ ω L ( x k +1 , y k +1 , ω k +1 ) and d k +1 ω = ( ω k +1 − ω k ) / ( αβ ) = ∆ ω k +1 / ( αβ ) . (B) Let us now prove F ( x k n ) converges to F ( x ∗ ). This implies L ( x k n , y k n , ω k n )converges to L ( x ∗ , y ∗ , ω ∗ ) since L is differentiable in y and ω . We have F ( x k n ) = f ( x k n ) + s (cid:88) i =1 g i ( x k n i ) = u s ( x k n s , x k n ) + s (cid:88) i =1 g i ( x k n i ) . So F ( x k n ) converges to u s ( x ∗ i , x ∗ ) + (cid:80) si =1 g i ( x ∗ i ) = F ( x ∗ ).We now proceed to prove the global convergence. Denote z = ( x, y, ω ), ˜ z = (˜ x, ˜ y, ˜ ω ),and z k = ( x k , y k , ω k ). We consider the following auxiliary function¯ L ( z , ˜ z ) = L ( x, y, ω ) + s (cid:88) i =1 η i + C x η i (cid:107) x i − ˜ x i (cid:107) + (1 + C y ) µ (cid:107) y − ˜ y (cid:107) + α β (cid:107) B ∗ ( ω − ˜ ω ) (cid:107) . The auxiliary sequence ¯ L ( z k , z k − ) has the following properties. NERTIAL ALTERNATING DIRECTION METHOD OF MULTIPLIERS
Sufficient decreasing property . From (49) we have¯ L ( z k +1 , z k ) + s (cid:88) i =1 η i − C x η i (cid:0) (cid:107) x k +1 i − x ki (cid:107) + (cid:107) x ki − x k − i (cid:107) (cid:1) + (1 − C y ) µ (cid:0) (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) (cid:1) ≤ ¯ L ( z k , z k − ) . Boundedness of subgradient . In the proof (A) above, we have proved that (cid:107) d k +1 (cid:107) ≤ a ( (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) ω k +1 − ω k (cid:107) )for some constant a and d k +1 ∈ ∂ L ( z k +1 ). On the other hand, as we use α = 1, from (36) we obtain √ σ B (cid:107) ω k +1 − ω k (cid:107) ≤ (cid:107) B ∗ ( ω k +1 − ω k ) (cid:107) = (cid:107) ∆ z k +1 (cid:107) = (cid:107)∇ h ( y k ) − ∇ h ( y k − ) + L h (∆ y k +1 − ∆ y k ) (cid:107)≤ L h (cid:107) y k − y k − (cid:107) + L h (cid:107) y k +1 − y k (cid:107) . (64)Hence, (cid:107) d k +1 (cid:107) ≤ a ( (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) )for some constant a . Note that ∂ ¯ L ( z , ˜ z ) = ∂ L ( z , ˜ z )+ ∂ (cid:16) s (cid:88) i =1 η i + C x η i (cid:107) x i − ˜ x i (cid:107) + (1 + C y ) µ (cid:107) y − ˜ y (cid:107) + α β (cid:107) B ∗ ( ω − ˜ ω ) (cid:107) (cid:17) . Hence, it is not difficult to show that (cid:107) d k +1 (cid:107) ≤ a ( (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) )for some constant a and d k +1 ∈ ∂ ¯ L ( z k +1 , z k ).3. KL property . Since F ( x ) + h ( y ) has KL property, then ¯ L ( z , ˜ z ) also has K(cid:32)Lproperty.4. A continuity condition . Suppose z k n converges to ( x ∗ , y ∗ , ω ∗ ). In the proof(B) above, we have proved that L ( z k n ) converges to L ( x ∗ , y ∗ , ω ∗ ). Furthermore,from Proposition 2.6 we proved that (cid:107) z k +1 − z k (cid:107) goes to 0. Hence we have z k n − converges to ( x ∗ , y ∗ , ω ∗ ). So, ¯ L ( z k +1 , z k ) converges to ¯ L ( z ∗ , z ∗ ).Using the same technique as in [6, Theorem 1], see also [17, 32], we can prove that ∞ (cid:88) k =1 (cid:0) (cid:107) x k +1 − x k (cid:107) + (cid:107) x k − x k − (cid:107) + (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) (cid:1) < ∞ . which implies { ( x k , y k ) } converges to ( x ∗ , y ∗ ). From (64) we obtain ∞ (cid:88) k =1 (cid:107) ω k +1 − ω k (cid:107) ≤ ∞ (cid:88) k =1 (cid:0) (cid:107) y k +1 − y k (cid:107) + (cid:107) y k − y k − (cid:107) (cid:1) < ∞ . Hence, { ω k } also converges to ω ∗∗