[PDF] Block Coordinate Descent Only Converge to Minimizers

Abstract

Given a non-convex twice continuously differentiable cost function with Lipschitz continuous gradient, we prove that all of block coordinate gradient descent, block mirror descent and proximal block coordinate descent converge to a local minimizer, almost surely with random initialization. Furthermore, we show that these results also hold true even for the cost functions with non-isolated critical points.

Full PDF

aa r X i v : . [ m a t h . O C ] O c t Block Coordinate Descent Only Converge to Minimizers

Enbin Song ∗ , Zhubin Shen ∗ and Qingjiang Shi † ∗ Department of Mathematics, Sichuan University, Chengdu, China † College of EIE, Nanjing University of Aeronautics and Astronautics, Nanjing, China

Abstract

Given a non-convex twice continuously diﬀerentiable cost function with Lipschitzcontinuous gradient, we prove that all of block coordinate gradient descent, blockmirror descent and proximal block coordinate descent converge to a local minimizer,almost surely with random initialization. Furthermore, we show that these resultsalso hold true even for the cost functions with non-isolated critical points.

Keywords:

Block coordinate gradient descent, block mirror descent, proximalblock coordinate descent, saddle points, local minimum, non-convex

1. Introduction

A main source of diﬃculty for non-convex optimization over continuous spacesis the proliferation of saddle points. Actually, it is easy to ﬁnd some instanceswhere bad initialization of the gradient descent converges to unfavorable saddlepoint (Nesterov, 2004, Section 1.2.3). Although there exist such worst case instancesin theory, many simple algorithms including the ﬁrst order algorithms and theirvariants, perform extremely well in terms of the quality of solutions of continuousoptimization.

Related work

Recently, a milestone result of the gradient descent was established by Lee et al.(2016). That is, the gradient descent method converges to a local minimizer, almostsurely with random initialization, by resorting to a tool from topology of dynamicalsystems. Lee et al. (2016) assume that a cost function satisﬁes the strict saddleproperty . Equivalently, each critical point x of f is either a local minimizer, ora “strict saddle”, i.e., ∇ f ( x ) has at least one strictly negative eigenvalue. Theydemonstrated that if f : R n → R is a twice continuously diﬀerentiable functionwhose gradient is Lipschitz continuous with constant L , then the gradient descentwith a suﬃciently small constant step-size α (i.e., x k +1 = x k − α ∇ f ( x k ) and 0 < α < Email: [email protected] (E. B. Song); [email protected] (Q. Shi) x is a critical point of f if ∇ f ( x ) = . Preprint submitted to Elsevier October 26, 2017 L ) and a random initialization converges to a local minimizer or negative inﬁnityalmost surely.There is a followup work given by Panageas and Piliouras (2016), which ﬁrstlyproven that the results in Lee et al. (2016) do hold true even for cost function f with non-isolated critical points. One key tool they used is that for every open coverthere is a countable subcover in R n . Moreover, Panageas and Piliouras (2016) haveshown the globally Lipschitz assumption can be circumvented as long as the domainis convex and forward invariant with respect to gradient descent. In addition, theyalso provided an upper bound on the allowable step-size (such that those resultshold true).There are some prior works showing that ﬁrst-order descent methods can indeedescape strict saddle points with the assistance of near isotropic noise. Speciﬁcally,Pemantle (1990) established convergence of the Robbins-Monro stochastic approx-imation to local minimizers for strict saddle functions and Kleinberg et al. (2009)demonstrated that the perturbed versions of multiplicative weights algorithm canconverge to local minima in generic potential games. In particular, Ge et al. (2015)quantiﬁed the convergence rate of noise-added stochastic gradient descent to localminima. Note that the aforementioned methods requires the assistance of isotropicnoise, which can signiﬁcantly slowdown the convergence rate when the problem pa-rameters dimension is large. In contrast, our setting is deterministic and correspondsto simple implementations of block coordinate gradient descent, block mirror descentand proximal block coordinate descent. Our contribution

In this paper, under the assumption that f is twice continuously diﬀerentiablefunction with a Lipschitz continuous gradient over R n , we prove that all the followingblock coordinate descent methods with constant step size,(1) block coordinate gradient descent;(2) block mirror descent; and(3) proximal block coordinate descent,converge to a local minimizer, almost surely with random initialization. This resulteven holds for cost function with non-isolated critical points.In other words, we not only aﬃrmatively answer the open questions whethermirror descent or block coordinate descent does not converge to saddle points inLee et al. (2016), but also show that the same results hold true for block mirrordescent (including mirror descent as a special case) and proximal block coordinatedescent. In addition, these results also hold true even for the cost functions withnon-isolated critical points, which generalizes the results in Panageas and Piliouras(2016) as well. 2 .3. Outline of the proof

Recall the main ideas in Lee et al. (2016). Suppose that g : R n → R n is aniterative mapping of an optimization method and the ﬁxed point of g is the criticalpoint of f as well. Then the following key properties of g ,(i) g is a diﬀeomorphism;(ii) x ∗ is a ﬁxed point of g , then there is at least one eigenvalue of the Jacobian Dg ( x ∗ ), whose magnitude is strictly greater than one,imply the results in Lee et al. (2016) hold true.Although the basic idea of the proof presented in this paper comes from Lee et al.(2016), answering the underlying questions for block coordinate type algorithmsis not easy and the existing analysis methods are not applicable. In particular,the eigenvalue analysis of the Jacobian of the iterative mapping needs a nontrivialargument.As compared to Property (ii), Property (i) is easier to verify since we can decom-pose the entire iterative mapping into multiple one-block updating case and thenusing the chain rule of diﬀeomorphism, we prove the entire updating is a diﬀeomor-phism. In contrast, Property (ii) is very challenging because the Jocabian Dg ( x ∗ ) ofblock updating fashion at a saddle point is a non-symmetric matrix and a complexpolynominal function of the original ∇ f ( x ∗ ) with degree p (the number of the blocksof the decision variables). We overcome the above diﬃculties by the following twosteps. Precisely, the ﬁrst step is to transform the original Jacobian Dg αf ( x ∗ ) intoa simpler form which can be handled easily. Next, based on the simple form of Dg αf ( x ∗ ), the second step is to prove that Dg ( x ∗ ) has at least one eigenvalue withmagnitude strictly greater than one by resorting to Lemma 9.2 in appendices whichfollows essentially from Rouche’s Theorem in complex analysis. Notations and organizationNotations.

Denote complex number z as z = a + bi , where a and b are realnumbers and i is the imaginary unit with i = −

1. We also denote a = Re( z ) and b = Im( z ) as the real part and the imaginary part of z , respectively. For a matrix X , we denote eig( X ) as the set of eigenvalues of X , X T as the transpose of X , X H as the conjugate transpose or Hermitian transpose of X , ρ ( X ) as the spectralradius of X (i.e., the maximum modulus of the eigenvalues of X ), and k X k as thespectral norm of X . When X is a real symmetric matrix, let λ max ( X ) and λ min ( X )denote the maximum and minimum eigenvalues of X , respectively. Moreover, fortwo real symmetric matrices X and X , X ≻ X (resp. X (cid:23) X ) means X − X is positive deﬁnite (resp. positive semi-deﬁnite). We use I n to denote the identitymatrix with dimension n , and we will simply use I when it is clear from contextwhat the dimension is. For square matrices X s ∈ R n s × n s , s = 1 , , . . . , p , we denoteDiag ( X , X , . . . , X p ) as the block-diagonal matrix with X s being the s -th diagonal3lock. For square matrices X s ∈ R n × n , s = 1 , . . . , p , and t, k ∈ { , , . . . , p } , weuse k Q s = t X s to denote the continued products X t · X t +1 . . . X k − · X k if t ≤ k and X t · X t − . . . X k +1 · X k if t > k , respectively. P ν denotes the probability with respectto a prior measure ν , which is assumed to be absolutely continuous with respect toLebesgue measure. Organization.

In section 2, we introduce the basic setting and deﬁnitions usedthroughout the paper. Section 3 provides the main results for block coordinategradient descent method. The main results for block mirror descent and proximalblock coordinate descent are given in Section 4 and Section 5, respectively. Section 6provides several lemmas. Finally, we conclude this paper in Section 7. The detailedproofs of some lemmas and propositions are presented in Section 9 .

2. Preliminaries

We consider optimization modelmin { f ( x ) : x ∈ R n } , (2.1)where we make the following blanket assumption: Assumption 2.1 f is twice continuously diﬀerentiable function whose gradient isLipschitz continuous over R n , i.e., there exists a parameter L > such that k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k for every x, y ∈ R n . (2.2)Throughout this paper, in order to introduce the block coordinate descent method,we assume the vector of decision variables x has the following partition: x =  x (1) x (2) ... x ( p )  , (2.3)where x ( s ) ∈ R n s , n , n , . . . , n p are p positive integer numbers satisfying p P s =1 n s = n .Moreover, we use the notations in Nesterov (2012) and deﬁne matrices U s ∈ R n × n s , the s -th block-column of I n , s = 1 , . . . , p , such that (cid:0) U , U , . . . , U p (cid:1) = I n . (2.4)4learly, according to our notations, we have x ( s ) = U Ts x for every x ∈ R n , s =1 , . . . , p . Consequently, x = p P s =1 U s x ( s ) and the derivative corresponding variables inthe vector x ( s ) can be expressed as ∇ s f ( x ) ≡ U Ts ∇ f ( x ) , s = 1 , . . . , p. Below we give some necessary deﬁnitions as appeared in Lee et al. (2016) andPanageas and Piliouras (2016).

Deﬁnition 2.1

1. A point x ∗ is a critical point of f if ∇ f ( x ∗ ) = . We denote C = { x : ∇ f ( x ) = } as the set of critical points (can be uncountably many).2. A critical point x ∗ is isolated if there is a neighborhood U around x ∗ , and x ∗ isthe only critical point in U . Otherwise is called non-isolated.3. A critical point is a local minimum if there is a neighborhood U around x ∗ suchthat f ( x ∗ ) ≤ f ( x ) for all x ∈ U , and a local maximum if f ( x ∗ ) ≥ f ( x ) .4. A critical point is a saddle point if for all neighborhoods U around x ∗ , there are x, y ∈ U such that f ( x ) ≤ f ( x ∗ ) ≤ f ( y ) . Deﬁnition 2.2 (Strict Saddle)

A critical point x ∗ of f is a strict saddle if λ min ( ∇ f ( x ∗ )) < . Deﬁnition 2.3 (Global Stable Set)

The global stable set W s ( x ∗ ) of a criticalpoint x ∗ is the set of initial conditions of an iterative mapping g of an optimiza-tion method that converge to x ∗ : W s ( x ∗ ) = { x : lim k g k ( x ) = x ∗ } .

3. The BCGD method

In this section, we will prove that the BCGD method does not converge to saddlepoints under appropriate choice of step size, almost surely with random initialization.This lays the ground for the analysis of the BMD and PBCD methods. If the critical points are isolated then they are countably many or ﬁnite. .1. The BCGD method description

For ease of later reference and also for the sake of clarity, based on notations inSection 2, we present a detailed description of the BCGD method (Beck and Tetruashvili,2013) for problem (2.1) below.

Method 3.1 (BCGD)Input: α < L . Initialization: x ∈ R n . General Step ( k = 0 , , . . . ): Set x k = x k and deﬁne recursively x sk = x s − k − αU s ∇ s f ( x s − k ) , s = 1 , . . . , p. Set x k +1 = x pk .In what follows, for a given step size α >

0, we use g s αf to denote the correspond-ing gradient mapping with respect to x ( s ) , i.e., g s αf ( x ) , x − αU s ∇ s f ( x ) , s = 1 , . . . , p. (3.5)It is clear that, given x k , the above BCGD method generates x k +1 in the followingmanner, x k +1 = g αf ( x k ) , (3.6)where composite mapping g αf ( x ) , g p αf ◦ g p − αf ◦ · · · ◦ g αf ◦ g αf ( x ) . (3.7)By simple computation, the Jacobian of g sαf is given by Dg sαf ( x ) = I n − α ∇ f ( x ) U s U Ts , (3.8)where ∇ f ( x ) = (cid:16) ∂ f ( x ) ∂x ( s ) ∂x ( t ) (cid:17) ≤ s, t ≤ p . By using the chain rule, we obtain the following Jacobian of the mapping g , i.e., Dg αf ( x ) = Dg αf ( y ) × Dg αf ( y ) × · · · · × Dg p − αf ( y p − ) × Dg p αf ( y p ) , (3.9)where y = x , and y s = g s − αf ( y s − ), s = 2 , . . . , p .Given the above basic notations, as mentioned in Section 1, it is suﬃcient for usto prove that the iterative mapping g αf admits the following two key properties:(i) g αf is a diﬀeomorphism;(ii) x ∗ is a ﬁxed point of g , then there is at least one eigenvalue of the Jacobian Dg αf ( x ∗ ), whose magnitude is strictly greater than one.In what follows, we will mainly concern with proving the above two properties of g αf . Speciﬁcally, Subsections 3.2 and 3.3 present their proofs, respectively.6 .2. The iterative mapping g αf of BCGD method is a diﬀeomorphism In this subsection, we ﬁrst present the following Lemma 3.1 which shows that g s αf , s = 1 , . . . , p, are diﬀeomorphisms. Based on this lemma, then we further provethat g αf is a diﬀeomorphism as well. Lemma 3.1

Under Assumption 2.1, if step size α < L , then the iterative mappings g s αf deﬁned by (3.5) , s = 1 , . . . , p, are diﬀeomorphisms. Proof . The proof of Lemma 3.1 is lengthy and has been relegated to the Ap-pendix.Given the above lemma, we immediately obtain the following result.

Proposition 3.1

Under Assumption 2.1, the mapping g αf deﬁned by (3.7) with stepsize α < L is a diﬀeomorphism. Proof . It follows from Proposition 2.15 in Lee (2013) that the composition oftwo diﬀeomorphisms is also a diﬀeomorphism. Using this fact and the previousLemma 3.1, the proof is completed.

Remark 3.1

Note that, in order to guarantee invertibility of Dg αf ( x ) , Eq. (9.145) in the the appendix must hold, which clearly implies that α < L is necessary.3.3. Eigenvalue analysis of the Jacobian of g αf at a strict saddle point In this subsection, we will analyze the eigenvalues of the Jacobian of g αf at astrict saddle point and show that it has at least one eigenvalue with magnitudegreater than one, which is a crucial part in our entire proof. The proof mainlyinvolves two steps.More speciﬁcally, the ﬁrst step is to transform the original Jacobian Dg αf ( x ∗ )into a simpler form which can be dealt with more easily. Furthermore, based on thesimple form of Dg αf ( x ∗ ), the second step is to prove that Dg ( x ∗ ) has at least oneeigenvalue with magnitude strictly greater than one by resorting to Lemma 9.2 inAppendix which follows essentially from Rouche’s Theorem in complex analysis.In what follows, we assume that x ∗ ∈ R n is a strict saddle point. Hence, g s αf ( x ∗ ) = x ∗ , s = 1 , , . . . , p, . By chain rule (3.9), we have Dg αf ( x ∗ ) = Dg αf ( x ∗ ) × Dg αf ( x ∗ ) × · · · × Dg p − αf ( x ∗ ) × Dg p αf ( x ∗ ) . (3.10)Since eig ( Dg αf ( x ∗ )) = eig (cid:16) ( Dg αf ( x ∗ )) T (cid:17) , it is suﬃcient for us to analyze theeigenvalues of matrix ( Dg αf ( x ∗ )) T . In addition, Eq. (3.10) leads to( Dg αf ( x ∗ )) T = (cid:0) Dg p αf ( x ∗ ) (cid:1) T × (cid:0) Dg p − αf ( x ∗ ) (cid:1) T × · · · × (cid:0) Dg αf ( x ∗ ) (cid:1) T × (cid:0) Dg αf ( x ∗ ) (cid:1) T = (cid:0) I n − αU p U Tp ∇ f ( x ∗ ) (cid:1) × (cid:0) I n − αU p − U Tp − ∇ f ( x ∗ ) (cid:1) × · · · × (cid:0) I n − αU U T ∇ f ( x ∗ ) (cid:1) × (cid:0) I n − αU U T ∇ f ( x ∗ ) (cid:1) , (3.11)7here the second equality is due to (3.8). For the sake of simplicity, we denotematrix ∇ f ( x ∗ ) as matrix A ∈ R n × n with p × p blocks. Speciﬁcally, A , ( A st ) ≤ s, t ≤ p , (3.12)and its ( s, t )-th block is deﬁned as A st , ∂ f ( x ∗ ) ∂x ∗ ( s ) ∂x ∗ ( t ) , ≤ s, t ≤ p. (3.13)Furthermore, we denote the s -th block-row of A as A s , ( A st ) ≤ t ≤ p , s = 1 , . . . , p. (3.14)Based on the above notations, we have U Ts ∇ f ( x ∗ ) = A s , ≤ s ≤ p, (3.15)and A s U t = A st , ≤ s, t ≤ p. (3.16)Combining (3.11) and (3.15), we have( Dg αf ( x ∗ )) T = ( I n − αU p A p ) × ( I n − αU p − A p − ) × · · · × ( I n − αU A ) × ( I n − αU A ) . (3.17)In order to analyze the eigenvalues of the above matrix, we furthermore deﬁne amatrix below: G , α h I n − ( Dg αf ( x ∗ )) T i , (3.18)or equivalently, ( Dg αf ( x ∗ )) T = I n − αG. (3.19)The above relation (3.19) clearly means that λ ∈ eig ( G ) ⇔ − αλ ∈ eig (cid:16) ( Dg αf ( x ∗ )) T (cid:17) . (3.20)In the rest of this subsection, our focus is on analyzing the eigenvalues of G . Weﬁrst introduce the following Lemma 3.2 and Lemma 3.3 which assert a particularrelation between G and A . Lemma 3.2

Assume that x ∗ ∈ R n is a strict saddle point, G is given by (3.18) , U s and A s are deﬁned by (2.4) and (3.14) , respectively, s = 1 , . . . , p . Then, U Ts G = A s − α s − X t =1 A st U Tt G, s = 1 , . . . , p. (3.21)8 roof . The proof of Lemma 3.2 is lengthy and has been relegated to the Ap-pendix.In order to give a clear and simple expression of G , we further deﬁne the strictlyblock lower triangular matrix based on A below:ˇ A , (cid:0) ˇ A st (cid:1) ≤ s, t ≤ p (3.22)with p × p blocks and its ( s, t )-th block is given byˇ A st = ( A st , s > t, , s ≤ t. (3.23)Similarly, we denote the s -th block-row of ˇ A asˇ A s , (cid:0) ˇ A st (cid:1) ≤ t ≤ p , s = 1 , . . . , p. (3.24)The following lemma plays an important role in this subsection because it givesa simple expression of G in terms of A and ˇ A , which allows us to analyze theeigenvalues of G more easily. Lemma 3.3

Let x ∗ ∈ R n be a strict saddle point. Assume that G , A and ˇ A aredeﬁned by (3.18) , (3.12) and (3.22) , respectively. Then, G = (cid:0) I n + α ˇ A (cid:1) − A. (3.25) Proof . We assume that G has the following partition: G =  G G ... G p  , where G s ∈ R n s × n is the s -th block-row of G , s = 1 , , . . . , p. Consequently, we have G s = U Ts G = A s − α s − X t =1 A st U Tt G = A s − α s − X t =1 A st G t = A s − α ˇ A s G, s = 1 , , . . . , p, where in the second equality we use Lemma 3.2 and the last equality is due to thedeﬁnition of ˇ A s in (3.24). Since the above equality holds for any s , it immediatelyfollows that G = A − α ˇ AG, which is equivalent to (cid:0) I n + α ˇ A (cid:1) G = A. I n + α ˇ A is an invertible matrix because ˇ A is a block strictly lower triangularmatrix. Premultiplying both sides of the above equality by (cid:0) I n + α ˇ A (cid:1) − , we arriveat (3.25). Therefore, the lemma is proved.The following proposition plays a central role in this subsection because it pro-vides a suﬃciently exact description of the distribution of the eigenvalues of G . Moreimportantly, it leads immediately to the subsequent Proposition 3.3 which assertsthat, there is at least one eigenvalue of Jacobian of iterative mapping g αf at a strictsaddle point, whose magnitude is strictly greater than one. Proposition 3.2

Assume that x ∗ ∈ R n is a strict saddle point, and G is deﬁned by (3.18) with α ∈ (cid:0) , L (cid:1) , where L is determined by (2.2) . Then, there exists at leastone eigenvalue of G which lies in closed left half complex plane excluding origin, i.e., ∀ β ∈ (cid:18) , L (cid:19) ⇒ ∃ λ ∈ (cid:2) eig ( G ) \ Ω (cid:3) , (3.26) where Ω , (cid:8) a + bi (cid:12)(cid:12) a, b ∈ R , a ≤ , ( a, b ) = (0 , , i = √− (cid:9) . (3.27) Proof . It follows from Lemma 3.3 that G has the following expression: G = (cid:0) I n + α ˇ A (cid:1) − A, (3.28)where A and ˇ A are deﬁned by (3.12) and (3.22), respectively. Since A = ∇ f ( x ∗ )and x ∗ is a strictly saddle point, A has at least one negative eigenvalue. Furthermore,by applying Lemma 6.4 in Section 6 with identiﬁcations A ∼ B , ˇ A ∼ ˇ B , α ∼ β and ρ ( A ) ∼ ρ ( B ), we have ∀ β ∈ (cid:18) , ρ ( A ) (cid:19) ⇒ ∃ λ ∈ (cid:2) eig ( G ) \ Ω (cid:3) , (3.29)where Ω is deﬁned by (3.27).In addition, Lemma 7 in Panageas and Piliouras (2016) implies that the gradientLipschitz continuous constant L ≥ ρ ( A ) = ρ ( ∇ f ( x ∗ )), which amounts to α ∈ (cid:0) , L (cid:1) ⊆ (cid:16) , ρ ( A ) (cid:17) . Hence, the above (3.29) leads immediately to (3.26).Based on the above Proposition 3.2, we immediately obtain the key propositionin this subsection. Proposition 3.3

Assume Assumption 2.1 holds. Suppose that BCGD iterativemapping g αf is deﬁned by (3.7) with α ∈ (cid:0) , L (cid:1) , L is determined by (2.2) , and x ∗ ∈ R n is a strict saddle point. Then Dg αf ( x ∗ ) has at least one eigenvalue whosemagnitude is strictly greater than one. roof . Recalling Eq. (3.20), we have λ ∈ eig ( G ) ⇔ − αλ ∈ eig (cid:16) ( Dg αf ( x ∗ )) T (cid:17) , (3.30)where G = α h I n − ( Dg αf ( x ∗ )) T i is deﬁned by (3.18).Combined with (3.30), Proposition 3.2 implies that there exists at least oneeigenvalue of ( Dg αf ( x ∗ )) T which can be expressed as1 − α ( a + bi ) , (3.31)where a + bi belongs to Ω deﬁned by (6.130), or equivalently, a ≤ a, b ) = (0 , . (3.32)Consequently, its magnitude is | − α ( a + bi ) | = √ − αa + α a + α b ≥ p α ( a + b ) > , where the ﬁrst inequality is due to a ≤ α >

0; and the second inequalitythanks to ( a, b ) = (0 , Dg αf ( x ∗ ) are the same asthose of ( Dg αf ( x ∗ )) T . Thus, the proof is ﬁnished. Main results of BCGD

We ﬁrst introduce the following proposition, which asserts that the limit pointof the sequence generated by the BCGD method 3.1 is a critical point of f . Proposition 3.4

Under Assumption 2.1, if { x k } k ≥ is generated by the BCGDmethod 3.1 with < α < L , lim k x k exists and denote it as x ∗ , then x ∗ is a crit-ical point of f , i.e., ∇ f ( x ∗ ) = . Proof . Since { x k } k ≥ is generated by the BCGD method 3.1, x k = g k αf ( x ) ,where g αf is deﬁned by (3.5). Hence, lim k x k = lim k g k αf ( x ) = x ∗ . Since g αf is adiﬀeomorphism, we immediately know that x ∗ is a ﬁxed point of g αf . It followseasily from the deﬁnition (3.7) of g αf that ∇ f ( x ∗ ) = . Thus the proof is ﬁnished.Armed with the results established in previous subsections and the above Proposition3.4, we now state and prove our main theorems of BCGD. Speciﬁcally, similar tothe proof of Theorem 4 in Lee et al. (2016), the center-stable manifold theorem inSmale (1967); Shub (1987); Hirsch et al. (1977) is a primary tool because it gives alocal characterization of the stable set. Hence, we ﬁrst rewrite it as follows. g k αf denotes the composition of g αf with itself k times. heorem 3.1 (Shub, 1987, Theorem III. 7) Let 0 be a ﬁxed point for the C r localdiﬀeromorphism φ : U −→ E where U is a neighborhood of zero in the Banachspace E and ∞ > r ≥ . Let E sc L E u be the invariant splitting of R n into thegeneralized eigenspaces of Df (0) corresponding to eigenvalues of absolute value lessthan or equal to one, and greater than one. Then there is a local φ invariant C r embedded disc W scloc tangent to E sc at and a ball B around zero in an adapted normsuch that φ ( W scloc ) T B ⊂ W scloc , and T φ − k ( B ) ⊂ W scloc . The following theorem is similar to Theorem 4 in Lee et al. (2016).

Theorem 3.2

Let f be a C function and x ∗ be a strict saddle. If { x k } is generatedby the BCGD method 3.1 with < α < L , then P ν h lim k x k = x ∗ i = 0 . Proof . Proposition 3.4 implies that, if lim k x k exists then it must be a criticalpoint. Hence, we consider calculating the Lebesgue measure (or probability withrespect to the prior measure ν ) of the set h lim k x k = x ∗ i = W s ( x ∗ ) (see Deﬁnition2.3).Furthermore, since Proposition 3.1 means the BCGD iterative mapping g αf is adiﬀeomorphism, we replace φ and ﬁxed point by g αf and the strict point x ∗ in theabove Stable Manifold Theorem 3.1, respectively. Then the manifold W sloc ( x ∗ ) hasstrictly positive codimension because of Proposition 3.3 and x ∗ being a strict saddlepoint. Hence, W sloc ( x ∗ ) has measure zero.In what follows, we are able to apply the same arguments as in Lee et al. (2016)to ﬁnish the proof of the theorem. Since the proof follows a similar pattern, it isomitted.Given the above Theorem 3.2, we immediately obtain the following Theorem 3.3and its corollary by the same arguments as in the proofs of Theorem 2 and Corollary12 in Panageas and Piliouras (2016). Therefore, we omit their proofs. Theorem 3.3 (Non-isolated)

Let f : R n → R be a twice continuously diﬀeren-tiable function and sup x ∈ R n k∇ f ( x ) k ≤ L < ∞ . The set of initial points x ∈ R n , foreach of which the BCGD method 3.1 with step size < α < L converges to a strictsaddle point, is of (Lebesgue) measure zero, without assumption that critical pointsare isolated. A straightforward corollary of Theorem 3.3 is given below:

Corollary 3.1

Assume that the conditions of Theorem 3.3 are satisﬁed and allsaddle points of f are strict. Additionally, assume lim k g k αf ( x ) exists for all x in R n .Then P ν h lim k g k αf ( x ) = x ∗ i = 1 , here g αf is deﬁned by (3.5) and x ∗ is a local minimum .

4. The BMD method

In this section, we will extend the above results to the BMD method in Beck and Teboulle(2003) and Juditsky and Nemirovski (2014). In other words, the BMD method,based on Bregman’s divergence, converges to minimizers as well, almost surely withrandom initialization.

The BMD method description

For clarity of notation, recall the vector of decision variables x has been assumedto have the following partition (see (2.3)): x =  x (1) x (2) ... x ( p )  , (4.33)where x ( t ) ∈ R n t , and n , n , . . . , n p are p positive integer numbers satisfying p P t =1 n t = n .Correspondingly, we assume a set of variables x sk have the following partition aswell: x sk ,  x sk (1) x sk (2) ... x sk ( p )  , s = 1 , . . . , p ; k = 0 , , . . . , (4.34)where x sk ( t ) ∈ R n t ; t, s = 1 , . . . , p ; k = 0 , , . . . .In order to introduce the BMD method, we assume there are p strictly convex andcontinuously diﬀerentiable functions ϕ t : R n t → R , t = 1 , , . . . , p . Furthermore, wemake the following assumption. Assumption 4.1 ϕ t is a strongly convex and twice continuously diﬀerentiable func-tion with parameter µ t > , i.e., for any y ( t ) and x ( t ) ∈ R n t , ϕ t ( y ( t ) ) ≥ ϕ ( x ( t ) ) + h∇ ϕ t ( x ( t ) ) , y ( t ) − x ( t ) i + µ t k y ( t ) − x ( t ) k , t = 1 , , . . . , p. (4.35)The Bregman divergences of the above strongly convex functions, B ϕ t : R n t × R n t → R + , are deﬁned as B ϕ t ( x ( t ) , y ( t ) ) , ϕ t ( x ( t ) ) − ϕ t ( y ( t ) ) − h x ( t ) − y ( t ) , ∇ ϕ t ( y ( t ) ) i , t = 1 , , . . . , p. (4.36)13he Bregman divergence was initially studied by Bregman (1967) and later by manyothers (see Auslender and Teboulle (2006); Bauschke et al. (2006); Teboulle (1997)and references therein). Remark 4.1

We should mention that Bregman divergence is generally deﬁned fora continuously diﬀerentiable and strongly convex function which is not necessarilytwice diﬀerentiable. Here, the twice diﬀerentiability of ϕ s seems a necessary andreasonable condition for our subsequent analysis because the existence of the Jacobianof the iterative mapping depends directly on ∇ ϕ t (see (4.47) ). Let µ = min { µ , µ , . . . , µ p } . (4.37)Given the above notations, the detailed description of the block mirror descentalgorithm for problem (2.1) is given below. Method 4.1 (BMD)Input: α < µL . Initialization: x ∈ R n . General Step ( k = 0 , , . . . ): Set x k = x k and deﬁne recursively for s = 1, 2, . . . , p : t = 1 , , . . . , p, If t = s , x sk ( t ) = arg min x ( t ) (cid:10) x ( t ) , ∇ t f (cid:0) x s − k (cid:1)(cid:11) + 1 α B ϕ t (cid:0) x ( t ) , x s − k ( t ) (cid:1) . (4.38) Else x sk ( t ) = x s − k ( t ) . (4.39) EndSet x k +1 = x pk . Note that ϕ s is a strongly convex function. Then it is easily seen from (4.36) that B ϕ s ( x ( s ) , y ( s ) ) is a strongly convex function with respect to x ( s ) if y ( s ) is ﬁxed. Hence,let x sk ( s ) be the unique solution of problem (4.38). The KKT condition implies that = ∇ s f (cid:0) x s − k (cid:1) + 1 α (cid:0) ∇ ϕ s ( x sk ( s ) ) − ∇ ϕ s (cid:0) x s − k ( s ) (cid:1)(cid:1) , (4.40)which is equivalent to ∇ ϕ s ( x sk ( s ) ) = ∇ ϕ s (cid:0) x s − k ( s ) (cid:1) − α ∇ s f (cid:0) x s − k (cid:1) . (4.41)Combined with the assumption about ϕ s , Lemma 9.4 in Appendix leads immedi-ately to the existence of the inverse mapping of ∇ ϕ s . Let [ ∇ ϕ s ] − denote its inverse.14hen x sk ( s ) can be expressed in terms of x s − k as x sk ( s ) = [ ∇ ϕ s ] − (cid:0) ∇ ϕ s (cid:0) x s − k ( s ) (cid:1) − α ∇ s f (cid:0) x s − k (cid:1)(cid:1) . (4.42)It follows from Eqs. (4.34), (4.38), and (4.39) that x sk = (cid:0) I n − U s U Ts (cid:1) x s − k + U s [ ∇ ϕ s ] − (cid:0) ∇ ϕ s (cid:0) x s − k ( s ) (cid:1) − α ∇ s f (cid:0) x s − k (cid:1)(cid:1) , s = 1 , , . . . , p, (4.43)where U s is deﬁned by (2.4).In what follows, we deﬁne ψ s : R n s → R n s as ψ s ( x ) , (cid:0) I n − U s U Ts (cid:1) x + U s [ ∇ ϕ s ] − ( ∇ ϕ s ( x ( s ) ) − α ∇ s f ( x )) , s = 1 , , . . . , p. (4.44)It is clear that, given x k , the above BMD method generates x k +1 in the followingmanner, x k +1 = ψ ( x k ) , (4.45)where the composite mapping ψ ( x ) , ψ p ◦ ψ p − ◦ · · · ◦ ψ ◦ ψ ( x ) . (4.46)Additionally, it follows from (4.44) that, for each s = 1 , , . . . , p, the Jacobian of ψ s is given below: Dψ s ( x )= (cid:0) I n − U s U Ts (cid:1) + D (cid:8) [ ∇ ϕ s ] − ( ∇ ϕ s ( x ( s ) ) − α ∇ s f ( x )) (cid:9) U Ts = (cid:0) I n − U s U Ts (cid:1) + D {∇ ϕ s ( x ( s ) ) − α ∇ s f ( x ) } × (cid:8) ∇ ϕ s (cid:8) [ ∇ ϕ s ] − ( ∇ ϕ s ( x ( s ) ) − α ∇ s f ( x )) (cid:9)(cid:9) − U Ts = (cid:0) I n − U s U Ts (cid:1) + (cid:8) U s ∇ ϕ s ( x ( s ) ) − α ∇ f ( x ) U s (cid:9) (cid:8) ∇ ϕ s (cid:8) [ ∇ ϕ s ] − ( ∇ ϕ s ( x ( s ) ) − α ∇ s f ( x )) (cid:9)(cid:9) − U Ts , (4.47)where the ﬁrst equality is due to chain rule; the second equality holds because ofchain rule and inverse function theorem in Spivak (1965); the last equality followsfrom the deﬁnition of U s and ∇ f ( x ) = (cid:16) ∂ f ( x ) ∂x ( s ) ∂x ( t ) (cid:17) ≤ s, t ≤ p . By using the chain rule, we obtain the following Jacobian of the mapping ψ , i.e., Dψ ( x ) = Dψ ( y ) × Dg ( y ) × · · · · × Dg p − ( y p − ) × Dg p ( y p ) , (4.48)where y = x , and y s = ψ s − ( y s − ), s = 2 , . . . , p .15 .2. The iterative mapping ψ of BMD is a diﬀeomorphism In this subsection, we ﬁrst present the following Lemma 4.1 which shows that ψ s , s = 1 , . . . , p, are diﬀeomorphisms. Based on this lemma, then we further provethat ψ is a diﬀeomorphism as well. Lemma 4.1

If the step size α < µL , then the mappings ψ s deﬁned by (4.44) , s =1 , . . . , p, are diﬀeomorphisms. Proof . The proof is lengthy and has been relegated to the Appendix.

Proposition 4.1

The mapping ψ deﬁned by (4.46) with step size α < µL is a dif-feomorphism. Proof . It follows from Proposition 2.15 in Lee (2013) that the composition oftwo diﬀeomorphisms is also a diﬀeomorphism. Using this fact and previous Lemma4.1, the proof is completed.

Eigenvalue analysis of the Jacobian of ψ at a strict saddle point In this subsection, we will analyze the eigenvalues of the Jacobian of ψ at a strictsaddle point and show it has at least one eigenvalue with magnitude greater thanone, which is a crucial part.Suppose that x ∗ ∈ R n is a strict saddle point. Then ∇ f ( x ∗ ) = and ψ s ( x ∗ ) = x ∗ , s = 1 , , . . . , p . Hence, Dψ s ( x ∗ )= (cid:0) I n − U s U Ts (cid:1) + (cid:8) U s ∇ ϕ s ( x ∗ ( s ) ) − α ∇ f ( x ∗ ) U s (cid:9) (cid:8) ∇ ϕ s (cid:8) [ ∇ ϕ s ] − ( ∇ ϕ s ( x ∗ ( s ) ) − α ∇ s f ( x ∗ )) (cid:9)(cid:9) − U Ts = (cid:0) I n − U s U Ts (cid:1) + (cid:8) U s ∇ ϕ s ( x ∗ ( s ) ) − α ∇ f ( x ∗ ) U s (cid:9) (cid:8) ∇ ϕ s (cid:8) [ ∇ ϕ s ] − ( ∇ ϕ s ( x ∗ ( s ) )) (cid:9)(cid:9) − U Ts = (cid:0) I n − U s U Ts (cid:1) + (cid:8) U s ∇ ϕ s ( x ∗ ( s ) ) − α ∇ f ( x ∗ ) U s (cid:9) (cid:8) ∇ ϕ s ( x ∗ ( s ) ) (cid:9) − U Ts = I n − α ∇ f ( x ∗ ) U s (cid:8) ∇ ϕ s ( x ∗ ( s ) ) (cid:9) − U Ts , (4.49)where the second equality is due to ∇ f ( x ∗ ) = ; the third equality thanks to[ ∇ ϕ s ] − ( ∇ ϕ s ( x ∗ ( s ) )) = x ∗ ( s ) .Furthermore, since the eigenvalues of Dψ ( x ∗ ) are the same as those of { Dψ ( x ∗ ) } T ,16t suﬃces to analyze the eigenvalues of { Dψ ( x ∗ ) } T . It follows from (4.48) that { Dψ ( x ∗ ) } T = { Dψ ( x ∗ ) × Dψ ( x ∗ ) × · · · × Dψ p − ( x ∗ ) × Dψ p ( x ∗ ) } T = { Dψ p ( x ∗ ) } T × { Dψ p − ( x ∗ ) } T × · · · × { Dψ ( x ∗ ) } T × { Dψ ( x ∗ ) } T = Y s = p n I n − α ∇ f ( x ∗ ) U s (cid:8) ∇ ϕ s ( x ∗ ( s ) ) (cid:9) − U Ts o T = Y s = p n I n − αU s (cid:8) ∇ ϕ s ( x ∗ ( s ) ) (cid:9) − U Ts ∇ f ( x ∗ ) o = Y s = p n I n − αU s (cid:8) ∇ ϕ s ( x ∗ ( s ) ) (cid:9) − A s o , (4.50)where the third equality uses (4.49) and the last equality is due to deﬁnition (3.14)of A s .Similar to the case of BCGD, we furthermore deﬁne e G , α h I n − { Dψ ( x ∗ ) } T i , (4.51)or equivalently, ( Dψ ( x ∗ )) T = I n − α e G. (4.52)The above relation (4.52) clearly means that λ ∈ eig (cid:16) e G (cid:17) ⇔ − αλ ∈ eig (cid:16) ( Dψ ( x ∗ )) T (cid:17) . (4.53)In order to achieve a simple form of e G , we ﬁrst introduce the following notations.Deﬁne Ψ , Diag (cid:0) ∇ ϕ ( x ∗ (1) ) , ∇ ϕ ( x ∗ (2) ) , . . . , ∇ ϕ p ( x ∗ ( p ) ) (cid:1) . (4.54)ThenΨ − = Diag (cid:16)(cid:8) ∇ ϕ ( x ∗ (1) ) (cid:9) − , (cid:8) ∇ ϕ ( x ∗ (2) ) (cid:9) − , . . . , (cid:8) ∇ ϕ p ( x ∗ ( p ) ) (cid:9) − (cid:17) . (4.55)Note that A , A st and A s have been deﬁned by (3.12), (3.13) and (3.14), respectively.Then A = ( A st ) ≤ s,t ≤ p =  A A ... A p  , (4.56)17here A st ∈ R n s × n t and A s ∈ R n s × n denote the ( s, t )-th block and the s -th block-rowof A , respectively. We further deﬁne T , Ψ − A = (cid:16)(cid:8) ∇ ϕ ( x ∗ ( s ) ) (cid:9) − A st (cid:17) ≤ s,t ≤ p =  {∇ ϕ ( x ∗ (1) ) } − A {∇ ϕ ( x ∗ (2) ) } − A ... {∇ ϕ p ( x ∗ ( p ) ) } − A p  =  T T ... T p  , (4.57)where {∇ ϕ s ( x ∗ ( s ) ) } − A st ∈ R n s × n t and {∇ ϕ s ( x ∗ ( s ) ) } − A s ∈ R n s × n denote the( s, t )-th block and the s -th block-row of T , respectively. Given the above notations,we further denote the strictly block lower triangular matrix based on T asˇ T , (cid:0) ˇ T st (cid:1) ≤ s, t ≤ p (4.58)with p × p blocks and its ( s, t )-th block is given byˇ T st = ( T st , s > t, , s ≤ t = ( {∇ ϕ s ( x ∗ ( s ) ) } − A st , s > t, , s ≤ t. (4.59)Substituting T s (see its deﬁnition (4.57)) into (4.50), we deduce that { Dψ ( x ∗ ) } T = Y s = p { I n − αU s T s } . (4.60)The following Lemma shows that e G still has a form similar to G deﬁned by (3.18). In fact, it just replaces A and ˇ A in (3.18) by T and ˇ T , respectively. Lemma 4.2

Let x ∗ ∈ R n be a strict saddle point. Assume that e G , T and ˇ T aredeﬁned by (4.51) , (4.57) and (4.58) , respectively. Then e G = (cid:0) I n + α ˇ T (cid:1) − T. (4.61) Proof . With the identiﬁcations (4.51) ∼ (3.18), (4.60) ∼ (3.17), (4.57) ∼ (3.12)and (4.58) ∼ (3.22), we are able to apply the same arguments as in the proof ofLemma 3.3 to obtain (4.61). Since the proof follows a similar pattern, it is thereforeomitted. 18ased on the above Lemma 4.2, along with deﬁnitions (4.55), (4.57), (4.58),(3.12) and (3.22), we further have e G = (cid:0) I n + α Ψ − ˇ A (cid:1) − Ψ − A = (cid:0) Ψ + α ˇ A (cid:1) − A = h Ψ (cid:16) I n + α Ψ − ˇ A Ψ − (cid:17) Ψ i − A = Ψ − (cid:16) I n + α Ψ − ˇ A Ψ − (cid:17) − Ψ − A = (cid:16) Ψ − (cid:17) T (cid:20) I n + α Ψ − ˇ A (cid:16) Ψ − (cid:17) T (cid:21) − Ψ − A, (4.62)where the second equality holds because Ψ − denotes the unique symmetric, positivedeﬁnite square root matrix of the symmetric, positive deﬁnite matrix Ψ − .Switching the order of the products (4.62) by moving the ﬁrst component to thelast, we get a new matrix G , (cid:20) I n + α Ψ − ˇ A (cid:16) Ψ − (cid:17) T (cid:21) − Ψ − A (cid:16) Ψ − (cid:17) T . (4.63)Note that eig ( XY ) = eig ( Y X ) for any two square matrices, thuseig (cid:16) e G (cid:17) = eig (cid:0) G (cid:1) , (4.64)which shows that it suﬃces to analyze eigenvalues of G .Before presenting Lemma 4.4, the grand result of this subsection, we ﬁrst pro-vide two special properties of Ψ − A (cid:16) Ψ − (cid:17) T which will be used in the subsequentanalysis. Lemma 4.3

Let x ∗ ∈ R n be a strict saddle point. Assume that A and Ψ are deﬁneby (3.12) and (4.54) , respectively. Then,(i) Ψ − A (cid:16) Ψ − (cid:17) T has at least one negative eigenvalue.(ii) The spectral radius of the symmetric matrix Ψ − A (cid:16) Ψ − (cid:17) T is upper boundedby µL . Proof . (i) Since Ψ − A (cid:16) Ψ − (cid:17) T is a congruent transformation of A , they havesame index of inertia. In addition, x ∗ is a strict saddle point implies that A = ∇ f ( x ∗ ) (see its deﬁnition (3.12)) has at least one negative eigenvalue. Hence,19 − A (cid:16) Ψ − (cid:17) T has at least one negative eigenvalue as well.(ii) In what follows, we will prove that ρ (cid:18) Ψ − A (cid:16) Ψ − (cid:17) T (cid:19) ≤ Lµ . (4.65)Obviously, it suﬃces to prove that − Lµ I n (cid:22) Ψ − A (cid:16) Ψ − (cid:17) T (cid:22) Lµ I n holds. It followseasily from (2.2) and Lemma 7 in Panageas and Piliouras (2016) that − LI n (cid:22) ∇ f ( x ∗ ) = A (cid:22) LI n . (4.66)In addition, Eqs. (4.35) means that ∇ ϕ s ( x ∗ ( s ) ) (cid:23) µ s I n s (cid:23) µI n s , s = 1 , , . . . , p, (4.67)where the last inequality is due to the deﬁnition of µ (see it deﬁnition in (4.37)).Consequently, combined with the deﬁnition (4.54), the above inequalities imply that1 µ Ψ (cid:23) I n , (4.68)which, combined with (4.66) further implies − Lµ Ψ I n (cid:22) A (cid:22) Lµ Ψ I n . (4.69)Multiplying the left-hand side of the above inequalities by Ψ − and the right-handside by (cid:16) Ψ − (cid:17) T , respectively, we arrive at − Lµ I n (cid:22) Ψ − A (cid:16) Ψ − (cid:17) T (cid:22) Lµ I n . (4.70)Thus, the proof is ﬁnished.Based on the above Lemma 4.3 and Lemma 6.4 in Section 6, we will prove thatthere exists at least one eigenvalue of G deﬁned by (4.63) which belongs to Ω deﬁnedby (6.130). Lemma 4.4

Suppose that x ∗ ∈ R n is a strict saddle point, and G = (cid:20) I n + α Ψ − ˇ A (cid:16) Ψ − (cid:17) T (cid:21) − Ψ − A (cid:16) Ψ − (cid:17) T (4.71) is given by (4.63) with α ∈ (cid:0) , µL (cid:1) , where L is determined by (2.2) ; µ , A and ˇ A are given by (4.37) , (3.12) and (3.22) , respectively; and Ψ − denotes the uniquesymmetric, positive deﬁnite square root matrix of the symmetric, positive deﬁnitematrix Ψ − deﬁned by (4.55) . Then, there exists at least one eigenvalue of the G which belongs to Ω deﬁned by (3.27) . roof . Since x ∗ is a strict saddle point and A = ∇ f ( x ∗ ) is given by (3.12),Lemma 4.3 clearly implies that Ψ − A (cid:16) Ψ − (cid:17) T has at least one negative eigenvalueand the spectral radius of the symmetric matrix Ψ − A (cid:16) Ψ − (cid:17) T is upper boundedby µL .Further, note that G = (cid:20) I n + α Ψ − ˇ A (cid:16) Ψ − (cid:17) T (cid:21) − Ψ − A (cid:16) Ψ − (cid:17) T and Ψ − ˇ A (cid:16) Ψ − (cid:17) T is a strictly block lower triangle matrix based on Ψ − A (cid:16) Ψ − (cid:17) T , where ˇ A is deﬁnedby (3.22) . Therefore, by applying Lemma 6.4 in Section 6 with identiﬁcationsΨ − A (cid:16) Ψ − (cid:17) T ∼ B , Ψ − ˇ A (cid:16) Ψ − (cid:17) T ∼ ˇ B , α ∼ β and ρ (cid:18) Ψ − A (cid:16) Ψ − (cid:17) T (cid:19) ∼ ρ ( B ),we know, for any α ∈ (cid:0) , µL (cid:1) , there exists at least one eigenvalue of ¯ G belonging toΩ deﬁned by (3.27).In addition, Lemma 4.3 implies that gradient Lipschitz continuous constant Lµ ≥ ρ (cid:18) Ψ − A (cid:16) Ψ − (cid:17) T (cid:19) , which leads immediately to α ∈ (cid:0) , µL (cid:1) ⊆ , ρ (cid:18) Ψ − A (cid:16) Ψ − (cid:17) T (cid:19) ! .Then, the above arguments clearly imply that the proposition holds true.The following proposition shows that the Jacobian of the BMD iterative mapping ψ deﬁned by (4.46) at a strict saddle point admits at least one eigenvalue whosemagnitude is strictly greater than one. Proposition 4.2

Assume that BMD iterative mapping ψ is deﬁned by (4.46) with α ∈ (cid:0) , µL (cid:1) , and x ∗ ∈ R n is a strict saddle point. Then Dψ ( x ∗ ) has at least oneeigenvalue whose magnitude is strictly greater than one. Proof . Note that (4.64) and (4.63) imply that e G and G have same eigenvalues.Consequently, Lemma 4.4 means there exists at least one eigenvalue of the e G lyingin Ω deﬁned by (6.130).In what follows, with the identiﬁcations Dψ ( x ∗ ) ∼ Dg αf ( x ∗ ), e G ∼ G and Lµ ∼ L , we are able to apply the same arguments as in the proof of Proposition 3.3 inSubsection 3.3 to ﬁnish the proof. Since the proof follows a similar pattern, it istherefore omitted. Main results of BMD method

We ﬁrst introduce the following proposition, which asserts that the limit pointof the sequence generated by the BMD method 4.1 is a critical point of f . Proposition 4.3

Under Assumption 2.1, if { x k } k ≥ is generated by the BMD method4.1 with < α < µL , lim k x k exists and denote it as x ∗ , then x ∗ is a critical point of f , i.e., ∇ f ( x ∗ ) = . roof . First, since { x k } k ≥ is generated by the BMD method 4.1, then x k = ψ k ( x ) , where the BMD iterative mapping ψ is deﬁned by (4.54). Hence, lim k x k =lim k ψ k ( x ) = x ∗ . Since ψ is a diﬀeomorphism, we immediately know that x ∗ is aﬁxed point of ψ .Second, notice that ψ and ψ s are deﬁned by (4.46) and (4.44), respectively. If x ∗ is a ﬁxed point of ψ , then ψ ( x ∗ ) = x ∗ , (4.72)which implies that ψ s ( x ∗ ) = (cid:0) I n − U s U Ts (cid:1) x ∗ + U s [ ∇ ϕ s ] − ( ∇ ϕ s ( x ∗ ( s ) ) − α ∇ s f ( x ∗ )) , s = 1 , , . . . , p. (4.73)Consequently, x ∗ ( s ) = [ ∇ ϕ s ] − ( ∇ ϕ s ( x ∗ ( s ) ) − α ∇ s f ( x ∗ )) , s = 1 , , . . . , p. (4.74)Since Lemma 9.4 in Appendix asserts that ∇ ϕ s is a diﬀeomorphism, then [ ∇ ϕ s ] − is a diﬀeomorphism as well. Thus it follows from (4.74) that ∇ ϕ s ( x ∗ ( s ) ) = ∇ ϕ s ( x ∗ ( s ) ) − α ∇ s f ( x ∗ ) , s = 1 , , . . . , p. (4.75)We arrive at ∇ s f ( x ∗ ) = , s = 1 , , . . . , p, (4.76)or equivalently, x ∗ is a stationary point of f . Thus the proof is ﬁnished.Armed with the results established in previous subsections, we now state andprove our main theorem of the BMD method, whose proof is similar to that ofTheorem 3.2 in Subsection 3.4. However, its proof is still given as follows in detailfor the sake of completeness. Theorem 4.1

Let f be a C function and x ∗ be a strict saddle point. If { x k } isgenerated by the BMD method 4.1 with < α < µL , then P ν h lim k x k = x ∗ i = 0 . Proof . First, Proposition 4.3 implies that, if lim k x k exists then it must be a criti-cal point. Hence, we consider calculating the Lebesgue measure (or probability withrespect to the prior measure ν ) of the set h lim k x k = x ∗ i = W s ( x ∗ ) (see Deﬁnition2.3).Second, since Proposition 4.1 means the BMD iterative mapping ψ is a diﬀeo-morphism, we replace φ and ﬁxed point by ψ and the strict saddle point x ∗ in the ψ k denotes the composition of ψ with itself k times. W sloc ( x ∗ ) has strictly positive codimension because of Proposition 4.2 and x ∗ being a strict saddle point. Hence, W sloc ( x ∗ ) has measure zero.In what follows, we are able to apply the same arguments as in Lee et al. (2016)to ﬁnish the proof of the theorem. Since the proof follows a similar pattern, it istherefore omitted.Given the above Theorem 4.1, we immediately obtain the following Theorem 4.2and its corollary by the same arguments as in the proofs of Theorem 2 and Corollary12 in Panageas and Piliouras (2016). Therefore, we omit their proofs. Theorem 4.2 (Non-isolated)

Let f : R n → R be a twice continuously diﬀeren-tiable function and sup x ∈ R n k∇ f ( x ) k ≤ L < ∞ . The set of initial conditions x ∈ R n sothat the BMD method 4.1 with step size < α < µL converges to a strict saddle pointis of (Lebesgue) measure zero, without assumption that critical points are isolated. A straightforward corollary of Theorem 4.2 is given below:

Corollary 4.1

Assume that the conditions of Theorem 4.2 are satisﬁed and allsaddle points of f are strict. Additionally, assume lim k ψ k ( x ) exists for all x in R n .Then P ν h lim k ψ k ( x ) = x ∗ i = 1 , where ψ is deﬁned by (4.54) and x ∗ is a local minimum.

5. The PBCD method

In this section, we will prove that the PBCD method in Hong et al. (2017);Fercoq and Richtrik (2013); Hong et al. (2015) converges to minimizers as well, al-most surely with random initialization.

The PBCD method description

For clarity of notation, recall the vector of decision variables x has been assumedto have the following partition (see (2.3)): x =  x (1) x (2) ... x ( p )  , (5.77)23here x ( t ) ∈ R n t , n , n , . . . , n p are p positive integer numbers satisfying t P s =1 n t = n .Correspondingly, we assume a set of variables x sk has the following partition as well: x sk ,  x sk (1) x sk (2) ... x sk ( p )  , s = 1 , . . . , p ; k = 0 , , . . . , (5.78)where x sk ( t ) ∈ R n t ; t, s = 1 , . . . , p ; k = 0 , , . . . .Given the above notations, the detailed description of the PBCD method forproblem (2.1) is given below. Method 5.1 (PBCD)Input: α < L . Initialization: x ∈ R n . General Step ( k = 0 , , . . . ): Set x k = x k and deﬁne recursively for s = 1, 2, . . . , p : t = 1 , , . . . , p, If t = s , x sk ( t ) = arg min x ( t ) n f (cid:0) x s − k (1) , . . . , x s − k ( t − , x ( t ) , x s − k ( t + 1) , x s − k ( p ) (cid:1) + 12 α (cid:13)(cid:13) x ( t ) − x s − k ( t ) (cid:13)(cid:13) o . (5.79) Else x sk ( t ) = x s − k ( t ) . (5.80) EndSet x k +1 = x pk . It follows from α < L that f (cid:0) x s − k (1) , . . . , x s − k ( s − , x ( s ) , x s − k ( s + 1) , x s − k ( p ) (cid:1) + 12 α (cid:13)(cid:13) x ( s ) − x s − k ( s ) (cid:13)(cid:13) is a strongly convex function with respect to variable x ( s ) . Hence, let x sk ( s ) be theunique minimizer of problem (5.79). Then by the KKT condition, = ∇ s f (cid:0) x s − k (1) , . . . , x s − k ( s − , x sk ( s ) , x s − k ( s + 1) , x s − k ( p ) (cid:1) + 1 α (cid:0) x sk ( s ) − x s − k ( s ) (cid:1) , which is equivalent to x s − k ( s ) = x sk ( s ) + α ∇ s f (cid:0) x s − k (1) , . . . , x s − k ( s − , x sk ( s ) , x s − k ( s + 1) , x s − k ( p ) (cid:1) . (5.81)24ombining Eqs. (5.78), (5.80) and (5.81), we have the following relationships be-tween x s − k and x sk , x s − k = x sk + αU s ∇ s f ( x sk ) , s = 1 , . . . , p. (5.82)Recall that, for a given step size α >

0, we have used g s αf (see its deﬁnition in (3.5))to denote the gradient mapping of function f with respect to variable x ( s ) , i.e., g s αf ( x ) = x − αU s ∇ s f ( x ) , s = 1 , . . . , p. (5.83)By using same notations, for a given step size α >

0, the gradient mapping offunction − f with respect to variable x ( s ) is g s α ( − f ) ( x ) = x − αU s ∇ s ( − f ( x )) = x − αU s ( −∇ s f ( x )) = x + αU s ∇ s f ( x ) . (5.84)It is obvious that Eqs. (5.82) and (5.84) imply that x s − k = g s α ( − f ) ( x sk ) , s = 1 , . . . , p. (5.85)Note that Lemma 5.2 in Subsection 5.2 implies that g s α ( − f ) is a diﬀeomorphism with α ∈ (cid:0) , L (cid:1) . We denote its inverse as (cid:2) g s α ( − f ) (cid:3) − which is a diﬀeomorphism as well.Then, (5.85) is further equivalent to x sk = (cid:2) g s α ( − f ) (cid:3) − ( x s − k ) , s = 1 , . . . , p. (5.86) Lemma 5.1

Assume that g s α ( − f ) is determined by (5.84) , s = 1 , . . . , p. Given x k , theabove PBCD method generates x k +1 in the following manner, x k +1 = [ g α ( − f ) ] − ( x k ) , (5.87) where the composite iterative mapping [ g α ( − f ) ] − denotes the inverse mapping of thefollowing mapping g α ( − f ) , g α ( − f ) ◦ g α ( − f ) ◦ · · · ◦ g p − α ( − f ) ◦ g p α ( − f ) . (5.88) Proof . According to the PBCD method and (5.86), we have x k = x k and x k +1 = (cid:2) g p α ( − f ) (cid:3) − ◦ (cid:2) g p − α ( − f ) (cid:3) − ◦ · · · ◦ (cid:2) g α ( − f ) (cid:3) − ◦ (cid:2) g α ( − f ) (cid:3) − ( x k )= (cid:2) g α ( − f ) ◦ g α ( − f ) ◦ · · · ◦ g p − α ( − f ) ◦ g p α ( − f ) (cid:3) − ( x k ) . (5.89)Thus the proof is ﬁnished.By simple computation, the Jacobian of g sα ( − f ) is given by Dg s α ( − f ) ( x ) = I n + α ∇ f ( x ) U s U Ts , (5.90)25here ∇ f ( x ) = (cid:18) ∂ f ( x ) ∂x ( s ) ∂x ( t ) (cid:19) ≤ s, t ≤ p . (5.91)Since g α ( − f ) ( x ) is deﬁned by (5.88), the chain rule implies that the Jacobian of themapping g is Dg α ( − f ) ( x ) = Dg p α ( − f ) ( y p ) × Dg p − α ( − f ) ( y p − ) × · · · × Dg α ( − f ) ( y ) × Dg α ( − f ) ( y ) , (5.92)where y = x , and y s = g s − α ( − f ) ( y s − ), s = 2 , . . . , p . The iterative mapping [ g α ( − f ) ] − of the PBCD method is a diﬀeomor-phism In this subsection, we ﬁrst present the following Lemma 5.2 which shows that g s α ( − f ) , s = 1 , . . . , p, are diﬀeomorphisms. Based on this lemma, then we furtherprove that [ g α ( − f ) ] − is a diﬀeomorphism as well. Lemma 5.2

If step size α < L , then the mappings g s α ( − f ) deﬁned by (3.5) , s = 1 ,. . . , p, are diﬀeomorphisms. Proof . Note that L is also the Lipschitz constant of gradient of − f . By replacing − f with f , Lemma 3.1 in Subsection 3.2 implies that g s α ( − f ) is a diﬀeomorphism with α ∈ (cid:0) , L (cid:1) , the proof is completed. Proposition 5.1

The mapping [ g α ( − f ) ] − determined by (5.88) with step size α < L is a diﬀeomorphism. Proof . Note that Lemma 5.2 implies that g s α ( − f ) , s = 1 , . . . , p, deﬁned by (3.5)are diﬀeomorphisms and g α ( − f ) is deﬁned (5.88) with 0 < α < L . By a similarargument as in the proof of Proposition 3.1 in Subsection 3.2, we know g α ( − f ) is adiﬀeomorphism. Thus its inverse is a diﬀeomorphism as well, the proof is completed. Eigenvalue analysis of the Jacobian of [ g α ( − f ) ] − at a strict saddle point In this subsection, we consider the eigenvalues of the Jacobian of [ g α ( − f ) ] − at astrict saddle point, which is a crucial part in our entire proof.If x ∗ ∈ R n is a strict saddle point, then ∇ f ( x ∗ ) = . Consequently, x ∗ = (cid:2) g s α ( − f ) (cid:3) − ( x ∗ ) , s = 1 , . . . , p, (5.93)which implies that x ∗ = g α ( − f ) ( x ∗ ) . (5.94)26ore importantly, the above Eq. (5.94) implies that the Jacobian of [ g α ( − f ) ] − at x ∗ can be expressed as D [ g α ( − f ) ] − ( x ∗ ) = ( Dg α ( − f ) ( x ∗ )) − , (5.95)which is due to inverse function theorem in Spivak (1965). Hence, in order to showthat there is at least one eigenvalue of D [ g α ( − f ) ] − ( x ∗ ) whose magnitude is strictlygreater than one, we ﬁrst argue that Dg α ( − f ) ( x ∗ ) still has a similar structure as thatof Dg αf ( x ∗ ) in Section 3.Speciﬁcally, chain rule (5.92) and Eq. (5.94) imply Dg α ( − f ) ( x ∗ ) = Dg p α ( − f ) ( x ∗ ) × Dg p − α ( − f ) ( x ∗ ) × · · · × Dg α ( − f ) ( x ∗ ) × Dg α ( − f ) ( x ∗ ) . (5.96)Moreover, the eigenvalues of Dg α ( − f ) ( x ∗ ) are the same as those of its transpose.Hence, we compute( Dg α ( − f ) ( x ∗ )) T = (cid:0) Dg α ( − f ) ( x ∗ ) (cid:1) T × (cid:0) Dg α ( − f ) ( x ∗ ) (cid:1) T × · · · × (cid:0) Dg p − α ( − f ) ( x ∗ ) (cid:1) T × (cid:0) Dg p α ( − f ) ( x ∗ ) (cid:1) T = (cid:0) I n + αU U T ∇ f ( x ∗ ) (cid:1) × (cid:0) I n + αU U T ∇ f ( x ∗ ) (cid:1) × · · · × (cid:0) I n + αU p − U Tp − ∇ f ( x ∗ ) (cid:1) × (cid:0) I n + αU p U Tp ∇ f ( x ∗ ) (cid:1) , (5.97)where the second equality is due to (5.90).Now we deﬁne H , α h I n − ( Dg α ( − f ) ( x ∗ )) T i , (5.98)or equivalently, ( Dg α ( − f ) ( x ∗ )) T = I n − αH. (5.99)The above relation (5.99) clearly means that λ ∈ eig ( H ) ⇔ − αλ ∈ eig (cid:16) ( Dg α ( − f ) ( x ∗ )) T (cid:17) . (5.100)For the sake of clariﬁcation, we rewrite A deﬁned by (3.12) below: A = ( A st ) ≤ s, t ≤ p , (5.101)and its ( s, t )-th block is given by A st = ∂ f ( x ∗ ) ∂x ∗ ( s ) ∂x ∗ ( t ) , ≤ s, t ≤ p. (5.102)Similarly, we denote the strictly block upper triangular matrix based on A asˆ A , (cid:16) ˆ A st (cid:17) ≤ s, t ≤ p (5.103)27ith p × p blocks and its ( s, t )-th block is given byˆ A st = (cid:26) A st , s < t, , s ≥ t. (5.104)Based on the above notations, we are able to obtain the following Lemma 5.3which is similar to Lemma 3.3 in Subsection 3.3. It gives a simple expression of H in terms of A and ˆ A . Lemma 5.3

Let x ∗ ∈ R n be a strict saddle point. Assume that H , A and ˆ A aredeﬁned by (5.98) , (5.101) and (5.103) , respectively. Then H = − (cid:16) I n − α ˆ A (cid:17) − A. (5.105) Proof . The proof is similar to that of Lemma 3.3 in Subsection 3.3.In what follows, we proceed now in a way similar to the case of BCGD, althoughsome speciﬁc technical diﬃculties will occur in the analysis.The following proposition shows that the Jacobian of the PBCD iterative map-ping [ g α ( − f ) ] − determined by (5.88) at a strict saddle point admits at least oneeigenvalue whose magnitude is strictly greater than one, which plays a key role inthis subection. Proposition 5.2

Assume that PBCD iterative mapping [ g α ( − f ) ] − is determined by (5.88) and x ∗ ∈ R n is a strict saddle point. Then D [ g α ( − f ) ] − ( x ∗ ) has at least oneeigenvalue whose magnitude is strictly greater than one. Proof . First, recall the Jacobian of [ g α ( − f ) ] − at x ∗ can be expressed as D [ g α ( − f ) ] − ( x ∗ ) = ( Dg α ( − f ) ( x ∗ )) − . (5.106)Second, recall Eqs. (5.98)-(5.100) as follows. H , α h I n − ( Dg α ( − f ) ( x ∗ )) T i , (5.107)or equivalently, ( Dg α ( − f ) ( x ∗ )) T = I n − αH. (5.108)It follows from Lemma 5.3 that H has the following expression: H = − (cid:16) I n − α ˆ A (cid:17) − A = (cid:16) I n + α (cid:16) − ˆ A (cid:17)(cid:17) − ( − A ) , (5.109)28here A and ˆ A are deﬁned by (5.101) and (5.103). Combining Eqs. (5.106) and(5.108), we have eig (cid:0) D [ g α ( − f ) ] − ( x ∗ ) (cid:1) = eig (cid:16)(cid:8) D [ g α ( − f ) ] − ( x ∗ ) (cid:9) T (cid:17) = eig (cid:18)n ( Dg α ( − f ) ( x ∗ )) T o − (cid:19) = eig (cid:0) ( I n − αH ) − (cid:1) = eig (cid:26) I n − α h I n + α (cid:16) − ˆ A (cid:17)i − ( − A ) (cid:27) − ! , (5.110)where the last equality is due to (5.109).Since A = ∇ f ( x ∗ ) and x ∗ is a strictly saddle point, A has at least one neg-ative eigenvalue. Hence, − A has at least one positive eigenvalue. Consequently,by applying Lemma 6.6 with identiﬁcations − A ∼ B , − ˆ A ∼ ˆ B , α ∼ β and ρ ( A ) ∼ ρ ( B ), we know , for any α ∈ (cid:16) , ρ ( A ) (cid:17) , there exists at least one eigen-value λ of (cid:26) I n − α h I n + α (cid:16) − ˆ A (cid:17)i − ( − A ) (cid:27) − , whose magnitude is strictly greaterthan one.Furthermore, Lemma 7 in Panageas and Piliouras (2016) implies that gradientLipschitz continuous constant L ≥ ρ ( A ) = ρ ( ∇ f ( x ∗ )), which leads to α ∈ (cid:0) , L (cid:1) ⊆ (cid:16) , ρ ( A ) (cid:17) . Then, the above arguments and (5.110) imply that the proposition holdstrue. Main results of PBCD

We ﬁrst introduce the following proposition, which asserts that the limit pointof the sequence generated by the PBCD method 5.1 is a critical point of f . Proposition 5.3

Under Assumption 2.1, if { x k } k ≥ is generated by the PBCDmethod 5.1 with < α < L , lim k x k exists and denote it as x ∗ , then x ∗ is a crit-ical point of f , i.e., ∇ f ( x ∗ ) = . Proof . Notice that { x k } k ≥ is generated by the PBCD method 5.1. We clearlyhave x k = (cid:8) [ g α ( − f ) ] − (cid:9) k ( x ) , where [ g α ( − f ) ] − denotes the inverse mapping of g α ( − f ) deﬁned by (5.88). Hence, lim k x k = lim k (cid:8) [ g α ( − f ) ] − (cid:9) k ( x ) = x ∗ . Since [ g α ( − f ) ] − is a diﬀeomorphism, we immediately know that x ∗ is a ﬁxed point of [ g α ( − f ) ] − . n [ g α ( − f ) ] − o k denotes the composition of [ g α ( − f ) ] − with itself k times. g α ( − f ) . It follows easily from the deﬁnition (5.88)of g α ( − f ) that ∇ f ( x ∗ ) = . Thus the proof is ﬁnished.Armed with the results established in previous subsections and the above Propo-sition 5.3, we now state and prove our main theorem of PBCD, whose proof is similarto that of Theorem 3.2 in Subsection 3.4. However, its proof is still given as followsin detail for the sake of completeness. Theorem 5.1

Let f be a C function and x ∗ be a strict saddle point. If { x k } k ≥ isgenerated by the PBCD method 5.1 with < α < L , then P ν h lim k x k = x ∗ i = 0 . Proof . First, Proposition 5.3 implies that, if lim k x k exists then it must be a criti-cal point. Hence, we consider calculating the Lebesgue measure (or probability withrespect to the prior measure ν ) of the set h lim k x k = x ∗ i = W s ( x ∗ ) (see Deﬁnition2.3).Second, since Proposition 5.1 means PBCD iterative mapping [ g α ( − f ) ] − is a dif-feomorphism, we replace φ and ﬁxed point by [ g α ( − f ) ] − and the strict saddle point x ∗ in the above Stable Manifold Theorem 3.1 in Subsection 3.4, respectively. Thenthe manifold W sloc ( x ∗ ) has strictly positive codimension because of Proposition 5.2and x ∗ being a strict saddle point. Hence, W sloc ( x ∗ ) has measure zero.In what follows, we are able to apply the same arguments as in Lee et al. (2016)to ﬁnish the proof of the theorem. Since the proof follows a similar pattern, it istherefore omitted.Given the above Theorem 5.1, we immediately obtain the following Theorem 5.2and its corollary by the same arguments as in the proofs of Theorem 2 and Corollary12 in Panageas and Piliouras (2016). Therefore, we omit their proofs. Theorem 5.2 (Non-isolated)

Let f : R n → R be a twice continuously diﬀeren-tiable function and sup x ∈ R n k∇ f ( x ) k < ∞ . The set of initial conditions x ∈ R n so thatthe PBCD method 5.1 with step size < α < L converges to a strict saddle point isof (Lebesgue) measure zero, without assumption that critical points are isolated. A straightforward corollary of Theorem 5.2 is given below:

Corollary 5.1

Assume that the conditions of Theorem 5.2 are satisﬁed and allsaddle points of f are strict. Additionally, the composite iterative mapping [ g α ( − f ) ] − denotes the inverse mapping of g α ( − f ) , assume lim k (cid:8) [ g α ( − f ) ] − (cid:9) k ( x ) exists for all x in R n . Then P ν h lim k (cid:8) [ g α ( − f ) ] − (cid:9) k ( x ) = x ∗ i = 1 , where [ g α ( − f ) ] − denotes the inverse mapping of g α ( − f ) deﬁned by (5.88) and x ∗ is alocal minimum. . Several Technical Lemmas In what follows, we will provide several technical lemmas (Lemmas 6.1– 6.6),which provide the basis for proving that Jacobian of iterative mappings including g αf deﬁned by (3.7) in Section 3, ψ deﬁned by (4.54) in Section 4 and [ g α ( − f ) ] − deﬁned by (5.87) in Section 5 at a strict saddle point has at least one eigenvalue withmagnitude strictly greater than one. In particular, Lemma 6.4 gives a suﬃcientlyexact description of the distribution of the eigenvalues of matrix with the kind ofstructure as G which is related to g αf and ψ , while Lemmas 6.5 and 6.6 give thesimilar description of that of the eigenvalues of matrix with the kind of structure as H which is related to g α ( − f ) .Before presenting the grand result of this subsection, we ﬁrst introduce twonotations which will be used in what follows. Assume B ∈ R n × n is symmetricmatrix with p × p blocks. Speciﬁcally, B , ( B st ) ≤ s, t ≤ p , (6.111)and its ( s, t )-th block B st ∈ R n s × n t , ≤ s, t ≤ p, (6.112)where n , n , . . . , n p are p positive integer numbers satisfying p P s =1 n s = n . In addition,we denote the strictly block lower triangular matrix based on B asˇ B , (cid:0) ˇ B st (cid:1) ≤ s, t ≤ p (6.113)with p × p blocks and its ( s, t )-th block is given byˇ B st = ( B st , s > t, , s ≤ t. (6.114)Similarly, we denote the strictly block upper triangular matrix based on B asˆ B , (cid:16) ˆ B st (cid:17) ≤ s, t ≤ p (6.115)with p × p blocks and its ( s, t )-th block is given byˆ B st = (cid:26) B st , s < t, , s ≥ t. (6.116)Given the above notations, then we have the following Lemma 6.1, which assertsthat, for any unit vector η ∈ C n , the real parts of η H ˇ Bη and η H ˆ Bη are both boundedby the spectral radius of B . 31 emma 6.1 Assume that B , ˇ B and ˆ B are deﬁned by (6.111) , (6.113) and (6.115) ,respectively. Then for an arbitrary n dimensional vector η ∈ C n with k η k = 1 , wehave − ρ ( B ) ≤ Re (cid:0) η H ˇ Bη (cid:1) ≤ ρ ( B ) , (6.117) and − ρ ( B ) ≤ Re (cid:16) η H ˆ Bη (cid:17) ≤ ρ ( B ) . (6.118) Proof . We ﬁrst deﬁne a block diagonal matrix based on B below:˜ B , Diag ( B , B , . . . , B pp ) , (6.119)whose main diagonal blocks are the same as those of B . Therefore, B has thefollowing decomposition, i.e., B = ˇ B + ˜ B + ˇ B T . (6.120)In addition, it is obvious that − ρ ( B ) I n (cid:22) B (cid:22) ρ ( B ) I n , (6.121)which, combined with Theorem 4.3.15 in Horn and Johnson (1986), means that − ρ ( B ) I n (cid:22) ˜ B (cid:22) ρ ( B ) I n . (6.122)Assume that η ∈ C n and k η k = 1. On the one hand,2 ρ ( B ) ≥ η H ( ρ ( B ) I n + B ) η = η H h ρ ( B ) I n + (cid:16) ˇ B + ˜ B + ˇ B T (cid:17)i η = η H (cid:16) ρ ( B ) I n + ˜ B (cid:17) η + η H (cid:0) ˇ B + ˇ B T (cid:1) η = η H (cid:16) ρ ( B ) I n + ˜ B (cid:17) η + η H (cid:0) ˇ B + ˇ B H (cid:1) η = η H (cid:16) ρ ( B ) I n + ˜ B (cid:17) η + 2Re (cid:0) η H ˇ Bη (cid:1) ≥ (cid:0) η H ˇ Bη (cid:1) , (6.123)where the ﬁrst inequality is due to (6.121); the ﬁrst equality holds because of (6.120);the third equality follows from the fact that ˇ B is a real matrix; and (6.122) amountsto the last inequality. 32n the other hand, we also have2 ρ ( B ) ≥ η H ( ρ ( B ) I n − B ) η = η H h ρ ( B ) I n − (cid:16) ˇ B + ˜ B + ˇ B T (cid:17)i η = η H (cid:16) ρ ( B ) I n − ˜ B (cid:17) η − η H (cid:0) ˇ B + ˇ B T (cid:1) η = η H (cid:16) ρ ( B ) I n − ˜ B (cid:17) η − η H (cid:0) ˇ B + ˇ B H (cid:1) η = η H (cid:16) ρ ( B ) I n − ˜ B (cid:17) η − (cid:0) η H ˇ Bη (cid:1) ≥ − (cid:0) η H ˇ Bη (cid:1) , (6.124)where the ﬁrst inequality is due to (6.121); the ﬁrst equality holds because of (6.120);the third equality follows from the fact that ˇ B is a real matrix; and (6.122) amountsto the last inequality.Clearly, Eqs. (6.123) and (6.124) lead to (6.117). By using the same argument,we know (6.118) holds true.The following lemma states that, if B is invertible, the real part of the eigenvaluesof B − (cid:0) I n + β ˇ B (cid:1) does not equal zero for any β ∈ (cid:16) , ρ ( B ) (cid:17) . Lemma 6.2

Assume that B and ˇ B are deﬁned by (6.111) and (6.113) , respectively.Moreover, suppose B is invertible. For any β ∈ (cid:16) , ρ ( B ) (cid:17) and t ∈ [0 , , if λ is aneigenvalue of B − (cid:0) I n + tβ ˇ B (cid:1) , then Re ( λ ) = 0 . Proof . According to assumptions, B is invertible. Combining with I n + tβ ˇ B being invertible with any t ∈ [0 , B − (cid:0) I n + tβ ˇ B (cid:1) is an invertible matrix.Let λ be an eigenvalue of B − (cid:0) I + tβ ˇ B (cid:1) and ξ be the corresponding eigenvector ofunit length. Then λ = 0 and B − (cid:0) I n + tβ ˇ B (cid:1) ξ = λξ, (6.125)which is clearly equivalent to equation: (cid:0) I n + tβ ˇ B (cid:1) ξ = λBξ. (6.126)Premultiplying both sides of the above equality by ξ H , we arrive at1 + tβξ H ˇ Bξ = λξ H Bξ. (6.127)Note that 0 < β < ρ ( B ) and t ∈ [0 , < Re (cid:0) tβξ H ˇ Bξ (cid:1) <

2, i.e., the real part of 1 + tβξ H ˇ Bξ is a positive real number. We denote 1 + tβξ H ˇ Bξ as a + bi where 0 < a <

2. If λ = ci ( c = 0) is a purely imaginary number, then33 ξ H Bξ (cid:1) ci is also a purely imaginary number because of B being a real symmetricmatrix. Hence, (6.127) becomes1 + tβξ H ˇ Bξ = a + bi = (cid:0) ξ H Bξ (cid:1) ci, (6.128)which is a contradiction. Hence, Re( λ ) = 0 is proved.The following Lemma states that, if B is invertible, the real part of the eigen-values of ( βB ) − (cid:16) I n + tβ ˆ B (cid:17) is strictly larger than for any β ∈ (cid:16) , ρ ( B ) (cid:17) and t ∈ [0 , Lemma 6.3

Assume that B and ˆ B are deﬁned by (6.111) and (6.115) , respectively.Moreover, suppose B is invertible. For any β ∈ (cid:16) , ρ ( B ) (cid:17) and t ∈ [0 , , if λ is aneigenvalue of ( βB ) − (cid:16) I n + tβ ˆ B (cid:17) and Re ( λ ) > , then Re ( λ ) ≥ + − βρ ( B ) βρ ( B ) > . Proof . The proof is lengthy and has been relegated to the Appendix.The following Lemma plays a key role because it provides a suﬃciently exactdescription of the distribution of the eigenvalues of (cid:0) I n + β ˇ B (cid:1) − B , which has thethe same structure as G deﬁned by (3.18). Lemma 6.4

Assume that B and ˇ B are deﬁned by (6.111) and (6.113) , respectively.Furthermore if λ min ( B ) < , then, for an arbitrary β ∈ (cid:16) , ρ ( B ) (cid:17) , there is at least oneeigenvalue λ of (cid:0) I n + β ˇ B (cid:1) − B which lies in closed left half complex plane excludingorigin, i.e., ∀ β ∈ (cid:18) , ρ ( B ) (cid:19) ⇒ ∃ λ ∈ h eig (cid:16)(cid:0) I n + β ˇ B (cid:1) − B (cid:17) \ Ω i , (6.129) where Ω , (cid:8) a + bi (cid:12)(cid:12) a, b ∈ R , a ≤ , ( a, b ) = (0 , , i = √− (cid:9) . (6.130) Proof . The proof is lengthy and has been relegated to the Appendix.Similar to Lemma 6.4, the following lemma plays a key role in this case becauseit provides a suﬃciently exact description of the distribution of the eigenvalues of β (cid:16) I n + β ˆ B (cid:17) − B , which has the the same structure as H deﬁned by (5.98). Lemma 6.5

Assume that B and ˆ B are deﬁned by (6.111) and (6.115) , respectively.Furthermore if λ max ( B ) > , then, for an arbitrary β ∈ (cid:16) , ρ ( B ) (cid:17) , there is at leastone nonzero eigenvalue λ of β (cid:16) I n + β ˆ B (cid:17) − B such that λ ∈ Ξ( β, B ) , (6.131)34 here Ξ( β, B ) , (cid:26) a + bi (cid:12)(cid:12)(cid:12) a, b ∈ R ,

12 + 1 − βρ ( B ) βρ ( B ) ≤ a, i = √− (cid:27) . (6.132) Proof . The proof is lengthy and has been relegated to the Appendix.Based on the above Lemma, we directly obtain the following Lemma, whichshows there is at least one nonzero eigenvalue of (cid:20) I n − β (cid:16) I n + β ˆ B (cid:17) − B (cid:21) − withthe same structure as (cid:8) D [ g α ( − f ) ] − ( x ∗ ) (cid:9) T , whose magnitude is strictly greater thanone. Lemma 6.6

Assume that B and ˆ B are deﬁned by (6.111) and (6.115) , respec-tively. Furthermore suppose λ max ( B ) > and I n − β (cid:16) I n + β ˆ B (cid:17) − B is invertible,then, for an arbitrary β ∈ (cid:16) , ρ ( B ) (cid:17) , there is at least one nonzero eigenvalue of (cid:20) I n − β (cid:16) I n + β ˆ B (cid:17) − B (cid:21) − , whose magnitude is strictly greater than one. Proof . According to assumptions, I n − β (cid:16) I n + β ˆ B (cid:17) − B is invertible. Then λ ∈ eig (cid:18) β (cid:16) I n + β ˆ B (cid:17) − B (cid:19) ⇔ − λ ∈ eig (cid:20) I n − β (cid:16) I n + β ˆ B (cid:17) − B (cid:21) − ! . (6.133)Hence, the statement of the lemma is equivalent to that there is at least one eigen-value λ of β (cid:16) I n + β ˆ B (cid:17) − B such that (cid:12)(cid:12)(cid:12)(cid:12) − λ (cid:12)(cid:12)(cid:12)(cid:12) > ⇔ (cid:12)(cid:12)(cid:12)(cid:12) λ λ − (cid:12)(cid:12)(cid:12)(cid:12) > . (6.134)The above inequality is clearly equivalent to (cid:12)(cid:12)(cid:12)(cid:12) λ (cid:12)(cid:12)(cid:12)(cid:12) > (cid:12)(cid:12)(cid:12)(cid:12) λ − (cid:12)(cid:12)(cid:12)(cid:12) ⇔ < Re (cid:18) λ (cid:19) . (6.135)In addition, under the same assumptions, Lemma 6.5 shows that there exists at leastone eigenvalue λ of β (cid:16) I n + β ˆ B (cid:17) − B satisfying12 < (cid:18)

12 + 1 − βρ ( B ) βρ ( B ) (cid:19) ≤ Re (cid:18) λ (cid:19) . (6.136)Thus, the proof is ﬁnished. 35 . Conclusion In this paper, given a non-convex twice continuously diﬀerentiable cost functionwith Lipschitz continuous gradient, we prove that all of BCGD, BMD and PBCDconverge to a local minimizer, almost surely with random initialization.As a by-product, it aﬃrmatively answers the open questions whether mirrordescent or block coordinate descent does not converge to saddle points in Lee et al.(2016). More importantly, our results also hold true even for the cost functions withnon-isolated critical points, which generalizes the results in Panageas and Piliouras(2016) as well.By using the similar arguments, it is interesting to further research whetherother methods, such as ADMM, BCGD, BMD, PBCD and their variations for theproblems with some specially structured constraints, admit similar results or not.

8. Acknowledgements

The authors would like to thank Prof. Zhi-Quan (Tom) Luo from The ChineseUniversity of Hong Kong, Shenzhen, for the helpful discussions on this work.

9. Appendix

Proof . Firstly, we prove that g αf with step size α < L is a diﬀeomorphism. Theproof is given by the following four steps. (a) We prove that g αf is injective from R n → R n for α < L . Suppose that thereexist x and y such that g αf ( x ) = g αf ( y ). Then x − y = αU ( ∇ f ( x ) − ∇ f ( y )) and k x − y k = α k U ( ∇ f ( x ) − ∇ f ( y )) k = α k∇ f ( x ) − ∇ f ( y ) k ≤ α k∇ f ( x ) − ∇ f ( y ) k ≤ αL k x − y k , (9.137)where the second equality is due to U T U = I n . Since αL <

1, (9.137) means x = y . (b) To show g αf is surjective, we construct an explicit inverse function. Given apoint y in R n , suppose it has the following partition, y =  y (1) y (2) ... y ( p )  . Then we deﬁne n − n dimensional vector y − (1) ,  y (2) ... y ( p )  (9.138)36nd deﬁne a function ˆ f (cid:0) · ; y − (1) (cid:1) : R n → R ,ˆ f (cid:0) x (1) ; y − (1) (cid:1) , f (cid:18)(cid:18) x (1) y − (1) (cid:19)(cid:19) , which is determined by function f and the remained block coordinate vector y − (1) of y . Then, the proximal point mapping of − ˆ f (cid:0) · ; y − (1) (cid:1) centered at y (1) is given by x y (1) = arg min x (1) k x (1) − y (1) k − α ˆ f (cid:0) x (1) ; y − (1) (cid:1) (9.139)For α < L , the function above is strongly convex with respect to x (1) , so there is aunique minimizer. Let x y (1) be the unique minimizer, then by the KKT condition, y (1) = x y (1) − α ∇ ˆ f (cid:0) x y (1) ; y − (1) (cid:1) = x y (1) − α ∇ f (cid:18)(cid:18) x y (1) y − (1) (cid:19)(cid:19) , (9.140)where the second equality is due to the deﬁnition of function ˆ f (cid:0) · ; y − (1) (cid:1) . Let x y bedeﬁned as x y , (cid:18) x y (1) y − (1) (cid:19) , (9.141)where x y (1) is deﬁned by (9.139). Accordingly, y = (cid:18) y (1) y − (1) (cid:19) =  x y (1) − α ∇ f (cid:18)(cid:18) x y (1) y − (1) (cid:19)(cid:19) y − (1)  = x y − αU ∇ f ( x y ) = g αf ( x y ) , where the ﬁrst equality is due to the deﬁnition of y − (1) (see (9.138)); the secondequality thanks to (9.140); and the third equality holds because of the deﬁnition of x y (see (9.141)).Hence, x y is mapped to y by the mapping g αf . (c) In addition, combined with ∇ f ( x ) U U T ∈ R n × n and U T ∇ f ( x ) U ∈ R n × n ,Theorem 1.3.20 in Horn and Johnson (1986) meanseig (cid:0) ∇ f ( x ) U U T (cid:1) ⊆ eig (cid:0) U T ∇ f ( x ) U (cid:1) ∪ { } . (9.142)Moreover, since U T ∇ f ( x ) U = ∂ f ( x ) ∂x (1) ∂x (1) (9.143)which is the n -by- n principal submatrix of ∇ f ( x ), it follows from Theorem 4.3.15in Horn and Johnson (1986) thateig (cid:18) ∂ f ( x ) ∂x (1)2 (cid:19) ⊆ (cid:2) λ min (cid:0) ∇ f ( x ) (cid:1) , λ max (cid:0) ∇ f ( x ) (cid:1)(cid:3) ⊆ [ − L, L ] , (9.144)37here the last relation holds because of Eq. p(2.2) and Lemma 7 in Panageas and Piliouras(2016). Since α < L , Eqs. (9.142), (9.143) and (9.144) imply thateig (cid:0) α ∇ f ( x ) U U T (cid:1) ⊆ ( − , . (9.145)Hence, Dg αf ( x ) = I − α ∇ f ( x ) U U T is invertible for α < L . (d) Note that we have shown that g αf is bijection, and continuously diﬀerentiable.Since Dg αf ( x ) is invertible for α < L , the inverse function theorem guarantees (cid:2) g αf (cid:3) − is continuously diﬀerentiable. Thus, g αf is a diﬀeomorphism.Secondly, it is obvious that similar arguments can be applied to verify that, g s αf , s = 2 , . . . , p, are also diﬀeomorphisms. Proof . First, we deﬁne recursively G [ s ] , α " I n − Y t = s ( I n − αU t A t ) , ≤ s ≤ p, (9.146)which, combined with (3.17) and (3.18), implies that G = G [ p ]. In addition, it iseasily seen that U Ts ( I n − αU t A t ) = U Ts − αU Ts U t A t = U Ts (9.147)when s = t . If k < s , then the above (9.147) follows that U Ts G [ k ] = 1 α U Ts " I n − Y t = k ( I n − αU t A t ) = 1 α " U Ts − U Ts Y t = k ( I n − αU t A t ) = . (9.148)Consequently, if 1 ≤ s < q ≤ p , then U Ts ( αG [ q ]) = U Ts " I n − Y t = q ( I n − αU t A t ) = U Ts I n − U Ts s +1 Y t = q ( I n − αU t A t ) Y t = s ( I n − αU t A t )= U Ts I n − U Ts Y t = s ( I n − αU t A t )= U Ts " I n − Y t = s ( I n − αU t A t ) = U Ts αG [ s ] , (9.149)38here the fourth equality is due to (9.147) and the last equality uses the deﬁnition(9.146) of G [ s ]. Particularly if q = p , the above equation becomes U Ts αG [ s ] = U Ts αG [ p ] = U Ts αG. (9.150)From equality (9.150), we further have U Ts ( αG ) = U Ts [ I n − ( I n − αG [ s ])](a)= U Ts [ I n − ( I n − αU s A s ) ( I n − αG [ s − αU Ts G [ s −

1] + αA s − α A s G [ s − αA s − α A s G [ s − αA s − α A s p X t =1 U t U Tt G [ s − αA s − α A s s − X t =1 U t U Tt G [ s − αA s − α A s s − X t =1 U t U Tt G [ p ](f)= αA s − α A s s − X t =1 U t U Tt G = αA s − α s − X t =1 A s U t U Tt G = αA s − α s − X t =1 A st U Tt G where (a) uses the deﬁnitions of G [ s ] in (9.146); (b) is due to (9.148); (c) thanks tothe deﬁnition (2.4) of U t ; (d) uses (9.148) again; (e) holds because of (9.149); (f)follows from G [ p ] = G . Dividing both sides of the above equation by α , we have(3.21). Proof . We divide the proof into two cases.

Case 1: B is an invertible matrix. In this case, we clearly have, (cid:16)(cid:0) I n + β ˇ B (cid:1) − B (cid:17) − = B − (cid:0) I n + β ˇ B (cid:1) . (9.151)In what follows, we will prove that (6.129) is true by using Lemma 9.2 in Appendix.Fristly, we deﬁne an analytic function with t as a parameter: X t ( z ) , det (cid:8) zI n − (cid:2) (1 − t ) B − + tB − (cid:0) I n + β ˇ B (cid:1)(cid:3)(cid:9) , ≤ t ≤

1= det (cid:8) zI n − B − (cid:0) I n + tβ ˇ B (cid:1)(cid:9) , ≤ t ≤ . (9.152)39oreover, deﬁne a closed rectangle in the complex plane as D , { a + bi | − ν ≤ a ≤ , − ν ≤ b ≤ ν } , (9.153)with ν being deﬁned below: ν , (cid:13)(cid:13) B − (cid:13)(cid:13) + 1 ρ ( B ) (cid:13)(cid:13) B − ˇ B (cid:13)(cid:13) ≥ (cid:13)(cid:13) B − k + tβ k B − ˇ B (cid:13)(cid:13) ≥ (cid:13)(cid:13) B − (cid:0) I n + tβ ˇ B (cid:1)(cid:13)(cid:13) , ∀ t ∈ [0 , , (9.154)where the ﬁrst inequality holds because of β ∈ (cid:16) , ρ ( B ) (cid:17) and t ∈ [0 , ∂ D of D consists of a ﬁnite number of smooth curves. Speciﬁcally,deﬁne γ , { a + bi | a = 0 , − ν ≤ b ≤ ν } ,γ , { a + bi | a = − ν, − ν ≤ b ≤ ν } ,γ , { a + bi | − ν ≤ a ≤ , b = 2 ν } ,γ , { a + bi | − ν ≤ a ≤ , b = − ν } , (9.155)then ∂ D = γ ∪ γ ∪ γ ∪ γ . (9.156)In order to apply Lemma 9.2, we will show that X t ( z ) = 0 , ∀ t ∈ [0 , , ∀ z ∈ ∂ D . (9.157)On the one hand, since the spectral norm of a matrix is lager than or equal toits spectral radius, the above inequality (9.154) yields that, for any t ∈ [0 , B − (cid:0) I + tβ ˇ B (cid:1) has a magnitude less than ν . Note that for an arbitrary z ∈ γ ∪ γ ∪ γ , then | z | ≥ ν . Consequently, X t ( z ) = 0 , ∀ t ∈ [0 , , ∀ z ∈ γ ∪ γ ∪ γ . (9.158)On the other hand, it follows from Lemma 6.2 that, for any t ∈ [0 , λ is aneigenvalue of B − (cid:0) I n + tβ ˇ B (cid:1) , then Re ( λ ) = 0. Hence, λ / ∈ γ . As mentioned atthe beginning of the proof, B − (cid:0) I n + tβ ˇ B (cid:1) is invertible. Hence, there are no zeroeigenvalues of B − (cid:0) I n + tβ ˇ B (cid:1) in γ , i.e., X t ( z ) = 0 , ∀ t ∈ [0 , , ∀ z ∈ γ . (9.159)Combining (9.158) and (9.159), we obtain (9.157).As a result, it follows from Lemma 9.2 in Appendix, Eqs. (9.152), (9.153) and(9.157) that X ( z ) = det { zI n − B − } and X ( z ) = det (cid:8) zI n − B − (cid:0) I n + β ˇ B (cid:1)(cid:9) havethe same number of zeros in D . Note that λ min ( B ) < λ min ( B ) of B − . Recalling the deﬁnition (9.154) of ν , weknow (cid:12)(cid:12)(cid:12) λ min ( B ) (cid:12)(cid:12)(cid:12) ≤ ν . Thus λ min ( B ) must lie inside D . In other words, the number ofzeros of X ( z ) inside D is at least one, which in turn shows the number of zeros of X ( z ) inside D is at least one as well. Thus, there must exist at least one eigenvalueof B − (cid:0) I n + β ˇ B (cid:1) lying inside D . We denote it as x + yi , then − ν < x < − ν < y < ν . Consequently, x + yi = x − yix + y is an eigenvalue of (cid:0) I n + β ˇ B (cid:1) − B withreal part xx + y <

0. Hence, x + yi lies in Ω deﬁned by (6.130) and the proof is ﬁnishedin this case. Case 2: B is a singular matrix. In this case, we will apply perturbationtheorem based on the results in

Case 1 to prove (6.129).Suppose the multiplicity of zero eigenvalue of B is m and B has an eigenvaluedecomposition in the form of B = V (cid:18) Θ 00 0 (cid:19) V T = V Θ V T , (9.160)where Θ = Diag ( θ , θ , . . . , θ n − m ), θ s , s = 1 , . . . , n − m , are the nonzero eigenvaluesof B and V = (cid:0) V V (cid:1) (9.161)is an orthogonal matrix and V consists of the ﬁrst ( n − m ) columns of V .Denote δ , min {| θ | , | θ | , . . . , | θ n − m |} . (9.162)For any ǫ ∈ (0 , δ ), we deﬁne B ( ǫ ) , B + ǫI n , (9.163)then, eig ( B ( ǫ )) = { θ + ǫ, θ + ǫ, . . . , θ n − m + ǫ, ǫ } 6∋ , ∀ ǫ ∈ (0 , δ ) , (9.164)and λ min ( B ( ǫ )) = λ min ( B ) + ǫ ≤ − δ + ǫ < , ∀ ǫ ∈ (0 , δ ) , (9.165)where the ﬁrst inequality is due to the deﬁnition of δ and min { θ , θ , . . . , θ n − m } = λ min ( B ) < B is deﬁned by (6.111), B ( ǫ ) has p × p blocks form as well. Speciﬁcally, B ( ǫ ) = ( B ( ǫ ) st ) ≤ s, t ≤ p , (9.166)and its ( s, t )-th block is given by B ( ǫ ) st = (cid:26) B st + ǫI n s , s = t,B st , s = t, (9.167)41here n , n , . . . , n p are p positive integer numbers satisfying p P s =1 n s = n . Similarto the deﬁnition (6.113) of ˇ B , we denote the strictly block lower triangular matrixbased on B ( ǫ ) as ˇ B ( ǫ ) , (cid:0) ˇ B ( ǫ ) st (cid:1) ≤ s, t ≤ p (9.168)with p × p blocks and its ( s, t )-th block is given byˇ B ( ǫ ) st = (cid:26) B ( ǫ ) st , s > t, , s ≤ t, = (cid:26) B st , s > t, , s ≤ t, = ˇ B st , (9.169)where the second equality holds because of (9.167); the last equality is due to (6.114).It follows easily from Eqs. (6.113), (6.114) (9.168) and (9.169) thatˇ B ( ǫ ) = ˇ B. (9.170)Consequently, (cid:0) I n + β ˇ B ( ǫ ) (cid:1) − B ( ǫ ) = (cid:0) I n + β ˇ B (cid:1) − B ( ǫ ) = (cid:0) I n + β ˇ B (cid:1) − ( B + ǫI n ) , (9.171)where the ﬁrst equality is due to (9.170) and the second equality holds because of(9.163). For simplicity, let λ βs ( ǫ ) , s = 1 , . . . , n, (9.172)be the eigenvalues of (cid:0) I n + β ˇ B (cid:1) − ( B + ǫI n ).Note that for any ǫ ∈ (0 , δ ), B ( ǫ ) is invertible and λ min ( B ( ǫ )) < B ( ǫ ) and ˇ B ( ǫ ), a similar argument in Case 1 can beapplied with the identiﬁcations B ( ǫ ) ∼ B , ˇ B ( ǫ ) ∼ ˇ B , β ∼ β and ρ ( B ( ǫ )) ∼ ρ ( B ),to prove that, for any β ∈ (cid:16) , ρ ( B ( ǫ )) (cid:17) , there must exist at least one eigenvalueof (cid:0) I n + β ˇ B ( ǫ ) (cid:1) − B ( ǫ ) which lies in Ω deﬁned by (6.130). Taking into accountdeﬁnition (9.163), we have ρ ( B ( ǫ )) ≤ ρ ( B ) + ǫ . Hence, for any ǫ ∈ (0 , δ ) and β ∈ (cid:16) , ρ ( B )+ ǫ (cid:17) ⊆ (cid:16) , ρ ( B ( ǫ )) (cid:17) , there exists at least one index denoted as s ( ǫ ) ∈{ , , . . . , n } such that λ βs ( ǫ ) ( ǫ ) ∈ Ω . (9.173)Furthermore, it is well known that the eigenvalues of a matrix M are continuousfunctions of the entries of M . Therefore, for any β ∈ (cid:16) , ρ ( B ) (cid:17) , λ βs ( ǫ ) is a continuousfunction of ǫ and lim ǫ → + λ βs ( ǫ ) = λ βs (0) , s = 1 , . . . , n, (9.174)where λ βs (0) is the eigenvalue of (cid:0) I n + β ˇ B (cid:1) − B .42n what follows, we will prove that (6.129) holds true by contradiction.Suppose for sake of contradiction that, there exists a β ∗ ∈ (cid:16) , ρ ( B ) (cid:17) such that,for any s ∈ { , . . . , n } , we have λ β ∗ s (0) / ∈ Ω , (9.175)where λ β ∗ s (0) is the eigenvalue of (cid:0) I n + β ∗ ˇ B (cid:1) − B .According to Lemma 9.3 in Appendix and the assumption that the multiplic-ity of zero eigenvalue of B is m , we know that the multiplicity of eigenvalue 0 of (cid:0) I n + β ∗ ˇ B (cid:1) − B is m as well. Then there are exactly m eigenvalue functions of ǫ whose limits are 0 as ǫ approaches zero from above. Without loss of generality, weassume lim ǫ → + λ β ∗ s ( ǫ ) = λ β ∗ s (0) = 0 , s = 1 , . . . , m, (9.176)and lim ǫ → + λ β ∗ s ( ǫ ) = λ β ∗ s (0) = 0 , s = m + 1 , . . . , n. (9.177)Subsequently, under the assumption (9.175), we will ﬁrst prove that there existsa δ ∗ > ǫ ∈ ( − δ ∗ , β ∗ ∈ (cid:16) , ρ ( B )+ ǫ (cid:17) ⊆ (cid:16) , ρ ( B ( ǫ )) (cid:17) andthere exists no s ∈ { , . . . , n } such that λ β ∗ s ( ǫ ) belongs to Ω. This would contradict(9.173). The proof is given by the following four steps. Step (a):

Under the assumption (9.175), we ﬁrst prove that there exists a δ ∗ > ǫ ∈ ( − δ ∗ , β ∗ ∈ (cid:16) , ρ ( B )+ ǫ (cid:17) ⊆ (cid:16) , ρ ( B ( ǫ )) (cid:17) and there does notexist any s ∈ { m + 1 , . . . , n } such that λ β ∗ s ( ǫ ) belongs to Ω.Taking into account the deﬁnition of Ω, Eq. (9.175) and β ∗ ∈ (cid:16) , ρ ( B ) (cid:17) , weimply that, there exists a ¯ δ such that, for any ǫ ∈ (0 , ¯ δ ), β ∗ ∈ (cid:18) , ρ ( B ) + ǫ (cid:19) ⊆ (cid:18) , ρ ( B ( ǫ )) (cid:19) (9.178)and Re (cid:0) λ β ∗ s (0) (cid:1) > , ∀ s ∈ { m + 1 , . . . , n } . (9.179)Moreover, note that λ β ∗ s ( ǫ ) is a continuous function of ǫ and (9.177) holds. Combiningwith the above inequalities, we know that there exists a δ ∗ > δ ∗ ≤ ¯ δ suchthat (cid:12)(cid:12) λ β ∗ s ( ǫ ) − λ β ∗ s (0) (cid:12)(cid:12) <

13 Re (cid:0) λ β ∗ s (0) (cid:1) , ∀ s ∈ { m + 1 , . . . , n } , ∀ ǫ ∈ [0 , δ ∗ ] , (9.180)which further means that0 <

23 Re (cid:0) λ β ∗ s (0) (cid:1) < Re (cid:0) λ β ∗ s ( ǫ ) (cid:1) , ∀ s ∈ { m + 1 , . . . , n } , ∀ ǫ ∈ [0 , δ ∗ ] . (9.181)43ence, we arrive at λ β ∗ s ( ǫ ) / ∈ Ω , ∀ s ∈ { m + 1 , . . . , n } , ∀ ǫ ∈ [0 , δ ∗ ] . (9.182) Step (b):

In this step, we will prove that there exists a δ ∗ > ǫ ∈ (0 , δ ∗ ] and s ∈ { , . . . , m } ,Re (cid:0) λ β ∗ s ( ǫ ) (cid:1) > , (9.183)which immediately implies that λ β ∗ s ( ǫ ) / ∈ Ω , ∀ s ∈ { , . . . , m } , ∀ ǫ ∈ (0 , δ ∗ ] . (9.184)For simplicity, let ˇ C ij , V Ti ˇ BV j , ≤ i, j ≤ , (9.185)where V and V are given by (9.161).In what follows, we take an arbitrary s ∈ { , . . . , m } . Since we assume that λ β ∗ s ( ǫ ) is the eigenvalue of (cid:0) I n + β ∗ ˇ B (cid:1) − ( B + ǫI n ), then, for any ǫ ∈ (0 , δ ),det n(cid:0) I n + β ∗ ˇ B (cid:1) − ( B + ǫI n ) − λ β ∗ s ( ǫ ) I n o = 0 , which is clearly equivalent todet (cid:8)(cid:0) I n + β ∗ ˇ B (cid:1)(cid:9) det n(cid:0) I n + β ∗ ˇ B (cid:1) − ( B + ǫI n ) − λ β ∗ s ( ǫ ) I n o = 0 , ∀ ǫ ∈ (0 , δ ) . (9.186)It is easily seen from the above Eq. (9.186) that, for any ǫ ∈ (0 , δ ),0 = det (cid:8) ( B + ǫI n ) − λ β ∗ s ( ǫ ) (cid:0) I n + β ∗ ˇ B (cid:1)(cid:9) = det (cid:26)(cid:18) V (cid:18) Θ 00 0 (cid:19) V T + ǫI n (cid:19) − λ β ∗ s ( ǫ ) (cid:0) I n + β ∗ ˇ B (cid:1)(cid:27) = det (cid:26)(cid:18) Θ + ǫI n − m ǫI m (cid:19) − λ β ∗ s ( ǫ ) (cid:0) I n + β ∗ V T ˇ BV (cid:1)(cid:27) = det (cid:26)(cid:18) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1) − λ β ∗ s ( ǫ ) β ∗ ˇ C − λ β ∗ s ( ǫ ) β ∗ ˇ C ǫI m − λ β ∗ s ( ǫ ) (cid:0) I m + β ∗ ˇ C (cid:1) (cid:19)(cid:27) , (9.187)where the second equality is due to (9.160) and the last equality holds because ofEqs. (9.161) and (9.185).Besides, recalling lim ǫ → + λ β ∗ s ( ǫ ) = λ β ∗ s (0) = 0 , (9.188)we have lim ǫ → + (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) = Θ , (9.189)44hich is an invertible matrix because Θ is given by (9.160). Clearly, the above(9.189) further implies that, there exists a δ > ǫ ∈ [0 , δ ],Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1) (9.190)is an invertible matrix as well, andlim ǫ → + (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − = Θ − . (9.191)Since the inverse of a matrix M , M − , is a continuous function of the elements of M , there exists a δ > δ ≤ δ , such that, for any ǫ ∈ [0 , δ ], (cid:13)(cid:13)(cid:13)(cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) Θ − (cid:13)(cid:13) . (9.192)Consequently, for any ǫ ∈ [0 , δ ], it follows easily from (9.187) and (9.190) that0 = det (cid:8) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:9) × det n ǫI m − λ β ∗ s ( ǫ ) (cid:0) I m + β ∗ ˇ C (cid:1) − ( β ∗ ) (cid:0) λ β ∗ s ( ǫ ) (cid:1) ˇ C (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − ˇ C o , (9.193)which, combined with the fact that Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ V T ˇ BV (cid:1) is aninvertible matrix again (see (9.190)), means that ǫI m − λ β ∗ s ( ǫ ) (cid:0) I m + β ∗ ˇ C (cid:1) − ( β ∗ ) (cid:0) λ β ∗ s ( ǫ ) (cid:1) ˇ C (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − ˇ C is a singular matrix. Therefore, there exists one nonzero vector υ ( ǫ ) ∈ C m with k υ ( ǫ ) k = 1 satisfying n ǫI m − λ β ∗ s ( ǫ ) (cid:0) I m + β ∗ ˇ C (cid:1) − ( β ∗ ) (cid:0) λ β ∗ s ( ǫ ) (cid:1) ˇ C (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − ˇ C o υ ( ǫ ) = . (9.194)Moreover, for any ǫ ∈ [0 , δ ], premultiplying both sides of the above equality by( υ ( ǫ )) H , we have( υ ( ǫ )) H n ǫI m − λ β ∗ s ( ǫ ) (cid:0) I m + β ∗ ˇ C (cid:1) − ( β ∗ ) (cid:0) λ β ∗ s ( ǫ ) (cid:1) ˇ C (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − ˇ C o υ ( ǫ ) = 0 , (9.195)45r equivalently, ǫ − λ β ∗ s ( ǫ ) (cid:16) β ∗ ( υ ( ǫ )) H ˇ C υ ( ǫ ) (cid:17) = ( β ∗ ) (cid:0) λ β ∗ s ( ǫ ) (cid:1) ( υ ( ǫ )) H ˇ C (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − ˇ C υ ( ǫ ) . (9.196)Dividing both sides of the above equation by λ β ∗ s ( ǫ ), we have, for any ǫ ∈ (0 , δ ],0 ≤ (cid:12)(cid:12)(cid:12)(cid:12) ǫλ β ∗ s ( ǫ ) − (cid:16) β ∗ ( υ ( ǫ )) H ˇ C υ ( ǫ ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ( β ∗ ) λ β ∗ s ( ǫ ) ( υ ( ǫ )) H ˇ C (cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − ˇ C υ ( ǫ ) (cid:12)(cid:12)(cid:12) ≤ ( β ∗ ) (cid:12)(cid:12) λ β ∗ s ( ǫ ) (cid:12)(cid:12) (cid:13)(cid:13) ˇ C (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:2) Θ + ǫI n − m − λ β ∗ s ( ǫ ) (cid:0) I n − m + β ∗ ˇ C (cid:1)(cid:3) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ˇ C (cid:13)(cid:13) ≤ β ∗ ) (cid:12)(cid:12) λ β ∗ s ( ǫ ) (cid:12)(cid:12) (cid:13)(cid:13) ˇ C (cid:13)(cid:13) (cid:13)(cid:13) Θ − (cid:13)(cid:13) (cid:13)(cid:13) ˇ C (cid:13)(cid:13) , (9.197)where the second inequality is due to k υ ( ǫ ) k = 1 and Cauchy–Schwartz inequality;the last inequality follows from (9.192). As lim ǫ → + λ β ∗ s ( ǫ ) = λ β ∗ s (0) = 0, the right-handlimit in the above expression is equal to zero. Hence, we havelim ǫ → + (cid:12)(cid:12)(cid:12)(cid:12) ǫλ β ∗ s ( ǫ ) − (cid:16) β ∗ ( υ ( ǫ )) H ˇ C υ ( ǫ ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . (9.198)Recall that Re (cid:16) β ∗ ( υ ( ǫ )) H ˇ C υ ( ǫ ) (cid:17) = Re (cid:16) β ∗ ( υ ( ǫ )) H V T ˇ BV υ ( ǫ ) (cid:17) = Re (cid:16) β ∗ ( υ ( ǫ )) H V H ˇ BV υ ( ǫ ) (cid:17) = Re (cid:16) β ∗ ( V υ ( ǫ )) H ˇ BV υ ( ǫ ) (cid:17) ≥ − β ∗ ρ ( B ) > , (9.199)where the ﬁrst equality follows from (9.185); the second equality holds because V is a real matrix (see eq. (9.160)); the ﬁrst inequality follows easily from Lemma6.2 and k V υ ( ǫ ) k = ( υ ( ǫ )) H V H V υ ( ǫ ) = ( υ ( ǫ )) H υ ( ǫ ) = 1; and the last inequalitythanks to β ∗ ∈ (cid:16) , ρ ( B ) (cid:17) . Consequently, there exists a δ with δ ≤ δ such that,46or any ǫ ∈ (0 , δ ],13 (1 − β ∗ ρ ( B )) ≥ (cid:12)(cid:12)(cid:12)(cid:12) ǫλ β ∗ s ( ǫ ) − (cid:16) β ∗ ( υ ( ǫ )) H ˇ C υ ( ǫ ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12)(cid:12) Re (cid:18) ǫλ β ∗ s ( ǫ ) − (cid:16) β ∗ ( υ ( ǫ )) H ˇ C υ ( ǫ ) (cid:17)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) Re (cid:18) ǫλ β ∗ s ( ǫ ) (cid:19) − Re (cid:16) β ∗ ( υ ( ǫ )) H ˇ C υ ( ǫ ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) . (9.200)The above inequalities (9.199) and (9.200) imply that, for any ǫ ∈ (0 , δ ],0 <

23 (1 − β ∗ ρ ( B )) ≤ Re (cid:18) ǫλ β ∗ s ( ǫ ) (cid:19) = ǫ (cid:12)(cid:12)(cid:12) λ β ∗ s ( ǫ ) (cid:12)(cid:12)(cid:12) Re (cid:0) λ β ∗ s ( ǫ ) (cid:1) . (9.201)Since the above argument is applied to any s ∈ { , . . . , m } , there exists a δ ∗ > < Re (cid:0) λ β ∗ s ( ǫ ) (cid:1) , ∀ s ∈ { , . . . , m } , ∀ ǫ ∈ (0 , δ ∗ ] , (9.202)which further implies that λ β ∗ s ( ǫ ) / ∈ Ω , ∀ s ∈ { , . . . , m } , ∀ ǫ ∈ (0 , δ ∗ ] . (9.203) Step (c):

Combining (9.182) and (9.203), we arrive at, λ β ∗ s ( ǫ ) / ∈ Ω , ∀ s ∈ { , . . . , n } , ∀ ǫ ∈ (0 , δ ∗ ] , (9.204)where δ ∗ = min { δ ∗ , δ ∗ } . Step (d):

Let δ ∗ = min (cid:26) δ, δ ∗ (cid:27) . (9.205)Then, for any ǫ ∈ (0 , δ ∗ ), we have ǫ ∈ (0 , δ ) , (9.206) β ∗ ∈ (cid:18) , ρ ( B ) + ǫ (cid:19) ⊆ (cid:18) , ρ ( B ( ǫ )) (cid:19) (9.207)and λ β ∗ s ( ǫ ) / ∈ Ω , ∀ s ∈ { , . . . , n } , (9.208)where (9.206) uses the deﬁnition (9.205) of δ ∗ ; (9.207) is due to the deﬁnition (9.205)of δ ∗ (i.e., δ ∗ ≤ δ ∗ ≤ δ ∗ ≤ ¯ δ ) and (9.178); and (9.208) thanks to the deﬁnition (9.205)of δ ∗ and (9.204). Clearly, this contradicts (9.173).Hence, we conclude that (6.129) holds true.47 .4. Proof of Lemma 4.1 Proof . We ﬁrst prove that g with step size α < µL is a diﬀeomorphism. Theproof is given by the following four steps. (a) We ﬁrst prove that ψ is injective from R n → R n for α < µL . Suppose thatthere exist x and y such that ψ ( x ) = ψ ( y ), which implies that ( [ ∇ ϕ ] − ( ∇ ϕ ( x ( t ) ) − α ∇ f ( x )) = [ ∇ ϕ ] − ( ∇ ϕ ( y ( t ) ) − α ∇ f ( y )) , t = 1 ,x ( t ) = y ( t ) , t = 2 , , . . . , p. (9.209)Since Lemma 9.4 in Appendix asserts that ∇ ϕ is a diﬀeomorphism, then [ ∇ ϕ ] − is a diﬀeomorphism as well. Hence, the above equality (9.209) is equivalent to ( ∇ ϕ ( x ( t ) ) − α ∇ f ( x ) = ∇ ϕ ( y ( t ) ) − α ∇ f ( y ) , t = 1 ,x ( t ) = y ( t ) , t = 2 , , . . . , p. (9.210)In particular, ∇ ϕ ( x (1) ) − α ∇ f ( x ) = ∇ ϕ ( y (1) ) − α ∇ f ( y ) further implies that k x (1) − y (1) k ≤ µ k∇ ϕ ( x (1) ) − ∇ ϕ ( y (1) ) k = αµ k∇ f ( x ) − ∇ f ( y ) k≤ αµ k∇ f ( x ) − ∇ f ( y ) k≤ αLµ k x − y k = αLµ k x (1) − y (1) k , (9.211)where the ﬁrst inequality is due to strong convexity (see (4.35)); the third inequalitythanks to (2.2); the last equality holds because of (9.210). Since αL <

1, (9.211)means x (1) = y (1) . Combining with (9.210), we have x = y . (b) To show ψ is surjective, we construct an explicit inverse function. Given apoint y in R n , suppose it has the following partition, y =  y (1) y (2) ... y ( p )  . (9.212)Then we deﬁne n − n dimensional vector y − (1) ,  y (2) ... y ( p )  (9.213)48nd deﬁne a function ¯ f (cid:0) · ; y − (1) (cid:1) : R n → R ,¯ f (cid:0) x (1) ; y − (1) (cid:1) , f (cid:18)(cid:18) x (1) y − (1) (cid:19)(cid:19) , (9.214)which is determined by function f and the remained block coordinate vector y − (1) of y . Consider the following problem,min x (1) B ϕ ( x (1) , y (1) ) − α ¯ f (cid:0) x (1) ; y − (1) (cid:1) (9.215)For α < µL , the function above is strongly convex with respect to x (1) , so thereis a unique minimizer of the problem (9.215). Let x y (1) be the unique minimizer ,then by the KKT condition, ∇ ϕ ( y (1) ) = ∇ ϕ ( x y (1) ) − α ∇ ¯ f (cid:0) x y (1) ; y − (1) (cid:1) , (9.216)which is equivalent to y (1) = [ ∇ ϕ ] − (cid:18) ∇ ϕ ( x y (1) ) − α ∇ f (cid:18)(cid:18) x y (1) y − (1) (cid:19)(cid:19)(cid:19) . (9.217)Let x y be deﬁned as x y , (cid:18) x y (1) y − (1) (cid:19) , (9.218)where x y (1) is determined by (9.217). Accordingly, y = (cid:18) y (1) y − (1) (cid:19) =  [ ∇ ϕ ] − (cid:18) ∇ ϕ ( x y (1) ) − α ∇ f (cid:18)(cid:18) x y (1) y − (1) (cid:19)(cid:19)(cid:19) y − (1)  = (cid:0) I n − U U T (cid:1) x y + U [ ∇ ϕ ] − ( ∇ ϕ ( x y (1) ) − α ∇ f ( x y ))= ψ ( x y ) , (9.219)where the ﬁrst equality is due to the deﬁnition of y − (1) (see (9.213)); the secondequality thanks to (9.217); and the third equality holds because of deﬁnition of U (see (2.4)); since ψ ( x y ) is deﬁned by (4.44), the last equality holds true.Hence, x y is mapped to y by the mapping ψ .49 c) In addition, recalling (4.47), we have Dψ ( x )= (cid:0) I n − U U T (cid:1) + (cid:8) U ∇ ϕ ( x (1) ) − α ∇ f ( x ) U (cid:9) (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − U T =  · · · I n · · · ... ... . . . ... · · · I n p  +  {∇ ϕ ( x (1) ) − αA } (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − · · · A (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − · · · ... ... . . . ... A p (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − · · ·  =  {∇ ϕ ( x (1) ) − αA } (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − · · · A (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − I n · · · ... ... . . . ... A p (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − · · · I n p  , where the second equality is due to the deﬁnitions of U and A s which are given by(2.4) and (3.13), respectively. The above equality means thateig ( Dψ ( x ))= { } [ eig (cid:16)(cid:8) ∇ ϕ ( x (1) ) − αA (cid:9) (cid:8) ∇ ϕ (cid:8) [ ∇ ϕ ] − ( ∇ ϕ ( x (1) ) − α ∇ f ( x )) (cid:9)(cid:9) − (cid:17) . Moreover, ∇ ϕ ( x (1) ) − αA (cid:23) ∇ ϕ ( x (1) ) − αLI n ≻ ∇ ϕ ( x (1) ) − µL LI n = ∇ ϕ ( x (1) ) − µI n (cid:23) , (9.220)where the ﬁrst inequality holds because of A = ∇ f ( x ), (2.2) and Lemma 7in Panageas and Piliouras (2016); the second inequality is due to α < µL ; the lastinequality thanks to (4.35). Hence, ∇ ϕ ( x (1) ) − αA is an invertible matrix. Con-sequently, Dψ ( x ) is an invertible matrix as well. (d) Note that we have shown ψ is bijection, and continuously diﬀerentiable.Since Dψ ( x ) is invertible for α < µL , the inverse function theorem guarantees that[ ψ ] − is continuously diﬀerentiable. Thus, ψ is a diﬀeomorphism.Secondly, it is obvious that similar arguments can be applied to verify that, ψ s , s = 2 , . . . , p, are also diﬀeomorphisms. Thus, the proof is completed.50 .5. Proof of Lemma 6.3 Proof . Let λ be an eigenvalue of ( βB ) − (cid:16) I + tβ ˆ B (cid:17) and ξ be the correspondingeigenvector of unit length, then λ = 0 and( βB ) − (cid:16) I n + tβ ˆ B (cid:17) ξ = λξ, (9.221)which is clearly equivalent to equation: (cid:16) I n + tβ ˆ B (cid:17) ξ = λ ( βB ) ξ. (9.222)Premultiplying both sides of the above equality by ξ H , we arrive at1 + tβξ H ˆ Bξ = λξ H ( βB ) ξ, (9.223)or equivalently, λ = 1 + tβξ H ˆ Bξβξ H Bξ . (9.224)Recalling that 0 < β < ρ ( B ) and t ∈ [0 , < Re (cid:16) βξ H ˆ Bξ (cid:17) <

2. Combining with the assumptions that Re ( λ ) > B is asymmetric matrix, we have βξ H Bξ > . (9.225)We rewrite ˜ B deﬁned by (6.119) below:˜ B = Diag ( B , B , . . . , B pp ) , (9.226)whose main diagonal blocks are the same as those of B . Therefore, B has thefollowing decomposition: B = ˆ B + ˜ B + ˆ B T . (9.227)In addition, Theorem 4.3.15 in Horn and Johnson (1986) means that − ρ ( B ) (cid:22) ˜ B (cid:22) ρ ( B ) . (9.228)51ence, if t ∈ (cid:2) , (cid:3) , thenRe ( λ ) −

12 = Re (cid:16) tβξ H ˆ Bξ (cid:17) βξ H Bξ −

12= 2Re (cid:16) tβξ H ˆ Bξ (cid:17) − βξ H Bξ βξ H Bξ = 2Re (cid:16) tβξ H ˆ Bξ (cid:17) − βξ H (cid:16) ˆ B + ˜ B + ˆ B T (cid:17) ξ βξ H Bξ = 2 + 2 tβ Re (cid:16) ξ H ˆ Bξ (cid:17) − β Re (cid:16) ξ H ˆ Bξ (cid:17) − βξ H ˜ Bξ βξ H Bξ = 2 + 2( t − β Re (cid:16) ξ H ˆ Bξ (cid:17) − βξ H ˜ Bξ βξ H Bξ ≥ t − βρ ( B ) − βρ ( B )2 βξ H Bξ ≥ − βρ ( B )2 ρ ( B )= 1 − βρ ( B ) βρ ( B ) > , (9.229)where the third equality is due to (9.227); the ﬁrst inequality thanks to Eqs. (6.118),(9.225) and (9.228); the second inequality holds because of t ∈ (cid:2) , (cid:3) ; the last52nequality holds because of β ∈ (cid:16) , ρ ( B ) (cid:17) . If t ∈ (cid:2) , (cid:3) , thenRe ( λ ) −

12 = Re (cid:16) tβξ H ˆ Bξ (cid:17) βξ H Bξ −

12= 2Re (cid:16) tβξ H ˆ Bξ (cid:17) − βξ H Bξ βξ H Bξ = 2 + 2 tβ Re (cid:16) ξ H ˆ Bξ (cid:17) − βξ H Bξ βξ H Bξ ≥ − tβρ ( B ) − βρ ( B )2 βρ ( B ) ≥ − βρ ( B )2 βρ ( B )= 1 − βρ ( B ) βρ ( B ) > , (9.230)where the ﬁrst inequality is due to Eqs. (6.118) and (9.225); the second inequalityholds because of t ∈ (cid:2) , (cid:3) ; β ∈ (cid:16) , ρ ( B ) (cid:17) implies the last inequality.Thus, the proof is ﬁnished. Proof . We divide the proof into two cases.

Case 1: B is an invertible matrix. Therefore, we have (cid:18) β (cid:16) I n + β ˆ B (cid:17) − B (cid:19) − = ( βB ) − (cid:16) I n + β ˆ B (cid:17) , (9.231)which implies that λ ∈ eig (cid:18)(cid:16) I n + β ˆ B (cid:17) − B (cid:19) ⇔ λ ∈ eig (cid:16) ( βB ) − (cid:16) I n + β ˆ B (cid:17)(cid:17) . (9.232)For clarity of notation, we use σ to denote the eigenvalue of ( βB ) − (cid:16) I n + β ˆ B (cid:17) .Hence, it is suﬃcient for us to prove that, for an arbitrary β ∈ (cid:16) , ρ ( B ) (cid:17) , there is atleast one nonzero eigenvalue σ of ( βB ) − (cid:16) I n + β ˆ B (cid:17) such that σ ∈ Ξ( β, B ) . (9.233)Subsequently, we will prove that relation (9.233) is true by using Lemma 9.2 inAppendix. 53e ﬁrst deﬁne an analytic function with t as parameter: X t ( z ) , det n zI n − h (1 − t ) ( βB ) − + t ( βB ) − (cid:16) I n + β ˆ B (cid:17)io , ≤ t ≤

1= det n zI n − ( βB ) − (cid:16) I n + tβ ˆ B (cid:17)o , ≤ t ≤ . (9.234)In order to construct a closed region, we deﬁne ν , (cid:13)(cid:13) ( βB ) − (cid:13)(cid:13) + 1 ρ ( B ) (cid:13)(cid:13)(cid:13) ( βB ) − ˆ B (cid:13)(cid:13)(cid:13) ≥ (cid:13)(cid:13)(cid:13) ( βB ) − k + tβ k ( βB ) − ˆ B (cid:13)(cid:13)(cid:13) , ∀ t ∈ [0 , ≥ (cid:13)(cid:13)(cid:13) ( βB ) − (cid:16) I n + tβ ˆ B (cid:17)(cid:13)(cid:13)(cid:13) , ∀ t ∈ [0 ,

1] (9.235)where the ﬁrst inequality holds because of β ∈ (cid:16) , ρ ( B ) (cid:17) and t ∈ [0 , t = 0, the above equation also means that ν ≥ (cid:13)(cid:13) ( βB ) − (cid:13)(cid:13) ≥ ρ (cid:0) ( βB ) − (cid:1) ≥ βρ ( B ) > βρ ( B ) −

12 = 12 + 1 − βρ ( B ) βρ ( B ) , (9.236)where the second equality is due to the deﬁnitions of spectral norm and spectralradius; the third inequality thanks to property of spectral radius; the last inequalityholds because of β ∈ (cid:16) , ρ ( B ) (cid:17) .Thus given the above ν satisfying (9.235) and (9.236), we can deﬁne a closedrectangle as D , (cid:26) a + bi | ≤ a ≤ ν, − ν ≤ b ≤ ν (cid:27) , (9.237)which is a closed region in the complex plane. Note its boundary ∂ D consists of aﬁnite number of smooth curves. Speciﬁcally, deﬁne γ , (cid:26) a + bi | a = 12 , − ν ≤ b ≤ ν (cid:27) ,γ , { a + bi | a = 2 ν, − ν ≤ b ≤ ν } ,γ , (cid:26) a + bi | ≤ a ≤ ν, b = 2 ν (cid:27) ,γ , (cid:26) a + bi | ≤ a ≤ ν, b = − ν (cid:27) , (9.238)then ∂ D = γ ∪ γ ∪ γ ∪ γ . (9.239)54n order to apply Lemma 9.2, we will show that X t ( z ) = 0 , ∀ t ∈ [0 , , ∀ z ∈ ∂ D . (9.240)On the one hand, since the spectral norm of a matrix is lager than or equal toits spectral radius, the above inequality (9.235) yields that, for any t ∈ [0 , B − (cid:0) I + tβ ˇ B (cid:1) has a magnitude less than ν . Note that for an arbitrary z ∈ γ ∪ γ ∪ γ , then | z | ≥ ν . Consequently, X t ( z ) = 0 , ∀ t ∈ [0 , , ∀ z ∈ γ ∪ γ ∪ γ . (9.241)On the other hand, if σ is an eigenvalue of ( βB ) − (cid:16) I n + tβ ˆ B (cid:17) with any t ∈ [0 , σ ) >

0, then Lemma 6.3 implies that,Re ( σ ) > , (9.242)which immediately implies that σ γ . (9.243)Hence, we have X t ( z ) = 0 , ∀ t ∈ [0 , , ∀ z ∈ γ . (9.244)Combining (9.241) and (9.244), we obtain (9.240).As a result, it follows from Lemma 9.2 in Appendix, (9.234), (9.237) and (9.240)that X ( z ) = det (cid:8) zI n − ( βB ) − (cid:9) and X ( z ) = det n zI n − ( βB ) − (cid:16) I n + β ˆ B (cid:17)o havethe same number of zeros in D . Note that λ max ( B ) > βλ max ( B ) of ( βB ) − . Note that βλ max ( B ) ≥ βρ ( B ) > ν , we know (cid:12)(cid:12)(cid:12) βλ max ( B ) (cid:12)(cid:12)(cid:12) ≤ ν . Thus βλ max ( B ) mustlie inside D . In other words, the number of zeros of X ( z ) inside D is at least one,which in turn shows the number of zeros of X ( z ) is at least one as well. Thus, theremust exist at least one eigenvalue of ( βB ) − (cid:16) I n + β ˆ B (cid:17) lying inside D . We denoteit as σ , then Re ( σ ) > . Moreover, Lemma 6.3 means that Re ( σ ) ≥ + − βρ ( B ) βρ ( B ) .Hence, σ lies in Ξ( β, B ) deﬁned by (6.132) and the proof is ﬁnished in this case. Case 2: B is a singular matrix. In this case, we will apply perturbationtheorem based on the results in

Case 1 to prove (6.131).Suppose the multiplicity of zero eigenvalue of B is m . For clarity of notation, werewrite the eigen decomposition (9.160) of B below: B = V (cid:18) Θ 00 0 (cid:19) V T = V Θ V T , (9.245)55here Θ = Diag ( θ , θ , . . . , θ n − m ), θ s , s = 1 , . . . , n − m , are the nonzero eigenvaluesof B and V = (cid:0) V V (cid:1) (9.246)is an orthogonal matrix and V consists of the ﬁrst ( n − m ) columns of V .Denote δ , min {| θ | , | θ | , . . . , | θ n − m |} . (9.247)For any ǫ ∈ ( − δ, , ), we deﬁne B ( ǫ ) , B + ǫI n , (9.248)then, eig ( B ( ǫ )) = { θ + ǫ, θ + ǫ, . . . , θ n − m + ǫ, ǫ } 6∋ , ∀ ǫ ∈ ( − δ, , (9.249)and λ max ( B ( ǫ )) = λ max ( B ) + ǫ ≥ δ + ǫ > , ∀ ǫ ∈ ( − δ, , (9.250)where the ﬁrst inequality is due to the deﬁnition of δ and max { θ , θ , . . . , θ n − m } = λ max ( B ) > B is deﬁned by (6.111), B ( ǫ ) has p × p blocks form as well. Speciﬁcally, B ( ǫ ) = ( B ( ǫ ) st ) ≤ s, t ≤ p , (9.251)and its ( s, t )-th block is given B ( ǫ ) st = (cid:26) B st + ǫI n s , s = t,B st , s = t, (9.252)where n , n , . . . , n p are p positive integer numbers satisfying p P s =1 n s = n . Similarto deﬁnition (6.115), we denote the strictly block upper triangular matrix based on B ( ǫ ) as ˆ B ( ǫ ) , (cid:16) ˆ B ( ǫ ) st (cid:17) ≤ s, t ≤ p (9.253)with p × p blocks and its ( s, t )-th block is given byˆ B ( ǫ ) st = (cid:26) B ( ǫ ) st , s < t, , s ≥ t, = (cid:26) B st , s < t, , s ≥ t, = ˆ B st , (9.254)where the second equality holds because of (9.252); the last equality is due to (6.116).56t follows easily from Eqs. (6.115), (6.116) (9.253) and (9.254) thatˆ B ( ǫ ) = ˆ B. (9.255)Consequently, β (cid:16) I n + β ˆ B ( ǫ ) (cid:17) − B ( ǫ ) = β (cid:16) I n + β ˆ B (cid:17) − B ( ǫ ) = β (cid:16) I n + β ˆ B (cid:17) − ( B + ǫI n ) , where the ﬁrst equality is due to (9.255) and the second equality holds because of(9.248). For simplicity, let λ βs ( ǫ ) , s = 1 , . . . , n, (9.256)be the eigenvalues of β (cid:16) I n + β ˆ B (cid:17) − ( B + ǫI n ).Note that for any ǫ ∈ ( − δ, B ( ǫ ) is invertible and λ max ( B ( ǫ )) > B ( ǫ ) and ˇ B ( ǫ ), a similar argument in Case 1 can be applied with the identiﬁcations B ( ǫ ) ∼ B , ˆ B ( ǫ ) ∼ ˆ B , β ∼ β and ρ ( B ( ǫ )) ∼ ρ ( B ), to prove that, for any β ∈ (cid:16) , ρ ( B ( ǫ )) (cid:17) , there must exist at least oneeigenvalue of (cid:16) I n + β ˆ B ( ǫ ) (cid:17) − B ( ǫ ) which lies in Ξ( β, B ( ǫ )) deﬁned by the following(9.258). Taking into account deﬁnition (9.248), we have ρ ( B ( ǫ )) ≤ ρ ( B ) + ǫ . Hence,for any ǫ ∈ ( − δ,

0) and β ∈ (cid:16) , ρ ( B )+ ǫ (cid:17) ⊆ (cid:16) , ρ ( B ( ǫ )) (cid:17) , there exists at least one indexdenoted as s ( ǫ ) ∈ { , , . . . , n } such that1 λ βs ( ǫ ) ( ǫ ) ∈ Ξ( β, B ( ǫ )) , (9.257)whereΞ( β, B ( ǫ )) , (cid:26) a + bi (cid:12)(cid:12)(cid:12) a, b ∈ R ,

12 + 1 − βρ ( B ( ǫ )) βρ ( B ( ǫ )) ≤ a, i = √− (cid:27) . (9.258)Furthermore, it is well known that the eigenvalues of a matrix M are continuousfunctions of the entries of M . Therefore, for any β ∈ (cid:16) , ρ ( B ) (cid:17) , λ βs ( ǫ ) is a continuousfunction of ǫ and lim ǫ → − λ βs ( ǫ ) = λ βs (0) , s = 1 , . . . , n, (9.259)where λ βs (0) is the eigenvalue of β (cid:0) I n + β ˇ B (cid:1) − B .In what follows, we will prove that (6.131) holds true by contradiction.Suppose for sake of contradiction that, there exists a β ∗ ∈ (cid:16) , ρ ( B ) (cid:17) such that,for any s ∈ { , . . . , n } , if λ β ∗ s (0) = 0, then1 λ β ∗ s (0) / ∈ Ξ ( β ∗ , B (0)) = Ξ ( β ∗ , B ) , (9.260)57here λ β ∗ s (0) is the eigenvalue of β ∗ (cid:0) I n + β ∗ ˇ B (cid:1) − B .According to Lemma 9.3 in Appendix and the assumption that the multiplic-ity of zero eigenvalue of B is m , we know that the multiplicity of eigenvalue 0 of β (cid:0) I n + β ∗ ˇ B (cid:1) − B is m as well. Then there are exactly m eigenvalue functions of ǫ whose limits are 0 as ǫ approaches zero from above. Without loss of generality, weassume lim ǫ → + λ β ∗ s ( ǫ ) = λ β ∗ s (0) = 0 , s = 1 , . . . , m, (9.261)and lim ǫ → + λ β ∗ s ( ǫ ) = λ β ∗ s (0) = 0 , s = m + 1 , . . . , n. (9.262)Subsequently, under the assumption (9.260), we will prove that there exists a δ ∗ > δ ∗ ≤ δ such that, for any ǫ ∈ ( − δ ∗ , β ∗ ∈ (cid:16) , ρ ( B )+ ǫ (cid:17) ⊆ (cid:16) , ρ ( B ( ǫ )) (cid:17) . This would contradict (9.257) and there does not exist any s ∈ { , . . . , n } such that λ β ∗ s ( ǫ ) belongs to Ξ ( β ∗ , B ( ǫ )) . The proof is given by the following four steps. Step (a):

Under the assumption (9.260), we ﬁrst prove that there exists a δ ∗ > ǫ ∈ ( − δ ∗ , β ∗ ∈ (cid:16) , ρ ( B )+ ǫ (cid:17) ⊆ (cid:16) , ρ ( B ( ǫ )) (cid:17) and there does notexist any s ∈ { m + 1 , . . . , n } such that λ β ∗ s ( ǫ ) belongs to Ξ ( β ∗ , B ( ǫ )).Taking into account the deﬁnition of Ξ ( β ∗ , B ( ǫ )), Eq. (9.260) and β ∗ ∈ (cid:16) , ρ ( B ) (cid:17) ,we imply that, there exists a ¯ δ such that, for any ǫ ∈ ( − ¯ δ, β ∗ ∈ (cid:18) , ρ ( B ) + ǫ (cid:19) ⊆ (cid:18) , ρ ( B ( ǫ )) (cid:19) (9.263)and Re (cid:18) λ β ∗ s (0) (cid:19) <

12 + 1 − β ∗ ρ ( B ( ǫ )) β ∗ ρ ( B ( ǫ )) , ∀ s ∈ { m + 1 , . . . , n } . (9.264)Moreover, note that λ β ∗ s ( ǫ ) is a continuous function of ǫ and (9.262) holds. Combiningwith the above inequalities, we know that there exists a δ ∗ > δ ∗ ≤ ¯ δ suchthat (cid:12)(cid:12)(cid:12)(cid:12) λ β ∗ s ( ǫ ) − λ β ∗ s (0) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:20)

12 + 1 − β ∗ ρ ( B ( ǫ )) β ∗ ρ ( B ( ǫ )) − Re (cid:18) λ β ∗ s (0) (cid:19)(cid:21) , ∀ s ∈ { m + 1 , . . . , n } , ∀ ǫ ∈ ( − δ ∗ , s ∈ { m + 1 , . . . , n } and ǫ ∈ ( − δ ∗ (cid:18) λ β ∗ s ( ǫ ) (cid:19) < Re (cid:18) λ β ∗ s (0) (cid:19) + 13 (cid:20)

12 + 1 − β ∗ ρ ( B ( ǫ )) β ∗ ρ ( B ( ǫ )) − Re (cid:18) λ β ∗ s (0) (cid:19)(cid:21) , = 12 + 1 − β ∗ ρ ( B ( ǫ )) β ∗ ρ ( B ( ǫ )) − (cid:20)

12 + 1 − β ∗ ρ ( B ( ǫ )) β ∗ ρ ( B ( ǫ )) − Re (cid:18) λ β ∗ s (0) (cid:19)(cid:21) <

12 + 1 − β ∗ ρ ( B ( ǫ )) β ∗ ρ ( B ( ǫ )) . Hence, we arrive at1 λ β ∗ s ( ǫ ) / ∈ Ξ ( β ∗ , B ( ǫ )) , ∀ s ∈ { m + 1 , . . . , n } , ∀ ǫ ∈ ( − δ ∗ , . (9.265) Step (b):

In this step, by the same arguments as in

Step (b) in the proof ofLemma 6.4 (see (9.185) − (9.203)), we can prove that there exists a δ ∗ > ǫ ∈ ( − δ ∗ ,

0) and s ∈ { , . . . , m } ,Re (cid:0) λ β ∗ s ( ǫ ) (cid:1) < , (9.266)or equivalently, Re (cid:18) λ β ∗ s ( ǫ ) (cid:19) < , (9.267)which immediately implies that1 λ β ∗ s ( ǫ ) / ∈ Ξ ( β ∗ , B ( ǫ )) , ∀ s ∈ { m + 1 , . . . , n } , ∀ ǫ ∈ ( − δ ∗ , . (9.268) Step (c):

Combining (9.265) and (9.268), we have the following conclusion1 λ β ∗ s ( ǫ ) / ∈ Ξ ( β ∗ , B ( ǫ )) , ∀ s ∈ { , . . . , n } , ∀ ǫ ∈ ( − δ ∗ , , (9.269)where δ ∗ = min { δ ∗ , δ ∗ } . Step (d):

Let δ ∗ = min (cid:26) δ, δ ∗ (cid:27) . (9.270)Then, for any ǫ ∈ ( − δ ∗ , ǫ ∈ ( − δ, , (9.271) β ∗ ∈ (cid:18) , ρ ( B ) + ǫ (cid:19) ⊆ (cid:18) , ρ ( B ( ǫ )) (cid:19) (9.272)59nd 1 λ β ∗ s ( ǫ ) / ∈ Ξ ( β ∗ , B ( ǫ )) , ∀ s ∈ { , . . . , n } , (9.273)where (9.271) uses the deﬁnition (9.270) of δ ∗ ; (9.272) is due to the deﬁnition (9.270)of δ ∗ (i.e., δ ∗ ≤ δ ∗ ≤ δ ∗ ≤ ¯ δ ) and (9.263); and (9.273) thanks to the deﬁnition (9.270)of δ ∗ and (9.269). Clearly, this contradicts (9.257).Hence, we conclude that Eq. (6.131) holds true. Lemma 9.1 (Rouche’s Theorem: Conway (1973)) Suppose f and g are meromor-phic in a neighborhood of ¯ B ( a ; R ) with no zeros or poles on the circle γ = { z : | z − a | = R } . If Z f and Z g ( P f and P g ) are the number of zeros (poles) of f and g inside γ counted according to their multiplicities and if | f ( z ) + g ( z ) | < | f ( z ) | + | g ( z ) | on γ , then Z f − P f = Z g − P g . Lemma 9.2 (Ostrowski (1958)) Assume

M, N ∈ C n × n and deﬁne X t ( z ) , det { zI n − [(1 − t ) M + tN ] } , ≤ t ≤ . (9.274) Moreover, suppose that D is a closed region and ∂ D consists of a ﬁnite number ofsmooth curves. If X t ( z ) = 0 , ∀ t ∈ [0 , , ∀ z ∈ ∂ D , (9.275) then, for any t , t ∈ [0 , , X t ( z ) and X t ( z ) have the same number of zeros in D counted according to their multiplicities. Proof . For any t ∈ [0 , u ( z ) , X t ( z ) (9.276)and p , min x ∈ ∂ D |X t ( z ) | > . (9.277)Moreover, for any t + ǫ ∈ [0 , v ( z ) , X t + ǫ ( z ) − X t ( z ) . (9.278)Since X t ( z ) is a continuous function on the close set [0 , × ∂ D , the followinginequality holds true max z ∈ ∂ D | v ( z ) | < p. (9.279)60ence | ǫ | is suﬃciently small. Therefore, two single-valued analytic functions u ( z )and v ( z ) on region D satisfy | u ( z ) | > | v ( z ) | , ∀ z ∈ ∂ D . (9.280)According to Rouche’s theorem (see Lemma 9.1), u ( z ) + v ( z ) = X t + ǫ ( z ) and u ( z ) = X t ( z ) have the same number of zeros in D . Hence, given any t in the | ǫ | neighborhood of t , the number of zeros of X t ( z ) in the region D denoted as N ( t ),is a constant. Thereby, N ( t ) is a continuous function deﬁned on [0 , N ( t ) can only be nonnegative integers. Therefore, N ( t ) is a constantfunction on [0 , Lemma 9.3

Assume that B and ˆ B are deﬁned by (6.111) and (6.113) , respectively,then, for any β ∈ (cid:16) , ρ ( B ) (cid:17) , the multiplicity of zero eigenvalue of (cid:16) I n + β ˆ B (cid:17) − B isthe same as that of zero eigenvalue of B . Proof . Without a loss of generality, suppose the multiplicity of zero eigenvalueof B is m and B has an eigen decomposition in the form of (9.160). Therefore, inorder to investigate the eigenvalues of (cid:16) I n + β ˆ B (cid:17) − B , its characteristic polynomialis given below:det (cid:26)(cid:16) I n + β ˆ B (cid:17) − B − λI n (cid:27) = det n B − λ (cid:16) I n + β ˆ B (cid:17)o = det (cid:26) V (cid:18) Θ

00 0 (cid:19) V T − λ (cid:16) I n + β ˆ B (cid:17)(cid:27) = det (cid:26)(cid:18) Θ

00 0 (cid:19) − λ (cid:16) I n + βV T ˆ BV (cid:17)(cid:27) = det (cid:26)(cid:18) Θ − λ (cid:0) I n − m + β ˇ C (cid:1) − λβ ˇ C − λβ ˇ C − λ (cid:0) I m + β ˇ C (cid:1) (cid:19)(cid:27) = det (cid:8) Θ − λ (cid:0) I n − m + β ˇ C (cid:1)(cid:9) × det n − λ (cid:0) I m + β ˇ C (cid:1) − λ β ˇ C (cid:2) Θ − λ (cid:0) I n − m + β ˇ C (cid:1)(cid:3) − ˇ C o = λ m det (cid:8) Θ − λ (cid:0) I n − m + β ˇ C (cid:1)(cid:9) × det n − (cid:0) I m + β ˇ C (cid:1) − λβ ˇ C (cid:2) Θ − λ (cid:0) I n − m + β ˇ C (cid:1)(cid:3) − ˇ C o , (9.281)where the second equality holds because of the eigen decomposition form (9.160) of B and the fourth equality is due to the deﬁnitions (9.185) of C ij , i, j = 1 ,

2. Note thatΘ is invertible, we know zero is not a root of polynomial det (cid:8) Θ − λ (cid:0) I n − m + β ˇ C (cid:1)(cid:9) .61n addition, assume that λ is an eigenvalue of I m + β ˇ C and υ ∈ C m is its eigenvectorof length one, then Re ( λ ) = Re (cid:0) υ H (cid:0) I m + β ˇ C (cid:1) υ (cid:1) = Re (cid:0) βυ H ˇ C υ (cid:1) = Re (cid:16) βυ H V T ˆ BV υ (cid:17) = Re (cid:16) βυ H V H ˆ BV υ (cid:17) = Re (cid:16) β ( V υ ) H ˆ BV υ (cid:17) ≥ − βρ ( B ) = ρ ( B ) − βρ ( B ) > , (9.282)where the ﬁrst equality follows from (9.185); the second equality is because V isa real matrix in equation (9.160); it follows the last equality from Lemma 6.2 and k V υ k = ( V υ ) H V υ = υ H υ = 1. The above inequality also shows that the real partof the eigenvalues of I m + β ˇ C is not zero. Hence, zero is not one of its eigenvalues.The above analysis shows that zero is not one of the eigenvalues of Θ − λ (cid:0) I n − m + β ˇ C (cid:1) nor I m + β ˇ C . Combined with the expression (9.281), the multiplicity of zeroeigenvalue of (cid:16) I n + β ˆ B (cid:17) − B is exactly m . Lemma 9.4

Suppose that φ ( x ) is a strongly convex twice continuously diﬀerentiablefunction with parameter σ > , i.e., for any y and x ∈ R n , φ ( x ) ≥ φ ( y ) + h x − y, ∇ φ ( y ) i + σ k x − y k , (9.283) then its gradient mapping ∇ φ : R n → R n is a diﬀeomorphism. Proof . Since φ is a σ -strongly convex function, it follows easily from (9.283)that h∇ φ ( y ) − ∇ φ ( x ) , y − x i ≥ σ k y − x k . (9.284)We ﬁrst check that ∇ φ is injective from R n → R n . Suppose that there exist x and y such that ∇ φ ( x ) = ∇ φ ( y ). Then we would have k y − x k ≤ σ h∇ φ ( y ) − ∇ φ ( x ) , y − x i = 0 , which means x = y .To show the gradient map ∇ φ is surjective, we construct an explicit inversefunction ( ∇ φ ) − given by x y = arg min x φ ( x ) − h y, x i . φ is a σ -strongly convex function, the function above is strongly convex withrespect to x . Hence, there is a unique minimizer and we denote it as x y . Then bythe KKT condition, ∇ φ ( x y ) = y. Thus, x y is mapped to y by the mapping ∇ φ . Consequently, ∇ φ is bijection. Theassumptions also mean it is continuously diﬀerentiable.In addition, for any x ∈ R n , D ∇ φ ( x ) = ∇ φ ( x ) (cid:23) µI n ≻ , (9.285)where the last but one inequality holds because of σ strong convexity of φ (Nesterov,2004). Therefore, D ∇ φ ( x ) is invertible, and the inverse function theorem guarantees( ∇ φ ) − is continuously diﬀerentiable. Thus ∇ φ is a diﬀeomorphism. References

A. Auslender and M. Teboulle. Interior gradient and proximal methods for convexand conic optimization.

SIAM Journal on Optimization , 16:697–725, 2006.H. H. Bauschke, J. M. Borwein, and P. L. Combettes. Bregman monotone opti-mization algorithms.

SIAM Journal on Control and Optimization , 42:596–636,2006.A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient meth-ods for convex optimization.

Operations Research Letters , 31(3):167–175, 2003.A. Beck and L. Tetruashvili. On the convergence of block coordinate descent typemethods.

Siam Journal on Optimization , 23(4):2037–2060, 2013.L. M. Bregman. The relaxation method of ﬁnding the common point of convexsets and its application to the solution of problems in convex programming.

UssrComputational Mathematics & Mathematical Physics , 7(3):200–217, 1967.J. B. Conway.

Functions of One Complex Variable I . Springer-Verlag, 1973.O. Fercoq and P. Richtrik. Accelerated, parallel and proximal coordinate descent.

Siam Journal on Optimization , 25(4), 2013.R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points — onlinestochastic gradient for tensor decomposition.

Mathematics , 2015.M. W. Hirsch, C. C. Pugh, and M. Shub.

Invariant Manifolds . Springer-Verlag,1977. 63. Hong, M. Razaviyayn, Z. Q. Luo, and J. S. Pang. A uniﬁed algorithmic frame-work for block-structured optimization involving big data.

Mathematics , 2015.M. Hong, X. Wang, M. Razaviyayn, and Z. Q. Luo. Iteration complexity analysis ofblock coordinate descent methods.

Mathematical Programming , 163(1-2):85–114,2017.R. A. Horn and C. R. Johnson.

Matrix analysis . Cambridge University Press, NewYork, NY, USA, 1986. ISBN 0-521-30586-1.A. Juditsky and A. S. Nemirovski. Deterministic and stochastic ﬁrst order algorithmsof large-scale convex optimization.

Yes Workshop Stochastic Optimization andOnline Learning , 2014.R. Kleinberg, G. Piliouras, and E. Tardos. Multiplicative updates outperform genericno-regret learning in congestion games: extended abstract.

Proceedings of theAnnual Acm Symposium on Theory of Computing , pages 533–542, 2009.J. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent converges tominimizers. In

Proceedings of the Conference on Computational Learning Theory(COLT) , pages 1246–1257, Columbia University, New York, USA, 2016.J. M. Lee.

Introductory Lectures to smooth manifolds . Springer Science and BusinessMedia, 2013.Y. Nesterov.

Introductory Lectures on Convex Optimization , volume 87. SpringerScience and Business Media, 2004.Y. Nesterov. Eﬃciency of coordinate descent methods on huge-scale optimizationproblems.

SIAM Journal on Optimization , 22(2):341–362, 2012.A. Ostrowski. ¨uber die stetigkeit von charakteristischen wurzeln in abh¨angigkeit vonden matrizenelementen.

Jahresbericht der Deutschen Mathematiker-Vereinigung ,60:40–42, 1958.I. Panageas and G. Piliouras. Gradient descent converges to minimizers: The caseof non-isolated critical points. arXiv preprint arXiv , page 1605.00405, 2016.R. Pemantle. Nonconvergence to unstable points in urn models and stochastic ap-proximations.

The Annals of Probability , pages 698–712, 1990.M. Shub.

Global stability of dynamical systems . Springer Science and BusinessMedia, 1987.S. Smale. Diﬀerentiable dynamical systems.

Bulletin of the American MathematicalSociety , 73(6):747–817, 1967. 64. Spivak.

Calculus on manifolds: a modern approach to classical theorems ofadvanced calculus . Benjamin, 1965.M. Teboulle.