[PDF] Convex Synthesis of Accelerated Gradient Algorithms

Abstract

We present a convex solution for the design of generalized accelerated gradient algorithms for strongly convex objective functions with Lipschitz continuous gradients. We utilize integral quadratic constraints and the Youla parameterization from robust control theory to formulate a solution of the algorithm design problem as a convex semi-definite program. We establish explicit formulas for the optimal convergence rates and extend the proposed synthesis solution to extremum control problems.

Full PDF

CCONVEX SYNTHESIS OF ACCELERATED GRADIENT ALGORITHMS

CARSTEN SCHERER ∗ AND

CHRISTIAN EBENBAUER † Abstract.

We present a convex solution for the design of generalized accelerated gradient algorithms for stronglyconvex objective functions with Lipschitz continuous gradients. We utilize integral quadratic constraints and the Youlaparameterization from robust control theory to formulate a solution of the algorithm design problem as a convex semi-deﬁnite program. We establish explicit formulas for the optimal convergence rates and extend the proposed synthesissolution to extremum control problems.

1. Introduction.

Accelerated gradient algorithms, also refereed to as momentum methods, areconsidered to be among the most widely used optimization algorithms. These methods are applied e.g. incontrol or artiﬁcial intelligence to train neural networks or to solve online optimization problems arisingfrom receding horizon decision making.From a control and dynamical system perspective, accelerated algorithms can be viewed as a lineartime-invariant discrete-time system in feedback with the gradient of the to-be-minimized function asa nonlinearity [25, 38, 5, 16, 12]. This perspective provides an immediate link to the so-called absolutestability or Lur’e problem in control and oﬀers the possibility to apply advanced tools from robust controlfor the analysis and design of accelerated gradient algorithms.It has been shown, e.g., that the concept of integral quadratic constraints and so-called Zames-Falbmultipliers allow to recover the well-known bounds for the convergence rates of Nesterov’s celebrated ac-celeration algorithm by semi-deﬁnite programming [16]. Moreover, by tuning the parameters of Nesterov’salgorithm, these bounds can be improved to get the so-called triple momentum algorithm [34].A more challenging task than the analysis of given algorithms is the design of novel algorithms withthe help of convex optimization. In light of the relation to absolute stability and Lur’e problems, algorithmdesign falls into the area of robust feedback controller synthesis. Some recent works have addressed thesynthesis problem (e.g. [17, 22, 10, 35]). However, so far it has been an open problem to formulate thegeneral accelerated gradient algorithm design problem as a genuine convex optimization problem. In fact,the aligned question of designing robust controllers by a convex search over the controller parameters andthe multipliers to certify stability is as well a long-standing open problem in its full generality.In this paper, we present a convex solution for a general accelerated gradient algorithm synthesisproblem based on semi-deﬁnite programming. Speciﬁcally, the main contributions are as follows. In Sec-tion 2, we reveal that a particular dynamical system structure is inherent to any convergent algorithm.This insight allows us to formulate the algorithm design problem in terms of a robust feedback controllersynthesis problem in Section 3. We then show in Section 4 how the special structure of the system canbe exploited to convexify the common search for the algorithm parameters and the dynamic Zames-Falbmultipliers which certify convergence. Our approach permits to derive explicit formulas for the optimalconvergence rate that is achievable by synthesis, in analogy to the analysis results for Nesterov’s algorithmin [28]. In this fashion, we are able to prove that the convergence rate of the triple momentum algorithmis indeed optimal if using the class of causal Zames-Falb multipliers to assure convergence.Another key feature of our approach is its ﬂexibility. We reveal in Section 5 that it extends toextremum control [2], in which the goal is to drive the output of a dynamical system to some steady-statecondition at which a given cost function is minimized or maximized. In particular, we establish a fullyconvex synthesis approach to design extremum controllers with optimal convergence properties, even ifthe cost functions are structured.Since we believe that the results in this paper are of interest to both the areas of control andoptimization, we have written several sections in a tutorial fashion so that the results are accessiblewithout a special background in robust control theory.

2. Algorithm analysis and design. ∗ Department of Mathematics, University of Stuttgart, Germany, [email protected] .Funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy- EXC 2075 - 390740016. We acknowledge the support by the Stuttgart Center for Simulation Science (SimTech). † Institute for Systems Theory and Automatic Control, University of Stuttgart, Germany, [email protected]. a r X i v : . [ m a t h . O C ] F e b .1. Systems and Algorithms. Let S m,L be the class of all C -functions f : R d → R that arestrongly convex with parameter m > L > m , i.e.,[ ∇ f ( x ) − ∇ f ( y )] T ( x − y ) ≥ m (cid:107) x − y (cid:107) and (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) for all x, y ∈ R d . We denote by S m,L the set of f ∈ S m,L with ∇ f (0) = 0. Any f ∈ S m,L admits aunique global minimizer z ∗ = arg min z ∈ R d f ( z ) ∈ R d which is the solution of the equation ∇ f ( x ) = 0. Itis well-known that the sequence deﬁned by the gradient descent algorithm(2.1) z k +1 = z k − α ∇ f ( z k )for a ﬁxed step-size α ∈ (0 , /L ) converges to z ∗ linearly, i.e., there exists constants K and ρ ∈ (0 , (cid:107) z k − z ∗ (cid:107) ≤ Kρ k (cid:107) z − z ∗ (cid:107) holds for all z ∈ R n and k ∈ N . The worst-case convergence rateis deﬁned as the inﬁmal ρ ∈ (0 ,

1) for which there exists some K such that linear convergence holds forall f ∈ S m,L . This value depends on the algorithm parameter α and is denoted by ρ wc ( α ). Determiningupper bounds on ρ wc ( α ) and ﬁnding an optimal choice for the algorithm parameter α which minimizes ρ wc ( α ) has a long history in optimization theory [24].From the perspective of control, (2.1) simply deﬁnes a nonlinear discrete-time dynamical system. Then k ∈ N denotes a time-instant and the sequence ( z k ) k ∈ N is the solution (state-trajectory) of the system.Moreover, z ∗ just constitutes a constant trajectory of (2.1) and is, therefore, called an equilibrium (a ﬁxedpoint) thereof. Linear convergence with rate ρ means that z ∗ is globally exponentially stable with rate ρ .The worst-case convergence rate is deﬁned by considering the whole family of systems parameterized by ∇ f for f ∈ S m,L . It is common in control that such a family of systems is interpreted as a single so-calleduncertain dynamical system with an uncertainty ∇ f ∈ ∇S m,L . Also in this ﬁeld there is a long traditionin estimating ρ wc ( α ), which is termed robust stability analysis . Finding a parameter α which minimizes ρ wc ( α ) or a tight upper bound thereof is then called robust stability synthesis .In robust control, a particularly useful step is to separate the description of the known parts of thealgorithm from the uncertainty ∇ f . This just means to introduce the auxiliary signals x k := z k and(2.2) w k := ∇ f ( z k ) , which allows us to rewrite (2.1) as (2.2) together with(2.3) (cid:18) x k +1 z k (cid:19) = (cid:18) A BC d (cid:19) (cid:18) x k w k (cid:19) where A = I d , B = − αI d , C = I d , and I d / 0 d denote the identity/zero matrix in R d × d , respectively. Byitself, (2.3) deﬁnes a linear time-invariant dynamical system that maps an initial condition x ∈ R d andsome input sequence w = ( w k ) k ∈ N through the recursion (2.3) into the output sequence z = ( z k ) k ∈ N . Therelation (2.2) alone is viewed as a static (nonlinear) system which maps the signal z into w . Considering(2.2)-(2.3) together means that the output (input) of (2.3) is set equal to the input (output) of (2.2).In control, this constitutes the feedback interconnection of (2.2) and (2.3) and motivates to visualizethis feedback loop in an intuitive fashion by the block-diagram on the left in Fig. 1. In other words,the system (2.3) involves the algorithm parameters, while the feedback interconnection of (2.2) and (2.3)constitutes the algorithm itself in order to compute z ∗ for a particular instance of f ∈ S m,L . Exactly thesame interconnection represents an algorithm with a variable step-size α k if just replacing B = − αI d in(2.3) with B k = − α k I d , which turns (2.3) into a linear time-varying system.Now consider (2.3) with general matrices A ∈ R n × n , B ∈ R n × d , C ∈ R d × n . Then the interconnectionof (2.2) and (2.3) takes the initial condition x ∈ R n as its input and generates the unique state- andoutput-responses ( x k ) k ∈ N and ( z k ) k ∈ N through the recursion(2.4) x k +1 = A x k + B∇ f ( C x k ) , z k = C x k . The main goal of this work is to determine matrices A , B , C , if existing, by a semi-deﬁnite program suchthat the algorithm (2.4) achieves a given convergence rate ρ ∈ (0 ,

1) for given m, L and all objectivefunctions f ∈ S m,L . We work with an operator interpretation of (2.3) with general A ∈ R n × n , B ∈ R n × m , k +1 = A x k + B w k z k = C x k w k = ∇ f ( z k ) w k z k (cid:20) A BC (cid:21) ∇ f wz Fig. 1 . Feedback interconnection.

C ∈ R k × n and replacing 0 d by D ∈ R k × m . Moreover, we denote by l n the vector space of signals x : N → R n , while l n is the subspace of all square summable sequences equipped with the inner product (cid:104) x, y (cid:105) := (cid:80) ∞ k =0 x Tk y k and the norm (cid:107) x (cid:107) := (cid:112) (cid:104) x, x (cid:105) for x, y ∈ l n . For w ∈ l m and x ∈ R n , the recursion(2.5) (cid:18) x k +1 z k (cid:19) = (cid:18) A BC D (cid:19) (cid:18) x k w k (cid:19) deﬁnes unique state- and output-responses x ∈ l n and z ∈ l d , respectively; for a ﬁxed x (often taken tobe 0), the resulting aﬃne (linear) operator is denoted as(2.6) z = (cid:20) A BC D (cid:21) w where we suppress the dependence on x . All throughout this paper, we reserve square brackets to denotethe input-output operator (2.6) deﬁned by the recursion (2.5); the partition lines in (2.6) are alwaysdisplayed to separate A from the other matrix blocks in (2.5).Consequently (2.2)-(2.3), (2.4), and(2.7) w = ∇ f ( z ) , z = (cid:20) A BC (cid:21) w express one and the same interconnection as depicted by the block-diagram in Fig. 1. Analyzing theconvergence properties of a general algorithm (2.4) then boils down to analyzing the stability propertiesof the feedback system (2.7). From now on we represent algorithms interchangeably by (2.3) or (2.6).Algorithm convergence means that the signal z in (2.7) converges to the minimizer of f for any initialcondition x ∈ R n .The analysis of stability of feedback interconnections constitutes one of the fundamental questionsstudied in control since its beginnings, with many traditional ideas nicely collected in the classical textbook[4]. Polyak was among the ﬁrst to clearly emphasize the above sketched tight link between the twoareas [25], see also [38, 5, 16, 12]. By arguing with an analogy to mechanical systems, he suggestedto replace (2.1) by the heavy-ball method which includes a damping or momentum term as in z k +1 = z k − α ∇ f ( z k ) + β ( z k − z k − ) for some parameters β ∈ [0 ,

1) and α ∈ (0 , β ) /L ). It is easy to checkthat the corresponding algorithm is (2.7) for the system matrices(2.8) (cid:18) A BC (cid:19) =  (1 + β ) I d − βI d − αI d I d γ ) I d − γI d  with γ = 0. In [25] it is shown that the convergence rate is considerably improved over gradient descent,at the cost of sacriﬁcing global algorithm convergence [16]. Nesterov’s celebrated accelerated gradientdecent algorithm corresponds the choice γ = β in (2.8) with guaranteed global and fast convergence[24]. The more recently proposed triple momentum algorithm [34] relies on diﬀerent values of the threeparameters in (2.8) with the best-known convergence rate to date.Let us conclude this section by recalling some basic notions for general linear systems (2.5) or (2.6).With an invertible matrix T ∈ R n × n , a state-coordinate change for (2.5) is deﬁned by ξ k := T x k . Itis easily seen that this transforms the quadrupel ( A , B , C , D ) into ( T − A T, T − B , C T, D ). For a ﬁxedinput signal w and the initial conditions x and ξ = T x , respectively, one can check that the outputrajectories of the original and the transformed systems are identical; this is compactly expressed through(2.9) (cid:20) A BC D (cid:21) = (cid:20) T − A T T − BC T D (cid:21) . Furthermore, the series interconnection of two systems(2.10) y = (cid:20) A B C D (cid:21) u and y = (cid:20) A B C D (cid:21) u is deﬁned by using the output signal of the second as an input to the ﬁrst, which is reﬂected by u = y (and requires that the signal dimensions match). This is nothing but the composition of the tworespective maps, which is as usual denoted as an operator product. It is elementary to verify that theseries interconnection of the two systems (2.10) can be described by(2.11)  A B C B D A B C D C D D  or  A B B C A B D D C C D D  . In case of identical dimensions of the input and output signals in (2.10), the sum of the two maps isthe so-called parallel interconnection given by(2.12) (cid:20) A B C D (cid:21) + (cid:20) A B C D (cid:21) =  A B A B C C D + D  . Further, if D is invertible, the map (2.6) is invertible and its inverse can be represented with (cid:20) A − BD − C BD − −D − C D − (cid:21) . The system (2.5) is called stable if A is a Schur matrix, i.e., all its eigenvalues are in absolute valuestrictly smaller than one. Moreover, (2.5) or the pair ( A , B ) is stabilizable if there exists a matrix M suchthat A + B M is Schur; similarly, (2.5) or ( A , C ) is detectable if there exists L such that A + L C is Schur. Let us now get back to the algorithm (2.7). First,we settle that convergence enforces an important structural constraint on the parameters A , B , C . Westart by stressing that there is no beneﬁt in choosing systems (2.3) which are not detectable. Indeed,suppose ( A , C ) in (2.7) is not detectable. We then follow [42, Sec. 3.3] and perform a state-coordinatechange to obtain (cid:18) A BC (cid:19) =  A A B A B C  where ( A , C ) is detectable. Due to the block structure of the matrices, the set of z -trajectories of (2.4)and of ξ k +1 = A ξ k + B ∇ f ( C ξ k ) , z k = C ξ k are obviously identical. W.l.o.g. we can hence replace thenon-detectable system in (2.7) by the one described with the triple ( A , B , C ) which is detectable.Note that the least convergence requirement for the algorithm (2.2)-(2.3) is(2.13) lim k →∞ z k = z ∗ and lim k →∞ w k = 0(for any initial condition x ∈ R n and some z ∗ ∈ R d ). If ( A , C ) is detectable and we take L such that A + L C is Schur, we infer x k +1 = ( A + L C ) x k + B w k − Lz k . Therefore, (2.13) also implies the convergenceof the state-trajectory x k to some x ∗ for k → ∞ with(2.14) x ∗ = A x ∗ and z ∗ = C x ∗ . Most importantly, we now show that (2.13) enforces the following special structure of the algorithmparameters. heorem

Let ( A , C ) be detectable. If all trajectories of (2.4) satisfy (2.13) for all quadraticfunctions f ∈ S m,L and all x ∈ R n , then A + B m C is Schur and there exist A a , B a , C a , D a such that (2.15) (cid:20) A BC (cid:21) = (cid:20) A a B a C a D a (cid:21) (cid:20) I d I d I d (cid:21) . If ( A , C ) has the structure induced by (2.15), then (2.14) has a unique solution x ∗ ∈ R n for every z ∗ ∈ R d . Before entering the proof, let us interpret the structural property (2.15) in the state-space. By (2.11),it implies that there exists a state-coordinate change of (2.3) after which the algorithm (2.2)-(2.3) reads(2.16)  ξ k +1 η k +1 z k  =  I d I d B a A a D a C a   ξ k η k w k  , w k = ∇ f ( z k ) . Proof.

We start by observing that, given z ∗ , there is at most one vector x ∗ satisfying (2.14) since(2.17) rank (cid:18) A − I C (cid:19) = n. Indeed, because ( A , C ) is detectable, we can take L such that A + L C is Schur; then ( A − I ) x = 0, C x = 0imply ( A + L C − I ) x = 0 and thus x = 0, because 1 is no eigenvalue of A + L C .Now take f ∈ S m,L as f ( z ) = z T ( mI d ) z − b T z with any b ∈ R d . Then the trajectories of (2.2)-(2.3)satisfy(2.18) x k +1 = ( A + B m C ) x k − B b, w k = m C x k − b. Due to (2.13) we have argued above that x k k →∞ −→ x ∗ with x ∗ satisfying (2.14); we also infer m C x ∗ = b .Let us ﬁrst take b = 0. By m > z ∗ = C x ∗ = 0. Since x ∗ = 0 satisﬁes (2.14), we infer (byuniqueness) that x k k →∞ −→ A + B m C is Schur which provesthe ﬁrst statement.Now let b ∈ R d be general. We then get b = m C x ∗ = m C ( A + B m C − I n ) − B b for all b ∈ R d and,therefore, I d − m C ( A + B m C − I n ) − B = 0. With a Schur complement argument [11], this impliesrank (cid:18) A + B m C − I n B m C I d (cid:19) = n. Yet another Schur complement argument shows(2.19) rank(

A − I n ) = n − d. Now choose T ∈ R n × d with full column rank and T T ( A − I n ) = 0. Then T T B ∈ R d × d is invertible;otherwise there exists x (cid:54) = 0 with x T T T B = 0 and thus ( T x ) T ( A + B m C ) = x T T T A = x T T T = ( T x ) T ;because T x (cid:54) = 0, we infer that 1 is an eigenvalue of A + B m C , which is a contradiction since the lattermatrix is Schur. We can hence choose T to also satisfy T T B = I d and pick T ∈ R n × ( n − d ) with T T B = 0such that T = ( T T ) is invertible. We get T T A = (cid:18) I d B a A a (cid:19) T T and T T B = (cid:18) I d (cid:19) for suitablematrices A a ∈ R ( n − d ) × ( n − d ) , B a ∈ R ( n − d ) × d . With (cid:0) D a C a (cid:1) := C T − T we infer (2.15) since(2.20) (cid:32) T T A T − T T T BC T − T (cid:33) =  I d I d B a A a D a C a  . For the system matrices in (2.16), we ﬁnally note that (2.14) is equivalent to(2.21) (cid:18) B a A a − ID a C a (cid:19) x ∗ = (cid:18) z ∗ (cid:19) and that the matrix ´in (2.21) is square. Since (2.21) has at most one solution x ∗ (as shown at the beginningof the proof), we infer that the matrix in (2.21) is actually invertible, which proves the last statement.We are now in the position to introduce the precise deﬁnition of algorithm convergence with rate ρ . efinition Let the system in (2.7) be detectable and admit the structure (2.15). For ρ ∈ (0 , ,algorithm (2.7) achieves ρ -convergence (for the class S m,L ) if there exists some K ≥ such that (2.22) (cid:107) z k − z ∗ (cid:107) ≤ Kρ k (cid:107) x − x ∗ (cid:107) for all k ∈ N , for any f ∈ S m,L with minimizer z ∗ ∈ R d , any x ∈ R n and any x ∗ ∈ R n satisfying (2.14).The inﬁmum of all ρ ∈ (0 , such that (2.7) achieves ρ -convergence is the algorithm convergence rateand denoted as ρ wc (with ρ wc := ∞ if no such ρ exists). Note that ρ -convergence is invariant under a state-coordinate change of (2.3). Moreover, ρ -convergenceimplies but is stronger than the convergence property (2.13) for all trajectories of (2.7) with any f ∈ S m,L . Theorem

Let the system in (2.7) be detectable and admit the structure (2.15). If (2.7) achieves ρ -convergence for the class S m,L , then it achieves ρ -convergence for the full class S m,L as well.Proof. By assumption, there exists some K ≥ (cid:18) ¯ x k +1 ¯ z k (cid:19) = (cid:18) A BC (cid:19) (cid:18) ¯ x k ¯ w k (cid:19) , ¯ w k = ∇ ¯ f (¯ z k )for any ¯ f ∈ S m,L satisfy(2.24) (cid:107) ¯ z k (cid:107) ≤ Kρ k (cid:107) ¯ x (cid:107) for all k ∈ N . Now take f ∈ S m,L with minimizer z ∗ , any x ∈ R n and consider (2.16). By Theorem 2.1, (2.14) hasa unique solution x ∗ ∈ R n which clearly satisﬁes (cid:18) x ∗ z ∗ (cid:19) = (cid:18) A BC (cid:19) (cid:18) x ∗ (cid:19) . Deﬁne ¯ x k := x k − x ∗ ,¯ z k := z k − z ∗ , ¯ w k := w k and ¯ f := f ( • + z ∗ ). By linearity, this yields a trajectory of (2.23). Since ¯ f ∈ S m,L ,we infer that (2.24) is valid. This is clearly nothing but (2.22) as was to be shown.In summary, Theorem 2.1 reveals that algorithm convergence requires that the related linear system“contains” a model of the so-called discrete time integrator(2.25) (cid:18) η k +1 y k (cid:19) = (cid:18) I d I d I d (cid:19) (cid:18) η k u k (cid:19) as a right factor. Conversely, by Theorem 2.3, if the algorithm parameters ”contain” the integrator (2.25), ρ -convergence can be induced from ρ -convergence for f ∈ S m,L with a minimizer located at the origin.From a control theory perspective, this is reminiscent of the so-called internal model principle [39]. ρ -Convergence. From now on we assume that the systemin (2.7) is detectable and admits the structure (2.15). The next goal is to relate the question of boundingthe algorithm convergence rate ρ wc to a robust stability analysis problem. In view of Theorems 2.1 and 2.3,it suﬃces to conﬁne the discussion to the class S m,L . We also map S m,L bijectively onto S ,L − m through f (cid:55)→ g where g ( z ) := f ( z ) − z T ( mI d ) z for z ∈ R d . Then, (2.7) clearly just is the interconnection of(2.3) with w = ∇ g ( z ) + mz . With the transformation(2.26) (cid:18) ¯ z ¯ w (cid:19) := (cid:18) I d − mI d I d (cid:19) (cid:18) zw (cid:19) , this interconnection can be as well expressed by(2.27) (cid:18) ¯ x k +1 ¯ z k (cid:19) = (cid:18) A + B m C BC (cid:19) (cid:18) ¯ x k ¯ w k (cid:19) , ¯ w k = ∇ g (¯ z k ) . Then ρ wc is just equal to the convergence rate of (2.27) for the class S ,L − m . Next, for ρ ∈ (0 , ρ + : l n → l n , x (cid:55)→ ρ + ( x ) = ( ρ k x k ) k ∈ N which isbijective. Then(2.28) ˇ x := ρ − (¯ x ) , ˇ w = ρ − ( ¯ w ) and ˇ z = ρ − (¯ z )ransform (2.27) into(2.29) (cid:18) ˇ x k +1 ˇ z k (cid:19) = (cid:18) ρ − ( A + B m C ) ρ − BC (cid:19) (cid:18) ˇ x k ˇ w k (cid:19) , ˇ w k = ρ − k ∇ g ( ρ k ˇ z k ) . These steps permit us to relate ρ wc to a standard robust stability margin for the map ˇ x (cid:55)→ ˇ z deﬁned by(2.29). Lemma

Let ρ rs be the inﬁmal ρ ∈ (0 , for which there is a K ≥ such that all trajectories of(2.29) with g ∈ S ,L − m satisfy (cid:107) ˇ z (cid:107) ≤ K (cid:107) ˇ x (cid:107) . Then the convergence rate ρ wc of algorithm (2.7) is equalto ρ rs .Proof. To show ρ wc ≤ ρ rs we can assume ρ rs < ∞ . Let ρ ∈ ( ρ rs ,

1) and take any trajectory of (2.27).Then the ρ − -transformed signals deﬁne a trajectory of (2.29) and we hence infer, by using the deﬁnitionof ρ rs , that ρ − k (cid:107) ¯ z k (cid:107) = (cid:107) ˇ z k (cid:107) ≤ (cid:107) ˇ z (cid:107) ≤ K (cid:107) ˇ x (cid:107) = K (cid:107) ¯ x (cid:107) for all k ∈ N . This proves ρ -convergence of (2.27)for the class S ,L − m and, therefore, ρ wc ≤ ρ . Since ρ ∈ ( ρ rs ,

1) was arbitrary, we conclude ρ wc ≤ ρ rs .To see ρ rs ≤ ρ wc let ρ wc < ∞ and take ρ ∈ ( ρ wc , ρ ∈ (0 ,

1) with ρ ρ ∈ ( ρ wc , ρ ρ )-convergence of (2.27) for the class S ,L − m . Hence, there exists some ¯ K ≥ (cid:107) ( ρ ρ ) − k ¯ z k (cid:107) ≤ ¯ K (cid:107) ¯ x (cid:107) and thus (cid:107) ρ − k ¯ z k (cid:107) ≤ ¯ Kρ k (cid:107) ¯ x (cid:107) for all k ∈ N . Thenany trajectory of (2.29) can be transformed with ρ + back into one of (2.27) to get, with ¯ x = ˇ x , that (cid:80) ∞ k =0 (cid:107) ˇ z k (cid:107) = (cid:80) ∞ k =0 (cid:107) ρ − k ¯ z k (cid:107) ≤ (cid:0)(cid:80) ∞ k =0 ρ k (cid:1) ¯ K (cid:107) ˇ x (cid:107) . We conclude ρ rs ≤ ρ and, hence, ρ rs ≤ ρ wc .In summary, computing tight upper bounds on the convergence rate ρ wc of algorithm (2.7) is equiva-lent to determining tight upper bounds on the so-called robust stability margin ρ rs for the interconnection(2.29) as deﬁned in Lemma 2.4. Remark ρ rs < ∞ then ρ − ( A + B m C ) is Schur for all ρ ∈ ( ρ rs , In this section we sketchhow to compute eﬀective bounds on the margin ρ wc = ρ rs by setting up a semi-deﬁnite program. Thisinvolves a family of so-called Zames-Falb multipliers. These are systems(2.30) Π(Λ) := (cid:20) A f B f C f (Λ) D f (Λ) (cid:21) :=  I d · · · · · · I d · · · I d Λ l Λ l − · · · Λ Λ  which are parameterized by a matrix tuple Λ in the set Λ ρ := (cid:8)(cid:0) Λ l · · · Λ Λ (cid:1) ∈ ( R d × d ) l +1 | Λ − diag(Λ ) ≤ , Λ i ≤ i = 1 , . . . , l, (cid:32) l (cid:88) i =0 Λ i ρ i (cid:33) e ≥ , e T (cid:32) l (cid:88) i =0 Λ i ρ − i (cid:33) ≥ (cid:41) ;the inequalities are read elementwise and e ∈ R d is the all-ones vector, while diag( A ) ∈ R d × d is thediagonal matrix whose diagonal is identical to that of A ∈ R d × d .Note that (2.30) is a so-called ﬁnite-impulse-response ﬁlter of length l and of dimension d × d ; thelatter two parameters are not displayed in Λ ρ to lighten the notation. The parameters for which theblocks in Λ ∈ Λ ρ are diagonally repeated are collected in(2.31) Λ rρ := Λ ρ ∩ (cid:8)(cid:0) λ l I d · · · λ I d λ I d (cid:1) | λ i ∈ R (cid:9) . The introduction of this family is motivated by the robust stability result in [22] for (2.29). Thisinvolves the following positivity property for the nonlinearity(2.32) ˇ w = ρ − ∇ g ( ρ + ˇ z ) . emma Let g ∈ S ,L − m and Λ ∈ Λ rρ . Then (2.33) (cid:104) Π(Λ)˜ z, ˜ w (cid:105) ≥ holds for all ˇ z ∈ l d and the output ˇ w of (2.32) with (2.34) (cid:18) ˜ z ˜ w (cid:19) := (cid:18) LI d − mI d − I d I d (cid:19) (cid:18) ˇ z ˇ w (cid:19) . We emphasize that the response Π(Λ)˜ z is deﬁned based on the state-space representation (2.30) withthe state’s initial condition taken as zero. From now on, we follow this convention in robust control forsystems expressed in operator notation as in (2.6).In systems theory, (2.33) is a so-called passivity property for ﬁltered versions of the input and outputsignals of (2.32); it is also referred to as an integral quadratic constraint (IQC) [20]. The latter terminologyemerges since such results are often formulated for continuous time systems, for which the l -inner product(expressed in terms of sums) is replaced by the inner product on L [0 , ∞ ) (involving integrals).Guaranteeing robust stability of (2.29) involves a related negativity condition for the linear system(2.35) ˇ z = (cid:20) ρ − ( A + B m C ) ρ − BC (cid:21) ˇ w. By Remark 2.5, this system needs to be stable. As a consequence, if ˇ w ∈ l d is any input with ﬁnite l -norm, the response of (2.35) satisﬁes ˇ z ∈ l d . The following ﬁltered strict negativity property thenguarantees robust stability for (2.29) as deﬁned in Lemma 2.4 and, thus, assures ρ rs = ρ wc ≤ ρ [22,Lemma 3, Theorem 4]. Theorem

Let ρ ∈ (0 , . Then ρ wc ≤ ρ is assured if ρ − ( A + B m C ) is Schur and if there exist Λ ∈ Λ rρ , ε > such that for any ˇ w ∈ l d and the response ˇ z ∈ l d of (2.35), the signals (2.34) satisfy (2.36) (cid:104) Π(Λ)˜ z, ˜ w (cid:105) ≤ − ε (cid:107) ˜ w (cid:107) . In view of (2.34) and (2.35) and for any ˜ w ∈ l d , the trajectories (˜ z, ˜ w ) in (2.36) can as well be associatedto the system ˜ z = ˜ G ˜ w with(2.37) ˜ G := (cid:20) ρ − ( A + B m C ) ρ − B ( L − m ) C − I d (cid:21) . If G denotes the series interconnection Π(Λ) ˜ G , then (2.36) just reads (cid:104) G ˜ w, ˜ w (cid:105) ≤ − ε (cid:107) ˜ w (cid:107) for all ˜ w ∈ l d ,and G is also said to be strictly negative real. Theorem 2.7 just expresses that ρ wc ≤ ρ is guaranteed bychecking that G is stable and strictly negative real.To verify these properties, we can use the following variant of the celebrated positive real lemma [1]. Lemma

Consider a system z = Gw with G = (cid:20) A BC D (cid:21) and d inputs and outputs. Then thefollowing statements are equivalent: A is Schur and G is strictly negative real (SNR): There exists some ε > with (cid:104) Gw, w (cid:105) ≤ − ε (cid:107) w (cid:107) for all w ∈ l d . There exists some X (cid:31) such that (2.38)  A BI C D I  T  X − X I I   A BI C D I  ≺ . In here, A (cid:31) A ≺

0) means that the real matrix A is symmetric and positive (negative) deﬁnite.Lemma 2.8 allows to translate stability and strict negative realness of an operator deﬁned by a linearsystem into a convex ﬁnite-dimensional feasibility constraint, which takes of the form of a linear matrixinequality (LMI) in the matrix variable X .ecall that, in Theorem 2.7, this involves the series interconnection of Π(Λ) and ˜ G in (2.37) with thestate-space description(2.39) Π(Λ) ˜ G = (cid:34) ˜ A ˜ B ˜ C (Λ) ˜ D (Λ) (cid:35) :=  A f B f ( L − m ) C − B f ρ − ( A + B m C ) ρ − B C f (Λ) D f (Λ)( L − m ) C − D f (Λ)  . We observe that ρ − ( A + B m C ) is Schur iﬀ this holds for ˜ A . Hence, Theorem 2.7 in combination withLemma 2.8 leads to the following result. Corollary

For ρ ∈ (0 , , the convergence rate of algorithm (2.7) is bounded as ρ wc ≤ ρ ifthere exist Λ ∈ Λ rρ and X that satisfy the constraints X (cid:31) , • T  X −X I I   ˜ A ˜ B I C (Λ) ˜ D (Λ)0 I  ≺ . (2.40)In (2.40) and later we use “ • ” as a placeholder for the matrix on the right to save space. For ﬁxed l ∈ N and ρ ∈ (0 , X and Λ. Note that Λ rρ is as well described by LMI constraints on Λ. We have thus reduced the veriﬁcation of ρ -convergencefor a given algorithm to a convex feasibility test in terms of LMIs.However, recall that the main goal of this paper is algorithm design . For a given ρ ∈ (0 , ρ -convergence; if existing, one wishes to construct an algorithm based on some solution of the LMI.Recall that we parameterize algorithms by the matrices A a , B a , C a , D a in (2.16). In Corollary 2.9,these matrices enter the constraints (2.40) via (2.39) in a non-linear fashion. If using Corollary 2.9 fordesign, we end up with non-convex constraints if viewing the algorithm matrices A a , B a , C a , D a , themultiplier parameter Λ and the so-called Lyapunov matrix X as decision variables. As a preparation forovercoming this trouble, we recapitulate some essential insights into controller synthesis by LMIs next.

3. Controller Synthesis with LMIs.

Feedback control for systems described by diﬀerence equa-tions as in this paper can be abstractly formulated in terms of a given to-be-controlled system, theso-called generalized plant, as described by(3.1) (cid:18) zy (cid:19) =  A B BC D EC F  (cid:18) wu (cid:19) . This plant has two (vector-valued) input and output signals. Here u is the so-called control input withwhich the system is actuated, steered or manipulated. On the other hand, y is the so-called measurementoutput, which is viewed as the available information about the system for the purpose of controlling it.A controller then takes y as its input and generates the control action u as its output through(3.2) u = (cid:20) A c B c C c D c (cid:21) y. The plant and controller form the so-called feedback interconnection, which can be expressed (after asimple calculation [32, Section 2]) as(3.3) z =  A + BD c C BC c B + BD c FB c C A c B c FC + ED c C EC c D + ED c F  w =: (cid:20) A BC D (cid:21) w. The closed-loop system is aﬀected by the disturbance input w and responds with the controlled output z ; these are the signals on which one imposes desired speciﬁcations which the controller should achieve.oremost, controllers are required to (internally) stabilize the plant, i.e., they need to render A Schur.Next to stabilization, many desired so-called performance properties on the map w (cid:55)→ z are expressed as (cid:104) (cid:18) zw (cid:19) , P (cid:18) zw (cid:19) (cid:105) ≤ − ε (cid:107) w (cid:107) for all w ∈ l d (where d is the number of components of w ) and some ε >

0. Here P is an indeﬁnitesymmetric weighting matrix that is partitioned according to the signals z and w with the properties P = (cid:18) Q SS T R (cid:19) , Q (cid:60) P ) (cid:54) = 0 . The celebrated KYP lemma (see e.g. [42, 26]) can be used to show that the controller achieves both tasksiﬀ there exists a Lyapunov matrix X that satisﬁes(3.4) X (cid:31) , • T  X −X Q S S T R   A B I C D I  ≺ . Analyzing the desired properties of a ﬁxed controller thus boils down to this convex feasibility test in X .If synthesizing a controller, we view the parameters of (3.2) as additional decision variables. However,(3.4) does not impose a convex constraint on both X and the controller matrices. Despite this trouble,the existence of a controller that achieves (3.4) can still be equivalently expressed as convex constraints. Theorem

Let U and V be matrices whose columns form a basis of ker (cid:0) C F (cid:1) and ker (cid:0) B T E T (cid:1) ,respectively. Then there exist a controller (3.2) and an X such that the closed-loop system (3.3) satisﬁes(3.4) iﬀ there exist symmetric matrices X , Y with (3.5) (cid:18) X II Y (cid:19) (cid:31) , • T  X − X Q S S T R   A B I C D I  U ≺ , • T  Y − Y Q ˜ S S T ˜ R   I − A T − C T I − B T − D T  V (cid:31) here ˜ Q , ˜ R , ˜ S denote the blocks of the inverse P − . Once the LMIs (3.5) are feasible, a constructive procedure to compute the controller matrices is foundin [8], and the dimension of the resulting state-matrix A c equals that of A .This result essentially appeared in the seminal work [9, 13] for Q = I , R = − γ I , S = 0 related tothe so-called H ∞ -control problem. The extensions to general performance indices have been suggested in[19, 30], while the current paper is aligned in notation with the exposition in [32, 33]. Here we only use(3.6) Q = R = 0 and S = I. Remark

A, B ) is stabilizable and (

A, C ) is detectable [42, Section 17.1].

4. Convexiﬁcation of Algorithm Synthesis.4.1. Algorithm Design by Controller Synthesis.

It is now a natural idea to exploit the generalcontroller synthesis framework in Section 3 for algorithm design based on Corollary 2.9.In order to match (2.40) with (3.4), we choose (3.6) and express the state-space description of (2.39)as the interconnection of a suitable plant (3.1) in feedback with a controller which is determined throughthe algorithm parameters. In fact, a trivial computation shows that the system (2.37) for ( A , B , C ) from(2.16) is obtained as in (3.1)-(3.3) by the feedback interconnection of the plant(4.1) (cid:18) ˜ zy (cid:19) =  ρ − I d ρ − I d ρ − mI d − I d LI d − mI d I d  (cid:18) ˜ wu (cid:19) ith the controller(4.2) u = (cid:20) ρ − A a ρ − B a C a D a (cid:21) y. Moreover, the weighted interconnection (2.39) is clearly given by closing the loop with the same controller(4.2) and the following ﬁltered version of (4.1):(4.3)  A f B f C f (Λ) D f (Λ) 00 0 I d   ρ − I d ρ − I d ρ − mI d − I d LI d − mI d I d  =  A f − B f B f ( L − m )0 ρ − I d ρ − I d ρ − mI d C f (Λ) 0 − D f (Λ) − D f (Λ)( L − m )0 I d  . Recall that A f is Schur. Therefore, the structure of ˜ A in (2.39) clariﬁes that ρ − ( A + B m C ) is Schur iﬀ(4.2) stabilizes (4.3).For some given Λ ∈ Λ rρ , we can apply Theorem 3.1 in order to characterize the existence of algorithmparameters that achieve (2.40) for some X as an LMI feasibility test. However, the joint search overΛ ∈ Λ rρ and X , Y in the resulting inequalities remains non-convex. This is a commonly encounteredproblem in robust controller synthesis (see e.g. [36]), and has been also noted for algorithm design in therecent paper [17].Our progress over all existing results is to show how to perform such a simultaneous convex search in (2.40) over the algorithm variables A a , B a , C a , D a , the multiplier parameters Λ and the Lyapunovmatrix X . Remark ∈ Λ rρ has been determined, the corresponding algorithm parameterscan be directly determined on the basis of Theorem 3.1 as in [8]. Let us include an important structuralremark at this point. All matrices involved in (4.3) and (3.6) do admit the Kronecker structure M ⊗ I d withsuitable matrices M and the standard Kronecker product “ ⊗ ”. This makes it possible to work w.l.o.g.with X and Y in Theorem 3.1 that admit such a structure, and the steps in [8] generate algorithmparameters that inherit this structure as well. This so-called dimensionality reduction [16] implies thatthe computational complexity for algorithm design is independent of d , and that it generates algorithmsthat can be applied for arbitrary dimesions d ∈ N . Recall that the feedback interconnection of(4.1) and (4.2) is called the closed-loop system. Let us start by showing that all such closed-loops obtainedby stabilizing controllers can be expressed as(4.4) { T + T QT | Q ∈ Q} with the following set of stable systems:(4.5) Q := (cid:26) Q = (cid:20) A Q B Q C Q D Q (cid:21) | A Q is Schur (cid:27) . For this so-called Youla parametrization we rely on [7, 42] and emphasize that the results directly carryover from continuous-time to discrete-time systems.

Lemma

The set of all systems (2.37) parameterized by A a , B a , C a , D a and such that ρ − ( A + B m C ) is Schur is equal to (4.4) where σ := Lm − − , T = (cid:20) − ρ − I d σI d − I d (cid:21) , T = LI d − mI d , T =  − ρ − I d ρ − I d ρ − I d I d  . Moreover, the correspondence between the algorithm parameters and Q is given by (cid:18) A a B a C a D a (cid:19) =  − mD Q − mC Q − mD Q ρB Q ρA Q ρB Q D Q C Q D Q − m − I d  . roof. In control we associate to a linear system (2.6) its so-called transfer matrix C (z I − A ) − B + D ,whose entries are real rational and proper functions in z. A calculation shows that (4.1) has the transfermatrix P (z) = (cid:18) P (z) P (z) P (z) P (z) (cid:19) = (cid:18) − I d LI d − mI d ρ z − I d ρ z − mI d (cid:19) . With M (z) = ˜ M (z) := ρ z − ρ z I d , N (z) = ˜ N (z) := ρ z mI d we then infer that P (z) = N (z) M (z) − . With X (z) = ˜ X (z) := I d , Y (z) = ˜ Y (z) := − m − I d we get the so-called double B´ezout identity (cid:18) ˜ X (z) − ˜ Y (z) − ˜ N (z) ˜ M (z) (cid:19) (cid:18) M (z) Y (z) N (z) X (z) (cid:19) = I d . This permits us to apply [7, Theorem 1 in Section 4.5]. Speciﬁcally, if deﬁning T (z) = P (z) + P (z) M (z) ˜ Y (z) P (z) , ¯ T (z) = P (z) M (z) , ¯ T (z) = ˜ M (z) P (z) , the set of all closed-loop transfer matrices that can be obtained with stabilizing controllers for (4.1) isgiven by T (z) + ¯ T (z) Q (z) ¯ T (z) where Q (z) varies in the set of all transfer matrices associated to theelements in Q . Since M (z) is a multiple of the identity matrix, we infer ¯ T (z) Q (z) = P (z) Q (z) M (z) andhence T (z) + ¯ T (z) Q (z) ¯ T (z) = T (z) + T (z) Q (z) T (z) with T (z) := P (z), T (z) = M (z) ˜ M (z) P (z).Now note that T (z) = ( L − m ) I d ,(4.6) T (z) = − I d − ρ z σI d and T (z) = ρ z − ρ z) I d , which do indeed have the state-space representations as in the lemma.According to [42, Theorem 12.17], the controller’s transfer matrix which corresponds to Q (z) isobtained by feedback of the plant with transfer matrix (cid:18) Y (z) X (z) − ˜ X (z) − X (z) − − X (z) − N (z) (cid:19) = (cid:18) − m − I d I d I d − ρ z mI d (cid:19) and Q (z). In case that Q (z) is the transfer matrix of an element in (4.5), an elementary calculation showsthat the related controller has the state-space description  − ρ − mD Q − ρ − mC Q − ρ − mD Q B Q A Q B Q D Q C Q D Q − m − I d  . Matching with (4.2) reveals the relation of Q with the algorithm parameters as claimed.Next we note that Π(Λ), T , T , T and Q are all stable, which implies the very same property for(4.7) Π(Λ)( T + T QT )due to (2.11) and (2.12). Just by combining Theorem 2.7 with Lemma 4.2, we infer that there exists analgorithm which achieves ρ -convergence if there exist Λ ∈ Λ rρ and Q ∈ Q such that (4.7) is SNR. The keystep to convexity is the parameter change Z := Π(Λ) T Q , as shown in the next lemma. Lemma

Let ¯Λ := (cid:0) · · · I (cid:1) and suppose that Λ with ¯Λ ∈ Λ ⊂ Λ ρ is convex. Then thefollowing statements are equivalent:

1. Π(Λ)( T + T QT ) is SNR for some Λ ∈ Λ , Q ∈ Q .

2. Π(Λ) T + ZT is SNR for some Λ ∈ Λ , Z ∈ Q .Proof. If Π(Λ)( T + T QT ) = Π(Λ) T + Π(Λ) T QT is SNR for some Q ∈ Q , it suﬃces to observethat Z := Π(Λ) T Q ∈ Q since Z admits the description  A f B f ( L − m ) C Q B f ( L − m ) D Q A Q B Q C f (Λ) D f (Λ)( L − m ) C Q D f (Λ)( L − m ) D Q  nd A f as well as A Q are Schur. Hence 1. implies 2.To show that 2. implies 1., pick Λ ∈ Λ , Z ∈ Q such that (cid:104) w, (Π(Λ) T + ZT ) w (cid:105) ≤ − ε (cid:107) w (cid:107) for all w ∈ l d and some ε >

0. In a ﬁrst step, we slightly perturb Λ as Λ + ¯ ε ¯Λ in order render D f (Λ + ¯ ε ¯Λ) = D f (Λ) + ¯ εD f (¯Λ) invertible; since D f (¯Λ) = I , this is indeed true for all suﬃciently small ¯ ε >

0. Since T is stable, its l -induced operator norm (cid:107) T (cid:107) is ﬁnite. By Π(¯Λ) = I we infer for all w ∈ l d that (cid:104) w, (Π(Λ + ¯ ε ¯Λ) T + ZT ) w (cid:105) = (cid:104) w, Π(Λ) T + ZT ) w (cid:105) + ¯ ε (cid:104) w, T w (cid:105) ≤ ( − ε + ¯ ε (cid:107) T (cid:107) ) (cid:107) w (cid:107) . All this permits us to ﬁx some small ¯ ε ∈ (0 ,

1) such that ˜Λ := Λ + ¯ ε ¯Λ ∈ Λ (convexity), D f (˜Λ) is invertibleand Π(˜Λ) T + ZT stays SNR. Therefore, Π(˜Λ) − exists and can be expressed as (cid:34) ˜ A f ˜ B f ˜ C f ˜ D f (cid:35) := (cid:34) A f − B f D f (˜Λ) − C f B f D f (˜Λ) − − D f (˜Λ) − C f D f (˜Λ) − (cid:35) . Next we show that ˜ A f is Schur. To this end we ﬁx δ := L − m ∈ (0 , L − m ) and the map g ( x ) := δx T x .Then Lemma 2.6 is valid for the full class Λ ρ [22], and thus as well for ˜Λ. If ˇ z ∈ l d , we note that (2.32)just gives ˇ w = δ ˇ z and we get ˜ z = ( L − m )ˇ z − δ ˇ z = δ ˇ z in (2.34); by Lemma 2.6 we hence conclude (cid:104) Π(˜Λ)ˇ z, ˇ z (cid:105) ≥ z ∈ l d ; this shows (cid:104) Π(˜Λ) z, z (cid:105) = (cid:104) Π(Λ) z, z (cid:105) + ¯ ε (cid:104) Π(¯Λ) z, z (cid:105) ≥ ¯ ε (cid:107) z (cid:107) for all z ∈ l d .Since A f is Schur and − Π(Λ) with a state-space description in terms of ( A f , B f , − C f (˜Λ) , − D f (˜Λ)) is SNR,Lemma 2.8 shows that there exists some X (cid:31) • T  X − X I I   A f B f I − C f (˜Λ) − D f (˜Λ)0 I  ≺ . Now we exploit again that D f (˜Λ) is invertible and perform a congruence transformation of (4.8) with (cid:18) I − D f (˜Λ) − C f (˜Λ) D f (˜Λ) − (cid:19) to get • T  X − X I I   ˜ A f ˜ B f I − I ˜ C f ˜ D f  ≺ . By inspection, the left-upper block of this inequality reads ( ˜ A f ) T X ˜ A f − X ≺

0. Because of X (cid:31) A f is indeed a Schur matrix.Since T just is a real invertible matrix, we can deﬁne Q := T − Π(˜Λ) − Z . We infer Q ∈ Q , again justby using (2.11). Moreover, Π(˜Λ)( T + T QT ) = Π(˜Λ) T + ZT shows that Π(˜Λ)( T + T QT ) is SNR.Corollary 2.9 combined with Lemma 4.2 and Lemma 4.3 for Λ := Λ rρ leads to the following result. Corollary

With ρ ∈ (0 , , there exists an algorithm whose convergence rate is bounded as ρ wc ≤ ρ if there exist Λ ∈ Λ rρ and Z ∈ Q such that Π(Λ) T + ZT is SNR. Both Λ rρ and Q are convex and Π(Λ) T + ZT is aﬃne in Λ and Z . Since the SNR property is aconvex constraint, we have shown that the algorithm design problem is indeed convex as a feasibilityproblem over the inﬁnite dimensional space Λ rρ × Q . Testing whether there exist Λ ∈ Λ rρ and Z ∈ Q for whichΠ(Λ) T + ZT is SNR can even be turned into a ﬁnite dimensional convex feasibility problem. Towardsthis end, we represent Z ∈ Q as(4.9) Z = (cid:20) A Z B Z C Z D Z (cid:21) and use (3.1)-(3.3) to see that Π(Λ) T + ZT results from the feedback interconnection of the plant (3.1)with the controller u = Zy for(4.10)  A B BC (Λ) D (Λ) EC F  :=  A f B f σ − B f

00 0 0 − ρ − I d I d ρ − I d C f (Λ) D f (Λ) σ − D f (Λ) LI d − mI d I d  . his viewpoint permits us to derive an LMI solution for the algorithm synthesis problem based onTheorem 3.1, our second main result. The relevant LMIs can be more compactly expressed by using theselection matrix J :=  I dim( A f ) I d  . Theorem

There exists some Λ ∈ Λ rρ and a controller (4.9) such that A Z is Schur and Π(Λ) T + ZT is SNR iﬀ there exist X and Λ satisfying Λ ∈ Λ rρ , X (cid:31) and • T  X − X I d I d   AJ B J C (Λ) J D (Λ)0 I d  ≺ . (4.11) If the LMIs (4.11) are feasible, there exists an algorithm whose convergence rate is bounded as ρ wc ≤ ρ .Proof. Let us abbreviate the interconnection (4.9)-(4.10) as determined according to (3.3) by(4.12) z = (cid:34) ˜ A ˜ B ˜ C (Λ) ˜ D (Λ) (cid:35) w (which is an abuse of notation since the matrices diﬀer from those in (2.39)). Since B in (4.10) is zero, A Z is Schur iﬀ ˜ A is Schur. Therefore, A Z is Schur and (4.9)-(4.10) is SNR iﬀ there exists some X (cid:31) U :=  I dim( A f ) I d

00 0 00 0 I d  , V :=  I dim( A f ) I d

00 0 I d  of ker (cid:0) C F (cid:1) , ker (cid:0) B T E T (cid:1) , respectively. Then the second LMI in (3.5) is just identical to third onein (4.11); since the ﬁrst LMI in (3.5) implies X (cid:31)

0, “only if” follows directly.To show “if”, let (4.11) hold. By the particular choice of V , the third inequality in (3.5) simpliﬁesto Y − AY A T (cid:31)

0. Since A is Schur, we can take Y (cid:31) Y − AY A T = I and thus obtain for any α > αY of the third LMI in (3.5). Since X (cid:31)

0, we can certainly ﬁnd some large α > Y = α Y also satisﬁes the ﬁrst LMI in (3.5). Applying theorem 3.1 completes the proof.Like for algorithm analysis, the left-hand side of (4.11) constitute LMI constraints on X and Λ.Feasibility of these LMIs is equivalent to the existence of A a , B a , C a , D a , Λ ∈ Λ rρ and X with (2.40),which is the desired convexiﬁcation of algorithm synthesis, one of the main goals of this paper.Let us now establish that one can even eliminate the unknown X in Theorem 4.5. Corollary

Let R := (cid:80) lk =0 ρ k (Λ k Lm − ) for ﬁxed Λ ∈ Λ rρ . Then there exists X with (4.11) iﬀ (4.13) (cid:18) − ρ ( R T + R ) R T + Λ R + Λ T Λ T + Λ (cid:19) (cid:31) . Proof.

Note that the last LMI in (4.11) is a generalized Stein inequality(4.14) A T X A − E T X E + C T N C ≺  AEC  :=  A f B f σ − B f − ρ I d ρ I d ρ I d I dim( A f ) I d

00 0 0 C f (Λ) D f (Λ) σ − D f (Λ)0 0 I d  , N := (cid:18) I d I d (cid:19) . e start by determining a congruence transformation on (4.14) in order to render A diagonal. If U ischosen to satisfy the Sylvester equation A f U − U ( ρ I d ) + B f (1 + σ ) = 0 , (4.15)we indeed have P − A Q = ¯ A :=  A f ρ I d

00 0 − I d  , P − E Q = E for P :=  I U B f I d ρ I d − ρ I d  , Q :=  I U I d − I d I d  . A congruence transformation of (4.14) with Q leads to(4.16) ¯ A T ¯ X ¯ A − E T ¯ X E + ¯ C T N ¯ C ≺ X := P T XP, ¯ C := C Q = (cid:18) C f (Λ) R − D f (Λ)0 − I d I d (cid:19) , and R := C f (Λ) U + D f (Λ)(1 + σ ) . By (4.15) we have U = ( ρ I d − A f ) − B f (1 + σ ) and hence we get R = [ C f (Λ)( ρ I d − A f ) − B f + D f (Λ)](1 + σ ) . Due to (2.30) and σ + 1 = Lm − , this indeed matches withthe deﬁnition of R in the corollary.To show “only if” let X satisfy (4.11). Then ˜ X := P T XP (cid:31) d × d -block is denoted as Y and still positive deﬁnite. Canceling the ﬁrst block row/column of (4.16) gives(4.17) (cid:18) ρ I − I (cid:19) T (cid:18) Y Y Y Y (cid:19) (cid:18) ρ I − I (cid:19) − (cid:18) Y

00 0 (cid:19) + (cid:18) − R T − R R T + D f (Λ) R + D f (Λ) T − D f (Λ) T − D f (Λ) (cid:19) ≺ . With a sign-change in the oﬀ-diagonal blocks and if recalling D f (Λ) = Λ , this is equivalent to(4.18) (cid:32) R T + R − − ρ ρ Y R T +Λ − ρ Y R + Λ T − ρ Y Λ T + Λ − Y (cid:33) (cid:31) . If H := R T + R − − ρ ρ Y (cid:31) ρ ∈ (0 ,

1) that − ρ ( R + R T ) − ρ Y = − ρ H (cid:31) H. By (cid:16) ρ Y

11 1 ρ Y ρ Y Y (cid:17) (cid:31)

0, the inequality (4.18) hence implies (4.13).To prove the converse, let (4.13) hold and deﬁne Y := (cid:32) ρ − ρ ( R T + R − εI ) ρ ( R T + Λ ) ρ ( R + Λ T ) Λ T +Λ − εI (cid:33) ;here we can choose so small ε > Y is positive deﬁnite. Moreover, Y obviously satisﬁes (4.18) andthus (4.17). Since A f is Schur, we can choose ¯ X (cid:31) A T f ¯ X A f − ¯ X = − I . Let us then deﬁne¯ X := diag( α ¯ X , Y ) (cid:31) α > ¯ A T ¯ X ¯ A − E T ¯ X E + ¯ C T N ¯ C ; its right-lower 2 d × d -block equals (4.17) and is, therefore, negative deﬁnite; moreover, α only aﬀects the left-upper block ofthis matrix, which actually just equals A T f ( α ¯ X ) A f − ( α ¯ X ) = − αI ; therefore, we can ﬁx a suﬃcientlylarge α > X (cid:31) X := P − T ¯ XP − (cid:31) Remark H ∞ - and SNR-synthesis problems based on Nevanlinna-Pick interpolation (see e.g. [18, 14]). In fact,Theorem 4.5 concerns Π(Λ) T + ZT in which T is a stable system with as many inputs as outputs. Therelated SNR-synthesis problem is classically said to be of the one-block type. It is also known that theunstable zeros of the transfer matrix of T play a key role in characterizing its solvability. Due to (4.6),these are given by z = ρ − and z = ∞ . As it turns out after a simple computation, -(4.13) is nothingbut the so-called Pick matrix(4.19)  H ( z ) ∗ + H ( z )1 − ¯ z − z − H ( z ) ∗ + H ( z )1 − ¯ z − z − H ( z ) ∗ + H ( z )1 − ¯ z − z − H ( z ) ∗ + H ( z )1 − ¯ z − z −  where H denotes the transfer matrix of Π(Λ) T (with the deﬁnition in [3] which permits zeros at inﬁnity).or the particular class of multipliers (4.6), we can even go one step further and explicitly characterizethe set of those parameters ρ for which LMI (4.13) is feasible. Corollary

Let κ := Lm − and l ≥ . Then there exists some Λ ∈ Λ rρ with (4.13) iﬀ − √ κ < ρ .Proof. In view of the Kronecker product structure of the elements in Λ rρ and homogeneity of (4.11),we can ﬁx λ = 1 and express (4.13) as(4.20) (cid:32) κ − ρ (1 + (cid:80) lk =1 ρ k λ k ) • κ (1 + (cid:80) lk =1 ρ k λ k ) 2 (cid:33) (cid:31) . If we set α := ρ − ρ >

1, we infer that (cid:18) − ρ β β β (cid:19) (cid:31) β ∈ (cid:0) α − , α (cid:1) . Moreover, since l ≥ (cid:110) (cid:80) lk =1 ρ k λ k | (cid:80) lk =1 ρ − k λ k ≥ , λ k ≤ (cid:111) is the interval [1 − ρ , (cid:0) α − , α (cid:1) ∩ [ κ (1 − ρ ) , κ ] is not empty. The inﬁmal ρ ∈ (0 ,

1) for which this is true is determined by the equation ρ − ρ = κ (1 − ρ ), which indeed gives1 − / √ κ .If using Zames-Falb multipliers of any length l ≥ − √ κ is the optimal rate that is achievable among all algorithms (2.7). In view of [34], this provesfor the ﬁrst time that the triple momentum algorithm is guaranteed to be optimal even if allowing forZames-Falb multipliers of length l >

1. This also clariﬁes why various attempts to improve the rate bymanual tuning [16] or sum-of-squares optimization [6] of the algorithm parameters were not successful.Our computation of an explicit optimal rate-bound for design is analogous to what has been achievedfor the analysis of Nesterov’s algorithm in [28]. Our approach brings out the intrinsic system theoreticreasons for the limits of performance in algorithm design; this holds for both the value of the optimalrate (determined by two zeros of some transfer matrix), and for the insight that algorithms (2.7) withmatrices A of dimension larger than two are not beneﬁcial. All this is a consequence of systematicallyformulating algorithm design as a controller synthesis problem for the plant (4.1). We emphasize that our algorithm design ap-proach is more powerful than just proving Corollary 4.8. This is illustrated by following [10] and showinghow one can exploit additional structural knowledge about the cost functions. Speciﬁcally, for given ma-trices M f , L f ∈ R d × d with 0 ≺ M f ≺ L f , we consider the class F of functions f ∈ C ( R d , R ) satisfying(4.21) M f (cid:52) ∇ f ( x ) (cid:52) L f for all x ∈ R d . One could take the triple momentum algorithm and achieve the convergence rate(4.22) 1 − √ κ for κ = λ max ( L f ) λ min ( M f ) . Instead, we can as well design algorithms based on the matrices M f and L f by solving a suitable LMIsystem. For this purpose, we introduce T := ( L f − M f ) − and deﬁne (4.10) by replacing(4.23) ( mI d , LI d , σ ) with ( M f , L f , L f M − f − I d ) . We then arrive at the the following convex algorithm design result for the class F . Theorem

One can construct an algorithm which achieves the convergence rate ρ ∈ (0 , for theclass F if there exist Λ ∈ T Λ rρ T T and X (cid:31) which satisfy the LMI (4.11).Proof. We ﬁrst observe that all insights in Section 2.2 remain valid after the substitution (4.23).We argue that the same holds for Sections 2.3 and 2.4 with ˜ G constructed based on (4.23). To this endlet f ∈ F be taken with ∇ f (0) = 0 and deﬁne g ( z ) := f ( T z ) − ( T z ) T M f ( T z ) to infer ∇ g ( z ) = T T ∇ f ( T z ) T − T T M f T and thus 0 (cid:52) ∇ g ( z ) (cid:52) T T ( L f − M f ) T = I for all z ∈ R n . Hence g ∈ S , . Withthe transformations ¯ w := T T w and ¯ z := T − z in (2.7) we obtain¯ w = ∇ g (¯ z ) , ¯ z = (cid:34) A + B M f C B T − T T − C (cid:35) ¯ w. as defined in (4.22) C on v e r gen c e r a t e triple momentumrepeated multiplierfull multiplier Fig. 2 . Optimal algorithm convergence rates versus condition number κ in (4.22) for example in [22]: Triple momentum(blue), Theorem 4.9 for Λ rρ (yellow) and Theorem 4.9 for Λ ρ (purple). By just following the line of reasoning in Sections 2.3 and 2.4, Theorem 2.7 holds for the convergencerate ρ wc with respect to F if replacing (2.37) with (cid:20) ρ − ( A + B M f C ) ρ − B T − T (1 − T − C − I d (cid:21) ;observe that we use g ∈ S , at this point. Since T − = T T ( L f − M f ), this can be expressed as T T ˜ G T − T with ˜ G := (cid:20) ρ − ( A + B M f C ) ρ − B ( L f − M f ) C − I d (cid:21) . Theorem 2.7 involves Π(Λ) T T ˜ G T − being stable and SNR, which is equivalent to the same conditions forthe congruence transformed system T Π(Λ) T T ˜ G and hence for Π( T Λ T ) ˜ G ; here T Π(Λ) T T = Π( T Λ T T ) isshown with (2.30) and a suitable state-coordinate change. All this reveals that Corollary 2.9 persists tohold for the class F if replacing Λ rρ with T Λ rρ T T . The proof is then concluded as that of Theorem 4.5.Once having determined some Λ ∗ ∈ T Λ rρ T T for which the LMIs (4.11) in X are feasible, one canﬁnd related algorithm parameters A a , B a , C a , D a as sketched in Section 4.1. Moreover, the comments ondimensionality reduction carry over to the situation that L f − M f (and hence T ) are block-diagonal.A concrete instance of the current setup are functions f ( x ) = 12 x T Rx + h ( Sx − s ) for x ∈ R d with given R ∈ R d × d , S ∈ R e × d , s ∈ R e and any h ∈ S m,L ∩ C ( R e , R ) where R is positive deﬁnite.Indeed, since ∇ f ( x ) = R + S T ∇ h ( Sx − s ) S , we infer that (4.21) holds with any small ε > M f := R + S T mS − εI and L f := R + S T LS.

As motivated in [22], such cost functions appear in model predictive control if handling the constraint Sx ≤ s with a relaxed barrier function b ∈ S m,L ∩ C ( R , R ) for the set { x ∈ R | x ≤ } . This resultsin the choice h ( x ) = (cid:80) ei =1 b ( x i ) with ∇ h ( x ) = col( b (cid:48) ( x ) , . . . , b (cid:48) ( x e )), a so-called diagonally repeatednonlinearity. One can exploit this extra structure by using the full multiplier class Λ ρ instead of Λ rρ inTheorem 4.9; a proof relies on [22, Theorem 9] and the fact that Lemma 4.3 also applies to Λ = Λ ρ . Thisoﬀers yet another possibility for reducing conservatism in algorithm design.We pick up the numerical example from [22, Section 6.2] for the latter class. Figure 2 depicts theconvergence rates of the triple momentum algorithm (blue), the structure exploiting algorithm fromTheorem 4.9 with repeated (yellow) and full multipliers (purple). In contrast to [22] (relying on non-convex design algorithms), we get identical rates for the two multiplier classes with our convex designalgorithms. G ∇ fK K + zwvu G ∇ fH d ¯ K yu z w

Fig. 3 . Extremum Control: Optimization of the output of G (left) or that of ¯ K (right).

5. Generalization: Extremum Control.

We now demonstrate that the proposed framework andthe accompanying convexiﬁcation result have a much wider scope than presented so far. They permitto systematically design optimization and learning algorithms with optimal convergence rates, even withthe presence of additional dynamics in the feedback loop. Such dynamics may represent, for example, amodel of a communication channels in optimization problems over networks, a noise ﬁlter if only a noisygradient is available, the dynamics of a robot in a source seeking problem, or the dynamic properties of ahardware architecture like in neuromorphic computing. A particularly nice scenario is extremum controlas conceptually mentioned in [2]. For a given system and a cost function, the goal is to design a controllerthat drives some system output to a steady-state in which the cost is minimal.Among the many concrete instantiations of extremum control, we concentrate on the case where somelinear system is given, the cost function f is only known to belong to the class S m,L , and the gradient ofthe cost function can be evaluated [21, 23, 15]. To be concrete, we assume that the system is described as(5.1) (cid:18) zv (cid:19) = (cid:18) G G (cid:19) u =  A G B G C G D G C G  u, with an input signal u ∈ l n u used for control and two output signals z ∈ l d , v ∈ l n v interpreted as follows.The ﬁrst one is supposed to be asymptotically steered to z ∗ ∈ R d with ∇ f ( z ∗ ) = 0 for any cost f ∈ S m,L .The second output provides extra information about the system that can be exploited for control; it canbe empty, which boils down to n v = 0.The to-be-constructed dynamic controller (algorithm) is a linear time-invariant system that takes thetwo signals w = ∇ f ( z ) and v as its inputs and generates the control signal u as its output:(5.2) u = (cid:0) K K (cid:1) (cid:18) wv (cid:19) , w = ∇ f ( z ) . Altogether, (5.1)-(5.2) deﬁne the closed-loop system as depicted on the left in Fig. 3. With G := G ( I − K G ) − K , a simple calculation shows that it can be described as(5.3) w = ∇ f ( z ) , z = G w. Remark K , K , one can calculate arepresentation (2.6) of z = G w . We assume that the “ D -matrix” of K vanishes, which assures D = 0 in(2.6) and leads to (2.3).Given ρ ∈ (0 , ρ -convergence for the interconnection (5.3) The inﬁmum of all such ρ ’s is the extremum control rate ρ ec .In Section 4 we have been only addressing the simple case G = I and G = [ ]. Despite the currentmore general setting, we are in the position to exploit the developed results in their full extent as seennext. The key is to adopt the generalized plant point-of-view. In the representation of z = G w by (2.3) we can assumew.l.o.g. that ( A , C ) is detectable. By Theorem 2.1, we need to make sure that G admits a factorization ¯ G H d with H d being the integrator (2.25). To enforce this structure we consider the feedback interconnection(5.4) (cid:18) zy (cid:19) =  G H d G  (cid:18) wu (cid:19) , u = (cid:0) ¯ K K (cid:1) y here the controllers ¯ K and K can be freely chosen. With the abbreviation ¯ G := G ( I − K G ) − ¯ K ,closing this loop indeed gives z = ¯ G H d w , and(5.5) K := ¯ K H d assures the structure G = G ( I − K G ) − K = ¯ G H d .We work again with state-space descriptions (3.1)-(3.2) of plant and controller in (5.4), respectively.Based on (5.1), the matrices for the plant can be taken as(5.6)  A B BC D EC F  :=  I I A G B G C G D G I C G  , while those for the controller are free. Of course, then (3.3) leads to a state-space description of (5.4) and,by inspection, D indeed vanishes. With the choice (5.5), this gives as well a state-space representation of G in (5.3) which is ensured to admit the structure (2.15).For the purpose of synthesis, we need a plant-controller description for the corresponding system ˜ G in (2.37). Recall that ˜ G was obtained from (2.3) by the signal transformations (2.26), (2.28) and (2.34).If applied to the plant (3.1) with matrices (5.6), these transformations lead to(5.7) (cid:18) ˜ zy (cid:19) =  ˜ A ˜ B ˜ B ˜ C ˜ D ˜ E ˜ C ˜ F  (cid:18) ˜ wu (cid:19) =  ρ − ( A + B mC ) ρ − B ρ − ( B + B mE )( L − m ) C − I ( L − m ) EC  (cid:18) ˜ wu (cid:19) corresponding to the former (4.1). A state-space description of ˜ G is obtained from (5.7) interconnectedwith (4.2).Let us now formulate a test whether (5.7) admits a stabilizing controller which involves so-calleduncontrollable and unobservable modes [42, Deﬁnition 3.6]. Lemma

There exists a controller which stabilizes (5.7) iﬀ ( A G , B G ) / ( A G , C G ) have no uncon-trollable/unobservable modes in { λ ∈ C | | λ | ≥ ρ } and (cid:18) A G − I B G C G D G (cid:19) has full row rank.Proof. Since the triple ( ˜ A, ˜ B, ˜ C ) is given by (cid:18) ρ − (cid:18) I mC G A G (cid:19) , ρ − (cid:18) mD G B G (cid:19) , (cid:18) I C G (cid:19)(cid:19) , one easilyveriﬁes with the so-called Hautus-test that the formulated conditions characterize that ( ˜ A, ˜ B )/( ˜ A, ˜ C ) arestabilizable/detectable. This proves the claim by Remark 3.2.This identiﬁes necessary conditions on (5.1) for achieving an extremum control rate of ρ ∈ (0 , G needs to be right invertible (requiring d ≤ n u ) and should have no invariant zero at 1 [39]. We assume that (5.7) admits a stabilizing controller. Again,the key to convexiﬁcation is the description of all stabilized closed-loop systems as in (4.4). Now we followa classical state-space procedure to construct this Youla-parameterization [7, Section 4.5]: Just choosematrices ˜ M and ˜ L such that ˜ A + ˜ B ˜ M and ˜ A + ˜ L ˜ C are Schur and take(5.8) (cid:18) T T T (cid:19) :=  ˜ A + ˜ B ˜ M − ˜ B ˜ M ˜ B ˜ B A + ˜ L ˜ C ˜ B + ˜ L ˜ F C + ˜ E ˜ M − ˜ E ˜ M ˜ D ˜ E C ˜ F  . Note that the left-hand side is a matrix of operators, whose blocks are deﬁned by the one with a state-spacerepresentations on the right on the right-hand side.Moreover, we exploit the structure of the multiplies Λ rρ in (2.31) to arrive at the following result. Lemma ρ ∈ (0 , satisﬁes ρ ec ≤ ρ if there exist Λ ∈ Λ rρ , Z ∈ Q s.th. Π(Λ) T + T ZT is SNR.roof. By (2.30) and (2.31), the multiplier Π(Λ) admits the diagonal structure π (Λ) I d with π (Λ) =  · · · · · · · · · λ l λ l − · · · λ λ  having one input and one output only. Moreover, T has the dimension d × n u and can be expressed asa operator matrix with entries ( T ) ij for i = 1 , . . . , d , j = 1 , . . . , n u that are also systems with one inputand one output. It is well-known that such systems commute as π (Λ)( T ) ij = ( T ) ij π (Λ), which impliesΠ(Λ) T = ( π (Λ) I d ) T = T ( π (Λ) I n u ) = T π (Λ) . Starting with Π(Λ) T + T ZT being SNR, we follow the proof that 2. implies 1. in Lemma 4.3. Due tothe structure of ˜Λ, we infer Π(˜Λ) = π (˜Λ) I d and hence Π(˜Λ) − = π (˜Λ) − I d . Therefore, Q := π (˜Λ) − Z satisﬁes Q ∈ Z and assures that Π(˜Λ)( T + T QT ) = Π(˜Λ) T + T ZT is SNR. Again, the application ofTheorem 2.7 completes the proof.Verifying ρ ec ≤ ρ according to Lemma 5.3 is convex over Λ rρ ×Q and can be exactly turned into a ﬁnitedimensional LMI feasibility problem. To see this we proceed as in Section 4.3 and express Π(Λ) T + T ZT as the interconnection(5.9) (cid:18) zy (cid:19) = (cid:18) Π(Λ) T T T (cid:19) (cid:18) wu (cid:19) , u = Zy.

Based on those of the multiplier (2.30) and (5.8), it is not diﬃcult to construct a state-space representationof the plant in (5.9). We dispense with writing down the matrices but note that these admit the structure(5.10)  A B BC (Λ) D (Λ) EC F  =  A B A B C (Λ) C D (Λ) EC F  in which C (Λ), D (Λ) and hence C (Λ) are aﬃne in Λ. Note that we abuse notation since (5.10) and(5.6) certainly are diﬀerent plants.We are now in the position to apply (a specialized version of) a convexiﬁcation procedure thathas been ﬁrst established in [31], see also [27] for recent extensions. The corresponding LMIs involve asymmetric decision variable W with the same size and partition as A . Let us introduce the followingfunctions in the variables W and Λ (where we drop the arguments to save space):(5.11) W := (cid:18) W W I (cid:19) , W := (cid:18) I − W T W (cid:19) , Y := W T W , (cid:20) A B C D (cid:21) := (cid:20) W T A W W T B C (Λ) W D (Λ) (cid:21) . It is crucial and easily checked by computation that the bold matrices all depend aﬃnely on W and Λ.Moreover, let U , V be basis matrices of ker (cid:0) C F (cid:1) , ker (cid:0) B T E T (cid:1) respectively; by (5.10) and witha basis col( V , V ) of ker (cid:0) B T E T (cid:1) , we can choose V :=  I V V  . Theorem

Either one of the following two equivalent conditions imply ρ ec ≤ ρ for ρ ∈ (0 , : There exists a Λ ∈ Λ rρ and a controller (4.9) such that A Z is Schur and Π(Λ) T + T ZT is SNR. There exist X , W and Λ ∈ Λ rρ satisfying (cid:18) X W W T Y (cid:19) (cid:31) and (5.12) • T  X − X I I   A B I C (Λ) D (Λ)0 I  U ≺ , • T  Y − Y − I I   I − A T − C T I − B T − D T  V (cid:31) . ketch of proof. By Lemma 5.3 it suﬃces to show the equivalence of 1. and 2. To this end wecharacterize 1. with Theorem 3.1 for (3.6) in terms of a non-convex feasibility condition in the variables X , Y and Λ. We then map Y (cid:31) (cid:18) W W W W (cid:19) := (cid:18) Y − − Y − Y − Y Y − Y − Y Y − Y (cid:19) with W (cid:31) W (cid:31)

0. For (5.11) one easily checks that Y W = W and det( W ) (cid:54) = 0, det( W ) (cid:54) = 0. Hence Y = W W − = W − T W T . Thus, by the third equation in (5.11), we get(5.13) Y = W − T Y W − = W Y − W T . Hence, the ﬁrst inequalities in (3.5) and in 2. are related by a congruence transformation with thematrix diag( I, W ). Moreover, the second one in (3.5) and the ﬁrst in (5.12) are identical.For the third inequality in (3.5) we take the annihilator matrix V := diag( W , I ) V . This is ﬁne since B T W = B T by the structure of W , B , and hence (cid:0) B T E T (cid:1) V = (cid:0) B T W E T (cid:1) V = (cid:0) B T E T (cid:1) V = 0.An inspection of the proof of Theorem 3.1 in [32, Section 6.3] reveals that it causes no harm to take some V which depends on Y . If we then substitute (5.13) in the third inequality of (3.5), we obtain • T  Y − Y − I I   W − − W T A T − W T C T I − B T − D T  V (cid:31) . With V := diag( W , I ) V and (5.11) we right away obtain the second inequality in (5.12).In fact, 2. constitutes convex constraints on all decisions variables; genuine LMIs are obtained bytaking a Schur complement w.r.t. Y − in the second inequality of (5.12). Let us collect the steps to solve the extremumcontrol problem for the system (5.1). Note that this encompasses algorithm design for G = I d and G = [ ] as in the ﬁrst part of the paper.1. Choose some l ∈ N in (2.30) and ρ ∈ (0 , ρ is achievable for a suitable controller (5.2) constructed as follows.Choose the weighted version of the generalized plant (5.7) given as(5.14)  A f B f ˜ C B f ˜ D B f ˜ E A ˜ B ˜ BC f (Λ ∗ ) D f (Λ ∗ ) ˜ C D f (Λ ∗ ) ˜ D D f (Λ ∗ ) ˜ E C ˜ F  . For (5.14) and (3.6), the LMIs in Theorem 3.1 are feasible. With a related controller (3.2) and in viewof (4.2), (5.5), we infer that (cid:0) K K (cid:1) = (cid:20) ρA c ρB c C c D c (cid:21) (cid:18) H d I n v (cid:19) achieves ρ -convergence for (5.1)-(5.2).By bisection over ρ , one can determine the inﬁmal worst-case convergence rate ρ ∗ ( l ) that is achievablewith some algorithm. Note that this best rate depends on the length l of the Zames-Falb multiplier (2.30).We emphasize that any papers in the literature revolve around the case l = 0 leading to a multiplierwithout dynamics, which is related to the so-called circle-criterion or a version of the small-gain theorem.For example in [21], controllers are assumed to have an observer structure incorporating an integrator,and the design is split up into sequential observer and state-feedback synthesis steps relying on the small-gain theorem, both of which are typically conservative. The paper [23] ﬁxes a control structure and isconﬁned to stability analysis for l = 0 only. Dynamic multipliers are generally known to be considerablymore powerful (see e.g. [37] for analysis and [29] for synthesis). The test for l = 1 termed oﬀ-by-one circlecriterion attracted special attention in [16, 17, 34] for algorithm analysis (see also Corollary 4.8).Our approach overcomes various of these limitations in that we perform direct output-feedback syn-thesis and employ dedicated dynamic multipliers for the general extremum control problem. Our numerical =L/m C on v e r gen c e r a t e p=0.2p=0.8p=0.9p=1p=1.1p=1.2p=2 =L/m C on v e r gen c e r a t e p=0.2p=0.8p=0.9p=1p=1.1p=1.2p=2 Fig. 4 . Guaranteed rates for the example in Section 5.4.1 with l = 0 (left) and l = 2 (right). examples illustrate that it can be beneﬁcial to work with multipliers of length larger than 1, and that onecan analyze the achievable convergence rates depending on suitable system theoretic properties of (5.1).We have not made any assumptions on G so far. In case that G admits the structure G = g I n u with a single input single output system g , we emphasize that the extensions as described in Section4.4 go through with ease in the current more general setting. If, in addition, G is empty and we workwith multipliers that admit a Kronecker structure, the possibility for dimensionality reduction carriesover as well (see Remark 4.1). The synthesis procedure will lead to a controller K that also admits thisstructure. Concretely, this captures the optimal synthesis of an algorithm for the minimization of f wherethe information sent to the gradient ﬁrst needs to pass a communication channel that is modeled by g . Relating the limits of performance of some controlled systems to propertiesof the underlying uncontrolled one is a classical research topic in control [40]. Our tools put us in theposition to explore such limits of performance expressed by the achievable optimal convergence rate. For anumerical illustration, we choose a very simple conﬁguration with G = [ ] and a family of systems G thatadmit the transfer functions G ( z ) = z − . z + p )( z +0 . with the pole p varying in { . , . , . , , . , . , } .The optimal achievable rates for static ( l = 0) and dynamic multipliers ( l = 2) depending on thecondition number L/m are plotted in Fig. 4; note that the saturation at 1 . ρ with the interval [0 , . l = 0 to l = 2, but they do not improve any more for l >

2. It is aswell interesting to observe that the rates do not change when moving the pole p inside the unit disktowards the boundary, but that they do get worse if p moves further into the unstable region. Instead ofdiscussing other interesting aspects of such trade-oﬀs and fundamental performance limitations for theexample, we conclude by emphasizing that the key aspect is the mere ability to generate such plots alsofor many other scenarios. A particularly interesting case is optimization with delayed gradient informa-tion, as it appears in parallel optimization or optimization over networks, see e.g. [41]. Hereby, convergencerates of gradient descent algorithms described as x k +1 = x k − α ∇ f ( x k − ν ) with step-size α > ν ∈ N are studied. Such a delay in general accelerated gradient descent algorithms can be easily capturedby taking G ( z ) = z ν and G = [ ] on the left in Fig. 3. The achievable guaranteed convergence rates for ν = 0 , , ν = 0 corresponds to the triple-momentum method.As expected, a longer lag leads to a lower performance. It is interesting to observe that our approachallows to design an accelerated algorithm with delayed gradients that outperforms the standard gradientdescent algorithm (GD) without delay for larger values of κ . Finally, we are not tied to the left conﬁguration in Fig. 3. For example, onemight require to optimize the rate of convergence to an optimal steady-state u ∗ of the controller’s outputon the right in Fig. 3 (in which the integrator H d is already displayed explicitly). Let us also assume that ∇ f is diagonally repeated as ∇ f ( z ) = col( b (cid:48) ( z ) , . . . , b (cid:48) ( z )) with any b ∈ S m,L ∩ C ( R , R ) (see Sec. 4.4). =L/m C on v e r gen c e r a t e =0=1=2GD as defined in (4.23) C on v e r gen c e r a t e triple momentumrepeated multiplier, l=2full multiplier, l=2full multiplier, l=4 Fig. 5 . Guaranteed rates for the example in Section 5.4.2 (left) and Section 5.4.3 (right).

Then algorithm synthesis can be convexiﬁed along the discussed lines for both repeated and full dynamicmultipliers Π(Λ) with Λ ∈ Λ rρ and Λ ∈ Λ ρ , respectively. If we pick G ( z ) = (cid:18) z − . z +1 . z − . z +0 . (cid:19) , Fig. 5 onthe right depicts the achievable guaranteed rates for repeated multipliers of length l = 2 and for fullmultipliers of lengths l = 2 and l = 4. These results not only reveal the beneﬁt of higher order dynamicsin the multipliers, but also that of exploiting the structure in the cost function.

6. Conclusions.

Expanding on the point of view discussed in [38], it is one of our key messagesthat the generalized plant view-point adopted in this paper oﬀers otherwise unachievable conceptual andstructural insights into the analysis and synthesis of optimization algorithms. We believe that this view-point has a high potential to stimulate further research at the interface of systems theory, optimization andmachine learning. For example, it seems highly promising to incorporated recent advances in structuredcontroller synthesis (see [27] and references therein) for the convex design of distributed optimizationalgorithms. Finally, we believe that our approach can be utilized in algorithm design problems whereperformance properties such as the mitigation of the sensitivity against noise come into play.

REFERENCES[1] B. Anderson and S. Vongpanitlerd.

Network Analysis and Synthesis . Prentice Hall, Englewood Cliﬀs, New Jersey,1973.[2] K. ˚Astr¨om and B. Wittenmark.

Adaptive Control . Addison-Wesley series in electrical engineering : control engineering.Addison-Wesley, 1995.[3] C. Byrnes, T. Georgiou, and A. Lindquist. A generalized entropy critertion for Nevanlinna-Pick interpolation withdegree constraint.

IEEE Trans. Aut. Control , 46(5):822–839, 2001.[4] C. Desoer and M. Vidyasagar.

Feedback Systems: Input-Output Approach . Academic Press, London, 1975.[5] H.-B. D¨urr and C. Ebenbauer. On a class of smooth optimization algorithms with applications in control.

IFACProceedings Volumes , 45(17):291–298, 2012.[6] M. Fazlyab, M. Morari, and V. M. Preciado. Design of ﬁrst-order optimization algorithms via sum-of-squares pro-gramming. In , 2018.[7] B. Francis.

A course in H ∞ control theory . Springer-Verlag, Berlin, 1987.[8] P. Gahinet. A new parametrization of H ∞ suboptimal controllers. In International Journal of Control , pages 1031–1051, 1994.[9] P. Gahinet and P. Apkarian. A linear matrix inequality approach to H ∞ Control.

Int. J. Robust Nonlin. , 4:421–448,1994.[10] D. Gramlich, C. Ebenbauer, and C. W. Scherer. Convex synthesis of accelerated gradient algorithms for optimizationand saddle point problems using Lyapunov functions. arXiv:2006.09946 [math.OC] .[11] R. Horn and C. Johnson.

Matrix Analysis . Cambrigde University Press, New York, 1985.[12] B. Hu and L. Lessard. Control interpretations for ﬁrst-order optimization methods. In , pages 3114–3119, May 2017. ISSN: 2378-5861.[13] T. Iwasaki and R. Skelton. All controllers for the general H ∞ control problem: LMI existence conditions and statespace formulas. Automatica , 30:1307–1317, 1994.[14] H. Kimura. Conjugation, interpolation and model-matching in H ∞ . International Journal of Control , 49(1):269–307,1989.[15] L. S. P. Lawrence, J. W. Simpson-Porco, and E. Mallada. Linear-convex optimal steady-state control.[16] L. Lessard, B. Recht, and A. Packard. Analysis and Design of Optimization Algorithms via Integral Quadraticonstraints.

SIAM Journal on Optimization , 26(1):57–95, 2016.[17] L. Lessard and P. Seiler. Direct synthesis of iterative algorithms with bounds on achievable worst-case convergencerate. In . IEEE, jul 2020.[18] D. J. N. Limebeer and B. D. O. Anderson. An interpolation theory approach to H8 controller degree bounds.

LinearAlgebra and its Applications , 98:347–386, Jan. 1988.[19] I. Masubuchi, A. Ohara, and N. Suda. LMI-based controller synthesis: a uniﬁed formulation and solution.

Int. J.Robust Nonlin. , 8:669–686, 1998.[20] A. Megretski and A. Rantzer. System analysis via Integral Quadratic Constraints.

IEEE T. Automat. Contr. , 42:819–830, 1997.[21] S. Michalowsky and C. Ebenbauer. Extremum control of linear systems based on output feedback. In . IEEE, 2016.[22] S. Michalowsky, C. Scherer, and C. Ebenbauer. Robust and structure exploiting optimisation algorithms: an integralquadratic constraint approach.

International Journal of Control , pages 1–24, 2020.[23] Z. E. Nelson and E. Mallada. An integral quadratic constraint framework for real-time steady-state optimization oflinear time-invariant systems. In , 2018.[24] Y. Nesterov.

Lectures on Convex Optimization , volume 137 of

Springer Optimization and Its Applications . SpringerInternational Publishing, 2018.[25] B. Polyak.

Introduction to Optimization . Optimization Software, Inc., New York, 1987.[26] A. Rantzer. On the Kalman-Yakubovich-Ppov lemma.

Systems & Control Letters , 28(1):7–10, 1996.[27] C. A. Rosinger and C. W. Scherer. A ﬂexible synthesis framework of structured controllers for networked systems.

IEEE Transactions on Control of Network Systems , 7(1):6–18, 2020.[28] S. Safavi, B. Joshi, G. Franca, and J. Bento. An explicit convergence rate for nesterov’s method from SDP. In . IEEE, jun 2018.[29] C. Scherer. Gain-scheduling control with dynamic multipliers by convex optimization.

SIAM J. Contr. Optim. ,53(3):1224–1249, 2015.[30] C. Scherer, P. Gahinet, and M. Chilali. Multiobjective output-feedback control via LMI optimization.

IEEE Trans-actions on Automatic Control , 42(7):896–911, 1997.[31] C. W. Scherer. Design of Structured Controllers with Applications. In

Proc. 39th IEEE Conf. Decision and Control ,Sydney, Australia, 2000.[32] C. W. Scherer. Robust Mixed Control and LPV Control with Full Block Scalings. In L. El Ghaoui and S. Niculescu,editors,

Advances in Linear Matrix Inequality Methods in Control , pages 187–207. SIAM, Philadelphia, 2000.[33] C. W. Scherer and S. Weiland.

Linear matrix inequalities in control . Lecture Notes, Delft University of Technology,1999.[34] B. V. Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent ﬁrst-order method for minimizingstrongly convex functions.

IEEE Control Systems Letters , 2(1):49–54, 2018.[35] A. Taylor and Y. Drori. An optimal gradient method for smooth (possibly strongly) convex minimization. arXiv:2101.09741 [math.OC] .[36] J. Veenman and C. Scherer. A synthesis framework for robust gain-scheduling controllers.

Automatica , 50(11):2799–2812, 2014.[37] J. Veenman, C. W. Scherer, and H. K¨oro˘glu. Robust stability and performance analysis based on integral quadraticconstraints.

European Journal of Control , 31:1–32, 2016.[38] J. Wang and N. Elia. A control perspective for centralized and distributed convex optimization. In

IEEE Conferenceon Decision and Control and European Control Conference Orlando, FL, USA . IEEE, 2011.[39] W. Wonham.

Linear Multivariable Control . Springer-Verlag, Berlin, 3rd edition, 1985.[40] G. Zames. Feedback and optimal sensitivity: Model reference transformations, multiplicative seminorms, and approx-imate inverses.

IEEE Transactions on Automatic Control , 26(2):301–320, 1981.[41] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu. Asynchronous stochastic gradient descentwith delay compensation.[42] K. Zhou, J. Doyle, and K. Glover.