[PDF] Comparison of Minimization Methods for Rosenbrock Functions

Abstract

This paper gives an in-depth review of the most common iterative methods for unconstrained optimization using two functions that belong to a class of Rosenbrock functions as a performance test. This study covers the Steepest Gradient Descent Method, the Newton-Raphson Method, and the Fletcher-Reeves Conjugate Gradient method. In addition, four different step-size selecting methods including fixed-step-size, variable step-size, quadratic-fit, and golden section method were considered. Due to the computational nature of solving minimization problems, testing the algorithms is an essential part of this paper. Therefore, an extensive set of numerical test results is also provided to present an insightful and a comprehensive comparison of the reviewed algorithms. This study highlights the differences and the trade-offs involved in comparing these algorithms.

Full PDF

11 Comparison of Optimization Methods with Application to a NetworkContaining Malicious Agents.

Iyanuoluwa Emiola, Robson Adem

Abstract —There are different methods of solving uncon-strained optimization problems, but there have been disparities inconvergence speed for most of these methods. First-order methodssuch as the steepest descent method are very common in solvingunconstrained problems but second-order methods such as theNewton-type methods enable faster convergence especially forquadratic functions. In this paper, We compare and analyse ﬁrstand second order methods in solving two unconstrained problemsthat differ by a multiplicative perturbation parameter to showhow malicious agents in a network can cause disruption. We alsoexplore the advantages and disadvantages of the steepest descent,Newton, and Conjugate gradient methods in terms of theirconvergence attributes by comparing a strictly convex functionwith a banana-type function.

I. I

NTRODUCTION

Solutions to unconstrained optimization problems can beapplied to multi-agent systems and machine learning problems,especially if the problem is in a decentralized or distributedfashion [1], [2], [3] and [4]. Sometimes, malicious agents canbe present in a network that will slow down convergence ratesto optimal points. Therefore the need for a fast convergenceand cost associated with it are usually necessities by recentresearchers. The steepest descent method is a good ﬁrst ordermethod for obtaining optimal solutions if an appropriate stepsize is chosen. Some methods of choosing step sizes includethe ﬁxed step size, the variable step size, the polynomial ﬁt,and the golden section method that will be discussed in detailsin subsequent sections. Nonetheless, these methods have theirown merits and demerits. A second order method such as theNewton method is very suitable for quadratic problems andattains optimality in a small number of iterations [5]. Howeverthe Newton method requires computing the inverse of thehessian which is often a bottleneck. Moreover, the Newtonmethod will not be suitable for solving large scale optimizationproblems.To address the lapses that the Newton method poses, somemethods have been recently proposed such as matrix split-ting seen in [6]. Another way of obtaining fast convergenceproperties of second-order methods using the structure of ﬁrst-order methods are the so-called quasi-Newton methods. Thesemethods incorporate second-order (curvature information) inthe ﬁrst-order approaches. Examples of these methods includethe BFGS [7] and the Barzilai-Borwein (BB) [8] and [9].However these methods usually require additional assumptionson the objective function to be strongly convex and that thegradient of the objective function to be Lipschitz continuousto improve convergence rate as seen in [10].

Funding Acknowledgement..The authors are with the Electrical & Computer Engineering Departmentat the University of Central Florida, Orlando FL 32816, USA. Email: [email protected], [email protected]

The conjugate gradient method on the other hand does nothave as much restrictions as the Newton method in terms ofcomputing the inverse of the hessian. The method computesthe direction of search at each iteration and the direction isexpressed as a linear combination of previous search directioncalculation and the present gradient at the present iteration.Therefore, the conjugate direction will be more suitable forlarge scale optimization problems than the Newton method.The other interesting attribute of the conjugate gradient methodis that it gives different methods of calculating the searchdirections and it is not only limited to quadratic functionsas we will use that while performing simulations using theBanana function [5].Fast convergence is a necessity as we will explore in anapplication where there are malicious agents in a network (thathas a central coordinator) that will try to deviate the optimalsolution of the regular agents. Some researchers have exploredthe roles adversarial agents play in such scenarios. One ofthem is seen in [11] where the authors explore the convergenceproperties using a consensus framework. Another one is seenin [12] where the author devised a detection metric to identifywhat nodes are malicious within a network. We use a pertur-bation method by changing and increasing the coefﬁcient of astrictly convex function and the new coefﬁcient will result in aperturbed optimal solution parameter. In this paper, we explorethe convergence properties of the regular agents and maliciousagents central coordinator’s objective functions by using ﬁrstorder methods such as the steepest descent with ﬁxed stepsize, variable step size, polynomial quadratic ﬁt and goldensection method. We also explore convergence properties usingthe Newton Method and the Conjugate Gradient Method. Forboth ﬁrst and second methods as described, we deduce howthe malicious agents perturbed parameter can either preventconvergence or increase the number of iterations to achieveconvergence through some simulations.

A. Contributions

In this paper, we compare and analyze the convergenceattributes of a non-quadratic function and in a scenario wherea multiplicative perturbation parameter is used to denote thecentral coordinator of malicious agents solving a differentfunction as the regular with the hope of obstructing the mali-cious agent goal of optimality. We analyze the convergencenumerically and show that convergence is attained despitethe presence of the malicious agents. We explore differenttypes of methods as described to analyse convergence withoutrestricting the analysis to quadratic functions. Moreover, thisis the ﬁrst work that will compare convergence of perturbednon-quadratic functions in the manner described. a r X i v : . [ m a t h . O C ] J a n B. Paper Pattern

Section II presents the problem formulation, section intro-duces the different types of ﬁrst and second type methods,and their convergence analysis, numerical experiments areperformed in section IV and we conclude in section V.

C. Notation

We denote the set of positive and negative reals as R + and R − , the transpose of a vector or matrix as ( · ) T , andthe L -norm of a vector by ||·|| . We let the gradient of afunction f ( · ) as ∇ f ( · ) , and (cid:104) · , ·(cid:105) denotes the inner productof two vectors. x ∗ and x a represent the optimal points for theregular agent central central coordinator and malicious agentcentral coordinator respectively. The variable (cid:15) represents theperturbation parameter.II. P ROBLEM FORMULATION

Suppose there is a central coordinator of regular agents in anetwork trying to solve the following objective minimizationproblem: min( f r = f ( x , x )) (1)where f r is a strictly convex function. To solve problem (1)using gradient methods, we use the iterative equation below: x ( k + 1) = x ( k ) − α ∇ f ( x ( k )) , (2)where k is the iterative time step, α is the step size thatthe central coordinator is using to achieve convergence, and f ( x ( k )) is the gradient of f at each iterate x ( k ) . Differentﬁrst order and second order methods for solving problem (1)will be explored in the later part of the paper. The goal of theregular agents is to get to the optimal point x ∗ .Suppose there are malicious agents in a network and thecentral coordinator decides to obstruct the regular agent fromobtaining their desired location. To facilitate this obstruction,the malicious agent central coordinator used the banana typefunction of the form: min( f m = mf r ) (3)where m is a positive constant and the malicious objectivefunction f is also a strictly convex function. To solve problem(3) using gradient methods,and noting that the maliciousagents are trying to get to the optimal point, x ∗ + (cid:15) , we usethe iterative equation below: x ( k + 1) = x ( k ) − α ∇ f ( x ( k )) + (cid:15), (4)where (cid:15) is the perturbed parameter that the central coordinatorof the regular agents is using to distort the usual optimalsolution of the regular agents, x ∗ , k is the iterative time step, α is the step size that the central coordinator of maliciousagents is using to achieve convergence, and f ( x ( k )) is alsothe gradient of f at each iterate x ( k ) .In later part of the paper, we will use the Banana functionwith (cid:15) = 100 to test how the optimal solutions are beingshifted and that convergence is still obtained after some iter-ations. We now state the assumptions needed before analysis. Assumption 1.

The decision set of the coordinators of ma-licious and regular agents X is bounded. This means thereexists some positive constant ≤ B < ∞ such that |X |≤ B . Assumption 2. f ( x ) in problems (1) and (3) are strictlyconvex and twice differentiable. III. A

NALYSIS OF F IRST AND S ECOND ORDER METHODS

We now discuss a ﬁrst order method known as the steepestdescent method for solving problem (1).

A. The Steepest Descent Method

To solve a problem such as the one described in (1), asequence of guesses x (0) , x (1) , ......x ( k ) , x ( k + 1) ... will begenerated in a descent manner such that f ( x (0)) > f ( x (1)) > ... > f ( x ( k + 1)) It can be often tedious to obtain optimality after some K iterations, when ∇ f ( x ( K )) = 0 . Therefore it sufﬁces toactually modify the gradient stopping condition to satisfy (cid:107) f ( x ( K )) (cid:107)≤ ε where ε > and very small which is oftenreferred to as the stopping criterion for convergence to hold.While using descent methods under each k th iterative guess,we search along a direction d ( k ) which is a n × vector andalso using the optimum step size, α ( k ) according to: x ( k + 1) = x ( k ) + α ( k ) d ( k ) The goal is to ensure that f ( x ( k +1)) ≤ f ( x ( k )) . Moreover, ifthe the direction of negative of the gradient is taken such that d ( k ) = −∇ f ( x ( k )) , we obtain the steepest descent algorithm.Different ways of choosing the step size will be exploredbelow: B. Steepest Descent With a Constant Step Size

The constant step size is constructed in a manner whereyou simply use one value of α in all iterations. Althoughit looks relatively easy to implement, the demerits of usinga ﬁxed step size is not knowing apriori whether the choiceof step size is numerically adequate. In problem (1), if thecentral coordinator chooses a very small α value, the algorithmwill be slow and if he chooses an extremely high α value,the algorithm might diverge. To illustrate the ﬁxed step sizeprinciple in problem (1), we will pick an α value between and and show numerically how convergence is attained.To demonstrate that a suitable step size results in a neigh-borhood convergence to optimal solution, we will show aproof of a gradient descent algorithm where malicious agentsare present provided the function in problem (2) is stronglyconvex. This is shown below: C. Convergence proof for a strongly convex function withmalicious perturbation parameter

Theorem 1.

Let Assumptions (1) , and (2) hold where the func-tion in (1) is strongly convex with strong convexity parameters µ and L with µ ≤ L . Let the perturbation parameter satisfy (cid:15) > , and let each agent’s estimates when malicious agents are present satisfy x (0) − x ∗ < < (cid:15) max . If c = µ + L , c = µLµ + L , and α < c ,the iterates generated from (2) converge to the neighborhood of the optimal solution x ∗ .Proof. Refer to [13] for the proof.Other methods of choosing the step size in a steepest descentalgorithm are shown below:

D. Steepest Descent with Variable Step Size

In the variable step size method, or values of α are cho-sen at each iteration and the value that produces the smallest g ( α ) value will be chosen where g ( α ( k )) = f ( x ( k + 1)) . Thevariable step size algorithm is also easy to implement and has abetter convergence probability that the ﬁxed step size method.The results of simulating problems (1) and (3) via the variablestep size is shown in (IV).Another variation of the variable step size method is thepolynomial ﬁt methods but we will focus mostly on a sub-class of the polynomial methods which is the quadratic ﬁtmethod below: E. Steepest Descent with Quadratic Fit Method

For the quadratic ﬁt method, three values of α are guessedat each iteration and the values of the corresponding g ( α ( k ) values are computed, where g ( α ( k ) = f ( x ( k + 1)) Forexample, suppose the three values of α values chosen are α , α , α . To ﬁt a quadratic model of the form: g ( α ) = aα + bα + c, (5)we write the quadratic model with respect to the α values as: g ( α ) = aα + bα + c, (6) g ( α ) = aα + bα + c, (7)and g ( α ) = aα + bα + c. (8)where a, b, c are constants. After solving the constants a, b, c inequations (6), (7), and (8), it will be used in equation (5) andthe quadratic ﬁt according to equation (5) is then minimized.Another method for solving problem in (1) and unconstrainedoptimization problems is the golden section search method.We brieﬂy explain this as follows: F. Steepest Descent with the Golden Section Search

In this algorithm, we use a range between two valuesand divide the range into sections. We then eliminate somesections within the sections in the range to shrink the regionwhere the convergence might occur. For this algorithm to beimplemented as we will see in section (IV), the initial regionof uncertainty and the stopping criterion have to be deﬁned.An example where a golden section search was applied tominimize a function in a closed interval is seen in [5].Second-order methods have been an improvement in terms ofconvergence speeds when it comes to solving unconstrainedoptimization problems such as problems (1) and (3). We willshow by simulations in section IV that convergence is faster with the second-order methods than in the ﬁrst order typemethods despite the presence of malicious agents. We nowanalyze the two second order methods below:

G. Newton Methods

The Newton Method is very useful in obtaining fast con-vergence of an unconstrained problem like in equation (1)especially when the initial starting point is very close to theminimum. The main disadvantage of this method is the costand difﬁculty associated with ﬁnding the inverse of the hessianand also ensuring that the hessian inverse matrix is positivedeﬁnite. The update equation for the Newton Method is givenby: x ( k + 1) = x ( k ) − ∇ f ( x )( ∇ f ( x )) − . (9)Some of the methods of approximating the term that containsthe inverse of the hessian, ( ∇ f ( x )) − in equation (9) are theQuasi-Newton methods such as the BFGS and the Barzilai-Borwein methods [10]. H. Conjugate Gradient Methods

For the class of quadratic functions: f ( x ) = 0 . x T Qx − x T b, and x ∈ R n , the conjugate gradient algorithm uses a directionexpressed in terms of the current gradient and the previousdirection at each iteration by ensuring that the directions aremutually Q-conjugate, where Q is a real symmetric n × n matrix. We note that the directions d (0) , d (1) , .....d ( m ) areQ-conjugate if d ( i ) T Qd ( j ) = 0 , and i (cid:54) = j . The conjugategradient method also exhibits fast convergence property fornon-quadratic problems like problem in (1) and (3). In thesimulation in section IV, we use the Fletcher-Reeves Formulagiven by: β ( k ) = g ( k + 1) T g ( k + 1) g ( k ) T g ( k ) where g ( k ) = ∇ f ( x ( k )) and β ( k ) are constants pickedsuch that the directional iteration d ( k + 1) is Q-conjugate to d (0) , d (1) , .....d ( k ) according to the following iterations: x ( k + 1) = x ( k ) + α ( k ) d ( k ) , and α ( k ) is the step size. We will show in section IV that theconjugate gradient method performs better than the steepestdescent method in terms of convergence rate when we use thesame ﬁxed step size.IV. N UMERICAL E XPERIMENTS I NSIGHTS

In this section, we will compare the methods discussedin section III in terms of convergence and the number ofiterations taken to reach optimality. In this section, we let thecentral coordinator of the regular agents minimization problembe given by: f r = f ( x , x ) = ( x − x ) + ( x − (10) and due to the optimal solution disruption goal of the maliciousagents, we let the central coordinator of the malicious agentsproblem be: f m = f ( x , x ) = 100( x − x ) + ( x − (11)where f r and f m are strictly convex in the neighborhood ofthe optimal solution. For these two functions, initial conditionsof (2 , and (5 , and a stopping criterion of ∇ f ( x ( K )) ≤ . are used across all these methods discussed in sectionIII. Here, K is the maximum iteration number to achieveconvergence. By inspection, the optimal solutions of bothequations (10) and (11) are x = x = 1 . To compare themethods discussed in section III, a geometric sequence ofﬁxed step sizes is used. For the steepest descent, Newton andConjugate Methods, we use the ﬁxed step sizes, . , . , . and . . A. Case when α = 0 . When the step size α = 0 . is used, the iteration inequation (2) generated by (10) converges for the Newtonmethod with both of the initial conditions criteria, convergesfor the steepest descent with the (2 , initial condition anddiverges with the initial condition of (5 , . The iteration alsodiverges with the conjugate gradient with both of the initialstarting points.For the iteration generated by (11) when the step size α =0 . , it converges by the Newton method using both of theinitial starting points, diverges by the steepest descent for bothinitial points, and also diverges by the conjugate gradient usingboth initial points, (2 , and (5 , . B. Case when α = 0 . It is expected that the case when α = 0 . will providean improvement regarding convergence compared to the casewhen α = 0 . because . < . . When α = 0 . is used, the iteration in equation (2) generated by (10) con-verges for the Newton method by using the two starting points, (2 , and (5 , . By using the steepest descent on the ﬁrstfunction, the iteration generated by (10) converges with bothof the starting initial points which is an improvement over thecase when α = 0 . , where the steepest method divergesusing the initial point (5 , . By the conjugate method on theﬁrst function, convergence is obtained with both starting pointswhich is also an improvement over the case when α = 0 . ,where it diverges for both starting values.For the iteration generated by (11) when the step size α = 0 . is used, there was no improvement in convergenceattributes because divergence is obtained by steepest descentand conjugate gradient methods for both of the initial points. C. Case when α = 0 . At the stage when α = 0 . is used, the three methods,conjugate, Newton, and the steepest descent all converge withboth starting points for the iteration in generated by (10).By using the iteration generated by (11) when α = 0 . ,there is an improvement over the case when α = 0 . . This is evident because convergence is obtained by starting fromthe initial point (2 , compared to divergence result obtainedby using the steepest descent and conjugate gradient methodswhen α = 0 . . D. Case when α = 0 . If the step size α = 0 . is used, the three methods:steepest descent, Newton, and conjugate gradient all convergeas shown in ﬁgures 1 and 2 for both the iterations generatedby (10) and (11). In addition, We start with both initial points (2 , and (5 , to obtain convergence with the step size α =0 . . Therefore we will use this step size as a case studyto compare these methods. E. Signiﬁcance of the Newton Method

The signiﬁcance of the Newton method should not beoverlooked even for non-quadratic functions such as in (10)and (11) because the method guarantees convergence to theoptimal solution for the geometric step sizes of . , . , . and . . Moreover, it achieves convergence injust few iterations for both of the starting points. This afﬁrmsthe unique convergence attribute of the Newton method whenthe starting point is not far away from the optimal solution.To compare the performance of the ﬁxed step sizes withother variable metrics, an illustration of the variable step size,quadratic ﬁt and golden section methods are presented below: F. Comparison of the variable step size, quadratic ﬁt andgolden section with other methods

By starting with the two initial points (2 , and (5 , usingthe variable step size method, convergence was obtained forthe two functions (10) and (11) when three varying step sizesof . , . and . are used. When the quadraticﬁt is used, three values of the step sizes are used in eachiteration and selected from the range (0 . , . . Theresult from the quadratic ﬁt shows that a better convergence isachieved for function (10) but shows a weaker convergencefor the second function. This explains that ﬂuctuations inthe random selection of step sizes between a range can alterconvergence rate. Moreover, the perturbation from function(11) can also slow down convergence rates. For the goldensection method, a range of (0 . , . is used to locatethe value of the step size that result in the solution to problems(10) and (11). By using the initial points of (2 , and (5 , on the ﬁrst function in (10), the golden section methodresulted in the fastest convergence rate by comparing with thesteepest descent with ﬁxed step size, variable step size andthe quadratic ﬁt methods. However for the second function in(11), convergence with the golden section than the variable,ﬁxed step size and the quadratic ﬁt methods when the sameinitial starting conditions are used.The simulations for all of these methods are shown below: F i g . : C on t ou r p l o t s o f t h e S t ee p e s t G r a d i e n t D e s ce n t M e t hod w it hv a r i ou s c ho i ce s o f s t e p - s i ze . T h e p l o t s a l s o s ho w t h e t r a j ec t o r yo f gu e ss e s c onv e r g i ng t o t h e m i n i m u m . V. C

ONCLUSIONS

We analyze convergence attributes of some selected ﬁrst andsecond order methods such as the steepest descent, Newton,and conjugate gradient and apply it to a strictly convex and aBanana-type function. We show through different optimizationmethods for functions (11) and (11) that it is still possible forthe regular agents to converge to their optimal solution despitethe presence of malicious agents by showing that function (11)converges for some methods. We obtain a geometric test fora ﬁxed step size that guarantee convergence while comparingthe optimization methods discussed. Numerical experimentsafﬁrm that the Newton method has the fastest convergencerate for the two strictly convex functions used in this paperand that it is useful when the initial starting point is close tothe optimal point as seen with the starting points used. Outof the three optimization methods considered, the conjugategradient is the most beneﬁcial method than the others sinceit does not have the convergence restrictions as the steepestdescent and the Newton methods.R

EFERENCES [1] A. Nedi´c, A. Olshevsky, and M. G. Rabbat, “Network topol-ogy and communication-computation tradeoffs in decentralizedoptimization,”

Proceedings of the IEEE , vol. 106, no. 5, pp.953–976, 2018.[2] S. Yang, Q. Liu, and J. Wang, “Distributed optimization basedon a multiagent system in the presence of communicationdelays,”

IEEE Transactions on Systems, Man, and Cybernetics:Systems , vol. 47, no. 5, pp. 717–728, 2016.[3] E. Montijano and A. R. Mosteo, “Efﬁcient multi-robot forma-tions using distributed optimization,” in . IEEE, 2014, pp. 6167–6172.[4] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Consensus-baseddistributed optimization: Practical issues and applications inlarge-scale machine learning,” in .IEEE, 2012, pp. 1543–1550.[5] E. K. Chong and S. H. Zak,

An introduction to optimization .John Wiley & Sons, 2004.[6] M. Zargham, A. Ribeiro, A. Ozdaglar, and A. Jadbabaie,“Accelerated dual descent for network ﬂow optimization,”

IEEETransactions on Automatic Control , vol. 59, no. 4, pp. 905–920,2013.[7] M. Eisen, A. Mokhtari, and A. Ribeiro, “Decentralized quasi-newton methods,”

IEEE Transactions on Signal Processing ,vol. 65, no. 10, pp. 2613–2628, 2017.[8] Y.-H. Dai and R. Fletcher, “Projected barzilai-borwein methodsfor large-scale box-constrained quadratic programming,”

Nu-merische Mathematik , vol. 100, no. 1, pp. 21–47, 2005.[9] P. E. Gill and W. Murray, “Quasi-newton methods for uncon-strained optimization,”

IMA Journal of Applied Mathematics ,vol. 9, no. 1, pp. 91–108, 1972.[10] J. Gao, X. Liu, Y.-H. Dai, Y. Huang, and P. Yang, “Geometricconvergence for distributed optimization with barzilai-borweinstep sizes,” arXiv preprint arXiv:1907.07852 , 2019.[11] S. Sundaram and B. Gharesifard, “Distributed optimizationunder adversarial nodes,”

IEEE Transactions on AutomaticControl , vol. 64, no. 3, pp. 1063–1076, 2018.[12] N. Ravi, A. Scaglione, and A. Nedi´c, “A case of distributedoptimization in adversarial environment,” in

ICASSP 2019-2019IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019, pp. 5252–5256.[13] I. Emiola, L. Njilla, and C. Enyioha, “On distributed optimiza-tion in the presence of malicious agents,” 2021. F i g . : C on t ou r p l o t s o f N e w t on ’ s m e t hod a nd t h e C on j ug a t e G r a d i e n t m e t hod . T h e p l o t s a l s o s ho w t h e t h e t r a j ec t o r yo f gu e ss e s c onv e r g i ng t o t h e mm