[PDF] AdaSwarm: Augmenting Gradient-Based optimizers in Deep Learning with Swarm Intelligence

Abstract

This paper introduces AdaSwarm, a novel gradient-free optimizer which has similar or even better performance than the Adam optimizer adopted in neural networks. In order to support our proposed AdaSwarm, a novel Exponentially weighted Momentum Particle Swarm Optimizer (EMPSO), is proposed. The ability of AdaSwarm to tackle optimization problems is attributed to its capability to perform good gradient approximations. We show that, the gradient of any function, differentiable or not, can be approximated by using the parameters of EMPSO. This is a novel technique to simulate GD which lies at the boundary between numerical methods and swarm intelligence. Mathematical proofs of the gradient approximation produced are also provided. AdaSwarm competes closely with several state-of-the-art (SOTA) optimizers. We also show that AdaSwarm is able to handle a variety of loss functions during backpropagation, including the maximum absolute error (MAE).

Full PDF

AAdaSwarm: A Novel PSO optimization Method forthe Mathematical Equivalence of Error Gradients

Rohan Mohapatra ∗ , Snehanshu Saha † , and Soma S. Dhavala ‡∗ Dept. of Computer Science and Engineering, PES University, Bengaluru, India [email protected] † Dept. of Computer Science and Information Systems and APPCAIRBITS Pilani, K K Birla Goa, Goa Campus, India [email protected] ‡ Center for AstroInformatics, Modeling and Simulation (CAMS)& ML Square, Bengaluru, India [email protected]

Abstract —This paper tackles the age-old question of derivate-free optimization in neural networks. This paper introducesAdaSwarm, a novel derivative-free optimizer to have similaror better performance to Adam but without gradients. Tosupport the AdaSwarm, a novel Particle Swarm OptimizationExponentially weighted Momentum PSO (EM-PSO), a derivative-free optimizer, is also proposed which tackles constrained andunconstrained single objective optimization problems and looksat applying the proposed momentum particle swarm optimizationon benchmark test functions, engineering optimization problemsand habitability scores for exoplanets which show speed andconvergence of the technique. The EM-PSO is extended byapproximating the gradient of a function at any point usingthe parameters of the particle swarm optimization. This isa novel technique to simulate gradient descent, an extremelypopular method in the back-propagation algorithm, using theapproximated gradients from the particle swarm optimizationparameters. Mathematical proofs of gradient approximationby EM-PSO, thereby bypassing the gradient computation, arepresented. The AdaSwarm is compared with various optimizersand the theory and algorithmic performance are supported bypromising results.

Index Terms —particle swarm optimization, meta-heuristics,gradient descent, neural networks, back-propagation, momen-tum, adam

I. I

NTRODUCTION

Gradient descent is a popular method, used extensively inthe backpropagation algorithm to minimise the loss in a neuralnetwork. Though gradient Descent is reliable and efﬁcient, itis a derivative-based optimisation algorithm. This poses twochallenges: it can be applied only to differentiable functions,and computing the derivative for complex, real world problemscan be a difﬁcult task. As a result of the shortcomings ofgradient descent, there has been signiﬁcant interest in usingmeta-heuristic algorithms for optimization. One such meta-heuristic algorithm, particle swarm optimization, was proposedin 1995 as a swarm-intelligence based evolutionary algorithmfor optimization problems. It is inspired by the ﬂockingbehavior of birds. In this paper, we explore PSO and anovel variant of PSO as a derivative-free technique to emulategradient descent, and replace the same in the backpropagationalgorithm. The underlying question being addressed here is whether the gradient (read derivative) of a function can beapproximated by a meta-heuristic algorithm such as PSO? Weaddress this by establishing an equivalence between PSO andGradient descent, both theoretically and empirically, such thatwe can compute the gradient for a function at any point just byknowing the values of the parameters of PSO, without havingto actually calculate the derivative. In this regard, AdaSwarmis proposed to show that even ”gradient”-free optimizer cangive good/better results to the best optimizers.

A. Motivation

The sources of inspiration to understand and improve,optimization techniques in general, and PSO in particular,can come from many places. Said et al [1] postulate thatswarms behave similar to classical and quantum particles.In fact, their analogy is so striking that one may tend tothink that the social and individual intelligence angles in thePSO are after all nice useful perspectives, and that there isa neat underlying dynamical system at play. This dynamicalsystems perspective, was indeed also useful in unifying twoalmost parallel streams, namely, optimization and MarkovChain Monte Carlo sampling [2], [3]. In a seminal paper,Wellington and Teh [2], show that an SGD optimization tech-nique can be turned into a sampling technique by just addingnoise, governed by Langevin dynamics. Recently, Soma andSato [4] provide further insights into this connection basedon an underlying dynamical system governed by stochasticdifferential equations (SDE). While these results are new, theconnections between derivative-free optimization techniquesbased on Stochastic Approximation and Finite Differences arewell documented [5]. Such strong connections between theseseemingly different sub ﬁelds of optimization and samplingmade us ponder: Is there much a larger, grander, template –of which the aforementioned approaches are special cases?What is missing in the much grander puzzle, where do metaheuristics algorithms ﬁt? This particular questions seems to beuseful in two ways: can we use ideas developed in improvingSGD and apply them to PSO, and can we use particlehistory and their location awareness to offer derivative-free a r X i v : . [ c s . N E ] M a y pproximations to the gradients if possible? We answer thesequestions positively in this paper. B. Our Contribution

The following list of items are accomplished in the remain-der of the paper: • We propose a rather novel approach, EM-PSO, to mo-mentum particle swarm optimization by aligning theweighted average to the exploration part of vanilla PSO(please see section II), thereby achieving fewer iterationsto reduce error and reach the optima. This is differentfrom the existing M-PSO approach (see section A whereexploration and exploitation terms are used equally. • We present the mathematical formulation of EM-PSO andprove theorems (see section III) establishing a mathemat-ical equivalence between the following versions:1) The vanilla PSO and Gradient Descent2) The EM-PSO and Stochastic Gradient Descent withMomentum • The Theorems propose a direct gradient approximationmethod with precise alternatives to gradient computationby exploiting the hyperparameters of Vanilla and EMversions of PSO. Since our approach is an approximationtype, we prove a result on the order of accuracy of theapproximation method. • We interpret the gradient approximation approach toemulate backpropagation in Vanilla feed-forward neuralnetworks and CNN via EM-PSO and test it on a simulateddata set (see section IV and Fig.2). • We also present a novel adaptive gradient free optimizer,AdaSwarm and use it on Neural networks and presentresults for the same.(see section VI). • We present a rotational variant of EMPSO, calledREMPSO to train high dimensional classiﬁcation data • Since the fulcrum of the work is optimization, we applyour method to benchmark test optimization functions. Toestablish the efﬁcacy of our approach, we have tested20 such functions[6] and present a few samples in thepaper. Finally, we apply EM-PSO to solve unconstrainedand constrained single objective benchmark engineeringoptimization problems(see Appendix B).II. E

XPONENTIALLY WEIGHTED M OMENTUM P ARTICLE S WARM O PTIMIZATION (EM-PSO)In this section, we propose a rather novel approach tomomentum particle swarm optimization. The problem with thecurrently available momentum Particle Swarm Optimization [7]that the weighted average which is computed takes care of bothexploration and exploitation simultaneously. Since PSO triesto search the space by exploring, it makes more sense to givemore weight to the exploration part of the equation. Anotherproblem with the above term is that it takes longer iterationsto reduce errors and reach the minimum. The simulated data set is available at https://gist.github.com/rohanmohapatra/4e7bce4f0d95746d8b993437c257e99b

To counter the above-said problems, we mathematicallyformulate a new Particle Swarm Optimization with momentumas follows. v t +1 i = M t +1 i + c r ( p besti − x ti ) + c r ( g best − x ti ) (1)where, M t +1 i = βM ti + (1 − β ) v ti (2)Here β is the momentum factor, and M t +1 i , indicates theeffect of the momentum. The above equation can be re-writtenas the following by combining (1) and (2), v t +1 i = βM ti + (1 − β ) v ti + c r ( p besti − x ti ) + c r ( g best − x ti ) (3)We understand that PSO is composed of two phases, theexploration phase, and the exploitation phase[8]. v ti (cid:124)(cid:123)(cid:122)(cid:125) Exploration + c r ( p besti − x ti ) + c r ( g best − x ti ) (cid:124) (cid:123)(cid:122) (cid:125) Exploitation (4)With above proposed approach, the exploration phase is deter-mined by the exponential weighted average of the previousvelocities seen so far only. The negligible[7] weights appliedin the momentum PSO do not help much in providing us theacceleration required.

A. Intuition behind the term Exponentially Weighted Average

In the previous section, we deﬁned the momentum (2) andwe also mentioned that it was an exponentially weightedaverage of the previous velocities seen so far. We prove itby expanding M ti (2),we get the following derivation. M ti = βM t − i + (1 − β ) v t − i (5) M ti = β [ βM t − i + (1 − β ) v t − i ] + (1 − β ) v t − i (6) M ti = β M t − i + β (1 − β ) v t − i + (1 − β ) v t − i (7) M ti = β [ βM t − i +(1 − β ) v t − i ]+ β (1 − β ) v t − i +(1 − β ) v t − i (8) M ti = β M t − i + β (1 − β ) v t − i + β (1 − β ) v t − i +(1 − β ) v t − i (9)Generalizing M ti , it can be written as the follows, M ti = β n M t − ni + β ( n − (1 − β ) v t − ni + β ( n − (1 − β ) v t − ( n − i + ... + β (1 − β ) v t − i + (1 − β ) v t − i (10)From this equation we see, that the value of t th term of themomentum is dependent on all the previous values 1..t of thevelocities. All of the previous velocities are assigned someweight. This weight is β i (1 − β ) for ( t − i ) th . Because betais less than 1, it becomes even smaller when we take beta tothe power of some positive number. So the older velocities getmuch smaller weight and therefore contribute less for overallvalue of the Momentum. . REM-PSO: Rotation Accelerated EM-PSO for high dimen-sional data A lot of research has been put into demonstrating theeffectiveness of the PSO in solving/optimizing various dis-crete/continuous problems. And a plenty of applications ofPSO, such as the neural network training, PID controllertuning, electric system optimisation have been studied andachieved well results. However, PSO is often failed in search-ing the global optimal solution in the case of the objectivefunction that have a large number of dimensions mainlybecause of the amount of computations that PSO has to gothrough to obtain global optimum but there are high chancesthat it gets stuck in sub-plane of the whole search space.T. Hatanaka[9] had proposed a Rotated Particle SwarmOptimization to improve the performance of the PSO in caseof high-dimensional optimization. We can re-deﬁne the EM-PSO equation as follows, Let’s assume, φ = diag ( c , ∗ r , , c , ∗ r , , c , ∗ r , ...., c ,d ∗ r ,d ) (11) φ = diag ( c , ∗ r , , c , ∗ r , , c , ∗ r , ...., c ,d ∗ r ,d ) (12)The EM-PSO equation can be written as, v t +1 i = βM ti +(1 − β ) v ti + φ ( p besti − x ti )+ φ ( g best − x ti ) (13)The EM-PSO algorithm was designed by emulating birdsseeking food with faster convergence rate. Birds probablynever change the strategy to seek according to whether foodexists in the true north or in the northeast. Consequently,particles can search optima even if axes are rotated. We usecoordinate conversion to the velocity update by using a matrixA. The matrix A is DXD matrix where certain factor is usedto determine the number of axes to rotate. If we consider apoint in the original space x , y then in the rotated coordinatespace, let it be it x (cid:48) , y (cid:48) , where they can be written as x (cid:48) = xcos ( θ ) + ysin ( θ ) (14) y (cid:48) = − xsin ( θ ) + ycos ( θ ) (15) A − = A T since A has orthonormal basis. An arbitarymatrix P with orthonormal basis vectors,then T A : e , e , ....., e N → e (cid:48) , e (cid:48) , ....., e (cid:48) N (16)can be expressed as the following transformation matrix, A = [ e (cid:48) , e (cid:48) , ....., e (cid:48) N ] ∀ e i , e j ∈ R N , j = 1 , ...., N (17)The rotation matrix rotating N-dim solution space of θ degree is expressed as: M ( θ, N ) = N − (cid:89) i =1 N (cid:89) j = i +1 M i,j ( θ ) (18) and each element p i,jq,l ( θ ) of M ( θ, N ) is expressed as thefollowing: p i,jq,l =  cos( θ ), if q = i, l = i-sin( θ ), if q = i, l = jsin( θ ), if q = j, l = icos( θ ), if q = j, l = j1, if q = l (cid:54) = i , if q = l (cid:54) = j0, otherwise.For example, let’s deﬁne a matrix with i = 5 , j = 5 , andbased on that the matrix M (or A) will beM =  cos ( θ ) sin ( θ ) 0 0 0 − sin ( θ ) cos ( θ ) 0 0 00 0 1 0 00 0 0 cos ( θ ) sin ( θ )0 0 0 − sin ( θ ) cos ( θ )  C. Transformation In-variance

EM-PSO optimization has transformation in-variance asfollows, • f ◦ T r s ◦ update = f ◦ update ◦ T r s (transformation in-variance for solution space) • T r f ◦ f ◦ update = f ◦ update ◦ T r f (transformation in-variance for objective function) T r s : Scale Transformation/Parallel Shift/Rotation of solutionspace T r f : Scale Transformation/Parallel Shift/Rotation of theobjective function In variance for rotation of solution space : T r s x → G x , G ∈ R N , where G satisﬁes G − = G T ,rotating a vector in the solution space.Using the Rotated Particle Swarm Optimization equation andcombining with EM-PSO equation, we get v t +1 i = βM ti +(1 − β ) v ti + A T φ A ( p besti − x ti )+ A T φ A ( g best − x ti ) (19)where, A = [ e (cid:48) , e (cid:48) , ....., e (cid:48) N ] ∀ ∈ R N , j = 1 , ...., N (20)is a coordinate transformation matrix. e (cid:48) , e (cid:48) , ....., e (cid:48) N , normalbasis of eigen vectors of co-variance matrix Σ( Z ) ∈ R N ofsolution set Z.III. E QUIVALENCE BETWEEN G RADIENT D ESCENT AND P ARTICLE S WARM O PTIMIZATION

A. Proof of Equivalence between the Gradient Descent andVanilla PSO

Theorem:

Under reasonable assumptions on global minima,the following equivalence holds: η = ω and f (cid:48)(cid:48) ( w’ ) = − ( c r + c r ) η Proof: In gradient descent, the weight update rule is given by, w ( t ) = w ( t − + η ∂f∂w (21)ere we deﬁne the Taylor series expansion for a given f ( x ) .On differentiating the expansion to get an approximation forthe gradient represented by ∂f∂x . f ( x ) = f ( a ) + f (cid:48) ( a )( x − a ) + f (cid:48)(cid:48) ( a )2! ( x − a ) + .... + E n ( x ) (22)where, E n ( x ) is the error term deﬁned to specify the deviationfrom the actual curve given by, | E n ( x ) | = k ( x − a ) n +1 ( n + 1)! (23)on taking ddx on both sides, the gradient is approximated us-ing the Taylor series expansion. The rationale behind approxi-mating the gradient is that, for a given function, calculation ofgradient is a tedious task. So by approximating the gradientusing Taylor expansion centered around the global minima,we get the mathematical convenience of f (cid:48) ( a ) = 0 . f (cid:48) ( x ) = 0+ f (cid:48) ( a )+ f (cid:48)(cid:48) ( a )( x − a )+ f (cid:48)(cid:48)(cid:48) ( a )2! ( x − a ) + .... + E n − ( x ) (24)Then the gradient can be written as follows at the optimalpoint w (cid:48) . This point which we have chosen is the globalminima of the function f ( x ) (we can now say that the gradientcurve will be centered around w (cid:48) and any point on the gradientcurve can be approximated from the expansion), ∂f∂w (cid:12)(cid:12)(cid:12) w = w (cid:48) = f (cid:48) ( w’ ) + f (cid:48)(cid:48) ( w’ )( w − w’ ) + ... + E n − ( w ) (25)By combining (21) and (25) we get, w ( t ) = w ( t − + ηf (cid:48) ( w’ )+ ηf (cid:48)(cid:48) ( w’ )( w − w’ )+ E n − ( w ) (26)The equation for the basic Particle Swarm Optimization(PSO) (61) for particle i , x ( t ) i = x ( t − i + ωv ( t − i + c r ( p besti − x ( t − i ) + c r ( g best − x ( t − i ) (27)Under Assumptions stated below : • Around the optimum value (global minima) the explo-ration phase will almost become constant hence theanalogy that f (cid:48) ( w (cid:48) ) = v ( t − i holds true. f (cid:48) ( w (cid:48) ) is aconstant value and hence when the particles are aroundthe global minima, the velocity is constant as well. • At around the global minima, for most of the particles,we have p besti ≈ g best .If the assumptions stated above hold true, then we draw anequivalence as follows, η = ω (28) f (cid:48)(cid:48) ( w’ ) = − ( c r + c r ) η (29)For mathematical convenience, we set the learning rate ofthe weight update rule η = ω . Using these parameters and equation (25), any gradient canbe approximated without having to calculate the gradientsof the function itself .Under the above assumptions, we will have to factor in theother variables that will play a huge role in the approximation,as we move farther away from the center w (cid:48) , the error termwill be signiﬁcantly large. The p besti will factor in for pointsaway from the minima and hence the equivalence can also bewritten as, v ( t − i = f (cid:48) ( w’ ) + E n − ( w ) (30) f (cid:48)(cid:48) ( w’ )( w − w’ ) = − η ( c r ( p besti − x ( t − i ))+ − η ( c r ( g best − x ( t − i ) (31) B. Proof of Equivalence between the Stochastic GradientDescent with Momentum and EM-PSO

Theorem:

Under reasonable assumptions on globalminima, the following equivalence holds: η = (1 − β ) , f (cid:48)(cid:48) ( w’ ) = − ( c r + c r ) η and α = β Proof:

In the above section, we proved the equivalencebetween the Gradient Descent Rule and the Vanilla ParticleSwarm Optimizer, we extend the same to a much advancedversion of the Gradient Descent that can tackle the problemof faster convergence and stagnating at local minima problem.The Gradient Descent Weight Update with momentum isgiven by, w ( t ) = w ( t − + ηV tdw (32)where, V tdw = βV t − dw + (1 − β ) ∂f∂w (33)Combining the equations and dividing (33) by (1 − β ) weget, w ( t ) = w ( t − + αV t − dw + η ∂f∂w (34)Here, η is the learning rate and V t − dw is the momentum appliedto the weight update and where, α = ηβ (1 − β ) (35)We apply the same Taylor series expansion to the gradientas deﬁned int (25) and then formulate the gradient descentwith momentum as follows, w ( t ) = w ( t − + ηf (cid:48) ( w (cid:48) ) + ηf (cid:48)(cid:48) ( w (cid:48) )( w − w (cid:48) )+ E n − ( w ) + αV t − dw (36)The proposed EM-PSO, is deﬁned above which can bewritten as, x ti = x t − i + βM t − i + (1 − β ) v t − i + c r ( p besti − x t − i ) + c r ( g best − x t − i ) (37)nder the same assumptions stated in the previous proof,we deﬁne the equivalence as follows η = (1 − β ) (38) f (cid:48)(cid:48) ( w (cid:48) ) = − ( c r + c r ) η (39) α = ηβ (1 − β ) (40)We ﬁnd that M t − i works the same way as the momentumterm in the (29) i.e. V t − dw which helps in smoothing and fasterconvergence to the minimum.IV. I NTERPRETATION OF G RADIENTS AND E MULATING B ACK PROPAGATION IN VANILLA FEED FORWARD NEURALNETWORKS

Let us revisit the gradient descent rule proposed in the backpropagation algorithm in vanilla feed forward networks, w ( t ) = w ( t − + η ∂f∂w (41)The equivalence between GD and PSO has been provedin the previous section, we can now substitute the gradientwith PSO parameters and approximate the gradient values.Computing gradient for some functions might be very difﬁcultand some loss functions are non-differentiable. If we can geta good approximation, then we can easily emulate gradientdescent.Considering at any iteration, the c r and c r values arecomputed by taking average of n particles, and v ( t − i is takenof the particle that inﬂuences the g best .Using the PSO Parameters, we get the following equation, w ( t ) = w ( t − − η [ v ( t − i + − η ( c r ( p besti − x ( t − i ))+ − η ( c r ( g best − x ( t − i ))] (42)Table I shows the emulation using approximated gradients.For training a neural network, PSO has been used extensivelyused [10], [11], as Gradient Descent has a higher chance ofgetting stuck in local minima. To counter that, PSO particlesconsist of weights of the neural network and the ﬁtnessfunction is the loss function that has to be reduced. The caveatin this approach is that as the number of weights increase,so will the number of dimensions of a single particle. It isreported that such a system will fail to converge because ofthe humongous weight updates.We propose a rather novel idea to train the neural network,we use the idea of batch training and the dimension of theparticle vector is now just batch size * no of classes , signif-icantly reducing the amount of computations and increasingtraining speeds from previous approaches. In a neural network,we essentially have a loss function that has to be minimized toget the set of weights to classify a given instance. Traditionally,the back-propagation algorithm would do an update as follows, (here we consider the loss function to be the MSE (meansquare error) (ˆ y − y ) where ˆ y and y are target and outputvalues respectively).The equivalence between error gradient and the EM-PSOapproximation is a natural but non-trivial consequence ofthe insights gained by theorems proved in section IV. If aderivative can be expressed mathematically using the PSO andthe proposed EM-PSO parameters, it is expected that errorgradient whose minima is critical to compute in backpropaga-tion, is approximated using PSO parameters. This is markeddeparture from existing heuristic based approaches. ∂E∂w = ( y − ˆ y ) ∗ D ( activation ) ∗ x (43)Using the above proved equivalence , we can substitute thevalue of the gradient in (43) and get the following equation, ∂E∂w = − ( c r + c r ) η ( g best − y ) ∗ D ( activation ) ∗ x (44)where D ( activation ) is the derivative of the activation func-tion used. Using this approximation, we descend to the minimaat a faster rate. From the Fig. 1 it can be seen clearly,the descent in slope is steeper than the traditional gradientapproach. Fig. 1. Error Gradient approximation in backpropagation by EM-PSO, asproposed in Eq. (39)

V. E

MULATION OF G RADIENT D ESCENT IN V ANILLA F EED -F ORWARD N ETWORKS

In this, the proof for Section IV has been explained indetail. The backpropagation weight-tuning rule uses the error-gradient ∂E∂w ji , The observation is that weight w ji can inﬂuencethe rest of the network through net output coming from aneuron, net j . Let us consider j to be an output unit, We canuse the chain rule to write, ∂E∂w ji = ∂E∂net j ∂net j ∂w ji (45) ∂E∂w ji = ∂E∂net j x ji (46)ust as w ji can inﬂuence the network through net j , net j can inﬂuence the network through y j . On applying chain rule, ∂E∂w ji = ∂E∂y j ∂y j ∂net j x ji (47)Calculating the gradient ∂E∂y j would mean choosing a lossfunction that is differentiable, and that would eliminate a lotof loss functions, to tackle the problem, we propose a novelidea to approximate the gradient using PSO parameters. Usingthe theorem, we will prove that ∂E∂y j = − ( c r + c r ) η ( g best − y ) (48)Using the Taylor series expansion, the gradient is approxi-mated as follows, ∂E∂y j (cid:12)(cid:12)(cid:12) y j = y opt = E (cid:48) ( y opt )+ E (cid:48)(cid:48) ( y opt )( y j − y opt )+ ... + R n − ( y j ) (49)Here E ( y ) is the loss w.r.t which the error is minimized soas to get good predictions, R n − remainder and y opt is theoptimal value of the Loss at which the error is the lowest.If we apply the PSO parameter approximation proved in thetheorem mentioned in Section III-B, we get E (cid:48)(cid:48) ( y opt ) = − ( c r + c r ) η (50)substituting this value in (49), we we will get ∂E∂y j = E (cid:48) ( y opt ) + ( c r + c r ) η ( y j − y opt ) + ... + R n − ( y ) (51)Let’s state few assumptions and are likely to be true in mostof the cases, • Around the optimum value (global minima), the lossgradient at optimum value is 0 , hence E (cid:48) ( y opt ) = 0 . • We can consider the Remainder term of the Taylor seriesexpansion as it doesn’t have signiﬁcance in most of thecases, hence R n − ( y ) = 0 • In Particle Swarm Optimization, it has been found that the g best is the optimum value of any function, so if E ( y ) is aloss function, on using PSO, the g best can be substitutedin place y opt .Then we get a rather simple equation for the Error gradient, ∂E∂y j = − ( c r + c r ) η ( g best − y j ) (52)Substituting yields ∂E∂w ji = − ( c r + c r ) η ( g best − y j ) ∂y j ∂net j x ji (53)where ∂y j ∂net j is the value of gradient w.r.t the activationfunction used. VI. A DA S WARM

In this section, we propose a fast, gradient-free optimizer,AdaSwarm. The Adam optimizer was proposed for optimiza-tion that would require ﬁrst-order, Adam has no or littlememory requirement. The Adam optimizer calculates adaptivelearning rates for different parameters from the estimates ofsecond and ﬁrst moments of the gradients. The AdaSwarmis based on Adam optimizer by replacing the gradient withthe approximate gradient calculated. The gradients are ap-proximated using the theorems proposed above, we replacethese gradients in Adams update rule. Results and experimentsfurther discussed show that AdaSwarm has less execution timeand the optimization is comparable to Adam, and sometimesperforms better than Adam. The algorithm below talks aboutthe AdaSwarm algorithm.

Algorithm 1:

AdaSwarm

Require: η : Learning Rate; Require: β , β ∈ [0, 1): Exponential decay rates for themoment estimates; Require: f ( θ ) : Function with parameter θ ; Require: θ : Initial parameter vector; m ← v ← t ← while θ t not converged do t ← t + 1; g t ← Approximates Gradients (Get gradients w.r.t.stochastic objective at timestep t); m t ← β m t − + (1 - β ) g t ; v t ← β v t − + (1 - β ) g t ; θ t ← θ t − - η m t √ v t + (cid:15) Return θ t Adam is a very popular optimization algorithm used ex-tensively in Neural Networks and is the best optimizer, it islooked at as a combination of Stochastic Gradient Descentwith momentum and RMSProp. It leverages the momentum byusing moving average of the gradient instead of the gradientitself like SGD with momentum and the squared gradients areused to scale the learning rate like RMSProp. Added to thesecapabilities, when we add approximate gradients calculatedusing EM-PSO (A fast converging Particle Swarm Optimizer),it becomes a truly derivative-free optimizer. The AdaSwarm isnow a combination of RMSProp, SGD with Momentum andEM-PSO all combined to provide speed and acceleration totrain neural networks.

A. Order of Accuracy: EMPSO approximation to gradients

We must ﬁnd numerical approximation of an exact value ofthe gradient. The approximation depends on a small parameter h , which can be for instance the grid size or time step ina numerical method. We denote the approximation of thegradient as ˜ u h . Let’s ﬁnd the order of accuracy here. ABLE IA DA S WARM VS O THER OPTIMIZERS : P

ROMISING RESULTS OF A DA S WARM COULD ADDRESS THE SENSITIVITY ( TO INITIALISATION ), AND ROBUSTNESS ( TO MULTIPLE LOCAL MINIMA ) IN CLASSIFICATION DATASETS

Dataset Metrics Optimizer

SGD Emulation of SGDwith PSO parameters AdaGrad AdaDelta RMSProp AMSGrad Adam AdaSwarmLoss 0.184 0.133 0.232 0.272 0.199 0.222 0.219

Iris Accuracy 96.223% 98.223% 88.444% 86% 92.222% 89.222% 90.667%

Ionosphere Loss 0.665 0.37 0.564 0.545 0.243 0.3978 0.259

Accuracy 52.429% 88.857% 73.571% 76.142% 92.143% 83.425% 90.857%

Wisconsin Breast Cancer Loss 0.560 0.414 0.436 0.422 0.414 0.383 0.405

Accuracy 81.231% 84.384% 83.284% 83.431% 84.384% 84.970% 83.357%

Sonar Loss 0.69 0.441 0.439 0.428 0.374 0.3530 0.33

Accuracy 58.173% 80.769% 81.492% 81.971% 83.413% 85.099% 86.298%

Wheat Seeds Loss 0.612 0.638 0.565 0.586 0.43 0.434 0.434

Accuracy 66.984% 66.667% 66.667% 73.334% 80.317% 78.889% 81.905%

Bank noteAuthentication Loss 0.228 0.001 0.0300 0.004 0.0005 0.002 0.004

Accuracy 97.255% 100% 100% 100% 100% 100% 100%

Heart Disease Loss 0.494 0.550 0.557 0.517 0.499 0.398 0.564

Accuracy 77.94% 71.17% 70.34% 74.580% 76.43% 82.15% 72.27%

Haberman’s Survival Loss 0.5766 0.5766 0.534 0.556 0.5266 0.533 0.5242

Accuracy 73.856% 73.529% 75.163% 75.490% 76.143% 76.470% 77.00%

Wine Loss 0.663 0.652 0.385 0.555 0.631 0.603 0.400

Accuracy 66.667% 66.667% 80.337% 69.101% 66.667% 71.535% 83.142%

Car Evaluation Loss 0.375 0.268 0.282 0.304 0.286 0.273 0.260

Accuracy 85.011% 87.355% 85.894% 86.038% 85.156% 87.340% 86.850%

TABLE IIA DA S WARM VS A DAM FOR C OMPUTER V ISION D ATASETS : P

ROMISING RESULTS OF A DA S WARM COULD ADDRESS THE SENSITIVITY ( TOINITIALISATION ), AND ROBUSTNESS ( TO MULTIPLE LOCAL MINIMA ) Dataset Best Loss Best Training Accuracy Testing Accuracy Total Execution Time

Adam AdaSwarm Adam AdaSwarm Adam AdaSwarm Adam AdaSwarmMNIST 0.073 0.0727 97.09% 97.3% 97.2% 97.3% ∼ ∼ ∼

990 s ∼

900 sCIFAR 10 0.234 0.223 91.058% 91.6% 91.074% 91.447% ∼ ∼

994 sCIFAR 100 0.036 0.038 99.122% 99.100% 99.007% 99.039% ∼ ∼ Using the theorems proved, the gradient can be approxi-mated as the following for a function f ( x ) , ˜ u h = − ( c r + c r ) η ( f ( x ) − g best ) (54)Considering, the gradient as true value, we have, u = f (cid:48) ( x ) (55)Since the value − ( c r + c r ) η is constant, we can replace it by M . Then we have, | ˜ u h − u | = M ( f ( x ) − g best ) − f (cid:48) ( x ) (56) After Taylor expansion we get, | ˜ u h − u | = M ( f ( x )+ hf (cid:48) ( x )+ h f (cid:48)(cid:48) ( ξ )2 − g best ) − f (cid:48) ( x ) (57)for some ξ ∈ [0 , h ] . | ˜ u h − u | = M f ( x )+ M hf (cid:48) ( x )+ M h f (cid:48)(cid:48) ( ξ )2 − M g best − f (cid:48) ( x ) (58) | ˜ u h − u | = M f ( x ) + ( M h − f (cid:48) ( x ) + M h f (cid:48)(cid:48) ( ξ )2 − M g best (59) By theory, the g best is the solution for f ( x ) , after running PSOfor a set number of iterations, we get the optima. Hence, wecan assume that g best ≈ f ( x ) and if h is signiﬁcantly smaller, | ˜ u h − u | = ( M − hf (cid:48) ( x ) + M h f (cid:48)(cid:48) ( ξ )2 (60)Often the error ˜ u h − u udepends smoothly on h . Then thereis an error coefﬁcient D such that ˜ u h − u = Dh p + O ( h p +1 ) .Here the order of accuracy is O ( h ) where D = (M-1)and p = 1 . This explains the true vs. approximate gradientcomparison (empirical) in Fig. 2VII. E XPERIMENTS AND D ISCUSSIONS

A. Experimental Setup

The Conﬁrmed Exoplanets Catalog was used for datasetmaintained by the Planetary Habitability Laboratory (PHL)[12]. We use the parameters described in the Table V. Surfacetemperature and eccentricity are not recorded in Earth Units,we normalized these values by dividing them with Earthssurface temperature (288 K) and eccentricity (0.017).ThePHL-EC records are empty for those exoplanets whose surfacetemperature is not known. We drop these records from theexperiment. We conveniently ﬁrst test the proposed swarmlgorithm on test optimization functions mentioned below.We used n = 1000 with a target error = ∗ − and particles. Then the algorithm was used to optimize the CDHSand CEESA objective functions.For testing the neural networks, AdaSwarm was comparedto various optimizers and applied on datasets like MNIST,CIFAR10, CIFAR100, Iris, Breast Cancer (Wisconsin) andmany other classiﬁcation datasets and results are presentedin this section. B. Test Optimization Functions and Engineering Optimizationproblems

The optimizer is tested on twenty benchmark test optimiza-tion problems. It is also tested on various real-world engineer-ing optimization problems too. The optimizer is tested and acomparison between the vanilla particle swarm optimization,momentum particle swarm optimization and exponentiallyweighted momentum particle swarm is demonstrated. Graphscomparing the cost function values are reported, along withthe optimum value achieved for each of the optimizers.

C. True Gradient and Approximate gradient comparison

The gradients were calculated for the functions mentionedin Table VIII, the approximate gradient values computedusing the PSO parameters as deﬁned in Section III-B and thetrue gradient were plotted. The results are shown in Fig. 2.With this, we can deﬁnitely use this approach for computinggradients and replacing it in back-propagation.

Fig. 2. True Gradient vs Approximate Gradient calculated using PSOParameters: The approximation accomplished is evidence of the efﬁcacy ofEM-PSO and the order of accuracy proof presented in VI.A

D. AdaSwarm vs Optimizers for various classiﬁcation datasets

Using the approximated derivative, we replace the backwardpass in the loss to this derivative, with such a modiﬁcation,every time the algorithm backpropagates, it uses the customapproximated gradients throughout the layer to update the weights. A new loss was deﬁned, in the forward pass, theBinary cross-entropy loss was computed and in the backwardpass, we replaced the gradient with our approximated gradient.The dataset was divided into batches, then for an epoch,batch loss and accuracy were calculated using the optimizer.The losses were a running average in the batch, the samerunning average was applied to accuracies was used. Theepoch loss and accuracy were reported for all the datasetsdeﬁned. For comparing the different optimizers, the optimizersselected were SGD, SGD emulation with PSO parameters,RMSProp, Adam, and AdaSwarm. The experimentation resultsare produced in Table I and Table II.

E. cNN: Archiecture used for comparison between Adam andAdaSwarm

In this sub-section, we talk about how the custom gradientswere implemented, this sub- section also gives meaningfulinsights into the backpropagation algorithm. We deﬁne a CNNarchitecture for testing on benchmark datasets. This model isa simple CNN model which is used for training on computervision datasets, this model contains 2 Convolution Layershaving a 3X3 kernel, a max-pooling layer and ﬂatten layerconnected to a 128 neuron hidden layer and with an outputlayer of 10 neurons for 10 classes. Using this model, we gettotal weights summing up to 1,199,882 for the MNIST modelhaving 28 X 28 image size.As the image size varies for CIFAR-10, the weights willvary but the architecture remains the same. The results arepresented here Table II. The approximate gradients are usedfor every weight in the model.VIII. C

ONCLUSION

The paper presents an exponentially averaged momentumenhanced particle swarm approach to solve unconstrained andconstrained optimization problems. It is common knowledgethat PSO is a ”derivative-free” optimizer, in the sense thatgradients or hessians need not be computed. However, atheoretical relation directly connecting the gradient to the pa-rameters of PSO has never been presented. The approximation,as documented in several papers, is purely empirical. This isthe cornerstone of our approach where we prove theorems toshow gradient approximation in terms of the parameters andverify the theoretical equivalence empirically. In a nutshell, thederivative of any function with countably ﬁnite singularitiesmay be precisely expressed in terms of the PSO and EM-PSO parameters. Additionally, the derivative approximation isseamlessly extended to approximate error gradients in back-propagation in neural networks. This throws some interestinginsights to derivative-free optimization, apart from demon-strating empirical evidence of our method surpassing standardPSO in terms of speed of convergence. A new optimizer,AdaSwarm, based on the widely used Adam optimizer is alsointroduced and tested against Adam while training NeuralNetworks. In all the above cases, the proposed AdaSwarmalgorithm either gets comparable results to existing optimizersor surpasses their performance, the results of the same have

ABLE IIIU

NCONSTRAINED T EST O PTIMIZATION F UNCTIONS [6] (

SEE A PPENDIX B) Name

Global Minimum M-PSO Optimized Value

Iterations

EM-PSO

Iterations

Optimized Value

Ackley function 0 0.001 59 0.001 47Rosenbrock2D function 0 ∗ −

215 67.14 124Beale function 0 . ∗ −

68 0 136GoldsteinPrice function 3 3 61 3 53Booth function 0.0 . ∗ − . ∗ − . ∗ −

95 0 48Three-hump camel function 0.0 ∗ −

59 0.0 36Easom function -1 − ∗ −

45 -1 40Cross-in-tray function -2.06261 -2.06 43 -2.06 30TABLE IVC

ONSTRAINED T EST O PTIMIZATION F UNCTIONS [6] (

SEE A PPENDIX B) Name Global Minimum M-PSOOptimized Value Iterations EM-PSOOptimized Value Iterations

MishraBirdFunction -106.76 -106.76 121 -106.76 55Rosenbrock functionconstrained with acubic and a line function 0 0.99 109 0.99 64Rosenbrock functionconstrained to a disk 0 0 69 0 39TABLE VC

OMPUTED

CDHS

AND

CEESA

SCORES BY

EM-PSO

ARE CLOSE TO THE SCORES COMPUTED BY [13], [14]

WITH HIGH PRECISION AND COMPARISONOF THE TWO ALGORITHMSName Algorithm [ α , β , γ , δ ] [ Y i , Y s ] CDHS Scores Iterations [ r , d , t , v , e , ρ ] CEESA Scores IterationsTRAPPIST-1 b M-PSOEM-PSO [ 0.99, 0.01, 0.01, 0.99 ][0.99, 0.01, 0.01, 0.99] [1.09, 1.38][1.09, 1.38] 1.2341.234 13075 [0.556, 0, 0.398, 0.045, 0, 0.629][0.107, 0.314, 0.578, 0.001, 3.704, 0.999] 1.1931.126 7692TRAPPIST-1 c M-PSOEM-PSO [0.99, 0.01, 0.01, 0.99][0.99, 0.01, 0.01, 0.99] [1.17, 1.21][1.17, 1.21] 1.191.19 6580 [0.117, 0.384, 0.273, 0.225, 0, 0.999][0.053, 0.348, 0.212, 0.386, 0, 0.999] 1.1611.161 5460TRAPPIST-1 e M-PSOEM-PSO [0.99, 0.01, 0.01, 0.99][0.99, 0.01, 0.2, 0.8] [0.92, 0.88][0.92, 0.88] 0.90960.9096 3069 [0.455, 0.486, 0.033, 0.027, 5.182, 0.504][0.264, 0.004, 0.626, 0.105, 0, 0.936] 0.8680.897 827TRAPPIST-1 f M-PSOEM-PSO [0.99, 0.01, 0.95, 0.05][0.99, 0.01, 0.7, 0.3] [1.04, 0.8][1.04, 0.8] 0.920.92 9359 [0.718, 0, 0.276, 0.006, 0, 0.969][0.382, 0.24, 0.093, 0.284, 5.392, 0.719] 0.9720.836 7746 TABLE VIE

NGINEERING O PTIMIZATION P ROBLEMS (NA -

NO ITERATIONS AVAILABLE ) OptimizationProblem Method (minima value, Number of iterations)PreviouslyBest known soln. W-PSO M-PSO EM-PSO H. Garg SMPSO NSGA-2 NSGA-3 PSO withMomentum[15]

HimmelblausnonlinearProblem [16] ( − . ,NA) ( − . , ) ( − . , ) ( − . , ) ( − . ,NA) ( − . ,10000) ( − . ,10000) ( − . ,10000) ( − . ,1000)Welded BeamProblem [16] ( . ,NA) ( . , ) ( . , ) ( . , ) ( . ,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000)Speed ReducerProblem[17] ( , . ,NA) ( . , ) ( . , ) ( . , ) (NA,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000)Pressure VesselProblem [16] ( . ,NA) ( . , ) ( . , ) ( . , ) ( . ,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000)CompressionSpringProblem [17] ( . ,NA) ( . , ) ( . , ) ( . , ) (NA,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000) ABLE VIIG

RADIENT D ESCENT VS E MULATED G RADIENT D ESCENT USING

PSO P

ARAMETERS P ARAMETER S ET : η =0.1, C C Function Global Minimum Gradient Descent Optimum Iterations Emulated Gradient Descentwith PSO ParamtersOptimum Iterations − (cid:0) x − x (cid:1) -2.25 2.6 31 -2.249 9 x − x ) + 7 − exp (cid:0) cos (cid:0) x (cid:1)(cid:1) + x -2.718 -2.718 349 -2.718 9 x − sin ( x ) + exp (cid:0) x (cid:1) TOCHASTIC G RADIENT D ESCENT WITH M OMENTUM VS E MULATED G RADIENT D ESCENT USING

EM-PSO P

ARAMETERS P ARAMETER S ET : η =0.1, C C β =0.9 Function Global Minimum Stochastic Gradient Descent Optimum Iterations Emulated Gradient Descentwith EM-PSO ParamtersOptimum Iterations − (cid:0) x − x (cid:1) -2.25 -1.45 667 -2.249 13 x − x ) + 7 − exp (cid:0) cos (cid:0) x (cid:1)(cid:1) + x -2.718 -2.718 49 -2.706 8 x − sin ( x ) + exp (cid:0) x (cid:1) also been reported in tabular and visual forms in this paper.We conclude by noting that the proposed method provides abackbone to loss functions that are non-differentiable. Usingthe AdaSwarm not only eliminates the need of differential lossfunction, it paves way for non-differentiable loss functions tobe used in Neural Networks.A CKNOWLEDGEMENT

The authors would like to thank the Science and Engi-neering Research Board (SERB)-Department of Science andTechnology (DST), Government of of India for supportingthis research. The project reference number is: SERB-EMR/2016/005687. R

EFERENCES[1] S. M. Mikki and A. A. Kishk,

Particle Swarm Optimizaton: A Physics-Based Approach . Morgan & Claypool, 2008.[2] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradientlangevin dynamics,”

In Proceedings of the 28th International Conferenceon Machine Learning , p. 681688, 2011.[3] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan, “Samplingcan be faster than optimization,”

Proceedings of the National Academyof Sciences

ArXiv , vol. abs/1911.09011, 2019.[5] J. Spall,

Introduction to stochastic search and optimization . Wiley-Interscience, 2003.[6] M. Jamil and X. S. Yang, “A literature survey of benchmark functions forglobal optimisation problems,”

International Journal of MathematicalModelling and Numerical Optimisation , vol. 4, no. 2, p. 150, 2013.[Online]. Available: http://dx.doi.org/10.1504/IJMMNO.2013.055204[7] T. Xiang, J. Wang, and X. Liao, “An improved particle swarm optimizerwith momentum,” ,pp. 3341–3345, 09 2007.[8] S. Chen and J. Montgomery, “Particle swarm optimization with thresheldconvergence,” , 06 2013.[9] T. Hatanaka, T. Korenaga, N. Kondo, and K. Uosaki,

Search Perfor-mance Improvement for PSO in High Dimensional Space , 01 2009. [10] G. K. Jha, P. Thulasiraman, and R. K. Thulasiram, “Pso based neuralnetwork for time series forecasting,” in , June 2009, pp. 1422–1427.[11] H. T. Rauf, W. H. Bangyal, J. Ahmad, and S. A. Bangyal, “Training ofartiﬁcial neural network using pso with novel initialization technique,”in , Nov 2018, pp. 1–8.[12] A. Mndez, “A thermal planetary habitability classiﬁca-tion for exoplanets.planetary habitability laboratory @ uprarecibo,” 2011. [Online]. Available: http://phl.upr.edu/library/notes/athermalplanetaryhabitabilityclassiﬁcationforexoplanets[13] K. Bora, S. Saha, S. Agrawal, M. Safonova, S. Routh, andA. Narasimhamurthy, “Cd-hpf: New habitability score via data analyticmodeling,” submitted to Astronomy and Computing , vol. 17, 04 2016.[14] S. Saha, S. Basak, M. Safonova, K. Bora, S. Agrawal, P. Sarkar, andJ. Murthy, “Theoretical validation of potential habitability via analyticaland boosted tree methods: An optimistic study on recently discoveredexoplanets,”

Astronomy and Computing , vol. 23, p. 141150, Apr 2018.[Online]. Available: http://dx.doi.org/10.1016/j.ascom.2018.03.003[15] J. Ren and S. Yang, “A particle swarm optimization algorithm withmomentum factor,” vol. 1, pp. 19–21, Oct 2011.[16] H. Garg, “A hybrid pso-ga algorithm for constrained optimizationproblems,”

Applied mathematics and conputation , vol. 274, pp. 292–305, 2 2015.[17] Y. X. Gandomi A.H., “Benchmark problems in structural optimization,”vol. 356, 2011.[18] J. Kennedy and R. Eberhart, “Particle swarm optimization,”

Proceedingsof ICNN’95 - International Conference on Neural Networks , vol. 4, pp.1942–1948, Nov 1995.[19] A. Theophilus, S. Saha, S. Basak, and J. Murthy, “A novel exoplanetaryhabitability score via particle swarm optimization of ces productionfunctions,” 11 2018. A PPENDIX AP ARTICLE S WARM O PTIMIZATION WITH ITS VARIANTS

A. Particle Swarm Optimization with Inertial Weight

The Particle Swarm optimization algorithm[18] is an opti-mization algorithm inspired by the ﬂocking behavior of birds.It is characterized by a population of particles in space, whichaim to converge to an optimal point. The movement of thearticles in space is characterized by two equations, namely,velocity and position update equations, which are as follows: v t +1 i = ωv ti + c r ( p besti − x ti ) + c r ( g best − x ti ) (61) x t +1 i = x ti + v t +1 i (62)where ω, c , c ≥ . Here, x ti refers to the position of particle iin the search space at time t, v ti refers to the velocity of particlei at time t, p besti is the personal best position of particle i, g besti is the best position amongst all the particles of the population. B. Particle Swarm Optimization with Momentum

Inspired by the momentum term in the Back Propagationalgorithm, a momentum is introduced in the velocity updatingequation of

PSO [7]. Thus, the new equation along with themomentum term was introduced by the following equation, v t +1 i = (1 − λ )( v ti + c r ( p besti − x ti )+ c r ( g best − x ti ))+ λv t − i (63)where c , c , x ti , v ti , p besti , g besti mean the same as describedin the previous section. The momentum factor is indicated by λ . A PPENDIX

BEM-PSO

FOR SINGLE OPTIMIZATION PROBLEMS

We apply the proposed Exponentially averaged MomentumPSO on benchmark test optimization functions [6], habitabilityoptimization problems in Astronomy and benchmark test opti-mization problems popular in different branches in Engineer-ing. Usually a problem maybe constrained or unconstraineddepending on the search space that has been tackled with.An unconstrained problem’s space is the full search spacefor the particle swarm. The difﬁculty arises only when it’sconstrained. Theophilus, Saha et al. [19] describe a way tohandle constrained optimization. We use a the same method torepresent the test functions as well as represent few standardoptimization problems.

A. Standard Test Optimization Functions

In this section we brieﬂy describe the benchmark opti-mization functions chosen to evaluate our proposed algo-rithm and compare its performance to that of the weightedPSO and Momentum PSO described in

Appendix A . Forthe purpose of assessing the performance of the proposedalgorithm we have considered single objective unconstrainedoptimization functions [6] Rastrigin, Ackley, Sphere, Rosen-brock, Beale, Goldstein-Price, Booth, Bukin N.6, Matyas,Levi N.13, Himmelblau’s, Three-hump camel, Easom, Cross-in-tray, Eggholder, Holder table, McCormick, Schaffer N.2,Schaffer N.4, StyblinskiTang as well as constrained optimiza-tion functions, Rosenbrock constrained with a cubic line,Rosenbrock constrained to a disc and Mishra’s bird.The results for the above mentioned benchmark optimiza-tion functions are summarised in

Table III and

Table IV . B. Engineering Optimization Problems

In this section we brieﬂy describe the benchmark engineer-ing optimization[16] problems chosen to evaluate our proposedalgorithm and compare its performance to that of the variousalgorithms. They have been formulated based on real-worldscenarios.

C. Constrained Single Objective Optimization Problems fromHabitability

We present two problems from exoplanetary habitabilityscore computation [13], [14] which have been formulatedas constrained single-objective optimization problems. Thehabitability scores have been computed with Gradientascent/descent type approaches. We solve these problemsusing our approach and compare with the scores obtainedearlier. Representing CDHS : The Cobb-Douglas Habitabilityscore can be constructed as a constrained optimization problemwhere the objective function is represented as follows: Y = R α .D β .V γe .T δs (64)where R, D, V e and T s are density, radius, escape velocity andsurface temperature for a particular exoplanet respectively and α, β, γ and δ are elasticity coefﬁcients. maximize α, β, γ, δ Y subject to 0 < φ < , ∀ φ ∈ { α, β, γ, δ } ,α + β − − τ ≤ , − α − β − τ ≤ ,γ + δ − − τ ≤ , − γ − δ − τ ≤ (65)It can be subjected to two scales of production: CRS (ConstantReturn to Scale) and DRS (Decreasing Return to Scale)-(citeCobb Douglas paper).The above equation (64) is concaveunder constant returns to scale (CRS) ,when α + β = 1 and γ + δ = 1 , and also under decreasing returns to scale (DRS), α + β < and γ + δ < . Representing CEESA : The objective function forCEESA to estimate the habitability score of an exoplanetis: maximize r, d, t, v, e, ρ, η Y = ( r.R ρ + d.D ρ + t.T ρ + v.V ρ + e.E ρ ) ηρ subject to 0 < φ < , ∀ φ ∈ { r, d, t, v, e } , < ρ ≤ , < η < , ( r + d + t + v + e ) − − τ ≤ , − ( r + d + t + v + e ) − τ ≤ (66)where E represents Orbital eccentricity, ττ