AdaSwarm: Augmenting Gradient-Based optimizers in Deep Learning with Swarm Intelligence
Rohan Mohapatra, Snehanshu Saha, Carlos A. Coello Coello, Anwesh Bhattacharya, Soma S. Dhavala, Sriparna Saha
AAdaSwarm: A Novel PSO optimization Method forthe Mathematical Equivalence of Error Gradients
Rohan Mohapatra ∗ , Snehanshu Saha † , and Soma S. Dhavala ‡∗ Dept. of Computer Science and Engineering, PES University, Bengaluru, India [email protected] † Dept. of Computer Science and Information Systems and APPCAIRBITS Pilani, K K Birla Goa, Goa Campus, India [email protected] ‡ Center for AstroInformatics, Modeling and Simulation (CAMS)& ML Square, Bengaluru, India [email protected]
Abstract —This paper tackles the age-old question of derivate-free optimization in neural networks. This paper introducesAdaSwarm, a novel derivative-free optimizer to have similaror better performance to Adam but without gradients. Tosupport the AdaSwarm, a novel Particle Swarm OptimizationExponentially weighted Momentum PSO (EM-PSO), a derivative-free optimizer, is also proposed which tackles constrained andunconstrained single objective optimization problems and looksat applying the proposed momentum particle swarm optimizationon benchmark test functions, engineering optimization problemsand habitability scores for exoplanets which show speed andconvergence of the technique. The EM-PSO is extended byapproximating the gradient of a function at any point usingthe parameters of the particle swarm optimization. This isa novel technique to simulate gradient descent, an extremelypopular method in the back-propagation algorithm, using theapproximated gradients from the particle swarm optimizationparameters. Mathematical proofs of gradient approximationby EM-PSO, thereby bypassing the gradient computation, arepresented. The AdaSwarm is compared with various optimizersand the theory and algorithmic performance are supported bypromising results.
Index Terms —particle swarm optimization, meta-heuristics,gradient descent, neural networks, back-propagation, momen-tum, adam
I. I
NTRODUCTION
Gradient descent is a popular method, used extensively inthe backpropagation algorithm to minimise the loss in a neuralnetwork. Though gradient Descent is reliable and efficient, itis a derivative-based optimisation algorithm. This poses twochallenges: it can be applied only to differentiable functions,and computing the derivative for complex, real world problemscan be a difficult task. As a result of the shortcomings ofgradient descent, there has been significant interest in usingmeta-heuristic algorithms for optimization. One such meta-heuristic algorithm, particle swarm optimization, was proposedin 1995 as a swarm-intelligence based evolutionary algorithmfor optimization problems. It is inspired by the flockingbehavior of birds. In this paper, we explore PSO and anovel variant of PSO as a derivative-free technique to emulategradient descent, and replace the same in the backpropagationalgorithm. The underlying question being addressed here is whether the gradient (read derivative) of a function can beapproximated by a meta-heuristic algorithm such as PSO? Weaddress this by establishing an equivalence between PSO andGradient descent, both theoretically and empirically, such thatwe can compute the gradient for a function at any point just byknowing the values of the parameters of PSO, without havingto actually calculate the derivative. In this regard, AdaSwarmis proposed to show that even ”gradient”-free optimizer cangive good/better results to the best optimizers.
A. Motivation
The sources of inspiration to understand and improve,optimization techniques in general, and PSO in particular,can come from many places. Said et al [1] postulate thatswarms behave similar to classical and quantum particles.In fact, their analogy is so striking that one may tend tothink that the social and individual intelligence angles in thePSO are after all nice useful perspectives, and that there isa neat underlying dynamical system at play. This dynamicalsystems perspective, was indeed also useful in unifying twoalmost parallel streams, namely, optimization and MarkovChain Monte Carlo sampling [2], [3]. In a seminal paper,Wellington and Teh [2], show that an SGD optimization tech-nique can be turned into a sampling technique by just addingnoise, governed by Langevin dynamics. Recently, Soma andSato [4] provide further insights into this connection basedon an underlying dynamical system governed by stochasticdifferential equations (SDE). While these results are new, theconnections between derivative-free optimization techniquesbased on Stochastic Approximation and Finite Differences arewell documented [5]. Such strong connections between theseseemingly different sub fields of optimization and samplingmade us ponder: Is there much a larger, grander, template –of which the aforementioned approaches are special cases?What is missing in the much grander puzzle, where do metaheuristics algorithms fit? This particular questions seems to beuseful in two ways: can we use ideas developed in improvingSGD and apply them to PSO, and can we use particlehistory and their location awareness to offer derivative-free a r X i v : . [ c s . N E ] M a y pproximations to the gradients if possible? We answer thesequestions positively in this paper. B. Our Contribution
The following list of items are accomplished in the remain-der of the paper: • We propose a rather novel approach, EM-PSO, to mo-mentum particle swarm optimization by aligning theweighted average to the exploration part of vanilla PSO(please see section II), thereby achieving fewer iterationsto reduce error and reach the optima. This is differentfrom the existing M-PSO approach (see section A whereexploration and exploitation terms are used equally. • We present the mathematical formulation of EM-PSO andprove theorems (see section III) establishing a mathemat-ical equivalence between the following versions:1) The vanilla PSO and Gradient Descent2) The EM-PSO and Stochastic Gradient Descent withMomentum • The Theorems propose a direct gradient approximationmethod with precise alternatives to gradient computationby exploiting the hyperparameters of Vanilla and EMversions of PSO. Since our approach is an approximationtype, we prove a result on the order of accuracy of theapproximation method. • We interpret the gradient approximation approach toemulate backpropagation in Vanilla feed-forward neuralnetworks and CNN via EM-PSO and test it on a simulateddata set (see section IV and Fig.2). • We also present a novel adaptive gradient free optimizer,AdaSwarm and use it on Neural networks and presentresults for the same.(see section VI). • We present a rotational variant of EMPSO, calledREMPSO to train high dimensional classification data • Since the fulcrum of the work is optimization, we applyour method to benchmark test optimization functions. Toestablish the efficacy of our approach, we have tested20 such functions[6] and present a few samples in thepaper. Finally, we apply EM-PSO to solve unconstrainedand constrained single objective benchmark engineeringoptimization problems(see Appendix B).II. E
XPONENTIALLY WEIGHTED M OMENTUM P ARTICLE S WARM O PTIMIZATION (EM-PSO)In this section, we propose a rather novel approach tomomentum particle swarm optimization. The problem with thecurrently available momentum Particle Swarm Optimization [7]that the weighted average which is computed takes care of bothexploration and exploitation simultaneously. Since PSO triesto search the space by exploring, it makes more sense to givemore weight to the exploration part of the equation. Anotherproblem with the above term is that it takes longer iterationsto reduce errors and reach the minimum. The simulated data set is available at https://gist.github.com/rohanmohapatra/4e7bce4f0d95746d8b993437c257e99b
To counter the above-said problems, we mathematicallyformulate a new Particle Swarm Optimization with momentumas follows. v t +1 i = M t +1 i + c r ( p besti − x ti ) + c r ( g best − x ti ) (1)where, M t +1 i = βM ti + (1 − β ) v ti (2)Here β is the momentum factor, and M t +1 i , indicates theeffect of the momentum. The above equation can be re-writtenas the following by combining (1) and (2), v t +1 i = βM ti + (1 − β ) v ti + c r ( p besti − x ti ) + c r ( g best − x ti ) (3)We understand that PSO is composed of two phases, theexploration phase, and the exploitation phase[8]. v ti (cid:124)(cid:123)(cid:122)(cid:125) Exploration + c r ( p besti − x ti ) + c r ( g best − x ti ) (cid:124) (cid:123)(cid:122) (cid:125) Exploitation (4)With above proposed approach, the exploration phase is deter-mined by the exponential weighted average of the previousvelocities seen so far only. The negligible[7] weights appliedin the momentum PSO do not help much in providing us theacceleration required.
A. Intuition behind the term Exponentially Weighted Average
In the previous section, we defined the momentum (2) andwe also mentioned that it was an exponentially weightedaverage of the previous velocities seen so far. We prove itby expanding M ti (2),we get the following derivation. M ti = βM t − i + (1 − β ) v t − i (5) M ti = β [ βM t − i + (1 − β ) v t − i ] + (1 − β ) v t − i (6) M ti = β M t − i + β (1 − β ) v t − i + (1 − β ) v t − i (7) M ti = β [ βM t − i +(1 − β ) v t − i ]+ β (1 − β ) v t − i +(1 − β ) v t − i (8) M ti = β M t − i + β (1 − β ) v t − i + β (1 − β ) v t − i +(1 − β ) v t − i (9)Generalizing M ti , it can be written as the follows, M ti = β n M t − ni + β ( n − (1 − β ) v t − ni + β ( n − (1 − β ) v t − ( n − i + ... + β (1 − β ) v t − i + (1 − β ) v t − i (10)From this equation we see, that the value of t th term of themomentum is dependent on all the previous values 1..t of thevelocities. All of the previous velocities are assigned someweight. This weight is β i (1 − β ) for ( t − i ) th . Because betais less than 1, it becomes even smaller when we take beta tothe power of some positive number. So the older velocities getmuch smaller weight and therefore contribute less for overallvalue of the Momentum. . REM-PSO: Rotation Accelerated EM-PSO for high dimen-sional data A lot of research has been put into demonstrating theeffectiveness of the PSO in solving/optimizing various dis-crete/continuous problems. And a plenty of applications ofPSO, such as the neural network training, PID controllertuning, electric system optimisation have been studied andachieved well results. However, PSO is often failed in search-ing the global optimal solution in the case of the objectivefunction that have a large number of dimensions mainlybecause of the amount of computations that PSO has to gothrough to obtain global optimum but there are high chancesthat it gets stuck in sub-plane of the whole search space.T. Hatanaka[9] had proposed a Rotated Particle SwarmOptimization to improve the performance of the PSO in caseof high-dimensional optimization. We can re-define the EM-PSO equation as follows, Let’s assume, φ = diag ( c , ∗ r , , c , ∗ r , , c , ∗ r , ...., c ,d ∗ r ,d ) (11) φ = diag ( c , ∗ r , , c , ∗ r , , c , ∗ r , ...., c ,d ∗ r ,d ) (12)The EM-PSO equation can be written as, v t +1 i = βM ti +(1 − β ) v ti + φ ( p besti − x ti )+ φ ( g best − x ti ) (13)The EM-PSO algorithm was designed by emulating birdsseeking food with faster convergence rate. Birds probablynever change the strategy to seek according to whether foodexists in the true north or in the northeast. Consequently,particles can search optima even if axes are rotated. We usecoordinate conversion to the velocity update by using a matrixA. The matrix A is DXD matrix where certain factor is usedto determine the number of axes to rotate. If we consider apoint in the original space x , y then in the rotated coordinatespace, let it be it x (cid:48) , y (cid:48) , where they can be written as x (cid:48) = xcos ( θ ) + ysin ( θ ) (14) y (cid:48) = − xsin ( θ ) + ycos ( θ ) (15) A − = A T since A has orthonormal basis. An arbitarymatrix P with orthonormal basis vectors,then T A : e , e , ....., e N → e (cid:48) , e (cid:48) , ....., e (cid:48) N (16)can be expressed as the following transformation matrix, A = [ e (cid:48) , e (cid:48) , ....., e (cid:48) N ] ∀ e i , e j ∈ R N , j = 1 , ...., N (17)The rotation matrix rotating N-dim solution space of θ degree is expressed as: M ( θ, N ) = N − (cid:89) i =1 N (cid:89) j = i +1 M i,j ( θ ) (18) and each element p i,jq,l ( θ ) of M ( θ, N ) is expressed as thefollowing: p i,jq,l = cos( θ ), if q = i, l = i-sin( θ ), if q = i, l = jsin( θ ), if q = j, l = icos( θ ), if q = j, l = j1, if q = l (cid:54) = i , if q = l (cid:54) = j0, otherwise.For example, let’s define a matrix with i = 5 , j = 5 , andbased on that the matrix M (or A) will beM = cos ( θ ) sin ( θ ) 0 0 0 − sin ( θ ) cos ( θ ) 0 0 00 0 1 0 00 0 0 cos ( θ ) sin ( θ )0 0 0 − sin ( θ ) cos ( θ ) C. Transformation In-variance
EM-PSO optimization has transformation in-variance asfollows, • f ◦ T r s ◦ update = f ◦ update ◦ T r s (transformation in-variance for solution space) • T r f ◦ f ◦ update = f ◦ update ◦ T r f (transformation in-variance for objective function) T r s : Scale Transformation/Parallel Shift/Rotation of solutionspace T r f : Scale Transformation/Parallel Shift/Rotation of theobjective function In variance for rotation of solution space : T r s x → G x , G ∈ R N , where G satisfies G − = G T ,rotating a vector in the solution space.Using the Rotated Particle Swarm Optimization equation andcombining with EM-PSO equation, we get v t +1 i = βM ti +(1 − β ) v ti + A T φ A ( p besti − x ti )+ A T φ A ( g best − x ti ) (19)where, A = [ e (cid:48) , e (cid:48) , ....., e (cid:48) N ] ∀ ∈ R N , j = 1 , ...., N (20)is a coordinate transformation matrix. e (cid:48) , e (cid:48) , ....., e (cid:48) N , normalbasis of eigen vectors of co-variance matrix Σ( Z ) ∈ R N ofsolution set Z.III. E QUIVALENCE BETWEEN G RADIENT D ESCENT AND P ARTICLE S WARM O PTIMIZATION
A. Proof of Equivalence between the Gradient Descent andVanilla PSO
Theorem:
Under reasonable assumptions on global minima,the following equivalence holds: η = ω and f (cid:48)(cid:48) ( w’ ) = − ( c r + c r ) η Proof: In gradient descent, the weight update rule is given by, w ( t ) = w ( t − + η ∂f∂w (21)ere we define the Taylor series expansion for a given f ( x ) .On differentiating the expansion to get an approximation forthe gradient represented by ∂f∂x . f ( x ) = f ( a ) + f (cid:48) ( a )( x − a ) + f (cid:48)(cid:48) ( a )2! ( x − a ) + .... + E n ( x ) (22)where, E n ( x ) is the error term defined to specify the deviationfrom the actual curve given by, | E n ( x ) | = k ( x − a ) n +1 ( n + 1)! (23)on taking ddx on both sides, the gradient is approximated us-ing the Taylor series expansion. The rationale behind approxi-mating the gradient is that, for a given function, calculation ofgradient is a tedious task. So by approximating the gradientusing Taylor expansion centered around the global minima,we get the mathematical convenience of f (cid:48) ( a ) = 0 . f (cid:48) ( x ) = 0+ f (cid:48) ( a )+ f (cid:48)(cid:48) ( a )( x − a )+ f (cid:48)(cid:48)(cid:48) ( a )2! ( x − a ) + .... + E n − ( x ) (24)Then the gradient can be written as follows at the optimalpoint w (cid:48) . This point which we have chosen is the globalminima of the function f ( x ) (we can now say that the gradientcurve will be centered around w (cid:48) and any point on the gradientcurve can be approximated from the expansion), ∂f∂w (cid:12)(cid:12)(cid:12) w = w (cid:48) = f (cid:48) ( w’ ) + f (cid:48)(cid:48) ( w’ )( w − w’ ) + ... + E n − ( w ) (25)By combining (21) and (25) we get, w ( t ) = w ( t − + ηf (cid:48) ( w’ )+ ηf (cid:48)(cid:48) ( w’ )( w − w’ )+ E n − ( w ) (26)The equation for the basic Particle Swarm Optimization(PSO) (61) for particle i , x ( t ) i = x ( t − i + ωv ( t − i + c r ( p besti − x ( t − i ) + c r ( g best − x ( t − i ) (27)Under Assumptions stated below : • Around the optimum value (global minima) the explo-ration phase will almost become constant hence theanalogy that f (cid:48) ( w (cid:48) ) = v ( t − i holds true. f (cid:48) ( w (cid:48) ) is aconstant value and hence when the particles are aroundthe global minima, the velocity is constant as well. • At around the global minima, for most of the particles,we have p besti ≈ g best .If the assumptions stated above hold true, then we draw anequivalence as follows, η = ω (28) f (cid:48)(cid:48) ( w’ ) = − ( c r + c r ) η (29)For mathematical convenience, we set the learning rate ofthe weight update rule η = ω . Using these parameters and equation (25), any gradient canbe approximated without having to calculate the gradientsof the function itself .Under the above assumptions, we will have to factor in theother variables that will play a huge role in the approximation,as we move farther away from the center w (cid:48) , the error termwill be significantly large. The p besti will factor in for pointsaway from the minima and hence the equivalence can also bewritten as, v ( t − i = f (cid:48) ( w’ ) + E n − ( w ) (30) f (cid:48)(cid:48) ( w’ )( w − w’ ) = − η ( c r ( p besti − x ( t − i ))+ − η ( c r ( g best − x ( t − i ) (31) B. Proof of Equivalence between the Stochastic GradientDescent with Momentum and EM-PSO
Theorem:
Under reasonable assumptions on globalminima, the following equivalence holds: η = (1 − β ) , f (cid:48)(cid:48) ( w’ ) = − ( c r + c r ) η and α = β Proof:
In the above section, we proved the equivalencebetween the Gradient Descent Rule and the Vanilla ParticleSwarm Optimizer, we extend the same to a much advancedversion of the Gradient Descent that can tackle the problemof faster convergence and stagnating at local minima problem.The Gradient Descent Weight Update with momentum isgiven by, w ( t ) = w ( t − + ηV tdw (32)where, V tdw = βV t − dw + (1 − β ) ∂f∂w (33)Combining the equations and dividing (33) by (1 − β ) weget, w ( t ) = w ( t − + αV t − dw + η ∂f∂w (34)Here, η is the learning rate and V t − dw is the momentum appliedto the weight update and where, α = ηβ (1 − β ) (35)We apply the same Taylor series expansion to the gradientas defined int (25) and then formulate the gradient descentwith momentum as follows, w ( t ) = w ( t − + ηf (cid:48) ( w (cid:48) ) + ηf (cid:48)(cid:48) ( w (cid:48) )( w − w (cid:48) )+ E n − ( w ) + αV t − dw (36)The proposed EM-PSO, is defined above which can bewritten as, x ti = x t − i + βM t − i + (1 − β ) v t − i + c r ( p besti − x t − i ) + c r ( g best − x t − i ) (37)nder the same assumptions stated in the previous proof,we define the equivalence as follows η = (1 − β ) (38) f (cid:48)(cid:48) ( w (cid:48) ) = − ( c r + c r ) η (39) α = ηβ (1 − β ) (40)We find that M t − i works the same way as the momentumterm in the (29) i.e. V t − dw which helps in smoothing and fasterconvergence to the minimum.IV. I NTERPRETATION OF G RADIENTS AND E MULATING B ACK PROPAGATION IN VANILLA FEED FORWARD NEURALNETWORKS
Let us revisit the gradient descent rule proposed in the backpropagation algorithm in vanilla feed forward networks, w ( t ) = w ( t − + η ∂f∂w (41)The equivalence between GD and PSO has been provedin the previous section, we can now substitute the gradientwith PSO parameters and approximate the gradient values.Computing gradient for some functions might be very difficultand some loss functions are non-differentiable. If we can geta good approximation, then we can easily emulate gradientdescent.Considering at any iteration, the c r and c r values arecomputed by taking average of n particles, and v ( t − i is takenof the particle that influences the g best .Using the PSO Parameters, we get the following equation, w ( t ) = w ( t − − η [ v ( t − i + − η ( c r ( p besti − x ( t − i ))+ − η ( c r ( g best − x ( t − i ))] (42)Table I shows the emulation using approximated gradients.For training a neural network, PSO has been used extensivelyused [10], [11], as Gradient Descent has a higher chance ofgetting stuck in local minima. To counter that, PSO particlesconsist of weights of the neural network and the fitnessfunction is the loss function that has to be reduced. The caveatin this approach is that as the number of weights increase,so will the number of dimensions of a single particle. It isreported that such a system will fail to converge because ofthe humongous weight updates.We propose a rather novel idea to train the neural network,we use the idea of batch training and the dimension of theparticle vector is now just batch size * no of classes , signif-icantly reducing the amount of computations and increasingtraining speeds from previous approaches. In a neural network,we essentially have a loss function that has to be minimized toget the set of weights to classify a given instance. Traditionally,the back-propagation algorithm would do an update as follows, (here we consider the loss function to be the MSE (meansquare error) (ˆ y − y ) where ˆ y and y are target and outputvalues respectively).The equivalence between error gradient and the EM-PSOapproximation is a natural but non-trivial consequence ofthe insights gained by theorems proved in section IV. If aderivative can be expressed mathematically using the PSO andthe proposed EM-PSO parameters, it is expected that errorgradient whose minima is critical to compute in backpropaga-tion, is approximated using PSO parameters. This is markeddeparture from existing heuristic based approaches. ∂E∂w = ( y − ˆ y ) ∗ D ( activation ) ∗ x (43)Using the above proved equivalence , we can substitute thevalue of the gradient in (43) and get the following equation, ∂E∂w = − ( c r + c r ) η ( g best − y ) ∗ D ( activation ) ∗ x (44)where D ( activation ) is the derivative of the activation func-tion used. Using this approximation, we descend to the minimaat a faster rate. From the Fig. 1 it can be seen clearly,the descent in slope is steeper than the traditional gradientapproach. Fig. 1. Error Gradient approximation in backpropagation by EM-PSO, asproposed in Eq. (39)
V. E
MULATION OF G RADIENT D ESCENT IN V ANILLA F EED -F ORWARD N ETWORKS
In this, the proof for Section IV has been explained indetail. The backpropagation weight-tuning rule uses the error-gradient ∂E∂w ji , The observation is that weight w ji can influencethe rest of the network through net output coming from aneuron, net j . Let us consider j to be an output unit, We canuse the chain rule to write, ∂E∂w ji = ∂E∂net j ∂net j ∂w ji (45) ∂E∂w ji = ∂E∂net j x ji (46)ust as w ji can influence the network through net j , net j can influence the network through y j . On applying chain rule, ∂E∂w ji = ∂E∂y j ∂y j ∂net j x ji (47)Calculating the gradient ∂E∂y j would mean choosing a lossfunction that is differentiable, and that would eliminate a lotof loss functions, to tackle the problem, we propose a novelidea to approximate the gradient using PSO parameters. Usingthe theorem, we will prove that ∂E∂y j = − ( c r + c r ) η ( g best − y ) (48)Using the Taylor series expansion, the gradient is approxi-mated as follows, ∂E∂y j (cid:12)(cid:12)(cid:12) y j = y opt = E (cid:48) ( y opt )+ E (cid:48)(cid:48) ( y opt )( y j − y opt )+ ... + R n − ( y j ) (49)Here E ( y ) is the loss w.r.t which the error is minimized soas to get good predictions, R n − remainder and y opt is theoptimal value of the Loss at which the error is the lowest.If we apply the PSO parameter approximation proved in thetheorem mentioned in Section III-B, we get E (cid:48)(cid:48) ( y opt ) = − ( c r + c r ) η (50)substituting this value in (49), we we will get ∂E∂y j = E (cid:48) ( y opt ) + ( c r + c r ) η ( y j − y opt ) + ... + R n − ( y ) (51)Let’s state few assumptions and are likely to be true in mostof the cases, • Around the optimum value (global minima), the lossgradient at optimum value is 0 , hence E (cid:48) ( y opt ) = 0 . • We can consider the Remainder term of the Taylor seriesexpansion as it doesn’t have significance in most of thecases, hence R n − ( y ) = 0 • In Particle Swarm Optimization, it has been found that the g best is the optimum value of any function, so if E ( y ) is aloss function, on using PSO, the g best can be substitutedin place y opt .Then we get a rather simple equation for the Error gradient, ∂E∂y j = − ( c r + c r ) η ( g best − y j ) (52)Substituting yields ∂E∂w ji = − ( c r + c r ) η ( g best − y j ) ∂y j ∂net j x ji (53)where ∂y j ∂net j is the value of gradient w.r.t the activationfunction used. VI. A DA S WARM
In this section, we propose a fast, gradient-free optimizer,AdaSwarm. The Adam optimizer was proposed for optimiza-tion that would require first-order, Adam has no or littlememory requirement. The Adam optimizer calculates adaptivelearning rates for different parameters from the estimates ofsecond and first moments of the gradients. The AdaSwarmis based on Adam optimizer by replacing the gradient withthe approximate gradient calculated. The gradients are ap-proximated using the theorems proposed above, we replacethese gradients in Adams update rule. Results and experimentsfurther discussed show that AdaSwarm has less execution timeand the optimization is comparable to Adam, and sometimesperforms better than Adam. The algorithm below talks aboutthe AdaSwarm algorithm.
Algorithm 1:
AdaSwarm
Require: η : Learning Rate; Require: β , β ∈ [0, 1): Exponential decay rates for themoment estimates; Require: f ( θ ) : Function with parameter θ ; Require: θ : Initial parameter vector; m ← v ← t ← while θ t not converged do t ← t + 1; g t ← Approximates Gradients (Get gradients w.r.t.stochastic objective at timestep t); m t ← β m t − + (1 - β ) g t ; v t ← β v t − + (1 - β ) g t ; θ t ← θ t − - η m t √ v t + (cid:15) Return θ t Adam is a very popular optimization algorithm used ex-tensively in Neural Networks and is the best optimizer, it islooked at as a combination of Stochastic Gradient Descentwith momentum and RMSProp. It leverages the momentum byusing moving average of the gradient instead of the gradientitself like SGD with momentum and the squared gradients areused to scale the learning rate like RMSProp. Added to thesecapabilities, when we add approximate gradients calculatedusing EM-PSO (A fast converging Particle Swarm Optimizer),it becomes a truly derivative-free optimizer. The AdaSwarm isnow a combination of RMSProp, SGD with Momentum andEM-PSO all combined to provide speed and acceleration totrain neural networks.
A. Order of Accuracy: EMPSO approximation to gradients
We must find numerical approximation of an exact value ofthe gradient. The approximation depends on a small parameter h , which can be for instance the grid size or time step ina numerical method. We denote the approximation of thegradient as ˜ u h . Let’s find the order of accuracy here. ABLE IA DA S WARM VS O THER OPTIMIZERS : P
ROMISING RESULTS OF A DA S WARM COULD ADDRESS THE SENSITIVITY ( TO INITIALISATION ), AND ROBUSTNESS ( TO MULTIPLE LOCAL MINIMA ) IN CLASSIFICATION DATASETS
Dataset Metrics Optimizer
SGD Emulation of SGDwith PSO parameters AdaGrad AdaDelta RMSProp AMSGrad Adam AdaSwarmLoss 0.184 0.133 0.232 0.272 0.199 0.222 0.219
Iris Accuracy 96.223% 98.223% 88.444% 86% 92.222% 89.222% 90.667%
Ionosphere Loss 0.665 0.37 0.564 0.545 0.243 0.3978 0.259
Accuracy 52.429% 88.857% 73.571% 76.142% 92.143% 83.425% 90.857%
Wisconsin Breast Cancer Loss 0.560 0.414 0.436 0.422 0.414 0.383 0.405
Accuracy 81.231% 84.384% 83.284% 83.431% 84.384% 84.970% 83.357%
Sonar Loss 0.69 0.441 0.439 0.428 0.374 0.3530 0.33
Accuracy 58.173% 80.769% 81.492% 81.971% 83.413% 85.099% 86.298%
Wheat Seeds Loss 0.612 0.638 0.565 0.586 0.43 0.434 0.434
Accuracy 66.984% 66.667% 66.667% 73.334% 80.317% 78.889% 81.905%
Bank noteAuthentication Loss 0.228 0.001 0.0300 0.004 0.0005 0.002 0.004
Accuracy 97.255% 100% 100% 100% 100% 100% 100%
Heart Disease Loss 0.494 0.550 0.557 0.517 0.499 0.398 0.564
Accuracy 77.94% 71.17% 70.34% 74.580% 76.43% 82.15% 72.27%
Haberman’s Survival Loss 0.5766 0.5766 0.534 0.556 0.5266 0.533 0.5242
Accuracy 73.856% 73.529% 75.163% 75.490% 76.143% 76.470% 77.00%
Wine Loss 0.663 0.652 0.385 0.555 0.631 0.603 0.400
Accuracy 66.667% 66.667% 80.337% 69.101% 66.667% 71.535% 83.142%
Car Evaluation Loss 0.375 0.268 0.282 0.304 0.286 0.273 0.260
Accuracy 85.011% 87.355% 85.894% 86.038% 85.156% 87.340% 86.850%
TABLE IIA DA S WARM VS A DAM FOR C OMPUTER V ISION D ATASETS : P
ROMISING RESULTS OF A DA S WARM COULD ADDRESS THE SENSITIVITY ( TOINITIALISATION ), AND ROBUSTNESS ( TO MULTIPLE LOCAL MINIMA ) Dataset Best Loss Best Training Accuracy Testing Accuracy Total Execution Time
Adam AdaSwarm Adam AdaSwarm Adam AdaSwarm Adam AdaSwarmMNIST 0.073 0.0727 97.09% 97.3% 97.2% 97.3% ∼ ∼ ∼
990 s ∼
900 sCIFAR 10 0.234 0.223 91.058% 91.6% 91.074% 91.447% ∼ ∼
994 sCIFAR 100 0.036 0.038 99.122% 99.100% 99.007% 99.039% ∼ ∼ Using the theorems proved, the gradient can be approxi-mated as the following for a function f ( x ) , ˜ u h = − ( c r + c r ) η ( f ( x ) − g best ) (54)Considering, the gradient as true value, we have, u = f (cid:48) ( x ) (55)Since the value − ( c r + c r ) η is constant, we can replace it by M . Then we have, | ˜ u h − u | = M ( f ( x ) − g best ) − f (cid:48) ( x ) (56) After Taylor expansion we get, | ˜ u h − u | = M ( f ( x )+ hf (cid:48) ( x )+ h f (cid:48)(cid:48) ( ξ )2 − g best ) − f (cid:48) ( x ) (57)for some ξ ∈ [0 , h ] . | ˜ u h − u | = M f ( x )+ M hf (cid:48) ( x )+ M h f (cid:48)(cid:48) ( ξ )2 − M g best − f (cid:48) ( x ) (58) | ˜ u h − u | = M f ( x ) + ( M h − f (cid:48) ( x ) + M h f (cid:48)(cid:48) ( ξ )2 − M g best (59) By theory, the g best is the solution for f ( x ) , after running PSOfor a set number of iterations, we get the optima. Hence, wecan assume that g best ≈ f ( x ) and if h is significantly smaller, | ˜ u h − u | = ( M − hf (cid:48) ( x ) + M h f (cid:48)(cid:48) ( ξ )2 (60)Often the error ˜ u h − u udepends smoothly on h . Then thereis an error coefficient D such that ˜ u h − u = Dh p + O ( h p +1 ) .Here the order of accuracy is O ( h ) where D = (M-1)and p = 1 . This explains the true vs. approximate gradientcomparison (empirical) in Fig. 2VII. E XPERIMENTS AND D ISCUSSIONS
A. Experimental Setup
The Confirmed Exoplanets Catalog was used for datasetmaintained by the Planetary Habitability Laboratory (PHL)[12]. We use the parameters described in the Table V. Surfacetemperature and eccentricity are not recorded in Earth Units,we normalized these values by dividing them with Earthssurface temperature (288 K) and eccentricity (0.017).ThePHL-EC records are empty for those exoplanets whose surfacetemperature is not known. We drop these records from theexperiment. We conveniently first test the proposed swarmlgorithm on test optimization functions mentioned below.We used n = 1000 with a target error = ∗ − and particles. Then the algorithm was used to optimize the CDHSand CEESA objective functions.For testing the neural networks, AdaSwarm was comparedto various optimizers and applied on datasets like MNIST,CIFAR10, CIFAR100, Iris, Breast Cancer (Wisconsin) andmany other classification datasets and results are presentedin this section. B. Test Optimization Functions and Engineering Optimizationproblems
The optimizer is tested on twenty benchmark test optimiza-tion problems. It is also tested on various real-world engineer-ing optimization problems too. The optimizer is tested and acomparison between the vanilla particle swarm optimization,momentum particle swarm optimization and exponentiallyweighted momentum particle swarm is demonstrated. Graphscomparing the cost function values are reported, along withthe optimum value achieved for each of the optimizers.
C. True Gradient and Approximate gradient comparison
The gradients were calculated for the functions mentionedin Table VIII, the approximate gradient values computedusing the PSO parameters as defined in Section III-B and thetrue gradient were plotted. The results are shown in Fig. 2.With this, we can definitely use this approach for computinggradients and replacing it in back-propagation.
Fig. 2. True Gradient vs Approximate Gradient calculated using PSOParameters: The approximation accomplished is evidence of the efficacy ofEM-PSO and the order of accuracy proof presented in VI.A
D. AdaSwarm vs Optimizers for various classification datasets
Using the approximated derivative, we replace the backwardpass in the loss to this derivative, with such a modification,every time the algorithm backpropagates, it uses the customapproximated gradients throughout the layer to update the weights. A new loss was defined, in the forward pass, theBinary cross-entropy loss was computed and in the backwardpass, we replaced the gradient with our approximated gradient.The dataset was divided into batches, then for an epoch,batch loss and accuracy were calculated using the optimizer.The losses were a running average in the batch, the samerunning average was applied to accuracies was used. Theepoch loss and accuracy were reported for all the datasetsdefined. For comparing the different optimizers, the optimizersselected were SGD, SGD emulation with PSO parameters,RMSProp, Adam, and AdaSwarm. The experimentation resultsare produced in Table I and Table II.
E. cNN: Archiecture used for comparison between Adam andAdaSwarm
In this sub-section, we talk about how the custom gradientswere implemented, this sub- section also gives meaningfulinsights into the backpropagation algorithm. We define a CNNarchitecture for testing on benchmark datasets. This model isa simple CNN model which is used for training on computervision datasets, this model contains 2 Convolution Layershaving a 3X3 kernel, a max-pooling layer and flatten layerconnected to a 128 neuron hidden layer and with an outputlayer of 10 neurons for 10 classes. Using this model, we gettotal weights summing up to 1,199,882 for the MNIST modelhaving 28 X 28 image size.As the image size varies for CIFAR-10, the weights willvary but the architecture remains the same. The results arepresented here Table II. The approximate gradients are usedfor every weight in the model.VIII. C
ONCLUSION
The paper presents an exponentially averaged momentumenhanced particle swarm approach to solve unconstrained andconstrained optimization problems. It is common knowledgethat PSO is a ”derivative-free” optimizer, in the sense thatgradients or hessians need not be computed. However, atheoretical relation directly connecting the gradient to the pa-rameters of PSO has never been presented. The approximation,as documented in several papers, is purely empirical. This isthe cornerstone of our approach where we prove theorems toshow gradient approximation in terms of the parameters andverify the theoretical equivalence empirically. In a nutshell, thederivative of any function with countably finite singularitiesmay be precisely expressed in terms of the PSO and EM-PSO parameters. Additionally, the derivative approximation isseamlessly extended to approximate error gradients in back-propagation in neural networks. This throws some interestinginsights to derivative-free optimization, apart from demon-strating empirical evidence of our method surpassing standardPSO in terms of speed of convergence. A new optimizer,AdaSwarm, based on the widely used Adam optimizer is alsointroduced and tested against Adam while training NeuralNetworks. In all the above cases, the proposed AdaSwarmalgorithm either gets comparable results to existing optimizersor surpasses their performance, the results of the same have
ABLE IIIU
NCONSTRAINED T EST O PTIMIZATION F UNCTIONS [6] (
SEE A PPENDIX B) Name
Global Minimum M-PSO Optimized Value
Iterations
EM-PSO
Iterations
Optimized Value
Ackley function 0 0.001 59 0.001 47Rosenbrock2D function 0 ∗ −
215 67.14 124Beale function 0 . ∗ −
68 0 136GoldsteinPrice function 3 3 61 3 53Booth function 0.0 . ∗ − . ∗ − . ∗ −
95 0 48Three-hump camel function 0.0 ∗ −
59 0.0 36Easom function -1 − ∗ −
45 -1 40Cross-in-tray function -2.06261 -2.06 43 -2.06 30TABLE IVC
ONSTRAINED T EST O PTIMIZATION F UNCTIONS [6] (
SEE A PPENDIX B) Name Global Minimum M-PSOOptimized Value Iterations EM-PSOOptimized Value Iterations
MishraBirdFunction -106.76 -106.76 121 -106.76 55Rosenbrock functionconstrained with acubic and a line function 0 0.99 109 0.99 64Rosenbrock functionconstrained to a disk 0 0 69 0 39TABLE VC
OMPUTED
CDHS
AND
CEESA
SCORES BY
EM-PSO
ARE CLOSE TO THE SCORES COMPUTED BY [13], [14]
WITH HIGH PRECISION AND COMPARISONOF THE TWO ALGORITHMSName Algorithm [ α , β , γ , δ ] [ Y i , Y s ] CDHS Scores Iterations [ r , d , t , v , e , ρ ] CEESA Scores IterationsTRAPPIST-1 b M-PSOEM-PSO [ 0.99, 0.01, 0.01, 0.99 ][0.99, 0.01, 0.01, 0.99] [1.09, 1.38][1.09, 1.38] 1.2341.234 13075 [0.556, 0, 0.398, 0.045, 0, 0.629][0.107, 0.314, 0.578, 0.001, 3.704, 0.999] 1.1931.126 7692TRAPPIST-1 c M-PSOEM-PSO [0.99, 0.01, 0.01, 0.99][0.99, 0.01, 0.01, 0.99] [1.17, 1.21][1.17, 1.21] 1.191.19 6580 [0.117, 0.384, 0.273, 0.225, 0, 0.999][0.053, 0.348, 0.212, 0.386, 0, 0.999] 1.1611.161 5460TRAPPIST-1 e M-PSOEM-PSO [0.99, 0.01, 0.01, 0.99][0.99, 0.01, 0.2, 0.8] [0.92, 0.88][0.92, 0.88] 0.90960.9096 3069 [0.455, 0.486, 0.033, 0.027, 5.182, 0.504][0.264, 0.004, 0.626, 0.105, 0, 0.936] 0.8680.897 827TRAPPIST-1 f M-PSOEM-PSO [0.99, 0.01, 0.95, 0.05][0.99, 0.01, 0.7, 0.3] [1.04, 0.8][1.04, 0.8] 0.920.92 9359 [0.718, 0, 0.276, 0.006, 0, 0.969][0.382, 0.24, 0.093, 0.284, 5.392, 0.719] 0.9720.836 7746 TABLE VIE
NGINEERING O PTIMIZATION P ROBLEMS (NA -
NO ITERATIONS AVAILABLE ) OptimizationProblem Method (minima value, Number of iterations)PreviouslyBest known soln. W-PSO M-PSO EM-PSO H. Garg SMPSO NSGA-2 NSGA-3 PSO withMomentum[15]
HimmelblausnonlinearProblem [16] ( − . ,NA) ( − . , ) ( − . , ) ( − . , ) ( − . ,NA) ( − . ,10000) ( − . ,10000) ( − . ,10000) ( − . ,1000)Welded BeamProblem [16] ( . ,NA) ( . , ) ( . , ) ( . , ) ( . ,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000)Speed ReducerProblem[17] ( , . ,NA) ( . , ) ( . , ) ( . , ) (NA,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000)Pressure VesselProblem [16] ( . ,NA) ( . , ) ( . , ) ( . , ) ( . ,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000)CompressionSpringProblem [17] ( . ,NA) ( . , ) ( . , ) ( . , ) (NA,NA) ( . ,10000) ( . ,10000) ( . ,10000) ( . ,1000) ABLE VIIG
RADIENT D ESCENT VS E MULATED G RADIENT D ESCENT USING
PSO P
ARAMETERS P ARAMETER S ET : η =0.1, C C Function Global Minimum Gradient Descent Optimum Iterations Emulated Gradient Descentwith PSO ParamtersOptimum Iterations − (cid:0) x − x (cid:1) -2.25 2.6 31 -2.249 9 x − x ) + 7 − exp (cid:0) cos (cid:0) x (cid:1)(cid:1) + x -2.718 -2.718 349 -2.718 9 x − sin ( x ) + exp (cid:0) x (cid:1) TOCHASTIC G RADIENT D ESCENT WITH M OMENTUM VS E MULATED G RADIENT D ESCENT USING
EM-PSO P
ARAMETERS P ARAMETER S ET : η =0.1, C C β =0.9 Function Global Minimum Stochastic Gradient Descent Optimum Iterations Emulated Gradient Descentwith EM-PSO ParamtersOptimum Iterations − (cid:0) x − x (cid:1) -2.25 -1.45 667 -2.249 13 x − x ) + 7 − exp (cid:0) cos (cid:0) x (cid:1)(cid:1) + x -2.718 -2.718 49 -2.706 8 x − sin ( x ) + exp (cid:0) x (cid:1) also been reported in tabular and visual forms in this paper.We conclude by noting that the proposed method provides abackbone to loss functions that are non-differentiable. Usingthe AdaSwarm not only eliminates the need of differential lossfunction, it paves way for non-differentiable loss functions tobe used in Neural Networks.A CKNOWLEDGEMENT
The authors would like to thank the Science and Engi-neering Research Board (SERB)-Department of Science andTechnology (DST), Government of of India for supportingthis research. The project reference number is: SERB-EMR/2016/005687. R
EFERENCES[1] S. M. Mikki and A. A. Kishk,
Particle Swarm Optimizaton: A Physics-Based Approach . Morgan & Claypool, 2008.[2] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradientlangevin dynamics,”
In Proceedings of the 28th International Conferenceon Machine Learning , p. 681688, 2011.[3] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan, “Samplingcan be faster than optimization,”
Proceedings of the National Academyof Sciences
ArXiv , vol. abs/1911.09011, 2019.[5] J. Spall,
Introduction to stochastic search and optimization . Wiley-Interscience, 2003.[6] M. Jamil and X. S. Yang, “A literature survey of benchmark functions forglobal optimisation problems,”
International Journal of MathematicalModelling and Numerical Optimisation , vol. 4, no. 2, p. 150, 2013.[Online]. Available: http://dx.doi.org/10.1504/IJMMNO.2013.055204[7] T. Xiang, J. Wang, and X. Liao, “An improved particle swarm optimizerwith momentum,” ,pp. 3341–3345, 09 2007.[8] S. Chen and J. Montgomery, “Particle swarm optimization with thresheldconvergence,” , 06 2013.[9] T. Hatanaka, T. Korenaga, N. Kondo, and K. Uosaki,
Search Perfor-mance Improvement for PSO in High Dimensional Space , 01 2009. [10] G. K. Jha, P. Thulasiraman, and R. K. Thulasiram, “Pso based neuralnetwork for time series forecasting,” in , June 2009, pp. 1422–1427.[11] H. T. Rauf, W. H. Bangyal, J. Ahmad, and S. A. Bangyal, “Training ofartificial neural network using pso with novel initialization technique,”in , Nov 2018, pp. 1–8.[12] A. Mndez, “A thermal planetary habitability classifica-tion for exoplanets.planetary habitability laboratory @ uprarecibo,” 2011. [Online]. Available: http://phl.upr.edu/library/notes/athermalplanetaryhabitabilityclassificationforexoplanets[13] K. Bora, S. Saha, S. Agrawal, M. Safonova, S. Routh, andA. Narasimhamurthy, “Cd-hpf: New habitability score via data analyticmodeling,” submitted to Astronomy and Computing , vol. 17, 04 2016.[14] S. Saha, S. Basak, M. Safonova, K. Bora, S. Agrawal, P. Sarkar, andJ. Murthy, “Theoretical validation of potential habitability via analyticaland boosted tree methods: An optimistic study on recently discoveredexoplanets,”
Astronomy and Computing , vol. 23, p. 141150, Apr 2018.[Online]. Available: http://dx.doi.org/10.1016/j.ascom.2018.03.003[15] J. Ren and S. Yang, “A particle swarm optimization algorithm withmomentum factor,” vol. 1, pp. 19–21, Oct 2011.[16] H. Garg, “A hybrid pso-ga algorithm for constrained optimizationproblems,”
Applied mathematics and conputation , vol. 274, pp. 292–305, 2 2015.[17] Y. X. Gandomi A.H., “Benchmark problems in structural optimization,”vol. 356, 2011.[18] J. Kennedy and R. Eberhart, “Particle swarm optimization,”
Proceedingsof ICNN’95 - International Conference on Neural Networks , vol. 4, pp.1942–1948, Nov 1995.[19] A. Theophilus, S. Saha, S. Basak, and J. Murthy, “A novel exoplanetaryhabitability score via particle swarm optimization of ces productionfunctions,” 11 2018. A PPENDIX AP ARTICLE S WARM O PTIMIZATION WITH ITS VARIANTS
A. Particle Swarm Optimization with Inertial Weight
The Particle Swarm optimization algorithm[18] is an opti-mization algorithm inspired by the flocking behavior of birds.It is characterized by a population of particles in space, whichaim to converge to an optimal point. The movement of thearticles in space is characterized by two equations, namely,velocity and position update equations, which are as follows: v t +1 i = ωv ti + c r ( p besti − x ti ) + c r ( g best − x ti ) (61) x t +1 i = x ti + v t +1 i (62)where ω, c , c ≥ . Here, x ti refers to the position of particle iin the search space at time t, v ti refers to the velocity of particlei at time t, p besti is the personal best position of particle i, g besti is the best position amongst all the particles of the population. B. Particle Swarm Optimization with Momentum
Inspired by the momentum term in the Back Propagationalgorithm, a momentum is introduced in the velocity updatingequation of
PSO [7]. Thus, the new equation along with themomentum term was introduced by the following equation, v t +1 i = (1 − λ )( v ti + c r ( p besti − x ti )+ c r ( g best − x ti ))+ λv t − i (63)where c , c , x ti , v ti , p besti , g besti mean the same as describedin the previous section. The momentum factor is indicated by λ . A PPENDIX
BEM-PSO
FOR SINGLE OPTIMIZATION PROBLEMS
We apply the proposed Exponentially averaged MomentumPSO on benchmark test optimization functions [6], habitabilityoptimization problems in Astronomy and benchmark test opti-mization problems popular in different branches in Engineer-ing. Usually a problem maybe constrained or unconstraineddepending on the search space that has been tackled with.An unconstrained problem’s space is the full search spacefor the particle swarm. The difficulty arises only when it’sconstrained. Theophilus, Saha et al. [19] describe a way tohandle constrained optimization. We use a the same method torepresent the test functions as well as represent few standardoptimization problems.
A. Standard Test Optimization Functions
In this section we briefly describe the benchmark opti-mization functions chosen to evaluate our proposed algo-rithm and compare its performance to that of the weightedPSO and Momentum PSO described in
Appendix A . Forthe purpose of assessing the performance of the proposedalgorithm we have considered single objective unconstrainedoptimization functions [6] Rastrigin, Ackley, Sphere, Rosen-brock, Beale, Goldstein-Price, Booth, Bukin N.6, Matyas,Levi N.13, Himmelblau’s, Three-hump camel, Easom, Cross-in-tray, Eggholder, Holder table, McCormick, Schaffer N.2,Schaffer N.4, StyblinskiTang as well as constrained optimiza-tion functions, Rosenbrock constrained with a cubic line,Rosenbrock constrained to a disc and Mishra’s bird.The results for the above mentioned benchmark optimiza-tion functions are summarised in
Table III and
Table IV . B. Engineering Optimization Problems
In this section we briefly describe the benchmark engineer-ing optimization[16] problems chosen to evaluate our proposedalgorithm and compare its performance to that of the variousalgorithms. They have been formulated based on real-worldscenarios.
C. Constrained Single Objective Optimization Problems fromHabitability
We present two problems from exoplanetary habitabilityscore computation [13], [14] which have been formulatedas constrained single-objective optimization problems. Thehabitability scores have been computed with Gradientascent/descent type approaches. We solve these problemsusing our approach and compare with the scores obtainedearlier. Representing CDHS : The Cobb-Douglas Habitabilityscore can be constructed as a constrained optimization problemwhere the objective function is represented as follows: Y = R α .D β .V γe .T δs (64)where R, D, V e and T s are density, radius, escape velocity andsurface temperature for a particular exoplanet respectively and α, β, γ and δ are elasticity coefficients. maximize α, β, γ, δ Y subject to 0 < φ < , ∀ φ ∈ { α, β, γ, δ } ,α + β − − τ ≤ , − α − β − τ ≤ ,γ + δ − − τ ≤ , − γ − δ − τ ≤ (65)It can be subjected to two scales of production: CRS (ConstantReturn to Scale) and DRS (Decreasing Return to Scale)-(citeCobb Douglas paper).The above equation (64) is concaveunder constant returns to scale (CRS) ,when α + β = 1 and γ + δ = 1 , and also under decreasing returns to scale (DRS), α + β < and γ + δ < . Representing CEESA : The objective function forCEESA to estimate the habitability score of an exoplanetis: maximize r, d, t, v, e, ρ, η Y = ( r.R ρ + d.D ρ + t.T ρ + v.V ρ + e.E ρ ) ηρ subject to 0 < φ < , ∀ φ ∈ { r, d, t, v, e } , < ρ ≤ , < η < , ( r + d + t + v + e ) − − τ ≤ , − ( r + d + t + v + e ) − τ ≤ (66)where E represents Orbital eccentricity, ττ