[PDF] A new sigmoidal fractional derivative for regularization

Abstract

In this paper, we propose a new fractional derivative, which is based on a Caputo-type derivative with a smooth kernel. We show that the proposed fractional derivative reduces to the classical derivative and has a smoothing effect which is compatible with ℓ 1 regularization. Moreover, it satisfies some classical properties.

Full PDF

aa r X i v : . [ m a t h . G M ] M a r A NEW SIGMOIDAL FR AC TIONAL DER IVATIVE FORR EGULARIZATION

A P

REPRINT

Mostafa Rezapour

Department of MathematicsWashington State UniversityPullman WA, 99163 [email protected]

Adebowale Sijuwade

Department of MathematicsWashington State UniversityPullman WA, 99163 [email protected]

Thomas J. Asaki

Department of MathematicsWashington State UniversityPullman WA, 99163 [email protected]

March 18, 2020 A BSTRACT

In this paper, we propose a new fractional derivative, which is based on a Caputo-type derivative witha smooth kernel. We show that the proposed fractional derivative reduces to the classical derivativeand has a smoothing effect which is compatible with ℓ regularization. Moreover, it satisﬁes someclassical properties. K eywords Fractional calculus, Caputo derivative, Regularization

Fractional calculus has undergone signiﬁcant developments in recent years and has found use in physics, engineering,economics, etc [1, 2, 3]. Classical results about the Riemann-Liouville and Caputo derivatives as well as fractionaldifferential equations can be found in [4, 5, 6]. In [11] and [48], Caputo and Fabrizio suggested a new fractionalderivative, whose properties were investigated by Losada and Nieto [18]. This fractional derivative was utilized invarious applications, including the fractional Nagumo equation in Alqahtani et al. [36], coupled systems of time-fractional differential problems in Alsaedi et al. [37] and Fischer’s reaction-diffusion equation in Atangana et al [38].More applications of the Caputo-Fabrizio fractional derivative can be found in Aydogan et al [39]. and Atangana et al[40].For ≤ α ≤ , −∞ < a < t , f ∈ H ( a, b ) and b > a , the Caputo fractional derivative is deﬁned by Ca D αt f ( t ) = 1Γ(1 − α ) Z ta f ′ ( s )( t − s ) − α ds. (1)By replacing the term − α ) with the normalization constant M ( α ) such that M (0) = M (1) = 1 and adjusting thekernel ( t − s ) − α , we obtain the Caputo-Fabrizio fractional derivative deﬁned by CFa D αt f ( t ) = M ( α )Γ(1 − α ) Z ta f ′ ( s ) exp (cid:18) − α ( t − s ) α − (cid:19) ds. (2)The Caputo-Fabrizio fractional derivative of a constant vanishes as does the usual Caputo derivative, however the newkernel exp (cid:0) − αα − ) is no longer singular for s = t . Caputo and Fabrizio try to extend their deﬁnition in [11] to functionsin L by PREPRINT - M

ARCH

18, 2020 CF D ( α ) t f ( t ) = M ( α )Γ(1 − α ) Z t −∞ ( f ( s ) − f ( t )) exp (cid:18) − α ( t − s ) α − (cid:19) ds. Algahtani et al. [36] show that the nonlinear Nagumo equation given by CF D αt u ( x, t ) + βu ( x, t ) n ∂ x u ( x, t ) = ∂ x ( αu ( x, t ) n ∂ x u ( x, t )) + γu ( x, t )(1 − u m )( u m − δ ) , (3)where < α < and β, γ, δ are constant, subject to the boundary conditions u ( x,

0) = f ( x ) , u (0 , t ) = g ( t ) has an exact solution. The authors show that this PDE can be reformulated in terms of a Lipschitz kernel. Exis-tence of the exact solution is shown using a ﬁxed point approach and uniqueness is provided, given that suitableassumptions are made about the Lipschitz constant. Their study claims that an exponential kernel is in some sensea better kernel than a power function, since the lack of a singularity provides a better ﬁltration effect. In the con-text of fractional differential equation applications, since the associated functions are not deﬁned in a Banach space,only approximate solutions to certain fractional differential equations can be investigated. The methods used to han-dle fractional differential problems such as CF D α f ( t ) = g ( t, f ( t )) , cannot be extended to the problems resembling CF D α f ( t ) = g ( t, f ( t ) , CF D α f ( t ) ).In Baleanu et al [17], the Caputo-Fabrizio fractional derivative on the Banach space C R [0 , is considered in the con-text of higher order series-type fractional integrodifferential equations. More precisely, an extended Caputo-Fabriziotype fractional derivative is provided of order ≤ α < on C R [0 , for b > by CF N D α f ( t ) = M ( α )1 − α ( f ( t ) − f (0)) exp (cid:18) − αt − α (cid:19) + αM ( α )(1 − α ) Z t ( f ( t ) − f ( s )) exp (cid:18) − α ( t − s )1 − α (cid:19) ds. These authors use a standard ﬁxed point approach to establish uniqueness of solutions to fractional series-type differ-ential problems such as

CF N D α f ( t ) = ∞ X j =0 CF N D ρ [ j ] g ( t, f ( t ) , ( φf )( t ) , h ( t ) CFN D γ f ( t ) , g ( t ) CFN D δ f ( t ))2 j , with initial condition f (0) = 0 and α, γ, δ, ρ ∈ (0 , .An extension of this type which is compatible with orders beyond (0 , has yet to be provided.The Caputo-Fabrizio fractional derivative is discussed in the setting of distributions in [41]. Other types of fractionalderivatives can be found in Katugampola [35] and Oliveira et al [42]. In de Oliveira [12], it is shown that the choiceof kernel in a Caputo-type fractional derivative is connected to the Laplace transform via convolution.Let I denote the Schwarz class of smooth test functions whose derivatives decay at inﬁnity. Moreover, let I ′ denotethe space of continuous linear functionals on I . The distributional derivative{ T’ } is deﬁned as in [47] Z R T ′ ( t ) φ ( t ) dt = − Z R T ( t ) φ ( t ) dt, (4)for all smooth compactly supported test functions φ on R .The distributional Laplace transform is given by F ( s ) = L ( φ ( t )) = F ( φ ( t ) e − σt )( µ ) , where s = σ + iµ , µ < and φ ( t ) e − σt ∈ I ′ . Suppose that f is supported on (0 , ∞ ) such that σ > and f ( t ) e − σt ∈ I ′ . It follows that the Laplace transform of the derivative is given by2 PREPRINT - M

ARCH

18, 2020 L ( φ ′ ( t ))( s ) = s L ( φ ( t ))( s ) . Let L denote the distributional Laplace transform deﬁned by L ( f ′ ( x )) = L − ( s L ( f )) . One can deﬁne a more general fractional derivative as follows. Suppose that Φ( s, α ) is a fractional integrodifferentialoperator and K ( t, s ) : R → R is a continuous kernel. Let the corresponding operator φ ( s, α ) be deﬁned for somefractional derivative D α such that L ( D α f ( t )) = Φ( s, α ) L ( f ( t )) , where Φ( s,

1) = s, Φ( s, −

1) = s and Φ( s,

0) = 1 . Then, letting Φ( s, α ) = s L ( K ( s, t, α )) . Proceeding with theConvolution Theorem, we are left with a Caputo-type fractional operator of the form a D αK f ( t ) = Z ta K ( t − s, α ) f ′ ( s ) ds, (5)which is dependent on the choice of kernel K . For f ∈ H ( a, b ) , and n ∈ N , we can spot commonly used kernelssuch as the Caputo kernel K = − α ) ( t − s ) ⌈ α ⌉− α − , the Caputo-Fabrizio kernel K = M ( α )1 − α exp (cid:16) − α ( t − s ) α − (cid:17) andthe Gaussian kernel K = √ πσ exp (cid:16) − t σ (cid:17) [4, 10, 13].The memory principle for fractional derivatives describes the history of f ( t ) near the terminal point t = a . Let L denote the memory length, satisfying a + L ≤ t ≤ b . Deﬁne the error in approximating the fractional derivative by E L,α,a ( t ) = | a D αK f ( t ) − t − L D αK f ( t ) | , where a D αK f ( t ) is as in ( ). If f ′ ( t ) ≤ M for a < t < b and < α < , we have the following error estimate for theCaputo fractional derivative E L,α,a ( t ) = (cid:12)(cid:12)(cid:12)(cid:12) − α ) Z tt − L f ′ ( s )( t − s ) − α ds (cid:12)(cid:12)(cid:12)(cid:12) ≤ M L − α | Γ(2 − α ) | . For all ǫ > , if E L,α,a ( t ) ≤ ǫ with a + L ≤ t ≤ b , we have L ≥ (cid:18) Mǫ | Γ(2 − α ) | (cid:19) α − . (6)Therefore, the Caputo fractional derivative with terminal a can be approximated by the corresponding fractional deriva-tive with lower limit t − L , with the level of accuracy described above.In this work, we propose a different fractional derivative that has a smooth kernel. Our primary interest in deﬁningthis fractional derivative is the improvement of machine learning algorithms. Caputo-type fractional derivatives havebeen applied in machine learning, such as in Pu et al [10]. In particular, fractional order gradient methods have beenconsidered in order to improve the performance of the integer order methods. For example, suppose that f : R n → R is convex and differentiable with a Lipschitz gradient, then the integer order gradient method deﬁned by x k +1 = x k − µ ∇ f ( x k ) has a linear convergence rate. Improving the performance of the integer-order gradient method is critical in optimiza-tion problems. In recent literature, fractional calculus has been thought to improve the integer order gradient method3 PREPRINT - M

ARCH

18, 2020due to nonlocality and the memory principle. Fractional order gradient methods have been proposed based on the Ca-puto fractional derivative that offer competitive convergence rates. For example, in [28], a Caputo fractional gradientmethod is proposed that is shown to be monotone and exhibit strong convergence.Fractional derivatives were used in the backpropagation algorithm for feedforward neural networks and convolutionalneural networks in [32, 46]. In both studies, the rate of convergence was shown to exceed the rate of integer-ordermethods. Fractional-order methods have been used to investigate complex-valued neural networks in [24] and recurrentneural network models in [44]. In [28] and [22], gradients based on the Caputo fractional derivative are used to updateparameters while integer order gradients are used to handle backpropagation allowing for simpler computation. Theexperiments therein are shown to improve the accuracy of the neural network’s performance compared to integer-ordermethods while being equally costly.In the training of machine learning models, one often needs to obtain weights of the features which optimize thetraining data. In the case of maximum likelihood training, regularization is typically needed so that the model does notoverﬁt the training data. In ℓ p regularization, the weight vector is penalized by its ℓ p norm. While the case for p = 1 and p = 2 are very common and result in similar levels of accuracy, ℓ regularization is much more practical. Dueto its sparsity, ℓ regularization is less memory intensive and more time-effective than ℓ regularization. On the otherhand, ℓ regularization is problematic in that during the update process, the gradient of the regularization term is notdifferentiable at the origin as the error function given below E ℓ = E + λ N X k =1 | x k | (7)has classical derivative ∂E ℓ ∂x j = ∂E∂x j + λ sgn ( x j ) . A typical remedy to this problem is to use the stochastic gradient descent method, which approximates the gradientusing the training data. Although time efﬁcient for training, when the dimension of the feature space is large, theupdate process slows down signiﬁcantly. Furthermore, the model becomes less sparse after training the data. Thediscontinuity induced by the regularizer proves to be problematic as it adjusts the direction of descent. The useof sigmoids in regularization problems has been previously explored as in Krutikov [43], but not in the context offractional derivatives. Another remedy to the aforementioned problem is the use of fractional gradients over theclassical descent methods. These methods are still in their infancy and problematic in that convergence to the localoptimum is not always guaranteed, even when the algorithm converges as in [9]. Furthermore, these methods oftenrequire an adjustment to the fractional derivative by truncation and methods based on memory principle ( ) due to thecomputational expense and the failure of the Caputo kernel to be smooth.We would also like our operator to be nonlocal. In [13], it is shown that unlike the Caputo derivative, the Caputo-Fabrizio fractional derivative is not a nonlocal operator. The linear fractional differential equation λ ( CFa D αt f ( t )) + ν ( t ) g ( t ) + η ( t, t ) Y ( t ) = 0 is shown to reduce to a ﬁrst-order ordinary differential equation. This means that the Caputo-Fabrizio derivative cannotsufﬁciently describe processes with nonlocality and memory. With the correct choice of kernel, this complication canbe avoided. In this section, we deﬁne a new left-sided fractional derivative. We show that the proposed fractional derivative reducesto the H derivative as the order approaches 1. In the results to follow, for < α ≤ , we will let C ( α ) denote anormalization constant C ( α )Γ(2 − α ) satisfying C ( α )Γ(1 − α ) → as α → − .4 PREPRINT - M

ARCH

18, 2020

Deﬁnition 2.1. (Left sigmoidal fractional derivative)

Let < α ≤ , f ∈ H (( a, b )) , t > a and { f ( t ) } ′ denotesthe H distributional derivative as in ( ). We deﬁne a new fractional derivative by σ D αa f ( t ) = C ( α ) Z ta { f ( s ) } ′ sech (cid:18) s − t − α (cid:19) ds. (8)Now, we show that the left sigmoidal fractional derivative reduces to the H derivative. Theorem 2.1. (Reduction to classical derivative)

Suppose f ∈ H ( a, b ) , then lim α → − σ D αa f ( t ) = { f ( t ) } ′ . (9) Proof.

Proof. lim α → − σ D αa f ( t ) = C ( α )Γ(2 − α ) lim α → − Z ta { f ( s ) } ′ sech (cid:18) s − t − α (cid:19) ds = 2 C ( α )Γ(1 − α ) lim α → − Z ta { f ( s ) } ′ sech (cid:0) s − t − α (cid:1) − α ) ds = 2 C ( α )Γ(1 − α ) lim α → − Z ta { f ( s ) } ′ sech (cid:0) s − t − α (cid:1) − α ) ds = lim α → − C ( α )Γ(1 − α ) (cid:18) Z ta { f ( s ) } ′ lim α → − sech (cid:18) s − t − α (cid:19) − α ) ds (cid:19) = Z ta { f ( s ) } ′ δ ( s − t ) ds = { f ( t ) } ′ , where the last result follows from the observation that δ ( t ) is the Dirac distribution.In the following theorem, we show that this left sigmoidal fractional derivative is commutative with respect to theclassical derivative. Theorem 2.2.

Suppose that f is at least twice continuously differentiable and σ D αa f ( t ) is differentiable. If f ′ ( a ) = 0 ,then σ D αa ( σ D a f ( t )) = σ D a ( σ D αa f ( t )) , (10)where < α < . Proof.

Proof.

From ( ), integrating by parts yields σ D αa ( σ D a f ( t )) = C ( α ) Z ta f ′′ ( s ) sech (cid:18) s − t − α (cid:19) ds = f ′ ( t )1 − α + 2 C ( α )Γ(2 − α )(1 − α ) Z ta f ′′ ( s ) sech (cid:18) s − t − α (cid:19) tanh (cid:18) s − t − α (cid:19) ds, (11)so we have σ D a ( σ D αa f ( t )) = lim γ → − σ D γa ( σ D αa f ( t )) = ddt ( σ D αa f ( t )) = C ( α ) ddt Z ta f ′ ( s ) sech (cid:18) s − t − α (cid:19) ds PREPRINT - M

ARCH

18, 2020 = f ′ ( t )1 − α + 2 C ( α )Γ(2 − α )(1 − α ) Z ta f ′′ ( s ) sech (cid:18) s − t − α (cid:19) tanh (cid:18) s − t − α (cid:19) ds, (12)appealing to the Leibniz integral rule ddt (cid:18) Z b ( t ) a ( t ) f ( t, s ) ds (cid:19) = f ( t, b ( t )) b ′ ( t ) − f ( t, a ( t )) a ′ ( t ) + Z b ( t ) a ( t ) ∂∂t f ( s, t ) dt. From ( ) and ( ), the desired result is obtained.In the next theorem, we show that the left sigmoidal fractional derivative does not satisfy the memory principle in thesense of ( ). More precisely, the next theorem implies that we show that the left sigmoidal fractional derivative can beapproximated by the corresponding fractional derivative with lower limit t − L with increased accuracy for orders inwhich C ( α ) is large. Theorem 2.3. (Memory principle)

Suppose that f is differentiable on ( a, b ) , a + L ≤ t ≤ b and < α < . Forevery ǫ > , if there exists C > such that f ′ ( t ) ≤ C , then L ≥ (1 − α )( | C ( α ) | C ǫ − ) . (13) Proof.

Proof.

Making use of the inequality cosh( s ) ≥ p s , we have | σ D αa f ( t ) − σ D αt − L f ( t ) | = C ( α ) Z t − La f ′ ( s ) sech (cid:18) s − t − α (cid:19) ds ≤ C ( α ) C Z t − La ds s − t − α ) ≤ C ( α ) C L − α ) , and the result follows.In the theorem below, we show that our new fractional derivative provides a sigmoidal approximation to functions thathave a piecewise linear H distributional derivative. For instance, the proposed left sigmoidal fractional derivativeis compatible with ℓ -regularization. In the case of the ℓ norm, it can be used to deﬁne a fractional gradient, whichapproximates its classical gradient via a family of sigmoids as α approaches 1. This is promising in the context ofgradient descent algorithms. Theorem 2.4 (Norm-1 compatibility ) σ D αa provides a smooth approximation to the ℓ norm deﬁned by k x k = n X k =1 | x k | as α → in the sense that for the error function E given in ( ), σ D αa E ℓ ( x j ) is given by σ D αa E ( x j ) + λC ( α )( α −

1) tanh (cid:18) a − x j − α (cid:19) , where a > . 6 PREPRINT - M

ARCH

18, 2020

Proof.

The result follows from the observation that C ( α ) Z ta {| s |} sech (cid:18) s − t − α (cid:19) ds = C ( α ) Z ta H ( t ) sech (cid:18) s − t − α (cid:19) ds, = C ( α )( α −

1) tanh (cid:18) a − t − α (cid:19) ds →

12 (2 H ( t ) − as α → − , where H ( t ) is the Heaviside function. Theorem 2.5. (Mittag-Lefﬂer function).

Suppose that γ, η > and < a < t . Then σ D αa E γ,η ( t ) ≤ C ( α ) E γ,η ( t − a ) , where E γ,η ( z ) = ∞ P k =0 z k Γ( γk + η ) is the two-parameter Mittag-Lefﬂer function Proof.

Proof. σ D αa E γ,η ( t ) = Z ta sech (cid:18) s − t − α (cid:19) dds ∞ X k =0 s k Γ( γk + η ) ds = Z ta sech (cid:18) s − t − α (cid:19) ∞ X k =0 ks k − Γ( γk + η ) = ∞ X k =0 k Γ( γk + η ) Z ta s k − sech (cid:18) s − t − α (cid:19) ds ≤ ∞ X k =0 k Γ( γk + η ) Z ta s k − ds = ∞ X k =1 ( t − a ) k Γ( γk + η ) . Theorem 2.6.

Suppose that f ≥ , < p < ∞ , < α < and < t ≤ T . If f ≥ is differentiable with f ′ ∈ L p ( R ) and M is the maximal operator of f given by M f ( x ) = sup t → a + x ) Z a + xa − x f ( t ) dt, then (a) σ D α − t f ( t ) ≤ T C ( α ) M ( | f ′ | )(0) (b) σ D αa f ( t ) is integrable on R . Proof.

Proof. (a) Since σ D α − t M f ( t ) = C ( α ) Z t − t f ′ ( s ) sech (cid:18) s − t − α (cid:19) ds ≤ tC ( α ) · t Z t − t f ′ ( s ) ds ≤ T C ( α ) sup t> R t − t | f ′ ( s ) | dst = 2 T C ( α ) M | f ′ | (0) . (b) From Young’s convolution inequality, k f ⋆ g k L r ≤ k f k L p k g k L prp + r ( p − .7 PREPRINT - M

ARCH

18, 2020 Z ∞−∞ σ D αa M f ( t ) dt ≤ C ( α ) Z ∞−∞ Z ta | f ′ ( s ) sech (cid:18) s − t − α (cid:19) | ds = (cid:13)(cid:13)(cid:13)(cid:13) f ′ ( t ) ⋆ sech (cid:18) tα − (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) L ( R ) ≤ k f ′ k L p ( R ) (cid:13)(cid:13)(cid:13)(cid:13) sech (cid:18) tα − (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) L p p − ( R ) < ∞ . The next theorem describes the effect of the Laplace and Fourier transforms, which can extend to distributions asin de Oliveira [6]. The Convolution Theorem connects our choice of kernel as in ( ) via the operator Φ( s, α ) = s L ( K ( s, t, α )) . In this case, Φ( s, α ) depends on the digamma function Ψ( z ) = Γ ′ ( z )Γ( z ) . This shows that the left-sigmoidal fractional derivative does not reduce to the left-sided Riemann-Liouville fractional derivative. Theorem 2.7. (Transformations)

Suppose that < α < , Re ( s ) > , ω ∈ R , a ∈ R and f is a differentiablefunction of exponential order such that f (0) = 0 . If T ( s ) , T ( ω ) are deﬁned by T ( s ) = (cid:18) s (cid:18) Ψ( s ) − Ψ( s )2 (cid:19)(cid:19) , T ( ω ) = r π (cid:16) πω (cid:17) , then (a) L ( σ D α f ( t ))( s ) = C ( α )( s ( α − T (cid:0) ( α − s (cid:1) L ( f )( s ) (b) F ( σ D α )( ω ) = − C ( α ) ω | α − | ( α − T (cid:0) ( α − s (cid:1) F ( f )( ω ) , where L ( f )( s ) denotes the Laplace transform of f and F ( f )( ω ) denotes the Fourier transform of f . Proof.

Proof. (a) follows from a standard application of the Convolution theorem. By using the dilation property L ( f ( at )) = F ( sa ) a , we have L (cid:0) σ D α f ( t ))( s ) C ( α ) = L (cid:18) f ′ ⋆ sech (cid:18) tα − (cid:19)(cid:19) = L ( f ′ ) L (cid:18) sech (cid:18) tα − (cid:19)(cid:19) = s ( α − L ( f )( s ) L (sech )( s ( α − α − s ) L ( f )( s ) L (tanh)( s ( α − . The transform L (tanh t ) is handled as follows s L (tanh( t ))( s ) = s L (tanh( t )) = s Z ∞ e − st (1 − e − t )1 + e − t dt = s Z ∞ e − st (1 − e − t ) ∞ X k =0 ( − e − t ) k dt. Because of the absolute convergence of the monotone decreasing sum ∞ P k =0 ( − k e − kt dt and the nondecreasing natureof its partial sums, we can exchange integration and summation using the Lebesgue Monotone Convergence Theorem.Continuing, we have 8 PREPRINT - M

ARCH

18, 2020 s + 2 s ∞ X k =1 ( − k L ( e − kt ) = s + 2 s ∞ X k =1 ( − k k + s = s + 2 s ∞ X k =1 = ( − k ks + 1 = s (cid:18) s (cid:18) Ψ( s ) − Ψ( s )2 (cid:19)(cid:19) . The identity ∞ X k =0 ( − k sk + 1 = Ψ( s +12 s ) − Ψ( s )2 s used above comes from the Lerch transcendent, deﬁned by Φ( s, z, a ) = ∞ X k =0 z k ( a + k ) s , where | z | < , a = 0 , − , − , ... and using the dilation property once more, the result follows.(b) We proceed as in (a). F ( σ D α f ( t )) = Z ∞−∞ ( σ D α f ( t )) e iωt dt = F ( f ′ ) F (cid:18) sech (cid:18) tα − (cid:19)(cid:19) = iω F ( f ) F (cid:18) sech (cid:18) tα − (cid:19)(cid:19) = iω | α − | F ( f ) F (sech )(( α − ω )= − ω | α − | ( α − F ( f ) F (tanh( t ))(( α − ω ) . To ﬁnish the proof, we recall the result F (tanh( t )) = iω r π (cid:16) πω (cid:17) . Theorem 2.8.

Suppose that f is differentiable and < α < . Then Z ta f ′ ( s ) e − ( s − t − α ) ds ≤ C ( α ) − σ D αa ( f ( t )) ≤ Z ta (1 − α ) f ′ ( s )(1 − α ) + ( s − t ) ds ≤ f ( t ) − f ( a ) . Proof.

Proof.

Using the inequality cosh x ≤ e x , we have e − ( s − t − α ) ≤ sech (cid:18) s − t − α (cid:19) , which results in the leftmost inequality. Noticing that cosh x ≥ x , we have that9 PREPRINT - M

ARCH

18, 2020 sech (cid:18) s − t − α (cid:19) ≤ (1 − α ) (1 − α ) + ( s − t ) ≤ , ﬁnishing the last three inequalities. Theorem 2.9.

The problem σ D αa ( f ( t )) = G ( t ) , G (0) = 0 has the solution f ( t ) = g ( t ) C ( α ) + f (0) , where G ( t ) = R t g ( s ) ds . Proof.

Proof.

Differentiating the differential equation above, the problem above reduces to C ( α ) f ′ ( t ) = g ′ ( t ) , which can be integrated to obtain the result. Theorem 2.10.

Let < α < and let g : ( a, b ) × R be a continuous function such that there exists a constant C > satisfying | g ( t, x , y ) − g ( t, x , y ) | ≤ C ( | x − x | + | y − y | ) for all t ∈ ( a, b ) and x , x , y , y ∈ R and | ( α − C ( α ) C | < . Then, the problem σ D αa f ( t ) = g ( t, f ( t ) , σ D αa f ( t )) has a unique solution. Proof.

Proof. | g ( t, σ D αa ( f ( t ))) − g ( t, σ D αa ( f ( t )) |≤ | ( α − C ( α ) tanh (cid:18) a − t − α (cid:19) | f − f |≤ | ( α − C ( α ) C || f − f | . Since ( α − C ( α ) C < , the map F : H ( a, b ) → H ( a, b ) deﬁned by C ( α ) − g ( t, σ D αa ( f ( t ))) is a contraction. By the Banach ﬁxed-point theorem, it has a unique ﬁxed point, ﬁnishing the proof.10 PREPRINT - M

ARCH

18, 2020We note that this result is advantageous in that the analogous existence and uniqueness result as in fractional differentialsystems deﬁned by the Caputo derivative is highly dependent on initial conditions imposed on the primary function ofinterest and its classical derivatives[4].We now shift our attention to a gradient descent method. Suppose that f ( x ) has a bounded derivative and uniquecritical point t ∗ such that f ′ ( t ∗ ) = 0 . For a ≤ t ≤ b , < α < , deﬁne the scalar left sigmoidal fractional gradientdescent method by t k +1 = t k − µ σ D αt k − f ( t k ) . (13)where < µ < is the learning rate. Theorem 2.11 (Fractional Gradient Descent).

Let f be as in ( ). Then, the left-sigmoidal fractional-order gradientmethod ( ) converges to the true critical point t ∗ . Proof.

Proof.

ARCH

18, 2020

In this paper, we deﬁned a new sigmoidal fractional derivative, which is compatible with certain weakly differentiablefunctions. We showed that this fractional derivative satisﬁes some forms of classical properties and is compatible withthe ℓ norm by a sigmoidal approximation. For further research, we will investigate this operator in optimization andmachine learning. We note that the left-sigmoidal fractional derivative can be applied in the context of gradient descent,which has applications in optimization and machine learning [7, 8]. Recently, backpropagation and convolution neuralnetworks have been studied in the context of fractional derivatives, typically of the Caputo-type are being used forgradient descent. This idea is still novel and needs to see improvements. For example, the gradient descent methodhas been handled by Sheng et al.; [32], [33], Wang et al. ; [28], Wei et al.; [9] and Bao et al [22]. These methodsare still early in development. The following topics still need to be fully addressed: convergence to an extreme point,extending the available range of fractional order, more complicated neural networks, loss function compatibility andthe usage of the chain rule. Conﬂict of interest

The authors declare that there is no conﬂict of interest.

References [1] Gorenﬂo R, Mainardi F, 223–276. Fractional calculus: integral and differential equations of fractional order.Fractals and Fractional Calculus in Continuum Mechanics. Carpinteri A, Mainardi F, editors. Wien and NewYork: Springer Verlag; 1997.[2] Kochubei AN. General fractional calculus, evolution equations, and renewal processes. Integr Equ Oper Theory2011;71:583–600.[3] V. Kiryakova, Generalised Fractional Calculus and Applications, Pitman Research Notes in Mathematics No301, Longman, Harlow, 1994[4] I. Podlubny, Fractional Differential Equations, Academic Press, New York, 2009.[5] M. Caputo, Elasticita‘ e Dissipazione, Zanichelli, Bologna, 1965.[6] de Oliveira EC, Machado JAT. A review of deﬁnitions for fractional derivatives and integral. Math Prob Ing2014;2014:238459.[7] J. S. Zeng and W. T. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Process-ing, vol. 66, no. 11, pp. 2834–2848, 2018.[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.[9] Wei, Kang, Yin, Yong. (2018). Design of generalized fractional order gradient descent method Manuscriptsubmitted for publication.[10] Pu, Yi-Fei , Zhou, Zhang, Ni, Huang, Siarry. Fractional Extreme Value Adaptive Training Method: FractionalSteepest Descent Approach. IEEE transactions on neural networks and learning systems (2013).[11] Caputo, Michèle and Mauro Fabrizio. “A new Deﬁnition of Fractional Derivative without Singular Kernel.”(2015).[12] Capelas de Oliveira, Edmundo , Jarosz, S. , Jr, Vaz,. (2018). Fractional Calculus via Laplace Transform andits Application in Relaxation Processes. Communications in Nonlinear Science and Numerical Simulation. 69.10.1016/j.cnsns.2018.09.013.[13] Tarasov VE. No nonlocality. no fractional derivative. Commun Nonlinear Sci Numer Simul 2018;62:157–63.[14] Ortigueira MD, Machado JAT. A critical analysis of the caputo-fabrizio operator. Commun Nonlinear Sci NumerSimul 2018;59:608–11[15] Voronin, Sergey et al. “Convolution based smooth approximations to the absolute value function with applica-tion to non-smooth regularization.” (2014).[16] Wei, Yiheng and Chen, Yuquan and Cheng, Songsong and Wang, Yong. (2017). A note on short memoryprinciple of fractional calculus. Fractional Calculus and Applied Analysis. 20. 10.1515/fca-2017-0073.[17] Baleanu, Dumitru and Mousalou, Asef and Rezapour, Shahram. (2018). The extended fractional Ca-puto–Fabrizio derivative of order ≤ σ . Advances in Difference Equations. 2018. 10.1186/s13662-018-1696-6.12 PREPRINT - M

ARCH

18, 2020[18] Losada, J. and Nieto, Juan. (2015). Properties of a new fractional derivative without singular Kernel. Prog FractDiffer Appl. 1. 87-92. 10.12785/pfda/010202.[19] Herzallah, Mohamed A. E.. “NOTES ON SOME FRACTIONAL CALCULUS OPERATORS AND THEIRPROPERTIES.” (2014).[20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.[21] J. S. Zeng and W. T. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Process-ing, vol. 66, no. 11, pp. 2834–2848, 2018.[22] Bao, Chunhui and PU, Yi-Fei and Zhang, Yi. (2018). Fractional-Order Deep Backpropagation Neural Network.Computational Intelligence and Neuroscience. 2018. 1-10. 10.1155/2018/7361628.[23] Evirgen, Fırat and Yavuz, Mehmet. (2018). An Alternative Approach For Nonlinear Optimization Problem withCaputo-Fabrizio Derivative.[24] Wang, Jian and Yang, Guoling and Zhang, Bingjie and Sun, Zhanquan and Liu, Yusong and Wang, Jichao.(2017). Convergence Analysis of Caputo-Type Fractional Order Complex-Valued Neural Networks. IEEE Ac-cess. PP. 1-1. 10.1109/ACCESS.2017.2679185.[25] Jiang, Shan and Fang, Shu-Cherng and Nie, Tiantian and Xing, Wenxun. (2019). A gradient descent basedalgorithm for ℓ p minimization. European Journal of Operational Research. 10.1016/j.ejor.2019.11.051.[26] Cheng, Songsong and Wei, Yiheng and Chen, Yuquan and Li, Yan and Wang, Yong. (2016). An inno-vative fractional order LMS based on variable initial value and gradient order. Signal Processing. 133.10.1016/j.sigpro.2016.11.026.[27] He, Ji-Huan and Elagan, S.K. and Li, Z.B.. (2012). Geometrical explanation of the fractional com-plex transform and derivative chain rule for fractional calculus. Physics Letters A. 376. 257–259.10.1016/j.physleta.2011.11.030.[28] Wang, Jian and Wen, Yanqing and Gou, Yida and Ye, Zhenyun and Chen, Hua. (2017). Fractional-order gradient descent learning of BP neural networks with Caputo derivative. Neural Networks. 89. 19-30.10.1016/j.neunet.2017.02.007.[29] Karci, Ali. “Chain Rule for Fractional Order Derivatives.” (2015).[30] Tarasov, Vasily. (2015). On Chain Rule for Fractional Derivatives. Communications in Nonlinear Science andNumerical Simulation. (2016) Vol.30.. P.1-4.. 10.1016/j.cnsns.2015.06.007.[31] Jumarie, Guy. (2013). On the derivative chain-rules in fractional calculus via fractional difference and theirapplication to systems modelling. Central European Journal of Physics. 11. 10.2478/s11534-013-0256-7.[32] Sheng, Dian and Wei, Yiheng and Chen, Yuquan and Wang, Yong. Convolutional neural networks with frac-tional order gradient method. (2019).[33] Chen, Y.Q., Gao, Q., Wei, Y.H., Wang, Y.. Study on fractional order gradient methods. Applied Mathematicsand Computation 2017;314:310– 321.[34] Chen, Y.Q., Wei, Y.H., Wang, Y., Chen, Y.Q.. Fractional order gradient methods for a general class of convexfunctions. In: 2018 Annual American Control Conference (ACC). Milwaukee, USA; 2018, p. 3763–3767[35] Katugampola, Udita N.. “A New Fractional Derivative with Classical Properties.” (2014).[36] Alqahtani, Rubayyi T.. “Fixed-point theorem for Caputo–Fabrizio fractional Nagumo equation with nonlineardiffusion and convection.” (2016).[37] Alsaedi, Ahmed et al. “On Coupled Systems of Time-Fractional Differential Problems by Using a New Frac-tional Derivative.” (2016).[38] Atangana, Abdon, 2016. "On the new fractional derivative and application to nonlinear Fisher’s reac-tion–diffusion equation," Applied Mathematics and Computation, Elsevier, vol. 273(C), pages 948-956.[39] Aydogan, S.M., Baleanu, D., Mousalou, A. et al. On approximate solutions for two higher-order Caputo-Fabrizio fractional integro-differential equations. Adv Differ Equ 2017, 221 (2017) doi:10.1186/s13662-017-1258-3[40] Atangana, A., Gómez-Aguilar, J.F. Decolonisation of fractional calculus rules: Breaking commutativity andassociativity to capture more natural phenomena. Eur. Phys. J. Plus 133, 166 (2018) doi:10.1140/epjp/i2018-12021-3[41] Atanackovic, Pilipovic, Zorica. (2018). Properties of the Caputo-Fabrizio fractional derivative and its distribu-tional settings. Fractional Calculus and Applied Analysis. 21. 10.1515/fca-2018-0003.13 PREPRINT - M

ARCH

18, 2020[42] de Oliveira EC, Machado JAT. A review of deﬁnitions for fractional derivatives and integral. Math ProbIng2014;2014:238459[43] Krutikov, V and Kazakovtsev, Lev and Shkaberina, G and Kazakovtsev, V. (2019). New method of training two-layer sigmoid neural networks using regularization. IOP Conference Series: Materials Science and Engineering.537. 042055. 10.1088/1757-899X/537/4/042055.[44] Rakkiyappan, R., Sivaranjani, R., Velmurugan, G., and Cao, J. (2016). Analysis of global o ( t − α ))