[PDF] TanhSoft -- a family of activation functions combining Tanh and Softplus

Abstract

Deep learning at its core, contains functions that are composition of a linear transformation with a non-linear function known as activation function. In past few years, there is an increasing interest in construction of novel activation functions resulting in better learning. In this work, we propose a family of novel activation functions, namely TanhSoft, with four undetermined hyper-parameters of the form tanh({\alpha}x+{\beta}e^{{\gamma}x})ln({\delta}+e^x) and tune these hyper-parameters to obtain activation functions which are shown to outperform several well known activation functions. For instance, replacing ReLU with xtanh(0.6e^x)improves top-1 classification accuracy on CIFAR-10 by 0.46% for DenseNet-169 and 0.7% for Inception-v3 while with tanh(0.87x)ln(1 +e^x) top-1 classification accuracy on CIFAR-100 improves by 1.24% for DenseNet-169 and 2.57% for SimpleNet model.

Full PDF

TT ANH S OFT - A FAMILY OF ACTIVATION FUNCTIONS COMBINING T ANH AND S OFTPLUS

A P

REPRINT

Koushik Biswas

Department of Computer ScienceIIIT DelhiNew Delhi, India, 110020 [email protected]

Sandeep Kumar

Department of Computer Science, IIIT Delhi&Department of Mathematics, Shaheed Bhagat Singh College,University of Delhi.New Delhi, India. [email protected], [email protected]

Shilpak Banerjee

Department of MathematicsIIIT DelhiNew Delhi, India, 110020 [email protected]

Ashish Kumar Pandey

Department of MathematicsIIIT DelhiNew Delhi, India, 110020 [email protected]

September 9, 2020 A BSTRACT

TanhSoft , with four undetermined hyper-parameters ofthe form tanh ( αx + βe γx ) ln( δ + e x ) and tune these hyper-parameters to obtain activation functionswhich are shown to outperform several well known activation functions. For instance, replacing ReLUwith x tanh(0 . e x ) improves top-1 classiﬁcation accuracy on CIFAR-10 by 0.46% for DenseNet-169 and 0.7% for Inception-v3 while with tanh(0 . x ) ln(1 + e x ) top-1 classiﬁcation accuracy onCIFAR-100 improves by 1.24% for DenseNet-169 and 2.57% for SimpleNet model. K eywords Activation Function · Neural Networks · Deep Learning

Artiﬁcial neural networks (ANNs) have occupied the center stage in the realm of deep learning in the recent past. ANNsare made up of several hidden layers, and each hidden layer consists of several neurons. At each neuron, an afﬁnelinear map is composed with a non-linear function known as activation function . During the training of an ANN, thelinear map is optimized, however an activation function is usually ﬁxed in the beginning along with the architecture ofthe ANN. There has been an increasing interest in developing a methodical understanding of activation functions, inparticular with regards to the construction of novel activation functions and identifying mathematical properties leadingto a better learning [1].An activation function is considered good if it can increase the learning rate and leaning to better convergence whichleads to more accurate results. At the early stage of deep learning research, researchers used shallow networks (fewerhidden layers), and tanh or sigmoid, were used as activation functions. As the research progressed and deeper networks(multiple hidden layers) came into fashion to achieve challenging tasks, Rectiﬁed Linear Unit (ReLU)( [2], [3], [4])emerged as the most popular activation function. Despite its simplicity, deep neural networks with ReLU have learnedmany complex and highly nonlinear functions with high accuracy. a r X i v : . [ c s . N E ] S e p o overcome the shortcomings of ReLU (non-zero mean, negative missing, unbounded output, to name a few, see [5]and to increase the accuracy considerably in a variety of tasks in comparison to networks with ReLU, many newactivation functions have been proposed over the years. Many of these new activation functions are variants of ReLU,for example, Leaky ReLU [6], Exponential Linear Unit (ELU) [7], Parametric Rectiﬁed Linear Unit (PReLU) [8],Randomized Leaky Rectiﬁed Linear Units (RReLU) [9] and Inverse Square Root Linear Units (ISRLUS) [10]. Inthe recent past, some activation functions constructed from tanh or sigmoid have achieved state-of-the-art resultson a variety of challenging datasets. Most notably, among such activation functions, Swish [11] has emerged as aclose favorite to ReLU. Some of these novel activation functions have shown that introducing of hyper-parameters inthe argument of the functions may provide activation functions for special values of these hyper-parameters that canoutperform activation functions for other values of hyper-parameters, for example, see [11], [5].In this article, we propose a family of activation functions with four hyper-parameters of the form f ( x ; α, β, γ, δ ) = tanh( αx + βe γx ) ln( δ + e x ) . (1)We show that activation functions for some speciﬁc values of hyper-parameters outperform several well known andconventional activation functions, including ReLU and Swish. Moreover, using a hyper-parameterized combination ofknown activation functions, we attempt to make the search for novel activation functions organized. As indicated aboveand validated below, such an organized search can often ﬁnd better performing activation functions in the vicinity ofknown functions. We give a brief idea of a few most widely used activation functions. All of these functions, along with some membersof TanhSoft family, are given in Figure 1. • Sigmoid:-

Sigmoid activation function, which is also known as logistic function, used in binary classiﬁcationproblem in outcome layers and it produces outputs based on probability. Sigmoid is a smooth, bounded,non-linear and differentiable function range in (0 , . Sigmoid suffers from vanishing gradient problem.Sigmoid is deﬁned as σ ( x ) = 11 + e − x (2) • Hyperbolic Tangent Function:-

Hyperbolic Tangent Function, tanh, is a smooth, non-linear and differentiablefunction in the range ( − , deﬁned as tanh( x ) = e x − e − x e x + e − x . (3)It is used in recurrent neural networks, natural language processing [12] and speech processing tasks but alsosuffers from vanishing gradient problem. • Rectiﬁed Linear Unit (ReLU):-

The rectiﬁed linear unit (ReLU) activation function was ﬁrst introduced byNair and Hinton in 2010 [REF]. At present, It is one of the most widely used activation function. ReLU isa kind of linear function, and it is identity in the positive axis while in the negative axis. One of the bestproperty about ReLU is, it learns really fast. ReLU suffers from vanishing gradient problem. Also, in somesituation, it has been observed that up-to 50 % of neurons are dead because of 0 value in the negative axis.ReLU( [4], [3], [2]) is deﬁned as f ( x ) = max(0 , x ) . (4) • Leaky Rectiﬁed Linear Unit:-

Leaky Rectiﬁed Linear Unit(Leaky ReLU) was proposed by Mass et al. on2013 [6]. Leaky ReLU has introduced an non-zero gradient in the negative axis to overcome the vanishinggradient and dead neuron problems of ReLU. LReLU is deﬁned as f ( x ) = (cid:26) x x > . .x x ≤ (5) • Parametric ReLU:-

Parametric ReLU(PReLU) is similar to Leaky ReLU where Leaky ReLU has predeter-mined negative slope where PReLU has a parametric negative slope. PReLU is deﬁned as f ( x ) = (cid:26) x x > a.x x ≤ (6)where a ≤ and so f(x) is equivalent to max(x,ax).2 Swish:-

To construct Swish, researcher from Google brain team uses reinforcement learning based automaticsearch technique. It was proposed by Ramachandran et al., on 2017 [11]. Swish is a non-monotonic, smoothfunction which is bounded below and unbounded above. Swish is deﬁned as f ( x ) = xσ ( x ) = x e − x (7) • E-swish:-

E-Swish [13] is a slight modiﬁed version of Swish function, introduced by Alcaide on 2018, deﬁnedby f ( x ) = βxσ ( x ) (8)where β is a trainable parameter. This function follows all the properties of Swish and can provide betterresults than Swish as claimed in the paper for some values of β . • ELISH:-

Exponential Linear Sigmoid SquasHing(ELiSH) was proposed by Barisat et el. [14] on 2018. It isunbounded above, bounded below, non-monotonic, and smooth function deﬁned by f ( x ) = (cid:40) x e − x x ≥ e x − e − x x < (9) • Softsign:-

The Softsign function was proposed by Turian et al., 2009 [15]. Softsign is a quadratic polynomialbased function. Softsign is used in regression problem [16] as well as speech system [17] and ac hived somepromising results. Softsign is deﬁned as f ( x ) = x | x | (10)where | x | is absolute value of x. • Exponential Linear Units:-

Exponential Linear Units(ELU) was proposed by Clevert et al., 2015 [7]. ELUis deiﬁned in such a way so that it overcomes the vanishing gradient problem of ReLU. ELU is a fast learnerand it generalises better than ReLU and LReLU. ELU is deﬁned as f ( x ) = (cid:26) x x > α ( e x − x ≤ (11)where α is a hyper-parameter. • Softplus:-

Softplus was proposed by Dugas et al., 2001 [18, 19]. Softplus is a smooth activation function andhas non-zero gradient. Softplus is deﬁned as

Softplus( x ) = ln (1 + e x ) (12)Figure 1: Plot of various activation functions3 TanhSoft activation function family

Standard ANN training process involves tuning the weights in the linear part of the network, however there is a merit inthe ability to custom design activation functions, to better ﬁt the problem at hand. Here, rather than looking at individualactivation functions, we propose a family of functions indexed by four hyper-parameters. We refer to this family asthe TanhSoft family as it is created by combining a hyper-parametric versions of the tanh and the Softplus activationfunctions. Explicitly, we express it as f ( x ; α, β, γ, δ ) = tanh( αx + βe γx ) ln( δ + e x ) . (13)Any function in this family can be used as a activation function for hyper-parameter values −∞ ≤ α ≤ , ≤ β < ∞ , < γ < ∞ and ≤ δ ≤ , though for experimental and practical purpose, we restrict to small ranges of thehyper-parameters ( ≤ α ≤ , ≤ β < , < γ < and δ = 0 , ). The interplay between the hyper-parameters α, β ,and γ plays a major role for the TanhSoft family and controls the slope of the curve in both positive and negative axes.The hyper-parameter δ is used as a switch to change the SoftPlus component of TanhSoft to a linear function ( δ = 0 )allowing us to cover a larger class of functions.Note that f ( x ; 0 , , γ, δ ) recovers the zero function and f ( x ; 0 , β, , becomes the linear function family tanh( β ) x .For large values of some parameter while ﬁxing the other parameters, the TanhSoft family converges to some knownactivation functions pointwise. For example, lim γ →∞ f ( x ; 0 , β, γ,

0) = ReLU( x ) ∀ x ∈ R for any ﬁxed β > . With the similar hyper-parametric settings except for very small negative values of α , f ( x ; α, β, γ, has behavioursimilar to the Parametric ReLU activation function. Also, lim β →∞ f ( x ; 0 , β, γ,

1) = Softplus( x ) ∀ x ∈ R for any ﬁxed γ > . We remark that the MISH [20] activation function is also obtained from tanh and Softplus but it is a composition whilewe use a hyper-parametric product. Worth noting that the author has reported unstable training behaviour for the speciﬁcfunction f ( x ; 1 , , γ, in [20], however, we failed to ﬁnd any instability during the training process. Also, in [21] theauthors have mentioned the function f ( x ; 0 , , , , which arise as an example from the TanhSoft family. In fact, weshow that because of the introduction of hyper-parameters, better activation functions of the form f ( x ; 0 , β, γ, can beobtained.Being a product of two smooth functions, TanhSoft is a family of smooth activation functions. As expected, themonotonocity and boundedness of the functions in the family depend on the speciﬁc values of the hyper-parameters.The derivative of the TanhSoft family is given by f (cid:48) ( x ; α, β, γ, δ ) = tanh( αx + βe γx ) e x δ + e x + ( α + βγe γx ) sech ( αx + βe γx ) ln( δ + e x ) . (14)A detailed study of the mathematical properties of the TanhSoft family will be presented in a later work. In this work, wefocus on providing several examples of activation functions from the family which perform well on many challengingdatasets. We have performed an organized search of activation functions within the TanhSoft family by varying the values ofhyper-parameters and training and testing them with DenseNet-121 [22] and SimpleNet [23] on CIFAR10 [24] dataset.Several functions were tested and we select eight functions as example to report their performance. The Top-1 and Top-3accuracies of these eight functions are given in Table 1. All these functions, either outperformed or give near accuracy ascompared to ReLU. Most notably, f ( x ; 0 , . , ,

0) = x tanh(0 . e x ) and f ( x ; 0 . , , γ,

1) = tanh(0 . x ) ln(1 + e x ) constantly outperform ReLU even with more complex models. We have given detailed results in Section 5 with morecomplex models and datasets. 4igure 2: A few novel activation functions from the searches of the TanhSoft family.ActivationFunction Top-1 accuracy onDenseNet-121 Top-3 accuracy onDenseNet-121 Accuracy onSimpleNet Model ReLU( x ) = max(0 , x ) f ( x ; 0 , . , ,

0) = x tanh(0 . e x ) f ( x ; 0 . , , γ,

1) = tanh(0 . x ) ln(1 + e x ) f ( x ; 0 , , ,

0) = x tanh( e x ) f ( x ; 0 , . , ,

0) = x tanh(0 . e x ) f ( x ; 0 , , ,

1) = tanh( e x ) ln(1 + e x ) f ( x ; 0 , . , ,

0) = x tanh(0 . e x ) f ( x ; 1 , , ,

0) = x tanh( x + e x ) f ( x ; 1 , , ,

1) = tanh( x + e x ) ln(1 + e x ) f ( x ; α, , γ, and f ( x ; 0 , β, γ, and call them as TanhSoft-1 and

TanhSoft-2 . In what follows, wediscuss the properties of these subfamilies, experiments with more complex models, and comparison with a few otherwidely used activation functions.

The functions, TanhSoft-1 and TanhSoft-2 are given asTanhSoft-1 : f ( x ; α, , γ,

1) = tanh( αx ) Softplus ( x ) = tanh( αx ) ln(1 + e x ) , (15)TanhSoft-2 : f ( x ; 0 , β, γ,

0) = x tanh( βe γx ) . (16)5he corresponding derivatives (see equation 14) areDerivative of TanhSoft-1 : f (cid:48) ( x ; α, , γ,

1) = tanh( αx ) e x (1 + e x ) + α sech ( αx ) ln(1 + e x ) , (17)Derivative of TanhSoft-2 : f (cid:48) ( x ; 0 , β, γ,

0) = tanh( βe γx ) + βγxe γx sech ( βe γx ) . (18)Figure 3 and 5 show the graph for TanhSoft-1 and TanhSoft-2 activation functions for different values of α , and β and γ respectively. If α = 0 , then TanhSoft-1 becomes the zero function. Similarly, for β = 0 , TanhSoft-2 is thezero function. Like ReLU and Swish, TanhSoft-1 and TanhSoft-2 are also unbounded above but bounded below. LikeSwish, TanhSoft-1 and TanhSoft-2 are both smooth, non-monotonic activation functions. Plots of the ﬁrst derivativeof TanhSoft-1 and TanhSoft-2 are given in Figures 4 and 6 for different values of α , and β and γ respectively. Acomparison between TanhSoft-1, TanhSoft-2 and Swish and their ﬁrst derivatives are given in Figures 7 and 8.TanhSoft-1 and TanhSoft-2 can be implemented with a single line of code in Keras Library [25] or TensorﬂowV2.3.0 [26]. The Keras codes for TanhSoft-1 and TanhSoft-2 aretf.keras.activations.tanh ( α ∗ x ) ∗ tf.keras.activations.softplus ( x ) and, x ∗ tf.keras.activations.tanh ( β ∗ tf.keras.activations.exp ( γx )) for speciﬁc value of α, β and γ respectively.Figure 3: TanhSoft-1 Activation for different values of α Figure 4: First order derivative Derivative of TanhSoft-1Activation for different values of α Figure 5: TanhSoft-2 Activation for different values of β Figure 6: First order derivative Derivative of TanhSoft-2Activation for different values of β Experiments with TanhSoft-1 and TanhSoft-2

We tested TanhSoft-1 and TanhSoft-2 for different values of hyper-parameters against widely used activation functionson CIFAR and MNIST datasets. In particular, TanhSoft-1 with α = 0 . , f ( x ; 0 . , , γ, , and TanhSoft-2 with β = 0 . , γ = 1 , f ( x ; 0 , . , , , produced best results. We observe that f ( x ; 0 . , , γ, and f ( x ; 0 , . , , inmost cases beats or performs equally with baseline activation functions, and underperforms marginally on rare occasions.Table 2 gives a comparison with the baseline activation such as ReLU, Leaky ReLU, ELU, Softplus, and Swish. Wehave tested our activation function in several models, such as Densenet121 [22], DenseNet169 [22], InceptionNetV3 [27], SimpleNet [23], MobileNets [28], WideResNet 28-10 [29]. In this next section we will provide details of ourexperimental framework and results.Baselines ReLU Leaky ReLU ELU Swish SoftplusTanhSoft-1 > Baseline 10 7 11 10 10TanhSoft-1 = Baseline 1 2 0 1 0TanhSoft-1 < Baseline 0 2 0 0 1TanhSoft-2 > Baseline 11 9 11 11 11TanhSoft-2 = Baseline 0 1 0 0 0TanhSoft-2 < Baseline 0 1 0 0 0Table 2: Baseline table of TanhSoft-1 and TanhSoft-2 for Top-1 Accurecy MNIST

MNIST [30] database contains image data of handwritten digits from 0 to 9. The dataset contains 60k training and10k testing × grey-scale images. We run a custom 5-layer CNN architecture on MNIST dataset with differentactivation functions and results are reported in Table 3. We have reported accuracy and loss for TanhSoft-1 for differentvalues of α in Figures 11 and 12. 7ctivation Function 5-fold Accuracy on MNIST dataTanhSoft-1( α = 0 . ) 99.0 ( ± β = 0 . , γ = 1 ) ± ReLU 99.0 ( ± ± α = 0.01) 99.0 ( ± ± ± α Figure 10: Loss with TanhSoft-1 Activation function inMNIST dataset for different values of α CIFAR

The CIFAR [24] dataset consists of × colored images, consists of 60k images and divided into 50k training and10k test images. There are two type of CIFAR dataset such as CIFAR10 and CIFAR100. CIFAR10 dataset has 10classes with 6000 images per class while for CIFAR100 has 100 classes with 600 images per class. We have reportedthe results of TanhSoft-1 and TanhSoft-1 for α = 0 . and β = 0 . , γ = 1 respectively along with ReLU, LeakyReLu, ELU, Softplus and Swish in the CIFAR10 dataset with DenseNet121, DenseNet169, InceptionNet V3 andSimpleNet while for CIFAR100 dataset results have been reported with DenseNet121, DenseNet169, InceptionNetV3, MobileNet [28], WideResNet [29] and SimpleNet. We have trained with Adam optimizer [31] with 100 epochsfor DenseNet121, DenseNet169, InceptionNet V3, MobileNet, WideResNet and 200 epochs for SimpleNet. We usedweight decay × − in SimpleNet model. Weight decay has been decided according to [32]. Table 5, 6 containsresults for CIFAR10 data while Table 7, 8 contains results for CIFAR100 data. Table 4 contains results for accuracyand Loss on CIFAR10 dataset with SimpleNet Model and TanhSoft-2 activation function for different values of β and γ = 1 β β and γ = 1 α = 0 . ) 90.98 98.80 91.05 98.75TanhSoft-2( β = 0 . , γ = 1 ) α = 0.01) 90.77 98.80 90.61 98.78ELU 90.49 98.61 90.40 98.65Swish 90.77 98.80 90.38 98.68Softplus 90.45 98.65 90.51 98.69Table 5: Experimental Results with CIFAR10 Dataset with DenseNet-121 and DenseNet-169ActivationFunction SimpleNetTop-1 Accuracy Inception-NetModel V3Top-1Accuracy InceptionModel V3Top-3AccuracyTanhSoft-1( α = 0 . ) 92.07 91.93 98.84TanhSoft-2( β = 0 . , γ = 1 ) α = 0.01) 91.05 91.84 98.93ELU 91.19 91.01 98.79Swish 91.59 91.26 98.75Softplus 91.42 91.79 98.84Table 6: Experimental Results with CIFAR10 Dataset with SimpleNet and Inception V3Figure 11: Train and Test accuracy on CIFAR100 datasetwith WideResNet 28-10 model Figure 12: Train and Test loss on CIFAR100 dataset withWideResNet 28-10 modelActivationFunction MobileNetTop-1Accuracy MobileNetTop-3Accuracy Inception V3Top-1Accuracy Inception V3Top-3Accuracy WideResNet 28-10Top-1 AccuracyTanhSoft-1( α = 0 . ) 57.56 76.57 69.19 84.63 69.40TanhSoft-2( β = 0 . ) 57.56 76.57 ReLU 56.87 76.33 69.09 85.41 66.54Leaky ReLU( α = 0.01) α = 0 . ) 66.99 83.76 TanhSoft-2( β = 0 . , γ = 1 ) α = 0.01) In this study, we have explored a new novel hyper-parameter family of activation functions, TanhSoft, deﬁnedmathematically as tanh ( αx + βe γx ) ln( δ + e x ) where α , β , γ and δ are tunable hyper-parameters. We have shownthat in the different complex models, TanhSoft outperforms in the MNIST, Cipher10, and CIFAR100 datasets comparedto ReLU, Leaky ReLU, Swish, ELU, and Softplus so that TanhSoft can be a good choice to replace the ReLU, Swish,and other widely used activation functions. Future work can be, applying the proposed novel activation function to morechallenging datasets such as ImageNet & COCO and try on different other models to achieve State-of-the-Art results. References [1] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation functions: Compari-son of trends in practice and research for deep learning, 2018.[2] Vinod Nair and Geoffrey E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In JohannesFürnkranz and Thorsten Joachims, editors,

Proceedings of the 27th International Conference on Machine Learning(ICML-10), June 21-24, 2010, Haifa, Israel , pages 807–814. Omnipress, 2010.[3] Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stagearchitecture for object recognition? In

IEEE 12th International Conference on Computer Vision, ICCV 2009,Kyoto, Japan, September 27 - October 4, 2009 , pages 2146–2153. IEEE Computer Society, 2009.[4] Richard Hahnloser, Rahul Sarpeshkar, Misha Mahowald, Rodney Douglas, and H. Seung. Digital selection andanalogue ampliﬁcation coexist in a cortex-inspired silicon circuit.

Nature , 405:947–51, 07 2000.[5] Yuan Zhou, Dandan Li, Shuwei Huo, and Sun-Yuan Kung. Soft-root-sign activation function, 2020.[6] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectiﬁer nonlinearities improve neural network acousticmodels. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing , 2013.[7] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning byexponential linear units (elus), 2015.[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-levelperformance on imagenet classiﬁcation, 2015.[9] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectiﬁed activations in convolutionalnetwork, 2015.[10] Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti, and Brian Whitney. Improving deep learning by inversesquare root linear units (isrlus), 2017.[11] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions, 2017.[12] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutionalnetworks, 2016.[13] Eric Alcaide. E-swish: Adjusting activations to different network depths, 2018.[14] Mina Basirat and Peter M. Roth. The quest for the golden activation function, 2018.1015] Joseph Turian, James Bergstra, and Yoshua Bengio. Quadratic features and deep architectures for chunking. In

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapterof the Association for Computational Linguistics, Companion Volume: Short Papers , pages 245–248, Boulder,Colorado, June 2009. Association for Computational Linguistics.[16] Phong Le and Willem Zuidema. Compositional distributional semantics with long short term memory, 2015.[17] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, andJohn Miller. Deep voice 3: Scaling text-to-speech with convolutional sequence learning, 2017.[18] Hao Zheng, Zhanlei Yang, Wenju Liu, Jizhong Liang, and Yanpeng Li. Improving deep neural networks usingsoftplus units. In , pages 1–4, 2015.[19] Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, and René Garcia. Incorporating second-orderfunctional knowledge for better option pricing. In

Proceedings of the 13th International Conference on NeuralInformation Processing Systems , NIPS’00, page 451–457, Cambridge, MA, USA, 2000. MIT Press.[20] Diganta Misra. Mish: A self regularized non-monotonic activation function, 2019.[21] Xinyu Liu and Xiaoguang Di. Tanhexp: A smooth activation function with high convergence speed for lightweightneural networks, 2020.[22] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutionalnetworks, 2016.[23] Seyyed Hossein Hasanpour, Mohammad Rouhani, Mohsen Fayyaz, and Mohammad Sabokrou. Lets keep itsimple, using simple architectures to outperform deeper and more complex architectures, 2016.[24] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.[25] François Chollet. Keras. https://github.com/fchollet/keras , 2015.[26] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, AndyDavis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, MichaelIsard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, RajatMonga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden,Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015. Software available from tensorﬂow.org.[27] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking theinception architecture for computer vision, 2015.[28] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications,2017.[29] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.[30] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database.

ATT Labs [Online]. Available:http://yann.lecun.com/exdb/mnist , 2, 2010.[31] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and YannLeCun, editors, , 2015.[32] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In