A New Artificial Neuron Proposal with Trainable Simultaneous Local and Global Activation Function
AA New Artificial Neuron Proposal with Trainable Simultaneous Local andGlobal Activation Function
Tiago A. E. Ferreira a,b, ∗ , Marios Mattheakis b , Pavlos Protopapas b a Universidade Federal Rural de Pernambuco, Departamento de Estatística e Informática, Dois Irmãos, 52171-900 Recife - PE, Brasil b John A. Paulson School of Engineering and Applied Sciences, Harvard UniversityCambridge, Massachusetts 02138, United States
Abstract
The activation function plays a fundamental role in the artificial neural network learning process. However, thereis no obvious choice or procedure to determine the best activation function, which depends on the problem. Thisstudy proposes a new artificial neuron, named global-local neuron, with a trainable activation function composed oftwo components, a global and a local. The global component term used here is relative to a mathematical functionto describe a general feature present in all problem domain. The local component is a function that can represent alocalized behavior, like a transient or a perturbation. This new neuron can define the importance of each activationfunction component in the learning phase. Depending on the problem, it results in a purely global, or purely local,or a mixed global and local activation function after the training phase. Here, the trigonometric sine function wasemployed for the global component and the hyperbolic tangent for the local component. The proposed neuron wastested for problems where the target was a purely global function, or purely local function, or a composition oftwo global and local functions. Two classes of test problems were investigated, regression problems and di ff erentialequations solving. The experimental tests demonstrated the Global-Local Neuron network’s superior performance,compared with simple neural networks with sine or hyperbolic tangent activation function, and with a hybrid networkthat combines these two simple neural networks. Keywords:
Artificial Neuron, Trainable Activation Function, Local and Global Features, Regression Problem, Di ff erentialEquation Solving
1. Introduction
An Artificial Neural Network (ANN) is a universal approximator for mathematical functions [1]. In this way, theANNs are employed in a vast set of problems with many numerical applications, like in regressions problems [2, 3]and di ff erential equations solving [4, 5, 6, 7, 8, 9]. The choice of the activation function plays a vital role in the ANNconvergence process and precision performance. The activation function depends on the study’s problem, and thereis no obvious choice or procedure to determine the best activation function before the training process. Several worksin the literature board the ideal activation function determination problem proposing a trainable or adaptive activationfunction. Chien-Cheng Yu et al. [10] proposed an adaptive activation function and an e ff ective learning method basedon the backpropagation algorithm to adjust the activation function parameters and the ANN weights. Dushko ff andPtucha [11] proposed a multiple activation functions for a convolutional ANN, where each neuron has a specificactivation function. Li et al. [12] presented a tunable activation function employed for Extreme Learning Machine,while Shen et al. [13] applied a similar idea, but extended for a tunable activation function with multiple outputs.Jagtap et al. [5] employed an adaptive activation function for problems of linear and nonlinear partial di ff erentialequations in Physics-Informed Neural Networks. ∗ Corresponding author
Email address: [email protected] (Tiago A. E. Ferreira)
Preprint submitted to Journal Name January 18, 2021 a r X i v : . [ c s . N E ] J a n n this pursuit by the ideal activation function, a very interesting work was developed by Qian et al. [14]. Thisstudy proposed strategies for combining basic activation functions, like ReLU and its variants, in a convolutionalneural network. One of the their proposals was to create a mixed activation function of the form f mix ( x ) = p · f ( x ) + (1 − p ) · f ( x ), where p ∈ [0 ,
1] is a combination coe ffi cient, and f and f are two basic activation functions tobe combined. More specifically, in the reference [14], ReLU, leaky ReLU (LReLU), parametric ReLU (PReLU),exponential linear unit (ELU), and parametric ELU (PELU) were employed as the two activation functions.There are many situations where a mathematical function governs a given problem with two features, global andlocal behavior. These double-feature problems are relatively common in modeling mathematical, physical, and en-gineering problems, such as regression problems and di ff erential equation solving. The global component is somecharacteristic present in its entire problem domain, like an undulatory behavior of the wave signal. The local compo-nent describes behavior located in a particular part of the problem domain, like a transient or a localized perturbation.For this problem class, ANN modeling can improve precision and convergence time if the activation function has bothglobal and local characteristics.Inspired in the work of Qian et al. [14], we propose a new artificial neuron named global-local neuron (GLN),with an activation function composed of two mathematical functions, one with a global behavior and the other witha local behavior. The GLN has the capability to adjust the relative importance of each of these components in itstraining process. A typical gradient descendant algorithm, like the Adam algorithm [15], can be applied. Therefore,the GLN has the capability to adjust the importance of each global and local components. The GLN can choose if theactivation function is purely global, purely local, or a combination of these two components.To demonstrate the GLN versatility, problems with local, global, and a combination of these two componentswere employed in three regression tasks and seven di ff erential equation solving. Three other ANN models were alsoused in the same problems to establish relative baseline performance. A multilayer perceptron (MLP) with the sineactivation function (MLP-Sin), an MLP with hyperbolic tangent activation function (MLP-Tanh), and hybrid MLP,with an MLP-Sin and an MLP-Tanh branches combination, called two branches network (TBN).The article is drawn as follows. Section 2 presents the proposed GLN and its global and local activation functions.Section 3 introduces the basic ideas of the problem classes board here, the regression problem and the di ff erentialequation solving by ANNs. Section 4 exhibits the data and the definitions of the di ff erential equations. Section5 exposes all experimental results, where a total of 2400 computational simulations were done. Finally, Section 6presents the proposed ANN architectures’ final conclusions.
2. Proposed Neuron
This work proposes a new artificial neuron named global-local neuron (GLN), with an activation function com-posed of two mathematical functions, a local and a global function. The main idea is to give the neuron the ability tocombine a function with a global feature, like a trigonometric sine, and a function with a local transient behavior, likea hyperbolic tangent or a sigmoid logistic function.In general, an artificial neuron can be described as a mathematical function with a weighted sum of arguments, asshown in Figure 1. The neuron receives the inputs ( x , x , x , . . . , x n ), multiples by the respective weights ( w , w , w , . . . , w n ),sums them, and adds a bias b . This sum is the argument of the activation function F ( · ) and the result of the activationfunction processing is the output of the neuron.The proposed GLN has a composed activation function F ( · ) with local and global components combined with aweight 0 ≤ α ≤ F ( x ) = ( α · global( x ) + (1 − α ) · local( x )) − Bias , (1)where global( · ) is the global function component, and local( · ) is the local function component. The weight α and theBias are adjusted in the ANN training process. The scheme of GLN is shown in Figure 2. Worth note that α and Biasare part of the activation function. In this way, if the same activation function is used in a layer, the adjustment ofthese parameters is unique for all neuron of the layer. In the development process of GLN, the α combination weightwas implemented as α = sig( z ) , z ∈ R (2)2 igure 1: Scheme of a simple artificial neuron.Figure 2: Scheme of the Global-Local Neuron proposed. where sig( · ) is the sigmoid logistic function given by,sig( z ) = + exp( − z ) (3)The implementation employed the sine as the global component and hyperbolic tangent as the local component.All the network’s hidden layers have the same activation function. Therefore, with the proposed GLN, an ANN ofthe MLP type was built, where all hidden neurons are GLN. This multilayer perceptron of GLN (MLP-GLN) has thesame activation in each hidden layer (implying an α weight per hidden layer), while in the last layer (the output layer)all the GLN outputs are weighted summed in a neuron with a linear activation function.
3. Test Problems
In this work, two types of problems were employed to test the proposed GLN, the regression problem, and di ff er-ential equations solving. For both classes of problems, the idea is to test the MLP-GLN capability to solve them bycombining the global (sine function) and local (hyperbolic tangent) components in its activation function. Typically, when a neural network is applied to solve a regression problem, it is sought to find some functionaldescription of given data. The regression of a dependent variable Y , given an independent variable X , consists of3nding the model to determine the most probable value of Y for each value of X , based on a finite data set ( X , Y ).In this way, the ANN maps the independent variable X to the dependent variable Y . In other words, the ANNlearns functional structure G ( · ), where for a given value of X , it is possible to compute Y = G ( X ). ff erential Equations Solving with Neural Networks Di ff erential equations can model many scientific and engineering problems. However, for many physical systemsof practical interest, these di ff erential equations are analytically intractable. Consequently, there is a great interest indeveloping computational techniques to solve di ff erential equations numerically.ANN had been applied to solve di ff erential equations [4, 5, 16, 17]. The basic principle for solving di ff erentialequations with ANNs is to look at the problem as an optimization problem. Defining a general di ff erential equationin the form, D ( u ) − F = D ( · ) is a di ff erential operator, u is a possible solution of D ( · ), and F is a known forcing function.Let (cid:98) u the output of the ANN. If (cid:98) u is a solution for the di ff erential equation D , then the residual is given by R ( (cid:98) u ) = D ( (cid:98) u ) − F (5)The main idea is to use the R ( (cid:98) u ) as the loss function in the ANN training process, where the solving di ff erentialequation problem is reduced to a minimization problem.To guarantee that the initial conditions are satisfied, (cid:98) u can be substituted by a modified solution (cid:101) u . For example,if a certain di ff erential equation in space x and time t has a initial condition in t = t given by function u t ( x ), thesolution can be written as [18], (cid:101) u ( x , t ) = u t ( x ) + (cid:16) − e − ( t − t ) (cid:17)(cid:98) u ( x , t ) (6)implying that when t = t , the trial solution is exactly u t ( x ).This is implemented in the Python library NeuroDiffEq [18] with many other initial and boundary conditions inthe form, (cid:101) u ( x , t ) = A ( x , t ; x boundary , t ) (cid:98) u ( x , t ) , (7)where A ( x , t ; x boundary , t ) is selected so that (cid:101) u ( x , t ) has the correct initial and boundary conditions.
4. Data Sets for Regression and Di ff erential Equations In this study, the proposed GLN were tested in two types of problems. The first problem is the regression problem,and the second problem is the di ff erential equation solving. The data sets employed to test the MLP-GLN in the regression problem are presented in the next subsections. Twoof them are simulated data set, and one is a real-world data set. The artificial data sets encompass features with globaland local components, with an addictive and multiplicative composition of the global and local characteristics. Thereal-world data, the sunspot yearly observation, also has global (a quasi-periodic behavior) and local (fluctuations)components.
The first artificial data set developed to test the proposed GLN network was generated by adding two Gaussianfunctions (the local components) and a trigonometric sine function (the global component). Given these functionsaddition, this data set was called EES (
Exponential, Exponential, Sine ) and generated by the equation,
EES ( x ) = E exp (cid:32) − ( x − a ) σ (cid:33) + E exp (cid:32) − x σ (cid:33) + sin( ω x ) (8)4 able 1: The parameters values employed to build the EES data set. Parameters Numeric Values E E a σ ω x Figure 3: EES data set from Equation 8 with the parameters values from Table 1. where E and E are constants representing each Gaussian function’s amplitude, a is the mean, σ is the constantstandard deviation for both exponential, and ω is the angular frequency of the sine function. Here, to build the EESdata set, the values of these parameters are presented in Table 1. Figure 3 shows the plot of Equation 8, where weobserve global behavior with a local fluctuation. The domain used here was x ∈ [ − , The second data set developed here was generated by multiplying a sine function with an exponential function.This data set was named SE and is given by the equation,
S E ( x ) = E sin( ω x ) exp (cid:32) − x σ (cid:33) (9)where one more time E is constant, σ is a constant standard deviation, and ω is sine function’s angular frequency. InTable 2 are shown the values of these parameters used to build the SE data set. In Figure 4 is shown the plot of theEquation 9, where now it is possible to note a local behavior dominance. The domain used was x ∈ [ − , Table 2: The parameters values employed to build the SE data set.
Parameters Numeric Values E σ ω Figure 4: SE Data set from Equation 9 with the parameters’ values from Table 2.
Sunspots are spots darker than the surrounding areas on the Sun’s photosphere. These spots are temporary phe-nomena that may last anywhere from a few days to a few months. Besides the solar activity study, the sunspot oc-currence can influence the space weather, impacting the earth’s ionosphere state, generating interference on satellitecommunications, and a ff ecting the conditions of short-wave radio propagation [19, 20, 21].The third data set was the yearly mean total sunspot number from 1700 until 2019. This data set consists of 320points. The sunspot data used here were downloaded from the website SILSO (Sunspot Index and Long-term SolarObservations) . The Figure 5 shows the plot of the Sunspot Data set. This natural phenomenon presents a combinationof global and local behavior. ff erential Equations The di ff erential equations used to test the proposed GLN are presented in the next sections. The equations selectedhere have possible solutions with only global features, only local features, and both global and local features. The first di ff erential equation investigated is, dudt = − u . (10)For the initial condition u ( t = = .
0, the solution is given by u ( t ) = e − t . A catenary function [22] is a mathematical curve that an idealized hanging rope assumes under its weight whensupported only at its ends, like an electrical cable between two posts in an overhead electrical power. The di ff erentialequation whose solution is a catenary function can be written as, d udx − (cid:115) + (cid:32) dudx (cid:33) = . (11)The solution is u ( x ) = cosh( x ) (the catenary). The boundary conditions used here were u ( x = − = u ( x = = . http: // / silso / datafiles . ears Figure 5: Plot of the Sunspot data set.
A simple harmonic oscillator [23] in one dimension can be described by the di ff erential equation, d udt + u = u ( t ) = sin( t ), for the initial conditions of u ( t = = . dudt | t = = . A damped harmonic oscillator [23] in one dimension can be modeled by the di ff erential equation, d udt + dudt + u = u ( t ) = exp (cid:16) − t (cid:17) sin( t ), for the initial conditions u ( t = = . du ( t ) dt | t = = . The Laplace equation is a second order partial di ff erential equation, where in two dimensions can be written as, ∂ u ∂ x + ∂ u ∂ y = u ( x , y ) is a function of two variables. For the boundary conditions u ( x = , y ) = sin( π y ), u ( x = , y ) = . u ( x , y = = .
0, and u ( x , y = =
0, an analytic solution for the Laplace equation in two dimensions is, u ( x , y ) = sin( π y ) sinh( π (1 − x ))sinh( π ) . (15) The heat equation describes how the heat di ff uses through a given region. In one spatial dimension, the heatequation is: ∂ u ∂ t − k ∂ u ∂ x = , (16)7 igure 6: The scheme of the Two Branches Network. where k is the di ff usivity of the medium. This equation models the flow of heat in a homogeneous and isotropicmedium, where u ( x , t ) represents the temperature at the point x in time t .For the initial conditions u ( x , t = = sin( π x ), ∂ u ( x , t ) ∂ x | x = = π e − k π t , and ∂ u ( x , t ) ∂ x | x = = − π e − k π t , the analytical solutionis given by, u ( x , t ) = sin (cid:18) π xL (cid:19) exp (cid:32) − k π tL (cid:33) , (17)where L is the size of the heat propagation medium. The Kuramoto-Sivashinsky Equation is a nonlinear fourth-order partial di ff erential equation applied to the studyof many continuous medium physical systems with instabilities and chaotic behavior [24, 25]. The Kuramoto-Sivashinsky Equation is given by, ∂ u ∂ t + u ∂ u ∂ x + β ∂ u ∂ x + γ ∂ u ∂ x = β and γ are constants. Here, it was used β = γ = .
0. To solve this equation, it was employed the initialcondition u ( x , t = = e − x , and the boundary conditions u ( x = − , t ) = u ( x = , t ) = .
5. Experiments and Results
All problems presented in Section 4 were employed to assess the performance of the MLP-GLN.Standard MLP networks were also applied to the same benchmark tests for the same conditions used in experi-ments with the MLP-GLN. Three di ff erent ANN models were used to create a comparative analysis. The first archi-tecture was an MLP with the same architecture of the MLP-GLN, but replacing the GLN’s double activation functionby the sine function (MLP-Sin). The second architecture replaced the GLN’s activation function by the hyperbolictangent function (MLP-Tanh). The third architecture was a hybrid architecture, combining an MLP-Sin branch and anMLP-Tanh branch. The ANN resultant from this combination was called Two Branch Network (TBN), and its schemecan be viewed in Figure 6. To maintain the same number of hidden neurons for all networks in the TBN network,the hidden layers of each branch have half the number MLP-GLN’s hidden neurons. For example, if the MLP-GLNarchitecture is 1 − − − To test the regression performance of the proposed GLN, the three data sets presented in Section 4 were used. Theproposed benchmark is to fit the data set, learning the respective functions in a regression task.Two di ff erent architectures were used for the full conected MLP-GLN, MLP-Sin, and MLP-Tanh. The first archi-tecture was 1 − −
1, i.e., one input, one hidden layer with 20 neurons, and one output. The second architecture was8 − − −
1, namely, one input, two hidden layers with 20 neurons each, and one output. For the TBN network,each branch had the respective architectures of 1 − − − − −
1, corresponding to the same numberof hidden neurons. These ANN architectures were defined after the performance of the preliminary experiments toguarantee a minimum su ffi cient capability to solve the test problems.Initially, a data repository with 2000 points was created for the EES and SE artificial data sets. For the sunspotdata, the data repository is the observations themselves, 320 points. For each data repository, three disjoints sets,training with 50% of the data, validation with 25%, and test sets with 25% were created. For the training phase, themini-batch scheme was employed with a batch size of 64.Each pattern in the data sets were formed by a pair ( x , G ( x )) (as definided in Section 3.1), where x is the ANNinput and G ( · ) is the target (or ANN wished output) for the input x , the respective data set function or data observation.The computational implementation was done with the pyTorch library (Python3 Programming Language). Thetraining algorithm employed was the Adam algorithm [15], with a learning rate of 10 − . The stopping criteria werethe maximum number of epochs of 200 , (cid:98) G ( x ) and the real value of the corresponding data set, G ( x ), given by theequation, MS E = N batch N batch (cid:88) i = (cid:16) (cid:98) G ( x i ) − G ( x i ) (cid:17) (19)where N batch is the size of the mini-batch. Figure 7 presents the test set MSE distribution for all ANN models studied for the EES data set. Figures 7a and7b refer to the architecture 1 − −
1, while Figures 7c and 7d correspond to the architectures 1 − − −
1. Figure8 presents the epoch distributions for the training process, where Figure 8a refers to the architecture 1 − − − − −
1. Table 3 shows the descriptive statistics for the test set MSE distributions andfor the epochs distribution in the training, minimum value observed (Min.), maximum value observed (Max.), mean,median, standard deviation (Std.), and the coe ffi cient of variation (CV). The CV is defined as the ratio between thestandard deviation and the mean (CV = S tdMean ). Here, the CV can be seen as stability criterion, where smaller the CV,greater the stability.Observing Figures 7a and 7b, and Table 3, for the experiments with architecture 1 − − ff erences in both locationand shape of the two samples’ empirical cumulative distribution functions, independent of the distributions. The nullhypothesis is that the MSE distributions of the two models are from the same continuous distribution . The alternativehypothesis is that the MSE distributions of the two models are not from the same continuous distribution . At the 5%significance level, the KS test does not reject the null hypothesis, reaching a p -value of 0 . − − − −
1, Figure 8a demonstrates an apparent similaritybetween the models MLP-GLN, MLP-Sin, and TBN. Applying the KS test at the 5% significance level, the modelsMLP-GLN and MLP-Sin have the same statistical behavior for the number of training epochs, where the KS testagain does not reject the null hypothesis, reaching a p -value of 0 . − −
1, the MLP-GLN together with the TBN model reached the bestregression performance. The MLP-GLN was more e ffi cient than the TBN model with respect to the expected numberof epochs to train the model. In this way, the MLP-GLN model was the better option among those tested.Analyzing the MLP-GLN α values, defined in Equation 1, it is possible to verify the importance given to eachactivation function component, where α = α = NN Models (a) 1 − − ANN Models (b) 1 − − ANN Models (c) 1 − − − ANN Models (d) 1 − − − Figure 7: The test set MSE box plot for all ANN models studied and both architectures for EES data set. In (a) and (c) are presented the MSEdistributions for the 30 repetitions for each ANN model for architectures 1 − − − − −
1. In (b) and (d) are shown a zoom scale tovisualize the better performance models for both architectures, where the dashed lines are the median values references.
The distribution of the α weight values can be viewed in Figure 9a. With a mean value of α of 0 .
566 and a medianof 0 . ff ect caused by the Gaussian function, where apparently the global behavior is dominant (seeFigure 3). However, the perturbation created by the Gaussian function in the sine is su ffi cient to demand a strongimportance of the local component in the activation function composition.For the experiments with two hidden layers (architecture 1 − − −
1) there will two e ff ect. One is thecombination of the global and local components in the hidden layer, and the other is the composition of the globaland local activation function of the two hidden layers. For this architecture, the test set MSE distributions are shownin Figures 7c and 7d, with the descriptive statistics in Table 3. In general, the MLP-GLN reached a better relative testset MSE performance when compared with other analyzed models, but with the biggest CV value. Higher the CVvalue, greater the relative dispersion, indicating that the MLP-GLN has the smallest relative stability. However, theworse MSE value reached by the MLP-GLN, the maximum MSE value, is less than all other models’ maximum MSEvalues.The MLP-GLN had the smallest MSE value but it required the largest number of training epochs, as it is shown byFigure 8b and Table 3. Comparing the models MLP-GLN and MLP-Sin (the second-best model for the test set MSE10 NN Models (a) 1 − − ANN Models (b) 1 − − − Figure 8: The epoch distributions in the training for all ANN models with the EES Data set. (a) is related to architecture 1 − −
1, and (b) isrelated to architecture 1 − − − Statistics ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − M S E Min. 1 . · − . · − . · − . · − Max. 2 . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. . · − . · − . · − . · − CV 0 . . .
908 0 . E po c h s Min. 10697 . .
667 38452 .
000 27992 . . .
000 35336 .
500 24343 . .
650 6943 . . . . . .
431 0 . − − − M S E Min. . · − . · − . · − . · − Max. . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. . · − . · − . · − . · − CV 1 .
890 0 . . . E po c h s Min.
Mean 47492 .
633 39489 .
867 38569 . . Median 45434 .
000 35611 . . . .
527 6960 .
878 7149 . . CV 0 .
399 0 .
176 0 . . performance), the mean number of training epochs of the MLP-GLN is 20% higher than the MLP-Sin epochs number.However, observing the MSE performance, the MLP-GLN mean MSE distribution is approximately ten times lessthan the MLP-Sin mean MSE distribution, which may justify the choice of the MLP-GLN model to the detrimentof the MLP-Sin model. The α values distribution for the two hidden layers are shown in Figure 9b. In general,11 able 4: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE and number of epochs distributions in the trainingbetween the MLP-GLN and all other models for both architectures studied – EES data set. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 1 . · − MLP-Tanh No 6 . · − TBN Yes 0 . E po c h s MLP-Sin Yes 0 . . · − TBN No 0 . − − − M S E MLP-Sin No 8 . · − MLP-Tanh No 1 . · − TBN No 1 . · − E po c h s MLP-Sin No 0 . . . · − Hidden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 9: The α distribution for the MLP-GLN models after the training process – EES data ser. for both hidden layers, the MLP-GLN gives more importance to the global component than the local component ofits activation functions. For the first hidden layer, the mean α value was 0 .
691 and the median of 0 . α value of 0 .
709 and the median of 0 . α values demonstrate the globalbehavior of the EES data set.The KS test was also applied to the 1 − − − The same experimental procedure applied to the EES data set was also employed for the SE data set. The test setMSE distributions for all ANN models are outlined in Figure 10. Figure 10a presents the MSE distributions for thearchitecture with one hidden layer (1 − − − − ffi cient than MLP-Tanh. However, applying the two-12ample KS test at the 5% significance level, the MSE distributions of the MLP-GLN and MLP-Tanh are statisticallysimilar, as presented in Table 6. The other two ANN models are not statistically similar to the MLP-GLN. ANN Models (a) 1 − − ANN Models (b) 1 − − − ANN Models (c) 1 − − − Figure 10: The test set MSE box plot for all ANN models studied and both architectures for the SE data set. In (a) are presented the MSEdistributions (30 repetitions) for each ANN model for architecture 1 − −
1. In (b) and (c) are shown the results for architecture 1 − − − Observing Figure 10a, it is possible to note that the MLP-Tanh had a better MSE performance than the MLP-Sin,and the TBN had an intermediary MSE performance between the MLP-Sin and MLP-Tanh. In this way, the TBN canbe viewed as a mean between the MLP-Sin and MLP-Tanh, where its dispersion will be something equivalent to thesum of the two models’ dispersion. The MLP-GLN gave more importance to the local component of its activationfunction, as shown in Figure 12a. In this way, the MLP-GLN managed to do a more e ffi cient combination of the sineand hyperbolic tangent functions than the TBN.The distribution of the number of epochs for each ANN model (architecture 1 − −
1) is shown in Figure 11a. TheMLP-GLN is more e ffi cient than other models when looking at the measure of central tendency, mean, and median.Besides these measures of the central location, the MLP-GLN also has the lowest CV, indicating that the MLP-GLNhas the lowest dispersion for the number of trained epochs.For the architecture with two hidden layers, 1 − − −
1, the MLP-GLN and the MLP-Sin are statistically similaras for the MSE performance as for the number of epochs trained, according to Figure 10c and the KS test presentedin Table 4. In particular, for the number of trained epochs, the most e ffi cient ANN is the MLP-Tanh. However, theMSE performance reached by the MLP-Tanh is the worst among the analyzed models. In this way, the MLP-GLN and13 able 5: The descriptive statistics for all ANN models analyzed. All MSE measures are relative to the test set of the SE data set. The epochsmeasures are referent to the training process. The best results are highlighted in bold-face. Statistics ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − M S E Min. . .
058 0 .
016 0 . . .
145 0 .
141 0 . . .
097 0 .
066 0 . . .
094 0 .
066 0 . . . .
033 0 . . . .
492 0 . E po c h s Min. 10061 10319 . .
833 20754 .
133 17071 . . .
000 20803 .
000 15950 . . .
269 7791 .
903 5560 . . .
336 0 .
375 0 . − − − M S E Min. . · − . · − . · − . · − Max. 2 . · − . · − .
052 2 . · − Mean . · − . · − .
017 3 . · − Median 1 . · − . · − .
016 2 . · − Std. 4 . · − . · − .
013 4 . · − CV 1 .
597 1 . . . E po c h s Min. 23100 31700 .
233 62887 . . . .
000 59996 . . . .
494 21061 . . . . .
335 0 .
528 0 . ANN Models (a) 1 − − ANN Models (b) 1 − − − Figure 11: The epoch distributions in the training for all ANN models for the SE data set. (a) is related to architecture 1 − −
1, and (b) is relatedto architecture 1 − − − MLP-Sin need approximately twice the MLP-Tanh’s number of epochs to be trained, but the MSE reached by thesetwo models is likely 10 more accurate than the MLP-Tanh.The α weight distribution for the MLP-GLN with two hidden layers is shown in Figure 12b. The MLP-GLN gave14 able 6: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE and number of epochs distributions between the MLP-GLN and all other models for both architectures studied – SE data set. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 1 . · − MLP-Tanh Yes 0 . . E po c h s MLP-Sin Yes 0 . . . − − − M S E MLP-Sin Yes 0 . . · − TBN No 4 . · − E po c h s MLP-Sin Yes 0 . . · − TBN No 0 . HiddenLayer (a) 1 − − Hidden Layers (b) 1 − − − Figure 12: The α distributions for the MLP-GLN models after the training process – SE data set. The MLP-GLN model performance investigation on a real-world data set, the Sunspot data set, was done here.Figure 13 presents the MSE distribution for the analyzed models, and Figure 14, the distribution for the number ofepochs.For the architecture 1 − −
1, although the MLP-GLN model has reached the best values of mean and median forMSE, presented in Table 7, the statistical behavior of the MLP-Sin and TBN models is similar to the MLP-GLN model,for the MSE and epochs distributions. Table 8 presents the results of the two-sample KS test at 5% of significancelevel employed to verify whether the sample distribution of the analyzed models comes from the same populationdistribution or not. Only the MLP-Tanh was considered as a model with a distinct statistical behavior when comparedwith the MLP-GLN.Figure 15a shows the α values distribution for the MLP-GLN model with one hidden layer. The mean and the15 NN Models (a) 1 − − ANN Models (b) 1 − − − ANN Models (c) 1 − − − Figure 13: The test set MSE box-plot for all ANN models and both architectures for the Sunspot data set. In (a) are presented the MSE distributionsfor the 30 repetitions for each ANN model, architecture 1 − −
1. In (b) and (c) are shown the results for architecture 1 − − − median values of the α distribution are a little less than 0 .
5, demonstrating an approximate equilibrium between theglobal and local components, with a slight predominance for the local component in the activation function.For the architecture 1 − − −
1, the MLP-GLN had the best results for the MSE performance, except for the CVvalue, presenting the most prominent relative MSE distribution dispersion, as can be viewed in Table 7. Applying thetwo-sample KS test to verify if the other models have the same population distribution of the MLP-GLN, the MLP-Sinpresents the same statistical behavior for the MSE and number of epochs distribution, as it is demonstrated in Table8. Observing the quantity of epochs to train the model, MLP-GLN is computationally more costly than MLP-Tanh.It was needed about 79% more epochs to train the MLP-GLN than to train the MLP-Tanh. However, when the MLP-GLN MSE distribution is compared with the MLP-Tanh MSE distribution, the MLP-GLN reached a mean MSE value10 times less than the MLP-Tanh model.Figure 15b presents the α distributions for the two hidden layers. For the first hidden layer, the MLP-GLN gavemore importance to the global component (sine function) than the local component, where the mean α value and themedian reached were respectively 0 .
786 and 0 . α value of 0 .
478 and a median of 0 . NN Models (a) 1 − − ANN Models (b) 1 − − − Figure 14: The epochs distributions for all ANN models – Sunspot Data set. (a) is related to architecture 1 − −
1, and (b) is related to architecture1 − − − Statistics ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − M S E Min. . · − .
012 0 .
037 0 . .
058 0 . .
103 0 . . .
034 0 .
059 0 . . .
035 0 .
057 0 . . .
012 0 . . .
567 0 . . . E po c h s Min. 10785 11013 .
833 22222 . . . .
500 20078 . . . .
901 8454 . . . . .
380 0 .
415 0 . − − − M S E Min. . · − . · − .
042 4 . · − Max. 1 . · − . · − . . · − Mean . · − . · − .
059 1 . · − Median . · − . · − .
057 1 . · − Std. 3 . · − . · − . . · − CV 0 .
662 0 . . . E po c h s Min. 10421 10441 .
733 19764 . . . .
500 17675 . . . .
981 7265 . . . .
374 0 . . . able 8: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE and number of epochs distributions between the MLP-GLN and all other models for both architectures studied – Sunspot data set. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin Yes 0 . . · − TBN Yes 0 . E po c h s MLP-Sin Yes 0 . . . − − − M S E MLP-Sin No 1 . · − MLP-Tanh No 1 . · − TBN No 1 . · − E po c h s MLP-Sin No 1 . · − MLP-Tanh No 4 . · − TBN Yes 0 . Hidden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 15: The α distributions for the MLP-GLN models after the training process – Sunspot data set. ff erential Equation Solving Problem In this second class of tests applied to assess the performance of the proposed neuron, we employed the sameexperimental structure used previously for the regression test of the architectures 1 − − − − −
1. Itwas used 30 repetitions for each experiment with all ANN models, where the Adam training algorithm [15] with alearning rate of 10 − was employed.The experiments to solve the di ff erential equations presented in Section 4.2 were implemented in Python3 with thelibrary NeuroDiffEq [18]. For each di ff erential equation solving experiment, a set of initial or boundary conditionsis presented, as defined in Section 4.2. The maximum number of epochs employed in the training process for eachdi ff erential equation solving was defined based on the initial experimental investigations. With these parameters, thelibrary NeuroDiffEq has trained the ANN models until the maximum number of epochs is set, returning the trainedANN models defined in the training point where the smallest validation loss occurs. The
NeuroDiffEq library doesnot return information about the number of epochs used to select the ANN model. For this reason, any analysis of thenumber of epochs employed to train the ANN models was not done for the di ff erential equations solving experiments. The NeuroDi ff Eq library is available in https: // github.com / odegym / neurodi ff eq . .2.1. Exponential Decay Equation Results As presented in Section 4.2.1, the Decay di ff erential equation with initial condition u ( t = = u ( t ) = e − t . The idea is to generate a solution u ( t ) based on ANN models to describe this analyticalsolution combined with the initial condition of u ( t = =
1. All experiments used the domain of t ∈ [0 ,
3] with amaximum number of epochs of 500 to training the ANN models.
ANN Models (a) 1 − − ANN Models (b) 1 − − − ANN Models (c) 1 − − − Figure 16: The test set MSE box plot for all ANN models studied and both architectures for solving the Exponential Decay di ff erential equation.In (a) is presented the MSE distributions for the 30 repetitions for each ANN model for architectures 1 − −
1. In (b) and (c) are shown the resultsfor architecture 1 − − − Figure 16 shows the MSE distribution for Decay equation ANN solution, where Figure 16a is relative to theexperiments with 1 − − − − − α value distributions for both architectures are shown in Figure 17. Figure 17a refers to the 1 − − .
460 and median value of 0 .
416 indicate a slight tendency to the local component.19 able 9: The descriptive statistics for all ANN models analyzed. All MSE measures are relative to the Exponential Decay solving. The best resultsare highlighted in bold-face.
MSE ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − Min. 9 . · − . · − . · − . · − Max. . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. . · − . · − . · − . · − CV 2 . . .
643 1 . − − − Min. . · − . · − . · − . · − Max. 6 . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. 1 . · − . · − . · − . · − CV 2 . . .
877 1 . Table 10: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE distributions between the MLP-GLN and all othermodels for both architectures studied for solving the Exponential Decay di ff erential equation. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 6 . · − MLP-Tanh No 0 . . − − − M S E MLP-Sin No 4 . · − MLP-Tanh No 1 . · − TBN No 1 . · − Hidden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 17: The α distributions for the MLP-GLN models after the training process for solving the Exponential Decay di ff erential equation. NN Models (a) 1 − − ANN Models (b) 1 − − ANN Models (c) 1 − − − ANN Models (d) 1 − − − Figure 18: The MSE test set box-plot for all ANN models studied and both architectures to solve the Catenary di ff erential equation. In (a) and (b)are presented the MSE distributions for the 30 repetitions for each ANN model for architectures 1 − − − − − Figure 17b corresponds to the 1 − − − α mean and median are,respectively, 0 .
483 and 0 . .
525 and 0 . The catenary curve is the di ff erential equation solution presented in Section 4.2.2. The ANN models for both1 − − − − − ff erential equation (Equation11). For all experiments, the domain of x ∈ [ − ,
2] was used to discover an approximate solution of the di ff erentialequation. The maximum number of epochs of 700 was employed for all ANN models training phases.Figure 18 shows the MSE distribution for all analyzed ANN models. Figures 18a and 18b are relative to the1 − − − − − ff erential equation, the MLP-Sin, MLP-Tanh, and TBN models were not statistically similar to theMLP-GLN when tested with the two-sample KS test. Table 12 exhibits the results of the KS test at 5% significance,21 able 11: The descriptive statistics for ANN models analyzed. All MSE measures are relative to the Catenary di ff erential equation solving. Thebest results are highlighted in bold-face. MSE ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − Min. . · − . · − . · − . · − Max. 7 . · − . · − . · − . · − Mean 6 . · − . · − . · − . · − Median 1 . · − . · − . · − . · − Std. 1 . · − . · − . · − . · − CV 2 . . .
506 1 . − − − Min. . · − . · − . · − . · − Max. 1 . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. 2 . · − . · − . · − . · − CV 1 . . .
686 0 . Table 12: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE distributions between the MLP-GLN and all othermodels for both architectures studied for the Catenary di ff erential equation solving. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 1 . · − MLP-Tanh No 0 . . · − − − − M S E MLP-Sin No 1 . · − MLP-Tanh No 6 . · − TBN No 1 . · − indicating that MLP-GLN has a di ff erent population distribution from all other ANN models’ population distributions.Figure 19 presents the α weight value distributions. Figure 19a is for the 1 − − α weight values, with a total variation from 0 . . .
568 and 0 . α values can tell that both components have similar importance. The MLP-GLN found a solution withthree possible activation function behavior, convergence, purely local, purely global, and a mix of these components.For the architecture 1 − − −
1, the α values distributions were more stable. Figure 19b shows the α weightdistributions for both hidden layers. The first hidden layer selected values of α greater than 0 . α ingeneral less than 0 . The di ff erential equation for a simple harmonic oscillator, presented in Section 4.2.3, has an analytical solutiondescribed by a global function, like a sine. Thus, it is expected that ANN models with sine activation function aremore appropriate to describe this di ff erential equation’s solution.22 idden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 19: The α distributions for the MLP-GLN models after the training process for solving the Catenary di ff erential equation. For all experiments, the di ff erential equation solutions were sought in the domain t ∈ [0 , π ], and all ANN modelswere trained with maximum epoch number of 700.Figure 20 presents the test set MSE distributions for all ANN models. Figures 20a and 20b are referent to 1 − − − − − − −
1, the models MLP-GLN, MLP-Sin, and TBN obtained the best MSE performances.Looking at the mean values of the MSE distributions, the MLP-GLN model was the best, with a value of 3 . · − . However, observing the MSE distributions’ median, we read that the TBN is the best model, with a value of1 . · − , and with the best results for the second and third quarterlies. The MLP-Sin performance stayed betweenthese two models. If the MSE distribution dispersion is observed, the MLP-GLN presents a CV of 1 .
420 against theMLP-Sin and TBN CV values of 2 .
034 and 2 . Table 13: The descriptive statistics for all ANN models analyzed. All MSE measures are relative to the Simple Harmonic Oscillator di ff erentialequation solving. The best results are highlighted in bold-face. MSE ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − Min. . · − . · − . · − . · − Max. . · − . · − . · − . · − Mean . · − . · − . · − . · − Median 2 . · − . · − . · − . · − Std. . · − . · − . · − . · − CV 1 .
420 2 . . . − − − Min. . · − . · − . · − . · − Max. 6 . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. 1 . · − . · − . · − . · − CV 1 .
046 0 . . . − − −
1, there was a similar behavior, the MLP-GLN, MLP-Sin, and TBN models23
NN Models (a) 1 − − ANN Models (b) 1 − − ANN Models (c) 1 − − − ANN Models (d) 1 − − − Figure 20: The test set MSE box-plot for all ANN models studied and both architectures for the Harmonic Oscillator di ff erential equation solving.In (a) and (b) are presented the MSE distributions for the 30 repetitions for each ANN model for architecture 1 − −
1. In (c) and (d) is shown theresults for architecture 1 − − − idden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 21: The α distributions for the MLP-GLN models after the training process for solving the Simple Harmonic Oscillator di ff erential equation. have the best MSE performance. For this architecture with two hidden layers, the MLP-GLN reached the best MSEperformance for both mean and median measures, whereas the MLP-Sin presented the smallest CV measure amongthese three models.It worth noting that the MLP-Sin was a better option than the MLP-Tanh. This result was already expectedsince trigonometric functions directly describe this di ff erential equation’s solution. However, the ANN functionalapproximator process was more e ffi cient with the use of the MLP-GLN double activation function with local andglobal components. Table 14: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE distributions between the MLP-GLN and all othermodels for both architectures studied for solving the Catenary di ff erential equation. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 0 . . · − TBN No 0 . − − − M S E MLP-Sin No 2 . · − MLP-Tanh No 1 . · − TBN No 4 . · − The MLP-GLN was tested against all other ANN models to verify if there is some ANN model with the sameMSE population distribution. The two-sample KS test was applied, and the MLP-GLN presents a di ff erent MSEpopulational distribution. The two-sample KS test results at 5% significance are shown in Table 14.Figure 21 shows the α weight values distribution for both MLP-GLN architectures. Observing the Figure 21a,the α values distribution reached by the architecture 1 − − . . − − −
1, the both hidden layers obtained mean α valuesgreater than 0 . .
552 and 0 . .
573 and 0 . The di ff erential equation for a damped harmonic oscillator, as defined in Section 4.2.4, has an analytical solutionformed by a mathematical product of two functions, an exponential and a trigonometric function. All the ANN modelswere employed to solve the damped harmonic oscillator equation in the domain t ∈ [0 , π ], where it was applied thenumber of epochs of 1000 to execute all ANN training processes. ANN Models (a) 1 − − ANN Models (b) 1 − − − ANN Models (c) 1 − − − Figure 22: The MSE test set box-plot for all ANN models studied and both architectures for the Damped Harmonic Oscillator di ff erential equationsolving. In (a) is presented the MSE distribution for the 30 repetitions for each ANN model for architectures 1 − −
1. In (b) and (c) are shownthe results for architecture 1 − − − Figure 22 shows the MSE distributions for all ANN models. The MSE distributions for architecture 1 − − − − − − −
1, the two best ANN models were the MLP-GLN and the TBN, demonstrating thatthe local and global function combinations have a better performance than the ANN with a single activation functionfor this problem. The MLP-GLN reached the best MSE mean value, while the TBN had better MSE median result.Both these ANN models had very similar statistical central measures results, but the MLP-GLN had a CV value of0 .
784 against a TBN CV value of 0 . able 15: The descriptive statistics for all ANN models analyzed. All MSE measures are relative to the Damped Harmonic Oscillator di ff erentialequation solving. The best results are highlighted in bold-face. MSE ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − Min. . · − . · − . · − . · − Max. . · − . · − . · − . · − Mean . · − . · − . · − . · − Median 3 . · − . · − . · − . · − Std. . · − . · − . · − . · − CV 0 .
784 0 . . . − − − Min. . · − . · − . · − . · − Max. . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. 9 . · − . · − . · − . · − CV 0 . . .
579 1 . − − −
1, three ANN models stood out, the MLP-GLN, MLP-Sin, and TBN, as it can beviewed in Figure 22b. The MLP-GLN reached the better MSE performance from these three ANN models, as viewedin Table 15. For this architecture, the MLP-GLN has a statistically di ff erent MSE distribution, as the two-sample KStest a ffi rms in Table 16. Thus, the MLP-GLN reached better MSE performance from these three highlighted ANNmodels. Table 16: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE distributions between the MLP-GLN and all othermodels for both architectures studied for the Damped Harmonic Oscillator di ff erential equation solving. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 2 . · − MLP-Tanh No 1 . · − TBN Yes 0 . − − − M S E MLP-Sin No 0 . . · − TBN No 0 . α values distribution for the MLP-GLN can be viewed in Figure 23. For the architecture with a hidden layer,Figure 23a shows an α distribution with mean value of 0 . . − − − α mean value of 0 . . α values distribution with a mean of 0 .
696 and a median of 0 . α values distributions for both damped and simple harmonic oscillators. It is possible to notethat, for the 1 − − ff erential equation cases have the same behavior, but for the damped oscillator case the α distribution27 idden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 23: The α distributions for the MLP-GLN models after the training process for the Damped Harmonic Oscillator di ff erential equationsolving. presented an accentuated asymmetric, extending its tail for smaller values of α . This behavior indicates that theMLP-GLN introduces a more significant contribution to the local component in the activation function compositionfor the damped harmonic oscillator. This behavior makes sense because there is an exponential component in thedamped oscillator solution, whereas for a simple oscillator, there is only the sine component. For the architecture1 − − −
1, the α values distributions for both damped and simple oscillators have similar behavior. However,for the damped oscillator ANN solver, the α distribution has less dispersion, indicating a finer specialization foreach hidden layer, where the hyperbolic tangent contribution was increased in the first hidden layer and the sinecontribution in the second hidden layer. One more time, this behavior of the α values points to more local informationrepresentation in the first hidden layer and a more global representation in the second hidden layer. The Laplace di ff erential equation has an analytical solution given by Equation 15, as defined in Section 4.2.5.This solution is a mathematical product between sine and hyperbolic sines. The analytical solution is a 2-dimensionalfunction, u ( x , y ). The mathematical domain analyzed here was x ∈ [ − ,
2] and y ∈ [ − , − −
1. It is possible to note that the MLP-GLN and MLP-Tanh are the two models with the best MSE performance. The MLP-GLN reached an MSE distribution with a meanvalue of 2 . · − and a median of 1 . · − , the MLP-Tanh a mean of 1 . · − and a media of 1 . · − .All descriptive statistics are presented in Table 17. Looking at Figure 24b, apparently the MLP-Tanh has better MSEperformance, since its MSE distribution CV is 0 . . − − − α values distribution reached by MLP-GLN for both architec-tures. Figure 25a for architecture 1 − −
1, and Figure 25b for architecture 1 − − −
1. For the architecture withone hidden layer, the α values distribution obtained both mean and median values of 0 . NN Models (a) 1 − − ANN Models (b) 1 − − ANN Models (c) 1 − − − ANN Models (d) 1 − − − Figure 24: The test set MSE box-plot for all ANN models studied and both architectures for the Laplace di ff erential equation solving. In (a) and(b) are presented the MSE distributions for the 30 repetitions for each ANN model for architectures 1 − −
1. In (c) and (d) is shown the resultsfor architecture 1 − − −
1. The dashed lines are the median values references.
Hidden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 25: The α distributions for the MLP-GLN models after the training process for the Laplace di ff erential equation solving. able 17: The descriptive statistics for all ANN models analyzed. All MSE measures are relative to the Laplace di ff erential equation solving. Thebest results are highlighted in bold-face. MSE ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − Min. . · − . · − . · − . · − Max. 5 . · − . · − . · − . · − Mean 2 . · − . · − . · − . · − Median . · − . · − . · − . · − Std. 1 . · − . · − . · − . · − CV 0 . . .
382 0 . − − − Min. . · − . · − . · − . · − Max. 4 . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. 9 . · − . · − . · − . · − CV 0 . . .
358 0 . Table 18: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE distributions between the MLP-GLN and all othermodels for both architectures studied for the Laplace di ff erential equation solving. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 1 . · − MLP-Tanh Yes 0 . . − − − M S E MLP-Sin No 8 . · − MLP-Tanh Yes 0 . . · − ever, for the architecture, 1 − − −
1, the α values distribution was dominated by values greater than 0 .
5. Forboth hidden layers, the global component of the activation function had more importance than the local component,where the mean and median values for the first hidden layer were 0 .
573 and 0 . .
621 and 0 . The heat di ff erential equation defined in Section 4.2.6 with respective boundary conditions, has a possible analyt-ical solution presented in Equation 17. This solution is a mathematical product of the sine and exponential functions,where in the context of this article, a solution with global and local components. For the experiments, the parametersused were a medium di ff usivity k = .
3, a size of the heat propagation medium L = .
0, a time-domain of t ∈ [0 , x ∈ [0 , L ]. All ANN models were trained with the number of epochs of 300.Figure 26 shows the MSE distributions. For the architecture 1 − −
1, Figures 26a and 26b, the MLP-GLN andTBN models had a MSE performance better than pure ANN models. Table 19 presents all descriptive statistics forthose MSE distributions. The MLP-GLN and TBN MSE distributions came from the same population distributionaccording to the two-sample KS test Table 20. The MLP-GLN and TBN reached statistical central measures verysimilar, with MSE mean values of 6 . · − and 6 . · − , respectively, and a median of 4 . · − and5 . · − . 30 NN Models (a) 1 − − ANN Models (b) 1 − − ANN Models (c) 1 − − − ANN Models (d) 1 − − − Figure 26: The MSE test set box-plot for all ANN models studied and both architectures for the Heat di ff erential equation solving. In (a) and (b)are presented the MSE distributions for the 30 repetitions for each ANN model for architectures 1 − −
1, where the dashed lines are the medianvalues references. In (c) and (d) are shown the results for architecture 1 − − − able 19: The descriptive statistics for all ANN models analyzed. All MSE measures are relative to the Heat di ff erential equation solving. The bestresults are highlighted in bold-face. MSE ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − Min. 1 . · − . · − . · − . · − Max. 2 . · − . · − . · − . · − Mean 6 . · − . · − . · − . · − Median . · − . · − . · − . · − Std. 5 . · − . · − . · − . · − CV 0 .
792 1 . . . − − − Min. . · − . · − . · − . · − Max. . · − . · − . · − . · − Mean . · − . · − . · − . · − Median . · − . · − . · − . · − Std. . · − . · − . · − . · − CV 0 . . .
700 0 . − − −
1, the MLP-GLN presented the best MSE performance, as demonstrated by Figures26c and 26d. For this architecture, the MLP-GLN MSE distribution came from a distinct population distribution whencompared with the other analyzed ANN models, as can be viewed in Table 20 with the two sample KS test results.
Table 20: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE distributions between the MLP-GLN and all othermodels for both architectures studied for the Heat di ff erential equation solving. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 0 . . · − TBN Yes 0 . − − − M S E MLP-Sin No 5 . · − MLP-Tanh No 0 . . · − Observing the α values distribution, Figure 27, it is possible to verify the expectation for the activation functioncomponents combination. Figure 27a shows the α values distribution for the architecture with one hidden layer. Inthis case, the MLP-GLN gave more importance to the global component, with an α mean value of 0 .
881 and a medianvalue of 0 . − − −
1, Figure 27b, both hidden layers presented a tendency to enhancethe global component, with a mean value of 0 .
594 and a median value of 0 .
578 for the first hidden layer, and 0 . .
550 for the second hidden layer.
The Kuramoto-Sivashinsky di ff erential equation presented in Section 4.2.7 is a very complex nonlinear di ff erentialequation. The analysis employed here, given the set of initial and boundary conditions, compares the MSE perfor-mance of the pure ANN models (MLP-Sin and MLP-Tanh) and the ANN composite models (MLP-GLN, and TBN).The experiments were executed in the domain of x ∈ [ − ,
40] and t ∈ [0 , − −
1, Figures 28a and 28b, the two best models were the MLP-GLN and MLP-Tan. The MSE32 idden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 27: The α distributions for the MLP-GLN models after the training process for the Heat di ff erential equation solving. distributions’ mean values for these two better ANN models were 0 .
028 and 0 . .
027 and 0 . . . − − −
1, Figure 28c, the MLP-GLN presents the best MSE performance, with bothmean and median values of 0 . Table 21: The descriptive statistics for all ANN models analyzed. All test set MSE measures are relative to the Kuramoto-Sivashinsky di ff erentialequation solving. The best results are highlighted in bold-face. MSE ANN Models
MLP-GLN MLP-Sin MLP-Tanh TBN − − Min. . .
126 0 .
023 0 . .
047 0 . . . .
028 0 . . . .
027 0 . . . . · − . · − . · − . · − CV 0 .
174 0 . . . − − − Min. . .
015 0 .
016 0 . . .
022 0 .
024 0 . . .
018 0 .
019 0 . . .
018 0 .
019 0 . . · − . · − . · − . · − CV 0 .
147 0 . . . α values distribution for both MLP-GLN architectures. For the architecture with only onehidden layer, Figure 29a, the MLP-GLN reaches a behavior where the activation function local component has more33 NN Models (a) 1 − − ANN Models (b) 1 − − − ANN Models (c) 1 − − − Figure 28: The test set MSE box plot for all ANN models studied and both architectures for the Kuramoto-Sivashinsky di ff erential equation solving.In (a) and (b) are presented the MSE distributions for the 30 repetitions for each ANN model for architectures 1 − −
1. In (c) is shown the resultsfor architecture 1 − − −
1. The dashed lines are the median values references.Table 22: Two-sample Kolmogorov-Smirnov Test at the 5% significance level for the MSE distributions between the MLP-GLN and all othermodels for both architectures studied for solving the Kuramoto-Sivashinsky di ff erential equation. Tested Model KS Test Resultswhit MLP-GLN
Statistically Similar p -values − − M S E MLP-Sin No 1 . · − MLP-Tanh No 0 . . − − − M S E MLP-Sin No 2 . · − MLP-Tanh No 8 . · − TBN No 8 . · − idden Layer (a) 1 − − Hidden Layers (b) 1 − − − Figure 29: The α distributions for the MLP-GLN models after the training process for the Kuramoto-Sivashinsky di ff erential equation solving. importance than the global component. With a mean value of 0 .
015 and a median of 0 . − − −
1, Figure 29b, the MLP-GLN presents a first hiddenlayer with predominate local component, with α mean value of 0 .
138 and a media value of 0 . α mean value of 0 .
800 and a median value of 0 . ff erent layers.
6. Conclusions
This article proposes a new artificial neuron with an activation function composed of two di ff erent mathematicalfunctions, a function with local behavior and a function with global features. Here, the local function used was thehyperbolic tangent, and the global was the trigonometrical sine. The term local for the tanh( x ) was employed becausethe relevant variation of the function is located in a narrow range around x =
0, and it tends to 1 or − x → ∞ or x → −∞ . On the other hand, the sin( x ) is called global component because it has the same variation behavior forall domain x ∈ [ −∞ , ∞ ].The sin( x ) and tanh( x ) were chosen here because the problems studied in general have global and local features,or some combination of those. Other mathematical functions could also be used, where the proposed neuron wouldcombine the characteristics of the chosen mathematical functions.It was done a massive experimental test with the proposed neuron approaching two types of problems, the regres-sion problem and di ff erential equation solving, totaling 2400 simulations. These problem classes were chosen becauseit is common to find situations where the problem solution combines a local and global feature. More specifically, tenindividual problems were studied, three regression problems and seven di ff erential equations.In general, the proposed MLP-GLN reached a better MSE performance distribution for the tested problems. How-ever, in some problems, the MLP-GLN had statistically similar behavior with a traditional MLP, with only sine func-tion or hyperbolic tangent function, or it had statistically identical behavior to a combination of these two traditionalMLP, the TBN model. Thus, the MLP-GLN could adapt its activation function composition to e ffi ciently solve theproblems.All MLP-GLN experiments used the same activation function per hidden layer. In this way, the α weight, thatcombines the local and global activation function components, is the same for all neurons in the same hidden layer.Therefore, it was possible to verify each hidden layer tendency by choosing the importance of the local and globalcomponents and thus, extract information about the importance of each global and local component in the analyzedproblem. 35ince the output layer of the MLP-GLN is only one neuron with a linear activation function, the architecture withone hidden layer makes a combination of the local and global functions without doing a mathematical compositionof these two functions. Thus, the MLP-GLN could define an e ffi cient and automatic combination of these two localand global functions in such a way to reach a better or equivalent performance to the ANN with a single activationfunction or with the linear combination of the two sine and hyperbolic tangent ANN, the TBN model.The MLP-GLN demonstrated the best MSE performance for all experiments done for the architecture with twohidden layers. For this architecture, the MLP-GLN makes a mathematical composition of the first hidden layeractivation function with the activation function of the second hidden layer. Combining the two components plus thecomposition of the activation functions of the hidden layers gave MLP-GLN the best versatility to solve the problemwith better MSE precision than using one layer. Thus, a deeper architecture yields better performance.Therefore, the MLP-GLN presents a possibility to reach better results than the ANN with single activation func-tions or a linear combination of these ANNs, with automatic adjusting of the relative importance between the activationfunction components. In this way, many other function types with di ff erent features can be also combined in pairs tobetter describe problems containing di ff erent feature components. Acknowledges
To the Science and Technology Support Foundation of Pernambuco (FACEPE) Brazil, Brazilian National Councilfor Scientific and Technological Development (CNPq), Coordenação de Aperfeiçoamento de Pessoal de Nível Supe-rior - Brasil (CAPES) - Finance Code 001 by financial support for the development of this research, and the Institutefor Applied Computational Science (IACS) at Harvard University for host the professor Tiago A. E. Ferreira.
References [1] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2 (1989) 303–314.URL: https://doi.org/10.1007/BF02551274 . doi: .[2] O. F. Ertugrul, A novel type of activation function in artificial neural networks: Trained activation function, Neural Networks 99 (2018)148 – 157. URL: . doi: .[3] W.-J. Niu, Z.-K. Feng, B.-F. Feng, Y.-W. Min, C.-T. Cheng, J.-Z. Zhou, Comparison of multiple linear regression, artificial neural network,extreme learning machine, and support vector machine in deriving operation rule of hydropower reservoir, Water 11 (2019) 88. URL: . doi: .[4] I. E. Lagaris, A. Likas, D. I. Fotiadis, Artificial neural networks for solving ordinary and partial di ff erential equations, IEEE Transactions onNeural Networks 9 (1998) 987–1000.[5] A. D. Jagtap, K. Kawaguchi, G. E. Karniadakis, Adaptive activation functions accelerate convergence in deep and physics-informed neuralnetworks, Journal of Computational Physics 404 (2020) 109136. URL: . doi: .[6] X. Meng, G. E. Karniadakis, A composite neural network that learns from multi-fidelity data: Application to function approximationand inverse pde problems, Journal of Computational Physics 401 (2020) 109020. URL: . doi: .[7] M. Mattheakis, D. Sondak, A. S. Dogra, P. Protopapas, Hamiltonian neural networks for solving di ff erential equations, arXiv (2020). arXiv:2001.11107 .[8] A. Paticchio, T. Scarlatti, M. Mattheakis, P. Protopapas, M. Brambilla, Semi-supervised neural networks solve an inverse problem formodeling covid-19 spread, arXiv (2020). arXiv:2010.05074 .[9] H. Jin, M. Mattheakis, P. Protopapas, Unsupervised neural networks for quantum eigenvalue problems, arXiv (2020). arXiv:2010.05075 .[10] Chien-Cheng Yu, Yun-Ching Tang, Bin-Da Liu, An adaptive activation function for multilayer feedforward neural networks, in: 2002 IEEERegion 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM ’02. Proceedings., volume 1, 2002, pp.645–650 vol.1.[11] M. Dushko ff , R. Ptucha, Adaptive activation functions for deep networks, in: 2016 Electronic Imaging, volume 16, 2016, pp. 1–5.[12] B. Li, Y. Li, X. Rong, The extreme learning machine learning algorithm with tunable activation function, Neural Comput & Applic 22 (2013)531–539. URL: https://doi.org/10.1007/s00521-012-0858-9 . doi: .[13] Y. Shen, B. Wang, F. Chen, L. Cheng, A new multi-output neural model with tunable activation function and its applications, NeuralProcessing Letters 20 (2004) 85–104. URL: https://doi.org/10.1007/s11063-004-0637-4 .[14] S. Qian, H. Liu, C. Liu, S. Wu, H. S. Wong, Adaptive activation functions in convolutional neural networks, Neurocomputing 272 (2018)204 – 212. URL: . doi: .[15] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014. URL: https://arxiv.org/abs/1412.6980 . arXiv:1412.6980 .
16] D. Randle, P. Protopapas, D. Sondak, Unsupervised learning of solutions to di ff erential equations with generative adversarial networks, 2020. arXiv:2007.11133 .[17] M. D. Giovanni1, D. Sondak, P. Protopapas, M. Brambilla1, Finding multiple solutions of odes with neural networks, in: J. Lee, E. F. Darve,P. K. Kitanidis, M. W. Farthing, T. Hesser (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Combining Artificial Intelligence andMachine Learning with Physical Sciences (AAAI-MLPS 2020), volume 2587, Stanford, CA, USA, 2020. URL: http://ceur-ws.org/Vol-2587/ .[18] F. Chen, D. Sondak, P. Protopapas, M. Mattheakis, S. Liu, D. Agarwal, M. D. Giovanni, Neurodi ff eq: A python package for solvingdi ff erential equations with neural networks, Journal of Open Source Software 5 (2020) 1931. URL: https://doi.org/10.21105/joss.01931 . doi: .[19] A. Banerjee, S. Das, A. B. Bhattacharya, Sunspot occurrences and the probable e ff ect of interference with hf radio communications in theearth’s ionosphere, in: 2016 International Conference on Computer, Electrical Communication Engineering (ICCECE), 2016, pp. 1–4.[20] Z. Hou, Z. Huang, L. Xia, B. Li, H. Fu, Observations of upward propagating waves in the transition region and corona above sunspots, TheAstrophysical Journal 855 (2018) 65. URL: https://doi.org/10.3847%2F1538-4357%2Faaab5a . doi: .[21] P. Bhowmik, D. Nandy, Prediction of the strength and timing of sunspot cycle 25 reveal decadal-scale space environmental conditions, NatureCommunications 9 (2018) 5209. URL: https://doi.org/10.1038/s41467-018-07690-0 . doi: .[22] E. H. Lockwood, A Book of Curves, Cambridge University Press, Cambride, UK, 1961.[23] J. B. M. Stephen T. Thornton, Classical dynamics of particles and systems, 5th ed. ed., Brooks / Cole, Southbank, Victoria, Australia ; Belmont,CA, 2004.[24] L. N. T. Aly-Khan Kassam, Fourth-order time-stepping for sti ff pdes, SIAM Journal on Scientific Computing 26 (2005) 1214–1233. URL: https://doi.org/10.1137/S1064827502410633 . doi: .[25] M. Lakestani, M. Dehghan, Numerical solutions of the generalized kuramoto–sivashinsky equation using b-spline functions, AppliedMathematical Modelling 36 (2012) 605 – 617. URL: .doi: .[26] J. D. Gibbons, S. Chakraborti, Nonparametric Statistical Inference, fourth ed., Marcel Dekker, 2003..[26] J. D. Gibbons, S. Chakraborti, Nonparametric Statistical Inference, fourth ed., Marcel Dekker, 2003.