[PDF] Multi-Activation Hidden Units for Neural Networks with Random Weights

Abstract

Single layer feedforward networks with random weights are successful in a variety of classification and regression problems. These networks are known for their non-iterative and fast training algorithms. A major drawback of these networks is that they require a large number of hidden units. In this paper, we propose the use of multi-activation hidden units. Such units increase the number of tunable parameters and enable formation of complex decision surfaces, without increasing the number of hidden units. We experimentally show that multi-activation hidden units can be used either to improve the classification accuracy, or to reduce computations.

Full PDF

MMulti-Activation Hidden Units for Neural Networks with Random Weights

Ajay M. Patrikar [email protected]

Abstract —Single layer feedforward networks with random weights are successful in a variety of classification and regression problems. These networks are known for their non-iterative and fast training algorithms. A major drawback of these networks is that they require a large number of hidden units. In this paper, we propose the use of multi-activation hidden units. Such units increase the number of tunable parameters and enable formation of complex decision surfaces, without increasing the number of hidden units. We experimentally show that multi-activation hidden units can be used either to improve the classification accuracy, or to reduce computations.

Keywords— Machine learning, feedforward neural networks, neural networks with random weights, random vector functional link networks, extreme learning machines I. I NTRODUCTION

Single layer feedforward networks with random weights have been studied since the early nineties [1-4] and have been successfully applied to a large number of pattern classification and regression problems in the last two decades. A survey of these networks can be found in [5]. In the literature, these networks have often been referred to as random vector functional link (RVFL) networks [6-10] or extreme learning machines (ELM) [11-12]. In this paper, we will refer to them as neural networks with random weights (NNRW). These networks are characterized by random assignment of hidden unit weights, which are not trained. Weights between the hidden layer and the output layer are analytically obtained using non-iterative training algorithms. These algorithms [5] are known to be much faster than the conventional neural networks, which depend on iterative training algorithms based on error backpropagation. A known drawback of NNRW is the large number of hidden units required by these networks to achieve good accuracy. This can result in a longer running time during inference, which can limit their use on platforms with limited computational power such as embedded systems, Internet of Things, smartphones, drones, etc. Presently, machine learning algorithms are increasingly being adopted on such platforms, emphasizing the need for efficient machine learning models. There have been several attempts to reduce the number of hidden units reported in the literature [11-19]. These methods often depend on incrementally adding or pruning hidden units in the network. In this paper, we take a very different approach. We attempt to reduce the number of hidden units by using multiple activations per hidden unit. A similar method based on activation ensemble has recently been investigated in the context of deep networks [28, 29]. Our proposed method is simpler and is focused on NNRW. Another method which uses multiple activations for NNRW ensemble was proposed in [30]. Our proposed method differs in that it does not use NNRW ensemble. Instead, it uses multiple activations per hidden unit in a single NNRW. Use of multiple activation functions allows for the formation of varied decision surfaces. We experimentally show that multiple activations lead to improved classification accuracy. Alternatively, this method can help reduce the number of hidden units leading to reduced computations. The paper is organized as follows. Section II introduces NNRW with multi-activation hidden units. In Section III, experimental results are presented on a number of benchmark machine learning problems. Section IV gives a summary and conclusions. II. M ULTI -A CTIVATION H IDDEN U NITS

A single layer feedforward network is shown in Fig. 1. The hidden layer has been modified to include two or more activation functions per summation unit. There are no weights associated with the links between the summation units and the activation functions. In the case of NNRW, the weights between the input and hidden layers are chosen randomly, and only the weights between the hidden and output layers are trained. If only one activation function is used per summation unit, this model leads to traditional implementation of NNRW. In some implementations of NNRW, there are direct connections between the input and output layer (e.g. RVFL). While the proposed method is applicable to those implementations, we focus our analysis only on the architecture shown in Fig. 1. Let 𝑥⃗ be the input feature vector. Let 𝑑 𝑖𝑛 (𝑥⃗) be the output of the n th activation function 𝑔 𝑛 ( ) of the i th hidden unit: 𝑑 𝑖𝑛 (𝑥⃗) = 𝑔 𝑛 (𝑎⃗ 𝑖 ∙ 𝑥⃗ + 𝑏 𝑖 ) (1) where 𝑎⃗ 𝑖 is the random weight vector and 𝑏 𝑖 is the bias term associated with the i th hidden unit. We construct a vector ℎ⃗⃗(𝑥⃗) as below: ℎ⃗⃗(𝑥⃗) = [𝑑 (𝑥⃗), … , 𝑑 𝐴 (𝑥⃗), 𝑑 (𝑥⃗), … , 𝑑 𝐴 (𝑥⃗), … ,𝑑 𝑀𝑁 (𝑥⃗), … , 𝑑 𝑀𝑁 𝐴 (𝑥⃗)]. (2) he dimension of this vector is 𝑁 𝐴 ∗ 𝑀 , where 𝑀 is the number of summation units and 𝑁 𝐴 is the number of activation functions per summation unit. We define the output function for each class to be 𝑓 𝑛 (𝑥⃗) = ℎ⃗⃗(𝑥⃗) ∙ 𝛽⃗ 𝑛 (3) where 𝛽⃗ 𝑛 = [𝑤 , 𝑤 , … , 𝑤 𝑁 𝐴 ∗𝑀𝑛 ] 𝑇 is a vector of the output weights for the n th class. Our goal is to determine the output weights 𝛽⃗ 𝑛 for each class. Fig. 1.

Single layer feedforward network with random weights and multi-activation hidden units

Given 𝐿 training samples {(𝑥⃗ 𝑖 , 𝑡⃗ 𝑖 )} 𝑖=1𝐿 , we seek a solution to the following learning problem: 𝑯 𝜷 = 𝑻 (4) where

𝑻 = [𝑡⃗ , … , 𝑡⃗ 𝐿 ] 𝑇 are target labels, 𝑯 =[ℎ⃗⃗(𝑥⃗ ), … , ℎ⃗⃗(𝑥⃗ 𝐿 )] 𝑇 is a matrix consisting of hidden unit output vectors, and 𝜷 = [𝛽⃗ , … , 𝛽⃗ 𝑃 ] 𝑇 is the output weight matrix. There are 𝑃 classes. The output weights 𝜷 can be calculated as follows: 𝜷 = 𝑯 † 𝑻 (5) where 𝑯 † is the Moore-Penrose generalized inverse of matrix 𝑯 . There are several methods for calculation of 𝑯 † . These include the orthogonal projection method, orthogonalization method, iterative method, and singular value decomposition [20-21]. Another alternative is to use ridge regression [22-23] for which a solution is given by 𝜷 = 𝑯(𝑯 𝑻 𝑯 + 𝜆𝑰) −𝟏 𝑻 (6) where 𝑰 is the (𝑁 𝐴 ∗ 𝑀) × (𝑁 𝐴 ∗ 𝑀) identity matrix and 𝜆 is a tunable parameter. We use this method in our experiments with 𝜆 set to 0.01. For a network of 𝑁 inputs, 𝑀 hidden units, 𝑁 𝐴 activations per hidden unit, and 𝑃 outputs, the number of multiply-and-accumulate arithmetic operations during inference are approximately ( 𝑁 ∗ 𝑀 + 𝑀 ∗ 𝑁 𝐴 ∗ 𝑃), if the bias terms are ignored. We use this formula to compare network computations. In NNRW, the random projection performed by the hidden layer usually does not contain any information specific to the classification or regression problem that the network is trying to solve. However, there is significant amount of computation associated with the random projection step. By sharing the hidden units between activations, we limit the number of random parameters and associated computations. At the same time, we increase the number of tunable weights 𝜷 by introducing multiple activations in the hidden layer. Increase in tunable weights often leads to better performance, until overfitting causes the performance to deteriorate. Using different types of activation functions has another advantage. It allows formation of more complex decision surfaces, which can enhance the classification capabilities of the network. In NNRW, the activation functions need not be differentiable; therefore, there are many nonlinear functions to choose from. The popular ones include sigmoid, tanh, Gaussian, rectified linear units (ReLU), leaky ReLU, etc. Other functions such as hardlim, sine, tribas, cubic, and signed-quadratic functions have also been used [6, 20]. An entirely new class of activation functions for deep networks have been recently investigated in [24, 29]. In the next section, we experimentally verify that factors such as more tunable parameters and more complex decision surfaces lead to either better accuracy or a smaller network. TABLE I. T RANSFER F UNCTIONS FOR V ARIOUS A CTIVATIONS

Activation Function Formula

Sigmoid 𝑓(𝑦) = 11 + 𝑒 −𝑦 Gaussian 𝑦 = 𝑒 −𝑦 Leaky ReLU 𝑓(𝑦) = {𝑦 𝑦 > 00.2𝑦 𝑦 ≤ 0

III. E XPERIMENTAL R ESULTS

In this section, we describe experiments on three benchmark machine learning problems. In each case, a two-activation NNRW is compared with the baseline NNRW models, which make use of only one activation function per summation unit. The activation functions used in our experiments are chosen from Table I. A. Results of the SatImage Problem

NNRW classifiers were trained for the Landsat satellite image (SatImage) problem from the Statlog [25] collection. This problem contains 36 attributes, six classes, 4,435 training samples, and 2,000 test samples. Twenty-five trials were conducted with different random initializations and the verage classification accuracy was calculated. Fig. 2 shows the average classification accuracy as a function of the number of hidden units. The best accuracy of 90.68% was obtained using a two-activation NNRW (Sigmoid + Gaussian) with 500 hidden units. The performance drop beyond 500 units for this network was due to overfitting. The highest accuracy obtained by the single-activation NNRWs was 90.51% with 1,000 units. The top performing two-activation NNRW requires 43% fewer computations during inference than the computations required by the top performing single-activation NNRW.

Fig. 2.

Results of SatImage classification problem B. Results of the UCI Letter Recognition Problem

The UCI letter recognition problem [25] contains 16 attributes and 26 classes. The data consist of 20,000 samples. For each trial, the training data set and test data set are randomly generated from the overall database. 13,333 samples were used for training and 6,667 samples were used for testing. Twenty-five trials were conducted with different random initializations as well as data partitions, and the average classification accuracy was calculated. The results are shown in Fig. 3. The best accuracy obtained by the two-activation NNRW (Sigmoid + Gaussian) was 96.74% with 2,600 hidden units. It can be seen from Fig. 3 that the two-activation NNRW consistently outperforms the single-activation NNRWs. The best accuracy obtained by the single-activation NNRWs was 96.19% with 2,600 hidden units. In comparison, a two-activation NNRW with 1,400 hidden units had an accuracy of 96.22%, which is about a 14% drop in computations during inference.

Fig. 3.

Results of the UCI letter recognition problem C. Results of the MNIST Classification Problem

MNIST is a benchmark problem for handwritten digit recognition [26]. The problem consists of 10 classes, 60,000 training images, and 10,000 test images. The dimensionality of images is 28x28 pixels. We used the original MNIST dataset without any distortions. Thus, the dimensionality of the feature vector was 784. For this problem, we made use of shaped input weights to initialize hidden layer weights as described in [22, 27] which are known to provide better accuracy. The results averaged over twenty-five trials are shown in Fig. 4. The best accuracy obtained was 98.96% using a two-activation NNRW with 7,000 hidden units. It can be seen that the two-activation NNRW consistently outperforms single-activation NNRWs. The best accuracy obtained by the single-activation NNRWs was 98.76% with 7,000 hidden units. In comparison, a two-activation NNRW had an accuracy of 98.71% with 3,000 hidden units, which is about a 56% drop in computations during inference.

Fig. 4.

Results of the MNIST classification problem V. C ONCLUSION

We have experimentally shown that using multi-activation hidden units in NNRW results in overall superior performance. The proposed method can be used either to improve accuracy or to reduce computations. While our experiments are limited to two activation functions, it is certainly possible to use more than two. The activation functions used in NNRW need not be differentiable. Therefore, there are many nonlinear functions to choose from. Further research is needed to determine which activation functions are complementary to each other. R

EFERENCES [1]

Y.-H. Pao, Y. Takefuji, Functional-link net computing: theory, system architecture, and functionalities, Computer 25 (5), 1992, pp. 76–79. [2]

W. F. Schmidt, M. A. Kraaijveld, R. P. Duin, Feedforward neural networks with random weights, in: Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on, IEEE, 1992, pp. 1–4. [3]

Y.-H. Pao, S. M. Phillips, The functional link net and learning optimal control, Neurocomputing 9 (2) (1995) 149–164. [4]

Y.-H. Pao, G.-H. Park, D. J. Sobajic, Learning and generalization characteristics of the random vector functional-link net, Neurocomputing 6 (2) (1994), pp. 163-180. [5]

Cao, W., Wang, X., Ming, Z. & Gao, J., "A review on neural networks with random weights", Neurocomputing, 2018, vol. 275, pp. 278-287. [6]

L. Zhang, P. Suganthan, A comprehensive evaluation of random vector functional link networks, Information Sciences 367 (2016) pp. 1094-1105. [7]

M. Li, D. Wang, Insights into randomized algorithms for neural networks: Practical issues and common pitfalls, Information Sciences 382 (2017) pp.170–178. [8]

L. Zhang, P. Suganthan, A survey of randomized algorithms for training neural networks, Information Sciences 364 (2016) pp. 146–155. [9]

S. Scardapane, D. Wang, Randomness in neural networks: an overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (2) 2017. [10]

R. Katuwal, P.N. Suganthan, M. Tanveer, Random Vector Functional Link Neural Network based Ensemble Deep Learning, arXiv: 1907.00350, 2019. [11]

Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2004). Extreme learning machine: a new learning scheme of feedforward neural networks. In: IEEE International Joint Conference on Neural Networks, 2004, vol. 2, pp. 985–990. [12]

Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2006). Extreme learning machine: theory and applications. Neurocomputing, 70(1), 489–501. [13]

G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: A review, Neural Networks 61 (2015) 32–48. [14]

Huang, G.-B., & Chen, L. (2007). Convex incremental extreme learning machine. Neurocomputing, 70(16), 3056–3062. [15]

Huang, G.-B., & Chen, L. (2008). Enhanced random search based incremental extreme learning machine. Neurocomputing, 71(16), 3460–3468. [16]

Huang, G.-B., Chen, L., & Siew, C.-K. (2006). Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks, 17(4), 879–892. [17]

Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., & Lendasse, A. (2010). OPELM: Optimally pruned extreme learning machine. IEEE Transactions on Neural Networks, 21(1), 158–162. [18]

Rong, H.-J., Ong, Y.-S., Tan, A.-H., & Zhu, Z. (2008). A fast pruned-extreme learning machine for classification problem. Neurocomputing, 72(1), 359–366. [19]

Wang, N., Er, M. J., & Han, M. (2014). Parsimonious extreme learning machine using recursive orthogonal least squares. IEEE Transactions on Neural Networks and Learning Systems. [20]

G.-B. Huang, H Zhou, X Ding, R. Zhang, "Extreme Learning Machine for Regression and Multiclass Classification", Systems Man and Cybernetics Part B: Cybernetics IEEE Transactions, vol. 42, no. 2, pp. 513-529, April 2012. [21]

C. R. Rao, S. K. Mitra, "Generalized Inverse of Matrices and Its Applications" New York:Wiley, 1971. [22]

M. D. McDonnell, M. D. Tissera, T. Vladusich, A. Van Schaik, J. Tapson, "Fast simple and accurate handwritten digit classification by training shallow neural network classifiers with the ‘extreme learning machine’ algorithm", PLOS One, vol. 10, pp. 1-20, 2015. [23]

D. W. Marquardt and R. D. Snee, Ridge regression in practice. The American Statistician, vol. 29, pp. 3-20, 1975. [24]

P. Ramachandran, Barret Zoph, Quoc V. Le, Searching for Activation Functions, arXiv: 1710.05941, 2017. [25]

Y. LeCun et al., "Gradient-Based Learning Applied to Document Recognition", Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, 1998. [27] https://github.com/McDonnell-Lab/Matlab-ELM-toolbox [28]

Klabjan, D., Harmon, M., 2019. Activation ensembles for deep neural networks. In: 2019 IEEE International Conference on Big Data, Los Angeles, CA, USA, December 9-12, 2019. IEEE, pp. 206–214. [29]

A. Apicella, F. Donnarumma, F. Isgrò and R. Prevete, A survey on modern trainable activation functions, arXiv: 2005.00817, 2020. [30]