On the loss of learning capability inside an arrangement of neural networks
OOn the loss of learning capability inside anarrangement of neural networks
Ivan Arraut
The Open University of Hong Kong30 Good Shepherd St., Kowloon, Hong Kong, China [email protected]
Diana Diaz
Wayne State University5057 Woodward Ave., Detroit, MI, USA [email protected]
Abstract
We analyze the loss of information and the loss of learning capability inside anarrangement of neural networks. Our method is new and based on the formulationof non-unitary Bogoliubov transformations in order to connect the informationbetween different points of the arrangement. This can be done after expanding theactivation function in a Fourier series and then assuming that its information isstored inside a Quantum scalar field.
Artificial neural networks were proposed for the first time in 1943 as an attempt to simulate the wayhow the human brain operates [1]. By then, the logic, taken as the way how some input informationis interpreted, was the key ingredient for the formulation of this concept. Subsequently, the notions ofHebbian learning were developed [2]. The Hebbian network was after analyzed in [3]. The perceptronwas proposed in 1958 by Rosenblatt by assuming that there exists a bridge between psychology andbiophysics. The perceptron is based on three fundamental questions: 1. How is information about thephysical world sensed or detected, by the biological system? 2. In what form is information stored orremembered? 3. How does information contained in storage or memory, influence recognition andbehavior?.By complementing the perceptrons with connections (synapses in biological terms), and by taking theoutput of a neuron as the input of another one, we create a primitive arrangement of neural networks.Although this is a good starting point, it is difficult for an arrangement of neural networks based onperceptrons to learn in the sense of Machine Learning. The reason is that when we use perceptrons,a small change in the input can create huge changes in the output, destroying in many cases thepossibility of learning [4]. When a system of neural networks learns, normally it makes improvementsbased on variations on the weights (importance of the information transmitted through the synapses)and the bias (related to the threshold). Examples of this can be found in the implementation of themethod of gradient descend to minimize the cost-function [5]. In such a case, the perceptrons wouldoperate terribly because any attempt of learning some specific output information from a singleoutput neuron, might destroy easily what has been learned in other output neurons. This situation canbe solved by introducing an activation function f ( x ) , different to the standard step function in thisscenario. The most common non-trivial activation function used is the sigmoid function and here wewill take it as the ideal output of a single neuron. The biggest advantage of the sigmoid function is itshigh sensitivity in responding to small changes in the inputs. This means that it can map the smallinput changes into small output changes. Those small changes might be entered through variations ofthe weights and bias. In this way, the neural network will be able to learn by updating these valuesuntil it gets the desired result. Normally the process of learning is based on the minimization of thecost function. Once the system learns something, ideally it does not forget it and the information isin principle preserved. However, there are cases where there might be a loss of information due todifferent effects and this might affect the capability of the system for learning. Second Workshop on Machine Learning and the Physical Sciences (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ c s . OH ] J a n n this paper, we introduce a novel method for analyzing the loss of learning capability and/or lossof information inside a neural network arrangement. Our method is based in the Fourier expansionof the activation functions, taken as scalar Quantum fields [6]. We then employ the concepts of Bogoliubov transformations [7] to relate the information going from one neuron to another one as itwas proposed in [8]. In this scenario, we interpret the loss of information as the loss of unitarity in thetransformation. In this way, we evaluate the conditions under which the loss of unitarity transformsour sigmoid response functions (a system able to learn) into a standard step function (system unableto learn), destroying then the capability of the network to learn. Interestingly, we understand that theamount of information stored is also connected with the learning capability of the system, just as ithappens in biological systems. The paper is organized as follows: In Sec. (2), we introduce the basicconcept of perceptrons, understanding that the activation function is a step function in such a case. InSec. (3), we explain the neural network arrangement using the sigmoid as an activation function. InSec. (4), we analyze the evolution of the information in the system, including loses, by storing theinformation locally inside a scalar Quantum field. We then analyze how much of this information islost, after crossing through the synapse toward another arbitrarily selected neuron. This is achievedby using the famous Bogoliubov transformations [7]. Finally, in Sec. (5), we conclude. 𝑥 𝑥 𝑥 output 𝑤 𝑤 𝑤 Figure 1: Perceptron neuronwith three input variables witha single output or . The in-puts are represented by x , x and x . Here ω i are the weightscorresponding to each input.Let’s start with the basic definition of neural networks, by con-sidering each neuron to be equivalent to a perceptron in the formoriginally proposed in [1]. A perceptron takes different inputs de-fined by a set of variables x , x , x ,... x n and it reproduces asingle binary output [4]. The neuron in such a case reproducessome specific output m j defined by either, m j = 0 or m j = 1 ( j = 1 , , , ... are the number of output neurons), depending onwhether or not the weighted sum of inputs is larger or not than somespecified threshold. The threshold can also appear in the form ofbias. Specifically, the condition imposed over the perceptrons is A = 0 , if (cid:88) j ω j x j + b ≤
0= 1 , if (cid:88) j ω j x j + b > . (1)Here b corresponds to the bias of the perceptron and it is equivalentto the threshold of the neuron. Figure (1) illustrates the standardbehavior of a perceptron neuron.We can perceive the perceptron as a neuron able to make a decision based on input weighted evidence.From the perspective of biological neurons, the weights w l defined in eq. (1) represent the importancewhich the synapse gives to the different input patterns. One example with two weights can be thedecision of wearing a certain color of clothes. We can assume that the output A = 0 means dressingin black or blue color and the output A = 1 means dressing any color different to black or blue. Wecan then take the first binary input as x = 1 if there is at least one protest in the streets, and x = 0 if there are no protests. Similarly, we might take the second input as x = 1 if the protesters dressblack and the policemen dress blue (or vice-versa); and x = 0 if the protesters and the policemendress any color different to black or blue. Based on these, our system has to decide if it is good todress black or blue during a protest season as the one lived in Hong Kong. Our system could giveweights to the pair of binary inputs, let’s take the same weights for both inputs as w = 2 and w = 2 .This means that both inputs are equally important for the system. If we take the bias to be equivalentto b = − , then in agreement with eq. (1), the system will decide to dress black or blue only if thereare no protests on the streets or if the protesters and police (in case of having protests on the streets)are not dressing the mentioned colors (black or blue). On the other hand, the decision of not dressingthe same colors is only taken if both situations appear, namely, there is a protest and the police andprotesters are dressing in the mentioned colors. In this simple situation, a system with an operationbased on perceptrons will be enough. More complex arrangements can be done if we create a morecomplex network like the one in Figure (2). To show the difficulties that the perceptron neurons havein learning appropriately some specific tasks, we study sigmoid neurons and focus on this importantaspect of artificial neural networks. 2 Sigmoid neurons
The sigmoid neurons are different from the perceptrons because the response function of the neuronsis not a step function, but rather a sigmoid function defined as σ ( z ) = 11 + e − z , (2)where z = (cid:80) j ω j x j + b , which is the same entrance used for the perceptrons in eq. (1). Note thatthe response function can take any real value between and which marks a significant differencefrom the perceptron case. This corresponds to a huge advantage for the process of learning whencompared with the perceptron. Note that there are two limits for the sigmoid function where thebehavior corresponds to perceptrons. This happens when z = ±∞ . The biggest advantage of thesigmoid function for learning is that small changes in the inputs correspond to small changes in theoutputs. This can be quantified mathematically as ∆ m ≈ (cid:88) j ∂m∂ω j ∆ ω j + ∂m∂b ∆ b. (3)This is only the ordinary chain rule in Calculus. The superiority in learning for neural networks usingas a response to the sigmoid function can be seen in the example of a simple network to classifyhandwritten digits using a three-layer neural network. The input layer of the network contains neuronsencoding the values of the input pixels and the output contains neurons, each one representing adifferent digit from − . If the first neuron is excited, then the system will identify , if the secondoutput neuron is the one excited, then the system will naturally identify and so on. Let’s assume thatthe system has a problem identifying the number , i.e. the output neuron, corresponding to the digit , does not get excited even if the input image is a . Then, the system would have to be adjusted bychanging gradually the weights and bias, such that the last neuron can shoot for the appropriate case.This is what we call the process of learning. The best way for doing this is to make small changesin the weights and on the bias until the system can identify the number . The small changes inweights and bias will be detected as small changes in the outputs in agreement with eq. (3). Thisgives then the advantage of testing simultaneously all the other numbers. Or equivalently, in this waywe avoid to damage any adjustment for the other output neurons corresponding to the identificationof the other numbers. This clever way of learning is impossible if we use perceptrons because forperceptrons we would have possible drastic changes in the outputs for a finite change in the inputs. Itis for this reason that the possible transitions from the sigmoid response to the perceptron response,as a consequence of the loss of information, deserves careful attention. Intuitively it is expected (although not obvious) that the loss of information in a system must berelated to the loss of the ability of the system for learning. Here we use the method proposed in [8]and extend it to real scenarios. First, we redefine the functions as Quantum scalar fields storing someamount of information. In this way, we define the Quantum field for the Heaviside function as < | ˆ φ ( z ) | > = f p ( z ) = z (cid:88) k = −∞ δ [ k ] = < | (cid:88) k (cid:16) p k ( z )ˆ b k + ¯ p k ( z )ˆ b + k (cid:17) | > . (4)Here δ [ k ] is the standard delta-Dirac function which characterizes the step function (perceptron)behavior and | > corresponds to the standard vacuum state. Here however, the vacuum | > isnot the vacuum for the modes p k ( z ) as we will explain in a moment. In eq. (4), the right-hand sidedefines the function as a field expanded in terms of positive and negative frequency modes p k ( z ) and ¯ p k ( z ) , with the corresponding annihilation and creation operators ˆ b k and ˆ b + k respectively. Figure(2) shows a point in the network where the information stored in the system obeys the distribution(4). Note that we are taking the step function behavior as the total output of the system. Locally, theoperators obey some local algebra [ˆ b k , ˆ b + k (cid:48) ] = δ k,k (cid:48) , inf ormation preserved, (cid:54) = δ k,k (cid:48) inf ormation lost. (5)3or the output neuronal response function, we define a vacuum state as ˆ b k | ¯0 > = 0 .In the same way, we can expand the sigmoid function by identifying it with the scalar field < | ˆ φ ( z ) | > = f σ ( z ) = 11 + e − z = < | (cid:88) k (cid:0) f k ( z )ˆ a k + ¯ f k ( z )ˆ a + k (cid:1) | > . (6) ( + ) ∑ ̂ ¯ ̂ + ( + ) ∑ ̂ ¯ ̂ + input information output informationInput layer ∈ Hidden layers ∈ Output layer ∈ loss Figure 2: Standard neural networks. The informa-tion flows from the input to the output. There mightbe lost of information during the transmissionthrough the synapses as the figure illustrates. Wetake the input information as it is stored in a Quan-tum field. The same occurs for the output informa-tion. The loss of information is reduced to zero if [ˆ b k , ˆ b + k (cid:48) ] = δ k,k (cid:48) = [ˆ a k , ˆ a + k (cid:48) ] . The information isloss for the output when [ˆ b k , ˆ b + k (cid:48) ] (cid:54) = [ˆ a k , ˆ a + k (cid:48) ] .In a standard form, we will take the algebra ofcreation and annihilation operators for this fieldas fixed and being equivalent to [ˆ a k , ˆ a + k (cid:48) ] = δ k,k (cid:48) . (7)In Figure (2), we can see that the sigmoid in-formation appears complete on the input neu-rons. This means that we are taking the Quan-tum field (6) as a field containing all the infor-mation. There are no loses at this point. Herewe also define another local vacuum condition ˆ a k | > = 0 . Note that this vacuum | > isdifferent to the vacuum | ¯0 > defined for the ˆ b k -operators. This means that in different points ofthe network we have a different amount of in-formation, and at each point, we define "empty"or "no-information" in a different way. We canthen connect the operators defined in eq. (4)with those defined in eq. (6) via Bogoliubovtransformations. Then we have the relations ˆ b k = (cid:88) k (cid:0) ¯ α kj ˆ a − ¯ β kj ˆ a + j (cid:1) . (8)For our purposes, this is the important relation because it is the one which establishes the connectionbetween the local vacuums defined at each point of the arrangement. Note that if β ij = 0 in theprevious equation, then the information is preserved and | > = | ¯0 > unambigüously. This is thecase because in these special circumstances, both operators ˆ a k and ˆ b k annihilate the same vacuum.Then in these special circumstances [ˆ a k , ˆ a + k (cid:48) ] = δ k,k (cid:48) = [ˆ b k , ˆ b + k (cid:48) ] . For this case the sigmoid responsefunction will never change into a step function. On the other hand, if ¯ β kj (cid:54) = 0 , then we have loss ofinformation and then [ˆ a k , ˆ a + k (cid:48) ] = δ k,k (cid:48) (cid:54) = [ˆ b k , ˆ b + k (cid:48) ] . If we want to search for the information loss, wecan expand the sigmoid function (6) in terms of the modes of the step function (4) as < | ˆ φ ( z ) | > = f σ ( z ) = 11 + e − z = < | (cid:88) k (cid:16) p k ( z )ˆ b k + ¯ p k ( z )ˆ b + k + q k ( z )ˆ c k + ¯ q k ( z )ˆ c + k (cid:17) | > . (9)the above result suggests that to reproduce the information of the sigmoid function using as a startingpoint the step response function, it is necessary to add some extra modes q k with their correspondingoperators ˆ c k to the step function. Then it is possible to conclude that the modes leaving the systemand representing the loss of information are < | ˆ φ ( z ) − ˆ φ ( z ) | > = f σ ( z ) − f p ( z ) = < | (cid:88) k (cid:0) q k ( z )ˆ c k + ¯ q k ( z )ˆ c + k (cid:1) | > (10) = 12 ( tanh ( z/
2) + 1) − πsinh ( z/ (cid:90) ∞−∞ dz ( cos ( kz ) − isin ( kz )) . This expansion represents the information escaping the system when it flows from the input towardoutputs. The conditions for getting a step function from an initial input sigmoid are then reduced tothe analysis of the matrix elements ¯ β kj in eq. (8). Note that the difference between the equations(9) and (10) is the presence of the the modes expanded by the operators ˆ b k , which are the modesperceived by the system after losing information. The ˆ b k -modes can be quantified in a similarway as Hawking quantified the modes escaping the event horizon of a Black-Hole [9]. Then ourwork, besides contributing to the understanding the loss of learning capability of a neural network4rrangement, it also provides the opportunity of testing some possible scenarios where we can findanalogue models where the Hawking radiation and similar effects could be tested. As a final remark,note that all the vacuum expectations values are evaluated with respect to a common vacuum | > .This corresponds to the vacuum for the modes f k ( z ) . Only if we use a common vacuum, we cancompare the amount of information stored at different locations of the network. This technique isoriginal and it is the main contribution of our paper. Here, we have presented a novel method for analyzing the loss of information in a neural network.The method also connects the loss of information with the loss of the system’s learning capability. Itconsists of promoting the response functions as Quantum fields expanded as a function of positiveand negative frequencies. If the information is preserved in the system, the Bogoliubov coefficient β kj would vanish and then the algebra of commutators would be also preserved. This in Quantummechanics is known as unitarity. On the other hand, a non-trivial value of β kj would imply thenon-conservation of information and then the possibility of losing the advantages of the sigmoidresponse function. Similar effect appears in the evaporation of a Black-Hole and it is known asHawking radiation [9]. This effect can be cumulative and the system effectively could behave as astandard perceptron even if the neurons were configured originally with a sigmoid response function.The take-home message in this work is that if we want to quantify the loss of information in asystem storing information in agreement with a bosonic field, this can be quantified by a non-unitaryBogoliubov transformation. The non-unitarity registers the amount of information that we cannever recover again and in addition affects our system in its learning process. Note that the loss ofinformation explored here is different to the information in the way analyzed in [10, 11]. In fact,the extraction of information based in the Bottleneck principle promoted in [10, 11] consists in theelimination of what we interpret as an initial noise. We could use similar techniques to the oneshowed in this paper for explaining such effect, but we will let this part for a future paper. Our paperinstead deals with the loss of useful information, namely, the information which disappears assumingthat we are able to eliminate the noise in advance. More details about this will be also given in amore formal paper. In subsequent works, we will also deliver more details of our analysis, includingpossible algorithms, which will take time to get them ready but are under process. The limitations inthis task are connected to the limitations in the understanding of the storage of information by usingBosonic systems. Bosonic systems are able to store more information than qubits but they can alsolose it easier. This means that our analysis of loss of information in this paper is extremely importantfor the development of more efficient systems for storing information. References [1] W. McCulloch and W. Pitts,
A Logical Calculus of Ideas Immanent in Nervous Activity , Bull. of Math.Biophys. (4): 115–133, (1943).[2] D. Hebb, The Organization of Behavior , New York: Wiley. ISBN 978-1-135-63190-1, (1949).[3] B. G. Farley and W.A. Clark,
Simulation of Self-Organizing Systems by Digital Computer , IRE Transactionson Information Theory. 4 (4): 76–84, (1954).[4] M. Nielsen,
Neural Networks and Deep Learning , online book available inhttp://neuralnetworksanddeeplearning.com/.[5] O. Goudet, B. Duval and J. K. Hao,
Gradient Descent based Weight Learning for Grouping Problems:Application on Graph Coloring and Equitable Graph Coloring , arXiv:1909.02261 [cs.LG]; N. J. A. Harvey,C. Liaw and S. Randhawa,
Simple and optimal high-probability bounds for strongly-convex stochasticgradient descent , arXiv:1909.00843 [cs.LG]; Y. Xie, X. Wu and R. Ward,
Linear Convergence of AdaptiveStochastic Gradient Descent , arXiv:1908.10525 [stat.ML].[6] M. E. Peskin and Dv V. Schroeder, an introduction to Quantum Field Theory , CRC press, Taylor and Francisgroup, 6000Broken Sound Parkway, NW Suite 300, Boca Raton Fl. 33487-2742, (2018).[7] J. G. Valatin,
Comments on the theory of superconductivity , Il Nuovo Cimento. 7 (6): 843–857, (1958); N. N.Bogoliubov,
On a new method in the theory of superconductivity , Il Nuovo Cimento. 7 (6): 794–805, (1958).[8] Ivan Arraut,
Black-hole evaporation from the perspective of neural networks , EPL 124 (2018) no.5, 50002.
9] S. W. Hawking,
Particle creation by black holes , Commun. Math. Phys. 43, 199–220 (1975).[10] N. Tishby and N. Zaslavsky,
Deep Learning and the Information Bottleneck Principle , arXiv:1503.02406[cs.LG].[11] R. Shwartz-Ziv and N. Tishby,
Opening the Black Box of Deep Neural Networks via Information ,arXiv:1703.00810 [cs.LG].,arXiv:1703.00810 [cs.LG].