Derivation of the Backpropagation Algorithm Based on Derivative Amplification Coefficients
DDerivation of the Backpropagation Algorithm Based onDerivative Amplification Coefficients
Yiping Cheng
School of Electronic and Information EngineeringBeijing Jiaotong University, Beijing 100044, China [email protected]
ABSTRACT.
The backpropagation algorithm for neural networks is widely felt hard to understand, despite theexistence of some well-written explanations and/or derivations. This paper provides a new derivation of this algorithmbased on the concept of derivative amplification coefficients. First proposed by this author for fully connected cascadenetworks, this concept is found to well carry over to conventional feedforward neural networks and it paves the way forthe use of mathematical induction in establishing a key result that enables backpropagation for derivative amplificationcoefficients. Then we establish the connection between derivative amplification coefficients and error coefficients(commonly referred to as errors in the literature), and show that the same backpropagation procedure can be used forerror coefficients. The entire derivation is thus rigorous, simple, and elegant.
Keywords: neural networks; backpropagation; derivative amplification coefficients; machine learning
The backpropagation algorithm is extremely efficient for computing the gradient of error functions in machine learningusing neural networks. Combined with the stochastic gradient descent method for optimization, it has long been thecornerstone for efficient training of neural networks. However, this algorithm is widely considered by practitioners inthe field to be complicated, as they often feel frustrated when they try to have a full understanding of this algorithm.Sure, to be fair, there are several well-written explanations and/or derivations of backpropagation, such as the ones in[1, 2], but there are still a large portion of readers who still think they are confused, which include me. To be frank, Iwas nearly totally lost when reading the derivation in [1], and I also felt frustrated at reading the “How backpropagationworks” chapter in [2] since it requires an advanced mathematical tool “Hadamard product”. I wish to have a simplederivation and I believe there must be one.I am not a pure freshman to neural networks. In fact, I have some research experience in an unconventional neuralnetwork architecture – fully connected cascade networks, and I have published a paper [3] in that area. In that paper Iproposed a backpropagation algorithm for the gradient of fully connected cascade networks. However, when I wrotethat paper I did not realize that the concepts proposed in it can be used to write a new derivation of the well-knownbackpropagation algorithm for conventional feedforward neural networks (multilayer perceptrons). It was only recentlythat I came to the idea that the concept of derivative amplification coefficients may carry over to multilayer perceptronsand a new derivation can possibly be written which may hopefully be more accessible than the existing derivations to awide audience among neural network practitioners. So this paper is the result of that effort of mine.The rest of this paper is structured as follows. Section 2 introduces neural networks as nonlinear functions, whichalso sets out the notation. Section 3 explains why we need to compute the partial derivatives. Section 4 defines derivativeamplification coefficients for multilayer perceptrons. Section 5 establishes key results that enable backpropagation forderivative amplification coefficients. Section 6 defines error coefficients and shows that the same backpropagation canbe used for error coefficients, which actually appear in the algorithm. We conclude this paper with Section 7.
In this paper neural networks are understood as conventional feedforward neural networks, i.e. multilayer perceptrons.There is nothing mysterious in a neural network. It is merely a nonlinear function composed of layers, as depicted below.1 a r X i v : . [ c s . L G ] F e b here is an input layer, a number of hidden layers, and an output layer in a neural network. Each hidden layerreceives input from its previous layer and sends its output to its next layer. The input layer has no previous layer and theoutput layer has no next layer. The input layer does not do any processing, thus it is not included when counting thenumber of layers of a neural network. Therefore when we speak of a “three-layer neural network”, we mean that thenumber of hidden layers in the neural network is two.Consider a neural network with L layers. We designate the input layer as the -th layer, the hidden layers as the -stto ( L − -th layers, and the output layer as the L -th layer. Let us denote the input, which is a vector, by x = ( x , · · · , x n ) , (1)and the output, which is also a vector, by y = ( y , · · · , y m ) . (2)We denote the output of the l -th layer, which is again a vector, by z l = ( z l, , · · · , z l,h l ) . (3)Then we have z = x , h = n, (4) z L = y , h L = m. (5)Each hidden layer and the output layer produces its output in two steps. The first step is weighted sum and thesecond step is activation. Let us denote the l -th layer weighted sum by u l = ( u l, , · · · , u l,h l ) . (6)Then for each l with ≤ l ≤ L , i with ≤ i ≤ h l , we have u l,i = w l,i, + h l − (cid:88) j =1 w l,i,j z l − ,j (7) z l,i = φ l ( u l,i ) (8)where φ l is the activation function for the l -th layer. All hidden layers typically use the same activation function, whichmust be nonlinear. The output layer will typically use a different activation function from the hidden layers and isdependent upon the purpose of the network.Eqs. (7,8) describe a processing unit which processes input from the previous layer to a component of the outputvector of this layer. It is called a node, or a neuron, and is represented in the above figure by a dot. Thus the l -th layerhas h l neurons. The power of neural networks lies in the fact that a neural network can approximate any smooth functionarbitrarily well, provided that it has at least one hidden layer and has a sufficient number of neurons.The weights w l,i,j are the real parameters of the neural network, and they, together with the integer parameters L and n, h , · · · , h L , uniquely determine the behavior of the network, i.e. the function that it represents. For notationalconvenience, let us now group the weights into a vector w = ( w , · · · , w L ) (9)2here for each l = 1 , . . . , L , w l = ( w l, , · · · , w l,h l ) (10)where for each i = 1 , . . . , h l , w l,i = ( w l,i, , w l,i, , · · · , w l,i,h l − ) . (11)Thus, w consists of L (cid:80) l =1 (1 + h l − ) h l weights. Since a neural network defines a multi-input multi-output function y = f ( x ; w ) ( x being the input vector and w beingthe vector of weights), it has a Jacobian at any point in the Euclidean space, with respect to both x and w . Computing ofthe partial derivatives that form the Jacobian with respect to w is important for all applications of neural networks, as wewill shortly show. Let us here take the regression application of neural networks as example.A regression problem is given a set of input-output data { ( x ( k ) , d ( k ) ) } Nk =1 , where N is the number of data points, x ( k ) ∈ R n and d ( k ) ∈ R m , to find a function f from a particular function space that best matches the data. Nowsuppose that we have chosen the function space to be neural networks with L layers and h , · · · , h L − neurons for thehidden layers respectively. Now it remains to find the weights. Reasonably, an optimal set of weights should minimizethe following error function: E ( w ) = 12 N (cid:88) k =1 || f ( x ( k ) ; w ) − d ( k ) || = 12 N (cid:88) k =1 m (cid:88) o =1 [ f o ( x ( k ) ; w ) − d ( k ) o ] . (12)Using such a criterion, the regression problem reduces to an optimization problem. There are two classes ofoptimization methods: gradient-based and non-gradient-based. For neural network regression problems, gradient-basedmethods are much faster than non-gradient-based methods. In fact, the method currently prevalent in neural networkoptimization is stochastic gradient descent , abbreviated as SGD. In SGD, an initial weight vector is randomly chosenand the algorithm continuously updates it incrementally in the direction of the steepest gradient descent based on thenewly arrived data ( x ( k ) , d ( k ) ) . The gradient of (12) is additive, i.e., the gradient of the error involving many datasamples is the sum of the gradients of the errors each involving a single data sample. Therefore in the sequel let usconsider only one data sample ( x ( k ) , d ( k ) ) . In addition, for sake of notational brevity, let us also drop the · ( k ) superscript.So let us assume that we have data ( x , d ) , the current weight vector is w , and the error function is E ( x ; w ) = 12 || f ( x ; w ) − d || = 12 m (cid:88) o =1 [ f o ( x ; w ) − d o ] . (13)Then the gradient of E with respect to w is composed of the following partial derivatives: ∂E∂w l,i,j ( x ; w ) = m (cid:88) o =1 [ f o ( x ; w ) − d o ] ∂f o ∂w l,i,j ( x ; w ) . (14)We then see that computing the partial derivatives of the error over the weights boils down to computing the partialderivatives of the network output functions over the weights. It is perceived by the present author that the existing derivations often confuse functions with their output variables. It isbelieved by the present author that the variable is not an official mathematical concept but the function is, and the chainrule should be applied to functions, not variables. It is the opinion of the present author that purported applications ofthe chain rule to variables might make the results problematic or unconvincing, and this is especially the case as neuralnetwork functions involve many levels of composition. Therefore, to be mathematically precise and rigorous, we chooseto adopt a pure function point of view. So in the following, we define the functions that map ( x ; w ) to u l,i and z l,i . It isdone in a recursive manner.For each i with ≤ i ≤ n , f [0] i ( x ; w ) def == x i . (15)3nd for each l with ≤ l ≤ L , i with ≤ i ≤ h l , g [ l ] i ( x ; w ) def == w l,i, + h l − (cid:88) j =1 w l,i,j f [ l − j ( x ; w ) , (16) f [ l ] i ( x ; w ) def == φ l ( g [ l ] i ( x ; w )) . (17)Since the L -th layer is the output layer, for all o with ≤ o ≤ m , f [ L ] o = f o . (18)The output variables of the above functions with ( x ; w ) as input are so frequently used in this paper that we decideto give them simplified notations. We denote u l,i def == g [ l ] i ( x ; w ) and u l = ( u l, , · · · , u l,h l ) , (19)and z l,i def == f [ l ] i ( x ; w ) and z l = ( z l, , · · · , z l,h l ) . (20)These two notation rules (19–20) are also consistent with our informal notation rules (3–8).Now we are in a position to define the derivative amplification coefficients. Like u l,i and z l,i , they are also outputs offunctions with ( x ; w ) as input, so they vary as ( x ; w ) varies. But we drop this argument ( x ; w ) for notational simplicity.For all ≤ l ≤ r ≤ L and ≤ i ≤ h l , ≤ t ≤ h r , the derivative amplification coefficient from node ( l, i ) to node ( r, t ) , denoted by α l,i → r,t , is defined recursively as follows: α l,i → l,t def == (cid:26) , if i = t, , otherwise; (21)and if l < r , α l,i → r,t def == h r − (cid:88) j =1 w r,t,j φ (cid:48) r − ( u r − ,j ) α l,i → r − ,j . (22)The role that derivative amplification coefficients play in the computation of partial derivatives is seen in thefollowing theorem. Theorem 1.
For ≤ l ≤ r ≤ L and ≤ i ≤ h l , ≤ j ≤ h l − , ≤ t ≤ h r , ∂g [ r ] t ∂w l,i,j ( x ; w ) = α l,i → r,t · ∂g [ l ] i ∂w l,i,j ( x ; w ) . (23) Proof.
The result is proved by mathematical induction on r . For the base case, i.e. r = l , the result obviously holds.Now as induction hypothesis suppose that the result holds when r = q ≥ l . Then for all ≤ t ≤ h q +1 , by (16,17), g [ q +1] t ( x ; w ) = w q +1 ,t, + h q (cid:88) s =1 w q +1 ,t,s φ q ( g [ q ] s ( x ; w ) (cid:124) (cid:123)(cid:122) (cid:125) u q,s ) . Note that in the above equation the index of summation is changed from j to s , in order to avoid symbol collisionwith the j in w l,i,j . Then ∂g [ q +1] t ∂w l,i,j ( x ; w ) = h q (cid:88) s =1 w q +1 ,t,s φ (cid:48) q ( u q,s ) ∂g [ q ] s ∂w l,i,j ( x ; w ) by chain rule = h q (cid:88) s =1 w q +1 ,t,s φ (cid:48) q ( u q,s ) α l,i → q,s · ∂g [ l ] i ∂w l,i,j ( x ; w ) by induction hypothesis = α l,i → q +1 ,t · ∂g [ l ] i ∂w l,i,j ( x ; w ) . by definition (22)So the result also holds for r = q + 1 , and the inductive step is complete.4ost readers will be surprised, or at least feel strange, why the derivative amplification coefficients are about g , not f ? The answer is that we can also define derivative amplification coefficients so that they are about f , but the finalalgorithm resulting from that choice will be less elegant and slightly less efficient, and will also be different from thecommonly known backpropagation algorithm. So we have made the better choice. Apparently, α l,i → L,o can be used for computing ∂g [ L ] o ∂w l,i,j ( x ; w ) and further ∂f o ∂w l,i,j ( x ; w ) . There are m L − (cid:80) l =1 h l suchderivative amplification coefficients. What is more, if we compute them naively based on the definition, then there is ahuge computational burden. Thus, it is extremely desirable to find any structural relationships between the coefficientsin the hope that the computational cost can be reduced. For inspiration let us look at a 2-3-2-1 network. For this networkwe have g [2]1 ( x ; w ) = w , , + (cid:88) j =1 w , ,j φ ( g [1] j ( x ; w ) (cid:124) (cid:123)(cid:122) (cid:125) u ,j ) ,g [2]2 ( x ; w ) = w , , + (cid:88) j =1 w , ,j φ ( g [1] j ( x ; w ) (cid:124) (cid:123)(cid:122) (cid:125) u ,j ) ,g [3]1 ( x ; w ) = w , , + (cid:88) j =1 w , ,j φ ( g [2] j ( x ; w ) (cid:124) (cid:123)(cid:122) (cid:125) u ,j ) . And α , → , = w , , φ (cid:48) ( u , ) , α , → , = w , , φ (cid:48) ( u , ) , α , → , = (cid:88) j =1 w , ,j φ (cid:48) ( u ,j ) α , → ,j = w , , φ (cid:48) ( u , ) w , , φ (cid:48) ( u , ) + w , , φ (cid:48) ( u , ) w , , φ (cid:48) ( u , ) , α , → , = w , , φ (cid:48) ( u , ) , α , → , = w , , φ (cid:48) ( u , ) . It is a simple matter to verify that α , → , = φ (cid:48) ( u , ) ( w , , α , → , + w , , α , → , ) . It turns out that this equation is not a mere coincidence, and in general we have the following theorem.
Theorem 2.
For ≤ l < r ≤ L and ≤ i ≤ h l , ≤ t ≤ h r , α l,i → r,t = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → r,t . (24) Proof.
The result is proved by mathematical induction on r .Base case, i.e. r = l + 1 : By definition (21,22), α l,i → r,t = α l,i → l +1 ,t = h l (cid:88) j =1 w l +1 ,t,j φ (cid:48) l ( u l,i ) α l,i → l,j = w l +1 ,t,i φ (cid:48) l ( u l,i ) ,φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → r,t = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → l +1 ,t = w l +1 ,t,i φ (cid:48) l ( u l,i ) . So the result holds for the base case. 5nductive step: Suppose that the result holds when r = q ≥ l + 1 . That is, for ≤ i ≤ h l , ≤ j ≤ h q , we have α l,i → q,j = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → q,j . (25)Then α l,i → q +1 ,t = h q (cid:88) j =1 w q +1 ,t,j φ (cid:48) q ( u q,j ) α l,i → q,j by definition (22) = h q (cid:88) j =1 w q +1 ,t,j φ (cid:48) q ( u q,j ) φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → q,j by induction hypothesis (25) = φ (cid:48) l ( u l,i ) h q (cid:88) j =1 h l +1 (cid:88) s =1 w q +1 ,t,j φ (cid:48) q ( u q,j ) w l +1 ,s,i α l +1 ,s → q,j = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 h q (cid:88) j =1 w q +1 ,t,j φ (cid:48) q ( u q,j ) w l +1 ,s,i α l +1 ,s → q,j = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i h q (cid:88) j =1 w q +1 ,t,j φ (cid:48) q ( u q,j ) α l +1 ,s → q,j = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → q +1 ,t . by definition (22)So the result also holds for r = q + 1 , and the inductive step is complete.For our purposes here only the derivative amplification coefficients to the final output layer are of interest. Thereforeonly a special case r = L of the above theorem is actually used: Corollary 3.
For ≤ l < L and ≤ i ≤ h l , ≤ o ≤ m , α l,i → L,o = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → L,o . (26)Based on Corollary 3, the derivative amplification coefficients to the nodes in the output layer can be computedincrementally layer by layer in the backward direction, with each layer doing much less work than based on the definitiondirectly. The computational cost can therefore be dramatically reduced. As said, to find all the partial derivatives ∂f o ∂w l,i,j ( x ; w ) for all l, i, j, o , a total of m L − (cid:80) l =1 h l derivative amplificationcoefficients need to be maintained and computed. Can we make this number smaller? If m > , then the answer is yes.The key idea is that actually we do not need to compute the partial derivatives of output functions over the weights, whatwe really need to compute is the partial derivatives of the error over the weights. So in this section we shall develop anerror-oriented modification of backpropagation. When m = 1 , this error-oriented modification makes no loss (albeit noimprovement either), therefore it is also used in that case for sake of style consistency.In virtually all applications of neural networks, the error (or loss, cost, penalty, etc.) E that we currently want tominimize, is a function of the network output function f ( x ; w ) . Therefore generally we have ∂E∂w l,i,j ( x ; w ) = m (cid:88) o =1 ε o ( x ; w ) · ∂f o ∂w l,i,j ( x ; w ) . (27)We have seen one particular example of (27), which is (14) in the regression setting. The ε o ( x ; w ) ’s can be computedbefore backpropagation begins. Now, as a result of (27,17), ∂E∂w l,i,j ( x ; w ) = m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( g [ L ] o ( x ; w ) (cid:124) (cid:123)(cid:122) (cid:125) u L,o ) · ∂g [ L ] o ∂w l,i,j ( x ; w ) . (28)6ow, let us give a definition of error coefficients for the nodes of the neural network.For all ≤ l ≤ L and ≤ i ≤ h l , the error coefficient of node ( l, i ) , denoted by δ l,i , is defined as δ l,i def == m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · α l,i → L,o . (29)Note also that δ l,i , like α l,i → r,t , should also be regarded as having an implicit argument ( x ; w ) . Theorem 4.
For ≤ l ≤ L , ≤ i ≤ h l , and ≤ j ≤ h l − , ∂E∂w l,i,j ( x ; w ) = (cid:26) δ l,i , if j = 0 ,δ l,i · z l − ,j , otherwise. (30) Proof.
It follows from Theorem 1 that for each o with ≤ o ≤ m , ∂g [ L ] o ∂w l,i,j ( x ; w ) = α l,i → L,o · ∂g [ l ] i ∂w l,i,j ( x ; w ) . (31)Thus ∂E∂w l,i,j ( x ; w ) = m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · ∂g [ L ] o ∂w l,i,j ( x ; w ) by (28) = m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · α l,i → L,o · ∂g [ l ] i ∂w l,i,j ( x ; w ) by (31) = δ l,i · ∂g [ l ] i ∂w l,i,j ( x ; w ) by definition (29) = (cid:26) δ l,i , if j = 0 ,δ l,i · z l − ,j , otherwise. by (16,20) Theorem 5.
For ≤ l < L and ≤ i ≤ h l , δ l,i = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i δ l +1 ,s . (32) Proof. δ l,i = m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · α l,i → L,o by defintion (29) = m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → L,o by Corollary 3 = φ (cid:48) l ( u l,i ) m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) h l +1 (cid:88) s =1 w l +1 ,s,i α l +1 ,s → L,o = φ (cid:48) l ( u l,i ) m (cid:88) o =1 h l +1 (cid:88) s =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · w l +1 ,s,i · α l +1 ,s → L,o = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · w l +1 ,s,i · α l +1 ,s → L,o = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i m (cid:88) o =1 ε o ( x ; w ) · φ (cid:48) L ( u L,o ) · α l +1 ,s → L,o = φ (cid:48) l ( u l,i ) h l +1 (cid:88) s =1 w l +1 ,s,i δ l +1 ,s . by defintion (29)7inally, to start the backpropagation we need to have the error coefficients of the output layer. By definition (29) anddefinition (21), we have for ≤ o ≤ m , δ L,o = ε o ( x ; w ) · φ (cid:48) L ( u L,o ) . (33)Based on all the foregoing results, we are now able to provide a full description of the backpropagation algorithm forthe regression application. Algorithm 1:
BP REG
Input :
L, n = h , h , · · · , h L − , m = h L , w ∈ R L (cid:80) l =1 (1+ h l − ) h l , x ∈ R n , d ∈ R m Output : ∂E∂w l,i,j ( x ; w ) for ≤ l ≤ L , ≤ i ≤ h l , ≤ j ≤ h l − , where E is defined by (13) Compute u l,i and z l,i for ≤ l ≤ L , ≤ i ≤ h l based on (15–20) for o = 1 , . . . , m do δ L,o = ( z L,o − d o ) φ (cid:48) L ( u L,o ) for l = L − , . . . , do for i = 1 , . . . , h l do δ l,i = φ (cid:48) l ( u l,i ) h l +1 (cid:80) s =1 w l +1 ,s,i δ l +1 ,s for l = 1 , . . . , L do for i = 1 , . . . , h l do ∂E∂w l,i, ( x ; w ) = δ l,i for j = 1 , . . . , h l − do ∂E∂w l,i,j ( x ; w ) = δ l,i · z l − ,j We have two remarks for the algorithm:• This algorithm description is for regression. For other applications of neural networks one can suitably modifyline 3. It should not be difficult if one understands the underlying mathematics of the algorithm.• Line 1 constitutes the forward propagation part. Lines 2–6 constitute the backpropagation part, which is forcomputing the L (cid:80) l =1 h l error coefficients. Lines 7-11 constitute the third finishing part, and we have a much largerfreedom in choosing its internal order. The order between the three major parts cannot be altered, except that thethird part can be merged into the backpropagation part. We have presented a new derivation of the classic backpropagation algorithm based on the concept of derivativeamplification coefficients first proposed in [3]. It is considered that the use of derivative amplification coefficients isessential as a theoretic tool, because without that tool we are unable to use mathematical induction, which is essential inestablishing Theorem 2, the center of our derivation. Our new derivation is rigorous, simple, and elegant, and we believewill be helpful for a large portion of practitioners in the neural networks community.
References [1] S. Haykin.
Neural Networks: A Comprehensive Foundation . Prentice Hall, 2nd edition, 1999.[2] M.A. Nielsen.
Neural Networks and Deep Learning . Determination Press, 2015.[3] Y. Cheng. Backpropagation for fully connected cascade networks.