[PDF] An Analytic Layer-wise Deep Learning Framework with Applications to Robotics

Abstract

Deep learning has achieved great success in many applications, but it has been less well analyzed from the theoretical perspective. To deploy deep learning algorithms in a predictable and stable manner is particularly important in robotics, as robots are active agents that need to interact safely with the physical world. This paper presents an analytic deep learning framework for fully connected neural networks, which can be applied for both regression problems and classification problems. Examples for regression and classification problems include online robot control and robot vision. We present two layer-wise learning algorithms such that the convergence of the learning systems can be analyzed. Firstly, an inverse layer-wise learning algorithm for multilayer networks with convergence analysis for each layer is presented to understand the problems of layer-wise deep learning. Secondly, a forward progressive learning algorithm where the deep networks are built progressively by using single hidden layer networks is developed to achieve better accuracy. It is shown that the progressive learning method can be used for fine-tuning of weights from convergence point of view. The effectiveness of the proposed framework is illustrated based on classical benchmark recognition tasks using the MNIST and CIFAR-10 datasets and the results show a good balance between performance and explainability. The proposed method is subsequently applied for online learning of robot kinematics and experimental results on kinematic control of UR5e robot with unknown model are presented.

Full PDF

AAn Analytic Layer-wise Deep Learning Framework withApplications to Robotics

Huu-Thiet Nguyen , Chien Chern Cheah , and Kar-Ann Toh School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue,Singapore 639798 School of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea 03722

Abstract

Deep learning has achieved great success in many applications, but it has been less wellanalyzed from the theoretical perspective. To deploy deep learning algorithms in a predictableand stable manner is particularly important in robotics, as robots are active agents that needto interact safely with the physical world. This paper presents an analytic deep learning frame-work for fully connected neural networks, which can be applied for both regression problems andclassiﬁcation problems. Examples for regression and classiﬁcation problems include online robotcontrol and robot vision. We present two layer-wise learning algorithms such that the conver-gence of the learning systems can be analyzed. Firstly, an inverse layer-wise learning algorithmfor multilayer networks with convergence analysis for each layer is presented to understand theproblems of layer-wise deep learning. Secondly, a forward progressive learning algorithm wherethe deep networks are built progressively by using single hidden layer networks is developed toachieve better accuracy. It is shown that the progressive learning method can be used for ﬁne-tuning of weights from convergence point of view. The eﬀectiveness of the proposed frameworkis illustrated based on classical benchmark recognition tasks using the MNIST and CIFAR-10datasets and the results show a good balance between performance and explainability. The pro-posed method is subsequently applied for online learning of robot kinematics and experimentalresults on kinematic control of UR5e robot with unknown model are presented.

Artiﬁcial neural networks (ANNs), or simply neural networks (NNs), have been widely deployedin data related problems, such as regression analysis, classiﬁcation, data processing, control androbotics, etc. In robotic applications, the use of ANNs can be traced back to late 1980s [1].The ANNs have been proved to be universal approximators [2, 3] where their great potential inidentiﬁcation and control of dynamic systems was discussed in [4]. Hence, they have been regardedas a potential approach to deal with nonlinearities, modeling uncertainties and disturbances inrobot control systems.For years, many results in feedback control of robots have been obtained by focusing on re-gression problems based on simple shallow networks . Most studies are based on one hidden layernetworks such as the single-hidden layer feedforward networks (SLFNs) and radial basis functionnetworks (RBFNs). There are two notable approaches in learning of these networks in robotics:1 a r X i v : . [ c s . R O ] F e b he ﬁrst approach is training only the output weights, which is still popular until recently [5–11];the second approach focuses on training both input and output weights of the network [12–15]. In the ﬁrst approach of training only the output weights, by using linear output activationfunctions, the algorithms for updating the last layer of weights of these networks resemble thoseadaptive control techniques where the output equation is linear with tunable parameters. Similar tothe adaptive control, the convergence and stability of these algorithms can be ensured by using theLyapunov method. Among those early studies in NN based control, Sanner and Slotine [5] analyzedthe approximation capability of Gaussian networks and employed them in control of systems withdynamic uncertainties. The RBFNs were employed in an indirect controller of a subsystem and inan adaptive NN model reference controller of another subsystem in underactuated wheeled invertedpendulums [6]. While robot control problems are usually formulated for trajectory tracking task, aregion adaptive NN controller with a uniﬁed objective bound was synthesized for robot control intask space [7]. In [8], He et al. proposed an adaptive NN control technique for robots with full-stateconstraints which guaranteed the asymptotic tracking. Global stability was ensured by using a NN-based controller for dual-arm robots with dynamic uncertainties [9]. Recently, the approach hasbeen adopted for indirect herding [10]. For robots with unknown Jacobian matrix, [11] proposedSLFN-based controllers that guaranteed the stability of the system.

In the second approach of training weights in both layers, besides the output weights, the inputweights of the network are also adjusted. In the 1990s, Chen and Khalil [12] provided convergenceanalysis of a learning algorithm that was based on the backpropagation (BP) and gradient descent(GD) in multilayer NN control of nonlinear systems. In 1996, Lewis et al. [13] proposed a learningalgorithm for updating 2 layers of weights (input weights and output weights) in a SLFN withLyapunov-based convergence analysis. The study paved the way for more research works in applyingSLFNs in dynamic systems like robotics. In [14], control of nonholonomic mobile robots was studiedand a NN controller was introduced to deal with disturbances and unmodeled dynamics. In [15],an adaptive NN-based controller was also proposed for manipulators with kinematic and dynamicuncertainties. However, all these works focused on theoretical analysis of shallow networks fordynamic systems (of the form ˙ z = f ( z )) and the output activation functions for the networks werealso assumed to be linear.Deep networks [16, 17] can achieve better performance as compared to the shallow counterpartswhen it comes to number of tunable parameters and overﬁtting problems [18, 19]. The deep NNshave been shown to be more powerful in function approximation and classiﬁcation than the singlehidden layer networks [20]. The current boom of machine learning (ML) applications in manyaspects of our life is greatly attributed to deep learning (DL) algorithms [21,22] in which backprop-agation (BP) [23] plays a major role. DL has replaced many conventional learning algorithms whichsaw disadvantages in processing raw data [24]. Many unprecedented successes in image recognitionhave been achieved by the convolutional neural networks (CNNs) [24, 25]. However, DL has notless analyzed from theoretical perspective and DL models remain diﬃcult to understand despitetremendous successes [26].Various attempts have been done to understand the properties of deep networks. Layer-wiselearning is one of the methods to dissect a large network into smaller pieces. One method of trainingnetwork layer-by-layer is using matrix pseudoinverse as in [27] and together with functional inverse,as developed in [28]. The method [28] does not require any computation of the gradient informationand can be applied to both regression and classiﬁcation problems. However, its performance stilllags behind the state-of-art gradient descent DL algorithms in many applications. Employing the2ayer-wise method in [28], an iterative learning algorithm for oﬄine regression problem of robotkinematics was developed [29] but it was again limited to shallow networks with one or two hiddenlayers. The algorithm [29] was built and analyzed in continuous time and hence could not begeneralized for classiﬁcation problems. In addition, the learning of the input layer was ignoredand the weights obtained were time-varying which thus required averaging. Another approach fortraining deep networks layer-wisely is greedy layer-wise learning which can be found in networkpre-training [30, 31] and forward thinking algorithm [32]. In this methodology [30–32], trainingof the multilayer NN is performed by adding the layers in a forward manner, starting from theﬁrst layer based on the training of shallow networks. Each hidden layer is added a time, and eachtraining step involves only a single hidden layer feedforward network. After training the SLFNin each layer, the input weights are kept while the output weights are discarded. A new hiddenlayer is then added on top of the hidden layer of the last SLFN to create a new SLFN. However,despite the good performance especially in classiﬁcation problems, there is no convergence analysison these algorithms [30–32], which is a common problem in the ML literature.Robots are active agents which interact physically with the real world, and applying DL toolsin robot control need careful consideration [33]. When employing a deep network in control ofrobotic systems, one should guarantee the stability, convergence and robustness because robotsneed to be operated in a safe and predictable manner. Most DL algorithms used in ML communitylack theoretical supports for convergence analysis. Therefore, in spite of many elegant results fromML research, very few can be directly used in the area of robot control for the reason of safety.Recently, there has been an increasing need of building interpretable and understandable deepneural networks [34]. As a result, the ﬁeld of explainable artiﬁcial intelligence (XAI) has begunto attract more attention from academics [35]. An XAI project launched by DARPA has aimedfor developing accountable AI with a compromise between performance and explainability [36].Therefore establishing a reliable theoretical framework for constructing and training deep networks,which ensures the convergence, could open up many XAI applications to robotics.This paper aims to develop a theoretical framework for multilayer NNs which can be eﬃcientlyapplied in operations of robotic systems. Our main focus is on the study of learning algorithmsand training methodologies for multilayer NNs to make them perform in a reliable and explainablemanner, which is desirable for control of active agents like robots. To achieve that, an inverselearning algorithm is ﬁrst formulated to understand the issues and diﬃculties of establishing analyticlayer-wise deep learning and based on this study, an analytic forward progressive algorithm is ﬁnallyproposed to overcome the problems. The main contributions of this paper are:i, The development of a theoretical framework to ensure the convergence of the layer-wise deeplearning algorithms. To the best of our knowledge, there is currently no theoretical result foranalysis of deep learning of fully connected neural networks to ensure convergence for safeoperation of robot systems.ii, The development of a forward progressive layer-wise learning algorithm for deep networksin which the general input-output function y = f ( z ) is also considered. Based on the con-vergence analysis, it is shown that the proposed algorithm can be used for ﬁne-tuning ofweights.iii, The development of a systematic learning or training methodology in which deep networkscan be built gradually for reliable operations in both online and oﬄine robotic applications.3he proposed framework is applied to two recognition tasks using the MNIST [37] and CIFAR-10 [38] databases and an online kinematic robot control task using the UR5e manipulator. Exper-imental results are presented to illustrate the performance of the proposed algorithm. It is shownthat forward progressive learning can achieve similar accuracy as compared to gradient descentmethod but the main advantage is that the convergence of proposed algorithm can be establishedin a systematic way. Consider a mapping between the input variable z ∈ R m and the output variable y ∈ R p y = f ( z ) (1)The function f : R m → R p is assumed to be unknown, but can be approximated by available inputand output (target) data which are referred to as training data. Our objective is to develop atheoretical framework to achieve an approximation (model) of the function f based on the trainingdata, so that this model can predict well on unknown new data.Based on the output variable y , the problem can be divided into two main types: • When y is a continuous variable, the problem is known as a regression problem. • When y is a categorical variable, the problem is known as a classiﬁcation problem.In the area of robotics, both types of problem can be found. For instance, when a robot needs toidentify (and label) the objects within its work space, a classiﬁcation or a recognition task shouldbe done; but how the robot makes movement by rotating its joints to reach the position of theobject would be a regression problem.In order to approximate f , a multilayer feedforward neural network (MLFN) is used. In thispaper, we present two techniques for training the MLFNs. The ﬁrst one is called inverse layer-wiselearning which is presented in section 3. The second one is called forward progressive learning which is presented in section 4.An illustration of an n -layer MLFN ( n − j th hidden layer (1 ≤ j ≤ n −

1) is denoted as h j ; the activationfunctions for the j th hidden layer are denoted as φ j : φ j = [ φ j, , φ j, , . . . , φ j,h j ] T ; the activationfunctions for output layer are denoted as φ n : φ n = σ = [ σ i, , σ i, , . . . , σ i,p ] T (and hence h n (cid:44) p );The n weight matrices are denoted (from input layer to output layer) as W , W , . . . , W n where W is the matrix of input weights and W n is the matrix of output weights. The output of theMLFN as shown in Fig. 1 is given as follows y N N = φ n ( W n φ n − ( W n − φ n − ( . . . W φ ( W φ ( W z )) . . . ))) (2)Since the weights are the tunable parameters, the output of the MLFN in (2) can be written as y N N = y N N ( W j | nj =1 , z ) (3)The denotation shows the dependence of the network output on the weights W j ( j = 1 , . . . , n ) andthe input variable z . The output after the ( j − th hidden layer (activation values after φ j − ) isgiven as y N N ( j − ( W l | j − l =1 , z ) = φ j − ( W j − φ j − ( . . . W φ ( W φ ( W z )) . . . )) (4)4 Input layer st hidden layer 𝑧 𝑧 𝑧 𝑚 𝜙 𝐖 𝝓 𝒛 𝝓 𝜙 𝜙 𝜙 𝜙 𝜙 nd hidden layer Output layer 𝜎 𝑦 𝜎 𝑦 𝜎 𝑝 𝑦 𝑝 𝐖 𝑛 𝒚 𝝓 𝑛−1 𝜙 𝑛−1,1 𝜙 𝑛−1,2 𝜙 𝑛−1,ℎ 𝑛−1 ሺ𝑛 − 1ሻ th hidden layer 𝐖 𝝓 𝑛 = 𝝈 Figure 1: Structure of an n -layer feedforward neural network with input z and target output y .According to the universal approximation theorem of neural networks [2, 3], the networks canapproximate any function at any accuracy with suﬃciently large number of hidden neurons. In this approach, the MLFN is trained layer-by-layer to ensure the convergence, which means onelayer of weights is learned at a time. In [28], by using functional inverse and matrix pseudoinverse,equation (2) has been treated as follows with y N N = yy = φ n ( W n φ n − ( W n − φ n − ( . . . W φ ( W φ ( W z )) . . . ))) (5) → W † n φ − n ( y ) = φ n − ( W n − φ n − ( . . . W φ ( W φ ( W z )) . . . )) (6)... → W † φ − ( · · · W † n − φ − n − ( W † n φ − n ( y )) · · · ) = φ ( W z ) (7)where W † n , W † n − , . . . , W † are the Moore-Penrose inverses (or pseudoinverses) of matrices W n , W n − , . . . , W respectively; φ − n , φ − n − , . . . , φ − are vectors of inverse functions of respective φ n , φ n − , . . . , φ .The inverse layer-wise learning is conducted through two stages in sequence: backward andforward. The learning process starts with the backward stage (subsection 3.1), where the MLFNis trained layer-wisely from the output layer to the input layer. After that, the forward stage takesplace (subsection 3.2), where the network is trained layer-wisely in the forward direction, from theinput layer to the output layer. Unlike in [28] where the author used the kernel and range space tosolve linear equations, we propose nonlinear update laws to ﬁnd the weight matrices incrementallyso that the convergence is ensured. 5 .1 Backward Stage of Inverse Layer-wise Learning In the ﬁrst stage of inverse layer-wise learning, the network is trained layer-by-layer from W n to W . That is, W n is trained ﬁrst. Then comes W n − , W n − , . . . . The backward stage ends with thelearning of the input weights W . Let us now look into the details of how the weights in each layerare trained, starting from W n . Prior to that, all the weight matrices W , W , . . . , W n − , W n areﬁrst randomly initialized. Let ¯ W (cid:63) , ¯ W (cid:63) , . . . , ¯ W (cid:63)n − , ¯ W (cid:63)n be the respective values of these matricesafter initialization. n During learning of W n (or the n th layer), the other weight matrices are frozen at their initializedvalues ¯ W (cid:63) , ¯ W (cid:63) , . . . , ¯ W (cid:63)n − . Using these values, we can compute the input of the n th layer, whichis also the output after the ( n − th hidden layer, by using (4)¯ φ (cid:63)n − = φ n − ( ¯ W (cid:63)n − φ n − ( . . . φ ( ¯ W (cid:63) φ ( ¯ W (cid:63) z )) . . . )) (8)Hence, the output of this n th layer (and also of the whole MLFN), denoted as y N N n , is given as y N N n ( W n , ¯ φ (cid:63)n − ) = φ n ( W n ¯ φ (cid:63)n − ) (9)This is actually equal to the right-hand side of (5) when setting W , . . . , W n − as ¯ W (cid:63) , . . . , ¯ W (cid:63)n − ,respectively. Hence, the target for learning of W n , denoted as y n , is the direct target y of thewhole MLFN as seen on the left-hand side of (5). That is y n = y (10)Given ¯ W (cid:63) , ¯ W (cid:63) , . . . , ¯ W (cid:63)n − , there exists a weight matrix W n such that the target provided in (10)can be approximated by the network whose output is given in (9). This is feasible if the number ofneurons h n − is suﬃciently large. We have y n = y N N n ( W n , ¯ φ (cid:63)n − ) = φ n ( W n ¯ φ (cid:63)n − ) (11)For learning of W n , an incremental learning update law, referred to as one-layer update , is developedto update the weights at that layer. In this algorithm, the weights in matrix W n are updatedincrementally without inverting the activation functions in φ n (and hence the update law is alsocalled nonlinear ). In each step of training, we use one example of the training data. At the k th step, ( z ( k ) , y ( k )) are used, and the target for training W n in (10) is given as y n ( k ) = y ( k ) (12)Equation (11) can be rewritten as y n ( k ) = y N N n ( W n , ¯ φ (cid:63)n − ( k )) = φ n ( W n ¯ φ (cid:63)n − ( k )) (13)where ¯ φ (cid:63)n − ( k ) = φ n − ( ¯ W (cid:63)n − φ n − ( . . . φ ( ¯ W (cid:63) φ ( ¯ W (cid:63) z ( k ))) . . . )) (14)6et ˆ W n ( k ) denote the estimated weight matrix at the k th step of training, the estimated outputˆ y n ( k ) at this k th step is constructed as the direct output of this n th layer when its weight matrixis set at ˆ W n ( k ): ˆ y n ( k ) = y N N n ( ˆ W n ( k ) , ¯ φ (cid:63)n − ( k )) = φ n ( ˆ W n ( k ) ¯ φ (cid:63)n − ( k )) (15)The output estimation error in learning of W n at the k th step is deﬁned as e n ( k ) = y n ( k ) − ˆ y n ( k ).Hence, from (13) and (15) we have e n ( k ) = y N N n ( W n , ¯ φ (cid:63)n − ( k )) − y N N n ( ˆ W n ( k ) , ¯ φ (cid:63)n − ( k )) (16)or e n ( k ) = φ n ( W n ¯ φ (cid:63)n − ( k )) − φ n ( ˆ W n ( k ) ¯ φ (cid:63)n − ( k )) (17)Let δ n ( k ) (cid:44) W n ¯ φ (cid:63)n − ( k ) − ˆ W n ( k ) ¯ φ (cid:63)n − ( k ) = ∆ W n ( k ) ¯ φ (cid:63)n − ( k ) (18)where ∆ W n ( k ) = W n − ˆ W n ( k ).Let us consider the relationship between e n ( k ) and δ n ( k ): If φ n is chosen as a vector of monoton-ically increasing activation functions whose derivatives are bounded by f φ n , we havei, The corresponding elements of two vectors e n ( k ) and δ n ( k ) have the same sign, i.e. e n,i ( k ) δ n,i ( k ) ≥ , ∀ i = 1 ..h n (19)ii, The absolute value of an element of e n ( k ) is less than or equal to f φ n times the correspondingelement of δ n ( k ) i.e. | e n,i ( k ) | ≤ f φ n | δ n,i ( k ) | , ∀ i = 1 ..h n (20)The incremental learning law (one-layer update) to update the estimated weights based on theoutput estimation error is proposed asˆ W n ( k + 1) = ˆ W n ( k ) + L n ( k ) e n ( k ) ¯ φ (cid:63)Tn − ( k ) (21)where L n ( k ) ∈ R h n × h n is a positive diagonal matrix; ¯ φ (cid:63)n − ( k ) is calculated using (14); e n ( k ) = y n ( k ) − ˆ y n ( k ) with y n ( k ) and ˆ y n ( k ) given in (12) and (15), respectively. Denoting w n,i as the i th column vector of matrix W n , ˆ w n,i ( k ) the i th column vector of ˆ W n ( k ) and ¯ φ (cid:63)n − ,i ( k ) the i th elementof vector ¯ φ (cid:63)n − ( k ), the update law (21) can be rewritten in the vector form asˆ w n,i ( k + 1) = ˆ w n,i ( k ) + ¯ φ (cid:63)n − ,i ( k ) L n ( k ) e n ( k ) (22)To show the convergence, we deﬁne an objective function as V ( k ) = h n − (cid:88) i =1 ∆ w Tn,i ( k )∆ w n,i ( k ) (23)7here ∆ w n,i ( k ) = w n,i − ˆ w n,i ( k ). The objective function at the ( k + 1) th step is V ( k + 1) = h n − (cid:88) i =1 ∆ w Tn,i ( k + 1)∆ w n,i ( k + 1)= h n − (cid:88) i =1 (∆ w n,i ( k ) − ¯ φ (cid:63)n − ,i ( k ) L n ( k ) e n ( k )) T (∆ w n,i ( k ) − ¯ φ (cid:63)n − ,i ( k ) L n ( k ) e n ( k )) (24)A change of the objective function value when the learning step goes from k th to ( k + 1) th ∆ V ( k ) = V ( k + 1) − V ( k )= h n − (cid:88) i =1 (cid:16) − ¯ φ (cid:63)n − ,i ( k )∆ w Tn,i ( k ) L n ( k ) e n ( k ) − ¯ φ (cid:63)n − ,i ( k ) e Tn ( k ) L Tn ( k )∆ w n,i ( k )+ ¯ φ (cid:63) n − ,i ( k ) e Tn ( k ) L Tn ( k ) L n ( k ) e n ( k ) (cid:17) (25)From (18), we have δ n ( k ) = ∆ W n ( k ) ¯ φ (cid:63)n − ( k ) = (cid:80) h n − i =1 ∆ w n,i ( k ) ¯ φ (cid:63)n − ,i ( k ), hence∆ V ( k ) = − δ Tn ( k ) L n ( k ) e n ( k ) − e Tn ( k ) L Tn ( k ) δ n ( k )+ h n − (cid:88) i =1 (cid:16) ¯ φ (cid:63) n − ,i ( k ) e Tn ( k ) L Tn ( k ) L n ( k ) e n ( k ) (cid:17) (26)From the properties stated in (19), (20), we have the following inequality since L n ( k ) is a positivediagonal matrix δ Tn ( k ) L n ( k ) e n ( k ) ≥ f φ n e Tn ( k ) L n ( k ) e n ( k ) (27)which ﬁnally gives∆ V ( k ) ≤ − e Tn ( k ) (cid:16) f φ n L n ( k ) − h n − (cid:88) i =1 ¯ φ (cid:63) n − ,i ( k ) L Tn ( k ) L n ( k ) (cid:17) e n ( k ) (28)When L n ( k ) is chosen such that2 f φ n L n ( k ) − h n − (cid:88) i =1 ¯ φ (cid:63) n − ,i ( k ) L Tn ( k ) L n ( k ) > V ( k ) is negative if e n ( k ) is non-zero. That means, the value of objective function keepsdecreasing V ( k + 1) < V ( k ). Moreover, since the function V ( k ) is non-negative, which means it isbounded from below, we have ∆ V ( k ) converges as k increases. Thus, from (28), e n ( k ) convergesas k increases. 8 .1.2 Learning of the hidden weights W j with n − ≥ j ≥ W n has been learned and its value in this backward stage has been obtained as ¯ W bn , thetarget for the ( n − th layer is calculated based on the left hand side of (6) as y n − = ¯ W b † n φ − n ( y n ).Generally, the target for learning of W j (or the j th layer), denoted as y j , can be achieved bycalculating backwardly from the target for the last layer y n = y , using the left-hand sides ofequations from (5) to (7). We have y j = ¯ W b † j +1 φ − j +1 ( y j +1 ) with n − ≥ j ≥ 𝐖ഥ 𝑛𝑏 𝐖ഥ 𝑛−1𝑏 𝝓 𝑛 𝝓 𝑛−1 𝝓 𝑛−2 𝝓 𝑗 𝐖෡ 𝑗 𝝓ഥ 𝑗−1 ∗ 𝐖ഥ 𝑗+1𝑏 𝐖ഥ 𝑗−1∗ Figure 2: Backward transmission of the target in inverse layer-wise learning, from the output layerto the inner layers.The input of the j th layer is computed similarly to (8) using (4)¯ φ (cid:63)j − (cid:44) φ j − ( ¯ W (cid:63)j − φ j − ( . . . φ ( ¯ W (cid:63) φ ( ¯ W (cid:63) z )) . . . )) (31)Hence, the output of this j th layer is given as y N N j ( W j , ¯ φ (cid:63)j − ) = φ j ( W j ¯ φ (cid:63)j − ) (32)There exists a weight matrix W j such that the target provided in (30) can be approximated bythe network whose output is given in (32) y j = y N N j ( W j , ¯ φ (cid:63)j − ) = φ j ( W j ¯ φ (cid:63)j − ) (33)Similar to learning of W n , the weights in matrix W j are updated incrementally without invertingthe activation functions φ j . The functions φ j are chosen to be monotonically increasing activationfunctions and their derivatives are bounded by f φ j . At the k th step of training, equation (30) canbe rewritten as y j ( k ) = ¯ W b † j +1 φ − j +1 ( y j +1 ( k )) (34)and equation (33) can be rewritten as y j ( k ) = y N N j ( W j , ¯ φ (cid:63)j − ( k )) = φ j ( W j ¯ φ (cid:63)j − ( k )) (35)where ¯ φ (cid:63)j − ( k ) = φ j − ( ¯ W (cid:63)j − φ j − ( . . . φ ( ¯ W (cid:63) φ ( ¯ W (cid:63) z ( k ))) . . . )) (36)9et ˆ W j ( k ) denote the estimated weight matrix at the k th step of training, the estimated outputˆ y j ( k ) at the k th step is constructed as the direct output of this j th layer when its weight matrix isset at ˆ W j ( k ): ˆ y j ( k ) = y N N j ( ˆ W j ( k ) , ¯ φ (cid:63)j − ( k )) = φ j ( ˆ W j ( k ) ¯ φ (cid:63)j − ( k )) (37)The output estimation error in learning of W j at the k th step is deﬁned as e j ( k ) = y j ( k ) − ˆ y j ( k ).Hence, e j ( k ) = y N N j ( W j , ¯ φ (cid:63)j − ( k )) − y N N j ( ˆ W j ( k ) , ¯ φ (cid:63)j − ( k )) (38)The incremental learning law to update the estimated weights based on the output estimationerror is proposed as ˆ W j ( k + 1) = ˆ W j ( k ) + L j ( k ) e j ( k ) ¯ φ (cid:63)Tj − ( k ) (39)where L j ( k ) ∈ R h j × h j is a positive diagonal matrix that satisﬁes the following condition2 f φ j L j ( k ) − h j − (cid:88) i =1 ¯ φ (cid:63) j − ,i ( k ) L Tj ( k ) L j ( k ) > φ (cid:63) j − ,i ( k ) being the i th element of vector ¯ φ (cid:63)j − ( k ); ¯ φ (cid:63)j − ( k ) is calculated using (36); e j ( k ) = y j ( k ) − ˆ y j ( k ) with y j ( k ) and ˆ y j ( k ) given in (34) and (37), respectively.Similarly to the case of W n , it can be shown that e j ( k ) converges as k increases. Learning of W is done in the same way as learning of W j above by setting j = 1. One point totake note is that the denotations of ¯ φ (cid:63)j − and h j − in above demonstration would become ¯ φ (cid:63) and h which do not exist. However, it is possible to consider that ¯ φ (cid:63) (cid:44) z ( ¯ φ (cid:63) ,i (cid:44) z i ) and h (cid:44) m ( m is the dimension of the input vector z ). After learning, we get ¯ W b . The backward stage stops. In the forward stage, the network is trained forwardly from the input layer to the output layer.The inverse layer-wise learning goes forwards from W until the output weights W n are retrained.In this stage, the learning can be conducted similarly for every layer from W to W n . Theonly diﬀerence is that in the backward stage, the input of the j th layer was deﬁned based on theinitialized values of ¯ W (cid:63) , ¯ W (cid:63) , . . . , ¯ W (cid:63)j − as in (31), but it is now deﬁned based on the values ofweights ¯ W , ¯ W , . . . , ¯ W j − that have been relearned before W j ¯ φ j − = φ j − ( ¯ W j − φ j − ( . . . φ ( ¯ W φ ( ¯ W z )) . . . )) with 2 ≤ j ≤ n (41)with noting that ¯ W = ¯ W b since there is no forward learning for W . This part summarizes the inverse layer-wise learning of MLFNs. In each layer, the weights areupdated incrementally by a one-layer update law. The details of the inverse layer-wise learningalgorithm are as follows 10i) Initialization: Randomly assign W , W , . . . , W n .(ii) Train W n using update law (21) → obtain ¯ W bn (iii) Backward looping: Learning of W j (from W n − to W )(a) Set j = n − W j using update law (39) → obtain ¯ W bj (c) Decrease j by 1 and go to step (b) if j ≥ W j again (from W to W n )(d) Set j = 2(e) Train W j using update law (39) with noting that ¯ φ j − in (41) is used instead of ¯ φ (cid:63)j − → obtain ¯ W j (f) Increase j by 1 and go to step (e) if j ≤ n In learning of each layer, the number of training steps can exceed the number of trainingexamples. This is because after all the examples in the training data have been used, the fulltraining set can be used again. Each time when a full set of training data is used is called a loop.In practice, the training of each layer can take many loops.

MNIST database was used for assessing the performance of the inverse layer-wise learning algorithmto understand the issues associated with it.

The network built has the following properties: • A 3-hidden layer network with structure 784-300-100-50-10 (300 units, 100 units, and 50units in hidden layers). The activation functions at hidden layers are modiﬁed softplus f ( x ) = log (0 . e x ) as suggested in [28], which has its inverse function as log ( e x − . f ( x ) = 1 / (1 + e − x ).In this network, there are 4 weight matrices to be learned: W , W , W and W . Before trainingall of the weights, W to W were randomly initialized. After that, learning process began withlearning of W → learning of W → learning of W → learning of W → relearning of W → relearning of W → relearning of W . In learning of each layer, the gain matrix L was not apredeﬁned matrix. Instead, it was calculated from the respective conditions in (29) and (40) toensure the convergence of the learning process. Since the matrix is diagonal and the inequalitiesare quite straightforward, the gain matrix L can be directly calculated. The number of loops fortraining of each layer was 20.The results in Table 1 show that although convergence can be achieved, the accuracy aftertraining by inverse layer-wise learning algorithm is not desirable as compared to the stochasticgradient descent (SGD) method ( 92.04% compared to 98.36% on the test set). Noting that for the11able 1: Training & testing accuracies (%) by inverse layer-wise learning vs. SGD Training Testing

Inverse layer-wise learning 92.37 92.04Stochastic gradient descent 99.80 98.36purpose of comparing with the best performance, the results for SGD method in this paper wereachieved by observing the accuracy directly on the test set while training and no validation set wasused.

The inverse layer-wise learning method is non-error-based, which means that the error at the outputlayer of the MLFN is not directly used to adjust all layers of weights (the output error is only useddirectly for training the last layer). Instead, the target is transmitted backwards to previous layersin a backward transmission process (please refer to (30) and Fig. 2). Though convergence can nowbe ensured in leaning, this backward transmission of target causes some possible problems thatlead to a trade oﬀ in accuracy as compared to backpropagation (like SGD) method. One of thereasons can stem from the fact that the modiﬁed softplus and sigmoid are only invertible withintheir ranges. This causes the distortions in the target values when transmitted backwardly. Evenwhen all the nonlinear functions are fully invertible and the targets for previous layer are rightvalues, training only a layer of weights may not help in ﬁtting the target for that layer. Thereforethe inverse layer-wise learning method can be used in cases where a trade oﬀ in performance isacceptable while ensuring the convergence of the learning systems. In the next section, we presenta learning method to achieve a good balance between accuracy and convergence.

In this section, we develop a forward progressive leaning (FPL) method based on the layer-wisemethodology referred in [30–32]. Unlike those works [30–32] where there is no convergence analysis,the convergence can be analyzed in our proposed FPL method. Our main aim here is to developan output-error-based layer-wise learning algorithm so as to overcome the drawback of the inverselayer-wise leaning method in section 3.Fig. 3 illustrates the processing details of the algorithms. The overall structure of the deepnetwork with n weight matrices to be learned is shown in Fig 3(a). The FPL starts with learningof the weights in the ﬁrst layer W based on an SLFN (or two-layer network) where the ﬁrsthidden layer is directly connected to the output, as shown in ﬁg 3(b). Two weight matrices, aninput weight matrix W and a pseudo output weight matrix W (cid:66) , are learned simultaneously by atwo-layer algorithm. After learning, the matrix ¯ W is kept to form new input for the next layerwhile the pseudo output weight matrix ¯ W (cid:66) is discarded. Fig 3(c) shows the learning of the secondlayer W based on a second SLFN with two weight matrices W and W (cid:66) . New input z of thisSLFN has been formed by passing z through ﬁxed ¯ W . Similarly, after training we keep ¯ W anddiscard ¯ W (cid:66) . Fig 3(d) shows the learning of the j th layer W j with input z j and target output y .12 𝐖 𝝓 𝒛 𝝓 𝐖 𝑛 𝝓 𝑛−1 𝐖 𝝈 𝒛 𝒛 𝑛−1 𝒚 𝐖 ⋯ 𝐖̂ 𝝓 𝒛 𝐖̂ 𝒚 𝐖̂ 𝒛 𝝓 𝐖̅ 𝐖̂ 𝝈 𝒚 𝝈 𝐖̅ 𝒛 𝒛 𝐖̂ 𝑗 𝒛 𝑗 𝐖̅ 𝝈 𝒚 ⋯ 𝐖̂ 𝑗▷ 𝝓 𝑗 𝒛 𝐖̅ 𝒛 𝒛 𝐖̂ 𝑛 𝝓 𝑛−1 𝐖̅ 𝝈 𝒚 ⋯ 𝐖̂ 𝑛−1 𝒛 𝐖̅ 𝑗−1 𝐖̅ 𝐖̅ a) b) c) d) e) Figure 3: Forward progressive learning (FPL) method, where an n -layer network is trained layer-wisely through learning of ( n −

1) two-layer networks. Each two-layer network is trained in 2 phases:pre-training (subsection 4.1) and ﬁne-tuning (subsection 4.2) so as to guarantee the convergence.The FPL continues until the learning of W n − takes place as shown in Fig 3(e), where the pseudooutput weights are no longer needed. Instead, the true output weights of the deep net W n is usedand trained together with W n − . After training this SLFN, the FPL ends.We now consider a general case when the j th hidden layer is added. The structure of the SLFNin this case is shown in Fig. 4. Since ¯ W , . . . , ¯ W j − have been trained, the input to this SLFNcan be calculated similarly to (31) z j (cid:44) ¯ φ j − = φ j − ( ¯ W j − φ j − ( . . . φ ( ¯ W φ ( ¯ W z )) . . . )) (42)The output of the network as shown in Fig. 4 can be expressed as follows: y N N ( W j , W (cid:66) j , z j ) = σ ( W (cid:66) j φ j ( W j z j )) (43)Each SLFN is trained in two phases. In the ﬁrst phase, called pre-training, the proposed one-layerupdate law in section 3 is adopted to pre-train the SLFN. In the second phase, called ﬁne-tuning,a two-layer update law is developed to ﬁne-tune the weights of the SLFN.13 input to the 𝑗 th layer 𝑗 th hidden layer output layer 𝜎 𝑦 𝜙 𝑗,1 𝜙 𝑗,2 𝜙 𝑗,ℎ 𝑗 𝜎 𝑦 𝜎 𝑝 𝑦 𝑝 𝐖 𝑗 𝐖 𝑗▷ 𝝓 𝑗 𝒚 𝝈 𝑧 𝑗,1 𝑧 𝑗,2 𝑧 𝑗,ℎ 𝑗−1 𝒛 𝑗 Figure 4: The single hidden layer feedforward network (or two-layer network) for new added j th hidden layer in forward progressive learning. The input z j of this network is calculated by forwardlypropagating z through previous layers which have been learned, while its target is the overall target y directly. The purpose of the pre-training phase is to achieve a suﬃciently small estimation error, beforefurther improvement is achieved by a ﬁne-tunning phase.The proposed one-layer update law for training the last layer of an MLFN is used to pre-trainthe SLFN. The steps of training are as follows: • Initialization: Randomly assign W j and W (cid:46)j to ¯ W (cid:63)j and ¯ W (cid:46)(cid:63)j respectively. • W (cid:46)j is trained using update law (21) to obtain ¯ W (cid:46)j .After W (cid:46)j has been pre-trained, the entire SLFN will be trained one more time in a ﬁne-tuningphase.Noting that the inverse layer-wise algorithm presented in the previous section can also be usedto achieved this aim. In this subsection, a two-layer update law is developed to update concurrently both the inputweights W j and the output weights W (cid:46)j of the SLFN. When the number of neurons in the hiddenlayer h j is suﬃciently large, there exist weight matrices W j and W (cid:46)j such that the target providedin (1) can be approximated by the network whose output is given in (43) y ( k ) = y N N ( W j , W (cid:66) j , z j ( k )) = σ (cid:0) W (cid:46)j φ j ( W j z j ( k )) (cid:1) (44)As W j and W (cid:46)j are unknown, they are updated incrementally by two learning laws. Let ˆ W j ( k )and ˆ W (cid:46)j ( k ) denote the estimated weight matrices W j and W (cid:46)j at the k th step of learning, the14stimated output vector ˆ y ( k ) at the k th step is constructed as the output of the SLFN when itsweights are set at ˆ W j ( k ) and ˆ W (cid:46)j ( k )ˆ y ( k ) = y N N ( ˆ W j ( k ) , ˆ W (cid:66) j ( k ) , z j ( k )) = σ (cid:0) ˆ W (cid:46)j ( k ) φ j ( ˆ W j ( k ) z j ( k )) (cid:1) (45)The output estimation error at the k th step is deﬁned as e ( k ) = y ( k ) − ˆ y ( k ). Hence, e ( k ) = y N N ( W j , W (cid:66) j , z j ( k )) − y N N ( ˆ W j ( k ) , ˆ W (cid:66) j ( k ) , z j ( k )) (46)= σ (cid:0) W (cid:46)j φ j ( W j z j ( k )) (cid:1) − σ (cid:0) ˆ W (cid:46)j ( k ) φ j ( ˆ W j ( k ) z j ( k )) (cid:1) (47)Let δ ( k ) (cid:44) W (cid:46)j φ j ( W j z j ( k )) − ˆ W (cid:46)j ( k ) φ j ( ˆ W j ( k ) z j ( k )) (48)which can be expressed as δ ( k ) = ˆ W (cid:46)j ( k )∆ φ j ( k ) + ∆ W (cid:46)j ( k ) ˆ φ j ( k ) + ∆ W (cid:46)j ( k )∆ φ j ( k ) (49)where ∆ φ j ( k ) (cid:44) φ j ( W j z j ( k )) − φ j ( ˆ W j ( k ) z j ( k )), ˆ φ j ( k ) (cid:44) φ j ( ˆ W j ( k ) z j ( k )) and ∆ W (cid:46)j ( k ) (cid:44) W (cid:46)j − ˆ W (cid:46)j ( k ).Consequently, the incremental learning laws (two-layer update) to update the estimated weightbased on the output estimation error e ( k ) are proposed as:ˆ W (cid:46)j ( k + 1) = ˆ W (cid:46)j ( k ) + α L ( k ) e ( k ) ˆ φ Tj ( k ) (50)ˆ W j ( k + 1) = ˆ W j ( k ) + α P ( k ) e ( k ) z Tj ( k ) (51)where α and α are positive scalars, L ( k ) ∈ R p × p is a positive diagonal matrix, P ( k ) ∈ R h j × p is amatrix depending on the learning step k .In (50), let w (cid:46)j,i denote the i th column vector of matrix W (cid:46)j , ˆ w (cid:46)j,i ( k ) the i th column vector of ˆ W (cid:46)j ( k )and ˆ φ j,i ( k ) the i th element of vector φ j ( k ). In (51), let w j,ı denote the ı th column vector of matrix W j , ˆ w j,ı ( k ) the ı th column vector of ˆ W j ( k ) and z j,ı ( k ) the ı th element of the vector z j ( k ). Theupdate laws (50) and (51) can be rewritten in vector form as:ˆ w (cid:46)j,i ( k + 1) = ˆ w (cid:46)j,i ( k ) + α ˆ φ j,i ( k ) L ( k ) e ( k ) (52)ˆ w j,ı ( k + 1) = ˆ w j,ı ( k ) + α z j,ı ( k ) P ( k ) e ( k ) (53)To show the convergence, we deﬁne an objective function given by V ( k ) = 1 α h j (cid:88) i =1 ∆ w (cid:46)Tj,i ( k )∆ w (cid:46)j,i ( k ) + 1 α h j − (cid:88) ı =1 ∆ w Tj,ı ( k )∆ w j,ı ( k ) (54)where ∆ w (cid:46)j,i ( k ) = w (cid:46)j,i − ˆ w (cid:46)j,i ( k ) and ∆ w j,ı ( k ) = w j,ı − ˆ w j,ı ( k ). From (52) and (53), the objectivefunction at the ( k + 1) th step can be written as V ( k + 1) = 1 α h j (cid:88) i =1 ∆ w (cid:46)Tj,i ( k + 1)∆ w (cid:46)j,i ( k + 1) + 1 α h j − (cid:88) ı =1 ∆ w Tj,ı ( k + 1)∆ w j,ı ( k + 1)15sing (52) and (53), a change of the objective function value when the training step goes from k th to ( k + 1) th can therefore be derived as∆ V ( k ) = V ( k + 1) − V ( k )= − ˆ φ Tj ( k )∆ W (cid:46)Tj ( k ) L ( k ) e ( k ) − z Tj ( k )∆ W Tj ( k ) P ( k ) e ( k ) − e T ( k ) L T ( k )∆ W (cid:46)j ( k ) ˆ φ j ( k ) − e T ( k ) P T ( k )∆ W j ( k ) z j ( k )+ e T ( k ) (cid:18) α h j (cid:88) i =1 ˆ φ j,i ( k ) L T ( k ) L ( k ) + α h j − (cid:88) ı =1 z j,ı ( k ) P T ( k ) P ( k ) (cid:19) e ( k ) (55)From (49) , we have∆ W (cid:46)j ( k ) ˆ φ j ( k ) = δ ( k ) − ˆ W (cid:46)j ( k )∆ φ j ( k ) − ∆ W (cid:46)j ( k )∆ φ j ( k ) (56)Next, substituting into (55) gives∆ V ( k ) = − δ T ( k ) L ( k ) e ( k ) − e T ( k ) L T ( k ) δ ( k )+ (cid:16) ∆ φ jT ( k ) ˆ W (cid:46)Tj ( k ) − z Tj ( k )∆ W Tj ( k ) P ( k ) L − ( k ) (cid:17) L ( k ) e ( k )+ e T ( k ) L T ( k ) (cid:16) ˆ W (cid:46)j ( k )∆ φ j ( k ) − L − T ( k ) P T ( k )∆ W j ( k ) z j ( k ) (cid:17) + e T ( k ) (cid:16) α h j (cid:88) i =1 ˆ φ j,i ( k ) L T ( k ) L ( k ) + α h j − (cid:88) ı =1 z j,ı ( k ) P T ( k ) P ( k ) (cid:17) e ( k )+ ∆ φ Tj ( k )∆ W (cid:46)Tj ( k ) L ( k ) e ( k ) + e T ( k ) L T ( k )∆ W (cid:46)j ( k )∆ φ j ( k ) (57)After the pre-training phase, the errors are suﬃciently small and hence the last two terms whichare of O are negligible as compared to the other terms which are of O . Also, let the matrix P ( k )be chosen so that ξ ( k ) (cid:44) ˆ W (cid:46)j ( k )∆ φ j ( k ) − L − T ( k ) P T ( k )∆ W j ( k ) z j ( k ) is zero or suﬃciently small,then the equation (57) becomes∆ V ( k ) = − δ T ( k ) L ( k ) e ( k ) − e T ( k ) L T ( k ) δ ( k )+ e T ( k ) (cid:16) α h j (cid:88) i =1 ˆ φ j,i ( k ) L T ( k ) L ( k ) + α h j − (cid:88) ı =1 z j,ı ( k ) P T ( k ) P ( k ) (cid:17) e ( k ) (58)Similarly to the one-layer update, if the activation functions in σ are monotonically increasing andtheir derivatives are bounded by f σ , comparing between e ( k ) in (47) and δ ( k ) in (48), we havei, The corresponding elements of e ( k ) and δ ( k ) have the same sign, i.e. e i ( k ) δ i ( k ) ≥ , ∀ i = 1 ..p (59)ii, The absolute values of the elements of e ( k ) are less than or equal to f σ times the correspondingelements of δ ( k ) i.e. | e i ( k ) | ≤ f σ | δ i ( k ) | , ∀ i = 1 ..p (60)16rom the properties stated in (59), (60), the following inequality can be assured∆ V ( k ) ≤ − f σ e T ( k ) L ( k ) e ( k )+ e T ( k ) (cid:16) α h j (cid:88) i =1 ˆ φ j,i ( k ) L T ( k ) L ( k ) + α h j − (cid:88) ı =1 z j,ı ( k ) P T ( k ) P ( k ) (cid:17) e ( k ) (61)When L ( k ) is chosen such that2 f σ L ( k ) − (cid:16) α h j (cid:88) i =1 ˆ φ j,i ( k ) L T ( k ) L ( k ) + α h j − (cid:88) ı =1 z j,ı ( k ) P T ( k ) P ( k ) (cid:17) > V ( k ) is negative if e ( k ) is non-zero. That means, the value of objective function keepsdecreasing V ( k + 1) < V ( k ). Moreover, since the function V ( k ) is non-negative, which means it isbounded from below, we have ∆ V ( k ) converges as k increases. Thus, from (61), e ( k ) converges as k increases.However, to achieve ξ ( k ) = ˆ W (cid:46)j ( k )∆ φ j ( k ) − L − T ( k ) P T ( k )∆ W j ( k ) z j ( k ) ≈ φ j . Let us now analyze the choice of P ( k ). Discussions:

Setting P ( k ) = Θ T ( k ) ˆ W (cid:46)Tj ( k ) L ( k ), where Θ ( k ) ∈ R h j × h j , we have ξ ( k ) = ˆ W (cid:46)j ( k ) (cid:0) ∆ φ j ( k ) − Θ ( k )∆ W j ( k ) z j ( k ) (cid:1) (64)To achieve ξ ( k ) ≈

0, the matrix Θ ( k ) should be chosen such that ∆ φ j ( k ) − Θ ( k )∆ W j ( k ) z j ( k ) ≈ φ j ( k ) = φ j ( W j z j ( k )) − φ j ( ˆ W j ( k ) z j ( k )) ≈ Θ ( k )∆ W j ( k ) z j ( k ) (65)Noting that the activation functions in the vector φ j act element-wisely, therefore it is clearer tolook at each element of the vector ∆ φ j ( k ). Its i th element is denoted as ∆ φ j,i ( k ) = φ j,i ( w j,ri z j ( k )) − φ j,i ( ˆ w j,ri ( k ) z j ( k )) where w j,ri , ˆ w j,ri ( k ) are the i th rows of matrices W j , ˆ W j ( k ), respectively. Nowlet us look at the choice of Θ ( k ) with diﬀerent activation functions φ j .It is interesting to note that if the activation functions in the vector φ j are chosen as the ReLUs,which are deﬁned as φ ( x ) = x if x > , and φ ( x ) = 0 if otherwise. Hence, we have ∆ φ j,i ( k ) = • ∆ w j,ri ( k ) z j ( k ) if w j,ri z j ( k ) ≥ w j,ri ( k ) z j ( k ) ≥ • w j,ri z j ( k ) ≤ w j,ri ( k ) z j ( k ) ≤ • w j,ri z j ( k ) if w j,ri z j ( k ) ≥ w j,ri ( k ) z j ( k ) ≤ • − ˆ w j,ri ( k ) z j ( k ) if w j,ri z j ( k ) ≤ w j,ri ( k ) z j ( k ) ≥ w j,ri z j ( k ) andˆ w j,ri ( k ) z j ( k ) are close to each other after the pre-training phase. Hence, in the last two cases abovewhere they are of opposite signs, they should be suﬃciently small. Therefore, it can be considered17hat for these last two cases ∆ φ j,i ( k ) ≈

0. Hence, the matrix Θ ( k ) in (65) can be chosen as adiagonal matrix diag { θ ( k ) , θ ( k ) , . . . , θ h j ( k ) } where the diagonal elements are θ i ( k ) = (cid:26) w j,ri ( k ) z j ( k ) ≥

00 otherwise (66)Looking back at (66), the value of θ i ( k ) is actually the derivative of the ReLU function θ i ( k ) = ˆ φ (cid:48) j,i ( k ) = dφ j,i ( x ( k )) dx ( k ) (cid:12)(cid:12)(cid:12) x ( k )= ˆ w j,ri ( k ) z j ( k ) (67)Generalizing to any diﬀerentiable function, we set Θ ( k ) = Φ (cid:48) j ( k ) with Φ (cid:48) j ( k ) = diag { θ ( k ), θ ( k ), . . . , θ h j ( k ) } where θ i ( k ) is deﬁned in (67). With this ﬁrst order approximation, we have φ j ( W j z j ( k )) − φ j ( ˆ W j ( k ) z j ( k )) ≈ Φ (cid:48) j ( k )∆ W j ( k ) z j ( k )Hence, equation (65) is satisﬁed. Now, replace the matrix P ( k ) into (51) and rewrite the full updatelaw in (50) and (51) as followsˆ W (cid:46)j ( k + 1) = ˆ W (cid:46)j ( k ) + α L ( k ) e ( k ) ˆ φ Tj ( k ) (68)ˆ W j ( k + 1) = ˆ W j ( k ) + α Φ (cid:48) Tj ( k ) ˆ W (cid:46)Tj ( k ) L ( k ) e ( k ) z Tj ( k ) (69)It can be seen that these update laws are similar to the ﬁrst order gradient descent when theactivation functions at the output layer are linear.The complete condition (62) can now be written as2 f σ L ( k ) − (cid:32) α h j (cid:88) i =1 ˆ φ j,i ( k ) L T ( k ) L ( k ) + α h j − (cid:88) ı =1 z j,ı ( k ) L T ( k ) ˆ W (cid:46)j ( k ) Φ (cid:48) j ( k ) Φ (cid:48) Tj ( k ) ˆ W (cid:46)Tj ( k ) L ( k ) (cid:33) > The overall algorithm for learning of each SLFN in the FPL is described as follows:(i) The ﬁrst phase: Pre-train the SLFN following the steps in subsection 4.1.(ii) The second phase: Fine-tuning of the SLFN:(a) For each data sample ( z j ( k ) , y ( k )) in the training set:(1) Calculate ˆ y ( k ) using (45).(2) Calculate e ( k ) = y ( k ) − ˆ y ( k ).(3) Update the weight matrices using update laws (68) and (69) where L ( k ) shouldsatisfy (70).(b) Move to the next training sample and repeat (a) (steps (1) to (3)) until all the samplesin the training set have been used.(c) When all the data samples in the training set have been used → ﬁnish 1 loop. Thetraining can take more than 1 loop if necessary.18 Online Kinematic Control of Robot Manipulators

In this section, we show how the results can be adapted for online learning of robot kinematicswithout any modeling.The rate of change of joint variables ˙ q is related to the rate of change of position and orientationof the end eﬀector in sensory space ˙ x as ˙ x = J ( q ) ˙ q (71)where J ( q ) is the overall Jacobian matrix from joint space to sensory task space. The relationshipin equation (71) can be approximated by a multilayer network whose output is given in equation(3) ˙ x = J ( q ) ˙ q = y N N ( W j | nj =1 , q , ˙ q ) (72)It can be seen that the learning algorithms in section 3 and section 4 can be directly applied foroﬄine learning by setting y = ˙ x and inputting to the network q , ˙ q . However, for online robotcontrol, a desired trajectory is speciﬁed and hence the learning algorithms need to be adapted foronline learning purpose.At the k th step of online learning, we have˙ x ( k ) = J ( q ( k )) ˙ q ( k ) = y N N ( W j | nj =1 , q ( k ) , ˙ q ( k )) (73)The estimated output of the network at the k th step of online learning is given as˙ˆ x ( k ) = ˆ J ( q ( k ) , ˆ W Σ ( k )) ˙ q ( k ) = y N N ( ˆ W j ( k ) | nj =1 , q ( k ) , ˙ q ( k )) (74)where ˆ W Σ ( k ) stands for all the estimated weight matrices ˆ W j ( k ) for j = 1 , , . . . , n .Let the reference joint velocity ˙ q based on the sensory task space feedback be proposed asfollows ˙ q ( k ) = ˆ J † ( q ( k ) , ˆ W Σ ( k ))( ˙ x d ( k ) − α ∆ x ( k )) (75)where ˆ J † ( q ( k ) , ˆ W Σ ( k )) is the pseudoinverse matrix of the estimated Jacobian ˆ J ( q ( k ) , ˆ W Σ ( k )); α is a positive scalar; ∆ x ( k ) = x ( k ) − x d ( k ); x d ( k ) and ˙ x d ( k ), respectively, are the desired positionand velocity of the end eﬀector in the sensory task space. Premultiplying (75) by ˆ J ( q ( k ) , ˆ W Σ ( k ))gives ˆ J ( q ( k ) , ˆ W Σ ( k )) ˙ q ( k ) = ˙ x d ( k ) − α ∆ x ( k ) (76)Subtracting (73) and (76) gives J ( q ( k )) ˙ q ( k ) − ˆ J ( q ( k ) , ˆ W Σ ( k )) ˙ q ( k )= ˙ x ( k ) − ˙ x d ( k ) + α ∆ x ( k ) = ∆ ˙ x ( k ) + α ∆ x ( k ) (77)Let ε ( k ) (cid:44) ∆ ˙ x ( k ) + α ∆ x ( k ) be the online feedback error in online learning, from (73) and (74) wehave ε ( k ) = y N N ( W j | nj =1 , q ( k ) , ˙ q ( k )) − y N N ( ˆ W j ( k ) | nj =1 , q ( k ) , ˙ q ( k )) (78)19ence, the online feedback error ε ( k ) here is equal to the output estimation error as seen in (46).By using ε ( k ) for the update laws in section 4 to train the network, the convergence of this errorcan be guaranteed. Remark:

To simplify the analysis and presentation so as to gain better understanding, this pa-per considers the deep networks with suﬃciently large number of neurons so that the reconstructionerror is negligible. With the advances in hardware and computational technology in recent years, alarge number of neurons and large number of layers can now be implemented by using computerswith GPUs. However, it is important to note that in the presence of any approximation errors, theconvergence to a bound can still be analyzed by extending the time domain results in [39] and thesize of the bound is dependent on the approximation errors.

In this section, we present three case studies to illustrate the performance of the proposed learningalgorithms. The ﬁrst two are classiﬁcation problems with the classical MNIST and CIFAR-10databases. The third one is a regression problem with an online tracking control task for a UR5emanipulator.

We considered the same network structure as in subsection 3.4, which was 784-300-100-50-10. Wecall it the full network. The activation functions used at hidden layers of this full network wereReLU f ( x ) = max(0 , x ) and at output layer were sigmoid f ( x ) = 1 / (1 + e − x ).With the structure of 784-300-100-50-10, the full network was trained through training threesmaller single layer feedforward networks (SLFNs) sequentially: 784-300-10 (net I), 300-100-10(net II), and 100-50-10 (net III). The activation functions used at the hidden layer of all three netswere the same as those at the hidden layers of the full network (ReLU), and the output activationfunctions of the three nets were also the same as those of the full network (sigmoid). Each SLFNwas trained in 2 phases: pre-training using one-layer update and ﬁne-tuning using two-layer update.In the pre-training phase of each net, the identity output was used, and in the ﬁne-tuning phase,the sigmoid output was used. After net I (784-300-10) had been fully trained, its input weights (thelayer 783-300) were kept and frozen, while its output weights (300-10) were discarded. This processcreates a modiﬁed input layer which has a dimension of 300. This new input was fed into net II(300-100-10) and the training process of net II began. Similarly, after net II had been trained,a modiﬁed input layer of dimension 100 was created. Training of net III (100-50-10) took placeafterwards.The choice of gain matrices in pre-training phase of the three nets could be done similarly tothe subsection 3.4, which is by calculating from the condition in (29). In ﬁne-tuning phase, the gainmatrix should satisfy the condition stated in (70). This condition seems to suggest that the trainingat the ending or ﬁne-tuning stage should be done with a small gain so as to ensure convergence. Inpractice, the gain was initially set to some value, and then adjusted automatically by monitoringthe condition in each update and reducing it if necessary.The proposed FPL algorithm was compared with the SGD method where the network wastrained for all layers together. We ﬁrst tested the convergence of the FPL and SGD algorithms onan SLFN whose structure was the same as net I above by using various diﬀerent gains (matrix L in20able 2: Convergence of FPL and SGD on net I with diﬀerent gains Gains FPL SGD

FPL SGD training testing training testingRunning5 times 99.94 98.59 99.81 98.4799.93 98.37 99.81 98.3099.91 98.43 99.78 98.3899.92 98.38 99.81 98.2699.89 98.42 99.78 98.41Mean values 99.92 98.44 99.80 98.36FPL and learning rate in SGD). Since at this stage FPL iterates for 1 training example at a time,the batch size in the case of SGD was also set as 1 for consistency. Table 2 shows a summary of theresults. Noting that for FPL, the gain matrix was for initial setting only since it can be adjustedautomatically by checking the condition (70) during the training process. It is seen from the tablethat FPL can guarantee the convergence for a wide range of learning gain, while divergence canoccur in SGD in cases where the gain is large. Another case study on online kinematic control ofrobot is presented in subsection 6.3.1 to illustrate the performance of FPL as compared to SGD indealing with new tracking tasks or new circumstances.After testing the convergence, we compared the training and testing accuracies of the twomethods. The learning rate (LR) for SGD was chosen small enough such that the loss functionconverged. The LR was initially set at 0.05, the number of epochs was 100, and the batch size was1. The LR was reduced by 2 after half of number of epochs and by 4 after 3 / L = diag(0 . , · · · , . FPL SGD training testing training testingRunning5 times 94.49 88.24 94.41 88.2094.14 88.22 94.85 88.1793.98 88.12 94.17 88.1494.17 88.19 94.62 88.1294.41 88.19 94.78 88.20Mean values 94.24 88.19 94.57 88.17

For CIFAR-10 dataset, we used a pre-trained convolutional neural network (ResNet-18) to get theoutput of the convolutional part. This is one of the common techniques in transfer learning wherethe convolutional layers are ﬁxed as feature extractor, and only the classiﬁer layers are trainedfor the speciﬁc task. The output or extracted features were then considered as the input of theclassiﬁer part which was a network with several layers. The output of the convolutional part hada dimension of 512. The structure of the fully connected network was as follows. • A 2-hidden layer network with structure 512-200-80-10 (200 units, 80 units in hidden layers).The activation functions at hidden layers are ReLU f ( x ) = max(0 , x ) and at output layerwere sigmoid f ( x ) = 1 / (1 + e − x ).With the structure of 512-200-80-10, the full network was trained through training two smallerSLFNs sequentially: 512-200-10 (net I), 200-80-10 (net II).The training and testing accuracies of the proposed method are compared with the SGD method.The SGD trains all layers of each network all together, with batch size set to be 1. The SGD methodtook 300 full epochs and LR = 0.01 initially. For the FPL algorithm, in the pre-training phase ofthe two nets, the number of loops for training of the last layer was 2. In the ﬁne-tuning phases,the initial setting of the gain matrix for net I was L = diag(0 . , · · · , . L = diag(0 . , · · · , . Although it is good to ensure convergence of the deep learning systems in classiﬁcations problems soas to establish a systematic method instead of trial and error method for selection of learning gains,it may be sometimes arguable that the convergence analysis is not very critical in classiﬁcation22roblems since there is no harm to redo the training if divergence occurs. However, for onlinetraining of robots, convergence is crucial in assuring a safe operation at all time. In this section,we ﬁrstly show the importance of convergence in the online learning in robot control by comparingthe performance of SGD and FPL on the simulator. After that, we show how a deep networkcan be built progressively by using FPL method such that the convergence of the online feedbackerror is guaranteed on the real robot. By repeating the operations, we shall show that the robotcan gradually learn to execute a task based on feedback errors of the end eﬀector without anyknowledge of the kinematic model.The tracking control task was drawing a circle in 3D space. The desired trajectory in sensoryspace is a circle (C1) speciﬁed as x = − . − .

06 cos ( ωt ) − .

18 sin ( ωt ) x = 0 . ωt ) − .

02 sin ( ωt ) x = 0 . .

02 cos ( ωt ) + 0 .

06 sin ( ωt ) (79)The units for the coordinates are meter (m). The circle (C1) has a center at [ − . , , .

5] and aradius of 0.2 m.

Because of safety reason, we only tested SGD on the simulator as there is no guarantee of conver-gence when using SGD in online learning. For the purpose of comparison, FPL was also tested onthe same simulator, before it was ﬁnally implemented on the actual robot (in subsection 6.3.2).We considered the major axis which includes the ﬁrst 3 joints q , q , q . A single hidden layernetwork was built to approximate the Jacobian matrix of the UR5e robot through the relationshipin (72). To do that, we ﬁrstly learned an SLFN with structure 3-12-3. That is, 3 input nodes (for q , q , q ), 12 nodes in the hidden layer and 3 output nodes (for ˙ x , ˙ x , ˙ x ). The activation functionsused for the hidden layers were modiﬁed softplus f ( x ) = log (0 . e x ) and for the output layer wereidentity f ( x ) = x .Since the kinematic model is unknown, we needed to ﬁrst manually move the robot around thedesired trajectory in order to collect data for oﬄine training of the network. The data of q , ˙ q and˙ x were collected during the manual movement. After getting the data, we trained the networkoﬄine by using SGD method. Diﬀerent from the convergence test in the classiﬁcation task, thelearning rate here was chosen such that the oﬄine learning converged. The obtained weights werethen adopted as a starting point for the learning of Jacobian matrix in the online training.We performed the online training using both SGD and FPL on the simulator of UR5e. Therobot was ﬁrst moved to an initial position so that the initial error was zero. The training was thenconducted by using the online ˙ q command as constructed in (75).˙ q ( k ) = ˆ J † ( q ( k ) , ˆ W Σ ( k ))( ˙ x d ( k ) − α ∆ x ( k ))The weights of the network were subsequently updated online using the SGD with estimation error˙ x ( k ) − ˙ˆ x ( k ) and the FPL with online feedback error ε ( k ) in (77). The matrix ˆ J ( q ( k ) , ˆ W Σ ( k ))can be calculated based on current joint variables q ( k ) and current weights ˆ W Σ ( k ). With bothmethods, the same gain (learning rate) as used in the oﬄine learning phase was used. To test theperformance of the robot system in tracking new task, the desired speed of the end eﬀector wasincreased (3-5 times) as compared to the original speed when moving the robot manually.23igure 5: Desired and actual trajectories of the robot end-eﬀector in the convergence test of onlinelearning using SGD and FPL methods: a) SGD - Divergence occurred even though the samelearning rate as in oﬄine learning phase was used; b) FPL - Convergence was guaranteed by usingthe proposed online adjustment of gain matrix.Fig. 5a shows the plot of the actual path of the robot end eﬀector by using SGD to approximatethe Jacobian matrix. It can be seen that the end eﬀector deviates signiﬁcantly from the desiredpath during the on-line training. The program was terminated after some time, as the learning ofthe network got unstable and caused the interruption in the calculation of ˆ J † ( q ( k ) , ˆ W Σ ( k )).The results in Fig. 5b shows that the convergence can be ensured by using FPL to achieve safeon-line training. With the real robot, the real values of joint variables q and joint velocities ˙ q were collected usingthe internal communication channel of the robot. The positions x of the end eﬀector in sensoryspace were recorded using a Kinect RGB-D camera. The sampling time was about 0.07 s. Thevelocities ˙ x were then calculated from the positions x .For the experiment on real robot, the desired trajectory was also the circle (C1) speciﬁed at (79).The angular frequency (or angular speed) ω in (79) was planned in 5 phases. In the ﬁrst 5 seconds,the desired ω is 0 rad/s, which aims to keep the robot at the initial position. The next 30 seconds(5 s - 35 s) is the acceleration period, when the desired angular frequency increases gradually fromrest at 0 rad/s to full speed at 2 π/

30 rad/s (2 rpm). After that, the robot end eﬀector would move3 rounds (revolutions) at full speed in 90 seconds (35 s - 125 s), before decelerating from full speedto 0 rad/s in the next 30 seconds (125 s - 155 s). Finally, the robot would be at rest for the last 5seconds.

Training of the ﬁrst hidden layer network

We aimed to build a two-hidden layer networkto approximate the Jacobian matrix of the UR5e robot through the relationship in (72). To dothat, we ﬁrstly learned an SLFN with structure 3-12-3 (net I) (similar to the network used in thesimulator above).Though the proposed online kinematic control algorithm in section 5 can be applied directly,the transient performance at the initial stage of learning may not be good since the controlleris completely model-free at the initial stage. To overcome this issue, we adopt a combination ofoﬄine and online trainings so that real-time feedback control using deep networks can be established24igure 6: Desired and actual trajectories of the robot end-eﬀector in sensory space for training cicle(C1). time (s) -6-4-20246 t r a ck i ng e rr o r ( m ) -3 e x e x e x time (s) t r a ck i ng e rr o r ( m ) e x e x e x time (s) e x e x e x a, Tracking errors (C1) c, Tracking errors in testing (C2) - With learningb, Tracking errors in testing (C2) - No learning Figure 7: Performance of the online kinematic control task on the real UR5e robot. The ﬁrst ﬁgureis for the training circle (C1): a, tracking error (with respect to time) of every coordinate. Thelast 2 ﬁgures are for the testing circle (C2): b, tracking errors when the weights of the network areﬁxed and c, tracking errors when the last layer of the network is updated.eventually.After getting the manual data, we ﬁrst trained the network oﬄine by one-layer pre-training (asin subsection 4.1) and two-layer ﬁne-tuning (as in subsection 4.2). The obtained weights were thenadopted as a starting point for the learning of Jacobian matrix in the online training.For the online training, the robot was ﬁrst moved to an initial position so that the initial errorwas zero. The training was then conducted by using the online ˙ q command as constructed in (75).The weights of the network were subsequently updated online using the two-layer update witherror ε ( k ). The matrix ˆ J ( q ( k ) , ˆ W Σ ( k )) can be calculated based on current joint variables q ( k ) andcurrent weights ˆ W Σ ( k ). Training of the second hidden layer network

After the ﬁrst net with one hidden layer hadbeen trained, we discarded its output weights and then added one new hidden layer of 12 neurons.The structure of the whole network became 3-12-12-3. Because the input weights of this wholenetwork were frozen, the training process was continued with a net of structure 12-12-3 (net II).The learning of this net was similar to that of net I. Again, oﬄine training was ﬁrst performed toavoid poor transient performance. The training data for oﬄine training of net II were a combinationof half of the amount of manual data and half of the amount of new data generated in the online25raining of net I. After oﬄine training, the online control task was done similarly to net I.The desired and actual trajectories for online learning are shown in Fig. 6, and the trackingerror is shown in Fig. 7a. It can be observed that the errors are very small and the actual trajectoryfollows the desired one very closely. This tells us that building network using FPL can guaranteethe convergence of the tracking errors in online learning control.

Testing of the trained network

To test the generalization property of the network, we usedthe Jacobian matrix obtained after online training of net II above for a tracking control task witha new trajectory which was also a circle (C2) with radius of 0.15 m. This circle was on a new planewhich is 0.1 m lower along the x -axis compared with (C1). The maximum speed of the movementwas at 2 π/

20 rad/s (or 3 rpm) and the direction of the movement was opposite.Fig. 7b and 7c show the tracking errors for the new trajectory (C2). The initial errors for x -axis are large (about 0.1 m) as the robot did not start on the new desired trajectory. The initialposition of the end eﬀector was set as the same as the old trajectory (C1) used for training. Fig.7b shows the tracking errors when the Jacobian is used directly without any update of the weights.It can be seen that the peaks in the full-speed period (35 s - 125 s) are similar to each other (atabout 0.01 m), which means that the errors are the same for each evolution of the movement ofthe end eﬀector. This is understandable as the weights are kept constant during the new trackingcontrol task. Fig. 7c shows the tracking errors when the Jacobian is updated during the onlinecontrol. Only the weights of the last layer were trained during online learning. It is observed thatthe peaks in the full-speed period (35 s - 125 s) now are much smaller than the previous case.Hence, from this case study, we can see that the FPL framework ensures the convergence of thetracking errors in the online learning. The framework also gives quite good generalization in thisexperiment and only the weights of the last layer are updated during online control task so as toachieve better tracking performance for the new trajectory. In this paper, we have presented a layer-wise deep learning framework in which a multilayer feed-forward network can be built and trained such that the convergence of the algorithm is ensured. Ithas been shown that a robot can learn to execute an online kinematic control task in a safe and pre-dictable manner without any modeling. The case studies of classiﬁcation tasks using MNIST andCIFAR-10 databases have shown that using the learning framework can result in similar accuraciesas compared to the gradient descent method while ensuring convergence. The proposed methodwould widen the potential applications of deep learning in the areas of robotics and control.

References [1] W. T. Miller, “Real-time application of neural networks for sensor-based control of robots withvision,”

IEEE Trans. Syst., Man, Cybern. , vol. 19, no. 4, pp. 825–831, 1989.[2] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universalapproximators,”

Neural Netw. , vol. 2, no. 5, pp. 359–366, 1989.[3] T. Poggio, A. Banburski, and Q. Liao, “Theoretical issues in deep networks,”

Proc. Natl. Acad.Sci. , 2020. 264] K. S. Narendra and K. Parthasarathy, “Identiﬁcation and control of dynamical systems usingneural networks,”

IEEE Trans. Neural Netw. , vol. 1, no. 1, pp. 4–27, 1990.[5] R. Sanner and J.-J. Slotine, “Gaussian networks for direct adaptive control,”

IEEE Trans.Neural Netw. , vol. 3, no. 6, pp. 837–863, 1992.[6] C. Yang, Z. Li, R. Cui, and B. Xu, “Neural network-based motion control of an underactuatedwheeled inverted pendulum model,”

IEEE Trans. Neural Netw. Learn. Syst. , vol. 25, no. 11,pp. 2004–2016, 2014.[7] X. Li and C. C. Cheah, “Adaptive neural network control of robot based on a uniﬁed objectivebound,”

IEEE Trans. Control Syst. Technol. , vol. 22, no. 3, pp. 1032–1043, 2014.[8] W. He, Y. Chen, and Z. Yin, “Adaptive neural network control of an uncertain robot withfull-state constraints,”

IEEE Trans. Cybern. , vol. 46, no. 3, pp. 620–629, 2016.[9] C. Yang, Y. Jiang, Z. Li, W. He, and C.-Y. Su, “Neural control of bimanual robots withguaranteed global stability and motion precision,”

IEEE Trans. Ind. Informat. , vol. 13, no. 3,pp. 1162–1171, 2016.[10] R. A. Licitra, Z. I. Bell, and W. E. Dixon, “Single-agent indirect herding of multiple targetswith uncertain dynamics,”

IEEE Trans. Robot. , vol. 35, no. 4, pp. 847–860, 2019.[11] S. Lyu and C. C. Cheah, “Data-driven learning for robot control with unknown jacobian,”

Automatica , vol. 120, p. 109120, 2020.[12] F.-C. Chen and H. Khalil, “Adaptive control of a class of nonlinear discrete-time systems usingneural networks,”

IEEE Trans. Automat. Contr. , vol. 40, no. 5, pp. 791–801, 1995.[13] F. L. Lewis, A. Yesildirek, and K. Liu, “Multilayer neural-net robot controller with guaranteedtracking performance,”

IEEE Trans. Neural Netw. , vol. 7, no. 2, pp. 388–399, 1996.[14] R. Fierro and F. L. Lewis, “Control of a nonholonomic mobile robot using neural networks,”

IEEE Trans. Neural Netw. , vol. 9, no. 4, pp. 589–600, 1998.[15] L. Cheng, Z.-G. Hou, and M. Tan, “Adaptive neural network tracking control for manipulatorswith uncertain kinematics, dynamics and actuator model,”

Automatica , vol. 45, no. 10, pp.2312–2318, 2009.[16] C. M. Bishop et al. , Neural networks for pattern recognition . Oxford university press, 1995.[17] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectiﬁer neural networks,” in

Proc. 14thInt. Conf. Artif. Intell. Statist. , 2011, pp. 315–323.[18] Y. Bengio et al. , “Learning deep architectures for AI,”

Found. Trends Mach. Learn. , vol. 2,no. 1, pp. 1–127, 2009.[19] H. Mhaskar, Q. Liao, and T. Poggio, “When and why are deep networks better than shallowones?” in

Thirty-First AAAI Conf. Artif. Intell. , 2017, p. 2343–2349.[20] H. N. Mhaskar and T. Poggio, “Deep vs. shallow networks: An approximation theory perspec-tive,”

Anal. Appl. , vol. 14, no. 06, pp. 829–848, 2016.2721] I. Goodfellow, Y. Bengio, and A. Courville,

Deep learning . MIT press, 2016.[22] J. Schmidhuber, “Deep learning in neural networks: An overview,”

Neural Netw. , vol. 61, pp.85–117, 2015.[23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,”

Nature , vol. 323, no. 6088, pp. 533–536, 1986.[24] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521, no. 7553, p. 436,2015.[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutionalneural networks,” in

Adv. Neural Inform. Process. Syst. , 2012, pp. 1097–1105.[26] T. J. Sejnowski, “The unreasonable eﬀectiveness of deep learning in artiﬁcial intelligence,”

Proc. Natl. Acad. Sci. , 2020.[27] P. Guo and M. R. Lyu, “A pseudoinverse learning algorithm for feedforward neural networkswith stacked generalization applications to software reliability growth data,”

Neurocomputing ,vol. 56, pp. 101–121, 2004.[28] K.-A. Toh, “Analytic network learning,” arXiv:1811.08227 , Nov, 2018.[29] H.-T. Nguyen, C. C. Cheah, and K.-A. Toh, “A data-driven iterative learning algorithm forrobot kinematics approximation,” in . IEEE, 2019, pp. 1031–1036.[30] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deepnetworks,” in

Adv. Neural Inf. Process. Syst. , 2007, pp. 153–160.[31] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why doesunsupervised pre-training help deep learning?”

J. Mach. Learn. Res. , vol. 11, no. Feb, pp.625–660, 2010.[32] C. Hettinger, T. Christensen, B. Ehlert, J. Humpherys, T. Jarvis, and S. Wade, “Forwardthinking: Building and training neural networks one layer at a time,” arXiv:1706.02480 , 2017.[33] N. S¨underhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, B. Upcroft, P. Abbeel,W. Burgard, M. Milford et al. , “The limits and potentials of deep learning for robotics,”

Int.J. Robot. Res. , vol. 37, no. 4-5, pp. 405–420, 2018.[34] G. Montavon, W. Samek, and K.-R. M¨uller, “Methods for interpreting and understandingdeep neural networks,”

Digit. Signal Process. , vol. 73, pp. 1–15, 2018.[35] D. Gunning, M. Steﬁk, J. Choi, T. Miller, S. Stumpf, and G.-Z. Yang, “XAI—explainableartiﬁcial intelligence,”

Science Robotics , vol. 4, no. 37, 2019.[36] D. Gunning, “Explainable artiﬁcial intelligence (XAI),”

Defense Advanced Research ProjectsAgency (DARPA), nd Web , vol. 2, 2017.[37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-based learning applied to documentrecognition,”

Proc. IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.2838] A. Krizhevsky et al. , “Learning multiple layers of features from tiny images,” 2009.[39] S. Kawamura, F. Miyazaki, and S. Arimoto, “Is a local linear pd feedback control law eﬀectivefor trajectory tracking of robot motion?” in