Learning Humanoid Robot Motions Through Deep Neural Networks
Luckeciano Carvalho Melo, Marcos Ricardo Omena Albuquerque Maximo, Adilson Marques da Cunha
LL EARNING H UMANOID R OBOT M OTIONS THROUGH D EEP N EURAL N ETWORKS
A P
REPRINT
Luckeciano C. Melo
Autonomous Computational Systems LabComputer Science DivisionAeronautics Institute of TechnologySão José dos Campos, Brazil [email protected]
Marcos R. O. A. Maximo
Autonomous Computational Systems LabComputer Science DivisionAeronautics Institute of TechnologySão José dos Campos, Brazil [email protected]
Adilson Marques da Cunha
Computer Science DivisionAeronautics Institute of TechnologySão José dos Campos, Brazil [email protected]
January 3, 2019 A BSTRACT
Controlling a high degrees of freedom humanoid robot is acknowledged as one of the hardest problemsin Robotics. Due to the lack of mathematical models, an approach frequently employed is to relyon human intuition to design keyframe movements by hand, usually aided by graphical tools. Inthis paper, we propose a learning framework based on neural networks in order to mimic humanoidrobot movements. The developed technique does not make any assumption about the underlyingimplementation of the movement, therefore both keyframe and model-based motions may be learned.The framework was applied in the RoboCup 3D Soccer Simulation domain and promising resultswere obtained using the same network architecture for several motions, even when copying motionsfrom another teams. K eywords Robotics, Machine Learning, Neural Networks.
RoboCup Soccer 3D Simulation League (Soccer 3D) is a particularly interesting challenge concerning humanoid robotsoccer. It consists of a simulation environment of a soccer match with two teams, each one composed by up to 11simulated NAO robots [1], the official robot used for RoboCup Standard Platform League since 2008. Soccer 3D isinteresting for robotics research since it involves high level multi-agent cooperative decision making while providing aphysically realistic environment which requires control and signal processing techniques for robust low level skills.In the current level of evolution of Soccer 3D, motion control is a key factor in team’s performance. Indeed, controllinga high degrees of freedom humanoid robot is acknowledged as one of the hardest problems in Robotics. Much effort hasbeen devised to humanoid robot walking, where researchers have been very successful in designing control algorithmswhich reason about reduced order mathematical models based on the Zero Moment Point (ZMP) concept, such as thelinear inverted pendulum model [2]. Nevertheless, these techniques restrict the robot to operate under a small region ofits dynamics, where the assumptions of the simplified models are still valid [3, 4].Therefore, model-based techniques are hard to use for designing highly dynamic movements, such as a long distancekick and a goalkeeper’s dive to defend the goal from a fast moving ball. In the robot soccer domain, a common approach a r X i v : . [ c s . A I] J a n PREPRINT - J
ANUARY
3, 2019for these movements is to employ keyframe movements, where the motion is composed by a sequence of robot postures.In this case, the movement is designed off-line and executed in a open-loop fashion in execution time.Due to the lack of mathematical models, an approach frequently employed is to rely on human intuition to designkeyframe movements by hand, usually aided by graphical tools. However, this process is difficulty, time consuming, andis often unable to obtain high performance motions given the high dimensionality of the search space. Other possiblesolution is to use motion capture data from humans [5], which has its own challenges due to the fact that the kinematicand dynamic properties of a humanoid robot differs greatly from those of a human.Therefore, many works have experimented on using machine learning and optimization algorithms to develop highperformance keyframe movements, showing promising results. Rei et al. describes an algorithm which is able to mimicmovements observed from other agents and improve these learnt motions through an evolutionary strategy [6]. Otherworks have also used optimization to improve the performance of existing keyframe movements [4, 7]. Moreover, Penget al. have developed control policies through deep reinforcement learning that mimic reference motions [8].A common fact regarding the aforementioned machine learning and optimization approaches is that they rely onreference motions for learning. Due to the high dimensionality of the search space, direct optimization from acompletely random point most likely fails to find an useful movement. In this work, we contribute by showing thata neural network may be taught through supervised learning to mimic an existing movement. From an engineeringstandpoint, having a keyframe movement represented as a neural network does not provide advantages by itself.However, our intention in a future work is to use this neural network as a seed for reinforcement learning methods asthe one shown in [8].The remaining of this work is organized as follows. Section 2 provides theoretical background. In Section 3, themethodology and tools used in this work are explained. Furthermore, Section 4 presents simulation results to validateour approach. Finally, Section 5 concludes and shares our ideas for future work. A keyframe k = [ j , j , . . . , j n ] T ∈ K ⊆ R n is an ordered set of joint angular positions, where K and n are the joint space and the number of degrees of freedom of the robot, respectively. Definition 2 A keyframe step is an ordered pair s = ( k , t ) ∈ S = K × R , where k is a keyframe and t is the timewhen the keyframe must be achieved with respect to the beginning of the movement, respectively. Definition 3 A keyframe movement , or simply a movement, is defined as m = ( s , s , . . . , s γ , r ) ∈ M = S γ × R ,where γ and r are the number of keyframe steps and the speed rate of the movement, respectively. In this representation,we assume the movement starts at time 0 and the first keyframe step represents the robot posture at the beginning of themovement. Therefore, t = 0 and each time t i , ∀ i ≥ is a time since the beginning of the movement. Keyframe movements are executed in an open-loop fashion, where joint positions are computed through interpolation ofkeyframe steps based on the current time. If the interface to the robot joints is not position-based, local controllers maybe used to track the position references issued by the keyframe. For example, in the Simspark simulator, the simulatedNAO has speed-controlled joints, therefore we use simple proportional controllers for each joint to track the desiredjoint positions. To obtain smooth joint trajectories, we interpolate keyframe steps using cubic splines [9], which arefunctions of class C . Neural Networks are a learning representation whose goal is to approximate some function f ∗ . The data collectedfrom an environment encodes an underlying function y = f ∗ ( x ) that maps an input x to an output y , which may bea category from a classifier or a continue value in regression problems. The neural network defines an approximatemapping y = f ( x ; θ ) by learning the values of the parameters θ which result in the best function approximation. TheFigure 1 shows a neural network and an artificial neuron in detail.These networks are typically represented by composing together many different functions, which are associated witha directed acyclic graph describing a computational model. For example, we might have three layers (each of themrepresenting a function f (1) , f (2) , and f (3) respectively), connecting in a chain, resulting in a final representation f ( x ) = f (3) ( f (2) ( f (1) ( x ))) . 2 PREPRINT - J
ANUARY
3, 2019Figure 1: Artificial neuron and feed forward artificial neural network [10].During neural network training, the objective is to adjust f ( x ) to match f ∗ ( x ) using the training dataset, which providesnoisy examples of f ∗ ( x ) evaluated in different points. The training examples specify directly what the output layermust do at each point x , but the learning algorithm must decide how to use all layers to produce this desired output [11].Additionally, we must also choose a learning algorithm to tune this function approximation. In the context of neuralnetworks, gradient-based algorithms are broadly used, especially those based on the backpropagation idea [12]. Thepurpose of these algorithms are to propagate the gradient of a cost function through the whole network, in order tominimize the cost function. Most modern neural networks perform this optimization strategy using maximum likelihood,i.e. the cross-entropy between the training data and the model distribution: J ( θ ) = − E x , y ∼ ˆ p data log p model ( y | x ) (1)In this work, we used the mean squared error loss function in order to fit the dataset. Indeed, we may show that bothcost functions are closely related. Let us consider normally distributed errors: p model ( y | x ) = N ( y ; f ( x ; θ ) , σ I ) (2)where f ( x ; θ ) and σ I are the mean and covariance of this distribution, respectively. Substituting Eq. (2) in Eq. (1): J ( θ ) = 12 E x , y ∼ ˆ p data (cid:107) y − f ( x ; θ ) (cid:107) + const (3)The constant term does not depend on θ and may be dropped. By explicitly evaluating the expectation in Eq. (3), wearrive at the mean squared error cost function: J ( θ ) = 12 m m (cid:88) i (cid:107) y i − f ( x ; θ ) (cid:107) (4)Lastly, the gradient of the loss function is taken and propagated through the hidden layers by the chain rule. For example,given Y = g ( X ) and z = f ( Y ) , then the chain rule states: ∇ X z = (cid:88) j ( ∇ X Y j ) ∂z∂Y j (5)This equation is taken recursively until the gradient is propagated to all layers of the neural network. In order to use supervised learning for learning keyframe motions using neural networks, we first need to construct adataset. A dataset consists of samples of keyframe steps. The samples were collected within Soccer 3D environmentwith a frequency of 50 Hz. We acquired these samples in two different ways.3
PREPRINT - J
ANUARY
3, 2019In the first one, we commanded an agent of our team to execute specific motions and sampled the reference jointpositions computed by our code. In this case, we sampled the kick and get up keyframe motions [4]. Notice that for thisapproach to be successful, one needs access to the source code.The second approach involved changing the Soccer 3D server source code to provide current joint positions of a givenrobot, in a similar way as described in [13]. This allowed us to acquire motion datasets from other teams, without anyknowledge of how these movements are implemented. In this case, we collected two types of kicks based on keyframesand sampled joint values of the walking engine [14].
The neural network has to be able to learn how to interpolate between samples, which actually happens. The architecturethat performed best – in terms of mean absolute error minimization and simplicity – is shown in Figure 2. A deepneural network with 2 hidden, fully connected layers of 75 and 50 neurons was used. The output layer has 23 regressionneurons, which represent the 22 joint angles and a neuron whose output indicates if the motion has ended or not. Theneurons in each hidden layer use the LeakyReLU activation function [15]: f ( x ) = (cid:26) αx, x < x, x ≥ where α is a small constant. This activation function was used to improve the representation capacity of the neuralnetwork, adding support for non-linear functions. Time Instant 75 Neurons 50 Neurons 22 JointsHas Ended?
Figure 2: Architecture of the neural network designed to learn motions.This architecture resulted in thousand of parameters to optimize, as exposed in Table 1. A very high number whencompared to more traditional optimization approaches [14]. Notice that increasing the number of parameters usuallyallows representing better movements. Table 1: Network Summary
Layer Neurons Activation Parameters
Dense 75 LeakyReLU 150Dense 50 LeakyReLU 3800Dense 23 Linear 1173
Total Parameters
Since keyframe motions are executed in an open-loop fashion, the sequence of joint positions are always the samefor different repetitions, independently of the robot’s state. Therefore, adding samples of multiple executions of thesame motion would not make our dataset richer, so we decided to use only one repetition for each movement for fastertraining. In the case of the walking motion, we collected samples within one walking period.During training, we used 50 thousands epochs divided in 5 training phases, where the learning rate was decreasedbetween phases in order to achieve better performance. First, we executed 30000 epochs using learning rate of 0.001.The other phases had 5000 epochs each, and we decreased the learning rate by 0.0002 in each phase.4
PREPRINT - J
ANUARY
3, 2019Furthermore, we used Adam optimization [16] during the whole training. The loss function used was the mean squarederror, as explained in Subsec. 2.2. We decided this loss function is adequate for this problem because it stronglypenalizes large errors, which can collapse the whole motion.
In order to perform network design and the training procedure, we used the Keras [17] framework coupled withTensorflow [18] as backend. After training, the weights were freezed and converted to a specific format which isreadable using the Tensorflow C++ API integrated within the agent’s code. Hence, the training is performed outside theenvironment, but the agent actually computes network inferences during simulation execution.
All results and logs obtained, as well as the code used for neural network training, are available in the project repository for reproducibility. Joints Error
Mean Absolute ErrorMean Squared Error
Figure 3: The plots of mean squared error and mean absolute error during training.The initial results come from the training procedure, outside the simulation environment. Figure 3 presents trainingcurves for the kick keyframe dataset. In this case, the plots show mean squared error and mean absolute error metrics,respectively. In both metrics, the value decreases drastically in the first epochs. This same behavior was present in othertraining procedures as well. However, only after thousands of epochs the network achieved a low error that reproducedthe motion successfully, which shows how sensible to small joint errors keyframes are, given that they are open-loopmotions. The peaks during the training happen at the learning rate transition instants, but they do not hurt the trainingprocedure itself. This is due to the fact that we re-compile the model at each training phase, which resets the optimizerstate. This means that the training will suffer a little at the beginning until you adjust the learning rate, the moments, etc.However, there is absolutely no damage to the weights.
The final mean absolute error is radians and the motion is visually indistinguishable from the original one, as canbe seen in Figure 4. In this figure, snapshots from both motions were taken. The Figure 5 shows several plots of jointangles comparing the original and learned kick motions. As we may see, the learned motion has fitted the movementwith minor errors . Repository: https://goo.gl/MjRWAH Kick results video: https://streamable.com/gpltm PREPRINT - J
ANUARY
3, 2019Figure 4: Kick motion. The first row of figures shows the original kick motion. The second row shows the learned kickmotion. The motions are visually indistinguishable.In order to evaluate the learned kick motion in the RoboCup Soccer 3D domain, we created a statistical test. Inside thetest scenario, the ball was placed initially in the center of the field with an agent near to it. The only action of the agentis to kick the ball in the goal direction. After the kick, the agent runs until reaching the ball and kicks it again, repeatingthis process till scoring a goal. When the goal occurs, this same scenario is repeated. The whole test was conductedduring thirty minutes in clock time and the following data was collected: total number of kicks, number of successfulkicks, mean distance that the ball has traveled and the standard deviation of this measure. The results from the originaland learned kicks is shown in Table 2. ang l e ( deg r ee s ) leftHipPitch Original KickLearned Kick ang l e ( deg r ee s ) leftArm Roll Original KickLearned Kick ang l e ( deg r ee s ) leftShoulderPitch Original KickLearned Kick ang l e ( deg r ee s ) leftHipRoll Original KickLearned Kick ang l e ( deg r ee s ) leftFootPitch Original KickLearned Kick ang l e ( deg r ee s ) leftShoulderYaw Original KickLearned Kick ang l e ( deg r ee s ) leftHipYawPitch Original KickLearned Kick ang l e ( deg r ee s ) leftFootRoll Original KickLearned Kick ang l e ( deg r ee s ) leftKneePitch Original KickLearned Kick
Figure 5: Joint values for comparing original and learned kicks. The neural network was able to fit the joint trajectorieswith small errors.Although both kicks have similar results, the original kick is slightly better in this scenario. Confronting Figure 5, wecan conclude that even with an almost equal representation, the kick lose part of its efficiency and this fact show us howsensible are movements based on keyframe data. 6
PREPRINT - J
ANUARY
3, 2019Table 2: Kick Comparison
Kick StatisticsType
Accuracy (%)
Distance ( m ) Mean Std
Original Kick 64.5 8.92 3.82Neural Kick 52.6 7.16 4.06
Using the modified server described in Subsec. 3.1, a dataset with samples of the UT Austin Villa’s walking motion[13] was acquired. This team is the current champion of RoboCup Soccer 3D competition [19].The objective is to mimic the walk motion as a keyframe and use that in our agent. The previously described frameworkused for learning our own kick motion was used in this training, including the neural network architecture and itshyperparameters.The results from this training are shown in Figure 6. Similarly to Figure 5, it shows the joint angles throughout thewalking motion period for the original and learned walk. Additionally, it shows the real joints values from the movementin the server. These joints were chosen because they are the most dynamic in the walk motion and therefore the hardestto learn. ang l e ( deg r ee s ) rightHipPitch OriginalLearnedServer ang l e ( deg r ee s ) rightKneePitch OriginalLearnedServer ang l e ( deg r ee s ) rightFootPitch OriginalLearnedServer ang l e ( deg r ee s ) rightFootRoll OriginalLearnedServer ang l e ( deg r ee s ) rightHipRoll OriginalLearnedServer ang l e ( deg r ee s ) rightShoulderYaw OriginalLearnedServer
Figure 6: Joints positions during a period of the walking motion for the original walk, the learned walk and the jointspositions effectively attained during the learned walking motion.The learned motion has fitted the dataset very well. However, these values are just desired joints. In fact, these valuesare used as references to joint controllers and are attenuated due to joint dynamics. Furthermore, this motion is operatedin a open-loop fashion, so the agent is not able to correct its own trajectory, and this walks gets biased in the simple taskof walking straight forward.Despite the facts above, the motion works well in a non-competitive scenario , which is shown in the metrics collectedin Forward Walk test scenario – agent walking forward from the goal post until the center line of field – in Table 3 andthe visual representation in Figure 7. This same framework was used to learn other keyframe motions originated by our agent itself, such as the get up motion.As the cases previously described, the resultant neural network was capable of mimicking the keyframe including itsinterpolation. Hence, all of our keyframe motions can be replaced by neural motions with similar performance. Walk results video: https://streamable.com/m5w1d PREPRINT - J
ANUARY
3, 2019Table 3: Walk Comparison - Forward Walk
Walk StatisticsType Velocity ( m/s ) Y Error ( m ) Mean Std Mean Std
Original Walk 0.87 0.01 - -Learned Walk 0.23 0.01 0.96 2.63 (a) (b) (c)
Figure 7: Walking motions comparison. The figure (a) shows our agent in its regular walk. The figure (b) shows thesame agent mimicking UT Austin Villa walk. The figure (c) is the UT Austin Villa agent itself, performing his ownwalking motion.However, the huge improvement of this method is about mimicking other teams motions. In Soccer 3D environment,movements like kick and walking have giant impact in the team’s performance. With this learning framework, our agentis able to mimic multiples movements from several teams.As example, we collected data from UT Austin Villa kick, which was originally optimized using Deep ReinforcementLearning techniques [20]. Our agent learned this kick without any additional optimization strategy: we just usedsamples collected from the modified server.
In this work, we presented a method for learning humanoid robot movements using datasets composed of joint valuesat each time instant. The learning framework provided was capable to learn several types of motion, including walkand kick, without any change in network architecture or hyperparameters. Moreover, the learned motions have similarperformance to the original ones. Furthermore, this framework was able to learn other teams motions, without anyknowledge about the underlying implementation – only using the joints values provided by a modified version of theserver. This is a huge improvement in terms of getting improved motions, as our agent can mimic other teams motionsusing this machine learning technique.As future work, we plan to apply Deep Reinforcement Learning algorithms to obtain faster and more robust kicks, usingas “seed” the neural networks obtained in this work. Another track to be followed is to transfer the learning of thisnetwork to a new network that represents the motion policy itself (i.e a network which has as inputs the current state ofthe robot, including joint and link states, besides the current time instant), and optimize this motion policy in order toget a closed-loop walking and kick motion that can correct itself. As a long term goal, we intend to create model-freekick and walking engines.
We thank our sponsors ITAEx, Altium, Intel, Mathworks, Metinjo, Micropress, Poliedro, Polimold, Poupex, FHC,Rapid, and Solidworks. We are also grateful to ITA for supporting our work.We would also like to show our gratitude to Patrick MacAlpine from UT Austin Villa team for sharing their ideas andcode regarding the Soccer 3D simulation server modification.Finally, we would like to acknowledge all the ITAndroids team, especially Soccer 3D simulation team members for thehard work in the development of the base code. 8
PREPRINT - J
ANUARY
3, 2019
References [1] D. Gouaillier, V. Hugel, P. Blazevic, C. Kilner, J. Monceaux, P. Lafourcade, B. Marnier, J. Serre, and B. Maisonnier.Mechatronic design of nao humanoid. In , pages769–774, May 2009.[2] Shuuji Kajita, Fumio Kanehiro, Kenji Kaneko, Kazuhito Yokoi, and Hirohisa Hirukawa. The 3D Linear InvertedPendulum Mode: A simple modeling for a biped walking pattern generation. In
In Proceedings of the 2001IEEE/RSJ International Conference on Intelligent Robots and Systems , Hawaii, USA, October 2001. IEEE.[3] Steven Collins, Andy Ruina, Russ Tedrake, and Martijn Wisse. Efficient Bipedal Robots Based on PassiveDynamic Walkers.
Science Magazine , 307:1082–1085, February 2005.[4] F. Muniz, M. R. O. A. Maximo, and C. H. C. Ribeiro. Keyframe movement optimization for simulated humanoidrobot using a parallel optimization framework. In , pages 79–84, Oct 2016.[5] Aaron P. Shon, Keith Grochow, and Rajesh P. N. Rao. Robotic imitation from human motion capture using gaussianprocesses. In
In Proceedings of the IEEE/RAS International Conference on Humanoid Robots (Humanoids , 2005.[6] Mike Depinet, Patrick MacAlpine, and Peter Stone. Keyframe sampling, optimization, and behavior integration:Towards long-distance kicking in the robocup 3d simulation league. In H. Levent Akin, Reinaldo A. C. Bianchi,Subramanian Ramamoorthy, and Komei Sugiura, editors,
RoboCup-2014: Robot Soccer World Cup XVIII . SpringerVerlag, 2015.[7] Abbas Abdolmaleki, David Simões, Nuno Lau, Luis Paulo Reis, and Gerhard Neumann. Learning a humanoid kickwith controlled distance. In Sven Behnke, Raymond Sheh, Sanem Sarıel, and Daniel D. Lee, editors,
RoboCup2016: Robot World Cup XX , pages 45–57, Cham, 2017. Springer International Publishing.[8] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deepreinforcement learning of physics-based character skills.
ACM Transactions on Graphics (Proc. SIGGRAPH 2018- to appear) , 37(4), 2018.[9] Richard H. Bartels, John C. Beatty, and Brian A. Barsky.
An Introduction to Splines for Use in Computer Graphics&Amp; Geometric Modeling . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1987.[10] Dejan Tanikic and Vladimir Despotovic. Artificial intelligence techniques for modelling of temperature in themetal cutting process. In Yogiraj Pardhi, editor,
Metallurgy , chapter 7. IntechOpen, Rijeka, 2012.[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016.[12] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagatingerrors.
Nature , 323(533), 1988.[13] Patrick MacAlpine, Nick Collins, Adrian Lopez-Mobilia, and Peter Stone. Ut austin villa: Robocup 2012 3dsimulation league champion. In Xiaoping Chen, Peter Stone, Luis Enrique Sucar, and Tijn van der Zant, editors,
RoboCup 2012: Robot Soccer World Cup XVI , pages 77–88, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.[14] Patrick MacAlpine, Samuel Barrett, Daniel Urieli, Victor Vu, and Peter Stone. Design and optimization of anomnidirectional humanoid walk: A winning approach at the RoboCup 2011 3D simulation competition. In
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI) , July 2012.[15] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutionalnetwork. arXiv , Nov 2015.[16] Diederik P. Kingma and Jimmy Ba. Computer science > learning adam: A method for stochastic optimization. arXiv , Dec 2014.[17] François Chollet et al. Keras. https://keras.io, 2015.[18] Martín Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Softwareavailable from tensorflow.org.[19] Patrick MacAlpine and Peter Stone. UT Austin Villa: RoboCup 2017 3D simulation league competition andtechnical challenges champions. In Claude Sammut, Oliver Obst, Flavio Tonidandel, and Hidehisa Akyama,editors,
RoboCup 2017: Robot Soccer World Cup XXI , Lecture Notes in Artificial Intelligence. Springer, 2018.9
PREPRINT - J
ANUARY
3, 2019[20] Patrick MacAlpine and Peter Stone. UT Austin Villa: RoboCup 2017 3D simulation league competition andtechnical challenges champions. In Claude Sammut, Oliver Obst, Flavio Tonidandel, and Hidehisa Akyama,editors,